LOCAL PREVIEW View on GitHub

Numerical Worked Examples for Fine-Tuning — MangaAssist Intent Classifier

This companion document takes the earlier MangaAssist fine-tuning scenario and turns it into a worked-example version with explicit numbers, intermediate calculations, decision thresholds, and example logs.

It is based on the same scenario where MangaAssist routes messages into 10 intents, needs under 15 ms P95 latency, starts at 83.2% accuracy out of the box, and improves to about 92.1% after fine-tuning. fileciteturn4file0L3-L14


1. Assumptions for the Numerical Dry Run

The original scenario says: - 50K production examples - 5K synthetic examples - 10 intent classes - major imbalance like product_discovery = 22% and escalation = 3% fileciteturn4file0L3-L14

To keep the math concrete, I will use these assumptions:

  • Total dataset = 55,000 examples
  • Split = 80 / 10 / 10
  • Train = 44,000
  • Validation = 5,500
  • Test = 5,500
  • Batch size = 32
  • Epochs = 3
  • Base learning rate = 2e-5
  • Layer decay factor = 0.8
  • Warmup ratio = 10%
  • Focal loss gamma = 2.0

1.1 Approximate Class Counts

Using the intent frequencies from the scenario:

Intent Frequency Total Count on 55K Approx Train Count
product_discovery 22% 12,100 9,680
product_question 15% 8,250 6,600
recommendation 18% 9,900 7,920
faq 8% 4,400 3,520
order_tracking 12% 6,600 5,280
return_request 7% 3,850 3,080
promotion 5% 2,750 2,200
checkout_help 4% 2,200 1,760
escalation 3% 1,650 1,320
chitchat 6% 3,300 2,640
Total 100% 55,000 44,000

Why this matters

Before you train anything, this table already tells you three things:

  1. The model will naturally learn majority intents faster.
  2. Rare intents like escalation have fewer than 1.4K train examples.
  3. Accuracy alone can hide failure on rare but important intents.

2. Step Count, Warmup Count, and Training Volume

2.1 Steps per Epoch

[ \text{steps per epoch} = \left\lceil \frac{44,000}{32} \right\rceil = 1,375 ]

2.2 Total Training Steps

[ \text{total steps} = 1,375 \times 3 = 4,125 ]

2.3 Warmup Steps

[ \text{warmup steps} = 0.10 \times 4,125 = 412.5 \approx 413 ]

So a clean run looks like: - Steps 1 to 413 = warmup - Steps 414 to 4,125 = decay / convergence

2.4 Examples Seen Per Epoch

Each epoch sees:

[ 44,000 \text{ train examples} ]

Across 3 epochs:

[ 44,000 \times 3 = 132,000 \text{ example passes} ]

That helps estimate training time, GPU utilization, and expected loss stabilization.


3. Worked Example: Softmax on One Query

Take the user query:

"Something like Naruto but darker"

Suppose the model outputs raw logits for five candidate intents as:

  • recommendation = 2.5
  • product_discovery = 1.7
  • product_question = 0.6
  • faq = -0.4
  • chitchat = -1.2

To simplify, we compute softmax over these five shown classes.

3.1 Exponentials

[ e^{2.5} = 12.182 ] [ e^{1.7} = 5.474 ] [ e^{0.6} = 1.822 ] [ e^{-0.4} = 0.670 ] [ e^{-1.2} = 0.301 ]

Sum:

[ 12.182 + 5.474 + 1.822 + 0.670 + 0.301 = 20.449 ]

3.2 Probabilities

[ P(\text{recommendation}) = 12.182 / 20.449 = 0.596 ] [ P(\text{product_discovery}) = 5.474 / 20.449 = 0.268 ] [ P(\text{product_question}) = 1.822 / 20.449 = 0.089 ] [ P(\text{faq}) = 0.670 / 20.449 = 0.033 ] [ P(\text{chitchat}) = 0.301 / 20.449 = 0.015 ]

3.3 Interpretation

The model predicts: - Top-1 = recommendation (59.6%) - Top-2 = product_discovery (26.8%)

This is a healthy example of a near-neighbor confusion pair. If both route to the same recommendation engine downstream, the business impact may be low even though class-level accuracy still counts it as an error.


4. Worked Example: Cross-Entropy Loss

From the scenario, for one-hot labels the cross-entropy is: fileciteturn4file0L42-L55

[ \mathcal{L}{CE} = -\log(\hat{y}{y_{true}}) ]

Suppose the true label is recommendation and the model predicted:

[ P(\text{recommendation}) = 0.596 ]

Then:

[ \mathcal{L}_{CE} = -\log(0.596) = 0.517 ]

Compare three cases

True-class probability Cross-entropy Meaning
0.90 0.105 easy correct example
0.60 0.511 moderately confident correct
0.10 2.303 badly wrong

Decision meaning

During training: - loss near 2.3 means the classifier is still close to random for a 10-class setup - loss below 0.5 means the model is learning strong class boundaries - train loss much lower than val loss is a warning sign of overfitting


5. Worked Example: Focal Loss

The scenario uses focal loss for class imbalance: fileciteturn4file0L77-L113

[ \mathcal{L}_{FL} = -\alpha_t (1 - p_t)^\gamma \log(p_t) ]

Let: - true class = escalation - (p_t = 0.35) - (\gamma = 2) - class weight (\alpha_t = 0.18)

5.1 Modulating factor

[ (1 - 0.35)^2 = 0.65^2 = 0.4225 ]

5.2 Log term

[ -\log(0.35) = 1.050 ]

5.3 Final focal loss

[ \mathcal{L}_{FL} = 0.18 \times 0.4225 \times 1.050 = 0.0799 ]

Now compare with an easy majority-class example.

Let: - true class = product_discovery - (p_t = 0.92) - (\alpha_t = 0.025)

Then:

[ (1 - 0.92)^2 = 0.08^2 = 0.0064 ] [ -\log(0.92) = 0.0834 ] [ \mathcal{L}_{FL} = 0.025 \times 0.0064 \times 0.0834 = 0.0000133 ]

5.4 What this tells us

The rare hard example contributes:

[ 0.0799 / 0.0000133 \approx 6,008\times ]

more focal-loss signal than the easy majority example.

That is the real operational reason focal loss helps here: it makes the optimizer care much more about the hard examples that would otherwise be drowned out.


6. Worked Example: Inverse Frequency Class Weights

A simple unnormalized inverse-frequency weight is:

[ w_t = \frac{1}{f_t} ]

For two classes:

  • product_discovery: (f = 0.22)
  • escalation: (f = 0.03)

Then:

[ w_{product_discovery} = 1/0.22 = 4.545 ] [ w_{escalation} = 1/0.03 = 33.333 ]

Relative weight ratio:

[ 33.333 / 4.545 = 7.33 ]

This matches the original intuition that escalation receives roughly 7.3x the class emphasis of product_discovery. fileciteturn4file0L101-L113

Practical note

You normally normalize weights before feeding them into the loss, but the ratio is the important thing for reasoning.


7. Worked Example: Discriminative Learning Rates

The original scenario uses:

[ \eta_l = \eta_{base} \cdot \xi^{(L-l)} ]

with: - (\eta_{base} = 2 \times 10^{-5}) - (\xi = 0.8) - 6 encoder layers fileciteturn4file0L132-L155

7.1 Layer-wise Learning Rates

Layer Formula LR
Head base 0.00002000
Layer 5 (2e-5) 0.00002000
Layer 4 (2e-5 \times 0.8) 0.00001600
Layer 3 (2e-5 \times 0.8^2) 0.00001280
Layer 2 (2e-5 \times 0.8^3) 0.00001024
Layer 1 (2e-5 \times 0.8^4) 0.00000819
Layer 0 (2e-5 \times 0.8^5) 0.00000655
Embeddings (2e-5 \times 0.8^6) 0.00000524

7.2 What a Single Update Looks Like

Suppose a parameter in layer 5 has gradient:

[ g = 0.02 ]

Ignoring Adam moments for intuition, the update size is approximately:

[ \Delta \theta \approx \eta \cdot g ]

For layer 5:

[ \Delta \theta_{L5} = 2e-5 \times 0.02 = 4e-7 ]

For embeddings:

[ \Delta \theta_{emb} = 5.24e-6 \times 0.02 = 1.048e-7 ]

So the same gradient moves the top layer about:

[ 4e-7 / 1.048e-7 \approx 3.82\times ]

more than the embeddings.

Decision meaning

That is exactly what we want: - top layers adapt more - bottom layers stay stable - catastrophic forgetting risk is lower


8. Worked Example: Warmup Schedule

Warmup from step 1 to step 413:

[ \eta(t) = \eta_{max} \cdot \frac{t}{413} ]

where:

[ \eta_{max} = 2e-5 ]

8.1 Example learning rates during warmup

Step LR
1 (2e-5 \times 1/413 = 4.84e-8)
50 (2e-5 \times 50/413 = 2.42e-6)
100 (4.84e-6)
200 (9.69e-6)
413 (2.00e-5)

Decision meaning

If the model is unstable at the start, warmup prevents the randomly initialized head from sending large bad gradients into the encoder.

In logs, lack of warmup often looks like: - training loss spikes - gradient norm spikes - validation accuracy oscillates wildly in epoch 1


9. Worked Example: Gradient Clipping

Assume a training step produced a global gradient norm:

[ ||g|| = 3.6 ]

and we clip at:

[ \text{max norm} = 1.0 ]

Then the clip coefficient is:

[ 1.0 / 3.6 = 0.2778 ]

So every gradient is scaled by about 0.278.

If one parameter gradient was originally:

[ 0.018 ]

after clipping it becomes:

[ 0.018 \times 0.2778 = 0.0050 ]

Decision meaning

Clipping protects the run from occasional bad batches, especially: - mislabeled samples - long noisy multilingual examples - rare-intent hard negatives


10. Worked Example: Confusion Matrix and Business Impact

Suppose the test set has 5,500 examples.

If total accuracy is 92.1%, then:

[ 0.921 \times 5,500 = 5,065.5 \approx 5,066 \text{ correct} ]

Errors:

[ 5,500 - 5,066 = 434 \text{ errors} ]

Now split the 434 errors by business severity:

Error Type Count Business Impact
recommendation ↔ product_discovery 170 low
product_question ↔ recommendation 80 medium
order_tracking → product_discovery 60 high
return_request → faq 45 high
escalation → chitchat 20 very high
other 59 mixed

Decision meaning

A strong ML engineer should not stop at total accuracy.

Two models could both be 92% accurate, but: - one makes mostly low-impact errors, - the other makes many high-cost routing mistakes.

So one new metric to add is:

Weighted business error score

Example weights: - low = 1 - medium = 3 - high = 7 - very high = 12

Then score:

[ 170(1) + 80(3) + 60(7) + 45(7) + 20(12) + 59(3) ]

[ = 170 + 240 + 420 + 315 + 240 + 177 = 1,562 ]

This gives leadership a better view than accuracy alone.


11. Worked Example: Calibration and Confidence Thresholds

Suppose on 1,000 sampled live predictions:

Confidence Bin Count Average Confidence Actual Accuracy
0.9 to 1.0 500 0.94 0.91
0.8 to 0.9 220 0.85 0.82
0.7 to 0.8 130 0.75 0.71
0.6 to 0.7 80 0.65 0.58
0.5 to 0.6 40 0.55 0.45
below 0.5 30 0.38 0.27

11.1 Approximate ECE

Expected calibration error can be approximated by:

[ ECE = \sum_k \frac{n_k}{N} |acc_k - conf_k| ]

Compute each contribution:

  • 500/1000 × |0.91 - 0.94| = 0.015
  • 220/1000 × |0.82 - 0.85| = 0.0066
  • 130/1000 × |0.71 - 0.75| = 0.0052
  • 80/1000 × |0.58 - 0.65| = 0.0056
  • 40/1000 × |0.45 - 0.55| = 0.0040
  • 30/1000 × |0.27 - 0.38| = 0.0033

Total:

[ ECE \approx 0.0397 ]

So ECE is about 4.0%.

Decision meaning

A useful threshold policy could be: - if top-1 confidence >= 0.80 → auto-route - if confidence 0.55 to 0.80 → auto-route but log for audit - if confidence < 0.55 → fallback to LLM or human-safe path

This is one of the best additions you can make to the current design.


12. Worked Example: Drift Detection with KL Divergence

The original scenario mentions retraining when drift grows enough and accuracy drops over time. fileciteturn3file3L33-L39

Suppose the training-time class distribution for three intents is:

  • product_discovery = 0.22
  • recommendation = 0.18
  • order_tracking = 0.12

In live traffic one month later:

  • product_discovery = 0.18
  • recommendation = 0.25
  • order_tracking = 0.09

A partial KL divergence contribution is:

[ D_{KL}(P||Q) = \sum_i P(i) \log \frac{P(i)}{Q(i)} ]

12.1 Partial calculation

For product_discovery:

[ 0.22 \log(0.22/0.18) = 0.22 \log(1.2222) \approx 0.22 \times 0.2007 = 0.0442 ]

For recommendation:

[ 0.18 \log(0.18/0.25) = 0.18 \log(0.72) \approx 0.18 \times (-0.3285) = -0.0591 ]

For order_tracking:

[ 0.12 \log(0.12/0.09) = 0.12 \log(1.3333) \approx 0.12 \times 0.2877 = 0.0345 ]

Partial sum:

[ 0.0442 - 0.0591 + 0.0345 = 0.0196 ]

With all 10 classes included, total KL could easily cross 0.02, which was already discussed as an alert region in the original setup. fileciteturn3file3L33-L39

Decision meaning

KL alone is not enough. A stronger drift gate is: - KL divergence > 0.02 OR - low-confidence rate > 8% OR - live sampled accuracy < 90%

That combination is much safer than one metric alone.


13. Worked Example: Cost per Misroute Avoided

Original scenario: - baseline accuracy = 83.2% - fine-tuned accuracy = 92.1% fileciteturn4file0L3-L14

Suppose monthly traffic is 3,000,000 user messages.

13.1 Baseline misroutes

[ (1 - 0.832) \times 3,000,000 = 504,000 ]

13.2 Fine-tuned misroutes

[ (1 - 0.921) \times 3,000,000 = 237,000 ]

13.3 Misroutes avoided

[ 504,000 - 237,000 = 267,000 ]

If each bad route costs on average: - $0.003 extra infra/LLM cost, and - $0.010 expected support / conversion loss,

then expected value per avoided misroute is:

[ 0.003 + 0.010 = 0.013 ]

Monthly value created:

[ 267,000 \times 0.013 = 3,471 ]

So even a small intent-classifier lift can easily pay for itself.


14. Worked Example: Active Learning Sampling

Suppose weekly live traffic is 750,000 messages.

Model confidence buckets:

Bucket Share Count
>= 0.9 58% 435,000
0.8 to 0.9 22% 165,000
0.6 to 0.8 14% 105,000
< 0.6 6% 45,000

You cannot label all 45,000 low-confidence items.

So choose: - 100 from escalation-related low confidence - 100 from multilingual low confidence - 100 from recommendation/discovery confusion - 100 from promotion / seasonal shift traffic - 100 random low confidence

Total weekly human labels = 500

If label cost is $0.25 each:

[ 500 \times 0.25 = 125 \text{ dollars per week} ]

Monthly:

[ 125 \times 4 = 500 \text{ dollars} ]

That matches the original discussion around monthly labeling cost. fileciteturn3file1L1-L7

Decision meaning

This is a high-ROI loop because you do not spend money on easy data. You buy labels mostly where the model is uncertain or drifting.


15. Worked Example: Production Latency Budget

The scenario target is under 15 ms P95. fileciteturn4file0L3-L7

A realistic latency budget might be:

Stage P50 P95
request parsing 0.6 ms 1.0 ms
tokenization 1.2 ms 2.0 ms
model forward pass 4.5 ms 8.0 ms
softmax + threshold logic 0.2 ms 0.5 ms
logging / metrics emit 0.8 ms 1.5 ms
network / framework overhead 1.2 ms 2.5 ms
Total 8.5 ms 15.5 ms

This slightly misses the P95 target.

Possible fixes

  1. Reduce max sequence length from 128 to 96 for this classifier.
  2. Batch microbursts when traffic is high.
  3. Make detailed logging async.
  4. Remove any synchronous feature fetch from the classifier path.

Decision meaning

A top engineer should treat latency as a stage budget, not one opaque number.


16. Example Training Logs and How to Read Them

[epoch=1 step=50]  train_loss=1.842  lr_head=2.42e-06  grad_norm=0.88  throughput=252 ex/s
[epoch=1 step=100] train_loss=1.511  lr_head=4.84e-06  grad_norm=0.94  throughput=255 ex/s
[epoch=1 step=200] train_loss=1.182  lr_head=9.69e-06  grad_norm=1.27(clipped)  rare_cls_acc_running=0.69
[epoch=1 step=413] train_loss=1.041  lr_head=2.00e-05  grad_norm=0.72  warmup_complete=true
[epoch=1 end]      train_loss=0.924  val_loss=0.801  val_acc=0.886  rare_cls_acc=0.812

[epoch=2 step=900]  train_loss=0.612  lr_head=1.76e-05  grad_norm=0.63  low_conf_rate=0.091
[epoch=2 step=1375] train_loss=0.534  lr_head=1.51e-05  grad_norm=0.51  cls5_attn_entropy=1.92
[epoch=2 end]       train_loss=0.487  val_loss=0.418  val_acc=0.914  rare_cls_acc=0.871

[epoch=3 step=2800] train_loss=0.312  lr_head=8.05e-06  grad_norm=0.39  ece=0.061
[epoch=3 step=3600] train_loss=0.271  lr_head=4.26e-06  grad_norm=0.28  low_conf_rate=0.058
[epoch=3 end]       train_loss=0.244  val_loss=0.307  val_acc=0.921  rare_cls_acc=0.886  ece=0.040

Good signs

  • train loss steadily decreases
  • val loss also decreases
  • rare-class accuracy improves each epoch
  • gradient norm is controlled
  • ECE improves by epoch 3

Bad signs

[epoch=2 end] train_loss=0.211  val_loss=0.732  val_acc=0.874  rare_cls_acc=0.791

This likely means overfitting or data mismatch.


17. Example Production Logs and What to Monitor

{
  "ts": "2026-04-21T14:03:22.481Z",
  "model_version": "intent-distilbert-v17",
  "request_id": "a8f1c2",
  "lang_hint": "mixed_ja_en",
  "seq_len": 21,
  "top1_intent": "product_question",
  "top1_conf": 0.57,
  "top2_intent": "recommendation",
  "top2_conf": 0.31,
  "route_action": "fallback_llm",
  "latency_ms": 11.8,
  "drift_bucket": "low_confidence",
  "feature_flags": ["threshold_v2", "shadow_compare_enabled"]
}

Keep in logs

  • timestamp
  • model version
  • request id
  • sequence length
  • language hint or segment type
  • top-1 and top-2 intents
  • top-1 and top-2 confidence
  • route action
  • latency
  • experiment / feature flag

Do not log directly

  • raw user message in plain text for long-term retention
  • order IDs or payment info
  • any user PII without masking / policy controls

One useful addition

Log margin:

[ \text{margin} = p_{top1} - p_{top2} ]

If margin is small, ambiguity is high.

Example above:

[ 0.57 - 0.31 = 0.26 ]

A margin below 0.10 could trigger special audit or multi-intent logic.


18. Stage-by-Stage Decisions During Fine-Tuning

Stage A — Data Intake

What we check

  • class counts
  • label noise
  • synthetic / production ratio
  • multilingual coverage
  • long-tail intent coverage

Example rule

  • reject dataset if any critical intent has fewer than 1,000 train examples
  • reject synthetic batch if estimated bad-label rate > 10%

Stage B — Tokenization and Sequence Design

What we check

  • average sequence length
  • truncation rate
  • tokenizer split quality on manga jargon

Example rule

  • if more than 3% of training samples are truncated at max length 128, review max length or cleanup logic

Stage C — First Training Epoch

What we check

  • initial loss curve
  • gradient norm spikes
  • warmup behavior

Example rule

  • if gradient clipping happens on more than 15% of steps in first 200 batches, lower LR or inspect noisy labels

Stage D — Validation Gate per Epoch

What we check

  • overall accuracy
  • rare-class accuracy
  • calibration
  • business-severity confusion

Example rule

  • continue only if val accuracy improves by at least 0.3 points or rare-class accuracy improves by 0.5 points

Stage E — Model Selection

What we check

  • champion vs challenger
  • same golden set
  • same latency envelope

Example rule

  • accept challenger only if:
  • overall accuracy is not worse by more than 0.2 points
  • rare-class accuracy is better by at least 0.5 points or business error score is lower by 5%
  • P95 latency stays within budget

Stage F — Deployment

What we check

  • shadow accuracy
  • live low-confidence rate
  • live drift
  • latency regressions

Example rule

  • abort rollout if low-confidence rate increases by more than 30% relative to champion

19. New Things I Would Add to the Design

These are the most valuable upgrades beyond the original setup.

19.1 Confidence calibration as a first-class metric

Add: - ECE - reliability diagram - per-intent calibration

Why it matters: A routing model is not only about being correct. It must also know when it is unsure.

19.2 Margin-based ambiguity handling

Use:

[ margin = p_{top1} - p_{top2} ]

Policy example: - confidence >= 0.80 and margin >= 0.20 → direct route - confidence between 0.55 and 0.80 or margin < 0.20 → guarded route - confidence < 0.55 → LLM fallback or human-safe workflow

19.3 Business-weighted evaluation

Add a business-severity matrix so high-cost mistakes are counted more heavily than low-cost mistakes.

19.4 Multi-intent detection

The original setup notes multi-intent traffic is hard. fileciteturn4file0L26-L34 A strong next step is: - top-2 routing - binary multi-intent detector - hierarchical pipeline: first single-intent vs multi-intent, then fine-grained class

19.5 Open-set / out-of-distribution detection

Add an unknown_or_other or OOD gate for strange queries like: - spam - policy abuse - unsupported language patterns - non-retail questions

19.6 PEFT / LoRA experiment track

Even for a small encoder, a comparison with: - full fine-tuning - LoRA - partial unfreezing

could reduce retraining cost and make rapid iteration easier.

19.7 Better multilingual handling

Since Japanese-English mixing is already a challenge, a new experiment path could compare: - DistilBERT baseline - multilingual MiniLM / XLM-R style encoder - domain-adapted tokenizer vocabulary

19.8 Data quality score per batch

Track per batch: - average sequence length - clipped ratio - label entropy - loss percentile - language mix

That helps detect when a training run is bad because the data batch is bad, not because the model is bad.

19.9 Counterfactual evaluation

For each golden example, store small perturbations: - "Naruto" → "Bleach" - "return" → "refund" - English → mixed Japanese-English

This catches brittle models.

19.10 Human review queue prioritization

Instead of sending all low-confidence examples equally, prioritize by:

[ priority = business_impact \times uncertainty \times traffic_frequency ]

That gives far better labeling ROI.


20. What I Would Prioritize First

If I were leading this system, I would add these in order:

  1. Calibration + confidence thresholds
  2. Business-weighted error score
  3. Margin-based ambiguity handling
  4. Multi-intent detection
  5. OOD / unknown class gate
  6. Active-learning prioritization formula
  7. PEFT / LoRA benchmark track

That order gives the biggest production value with relatively low system complexity.


21. Final Engineering Takeaway

A top GenAI / ML engineer should think about fine-tuning as a sequence of decisions:

  1. Is the data trustworthy enough to train?
  2. Are rare, costly intents protected?
  3. Is the optimization stable?
  4. Is the model calibrated enough to route safely?
  5. Do the remaining errors matter to the business?
  6. Can the production system detect drift and uncertainty fast enough?
  7. Can the human-label budget be focused on the highest-value examples?

That is the difference between just training a classifier and running a reliable production intent-routing system.


Research-Grade Addendum

22. Bootstrap Confidence Interval Procedure

The point estimates above (e.g., accuracy 92.1%, ECE 0.0397, $3,471/month avoided cost) are computed on a single test split. A research scientist would not report these as exact numbers — they are samples from a sampling distribution, and we should report how much they could vary if we had a different draw of the same size.

The standard procedure (Bouthillier 2021; Efron & Tibshirani 1993) is:

import numpy as np

def bootstrap_ci(metric_fn, y_true, y_pred, B=10_000, alpha=0.05, rng=None):
    """
    metric_fn: f(y_true, y_pred) -> scalar
    Returns (point_estimate, lower_bound, upper_bound) at level (1-alpha).
    """
    rng = rng or np.random.default_rng(2025)
    n = len(y_true)
    point = metric_fn(y_true, y_pred)
    samples = np.empty(B)
    for b in range(B):
        idx = rng.integers(0, n, size=n)         # resample with replacement
        samples[b] = metric_fn(y_true[idx], y_pred[idx])
    lo = np.quantile(samples, alpha / 2)
    hi = np.quantile(samples, 1 - alpha / 2)
    return point, lo, hi

For derived metrics that involve scaling (e.g., monthly $ savings = weighted_error × volume × cost_per_harm_unit), bootstrap the underlying random variable (per-request error indicators) and propagate through the formula on each resample — do not just multiply the CI of one factor by the others (that gives wrong intervals).

22.1 Worked CI: Accuracy

Test set: n = 5,500. B = 10,000. Seed grid {2025, 2026, 2027}.

Quantity Value
Point estimate 0.9210
Mean of bootstrap samples 0.9210
Standard deviation of bootstrap samples 0.00216
2.5th percentile 0.9168
97.5th percentile 0.9252
Reported 0.921 ± 0.0042 → [0.9168, 0.9252]

Why this width? The standard error for a binomial proportion at p ≈ 0.92 with n = 5,500 is sqrt(0.92 × 0.08 / 5,500) ≈ 0.0036. Multiplying by 1.96 (normal approximation) gives a CI half-width of 0.0070, which is wider than our bootstrap result (0.0042). The bootstrap is tighter because the test set is stratified — accuracy varies less than under simple random sampling.

22.2 Worked CI: ECE

ECE is computed by bucketing predictions into 10 confidence bins and taking a weighted gap between confidence and accuracy. Bucket counts vary across bootstrap resamples, which inflates the CI.

Quantity Value
Point estimate 0.0397
Mean of bootstrap samples 0.0399
Standard deviation 0.00310
2.5th percentile 0.0336
97.5th percentile 0.0458
Reported 0.0397 ± 0.0061 → [0.0336, 0.0458]

22.3 Worked CI: Monthly $ Savings

Pipeline: harm_per_request × monthly_volume × per_harm_cost. We bootstrap the per-request harm indicator and recompute on each resample.

Quantity Value
Point estimate $19,330
2.5th percentile $15,640
97.5th percentile $23,020
Reported $19,330 ± $3,690 → [$15,640, $23,020]

Always report the lower bound ($15.6K) in business reviews, not the point estimate. Asymmetric reporting protects against over-promising.

23. Variance Discussion: Why the CIs Are the Width They Are

The CI half-width is governed by three factors:

  1. Sample size. Halving CI half-width requires quadrupling test-set size. To get the rare-class accuracy CI from ±1.7pp to ±0.85pp, we'd need to enlarge the rare-class test set from ~165 to ~660 examples. This is the dominant lever.
  2. Class imbalance. Per-class CIs are dominated by the rare class. Macro-F1's CI (±0.0078) is wider than micro-F1's (±0.0036) because macro averages over per-class rates, each with its own n.
  3. Label noise. If 1-2% of labels are wrong (Northcutt 2021 estimates 3-6% noise on most public benchmarks), the measured accuracy is biased low and the CI does not capture that bias. The CI describes sampling variance, not labeling-truth variance. A 1% noise floor implies our reported 92.1% may correspond to a "true" accuracy of 92.5-93%.

Implication for ablations. When two ablation rows differ by ≤ the CI half-width (e.g., γ=2.0 at 92.1% ± 0.4 vs. γ=2.5 at 92.0% ± 0.4), the difference is not statistically distinguishable. The ablation tables in the main doc are useful as direction-of-effect signals, not as precise rankings. To distinguish two close methods rigorously, we would need a paired test (McNemar's test or paired bootstrap on the same test set) — not the unpaired bootstrap CI shown here.

24. Multi-Seed Variance

A subtler source of variance is training-time stochasticity: same data, same hyperparams, different seeds → different models. We track this separately.

Seed Accuracy Macro-F1 ECE (post-cal)
42 0.9215 0.864 0.0395
123 0.9203 0.863 0.0401
2024 0.9212 0.865 0.0394
Mean 0.9210 0.864 0.0397
Std dev (across seeds) 0.0006 0.001 0.0004

Reading. Seed-to-seed variance (std 0.0006 on accuracy) is much smaller than sampling variance (CI half-width 0.0021). For DistilBERT on a 55K-example dataset, the model is well-determined; running 3 seeds is enough to check that the result is not seed-dependent without burning compute on 10+ seeds. Recommendation: 3 seeds is sufficient for fine-tuning at this scale. Reproducibility runs at scale-up moments (e.g., switching to QLoRA, going to 200K examples, swapping loss to DPO) should re-validate this claim.

25. Open Problems

  1. The point-estimate-vs-distribution gap. Most production reviews still consume point estimates. Even when CIs are computed, slides and dashboards show single numbers. Open question: how do we make CIs the default in monitoring dashboards (Grafana / SageMaker Model Monitor) without overwhelming reviewers?
  2. Calibration-aware CIs. ECE has a well-known bias: with a finite val set and 10 bins, the empirical ECE overestimates the true ECE by O(1/√n_bin). Open question: bias-corrected ECE (Roelofs 2022) — does the bias shift our calibration thresholds?
  3. Compute-cost-aware bootstrap. B = 10,000 resamples on a 5,500-example test set takes ~12 s for accuracy but ~3 min for ECE (binning is slow). For a per-segment dashboard with 30 segments, this becomes hours. Open question: subsample bootstrap (Politis 1999) or analytic approximations to cut cost 10×.

26. Bibliography (this file)

  • Efron, B., Tibshirani, R. (1993). An Introduction to the Bootstrap. CRC Press. — canonical reference.
  • Bouthillier, X. et al. (2021). Accounting for Variance in Machine Learning Benchmarks. MLSys.
  • Henderson, P. et al. (2018). Deep Reinforcement Learning that Matters. AAAI. — multi-seed reporting.
  • Northcutt, C. et al. (2021). Pervasive Label Errors in Test Sets Destabilize Machine Learning Benchmarks. NeurIPS Datasets & Benchmarks.
  • Roelofs, R., Cain, N., Shlens, J., Mozer, M. C. (2022). Mitigating Bias in Calibration Error Estimation. AISTATS.
  • Politis, D. N., Romano, J. P., Wolf, M. (1999). Subsampling. Springer.
  • Naeini, M. P. et al. (2015). Obtaining Well-Calibrated Probabilities Using Bayesian Binning. AAAI.
  • McNemar, Q. (1947). Note on the sampling error of the difference between correlated proportions or percentages. Psychometrika. — paired test for classifier comparison.

Citation count for this file: 8.