Fine-Tuning Dry Run Document — Intent Classifier (DistilBERT) for MangaAssist
This document is a practical dry run of the fine-tuning process for the uploaded MangaAssist intent-classification scenario. It is designed to help you understand the math, formulas, training flow, production logs, decision metrics, and how engineering decisions are made at each stage.
1) Goal of the System
MangaAssist sends each user message to an intent classifier before calling downstream services. The classifier must map messages like:
"Show me horror manga""Where is my order?""Something like Naruto but darker"
into one of 10 intents, while meeting a P95 latency target under 15 ms. The uploaded scenario uses DistilBERT, starts from 83.2% domain accuracy out of the box, and improves to about 92.1% after domain fine-tuning.
2) Problem Framing
Business problem
If intent classification is wrong: - users get routed to the wrong service, - extra LLM or backend calls may be triggered, - customer experience drops, - cost rises.
ML problem
This is a multi-class text classification problem with: - 10 classes, - class imbalance, - domain jargon, - slang, - Japanese-English mixed text, - multi-intent ambiguity.
Why this is hard
The uploaded scenario highlights four challenges: 1. manga jargon, 2. slang, 3. mixed intents, 4. Japanese-English mixing.
3) End-to-End Fine-Tuning Lifecycle
flowchart TD
A[Collect labeled production logs] --> B[Clean + normalize text]
B --> C[Add filtered synthetic data]
C --> D[Train/val/test split]
D --> E[Tokenization]
E --> F[Load pre-trained DistilBERT]
F --> G[Fine-tune with focal loss + discriminative LR + warmup]
G --> H[Validate on golden set]
H --> I{Pass gates?}
I -- Yes --> J[Register model]
I -- No --> K[Debug data/hparams/labels]
K --> G
J --> L[Compile + deploy]
L --> M[Monitor latency, drift, confidence, accuracy]
M --> N{Need retraining?}
N -- Yes --> A
N -- No --> M
4) Stage-by-Stage Decision View
| Stage | Main Question | Key Inputs | Main Metrics | Typical Decision |
|---|---|---|---|---|
| Problem framing | What are we predicting and why? | user flows, intents, latency budget | business impact, routing error cost | use classifier as first routing layer |
| Data audit | Is data good enough? | production logs, labels, class counts | label noise, class balance, ambiguity rate | clean labels, add rules, collect more rare classes |
| Model selection | Which model fits accuracy + latency? | candidate models | accuracy, P95 latency, memory, cost | choose DistilBERT over larger/slower models |
| Loss selection | How do we handle imbalance? | class frequency | rare-class recall, macro F1 | use focal loss + class weights |
| Optimization | How do we fine-tune safely? | LR, scheduler, epochs | train/val loss, gradient norm | discriminative LR + warmup + clipping |
| Validation | Is model truly better? | val/test/golden set | accuracy, per-class recall, calibration | promote only if all gates pass |
| Deployment | Can it serve at target latency? | compiled artifact, infra | P50/P95/P99, errors, throughput | deploy with shadow or blue/green |
| Monitoring | Is production still healthy? | live traffic, sampled labels | drift, low confidence, live accuracy | trigger retrain or rollback |
5) The Core Model Dry Run
Model architecture
The scenario uses DistilBERT:
- embeddings,
- 6 transformer encoder layers,
- [CLS] pooled representation,
- linear classification head,
- softmax over 10 intents. fileciteturn1file0turn1file2
One-message dry run
Example input:
"Is this isekai peak or mid?"
Step 1: tokenization
A WordPiece tokenizer splits text into tokens or subwords.
Illustrative tokenization:
[CLS] is this is ##ek ##ai peak or mid ? [SEP]
Step 2: embeddings
Each token becomes a vector:
- token embedding,
- position embedding,
- summed into a 768-dimensional representation.
Step 3: transformer encoding
The 6 layers contextualize meaning. Lower layers mostly preserve language structure; higher layers adapt more to task semantics. The uploaded scenario notes gradient magnitude is strongest in the head and upper layers, and weakest in embeddings and lower layers.
Step 4: classification head
The [CLS] vector goes through a linear layer to produce 10 logits.
Example logits:
product_discovery: 0.8
product_question: 1.7
recommendation: 1.5
faq: -0.2
order_tracking: -1.4
return_request: -1.1
promotion: -0.6
checkout_help: -1.7
escalation: -2.0
chitchat: -1.0
Step 5: softmax to probabilities
The softmax converts logits into probabilities:
[ \hat{y}i = \frac{e^{z_i/T}}{\sum{j=1}^{C} e^{z_j/T}} ]
where: - (z_i) is the logit for class (i), - (T) is temperature, - (C=10) classes.
Example probabilities:
product_question: 0.41
recommendation: 0.34
product_discovery: 0.14
others combined: 0.11
Predicted class: product_question
Step 6: compare with true label
Suppose true label is recommendation.
The prediction is wrong, so the loss will push probability mass away from product_question and toward recommendation.
6) Math and Formulas You Should Understand
6.1 Cross-Entropy Loss
For one-hot labels:
[ \mathcal{L}{CE} = -\log(\hat{y}{y_{true}}) ]
If the true class probability is: - 0.9, loss = (-\log(0.9) \approx 0.105) - 0.1, loss = (-\log(0.1) \approx 2.303)
Why it matters
- correct + confident prediction -> small loss,
- wrong + confident prediction -> large loss.
Decision takeaway
Use cross-entropy as the baseline. If class imbalance is hurting rare classes, move to focal loss.
6.2 Focal Loss
The scenario uses focal loss to handle imbalance:
[ \mathcal{L}_{FL} = -\alpha_t (1 - p_t)^{\gamma} \log(p_t) ]
where: - (p_t) = probability of the true class, - (\alpha_t) = class weight, - (\gamma) = focusing parameter.
Intuition
- easy examples get down-weighted,
- hard examples get emphasized,
- rare classes benefit more.
Example
If: - (p_t = 0.9), - (\gamma = 2),
then focal weight:
[ (1 - 0.9)^2 = 0.01 ]
So easy examples contribute very little.
If: - (p_t = 0.1),
then:
[ (1 - 0.1)^2 = 0.81 ]
Hard examples still contribute strongly. This is why focal loss improved rare-class performance in the uploaded scenario.
Decision takeaway
Use focal loss when: - class imbalance is real, - rare-class recall matters, - you want no inference overhead.
6.3 Class Weights
The scenario defines class weights from inverse frequency:
[ \alpha_t = \frac{1}{\text{freq}(t)} \cdot \frac{1}{\sum_{c=1}^{C} 1/\text{freq}©} ]
Intuition
Rare classes like escalation get higher weight than frequent classes like product_discovery. The uploaded scenario notes the rare escalation class gets about 7.3x the weight of product_discovery.
Decision takeaway
If the model ignores small classes, increase class weights carefully. Too much weighting can overfit noise in rare labels.
6.4 Gradient Flow and Why Top Layers Change More
The uploaded scenario shows that during fine-tuning: - classification head gets the strongest gradients, - upper encoder layers adapt more, - embeddings and lower layers move very little.
This leads to the engineering idea of discriminative learning rates.
6.5 Discriminative Learning Rate
[ \eta_l = \eta_{base} \cdot \xi^{(L-l)} ]
where: - (\eta_{base}) = top-layer LR, - (\xi) = decay factor, - (L) = total layers, - (l) = current layer.
The uploaded scenario uses a base rate near 2e-5 and a decay factor near 0.8. fileciteturn1file2turn2file1
Intuition
- top layers need more freedom,
- bottom layers should preserve general language knowledge,
- this reduces catastrophic forgetting.
Decision takeaway
If you see unstable lower-layer drift or overfitting, reduce lower-layer LR or freeze lower layers temporarily.
6.6 Warmup + Decay Schedule
[ \eta(t)= \begin{cases} \eta_{max} \cdot \frac{t}{T_{warmup}} & t < T_{warmup} \ \eta_{max} \cdot \frac{T_{total}-t}{T_{total}-T_{warmup}} & t \ge T_{warmup} \end{cases} ]
Why warmup matters
At step 0 the classifier head is random. If LR is too large immediately: - early gradients are noisy, - encoder gets damaged, - training becomes unstable.
The uploaded scenario uses 10% warmup. fileciteturn1file2turn2file1
Decision takeaway
If the first few hundred steps look unstable, increase warmup or reduce base LR.
6.7 Gradient Clipping
For gradient norm (g):
[ g_{clipped} = g \cdot \min\left(1, \frac{\tau}{|g|}\right) ]
where (\tau) is max norm, often 1.0.
Why it matters
It prevents exploding updates from hard examples or noisy batches.
Decision takeaway
If gradient norm spikes or loss suddenly explodes, clipping is one of the first safeguards.
6.8 AdamW Update Rule
Conceptually:
[ \theta \leftarrow \theta - \eta \cdot \frac{\hat{m}}{\sqrt{\hat{v}} + \epsilon} - \eta \lambda \theta ]
where: - (\hat{m}) = bias-corrected first moment, - (\hat{v}) = bias-corrected second moment, - (\lambda) = weight decay.
Decision takeaway
AdamW is preferred over plain Adam because weight decay behaves more cleanly for transformer fine-tuning.
6.9 KL-Divergence for Drift Detection
The scenario uses KL divergence to compare production intent distribution with training distribution:
[ D_{KL}(P \parallel Q) = \sum_i P(i) \log \frac{P(i)}{Q(i)} ]
where: - (P) = live traffic distribution, - (Q) = reference/training distribution.
The uploaded scenario uses rough drift signals like: - ~0.012 -> monitor, - ~0.028 -> retrain.
Decision takeaway
Drift alone is not enough. Combine: - KL divergence, - low confidence rate, - live accuracy from sampled labels, - business incidents.
7) Single Training Step Dry Run
flowchart TD
A[Batch of 32 examples] --> B[Tokenize + pad to max_len]
B --> C[Forward pass through DistilBERT]
C --> D[Get logits for 10 classes]
D --> E[Apply softmax]
E --> F[Compute focal loss]
F --> G[Backpropagation]
G --> H[Clip gradients]
H --> I[Optimizer step]
I --> J[Scheduler step]
J --> K[Zero gradients]
Tensor view
input_ids:(32, 128)attention_mask:(32, 128)- hidden states:
(32, 128, 768) - pooled
[CLS]:(32, 768) - logits:
(32, 10) - loss: scalar
8) What Engineers Observe During Fine-Tuning
8.1 Core training metrics
| Metric | Why it matters | Healthy pattern | Warning sign |
|---|---|---|---|
| train loss | optimization progress | decreases steadily | flat or explodes |
| val loss | generalization | decreases, then stabilizes | rises while train loss falls |
| accuracy | top-line classification quality | improves by epoch | unstable or plateau too early |
| macro F1 | balanced view across classes | improves with rare classes | lower than accuracy by a lot |
| rare-class recall | important for small classes | rises after focal loss/weights | near zero or unstable |
| confusion matrix | where mistakes happen | errors cluster in similar classes | critical classes collapse |
| confidence on correct predictions | calibration + separability | increases | low and flat |
| confidence on wrong predictions | overconfidence risk | stays moderate or drops | very high confidence on errors |
8.2 Optimization metrics
| Metric | Why it matters | Warning sign |
|---|---|---|
| learning rate | confirms scheduler works | wrong warmup/decay behavior |
| gradient norm | training stability | spikes or collapse to near zero |
| layer-wise gradient norm | confirms top layers adapt more | lower layers moving too much |
| parameter update norm | actual step size | too large or vanishing |
| weight norm | model stability | runaway growth |
8.3 Data quality metrics
| Metric | Why it matters | Warning sign |
|---|---|---|
| class distribution | imbalance check | rare classes too small |
| label noise rate | bad supervision hurts training | high disagreement in audits |
| duplicate rate | train/val leakage risk | repeated samples across splits |
| text length distribution | truncation risk | important content chopped |
| synthetic vs production ratio | noise and realism balance | too much synthetic data |
8.4 System metrics
| Metric | Why it matters | Warning sign |
|---|---|---|
| GPU utilization | efficiency | very low -> input bottleneck |
| tokens/sec or examples/sec | throughput | sudden drop |
| step time | training stability | high jitter |
| memory / VRAM | OOM prevention | near max or fragmented |
| data loader latency | pipeline bottleneck | GPU idle time |
9) Stage-by-Stage Decisions in Real Engineering Terms
Stage A — Data collection and audit
What to check
- Is the label taxonomy clean?
- Are the 10 intents mutually understandable?
- Are there hidden multi-intent messages?
- Are rare classes too small?
Decisions
- merge overlapping intents if ambiguity is too high,
- add labeling guidelines,
- add synthetic examples only after filtering,
- sample more low-confidence production examples.
Typical decision metrics
- class frequency,
- annotator agreement,
- label noise estimate,
- percentage of ambiguous examples,
- percentage of multi-intent messages.
Example production data audit log
[DATA_AUDIT] run_id=ft_2026_04_20_01
[DATA_AUDIT] total_examples=55000 production=50000 synthetic=5000
[DATA_AUDIT] label_distribution={product_discovery:22.1, recommendation:18.0, product_question:15.2, faq:8.0, order_tracking:12.1, return_request:7.0, promotion:5.0, checkout_help:4.1, escalation:3.0, chitchat:5.5}
[DATA_AUDIT] duplicate_rate=1.8%
[DATA_AUDIT] jp_en_mixed=11.7%
[DATA_AUDIT] multi_intent_estimate=17.9%
[DATA_AUDIT] synthetic_low_quality_estimate=9.6%
[DECISION] filter_synthetic=yes reason="noise above 5% threshold"
Stage B — Train/val/test split
What to check
- no leakage,
- stratification maintained,
- all rare intents appear in val/test,
- no duplicate conversation chunks across splits.
Decisions
- use stratified split,
- optionally group by conversation/user to avoid leakage,
- reserve a hand-curated golden set.
Typical decision metrics
- class parity across splits,
- duplicate cross-split count,
- user/thread leakage count.
Stage C — Model selection
The uploaded scenario compares TinyBERT, DistilBERT, and RoBERTa, and picks DistilBERT because it balances accuracy, latency, and cost better for the 15 ms routing budget.
Decision logic
- If this classifier is in the critical path, latency matters a lot.
- A small accuracy gain from a heavier model may not justify slower routing.
- Cost-per-quality-point matters.
Example model selection log
[MODEL_BENCH] candidates=[tinybert, distilbert, roberta_base]
[MODEL_BENCH] tinybert acc=87.3 p95_ms=5 monthly_cost=89
[MODEL_BENCH] distilbert acc=92.1 p95_ms=12 monthly_cost=178
[MODEL_BENCH] roberta_base acc=94.8 p95_ms=28 monthly_cost=348
[DECISION] selected=distilbert reason="meets accuracy threshold and fits p95 latency budget"
Stage D — Loss function choice
The uploaded scenario compares standard CE, weighted CE, oversampling, focal loss, and focal + weighted, and chooses focal + weighted because rare-class accuracy is best with little training overhead.
Decision logic
- If overall accuracy is fine but rare classes are poor -> try weighting or focal loss.
- If oversampling increases training time too much and duplicates noisy rare labels -> prefer focal loss.
Example loss-choice log
[ABLATION] standard_ce overall_acc=91.2 rare_acc=78.4 train_min=36
[ABLATION] weighted_ce overall_acc=91.5 rare_acc=84.2 train_min=36
[ABLATION] oversampling overall_acc=91.8 rare_acc=85.1 train_min=52
[ABLATION] focal overall_acc=92.1 rare_acc=87.8 train_min=38
[ABLATION] focal_weighted overall_acc=92.1 rare_acc=88.6 train_min=38
[DECISION] selected=focal_weighted gamma=2.0 reason="best rare-class accuracy with low training overhead"
Stage E — Optimization setup
What to decide
- base learning rate,
- discriminative LR decay,
- warmup ratio,
- batch size,
- epochs,
- gradient clipping,
- weight decay.
Typical starting point from the scenario
- base LR ~
2e-5, - warmup ~
10%, - batch size
32, - epochs
3, - gamma
2.0, - layer LR decay ~
0.8, - clip grad norm
1.0. fileciteturn2file1turn1file2
What engineers watch live
- step loss,
- smoothed loss,
- val loss each epoch,
- gradient norm,
- LR curve,
- throughput,
- GPU memory.
Example training logs
[TRAIN] run_id=ft_2026_04_20_01 epoch=1 step=50/4688 lr_head=2.1e-06 lr_bottom=5.5e-07 loss=1.842 grad_norm=0.71 gpu_mem_gb=6.8 ex_per_sec=122
[TRAIN] run_id=ft_2026_04_20_01 epoch=1 step=250/4688 lr_head=1.05e-05 lr_bottom=2.7e-06 loss=1.214 grad_norm=0.88 gpu_mem_gb=6.9 ex_per_sec=121
[TRAIN] run_id=ft_2026_04_20_01 epoch=1 step=469/4688 lr_head=2.10e-05 lr_bottom=5.2e-06 loss=1.067 grad_norm=0.93 gpu_mem_gb=6.9 ex_per_sec=120
[TRAIN] run_id=ft_2026_04_20_01 epoch=2 step=1800/4688 lr_head=1.62e-05 lr_bottom=4.0e-06 loss=0.612 grad_norm=0.64 gpu_mem_gb=6.9 ex_per_sec=121
[TRAIN] run_id=ft_2026_04_20_01 epoch=3 step=4100/4688 lr_head=2.80e-06 lr_bottom=7.0e-07 loss=0.301 grad_norm=0.39 gpu_mem_gb=6.9 ex_per_sec=120
Decision examples during training
- Loss exploding early -> reduce LR or increase warmup.
- Val loss rises after epoch 2 -> stop at best checkpoint.
- Rare-class recall flat -> re-check weights, label quality, or sampling.
- GPU underutilized -> fix data loader bottleneck.
Stage F — Validation and checkpoint selection
What to check
- overall accuracy,
- macro F1,
- rare-class recall,
- confusion matrix,
- latency on validation harness,
- calibration,
- no regression on critical intents.
Golden-set promotion gate
Use a hand-reviewed set with edge cases: - slang, - mixed language, - multi-intent, - rare intents, - ambiguous examples.
Example validation log
[VAL] run_id=ft_2026_04_20_01 epoch=1 loss=0.88 acc=90.6 macro_f1=88.9 rare_recall=82.7 ece=0.061
[VAL] run_id=ft_2026_04_20_01 epoch=2 loss=0.39 acc=91.9 macro_f1=90.8 rare_recall=87.2 ece=0.043
[VAL] run_id=ft_2026_04_20_01 epoch=3 loss=0.31 acc=92.1 macro_f1=91.2 rare_recall=88.6 ece=0.037
[CHECKPOINT] best_epoch=3 criterion="max macro_f1 subject to latency < 15ms"
Decision logic
Promote only if: - overall accuracy beats current model, - rare-class metrics do not regress, - calibration is acceptable, - latency is still within budget.
Stage G — Deployment readiness
What to check before serving
- model artifact loads correctly,
- tokenizer version matches,
- compiled serving artifact works,
- endpoint meets P50/P95/P99 goals,
- health checks pass,
- rollback plan exists.
Example deployment log
[DEPLOY] candidate_model=v14 tokenizer_hash=9b2a1 compiled=true target=inf2.xlarge
[LOAD_TEST] p50_ms=7.9 p95_ms=12.4 p99_ms=18.3 error_rate=0.02% rps=500
[SHADOW] agreement_with_champion=96.1% disagreement_rate=3.9%
[SHADOW] critical_intent_regression=false
[DECISION] promote=yes strategy="blue_green"
Stage H — Production monitoring
Key live metrics
| Metric | Why it matters | Example threshold |
|---|---|---|
| P50/P95/P99 latency | routing must stay fast | P95 < 15 ms |
| endpoint error rate | serving health | < 0.5% |
| low-confidence rate | uncertainty / drift | < 8% |
| live sampled accuracy | real quality | > 90% |
| rare-class sampled recall | safety on small classes | > 85% |
| KL divergence | distribution drift | alert > 0.02 |
| intent mix shift | business or drift changes | investigate large changes |
| fallback/escalation rate | downstream impact | investigate spike |
Example production logs
[SERVE] ts=2026-04-20T10:41:12Z model=v14 req_id=8f2a latency_ms=8.7 intent=product_question confidence=0.74 tokens=14
[SERVE] ts=2026-04-20T10:41:13Z model=v14 req_id=8f2b latency_ms=11.1 intent=order_tracking confidence=0.96 tokens=6
[SERVE] ts=2026-04-20T10:41:14Z model=v14 req_id=8f2c latency_ms=13.9 intent=recommendation confidence=0.51 fallback_rerank=true tokens=22
[MONITOR_HOURLY] model=v14 p50_ms=8.1 p95_ms=12.8 p99_ms=19.7 err_rate=0.08% low_conf_rate=6.2% kl_div=0.011
[MONITOR_DAILY] model=v14 sampled_acc=91.7 rare_recall=87.9 fallback_rate=4.1%
[MONITOR_WEEKLY] model=v14 kl_div=0.029 sampled_acc=89.8 low_conf_rate=9.3%
[ALERT] retrain_trigger=true reason="drift + live accuracy degradation"
10) Important Decision Metrics Beyond Basic Accuracy
These are the metrics strong GenAI / ML engineers care about, not just top-line accuracy.
10.1 Macro F1
Useful because overall accuracy can hide weak rare classes.
[ F1 = 2 \cdot \frac{precision \cdot recall}{precision + recall} ]
Why it matters
If large classes dominate, accuracy can look good while small classes fail badly.
10.2 Per-class recall
Especially important for:
- escalation,
- checkout_help,
- promotion,
- any business-critical or compliance-sensitive intent.
Why it matters
A bad miss rate on a small but important intent may hurt operations more than a small dip in overall accuracy.
10.3 Calibration / ECE
A classifier should not only be accurate; its confidence should mean something.
Expected Calibration Error conceptually compares: - predicted confidence, - actual correctness.
Why it matters
If the model says 0.95 often and is only correct 0.75, it is overconfident.
This hurts fallback routing and active learning selection.
10.4 Low-confidence traffic rate
Percentage of requests where top prediction confidence is below a threshold, for example 0.6.
Why it matters
A rise often indicates: - new user behavior, - domain drift, - broken preprocessing, - too much ambiguity.
10.5 Confusion concentration
Study which intent pairs are commonly confused.
Why it matters
Some confusion is acceptable if both intents route to similar downstream systems. Other confusion is severe if it sends the user to the wrong workflow.
Example:
- product_discovery vs recommendation may be acceptable,
- order_tracking vs product_discovery is much worse.
10.6 Cost-per-quality-point
This is a practical engineering metric.
[ \text{cost per quality point} = \frac{\Delta \text{monthly cost}}{\Delta \text{accuracy points}} ]
The uploaded scenario explicitly reasons this way when comparing DistilBERT and RoBERTa.
Why it matters
It keeps the team from over-optimizing for tiny quality gains at large cost.
10.7 Retraining ROI
Measure whether labeling + retraining cost is justified by recovered quality.
Why it matters
A good MLOps team does not retrain on habit alone. It retrains when: - quality has drifted, - business value is clear, - new labels are informative.
11) Similar Important Failure Modes to Watch
11.1 Label noise
Symptoms: - train loss stays oddly high, - confusing samples dominate error analysis, - model confidence is unstable.
Action: - re-audit labels, - improve annotation policy, - remove contradictory samples.
11.2 Catastrophic forgetting
Symptoms: - early instability, - lower layers move too much, - general language behavior worsens.
Action: - lower LR, - add warmup, - freeze lower layers temporarily, - shorten training.
11.3 Overfitting
Symptoms: - train loss keeps dropping, - val loss rises, - confidence gets sharper but wrong more often on new data.
Action: - early stop, - reduce epochs, - strengthen regularization, - improve validation set quality.
11.4 Serving mismatch
Symptoms: - offline accuracy good, live accuracy poor.
Action: - verify tokenizer parity, - verify preprocessing parity, - compare offline and online text normalization, - inspect shadow-mode disagreements.
11.5 Drift masked by stable accuracy
Sometimes top-line accuracy looks okay but intent mix has changed.
Action: - inspect KL divergence, - inspect low-confidence rate, - inspect classwise performance, - inspect business routing outcomes.
12) Recommended Promotion Gates
Use a clear gate table before moving a model to production.
| Gate | Proposed rule |
|---|---|
| Overall accuracy | candidate >= current champion |
| Macro F1 | candidate >= current champion |
| Rare-class recall | no critical regression |
| Calibration | ECE not worse than allowed margin |
| P95 latency | < 15 ms |
| Error rate in load test | below operational threshold |
| Drift robustness | passes shadow test on recent traffic |
| Explainability / audit | confusion matrix + sample review completed |
13) Recommended Dashboard Layout
Training dashboard
- train loss
- val loss
- accuracy
- macro F1
- rare-class recall
- LR curve
- gradient norm
- GPU utilization
- step time
Validation dashboard
- confusion matrix
- per-class precision/recall/F1
- reliability plot / ECE
- confidence histograms
- top false positives
- top false negatives
Production dashboard
- P50/P95/P99 latency
- request volume
- error rate
- low-confidence rate
- intent distribution
- KL divergence
- sampled accuracy
- rare-class recall
- fallback rate
- rollback status
14) Practical Decision Tree
flowchart TD
A[Training run finished] --> B{Val loss lower?}
B -- No --> C[Stop or reduce LR / epochs]
B -- Yes --> D{Macro F1 improved?}
D -- No --> E[Inspect class imbalance and confusion matrix]
D -- Yes --> F{Rare-class recall improved?}
F -- No --> G[Adjust focal gamma / class weights / data]
F -- Yes --> H{Calibration acceptable?}
H -- No --> I[Apply calibration or revise training]
H -- Yes --> J{Latency under budget?}
J -- No --> K[Optimize serving or choose smaller model]
J -- Yes --> L[Promote to shadow mode]
L --> M{Shadow regressions?}
M -- Yes --> N[Rollback and inspect mismatches]
M -- No --> O[Promote to production]
15) What a Strong Engineer Would Say in Review
A strong ML/GenAI engineer would not say: - “Accuracy improved, so we are done.”
They would say: - “The model improved overall, but more importantly rare-class recall improved without breaking the latency budget.” - “Upper layers adapted as expected, lower-layer drift stayed controlled, and warmup prevented early instability.” - “Calibration improved, so confidence can be used for fallback routing and active learning.” - “Production monitoring combines latency, confidence, drift, and sampled accuracy, so we can retrain based on evidence, not guesswork.” - “We chose DistilBERT because it is the best system-level decision, not just the best raw-accuracy model.”
16) Final Engineering Summary
For this MangaAssist fine-tuning scenario: - DistilBERT is the right operating point because it balances quality, latency, and cost. - Focal loss + class weights is the right imbalance strategy because it improves rare classes with no inference penalty. - Discriminative learning rates + warmup are critical because lower layers should be preserved while upper layers adapt. fileciteturn1file2turn2file1 - Validation must go beyond accuracy into macro F1, rare-class recall, confusion matrix, calibration, and latency. - Production readiness is not just model quality; it includes logs, shadow tests, serving performance, drift signals, and retraining ROI.
That is the full dry run mindset: from data, to math, to optimization, to decision gates, to production observability.
17) Suggested Next Extensions
If you want to deepen this document later, add: 1. a worked numerical example of focal loss on one batch, 2. a confusion-matrix interpretation section for the 10 intents, 3. active-learning sampling math, 4. calibration plots and ECE computation, 5. a fallback routing design for low-confidence predictions.