LOCAL PREVIEW View on GitHub

Fine-Tuning Dry Run Document — Intent Classifier (DistilBERT) for MangaAssist

This document is a practical dry run of the fine-tuning process for the uploaded MangaAssist intent-classification scenario. It is designed to help you understand the math, formulas, training flow, production logs, decision metrics, and how engineering decisions are made at each stage.


1) Goal of the System

MangaAssist sends each user message to an intent classifier before calling downstream services. The classifier must map messages like:

  • "Show me horror manga"
  • "Where is my order?"
  • "Something like Naruto but darker"

into one of 10 intents, while meeting a P95 latency target under 15 ms. The uploaded scenario uses DistilBERT, starts from 83.2% domain accuracy out of the box, and improves to about 92.1% after domain fine-tuning.


2) Problem Framing

Business problem

If intent classification is wrong: - users get routed to the wrong service, - extra LLM or backend calls may be triggered, - customer experience drops, - cost rises.

ML problem

This is a multi-class text classification problem with: - 10 classes, - class imbalance, - domain jargon, - slang, - Japanese-English mixed text, - multi-intent ambiguity.

Why this is hard

The uploaded scenario highlights four challenges: 1. manga jargon, 2. slang, 3. mixed intents, 4. Japanese-English mixing.


3) End-to-End Fine-Tuning Lifecycle

flowchart TD
    A[Collect labeled production logs] --> B[Clean + normalize text]
    B --> C[Add filtered synthetic data]
    C --> D[Train/val/test split]
    D --> E[Tokenization]
    E --> F[Load pre-trained DistilBERT]
    F --> G[Fine-tune with focal loss + discriminative LR + warmup]
    G --> H[Validate on golden set]
    H --> I{Pass gates?}
    I -- Yes --> J[Register model]
    I -- No --> K[Debug data/hparams/labels]
    K --> G
    J --> L[Compile + deploy]
    L --> M[Monitor latency, drift, confidence, accuracy]
    M --> N{Need retraining?}
    N -- Yes --> A
    N -- No --> M

4) Stage-by-Stage Decision View

Stage Main Question Key Inputs Main Metrics Typical Decision
Problem framing What are we predicting and why? user flows, intents, latency budget business impact, routing error cost use classifier as first routing layer
Data audit Is data good enough? production logs, labels, class counts label noise, class balance, ambiguity rate clean labels, add rules, collect more rare classes
Model selection Which model fits accuracy + latency? candidate models accuracy, P95 latency, memory, cost choose DistilBERT over larger/slower models
Loss selection How do we handle imbalance? class frequency rare-class recall, macro F1 use focal loss + class weights
Optimization How do we fine-tune safely? LR, scheduler, epochs train/val loss, gradient norm discriminative LR + warmup + clipping
Validation Is model truly better? val/test/golden set accuracy, per-class recall, calibration promote only if all gates pass
Deployment Can it serve at target latency? compiled artifact, infra P50/P95/P99, errors, throughput deploy with shadow or blue/green
Monitoring Is production still healthy? live traffic, sampled labels drift, low confidence, live accuracy trigger retrain or rollback

5) The Core Model Dry Run

Model architecture

The scenario uses DistilBERT: - embeddings, - 6 transformer encoder layers, - [CLS] pooled representation, - linear classification head, - softmax over 10 intents. fileciteturn1file0turn1file2

One-message dry run

Example input:

"Is this isekai peak or mid?"

Step 1: tokenization

A WordPiece tokenizer splits text into tokens or subwords.

Illustrative tokenization:

[CLS] is this is ##ek ##ai peak or mid ? [SEP]

Step 2: embeddings

Each token becomes a vector:

  • token embedding,
  • position embedding,
  • summed into a 768-dimensional representation.

Step 3: transformer encoding

The 6 layers contextualize meaning. Lower layers mostly preserve language structure; higher layers adapt more to task semantics. The uploaded scenario notes gradient magnitude is strongest in the head and upper layers, and weakest in embeddings and lower layers.

Step 4: classification head

The [CLS] vector goes through a linear layer to produce 10 logits.

Example logits:

product_discovery: 0.8
product_question: 1.7
recommendation: 1.5
faq: -0.2
order_tracking: -1.4
return_request: -1.1
promotion: -0.6
checkout_help: -1.7
escalation: -2.0
chitchat: -1.0

Step 5: softmax to probabilities

The softmax converts logits into probabilities:

[ \hat{y}i = \frac{e^{z_i/T}}{\sum{j=1}^{C} e^{z_j/T}} ]

where: - (z_i) is the logit for class (i), - (T) is temperature, - (C=10) classes.

Example probabilities:

product_question: 0.41
recommendation: 0.34
product_discovery: 0.14
others combined: 0.11

Predicted class: product_question

Step 6: compare with true label

Suppose true label is recommendation. The prediction is wrong, so the loss will push probability mass away from product_question and toward recommendation.


6) Math and Formulas You Should Understand

6.1 Cross-Entropy Loss

For one-hot labels:

[ \mathcal{L}{CE} = -\log(\hat{y}{y_{true}}) ]

If the true class probability is: - 0.9, loss = (-\log(0.9) \approx 0.105) - 0.1, loss = (-\log(0.1) \approx 2.303)

Why it matters

  • correct + confident prediction -> small loss,
  • wrong + confident prediction -> large loss.

Decision takeaway

Use cross-entropy as the baseline. If class imbalance is hurting rare classes, move to focal loss.


6.2 Focal Loss

The scenario uses focal loss to handle imbalance:

[ \mathcal{L}_{FL} = -\alpha_t (1 - p_t)^{\gamma} \log(p_t) ]

where: - (p_t) = probability of the true class, - (\alpha_t) = class weight, - (\gamma) = focusing parameter.

Intuition

  • easy examples get down-weighted,
  • hard examples get emphasized,
  • rare classes benefit more.

Example

If: - (p_t = 0.9), - (\gamma = 2),

then focal weight:

[ (1 - 0.9)^2 = 0.01 ]

So easy examples contribute very little.

If: - (p_t = 0.1),

then:

[ (1 - 0.1)^2 = 0.81 ]

Hard examples still contribute strongly. This is why focal loss improved rare-class performance in the uploaded scenario.

Decision takeaway

Use focal loss when: - class imbalance is real, - rare-class recall matters, - you want no inference overhead.


6.3 Class Weights

The scenario defines class weights from inverse frequency:

[ \alpha_t = \frac{1}{\text{freq}(t)} \cdot \frac{1}{\sum_{c=1}^{C} 1/\text{freq}©} ]

Intuition

Rare classes like escalation get higher weight than frequent classes like product_discovery. The uploaded scenario notes the rare escalation class gets about 7.3x the weight of product_discovery.

Decision takeaway

If the model ignores small classes, increase class weights carefully. Too much weighting can overfit noise in rare labels.


6.4 Gradient Flow and Why Top Layers Change More

The uploaded scenario shows that during fine-tuning: - classification head gets the strongest gradients, - upper encoder layers adapt more, - embeddings and lower layers move very little.

This leads to the engineering idea of discriminative learning rates.


6.5 Discriminative Learning Rate

[ \eta_l = \eta_{base} \cdot \xi^{(L-l)} ]

where: - (\eta_{base}) = top-layer LR, - (\xi) = decay factor, - (L) = total layers, - (l) = current layer.

The uploaded scenario uses a base rate near 2e-5 and a decay factor near 0.8. fileciteturn1file2turn2file1

Intuition

  • top layers need more freedom,
  • bottom layers should preserve general language knowledge,
  • this reduces catastrophic forgetting.

Decision takeaway

If you see unstable lower-layer drift or overfitting, reduce lower-layer LR or freeze lower layers temporarily.


6.6 Warmup + Decay Schedule

[ \eta(t)= \begin{cases} \eta_{max} \cdot \frac{t}{T_{warmup}} & t < T_{warmup} \ \eta_{max} \cdot \frac{T_{total}-t}{T_{total}-T_{warmup}} & t \ge T_{warmup} \end{cases} ]

Why warmup matters

At step 0 the classifier head is random. If LR is too large immediately: - early gradients are noisy, - encoder gets damaged, - training becomes unstable.

The uploaded scenario uses 10% warmup. fileciteturn1file2turn2file1

Decision takeaway

If the first few hundred steps look unstable, increase warmup or reduce base LR.


6.7 Gradient Clipping

For gradient norm (g):

[ g_{clipped} = g \cdot \min\left(1, \frac{\tau}{|g|}\right) ]

where (\tau) is max norm, often 1.0.

Why it matters

It prevents exploding updates from hard examples or noisy batches.

Decision takeaway

If gradient norm spikes or loss suddenly explodes, clipping is one of the first safeguards.


6.8 AdamW Update Rule

Conceptually:

[ \theta \leftarrow \theta - \eta \cdot \frac{\hat{m}}{\sqrt{\hat{v}} + \epsilon} - \eta \lambda \theta ]

where: - (\hat{m}) = bias-corrected first moment, - (\hat{v}) = bias-corrected second moment, - (\lambda) = weight decay.

Decision takeaway

AdamW is preferred over plain Adam because weight decay behaves more cleanly for transformer fine-tuning.


6.9 KL-Divergence for Drift Detection

The scenario uses KL divergence to compare production intent distribution with training distribution:

[ D_{KL}(P \parallel Q) = \sum_i P(i) \log \frac{P(i)}{Q(i)} ]

where: - (P) = live traffic distribution, - (Q) = reference/training distribution.

The uploaded scenario uses rough drift signals like: - ~0.012 -> monitor, - ~0.028 -> retrain.

Decision takeaway

Drift alone is not enough. Combine: - KL divergence, - low confidence rate, - live accuracy from sampled labels, - business incidents.


7) Single Training Step Dry Run

flowchart TD
    A[Batch of 32 examples] --> B[Tokenize + pad to max_len]
    B --> C[Forward pass through DistilBERT]
    C --> D[Get logits for 10 classes]
    D --> E[Apply softmax]
    E --> F[Compute focal loss]
    F --> G[Backpropagation]
    G --> H[Clip gradients]
    H --> I[Optimizer step]
    I --> J[Scheduler step]
    J --> K[Zero gradients]

Tensor view

  • input_ids: (32, 128)
  • attention_mask: (32, 128)
  • hidden states: (32, 128, 768)
  • pooled [CLS]: (32, 768)
  • logits: (32, 10)
  • loss: scalar

8) What Engineers Observe During Fine-Tuning

8.1 Core training metrics

Metric Why it matters Healthy pattern Warning sign
train loss optimization progress decreases steadily flat or explodes
val loss generalization decreases, then stabilizes rises while train loss falls
accuracy top-line classification quality improves by epoch unstable or plateau too early
macro F1 balanced view across classes improves with rare classes lower than accuracy by a lot
rare-class recall important for small classes rises after focal loss/weights near zero or unstable
confusion matrix where mistakes happen errors cluster in similar classes critical classes collapse
confidence on correct predictions calibration + separability increases low and flat
confidence on wrong predictions overconfidence risk stays moderate or drops very high confidence on errors

8.2 Optimization metrics

Metric Why it matters Warning sign
learning rate confirms scheduler works wrong warmup/decay behavior
gradient norm training stability spikes or collapse to near zero
layer-wise gradient norm confirms top layers adapt more lower layers moving too much
parameter update norm actual step size too large or vanishing
weight norm model stability runaway growth

8.3 Data quality metrics

Metric Why it matters Warning sign
class distribution imbalance check rare classes too small
label noise rate bad supervision hurts training high disagreement in audits
duplicate rate train/val leakage risk repeated samples across splits
text length distribution truncation risk important content chopped
synthetic vs production ratio noise and realism balance too much synthetic data

8.4 System metrics

Metric Why it matters Warning sign
GPU utilization efficiency very low -> input bottleneck
tokens/sec or examples/sec throughput sudden drop
step time training stability high jitter
memory / VRAM OOM prevention near max or fragmented
data loader latency pipeline bottleneck GPU idle time

9) Stage-by-Stage Decisions in Real Engineering Terms

Stage A — Data collection and audit

What to check

  • Is the label taxonomy clean?
  • Are the 10 intents mutually understandable?
  • Are there hidden multi-intent messages?
  • Are rare classes too small?

Decisions

  • merge overlapping intents if ambiguity is too high,
  • add labeling guidelines,
  • add synthetic examples only after filtering,
  • sample more low-confidence production examples.

Typical decision metrics

  • class frequency,
  • annotator agreement,
  • label noise estimate,
  • percentage of ambiguous examples,
  • percentage of multi-intent messages.

Example production data audit log

[DATA_AUDIT] run_id=ft_2026_04_20_01
[DATA_AUDIT] total_examples=55000 production=50000 synthetic=5000
[DATA_AUDIT] label_distribution={product_discovery:22.1, recommendation:18.0, product_question:15.2, faq:8.0, order_tracking:12.1, return_request:7.0, promotion:5.0, checkout_help:4.1, escalation:3.0, chitchat:5.5}
[DATA_AUDIT] duplicate_rate=1.8%
[DATA_AUDIT] jp_en_mixed=11.7%
[DATA_AUDIT] multi_intent_estimate=17.9%
[DATA_AUDIT] synthetic_low_quality_estimate=9.6%
[DECISION] filter_synthetic=yes reason="noise above 5% threshold"

Stage B — Train/val/test split

What to check

  • no leakage,
  • stratification maintained,
  • all rare intents appear in val/test,
  • no duplicate conversation chunks across splits.

Decisions

  • use stratified split,
  • optionally group by conversation/user to avoid leakage,
  • reserve a hand-curated golden set.

Typical decision metrics

  • class parity across splits,
  • duplicate cross-split count,
  • user/thread leakage count.

Stage C — Model selection

The uploaded scenario compares TinyBERT, DistilBERT, and RoBERTa, and picks DistilBERT because it balances accuracy, latency, and cost better for the 15 ms routing budget.

Decision logic

  • If this classifier is in the critical path, latency matters a lot.
  • A small accuracy gain from a heavier model may not justify slower routing.
  • Cost-per-quality-point matters.

Example model selection log

[MODEL_BENCH] candidates=[tinybert, distilbert, roberta_base]
[MODEL_BENCH] tinybert acc=87.3 p95_ms=5 monthly_cost=89
[MODEL_BENCH] distilbert acc=92.1 p95_ms=12 monthly_cost=178
[MODEL_BENCH] roberta_base acc=94.8 p95_ms=28 monthly_cost=348
[DECISION] selected=distilbert reason="meets accuracy threshold and fits p95 latency budget"

Stage D — Loss function choice

The uploaded scenario compares standard CE, weighted CE, oversampling, focal loss, and focal + weighted, and chooses focal + weighted because rare-class accuracy is best with little training overhead.

Decision logic

  • If overall accuracy is fine but rare classes are poor -> try weighting or focal loss.
  • If oversampling increases training time too much and duplicates noisy rare labels -> prefer focal loss.

Example loss-choice log

[ABLATION] standard_ce overall_acc=91.2 rare_acc=78.4 train_min=36
[ABLATION] weighted_ce overall_acc=91.5 rare_acc=84.2 train_min=36
[ABLATION] oversampling overall_acc=91.8 rare_acc=85.1 train_min=52
[ABLATION] focal overall_acc=92.1 rare_acc=87.8 train_min=38
[ABLATION] focal_weighted overall_acc=92.1 rare_acc=88.6 train_min=38
[DECISION] selected=focal_weighted gamma=2.0 reason="best rare-class accuracy with low training overhead"

Stage E — Optimization setup

What to decide

  • base learning rate,
  • discriminative LR decay,
  • warmup ratio,
  • batch size,
  • epochs,
  • gradient clipping,
  • weight decay.

Typical starting point from the scenario

  • base LR ~ 2e-5,
  • warmup ~ 10%,
  • batch size 32,
  • epochs 3,
  • gamma 2.0,
  • layer LR decay ~ 0.8,
  • clip grad norm 1.0. fileciteturn2file1turn1file2

What engineers watch live

  • step loss,
  • smoothed loss,
  • val loss each epoch,
  • gradient norm,
  • LR curve,
  • throughput,
  • GPU memory.

Example training logs

[TRAIN] run_id=ft_2026_04_20_01 epoch=1 step=50/4688 lr_head=2.1e-06 lr_bottom=5.5e-07 loss=1.842 grad_norm=0.71 gpu_mem_gb=6.8 ex_per_sec=122
[TRAIN] run_id=ft_2026_04_20_01 epoch=1 step=250/4688 lr_head=1.05e-05 lr_bottom=2.7e-06 loss=1.214 grad_norm=0.88 gpu_mem_gb=6.9 ex_per_sec=121
[TRAIN] run_id=ft_2026_04_20_01 epoch=1 step=469/4688 lr_head=2.10e-05 lr_bottom=5.2e-06 loss=1.067 grad_norm=0.93 gpu_mem_gb=6.9 ex_per_sec=120
[TRAIN] run_id=ft_2026_04_20_01 epoch=2 step=1800/4688 lr_head=1.62e-05 lr_bottom=4.0e-06 loss=0.612 grad_norm=0.64 gpu_mem_gb=6.9 ex_per_sec=121
[TRAIN] run_id=ft_2026_04_20_01 epoch=3 step=4100/4688 lr_head=2.80e-06 lr_bottom=7.0e-07 loss=0.301 grad_norm=0.39 gpu_mem_gb=6.9 ex_per_sec=120

Decision examples during training

  • Loss exploding early -> reduce LR or increase warmup.
  • Val loss rises after epoch 2 -> stop at best checkpoint.
  • Rare-class recall flat -> re-check weights, label quality, or sampling.
  • GPU underutilized -> fix data loader bottleneck.

Stage F — Validation and checkpoint selection

What to check

  • overall accuracy,
  • macro F1,
  • rare-class recall,
  • confusion matrix,
  • latency on validation harness,
  • calibration,
  • no regression on critical intents.

Golden-set promotion gate

Use a hand-reviewed set with edge cases: - slang, - mixed language, - multi-intent, - rare intents, - ambiguous examples.

Example validation log

[VAL] run_id=ft_2026_04_20_01 epoch=1 loss=0.88 acc=90.6 macro_f1=88.9 rare_recall=82.7 ece=0.061
[VAL] run_id=ft_2026_04_20_01 epoch=2 loss=0.39 acc=91.9 macro_f1=90.8 rare_recall=87.2 ece=0.043
[VAL] run_id=ft_2026_04_20_01 epoch=3 loss=0.31 acc=92.1 macro_f1=91.2 rare_recall=88.6 ece=0.037
[CHECKPOINT] best_epoch=3 criterion="max macro_f1 subject to latency < 15ms"

Decision logic

Promote only if: - overall accuracy beats current model, - rare-class metrics do not regress, - calibration is acceptable, - latency is still within budget.


Stage G — Deployment readiness

What to check before serving

  • model artifact loads correctly,
  • tokenizer version matches,
  • compiled serving artifact works,
  • endpoint meets P50/P95/P99 goals,
  • health checks pass,
  • rollback plan exists.

Example deployment log

[DEPLOY] candidate_model=v14 tokenizer_hash=9b2a1 compiled=true target=inf2.xlarge
[LOAD_TEST] p50_ms=7.9 p95_ms=12.4 p99_ms=18.3 error_rate=0.02% rps=500
[SHADOW] agreement_with_champion=96.1% disagreement_rate=3.9%
[SHADOW] critical_intent_regression=false
[DECISION] promote=yes strategy="blue_green"

Stage H — Production monitoring

Key live metrics

Metric Why it matters Example threshold
P50/P95/P99 latency routing must stay fast P95 < 15 ms
endpoint error rate serving health < 0.5%
low-confidence rate uncertainty / drift < 8%
live sampled accuracy real quality > 90%
rare-class sampled recall safety on small classes > 85%
KL divergence distribution drift alert > 0.02
intent mix shift business or drift changes investigate large changes
fallback/escalation rate downstream impact investigate spike

Example production logs

[SERVE] ts=2026-04-20T10:41:12Z model=v14 req_id=8f2a latency_ms=8.7 intent=product_question confidence=0.74 tokens=14
[SERVE] ts=2026-04-20T10:41:13Z model=v14 req_id=8f2b latency_ms=11.1 intent=order_tracking confidence=0.96 tokens=6
[SERVE] ts=2026-04-20T10:41:14Z model=v14 req_id=8f2c latency_ms=13.9 intent=recommendation confidence=0.51 fallback_rerank=true tokens=22

[MONITOR_HOURLY] model=v14 p50_ms=8.1 p95_ms=12.8 p99_ms=19.7 err_rate=0.08% low_conf_rate=6.2% kl_div=0.011
[MONITOR_DAILY] model=v14 sampled_acc=91.7 rare_recall=87.9 fallback_rate=4.1%
[MONITOR_WEEKLY] model=v14 kl_div=0.029 sampled_acc=89.8 low_conf_rate=9.3%
[ALERT] retrain_trigger=true reason="drift + live accuracy degradation"

10) Important Decision Metrics Beyond Basic Accuracy

These are the metrics strong GenAI / ML engineers care about, not just top-line accuracy.

10.1 Macro F1

Useful because overall accuracy can hide weak rare classes.

[ F1 = 2 \cdot \frac{precision \cdot recall}{precision + recall} ]

Why it matters

If large classes dominate, accuracy can look good while small classes fail badly.


10.2 Per-class recall

Especially important for: - escalation, - checkout_help, - promotion, - any business-critical or compliance-sensitive intent.

Why it matters

A bad miss rate on a small but important intent may hurt operations more than a small dip in overall accuracy.


10.3 Calibration / ECE

A classifier should not only be accurate; its confidence should mean something.

Expected Calibration Error conceptually compares: - predicted confidence, - actual correctness.

Why it matters

If the model says 0.95 often and is only correct 0.75, it is overconfident. This hurts fallback routing and active learning selection.


10.4 Low-confidence traffic rate

Percentage of requests where top prediction confidence is below a threshold, for example 0.6.

Why it matters

A rise often indicates: - new user behavior, - domain drift, - broken preprocessing, - too much ambiguity.


10.5 Confusion concentration

Study which intent pairs are commonly confused.

Why it matters

Some confusion is acceptable if both intents route to similar downstream systems. Other confusion is severe if it sends the user to the wrong workflow.

Example: - product_discovery vs recommendation may be acceptable, - order_tracking vs product_discovery is much worse.


10.6 Cost-per-quality-point

This is a practical engineering metric.

[ \text{cost per quality point} = \frac{\Delta \text{monthly cost}}{\Delta \text{accuracy points}} ]

The uploaded scenario explicitly reasons this way when comparing DistilBERT and RoBERTa.

Why it matters

It keeps the team from over-optimizing for tiny quality gains at large cost.


10.7 Retraining ROI

Measure whether labeling + retraining cost is justified by recovered quality.

Why it matters

A good MLOps team does not retrain on habit alone. It retrains when: - quality has drifted, - business value is clear, - new labels are informative.


11) Similar Important Failure Modes to Watch

11.1 Label noise

Symptoms: - train loss stays oddly high, - confusing samples dominate error analysis, - model confidence is unstable.

Action: - re-audit labels, - improve annotation policy, - remove contradictory samples.

11.2 Catastrophic forgetting

Symptoms: - early instability, - lower layers move too much, - general language behavior worsens.

Action: - lower LR, - add warmup, - freeze lower layers temporarily, - shorten training.

11.3 Overfitting

Symptoms: - train loss keeps dropping, - val loss rises, - confidence gets sharper but wrong more often on new data.

Action: - early stop, - reduce epochs, - strengthen regularization, - improve validation set quality.

11.4 Serving mismatch

Symptoms: - offline accuracy good, live accuracy poor.

Action: - verify tokenizer parity, - verify preprocessing parity, - compare offline and online text normalization, - inspect shadow-mode disagreements.

11.5 Drift masked by stable accuracy

Sometimes top-line accuracy looks okay but intent mix has changed.

Action: - inspect KL divergence, - inspect low-confidence rate, - inspect classwise performance, - inspect business routing outcomes.


Use a clear gate table before moving a model to production.

Gate Proposed rule
Overall accuracy candidate >= current champion
Macro F1 candidate >= current champion
Rare-class recall no critical regression
Calibration ECE not worse than allowed margin
P95 latency < 15 ms
Error rate in load test below operational threshold
Drift robustness passes shadow test on recent traffic
Explainability / audit confusion matrix + sample review completed

Training dashboard

  • train loss
  • val loss
  • accuracy
  • macro F1
  • rare-class recall
  • LR curve
  • gradient norm
  • GPU utilization
  • step time

Validation dashboard

  • confusion matrix
  • per-class precision/recall/F1
  • reliability plot / ECE
  • confidence histograms
  • top false positives
  • top false negatives

Production dashboard

  • P50/P95/P99 latency
  • request volume
  • error rate
  • low-confidence rate
  • intent distribution
  • KL divergence
  • sampled accuracy
  • rare-class recall
  • fallback rate
  • rollback status

14) Practical Decision Tree

flowchart TD
    A[Training run finished] --> B{Val loss lower?}
    B -- No --> C[Stop or reduce LR / epochs]
    B -- Yes --> D{Macro F1 improved?}
    D -- No --> E[Inspect class imbalance and confusion matrix]
    D -- Yes --> F{Rare-class recall improved?}
    F -- No --> G[Adjust focal gamma / class weights / data]
    F -- Yes --> H{Calibration acceptable?}
    H -- No --> I[Apply calibration or revise training]
    H -- Yes --> J{Latency under budget?}
    J -- No --> K[Optimize serving or choose smaller model]
    J -- Yes --> L[Promote to shadow mode]
    L --> M{Shadow regressions?}
    M -- Yes --> N[Rollback and inspect mismatches]
    M -- No --> O[Promote to production]

15) What a Strong Engineer Would Say in Review

A strong ML/GenAI engineer would not say: - “Accuracy improved, so we are done.”

They would say: - “The model improved overall, but more importantly rare-class recall improved without breaking the latency budget.” - “Upper layers adapted as expected, lower-layer drift stayed controlled, and warmup prevented early instability.” - “Calibration improved, so confidence can be used for fallback routing and active learning.” - “Production monitoring combines latency, confidence, drift, and sampled accuracy, so we can retrain based on evidence, not guesswork.” - “We chose DistilBERT because it is the best system-level decision, not just the best raw-accuracy model.”


16) Final Engineering Summary

For this MangaAssist fine-tuning scenario: - DistilBERT is the right operating point because it balances quality, latency, and cost. - Focal loss + class weights is the right imbalance strategy because it improves rare classes with no inference penalty. - Discriminative learning rates + warmup are critical because lower layers should be preserved while upper layers adapt. fileciteturn1file2turn2file1 - Validation must go beyond accuracy into macro F1, rare-class recall, confusion matrix, calibration, and latency. - Production readiness is not just model quality; it includes logs, shadow tests, serving performance, drift signals, and retraining ROI.

That is the full dry run mindset: from data, to math, to optimization, to decision gates, to production observability.


17) Suggested Next Extensions

If you want to deepen this document later, add: 1. a worked numerical example of focal loss on one batch, 2. a confusion-matrix interpretation section for the 10 intents, 3. active-learning sampling math, 4. calibration plots and ECE computation, 5. a fallback routing design for low-confidence predictions.