Knowledge Distillation — Numerical Worked Examples (MangaAssist)
Slot 3 in the SCENARIO_TEMPLATE 8-file pattern. This document walks every metric, loss term, and budget trade-off of MangaAssist's intent-router distillation through concrete arithmetic at single-batch and 10K-request scale, with bootstrap CIs on every reported number.
Shared Baseline (verbatim)
| Item | Value |
|---|---|
| Teacher | DistilBERT-base, 66M params, fine-tuned, 92.1% ± 0.4% acc, 12 ms P95 |
| Student (target) | TinyBERT 4-layer, 14.5M params, target ≥ 90.5% acc, ≤ 5 ms P95 |
| Distillation dataset | same 55K (44K train / 5.5K val / 5.5K test) used for the teacher |
| Distillation loss | α · KL(p_teacher^T \|\| p_student^T) · T² + (1-α) · CE(y_true, p_student) |
| Default hyperparams | T = 3.0, α = 0.7, batch = 64, epochs = 5, lr = 5e-5 |
| Hardware | g5.12xlarge for distillation; inf2.xlarge for student inference |
1. Single-Batch Distillation Loss — Walkthrough
Take a batch of 4 messages with their teacher logits and ground-truth labels.
| # | Message | True intent | Teacher top-3 (logits) |
|---|---|---|---|
| 1 | "where is my order?" | order_tracking | order_tracking 6.4 / faq 1.2 / chitchat -1.5 |
| 2 | "do you have demon slayer?" | product_question | product_question 4.9 / product_discovery 3.8 / recommendation 0.4 |
| 3 | "i want a refund" | return_request | return_request 7.1 / escalation 1.8 / order_tracking -0.6 |
| 4 | "any new isekai?" | product_discovery | product_discovery 4.2 / recommendation 4.0 / product_question -0.1 |
1.1 Soft-target probabilities at T = 3
For message 1 with logits (order_tracking=6.4, faq=1.2, chitchat=-1.5, others ≈ −2 each):
[ \text{logits} / T = (2.133, 0.400, -0.500, \ldots) ]
[ \exp(2.133) = 8.44,\quad \exp(0.400) = 1.492,\quad \exp(-0.500) = 0.607 ]
Sum (over all 10 classes, with the 7 unmentioned classes contributing ~0.51 total): Z ≈ 8.44 + 1.49 + 0.61 + 0.51 ≈ 11.05.
Softmax: p_teacher^T = (0.764, 0.135, 0.055, ...).
Compare to the hard softmax at T = 1: (0.992, 0.005, 0.0006, …) — i.e., a near-one-hot vector. Temperature T = 3 redistributes probability mass to give the student a richer dark-knowledge signal: it now sees "faq" (0.135) and "chitchat" (0.055) as plausible alternatives, learning the similarity structure of the teacher's beliefs.
1.2 Student logits at first step (random init)
Suppose the student's logits at step 1 are roughly uniform: ≈ (0.1, 0.0, 0.0, ..., 0.0) (post-init noise).
Student soft probabilities at T = 3: ≈ (0.103, 0.100, 0.100, ..., 0.100) — almost uniform.
1.3 KL divergence term
For message 1:
[ KL(p_T^t | p_T^s) = \sum_i p_T^t(i) \cdot \log \frac{p_T^t(i)}{p_T^s(i)} ]
Dominant term: 0.764 · log(0.764/0.103) ≈ 0.764 · 2.001 ≈ 1.529. Smaller terms add ~0.05 each. Total KL ≈ 1.6 for message 1.
The full distillation loss multiplies by T² = 9 (Hinton 2015's correction so soft-target gradients match hard-target scale) and by α = 0.7:
[ \mathcal{L}_{KL,1} = 0.7 \cdot 9 \cdot 1.6 = 10.08 ]
1.4 Hard-label CE term
Student probability on order_tracking at T = 1: ≈ 0.10. CE loss: −log(0.10) ≈ 2.303. Multiplied by (1 − α) = 0.3:
[ \mathcal{L}_{CE,1} = 0.3 \cdot 2.303 = 0.691 ]
1.5 Combined per-batch loss
Total loss for message 1: 10.08 + 0.69 = 10.77. Repeat for messages 2-4 (KL roughly 1.0-1.5 each), giving a batch mean around ≈ 9.8. This is the starting loss — within ~50 steps it falls below 4.0; within an epoch it reaches ~1.2.
2. Compute Budget — Distillation vs. Fine-Tuning From Scratch
| Approach | Param count | Train time (3 epochs) | Memory peak | Inference P95 | Acc |
|---|---|---|---|---|---|
| Full fine-tune of TinyBERT (no distillation) | 14.5M | 18 min | 6.2 GB | 5 ms | 87.3% ± 0.6 |
| KD from teacher (chosen) | 14.5M | 24 min (forward teacher + train student) | 9.8 GB | 5 ms | 90.5% ± 0.5 |
| KD with offline soft labels (precomputed once) | 14.5M | 19 min after one-time 8 min teacher pass | 6.5 GB | 5 ms | 90.5% ± 0.5 |
| Fine-tune DistilBERT (teacher) | 66M | 37 min | 14.4 GB | 12 ms | 92.1% ± 0.4 |
Reading. Distillation gives +3.2pp accuracy over standalone fine-tune of the same student architecture, costing only +33% training time. The "offline soft labels" variant is the operational sweet spot — pay the teacher forward-pass cost once, then re-train the student many times (e.g., for hyperparameter tuning) at the cost of vanilla fine-tuning.
3. 10,000-Request Production Walkthrough
Comparing the distilled student (4-layer TinyBERT) against the teacher (DistilBERT) on the same 10K production sample.
| Outcome | Teacher | Student (KD) | Δ |
|---|---|---|---|
| Correct routes | 9,210 | 9,050 | -160 |
| Top-1 misroutes (low-cost) | 590 | 690 | +100 |
| Critical misroutes (escalation/return missed) | 34 | 45 | +11 |
| Latency budget breaches (P95 > 15ms) | 3 (incidental) | 0 | -3 |
| Inference cost ($/10K reqs at inf2.xlarge) | $0.038 | $0.016 | -$0.022 |
3.1 Cost-per-quality calculation
Monthly traffic: 1.4M requests.
- Teacher monthly inference:
1.4M × $0.038 / 10K = $5,320. - Student monthly inference:
1.4M × $0.016 / 10K = $2,240. - Inference savings:
$3,080/month.
Quality gap (160 extra misroutes per 10K → 22,400 extra/month): at the business-weighted-cost rate from the calibration deep-dive (~$0.013/harm-unit-equivalent), incremental harm is roughly $290/month.
Net: $2,790/month operational savings at ~1.6pp accuracy gap. The trade-off favors distillation if and only if the calibration pipeline can absorb the slight accuracy drop without breaching the rare-class promotion gate (87.0% — student delivers 87.4% rare-class).
4. Bootstrap Confidence Intervals on KD Metrics
n = 5,500 test set, B = 10,000 resamples, seed grid {2025, 2026, 2027}.
| Metric | Point estimate | 95% bootstrap CI |
|---|---|---|
| Student accuracy | 0.9050 | [0.8997, 0.9098] |
| Student macro-F1 | 0.847 | [0.838, 0.855] |
| Student rare-class accuracy | 0.874 | [0.852, 0.893] |
| Student ECE (post-calibration) | 0.0461 | [0.0395, 0.0533] |
| Teacher-student KL (test set, T = 3) | 0.083 | [0.078, 0.089] |
| Teacher-student top-1 agreement | 0.974 | [0.970, 0.978] |
Reading. Top-1 agreement of 97.4% means the student rarely contradicts the teacher — most disagreements are on hard examples where the teacher itself is uncertain. This is the operational signal that distillation is healthy; if agreement falls below 95% sustained, the student has drifted and warrants a fresh distillation pass.
Research Notes — KD numerical. Citations: Hinton 2015 (NeurIPS workshop) —
T² · KLcorrection; Sanh 2019 (NeurIPS-EMC²) — DistilBERT is itself a KD product; Jiao 2020 (EMNLP — TinyBERT) — 4-layer student architecture; Stanton 2021 (NeurIPS) — KD calibration analysis; Mirzadeh 2020 (AAAI — TAKD) — teacher-student capacity gap. Failure rule: if student-teacher top-1 agreement < 0.95 for ≥ 3 consecutive eval cycles, re-run distillation withαlowered by 0.1 (more weight on hard labels) before retraining the teacher.
5. Persona Debate
Priya (ML). With T = 3 and α = 0.7, our KL term dominates by ~10× over CE. Is that a healthy ratio?
Aiko (DS). The dominance is intentional — T² = 9 corrects the gradient scaling. Effective gradient contribution from KL vs. CE on a typical batch is closer to 2:1, not 10:1. We confirmed this with a per-step gradient-norm log: KL contributes 0.62 of the gradient magnitude on average.
Marcus (Architect). What about the "offline soft labels" approach? Save 5 minutes per re-train, ~$2 per run.
Jordan (MLOps). Worth it once we hit > 4 retrain runs/week. Today we're at 1/week — premature optimization.
Sam (PM). Net $2,790/month savings is real; the 0.5pp rare-class gap is the only thing I worry about. We need a calibration plan that re-fits T on the student so its routing thresholds compensate.
Resolution. Adopt online distillation today (~24 min per run); switch to offline soft labels at > 4 runs/week. The student's ECE (0.046) is just inside the promotion gate (0.045 was main-doc gate; we widen to 0.050 for the student), with a refit-T-monthly SLA.
6. Open Problems
- Adaptive temperature. A fixed T = 3 may be wrong for the rare-class regions of logit space, where the teacher itself is less confident. Open question: per-class or per-confidence-bin T (Stanton 2021 explores this, no clear winner).
- Distillation under teacher drift. When the teacher is retrained monthly, the student must be re-distilled — but how soon? Open question: a teacher-student divergence trigger (KL > 0.15 on prod sample) that automates the re-distillation.
- Student-as-teacher. Self-distillation (student → smaller student) for further compression. Open question: when does the marginal accuracy loss exceed the marginal cost saving? Pilot below the current 14.5M-param student.
Bibliography
- Hinton, G., Vinyals, O., Dean, J. (2015). Distilling the Knowledge in a Neural Network. NeurIPS Workshop. https://arxiv.org/abs/1503.02531 —
T² · KLcorrection; α-blending. - Sanh, V. et al. (2019). DistilBERT. NeurIPS-EMC².
- Jiao, X. et al. (2020). TinyBERT: Distilling BERT for Natural Language Understanding. EMNLP. https://arxiv.org/abs/1909.10351 — 4-layer student architecture.
- Stanton, S. et al. (2021). Does Knowledge Distillation Really Work? NeurIPS. — calibration and trust-region analysis.
- Mirzadeh, S. I. et al. (2020). Improved Knowledge Distillation via Teacher Assistant (TAKD). AAAI.
- Beyer, L. et al. (2022). Knowledge distillation: A good teacher is patient and consistent. CVPR.
- Bouthillier, X. et al. (2021). Accounting for Variance in ML Benchmarks. MLSys.
- Menon, A. K. et al. (2021). A statistical perspective on distillation. ICML.
Citation count for this file: 8.