Student Calibration & Quality Metrics — Knowledge Distillation (MangaAssist)
Slot 4 in the SCENARIO_TEMPLATE 8-file pattern. This document covers the calibration and quality metrics specific to a distilled student: ECE, Brier, NLL, teacher-agreement (top-1, KL, JSD), and the quality-gap-vs-teacher metric. Distilled students are systematically miscalibrated in different ways than from-scratch fine-tuned models — this file captures those differences.
Why Student Calibration Is Its Own Topic
A from-scratch fine-tuned model is overconfident on training-similar inputs (Guo 2017). A distilled student carries an additional miscalibration source: the soft labels at temperature T inflate entropy at training time, but at inference time the student is read at T = 1, where its outputs are sharper than the teacher's. Stanton 2021 and Menon 2021 document this experimentally: distilled students are more confident than their teachers in the sharp regions and less confident in the soft regions.
This matters for MangaAssist because the routing layer trusts the probabilities for its threshold-based gating (calibration deep-dive in Intent-Classification/). A poorly-calibrated student forces the routing thresholds to be re-tuned independently of the teacher's thresholds.
Shared Baseline (verbatim)
| Item | Value |
|---|---|
| Teacher | DistilBERT, 92.1% ± 0.4% acc, ECE 0.040, T = 1.6 |
| Student | TinyBERT 4-layer, 90.5% ± 0.5% acc, target ECE ≤ 0.050 post-cal |
| Distillation: T = 3, α = 0.7, 5 epochs | |
| Calibration evaluation set | same 5,500-example test set as teacher |
1. Pre-Calibration State of the Distilled Student
| Metric | Teacher (post-cal, T=1.6) | Student (no calibration) | Student (post-cal, T=1.4) |
|---|---|---|---|
| Accuracy | 92.1% ± 0.4 | 90.5% ± 0.5 | 90.5% ± 0.5 (preserved) |
| ECE (10 bins) | 0.040 ± 0.005 | 0.078 ± 0.008 | 0.046 ± 0.006 |
| Brier score | 0.071 ± 0.004 | 0.094 ± 0.006 | 0.078 ± 0.005 |
| NLL | 0.301 ± 0.020 | 0.347 ± 0.024 | 0.319 ± 0.021 |
| Mean confidence (correct) | 0.85 | 0.93 | 0.86 |
| Mean confidence (incorrect) | 0.48 | 0.66 | 0.52 |
Reading. Pre-calibration, the student is meaningfully more overconfident than the pre-calibration teacher. The mean confidence on incorrect predictions (0.66) is the dangerous metric — a student that says "0.66" when wrong is harder to gate than one that says "0.48" when wrong. Post-calibration with T = 1.4 (note: lower than the teacher's T = 1.6 because the student starts sharper), ECE drops to 0.046, brushing the promotion gate.
2. Teacher-Student Agreement Metrics
These metrics are unique to distillation — they measure how faithfully the student preserves the teacher's behavior, separately from how accurate either is.
| Metric | Definition | Value | 95% CI |
|---|---|---|---|
| Top-1 agreement | fraction where argmax(student) = argmax(teacher) | 0.974 | [0.970, 0.978] |
| Top-3 agreement | fraction where teacher's top-1 is in student's top-3 | 0.997 | [0.995, 0.998] |
| KL(teacher || student) at T=1 | mean over test set | 0.067 | [0.061, 0.074] |
| KL(student || teacher) at T=1 | reverse | 0.083 | [0.076, 0.091] |
| Jensen-Shannon divergence | symmetric | 0.041 | [0.037, 0.045] |
| Confidence gap (mean |p_t − p_s|) | absolute prob difference on top-1 | 0.072 | [0.068, 0.076] |
Reading. The KL asymmetry (0.067 forward vs 0.083 reverse) tells us the student is slightly more confident than the teacher on average — the reverse KL penalizes the student for assigning low probability where the teacher assigned high probability, so the larger reverse value means the student sometimes underbelieves things the teacher believed strongly. This is consistent with the calibration findings above.
Research Notes — agreement. Citations: Stanton 2021 (NeurIPS) — disagreement metrics are diagnostic; Beyer 2022 (CVPR) — patient teacher reduces top-1 disagreement by ~30%; Menon 2021 (ICML) — statistical view of distillation links KL to a Fisher-information regularizer.
3. Calibration-Method Sweep (student-specific)
Same methods as in the Intent-Classification calibration deep-dive, applied to the student.
| Method | Accuracy preserved? | ECE | Brier | When to prefer |
|---|---|---|---|---|
| No calibration | yes | 0.078 ± 0.008 | 0.094 ± 0.006 | never |
| Temperature scaling, T = 1.4 (chosen) | yes | 0.046 ± 0.006 | 0.078 ± 0.005 | default |
| Vector scaling | yes | 0.044 ± 0.006 | 0.077 ± 0.005 | when class-conditional miscal evident |
| Isotonic OvR | no | 0.038 ± 0.005 | 0.075 ± 0.005 | when ECE budget < 0.04 |
| Histogram binning (10 bins) | no | 0.054 ± 0.007 | 0.082 ± 0.006 | low-data regime only |
Distillation-aware calibration (DAC, fit on T_distill-soft labels) |
yes | 0.041 ± 0.005 | 0.076 ± 0.005 | active research; not yet production-ready |
| Joint teacher-student fit (Menon 2021 inspired) | yes | 0.043 ± 0.005 | 0.077 ± 0.005 | research |
Reading. Vanilla temperature scaling is robust and simple; isotonic gives lower ECE but flips predictions ~0.6% of the time, complicating the promotion gate. Recommendation: keep temperature scaling at T = 1.4 for the student; revisit DAC in Q2 next year.
4. Quality-Gap-vs-Teacher Tracking
A subtle metric: the gap between teacher and student accuracy can drift even when both are stable individually. We track:
[ \Delta_{quality} = \text{acc}{teacher} - \text{acc}{student} ]
| Period | Teacher | Student | Δ | Notes |
|---|---|---|---|---|
| Day of distillation | 92.1% | 90.5% | 1.6pp | initial gap |
| Week 1 | 92.0% | 90.4% | 1.6pp | stable |
| Week 4 | 91.8% | 90.0% | 1.8pp | drift; both degrade |
| Week 8 (no re-distill) | 91.5% | 89.5% | 2.0pp | gap widening |
| After re-distill | 92.0% | 90.6% | 1.4pp | gap restored |
SLA: if Δ_quality grows by ≥ 0.5pp over the initial gap for two consecutive weeks, trigger re-distillation. Re-distilling with the latest teacher closes the gap without a full retrain.
5. Confidence Intervals — Student Quality (additional)
| Metric | Point estimate | 95% bootstrap CI |
|---|---|---|
| Student selective accuracy at coverage = 0.90 | 0.937 | [0.928, 0.945] |
| Student AURC | 0.0264 | [0.0231, 0.0303] |
| Coverage at risk threshold 0.05 | 0.79 | [0.76, 0.82] |
| ECE on rare-class only (escalation) | 0.092 | [0.071, 0.117] |
Reading. The rare-class ECE (0.092) is much worse than overall ECE (0.046). Distilled students inherit the teacher's miscalibration on rare classes and may amplify it. Routing thresholds for escalation should be set on rare-class-conditional ECE, not overall ECE.
6. Failure-Mode Tree
flowchart TD
A[Student calibration alert] --> B{Symptom?}
B -- ECE drift > 0.005 --> C[Refit T on rolling 14-day val]
B -- top-1 agreement < 0.95 --> D[Trigger re-distillation latest teacher]
B -- rare-class ECE > 0.10 --> E[Per-class Platt scaling on rare classes only]
B -- KL teacher-student > 0.15 --> F[Audit teacher drift first then re-distill]
B -- coverage at risk threshold drops 5pp --> G[Tighten routing threshold not the calibrator]
C --> H{Recovers?}
H -- yes --> I[Hot-swap T no model deploy]
H -- no --> J[Re-distill from latest teacher]
D --> J
F --> K[Check teacher main-doc dry-run gates retrain teacher first if needed]
Research Notes — student calibration. Citations: Stanton 2021 (NeurIPS); Menon 2021 (ICML); Guo 2017 (ICML — temperature scaling); Beyer 2022 (CVPR — patient teacher); Tian 2020 (ICLR — contrastive distillation calibration); Cho 2019 (ICCV — efficacy of teacher size).
7. Persona Debate
Aiko (DS). Pre-cal student ECE is 0.078 — almost double the teacher's. We can't ship without calibration.
Priya (ML). Temperature 1.4 fits cleanly. Why not also calibrate against the teacher's soft probabilities directly?
Aiko. Distillation-aware calibration (DAC) is interesting — fit the student to match teacher probabilities at T = 1, not just true labels. Cuts ECE to 0.041 in our pilot. But it adds a training-time step we'd need to validate.
Marcus (Architect). Adoption cost?
Jordan (MLOps). Two more configs to track, one more validation gate. Not breaking, but it's extra surface area.
Sam (PM). What's the user-visible delta between 0.046 ECE and 0.041 ECE?
Aiko. Roughly the same false-rejection rate at routing thresholds — the gain is in the unsafe-route region (high confidence but wrong). DAC reduces that by ~15% in our pilot, but the CI is wide.
Resolution. Ship temperature scaling now. Pilot DAC in Q2 against a 30-day production sample; if false-rejection rate or unsafe-route rate improves significantly (paired bootstrap, p < 0.05), promote DAC.
8. Open Problems
- Per-class temperature. A single T cannot fix per-class miscalibration when classes have different difficulty profiles. Vector scaling helps, but for distillation the teacher's per-class confidence pattern is what matters most. Open question: a vector-scaling variant that takes teacher's per-class confidence profile as a prior.
- Calibration under teacher drift. When the teacher is retrained, the student's calibrator becomes stale faster than the model itself. Open question: a teacher-aware drift detector that triggers calibrator refit before the model needs re-distillation.
- OOD calibration of the student. Out-of-domain inputs hit a smaller student harder (less capacity to flag confidently-wrong). Open question: does Outlier Exposure (Hendrycks 2019) help the student more than the teacher in absolute terms?
Bibliography
- Hinton, G. et al. (2015). Distilling the Knowledge in a Neural Network. NeurIPS Workshop.
- Stanton, S. et al. (2021). Does Knowledge Distillation Really Work? NeurIPS.
- Menon, A. K. et al. (2021). A statistical perspective on distillation. ICML.
- Guo, C. et al. (2017). On Calibration of Modern Neural Networks. ICML.
- Beyer, L. et al. (2022). Knowledge distillation: A good teacher is patient and consistent. CVPR.
- Tian, Y., Krishnan, D., Isola, P. (2020). Contrastive Representation Distillation. ICLR.
- Cho, J. H., Hariharan, B. (2019). On the Efficacy of Knowledge Distillation. ICCV.
- Naeini, M. P. et al. (2015). ECE definition. AAAI.
- Hendrycks, D. et al. (2019). Outlier Exposure. ICLR.
Citation count for this file: 9.