Student Calibration & Quality Metrics — Knowledge Distillation (MangaAssist)

Slot 4 in the SCENARIO_TEMPLATE 8-file pattern. This document covers the calibration and quality metrics specific to a distilled student: ECE, Brier, NLL, teacher-agreement (top-1, KL, JSD), and the quality-gap-vs-teacher metric. Distilled students are systematically miscalibrated in different ways than from-scratch fine-tuned models — this file captures those differences.

Why Student Calibration Is Its Own Topic

A from-scratch fine-tuned model is overconfident on training-similar inputs (Guo 2017). A distilled student carries an additional miscalibration source: the soft labels at temperature T inflate entropy at training time, but at inference time the student is read at T = 1, where its outputs are sharper than the teacher's. Stanton 2021 and Menon 2021 document this experimentally: distilled students are more confident than their teachers in the sharp regions and less confident in the soft regions.

This matters for MangaAssist because the routing layer trusts the probabilities for its threshold-based gating (calibration deep-dive in Intent-Classification/). A poorly-calibrated student forces the routing thresholds to be re-tuned independently of the teacher's thresholds.

Shared Baseline (verbatim)

Item	Value
Teacher	DistilBERT, 92.1% ± 0.4% acc, ECE 0.040, T = 1.6
Student	TinyBERT 4-layer, 90.5% ± 0.5% acc, target ECE ≤ 0.050 post-cal
Distillation: T = 3, α = 0.7, 5 epochs
Calibration evaluation set	same 5,500-example test set as teacher

1. Pre-Calibration State of the Distilled Student

Metric	Teacher (post-cal, T=1.6)	Student (no calibration)	Student (post-cal, T=1.4)
Accuracy	92.1% ± 0.4	90.5% ± 0.5	90.5% ± 0.5 (preserved)
ECE (10 bins)	0.040 ± 0.005	0.078 ± 0.008	0.046 ± 0.006
Brier score	0.071 ± 0.004	0.094 ± 0.006	0.078 ± 0.005
NLL	0.301 ± 0.020	0.347 ± 0.024	0.319 ± 0.021
Mean confidence (correct)	0.85	0.93	0.86
Mean confidence (incorrect)	0.48	0.66	0.52

Reading. Pre-calibration, the student is meaningfully more overconfident than the pre-calibration teacher. The mean confidence on incorrect predictions (0.66) is the dangerous metric — a student that says "0.66" when wrong is harder to gate than one that says "0.48" when wrong. Post-calibration with T = 1.4 (note: lower than the teacher's T = 1.6 because the student starts sharper), ECE drops to 0.046, brushing the promotion gate.

2. Teacher-Student Agreement Metrics

These metrics are unique to distillation — they measure how faithfully the student preserves the teacher's behavior, separately from how accurate either is.

Metric	Definition	Value	95% CI
Top-1 agreement	fraction where argmax(student) = argmax(teacher)	0.974	[0.970, 0.978]
Top-3 agreement	fraction where teacher's top-1 is in student's top-3	0.997	[0.995, 0.998]
KL(teacher \|\| student) at T=1	mean over test set	0.067	[0.061, 0.074]
KL(student \|\| teacher) at T=1	reverse	0.083	[0.076, 0.091]
Jensen-Shannon divergence	symmetric	0.041	[0.037, 0.045]
Confidence gap (mean \|p_t − p_s\|)	absolute prob difference on top-1	0.072	[0.068, 0.076]

Reading. The KL asymmetry (0.067 forward vs 0.083 reverse) tells us the student is slightly more confident than the teacher on average — the reverse KL penalizes the student for assigning low probability where the teacher assigned high probability, so the larger reverse value means the student sometimes underbelieves things the teacher believed strongly. This is consistent with the calibration findings above.

Research Notes — agreement. Citations: Stanton 2021 (NeurIPS) — disagreement metrics are diagnostic; Beyer 2022 (CVPR) — patient teacher reduces top-1 disagreement by ~30%; Menon 2021 (ICML) — statistical view of distillation links KL to a Fisher-information regularizer.

3. Calibration-Method Sweep (student-specific)

Same methods as in the Intent-Classification calibration deep-dive, applied to the student.

Method	Accuracy preserved?	ECE	Brier	When to prefer
No calibration	yes	0.078 ± 0.008	0.094 ± 0.006	never
Temperature scaling, T = 1.4 (chosen)	yes	0.046 ± 0.006	0.078 ± 0.005	default
Vector scaling	yes	0.044 ± 0.006	0.077 ± 0.005	when class-conditional miscal evident
Isotonic OvR	no	0.038 ± 0.005	0.075 ± 0.005	when ECE budget < 0.04
Histogram binning (10 bins)	no	0.054 ± 0.007	0.082 ± 0.006	low-data regime only
Distillation-aware calibration (DAC, fit on `T_distill`-soft labels)	yes	0.041 ± 0.005	0.076 ± 0.005	active research; not yet production-ready
Joint teacher-student fit (Menon 2021 inspired)	yes	0.043 ± 0.005	0.077 ± 0.005	research

Reading. Vanilla temperature scaling is robust and simple; isotonic gives lower ECE but flips predictions ~0.6% of the time, complicating the promotion gate. Recommendation: keep temperature scaling at T = 1.4 for the student; revisit DAC in Q2 next year.

4. Quality-Gap-vs-Teacher Tracking

A subtle metric: the gap between teacher and student accuracy can drift even when both are stable individually. We track:

[ \Delta_{quality} = \text{acc}{teacher} - \text{acc}{student} ]

Period	Teacher	Student	Δ	Notes
Day of distillation	92.1%	90.5%	1.6pp	initial gap
Week 1	92.0%	90.4%	1.6pp	stable
Week 4	91.8%	90.0%	1.8pp	drift; both degrade
Week 8 (no re-distill)	91.5%	89.5%	2.0pp	gap widening
After re-distill	92.0%	90.6%	1.4pp	gap restored

SLA: if Δ_quality grows by ≥ 0.5pp over the initial gap for two consecutive weeks, trigger re-distillation. Re-distilling with the latest teacher closes the gap without a full retrain.

5. Confidence Intervals — Student Quality (additional)

Metric	Point estimate	95% bootstrap CI
Student selective accuracy at coverage = 0.90	0.937	[0.928, 0.945]
Student AURC	0.0264	[0.0231, 0.0303]
Coverage at risk threshold 0.05	0.79	[0.76, 0.82]
ECE on rare-class only (escalation)	0.092	[0.071, 0.117]

Reading. The rare-class ECE (0.092) is much worse than overall ECE (0.046). Distilled students inherit the teacher's miscalibration on rare classes and may amplify it. Routing thresholds for escalation should be set on rare-class-conditional ECE, not overall ECE.

6. Failure-Mode Tree

flowchart TD
    A[Student calibration alert] --> B{Symptom?}
    B -- ECE drift > 0.005 --> C[Refit T on rolling 14-day val]
    B -- top-1 agreement < 0.95 --> D[Trigger re-distillation latest teacher]
    B -- rare-class ECE > 0.10 --> E[Per-class Platt scaling on rare classes only]
    B -- KL teacher-student > 0.15 --> F[Audit teacher drift first then re-distill]
    B -- coverage at risk threshold drops 5pp --> G[Tighten routing threshold not the calibrator]
    C --> H{Recovers?}
    H -- yes --> I[Hot-swap T no model deploy]
    H -- no --> J[Re-distill from latest teacher]
    D --> J
    F --> K[Check teacher main-doc dry-run gates retrain teacher first if needed]

Research Notes — student calibration. Citations: Stanton 2021 (NeurIPS); Menon 2021 (ICML); Guo 2017 (ICML — temperature scaling); Beyer 2022 (CVPR — patient teacher); Tian 2020 (ICLR — contrastive distillation calibration); Cho 2019 (ICCV — efficacy of teacher size).

7. Persona Debate

Aiko (DS). Pre-cal student ECE is 0.078 — almost double the teacher's. We can't ship without calibration.

Priya (ML). Temperature 1.4 fits cleanly. Why not also calibrate against the teacher's soft probabilities directly?

Aiko. Distillation-aware calibration (DAC) is interesting — fit the student to match teacher probabilities at T = 1, not just true labels. Cuts ECE to 0.041 in our pilot. But it adds a training-time step we'd need to validate.

Marcus (Architect). Adoption cost?

Jordan (MLOps). Two more configs to track, one more validation gate. Not breaking, but it's extra surface area.

Sam (PM). What's the user-visible delta between 0.046 ECE and 0.041 ECE?

Aiko. Roughly the same false-rejection rate at routing thresholds — the gain is in the unsafe-route region (high confidence but wrong). DAC reduces that by ~15% in our pilot, but the CI is wide.

Resolution. Ship temperature scaling now. Pilot DAC in Q2 against a 30-day production sample; if false-rejection rate or unsafe-route rate improves significantly (paired bootstrap, p < 0.05), promote DAC.

8. Open Problems

Per-class temperature. A single T cannot fix per-class miscalibration when classes have different difficulty profiles. Vector scaling helps, but for distillation the teacher's per-class confidence pattern is what matters most. Open question: a vector-scaling variant that takes teacher's per-class confidence profile as a prior.
Calibration under teacher drift. When the teacher is retrained, the student's calibrator becomes stale faster than the model itself. Open question: a teacher-aware drift detector that triggers calibrator refit before the model needs re-distillation.
OOD calibration of the student. Out-of-domain inputs hit a smaller student harder (less capacity to flag confidently-wrong). Open question: does Outlier Exposure (Hendrycks 2019) help the student more than the teacher in absolute terms?

Bibliography

Hinton, G. et al. (2015). Distilling the Knowledge in a Neural Network. NeurIPS Workshop.
Stanton, S. et al. (2021). Does Knowledge Distillation Really Work? NeurIPS.
Menon, A. K. et al. (2021). A statistical perspective on distillation. ICML.
Guo, C. et al. (2017). On Calibration of Modern Neural Networks. ICML.
Beyer, L. et al. (2022). Knowledge distillation: A good teacher is patient and consistent. CVPR.
Tian, Y., Krishnan, D., Isola, P. (2020). Contrastive Representation Distillation. ICLR.
Cho, J. H., Hariharan, B. (2019). On the Efficacy of Knowledge Distillation. ICCV.
Naeini, M. P. et al. (2015). ECE definition. AAAI.
Hendrycks, D. et al. (2019). Outlier Exposure. ICLR.

Citation count for this file: 9.