Teacher Rotation Strategy — Knowledge Distillation (MangaAssist)

Slot 9 (discovery / evolution) in the SCENARIO_TEMPLATE 8-file pattern. The earlier KD docs treat the teacher as static. In production, the teacher is retrained monthly (per the Intent-Classification dry-run doc), and any change in the teacher invalidates assumptions baked into the student. This doc handles teacher rotation: how to manage the lifecycle of multiple teacher checkpoints over time without thrashing the student.

Why This Is a Real Problem

Three observations from production:

Teacher updates don't always require student updates. A 0.2pp teacher accuracy improvement is below the student's bootstrap CI half-width — re-distilling is wasted compute.
Teacher updates can break student calibration silently. Even when student accuracy is preserved, the student's calibrator (T = 1.4) was fit against a specific teacher; updating the teacher shifts the soft-label distribution and can drift ECE.
Multiple teachers concurrently in production. During a teacher canary period, one student may have been distilled against teacher v_n while a separate experiment uses teacher v_{n+1}. We need a clean way to handle this.

A research-grade rotation strategy answers four questions:

When to re-distill (drift triggers, not calendar)?
Which teacher to use (latest, ensemble, or stable LTS)?
How to roll out the new student (replace, ensemble, or A/B)?
How to roll back (which teacher version do we keep frozen)?

Shared Baseline (verbatim)

Item	Value
Teacher	DistilBERT, retrained monthly (per dry-run doc)
Student	TinyBERT 4-layer, distilled from a single teacher checkpoint
Teacher checkpoint store	`s3://mangaassist-ml-prod/teachers/distilbert-v{n}/` with metadata: training data hash, accuracy, ECE, T

1. Re-Distillation Trigger Matrix

Trigger	Threshold	Action	Rationale
Teacher accuracy delta	≥ +0.5pp	re-distill	a meaningful capability lift; ignore noise
Teacher ECE delta	≥ +0.01	re-distill	recalibration alone won't carry through soft labels
Teacher KL drift to old teacher	≥ 0.05 on 14-day prod sample	re-distill	distribution-of-soft-labels has shifted
Student top-1 agreement with teacher	< 0.95	re-distill	gap has opened by >2pp from initial 0.974
Quality gap drift	acc_t − acc_s > 1.6 + 0.5 (initial + drift)	re-distill	student is falling behind
Teacher major version (architecture change)	always	re-distill	non-negotiable
None of the above	—	skip	preserve compute

Implication. Re-distillation is not monthly by default — it's event-driven. Empirically (last 12 months) we re-distilled 7 times (vs 12 monthly retrains), saving ~$210 in compute and ~5 days of monitoring overhead.

2. Which Teacher to Use

Three strategies, each with trade-offs:

Strategy	Description	Pros	Cons	When to prefer
Latest checkpoint (chosen)	always distill from the newest teacher	simple; tracks latest knowledge	brittle to teacher anomalies	high-frequency change, healthy teacher monitoring
Long-Term-Stable (LTS) checkpoint	freeze a teacher every quarter; distill against LTS	predictable student behavior	falls behind; misses recent intents	regulated environments
Teacher ensemble	distill against average soft-label of last K teachers	smoother soft labels; less variance	K-fold compute cost	when single-teacher noise is documented

Reading. We use latest checkpoint but log the teacher version in every student artifact. This makes rollback (§4) trivial: redeploy the previous student.

Research Notes — multi-teacher. Citations: You 2017 (KDD — Learning from Multiple Teachers); Tarvainen & Valpola 2017 (NeurIPS — Mean Teacher); Zhao 2022 (ICLR — DKD decoupled distillation) — supports separating soft-label and hard-label contributions, useful when teachers disagree.

3. Rollout Pattern: Re-Distillation as a Staged Deploy

A new student is never deployed in one step — even though distillation is fast (~24 min), the student's calibration and routing thresholds are downstream-coupled to it.

Stage	Traffic	Duration	Pass criterion
Shadow	0% (mirror)	24 h	acceptance suite §6 dry-run doc
Canary 5%	5%	48 h	rare-class accuracy ≥ 87.0% on canary slice
Canary 25%	25%	48 h	weighted error ≤ baseline + 0.5pp
50%	50%	24 h	no production-weighted regression
100%	full	n/a	full deploy; old student frozen for 14 days as rollback target

If any gate fails: revert to the previous student (which is paired with its teacher version), file an incident, refresh the trigger thresholds.

4. Rollback Discipline

Three rollback dimensions to track:

Student artifact (model weights + calibrator + thresholds): keep last 3 versions hot, last 12 cold.
Teacher artifact: keep last 6 versions cold; the paired teacher for each student is metadata-pinned.
Decision metadata (the trigger that caused this re-distillation, the gate that passed): keep forever in the model registry.

Rollback time-to-live (TTL): 90 minutes from page-on-call to traffic restoration. Verified quarterly via game-day exercises.

5. Confidence Intervals on Lifecycle Metrics

n = 12 months of production logs.

Metric	Point	95% CI
Re-distillations per month	0.58	[0.34, 0.92]
Mean time between re-distillations	51 days	[38, 70]
Re-distillation success rate (pass canary)	86%	[62%, 96%]
Mean compute cost per re-distillation	$32	[$28, $37]
Mean rollback time (when triggered)	47 min	[38, 58]

Reading. ~6 re-distillations/year averaging $32 each = ~$190/year compute. Compared to monthly forced re-distillation (~$380/year), event-driven saves ~50% with no observed quality regression.

6. Failure-Mode Tree

flowchart TD
    A[Re-distillation triggered or scheduled] --> B[Fetch latest teacher checkpoint]
    B --> C{Teacher healthy?}
    C -- accuracy and ECE in expected range --> D[Run distillation pipeline]
    C -- regressed --> E[Skip re-distillation alert teacher pipeline owners]
    D --> F[Run student acceptance suite]
    F -- pass --> G[Shadow deploy 24h]
    F -- fail --> H{Failure type?}
    H -- insufficient quality --> I[Adjust alpha or epochs retry once]
    H -- calibration drift --> J[Refit T pre-deploy retry once]
    H -- top-1 agreement low --> K[Switch to ensemble teacher last 3 checkpoints retry]
    G --> L[Canary deploy stages 5 25 50 100]
    L -- gate failure --> M[Rollback to previous student keep teacher in cold storage]
    L -- pass all stages --> N[Promote new student freeze old artifact for 14 days]

7. Persona Debate

Jordan (MLOps). Event-driven re-distillation feels right. Calendar-driven was burning compute on no-op months.

Aiko (DS). I want one more trigger: rare-class drift. Accuracy on escalation can drop without overall accuracy dropping. Add a per-class alarm at -2pp on rare classes.

Priya (ML). Already in §1 implicitly via "quality gap drift" — let me make it explicit.

Marcus (Architect). Teacher ensemble idea — defer it. Adds 3× the soft-label storage and a coordination layer. Not worth it until we see single-teacher anomalies repeatedly.

Sam (PM). Six re-distills/year at $32 each is fine; rollback discipline is the real win. Quarterly game-days are non-negotiable.

Resolution. Adopt event-driven trigger matrix; latest-checkpoint strategy; staged rollout with paired teacher metadata. Re-evaluate ensemble strategy after a single-teacher incident or in 12 months, whichever comes first.

8. Open Problems

Continuous distillation. Instead of discrete re-distillation events, can we continuously update the student via online KD against the latest teacher? Tarvainen & Valpola's Mean Teacher offers one direction. Open question: can online KD preserve calibration without per-batch validation?
Knowledge accumulation across teachers. Teacher v_{n+1} may know things teacher v_n didn't — and the student should remember both. Open question: a continual-distillation regime where the student is exposed to the union of soft-label distributions over a window of teachers (intersects Learning-Strategies/06 continual learning).
Teacher-aware OOD. When the teacher recognizes new intents (after a discovery-pipeline-driven taxonomy update), the student must inherit the new label semantics. Open question: a "label-aware re-distillation" that preserves old labels' soft profiles while adopting new ones.

Bibliography

Hinton, G. et al. (2015). Distilling the Knowledge in a Neural Network. NeurIPS Workshop.
Stanton, S. et al. (2021). Does Knowledge Distillation Really Work? NeurIPS.
Tarvainen, A., Valpola, H. (2017). Mean Teachers are Better Role Models. NeurIPS.
You, S., Xu, C., Xu, C., Tao, D. (2017). Learning from Multiple Teacher Networks. KDD.
Zhao, B., Cui, Q., Song, R., Qiu, Y., Liang, J. (2022). Decoupled Knowledge Distillation. CVPR.
Mirzadeh, S. I. et al. (2020). Improved KD via Teacher Assistant. AAAI.
Beyer, L. et al. (2022). A good teacher is patient and consistent. CVPR.
Pineau, J. et al. (2021). Reproducibility checklist. NeurIPS.
Gebru, T. et al. (2021). Datasheets for Datasets. CACM.
Bouthillier, X. et al. (2021). Accounting for Variance. MLSys.

Citation count for this file: 10.