Teacher Rotation Strategy — Knowledge Distillation (MangaAssist)
Slot 9 (discovery / evolution) in the SCENARIO_TEMPLATE 8-file pattern. The earlier KD docs treat the teacher as static. In production, the teacher is retrained monthly (per the Intent-Classification dry-run doc), and any change in the teacher invalidates assumptions baked into the student. This doc handles teacher rotation: how to manage the lifecycle of multiple teacher checkpoints over time without thrashing the student.
Why This Is a Real Problem
Three observations from production:
- Teacher updates don't always require student updates. A 0.2pp teacher accuracy improvement is below the student's bootstrap CI half-width — re-distilling is wasted compute.
- Teacher updates can break student calibration silently. Even when student accuracy is preserved, the student's calibrator (T = 1.4) was fit against a specific teacher; updating the teacher shifts the soft-label distribution and can drift ECE.
- Multiple teachers concurrently in production. During a teacher canary period, one student may have been distilled against teacher v_n while a separate experiment uses teacher v_{n+1}. We need a clean way to handle this.
A research-grade rotation strategy answers four questions:
- When to re-distill (drift triggers, not calendar)?
- Which teacher to use (latest, ensemble, or stable LTS)?
- How to roll out the new student (replace, ensemble, or A/B)?
- How to roll back (which teacher version do we keep frozen)?
Shared Baseline (verbatim)
| Item | Value |
|---|---|
| Teacher | DistilBERT, retrained monthly (per dry-run doc) |
| Student | TinyBERT 4-layer, distilled from a single teacher checkpoint |
| Teacher checkpoint store | s3://mangaassist-ml-prod/teachers/distilbert-v{n}/ with metadata: training data hash, accuracy, ECE, T |
1. Re-Distillation Trigger Matrix
| Trigger | Threshold | Action | Rationale |
|---|---|---|---|
| Teacher accuracy delta | ≥ +0.5pp | re-distill | a meaningful capability lift; ignore noise |
| Teacher ECE delta | ≥ +0.01 | re-distill | recalibration alone won't carry through soft labels |
| Teacher KL drift to old teacher | ≥ 0.05 on 14-day prod sample | re-distill | distribution-of-soft-labels has shifted |
| Student top-1 agreement with teacher | < 0.95 | re-distill | gap has opened by >2pp from initial 0.974 |
| Quality gap drift | acc_t − acc_s > 1.6 + 0.5 (initial + drift) | re-distill | student is falling behind |
| Teacher major version (architecture change) | always | re-distill | non-negotiable |
| None of the above | — | skip | preserve compute |
Implication. Re-distillation is not monthly by default — it's event-driven. Empirically (last 12 months) we re-distilled 7 times (vs 12 monthly retrains), saving ~$210 in compute and ~5 days of monitoring overhead.
2. Which Teacher to Use
Three strategies, each with trade-offs:
| Strategy | Description | Pros | Cons | When to prefer |
|---|---|---|---|---|
| Latest checkpoint (chosen) | always distill from the newest teacher | simple; tracks latest knowledge | brittle to teacher anomalies | high-frequency change, healthy teacher monitoring |
| Long-Term-Stable (LTS) checkpoint | freeze a teacher every quarter; distill against LTS | predictable student behavior | falls behind; misses recent intents | regulated environments |
| Teacher ensemble | distill against average soft-label of last K teachers | smoother soft labels; less variance | K-fold compute cost | when single-teacher noise is documented |
Reading. We use latest checkpoint but log the teacher version in every student artifact. This makes rollback (§4) trivial: redeploy the previous student.
Research Notes — multi-teacher. Citations: You 2017 (KDD — Learning from Multiple Teachers); Tarvainen & Valpola 2017 (NeurIPS — Mean Teacher); Zhao 2022 (ICLR — DKD decoupled distillation) — supports separating soft-label and hard-label contributions, useful when teachers disagree.
3. Rollout Pattern: Re-Distillation as a Staged Deploy
A new student is never deployed in one step — even though distillation is fast (~24 min), the student's calibration and routing thresholds are downstream-coupled to it.
| Stage | Traffic | Duration | Pass criterion |
|---|---|---|---|
| Shadow | 0% (mirror) | 24 h | acceptance suite §6 dry-run doc |
| Canary 5% | 5% | 48 h | rare-class accuracy ≥ 87.0% on canary slice |
| Canary 25% | 25% | 48 h | weighted error ≤ baseline + 0.5pp |
| 50% | 50% | 24 h | no production-weighted regression |
| 100% | full | n/a | full deploy; old student frozen for 14 days as rollback target |
If any gate fails: revert to the previous student (which is paired with its teacher version), file an incident, refresh the trigger thresholds.
4. Rollback Discipline
Three rollback dimensions to track:
- Student artifact (model weights + calibrator + thresholds): keep last 3 versions hot, last 12 cold.
- Teacher artifact: keep last 6 versions cold; the paired teacher for each student is metadata-pinned.
- Decision metadata (the trigger that caused this re-distillation, the gate that passed): keep forever in the model registry.
Rollback time-to-live (TTL): 90 minutes from page-on-call to traffic restoration. Verified quarterly via game-day exercises.
5. Confidence Intervals on Lifecycle Metrics
n = 12 months of production logs.
| Metric | Point | 95% CI |
|---|---|---|
| Re-distillations per month | 0.58 | [0.34, 0.92] |
| Mean time between re-distillations | 51 days | [38, 70] |
| Re-distillation success rate (pass canary) | 86% | [62%, 96%] |
| Mean compute cost per re-distillation | $32 | [$28, $37] |
| Mean rollback time (when triggered) | 47 min | [38, 58] |
Reading. ~6 re-distillations/year averaging $32 each = ~$190/year compute. Compared to monthly forced re-distillation (~$380/year), event-driven saves ~50% with no observed quality regression.
6. Failure-Mode Tree
flowchart TD
A[Re-distillation triggered or scheduled] --> B[Fetch latest teacher checkpoint]
B --> C{Teacher healthy?}
C -- accuracy and ECE in expected range --> D[Run distillation pipeline]
C -- regressed --> E[Skip re-distillation alert teacher pipeline owners]
D --> F[Run student acceptance suite]
F -- pass --> G[Shadow deploy 24h]
F -- fail --> H{Failure type?}
H -- insufficient quality --> I[Adjust alpha or epochs retry once]
H -- calibration drift --> J[Refit T pre-deploy retry once]
H -- top-1 agreement low --> K[Switch to ensemble teacher last 3 checkpoints retry]
G --> L[Canary deploy stages 5 25 50 100]
L -- gate failure --> M[Rollback to previous student keep teacher in cold storage]
L -- pass all stages --> N[Promote new student freeze old artifact for 14 days]
7. Persona Debate
Jordan (MLOps). Event-driven re-distillation feels right. Calendar-driven was burning compute on no-op months.
Aiko (DS). I want one more trigger: rare-class drift. Accuracy on escalation can drop without overall accuracy dropping. Add a per-class alarm at -2pp on rare classes.
Priya (ML). Already in §1 implicitly via "quality gap drift" — let me make it explicit.
Marcus (Architect). Teacher ensemble idea — defer it. Adds 3× the soft-label storage and a coordination layer. Not worth it until we see single-teacher anomalies repeatedly.
Sam (PM). Six re-distills/year at $32 each is fine; rollback discipline is the real win. Quarterly game-days are non-negotiable.
Resolution. Adopt event-driven trigger matrix; latest-checkpoint strategy; staged rollout with paired teacher metadata. Re-evaluate ensemble strategy after a single-teacher incident or in 12 months, whichever comes first.
8. Open Problems
- Continuous distillation. Instead of discrete re-distillation events, can we continuously update the student via online KD against the latest teacher? Tarvainen & Valpola's Mean Teacher offers one direction. Open question: can online KD preserve calibration without per-batch validation?
- Knowledge accumulation across teachers. Teacher v_{n+1} may know things teacher v_n didn't — and the student should remember both. Open question: a continual-distillation regime where the student is exposed to the union of soft-label distributions over a window of teachers (intersects Learning-Strategies/06 continual learning).
- Teacher-aware OOD. When the teacher recognizes new intents (after a discovery-pipeline-driven taxonomy update), the student must inherit the new label semantics. Open question: a "label-aware re-distillation" that preserves old labels' soft profiles while adopting new ones.
Bibliography
- Hinton, G. et al. (2015). Distilling the Knowledge in a Neural Network. NeurIPS Workshop.
- Stanton, S. et al. (2021). Does Knowledge Distillation Really Work? NeurIPS.
- Tarvainen, A., Valpola, H. (2017). Mean Teachers are Better Role Models. NeurIPS.
- You, S., Xu, C., Xu, C., Tao, D. (2017). Learning from Multiple Teacher Networks. KDD.
- Zhao, B., Cui, Q., Song, R., Qiu, Y., Liang, J. (2022). Decoupled Knowledge Distillation. CVPR.
- Mirzadeh, S. I. et al. (2020). Improved KD via Teacher Assistant. AAAI.
- Beyer, L. et al. (2022). A good teacher is patient and consistent. CVPR.
- Pineau, J. et al. (2021). Reproducibility checklist. NeurIPS.
- Gebru, T. et al. (2021). Datasheets for Datasets. CACM.
- Bouthillier, X. et al. (2021). Accounting for Variance. MLSys.
Citation count for this file: 10.