LOCAL PREVIEW View on GitHub

Teacher Rotation Strategy — Knowledge Distillation (MangaAssist)

Slot 9 (discovery / evolution) in the SCENARIO_TEMPLATE 8-file pattern. The earlier KD docs treat the teacher as static. In production, the teacher is retrained monthly (per the Intent-Classification dry-run doc), and any change in the teacher invalidates assumptions baked into the student. This doc handles teacher rotation: how to manage the lifecycle of multiple teacher checkpoints over time without thrashing the student.

Why This Is a Real Problem

Three observations from production:

  1. Teacher updates don't always require student updates. A 0.2pp teacher accuracy improvement is below the student's bootstrap CI half-width — re-distilling is wasted compute.
  2. Teacher updates can break student calibration silently. Even when student accuracy is preserved, the student's calibrator (T = 1.4) was fit against a specific teacher; updating the teacher shifts the soft-label distribution and can drift ECE.
  3. Multiple teachers concurrently in production. During a teacher canary period, one student may have been distilled against teacher v_n while a separate experiment uses teacher v_{n+1}. We need a clean way to handle this.

A research-grade rotation strategy answers four questions:

  1. When to re-distill (drift triggers, not calendar)?
  2. Which teacher to use (latest, ensemble, or stable LTS)?
  3. How to roll out the new student (replace, ensemble, or A/B)?
  4. How to roll back (which teacher version do we keep frozen)?

Shared Baseline (verbatim)

Item Value
Teacher DistilBERT, retrained monthly (per dry-run doc)
Student TinyBERT 4-layer, distilled from a single teacher checkpoint
Teacher checkpoint store s3://mangaassist-ml-prod/teachers/distilbert-v{n}/ with metadata: training data hash, accuracy, ECE, T

1. Re-Distillation Trigger Matrix

Trigger Threshold Action Rationale
Teacher accuracy delta ≥ +0.5pp re-distill a meaningful capability lift; ignore noise
Teacher ECE delta ≥ +0.01 re-distill recalibration alone won't carry through soft labels
Teacher KL drift to old teacher ≥ 0.05 on 14-day prod sample re-distill distribution-of-soft-labels has shifted
Student top-1 agreement with teacher < 0.95 re-distill gap has opened by >2pp from initial 0.974
Quality gap drift acc_t − acc_s > 1.6 + 0.5 (initial + drift) re-distill student is falling behind
Teacher major version (architecture change) always re-distill non-negotiable
None of the above skip preserve compute

Implication. Re-distillation is not monthly by default — it's event-driven. Empirically (last 12 months) we re-distilled 7 times (vs 12 monthly retrains), saving ~$210 in compute and ~5 days of monitoring overhead.


2. Which Teacher to Use

Three strategies, each with trade-offs:

Strategy Description Pros Cons When to prefer
Latest checkpoint (chosen) always distill from the newest teacher simple; tracks latest knowledge brittle to teacher anomalies high-frequency change, healthy teacher monitoring
Long-Term-Stable (LTS) checkpoint freeze a teacher every quarter; distill against LTS predictable student behavior falls behind; misses recent intents regulated environments
Teacher ensemble distill against average soft-label of last K teachers smoother soft labels; less variance K-fold compute cost when single-teacher noise is documented

Reading. We use latest checkpoint but log the teacher version in every student artifact. This makes rollback (§4) trivial: redeploy the previous student.

Research Notes — multi-teacher. Citations: You 2017 (KDD — Learning from Multiple Teachers); Tarvainen & Valpola 2017 (NeurIPS — Mean Teacher); Zhao 2022 (ICLR — DKD decoupled distillation) — supports separating soft-label and hard-label contributions, useful when teachers disagree.


3. Rollout Pattern: Re-Distillation as a Staged Deploy

A new student is never deployed in one step — even though distillation is fast (~24 min), the student's calibration and routing thresholds are downstream-coupled to it.

Stage Traffic Duration Pass criterion
Shadow 0% (mirror) 24 h acceptance suite §6 dry-run doc
Canary 5% 5% 48 h rare-class accuracy ≥ 87.0% on canary slice
Canary 25% 25% 48 h weighted error ≤ baseline + 0.5pp
50% 50% 24 h no production-weighted regression
100% full n/a full deploy; old student frozen for 14 days as rollback target

If any gate fails: revert to the previous student (which is paired with its teacher version), file an incident, refresh the trigger thresholds.


4. Rollback Discipline

Three rollback dimensions to track:

  1. Student artifact (model weights + calibrator + thresholds): keep last 3 versions hot, last 12 cold.
  2. Teacher artifact: keep last 6 versions cold; the paired teacher for each student is metadata-pinned.
  3. Decision metadata (the trigger that caused this re-distillation, the gate that passed): keep forever in the model registry.

Rollback time-to-live (TTL): 90 minutes from page-on-call to traffic restoration. Verified quarterly via game-day exercises.


5. Confidence Intervals on Lifecycle Metrics

n = 12 months of production logs.

Metric Point 95% CI
Re-distillations per month 0.58 [0.34, 0.92]
Mean time between re-distillations 51 days [38, 70]
Re-distillation success rate (pass canary) 86% [62%, 96%]
Mean compute cost per re-distillation $32 [$28, $37]
Mean rollback time (when triggered) 47 min [38, 58]

Reading. ~6 re-distillations/year averaging $32 each = ~$190/year compute. Compared to monthly forced re-distillation (~$380/year), event-driven saves ~50% with no observed quality regression.


6. Failure-Mode Tree

flowchart TD
    A[Re-distillation triggered or scheduled] --> B[Fetch latest teacher checkpoint]
    B --> C{Teacher healthy?}
    C -- accuracy and ECE in expected range --> D[Run distillation pipeline]
    C -- regressed --> E[Skip re-distillation alert teacher pipeline owners]
    D --> F[Run student acceptance suite]
    F -- pass --> G[Shadow deploy 24h]
    F -- fail --> H{Failure type?}
    H -- insufficient quality --> I[Adjust alpha or epochs retry once]
    H -- calibration drift --> J[Refit T pre-deploy retry once]
    H -- top-1 agreement low --> K[Switch to ensemble teacher last 3 checkpoints retry]
    G --> L[Canary deploy stages 5 25 50 100]
    L -- gate failure --> M[Rollback to previous student keep teacher in cold storage]
    L -- pass all stages --> N[Promote new student freeze old artifact for 14 days]

7. Persona Debate

Jordan (MLOps). Event-driven re-distillation feels right. Calendar-driven was burning compute on no-op months.

Aiko (DS). I want one more trigger: rare-class drift. Accuracy on escalation can drop without overall accuracy dropping. Add a per-class alarm at -2pp on rare classes.

Priya (ML). Already in §1 implicitly via "quality gap drift" — let me make it explicit.

Marcus (Architect). Teacher ensemble idea — defer it. Adds 3× the soft-label storage and a coordination layer. Not worth it until we see single-teacher anomalies repeatedly.

Sam (PM). Six re-distills/year at $32 each is fine; rollback discipline is the real win. Quarterly game-days are non-negotiable.

Resolution. Adopt event-driven trigger matrix; latest-checkpoint strategy; staged rollout with paired teacher metadata. Re-evaluate ensemble strategy after a single-teacher incident or in 12 months, whichever comes first.


8. Open Problems

  1. Continuous distillation. Instead of discrete re-distillation events, can we continuously update the student via online KD against the latest teacher? Tarvainen & Valpola's Mean Teacher offers one direction. Open question: can online KD preserve calibration without per-batch validation?
  2. Knowledge accumulation across teachers. Teacher v_{n+1} may know things teacher v_n didn't — and the student should remember both. Open question: a continual-distillation regime where the student is exposed to the union of soft-label distributions over a window of teachers (intersects Learning-Strategies/06 continual learning).
  3. Teacher-aware OOD. When the teacher recognizes new intents (after a discovery-pipeline-driven taxonomy update), the student must inherit the new label semantics. Open question: a "label-aware re-distillation" that preserves old labels' soft profiles while adopting new ones.

Bibliography

  • Hinton, G. et al. (2015). Distilling the Knowledge in a Neural Network. NeurIPS Workshop.
  • Stanton, S. et al. (2021). Does Knowledge Distillation Really Work? NeurIPS.
  • Tarvainen, A., Valpola, H. (2017). Mean Teachers are Better Role Models. NeurIPS.
  • You, S., Xu, C., Xu, C., Tao, D. (2017). Learning from Multiple Teacher Networks. KDD.
  • Zhao, B., Cui, Q., Song, R., Qiu, Y., Liang, J. (2022). Decoupled Knowledge Distillation. CVPR.
  • Mirzadeh, S. I. et al. (2020). Improved KD via Teacher Assistant. AAAI.
  • Beyer, L. et al. (2022). A good teacher is patient and consistent. CVPR.
  • Pineau, J. et al. (2021). Reproducibility checklist. NeurIPS.
  • Gebru, T. et al. (2021). Datasheets for Datasets. CACM.
  • Bouthillier, X. et al. (2021). Accounting for Variance. MLSys.

Citation count for this file: 10.