LOCAL PREVIEW View on GitHub

Model Compression & Optimization — Folder Index

Two compression techniques are covered here: Knowledge Distillation (KD, topic 05) for shrinking the MangaAssist intent router from a 66M-param teacher to a 14.5M-param student, and Mixture of Experts (MoE, topic 15) for routing-by-genre / task / workflow when MangaAssist is asked to answer with deeper knowledge.

KD is fully expanded into the 8-file SCENARIO_TEMPLATE pattern (see ../SCENARIO_TEMPLATE.md). MoE is at the basic 2-file depth and will be expanded in a future phase.


Knowledge Distillation (topic 05) — Reading Order

# File Persona Why read it next
1 05-knowledge-distillation-pipeline.md Priya + Aiko start here — theory, math, architecture, training code
2 05-knowledge_distillation_scenarios_mangaassist.md Sam + Marcus MangaAssist scenarios with persona debate
3 05-distillation_numerical_worked_examples_mangaassist.md Aiko concrete arithmetic; bootstrap CIs; T² · KL walkthrough
4 05-student_calibration_and_quality_metrics_mangaassist.md Aiko + Jordan student-specific calibration; teacher-agreement metrics
5 05-distillation_business_weighted_error_mangaassist.md Sam + Marcus total-cost decision; sensitivity to volume and cost matrix
6 05-mangaassist_distillation_dry_run_intuition.md Jordan end-to-end execution intuition
7 05-mangaassist_distillation_failure_scenarios_scale.md Priya failure-at-scale playbook
8 05-mangaassist_distillation_solution_decisions.md Marcus architectural/loss decision tree
9 05-mangaassist_kd_prompt_improved_libraries_hardware.md Jordan infrastructure & library choices
10 05-teacher_rotation_strategy_mangaassist.md Sam + Aiko how teachers evolve; re-distillation triggers; rollback discipline

Tip. Files 1-2 establish the technique. Files 3-5 are the metrics + cost trio (math, calibration, business). Files 6-9 are operational playbooks. File 10 closes the lifecycle loop (when to re-distill, how to roll back).

See the canonical sub-problem mapping in ../SCENARIO_TEMPLATE.md §3.7: for KD, sub-problem A = "failure at scale" (file 7), sub-problem B = "solution decisions" (file 8), discovery = "teacher rotation" (file 10).


Mixture of Experts (topic 15)

# File Status
1 15-mixture-of-experts-routing.md exists (main doc)
2 15-mixture_of_experts_scenarios_mangaassist.md exists (scenario doc)
3-9 not yet expanded planned

When MoE is expanded, the planned slot mapping per ../SCENARIO_TEMPLATE.md §3.7 is: - Sub-problem A: load-balancing loss sensitivity - Sub-problem B: expert-collapse detection - Discovery: expert-library growth


Shared Baseline (verbatim)

Item Value
Teacher DistilBERT-base, 66M params, 92.1% ± 0.4% acc, 12 ms P95, $5,320/mo @ 1.4M reqs
Student TinyBERT 4-layer, 14.5M params, 90.5% ± 0.5% acc, 5 ms P95, $2,240/mo @ 1.4M reqs
Distillation T = 3, α = 0.7, 5 epochs, 24 min wall-clock per re-distill
Calibration student post-cal T = 1.4, ECE 0.046 ± 0.006
Re-distillation cadence event-driven (~6/year), not calendar

Personas

Same 5 personas as Intent-Classification/README.md: - Priya (ML), Marcus (Architect), Aiko (DS), Jordan (MLOps), Sam (PM)


Glossary (KD-specific)

Term Definition
Teacher larger pre-trained model whose soft-label distribution is the supervision signal
Student smaller model trained to mimic the teacher's outputs
Soft labels softmax(z / T) for T > 1; capture the teacher's similarity structure
Dark knowledge the off-target probability mass in soft labels (Hinton 2015)
T (temperature) softmax temperature; T > 1 spreads probability across classes
α (loss mix) weight on KL term vs hard-label CE term in distillation loss
Top-1 agreement fraction where argmax(student) = argmax(teacher)
TAKD Teacher Assistant Knowledge Distillation (Mirzadeh 2020)
DKD Decoupled Knowledge Distillation (Zhao 2022)
Patient teacher a teacher applied with strong augmentation and long training (Beyer 2022)
Re-distillation trigger event-driven condition (drift, gap) that forces a new student build

Folder Citation Index

KD foundations

  • Hinton, G. et al. (2015). Distilling the Knowledge in a Neural Network. NeurIPS Workshop.
  • Sanh, V. et al. (2019). DistilBERT. NeurIPS-EMC².
  • Jiao, X. et al. (2020). TinyBERT. EMNLP.

KD theory & analysis

  • Stanton, S. et al. (2021). Does Knowledge Distillation Really Work? NeurIPS.
  • Menon, A. K. et al. (2021). A statistical perspective on distillation. ICML.
  • Cho, J. H., Hariharan, B. (2019). On the Efficacy of KD. ICCV.

KD variants

  • Mirzadeh, S. I. et al. (2020). Teacher Assistant KD. AAAI.
  • Beyer, L. et al. (2022). Patient & consistent teacher. CVPR.
  • Tian, Y. et al. (2020). Contrastive Representation Distillation. ICLR.
  • You, S. et al. (2017). Multiple Teacher Networks. KDD.
  • Zhao, B. et al. (2022). Decoupled KD. CVPR.
  • Tarvainen, A., Valpola, H. (2017). Mean Teachers. NeurIPS.

Survey

  • Gou, J. et al. (2021). Knowledge Distillation: A Survey. IJCV.

Cost / energy

  • Strubell, E. et al. (2019). Energy and Policy Considerations. ACL.
  • Elkan, C. (2001). Cost-Sensitive Learning. IJCAI.

Reproducibility / variance

  • Bouthillier, X. et al. (2021). Variance in ML Benchmarks. MLSys.
  • Pineau, J. et al. (2021). NeurIPS Reproducibility Checklist.
  • Gebru, T. et al. (2021). Datasheets. CACM.

Audit Checklist (KD only — MoE pending)

  • All 8 KD files exist (some are pre-existing, some new in Phase C)
  • Folder README exists (this file)
  • Shared baseline verbatim in every doc
  • Numerical, calibration, business, and teacher-rotation deep-dives carry CIs and citations
  • Pre-existing dry-run / failure-scenarios / solution-decisions / kd-prompt files retain their content (not rewritten)
  • MoE expansion (planned)
  • Master 00-mangaassist_fine_tuning_topic_scenario_map.md updated with new file links

Cross-Folder Pointers