Model Compression & Optimization — Folder Index
Two compression techniques are covered here: Knowledge Distillation (KD, topic 05) for shrinking the MangaAssist intent router from a 66M-param teacher to a 14.5M-param student, and Mixture of Experts (MoE, topic 15) for routing-by-genre / task / workflow when MangaAssist is asked to answer with deeper knowledge.
KD is fully expanded into the 8-file SCENARIO_TEMPLATE pattern (see
../SCENARIO_TEMPLATE.md). MoE is at the basic 2-file depth and will be expanded in a future phase.
Knowledge Distillation (topic 05) — Reading Order
| # | File | Persona | Why read it next |
|---|---|---|---|
| 1 | 05-knowledge-distillation-pipeline.md | Priya + Aiko | start here — theory, math, architecture, training code |
| 2 | 05-knowledge_distillation_scenarios_mangaassist.md | Sam + Marcus | MangaAssist scenarios with persona debate |
| 3 | 05-distillation_numerical_worked_examples_mangaassist.md | Aiko | concrete arithmetic; bootstrap CIs; T² · KL walkthrough |
| 4 | 05-student_calibration_and_quality_metrics_mangaassist.md | Aiko + Jordan | student-specific calibration; teacher-agreement metrics |
| 5 | 05-distillation_business_weighted_error_mangaassist.md | Sam + Marcus | total-cost decision; sensitivity to volume and cost matrix |
| 6 | 05-mangaassist_distillation_dry_run_intuition.md | Jordan | end-to-end execution intuition |
| 7 | 05-mangaassist_distillation_failure_scenarios_scale.md | Priya | failure-at-scale playbook |
| 8 | 05-mangaassist_distillation_solution_decisions.md | Marcus | architectural/loss decision tree |
| 9 | 05-mangaassist_kd_prompt_improved_libraries_hardware.md | Jordan | infrastructure & library choices |
| 10 | 05-teacher_rotation_strategy_mangaassist.md | Sam + Aiko | how teachers evolve; re-distillation triggers; rollback discipline |
Tip. Files 1-2 establish the technique. Files 3-5 are the metrics + cost trio (math, calibration, business). Files 6-9 are operational playbooks. File 10 closes the lifecycle loop (when to re-distill, how to roll back).
See the canonical sub-problem mapping in
../SCENARIO_TEMPLATE.md§3.7: for KD, sub-problem A = "failure at scale" (file 7), sub-problem B = "solution decisions" (file 8), discovery = "teacher rotation" (file 10).
Mixture of Experts (topic 15)
| # | File | Status |
|---|---|---|
| 1 | 15-mixture-of-experts-routing.md | exists (main doc) |
| 2 | 15-mixture_of_experts_scenarios_mangaassist.md | exists (scenario doc) |
| 3-9 | not yet expanded | planned |
When MoE is expanded, the planned slot mapping per ../SCENARIO_TEMPLATE.md §3.7 is:
- Sub-problem A: load-balancing loss sensitivity
- Sub-problem B: expert-collapse detection
- Discovery: expert-library growth
Shared Baseline (verbatim)
| Item | Value |
|---|---|
| Teacher | DistilBERT-base, 66M params, 92.1% ± 0.4% acc, 12 ms P95, $5,320/mo @ 1.4M reqs |
| Student | TinyBERT 4-layer, 14.5M params, 90.5% ± 0.5% acc, 5 ms P95, $2,240/mo @ 1.4M reqs |
| Distillation | T = 3, α = 0.7, 5 epochs, 24 min wall-clock per re-distill |
| Calibration | student post-cal T = 1.4, ECE 0.046 ± 0.006 |
| Re-distillation cadence | event-driven (~6/year), not calendar |
Personas
Same 5 personas as Intent-Classification/README.md:
- Priya (ML), Marcus (Architect), Aiko (DS), Jordan (MLOps), Sam (PM)
Glossary (KD-specific)
| Term | Definition |
|---|---|
| Teacher | larger pre-trained model whose soft-label distribution is the supervision signal |
| Student | smaller model trained to mimic the teacher's outputs |
| Soft labels | softmax(z / T) for T > 1; capture the teacher's similarity structure |
| Dark knowledge | the off-target probability mass in soft labels (Hinton 2015) |
| T (temperature) | softmax temperature; T > 1 spreads probability across classes |
| α (loss mix) | weight on KL term vs hard-label CE term in distillation loss |
| Top-1 agreement | fraction where argmax(student) = argmax(teacher) |
| TAKD | Teacher Assistant Knowledge Distillation (Mirzadeh 2020) |
| DKD | Decoupled Knowledge Distillation (Zhao 2022) |
| Patient teacher | a teacher applied with strong augmentation and long training (Beyer 2022) |
| Re-distillation trigger | event-driven condition (drift, gap) that forces a new student build |
Folder Citation Index
KD foundations
- Hinton, G. et al. (2015). Distilling the Knowledge in a Neural Network. NeurIPS Workshop.
- Sanh, V. et al. (2019). DistilBERT. NeurIPS-EMC².
- Jiao, X. et al. (2020). TinyBERT. EMNLP.
KD theory & analysis
- Stanton, S. et al. (2021). Does Knowledge Distillation Really Work? NeurIPS.
- Menon, A. K. et al. (2021). A statistical perspective on distillation. ICML.
- Cho, J. H., Hariharan, B. (2019). On the Efficacy of KD. ICCV.
KD variants
- Mirzadeh, S. I. et al. (2020). Teacher Assistant KD. AAAI.
- Beyer, L. et al. (2022). Patient & consistent teacher. CVPR.
- Tian, Y. et al. (2020). Contrastive Representation Distillation. ICLR.
- You, S. et al. (2017). Multiple Teacher Networks. KDD.
- Zhao, B. et al. (2022). Decoupled KD. CVPR.
- Tarvainen, A., Valpola, H. (2017). Mean Teachers. NeurIPS.
Survey
- Gou, J. et al. (2021). Knowledge Distillation: A Survey. IJCV.
Cost / energy
- Strubell, E. et al. (2019). Energy and Policy Considerations. ACL.
- Elkan, C. (2001). Cost-Sensitive Learning. IJCAI.
Reproducibility / variance
- Bouthillier, X. et al. (2021). Variance in ML Benchmarks. MLSys.
- Pineau, J. et al. (2021). NeurIPS Reproducibility Checklist.
- Gebru, T. et al. (2021). Datasheets. CACM.
Audit Checklist (KD only — MoE pending)
- All 8 KD files exist (some are pre-existing, some new in Phase C)
- Folder README exists (this file)
- Shared baseline verbatim in every doc
- Numerical, calibration, business, and teacher-rotation deep-dives carry CIs and citations
- Pre-existing dry-run / failure-scenarios / solution-decisions / kd-prompt files retain their content (not rewritten)
- MoE expansion (planned)
- Master
00-mangaassist_fine_tuning_topic_scenario_map.mdupdated with new file links
Cross-Folder Pointers
- Master curriculum index →
../README.md - Topic-scenario map →
../00-mangaassist_fine_tuning_topic_scenario_map.md - Template →
../SCENARIO_TEMPLATE.md - Sister Tier-1 folders: Intent-Classification, Embedding-Fine-Tuning, Retrieval-Fine-Tuning (RAFT), Fine-Tuning-Techniques (LoRA), Alignment-RLHF