Model Compression & Optimization — Folder Index

Two compression techniques are covered here: Knowledge Distillation (KD, topic 05) for shrinking the MangaAssist intent router from a 66M-param teacher to a 14.5M-param student, and Mixture of Experts (MoE, topic 15) for routing-by-genre / task / workflow when MangaAssist is asked to answer with deeper knowledge.

KD is fully expanded into the 8-file SCENARIO_TEMPLATE pattern (see ../SCENARIO_TEMPLATE.md). MoE is at the basic 2-file depth and will be expanded in a future phase.

Knowledge Distillation (topic 05) — Reading Order

#	File	Persona	Why read it next
1	05-knowledge-distillation-pipeline.md	Priya + Aiko	start here — theory, math, architecture, training code
2	05-knowledge_distillation_scenarios_mangaassist.md	Sam + Marcus	MangaAssist scenarios with persona debate
3	05-distillation_numerical_worked_examples_mangaassist.md	Aiko	concrete arithmetic; bootstrap CIs; T² · KL walkthrough
4	05-student_calibration_and_quality_metrics_mangaassist.md	Aiko + Jordan	student-specific calibration; teacher-agreement metrics
5	05-distillation_business_weighted_error_mangaassist.md	Sam + Marcus	total-cost decision; sensitivity to volume and cost matrix
6	05-mangaassist_distillation_dry_run_intuition.md	Jordan	end-to-end execution intuition
7	05-mangaassist_distillation_failure_scenarios_scale.md	Priya	failure-at-scale playbook
8	05-mangaassist_distillation_solution_decisions.md	Marcus	architectural/loss decision tree
9	05-mangaassist_kd_prompt_improved_libraries_hardware.md	Jordan	infrastructure & library choices
10	05-teacher_rotation_strategy_mangaassist.md	Sam + Aiko	how teachers evolve; re-distillation triggers; rollback discipline

Tip. Files 1-2 establish the technique. Files 3-5 are the metrics + cost trio (math, calibration, business). Files 6-9 are operational playbooks. File 10 closes the lifecycle loop (when to re-distill, how to roll back).

See the canonical sub-problem mapping in ../SCENARIO_TEMPLATE.md §3.7: for KD, sub-problem A = "failure at scale" (file 7), sub-problem B = "solution decisions" (file 8), discovery = "teacher rotation" (file 10).

Mixture of Experts (topic 15)

#	File	Status
1	15-mixture-of-experts-routing.md	exists (main doc)
2	15-mixture_of_experts_scenarios_mangaassist.md	exists (scenario doc)
3-9	not yet expanded	planned

When MoE is expanded, the planned slot mapping per ../SCENARIO_TEMPLATE.md §3.7 is: - Sub-problem A: load-balancing loss sensitivity - Sub-problem B: expert-collapse detection - Discovery: expert-library growth

Shared Baseline (verbatim)

Item	Value
Teacher	DistilBERT-base, 66M params, 92.1% ± 0.4% acc, 12 ms P95, $5,320/mo @ 1.4M reqs
Student	TinyBERT 4-layer, 14.5M params, 90.5% ± 0.5% acc, 5 ms P95, $2,240/mo @ 1.4M reqs
Distillation	T = 3, α = 0.7, 5 epochs, 24 min wall-clock per re-distill
Calibration	student post-cal T = 1.4, ECE 0.046 ± 0.006
Re-distillation cadence	event-driven (~6/year), not calendar

Personas

Same 5 personas as Intent-Classification/README.md: - Priya (ML), Marcus (Architect), Aiko (DS), Jordan (MLOps), Sam (PM)

Glossary (KD-specific)

Term	Definition
Teacher	larger pre-trained model whose soft-label distribution is the supervision signal
Student	smaller model trained to mimic the teacher's outputs
Soft labels	softmax(z / T) for T > 1; capture the teacher's similarity structure
Dark knowledge	the off-target probability mass in soft labels (Hinton 2015)
T (temperature)	softmax temperature; T > 1 spreads probability across classes
α (loss mix)	weight on KL term vs hard-label CE term in distillation loss
Top-1 agreement	fraction where argmax(student) = argmax(teacher)
TAKD	Teacher Assistant Knowledge Distillation (Mirzadeh 2020)
DKD	Decoupled Knowledge Distillation (Zhao 2022)
Patient teacher	a teacher applied with strong augmentation and long training (Beyer 2022)
Re-distillation trigger	event-driven condition (drift, gap) that forces a new student build

Folder Citation Index

KD foundations

Hinton, G. et al. (2015). Distilling the Knowledge in a Neural Network. NeurIPS Workshop.
Sanh, V. et al. (2019). DistilBERT. NeurIPS-EMC².
Jiao, X. et al. (2020). TinyBERT. EMNLP.

KD theory & analysis

Stanton, S. et al. (2021). Does Knowledge Distillation Really Work? NeurIPS.
Menon, A. K. et al. (2021). A statistical perspective on distillation. ICML.
Cho, J. H., Hariharan, B. (2019). On the Efficacy of KD. ICCV.

KD variants

Mirzadeh, S. I. et al. (2020). Teacher Assistant KD. AAAI.
Beyer, L. et al. (2022). Patient & consistent teacher. CVPR.
Tian, Y. et al. (2020). Contrastive Representation Distillation. ICLR.
You, S. et al. (2017). Multiple Teacher Networks. KDD.
Zhao, B. et al. (2022). Decoupled KD. CVPR.
Tarvainen, A., Valpola, H. (2017). Mean Teachers. NeurIPS.

Survey

Gou, J. et al. (2021). Knowledge Distillation: A Survey. IJCV.

Cost / energy

Strubell, E. et al. (2019). Energy and Policy Considerations. ACL.
Elkan, C. (2001). Cost-Sensitive Learning. IJCAI.

Reproducibility / variance

Bouthillier, X. et al. (2021). Variance in ML Benchmarks. MLSys.
Pineau, J. et al. (2021). NeurIPS Reproducibility Checklist.
Gebru, T. et al. (2021). Datasheets. CACM.

Audit Checklist (KD only — MoE pending)

All 8 KD files exist (some are pre-existing, some new in Phase C)
Folder README exists (this file)
Shared baseline verbatim in every doc
Numerical, calibration, business, and teacher-rotation deep-dives carry CIs and citations
Pre-existing dry-run / failure-scenarios / solution-decisions / kd-prompt files retain their content (not rewritten)
MoE expansion (planned)
Master 00-mangaassist_fine_tuning_topic_scenario_map.md updated with new file links

Cross-Folder Pointers

Master curriculum index → ../README.md
Topic-scenario map → ../00-mangaassist_fine_tuning_topic_scenario_map.md
Template → ../SCENARIO_TEMPLATE.md
Sister Tier-1 folders: Intent-Classification, Embedding-Fine-Tuning, Retrieval-Fine-Tuning (RAFT), Fine-Tuning-Techniques (LoRA), Alignment-RLHF