MangaAssist Knowledge Distillation — Production Dry Run, Library Comparison, and Hardware Playbook
This document expands the original MangaAssist distillation pipeline into a production-focused runbook with:
- numerical dry runs,
- epoch-by-epoch training evolution,
- early stopping logic,
- production log examples,
- hardware-specific strategies,
- library ecosystem comparison,
- OpenAI-based distillation options,
- Mermaid diagrams for engineering reviews.
0. Scope and Reading Guide
This document covers four practical distillation tracks for MangaAssist:
-
Intent classifier distillation
DistilBERT teacher → TinyBERT student. -
Response model distillation (managed-model path)
OpenAIgpt-4.1teacher → fine-tunedgpt-4.1-ministudent. -
Response model distillation (self-hosted path)
Teacher outputs from a stronger managed LLM → LoRA-tuned Llama 3 8B student. -
Reranker distillation
Large cross-encoder teacher → compact student reranker.
Where exact throughput or latency numbers are not guaranteed by vendors, they are marked as dry-run engineering estimates and should be treated as planning numbers, not contractual benchmarks.
1. What Distillation Means in MangaAssist
Knowledge distillation trains a smaller student model to imitate a stronger teacher model.
For MangaAssist, distillation matters when the teacher is too expensive or too slow to use on every request:
- a managed LLM gives the best final answer quality,
- a large reranker lifts ranking quality but breaks latency budgets,
- an intent classifier needs to run on CPU or Lambda,
- unlabeled production logs need weak supervision from a stronger model.
1.1 MangaAssist production targets
| Component | Teacher | Student | Why distill |
|---|---|---|---|
| Intent classification | DistilBERT (66M) | TinyBERT (14.5M) | Lower Lambda latency, smaller cold-start memory |
| Response model | gpt-4.1 / strong managed LLM |
gpt-4.1-mini or Llama 3 8B |
Lower serving cost, lower p95 latency |
| Reranker | large cross-encoder | compact 4-layer ONNX reranker | fit within 15–20 ms inline ranking budget |
| Ambiguous intent resolution | ensemble + rules + adjudicator | single distilled classifier | simpler deployment path |
1.2 Success criteria
A student is useful only if the total system gets better on business constraints, not just model loss:
- quality stays inside promotion gates,
- latency meets SLO,
- cost drops enough to matter,
- operational simplicity improves,
- safety/refusal behavior does not regress.
2. End-to-End Distillation Flow
flowchart TD
A[Production traffic logs] --> B[Filter PII / deduplicate / stratify]
B --> C[Build distillation dataset]
C --> D[Teacher labeling]
D --> E[Hard labels + soft labels + metadata]
E --> F[Student training]
F --> G[Offline evaluation]
G --> H[Shadow evaluation]
H --> I[Canary deploy]
I --> J[Promotion or rollback]
C --> C1[Human corrected set]
C --> C2[Refusal and escalation set]
C --> C3[Rare intents / long-tail queries]
D --> D1[Teacher scores]
D --> D2[Teacher responses]
D --> D3[Teacher uncertainty]
G --> G1[Accuracy / win rate]
G --> G2[Hallucination / refusal precision]
G --> G3[Latency / cost]
3. Dataset Design for MangaAssist
We start from the production-oriented scenario and make it concrete.
3.1 Response distillation dataset
| Source | Count | Purpose |
|---|---|---|
| production prompts | 25,000 | realistic traffic distribution |
| teacher responses | 25,000 | target behavior |
| human-corrected responses | 5,000 | fix teacher mistakes |
| refusal / escalation examples | 2,000 | preserve support behavior |
| total supervised rows | 32,000 | final response training set |
3.2 Suggested split
| Split | Count | Notes |
|---|---|---|
| train | 25,600 | 80% |
| validation | 3,200 | 10% |
| test / gate | 3,200 | 10%, frozen |
| golden human review set | 500 | never used for training |
3.3 Intent classifier dataset
For classifier distillation, use a larger but lighter dataset:
| Source | Count |
|---|---|
| labeled intent examples | 12,000 |
| paraphrased augmentations | 12,000 |
| recent unlabeled queries + teacher soft labels | 20,000 |
| rare-escalation and high-risk phrases | 4,000 |
| total | 48,000 |
3.4 Why production logs matter
Production logs tell you things a clean benchmark usually hides:
- where users are ambiguous,
- where teacher confidence is low,
- which intents are rare but high-risk,
- which prompts are long and expensive,
- where refusals are triggered,
- where the teacher hallucinates catalog or shipping facts.
A good distillation run does not only log the final answer. It logs the decision process around that answer.
4. What to Log During Distillation
4.1 Training-time logs
These logs are the minimum useful set:
| Log field | Why it matters |
|---|---|
epoch |
locate training stage |
step |
trend within epoch |
loss_total |
overall optimization |
loss_hard |
fit to labels |
loss_kd |
fit to teacher distribution |
loss_feature |
hidden-state matching quality |
loss_attention |
attention transfer quality |
grad_norm |
detect instability |
lr |
tie behavior to schedule |
tokens_per_sec or samples_per_sec |
training efficiency |
gpu_mem_gb |
capacity planning |
teacher_entropy_mean |
teacher softness / ambiguity level |
student_entropy_mean |
whether student is overconfident |
rare_class_recall |
long-tail protection |
refusal_precision |
policy safety |
hallucination_rate |
factuality check |
4.2 Evaluation-time logs
| Metric | Meaning |
|---|---|
| teacher preference match | how often student matches teacher choice |
| human win rate vs base student | student quality lift vs pre-distilled baseline |
| catalog hallucination rate | factual risk on catalog facts |
| escalation precision | whether support escalation is triggered correctly |
| refusal precision / recall | whether unsafe/out-of-scope queries are handled correctly |
| cost per 1K responses | business reason for distillation |
| p50 / p95 / p99 latency | deployment readiness |
4.3 Example production log schema
{
"event": "distill_epoch_end",
"run_id": "kd_resp_v12",
"student": "gpt-4.1-mini-ft-v12",
"teacher": "gpt-4.1-2025-04-14",
"epoch": 3,
"loss_total": 0.812,
"loss_ce": 0.471,
"loss_kd": 0.958,
"teacher_pref_match": 0.872,
"human_win_rate_vs_base": 0.681,
"catalog_hallucination_rate": 0.036,
"refusal_precision": 0.942,
"cost_per_1k_responses_usd": 4.80,
"cost_reduction_vs_teacher": 0.61,
"p95_latency_ms": 420
}
5. Distillation Math Refresher
5.1 Hard-label loss
[ \mathcal{L}{hard} = -\sum{c=1}^{C} y_c \log p_S(c|x) ]
This learns the correct class, but it throws away the teacher's view of near-miss classes.
5.2 Soft-label KD loss
[ \mathcal{L}{KD} = T^2 \cdot D{KL}(p_T^{(T)} \Vert p_S^{(T)}) ]
with
[ p_i^{(T)} = \frac{e^{z_i/T}}{\sum_j e^{z_j/T}} ]
and combined loss:
[ \mathcal{L} = (1-\alpha)\mathcal{L}{hard} + \alpha \mathcal{L}{KD} ]
For hidden-state matching:
[ \mathcal{L}_{feature} = \sum_l \text{MSE}(W_l h_S^{(l)}, h_T^{(m(l))}) ]
For attention transfer:
[ \mathcal{L}_{attn} = \sum_l \text{MSE}(A_S^{(l)}, A_T^{(m(l))}) ]
Final total for TinyBERT-style two-stage training:
[ \mathcal{L}{total} = \lambda{KD}\mathcal{L}{KD} + \lambda{CE}\mathcal{L}{hard} + \lambda{feat}\mathcal{L}{feature} + \lambda{attn}\mathcal{L}_{attn} ]
6. Dry Run A — DistilBERT → TinyBERT Intent Distillation
This is the best first production dry run because it is cheap, fast, and easy to evaluate.
6.1 Setup
- Teacher: fine-tuned DistilBERT, 66M params
- Student: TinyBERT 4L, 14.5M params
- Classes: 10 intent classes
- Temperature: 4
- Alpha: 0.7
- Max sequence length: 64
- Training mode:
- Stage 1: feature matching
- Stage 2: output KD + CE
- Training dataset: 48,000 rows
- Validation set: 4,800
- Gate set: 4,800
6.2 Stage 0 baseline
| Model | Accuracy | Rare-class recall | Refusal precision | Warm latency |
|---|---|---|---|---|
| Teacher DistilBERT | 92.1% | 88.4% | 96.8% | 15 ms |
| Base TinyBERT, no KD | 84.6% | 73.9% | 90.7% | 5 ms |
The raw gap to close is:
- accuracy gap = 92.1 - 84.6 = 7.5 points
- rare-class recall gap = 88.4 - 73.9 = 14.5 points
6.3 Stage 1 — feature matching dry run
Assume:
- A100 80 GB
- bf16
- batch size 128
- seq len 64
- teacher and student in memory at once
Steps per epoch:
[ \text{steps/epoch} = \lceil 48,000 / 128 \rceil = 375 ]
If effective throughput is ~320 sequences/sec including teacher forward + hidden-state extraction, then:
[ 48,000 / 320 \approx 150 \text{ sec} \approx 2.5 \text{ min / epoch} ]
For 6 epochs, estimated wall-clock:
[ 6 \times 2.5 = 15 \text{ min} ]
6.4 Stage 1 epoch table
| Epoch | Feature loss | Attention loss | Val acc (probe head) | Rare recall | Notes |
|---|---|---|---|---|---|
| 1 | 1.842 | 0.913 | 78.4% | 66.2% | student still unstable |
| 2 | 1.221 | 0.642 | 81.0% | 69.4% | large gain, hidden states aligning |
| 3 | 0.944 | 0.501 | 82.6% | 71.3% | gradients smooth |
| 4 | 0.791 | 0.438 | 83.2% | 72.0% | diminishing returns start |
| 5 | 0.714 | 0.401 | 83.5% | 72.4% | small gain |
| 6 | 0.701 | 0.392 | 83.5% | 72.5% | plateau |
6.5 Stage 1 stopping rule
Stop Stage 1 when all are true for 2 consecutive epochs:
- feature loss improvement < 2%
- attention loss improvement < 2%
- probe-head val accuracy gain < 0.2 points
That happens at epoch 5–6, so we stop after epoch 6.
6.6 Stage 2 — output KD + CE dry run
Start from the Stage 1 checkpoint.
Assume:
- batch size 128
- same sequence length
- output-only KD is cheaper than feature KD
- effective throughput ~650 seq/sec
Epoch time:
[ 48,000 / 650 \approx 74 \text{ sec} \approx 1.2 \text{ min / epoch} ]
For 8 epochs, estimated time:
[ 8 \times 1.2 \approx 9.6 \text{ min} ]
6.7 Stage 2 epoch table
| Epoch | CE loss | KD loss | Total loss | Val acc | Rare recall | Refusal precision | Notes |
|---|---|---|---|---|---|---|---|
| 1 | 0.921 | 1.404 | 1.259 | 86.8% | 77.4% | 92.1% | big jump from Stage 1 |
| 2 | 0.772 | 1.115 | 1.012 | 88.1% | 79.9% | 93.4% | teacher structure learned |
| 3 | 0.681 | 0.982 | 0.892 | 88.8% | 81.1% | 94.2% | stable |
| 4 | 0.622 | 0.914 | 0.826 | 89.2% | 82.2% | 94.8% | near best |
| 5 | 0.593 | 0.878 | 0.792 | 89.3% | 82.6% | 95.1% | best val accuracy |
| 6 | 0.582 | 0.871 | 0.785 | 89.3% | 82.8% | 95.0% | no real gain |
| 7 | 0.579 | 0.870 | 0.784 | 89.2% | 82.7% | 94.8% | slight overfit signs |
| 8 | 0.567 | 0.873 | 0.781 | 89.1% | 82.3% | 94.5% | overfit |
6.8 Where to stop
We stop at epoch 5 for deployment candidate selection because:
- best validation accuracy is first reached at epoch 5,
- rare recall keeps improving slightly after that, but overall val accuracy does not,
- refusal precision starts flattening,
- epoch 7–8 shows classic memorization: training loss still falls, validation stops improving.
6.9 Classifier gate outcome
| Gate | Threshold | Epoch 5 result | Pass |
|---|---|---|---|
| accuracy | >= 89.0% | 89.3% | yes |
| rare-class recall | >= 80.0% | 82.6% | yes |
| refusal precision | >= 94.0% | 95.1% | yes |
| Lambda warm latency | <= 6 ms | 5 ms | yes |
| Lambda cold p95 | <= 150 ms | 122 ms | yes |
6.10 Final classifier business result
| Metric | Teacher | Distilled student | Delta |
|---|---|---|---|
| Accuracy | 92.1% | 89.3% | -2.8 |
| Rare recall | 88.4% | 82.6% | -5.8 |
| Warm latency | 15 ms | 5 ms | 3.0× faster |
| Model size | 264 MB | 58 MB | 4.6× smaller |
| Cost per 1M requests (CPU/Lambda dry run) | $38 | $17 | 55% lower |
7. Dry Run B — Response Distillation with OpenAI Models
This is the managed-model path.
7.1 Why include OpenAI here
If MangaAssist wants a smaller managed student instead of self-hosting, a practical path is:
- teacher:
gpt-4.1 - student base:
gpt-4.1-mini - fine-tuning method: supervised fine-tuning using teacher outputs + human corrections + refusal data
This is not pure logit-level KD. It is response distillation through SFT.
7.2 Data recipe
We reuse the 32,000 row response dataset:
- 25,000 production prompts
- 25,000 teacher answers
- 5,000 human-corrected answers
- 2,000 refusal/escalation examples
We transform each row into a chat-style training example:
{
"messages": [
{"role": "system", "content": "You are MangaAssist. Follow catalog-safe, retrieval-grounded answer rules."},
{"role": "user", "content": "I want romance manga with adult characters."},
{"role": "assistant", "content": "Here are three good options..."}
]
}
7.3 Label weighting strategy
Not all rows should count equally.
| Row type | Weight | Why |
|---|---|---|
| human-corrected | 2.0 | best ground truth |
| teacher response, clean | 1.0 | useful target |
| refusal / escalation | 2.5 | safety critical |
| low-confidence teacher rows | 0.5 | avoid copying ambiguity too strongly |
7.4 Managed-model dry run assumptions
- training rows: 25,600
- validation rows: 3,200
- average prompt + answer length: 420 tokens
- total tokens / epoch:
[ 25,600 \times 420 = 10,752,000 \text{ tokens} ]
If 4 epochs are trained, the total training token volume is:
[ 4 \times 10.752M = 43.008M \text{ tokens} ]
7.5 Offline evaluation rubric
Each answer is scored on 5 axes:
| Axis | Weight |
|---|---|
| factuality / no catalog hallucination | 0.30 |
| recommendation relevance | 0.25 |
| policy / escalation correctness | 0.20 |
| answer format quality | 0.15 |
| conciseness | 0.10 |
Rubric score:
[ \text{rubric} = 0.30F + 0.25R + 0.20P + 0.15Q + 0.10C ]
7.6 Epoch-by-epoch dry run
| Epoch | Val rubric | Teacher preference match | Human win rate vs base student | Hallucination rate | Refusal precision | Notes |
|---|---|---|---|---|---|---|
| 1 | 4.08 / 5 | 81.2% | 59.0% | 5.4% | 91.8% | clear improvement, still weak on catalog facts |
| 2 | 4.23 / 5 | 84.9% | 64.1% | 4.2% | 93.7% | close to gate |
| 3 | 4.31 / 5 | 87.2% | 68.1% | 3.6% | 94.4% | best balanced checkpoint |
| 4 | 4.30 / 5 | 87.4% | 68.4% | 4.1% | 93.8% | slight overfit / style memorization |
7.7 Promotion gate
| Metric | Gate | Epoch 3 |
|---|---|---|
| teacher preference match | >= 85% | 87.2% |
| human win rate vs base student | >= 65% | 68.1% |
| catalog hallucination rate | <= 4% | 3.6% |
| refusal precision | >= 94% | 94.4% |
| cost per 1K responses | >= 50% lower than teacher | 61% lower |
7.8 Why epoch 3 wins over epoch 4
Epoch 4 slightly beats epoch 3 on teacher preference match, but not enough to justify:
- hallucination rate rises from 3.6% to 4.1%,
- refusal precision drops,
- style starts becoming too rigid,
- support escalation wording becomes overly templated.
This is a classic production tradeoff: a tiny gain in imitation quality is not worth a measurable regression in safety and factuality.
7.9 Example OpenAI-style distillation workflow
from openai import OpenAI
import json
client = OpenAI()
# 1) Upload training file
train_file = client.files.create(
file=open("mangaassist_distill_train.jsonl", "rb"),
purpose="fine-tune"
)
# 2) Start fine-tuning job on a smaller student
job = client.fine_tuning.jobs.create(
model="gpt-4.1-mini-2025-04-14",
training_file=train_file.id,
method={"type": "supervised"}
)
print(job.id)
7.10 Example offline teacher-generation step
from openai import OpenAI
import json
client = OpenAI()
def label_with_teacher(prompt: str) -> str:
resp = client.responses.create(
model="gpt-4.1-2025-04-14",
input=[
{
"role": "system",
"content": "You are MangaAssist. Be retrieval-grounded, catalog-safe, and follow escalation policy."
},
{"role": "user", "content": prompt},
]
)
return resp.output_text
7.11 Example response-distillation log
{
"event": "distilled_model_eval",
"student": "gpt-4.1-mini-ft-manga-v03",
"teacher": "gpt-4.1-2025-04-14",
"epoch": 3,
"teacher_preference_match": 0.872,
"human_win_rate_vs_base": 0.681,
"hallucination_rate": 0.036,
"refusal_precision": 0.944,
"cost_reduction": 0.61
}
8. Dry Run C — Self-Hosted Response Distillation into Llama 3 8B
This path is used when MangaAssist wants a self-hosted fallback or cost-controlled serving layer.
8.1 Distillation style
For a hosted teacher such as OpenAI or another managed LLM, we normally do:
- teacher inference offline,
- save teacher outputs,
- fine-tune the student with SFT / LoRA.
That means the teacher is not on GPU during student training.
8.2 Training setup
- student: Llama 3 8B
- fine-tuning: LoRA
- precision: bf16
- seq length: 512
- micro-batch: 8
- gradient accumulation: 8
- effective batch size: 64
- dataset: 32,000 examples
- tokens / example: 512 average padded
Tokens per epoch:
[ 32,000 \times 512 = 16,384,000 ]
If effective training throughput is 5,800 tok/s on a single A100 80 GB with LoRA, then:
[ 16,384,000 / 5,800 \approx 2,825 \text{ sec} \approx 47 \text{ min / epoch} ]
Add eval + checkpoint overhead:
- training: ~47 min / epoch
- eval/checkpoint: ~8 min / epoch
- total: ~55 min / epoch
For 4 epochs:
[ 4 \times 55 \approx 220 \text{ min} \approx 3.7 \text{ hours} ]
8.3 Epoch table
| Epoch | Train loss | Val rubric | Human score (1-5) | Hallucination rate | p95 latency | Notes |
|---|---|---|---|---|---|---|
| 1 | 1.92 | 3.61 | 3.5 | 7.4% | 158 ms | learns format first |
| 2 | 1.41 | 3.83 | 3.8 | 5.5% | 146 ms | content relevance improves |
| 3 | 1.19 | 3.92 | 3.9 | 4.0% | 137 ms | best balanced checkpoint |
| 4 | 1.08 | 3.91 | 3.9 | 4.6% | 135 ms | slight overfit, hallucinations rise |
8.4 Stop rule
Choose the first checkpoint that satisfies all of:
- rubric score improvement < 0.02 on next epoch,
- hallucination rate not increasing,
- support escalation precision not decreasing,
- no meaningful p95 latency benefit from further tuning.
That selects epoch 3.
8.5 Why not train longer
For distillation, longer training can make the student memorize teacher style more than teacher behavior. In MangaAssist that often shows up as:
- repeating stock phrases,
- too much certainty on weak retrieval,
- nicer formatting but worse factuality,
- lower diversity on recommendation prompts.
8.6 Example TRL SFTTrainer snippet
from transformers import AutoModelForCausalLM, AutoTokenizer
from trl import SFTTrainer, SFTConfig
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Meta-Llama-3-8B",
torch_dtype="bfloat16",
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B")
tokenizer.pad_token = tokenizer.eos_token
config = SFTConfig(
output_dir="./manga_llama_student",
num_train_epochs=4,
per_device_train_batch_size=8,
gradient_accumulation_steps=8,
learning_rate=2e-5,
max_seq_length=512,
bf16=True,
gradient_checkpointing=True,
)
trainer = SFTTrainer(
model=model,
args=config,
train_dataset=train_ds,
eval_dataset=val_ds,
processing_class=tokenizer,
)
trainer.train()
9. Dry Run D — Distilled Reranker
9.1 Why this matters
Rerankers are often the hidden latency offender in RAG systems.
If the teacher cross-encoder improves NDCG@10 but adds 40–60 ms, you often need a smaller student to stay in SLO.
9.2 Student objective
[ \mathcal{L} = \alpha \cdot \mathcal{L}{sup} + (1-\alpha)\cdot \text{MSE}(s{student}, s_{teacher}) ]
9.3 Example dry run numbers
- teacher NDCG@10: 0.842
- base student NDCG@10: 0.751
- distilled student NDCG@10: 0.791
- teacher p95 latency: 48 ms
- distilled ONNX student p95 latency: 14 ms
9.4 Epoch table
| Epoch | Pairwise loss | MSE score loss | NDCG@10 | p95 latency | Notes |
|---|---|---|---|---|---|
| 1 | 0.491 | 0.228 | 0.771 | 14 ms | learns teacher ordering quickly |
| 2 | 0.444 | 0.192 | 0.784 | 14 ms | strong gain |
| 3 | 0.437 | 0.183 | 0.791 | 14 ms | best checkpoint |
| 4 | 0.433 | 0.181 | 0.790 | 14 ms | no useful gain |
9.5 Stop condition
Stop at epoch 3 because NDCG gain from epoch 3 to 4 is negative.
10. Early Stopping and Tradeoffs
10.1 Do not stop on training loss alone
Training loss almost always keeps falling even after the student has started to overfit.
For distillation, better stop signals are:
- validation KD loss,
- gate-set human score,
- hallucination rate,
- refusal precision,
- rare-class recall.
10.2 Practical stop rules
Classifier stop rules
Stop if 2 of 3 happen for 2 consecutive epochs:
- validation accuracy improves by < 0.15 points,
- rare-class recall improves by < 0.25 points,
- KD loss improves by < 1%.
Response model stop rules
Stop if any of these happen after the minimum epoch count:
- hallucination rate rises by > 0.3 points,
- human score gain < 0.03,
- refusal precision drops,
- teacher preference match gain < 0.2 points.
Reranker stop rules
Stop if:
- NDCG gain < 0.002 across one epoch,
- latency target already met,
- top-3 swap rate on validation set becomes unstable.
10.3 Tradeoff table
| Choice | Benefit | Risk |
|---|---|---|
| higher temperature | more dark knowledge | signal becomes too flat |
| higher alpha | stronger teacher imitation | copies teacher mistakes |
| more epochs | better teacher imitation | hallucination and memorization |
| smaller student | lower latency and cost | capacity floor too low |
| more unlabeled logs | broader coverage | shift/noise if logs are stale |
| more refusal weighting | safer behavior | over-refusal on borderline prompts |
11. Production Logs You Actually Need
11.1 Teacher-label generation logs
{
"event": "teacher_label_generation",
"run_id": "labeling_2026_04_21",
"teacher_model": "gpt-4.1-2025-04-14",
"rows_processed": 25000,
"avg_prompt_tokens": 138,
"avg_response_tokens": 212,
"teacher_refusal_rate": 0.061,
"teacher_entropy_mean": 1.74,
"low_confidence_fraction": 0.084,
"estimated_cost_usd": 148.20
}
11.2 Classifier epoch logs
{
"event": "distill_epoch_end",
"run_id": "tinybert_kd_v05",
"stage": "output_kd",
"epoch": 5,
"loss_total": 0.792,
"loss_ce": 0.593,
"loss_kd": 0.878,
"val_accuracy": 0.893,
"rare_class_recall": 0.826,
"refusal_precision": 0.951,
"throughput_seq_per_sec": 648,
"gpu_mem_gb": 11.8
}
11.3 Response-model gate logs
{
"event": "gate_eval",
"run_id": "resp_openai_kd_v03",
"student": "gpt-4.1-mini-ft-v03",
"teacher": "gpt-4.1-2025-04-14",
"teacher_preference_match": 0.872,
"human_win_rate_vs_base": 0.681,
"catalog_hallucination_rate": 0.036,
"escalation_precision": 0.947,
"cost_reduction": 0.61,
"decision": "promote_shadow"
}
11.4 Canary logs
{
"event": "online_shadow_compare",
"shadow_student": "gpt-4.1-mini-ft-v03",
"control_teacher": "gpt-4.1-2025-04-14",
"sample_size": 5000,
"student_accept_rate": 0.944,
"student_escalation_rate": 0.063,
"teacher_escalation_rate": 0.059,
"student_catalog_hallucination_rate": 0.039,
"teacher_catalog_hallucination_rate": 0.021,
"student_p95_latency_ms": 418,
"teacher_p95_latency_ms": 922
}
12. Library Ecosystem Comparison
This section answers: Which tool should I use, and when?
12.1 Comparison table
| Library / stack | Ease of use | Output KD | Feature KD | Attention KD | Hardware target | Maintenance view | Best use |
|---|---|---|---|---|---|---|---|
| raw PyTorch | medium-low | yes | yes | yes | GPU / CPU | always viable | max flexibility |
transformers + custom Trainer |
high | yes | yes (manual) | yes (manual) | GPU / CPU | active | best general NLP default |
TextBrewer |
medium | yes | yes | yes | GPU mainly | limited / older | classic NLP KD experiments |
Knowledge-Distillation-Zoo patterns |
low-medium | yes | yes | yes | GPU mainly | limited / reference-only | research baselines |
optimum |
high | no training, yes export | n/a | n/a | CPU / ONNX / edge | active | post-distillation optimization |
trl SFTTrainer |
high | response distillation | no | no | GPU | active | LLM response distillation |
lightning |
medium-high | yes | yes | yes | GPU / multi-GPU | active | clean engineering loops |
Optimum Intel / OpenVINO |
high | n/a | n/a | n/a | Intel CPU / GPU / edge | active | CPU-first inference |
llama.cpp / llama-cpp-python |
high for serving | n/a | n/a | n/a | CPU / Apple Silicon / edge | very active | quantized local serving |
How to read “maintenance view”
- active: frequent docs, releases, or official ecosystem support
- limited / older: still usable, but not where you should expect the latest production integrations
- reference-only: best as a pattern source, not as your primary production framework
12.2 Hugging Face transformers + custom Trainer
When to use it
Use this when you want:
- standard HF model loading,
- easy evaluation hooks,
- mixed precision,
- multi-GPU support,
- enough flexibility to add KD losses.
Minimal snippet
import torch
import torch.nn.functional as F
from transformers import Trainer
class KDTrainer(Trainer):
def __init__(self, teacher_model=None, temperature=4.0, alpha=0.7, **kwargs):
super().__init__(**kwargs)
self.teacher = teacher_model.eval()
self.temperature = temperature
self.alpha = alpha
for p in self.teacher.parameters():
p.requires_grad = False
def compute_loss(self, model, inputs, return_outputs=False, **kwargs):
labels = inputs["labels"]
student_outputs = model(**inputs)
with torch.no_grad():
teacher_outputs = self.teacher(**inputs)
s_logits = student_outputs.logits
t_logits = teacher_outputs.logits
ce = F.cross_entropy(s_logits, labels)
kd = F.kl_div(
F.log_softmax(s_logits / self.temperature, dim=-1),
F.softmax(t_logits / self.temperature, dim=-1),
reduction="batchmean"
) * (self.temperature ** 2)
loss = (1 - self.alpha) * ce + self.alpha * kd
return (loss, student_outputs) if return_outputs else loss
Strengths vs raw PyTorch
- less boilerplate,
- built-in metrics/checkpoints,
- integrates with
accelerate, - easy for reproducible training jobs.
Weaknesses vs raw PyTorch
- feature/attention KD still requires manual plumbing,
- custom distributed teacher logic can get messy,
- callback/event model is helpful but not always enough for unusual pipelines.
12.3 TextBrewer
When to use it
Use it when you want a KD-focused NLP library with explicit support for:
- soft-label distillation,
- intermediate feature matching,
- dynamic loss schedules.
Minimal snippet
from textbrewer import GeneralDistiller, TrainingConfig, DistillationConfig
train_config = TrainingConfig(
output_dir="./tb_out",
device="cuda"
)
distill_config = DistillationConfig(
temperature=4,
hard_label_weight=0.3,
kd_loss_weight=0.7
)
distiller = GeneralDistiller(
train_config=train_config,
distill_config=distill_config,
model_T=teacher,
model_S=student,
adaptor_T=teacher_adaptor,
adaptor_S=student_adaptor
)
distiller.train(
optimizer=optimizer,
dataloader=train_loader,
num_epochs=4
)
Strengths vs raw PyTorch
- KD concepts are first-class,
- easier feature and attention matching setup,
- useful for classic BERT/TinyBERT style experiments.
Weaknesses vs raw PyTorch
- ecosystem feels older,
- fewer modern production examples,
- less aligned with current HF + PEFT + LLM workflows.
12.4 Knowledge-Distillation-Zoo patterns
This is better thought of as a pattern repository than a production library.
When to use it
Use it when you want:
- a fast starting point for many KD losses,
- paper-reproduction style experimentation,
- baseline implementations for losses like FitNet, AT, PKT, RKD, etc.
Minimal snippet pattern
# pattern example, not a pip-stable API
logits_loss = kd_criterion(student_logits, teacher_logits)
feat_loss = hint_criterion(student_feat, teacher_feat)
loss = 0.7 * logits_loss + 0.3 * feat_loss
loss.backward()
optimizer.step()
Strengths vs raw PyTorch
- fast way to inspect many loss designs,
- good learning/reference resource.
Weaknesses vs raw PyTorch
- not a full training framework,
- fewer production ergonomics,
- best used as inspiration, not as final platform code.
12.5 optimum for ONNX export + quantization
This is usually the next step after distillation.
When to use it
Use it when the student is already good enough and you now want:
- ONNX export,
- INT8 quantization,
- easier CPU deployment.
Minimal snippet
from optimum.onnxruntime import ORTModelForSequenceClassification, ORTQuantizer
from optimum.onnxruntime.configuration import AutoQuantizationConfig
model = ORTModelForSequenceClassification.from_pretrained("./tinybert_student", export=True)
quantizer = ORTQuantizer.from_pretrained(model)
qconfig = AutoQuantizationConfig.avx512_vnni(is_static=False)
quantizer.quantize(save_dir="./tinybert_student_int8", quantization_config=qconfig)
Strengths vs raw PyTorch
- much easier ONNX path,
- simpler CPU optimization workflow,
- good post-training deployment step.
Weaknesses vs raw PyTorch
- not a KD trainer by itself,
- export/quantization edge cases still exist for some architectures.
12.6 trl SFTTrainer for LLM-to-LLM response distillation
When to use it
Use it for:
- teacher-response imitation,
- LLM instruction tuning,
- distillation where logits from the teacher are unavailable.
Minimal snippet
from trl import SFTTrainer, SFTConfig
cfg = SFTConfig(
output_dir="./student_out",
num_train_epochs=3,
per_device_train_batch_size=4,
gradient_accumulation_steps=8,
max_seq_length=512
)
trainer = SFTTrainer(
model=student_model,
args=cfg,
train_dataset=train_ds,
eval_dataset=val_ds,
processing_class=tokenizer,
)
trainer.train()
Strengths vs raw PyTorch
- very small amount of code,
- great fit for response distillation,
- integrates well with HF/PEFT stacks.
Weaknesses vs raw PyTorch
- not intended for feature KD,
- less natural for classical classifier KD.
12.7 lightning for clean distillation loops
When to use it
Use it when you want:
- clean engineering separation,
- callbacks,
- multi-GPU structure,
- long-lived training codebase.
Minimal snippet
import lightning as L
import torch.nn.functional as F
class LitKD(L.LightningModule):
def __init__(self, teacher, student, temperature=4.0, alpha=0.7):
super().__init__()
self.teacher = teacher.eval()
self.student = student
self.temperature = temperature
self.alpha = alpha
for p in self.teacher.parameters():
p.requires_grad = False
def training_step(self, batch, batch_idx):
labels = batch["labels"]
s = self.student(**batch).logits
with torch.no_grad():
t = self.teacher(**batch).logits
ce = F.cross_entropy(s, labels)
kd = F.kl_div(
F.log_softmax(s / self.temperature, dim=-1),
F.softmax(t / self.temperature, dim=-1),
reduction="batchmean"
) * (self.temperature ** 2)
loss = (1 - self.alpha) * ce + self.alpha * kd
self.log("train_loss", loss)
return loss
Strengths vs raw PyTorch
- easier large-project organization,
- strong callback/logging pattern,
- good for repeatable MLOps pipelines.
Weaknesses vs raw PyTorch
- adds another abstraction layer,
- sometimes slower to debug unusual distributed issues.
12.8 Optimum Intel / OpenVINO
When to use it
Use it when your distilled student must run mainly on:
- Intel CPUs,
- Intel iGPU / accelerator environments,
- edge or desktop inference.
Minimal snippet
from optimum.intel import OVModelForSequenceClassification
from transformers import AutoTokenizer
model = OVModelForSequenceClassification.from_pretrained(
"./tinybert_student_int8",
export=True
)
tokenizer = AutoTokenizer.from_pretrained("./tinybert_student")
inputs = tokenizer("romance manga with adult cast", return_tensors="pt")
outputs = model(**inputs)
Strengths vs raw PyTorch
- strong CPU-first deployment path,
- useful OpenVINO export and runtime integration,
- practical for low-latency CPU serving.
Weaknesses vs raw PyTorch
- mainly inference-focused,
- best after model training is finished.
12.9 llama.cpp / llama-cpp-python
When to use it
Use it after distillation when you want to serve a quantized student locally or on low-cost hardware.
Minimal snippet
from llama_cpp import Llama
llm = Llama(
model_path="./manga-student-q4_k_m.gguf",
n_ctx=4096,
n_gpu_layers=20
)
resp = llm.create_chat_completion(
messages=[{"role": "user", "content": "Suggest mature romance manga"}]
)
print(resp["choices"][0]["message"]["content"])
Strengths vs raw PyTorch
- simple local serving,
- excellent quantized inference story,
- strong Apple Silicon and CPU usability.
Weaknesses vs raw PyTorch
- serving stack, not training stack,
- you usually convert/export into it after training elsewhere.
13. Hardware-Specific Distillation Strategies
13.1 A. Single A100 80GB
Best use cases
- DistilBERT → TinyBERT full KD
- Llama 8B LoRA response distillation
- large batch experiments
- frequent checkpoint sweeps
Precision
- bf16 preferred for training on A100
- fallback: fp16 if library path requires it
- keep optimizer states in standard mixed-precision defaults
Realistic batch sizes
| Workload | Batch size | Notes |
|---|---|---|
| TinyBERT classifier KD | 128 | teacher + student + features fit comfortably |
| TinyBERT feature KD with seq 128 | 96 | if attention tensors are kept |
| Llama 3 8B LoRA SFT | 8 | good starting micro-batch |
| Llama 3 8B full finetune | usually not recommended here | LoRA/QLoRA preferred |
Throughput dry-run numbers
| Workload | Throughput | Epoch time |
|---|---|---|
| TinyBERT feature KD | ~320 seq/s | ~2.5 min / epoch on 48K rows |
| TinyBERT output KD | ~650 seq/s | ~1.2 min / epoch on 48K rows |
| Llama 3 8B LoRA, seq 512 | ~5,800 tok/s | ~47 min / epoch on 32K rows |
Bottlenecks to watch
- hidden-state extraction cost in feature KD,
- dataloader underfeeding GPU,
- sequence padding waste,
- checkpoint save stalls,
- teacher forward pass doubling compute in online KD.
Important engineering note
For managed teachers like OpenAI, do not keep the teacher in the training loop.
Generate teacher outputs first, then train only the student.
13.2 B. Multi-GPU — 2×A10G or 4×A100
DDP vs FSDP for distillation
Use DDP when:
- both teacher and student fit per GPU,
- you want simpler debugging,
- the student is moderate size.
Use FSDP when:
- the student does not fit comfortably as a full replica,
- optimizer memory is the problem,
- you want bigger effective context or batch.
Distillation-specific problem
The teacher must be available on every rank if teacher inference happens online.
You have three patterns:
-
Replicate frozen teacher on every rank
easiest, best for small teachers. -
Precompute teacher outputs
best for managed teachers or large teachers. -
FSDP student, replicated teacher
best hybrid for medium teacher + larger student.
DDP wrapper pattern for the teacher
import torch
from torch.nn.parallel import DistributedDataParallel as DDP
teacher = TeacherModel().to(local_rank)
student = StudentModel().to(local_rank)
for p in teacher.parameters():
p.requires_grad = False
teacher.eval()
teacher = DDP(teacher, device_ids=[local_rank], broadcast_buffers=False)
student = DDP(student, device_ids=[local_rank])
for batch in train_loader:
with torch.no_grad():
t_logits = teacher(**batch).module.logits
s_logits = student(**batch).logits
loss = kd_loss(s_logits, t_logits, batch["labels"])
loss.backward()
optimizer.step()
optimizer.zero_grad()
When to prefer accelerate
Use accelerate when:
- you want one codepath across 1 GPU / many GPUs,
- you may switch between DDP and FSDP,
- you want HF integration without hand-writing launch logic.
Minimal pattern:
from accelerate import Accelerator
accelerator = Accelerator()
student, optimizer, train_loader = accelerator.prepare(student, optimizer, train_loader)
teacher.to(accelerator.device)
teacher.eval()
for batch in train_loader:
with torch.no_grad():
t_logits = teacher(**batch).logits
s_logits = student(**batch).logits
loss = kd_loss(s_logits, t_logits, batch["labels"])
accelerator.backward(loss)
optimizer.step()
optimizer.zero_grad()
Expected scaling dry-run numbers
| Hardware | Workload | Aggregate throughput | Practical note |
|---|---|---|---|
| 2×A10G | TinyBERT KD | ~430 seq/s | ~1.35× to 1.5× over single A10G |
| 4×A100 | TinyBERT KD | ~1,150 seq/s | near-linear if input pipeline is healthy |
| 2×A10G | Llama 8B LoRA | ~1,700 tok/s | use gradient accumulation heavily |
| 4×A100 | Llama 8B LoRA/FSDP | ~17,000 tok/s | strong fit for fast checkpoint sweeps |
DDP vs FSDP tradeoff summary
| Choice | Good | Bad |
|---|---|---|
| DDP | simpler, stable, easy teacher replication | duplicates model memory |
| FSDP | lower memory, bigger models | more complex debugging and checkpointing |
| precomputed teacher cache | fastest train loop, cheapest runtime | more storage and preprocessing |
13.3 C. AWS Lambda / CPU-only inference of the distilled student
This is where distillation usually pays real business value.
Export flow
flowchart LR
A[PyTorch student checkpoint] --> B[ONNX export with Optimum]
B --> C[INT8 quantization]
C --> D[Package model artifact]
D --> E[Lambda container or zip]
E --> F[Provisioned Concurrency optional]
ONNX export with optimum
from optimum.onnxruntime import ORTModelForSequenceClassification
model = ORTModelForSequenceClassification.from_pretrained(
"./tinybert_student",
export=True
)
model.save_pretrained("./tinybert_student_onnx")
INT8 question: does accuracy degrade further from 89.3%?
In the dry run, yes, but only slightly.
| Variant | Accuracy | Delta vs FP32 distilled |
|---|---|---|
| TinyBERT distilled fp32 | 89.3% | baseline |
| ONNX dynamic INT8 | 89.0% | -0.3 |
| ONNX static INT8 | 88.8% | -0.5 |
This is usually acceptable if the latency gain is material.
Why memory allocation matters
Lambda allocates CPU power proportional to memory.
That means 1 GB is not just “more memory”; it is also more CPU.
Expected Lambda latency dry run
| Memory | Model format | p50 warm | p95 warm | cold start p95 | Notes |
|---|---|---|---|---|---|
| 512 MB | PyTorch fp32 | 31 ms | 44 ms | 240 ms | misses strict target |
| 512 MB | ONNX INT8 | 18 ms | 29 ms | 148 ms | usable but tight |
| 1024 MB | ONNX INT8 | 11 ms | 18 ms | 103 ms | recommended |
| 1536 MB | ONNX INT8 | 9 ms | 15 ms | 95 ms | diminishing returns |
Practical recommendation
For MangaAssist intent classification:
- deploy ONNX INT8
- start with 1024 MB
- move to 512 MB only if traffic cost pressure is high and p95 still meets SLO
13.4 D. Apple Silicon (M2 / M3) for local development
What it is good for
- correctness testing,
- small dry runs,
- prompt formatting validation,
- tiny classifier distillation experiments,
- local quantized student inference.
PyTorch MPS basics
import torch
device = torch.device("mps" if torch.backends.mps.is_available() else "cpu")
model = model.to(device)
batch = {k: v.to(device) for k, v in batch.items()}
Practical guidance
- prefer fp16 or fp32-style workflows for local experimentation,
- keep batch sizes conservative,
- expect some operator gaps or performance differences versus CUDA,
- use Apple Silicon mainly for development, not final throughput claims.
Realistic local batch sizes
| Workload | M2/M3 batch size |
|---|---|
| TinyBERT KD, seq 64 | 8–16 |
| TinyBERT inference | 32–64 |
| 7B/8B LoRA toy run | 1–2 |
quantized local inference via llama.cpp |
depends on RAM and quant level |
Dry-run throughput
| Workload | Throughput |
|---|---|
| TinyBERT training on MPS | ~110 seq/s |
| TinyBERT inference on MPS | ~260 seq/s |
| 8B quantized local generation | ~18–45 tok/s depending on quant + memory |
Main limitations vs CUDA
- less mature distributed story,
- smaller effective memory ceiling,
- weaker training throughput,
- local results are useful for debugging but should not be treated as production capacity numbers.
14. Example Fine-Tuning Evolution Narratives
These are the kinds of summaries that should be written into experiment notes after every run.
14.1 TinyBERT run summary
- Epoch 1–2: student learns coarse teacher structure; rare-class recall jumps most here.
- Epoch 3–4: student begins separating confusable intents cleanly.
- Epoch 5: best overall tradeoff between accuracy, rare recall, and refusal precision.
- Epoch 6+: training loss still falls, but user-facing metrics flatten.
14.2 OpenAI managed-student run summary
- Epoch 1: student learns answer format and common recommendation phrasing.
- Epoch 2: factuality improves because teacher response structure is internalized.
- Epoch 3: best checkpoint; hallucinations lowest while format quality stays strong.
- Epoch 4: style becomes more rigid and catalog hallucinations creep up.
14.3 Llama 8B run summary
- Epoch 1: formatting and response skeleton improve.
- Epoch 2: answer relevance rises sharply.
- Epoch 3: best human score / hallucination balance.
- Epoch 4: no meaningful answer quality gain, more memorized phrasing.
15. Deployment Decision Framework
15.1 Promotion checklist
A distilled student is promoted only if all pass:
| Category | Gate |
|---|---|
| quality | teacher preference match meets target |
| safety | hallucination/refusal metrics meet target |
| latency | p95 meets SLO |
| cost | at least 50% lower if this is a cost-driven project |
| rollback | last-known-good model available |
| observability | dashboards and alerts are live |
15.2 Shadow deployment plan
- Route 1–5% of traffic to the student in shadow mode.
- Compare answer category, refusal rate, and escalation rate.
- Human review only the disagreement bucket.
- Promote to 10%, then 25%, then 50%, then 100%.
15.3 Rollback triggers
Rollback immediately if any of these cross threshold:
- hallucination rate +1.0 point above baseline,
- refusal precision -2.0 points below baseline,
- rare-class recall -3.0 points,
- p95 latency +20% vs approved benchmark,
- support tickets on recommendation quality spike.
16. Recommended Stack by Scenario
| Scenario | Recommended stack |
|---|---|
| classic NLP classifier KD | transformers + custom Trainer |
| paper-like feature KD experiment | TextBrewer or raw PyTorch |
| response distillation into open model | trl + PEFT + HF |
| managed-model distillation | OpenAI teacher generation + OpenAI SFT |
| CPU deployment | optimum + ONNX + optionally Optimum Intel |
| local quantized serving | llama.cpp / llama-cpp-python |
| long-lived training codebase | lightning or HF + accelerate |
17. Final Recommendations for MangaAssist
17.1 What to run first
Run these in order:
-
DistilBERT → TinyBERT
Fastest proof of value. Cheap. Clear metrics. -
Managed response distillation
gpt-4.1teacher → fine-tunedgpt-4.1-mini. -
Self-hosted fallback distillation
same teacher outputs → Llama 3 8B student. -
Reranker distillation + ONNX INT8
only if reranking latency is still a bottleneck.
17.2 Best stopping points from the dry runs
| Run | Best checkpoint |
|---|---|
| TinyBERT classifier | Stage 2, epoch 5 |
| OpenAI managed student | epoch 3 |
| Llama 3 8B student | epoch 3 |
| Reranker | epoch 3 |
17.3 Core lesson
The correct question is not “Did the loss go down?”
The correct question is:
“At which checkpoint did the student become cheap and fast enough, while still preserving safety, factuality, and user-perceived quality?”
That checkpoint is the one to deploy.
18. Appendix — Extra Useful Metrics
18.1 Teacher confidence spread
[ \text{confidence spread} = p_{top1} - p_{top2} ]
Low spread means ambiguous teacher signal.
These examples are good for:
- soft labels,
- human review prioritization,
- rare-class calibration.
18.2 Example ranking metrics
[ NDCG@k = \frac{DCG@k}{IDCG@k} ]
with
[ DCG@k = \sum_{i=1}^{k} \frac{2^{rel_i}-1}{\log_2(i+1)} ]
18.3 Cost reduction formula
[ \text{cost reduction} = \frac{\text{teacher cost} - \text{student cost}}{\text{teacher cost}} ]
Example:
- teacher cost per 1K responses = 12.4
- student cost per 1K responses = 4.8
[ (12.4 - 4.8) / 12.4 = 0.6129 \approx 61.3\% ]
18.4 Hallucination rate
[ \text{hallucination rate} = \frac{\text{responses with unsupported factual claims}}{\text{all evaluated responses}} ]
Example:
- unsupported claims in sample = 29
- evaluated responses = 800
[ 29/800 = 0.03625 = 3.6\% ]
19. Appendix — Mermaid Diagram for Distillation Decisions
flowchart TD
A[Need lower cost or latency?] -->|No| B[Keep teacher in production]
A -->|Yes| C[Can smaller zero-shot model already pass?]
C -->|Yes| D[Use smaller base model directly]
C -->|No| E[Can teacher outputs be collected offline?]
E -->|Yes| F[Run distillation]
E -->|No| G[Use online KD only if teacher cost is acceptable]
F --> H[Offline eval]
H --> I{Pass quality + safety + latency gates?}
I -->|No| J[Revise data / temperature / student size]
I -->|Yes| K[Shadow deploy]
K --> L{Shadow stable?}
L -->|No| M[Rollback]
L -->|Yes| N[Promote]
20. Appendix — Source-aware Implementation Notes
This document expands the original MangaAssist distillation write-up with: - the original DistilBERT → TinyBERT and Claude/LLM-style examples, - concrete OpenAI-based distillation options, - additional hardware and deployment planning, - production-log-first dry-run analysis.
Keep the original baseline metrics and diagrams as the “teacher” document, and use this one as the operations + implementation expansion.
21. Official Docs and Repositories to Check While Implementing
These are the main docs/repositories worth checking while turning the dry run into a real pipeline:
- Hugging Face
transformersTrainer documentation - Hugging Face
trlSFTTrainer documentation - Hugging Face
optimumONNX Runtime quantization documentation - Hugging Face
optimum-intel/ OpenVINO documentation - Hugging Face
acceleratedocumentation - PyTorch
DistributedDataParalleldocumentation - PyTorch
FullyShardedDataParalleldocumentation - PyTorch MPS backend documentation
- AWS Lambda memory/CPU allocation documentation
- OpenAI supervised fine-tuning guide
- OpenAI model optimization guide
- OpenAI distillation cookbook example
TextBrewerGitHub repository and docsKnowledge-Distillation-ZooGitHub repository- Lightning-AI repository/docs
llama.cpprepositoryllama-cpp-pythondocumentation
For production use, always re-check: - supported model versions, - fine-tuning availability, - export/quantization compatibility, - hardware backend support, - recent release notes before locking the pipeline.