MangaAssist Knowledge Distillation — Numerical Dry Runs, Critical Calculations, and Scale Intuition
This document is a calculation-first companion to the larger MangaAssist distillation guide.
It focuses on worked dry runs, what the logs mean, how training evolves by epoch, and how decisions change when traffic moves from MangaAssist scale to Amazon-like scale.Important note on scale examples: any references to Amazon-like scale in this document are illustrative engineering scenarios, not claims about Amazon's internal production numbers.
1. What This Document Tries to Build
Most distillation documents explain the formulas correctly but do not build intuition for questions like:
- Why does temperature help?
- Why does a student sometimes improve even when its total loss is still high?
- Why can a tiny change in hallucination rate be acceptable at small scale but catastrophic at very large scale?
- Why does a model that looks cheaper offline still fail promotion?
- When should we stop fine-tuning even if training loss keeps going down?
This document answers those using MangaAssist-style examples.
2. The Core MangaAssist Setting
We will use three concrete distillation tasks:
-
Intent classifier distillation
DistilBERT teacher → TinyBERT student. -
Response model distillation
Strong managed teacher (for example an OpenAI teacher such asgpt-4.1) → cheaper student (for examplegpt-4.1-minior a self-hosted Llama 3 8B). -
Reranker distillation
Large cross-encoder teacher → smaller latency-safe reranker.
We will use these MangaAssist production constraints:
| Component | Teacher behavior | Student target |
|---|---|---|
| Intent classifier | high accuracy but higher memory/latency | fit on Lambda / CPU and keep most quality |
| Response model | best answer quality, expensive | much lower cost per response |
| Reranker | best ranking quality, too slow inline | preserve top ranking with lower latency |
2.1 Reference promotion gates from the MangaAssist scenario
| Metric | Gate |
|---|---|
| teacher preference match | >= 85% |
| human win rate vs base student | >= 65% |
| catalog hallucination rate | <= 4% |
| cost per 1K responses | at least 50% lower |
These gates are important because the model is not promoted for low loss. It is promoted because it passes quality + latency + cost + safety together.
3. The Smallest Possible Distillation Intuition
3.1 One MangaAssist example
User message:
"I want to return volume 1, but where is my order right now?"
Assume the intent classes are:
order_trackingreturn_requestcheckout_helpfaq
This message is ambiguous. It contains both return language and order status language.
That is exactly where distillation helps most.
4. Dry Run 1 — Softmax and Temperature on One Example
4.1 Teacher and student logits
Let the teacher and student output these logits for the same message.
| Class | Teacher logit | Student logit |
|---|---|---|
order_tracking |
2.8 | 2.1 |
return_request |
2.2 | 1.7 |
checkout_help |
0.5 | 0.8 |
faq |
-1.0 | -0.5 |
The teacher thinks order_tracking is best, but also sees strong overlap with return_request.
4.2 Softmax at normal temperature T = 1
Teacher probabilities:
| Class | Probability |
|---|---|
order_tracking |
0.5983 |
return_request |
0.3283 |
checkout_help |
0.0600 |
faq |
0.0134 |
Student probabilities:
| Class | Probability |
|---|---|
order_tracking |
0.4958 |
return_request |
0.3323 |
checkout_help |
0.1351 |
faq |
0.0368 |
At T=1, the distributions are already usable, but the teacher is still relatively sharp.
4.3 Softmax at distillation temperature T = 4
Teacher probabilities:
| Class | Probability |
|---|---|
order_tracking |
0.3559 |
return_request |
0.3063 |
checkout_help |
0.2002 |
faq |
0.1376 |
Student probabilities:
| Class | Probability |
|---|---|
order_tracking |
0.3175 |
return_request |
0.2873 |
checkout_help |
0.2294 |
faq |
0.1658 |
4.4 What changed, intuitively?
At T=1, the model is saying:
- "This is mostly
order_tracking."
At T=4, it is saying:
- "This is still mostly
order_tracking, butreturn_requestis close, andcheckout_helpis not impossible."
That extra structure is the dark knowledge.
The student does not just learn the winning class. It learns the shape of the confusion.
5. Dry Run 2 — Hard-Label Cross-Entropy on the Same Example
Assume the ground-truth label is:
order_tracking
Student probability for the correct class at T=1 is:
p = 0.4958
Hard-label cross-entropy is:
[ \mathcal{L}_{hard} = -\log(0.4958) = 0.7017 ]
5.1 Intuition
A loss of 0.7017 means:
- the student is not disastrously wrong,
- but it is far from confidently correct,
- and the correct class only has about 49.6% probability.
If the student had predicted 0.90 for the correct class, the loss would be:
[ -\log(0.90) = 0.1054 ]
That is much lower.
So cross-entropy punishes lack of confidence on the correct class.
6. Dry Run 3 — KL Divergence Distillation Loss
Now compute the distillation loss using the softened distributions at T=4.
Teacher soft distribution:
[ p_T = [0.3559, 0.3063, 0.2002, 0.1376] ]
Student soft distribution:
[ p_S = [0.3175, 0.2873, 0.2294, 0.1658] ]
KL divergence:
[ D_{KL}(p_T | p_S) = \sum_i p_T(i) \log \frac{p_T(i)}{p_S(i)} = 0.007315 ]
Because the loss is computed at temperature T, we multiply by T^2:
[ \mathcal{L}{KD} = T^2 \cdot D{KL}(p_T | p_S) = 16 \cdot 0.007315 = 0.1170 ]
6.1 Why is the KD loss much smaller than the hard loss?
Because the student is closer to the teacher's full probability shape than it is to the one-hot target.
This is a critical intuition:
- hard labels ask: "Did you put almost everything on the winning class?"
- KD asks: "Did you match the teacher's reasoning pattern?"
The student can be imperfect on the final class but still be quite aligned with the teacher's structure.
That is why KD often stabilizes smaller students.
7. Dry Run 4 — Combined Distillation Loss
Use the standard combined loss:
[ \mathcal{L}{total} = (1-\alpha)\mathcal{L}{hard} + \alpha\mathcal{L}_{KD} ]
Let alpha = 0.7.
Then:
[ \mathcal{L}_{total} = 0.3 \cdot 0.7017 + 0.7 \cdot 0.1170 ]
[ \mathcal{L}_{total} = 0.2105 + 0.0819 = 0.2924 ]
7.1 What this means
Because alpha = 0.7, training is telling the student:
- "I care more about matching the teacher's shape than only matching the hard label."
7.2 What if we change alpha?
Using the same example:
alpha |
Combined loss | Interpretation |
|---|---|---|
| 0.3 | 0.5263 | more trust in hard labels |
| 0.5 | 0.4094 | balanced |
| 0.7 | 0.2924 | more trust in teacher |
This does not mean 0.7 is always better. It means:
- if the teacher is reliable and calibrated, higher
alphacan help, - if the teacher is often wrong on edge cases, high
alphacopies mistakes faster.
8. Dry Run 5 — Why Distillation Gives Better Gradients on Ambiguous Classes
For the same example, the student hard-label gradient pattern is approximately:
[ [p_S - y] = [-0.5042, 0.3323, 0.1351, 0.0368] ]
The KD gradient pattern at T=4 before the T^2 scaling is:
[ [p_S^{(T)} - p_T^{(T)}] = [-0.0383, -0.0190, 0.0292, 0.0281] ]
After the T^2 = 16 scaling, the relative effect becomes:
[ [-0.6133, -0.3037, 0.4668, 0.4502] ]
8.1 Intuition
Hard-label training says:
- push the correct class up,
- push everything else down.
KD says:
- push the correct class up,
- but also keep
return_requestfairly high, - do not flatten everything else too aggressively.
That matters because MangaAssist traffic is full of mixed-intent user messages.
9. Dry Run 6 — Batch-Level Loss Instead of One Example
Now take a mini-batch of 8 examples.
9.1 Per-example losses
| Example | Hard CE | KD loss after T^2 |
|---|---|---|
| 1 | 0.70 | 0.117 |
| 2 | 0.22 | 0.084 |
| 3 | 1.05 | 0.190 |
| 4 | 0.41 | 0.102 |
| 5 | 0.33 | 0.066 |
| 6 | 1.40 | 0.245 |
| 7 | 0.18 | 0.052 |
| 8 | 0.65 | 0.140 |
Batch averages:
- average hard CE =
0.6175 - average KD loss =
0.1245
With alpha = 0.7:
[ \mathcal{L}_{batch} = 0.3 \cdot 0.6175 + 0.7 \cdot 0.1245 = 0.2724 ]
9.2 Intuition from the batch
Examples 3 and 6 are the important ones.
They likely correspond to:
- long-tail messages,
- multi-intent messages,
- rare policy phrasing,
- noisy customer text.
If those examples keep dominating the batch loss for many epochs, you should ask:
- is the student too small?
- is the teacher inconsistent?
- are these labels actually wrong?
- is the data distribution too broad for this student size?
10. Dry Run 7 — Feature Matching for TinyBERT
Output KD is not the only signal. TinyBERT-style training also matches intermediate representations.
10.1 Hidden-state example
Suppose a teacher hidden vector and projected student hidden vector are:
Teacher hidden:
[ [0.8, -0.1, 0.2, 0.5] ]
Projected student hidden:
[ [0.5, 0.0, 0.4, 0.1] ]
Mean squared error:
[ \text{MSE} = \frac{(0.8-0.5)^2 + (-0.1-0.0)^2 + (0.2-0.4)^2 + (0.5-0.1)^2}{4} ]
[ = \frac{0.09 + 0.01 + 0.04 + 0.16}{4} = 0.075 ]
10.2 Attention example
Teacher attention:
[ \begin{bmatrix} 0.7 & 0.3 \ 0.2 & 0.8 \end{bmatrix} ]
Student attention:
[ \begin{bmatrix} 0.6 & 0.4 \ 0.25 & 0.75 \end{bmatrix} ]
Attention MSE:
[ \frac{(0.7-0.6)^2 + (0.3-0.4)^2 + (0.2-0.25)^2 + (0.8-0.75)^2}{4} = 0.00625 ]
10.3 Intuition
The hidden-state loss is larger than the attention loss here.
That suggests:
- attention alignment is already decent,
- but the student representation space is still not matching the teacher well.
In practice, that can mean:
- keep Stage 1 feature distillation longer,
- lower LR for Stage 2,
- or use a slightly wider student.
11. Dry Run 8 — Reranker Distillation and Why Small MSE Can Still Hide Ranking Errors
Suppose the teacher gives these scores for four candidate manga results.
| Candidate | Teacher score |
|---|---|
| A | 0.94 |
| B | 0.91 |
| C | 0.35 |
| D | 0.12 |
Student scores:
| Candidate | Student score |
|---|---|
| A | 0.90 |
| B | 0.92 |
| C | 0.31 |
| D | 0.15 |
Score MSE is tiny:
[ \text{MSE} = 0.00105 ]
That looks excellent.
But the student swapped A and B.
11.1 Why that matters
If candidate A is the truly best result, then a tiny score error can still cause the top recommendation to be wrong.
That is why reranker distillation should monitor ranking metrics, not only score loss.
11.2 NDCG intuition
Assume relevance labels:
- A = 3
- B = 2
- C = 0
- D = 1
If the ranking is ideal, NDCG@4 = 1.0.
If the student swaps A and B, the NDCG drops to roughly 0.972.
That looks small.
But at large traffic volume, even a small NDCG drop means many more users see the second-best manga recommendation first.
12. Dry Run 9 — Response Distillation with a Managed Teacher
Now move from classification to response generation.
12.1 MangaAssist response dataset
| Source | Count |
|---|---|
| production prompts | 25,000 |
| human-corrected teacher outputs | 5,000 |
| refusal / escalation examples | 2,000 |
| total useful rows | 32,000 |
Assume we distill:
- teacher: stronger managed model such as
gpt-4.1 - student: cheaper managed student such as
gpt-4.1-minior a self-hosted Llama 3 8B
12.2 Example response-level metrics per epoch
The exact loss is different from classifier KD because for response distillation we often train on teacher outputs rather than direct logits. But the intuition is the same.
| Epoch | Train loss | Val loss | Human win rate vs base student | Teacher preference match | Hallucination rate | Refusal precision |
|---|---|---|---|---|---|---|
| 1 | 1.92 | 1.84 | 48% | 72% | 7.8% | 86% |
| 2 | 1.41 | 1.36 | 57% | 79% | 5.9% | 89% |
| 3 | 1.08 | 1.02 | 64% | 84% | 4.6% | 91% |
| 4 | 0.91 | 0.88 | 67% | 86% | 3.9% | 93% |
| 5 | 0.84 | 0.86 | 67% | 86% | 3.8% | 92% |
| 6 | 0.79 | 0.89 | 66% | 85% | 4.1% | 91% |
12.3 Best checkpoint decision
Checkpoint at epoch 4 is best.
Why not epoch 6?
Because:
- train loss keeps dropping,
- but val loss starts rising,
- hallucination rate starts getting worse,
- human win rate is flat or slightly worse.
That is the exact signature of overfitting to teacher style without improving production quality.
13. Early Stopping Intuition — What to Watch and Why
A good early stopping rule for MangaAssist is not just:
- "stop when validation loss stops improving"
It should be:
-
loss-based rule
Stop if validation loss does not improve by at least1%for2consecutive epochs. -
quality rule
Stop if human win rate improvement is less than0.5points over2epochs. -
safety rule
Stop immediately if hallucination rate worsens for2evals in a row. -
business rule
Do not promote any checkpoint that fails cost/latency/safety gates even if its loss is best.
13.1 Why training loss alone is dangerous
At later epochs, the student often becomes better at imitating:
- teacher phrasing,
- teacher verbosity,
- teacher style tokens,
- teacher formatting patterns.
But that does not always improve:
- factuality,
- escalation behavior,
- refusal correctness,
- catalog grounding.
So the model can look better mathematically while getting worse operationally.
14. Dry Run 10 — Human-Corrected Data as a Safety Brake
Suppose the teacher-labeled dataset has:
25,000teacher responses,5,000human-corrected rows,2,000refusal/escalation rows.
If we up-weight:
- corrected rows by
3x, - refusal rows by
2x,
then the effective training row count becomes:
[ 25,000 + (5,000 \times 3) + (2,000 \times 2) = 44,000 ]
Effective contribution by group:
| Group | Effective rows | Share |
|---|---|---|
| raw teacher responses | 25,000 | 56.8% |
| human-corrected rows | 15,000 | 34.1% |
| refusal/escalation rows | 4,000 | 9.1% |
14.1 Intuition
Even though the corrected set is only 5,000 rows, weighting makes it a much stronger anchor.
That helps prevent the student from copying teacher mistakes too blindly.
15. Production Log Fields That Actually Matter
These are the logs that build intuition in a real pipeline.
15.1 Step log
{
"event": "kd_train_step",
"task": "intent_distillation",
"epoch": 3,
"global_step": 1840,
"batch_size": 128,
"hard_ce": 0.482,
"kd_kl_t2": 0.109,
"feature_mse": 0.071,
"attention_mse": 0.008,
"total_loss": 0.252,
"grad_norm": 1.84,
"lr": 0.000031,
"throughput_samples_per_sec": 2210
}
15.2 How to read it
-
hard_ce = 0.482
student is still making meaningful class mistakes. -
kd_kl_t2 = 0.109
student is reasonably aligned to teacher distribution. -
feature_mse = 0.071
internal representations still have room to improve. -
grad_norm = 1.84
gradients are healthy; not exploding.
If hard_ce stops improving but feature_mse is still improving, the student may be learning internal structure that will help later.
If feature_mse is flat and hard_ce is noisy, the student may have hit its capacity limit.
15.3 Epoch summary log
{
"event": "kd_epoch_summary",
"task": "response_distillation",
"student": "manga-student-8b-v04",
"teacher": "managed-teacher-v07",
"epoch": 4,
"train_loss": 0.91,
"val_loss": 0.88,
"teacher_preference_match": 0.86,
"human_win_rate_vs_base": 0.67,
"hallucination_rate": 0.039,
"refusal_precision": 0.93,
"avg_output_tokens": 142,
"cost_per_1k_responses": 0.42,
"checkpoint_promotable": true
}
15.4 Important derived calculations
If the baseline student cost per 1K responses was 0.95 and the distilled student is 0.42, then:
[ \text{cost reduction} = 1 - \frac{0.42}{0.95} = 55.8\% ]
If hallucination_rate = 0.039 on 2,000 reviewed samples, then:
[ 0.039 \times 2000 = 78 ]
That means 78 hallucination cases were found in review.
At this point, the model passes the < 4% gate.
16. Dry Run 11 — Cost Intuition at MangaAssist Scale
Assume MangaAssist runs 500,000 responses per day.
Using the earlier blended per-request numbers from the MangaAssist scenario:
- teacher cost per query =
$0.003 - student cost per query =
$0.0001
16.1 Daily cost
Teacher daily cost:
[ 500,000 \times 0.003 = 1,500 ]
Student daily cost:
[ 500,000 \times 0.0001 = 50 ]
Daily savings:
[ 1,500 - 50 = 1,450 ]
Monthly savings at 30 days:
[ 1,450 \times 30 = 43,500 ]
16.2 Intuition
At MangaAssist scale, cost savings are already meaningful.
But quality still matters.
If hallucination rate rises by 1.5 percentage points after distillation, then extra bad responses per day are:
[ 500,000 \times 0.015 = 7,500 ]
So the real question is not:
- "Did we save money?"
It is:
- "Did we save enough money to justify 7,500 extra bad responses per day?"
That is why cost and quality must be read together.
17. Dry Run 12 — Amazon-Like Scale Changes Everything
Now use an illustrative Amazon-like scale scenario.
Assume traffic is:
50,000,000responses per day
Use the same per-query cost numbers only as an engineering illustration.
17.1 Daily cost at Amazon-like scale
Teacher daily cost:
[ 50,000,000 \times 0.003 = 150,000 ]
Student daily cost:
[ 50,000,000 \times 0.0001 = 5,000 ]
Daily savings:
[ 150,000 - 5,000 = 145,000 ]
Monthly savings:
[ 145,000 \times 30 = 4,350,000 ]
17.2 Now look at the same 1.5% quality regression
Extra bad responses per day:
[ 50,000,000 \times 0.015 = 750,000 ]
17.3 Intuition
At very large scale:
- tiny cost differences become millions of dollars,
- tiny quality regressions become hundreds of thousands of bad experiences per day.
This is why large platforms almost never make a full replacement decision from a single offline metric.
They use:
- stratified evals,
- canaries,
- traffic slicing,
- fallback routing,
- selective teacher escalation,
- per-segment promotion gates.
18. Why Sampling Changes at Large Scale
At MangaAssist scale, maybe a 25,000 row training sample already covers a lot of the product surface.
At Amazon-like scale, 25,000 rows may be only a tiny snapshot.
18.1 Same sample size, different meaning
If traffic is 50,000,000/day, then a 25,000 example sample is:
[ \frac{25,000}{50,000,000} = 0.0005 = 0.05\% ]
So 25,000 rows are only 0.05% of one day of traffic.
18.2 Why this matters
Rare patterns become common in absolute count.
If a rare failure mode happens in only 0.1% of requests:
- at
500,000/day, that is500failures/day - at
50,000,000/day, that is50,000failures/day
So at very large scale, the golden set must explicitly include:
- long-tail categories,
- rare policy cases,
- locale variation,
- device variation,
- high-value customer flows,
- safety-critical escalations.
19. Canary Math — Why Big Platforms Detect Regressions Faster
Suppose baseline accuracy is 90.0% and you want to detect a drop to 89.5%.
A rough two-sample test suggests you need about:
57,700samples per arm
Call it 58K queries per arm.
19.1 What that means at MangaAssist scale
If total traffic is 500,000/day, and your canary gets 10% of traffic:
- canary traffic per day =
50,000
To collect 58,000 canary samples, you need a little over one day.
19.2 What that means at Amazon-like scale
If total traffic is 50,000,000/day, and your canary gets 1% of traffic:
- canary traffic per day =
500,000
To collect 58,000 samples, you need only a small part of the day.
19.3 Intuition
Large-scale systems can detect tiny regressions faster if they have good telemetry.
But they also suffer larger absolute damage if they promote a bad model too early.
20. How Fine-Tuning Typically Evolves by Epoch
A useful mental model:
Epoch 1
- student learns broad teacher style,
- loss drops quickly,
- quality may still be unstable,
- hallucination rate is usually still high.
Epoch 2–3
- most of the useful gains arrive,
- ambiguity handling improves,
- preference match jumps,
- long-tail failures begin to shrink.
Epoch 4
- often the best tradeoff zone,
- validation and business metrics align,
- latency and cost are already fixed by model size, so quality becomes the final gate.
Epoch 5+
- train loss still falls,
- validation often flattens or worsens,
- style imitation gets stronger,
- factual drift and teacher quirks may increase.
20.1 The intuition you want
Distillation improvement is often front-loaded.
The first few epochs teach the student the big structure.
Later epochs often teach the student the teacher's imperfections.
21. Decision Table — When to Stop, Resize, or Redesign
| Symptom | Likely meaning | Best action |
|---|---|---|
| train loss down, val loss down, hallu down | healthy learning | continue |
| train loss down, val flat, hallu flat | near convergence | prepare stop |
| train loss down, val up, hallu up | overfitting / teacher style copying | early stop |
| KD loss low but hard CE high | student matches shape but not class enough | increase hard-label weight or extend training |
| hard CE low but KD high | student predicts labels but misses teacher uncertainty | raise alpha or temperature |
| all losses high for many epochs | student too small or data too noisy | larger student / better data |
| low offline loss, poor online business metrics | missing eval dimensions | fix gate design |
22. Two Practical Mermaid Diagrams
22.1 Where the numbers come from
flowchart LR
A[Teacher logits] --> B[Softmax with T]
C[Student logits] --> D[Softmax with T]
B --> E[KL divergence]
D --> E
C --> F[Hard-label cross-entropy]
E --> G[Weighted total loss]
F --> G
G --> H[Backprop into student]
22.2 How scale changes the decision
flowchart TD
A[Same model regression: +1.5% hallucination] --> B[MangaAssist scale: 500K/day]
A --> C[Amazon-like scale: 50M/day]
B --> D[Extra bad responses: 7,500/day]
C --> E[Extra bad responses: 750,000/day]
D --> F[Maybe acceptable only with strong savings or fallback use]
E --> G[Usually requires canarying, segmentation, routing, and rollback guardrails]
23. Final Intuition Summary
23.1 What distillation is really doing
It is not magic compression.
It is a trade:
- keep as much of the teacher's behavior as possible,
- in a smaller, cheaper, faster model,
- without copying the teacher's mistakes too strongly.
23.2 The critical calculations to internalize
- Hard CE tells you how well the student matches the final label.
- KD / KL loss tells you how well the student matches the teacher's uncertainty structure.
- Feature loss tells you whether the student is learning similar internal representations.
- Hallucination rate tells you whether cheaper answers are safe enough.
- Cost reduction tells you whether the student is financially meaningful.
- Absolute failure count tells you what a small percentage means at real traffic scale.
23.3 The most important intuition at scale
At small scale, you can ask:
- "Is this student good enough?"
At very large scale, you must ask:
- "For which traffic segments is this student good enough?"
- "Where should we still call the stronger teacher?"
- "What tiny percentage regression becomes huge in absolute count?"
That is the real production mindset.
24. Recommended Next Document
After understanding this numerical dry-run guide, the best next document is:
- "MangaAssist Distillation Operations Playbook"
That document should cover:
- dataset refresh policy,
- log schema,
- online canary rules,
- teacher fallback routing,
- segment-wise promotion,
- rollback conditions,
- hardware deployment paths.