MangaAssist Knowledge Distillation — Numerical Dry Runs, Critical Calculations, and Scale Intuition

This document is a calculation-first companion to the larger MangaAssist distillation guide.
It focuses on worked dry runs, what the logs mean, how training evolves by epoch, and how decisions change when traffic moves from MangaAssist scale to Amazon-like scale.

Important note on scale examples: any references to Amazon-like scale in this document are illustrative engineering scenarios, not claims about Amazon's internal production numbers.

1. What This Document Tries to Build

Most distillation documents explain the formulas correctly but do not build intuition for questions like:

Why does temperature help?
Why does a student sometimes improve even when its total loss is still high?
Why can a tiny change in hallucination rate be acceptable at small scale but catastrophic at very large scale?
Why does a model that looks cheaper offline still fail promotion?
When should we stop fine-tuning even if training loss keeps going down?

This document answers those using MangaAssist-style examples.

2. The Core MangaAssist Setting

We will use three concrete distillation tasks:

Intent classifier distillation
DistilBERT teacher → TinyBERT student.
Response model distillation
Strong managed teacher (for example an OpenAI teacher such as gpt-4.1) → cheaper student (for example gpt-4.1-mini or a self-hosted Llama 3 8B).
Reranker distillation
Large cross-encoder teacher → smaller latency-safe reranker.

We will use these MangaAssist production constraints:

Component	Teacher behavior	Student target
Intent classifier	high accuracy but higher memory/latency	fit on Lambda / CPU and keep most quality
Response model	best answer quality, expensive	much lower cost per response
Reranker	best ranking quality, too slow inline	preserve top ranking with lower latency

2.1 Reference promotion gates from the MangaAssist scenario

Metric	Gate
teacher preference match	>= 85%
human win rate vs base student	>= 65%
catalog hallucination rate	<= 4%
cost per 1K responses	at least 50% lower

These gates are important because the model is not promoted for low loss. It is promoted because it passes quality + latency + cost + safety together.

3. The Smallest Possible Distillation Intuition

3.1 One MangaAssist example

User message:

"I want to return volume 1, but where is my order right now?"

Assume the intent classes are:

order_tracking
return_request
checkout_help
faq

This message is ambiguous. It contains both return language and order status language.

That is exactly where distillation helps most.

4. Dry Run 1 — Softmax and Temperature on One Example

4.1 Teacher and student logits

Let the teacher and student output these logits for the same message.

Class	Teacher logit	Student logit
`order_tracking`	2.8	2.1
`return_request`	2.2	1.7
`checkout_help`	0.5	0.8
`faq`	-1.0	-0.5

The teacher thinks order_tracking is best, but also sees strong overlap with return_request.

4.2 Softmax at normal temperature `T = 1`

Teacher probabilities:

Class	Probability
`order_tracking`	0.5983
`return_request`	0.3283
`checkout_help`	0.0600
`faq`	0.0134

Student probabilities:

Class	Probability
`order_tracking`	0.4958
`return_request`	0.3323
`checkout_help`	0.1351
`faq`	0.0368

At T=1, the distributions are already usable, but the teacher is still relatively sharp.

4.3 Softmax at distillation temperature `T = 4`

Teacher probabilities:

Class	Probability
`order_tracking`	0.3559
`return_request`	0.3063
`checkout_help`	0.2002
`faq`	0.1376

Student probabilities:

Class	Probability
`order_tracking`	0.3175
`return_request`	0.2873
`checkout_help`	0.2294
`faq`	0.1658

4.4 What changed, intuitively?

At T=1, the model is saying:

"This is mostly order_tracking."

At T=4, it is saying:

"This is still mostly order_tracking, but return_request is close, and checkout_help is not impossible."

That extra structure is the dark knowledge.

The student does not just learn the winning class. It learns the shape of the confusion.

5. Dry Run 2 — Hard-Label Cross-Entropy on the Same Example

Assume the ground-truth label is:

order_tracking

Student probability for the correct class at T=1 is:

p = 0.4958

Hard-label cross-entropy is:

[ \mathcal{L}_{hard} = -\log(0.4958) = 0.7017 ]

5.1 Intuition

A loss of 0.7017 means:

the student is not disastrously wrong,
but it is far from confidently correct,
and the correct class only has about 49.6% probability.

If the student had predicted 0.90 for the correct class, the loss would be:

[ -\log(0.90) = 0.1054 ]

That is much lower.

So cross-entropy punishes lack of confidence on the correct class.

6. Dry Run 3 — KL Divergence Distillation Loss

Now compute the distillation loss using the softened distributions at T=4.

Teacher soft distribution:

[ p_T = [0.3559, 0.3063, 0.2002, 0.1376] ]

Student soft distribution:

[ p_S = [0.3175, 0.2873, 0.2294, 0.1658] ]

KL divergence:

[ D_{KL}(p_T | p_S) = \sum_i p_T(i) \log \frac{p_T(i)}{p_S(i)} = 0.007315 ]

Because the loss is computed at temperature T, we multiply by T^2:

[ \mathcal{L}{KD} = T^2 \cdot D{KL}(p_T | p_S) = 16 \cdot 0.007315 = 0.1170 ]

6.1 Why is the KD loss much smaller than the hard loss?

Because the student is closer to the teacher's full probability shape than it is to the one-hot target.

This is a critical intuition:

hard labels ask: "Did you put almost everything on the winning class?"
KD asks: "Did you match the teacher's reasoning pattern?"

The student can be imperfect on the final class but still be quite aligned with the teacher's structure.

That is why KD often stabilizes smaller students.

7. Dry Run 4 — Combined Distillation Loss

Use the standard combined loss:

[ \mathcal{L}{total} = (1-\alpha)\mathcal{L}{hard} + \alpha\mathcal{L}_{KD} ]

Let alpha = 0.7.

Then:

[ \mathcal{L}_{total} = 0.3 \cdot 0.7017 + 0.7 \cdot 0.1170 ]

[ \mathcal{L}_{total} = 0.2105 + 0.0819 = 0.2924 ]

7.1 What this means

Because alpha = 0.7, training is telling the student:

"I care more about matching the teacher's shape than only matching the hard label."

7.2 What if we change `alpha`?

Using the same example:

`alpha`	Combined loss	Interpretation
0.3	0.5263	more trust in hard labels
0.5	0.4094	balanced
0.7	0.2924	more trust in teacher

This does not mean 0.7 is always better. It means:

if the teacher is reliable and calibrated, higher alpha can help,
if the teacher is often wrong on edge cases, high alpha copies mistakes faster.

8. Dry Run 5 — Why Distillation Gives Better Gradients on Ambiguous Classes

For the same example, the student hard-label gradient pattern is approximately:

[ [p_S - y] = [-0.5042, 0.3323, 0.1351, 0.0368] ]

The KD gradient pattern at T=4 before the T^2 scaling is:

[ [p_S^{(T)} - p_T^{(T)}] = [-0.0383, -0.0190, 0.0292, 0.0281] ]

After the T^2 = 16 scaling, the relative effect becomes:

[ [-0.6133, -0.3037, 0.4668, 0.4502] ]

8.1 Intuition

Hard-label training says:

push the correct class up,
push everything else down.

KD says:

push the correct class up,
but also keep return_request fairly high,
do not flatten everything else too aggressively.

That matters because MangaAssist traffic is full of mixed-intent user messages.

9. Dry Run 6 — Batch-Level Loss Instead of One Example

Now take a mini-batch of 8 examples.

9.1 Per-example losses

Example	Hard CE	KD loss after `T^2`
1	0.70	0.117
2	0.22	0.084
3	1.05	0.190
4	0.41	0.102
5	0.33	0.066
6	1.40	0.245
7	0.18	0.052
8	0.65	0.140

Batch averages:

average hard CE = 0.6175
average KD loss = 0.1245

With alpha = 0.7:

[ \mathcal{L}_{batch} = 0.3 \cdot 0.6175 + 0.7 \cdot 0.1245 = 0.2724 ]

9.2 Intuition from the batch

Examples 3 and 6 are the important ones.

They likely correspond to:

long-tail messages,
multi-intent messages,
rare policy phrasing,
noisy customer text.

If those examples keep dominating the batch loss for many epochs, you should ask:

is the student too small?
is the teacher inconsistent?
are these labels actually wrong?
is the data distribution too broad for this student size?

10. Dry Run 7 — Feature Matching for TinyBERT

Output KD is not the only signal. TinyBERT-style training also matches intermediate representations.

10.1 Hidden-state example

Suppose a teacher hidden vector and projected student hidden vector are:

Teacher hidden:

[ [0.8, -0.1, 0.2, 0.5] ]

Projected student hidden:

[ [0.5, 0.0, 0.4, 0.1] ]

Mean squared error:

[ \text{MSE} = \frac{(0.8-0.5)^2 + (-0.1-0.0)^2 + (0.2-0.4)^2 + (0.5-0.1)^2}{4} ]

[ = \frac{0.09 + 0.01 + 0.04 + 0.16}{4} = 0.075 ]

10.2 Attention example

Teacher attention:

[ \begin{bmatrix} 0.7 & 0.3 \ 0.2 & 0.8 \end{bmatrix} ]

Student attention:

[ \begin{bmatrix} 0.6 & 0.4 \ 0.25 & 0.75 \end{bmatrix} ]

Attention MSE:

[ \frac{(0.7-0.6)^2 + (0.3-0.4)^2 + (0.2-0.25)^2 + (0.8-0.75)^2}{4} = 0.00625 ]

10.3 Intuition

The hidden-state loss is larger than the attention loss here.

That suggests:

attention alignment is already decent,
but the student representation space is still not matching the teacher well.

In practice, that can mean:

keep Stage 1 feature distillation longer,
lower LR for Stage 2,
or use a slightly wider student.

11. Dry Run 8 — Reranker Distillation and Why Small MSE Can Still Hide Ranking Errors

Suppose the teacher gives these scores for four candidate manga results.

Candidate	Teacher score
A	0.94
B	0.91
C	0.35
D	0.12

Student scores:

Candidate	Student score
A	0.90
B	0.92
C	0.31
D	0.15

Score MSE is tiny:

[ \text{MSE} = 0.00105 ]

That looks excellent.

But the student swapped A and B.

11.1 Why that matters

If candidate A is the truly best result, then a tiny score error can still cause the top recommendation to be wrong.

That is why reranker distillation should monitor ranking metrics, not only score loss.

11.2 NDCG intuition

Assume relevance labels:

A = 3
B = 2
C = 0
D = 1

If the ranking is ideal, NDCG@4 = 1.0.

If the student swaps A and B, the NDCG drops to roughly 0.972.

That looks small.

But at large traffic volume, even a small NDCG drop means many more users see the second-best manga recommendation first.

12. Dry Run 9 — Response Distillation with a Managed Teacher

Now move from classification to response generation.

12.1 MangaAssist response dataset

Source	Count
production prompts	25,000
human-corrected teacher outputs	5,000
refusal / escalation examples	2,000
total useful rows	32,000

Assume we distill:

teacher: stronger managed model such as gpt-4.1
student: cheaper managed student such as gpt-4.1-mini or a self-hosted Llama 3 8B

12.2 Example response-level metrics per epoch

The exact loss is different from classifier KD because for response distillation we often train on teacher outputs rather than direct logits. But the intuition is the same.

Epoch	Train loss	Val loss	Human win rate vs base student	Teacher preference match	Hallucination rate	Refusal precision
1	1.92	1.84	48%	72%	7.8%	86%
2	1.41	1.36	57%	79%	5.9%	89%
3	1.08	1.02	64%	84%	4.6%	91%
4	0.91	0.88	67%	86%	3.9%	93%
5	0.84	0.86	67%	86%	3.8%	92%
6	0.79	0.89	66%	85%	4.1%	91%

12.3 Best checkpoint decision

Checkpoint at epoch 4 is best.

Why not epoch 6?

Because:

train loss keeps dropping,
but val loss starts rising,
hallucination rate starts getting worse,
human win rate is flat or slightly worse.

That is the exact signature of overfitting to teacher style without improving production quality.

13. Early Stopping Intuition — What to Watch and Why

A good early stopping rule for MangaAssist is not just:

"stop when validation loss stops improving"

It should be:

loss-based rule
Stop if validation loss does not improve by at least 1% for 2 consecutive epochs.
quality rule
Stop if human win rate improvement is less than 0.5 points over 2 epochs.
safety rule
Stop immediately if hallucination rate worsens for 2 evals in a row.
business rule
Do not promote any checkpoint that fails cost/latency/safety gates even if its loss is best.

13.1 Why training loss alone is dangerous

At later epochs, the student often becomes better at imitating:

teacher phrasing,
teacher verbosity,
teacher style tokens,
teacher formatting patterns.

But that does not always improve:

factuality,
escalation behavior,
refusal correctness,
catalog grounding.

So the model can look better mathematically while getting worse operationally.

14. Dry Run 10 — Human-Corrected Data as a Safety Brake

Suppose the teacher-labeled dataset has:

25,000 teacher responses,
5,000 human-corrected rows,
2,000 refusal/escalation rows.

If we up-weight:

corrected rows by 3x,
refusal rows by 2x,

then the effective training row count becomes:

[ 25,000 + (5,000 \times 3) + (2,000 \times 2) = 44,000 ]

Effective contribution by group:

Group	Effective rows	Share
raw teacher responses	25,000	56.8%
human-corrected rows	15,000	34.1%
refusal/escalation rows	4,000	9.1%

14.1 Intuition

Even though the corrected set is only 5,000 rows, weighting makes it a much stronger anchor.

That helps prevent the student from copying teacher mistakes too blindly.

15. Production Log Fields That Actually Matter

These are the logs that build intuition in a real pipeline.

15.1 Step log

{
  "event": "kd_train_step",
  "task": "intent_distillation",
  "epoch": 3,
  "global_step": 1840,
  "batch_size": 128,
  "hard_ce": 0.482,
  "kd_kl_t2": 0.109,
  "feature_mse": 0.071,
  "attention_mse": 0.008,
  "total_loss": 0.252,
  "grad_norm": 1.84,
  "lr": 0.000031,
  "throughput_samples_per_sec": 2210
}

15.2 How to read it

hard_ce = 0.482
student is still making meaningful class mistakes.
kd_kl_t2 = 0.109
student is reasonably aligned to teacher distribution.
feature_mse = 0.071
internal representations still have room to improve.
grad_norm = 1.84
gradients are healthy; not exploding.

If hard_ce stops improving but feature_mse is still improving, the student may be learning internal structure that will help later.
If feature_mse is flat and hard_ce is noisy, the student may have hit its capacity limit.

15.3 Epoch summary log

{
  "event": "kd_epoch_summary",
  "task": "response_distillation",
  "student": "manga-student-8b-v04",
  "teacher": "managed-teacher-v07",
  "epoch": 4,
  "train_loss": 0.91,
  "val_loss": 0.88,
  "teacher_preference_match": 0.86,
  "human_win_rate_vs_base": 0.67,
  "hallucination_rate": 0.039,
  "refusal_precision": 0.93,
  "avg_output_tokens": 142,
  "cost_per_1k_responses": 0.42,
  "checkpoint_promotable": true
}

15.4 Important derived calculations

If the baseline student cost per 1K responses was 0.95 and the distilled student is 0.42, then:

[ \text{cost reduction} = 1 - \frac{0.42}{0.95} = 55.8\% ]

If hallucination_rate = 0.039 on 2,000 reviewed samples, then:

[ 0.039 \times 2000 = 78 ]

That means 78 hallucination cases were found in review.

At this point, the model passes the < 4% gate.

16. Dry Run 11 — Cost Intuition at MangaAssist Scale

Assume MangaAssist runs 500,000 responses per day.

Using the earlier blended per-request numbers from the MangaAssist scenario:

teacher cost per query = $0.003
student cost per query = $0.0001

16.1 Daily cost

Teacher daily cost:

[ 500,000 \times 0.003 = 1,500 ]

Student daily cost:

[ 500,000 \times 0.0001 = 50 ]

Daily savings:

[ 1,500 - 50 = 1,450 ]

Monthly savings at 30 days:

[ 1,450 \times 30 = 43,500 ]

16.2 Intuition

At MangaAssist scale, cost savings are already meaningful.

But quality still matters.

If hallucination rate rises by 1.5 percentage points after distillation, then extra bad responses per day are:

[ 500,000 \times 0.015 = 7,500 ]

So the real question is not:

"Did we save money?"

It is:

"Did we save enough money to justify 7,500 extra bad responses per day?"

That is why cost and quality must be read together.

17. Dry Run 12 — Amazon-Like Scale Changes Everything

Now use an illustrative Amazon-like scale scenario.

Assume traffic is:

50,000,000 responses per day

Use the same per-query cost numbers only as an engineering illustration.

17.1 Daily cost at Amazon-like scale

Teacher daily cost:

[ 50,000,000 \times 0.003 = 150,000 ]

Student daily cost:

[ 50,000,000 \times 0.0001 = 5,000 ]

Daily savings:

[ 150,000 - 5,000 = 145,000 ]

Monthly savings:

[ 145,000 \times 30 = 4,350,000 ]

17.2 Now look at the same `1.5%` quality regression

Extra bad responses per day:

[ 50,000,000 \times 0.015 = 750,000 ]

17.3 Intuition

At very large scale:

tiny cost differences become millions of dollars,
tiny quality regressions become hundreds of thousands of bad experiences per day.

This is why large platforms almost never make a full replacement decision from a single offline metric.

They use:

stratified evals,
canaries,
traffic slicing,
fallback routing,
selective teacher escalation,
per-segment promotion gates.

18. Why Sampling Changes at Large Scale

At MangaAssist scale, maybe a 25,000 row training sample already covers a lot of the product surface.

At Amazon-like scale, 25,000 rows may be only a tiny snapshot.

18.1 Same sample size, different meaning

If traffic is 50,000,000/day, then a 25,000 example sample is:

[ \frac{25,000}{50,000,000} = 0.0005 = 0.05\% ]

So 25,000 rows are only 0.05% of one day of traffic.

18.2 Why this matters

Rare patterns become common in absolute count.

If a rare failure mode happens in only 0.1% of requests:

at 500,000/day, that is 500 failures/day
at 50,000,000/day, that is 50,000 failures/day

So at very large scale, the golden set must explicitly include:

long-tail categories,
rare policy cases,
locale variation,
device variation,
high-value customer flows,
safety-critical escalations.

19. Canary Math — Why Big Platforms Detect Regressions Faster

Suppose baseline accuracy is 90.0% and you want to detect a drop to 89.5%.

A rough two-sample test suggests you need about:

57,700 samples per arm

Call it 58K queries per arm.

19.1 What that means at MangaAssist scale

If total traffic is 500,000/day, and your canary gets 10% of traffic:

canary traffic per day = 50,000

To collect 58,000 canary samples, you need a little over one day.

19.2 What that means at Amazon-like scale

If total traffic is 50,000,000/day, and your canary gets 1% of traffic:

canary traffic per day = 500,000

To collect 58,000 samples, you need only a small part of the day.

19.3 Intuition

Large-scale systems can detect tiny regressions faster if they have good telemetry.

But they also suffer larger absolute damage if they promote a bad model too early.

20. How Fine-Tuning Typically Evolves by Epoch

A useful mental model:

Epoch 1

student learns broad teacher style,
loss drops quickly,
quality may still be unstable,
hallucination rate is usually still high.

Epoch 2–3

most of the useful gains arrive,
ambiguity handling improves,
preference match jumps,
long-tail failures begin to shrink.

Epoch 4

often the best tradeoff zone,
validation and business metrics align,
latency and cost are already fixed by model size, so quality becomes the final gate.

Epoch 5+

train loss still falls,
validation often flattens or worsens,
style imitation gets stronger,
factual drift and teacher quirks may increase.

20.1 The intuition you want

Distillation improvement is often front-loaded.

The first few epochs teach the student the big structure.

Later epochs often teach the student the teacher's imperfections.

21. Decision Table — When to Stop, Resize, or Redesign

Symptom	Likely meaning	Best action
train loss down, val loss down, hallu down	healthy learning	continue
train loss down, val flat, hallu flat	near convergence	prepare stop
train loss down, val up, hallu up	overfitting / teacher style copying	early stop
KD loss low but hard CE high	student matches shape but not class enough	increase hard-label weight or extend training
hard CE low but KD high	student predicts labels but misses teacher uncertainty	raise `alpha` or temperature
all losses high for many epochs	student too small or data too noisy	larger student / better data
low offline loss, poor online business metrics	missing eval dimensions	fix gate design

22. Two Practical Mermaid Diagrams

22.1 Where the numbers come from

flowchart LR
    A[Teacher logits] --> B[Softmax with T]
    C[Student logits] --> D[Softmax with T]
    B --> E[KL divergence]
    D --> E
    C --> F[Hard-label cross-entropy]
    E --> G[Weighted total loss]
    F --> G
    G --> H[Backprop into student]

22.2 How scale changes the decision

flowchart TD
    A[Same model regression: +1.5% hallucination] --> B[MangaAssist scale: 500K/day]
    A --> C[Amazon-like scale: 50M/day]
    B --> D[Extra bad responses: 7,500/day]
    C --> E[Extra bad responses: 750,000/day]
    D --> F[Maybe acceptable only with strong savings or fallback use]
    E --> G[Usually requires canarying, segmentation, routing, and rollback guardrails]

23. Final Intuition Summary

23.1 What distillation is really doing

It is not magic compression.

It is a trade:

keep as much of the teacher's behavior as possible,
in a smaller, cheaper, faster model,
without copying the teacher's mistakes too strongly.

23.2 The critical calculations to internalize

Hard CE tells you how well the student matches the final label.
KD / KL loss tells you how well the student matches the teacher's uncertainty structure.
Feature loss tells you whether the student is learning similar internal representations.
Hallucination rate tells you whether cheaper answers are safe enough.
Cost reduction tells you whether the student is financially meaningful.
Absolute failure count tells you what a small percentage means at real traffic scale.

23.3 The most important intuition at scale

At small scale, you can ask:

"Is this student good enough?"

At very large scale, you must ask:

"For which traffic segments is this student good enough?"
"Where should we still call the stronger teacher?"
"What tiny percentage regression becomes huge in absolute count?"

That is the real production mindset.

24. Recommended Next Document

After understanding this numerical dry-run guide, the best next document is:

"MangaAssist Distillation Operations Playbook"

That document should cover:

dataset refresh policy,
log schema,
online canary rules,
teacher fallback routing,
segment-wise promotion,
rollback conditions,
hardware deployment paths.