MangaAssist Knowledge Distillation at Scale

Failure Scenarios, Production Signals, and Fixes

This document is written as a production engineering guide for MangaAssist. The “Amazon” framing below should be read as Amazon-scale engineering intuition: very high traffic, strict latency budgets, large catalog churn, strong operational controls, and costly mistakes when small percentage errors become large absolute counts. It is not a claim about any private internal Amazon system.

1. Why Failure Analysis Matters More at Scale

At small scale, a distillation issue may look like a modest metric drop. At large scale, the same issue becomes:

thousands of wrong responses per hour,
higher cloud cost from fallback traffic,
more human escalations,
more bad policy decisions copied into the student,
slower incident response because logs are too noisy.

Small-scale vs Amazon-scale intuition

Assume MangaAssist handles manga discovery, order help, catalog Q&A, and support escalation.

Scenario	Daily requests	Hallucination rate	Bad responses/day
Small product	50,000	1.5%	750
Growing product	2,000,000	1.5%	30,000
Amazon-scale style scenario	50,000,000	1.5%	750,000

A 0.5% improvement in hallucination rate seems small. But at 50M requests/day:

before: 50,000,000 × 0.015 = 750,000
after: 50,000,000 × 0.010 = 500,000
reduction: 250,000 fewer bad responses/day

That is why production distillation is not just about model compression. It is about error amplification control.

2. End-to-End Distillation Pipeline and Where It Breaks

flowchart LR
    A[Production prompts and logs] --> B[Sampling and filtering]
    B --> C[Teacher labeling]
    C --> D[Human review and policy correction]
    D --> E[Training set build]
    E --> F[Student fine-tuning]
    F --> G[Offline eval]
    G --> H[Shadow deployment]
    H --> I[Canary rollout]
    I --> J[Full production]

    C --> C1[Teacher failure risk]
    D --> D1[Reviewer inconsistency risk]
    E --> E1[Data skew risk]
    F --> F1[Overfitting and unstable loss]
    G --> G1[Offline-online mismatch]
    H --> H1[Latency and fallback explosion]
    I --> I1[Regional hot spots]
    J --> J1[Drift and catalog churn]

Most teams focus only on training loss. At scale, failures happen in every stage.

3. Failure Scenario 1: Teacher Hallucinations Get Copied into the Student

What happened

The teacher answers a query using confident but wrong catalog facts. Example:

User asks: “Is Nana available in hardcover?”
Catalog says: paperback only
Teacher says: “Yes, hardcover edition is in stock”

If this response enters the distillation set, the student learns the wrong behavior.

Why this gets worse at Amazon scale

At small scale, a few wrong labels may be tolerable. At very large scale, even a 0.2% corrupted teacher-label rate can contaminate a huge dataset.

Assume:

25,000,000 unlabeled production examples sampled/month
0.2% teacher hallucination contamination

Then:

25,000,000 × 0.002 = 50,000 wrong teacher targets

That is enough to move the student toward systematic factual mistakes.

Production signal

{
  "event": "distill_eval_slice",
  "slice": "catalog_factuality",
  "teacher_fact_error_rate": 0.006,
  "student_fact_error_rate": 0.011,
  "teacher_confident_error_rate": 0.004,
  "copied_teacher_error_overlap": 0.73
}

How we detect it

We compare:

teacher answer vs retrieval evidence,
teacher answer vs structured catalog record,
student answer vs teacher answer overlap on incorrect facts,
confidence on wrong answers.

The dangerous metric is not only fact error rate. It is copied teacher error overlap.

If 73% of student fact errors match teacher errors, the student is not inventing new mistakes. It is copying the teacher.

Fix

Make teacher generation retrieval-grounded.
Reject teacher outputs unsupported by retrieved evidence.
Add human-corrected examples for high-risk domains.
Down-weight examples where teacher confidence is high but retrieval support is low.
Add a factuality verifier before dataset write.

Numerical fix effect

Stage	Teacher label contamination	Student factual error
Before filtering	0.20%	1.10%
After retrieval-support filter	0.06%	0.72%
After human review on risky slices	0.03%	0.54%

At 50M requests/day, reducing factual error from 1.10% to 0.54% saves:

before: 50,000,000 × 0.011 = 550,000 bad factual responses/day
after: 50,000,000 × 0.0054 = 270,000
improvement: 280,000 fewer/day

4. Failure Scenario 2: Distillation Improves Offline Scores but Hurts Live Traffic

What happened

The student matches the teacher on offline test sets but performs worse in production because the offline set is cleaner than real traffic.

Offline data often has:

shorter prompts,
fewer typos,
fewer multi-intent requests,
less adversarial wording,
less seasonal catalog drift.

Example

Metric	Offline	Shadow traffic
Teacher preference match	88%	79%
Hallucination rate	2.8%	5.7%
Escalation recall	93%	81%

The model looked promotable offline. It was not ready online.

Why this explodes at scale

Suppose 8% of traffic is multi-intent and the student is weak on that slice. At 50M/day:

50,000,000 × 0.08 = 4,000,000 multi-intent requests/day

If escalation miss rate rises from 4% to 10% on that slice:

before: 4,000,000 × 0.04 = 160,000 misses/day
after: 4,000,000 × 0.10 = 400,000 misses/day
extra misses: 240,000/day

Production signal

{
  "event": "shadow_compare",
  "slice": "multi_intent_messages",
  "teacher_preference_match": 0.78,
  "student_win_rate": 0.44,
  "policy_escalation_recall": 0.81,
  "baseline_escalation_recall": 0.92
}

Fix

Build eval slices from real production logs, not only curated datasets.
Include noisy inputs: typos, partial messages, language mixing, repeated context.
Gate on slice-level metrics, not only global averages.
Run shadow deployment before canary.

Amazon-scale lesson

At scale, averages lie. If a student is excellent on 90% easy traffic and poor on 10% costly traffic, the business impact can still be unacceptable.

5. Failure Scenario 3: Student Is Too Small to Absorb Teacher Behavior

What happened

The team compresses too aggressively. Example:

teacher: strong reranker or LLM
student: tiny model chosen only for cost
result: low training loss improvement, weak calibration, poor rare-case behavior

Numerical intuition

Suppose the teacher achieves 94 NDCG-normalized quality units on reranking. Three student sizes are tested.

Student	Params	NDCG@10	p95 latency
120M	0.92	32 ms	$0.18
40M	0.88	18 ms	$0.09
10M	0.76	9 ms	$0.05

At first glance, the 10M model looks attractive. But assume each 0.01 NDCG drop causes 0.3% fewer correct top-item placements.

Comparing 40M vs 10M:

NDCG drop = 0.88 - 0.76 = 0.12
conversion sensitivity = 0.12 × 0.3% = 3.6% relative loss

If this affects 8M purchase-influencing sessions/day, that loss is huge.

Production signal

{
  "event": "student_capacity_analysis",
  "student": "tiny-reranker-10m",
  "train_kd_loss_epoch1": 1.92,
  "train_kd_loss_epoch4": 1.61,
  "eval_ndcg10": 0.76,
  "calibration_ece": 0.19,
  "rare_query_recall": 0.41
}

The giveaway is:

KD loss improves only a little,
rare recall stays weak,
calibration is poor,
the model underfits even before overfitting becomes a concern.

Fix

Move to a larger student.
Use intermediate teacher or two-hop distillation.
Add feature matching, not only output matching.
Use LoRA or adapter-based student tuning if full fine-tune is unstable.

Rule of thumb

If the student cannot recover teacher behavior on easy slices after 1 to 2 epochs, compression may be too aggressive. This is a capacity problem, not a patience problem.

6. Failure Scenario 4: Rare but High-Cost Classes Collapse

What happened

The overall accuracy looks good, but rare classes suffer. For MangaAssist, examples include:

refund fraud escalation,
self-harm wording,
legal complaint escalation,
payment dispute,
account lockout.

These are low-frequency but high-cost.

Numerical example

Assume training distribution:

Class	Traffic share
FAQ	40%
recommendations	25%
order tracking	20%
returns	10%
payment dispute	3%
safety escalation	2%

A distilled classifier improves global accuracy from 90.8% to 92.0%, but safety escalation recall drops from 95% to 82%.

At 50M/day:

safety traffic = 50,000,000 × 0.02 = 1,000,000/day
missed escalations before = 1,000,000 × 0.05 = 50,000/day
missed escalations after = 1,000,000 × 0.18 = 180,000/day
extra misses = 130,000/day

That global accuracy gain is not worth it.

Production signal

{
  "event": "class_slice_eval",
  "class": "safety_escalation",
  "support": 842,
  "teacher_recall": 0.96,
  "student_recall": 0.82,
  "student_precision": 0.88,
  "hard_label_weight": 0.15,
  "soft_label_weight": 0.85
}

Root cause

Soft labels can blur rare classes into nearby common classes. The student learns ambiguity, but the business needs a hard safe decision.

Fix

Blend soft labels with hard labels.
Upweight rare high-risk classes.
Oversample reviewed rare-case examples.
Use class-specific promotion gates.
Add rules above the student for critical classes.

Numerical adjustment

Original total loss:

L = 0.15 * CE + 0.85 * KD

Safer revised loss for high-risk classifier:

L = 0.45 * CE + 0.55 * KD + 0.30 * rare_class_penalty

Result:

Metric	Before	After fix
Global accuracy	92.0%	91.6%
Safety recall	82%	94%
Payment dispute recall	79%	90%

Slight global loss, much better business safety.

7. Failure Scenario 5: Fallback Traffic Explosion Removes the Cost Savings

What happened

The student was meant to reduce cost, but poor confidence calibration caused too many fallback calls to the teacher.

Example routing rule:

use student if confidence >= 0.80
else call managed teacher

If the student is underconfident, fallback rate jumps.

Numerical example

Assume 10M queries/day.

Metric	Planned	Actual
Student handled	90%	62%
Teacher fallback	10%	38%
Teacher cost/1K	$3.00	$3.00
Student cost/1K	$0.20	$0.20

Planned daily cost:

student: 9,000,000 / 1000 × 0.20 = $1,800
teacher: 1,000,000 / 1000 × 3.00 = $3,000
total = $4,800/day

Actual daily cost:

student: 6,200,000 / 1000 × 0.20 = $1,240
teacher: 3,800,000 / 1000 × 3.00 = $11,400
total = $12,640/day

The “cheaper” architecture became 2.63× more expensive than planned.

Production signal

{
  "event": "router_health",
  "student_model": "manga-student-8b-v04",
  "student_accept_rate": 0.62,
  "teacher_fallback_rate": 0.38,
  "student_ece": 0.17,
  "teacher_daily_cost_usd": 11400,
  "planned_teacher_daily_cost_usd": 3000
}

Fix

Recalibrate student confidence on shadow traffic.
Use margin-based routing, not only top-score routing.
Add slice-based thresholds.
Allow direct safe refusal instead of fallback on low-value intents.
Distill the teacher better on ambiguous traffic.

Better routing formula

Instead of:

route_to_teacher if max_prob < 0.80

Use:

route_to_teacher if (max_prob < 0.72) OR (top1 - top2 < 0.08) OR risky_slice = true

This reduces unnecessary fallback from uncertainty caused by close classes.

8. Failure Scenario 6: Training Looks Stable, but Calibration Breaks

What happened

Loss goes down and accuracy improves, but the student becomes poorly calibrated. It is confidently wrong.

For customer support workflows, this is dangerous because the router trusts the student.

Numerical intuition

Two students both achieve 90% accuracy.

Model	Accuracy	ECE	Confident wrong answers
Student A	90%	0.04	2%
Student B	90%	0.18	8%

Student B is much worse operationally.

At 20M/day:

confident wrong before: 20,000,000 × 0.02 = 400,000/day
confident wrong after: 20,000,000 × 0.08 = 1,600,000/day

Production signal

{
  "event": "calibration_eval",
  "split": "shadow_traffic",
  "accuracy": 0.90,
  "ece": 0.18,
  "brier_score": 0.22,
  "confident_error_rate": 0.08
}

Fix

Track ECE, Brier score, and confident-error rate, not just accuracy.
Use temperature scaling after training.
Retrain with more hard labels on risky slices.
Calibrate thresholds per intent family.

Important intuition

A student with slightly lower accuracy but much better calibration can be the better production model. At scale, routing quality matters almost as much as raw model quality.

9. Failure Scenario 7: Distillation Dataset Becomes Stale

What happened

Manga titles, editions, promotions, and inventory evolve quickly. A student distilled from old production data learns yesterday’s traffic.

Numerical example

Assume:

30% of catalog-related prompts reference titles or editions introduced in the last 60 days.
training data is 4 months old.

On fresh-title slice:

Model	Retrieval hit rate	Grounded answer rate
Teacher	93%	90%
Student distilled on old data	78%	72%

At 12M catalog Q&A requests/day, if 30% are fresh-title related:

slice traffic = 12,000,000 × 0.30 = 3,600,000/day
grounded-answer gap = 0.90 - 0.72 = 0.18
extra bad responses = 3,600,000 × 0.18 = 648,000/day

Production signal

{
  "event": "freshness_slice_eval",
  "slice": "new_titles_last_60_days",
  "teacher_grounded_rate": 0.90,
  "student_grounded_rate": 0.72,
  "retrieval_miss_rate": 0.22,
  "dataset_age_days_p50": 97
}

Fix

Build rolling distillation datasets from recent logs.
Weight recent traffic more heavily.
Keep retrieval system fresh and decoupled from student knowledge.
Retrain or refresh on fixed cadence.
Use shadow eval on the newest catalog slice before rollout.

Amazon-scale takeaway

At large scale, staleness is a first-class bug. Even a strong student drifts quickly if the surrounding business world changes quickly.

10. Failure Scenario 8: Logging Costs and Observability Collapse

What happened

The team logs everything:

full teacher response,
student response,
logits,
retrieval context,
raw prompt,
reranker scores,
reviewer notes.

At first this feels useful. At scale it becomes expensive and hard to query.

Numerical example

Assume average inference log payload = 12 KB. At 50M requests/day:

total log volume/day = 50,000,000 × 12 KB = 600,000,000 KB
approximately 572 GB/day
monthly ≈ 17.2 TB

If storage plus indexing costs $120/TB effective:

monthly logging cost ≈ 17.2 × 120 = $2,064/month

That number grows fast if you also retain raw contexts and full token-level scores. In real systems, observability cost can be much higher due to replication and query indexing.

Operational failure

Even worse than cost:

incidents are harder to debug,
dashboards time out,
sensitive data risk increases,
engineers stop trusting dashboards.

Fix

Log full payloads only for sampled traffic.
Log compact aggregate metrics on all traffic.
Separate hot logs from cold audit storage.
Hash or redact customer text.
Keep per-token data only for debugging cohorts.

Recommended logging split

Traffic slice	Logging mode
99.5% normal traffic	compact metrics only
0.4% sampled shadow traffic	full prompt/response + retrieval ids
0.1% incident/risky slice	full forensic logs

11. Failure Scenario 9: Distributed Distillation Training Becomes the Bottleneck

What happened

At Amazon-scale style training volumes, the issue is no longer only model math. It becomes system bottlenecks:

teacher inference too slow,
GPU underutilized due to data pipeline,
all ranks duplicating teacher work,
FSDP/DP setup inefficient,
checkpoint saves too large and too frequent.

Numerical example

Suppose we distill 100M examples. Per-example teacher forward time = 3 ms. Naively doing teacher inference inline for every student batch:

100,000,000 × 3 ms = 300,000,000 ms = 300,000 s ≈ 83.3 hours

That is only teacher forward time, before student backward.

If preprocessing teacher targets offline reduces teacher work by 85% during training:

remaining teacher cost ≈ 83.3 × 0.15 = 12.5 hours

Production signal

{
  "event": "training_system_profile",
  "job": "distill_llm_v12",
  "gpu_utilization": 0.41,
  "input_pipeline_wait_fraction": 0.33,
  "teacher_forward_fraction": 0.38,
  "checkpoint_time_fraction": 0.09,
  "tokens_per_second": 7400
}

GPU utilization at 41% means the math is not the main problem. The system is starving the GPUs.

Fix

Precompute teacher outputs where possible.
Cache soft labels or response targets.
Separate labeling job from student training job.
Use distributed loaders and pinned memory.
Save checkpoints at useful intervals, not too often.

Intuition

At scale, the fastest KD training job is often the one that moves teacher computation out of the training loop.

12. Failure Scenario 10: Human Review Layer Does Not Scale Cleanly

What happened

Human review is added to fix teacher mistakes. But reviewers disagree, instructions drift, and throughput is too low for the most important slices.

Numerical example

Assume:

sampled risky examples needing review = 80,000/week
reviewer capacity = 18,000/week

Backlog growth:

80,000 - 18,000 = 62,000/week

After 4 weeks:

62,000 × 4 = 248,000 backlog

Then the student is trained on stale or unreviewed data.

Production signal

{
  "event": "review_pipeline_health",
  "weekly_incoming": 80000,
  "weekly_reviewed": 18000,
  "backlog": 248000,
  "inter_annotator_agreement": 0.71,
  "median_review_delay_days": 19
}

Fix

Only route the highest-value slices to humans.
Use reviewer adjudication for disagreement slices.
Convert repeated human corrections into policy rules.
Track agreement and reviewer drift.
Sample intelligently, not uniformly.

Amazon-scale lesson

Humans are precious. Use them where they create the most model-quality gain per reviewed example.

13. Failure Scenario 11: Canary Looks Fine Globally but Fails by Region, Device, or Language

What happened

Global canary passes, but certain slices break:

mobile app traffic,
non-English messages,
high-latency regions,
older devices with tighter CPU budgets.

Numerical example

Global p95 latency improves from 190 ms to 150 ms. Looks good. But mobile low-memory slice worsens:

Slice	Before	After
global p95	190 ms	150 ms
low-memory mobile p95	240 ms	330 ms
non-English factuality	88%	76%

If low-memory mobile traffic is 6M/day and abandonment rises by 2.5%:

6,000,000 × 0.025 = 150,000 extra bad user outcomes/day

Fix

Gate rollout by slice, not only globally.
Deploy separate student variants by hardware target if needed.
Keep multilingual eval slices.
Keep regional canary metrics.

14. Failure Scenario 12: No Clear Stop Rule, So Fine-Tuning Keeps Going Too Long

What happened

Training continues because loss is still going down. But production-relevant metrics stop improving.

Example epoch table

Epoch	Train KD loss	Eval hard loss	Teacher match	Hallucination	ECE
1	1.82	0.74	0.79	0.050	0.11
2	1.45	0.61	0.84	0.039	0.08
3	1.28	0.58	0.86	0.034	0.06
4	1.14	0.57	0.865	0.033	0.06
5	1.03	0.60	0.861	0.037	0.08
6	0.96	0.66	0.852	0.043	0.10

A common mistake is to keep going because train loss improves. But epoch 4 was the best production point.

Fix

Stop on a promotion score, not just train loss.

Example score:

promotion_score = 0.35 * teacher_match + 0.25 * human_win + 0.20 * safety_recall - 0.10 * hallucination - 0.10 * ece

Pick the checkpoint with the best promotion score on shadow traffic.

15. How We Would Fix This for MangaAssist in an Amazon-Scale Style Environment

Operating model

Teacher labeling is separate from student training - run batch teacher labeling jobs, - attach retrieval evidence ids, - store compact soft labels and policy tags.
Human review is targeted - factual catalog mismatch, - risky escalations, - low-margin ambiguous cases, - fresh catalog slices.
Promotion gates are slice-based - not just one global score, - catalog factuality, - rare escalation recall, - multilingual slice, - long-tail title freshness, - device latency.
Routing is calibration-aware - student handles easy traffic, - teacher handles ambiguous or high-risk traffic, - thresholds tuned per slice.
Refresh is continuous - rolling recent logs, - hard example mining, - drift alerts, - monthly or biweekly refresh depending on catalog volatility.

16. Recommended Production Dashboard

flowchart TD
    A[Traffic health] --> A1[QPS]
    A --> A2[p50 p95 latency]
    A --> A3[teacher fallback rate]

    B[Quality health] --> B1[teacher preference match]
    B --> B2[human win rate]
    B --> B3[hallucination rate]
    B --> B4[factuality by slice]

    C[Safety health] --> C1[escalation recall]
    C --> C2[confident error rate]
    C --> C3[policy violation rate]

    D[Data health] --> D1[dataset age]
    D --> D2[review backlog]
    D --> D3[slice coverage]

    E[Training health] --> E1[KD loss]
    E --> E2[hard loss]
    E --> E3[ECE]
    E --> E4[rare-class recall]

Minimum metrics I would insist on

Category	Must-have metric	Why it matters
Quality	hallucination rate	raw trustworthiness
Quality	teacher preference match	distillation success
Safety	escalation recall	costly failure prevention
Routing	fallback rate	cost control
Calibration	ECE	confidence quality
Freshness	dataset age p50/p95	drift control
Training	best checkpoint by shadow score	stop rule

17. Final Intuition

At small scale, distillation is often described as:

make the model smaller,
keep most of the quality,
reduce latency and cost.

At Amazon-like scale, the real story is bigger:

teacher mistakes can contaminate millions of examples,
tiny calibration errors can multiply teacher fallback cost,
rare-class failures can dominate business risk,
stale data can create hundreds of thousands of bad answers per day,
logging and review pipelines become system bottlenecks,
a 0.5% regression is not “small” anymore.

So the mature version of distillation is:

compress the model without compressing trust, safety, and operational control.

That is the mindset needed for MangaAssist if it grows from a manageable chatbot into an Amazon-scale style production system.