LOCAL PREVIEW View on GitHub

MangaAssist Knowledge Distillation at Scale

Failure Scenarios, Production Signals, and Fixes

This document is written as a production engineering guide for MangaAssist. The “Amazon” framing below should be read as Amazon-scale engineering intuition: very high traffic, strict latency budgets, large catalog churn, strong operational controls, and costly mistakes when small percentage errors become large absolute counts. It is not a claim about any private internal Amazon system.


1. Why Failure Analysis Matters More at Scale

At small scale, a distillation issue may look like a modest metric drop. At large scale, the same issue becomes:

  • thousands of wrong responses per hour,
  • higher cloud cost from fallback traffic,
  • more human escalations,
  • more bad policy decisions copied into the student,
  • slower incident response because logs are too noisy.

Small-scale vs Amazon-scale intuition

Assume MangaAssist handles manga discovery, order help, catalog Q&A, and support escalation.

Scenario Daily requests Hallucination rate Bad responses/day
Small product 50,000 1.5% 750
Growing product 2,000,000 1.5% 30,000
Amazon-scale style scenario 50,000,000 1.5% 750,000

A 0.5% improvement in hallucination rate seems small. But at 50M requests/day:

  • before: 50,000,000 × 0.015 = 750,000
  • after: 50,000,000 × 0.010 = 500,000
  • reduction: 250,000 fewer bad responses/day

That is why production distillation is not just about model compression. It is about error amplification control.


2. End-to-End Distillation Pipeline and Where It Breaks

flowchart LR
    A[Production prompts and logs] --> B[Sampling and filtering]
    B --> C[Teacher labeling]
    C --> D[Human review and policy correction]
    D --> E[Training set build]
    E --> F[Student fine-tuning]
    F --> G[Offline eval]
    G --> H[Shadow deployment]
    H --> I[Canary rollout]
    I --> J[Full production]

    C --> C1[Teacher failure risk]
    D --> D1[Reviewer inconsistency risk]
    E --> E1[Data skew risk]
    F --> F1[Overfitting and unstable loss]
    G --> G1[Offline-online mismatch]
    H --> H1[Latency and fallback explosion]
    I --> I1[Regional hot spots]
    J --> J1[Drift and catalog churn]

Most teams focus only on training loss. At scale, failures happen in every stage.


3. Failure Scenario 1: Teacher Hallucinations Get Copied into the Student

What happened

The teacher answers a query using confident but wrong catalog facts. Example:

  • User asks: “Is Nana available in hardcover?”
  • Catalog says: paperback only
  • Teacher says: “Yes, hardcover edition is in stock”

If this response enters the distillation set, the student learns the wrong behavior.

Why this gets worse at Amazon scale

At small scale, a few wrong labels may be tolerable. At very large scale, even a 0.2% corrupted teacher-label rate can contaminate a huge dataset.

Assume:

  • 25,000,000 unlabeled production examples sampled/month
  • 0.2% teacher hallucination contamination

Then:

25,000,000 × 0.002 = 50,000 wrong teacher targets

That is enough to move the student toward systematic factual mistakes.

Production signal

{
  "event": "distill_eval_slice",
  "slice": "catalog_factuality",
  "teacher_fact_error_rate": 0.006,
  "student_fact_error_rate": 0.011,
  "teacher_confident_error_rate": 0.004,
  "copied_teacher_error_overlap": 0.73
}

How we detect it

We compare:

  • teacher answer vs retrieval evidence,
  • teacher answer vs structured catalog record,
  • student answer vs teacher answer overlap on incorrect facts,
  • confidence on wrong answers.

The dangerous metric is not only fact error rate. It is copied teacher error overlap.

If 73% of student fact errors match teacher errors, the student is not inventing new mistakes. It is copying the teacher.

Fix

  1. Make teacher generation retrieval-grounded.
  2. Reject teacher outputs unsupported by retrieved evidence.
  3. Add human-corrected examples for high-risk domains.
  4. Down-weight examples where teacher confidence is high but retrieval support is low.
  5. Add a factuality verifier before dataset write.

Numerical fix effect

Stage Teacher label contamination Student factual error
Before filtering 0.20% 1.10%
After retrieval-support filter 0.06% 0.72%
After human review on risky slices 0.03% 0.54%

At 50M requests/day, reducing factual error from 1.10% to 0.54% saves:

  • before: 50,000,000 × 0.011 = 550,000 bad factual responses/day
  • after: 50,000,000 × 0.0054 = 270,000
  • improvement: 280,000 fewer/day

4. Failure Scenario 2: Distillation Improves Offline Scores but Hurts Live Traffic

What happened

The student matches the teacher on offline test sets but performs worse in production because the offline set is cleaner than real traffic.

Offline data often has:

  • shorter prompts,
  • fewer typos,
  • fewer multi-intent requests,
  • less adversarial wording,
  • less seasonal catalog drift.

Example

Metric Offline Shadow traffic
Teacher preference match 88% 79%
Hallucination rate 2.8% 5.7%
Escalation recall 93% 81%

The model looked promotable offline. It was not ready online.

Why this explodes at scale

Suppose 8% of traffic is multi-intent and the student is weak on that slice. At 50M/day:

50,000,000 × 0.08 = 4,000,000 multi-intent requests/day

If escalation miss rate rises from 4% to 10% on that slice:

  • before: 4,000,000 × 0.04 = 160,000 misses/day
  • after: 4,000,000 × 0.10 = 400,000 misses/day
  • extra misses: 240,000/day

Production signal

{
  "event": "shadow_compare",
  "slice": "multi_intent_messages",
  "teacher_preference_match": 0.78,
  "student_win_rate": 0.44,
  "policy_escalation_recall": 0.81,
  "baseline_escalation_recall": 0.92
}

Fix

  1. Build eval slices from real production logs, not only curated datasets.
  2. Include noisy inputs: typos, partial messages, language mixing, repeated context.
  3. Gate on slice-level metrics, not only global averages.
  4. Run shadow deployment before canary.

Amazon-scale lesson

At scale, averages lie. If a student is excellent on 90% easy traffic and poor on 10% costly traffic, the business impact can still be unacceptable.


5. Failure Scenario 3: Student Is Too Small to Absorb Teacher Behavior

What happened

The team compresses too aggressively. Example:

  • teacher: strong reranker or LLM
  • student: tiny model chosen only for cost
  • result: low training loss improvement, weak calibration, poor rare-case behavior

Numerical intuition

Suppose the teacher achieves 94 NDCG-normalized quality units on reranking. Three student sizes are tested.

Student Params NDCG@10 p95 latency Cost/1K queries
120M 0.92 32 ms $0.18
40M 0.88 18 ms $0.09
10M 0.76 9 ms $0.05

At first glance, the 10M model looks attractive. But assume each 0.01 NDCG drop causes 0.3% fewer correct top-item placements.

Comparing 40M vs 10M:

  • NDCG drop = 0.88 - 0.76 = 0.12
  • conversion sensitivity = 0.12 × 0.3% = 3.6% relative loss

If this affects 8M purchase-influencing sessions/day, that loss is huge.

Production signal

{
  "event": "student_capacity_analysis",
  "student": "tiny-reranker-10m",
  "train_kd_loss_epoch1": 1.92,
  "train_kd_loss_epoch4": 1.61,
  "eval_ndcg10": 0.76,
  "calibration_ece": 0.19,
  "rare_query_recall": 0.41
}

The giveaway is:

  • KD loss improves only a little,
  • rare recall stays weak,
  • calibration is poor,
  • the model underfits even before overfitting becomes a concern.

Fix

  1. Move to a larger student.
  2. Use intermediate teacher or two-hop distillation.
  3. Add feature matching, not only output matching.
  4. Use LoRA or adapter-based student tuning if full fine-tune is unstable.

Rule of thumb

If the student cannot recover teacher behavior on easy slices after 1 to 2 epochs, compression may be too aggressive. This is a capacity problem, not a patience problem.


6. Failure Scenario 4: Rare but High-Cost Classes Collapse

What happened

The overall accuracy looks good, but rare classes suffer. For MangaAssist, examples include:

  • refund fraud escalation,
  • self-harm wording,
  • legal complaint escalation,
  • payment dispute,
  • account lockout.

These are low-frequency but high-cost.

Numerical example

Assume training distribution:

Class Traffic share
FAQ 40%
recommendations 25%
order tracking 20%
returns 10%
payment dispute 3%
safety escalation 2%

A distilled classifier improves global accuracy from 90.8% to 92.0%, but safety escalation recall drops from 95% to 82%.

At 50M/day:

  • safety traffic = 50,000,000 × 0.02 = 1,000,000/day
  • missed escalations before = 1,000,000 × 0.05 = 50,000/day
  • missed escalations after = 1,000,000 × 0.18 = 180,000/day
  • extra misses = 130,000/day

That global accuracy gain is not worth it.

Production signal

{
  "event": "class_slice_eval",
  "class": "safety_escalation",
  "support": 842,
  "teacher_recall": 0.96,
  "student_recall": 0.82,
  "student_precision": 0.88,
  "hard_label_weight": 0.15,
  "soft_label_weight": 0.85
}

Root cause

Soft labels can blur rare classes into nearby common classes. The student learns ambiguity, but the business needs a hard safe decision.

Fix

  1. Blend soft labels with hard labels.
  2. Upweight rare high-risk classes.
  3. Oversample reviewed rare-case examples.
  4. Use class-specific promotion gates.
  5. Add rules above the student for critical classes.

Numerical adjustment

Original total loss:

L = 0.15 * CE + 0.85 * KD

Safer revised loss for high-risk classifier:

L = 0.45 * CE + 0.55 * KD + 0.30 * rare_class_penalty

Result:

Metric Before After fix
Global accuracy 92.0% 91.6%
Safety recall 82% 94%
Payment dispute recall 79% 90%

Slight global loss, much better business safety.


7. Failure Scenario 5: Fallback Traffic Explosion Removes the Cost Savings

What happened

The student was meant to reduce cost, but poor confidence calibration caused too many fallback calls to the teacher.

Example routing rule:

  • use student if confidence >= 0.80
  • else call managed teacher

If the student is underconfident, fallback rate jumps.

Numerical example

Assume 10M queries/day.

Metric Planned Actual
Student handled 90% 62%
Teacher fallback 10% 38%
Teacher cost/1K $3.00 $3.00
Student cost/1K $0.20 $0.20

Planned daily cost:

  • student: 9,000,000 / 1000 × 0.20 = $1,800
  • teacher: 1,000,000 / 1000 × 3.00 = $3,000
  • total = $4,800/day

Actual daily cost:

  • student: 6,200,000 / 1000 × 0.20 = $1,240
  • teacher: 3,800,000 / 1000 × 3.00 = $11,400
  • total = $12,640/day

The “cheaper” architecture became 2.63× more expensive than planned.

Production signal

{
  "event": "router_health",
  "student_model": "manga-student-8b-v04",
  "student_accept_rate": 0.62,
  "teacher_fallback_rate": 0.38,
  "student_ece": 0.17,
  "teacher_daily_cost_usd": 11400,
  "planned_teacher_daily_cost_usd": 3000
}

Fix

  1. Recalibrate student confidence on shadow traffic.
  2. Use margin-based routing, not only top-score routing.
  3. Add slice-based thresholds.
  4. Allow direct safe refusal instead of fallback on low-value intents.
  5. Distill the teacher better on ambiguous traffic.

Better routing formula

Instead of:

route_to_teacher if max_prob < 0.80

Use:

route_to_teacher if (max_prob < 0.72) OR (top1 - top2 < 0.08) OR risky_slice = true

This reduces unnecessary fallback from uncertainty caused by close classes.


8. Failure Scenario 6: Training Looks Stable, but Calibration Breaks

What happened

Loss goes down and accuracy improves, but the student becomes poorly calibrated. It is confidently wrong.

For customer support workflows, this is dangerous because the router trusts the student.

Numerical intuition

Two students both achieve 90% accuracy.

Model Accuracy ECE Confident wrong answers
Student A 90% 0.04 2%
Student B 90% 0.18 8%

Student B is much worse operationally.

At 20M/day:

  • confident wrong before: 20,000,000 × 0.02 = 400,000/day
  • confident wrong after: 20,000,000 × 0.08 = 1,600,000/day

Production signal

{
  "event": "calibration_eval",
  "split": "shadow_traffic",
  "accuracy": 0.90,
  "ece": 0.18,
  "brier_score": 0.22,
  "confident_error_rate": 0.08
}

Fix

  1. Track ECE, Brier score, and confident-error rate, not just accuracy.
  2. Use temperature scaling after training.
  3. Retrain with more hard labels on risky slices.
  4. Calibrate thresholds per intent family.

Important intuition

A student with slightly lower accuracy but much better calibration can be the better production model. At scale, routing quality matters almost as much as raw model quality.


9. Failure Scenario 7: Distillation Dataset Becomes Stale

What happened

Manga titles, editions, promotions, and inventory evolve quickly. A student distilled from old production data learns yesterday’s traffic.

Numerical example

Assume:

  • 30% of catalog-related prompts reference titles or editions introduced in the last 60 days.
  • training data is 4 months old.

On fresh-title slice:

Model Retrieval hit rate Grounded answer rate
Teacher 93% 90%
Student distilled on old data 78% 72%

At 12M catalog Q&A requests/day, if 30% are fresh-title related:

  • slice traffic = 12,000,000 × 0.30 = 3,600,000/day
  • grounded-answer gap = 0.90 - 0.72 = 0.18
  • extra bad responses = 3,600,000 × 0.18 = 648,000/day

Production signal

{
  "event": "freshness_slice_eval",
  "slice": "new_titles_last_60_days",
  "teacher_grounded_rate": 0.90,
  "student_grounded_rate": 0.72,
  "retrieval_miss_rate": 0.22,
  "dataset_age_days_p50": 97
}

Fix

  1. Build rolling distillation datasets from recent logs.
  2. Weight recent traffic more heavily.
  3. Keep retrieval system fresh and decoupled from student knowledge.
  4. Retrain or refresh on fixed cadence.
  5. Use shadow eval on the newest catalog slice before rollout.

Amazon-scale takeaway

At large scale, staleness is a first-class bug. Even a strong student drifts quickly if the surrounding business world changes quickly.


10. Failure Scenario 8: Logging Costs and Observability Collapse

What happened

The team logs everything:

  • full teacher response,
  • student response,
  • logits,
  • retrieval context,
  • raw prompt,
  • reranker scores,
  • reviewer notes.

At first this feels useful. At scale it becomes expensive and hard to query.

Numerical example

Assume average inference log payload = 12 KB. At 50M requests/day:

  • total log volume/day = 50,000,000 × 12 KB = 600,000,000 KB
  • approximately 572 GB/day
  • monthly ≈ 17.2 TB

If storage plus indexing costs $120/TB effective:

  • monthly logging cost ≈ 17.2 × 120 = $2,064/month

That number grows fast if you also retain raw contexts and full token-level scores. In real systems, observability cost can be much higher due to replication and query indexing.

Operational failure

Even worse than cost:

  • incidents are harder to debug,
  • dashboards time out,
  • sensitive data risk increases,
  • engineers stop trusting dashboards.

Fix

  1. Log full payloads only for sampled traffic.
  2. Log compact aggregate metrics on all traffic.
  3. Separate hot logs from cold audit storage.
  4. Hash or redact customer text.
  5. Keep per-token data only for debugging cohorts.
Traffic slice Logging mode
99.5% normal traffic compact metrics only
0.4% sampled shadow traffic full prompt/response + retrieval ids
0.1% incident/risky slice full forensic logs

11. Failure Scenario 9: Distributed Distillation Training Becomes the Bottleneck

What happened

At Amazon-scale style training volumes, the issue is no longer only model math. It becomes system bottlenecks:

  • teacher inference too slow,
  • GPU underutilized due to data pipeline,
  • all ranks duplicating teacher work,
  • FSDP/DP setup inefficient,
  • checkpoint saves too large and too frequent.

Numerical example

Suppose we distill 100M examples. Per-example teacher forward time = 3 ms. Naively doing teacher inference inline for every student batch:

100,000,000 × 3 ms = 300,000,000 ms = 300,000 s ≈ 83.3 hours

That is only teacher forward time, before student backward.

If preprocessing teacher targets offline reduces teacher work by 85% during training:

  • remaining teacher cost ≈ 83.3 × 0.15 = 12.5 hours

Production signal

{
  "event": "training_system_profile",
  "job": "distill_llm_v12",
  "gpu_utilization": 0.41,
  "input_pipeline_wait_fraction": 0.33,
  "teacher_forward_fraction": 0.38,
  "checkpoint_time_fraction": 0.09,
  "tokens_per_second": 7400
}

GPU utilization at 41% means the math is not the main problem. The system is starving the GPUs.

Fix

  1. Precompute teacher outputs where possible.
  2. Cache soft labels or response targets.
  3. Separate labeling job from student training job.
  4. Use distributed loaders and pinned memory.
  5. Save checkpoints at useful intervals, not too often.

Intuition

At scale, the fastest KD training job is often the one that moves teacher computation out of the training loop.


12. Failure Scenario 10: Human Review Layer Does Not Scale Cleanly

What happened

Human review is added to fix teacher mistakes. But reviewers disagree, instructions drift, and throughput is too low for the most important slices.

Numerical example

Assume:

  • sampled risky examples needing review = 80,000/week
  • reviewer capacity = 18,000/week

Backlog growth:

80,000 - 18,000 = 62,000/week

After 4 weeks:

62,000 × 4 = 248,000 backlog

Then the student is trained on stale or unreviewed data.

Production signal

{
  "event": "review_pipeline_health",
  "weekly_incoming": 80000,
  "weekly_reviewed": 18000,
  "backlog": 248000,
  "inter_annotator_agreement": 0.71,
  "median_review_delay_days": 19
}

Fix

  1. Only route the highest-value slices to humans.
  2. Use reviewer adjudication for disagreement slices.
  3. Convert repeated human corrections into policy rules.
  4. Track agreement and reviewer drift.
  5. Sample intelligently, not uniformly.

Amazon-scale lesson

Humans are precious. Use them where they create the most model-quality gain per reviewed example.


13. Failure Scenario 11: Canary Looks Fine Globally but Fails by Region, Device, or Language

What happened

Global canary passes, but certain slices break:

  • mobile app traffic,
  • non-English messages,
  • high-latency regions,
  • older devices with tighter CPU budgets.

Numerical example

Global p95 latency improves from 190 ms to 150 ms. Looks good. But mobile low-memory slice worsens:

Slice Before After
global p95 190 ms 150 ms
low-memory mobile p95 240 ms 330 ms
non-English factuality 88% 76%

If low-memory mobile traffic is 6M/day and abandonment rises by 2.5%:

6,000,000 × 0.025 = 150,000 extra bad user outcomes/day

Fix

  1. Gate rollout by slice, not only globally.
  2. Deploy separate student variants by hardware target if needed.
  3. Keep multilingual eval slices.
  4. Keep regional canary metrics.

14. Failure Scenario 12: No Clear Stop Rule, So Fine-Tuning Keeps Going Too Long

What happened

Training continues because loss is still going down. But production-relevant metrics stop improving.

Example epoch table

Epoch Train KD loss Eval hard loss Teacher match Hallucination ECE
1 1.82 0.74 0.79 0.050 0.11
2 1.45 0.61 0.84 0.039 0.08
3 1.28 0.58 0.86 0.034 0.06
4 1.14 0.57 0.865 0.033 0.06
5 1.03 0.60 0.861 0.037 0.08
6 0.96 0.66 0.852 0.043 0.10

A common mistake is to keep going because train loss improves. But epoch 4 was the best production point.

Fix

Stop on a promotion score, not just train loss.

Example score:

promotion_score = 0.35 * teacher_match + 0.25 * human_win + 0.20 * safety_recall - 0.10 * hallucination - 0.10 * ece

Pick the checkpoint with the best promotion score on shadow traffic.


15. How We Would Fix This for MangaAssist in an Amazon-Scale Style Environment

Operating model

  1. Teacher labeling is separate from student training - run batch teacher labeling jobs, - attach retrieval evidence ids, - store compact soft labels and policy tags.

  2. Human review is targeted - factual catalog mismatch, - risky escalations, - low-margin ambiguous cases, - fresh catalog slices.

  3. Promotion gates are slice-based - not just one global score, - catalog factuality, - rare escalation recall, - multilingual slice, - long-tail title freshness, - device latency.

  4. Routing is calibration-aware - student handles easy traffic, - teacher handles ambiguous or high-risk traffic, - thresholds tuned per slice.

  5. Refresh is continuous - rolling recent logs, - hard example mining, - drift alerts, - monthly or biweekly refresh depending on catalog volatility.


flowchart TD
    A[Traffic health] --> A1[QPS]
    A --> A2[p50 p95 latency]
    A --> A3[teacher fallback rate]

    B[Quality health] --> B1[teacher preference match]
    B --> B2[human win rate]
    B --> B3[hallucination rate]
    B --> B4[factuality by slice]

    C[Safety health] --> C1[escalation recall]
    C --> C2[confident error rate]
    C --> C3[policy violation rate]

    D[Data health] --> D1[dataset age]
    D --> D2[review backlog]
    D --> D3[slice coverage]

    E[Training health] --> E1[KD loss]
    E --> E2[hard loss]
    E --> E3[ECE]
    E --> E4[rare-class recall]

Minimum metrics I would insist on

Category Must-have metric Why it matters
Quality hallucination rate raw trustworthiness
Quality teacher preference match distillation success
Safety escalation recall costly failure prevention
Routing fallback rate cost control
Calibration ECE confidence quality
Freshness dataset age p50/p95 drift control
Training best checkpoint by shadow score stop rule

17. Final Intuition

At small scale, distillation is often described as:

  • make the model smaller,
  • keep most of the quality,
  • reduce latency and cost.

At Amazon-like scale, the real story is bigger:

  • teacher mistakes can contaminate millions of examples,
  • tiny calibration errors can multiply teacher fallback cost,
  • rare-class failures can dominate business risk,
  • stale data can create hundreds of thousands of bad answers per day,
  • logging and review pipelines become system bottlenecks,
  • a 0.5% regression is not “small” anymore.

So the mature version of distillation is:

compress the model without compressing trust, safety, and operational control.

That is the mindset needed for MangaAssist if it grows from a manageable chatbot into an Amazon-scale style production system.