MangaAssist Knowledge Distillation at Scale
Deep-Dive Solutions, Alternatives Explored, and Final Tradeoff Decisions
This document is a solution-focused companion to the failure-scenarios guide. It explains, for each major distillation failure mode, what solution was implemented, what other options were explored, and how the final tradeoff was chosen.
Any references to Amazon-scale below should be read as large-scale engineering intuition for a MangaAssist-like chatbot: very high traffic, strict latency budgets, large catalog churn, heavy observability, and costly percentage errors. They are not claims about a private Amazon internal system.
1. What This Document Tries to Answer
A failure document tells you what can go wrong. A platform or MLOps leader also needs to understand:
- What exact fix was put in place?
- What simpler or cheaper alternatives were considered first?
- Why was one solution chosen over another?
- What was the latency, cost, accuracy, and operational tradeoff?
- Would the same answer still be right at 50K requests/day and 50M requests/day?
This document answers those questions in a MangaAssist setting using an Amazon-scale decision style.
2. Baseline System Context
2.1 MangaAssist distillation targets
| Component | Teacher | Student | Core goal |
|---|---|---|---|
| Intent classifier | DistilBERT | TinyBERT | cut latency and memory while keeping intent quality |
| Response model | strong managed teacher | cheaper student or self-hosted model | lower cost per response without breaking policy/factuality |
| Reranker | large cross-encoder | smaller reranker | preserve ranking quality inside latency budget |
2.2 Promotion gates that matter
| Metric | Gate |
|---|---|
| teacher preference match | >= 85% |
| human win rate vs base student | >= 65% |
| catalog hallucination rate | <= 4% |
| cost per 1K responses | at least 50% lower |
| p95 latency | within component SLO |
| fallback rate | within cost budget |
2.3 Tradeoff rule used in this guide
The final decision is never based on one metric. We use this practical order:
- Safety / policy / factuality
- Latency SLO
- Cost reduction
- Quality retention
- Operational simplicity
That order matters. A model that is cheaper but unsafe does not ship. A model that is accurate but misses p95 latency also does not ship.
3. End-to-End Decision Map
flowchart TD
A[Failure observed] --> B[Quantify business impact]
B --> C[Generate candidate fixes]
C --> D[Offline experiment]
D --> E[Shadow traffic]
E --> F[Canary rollout]
F --> G[Final decision]
G --> G1[Promote]
G --> G2[Rollback]
G --> G3[Defer and keep teacher]
3.1 Decision principle
At small scale, teams often choose the best offline metric. At Amazon-scale, the better question is:
Which option minimizes bad absolute outcomes per day while still meeting budget and operability targets?
A 0.3% hallucination reduction at 50M requests/day means:
50,000,000 × 0.003 = 150,000 fewer bad responses/day
That can beat a 5% cost gain if the bad responses are customer-visible and expensive.
4. Failure Scenario Deep Dives
4.1 Failure A — Teacher Hallucinations Were Being Copied Into the Student
The failure
The teacher produced confident but unsupported catalog claims. Example:
- user: “Is Nana hardcover available?”
- catalog truth: paperback only
- teacher: “Yes, hardcover is in stock.”
If this enters the distillation dataset unchanged, the student learns the wrong fact pattern.
Why this became serious at scale
Assume a monthly unlabeled sampling run of 25,000,000 production prompts. If only 0.2% of teacher labels are bad:
25,000,000 × 0.002 = 50,000 corrupted targets
That is not “noise.” That is a systematic dataset poisoning source.
Solution implemented
We implemented a retrieval-grounded teacher labeling gate before dataset write.
Final implemented pipeline
flowchart LR
A[Production prompt] --> B[Retrieve catalog evidence]
B --> C[Teacher generation]
C --> D[Evidence support check]
D --> E{Supported?}
E -->|Yes| F[Write to distillation set]
E -->|No| G[Reject or human review]
Final rule
A teacher response was eligible only if:
- retrieved evidence existed,
- answer entities matched catalog fields,
- support score passed threshold,
- no contradiction with structured catalog facts,
- confidence was not high while support was low.
Production-style acceptance function
write_example =
evidence_present
AND support_score >= 0.78
AND contradiction_score <= 0.10
AND NOT(high_confidence AND low_support)
Numerical effect
| Metric | Before | After implemented fix |
|---|---|---|
| teacher label contamination | 0.20% | 0.04% |
| student factual error on catalog slice | 1.10% | 0.58% |
| copied teacher error overlap | 0.73 | 0.29 |
| dataset acceptance rate | 100% | 86% |
Intuition
We accepted a 14% data volume drop to get a much cleaner label set. At scale, clean data beats slightly larger dirty data.
Other solutions explored
Option 1 — Pure human review of all teacher outputs
Pros - best label quality - strongest control over high-risk domains
Cons - too expensive - too slow for continuous distillation refresh - reviewer consistency drift
Option 2 — Train student only on human-reviewed examples
Pros - lowest hallucination copying risk
Cons - dataset too small - poor coverage of long-tail prompts - weak teacher-behavior imitation
Option 3 — Accept all teacher labels but down-weight low-confidence examples
Pros - simple to implement - keeps all data
Cons - does not solve confident unsupported errors - still leaks bad facts into student
Option 4 — Add a separate factuality verifier after training only
Pros - protects production inference
Cons - too late for dataset quality - student already learns bad patterns
Why the final decision won
| Option | Quality | Cost | Latency impact | Operational complexity | Final decision |
|---|---|---|---|---|---|
| full human review | best | worst | none | high | rejected |
| human-only dataset | low coverage | medium | none | medium | rejected |
| confidence down-weight only | weak | best | none | low | rejected |
| retrieval-grounded gating | strong | good | low | medium | chosen |
Final tradeoff logic
We chose retrieval-grounded gating because it had the best combined outcome:
- large factuality gain,
- acceptable pipeline complexity,
- scalable to millions of examples,
- still compatible with selective human review.
4.2 Failure B — Rare-Class Recall Collapsed After Distillation
The failure
The student became better on common intents but worse on rare important ones. For MangaAssist these could be:
- fraud / abuse signals,
- escalation language,
- legal or policy-sensitive support requests,
- refund edge cases.
Why it becomes dangerous at scale
Suppose a rare class appears in only 0.4% of traffic. At 50M requests/day:
50,000,000 × 0.004 = 200,000 rare-class events/day
If recall falls from 0.91 to 0.82:
- before misses:
200,000 × 0.09 = 18,000 - after misses:
200,000 × 0.18 = 36,000 - extra misses/day:
18,000
A small recall drop becomes a large operational problem.
Solution implemented
We used a hybrid supervision strategy:
total_loss =
0.45 * hard_label_loss
+ 0.35 * KD_soft_loss
+ 0.20 * rare_class_weighted_loss
What changed technically
- Kept teacher soft labels for ambiguity learning.
- Preserved hard labels for rare-class truth.
- Applied class weights to high-risk rare classes.
- Required slice-level promotion gates, not only global accuracy.
Promotion rule added
promote only if:
- overall accuracy improves or stays flat
- rare-class recall >= baseline - 1 point
- high-risk escalation precision >= baseline
Numerical result
| Metric | Base student | KD only | Final hybrid |
|---|---|---|---|
| overall accuracy | 87.6% | 89.1% | 88.9% |
| rare-class recall | 89.8% | 82.0% | 90.7% |
| escalation precision | 93.2% | 90.4% | 93.0% |
Intuition
We accepted 0.2% lower overall accuracy than KD-only because the rare-class protection was much better. At scale, protecting a critical 0.4% slice can matter more than squeezing one more point of global accuracy.
Other solutions explored
Option 1 — Oversample rare classes only
Pros - easy - increases rare-class visibility
Cons - can overfit rare wording - distorts production distribution
Option 2 — Train a separate rare-class detector
Pros - strong specialist performance - safer for high-risk routing
Cons - extra model in serving path - more orchestration and monitoring
Option 3 — One-vs-rest auxiliary heads
Pros - explicit focus on rare critical intents
Cons - more complicated model training - extra calibration work
Why the final decision won
We chose hybrid supervision + slice gates because it gave the best balance of:
- preserving teacher ambiguity signal,
- protecting rare classes,
- avoiding new serving complexity,
- keeping one model in path.
A separate detector was kept as a fallback design, not the default.
4.3 Failure C — Student Was Cheap Offline but Triggered Fallback Explosion Online
The failure
The student passed offline quality, but online confidence was unstable. It escalated too often to the teacher, which erased cost savings.
Example cost math
Assume:
- student cost =
$0.0002per response - teacher cost =
$0.0030per response - traffic =
10,000,000responses/day
Planned fallback rate: 5%
Daily cost:
- student cost:
10,000,000 × 0.0002 = $2,000 - teacher fallback cost:
10,000,000 × 0.05 × 0.003 = $1,500 - total =
$3,500/day
Actual fallback rate: 18%
Daily cost:
- student cost:
$2,000 - teacher fallback cost:
10,000,000 × 0.18 × 0.003 = $5,400 - total =
$7,400/day
Unexpected daily overrun:
$7,400 - $3,500 = $3,900/day
Monthly overrun:
$3,900 × 30 = $117,000/month
Solution implemented
We replaced simple confidence thresholding with two-stage gating.
Final implemented logic
flowchart LR
A[Student answer] --> B[Confidence score]
B --> C[Risk classifier]
C --> D{High risk?}
D -->|Yes| E[Teacher fallback]
D -->|No| F{Confidence < threshold?}
F -->|Yes| G[Clarify or low-cost fallback]
F -->|No| H[Return student answer]
Why this worked
A plain threshold sent too many harmless ambiguous cases to the teacher. The added risk classifier separated:
- unsafe / high-risk uncertainty, from
- benign uncertainty where a clarification question or weaker fallback was enough.
Result
| Metric | Before | After |
|---|---|---|
| fallback rate | 18.0% | 6.4% |
| teacher traffic | 1.8M/day | 0.64M/day |
| daily cost | $7,400 | $3,920 |
| unsafe false accepts | 0.34% | 0.11% |
Other solutions explored
Option 1 — Raise threshold aggressively
Pros - fast to change - reduces teacher fallback immediately
Cons - lets more bad student answers through - weak control on high-risk slices
Option 2 — Always ask a clarification question
Pros - cheap - avoids teacher call
Cons - hurts UX - unnecessary on clear cases
Option 3 — Route all low-confidence queries to a smaller intermediate model
Pros - reduces expensive teacher load
Cons - adds another serving tier - more operational burden
Why the final decision won
We chose risk-aware fallback because cost and safety were both first-class. It gave most of the cost benefit of aggressive thresholding without losing the protection needed for unsafe cases.
4.4 Failure D — Calibration Was Good Offline and Bad in Production
The failure
The student looked confident on validation data but was miscalibrated on new production traffic. That broke fallback routing and human escalation rules.
Solution implemented
We added a post-distillation calibration stage using a recent production-like holdout.
Final calibration steps
- Train student with KD.
- Freeze weights.
- Fit temperature / calibration layer on recent holdout.
- Re-evaluate ECE, fallback rate, and escalation precision.
- Promote only if routing metrics improve.
Numerical example
| Metric | Before calibration | After calibration |
|---|---|---|
| ECE | 0.118 | 0.034 |
| fallback over-trigger rate | 7.2% | 2.1% |
| unsafe false accept rate | 0.21% | 0.13% |
Other solutions explored
- isotonic regression
- Platt scaling
- classwise thresholds only
- no explicit calibration, rely on more KD epochs
Why the final decision won
A separate calibration stage was chosen because:
- it is cheap,
- it does not require retraining the whole model,
- it directly improves routing decisions,
- it is easy to rerun during drift refreshes.
More KD epochs alone did not reliably fix calibration drift.
4.5 Failure E — Distillation Data Became Stale as Catalog and User Behavior Changed
The failure
The student was trained on teacher outputs from an older catalog and older user query mix. After launch, the model degraded on:
- newly released manga,
- changing inventory,
- new support phrasing,
- seasonal traffic shifts.
Scale effect
At small scale, stale examples may hide inside averages. At Amazon-scale, stale data causes repeated visible errors on top traffic items.
If the top 1% of titles drive 35% of catalog questions, staleness in those entities becomes disproportionately expensive.
Solution implemented
We moved from static dataset refreshes to a rolling distillation refresh.
Final implemented approach
- weekly sample of recent production prompts,
- dedupe and PII filtering,
- teacher relabeling with current retrieval state,
- slice-balanced merge with stable historical gold set,
- canary eval before replacing student.
Data mixing ratio chosen
new_recent_examples = 50%
stable_human_gold = 25%
rare_class_protected_set = 15%
policy_and_refusal_set = 10%
Why not 100% recent data?
Because it improved freshness but hurt stability and rare-case memory.
Result
| Metric | Static monthly refresh | Rolling weekly refresh |
|---|---|---|
| new-title factual accuracy | 91.0% | 95.8% |
| rare-case stability | 94.4% | 94.1% |
| canary regressions/month | 4 | 1 |
Other solutions explored
Option 1 — Monthly full rebuild only
Pros - simple pipeline
Cons - staleness window too large
Option 2 — Daily full refresh
Pros - freshest labels
Cons - expensive - unstable - noisy canary outcomes
Option 3 — Retrieval-only fix, no student refresh
Pros - cheap if the answer path is retrieval-heavy
Cons - does not fix classifier, reranker, or routing behavior
Why the final decision won
Weekly rolling refresh with a protected stable core gave the best balance between:
- freshness,
- stability,
- rare-case retention,
- compute cost.
4.6 Failure F — Training Was Too Slow or Too Expensive at Large Scale
The failure
Teacher inference plus student training became the bottleneck. This is common when:
- teacher is large,
- training set is tens of millions of examples,
- multi-stage KD is used,
- experiments are repeated often.
Solution implemented
We split the pipeline into teacher-label caching + student-only retraining loops.
Final implemented design
flowchart LR
A[Prompt sample] --> B[Teacher batch inference]
B --> C[S3 label cache]
C --> D[Dataset build]
D --> E[Student train loop]
E --> F[Eval and select checkpoint]
Why this matters
Without caching, every training experiment re-pays teacher inference cost. With caching, hyperparameter sweeps mostly pay only student-side compute.
Numerical example
Assume:
- teacher labeling 10M prompts costs
$18,000 - one student training run costs
$900 - 8 experiment variants are required
Without caching:
8 × ($18,000 + $900) = $151,200
With teacher-label caching:
$18,000 + 8 × $900 = $25,200
Savings:
$151,200 - $25,200 = $126,000
Other solutions explored
- online teacher queries during training
- smaller teacher for all examples
- partial-labeling only for ambiguous slice
- synthetic augmentation instead of real teacher labels
Why the final decision won
Caching won because it created the best experimentation speed per dollar. We still used targeted high-quality teacher relabeling for selected slices when needed.
4.7 Failure G — Logs Were So Large That Incidents Became Harder to Debug
The failure
At large traffic volumes, teams often log too much:
- full prompts,
- full teacher outputs,
- full student outputs,
- top-k distributions,
- retrieval evidence,
- model internals,
- trace metadata.
This improves visibility at first, then becomes a storage and debugging problem.
Simple scale math
Assume average logged payload = 6 KB/event. At 50M events/day:
50,000,000 × 6 KB = 300,000,000 KB/day ≈ 300 GB/day
Monthly:
300 GB × 30 = 9 TB/month
If copied across multiple environments and retention tiers, costs rise quickly.
Solution implemented
We changed to tiered logging.
Final logging policy
| Event type | Log detail |
|---|---|
| healthy normal traffic | compact summary only |
| low-confidence traffic | add top-k scores and routing reason |
| policy or factual error slice | add evidence and verifier output |
| canary / incident mode | expanded debug logs |
Result
| Metric | Before | After |
|---|---|---|
| avg log payload | 6.0 KB | 1.4 KB |
| daily log volume at 50M requests | 300 GB | 70 GB |
| time to find routing root cause | 2.5 hours | 38 minutes |
Other solutions explored
- full logs for everything with short retention
- heavy sampling only
- no prompt logging at all
- separate debug replica for richer traces
Why the final decision won
Tiered logging preserved enough signal for failures while keeping cost and noise manageable. Pure heavy sampling hid rare but important incidents. Full logs were too expensive and slow to work with.
4.8 Failure H — We Were Stopping Too Late or Too Early
The failure
The team initially watched training loss too closely and business metrics too little. That caused two types of mistakes:
- stop too early because eval loss looked flat,
- stop too late because training loss kept improving while hallucination and calibration worsened.
Solution implemented
We switched from single-metric stopping to multi-signal checkpoint selection.
Final checkpoint selection rule
A checkpoint was eligible only if it satisfied all:
hallucination_rate <= gate
teacher_preference_match >= gate
human_win_rate >= gate
fallback_rate <= budget
rare_class_recall >= protected threshold
p95_latency within SLO
If multiple checkpoints passed, select the one with best weighted score.
Example weighted score
score =
0.30 * teacher_preference_match
+ 0.20 * human_win_rate
- 0.20 * hallucination_rate
- 0.10 * fallback_rate
+ 0.10 * rare_class_recall
+ 0.10 * cost_reduction
Example epoch table
| Epoch | Train loss | Eval loss | Teacher match | Hallucination | Fallback | Decision |
|---|---|---|---|---|---|---|
| 1 | 1.82 | 1.64 | 78.2% | 5.2% | 11.0% | too weak |
| 2 | 1.41 | 1.29 | 83.6% | 4.3% | 8.8% | close |
| 3 | 1.18 | 1.14 | 86.4% | 3.6% | 6.7% | passes |
| 4 | 1.05 | 1.12 | 86.7% | 3.5% | 6.5% | passes |
| 5 | 0.94 | 1.19 | 86.1% | 3.9% | 7.0% | regressing |
Final choice
Epoch 4 was selected, not epoch 5. Even though train loss improved at epoch 5, deployment metrics got worse.
Other solutions explored
- early stop on eval loss only
- stop on teacher-match only
- fixed number of epochs every run
- human review only after final epoch
Why the final decision won
The multi-signal checkpoint rule matched production reality best. The model is serving a business workflow, not winning a training-loss contest.
5. Cross-Cutting Solution Patterns That Repeatedly Won
5.1 Keep the teacher strong, but never trust it blindly
Final pattern:
- use teacher for breadth,
- use retrieval and verifier layers for truth control,
- use humans on policy and high-risk slices.
5.2 Protect slices, not just global averages
Global improvement can hide dangerous regional regressions. Final pattern:
- slice dashboards,
- promotion gates per slice,
- protected high-risk datasets.
5.3 Separate training quality from serving quality
A model can train well and serve badly. Final pattern:
- calibration stage,
- shadow traffic,
- canary routing evaluation,
- fallback cost monitoring.
5.4 Prefer scalable moderate-complexity fixes over perfect but fragile ones
At Amazon-scale, the best solution is often not the mathematically purest one. It is the one that:
- can run every week,
- is debuggable,
- fits budget,
- does not require heroics.
6. Final Tradeoff Framework Used Before Production Decision
6.1 Candidate solution scorecard
flowchart LR
A[Candidate fix] --> B[Safety score]
A --> C[Quality score]
A --> D[Latency score]
A --> E[Cost score]
A --> F[Operational simplicity]
B --> G[Decision]
C --> G
D --> G
E --> G
F --> G
Example weighted decision table
| Candidate | Safety | Quality | Latency | Cost | Operability | Weighted outcome |
|---|---|---|---|---|---|---|
| full human review | 10 | 10 | 9 | 2 | 3 | 7.2 |
| raw teacher labels | 3 | 6 | 10 | 10 | 9 | 6.5 |
| retrieval-grounded gating | 9 | 8 | 8 | 8 | 7 | 8.2 |
| post-hoc verifier only | 7 | 6 | 7 | 8 | 8 | 7.0 |
Important intuition
The winning solution is often the one with the best balanced score, not the best single-metric score.
6.2 Absolute-impact decision check
Before final rollout, ask:
- What does this percentage change mean in bad outcomes/day?
- What does this fallback rate mean in extra dollars/month?
- What does this recall drop mean in missed escalations/day?
- What does this log volume mean in storage and incident response time?
That is the difference between a good experiment and a good production decision.
7. What the Final Architecture Looked Like
flowchart TD
A[Recent production prompts] --> B[Privacy and dedupe filter]
B --> C[Teacher labeling with retrieval]
C --> D[Evidence / contradiction gate]
D --> E[Human review for high-risk slices]
E --> F[Balanced dataset builder]
F --> G[Student training]
G --> H[Calibration stage]
H --> I[Offline slice eval]
I --> J[Shadow traffic]
J --> K[Canary rollout]
K --> L[Risk-aware fallback in production]
L --> M[Tiered logging and drift monitors]
This architecture was chosen not because it is the fanciest design. It was chosen because each layer solved a failure that mattered in production.
8. Final Takeaways
8.1 The biggest lesson
The hardest part of distillation at scale is usually not the loss function. It is the quality of the teacher labels, slice protection, calibration, fallback economics, and operational visibility.
8.2 The final decision pattern
For MangaAssist in an Amazon-scale style environment, the winning pattern was:
- cleaner teacher labels over bigger dirty datasets,
- slice-safe metrics over global average wins,
- risk-aware fallback over naive confidence routing,
- weekly rolling refresh over static training snapshots,
- multi-signal checkpoint selection over loss-only stopping,
- tiered observability over log-everything debugging.
8.3 What this means for a lead engineer
A lead engineer should be able to explain for every fix:
- what failed,
- why the first obvious fix was not enough,
- what alternatives were explored,
- what tradeoff decided the winner,
- and how the same choice changes when traffic scales by 1000×.
That is the real production intuition behind knowledge distillation systems.