LOCAL PREVIEW View on GitHub

MangaAssist Knowledge Distillation at Scale

Deep-Dive Solutions, Alternatives Explored, and Final Tradeoff Decisions

This document is a solution-focused companion to the failure-scenarios guide. It explains, for each major distillation failure mode, what solution was implemented, what other options were explored, and how the final tradeoff was chosen.

Any references to Amazon-scale below should be read as large-scale engineering intuition for a MangaAssist-like chatbot: very high traffic, strict latency budgets, large catalog churn, heavy observability, and costly percentage errors. They are not claims about a private Amazon internal system.


1. What This Document Tries to Answer

A failure document tells you what can go wrong. A platform or MLOps leader also needs to understand:

  • What exact fix was put in place?
  • What simpler or cheaper alternatives were considered first?
  • Why was one solution chosen over another?
  • What was the latency, cost, accuracy, and operational tradeoff?
  • Would the same answer still be right at 50K requests/day and 50M requests/day?

This document answers those questions in a MangaAssist setting using an Amazon-scale decision style.


2. Baseline System Context

2.1 MangaAssist distillation targets

Component Teacher Student Core goal
Intent classifier DistilBERT TinyBERT cut latency and memory while keeping intent quality
Response model strong managed teacher cheaper student or self-hosted model lower cost per response without breaking policy/factuality
Reranker large cross-encoder smaller reranker preserve ranking quality inside latency budget

2.2 Promotion gates that matter

Metric Gate
teacher preference match >= 85%
human win rate vs base student >= 65%
catalog hallucination rate <= 4%
cost per 1K responses at least 50% lower
p95 latency within component SLO
fallback rate within cost budget

2.3 Tradeoff rule used in this guide

The final decision is never based on one metric. We use this practical order:

  1. Safety / policy / factuality
  2. Latency SLO
  3. Cost reduction
  4. Quality retention
  5. Operational simplicity

That order matters. A model that is cheaper but unsafe does not ship. A model that is accurate but misses p95 latency also does not ship.


3. End-to-End Decision Map

flowchart TD
    A[Failure observed] --> B[Quantify business impact]
    B --> C[Generate candidate fixes]
    C --> D[Offline experiment]
    D --> E[Shadow traffic]
    E --> F[Canary rollout]
    F --> G[Final decision]

    G --> G1[Promote]
    G --> G2[Rollback]
    G --> G3[Defer and keep teacher]

3.1 Decision principle

At small scale, teams often choose the best offline metric. At Amazon-scale, the better question is:

Which option minimizes bad absolute outcomes per day while still meeting budget and operability targets?

A 0.3% hallucination reduction at 50M requests/day means:

50,000,000 × 0.003 = 150,000 fewer bad responses/day

That can beat a 5% cost gain if the bad responses are customer-visible and expensive.


4. Failure Scenario Deep Dives


4.1 Failure A — Teacher Hallucinations Were Being Copied Into the Student

The failure

The teacher produced confident but unsupported catalog claims. Example:

  • user: “Is Nana hardcover available?”
  • catalog truth: paperback only
  • teacher: “Yes, hardcover is in stock.”

If this enters the distillation dataset unchanged, the student learns the wrong fact pattern.

Why this became serious at scale

Assume a monthly unlabeled sampling run of 25,000,000 production prompts. If only 0.2% of teacher labels are bad:

25,000,000 × 0.002 = 50,000 corrupted targets

That is not “noise.” That is a systematic dataset poisoning source.


Solution implemented

We implemented a retrieval-grounded teacher labeling gate before dataset write.

Final implemented pipeline

flowchart LR
    A[Production prompt] --> B[Retrieve catalog evidence]
    B --> C[Teacher generation]
    C --> D[Evidence support check]
    D --> E{Supported?}
    E -->|Yes| F[Write to distillation set]
    E -->|No| G[Reject or human review]

Final rule

A teacher response was eligible only if:

  • retrieved evidence existed,
  • answer entities matched catalog fields,
  • support score passed threshold,
  • no contradiction with structured catalog facts,
  • confidence was not high while support was low.

Production-style acceptance function

write_example =
    evidence_present
    AND support_score >= 0.78
    AND contradiction_score <= 0.10
    AND NOT(high_confidence AND low_support)

Numerical effect

Metric Before After implemented fix
teacher label contamination 0.20% 0.04%
student factual error on catalog slice 1.10% 0.58%
copied teacher error overlap 0.73 0.29
dataset acceptance rate 100% 86%

Intuition

We accepted a 14% data volume drop to get a much cleaner label set. At scale, clean data beats slightly larger dirty data.


Other solutions explored

Option 1 — Pure human review of all teacher outputs

Pros - best label quality - strongest control over high-risk domains

Cons - too expensive - too slow for continuous distillation refresh - reviewer consistency drift

Option 2 — Train student only on human-reviewed examples

Pros - lowest hallucination copying risk

Cons - dataset too small - poor coverage of long-tail prompts - weak teacher-behavior imitation

Option 3 — Accept all teacher labels but down-weight low-confidence examples

Pros - simple to implement - keeps all data

Cons - does not solve confident unsupported errors - still leaks bad facts into student

Option 4 — Add a separate factuality verifier after training only

Pros - protects production inference

Cons - too late for dataset quality - student already learns bad patterns


Why the final decision won

Option Quality Cost Latency impact Operational complexity Final decision
full human review best worst none high rejected
human-only dataset low coverage medium none medium rejected
confidence down-weight only weak best none low rejected
retrieval-grounded gating strong good low medium chosen

Final tradeoff logic

We chose retrieval-grounded gating because it had the best combined outcome:

  • large factuality gain,
  • acceptable pipeline complexity,
  • scalable to millions of examples,
  • still compatible with selective human review.

4.2 Failure B — Rare-Class Recall Collapsed After Distillation

The failure

The student became better on common intents but worse on rare important ones. For MangaAssist these could be:

  • fraud / abuse signals,
  • escalation language,
  • legal or policy-sensitive support requests,
  • refund edge cases.

Why it becomes dangerous at scale

Suppose a rare class appears in only 0.4% of traffic. At 50M requests/day:

50,000,000 × 0.004 = 200,000 rare-class events/day

If recall falls from 0.91 to 0.82:

  • before misses: 200,000 × 0.09 = 18,000
  • after misses: 200,000 × 0.18 = 36,000
  • extra misses/day: 18,000

A small recall drop becomes a large operational problem.


Solution implemented

We used a hybrid supervision strategy:

total_loss =
    0.45 * hard_label_loss
  + 0.35 * KD_soft_loss
  + 0.20 * rare_class_weighted_loss

What changed technically

  1. Kept teacher soft labels for ambiguity learning.
  2. Preserved hard labels for rare-class truth.
  3. Applied class weights to high-risk rare classes.
  4. Required slice-level promotion gates, not only global accuracy.

Promotion rule added

promote only if:
- overall accuracy improves or stays flat
- rare-class recall >= baseline - 1 point
- high-risk escalation precision >= baseline

Numerical result

Metric Base student KD only Final hybrid
overall accuracy 87.6% 89.1% 88.9%
rare-class recall 89.8% 82.0% 90.7%
escalation precision 93.2% 90.4% 93.0%

Intuition

We accepted 0.2% lower overall accuracy than KD-only because the rare-class protection was much better. At scale, protecting a critical 0.4% slice can matter more than squeezing one more point of global accuracy.


Other solutions explored

Option 1 — Oversample rare classes only

Pros - easy - increases rare-class visibility

Cons - can overfit rare wording - distorts production distribution

Option 2 — Train a separate rare-class detector

Pros - strong specialist performance - safer for high-risk routing

Cons - extra model in serving path - more orchestration and monitoring

Option 3 — One-vs-rest auxiliary heads

Pros - explicit focus on rare critical intents

Cons - more complicated model training - extra calibration work


Why the final decision won

We chose hybrid supervision + slice gates because it gave the best balance of:

  • preserving teacher ambiguity signal,
  • protecting rare classes,
  • avoiding new serving complexity,
  • keeping one model in path.

A separate detector was kept as a fallback design, not the default.


4.3 Failure C — Student Was Cheap Offline but Triggered Fallback Explosion Online

The failure

The student passed offline quality, but online confidence was unstable. It escalated too often to the teacher, which erased cost savings.

Example cost math

Assume:

  • student cost = $0.0002 per response
  • teacher cost = $0.0030 per response
  • traffic = 10,000,000 responses/day

Planned fallback rate: 5%

Daily cost:

  • student cost: 10,000,000 × 0.0002 = $2,000
  • teacher fallback cost: 10,000,000 × 0.05 × 0.003 = $1,500
  • total = $3,500/day

Actual fallback rate: 18%

Daily cost:

  • student cost: $2,000
  • teacher fallback cost: 10,000,000 × 0.18 × 0.003 = $5,400
  • total = $7,400/day

Unexpected daily overrun:

$7,400 - $3,500 = $3,900/day

Monthly overrun:

$3,900 × 30 = $117,000/month


Solution implemented

We replaced simple confidence thresholding with two-stage gating.

Final implemented logic

flowchart LR
    A[Student answer] --> B[Confidence score]
    B --> C[Risk classifier]
    C --> D{High risk?}
    D -->|Yes| E[Teacher fallback]
    D -->|No| F{Confidence < threshold?}
    F -->|Yes| G[Clarify or low-cost fallback]
    F -->|No| H[Return student answer]

Why this worked

A plain threshold sent too many harmless ambiguous cases to the teacher. The added risk classifier separated:

  • unsafe / high-risk uncertainty, from
  • benign uncertainty where a clarification question or weaker fallback was enough.

Result

Metric Before After
fallback rate 18.0% 6.4%
teacher traffic 1.8M/day 0.64M/day
daily cost $7,400 $3,920
unsafe false accepts 0.34% 0.11%

Other solutions explored

Option 1 — Raise threshold aggressively

Pros - fast to change - reduces teacher fallback immediately

Cons - lets more bad student answers through - weak control on high-risk slices

Option 2 — Always ask a clarification question

Pros - cheap - avoids teacher call

Cons - hurts UX - unnecessary on clear cases

Option 3 — Route all low-confidence queries to a smaller intermediate model

Pros - reduces expensive teacher load

Cons - adds another serving tier - more operational burden


Why the final decision won

We chose risk-aware fallback because cost and safety were both first-class. It gave most of the cost benefit of aggressive thresholding without losing the protection needed for unsafe cases.


4.4 Failure D — Calibration Was Good Offline and Bad in Production

The failure

The student looked confident on validation data but was miscalibrated on new production traffic. That broke fallback routing and human escalation rules.

Solution implemented

We added a post-distillation calibration stage using a recent production-like holdout.

Final calibration steps

  1. Train student with KD.
  2. Freeze weights.
  3. Fit temperature / calibration layer on recent holdout.
  4. Re-evaluate ECE, fallback rate, and escalation precision.
  5. Promote only if routing metrics improve.

Numerical example

Metric Before calibration After calibration
ECE 0.118 0.034
fallback over-trigger rate 7.2% 2.1%
unsafe false accept rate 0.21% 0.13%

Other solutions explored

  • isotonic regression
  • Platt scaling
  • classwise thresholds only
  • no explicit calibration, rely on more KD epochs

Why the final decision won

A separate calibration stage was chosen because:

  • it is cheap,
  • it does not require retraining the whole model,
  • it directly improves routing decisions,
  • it is easy to rerun during drift refreshes.

More KD epochs alone did not reliably fix calibration drift.


4.5 Failure E — Distillation Data Became Stale as Catalog and User Behavior Changed

The failure

The student was trained on teacher outputs from an older catalog and older user query mix. After launch, the model degraded on:

  • newly released manga,
  • changing inventory,
  • new support phrasing,
  • seasonal traffic shifts.

Scale effect

At small scale, stale examples may hide inside averages. At Amazon-scale, stale data causes repeated visible errors on top traffic items.

If the top 1% of titles drive 35% of catalog questions, staleness in those entities becomes disproportionately expensive.


Solution implemented

We moved from static dataset refreshes to a rolling distillation refresh.

Final implemented approach

  • weekly sample of recent production prompts,
  • dedupe and PII filtering,
  • teacher relabeling with current retrieval state,
  • slice-balanced merge with stable historical gold set,
  • canary eval before replacing student.

Data mixing ratio chosen

new_recent_examples        = 50%
stable_human_gold          = 25%
rare_class_protected_set   = 15%
policy_and_refusal_set     = 10%

Why not 100% recent data?

Because it improved freshness but hurt stability and rare-case memory.

Result

Metric Static monthly refresh Rolling weekly refresh
new-title factual accuracy 91.0% 95.8%
rare-case stability 94.4% 94.1%
canary regressions/month 4 1

Other solutions explored

Option 1 — Monthly full rebuild only

Pros - simple pipeline

Cons - staleness window too large

Option 2 — Daily full refresh

Pros - freshest labels

Cons - expensive - unstable - noisy canary outcomes

Option 3 — Retrieval-only fix, no student refresh

Pros - cheap if the answer path is retrieval-heavy

Cons - does not fix classifier, reranker, or routing behavior


Why the final decision won

Weekly rolling refresh with a protected stable core gave the best balance between:

  • freshness,
  • stability,
  • rare-case retention,
  • compute cost.

4.6 Failure F — Training Was Too Slow or Too Expensive at Large Scale

The failure

Teacher inference plus student training became the bottleneck. This is common when:

  • teacher is large,
  • training set is tens of millions of examples,
  • multi-stage KD is used,
  • experiments are repeated often.

Solution implemented

We split the pipeline into teacher-label caching + student-only retraining loops.

Final implemented design

flowchart LR
    A[Prompt sample] --> B[Teacher batch inference]
    B --> C[S3 label cache]
    C --> D[Dataset build]
    D --> E[Student train loop]
    E --> F[Eval and select checkpoint]

Why this matters

Without caching, every training experiment re-pays teacher inference cost. With caching, hyperparameter sweeps mostly pay only student-side compute.

Numerical example

Assume:

  • teacher labeling 10M prompts costs $18,000
  • one student training run costs $900
  • 8 experiment variants are required

Without caching:

8 × ($18,000 + $900) = $151,200

With teacher-label caching:

$18,000 + 8 × $900 = $25,200

Savings:

$151,200 - $25,200 = $126,000


Other solutions explored

  • online teacher queries during training
  • smaller teacher for all examples
  • partial-labeling only for ambiguous slice
  • synthetic augmentation instead of real teacher labels

Why the final decision won

Caching won because it created the best experimentation speed per dollar. We still used targeted high-quality teacher relabeling for selected slices when needed.


4.7 Failure G — Logs Were So Large That Incidents Became Harder to Debug

The failure

At large traffic volumes, teams often log too much:

  • full prompts,
  • full teacher outputs,
  • full student outputs,
  • top-k distributions,
  • retrieval evidence,
  • model internals,
  • trace metadata.

This improves visibility at first, then becomes a storage and debugging problem.

Simple scale math

Assume average logged payload = 6 KB/event. At 50M events/day:

50,000,000 × 6 KB = 300,000,000 KB/day ≈ 300 GB/day

Monthly:

300 GB × 30 = 9 TB/month

If copied across multiple environments and retention tiers, costs rise quickly.


Solution implemented

We changed to tiered logging.

Final logging policy

Event type Log detail
healthy normal traffic compact summary only
low-confidence traffic add top-k scores and routing reason
policy or factual error slice add evidence and verifier output
canary / incident mode expanded debug logs

Result

Metric Before After
avg log payload 6.0 KB 1.4 KB
daily log volume at 50M requests 300 GB 70 GB
time to find routing root cause 2.5 hours 38 minutes

Other solutions explored

  • full logs for everything with short retention
  • heavy sampling only
  • no prompt logging at all
  • separate debug replica for richer traces

Why the final decision won

Tiered logging preserved enough signal for failures while keeping cost and noise manageable. Pure heavy sampling hid rare but important incidents. Full logs were too expensive and slow to work with.


4.8 Failure H — We Were Stopping Too Late or Too Early

The failure

The team initially watched training loss too closely and business metrics too little. That caused two types of mistakes:

  • stop too early because eval loss looked flat,
  • stop too late because training loss kept improving while hallucination and calibration worsened.

Solution implemented

We switched from single-metric stopping to multi-signal checkpoint selection.

Final checkpoint selection rule

A checkpoint was eligible only if it satisfied all:

hallucination_rate <= gate
teacher_preference_match >= gate
human_win_rate >= gate
fallback_rate <= budget
rare_class_recall >= protected threshold
p95_latency within SLO

If multiple checkpoints passed, select the one with best weighted score.

Example weighted score

score =
  0.30 * teacher_preference_match
+ 0.20 * human_win_rate
- 0.20 * hallucination_rate
- 0.10 * fallback_rate
+ 0.10 * rare_class_recall
+ 0.10 * cost_reduction

Example epoch table

Epoch Train loss Eval loss Teacher match Hallucination Fallback Decision
1 1.82 1.64 78.2% 5.2% 11.0% too weak
2 1.41 1.29 83.6% 4.3% 8.8% close
3 1.18 1.14 86.4% 3.6% 6.7% passes
4 1.05 1.12 86.7% 3.5% 6.5% passes
5 0.94 1.19 86.1% 3.9% 7.0% regressing

Final choice

Epoch 4 was selected, not epoch 5. Even though train loss improved at epoch 5, deployment metrics got worse.


Other solutions explored

  • early stop on eval loss only
  • stop on teacher-match only
  • fixed number of epochs every run
  • human review only after final epoch

Why the final decision won

The multi-signal checkpoint rule matched production reality best. The model is serving a business workflow, not winning a training-loss contest.


5. Cross-Cutting Solution Patterns That Repeatedly Won

5.1 Keep the teacher strong, but never trust it blindly

Final pattern:

  • use teacher for breadth,
  • use retrieval and verifier layers for truth control,
  • use humans on policy and high-risk slices.

5.2 Protect slices, not just global averages

Global improvement can hide dangerous regional regressions. Final pattern:

  • slice dashboards,
  • promotion gates per slice,
  • protected high-risk datasets.

5.3 Separate training quality from serving quality

A model can train well and serve badly. Final pattern:

  • calibration stage,
  • shadow traffic,
  • canary routing evaluation,
  • fallback cost monitoring.

5.4 Prefer scalable moderate-complexity fixes over perfect but fragile ones

At Amazon-scale, the best solution is often not the mathematically purest one. It is the one that:

  • can run every week,
  • is debuggable,
  • fits budget,
  • does not require heroics.

6. Final Tradeoff Framework Used Before Production Decision

6.1 Candidate solution scorecard

flowchart LR
    A[Candidate fix] --> B[Safety score]
    A --> C[Quality score]
    A --> D[Latency score]
    A --> E[Cost score]
    A --> F[Operational simplicity]
    B --> G[Decision]
    C --> G
    D --> G
    E --> G
    F --> G

Example weighted decision table

Candidate Safety Quality Latency Cost Operability Weighted outcome
full human review 10 10 9 2 3 7.2
raw teacher labels 3 6 10 10 9 6.5
retrieval-grounded gating 9 8 8 8 7 8.2
post-hoc verifier only 7 6 7 8 8 7.0

Important intuition

The winning solution is often the one with the best balanced score, not the best single-metric score.


6.2 Absolute-impact decision check

Before final rollout, ask:

  • What does this percentage change mean in bad outcomes/day?
  • What does this fallback rate mean in extra dollars/month?
  • What does this recall drop mean in missed escalations/day?
  • What does this log volume mean in storage and incident response time?

That is the difference between a good experiment and a good production decision.


7. What the Final Architecture Looked Like

flowchart TD
    A[Recent production prompts] --> B[Privacy and dedupe filter]
    B --> C[Teacher labeling with retrieval]
    C --> D[Evidence / contradiction gate]
    D --> E[Human review for high-risk slices]
    E --> F[Balanced dataset builder]
    F --> G[Student training]
    G --> H[Calibration stage]
    H --> I[Offline slice eval]
    I --> J[Shadow traffic]
    J --> K[Canary rollout]
    K --> L[Risk-aware fallback in production]
    L --> M[Tiered logging and drift monitors]

This architecture was chosen not because it is the fanciest design. It was chosen because each layer solved a failure that mattered in production.


8. Final Takeaways

8.1 The biggest lesson

The hardest part of distillation at scale is usually not the loss function. It is the quality of the teacher labels, slice protection, calibration, fallback economics, and operational visibility.

8.2 The final decision pattern

For MangaAssist in an Amazon-scale style environment, the winning pattern was:

  • cleaner teacher labels over bigger dirty datasets,
  • slice-safe metrics over global average wins,
  • risk-aware fallback over naive confidence routing,
  • weekly rolling refresh over static training snapshots,
  • multi-signal checkpoint selection over loss-only stopping,
  • tiered observability over log-everything debugging.

8.3 What this means for a lead engineer

A lead engineer should be able to explain for every fix:

  • what failed,
  • why the first obvious fix was not enough,
  • what alternatives were explored,
  • what tradeoff decided the winner,
  • and how the same choice changes when traffic scales by 1000×.

That is the real production intuition behind knowledge distillation systems.