MangaAssist Knowledge Distillation at Scale

Deep-Dive Solutions, Alternatives Explored, and Final Tradeoff Decisions

This document is a solution-focused companion to the failure-scenarios guide. It explains, for each major distillation failure mode, what solution was implemented, what other options were explored, and how the final tradeoff was chosen.

Any references to Amazon-scale below should be read as large-scale engineering intuition for a MangaAssist-like chatbot: very high traffic, strict latency budgets, large catalog churn, heavy observability, and costly percentage errors. They are not claims about a private Amazon internal system.

1. What This Document Tries to Answer

A failure document tells you what can go wrong. A platform or MLOps leader also needs to understand:

What exact fix was put in place?
What simpler or cheaper alternatives were considered first?
Why was one solution chosen over another?
What was the latency, cost, accuracy, and operational tradeoff?
Would the same answer still be right at 50K requests/day and 50M requests/day?

This document answers those questions in a MangaAssist setting using an Amazon-scale decision style.

2. Baseline System Context

2.1 MangaAssist distillation targets

Component	Teacher	Student	Core goal
Intent classifier	DistilBERT	TinyBERT	cut latency and memory while keeping intent quality
Response model	strong managed teacher	cheaper student or self-hosted model	lower cost per response without breaking policy/factuality
Reranker	large cross-encoder	smaller reranker	preserve ranking quality inside latency budget

2.2 Promotion gates that matter

Metric	Gate
teacher preference match	>= 85%
human win rate vs base student	>= 65%
catalog hallucination rate	<= 4%
cost per 1K responses	at least 50% lower
p95 latency	within component SLO
fallback rate	within cost budget

2.3 Tradeoff rule used in this guide

The final decision is never based on one metric. We use this practical order:

Safety / policy / factuality
Latency SLO
Cost reduction
Quality retention
Operational simplicity

That order matters. A model that is cheaper but unsafe does not ship. A model that is accurate but misses p95 latency also does not ship.

3. End-to-End Decision Map

flowchart TD
    A[Failure observed] --> B[Quantify business impact]
    B --> C[Generate candidate fixes]
    C --> D[Offline experiment]
    D --> E[Shadow traffic]
    E --> F[Canary rollout]
    F --> G[Final decision]

    G --> G1[Promote]
    G --> G2[Rollback]
    G --> G3[Defer and keep teacher]

3.1 Decision principle

At small scale, teams often choose the best offline metric. At Amazon-scale, the better question is:

Which option minimizes bad absolute outcomes per day while still meeting budget and operability targets?

A 0.3% hallucination reduction at 50M requests/day means:

50,000,000 × 0.003 = 150,000 fewer bad responses/day

That can beat a 5% cost gain if the bad responses are customer-visible and expensive.

4. Failure Scenario Deep Dives

4.1 Failure A — Teacher Hallucinations Were Being Copied Into the Student

The failure

The teacher produced confident but unsupported catalog claims. Example:

user: “Is Nana hardcover available?”
catalog truth: paperback only
teacher: “Yes, hardcover is in stock.”

If this enters the distillation dataset unchanged, the student learns the wrong fact pattern.

Why this became serious at scale

Assume a monthly unlabeled sampling run of 25,000,000 production prompts. If only 0.2% of teacher labels are bad:

25,000,000 × 0.002 = 50,000 corrupted targets

That is not “noise.” That is a systematic dataset poisoning source.

Solution implemented

We implemented a retrieval-grounded teacher labeling gate before dataset write.

Final implemented pipeline

flowchart LR
    A[Production prompt] --> B[Retrieve catalog evidence]
    B --> C[Teacher generation]
    C --> D[Evidence support check]
    D --> E{Supported?}
    E -->|Yes| F[Write to distillation set]
    E -->|No| G[Reject or human review]

Final rule

A teacher response was eligible only if:

retrieved evidence existed,
answer entities matched catalog fields,
support score passed threshold,
no contradiction with structured catalog facts,
confidence was not high while support was low.

Production-style acceptance function

write_example =
    evidence_present
    AND support_score >= 0.78
    AND contradiction_score <= 0.10
    AND NOT(high_confidence AND low_support)

Numerical effect

Metric	Before	After implemented fix
teacher label contamination	0.20%	0.04%
student factual error on catalog slice	1.10%	0.58%
copied teacher error overlap	0.73	0.29
dataset acceptance rate	100%	86%

Intuition

We accepted a 14% data volume drop to get a much cleaner label set. At scale, clean data beats slightly larger dirty data.

Why the final decision won

Option	Quality	Cost	Latency impact	Operational complexity	Final decision
full human review	best	worst	none	high	rejected
human-only dataset	low coverage	medium	none	medium	rejected
confidence down-weight only	weak	best	none	low	rejected
retrieval-grounded gating	strong	good	low	medium	chosen

Final tradeoff logic

We chose retrieval-grounded gating because it had the best combined outcome:

large factuality gain,
acceptable pipeline complexity,
scalable to millions of examples,
still compatible with selective human review.

4.2 Failure B — Rare-Class Recall Collapsed After Distillation

The failure

The student became better on common intents but worse on rare important ones. For MangaAssist these could be:

fraud / abuse signals,
escalation language,
legal or policy-sensitive support requests,
refund edge cases.

Why it becomes dangerous at scale

Suppose a rare class appears in only 0.4% of traffic. At 50M requests/day:

50,000,000 × 0.004 = 200,000 rare-class events/day

If recall falls from 0.91 to 0.82:

before misses: 200,000 × 0.09 = 18,000
after misses: 200,000 × 0.18 = 36,000
extra misses/day: 18,000

A small recall drop becomes a large operational problem.

Solution implemented

We used a hybrid supervision strategy:

total_loss =
    0.45 * hard_label_loss
  + 0.35 * KD_soft_loss
  + 0.20 * rare_class_weighted_loss

What changed technically

Kept teacher soft labels for ambiguity learning.
Preserved hard labels for rare-class truth.
Applied class weights to high-risk rare classes.
Required slice-level promotion gates, not only global accuracy.

Promotion rule added

promote only if:
- overall accuracy improves or stays flat
- rare-class recall >= baseline - 1 point
- high-risk escalation precision >= baseline

Numerical result

Metric	Base student	KD only	Final hybrid
overall accuracy	87.6%	89.1%	88.9%
rare-class recall	89.8%	82.0%	90.7%
escalation precision	93.2%	90.4%	93.0%

Intuition

We accepted 0.2% lower overall accuracy than KD-only because the rare-class protection was much better. At scale, protecting a critical 0.4% slice can matter more than squeezing one more point of global accuracy.

Why the final decision won

We chose hybrid supervision + slice gates because it gave the best balance of:

preserving teacher ambiguity signal,
protecting rare classes,
avoiding new serving complexity,
keeping one model in path.

A separate detector was kept as a fallback design, not the default.

4.3 Failure C — Student Was Cheap Offline but Triggered Fallback Explosion Online

The failure

The student passed offline quality, but online confidence was unstable. It escalated too often to the teacher, which erased cost savings.

Example cost math

Assume:

student cost = $0.0002 per response
teacher cost = $0.0030 per response
traffic = 10,000,000 responses/day

Planned fallback rate: 5%

Daily cost:

student cost: 10,000,000 × 0.0002 = $2,000
teacher fallback cost: 10,000,000 × 0.05 × 0.003 = $1,500
total = $3,500/day

Actual fallback rate: 18%

Daily cost:

student cost: $2,000
teacher fallback cost: 10,000,000 × 0.18 × 0.003 = $5,400
total = $7,400/day

Unexpected daily overrun:

$7,400 - $3,500 = $3,900/day

Monthly overrun:

$3,900 × 30 = $117,000/month

Solution implemented

We replaced simple confidence thresholding with two-stage gating.

Final implemented logic

flowchart LR
    A[Student answer] --> B[Confidence score]
    B --> C[Risk classifier]
    C --> D{High risk?}
    D -->|Yes| E[Teacher fallback]
    D -->|No| F{Confidence < threshold?}
    F -->|Yes| G[Clarify or low-cost fallback]
    F -->|No| H[Return student answer]

Why this worked

A plain threshold sent too many harmless ambiguous cases to the teacher. The added risk classifier separated:

unsafe / high-risk uncertainty, from
benign uncertainty where a clarification question or weaker fallback was enough.

Result

Metric	Before	After
fallback rate	18.0%	6.4%
teacher traffic	1.8M/day	0.64M/day
daily cost	$7,400	$3,920
unsafe false accepts	0.34%	0.11%

Why the final decision won

We chose risk-aware fallback because cost and safety were both first-class. It gave most of the cost benefit of aggressive thresholding without losing the protection needed for unsafe cases.

4.4 Failure D — Calibration Was Good Offline and Bad in Production

The failure

The student looked confident on validation data but was miscalibrated on new production traffic. That broke fallback routing and human escalation rules.

Solution implemented

We added a post-distillation calibration stage using a recent production-like holdout.

Final calibration steps

Train student with KD.
Freeze weights.
Fit temperature / calibration layer on recent holdout.
Re-evaluate ECE, fallback rate, and escalation precision.
Promote only if routing metrics improve.

Numerical example

Metric	Before calibration	After calibration
ECE	0.118	0.034
fallback over-trigger rate	7.2%	2.1%
unsafe false accept rate	0.21%	0.13%

Why the final decision won

A separate calibration stage was chosen because:

it is cheap,
it does not require retraining the whole model,
it directly improves routing decisions,
it is easy to rerun during drift refreshes.

More KD epochs alone did not reliably fix calibration drift.

4.5 Failure E — Distillation Data Became Stale as Catalog and User Behavior Changed

The failure

The student was trained on teacher outputs from an older catalog and older user query mix. After launch, the model degraded on:

newly released manga,
changing inventory,
new support phrasing,
seasonal traffic shifts.

Scale effect

At small scale, stale examples may hide inside averages. At Amazon-scale, stale data causes repeated visible errors on top traffic items.

If the top 1% of titles drive 35% of catalog questions, staleness in those entities becomes disproportionately expensive.

Solution implemented

We moved from static dataset refreshes to a rolling distillation refresh.

Final implemented approach

weekly sample of recent production prompts,
dedupe and PII filtering,
teacher relabeling with current retrieval state,
slice-balanced merge with stable historical gold set,
canary eval before replacing student.

Data mixing ratio chosen

new_recent_examples        = 50%
stable_human_gold          = 25%
rare_class_protected_set   = 15%
policy_and_refusal_set     = 10%

Why not 100% recent data?

Because it improved freshness but hurt stability and rare-case memory.

Result

Metric	Static monthly refresh	Rolling weekly refresh
new-title factual accuracy	91.0%	95.8%
rare-case stability	94.4%	94.1%
canary regressions/month	4	1

Why the final decision won

Weekly rolling refresh with a protected stable core gave the best balance between:

freshness,
stability,
rare-case retention,
compute cost.

4.6 Failure F — Training Was Too Slow or Too Expensive at Large Scale

The failure

Teacher inference plus student training became the bottleneck. This is common when:

teacher is large,
training set is tens of millions of examples,
multi-stage KD is used,
experiments are repeated often.

Solution implemented

We split the pipeline into teacher-label caching + student-only retraining loops.

Final implemented design

flowchart LR
    A[Prompt sample] --> B[Teacher batch inference]
    B --> C[S3 label cache]
    C --> D[Dataset build]
    D --> E[Student train loop]
    E --> F[Eval and select checkpoint]

Why this matters

Without caching, every training experiment re-pays teacher inference cost. With caching, hyperparameter sweeps mostly pay only student-side compute.

Numerical example

Assume:

teacher labeling 10M prompts costs $18,000
one student training run costs $900
8 experiment variants are required

Without caching:

8 × ($18,000 + $900) = $151,200

With teacher-label caching:

$18,000 + 8 × $900 = $25,200

Savings:

$151,200 - $25,200 = $126,000

Why the final decision won

Caching won because it created the best experimentation speed per dollar. We still used targeted high-quality teacher relabeling for selected slices when needed.

4.7 Failure G — Logs Were So Large That Incidents Became Harder to Debug

The failure

At large traffic volumes, teams often log too much:

full prompts,
full teacher outputs,
full student outputs,
top-k distributions,
retrieval evidence,
model internals,
trace metadata.

This improves visibility at first, then becomes a storage and debugging problem.

Simple scale math

Assume average logged payload = 6 KB/event. At 50M events/day:

50,000,000 × 6 KB = 300,000,000 KB/day ≈ 300 GB/day

Monthly:

300 GB × 30 = 9 TB/month

If copied across multiple environments and retention tiers, costs rise quickly.

Solution implemented

We changed to tiered logging.

Final logging policy

Event type	Log detail
healthy normal traffic	compact summary only
low-confidence traffic	add top-k scores and routing reason
policy or factual error slice	add evidence and verifier output
canary / incident mode	expanded debug logs

Result

Metric	Before	After
avg log payload	6.0 KB	1.4 KB
daily log volume at 50M requests	300 GB	70 GB
time to find routing root cause	2.5 hours	38 minutes

Why the final decision won

Tiered logging preserved enough signal for failures while keeping cost and noise manageable. Pure heavy sampling hid rare but important incidents. Full logs were too expensive and slow to work with.

4.8 Failure H — We Were Stopping Too Late or Too Early

The failure

The team initially watched training loss too closely and business metrics too little. That caused two types of mistakes:

stop too early because eval loss looked flat,
stop too late because training loss kept improving while hallucination and calibration worsened.

Solution implemented

We switched from single-metric stopping to multi-signal checkpoint selection.

Final checkpoint selection rule

A checkpoint was eligible only if it satisfied all:

hallucination_rate <= gate
teacher_preference_match >= gate
human_win_rate >= gate
fallback_rate <= budget
rare_class_recall >= protected threshold
p95_latency within SLO

If multiple checkpoints passed, select the one with best weighted score.

Example weighted score

score =
  0.30 * teacher_preference_match
+ 0.20 * human_win_rate
- 0.20 * hallucination_rate
- 0.10 * fallback_rate
+ 0.10 * rare_class_recall
+ 0.10 * cost_reduction

Example epoch table

Epoch	Train loss	Eval loss	Teacher match	Hallucination	Fallback	Decision
1	1.82	1.64	78.2%	5.2%	11.0%	too weak
2	1.41	1.29	83.6%	4.3%	8.8%	close
3	1.18	1.14	86.4%	3.6%	6.7%	passes
4	1.05	1.12	86.7%	3.5%	6.5%	passes
5	0.94	1.19	86.1%	3.9%	7.0%	regressing

Final choice

Epoch 4 was selected, not epoch 5. Even though train loss improved at epoch 5, deployment metrics got worse.

Why the final decision won

The multi-signal checkpoint rule matched production reality best. The model is serving a business workflow, not winning a training-loss contest.

5. Cross-Cutting Solution Patterns That Repeatedly Won

5.1 Keep the teacher strong, but never trust it blindly

Final pattern:

use teacher for breadth,
use retrieval and verifier layers for truth control,
use humans on policy and high-risk slices.

5.2 Protect slices, not just global averages

Global improvement can hide dangerous regional regressions. Final pattern:

slice dashboards,
promotion gates per slice,
protected high-risk datasets.

5.3 Separate training quality from serving quality

A model can train well and serve badly. Final pattern:

calibration stage,
shadow traffic,
canary routing evaluation,
fallback cost monitoring.

5.4 Prefer scalable moderate-complexity fixes over perfect but fragile ones

At Amazon-scale, the best solution is often not the mathematically purest one. It is the one that:

can run every week,
is debuggable,
fits budget,
does not require heroics.

6. Final Tradeoff Framework Used Before Production Decision

6.1 Candidate solution scorecard

flowchart LR
    A[Candidate fix] --> B[Safety score]
    A --> C[Quality score]
    A --> D[Latency score]
    A --> E[Cost score]
    A --> F[Operational simplicity]
    B --> G[Decision]
    C --> G
    D --> G
    E --> G
    F --> G

Example weighted decision table

Candidate	Safety	Quality	Latency	Cost	Operability	Weighted outcome
full human review	10	10	9	2	3	7.2
raw teacher labels	3	6	10	10	9	6.5
retrieval-grounded gating	9	8	8	8	7	8.2
post-hoc verifier only	7	6	7	8	8	7.0

Important intuition

The winning solution is often the one with the best balanced score, not the best single-metric score.

6.2 Absolute-impact decision check

Before final rollout, ask:

What does this percentage change mean in bad outcomes/day?
What does this fallback rate mean in extra dollars/month?
What does this recall drop mean in missed escalations/day?
What does this log volume mean in storage and incident response time?

That is the difference between a good experiment and a good production decision.

7. What the Final Architecture Looked Like

flowchart TD
    A[Recent production prompts] --> B[Privacy and dedupe filter]
    B --> C[Teacher labeling with retrieval]
    C --> D[Evidence / contradiction gate]
    D --> E[Human review for high-risk slices]
    E --> F[Balanced dataset builder]
    F --> G[Student training]
    G --> H[Calibration stage]
    H --> I[Offline slice eval]
    I --> J[Shadow traffic]
    J --> K[Canary rollout]
    K --> L[Risk-aware fallback in production]
    L --> M[Tiered logging and drift monitors]

This architecture was chosen not because it is the fanciest design. It was chosen because each layer solved a failure that mattered in production.

8. Final Takeaways

8.1 The biggest lesson

The hardest part of distillation at scale is usually not the loss function. It is the quality of the teacher labels, slice protection, calibration, fallback economics, and operational visibility.

8.2 The final decision pattern

For MangaAssist in an Amazon-scale style environment, the winning pattern was:

cleaner teacher labels over bigger dirty datasets,
slice-safe metrics over global average wins,
risk-aware fallback over naive confidence routing,
weekly rolling refresh over static training snapshots,
multi-signal checkpoint selection over loss-only stopping,
tiered observability over log-everything debugging.

8.3 What this means for a lead engineer

A lead engineer should be able to explain for every fix:

what failed,
why the first obvious fix was not enough,
what alternatives were explored,
what tradeoff decided the winner,
and how the same choice changes when traffic scales by 1000×.

That is the real production intuition behind knowledge distillation systems.

MangaAssist Knowledge Distillation at Scale

Deep-Dive Solutions, Alternatives Explored, and Final Tradeoff Decisions

1. What This Document Tries to Answer

2. Baseline System Context

2.1 MangaAssist distillation targets

2.2 Promotion gates that matter

2.3 Tradeoff rule used in this guide

3. End-to-End Decision Map

3.1 Decision principle

4. Failure Scenario Deep Dives

4.1 Failure A — Teacher Hallucinations Were Being Copied Into the Student

The failure

Why this became serious at scale

Solution implemented

Final implemented pipeline

Final rule

Production-style acceptance function

Numerical effect

Intuition

Other solutions explored

Option 1 — Pure human review of all teacher outputs

Option 2 — Train student only on human-reviewed examples

Option 3 — Accept all teacher labels but down-weight low-confidence examples

Option 4 — Add a separate factuality verifier after training only

Why the final decision won

Final tradeoff logic

4.2 Failure B — Rare-Class Recall Collapsed After Distillation

The failure

Why it becomes dangerous at scale

Solution implemented

What changed technically

Promotion rule added

Numerical result

Intuition

Other solutions explored

Option 1 — Oversample rare classes only

Option 2 — Train a separate rare-class detector

Option 3 — One-vs-rest auxiliary heads

Why the final decision won

4.3 Failure C — Student Was Cheap Offline but Triggered Fallback Explosion Online

The failure

Example cost math

Planned fallback rate: 5%

Actual fallback rate: 18%

Solution implemented

Final implemented logic

Why this worked

Result

Other solutions explored

Option 1 — Raise threshold aggressively

Option 2 — Always ask a clarification question

Option 3 — Route all low-confidence queries to a smaller intermediate model

Why the final decision won

4.4 Failure D — Calibration Was Good Offline and Bad in Production

The failure

Solution implemented

Final calibration steps

Numerical example

Other solutions explored

Why the final decision won

4.5 Failure E — Distillation Data Became Stale as Catalog and User Behavior Changed

The failure

Scale effect

Solution implemented

Final implemented approach

Data mixing ratio chosen

Why not 100% recent data?

Result

Other solutions explored

Option 1 — Monthly full rebuild only

Option 2 — Daily full refresh

Option 3 — Retrieval-only fix, no student refresh

Why the final decision won

4.6 Failure F — Training Was Too Slow or Too Expensive at Large Scale

The failure

Solution implemented

Final implemented design

Why this matters

Numerical example

Other solutions explored