Distillation Business-Weighted Error Score (MangaAssist)

Slot 5 of the SCENARIO_TEMPLATE 8-file pattern. The Intent-Classification business-weighted-error doc framed which model wins on cost. This doc applies the same lens to the distillation choice — should we ship the teacher (DistilBERT, more accurate) or the student (TinyBERT, much cheaper) — and shows that the answer is conditional on traffic volume, latency budget, and rare-class cost matrix.

Why This Is a Distinct Decision

The teacher-vs-student trade is not an apples-to-apples accuracy comparison. The two models have different:

Inference cost (student is ~2.4× cheaper per request)
Latency profile (student 5 ms vs teacher 12 ms P95)
Error distribution (student has more low-cost errors and slightly more critical errors)
Energy / carbon footprint (~58% reduction with student)
Maintenance cost (re-distillation cycle is extra ops surface)

A research-grade decision uses a single business objective:

[ \text{Total Cost}(M) = \text{Inference Cost}(M) + \text{Error Cost}(M) + \text{Operational Cost}(M) ]

evaluated across realistic traffic volumes and cost matrices.

Shared Baseline (verbatim)

Item	Value
Teacher (M_T)	DistilBERT, 92.1% acc, 12 ms P95, $5,320/mo at 1.4M reqs
Student (M_S)	TinyBERT 4-layer, 90.5% acc, 5 ms P95, $2,240/mo at 1.4M reqs
Cost matrix	same per-(true,predicted) cost as Intent-Classification business doc
Volume bands evaluated	0.5M, 1.4M (current), 5M, 20M monthly requests

1. Cost Decomposition (Current Volume = 1.4M Requests/Month)

1.1 Inference cost

Teacher: 1.4M × $0.038/10K = $5,320/month
Student: 1.4M × $0.016/10K = $2,240/month
Δ inference: −$3,080/month (favors student)

1.2 Error cost (business-weighted)

Per the Intent-Classification cost matrix and per-request weighted-error calculations:

Teacher weighted error: 0.0312 → harm units = 1.4M × 0.0312 = 43,680/month
Student weighted error: 0.0356 → harm units = 1.4M × 0.0356 = 49,840/month
Δ harm: +6,160 units/month, at $0.013/unit = +$80/month (favors teacher slightly)

1.3 Critical-error cost (escalation/return missed)

Teacher critical-error rate: 0.0034 → 4,760/month
Student critical-error rate: 0.0045 → 6,300/month
Δ critical: +1,540/month, at $1.20/critical-event = +$1,848/month (favors teacher)

1.4 Operational cost

Teacher: monthly retraining $512 (data labeling) + monitoring (no extra)
Student: same retraining cost + re-distillation cycle ~$30/month (compute) + monitoring/calibration extra

Δ operational: +$30/month (favors teacher, marginally)

1.5 Total Cost Comparison

Component	Teacher	Student	Δ (S − T)
Inference	$5,320	$2,240	-$3,080
Weighted error	$568	$648	+$80
Critical error	$5,712	$7,560	+$1,848
Operational	$512	$542	+$30
Total	$12,112	$10,990	−$1,122

At current volume, the student wins by $1,122/month. But the win is sensitive to the critical-error cost rate ($1.20/event) — see §3.

2. Sensitivity Analysis Across Volume Bands

Total cost (in thousands of dollars/month) at each volume:

Volume	Teacher total	Student total	Winner	Margin
0.5M reqs/mo	$4.4K	$3.9K	student	$0.5K
1.4M (current)	$12.1K	$11.0K	student	$1.1K
5M	$43.3K	$39.3K	student	$4.0K
20M	$173K	$157K	student	$16K

Reading. Student wins at every volume band — the inference savings scale linearly with traffic, while the critical-error cost grows linearly too but the gap is fixed. The margin grows with traffic, making the student the obvious choice at scale.

3. Sensitivity to Critical-Error Cost

What if the per-critical-event cost is wrong? Sweep $0.40 → $5.00:

$ per critical event	Teacher critical-cost	Student critical-cost	Total margin (S − T)	Winner
$0.40	$1,904	$2,520	-$2,505	student
$0.80	$3,808	$5,040	-$1,815	student
$1.20 (chosen)	$5,712	$7,560	-$1,122	student
$2.00	$9,520	$12,600	+$258	teacher
$3.00	$14,280	$18,900	+$1,620	teacher
$5.00	$23,800	$31,500	+$4,620	teacher

Critical decision boundary: at ~$1.85/critical-event, the choice flips to the teacher. This is the lever the cost-matrix audit must protect: if the per-critical-event cost is even moderately under-estimated, we'd be shipping the wrong model.

4. Sensitivity to Latency Cost

If we monetize latency (engagement / conversion impact), the student gains additional value because users interact with a snappier system more.

$ per ms above baseline	Teacher latency cost	Student latency cost	Net impact on margin
$0/ms (baseline assumption)	$0	$0	as above
$50/ms	$0	-$350/mo (student saves 7 ms × $50)	margin grows to $1,472
$100/ms	$0	-$700/mo	margin grows to $1,822

Reading. If the engagement model values speed (most retail experiments find ~$50-100/ms at this scale), the margin in favor of the student grows by another $350-700/month.

5. Confidence Intervals on the Cost Margin

n = 5,500 test, B = 10,000 bootstrap, propagated through the cost calculation.

Component	Point	95% CI
Teacher weighted error	0.0312	[0.0288, 0.0341]
Student weighted error	0.0356	[0.0328, 0.0388]
Δ weighted error (paired)	0.0044	[0.0014, 0.0078]
Net monthly savings, student over teacher	$1,122	[$320, $2,030]

Reading. At the 95% lower bound, the student still saves ~$320/month — robust. At the upper bound, ~$2,030/month. Decision is statistically clear at the current cost-matrix point estimate.

6. Failure-Mode Tree

flowchart TD
    A[Monthly business review] --> B{Symptom?}
    B -- critical-error rate gap teacher-vs-student widens > 50% --> C[Audit student rare-class fine-tune retrain or re-distill]
    B -- inference savings shrink > 25% --> D[Check student utilization may have spilled to a larger instance]
    B -- per-critical-event cost re-estimated > $1.85 --> E[Re-run §3 sweep escalate to PM the choice may flip]
    B -- traffic > 5M reqs/mo sustained --> F[Re-run total-cost sweep teacher math may shift due to scale-out limits]
    C --> G[Re-distill or augment rare-class training data]
    D --> H[Tune batching at serving layer reset to student instance type]
    E --> I[Cost-matrix audit with PM and ops re-derive critical-event cost]

Research Notes — distillation cost. Citations: Elkan 2001 (IJCAI) — cost-matrix decision theory; Provost 2000 (AAAI) — cost-aware threshold moving; Gou 2021 (IJCV — distillation survey) — quality-vs-cost trade-off framing; Strubell 2019 (ACL) — energy/cost reporting in NLP.

7. Persona Debate

Sam (PM). $1,122/month savings is real but small at this volume. What's the long-term case?

Marcus (Architect). At 5M reqs/mo we save $4K/mo; at 20M we save $16K/mo. Plus latency engagement upside.

Aiko (DS). I'm worried about the critical-error rate. 6,300 vs 4,760 monthly missed escalations is 1,540 more frustrated users. Even at $1.20/event the model is robustly student-favored, but if Sam's CSAT model says it's $2/event we flip.

Sam. I'll commit to a quarterly cost-matrix audit with ops. Today's $1.20 was derived from 6 weeks of A/B data — not bulletproof.

Jordan (MLOps). Re-distillation cycle adds two new things to monitor: top-1 teacher-student agreement and quality-gap drift. Both fit in the existing dashboard.

Priya (ML). Long-term, the student lets us experiment with hyperparams 2.4× faster — every tuning sweep returns more value once the teacher pipeline is stable.

Resolution. Ship the student behind a feature flag; canary 5% → 25% over 2 weeks. Track all four cost components weekly. Block full rollout if critical-error cost grows beyond $1,800/month vs teacher.

8. Open Problems

Per-segment cost modeling. The cost matrix is global; some segments (VIP customers) have higher critical-event cost. Bahnsen 2014's example-dependent CSL applies. Open question: is a per-segment cost matrix worth the operational complexity for our top 1% LTV customers?
Cost-aware distillation. Today the distillation loss is α · KL + (1-α) · CE. Could we replace CE with a cost-weighted CE so the student preferentially preserves the teacher's behavior on high-cost predictions? See Khan 2018.
Carbon cost as a tie-breaker. Student's smaller compute footprint reduces carbon ~58%. As the org adopts carbon-budget tracking, does this become a first-order constraint? Open question: a carbon-aware total-cost formula that internalizes Strubell 2019's accounting.

Bibliography

Elkan, C. (2001). The Foundations of Cost-Sensitive Learning. IJCAI.
Provost, F. (2000). Imbalanced Data Sets 101 / threshold moving. AAAI Workshop.
Bahnsen, A. C. et al. (2014). Example-Dependent Cost-Sensitive Decision Trees. ESWA.
Khan, S. H. et al. (2018). Cost-Sensitive Deep Feature Learning. IEEE TNNLS.
Gou, J., Yu, B., Maybank, S. J., Tao, D. (2021). Knowledge Distillation: A Survey. IJCV.
Strubell, E., Ganesh, A., McCallum, A. (2019). Energy and Policy Considerations for Deep Learning in NLP. ACL.
Hinton, G. et al. (2015). Distilling the Knowledge in a Neural Network. NeurIPS Workshop.
Stanton, S. et al. (2021). Does Knowledge Distillation Really Work? NeurIPS.
Bouthillier, X. et al. (2021). Accounting for Variance. MLSys.

Citation count for this file: 9.