Distillation Business-Weighted Error Score (MangaAssist)
Slot 5 of the SCENARIO_TEMPLATE 8-file pattern. The Intent-Classification business-weighted-error doc framed which model wins on cost. This doc applies the same lens to the distillation choice — should we ship the teacher (DistilBERT, more accurate) or the student (TinyBERT, much cheaper) — and shows that the answer is conditional on traffic volume, latency budget, and rare-class cost matrix.
Why This Is a Distinct Decision
The teacher-vs-student trade is not an apples-to-apples accuracy comparison. The two models have different:
- Inference cost (student is ~2.4× cheaper per request)
- Latency profile (student 5 ms vs teacher 12 ms P95)
- Error distribution (student has more low-cost errors and slightly more critical errors)
- Energy / carbon footprint (~58% reduction with student)
- Maintenance cost (re-distillation cycle is extra ops surface)
A research-grade decision uses a single business objective:
[ \text{Total Cost}(M) = \text{Inference Cost}(M) + \text{Error Cost}(M) + \text{Operational Cost}(M) ]
evaluated across realistic traffic volumes and cost matrices.
Shared Baseline (verbatim)
| Item | Value |
|---|---|
| Teacher (M_T) | DistilBERT, 92.1% acc, 12 ms P95, $5,320/mo at 1.4M reqs |
| Student (M_S) | TinyBERT 4-layer, 90.5% acc, 5 ms P95, $2,240/mo at 1.4M reqs |
| Cost matrix | same per-(true,predicted) cost as Intent-Classification business doc |
| Volume bands evaluated | 0.5M, 1.4M (current), 5M, 20M monthly requests |
1. Cost Decomposition (Current Volume = 1.4M Requests/Month)
1.1 Inference cost
- Teacher: 1.4M × $0.038/10K = $5,320/month
- Student: 1.4M × $0.016/10K = $2,240/month
- Δ inference: −$3,080/month (favors student)
1.2 Error cost (business-weighted)
Per the Intent-Classification cost matrix and per-request weighted-error calculations:
- Teacher weighted error: 0.0312 → harm units = 1.4M × 0.0312 = 43,680/month
- Student weighted error: 0.0356 → harm units = 1.4M × 0.0356 = 49,840/month
- Δ harm: +6,160 units/month, at $0.013/unit = +$80/month (favors teacher slightly)
1.3 Critical-error cost (escalation/return missed)
- Teacher critical-error rate: 0.0034 → 4,760/month
- Student critical-error rate: 0.0045 → 6,300/month
- Δ critical: +1,540/month, at $1.20/critical-event = +$1,848/month (favors teacher)
1.4 Operational cost
- Teacher: monthly retraining $512 (data labeling) + monitoring (no extra)
- Student: same retraining cost + re-distillation cycle ~$30/month (compute) + monitoring/calibration extra
Δ operational: +$30/month (favors teacher, marginally)
1.5 Total Cost Comparison
| Component | Teacher | Student | Δ (S − T) |
|---|---|---|---|
| Inference | $5,320 | $2,240 | -$3,080 |
| Weighted error | $568 | $648 | +$80 |
| Critical error | $5,712 | $7,560 | +$1,848 |
| Operational | $512 | $542 | +$30 |
| Total | $12,112 | $10,990 | −$1,122 |
At current volume, the student wins by $1,122/month. But the win is sensitive to the critical-error cost rate ($1.20/event) — see §3.
2. Sensitivity Analysis Across Volume Bands
Total cost (in thousands of dollars/month) at each volume:
| Volume | Teacher total | Student total | Winner | Margin |
|---|---|---|---|---|
| 0.5M reqs/mo | $4.4K | $3.9K | student | $0.5K |
| 1.4M (current) | $12.1K | $11.0K | student | $1.1K |
| 5M | $43.3K | $39.3K | student | $4.0K |
| 20M | $173K | $157K | student | $16K |
Reading. Student wins at every volume band — the inference savings scale linearly with traffic, while the critical-error cost grows linearly too but the gap is fixed. The margin grows with traffic, making the student the obvious choice at scale.
3. Sensitivity to Critical-Error Cost
What if the per-critical-event cost is wrong? Sweep $0.40 → $5.00:
| $ per critical event | Teacher critical-cost | Student critical-cost | Total margin (S − T) | Winner |
|---|---|---|---|---|
| $0.40 | $1,904 | $2,520 | -$2,505 | student |
| $0.80 | $3,808 | $5,040 | -$1,815 | student |
| $1.20 (chosen) | $5,712 | $7,560 | -$1,122 | student |
| $2.00 | $9,520 | $12,600 | +$258 | teacher |
| $3.00 | $14,280 | $18,900 | +$1,620 | teacher |
| $5.00 | $23,800 | $31,500 | +$4,620 | teacher |
Critical decision boundary: at ~$1.85/critical-event, the choice flips to the teacher. This is the lever the cost-matrix audit must protect: if the per-critical-event cost is even moderately under-estimated, we'd be shipping the wrong model.
4. Sensitivity to Latency Cost
If we monetize latency (engagement / conversion impact), the student gains additional value because users interact with a snappier system more.
| $ per ms above baseline | Teacher latency cost | Student latency cost | Net impact on margin |
|---|---|---|---|
| $0/ms (baseline assumption) | $0 | $0 | as above |
| $50/ms | $0 | -$350/mo (student saves 7 ms × $50) | margin grows to $1,472 |
| $100/ms | $0 | -$700/mo | margin grows to $1,822 |
Reading. If the engagement model values speed (most retail experiments find ~$50-100/ms at this scale), the margin in favor of the student grows by another $350-700/month.
5. Confidence Intervals on the Cost Margin
n = 5,500 test, B = 10,000 bootstrap, propagated through the cost calculation.
| Component | Point | 95% CI |
|---|---|---|
| Teacher weighted error | 0.0312 | [0.0288, 0.0341] |
| Student weighted error | 0.0356 | [0.0328, 0.0388] |
| Δ weighted error (paired) | 0.0044 | [0.0014, 0.0078] |
| Net monthly savings, student over teacher | $1,122 | [$320, $2,030] |
Reading. At the 95% lower bound, the student still saves ~$320/month — robust. At the upper bound, ~$2,030/month. Decision is statistically clear at the current cost-matrix point estimate.
6. Failure-Mode Tree
flowchart TD
A[Monthly business review] --> B{Symptom?}
B -- critical-error rate gap teacher-vs-student widens > 50% --> C[Audit student rare-class fine-tune retrain or re-distill]
B -- inference savings shrink > 25% --> D[Check student utilization may have spilled to a larger instance]
B -- per-critical-event cost re-estimated > $1.85 --> E[Re-run §3 sweep escalate to PM the choice may flip]
B -- traffic > 5M reqs/mo sustained --> F[Re-run total-cost sweep teacher math may shift due to scale-out limits]
C --> G[Re-distill or augment rare-class training data]
D --> H[Tune batching at serving layer reset to student instance type]
E --> I[Cost-matrix audit with PM and ops re-derive critical-event cost]
Research Notes — distillation cost. Citations: Elkan 2001 (IJCAI) — cost-matrix decision theory; Provost 2000 (AAAI) — cost-aware threshold moving; Gou 2021 (IJCV — distillation survey) — quality-vs-cost trade-off framing; Strubell 2019 (ACL) — energy/cost reporting in NLP.
7. Persona Debate
Sam (PM). $1,122/month savings is real but small at this volume. What's the long-term case?
Marcus (Architect). At 5M reqs/mo we save $4K/mo; at 20M we save $16K/mo. Plus latency engagement upside.
Aiko (DS). I'm worried about the critical-error rate. 6,300 vs 4,760 monthly missed escalations is 1,540 more frustrated users. Even at $1.20/event the model is robustly student-favored, but if Sam's CSAT model says it's $2/event we flip.
Sam. I'll commit to a quarterly cost-matrix audit with ops. Today's $1.20 was derived from 6 weeks of A/B data — not bulletproof.
Jordan (MLOps). Re-distillation cycle adds two new things to monitor: top-1 teacher-student agreement and quality-gap drift. Both fit in the existing dashboard.
Priya (ML). Long-term, the student lets us experiment with hyperparams 2.4× faster — every tuning sweep returns more value once the teacher pipeline is stable.
Resolution. Ship the student behind a feature flag; canary 5% → 25% over 2 weeks. Track all four cost components weekly. Block full rollout if critical-error cost grows beyond $1,800/month vs teacher.
8. Open Problems
- Per-segment cost modeling. The cost matrix is global; some segments (VIP customers) have higher critical-event cost. Bahnsen 2014's example-dependent CSL applies. Open question: is a per-segment cost matrix worth the operational complexity for our top 1% LTV customers?
- Cost-aware distillation. Today the distillation loss is
α · KL + (1-α) · CE. Could we replace CE with a cost-weighted CE so the student preferentially preserves the teacher's behavior on high-cost predictions? See Khan 2018. - Carbon cost as a tie-breaker. Student's smaller compute footprint reduces carbon ~58%. As the org adopts carbon-budget tracking, does this become a first-order constraint? Open question: a carbon-aware total-cost formula that internalizes Strubell 2019's accounting.
Bibliography
- Elkan, C. (2001). The Foundations of Cost-Sensitive Learning. IJCAI.
- Provost, F. (2000). Imbalanced Data Sets 101 / threshold moving. AAAI Workshop.
- Bahnsen, A. C. et al. (2014). Example-Dependent Cost-Sensitive Decision Trees. ESWA.
- Khan, S. H. et al. (2018). Cost-Sensitive Deep Feature Learning. IEEE TNNLS.
- Gou, J., Yu, B., Maybank, S. J., Tao, D. (2021). Knowledge Distillation: A Survey. IJCV.
- Strubell, E., Ganesh, A., McCallum, A. (2019). Energy and Policy Considerations for Deep Learning in NLP. ACL.
- Hinton, G. et al. (2015). Distilling the Knowledge in a Neural Network. NeurIPS Workshop.
- Stanton, S. et al. (2021). Does Knowledge Distillation Really Work? NeurIPS.
- Bouthillier, X. et al. (2021). Accounting for Variance. MLSys.
Citation count for this file: 9.