LOCAL PREVIEW View on GitHub

Distillation Business-Weighted Error Score (MangaAssist)

Slot 5 of the SCENARIO_TEMPLATE 8-file pattern. The Intent-Classification business-weighted-error doc framed which model wins on cost. This doc applies the same lens to the distillation choice — should we ship the teacher (DistilBERT, more accurate) or the student (TinyBERT, much cheaper) — and shows that the answer is conditional on traffic volume, latency budget, and rare-class cost matrix.

Why This Is a Distinct Decision

The teacher-vs-student trade is not an apples-to-apples accuracy comparison. The two models have different:

  • Inference cost (student is ~2.4× cheaper per request)
  • Latency profile (student 5 ms vs teacher 12 ms P95)
  • Error distribution (student has more low-cost errors and slightly more critical errors)
  • Energy / carbon footprint (~58% reduction with student)
  • Maintenance cost (re-distillation cycle is extra ops surface)

A research-grade decision uses a single business objective:

[ \text{Total Cost}(M) = \text{Inference Cost}(M) + \text{Error Cost}(M) + \text{Operational Cost}(M) ]

evaluated across realistic traffic volumes and cost matrices.


Shared Baseline (verbatim)

Item Value
Teacher (M_T) DistilBERT, 92.1% acc, 12 ms P95, $5,320/mo at 1.4M reqs
Student (M_S) TinyBERT 4-layer, 90.5% acc, 5 ms P95, $2,240/mo at 1.4M reqs
Cost matrix same per-(true,predicted) cost as Intent-Classification business doc
Volume bands evaluated 0.5M, 1.4M (current), 5M, 20M monthly requests

1. Cost Decomposition (Current Volume = 1.4M Requests/Month)

1.1 Inference cost

  • Teacher: 1.4M × $0.038/10K = $5,320/month
  • Student: 1.4M × $0.016/10K = $2,240/month
  • Δ inference: −$3,080/month (favors student)

1.2 Error cost (business-weighted)

Per the Intent-Classification cost matrix and per-request weighted-error calculations:

  • Teacher weighted error: 0.0312 → harm units = 1.4M × 0.0312 = 43,680/month
  • Student weighted error: 0.0356 → harm units = 1.4M × 0.0356 = 49,840/month
  • Δ harm: +6,160 units/month, at $0.013/unit = +$80/month (favors teacher slightly)

1.3 Critical-error cost (escalation/return missed)

  • Teacher critical-error rate: 0.0034 → 4,760/month
  • Student critical-error rate: 0.0045 → 6,300/month
  • Δ critical: +1,540/month, at $1.20/critical-event = +$1,848/month (favors teacher)

1.4 Operational cost

  • Teacher: monthly retraining $512 (data labeling) + monitoring (no extra)
  • Student: same retraining cost + re-distillation cycle ~$30/month (compute) + monitoring/calibration extra

Δ operational: +$30/month (favors teacher, marginally)

1.5 Total Cost Comparison

Component Teacher Student Δ (S − T)
Inference $5,320 $2,240 -$3,080
Weighted error $568 $648 +$80
Critical error $5,712 $7,560 +$1,848
Operational $512 $542 +$30
Total $12,112 $10,990 −$1,122

At current volume, the student wins by $1,122/month. But the win is sensitive to the critical-error cost rate ($1.20/event) — see §3.


2. Sensitivity Analysis Across Volume Bands

Total cost (in thousands of dollars/month) at each volume:

Volume Teacher total Student total Winner Margin
0.5M reqs/mo $4.4K $3.9K student $0.5K
1.4M (current) $12.1K $11.0K student $1.1K
5M $43.3K $39.3K student $4.0K
20M $173K $157K student $16K

Reading. Student wins at every volume band — the inference savings scale linearly with traffic, while the critical-error cost grows linearly too but the gap is fixed. The margin grows with traffic, making the student the obvious choice at scale.


3. Sensitivity to Critical-Error Cost

What if the per-critical-event cost is wrong? Sweep $0.40 → $5.00:

$ per critical event Teacher critical-cost Student critical-cost Total margin (S − T) Winner
$0.40 $1,904 $2,520 -$2,505 student
$0.80 $3,808 $5,040 -$1,815 student
$1.20 (chosen) $5,712 $7,560 -$1,122 student
$2.00 $9,520 $12,600 +$258 teacher
$3.00 $14,280 $18,900 +$1,620 teacher
$5.00 $23,800 $31,500 +$4,620 teacher

Critical decision boundary: at ~$1.85/critical-event, the choice flips to the teacher. This is the lever the cost-matrix audit must protect: if the per-critical-event cost is even moderately under-estimated, we'd be shipping the wrong model.


4. Sensitivity to Latency Cost

If we monetize latency (engagement / conversion impact), the student gains additional value because users interact with a snappier system more.

$ per ms above baseline Teacher latency cost Student latency cost Net impact on margin
$0/ms (baseline assumption) $0 $0 as above
$50/ms $0 -$350/mo (student saves 7 ms × $50) margin grows to $1,472
$100/ms $0 -$700/mo margin grows to $1,822

Reading. If the engagement model values speed (most retail experiments find ~$50-100/ms at this scale), the margin in favor of the student grows by another $350-700/month.


5. Confidence Intervals on the Cost Margin

n = 5,500 test, B = 10,000 bootstrap, propagated through the cost calculation.

Component Point 95% CI
Teacher weighted error 0.0312 [0.0288, 0.0341]
Student weighted error 0.0356 [0.0328, 0.0388]
Δ weighted error (paired) 0.0044 [0.0014, 0.0078]
Net monthly savings, student over teacher $1,122 [$320, $2,030]

Reading. At the 95% lower bound, the student still saves ~$320/month — robust. At the upper bound, ~$2,030/month. Decision is statistically clear at the current cost-matrix point estimate.


6. Failure-Mode Tree

flowchart TD
    A[Monthly business review] --> B{Symptom?}
    B -- critical-error rate gap teacher-vs-student widens > 50% --> C[Audit student rare-class fine-tune retrain or re-distill]
    B -- inference savings shrink > 25% --> D[Check student utilization may have spilled to a larger instance]
    B -- per-critical-event cost re-estimated > $1.85 --> E[Re-run §3 sweep escalate to PM the choice may flip]
    B -- traffic > 5M reqs/mo sustained --> F[Re-run total-cost sweep teacher math may shift due to scale-out limits]
    C --> G[Re-distill or augment rare-class training data]
    D --> H[Tune batching at serving layer reset to student instance type]
    E --> I[Cost-matrix audit with PM and ops re-derive critical-event cost]

Research Notes — distillation cost. Citations: Elkan 2001 (IJCAI) — cost-matrix decision theory; Provost 2000 (AAAI) — cost-aware threshold moving; Gou 2021 (IJCV — distillation survey) — quality-vs-cost trade-off framing; Strubell 2019 (ACL) — energy/cost reporting in NLP.


7. Persona Debate

Sam (PM). $1,122/month savings is real but small at this volume. What's the long-term case?

Marcus (Architect). At 5M reqs/mo we save $4K/mo; at 20M we save $16K/mo. Plus latency engagement upside.

Aiko (DS). I'm worried about the critical-error rate. 6,300 vs 4,760 monthly missed escalations is 1,540 more frustrated users. Even at $1.20/event the model is robustly student-favored, but if Sam's CSAT model says it's $2/event we flip.

Sam. I'll commit to a quarterly cost-matrix audit with ops. Today's $1.20 was derived from 6 weeks of A/B data — not bulletproof.

Jordan (MLOps). Re-distillation cycle adds two new things to monitor: top-1 teacher-student agreement and quality-gap drift. Both fit in the existing dashboard.

Priya (ML). Long-term, the student lets us experiment with hyperparams 2.4× faster — every tuning sweep returns more value once the teacher pipeline is stable.

Resolution. Ship the student behind a feature flag; canary 5% → 25% over 2 weeks. Track all four cost components weekly. Block full rollout if critical-error cost grows beyond $1,800/month vs teacher.


8. Open Problems

  1. Per-segment cost modeling. The cost matrix is global; some segments (VIP customers) have higher critical-event cost. Bahnsen 2014's example-dependent CSL applies. Open question: is a per-segment cost matrix worth the operational complexity for our top 1% LTV customers?
  2. Cost-aware distillation. Today the distillation loss is α · KL + (1-α) · CE. Could we replace CE with a cost-weighted CE so the student preferentially preserves the teacher's behavior on high-cost predictions? See Khan 2018.
  3. Carbon cost as a tie-breaker. Student's smaller compute footprint reduces carbon ~58%. As the org adopts carbon-budget tracking, does this become a first-order constraint? Open question: a carbon-aware total-cost formula that internalizes Strubell 2019's accounting.

Bibliography

  • Elkan, C. (2001). The Foundations of Cost-Sensitive Learning. IJCAI.
  • Provost, F. (2000). Imbalanced Data Sets 101 / threshold moving. AAAI Workshop.
  • Bahnsen, A. C. et al. (2014). Example-Dependent Cost-Sensitive Decision Trees. ESWA.
  • Khan, S. H. et al. (2018). Cost-Sensitive Deep Feature Learning. IEEE TNNLS.
  • Gou, J., Yu, B., Maybank, S. J., Tao, D. (2021). Knowledge Distillation: A Survey. IJCV.
  • Strubell, E., Ganesh, A., McCallum, A. (2019). Energy and Policy Considerations for Deep Learning in NLP. ACL.
  • Hinton, G. et al. (2015). Distilling the Knowledge in a Neural Network. NeurIPS Workshop.
  • Stanton, S. et al. (2021). Does Knowledge Distillation Really Work? NeurIPS.
  • Bouthillier, X. et al. (2021). Accounting for Variance. MLSys.

Citation count for this file: 9.