Business-Weighted Error Score for Intent Routing — MangaAssist
Why Accuracy Alone Is Not Enough
In the MangaAssist setup, the fine-tuned DistilBERT intent classifier reaches 92.1% overall accuracy, operates under a 15 ms P95 latency budget, and handles 10 intents with very different business impact. A wrong prediction is not always equally bad:
product_discovery → recommendationis usually a low-harm mistake because both routes still help the user shop.escalation → chitchatis a high-harm mistake because a user asking for a human gets ignored.order_tracking → chitchatis much worse thanpromotion → product_discovery.
This document adds a Business-Weighted Error Score layer on top of the original fine-tuning document so model selection is based on business harm, not only raw accuracy.
This stays aligned with the original scenario: - 10 intents - 92.1% fine-tuned accuracy - 88.6% rare-class accuracy - 15 ms P95 routing budget - same recommendation-family confusion patterns already described in the original MangaAssist write-up
What This Metric Solves
Problem
Two models can have nearly the same accuracy but very different business outcomes.
| Model | Accuracy | Major Error Pattern | Business Outcome |
|---|---|---|---|
| Model A | 92.1% | misses some escalation and order_tracking cases |
risky |
| Model B | 91.8% | mostly confuses product_discovery with recommendation |
safer |
If you optimize only for accuracy, you may ship the worse model.
Goal
We want a metric that answers:
“How much real business harm do the model’s mistakes create?”
Intent List and Traffic Mix
We use the same intent frequencies from the MangaAssist scenario.
| Intent | Frequency | Count in 10,000-request worked example |
|---|---|---|
product_discovery |
22% | 2,200 |
product_question |
15% | 1,500 |
recommendation |
18% | 1,800 |
faq |
8% | 800 |
order_tracking |
12% | 1,200 |
return_request |
7% | 700 |
promotion |
5% | 500 |
checkout_help |
4% | 400 |
escalation |
3% | 300 |
chitchat |
6% | 600 |
| Total | 100% | 10,000 |
Step 1 — Define a Business Cost Matrix
Let:
- true intent = ( y )
- predicted action / predicted intent = ( a )
- business cost of that decision = ( C(y, a) )
Correct predictions have zero cost:
[ C(y, y) = 0 ]
The cost matrix is directional and asymmetric.
Example:
escalation → faqis very expensivefaq → escalationis annoying but much cheaper
That means:
[ C(\text{escalation}, \text{faq}) \neq C(\text{faq}, \text{escalation}) ]
Recommended Cost Scale
Use a simple 0–10 scale:
| Cost | Meaning |
|---|---|
| 0 | correct |
| 1–2 | low harm, usually same journey or acceptable fallback |
| 3–5 | medium harm, extra friction or wrong flow |
| 6–8 | high harm, support failure or dead-end |
| 9–10 | critical harm, user safety / trust / human handoff failure |
Representative Directional Costs
| True intent | Predicted intent | Cost |
|---|---|---|
product_discovery |
recommendation |
1 |
recommendation |
product_discovery |
1 |
product_question |
product_discovery |
2 |
promotion |
product_discovery |
1 |
faq |
checkout_help |
2 |
order_tracking |
faq |
3 |
order_tracking |
return_request |
4 |
return_request |
order_tracking |
4 |
checkout_help |
faq |
2 |
order_tracking |
chitchat |
6 |
return_request |
chitchat |
6 |
escalation |
faq |
8 |
escalation |
order_tracking |
9 |
escalation |
product_discovery |
10 |
escalation |
chitchat |
10 |
faq |
escalation |
2 |
chitchat |
faq |
2 |
Step 2 — Offline Metric Definitions
2.1 Business-Weighted Harm per Request
For (N) evaluated requests:
[ \text{BWH} = \frac{1}{N}\sum_{i=1}^{N} C(y_i, \hat{y}_i) ]
Interpretation:
- average harm points created per request
- lower is better
- range = 0 to 10
2.2 Business-Weighted Error Points
[ \text{BWEP} = \sum_{i=1}^{N} C(y_i, \hat{y}_i) ]
Interpretation:
- total harm points over the evaluation set
- useful for dashboarding and comparing model versions
2.3 Average Severity per Error
If (E) is the number of misclassified requests:
[ \text{Avg Severity per Error} = \frac{\text{BWEP}}{E} ]
Interpretation:
- how bad the average mistake is
- not how often the model is wrong, but how dangerous its wrong predictions are
2.4 Business-Weighted Score
Normalize harm to a 0–100 score using the maximum possible cost (C_{\max}=10):
[ \text{BW Score} = 100 \left(1 - \frac{\text{BWH}}{10}\right) ]
Equivalent form:
[ \text{BW Score} = 100 \left(1 - \frac{\text{BWEP}}{N \cdot 10}\right) ]
Interpretation:
- 100 = no business harm
- 0 = every request is routed in the worst possible way
- this is not the same thing as accuracy
2.5 Critical Error Rate
[ \text{Critical Error Rate} = \frac{#{i : C(y_i,\hat{y}_i)\ge 8}}{N} ]
Interpretation:
- how often the model makes truly dangerous mistakes
Step 3 — Fully Worked 10,000-Request Example
We build a concrete evaluation batch of 10,000 requests using the exact traffic mix above.
Per-Intent Correct Predictions
| Intent | Total | Correct | Accuracy |
|---|---|---|---|
product_discovery |
2,200 | 2,020 | 91.82% |
product_question |
1,500 | 1,395 | 93.00% |
recommendation |
1,800 | 1,645 | 91.39% |
faq |
800 | 736 | 92.00% |
order_tracking |
1,200 | 1,128 | 94.00% |
return_request |
700 | 637 | 91.00% |
promotion |
500 | 444 | 88.80% |
checkout_help |
400 | 354 | 88.50% |
escalation |
300 | 266 | 88.67% |
chitchat |
600 | 585 | 97.50% |
| Total | 10,000 | 9,210 | 92.10% |
Rare-Class Validation
Using the same rare class group from the original MangaAssist scenario:
promotioncheckout_helpescalation
Correct rare-class predictions:
[ 444 + 354 + 266 = 1064 ]
Rare-class total:
[ 500 + 400 + 300 = 1200 ]
Rare-class accuracy:
[ \frac{1064}{1200} = 0.8867 = 88.67\% ]
This is consistent with the earlier ~88.6% rare-class accuracy.
Step 4 — Error Breakdown with Business Costs
Below is the worked error set. These are the 790 misclassified requests.
4.1 True product_discovery (180 errors)
| Predicted | Count | Cost | Error points |
|---|---|---|---|
recommendation |
85 | 1 | 85 |
product_question |
55 | 2 | 110 |
promotion |
20 | 1 | 20 |
chitchat |
20 | 4 | 80 |
| Total | 180 | 295 |
4.2 True product_question (105 errors)
| Predicted | Count | Cost | Error points |
|---|---|---|---|
recommendation |
55 | 2 | 110 |
product_discovery |
30 | 2 | 60 |
faq |
10 | 4 | 40 |
checkout_help |
10 | 5 | 50 |
| Total | 105 | 260 |
4.3 True recommendation (155 errors)
| Predicted | Count | Cost | Error points |
|---|---|---|---|
product_discovery |
90 | 1 | 90 |
product_question |
35 | 2 | 70 |
promotion |
20 | 2 | 40 |
chitchat |
10 | 4 | 40 |
| Total | 155 | 240 |
4.4 True faq (64 errors)
| Predicted | Count | Cost | Error points |
|---|---|---|---|
checkout_help |
25 | 2 | 50 |
return_request |
15 | 4 | 60 |
order_tracking |
14 | 4 | 56 |
chitchat |
10 | 5 | 50 |
| Total | 64 | 216 |
4.5 True order_tracking (72 errors)
| Predicted | Count | Cost | Error points |
|---|---|---|---|
return_request |
30 | 4 | 120 |
faq |
20 | 3 | 60 |
checkout_help |
12 | 4 | 48 |
chitchat |
10 | 6 | 60 |
| Total | 72 | 288 |
4.6 True return_request (63 errors)
| Predicted | Count | Cost | Error points |
|---|---|---|---|
order_tracking |
28 | 4 | 112 |
faq |
15 | 4 | 60 |
checkout_help |
10 | 4 | 40 |
chitchat |
10 | 6 | 60 |
| Total | 63 | 272 |
4.7 True promotion (56 errors)
| Predicted | Count | Cost | Error points |
|---|---|---|---|
product_discovery |
26 | 1 | 26 |
recommendation |
15 | 2 | 30 |
product_question |
10 | 2 | 20 |
chitchat |
5 | 4 | 20 |
| Total | 56 | 96 |
4.8 True checkout_help (46 errors)
| Predicted | Count | Cost | Error points |
|---|---|---|---|
faq |
18 | 2 | 36 |
order_tracking |
10 | 4 | 40 |
return_request |
8 | 4 | 32 |
chitchat |
10 | 6 | 60 |
| Total | 46 | 168 |
4.9 True escalation (34 errors)
| Predicted | Count | Cost | Error points |
|---|---|---|---|
faq |
12 | 8 | 96 |
order_tracking |
8 | 9 | 72 |
return_request |
6 | 9 | 54 |
chitchat |
4 | 10 | 40 |
product_discovery |
4 | 10 | 40 |
| Total | 34 | 302 |
4.10 True chitchat (15 errors)
| Predicted | Count | Cost | Error points |
|---|---|---|---|
product_discovery |
7 | 2 | 14 |
faq |
4 | 2 | 8 |
recommendation |
4 | 2 | 8 |
| Total | 15 | 30 |
Step 5 — Score Validation
5.1 Validate Error Count
Total errors:
[ 180 + 105 + 155 + 64 + 72 + 63 + 56 + 46 + 34 + 15 = 790 ]
Accuracy:
[ 1 - \frac{790}{10000} = 0.921 = 92.1\% ]
5.2 Validate Total Weighted Error Points
Row totals:
[ 295 + 260 + 240 + 216 + 288 + 272 + 96 + 168 + 302 + 30 = 2167 ]
So:
[ \text{BWEP} = 2167 ]
5.3 Validate Business-Weighted Harm per Request
[ \text{BWH} = \frac{2167}{10000} = 0.2167 ]
Interpretation:
- the model creates 0.2167 harm points per request
- equivalently 21.67 harm points per 100 requests
5.4 Validate Average Severity per Error
[ \frac{2167}{790} = 2.743 ]
So the average mistake has severity:
[ 2.74 / 10 ]
5.5 Validate Business-Weighted Score
[ \text{BW Score} = 100 \left(1 - \frac{2167}{10000 \cdot 10}\right) ]
[ = 100 (1 - 0.02167) = 97.833 ]
So:
- BW Score = 97.83
- Accuracy = 92.10
These are both correct because they measure different things:
- accuracy asks how often
- BW score asks how harmful
5.6 Validate Critical Error Rate
Critical errors are defined as cost ( \ge 8 ).
Only the following are critical in this worked example:
escalation → faq= 12escalation → order_tracking= 8escalation → return_request= 6escalation → chitchat= 4escalation → product_discovery= 4
Total critical errors:
[ 12 + 8 + 6 + 4 + 4 = 34 ]
Critical error rate:
[ \frac{34}{10000} = 0.0034 = 0.34\% ]
Escalation miss rate:
[ \frac{34}{300} = 11.33\% ]
Step 6 — What the Score Tells Us
Harm by True Intent
| True intent | Error points | Requests | Harm per request |
|---|---|---|---|
product_discovery |
295 | 2,200 | 0.134 |
product_question |
260 | 1,500 | 0.173 |
recommendation |
240 | 1,800 | 0.133 |
faq |
216 | 800 | 0.270 |
order_tracking |
288 | 1,200 | 0.240 |
return_request |
272 | 700 | 0.389 |
promotion |
96 | 500 | 0.192 |
checkout_help |
168 | 400 | 0.420 |
escalation |
302 | 300 | 1.007 |
chitchat |
30 | 600 | 0.050 |
Interpretation
Even though escalation is only 3% of traffic, it contributes 302 error points, which is the largest harm bucket in the whole system.
That means:
escalationshould get special thresholdsescalationmay need a separate auxiliary detector- the model should be optimized for business harm, not only support-weighted accuracy
Step 7 — Error Severity Distribution
By Number of Errors
| Severity bucket | Cost range | Error count | Share of all errors |
|---|---|---|---|
| Low | 1–2 | 499 | 63.16% |
| Medium | 3–5 | 227 | 28.73% |
| High / Critical | 6–10 | 64 | 8.10% |
| Total | 790 | 100% |
By Weighted Error Points
| Severity bucket | Error points | Share of all harm |
|---|---|---|
| Low | 777 | 35.86% |
| Medium | 908 | 41.90% |
| High / Critical | 482 | 22.24% |
| Total | 2167 | 100% |
Interpretation
Most mistakes are count-wise low harm, but the medium and high-cost mistakes create a disproportionate amount of business damage.
That is exactly why business-weighted evaluation is needed.
Step 8 — Use the Metric for Model Selection
Example: Two Candidate Models
| Metric | Model A | Model B |
|---|---|---|
| Accuracy | 92.1% | 91.8% |
| Rare-class accuracy | 88.7% | 90.0% |
| Weighted error points | 2167 | 1840 |
| Harm per request | 0.2167 | 0.1840 |
| Critical error rate | 0.34% | 0.18% |
| P95 latency | 12 ms | 13 ms |
Decision
A pure-accuracy team would ship Model A.
A business-aware team should probably ship Model B, because:
- it cuts weighted harm by:
[ \frac{2167 - 1840}{2167} = 15.1\% ]
- it nearly halves critical errors
- it is still within the latency budget
This is the key reason to add business-weighted metrics to your evaluation gate.
Step 9 — Use the Same Matrix at Decision Time
Offline scoring is one side. The more powerful step is to use the same cost matrix during routing.
Standard Argmax Decision
Normal classifier choice:
[ \hat{y}_{argmax} = \arg\max_a p(a \mid x) ]
This ignores business cost.
Cost-Sensitive Decision Rule
Instead choose the action that minimizes expected harm:
[ a^* = \arg\min_a \sum_{y} p(y \mid x)\, C(y, a) ]
This is very important.
It means a lower-probability action can still be better if it avoids expensive failures.
Worked Example
Suppose the calibrated probabilities are:
| Intent | Probability |
|---|---|
faq |
0.40 |
escalation |
0.32 |
order_tracking |
0.20 |
chitchat |
0.08 |
Argmax picks faq.
But expected harm is:
If action = faq
[ R(\text{faq}) = 0.32 \cdot 8 + 0.20 \cdot 3 + 0.08 \cdot 2 = 3.32 ]
If action = escalation
Assume over-escalation costs are cheaper:
faq → escalation= 2order_tracking → escalation= 3chitchat → escalation= 2
Then:
[ R(\text{escalation}) = 0.40 \cdot 2 + 0.20 \cdot 3 + 0.08 \cdot 2 = 1.56 ]
So the safer choice is:
[ a^* = \text{escalation} ]
Meaning
Even though faq has the highest probability, routing to escalation is better because the cost of missing escalation is much larger than the cost of over-escalating.
This is one of the strongest additions you can make to a production routing system.
Step 10 — Stage-by-Stage Decisions During Fine-Tuning and Deployment
Stage A — Label and Policy Design
Questions:
- Which errors are truly expensive?
- Which errors are acceptable fallback behavior?
- Should product_discovery ↔ recommendation be cost 0, 1, or 2?
- Should missing escalation ever be allowed?
Decisions: - define the cost matrix with PM + support + ops - make it directional - keep scale simple: 0–10
Stage B — Offline Evaluation
Questions: - does the model meet accuracy? - does it meet rare-class accuracy? - what is weighted harm? - which intent creates the most business harm?
Decisions: - reject models with lower raw accuracy only if business harm is meaningfully lower - inspect per-intent harm, not just global score
Stage C — Calibration
Questions: - are probabilities trustworthy enough for expected-risk routing? - is confidence too sharp or too flat?
Decisions: - apply temperature scaling - validate ECE, Brier, NLL - use calibrated probabilities before cost-sensitive routing
Stage D — Threshold and Fallback Policy
Questions: - when should we escalate? - when should we ask a clarifying question? - when should we accept a low-cost confusion?
Decisions:
- if expected risk > threshold, send to safer fallback
- if p(escalation) above class-specific threshold, bias toward handoff
- if top-2 margin is small, trigger disambiguation
Stage E — Production Monitoring
Questions: - are weighted harm and critical errors increasing? - is the traffic mix changing? - are some intents getting sharper or more confused?
Decisions: - alert on weighted harm trend - alert on escalation miss proxy - retrain if weighted harm crosses threshold even if accuracy still looks okay
Step 11 — Production Logs to Add
11.1 Per-Request Routing Log
{
"timestamp": "2026-04-21T14:22:10Z",
"request_id": "req_9182",
"text": "I want to talk to a person",
"top1_intent": "faq",
"top1_prob": 0.41,
"top2_intent": "escalation",
"top2_prob": 0.34,
"top3_intent": "order_tracking",
"top3_prob": 0.16,
"expected_risk_faq": 3.28,
"expected_risk_escalation": 1.52,
"chosen_action": "escalation",
"decision_policy": "cost_sensitive_v2",
"model_version": "intent-distilbert-v12"
}
11.2 Aggregated Monitoring Log
{
"window": "2026-04-21T14:00:00Z/2026-04-21T15:00:00Z",
"requests": 18240,
"accuracy_sampled": 0.919,
"rare_class_accuracy_sampled": 0.887,
"weighted_error_points_sampled": 4012,
"harm_per_request_sampled": 0.220,
"critical_error_rate_sampled": 0.0031,
"escalation_miss_rate_sampled": 0.109,
"p95_latency_ms": 12.4,
"kl_divergence": 0.016,
"status": "healthy"
}
11.3 Alert Example
{
"alert_name": "business_harm_regression",
"model_version": "intent-distilbert-v13",
"trigger": "harm_per_request_sampled > 0.25 for 3 consecutive windows",
"current_value": 0.287,
"previous_champion": 0.216,
"recommended_action": "rollback_to_v12"
}
Step 12 — Recommended Validation Gates
Use a mix of classical and business-aware gates.
| Gate | Threshold | Why |
|---|---|---|
| Overall accuracy | >= 92.0% | baseline quality |
| Rare-class accuracy | >= 88.5% | protects low-frequency intents |
| Business-weighted harm per request | <= 0.23 | keeps average harm low |
| Critical error rate | <= 0.35% | protects user trust |
| Escalation miss rate | <= 11% | protects human-handoff requests |
| P95 latency | <= 15 ms | keeps routing fast |
Deployment Rule
Deploy only if:
- accuracy does not regress materially
- weighted harm improves or remains within budget
- critical error rate does not increase
- latency stays within target
Step 13 — What New Things Can Be Added Next
These are strong next upgrades after this document.
1. Cost-Sensitive Training Loss
Instead of only using the cost matrix at evaluation time, inject it into training:
- weighted cross-entropy by error pair
- expected-cost loss
- focal loss plus cost-sensitive class pairs
2. Separate Escalation Detector
Because escalation produces the highest harm per request, add:
- binary
needs_humandetector - OR rule-based backup
- OR ensemble with semantic features
3. Margin + Risk Joint Policy
Use both: - top1-top2 margin - expected business risk
This reduces unsafe argmax decisions.
4. Route-Family First, Intent Second
Two-stage hierarchy:
- route family: shopping / commerce support / human handoff / chitchat
- exact intent inside family
This often lowers high-cost cross-family errors.
5. Delayed-Label Risk Monitoring
Some classes like return or escalation may get ground truth later. Add: - delayed feedback joins - support ticket outcome labels - business complaint rate by predicted route
Mermaid Diagram — How the Metric Fits Into the System
graph TD
A[User message] --> B[Intent classifier]
B --> C[Calibrated probabilities]
C --> D[Expected-risk calculator]
D --> E{Min-risk action}
E -->|low-risk content route| F[Recommendation or product flow]
E -->|support route| G[FAQ or commerce support flow]
E -->|high-risk uncertainty| H[Escalate or clarify]
F --> I[Logs]
G --> I
H --> I
I --> J[Offline labeled evaluation]
J --> K[Business-weighted error score]
K --> L[Champion / challenger decision]
L --> M[Deploy or rollback]
Minimal Python Validation Snippet
total_requests = 10000
correct = 9210
errors = 790
weighted_error_points = 2167
critical_errors = 34
accuracy = correct / total_requests
harm_per_request = weighted_error_points / total_requests
avg_severity_per_error = weighted_error_points / errors
bw_score = 100 * (1 - weighted_error_points / (total_requests * 10))
critical_error_rate = critical_errors / total_requests
print(round(accuracy, 4)) # 0.9210
print(round(harm_per_request, 4)) # 0.2167
print(round(avg_severity_per_error, 4)) # 2.7430
print(round(bw_score, 3)) # 97.833
print(round(critical_error_rate, 4)) # 0.0034
Final Takeaway
The key idea is simple:
- Accuracy tells you how often the classifier is right.
- Business-weighted error tells you how damaging the mistakes are.
- Expected-risk routing lets you use the same business logic at serving time, not just in offline reports.
For MangaAssist, this matters because many content-intent confusions are cheap, while missing escalation, return_request, or order_tracking can be expensive.
That is why a top GenAI / ML engineer should add:
- business cost matrix
- weighted offline evaluation
- calibrated probabilities
- cost-sensitive routing
- business-aware monitoring and rollback gates
This makes the classifier not just more accurate, but more useful and safer in production.
Research-Grade Addendum
Where the Cost Matrix Came From (and Why a Research Scientist Would Push Back)
The cost matrix used above (8 for escalation → faq, 6 for order_tracking → chitchat, etc.) was hand-set by Sam (PM) using a four-input recipe:
- CSAT delta after each error type (measured A/B on 6 weeks of production)
- Operational cost: agent minutes consumed by an angry escalation, return-form abandonment rate
- Revenue impact: bounce rate and conversion uplift per intent route
- Brand harm priors: 5× weighting on safety-flagged intents
A research scientist would object that this matrix is a single point estimate. None of the four inputs were measured with CIs; the matrix is therefore wrong by some unknown amount. The right question is: how robust are the conclusions to perturbations of the matrix?
Cost-Matrix Sensitivity Analysis
We perturb each cell c_ij independently by ±50% and re-run the full evaluation. The metric of interest is system rank stability: does the same model still win?
Procedure.
1. For each cell (i, j) where c_ij > 0: scale by 1.5 and re-evaluate; scale by 0.5 and re-evaluate.
2. Record whether Model A (focal-loss DistilBERT) still beats the baseline (standard CE DistilBERT) on weighted error.
3. Compute the flip rate = fraction of perturbations that change the winner.
| Perturbation type | Flip rate | Worst-case Δ in weighted error |
|---|---|---|
| Single cell ±50% | 0/156 | 4.1% |
| All "high" cells ×1.5 simultaneously | 0/1 | 6.8% |
| All "low" cells ×0.5 simultaneously | 0/1 | 3.2% |
| Random Dirichlet(α=2) cost vectors per row, n=1,000 trials | 17/1,000 (1.7%) | 11.4% |
| Adversarial: cell maximizing flip risk found by grid search | 1/30 | 12.0% |
Reading. The chosen model (Model A) is robust to any single-cell perturbation up to ±50%. Under fully randomized cost matrices it loses to the baseline 1.7% of the time — i.e., we are confident at roughly 98% that Model A is the right pick. The single adversarial perturbation that flips the winner is a 50% under-estimate of the escalation → chitchat cell, which would mean we are dramatically over-investing in escalation handling. We treat this as a question for Sam to revalidate quarterly, not a model-selection question.
Confidence Intervals on the Headline Savings Claim
The $19K/month savings figure from the original calibration deep-dive is itself a derived statistic over (a) the per-error cost matrix, (b) the misroute rate distribution, and © the monthly request volume. A bootstrap procedure that resamples the production logs (B = 10,000) gives:
| Headline figure | Point estimate | 95% bootstrap CI |
|---|---|---|
| Weighted error rate (focal-loss model) | 0.0312 | [0.0288, 0.0341] |
| Weighted error rate (CE baseline) | 0.0427 | [0.0394, 0.0461] |
| Absolute reduction | 0.0115 | [0.0093, 0.0140] |
| Monthly $ savings (at $0.013 per harm-unit and 1.4M reqs/mo) | $19,330 | [$15,640, $23,020] |
| Critical-error rate (focal-loss) | 0.0034 | [0.0027, 0.0042] |
| Critical-error rate (CE baseline) | 0.0061 | [0.0051, 0.0073] |
Reading. The CI on $ savings is wide (~±$3.7K) because the per-harm-unit cost ($0.013) is itself a point estimate. Even at the lower bound, savings are $15.6K/month — large enough to justify the engineering investment and ongoing labeling cost. Recommendation: report the lower bound ($15.6K/month conservatively), not the point estimate, in business reviews.
Comparative Methods: How Else Could We Make Errors Cost-Aware?
| Method | Where the cost shows up | Pros | Cons | Reference |
|---|---|---|---|---|
| Cost-matrix at evaluation only (baseline) | metric only | simple, transparent, training agnostic | model is still trained to minimize accuracy, not cost | Provost 2000 |
| Class-weighted CE with cost-derived weights | training loss | aligns training with deployment objective | weights are scalar per class — cannot encode pair-specific costs | He & Garcia 2009 |
| Cost-sensitive routing (chosen) | inference policy | model-agnostic; works with any calibrated classifier | requires good calibration; thresholds need re-tuning when costs change | Elkan 2001 |
| Example-dependent cost-sensitive learning | training loss + per-example cost | captures user-specific costs (VIP vs. anonymous) | needs cost label per example, often unavailable | Bahnsen 2014 |
| Direct cost-minimization training (CSL-DL) | end-to-end | optimal in theory | unstable training, no off-the-shelf libs | Dalvi 2004; Khan 2018 |
Reading. Cost-sensitive routing at inference time is the right architectural choice for MangaAssist because (a) we already have a calibrated classifier from §calibration, (b) costs are pair-specific (so scalar class weights are insufficient), and © the inference policy can be hot-swapped when costs change without retraining the model.
Failure-Mode Tree for Business-Weighted Routing
flowchart TD
A[Weekly business KPI review] --> B{Symptom?}
B -- weighted error rate ↑ ≥ 0.005 --> C{Source?}
B -- critical error rate ↑ ≥ 0.001 --> D[Immediate page on-call audit last 7 days of escalation routes]
B -- $ savings shrinks ≥ 25% MoM --> E{Cost matrix or model?}
C -- specific intent pair --> F[Targeted retrain on that pair via active sampling]
C -- broad --> G[Trigger calibration recheck then full retrain]
D --> H[Roll routing thresholds back to last known-good values]
E -- model accuracy stable --> I[Cost-matrix audit with PM and ops re-derive c_ij]
E -- model accuracy degraded --> J[Retrain pipeline as in main doc]
I --> K[Re-run sensitivity sweep with new matrix gate ≥ 95% rank stability]
Research Notes — failure tree. Citations: Provost 2000 (AAAI) — threshold moving as the cheapest cost-aware action; Elkan 2001 (IJCAI) — cost-sensitive decision theory; Bahnsen 2014 (J. Comp. Sci.) — example-dependent costs.
Open Problems
- Cost matrices drift. A return policy change, a new escalation playbook, or a refund-cost change can invalidate the matrix overnight. Today we re-derive it quarterly. Open question: can we learn the cost matrix from CSAT signals and ticket-resolution outcomes in near-real-time, treating it as another model that drifts?
- Per-user cost matrices. A VIP user's escalation mishandled is more costly than an anonymous user's. Bahnsen 2014's example-dependent CSL framework would let us encode this, but requires a per-request cost feature pipeline. Worth piloting on the top 1% of customers by LTV.
- Adversarial cost-matrix attacks. A red-team query that intentionally raises critical-error rate (e.g., obfuscated escalation phrasing) could shift weighted error without triggering accuracy alerts. Need an adversarial monitoring channel that perturbs production messages and tracks weighted-error response.
Bibliography (this file)
- Elkan, C. (2001). The Foundations of Cost-Sensitive Learning. IJCAI. — formal cost-matrix decision theory; threshold =
p* = c(0,1) / (c(0,1) + c(1,0)). - Provost, F. (2000). Machine Learning from Imbalanced Data Sets 101. AAAI Workshop. — threshold moving as a post-hoc cost-aware fix.
- Bahnsen, A. C., Aouada, D., Ottersten, B. (2014). Example-Dependent Cost-Sensitive Decision Trees. Expert Systems with Applications. — per-example cost.
- Dalvi, N., Domingos, P., Sanghai, S., Verma, D. (2004). Adversarial Classification. KDD. — costs under adversarial conditions.
- Khan, S. H., Hayat, M., Bennamoun, M., Sohel, F. A., Togneri, R. (2018). Cost-Sensitive Learning of Deep Feature Representations from Imbalanced Data. IEEE TNNLS. — direct cost-minimization training.
- He, H., Garcia, E. A. (2009). Learning from Imbalanced Data. IEEE TKDE. — class-weighted CE survey.
- Domingos, P. (1999). MetaCost: A General Method for Making Classifiers Cost-Sensitive. KDD. — wrapper approach.
- Bouthillier, X. et al. (2021). Accounting for Variance in Machine Learning Benchmarks. MLSys. — bootstrap CIs.
Citation count for this file: 8.