Business-Weighted Error Score for Intent Routing — MangaAssist

Why Accuracy Alone Is Not Enough

In the MangaAssist setup, the fine-tuned DistilBERT intent classifier reaches 92.1% overall accuracy, operates under a 15 ms P95 latency budget, and handles 10 intents with very different business impact. A wrong prediction is not always equally bad:

product_discovery → recommendation is usually a low-harm mistake because both routes still help the user shop.
escalation → chitchat is a high-harm mistake because a user asking for a human gets ignored.
order_tracking → chitchat is much worse than promotion → product_discovery.

This document adds a Business-Weighted Error Score layer on top of the original fine-tuning document so model selection is based on business harm, not only raw accuracy.

This stays aligned with the original scenario: - 10 intents - 92.1% fine-tuned accuracy - 88.6% rare-class accuracy - 15 ms P95 routing budget - same recommendation-family confusion patterns already described in the original MangaAssist write-up

What This Metric Solves

Problem

Two models can have nearly the same accuracy but very different business outcomes.

Model	Accuracy	Major Error Pattern	Business Outcome
Model A	92.1%	misses some `escalation` and `order_tracking` cases	risky
Model B	91.8%	mostly confuses `product_discovery` with `recommendation`	safer

If you optimize only for accuracy, you may ship the worse model.

Goal

We want a metric that answers:

“How much real business harm do the model’s mistakes create?”

Intent List and Traffic Mix

We use the same intent frequencies from the MangaAssist scenario.

Intent	Frequency	Count in 10,000-request worked example
`product_discovery`	22%	2,200
`product_question`	15%	1,500
`recommendation`	18%	1,800
`faq`	8%	800
`order_tracking`	12%	1,200
`return_request`	7%	700
`promotion`	5%	500
`checkout_help`	4%	400
`escalation`	3%	300
`chitchat`	6%	600
Total	100%	10,000

Step 1 — Define a Business Cost Matrix

Let:

true intent = ( y )
predicted action / predicted intent = ( a )
business cost of that decision = ( C(y, a) )

Correct predictions have zero cost:

[ C(y, y) = 0 ]

The cost matrix is directional and asymmetric.

Example:

escalation → faq is very expensive
faq → escalation is annoying but much cheaper

That means:

[ C(\text{escalation}, \text{faq}) \neq C(\text{faq}, \text{escalation}) ]

Recommended Cost Scale

Use a simple 0–10 scale:

Cost	Meaning
0	correct
1–2	low harm, usually same journey or acceptable fallback
3–5	medium harm, extra friction or wrong flow
6–8	high harm, support failure or dead-end
9–10	critical harm, user safety / trust / human handoff failure

Representative Directional Costs

True intent	Predicted intent	Cost
`product_discovery`	`recommendation`	1
`recommendation`	`product_discovery`	1
`product_question`	`product_discovery`	2
`promotion`	`product_discovery`	1
`faq`	`checkout_help`	2
`order_tracking`	`faq`	3
`order_tracking`	`return_request`	4
`return_request`	`order_tracking`	4
`checkout_help`	`faq`	2
`order_tracking`	`chitchat`	6
`return_request`	`chitchat`	6
`escalation`	`faq`	8
`escalation`	`order_tracking`	9
`escalation`	`product_discovery`	10
`escalation`	`chitchat`	10
`faq`	`escalation`	2
`chitchat`	`faq`	2

Step 2 — Offline Metric Definitions

2.1 Business-Weighted Harm per Request

For (N) evaluated requests:

[ \text{BWH} = \frac{1}{N}\sum_{i=1}^{N} C(y_i, \hat{y}_i) ]

Interpretation:

average harm points created per request
lower is better
range = 0 to 10

2.2 Business-Weighted Error Points

[ \text{BWEP} = \sum_{i=1}^{N} C(y_i, \hat{y}_i) ]

Interpretation:

total harm points over the evaluation set
useful for dashboarding and comparing model versions

2.3 Average Severity per Error

If (E) is the number of misclassified requests:

[ \text{Avg Severity per Error} = \frac{\text{BWEP}}{E} ]

Interpretation:

how bad the average mistake is
not how often the model is wrong, but how dangerous its wrong predictions are

2.4 Business-Weighted Score

Normalize harm to a 0–100 score using the maximum possible cost (C_{\max}=10):

[ \text{BW Score} = 100 \left(1 - \frac{\text{BWH}}{10}\right) ]

Equivalent form:

[ \text{BW Score} = 100 \left(1 - \frac{\text{BWEP}}{N \cdot 10}\right) ]

Interpretation:

100 = no business harm
0 = every request is routed in the worst possible way
this is not the same thing as accuracy

2.5 Critical Error Rate

[ \text{Critical Error Rate} = \frac{#{i : C(y_i,\hat{y}_i)\ge 8}}{N} ]

Interpretation:

how often the model makes truly dangerous mistakes

Step 3 — Fully Worked 10,000-Request Example

We build a concrete evaluation batch of 10,000 requests using the exact traffic mix above.

Per-Intent Correct Predictions

Intent	Total	Correct	Accuracy
`product_discovery`	2,200	2,020	91.82%
`product_question`	1,500	1,395	93.00%
`recommendation`	1,800	1,645	91.39%
`faq`	800	736	92.00%
`order_tracking`	1,200	1,128	94.00%
`return_request`	700	637	91.00%
`promotion`	500	444	88.80%
`checkout_help`	400	354	88.50%
`escalation`	300	266	88.67%
`chitchat`	600	585	97.50%
Total	10,000	9,210	92.10%

Rare-Class Validation

Using the same rare class group from the original MangaAssist scenario:

promotion
checkout_help
escalation

Correct rare-class predictions:

[ 444 + 354 + 266 = 1064 ]

Rare-class total:

[ 500 + 400 + 300 = 1200 ]

Rare-class accuracy:

[ \frac{1064}{1200} = 0.8867 = 88.67\% ]

This is consistent with the earlier ~88.6% rare-class accuracy.

Step 4 — Error Breakdown with Business Costs

Below is the worked error set. These are the 790 misclassified requests.

4.1 True `product_discovery` (180 errors)

Predicted	Count	Cost	Error points
`recommendation`	85	1	85
`product_question`	55	2	110
`promotion`	20	1	20
`chitchat`	20	4	80
Total	180		295

4.2 True `product_question` (105 errors)

Predicted	Count	Cost	Error points
`recommendation`	55	2	110
`product_discovery`	30	2	60
`faq`	10	4	40
`checkout_help`	10	5	50
Total	105		260

4.3 True `recommendation` (155 errors)

Predicted	Count	Cost	Error points
`product_discovery`	90	1	90
`product_question`	35	2	70
`promotion`	20	2	40
`chitchat`	10	4	40
Total	155		240

4.4 True `faq` (64 errors)

Predicted	Count	Cost	Error points
`checkout_help`	25	2	50
`return_request`	15	4	60
`order_tracking`	14	4	56
`chitchat`	10	5	50
Total	64		216

4.5 True `order_tracking` (72 errors)

Predicted	Count	Cost	Error points
`return_request`	30	4	120
`faq`	20	3	60
`checkout_help`	12	4	48
`chitchat`	10	6	60
Total	72		288

4.6 True `return_request` (63 errors)

Predicted	Count	Cost	Error points
`order_tracking`	28	4	112
`faq`	15	4	60
`checkout_help`	10	4	40
`chitchat`	10	6	60
Total	63		272

4.7 True `promotion` (56 errors)

Predicted	Count	Cost	Error points
`product_discovery`	26	1	26
`recommendation`	15	2	30
`product_question`	10	2	20
`chitchat`	5	4	20
Total	56		96

4.8 True `checkout_help` (46 errors)

Predicted	Count	Cost	Error points
`faq`	18	2	36
`order_tracking`	10	4	40
`return_request`	8	4	32
`chitchat`	10	6	60
Total	46		168

4.9 True `escalation` (34 errors)

Predicted	Count	Cost	Error points
`faq`	12	8	96
`order_tracking`	8	9	72
`return_request`	6	9	54
`chitchat`	4	10	40
`product_discovery`	4	10	40
Total	34		302

4.10 True `chitchat` (15 errors)

Predicted	Count	Cost	Error points
`product_discovery`	7	2	14
`faq`	4	2	8
`recommendation`	4	2	8
Total	15		30

Step 5 — Score Validation

5.1 Validate Error Count

Total errors:

[ 180 + 105 + 155 + 64 + 72 + 63 + 56 + 46 + 34 + 15 = 790 ]

Accuracy:

[ 1 - \frac{790}{10000} = 0.921 = 92.1\% ]

5.2 Validate Total Weighted Error Points

Row totals:

[ 295 + 260 + 240 + 216 + 288 + 272 + 96 + 168 + 302 + 30 = 2167 ]

So:

[ \text{BWEP} = 2167 ]

5.3 Validate Business-Weighted Harm per Request

[ \text{BWH} = \frac{2167}{10000} = 0.2167 ]

Interpretation:

the model creates 0.2167 harm points per request
equivalently 21.67 harm points per 100 requests

5.4 Validate Average Severity per Error

[ \frac{2167}{790} = 2.743 ]

So the average mistake has severity:

[ 2.74 / 10 ]

5.5 Validate Business-Weighted Score

[ \text{BW Score} = 100 \left(1 - \frac{2167}{10000 \cdot 10}\right) ]

[ = 100 (1 - 0.02167) = 97.833 ]

So:

BW Score = 97.83
Accuracy = 92.10

These are both correct because they measure different things:

accuracy asks how often
BW score asks how harmful

5.6 Validate Critical Error Rate

Critical errors are defined as cost ( \ge 8 ).

Only the following are critical in this worked example:

escalation → faq = 12
escalation → order_tracking = 8
escalation → return_request = 6
escalation → chitchat = 4
escalation → product_discovery = 4

Total critical errors:

[ 12 + 8 + 6 + 4 + 4 = 34 ]

Critical error rate:

[ \frac{34}{10000} = 0.0034 = 0.34\% ]

Escalation miss rate:

[ \frac{34}{300} = 11.33\% ]

Step 6 — What the Score Tells Us

Harm by True Intent

True intent	Error points	Requests	Harm per request
`product_discovery`	295	2,200	0.134
`product_question`	260	1,500	0.173
`recommendation`	240	1,800	0.133
`faq`	216	800	0.270
`order_tracking`	288	1,200	0.240
`return_request`	272	700	0.389
`promotion`	96	500	0.192
`checkout_help`	168	400	0.420
`escalation`	302	300	1.007
`chitchat`	30	600	0.050

Interpretation

Even though escalation is only 3% of traffic, it contributes 302 error points, which is the largest harm bucket in the whole system.

That means:

escalation should get special thresholds
escalation may need a separate auxiliary detector
the model should be optimized for business harm, not only support-weighted accuracy

Step 7 — Error Severity Distribution

By Number of Errors

Severity bucket	Cost range	Error count	Share of all errors
Low	1–2	499	63.16%
Medium	3–5	227	28.73%
High / Critical	6–10	64	8.10%
Total		790	100%

By Weighted Error Points

Severity bucket	Error points	Share of all harm
Low	777	35.86%
Medium	908	41.90%
High / Critical	482	22.24%
Total	2167	100%

Interpretation

Most mistakes are count-wise low harm, but the medium and high-cost mistakes create a disproportionate amount of business damage.

That is exactly why business-weighted evaluation is needed.

Step 8 — Use the Metric for Model Selection

Example: Two Candidate Models

Metric	Model A	Model B
Accuracy	92.1%	91.8%
Rare-class accuracy	88.7%	90.0%
Weighted error points	2167	1840
Harm per request	0.2167	0.1840
Critical error rate	0.34%	0.18%
P95 latency	12 ms	13 ms

Decision

A pure-accuracy team would ship Model A.

A business-aware team should probably ship Model B, because:

it cuts weighted harm by:

[ \frac{2167 - 1840}{2167} = 15.1\% ]

it nearly halves critical errors
it is still within the latency budget

This is the key reason to add business-weighted metrics to your evaluation gate.

Step 9 — Use the Same Matrix at Decision Time

Offline scoring is one side. The more powerful step is to use the same cost matrix during routing.

Standard Argmax Decision

Normal classifier choice:

[ \hat{y}_{argmax} = \arg\max_a p(a \mid x) ]

This ignores business cost.

Cost-Sensitive Decision Rule

Instead choose the action that minimizes expected harm:

[ a^* = \arg\min_a \sum_{y} p(y \mid x)\, C(y, a) ]

This is very important.

It means a lower-probability action can still be better if it avoids expensive failures.

Worked Example

Suppose the calibrated probabilities are:

Intent	Probability
`faq`	0.40
`escalation`	0.32
`order_tracking`	0.20
`chitchat`	0.08

Argmax picks faq.

But expected harm is:

If action = `faq`

[ R(\text{faq}) = 0.32 \cdot 8 + 0.20 \cdot 3 + 0.08 \cdot 2 = 3.32 ]

If action = `escalation`

Assume over-escalation costs are cheaper:

faq → escalation = 2
order_tracking → escalation = 3
chitchat → escalation = 2

Then:

[ R(\text{escalation}) = 0.40 \cdot 2 + 0.20 \cdot 3 + 0.08 \cdot 2 = 1.56 ]

So the safer choice is:

[ a^* = \text{escalation} ]

Meaning

Even though faq has the highest probability, routing to escalation is better because the cost of missing escalation is much larger than the cost of over-escalating.

This is one of the strongest additions you can make to a production routing system.

Step 10 — Stage-by-Stage Decisions During Fine-Tuning and Deployment

Stage A — Label and Policy Design

Questions: - Which errors are truly expensive? - Which errors are acceptable fallback behavior? - Should product_discovery ↔ recommendation be cost 0, 1, or 2? - Should missing escalation ever be allowed?

Decisions: - define the cost matrix with PM + support + ops - make it directional - keep scale simple: 0–10

Stage B — Offline Evaluation

Questions: - does the model meet accuracy? - does it meet rare-class accuracy? - what is weighted harm? - which intent creates the most business harm?

Decisions: - reject models with lower raw accuracy only if business harm is meaningfully lower - inspect per-intent harm, not just global score

Stage C — Calibration

Questions: - are probabilities trustworthy enough for expected-risk routing? - is confidence too sharp or too flat?

Decisions: - apply temperature scaling - validate ECE, Brier, NLL - use calibrated probabilities before cost-sensitive routing

Stage D — Threshold and Fallback Policy

Questions: - when should we escalate? - when should we ask a clarifying question? - when should we accept a low-cost confusion?

Decisions: - if expected risk > threshold, send to safer fallback - if p(escalation) above class-specific threshold, bias toward handoff - if top-2 margin is small, trigger disambiguation

Stage E — Production Monitoring

Questions: - are weighted harm and critical errors increasing? - is the traffic mix changing? - are some intents getting sharper or more confused?

Decisions: - alert on weighted harm trend - alert on escalation miss proxy - retrain if weighted harm crosses threshold even if accuracy still looks okay

Step 11 — Production Logs to Add

11.1 Per-Request Routing Log

{
  "timestamp": "2026-04-21T14:22:10Z",
  "request_id": "req_9182",
  "text": "I want to talk to a person",
  "top1_intent": "faq",
  "top1_prob": 0.41,
  "top2_intent": "escalation",
  "top2_prob": 0.34,
  "top3_intent": "order_tracking",
  "top3_prob": 0.16,
  "expected_risk_faq": 3.28,
  "expected_risk_escalation": 1.52,
  "chosen_action": "escalation",
  "decision_policy": "cost_sensitive_v2",
  "model_version": "intent-distilbert-v12"
}

11.2 Aggregated Monitoring Log

{
  "window": "2026-04-21T14:00:00Z/2026-04-21T15:00:00Z",
  "requests": 18240,
  "accuracy_sampled": 0.919,
  "rare_class_accuracy_sampled": 0.887,
  "weighted_error_points_sampled": 4012,
  "harm_per_request_sampled": 0.220,
  "critical_error_rate_sampled": 0.0031,
  "escalation_miss_rate_sampled": 0.109,
  "p95_latency_ms": 12.4,
  "kl_divergence": 0.016,
  "status": "healthy"
}

11.3 Alert Example

{
  "alert_name": "business_harm_regression",
  "model_version": "intent-distilbert-v13",
  "trigger": "harm_per_request_sampled > 0.25 for 3 consecutive windows",
  "current_value": 0.287,
  "previous_champion": 0.216,
  "recommended_action": "rollback_to_v12"
}

Step 12 — Recommended Validation Gates

Use a mix of classical and business-aware gates.

Gate	Threshold	Why
Overall accuracy	>= 92.0%	baseline quality
Rare-class accuracy	>= 88.5%	protects low-frequency intents
Business-weighted harm per request	<= 0.23	keeps average harm low
Critical error rate	<= 0.35%	protects user trust
Escalation miss rate	<= 11%	protects human-handoff requests
P95 latency	<= 15 ms	keeps routing fast

Deployment Rule

Deploy only if:

accuracy does not regress materially
weighted harm improves or remains within budget
critical error rate does not increase
latency stays within target

Step 13 — What New Things Can Be Added Next

These are strong next upgrades after this document.

1. Cost-Sensitive Training Loss

Instead of only using the cost matrix at evaluation time, inject it into training:

weighted cross-entropy by error pair
expected-cost loss
focal loss plus cost-sensitive class pairs

2. Separate Escalation Detector

Because escalation produces the highest harm per request, add:

binary needs_human detector
OR rule-based backup
OR ensemble with semantic features

3. Margin + Risk Joint Policy

Use both: - top1-top2 margin - expected business risk

This reduces unsafe argmax decisions.

4. Route-Family First, Intent Second

Two-stage hierarchy:

route family: shopping / commerce support / human handoff / chitchat
exact intent inside family

This often lowers high-cost cross-family errors.

5. Delayed-Label Risk Monitoring

Some classes like return or escalation may get ground truth later. Add: - delayed feedback joins - support ticket outcome labels - business complaint rate by predicted route

Mermaid Diagram — How the Metric Fits Into the System

graph TD
    A[User message] --> B[Intent classifier]
    B --> C[Calibrated probabilities]
    C --> D[Expected-risk calculator]
    D --> E{Min-risk action}
    E -->|low-risk content route| F[Recommendation or product flow]
    E -->|support route| G[FAQ or commerce support flow]
    E -->|high-risk uncertainty| H[Escalate or clarify]
    F --> I[Logs]
    G --> I
    H --> I
    I --> J[Offline labeled evaluation]
    J --> K[Business-weighted error score]
    K --> L[Champion / challenger decision]
    L --> M[Deploy or rollback]

Minimal Python Validation Snippet

total_requests = 10000
correct = 9210
errors = 790
weighted_error_points = 2167
critical_errors = 34

accuracy = correct / total_requests
harm_per_request = weighted_error_points / total_requests
avg_severity_per_error = weighted_error_points / errors
bw_score = 100 * (1 - weighted_error_points / (total_requests * 10))
critical_error_rate = critical_errors / total_requests

print(round(accuracy, 4))                 # 0.9210
print(round(harm_per_request, 4))         # 0.2167
print(round(avg_severity_per_error, 4))   # 2.7430
print(round(bw_score, 3))                 # 97.833
print(round(critical_error_rate, 4))      # 0.0034

Final Takeaway

The key idea is simple:

Accuracy tells you how often the classifier is right.
Business-weighted error tells you how damaging the mistakes are.
Expected-risk routing lets you use the same business logic at serving time, not just in offline reports.

For MangaAssist, this matters because many content-intent confusions are cheap, while missing escalation, return_request, or order_tracking can be expensive.

That is why a top GenAI / ML engineer should add:

business cost matrix
weighted offline evaluation
calibrated probabilities
cost-sensitive routing
business-aware monitoring and rollback gates

This makes the classifier not just more accurate, but more useful and safer in production.

Research-Grade Addendum

Where the Cost Matrix Came From (and Why a Research Scientist Would Push Back)

The cost matrix used above (8 for escalation → faq, 6 for order_tracking → chitchat, etc.) was hand-set by Sam (PM) using a four-input recipe:

CSAT delta after each error type (measured A/B on 6 weeks of production)
Operational cost: agent minutes consumed by an angry escalation, return-form abandonment rate
Revenue impact: bounce rate and conversion uplift per intent route
Brand harm priors: 5× weighting on safety-flagged intents

A research scientist would object that this matrix is a single point estimate. None of the four inputs were measured with CIs; the matrix is therefore wrong by some unknown amount. The right question is: how robust are the conclusions to perturbations of the matrix?

Cost-Matrix Sensitivity Analysis

We perturb each cell c_ij independently by ±50% and re-run the full evaluation. The metric of interest is system rank stability: does the same model still win?

Procedure. 1. For each cell (i, j) where c_ij > 0: scale by 1.5 and re-evaluate; scale by 0.5 and re-evaluate. 2. Record whether Model A (focal-loss DistilBERT) still beats the baseline (standard CE DistilBERT) on weighted error. 3. Compute the flip rate = fraction of perturbations that change the winner.

Perturbation type	Flip rate	Worst-case Δ in weighted error
Single cell ±50%	0/156	4.1%
All "high" cells ×1.5 simultaneously	0/1	6.8%
All "low" cells ×0.5 simultaneously	0/1	3.2%
Random Dirichlet(α=2) cost vectors per row, n=1,000 trials	17/1,000 (1.7%)	11.4%
Adversarial: cell maximizing flip risk found by grid search	1/30	12.0%

Reading. The chosen model (Model A) is robust to any single-cell perturbation up to ±50%. Under fully randomized cost matrices it loses to the baseline 1.7% of the time — i.e., we are confident at roughly 98% that Model A is the right pick. The single adversarial perturbation that flips the winner is a 50% under-estimate of the escalation → chitchat cell, which would mean we are dramatically over-investing in escalation handling. We treat this as a question for Sam to revalidate quarterly, not a model-selection question.

Confidence Intervals on the Headline Savings Claim

The $19K/month savings figure from the original calibration deep-dive is itself a derived statistic over (a) the per-error cost matrix, (b) the misroute rate distribution, and © the monthly request volume. A bootstrap procedure that resamples the production logs (B = 10,000) gives:

Headline figure	Point estimate	95% bootstrap CI
Weighted error rate (focal-loss model)	0.0312	[0.0288, 0.0341]
Weighted error rate (CE baseline)	0.0427	[0.0394, 0.0461]
Absolute reduction	0.0115	[0.0093, 0.0140]
Monthly $ savings (at $0.013 per harm-unit and 1.4M reqs/mo)	$19,330	[$15,640, $23,020]
Critical-error rate (focal-loss)	0.0034	[0.0027, 0.0042]
Critical-error rate (CE baseline)	0.0061	[0.0051, 0.0073]

Reading. The CI on $ savings is wide (~±$3.7K) because the per-harm-unit cost ($0.013) is itself a point estimate. Even at the lower bound, savings are $15.6K/month — large enough to justify the engineering investment and ongoing labeling cost. Recommendation: report the lower bound ($15.6K/month conservatively), not the point estimate, in business reviews.

Comparative Methods: How Else Could We Make Errors Cost-Aware?

Method	Where the cost shows up	Pros	Cons	Reference
Cost-matrix at evaluation only (baseline)	metric only	simple, transparent, training agnostic	model is still trained to minimize accuracy, not cost	Provost 2000
Class-weighted CE with cost-derived weights	training loss	aligns training with deployment objective	weights are scalar per class — cannot encode pair-specific costs	He & Garcia 2009
Cost-sensitive routing (chosen)	inference policy	model-agnostic; works with any calibrated classifier	requires good calibration; thresholds need re-tuning when costs change	Elkan 2001
Example-dependent cost-sensitive learning	training loss + per-example cost	captures user-specific costs (VIP vs. anonymous)	needs cost label per example, often unavailable	Bahnsen 2014
Direct cost-minimization training (CSL-DL)	end-to-end	optimal in theory	unstable training, no off-the-shelf libs	Dalvi 2004; Khan 2018

Reading. Cost-sensitive routing at inference time is the right architectural choice for MangaAssist because (a) we already have a calibrated classifier from §calibration, (b) costs are pair-specific (so scalar class weights are insufficient), and © the inference policy can be hot-swapped when costs change without retraining the model.

Failure-Mode Tree for Business-Weighted Routing

flowchart TD
    A[Weekly business KPI review] --> B{Symptom?}
    B -- weighted error rate ↑ ≥ 0.005 --> C{Source?}
    B -- critical error rate ↑ ≥ 0.001 --> D[Immediate page on-call audit last 7 days of escalation routes]
    B -- $ savings shrinks ≥ 25% MoM --> E{Cost matrix or model?}
    C -- specific intent pair --> F[Targeted retrain on that pair via active sampling]
    C -- broad --> G[Trigger calibration recheck then full retrain]
    D --> H[Roll routing thresholds back to last known-good values]
    E -- model accuracy stable --> I[Cost-matrix audit with PM and ops re-derive c_ij]
    E -- model accuracy degraded --> J[Retrain pipeline as in main doc]
    I --> K[Re-run sensitivity sweep with new matrix gate ≥ 95% rank stability]

Research Notes — failure tree. Citations: Provost 2000 (AAAI) — threshold moving as the cheapest cost-aware action; Elkan 2001 (IJCAI) — cost-sensitive decision theory; Bahnsen 2014 (J. Comp. Sci.) — example-dependent costs.

Open Problems

Cost matrices drift. A return policy change, a new escalation playbook, or a refund-cost change can invalidate the matrix overnight. Today we re-derive it quarterly. Open question: can we learn the cost matrix from CSAT signals and ticket-resolution outcomes in near-real-time, treating it as another model that drifts?
Per-user cost matrices. A VIP user's escalation mishandled is more costly than an anonymous user's. Bahnsen 2014's example-dependent CSL framework would let us encode this, but requires a per-request cost feature pipeline. Worth piloting on the top 1% of customers by LTV.
Adversarial cost-matrix attacks. A red-team query that intentionally raises critical-error rate (e.g., obfuscated escalation phrasing) could shift weighted error without triggering accuracy alerts. Need an adversarial monitoring channel that perturbs production messages and tracks weighted-error response.

Bibliography (this file)

Elkan, C. (2001). The Foundations of Cost-Sensitive Learning. IJCAI. — formal cost-matrix decision theory; threshold = p* = c(0,1) / (c(0,1) + c(1,0)).
Provost, F. (2000). Machine Learning from Imbalanced Data Sets 101. AAAI Workshop. — threshold moving as a post-hoc cost-aware fix.
Bahnsen, A. C., Aouada, D., Ottersten, B. (2014). Example-Dependent Cost-Sensitive Decision Trees. Expert Systems with Applications. — per-example cost.
Dalvi, N., Domingos, P., Sanghai, S., Verma, D. (2004). Adversarial Classification. KDD. — costs under adversarial conditions.
Khan, S. H., Hayat, M., Bennamoun, M., Sohel, F. A., Togneri, R. (2018). Cost-Sensitive Learning of Deep Feature Representations from Imbalanced Data. IEEE TNNLS. — direct cost-minimization training.
He, H., Garcia, E. A. (2009). Learning from Imbalanced Data. IEEE TKDE. — class-weighted CE survey.
Domingos, P. (1999). MetaCost: A General Method for Making Classifiers Cost-Sensitive. KDD. — wrapper approach.
Bouthillier, X. et al. (2021). Accounting for Variance in Machine Learning Benchmarks. MLSys. — bootstrap CIs.

Citation count for this file: 8.

Business-Weighted Error Score for Intent Routing — MangaAssist

Why Accuracy Alone Is Not Enough

What This Metric Solves

Problem

Goal

Intent List and Traffic Mix

Step 1 — Define a Business Cost Matrix

Recommended Cost Scale

Representative Directional Costs

Step 2 — Offline Metric Definitions

2.1 Business-Weighted Harm per Request

2.2 Business-Weighted Error Points

2.3 Average Severity per Error

2.4 Business-Weighted Score

2.5 Critical Error Rate

Step 3 — Fully Worked 10,000-Request Example

Per-Intent Correct Predictions

Rare-Class Validation

Step 4 — Error Breakdown with Business Costs

4.1 True product_discovery (180 errors)

4.2 True product_question (105 errors)

4.3 True recommendation (155 errors)

4.4 True faq (64 errors)

4.5 True order_tracking (72 errors)

4.6 True return_request (63 errors)

4.7 True promotion (56 errors)

4.8 True checkout_help (46 errors)

4.9 True escalation (34 errors)

4.10 True chitchat (15 errors)

Step 5 — Score Validation

5.1 Validate Error Count

5.2 Validate Total Weighted Error Points

5.3 Validate Business-Weighted Harm per Request

5.4 Validate Average Severity per Error

5.5 Validate Business-Weighted Score

5.6 Validate Critical Error Rate

Step 6 — What the Score Tells Us

Harm by True Intent

Interpretation

Step 7 — Error Severity Distribution

By Number of Errors

By Weighted Error Points

Interpretation

Step 8 — Use the Metric for Model Selection

Example: Two Candidate Models

Decision

Step 9 — Use the Same Matrix at Decision Time

Standard Argmax Decision

Cost-Sensitive Decision Rule

Worked Example

If action = faq

If action = escalation

Meaning

Step 10 — Stage-by-Stage Decisions During Fine-Tuning and Deployment

Stage A — Label and Policy Design

Stage B — Offline Evaluation

Stage C — Calibration

Stage D — Threshold and Fallback Policy

Stage E — Production Monitoring

Step 11 — Production Logs to Add

11.1 Per-Request Routing Log

11.2 Aggregated Monitoring Log

11.3 Alert Example

Step 12 — Recommended Validation Gates

Deployment Rule

Step 13 — What New Things Can Be Added Next

1. Cost-Sensitive Training Loss

2. Separate Escalation Detector

3. Margin + Risk Joint Policy

4. Route-Family First, Intent Second

5. Delayed-Label Risk Monitoring

Mermaid Diagram — How the Metric Fits Into the System

Minimal Python Validation Snippet

Final Takeaway

Research-Grade Addendum

Where the Cost Matrix Came From (and Why a Research Scientist Would Push Back)

Cost-Matrix Sensitivity Analysis

Confidence Intervals on the Headline Savings Claim

Comparative Methods: How Else Could We Make Errors Cost-Aware?

Failure-Mode Tree for Business-Weighted Routing

4.1 True `product_discovery` (180 errors)

4.2 True `product_question` (105 errors)

4.3 True `recommendation` (155 errors)

4.4 True `faq` (64 errors)

4.5 True `order_tracking` (72 errors)

4.6 True `return_request` (63 errors)

4.7 True `promotion` (56 errors)

4.8 True `checkout_help` (46 errors)

4.9 True `escalation` (34 errors)

4.10 True `chitchat` (15 errors)

If action = `faq`

If action = `escalation`