LOCAL PREVIEW View on GitHub

Business-Weighted Error Score for Intent Routing — MangaAssist

Why Accuracy Alone Is Not Enough

In the MangaAssist setup, the fine-tuned DistilBERT intent classifier reaches 92.1% overall accuracy, operates under a 15 ms P95 latency budget, and handles 10 intents with very different business impact. A wrong prediction is not always equally bad:

  • product_discovery → recommendation is usually a low-harm mistake because both routes still help the user shop.
  • escalation → chitchat is a high-harm mistake because a user asking for a human gets ignored.
  • order_tracking → chitchat is much worse than promotion → product_discovery.

This document adds a Business-Weighted Error Score layer on top of the original fine-tuning document so model selection is based on business harm, not only raw accuracy.

This stays aligned with the original scenario: - 10 intents - 92.1% fine-tuned accuracy - 88.6% rare-class accuracy - 15 ms P95 routing budget - same recommendation-family confusion patterns already described in the original MangaAssist write-up


What This Metric Solves

Problem

Two models can have nearly the same accuracy but very different business outcomes.

Model Accuracy Major Error Pattern Business Outcome
Model A 92.1% misses some escalation and order_tracking cases risky
Model B 91.8% mostly confuses product_discovery with recommendation safer

If you optimize only for accuracy, you may ship the worse model.

Goal

We want a metric that answers:

“How much real business harm do the model’s mistakes create?”


Intent List and Traffic Mix

We use the same intent frequencies from the MangaAssist scenario.

Intent Frequency Count in 10,000-request worked example
product_discovery 22% 2,200
product_question 15% 1,500
recommendation 18% 1,800
faq 8% 800
order_tracking 12% 1,200
return_request 7% 700
promotion 5% 500
checkout_help 4% 400
escalation 3% 300
chitchat 6% 600
Total 100% 10,000

Step 1 — Define a Business Cost Matrix

Let:

  • true intent = ( y )
  • predicted action / predicted intent = ( a )
  • business cost of that decision = ( C(y, a) )

Correct predictions have zero cost:

[ C(y, y) = 0 ]

The cost matrix is directional and asymmetric.

Example:

  • escalation → faq is very expensive
  • faq → escalation is annoying but much cheaper

That means:

[ C(\text{escalation}, \text{faq}) \neq C(\text{faq}, \text{escalation}) ]

Use a simple 0–10 scale:

Cost Meaning
0 correct
1–2 low harm, usually same journey or acceptable fallback
3–5 medium harm, extra friction or wrong flow
6–8 high harm, support failure or dead-end
9–10 critical harm, user safety / trust / human handoff failure

Representative Directional Costs

True intent Predicted intent Cost
product_discovery recommendation 1
recommendation product_discovery 1
product_question product_discovery 2
promotion product_discovery 1
faq checkout_help 2
order_tracking faq 3
order_tracking return_request 4
return_request order_tracking 4
checkout_help faq 2
order_tracking chitchat 6
return_request chitchat 6
escalation faq 8
escalation order_tracking 9
escalation product_discovery 10
escalation chitchat 10
faq escalation 2
chitchat faq 2

Step 2 — Offline Metric Definitions

2.1 Business-Weighted Harm per Request

For (N) evaluated requests:

[ \text{BWH} = \frac{1}{N}\sum_{i=1}^{N} C(y_i, \hat{y}_i) ]

Interpretation:

  • average harm points created per request
  • lower is better
  • range = 0 to 10

2.2 Business-Weighted Error Points

[ \text{BWEP} = \sum_{i=1}^{N} C(y_i, \hat{y}_i) ]

Interpretation:

  • total harm points over the evaluation set
  • useful for dashboarding and comparing model versions

2.3 Average Severity per Error

If (E) is the number of misclassified requests:

[ \text{Avg Severity per Error} = \frac{\text{BWEP}}{E} ]

Interpretation:

  • how bad the average mistake is
  • not how often the model is wrong, but how dangerous its wrong predictions are

2.4 Business-Weighted Score

Normalize harm to a 0–100 score using the maximum possible cost (C_{\max}=10):

[ \text{BW Score} = 100 \left(1 - \frac{\text{BWH}}{10}\right) ]

Equivalent form:

[ \text{BW Score} = 100 \left(1 - \frac{\text{BWEP}}{N \cdot 10}\right) ]

Interpretation:

  • 100 = no business harm
  • 0 = every request is routed in the worst possible way
  • this is not the same thing as accuracy

2.5 Critical Error Rate

[ \text{Critical Error Rate} = \frac{#{i : C(y_i,\hat{y}_i)\ge 8}}{N} ]

Interpretation:

  • how often the model makes truly dangerous mistakes

Step 3 — Fully Worked 10,000-Request Example

We build a concrete evaluation batch of 10,000 requests using the exact traffic mix above.

Per-Intent Correct Predictions

Intent Total Correct Accuracy
product_discovery 2,200 2,020 91.82%
product_question 1,500 1,395 93.00%
recommendation 1,800 1,645 91.39%
faq 800 736 92.00%
order_tracking 1,200 1,128 94.00%
return_request 700 637 91.00%
promotion 500 444 88.80%
checkout_help 400 354 88.50%
escalation 300 266 88.67%
chitchat 600 585 97.50%
Total 10,000 9,210 92.10%

Rare-Class Validation

Using the same rare class group from the original MangaAssist scenario:

  • promotion
  • checkout_help
  • escalation

Correct rare-class predictions:

[ 444 + 354 + 266 = 1064 ]

Rare-class total:

[ 500 + 400 + 300 = 1200 ]

Rare-class accuracy:

[ \frac{1064}{1200} = 0.8867 = 88.67\% ]

This is consistent with the earlier ~88.6% rare-class accuracy.


Step 4 — Error Breakdown with Business Costs

Below is the worked error set. These are the 790 misclassified requests.

4.1 True product_discovery (180 errors)

Predicted Count Cost Error points
recommendation 85 1 85
product_question 55 2 110
promotion 20 1 20
chitchat 20 4 80
Total 180 295

4.2 True product_question (105 errors)

Predicted Count Cost Error points
recommendation 55 2 110
product_discovery 30 2 60
faq 10 4 40
checkout_help 10 5 50
Total 105 260

4.3 True recommendation (155 errors)

Predicted Count Cost Error points
product_discovery 90 1 90
product_question 35 2 70
promotion 20 2 40
chitchat 10 4 40
Total 155 240

4.4 True faq (64 errors)

Predicted Count Cost Error points
checkout_help 25 2 50
return_request 15 4 60
order_tracking 14 4 56
chitchat 10 5 50
Total 64 216

4.5 True order_tracking (72 errors)

Predicted Count Cost Error points
return_request 30 4 120
faq 20 3 60
checkout_help 12 4 48
chitchat 10 6 60
Total 72 288

4.6 True return_request (63 errors)

Predicted Count Cost Error points
order_tracking 28 4 112
faq 15 4 60
checkout_help 10 4 40
chitchat 10 6 60
Total 63 272

4.7 True promotion (56 errors)

Predicted Count Cost Error points
product_discovery 26 1 26
recommendation 15 2 30
product_question 10 2 20
chitchat 5 4 20
Total 56 96

4.8 True checkout_help (46 errors)

Predicted Count Cost Error points
faq 18 2 36
order_tracking 10 4 40
return_request 8 4 32
chitchat 10 6 60
Total 46 168

4.9 True escalation (34 errors)

Predicted Count Cost Error points
faq 12 8 96
order_tracking 8 9 72
return_request 6 9 54
chitchat 4 10 40
product_discovery 4 10 40
Total 34 302

4.10 True chitchat (15 errors)

Predicted Count Cost Error points
product_discovery 7 2 14
faq 4 2 8
recommendation 4 2 8
Total 15 30

Step 5 — Score Validation

5.1 Validate Error Count

Total errors:

[ 180 + 105 + 155 + 64 + 72 + 63 + 56 + 46 + 34 + 15 = 790 ]

Accuracy:

[ 1 - \frac{790}{10000} = 0.921 = 92.1\% ]

5.2 Validate Total Weighted Error Points

Row totals:

[ 295 + 260 + 240 + 216 + 288 + 272 + 96 + 168 + 302 + 30 = 2167 ]

So:

[ \text{BWEP} = 2167 ]

5.3 Validate Business-Weighted Harm per Request

[ \text{BWH} = \frac{2167}{10000} = 0.2167 ]

Interpretation:

  • the model creates 0.2167 harm points per request
  • equivalently 21.67 harm points per 100 requests

5.4 Validate Average Severity per Error

[ \frac{2167}{790} = 2.743 ]

So the average mistake has severity:

[ 2.74 / 10 ]

5.5 Validate Business-Weighted Score

[ \text{BW Score} = 100 \left(1 - \frac{2167}{10000 \cdot 10}\right) ]

[ = 100 (1 - 0.02167) = 97.833 ]

So:

  • BW Score = 97.83
  • Accuracy = 92.10

These are both correct because they measure different things:

  • accuracy asks how often
  • BW score asks how harmful

5.6 Validate Critical Error Rate

Critical errors are defined as cost ( \ge 8 ).

Only the following are critical in this worked example:

  • escalation → faq = 12
  • escalation → order_tracking = 8
  • escalation → return_request = 6
  • escalation → chitchat = 4
  • escalation → product_discovery = 4

Total critical errors:

[ 12 + 8 + 6 + 4 + 4 = 34 ]

Critical error rate:

[ \frac{34}{10000} = 0.0034 = 0.34\% ]

Escalation miss rate:

[ \frac{34}{300} = 11.33\% ]


Step 6 — What the Score Tells Us

Harm by True Intent

True intent Error points Requests Harm per request
product_discovery 295 2,200 0.134
product_question 260 1,500 0.173
recommendation 240 1,800 0.133
faq 216 800 0.270
order_tracking 288 1,200 0.240
return_request 272 700 0.389
promotion 96 500 0.192
checkout_help 168 400 0.420
escalation 302 300 1.007
chitchat 30 600 0.050

Interpretation

Even though escalation is only 3% of traffic, it contributes 302 error points, which is the largest harm bucket in the whole system.

That means:

  • escalation should get special thresholds
  • escalation may need a separate auxiliary detector
  • the model should be optimized for business harm, not only support-weighted accuracy

Step 7 — Error Severity Distribution

By Number of Errors

Severity bucket Cost range Error count Share of all errors
Low 1–2 499 63.16%
Medium 3–5 227 28.73%
High / Critical 6–10 64 8.10%
Total 790 100%

By Weighted Error Points

Severity bucket Error points Share of all harm
Low 777 35.86%
Medium 908 41.90%
High / Critical 482 22.24%
Total 2167 100%

Interpretation

Most mistakes are count-wise low harm, but the medium and high-cost mistakes create a disproportionate amount of business damage.

That is exactly why business-weighted evaluation is needed.


Step 8 — Use the Metric for Model Selection

Example: Two Candidate Models

Metric Model A Model B
Accuracy 92.1% 91.8%
Rare-class accuracy 88.7% 90.0%
Weighted error points 2167 1840
Harm per request 0.2167 0.1840
Critical error rate 0.34% 0.18%
P95 latency 12 ms 13 ms

Decision

A pure-accuracy team would ship Model A.

A business-aware team should probably ship Model B, because:

  • it cuts weighted harm by:

[ \frac{2167 - 1840}{2167} = 15.1\% ]

  • it nearly halves critical errors
  • it is still within the latency budget

This is the key reason to add business-weighted metrics to your evaluation gate.


Step 9 — Use the Same Matrix at Decision Time

Offline scoring is one side. The more powerful step is to use the same cost matrix during routing.

Standard Argmax Decision

Normal classifier choice:

[ \hat{y}_{argmax} = \arg\max_a p(a \mid x) ]

This ignores business cost.

Cost-Sensitive Decision Rule

Instead choose the action that minimizes expected harm:

[ a^* = \arg\min_a \sum_{y} p(y \mid x)\, C(y, a) ]

This is very important.

It means a lower-probability action can still be better if it avoids expensive failures.

Worked Example

Suppose the calibrated probabilities are:

Intent Probability
faq 0.40
escalation 0.32
order_tracking 0.20
chitchat 0.08

Argmax picks faq.

But expected harm is:

If action = faq

[ R(\text{faq}) = 0.32 \cdot 8 + 0.20 \cdot 3 + 0.08 \cdot 2 = 3.32 ]

If action = escalation

Assume over-escalation costs are cheaper:

  • faq → escalation = 2
  • order_tracking → escalation = 3
  • chitchat → escalation = 2

Then:

[ R(\text{escalation}) = 0.40 \cdot 2 + 0.20 \cdot 3 + 0.08 \cdot 2 = 1.56 ]

So the safer choice is:

[ a^* = \text{escalation} ]

Meaning

Even though faq has the highest probability, routing to escalation is better because the cost of missing escalation is much larger than the cost of over-escalating.

This is one of the strongest additions you can make to a production routing system.


Step 10 — Stage-by-Stage Decisions During Fine-Tuning and Deployment

Stage A — Label and Policy Design

Questions: - Which errors are truly expensive? - Which errors are acceptable fallback behavior? - Should product_discovery ↔ recommendation be cost 0, 1, or 2? - Should missing escalation ever be allowed?

Decisions: - define the cost matrix with PM + support + ops - make it directional - keep scale simple: 0–10

Stage B — Offline Evaluation

Questions: - does the model meet accuracy? - does it meet rare-class accuracy? - what is weighted harm? - which intent creates the most business harm?

Decisions: - reject models with lower raw accuracy only if business harm is meaningfully lower - inspect per-intent harm, not just global score

Stage C — Calibration

Questions: - are probabilities trustworthy enough for expected-risk routing? - is confidence too sharp or too flat?

Decisions: - apply temperature scaling - validate ECE, Brier, NLL - use calibrated probabilities before cost-sensitive routing

Stage D — Threshold and Fallback Policy

Questions: - when should we escalate? - when should we ask a clarifying question? - when should we accept a low-cost confusion?

Decisions: - if expected risk > threshold, send to safer fallback - if p(escalation) above class-specific threshold, bias toward handoff - if top-2 margin is small, trigger disambiguation

Stage E — Production Monitoring

Questions: - are weighted harm and critical errors increasing? - is the traffic mix changing? - are some intents getting sharper or more confused?

Decisions: - alert on weighted harm trend - alert on escalation miss proxy - retrain if weighted harm crosses threshold even if accuracy still looks okay


Step 11 — Production Logs to Add

11.1 Per-Request Routing Log

{
  "timestamp": "2026-04-21T14:22:10Z",
  "request_id": "req_9182",
  "text": "I want to talk to a person",
  "top1_intent": "faq",
  "top1_prob": 0.41,
  "top2_intent": "escalation",
  "top2_prob": 0.34,
  "top3_intent": "order_tracking",
  "top3_prob": 0.16,
  "expected_risk_faq": 3.28,
  "expected_risk_escalation": 1.52,
  "chosen_action": "escalation",
  "decision_policy": "cost_sensitive_v2",
  "model_version": "intent-distilbert-v12"
}

11.2 Aggregated Monitoring Log

{
  "window": "2026-04-21T14:00:00Z/2026-04-21T15:00:00Z",
  "requests": 18240,
  "accuracy_sampled": 0.919,
  "rare_class_accuracy_sampled": 0.887,
  "weighted_error_points_sampled": 4012,
  "harm_per_request_sampled": 0.220,
  "critical_error_rate_sampled": 0.0031,
  "escalation_miss_rate_sampled": 0.109,
  "p95_latency_ms": 12.4,
  "kl_divergence": 0.016,
  "status": "healthy"
}

11.3 Alert Example

{
  "alert_name": "business_harm_regression",
  "model_version": "intent-distilbert-v13",
  "trigger": "harm_per_request_sampled > 0.25 for 3 consecutive windows",
  "current_value": 0.287,
  "previous_champion": 0.216,
  "recommended_action": "rollback_to_v12"
}

Use a mix of classical and business-aware gates.

Gate Threshold Why
Overall accuracy >= 92.0% baseline quality
Rare-class accuracy >= 88.5% protects low-frequency intents
Business-weighted harm per request <= 0.23 keeps average harm low
Critical error rate <= 0.35% protects user trust
Escalation miss rate <= 11% protects human-handoff requests
P95 latency <= 15 ms keeps routing fast

Deployment Rule

Deploy only if:

  1. accuracy does not regress materially
  2. weighted harm improves or remains within budget
  3. critical error rate does not increase
  4. latency stays within target

Step 13 — What New Things Can Be Added Next

These are strong next upgrades after this document.

1. Cost-Sensitive Training Loss

Instead of only using the cost matrix at evaluation time, inject it into training:

  • weighted cross-entropy by error pair
  • expected-cost loss
  • focal loss plus cost-sensitive class pairs

2. Separate Escalation Detector

Because escalation produces the highest harm per request, add:

  • binary needs_human detector
  • OR rule-based backup
  • OR ensemble with semantic features

3. Margin + Risk Joint Policy

Use both: - top1-top2 margin - expected business risk

This reduces unsafe argmax decisions.

4. Route-Family First, Intent Second

Two-stage hierarchy:

  1. route family: shopping / commerce support / human handoff / chitchat
  2. exact intent inside family

This often lowers high-cost cross-family errors.

5. Delayed-Label Risk Monitoring

Some classes like return or escalation may get ground truth later. Add: - delayed feedback joins - support ticket outcome labels - business complaint rate by predicted route


Mermaid Diagram — How the Metric Fits Into the System

graph TD
    A[User message] --> B[Intent classifier]
    B --> C[Calibrated probabilities]
    C --> D[Expected-risk calculator]
    D --> E{Min-risk action}
    E -->|low-risk content route| F[Recommendation or product flow]
    E -->|support route| G[FAQ or commerce support flow]
    E -->|high-risk uncertainty| H[Escalate or clarify]
    F --> I[Logs]
    G --> I
    H --> I
    I --> J[Offline labeled evaluation]
    J --> K[Business-weighted error score]
    K --> L[Champion / challenger decision]
    L --> M[Deploy or rollback]

Minimal Python Validation Snippet

total_requests = 10000
correct = 9210
errors = 790
weighted_error_points = 2167
critical_errors = 34

accuracy = correct / total_requests
harm_per_request = weighted_error_points / total_requests
avg_severity_per_error = weighted_error_points / errors
bw_score = 100 * (1 - weighted_error_points / (total_requests * 10))
critical_error_rate = critical_errors / total_requests

print(round(accuracy, 4))                 # 0.9210
print(round(harm_per_request, 4))         # 0.2167
print(round(avg_severity_per_error, 4))   # 2.7430
print(round(bw_score, 3))                 # 97.833
print(round(critical_error_rate, 4))      # 0.0034

Final Takeaway

The key idea is simple:

  • Accuracy tells you how often the classifier is right.
  • Business-weighted error tells you how damaging the mistakes are.
  • Expected-risk routing lets you use the same business logic at serving time, not just in offline reports.

For MangaAssist, this matters because many content-intent confusions are cheap, while missing escalation, return_request, or order_tracking can be expensive.

That is why a top GenAI / ML engineer should add:

  1. business cost matrix
  2. weighted offline evaluation
  3. calibrated probabilities
  4. cost-sensitive routing
  5. business-aware monitoring and rollback gates

This makes the classifier not just more accurate, but more useful and safer in production.


Research-Grade Addendum

Where the Cost Matrix Came From (and Why a Research Scientist Would Push Back)

The cost matrix used above (8 for escalation → faq, 6 for order_tracking → chitchat, etc.) was hand-set by Sam (PM) using a four-input recipe:

  1. CSAT delta after each error type (measured A/B on 6 weeks of production)
  2. Operational cost: agent minutes consumed by an angry escalation, return-form abandonment rate
  3. Revenue impact: bounce rate and conversion uplift per intent route
  4. Brand harm priors: 5× weighting on safety-flagged intents

A research scientist would object that this matrix is a single point estimate. None of the four inputs were measured with CIs; the matrix is therefore wrong by some unknown amount. The right question is: how robust are the conclusions to perturbations of the matrix?

Cost-Matrix Sensitivity Analysis

We perturb each cell c_ij independently by ±50% and re-run the full evaluation. The metric of interest is system rank stability: does the same model still win?

Procedure. 1. For each cell (i, j) where c_ij > 0: scale by 1.5 and re-evaluate; scale by 0.5 and re-evaluate. 2. Record whether Model A (focal-loss DistilBERT) still beats the baseline (standard CE DistilBERT) on weighted error. 3. Compute the flip rate = fraction of perturbations that change the winner.

Perturbation type Flip rate Worst-case Δ in weighted error
Single cell ±50% 0/156 4.1%
All "high" cells ×1.5 simultaneously 0/1 6.8%
All "low" cells ×0.5 simultaneously 0/1 3.2%
Random Dirichlet(α=2) cost vectors per row, n=1,000 trials 17/1,000 (1.7%) 11.4%
Adversarial: cell maximizing flip risk found by grid search 1/30 12.0%

Reading. The chosen model (Model A) is robust to any single-cell perturbation up to ±50%. Under fully randomized cost matrices it loses to the baseline 1.7% of the time — i.e., we are confident at roughly 98% that Model A is the right pick. The single adversarial perturbation that flips the winner is a 50% under-estimate of the escalation → chitchat cell, which would mean we are dramatically over-investing in escalation handling. We treat this as a question for Sam to revalidate quarterly, not a model-selection question.

Confidence Intervals on the Headline Savings Claim

The $19K/month savings figure from the original calibration deep-dive is itself a derived statistic over (a) the per-error cost matrix, (b) the misroute rate distribution, and © the monthly request volume. A bootstrap procedure that resamples the production logs (B = 10,000) gives:

Headline figure Point estimate 95% bootstrap CI
Weighted error rate (focal-loss model) 0.0312 [0.0288, 0.0341]
Weighted error rate (CE baseline) 0.0427 [0.0394, 0.0461]
Absolute reduction 0.0115 [0.0093, 0.0140]
Monthly $ savings (at $0.013 per harm-unit and 1.4M reqs/mo) $19,330 [$15,640, $23,020]
Critical-error rate (focal-loss) 0.0034 [0.0027, 0.0042]
Critical-error rate (CE baseline) 0.0061 [0.0051, 0.0073]

Reading. The CI on $ savings is wide (~±$3.7K) because the per-harm-unit cost ($0.013) is itself a point estimate. Even at the lower bound, savings are $15.6K/month — large enough to justify the engineering investment and ongoing labeling cost. Recommendation: report the lower bound ($15.6K/month conservatively), not the point estimate, in business reviews.

Comparative Methods: How Else Could We Make Errors Cost-Aware?

Method Where the cost shows up Pros Cons Reference
Cost-matrix at evaluation only (baseline) metric only simple, transparent, training agnostic model is still trained to minimize accuracy, not cost Provost 2000
Class-weighted CE with cost-derived weights training loss aligns training with deployment objective weights are scalar per class — cannot encode pair-specific costs He & Garcia 2009
Cost-sensitive routing (chosen) inference policy model-agnostic; works with any calibrated classifier requires good calibration; thresholds need re-tuning when costs change Elkan 2001
Example-dependent cost-sensitive learning training loss + per-example cost captures user-specific costs (VIP vs. anonymous) needs cost label per example, often unavailable Bahnsen 2014
Direct cost-minimization training (CSL-DL) end-to-end optimal in theory unstable training, no off-the-shelf libs Dalvi 2004; Khan 2018

Reading. Cost-sensitive routing at inference time is the right architectural choice for MangaAssist because (a) we already have a calibrated classifier from §calibration, (b) costs are pair-specific (so scalar class weights are insufficient), and © the inference policy can be hot-swapped when costs change without retraining the model.

Failure-Mode Tree for Business-Weighted Routing

flowchart TD
    A[Weekly business KPI review] --> B{Symptom?}
    B -- weighted error rate ↑ ≥ 0.005 --> C{Source?}
    B -- critical error rate ↑ ≥ 0.001 --> D[Immediate page on-call audit last 7 days of escalation routes]
    B -- $ savings shrinks ≥ 25% MoM --> E{Cost matrix or model?}
    C -- specific intent pair --> F[Targeted retrain on that pair via active sampling]
    C -- broad --> G[Trigger calibration recheck then full retrain]
    D --> H[Roll routing thresholds back to last known-good values]
    E -- model accuracy stable --> I[Cost-matrix audit with PM and ops re-derive c_ij]
    E -- model accuracy degraded --> J[Retrain pipeline as in main doc]
    I --> K[Re-run sensitivity sweep with new matrix gate ≥ 95% rank stability]

Research Notes — failure tree. Citations: Provost 2000 (AAAI) — threshold moving as the cheapest cost-aware action; Elkan 2001 (IJCAI) — cost-sensitive decision theory; Bahnsen 2014 (J. Comp. Sci.) — example-dependent costs.

Open Problems

  1. Cost matrices drift. A return policy change, a new escalation playbook, or a refund-cost change can invalidate the matrix overnight. Today we re-derive it quarterly. Open question: can we learn the cost matrix from CSAT signals and ticket-resolution outcomes in near-real-time, treating it as another model that drifts?
  2. Per-user cost matrices. A VIP user's escalation mishandled is more costly than an anonymous user's. Bahnsen 2014's example-dependent CSL framework would let us encode this, but requires a per-request cost feature pipeline. Worth piloting on the top 1% of customers by LTV.
  3. Adversarial cost-matrix attacks. A red-team query that intentionally raises critical-error rate (e.g., obfuscated escalation phrasing) could shift weighted error without triggering accuracy alerts. Need an adversarial monitoring channel that perturbs production messages and tracks weighted-error response.

Bibliography (this file)

  • Elkan, C. (2001). The Foundations of Cost-Sensitive Learning. IJCAI. — formal cost-matrix decision theory; threshold = p* = c(0,1) / (c(0,1) + c(1,0)).
  • Provost, F. (2000). Machine Learning from Imbalanced Data Sets 101. AAAI Workshop. — threshold moving as a post-hoc cost-aware fix.
  • Bahnsen, A. C., Aouada, D., Ottersten, B. (2014). Example-Dependent Cost-Sensitive Decision Trees. Expert Systems with Applications. — per-example cost.
  • Dalvi, N., Domingos, P., Sanghai, S., Verma, D. (2004). Adversarial Classification. KDD. — costs under adversarial conditions.
  • Khan, S. H., Hayat, M., Bennamoun, M., Sohel, F. A., Togneri, R. (2018). Cost-Sensitive Learning of Deep Feature Representations from Imbalanced Data. IEEE TNNLS. — direct cost-minimization training.
  • He, H., Garcia, E. A. (2009). Learning from Imbalanced Data. IEEE TKDE. — class-weighted CE survey.
  • Domingos, P. (1999). MetaCost: A General Method for Making Classifiers Cost-Sensitive. KDD. — wrapper approach.
  • Bouthillier, X. et al. (2021). Accounting for Variance in Machine Learning Benchmarks. MLSys. — bootstrap CIs.

Citation count for this file: 8.