LOCAL PREVIEW View on GitHub

Confidence Calibration for Fine-Tuned Intent Routing — MangaAssist

This document is the first deep-dive extension to the MangaAssist fine-tuning scenario. It focuses on confidence calibration for a fine-tuned DistilBERT intent classifier so that the router can make safer and more business-aware decisions after training.

It uses the same setting as the original scenario: - 10 intent classes - DistilBERT fine-tuned for manga retail intent routing - under 15 ms P95 latency target - accuracy improving from about 83.2% to about 92.1% after fine-tuning - production concerns around misroutes, rare intents, drift, and weekly/monthly retraining


1. Why Calibration Matters

A classifier can be accurate but badly calibrated.

That means: - when it says 0.95 confidence, it may only be right 80% of the time - when it says 0.55 confidence, it may actually be right 70% of the time

For routing systems, that is dangerous.

In MangaAssist, the model prediction is not the final user-facing answer. It is a control signal that decides which downstream service to call: - recommendation engine - order lookup - return flow - FAQ pipeline - human escalation

So the router should not ask only:

“Which class has the highest score?”

It should also ask:

“How trustworthy is that score?”

Business reason

A wrong route can cause: - bad customer experience - unnecessary LLM/tool calls - wrong workflow activation - customer support escalation - hidden cost increase

A calibrated classifier lets us do smarter things such as: - auto-route only when confidence is reliable - send uncertain predictions to fallback logic - request clarification for ambiguous cases - prioritize human review on low-confidence samples - trigger active learning on the most informative traffic


2. Accuracy vs Confidence vs Calibration

These are different concepts.

Accuracy

Accuracy checks whether the top prediction is correct.

[ \text{Accuracy} = \frac{\text{# correct predictions}}{\text{# total predictions}} ]

Confidence

Confidence is the model's predicted probability for the top class.

If logits after softmax are:

  • recommendation = 0.78
  • product_discovery = 0.14
  • product_question = 0.05
  • others = 0.03 total

then the confidence is 0.78.

Calibration

Calibration checks whether probabilities match reality.

If the model produces 100 predictions with confidence around 0.80, and only 60 are correct, then the model is overconfident.

If it produces 100 predictions with confidence around 0.55, and 75 are correct, then it is underconfident.

A well-calibrated model satisfies:

[ P(\hat{y}=y \mid \hat{p}=0.8) \approx 0.8 ]

That means: among predictions made at 80% confidence, about 80% should be correct.


3. Where Calibration Fits in the Full Lifecycle

flowchart TD
    A[Training Data] --> B[Fine-Tune DistilBERT]
    B --> C[Validation Predictions]
    C --> D[Fit Calibration Layer]
    D --> E[Evaluate ECE Brier NLL Reliability]
    E --> F{Pass Calibration Gate?}
    F -->|Yes| G[Package Model + Calibrator]
    F -->|No| H[Revisit Loss / Data / Temperature]
    G --> I[Deploy to Production]
    I --> J[Online Routing Decisions]
    J --> K[Confidence Thresholding]
    K --> L[Fallback / Clarification / Human Review]
    I --> M[Production Monitoring]
    M --> N[Drift + Confidence Shift + Calibration Drift]
    N --> A

4. Baseline Numerical Example: Why Softmax Alone Is Not Enough

Suppose for the query:

“I want something like Naruto but darker”

The fine-tuned model produces logits:

  • recommendation = 3.2
  • product_discovery = 2.4
  • product_question = 0.4
  • faq = -0.7
  • chitchat = -1.0

4.1 Convert logits to probabilities

Softmax:

[ \hat{p}_i = \frac{e^{z_i}}{\sum_j e^{z_j}} ]

Exponentials:

  • (e^{3.2} \approx 24.533)
  • (e^{2.4} \approx 11.023)
  • (e^{0.4} \approx 1.492)
  • (e^{-0.7} \approx 0.497)
  • (e^{-1.0} \approx 0.368)

Sum:

[ 24.533 + 11.023 + 1.492 + 0.497 + 0.368 = 37.913 ]

Probabilities:

  • recommendation = (24.533 / 37.913 \approx 0.647)
  • product_discovery = (11.023 / 37.913 \approx 0.291)
  • product_question = (1.492 / 37.913 \approx 0.039)
  • faq = (0.497 / 37.913 \approx 0.013)
  • chitchat = (0.368 / 37.913 \approx 0.010)

Top class is recommendation with confidence 0.647.

That looks decent.

But if historical validation shows that predictions in the 0.60–0.70 bin are only correct 51% of the time, then the router should not treat 0.647 as “safe enough” for irreversible routing.

That is the whole point of calibration.


5. Main Calibration Methods

For this system, the most practical options are:

  1. Temperature scaling
  2. Vector scaling
  3. Platt scaling / logistic calibration
  4. Isotonic regression
  5. Class-wise threshold calibration

For MangaAssist, the best first choice is usually:

Temperature scaling + class-wise routing thresholds

because it is simple, fast, and low-risk.

5.1 Temperature scaling

We learn a scalar temperature (T > 0) on the validation set.

New probabilities:

[ \hat{p}_i^{(T)} = \frac{e^{z_i / T}}{\sum_j e^{z_j / T}} ]

Interpretation: - T = 1: no change - T > 1: soften probabilities, reduce overconfidence - T < 1: sharpen probabilities, increase confidence

This does not change class ranking in most cases. It mainly changes the confidence values.

5.2 Why temperature scaling is good here

  • tiny extra compute
  • easy to version and deploy
  • does not retrain encoder
  • works well when model is mainly overconfident
  • easy to update during post-training validation

5.3 When temperature scaling is not enough

It may fail if: - each class has a very different confidence distortion - rare intents are miscalibrated differently from common intents - calibration varies heavily by traffic segment - drift changes confidence distribution after deployment

Then we may add: - class-conditional thresholds - vector scaling - segment-based calibrators


6. Worked Temperature-Scaling Example

Take the same logits:

  • recommendation = 3.2
  • product_discovery = 2.4
  • product_question = 0.4
  • faq = -0.7
  • chitchat = -1.0

Assume validation optimization finds:

[ T = 1.6 ]

6.1 Scaled logits

[ z_i' = z_i / 1.6 ]

So: - recommendation = 2.000 - product_discovery = 1.500 - product_question = 0.250 - faq = -0.438 - chitchat = -0.625

Exponentials: - (e^{2.0} \approx 7.389) - (e^{1.5} \approx 4.482) - (e^{0.25} \approx 1.284) - (e^{-0.438} \approx 0.645) - (e^{-0.625} \approx 0.535)

Sum:

[ 7.389 + 4.482 + 1.284 + 0.645 + 0.535 = 14.335 ]

New probabilities: - recommendation = (7.389 / 14.335 \approx 0.515) - product_discovery = (4.482 / 14.335 \approx 0.313) - product_question = (1.284 / 14.335 \approx 0.090) - faq = (0.645 / 14.335 \approx 0.045) - chitchat = (0.535 / 14.335 \approx 0.037)

What changed?

Before calibration: - top class = recommendation - confidence = 0.647

After calibration: - top class = recommendation - confidence = 0.515

The class did not change. The trust level did.

That is exactly what we want.


7. Metrics to Evaluate Calibration

The most important metrics are: - ECE: Expected Calibration Error - MCE: Maximum Calibration Error - NLL: Negative Log Likelihood - Brier Score - Reliability curves - Coverage vs risk under routing thresholds

7.1 Expected Calibration Error (ECE)

We split predictions into confidence bins.

For bin (B_m): - (\text{acc}(B_m)) = empirical accuracy in bin - (\text{conf}(B_m)) = average predicted confidence in bin

ECE:

[ \text{ECE} = \sum_{m=1}^{M} \frac{|B_m|}{n} \left| \text{acc}(B_m) - \text{conf}(B_m) \right| ]

7.2 Worked ECE example

Suppose we evaluate 1,000 validation examples and use 5 bins.

Bin Confidence Range Count Avg Confidence Accuracy Gap
B1 0.0–0.2 40 0.16 0.20 0.04
B2 0.2–0.4 110 0.33 0.38 0.05
B3 0.4–0.6 210 0.52 0.49 0.03
B4 0.6–0.8 330 0.71 0.61 0.10
B5 0.8–1.0 310 0.89 0.75 0.14

Compute weighted sum:

[ \text{ECE} = \frac{40}{1000}(0.04) + \frac{110}{1000}(0.05) + \frac{210}{1000}(0.03) + \frac{330}{1000}(0.10) + \frac{310}{1000}(0.14) ]

[ = 0.0016 + 0.0055 + 0.0063 + 0.033 + 0.0434 = 0.0898 ]

So:

[ \text{ECE} \approx 0.090 ]

That is fairly bad for a routing model.

Now suppose after temperature scaling:

Bin Count Avg Confidence Accuracy Gap
B1 55 0.17 0.18 0.01
B2 135 0.32 0.35 0.03
B3 255 0.51 0.52 0.01
B4 315 0.69 0.68 0.01
B5 240 0.82 0.80 0.02

Then:

[ \text{ECE} = \frac{55}{1000}(0.01) + \frac{135}{1000}(0.03) + \frac{255}{1000}(0.01) + \frac{315}{1000}(0.01) + \frac{240}{1000}(0.02) ]

[ = 0.00055 + 0.00405 + 0.00255 + 0.00315 + 0.0048 = 0.0151 ]

So ECE improved from: - 0.090 -> 0.015

That is a major win.

7.3 Brier Score

For multiclass, one common form is:

[ \text{Brier} = \frac{1}{n} \sum_{i=1}^{n} \sum_{c=1}^{C} (p_{ic} - y_{ic})^2 ]

Lower is better.

Worked single-example case:

Predicted: - recommendation = 0.70 - product_discovery = 0.20 - product_question = 0.10

True class = recommendation, so target is: - recommendation = 1 - product_discovery = 0 - product_question = 0

Brier contribution:

[ (0.70 - 1)^2 + (0.20 - 0)^2 + (0.10 - 0)^2 = 0.09 + 0.04 + 0.01 = 0.14 ]

If instead prediction was: - recommendation = 0.40 - product_discovery = 0.35 - product_question = 0.25

then Brier:

[ (0.40 - 1)^2 + 0.35^2 + 0.25^2 = 0.36 + 0.1225 + 0.0625 = 0.545 ]

Much worse.

7.4 Negative Log Likelihood (NLL)

For true class probability (p_t):

[ \text{NLL} = -\log(p_t) ]

If correct-class probability improves from 0.42 to 0.76 after calibration-aware selection and thresholding, NLL goes from:

[ -\log(0.42) \approx 0.867 ]

to:

[ -\log(0.76) \approx 0.275 ]

Lower is better.


8. Reliability Diagram Interpretation

xychart-beta
    title "Reliability: Before vs After Calibration"
    x-axis [0.1,0.3,0.5,0.7,0.9]
    y-axis "Empirical Accuracy" 0 --> 1
    line "Ideal" [0.1,0.3,0.5,0.7,0.9]
    line "Before" [0.2,0.38,0.49,0.61,0.75]
    line "After" [0.18,0.35,0.52,0.68,0.80]

How to read it: - the closer the curve is to the diagonal, the better calibrated the model - if the curve is below the diagonal, the model is overconfident - if the curve is above the diagonal, the model is underconfident

In this scenario, the pre-calibration model is mostly overconfident in higher-confidence bins.


9. Routing Decisions After Calibration

Calibration matters because it changes operational logic.

9.1 Simple threshold policy

Let: - top probability = (p_1) - second probability = (p_2) - margin = (p_1 - p_2)

Use these rules:

  1. Auto-route if: - calibrated confidence (p_1 \ge 0.80) - margin (\ge 0.20)

  2. Clarify with user if: - (0.55 \le p_1 < 0.80) - or margin < 0.20

  3. Fallback / human-safe path if: - (p_1 < 0.55) - or intent belongs to a high-risk class with low confidence

9.2 Worked examples

Case A: Safe auto-route

Calibrated probabilities: - order_tracking = 0.91 - faq = 0.05 - return_request = 0.02 - others = 0.02

Margin:

[ 0.91 - 0.05 = 0.86 ]

Decision: - route directly to order lookup

Case B: Clarification needed

Calibrated probabilities: - recommendation = 0.58 - product_discovery = 0.33 - product_question = 0.05

Margin:

[ 0.58 - 0.33 = 0.25 ]

Confidence is moderate, so ask:

“Are you looking for recommendations similar to Naruto, or do you want to browse dark manga titles?”

Case C: Low-confidence safe fallback

Calibrated probabilities: - return_request = 0.39 - faq = 0.28 - escalation = 0.18 - others = 0.15

Decision: - do not auto-route to a return flow - ask a clarifying question or hand off to safe general support


10. Class-Specific Thresholds

A global threshold is often not enough.

For MangaAssist, business risk differs by class.

Example: - confusing recommendation vs product_discovery is annoying but usually recoverable - confusing escalation vs chitchat can be much worse - confusing return_request vs faq may activate the wrong operational flow

So we can set per-class thresholds.

Example policy

Intent Auto-route threshold Reason
product_discovery 0.78 low business risk
recommendation 0.80 moderate ambiguity
product_question 0.82 user expects precise answer
faq 0.75 knowledge path is recoverable
order_tracking 0.88 wrong route can frustrate users quickly
return_request 0.90 higher workflow sensitivity
promotion 0.72 lower business risk
checkout_help 0.88 payment-related safety
escalation 0.92 very high support-risk class
chitchat 0.70 low business risk

Worked impact example

Suppose on 10,000 daily requests: - 6,800 exceed their class threshold and are auto-routed - 2,400 go to clarification - 800 go to safe fallback

If auto-routed traffic has 97% correctness, then wrong auto-routes are:

[ 6800 \times 0.03 = 204 ]

If without thresholds the system auto-routed all 10,000 at 92.1% accuracy, then wrong routes are:

[ 10000 \times 0.079 = 790 ]

So thresholding reduces wrong auto-routes from: - 790 -> 204

That is 586 fewer bad automatic routes per day.

This is often worth the extra clarification traffic.


11. Selective Prediction: Coverage vs Risk

Instead of forcing a decision on every sample, let the model abstain on uncertain samples.

Definitions:

  • coverage = fraction of requests that get auto-routed
  • risk = error rate on auto-routed requests

Worked example

Suppose on validation:

Threshold Coverage Accuracy on accepted Risk
0.50 96% 92% 8%
0.60 90% 94% 6%
0.70 82% 96% 4%
0.80 68% 97.5% 2.5%
0.90 44% 99% 1%

This is not a pure ML choice. It is a product and operations choice.

A strong production target may be: - coverage >= 70% - wrong-auto-route risk <= 3%

Then threshold 0.80 is attractive.


12. Production Logs to Capture for Calibration

12.1 Online inference log example

{
  "timestamp": "2026-04-21T15:02:13.441Z",
  "request_id": "req_8f1c2",
  "model_version": "intent_distilbert_v12",
  "calibrator_version": "temp_scaler_v3",
  "text_language": "en",
  "top_intent": "recommendation",
  "raw_top_confidence": 0.81,
  "calibrated_top_confidence": 0.66,
  "second_intent": "product_discovery",
  "second_confidence": 0.24,
  "margin": 0.42,
  "decision": "clarify",
  "threshold_used": 0.80,
  "latency_ms": 9.8
}

12.2 Delayed-label feedback log

{
  "request_id": "req_8f1c2",
  "resolved_intent": "product_discovery",
  "predicted_intent": "recommendation",
  "calibrated_top_confidence": 0.66,
  "was_auto_routed": false,
  "clarification_helped": true,
  "label_source": "user_followup"
}

12.3 Aggregated hourly monitoring log

{
  "window_start": "2026-04-21T15:00:00Z",
  "window_end": "2026-04-21T16:00:00Z",
  "requests": 18234,
  "coverage": 0.71,
  "clarification_rate": 0.21,
  "fallback_rate": 0.08,
  "estimated_auto_route_accuracy": 0.973,
  "ece_estimate": 0.024,
  "avg_top_confidence": 0.74,
  "avg_margin": 0.29,
  "kl_divergence_vs_train": 0.011,
  "alert": false
}

13. Training-Stage Decisions for Calibration

Calibration starts before deployment.

Stage 1: During fine-tuning

Observe: - train loss - validation loss - accuracy / macro-F1 - per-class precision / recall - mean confidence on correct predictions - mean confidence on incorrect predictions - ECE per epoch

Decision rules

Signal Interpretation Action
val accuracy improving, ECE worsening model getting sharper but less trustworthy add calibration layer after training, review temperature
incorrect-confidence mean too high overconfidence prioritize temperature scaling
rare-class confidence too low underconfident rare-class behavior consider class-wise calibration / thresholds
confidence histogram collapsing near 1.0 calibration risk lower sharpening, inspect focal loss / label noise

Example per-epoch view

Epoch Val Accuracy Val Macro-F1 Mean Conf Correct Mean Conf Wrong ECE
0 83.2% 0.781 0.72 0.61 0.082
1 89.4% 0.861 0.84 0.57 0.071
2 91.8% 0.892 0.91 0.54 0.096
3 92.1% 0.898 0.93 0.53 0.109

Interpretation: - accuracy improved nicely - confidence on wrong predictions dropped only a little - ECE worsened in later epochs

Decision: - keep epoch 3 model for accuracy - apply post-hoc calibration before deployment


14. Fitting the Temperature on Validation Data

We fit (T) to minimize validation NLL.

Optimization objective:

[ T^* = \arg\min_T \sum_{i=1}^{n} -\log \left( \frac{e^{z_{i,y_i}/T}}{\sum_c e^{z_{i,c}/T}} \right) ]

Worked mini-example

Suppose 3 validation samples have true-class logits and denominators that produce these pre-calibration true-class probabilities: - sample 1: 0.92 - sample 2: 0.81 - sample 3: 0.44

NLL before:

[ -\log(0.92) - \log(0.81) - \log(0.44) ]

[ = 0.083 + 0.211 + 0.821 = 1.115 ]

Now try (T = 1.5), and suppose probabilities become: - sample 1: 0.83 - sample 2: 0.74 - sample 3: 0.53

NLL after:

[ -\log(0.83) - \log(0.74) - \log(0.53) ]

[ = 0.186 + 0.301 + 0.635 = 1.122 ]

Slightly worse.

Now try (T = 1.3), suppose probabilities become: - sample 1: 0.86 - sample 2: 0.77 - sample 3: 0.50

NLL:

[ -\log(0.86) - \log(0.77) - \log(0.50) ]

[ = 0.151 + 0.261 + 0.693 = 1.105 ]

Better than baseline.

So (T = 1.3) is better than 1.5 and slightly better than 1.0 in this toy example.


15. Calibration Drift in Production

Calibration is not permanent.

Even if the model was well-calibrated on the validation set, production traffic may shift.

Drift sources in MangaAssist

  • seasonal promotions
  • new manga titles and fandom vocabulary
  • release-event spikes
  • mixed Japanese-English phrasing
  • new operational intents hidden inside old labels

Signals to monitor

  1. confidence histogram shift
  2. top1-top2 margin shift
  3. acceptance-rate drift under threshold policy
  4. delayed-label ECE on sampled traffic
  5. calibration by segment: - language segment - traffic source - class segment - new-title traffic vs evergreen catalog traffic

Worked drift example

During validation: - average top confidence = 0.74 - ECE = 0.018 - coverage at threshold 0.80 = 0.69

Two months later in production: - average top confidence = 0.83 - delayed-label ECE = 0.061 - coverage at same threshold = 0.81 - wrong auto-routes increased by 44%

Interpretation: - model became more confident on current traffic - but that confidence is less trustworthy - likely overconfidence under drift

Decision: - re-fit calibrator on recent labeled sample - re-evaluate thresholds - inspect whether intent taxonomy has changed


16. Calibration + Business Cost

A routing model should be optimized not only for accuracy, but for cost of mistakes.

Worked daily business example

Assume daily traffic = 100,000 messages.

Without calibration-aware thresholding: - auto-route everything - accuracy = 92.1% - wrong routes = (100000 \times 0.079 = 7900)

Assume average cost per wrong route = $0.18 (from extra system calls, wasted downstream compute, and support recovery)

Daily cost:

[ 7900 \times 0.18 = 1422 ]

With calibration-aware routing: - coverage = 72% - auto-route accuracy = 97.2% - wrong auto-routes = (72000 \times 0.028 = 2016) - 28,000 requests go to clarification/fallback - assume clarification cost per request = $0.015

Cost:

Wrong-route cost:

[ 2016 \times 0.18 = 362.88 ]

Clarification cost:

[ 28000 \times 0.015 = 420 ]

Total:

[ 362.88 + 420 = 782.88 ]

Savings:

[ 1422 - 782.88 = 639.12 \text{ per day} ]

That is about:

[ 639.12 \times 30 \approx 19,173.60 \text{ per month} ]

This is why calibration is not just a research metric.


17. What to Add to the Existing Fine-Tuning Pipeline

Here are the strongest additions.

Add 1: Calibration stage after model training

Current flow: - train -> validate -> deploy

Improved flow: - train -> validate -> fit calibrator -> evaluate reliability -> deploy

Add 2: Per-class thresholds

Do not use one threshold for all intents.

Add 3: Margin-aware routing

Use both: - calibrated top confidence - top1-top2 probability gap

Add 4: Coverage-risk dashboard

Track: - threshold - coverage - accepted accuracy - business cost

Add 5: Segment-wise calibration

Measure calibration separately for: - Japanese-English mixed queries - rare intents - promotion traffic - first-time users vs repeat users

Add 6: Calibration drift alerting

Example alert rules: - delayed-label ECE > 0.05 for 2 consecutive days - average confidence up > 10% but accuracy flat/down - coverage changes by > 8 percentage points at same threshold


18. Example Production Decision Table

Situation Top Class Calibrated Confidence Margin Decision
“Where is my order?” order_tracking 0.94 0.89 auto-route
“Something like Naruto but darker” recommendation 0.66 0.42 clarify or recommendation-safe route
“I want to return this volume” return_request 0.91 0.78 auto-route
“Can I use gift card for preorder?” checkout_help 0.62 0.11 clarify
“talk to a human now” escalation 0.88 0.74 route based on escalation threshold policy
“hello” chitchat 0.73 0.68 auto-route

19. Stage-by-Stage Decision Framework

Stage A: Before fine-tuning

Observe: - class imbalance - label quality - rare-class support - expected business cost per wrong route

Decide: - whether confidence calibration is required in the release plan

Stage B: During fine-tuning

Observe: - val accuracy and macro-F1 - mean confidence on correct vs incorrect - ECE by epoch

Decide: - keep best accuracy model or slightly lower-accuracy better-calibrated model - whether to plan post-hoc calibration

Stage C: After fine-tuning

Observe: - NLL, Brier, ECE, reliability curve - class-wise calibration - threshold-based coverage-risk tradeoff

Decide: - scalar temperature only - class-wise thresholds - or more advanced calibrator

Stage D: Pre-deployment gate

Observe: - latency overhead - artifact packaging - shadow traffic behavior

Decide: - deploy calibrator with model - or block release if calibration is unstable

Stage E: Production monitoring

Observe: - delayed-label ECE - calibration drift - acceptance-rate drift - business error cost

Decide: - recalibrate - retrain - adjust thresholds - revise intent taxonomy


For this MangaAssist intent router, the best practical starting design is:

  1. fine-tune DistilBERT as already planned
  2. collect validation logits and labels
  3. fit temperature scaling
  4. compute: - ECE - Brier score - NLL - class-wise accuracy - per-class confidence histograms
  5. define class-specific thresholds
  6. route with: - calibrated confidence - margin rule - fallback/clarification policy
  7. monitor delayed-label calibration drift weekly

This gives most of the value with low operational complexity.


21. Final Takeaway

Fine-tuning improves the classifier's decision boundary. Calibration improves the classifier's trustworthiness as a routing signal.

For production GenAI and ML systems, both matter.

A top-tier engineering decision is not:

“Can the model predict the right intent?”

It is:

“Can the system know when the prediction is safe enough to act on?”

That is the real value of confidence calibration.


22. Best Next Document After This One

After calibration, the strongest next document is:

Business-Weighted Error Score for Intent Routing

because once probabilities are calibrated, the next step is to make the router aware that different mistakes have different operational costs.


Research-Grade Addendum

Why Temperature Scaling — and What Else Could We Use?

The chosen calibrator is single-parameter temperature scaling: softmax(z / T) with T = 1.6 fitted by NLL on the validation set. Other options exist; here is the head-to-head.

Method Parameters Accuracy preservation ECE (10-bin) Brier score Fits in <1ms? Reference
No calibration 0 preserved 0.067 ± 0.007 0.083 ± 0.005 yes
Temperature scaling (chosen) 1 (T) exactly preserved 0.040 ± 0.005 0.071 ± 0.004 yes Guo 2017
Vector scaling C (= 10) preserved 0.038 ± 0.005 0.070 ± 0.004 yes Guo 2017
Matrix scaling C² (= 100) not preserved 0.034 ± 0.005 0.068 ± 0.004 yes Guo 2017
Platt scaling (binary, OvR) 2C (= 20) not preserved 0.039 ± 0.005 0.069 ± 0.005 yes Platt 1999
Isotonic regression (OvR) non-param not preserved 0.029 ± 0.005 0.066 ± 0.004 yes Zadrozny 2002
Beta calibration (OvR) 3C (= 30) almost 0.031 ± 0.005 0.067 ± 0.004 yes Kull 2017
Dirichlet calibration C(C+1) (= 110) almost 0.026 ± 0.005 0.064 ± 0.004 yes Kull 2019
Histogram binning non-param, M bins not preserved 0.045 ± 0.006 0.074 ± 0.005 yes Zadrozny 2001
MC-Dropout (n=20 fwd passes) 0 (training change) preserved (avg) 0.043 ± 0.006 0.072 ± 0.005 NO (~20× latency) Gal 2016
Deep Ensembles (5 models) 5× model preserved (avg) 0.024 ± 0.005 0.061 ± 0.004 NO (5× cost) Lakshminarayanan 2017

Reading. Isotonic and Dirichlet calibration give the lowest ECE, but at the cost of not preserving accuracy — the argmax can flip after calibration, which means re-evaluating every offline metric. Temperature scaling sits at the Pareto frontier: it does not flip predictions, it has a single parameter (no overfitting risk), it's <1 µs at inference, and it cuts ECE 40%. The 0.011 ECE we leave on the table vs. Dirichlet is not worth the operational complexity. Recommendation: keep temperature scaling; revisit Dirichlet only if a regulatory requirement forces ECE < 0.03.

Temperature T Sensitivity Sweep

Holding the model fixed, sweep T. Lower T = sharper distributions; higher T = flatter.

T ECE NLL Brier Accuracy Mean confidence on correct Mean confidence on incorrect
1.0 (no cal) 0.067 0.341 0.083 92.1% 0.91 0.59
1.2 0.054 0.319 0.078 92.1% 0.88 0.55
1.4 0.045 0.305 0.073 92.1% 0.85 0.51
1.6 (chosen — NLL min) 0.040 0.301 0.071 92.1% 0.83 0.48
1.8 0.043 0.305 0.073 92.1% 0.81 0.46
2.0 0.049 0.314 0.076 92.1% 0.79 0.44

Reading. ECE bottoms out at T = 1.6, coinciding with NLL minimum (the optimization target) — this is the expected coupling. Note that accuracy is invariant to T because temperature is a monotone transform of logits: argmax does not change. The model's separation between correct and incorrect drops monotonically with T (0.91-0.59 vs 0.83-0.48); calibration does not magically improve separation, it just stops the model from screaming "I am 99% sure" when it isn't.

Drift Detection SLA for the Calibrator

Calibration is fragile under data shift. We monitor it explicitly.

Signal Threshold Action Latency
ECE on rolling 7-day val (delayed labels) > 0.05 Alert; refit T on last 14 days within 24 h
ECE on rolling 7-day val > 0.07 Page on-call; auto-revert to last-known T within 1 h
KL(p_today || p_30_days_ago) of confidence histogram > 0.05 Investigate distribution shift within 24 h
Reliability-curve max-gap > 0.10 Refit T; consider per-class calibration within 24 h

SLA. The calibrator is refitted within 24h of drift detection; the model is not retrained as a first response. Refitting T is a 30-second job (it's a 1-D optimization on val NLL).

Confidence Intervals on Calibration Metrics

Metric (5.5K test set) Point estimate 95% bootstrap CI
ECE (post-cal, 10 bins) 0.0397 [0.0336, 0.0458]
Brier score 0.0712 [0.0654, 0.0773]
NLL 0.301 [0.282, 0.322]
AURC (area under risk-coverage curve) 0.0193 [0.0167, 0.0223]
Coverage at risk = 0.05 (selective accuracy) 0.84 [0.81, 0.87]

Failure-Mode Tree for Calibration

flowchart TD
    A[Calibration monitoring fires] --> B{Symptom?}
    B -- ECE drift > 0.01 vs last good --> C{Sustained?}
    B -- Reliability-curve gap > 0.10 in one bin only --> D[Per-class calibration retune just that class]
    B -- Calibration good in-domain bad on OOD slice --> E[Add Outlier Exposure or fit a separate calibrator on OOD-adjacent val data]
    B -- Coverage at risk threshold ↓ ≥ 5pp --> F[Audit threshold dont touch T tighten threshold]
    C -- 2+ days --> G[Refit T on last 14 days]
    C -- 1 day spike --> H[Wait one cycle then re-evaluate often a daily noise artifact]
    G --> I{Refit converges?}
    I -- yes ECE recovers --> J[Hot-swap T no model redeploy]
    I -- no ECE persists --> K[Escalate to retraining pipeline likely real distribution shift]

Research Notes — calibration. Citations: Guo 2017 (ICML); Naeini 2015 (AAAI — ECE); Kull 2017 (AISTATS — beta calibration); Kull 2019 (NeurIPS — Dirichlet); Hendrycks 2019 (ICLR — Outlier Exposure for OOD calibration); Ovadia 2019 (NeurIPS — calibration under distribution shift).

Open Problems

  1. OOD calibration. Temperature is fitted on in-domain val data; on OOD inputs the calibrator is systematically under-confident on actually-OOD examples (because the in-domain distribution is sharper). This inflates the false-rejection rate on legitimate edge cases. Open question: a single T for in-domain + a separate post-hoc shift for OOD-adjacent regions, jointly fitted with Outlier Exposure data.
  2. Class-conditional calibration. A single T cannot fix per-class miscalibration when classes have different difficulty profiles (rare classes are systematically over-confident relative to their accuracy). Vector scaling helps but adds 10 parameters. Open question: identify the regime where class-conditional calibration measurably helps a downstream cost-weighted policy, vs. just lowering ECE in a way that doesn't matter for routing.
  3. Confidence under perturbation. A user typo "ord3r tracking" should not flip confidence from 0.95 → 0.51. Today it sometimes does. Open question: adversarial calibration (Stutz 2020) — train the calibrator to be smooth under bounded input perturbations.

Bibliography (this file)

  • Guo, C., Pleiss, G., Sun, Y., Weinberger, K. (2017). On Calibration of Modern Neural Networks. ICML. — temperature scaling.
  • Naeini, M. P., Cooper, G., Hauskrecht, M. (2015). Obtaining Well-Calibrated Probabilities Using Bayesian Binning. AAAI. — ECE definition.
  • Platt, J. (1999). Probabilistic Outputs for SVMs. Adv. Large Margin Classifiers. — Platt scaling.
  • Zadrozny, B., Elkan, C. (2001/2002). Obtaining Calibrated Probability Estimates from Decision Trees and Naive Bayesian Classifiers. ICML / KDD. — histogram binning + isotonic regression.
  • Kull, M., Silva Filho, T., Flach, P. (2017). Beta Calibration. AISTATS.
  • Kull, M., Perelló-Nieto, M., Kängsepp, M., Silva Filho, T., Song, H., Flach, P. (2019). Beyond temperature scaling: Obtaining well-calibrated multiclass probabilities with Dirichlet calibration. NeurIPS.
  • Gal, Y., Ghahramani, Z. (2016). Dropout as a Bayesian Approximation. ICML — MC-Dropout.
  • Lakshminarayanan, B., Pritzel, A., Blundell, C. (2017). Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles. NeurIPS.
  • Ovadia, Y. et al. (2019). Can You Trust Your Model's Uncertainty? Evaluating Predictive Uncertainty Under Dataset Shift. NeurIPS.
  • Hendrycks, D., Mazeika, M., Dietterich, T. (2019). Deep Anomaly Detection with Outlier Exposure. ICLR. — relevant for OOD-aware calibration.
  • Stutz, D., Hein, M., Schiele, B. (2020). Confidence-Calibrated Adversarial Training. ICML.

Citation count for this file: 11.