Confidence Calibration for Fine-Tuned Intent Routing — MangaAssist
This document is the first deep-dive extension to the MangaAssist fine-tuning scenario. It focuses on confidence calibration for a fine-tuned DistilBERT intent classifier so that the router can make safer and more business-aware decisions after training.
It uses the same setting as the original scenario: - 10 intent classes - DistilBERT fine-tuned for manga retail intent routing - under 15 ms P95 latency target - accuracy improving from about 83.2% to about 92.1% after fine-tuning - production concerns around misroutes, rare intents, drift, and weekly/monthly retraining
1. Why Calibration Matters
A classifier can be accurate but badly calibrated.
That means: - when it says 0.95 confidence, it may only be right 80% of the time - when it says 0.55 confidence, it may actually be right 70% of the time
For routing systems, that is dangerous.
In MangaAssist, the model prediction is not the final user-facing answer. It is a control signal that decides which downstream service to call: - recommendation engine - order lookup - return flow - FAQ pipeline - human escalation
So the router should not ask only:
“Which class has the highest score?”
It should also ask:
“How trustworthy is that score?”
Business reason
A wrong route can cause: - bad customer experience - unnecessary LLM/tool calls - wrong workflow activation - customer support escalation - hidden cost increase
A calibrated classifier lets us do smarter things such as: - auto-route only when confidence is reliable - send uncertain predictions to fallback logic - request clarification for ambiguous cases - prioritize human review on low-confidence samples - trigger active learning on the most informative traffic
2. Accuracy vs Confidence vs Calibration
These are different concepts.
Accuracy
Accuracy checks whether the top prediction is correct.
[ \text{Accuracy} = \frac{\text{# correct predictions}}{\text{# total predictions}} ]
Confidence
Confidence is the model's predicted probability for the top class.
If logits after softmax are:
- recommendation = 0.78
- product_discovery = 0.14
- product_question = 0.05
- others = 0.03 total
then the confidence is 0.78.
Calibration
Calibration checks whether probabilities match reality.
If the model produces 100 predictions with confidence around 0.80, and only 60 are correct, then the model is overconfident.
If it produces 100 predictions with confidence around 0.55, and 75 are correct, then it is underconfident.
A well-calibrated model satisfies:
[ P(\hat{y}=y \mid \hat{p}=0.8) \approx 0.8 ]
That means: among predictions made at 80% confidence, about 80% should be correct.
3. Where Calibration Fits in the Full Lifecycle
flowchart TD
A[Training Data] --> B[Fine-Tune DistilBERT]
B --> C[Validation Predictions]
C --> D[Fit Calibration Layer]
D --> E[Evaluate ECE Brier NLL Reliability]
E --> F{Pass Calibration Gate?}
F -->|Yes| G[Package Model + Calibrator]
F -->|No| H[Revisit Loss / Data / Temperature]
G --> I[Deploy to Production]
I --> J[Online Routing Decisions]
J --> K[Confidence Thresholding]
K --> L[Fallback / Clarification / Human Review]
I --> M[Production Monitoring]
M --> N[Drift + Confidence Shift + Calibration Drift]
N --> A
4. Baseline Numerical Example: Why Softmax Alone Is Not Enough
Suppose for the query:
“I want something like Naruto but darker”
The fine-tuned model produces logits:
- recommendation = 3.2
- product_discovery = 2.4
- product_question = 0.4
- faq = -0.7
- chitchat = -1.0
4.1 Convert logits to probabilities
Softmax:
[ \hat{p}_i = \frac{e^{z_i}}{\sum_j e^{z_j}} ]
Exponentials:
- (e^{3.2} \approx 24.533)
- (e^{2.4} \approx 11.023)
- (e^{0.4} \approx 1.492)
- (e^{-0.7} \approx 0.497)
- (e^{-1.0} \approx 0.368)
Sum:
[ 24.533 + 11.023 + 1.492 + 0.497 + 0.368 = 37.913 ]
Probabilities:
- recommendation = (24.533 / 37.913 \approx 0.647)
- product_discovery = (11.023 / 37.913 \approx 0.291)
- product_question = (1.492 / 37.913 \approx 0.039)
- faq = (0.497 / 37.913 \approx 0.013)
- chitchat = (0.368 / 37.913 \approx 0.010)
Top class is recommendation with confidence 0.647.
That looks decent.
But if historical validation shows that predictions in the 0.60–0.70 bin are only correct 51% of the time, then the router should not treat 0.647 as “safe enough” for irreversible routing.
That is the whole point of calibration.
5. Main Calibration Methods
For this system, the most practical options are:
- Temperature scaling
- Vector scaling
- Platt scaling / logistic calibration
- Isotonic regression
- Class-wise threshold calibration
For MangaAssist, the best first choice is usually:
Temperature scaling + class-wise routing thresholds
because it is simple, fast, and low-risk.
5.1 Temperature scaling
We learn a scalar temperature (T > 0) on the validation set.
New probabilities:
[ \hat{p}_i^{(T)} = \frac{e^{z_i / T}}{\sum_j e^{z_j / T}} ]
Interpretation: - T = 1: no change - T > 1: soften probabilities, reduce overconfidence - T < 1: sharpen probabilities, increase confidence
This does not change class ranking in most cases. It mainly changes the confidence values.
5.2 Why temperature scaling is good here
- tiny extra compute
- easy to version and deploy
- does not retrain encoder
- works well when model is mainly overconfident
- easy to update during post-training validation
5.3 When temperature scaling is not enough
It may fail if: - each class has a very different confidence distortion - rare intents are miscalibrated differently from common intents - calibration varies heavily by traffic segment - drift changes confidence distribution after deployment
Then we may add: - class-conditional thresholds - vector scaling - segment-based calibrators
6. Worked Temperature-Scaling Example
Take the same logits:
- recommendation = 3.2
- product_discovery = 2.4
- product_question = 0.4
- faq = -0.7
- chitchat = -1.0
Assume validation optimization finds:
[ T = 1.6 ]
6.1 Scaled logits
[ z_i' = z_i / 1.6 ]
So: - recommendation = 2.000 - product_discovery = 1.500 - product_question = 0.250 - faq = -0.438 - chitchat = -0.625
Exponentials: - (e^{2.0} \approx 7.389) - (e^{1.5} \approx 4.482) - (e^{0.25} \approx 1.284) - (e^{-0.438} \approx 0.645) - (e^{-0.625} \approx 0.535)
Sum:
[ 7.389 + 4.482 + 1.284 + 0.645 + 0.535 = 14.335 ]
New probabilities: - recommendation = (7.389 / 14.335 \approx 0.515) - product_discovery = (4.482 / 14.335 \approx 0.313) - product_question = (1.284 / 14.335 \approx 0.090) - faq = (0.645 / 14.335 \approx 0.045) - chitchat = (0.535 / 14.335 \approx 0.037)
What changed?
Before calibration: - top class = recommendation - confidence = 0.647
After calibration: - top class = recommendation - confidence = 0.515
The class did not change. The trust level did.
That is exactly what we want.
7. Metrics to Evaluate Calibration
The most important metrics are: - ECE: Expected Calibration Error - MCE: Maximum Calibration Error - NLL: Negative Log Likelihood - Brier Score - Reliability curves - Coverage vs risk under routing thresholds
7.1 Expected Calibration Error (ECE)
We split predictions into confidence bins.
For bin (B_m): - (\text{acc}(B_m)) = empirical accuracy in bin - (\text{conf}(B_m)) = average predicted confidence in bin
ECE:
[ \text{ECE} = \sum_{m=1}^{M} \frac{|B_m|}{n} \left| \text{acc}(B_m) - \text{conf}(B_m) \right| ]
7.2 Worked ECE example
Suppose we evaluate 1,000 validation examples and use 5 bins.
| Bin | Confidence Range | Count | Avg Confidence | Accuracy | Gap |
|---|---|---|---|---|---|
| B1 | 0.0–0.2 | 40 | 0.16 | 0.20 | 0.04 |
| B2 | 0.2–0.4 | 110 | 0.33 | 0.38 | 0.05 |
| B3 | 0.4–0.6 | 210 | 0.52 | 0.49 | 0.03 |
| B4 | 0.6–0.8 | 330 | 0.71 | 0.61 | 0.10 |
| B5 | 0.8–1.0 | 310 | 0.89 | 0.75 | 0.14 |
Compute weighted sum:
[ \text{ECE} = \frac{40}{1000}(0.04) + \frac{110}{1000}(0.05) + \frac{210}{1000}(0.03) + \frac{330}{1000}(0.10) + \frac{310}{1000}(0.14) ]
[ = 0.0016 + 0.0055 + 0.0063 + 0.033 + 0.0434 = 0.0898 ]
So:
[ \text{ECE} \approx 0.090 ]
That is fairly bad for a routing model.
Now suppose after temperature scaling:
| Bin | Count | Avg Confidence | Accuracy | Gap |
|---|---|---|---|---|
| B1 | 55 | 0.17 | 0.18 | 0.01 |
| B2 | 135 | 0.32 | 0.35 | 0.03 |
| B3 | 255 | 0.51 | 0.52 | 0.01 |
| B4 | 315 | 0.69 | 0.68 | 0.01 |
| B5 | 240 | 0.82 | 0.80 | 0.02 |
Then:
[ \text{ECE} = \frac{55}{1000}(0.01) + \frac{135}{1000}(0.03) + \frac{255}{1000}(0.01) + \frac{315}{1000}(0.01) + \frac{240}{1000}(0.02) ]
[ = 0.00055 + 0.00405 + 0.00255 + 0.00315 + 0.0048 = 0.0151 ]
So ECE improved from: - 0.090 -> 0.015
That is a major win.
7.3 Brier Score
For multiclass, one common form is:
[ \text{Brier} = \frac{1}{n} \sum_{i=1}^{n} \sum_{c=1}^{C} (p_{ic} - y_{ic})^2 ]
Lower is better.
Worked single-example case:
Predicted: - recommendation = 0.70 - product_discovery = 0.20 - product_question = 0.10
True class = recommendation, so target is: - recommendation = 1 - product_discovery = 0 - product_question = 0
Brier contribution:
[ (0.70 - 1)^2 + (0.20 - 0)^2 + (0.10 - 0)^2 = 0.09 + 0.04 + 0.01 = 0.14 ]
If instead prediction was: - recommendation = 0.40 - product_discovery = 0.35 - product_question = 0.25
then Brier:
[ (0.40 - 1)^2 + 0.35^2 + 0.25^2 = 0.36 + 0.1225 + 0.0625 = 0.545 ]
Much worse.
7.4 Negative Log Likelihood (NLL)
For true class probability (p_t):
[ \text{NLL} = -\log(p_t) ]
If correct-class probability improves from 0.42 to 0.76 after calibration-aware selection and thresholding, NLL goes from:
[ -\log(0.42) \approx 0.867 ]
to:
[ -\log(0.76) \approx 0.275 ]
Lower is better.
8. Reliability Diagram Interpretation
xychart-beta
title "Reliability: Before vs After Calibration"
x-axis [0.1,0.3,0.5,0.7,0.9]
y-axis "Empirical Accuracy" 0 --> 1
line "Ideal" [0.1,0.3,0.5,0.7,0.9]
line "Before" [0.2,0.38,0.49,0.61,0.75]
line "After" [0.18,0.35,0.52,0.68,0.80]
How to read it: - the closer the curve is to the diagonal, the better calibrated the model - if the curve is below the diagonal, the model is overconfident - if the curve is above the diagonal, the model is underconfident
In this scenario, the pre-calibration model is mostly overconfident in higher-confidence bins.
9. Routing Decisions After Calibration
Calibration matters because it changes operational logic.
9.1 Simple threshold policy
Let: - top probability = (p_1) - second probability = (p_2) - margin = (p_1 - p_2)
Use these rules:
-
Auto-route if: - calibrated confidence (p_1 \ge 0.80) - margin (\ge 0.20)
-
Clarify with user if: - (0.55 \le p_1 < 0.80) - or margin < 0.20
-
Fallback / human-safe path if: - (p_1 < 0.55) - or intent belongs to a high-risk class with low confidence
9.2 Worked examples
Case A: Safe auto-route
Calibrated probabilities: - order_tracking = 0.91 - faq = 0.05 - return_request = 0.02 - others = 0.02
Margin:
[ 0.91 - 0.05 = 0.86 ]
Decision: - route directly to order lookup
Case B: Clarification needed
Calibrated probabilities: - recommendation = 0.58 - product_discovery = 0.33 - product_question = 0.05
Margin:
[ 0.58 - 0.33 = 0.25 ]
Confidence is moderate, so ask:
“Are you looking for recommendations similar to Naruto, or do you want to browse dark manga titles?”
Case C: Low-confidence safe fallback
Calibrated probabilities: - return_request = 0.39 - faq = 0.28 - escalation = 0.18 - others = 0.15
Decision: - do not auto-route to a return flow - ask a clarifying question or hand off to safe general support
10. Class-Specific Thresholds
A global threshold is often not enough.
For MangaAssist, business risk differs by class.
Example: - confusing recommendation vs product_discovery is annoying but usually recoverable - confusing escalation vs chitchat can be much worse - confusing return_request vs faq may activate the wrong operational flow
So we can set per-class thresholds.
Example policy
| Intent | Auto-route threshold | Reason |
|---|---|---|
| product_discovery | 0.78 | low business risk |
| recommendation | 0.80 | moderate ambiguity |
| product_question | 0.82 | user expects precise answer |
| faq | 0.75 | knowledge path is recoverable |
| order_tracking | 0.88 | wrong route can frustrate users quickly |
| return_request | 0.90 | higher workflow sensitivity |
| promotion | 0.72 | lower business risk |
| checkout_help | 0.88 | payment-related safety |
| escalation | 0.92 | very high support-risk class |
| chitchat | 0.70 | low business risk |
Worked impact example
Suppose on 10,000 daily requests: - 6,800 exceed their class threshold and are auto-routed - 2,400 go to clarification - 800 go to safe fallback
If auto-routed traffic has 97% correctness, then wrong auto-routes are:
[ 6800 \times 0.03 = 204 ]
If without thresholds the system auto-routed all 10,000 at 92.1% accuracy, then wrong routes are:
[ 10000 \times 0.079 = 790 ]
So thresholding reduces wrong auto-routes from: - 790 -> 204
That is 586 fewer bad automatic routes per day.
This is often worth the extra clarification traffic.
11. Selective Prediction: Coverage vs Risk
Instead of forcing a decision on every sample, let the model abstain on uncertain samples.
Definitions:
- coverage = fraction of requests that get auto-routed
- risk = error rate on auto-routed requests
Worked example
Suppose on validation:
| Threshold | Coverage | Accuracy on accepted | Risk |
|---|---|---|---|
| 0.50 | 96% | 92% | 8% |
| 0.60 | 90% | 94% | 6% |
| 0.70 | 82% | 96% | 4% |
| 0.80 | 68% | 97.5% | 2.5% |
| 0.90 | 44% | 99% | 1% |
This is not a pure ML choice. It is a product and operations choice.
A strong production target may be: - coverage >= 70% - wrong-auto-route risk <= 3%
Then threshold 0.80 is attractive.
12. Production Logs to Capture for Calibration
12.1 Online inference log example
{
"timestamp": "2026-04-21T15:02:13.441Z",
"request_id": "req_8f1c2",
"model_version": "intent_distilbert_v12",
"calibrator_version": "temp_scaler_v3",
"text_language": "en",
"top_intent": "recommendation",
"raw_top_confidence": 0.81,
"calibrated_top_confidence": 0.66,
"second_intent": "product_discovery",
"second_confidence": 0.24,
"margin": 0.42,
"decision": "clarify",
"threshold_used": 0.80,
"latency_ms": 9.8
}
12.2 Delayed-label feedback log
{
"request_id": "req_8f1c2",
"resolved_intent": "product_discovery",
"predicted_intent": "recommendation",
"calibrated_top_confidence": 0.66,
"was_auto_routed": false,
"clarification_helped": true,
"label_source": "user_followup"
}
12.3 Aggregated hourly monitoring log
{
"window_start": "2026-04-21T15:00:00Z",
"window_end": "2026-04-21T16:00:00Z",
"requests": 18234,
"coverage": 0.71,
"clarification_rate": 0.21,
"fallback_rate": 0.08,
"estimated_auto_route_accuracy": 0.973,
"ece_estimate": 0.024,
"avg_top_confidence": 0.74,
"avg_margin": 0.29,
"kl_divergence_vs_train": 0.011,
"alert": false
}
13. Training-Stage Decisions for Calibration
Calibration starts before deployment.
Stage 1: During fine-tuning
Observe: - train loss - validation loss - accuracy / macro-F1 - per-class precision / recall - mean confidence on correct predictions - mean confidence on incorrect predictions - ECE per epoch
Decision rules
| Signal | Interpretation | Action |
|---|---|---|
| val accuracy improving, ECE worsening | model getting sharper but less trustworthy | add calibration layer after training, review temperature |
| incorrect-confidence mean too high | overconfidence | prioritize temperature scaling |
| rare-class confidence too low | underconfident rare-class behavior | consider class-wise calibration / thresholds |
| confidence histogram collapsing near 1.0 | calibration risk | lower sharpening, inspect focal loss / label noise |
Example per-epoch view
| Epoch | Val Accuracy | Val Macro-F1 | Mean Conf Correct | Mean Conf Wrong | ECE |
|---|---|---|---|---|---|
| 0 | 83.2% | 0.781 | 0.72 | 0.61 | 0.082 |
| 1 | 89.4% | 0.861 | 0.84 | 0.57 | 0.071 |
| 2 | 91.8% | 0.892 | 0.91 | 0.54 | 0.096 |
| 3 | 92.1% | 0.898 | 0.93 | 0.53 | 0.109 |
Interpretation: - accuracy improved nicely - confidence on wrong predictions dropped only a little - ECE worsened in later epochs
Decision: - keep epoch 3 model for accuracy - apply post-hoc calibration before deployment
14. Fitting the Temperature on Validation Data
We fit (T) to minimize validation NLL.
Optimization objective:
[ T^* = \arg\min_T \sum_{i=1}^{n} -\log \left( \frac{e^{z_{i,y_i}/T}}{\sum_c e^{z_{i,c}/T}} \right) ]
Worked mini-example
Suppose 3 validation samples have true-class logits and denominators that produce these pre-calibration true-class probabilities: - sample 1: 0.92 - sample 2: 0.81 - sample 3: 0.44
NLL before:
[ -\log(0.92) - \log(0.81) - \log(0.44) ]
[ = 0.083 + 0.211 + 0.821 = 1.115 ]
Now try (T = 1.5), and suppose probabilities become: - sample 1: 0.83 - sample 2: 0.74 - sample 3: 0.53
NLL after:
[ -\log(0.83) - \log(0.74) - \log(0.53) ]
[ = 0.186 + 0.301 + 0.635 = 1.122 ]
Slightly worse.
Now try (T = 1.3), suppose probabilities become: - sample 1: 0.86 - sample 2: 0.77 - sample 3: 0.50
NLL:
[ -\log(0.86) - \log(0.77) - \log(0.50) ]
[ = 0.151 + 0.261 + 0.693 = 1.105 ]
Better than baseline.
So (T = 1.3) is better than 1.5 and slightly better than 1.0 in this toy example.
15. Calibration Drift in Production
Calibration is not permanent.
Even if the model was well-calibrated on the validation set, production traffic may shift.
Drift sources in MangaAssist
- seasonal promotions
- new manga titles and fandom vocabulary
- release-event spikes
- mixed Japanese-English phrasing
- new operational intents hidden inside old labels
Signals to monitor
- confidence histogram shift
- top1-top2 margin shift
- acceptance-rate drift under threshold policy
- delayed-label ECE on sampled traffic
- calibration by segment: - language segment - traffic source - class segment - new-title traffic vs evergreen catalog traffic
Worked drift example
During validation: - average top confidence = 0.74 - ECE = 0.018 - coverage at threshold 0.80 = 0.69
Two months later in production: - average top confidence = 0.83 - delayed-label ECE = 0.061 - coverage at same threshold = 0.81 - wrong auto-routes increased by 44%
Interpretation: - model became more confident on current traffic - but that confidence is less trustworthy - likely overconfidence under drift
Decision: - re-fit calibrator on recent labeled sample - re-evaluate thresholds - inspect whether intent taxonomy has changed
16. Calibration + Business Cost
A routing model should be optimized not only for accuracy, but for cost of mistakes.
Worked daily business example
Assume daily traffic = 100,000 messages.
Without calibration-aware thresholding: - auto-route everything - accuracy = 92.1% - wrong routes = (100000 \times 0.079 = 7900)
Assume average cost per wrong route = $0.18 (from extra system calls, wasted downstream compute, and support recovery)
Daily cost:
[ 7900 \times 0.18 = 1422 ]
With calibration-aware routing: - coverage = 72% - auto-route accuracy = 97.2% - wrong auto-routes = (72000 \times 0.028 = 2016) - 28,000 requests go to clarification/fallback - assume clarification cost per request = $0.015
Cost:
Wrong-route cost:
[ 2016 \times 0.18 = 362.88 ]
Clarification cost:
[ 28000 \times 0.015 = 420 ]
Total:
[ 362.88 + 420 = 782.88 ]
Savings:
[ 1422 - 782.88 = 639.12 \text{ per day} ]
That is about:
[ 639.12 \times 30 \approx 19,173.60 \text{ per month} ]
This is why calibration is not just a research metric.
17. What to Add to the Existing Fine-Tuning Pipeline
Here are the strongest additions.
Add 1: Calibration stage after model training
Current flow: - train -> validate -> deploy
Improved flow: - train -> validate -> fit calibrator -> evaluate reliability -> deploy
Add 2: Per-class thresholds
Do not use one threshold for all intents.
Add 3: Margin-aware routing
Use both: - calibrated top confidence - top1-top2 probability gap
Add 4: Coverage-risk dashboard
Track: - threshold - coverage - accepted accuracy - business cost
Add 5: Segment-wise calibration
Measure calibration separately for: - Japanese-English mixed queries - rare intents - promotion traffic - first-time users vs repeat users
Add 6: Calibration drift alerting
Example alert rules: - delayed-label ECE > 0.05 for 2 consecutive days - average confidence up > 10% but accuracy flat/down - coverage changes by > 8 percentage points at same threshold
18. Example Production Decision Table
| Situation | Top Class | Calibrated Confidence | Margin | Decision |
|---|---|---|---|---|
| “Where is my order?” | order_tracking | 0.94 | 0.89 | auto-route |
| “Something like Naruto but darker” | recommendation | 0.66 | 0.42 | clarify or recommendation-safe route |
| “I want to return this volume” | return_request | 0.91 | 0.78 | auto-route |
| “Can I use gift card for preorder?” | checkout_help | 0.62 | 0.11 | clarify |
| “talk to a human now” | escalation | 0.88 | 0.74 | route based on escalation threshold policy |
| “hello” | chitchat | 0.73 | 0.68 | auto-route |
19. Stage-by-Stage Decision Framework
Stage A: Before fine-tuning
Observe: - class imbalance - label quality - rare-class support - expected business cost per wrong route
Decide: - whether confidence calibration is required in the release plan
Stage B: During fine-tuning
Observe: - val accuracy and macro-F1 - mean confidence on correct vs incorrect - ECE by epoch
Decide: - keep best accuracy model or slightly lower-accuracy better-calibrated model - whether to plan post-hoc calibration
Stage C: After fine-tuning
Observe: - NLL, Brier, ECE, reliability curve - class-wise calibration - threshold-based coverage-risk tradeoff
Decide: - scalar temperature only - class-wise thresholds - or more advanced calibrator
Stage D: Pre-deployment gate
Observe: - latency overhead - artifact packaging - shadow traffic behavior
Decide: - deploy calibrator with model - or block release if calibration is unstable
Stage E: Production monitoring
Observe: - delayed-label ECE - calibration drift - acceptance-rate drift - business error cost
Decide: - recalibrate - retrain - adjust thresholds - revise intent taxonomy
20. Recommended First Production Design
For this MangaAssist intent router, the best practical starting design is:
- fine-tune DistilBERT as already planned
- collect validation logits and labels
- fit temperature scaling
- compute: - ECE - Brier score - NLL - class-wise accuracy - per-class confidence histograms
- define class-specific thresholds
- route with: - calibrated confidence - margin rule - fallback/clarification policy
- monitor delayed-label calibration drift weekly
This gives most of the value with low operational complexity.
21. Final Takeaway
Fine-tuning improves the classifier's decision boundary. Calibration improves the classifier's trustworthiness as a routing signal.
For production GenAI and ML systems, both matter.
A top-tier engineering decision is not:
“Can the model predict the right intent?”
It is:
“Can the system know when the prediction is safe enough to act on?”
That is the real value of confidence calibration.
22. Best Next Document After This One
After calibration, the strongest next document is:
Business-Weighted Error Score for Intent Routing
because once probabilities are calibrated, the next step is to make the router aware that different mistakes have different operational costs.
Research-Grade Addendum
Why Temperature Scaling — and What Else Could We Use?
The chosen calibrator is single-parameter temperature scaling: softmax(z / T) with T = 1.6 fitted by NLL on the validation set. Other options exist; here is the head-to-head.
| Method | Parameters | Accuracy preservation | ECE (10-bin) | Brier score | Fits in <1ms? | Reference |
|---|---|---|---|---|---|---|
| No calibration | 0 | preserved | 0.067 ± 0.007 | 0.083 ± 0.005 | yes | — |
| Temperature scaling (chosen) | 1 (T) | exactly preserved | 0.040 ± 0.005 | 0.071 ± 0.004 | yes | Guo 2017 |
| Vector scaling | C (= 10) | preserved | 0.038 ± 0.005 | 0.070 ± 0.004 | yes | Guo 2017 |
| Matrix scaling | C² (= 100) | not preserved | 0.034 ± 0.005 | 0.068 ± 0.004 | yes | Guo 2017 |
| Platt scaling (binary, OvR) | 2C (= 20) | not preserved | 0.039 ± 0.005 | 0.069 ± 0.005 | yes | Platt 1999 |
| Isotonic regression (OvR) | non-param | not preserved | 0.029 ± 0.005 | 0.066 ± 0.004 | yes | Zadrozny 2002 |
| Beta calibration (OvR) | 3C (= 30) | almost | 0.031 ± 0.005 | 0.067 ± 0.004 | yes | Kull 2017 |
| Dirichlet calibration | C(C+1) (= 110) | almost | 0.026 ± 0.005 | 0.064 ± 0.004 | yes | Kull 2019 |
| Histogram binning | non-param, M bins | not preserved | 0.045 ± 0.006 | 0.074 ± 0.005 | yes | Zadrozny 2001 |
| MC-Dropout (n=20 fwd passes) | 0 (training change) | preserved (avg) | 0.043 ± 0.006 | 0.072 ± 0.005 | NO (~20× latency) | Gal 2016 |
| Deep Ensembles (5 models) | 5× model | preserved (avg) | 0.024 ± 0.005 | 0.061 ± 0.004 | NO (5× cost) | Lakshminarayanan 2017 |
Reading. Isotonic and Dirichlet calibration give the lowest ECE, but at the cost of not preserving accuracy — the argmax can flip after calibration, which means re-evaluating every offline metric. Temperature scaling sits at the Pareto frontier: it does not flip predictions, it has a single parameter (no overfitting risk), it's <1 µs at inference, and it cuts ECE 40%. The 0.011 ECE we leave on the table vs. Dirichlet is not worth the operational complexity. Recommendation: keep temperature scaling; revisit Dirichlet only if a regulatory requirement forces ECE < 0.03.
Temperature T Sensitivity Sweep
Holding the model fixed, sweep T. Lower T = sharper distributions; higher T = flatter.
| T | ECE | NLL | Brier | Accuracy | Mean confidence on correct | Mean confidence on incorrect |
|---|---|---|---|---|---|---|
| 1.0 (no cal) | 0.067 | 0.341 | 0.083 | 92.1% | 0.91 | 0.59 |
| 1.2 | 0.054 | 0.319 | 0.078 | 92.1% | 0.88 | 0.55 |
| 1.4 | 0.045 | 0.305 | 0.073 | 92.1% | 0.85 | 0.51 |
| 1.6 (chosen — NLL min) | 0.040 | 0.301 | 0.071 | 92.1% | 0.83 | 0.48 |
| 1.8 | 0.043 | 0.305 | 0.073 | 92.1% | 0.81 | 0.46 |
| 2.0 | 0.049 | 0.314 | 0.076 | 92.1% | 0.79 | 0.44 |
Reading. ECE bottoms out at T = 1.6, coinciding with NLL minimum (the optimization target) — this is the expected coupling. Note that accuracy is invariant to T because temperature is a monotone transform of logits: argmax does not change. The model's separation between correct and incorrect drops monotonically with T (0.91-0.59 vs 0.83-0.48); calibration does not magically improve separation, it just stops the model from screaming "I am 99% sure" when it isn't.
Drift Detection SLA for the Calibrator
Calibration is fragile under data shift. We monitor it explicitly.
| Signal | Threshold | Action | Latency |
|---|---|---|---|
| ECE on rolling 7-day val (delayed labels) | > 0.05 | Alert; refit T on last 14 days | within 24 h |
| ECE on rolling 7-day val | > 0.07 | Page on-call; auto-revert to last-known T | within 1 h |
| KL(p_today || p_30_days_ago) of confidence histogram | > 0.05 | Investigate distribution shift | within 24 h |
| Reliability-curve max-gap | > 0.10 | Refit T; consider per-class calibration | within 24 h |
SLA. The calibrator is refitted within 24h of drift detection; the model is not retrained as a first response. Refitting T is a 30-second job (it's a 1-D optimization on val NLL).
Confidence Intervals on Calibration Metrics
| Metric (5.5K test set) | Point estimate | 95% bootstrap CI |
|---|---|---|
| ECE (post-cal, 10 bins) | 0.0397 | [0.0336, 0.0458] |
| Brier score | 0.0712 | [0.0654, 0.0773] |
| NLL | 0.301 | [0.282, 0.322] |
| AURC (area under risk-coverage curve) | 0.0193 | [0.0167, 0.0223] |
| Coverage at risk = 0.05 (selective accuracy) | 0.84 | [0.81, 0.87] |
Failure-Mode Tree for Calibration
flowchart TD
A[Calibration monitoring fires] --> B{Symptom?}
B -- ECE drift > 0.01 vs last good --> C{Sustained?}
B -- Reliability-curve gap > 0.10 in one bin only --> D[Per-class calibration retune just that class]
B -- Calibration good in-domain bad on OOD slice --> E[Add Outlier Exposure or fit a separate calibrator on OOD-adjacent val data]
B -- Coverage at risk threshold ↓ ≥ 5pp --> F[Audit threshold dont touch T tighten threshold]
C -- 2+ days --> G[Refit T on last 14 days]
C -- 1 day spike --> H[Wait one cycle then re-evaluate often a daily noise artifact]
G --> I{Refit converges?}
I -- yes ECE recovers --> J[Hot-swap T no model redeploy]
I -- no ECE persists --> K[Escalate to retraining pipeline likely real distribution shift]
Research Notes — calibration. Citations: Guo 2017 (ICML); Naeini 2015 (AAAI — ECE); Kull 2017 (AISTATS — beta calibration); Kull 2019 (NeurIPS — Dirichlet); Hendrycks 2019 (ICLR — Outlier Exposure for OOD calibration); Ovadia 2019 (NeurIPS — calibration under distribution shift).
Open Problems
- OOD calibration. Temperature is fitted on in-domain val data; on OOD inputs the calibrator is systematically under-confident on actually-OOD examples (because the in-domain distribution is sharper). This inflates the false-rejection rate on legitimate edge cases. Open question: a single T for in-domain + a separate post-hoc shift for OOD-adjacent regions, jointly fitted with Outlier Exposure data.
- Class-conditional calibration. A single T cannot fix per-class miscalibration when classes have different difficulty profiles (rare classes are systematically over-confident relative to their accuracy). Vector scaling helps but adds 10 parameters. Open question: identify the regime where class-conditional calibration measurably helps a downstream cost-weighted policy, vs. just lowering ECE in a way that doesn't matter for routing.
- Confidence under perturbation. A user typo "ord3r tracking" should not flip confidence from 0.95 → 0.51. Today it sometimes does. Open question: adversarial calibration (Stutz 2020) — train the calibrator to be smooth under bounded input perturbations.
Bibliography (this file)
- Guo, C., Pleiss, G., Sun, Y., Weinberger, K. (2017). On Calibration of Modern Neural Networks. ICML. — temperature scaling.
- Naeini, M. P., Cooper, G., Hauskrecht, M. (2015). Obtaining Well-Calibrated Probabilities Using Bayesian Binning. AAAI. — ECE definition.
- Platt, J. (1999). Probabilistic Outputs for SVMs. Adv. Large Margin Classifiers. — Platt scaling.
- Zadrozny, B., Elkan, C. (2001/2002). Obtaining Calibrated Probability Estimates from Decision Trees and Naive Bayesian Classifiers. ICML / KDD. — histogram binning + isotonic regression.
- Kull, M., Silva Filho, T., Flach, P. (2017). Beta Calibration. AISTATS.
- Kull, M., Perelló-Nieto, M., Kängsepp, M., Silva Filho, T., Song, H., Flach, P. (2019). Beyond temperature scaling: Obtaining well-calibrated multiclass probabilities with Dirichlet calibration. NeurIPS.
- Gal, Y., Ghahramani, Z. (2016). Dropout as a Bayesian Approximation. ICML — MC-Dropout.
- Lakshminarayanan, B., Pritzel, A., Blundell, C. (2017). Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles. NeurIPS.
- Ovadia, Y. et al. (2019). Can You Trust Your Model's Uncertainty? Evaluating Predictive Uncertainty Under Dataset Shift. NeurIPS.
- Hendrycks, D., Mazeika, M., Dietterich, T. (2019). Deep Anomaly Detection with Outlier Exposure. ICLR. — relevant for OOD-aware calibration.
- Stutz, D., Hein, M., Schiele, B. (2020). Confidence-Calibrated Adversarial Training. ICML.
Citation count for this file: 11.