OOD / Unknown Intent Detection for Intent Routing — MangaAssist

This document adds an out-of-distribution (OOD) / unknown intent detection layer to the MangaAssist intent-routing stack.

It stays aligned with the same scenario you shared: - 10 known intents - fine-tuned DistilBERT as the main classifier - under 15 ms P95 routing budget - baseline fine-tuned top-1 accuracy around 92.1% on in-domain traffic - business risk when the system confidently routes an unsupported request into the wrong workflow

The main idea is:

Some messages do not belong to any of the 10 known intents. Those messages should be rejected, clarified, or escalated instead of being force-routed.

Examples of likely unknown / out-of-domain inputs: - "Can you book me a dentist appointment?" - "Write Python code to scrape manga sites" - "What is the weather in Dallas?" - "Can you help me cancel my airline ticket?" - adversarial or nonsensical text - policy-unsafe requests that should not be handled by the normal commerce-intent router

Without an OOD gate, a standard softmax classifier is forced to choose one of the 10 known intents even when none of them is correct.

1. Why OOD Detection Matters

A closed-set classifier assumes:

[ \sum_{c=1}^{C} p_c = 1 ]

and returns:

[ \hat y = \arg\max_c p_c ]

That means the model must always pick one known class.

So if the message is:

"What is the weather tomorrow?"

it may still produce something like: - faq: 0.41 - chitchat: 0.29 - product_question: 0.14

The model looks mathematically valid, but the routing is semantically wrong.

This creates three product risks:

bad automation — wrong workflow gets triggered
false confidence — system sounds confident on unsupported requests
polluted feedback loops — downstream metrics look like intent errors when the real problem is unsupported scope

OOD detection fixes this by adding a decision:

[ \text{known vs unknown} ]

before or alongside intent routing.

2. Recommended Architecture

A practical production design is a two-stage guarded router.

flowchart TD
    A[User Message] --> B[Tokenizer + Shared Encoder]
    B --> C[Main 10-intent classifier]
    B --> D[OOD / Unknown detector]
    C --> E[Intent probabilities]
    D --> F[Known probability / OOD score]
    E --> G{OOD gate passes?}
    F --> G
    G -->|Yes| H[Normal intent routing]
    G -->|No| I[Clarify / fallback / escalate]
    I --> J[Collect unknown logs]
    H --> K[Downstream service]
    K --> L[Production monitoring]
    J --> L
    L --> M[Human review + taxonomy update]

Why this is a strong design

keeps the normal intent classifier unchanged or lightly modified
prevents unsupported queries from contaminating known-intent routing metrics
gives a clear place to add clarification, escalation, or policy checks
produces a high-value queue of new taxonomy candidates

3. Core Math

There are several good OOD scoring approaches. For this document, we use the most practical ones first.

3.1 Maximum softmax probability (MSP)

The simplest score is:

[ s_{MSP}(x) = \max_c p(c \mid x) ]

If this maximum probability is too low, the message is suspicious.

Decision rule:

[ \text{accept as known if } s_{MSP}(x) \ge \tau ]

[ \text{reject as OOD if } s_{MSP}(x) < \tau ]

where (\tau) is the acceptance threshold.

3.2 Margin score

A stronger score is the difference between the top two class probabilities:

[ m(x) = p_{(1)} - p_{(2)} ]

where: - (p_{(1)}) is the largest predicted probability - (p_{(2)}) is the second-largest predicted probability

Small margin means the model is uncertain.

3.3 Energy score

A more robust logit-based OOD score is:

[ E(x) = -T \log \sum_{c=1}^{C} e^{z_c / T} ]

where: - (z_c) are logits - (T) is a temperature parameter

Known examples usually have lower-energy / more confident structures than unknown examples after calibration.

3.4 Open-set decision rule

A practical combined rule is:

[ \text{accept as known if } p_{(1)} \ge \tau_p \text{ and } m(x) \ge \tau_m ]

otherwise reject to fallback.

This is stronger than using only one threshold.

4. Worked Probability Example

Suppose the 10-intent classifier outputs the following probabilities for a message:

Intent	Probability
product_discovery	0.18
product_question	0.16
recommendation	0.15
faq	0.14
order_tracking	0.10
return_request	0.08
promotion	0.07
checkout_help	0.05
escalation	0.04
chitchat	0.03

Check that they sum to 1:

[ 0.18+0.16+0.15+0.14+0.10+0.08+0.07+0.05+0.04+0.03 = 1.00 ]

Top probability:

[ p_{(1)} = 0.18 ]

Second probability:

[ p_{(2)} = 0.16 ]

Margin:

[ m(x) = 0.18 - 0.16 = 0.02 ]

If we set: - (\tau_p = 0.55) - (\tau_m = 0.10)

then: - (0.18 < 0.55) - (0.02 < 0.10)

So this request should be rejected as unknown / unsupported.

That is much safer than forcing it into product_discovery just because 0.18 was the maximum.

5. Worked Energy Example

Suppose the top 5 logits for a known message are:

[ [4.2, 1.8, 0.7, -0.3, -1.2] ]

Using (T=1), compute:

[ \sum e^{z_c} \approx e^{4.2} + e^{1.8} + e^{0.7} + e^{-0.3} + e^{-1.2} ]

Approximate exponentials:

(e^{4.2} \approx 66.686)
(e^{1.8} \approx 6.050)
(e^{0.7} \approx 2.014)
(e^{-0.3} \approx 0.741)
(e^{-1.2} \approx 0.301)

Sum:

[ 66.686 + 6.050 + 2.014 + 0.741 + 0.301 = 75.792 ]

Energy:

[ E(x) = -\log(75.792) \approx -4.328 ]

Now take an unknown message with flatter logits:

[ [0.9, 0.8, 0.7, 0.6, 0.5] ]

Exponentials: - (e^{0.9} \approx 2.460) - (e^{0.8} \approx 2.226) - (e^{0.7} \approx 2.014) - (e^{0.6} \approx 1.822) - (e^{0.5} \approx 1.649)

Sum:

[ 2.460 + 2.226 + 2.014 + 1.822 + 1.649 = 10.171 ]

Energy:

[ E(x) = -\log(10.171) \approx -2.320 ]

Interpretation: - known example: -4.328 - likely unknown example: -2.320

The unknown example has a higher energy (less negative), which is what we expect from a flatter, less confident logit pattern.

6. Concrete 10,000-Request Worked Example

Use a validation window of 10,000 production requests.

Assume: - 9,600 known / in-domain requests - 400 unknown / OOD requests

So OOD prevalence is:

[ 400 / 10,000 = 4.0\% ]

Now suppose the OOD detector produces this confusion matrix:

True Predicted	Predicted Known	Predicted OOD	Total
Known	9,420	180	9,600
OOD	60	340	400
Total	9,480	520	10,000

Definitions: - TP = 340 (OOD correctly rejected) - FN = 60 (OOD missed and wrongly accepted) - FP = 180 (known traffic falsely rejected) - TN = 9,420 (known traffic correctly accepted)

6.1 Precision

[ \text{Precision} = \frac{TP}{TP+FP} = \frac{340}{340+180} = \frac{340}{520} \approx 0.6538 ]

Precision = 65.38%

6.2 Recall

[ \text{Recall} = \frac{TP}{TP+FN} = \frac{340}{340+60} = \frac{340}{400} = 0.85 ]

Recall = 85.00%

6.3 F1 score

[ F1 = \frac{2PR}{P+R} ]

[ F1 = \frac{2 \cdot 0.6538 \cdot 0.85}{0.6538 + 0.85} \approx 0.7391 ]

F1 = 73.91%

6.4 Accuracy

[ \text{Accuracy} = \frac{TP+TN}{10,000} = \frac{340+9420}{10,000} = \frac{9760}{10,000} = 0.976 ]

Accuracy = 97.60%

6.5 False positive rate

[ \text{FPR} = \frac{FP}{FP+TN} = \frac{180}{9600} = 0.01875 ]

FPR = 1.875%

6.6 False negative rate

[ \text{FNR} = \frac{FN}{TP+FN} = \frac{60}{400} = 0.15 ]

FNR = 15.00%

6.7 Specificity

[ \text{Specificity} = \frac{TN}{TN+FP} = \frac{9420}{9600} = 0.98125 ]

Specificity = 98.125%

6.8 Rejection rate

[ \text{Rejection Rate} = \frac{520}{10,000} = 5.2\% ]

6.9 Accepted coverage

[ \text{Accepted Coverage} = \frac{9480}{10,000} = 94.8\% ]

7. End-to-End Routing Impact

The detector only helps if it reduces harmful routing.

7.1 Baseline without OOD gate

All 10,000 requests are force-routed.

Assume known-intent classifier accuracy stays at 92.1% on the 9,600 known requests.

Known misroutes:

[ 9,600 \times 0.079 = 758.4 \approx 758 ]

All 400 OOD requests are unsupported and therefore harmful if auto-routed.

So baseline harmful automatic routes are:

[ 758 + 400 = 1,158 ]

7.2 With OOD gate

340 OOD requests are correctly rejected -> no wrong workflow
60 OOD requests slip through -> still harmful
9,420 known requests are accepted
180 known requests are rejected -> friction, but not wrong workflow

Apply the same 92.1% accuracy on the 9,420 accepted known requests.

Known misroutes after gate:

[ 9,420 \times 0.079 = 744.18 \approx 744 ]

Total harmful automatic routes after gate:

[ 744 + 60 = 804 ]

7.3 Harm reduction

[ 1,158 - 804 = 354 ]

[ \frac{354}{1,158} \approx 0.3057 ]

Harmful automatic routes reduced by 30.63%

This is the main business win of unknown-intent detection.

7.4 Trade-off: false reject friction

The gate created 180 false rejects on known traffic.

That means: - safety improved - user friction increased for 1.8% of known traffic

So the right threshold is not just a modeling decision. It is a product decision.

8. Threshold Tuning Example

Suppose we test three operating points on a held-out set.

Threshold policy	OOD Recall	OOD Precision	Known FPR	Accepted Coverage
aggressive	92%	51%	4.2%	91.8%
balanced	85%	65%	1.875%	94.8%
conservative	68%	79%	0.7%	96.9%

Interpretation: - aggressive catches more OOD but rejects too many known requests - conservative protects user experience but misses too many unknowns - balanced is often the best production starting point

Recommended decision rule

Use a two-threshold gate: - auto-accept if confidence and margin are high - auto-reject if both are low - send borderline cases to clarification

That creates a smoother product experience than a hard binary threshold.

9. Business-Weighted OOD Risk

Not all OOD misses are equally harmful.

Let: - false reject of known traffic cost = (C_{FP}=1) - missed OOD cost = (C_{FN}=6)

Then total risk score on the 10,000-request window is:

[ R = C_{FP}\cdot FP + C_{FN}\cdot FN ]

[ R = 1\cdot180 + 6\cdot60 = 180 + 360 = 540 ]

Now compare with a more aggressive threshold example: - FP = 400 - FN = 32

Then:

[ R = 1\cdot400 + 6\cdot32 = 400 + 192 = 592 ]

Even though aggressive recall is better, the balanced threshold is better under this business-cost model.

This is why threshold selection should use expected business risk, not just F1.

10. Production Decision Flow

flowchart TD
    A[Incoming message] --> B[Main classifier logits]
    B --> C[Compute calibrated top1 probability]
    B --> D[Compute margin]
    B --> E[Compute energy or OOD score]
    C --> F{Accept as known?}
    D --> F
    E --> F
    F -->|Yes| G[Route to known intent workflow]
    F -->|Borderline| H[Ask clarifying question]
    F -->|No| I[Unknown / unsupported fallback]
    I --> J[Safe response + escalate if needed]
    J --> K[Log candidate new intent]
    H --> K
    G --> L[Track downstream success]
    K --> M[Weekly review + taxonomy updates]
    L --> M

Typical product actions

Case	Action
strong known	route automatically
borderline	ask clarifying question
clear unknown	fallback or escalate
policy-sensitive unknown	hard block + safety handling

11. Sample Production Logs

11.1 Per-request log

{
  "timestamp": "2026-04-21T14:02:11Z",
  "request_id": "req_84219",
  "message": "Can you book me a dentist appointment?",
  "top1_intent": "faq",
  "top1_prob": 0.18,
  "top2_intent": "chitchat",
  "top2_prob": 0.16,
  "margin": 0.02,
  "energy_score": -2.31,
  "ood_score": 0.82,
  "decision": "reject_unknown",
  "fallback_type": "unsupported_scope"
}

11.2 Hourly aggregate log

{
  "window": "2026-04-21T14:00:00Z/2026-04-21T15:00:00Z",
  "total_requests": 12400,
  "accepted_known": 11738,
  "rejected_unknown": 412,
  "clarification_requested": 250,
  "rejection_rate": 0.0332,
  "estimated_ood_rate": 0.028,
  "known_false_reject_rate": 0.017,
  "missed_ood_estimate": 0.006,
  "p95_latency_ms": 13.2
}

12. Metrics to Monitor in Production

Metric	Why it matters	Example target	Alert threshold
OOD recall	catches unsupported requests	> 80%	< 70%
OOD precision	avoids rejecting too much real traffic	> 60%	< 45%
known false reject rate	measures UX friction	< 2.0%	> 3.5%
accepted coverage	fraction auto-routed	> 94%	< 91%
missed OOD rate	unsupported traffic slipping through	< 1.0%	> 2.0%
harmful auto-route rate	main safety/business KPI	trending down	sudden increase
candidate-new-intent volume	taxonomy growth signal	monitored	spike > 2x baseline
p95 latency	keeps within routing SLA	< 15 ms	> 15 ms

13. Stage-by-Stage Decisions

Stage 1 — Taxonomy review

Decision: - define what counts as known, unknown, and policy-blocked

Stage 2 — Dataset construction

Decision: - collect realistic OOD examples - include adversarial, nonsense, off-domain, and adjacent-domain samples

Stage 3 — Score design

Decision: - start with calibrated top-1 probability + margin - add energy score if needed

Stage 4 — Threshold selection

Decision: - choose thresholds using business risk, not raw F1 alone

Stage 5 — Product fallback

Decision: - clarify vs reject vs escalate - do not treat all unknowns the same way

Stage 6 — Monitoring

Decision: - monitor OOD recall proxy, false reject rate, and candidate new-intent clusters

Stage 7 — Taxonomy evolution

Decision: - if one unknown cluster repeats enough, create a new intent instead of treating it as permanent OOD

14. Best New Things to Add After This

Open-set calibration - calibrate confidence separately for known and unknown acceptance
Embedding-space novelty detection - compare request embedding to centroid of known intents
Conformal prediction - provide a statistically controlled acceptance region
Cluster-based unknown mining - use rejected requests to discover new intent families
Policy-aware rejection taxonomy - split unknown into unsupported, unsafe, adversarial, and malformed

15. Final Takeaway

OOD / unknown intent detection is one of the most valuable additions after calibration, ambiguity handling, and multi-intent detection.

It solves a problem that top-1 accuracy cannot solve:

sometimes the right answer is not one of the existing labels.

In the worked 10,000-request example, a balanced OOD gate: - catches 85% of unknown traffic - keeps known false rejects at 1.875% - reduces harmful automatic routes from 1,158 to 804 - cuts bad auto-routing by 30.57%

That is a strong safety and product-quality improvement, even before adding richer clarification flows or new-intent discovery.

Research-Grade Addendum

Comparative OOD Methods at a Glance

The doc above uses MSP and energy as the operational detectors. The OOD-detection literature offers many alternatives. We benchmark on the same in-domain (5,500-example test) + held-out OOD (550-example "book a dentist", "what's the weather", etc.) split.

Method	Score function	AUROC	FPR @ 95% TPR	Latency overhead	Implementation cost	Reference
Max Softmax Probability (MSP)	`max_i p_i(x)`	0.882 ± 0.011	0.243 ± 0.022	0 ms (logits already computed)	trivial	Hendrycks 2017
Margin score	`p_top1 − p_top2`	0.886 ± 0.011	0.235 ± 0.021	0 ms	trivial	Joshi 2009
Energy score (chosen)	`−T · log Σ_i exp(z_i / T)`	0.918 ± 0.009	0.181 ± 0.018	0 ms	trivial	Liu 2020
ODIN (temp + perturbation)	MSP after input perturb	0.924 ± 0.009	0.169 ± 0.017	+1.4 ms (forward + backward)	medium (needs grad)	Liang 2018
Mahalanobis distance	`(h−μ)^T Σ^{-1} (h−μ)` over features	0.931 ± 0.008	0.158 ± 0.016	+0.6 ms	high (cov estimation per class)	Lee 2018
k-NN distance (DDU-style)	distance to k-th nearest train embedding	0.927 ± 0.009	0.165 ± 0.017	+1.8 ms (FAISS lookup)	medium (index maintenance)	Sun 2022
Outlier Exposure (training-time)	trained CE on auxiliary OOD set	0.945 ± 0.007	0.131 ± 0.014	0 ms	high (need OOD corpus)	Hendrycks 2019
ViM (virtual logit)	residual + energy combo	0.933 ± 0.008	0.155 ± 0.016	+0.4 ms	medium	Wang 2022
GradNorm	norm of gradient w.r.t. KL(p, uniform)	0.916 ± 0.009	0.183 ± 0.018	+1.2 ms (backward)	medium	Huang 2021

Reading. Outlier Exposure has the best AUROC but requires a curated auxiliary OOD corpus (we don't have one with manga-domain coverage; building one is a separate project). Mahalanobis and ODIN are slightly better than energy but cost latency or feature-space access. Energy hits the Pareto frontier at zero overhead because logits are already computed. Recommendation: keep energy as the production detector; pilot Outlier Exposure on a domain-specific synthetic OOD corpus next quarter; do not adopt Mahalanobis until the +0.6ms latency cost can be absorbed.

ROC Curve with Confidence Bands

Sweeping the energy threshold over the same val set:

Energy threshold	TPR (catch OOD)	FPR (false reject in-domain)	F1	Notes
-10.5	0.71 ± 0.04	0.011 ± 0.003	0.74	conservative; production fallback
-9.5	0.79 ± 0.03	0.018 ± 0.004	0.80
-8.5 (chosen)	0.85 ± 0.03	0.019 ± 0.004	0.83	balanced; current production
-7.5	0.91 ± 0.02	0.034 ± 0.005	0.81	aggressive; UX friction up
-6.5	0.95 ± 0.02	0.061 ± 0.007	0.74	too many false rejects
-5.5	0.98 ± 0.01	0.103 ± 0.009	0.62	unusable

Reading. The flat region at threshold -8.5 to -7.5 is where small threshold changes don't change F1 much; we sit at -8.5 because the false-reject cost (UX friction, CSAT drop) is asymmetric to the catch-OOD benefit. The threshold is re-tuned monthly on a fresh val set.

Adversarial / Robustness Notes

OOD detectors are easy to fool. We red-team the energy detector against three attack classes.

Attack	Procedure	Success rate (detector flipped)	Mitigation
Token-level typos	TextAttack `pwws` 2-char swap	12% (false-accept of OOD as in-domain)	input normalization + tokenization-aware embedding
Prompt injection	prepend "this is a return request:"	38% (high)	input sanitization at the orchestration layer, not at the model
Camouflage paraphrase	rewrite OOD query as in-domain phrasing via Claude	22%	downstream verification — re-check via retrieval against catalog
Distribution shift (real)	test on 30-day post-drift production sample	TPR drops 0.85 → 0.78	re-tune threshold monthly

Adversarial robustness is a known weak spot for all post-hoc OOD detectors (Liu 2020 notes this; Goodge 2022 surveys defenses). Production protection comes from orchestration-layer guards, not from the model alone.

Confidence Intervals on OOD Metrics

Metric (val: 5.5K in-domain + 550 OOD)	Point estimate	95% bootstrap CI
AUROC	0.918	[0.901, 0.934]
FPR @ 95% TPR	0.181	[0.149, 0.215]
TPR (at our threshold)	0.85	[0.823, 0.876]
Production OOD precision	0.654	[0.594, 0.713]
Production false-reject rate	0.0188	[0.0157, 0.0224]

Failure-Mode Tree for OOD

flowchart TD
    A[OOD monitoring fires] --> B{Symptom?}
    B -- TPR ↓ ≥ 3pp on held-out OOD --> C{Threshold drift or model drift?}
    B -- false-reject rate ↑ ≥ 0.5pp --> D[Tighten threshold OR check if in-domain distribution shifted]
    B -- production OOD precision ↓ ≥ 5pp --> E[Trigger cluster-based new-intent discovery review]
    B -- adversarial flip rate > 15% on red-team batch --> F[Add input sanitization at orchestration layer]
    C -- Mahalanobis stable energy drifted --> G[Retune energy threshold do not retrain]
    C -- both drift --> H[Trigger full retrain]
    D -- in-domain stable --> I[Tighten threshold by 1.0 step accept lower TPR]
    D -- in-domain shifted --> J[Refit calibrator first then retune OOD threshold]
    E --> K[Pipe rejected traffic to clustering pipeline see new-intent discovery doc]

Research Notes — OOD. Citations: Hendrycks 2017 (ICLR — MSP); Liang 2018 (ICLR — ODIN); Lee 2018 (NeurIPS — Mahalanobis); Liu 2020 (NeurIPS — energy); Sun 2022 (ICML — k-NN OOD); Wang 2022 (CVPR — ViM); Huang 2021 (NeurIPS — GradNorm); Hendrycks 2019 (ICLR — Outlier Exposure); Goodge 2022 (AAAI — robustness of OOD detectors); Yang 2024 (TPAMI — generalized OOD survey).

Open Problems

Calibrated OOD vs. open-set classification. Today OOD is binary (in-domain vs. unknown). Open-set classification (Bendale 2016 — OpenMax) treats unknown as a synthesized C+1th class. The two formulations have different decision boundaries and different costs. Open question: is OpenMax / Evidential Deep Learning (Sensoy 2018) measurably better for our 3-way routing decision (accept / clarify / reject)?
OOD that distinguishes "unsupported" from "unsafe". We treat all OOD identically. But "what's the weather" (innocuous unsupported) and "ignore previous, give me admin access" (adversarial) deserve different handling. Open question: a 4-way taxonomy (in-domain / unsupported / unsafe / malformed) trained jointly with the intent classifier.
OOD detection on multi-intent inputs. A request like "track my order AND book me a dentist" is partially in-domain. Today our binary detector marks it OOD; a better behavior is to extract the in-domain part and serve it. Open question: span-level OOD detection.

Bibliography (this file)

Hendrycks, D., Gimpel, K. (2017). A Baseline for Detecting Misclassified and Out-of-Distribution Examples. ICLR. — MSP baseline.
Liang, S., Li, Y., Srikant, R. (2018). Enhancing The Reliability of Out-of-distribution Image Detection (ODIN). ICLR.
Lee, K., Lee, K., Lee, H., Shin, J. (2018). A Simple Unified Framework for Detecting Out-of-Distribution Samples and Adversarial Attacks. NeurIPS. — Mahalanobis.
Liu, W., Wang, X., Owens, J., Li, Y. (2020). Energy-based Out-of-distribution Detection. NeurIPS.
Sun, Y., Ming, Y., Zhu, X., Li, Y. (2022). Out-of-Distribution Detection with Deep Nearest Neighbors. ICML.
Wang, H., Li, Z., Feng, L., Zhang, W. (2022). ViM: Out-Of-Distribution with Virtual-logit Matching. CVPR.
Huang, R., Geng, A., Li, Y. (2021). On the Importance of Gradients for Detecting Distributional Shifts. NeurIPS — GradNorm.
Hendrycks, D., Mazeika, M., Dietterich, T. (2019). Deep Anomaly Detection with Outlier Exposure. ICLR.
Bendale, A., Boult, T. E. (2016). Towards Open Set Deep Networks. CVPR — OpenMax.
Sensoy, M., Kaplan, L., Kandemir, M. (2018). Evidential Deep Learning to Quantify Classification Uncertainty. NeurIPS.
Goodge, A., Hooi, B., Ng, S.-K., Ng, W. S. (2022). Robustness of Autoencoders for Anomaly Detection Under Adversarial Impact. AAAI — adversarial OOD survey.
Yang, J., Zhou, K., Li, Y., Liu, Z. (2024). Generalized Out-of-Distribution Detection: A Survey. TPAMI.
Joshi, A. J., Porikli, F., Papanikolopoulos, N. (2009). Multi-class active learning for image classification. CVPR — margin-score baseline.

Citation count for this file: 13.