Multi-Intent Detection for Intent Routing — MangaAssist
This document adds a multi-intent detection and routing layer to the MangaAssist intent-classification stack.
It stays aligned with the same scenario you shared: - 10 intents - fine-tuned DistilBERT - under 15 ms P95 routing budget - baseline fine-tuned top-1 accuracy around 92.1% - multi-intent traffic around 18% with messages such as "I want to return this and find something better" fileciteturn7file0 - production routing where a single wrong route can trigger the wrong workflow
The main idea is:
Some messages do not belong to exactly one intent.
They belong to a set of intents.
Examples:
- return_request + recommendation
- order_tracking + return_request
- faq + checkout_help
- escalation + order_tracking
A single-label classifier can still produce a useful top-1 label, but it often drops the second required action. That is why multi-intent handling is a high-value next upgrade after calibration and ambiguity handling.
1. Why Multi-Intent Matters
In the original MangaAssist setup, multi-intent traffic is explicitly called out as one of the hard cases for the classifier. fileciteturn7file0
Examples: - “I want to return this and find something better” - “Where is my order and can I still return it?” - “Can I use a gift card and what is your refund policy?” - “Talk to a human and check my order status”
A plain softmax classifier assumes:
[ \sum_{c=1}^C p_c = 1 ]
and usually chooses only one class:
[ \hat y = \arg\max_c p_c ]
That is fine for single-intent messages, but it is limiting for requests that genuinely need two workflows.
Failure mode of a single-label router
Suppose the true request is:
return_requestrecommendation
If the model predicts only return_request, the user still does not get the recommendation flow they asked for.
So the model can look “correct enough” under top-1 accuracy while still being incomplete from a product perspective.
2. What We Want the System to Predict
Instead of one label, predict a set of labels.
Let the true label vector be:
[ y \in {0,1}^C ]
where: - (y_c = 1) if intent (c) is present - (y_c = 0) otherwise
For 10 intents:
[ y = [y_1, y_2, \dots, y_{10}] ]
Example:
If a message needs return_request and recommendation, then:
[ y_{return_request} = 1, \quad y_{recommendation} = 1 ]
and all other labels are 0.
3. Recommended Architecture
The best practical design here is usually a two-stage system.
flowchart TD
A[User Message] --> B[Shared DistilBERT Encoder]
B --> C[Stage 1: Multi-Intent Detector<br/>single vs multi]
B --> D[Stage 2: Multi-Label Intent Head<br/>10 sigmoid outputs]
C --> E{Predicted single or multi?}
E -->|Single| F[Standard single route]
E -->|Multi| G[Constrained label-set decode]
G --> H[Pair / workflow policy]
H --> I[Parallel or ordered routing]
I --> J[UI / workflow merge]
J --> K[Logs + monitoring + active learning]
Why this is a good choice
- shared encoder keeps latency low
- binary detector is easy to monitor
- multi-label head handles two-intent and occasional three-intent traffic
- policy layer decides whether flows run in parallel, sequentially, or with one intent dominating another
Recommended routing policy examples
| Predicted label set | Routing policy |
|---|---|
order_tracking + return_request |
fetch order first, then show return eligibility |
return_request + recommendation |
start return flow, then offer replacement suggestions |
faq + checkout_help |
answer policy and checkout guidance in same UX |
escalation + anything |
escalation dominates; preserve secondary context for handoff |
4. Core Math
4.1 Multi-label probabilities
Instead of softmax, use sigmoid independently for each class:
[ q_c = \sigma(z_c) = \frac{1}{1 + e^{-z_c}} ]
where: - (z_c) is the logit for class (c) - (q_c) is the probability that label (c) is present
This does not force the probabilities to sum to 1.
That is exactly what we want, because both return_request and recommendation can be true together.
4.2 Binary cross-entropy loss
For one example and one class:
[ \mathcal{L}_{BCE,c} = -\big[y_c \log(q_c) + (1-y_c)\log(1-q_c)\big] ]
Across all classes:
[ \mathcal{L}{BCE} = \frac{1}{C} \sum{c=1}^C -\big[y_c \log(q_c) + (1-y_c)\log(1-q_c)\big] ]
4.3 Weighted BCE
Because some intents are rare, we can up-weight positive examples for rare labels:
[ \mathcal{L}{WBCE} = \frac{1}{C} \sum{c=1}^C w_c \cdot -\big[y_c \log(q_c) + (1-y_c)\log(1-q_c)\big] ]
where (w_c) can be inverse-frequency or business-weighted.
4.4 Decision rule
At inference time, choose the label set:
[ \hat S = {c : q_c \ge \tau_c} ]
where: - (\tau_c) is a threshold for class (c) - thresholds can be global or class-specific
A stronger rule is:
[ \hat S = {c : q_c \ge \tau_c} \quad \text{and} \quad |\hat S| \le K ]
where (K) is a max label count such as 2 or 3, to avoid noisy over-prediction.
5. Worked Example — One Request
Take the message:
“I want to return this and maybe get something darker than Naruto”
Assume visible logits:
return_request = 2.1recommendation = 1.6order_tracking = -0.4faq = -1.2chitchat = -2.0
5.1 Convert logits to sigmoid probabilities
[ q_c = \frac{1}{1 + e^{-z_c}} ]
Computed values:
return_request= 0.8909recommendation= 0.8320order_tracking= 0.4013faq= 0.2315chitchat= 0.1192
If the threshold is (\tau = 0.50), then the predicted set is:
[ \hat S = {\text{return_request}, \text{recommendation}} ]
which is exactly what we want.
5.2 Worked BCE loss for this example
Assume the true set is:
return_request = 1recommendation = 1order_tracking = 0faq = 0chitchat = 0
Then the per-class loss terms are:
return_request: (-\log(0.8909) = 0.1155)recommendation: (-\log(0.8320) = 0.1839)order_tracking: (-\log(1-0.4013) = 0.5130)faq: (-\log(1-0.2315) = 0.2633)chitchat: (-\log(1-0.1192) = 0.1269)
Sum:
[ 1.2026 ]
Average over 5 visible classes:
[ \frac{1.2026}{5} = 0.2405 ]
This example is mostly good, but the model is still carrying some unwanted probability on order_tracking, which contributes extra loss.
6. Important Metrics
A multi-intent system should not be judged by plain top-1 accuracy alone.
6.1 Detector metrics
For Stage 1 (single vs multi):
- precision
- recall
- F1
- specificity
- false positive rate
6.2 Label-set metrics
For Stage 2:
- micro precision
- micro recall
- micro F1
- exact-set match
- subset accuracy
- label coverage
- per-pair recall
6.3 Product / workflow metrics
These matter most in production:
- full workflow success rate
- missing-secondary-intent rate
- wrong-extra-intent rate
- escalation-preservation rate
- multi-intent latency overhead
- clarification rate on ambiguous multi-intent cases
7. Fully Worked 10,000-Request Example
This worked example keeps the same overall MangaAssist traffic scale used in the earlier documents.
Assume: - total requests = 10,000 - single-intent share = 82% → 8,200 - multi-intent share = 18% → 1,800
7.1 Stage 1 — Detect single vs multi
Use a binary detector:
| Actual Predicted | Predicted Multi | Predicted Single | Total |
|---|---|---|---|
| Actual Multi | 1,620 | 180 | 1,800 |
| Actual Single | 360 | 7,840 | 8,200 |
| Total | 1,980 | 8,020 | 10,000 |
Validation math
Precision:
[ \frac{1620}{1620 + 360} = \frac{1620}{1980} = 0.8182 = 81.82\% ]
Recall:
[ \frac{1620}{1620 + 180} = \frac{1620}{1800} = 0.9000 = 90.00\% ]
F1:
[ \frac{2PR}{P+R} = \frac{2 \cdot 0.8182 \cdot 0.9000}{0.8182 + 0.9000} = 0.8571 = 85.71\% ]
Specificity:
[ \frac{7840}{7840 + 360} = \frac{7840}{8200} = 0.9561 = 95.61\% ]
Overall detector accuracy:
[ \frac{1620 + 7840}{10000} = \frac{9460}{10000} = 0.9460 = 94.60\% ]
7.2 Stage 2 — Multi-label quality on the 1,800 actual multi-intent requests
Assume most multi-intent requests contain 2 true intents, so total true positive labels:
[ 1800 \times 2 = 3600 ]
Suppose the model produces: - true positive labels = 3,204 - false positive labels = 306 - false negative labels = 396
Then:
Micro precision:
[ \frac{3204}{3204 + 306} = \frac{3204}{3510} = 0.9128 = 91.28\% ]
Micro recall:
[ \frac{3204}{3204 + 396} = \frac{3204}{3600} = 0.8900 = 89.00\% ]
Micro F1:
[ \frac{2 \cdot 0.9128 \cdot 0.8900}{0.9128 + 0.8900} = 0.9013 = 90.13\% ]
7.3 Exact-set match
Suppose the predicted label set is exactly right on 1,368 of the 1,800 multi-intent requests.
[ \frac{1368}{1800} = 0.7600 = 76.00\% ]
This is much stricter than micro F1, because every label in the set must be correct.
7.4 Full workflow success rate
After the multi-label prediction goes through the policy layer, suppose 1,512 of the 1,800 multi-intent requests complete all required actions successfully.
[ \frac{1512}{1800} = 0.8400 = 84.00\% ]
This is the most important product metric.
8. Compare Against the Single-Label Baseline
In the original setup, the fine-tuned model still struggles more on multi-intent traffic than on normal single-intent traffic. fileciteturn7file0
For the worked example, assume the old single-label system fully satisfies 684 of the 1,800 multi-intent requests.
[ \frac{684}{1800} = 0.38 = 38.00\% ]
Now compare:
| Metric | Single-label baseline | Multi-intent system |
|---|---|---|
| multi-intent full workflow success | 38.00% | 84.00% |
| failed multi-intent workflows | 1,116 | 288 |
| improvement in full workflow success | +46.00 pts | |
| failure reduction | 74.19% |
Failure reduction:
[ 1116 - 288 = 828 ]
[ \frac{828}{1116} = 0.7419 = 74.19\% ]
That is why multi-intent handling is worth building. It improves the metric that actually matters: did the user get all requested actions?
9. Threshold Design
A strong production system should not use the same threshold for every label.
Recommended approach:
- lower threshold for high-value secondary intents that are often under-detected
- higher threshold for noisy or low-prevalence labels
- special dominant policy for escalation
Example thresholds:
| Intent | Suggested threshold | Reason |
|---|---|---|
order_tracking |
0.55 | usually explicit |
return_request |
0.50 | often paired with order-related flows |
recommendation |
0.45 | more implicit, easier to miss |
faq |
0.55 | avoid noisy policy activation |
checkout_help |
0.55 | often confused with faq |
escalation |
0.35 | prefer recall over precision for safe handoff |
A good rule is:
[ \text{if } q_{escalation} \ge 0.35, \text{ preserve escalation in the final route set} ]
10. Better Design Than “Any Two Labels Above 0.5”
A raw sigmoid head is not enough by itself.
Useful additions: 1. count head predicting 1 vs 2 vs 3+ intents 2. pair prior based on common co-occurrence pairs 3. constraint layer disallowing unrealistic sets 4. calibration per label 5. margin / entropy checks before executing expensive parallel workflows
Recommended practical version
- shared DistilBERT encoder
- multi-intent detector
- multi-label sigmoid head
- class-specific thresholds
- pair-prior table
- dominant-intent rule for escalation
- ambiguity fallback for weird sets
11. Production Decision Flow
flowchart TD
A[Sigmoid probabilities q_c] --> B[Apply per-label thresholds]
B --> C[Form candidate set]
C --> D[Apply pair prior / constraints]
D --> E{Contains escalation?}
E -->|Yes| F[Escalate with preserved context]
E -->|No| G{Set size = 1?}
G -->|Yes| H[Single route]
G -->|No| I[Multi-route policy]
I --> J[Parallel or ordered execution]
J --> K[Merge outputs]
K --> L[Log label set, latency, user outcome]
12. Sample Production Logs
12.1 Per-request inference log
{
"request_id": "mi_10422",
"text": "I want to return this and find something better",
"stage1_multi_probability": 0.91,
"predicted_is_multi": true,
"label_probs": {
"return_request": 0.89,
"recommendation": 0.83,
"order_tracking": 0.40,
"faq": 0.23,
"chitchat": 0.12
},
"predicted_label_set": ["return_request", "recommendation"],
"policy": "return_then_recommend",
"latency_ms": 13.4
}
12.2 Detector aggregate log
{
"window": "1h",
"requests": 18420,
"predicted_multi_rate": 0.196,
"actual_multi_rate_delayed_label": 0.182,
"detector_precision_est": 0.814,
"detector_recall_est": 0.892,
"detector_f1_est": 0.851
}
12.3 Label-set aggregate log
{
"window": "1h",
"multi_intent_requests_labeled": 620,
"micro_precision": 0.908,
"micro_recall": 0.887,
"micro_f1": 0.897,
"exact_set_match": 0.752,
"full_workflow_success": 0.836,
"missing_secondary_intent_rate": 0.109
}
13. What to Monitor in Production
Model metrics
- detector precision / recall / F1
- micro F1 on multi-label sets
- exact-set match
- per-pair recall
- escalation preservation rate
Product metrics
- full workflow success
- follow-up clarification rate
- extra-turn rate after multi-intent requests
- handoff-to-human rate after failed multi-intent handling
- CSAT for multi-intent sessions
Risk metrics
- missing-secondary-intent rate
- wrong-extra-intent rate
- multi-route latency overhead
- rare-pair drift
- taxonomy-gap rate (requests that do not fit current intent set cleanly)
14. Stage-by-Stage Engineering Decisions
Stage A — Data labeling
Decision: - mark one label or multiple labels? - preserve label order?
Best practice: - store unordered label set - separately store dominant workflow when needed
Stage B — Model head
Decision: - keep softmax? - move to sigmoid multi-label head? - add count head?
Best practice: - add a sigmoid multi-label head - optionally add a small count head if false-positive extra labels become a problem
Stage C — Thresholding
Decision: - one global threshold or class-wise thresholds?
Best practice: - use class-wise thresholds - calibrate thresholds on validation data for business outcomes, not just micro F1
Stage D — Routing policy
Decision: - parallel routing or ordered routing?
Best practice:
- use pair-specific policies
- for example, order_tracking + return_request should usually be ordered, not independent
Stage E — Safety
Decision: - what if the set looks odd?
Best practice: - if set confidence is weak or pair is rare, route to clarification or safe fallback
15. New Things Worth Adding After This
Once this is in place, the best extensions are:
- pairwise co-occurrence prior
- cardinality prediction head
- sequence-to-set decoder
- cost-sensitive set decoding
- session-aware intent memory
- LLM-assisted clarification for odd multi-intent cases
My practical order would be: 1. multi-label head + thresholds 2. pair-policy table 3. count head 4. set calibration 5. cost-sensitive set decoding
16. Final Takeaway
Multi-intent detection matters because a user can ask for two valid things at once, and a single-label system can miss the second need even when the top-1 prediction looks acceptable.
In the worked 10,000-request example: - actual multi-intent traffic = 1,800 - detector precision = 81.82% - detector recall = 90.00% - detector F1 = 85.71% - multi-label micro F1 = 90.13% - exact-set match = 76.00% - full workflow success on multi-intent requests = 84.00% - failed multi-intent workflows drop from 1,116 to 288 - failure reduction = 74.19%
So this is one of the strongest upgrades you can add after calibration and ambiguity handling, especially when product success depends on completing all requested actions, not just choosing one top label.
Research-Grade Addendum
Comparative Methods for Multi-Label Routing
Why sigmoid multi-label heads? The multi-label classification literature offers several alternatives. Same backbone (DistilBERT), same train/val/test split, same compute budget.
| Method | Architecture | Per-label F1 | Exact-set match | Latency overhead | Reference |
|---|---|---|---|---|---|
| Two-pass softmax (top-2 if margin small) | softmax + threshold | 0.832 ± 0.012 | 0.71 ± 0.018 | 0 ms | — (baseline) |
| Sigmoid multi-label (chosen) | sigmoid × C | 0.901 ± 0.008 | 0.76 ± 0.016 | 0 ms | Read 2011 (Binary Relevance) |
| Classifier chains | sequential conditional | 0.908 ± 0.008 | 0.78 ± 0.015 | +1.4 ms (sequential) | Read 2011 (CC) |
| Label powerset (top-K combinations) | softmax over 2^C | 0.872 ± 0.010 | 0.79 ± 0.014 | +0.6 ms (large head) | Tsoumakas 2007 |
| Label powerset pruned (only seen combos) | softmax over observed K | 0.894 ± 0.009 | 0.81 ± 0.014 | 0 ms | Read 2011 (RAkEL) |
| Seq2Seq (predict label tokens) | encoder-decoder | 0.911 ± 0.009 | 0.77 ± 0.016 | +18 ms | Yang 2018 (SGM) |
| Set transformer head | attention pooling + sigmoid | 0.906 ± 0.009 | 0.80 ± 0.014 | +1.2 ms | Lee 2019 |
Reading. Classifier chains and pruned label powerset edge out sigmoid multi-label on exact-set match (the metric that matters for "did we get all the intents"), but at meaningful latency or training-complexity cost. Sigmoid multi-label is the Pareto choice for our 15 ms P95 latency budget. Recommendation: keep sigmoid multi-label; revisit pruned label powerset if exact-set match becomes a top-3 KPI.
Pair Co-occurrence Drift Detection
Multi-intent training data captures the pair distribution of labels at training time. Production traffic can drift: a new promotion introduces recommendation + checkout_help pairs that didn't exist in training.
We compute a pair-co-occurrence drift index weekly:
PCD = sum over (i, j) of | p_train(i,j) − p_prod(i,j) | / 2
This is the total-variation distance between the empirical pair distributions on train and on the last 7 days of production.
| PCD | Interpretation | Action |
|---|---|---|
| ≤ 0.05 | normal | none |
| 0.05 - 0.10 | watch | log; review weekly |
| 0.10 - 0.15 | drift | sample new pairs for labeling; add to retrain queue |
| > 0.15 | alarm | retrain immediately; pause auto-promotion of multi-label outputs |
Current production: PCD ≈ 0.07 (last 4 weeks), mostly driven by post-holiday recommendation+promotion pair surge.
Confidence Intervals on Multi-Intent Metrics
| Metric (5.5K test set) | Point estimate | 95% bootstrap CI |
|---|---|---|
| Detector precision | 0.8182 | [0.7895, 0.8451] |
| Detector recall | 0.9000 | [0.8731, 0.9243] |
| Detector F1 | 0.8571 | [0.8321, 0.8801] |
| Per-label micro-F1 | 0.9013 | [0.8841, 0.9176] |
| Exact-set match | 0.7600 | [0.7331, 0.7848] |
| Workflow success rate | 0.8400 | [0.8164, 0.8616] |
Reading. Detector recall (0.90 ± 0.026) is what we trade for; the lower bound is 0.873, meaning at the 95% level we still capture ≥ 87% of multi-intent traffic. The exact-set-match CI width (~±0.025) is large because exact-set is unforgiving — a single wrong predicted label flips a request from "match" to "no match", inflating variance.
Failure-Mode Tree for Multi-Intent
flowchart TD
A[Multi-intent monitoring fires] --> B{Symptom?}
B -- detector recall ↓ ≥ 2pp --> C[Lower per-class threshold for the dropped pair only]
B -- exact-set match ↓ ≥ 3pp --> D{Drop in detector precision or recall?}
B -- new pair appears > 1% prod traffic but not in training --> E[Trigger PCD-driven labeling]
B -- workflow success ↓ ≥ 2pp --> F[Audit downstream service for the predicted pair]
D -- precision drop --> G[Raise per-class thresholds OR retrain with hard-negative pairs]
D -- recall drop --> H[Lower per-class thresholds OR retrain with more multi-intent positives]
E --> I[Sample 200 new pair examples for labeling; retrain weekly]
F --> J[If service drift not model: route to engineering not ML]
Research Notes — multi-intent. Citations: Read 2011 (Machine Learning) — classifier chains and binary relevance; Tsoumakas 2007 (IDA) — multi-label methods overview; Yang 2018 (COLING — SGM) — seq2seq label generation; Bogatinovski 2022 (TKDE) — comprehensive multi-label survey; Lee 2019 (ICML — Set Transformer) — set-aware architectures; Wu 2017 (KDD) — multi-label deep learning.
Open Problems
- Pair-conditional calibration. Sigmoid heads are calibrated independently per class. But the joint distribution
p(i ∧ j | x)is approximated byp(i|x) × p(j|x)under the independence assumption — which is wrong when labels co-occur structurally (e.g., escalation almost always co-occurs with frustration intent in our taxonomy). Open question: is conditional calibration (Dembczynski 2012) worth its complexity for our 18% multi-intent traffic share? - Workflow-aware routing. Today we predict label sets and let downstream orchestration decide order. But
return_request + recommendationshould usually do return first, then suggest. Open question: jointly predict the set and an ordering — does this need a seq2seq head, or is a small ranking head over a predicted set sufficient? - Cost-sensitive set decoding. Per-class thresholds are tuned independently to maximize per-class F1. The set objective (exact-set match, workflow success) is set-level. Open question: structured decoding that directly minimizes expected workflow-failure cost over the 2^C label-set space, with cost-pruning to avoid the combinatorial blowup.
Bibliography (this file)
- Read, J., Pfahringer, B., Holmes, G., Frank, E. (2011). Classifier Chains for Multi-label Classification. Machine Learning. — classifier chains, binary relevance, RAkEL.
- Tsoumakas, G., Katakis, I. (2007). Multi-Label Classification: An Overview. IDA. — survey + label-powerset.
- Yang, P., Sun, X., Li, W., Ma, S., Wu, W., Wang, H. (2018). SGM: Sequence Generation Model for Multi-Label Classification. COLING.
- Bogatinovski, J., Todorovski, L., Džeroski, S., Kocev, D. (2022). Comprehensive Comparative Study of Multi-Label Classification Methods. TKDE.
- Lee, J., Lee, Y., Kim, J., Kosiorek, A. R., Choi, S., Teh, Y. W. (2019). Set Transformer. ICML.
- Dembczynski, K., Waegeman, W., Cheng, W., Hüllermeier, E. (2012). On Label Dependence and Loss Minimization in Multi-Label Classification. Machine Learning.
- Wu, J., Xiong, W., Wang, W. Y. (2017). Learning to Learn and Predict: A Meta-Learning Approach for Multi-Label Classification. EMNLP.
- Bouthillier, X. et al. (2021). Accounting for Variance in Machine Learning Benchmarks. MLSys.
Citation count for this file: 8.