Multi-Intent Detection for Intent Routing — MangaAssist

This document adds a multi-intent detection and routing layer to the MangaAssist intent-classification stack.

It stays aligned with the same scenario you shared: - 10 intents - fine-tuned DistilBERT - under 15 ms P95 routing budget - baseline fine-tuned top-1 accuracy around 92.1% - multi-intent traffic around 18% with messages such as "I want to return this and find something better" fileciteturn7file0 - production routing where a single wrong route can trigger the wrong workflow

The main idea is:

Some messages do not belong to exactly one intent.
They belong to a set of intents.

Examples: - return_request + recommendation - order_tracking + return_request - faq + checkout_help - escalation + order_tracking

A single-label classifier can still produce a useful top-1 label, but it often drops the second required action. That is why multi-intent handling is a high-value next upgrade after calibration and ambiguity handling.

1. Why Multi-Intent Matters

In the original MangaAssist setup, multi-intent traffic is explicitly called out as one of the hard cases for the classifier. fileciteturn7file0

Examples: - “I want to return this and find something better” - “Where is my order and can I still return it?” - “Can I use a gift card and what is your refund policy?” - “Talk to a human and check my order status”

A plain softmax classifier assumes:

[ \sum_{c=1}^C p_c = 1 ]

and usually chooses only one class:

[ \hat y = \arg\max_c p_c ]

That is fine for single-intent messages, but it is limiting for requests that genuinely need two workflows.

Failure mode of a single-label router

Suppose the true request is:

return_request
recommendation

If the model predicts only return_request, the user still does not get the recommendation flow they asked for.

So the model can look “correct enough” under top-1 accuracy while still being incomplete from a product perspective.

2. What We Want the System to Predict

Instead of one label, predict a set of labels.

Let the true label vector be:

[ y \in {0,1}^C ]

where: - (y_c = 1) if intent (c) is present - (y_c = 0) otherwise

For 10 intents:

[ y = [y_1, y_2, \dots, y_{10}] ]

Example:

If a message needs return_request and recommendation, then:

[ y_{return_request} = 1, \quad y_{recommendation} = 1 ]

and all other labels are 0.

3. Recommended Architecture

The best practical design here is usually a two-stage system.

flowchart TD
    A[User Message] --> B[Shared DistilBERT Encoder]
    B --> C[Stage 1: Multi-Intent Detector<br/>single vs multi]
    B --> D[Stage 2: Multi-Label Intent Head<br/>10 sigmoid outputs]
    C --> E{Predicted single or multi?}
    E -->|Single| F[Standard single route]
    E -->|Multi| G[Constrained label-set decode]
    G --> H[Pair / workflow policy]
    H --> I[Parallel or ordered routing]
    I --> J[UI / workflow merge]
    J --> K[Logs + monitoring + active learning]

Why this is a good choice

shared encoder keeps latency low
binary detector is easy to monitor
multi-label head handles two-intent and occasional three-intent traffic
policy layer decides whether flows run in parallel, sequentially, or with one intent dominating another

Recommended routing policy examples

Predicted label set	Routing policy
`order_tracking` + `return_request`	fetch order first, then show return eligibility
`return_request` + `recommendation`	start return flow, then offer replacement suggestions
`faq` + `checkout_help`	answer policy and checkout guidance in same UX
`escalation` + anything	escalation dominates; preserve secondary context for handoff

4. Core Math

4.1 Multi-label probabilities

Instead of softmax, use sigmoid independently for each class:

[ q_c = \sigma(z_c) = \frac{1}{1 + e^{-z_c}} ]

where: - (z_c) is the logit for class (c) - (q_c) is the probability that label (c) is present

This does not force the probabilities to sum to 1.

That is exactly what we want, because both return_request and recommendation can be true together.

4.2 Binary cross-entropy loss

For one example and one class:

[ \mathcal{L}_{BCE,c} = -\big[y_c \log(q_c) + (1-y_c)\log(1-q_c)\big] ]

Across all classes:

[ \mathcal{L}{BCE} = \frac{1}{C} \sum{c=1}^C -\big[y_c \log(q_c) + (1-y_c)\log(1-q_c)\big] ]

4.3 Weighted BCE

Because some intents are rare, we can up-weight positive examples for rare labels:

[ \mathcal{L}{WBCE} = \frac{1}{C} \sum{c=1}^C w_c \cdot -\big[y_c \log(q_c) + (1-y_c)\log(1-q_c)\big] ]

where (w_c) can be inverse-frequency or business-weighted.

4.4 Decision rule

At inference time, choose the label set:

[ \hat S = {c : q_c \ge \tau_c} ]

where: - (\tau_c) is a threshold for class (c) - thresholds can be global or class-specific

A stronger rule is:

[ \hat S = {c : q_c \ge \tau_c} \quad \text{and} \quad |\hat S| \le K ]

where (K) is a max label count such as 2 or 3, to avoid noisy over-prediction.

5. Worked Example — One Request

Take the message:

“I want to return this and maybe get something darker than Naruto”

Assume visible logits:

return_request = 2.1
recommendation = 1.6
order_tracking = -0.4
faq = -1.2
chitchat = -2.0

5.1 Convert logits to sigmoid probabilities

[ q_c = \frac{1}{1 + e^{-z_c}} ]

Computed values:

return_request = 0.8909
recommendation = 0.8320
order_tracking = 0.4013
faq = 0.2315
chitchat = 0.1192

If the threshold is (\tau = 0.50), then the predicted set is:

[ \hat S = {\text{return_request}, \text{recommendation}} ]

which is exactly what we want.

5.2 Worked BCE loss for this example

Assume the true set is:

return_request = 1
recommendation = 1
order_tracking = 0
faq = 0
chitchat = 0

Then the per-class loss terms are:

return_request: (-\log(0.8909) = 0.1155)
recommendation: (-\log(0.8320) = 0.1839)
order_tracking: (-\log(1-0.4013) = 0.5130)
faq: (-\log(1-0.2315) = 0.2633)
chitchat: (-\log(1-0.1192) = 0.1269)

Sum:

[ 1.2026 ]

Average over 5 visible classes:

[ \frac{1.2026}{5} = 0.2405 ]

This example is mostly good, but the model is still carrying some unwanted probability on order_tracking, which contributes extra loss.

6. Important Metrics

A multi-intent system should not be judged by plain top-1 accuracy alone.

6.1 Detector metrics

For Stage 1 (single vs multi):

precision
recall
F1
specificity
false positive rate

6.2 Label-set metrics

For Stage 2:

micro precision
micro recall
micro F1
exact-set match
subset accuracy
label coverage
per-pair recall

6.3 Product / workflow metrics

These matter most in production:

full workflow success rate
missing-secondary-intent rate
wrong-extra-intent rate
escalation-preservation rate
multi-intent latency overhead
clarification rate on ambiguous multi-intent cases

7. Fully Worked 10,000-Request Example

This worked example keeps the same overall MangaAssist traffic scale used in the earlier documents.

Assume: - total requests = 10,000 - single-intent share = 82% → 8,200 - multi-intent share = 18% → 1,800

7.1 Stage 1 — Detect single vs multi

Use a binary detector:

Actual Predicted	Predicted Multi	Predicted Single	Total
Actual Multi	1,620	180	1,800
Actual Single	360	7,840	8,200
Total	1,980	8,020	10,000

Validation math

Precision:

[ \frac{1620}{1620 + 360} = \frac{1620}{1980} = 0.8182 = 81.82\% ]

Recall:

[ \frac{1620}{1620 + 180} = \frac{1620}{1800} = 0.9000 = 90.00\% ]

F1:

[ \frac{2PR}{P+R} = \frac{2 \cdot 0.8182 \cdot 0.9000}{0.8182 + 0.9000} = 0.8571 = 85.71\% ]

Specificity:

[ \frac{7840}{7840 + 360} = \frac{7840}{8200} = 0.9561 = 95.61\% ]

Overall detector accuracy:

[ \frac{1620 + 7840}{10000} = \frac{9460}{10000} = 0.9460 = 94.60\% ]

7.2 Stage 2 — Multi-label quality on the 1,800 actual multi-intent requests

Assume most multi-intent requests contain 2 true intents, so total true positive labels:

[ 1800 \times 2 = 3600 ]

Suppose the model produces: - true positive labels = 3,204 - false positive labels = 306 - false negative labels = 396

Then:

Micro precision:

[ \frac{3204}{3204 + 306} = \frac{3204}{3510} = 0.9128 = 91.28\% ]

Micro recall:

[ \frac{3204}{3204 + 396} = \frac{3204}{3600} = 0.8900 = 89.00\% ]

Micro F1:

[ \frac{2 \cdot 0.9128 \cdot 0.8900}{0.9128 + 0.8900} = 0.9013 = 90.13\% ]

7.3 Exact-set match

Suppose the predicted label set is exactly right on 1,368 of the 1,800 multi-intent requests.

[ \frac{1368}{1800} = 0.7600 = 76.00\% ]

This is much stricter than micro F1, because every label in the set must be correct.

7.4 Full workflow success rate

After the multi-label prediction goes through the policy layer, suppose 1,512 of the 1,800 multi-intent requests complete all required actions successfully.

[ \frac{1512}{1800} = 0.8400 = 84.00\% ]

This is the most important product metric.

8. Compare Against the Single-Label Baseline

In the original setup, the fine-tuned model still struggles more on multi-intent traffic than on normal single-intent traffic. fileciteturn7file0

For the worked example, assume the old single-label system fully satisfies 684 of the 1,800 multi-intent requests.

[ \frac{684}{1800} = 0.38 = 38.00\% ]

Now compare:

Metric	Single-label baseline	Multi-intent system
multi-intent full workflow success	38.00%	84.00%
failed multi-intent workflows	1,116	288
improvement in full workflow success	+46.00 pts
failure reduction		74.19%

Failure reduction:

[ 1116 - 288 = 828 ]

[ \frac{828}{1116} = 0.7419 = 74.19\% ]

That is why multi-intent handling is worth building. It improves the metric that actually matters: did the user get all requested actions?

9. Threshold Design

A strong production system should not use the same threshold for every label.

Recommended approach: - lower threshold for high-value secondary intents that are often under-detected - higher threshold for noisy or low-prevalence labels - special dominant policy for escalation

Example thresholds:

Intent	Suggested threshold	Reason
`order_tracking`	0.55	usually explicit
`return_request`	0.50	often paired with order-related flows
`recommendation`	0.45	more implicit, easier to miss
`faq`	0.55	avoid noisy policy activation
`checkout_help`	0.55	often confused with faq
`escalation`	0.35	prefer recall over precision for safe handoff

A good rule is:

[ \text{if } q_{escalation} \ge 0.35, \text{ preserve escalation in the final route set} ]

10. Better Design Than “Any Two Labels Above 0.5”

A raw sigmoid head is not enough by itself.

Useful additions: 1. count head predicting 1 vs 2 vs 3+ intents 2. pair prior based on common co-occurrence pairs 3. constraint layer disallowing unrealistic sets 4. calibration per label 5. margin / entropy checks before executing expensive parallel workflows

Recommended practical version

shared DistilBERT encoder
multi-intent detector
multi-label sigmoid head
class-specific thresholds
pair-prior table
dominant-intent rule for escalation
ambiguity fallback for weird sets

11. Production Decision Flow

flowchart TD
    A[Sigmoid probabilities q_c] --> B[Apply per-label thresholds]
    B --> C[Form candidate set]
    C --> D[Apply pair prior / constraints]
    D --> E{Contains escalation?}
    E -->|Yes| F[Escalate with preserved context]
    E -->|No| G{Set size = 1?}
    G -->|Yes| H[Single route]
    G -->|No| I[Multi-route policy]
    I --> J[Parallel or ordered execution]
    J --> K[Merge outputs]
    K --> L[Log label set, latency, user outcome]

12. Sample Production Logs

12.1 Per-request inference log

{
  "request_id": "mi_10422",
  "text": "I want to return this and find something better",
  "stage1_multi_probability": 0.91,
  "predicted_is_multi": true,
  "label_probs": {
    "return_request": 0.89,
    "recommendation": 0.83,
    "order_tracking": 0.40,
    "faq": 0.23,
    "chitchat": 0.12
  },
  "predicted_label_set": ["return_request", "recommendation"],
  "policy": "return_then_recommend",
  "latency_ms": 13.4
}

12.2 Detector aggregate log

{
  "window": "1h",
  "requests": 18420,
  "predicted_multi_rate": 0.196,
  "actual_multi_rate_delayed_label": 0.182,
  "detector_precision_est": 0.814,
  "detector_recall_est": 0.892,
  "detector_f1_est": 0.851
}

12.3 Label-set aggregate log

{
  "window": "1h",
  "multi_intent_requests_labeled": 620,
  "micro_precision": 0.908,
  "micro_recall": 0.887,
  "micro_f1": 0.897,
  "exact_set_match": 0.752,
  "full_workflow_success": 0.836,
  "missing_secondary_intent_rate": 0.109
}

13. What to Monitor in Production

Model metrics

detector precision / recall / F1
micro F1 on multi-label sets
exact-set match
per-pair recall
escalation preservation rate

Product metrics

full workflow success
follow-up clarification rate
extra-turn rate after multi-intent requests
handoff-to-human rate after failed multi-intent handling
CSAT for multi-intent sessions

Risk metrics

missing-secondary-intent rate
wrong-extra-intent rate
multi-route latency overhead
rare-pair drift
taxonomy-gap rate (requests that do not fit current intent set cleanly)

14. Stage-by-Stage Engineering Decisions

Stage A — Data labeling

Decision: - mark one label or multiple labels? - preserve label order?

Best practice: - store unordered label set - separately store dominant workflow when needed

Stage B — Model head

Decision: - keep softmax? - move to sigmoid multi-label head? - add count head?

Best practice: - add a sigmoid multi-label head - optionally add a small count head if false-positive extra labels become a problem

Stage C — Thresholding

Decision: - one global threshold or class-wise thresholds?

Best practice: - use class-wise thresholds - calibrate thresholds on validation data for business outcomes, not just micro F1

Stage D — Routing policy

Decision: - parallel routing or ordered routing?

Best practice: - use pair-specific policies - for example, order_tracking + return_request should usually be ordered, not independent

Stage E — Safety

Decision: - what if the set looks odd?

Best practice: - if set confidence is weak or pair is rare, route to clarification or safe fallback

15. New Things Worth Adding After This

Once this is in place, the best extensions are:

pairwise co-occurrence prior
cardinality prediction head
sequence-to-set decoder
cost-sensitive set decoding
session-aware intent memory
LLM-assisted clarification for odd multi-intent cases

My practical order would be: 1. multi-label head + thresholds 2. pair-policy table 3. count head 4. set calibration 5. cost-sensitive set decoding

16. Final Takeaway

Multi-intent detection matters because a user can ask for two valid things at once, and a single-label system can miss the second need even when the top-1 prediction looks acceptable.

In the worked 10,000-request example: - actual multi-intent traffic = 1,800 - detector precision = 81.82% - detector recall = 90.00% - detector F1 = 85.71% - multi-label micro F1 = 90.13% - exact-set match = 76.00% - full workflow success on multi-intent requests = 84.00% - failed multi-intent workflows drop from 1,116 to 288 - failure reduction = 74.19%

So this is one of the strongest upgrades you can add after calibration and ambiguity handling, especially when product success depends on completing all requested actions, not just choosing one top label.

Research-Grade Addendum

Comparative Methods for Multi-Label Routing

Why sigmoid multi-label heads? The multi-label classification literature offers several alternatives. Same backbone (DistilBERT), same train/val/test split, same compute budget.

Method	Architecture	Per-label F1	Exact-set match	Latency overhead	Reference
Two-pass softmax (top-2 if margin small)	softmax + threshold	0.832 ± 0.012	0.71 ± 0.018	0 ms	— (baseline)
Sigmoid multi-label (chosen)	sigmoid × C	0.901 ± 0.008	0.76 ± 0.016	0 ms	Read 2011 (Binary Relevance)
Classifier chains	sequential conditional	0.908 ± 0.008	0.78 ± 0.015	+1.4 ms (sequential)	Read 2011 (CC)
Label powerset (top-K combinations)	softmax over 2^C	0.872 ± 0.010	0.79 ± 0.014	+0.6 ms (large head)	Tsoumakas 2007
Label powerset pruned (only seen combos)	softmax over observed K	0.894 ± 0.009	0.81 ± 0.014	0 ms	Read 2011 (RAkEL)
Seq2Seq (predict label tokens)	encoder-decoder	0.911 ± 0.009	0.77 ± 0.016	+18 ms	Yang 2018 (SGM)
Set transformer head	attention pooling + sigmoid	0.906 ± 0.009	0.80 ± 0.014	+1.2 ms	Lee 2019

Reading. Classifier chains and pruned label powerset edge out sigmoid multi-label on exact-set match (the metric that matters for "did we get all the intents"), but at meaningful latency or training-complexity cost. Sigmoid multi-label is the Pareto choice for our 15 ms P95 latency budget. Recommendation: keep sigmoid multi-label; revisit pruned label powerset if exact-set match becomes a top-3 KPI.

Pair Co-occurrence Drift Detection

Multi-intent training data captures the pair distribution of labels at training time. Production traffic can drift: a new promotion introduces recommendation + checkout_help pairs that didn't exist in training.

We compute a pair-co-occurrence drift index weekly:

PCD = sum over (i, j) of |  p_train(i,j)  −  p_prod(i,j)  |   / 2

This is the total-variation distance between the empirical pair distributions on train and on the last 7 days of production.

PCD	Interpretation	Action
≤ 0.05	normal	none
0.05 - 0.10	watch	log; review weekly
0.10 - 0.15	drift	sample new pairs for labeling; add to retrain queue
> 0.15	alarm	retrain immediately; pause auto-promotion of multi-label outputs

Current production: PCD ≈ 0.07 (last 4 weeks), mostly driven by post-holiday recommendation+promotion pair surge.

Confidence Intervals on Multi-Intent Metrics

Metric (5.5K test set)	Point estimate	95% bootstrap CI
Detector precision	0.8182	[0.7895, 0.8451]
Detector recall	0.9000	[0.8731, 0.9243]
Detector F1	0.8571	[0.8321, 0.8801]
Per-label micro-F1	0.9013	[0.8841, 0.9176]
Exact-set match	0.7600	[0.7331, 0.7848]
Workflow success rate	0.8400	[0.8164, 0.8616]

Reading. Detector recall (0.90 ± 0.026) is what we trade for; the lower bound is 0.873, meaning at the 95% level we still capture ≥ 87% of multi-intent traffic. The exact-set-match CI width (~±0.025) is large because exact-set is unforgiving — a single wrong predicted label flips a request from "match" to "no match", inflating variance.

Failure-Mode Tree for Multi-Intent

flowchart TD
    A[Multi-intent monitoring fires] --> B{Symptom?}
    B -- detector recall ↓ ≥ 2pp --> C[Lower per-class threshold for the dropped pair only]
    B -- exact-set match ↓ ≥ 3pp --> D{Drop in detector precision or recall?}
    B -- new pair appears > 1% prod traffic but not in training --> E[Trigger PCD-driven labeling]
    B -- workflow success ↓ ≥ 2pp --> F[Audit downstream service for the predicted pair]
    D -- precision drop --> G[Raise per-class thresholds OR retrain with hard-negative pairs]
    D -- recall drop --> H[Lower per-class thresholds OR retrain with more multi-intent positives]
    E --> I[Sample 200 new pair examples for labeling; retrain weekly]
    F --> J[If service drift not model: route to engineering not ML]

Research Notes — multi-intent. Citations: Read 2011 (Machine Learning) — classifier chains and binary relevance; Tsoumakas 2007 (IDA) — multi-label methods overview; Yang 2018 (COLING — SGM) — seq2seq label generation; Bogatinovski 2022 (TKDE) — comprehensive multi-label survey; Lee 2019 (ICML — Set Transformer) — set-aware architectures; Wu 2017 (KDD) — multi-label deep learning.

Open Problems

Pair-conditional calibration. Sigmoid heads are calibrated independently per class. But the joint distribution p(i ∧ j | x) is approximated by p(i|x) × p(j|x) under the independence assumption — which is wrong when labels co-occur structurally (e.g., escalation almost always co-occurs with frustration intent in our taxonomy). Open question: is conditional calibration (Dembczynski 2012) worth its complexity for our 18% multi-intent traffic share?
Workflow-aware routing. Today we predict label sets and let downstream orchestration decide order. But return_request + recommendation should usually do return first, then suggest. Open question: jointly predict the set and an ordering — does this need a seq2seq head, or is a small ranking head over a predicted set sufficient?
Cost-sensitive set decoding. Per-class thresholds are tuned independently to maximize per-class F1. The set objective (exact-set match, workflow success) is set-level. Open question: structured decoding that directly minimizes expected workflow-failure cost over the 2^C label-set space, with cost-pruning to avoid the combinatorial blowup.

Bibliography (this file)

Read, J., Pfahringer, B., Holmes, G., Frank, E. (2011). Classifier Chains for Multi-label Classification. Machine Learning. — classifier chains, binary relevance, RAkEL.
Tsoumakas, G., Katakis, I. (2007). Multi-Label Classification: An Overview. IDA. — survey + label-powerset.
Yang, P., Sun, X., Li, W., Ma, S., Wu, W., Wang, H. (2018). SGM: Sequence Generation Model for Multi-Label Classification. COLING.
Bogatinovski, J., Todorovski, L., Džeroski, S., Kocev, D. (2022). Comprehensive Comparative Study of Multi-Label Classification Methods. TKDE.
Lee, J., Lee, Y., Kim, J., Kosiorek, A. R., Choi, S., Teh, Y. W. (2019). Set Transformer. ICML.
Dembczynski, K., Waegeman, W., Cheng, W., Hüllermeier, E. (2012). On Label Dependence and Loss Minimization in Multi-Label Classification. Machine Learning.
Wu, J., Xiong, W., Wang, W. Y. (2017). Learning to Learn and Predict: A Meta-Learning Approach for Multi-Label Classification. EMNLP.
Bouthillier, X. et al. (2021). Accounting for Variance in Machine Learning Benchmarks. MLSys.

Citation count for this file: 8.