LOCAL PREVIEW View on GitHub

Multi-Intent Detection for Intent Routing — MangaAssist

This document adds a multi-intent detection and routing layer to the MangaAssist intent-classification stack.

It stays aligned with the same scenario you shared: - 10 intents - fine-tuned DistilBERT - under 15 ms P95 routing budget - baseline fine-tuned top-1 accuracy around 92.1% - multi-intent traffic around 18% with messages such as "I want to return this and find something better" fileciteturn7file0 - production routing where a single wrong route can trigger the wrong workflow

The main idea is:

Some messages do not belong to exactly one intent.
They belong to a set of intents.

Examples: - return_request + recommendation - order_tracking + return_request - faq + checkout_help - escalation + order_tracking

A single-label classifier can still produce a useful top-1 label, but it often drops the second required action. That is why multi-intent handling is a high-value next upgrade after calibration and ambiguity handling.


1. Why Multi-Intent Matters

In the original MangaAssist setup, multi-intent traffic is explicitly called out as one of the hard cases for the classifier. fileciteturn7file0

Examples: - “I want to return this and find something better” - “Where is my order and can I still return it?” - “Can I use a gift card and what is your refund policy?” - “Talk to a human and check my order status”

A plain softmax classifier assumes:

[ \sum_{c=1}^C p_c = 1 ]

and usually chooses only one class:

[ \hat y = \arg\max_c p_c ]

That is fine for single-intent messages, but it is limiting for requests that genuinely need two workflows.

Failure mode of a single-label router

Suppose the true request is:

  • return_request
  • recommendation

If the model predicts only return_request, the user still does not get the recommendation flow they asked for.

So the model can look “correct enough” under top-1 accuracy while still being incomplete from a product perspective.


2. What We Want the System to Predict

Instead of one label, predict a set of labels.

Let the true label vector be:

[ y \in {0,1}^C ]

where: - (y_c = 1) if intent (c) is present - (y_c = 0) otherwise

For 10 intents:

[ y = [y_1, y_2, \dots, y_{10}] ]

Example:

If a message needs return_request and recommendation, then:

[ y_{return_request} = 1, \quad y_{recommendation} = 1 ]

and all other labels are 0.


The best practical design here is usually a two-stage system.

flowchart TD
    A[User Message] --> B[Shared DistilBERT Encoder]
    B --> C[Stage 1: Multi-Intent Detector<br/>single vs multi]
    B --> D[Stage 2: Multi-Label Intent Head<br/>10 sigmoid outputs]
    C --> E{Predicted single or multi?}
    E -->|Single| F[Standard single route]
    E -->|Multi| G[Constrained label-set decode]
    G --> H[Pair / workflow policy]
    H --> I[Parallel or ordered routing]
    I --> J[UI / workflow merge]
    J --> K[Logs + monitoring + active learning]

Why this is a good choice

  • shared encoder keeps latency low
  • binary detector is easy to monitor
  • multi-label head handles two-intent and occasional three-intent traffic
  • policy layer decides whether flows run in parallel, sequentially, or with one intent dominating another
Predicted label set Routing policy
order_tracking + return_request fetch order first, then show return eligibility
return_request + recommendation start return flow, then offer replacement suggestions
faq + checkout_help answer policy and checkout guidance in same UX
escalation + anything escalation dominates; preserve secondary context for handoff

4. Core Math

4.1 Multi-label probabilities

Instead of softmax, use sigmoid independently for each class:

[ q_c = \sigma(z_c) = \frac{1}{1 + e^{-z_c}} ]

where: - (z_c) is the logit for class (c) - (q_c) is the probability that label (c) is present

This does not force the probabilities to sum to 1.

That is exactly what we want, because both return_request and recommendation can be true together.

4.2 Binary cross-entropy loss

For one example and one class:

[ \mathcal{L}_{BCE,c} = -\big[y_c \log(q_c) + (1-y_c)\log(1-q_c)\big] ]

Across all classes:

[ \mathcal{L}{BCE} = \frac{1}{C} \sum{c=1}^C -\big[y_c \log(q_c) + (1-y_c)\log(1-q_c)\big] ]

4.3 Weighted BCE

Because some intents are rare, we can up-weight positive examples for rare labels:

[ \mathcal{L}{WBCE} = \frac{1}{C} \sum{c=1}^C w_c \cdot -\big[y_c \log(q_c) + (1-y_c)\log(1-q_c)\big] ]

where (w_c) can be inverse-frequency or business-weighted.

4.4 Decision rule

At inference time, choose the label set:

[ \hat S = {c : q_c \ge \tau_c} ]

where: - (\tau_c) is a threshold for class (c) - thresholds can be global or class-specific

A stronger rule is:

[ \hat S = {c : q_c \ge \tau_c} \quad \text{and} \quad |\hat S| \le K ]

where (K) is a max label count such as 2 or 3, to avoid noisy over-prediction.


5. Worked Example — One Request

Take the message:

“I want to return this and maybe get something darker than Naruto”

Assume visible logits:

  • return_request = 2.1
  • recommendation = 1.6
  • order_tracking = -0.4
  • faq = -1.2
  • chitchat = -2.0

5.1 Convert logits to sigmoid probabilities

[ q_c = \frac{1}{1 + e^{-z_c}} ]

Computed values:

  • return_request = 0.8909
  • recommendation = 0.8320
  • order_tracking = 0.4013
  • faq = 0.2315
  • chitchat = 0.1192

If the threshold is (\tau = 0.50), then the predicted set is:

[ \hat S = {\text{return_request}, \text{recommendation}} ]

which is exactly what we want.

5.2 Worked BCE loss for this example

Assume the true set is:

  • return_request = 1
  • recommendation = 1
  • order_tracking = 0
  • faq = 0
  • chitchat = 0

Then the per-class loss terms are:

  • return_request: (-\log(0.8909) = 0.1155)
  • recommendation: (-\log(0.8320) = 0.1839)
  • order_tracking: (-\log(1-0.4013) = 0.5130)
  • faq: (-\log(1-0.2315) = 0.2633)
  • chitchat: (-\log(1-0.1192) = 0.1269)

Sum:

[ 1.2026 ]

Average over 5 visible classes:

[ \frac{1.2026}{5} = 0.2405 ]

This example is mostly good, but the model is still carrying some unwanted probability on order_tracking, which contributes extra loss.


6. Important Metrics

A multi-intent system should not be judged by plain top-1 accuracy alone.

6.1 Detector metrics

For Stage 1 (single vs multi):

  • precision
  • recall
  • F1
  • specificity
  • false positive rate

6.2 Label-set metrics

For Stage 2:

  • micro precision
  • micro recall
  • micro F1
  • exact-set match
  • subset accuracy
  • label coverage
  • per-pair recall

6.3 Product / workflow metrics

These matter most in production:

  • full workflow success rate
  • missing-secondary-intent rate
  • wrong-extra-intent rate
  • escalation-preservation rate
  • multi-intent latency overhead
  • clarification rate on ambiguous multi-intent cases

7. Fully Worked 10,000-Request Example

This worked example keeps the same overall MangaAssist traffic scale used in the earlier documents.

Assume: - total requests = 10,000 - single-intent share = 82%8,200 - multi-intent share = 18%1,800

7.1 Stage 1 — Detect single vs multi

Use a binary detector:

Actual Predicted Predicted Multi Predicted Single Total
Actual Multi 1,620 180 1,800
Actual Single 360 7,840 8,200
Total 1,980 8,020 10,000

Validation math

Precision:

[ \frac{1620}{1620 + 360} = \frac{1620}{1980} = 0.8182 = 81.82\% ]

Recall:

[ \frac{1620}{1620 + 180} = \frac{1620}{1800} = 0.9000 = 90.00\% ]

F1:

[ \frac{2PR}{P+R} = \frac{2 \cdot 0.8182 \cdot 0.9000}{0.8182 + 0.9000} = 0.8571 = 85.71\% ]

Specificity:

[ \frac{7840}{7840 + 360} = \frac{7840}{8200} = 0.9561 = 95.61\% ]

Overall detector accuracy:

[ \frac{1620 + 7840}{10000} = \frac{9460}{10000} = 0.9460 = 94.60\% ]

7.2 Stage 2 — Multi-label quality on the 1,800 actual multi-intent requests

Assume most multi-intent requests contain 2 true intents, so total true positive labels:

[ 1800 \times 2 = 3600 ]

Suppose the model produces: - true positive labels = 3,204 - false positive labels = 306 - false negative labels = 396

Then:

Micro precision:

[ \frac{3204}{3204 + 306} = \frac{3204}{3510} = 0.9128 = 91.28\% ]

Micro recall:

[ \frac{3204}{3204 + 396} = \frac{3204}{3600} = 0.8900 = 89.00\% ]

Micro F1:

[ \frac{2 \cdot 0.9128 \cdot 0.8900}{0.9128 + 0.8900} = 0.9013 = 90.13\% ]

7.3 Exact-set match

Suppose the predicted label set is exactly right on 1,368 of the 1,800 multi-intent requests.

[ \frac{1368}{1800} = 0.7600 = 76.00\% ]

This is much stricter than micro F1, because every label in the set must be correct.

7.4 Full workflow success rate

After the multi-label prediction goes through the policy layer, suppose 1,512 of the 1,800 multi-intent requests complete all required actions successfully.

[ \frac{1512}{1800} = 0.8400 = 84.00\% ]

This is the most important product metric.


8. Compare Against the Single-Label Baseline

In the original setup, the fine-tuned model still struggles more on multi-intent traffic than on normal single-intent traffic. fileciteturn7file0

For the worked example, assume the old single-label system fully satisfies 684 of the 1,800 multi-intent requests.

[ \frac{684}{1800} = 0.38 = 38.00\% ]

Now compare:

Metric Single-label baseline Multi-intent system
multi-intent full workflow success 38.00% 84.00%
failed multi-intent workflows 1,116 288
improvement in full workflow success +46.00 pts
failure reduction 74.19%

Failure reduction:

[ 1116 - 288 = 828 ]

[ \frac{828}{1116} = 0.7419 = 74.19\% ]

That is why multi-intent handling is worth building. It improves the metric that actually matters: did the user get all requested actions?


9. Threshold Design

A strong production system should not use the same threshold for every label.

Recommended approach: - lower threshold for high-value secondary intents that are often under-detected - higher threshold for noisy or low-prevalence labels - special dominant policy for escalation

Example thresholds:

Intent Suggested threshold Reason
order_tracking 0.55 usually explicit
return_request 0.50 often paired with order-related flows
recommendation 0.45 more implicit, easier to miss
faq 0.55 avoid noisy policy activation
checkout_help 0.55 often confused with faq
escalation 0.35 prefer recall over precision for safe handoff

A good rule is:

[ \text{if } q_{escalation} \ge 0.35, \text{ preserve escalation in the final route set} ]


10. Better Design Than “Any Two Labels Above 0.5”

A raw sigmoid head is not enough by itself.

Useful additions: 1. count head predicting 1 vs 2 vs 3+ intents 2. pair prior based on common co-occurrence pairs 3. constraint layer disallowing unrealistic sets 4. calibration per label 5. margin / entropy checks before executing expensive parallel workflows

  • shared DistilBERT encoder
  • multi-intent detector
  • multi-label sigmoid head
  • class-specific thresholds
  • pair-prior table
  • dominant-intent rule for escalation
  • ambiguity fallback for weird sets

11. Production Decision Flow

flowchart TD
    A[Sigmoid probabilities q_c] --> B[Apply per-label thresholds]
    B --> C[Form candidate set]
    C --> D[Apply pair prior / constraints]
    D --> E{Contains escalation?}
    E -->|Yes| F[Escalate with preserved context]
    E -->|No| G{Set size = 1?}
    G -->|Yes| H[Single route]
    G -->|No| I[Multi-route policy]
    I --> J[Parallel or ordered execution]
    J --> K[Merge outputs]
    K --> L[Log label set, latency, user outcome]

12. Sample Production Logs

12.1 Per-request inference log

{
  "request_id": "mi_10422",
  "text": "I want to return this and find something better",
  "stage1_multi_probability": 0.91,
  "predicted_is_multi": true,
  "label_probs": {
    "return_request": 0.89,
    "recommendation": 0.83,
    "order_tracking": 0.40,
    "faq": 0.23,
    "chitchat": 0.12
  },
  "predicted_label_set": ["return_request", "recommendation"],
  "policy": "return_then_recommend",
  "latency_ms": 13.4
}

12.2 Detector aggregate log

{
  "window": "1h",
  "requests": 18420,
  "predicted_multi_rate": 0.196,
  "actual_multi_rate_delayed_label": 0.182,
  "detector_precision_est": 0.814,
  "detector_recall_est": 0.892,
  "detector_f1_est": 0.851
}

12.3 Label-set aggregate log

{
  "window": "1h",
  "multi_intent_requests_labeled": 620,
  "micro_precision": 0.908,
  "micro_recall": 0.887,
  "micro_f1": 0.897,
  "exact_set_match": 0.752,
  "full_workflow_success": 0.836,
  "missing_secondary_intent_rate": 0.109
}

13. What to Monitor in Production

Model metrics

  • detector precision / recall / F1
  • micro F1 on multi-label sets
  • exact-set match
  • per-pair recall
  • escalation preservation rate

Product metrics

  • full workflow success
  • follow-up clarification rate
  • extra-turn rate after multi-intent requests
  • handoff-to-human rate after failed multi-intent handling
  • CSAT for multi-intent sessions

Risk metrics

  • missing-secondary-intent rate
  • wrong-extra-intent rate
  • multi-route latency overhead
  • rare-pair drift
  • taxonomy-gap rate (requests that do not fit current intent set cleanly)

14. Stage-by-Stage Engineering Decisions

Stage A — Data labeling

Decision: - mark one label or multiple labels? - preserve label order?

Best practice: - store unordered label set - separately store dominant workflow when needed

Stage B — Model head

Decision: - keep softmax? - move to sigmoid multi-label head? - add count head?

Best practice: - add a sigmoid multi-label head - optionally add a small count head if false-positive extra labels become a problem

Stage C — Thresholding

Decision: - one global threshold or class-wise thresholds?

Best practice: - use class-wise thresholds - calibrate thresholds on validation data for business outcomes, not just micro F1

Stage D — Routing policy

Decision: - parallel routing or ordered routing?

Best practice: - use pair-specific policies - for example, order_tracking + return_request should usually be ordered, not independent

Stage E — Safety

Decision: - what if the set looks odd?

Best practice: - if set confidence is weak or pair is rare, route to clarification or safe fallback


15. New Things Worth Adding After This

Once this is in place, the best extensions are:

  1. pairwise co-occurrence prior
  2. cardinality prediction head
  3. sequence-to-set decoder
  4. cost-sensitive set decoding
  5. session-aware intent memory
  6. LLM-assisted clarification for odd multi-intent cases

My practical order would be: 1. multi-label head + thresholds 2. pair-policy table 3. count head 4. set calibration 5. cost-sensitive set decoding


16. Final Takeaway

Multi-intent detection matters because a user can ask for two valid things at once, and a single-label system can miss the second need even when the top-1 prediction looks acceptable.

In the worked 10,000-request example: - actual multi-intent traffic = 1,800 - detector precision = 81.82% - detector recall = 90.00% - detector F1 = 85.71% - multi-label micro F1 = 90.13% - exact-set match = 76.00% - full workflow success on multi-intent requests = 84.00% - failed multi-intent workflows drop from 1,116 to 288 - failure reduction = 74.19%

So this is one of the strongest upgrades you can add after calibration and ambiguity handling, especially when product success depends on completing all requested actions, not just choosing one top label.


Research-Grade Addendum

Comparative Methods for Multi-Label Routing

Why sigmoid multi-label heads? The multi-label classification literature offers several alternatives. Same backbone (DistilBERT), same train/val/test split, same compute budget.

Method Architecture Per-label F1 Exact-set match Latency overhead Reference
Two-pass softmax (top-2 if margin small) softmax + threshold 0.832 ± 0.012 0.71 ± 0.018 0 ms — (baseline)
Sigmoid multi-label (chosen) sigmoid × C 0.901 ± 0.008 0.76 ± 0.016 0 ms Read 2011 (Binary Relevance)
Classifier chains sequential conditional 0.908 ± 0.008 0.78 ± 0.015 +1.4 ms (sequential) Read 2011 (CC)
Label powerset (top-K combinations) softmax over 2^C 0.872 ± 0.010 0.79 ± 0.014 +0.6 ms (large head) Tsoumakas 2007
Label powerset pruned (only seen combos) softmax over observed K 0.894 ± 0.009 0.81 ± 0.014 0 ms Read 2011 (RAkEL)
Seq2Seq (predict label tokens) encoder-decoder 0.911 ± 0.009 0.77 ± 0.016 +18 ms Yang 2018 (SGM)
Set transformer head attention pooling + sigmoid 0.906 ± 0.009 0.80 ± 0.014 +1.2 ms Lee 2019

Reading. Classifier chains and pruned label powerset edge out sigmoid multi-label on exact-set match (the metric that matters for "did we get all the intents"), but at meaningful latency or training-complexity cost. Sigmoid multi-label is the Pareto choice for our 15 ms P95 latency budget. Recommendation: keep sigmoid multi-label; revisit pruned label powerset if exact-set match becomes a top-3 KPI.

Pair Co-occurrence Drift Detection

Multi-intent training data captures the pair distribution of labels at training time. Production traffic can drift: a new promotion introduces recommendation + checkout_help pairs that didn't exist in training.

We compute a pair-co-occurrence drift index weekly:

PCD = sum over (i, j) of |  p_train(i,j)  −  p_prod(i,j)  |   / 2

This is the total-variation distance between the empirical pair distributions on train and on the last 7 days of production.

PCD Interpretation Action
≤ 0.05 normal none
0.05 - 0.10 watch log; review weekly
0.10 - 0.15 drift sample new pairs for labeling; add to retrain queue
> 0.15 alarm retrain immediately; pause auto-promotion of multi-label outputs

Current production: PCD ≈ 0.07 (last 4 weeks), mostly driven by post-holiday recommendation+promotion pair surge.

Confidence Intervals on Multi-Intent Metrics

Metric (5.5K test set) Point estimate 95% bootstrap CI
Detector precision 0.8182 [0.7895, 0.8451]
Detector recall 0.9000 [0.8731, 0.9243]
Detector F1 0.8571 [0.8321, 0.8801]
Per-label micro-F1 0.9013 [0.8841, 0.9176]
Exact-set match 0.7600 [0.7331, 0.7848]
Workflow success rate 0.8400 [0.8164, 0.8616]

Reading. Detector recall (0.90 ± 0.026) is what we trade for; the lower bound is 0.873, meaning at the 95% level we still capture ≥ 87% of multi-intent traffic. The exact-set-match CI width (~±0.025) is large because exact-set is unforgiving — a single wrong predicted label flips a request from "match" to "no match", inflating variance.

Failure-Mode Tree for Multi-Intent

flowchart TD
    A[Multi-intent monitoring fires] --> B{Symptom?}
    B -- detector recall ↓ ≥ 2pp --> C[Lower per-class threshold for the dropped pair only]
    B -- exact-set match ↓ ≥ 3pp --> D{Drop in detector precision or recall?}
    B -- new pair appears > 1% prod traffic but not in training --> E[Trigger PCD-driven labeling]
    B -- workflow success ↓ ≥ 2pp --> F[Audit downstream service for the predicted pair]
    D -- precision drop --> G[Raise per-class thresholds OR retrain with hard-negative pairs]
    D -- recall drop --> H[Lower per-class thresholds OR retrain with more multi-intent positives]
    E --> I[Sample 200 new pair examples for labeling; retrain weekly]
    F --> J[If service drift not model: route to engineering not ML]

Research Notes — multi-intent. Citations: Read 2011 (Machine Learning) — classifier chains and binary relevance; Tsoumakas 2007 (IDA) — multi-label methods overview; Yang 2018 (COLING — SGM) — seq2seq label generation; Bogatinovski 2022 (TKDE) — comprehensive multi-label survey; Lee 2019 (ICML — Set Transformer) — set-aware architectures; Wu 2017 (KDD) — multi-label deep learning.

Open Problems

  1. Pair-conditional calibration. Sigmoid heads are calibrated independently per class. But the joint distribution p(i ∧ j | x) is approximated by p(i|x) × p(j|x) under the independence assumption — which is wrong when labels co-occur structurally (e.g., escalation almost always co-occurs with frustration intent in our taxonomy). Open question: is conditional calibration (Dembczynski 2012) worth its complexity for our 18% multi-intent traffic share?
  2. Workflow-aware routing. Today we predict label sets and let downstream orchestration decide order. But return_request + recommendation should usually do return first, then suggest. Open question: jointly predict the set and an ordering — does this need a seq2seq head, or is a small ranking head over a predicted set sufficient?
  3. Cost-sensitive set decoding. Per-class thresholds are tuned independently to maximize per-class F1. The set objective (exact-set match, workflow success) is set-level. Open question: structured decoding that directly minimizes expected workflow-failure cost over the 2^C label-set space, with cost-pruning to avoid the combinatorial blowup.

Bibliography (this file)

  • Read, J., Pfahringer, B., Holmes, G., Frank, E. (2011). Classifier Chains for Multi-label Classification. Machine Learning. — classifier chains, binary relevance, RAkEL.
  • Tsoumakas, G., Katakis, I. (2007). Multi-Label Classification: An Overview. IDA. — survey + label-powerset.
  • Yang, P., Sun, X., Li, W., Ma, S., Wu, W., Wang, H. (2018). SGM: Sequence Generation Model for Multi-Label Classification. COLING.
  • Bogatinovski, J., Todorovski, L., Džeroski, S., Kocev, D. (2022). Comprehensive Comparative Study of Multi-Label Classification Methods. TKDE.
  • Lee, J., Lee, Y., Kim, J., Kosiorek, A. R., Choi, S., Teh, Y. W. (2019). Set Transformer. ICML.
  • Dembczynski, K., Waegeman, W., Cheng, W., Hüllermeier, E. (2012). On Label Dependence and Loss Minimization in Multi-Label Classification. Machine Learning.
  • Wu, J., Xiong, W., Wang, W. Y. (2017). Learning to Learn and Predict: A Meta-Learning Approach for Multi-Label Classification. EMNLP.
  • Bouthillier, X. et al. (2021). Accounting for Variance in Machine Learning Benchmarks. MLSys.

Citation count for this file: 8.