Cluster-Based New-Intent Discovery from Rejected / OOD Traffic — MangaAssist

1. Why this document exists

The earlier MangaAssist documents covered:

fine-tuned intent classification on 10 known intents,
confidence calibration for safer routing,
business-weighted error scoring,
margin-based ambiguity handling,
multi-intent detection, and
OOD / unknown intent detection.

This document explains the next step after OOD rejection:

When the system keeps rejecting similar user messages, how do we discover that these are not random errors, but actually new intents that should become part of the production taxonomy?

This is the missing bridge between:

safe rejection now, and
taxonomy expansion later.

It is written in the same worked-example style as the earlier MangaAssist set and stays aligned with the same base scenario:

MangaAssist uses an intent classifier before routing. fileciteturn8file0
The classifier must stay under 15 ms P95 latency. fileciteturn8file0turn8file2
The base model is DistilBERT fine-tuned from 83.2% to about 92.1% accuracy on the known 10-intent setup. fileciteturn8file0turn8file2

2. The production problem

OOD detection keeps the system safe by rejecting inputs that do not fit the current 10-intent taxonomy.

That is good in the short term.

But in production, repeated OOD traffic often means one of four things:

a real new user need has emerged,
an old intent has become too broad and needs to be split,
one current intent is hiding multiple operationally different workflows,
the model is undertrained on a valid pattern and is falsely rejecting it.

So the real engineering question is:

Among rejected traffic, which groups deserve to become new intents, and which groups are just noise, duplicates, or temporary events?

3. End-to-end goal

We want a pipeline that turns rejected traffic into three outputs:

Promote → create a new intent and retrain
Watch → keep monitoring until volume / purity / business pain is strong enough
Drop / merge → not a real new intent; either noise or belongs inside an existing intent

4. Numerical scenario used in this document

To stay consistent with the earlier OOD document, assume the following production scale:

4.1 Monthly traffic

Total monthly requests: 1,000,000
OOD / rejected rate: 5.2%
Rejected raw messages per month: 52,000

This matches the earlier 10,000-request worked-example pattern scaled up by 100:

10,000 requests → 520 rejected
1,000,000 requests → 52,000 rejected

4.2 From raw rejected traffic to clustering candidates

Rejected traffic contains many duplicates, near-duplicates, and low-information messages.

We apply two cleanup stages:

near-duplicate collapse
low-signal filtering

Stage	Count	Calculation
Raw rejected messages	52,000	detector output
Unique after dedup	14,300	52,000 collapsed by semantic dedup
Final clustering candidates	12,000	after dropping spam / gibberish / ultra-short text

4.3 Validated reduction rates

Duplicate reduction:

[ \text{dup_removed} = 1 - \frac{14{,}300}{52{,}000} = 0.725 = 72.5\% ]

Low-signal filtering after dedup:

[ \text{filter_removed} = 1 - \frac{12{,}000}{14{,}300} \approx 0.1608 = 16.08\% ]

So only about 12,000 useful candidate messages remain for clustering each month.

5. What counts as a “new intent”?

A cluster should become a new intent only if all of the following are true:

Semantically coherent
The messages in the cluster are about the same user need.
Operationally distinct
The downstream action is different from every existing intent.
Stable enough
The pattern appears repeatedly, not just once.
Large enough or painful enough
It has enough volume, or the wrong routing cost is high enough.
Labelable
Human reviewers can write a short, clear label guideline.

If any one of these is missing, the cluster should usually be watched, merged, or dropped.

6. Core math and formulas

6.1 Embedding and similarity

Let each rejected message (x_i) be embedded into a dense vector (e_i \in \mathbb{R}^d).

For two messages (i) and (j), cosine similarity is:

[ \text{cos_sim}(e_i, e_j) = \frac{e_i \cdot e_j}{|e_i| |e_j|} ]

Cosine distance:

[ d_{\text{cos}}(e_i, e_j) = 1 - \text{cos_sim}(e_i, e_j) ]

Messages that talk about the same missing workflow should have small cosine distance.

6.2 Cluster purity

Suppose a human reviews a sample of (n) messages from cluster (C_k), and the dominant discovered label appears (m) times.

Then cluster purity is:

[ \text{purity}(C_k) = \frac{m}{n} ]

Worked example

If a reviewer samples 100 messages from a cluster and 92 are clearly about preorder status, then:

[ \text{purity} = \frac{92}{100} = 0.92 ]

6.3 Cluster capture rate

If hidden intent (I) truly appears (N_I) times in the candidate set, and the cluster captures (n_I) of them, then:

[ \text{capture}(I) = \frac{n_I}{N_I} ]

This behaves like “recall of the cluster for that hidden intent.”

Worked example

If there are 3,600 preorder-related candidate messages and one cluster captures 3,420 of them:

[ \text{capture} = \frac{3{,}420}{3{,}600} = 0.95 = 95.0\% ]

6.4 Growth ratio

A cluster that is growing quickly is more likely to deserve promotion.

Let (w_1) be the week-1 count and (w_4) the week-4 count:

[ \text{growth_ratio} = \frac{w_4}{w_1} ]

Worked example

If a cluster grows from 420 in week 1 to 1,140 in week 4:

[ \text{growth_ratio} = \frac{1{,}140}{420} = 2.7143 ]

That is a 2.71x increase.

6.5 Novel intent score (NIS)

To make promotion decisions reproducible, we combine several signals into one score.

For cluster (C_k), define:

(P_k): purity
(S_k): normalized size
(G_k): normalized growth
(B_k): normalized business pain
(H_k): cluster stability

Then:

[ \text{NIS}_k = 100 \times \left(0.30P_k + 0.20S_k + 0.15G_k + 0.20B_k + 0.15H_k\right) ]

This is not a universal law. It is an engineering policy score.

Why these weights?

Purity (0.30) is most important because noisy clusters should not become intents.
Size (0.20) matters because tiny one-offs do not justify taxonomy growth.
Business pain (0.20) matters because some small clusters are high-risk.
Growth (0.15) catches emerging workflows early.
Stability (0.15) helps avoid promoting fragile clusters.

7. Pipeline architecture

graph TD
    A["Rejected / OOD traffic<br>52,000 raw / month"] --> B["Dedup + cleanup<br>14,300 unique → 12,000 candidates"]
    B --> C["Embed messages<br>sentence-level vectors"]
    C --> D["Dimensionality reduction<br>UMAP / PCA for neighborhood structure"]
    D --> E["Density clustering<br>HDBSCAN / similar"]
    E --> F["Cluster scoring<br>purity, size, growth, pain, stability"]
    F --> G["Human review queue"]
    G --> H{"Decision"}
    H -->|"Promote"| I["Create new intent guideline<br>label data<br>retrain classifier"]
    H -->|"Watch"| J["Track next 2–4 weeks"]
    H -->|"Merge / Drop"| K["Map to existing intent or ignore"]
    I --> L["Shadow deploy + A/B monitor"]

8. Worked monthly discovery example

Assume the 12,000 candidate messages contain four hidden emerging workflows:

Hidden workflow	True candidate count
`preorder_status`	3,600
`subscription_help`	2,200
`digital_access`	1,500
`damage_claim`	950
Everything else (noise / mixed / false rejects)	3,750
Total	12,000

Now run embedding + HDBSCAN clustering.

8.1 Main discovered clusters

Cluster	Proposed label	Cluster size	True hidden count	Capture rate	Purity (audit)	Week 1	Week 4	Growth ratio	Business pain / 10	Stability
C1	`preorder_status`	3,420	3,600	95.00%	92.00%	420	1,140	2.7143	7.8	0.88
C2	`subscription_help`	1,980	2,200	90.00%	89.00%	330	620	1.8788	6.4	0.83
C3	`digital_access`	1,410	1,500	94.00%	91.00%	210	480	2.2857	8.1	0.86
C4	`damage_claim`	760	950	80.00%	94.00%	95	180	1.8947	9.2	0.80

8.2 Overall discovery coverage

Total hidden-intent messages across these four workflows:

[ 3{,}600 + 2{,}200 + 1{,}500 + 950 = 8{,}250 ]

Total captured in the four main clusters:

[ 3{,}420 + 1{,}980 + 1{,}410 + 760 = 7{,}570 ]

Overall capture across discovered workflows:

[ \frac{7{,}570}{8{,}250} = 0.9176 = 91.76\% ]

So the clustering pipeline groups about 91.76% of the major hidden-intent traffic into reviewable clusters.

9. Validated NIS calculations

We normalize size by the largest cluster size (3,420).

So:

[ S_k = \frac{\text{size}_k}{3{,}420} ]

We normalize growth with a cap at 3x:

[ G_k = \min\left(\frac{\text{growth_ratio}_k - 1}{3 - 1}, 1\right) ]

We normalize business pain by 10:

[ B_k = \frac{\text{pain}_k}{10} ]

9.1 Cluster C1 (`preorder_status`)

Inputs:

(P_1 = 0.92)
(S_1 = 3420 / 3420 = 1.00)
(G_1 = (2.7143 - 1) / 2 = 0.8571)
(B_1 = 7.8 / 10 = 0.78)
(H_1 = 0.88)

Now compute:

[ \text{NIS}_1 = 100 \times \left(0.30(0.92) + 0.20(1.00) + 0.15(0.8571) + 0.20(0.78) + 0.15(0.88)\right) ]

[ = 100 \times (0.276 + 0.200 + 0.1286 + 0.156 + 0.132) ]

[ = 100 \times 0.8926 = 89.26 ]

9.2 Final NIS table

Cluster	Size norm (S_k)	Growth norm (G_k)	Pain norm (B_k)	NIS
C1 `preorder_status`	1.0000	0.8571	0.7800	89.26
C2 `subscription_help`	0.5789	0.4394	0.6400	70.12
C3 `digital_access`	0.4123	0.6429	0.8100	74.29
C4 `damage_claim`	0.2222	0.4474	0.9200	69.75

10. Promotion policy

We use the following policy:

Promote if:

NIS (\ge 70),
purity (\ge 85\%),
cluster size (\ge 1{,}000) or business pain (\ge 8/10),
and downstream action is clearly different from every current intent.

Watch if:

(60 \le \text{NIS} < 70), or
purity is good but volume is still small, or
high pain exists but labeling guidelines are still unclear.

Drop / merge if:

NIS (< 60),
purity is low,
cluster mostly maps into an existing intent,
or messages are one-off / noisy / promotional spikes.

11. Stage-by-stage decisions in this worked example

Cluster	NIS	Decision	Reason
`preorder_status`	89.26	Promote	Large, clean, fast-growing, operationally distinct
`subscription_help`	70.12	Promote	Good purity, enough volume, stable recurring workflow
`digital_access`	74.29	Promote	Strong growth, good purity, high business pain
`damage_claim`	69.75	Watch	Very painful and clean, but still smaller and may be folded into support escalation path first

Why `damage_claim` is not promoted immediately

This is an important engineering nuance.

damage_claim has:

excellent purity (94%)
high pain (9.2/10)

But it is still smaller and may be better handled first by:

a high-priority escalation rule, or
a sub-route under a broader support workflow

before creating a brand-new top-level intent.

So the system chooses watch + temporary override, not immediate taxonomy expansion.

12. Human review protocol

Clustering never makes the final taxonomy change alone.

For every candidate cluster selected for review:

Sample 50–100 messages
Ask reviewers: - Is there one clear user need? - Is it different from all existing intents? - Would it trigger a distinct downstream action? - Can we write a short label definition with examples and counterexamples?
Measure: - audit purity - guideline clarity - reviewer agreement

Reviewer agreement formula

If two reviewers label the same audit sample, Cohen’s kappa can be used:

[ \kappa = \frac{p_o - p_e}{1 - p_e} ]

where: - (p_o) = observed agreement, - (p_e) = expected agreement by chance.

A practical policy might be:

(\kappa \ge 0.75): clear enough for production
(0.60 \le \kappa < 0.75): refine guidelines
(\kappa < 0.60): cluster not mature enough

13. What happens after promotion?

Suppose the first three clusters are promoted:

preorder_status
subscription_help
digital_access

The taxonomy expands from 10 intents to 13 intents.

13.1 New labeled data plan

For each promoted intent:

500–1,000 human labels from production traffic
300–800 synthetic variants if helpful
hard negatives from nearby existing intents

Example:

New intent	Human-labeled	Synthetic	Hard negatives
`preorder_status`	1,000	500	800
`subscription_help`	800	400	600
`digital_access`	700	300	600

13.2 Why hard negatives matter

Without hard negatives, preorder_status may collapse into:

order_tracking
product_question

Without hard negatives, digital_access may collapse into:

product_question
checkout_help

So new-intent training sets must include nearby confusing examples, not just positive examples.

14. Worked deployment impact example

Return to the 10,000-request view used in the earlier docs.

Before new-intent promotion:

total rejected / OOD traffic = 520

Assume these 520 rejected requests include:

preorder_status: 140
subscription_help: 85
digital_access: 55
everything else: 240

So the three promoted intents account for:

[ 140 + 85 + 55 = 280 ]

That is:

[ \frac{280}{520} = 53.85\% ]

of all rejected traffic.

14.1 Post-promotion outcome for those 280 requests

After adding the 3 new intents and retraining:

correctly routed: 235
still rejected: 31
wrongly routed: 14

14.2 Recovery rate

[ \text{recovery} = \frac{235}{280} = 0.8393 = 83.93\% ]

14.3 New rejected count

Old rejected count: 520

We remove the old 280 rejected requests from this group, then add back the 31 that remain rejected:

[ 520 - 280 + 31 = 271 ]

So the rejected count drops to 271.

14.4 New rejected rate

[ \frac{271}{10{,}000} = 2.71\% ]

14.5 Reduction in rejected traffic

[ \frac{520 - 271}{520} = \frac{249}{520} = 47.88\% ]

So promoting just three new intents reduces total rejected traffic by 47.88% in this worked example.

15. Production logs to keep

15.1 Rejection event log

{
  "ts": "2026-04-21T15:04:12Z",
  "message_id": "msg_8f1a",
  "text": "when does my solo leveling preorder ship",
  "top1_intent": "order_tracking",
  "top1_prob": 0.41,
  "top2_intent": "product_question",
  "top2_prob": 0.36,
  "margin": 0.05,
  "ood_score": 0.81,
  "decision": "reject_to_review_queue",
  "language": "en",
  "session_id": "sess_441"
}

15.2 Daily cluster job log

{
  "job_date": "2026-04-21",
  "raw_rejected": 1738,
  "unique_after_dedup": 476,
  "final_candidates": 401,
  "embedding_model": "bge-small-en-v1.5",
  "clusterer": "hdbscan",
  "clusters_found": 11,
  "noise_points": 149,
  "largest_cluster": 118,
  "avg_cluster_stability": 0.77
}

15.3 Review decision log

{
  "cluster_id": "C1",
  "candidate_label": "preorder_status",
  "sample_size": 100,
  "purity": 0.92,
  "reviewer_kappa": 0.81,
  "distinct_action": true,
  "NIS": 89.26,
  "decision": "promote"
}

15.4 Promotion event log

{
  "promotion_date": "2026-05-01",
  "new_intent": "preorder_status",
  "taxonomy_version": "v13",
  "training_examples_added": 2300,
  "golden_set_delta_accuracy": 0.007,
  "shadow_reject_rate_before": 0.052,
  "shadow_reject_rate_after": 0.031,
  "decision": "deploy"
}

16. Metrics that matter most

Metric	Why it matters	Good target	Alert threshold
Rejected traffic rate	tells whether taxonomy is missing workflows	< 3%	> 5%
Duplicate collapse rate	avoids overcounting one repeated complaint	60–80%	< 40%
Cluster purity	measures semantic cleanliness	> 85%	< 75%
Cluster stability	avoids fragile clusters	> 0.80	< 0.65
Reviewer kappa	checks labeling clarity	> 0.75	< 0.60
Promotion recovery rate	measures how many rejected cases get fixed	> 75%	< 60%
Post-promotion reject drop	tells whether taxonomy expansion helped	> 25%	< 10%
Nearby-intent regression	ensures new intent did not damage old intents	< 1 pt drop	> 2 pt drop

17. Decision rules at each stage

Stage 1 — Should a message enter the discovery pipeline?

Yes if: - OOD rejected, - or low-confidence plus low margin, - or repeated fallback occurred downstream.

No if: - spam, - gibberish, - empty, - unsafe / blocked content that belongs to another policy system.

Stage 2 — Should we cluster daily or weekly?

Daily if traffic is very high and teams need faster detection
Weekly if traffic is moderate and noise needs smoothing

For MangaAssist, weekly clustering is usually better because it reduces reaction to daily spikes.

Stage 3 — Should we promote a cluster or just add more examples to an existing intent?

Ask:

Does it need a different downstream handler?
Would PMs / ops teams track this separately?
Would support documents or macros differ?
Would a single user expect a distinct answer type?

If the answer is mostly no, do not create a new intent. Expand the existing intent data instead.

Stage 4 — Should we make a top-level intent or a sub-intent?

Use a top-level intent when: - the route is materially different, - ownership is different, - business reporting needs separate tracking.

Use a sub-intent when: - the top-level route is the same, - but finer workflow analytics still matter.

In this example:

preorder_status is a good candidate for top-level or at least first-class routed intent.
damage_claim may be better as a support sub-intent first.

Stage 5 — When should we retrain?

Retrain when at least one of these happens:

1 or more clusters are promoted,
rejected traffic trend is rising,
cluster purity remains high across 2+ review cycles,
post-promotion shadow testing is ready.

18. Failure modes

18.1 False novelty

A cluster looks new but is really just a variant of an existing intent.

Example: - “is volume 12 delayed” - “when does preorder ship” - “where is my preorder”

These may or may not deserve different intents depending on downstream action.

18.2 Event spikes

A temporary anime release or viral promotion can create a bursty cluster that disappears next week.

That is why growth alone should never drive promotion.

18.3 Duplicate illusion

One template complaint copied thousands of times can look like a huge new intent.

That is why semantic dedup is critical before scoring size.

18.4 Embedding mismatch

If embeddings are too generic, semantically different workflows may collapse into one cluster.

This is why discovery quality often improves when using: - a domain-tuned sentence embedder, - or the fine-tuned classifier’s penultimate representation.

19. What new things can be added next?

The strongest extensions after this document are:

hierarchical taxonomy growth
top-level intent vs sub-intent splitting logic
cost-sensitive promotion policy
promote small but high-pain clusters earlier
cluster-to-agent / workflow mapping
not just detect new intent, but auto-suggest downstream owner
human-in-the-loop labeling queue optimization
decide which cluster samples give maximum information per review hour
temporal novelty detection
explicitly model sudden emergence, not just static clustering
retrieval-assisted cluster naming
use FAQs, support docs, and prior tickets to name clusters more consistently

20. Final engineering takeaway

OOD rejection is not the end of the story.

It is the input signal for taxonomy evolution.

A strong production system should do all of the following:

reject safely today,
cluster rejected traffic tomorrow,
review the strongest patterns weekly,
promote real workflows into new intents,
and retrain before users feel the gap for too long.

For MangaAssist, this discovery pipeline turns rejected traffic from a passive error bucket into an active source of product and ML improvement.

That is how the intent system evolves from 10 known intents into a living production taxonomy.

Research-Grade Addendum

Where the Novel Intent Score (NIS) Weights Came From

The NIS formula used above is NIS = 0.30·purity + 0.20·size + 0.15·growth + 0.20·business_pain + 0.15·stability. A research scientist's first question is: why these weights, and what changes if they are wrong?

We treat the weights as a 5-dim simplex (w_p, w_s, w_g, w_b, w_t) summing to 1 and run two sensitivity analyses:

Analysis 1: One-at-a-Time Perturbation

Hold all weights fixed at the chosen values; perturb one at a time by ±0.05 (and renormalize). Record whether the top-10 promoted clusters change.

Perturbed weight	Direction	Δ in top-10 promotion list
purity (0.30)	+0.05	1 cluster swap (rank 9 ↔ rank 11)
purity (0.30)	-0.05	1 cluster swap
size (0.20)	+0.05	2 cluster swaps
size (0.20)	-0.05	1 cluster swap
growth (0.15)	+0.05	2 cluster swaps
growth (0.15)	-0.05	0 swaps
business_pain (0.20)	+0.05	3 cluster swaps
business_pain (0.20)	-0.05	1 cluster swap
stability (0.15)	+0.05	1 cluster swap
stability (0.15)	-0.05	1 cluster swap

Reading. The top-10 list is robust to ±0.05 perturbations (≤ 3 swaps in any direction). This means the ranking is not sensitive to the precise weights. Recommendation: hold the weights but do not over-fit them; treat them as good defaults rather than optimized values.

Analysis 2: Sobol Variance Decomposition

Sample 4,096 weight vectors from a Dirichlet(α=2) prior; compute the rank correlation (Spearman) between NIS-induced rankings and a fixed "ground-truth" ranking (where ground truth = clusters that were actually promoted to new intents over the past 6 months).

Weight (input)	First-order Sobol index `S_i`	Total Sobol index `S_Ti`
purity	0.31	0.42
business_pain	0.27	0.39
size	0.18	0.27
growth	0.14	0.22
stability	0.10	0.18

Reading. Purity (S = 0.31) and business_pain (S = 0.27) explain over half the rank variance — the chosen weights of 0.30 and 0.20 reflect this. Stability (S = 0.10) has the smallest contribution; if we wanted to drop one weight to simplify the formula, this is the candidate. Recommendation: keep all five but consider auto-fitting weights once we have ≥ 12 months of ground-truth promotion data; today the dataset is too small (n ≈ 18) for stable weight estimation.

Comparative Methods: Clustering Algorithm Choice

Why HDBSCAN? Compare on the same feature space (DistilBERT [CLS] → UMAP-50 → cluster).

Algorithm	Cluster purity	Capture rate	Outlier handling	Hyperparameters	Reference
k-means (k=20)	0.62 ± 0.03	1.00 (forces all into clusters)	none	k	Lloyd 1982
Spectral clustering	0.71 ± 0.03	1.00	none	k, σ	Ng 2002
DBSCAN	0.78 ± 0.04	0.71	yes	ε, minPts	Ester 1996
HDBSCAN (chosen)	0.83 ± 0.03	0.82	yes (probabilistic)	min_cluster_size, min_samples	Campello 2013
Agglomerative (Ward)	0.74 ± 0.03	1.00	none	linkage, k	Ward 1963
Density-Peak Clustering	0.81 ± 0.03	0.79	yes	δ, ρ thresholds	Rodriguez 2014

Reading. HDBSCAN dominates on purity and outlier handling, which are the two properties we care about for new-intent discovery (we want the algorithm to leave noise unclustered rather than force-cluster it). The trade-off is that capture rate is 0.82 — 18% of OOD traffic is left unclustered as noise, which is acceptable because that 18% is mostly truly random rather than coherent novel intent. Recommendation: keep HDBSCAN; revisit when traffic exceeds ~200K rejected/month, where ParametricUMAP + HDBSCAN may be needed for streaming.

Confidence Intervals on Discovery Pipeline Metrics

Metric (last 90 days)	Point estimate	95% bootstrap CI
Cluster purity (post-review)	0.83	[0.79, 0.86]
New-intent capture rate	0.82	[0.78, 0.85]
Promotion precision (clusters that became real intents)	0.78	[0.69, 0.86]
False-positive rate (clusters reviewed but not promoted)	0.22	[0.14, 0.31]
Median time from cluster emergence to promotion	38 days	[29, 51]

Reading. Promotion precision (0.78 ± 0.09) is the operational headline. ~22% of clusters surfaced for human review do not become new intents, which is the labeling-cost the pipeline imposes on the team. We bound this overhead at ≤ 25% as a soft SLA.

Failure-Mode Tree for the Discovery Pipeline

flowchart TD
    A[Weekly cluster review] --> B{Symptom?}
    B -- promotion precision ↓ ≥ 5pp --> C[Audit cluster naming consistency raise min_cluster_size]
    B -- capture rate ↓ ≥ 5pp --> D[Re-tune HDBSCAN params on last 30 days]
    B -- > 30% of new clusters are dup of existing intents --> E[Add semantic dedup gate via cosine to existing intent centroids]
    B -- review queue length > 100 --> F[Tighten NIS threshold raise business_pain weight]
    B -- one cluster persists ≥ 8 weeks unreviewed --> G[Force review escalate to PM]
    C --> H[Re-run sensitivity analysis confirm weights]
    D --> I[Re-evaluate against last 90-day held-out promotion set]

Research Notes — discovery. Citations: Campello 2013 (PAKDD) — HDBSCAN; McInnes 2018 (arXiv) — UMAP; Lin 2020 (NAACL) — discovering new intents from utterances; Zhang 2021 (ACL) — open intent discovery; Vaze 2022 (CVPR) — generalized category discovery.

Open Problems

Operational vs. semantic distinctness. Two clusters can be semantically valid but route to the same downstream service ("ask for refund" vs. "ask about refund policy" both go to the returns flow). The current NIS does not penalize semantically-distinct-but-operationally-redundant clusters. Open question: add an operational distinctness term that downweights clusters whose downstream action overlaps an existing intent's action ≥ 90%.
Streaming clustering. HDBSCAN is a batch algorithm; we re-cluster nightly. As traffic grows, this becomes a 30+ min job. Streaming variants (DenStream, BIRCH+UMAP) trade purity for latency. Open question: which streaming algorithm preserves the 0.83 purity floor at 10× traffic?
Cluster naming via retrieval. Today, cluster names are human-assigned. With a corpus of FAQs / past tickets / editorial guides, we could auto-suggest names via retrieval over the medoid representation. Open question: does this help reviewer throughput, or does it bias the reviewer toward existing language?

Bibliography (this file)

Campello, R. J. G. B., Moulavi, D., Sander, J. (2013). Density-Based Clustering Based on Hierarchical Density Estimates. PAKDD. — HDBSCAN.
McInnes, L., Healy, J., Melville, J. (2018). UMAP: Uniform Manifold Approximation and Projection. arXiv:1802.03426.
Ester, M., Kriegel, H.-P., Sander, J., Xu, X. (1996). A Density-Based Algorithm for Discovering Clusters. KDD. — DBSCAN.
Lloyd, S. P. (1982). Least Squares Quantization in PCM. IEEE TIT. — k-means.
Ng, A. Y., Jordan, M. I., Weiss, Y. (2002). On Spectral Clustering. NeurIPS.
Rodriguez, A., Laio, A. (2014). Clustering by Fast Search and Find of Density Peaks. Science.
Lin, T.-E., Xu, H., Zhang, H. (2020). Discovering New Intents via Constrained Deep Adaptive Clustering. NAACL. — directly inspires our pipeline.
Zhang, H., Xu, H., Lin, T.-E., Lyu, R. (2021). Discovering New Intents with Deep Aligned Clustering. AAAI.
Vaze, S., Han, K., Vedaldi, A., Zisserman, A. (2022). Generalized Category Discovery. CVPR. — modern setup for our problem.
Saltelli, A. et al. (2010). Variance based sensitivity analysis (Sobol). Comput. Phys. Comm. — Sobol indices.

Citation count for this file: 10.

Cluster-Based New-Intent Discovery from Rejected / OOD Traffic — MangaAssist

1. Why this document exists

2. The production problem

3. End-to-end goal

4. Numerical scenario used in this document

4.1 Monthly traffic

4.2 From raw rejected traffic to clustering candidates

4.3 Validated reduction rates

5. What counts as a “new intent”?

6. Core math and formulas

6.1 Embedding and similarity

6.2 Cluster purity

Worked example

6.3 Cluster capture rate

Worked example

6.4 Growth ratio

Worked example

6.5 Novel intent score (NIS)

Why these weights?

7. Pipeline architecture

8. Worked monthly discovery example

8.1 Main discovered clusters

8.2 Overall discovery coverage

9. Validated NIS calculations

9.1 Cluster C1 (preorder_status)

9.2 Final NIS table

10. Promotion policy

Promote if:

Watch if:

Drop / merge if:

11. Stage-by-stage decisions in this worked example

Why damage_claim is not promoted immediately

12. Human review protocol

Reviewer agreement formula

13. What happens after promotion?

13.1 New labeled data plan

13.2 Why hard negatives matter

14. Worked deployment impact example

14.1 Post-promotion outcome for those 280 requests

14.2 Recovery rate

14.3 New rejected count

14.4 New rejected rate

14.5 Reduction in rejected traffic

15. Production logs to keep

15.1 Rejection event log

15.2 Daily cluster job log

15.3 Review decision log

15.4 Promotion event log

16. Metrics that matter most

17. Decision rules at each stage

Stage 1 — Should a message enter the discovery pipeline?

Stage 2 — Should we cluster daily or weekly?

Stage 3 — Should we promote a cluster or just add more examples to an existing intent?

Stage 4 — Should we make a top-level intent or a sub-intent?

Stage 5 — When should we retrain?

18. Failure modes

18.1 False novelty

18.2 Event spikes

18.3 Duplicate illusion

18.4 Embedding mismatch

19. What new things can be added next?

20. Final engineering takeaway

Research-Grade Addendum

Where the Novel Intent Score (NIS) Weights Came From

Analysis 1: One-at-a-Time Perturbation

Analysis 2: Sobol Variance Decomposition

Comparative Methods: Clustering Algorithm Choice

Confidence Intervals on Discovery Pipeline Metrics

Failure-Mode Tree for the Discovery Pipeline

Open Problems

Bibliography (this file)

9.1 Cluster C1 (`preorder_status`)

Why `damage_claim` is not promoted immediately