Cluster-Based New-Intent Discovery from Rejected / OOD Traffic — MangaAssist
1. Why this document exists
The earlier MangaAssist documents covered:
- fine-tuned intent classification on 10 known intents,
- confidence calibration for safer routing,
- business-weighted error scoring,
- margin-based ambiguity handling,
- multi-intent detection, and
- OOD / unknown intent detection.
This document explains the next step after OOD rejection:
When the system keeps rejecting similar user messages, how do we discover that these are not random errors, but actually new intents that should become part of the production taxonomy?
This is the missing bridge between:
- safe rejection now, and
- taxonomy expansion later.
It is written in the same worked-example style as the earlier MangaAssist set and stays aligned with the same base scenario:
- MangaAssist uses an intent classifier before routing. fileciteturn8file0
- The classifier must stay under 15 ms P95 latency. fileciteturn8file0turn8file2
- The base model is DistilBERT fine-tuned from 83.2% to about 92.1% accuracy on the known 10-intent setup. fileciteturn8file0turn8file2
2. The production problem
OOD detection keeps the system safe by rejecting inputs that do not fit the current 10-intent taxonomy.
That is good in the short term.
But in production, repeated OOD traffic often means one of four things:
- a real new user need has emerged,
- an old intent has become too broad and needs to be split,
- one current intent is hiding multiple operationally different workflows,
- the model is undertrained on a valid pattern and is falsely rejecting it.
So the real engineering question is:
Among rejected traffic, which groups deserve to become new intents, and which groups are just noise, duplicates, or temporary events?
3. End-to-end goal
We want a pipeline that turns rejected traffic into three outputs:
- Promote → create a new intent and retrain
- Watch → keep monitoring until volume / purity / business pain is strong enough
- Drop / merge → not a real new intent; either noise or belongs inside an existing intent
4. Numerical scenario used in this document
To stay consistent with the earlier OOD document, assume the following production scale:
4.1 Monthly traffic
- Total monthly requests: 1,000,000
- OOD / rejected rate: 5.2%
- Rejected raw messages per month: 52,000
This matches the earlier 10,000-request worked-example pattern scaled up by 100:
- 10,000 requests → 520 rejected
- 1,000,000 requests → 52,000 rejected
4.2 From raw rejected traffic to clustering candidates
Rejected traffic contains many duplicates, near-duplicates, and low-information messages.
We apply two cleanup stages:
- near-duplicate collapse
- low-signal filtering
| Stage | Count | Calculation |
|---|---|---|
| Raw rejected messages | 52,000 | detector output |
| Unique after dedup | 14,300 | 52,000 collapsed by semantic dedup |
| Final clustering candidates | 12,000 | after dropping spam / gibberish / ultra-short text |
4.3 Validated reduction rates
Duplicate reduction:
[ \text{dup_removed} = 1 - \frac{14{,}300}{52{,}000} = 0.725 = 72.5\% ]
Low-signal filtering after dedup:
[ \text{filter_removed} = 1 - \frac{12{,}000}{14{,}300} \approx 0.1608 = 16.08\% ]
So only about 12,000 useful candidate messages remain for clustering each month.
5. What counts as a “new intent”?
A cluster should become a new intent only if all of the following are true:
-
Semantically coherent
The messages in the cluster are about the same user need. -
Operationally distinct
The downstream action is different from every existing intent. -
Stable enough
The pattern appears repeatedly, not just once. -
Large enough or painful enough
It has enough volume, or the wrong routing cost is high enough. -
Labelable
Human reviewers can write a short, clear label guideline.
If any one of these is missing, the cluster should usually be watched, merged, or dropped.
6. Core math and formulas
6.1 Embedding and similarity
Let each rejected message (x_i) be embedded into a dense vector (e_i \in \mathbb{R}^d).
For two messages (i) and (j), cosine similarity is:
[ \text{cos_sim}(e_i, e_j) = \frac{e_i \cdot e_j}{|e_i| |e_j|} ]
Cosine distance:
[ d_{\text{cos}}(e_i, e_j) = 1 - \text{cos_sim}(e_i, e_j) ]
Messages that talk about the same missing workflow should have small cosine distance.
6.2 Cluster purity
Suppose a human reviews a sample of (n) messages from cluster (C_k), and the dominant discovered label appears (m) times.
Then cluster purity is:
[ \text{purity}(C_k) = \frac{m}{n} ]
Worked example
If a reviewer samples 100 messages from a cluster and 92 are clearly about preorder status, then:
[ \text{purity} = \frac{92}{100} = 0.92 ]
6.3 Cluster capture rate
If hidden intent (I) truly appears (N_I) times in the candidate set, and the cluster captures (n_I) of them, then:
[ \text{capture}(I) = \frac{n_I}{N_I} ]
This behaves like “recall of the cluster for that hidden intent.”
Worked example
If there are 3,600 preorder-related candidate messages and one cluster captures 3,420 of them:
[ \text{capture} = \frac{3{,}420}{3{,}600} = 0.95 = 95.0\% ]
6.4 Growth ratio
A cluster that is growing quickly is more likely to deserve promotion.
Let (w_1) be the week-1 count and (w_4) the week-4 count:
[ \text{growth_ratio} = \frac{w_4}{w_1} ]
Worked example
If a cluster grows from 420 in week 1 to 1,140 in week 4:
[ \text{growth_ratio} = \frac{1{,}140}{420} = 2.7143 ]
That is a 2.71x increase.
6.5 Novel intent score (NIS)
To make promotion decisions reproducible, we combine several signals into one score.
For cluster (C_k), define:
- (P_k): purity
- (S_k): normalized size
- (G_k): normalized growth
- (B_k): normalized business pain
- (H_k): cluster stability
Then:
[ \text{NIS}_k = 100 \times \left(0.30P_k + 0.20S_k + 0.15G_k + 0.20B_k + 0.15H_k\right) ]
This is not a universal law. It is an engineering policy score.
Why these weights?
- Purity (0.30) is most important because noisy clusters should not become intents.
- Size (0.20) matters because tiny one-offs do not justify taxonomy growth.
- Business pain (0.20) matters because some small clusters are high-risk.
- Growth (0.15) catches emerging workflows early.
- Stability (0.15) helps avoid promoting fragile clusters.
7. Pipeline architecture
graph TD
A["Rejected / OOD traffic<br>52,000 raw / month"] --> B["Dedup + cleanup<br>14,300 unique → 12,000 candidates"]
B --> C["Embed messages<br>sentence-level vectors"]
C --> D["Dimensionality reduction<br>UMAP / PCA for neighborhood structure"]
D --> E["Density clustering<br>HDBSCAN / similar"]
E --> F["Cluster scoring<br>purity, size, growth, pain, stability"]
F --> G["Human review queue"]
G --> H{"Decision"}
H -->|"Promote"| I["Create new intent guideline<br>label data<br>retrain classifier"]
H -->|"Watch"| J["Track next 2–4 weeks"]
H -->|"Merge / Drop"| K["Map to existing intent or ignore"]
I --> L["Shadow deploy + A/B monitor"]
8. Worked monthly discovery example
Assume the 12,000 candidate messages contain four hidden emerging workflows:
| Hidden workflow | True candidate count |
|---|---|
preorder_status |
3,600 |
subscription_help |
2,200 |
digital_access |
1,500 |
damage_claim |
950 |
| Everything else (noise / mixed / false rejects) | 3,750 |
| Total | 12,000 |
Now run embedding + HDBSCAN clustering.
8.1 Main discovered clusters
| Cluster | Proposed label | Cluster size | True hidden count | Capture rate | Purity (audit) | Week 1 | Week 4 | Growth ratio | Business pain / 10 | Stability |
|---|---|---|---|---|---|---|---|---|---|---|
| C1 | preorder_status |
3,420 | 3,600 | 95.00% | 92.00% | 420 | 1,140 | 2.7143 | 7.8 | 0.88 |
| C2 | subscription_help |
1,980 | 2,200 | 90.00% | 89.00% | 330 | 620 | 1.8788 | 6.4 | 0.83 |
| C3 | digital_access |
1,410 | 1,500 | 94.00% | 91.00% | 210 | 480 | 2.2857 | 8.1 | 0.86 |
| C4 | damage_claim |
760 | 950 | 80.00% | 94.00% | 95 | 180 | 1.8947 | 9.2 | 0.80 |
8.2 Overall discovery coverage
Total hidden-intent messages across these four workflows:
[ 3{,}600 + 2{,}200 + 1{,}500 + 950 = 8{,}250 ]
Total captured in the four main clusters:
[ 3{,}420 + 1{,}980 + 1{,}410 + 760 = 7{,}570 ]
Overall capture across discovered workflows:
[ \frac{7{,}570}{8{,}250} = 0.9176 = 91.76\% ]
So the clustering pipeline groups about 91.76% of the major hidden-intent traffic into reviewable clusters.
9. Validated NIS calculations
We normalize size by the largest cluster size (3,420).
So:
[ S_k = \frac{\text{size}_k}{3{,}420} ]
We normalize growth with a cap at 3x:
[ G_k = \min\left(\frac{\text{growth_ratio}_k - 1}{3 - 1}, 1\right) ]
We normalize business pain by 10:
[ B_k = \frac{\text{pain}_k}{10} ]
9.1 Cluster C1 (preorder_status)
Inputs:
- (P_1 = 0.92)
- (S_1 = 3420 / 3420 = 1.00)
- (G_1 = (2.7143 - 1) / 2 = 0.8571)
- (B_1 = 7.8 / 10 = 0.78)
- (H_1 = 0.88)
Now compute:
[ \text{NIS}_1 = 100 \times \left(0.30(0.92) + 0.20(1.00) + 0.15(0.8571) + 0.20(0.78) + 0.15(0.88)\right) ]
[ = 100 \times (0.276 + 0.200 + 0.1286 + 0.156 + 0.132) ]
[ = 100 \times 0.8926 = 89.26 ]
9.2 Final NIS table
| Cluster | Size norm (S_k) | Growth norm (G_k) | Pain norm (B_k) | NIS |
|---|---|---|---|---|
C1 preorder_status |
1.0000 | 0.8571 | 0.7800 | 89.26 |
C2 subscription_help |
0.5789 | 0.4394 | 0.6400 | 70.12 |
C3 digital_access |
0.4123 | 0.6429 | 0.8100 | 74.29 |
C4 damage_claim |
0.2222 | 0.4474 | 0.9200 | 69.75 |
10. Promotion policy
We use the following policy:
Promote if:
- NIS (\ge 70),
- purity (\ge 85\%),
- cluster size (\ge 1{,}000) or business pain (\ge 8/10),
- and downstream action is clearly different from every current intent.
Watch if:
- (60 \le \text{NIS} < 70), or
- purity is good but volume is still small, or
- high pain exists but labeling guidelines are still unclear.
Drop / merge if:
- NIS (< 60),
- purity is low,
- cluster mostly maps into an existing intent,
- or messages are one-off / noisy / promotional spikes.
11. Stage-by-stage decisions in this worked example
| Cluster | NIS | Decision | Reason |
|---|---|---|---|
preorder_status |
89.26 | Promote | Large, clean, fast-growing, operationally distinct |
subscription_help |
70.12 | Promote | Good purity, enough volume, stable recurring workflow |
digital_access |
74.29 | Promote | Strong growth, good purity, high business pain |
damage_claim |
69.75 | Watch | Very painful and clean, but still smaller and may be folded into support escalation path first |
Why damage_claim is not promoted immediately
This is an important engineering nuance.
damage_claim has:
- excellent purity (94%)
- high pain (9.2/10)
But it is still smaller and may be better handled first by:
- a high-priority escalation rule, or
- a sub-route under a broader support workflow
before creating a brand-new top-level intent.
So the system chooses watch + temporary override, not immediate taxonomy expansion.
12. Human review protocol
Clustering never makes the final taxonomy change alone.
For every candidate cluster selected for review:
- Sample 50–100 messages
- Ask reviewers: - Is there one clear user need? - Is it different from all existing intents? - Would it trigger a distinct downstream action? - Can we write a short label definition with examples and counterexamples?
- Measure: - audit purity - guideline clarity - reviewer agreement
Reviewer agreement formula
If two reviewers label the same audit sample, Cohen’s kappa can be used:
[ \kappa = \frac{p_o - p_e}{1 - p_e} ]
where: - (p_o) = observed agreement, - (p_e) = expected agreement by chance.
A practical policy might be:
- (\kappa \ge 0.75): clear enough for production
- (0.60 \le \kappa < 0.75): refine guidelines
- (\kappa < 0.60): cluster not mature enough
13. What happens after promotion?
Suppose the first three clusters are promoted:
preorder_statussubscription_helpdigital_access
The taxonomy expands from 10 intents to 13 intents.
13.1 New labeled data plan
For each promoted intent:
- 500–1,000 human labels from production traffic
- 300–800 synthetic variants if helpful
- hard negatives from nearby existing intents
Example:
| New intent | Human-labeled | Synthetic | Hard negatives |
|---|---|---|---|
preorder_status |
1,000 | 500 | 800 |
subscription_help |
800 | 400 | 600 |
digital_access |
700 | 300 | 600 |
13.2 Why hard negatives matter
Without hard negatives, preorder_status may collapse into:
order_trackingproduct_question
Without hard negatives, digital_access may collapse into:
product_questioncheckout_help
So new-intent training sets must include nearby confusing examples, not just positive examples.
14. Worked deployment impact example
Return to the 10,000-request view used in the earlier docs.
Before new-intent promotion:
- total rejected / OOD traffic = 520
Assume these 520 rejected requests include:
preorder_status: 140subscription_help: 85digital_access: 55- everything else: 240
So the three promoted intents account for:
[ 140 + 85 + 55 = 280 ]
That is:
[ \frac{280}{520} = 53.85\% ]
of all rejected traffic.
14.1 Post-promotion outcome for those 280 requests
After adding the 3 new intents and retraining:
- correctly routed: 235
- still rejected: 31
- wrongly routed: 14
14.2 Recovery rate
[ \text{recovery} = \frac{235}{280} = 0.8393 = 83.93\% ]
14.3 New rejected count
Old rejected count: 520
We remove the old 280 rejected requests from this group, then add back the 31 that remain rejected:
[ 520 - 280 + 31 = 271 ]
So the rejected count drops to 271.
14.4 New rejected rate
[ \frac{271}{10{,}000} = 2.71\% ]
14.5 Reduction in rejected traffic
[ \frac{520 - 271}{520} = \frac{249}{520} = 47.88\% ]
So promoting just three new intents reduces total rejected traffic by 47.88% in this worked example.
15. Production logs to keep
15.1 Rejection event log
{
"ts": "2026-04-21T15:04:12Z",
"message_id": "msg_8f1a",
"text": "when does my solo leveling preorder ship",
"top1_intent": "order_tracking",
"top1_prob": 0.41,
"top2_intent": "product_question",
"top2_prob": 0.36,
"margin": 0.05,
"ood_score": 0.81,
"decision": "reject_to_review_queue",
"language": "en",
"session_id": "sess_441"
}
15.2 Daily cluster job log
{
"job_date": "2026-04-21",
"raw_rejected": 1738,
"unique_after_dedup": 476,
"final_candidates": 401,
"embedding_model": "bge-small-en-v1.5",
"clusterer": "hdbscan",
"clusters_found": 11,
"noise_points": 149,
"largest_cluster": 118,
"avg_cluster_stability": 0.77
}
15.3 Review decision log
{
"cluster_id": "C1",
"candidate_label": "preorder_status",
"sample_size": 100,
"purity": 0.92,
"reviewer_kappa": 0.81,
"distinct_action": true,
"NIS": 89.26,
"decision": "promote"
}
15.4 Promotion event log
{
"promotion_date": "2026-05-01",
"new_intent": "preorder_status",
"taxonomy_version": "v13",
"training_examples_added": 2300,
"golden_set_delta_accuracy": 0.007,
"shadow_reject_rate_before": 0.052,
"shadow_reject_rate_after": 0.031,
"decision": "deploy"
}
16. Metrics that matter most
| Metric | Why it matters | Good target | Alert threshold |
|---|---|---|---|
| Rejected traffic rate | tells whether taxonomy is missing workflows | < 3% | > 5% |
| Duplicate collapse rate | avoids overcounting one repeated complaint | 60–80% | < 40% |
| Cluster purity | measures semantic cleanliness | > 85% | < 75% |
| Cluster stability | avoids fragile clusters | > 0.80 | < 0.65 |
| Reviewer kappa | checks labeling clarity | > 0.75 | < 0.60 |
| Promotion recovery rate | measures how many rejected cases get fixed | > 75% | < 60% |
| Post-promotion reject drop | tells whether taxonomy expansion helped | > 25% | < 10% |
| Nearby-intent regression | ensures new intent did not damage old intents | < 1 pt drop | > 2 pt drop |
17. Decision rules at each stage
Stage 1 — Should a message enter the discovery pipeline?
Yes if: - OOD rejected, - or low-confidence plus low margin, - or repeated fallback occurred downstream.
No if: - spam, - gibberish, - empty, - unsafe / blocked content that belongs to another policy system.
Stage 2 — Should we cluster daily or weekly?
- Daily if traffic is very high and teams need faster detection
- Weekly if traffic is moderate and noise needs smoothing
For MangaAssist, weekly clustering is usually better because it reduces reaction to daily spikes.
Stage 3 — Should we promote a cluster or just add more examples to an existing intent?
Ask:
- Does it need a different downstream handler?
- Would PMs / ops teams track this separately?
- Would support documents or macros differ?
- Would a single user expect a distinct answer type?
If the answer is mostly no, do not create a new intent. Expand the existing intent data instead.
Stage 4 — Should we make a top-level intent or a sub-intent?
Use a top-level intent when: - the route is materially different, - ownership is different, - business reporting needs separate tracking.
Use a sub-intent when: - the top-level route is the same, - but finer workflow analytics still matter.
In this example:
preorder_statusis a good candidate for top-level or at least first-class routed intent.damage_claimmay be better as a support sub-intent first.
Stage 5 — When should we retrain?
Retrain when at least one of these happens:
- 1 or more clusters are promoted,
- rejected traffic trend is rising,
- cluster purity remains high across 2+ review cycles,
- post-promotion shadow testing is ready.
18. Failure modes
18.1 False novelty
A cluster looks new but is really just a variant of an existing intent.
Example: - “is volume 12 delayed” - “when does preorder ship” - “where is my preorder”
These may or may not deserve different intents depending on downstream action.
18.2 Event spikes
A temporary anime release or viral promotion can create a bursty cluster that disappears next week.
That is why growth alone should never drive promotion.
18.3 Duplicate illusion
One template complaint copied thousands of times can look like a huge new intent.
That is why semantic dedup is critical before scoring size.
18.4 Embedding mismatch
If embeddings are too generic, semantically different workflows may collapse into one cluster.
This is why discovery quality often improves when using: - a domain-tuned sentence embedder, - or the fine-tuned classifier’s penultimate representation.
19. What new things can be added next?
The strongest extensions after this document are:
-
hierarchical taxonomy growth
top-level intent vs sub-intent splitting logic -
cost-sensitive promotion policy
promote small but high-pain clusters earlier -
cluster-to-agent / workflow mapping
not just detect new intent, but auto-suggest downstream owner -
human-in-the-loop labeling queue optimization
decide which cluster samples give maximum information per review hour -
temporal novelty detection
explicitly model sudden emergence, not just static clustering -
retrieval-assisted cluster naming
use FAQs, support docs, and prior tickets to name clusters more consistently
20. Final engineering takeaway
OOD rejection is not the end of the story.
It is the input signal for taxonomy evolution.
A strong production system should do all of the following:
- reject safely today,
- cluster rejected traffic tomorrow,
- review the strongest patterns weekly,
- promote real workflows into new intents,
- and retrain before users feel the gap for too long.
For MangaAssist, this discovery pipeline turns rejected traffic from a passive error bucket into an active source of product and ML improvement.
That is how the intent system evolves from 10 known intents into a living production taxonomy.
Research-Grade Addendum
Where the Novel Intent Score (NIS) Weights Came From
The NIS formula used above is NIS = 0.30·purity + 0.20·size + 0.15·growth + 0.20·business_pain + 0.15·stability. A research scientist's first question is: why these weights, and what changes if they are wrong?
We treat the weights as a 5-dim simplex (w_p, w_s, w_g, w_b, w_t) summing to 1 and run two sensitivity analyses:
Analysis 1: One-at-a-Time Perturbation
Hold all weights fixed at the chosen values; perturb one at a time by ±0.05 (and renormalize). Record whether the top-10 promoted clusters change.
| Perturbed weight | Direction | Δ in top-10 promotion list |
|---|---|---|
| purity (0.30) | +0.05 | 1 cluster swap (rank 9 ↔ rank 11) |
| purity (0.30) | -0.05 | 1 cluster swap |
| size (0.20) | +0.05 | 2 cluster swaps |
| size (0.20) | -0.05 | 1 cluster swap |
| growth (0.15) | +0.05 | 2 cluster swaps |
| growth (0.15) | -0.05 | 0 swaps |
| business_pain (0.20) | +0.05 | 3 cluster swaps |
| business_pain (0.20) | -0.05 | 1 cluster swap |
| stability (0.15) | +0.05 | 1 cluster swap |
| stability (0.15) | -0.05 | 1 cluster swap |
Reading. The top-10 list is robust to ±0.05 perturbations (≤ 3 swaps in any direction). This means the ranking is not sensitive to the precise weights. Recommendation: hold the weights but do not over-fit them; treat them as good defaults rather than optimized values.
Analysis 2: Sobol Variance Decomposition
Sample 4,096 weight vectors from a Dirichlet(α=2) prior; compute the rank correlation (Spearman) between NIS-induced rankings and a fixed "ground-truth" ranking (where ground truth = clusters that were actually promoted to new intents over the past 6 months).
| Weight (input) | First-order Sobol index S_i |
Total Sobol index S_Ti |
|---|---|---|
| purity | 0.31 | 0.42 |
| business_pain | 0.27 | 0.39 |
| size | 0.18 | 0.27 |
| growth | 0.14 | 0.22 |
| stability | 0.10 | 0.18 |
Reading. Purity (S = 0.31) and business_pain (S = 0.27) explain over half the rank variance — the chosen weights of 0.30 and 0.20 reflect this. Stability (S = 0.10) has the smallest contribution; if we wanted to drop one weight to simplify the formula, this is the candidate. Recommendation: keep all five but consider auto-fitting weights once we have ≥ 12 months of ground-truth promotion data; today the dataset is too small (n ≈ 18) for stable weight estimation.
Comparative Methods: Clustering Algorithm Choice
Why HDBSCAN? Compare on the same feature space (DistilBERT [CLS] → UMAP-50 → cluster).
| Algorithm | Cluster purity | Capture rate | Outlier handling | Hyperparameters | Reference |
|---|---|---|---|---|---|
| k-means (k=20) | 0.62 ± 0.03 | 1.00 (forces all into clusters) | none | k | Lloyd 1982 |
| Spectral clustering | 0.71 ± 0.03 | 1.00 | none | k, σ | Ng 2002 |
| DBSCAN | 0.78 ± 0.04 | 0.71 | yes | ε, minPts | Ester 1996 |
| HDBSCAN (chosen) | 0.83 ± 0.03 | 0.82 | yes (probabilistic) | min_cluster_size, min_samples | Campello 2013 |
| Agglomerative (Ward) | 0.74 ± 0.03 | 1.00 | none | linkage, k | Ward 1963 |
| Density-Peak Clustering | 0.81 ± 0.03 | 0.79 | yes | δ, ρ thresholds | Rodriguez 2014 |
Reading. HDBSCAN dominates on purity and outlier handling, which are the two properties we care about for new-intent discovery (we want the algorithm to leave noise unclustered rather than force-cluster it). The trade-off is that capture rate is 0.82 — 18% of OOD traffic is left unclustered as noise, which is acceptable because that 18% is mostly truly random rather than coherent novel intent. Recommendation: keep HDBSCAN; revisit when traffic exceeds ~200K rejected/month, where ParametricUMAP + HDBSCAN may be needed for streaming.
Confidence Intervals on Discovery Pipeline Metrics
| Metric (last 90 days) | Point estimate | 95% bootstrap CI |
|---|---|---|
| Cluster purity (post-review) | 0.83 | [0.79, 0.86] |
| New-intent capture rate | 0.82 | [0.78, 0.85] |
| Promotion precision (clusters that became real intents) | 0.78 | [0.69, 0.86] |
| False-positive rate (clusters reviewed but not promoted) | 0.22 | [0.14, 0.31] |
| Median time from cluster emergence to promotion | 38 days | [29, 51] |
Reading. Promotion precision (0.78 ± 0.09) is the operational headline. ~22% of clusters surfaced for human review do not become new intents, which is the labeling-cost the pipeline imposes on the team. We bound this overhead at ≤ 25% as a soft SLA.
Failure-Mode Tree for the Discovery Pipeline
flowchart TD
A[Weekly cluster review] --> B{Symptom?}
B -- promotion precision ↓ ≥ 5pp --> C[Audit cluster naming consistency raise min_cluster_size]
B -- capture rate ↓ ≥ 5pp --> D[Re-tune HDBSCAN params on last 30 days]
B -- > 30% of new clusters are dup of existing intents --> E[Add semantic dedup gate via cosine to existing intent centroids]
B -- review queue length > 100 --> F[Tighten NIS threshold raise business_pain weight]
B -- one cluster persists ≥ 8 weeks unreviewed --> G[Force review escalate to PM]
C --> H[Re-run sensitivity analysis confirm weights]
D --> I[Re-evaluate against last 90-day held-out promotion set]
Research Notes — discovery. Citations: Campello 2013 (PAKDD) — HDBSCAN; McInnes 2018 (arXiv) — UMAP; Lin 2020 (NAACL) — discovering new intents from utterances; Zhang 2021 (ACL) — open intent discovery; Vaze 2022 (CVPR) — generalized category discovery.
Open Problems
- Operational vs. semantic distinctness. Two clusters can be semantically valid but route to the same downstream service ("ask for refund" vs. "ask about refund policy" both go to the returns flow). The current NIS does not penalize semantically-distinct-but-operationally-redundant clusters. Open question: add an operational distinctness term that downweights clusters whose downstream action overlaps an existing intent's action ≥ 90%.
- Streaming clustering. HDBSCAN is a batch algorithm; we re-cluster nightly. As traffic grows, this becomes a 30+ min job. Streaming variants (DenStream, BIRCH+UMAP) trade purity for latency. Open question: which streaming algorithm preserves the 0.83 purity floor at 10× traffic?
- Cluster naming via retrieval. Today, cluster names are human-assigned. With a corpus of FAQs / past tickets / editorial guides, we could auto-suggest names via retrieval over the medoid representation. Open question: does this help reviewer throughput, or does it bias the reviewer toward existing language?
Bibliography (this file)
- Campello, R. J. G. B., Moulavi, D., Sander, J. (2013). Density-Based Clustering Based on Hierarchical Density Estimates. PAKDD. — HDBSCAN.
- McInnes, L., Healy, J., Melville, J. (2018). UMAP: Uniform Manifold Approximation and Projection. arXiv:1802.03426.
- Ester, M., Kriegel, H.-P., Sander, J., Xu, X. (1996). A Density-Based Algorithm for Discovering Clusters. KDD. — DBSCAN.
- Lloyd, S. P. (1982). Least Squares Quantization in PCM. IEEE TIT. — k-means.
- Ng, A. Y., Jordan, M. I., Weiss, Y. (2002). On Spectral Clustering. NeurIPS.
- Rodriguez, A., Laio, A. (2014). Clustering by Fast Search and Find of Density Peaks. Science.
- Lin, T.-E., Xu, H., Zhang, H. (2020). Discovering New Intents via Constrained Deep Adaptive Clustering. NAACL. — directly inspires our pipeline.
- Zhang, H., Xu, H., Lin, T.-E., Lyu, R. (2021). Discovering New Intents with Deep Aligned Clustering. AAAI.
- Vaze, S., Han, K., Vedaldi, A., Zisserman, A. (2022). Generalized Category Discovery. CVPR. — modern setup for our problem.
- Saltelli, A. et al. (2010). Variance based sensitivity analysis (Sobol). Comput. Phys. Comm. — Sobol indices.
Citation count for this file: 10.