Retrieval Performance Scenarios and Runbooks

AWS AIP-C01 Task 4.2 → Skill 4.2.2: Optimize retrieval mechanisms for FM-augmented applications System: MangaAssist e-commerce chatbot — Bedrock Claude 3 Sonnet/Haiku, OpenSearch Serverless (HNSW k-NN), DynamoDB, ECS Fargate, ElastiCache Redis Format: 5 production scenarios, each with Problem → Detection → Root Cause → Resolution → Prevention and decision trees

Skill Mapping

AWS AIP-C01 Element	Coverage
Task 4.2	Optimize application performance for FM workloads
Skill 4.2.2	Optimize retrieval mechanisms to improve FM-augmented application performance
This File	5 operational scenarios covering vector search quality, Japanese tokenization, hybrid scoring, index scaling, and re-ranking latency
MangaAssist Context	Production troubleshooting for a RAG pipeline serving 100K+ manga products to Japanese-speaking customers with < 200ms retrieval target

Scenario Overview

#	Scenario	Impact	Severity	Detection Time
1	Vector search returning irrelevant manga	Incorrect recommendations → customer frustration	High	Minutes (quality metric drop)
2	Japanese query tokenization failure	Missed results for rare manga titles → zero-result pages	High	Hours (long-tail query analysis)
3	Hybrid search score fusion producing unexpected ranking	Popular titles ranked below obscure ones → CTR drop	Medium	Hours (A/B metric divergence)
4	Index performance degradation after catalog expansion	Search latency exceeds budget → slow page loads	High	Minutes (latency alarm)
5	Re-ranking latency exceeding budget	Total retrieval > 200ms → SLA violation	Critical	Seconds (p95 alarm)

Scenario 1: Vector Search Returning Irrelevant Manga (Embedding Quality Degradation)

Problem Statement

MangaAssist's kNN vector search begins returning semantically irrelevant results. A customer searching for "dark psychological thriller manga" receives results like children's comedy titles and cooking manga. NDCG@5 drops from 0.85 to 0.58 over a two-week period.

Detection

graph TB
    subgraph Signals["Detection Signals"]
        style Signals fill:#e94560,stroke:#16213e,color:#fff
        S1["NDCG@5 weekly eval<br/>drops below 0.75 threshold"]
        S2["Customer feedback:<br/>'recommendations are wrong'"]
        S3["CTR on search results<br/>drops from 18% to 11%"]
        S4["kNN result diversity<br/>collapses — same titles<br/>appear for different queries"]
    end

    subgraph Metrics["CloudWatch Alarms"]
        style Metrics fill:#0f3460,stroke:#16213e,color:#fff
        M1["Alarm: ndcg_weekly < 0.75"]
        M2["Alarm: search_ctr_7d < 0.14"]
        M3["Alarm: zero_result_rate > 3%"]
    end

    S1 --> M1
    S3 --> M2
    S1 & S2 & S3 & S4 --> INVESTIGATE["Begin<br/>Investigation"]

Key monitoring query (CloudWatch Logs Insights):

fields @timestamp, query_text, knn_top1_score, knn_top5_avg_score, ndcg_at_5
| filter knn_top1_score < 0.70
| stats count() as low_quality_searches, avg(ndcg_at_5) as avg_ndcg by bin(1h)
| sort @timestamp desc

Root Cause Analysis — Decision Tree

graph TD
    START["kNN returning<br/>irrelevant results"] --> CHECK_EMB["Check: Were embeddings<br/>recently re-indexed?"]

    CHECK_EMB -->|Yes| EMB_MODEL["Check: Was the embedding<br/>model changed?"]
    CHECK_EMB -->|No| CHECK_DATA["Check: Was new data<br/>bulk-loaded recently?"]

    EMB_MODEL -->|Yes — model version changed| RC1["ROOT CAUSE 1:<br/>Embedding model mismatch<br/>Old docs: Titan v1<br/>New docs: Titan v2<br/>Vectors are incompatible"]
    EMB_MODEL -->|No — same model| CHECK_PROMPT["Check: Was the embedding<br/>input format changed?"]

    CHECK_PROMPT -->|Yes| RC2["ROOT CAUSE 2:<br/>Embedding input drift<br/>e.g., title-only → title+description<br/>changes vector space geometry"]
    CHECK_PROMPT -->|No| CHECK_INDEX["Check: HNSW index params"]

    CHECK_DATA -->|Yes — bulk load happened| CHECK_QUALITY["Check: Quality of<br/>new product data"]
    CHECK_DATA -->|No| CHECK_DRIFT["Check: Query distribution<br/>shift over time"]

    CHECK_QUALITY -->|Bad data: empty descriptions,<br/>wrong language, duplicates| RC3["ROOT CAUSE 3:<br/>Data quality degradation<br/>Garbage-in → garbage-out<br/>embeddings"]
    CHECK_QUALITY -->|Data looks fine| CHECK_INDEX

    CHECK_INDEX -->|ef_search too low| RC4["ROOT CAUSE 4:<br/>HNSW recall degradation<br/>ef_search insufficient for<br/>current index size"]
    CHECK_INDEX -->|Params normal| CHECK_DRIFT

    CHECK_DRIFT -->|Query patterns changed| RC5["ROOT CAUSE 5:<br/>Concept drift<br/>New manga genres/trends<br/>not represented in<br/>embedding training data"]
    CHECK_DRIFT -->|Stable| ESCALATE["Escalate: Unknown<br/>root cause — deeper<br/>investigation needed"]

Resolution by Root Cause

RC1: Embedding Model Mismatch

Step	Action	Command / Detail
1	Identify mixed-model documents	Query for documents indexed before vs after model switch; check `embedding_model_version` metadata field
2	Halt new indexing	Pause the DynamoDB-to-OpenSearch pipeline Lambda
3	Re-embed ALL documents with the new model	Run batch Titan v2 embedding job for all 100K products (~45 min at 100 TPS)
4	Bulk re-index	Use `_bulk` API to replace all vectors in `manga-products` index
5	Validate	Run NDCG evaluation on 500-query test set; confirm >= 0.85
6	Resume pipeline	Re-enable DynamoDB stream Lambda

RC3: Data Quality Degradation

Step	Action
1	Identify bad documents: empty `description`, non-Japanese text in `title_ja`, duplicate `manga_id`
2	Quarantine bad documents in a DynamoDB `quarantine` table
3	Delete bad vectors from OpenSearch index
4	Fix data pipeline: add validation Lambda between DynamoDB and embedding generation
5	Re-run embedding for corrected documents
6	Validate NDCG recovery

RC4: HNSW Recall Degradation

Step	Action
1	Check current `ef_search` value (should be 256 for MangaAssist)
2	Run recall benchmark: query 100 known-relevant pairs, measure recall@10
3	If recall < 0.95, increase `ef_search` to 512 (costs ~15ms more latency)
4	If recall still poor, check if index needs force-merge (fragmented segments)
5	If index > 200K docs on current shard count, add shards (re-index required)

Prevention

Measure	Implementation
Embedding model version tracking	Store `model_version` as a metadata field on every document; alarm on version mismatch
Data quality gate	Validation Lambda checks: description length > 20 chars, language detection matches expected, no duplicate IDs
Weekly NDCG evaluation	Automated pipeline runs 500-query eval set every Sunday; alarms if NDCG@5 < 0.80
Embedding drift detection	Compare average vector centroid monthly; alarm if centroid shift > 0.1 cosine distance

Scenario 2: Japanese Query Tokenization Failure for Rare Manga Titles

Problem Statement

Customers searching for rare or new manga titles in Japanese receive zero results or irrelevant results. Example: searching for "呪術廻戦0 東京都立呪術高等専門学校" (Jujutsu Kaisen 0: Tokyo Metropolitan Magic Technical College) returns nothing, while "呪術廻戦" (Jujutsu Kaisen) works fine. The issue affects ~5% of queries — specifically long compound titles and titles with unusual kanji combinations.

Detection

graph TB
    subgraph Signals["Detection Signals"]
        style Signals fill:#e94560,stroke:#16213e,color:#fff
        S1["Zero-result rate spikes<br/>for Japanese queries<br/>from 1.2% to 5.8%"]
        S2["Customer support tickets:<br/>'cannot find [specific title]'"]
        S3["BM25 component returns 0<br/>while kNN returns some results"]
        S4["Query length > 10 chars<br/>has 3x higher zero-result rate"]
    end

    S1 & S2 & S3 & S4 --> INVESTIGATE["Tokenization<br/>Investigation"]

Detection query:

fields @timestamp, query_text, bm25_result_count, knn_result_count, query_language
| filter query_language = "ja" and bm25_result_count = 0
| stats count() as zero_bm25 by bin(1d)
| sort @timestamp desc

Root Cause Analysis — Decision Tree

graph TD
    START["Japanese query<br/>returns 0 BM25 results"] --> CHECK_EXIST["Check: Does the product<br/>exist in the index?"]

    CHECK_EXIST -->|No| RC_MISSING["NOT A TOKENIZATION ISSUE:<br/>Product not yet indexed.<br/>Check DynamoDB → OpenSearch pipeline."]
    CHECK_EXIST -->|Yes| CHECK_ANALYZE["Run _analyze API on<br/>the query text"]

    CHECK_ANALYZE --> CHECK_TOKENS["Compare query tokens<br/>vs indexed tokens"]

    CHECK_TOKENS -->|Tokens don't overlap| CHECK_KUROMOJI["Check: kuromoji<br/>dictionary version"]

    CHECK_KUROMOJI -->|Default dict only| RC1["ROOT CAUSE 1:<br/>Missing custom dictionary<br/>Rare manga terms not in<br/>kuromoji default dictionary"]
    CHECK_KUROMOJI -->|Custom dict present| CHECK_COMPOUND["Check: Is the title<br/>a compound word?"]

    CHECK_COMPOUND -->|Yes — long compound| RC2["ROOT CAUSE 2:<br/>Over-segmentation<br/>kuromoji splits the title<br/>into too many fragments<br/>that don't match indexed form"]
    CHECK_COMPOUND -->|No| CHECK_READING["Check: Reading form<br/>normalization"]

    CHECK_READING -->|Katakana vs Hiragana<br/>mismatch| RC3["ROOT CAUSE 3:<br/>Script normalization failure<br/>Query in カタカナ but index<br/>stores ひらがな form"]
    CHECK_READING -->|Same script| CHECK_FULLWIDTH["Check: Fullwidth vs<br/>halfwidth characters"]

    CHECK_FULLWIDTH -->|Mismatch detected| RC4["ROOT CAUSE 4:<br/>Unicode normalization gap<br/>Fullwidth numbers/letters<br/>not normalized to halfwidth"]
    CHECK_FULLWIDTH -->|No mismatch| RC5["ROOT CAUSE 5:<br/>Analyzer configuration error<br/>Wrong analyzer assigned to field"]

Resolution by Root Cause

RC1: Missing Custom Dictionary Entries

The kuromoji tokenizer's built-in dictionary does not include many manga-specific terms, especially newer titles and genre terms.

Step	Action	Detail
1	Identify failing terms	Collect all zero-result queries from the last 30 days; run through `_analyze` API
2	Build custom user dictionary	Create `user_dictionary.txt` with manga titles, author names, genre terms
3	Update analyzer config	Add `kuromoji_user_dict` filter pointing to the custom dictionary
4	Re-index	Full re-index required to apply new analyzer to existing documents
5	Validate	Re-run failing queries; confirm non-zero BM25 results

Custom dictionary format (user_dictionary.txt):

呪術廻戦,呪術廻戦,ジュジュツカイセン,カスタム名詞
鬼滅の刃,鬼滅の刃,キメツノヤイバ,カスタム名詞
進撃の巨人,進撃の巨人,シンゲキノキョジン,カスタム名詞
東京都立呪術高等専門学校,東京都立呪術高等専門学校,トウキョウトリツジュジュツコウトウセンモンガッコウ,カスタム名詞
チェンソーマン,チェンソーマン,チェンソーマン,カスタム名詞

RC2: Over-segmentation of Compound Titles

Step	Action
1	Test with `_analyze` API: check how kuromoji tokenizes the compound title
2	If over-segmented, add the full compound as a custom dictionary entry (see RC1)
3	Additionally, add a `shingle` filter to create bigrams/trigrams that capture partial compounds
4	For the BM25 query, reduce `minimum_should_match` from 30% to 20% for long queries (> 8 tokens)

RC4: Unicode Normalization Gap

Step	Action
1	Add `icu_normalizer` filter to the analyzer chain (before kuromoji tokenizer)
2	Ensure NFKC normalization is applied to both queries and indexed text
3	Update the `QueryPreprocessor` to apply NFKC normalization before sending to OpenSearch
4	Re-index with updated analyzer

Prevention

Measure	Implementation
Custom dictionary maintenance	Monthly review of zero-result queries; add new manga titles/terms to custom dictionary
Automated tokenization tests	CI pipeline that runs 200 known manga titles through `_analyze` API and verifies expected tokens
Fallback to kNN	If BM25 returns 0 results, automatically fall back to kNN-only search (already implemented in hybrid pipeline)
Query normalization	Apply NFKC + script normalization in `QueryPreprocessor` before any search operation
New title onboarding	When a new manga is added to the catalog, its title is automatically added to the custom dictionary queue

Scenario 3: Hybrid Search Score Fusion Producing Unexpected Ranking

Problem Statement

After switching from weighted linear fusion to Reciprocal Rank Fusion (RRF), the search results quality appears to degrade for buy-intent queries. Customers searching for specific titles (e.g., "鬼滅の刃 23巻") see the exact product at rank 3-4 instead of rank 1. CTR for buy-intent queries drops from 24% to 17%.

Detection

graph TB
    subgraph Signals["Detection Signals"]
        style Signals fill:#e94560,stroke:#16213e,color:#fff
        S1["CTR drops for<br/>buy-intent queries<br/>24% → 17%"]
        S2["MRR drops for<br/>exact-title queries<br/>0.92 → 0.71"]
        S3["A/B test shows<br/>RRF underperforms Linear<br/>for buy-intent segment"]
        S4["BM25 rank-1 results<br/>are demoted to rank 3-4<br/>after RRF fusion"]
    end

    S1 & S2 & S3 & S4 --> DIAGNOSE["Fusion<br/>Diagnosis"]

Root Cause Analysis — Decision Tree

graph TD
    START["Exact-title products<br/>ranked lower than expected<br/>after score fusion"] --> CHECK_BM25["Check: BM25 ranking<br/>for the query"]

    CHECK_BM25 -->|BM25 rank 1 = correct product| CHECK_KNN["Check: kNN ranking<br/>for the query"]
    CHECK_BM25 -->|BM25 rank > 1| RC_BM25["BM25 issue —<br/>not a fusion problem.<br/>Check analyzer/boost config."]

    CHECK_KNN -->|kNN rank 1 = different product| CHECK_FUSION["Check: How RRF<br/>combines the ranks"]
    CHECK_KNN -->|kNN rank 1 = same product| RC_OTHER["Both agree on rank 1.<br/>Check re-ranking stage."]

    CHECK_FUSION --> ANALYZE_RRF["RRF formula:<br/>BM25 rank 1 → 1/(60+1) = 0.0164<br/>kNN rank 15 → 1/(60+15) = 0.0133<br/>Total: 0.0297"]

    ANALYZE_RRF --> COMPARE["Compare with competitor:<br/>BM25 rank 5 → 1/(60+5) = 0.0154<br/>kNN rank 1 → 1/(60+1) = 0.0164<br/>Total: 0.0318"]

    COMPARE --> RC1["ROOT CAUSE 1:<br/>RRF is rank-equalizing.<br/>A doc ranked #1 by BM25 but #15 by kNN<br/>loses to a doc ranked #5 and #1.<br/>RRF penalizes rank disagreement."]

    RC1 --> FIX["FIX: Use intent-aware<br/>fusion strategy.<br/>Buy intent → BM25-heavy linear.<br/>Recommend intent → RRF."]

Detailed Explanation

RRF is designed to be democratic across rankers: it weights BM25 and kNN contributions equally. This is ideal for exploratory queries where both signals matter equally. But for buy-intent queries, BM25 rank 1 (exact title match) is a much stronger signal than kNN rank 1 (semantic similarity). RRF does not capture this asymmetry.

Worked example:

Document	BM25 Rank	kNN Rank	RRF Score (k=60)	Linear Score (α=0.6, β=0.4)
Demon Slayer Vol 23 (correct)	1	15	1/(61) + 1/(75) = 0.0297	0.4(1.0) + 0.6(0.60) = 0.760
Jujutsu Kaisen Vol 25 (wrong)	5	1	1/(65) + 1/(61) = 0.0318	0.4(0.75) + 0.6(1.0) = 0.900
Demon Slayer Vol 22	2	20	1/(62) + 1/(80) = 0.0286	0.4(0.95) + 0.6(0.52) = 0.692

Under RRF, Jujutsu Kaisen Vol 25 outranks Demon Slayer Vol 23 because it has a higher kNN rank (1 vs 15), and RRF treats both rankers equally. Under weighted linear with BM25-heavy weights, Demon Slayer Vol 23 would rank correctly because its BM25 score is normalized to 1.0. But the weighted linear approach also has a problem — Jujutsu Kaisen still scores high due to its kNN dominance.

Resolution

Step	Action	Detail
1	Implement intent-aware fusion selection	`create_fusion(intent)` factory returns different strategies per intent
2	Buy intent → weighted linear (β=0.6 BM25, α=0.4 kNN)	Prioritize exact keyword match for purchase queries
3	Recommend intent → RRF (k=60)	Democratic fusion for exploratory discovery
4	Browse intent → RRF (k=60)	Default balanced approach
5	Research intent → weighted linear (α=0.5, β=0.5)	Balanced but score-aware for comparison queries
6	A/B test the intent-aware approach	Measure CTR and MRR per intent segment over 2 weeks

Configuration table:

Intent	Fusion Method	BM25 Weight	kNN Weight	Rationale
buy	Weighted linear	0.60	0.40	Exact match is paramount
recommend	RRF (k=60)	Equal (rank-based)	Equal (rank-based)	Diversity matters
browse	RRF (k=60)	Equal	Equal	Balanced exploration
research	Weighted linear	0.50	0.50	Both exact and semantic matter

Prevention

Measure	Implementation
Per-intent metric tracking	Track NDCG@5, MRR, CTR separately for each intent category
Fusion strategy A/B framework	Always run new fusion configs through A/B testing before full rollout
Intent detection accuracy	Monitor intent classifier accuracy; misclassified intents cascade into wrong fusion strategy
Rank disagreement monitoring	Track the average rank disagreement between BM25 and kNN; high disagreement = fusion method matters more

Scenario 4: Index Performance Degradation After Catalog Expansion (10K to 100K Manga)

Problem Statement

MangaAssist expands its catalog from 10K to 100K manga products. After the bulk indexing completes, kNN search latency increases from 35ms to 180ms at p95, and BM25 latency increases from 15ms to 65ms. The total retrieval pipeline now takes ~350ms, far exceeding the 200ms budget.

Detection

graph TB
    subgraph Signals["Detection Signals"]
        style Signals fill:#e94560,stroke:#16213e,color:#fff
        S1["CloudWatch alarm:<br/>knn_search_latency_p95 > 100ms"]
        S2["CloudWatch alarm:<br/>bm25_search_latency_p95 > 50ms"]
        S3["End-to-end retrieval<br/>latency > 200ms (SLA breach)"]
        S4["OpenSearch OCU<br/>utilization > 85%"]
    end

    subgraph Timeline["Event Timeline"]
        style Timeline fill:#0f3460,stroke:#16213e,color:#fff
        T1["Day 0: Catalog at 10K<br/>kNN: 35ms, BM25: 15ms"]
        T2["Day 1: Bulk load starts<br/>90K new products"]
        T3["Day 2: Bulk load completes<br/>100K total"]
        T4["Day 2+: Latency spikes<br/>kNN: 180ms, BM25: 65ms"]
    end

    T4 --> S1 & S2 & S3 & S4

Root Cause Analysis — Decision Tree

graph TD
    START["Search latency spiked<br/>after 10x catalog expansion"] --> CHECK_OCU["Check: OCU utilization"]

    CHECK_OCU -->|> 80%| RC1["ROOT CAUSE 1:<br/>Insufficient search OCUs.<br/>2 search OCUs cannot handle<br/>100K docs at target QPS."]
    CHECK_OCU -->|< 80%| CHECK_SEGMENTS["Check: Segment count<br/>per shard"]

    CHECK_SEGMENTS -->|> 20 segments per shard| RC2["ROOT CAUSE 2:<br/>Segment fragmentation.<br/>Bulk indexing created many<br/>small segments. Merges pending."]
    CHECK_SEGMENTS -->|< 20| CHECK_SHARDS["Check: Shard sizing"]

    CHECK_SHARDS -->|Single shard > 25GB or<br/>single shard > 50K HNSW docs| RC3["ROOT CAUSE 3:<br/>Shard too large.<br/>HNSW graph on a single shard<br/>is too big for efficient search."]
    CHECK_SHARDS -->|Shard size OK| CHECK_HNSW["Check: HNSW parameters"]

    CHECK_HNSW -->|ef_search = 256 was fine<br/>for 10K but slow for 100K| RC4["ROOT CAUSE 4:<br/>HNSW ef_search too high<br/>for new index size. Search<br/>is exploring too many nodes."]
    CHECK_HNSW -->|ef_search reasonable| CHECK_FILTER["Check: Filter clause<br/>performance"]

    CHECK_FILTER -->|Filters on non-keyword fields| RC5["ROOT CAUSE 5:<br/>Post-filter on text fields.<br/>Filtering after kNN search<br/>is slow on 100K docs."]
    CHECK_FILTER -->|Filters on keyword fields| ESCALATE["Escalate: Complex interaction<br/>of multiple factors"]

Resolution by Root Cause

RC1: Insufficient Search OCUs

Step	Action	Detail
1	Scale search OCUs from 2 → 6	OpenSearch Serverless scales in OCU pairs. 6 OCUs supports ~1500 QPS at 100K docs
2	Monitor latency after scaling	OCU scaling takes 5-10 minutes to take effect
3	Validate	kNN latency should return to < 50ms at p95
4	Set up auto-scaling policy	Configure max OCU limit and scaling triggers

Cost impact: 2 OCUs → 6 OCUs = $1,750/month → $5,250/month (+$3,500/month). Justified by SLA compliance.

RC2: Segment Fragmentation

Step	Action	Detail
1	Check segment count	`GET /manga-products/_segments` — look for > 20 segments per shard
2	Force merge	`POST /manga-products/_forcemerge?max_num_segments=1`
3	Wait for merge to complete	Force merge on 100K docs takes ~10-15 minutes
4	Validate latency	kNN should drop 30-50% after merge (fewer segments to search)
5	Schedule post-bulk-load merges	Always run force merge after nightly catalog sync

RC3: Shard Too Large

Step	Action	Detail
1	Calculate target shard count	100K docs / 25K docs per shard = 4 shards
2	Create new index with 4 shards	Use the same mapping with `number_of_shards: 4`
3	Reindex from old to new	`POST /_reindex` from `manga-products-v1` to `manga-products-v2`
4	Alias swap	Update `manga-products` alias to point to `manga-products-v2`
5	Delete old index	After validation, remove `manga-products-v1`

RC4: HNSW ef_search Too High

Step	Action	Detail
1	Run recall-latency benchmark at multiple ef_search values	Test ef_search = 64, 128, 256, 512
2	Find the knee point	Typically recall@10 plateaus while latency keeps climbing
3	For 100K docs, ef_search = 128 may achieve recall@10 = 0.95 at 40ms vs 0.97 at 180ms for ef_search = 256
4	Update index setting	`PUT /manga-products/_settings { "knn.algo_param.ef_search": 128 }`
5	Validate	Confirm recall@10 >= 0.95 and latency < 50ms

Benchmark reference table:

ef_search	recall@10 (100K docs)	Latency p95	Recommendation
64	0.89	22ms	Too low recall
128	0.95	42ms	Good balance for 100K
256	0.97	85ms	Was optimal for 10K, too slow for 100K
512	0.98	180ms	Way too slow

Prevention

Measure	Implementation
Pre-expansion capacity test	Before any 5x+ catalog growth, run load test on a staging index with target document count
Shard sizing guidelines	Document: max 25K HNSW docs per shard for 1536-dim vectors
Post-bulk-load automation	Lambda triggered after bulk indexing: (1) force merge, (2) run latency benchmark, (3) alert if > threshold
OCU scaling policy	Auto-scale search OCUs based on latency metric, not just utilization
ef_search tuning automation	Script that binary-searches for optimal ef_search given a target recall and latency budget

Scenario 5: Re-ranking Latency Exceeding Budget, Causing Total Retrieval > 200ms

Problem Statement

MangaAssist enables Claude Haiku re-ranking for all queries (not just buy-intent) to maximize retrieval quality. Haiku re-ranking adds 120-180ms per query, pushing total retrieval latency to 280-350ms at p95. The 200ms SLA is violated. Customer-visible page load times increase from 1.5s to 2.2s.

Detection

graph TB
    subgraph Signals["Detection Signals"]
        style Signals fill:#e94560,stroke:#16213e,color:#fff
        S1["CloudWatch alarm:<br/>retrieval_total_latency_p95 > 200ms"]
        S2["rerank_latency_p95<br/>= 155ms (budget: 60ms)"]
        S3["Bedrock Haiku invocations<br/>= 100% of searches<br/>(expected: 20% buy-intent only)"]
        S4["Monthly Bedrock cost<br/>for reranking: $3,200<br/>(expected: $640)"]
    end

    S1 & S2 & S3 & S4 --> DIAGNOSE["Latency<br/>Budget Analysis"]

Latency budget violation analysis:

Stage	Budget	Actual	Status
Query preprocessing	15ms	12ms	OK
Embedding generation	25ms	22ms	OK
Hybrid search	80ms	72ms	OK
Re-ranking	60ms	155ms	OVER by 95ms
Result assembly	20ms	15ms	OK
Total	200ms	276ms	SLA BREACH

Root Cause Analysis — Decision Tree

graph TD
    START["Re-ranking latency<br/>155ms, budget 60ms"] --> CHECK_METHOD["Check: Which re-ranking<br/>method is active?"]

    CHECK_METHOD -->|Claude Haiku| CHECK_SCOPE["Check: Is Haiku called<br/>for ALL queries?"]
    CHECK_METHOD -->|Cross-encoder| CHECK_BATCH["Check: Batch size<br/>for cross-encoder"]

    CHECK_SCOPE -->|Yes — all queries| RC1["ROOT CAUSE 1:<br/>Haiku re-ranking enabled<br/>globally instead of<br/>buy-intent only.<br/>Config error."]
    CHECK_SCOPE -->|No — only buy-intent| CHECK_INPUT["Check: Input size<br/>to Haiku"]

    CHECK_INPUT -->|> 20 candidates| RC2["ROOT CAUSE 2:<br/>Too many candidates<br/>sent to Haiku.<br/>Heuristic stage not<br/>filtering to top-20."]
    CHECK_INPUT -->|<= 20 candidates| CHECK_HAIKU_LATENCY["Check: Haiku<br/>invocation latency"]

    CHECK_HAIKU_LATENCY -->|> 100ms| RC3["ROOT CAUSE 3:<br/>Haiku cold start or<br/>Bedrock throttling.<br/>No provisioned throughput."]
    CHECK_HAIKU_LATENCY -->|< 100ms| RC4["ROOT CAUSE 4:<br/>Prompt too large.<br/>Excessive metadata in<br/>re-ranking prompt."]

    CHECK_BATCH -->|> 20 docs| RC5["ROOT CAUSE 5:<br/>Cross-encoder batch too<br/>large. Model scoring 50<br/>pairs instead of 20."]
    CHECK_BATCH -->|<= 20 docs| ESCALATE["Escalate: Model inference<br/>itself is slow — check<br/>ECS task sizing."]

Resolution by Root Cause

RC1: Haiku Re-ranking Enabled Globally (Most Likely)

Step	Action	Detail
1	Check reranker configuration	Review ECS task environment variable `RERANK_METHOD`
2	Fix: set intent-conditional logic	Only invoke Haiku for `intent == "buy"` (20% of queries)
3	Default to cross-encoder or heuristic-only	For browse/recommend/research: use cross-encoder (30ms) or heuristic-only (5ms)
4	Deploy config change	ECS rolling deployment, no downtime
5	Validate	Confirm retrieval_total_latency_p95 < 200ms within 15 minutes

Intent-to-reranker mapping:

Intent	Re-rank Method	Expected Latency	Quality (NDCG@5)
buy (20% of queries)	Claude Haiku	80-150ms	0.93
recommend (35%)	Cross-encoder	30-50ms	0.88
browse (30%)	Heuristic only	5ms	0.82
research (15%)	Cross-encoder	30-50ms	0.88
Weighted average	—	~40ms	~0.87

RC2: Too Many Candidates Sent to Haiku

Step	Action
1	Check heuristic stage output count — should be capped at 20
2	If heuristic stage is outputting 50 (i.e., passing all hybrid results through), fix the `heuristic_cutoff` parameter
3	Set `heuristic_cutoff = 20` in `RelevanceReranker` constructor
4	Validate: Haiku input should be ~1.5K tokens (20 candidates) instead of ~4K tokens (50 candidates)

RC3: Haiku Cold Start / Throttling

Step	Action
1	Check Bedrock CloudWatch metrics: `InvocationLatency`, `ThrottleCount`
2	If throttled: request Bedrock quota increase for Haiku in `ap-northeast-1`
3	If cold start: implement a keep-warm strategy (invoke Haiku every 30s with a dummy request)
4	Consider Bedrock provisioned throughput for Haiku if sustained high QPS

RC4: Prompt Too Large

Step	Action
1	Audit the re-ranking prompt — how many tokens per candidate?
2	Reduce candidate metadata: title + genre + rating only (remove description, tags, author bio)
3	Target: < 100 tokens per candidate, 20 candidates = 2K input tokens
4	Validate: Haiku response time should drop from 150ms to 80ms

Tiered Re-ranking Decision Flowchart

graph TD
    QUERY["Incoming Query"] --> INTENT["Detect Intent"]

    INTENT -->|Buy| BUDGET_CHECK_BUY["Remaining latency<br/>budget > 100ms?"]
    INTENT -->|Recommend| BUDGET_CHECK_REC["Remaining latency<br/>budget > 50ms?"]
    INTENT -->|Browse| HEURISTIC["Heuristic Only<br/>(5ms)"]
    INTENT -->|Research| BUDGET_CHECK_RES["Remaining latency<br/>budget > 50ms?"]

    BUDGET_CHECK_BUY -->|Yes| HAIKU["Claude Haiku<br/>Re-rank (80-150ms)"]
    BUDGET_CHECK_BUY -->|No| CROSS_ENC_BUY["Cross-Encoder<br/>Re-rank (30-50ms)"]

    BUDGET_CHECK_REC -->|Yes| CROSS_ENC_REC["Cross-Encoder<br/>Re-rank (30-50ms)"]
    BUDGET_CHECK_REC -->|No| HEURISTIC_REC["Heuristic Only<br/>(5ms)"]

    BUDGET_CHECK_RES -->|Yes| CROSS_ENC_RES["Cross-Encoder<br/>Re-rank (30-50ms)"]
    BUDGET_CHECK_RES -->|No| HEURISTIC_RES["Heuristic Only<br/>(5ms)"]

    HAIKU --> RESULT["Top-5 Results"]
    CROSS_ENC_BUY --> RESULT
    CROSS_ENC_REC --> RESULT
    CROSS_ENC_RES --> RESULT
    HEURISTIC --> RESULT
    HEURISTIC_REC --> RESULT
    HEURISTIC_RES --> RESULT

Prevention

Measure	Implementation
Latency budget enforcement	Middleware tracks elapsed time per stage; if cumulative > 140ms before re-ranking, auto-downgrade to heuristic-only
Reranker method per intent	Configuration table mapping intent → reranker method, reviewed monthly
Bedrock latency monitoring	CloudWatch alarm on `InvocationLatency_p95 > 100ms` for Haiku model ID
Cost alerting	CloudWatch alarm on daily Bedrock cost for re-ranking > $25/day (expected: $21/day at current volume)
Progressive enhancement	Start all queries with heuristic-only; upgrade to cross-encoder if latency budget allows; upgrade to Haiku only for buy-intent with budget remaining

Cross-Scenario Summary

Common Patterns Across All 5 Scenarios

Pattern	Scenarios	Key Lesson
Metric degradation precedes user complaints	1, 2, 3	Invest in automated quality metrics (NDCG, MRR, CTR) — detect issues before customers notice
Config changes cascade unpredictably	3, 5	Always A/B test configuration changes; never deploy to 100% without validation
Scale changes require re-tuning	4	Parameters optimized for 10K docs may be wrong for 100K docs; re-benchmark after growth
Japanese text processing needs special attention	2	Standard tokenizers fail on manga-specific terms; custom dictionaries are mandatory
Latency budgets need per-stage enforcement	5	Track latency per stage, not just end-to-end; auto-degrade expensive stages when budget is tight

Monitoring Dashboard — Retrieval Health

Metric	Source	Alarm Threshold	Check Frequency
`retrieval_total_latency_p95`	ECS application metric	> 200ms	Real-time (1-min)
`knn_search_latency_p95`	OpenSearch + app timer	> 80ms	Real-time (1-min)
`bm25_search_latency_p95`	OpenSearch + app timer	> 40ms	Real-time (1-min)
`rerank_latency_p95`	App timer	> 60ms	Real-time (1-min)
`zero_result_rate`	App counter	> 3%	Hourly
`ndcg_at_5_weekly`	Offline eval pipeline	< 0.80	Weekly
`mrr_weekly`	Offline eval pipeline	< 0.80	Weekly
`search_ctr_7d`	Click tracking	< 14%	Daily
`opensearch_ocu_utilization`	CloudWatch	> 85%	Real-time (5-min)
`bedrock_rerank_cost_daily`	Cost Explorer	> $30/day	Daily