LOCAL PREVIEW View on GitHub

Retrieval Performance Scenarios and Runbooks

AWS AIP-C01 Task 4.2 → Skill 4.2.2: Optimize retrieval mechanisms for FM-augmented applications System: MangaAssist e-commerce chatbot — Bedrock Claude 3 Sonnet/Haiku, OpenSearch Serverless (HNSW k-NN), DynamoDB, ECS Fargate, ElastiCache Redis Format: 5 production scenarios, each with Problem → Detection → Root Cause → Resolution → Prevention and decision trees


Skill Mapping

AWS AIP-C01 Element Coverage
Task 4.2 Optimize application performance for FM workloads
Skill 4.2.2 Optimize retrieval mechanisms to improve FM-augmented application performance
This File 5 operational scenarios covering vector search quality, Japanese tokenization, hybrid scoring, index scaling, and re-ranking latency
MangaAssist Context Production troubleshooting for a RAG pipeline serving 100K+ manga products to Japanese-speaking customers with < 200ms retrieval target

Scenario Overview

# Scenario Impact Severity Detection Time
1 Vector search returning irrelevant manga Incorrect recommendations → customer frustration High Minutes (quality metric drop)
2 Japanese query tokenization failure Missed results for rare manga titles → zero-result pages High Hours (long-tail query analysis)
3 Hybrid search score fusion producing unexpected ranking Popular titles ranked below obscure ones → CTR drop Medium Hours (A/B metric divergence)
4 Index performance degradation after catalog expansion Search latency exceeds budget → slow page loads High Minutes (latency alarm)
5 Re-ranking latency exceeding budget Total retrieval > 200ms → SLA violation Critical Seconds (p95 alarm)

Scenario 1: Vector Search Returning Irrelevant Manga (Embedding Quality Degradation)

Problem Statement

MangaAssist's kNN vector search begins returning semantically irrelevant results. A customer searching for "dark psychological thriller manga" receives results like children's comedy titles and cooking manga. NDCG@5 drops from 0.85 to 0.58 over a two-week period.

Detection

graph TB
    subgraph Signals["Detection Signals"]
        style Signals fill:#e94560,stroke:#16213e,color:#fff
        S1["NDCG@5 weekly eval<br/>drops below 0.75 threshold"]
        S2["Customer feedback:<br/>'recommendations are wrong'"]
        S3["CTR on search results<br/>drops from 18% to 11%"]
        S4["kNN result diversity<br/>collapses — same titles<br/>appear for different queries"]
    end

    subgraph Metrics["CloudWatch Alarms"]
        style Metrics fill:#0f3460,stroke:#16213e,color:#fff
        M1["Alarm: ndcg_weekly < 0.75"]
        M2["Alarm: search_ctr_7d < 0.14"]
        M3["Alarm: zero_result_rate > 3%"]
    end

    S1 --> M1
    S3 --> M2
    S1 & S2 & S3 & S4 --> INVESTIGATE["Begin<br/>Investigation"]

Key monitoring query (CloudWatch Logs Insights):

fields @timestamp, query_text, knn_top1_score, knn_top5_avg_score, ndcg_at_5
| filter knn_top1_score < 0.70
| stats count() as low_quality_searches, avg(ndcg_at_5) as avg_ndcg by bin(1h)
| sort @timestamp desc

Root Cause Analysis — Decision Tree

graph TD
    START["kNN returning<br/>irrelevant results"] --> CHECK_EMB["Check: Were embeddings<br/>recently re-indexed?"]

    CHECK_EMB -->|Yes| EMB_MODEL["Check: Was the embedding<br/>model changed?"]
    CHECK_EMB -->|No| CHECK_DATA["Check: Was new data<br/>bulk-loaded recently?"]

    EMB_MODEL -->|Yes — model version changed| RC1["ROOT CAUSE 1:<br/>Embedding model mismatch<br/>Old docs: Titan v1<br/>New docs: Titan v2<br/>Vectors are incompatible"]
    EMB_MODEL -->|No — same model| CHECK_PROMPT["Check: Was the embedding<br/>input format changed?"]

    CHECK_PROMPT -->|Yes| RC2["ROOT CAUSE 2:<br/>Embedding input drift<br/>e.g., title-only → title+description<br/>changes vector space geometry"]
    CHECK_PROMPT -->|No| CHECK_INDEX["Check: HNSW index params"]

    CHECK_DATA -->|Yes — bulk load happened| CHECK_QUALITY["Check: Quality of<br/>new product data"]
    CHECK_DATA -->|No| CHECK_DRIFT["Check: Query distribution<br/>shift over time"]

    CHECK_QUALITY -->|Bad data: empty descriptions,<br/>wrong language, duplicates| RC3["ROOT CAUSE 3:<br/>Data quality degradation<br/>Garbage-in → garbage-out<br/>embeddings"]
    CHECK_QUALITY -->|Data looks fine| CHECK_INDEX

    CHECK_INDEX -->|ef_search too low| RC4["ROOT CAUSE 4:<br/>HNSW recall degradation<br/>ef_search insufficient for<br/>current index size"]
    CHECK_INDEX -->|Params normal| CHECK_DRIFT

    CHECK_DRIFT -->|Query patterns changed| RC5["ROOT CAUSE 5:<br/>Concept drift<br/>New manga genres/trends<br/>not represented in<br/>embedding training data"]
    CHECK_DRIFT -->|Stable| ESCALATE["Escalate: Unknown<br/>root cause — deeper<br/>investigation needed"]

Resolution by Root Cause

RC1: Embedding Model Mismatch

Step Action Command / Detail
1 Identify mixed-model documents Query for documents indexed before vs after model switch; check embedding_model_version metadata field
2 Halt new indexing Pause the DynamoDB-to-OpenSearch pipeline Lambda
3 Re-embed ALL documents with the new model Run batch Titan v2 embedding job for all 100K products (~45 min at 100 TPS)
4 Bulk re-index Use _bulk API to replace all vectors in manga-products index
5 Validate Run NDCG evaluation on 500-query test set; confirm >= 0.85
6 Resume pipeline Re-enable DynamoDB stream Lambda

RC3: Data Quality Degradation

Step Action
1 Identify bad documents: empty description, non-Japanese text in title_ja, duplicate manga_id
2 Quarantine bad documents in a DynamoDB quarantine table
3 Delete bad vectors from OpenSearch index
4 Fix data pipeline: add validation Lambda between DynamoDB and embedding generation
5 Re-run embedding for corrected documents
6 Validate NDCG recovery

RC4: HNSW Recall Degradation

Step Action
1 Check current ef_search value (should be 256 for MangaAssist)
2 Run recall benchmark: query 100 known-relevant pairs, measure recall@10
3 If recall < 0.95, increase ef_search to 512 (costs ~15ms more latency)
4 If recall still poor, check if index needs force-merge (fragmented segments)
5 If index > 200K docs on current shard count, add shards (re-index required)

Prevention

Measure Implementation
Embedding model version tracking Store model_version as a metadata field on every document; alarm on version mismatch
Data quality gate Validation Lambda checks: description length > 20 chars, language detection matches expected, no duplicate IDs
Weekly NDCG evaluation Automated pipeline runs 500-query eval set every Sunday; alarms if NDCG@5 < 0.80
Embedding drift detection Compare average vector centroid monthly; alarm if centroid shift > 0.1 cosine distance

Scenario 2: Japanese Query Tokenization Failure for Rare Manga Titles

Problem Statement

Customers searching for rare or new manga titles in Japanese receive zero results or irrelevant results. Example: searching for "呪術廻戦0 東京都立呪術高等専門学校" (Jujutsu Kaisen 0: Tokyo Metropolitan Magic Technical College) returns nothing, while "呪術廻戦" (Jujutsu Kaisen) works fine. The issue affects ~5% of queries — specifically long compound titles and titles with unusual kanji combinations.

Detection

graph TB
    subgraph Signals["Detection Signals"]
        style Signals fill:#e94560,stroke:#16213e,color:#fff
        S1["Zero-result rate spikes<br/>for Japanese queries<br/>from 1.2% to 5.8%"]
        S2["Customer support tickets:<br/>'cannot find [specific title]'"]
        S3["BM25 component returns 0<br/>while kNN returns some results"]
        S4["Query length > 10 chars<br/>has 3x higher zero-result rate"]
    end

    S1 & S2 & S3 & S4 --> INVESTIGATE["Tokenization<br/>Investigation"]

Detection query:

fields @timestamp, query_text, bm25_result_count, knn_result_count, query_language
| filter query_language = "ja" and bm25_result_count = 0
| stats count() as zero_bm25 by bin(1d)
| sort @timestamp desc

Root Cause Analysis — Decision Tree

graph TD
    START["Japanese query<br/>returns 0 BM25 results"] --> CHECK_EXIST["Check: Does the product<br/>exist in the index?"]

    CHECK_EXIST -->|No| RC_MISSING["NOT A TOKENIZATION ISSUE:<br/>Product not yet indexed.<br/>Check DynamoDB → OpenSearch pipeline."]
    CHECK_EXIST -->|Yes| CHECK_ANALYZE["Run _analyze API on<br/>the query text"]

    CHECK_ANALYZE --> CHECK_TOKENS["Compare query tokens<br/>vs indexed tokens"]

    CHECK_TOKENS -->|Tokens don't overlap| CHECK_KUROMOJI["Check: kuromoji<br/>dictionary version"]

    CHECK_KUROMOJI -->|Default dict only| RC1["ROOT CAUSE 1:<br/>Missing custom dictionary<br/>Rare manga terms not in<br/>kuromoji default dictionary"]
    CHECK_KUROMOJI -->|Custom dict present| CHECK_COMPOUND["Check: Is the title<br/>a compound word?"]

    CHECK_COMPOUND -->|Yes — long compound| RC2["ROOT CAUSE 2:<br/>Over-segmentation<br/>kuromoji splits the title<br/>into too many fragments<br/>that don't match indexed form"]
    CHECK_COMPOUND -->|No| CHECK_READING["Check: Reading form<br/>normalization"]

    CHECK_READING -->|Katakana vs Hiragana<br/>mismatch| RC3["ROOT CAUSE 3:<br/>Script normalization failure<br/>Query in カタカナ but index<br/>stores ひらがな form"]
    CHECK_READING -->|Same script| CHECK_FULLWIDTH["Check: Fullwidth vs<br/>halfwidth characters"]

    CHECK_FULLWIDTH -->|Mismatch detected| RC4["ROOT CAUSE 4:<br/>Unicode normalization gap<br/>Fullwidth numbers/letters<br/>not normalized to halfwidth"]
    CHECK_FULLWIDTH -->|No mismatch| RC5["ROOT CAUSE 5:<br/>Analyzer configuration error<br/>Wrong analyzer assigned to field"]

Resolution by Root Cause

RC1: Missing Custom Dictionary Entries

The kuromoji tokenizer's built-in dictionary does not include many manga-specific terms, especially newer titles and genre terms.

Step Action Detail
1 Identify failing terms Collect all zero-result queries from the last 30 days; run through _analyze API
2 Build custom user dictionary Create user_dictionary.txt with manga titles, author names, genre terms
3 Update analyzer config Add kuromoji_user_dict filter pointing to the custom dictionary
4 Re-index Full re-index required to apply new analyzer to existing documents
5 Validate Re-run failing queries; confirm non-zero BM25 results

Custom dictionary format (user_dictionary.txt):

呪術廻戦,呪術廻戦,ジュジュツカイセン,カスタム名詞
鬼滅の刃,鬼滅の刃,キメツノヤイバ,カスタム名詞
進撃の巨人,進撃の巨人,シンゲキノキョジン,カスタム名詞
東京都立呪術高等専門学校,東京都立呪術高等専門学校,トウキョウトリツジュジュツコウトウセンモンガッコウ,カスタム名詞
チェンソーマン,チェンソーマン,チェンソーマン,カスタム名詞

RC2: Over-segmentation of Compound Titles

Step Action
1 Test with _analyze API: check how kuromoji tokenizes the compound title
2 If over-segmented, add the full compound as a custom dictionary entry (see RC1)
3 Additionally, add a shingle filter to create bigrams/trigrams that capture partial compounds
4 For the BM25 query, reduce minimum_should_match from 30% to 20% for long queries (> 8 tokens)

RC4: Unicode Normalization Gap

Step Action
1 Add icu_normalizer filter to the analyzer chain (before kuromoji tokenizer)
2 Ensure NFKC normalization is applied to both queries and indexed text
3 Update the QueryPreprocessor to apply NFKC normalization before sending to OpenSearch
4 Re-index with updated analyzer

Prevention

Measure Implementation
Custom dictionary maintenance Monthly review of zero-result queries; add new manga titles/terms to custom dictionary
Automated tokenization tests CI pipeline that runs 200 known manga titles through _analyze API and verifies expected tokens
Fallback to kNN If BM25 returns 0 results, automatically fall back to kNN-only search (already implemented in hybrid pipeline)
Query normalization Apply NFKC + script normalization in QueryPreprocessor before any search operation
New title onboarding When a new manga is added to the catalog, its title is automatically added to the custom dictionary queue

Scenario 3: Hybrid Search Score Fusion Producing Unexpected Ranking

Problem Statement

After switching from weighted linear fusion to Reciprocal Rank Fusion (RRF), the search results quality appears to degrade for buy-intent queries. Customers searching for specific titles (e.g., "鬼滅の刃 23巻") see the exact product at rank 3-4 instead of rank 1. CTR for buy-intent queries drops from 24% to 17%.

Detection

graph TB
    subgraph Signals["Detection Signals"]
        style Signals fill:#e94560,stroke:#16213e,color:#fff
        S1["CTR drops for<br/>buy-intent queries<br/>24% → 17%"]
        S2["MRR drops for<br/>exact-title queries<br/>0.92 → 0.71"]
        S3["A/B test shows<br/>RRF underperforms Linear<br/>for buy-intent segment"]
        S4["BM25 rank-1 results<br/>are demoted to rank 3-4<br/>after RRF fusion"]
    end

    S1 & S2 & S3 & S4 --> DIAGNOSE["Fusion<br/>Diagnosis"]

Root Cause Analysis — Decision Tree

graph TD
    START["Exact-title products<br/>ranked lower than expected<br/>after score fusion"] --> CHECK_BM25["Check: BM25 ranking<br/>for the query"]

    CHECK_BM25 -->|BM25 rank 1 = correct product| CHECK_KNN["Check: kNN ranking<br/>for the query"]
    CHECK_BM25 -->|BM25 rank > 1| RC_BM25["BM25 issue —<br/>not a fusion problem.<br/>Check analyzer/boost config."]

    CHECK_KNN -->|kNN rank 1 = different product| CHECK_FUSION["Check: How RRF<br/>combines the ranks"]
    CHECK_KNN -->|kNN rank 1 = same product| RC_OTHER["Both agree on rank 1.<br/>Check re-ranking stage."]

    CHECK_FUSION --> ANALYZE_RRF["RRF formula:<br/>BM25 rank 1 → 1/(60+1) = 0.0164<br/>kNN rank 15 → 1/(60+15) = 0.0133<br/>Total: 0.0297"]

    ANALYZE_RRF --> COMPARE["Compare with competitor:<br/>BM25 rank 5 → 1/(60+5) = 0.0154<br/>kNN rank 1 → 1/(60+1) = 0.0164<br/>Total: 0.0318"]

    COMPARE --> RC1["ROOT CAUSE 1:<br/>RRF is rank-equalizing.<br/>A doc ranked #1 by BM25 but #15 by kNN<br/>loses to a doc ranked #5 and #1.<br/>RRF penalizes rank disagreement."]

    RC1 --> FIX["FIX: Use intent-aware<br/>fusion strategy.<br/>Buy intent → BM25-heavy linear.<br/>Recommend intent → RRF."]

Detailed Explanation

RRF is designed to be democratic across rankers: it weights BM25 and kNN contributions equally. This is ideal for exploratory queries where both signals matter equally. But for buy-intent queries, BM25 rank 1 (exact title match) is a much stronger signal than kNN rank 1 (semantic similarity). RRF does not capture this asymmetry.

Worked example:

Document BM25 Rank kNN Rank RRF Score (k=60) Linear Score (α=0.6, β=0.4)
Demon Slayer Vol 23 (correct) 1 15 1/(61) + 1/(75) = 0.0297 0.4(1.0) + 0.6(0.60) = 0.760
Jujutsu Kaisen Vol 25 (wrong) 5 1 1/(65) + 1/(61) = 0.0318 0.4(0.75) + 0.6(1.0) = 0.900
Demon Slayer Vol 22 2 20 1/(62) + 1/(80) = 0.0286 0.4(0.95) + 0.6(0.52) = 0.692

Under RRF, Jujutsu Kaisen Vol 25 outranks Demon Slayer Vol 23 because it has a higher kNN rank (1 vs 15), and RRF treats both rankers equally. Under weighted linear with BM25-heavy weights, Demon Slayer Vol 23 would rank correctly because its BM25 score is normalized to 1.0. But the weighted linear approach also has a problem — Jujutsu Kaisen still scores high due to its kNN dominance.

Resolution

Step Action Detail
1 Implement intent-aware fusion selection create_fusion(intent) factory returns different strategies per intent
2 Buy intent → weighted linear (β=0.6 BM25, α=0.4 kNN) Prioritize exact keyword match for purchase queries
3 Recommend intent → RRF (k=60) Democratic fusion for exploratory discovery
4 Browse intent → RRF (k=60) Default balanced approach
5 Research intent → weighted linear (α=0.5, β=0.5) Balanced but score-aware for comparison queries
6 A/B test the intent-aware approach Measure CTR and MRR per intent segment over 2 weeks

Configuration table:

Intent Fusion Method BM25 Weight kNN Weight Rationale
buy Weighted linear 0.60 0.40 Exact match is paramount
recommend RRF (k=60) Equal (rank-based) Equal (rank-based) Diversity matters
browse RRF (k=60) Equal Equal Balanced exploration
research Weighted linear 0.50 0.50 Both exact and semantic matter

Prevention

Measure Implementation
Per-intent metric tracking Track NDCG@5, MRR, CTR separately for each intent category
Fusion strategy A/B framework Always run new fusion configs through A/B testing before full rollout
Intent detection accuracy Monitor intent classifier accuracy; misclassified intents cascade into wrong fusion strategy
Rank disagreement monitoring Track the average rank disagreement between BM25 and kNN; high disagreement = fusion method matters more

Scenario 4: Index Performance Degradation After Catalog Expansion (10K to 100K Manga)

Problem Statement

MangaAssist expands its catalog from 10K to 100K manga products. After the bulk indexing completes, kNN search latency increases from 35ms to 180ms at p95, and BM25 latency increases from 15ms to 65ms. The total retrieval pipeline now takes ~350ms, far exceeding the 200ms budget.

Detection

graph TB
    subgraph Signals["Detection Signals"]
        style Signals fill:#e94560,stroke:#16213e,color:#fff
        S1["CloudWatch alarm:<br/>knn_search_latency_p95 > 100ms"]
        S2["CloudWatch alarm:<br/>bm25_search_latency_p95 > 50ms"]
        S3["End-to-end retrieval<br/>latency > 200ms (SLA breach)"]
        S4["OpenSearch OCU<br/>utilization > 85%"]
    end

    subgraph Timeline["Event Timeline"]
        style Timeline fill:#0f3460,stroke:#16213e,color:#fff
        T1["Day 0: Catalog at 10K<br/>kNN: 35ms, BM25: 15ms"]
        T2["Day 1: Bulk load starts<br/>90K new products"]
        T3["Day 2: Bulk load completes<br/>100K total"]
        T4["Day 2+: Latency spikes<br/>kNN: 180ms, BM25: 65ms"]
    end

    T4 --> S1 & S2 & S3 & S4

Root Cause Analysis — Decision Tree

graph TD
    START["Search latency spiked<br/>after 10x catalog expansion"] --> CHECK_OCU["Check: OCU utilization"]

    CHECK_OCU -->|> 80%| RC1["ROOT CAUSE 1:<br/>Insufficient search OCUs.<br/>2 search OCUs cannot handle<br/>100K docs at target QPS."]
    CHECK_OCU -->|< 80%| CHECK_SEGMENTS["Check: Segment count<br/>per shard"]

    CHECK_SEGMENTS -->|> 20 segments per shard| RC2["ROOT CAUSE 2:<br/>Segment fragmentation.<br/>Bulk indexing created many<br/>small segments. Merges pending."]
    CHECK_SEGMENTS -->|< 20| CHECK_SHARDS["Check: Shard sizing"]

    CHECK_SHARDS -->|Single shard > 25GB or<br/>single shard > 50K HNSW docs| RC3["ROOT CAUSE 3:<br/>Shard too large.<br/>HNSW graph on a single shard<br/>is too big for efficient search."]
    CHECK_SHARDS -->|Shard size OK| CHECK_HNSW["Check: HNSW parameters"]

    CHECK_HNSW -->|ef_search = 256 was fine<br/>for 10K but slow for 100K| RC4["ROOT CAUSE 4:<br/>HNSW ef_search too high<br/>for new index size. Search<br/>is exploring too many nodes."]
    CHECK_HNSW -->|ef_search reasonable| CHECK_FILTER["Check: Filter clause<br/>performance"]

    CHECK_FILTER -->|Filters on non-keyword fields| RC5["ROOT CAUSE 5:<br/>Post-filter on text fields.<br/>Filtering after kNN search<br/>is slow on 100K docs."]
    CHECK_FILTER -->|Filters on keyword fields| ESCALATE["Escalate: Complex interaction<br/>of multiple factors"]

Resolution by Root Cause

RC1: Insufficient Search OCUs

Step Action Detail
1 Scale search OCUs from 2 → 6 OpenSearch Serverless scales in OCU pairs. 6 OCUs supports ~1500 QPS at 100K docs
2 Monitor latency after scaling OCU scaling takes 5-10 minutes to take effect
3 Validate kNN latency should return to < 50ms at p95
4 Set up auto-scaling policy Configure max OCU limit and scaling triggers

Cost impact: 2 OCUs → 6 OCUs = $1,750/month → $5,250/month (+$3,500/month). Justified by SLA compliance.

RC2: Segment Fragmentation

Step Action Detail
1 Check segment count GET /manga-products/_segments — look for > 20 segments per shard
2 Force merge POST /manga-products/_forcemerge?max_num_segments=1
3 Wait for merge to complete Force merge on 100K docs takes ~10-15 minutes
4 Validate latency kNN should drop 30-50% after merge (fewer segments to search)
5 Schedule post-bulk-load merges Always run force merge after nightly catalog sync

RC3: Shard Too Large

Step Action Detail
1 Calculate target shard count 100K docs / 25K docs per shard = 4 shards
2 Create new index with 4 shards Use the same mapping with number_of_shards: 4
3 Reindex from old to new POST /_reindex from manga-products-v1 to manga-products-v2
4 Alias swap Update manga-products alias to point to manga-products-v2
5 Delete old index After validation, remove manga-products-v1

RC4: HNSW ef_search Too High

Step Action Detail
1 Run recall-latency benchmark at multiple ef_search values Test ef_search = 64, 128, 256, 512
2 Find the knee point Typically recall@10 plateaus while latency keeps climbing
3 For 100K docs, ef_search = 128 may achieve recall@10 = 0.95 at 40ms vs 0.97 at 180ms for ef_search = 256
4 Update index setting PUT /manga-products/_settings { "knn.algo_param.ef_search": 128 }
5 Validate Confirm recall@10 >= 0.95 and latency < 50ms

Benchmark reference table:

ef_search recall@10 (100K docs) Latency p95 Recommendation
64 0.89 22ms Too low recall
128 0.95 42ms Good balance for 100K
256 0.97 85ms Was optimal for 10K, too slow for 100K
512 0.98 180ms Way too slow

Prevention

Measure Implementation
Pre-expansion capacity test Before any 5x+ catalog growth, run load test on a staging index with target document count
Shard sizing guidelines Document: max 25K HNSW docs per shard for 1536-dim vectors
Post-bulk-load automation Lambda triggered after bulk indexing: (1) force merge, (2) run latency benchmark, (3) alert if > threshold
OCU scaling policy Auto-scale search OCUs based on latency metric, not just utilization
ef_search tuning automation Script that binary-searches for optimal ef_search given a target recall and latency budget

Scenario 5: Re-ranking Latency Exceeding Budget, Causing Total Retrieval > 200ms

Problem Statement

MangaAssist enables Claude Haiku re-ranking for all queries (not just buy-intent) to maximize retrieval quality. Haiku re-ranking adds 120-180ms per query, pushing total retrieval latency to 280-350ms at p95. The 200ms SLA is violated. Customer-visible page load times increase from 1.5s to 2.2s.

Detection

graph TB
    subgraph Signals["Detection Signals"]
        style Signals fill:#e94560,stroke:#16213e,color:#fff
        S1["CloudWatch alarm:<br/>retrieval_total_latency_p95 > 200ms"]
        S2["rerank_latency_p95<br/>= 155ms (budget: 60ms)"]
        S3["Bedrock Haiku invocations<br/>= 100% of searches<br/>(expected: 20% buy-intent only)"]
        S4["Monthly Bedrock cost<br/>for reranking: $3,200<br/>(expected: $640)"]
    end

    S1 & S2 & S3 & S4 --> DIAGNOSE["Latency<br/>Budget Analysis"]

Latency budget violation analysis:

Stage Budget Actual Status
Query preprocessing 15ms 12ms OK
Embedding generation 25ms 22ms OK
Hybrid search 80ms 72ms OK
Re-ranking 60ms 155ms OVER by 95ms
Result assembly 20ms 15ms OK
Total 200ms 276ms SLA BREACH

Root Cause Analysis — Decision Tree

graph TD
    START["Re-ranking latency<br/>155ms, budget 60ms"] --> CHECK_METHOD["Check: Which re-ranking<br/>method is active?"]

    CHECK_METHOD -->|Claude Haiku| CHECK_SCOPE["Check: Is Haiku called<br/>for ALL queries?"]
    CHECK_METHOD -->|Cross-encoder| CHECK_BATCH["Check: Batch size<br/>for cross-encoder"]

    CHECK_SCOPE -->|Yes — all queries| RC1["ROOT CAUSE 1:<br/>Haiku re-ranking enabled<br/>globally instead of<br/>buy-intent only.<br/>Config error."]
    CHECK_SCOPE -->|No — only buy-intent| CHECK_INPUT["Check: Input size<br/>to Haiku"]

    CHECK_INPUT -->|> 20 candidates| RC2["ROOT CAUSE 2:<br/>Too many candidates<br/>sent to Haiku.<br/>Heuristic stage not<br/>filtering to top-20."]
    CHECK_INPUT -->|<= 20 candidates| CHECK_HAIKU_LATENCY["Check: Haiku<br/>invocation latency"]

    CHECK_HAIKU_LATENCY -->|> 100ms| RC3["ROOT CAUSE 3:<br/>Haiku cold start or<br/>Bedrock throttling.<br/>No provisioned throughput."]
    CHECK_HAIKU_LATENCY -->|< 100ms| RC4["ROOT CAUSE 4:<br/>Prompt too large.<br/>Excessive metadata in<br/>re-ranking prompt."]

    CHECK_BATCH -->|> 20 docs| RC5["ROOT CAUSE 5:<br/>Cross-encoder batch too<br/>large. Model scoring 50<br/>pairs instead of 20."]
    CHECK_BATCH -->|<= 20 docs| ESCALATE["Escalate: Model inference<br/>itself is slow — check<br/>ECS task sizing."]

Resolution by Root Cause

RC1: Haiku Re-ranking Enabled Globally (Most Likely)

Step Action Detail
1 Check reranker configuration Review ECS task environment variable RERANK_METHOD
2 Fix: set intent-conditional logic Only invoke Haiku for intent == "buy" (20% of queries)
3 Default to cross-encoder or heuristic-only For browse/recommend/research: use cross-encoder (30ms) or heuristic-only (5ms)
4 Deploy config change ECS rolling deployment, no downtime
5 Validate Confirm retrieval_total_latency_p95 < 200ms within 15 minutes

Intent-to-reranker mapping:

Intent Re-rank Method Expected Latency Quality (NDCG@5)
buy (20% of queries) Claude Haiku 80-150ms 0.93
recommend (35%) Cross-encoder 30-50ms 0.88
browse (30%) Heuristic only 5ms 0.82
research (15%) Cross-encoder 30-50ms 0.88
Weighted average ~40ms ~0.87

RC2: Too Many Candidates Sent to Haiku

Step Action
1 Check heuristic stage output count — should be capped at 20
2 If heuristic stage is outputting 50 (i.e., passing all hybrid results through), fix the heuristic_cutoff parameter
3 Set heuristic_cutoff = 20 in RelevanceReranker constructor
4 Validate: Haiku input should be ~1.5K tokens (20 candidates) instead of ~4K tokens (50 candidates)

RC3: Haiku Cold Start / Throttling

Step Action
1 Check Bedrock CloudWatch metrics: InvocationLatency, ThrottleCount
2 If throttled: request Bedrock quota increase for Haiku in ap-northeast-1
3 If cold start: implement a keep-warm strategy (invoke Haiku every 30s with a dummy request)
4 Consider Bedrock provisioned throughput for Haiku if sustained high QPS

RC4: Prompt Too Large

Step Action
1 Audit the re-ranking prompt — how many tokens per candidate?
2 Reduce candidate metadata: title + genre + rating only (remove description, tags, author bio)
3 Target: < 100 tokens per candidate, 20 candidates = 2K input tokens
4 Validate: Haiku response time should drop from 150ms to 80ms

Tiered Re-ranking Decision Flowchart

graph TD
    QUERY["Incoming Query"] --> INTENT["Detect Intent"]

    INTENT -->|Buy| BUDGET_CHECK_BUY["Remaining latency<br/>budget > 100ms?"]
    INTENT -->|Recommend| BUDGET_CHECK_REC["Remaining latency<br/>budget > 50ms?"]
    INTENT -->|Browse| HEURISTIC["Heuristic Only<br/>(5ms)"]
    INTENT -->|Research| BUDGET_CHECK_RES["Remaining latency<br/>budget > 50ms?"]

    BUDGET_CHECK_BUY -->|Yes| HAIKU["Claude Haiku<br/>Re-rank (80-150ms)"]
    BUDGET_CHECK_BUY -->|No| CROSS_ENC_BUY["Cross-Encoder<br/>Re-rank (30-50ms)"]

    BUDGET_CHECK_REC -->|Yes| CROSS_ENC_REC["Cross-Encoder<br/>Re-rank (30-50ms)"]
    BUDGET_CHECK_REC -->|No| HEURISTIC_REC["Heuristic Only<br/>(5ms)"]

    BUDGET_CHECK_RES -->|Yes| CROSS_ENC_RES["Cross-Encoder<br/>Re-rank (30-50ms)"]
    BUDGET_CHECK_RES -->|No| HEURISTIC_RES["Heuristic Only<br/>(5ms)"]

    HAIKU --> RESULT["Top-5 Results"]
    CROSS_ENC_BUY --> RESULT
    CROSS_ENC_REC --> RESULT
    CROSS_ENC_RES --> RESULT
    HEURISTIC --> RESULT
    HEURISTIC_REC --> RESULT
    HEURISTIC_RES --> RESULT

Prevention

Measure Implementation
Latency budget enforcement Middleware tracks elapsed time per stage; if cumulative > 140ms before re-ranking, auto-downgrade to heuristic-only
Reranker method per intent Configuration table mapping intent → reranker method, reviewed monthly
Bedrock latency monitoring CloudWatch alarm on InvocationLatency_p95 > 100ms for Haiku model ID
Cost alerting CloudWatch alarm on daily Bedrock cost for re-ranking > $25/day (expected: $21/day at current volume)
Progressive enhancement Start all queries with heuristic-only; upgrade to cross-encoder if latency budget allows; upgrade to Haiku only for buy-intent with budget remaining

Cross-Scenario Summary

Common Patterns Across All 5 Scenarios

Pattern Scenarios Key Lesson
Metric degradation precedes user complaints 1, 2, 3 Invest in automated quality metrics (NDCG, MRR, CTR) — detect issues before customers notice
Config changes cascade unpredictably 3, 5 Always A/B test configuration changes; never deploy to 100% without validation
Scale changes require re-tuning 4 Parameters optimized for 10K docs may be wrong for 100K docs; re-benchmark after growth
Japanese text processing needs special attention 2 Standard tokenizers fail on manga-specific terms; custom dictionaries are mandatory
Latency budgets need per-stage enforcement 5 Track latency per stage, not just end-to-end; auto-degrade expensive stages when budget is tight

Monitoring Dashboard — Retrieval Health

Metric Source Alarm Threshold Check Frequency
retrieval_total_latency_p95 ECS application metric > 200ms Real-time (1-min)
knn_search_latency_p95 OpenSearch + app timer > 80ms Real-time (1-min)
bm25_search_latency_p95 OpenSearch + app timer > 40ms Real-time (1-min)
rerank_latency_p95 App timer > 60ms Real-time (1-min)
zero_result_rate App counter > 3% Hourly
ndcg_at_5_weekly Offline eval pipeline < 0.80 Weekly
mrr_weekly Offline eval pipeline < 0.80 Weekly
search_ctr_7d Click tracking < 14% Daily
opensearch_ocu_utilization CloudWatch > 85% Real-time (5-min)
bedrock_rerank_cost_daily Cost Explorer > $30/day Daily