AWS AIP-C01 Task 4.2 → Skill 4.2.2: Optimize retrieval mechanisms for FM-augmented applications
System: MangaAssist e-commerce chatbot — Bedrock Claude 3 Sonnet/Haiku, OpenSearch Serverless (HNSW k-NN), DynamoDB, ECS Fargate, ElastiCache Redis
Format: 5 production scenarios, each with Problem → Detection → Root Cause → Resolution → Prevention and decision trees
Skill Mapping
| AWS AIP-C01 Element |
Coverage |
| Task 4.2 |
Optimize application performance for FM workloads |
| Skill 4.2.2 |
Optimize retrieval mechanisms to improve FM-augmented application performance |
| This File |
5 operational scenarios covering vector search quality, Japanese tokenization, hybrid scoring, index scaling, and re-ranking latency |
| MangaAssist Context |
Production troubleshooting for a RAG pipeline serving 100K+ manga products to Japanese-speaking customers with < 200ms retrieval target |
Scenario Overview
| # |
Scenario |
Impact |
Severity |
Detection Time |
| 1 |
Vector search returning irrelevant manga |
Incorrect recommendations → customer frustration |
High |
Minutes (quality metric drop) |
| 2 |
Japanese query tokenization failure |
Missed results for rare manga titles → zero-result pages |
High |
Hours (long-tail query analysis) |
| 3 |
Hybrid search score fusion producing unexpected ranking |
Popular titles ranked below obscure ones → CTR drop |
Medium |
Hours (A/B metric divergence) |
| 4 |
Index performance degradation after catalog expansion |
Search latency exceeds budget → slow page loads |
High |
Minutes (latency alarm) |
| 5 |
Re-ranking latency exceeding budget |
Total retrieval > 200ms → SLA violation |
Critical |
Seconds (p95 alarm) |
Scenario 1: Vector Search Returning Irrelevant Manga (Embedding Quality Degradation)
Problem Statement
MangaAssist's kNN vector search begins returning semantically irrelevant results. A customer searching for "dark psychological thriller manga" receives results like children's comedy titles and cooking manga. NDCG@5 drops from 0.85 to 0.58 over a two-week period.
Detection
graph TB
subgraph Signals["Detection Signals"]
style Signals fill:#e94560,stroke:#16213e,color:#fff
S1["NDCG@5 weekly eval<br/>drops below 0.75 threshold"]
S2["Customer feedback:<br/>'recommendations are wrong'"]
S3["CTR on search results<br/>drops from 18% to 11%"]
S4["kNN result diversity<br/>collapses — same titles<br/>appear for different queries"]
end
subgraph Metrics["CloudWatch Alarms"]
style Metrics fill:#0f3460,stroke:#16213e,color:#fff
M1["Alarm: ndcg_weekly < 0.75"]
M2["Alarm: search_ctr_7d < 0.14"]
M3["Alarm: zero_result_rate > 3%"]
end
S1 --> M1
S3 --> M2
S1 & S2 & S3 & S4 --> INVESTIGATE["Begin<br/>Investigation"]
Key monitoring query (CloudWatch Logs Insights):
fields @timestamp, query_text, knn_top1_score, knn_top5_avg_score, ndcg_at_5
| filter knn_top1_score < 0.70
| stats count() as low_quality_searches, avg(ndcg_at_5) as avg_ndcg by bin(1h)
| sort @timestamp desc
Root Cause Analysis — Decision Tree
graph TD
START["kNN returning<br/>irrelevant results"] --> CHECK_EMB["Check: Were embeddings<br/>recently re-indexed?"]
CHECK_EMB -->|Yes| EMB_MODEL["Check: Was the embedding<br/>model changed?"]
CHECK_EMB -->|No| CHECK_DATA["Check: Was new data<br/>bulk-loaded recently?"]
EMB_MODEL -->|Yes — model version changed| RC1["ROOT CAUSE 1:<br/>Embedding model mismatch<br/>Old docs: Titan v1<br/>New docs: Titan v2<br/>Vectors are incompatible"]
EMB_MODEL -->|No — same model| CHECK_PROMPT["Check: Was the embedding<br/>input format changed?"]
CHECK_PROMPT -->|Yes| RC2["ROOT CAUSE 2:<br/>Embedding input drift<br/>e.g., title-only → title+description<br/>changes vector space geometry"]
CHECK_PROMPT -->|No| CHECK_INDEX["Check: HNSW index params"]
CHECK_DATA -->|Yes — bulk load happened| CHECK_QUALITY["Check: Quality of<br/>new product data"]
CHECK_DATA -->|No| CHECK_DRIFT["Check: Query distribution<br/>shift over time"]
CHECK_QUALITY -->|Bad data: empty descriptions,<br/>wrong language, duplicates| RC3["ROOT CAUSE 3:<br/>Data quality degradation<br/>Garbage-in → garbage-out<br/>embeddings"]
CHECK_QUALITY -->|Data looks fine| CHECK_INDEX
CHECK_INDEX -->|ef_search too low| RC4["ROOT CAUSE 4:<br/>HNSW recall degradation<br/>ef_search insufficient for<br/>current index size"]
CHECK_INDEX -->|Params normal| CHECK_DRIFT
CHECK_DRIFT -->|Query patterns changed| RC5["ROOT CAUSE 5:<br/>Concept drift<br/>New manga genres/trends<br/>not represented in<br/>embedding training data"]
CHECK_DRIFT -->|Stable| ESCALATE["Escalate: Unknown<br/>root cause — deeper<br/>investigation needed"]
Resolution by Root Cause
RC1: Embedding Model Mismatch
| Step |
Action |
Command / Detail |
| 1 |
Identify mixed-model documents |
Query for documents indexed before vs after model switch; check embedding_model_version metadata field |
| 2 |
Halt new indexing |
Pause the DynamoDB-to-OpenSearch pipeline Lambda |
| 3 |
Re-embed ALL documents with the new model |
Run batch Titan v2 embedding job for all 100K products (~45 min at 100 TPS) |
| 4 |
Bulk re-index |
Use _bulk API to replace all vectors in manga-products index |
| 5 |
Validate |
Run NDCG evaluation on 500-query test set; confirm >= 0.85 |
| 6 |
Resume pipeline |
Re-enable DynamoDB stream Lambda |
RC3: Data Quality Degradation
| Step |
Action |
| 1 |
Identify bad documents: empty description, non-Japanese text in title_ja, duplicate manga_id |
| 2 |
Quarantine bad documents in a DynamoDB quarantine table |
| 3 |
Delete bad vectors from OpenSearch index |
| 4 |
Fix data pipeline: add validation Lambda between DynamoDB and embedding generation |
| 5 |
Re-run embedding for corrected documents |
| 6 |
Validate NDCG recovery |
RC4: HNSW Recall Degradation
| Step |
Action |
| 1 |
Check current ef_search value (should be 256 for MangaAssist) |
| 2 |
Run recall benchmark: query 100 known-relevant pairs, measure recall@10 |
| 3 |
If recall < 0.95, increase ef_search to 512 (costs ~15ms more latency) |
| 4 |
If recall still poor, check if index needs force-merge (fragmented segments) |
| 5 |
If index > 200K docs on current shard count, add shards (re-index required) |
Prevention
| Measure |
Implementation |
| Embedding model version tracking |
Store model_version as a metadata field on every document; alarm on version mismatch |
| Data quality gate |
Validation Lambda checks: description length > 20 chars, language detection matches expected, no duplicate IDs |
| Weekly NDCG evaluation |
Automated pipeline runs 500-query eval set every Sunday; alarms if NDCG@5 < 0.80 |
| Embedding drift detection |
Compare average vector centroid monthly; alarm if centroid shift > 0.1 cosine distance |
Scenario 2: Japanese Query Tokenization Failure for Rare Manga Titles
Problem Statement
Customers searching for rare or new manga titles in Japanese receive zero results or irrelevant results. Example: searching for "呪術廻戦0 東京都立呪術高等専門学校" (Jujutsu Kaisen 0: Tokyo Metropolitan Magic Technical College) returns nothing, while "呪術廻戦" (Jujutsu Kaisen) works fine. The issue affects ~5% of queries — specifically long compound titles and titles with unusual kanji combinations.
Detection
graph TB
subgraph Signals["Detection Signals"]
style Signals fill:#e94560,stroke:#16213e,color:#fff
S1["Zero-result rate spikes<br/>for Japanese queries<br/>from 1.2% to 5.8%"]
S2["Customer support tickets:<br/>'cannot find [specific title]'"]
S3["BM25 component returns 0<br/>while kNN returns some results"]
S4["Query length > 10 chars<br/>has 3x higher zero-result rate"]
end
S1 & S2 & S3 & S4 --> INVESTIGATE["Tokenization<br/>Investigation"]
Detection query:
fields @timestamp, query_text, bm25_result_count, knn_result_count, query_language
| filter query_language = "ja" and bm25_result_count = 0
| stats count() as zero_bm25 by bin(1d)
| sort @timestamp desc
Root Cause Analysis — Decision Tree
graph TD
START["Japanese query<br/>returns 0 BM25 results"] --> CHECK_EXIST["Check: Does the product<br/>exist in the index?"]
CHECK_EXIST -->|No| RC_MISSING["NOT A TOKENIZATION ISSUE:<br/>Product not yet indexed.<br/>Check DynamoDB → OpenSearch pipeline."]
CHECK_EXIST -->|Yes| CHECK_ANALYZE["Run _analyze API on<br/>the query text"]
CHECK_ANALYZE --> CHECK_TOKENS["Compare query tokens<br/>vs indexed tokens"]
CHECK_TOKENS -->|Tokens don't overlap| CHECK_KUROMOJI["Check: kuromoji<br/>dictionary version"]
CHECK_KUROMOJI -->|Default dict only| RC1["ROOT CAUSE 1:<br/>Missing custom dictionary<br/>Rare manga terms not in<br/>kuromoji default dictionary"]
CHECK_KUROMOJI -->|Custom dict present| CHECK_COMPOUND["Check: Is the title<br/>a compound word?"]
CHECK_COMPOUND -->|Yes — long compound| RC2["ROOT CAUSE 2:<br/>Over-segmentation<br/>kuromoji splits the title<br/>into too many fragments<br/>that don't match indexed form"]
CHECK_COMPOUND -->|No| CHECK_READING["Check: Reading form<br/>normalization"]
CHECK_READING -->|Katakana vs Hiragana<br/>mismatch| RC3["ROOT CAUSE 3:<br/>Script normalization failure<br/>Query in カタカナ but index<br/>stores ひらがな form"]
CHECK_READING -->|Same script| CHECK_FULLWIDTH["Check: Fullwidth vs<br/>halfwidth characters"]
CHECK_FULLWIDTH -->|Mismatch detected| RC4["ROOT CAUSE 4:<br/>Unicode normalization gap<br/>Fullwidth numbers/letters<br/>not normalized to halfwidth"]
CHECK_FULLWIDTH -->|No mismatch| RC5["ROOT CAUSE 5:<br/>Analyzer configuration error<br/>Wrong analyzer assigned to field"]
Resolution by Root Cause
RC1: Missing Custom Dictionary Entries
The kuromoji tokenizer's built-in dictionary does not include many manga-specific terms, especially newer titles and genre terms.
| Step |
Action |
Detail |
| 1 |
Identify failing terms |
Collect all zero-result queries from the last 30 days; run through _analyze API |
| 2 |
Build custom user dictionary |
Create user_dictionary.txt with manga titles, author names, genre terms |
| 3 |
Update analyzer config |
Add kuromoji_user_dict filter pointing to the custom dictionary |
| 4 |
Re-index |
Full re-index required to apply new analyzer to existing documents |
| 5 |
Validate |
Re-run failing queries; confirm non-zero BM25 results |
Custom dictionary format (user_dictionary.txt):
呪術廻戦,呪術廻戦,ジュジュツカイセン,カスタム名詞
鬼滅の刃,鬼滅の刃,キメツノヤイバ,カスタム名詞
進撃の巨人,進撃の巨人,シンゲキノキョジン,カスタム名詞
東京都立呪術高等専門学校,東京都立呪術高等専門学校,トウキョウトリツジュジュツコウトウセンモンガッコウ,カスタム名詞
チェンソーマン,チェンソーマン,チェンソーマン,カスタム名詞
RC2: Over-segmentation of Compound Titles
| Step |
Action |
| 1 |
Test with _analyze API: check how kuromoji tokenizes the compound title |
| 2 |
If over-segmented, add the full compound as a custom dictionary entry (see RC1) |
| 3 |
Additionally, add a shingle filter to create bigrams/trigrams that capture partial compounds |
| 4 |
For the BM25 query, reduce minimum_should_match from 30% to 20% for long queries (> 8 tokens) |
RC4: Unicode Normalization Gap
| Step |
Action |
| 1 |
Add icu_normalizer filter to the analyzer chain (before kuromoji tokenizer) |
| 2 |
Ensure NFKC normalization is applied to both queries and indexed text |
| 3 |
Update the QueryPreprocessor to apply NFKC normalization before sending to OpenSearch |
| 4 |
Re-index with updated analyzer |
Prevention
| Measure |
Implementation |
| Custom dictionary maintenance |
Monthly review of zero-result queries; add new manga titles/terms to custom dictionary |
| Automated tokenization tests |
CI pipeline that runs 200 known manga titles through _analyze API and verifies expected tokens |
| Fallback to kNN |
If BM25 returns 0 results, automatically fall back to kNN-only search (already implemented in hybrid pipeline) |
| Query normalization |
Apply NFKC + script normalization in QueryPreprocessor before any search operation |
| New title onboarding |
When a new manga is added to the catalog, its title is automatically added to the custom dictionary queue |
Scenario 3: Hybrid Search Score Fusion Producing Unexpected Ranking
Problem Statement
After switching from weighted linear fusion to Reciprocal Rank Fusion (RRF), the search results quality appears to degrade for buy-intent queries. Customers searching for specific titles (e.g., "鬼滅の刃 23巻") see the exact product at rank 3-4 instead of rank 1. CTR for buy-intent queries drops from 24% to 17%.
Detection
graph TB
subgraph Signals["Detection Signals"]
style Signals fill:#e94560,stroke:#16213e,color:#fff
S1["CTR drops for<br/>buy-intent queries<br/>24% → 17%"]
S2["MRR drops for<br/>exact-title queries<br/>0.92 → 0.71"]
S3["A/B test shows<br/>RRF underperforms Linear<br/>for buy-intent segment"]
S4["BM25 rank-1 results<br/>are demoted to rank 3-4<br/>after RRF fusion"]
end
S1 & S2 & S3 & S4 --> DIAGNOSE["Fusion<br/>Diagnosis"]
Root Cause Analysis — Decision Tree
graph TD
START["Exact-title products<br/>ranked lower than expected<br/>after score fusion"] --> CHECK_BM25["Check: BM25 ranking<br/>for the query"]
CHECK_BM25 -->|BM25 rank 1 = correct product| CHECK_KNN["Check: kNN ranking<br/>for the query"]
CHECK_BM25 -->|BM25 rank > 1| RC_BM25["BM25 issue —<br/>not a fusion problem.<br/>Check analyzer/boost config."]
CHECK_KNN -->|kNN rank 1 = different product| CHECK_FUSION["Check: How RRF<br/>combines the ranks"]
CHECK_KNN -->|kNN rank 1 = same product| RC_OTHER["Both agree on rank 1.<br/>Check re-ranking stage."]
CHECK_FUSION --> ANALYZE_RRF["RRF formula:<br/>BM25 rank 1 → 1/(60+1) = 0.0164<br/>kNN rank 15 → 1/(60+15) = 0.0133<br/>Total: 0.0297"]
ANALYZE_RRF --> COMPARE["Compare with competitor:<br/>BM25 rank 5 → 1/(60+5) = 0.0154<br/>kNN rank 1 → 1/(60+1) = 0.0164<br/>Total: 0.0318"]
COMPARE --> RC1["ROOT CAUSE 1:<br/>RRF is rank-equalizing.<br/>A doc ranked #1 by BM25 but #15 by kNN<br/>loses to a doc ranked #5 and #1.<br/>RRF penalizes rank disagreement."]
RC1 --> FIX["FIX: Use intent-aware<br/>fusion strategy.<br/>Buy intent → BM25-heavy linear.<br/>Recommend intent → RRF."]
Detailed Explanation
RRF is designed to be democratic across rankers: it weights BM25 and kNN contributions equally. This is ideal for exploratory queries where both signals matter equally. But for buy-intent queries, BM25 rank 1 (exact title match) is a much stronger signal than kNN rank 1 (semantic similarity). RRF does not capture this asymmetry.
Worked example:
| Document |
BM25 Rank |
kNN Rank |
RRF Score (k=60) |
Linear Score (α=0.6, β=0.4) |
| Demon Slayer Vol 23 (correct) |
1 |
15 |
1/(61) + 1/(75) = 0.0297 |
0.4(1.0) + 0.6(0.60) = 0.760 |
| Jujutsu Kaisen Vol 25 (wrong) |
5 |
1 |
1/(65) + 1/(61) = 0.0318 |
0.4(0.75) + 0.6(1.0) = 0.900 |
| Demon Slayer Vol 22 |
2 |
20 |
1/(62) + 1/(80) = 0.0286 |
0.4(0.95) + 0.6(0.52) = 0.692 |
Under RRF, Jujutsu Kaisen Vol 25 outranks Demon Slayer Vol 23 because it has a higher kNN rank (1 vs 15), and RRF treats both rankers equally. Under weighted linear with BM25-heavy weights, Demon Slayer Vol 23 would rank correctly because its BM25 score is normalized to 1.0. But the weighted linear approach also has a problem — Jujutsu Kaisen still scores high due to its kNN dominance.
Resolution
| Step |
Action |
Detail |
| 1 |
Implement intent-aware fusion selection |
create_fusion(intent) factory returns different strategies per intent |
| 2 |
Buy intent → weighted linear (β=0.6 BM25, α=0.4 kNN) |
Prioritize exact keyword match for purchase queries |
| 3 |
Recommend intent → RRF (k=60) |
Democratic fusion for exploratory discovery |
| 4 |
Browse intent → RRF (k=60) |
Default balanced approach |
| 5 |
Research intent → weighted linear (α=0.5, β=0.5) |
Balanced but score-aware for comparison queries |
| 6 |
A/B test the intent-aware approach |
Measure CTR and MRR per intent segment over 2 weeks |
Configuration table:
| Intent |
Fusion Method |
BM25 Weight |
kNN Weight |
Rationale |
| buy |
Weighted linear |
0.60 |
0.40 |
Exact match is paramount |
| recommend |
RRF (k=60) |
Equal (rank-based) |
Equal (rank-based) |
Diversity matters |
| browse |
RRF (k=60) |
Equal |
Equal |
Balanced exploration |
| research |
Weighted linear |
0.50 |
0.50 |
Both exact and semantic matter |
Prevention
| Measure |
Implementation |
| Per-intent metric tracking |
Track NDCG@5, MRR, CTR separately for each intent category |
| Fusion strategy A/B framework |
Always run new fusion configs through A/B testing before full rollout |
| Intent detection accuracy |
Monitor intent classifier accuracy; misclassified intents cascade into wrong fusion strategy |
| Rank disagreement monitoring |
Track the average rank disagreement between BM25 and kNN; high disagreement = fusion method matters more |
Problem Statement
MangaAssist expands its catalog from 10K to 100K manga products. After the bulk indexing completes, kNN search latency increases from 35ms to 180ms at p95, and BM25 latency increases from 15ms to 65ms. The total retrieval pipeline now takes ~350ms, far exceeding the 200ms budget.
Detection
graph TB
subgraph Signals["Detection Signals"]
style Signals fill:#e94560,stroke:#16213e,color:#fff
S1["CloudWatch alarm:<br/>knn_search_latency_p95 > 100ms"]
S2["CloudWatch alarm:<br/>bm25_search_latency_p95 > 50ms"]
S3["End-to-end retrieval<br/>latency > 200ms (SLA breach)"]
S4["OpenSearch OCU<br/>utilization > 85%"]
end
subgraph Timeline["Event Timeline"]
style Timeline fill:#0f3460,stroke:#16213e,color:#fff
T1["Day 0: Catalog at 10K<br/>kNN: 35ms, BM25: 15ms"]
T2["Day 1: Bulk load starts<br/>90K new products"]
T3["Day 2: Bulk load completes<br/>100K total"]
T4["Day 2+: Latency spikes<br/>kNN: 180ms, BM25: 65ms"]
end
T4 --> S1 & S2 & S3 & S4
Root Cause Analysis — Decision Tree
graph TD
START["Search latency spiked<br/>after 10x catalog expansion"] --> CHECK_OCU["Check: OCU utilization"]
CHECK_OCU -->|> 80%| RC1["ROOT CAUSE 1:<br/>Insufficient search OCUs.<br/>2 search OCUs cannot handle<br/>100K docs at target QPS."]
CHECK_OCU -->|< 80%| CHECK_SEGMENTS["Check: Segment count<br/>per shard"]
CHECK_SEGMENTS -->|> 20 segments per shard| RC2["ROOT CAUSE 2:<br/>Segment fragmentation.<br/>Bulk indexing created many<br/>small segments. Merges pending."]
CHECK_SEGMENTS -->|< 20| CHECK_SHARDS["Check: Shard sizing"]
CHECK_SHARDS -->|Single shard > 25GB or<br/>single shard > 50K HNSW docs| RC3["ROOT CAUSE 3:<br/>Shard too large.<br/>HNSW graph on a single shard<br/>is too big for efficient search."]
CHECK_SHARDS -->|Shard size OK| CHECK_HNSW["Check: HNSW parameters"]
CHECK_HNSW -->|ef_search = 256 was fine<br/>for 10K but slow for 100K| RC4["ROOT CAUSE 4:<br/>HNSW ef_search too high<br/>for new index size. Search<br/>is exploring too many nodes."]
CHECK_HNSW -->|ef_search reasonable| CHECK_FILTER["Check: Filter clause<br/>performance"]
CHECK_FILTER -->|Filters on non-keyword fields| RC5["ROOT CAUSE 5:<br/>Post-filter on text fields.<br/>Filtering after kNN search<br/>is slow on 100K docs."]
CHECK_FILTER -->|Filters on keyword fields| ESCALATE["Escalate: Complex interaction<br/>of multiple factors"]
Resolution by Root Cause
RC1: Insufficient Search OCUs
| Step |
Action |
Detail |
| 1 |
Scale search OCUs from 2 → 6 |
OpenSearch Serverless scales in OCU pairs. 6 OCUs supports ~1500 QPS at 100K docs |
| 2 |
Monitor latency after scaling |
OCU scaling takes 5-10 minutes to take effect |
| 3 |
Validate |
kNN latency should return to < 50ms at p95 |
| 4 |
Set up auto-scaling policy |
Configure max OCU limit and scaling triggers |
Cost impact: 2 OCUs → 6 OCUs = $1,750/month → $5,250/month (+$3,500/month). Justified by SLA compliance.
RC2: Segment Fragmentation
| Step |
Action |
Detail |
| 1 |
Check segment count |
GET /manga-products/_segments — look for > 20 segments per shard |
| 2 |
Force merge |
POST /manga-products/_forcemerge?max_num_segments=1 |
| 3 |
Wait for merge to complete |
Force merge on 100K docs takes ~10-15 minutes |
| 4 |
Validate latency |
kNN should drop 30-50% after merge (fewer segments to search) |
| 5 |
Schedule post-bulk-load merges |
Always run force merge after nightly catalog sync |
RC3: Shard Too Large
| Step |
Action |
Detail |
| 1 |
Calculate target shard count |
100K docs / 25K docs per shard = 4 shards |
| 2 |
Create new index with 4 shards |
Use the same mapping with number_of_shards: 4 |
| 3 |
Reindex from old to new |
POST /_reindex from manga-products-v1 to manga-products-v2 |
| 4 |
Alias swap |
Update manga-products alias to point to manga-products-v2 |
| 5 |
Delete old index |
After validation, remove manga-products-v1 |
RC4: HNSW ef_search Too High
| Step |
Action |
Detail |
| 1 |
Run recall-latency benchmark at multiple ef_search values |
Test ef_search = 64, 128, 256, 512 |
| 2 |
Find the knee point |
Typically recall@10 plateaus while latency keeps climbing |
| 3 |
For 100K docs, ef_search = 128 may achieve recall@10 = 0.95 at 40ms vs 0.97 at 180ms for ef_search = 256 |
|
| 4 |
Update index setting |
PUT /manga-products/_settings { "knn.algo_param.ef_search": 128 } |
| 5 |
Validate |
Confirm recall@10 >= 0.95 and latency < 50ms |
Benchmark reference table:
| ef_search |
recall@10 (100K docs) |
Latency p95 |
Recommendation |
| 64 |
0.89 |
22ms |
Too low recall |
| 128 |
0.95 |
42ms |
Good balance for 100K |
| 256 |
0.97 |
85ms |
Was optimal for 10K, too slow for 100K |
| 512 |
0.98 |
180ms |
Way too slow |
Prevention
| Measure |
Implementation |
| Pre-expansion capacity test |
Before any 5x+ catalog growth, run load test on a staging index with target document count |
| Shard sizing guidelines |
Document: max 25K HNSW docs per shard for 1536-dim vectors |
| Post-bulk-load automation |
Lambda triggered after bulk indexing: (1) force merge, (2) run latency benchmark, (3) alert if > threshold |
| OCU scaling policy |
Auto-scale search OCUs based on latency metric, not just utilization |
| ef_search tuning automation |
Script that binary-searches for optimal ef_search given a target recall and latency budget |
Scenario 5: Re-ranking Latency Exceeding Budget, Causing Total Retrieval > 200ms
Problem Statement
MangaAssist enables Claude Haiku re-ranking for all queries (not just buy-intent) to maximize retrieval quality. Haiku re-ranking adds 120-180ms per query, pushing total retrieval latency to 280-350ms at p95. The 200ms SLA is violated. Customer-visible page load times increase from 1.5s to 2.2s.
Detection
graph TB
subgraph Signals["Detection Signals"]
style Signals fill:#e94560,stroke:#16213e,color:#fff
S1["CloudWatch alarm:<br/>retrieval_total_latency_p95 > 200ms"]
S2["rerank_latency_p95<br/>= 155ms (budget: 60ms)"]
S3["Bedrock Haiku invocations<br/>= 100% of searches<br/>(expected: 20% buy-intent only)"]
S4["Monthly Bedrock cost<br/>for reranking: $3,200<br/>(expected: $640)"]
end
S1 & S2 & S3 & S4 --> DIAGNOSE["Latency<br/>Budget Analysis"]
Latency budget violation analysis:
| Stage |
Budget |
Actual |
Status |
| Query preprocessing |
15ms |
12ms |
OK |
| Embedding generation |
25ms |
22ms |
OK |
| Hybrid search |
80ms |
72ms |
OK |
| Re-ranking |
60ms |
155ms |
OVER by 95ms |
| Result assembly |
20ms |
15ms |
OK |
| Total |
200ms |
276ms |
SLA BREACH |
Root Cause Analysis — Decision Tree
graph TD
START["Re-ranking latency<br/>155ms, budget 60ms"] --> CHECK_METHOD["Check: Which re-ranking<br/>method is active?"]
CHECK_METHOD -->|Claude Haiku| CHECK_SCOPE["Check: Is Haiku called<br/>for ALL queries?"]
CHECK_METHOD -->|Cross-encoder| CHECK_BATCH["Check: Batch size<br/>for cross-encoder"]
CHECK_SCOPE -->|Yes — all queries| RC1["ROOT CAUSE 1:<br/>Haiku re-ranking enabled<br/>globally instead of<br/>buy-intent only.<br/>Config error."]
CHECK_SCOPE -->|No — only buy-intent| CHECK_INPUT["Check: Input size<br/>to Haiku"]
CHECK_INPUT -->|> 20 candidates| RC2["ROOT CAUSE 2:<br/>Too many candidates<br/>sent to Haiku.<br/>Heuristic stage not<br/>filtering to top-20."]
CHECK_INPUT -->|<= 20 candidates| CHECK_HAIKU_LATENCY["Check: Haiku<br/>invocation latency"]
CHECK_HAIKU_LATENCY -->|> 100ms| RC3["ROOT CAUSE 3:<br/>Haiku cold start or<br/>Bedrock throttling.<br/>No provisioned throughput."]
CHECK_HAIKU_LATENCY -->|< 100ms| RC4["ROOT CAUSE 4:<br/>Prompt too large.<br/>Excessive metadata in<br/>re-ranking prompt."]
CHECK_BATCH -->|> 20 docs| RC5["ROOT CAUSE 5:<br/>Cross-encoder batch too<br/>large. Model scoring 50<br/>pairs instead of 20."]
CHECK_BATCH -->|<= 20 docs| ESCALATE["Escalate: Model inference<br/>itself is slow — check<br/>ECS task sizing."]
Resolution by Root Cause
RC1: Haiku Re-ranking Enabled Globally (Most Likely)
| Step |
Action |
Detail |
| 1 |
Check reranker configuration |
Review ECS task environment variable RERANK_METHOD |
| 2 |
Fix: set intent-conditional logic |
Only invoke Haiku for intent == "buy" (20% of queries) |
| 3 |
Default to cross-encoder or heuristic-only |
For browse/recommend/research: use cross-encoder (30ms) or heuristic-only (5ms) |
| 4 |
Deploy config change |
ECS rolling deployment, no downtime |
| 5 |
Validate |
Confirm retrieval_total_latency_p95 < 200ms within 15 minutes |
Intent-to-reranker mapping:
| Intent |
Re-rank Method |
Expected Latency |
Quality (NDCG@5) |
| buy (20% of queries) |
Claude Haiku |
80-150ms |
0.93 |
| recommend (35%) |
Cross-encoder |
30-50ms |
0.88 |
| browse (30%) |
Heuristic only |
5ms |
0.82 |
| research (15%) |
Cross-encoder |
30-50ms |
0.88 |
| Weighted average |
— |
~40ms |
~0.87 |
RC2: Too Many Candidates Sent to Haiku
| Step |
Action |
| 1 |
Check heuristic stage output count — should be capped at 20 |
| 2 |
If heuristic stage is outputting 50 (i.e., passing all hybrid results through), fix the heuristic_cutoff parameter |
| 3 |
Set heuristic_cutoff = 20 in RelevanceReranker constructor |
| 4 |
Validate: Haiku input should be ~1.5K tokens (20 candidates) instead of ~4K tokens (50 candidates) |
RC3: Haiku Cold Start / Throttling
| Step |
Action |
| 1 |
Check Bedrock CloudWatch metrics: InvocationLatency, ThrottleCount |
| 2 |
If throttled: request Bedrock quota increase for Haiku in ap-northeast-1 |
| 3 |
If cold start: implement a keep-warm strategy (invoke Haiku every 30s with a dummy request) |
| 4 |
Consider Bedrock provisioned throughput for Haiku if sustained high QPS |
RC4: Prompt Too Large
| Step |
Action |
| 1 |
Audit the re-ranking prompt — how many tokens per candidate? |
| 2 |
Reduce candidate metadata: title + genre + rating only (remove description, tags, author bio) |
| 3 |
Target: < 100 tokens per candidate, 20 candidates = 2K input tokens |
| 4 |
Validate: Haiku response time should drop from 150ms to 80ms |
Tiered Re-ranking Decision Flowchart
graph TD
QUERY["Incoming Query"] --> INTENT["Detect Intent"]
INTENT -->|Buy| BUDGET_CHECK_BUY["Remaining latency<br/>budget > 100ms?"]
INTENT -->|Recommend| BUDGET_CHECK_REC["Remaining latency<br/>budget > 50ms?"]
INTENT -->|Browse| HEURISTIC["Heuristic Only<br/>(5ms)"]
INTENT -->|Research| BUDGET_CHECK_RES["Remaining latency<br/>budget > 50ms?"]
BUDGET_CHECK_BUY -->|Yes| HAIKU["Claude Haiku<br/>Re-rank (80-150ms)"]
BUDGET_CHECK_BUY -->|No| CROSS_ENC_BUY["Cross-Encoder<br/>Re-rank (30-50ms)"]
BUDGET_CHECK_REC -->|Yes| CROSS_ENC_REC["Cross-Encoder<br/>Re-rank (30-50ms)"]
BUDGET_CHECK_REC -->|No| HEURISTIC_REC["Heuristic Only<br/>(5ms)"]
BUDGET_CHECK_RES -->|Yes| CROSS_ENC_RES["Cross-Encoder<br/>Re-rank (30-50ms)"]
BUDGET_CHECK_RES -->|No| HEURISTIC_RES["Heuristic Only<br/>(5ms)"]
HAIKU --> RESULT["Top-5 Results"]
CROSS_ENC_BUY --> RESULT
CROSS_ENC_REC --> RESULT
CROSS_ENC_RES --> RESULT
HEURISTIC --> RESULT
HEURISTIC_REC --> RESULT
HEURISTIC_RES --> RESULT
Prevention
| Measure |
Implementation |
| Latency budget enforcement |
Middleware tracks elapsed time per stage; if cumulative > 140ms before re-ranking, auto-downgrade to heuristic-only |
| Reranker method per intent |
Configuration table mapping intent → reranker method, reviewed monthly |
| Bedrock latency monitoring |
CloudWatch alarm on InvocationLatency_p95 > 100ms for Haiku model ID |
| Cost alerting |
CloudWatch alarm on daily Bedrock cost for re-ranking > $25/day (expected: $21/day at current volume) |
| Progressive enhancement |
Start all queries with heuristic-only; upgrade to cross-encoder if latency budget allows; upgrade to Haiku only for buy-intent with budget remaining |
Cross-Scenario Summary
Common Patterns Across All 5 Scenarios
| Pattern |
Scenarios |
Key Lesson |
| Metric degradation precedes user complaints |
1, 2, 3 |
Invest in automated quality metrics (NDCG, MRR, CTR) — detect issues before customers notice |
| Config changes cascade unpredictably |
3, 5 |
Always A/B test configuration changes; never deploy to 100% without validation |
| Scale changes require re-tuning |
4 |
Parameters optimized for 10K docs may be wrong for 100K docs; re-benchmark after growth |
| Japanese text processing needs special attention |
2 |
Standard tokenizers fail on manga-specific terms; custom dictionaries are mandatory |
| Latency budgets need per-stage enforcement |
5 |
Track latency per stage, not just end-to-end; auto-degrade expensive stages when budget is tight |
Monitoring Dashboard — Retrieval Health
| Metric |
Source |
Alarm Threshold |
Check Frequency |
retrieval_total_latency_p95 |
ECS application metric |
> 200ms |
Real-time (1-min) |
knn_search_latency_p95 |
OpenSearch + app timer |
> 80ms |
Real-time (1-min) |
bm25_search_latency_p95 |
OpenSearch + app timer |
> 40ms |
Real-time (1-min) |
rerank_latency_p95 |
App timer |
> 60ms |
Real-time (1-min) |
zero_result_rate |
App counter |
> 3% |
Hourly |
ndcg_at_5_weekly |
Offline eval pipeline |
< 0.80 |
Weekly |
mrr_weekly |
Offline eval pipeline |
< 0.80 |
Weekly |
search_ctr_7d |
Click tracking |
< 14% |
Daily |
opensearch_ocu_utilization |
CloudWatch |
> 85% |
Real-time (5-min) |
bedrock_rerank_cost_daily |
Cost Explorer |
> $30/day |
Daily |