06: Retrieval Quality Testing
AIP-C01 Mapping
Content Domain 5: Testing, Validation, and Troubleshooting Task 5.1: Implement evaluation systems for GenAI Skill 5.1.6: Test the quality of retrieval mechanisms (for example, with relevance scoring, context matching, and retrieval latency).
User Story
As a MangaAssist search/retrieval engineer, I want to measure and optimize the quality of our OpenSearch Serverless retrieval pipeline — including relevance of retrieved chunks, precision/recall against expected contexts, and retrieval latency under production load — So that the LLM receives the best possible context for every query and we can diagnose whether quality issues originate from retrieval or generation.
Acceptance Criteria
- Retrieval relevance scoring evaluates top-K chunks against query intent
- Context matching measures whether retrieved chunks contain the information needed to answer the query
- Retrieval precision@K and recall@K computed against labeled ground truth per intent
- Retrieval latency P50/P95/P99 tracked per query type and index
- Mean Reciprocal Rank (MRR) and Normalized Discounted Cumulative Gain (nDCG) computed for ranking quality
- End-to-end retrieval testing includes embedding generation, vector search, and re-ranking stages
- Degradation alerts fire when retrieval quality drops below threshold
- Retrieval A/B testing framework supports comparing embedding models and search configurations
Why Retrieval Quality Is the Foundation
In MangaAssist's RAG architecture, the retrieval pipeline determines the ceiling of response quality. Even the best LLM cannot generate a correct answer from irrelevant context:
Response Quality ≤ min(Retrieval Quality, Generation Quality)
| Retrieval Quality | Generation Quality | User Experience |
|---|---|---|
| High (correct chunks) | High (good synthesis) | Correct, helpful response |
| High (correct chunks) | Low (poor synthesis) | Fixable — improve prompt/model |
| Low (wrong chunks) | High (good synthesis) | Confident hallucination — worst case |
| Low (wrong chunks) | Low (poor synthesis) | Obvious failure — user asks again |
The dangerous quadrant is high generation quality + low retrieval quality — the model confidently presents wrong information because it was given wrong context. This is why retrieval must be tested independently.
MangaAssist Retrieval Pipeline
User Query → Intent Classifier → Query Rewriter → Titan Embeddings v2 →
OpenSearch Serverless (k-NN + BM25 hybrid) → Re-ranker → Top-5 Chunks → LLM
Each stage can fail independently: the query rewriter can lose intent, embeddings can miss semantic nuance, the hybrid search can rank poorly, the re-ranker can demote the correct chunk.
High-Level Design
Retrieval Quality Testing Architecture
graph TD
subgraph "Test Data"
A[Golden Retrieval Dataset<br>300 queries with labeled chunks] --> B[Test Runner]
end
subgraph "Retrieval Pipeline Under Test"
B --> C[Query Rewriter]
C --> D[Embedding Generator<br>Titan Embeddings v2]
D --> E[OpenSearch Serverless<br>Hybrid k-NN + BM25]
E --> F[Re-ranker]
F --> G[Top-K Retrieved Chunks]
end
subgraph "Quality Metrics Engine"
G --> H[Relevance Scorer<br>Per-chunk relevance]
G --> I[Precision@K / Recall@K<br>Against ground truth]
G --> J[MRR / nDCG@K<br>Ranking quality]
G --> K[Latency Tracker<br>P50, P95, P99]
end
subgraph "Stage-Level Diagnostics"
C --> L[Query Rewrite Quality<br>Intent preservation check]
D --> M[Embedding Quality<br>Cosine similarity distribution]
E --> N[Search Quality<br>k-NN vs BM25 contribution]
F --> O[Re-ranker Quality<br>Position changes analysis]
end
subgraph "Outputs"
H --> P[Retrieval Quality Report]
I --> P
J --> P
K --> P
L --> P
M --> P
N --> P
O --> P
P --> Q[CloudWatch Dashboard]
P --> R[Degradation Alerts]
end
Retrieval A/B Testing Architecture
graph LR
subgraph "Experiment Setup"
A[Same Query Set] --> B{A/B Router}
B --> C[Config A: Titan v2<br>k=5, hybrid 0.7/0.3]
B --> D[Config B: Titan v2<br>k=7, hybrid 0.5/0.5]
end
subgraph "Evaluation"
C --> E[Retrieve & Score]
D --> E
E --> F[Paired Comparison<br>Same query, both configs]
end
subgraph "Decision"
F --> G{Config B better<br>on MRR + Latency?}
G -->|Yes, significant| H[Adopt Config B]
G -->|No| I[Keep Config A]
end
End-to-End Retrieval Latency Breakdown
sequenceDiagram
participant QR as Query Rewriter
participant EM as Embedding Generator
participant OS as OpenSearch
participant RR as Re-ranker
participant Total as Total Latency
Note over QR,Total: Target: < 500ms total
QR->>QR: Rewrite query<br>~50ms
QR->>EM: Rewritten query
EM->>EM: Generate embedding<br>~30ms (Titan v2)
EM->>OS: 1536-dim vector
OS->>OS: k-NN search + BM25<br>~80ms (warm) / ~300ms (cold)
OS->>RR: Top-20 candidates
RR->>RR: Re-rank to top-5<br>~40ms
RR->>Total: Final chunks
Note over Total: Warm: ~200ms | Cold: ~420ms | P99: ~600ms
Low-Level Design
Retrieval Test Data Model
from dataclasses import dataclass, field
from enum import Enum
from typing import Optional
import time
import uuid
@dataclass
class RetrievalTestCase:
"""A test case for retrieval quality evaluation."""
test_id: str = field(default_factory=lambda: str(uuid.uuid4()))
query: str = ""
intent: str = ""
expected_chunk_ids: list[str] = field(default_factory=list) # Ground truth chunks
expected_key_facts: list[str] = field(default_factory=list) # Facts that must appear
min_relevance_threshold: float = 0.6 # Minimum acceptable relevance
difficulty: str = "medium" # easy, medium, hard, adversarial
@dataclass
class RetrievalResult:
"""Result of a single retrieval operation."""
test_id: str = ""
query: str = ""
retrieved_chunks: list[dict] = field(default_factory=list) # {chunk_id, text, score, source}
retrieval_latency_ms: float = 0.0
embedding_latency_ms: float = 0.0
search_latency_ms: float = 0.0
rerank_latency_ms: float = 0.0
total_latency_ms: float = 0.0
@dataclass
class RetrievalQualityMetrics:
"""Aggregated retrieval quality metrics."""
precision_at_k: dict[int, float] = field(default_factory=dict) # k -> precision
recall_at_k: dict[int, float] = field(default_factory=dict) # k -> recall
mrr: float = 0.0 # Mean Reciprocal Rank
ndcg_at_k: dict[int, float] = field(default_factory=dict) # k -> nDCG
avg_relevance_score: float = 0.0
fact_coverage_rate: float = 0.0 # % of expected facts found
latency_p50_ms: float = 0.0
latency_p95_ms: float = 0.0
latency_p99_ms: float = 0.0
total_queries: int = 0
timestamp: float = field(default_factory=time.time)
Retrieval Quality Evaluator
import json
import logging
import math
import time
from statistics import median, quantiles
from typing import Optional
import boto3
logger = logging.getLogger(__name__)
class RetrievalQualityEvaluator:
"""Evaluates retrieval quality across multiple dimensions.
Metrics computed:
- Precision@K: fraction of retrieved chunks that are relevant
- Recall@K: fraction of relevant chunks that are retrieved
- MRR: average reciprocal rank of the first relevant chunk
- nDCG@K: ranking quality considering position
- Fact coverage: does the retrieved context contain the key facts?
- Latency breakdown: per-stage timing
"""
def __init__(self, opensearch_endpoint: str = "", embedding_model_id: str = "amazon.titan-embed-text-v2:0"):
self.bedrock = boto3.client("bedrock-runtime", region_name="us-east-1")
self.embedding_model_id = embedding_model_id
self.opensearch_endpoint = opensearch_endpoint
self.cloudwatch = boto3.client("cloudwatch", region_name="us-east-1")
def evaluate_retrieval_batch(
self,
test_cases: list[RetrievalTestCase],
retrieval_fn=None, # Callable that takes query -> RetrievalResult
k_values: list[int] = None,
) -> RetrievalQualityMetrics:
"""Evaluate retrieval quality on a batch of test cases."""
if k_values is None:
k_values = [1, 3, 5, 10]
all_results: list[tuple[RetrievalTestCase, RetrievalResult]] = []
for tc in test_cases:
if retrieval_fn:
result = retrieval_fn(tc.query)
else:
result = self._default_retrieve(tc.query)
all_results.append((tc, result))
# Compute metrics
precision_at_k = {k: self._compute_precision_at_k(all_results, k) for k in k_values}
recall_at_k = {k: self._compute_recall_at_k(all_results, k) for k in k_values}
mrr = self._compute_mrr(all_results)
ndcg_at_k = {k: self._compute_ndcg_at_k(all_results, k) for k in k_values}
avg_relevance = self._compute_avg_relevance(all_results)
fact_coverage = self._compute_fact_coverage(all_results)
# Latency statistics
latencies = [r.total_latency_ms for _, r in all_results if r.total_latency_ms > 0]
sorted_latencies = sorted(latencies) if latencies else [0]
return RetrievalQualityMetrics(
precision_at_k=precision_at_k,
recall_at_k=recall_at_k,
mrr=mrr,
ndcg_at_k=ndcg_at_k,
avg_relevance_score=avg_relevance,
fact_coverage_rate=fact_coverage,
latency_p50_ms=self._percentile(sorted_latencies, 50),
latency_p95_ms=self._percentile(sorted_latencies, 95),
latency_p99_ms=self._percentile(sorted_latencies, 99),
total_queries=len(all_results),
)
def _compute_precision_at_k(
self, results: list[tuple[RetrievalTestCase, RetrievalResult]], k: int
) -> float:
"""Precision@K: of the top-K retrieved, how many are relevant?"""
precisions = []
for tc, result in results:
top_k = result.retrieved_chunks[:k]
if not top_k:
precisions.append(0.0)
continue
relevant = sum(
1 for chunk in top_k
if chunk.get("chunk_id") in tc.expected_chunk_ids
)
precisions.append(relevant / len(top_k))
return sum(precisions) / len(precisions) if precisions else 0.0
def _compute_recall_at_k(
self, results: list[tuple[RetrievalTestCase, RetrievalResult]], k: int
) -> float:
"""Recall@K: of all relevant chunks, how many are in top-K?"""
recalls = []
for tc, result in results:
if not tc.expected_chunk_ids:
continue
top_k_ids = {
chunk.get("chunk_id") for chunk in result.retrieved_chunks[:k]
}
relevant_retrieved = len(top_k_ids & set(tc.expected_chunk_ids))
recalls.append(relevant_retrieved / len(tc.expected_chunk_ids))
return sum(recalls) / len(recalls) if recalls else 0.0
def _compute_mrr(
self, results: list[tuple[RetrievalTestCase, RetrievalResult]]
) -> float:
"""Mean Reciprocal Rank: average of 1/rank of first relevant chunk."""
reciprocal_ranks = []
for tc, result in results:
rr = 0.0
for i, chunk in enumerate(result.retrieved_chunks):
if chunk.get("chunk_id") in tc.expected_chunk_ids:
rr = 1.0 / (i + 1)
break
reciprocal_ranks.append(rr)
return sum(reciprocal_ranks) / len(reciprocal_ranks) if reciprocal_ranks else 0.0
def _compute_ndcg_at_k(
self, results: list[tuple[RetrievalTestCase, RetrievalResult]], k: int
) -> float:
"""Normalized Discounted Cumulative Gain at K."""
ndcg_scores = []
for tc, result in results:
# Relevance: 1 if chunk is expected, 0 otherwise
relevance = [
1 if chunk.get("chunk_id") in tc.expected_chunk_ids else 0
for chunk in result.retrieved_chunks[:k]
]
# DCG
dcg = sum(
rel / math.log2(i + 2) # i+2 because log2(1) = 0
for i, rel in enumerate(relevance)
)
# Ideal DCG (all relevant chunks ranked first)
ideal_relevance = sorted(relevance, reverse=True)
idcg = sum(
rel / math.log2(i + 2)
for i, rel in enumerate(ideal_relevance)
)
ndcg = dcg / idcg if idcg > 0 else 0.0
ndcg_scores.append(ndcg)
return sum(ndcg_scores) / len(ndcg_scores) if ndcg_scores else 0.0
def _compute_avg_relevance(
self, results: list[tuple[RetrievalTestCase, RetrievalResult]]
) -> float:
"""Average relevance score of top-K retrieved chunks."""
scores = []
for _, result in results:
for chunk in result.retrieved_chunks:
score = chunk.get("score", chunk.get("relevance_score", 0))
scores.append(float(score))
return sum(scores) / len(scores) if scores else 0.0
def _compute_fact_coverage(
self, results: list[tuple[RetrievalTestCase, RetrievalResult]]
) -> float:
"""What percentage of expected key facts appear in retrieved chunks?"""
total_facts = 0
found_facts = 0
for tc, result in results:
if not tc.expected_key_facts:
continue
combined_text = " ".join(
chunk.get("text", "") for chunk in result.retrieved_chunks
).lower()
for fact in tc.expected_key_facts:
total_facts += 1
if fact.lower() in combined_text:
found_facts += 1
return found_facts / total_facts if total_facts > 0 else 0.0
def _percentile(self, sorted_values: list[float], pct: int) -> float:
"""Compute percentile from sorted values."""
if not sorted_values:
return 0.0
idx = int(len(sorted_values) * pct / 100)
return sorted_values[min(idx, len(sorted_values) - 1)]
def _default_retrieve(self, query: str) -> RetrievalResult:
"""Default retrieval using MangaAssist's OpenSearch Serverless pipeline."""
total_start = time.time()
# Stage 1: Generate embedding
embed_start = time.time()
embedding = self._generate_embedding(query)
embed_latency = (time.time() - embed_start) * 1000
# Stage 2: Hybrid search (k-NN + BM25)
search_start = time.time()
raw_results = self._hybrid_search(query, embedding, k=20)
search_latency = (time.time() - search_start) * 1000
# Stage 3: Re-rank
rerank_start = time.time()
reranked = self._rerank(query, raw_results, top_k=5)
rerank_latency = (time.time() - rerank_start) * 1000
total_latency = (time.time() - total_start) * 1000
return RetrievalResult(
query=query,
retrieved_chunks=reranked,
embedding_latency_ms=embed_latency,
search_latency_ms=search_latency,
rerank_latency_ms=rerank_latency,
total_latency_ms=total_latency,
)
def _generate_embedding(self, text: str) -> list[float]:
"""Generate embedding using Amazon Titan Embeddings v2."""
body = json.dumps({"inputText": text})
response = self.bedrock.invoke_model(modelId=self.embedding_model_id, body=body)
return json.loads(response["body"].read())["embedding"]
def _hybrid_search(self, query: str, embedding: list[float], k: int = 20) -> list[dict]:
"""Hybrid k-NN + BM25 search on OpenSearch Serverless."""
# In production, this calls OpenSearch Serverless
# Simplified for illustration
return []
def _rerank(self, query: str, candidates: list[dict], top_k: int = 5) -> list[dict]:
"""Re-rank candidates using cross-encoder or LLM-based re-ranker."""
# In production, this uses a cross-encoder model
return candidates[:top_k]
Stage-Level Diagnostic Evaluator
class RetrievalStageDiagnostics:
"""Diagnoses retrieval quality issues at each pipeline stage.
When overall retrieval quality drops, this class helps isolate
which stage is the bottleneck:
- Query rewriting losing intent?
- Embeddings not capturing semantics?
- Search ranking poorly?
- Re-ranker demoting correct results?
"""
def diagnose_query_rewriter(
self, original_query: str, rewritten_query: str, intent: str
) -> dict:
"""Evaluate whether the query rewriter preserved intent and improved searchability."""
# Check intent preservation: key terms from original should appear in rewritten
original_terms = set(original_query.lower().split())
rewritten_terms = set(rewritten_query.lower().split())
overlap = original_terms & rewritten_terms
preservation_rate = len(overlap) / len(original_terms) if original_terms else 0
# Check expansion: did the rewriter add useful terms?
added_terms = rewritten_terms - original_terms
expansion_rate = len(added_terms) / max(len(rewritten_terms), 1)
return {
"stage": "query_rewriter",
"intent_preservation": preservation_rate,
"query_expansion_rate": expansion_rate,
"original_length": len(original_query),
"rewritten_length": len(rewritten_query),
"added_terms": list(added_terms)[:10],
"healthy": preservation_rate >= 0.5,
}
def diagnose_embedding_quality(
self, query_embedding: list[float], expected_chunk_embeddings: list[list[float]]
) -> dict:
"""Evaluate embedding quality by checking cosine similarity distribution."""
similarities = []
for chunk_emb in expected_chunk_embeddings:
sim = self._cosine_similarity(query_embedding, chunk_emb)
similarities.append(sim)
if not similarities:
return {"stage": "embedding", "healthy": False, "reason": "No embeddings to compare"}
avg_sim = sum(similarities) / len(similarities)
max_sim = max(similarities)
min_sim = min(similarities)
return {
"stage": "embedding",
"avg_similarity": round(avg_sim, 4),
"max_similarity": round(max_sim, 4),
"min_similarity": round(min_sim, 4),
"similarity_spread": round(max_sim - min_sim, 4),
"healthy": avg_sim >= 0.65,
"recommendation": (
"Embedding model captures query-chunk similarity well"
if avg_sim >= 0.65
else "Consider fine-tuning embeddings or adjusting chunk size"
),
}
def diagnose_search_components(
self,
knn_results: list[dict],
bm25_results: list[dict],
hybrid_results: list[dict],
expected_chunk_ids: list[str],
) -> dict:
"""Compare k-NN and BM25 contributions to identify search issues."""
knn_set = {r.get("chunk_id") for r in knn_results[:10]}
bm25_set = {r.get("chunk_id") for r in bm25_results[:10]}
hybrid_set = {r.get("chunk_id") for r in hybrid_results[:10]}
expected_set = set(expected_chunk_ids)
return {
"stage": "search",
"knn_recall@10": len(knn_set & expected_set) / max(len(expected_set), 1),
"bm25_recall@10": len(bm25_set & expected_set) / max(len(expected_set), 1),
"hybrid_recall@10": len(hybrid_set & expected_set) / max(len(expected_set), 1),
"knn_unique": len(knn_set - bm25_set),
"bm25_unique": len(bm25_set - knn_set),
"overlap": len(knn_set & bm25_set),
"recommendation": self._search_recommendation(
knn_set, bm25_set, hybrid_set, expected_set
),
}
def diagnose_reranker(
self,
pre_rerank: list[dict],
post_rerank: list[dict],
expected_chunk_ids: list[str],
) -> dict:
"""Evaluate whether the re-ranker improved or degraded ranking quality."""
expected_set = set(expected_chunk_ids)
# Compare positions of relevant chunks pre and post rerank
pre_positions = {}
for i, chunk in enumerate(pre_rerank):
if chunk.get("chunk_id") in expected_set:
pre_positions[chunk["chunk_id"]] = i
post_positions = {}
for i, chunk in enumerate(post_rerank):
if chunk.get("chunk_id") in expected_set:
post_positions[chunk["chunk_id"]] = i
improvements = 0
degradations = 0
for cid in expected_set:
pre = pre_positions.get(cid, len(pre_rerank))
post = post_positions.get(cid, len(post_rerank))
if post < pre:
improvements += 1
elif post > pre:
degradations += 1
return {
"stage": "reranker",
"improvements": improvements,
"degradations": degradations,
"no_change": len(expected_set) - improvements - degradations,
"healthy": improvements >= degradations,
"recommendation": (
"Re-ranker is improving ranking"
if improvements > degradations
else "Re-ranker may be degrading results — review cross-encoder model"
),
}
def _cosine_similarity(self, a: list[float], b: list[float]) -> float:
"""Compute cosine similarity between two vectors."""
dot_product = sum(x * y for x, y in zip(a, b))
norm_a = math.sqrt(sum(x * x for x in a))
norm_b = math.sqrt(sum(x * x for x in b))
return dot_product / (norm_a * norm_b) if norm_a > 0 and norm_b > 0 else 0.0
def _search_recommendation(self, knn, bm25, hybrid, expected) -> str:
knn_recall = len(knn & expected) / max(len(expected), 1)
bm25_recall = len(bm25 & expected) / max(len(expected), 1)
if knn_recall > bm25_recall + 0.2:
return "k-NN dominates — consider increasing k-NN weight in hybrid"
elif bm25_recall > knn_recall + 0.2:
return "BM25 dominates — queries may be keyword-heavy, check embeddings"
return "Balanced contribution — hybrid search is working well"
MangaAssist Scenarios
Scenario A: Embedding Model Upgrade Breaks Semantic Retrieval
Context: The team upgraded from Titan Embeddings v1 (1024-dim) to v2 (1536-dim) for better semantic representation. Retrieval quality was expected to improve.
What Happened: - Retrieval quality metrics before upgrade: - Precision@5: 0.72, Recall@5: 0.68, MRR: 0.78, nDCG@5: 0.74 - After v2 deployment: - Precision@5: 0.58 (-0.14), Recall@5: 0.52 (-0.16), MRR: 0.63 (-0.15), nDCG@5: 0.59 (-0.15)
How Caught: The retrieval quality evaluator ran its nightly regression suite. Every metric dropped. The stage-level diagnostics showed: - Embedding quality: avg cosine similarity between query and expected chunks dropped from 0.78 to 0.52 - Search quality: k-NN recall@10 dropped from 0.80 to 0.55
Root Cause: The existing OpenSearch index was built with v1 embeddings (1024-dim). The new v2 embeddings (1536-dim) were being compared against v1 embeddings in the index. Cosine similarity between different-dimensional embeddings is meaningless — OpenSearch was truncating/padding silently.
Fix: Re-indexed the entire knowledge base with v2 embeddings. This required: 1. Generate v2 embeddings for all 150K chunks 2. Create a new OpenSearch index with 1536-dim configuration 3. Perform a blue-green index switch 4. After re-indexing: Precision@5: 0.79, Recall@5: 0.75, MRR: 0.84
Scenario B: Hybrid Search Weight Tuning via Retrieval A/B Test
Context: MangaAssist's hybrid search uses 70% k-NN (semantic) + 30% BM25 (keyword). The team hypothesized that product_question intent might benefit from higher BM25 weight because users often search by exact product names.
What Happened:
- A/B test configurations:
- Config A (control): k-NN=0.7, BM25=0.3, k=5
- Config B (treatment): k-NN=0.5, BM25=0.5, k=5
- Results on 200 product_question test cases:
| Metric | Config A (70/30) | Config B (50/50) | Delta |
|---|---|---|---|
| Precision@5 | 0.68 | 0.76 | +0.08 |
| Recall@5 | 0.64 | 0.72 | +0.08 |
| MRR | 0.72 | 0.81 | +0.09 |
| Latency P95 | 85ms | 82ms | -3ms |
- On
recommendationintent, Config B was worse: Precision@5 dropped from 0.74 to 0.65 (semantic similarity matters more for recommendations than keyword matching)
Decision: Use intent-dependent search weights:
- product_question, order_tracking: 50/50 hybrid
- recommendation, chitchat: 70/30 hybrid
- faq: 60/40 hybrid (balanced)
Scenario C: Re-ranker Demotes Correct Chunks for Multi-Turn Queries
Context: The cross-encoder re-ranker was trained on single-turn query-chunk pairs. For multi-turn conversations, it was receiving the full conversation context as the "query."
What Happened: - The stage diagnostics showed: - Pre-rerank: correct chunk at position 3 - Post-rerank: correct chunk dropped to position 8 (outside top-5) - This happened for 23% of multi-turn test cases
Root Cause: The re-ranker was confused by long conversation histories. For a query like "What about the hardcover edition?" with conversation history mentioning "One Piece Volume 108," the re-ranker scored chunks about hardcover editions in general higher than the specific One Piece chunk because it weighted the current turn query ("hardcover edition") more than the conversation context.
Fix: Changed the re-ranker input strategy: - Single-turn: pass the query as-is - Multi-turn: pass a synthesized query that merges the current turn with the resolved reference ("One Piece Volume 108 hardcover edition") — using the query rewriter to resolve coreferences before re-ranking
Post-fix: multi-turn re-ranker accuracy improved from 77% to 91%.
Scenario D: Latency Spike During Product Launch Event
Context: A major manga publisher announced a new series, and 50,000 users simultaneously searched for information about it. Retrieval latency spiked.
What Happened: - Normal P95 latency: 200ms - During spike: P95 latency: 1,800ms - P99 latency: 4,200ms (above the 3-second SLA) - OpenSearch Serverless auto-scaled, but the scaling lag was 3-4 minutes
How Caught: The retrieval latency monitor fired an alarm when P95 exceeded 500ms for 5 consecutive minutes. The stage-level diagnostics showed: - Embedding generation: 30ms (normal) - OpenSearch search: 1,500ms (10x normal) - Re-ranker: 40ms (normal) - Bottleneck: OpenSearch Serverless compute scaling lag
Fix: 1. Pre-warmed OpenSearch capacity before known product launch events 2. Added request queuing with a 2-second timeout — if retrieval exceeded 2 seconds, return cached results from ElastiCache Redis for the most common queries 3. Implemented circuit breaker: if retrieval latency > 1 second for 3 consecutive requests, switch to ElastiCache-only retrieval for 60 seconds
Intuition Gained
Test Retrieval Independently from Generation
When response quality drops, the first question should be: "Is the problem in retrieval or generation?" Without separate retrieval metrics, you cannot distinguish between a model that is generating poorly from good context vs. a model that is generating confidently from bad context. The latter is far more dangerous.
MRR Is the Most Actionable Retrieval Metric
Precision@K and Recall@K tell you about the top-K as a set. MRR tells you about rank order — specifically, where the first correct chunk appears. For MangaAssist, if the correct chunk is at position 1 vs. position 5, the LLM's utilization of that chunk changes dramatically. A 0.05 improvement in MRR has more impact on downstream quality than a 0.05 improvement in Precision@5.
Retrieval Configuration Should Be Intent-Dependent
Different intents have different retrieval characteristics. Product questions are keyword-heavy (exact product names). Recommendations are semantic-heavy (genre similarity, thematic matching). FAQ queries are a mix. Using a single hybrid search weight for all intents is a compromise that serves none of them optimally. Intent-dependent retrieval configuration is a low-effort, high-impact optimization.
References
- MangaAssist Architecture LLD — RAG pipeline implementation details
- RAG Pipeline Cost Optimization — Content chunking and indexing
- Comprehensive Assessment — RAG evaluation perspective
- FM Output Quality Assessment — Context utilization scoring
- Data Integrations — OpenSearch Serverless configuration
- Performance Optimization — Query latency optimization