RAG Retrieval Pipeline Deep Dive - Shared Infrastructure
Purpose
Every RAG-based MCP in the chatbot relies on the same core retrieval infrastructure. This document details each shared component end-to-end: embedding service, hybrid retrieval, reranking, context assembly, and the offline pipelines that keep indexes fresh.
Pipeline Overview
flowchart TD
subgraph OFFLINE["Offline Indexing Pipeline"]
DP([Document / Manga / Review<br/>Ingested or Updated]) --> CH[Chunker<br/>Paragraph-aware; 512 tokens]
CH --> EB2[Batch Embedding<br/>Titan Embed v2; batch=100]
EB2 --> OS2[(OpenSearch Index<br/>kNN + BM25)]
EB2 --> DY2[(DynamoDB<br/>Embedding cache)]
end
subgraph ONLINE["Online Retrieval Pipeline per Tool Call"]
QI([Query Input]) --> QP[Query Preprocessor<br/>Normalise, detect language, expand synonyms]
QP --> QEB[Query Embedding<br/>Titan Embed v2 cached]
QP --> BM[BM25 Query<br/>Keyword terms]
QEB --> OS3[(OpenSearch)]
BM --> OS3
OS3 --> HY[Hybrid Score Fusion<br/>Reciprocal Rank Fusion]
HY --> RK[Cross-Encoder Rerank<br/>BGE-reranker-v2-m3]
RK --> CA[Context Assembler<br/>Structured XML output]
CA --> TR([Tool Result])
end
OS2 -.->|Index| OS3
style OFFLINE fill:#1a3a5c,color:#fff
style ONLINE fill:#1a5c2a,color:#fff
style TR fill:#27AE60,color:#fff
Component 1: Embedding Service
Architecture
flowchart LR
MCPs([All 7 MCP Servers]) --> ELB[ALB<br/>embedding-svc.internal]
ELB --> ECS1[ECS Task 1<br/>Embedding Service]
ELB --> ECS2[ECS Task 2]
ELB --> ECS3[ECS Task N]
ECS1 --> BR[Bedrock Runtime<br/>amazon.titan-embed-text-v2:0]
ECS2 --> BR
ECS3 --> BR
ECS1 --> RC[(ElastiCache Redis<br/>Embedding dedup cache)]
style MCPs fill:#4A90D9,color:#fff
style BR fill:#E67E22,color:#fff
Embedding Cache Strategy
async def embed(text: str) -> list[float]:
cache_key = f"emb:{hashlib.sha256(text.encode()).hexdigest()[:16]}"
# Check ElastiCache first
cached = await redis.get(cache_key)
if cached:
return json.loads(cached)
# Call Bedrock Titan Embed v2
response = bedrock.invoke_model(
modelId="amazon.titan-embed-text-v2:0",
body=json.dumps({"inputText": text, "dimensions": 1024, "normalize": True}),
)
embedding = json.loads(response["body"].read())["embedding"]
# Cache for 24h - embeddings are deterministic for same text
await redis.setex(cache_key, 86400, json.dumps(embedding))
return embedding
Cache hit rate in production: ~68% (popular queries repeat frequently). Saves ~$0.40 per 1M calls.
Component 2: Hybrid Retrieval
Reciprocal Rank Fusion (RRF)
flowchart LR
D[Dense kNN Results<br/>Ranked list 1] --> RRF[RRF Fusion<br/>score uses reciprocal rank sum]
S[Sparse BM25 Results<br/>Ranked list 2] --> RRF
RRF --> UN[Unified Ranked List<br/>Top-K candidates]
style RRF fill:#8E44AD,color:#fff
style UN fill:#27AE60,color:#fff
def reciprocal_rank_fusion(
dense_results: list[DocScore],
sparse_results: list[DocScore],
k: int = 60,
dense_weight: float = 0.7,
sparse_weight: float = 0.3,
) -> list[DocScore]:
scores: dict[str, float] = {}
for rank, doc in enumerate(dense_results):
scores[doc.id] = scores.get(doc.id, 0) + dense_weight / (rank + k)
for rank, doc in enumerate(sparse_results):
scores[doc.id] = scores.get(doc.id, 0) + sparse_weight / (rank + k)
return sorted(
[DocScore(id=doc_id, score=score) for doc_id, score in scores.items()],
key=lambda x: x.score, reverse=True,
)
Why k=60? The k constant prevents the top-1 result from dominating the fusion. Empirically, k=60 gives the best recall@10 on the MangaAssist golden evaluation dataset.
Component 3: OpenSearch Index Configuration
Shared Index Settings
{
"settings": {
"index": {
"knn": true,
"knn.algo_param.ef_search": 128,
"refresh_interval": "30s",
"number_of_shards": 5,
"number_of_replicas": 1
},
"analysis": {
"analyzer": {
"manga_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": ["lowercase", "asciifolding", "manga_synonyms"]
}
},
"filter": {
"manga_synonyms": {
"type": "synonym",
"synonyms": [
"seinen, adult manga, mature manga",
"shonen, shounen, boys manga",
"isekai, reincarnation fantasy",
"mecha, robot anime manga"
]
}
}
}
}
}
Component 4: Cross-Encoder Reranker
Deployment
flowchart LR
CH([Top-20 Candidates<br/>plus original query]) --> SM[SageMaker Real-Time Endpoint<br/>BGE-reranker-v2-m3<br/>ml.g4dn.xlarge]
SM --> SC[Ranked scores<br/>0.0 to 1.0 per candidate]
SC --> TOP[Top-3 or Top-5<br/>per tool configuration]
style CH fill:#4A90D9,color:#fff
style TOP fill:#27AE60,color:#fff
Why Cross-Encoder Over Bi-Encoder for Reranking?
| Property | Bi-Encoder (embedding similarity) | Cross-Encoder (reranker) |
|---|---|---|
| Speed | O(1) vector dot product | O(N) inference per pair |
| Accuracy | Good for recall | Better for precision |
| Context | Query and doc embedded separately | Query + doc encoded jointly |
| Use in pipeline | Retrieve top-20 (recall-focused) | Rerank to top-5 (precision-focused) |
The two-stage approach gets the best of both: cheap retrieval over millions of docs, expensive-but-accurate reranking over 20.
Component 5: Chunking Strategy
Different content types use different chunking:
flowchart TD
DT{Document Type} --> PD[Policy Docs<br/>Paragraph boundaries<br/>max 512 tokens<br/>64-token overlap]
DT --> RV[Reviews<br/>One review per chunk<br/>cap at 256 tokens]
DT --> MG[Manga Metadata<br/>One doc per manga<br/>Composite field embedding]
DT --> PG[Long-form FAQ<br/>Section-level chunks<br/>H2 heading boundaries]
PD --> OS4[(OpenSearch)]
RV --> OS4
MG --> OS4
PG --> OS4
style DT fill:#8E44AD,color:#fff
style OS4 fill:#E67E22,color:#fff
Parent-child chunking for policy docs: - Parent doc: full policy document (for context retrieval) - Child chunks: 512-token paragraphs (for precision retrieval) - On retrieval: return child chunk + parent doc summary - avoids losing context at chunk boundaries
Component 6: Offline Index Refresh Pipelines
flowchart TD
subgraph TRIGGERS["Refresh Triggers"]
RT1([New manga published]) --> CIP[Catalog Index Pipeline]
RT2([Review submitted]) --> RIP[Review Index Pipeline]
RT3([Policy doc updated to S3]) --> PIP[Policy Index Pipeline]
RT4([Nightly at 02:00 JST]) --> BRP[Batch Re-embed Pipeline<br/>for drift correction]
end
CIP --> OS5[(OpenSearch<br/>Catalog Index)]
RIP --> OS5
PIP --> OS5
BRP --> OS5
style TRIGGERS fill:#1a3a5c,color:#fff
style OS5 fill:#E67E22,color:#fff
Batch Re-Embed Pipeline (Drift Correction)
When the embedding model is updated (Titan Embed v1 to v2), all vectors become stale. The drift correction pipeline:
sequenceDiagram
participant EB as EventBridge
participant SF as Step Functions
participant S3 as Amazon S3
participant BR as Bedrock
participant OS as OpenSearch
EB->>SF: embedding_model_updated event
SF->>OS: Export all doc_ids to S3
SF->>SF: Chunk into batches of 1000
loop For each batch
SF->>S3: Read batch doc_ids
SF->>OS: Fetch raw text for each doc
SF->>BR: Batch embed 100 at a time
SF->>OS: Bulk update embedding fields
end
SF->>OS: Swap index alias to new index
SF->>EB: Emit reindex_complete
Blue/green re-indexing: New vectors go into a parallel index (manga-catalog-v2) while manga-catalog stays live. Alias swap is atomic - zero downtime.
Component 7: Retrieval Quality Metrics
flowchart LR
GD[(Golden Dataset<br/>500 labelled query-result pairs)] --> EV[Evaluator Lambda<br/>weekly batch job]
EV --> R10[Recall at 10<br/>target 0.92 or higher]
EV --> MRR[MRR at 5<br/>target 0.78 or higher]
EV --> NDCG[NDCG at 5<br/>target 0.82 or higher]
R10 --> CW[CloudWatch<br/>Metrics Dashboard]
MRR --> CW
NDCG --> CW
CW --> AL{Alert<br/>if below threshold}
AL -->|Yes| PD[PagerDuty<br/>ML-on-call]
style GD fill:#4A90D9,color:#fff
style AL fill:#C0392B,color:#fff
style PD fill:#C0392B,color:#fff
Latency Comparison: All Retrieval Stages
| Stage | P50 | P95 | P99 | Owner |
|---|---|---|---|---|
| Query embedding (cache miss) | 35ms | 50ms | 70ms | Bedrock |
| Query embedding (cache hit) | 1ms | 2ms | 3ms | ElastiCache |
| OpenSearch kNN (5M docs) | 80ms | 150ms | 200ms | OpenSearch |
| OpenSearch BM25 | 20ms | 40ms | 60ms | OpenSearch |
| RRF fusion (in-memory) | 1ms | 2ms | 3ms | MCP server |
| Cross-encoder rerank (20 to 5) | 60ms | 90ms | 110ms | SageMaker |
| Context assembly | 1ms | 3ms | 5ms | MCP server |
| Total pipeline | 198ms | 337ms | 451ms | - |
Interview Grill
Q: How do you choose between RRF and weighted linear combination for score fusion? A: RRF is preferred because it's rank-based, not score-based. Dense scores and BM25 scores live on incompatible scales (cosine similarity vs TF-IDF score). Normalising them to the same scale requires per-index calibration that drifts over time. RRF bypasses this entirely - it only uses rank position, which is scale-invariant.
Q: What happens if the SageMaker reranker endpoint is cold?
A: The reranker endpoint is configured with min_capacity=1 (always warm). If it times out (circuit breaker at 300ms), the MCP falls back to returning the RRF-fused list without reranking and adds "reranked": false to the tool result. Claude's system prompt says to treat this as a slightly lower-confidence result - not a failure.
Q: How do you avoid re-embedding unchanged documents in the nightly drift pipeline?
A: Each document in the OpenSearch index carries embedding_model_version. The batch pipeline skips any doc where embedding_model_version == current_model_version. Only new docs or docs with stale versions are re-embedded.
Q: How do you measure if RAG quality in production matches offline evaluation? A: Online-offline correlation: for 1% of production queries, the query + retrieved chunks + final answer are logged (with PII stripped). Weekly, 50 samples are human-labelled for relevance. The delta between online recall@5 and offline recall@5 on the golden dataset is the correlation score. If it exceeds 0.15, we investigate distribution shift.
Q: Why not use a managed RAG service like Bedrock Knowledge Bases instead of building this yourself? A: Bedrock Knowledge Bases handles a single use case well (document Q&A). We have seven different retrieval patterns - review corpus aggregation, graph-backed similarity, real-time trend scoring, and multi-source fusion - that require custom pipeline stages. Using Knowledge Bases for Catalog and Policy MCPs is explored as a future cost reduction, but the cross-MCP shared embedding service and reranker would still be custom.