LOCAL PREVIEW View on GitHub

RAG Retrieval Pipeline Deep Dive - Shared Infrastructure

Purpose

Every RAG-based MCP in the chatbot relies on the same core retrieval infrastructure. This document details each shared component end-to-end: embedding service, hybrid retrieval, reranking, context assembly, and the offline pipelines that keep indexes fresh.


Pipeline Overview

flowchart TD
    subgraph OFFLINE["Offline Indexing Pipeline"]
        DP([Document / Manga / Review<br/>Ingested or Updated]) --> CH[Chunker<br/>Paragraph-aware; 512 tokens]
        CH --> EB2[Batch Embedding<br/>Titan Embed v2; batch=100]
        EB2 --> OS2[(OpenSearch Index<br/>kNN + BM25)]
        EB2 --> DY2[(DynamoDB<br/>Embedding cache)]
    end

    subgraph ONLINE["Online Retrieval Pipeline per Tool Call"]
        QI([Query Input]) --> QP[Query Preprocessor<br/>Normalise, detect language, expand synonyms]
        QP --> QEB[Query Embedding<br/>Titan Embed v2 cached]
        QP --> BM[BM25 Query<br/>Keyword terms]
        QEB --> OS3[(OpenSearch)]
        BM --> OS3
        OS3 --> HY[Hybrid Score Fusion<br/>Reciprocal Rank Fusion]
        HY --> RK[Cross-Encoder Rerank<br/>BGE-reranker-v2-m3]
        RK --> CA[Context Assembler<br/>Structured XML output]
        CA --> TR([Tool Result])
    end

    OS2 -.->|Index| OS3

    style OFFLINE fill:#1a3a5c,color:#fff
    style ONLINE fill:#1a5c2a,color:#fff
    style TR fill:#27AE60,color:#fff

Component 1: Embedding Service

Architecture

flowchart LR
    MCPs([All 7 MCP Servers]) --> ELB[ALB<br/>embedding-svc.internal]
    ELB --> ECS1[ECS Task 1<br/>Embedding Service]
    ELB --> ECS2[ECS Task 2]
    ELB --> ECS3[ECS Task N]
    ECS1 --> BR[Bedrock Runtime<br/>amazon.titan-embed-text-v2:0]
    ECS2 --> BR
    ECS3 --> BR
    ECS1 --> RC[(ElastiCache Redis<br/>Embedding dedup cache)]

    style MCPs fill:#4A90D9,color:#fff
    style BR fill:#E67E22,color:#fff

Embedding Cache Strategy

async def embed(text: str) -> list[float]:
    cache_key = f"emb:{hashlib.sha256(text.encode()).hexdigest()[:16]}"

    # Check ElastiCache first
    cached = await redis.get(cache_key)
    if cached:
        return json.loads(cached)

    # Call Bedrock Titan Embed v2
    response = bedrock.invoke_model(
        modelId="amazon.titan-embed-text-v2:0",
        body=json.dumps({"inputText": text, "dimensions": 1024, "normalize": True}),
    )
    embedding = json.loads(response["body"].read())["embedding"]

    # Cache for 24h - embeddings are deterministic for same text
    await redis.setex(cache_key, 86400, json.dumps(embedding))
    return embedding

Cache hit rate in production: ~68% (popular queries repeat frequently). Saves ~$0.40 per 1M calls.


Component 2: Hybrid Retrieval

Reciprocal Rank Fusion (RRF)

flowchart LR
    D[Dense kNN Results<br/>Ranked list 1] --> RRF[RRF Fusion<br/>score uses reciprocal rank sum]
    S[Sparse BM25 Results<br/>Ranked list 2] --> RRF
    RRF --> UN[Unified Ranked List<br/>Top-K candidates]

    style RRF fill:#8E44AD,color:#fff
    style UN fill:#27AE60,color:#fff
def reciprocal_rank_fusion(
    dense_results: list[DocScore],
    sparse_results: list[DocScore],
    k: int = 60,
    dense_weight: float = 0.7,
    sparse_weight: float = 0.3,
) -> list[DocScore]:
    scores: dict[str, float] = {}

    for rank, doc in enumerate(dense_results):
        scores[doc.id] = scores.get(doc.id, 0) + dense_weight / (rank + k)

    for rank, doc in enumerate(sparse_results):
        scores[doc.id] = scores.get(doc.id, 0) + sparse_weight / (rank + k)

    return sorted(
        [DocScore(id=doc_id, score=score) for doc_id, score in scores.items()],
        key=lambda x: x.score, reverse=True,
    )

Why k=60? The k constant prevents the top-1 result from dominating the fusion. Empirically, k=60 gives the best recall@10 on the MangaAssist golden evaluation dataset.


Component 3: OpenSearch Index Configuration

Shared Index Settings

{
  "settings": {
    "index": {
      "knn": true,
      "knn.algo_param.ef_search": 128,
      "refresh_interval": "30s",
      "number_of_shards": 5,
      "number_of_replicas": 1
    },
    "analysis": {
      "analyzer": {
        "manga_analyzer": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": ["lowercase", "asciifolding", "manga_synonyms"]
        }
      },
      "filter": {
        "manga_synonyms": {
          "type": "synonym",
          "synonyms": [
            "seinen, adult manga, mature manga",
            "shonen, shounen, boys manga",
            "isekai, reincarnation fantasy",
            "mecha, robot anime manga"
          ]
        }
      }
    }
  }
}

Component 4: Cross-Encoder Reranker

Deployment

flowchart LR
    CH([Top-20 Candidates<br/>plus original query]) --> SM[SageMaker Real-Time Endpoint<br/>BGE-reranker-v2-m3<br/>ml.g4dn.xlarge]
    SM --> SC[Ranked scores<br/>0.0 to 1.0 per candidate]
    SC --> TOP[Top-3 or Top-5<br/>per tool configuration]

    style CH fill:#4A90D9,color:#fff
    style TOP fill:#27AE60,color:#fff

Why Cross-Encoder Over Bi-Encoder for Reranking?

Property Bi-Encoder (embedding similarity) Cross-Encoder (reranker)
Speed O(1) vector dot product O(N) inference per pair
Accuracy Good for recall Better for precision
Context Query and doc embedded separately Query + doc encoded jointly
Use in pipeline Retrieve top-20 (recall-focused) Rerank to top-5 (precision-focused)

The two-stage approach gets the best of both: cheap retrieval over millions of docs, expensive-but-accurate reranking over 20.


Component 5: Chunking Strategy

Different content types use different chunking:

flowchart TD
    DT{Document Type} --> PD[Policy Docs<br/>Paragraph boundaries<br/>max 512 tokens<br/>64-token overlap]
    DT --> RV[Reviews<br/>One review per chunk<br/>cap at 256 tokens]
    DT --> MG[Manga Metadata<br/>One doc per manga<br/>Composite field embedding]
    DT --> PG[Long-form FAQ<br/>Section-level chunks<br/>H2 heading boundaries]

    PD --> OS4[(OpenSearch)]
    RV --> OS4
    MG --> OS4
    PG --> OS4

    style DT fill:#8E44AD,color:#fff
    style OS4 fill:#E67E22,color:#fff

Parent-child chunking for policy docs: - Parent doc: full policy document (for context retrieval) - Child chunks: 512-token paragraphs (for precision retrieval) - On retrieval: return child chunk + parent doc summary - avoids losing context at chunk boundaries


Component 6: Offline Index Refresh Pipelines

flowchart TD
    subgraph TRIGGERS["Refresh Triggers"]
        RT1([New manga published]) --> CIP[Catalog Index Pipeline]
        RT2([Review submitted]) --> RIP[Review Index Pipeline]
        RT3([Policy doc updated to S3]) --> PIP[Policy Index Pipeline]
        RT4([Nightly at 02:00 JST]) --> BRP[Batch Re-embed Pipeline<br/>for drift correction]
    end

    CIP --> OS5[(OpenSearch<br/>Catalog Index)]
    RIP --> OS5
    PIP --> OS5
    BRP --> OS5

    style TRIGGERS fill:#1a3a5c,color:#fff
    style OS5 fill:#E67E22,color:#fff

Batch Re-Embed Pipeline (Drift Correction)

When the embedding model is updated (Titan Embed v1 to v2), all vectors become stale. The drift correction pipeline:

sequenceDiagram
    participant EB as EventBridge
    participant SF as Step Functions
    participant S3 as Amazon S3
    participant BR as Bedrock
    participant OS as OpenSearch

    EB->>SF: embedding_model_updated event
    SF->>OS: Export all doc_ids to S3
    SF->>SF: Chunk into batches of 1000
    loop For each batch
        SF->>S3: Read batch doc_ids
        SF->>OS: Fetch raw text for each doc
        SF->>BR: Batch embed 100 at a time
        SF->>OS: Bulk update embedding fields
    end
    SF->>OS: Swap index alias to new index
    SF->>EB: Emit reindex_complete

Blue/green re-indexing: New vectors go into a parallel index (manga-catalog-v2) while manga-catalog stays live. Alias swap is atomic - zero downtime.


Component 7: Retrieval Quality Metrics

flowchart LR
    GD[(Golden Dataset<br/>500 labelled query-result pairs)] --> EV[Evaluator Lambda<br/>weekly batch job]
    EV --> R10[Recall at 10<br/>target 0.92 or higher]
    EV --> MRR[MRR at 5<br/>target 0.78 or higher]
    EV --> NDCG[NDCG at 5<br/>target 0.82 or higher]
    R10 --> CW[CloudWatch<br/>Metrics Dashboard]
    MRR --> CW
    NDCG --> CW
    CW --> AL{Alert<br/>if below threshold}
    AL -->|Yes| PD[PagerDuty<br/>ML-on-call]

    style GD fill:#4A90D9,color:#fff
    style AL fill:#C0392B,color:#fff
    style PD fill:#C0392B,color:#fff

Latency Comparison: All Retrieval Stages

Stage P50 P95 P99 Owner
Query embedding (cache miss) 35ms 50ms 70ms Bedrock
Query embedding (cache hit) 1ms 2ms 3ms ElastiCache
OpenSearch kNN (5M docs) 80ms 150ms 200ms OpenSearch
OpenSearch BM25 20ms 40ms 60ms OpenSearch
RRF fusion (in-memory) 1ms 2ms 3ms MCP server
Cross-encoder rerank (20 to 5) 60ms 90ms 110ms SageMaker
Context assembly 1ms 3ms 5ms MCP server
Total pipeline 198ms 337ms 451ms -

Interview Grill

Q: How do you choose between RRF and weighted linear combination for score fusion? A: RRF is preferred because it's rank-based, not score-based. Dense scores and BM25 scores live on incompatible scales (cosine similarity vs TF-IDF score). Normalising them to the same scale requires per-index calibration that drifts over time. RRF bypasses this entirely - it only uses rank position, which is scale-invariant.

Q: What happens if the SageMaker reranker endpoint is cold? A: The reranker endpoint is configured with min_capacity=1 (always warm). If it times out (circuit breaker at 300ms), the MCP falls back to returning the RRF-fused list without reranking and adds "reranked": false to the tool result. Claude's system prompt says to treat this as a slightly lower-confidence result - not a failure.

Q: How do you avoid re-embedding unchanged documents in the nightly drift pipeline? A: Each document in the OpenSearch index carries embedding_model_version. The batch pipeline skips any doc where embedding_model_version == current_model_version. Only new docs or docs with stale versions are re-embedded.

Q: How do you measure if RAG quality in production matches offline evaluation? A: Online-offline correlation: for 1% of production queries, the query + retrieved chunks + final answer are logged (with PII stripped). Weekly, 50 samples are human-labelled for relevance. The delta between online recall@5 and offline recall@5 on the golden dataset is the correlation score. If it exceeds 0.15, we investigate distribution shift.

Q: Why not use a managed RAG service like Bedrock Knowledge Bases instead of building this yourself? A: Bedrock Knowledge Bases handles a single use case well (document Q&A). We have seven different retrieval patterns - review corpus aggregation, graph-backed similarity, real-time trend scoring, and multi-source fusion - that require custom pipeline stages. Using Knowledge Bases for Catalog and Policy MCPs is explored as a future cost reduction, but the cross-MCP shared embedding service and reranker would still be custom.