LOCAL PREVIEW View on GitHub

US-05: RAG Retrieval Depth vs Speed vs Cost

User Story

As a ML engineering lead responsible for the RAG pipeline, I want to determine the optimal number of retrieved chunks, whether to rerank, and which embedding model to use, So that the LLM is grounded in relevant data without blowing the latency or token budget.

The Debate

graph TD
    subgraph "Inference Team"
        I["Retrieve 10 chunks.<br/>Use cross-encoder reranking.<br/>Better grounding = fewer<br/>hallucinations = fewer<br/>escalations = lower support cost."]
    end

    subgraph "Performance Team"
        P["10 chunks + reranking<br/>adds 350ms to every request.<br/>That's our entire RAG budget<br/>plus half the LLM budget<br/>(more tokens = slower prefill)."]
    end

    subgraph "Cost Team"
        C["Each chunk adds ~100 tokens<br/>to the prompt. 10 chunks =<br/>1,000 extra input tokens =<br/>$3/1M × 1B = $3,000/day<br/>just in retrieval context.<br/>Plus OpenSearch query costs."]
    end

    I ---|"Latency<br/>tension"| P
    P ---|"Token cost<br/>tension"| C
    C ---|"Quality<br/>tension"| I

    style I fill:#ff6b6b,stroke:#333,color:#000
    style P fill:#4ecdc4,stroke:#333,color:#000
    style C fill:#f9d71c,stroke:#333,color:#000

Acceptance Criteria

  • Hallucination rate on RAG-backed intents is below 4%.
  • RAG retrieval latency (embed + search + rerank) stays under 250ms p95.
  • RAG-contributed tokens in the prompt do not exceed 400 tokens average.
  • Retrieval recall@3 is ≥ 0.85 (the correct chunk is in the top 3 for 85% of queries).
  • Monthly RAG infrastructure cost stays under $15,000.

The RAG Pipeline and Its Cost/Latency Points

graph TD
    A["User Query"] --> B["Embed Query<br/>⏱️ 15ms | 💰 $0.0001"]
    B --> C["KNN Search<br/>(OpenSearch)<br/>⏱️ 30-80ms | 💰 $0.0003"]
    C --> D{"Rerank?"}
    D -->|"Yes"| E["Reranker Model<br/>⏱️ 50-200ms | 💰 $0.001-0.005"]
    D -->|"No"| F["Use KNN order"]
    E --> G["Select Top-K<br/>Chunks"]
    F --> G
    G --> H["Inject into Prompt<br/>⏱️ ~0ms | 💰 $0.003/1K tokens"]
    H --> I["LLM Processes<br/>Longer Prompt<br/>⏱️ +10ms/100 tokens"]

    style B fill:#2d8659,stroke:#333,color:#fff
    style C fill:#fd9644,stroke:#333,color:#000
    style E fill:#eb3b5a,stroke:#333,color:#fff
    style H fill:#f9d71c,stroke:#333,color:#000

Cumulative Cost and Latency by Chunk Count

Chunks Retrieved KNN Latency Rerank Latency Prompt Tokens Added Total RAG Latency RAG Token Cost/Req Hallucination Rate
1 30ms 0ms ~100 45ms $0.0003 12%
3 40ms 60ms (lightweight) ~300 115ms $0.0009 4%
3 40ms 150ms (cross-encoder) ~300 205ms $0.0059 3%
5 50ms 100ms (lightweight) ~500 165ms $0.0015 2.5%
5 50ms 250ms (cross-encoder) ~500 315ms $0.0065 1.8%
10 80ms 200ms (lightweight) ~1,000 295ms $0.003 1.5%
10 80ms 400ms (cross-encoder) ~1,000 495ms $0.008 1.2%

The Diminishing Returns Curve

graph LR
    subgraph "Hallucination Rate vs Chunk Count"
        direction TB
        H1["1 chunk: 12%"] --> H3["3 chunks: 4%<br/>▼ 67% improvement"]
        H3 --> H5["5 chunks: 2.5%<br/>▼ 38% improvement"]
        H5 --> H10["10 chunks: 1.5%<br/>▼ 40% improvement"]
    end

    subgraph "Marginal Gain Per Chunk"
        direction TB
        M1["Chunk 1→3:<br/>-4% hallucination<br/>per chunk added"]
        M3["Chunk 3→5:<br/>-0.75% hallucination<br/>per chunk added"]
        M5["Chunk 5→10:<br/>-0.2% hallucination<br/>per chunk added"]
    end

    style H3 fill:#2d8659,stroke:#333,color:#fff
    style M1 fill:#2d8659,stroke:#333,color:#fff

The sweet spot: 3 chunks. Going from 1→3 gives 67% hallucination reduction. Going from 3→5 gives only 38% more reduction at 45% more latency and 67% more tokens. Going from 5→10 gives negligible improvement at 2x the cost.


The Reranking Decision

Option A: No Reranking (KNN order only)

graph LR
    A["Query Vector"] --> B["KNN: Top 3<br/>by cosine similarity"]
    B --> C["Use as-is"]

    style C fill:#2d8659,stroke:#333,color:#fff
  • Latency: 40ms
  • Cost: $0.0003/request
  • Recall@3: 0.72 (the correct chunk is in the top 3 for 72% of queries)
  • Problem: Cosine similarity is a rough approximation. Semantically similar but irrelevant chunks pollute the context.

Option B: Lightweight Reranker (Bi-Encoder)

graph LR
    A["Query Vector"] --> B["KNN: Top 10<br/>by cosine similarity"]
    B --> C["Bi-Encoder Reranker<br/>Score each chunk"]
    C --> D["Select Top 3<br/>by reranked score"]

    style C fill:#fd9644,stroke:#333,color:#000
  • Latency: 100ms total
  • Cost: $0.001/request
  • Recall@3: 0.85
  • Tradeoff: 60ms extra latency, 3x cost, but 18% better recall

Option C: Cross-Encoder Reranker

graph LR
    A["Query Vector"] --> B["KNN: Top 10<br/>by cosine similarity"]
    B --> C["Cross-Encoder<br/>Joint query-chunk scoring"]
    C --> D["Select Top 3<br/>by reranked score"]

    style C fill:#eb3b5a,stroke:#333,color:#fff
  • Latency: 205ms total
  • Cost: $0.005/request
  • Recall@3: 0.93
  • Tradeoff: 165ms extra latency, 16x cost, 8% better recall than bi-encoder

Decision Matrix

graph TD
    subgraph "Decision"
        D1["FAQ intents → Option B<br/>(Bi-Encoder Reranker)<br/>Recall matters, moderate latency budget"]
        D2["Product questions → Option A<br/>(No Reranker)<br/>Structured data, ASIN-based, fast"]
        D3["Recommendations → Option B<br/>(Bi-Encoder Reranker)<br/>Quality matters, longer latency budget"]
    end

    style D1 fill:#fd9644,stroke:#333,color:#000
    style D2 fill:#2d8659,stroke:#333,color:#fff
    style D3 fill:#fd9644,stroke:#333,color:#000

Why not cross-encoder? The 8% recall improvement costs 105ms extra (more than doubles rerank time). For a 250ms RAG budget, cross-encoder consumes 82% of it. The bi-encoder at 0.85 recall is sufficient when combined with the LLM's ability to ignore irrelevant chunks.


Chunk Quality Optimization (Alternative to More Chunks)

Instead of retrieving more chunks, improve each chunk's quality:

graph TD
    subgraph "Retrieve More Chunks"
        M1["Retrieve 10 instead of 3<br/>+7 chunks, +700 tokens<br/>+160ms latency<br/>Hallucination: 1.5%"]
    end

    subgraph "Improve Chunk Quality"
        Q1["Better chunking strategy<br/>+0 extra chunks<br/>+0ms latency<br/>Hallucination: 3%"]
        Q2["Hypothetical document<br/>embeddings (HyDE)<br/>+50ms latency<br/>Hallucination: 2.5%"]
        Q3["Query expansion<br/>(rephrase + multi-query)<br/>+30ms latency<br/>Hallucination: 2.8%"]
    end

    M1 --> D["Compare:<br/>Cost per hallucination<br/>point reduced"]
    Q1 --> D
    Q2 --> D
    Q3 --> D

    style Q1 fill:#2d8659,stroke:#333,color:#fff
    style M1 fill:#eb3b5a,stroke:#333,color:#fff

Cost Per Hallucination Point Reduced

Approach Hallucination Reduction Extra Cost/Req Extra Latency Cost Efficiency
3→10 chunks 2.5 points $0.0027 +180ms $0.0011/point
Better chunking 1.0 point $0 +0ms $0/point
HyDE 1.5 points $0.001 +50ms $0.0007/point
Query expansion 1.2 points $0.0005 +30ms $0.0004/point

Lesson: Invest in chunk quality before adding more chunks. Better chunking is free. Query expansion is 2.7x more cost-efficient than adding chunks.


Intent-Specific RAG Configuration

Intent Chunks Reranker Chunk Source Rationale
faq 3 Bi-encoder FAQ + Policy index Needs accurate policy retrieval
product_question 2 None Product description index ASIN filter makes KNN sufficient
recommendation 3 Bi-encoder Editorial + Review index Quality of context matters for framing
return_request 2 None Policy index only Narrow scope, policy-specific
checkout_help 2 None FAQ + Policy index Narrow scope
product_discovery 3 Bi-encoder Editorial + Product index Broad retrieval

Monitoring: RAG Quality Metrics

graph TD
    A["RAG Quality Dashboard"] --> B["Retrieval Metrics"]
    A --> C["Grounding Metrics"]
    A --> D["Cost Metrics"]

    B --> B1["Recall@K per intent"]
    B --> B2["Mean reciprocal rank"]
    B --> B3["Chunk relevance score"]

    C --> C1["Hallucination rate"]
    C --> C2["Citation accuracy"]
    C --> C3["% responses using retrieved data"]

    D --> D1["Tokens per RAG context"]
    D --> D2["OpenSearch query cost"]
    D --> D3["Reranker inference cost"]

    style A fill:#54a0ff,stroke:#333,color:#000

2026 Update: Use Adaptive RAG Before Increasing Top-K

Treat everything above this section as the baseline RAG architecture. This update keeps that original top-K design visible and explains how the current architecture becomes more adaptive and evidence-aware.

Recent retrieval work suggests that many teams should invest in better retrieval selection before spending more on raw chunk count.

  • Make hybrid retrieval the baseline: dense retrieval plus BM25 plus metadata filters. Anthropic's contextual retrieval work showed meaningful gains from contextual embeddings plus contextual BM25, with reranking improving results further.
  • Replace fixed top-K rules with adaptive retrieval. Only retrieve when the query or model uncertainty justifies it, and allow top-K or reranking depth to expand on high-risk requests instead of every request.
  • Improve evidence density before adding more tokens. Contextualized chunks, better chunk boundaries, query rewriting, and compression usually outperform blindly appending more raw passages.
  • Reserve expensive rerankers for high-value or low-confidence intents. Cheap intents should not pay the same rerank tax as high-risk recommendation or policy flows.
  • Measure answerability, citation coverage, and retrieval confidence in addition to recall@K. If the system cannot retrieve enough evidence, the right behavior is often to clarify or abstain.

Recent references: Anthropic contextual retrieval, Self-RAG, Corrective Retrieval-Augmented Generation, LongLLMLingua prompt compression.

Reversal Triggers

Trigger Action
Hallucination rate on any RAG intent exceeds 5% for 7 days Increase chunk count by 1 or add reranker
RAG latency exceeds 250ms p95 for 3 days Drop reranker or reduce chunk count
A new embedding model offers 2x better recall at same latency Re-index and re-evaluate chunk count
Token cost from RAG context exceeds $10K/month Investigate chunk compression or shorter chunks
Recall@3 drops below 0.80 after an index refresh Investigate chunk quality; may need re-chunking

Impact on Trilemma

Dimension 1 Chunk, No Rerank 3 Chunks + Bi-Encoder (Decision) 10 Chunks + Cross-Encoder
Cost Cheapest Moderate Expensive
Performance Fastest (45ms) Good (115ms) Slow (495ms)
Inference Quality Poor (12% hallucination) Good (4% hallucination) Best (1.2% hallucination)
QACPI Low (quality kills it) Highest Medium (latency + cost drag)