US-05: RAG Retrieval Depth vs Speed vs Cost

User Story

As a ML engineering lead responsible for the RAG pipeline, I want to determine the optimal number of retrieved chunks, whether to rerank, and which embedding model to use, So that the LLM is grounded in relevant data without blowing the latency or token budget.

The Debate

graph TD
    subgraph "Inference Team"
        I["Retrieve 10 chunks.<br/>Use cross-encoder reranking.<br/>Better grounding = fewer<br/>hallucinations = fewer<br/>escalations = lower support cost."]
    end

    subgraph "Performance Team"
        P["10 chunks + reranking<br/>adds 350ms to every request.<br/>That's our entire RAG budget<br/>plus half the LLM budget<br/>(more tokens = slower prefill)."]
    end

    subgraph "Cost Team"
        C["Each chunk adds ~100 tokens<br/>to the prompt. 10 chunks =<br/>1,000 extra input tokens =<br/>$3/1M × 1B = $3,000/day<br/>just in retrieval context.<br/>Plus OpenSearch query costs."]
    end

    I ---|"Latency<br/>tension"| P
    P ---|"Token cost<br/>tension"| C
    C ---|"Quality<br/>tension"| I

    style I fill:#ff6b6b,stroke:#333,color:#000
    style P fill:#4ecdc4,stroke:#333,color:#000
    style C fill:#f9d71c,stroke:#333,color:#000

Acceptance Criteria

Hallucination rate on RAG-backed intents is below 4%.
RAG retrieval latency (embed + search + rerank) stays under 250ms p95.
RAG-contributed tokens in the prompt do not exceed 400 tokens average.
Retrieval recall@3 is ≥ 0.85 (the correct chunk is in the top 3 for 85% of queries).
Monthly RAG infrastructure cost stays under $15,000.

The RAG Pipeline and Its Cost/Latency Points

graph TD
    A["User Query"] --> B["Embed Query<br/>⏱️ 15ms | 💰 $0.0001"]
    B --> C["KNN Search<br/>(OpenSearch)<br/>⏱️ 30-80ms | 💰 $0.0003"]
    C --> D{"Rerank?"}
    D -->|"Yes"| E["Reranker Model<br/>⏱️ 50-200ms | 💰 $0.001-0.005"]
    D -->|"No"| F["Use KNN order"]
    E --> G["Select Top-K<br/>Chunks"]
    F --> G
    G --> H["Inject into Prompt<br/>⏱️ ~0ms | 💰 $0.003/1K tokens"]
    H --> I["LLM Processes<br/>Longer Prompt<br/>⏱️ +10ms/100 tokens"]

    style B fill:#2d8659,stroke:#333,color:#fff
    style C fill:#fd9644,stroke:#333,color:#000
    style E fill:#eb3b5a,stroke:#333,color:#fff
    style H fill:#f9d71c,stroke:#333,color:#000

Cumulative Cost and Latency by Chunk Count

Chunks Retrieved	KNN Latency	Rerank Latency	Prompt Tokens Added	Total RAG Latency	RAG Token Cost/Req	Hallucination Rate
1	30ms	0ms	~100	45ms	$0.0003	12%
3	40ms	60ms (lightweight)	~300	115ms	$0.0009	4%
3	40ms	150ms (cross-encoder)	~300	205ms	$0.0059	3%
5	50ms	100ms (lightweight)	~500	165ms	$0.0015	2.5%
5	50ms	250ms (cross-encoder)	~500	315ms	$0.0065	1.8%
10	80ms	200ms (lightweight)	~1,000	295ms	$0.003	1.5%
10	80ms	400ms (cross-encoder)	~1,000	495ms	$0.008	1.2%

The Diminishing Returns Curve

graph LR
    subgraph "Hallucination Rate vs Chunk Count"
        direction TB
        H1["1 chunk: 12%"] --> H3["3 chunks: 4%<br/>▼ 67% improvement"]
        H3 --> H5["5 chunks: 2.5%<br/>▼ 38% improvement"]
        H5 --> H10["10 chunks: 1.5%<br/>▼ 40% improvement"]
    end

    subgraph "Marginal Gain Per Chunk"
        direction TB
        M1["Chunk 1→3:<br/>-4% hallucination<br/>per chunk added"]
        M3["Chunk 3→5:<br/>-0.75% hallucination<br/>per chunk added"]
        M5["Chunk 5→10:<br/>-0.2% hallucination<br/>per chunk added"]
    end

    style H3 fill:#2d8659,stroke:#333,color:#fff
    style M1 fill:#2d8659,stroke:#333,color:#fff

The sweet spot: 3 chunks. Going from 1→3 gives 67% hallucination reduction. Going from 3→5 gives only 38% more reduction at 45% more latency and 67% more tokens. Going from 5→10 gives negligible improvement at 2x the cost.

The Reranking Decision

Option A: No Reranking (KNN order only)

graph LR
    A["Query Vector"] --> B["KNN: Top 3<br/>by cosine similarity"]
    B --> C["Use as-is"]

    style C fill:#2d8659,stroke:#333,color:#fff

Latency: 40ms
Cost: $0.0003/request
Recall@3: 0.72 (the correct chunk is in the top 3 for 72% of queries)
Problem: Cosine similarity is a rough approximation. Semantically similar but irrelevant chunks pollute the context.

Option B: Lightweight Reranker (Bi-Encoder)

graph LR
    A["Query Vector"] --> B["KNN: Top 10<br/>by cosine similarity"]
    B --> C["Bi-Encoder Reranker<br/>Score each chunk"]
    C --> D["Select Top 3<br/>by reranked score"]

    style C fill:#fd9644,stroke:#333,color:#000

Latency: 100ms total
Cost: $0.001/request
Recall@3: 0.85
Tradeoff: 60ms extra latency, 3x cost, but 18% better recall

Option C: Cross-Encoder Reranker

graph LR
    A["Query Vector"] --> B["KNN: Top 10<br/>by cosine similarity"]
    B --> C["Cross-Encoder<br/>Joint query-chunk scoring"]
    C --> D["Select Top 3<br/>by reranked score"]

    style C fill:#eb3b5a,stroke:#333,color:#fff

Latency: 205ms total
Cost: $0.005/request
Recall@3: 0.93
Tradeoff: 165ms extra latency, 16x cost, 8% better recall than bi-encoder

Decision Matrix

graph TD
    subgraph "Decision"
        D1["FAQ intents → Option B<br/>(Bi-Encoder Reranker)<br/>Recall matters, moderate latency budget"]
        D2["Product questions → Option A<br/>(No Reranker)<br/>Structured data, ASIN-based, fast"]
        D3["Recommendations → Option B<br/>(Bi-Encoder Reranker)<br/>Quality matters, longer latency budget"]
    end

    style D1 fill:#fd9644,stroke:#333,color:#000
    style D2 fill:#2d8659,stroke:#333,color:#fff
    style D3 fill:#fd9644,stroke:#333,color:#000

Why not cross-encoder? The 8% recall improvement costs 105ms extra (more than doubles rerank time). For a 250ms RAG budget, cross-encoder consumes 82% of it. The bi-encoder at 0.85 recall is sufficient when combined with the LLM's ability to ignore irrelevant chunks.

Chunk Quality Optimization (Alternative to More Chunks)

Instead of retrieving more chunks, improve each chunk's quality:

graph TD
    subgraph "Retrieve More Chunks"
        M1["Retrieve 10 instead of 3<br/>+7 chunks, +700 tokens<br/>+160ms latency<br/>Hallucination: 1.5%"]
    end

    subgraph "Improve Chunk Quality"
        Q1["Better chunking strategy<br/>+0 extra chunks<br/>+0ms latency<br/>Hallucination: 3%"]
        Q2["Hypothetical document<br/>embeddings (HyDE)<br/>+50ms latency<br/>Hallucination: 2.5%"]
        Q3["Query expansion<br/>(rephrase + multi-query)<br/>+30ms latency<br/>Hallucination: 2.8%"]
    end

    M1 --> D["Compare:<br/>Cost per hallucination<br/>point reduced"]
    Q1 --> D
    Q2 --> D
    Q3 --> D

    style Q1 fill:#2d8659,stroke:#333,color:#fff
    style M1 fill:#eb3b5a,stroke:#333,color:#fff

Cost Per Hallucination Point Reduced

Approach	Hallucination Reduction	Extra Cost/Req	Extra Latency	Cost Efficiency
3→10 chunks	2.5 points	$0.0027	+180ms	$0.0011/point
Better chunking	1.0 point	$0	+0ms	$0/point
HyDE	1.5 points	$0.001	+50ms	$0.0007/point
Query expansion	1.2 points	$0.0005	+30ms	$0.0004/point

Lesson: Invest in chunk quality before adding more chunks. Better chunking is free. Query expansion is 2.7x more cost-efficient than adding chunks.

Intent-Specific RAG Configuration

Intent	Chunks	Reranker	Chunk Source	Rationale
`faq`	3	Bi-encoder	FAQ + Policy index	Needs accurate policy retrieval
`product_question`	2	None	Product description index	ASIN filter makes KNN sufficient
`recommendation`	3	Bi-encoder	Editorial + Review index	Quality of context matters for framing
`return_request`	2	None	Policy index only	Narrow scope, policy-specific
`checkout_help`	2	None	FAQ + Policy index	Narrow scope
`product_discovery`	3	Bi-encoder	Editorial + Product index	Broad retrieval

Monitoring: RAG Quality Metrics

graph TD
    A["RAG Quality Dashboard"] --> B["Retrieval Metrics"]
    A --> C["Grounding Metrics"]
    A --> D["Cost Metrics"]

    B --> B1["Recall@K per intent"]
    B --> B2["Mean reciprocal rank"]
    B --> B3["Chunk relevance score"]

    C --> C1["Hallucination rate"]
    C --> C2["Citation accuracy"]
    C --> C3["% responses using retrieved data"]

    D --> D1["Tokens per RAG context"]
    D --> D2["OpenSearch query cost"]
    D --> D3["Reranker inference cost"]

    style A fill:#54a0ff,stroke:#333,color:#000

2026 Update: Use Adaptive RAG Before Increasing Top-K

Treat everything above this section as the baseline RAG architecture. This update keeps that original top-K design visible and explains how the current architecture becomes more adaptive and evidence-aware.

Recent retrieval work suggests that many teams should invest in better retrieval selection before spending more on raw chunk count.

Make hybrid retrieval the baseline: dense retrieval plus BM25 plus metadata filters. Anthropic's contextual retrieval work showed meaningful gains from contextual embeddings plus contextual BM25, with reranking improving results further.
Replace fixed top-K rules with adaptive retrieval. Only retrieve when the query or model uncertainty justifies it, and allow top-K or reranking depth to expand on high-risk requests instead of every request.
Improve evidence density before adding more tokens. Contextualized chunks, better chunk boundaries, query rewriting, and compression usually outperform blindly appending more raw passages.
Reserve expensive rerankers for high-value or low-confidence intents. Cheap intents should not pay the same rerank tax as high-risk recommendation or policy flows.
Measure answerability, citation coverage, and retrieval confidence in addition to recall@K. If the system cannot retrieve enough evidence, the right behavior is often to clarify or abstain.

Recent references: Anthropic contextual retrieval, Self-RAG, Corrective Retrieval-Augmented Generation, LongLLMLingua prompt compression.

Reversal Triggers

Trigger	Action
Hallucination rate on any RAG intent exceeds 5% for 7 days	Increase chunk count by 1 or add reranker
RAG latency exceeds 250ms p95 for 3 days	Drop reranker or reduce chunk count
A new embedding model offers 2x better recall at same latency	Re-index and re-evaluate chunk count
Token cost from RAG context exceeds $10K/month	Investigate chunk compression or shorter chunks
Recall@3 drops below 0.80 after an index refresh	Investigate chunk quality; may need re-chunking

Impact on Trilemma

Dimension	1 Chunk, No Rerank	3 Chunks + Bi-Encoder (Decision)	10 Chunks + Cross-Encoder
Cost	Cheapest	Moderate	Expensive
Performance	Fastest (45ms)	Good (115ms)	Slow (495ms)
Inference Quality	Poor (12% hallucination)	Good (4% hallucination)	Best (1.2% hallucination)
QACPI	Low (quality kills it)	Highest	Medium (latency + cost drag)