US-05: RAG Retrieval Depth vs Speed vs Cost
User Story
As a ML engineering lead responsible for the RAG pipeline, I want to determine the optimal number of retrieved chunks, whether to rerank, and which embedding model to use, So that the LLM is grounded in relevant data without blowing the latency or token budget.
The Debate
graph TD
subgraph "Inference Team"
I["Retrieve 10 chunks.<br/>Use cross-encoder reranking.<br/>Better grounding = fewer<br/>hallucinations = fewer<br/>escalations = lower support cost."]
end
subgraph "Performance Team"
P["10 chunks + reranking<br/>adds 350ms to every request.<br/>That's our entire RAG budget<br/>plus half the LLM budget<br/>(more tokens = slower prefill)."]
end
subgraph "Cost Team"
C["Each chunk adds ~100 tokens<br/>to the prompt. 10 chunks =<br/>1,000 extra input tokens =<br/>$3/1M × 1B = $3,000/day<br/>just in retrieval context.<br/>Plus OpenSearch query costs."]
end
I ---|"Latency<br/>tension"| P
P ---|"Token cost<br/>tension"| C
C ---|"Quality<br/>tension"| I
style I fill:#ff6b6b,stroke:#333,color:#000
style P fill:#4ecdc4,stroke:#333,color:#000
style C fill:#f9d71c,stroke:#333,color:#000
Acceptance Criteria
- Hallucination rate on RAG-backed intents is below 4%.
- RAG retrieval latency (embed + search + rerank) stays under 250ms p95.
- RAG-contributed tokens in the prompt do not exceed 400 tokens average.
- Retrieval recall@3 is ≥ 0.85 (the correct chunk is in the top 3 for 85% of queries).
- Monthly RAG infrastructure cost stays under $15,000.
The RAG Pipeline and Its Cost/Latency Points
graph TD
A["User Query"] --> B["Embed Query<br/>⏱️ 15ms | 💰 $0.0001"]
B --> C["KNN Search<br/>(OpenSearch)<br/>⏱️ 30-80ms | 💰 $0.0003"]
C --> D{"Rerank?"}
D -->|"Yes"| E["Reranker Model<br/>⏱️ 50-200ms | 💰 $0.001-0.005"]
D -->|"No"| F["Use KNN order"]
E --> G["Select Top-K<br/>Chunks"]
F --> G
G --> H["Inject into Prompt<br/>⏱️ ~0ms | 💰 $0.003/1K tokens"]
H --> I["LLM Processes<br/>Longer Prompt<br/>⏱️ +10ms/100 tokens"]
style B fill:#2d8659,stroke:#333,color:#fff
style C fill:#fd9644,stroke:#333,color:#000
style E fill:#eb3b5a,stroke:#333,color:#fff
style H fill:#f9d71c,stroke:#333,color:#000
Cumulative Cost and Latency by Chunk Count
| Chunks Retrieved | KNN Latency | Rerank Latency | Prompt Tokens Added | Total RAG Latency | RAG Token Cost/Req | Hallucination Rate |
|---|---|---|---|---|---|---|
| 1 | 30ms | 0ms | ~100 | 45ms | $0.0003 | 12% |
| 3 | 40ms | 60ms (lightweight) | ~300 | 115ms | $0.0009 | 4% |
| 3 | 40ms | 150ms (cross-encoder) | ~300 | 205ms | $0.0059 | 3% |
| 5 | 50ms | 100ms (lightweight) | ~500 | 165ms | $0.0015 | 2.5% |
| 5 | 50ms | 250ms (cross-encoder) | ~500 | 315ms | $0.0065 | 1.8% |
| 10 | 80ms | 200ms (lightweight) | ~1,000 | 295ms | $0.003 | 1.5% |
| 10 | 80ms | 400ms (cross-encoder) | ~1,000 | 495ms | $0.008 | 1.2% |
The Diminishing Returns Curve
graph LR
subgraph "Hallucination Rate vs Chunk Count"
direction TB
H1["1 chunk: 12%"] --> H3["3 chunks: 4%<br/>▼ 67% improvement"]
H3 --> H5["5 chunks: 2.5%<br/>▼ 38% improvement"]
H5 --> H10["10 chunks: 1.5%<br/>▼ 40% improvement"]
end
subgraph "Marginal Gain Per Chunk"
direction TB
M1["Chunk 1→3:<br/>-4% hallucination<br/>per chunk added"]
M3["Chunk 3→5:<br/>-0.75% hallucination<br/>per chunk added"]
M5["Chunk 5→10:<br/>-0.2% hallucination<br/>per chunk added"]
end
style H3 fill:#2d8659,stroke:#333,color:#fff
style M1 fill:#2d8659,stroke:#333,color:#fff
The sweet spot: 3 chunks. Going from 1→3 gives 67% hallucination reduction. Going from 3→5 gives only 38% more reduction at 45% more latency and 67% more tokens. Going from 5→10 gives negligible improvement at 2x the cost.
The Reranking Decision
Option A: No Reranking (KNN order only)
graph LR
A["Query Vector"] --> B["KNN: Top 3<br/>by cosine similarity"]
B --> C["Use as-is"]
style C fill:#2d8659,stroke:#333,color:#fff
- Latency: 40ms
- Cost: $0.0003/request
- Recall@3: 0.72 (the correct chunk is in the top 3 for 72% of queries)
- Problem: Cosine similarity is a rough approximation. Semantically similar but irrelevant chunks pollute the context.
Option B: Lightweight Reranker (Bi-Encoder)
graph LR
A["Query Vector"] --> B["KNN: Top 10<br/>by cosine similarity"]
B --> C["Bi-Encoder Reranker<br/>Score each chunk"]
C --> D["Select Top 3<br/>by reranked score"]
style C fill:#fd9644,stroke:#333,color:#000
- Latency: 100ms total
- Cost: $0.001/request
- Recall@3: 0.85
- Tradeoff: 60ms extra latency, 3x cost, but 18% better recall
Option C: Cross-Encoder Reranker
graph LR
A["Query Vector"] --> B["KNN: Top 10<br/>by cosine similarity"]
B --> C["Cross-Encoder<br/>Joint query-chunk scoring"]
C --> D["Select Top 3<br/>by reranked score"]
style C fill:#eb3b5a,stroke:#333,color:#fff
- Latency: 205ms total
- Cost: $0.005/request
- Recall@3: 0.93
- Tradeoff: 165ms extra latency, 16x cost, 8% better recall than bi-encoder
Decision Matrix
graph TD
subgraph "Decision"
D1["FAQ intents → Option B<br/>(Bi-Encoder Reranker)<br/>Recall matters, moderate latency budget"]
D2["Product questions → Option A<br/>(No Reranker)<br/>Structured data, ASIN-based, fast"]
D3["Recommendations → Option B<br/>(Bi-Encoder Reranker)<br/>Quality matters, longer latency budget"]
end
style D1 fill:#fd9644,stroke:#333,color:#000
style D2 fill:#2d8659,stroke:#333,color:#fff
style D3 fill:#fd9644,stroke:#333,color:#000
Why not cross-encoder? The 8% recall improvement costs 105ms extra (more than doubles rerank time). For a 250ms RAG budget, cross-encoder consumes 82% of it. The bi-encoder at 0.85 recall is sufficient when combined with the LLM's ability to ignore irrelevant chunks.
Chunk Quality Optimization (Alternative to More Chunks)
Instead of retrieving more chunks, improve each chunk's quality:
graph TD
subgraph "Retrieve More Chunks"
M1["Retrieve 10 instead of 3<br/>+7 chunks, +700 tokens<br/>+160ms latency<br/>Hallucination: 1.5%"]
end
subgraph "Improve Chunk Quality"
Q1["Better chunking strategy<br/>+0 extra chunks<br/>+0ms latency<br/>Hallucination: 3%"]
Q2["Hypothetical document<br/>embeddings (HyDE)<br/>+50ms latency<br/>Hallucination: 2.5%"]
Q3["Query expansion<br/>(rephrase + multi-query)<br/>+30ms latency<br/>Hallucination: 2.8%"]
end
M1 --> D["Compare:<br/>Cost per hallucination<br/>point reduced"]
Q1 --> D
Q2 --> D
Q3 --> D
style Q1 fill:#2d8659,stroke:#333,color:#fff
style M1 fill:#eb3b5a,stroke:#333,color:#fff
Cost Per Hallucination Point Reduced
| Approach | Hallucination Reduction | Extra Cost/Req | Extra Latency | Cost Efficiency |
|---|---|---|---|---|
| 3→10 chunks | 2.5 points | $0.0027 | +180ms | $0.0011/point |
| Better chunking | 1.0 point | $0 | +0ms | $0/point |
| HyDE | 1.5 points | $0.001 | +50ms | $0.0007/point |
| Query expansion | 1.2 points | $0.0005 | +30ms | $0.0004/point |
Lesson: Invest in chunk quality before adding more chunks. Better chunking is free. Query expansion is 2.7x more cost-efficient than adding chunks.
Intent-Specific RAG Configuration
| Intent | Chunks | Reranker | Chunk Source | Rationale |
|---|---|---|---|---|
faq |
3 | Bi-encoder | FAQ + Policy index | Needs accurate policy retrieval |
product_question |
2 | None | Product description index | ASIN filter makes KNN sufficient |
recommendation |
3 | Bi-encoder | Editorial + Review index | Quality of context matters for framing |
return_request |
2 | None | Policy index only | Narrow scope, policy-specific |
checkout_help |
2 | None | FAQ + Policy index | Narrow scope |
product_discovery |
3 | Bi-encoder | Editorial + Product index | Broad retrieval |
Monitoring: RAG Quality Metrics
graph TD
A["RAG Quality Dashboard"] --> B["Retrieval Metrics"]
A --> C["Grounding Metrics"]
A --> D["Cost Metrics"]
B --> B1["Recall@K per intent"]
B --> B2["Mean reciprocal rank"]
B --> B3["Chunk relevance score"]
C --> C1["Hallucination rate"]
C --> C2["Citation accuracy"]
C --> C3["% responses using retrieved data"]
D --> D1["Tokens per RAG context"]
D --> D2["OpenSearch query cost"]
D --> D3["Reranker inference cost"]
style A fill:#54a0ff,stroke:#333,color:#000
2026 Update: Use Adaptive RAG Before Increasing Top-K
Treat everything above this section as the baseline RAG architecture. This update keeps that original top-K design visible and explains how the current architecture becomes more adaptive and evidence-aware.
Recent retrieval work suggests that many teams should invest in better retrieval selection before spending more on raw chunk count.
- Make hybrid retrieval the baseline: dense retrieval plus BM25 plus metadata filters. Anthropic's contextual retrieval work showed meaningful gains from contextual embeddings plus contextual BM25, with reranking improving results further.
- Replace fixed top-K rules with adaptive retrieval. Only retrieve when the query or model uncertainty justifies it, and allow top-K or reranking depth to expand on high-risk requests instead of every request.
- Improve evidence density before adding more tokens. Contextualized chunks, better chunk boundaries, query rewriting, and compression usually outperform blindly appending more raw passages.
- Reserve expensive rerankers for high-value or low-confidence intents. Cheap intents should not pay the same rerank tax as high-risk recommendation or policy flows.
- Measure answerability, citation coverage, and retrieval confidence in addition to recall@K. If the system cannot retrieve enough evidence, the right behavior is often to clarify or abstain.
Recent references: Anthropic contextual retrieval, Self-RAG, Corrective Retrieval-Augmented Generation, LongLLMLingua prompt compression.
Reversal Triggers
| Trigger | Action |
|---|---|
| Hallucination rate on any RAG intent exceeds 5% for 7 days | Increase chunk count by 1 or add reranker |
| RAG latency exceeds 250ms p95 for 3 days | Drop reranker or reduce chunk count |
| A new embedding model offers 2x better recall at same latency | Re-index and re-evaluate chunk count |
| Token cost from RAG context exceeds $10K/month | Investigate chunk compression or shorter chunks |
| Recall@3 drops below 0.80 after an index refresh | Investigate chunk quality; may need re-chunking |
Impact on Trilemma
| Dimension | 1 Chunk, No Rerank | 3 Chunks + Bi-Encoder (Decision) | 10 Chunks + Cross-Encoder |
|---|---|---|---|
| Cost | Cheapest | Moderate | Expensive |
| Performance | Fastest (45ms) | Good (115ms) | Slow (495ms) |
| Inference Quality | Poor (12% hallucination) | Good (4% hallucination) | Best (1.2% hallucination) |
| QACPI | Low (quality kills it) | Highest | Medium (latency + cost drag) |