US-09: Token Budget Allocation — Context Window Partitioning
User Story
As a ML engineering lead, I want to define how to partition the LLM context window between system prompt, conversation history, RAG chunks, and current data, So that the LLM has enough context to produce grounded answers without exceeding cost or latency limits from oversized prompts.
The Debate
graph TD
subgraph "Inference Team"
I["Give the LLM everything:<br/>20 turns of history,<br/>5 RAG chunks,<br/>full product data,<br/>active promotions.<br/>More context = better answers."]
end
subgraph "Performance Team"
P["Every 100 tokens adds<br/>~10ms to prefill latency.<br/>A 4K token prompt takes<br/>400ms just to process.<br/>Keep it under 2K tokens."]
end
subgraph "Cost Team"
C["Input tokens cost $3/1M.<br/>At 4K tokens × 700K LLM calls/day,<br/>that's $8,400/day in input alone.<br/>Cut to 2K tokens = $4,200/day.<br/>Save $126K/month."]
end
I ---|"Latency<br/>concern"| P
P ---|"Cost<br/>impact"| C
C ---|"Quality<br/>risk"| I
style I fill:#ff6b6b,stroke:#333,color:#000
style P fill:#4ecdc4,stroke:#333,color:#000
style C fill:#f9d71c,stroke:#333,color:#000
Acceptance Criteria
- Average input token count stays under 2,000 tokens for 80% of requests.
- Conversation history is never truncated mid-turn (summarize instead).
- System prompt is constant and does not vary per request (cacheable prompt prefix).
- RAG context injection does not exceed the chunk budget defined in US-05.
- Quality score does not drop below 0.85 on any intent due to context limitations.
The Context Window Budget
Total Available vs Allocated
For Claude 3.5 Sonnet with a 200K context window, the theoretical limit is enormous. But practical limits are set by cost and latency, not by the model's maximum:
graph TD
subgraph "Theoretical Max: 200K tokens"
THEORY["Claude 3.5 Sonnet<br/>can process 200K tokens.<br/>But at $3/1M input tokens,<br/>200K = $0.60 per request.<br/>At 700K requests/day = $420K/day.<br/>❌ Budget: destroyed."]
end
subgraph "Practical Budget: 2,000 tokens"
PRACTICAL["2,000 input tokens<br/>300 output tokens<br/>Cost: $0.006 + $0.0045 = $0.0105<br/>Prefill: ~200ms<br/>✅ Within budget and latency."]
end
THEORY --> PRACTICAL
style THEORY fill:#eb3b5a,stroke:#333,color:#fff
style PRACTICAL fill:#2d8659,stroke:#333,color:#fff
The 2,000-Token Budget Allocation
pie title "Token Budget Allocation (2,000 tokens)"
"System Prompt" : 350
"Conversation History" : 400
"RAG Context" : 600
"Current Data (product, promo)" : 400
"User Message" : 150
"Formatting / Separators" : 100
| Section | Token Budget | Fixed/Variable | Rationale |
|---|---|---|---|
| System prompt | 350 tokens | Fixed | Rules, persona, format instructions — same every request |
| Conversation history | 400 tokens | Variable | Summarized history + last 2 full turns |
| RAG context | 600 tokens | Variable | 2-3 chunks at ~100-200 tokens each |
| Current data | 400 tokens | Variable | Product details, promotions, user context |
| User message | 150 tokens | Variable | The current user message |
| Formatting / separators | 100 tokens | Fixed | Section headers, JSON structure, delimiters |
| Total | 2,000 tokens |
Section-by-Section Deep Dive
1. System Prompt (350 tokens) — The Non-Negotiable Core
graph TD
subgraph "System Prompt Contents"
S1["Persona definition<br/>(50 tokens)"]
S2["Core rules<br/>(150 tokens)"]
S3["Output format specs<br/>(80 tokens)"]
S4["Safety instructions<br/>(70 tokens)"]
end
S1 --> TOTAL["350 tokens total"]
S2 --> TOTAL
S3 --> TOTAL
S4 --> TOTAL
style TOTAL fill:#54a0ff,stroke:#333,color:#000
Tradeoff: The Inference Team wants 500+ tokens for detailed instructions. The Cost Team wants 200. At 350, the system prompt is concise but complete. Every word earns its place.
Optimization: System prompt is identical for all requests → use Bedrock's prompt caching to avoid re-processing it every time. This saves ~35ms prefill and reduces effective input cost.
2. Conversation History (400 tokens) — The Hardest Allocation
This is where the most contentious tradeoff lives:
graph TD
subgraph "Option A: Full History (Inference Team)"
A["Keep all 10 turns raw<br/>~1,500 tokens<br/>✅ Perfect context<br/>❌ Blows budget by 1,100 tokens"]
end
subgraph "Option B: Last 2 Turns Only (Cost Team)"
B["Only last user + assistant turn<br/>~200 tokens<br/>✅ Very cheap<br/>❌ Loses multi-turn context<br/>❌ 'What about the second one?'<br/>fails"]
end
subgraph "Option C: Summary + Last 2 (Decision)"
C["Compressed summary of turns 1-8<br/>(~150 tokens)<br/>+<br/>Full last 2 turns (~250 tokens)<br/>= 400 tokens<br/>✅ Preserves key context<br/>✅ Within budget"]
end
style A fill:#eb3b5a,stroke:#333,color:#fff
style B fill:#f9d71c,stroke:#333,color:#000
style C fill:#2d8659,stroke:#333,color:#fff
Conversation Summarization Strategy
sequenceDiagram
participant Memory as Conversation Memory
participant Summarizer as Summarizer (Haiku)
participant Prompt as Prompt Builder
Memory->>Prompt: Retrieve all turns for session
alt Turn count ≤ 4
Prompt->>Prompt: Include all turns verbatim (~400 tokens)
else Turn count > 4
Memory->>Summarizer: Summarize turns 1 through N-2
Summarizer-->>Memory: Summary (150 tokens)
Note over Summarizer: Cost: $0.0002 per summary<br/>(Haiku, ~400 input tokens)
Prompt->>Prompt: Summary + last 2 full turns (~400 tokens)
end
Summarization prompt (to Haiku):
Summarize this conversation in 2-3 sentences, preserving: (1) what the user is looking for, (2) what was recommended or discussed, (3) any unresolved questions. Do not include greetings or small talk.
Cost of summarization: ~$0.0002 per summary (Haiku). At 200K multi-turn sessions/day needing summarization = $40/day = $1,200/month. This is negligible compared to the token savings.
3. RAG Context (600 tokens) — Quality-Gated
graph TD
A["3 chunks retrieved"] --> B{"Total chunk<br/>tokens?"}
B -->|"< 600"| C["Include all chunks<br/>verbatim"]
B -->|"> 600"| D["Compress: extract<br/>key sentences only"]
B -->|"Chunks irrelevant<br/>(low similarity)"| E["Include 0 chunks<br/>+ flag: ungrounded response"]
style C fill:#2d8659,stroke:#333,color:#fff
style D fill:#fd9644,stroke:#333,color:#000
style E fill:#eb3b5a,stroke:#333,color:#fff
Chunk compression strategy: When total chunk tokens exceed budget, extract the 2-3 most relevant sentences using extractive summarization (keyword overlap with query). This preserves grounding while fitting the budget.
4. Current Data (400 tokens) — Intent-Dependent
| Intent | Data Included | Typical Token Count |
|---|---|---|
recommendation |
Recommended ASINs (title, author, price, format) × 5 | 350 tokens |
product_question |
Full product detail for current ASIN | 200 tokens |
faq |
None (RAG handles it) | 0 tokens |
order_tracking |
Order status summary | 100 tokens |
promotion |
Active promotions list | 150 tokens |
return_request |
Order detail + return policy summary | 250 tokens |
Dynamic allocation: When current data uses fewer tokens (e.g., FAQ uses 0), the surplus is reallocated to RAG context or conversation history.
Dynamic Budget Reallocation
graph TD
A["Total Budget: 2,000 tokens"] --> B["Fixed: System (350) + Format (100) = 450"]
A --> C["Variable Pool: 1,550 tokens"]
C --> D{"Intent Type"}
D -->|"FAQ"| E["History: 500<br/>RAG: 800<br/>Data: 100<br/>Message: 150"]
D -->|"Recommendation"| F["History: 400<br/>RAG: 400<br/>Data: 600<br/>Message: 150"]
D -->|"Product Question"| G["History: 300<br/>RAG: 500<br/>Data: 600<br/>Message: 150"]
D -->|"Order Tracking"| H["History: 500<br/>RAG: 0<br/>Data: 300<br/>Message: 150<br/>(+600 unallocated → shorter prompt)"]
style E fill:#54a0ff,stroke:#333,color:#000
style F fill:#fd9644,stroke:#333,color:#000
style G fill:#ff6b6b,stroke:#333,color:#000
style H fill:#2d8659,stroke:#333,color:#fff
Priority Order When Budget Is Tight
When the variable pool is under pressure (long conversation + many RAG chunks + rich product data), trim in this order:
graph LR
A["1. Trim conversation<br/>history (summarize<br/>more aggressively)"] --> B["2. Reduce RAG<br/>chunks (from 3→2)"]
B --> C["3. Compress product<br/>data (key fields only)"]
C --> D["4. Never trim<br/>user message"]
D --> E["5. Never trim<br/>system prompt"]
style A fill:#f9d71c,stroke:#333,color:#000
style D fill:#eb3b5a,stroke:#333,color:#fff
style E fill:#eb3b5a,stroke:#333,color:#fff
Cost Impact of Token Budget Decisions
Comparison: What Different Budgets Cost
| Budget | Avg Input Tokens | Daily Input Cost (700K calls) | Monthly Input Cost | Prefill Latency |
|---|---|---|---|---|
| Minimal (1,000) | 1,000 | $2,100 | $63,000 | ~100ms |
| Standard (2,000) | 1,800 (avg) | $3,780 | $113,400 | ~180ms |
| Rich (4,000) | 3,500 | $7,350 | $220,500 | ~350ms |
| Unlimited (8,000) | 6,000 | $12,600 | $378,000 | ~600ms |
Decision: 2,000 token budget. Going from 2K→4K doubles cost and doubles prefill latency while only improving quality score from 0.87→0.90. The marginal gain doesn't justify $107K/month.
The Prompt Caching Optimization
The numbers in this example are illustrative only. For current Bedrock-supported models, explicit prompt caching is usually only effective when the reusable prefix clears the model-specific checkpoint minimum, which is often much larger than the system prompt alone.
graph TD
A["Request 1:<br/>System Prompt (350 tokens)<br/>+ Conversation (400)<br/>+ RAG (600)<br/>+ Data (400)<br/>+ Message (150)"]
B["Request 2 (same session):<br/>System Prompt (CACHED — 0 new tokens)<br/>+ Conversation (400)<br/>+ RAG (600)<br/>+ Data (400)<br/>+ Message (150)"]
A --> C["Bedrock caches the<br/>system prompt prefix<br/>across requests"]
C --> B
style C fill:#2d8659,stroke:#333,color:#fff
Savings from prompt caching: - 350 tokens × $3/1M = $0.00105 saved per request - 700K requests/day = $735/day = $22K/month
2026 Update: Token Budgets Should Optimize Information Density
Treat everything above this section as the baseline token-budget architecture. This update keeps that original allocation model intact and explains how the current architecture should optimize for information density and cacheable prefixes.
The best current practice is to treat token budgeting as an information-density problem, not just a fixed partitioning problem.
- Allocate tokens by expected usefulness per intent. A static 400/600/400 split is a useful baseline, but the production policy should be driven by ablations on what actually changes answer quality.
- Treat prompt caching as a prefix-design problem. Do not assume the 350-token system prompt alone will be cacheable; on many current providers, explicit prompt caching only activates once a larger stable prefix is checkpointed.
- Use prompt compression and contextual compression when history or evidence gets long. LongLLMLingua and related work show that compression can improve both latency and answer quality when it removes low-signal tokens.
- Promote high-value evidence to the most visible parts of the prompt. Long-context models still suffer from position and salience effects, so packing and ordering matter.
- Track hidden waste explicitly: unused retrieved chunks, repeated history facts, policy text never cited by the model, and prompts that grow without improving outcomes.
Recent references: AWS Bedrock prompt caching, Anthropic contextual retrieval, LongLLMLingua prompt compression.
Reversal Triggers
| Trigger | Action |
|---|---|
| Quality score drops below 0.82 for multi-turn conversations | Increase history budget to 500 tokens; take from RAG if needed |
| Hallucination rate increases after history summarization deployed | Increase summary detail; allocate more tokens to summary |
| New model supports much larger effective context at same cost | Re-evaluate entire budget; may expand to 4K |
| Prompt caching becomes available for larger prefixes | Cache system prompt + conversation summary together |
| Token cost drops significantly | Expand budget proportionally; re-run quality benchmarks |
Impact on Trilemma
| Dimension | 1K Budget | 2K Budget (Decision) | 4K Budget | 8K Budget |
|---|---|---|---|---|
| Cost (monthly) | $63,000 | $113,400 | $220,500 | $378,000 |
| Performance (prefill) | 100ms | 180ms | 350ms | 600ms |
| Quality | 0.78 | 0.87 | 0.90 | 0.91 |
| QACPI | Medium (quality low) | Highest | Medium (cost drags) | Low (cost + latency) |
The 2K budget hits the QACPI sweet spot. Richer context has sharply diminishing returns while cost grows linearly.