US-09: Token Budget Allocation — Context Window Partitioning

User Story

As a ML engineering lead, I want to define how to partition the LLM context window between system prompt, conversation history, RAG chunks, and current data, So that the LLM has enough context to produce grounded answers without exceeding cost or latency limits from oversized prompts.

The Debate

graph TD
    subgraph "Inference Team"
        I["Give the LLM everything:<br/>20 turns of history,<br/>5 RAG chunks,<br/>full product data,<br/>active promotions.<br/>More context = better answers."]
    end

    subgraph "Performance Team"
        P["Every 100 tokens adds<br/>~10ms to prefill latency.<br/>A 4K token prompt takes<br/>400ms just to process.<br/>Keep it under 2K tokens."]
    end

    subgraph "Cost Team"
        C["Input tokens cost $3/1M.<br/>At 4K tokens × 700K LLM calls/day,<br/>that's $8,400/day in input alone.<br/>Cut to 2K tokens = $4,200/day.<br/>Save $126K/month."]
    end

    I ---|"Latency<br/>concern"| P
    P ---|"Cost<br/>impact"| C
    C ---|"Quality<br/>risk"| I

    style I fill:#ff6b6b,stroke:#333,color:#000
    style P fill:#4ecdc4,stroke:#333,color:#000
    style C fill:#f9d71c,stroke:#333,color:#000

Acceptance Criteria

Average input token count stays under 2,000 tokens for 80% of requests.
Conversation history is never truncated mid-turn (summarize instead).
System prompt is constant and does not vary per request (cacheable prompt prefix).
RAG context injection does not exceed the chunk budget defined in US-05.
Quality score does not drop below 0.85 on any intent due to context limitations.

The Context Window Budget

Total Available vs Allocated

For Claude 3.5 Sonnet with a 200K context window, the theoretical limit is enormous. But practical limits are set by cost and latency, not by the model's maximum:

graph TD
    subgraph "Theoretical Max: 200K tokens"
        THEORY["Claude 3.5 Sonnet<br/>can process 200K tokens.<br/>But at $3/1M input tokens,<br/>200K = $0.60 per request.<br/>At 700K requests/day = $420K/day.<br/>❌ Budget: destroyed."]
    end

    subgraph "Practical Budget: 2,000 tokens"
        PRACTICAL["2,000 input tokens<br/>300 output tokens<br/>Cost: $0.006 + $0.0045 = $0.0105<br/>Prefill: ~200ms<br/>✅ Within budget and latency."]
    end

    THEORY --> PRACTICAL

    style THEORY fill:#eb3b5a,stroke:#333,color:#fff
    style PRACTICAL fill:#2d8659,stroke:#333,color:#fff

The 2,000-Token Budget Allocation

pie title "Token Budget Allocation (2,000 tokens)"
    "System Prompt" : 350
    "Conversation History" : 400
    "RAG Context" : 600
    "Current Data (product, promo)" : 400
    "User Message" : 150
    "Formatting / Separators" : 100

Section	Token Budget	Fixed/Variable	Rationale
System prompt	350 tokens	Fixed	Rules, persona, format instructions — same every request
Conversation history	400 tokens	Variable	Summarized history + last 2 full turns
RAG context	600 tokens	Variable	2-3 chunks at ~100-200 tokens each
Current data	400 tokens	Variable	Product details, promotions, user context
User message	150 tokens	Variable	The current user message
Formatting / separators	100 tokens	Fixed	Section headers, JSON structure, delimiters
Total	2,000 tokens

Section-by-Section Deep Dive

1. System Prompt (350 tokens) — The Non-Negotiable Core

graph TD
    subgraph "System Prompt Contents"
        S1["Persona definition<br/>(50 tokens)"]
        S2["Core rules<br/>(150 tokens)"]
        S3["Output format specs<br/>(80 tokens)"]
        S4["Safety instructions<br/>(70 tokens)"]
    end

    S1 --> TOTAL["350 tokens total"]
    S2 --> TOTAL
    S3 --> TOTAL
    S4 --> TOTAL

    style TOTAL fill:#54a0ff,stroke:#333,color:#000

Tradeoff: The Inference Team wants 500+ tokens for detailed instructions. The Cost Team wants 200. At 350, the system prompt is concise but complete. Every word earns its place.

Optimization: System prompt is identical for all requests → use Bedrock's prompt caching to avoid re-processing it every time. This saves ~35ms prefill and reduces effective input cost.

2. Conversation History (400 tokens) — The Hardest Allocation

This is where the most contentious tradeoff lives:

graph TD
    subgraph "Option A: Full History (Inference Team)"
        A["Keep all 10 turns raw<br/>~1,500 tokens<br/>✅ Perfect context<br/>❌ Blows budget by 1,100 tokens"]
    end

    subgraph "Option B: Last 2 Turns Only (Cost Team)"
        B["Only last user + assistant turn<br/>~200 tokens<br/>✅ Very cheap<br/>❌ Loses multi-turn context<br/>❌ 'What about the second one?'<br/>fails"]
    end

    subgraph "Option C: Summary + Last 2 (Decision)"
        C["Compressed summary of turns 1-8<br/>(~150 tokens)<br/>+<br/>Full last 2 turns (~250 tokens)<br/>= 400 tokens<br/>✅ Preserves key context<br/>✅ Within budget"]
    end

    style A fill:#eb3b5a,stroke:#333,color:#fff
    style B fill:#f9d71c,stroke:#333,color:#000
    style C fill:#2d8659,stroke:#333,color:#fff

Conversation Summarization Strategy

sequenceDiagram
    participant Memory as Conversation Memory
    participant Summarizer as Summarizer (Haiku)
    participant Prompt as Prompt Builder

    Memory->>Prompt: Retrieve all turns for session

    alt Turn count ≤ 4
        Prompt->>Prompt: Include all turns verbatim (~400 tokens)
    else Turn count > 4
        Memory->>Summarizer: Summarize turns 1 through N-2
        Summarizer-->>Memory: Summary (150 tokens)
        Note over Summarizer: Cost: $0.0002 per summary<br/>(Haiku, ~400 input tokens)
        Prompt->>Prompt: Summary + last 2 full turns (~400 tokens)
    end

Summarization prompt (to Haiku):

Summarize this conversation in 2-3 sentences, preserving: (1) what the user is looking for, (2) what was recommended or discussed, (3) any unresolved questions. Do not include greetings or small talk.

Cost of summarization: ~$0.0002 per summary (Haiku). At 200K multi-turn sessions/day needing summarization = $40/day = $1,200/month. This is negligible compared to the token savings.

3. RAG Context (600 tokens) — Quality-Gated

graph TD
    A["3 chunks retrieved"] --> B{"Total chunk<br/>tokens?"}
    B -->|"< 600"| C["Include all chunks<br/>verbatim"]
    B -->|"> 600"| D["Compress: extract<br/>key sentences only"]
    B -->|"Chunks irrelevant<br/>(low similarity)"| E["Include 0 chunks<br/>+ flag: ungrounded response"]

    style C fill:#2d8659,stroke:#333,color:#fff
    style D fill:#fd9644,stroke:#333,color:#000
    style E fill:#eb3b5a,stroke:#333,color:#fff

Chunk compression strategy: When total chunk tokens exceed budget, extract the 2-3 most relevant sentences using extractive summarization (keyword overlap with query). This preserves grounding while fitting the budget.

4. Current Data (400 tokens) — Intent-Dependent

Intent	Data Included	Typical Token Count
`recommendation`	Recommended ASINs (title, author, price, format) × 5	350 tokens
`product_question`	Full product detail for current ASIN	200 tokens
`faq`	None (RAG handles it)	0 tokens
`order_tracking`	Order status summary	100 tokens
`promotion`	Active promotions list	150 tokens
`return_request`	Order detail + return policy summary	250 tokens

Dynamic allocation: When current data uses fewer tokens (e.g., FAQ uses 0), the surplus is reallocated to RAG context or conversation history.

Dynamic Budget Reallocation

graph TD
    A["Total Budget: 2,000 tokens"] --> B["Fixed: System (350) + Format (100) = 450"]
    A --> C["Variable Pool: 1,550 tokens"]

    C --> D{"Intent Type"}
    D -->|"FAQ"| E["History: 500<br/>RAG: 800<br/>Data: 100<br/>Message: 150"]
    D -->|"Recommendation"| F["History: 400<br/>RAG: 400<br/>Data: 600<br/>Message: 150"]
    D -->|"Product Question"| G["History: 300<br/>RAG: 500<br/>Data: 600<br/>Message: 150"]
    D -->|"Order Tracking"| H["History: 500<br/>RAG: 0<br/>Data: 300<br/>Message: 150<br/>(+600 unallocated → shorter prompt)"]

    style E fill:#54a0ff,stroke:#333,color:#000
    style F fill:#fd9644,stroke:#333,color:#000
    style G fill:#ff6b6b,stroke:#333,color:#000
    style H fill:#2d8659,stroke:#333,color:#fff

Priority Order When Budget Is Tight

When the variable pool is under pressure (long conversation + many RAG chunks + rich product data), trim in this order:

graph LR
    A["1. Trim conversation<br/>history (summarize<br/>more aggressively)"] --> B["2. Reduce RAG<br/>chunks (from 3→2)"]
    B --> C["3. Compress product<br/>data (key fields only)"]
    C --> D["4. Never trim<br/>user message"]
    D --> E["5. Never trim<br/>system prompt"]

    style A fill:#f9d71c,stroke:#333,color:#000
    style D fill:#eb3b5a,stroke:#333,color:#fff
    style E fill:#eb3b5a,stroke:#333,color:#fff

Cost Impact of Token Budget Decisions

Comparison: What Different Budgets Cost

Budget	Avg Input Tokens	Daily Input Cost (700K calls)	Monthly Input Cost	Prefill Latency
Minimal (1,000)	1,000	$2,100	$63,000	~100ms
Standard (2,000)	1,800 (avg)	$3,780	$113,400	~180ms
Rich (4,000)	3,500	$7,350	$220,500	~350ms
Unlimited (8,000)	6,000	$12,600	$378,000	~600ms

Decision: 2,000 token budget. Going from 2K→4K doubles cost and doubles prefill latency while only improving quality score from 0.87→0.90. The marginal gain doesn't justify $107K/month.

The Prompt Caching Optimization

The numbers in this example are illustrative only. For current Bedrock-supported models, explicit prompt caching is usually only effective when the reusable prefix clears the model-specific checkpoint minimum, which is often much larger than the system prompt alone.

graph TD
    A["Request 1:<br/>System Prompt (350 tokens)<br/>+ Conversation (400)<br/>+ RAG (600)<br/>+ Data (400)<br/>+ Message (150)"]

    B["Request 2 (same session):<br/>System Prompt (CACHED — 0 new tokens)<br/>+ Conversation (400)<br/>+ RAG (600)<br/>+ Data (400)<br/>+ Message (150)"]

    A --> C["Bedrock caches the<br/>system prompt prefix<br/>across requests"]
    C --> B

    style C fill:#2d8659,stroke:#333,color:#fff

Savings from prompt caching: - 350 tokens × $3/1M = $0.00105 saved per request - 700K requests/day = $735/day = $22K/month

2026 Update: Token Budgets Should Optimize Information Density

Treat everything above this section as the baseline token-budget architecture. This update keeps that original allocation model intact and explains how the current architecture should optimize for information density and cacheable prefixes.

The best current practice is to treat token budgeting as an information-density problem, not just a fixed partitioning problem.

Allocate tokens by expected usefulness per intent. A static 400/600/400 split is a useful baseline, but the production policy should be driven by ablations on what actually changes answer quality.
Treat prompt caching as a prefix-design problem. Do not assume the 350-token system prompt alone will be cacheable; on many current providers, explicit prompt caching only activates once a larger stable prefix is checkpointed.
Use prompt compression and contextual compression when history or evidence gets long. LongLLMLingua and related work show that compression can improve both latency and answer quality when it removes low-signal tokens.
Promote high-value evidence to the most visible parts of the prompt. Long-context models still suffer from position and salience effects, so packing and ordering matter.
Track hidden waste explicitly: unused retrieved chunks, repeated history facts, policy text never cited by the model, and prompts that grow without improving outcomes.

Recent references: AWS Bedrock prompt caching, Anthropic contextual retrieval, LongLLMLingua prompt compression.

Reversal Triggers

Trigger	Action
Quality score drops below 0.82 for multi-turn conversations	Increase history budget to 500 tokens; take from RAG if needed
Hallucination rate increases after history summarization deployed	Increase summary detail; allocate more tokens to summary
New model supports much larger effective context at same cost	Re-evaluate entire budget; may expand to 4K
Prompt caching becomes available for larger prefixes	Cache system prompt + conversation summary together
Token cost drops significantly	Expand budget proportionally; re-run quality benchmarks

Impact on Trilemma

Dimension	1K Budget	2K Budget (Decision)	4K Budget	8K Budget
Cost (monthly)	$63,000	$113,400	$220,500	$378,000
Performance (prefill)	100ms	180ms	350ms	600ms
Quality	0.78	0.87	0.90	0.91
QACPI	Medium (quality low)	Highest	Medium (cost drags)	Low (cost + latency)

The 2K budget hits the QACPI sweet spot. Richer context has sharply diminishing returns while cost grows linearly.