US-03: Latency Budget Allocation Across the Pipeline
User Story
As a performance engineering lead, I want to define explicit latency budgets for every stage of the request pipeline, So that the end-to-end p95 latency stays under 2 seconds while each team knows exactly how much time they can spend.
The Debate
graph TD
subgraph "Performance Team"
P["We have a 2-second budget.<br/>Every millisecond matters.<br/>Cut RAG to 1 chunk.<br/>Skip reranking."]
end
subgraph "Inference Team"
I["If you cut RAG depth,<br/>hallucination rate triples.<br/>We NEED reranking.<br/>Give us 500ms for retrieval."]
end
subgraph "Cost Team"
C["Provisioned throughput for<br/>low latency costs $50K/month.<br/>Can we shift budget from<br/>compute to latency tolerance?"]
end
P ---|"RAG budget<br/>fight"| I
I ---|"Provisioned<br/>throughput fight"| C
C ---|"SLA<br/>fight"| P
style P fill:#4ecdc4,stroke:#333,color:#000
style I fill:#ff6b6b,stroke:#333,color:#000
style C fill:#f9d71c,stroke:#333,color:#000
Acceptance Criteria
- Every pipeline stage has a defined latency budget with p95 tracking.
- End-to-end p95 latency is under 2 seconds for LLM-backed responses.
- End-to-end p95 latency is under 200ms for template responses.
- First-token latency (streaming) is under 500ms at p95.
- No single stage is allowed to consume more than 50% of the total budget.
The Pipeline and Its Latency Budget
Baseline: Where Time Goes Today (Unoptimized)
gantt
title Request Pipeline Latency Breakdown (Unoptimized) — p95
dateFormat X
axisFormat %Lms
section Edge
TLS + Auth + Rate Limit :a1, 0, 40
section Orchestration
Load Session + Memory :a2, 40, 90
Intent Classification :a3, 90, 140
section Data Retrieval
RAG: Embed + KNN + Rerank :a4, 140, 490
Service Calls (Catalog, Reco) :a5, 140, 340
section LLM
Prompt Assembly :a6, 490, 520
Bedrock Prefill :a7, 520, 920
First Token Decode :a8, 920, 970
Stream Remaining Tokens :a9, 970, 1800
section Safety
Guardrails :a10, 1800, 1900
section Delivery
Format + WebSocket Push :a11, 1900, 1930
Unoptimized total: ~1,930ms p95 — barely under 2 seconds, with zero headroom.
The Budget Allocation Problem
You have 2,000ms total. Here's what each team wants:
| Stage | Performance Team Wants | Inference Team Wants | Cost Team Wants |
|---|---|---|---|
| Edge (TLS, auth, rate limit) | 30ms | 30ms | 30ms |
| Session + Memory load | 40ms | 40ms | 40ms |
| Intent Classification | 30ms (rule-based only) | 80ms (ML classifier) | 30ms (rule-based) |
| RAG Retrieval | 100ms (1 chunk, no rerank) | 400ms (5 chunks + rerank) | 150ms (3 chunks, no rerank) |
| Service Calls | 150ms | 200ms | 150ms |
| Prompt Assembly | 20ms | 30ms | 20ms |
| LLM Generation | 600ms (Haiku) | 1,200ms (Sonnet) | 600ms (Haiku) |
| Guardrails | 50ms (basic checks) | 150ms (full pipeline) | 50ms (basic) |
| Format + Deliver | 30ms | 30ms | 30ms |
| Total | 1,050ms | 2,160ms | 1,100ms |
The Inference Team exceeds the budget by 160ms. Something has to give.
The Negotiated Budget
graph LR
subgraph "Agreed Budget — 2,000ms total"
A["Edge<br/>40ms"] --> B["Session<br/>50ms"]
B --> C["Intent<br/>50ms"]
C --> D["RAG<br/>250ms"]
C --> E["Services<br/>200ms"]
D --> F["Prompt<br/>30ms"]
E --> F
F --> G["LLM Gen<br/>900ms"]
G --> H["Guardrails<br/>100ms"]
H --> I["Deliver<br/>30ms"]
end
style A fill:#2d8659,stroke:#333,color:#fff
style D fill:#ff6b6b,stroke:#333,color:#fff
style G fill:#eb3b5a,stroke:#333,color:#fff
style H fill:#fd9644,stroke:#333,color:#000
Final Budget Table
| Stage | Budget (p95) | Headroom | Owner | Tradeoff Made |
|---|---|---|---|---|
| Edge (TLS + Auth + Rate Limit) | 40ms | 10ms | Platform | None — already fast |
| Session + Memory Load | 50ms | 10ms | Platform | DynamoDB DAX for hot sessions |
| Intent Classification | 50ms | 20ms | ML | Rule-based first, ML only on miss |
| RAG Retrieval | 250ms | 50ms | ML | 3 chunks + lightweight rerank (not full cross-encoder) |
| Service Calls (parallel) | 200ms | 50ms | Backend | Parallel fan-out, cache-first |
| Prompt Assembly | 30ms | 10ms | ML | Pre-compiled templates, string concatenation |
| LLM Generation (to first token) | 400ms | 100ms | ML/Platform | Provisioned throughput for Sonnet |
| LLM Streaming (remaining) | 500ms | 100ms | ML/Platform | Token-level streaming to client |
| Guardrails | 100ms | 20ms | ML | Async heavy checks; sync only for PII + price |
| Format + WebSocket Push | 30ms | 10ms | Platform | Pre-formatted templates |
| Total | 1,650ms | 350ms buffer |
The 350ms Buffer: Why It Matters
graph TD
A["350ms Buffer"] --> B{"What consumes it?"}
B --> C["Network jitter<br/>(10-50ms)"]
B --> D["Cold start penalty<br/>(occasional 100-200ms)"]
B --> E["GC pauses<br/>(10-30ms)"]
B --> F["Retry on transient failure<br/>(100-200ms)"]
style A fill:#2d8659,stroke:#333,color:#fff
Without the buffer, any jitter pushes past 2 seconds. The buffer is not waste — it's reliability insurance.
Stage-by-Stage Tradeoff Deep Dives
RAG Retrieval: The Biggest Negotiation
graph TD
subgraph "Option A: 1 chunk, no rerank (100ms)"
A1["✅ Fast<br/>✅ Cheap<br/>❌ Hallucination rate: 12%<br/>❌ Incomplete answers"]
end
subgraph "Option B: 3 chunks + lightweight rerank (250ms)"
B1["✅ Good balance<br/>⚠️ Moderate cost<br/>✅ Hallucination rate: 4%<br/>✅ Usually complete"]
end
subgraph "Option C: 5 chunks + cross-encoder rerank (400ms)"
C1["❌ Slow<br/>❌ Expensive<br/>✅ Hallucination rate: 2%<br/>✅ Almost always complete"]
end
A1 --> D["Decision: Option B"]
B1 --> D
C1 --> D
style B1 fill:#2d8659,stroke:#333,color:#fff
style D fill:#54a0ff,stroke:#333,color:#000
Why Option B wins: Going from 1→3 chunks cuts hallucination by 67%. Going from 3→5 only cuts another 50%. The marginal quality gain doesn't justify 150ms extra latency.
LLM Generation: Provisioned vs On-Demand
| Mode | First Token p95 | Monthly Cost | When |
|---|---|---|---|
| On-demand Bedrock | 500-800ms | Pay per token | Off-peak hours, cost zone |
| Provisioned throughput | 200-400ms | $25K/month base | Peak hours, performance zone |
| Cached prompt prefix | 150-300ms | Standard token cost + cache fee | Repeated system prompts |
Decision: Provisioned during peak (8AM-11PM JST), on-demand during off-peak. This saves ~40% of provisioned cost while maintaining latency during user-facing hours.
graph LR
subgraph "Peak (8AM-11PM JST)"
PEAK["Provisioned Throughput<br/>400ms first token<br/>$25K/month"]
end
subgraph "Off-Peak (11PM-8AM JST)"
OFFPEAK["On-Demand<br/>700ms first token<br/>Pay per token only"]
end
PEAK --> SAVE["Saves ~$10K/month<br/>vs 24/7 provisioned"]
OFFPEAK --> SAVE
style PEAK fill:#eb3b5a,stroke:#333,color:#fff
style OFFPEAK fill:#2d8659,stroke:#333,color:#fff
Guardrails: Sync vs Async Split
graph TD
A["LLM Response"] --> B{"Critical Checks<br/>(Sync — 100ms budget)"}
B -->|pass| C["Deliver to User"]
B -->|fail| D["Block + Fallback"]
C --> E{"Heavy Checks<br/>(Async — no budget)"}
E -->|pass| F["Log clean"]
E -->|fail| G["Flag for review<br/>+ optionally retract"]
subgraph "Sync Checks (must be fast)"
S1["PII detection — 20ms"]
S2["Price validation — 30ms"]
S3["ASIN existence — 30ms"]
S4["Toxicity quick scan — 20ms"]
end
subgraph "Async Checks (run after delivery)"
A1["Full hallucination scoring — 300ms"]
A2["Competitor mention scan — 50ms"]
A3["Scope drift analysis — 200ms"]
A4["Quality scoring — 400ms"]
end
style B fill:#eb3b5a,stroke:#333,color:#fff
style E fill:#f9d71c,stroke:#333,color:#000
The tradeoff: Moving heavy checks to async means a small window where a hallucinated response could be visible before retraction. The Inference Team accepted this because: 1. Sync checks catch the most dangerous failures (PII, wrong prices) 2. Async retraction within 2 seconds covers remaining risks 3. The alternative (all sync) adds 500ms+ to every response
Intent-Specific Budgets
Not all intents get the same budget:
graph TD
subgraph "Template Path — 200ms budget"
T["Greeting → 30ms<br/>Order Status → 150ms<br/>Promo List → 120ms"]
end
subgraph "Haiku Path — 1,200ms budget"
H["Simple FAQ → 800ms<br/>Product Lookup → 900ms<br/>Simple Reco → 1,100ms"]
end
subgraph "Sonnet Path — 2,000ms budget"
S["Complex Reco → 1,800ms<br/>Comparison → 1,900ms<br/>Multi-turn → 1,800ms"]
end
style T fill:#2d8659,stroke:#333,color:#fff
style H fill:#fd9644,stroke:#333,color:#000
style S fill:#eb3b5a,stroke:#333,color:#fff
| Intent | Latency Budget | Model Tier | RAG Chunks | Guardrail Mode |
|---|---|---|---|---|
chitchat |
200ms | Template | 0 | Sync only |
order_tracking |
200ms | Template | 0 | Sync only |
promotion |
200ms | Template | 0 | Sync only |
faq |
1,200ms | Haiku | 3 | Sync + async |
product_question |
1,200ms | Haiku | 2 | Sync + async |
recommendation |
2,000ms | Sonnet | 3 + rerank | Full pipeline |
product_discovery |
1,500ms | Haiku/Sonnet | 3 | Sync + async |
checkout_help |
1,200ms | Haiku | 2 | Sync + async |
return_request |
1,500ms | Haiku | 2 | Full pipeline |
escalation |
500ms | Template + handoff | 0 | Sync only |
Monitoring and Enforcement
sequenceDiagram
participant Stage as Pipeline Stage
participant Timer as Latency Timer
participant Alert as Alert System
participant Dash as Dashboard
Stage->>Timer: Start stage timer
Note over Timer: Stage running...
Stage->>Timer: End stage timer
alt Within budget
Timer->>Dash: Log (green)
else Within budget + headroom
Timer->>Dash: Log (yellow)
Timer->>Alert: Soft alert (Slack)
else Exceeds budget
Timer->>Dash: Log (red)
Timer->>Alert: Hard alert (PagerDuty)
Note over Alert: If p95 exceeds budget<br/>for 15 min → page on-call
end
2026 Update: Budget TTFT and ITL Separately
Treat everything above this section as the baseline latency-budget architecture. This update keeps that original budget model visible and explains how the current architecture should split and monitor latency now.
A modern latency budget should split "fast enough" into distinct user-visible components rather than treating end-to-end latency as one bucket.
- Track queue time, TTFT, and inter-token latency separately. User perception is dominated by the time to first useful token, not just the final completion time.
- Use prompt caching for stable prefixes when the reusable prefix is large enough to meet provider checkpoint minimums. Bedrock now supports multiple cache checkpoints and both 5-minute and 1-hour TTL options on supported models.
- For self-hosted serving, treat prefill and decode as separate bottlenecks. In vLLM V1, chunked prefill is a default optimization, and disaggregated prefill is useful when TTFT and tail ITL need different tuning.
- Scale and alert on waiting requests and queue time, not CPU alone. Generative traffic often fails the SLO because requests are waiting to start, not because a node looks saturated in generic infrastructure metrics.
- Before paying for blanket over-provisioning, try latency-optimized inference and cross-Region inference on the most user-facing routes.
Recent references: AWS Bedrock prompt caching, AWS Bedrock latency-optimized inference, vLLM optimization and tuning, vLLM disaggregated prefilling, vLLM metrics.
Reversal Triggers
| Trigger | Action |
|---|---|
| RAG p95 latency exceeds 250ms for 3 consecutive days | Reduce to 2 chunks or drop reranking |
| LLM first-token p95 exceeds 400ms during peak | Increase provisioned throughput or route more to Haiku |
| End-to-end p95 exceeds 2,000ms for any intent | Emergency budget rebalance; identify bottleneck stage |
| Guardrail sync checks exceed 100ms p95 | Move slowest check to async |
| User satisfaction correlates with faster responses | Tighten budget further; justify to Cost Team with engagement data |