US-03: Latency Budget Allocation Across the Pipeline

User Story

As a performance engineering lead, I want to define explicit latency budgets for every stage of the request pipeline, So that the end-to-end p95 latency stays under 2 seconds while each team knows exactly how much time they can spend.

The Debate

graph TD
    subgraph "Performance Team"
        P["We have a 2-second budget.<br/>Every millisecond matters.<br/>Cut RAG to 1 chunk.<br/>Skip reranking."]
    end

    subgraph "Inference Team"
        I["If you cut RAG depth,<br/>hallucination rate triples.<br/>We NEED reranking.<br/>Give us 500ms for retrieval."]
    end

    subgraph "Cost Team"
        C["Provisioned throughput for<br/>low latency costs $50K/month.<br/>Can we shift budget from<br/>compute to latency tolerance?"]
    end

    P ---|"RAG budget<br/>fight"| I
    I ---|"Provisioned<br/>throughput fight"| C
    C ---|"SLA<br/>fight"| P

    style P fill:#4ecdc4,stroke:#333,color:#000
    style I fill:#ff6b6b,stroke:#333,color:#000
    style C fill:#f9d71c,stroke:#333,color:#000

Acceptance Criteria

Every pipeline stage has a defined latency budget with p95 tracking.
End-to-end p95 latency is under 2 seconds for LLM-backed responses.
End-to-end p95 latency is under 200ms for template responses.
First-token latency (streaming) is under 500ms at p95.
No single stage is allowed to consume more than 50% of the total budget.

The Pipeline and Its Latency Budget

Baseline: Where Time Goes Today (Unoptimized)

gantt
    title Request Pipeline Latency Breakdown (Unoptimized) — p95
    dateFormat X
    axisFormat %Lms

    section Edge
    TLS + Auth + Rate Limit       :a1, 0, 40

    section Orchestration
    Load Session + Memory         :a2, 40, 90
    Intent Classification         :a3, 90, 140

    section Data Retrieval
    RAG: Embed + KNN + Rerank     :a4, 140, 490
    Service Calls (Catalog, Reco) :a5, 140, 340

    section LLM
    Prompt Assembly               :a6, 490, 520
    Bedrock Prefill               :a7, 520, 920
    First Token Decode            :a8, 920, 970
    Stream Remaining Tokens       :a9, 970, 1800

    section Safety
    Guardrails                    :a10, 1800, 1900

    section Delivery
    Format + WebSocket Push       :a11, 1900, 1930

Unoptimized total: ~1,930ms p95 — barely under 2 seconds, with zero headroom.

The Budget Allocation Problem

You have 2,000ms total. Here's what each team wants:

Stage	Performance Team Wants	Inference Team Wants	Cost Team Wants
Edge (TLS, auth, rate limit)	30ms	30ms	30ms
Session + Memory load	40ms	40ms	40ms
Intent Classification	30ms (rule-based only)	80ms (ML classifier)	30ms (rule-based)
RAG Retrieval	100ms (1 chunk, no rerank)	400ms (5 chunks + rerank)	150ms (3 chunks, no rerank)
Service Calls	150ms	200ms	150ms
Prompt Assembly	20ms	30ms	20ms
LLM Generation	600ms (Haiku)	1,200ms (Sonnet)	600ms (Haiku)
Guardrails	50ms (basic checks)	150ms (full pipeline)	50ms (basic)
Format + Deliver	30ms	30ms	30ms
Total	1,050ms	2,160ms	1,100ms

The Inference Team exceeds the budget by 160ms. Something has to give.

The Negotiated Budget

graph LR
    subgraph "Agreed Budget — 2,000ms total"
        A["Edge<br/>40ms"] --> B["Session<br/>50ms"]
        B --> C["Intent<br/>50ms"]
        C --> D["RAG<br/>250ms"]
        C --> E["Services<br/>200ms"]
        D --> F["Prompt<br/>30ms"]
        E --> F
        F --> G["LLM Gen<br/>900ms"]
        G --> H["Guardrails<br/>100ms"]
        H --> I["Deliver<br/>30ms"]
    end

    style A fill:#2d8659,stroke:#333,color:#fff
    style D fill:#ff6b6b,stroke:#333,color:#fff
    style G fill:#eb3b5a,stroke:#333,color:#fff
    style H fill:#fd9644,stroke:#333,color:#000

Final Budget Table

Stage	Budget (p95)	Headroom	Owner	Tradeoff Made
Edge (TLS + Auth + Rate Limit)	40ms	10ms	Platform	None — already fast
Session + Memory Load	50ms	10ms	Platform	DynamoDB DAX for hot sessions
Intent Classification	50ms	20ms	ML	Rule-based first, ML only on miss
RAG Retrieval	250ms	50ms	ML	3 chunks + lightweight rerank (not full cross-encoder)
Service Calls (parallel)	200ms	50ms	Backend	Parallel fan-out, cache-first
Prompt Assembly	30ms	10ms	ML	Pre-compiled templates, string concatenation
LLM Generation (to first token)	400ms	100ms	ML/Platform	Provisioned throughput for Sonnet
LLM Streaming (remaining)	500ms	100ms	ML/Platform	Token-level streaming to client
Guardrails	100ms	20ms	ML	Async heavy checks; sync only for PII + price
Format + WebSocket Push	30ms	10ms	Platform	Pre-formatted templates
Total	1,650ms	350ms buffer

The 350ms Buffer: Why It Matters

graph TD
    A["350ms Buffer"] --> B{"What consumes it?"}
    B --> C["Network jitter<br/>(10-50ms)"]
    B --> D["Cold start penalty<br/>(occasional 100-200ms)"]
    B --> E["GC pauses<br/>(10-30ms)"]
    B --> F["Retry on transient failure<br/>(100-200ms)"]

    style A fill:#2d8659,stroke:#333,color:#fff

Without the buffer, any jitter pushes past 2 seconds. The buffer is not waste — it's reliability insurance.

Stage-by-Stage Tradeoff Deep Dives

RAG Retrieval: The Biggest Negotiation

graph TD
    subgraph "Option A: 1 chunk, no rerank (100ms)"
        A1["✅ Fast<br/>✅ Cheap<br/>❌ Hallucination rate: 12%<br/>❌ Incomplete answers"]
    end

    subgraph "Option B: 3 chunks + lightweight rerank (250ms)"
        B1["✅ Good balance<br/>⚠️ Moderate cost<br/>✅ Hallucination rate: 4%<br/>✅ Usually complete"]
    end

    subgraph "Option C: 5 chunks + cross-encoder rerank (400ms)"
        C1["❌ Slow<br/>❌ Expensive<br/>✅ Hallucination rate: 2%<br/>✅ Almost always complete"]
    end

    A1 --> D["Decision: Option B"]
    B1 --> D
    C1 --> D

    style B1 fill:#2d8659,stroke:#333,color:#fff
    style D fill:#54a0ff,stroke:#333,color:#000

Why Option B wins: Going from 1→3 chunks cuts hallucination by 67%. Going from 3→5 only cuts another 50%. The marginal quality gain doesn't justify 150ms extra latency.

LLM Generation: Provisioned vs On-Demand

Mode	First Token p95	Monthly Cost	When
On-demand Bedrock	500-800ms	Pay per token	Off-peak hours, cost zone
Provisioned throughput	200-400ms	$25K/month base	Peak hours, performance zone
Cached prompt prefix	150-300ms	Standard token cost + cache fee	Repeated system prompts

Decision: Provisioned during peak (8AM-11PM JST), on-demand during off-peak. This saves ~40% of provisioned cost while maintaining latency during user-facing hours.

graph LR
    subgraph "Peak (8AM-11PM JST)"
        PEAK["Provisioned Throughput<br/>400ms first token<br/>$25K/month"]
    end

    subgraph "Off-Peak (11PM-8AM JST)"
        OFFPEAK["On-Demand<br/>700ms first token<br/>Pay per token only"]
    end

    PEAK --> SAVE["Saves ~$10K/month<br/>vs 24/7 provisioned"]
    OFFPEAK --> SAVE

    style PEAK fill:#eb3b5a,stroke:#333,color:#fff
    style OFFPEAK fill:#2d8659,stroke:#333,color:#fff

Guardrails: Sync vs Async Split

graph TD
    A["LLM Response"] --> B{"Critical Checks<br/>(Sync — 100ms budget)"}
    B -->|pass| C["Deliver to User"]
    B -->|fail| D["Block + Fallback"]

    C --> E{"Heavy Checks<br/>(Async — no budget)"}
    E -->|pass| F["Log clean"]
    E -->|fail| G["Flag for review<br/>+ optionally retract"]

    subgraph "Sync Checks (must be fast)"
        S1["PII detection — 20ms"]
        S2["Price validation — 30ms"]
        S3["ASIN existence — 30ms"]
        S4["Toxicity quick scan — 20ms"]
    end

    subgraph "Async Checks (run after delivery)"
        A1["Full hallucination scoring — 300ms"]
        A2["Competitor mention scan — 50ms"]
        A3["Scope drift analysis — 200ms"]
        A4["Quality scoring — 400ms"]
    end

    style B fill:#eb3b5a,stroke:#333,color:#fff
    style E fill:#f9d71c,stroke:#333,color:#000

The tradeoff: Moving heavy checks to async means a small window where a hallucinated response could be visible before retraction. The Inference Team accepted this because: 1. Sync checks catch the most dangerous failures (PII, wrong prices) 2. Async retraction within 2 seconds covers remaining risks 3. The alternative (all sync) adds 500ms+ to every response

Intent-Specific Budgets

Not all intents get the same budget:

graph TD
    subgraph "Template Path — 200ms budget"
        T["Greeting → 30ms<br/>Order Status → 150ms<br/>Promo List → 120ms"]
    end

    subgraph "Haiku Path — 1,200ms budget"
        H["Simple FAQ → 800ms<br/>Product Lookup → 900ms<br/>Simple Reco → 1,100ms"]
    end

    subgraph "Sonnet Path — 2,000ms budget"
        S["Complex Reco → 1,800ms<br/>Comparison → 1,900ms<br/>Multi-turn → 1,800ms"]
    end

    style T fill:#2d8659,stroke:#333,color:#fff
    style H fill:#fd9644,stroke:#333,color:#000
    style S fill:#eb3b5a,stroke:#333,color:#fff

Intent	Latency Budget	Model Tier	RAG Chunks	Guardrail Mode
`chitchat`	200ms	Template	0	Sync only
`order_tracking`	200ms	Template	0	Sync only
`promotion`	200ms	Template	0	Sync only
`faq`	1,200ms	Haiku	3	Sync + async
`product_question`	1,200ms	Haiku	2	Sync + async
`recommendation`	2,000ms	Sonnet	3 + rerank	Full pipeline
`product_discovery`	1,500ms	Haiku/Sonnet	3	Sync + async
`checkout_help`	1,200ms	Haiku	2	Sync + async
`return_request`	1,500ms	Haiku	2	Full pipeline
`escalation`	500ms	Template + handoff	0	Sync only

Monitoring and Enforcement

sequenceDiagram
    participant Stage as Pipeline Stage
    participant Timer as Latency Timer
    participant Alert as Alert System
    participant Dash as Dashboard

    Stage->>Timer: Start stage timer
    Note over Timer: Stage running...
    Stage->>Timer: End stage timer

    alt Within budget
        Timer->>Dash: Log (green)
    else Within budget + headroom
        Timer->>Dash: Log (yellow)
        Timer->>Alert: Soft alert (Slack)
    else Exceeds budget
        Timer->>Dash: Log (red)
        Timer->>Alert: Hard alert (PagerDuty)
        Note over Alert: If p95 exceeds budget<br/>for 15 min → page on-call
    end

2026 Update: Budget TTFT and ITL Separately

Treat everything above this section as the baseline latency-budget architecture. This update keeps that original budget model visible and explains how the current architecture should split and monitor latency now.

A modern latency budget should split "fast enough" into distinct user-visible components rather than treating end-to-end latency as one bucket.

Track queue time, TTFT, and inter-token latency separately. User perception is dominated by the time to first useful token, not just the final completion time.
Use prompt caching for stable prefixes when the reusable prefix is large enough to meet provider checkpoint minimums. Bedrock now supports multiple cache checkpoints and both 5-minute and 1-hour TTL options on supported models.
For self-hosted serving, treat prefill and decode as separate bottlenecks. In vLLM V1, chunked prefill is a default optimization, and disaggregated prefill is useful when TTFT and tail ITL need different tuning.
Scale and alert on waiting requests and queue time, not CPU alone. Generative traffic often fails the SLO because requests are waiting to start, not because a node looks saturated in generic infrastructure metrics.
Before paying for blanket over-provisioning, try latency-optimized inference and cross-Region inference on the most user-facing routes.

Recent references: AWS Bedrock prompt caching, AWS Bedrock latency-optimized inference, vLLM optimization and tuning, vLLM disaggregated prefilling, vLLM metrics.

Reversal Triggers

Trigger	Action
RAG p95 latency exceeds 250ms for 3 consecutive days	Reduce to 2 chunks or drop reranking
LLM first-token p95 exceeds 400ms during peak	Increase provisioned throughput or route more to Haiku
End-to-end p95 exceeds 2,000ms for any intent	Emergency budget rebalance; identify bottleneck stage
Guardrail sync checks exceed 100ms p95	Move slowest check to async
User satisfaction correlates with faster responses	Tighten budget further; justify to Cost Team with engagement data