LOCAL PREVIEW View on GitHub

US-03: Latency Budget Allocation Across the Pipeline

User Story

As a performance engineering lead, I want to define explicit latency budgets for every stage of the request pipeline, So that the end-to-end p95 latency stays under 2 seconds while each team knows exactly how much time they can spend.

The Debate

graph TD
    subgraph "Performance Team"
        P["We have a 2-second budget.<br/>Every millisecond matters.<br/>Cut RAG to 1 chunk.<br/>Skip reranking."]
    end

    subgraph "Inference Team"
        I["If you cut RAG depth,<br/>hallucination rate triples.<br/>We NEED reranking.<br/>Give us 500ms for retrieval."]
    end

    subgraph "Cost Team"
        C["Provisioned throughput for<br/>low latency costs $50K/month.<br/>Can we shift budget from<br/>compute to latency tolerance?"]
    end

    P ---|"RAG budget<br/>fight"| I
    I ---|"Provisioned<br/>throughput fight"| C
    C ---|"SLA<br/>fight"| P

    style P fill:#4ecdc4,stroke:#333,color:#000
    style I fill:#ff6b6b,stroke:#333,color:#000
    style C fill:#f9d71c,stroke:#333,color:#000

Acceptance Criteria

  • Every pipeline stage has a defined latency budget with p95 tracking.
  • End-to-end p95 latency is under 2 seconds for LLM-backed responses.
  • End-to-end p95 latency is under 200ms for template responses.
  • First-token latency (streaming) is under 500ms at p95.
  • No single stage is allowed to consume more than 50% of the total budget.

The Pipeline and Its Latency Budget

Baseline: Where Time Goes Today (Unoptimized)

gantt
    title Request Pipeline Latency Breakdown (Unoptimized) — p95
    dateFormat X
    axisFormat %Lms

    section Edge
    TLS + Auth + Rate Limit       :a1, 0, 40

    section Orchestration
    Load Session + Memory         :a2, 40, 90
    Intent Classification         :a3, 90, 140

    section Data Retrieval
    RAG: Embed + KNN + Rerank     :a4, 140, 490
    Service Calls (Catalog, Reco) :a5, 140, 340

    section LLM
    Prompt Assembly               :a6, 490, 520
    Bedrock Prefill               :a7, 520, 920
    First Token Decode            :a8, 920, 970
    Stream Remaining Tokens       :a9, 970, 1800

    section Safety
    Guardrails                    :a10, 1800, 1900

    section Delivery
    Format + WebSocket Push       :a11, 1900, 1930

Unoptimized total: ~1,930ms p95 — barely under 2 seconds, with zero headroom.

The Budget Allocation Problem

You have 2,000ms total. Here's what each team wants:

Stage Performance Team Wants Inference Team Wants Cost Team Wants
Edge (TLS, auth, rate limit) 30ms 30ms 30ms
Session + Memory load 40ms 40ms 40ms
Intent Classification 30ms (rule-based only) 80ms (ML classifier) 30ms (rule-based)
RAG Retrieval 100ms (1 chunk, no rerank) 400ms (5 chunks + rerank) 150ms (3 chunks, no rerank)
Service Calls 150ms 200ms 150ms
Prompt Assembly 20ms 30ms 20ms
LLM Generation 600ms (Haiku) 1,200ms (Sonnet) 600ms (Haiku)
Guardrails 50ms (basic checks) 150ms (full pipeline) 50ms (basic)
Format + Deliver 30ms 30ms 30ms
Total 1,050ms 2,160ms 1,100ms

The Inference Team exceeds the budget by 160ms. Something has to give.


The Negotiated Budget

graph LR
    subgraph "Agreed Budget — 2,000ms total"
        A["Edge<br/>40ms"] --> B["Session<br/>50ms"]
        B --> C["Intent<br/>50ms"]
        C --> D["RAG<br/>250ms"]
        C --> E["Services<br/>200ms"]
        D --> F["Prompt<br/>30ms"]
        E --> F
        F --> G["LLM Gen<br/>900ms"]
        G --> H["Guardrails<br/>100ms"]
        H --> I["Deliver<br/>30ms"]
    end

    style A fill:#2d8659,stroke:#333,color:#fff
    style D fill:#ff6b6b,stroke:#333,color:#fff
    style G fill:#eb3b5a,stroke:#333,color:#fff
    style H fill:#fd9644,stroke:#333,color:#000

Final Budget Table

Stage Budget (p95) Headroom Owner Tradeoff Made
Edge (TLS + Auth + Rate Limit) 40ms 10ms Platform None — already fast
Session + Memory Load 50ms 10ms Platform DynamoDB DAX for hot sessions
Intent Classification 50ms 20ms ML Rule-based first, ML only on miss
RAG Retrieval 250ms 50ms ML 3 chunks + lightweight rerank (not full cross-encoder)
Service Calls (parallel) 200ms 50ms Backend Parallel fan-out, cache-first
Prompt Assembly 30ms 10ms ML Pre-compiled templates, string concatenation
LLM Generation (to first token) 400ms 100ms ML/Platform Provisioned throughput for Sonnet
LLM Streaming (remaining) 500ms 100ms ML/Platform Token-level streaming to client
Guardrails 100ms 20ms ML Async heavy checks; sync only for PII + price
Format + WebSocket Push 30ms 10ms Platform Pre-formatted templates
Total 1,650ms 350ms buffer

The 350ms Buffer: Why It Matters

graph TD
    A["350ms Buffer"] --> B{"What consumes it?"}
    B --> C["Network jitter<br/>(10-50ms)"]
    B --> D["Cold start penalty<br/>(occasional 100-200ms)"]
    B --> E["GC pauses<br/>(10-30ms)"]
    B --> F["Retry on transient failure<br/>(100-200ms)"]

    style A fill:#2d8659,stroke:#333,color:#fff

Without the buffer, any jitter pushes past 2 seconds. The buffer is not waste — it's reliability insurance.


Stage-by-Stage Tradeoff Deep Dives

RAG Retrieval: The Biggest Negotiation

graph TD
    subgraph "Option A: 1 chunk, no rerank (100ms)"
        A1["✅ Fast<br/>✅ Cheap<br/>❌ Hallucination rate: 12%<br/>❌ Incomplete answers"]
    end

    subgraph "Option B: 3 chunks + lightweight rerank (250ms)"
        B1["✅ Good balance<br/>⚠️ Moderate cost<br/>✅ Hallucination rate: 4%<br/>✅ Usually complete"]
    end

    subgraph "Option C: 5 chunks + cross-encoder rerank (400ms)"
        C1["❌ Slow<br/>❌ Expensive<br/>✅ Hallucination rate: 2%<br/>✅ Almost always complete"]
    end

    A1 --> D["Decision: Option B"]
    B1 --> D
    C1 --> D

    style B1 fill:#2d8659,stroke:#333,color:#fff
    style D fill:#54a0ff,stroke:#333,color:#000

Why Option B wins: Going from 1→3 chunks cuts hallucination by 67%. Going from 3→5 only cuts another 50%. The marginal quality gain doesn't justify 150ms extra latency.

LLM Generation: Provisioned vs On-Demand

Mode First Token p95 Monthly Cost When
On-demand Bedrock 500-800ms Pay per token Off-peak hours, cost zone
Provisioned throughput 200-400ms $25K/month base Peak hours, performance zone
Cached prompt prefix 150-300ms Standard token cost + cache fee Repeated system prompts

Decision: Provisioned during peak (8AM-11PM JST), on-demand during off-peak. This saves ~40% of provisioned cost while maintaining latency during user-facing hours.

graph LR
    subgraph "Peak (8AM-11PM JST)"
        PEAK["Provisioned Throughput<br/>400ms first token<br/>$25K/month"]
    end

    subgraph "Off-Peak (11PM-8AM JST)"
        OFFPEAK["On-Demand<br/>700ms first token<br/>Pay per token only"]
    end

    PEAK --> SAVE["Saves ~$10K/month<br/>vs 24/7 provisioned"]
    OFFPEAK --> SAVE

    style PEAK fill:#eb3b5a,stroke:#333,color:#fff
    style OFFPEAK fill:#2d8659,stroke:#333,color:#fff

Guardrails: Sync vs Async Split

graph TD
    A["LLM Response"] --> B{"Critical Checks<br/>(Sync — 100ms budget)"}
    B -->|pass| C["Deliver to User"]
    B -->|fail| D["Block + Fallback"]

    C --> E{"Heavy Checks<br/>(Async — no budget)"}
    E -->|pass| F["Log clean"]
    E -->|fail| G["Flag for review<br/>+ optionally retract"]

    subgraph "Sync Checks (must be fast)"
        S1["PII detection — 20ms"]
        S2["Price validation — 30ms"]
        S3["ASIN existence — 30ms"]
        S4["Toxicity quick scan — 20ms"]
    end

    subgraph "Async Checks (run after delivery)"
        A1["Full hallucination scoring — 300ms"]
        A2["Competitor mention scan — 50ms"]
        A3["Scope drift analysis — 200ms"]
        A4["Quality scoring — 400ms"]
    end

    style B fill:#eb3b5a,stroke:#333,color:#fff
    style E fill:#f9d71c,stroke:#333,color:#000

The tradeoff: Moving heavy checks to async means a small window where a hallucinated response could be visible before retraction. The Inference Team accepted this because: 1. Sync checks catch the most dangerous failures (PII, wrong prices) 2. Async retraction within 2 seconds covers remaining risks 3. The alternative (all sync) adds 500ms+ to every response


Intent-Specific Budgets

Not all intents get the same budget:

graph TD
    subgraph "Template Path — 200ms budget"
        T["Greeting → 30ms<br/>Order Status → 150ms<br/>Promo List → 120ms"]
    end

    subgraph "Haiku Path — 1,200ms budget"
        H["Simple FAQ → 800ms<br/>Product Lookup → 900ms<br/>Simple Reco → 1,100ms"]
    end

    subgraph "Sonnet Path — 2,000ms budget"
        S["Complex Reco → 1,800ms<br/>Comparison → 1,900ms<br/>Multi-turn → 1,800ms"]
    end

    style T fill:#2d8659,stroke:#333,color:#fff
    style H fill:#fd9644,stroke:#333,color:#000
    style S fill:#eb3b5a,stroke:#333,color:#fff
Intent Latency Budget Model Tier RAG Chunks Guardrail Mode
chitchat 200ms Template 0 Sync only
order_tracking 200ms Template 0 Sync only
promotion 200ms Template 0 Sync only
faq 1,200ms Haiku 3 Sync + async
product_question 1,200ms Haiku 2 Sync + async
recommendation 2,000ms Sonnet 3 + rerank Full pipeline
product_discovery 1,500ms Haiku/Sonnet 3 Sync + async
checkout_help 1,200ms Haiku 2 Sync + async
return_request 1,500ms Haiku 2 Full pipeline
escalation 500ms Template + handoff 0 Sync only

Monitoring and Enforcement

sequenceDiagram
    participant Stage as Pipeline Stage
    participant Timer as Latency Timer
    participant Alert as Alert System
    participant Dash as Dashboard

    Stage->>Timer: Start stage timer
    Note over Timer: Stage running...
    Stage->>Timer: End stage timer

    alt Within budget
        Timer->>Dash: Log (green)
    else Within budget + headroom
        Timer->>Dash: Log (yellow)
        Timer->>Alert: Soft alert (Slack)
    else Exceeds budget
        Timer->>Dash: Log (red)
        Timer->>Alert: Hard alert (PagerDuty)
        Note over Alert: If p95 exceeds budget<br/>for 15 min → page on-call
    end

2026 Update: Budget TTFT and ITL Separately

Treat everything above this section as the baseline latency-budget architecture. This update keeps that original budget model visible and explains how the current architecture should split and monitor latency now.

A modern latency budget should split "fast enough" into distinct user-visible components rather than treating end-to-end latency as one bucket.

  • Track queue time, TTFT, and inter-token latency separately. User perception is dominated by the time to first useful token, not just the final completion time.
  • Use prompt caching for stable prefixes when the reusable prefix is large enough to meet provider checkpoint minimums. Bedrock now supports multiple cache checkpoints and both 5-minute and 1-hour TTL options on supported models.
  • For self-hosted serving, treat prefill and decode as separate bottlenecks. In vLLM V1, chunked prefill is a default optimization, and disaggregated prefill is useful when TTFT and tail ITL need different tuning.
  • Scale and alert on waiting requests and queue time, not CPU alone. Generative traffic often fails the SLO because requests are waiting to start, not because a node looks saturated in generic infrastructure metrics.
  • Before paying for blanket over-provisioning, try latency-optimized inference and cross-Region inference on the most user-facing routes.

Recent references: AWS Bedrock prompt caching, AWS Bedrock latency-optimized inference, vLLM optimization and tuning, vLLM disaggregated prefilling, vLLM metrics.

Reversal Triggers

Trigger Action
RAG p95 latency exceeds 250ms for 3 consecutive days Reduce to 2 chunks or drop reranking
LLM first-token p95 exceeds 400ms during peak Increase provisioned throughput or route more to Haiku
End-to-end p95 exceeds 2,000ms for any intent Emergency budget rebalance; identify bottleneck stage
Guardrail sync checks exceed 100ms p95 Move slowest check to async
User satisfaction correlates with faster responses Tighten budget further; justify to Cost Team with engagement data