LOCAL PREVIEW View on GitHub

Constraint Scenario 04 — Latency SLA Tightened (p95 first-token 1.4s → 600ms)

Trigger: Voice surface launching for MangaAssist, requires sub-second first-byte latency. The same agent that serves chat must serve voice. Pillars stressed: P1 (graph reshape) + P2 (serving) + P3 (provider/model choice).


TL;DR

A 2.3× latency tightening cannot be hit by "make things faster everywhere." It requires re-architecting the synchronous critical path to do less work, with the parts that can be deferred moved to a parallel-or-async track. The graph (story 01) and the streaming policy (story 07) are the primary levers; the cost is more compute and more complexity, not lower quality.


The change

Metric Before (chat) After (voice + chat)
p95 first-token latency 1.4s 600ms
p95 full-response latency 6s 2.5s for voice utterances; 6s for chat retained
Audio first-byte (voice only) n/a 800ms
Cost/turn budget $0.018 $0.022 (+22% acceptable)
Quality bar unchanged unchanged

The cost increases because we'll pay for more parallel work, more pre-warmed capacity, and more aggressive caching. Quality is held constant — that's the gate.


What's in the synchronous critical path today

From the graph (story 01) at chat steady state:

Ingress → SafetyPreCheck → Plan → ToolDispatch → Tools (parallel) → Compose → SafetyPostCheck → Stream
   5ms        80ms          350ms      ~50ms          600ms          700ms          150ms        first token at this point

First-token latency = sum of all of those = ~1935ms p95. Reality is ~1.4s p95 because of overlap. Either way, the budget is split roughly: - Plan: 25% - Tools: 30% - Compose: 35% - Safety + ingress: 10%

To get to 600ms first-token, the new budget split has to be: - Plan: 200ms (was 350) - Tools: 200ms (was 600, fan-out limited) - Compose: 200ms (was 700, streaming starts mid-compose) - Safety + ingress: 50ms (was 230)

Each one is a redesign, not a tweak.


The graph reshape

flowchart LR
  subgraph BEFORE[Before - chat steady state]
    A1[Ingress] --> B1[Safety pre]
    B1 --> C1[Plan]
    C1 --> D1[Tools]
    D1 --> E1[Compose]
    E1 --> F1[Safety post]
    F1 --> G1[Stream]
  end

  subgraph AFTER[After - voice-grade]
    A2[Ingress] --> B2[Safety pre - cached]
    B2 --> C2[Speculative pre-plan]
    B2 --> D2[Plan]
    C2 -.cached intent.-> D2
    D2 --> E2[Tools - tighter fanout]
    E2 --> F2[Compose - streaming]
    F2 --> G2[Stream first token]
    F2 -.parallel.-> H2[Safety post on draft]
    H2 -.kill on violation.-> G2
  end

What changed

  • Speculative pre-plan — A small classifier runs in parallel with safety pre-check, predicting the intent. If it agrees with the planner's later decision, we save ~150ms; if it disagrees, we discard the speculation.
  • Streaming starts mid-compose, not after compose finishes. First token goes out as soon as the first sentence is composable.
  • Safety post runs in parallel with stream. If it flags a violation, the stream is cancelled and a refusal is appended. This is the trade-off — sometimes the user sees a partial response that gets cancelled. Eval shows this happens <0.05% of voice turns.
  • Tighter fan-out — voice surface gets fan-out=2 max (vs 5 for chat). Quality measurement: -1.5% on rubric, acceptable for the latency win.

The serving reshape

Lever Effect on first-token latency
Hot voice fleet — pre-warmed, voice-only containers -200ms cold-start tail
Provisioned throughput on Bedrock for voice path -100ms model-side queuing
Inference cache aggressive on safety pre-check -60ms p95
Connection pool to backing tools (catalog, prefs) — keep-alive -40ms
Co-locate voice fleet with model endpoint (same AZ) -20ms

Combined: ~420ms first-token improvement on the long pole. We hit 600ms first-token p95 with margin.


The cost of getting fast

Source Extra cost vs chat
Provisioned throughput (reserved capacity not always used) +12%
Speculative pre-plan (sometimes wasted) +5%
Hot voice fleet (separate warm floor) +8%
Aggressive caching infra +2%
Total ~+27%

Budget allowance was +22%, we're at +27%. The gap closes by: - Voice has lower fan-out (Lever 5 from cost scenario); saves 3%. - Voice has shorter average response (audio brevity); saves 4% in output tokens.

Net: +20%, within budget.


What the harness contributes

Harness piece Role
Typed graph (story 01) The graph reshape is a versioned topology change
Streaming + heartbeat (story 07) Streaming-mid-compose pattern lives here
Capability flags (story 02) Voice surface activates "voice-mode" flags; chat unaffected
Eval (story 04) Voice has its own rubric (concision, audio-friendliness); per-surface eval baseline
Cache (story 06) Per-cache-tier eligibility for voice
Provider adapters (story 02) Voice uses provisioned throughput endpoint; chat uses on-demand

Without the per-surface flag and per-rubric eval machinery, voice would either drag chat down or fail its own bar. With them, the two surfaces coexist cleanly.


Pitfalls

Pitfall 1 — Cancelled-stream on safety violation feels broken

User says something the safety post-check flags after the stream started. Stream cancels mid-sentence; refusal appended. Sometimes feels jarring.

Mitigation: - Safety pre-check is highly tuned for voice — most violations caught before stream starts. - Post-check looks for catastrophic violations only (PII leak, copyright violation, mature content for non-adult). Most "soft" issues are non-cancelling. - Cancellation rate measured weekly; SLO < 0.1%.

Pitfall 2 — Speculative pre-plan disagrees with plan often

If speculation right rate < 70%, the speculation costs more than it saves.

Mitigation: - Speculation accuracy is monitored. Alarms below 75%. - The speculator is itself improvable: train on disagreements, ship updates monthly.

Pitfall 3 — Voice rubric drifts from chat rubric

Voice gets optimized for concision; eventually voice answers are good but feel curt. Cross-surface comparison shows voice users disengaging earlier.

Mitigation: - A "shared baseline" sub-rubric exists (factual_grounding, intent_match, safety) that's identical across surfaces. Surface-specific dimensions (concision for voice, depth for chat) are layered on top. - Cross-surface user-satisfaction comparison runs monthly.


Q&A drill — opening question

*Q: Why not just use a faster model? Haiku is faster than Sonnet — done.

Two reasons: 1. Latency is dominated by the orchestration, not the LLM call. Plan node is 350ms; the LLM call inside it is ~200ms. Even halving the LLM call (impossible with Haiku for this task; Haiku adds quality regression) only gets us 100ms. 2. Tools and compose are 70% of the latency. Making the planner faster doesn't help if tools are still slow.

The latency cut requires rearchitecting the path, not swapping a single component.


Grilling — Round 1

Q1. Streaming-mid-compose — how does the model produce streamable output with the right structure?

The compose prompt asks for structured streaming output: a short opening sentence first, then details, then a wrap. The streaming layer is sentence-aware — emits each complete sentence as it lands. This is a prompt-engineering pattern; it doesn't always produce perfectly streamable output, but the eval rubric for voice rewards it.

A small "stream-ability" classifier rates each output. Below threshold, we fall back to the buffer-then-stream path (one extra ~200ms).

Q2. Safety post running in parallel with stream — what's the user experience when it cancels?

The cancellation playback says: "Wait, I want to revise that — let me try again." and the system re-issues with a tighter constraint. The user gets ~1-2s of dead air, then a corrected response. Annoying, but rare (<0.1%) and recoverable. Better than holding everyone's first-token by 150ms for a 0.1% case.

Q3. What about ASR (speech-to-text) on the way in? That has its own latency.

Yes. The 600ms target is first-token of agent response, measured from the moment ASR delivers text to the agent. End-to-end mouth-to-mouth latency is ~1.4s including ASR (~400ms) and TTS (~250ms first audio frame). Each segment is its own optimization; this scenario focuses on the agent path.


Grilling — Round 2 (architect-level)

Q4. Speculative pre-plan + plan can both be running. How do you not double-spend on tokens?

The speculator is a tiny model (Haiku or smaller) on a compressed prompt — its cost is ~10% of the planner's. The "wasted" speculation cost when speculator disagrees is bounded. ROI: - Speculator agreement rate: 80%. - Latency savings on agree: 150ms. - Cost increase on disagree: $0.0004/turn (10% of planner cost). - Cost increase amortized: $0.00008/turn.

For a 150ms first-token improvement, paying $0.00008/turn extra is overwhelmingly worth it.

Q5. How would this scale to 10× voice traffic, or to a third surface (alerts) with its own latency profile?

The graph topology pattern generalizes. Each new surface gets: - A capability flag for the surface. - A surface-specific eval rubric. - A surface-specific budget table for the graph nodes. - Surface-specific serving capacity (warm floor, provisioned throughput).

The shared substrate (skills, sub-agents, eval harness, observability) doesn't grow linearly with surfaces. The marginal cost of a new surface is the surface-specific configuration, not new agent code.

The cap on this is when surfaces have fundamentally different reasoning needs — at that point you fork sub-agents, not topologies. Voice and chat share the same agent; "complex multi-day reasoning research" wouldn't.

Q6. The harness now has voice-vs-chat differences in latency budget, fanout, eval rubric, serving, model. How do you avoid configuration sprawl?

We use a surface profile abstraction in config:

surface_profiles:
  chat:
    latency: { first_token_p95_ms: 1400, full_p95_ms: 6000 }
    fanout_cap: 5
    eval_rubric: rubrics/chat-v3.yaml
    serving: { warm_floor_pct: 150, provisioned: false }
    model_routing: { planner: sonnet-4-7, sub_agents: sonnet-4-7 }
  voice:
    latency: { first_token_p95_ms: 600, full_p95_ms: 2500 }
    fanout_cap: 2
    eval_rubric: rubrics/voice-v1.yaml
    serving: { warm_floor_pct: 200, provisioned: true }
    model_routing: { planner: sonnet-4-7, sub_agents: haiku-4-5 }
    speculative_preplan: true
    streaming_mid_compose: true
One YAML per surface; surface ID is a request attribute; the harness picks the profile and applies it. Adding a new surface = adding a new profile, not editing 14 codepaths.


Intuition gained

  • Tightening latency is graph reshape, not "use a faster model." The synchronous critical path must do less.
  • Parallelism (speculative pre-plan, streaming-mid-compose, parallel safety) trades cost for latency. Cost goes up; eval-gated.
  • Surface profiles keep configuration sprawl bounded as new surfaces are added.
  • Per-surface eval rubric + shared baseline preserves cross-surface quality comparability.
  • Voice and chat share the same agent code — the difference is config, not implementation.

See also

  • 01-10x-user-surge.md — opposite end (latency relaxes, capacity tight)
  • 03-cost-budget-halved.md — opposite trade-off (cost down, quality protected)
  • User stories 01, 06, 07