Constraint Scenario 04 — Latency SLA Tightened (p95 first-token 1.4s → 600ms)
Trigger: Voice surface launching for MangaAssist, requires sub-second first-byte latency. The same agent that serves chat must serve voice. Pillars stressed: P1 (graph reshape) + P2 (serving) + P3 (provider/model choice).
TL;DR
A 2.3× latency tightening cannot be hit by "make things faster everywhere." It requires re-architecting the synchronous critical path to do less work, with the parts that can be deferred moved to a parallel-or-async track. The graph (story 01) and the streaming policy (story 07) are the primary levers; the cost is more compute and more complexity, not lower quality.
The change
| Metric | Before (chat) | After (voice + chat) |
|---|---|---|
| p95 first-token latency | 1.4s | 600ms |
| p95 full-response latency | 6s | 2.5s for voice utterances; 6s for chat retained |
| Audio first-byte (voice only) | n/a | 800ms |
| Cost/turn budget | $0.018 | $0.022 (+22% acceptable) |
| Quality bar | unchanged | unchanged |
The cost increases because we'll pay for more parallel work, more pre-warmed capacity, and more aggressive caching. Quality is held constant — that's the gate.
What's in the synchronous critical path today
From the graph (story 01) at chat steady state:
Ingress → SafetyPreCheck → Plan → ToolDispatch → Tools (parallel) → Compose → SafetyPostCheck → Stream
5ms 80ms 350ms ~50ms 600ms 700ms 150ms first token at this point
First-token latency = sum of all of those = ~1935ms p95. Reality is ~1.4s p95 because of overlap. Either way, the budget is split roughly: - Plan: 25% - Tools: 30% - Compose: 35% - Safety + ingress: 10%
To get to 600ms first-token, the new budget split has to be: - Plan: 200ms (was 350) - Tools: 200ms (was 600, fan-out limited) - Compose: 200ms (was 700, streaming starts mid-compose) - Safety + ingress: 50ms (was 230)
Each one is a redesign, not a tweak.
The graph reshape
flowchart LR
subgraph BEFORE[Before - chat steady state]
A1[Ingress] --> B1[Safety pre]
B1 --> C1[Plan]
C1 --> D1[Tools]
D1 --> E1[Compose]
E1 --> F1[Safety post]
F1 --> G1[Stream]
end
subgraph AFTER[After - voice-grade]
A2[Ingress] --> B2[Safety pre - cached]
B2 --> C2[Speculative pre-plan]
B2 --> D2[Plan]
C2 -.cached intent.-> D2
D2 --> E2[Tools - tighter fanout]
E2 --> F2[Compose - streaming]
F2 --> G2[Stream first token]
F2 -.parallel.-> H2[Safety post on draft]
H2 -.kill on violation.-> G2
end
What changed
- Speculative pre-plan — A small classifier runs in parallel with safety pre-check, predicting the intent. If it agrees with the planner's later decision, we save ~150ms; if it disagrees, we discard the speculation.
- Streaming starts mid-compose, not after compose finishes. First token goes out as soon as the first sentence is composable.
- Safety post runs in parallel with stream. If it flags a violation, the stream is cancelled and a refusal is appended. This is the trade-off — sometimes the user sees a partial response that gets cancelled. Eval shows this happens <0.05% of voice turns.
- Tighter fan-out — voice surface gets fan-out=2 max (vs 5 for chat). Quality measurement: -1.5% on rubric, acceptable for the latency win.
The serving reshape
| Lever | Effect on first-token latency |
|---|---|
| Hot voice fleet — pre-warmed, voice-only containers | -200ms cold-start tail |
| Provisioned throughput on Bedrock for voice path | -100ms model-side queuing |
| Inference cache aggressive on safety pre-check | -60ms p95 |
| Connection pool to backing tools (catalog, prefs) — keep-alive | -40ms |
| Co-locate voice fleet with model endpoint (same AZ) | -20ms |
Combined: ~420ms first-token improvement on the long pole. We hit 600ms first-token p95 with margin.
The cost of getting fast
| Source | Extra cost vs chat |
|---|---|
| Provisioned throughput (reserved capacity not always used) | +12% |
| Speculative pre-plan (sometimes wasted) | +5% |
| Hot voice fleet (separate warm floor) | +8% |
| Aggressive caching infra | +2% |
| Total | ~+27% |
Budget allowance was +22%, we're at +27%. The gap closes by: - Voice has lower fan-out (Lever 5 from cost scenario); saves 3%. - Voice has shorter average response (audio brevity); saves 4% in output tokens.
Net: +20%, within budget.
What the harness contributes
| Harness piece | Role |
|---|---|
| Typed graph (story 01) | The graph reshape is a versioned topology change |
| Streaming + heartbeat (story 07) | Streaming-mid-compose pattern lives here |
| Capability flags (story 02) | Voice surface activates "voice-mode" flags; chat unaffected |
| Eval (story 04) | Voice has its own rubric (concision, audio-friendliness); per-surface eval baseline |
| Cache (story 06) | Per-cache-tier eligibility for voice |
| Provider adapters (story 02) | Voice uses provisioned throughput endpoint; chat uses on-demand |
Without the per-surface flag and per-rubric eval machinery, voice would either drag chat down or fail its own bar. With them, the two surfaces coexist cleanly.
Pitfalls
Pitfall 1 — Cancelled-stream on safety violation feels broken
User says something the safety post-check flags after the stream started. Stream cancels mid-sentence; refusal appended. Sometimes feels jarring.
Mitigation: - Safety pre-check is highly tuned for voice — most violations caught before stream starts. - Post-check looks for catastrophic violations only (PII leak, copyright violation, mature content for non-adult). Most "soft" issues are non-cancelling. - Cancellation rate measured weekly; SLO < 0.1%.
Pitfall 2 — Speculative pre-plan disagrees with plan often
If speculation right rate < 70%, the speculation costs more than it saves.
Mitigation: - Speculation accuracy is monitored. Alarms below 75%. - The speculator is itself improvable: train on disagreements, ship updates monthly.
Pitfall 3 — Voice rubric drifts from chat rubric
Voice gets optimized for concision; eventually voice answers are good but feel curt. Cross-surface comparison shows voice users disengaging earlier.
Mitigation: - A "shared baseline" sub-rubric exists (factual_grounding, intent_match, safety) that's identical across surfaces. Surface-specific dimensions (concision for voice, depth for chat) are layered on top. - Cross-surface user-satisfaction comparison runs monthly.
Q&A drill — opening question
*Q: Why not just use a faster model? Haiku is faster than Sonnet — done.
Two reasons: 1. Latency is dominated by the orchestration, not the LLM call. Plan node is 350ms; the LLM call inside it is ~200ms. Even halving the LLM call (impossible with Haiku for this task; Haiku adds quality regression) only gets us 100ms. 2. Tools and compose are 70% of the latency. Making the planner faster doesn't help if tools are still slow.
The latency cut requires rearchitecting the path, not swapping a single component.
Grilling — Round 1
Q1. Streaming-mid-compose — how does the model produce streamable output with the right structure?
The compose prompt asks for structured streaming output: a short opening sentence first, then details, then a wrap. The streaming layer is sentence-aware — emits each complete sentence as it lands. This is a prompt-engineering pattern; it doesn't always produce perfectly streamable output, but the eval rubric for voice rewards it.
A small "stream-ability" classifier rates each output. Below threshold, we fall back to the buffer-then-stream path (one extra ~200ms).
Q2. Safety post running in parallel with stream — what's the user experience when it cancels?
The cancellation playback says: "Wait, I want to revise that — let me try again." and the system re-issues with a tighter constraint. The user gets ~1-2s of dead air, then a corrected response. Annoying, but rare (<0.1%) and recoverable. Better than holding everyone's first-token by 150ms for a 0.1% case.
Q3. What about ASR (speech-to-text) on the way in? That has its own latency.
Yes. The 600ms target is first-token of agent response, measured from the moment ASR delivers text to the agent. End-to-end mouth-to-mouth latency is ~1.4s including ASR (~400ms) and TTS (~250ms first audio frame). Each segment is its own optimization; this scenario focuses on the agent path.
Grilling — Round 2 (architect-level)
Q4. Speculative pre-plan + plan can both be running. How do you not double-spend on tokens?
The speculator is a tiny model (Haiku or smaller) on a compressed prompt — its cost is ~10% of the planner's. The "wasted" speculation cost when speculator disagrees is bounded. ROI: - Speculator agreement rate: 80%. - Latency savings on agree: 150ms. - Cost increase on disagree: $0.0004/turn (10% of planner cost). - Cost increase amortized: $0.00008/turn.
For a 150ms first-token improvement, paying $0.00008/turn extra is overwhelmingly worth it.
Q5. How would this scale to 10× voice traffic, or to a third surface (alerts) with its own latency profile?
The graph topology pattern generalizes. Each new surface gets: - A capability flag for the surface. - A surface-specific eval rubric. - A surface-specific budget table for the graph nodes. - Surface-specific serving capacity (warm floor, provisioned throughput).
The shared substrate (skills, sub-agents, eval harness, observability) doesn't grow linearly with surfaces. The marginal cost of a new surface is the surface-specific configuration, not new agent code.
The cap on this is when surfaces have fundamentally different reasoning needs — at that point you fork sub-agents, not topologies. Voice and chat share the same agent; "complex multi-day reasoning research" wouldn't.
Q6. The harness now has voice-vs-chat differences in latency budget, fanout, eval rubric, serving, model. How do you avoid configuration sprawl?
We use a surface profile abstraction in config:
surface_profiles:
chat:
latency: { first_token_p95_ms: 1400, full_p95_ms: 6000 }
fanout_cap: 5
eval_rubric: rubrics/chat-v3.yaml
serving: { warm_floor_pct: 150, provisioned: false }
model_routing: { planner: sonnet-4-7, sub_agents: sonnet-4-7 }
voice:
latency: { first_token_p95_ms: 600, full_p95_ms: 2500 }
fanout_cap: 2
eval_rubric: rubrics/voice-v1.yaml
serving: { warm_floor_pct: 200, provisioned: true }
model_routing: { planner: sonnet-4-7, sub_agents: haiku-4-5 }
speculative_preplan: true
streaming_mid_compose: true
Intuition gained
- Tightening latency is graph reshape, not "use a faster model." The synchronous critical path must do less.
- Parallelism (speculative pre-plan, streaming-mid-compose, parallel safety) trades cost for latency. Cost goes up; eval-gated.
- Surface profiles keep configuration sprawl bounded as new surfaces are added.
- Per-surface eval rubric + shared baseline preserves cross-surface quality comparability.
- Voice and chat share the same agent code — the difference is config, not implementation.
See also
01-10x-user-surge.md— opposite end (latency relaxes, capacity tight)03-cost-budget-halved.md— opposite trade-off (cost down, quality protected)- User stories 01, 06, 07