User Story 01 — Execution Flow Design
Pillar: P1 (AI Workflow) · Stage unlocked: 1 → 3 · Reading time: ~12 min
TL;DR
At Amazon scale, MangaAssist cannot be a free-form ReAct loop. The execution flow has to be a typed graph with explicit states, where each transition has a budget, a fallback, and an observability handle. The minute you ship a free-form loop to 1.2M concurrent sessions, you discover that 0.3% of them never terminate, and 0.3% of 1.2M is 3,600 zombie sessions per peak minute.
The User Story
As the principal engineer for MangaAssist, I want a typed execution graph that defines every legal state, transition, budget, and fallback for a user turn, so that I can reason about correctness, cost, and latency before the system goes live and respond to incidents during live traffic without re-reading prompts.
Acceptance criteria
- Every node has a max latency budget and a max token budget; exceeding either forces a defined fallback edge.
- Every node emits a structured trace event with
node_id,model_id,tool_calls,tokens_in/out,cost_usd,latency_ms. - The graph is introspectable at runtime — on-call can ask "where is session X stuck?" and get a node name, not a stack trace.
- The graph is versioned; deploys can canary a graph version against a control graph version.
- No node has implicit fallbacks. If a fallback isn't drawn, the request fails closed with a typed error.
Why a graph (not a free-form loop)
| Approach | Strength | Why it fails MangaAssist |
|---|---|---|
| Free-form ReAct (LLM decides next step) | Maximum flexibility, fewer LoC | At p99.9, 1 in 1000 turns loops indefinitely. At 2.4B/day, that's 2.4M zombie loops. |
| Pure DAG (one path through) | Predictable, easy to trace | Manga catalog questions need conditional fan-out (recommendation? policy? order?). DAG can't branch. |
| Typed state machine with bounded fan-out | Branchable, traceable, budgetable | Slightly more upfront design — pays back 100× in incident response |
We pick the third. Every state machine transition is one of: - Decision (LLM call → next state) - Tool (deterministic external call) - Fan-out (parallel sub-agent invocations) - Join (merge sub-agent results) - Terminal (response ready, error envelope, or escalate-to-human)
The graph for one MangaAssist turn
stateDiagram-v2
[*] --> Ingress
Ingress --> SafetyPreCheck: validate input + rate-limit
SafetyPreCheck --> Plan: input safe
SafetyPreCheck --> Refusal: blocked / abusive
Plan --> ToolDispatch: tools selected
Plan --> DirectAnswer: no tool needed (small-talk)
ToolDispatch --> ParallelTools: 2+ tools
ToolDispatch --> SingleTool: 1 tool
ParallelTools --> Join
SingleTool --> Join
Join --> Compose: aggregate results
Compose --> SafetyPostCheck: draft answer ready
SafetyPostCheck --> Stream: passes guardrails
SafetyPostCheck --> SafeRewrite: minor violation
SafeRewrite --> Stream
SafetyPostCheck --> Refusal: hard block
Stream --> Eval: response delivered
DirectAnswer --> Eval
Refusal --> Eval
Eval --> [*]: log judgment
Every box is a node. Every arrow is a typed transition with a budget.
Node budget table (production values)
| Node | p95 latency | p99 latency | Token budget | Cost cap | Fallback on breach |
|---|---|---|---|---|---|
| Ingress | 5 ms | 15 ms | — | — | hard 429 |
| SafetyPreCheck | 80 ms | 200 ms | 1K | $0.0002 | bypass with audit log |
| Plan | 350 ms | 900 ms | 4K | $0.004 | downgrade to Haiku |
| SingleTool | 200 ms | 800 ms | 2K | $0.001 | cached fallback |
| ParallelTools (longest leg) | 600 ms | 1.5 s | 8K | $0.004 | partial-results path |
| Compose | 700 ms | 1.8 s | 8K | $0.006 | template response |
| SafetyPostCheck | 150 ms | 400 ms | 2K | $0.001 | conservative refusal |
| Stream | 4 s | 6 s | 8K | $0.008 | flush partial + apologize |
| Eval (async, off-path) | n/a | n/a | 2K | $0.0008 | drop sample |
Total p95 budget for the synchronous path: 1.4 s first token, 6.0 s final token. That matches the SLA in the overview.
What "AI-first" looks like in this design
A non-AI-first team writes the prompt first, then asks "where does this live?" An AI-first team writes the graph first, then the prompts are fillable contracts inside each node.
The graph is the artifact reviewed in design docs, the unit tested in CI, the dashboard rendered on the wall. Prompts live inside it.
Production pitfalls and how the graph saves you
Pitfall 1 — The prompt that "knows too much"
A free-form prompt that says "if the user asks about an order, call the order tool, and if the order tool fails, try the cache, and if that fails, escalate" is a state machine encoded as English. It will diverge from reality the moment one branch is added without updating the prompt.
Fix: the graph IS the state machine. The prompt only handles the local decision, never the orchestration.
Pitfall 2 — Silent infinite loops
ReAct agents looping over "thought → action → observation" can loop forever if the LLM keeps proposing the same action. At 2.4B turns/day, even a 0.01% loop rate = 240K stuck turns.
Fix: every transition increments a step_count. Hard cap at 8. Beyond that, force-terminate to Compose with whatever evidence exists.
Pitfall 3 — Implicit fallbacks
"If the recommender is down, just answer without it" sounds fine until the eval shows answer quality drops 40% silently because nobody noticed the recommender was down for 3 days.
Fix: every fallback edge emits a degraded_mode event. SLO dashboard tracks degraded_mode_rate. Alarm at >2%.
Q&A drill — opening question
Q: We already use LangGraph. Isn't that the same as your typed state machine?
LangGraph gives you the primitive — a graph runtime. It does not give you: - Per-node latency / token / cost budgets enforced at runtime. - A canonical trace event schema across all nodes. - A versioning + canary strategy for graph topology changes. - Fallback edges that are types, not exception handlers.
LangGraph is the engine. The user story here is the policy layered on top of any engine. We've shipped the same policy on three different runtimes.
Grilling — Round 1
Q1. Why not let the LLM plan the graph dynamically? Modern models can decompose tasks well.
Because dynamic plans cannot be canaried. If 3% of traffic gets a new prompt that produces a new plan shape, you cannot diff it against the control. With a typed graph, you canary the graph version (5% on v17, 95% on v16) and the eval delta is interpretable.
Dynamic plans also break cost forecasting. CFO asks "what does this cost at 10× users?" and the answer "it depends on what the LLM decides each turn" is not an answer.
Q2. How do you keep the graph from sprawling? You'll end up with 200 nodes.
Two rules:
- Composition over inflation. A new tool is a registration in the ToolDispatch node, not a new node. The graph stays around 12 nodes; the tool registry grows.
- Sub-graph per domain. Order workflows, recommendation workflows, support workflows each get their own sub-graph mounted under one parent node. The parent stays simple.
Q3. What's actually stored at each step? The whole conversation?
No. Each node persists three things only: input_hash, output_hash, decision_id. Full payloads go to the trace store with a TTL of 30 days for debugging, redacted of PII. The state machine itself stores only the minimum to resume.
Grilling — Round 2 (architect-level)
Q4. Walk me through the blast radius if the Plan node's prompt has a bug that drops the user's locale.
- Detection: the per-locale eval set's pass rate drops on the canary graph version. Catches in <30 min if eval throughput is 50K judgments/hour as targeted.
- Containment: canary auto-rollback at 5% pass-rate degradation. Topology versioning means rollback is one config flag, not a redeploy.
- Backfill: re-run impacted sessions through the corrected graph in the offline replay harness; for sessions that already shipped a wrong answer, surface a banner "we updated our recommendation — reload to see the latest" rather than silent restate (which violates trust).
- Postmortem hook: the
Plannode trace contains the input AND the structured output (selected tools list). You can query "give me all canary sessions whereselected_toolsdid not includeuser-prefs-mcp" and recover the impact set deterministically.
Q5. The graph forces budgets. What if the user's question genuinely needs 12 seconds — say, a complex 5-volume recommendation?
Budget breach is a typed transition, not a hard kill. Three options at the breach point: 1. Downshift the model at the next decision (Haiku instead of Sonnet) — saves ~40% latency for ~5% quality loss in measurement. 2. Drop a tool — if the trending-MCP is contributing 2 of the 12 seconds and the user didn't ask about trending, skip it. 3. Flush partial + offer continuation — stream what we have and append "want me to keep digging?"
The choice between these is policy, encoded as fallback edges. Different user tiers get different policies (Prime gets policy 1+3, free tier gets 2).
Q6. How does this graph survive when we go from 7 tools to 70 (constraint scenario 6)?
The graph topology does not change. What changes:
- ToolDispatch node grows a tool-router LLM that selects from 70 tools instead of inlining them in the planner prompt.
- ParallelTools adds a fan-out cap (max 5 tools per turn) — measured to be the saturation point for context budget.
- The tool registry becomes the primary surface for additions, not the prompt.
This is exactly the LLD (Pillar 3) moment: the right abstraction at tool-registration time keeps the graph stable through 10× tool growth.
Intuition gained
- The graph is the architecture. The prompt is configuration inside the architecture. AI-first teams design top-down from the graph.
- Budgets are types. Latency, tokens, cost — these are all node-local types that the runtime enforces. They are not nice-to-haves logged after the fact.
- Fallbacks are first-class. If a fallback isn't drawn on the graph, it doesn't exist. There is no implicit "and then we figure something out."
- Versioning the graph is more important than versioning the prompt. Prompt changes are local edits; graph changes are architectural and need canary discipline.
See also
02-skill-composition-and-invocation.md— what lives insideToolDispatch03-sub-agent-orchestration.md— whatParallelToolsandJoinactually do06-checkpointing-and-serving.md— how the graph survives instance failure mid-turnChanging-Constraints-Scenarios/04-latency-sla-tightened.md— when the budget table compresses 3×