User Story 01 — Execution Flow Design

Pillar: P1 (AI Workflow) · Stage unlocked: 1 → 3 · Reading time: ~12 min

TL;DR

At Amazon scale, MangaAssist cannot be a free-form ReAct loop. The execution flow has to be a typed graph with explicit states, where each transition has a budget, a fallback, and an observability handle. The minute you ship a free-form loop to 1.2M concurrent sessions, you discover that 0.3% of them never terminate, and 0.3% of 1.2M is 3,600 zombie sessions per peak minute.

The User Story

As the principal engineer for MangaAssist, I want a typed execution graph that defines every legal state, transition, budget, and fallback for a user turn, so that I can reason about correctness, cost, and latency before the system goes live and respond to incidents during live traffic without re-reading prompts.

Acceptance criteria

Every node has a max latency budget and a max token budget; exceeding either forces a defined fallback edge.
Every node emits a structured trace event with node_id, model_id, tool_calls, tokens_in/out, cost_usd, latency_ms.
The graph is introspectable at runtime — on-call can ask "where is session X stuck?" and get a node name, not a stack trace.
The graph is versioned; deploys can canary a graph version against a control graph version.
No node has implicit fallbacks. If a fallback isn't drawn, the request fails closed with a typed error.

Why a graph (not a free-form loop)

Approach	Strength	Why it fails MangaAssist
Free-form ReAct (LLM decides next step)	Maximum flexibility, fewer LoC	At p99.9, 1 in 1000 turns loops indefinitely. At 2.4B/day, that's 2.4M zombie loops.
Pure DAG (one path through)	Predictable, easy to trace	Manga catalog questions need conditional fan-out (recommendation? policy? order?). DAG can't branch.
Typed state machine with bounded fan-out	Branchable, traceable, budgetable	Slightly more upfront design — pays back 100× in incident response

We pick the third. Every state machine transition is one of: - Decision (LLM call → next state) - Tool (deterministic external call) - Fan-out (parallel sub-agent invocations) - Join (merge sub-agent results) - Terminal (response ready, error envelope, or escalate-to-human)

The graph for one MangaAssist turn

stateDiagram-v2
  [*] --> Ingress
  Ingress --> SafetyPreCheck: validate input + rate-limit
  SafetyPreCheck --> Plan: input safe
  SafetyPreCheck --> Refusal: blocked / abusive
  Plan --> ToolDispatch: tools selected
  Plan --> DirectAnswer: no tool needed (small-talk)
  ToolDispatch --> ParallelTools: 2+ tools
  ToolDispatch --> SingleTool: 1 tool
  ParallelTools --> Join
  SingleTool --> Join
  Join --> Compose: aggregate results
  Compose --> SafetyPostCheck: draft answer ready
  SafetyPostCheck --> Stream: passes guardrails
  SafetyPostCheck --> SafeRewrite: minor violation
  SafeRewrite --> Stream
  SafetyPostCheck --> Refusal: hard block
  Stream --> Eval: response delivered
  DirectAnswer --> Eval
  Refusal --> Eval
  Eval --> [*]: log judgment

Every box is a node. Every arrow is a typed transition with a budget.

Node budget table (production values)

Node	p95 latency	p99 latency	Token budget	Cost cap	Fallback on breach
Ingress	5 ms	15 ms	—	—	hard 429
SafetyPreCheck	80 ms	200 ms	1K	$0.0002	bypass with audit log
Plan	350 ms	900 ms	4K	$0.004	downgrade to Haiku
SingleTool	200 ms	800 ms	2K	$0.001	cached fallback
ParallelTools (longest leg)	600 ms	1.5 s	8K	$0.004	partial-results path
Compose	700 ms	1.8 s	8K	$0.006	template response
SafetyPostCheck	150 ms	400 ms	2K	$0.001	conservative refusal
Stream	4 s	6 s	8K	$0.008	flush partial + apologize
Eval (async, off-path)	n/a	n/a	2K	$0.0008	drop sample

Total p95 budget for the synchronous path: 1.4 s first token, 6.0 s final token. That matches the SLA in the overview.

What "AI-first" looks like in this design

A non-AI-first team writes the prompt first, then asks "where does this live?" An AI-first team writes the graph first, then the prompts are fillable contracts inside each node.

The graph is the artifact reviewed in design docs, the unit tested in CI, the dashboard rendered on the wall. Prompts live inside it.

Production pitfalls and how the graph saves you

Pitfall 1 — The prompt that "knows too much"

A free-form prompt that says "if the user asks about an order, call the order tool, and if the order tool fails, try the cache, and if that fails, escalate" is a state machine encoded as English. It will diverge from reality the moment one branch is added without updating the prompt.

Fix: the graph IS the state machine. The prompt only handles the local decision, never the orchestration.

Pitfall 2 — Silent infinite loops

ReAct agents looping over "thought → action → observation" can loop forever if the LLM keeps proposing the same action. At 2.4B turns/day, even a 0.01% loop rate = 240K stuck turns.

Fix: every transition increments a step_count. Hard cap at 8. Beyond that, force-terminate to Compose with whatever evidence exists.

Pitfall 3 — Implicit fallbacks

"If the recommender is down, just answer without it" sounds fine until the eval shows answer quality drops 40% silently because nobody noticed the recommender was down for 3 days.

Fix: every fallback edge emits a degraded_mode event. SLO dashboard tracks degraded_mode_rate. Alarm at >2%.

Q&A drill — opening question

Q: We already use LangGraph. Isn't that the same as your typed state machine?

LangGraph gives you the primitive — a graph runtime. It does not give you: - Per-node latency / token / cost budgets enforced at runtime. - A canonical trace event schema across all nodes. - A versioning + canary strategy for graph topology changes. - Fallback edges that are types, not exception handlers.

LangGraph is the engine. The user story here is the policy layered on top of any engine. We've shipped the same policy on three different runtimes.

Grilling — Round 1

Q1. Why not let the LLM plan the graph dynamically? Modern models can decompose tasks well.

Because dynamic plans cannot be canaried. If 3% of traffic gets a new prompt that produces a new plan shape, you cannot diff it against the control. With a typed graph, you canary the graph version (5% on v17, 95% on v16) and the eval delta is interpretable.

Dynamic plans also break cost forecasting. CFO asks "what does this cost at 10× users?" and the answer "it depends on what the LLM decides each turn" is not an answer.

Q2. How do you keep the graph from sprawling? You'll end up with 200 nodes.

Two rules: - Composition over inflation. A new tool is a registration in the ToolDispatch node, not a new node. The graph stays around 12 nodes; the tool registry grows. - Sub-graph per domain. Order workflows, recommendation workflows, support workflows each get their own sub-graph mounted under one parent node. The parent stays simple.

Q3. What's actually stored at each step? The whole conversation?

No. Each node persists three things only: input_hash, output_hash, decision_id. Full payloads go to the trace store with a TTL of 30 days for debugging, redacted of PII. The state machine itself stores only the minimum to resume.

Grilling — Round 2 (architect-level)

Q4. Walk me through the blast radius if the Plan node's prompt has a bug that drops the user's locale.

Detection: the per-locale eval set's pass rate drops on the canary graph version. Catches in <30 min if eval throughput is 50K judgments/hour as targeted.
Containment: canary auto-rollback at 5% pass-rate degradation. Topology versioning means rollback is one config flag, not a redeploy.
Backfill: re-run impacted sessions through the corrected graph in the offline replay harness; for sessions that already shipped a wrong answer, surface a banner "we updated our recommendation — reload to see the latest" rather than silent restate (which violates trust).
Postmortem hook: the Plan node trace contains the input AND the structured output (selected tools list). You can query "give me all canary sessions where selected_tools did not include user-prefs-mcp" and recover the impact set deterministically.

Q5. The graph forces budgets. What if the user's question genuinely needs 12 seconds — say, a complex 5-volume recommendation?

Budget breach is a typed transition, not a hard kill. Three options at the breach point: 1. Downshift the model at the next decision (Haiku instead of Sonnet) — saves ~40% latency for ~5% quality loss in measurement. 2. Drop a tool — if the trending-MCP is contributing 2 of the 12 seconds and the user didn't ask about trending, skip it. 3. Flush partial + offer continuation — stream what we have and append "want me to keep digging?"

The choice between these is policy, encoded as fallback edges. Different user tiers get different policies (Prime gets policy 1+3, free tier gets 2).

Q6. How does this graph survive when we go from 7 tools to 70 (constraint scenario 6)?

The graph topology does not change. What changes: - ToolDispatch node grows a tool-router LLM that selects from 70 tools instead of inlining them in the planner prompt. - ParallelTools adds a fan-out cap (max 5 tools per turn) — measured to be the saturation point for context budget. - The tool registry becomes the primary surface for additions, not the prompt.

This is exactly the LLD (Pillar 3) moment: the right abstraction at tool-registration time keeps the graph stable through 10× tool growth.

Intuition gained

The graph is the architecture. The prompt is configuration inside the architecture. AI-first teams design top-down from the graph.
Budgets are types. Latency, tokens, cost — these are all node-local types that the runtime enforces. They are not nice-to-haves logged after the fact.
Fallbacks are first-class. If a fallback isn't drawn on the graph, it doesn't exist. There is no implicit "and then we figure something out."
Versioning the graph is more important than versioning the prompt. Prompt changes are local edits; graph changes are architectural and need canary discipline.