User Story 05 — Pause and Resume Workflows
Pillar: P2 (Harness) · Stage unlocked: 3 → 4 · Reading time: ~12 min
TL;DR
Some MangaAssist workflows take 2 seconds. Some take 90 seconds. Some take 4 hours (waiting on a human-in-the-loop reviewer for a flagged refund). A single process lifetime cannot hold all of these. Pause/resume is the harness primitive that makes long-running, externally-blocking, and recoverable agentic workflows possible. Without it, every container restart is a customer outage.
The User Story
As the platform engineer for MangaAssist, I want every agentic workflow to be pausable, persistable, and resumable at well-defined points, so that workflows survive container restarts, scale-down events, model swaps, and external waits, and customers see continuity instead of "let's start over."
Acceptance criteria
- Every state-machine node has a declared
pausableproperty (yes / no / only-on-blocking-call). - Pausing a workflow persists state to a durable store within 50 ms.
- Resume picks up at the exact node, with the same context, regardless of which container picks it up.
- Resume is idempotent — resuming twice does not double-spend tokens or double-call non-idempotent skills.
- Workflows older than 7 days auto-expire and free their state.
Why pause/resume is non-optional at scale
| Trigger | Frequency at MangaAssist scale | Failure mode without pause/resume |
|---|---|---|
| Container restart (deploy / autoscale) | ~1.5K/hour | 1.5K dropped sessions/hour |
| Spot instance preemption | ~200/hour | Hard customer-visible failure |
| External tool slow (>5s) | 0.4% of turns | Browser/connection times out |
| Human-in-the-loop required | ~3K/hour (refunds, age verification) | Workflow has to restart from scratch |
| Async batch enrichment | continuous | Cannot run at all without resume |
| Cost throttle / quota wait | bursts | Either fail or block a thread |
At 1.2M concurrent sessions, dropping the workflow on container restart is unacceptable. The system needs to be able to land any in-flight workflow on any healthy container.
The pause/resume architecture
flowchart LR
subgraph Compute[Stateless agent fleet]
C1[Container 1]
C2[Container 2]
C3[Container N]
end
subgraph Bus[Workflow message bus]
Q[Resumable workflow queue]
end
subgraph State[Durable workflow state]
DDB[(DynamoDB - workflow rows)]
S3[(S3 - large payloads / context)]
end
subgraph Wait[Wait conditions]
HUM[Human approval queue]
TIMER[Scheduled wakeups]
EXT[External callbacks]
end
C1 -->|checkpoint| DDB
C1 -->|big context blob| S3
C2 -->|claim| Q
C2 -->|read state| DDB
C2 -->|read context| S3
HUM -->|signal ready| Q
TIMER -->|fire| Q
EXT -->|webhook| Q
The agent fleet is stateless. Workflow state lives in DynamoDB (small) + S3 (large blobs). A workflow paused on container 1 can be resumed by container 73 in another AZ — that's the whole point.
Three flavors of pause
Flavor 1 — Implicit (process lifetime)
The workflow gets paused because the container is going down. There's no "wait" semantically — we just need durability across deploys.
- Trigger: SIGTERM, autoscale-down, deploy.
- Latency budget for pause: <50 ms (we have 30 seconds before SIGKILL; pause has to be cheap).
- Resume trigger: another container picks up the orphaned workflow row from the queue.
Flavor 2 — Wait-for-event (external blocking)
The workflow has to wait for something that's not on the critical path: a webhook, a human approval, a scheduled time.
- Trigger: explicit
await(condition)in the graph. - State persisted: full context, the wait condition, the resume node.
- Resume trigger: the event source posts to the bus (webhook, scheduler, human-approval system).
Flavor 3 — Throttle (cost / quota / rate)
The workflow can run, but we don't want it to right now (cost ceiling, provider quota near limit, peak hour).
- Trigger: quota manager returns "wait_until_T".
- Persisted: less than flavor 2; just the resume token + timer.
- Resume: scheduled wakeup at T, then check quota again.
What's actually persisted
# DynamoDB row schema (logical)
workflow_id: uuid
user_id: redacted_hash
graph_version: v17
current_node: SafetyPostCheck
node_local_state:
draft_response_s3_key: "s3://mangaassist-context/wf/{wf_id}/draft.json"
pending_tool_call_ids: []
retries_remaining: 1
context_pointer:
s3_key: "s3://mangaassist-context/wf/{wf_id}/context.bin"
size_bytes: 47218
ttl: "2026-05-06T14:00:00Z"
budgets_consumed:
tokens: 8421
cost_usd: 0.041
wall_ms_so_far: 14802
pause_reason: container_terminating | wait_human | quota_throttle
resume_condition:
type: human_approval | timer | webhook | none
payload: { ... }
idempotency_keys_used:
catalog_search: ["abc123..."]
order_inventory: ["def456..."]
created_at: ...
last_updated_at: ...
ttl: 7 days from creation
The big stuff (full conversation, retrieved documents, draft responses) goes to S3. The state row is small (<8 KB) so DynamoDB scans, lookups, and TTL cleanup are cheap at scale.
The resume protocol
sequenceDiagram
participant Container
participant Queue
participant DDB
participant S3
participant Skill
Container->>Queue: poll(resumable_workflows)
Queue-->>Container: claim(wf_id, lease=60s)
Container->>DDB: read(wf_id)
DDB-->>Container: state row
Container->>S3: read(context blob)
S3-->>Container: context bytes
alt resume condition met?
Container->>Container: hydrate state machine at current_node
Container->>Skill: re-issue with idempotency key
Skill-->>Container: result (or 200 cached)
Container->>Container: advance graph
Container->>DDB: checkpoint (or finalize)
else condition not met
Container->>DDB: extend wait
Container->>Queue: release claim
end
Three rules: - Lease + heartbeat. A claim holds for 60s; the container heartbeats every 15s. If a container dies mid-resume, the lease expires and another container picks up. - Idempotency on every external call. Skill registry's idempotency keys (story 02) are the lifeline here. - Last-writer-wins on the row with a version field. Two containers racing → one wins, the loser sees a version conflict and abandons.
Pitfalls — and how the design avoids them
Pitfall 1 — Token cost double-spend
A workflow paused mid-LLM-call resumes and re-runs the LLM call. That's $0.005 wasted per resume × millions = real money. Worse, the user gets a different non-deterministic answer.
Fix: LLM calls go through a thin invocation cache keyed by (prompt_hash, model_id, sampling_params, idempotency_key). Resumed call hits the cache; same answer, near-zero cost.
Pitfall 2 — Stale state at resume
A workflow paused at T=0 resumes at T=4h. The user's locale changed. Stock changed. Catalog reindexed. The cached LLM output is stale.
Fix: every workflow has a freshness boundary. Wait flavors (Flavor 2, Flavor 3) declare max_pause_seconds. If exceeded, the workflow doesn't resume — it auto-restarts from a safe earlier checkpoint or escalates ("Hi, this took longer than expected; let me redo this with current information").
Pitfall 3 — Orphan workflow rows
A workflow row is created but the container crashes before queuing it as resumable. It just sits there forever.
Fix: TTL on every row (7 days). A scavenger job runs hourly; rows older than 30 minutes without a heartbeat AND not claimed → re-queued. Defense in depth.
Pitfall 4 — Webhook arrives before pause is durable
External system completes the human approval and webhooks back at T+10ms; our pause hasn't finished writing to DDB yet.
Fix: the webhook handler idempotently writes a "ready" signal keyed by wf_id. The pause logic, on completing the write, checks for an existing ready signal and immediately re-queues if found. Webhook can arrive any time; the system converges.
Q&A drill — opening question
Q: Most chat workflows are sub-2-second. Why pay this complexity tax?
Three reasons: 1. Deploy resilience. Even sub-2-second workflows need to survive deploys at our cadence (3-4 deploys/day). Without pause-on-SIGTERM, deploys drop ~5K sessions each. 2. Long tail. "Sub-2-second" is a median. P99 is 14s. P99.9 is 60s. You don't get to ignore those at our scale. 3. Future workflows. Refund flows, age verification, account merge — all multi-minute. The infra has to be built before those features ship.
The complexity tax is real. The cost of NOT paying it is hidden in customer-experience metrics, where it's hardest to attribute and easiest to ignore.
Grilling — Round 1
Q1. How big does the state row get for a complex workflow with 6 sub-agent invocations and lots of retrieved context?
The DDB row stays <8 KB because heavy stuff goes to S3 by reference. The S3 blob can be 200KB-2MB depending on retrieval volume. Read latency: DDB ~5ms, S3 ~25ms — p95 resume read is <40ms. Acceptable for a workflow that paused for any non-trivial duration.
Q2. Idempotency is great for read-side skills. What about write-side — placing an order?
Write-side skills have idempotency keys generated at the planner level, not at retry time. The planner emits "I'm about to call place-order with idem-key=X". The key is persisted in the workflow row BEFORE the call. On resume, we re-issue with the same key; the order service de-dupes. Worst case: extra round-trip; never a double order.
Q3. What about ordering — does the user see their second message before we resume the first?
Per-session ordering is enforced by session-keyed queue partitioning. All workflows for (user_id, session_id) go to the same queue partition. Resume happens in submission order. If the user sends a second message while the first is paused, the second is queued behind the first or the first's resume condition is auto-promoted (depending on intent).
Grilling — Round 2 (architect-level)
Q4. How does pause/resume interact with the eval framework (story 04)?
Two interactions: - Online sampled judging is asynchronous — it does not wait for resume. The judge sees the eventual final response. Latency for judgment doesn't depend on workflow duration. - Replay harness — replays NEVER pause. The replay runs in a deterministic mode where wait conditions are mocked: human approval = mocked yes; timer = mocked elapsed; webhook = mocked deliver. Replay completes in seconds for a workflow that originally took hours.
This split is intentional. Replay is for diff-testing the agent logic, not the wait infrastructure.
Q5. A bug ships that corrupts the state row schema. Now resumed workflows crash on read. What's the recovery?
State row has a schema_version. The resume code is schema-version-aware: the current code can read the previous schema with an upgrader. New writes use the new schema; old in-flight rows continue under old schema until they finalize or expire (within 7 days max). Two-deploy rule: schema changes ship as additive in deploy N, then made required in deploy N+1 only after all old rows have aged out.
If we shipped a bug anyway: kill the bad deploy via auto-rollback (the eval framework catches the spike in resume failures), then for the broken rows, the scavenger marks them as failed and the user sees a "let me redo this" message rather than a crash loop. Customer impact: tens of thousands of redos, not millions of crashes.
Q6. How would you handle a multi-region setup where workflows might fail over from us-east-1 to us-west-2?
Three architectural pieces: - DDB Global Tables for the state row. Cross-region replication ~1 second. Acceptable because we're not racing to resume in <1s after a region failure. - S3 cross-region replication for context blobs. Same story. - Queue is regional, with a failover orchestrator that, on region-down detection, re-queues the now-orphaned rows in us-west-2's queue.
Trade-offs: the cost of cross-region replication is significant (we estimate +18% of state-layer cost). For workflows < 60 seconds (which is 95% of traffic), region failover during a workflow lifetime is rare enough that re-running from scratch is cheaper than always replicating. So we apply replication selectively: long-running workflows (Flavor 2 with max_pause_seconds > 300) opt in; short-runners don't.
Intuition gained
- Pause/resume is the harness primitive that turns chat from "stateless function" into "durable workflow." Without it, you cannot survive a single deploy without dropping sessions.
- Three flavors of pause (implicit / wait-for-event / throttle) need different storage and resume strategies but share the same row + S3 schema.
- Idempotency is the load-bearing detail — without it, resume is a cost and correctness disaster.
- Freshness boundaries prevent stale resumes from silently shipping wrong answers.
- The state row stays tiny (<8KB), big stuff goes to S3. This is what makes the harness scale to 1.2M concurrent.
See also
01-execution-flow-design.md— pause/resume operates on the typed graph02-skill-composition-and-invocation.md— idempotency keys come from skill contracts06-checkpointing-and-serving.md— checkpoints are the unit of pause granularityChanging-Constraints-Scenarios/01-10x-user-surge.md— pause/resume behavior under burst