LOCAL PREVIEW View on GitHub

User Story 05 — Pause and Resume Workflows

Pillar: P2 (Harness) · Stage unlocked: 3 → 4 · Reading time: ~12 min


TL;DR

Some MangaAssist workflows take 2 seconds. Some take 90 seconds. Some take 4 hours (waiting on a human-in-the-loop reviewer for a flagged refund). A single process lifetime cannot hold all of these. Pause/resume is the harness primitive that makes long-running, externally-blocking, and recoverable agentic workflows possible. Without it, every container restart is a customer outage.


The User Story

As the platform engineer for MangaAssist, I want every agentic workflow to be pausable, persistable, and resumable at well-defined points, so that workflows survive container restarts, scale-down events, model swaps, and external waits, and customers see continuity instead of "let's start over."

Acceptance criteria

  1. Every state-machine node has a declared pausable property (yes / no / only-on-blocking-call).
  2. Pausing a workflow persists state to a durable store within 50 ms.
  3. Resume picks up at the exact node, with the same context, regardless of which container picks it up.
  4. Resume is idempotent — resuming twice does not double-spend tokens or double-call non-idempotent skills.
  5. Workflows older than 7 days auto-expire and free their state.

Why pause/resume is non-optional at scale

Trigger Frequency at MangaAssist scale Failure mode without pause/resume
Container restart (deploy / autoscale) ~1.5K/hour 1.5K dropped sessions/hour
Spot instance preemption ~200/hour Hard customer-visible failure
External tool slow (>5s) 0.4% of turns Browser/connection times out
Human-in-the-loop required ~3K/hour (refunds, age verification) Workflow has to restart from scratch
Async batch enrichment continuous Cannot run at all without resume
Cost throttle / quota wait bursts Either fail or block a thread

At 1.2M concurrent sessions, dropping the workflow on container restart is unacceptable. The system needs to be able to land any in-flight workflow on any healthy container.


The pause/resume architecture

flowchart LR
  subgraph Compute[Stateless agent fleet]
    C1[Container 1]
    C2[Container 2]
    C3[Container N]
  end

  subgraph Bus[Workflow message bus]
    Q[Resumable workflow queue]
  end

  subgraph State[Durable workflow state]
    DDB[(DynamoDB - workflow rows)]
    S3[(S3 - large payloads / context)]
  end

  subgraph Wait[Wait conditions]
    HUM[Human approval queue]
    TIMER[Scheduled wakeups]
    EXT[External callbacks]
  end

  C1 -->|checkpoint| DDB
  C1 -->|big context blob| S3
  C2 -->|claim| Q
  C2 -->|read state| DDB
  C2 -->|read context| S3

  HUM -->|signal ready| Q
  TIMER -->|fire| Q
  EXT -->|webhook| Q

The agent fleet is stateless. Workflow state lives in DynamoDB (small) + S3 (large blobs). A workflow paused on container 1 can be resumed by container 73 in another AZ — that's the whole point.


Three flavors of pause

Flavor 1 — Implicit (process lifetime)

The workflow gets paused because the container is going down. There's no "wait" semantically — we just need durability across deploys.

  • Trigger: SIGTERM, autoscale-down, deploy.
  • Latency budget for pause: <50 ms (we have 30 seconds before SIGKILL; pause has to be cheap).
  • Resume trigger: another container picks up the orphaned workflow row from the queue.

Flavor 2 — Wait-for-event (external blocking)

The workflow has to wait for something that's not on the critical path: a webhook, a human approval, a scheduled time.

  • Trigger: explicit await(condition) in the graph.
  • State persisted: full context, the wait condition, the resume node.
  • Resume trigger: the event source posts to the bus (webhook, scheduler, human-approval system).

Flavor 3 — Throttle (cost / quota / rate)

The workflow can run, but we don't want it to right now (cost ceiling, provider quota near limit, peak hour).

  • Trigger: quota manager returns "wait_until_T".
  • Persisted: less than flavor 2; just the resume token + timer.
  • Resume: scheduled wakeup at T, then check quota again.

What's actually persisted

# DynamoDB row schema (logical)
workflow_id: uuid
user_id: redacted_hash
graph_version: v17
current_node: SafetyPostCheck
node_local_state:
  draft_response_s3_key: "s3://mangaassist-context/wf/{wf_id}/draft.json"
  pending_tool_call_ids: []
  retries_remaining: 1
context_pointer:
  s3_key: "s3://mangaassist-context/wf/{wf_id}/context.bin"
  size_bytes: 47218
  ttl: "2026-05-06T14:00:00Z"
budgets_consumed:
  tokens: 8421
  cost_usd: 0.041
  wall_ms_so_far: 14802
pause_reason: container_terminating | wait_human | quota_throttle
resume_condition:
  type: human_approval | timer | webhook | none
  payload: { ... }
idempotency_keys_used:
  catalog_search: ["abc123..."]
  order_inventory: ["def456..."]
created_at: ...
last_updated_at: ...
ttl: 7 days from creation

The big stuff (full conversation, retrieved documents, draft responses) goes to S3. The state row is small (<8 KB) so DynamoDB scans, lookups, and TTL cleanup are cheap at scale.


The resume protocol

sequenceDiagram
  participant Container
  participant Queue
  participant DDB
  participant S3
  participant Skill

  Container->>Queue: poll(resumable_workflows)
  Queue-->>Container: claim(wf_id, lease=60s)
  Container->>DDB: read(wf_id)
  DDB-->>Container: state row
  Container->>S3: read(context blob)
  S3-->>Container: context bytes

  alt resume condition met?
    Container->>Container: hydrate state machine at current_node
    Container->>Skill: re-issue with idempotency key
    Skill-->>Container: result (or 200 cached)
    Container->>Container: advance graph
    Container->>DDB: checkpoint (or finalize)
  else condition not met
    Container->>DDB: extend wait
    Container->>Queue: release claim
  end

Three rules: - Lease + heartbeat. A claim holds for 60s; the container heartbeats every 15s. If a container dies mid-resume, the lease expires and another container picks up. - Idempotency on every external call. Skill registry's idempotency keys (story 02) are the lifeline here. - Last-writer-wins on the row with a version field. Two containers racing → one wins, the loser sees a version conflict and abandons.


Pitfalls — and how the design avoids them

Pitfall 1 — Token cost double-spend

A workflow paused mid-LLM-call resumes and re-runs the LLM call. That's $0.005 wasted per resume × millions = real money. Worse, the user gets a different non-deterministic answer.

Fix: LLM calls go through a thin invocation cache keyed by (prompt_hash, model_id, sampling_params, idempotency_key). Resumed call hits the cache; same answer, near-zero cost.

Pitfall 2 — Stale state at resume

A workflow paused at T=0 resumes at T=4h. The user's locale changed. Stock changed. Catalog reindexed. The cached LLM output is stale.

Fix: every workflow has a freshness boundary. Wait flavors (Flavor 2, Flavor 3) declare max_pause_seconds. If exceeded, the workflow doesn't resume — it auto-restarts from a safe earlier checkpoint or escalates ("Hi, this took longer than expected; let me redo this with current information").

Pitfall 3 — Orphan workflow rows

A workflow row is created but the container crashes before queuing it as resumable. It just sits there forever.

Fix: TTL on every row (7 days). A scavenger job runs hourly; rows older than 30 minutes without a heartbeat AND not claimed → re-queued. Defense in depth.

Pitfall 4 — Webhook arrives before pause is durable

External system completes the human approval and webhooks back at T+10ms; our pause hasn't finished writing to DDB yet.

Fix: the webhook handler idempotently writes a "ready" signal keyed by wf_id. The pause logic, on completing the write, checks for an existing ready signal and immediately re-queues if found. Webhook can arrive any time; the system converges.


Q&A drill — opening question

Q: Most chat workflows are sub-2-second. Why pay this complexity tax?

Three reasons: 1. Deploy resilience. Even sub-2-second workflows need to survive deploys at our cadence (3-4 deploys/day). Without pause-on-SIGTERM, deploys drop ~5K sessions each. 2. Long tail. "Sub-2-second" is a median. P99 is 14s. P99.9 is 60s. You don't get to ignore those at our scale. 3. Future workflows. Refund flows, age verification, account merge — all multi-minute. The infra has to be built before those features ship.

The complexity tax is real. The cost of NOT paying it is hidden in customer-experience metrics, where it's hardest to attribute and easiest to ignore.


Grilling — Round 1

Q1. How big does the state row get for a complex workflow with 6 sub-agent invocations and lots of retrieved context?

The DDB row stays <8 KB because heavy stuff goes to S3 by reference. The S3 blob can be 200KB-2MB depending on retrieval volume. Read latency: DDB ~5ms, S3 ~25ms — p95 resume read is <40ms. Acceptable for a workflow that paused for any non-trivial duration.

Q2. Idempotency is great for read-side skills. What about write-side — placing an order?

Write-side skills have idempotency keys generated at the planner level, not at retry time. The planner emits "I'm about to call place-order with idem-key=X". The key is persisted in the workflow row BEFORE the call. On resume, we re-issue with the same key; the order service de-dupes. Worst case: extra round-trip; never a double order.

Q3. What about ordering — does the user see their second message before we resume the first?

Per-session ordering is enforced by session-keyed queue partitioning. All workflows for (user_id, session_id) go to the same queue partition. Resume happens in submission order. If the user sends a second message while the first is paused, the second is queued behind the first or the first's resume condition is auto-promoted (depending on intent).


Grilling — Round 2 (architect-level)

Q4. How does pause/resume interact with the eval framework (story 04)?

Two interactions: - Online sampled judging is asynchronous — it does not wait for resume. The judge sees the eventual final response. Latency for judgment doesn't depend on workflow duration. - Replay harness — replays NEVER pause. The replay runs in a deterministic mode where wait conditions are mocked: human approval = mocked yes; timer = mocked elapsed; webhook = mocked deliver. Replay completes in seconds for a workflow that originally took hours.

This split is intentional. Replay is for diff-testing the agent logic, not the wait infrastructure.

Q5. A bug ships that corrupts the state row schema. Now resumed workflows crash on read. What's the recovery?

State row has a schema_version. The resume code is schema-version-aware: the current code can read the previous schema with an upgrader. New writes use the new schema; old in-flight rows continue under old schema until they finalize or expire (within 7 days max). Two-deploy rule: schema changes ship as additive in deploy N, then made required in deploy N+1 only after all old rows have aged out.

If we shipped a bug anyway: kill the bad deploy via auto-rollback (the eval framework catches the spike in resume failures), then for the broken rows, the scavenger marks them as failed and the user sees a "let me redo this" message rather than a crash loop. Customer impact: tens of thousands of redos, not millions of crashes.

Q6. How would you handle a multi-region setup where workflows might fail over from us-east-1 to us-west-2?

Three architectural pieces: - DDB Global Tables for the state row. Cross-region replication ~1 second. Acceptable because we're not racing to resume in <1s after a region failure. - S3 cross-region replication for context blobs. Same story. - Queue is regional, with a failover orchestrator that, on region-down detection, re-queues the now-orphaned rows in us-west-2's queue.

Trade-offs: the cost of cross-region replication is significant (we estimate +18% of state-layer cost). For workflows < 60 seconds (which is 95% of traffic), region failover during a workflow lifetime is rare enough that re-running from scratch is cheaper than always replicating. So we apply replication selectively: long-running workflows (Flavor 2 with max_pause_seconds > 300) opt in; short-runners don't.


Intuition gained

  • Pause/resume is the harness primitive that turns chat from "stateless function" into "durable workflow." Without it, you cannot survive a single deploy without dropping sessions.
  • Three flavors of pause (implicit / wait-for-event / throttle) need different storage and resume strategies but share the same row + S3 schema.
  • Idempotency is the load-bearing detail — without it, resume is a cost and correctness disaster.
  • Freshness boundaries prevent stale resumes from silently shipping wrong answers.
  • The state row stays tiny (<8KB), big stuff goes to S3. This is what makes the harness scale to 1.2M concurrent.

See also

  • 01-execution-flow-design.md — pause/resume operates on the typed graph
  • 02-skill-composition-and-invocation.md — idempotency keys come from skill contracts
  • 06-checkpointing-and-serving.md — checkpoints are the unit of pause granularity
  • Changing-Constraints-Scenarios/01-10x-user-surge.md — pause/resume behavior under burst