LOCAL PREVIEW View on GitHub

User Story 06 — Checkpointing and Serving

Pillar: P2 (Harness) · Stage unlocked: 3 → 4 · Reading time: ~12 min


TL;DR

Checkpointing is what to save (state granularity, frequency, deltas vs full snapshots). Serving is how to keep the agent fleet healthy under load (warm pools, cold-start avoidance, model loading, model rotation). At MangaAssist scale, getting these two right is the difference between sub-2s p95 with stable cost, and 12s p95 with cost spiking 4× during peak.


The User Story

As the platform engineer responsible for MangaAssist's serving fleet, I want explicit checkpoint policies for each graph node, paired with a serving topology that pre-warms expensive resources and rotates models without dropping inflight workflows, so that the system can deploy 4× per day, scale 5× during anime-release peaks, and rotate foundation models without customer-visible latency or correctness regressions.

Acceptance criteria

  1. Every node declares whether it checkpoints (always / on-success / on-blocking-wait / never).
  2. Checkpoint write p99 < 80 ms; never blocks the user's stream.
  3. The serving fleet keeps a warm floor of containers + model adapters sized to absorb 60-second traffic doubling without cold-start spike.
  4. Model swap (e.g., Sonnet 4.6 → 4.7) rolls out across the fleet over 30 minutes with no in-flight workflow disruption.
  5. Checkpoint storage cost is bounded — not every node, not every turn writes a full snapshot.

Checkpointing — the granularity question

The naive approach: checkpoint after every node. Cost: catastrophic. At 2.4B turns × ~9 nodes × 50KB blob × $0.023/GB DDB writes = enormous.

The right approach: checkpoint at meaningful boundaries.

Boundary Why checkpoint here Cost
Before any blocking call (Flavor-2 wait) Necessary for resume Mandatory
After expensive sub-agent invocations Avoid recompute on resume Conditional
Before write-side skills (place order, send mail) Idempotency boundary Mandatory
At graph version transitions For safe rollback Mandatory
After every node Maximum safety, prohibitive cost Avoid
Periodic time-based (every 30s of wall time) Catches in-progress long compositions Conditional

Rule of thumb: checkpoint where the cost of redo > the cost of the checkpoint.


Delta checkpoints

A full snapshot is 50-200KB. A delta is typically 2-10KB. Delta checkpointing alone cuts checkpoint cost ~10×.

flowchart TB
  T0[T0 - workflow start] --> FS0[Full snapshot v0]
  FS0 --> N1[Node 1 finishes]
  N1 --> D1[Delta v0->v1: changed fields only]
  D1 --> N2[Node 2 finishes]
  N2 --> D2[Delta v1->v2]
  D2 --> N3[Node 3: blocking wait]
  N3 --> FS3[Full snapshot v3 - mandatory at boundary]
  FS3 --> N4[Node 4 - resume site]
  • Full snapshot at start + at blocking boundaries.
  • Deltas in between.
  • Resume reads (latest_full + all deltas after it) and reconstructs.
  • Rule: if accumulated deltas > 2× the size of a full snapshot, force a new full snapshot.

Serving — the topology

flowchart TB
  subgraph Edge[Edge]
    ALB[Application Load Balancer]
    WS[WebSocket Front-Door]
  end

  subgraph Hot[Hot fleet - warm floor]
    F1[Fargate task 1<br/>model adapters loaded]
    F2[Fargate task 2]
    F3[Fargate task N]
  end

  subgraph Burst[Burst fleet - on-demand]
    B1[Auto-scaled Fargate]
  end

  subgraph Models[Model serving]
    BR[Bedrock - managed FMs]
    SM[SageMaker - in-house FMs]
    Cache[Inference cache - fingerprint to response]
  end

  ALB --> WS
  WS --> F1
  WS --> F2
  WS --> F3
  WS -.overflow.-> B1
  F1 --> BR
  F1 --> SM
  F1 --> Cache
  B1 --> BR
  B1 --> SM
  B1 --> Cache

The warm floor

  • Sized to handle 150% of trailing 7-day p95 traffic continuously.
  • Always has model adapters loaded, prompt caches warm, connections pooled.
  • Cost-of-idle: ~14% of compute spend. Worth it because cold-start at peak is a 600-900ms first-token spike.

The burst fleet

  • Auto-scales on concurrent_active_workflows and queue_depth, not raw CPU.
  • Cold start ~22 seconds (container + model adapter load + cache prefill).
  • Pre-warm trigger: when warm-floor utilization exceeds 70%, spin burst capacity before it's needed.

Inference cache

  • Fingerprint = (prompt_hash, model_id, sampling_params_hash, capability_flags).
  • Hit rate: ~38% on the planner node, ~12% on sub-agents (less repeatable inputs), ~62% on the safety pre-check.
  • Cost saved: ~24% of total LLM spend.

Model rotation — the hard part

Bedrock or any provider periodically deprecates / replaces models. Story-level acceptance #4 says: rotate without disruption.

The rotation playbook

sequenceDiagram
  participant Ops
  participant ServingFleet
  participant Eval
  participant InflightWorkflows

  Ops->>ServingFleet: announce rotation Sonnet 4.6 -> 4.7
  ServingFleet->>ServingFleet: dual-load both adapters on warm floor
  Ops->>Eval: start shadow eval on 4.7 against 4.6
  Eval-->>Ops: shadow eval green after 24h
  Ops->>ServingFleet: canary 5pct traffic to 4.7
  ServingFleet->>ServingFleet: route by user_id hash
  Ops->>Eval: monitor canary online evals
  Eval-->>Ops: canary green after 24h
  Ops->>ServingFleet: ramp 25 / 50 / 100pct over 4h
  ServingFleet->>InflightWorkflows: NO change to in-flight - they finish on their started model
  ServingFleet->>ServingFleet: drain 4.6 over 7 days, then unload adapter

Key rules: - In-flight workflows finish on the model they started with. Switching models mid-workflow breaks reasoning continuity and invalidates the inference cache. The model_id is captured at workflow start and is part of the state row. - Dual-loading both adapters during the transition uses ~30% extra GPU memory on adapter-serving fleet. We provision for this; it's a known budget. - Eval framework gates each step. Shadow → canary → ramp, each with green eval.


Checkpoint cost math (at MangaAssist scale)

Approach Writes/day Bytes/day $/day on DDB Notes
Every node, full snapshot 2.4B × 9 = 21.6B ~1 PB ~$25K prohibitive
Every node, delta 21.6B ~110 TB ~$2.5K still wasteful
Boundary-only, full+delta hybrid ~3.2B ~16 TB ~$370 actual production policy
Boundary-only, no deltas ~720M ~36 TB ~$830 resume slower (recompute mid-workflow)

The hybrid is the sweet spot. Checkpoint where redo is expensive; delta where the change is small; full only at the resume sites.


Pitfalls

Pitfall 1 — Synchronous checkpoints blocking the user stream

A naive implementation await write_checkpoint() on the request thread. Now p99 latency includes p99 DDB write tail.

Fix: checkpoints are async writes with a fire-and-forget pattern, with two reliability properties: - The write is acknowledged BEFORE the next critical action (write-side skill, send-to-user). - A failed checkpoint downgrades the workflow to "non-resumable for this turn" — turn completes, but a deploy during this turn would lose it. Acceptable trade-off for keeping latency bounded.

Pitfall 2 — Cold model load during peak

Burst container starts during a 5× traffic spike, takes 22 seconds to load adapters, queues all incoming requests behind that. Latency spike.

Fix: the pre-warm trigger — at 70% warm-floor utilization, start burst containers before they're needed. The 22-second window is hidden inside spare capacity, not exposed to users. Combined with a "hot-spare" pool that holds 3-5 fully-loaded containers idle, ready to be promoted instantly.

Pitfall 3 — Inference cache poisoning

A prompt that produces a low-quality output gets cached. Now 38% of users see that low-quality output for a day.

Fix: cache eligibility includes eval signal. If the active eval (story 04) judges the output below the cache-eligibility threshold, it's not cached. Also, cache TTL is 4 hours by default — bounded blast radius even on the rare poisoned entry.

Pitfall 4 — Multi-version model concurrency drift

During a 7-day rotation, requests served by 4.6 and 4.7 produce slightly different outputs. The eval dashboard suddenly shows two distributions.

Fix: the dashboard partitions by model version. You see 4.6 and 4.7 separately, not a confusing mixture. The drift detector also partitions. This is necessary for the eval to remain interpretable during rotations.


Q&A drill — opening question

Q: Couldn't we just rely on Bedrock provisioned throughput and skip the warm-fleet machinery?

Bedrock provisioned throughput handles the model-call cold start. It does not handle: - Container cold start (Fargate task spin-up). - Application-level prompt cache warming. - Connection pool warming for downstream tools. - Skill registry hydration.

Each of these adds 100ms-2s on cold start. Bedrock alone removes the model-side cold start, which is necessary but insufficient. Both are needed.


Grilling — Round 1

Q1. Why use Fargate? Wouldn't EC2 with persistent processes be cheaper?

EC2 is cheaper per CPU-hour but more expensive in operational overhead — patching, AMI rotation, capacity planning. Fargate's per-task model maps cleanly onto our "stateless agent fleet" abstraction. The price delta is ~15-20%; we trade that for faster autoscale and zero AMI maintenance. As scale grows past where the delta becomes material (~$2M/year), we re-evaluate. Today's scale doesn't justify it.

Q2. How do you avoid checkpointing PII?

Two layers: - Field-level redaction at write time. Conversation has user-supplied content; PII detector runs on the fly; emails/phones/payment hints are tokenized. - Encryption at rest with KMS keys scoped per region. State rows older than 7 days auto-delete via TTL.

The redaction layer is itself a skill (story 02) with its own eval suite for false-negative rate.

Q3. Inference cache hit rate of 38% on planner sounds high — what's making prompts so repeatable?

Three contributors: - Many users ask similar opening questions ("recommend something like X"); the planner's first-call is essentially a routing decision over a small input distribution. - Prompt prefix is highly stable (system prompt + skill registry summary + few-shot); only the user message varies. Cache works on the full prompt hash, but the model side caches the prefix automatically (Anthropic prompt caching). - Capability-flag combinations are bounded; most users fall into 4-5 distinct flag profiles.

We don't try to cache more aggressively than the eval allows — see Pitfall 3.


Grilling — Round 2 (architect-level)

Q4. Walk me through what happens when a Fargate task crashes mid-stream.

  1. WebSocket front-door detects connection drop (TCP keepalive ~10s).
  2. The user's client auto-reconnects within 1-2s with the session token.
  3. New connection lands on a healthy container (load-balanced).
  4. New container looks up the workflow row by (user_id, session_id). State is the latest checkpoint (boundary or delta).
  5. Workflow resumes from last checkpointed node. If we were mid-stream, the user sees a brief "..." pause and then the response continues from the checkpoint.

Median customer-perceived disruption: ~3 seconds of wait, no lost message. Worst case (we crashed mid-skill-call without checkpoint): the call is re-issued with idempotency key; result is cached or recomputed; ~2-5s extra latency. We measure these as "recovery events" and budget them as ~0.4% of turns.

Q5. How does checkpointing interact with prompt caching at the model provider level?

They're orthogonal but both matter: - Provider prompt cache caches the model's KV-cache for a stable prompt prefix; saves cost and latency on subsequent calls with the same prefix. - Our checkpoint captures the workflow state, including the prompt itself, so resume can re-issue the same prompt and hit the provider cache.

When a workflow resumes after >5 min (Anthropic prompt cache TTL), the provider cache misses; this is fine, the resume itself is rare and the cost of one full-prompt call is bounded. We budget for this in the resume cost model.

Q6. A new graph version v18 has a different state shape than v17. We just rolled out v18. What about in-flight v17 workflows?

The state row has graph_version. Resume code is multi-version-capable for at least one prior version. Three patterns: - In-flight v17 workflows finish on v17 logic. No mid-workflow upgrade. - New workflows start on v18. Routing happens at workflow creation, not at resume. - v17 logic is retained in the codebase until all v17 workflows have aged out (≤7 days).

Code stays around for ~2 weeks per version, two versions in active codepath at most. This is the cost of safe rollouts, and we accept it.


Intuition gained

  • Checkpoints are policy, not implementation. Pick boundaries where redo > checkpoint cost. Delta vs full is an optimization, not a philosophy.
  • Warm floor + burst with pre-warm trigger is what hides the cold-start cost from users.
  • In-flight workflows finish on their starting model. Mid-workflow model rotation is a footgun.
  • Inference cache is a 24% cost lever but needs eval gating to avoid poisoning.
  • Multi-version codepath is the cost of safe rollouts. Budget for it instead of avoiding it.

See also

  • 05-pause-resume-workflows.md — checkpoints are pause's storage primitive
  • 07-sync-async-invocation.md — when serving is sync vs queue-mediated
  • 08-observability-cost-versioning-ratelimits.md — partitioning dashboards by graph/model version
  • Changing-Constraints-Scenarios/02-foundation-model-deprecated.md — rotation under forced deadline