User Story 06 — Checkpointing and Serving
Pillar: P2 (Harness) · Stage unlocked: 3 → 4 · Reading time: ~12 min
TL;DR
Checkpointing is what to save (state granularity, frequency, deltas vs full snapshots). Serving is how to keep the agent fleet healthy under load (warm pools, cold-start avoidance, model loading, model rotation). At MangaAssist scale, getting these two right is the difference between sub-2s p95 with stable cost, and 12s p95 with cost spiking 4× during peak.
The User Story
As the platform engineer responsible for MangaAssist's serving fleet, I want explicit checkpoint policies for each graph node, paired with a serving topology that pre-warms expensive resources and rotates models without dropping inflight workflows, so that the system can deploy 4× per day, scale 5× during anime-release peaks, and rotate foundation models without customer-visible latency or correctness regressions.
Acceptance criteria
- Every node declares whether it checkpoints (always / on-success / on-blocking-wait / never).
- Checkpoint write p99 < 80 ms; never blocks the user's stream.
- The serving fleet keeps a warm floor of containers + model adapters sized to absorb 60-second traffic doubling without cold-start spike.
- Model swap (e.g., Sonnet 4.6 → 4.7) rolls out across the fleet over 30 minutes with no in-flight workflow disruption.
- Checkpoint storage cost is bounded — not every node, not every turn writes a full snapshot.
Checkpointing — the granularity question
The naive approach: checkpoint after every node. Cost: catastrophic. At 2.4B turns × ~9 nodes × 50KB blob × $0.023/GB DDB writes = enormous.
The right approach: checkpoint at meaningful boundaries.
| Boundary | Why checkpoint here | Cost |
|---|---|---|
| Before any blocking call (Flavor-2 wait) | Necessary for resume | Mandatory |
| After expensive sub-agent invocations | Avoid recompute on resume | Conditional |
| Before write-side skills (place order, send mail) | Idempotency boundary | Mandatory |
| At graph version transitions | For safe rollback | Mandatory |
| After every node | Maximum safety, prohibitive cost | Avoid |
| Periodic time-based (every 30s of wall time) | Catches in-progress long compositions | Conditional |
Rule of thumb: checkpoint where the cost of redo > the cost of the checkpoint.
Delta checkpoints
A full snapshot is 50-200KB. A delta is typically 2-10KB. Delta checkpointing alone cuts checkpoint cost ~10×.
flowchart TB
T0[T0 - workflow start] --> FS0[Full snapshot v0]
FS0 --> N1[Node 1 finishes]
N1 --> D1[Delta v0->v1: changed fields only]
D1 --> N2[Node 2 finishes]
N2 --> D2[Delta v1->v2]
D2 --> N3[Node 3: blocking wait]
N3 --> FS3[Full snapshot v3 - mandatory at boundary]
FS3 --> N4[Node 4 - resume site]
- Full snapshot at start + at blocking boundaries.
- Deltas in between.
- Resume reads (latest_full + all deltas after it) and reconstructs.
- Rule: if accumulated deltas > 2× the size of a full snapshot, force a new full snapshot.
Serving — the topology
flowchart TB
subgraph Edge[Edge]
ALB[Application Load Balancer]
WS[WebSocket Front-Door]
end
subgraph Hot[Hot fleet - warm floor]
F1[Fargate task 1<br/>model adapters loaded]
F2[Fargate task 2]
F3[Fargate task N]
end
subgraph Burst[Burst fleet - on-demand]
B1[Auto-scaled Fargate]
end
subgraph Models[Model serving]
BR[Bedrock - managed FMs]
SM[SageMaker - in-house FMs]
Cache[Inference cache - fingerprint to response]
end
ALB --> WS
WS --> F1
WS --> F2
WS --> F3
WS -.overflow.-> B1
F1 --> BR
F1 --> SM
F1 --> Cache
B1 --> BR
B1 --> SM
B1 --> Cache
The warm floor
- Sized to handle 150% of trailing 7-day p95 traffic continuously.
- Always has model adapters loaded, prompt caches warm, connections pooled.
- Cost-of-idle: ~14% of compute spend. Worth it because cold-start at peak is a 600-900ms first-token spike.
The burst fleet
- Auto-scales on
concurrent_active_workflowsandqueue_depth, not raw CPU. - Cold start ~22 seconds (container + model adapter load + cache prefill).
- Pre-warm trigger: when warm-floor utilization exceeds 70%, spin burst capacity before it's needed.
Inference cache
- Fingerprint =
(prompt_hash, model_id, sampling_params_hash, capability_flags). - Hit rate: ~38% on the planner node, ~12% on sub-agents (less repeatable inputs), ~62% on the safety pre-check.
- Cost saved: ~24% of total LLM spend.
Model rotation — the hard part
Bedrock or any provider periodically deprecates / replaces models. Story-level acceptance #4 says: rotate without disruption.
The rotation playbook
sequenceDiagram
participant Ops
participant ServingFleet
participant Eval
participant InflightWorkflows
Ops->>ServingFleet: announce rotation Sonnet 4.6 -> 4.7
ServingFleet->>ServingFleet: dual-load both adapters on warm floor
Ops->>Eval: start shadow eval on 4.7 against 4.6
Eval-->>Ops: shadow eval green after 24h
Ops->>ServingFleet: canary 5pct traffic to 4.7
ServingFleet->>ServingFleet: route by user_id hash
Ops->>Eval: monitor canary online evals
Eval-->>Ops: canary green after 24h
Ops->>ServingFleet: ramp 25 / 50 / 100pct over 4h
ServingFleet->>InflightWorkflows: NO change to in-flight - they finish on their started model
ServingFleet->>ServingFleet: drain 4.6 over 7 days, then unload adapter
Key rules: - In-flight workflows finish on the model they started with. Switching models mid-workflow breaks reasoning continuity and invalidates the inference cache. The model_id is captured at workflow start and is part of the state row. - Dual-loading both adapters during the transition uses ~30% extra GPU memory on adapter-serving fleet. We provision for this; it's a known budget. - Eval framework gates each step. Shadow → canary → ramp, each with green eval.
Checkpoint cost math (at MangaAssist scale)
| Approach | Writes/day | Bytes/day | $/day on DDB | Notes |
|---|---|---|---|---|
| Every node, full snapshot | 2.4B × 9 = 21.6B | ~1 PB | ~$25K | prohibitive |
| Every node, delta | 21.6B | ~110 TB | ~$2.5K | still wasteful |
| Boundary-only, full+delta hybrid | ~3.2B | ~16 TB | ~$370 | actual production policy |
| Boundary-only, no deltas | ~720M | ~36 TB | ~$830 | resume slower (recompute mid-workflow) |
The hybrid is the sweet spot. Checkpoint where redo is expensive; delta where the change is small; full only at the resume sites.
Pitfalls
Pitfall 1 — Synchronous checkpoints blocking the user stream
A naive implementation await write_checkpoint() on the request thread. Now p99 latency includes p99 DDB write tail.
Fix: checkpoints are async writes with a fire-and-forget pattern, with two reliability properties: - The write is acknowledged BEFORE the next critical action (write-side skill, send-to-user). - A failed checkpoint downgrades the workflow to "non-resumable for this turn" — turn completes, but a deploy during this turn would lose it. Acceptable trade-off for keeping latency bounded.
Pitfall 2 — Cold model load during peak
Burst container starts during a 5× traffic spike, takes 22 seconds to load adapters, queues all incoming requests behind that. Latency spike.
Fix: the pre-warm trigger — at 70% warm-floor utilization, start burst containers before they're needed. The 22-second window is hidden inside spare capacity, not exposed to users. Combined with a "hot-spare" pool that holds 3-5 fully-loaded containers idle, ready to be promoted instantly.
Pitfall 3 — Inference cache poisoning
A prompt that produces a low-quality output gets cached. Now 38% of users see that low-quality output for a day.
Fix: cache eligibility includes eval signal. If the active eval (story 04) judges the output below the cache-eligibility threshold, it's not cached. Also, cache TTL is 4 hours by default — bounded blast radius even on the rare poisoned entry.
Pitfall 4 — Multi-version model concurrency drift
During a 7-day rotation, requests served by 4.6 and 4.7 produce slightly different outputs. The eval dashboard suddenly shows two distributions.
Fix: the dashboard partitions by model version. You see 4.6 and 4.7 separately, not a confusing mixture. The drift detector also partitions. This is necessary for the eval to remain interpretable during rotations.
Q&A drill — opening question
Q: Couldn't we just rely on Bedrock provisioned throughput and skip the warm-fleet machinery?
Bedrock provisioned throughput handles the model-call cold start. It does not handle: - Container cold start (Fargate task spin-up). - Application-level prompt cache warming. - Connection pool warming for downstream tools. - Skill registry hydration.
Each of these adds 100ms-2s on cold start. Bedrock alone removes the model-side cold start, which is necessary but insufficient. Both are needed.
Grilling — Round 1
Q1. Why use Fargate? Wouldn't EC2 with persistent processes be cheaper?
EC2 is cheaper per CPU-hour but more expensive in operational overhead — patching, AMI rotation, capacity planning. Fargate's per-task model maps cleanly onto our "stateless agent fleet" abstraction. The price delta is ~15-20%; we trade that for faster autoscale and zero AMI maintenance. As scale grows past where the delta becomes material (~$2M/year), we re-evaluate. Today's scale doesn't justify it.
Q2. How do you avoid checkpointing PII?
Two layers: - Field-level redaction at write time. Conversation has user-supplied content; PII detector runs on the fly; emails/phones/payment hints are tokenized. - Encryption at rest with KMS keys scoped per region. State rows older than 7 days auto-delete via TTL.
The redaction layer is itself a skill (story 02) with its own eval suite for false-negative rate.
Q3. Inference cache hit rate of 38% on planner sounds high — what's making prompts so repeatable?
Three contributors: - Many users ask similar opening questions ("recommend something like X"); the planner's first-call is essentially a routing decision over a small input distribution. - Prompt prefix is highly stable (system prompt + skill registry summary + few-shot); only the user message varies. Cache works on the full prompt hash, but the model side caches the prefix automatically (Anthropic prompt caching). - Capability-flag combinations are bounded; most users fall into 4-5 distinct flag profiles.
We don't try to cache more aggressively than the eval allows — see Pitfall 3.
Grilling — Round 2 (architect-level)
Q4. Walk me through what happens when a Fargate task crashes mid-stream.
- WebSocket front-door detects connection drop (TCP keepalive ~10s).
- The user's client auto-reconnects within 1-2s with the session token.
- New connection lands on a healthy container (load-balanced).
- New container looks up the workflow row by
(user_id, session_id). State is the latest checkpoint (boundary or delta). - Workflow resumes from last checkpointed node. If we were mid-stream, the user sees a brief "..." pause and then the response continues from the checkpoint.
Median customer-perceived disruption: ~3 seconds of wait, no lost message. Worst case (we crashed mid-skill-call without checkpoint): the call is re-issued with idempotency key; result is cached or recomputed; ~2-5s extra latency. We measure these as "recovery events" and budget them as ~0.4% of turns.
Q5. How does checkpointing interact with prompt caching at the model provider level?
They're orthogonal but both matter: - Provider prompt cache caches the model's KV-cache for a stable prompt prefix; saves cost and latency on subsequent calls with the same prefix. - Our checkpoint captures the workflow state, including the prompt itself, so resume can re-issue the same prompt and hit the provider cache.
When a workflow resumes after >5 min (Anthropic prompt cache TTL), the provider cache misses; this is fine, the resume itself is rare and the cost of one full-prompt call is bounded. We budget for this in the resume cost model.
Q6. A new graph version v18 has a different state shape than v17. We just rolled out v18. What about in-flight v17 workflows?
The state row has graph_version. Resume code is multi-version-capable for at least one prior version. Three patterns:
- In-flight v17 workflows finish on v17 logic. No mid-workflow upgrade.
- New workflows start on v18. Routing happens at workflow creation, not at resume.
- v17 logic is retained in the codebase until all v17 workflows have aged out (≤7 days).
Code stays around for ~2 weeks per version, two versions in active codepath at most. This is the cost of safe rollouts, and we accept it.
Intuition gained
- Checkpoints are policy, not implementation. Pick boundaries where redo > checkpoint cost. Delta vs full is an optimization, not a philosophy.
- Warm floor + burst with pre-warm trigger is what hides the cold-start cost from users.
- In-flight workflows finish on their starting model. Mid-workflow model rotation is a footgun.
- Inference cache is a 24% cost lever but needs eval gating to avoid poisoning.
- Multi-version codepath is the cost of safe rollouts. Budget for it instead of avoiding it.
See also
05-pause-resume-workflows.md— checkpoints are pause's storage primitive07-sync-async-invocation.md— when serving is sync vs queue-mediated08-observability-cost-versioning-ratelimits.md— partitioning dashboards by graph/model versionChanging-Constraints-Scenarios/02-foundation-model-deprecated.md— rotation under forced deadline