User Story 06 — Checkpointing and Serving

Pillar: P2 (Harness) · Stage unlocked: 3 → 4 · Reading time: ~12 min

TL;DR

Checkpointing is what to save (state granularity, frequency, deltas vs full snapshots). Serving is how to keep the agent fleet healthy under load (warm pools, cold-start avoidance, model loading, model rotation). At MangaAssist scale, getting these two right is the difference between sub-2s p95 with stable cost, and 12s p95 with cost spiking 4× during peak.

The User Story

As the platform engineer responsible for MangaAssist's serving fleet, I want explicit checkpoint policies for each graph node, paired with a serving topology that pre-warms expensive resources and rotates models without dropping inflight workflows, so that the system can deploy 4× per day, scale 5× during anime-release peaks, and rotate foundation models without customer-visible latency or correctness regressions.

Acceptance criteria

Every node declares whether it checkpoints (always / on-success / on-blocking-wait / never).
Checkpoint write p99 < 80 ms; never blocks the user's stream.
The serving fleet keeps a warm floor of containers + model adapters sized to absorb 60-second traffic doubling without cold-start spike.
Model swap (e.g., Sonnet 4.6 → 4.7) rolls out across the fleet over 30 minutes with no in-flight workflow disruption.
Checkpoint storage cost is bounded — not every node, not every turn writes a full snapshot.

Checkpointing — the granularity question

The naive approach: checkpoint after every node. Cost: catastrophic. At 2.4B turns × ~9 nodes × 50KB blob × $0.023/GB DDB writes = enormous.

The right approach: checkpoint at meaningful boundaries.

Boundary	Why checkpoint here	Cost
Before any blocking call (Flavor-2 wait)	Necessary for resume	Mandatory
After expensive sub-agent invocations	Avoid recompute on resume	Conditional
Before write-side skills (place order, send mail)	Idempotency boundary	Mandatory
At graph version transitions	For safe rollback	Mandatory
After every node	Maximum safety, prohibitive cost	Avoid
Periodic time-based (every 30s of wall time)	Catches in-progress long compositions	Conditional

Rule of thumb: checkpoint where the cost of redo > the cost of the checkpoint.

Delta checkpoints

A full snapshot is 50-200KB. A delta is typically 2-10KB. Delta checkpointing alone cuts checkpoint cost ~10×.

flowchart TB
  T0[T0 - workflow start] --> FS0[Full snapshot v0]
  FS0 --> N1[Node 1 finishes]
  N1 --> D1[Delta v0->v1: changed fields only]
  D1 --> N2[Node 2 finishes]
  N2 --> D2[Delta v1->v2]
  D2 --> N3[Node 3: blocking wait]
  N3 --> FS3[Full snapshot v3 - mandatory at boundary]
  FS3 --> N4[Node 4 - resume site]

Full snapshot at start + at blocking boundaries.
Deltas in between.
Resume reads (latest_full + all deltas after it) and reconstructs.
Rule: if accumulated deltas > 2× the size of a full snapshot, force a new full snapshot.

Serving — the topology

flowchart TB
  subgraph Edge[Edge]
    ALB[Application Load Balancer]
    WS[WebSocket Front-Door]
  end

  subgraph Hot[Hot fleet - warm floor]
    F1[Fargate task 1<br/>model adapters loaded]
    F2[Fargate task 2]
    F3[Fargate task N]
  end

  subgraph Burst[Burst fleet - on-demand]
    B1[Auto-scaled Fargate]
  end

  subgraph Models[Model serving]
    BR[Bedrock - managed FMs]
    SM[SageMaker - in-house FMs]
    Cache[Inference cache - fingerprint to response]
  end

  ALB --> WS
  WS --> F1
  WS --> F2
  WS --> F3
  WS -.overflow.-> B1
  F1 --> BR
  F1 --> SM
  F1 --> Cache
  B1 --> BR
  B1 --> SM
  B1 --> Cache

The warm floor

Sized to handle 150% of trailing 7-day p95 traffic continuously.
Always has model adapters loaded, prompt caches warm, connections pooled.
Cost-of-idle: ~14% of compute spend. Worth it because cold-start at peak is a 600-900ms first-token spike.

The burst fleet

Auto-scales on concurrent_active_workflows and queue_depth, not raw CPU.
Cold start ~22 seconds (container + model adapter load + cache prefill).
Pre-warm trigger: when warm-floor utilization exceeds 70%, spin burst capacity before it's needed.

Inference cache

Fingerprint = (prompt_hash, model_id, sampling_params_hash, capability_flags).
Hit rate: ~38% on the planner node, ~12% on sub-agents (less repeatable inputs), ~62% on the safety pre-check.
Cost saved: ~24% of total LLM spend.

Model rotation — the hard part

Bedrock or any provider periodically deprecates / replaces models. Story-level acceptance #4 says: rotate without disruption.

The rotation playbook

sequenceDiagram
  participant Ops
  participant ServingFleet
  participant Eval
  participant InflightWorkflows

  Ops->>ServingFleet: announce rotation Sonnet 4.6 -> 4.7
  ServingFleet->>ServingFleet: dual-load both adapters on warm floor
  Ops->>Eval: start shadow eval on 4.7 against 4.6
  Eval-->>Ops: shadow eval green after 24h
  Ops->>ServingFleet: canary 5pct traffic to 4.7
  ServingFleet->>ServingFleet: route by user_id hash
  Ops->>Eval: monitor canary online evals
  Eval-->>Ops: canary green after 24h
  Ops->>ServingFleet: ramp 25 / 50 / 100pct over 4h
  ServingFleet->>InflightWorkflows: NO change to in-flight - they finish on their started model
  ServingFleet->>ServingFleet: drain 4.6 over 7 days, then unload adapter

Key rules: - In-flight workflows finish on the model they started with. Switching models mid-workflow breaks reasoning continuity and invalidates the inference cache. The model_id is captured at workflow start and is part of the state row. - Dual-loading both adapters during the transition uses ~30% extra GPU memory on adapter-serving fleet. We provision for this; it's a known budget. - Eval framework gates each step. Shadow → canary → ramp, each with green eval.

Checkpoint cost math (at MangaAssist scale)

Approach	Writes/day	Bytes/day	$/day on DDB	Notes
Every node, full snapshot	2.4B × 9 = 21.6B	~1 PB	~$25K	prohibitive
Every node, delta	21.6B	~110 TB	~$2.5K	still wasteful
Boundary-only, full+delta hybrid	~3.2B	~16 TB	~$370	actual production policy
Boundary-only, no deltas	~720M	~36 TB	~$830	resume slower (recompute mid-workflow)

The hybrid is the sweet spot. Checkpoint where redo is expensive; delta where the change is small; full only at the resume sites.

Pitfalls

Pitfall 1 — Synchronous checkpoints blocking the user stream

A naive implementation await write_checkpoint() on the request thread. Now p99 latency includes p99 DDB write tail.

Fix: checkpoints are async writes with a fire-and-forget pattern, with two reliability properties: - The write is acknowledged BEFORE the next critical action (write-side skill, send-to-user). - A failed checkpoint downgrades the workflow to "non-resumable for this turn" — turn completes, but a deploy during this turn would lose it. Acceptable trade-off for keeping latency bounded.

Pitfall 2 — Cold model load during peak

Burst container starts during a 5× traffic spike, takes 22 seconds to load adapters, queues all incoming requests behind that. Latency spike.

Fix: the pre-warm trigger — at 70% warm-floor utilization, start burst containers before they're needed. The 22-second window is hidden inside spare capacity, not exposed to users. Combined with a "hot-spare" pool that holds 3-5 fully-loaded containers idle, ready to be promoted instantly.

Pitfall 3 — Inference cache poisoning

A prompt that produces a low-quality output gets cached. Now 38% of users see that low-quality output for a day.

Fix: cache eligibility includes eval signal. If the active eval (story 04) judges the output below the cache-eligibility threshold, it's not cached. Also, cache TTL is 4 hours by default — bounded blast radius even on the rare poisoned entry.

Pitfall 4 — Multi-version model concurrency drift

During a 7-day rotation, requests served by 4.6 and 4.7 produce slightly different outputs. The eval dashboard suddenly shows two distributions.

Fix: the dashboard partitions by model version. You see 4.6 and 4.7 separately, not a confusing mixture. The drift detector also partitions. This is necessary for the eval to remain interpretable during rotations.

Q&A drill — opening question

Q: Couldn't we just rely on Bedrock provisioned throughput and skip the warm-fleet machinery?

Bedrock provisioned throughput handles the model-call cold start. It does not handle: - Container cold start (Fargate task spin-up). - Application-level prompt cache warming. - Connection pool warming for downstream tools. - Skill registry hydration.

Each of these adds 100ms-2s on cold start. Bedrock alone removes the model-side cold start, which is necessary but insufficient. Both are needed.

Grilling — Round 1

Q1. Why use Fargate? Wouldn't EC2 with persistent processes be cheaper?

EC2 is cheaper per CPU-hour but more expensive in operational overhead — patching, AMI rotation, capacity planning. Fargate's per-task model maps cleanly onto our "stateless agent fleet" abstraction. The price delta is ~15-20%; we trade that for faster autoscale and zero AMI maintenance. As scale grows past where the delta becomes material (~$2M/year), we re-evaluate. Today's scale doesn't justify it.

Q2. How do you avoid checkpointing PII?

Two layers: - Field-level redaction at write time. Conversation has user-supplied content; PII detector runs on the fly; emails/phones/payment hints are tokenized. - Encryption at rest with KMS keys scoped per region. State rows older than 7 days auto-delete via TTL.

The redaction layer is itself a skill (story 02) with its own eval suite for false-negative rate.

Q3. Inference cache hit rate of 38% on planner sounds high — what's making prompts so repeatable?

Three contributors: - Many users ask similar opening questions ("recommend something like X"); the planner's first-call is essentially a routing decision over a small input distribution. - Prompt prefix is highly stable (system prompt + skill registry summary + few-shot); only the user message varies. Cache works on the full prompt hash, but the model side caches the prefix automatically (Anthropic prompt caching). - Capability-flag combinations are bounded; most users fall into 4-5 distinct flag profiles.

We don't try to cache more aggressively than the eval allows — see Pitfall 3.

Grilling — Round 2 (architect-level)

Q4. Walk me through what happens when a Fargate task crashes mid-stream.

WebSocket front-door detects connection drop (TCP keepalive ~10s).
The user's client auto-reconnects within 1-2s with the session token.
New connection lands on a healthy container (load-balanced).
New container looks up the workflow row by (user_id, session_id). State is the latest checkpoint (boundary or delta).
Workflow resumes from last checkpointed node. If we were mid-stream, the user sees a brief "..." pause and then the response continues from the checkpoint.

Median customer-perceived disruption: ~3 seconds of wait, no lost message. Worst case (we crashed mid-skill-call without checkpoint): the call is re-issued with idempotency key; result is cached or recomputed; ~2-5s extra latency. We measure these as "recovery events" and budget them as ~0.4% of turns.

Q5. How does checkpointing interact with prompt caching at the model provider level?

They're orthogonal but both matter: - Provider prompt cache caches the model's KV-cache for a stable prompt prefix; saves cost and latency on subsequent calls with the same prefix. - Our checkpoint captures the workflow state, including the prompt itself, so resume can re-issue the same prompt and hit the provider cache.

When a workflow resumes after >5 min (Anthropic prompt cache TTL), the provider cache misses; this is fine, the resume itself is rare and the cost of one full-prompt call is bounded. We budget for this in the resume cost model.

Q6. A new graph version v18 has a different state shape than v17. We just rolled out v18. What about in-flight v17 workflows?

The state row has graph_version. Resume code is multi-version-capable for at least one prior version. Three patterns: - In-flight v17 workflows finish on v17 logic. No mid-workflow upgrade. - New workflows start on v18. Routing happens at workflow creation, not at resume. - v17 logic is retained in the codebase until all v17 workflows have aged out (≤7 days).

Code stays around for ~2 weeks per version, two versions in active codepath at most. This is the cost of safe rollouts, and we accept it.

Intuition gained

Checkpoints are policy, not implementation. Pick boundaries where redo > checkpoint cost. Delta vs full is an optimization, not a philosophy.
Warm floor + burst with pre-warm trigger is what hides the cold-start cost from users.
In-flight workflows finish on their starting model. Mid-workflow model rotation is a footgun.
Inference cache is a 24% cost lever but needs eval gating to avoid poisoning.
Multi-version codepath is the cost of safe rollouts. Budget for it instead of avoiding it.