User Story 07 — Sync vs Async Invocation
Pillar: P2 (Harness) · Stage unlocked: 2 → 3 · Reading time: ~10 min
TL;DR
A user typing in MangaAssist expects an answer in seconds. A nightly enrichment job processing 4M titles through the same agent has no such expectation. The same agent runtime serves both — the difference is invocation mode. Getting the sync ↔ async boundary right is the single largest determinant of whether your harness scales economically.
The User Story
As the architect of MangaAssist's runtime, I want the same agent definition to be invokable in streaming-sync, async-queue, batch-window, and scheduled-trigger modes, so that chat, voice, daily personalization, weekly review summaries, and backfills all reuse the same agent logic with mode-appropriate latency and cost characteristics.
Acceptance criteria
- The agent's behavior is mode-independent — the same graph, same skills, same evals.
- Mode selection is policy at invocation, not a code-level branch.
- Streaming-sync emits first token within p95 1.4s; async-queue holds queue depth + handler concurrency as the SLO; batch hits throughput targets.
- Misuse (long-running task invoked sync) fails fast with a typed error, not by hanging the user's connection.
- Cross-mode metrics are unified — same eval, same observability — partitioned by mode.
The four invocation modes
flowchart LR
subgraph SYNC[Streaming-sync]
A1[User keystroke] --> B1[WebSocket]
B1 --> C1[Agent runtime]
C1 --> D1[Token-by-token stream back]
end
subgraph QUEUE[Async-queue]
A2[Caller] --> B2[Submit job]
B2 --> C2[(Queue)]
C2 --> D2[Agent worker pool]
D2 --> E2[Result store]
E2 --> F2[Webhook or polling]
end
subgraph BATCH[Batch-window]
A3[Scheduler] --> B3[Spawn batch driver]
B3 --> C3[Read shard manifest]
C3 --> D3[Fan out to worker pool]
D3 --> E3[Aggregate results to S3]
end
subgraph SCHED[Scheduled-trigger]
A4[Cron / EventBridge] --> B4[Single agent run]
B4 --> C4[Result to durable target]
end
When to use which
| Use case | Mode | Why |
|---|---|---|
| Live chat | streaming-sync | User waits; first-token matters; full-response budget 6s |
| Voice agent | streaming-sync | Same as chat, even tighter budget on first audio chunk |
| New-release email personalization | async-queue | User isn't waiting; can take 30-90s per user |
| Nightly catalog enrichment | batch-window | 4M items; throughput > latency |
| Monthly subscriber retention summary | scheduled-trigger | Per-user; runs once a month at user's local 9am |
| Refund decision waiting on fraud team | async-queue with pause | See story 05 (pause/resume) |
| Search relevance offline rescore | batch-window | Massive throughput, eventual consistency |
What's actually different across modes
The same agent code runs in all four modes. The harness configures these knobs per mode:
| Knob | Streaming-sync | Async-queue | Batch | Scheduled |
|---|---|---|---|---|
| Concurrency model | per-connection | worker pool with backpressure | shard-based | one-shot |
| Latency budget | hard p95 6s | soft, e.g. 90s p95 | n/a (throughput target) | soft |
| Output transport | WebSocket stream | result store + webhook | S3 manifest + Glue catalog | direct write |
| Failure handling | retry once + escape hatch | DLQ + alert | shard-level retry | re-fire next cycle |
| Cost ceiling | per-turn tight | per-job medium | per-job high (parallelism) | per-run high |
| Eval sampling | 1% online | 5% per shard | 100% (offline eval) | 100% |
| Idempotency | session-scoped | job-id-scoped | row-id-scoped | run-id-scoped |
| Pause/resume role | container-restart only | full Flavor-2 wait possible | shard restart | re-fire |
The agent doesn't know which mode it's in. It just runs the graph. The harness provides the surrounding mode-specific behavior.
Streaming — the place sync gets confused
Streaming feels async (tokens trickle in) but is sync from the user's perspective — the connection is open, the user is waiting. Three rules avoid disaster:
Rule 1 — No external I/O inside the streaming loop
Once you start streaming, every token has to come fast. Calling a tool mid-stream because the LLM said it wanted to is a freeze. The plan is finalized before the stream starts.
Rule 2 — Stream a draft, then verify
Some safety/eval checks happen on the full response. If the response can't be released until the eval passes, you can't stream it. Two options: - Stream the draft, verify in parallel, kill the stream if the verifier rejects. Higher trust but visible cancel. - Buffer the full response, verify, then stream. Adds 200-400ms latency to first-token but never visible cancel.
We use option 2 for safety-critical responses (refunds, account changes), option 1 for content (recommendations, summaries).
Rule 3 — Heartbeat the stream
Long-running compositions (over 4s of compute) without a token risk client timeouts. The harness emits a heartbeat character or a "thinking..." signal every 2s if no token has gone out. Keeps the connection alive.
Async-queue — the workhorse for non-real-time
sequenceDiagram
participant Caller
participant API
participant Queue
participant WorkerPool
participant ResultStore
participant Caller2 as Caller (poll or webhook)
Caller->>API: submit_job(payload)
API->>API: validate, idempotency check
API->>Queue: enqueue(job_id, payload)
API-->>Caller: 202 Accepted, job_id, ETA
Note over Queue,WorkerPool: workers pull when capacity available
Queue->>WorkerPool: dispatch
WorkerPool->>WorkerPool: run agent
WorkerPool->>ResultStore: write(job_id, result)
WorkerPool->>Caller2: webhook(job_id, result_url)
Why a queue (not just background tasks)
- Backpressure — if workers are saturated, the queue grows; new jobs are accepted-and-queued, not dropped.
- Replay — failed jobs go to DLQ, can be re-driven after fix.
- Ordering when needed — partition by user_id for per-user FIFO.
- Visibility — queue depth is a leading indicator of saturation.
Worker concurrency math
- Worker pool target utilization: 65-75% (leaves headroom for spikes).
- Per-worker concurrency: ~4 jobs in flight (each one is mostly waiting on LLM I/O).
- Per-pool concurrency: depends on the workload's median LLM latency vs $$ budget.
For MangaAssist's nightly personalization (~120M jobs/day, ~6s median per job), we run ~12K worker concurrency to stay within the 8-hour window.
Batch — when throughput is everything
Catalog enrichment runs the same agent across 4M titles. The agent stays the same; the harness: - Reads a manifest (shard list). - Fans out workers across shards (typically 256 shards × 2 days). - Per-shard checkpointing for restartability. - Result aggregation to S3 in Parquet for downstream Glue queries.
Eval in batch is 100%, offline. Every output is judged because the cost is amortized over a large run and the value of catching a regression before it lands in production is high. This is one of the few places where exhaustive eval is economical.
Pitfalls
Pitfall 1 — Sync-shaped task on async path
"User clicks 'recommend me 100 things'" — the engineer makes it sync. p99 hits 25s. Browser closes. User confused.
Fix: request shape determines mode. Heavy multi-result requests are async-queue with a "we'll DM your results" UX. The boundary is at the API layer, declared at endpoint registration, not bolted on per-request.
Pitfall 2 — Async path with sync expectations
A team decides "we'll just await the queue" — they ignore the queue's async-ness, block their thread, and hold connections open for minutes. Worker pool effectively single-threaded.
Fix: the queue API doesn't return a result; it returns a job_id. Result retrieval is via webhook or poll. There is no "await this job" — that pattern doesn't exist in our SDK by design.
Pitfall 3 — Mode-switching mid-deploy
A canary version changes a chat path from sync to async without UX coordination. Users see "your message has been queued" mysteriously.
Fix: mode is part of the API contract, version-bumped like any breaking change. Mode changes are coordinated with frontend.
Pitfall 4 — Batch backfill swamping the model provider quota
Nightly batch starts; consumes 80% of the day's Bedrock quota in 2 hours; live chat is throttled.
Fix: the quota manager (story 08) treats sync chat as P0 and batch as P3. Batch obeys a quota cap that leaves headroom for sync. Batch can run longer; sync cannot wait.
Q&A drill — opening question
Q: Why not just queue everything? It's simpler.
Two reasons: 1. User experience. Live chat that says "we'll get back to you" feels broken. Streaming sync is what makes the product feel alive. 2. Resource shape. Sync is bursty; async is smoothable. Mixing them on the same workers means you size for sync peak (over-provisioned for async) or async average (under-provisioned for sync). Separating them lets each fleet size for its own profile.
The right answer is "queue everything that can wait, stream everything that can't, and make the boundary explicit."
Grilling — Round 1
Q1. Worker concurrency = 4 inside one process — why not 1, simpler?
LLM calls are I/O-bound (network wait for the model). At 1-process-1-job, the worker is idle ~80% of the time. Concurrency 4 amortizes the cost of the worker container. Beyond 4, contention on local context/memory hurts. Empirically 4 is the sweet spot for our workload mix.
Q2. How do you handle a poison job that crashes the worker every time?
Standard DLQ pattern: max 3 attempts, then to DLQ. DLQ has its own alarm. Postmortem: was it bad input (sanitize), bad code (fix), bad model output (eval rule)? DLQ size is an SLO; we keep it below 0.05% of daily job volume.
Q3. Webhook delivery is unreliable. What if the caller misses the webhook?
Two backstops: (a) webhook has retries with exp backoff (1m, 5m, 30m, 4h); (b) callers can poll the result store directly with the job_id. The result store is durable; the webhook is convenience.
Grilling — Round 2 (architect-level)
Q4. How does the same agent code stay mode-agnostic when the modes have such different latency and cost profiles?
Two design choices:
- Budgets are passed as runtime config. The graph (story 01) reads latency_budget_ms and cost_cap_usd from the invocation context. A node that downshifts to Haiku under tight budget gets that signal from config, not from "what mode am I in."
- Output transport is harness-side. The agent emits structured outputs. The harness either streams those over WebSocket (sync), writes them to a result store (async), or appends to a Parquet file (batch). The agent doesn't know.
This separation is what lets us keep one agent and four modes.
Q5. Batch jobs that share infra with sync chat — how do you prevent noisy-neighbor effects?
Three barriers: - Separate worker pools. Same code, different fleet. Sync gets dedicated capacity; batch gets dedicated capacity. - Quota partitioning at the model provider. Sync has a reserved fraction of Bedrock TPM; batch consumes from a separate budget. - Network/DDB isolation is not free but is feasible — we use separate DDB tables for batch state vs sync state, so a batch surge can't throttle sync's writes.
We measured this carefully when we co-located vs separated. Separation costs ~8% more in idle compute; mixing introduced a 3% increase in sync p99 during batch runs. We separate.
Q6. What's the eval story across modes? Different SLAs imply different rubrics?
Same rubric, partitioned reporting. The judging rubric is "is this answer good?" — that doesn't depend on mode. But the expected score baseline does: - Sync chat baseline: 0.81 mean. - Async personalization: 0.78 mean (more complex inputs). - Batch enrichment: 0.86 mean (curated inputs, no live ambiguity).
Drift detection thresholds are mode-specific. A 0.78 score is great in async, alarming in batch. The eval framework knows this.
Intuition gained
- Mode is invocation policy, not agent code. The same agent runs in four modes; the harness configures the surroundings.
- Streaming is sync from the user's POV. Treat it that way: no external I/O mid-stream, plan-then-stream, heartbeat the connection.
- Queue-first for anything that can wait. It's the simplest way to get backpressure, replay, and ordering for free.
- Quota partitioning across modes prevents batch surges from killing live chat.
- Eval rubric is universal; baselines are mode-specific. Drift detection respects this.
See also
05-pause-resume-workflows.md— async mode is where Flavor-2 waits live06-checkpointing-and-serving.md— different fleet shapes per mode08-observability-cost-versioning-ratelimits.md— quota manager detailsChanging-Constraints-Scenarios/01-10x-user-surge.md— sync mode under burst