LOCAL PREVIEW View on GitHub

User Story 07 — Sync vs Async Invocation

Pillar: P2 (Harness) · Stage unlocked: 2 → 3 · Reading time: ~10 min


TL;DR

A user typing in MangaAssist expects an answer in seconds. A nightly enrichment job processing 4M titles through the same agent has no such expectation. The same agent runtime serves both — the difference is invocation mode. Getting the sync ↔ async boundary right is the single largest determinant of whether your harness scales economically.


The User Story

As the architect of MangaAssist's runtime, I want the same agent definition to be invokable in streaming-sync, async-queue, batch-window, and scheduled-trigger modes, so that chat, voice, daily personalization, weekly review summaries, and backfills all reuse the same agent logic with mode-appropriate latency and cost characteristics.

Acceptance criteria

  1. The agent's behavior is mode-independent — the same graph, same skills, same evals.
  2. Mode selection is policy at invocation, not a code-level branch.
  3. Streaming-sync emits first token within p95 1.4s; async-queue holds queue depth + handler concurrency as the SLO; batch hits throughput targets.
  4. Misuse (long-running task invoked sync) fails fast with a typed error, not by hanging the user's connection.
  5. Cross-mode metrics are unified — same eval, same observability — partitioned by mode.

The four invocation modes

flowchart LR
  subgraph SYNC[Streaming-sync]
    A1[User keystroke] --> B1[WebSocket]
    B1 --> C1[Agent runtime]
    C1 --> D1[Token-by-token stream back]
  end

  subgraph QUEUE[Async-queue]
    A2[Caller] --> B2[Submit job]
    B2 --> C2[(Queue)]
    C2 --> D2[Agent worker pool]
    D2 --> E2[Result store]
    E2 --> F2[Webhook or polling]
  end

  subgraph BATCH[Batch-window]
    A3[Scheduler] --> B3[Spawn batch driver]
    B3 --> C3[Read shard manifest]
    C3 --> D3[Fan out to worker pool]
    D3 --> E3[Aggregate results to S3]
  end

  subgraph SCHED[Scheduled-trigger]
    A4[Cron / EventBridge] --> B4[Single agent run]
    B4 --> C4[Result to durable target]
  end

When to use which

Use case Mode Why
Live chat streaming-sync User waits; first-token matters; full-response budget 6s
Voice agent streaming-sync Same as chat, even tighter budget on first audio chunk
New-release email personalization async-queue User isn't waiting; can take 30-90s per user
Nightly catalog enrichment batch-window 4M items; throughput > latency
Monthly subscriber retention summary scheduled-trigger Per-user; runs once a month at user's local 9am
Refund decision waiting on fraud team async-queue with pause See story 05 (pause/resume)
Search relevance offline rescore batch-window Massive throughput, eventual consistency

What's actually different across modes

The same agent code runs in all four modes. The harness configures these knobs per mode:

Knob Streaming-sync Async-queue Batch Scheduled
Concurrency model per-connection worker pool with backpressure shard-based one-shot
Latency budget hard p95 6s soft, e.g. 90s p95 n/a (throughput target) soft
Output transport WebSocket stream result store + webhook S3 manifest + Glue catalog direct write
Failure handling retry once + escape hatch DLQ + alert shard-level retry re-fire next cycle
Cost ceiling per-turn tight per-job medium per-job high (parallelism) per-run high
Eval sampling 1% online 5% per shard 100% (offline eval) 100%
Idempotency session-scoped job-id-scoped row-id-scoped run-id-scoped
Pause/resume role container-restart only full Flavor-2 wait possible shard restart re-fire

The agent doesn't know which mode it's in. It just runs the graph. The harness provides the surrounding mode-specific behavior.


Streaming — the place sync gets confused

Streaming feels async (tokens trickle in) but is sync from the user's perspective — the connection is open, the user is waiting. Three rules avoid disaster:

Rule 1 — No external I/O inside the streaming loop

Once you start streaming, every token has to come fast. Calling a tool mid-stream because the LLM said it wanted to is a freeze. The plan is finalized before the stream starts.

Rule 2 — Stream a draft, then verify

Some safety/eval checks happen on the full response. If the response can't be released until the eval passes, you can't stream it. Two options: - Stream the draft, verify in parallel, kill the stream if the verifier rejects. Higher trust but visible cancel. - Buffer the full response, verify, then stream. Adds 200-400ms latency to first-token but never visible cancel.

We use option 2 for safety-critical responses (refunds, account changes), option 1 for content (recommendations, summaries).

Rule 3 — Heartbeat the stream

Long-running compositions (over 4s of compute) without a token risk client timeouts. The harness emits a heartbeat character or a "thinking..." signal every 2s if no token has gone out. Keeps the connection alive.


Async-queue — the workhorse for non-real-time

sequenceDiagram
  participant Caller
  participant API
  participant Queue
  participant WorkerPool
  participant ResultStore
  participant Caller2 as Caller (poll or webhook)

  Caller->>API: submit_job(payload)
  API->>API: validate, idempotency check
  API->>Queue: enqueue(job_id, payload)
  API-->>Caller: 202 Accepted, job_id, ETA

  Note over Queue,WorkerPool: workers pull when capacity available
  Queue->>WorkerPool: dispatch
  WorkerPool->>WorkerPool: run agent
  WorkerPool->>ResultStore: write(job_id, result)
  WorkerPool->>Caller2: webhook(job_id, result_url)

Why a queue (not just background tasks)

  • Backpressure — if workers are saturated, the queue grows; new jobs are accepted-and-queued, not dropped.
  • Replay — failed jobs go to DLQ, can be re-driven after fix.
  • Ordering when needed — partition by user_id for per-user FIFO.
  • Visibility — queue depth is a leading indicator of saturation.

Worker concurrency math

  • Worker pool target utilization: 65-75% (leaves headroom for spikes).
  • Per-worker concurrency: ~4 jobs in flight (each one is mostly waiting on LLM I/O).
  • Per-pool concurrency: depends on the workload's median LLM latency vs $$ budget.

For MangaAssist's nightly personalization (~120M jobs/day, ~6s median per job), we run ~12K worker concurrency to stay within the 8-hour window.


Batch — when throughput is everything

Catalog enrichment runs the same agent across 4M titles. The agent stays the same; the harness: - Reads a manifest (shard list). - Fans out workers across shards (typically 256 shards × 2 days). - Per-shard checkpointing for restartability. - Result aggregation to S3 in Parquet for downstream Glue queries.

Eval in batch is 100%, offline. Every output is judged because the cost is amortized over a large run and the value of catching a regression before it lands in production is high. This is one of the few places where exhaustive eval is economical.


Pitfalls

Pitfall 1 — Sync-shaped task on async path

"User clicks 'recommend me 100 things'" — the engineer makes it sync. p99 hits 25s. Browser closes. User confused.

Fix: request shape determines mode. Heavy multi-result requests are async-queue with a "we'll DM your results" UX. The boundary is at the API layer, declared at endpoint registration, not bolted on per-request.

Pitfall 2 — Async path with sync expectations

A team decides "we'll just await the queue" — they ignore the queue's async-ness, block their thread, and hold connections open for minutes. Worker pool effectively single-threaded.

Fix: the queue API doesn't return a result; it returns a job_id. Result retrieval is via webhook or poll. There is no "await this job" — that pattern doesn't exist in our SDK by design.

Pitfall 3 — Mode-switching mid-deploy

A canary version changes a chat path from sync to async without UX coordination. Users see "your message has been queued" mysteriously.

Fix: mode is part of the API contract, version-bumped like any breaking change. Mode changes are coordinated with frontend.

Pitfall 4 — Batch backfill swamping the model provider quota

Nightly batch starts; consumes 80% of the day's Bedrock quota in 2 hours; live chat is throttled.

Fix: the quota manager (story 08) treats sync chat as P0 and batch as P3. Batch obeys a quota cap that leaves headroom for sync. Batch can run longer; sync cannot wait.


Q&A drill — opening question

Q: Why not just queue everything? It's simpler.

Two reasons: 1. User experience. Live chat that says "we'll get back to you" feels broken. Streaming sync is what makes the product feel alive. 2. Resource shape. Sync is bursty; async is smoothable. Mixing them on the same workers means you size for sync peak (over-provisioned for async) or async average (under-provisioned for sync). Separating them lets each fleet size for its own profile.

The right answer is "queue everything that can wait, stream everything that can't, and make the boundary explicit."


Grilling — Round 1

Q1. Worker concurrency = 4 inside one process — why not 1, simpler?

LLM calls are I/O-bound (network wait for the model). At 1-process-1-job, the worker is idle ~80% of the time. Concurrency 4 amortizes the cost of the worker container. Beyond 4, contention on local context/memory hurts. Empirically 4 is the sweet spot for our workload mix.

Q2. How do you handle a poison job that crashes the worker every time?

Standard DLQ pattern: max 3 attempts, then to DLQ. DLQ has its own alarm. Postmortem: was it bad input (sanitize), bad code (fix), bad model output (eval rule)? DLQ size is an SLO; we keep it below 0.05% of daily job volume.

Q3. Webhook delivery is unreliable. What if the caller misses the webhook?

Two backstops: (a) webhook has retries with exp backoff (1m, 5m, 30m, 4h); (b) callers can poll the result store directly with the job_id. The result store is durable; the webhook is convenience.


Grilling — Round 2 (architect-level)

Q4. How does the same agent code stay mode-agnostic when the modes have such different latency and cost profiles?

Two design choices: - Budgets are passed as runtime config. The graph (story 01) reads latency_budget_ms and cost_cap_usd from the invocation context. A node that downshifts to Haiku under tight budget gets that signal from config, not from "what mode am I in." - Output transport is harness-side. The agent emits structured outputs. The harness either streams those over WebSocket (sync), writes them to a result store (async), or appends to a Parquet file (batch). The agent doesn't know.

This separation is what lets us keep one agent and four modes.

Q5. Batch jobs that share infra with sync chat — how do you prevent noisy-neighbor effects?

Three barriers: - Separate worker pools. Same code, different fleet. Sync gets dedicated capacity; batch gets dedicated capacity. - Quota partitioning at the model provider. Sync has a reserved fraction of Bedrock TPM; batch consumes from a separate budget. - Network/DDB isolation is not free but is feasible — we use separate DDB tables for batch state vs sync state, so a batch surge can't throttle sync's writes.

We measured this carefully when we co-located vs separated. Separation costs ~8% more in idle compute; mixing introduced a 3% increase in sync p99 during batch runs. We separate.

Q6. What's the eval story across modes? Different SLAs imply different rubrics?

Same rubric, partitioned reporting. The judging rubric is "is this answer good?" — that doesn't depend on mode. But the expected score baseline does: - Sync chat baseline: 0.81 mean. - Async personalization: 0.78 mean (more complex inputs). - Batch enrichment: 0.86 mean (curated inputs, no live ambiguity).

Drift detection thresholds are mode-specific. A 0.78 score is great in async, alarming in batch. The eval framework knows this.


Intuition gained

  • Mode is invocation policy, not agent code. The same agent runs in four modes; the harness configures the surroundings.
  • Streaming is sync from the user's POV. Treat it that way: no external I/O mid-stream, plan-then-stream, heartbeat the connection.
  • Queue-first for anything that can wait. It's the simplest way to get backpressure, replay, and ordering for free.
  • Quota partitioning across modes prevents batch surges from killing live chat.
  • Eval rubric is universal; baselines are mode-specific. Drift detection respects this.

See also

  • 05-pause-resume-workflows.md — async mode is where Flavor-2 waits live
  • 06-checkpointing-and-serving.md — different fleet shapes per mode
  • 08-observability-cost-versioning-ratelimits.md — quota manager details
  • Changing-Constraints-Scenarios/01-10x-user-surge.md — sync mode under burst