User Story 03 — Sub-Agent Orchestration
Pillar: P1 (AI Workflow) + P2 (Harness) · Stage unlocked: 2 → 4 · Reading time: ~13 min
TL;DR
A sub-agent is a skill that itself has an LLM in the loop. Sub-agents are powerful (they decompose hard tasks) and dangerous (they fan out cost, latency, and failure modes). At MangaAssist scale, the difference between a system that scales and one that melts is whether the parent agent treats sub-agents as bounded contracts or as "another LLM I can talk to."
The User Story
As the architect of MangaAssist's discovery experience, I want to invoke specialized sub-agents (recommendation explainer, comparison writer, multi-title summarizer) under explicit budgets and isolated context, so that the parent planner stays small and focused, complex tasks delegate cleanly, and a misbehaving sub-agent cannot blow up the parent's cost, context, or latency budget.
Acceptance criteria
- Every sub-agent has a token budget, time budget, and tool whitelist declared at registration.
- Parent → sub-agent context handoff is explicit — only declared fields cross the boundary, not the entire conversation.
- Sub-agent failure (timeout, refusal, error) is a typed return to the parent, not an exception that bubbles.
- Sub-agent traces are stitched into the parent trace with a
parent_trace_idlink, not orphaned. - A sub-agent cannot recursively invoke itself or its parent (DAG-only orchestration).
When to use a sub-agent (vs a deterministic skill)
| Situation | Sub-agent | Deterministic skill |
|---|---|---|
| Task needs natural-language reasoning over retrieved content | ✅ | ❌ |
| Task is well-described by inputs → outputs without judgment | ❌ | ✅ |
| Need to combine 3+ skills with conditional logic | ✅ | ❌ |
| Latency budget is tight (<300ms) | ❌ (LLM call too slow) | ✅ |
| Output quality matters more than determinism | ✅ | ❌ |
| Output must be byte-for-byte reproducible | ❌ | ✅ |
A sub-agent costs ~5-15× more than a skill in $ and ~3-8× in latency. Use it where the judgment is real.
The orchestration topology
flowchart TB
P[Parent Planner Agent]
P -->|invoke recommend-explainer| SA1[Sub-Agent: Recommendation Explainer]
P -->|invoke compare-titles| SA2[Sub-Agent: Multi-Title Comparator]
P -->|invoke summarize-arc| SA3[Sub-Agent: Story-Arc Summarizer]
SA1 --> S1[catalog-search]
SA1 --> S2[user-prefs-reco]
SA1 --> S3[review-sentiment]
SA2 --> S1
SA2 --> S3
SA2 --> S4[cross-title-link]
SA3 --> S5[support-policy: spoiler rules]
SA3 --> S3
classDef parent fill:#dbeafe,stroke:#1e3a8a
classDef agent fill:#fef3c7,stroke:#92400e
classDef skill fill:#dcfce7,stroke:#166534
class P parent
class SA1,SA2,SA3 agent
class S1,S2,S3,S4,S5 skill
The parent never calls a leaf skill directly if a sub-agent for that domain exists — that keeps the parent prompt clean. Sub-agents have narrow tool whitelists: the recommendation explainer cannot call the support-policy skill, period.
The contract a sub-agent registers
Same shape as a skill (story 02), with extra fields:
name: recommendation-explainer
kind: sub-agent
version: 1.2
backing_model: claude-sonnet-4-6 # the LLM behind the sub-agent
fallback_model: claude-haiku-4-5
input:
user_query: { type: string, max_tokens: 500 }
candidate_title_ids: { type: array, items: string, max_len: 20 }
user_locale: { type: string }
user_prefs_summary: { type: string, max_tokens: 800 } # NOT the full conversation
output:
ranked_explanations:
type: array
items:
title_id: string
explanation: string # max 2 sentences
confidence: float
policy:
token_budget: { input: 4000, output: 1500 }
time_budget_ms: 3500
tool_whitelist: [catalog-search, user-prefs-reco, review-sentiment]
max_tool_calls: 5
max_recursion_depth: 0 # cannot invoke other sub-agents
eval:
shadow_eval: true
judge_model: claude-opus-4-7
rubric: rubrics/explanation-quality-v3.yaml
The fields that distinguish a sub-agent from a skill:
- backing_model + fallback_model — the LLM and its degradation path
- tool_whitelist + max_tool_calls — bound the sub-agent's blast radius
- max_recursion_depth — usually 0 to prevent fan-out chains
- eval — sub-agents almost always need shadow eval because output is judgment-laden
Context handoff — the most under-discussed problem
The naive approach: "pass the conversation history to the sub-agent." That's how you blow your token budget and leak PII into a sub-agent's audit trail.
The right approach: structured handoff with declared fields.
sequenceDiagram
participant User
participant Parent
participant Distiller
participant SubAgent
User->>Parent: "rec me dark fantasy like Berserk but newer"
Parent->>Parent: full conversation context (32K tokens)
Parent->>Distiller: extract sub-agent inputs from context
Distiller-->>Parent: { user_query, candidate_ids, prefs_summary } ~1.5K tokens
Parent->>SubAgent: structured input only
SubAgent->>SubAgent: scoped reasoning, narrow context
SubAgent-->>Parent: typed output
Parent->>User: composed response
The distiller is itself a small deterministic step (could be a tiny LLM or rule-based) that extracts only what the sub-agent declared in its input schema. Three benefits:
- Token budget for sub-agent stays predictable.
- PII handling is explicit at the boundary.
- Sub-agent prompt cache hit rate goes way up because input shape is stable.
The four orchestration patterns
Pattern 1 — Delegate (parent waits)
Parent waits for sub-agent. Simple, used when the sub-agent's output is on the critical path.
Pattern 2 — Race (multiple sub-agents, take first good)
Parent fans out 2-3 sub-agents with the same goal but different prompts/models. Take the first that meets the eval bar. Used when latency matters more than cost.
Pattern 3 — Ensemble (multiple, judge merges)
Parent fans out, then a tiny judge model picks/merges. Used for high-stakes recommendations.
Pattern 4 — Pipeline (sub-agent → sub-agent)
Output of A feeds B. Bounded depth (we use 2). Used for compose-then-critique workflows.
flowchart LR
subgraph P1[Pattern 1 Delegate]
A1[Parent] --> B1[Sub-agent] --> C1[Done]
end
subgraph P2[Pattern 2 Race]
A2[Parent] --> B2a[Sub-A]
A2 --> B2b[Sub-B]
A2 --> B2c[Sub-C]
B2a --> D2[First good]
B2b --> D2
B2c --> D2
end
subgraph P3[Pattern 3 Ensemble]
A3[Parent] --> B3a[Sub-A]
A3 --> B3b[Sub-B]
B3a --> J[Judge merge]
B3b --> J
J --> D3[Done]
end
subgraph P4[Pattern 4 Pipeline]
A4[Parent] --> B4a[Sub-A draft] --> B4b[Sub-B critique] --> D4[Done]
end
Production pitfalls
Pitfall 1 — The sub-agent that re-asks the user
Sub-agent prompt says "ask clarifying questions if unclear." It returns "what year were you born?" to the parent. The parent has no concept of forwarding that to the user. Result: a non-sequitur in the chat.
Fix: sub-agents are forbidden from producing user-facing questions. Their output schema does not have a "question to user" field. If they have low confidence, they return confidence: low and the parent decides whether to ask the user.
Pitfall 2 — Recursive blow-up
Sub-agent A calls sub-agent B which calls sub-agent A. Token costs explode in 3 turns.
Fix: max_recursion_depth: 0 for almost all sub-agents. Orchestration is a DAG, not a graph.
Pitfall 3 — Lost trace stitching
Parent trace shows "called sub-agent X, got result." Sub-agent's internal trace is in a different store. On-call cannot reconstruct what happened.
Fix: the gateway propagates parent_trace_id and span_id. Sub-agent's spans are children of the parent span in the same trace. One query, one timeline.
Pitfall 4 — Eval skew between parent and sub-agent
Parent eval looks great; user satisfaction is dropping. Root cause: the sub-agent's output is fine on its own rubric but doesn't compose well with the parent's voice.
Fix: end-to-end eval at the parent level is the source of truth. Sub-agent evals are diagnostic, not deciding. Don't ship a sub-agent whose unit eval improves but parent eval regresses.
Q&A drill — opening question
Q: Why have sub-agents at all? Why not one big agent that does everything?
Three reasons, in increasing order of importance: 1. Prompt size. A monolithic agent prompt that handles recommendations, comparisons, summaries, and edge cases is 12K tokens. Sub-agents with narrow scope are 2-3K each. 2. Independent improvement. The recommendation team can ship a new explainer prompt without touching the comparison team's prompt. 3. Blast radius. A bug in the comparison sub-agent doesn't degrade recommendations. Compartmentalization is the same reason microservices exist; the same logic applies to LLM workflows.
The cost is orchestration complexity — and that's exactly the harness work this folder is about.
Grilling — Round 1
Q1. How do you debug a 4-sub-agent fan-out where one returned garbage?
The trace stitching is the answer. One query: trace_id = X. The result is a Gantt-chart-style view: parent span at top, sub-agent spans below with start/end times, costs, models used, tool calls. The garbage sub-agent's input + output are visible. The eval judgment for that sub-agent is attached. You don't grep four log streams.
Q2. What's the cost difference between Pattern 2 (race) and Pattern 1 (delegate) at scale?
Race costs 2-3× in tokens because you pay for all parallel attempts. But latency drops 30-50% on the long tail. At 1.2M concurrent users, a 2s p95 improvement at 2× cost can be a winning trade if user satisfaction lifts conversion. The decision is per-surface — race for chat (latency-sensitive), delegate for batch enrichment (cost-sensitive).
Q3. How do you canary a new sub-agent prompt?
Same machinery as story 01: the sub-agent has a version. The parent's gateway routes 5% of invocations to v2, 95% to v1. The shadow eval judges both. Promotion gate: v2's eval score ≥ v1's, AND v2's parent-level end-to-end eval ≥ v1's. Both gates required.
Grilling — Round 2 (architect-level)
Q4. Suppose a sub-agent's tool-whitelist needs to grow because a new use case demands it. What's the change-management process?
Whitelist growth is a security/blast-radius decision, not a feature decision. Process: 1. Whitelist change is a separate PR from feature work. 2. Eval bar: the new tool's failure modes must be tested in the sub-agent's eval suite (e.g., "what happens when the new tool returns nothing?"). 3. Capability flag: the new tool is gated behind a flag for the sub-agent for the first 2 weeks. Default off; canary on. 4. Audit trail: every invocation of the newly-allowed tool by this sub-agent is logged at INFO for 30 days.
This sounds heavy. It is — and it correctly imposes friction on what is structurally a privilege escalation.
Q5. A sub-agent times out 5% of the time at p99. What's the right response?
Diagnose first, then choose from a menu: - If the timeout is upstream (e.g., backing tool slow) — fix the tool, not the sub-agent. - If the timeout is the LLM call itself — reduce input size (tighter distiller), or downshift to a faster model with calibrated quality measurement. - If the timeout is sub-agent-specific reasoning that takes too long — split into pipeline (Pattern 4): a fast first-pass sub-agent + a slower second-pass that's only invoked conditionally. - As a stopgap — race-pattern: invoke a faster fallback sub-agent in parallel; take whichever finishes first within the budget.
The wrong response is "increase the timeout." That hides the regression and increases tail latency.
Q6. How does sub-agent orchestration interact with pause/resume (story 05)?
Sub-agents are suspendable units. When the parent's harness checkpoints, it captures: parent state, list of in-flight sub-agent invocations, their inputs, and their futures. Three resume strategies: - Cheap and idempotent — re-invoke from scratch. - Expensive but idempotent — check sub-agent's idempotency cache; reuse result if present. - Non-idempotent (rare for sub-agents — usually a smell) — store the partial output durably; resume from the partial.
The first is the default; you only invest in the others when the cost-of-redo is high enough to justify the harness work.
Intuition gained
- Sub-agents are skills with an LLM inside. Same contract layer, plus model + budget + tool-whitelist + recursion-depth.
- Context handoff via a distiller is the under-discussed lever — it dominates token cost and prompt-cache hit rate.
- Orchestration patterns (delegate / race / ensemble / pipeline) are policy choices under cost and latency constraints. Different surfaces pick different patterns.
- Trace stitching is non-negotiable. Without it, on-call cannot debug fan-outs.
- End-to-end eval is the source of truth, not sub-agent local eval.
See also
02-skill-composition-and-invocation.md— sub-agents inherit the skill contract layer04-active-passive-evals.md— how shadow eval works for sub-agents06-checkpointing-and-serving.md— durable state for in-flight sub-agent invocationsChanging-Constraints-Scenarios/03-cost-budget-halved.md— switching delegate ↔ race under cost pressure