User Story 03 — Sub-Agent Orchestration

Pillar: P1 (AI Workflow) + P2 (Harness) · Stage unlocked: 2 → 4 · Reading time: ~13 min

TL;DR

A sub-agent is a skill that itself has an LLM in the loop. Sub-agents are powerful (they decompose hard tasks) and dangerous (they fan out cost, latency, and failure modes). At MangaAssist scale, the difference between a system that scales and one that melts is whether the parent agent treats sub-agents as bounded contracts or as "another LLM I can talk to."

The User Story

As the architect of MangaAssist's discovery experience, I want to invoke specialized sub-agents (recommendation explainer, comparison writer, multi-title summarizer) under explicit budgets and isolated context, so that the parent planner stays small and focused, complex tasks delegate cleanly, and a misbehaving sub-agent cannot blow up the parent's cost, context, or latency budget.

Acceptance criteria

Every sub-agent has a token budget, time budget, and tool whitelist declared at registration.
Parent → sub-agent context handoff is explicit — only declared fields cross the boundary, not the entire conversation.
Sub-agent failure (timeout, refusal, error) is a typed return to the parent, not an exception that bubbles.
Sub-agent traces are stitched into the parent trace with a parent_trace_id link, not orphaned.
A sub-agent cannot recursively invoke itself or its parent (DAG-only orchestration).

When to use a sub-agent (vs a deterministic skill)

Situation	Sub-agent	Deterministic skill
Task needs natural-language reasoning over retrieved content	✅	❌
Task is well-described by inputs → outputs without judgment	❌	✅
Need to combine 3+ skills with conditional logic	✅	❌
Latency budget is tight (<300ms)	❌ (LLM call too slow)	✅
Output quality matters more than determinism	✅	❌
Output must be byte-for-byte reproducible	❌	✅

A sub-agent costs ~5-15× more than a skill in $ and ~3-8× in latency. Use it where the judgment is real.

The orchestration topology

flowchart TB
  P[Parent Planner Agent]
  P -->|invoke recommend-explainer| SA1[Sub-Agent: Recommendation Explainer]
  P -->|invoke compare-titles| SA2[Sub-Agent: Multi-Title Comparator]
  P -->|invoke summarize-arc| SA3[Sub-Agent: Story-Arc Summarizer]

  SA1 --> S1[catalog-search]
  SA1 --> S2[user-prefs-reco]
  SA1 --> S3[review-sentiment]

  SA2 --> S1
  SA2 --> S3
  SA2 --> S4[cross-title-link]

  SA3 --> S5[support-policy: spoiler rules]
  SA3 --> S3

  classDef parent fill:#dbeafe,stroke:#1e3a8a
  classDef agent fill:#fef3c7,stroke:#92400e
  classDef skill fill:#dcfce7,stroke:#166534
  class P parent
  class SA1,SA2,SA3 agent
  class S1,S2,S3,S4,S5 skill

The parent never calls a leaf skill directly if a sub-agent for that domain exists — that keeps the parent prompt clean. Sub-agents have narrow tool whitelists: the recommendation explainer cannot call the support-policy skill, period.

The contract a sub-agent registers

Same shape as a skill (story 02), with extra fields:

name: recommendation-explainer
kind: sub-agent
version: 1.2
backing_model: claude-sonnet-4-6  # the LLM behind the sub-agent
fallback_model: claude-haiku-4-5

input:
  user_query:        { type: string, max_tokens: 500 }
  candidate_title_ids: { type: array, items: string, max_len: 20 }
  user_locale:       { type: string }
  user_prefs_summary: { type: string, max_tokens: 800 }   # NOT the full conversation

output:
  ranked_explanations:
    type: array
    items:
      title_id: string
      explanation: string  # max 2 sentences
      confidence: float

policy:
  token_budget: { input: 4000, output: 1500 }
  time_budget_ms: 3500
  tool_whitelist: [catalog-search, user-prefs-reco, review-sentiment]
  max_tool_calls: 5
  max_recursion_depth: 0   # cannot invoke other sub-agents

eval:
  shadow_eval: true
  judge_model: claude-opus-4-7
  rubric: rubrics/explanation-quality-v3.yaml

The fields that distinguish a sub-agent from a skill: - backing_model + fallback_model — the LLM and its degradation path - tool_whitelist + max_tool_calls — bound the sub-agent's blast radius - max_recursion_depth — usually 0 to prevent fan-out chains - eval — sub-agents almost always need shadow eval because output is judgment-laden

Context handoff — the most under-discussed problem

The naive approach: "pass the conversation history to the sub-agent." That's how you blow your token budget and leak PII into a sub-agent's audit trail.

The right approach: structured handoff with declared fields.

sequenceDiagram
  participant User
  participant Parent
  participant Distiller
  participant SubAgent

  User->>Parent: "rec me dark fantasy like Berserk but newer"
  Parent->>Parent: full conversation context (32K tokens)
  Parent->>Distiller: extract sub-agent inputs from context
  Distiller-->>Parent: { user_query, candidate_ids, prefs_summary }  ~1.5K tokens
  Parent->>SubAgent: structured input only
  SubAgent->>SubAgent: scoped reasoning, narrow context
  SubAgent-->>Parent: typed output
  Parent->>User: composed response

The distiller is itself a small deterministic step (could be a tiny LLM or rule-based) that extracts only what the sub-agent declared in its input schema. Three benefits: - Token budget for sub-agent stays predictable. - PII handling is explicit at the boundary. - Sub-agent prompt cache hit rate goes way up because input shape is stable.

The four orchestration patterns

Pattern 1 — Delegate (parent waits)

Parent waits for sub-agent. Simple, used when the sub-agent's output is on the critical path.

Pattern 2 — Race (multiple sub-agents, take first good)

Parent fans out 2-3 sub-agents with the same goal but different prompts/models. Take the first that meets the eval bar. Used when latency matters more than cost.

Pattern 3 — Ensemble (multiple, judge merges)

Parent fans out, then a tiny judge model picks/merges. Used for high-stakes recommendations.

Pattern 4 — Pipeline (sub-agent → sub-agent)

Output of A feeds B. Bounded depth (we use 2). Used for compose-then-critique workflows.

flowchart LR
  subgraph P1[Pattern 1 Delegate]
    A1[Parent] --> B1[Sub-agent] --> C1[Done]
  end
  subgraph P2[Pattern 2 Race]
    A2[Parent] --> B2a[Sub-A]
    A2 --> B2b[Sub-B]
    A2 --> B2c[Sub-C]
    B2a --> D2[First good]
    B2b --> D2
    B2c --> D2
  end
  subgraph P3[Pattern 3 Ensemble]
    A3[Parent] --> B3a[Sub-A]
    A3 --> B3b[Sub-B]
    B3a --> J[Judge merge]
    B3b --> J
    J --> D3[Done]
  end
  subgraph P4[Pattern 4 Pipeline]
    A4[Parent] --> B4a[Sub-A draft] --> B4b[Sub-B critique] --> D4[Done]
  end

Production pitfalls

Pitfall 1 — The sub-agent that re-asks the user

Sub-agent prompt says "ask clarifying questions if unclear." It returns "what year were you born?" to the parent. The parent has no concept of forwarding that to the user. Result: a non-sequitur in the chat.

Fix: sub-agents are forbidden from producing user-facing questions. Their output schema does not have a "question to user" field. If they have low confidence, they return confidence: low and the parent decides whether to ask the user.

Pitfall 2 — Recursive blow-up

Sub-agent A calls sub-agent B which calls sub-agent A. Token costs explode in 3 turns.

Fix: max_recursion_depth: 0 for almost all sub-agents. Orchestration is a DAG, not a graph.

Pitfall 3 — Lost trace stitching

Parent trace shows "called sub-agent X, got result." Sub-agent's internal trace is in a different store. On-call cannot reconstruct what happened.

Fix: the gateway propagates parent_trace_id and span_id. Sub-agent's spans are children of the parent span in the same trace. One query, one timeline.

Pitfall 4 — Eval skew between parent and sub-agent

Parent eval looks great; user satisfaction is dropping. Root cause: the sub-agent's output is fine on its own rubric but doesn't compose well with the parent's voice.

Fix: end-to-end eval at the parent level is the source of truth. Sub-agent evals are diagnostic, not deciding. Don't ship a sub-agent whose unit eval improves but parent eval regresses.

Q&A drill — opening question

Q: Why have sub-agents at all? Why not one big agent that does everything?

Three reasons, in increasing order of importance: 1. Prompt size. A monolithic agent prompt that handles recommendations, comparisons, summaries, and edge cases is 12K tokens. Sub-agents with narrow scope are 2-3K each. 2. Independent improvement. The recommendation team can ship a new explainer prompt without touching the comparison team's prompt. 3. Blast radius. A bug in the comparison sub-agent doesn't degrade recommendations. Compartmentalization is the same reason microservices exist; the same logic applies to LLM workflows.

The cost is orchestration complexity — and that's exactly the harness work this folder is about.

Grilling — Round 1

Q1. How do you debug a 4-sub-agent fan-out where one returned garbage?

The trace stitching is the answer. One query: trace_id = X. The result is a Gantt-chart-style view: parent span at top, sub-agent spans below with start/end times, costs, models used, tool calls. The garbage sub-agent's input + output are visible. The eval judgment for that sub-agent is attached. You don't grep four log streams.

Q2. What's the cost difference between Pattern 2 (race) and Pattern 1 (delegate) at scale?

Race costs 2-3× in tokens because you pay for all parallel attempts. But latency drops 30-50% on the long tail. At 1.2M concurrent users, a 2s p95 improvement at 2× cost can be a winning trade if user satisfaction lifts conversion. The decision is per-surface — race for chat (latency-sensitive), delegate for batch enrichment (cost-sensitive).

Q3. How do you canary a new sub-agent prompt?

Same machinery as story 01: the sub-agent has a version. The parent's gateway routes 5% of invocations to v2, 95% to v1. The shadow eval judges both. Promotion gate: v2's eval score ≥ v1's, AND v2's parent-level end-to-end eval ≥ v1's. Both gates required.

Grilling — Round 2 (architect-level)

Q4. Suppose a sub-agent's tool-whitelist needs to grow because a new use case demands it. What's the change-management process?

Whitelist growth is a security/blast-radius decision, not a feature decision. Process: 1. Whitelist change is a separate PR from feature work. 2. Eval bar: the new tool's failure modes must be tested in the sub-agent's eval suite (e.g., "what happens when the new tool returns nothing?"). 3. Capability flag: the new tool is gated behind a flag for the sub-agent for the first 2 weeks. Default off; canary on. 4. Audit trail: every invocation of the newly-allowed tool by this sub-agent is logged at INFO for 30 days.

This sounds heavy. It is — and it correctly imposes friction on what is structurally a privilege escalation.

Q5. A sub-agent times out 5% of the time at p99. What's the right response?

Diagnose first, then choose from a menu: - If the timeout is upstream (e.g., backing tool slow) — fix the tool, not the sub-agent. - If the timeout is the LLM call itself — reduce input size (tighter distiller), or downshift to a faster model with calibrated quality measurement. - If the timeout is sub-agent-specific reasoning that takes too long — split into pipeline (Pattern 4): a fast first-pass sub-agent + a slower second-pass that's only invoked conditionally. - As a stopgap — race-pattern: invoke a faster fallback sub-agent in parallel; take whichever finishes first within the budget.

The wrong response is "increase the timeout." That hides the regression and increases tail latency.

Q6. How does sub-agent orchestration interact with pause/resume (story 05)?

Sub-agents are suspendable units. When the parent's harness checkpoints, it captures: parent state, list of in-flight sub-agent invocations, their inputs, and their futures. Three resume strategies: - Cheap and idempotent — re-invoke from scratch. - Expensive but idempotent — check sub-agent's idempotency cache; reuse result if present. - Non-idempotent (rare for sub-agents — usually a smell) — store the partial output durably; resume from the partial.

The first is the default; you only invest in the others when the cost-of-redo is high enough to justify the harness work.

Intuition gained

Sub-agents are skills with an LLM inside. Same contract layer, plus model + budget + tool-whitelist + recursion-depth.
Context handoff via a distiller is the under-discussed lever — it dominates token cost and prompt-cache hit rate.
Orchestration patterns (delegate / race / ensemble / pipeline) are policy choices under cost and latency constraints. Different surfaces pick different patterns.
Trace stitching is non-negotiable. Without it, on-call cannot debug fan-outs.
End-to-end eval is the source of truth, not sub-agent local eval.