01 — The Orchestrator Agent
The router is not code. The router is Claude reading tool descriptions.
The Orchestrator is the central nervous system of MangaAssist. Every user message passes through it; nothing reaches a sub-agent or MCP server without it deciding. This document is a deep dive into what it actually is, what it does, how it fails, and why it's shaped the way it is.
What it is
A single instance of Claude 3.5 Sonnet running on Amazon Bedrock, invoked from a Lambda handler on each user turn. It is stateful at the conversation level — meaning it carries forward context across turns — but stateless at the process level: each Lambda invocation reconstitutes its working memory from Redis and DynamoDB.
There is no Python class called Orchestrator that contains routing logic. The Orchestrator's behavior is entirely defined by:
- The system prompt loaded at the start of each request
- The tool manifest (seven MCP servers, ~15 tools total) injected as Claude's
toolsparameter - The conversation history loaded from Redis/DynamoDB
- The inference run itself, which produces tool calls or a final answer
If you removed Claude, you would have nothing. There is no fallback orchestrator written in Python.
The agentic loop
Initialize → Plan → Act → Observe → Reflect → (loop or terminate)
Each phase is concrete:
Initialize (≈ 50ms)
- Load conversation state from
session:{session_id}:contextin ElastiCache (TTL 30 min) - Load user profile from DynamoDB (
Userstable) — favorite genres, locale, Prime status - Compose the request payload: system prompt + tool manifest + history + new message
Plan (≈ 400ms first token)
- Bedrock invokes Claude with the assembled prompt
- Claude streams either:
- A direct response (greeting, chitchat — no tool needed)
- One or more tool calls in structured JSON
- The decision is made inside the model by reading tool descriptions against the user's intent
Act (≈ 200–800ms per tool)
- Tool calls dispatched via
asyncio.gatherfor independent ones - Sequential chain when one tool's output feeds the next
- Each tool call hits an MCP server over HTTP/SSE through ECS Fargate
- Up to 5 tools in flight concurrently
Observe
- Tool results returned and appended to Claude's context as
<tool_result>XML blocks - The XML wrapping matters — see 06-tool-dispatch-and-routing.md for why this is a prompt-injection defense
Reflect
- Claude synthesizes the tool results into a final answer
- Post-generation guardrails run: ASIN validity, price sanity, link resolution
- If guardrails fail, regenerate; if regeneration fails, fall back to template
Terminate
- Stream tokens to user via WebSocket
- Persist session state back to DynamoDB
- Emit CloudWatch metrics: latency, tokens used, tools called, cache hit rate
- Close the X-Ray trace
The system prompt — anatomy
The system prompt is the single most important artifact in the entire chatbot. Get it wrong and tool selection collapses, hallucination spikes, latency drifts. The structure:
┌─────────────────────────────────────┐
│ ROLE DEFINITION │ "You are MangaAssist, a shopping assistant..."
├─────────────────────────────────────┤
│ HARD CONSTRAINTS │ Never invent prices. Always cite tool sources.
├─────────────────────────────────────┤
│ TOOL USAGE RULES │ For policy → support_mcp. For order → order_mcp.
├─────────────────────────────────────┤
│ TOOL INVENTORY SUMMARY │ 7 MCP servers, ~15 tools, with rich descriptions
├─────────────────────────────────────┤
│ RESPONSE STYLE │ Concise. No marketing language. Cite sources.
├─────────────────────────────────────┤
│ ESCALATION RULES │ When to call escalate_to_agent
├─────────────────────────────────────┤
│ CONTEXT BLOCK (per request) │ Page ASIN, locale, user profile, active promos
├─────────────────────────────────────┤
│ RETRIEVED KNOWLEDGE (per request) │ Top-3 RAG chunks if intent = policy/faq
├─────────────────────────────────────┤
│ CONVERSATION HISTORY │ Last 3-5 turns + summary
└─────────────────────────────────────┘
The first six blocks are static — they are identical across every request. This is intentional: it lets Bedrock's prompt caching kick in. Cache hit rate target: > 85%. The last three blocks are dynamic per request.
The non-negotiable tool usage rules
These live verbatim in the system prompt:
1. For any price question, ALWAYS call get_price. Never state or estimate a price.
2. For any order question, ALWAYS call get_order_status or check_stock.
3. For return/refund questions, ALWAYS call answer_faq or check_refund_eligibility.
4. For recommendations, ALWAYS call get_recommendations or get_similar_titles.
5. For trending content, ALWAYS call get_trending or get_new_releases.
6. ALWAYS cite the tool source when answering policy questions.
7. If a tool returns escalation_suggested=true, ALWAYS call escalate_to_agent next.
These exist because, in early evaluations, Claude would occasionally answer policy questions from training data ("Amazon's return window is typically 30 days...") without calling the tool. That's a hallucination risk. The ALWAYS call X pattern is blunt but works — it makes the tool call mandatory regardless of how confident the model feels.
Tool dispatch mechanics
Parallel dispatch
When a request has multiple independent intents, the Orchestrator emits multiple tool calls in a single inference turn. The Lambda handler then dispatches them concurrently:
async def dispatch(tool_calls: list[ToolCall]) -> list[ToolResult]:
return await asyncio.gather(
*[mcp_clients[tc.server].call(tc.tool, tc.input) for tc in tool_calls],
return_exceptions=True,
)
return_exceptions=True matters: if one tool raises, the others still succeed and the Orchestrator can still answer with partial information.
Up to 5 tools run concurrently. Past that we throttle — historically 6+ concurrent tool calls have hit MCP server rate limits and pushed P99 over budget.
Sequential chaining
When tool B needs tool A's output, Claude emits them across separate inference turns:
Turn 1 (model thinks): "I need the ASIN first."
→ catalog_mcp.search_manga("Naruto vol 3")
← returns ASIN B07X1234
Turn 2 (model thinks): "Now I can get sentiment."
→ review_mcp.get_sentiment_summary(asin="B07X1234")
← returns summary
Note this is emergent — there is no chain_tools() function in the codebase. The model decides multi-turn dispatch based on the structure of the user's question. This is also why tool descriptions matter so much: the description has to make the dependency visible to the model.
State management
The Orchestrator manages three layers of state, but only directly owns one:
| Layer | Owner | Store | TTL | Purpose |
|---|---|---|---|---|
| Working memory | Orchestrator (in-process) | Lambda RAM | Per request | Tool results buffered during the agentic loop |
| Conversation state | Orchestrator | ElastiCache Redis | 30 min | Recent turns, active intent, pending tool results |
| Session state | Orchestrator | DynamoDB | 24 hours | Full turn history, summary, extracted entities |
| User profile | UserService (external) | DynamoDB | Permanent | Genres, purchase history, preferences |
The Orchestrator reads the user profile but never writes to it directly — that's the UserService's domain. This separation prevents the chatbot from corrupting profile data on a bad inference.
See 08-memory-architecture.md for the full memory deep dive.
Stopping conditions
Without explicit stops, an LLM agent can loop forever. The Orchestrator has three independent kill switches, any of which terminates the loop:
| Switch | Threshold | What happens on trip |
|---|---|---|
| Iteration limit | 10 tool-call rounds per user message | Return best-effort answer with disclaimer |
| Wall-clock timeout | 8 seconds for the full agentic loop | Cancel pending tool calls, return template fallback |
| Token budget | Hard cap on context window growth | Trigger summarization mid-loop, drop oldest tool results |
The 8-second timeout is the user-facing SLA's safety net. The 3-second P99 target is the goal; the 8-second cap is what guarantees we never wedge a websocket waiting for a runaway model.
Latency budget
End-to-end target: P99 < 3 seconds, first token at ≈ 950ms.
Total budget: 3000 ms
├── API Gateway + Lambda cold path ≈ 100 ms
├── Initialize (Redis/Dynamo loads) ≈ 50 ms
├── First Claude inference (planning) ≈ 400 ms ← first token here-ish
├── Tool dispatch (parallel, 1 round) ≈ 800 ms
├── Second Claude inference (synthesis) ≈ 500 ms
├── Guardrail validation ≈ 50 ms
├── Streaming finish + persist ≈ 100 ms
└── Buffer ≈ 1000 ms
The 1-second buffer is real and intentional. Bedrock latency has occasional spikes; tools occasionally cold-start. Without a buffer you'd see frequent SLA breaches even when nothing is "wrong."
How the Orchestrator fails (and what we do about it)
| Failure | Detection | Recovery |
|---|---|---|
| Bedrock timeout / 5xx | Inference returns error | Single retry with reduced max_tokens; then template |
| Tool returns malformed JSON | JSON parse fails | Drop that tool result, continue with others |
| Tool returns nothing | Result body is empty list | MCP-side fallback strategy already triggered; pass through to Claude |
| Guardrail rejection (bad ASIN) | Post-gen check fails | Regenerate once with stricter system prompt; then template |
| Loop iteration overrun | Counter exceeds 10 | Return best-effort + escalation offer |
| Wall-clock timeout | 8s exceeded | Cancel via asyncio.wait_for, return template |
| User confidence drop | Claude reports low confidence | Trigger escalation flow |
The pattern: never fail silently. Either return a useful answer or explicitly hand off. The worst outcome is a confidently-wrong answer; the second-worst is a blank screen.
See 07-failure-handling.md for the full failure taxonomy and circuit-breaker design.
Why this shape — design tradeoffs considered
| Alternative considered | Why we rejected it |
|---|---|
| Hardcoded routing in Python | Couples application logic to LLM behavior; every new tool requires a code deploy |
| Multi-agent debate (autogen-style) | 3-5x latency, far higher cost, marginal quality gain for this use case |
| Single mega-agent (no sub-agents) | Bigger blast radius; harder to scale individual domains |
| LangChain agent framework | Too much abstraction; harder to debug; we wanted full control over the loop |
| ReAct without explicit stops | Loops on ambiguous questions; latency unpredictable |
The chosen design is "supervisor with bounded specialists" because it matches the failure shape we actually see in production: most issues are domain-local (a stale catalog, a flaky review service), and we want them contained.
Validation: Constraint Sanity Check
The numbers above are the targets in the architecture. This section is a brutal pass over them: which ones survive contact with reality, which ones are aggressive-but-doable with engineering effort, and which ones are flatly inconsistent with each other or with the underlying systems. Without this section, the document is aspirational marketing.
Verdict table
| Claimed metric | Verdict | Why |
|---|---|---|
| First token at ~950ms | Aggressive | Claude Sonnet on Bedrock TTFT is typically 400–1500ms warm; with a 4–6K token system prompt, 950ms is plausible only with prompt caching hot. Cold cache: 1.5–2s. |
| End-to-end P99 < 3s | Inconsistent with itself | Component means sum to ~2s, plus 1s buffer = 3s. But P99 of a sum of P99s is not the sum of means. Realistic P99 with current components: 3.5–4.5s. |
| 8s wall-clock vs 10-iteration limit | Mathematically incompatible | Each iteration ≈ Claude inference (500–1000ms) + tool dispatch (300–800ms) ≈ 1.5s. Ten iterations need ~15s. The 10-iteration cap is unreachable before the 8s timeout fires. One of the two is dead code. |
| Tool dispatch 800ms (parallel, 1 round) | Reasonable | Parallel dispatch latency = max of the slowest tool in the batch, not sum. 800ms holds if individual MCP P99 holds. |
| Cache hit rate > 85% | Conditional | Bedrock prompt cache TTL is ~5 min. 85% is achievable during steady traffic. During off-peak / cold start / quiet hours, hit rate collapses to 30–50%. The metric should be rate-tier'd by traffic level, not a single number. |
| Guardrail validation 50ms | Too tight if any check is networked | ASIN existence check via DynamoDB GetItem: 10–30ms P99 — fine. Link validation via HTTP HEAD: 50–500ms variable — blows the budget. Either keep all checks in-memory/cached, or expand the budget to 150ms. |
| Initialize 50ms | Realistic | Redis GET (~2ms) + DynamoDB GetItem (~15ms P99) + payload assembly (~10ms) ≈ 30–40ms. 50ms holds. |
| 5 concurrent tool calls | Fine on orchestrator, watch MCP fan-out | Lambda async handles 5 concurrent calls trivially. Risk is fan-out to a single MCP server (e.g., 5 simultaneous catalog calls); per-MCP rate limits become the bottleneck, not the orchestrator. |
| Synthesis (second inference) at 500ms | Aggressive | Tool results add 1–3K tokens to context. Sonnet TTFT on a longer prompt: 500–900ms. Full streamed response is longer still. 500ms is first-token, not finish. |
| Streaming finish + persist 100ms | Realistic for persist, ambiguous for streaming | DynamoDB PutItem P99 ~50–80ms. But "streaming finish" depends on response length, not infrastructure — a 200-token answer streams in ~1s on its own. The 100ms here only covers post-stream persist. |
The biggest lie: end-to-end P99 < 3s
The latency budget table sums components to 3000ms total. That math works for the mean, not the 99th percentile. A correct calculation:
Component | P50 | P99
------------------|--------|-------
API GW + Lambda | 50ms | 200ms
Initialize | 30ms | 80ms
Plan inference | 600ms | 1500ms
Tool dispatch | 400ms | 1200ms
Synthesis | 700ms | 1800ms
Guardrails | 30ms | 150ms
Persist + stream | 80ms | 300ms
------------------|--------|-------
Sum P50 | 1890ms |
Sum P99 | | 5230ms ← this is what 99% of users will see at worst
The P99 of a sum is not the sum of P99s (that overstates), but it's also not the sum of medians (that understates). The realistic P99 lands somewhere between — call it 3.5–4.5s — which is over budget. The 3-second target is achievable at P95, maybe P97, not P99.
What to do: either (a) restate the SLA as P95 < 3s, P99 < 5s — the honest version; or (b) invest in latency reduction on the two heaviest components (Plan + Synthesis inferences) via smaller models for routing, more aggressive caching, speculative decoding. The current document promises (a) without doing (b).
The math problem with the loop
The 10-iteration cap and 8-second wall-clock are documented side by side as if they're complementary. They're not.
Per-iteration cost ≈ Claude inference (500–1000ms) + tool call (300–800ms) ≈ 800–1800ms
10 iterations needed → 8–18 seconds
Within the 8-second wall-clock, only 4–5 iterations are reachable. The 10-iteration cap is therefore decorative — the wall-clock kills you first. Either the iteration cap should be lowered to ~4 (matching reality), or the wall-clock raised to ~15s (which breaks the 3s P99 SLA conversation entirely).
The honest design would be: max 4 iterations, 6s wall-clock, with the agent biased toward parallel dispatch (which compresses iterations). Document the bias and the limits will be consistent.
Cache hit rate is a window function, not a number
Quoting "> 85% cache hit rate" as a single number hides the distribution. Bedrock prompt caching has a 5-minute TTL. Behavior:
- Steady high traffic (>1 req/sec sustained): cache stays hot, hit rate > 90%
- Bursty traffic (gaps > 5 min): each burst pays a cache miss; hit rate 40–70%
- Cold start hours (e.g., overnight regional traffic): hit rate < 30%
The right metric is hit rate during business hours (steady) and cost amortization (steady-state cache savings vs. cold-start cost). A flat 85% target is unfalsifiable until you condition on traffic regime.
What this section is not
This is not a takedown of the architecture — the architecture is sound. It is a takedown of the specific numbers without their conditions. Every architecture doc has this problem: targets are written in optimistic mode and never re-validated against measured behavior. Including this section in every agent dive forces the question: is this number from a load test, or from a whiteboard?
If it's from a whiteboard, mark it. If it's from a load test, link the test.
Observability
Every Orchestrator invocation emits:
- CloudWatch metrics — latency P50/P95/P99, tokens in/out, tools called, cache hit rate, guardrail rejections, escalation rate
- X-Ray trace — every tool call is a sub-segment; the full flow is visible in the service map
- Structured logs — request ID, session ID, intent, tools dispatched, final answer hash
The single most useful metric in practice has been cache hit rate. When it drops below 80%, something has changed in the system prompt or context block — it's an early warning that latency and cost are about to spike.
Related documents
- 00-the-story.md — Narrative walkthrough
- 06-tool-dispatch-and-routing.md — How Claude selects tools
- 07-failure-handling.md — Circuit breakers and retry policy
- 08-memory-architecture.md — Three-tier memory in detail
- ../agents.md — Canonical reference
- ../RAG-MCP-Integration/08-mcp-orchestration-router.md — Routing-as-prompt deep dive