01 — The Orchestrator Agent

The router is not code. The router is Claude reading tool descriptions.

The Orchestrator is the central nervous system of MangaAssist. Every user message passes through it; nothing reaches a sub-agent or MCP server without it deciding. This document is a deep dive into what it actually is, what it does, how it fails, and why it's shaped the way it is.

What it is

A single instance of Claude 3.5 Sonnet running on Amazon Bedrock, invoked from a Lambda handler on each user turn. It is stateful at the conversation level — meaning it carries forward context across turns — but stateless at the process level: each Lambda invocation reconstitutes its working memory from Redis and DynamoDB.

There is no Python class called Orchestrator that contains routing logic. The Orchestrator's behavior is entirely defined by:

The system prompt loaded at the start of each request
The tool manifest (seven MCP servers, ~15 tools total) injected as Claude's tools parameter
The conversation history loaded from Redis/DynamoDB
The inference run itself, which produces tool calls or a final answer

If you removed Claude, you would have nothing. There is no fallback orchestrator written in Python.

The agentic loop

Initialize → Plan → Act → Observe → Reflect → (loop or terminate)

Each phase is concrete:

Initialize (≈ 50ms)

Load conversation state from session:{session_id}:context in ElastiCache (TTL 30 min)
Load user profile from DynamoDB (Users table) — favorite genres, locale, Prime status
Compose the request payload: system prompt + tool manifest + history + new message

Plan (≈ 400ms first token)

Bedrock invokes Claude with the assembled prompt
Claude streams either:
A direct response (greeting, chitchat — no tool needed)
One or more tool calls in structured JSON
The decision is made inside the model by reading tool descriptions against the user's intent

Act (≈ 200–800ms per tool)

Tool calls dispatched via asyncio.gather for independent ones
Sequential chain when one tool's output feeds the next
Each tool call hits an MCP server over HTTP/SSE through ECS Fargate
Up to 5 tools in flight concurrently

Observe

Tool results returned and appended to Claude's context as <tool_result> XML blocks
The XML wrapping matters — see 06-tool-dispatch-and-routing.md for why this is a prompt-injection defense

Reflect

Claude synthesizes the tool results into a final answer
Post-generation guardrails run: ASIN validity, price sanity, link resolution
If guardrails fail, regenerate; if regeneration fails, fall back to template

Terminate

Stream tokens to user via WebSocket
Persist session state back to DynamoDB
Emit CloudWatch metrics: latency, tokens used, tools called, cache hit rate
Close the X-Ray trace

The system prompt — anatomy

The system prompt is the single most important artifact in the entire chatbot. Get it wrong and tool selection collapses, hallucination spikes, latency drifts. The structure:

┌─────────────────────────────────────┐
│ ROLE DEFINITION                     │  "You are MangaAssist, a shopping assistant..."
├─────────────────────────────────────┤
│ HARD CONSTRAINTS                    │  Never invent prices. Always cite tool sources.
├─────────────────────────────────────┤
│ TOOL USAGE RULES                    │  For policy → support_mcp. For order → order_mcp.
├─────────────────────────────────────┤
│ TOOL INVENTORY SUMMARY              │  7 MCP servers, ~15 tools, with rich descriptions
├─────────────────────────────────────┤
│ RESPONSE STYLE                      │  Concise. No marketing language. Cite sources.
├─────────────────────────────────────┤
│ ESCALATION RULES                    │  When to call escalate_to_agent
├─────────────────────────────────────┤
│ CONTEXT BLOCK (per request)         │  Page ASIN, locale, user profile, active promos
├─────────────────────────────────────┤
│ RETRIEVED KNOWLEDGE (per request)   │  Top-3 RAG chunks if intent = policy/faq
├─────────────────────────────────────┤
│ CONVERSATION HISTORY                │  Last 3-5 turns + summary
└─────────────────────────────────────┘

The first six blocks are static — they are identical across every request. This is intentional: it lets Bedrock's prompt caching kick in. Cache hit rate target: > 85%. The last three blocks are dynamic per request.

The non-negotiable tool usage rules

These live verbatim in the system prompt:

1. For any price question, ALWAYS call get_price. Never state or estimate a price.
2. For any order question, ALWAYS call get_order_status or check_stock.
3. For return/refund questions, ALWAYS call answer_faq or check_refund_eligibility.
4. For recommendations, ALWAYS call get_recommendations or get_similar_titles.
5. For trending content, ALWAYS call get_trending or get_new_releases.
6. ALWAYS cite the tool source when answering policy questions.
7. If a tool returns escalation_suggested=true, ALWAYS call escalate_to_agent next.

These exist because, in early evaluations, Claude would occasionally answer policy questions from training data ("Amazon's return window is typically 30 days...") without calling the tool. That's a hallucination risk. The ALWAYS call X pattern is blunt but works — it makes the tool call mandatory regardless of how confident the model feels.

Tool dispatch mechanics

Parallel dispatch

When a request has multiple independent intents, the Orchestrator emits multiple tool calls in a single inference turn. The Lambda handler then dispatches them concurrently:

async def dispatch(tool_calls: list[ToolCall]) -> list[ToolResult]:
    return await asyncio.gather(
        *[mcp_clients[tc.server].call(tc.tool, tc.input) for tc in tool_calls],
        return_exceptions=True,
    )

return_exceptions=True matters: if one tool raises, the others still succeed and the Orchestrator can still answer with partial information.

Up to 5 tools run concurrently. Past that we throttle — historically 6+ concurrent tool calls have hit MCP server rate limits and pushed P99 over budget.

Sequential chaining

When tool B needs tool A's output, Claude emits them across separate inference turns:

Turn 1 (model thinks): "I need the ASIN first."
  → catalog_mcp.search_manga("Naruto vol 3")
  ← returns ASIN B07X1234

Turn 2 (model thinks): "Now I can get sentiment."
  → review_mcp.get_sentiment_summary(asin="B07X1234")
  ← returns summary

Note this is emergent — there is no chain_tools() function in the codebase. The model decides multi-turn dispatch based on the structure of the user's question. This is also why tool descriptions matter so much: the description has to make the dependency visible to the model.

State management

The Orchestrator manages three layers of state, but only directly owns one:

Layer	Owner	Store	TTL	Purpose
Working memory	Orchestrator (in-process)	Lambda RAM	Per request	Tool results buffered during the agentic loop
Conversation state	Orchestrator	ElastiCache Redis	30 min	Recent turns, active intent, pending tool results
Session state	Orchestrator	DynamoDB	24 hours	Full turn history, summary, extracted entities
User profile	UserService (external)	DynamoDB	Permanent	Genres, purchase history, preferences

The Orchestrator reads the user profile but never writes to it directly — that's the UserService's domain. This separation prevents the chatbot from corrupting profile data on a bad inference.

See 08-memory-architecture.md for the full memory deep dive.

Stopping conditions

Without explicit stops, an LLM agent can loop forever. The Orchestrator has three independent kill switches, any of which terminates the loop:

Switch	Threshold	What happens on trip
Iteration limit	10 tool-call rounds per user message	Return best-effort answer with disclaimer
Wall-clock timeout	8 seconds for the full agentic loop	Cancel pending tool calls, return template fallback
Token budget	Hard cap on context window growth	Trigger summarization mid-loop, drop oldest tool results

The 8-second timeout is the user-facing SLA's safety net. The 3-second P99 target is the goal; the 8-second cap is what guarantees we never wedge a websocket waiting for a runaway model.

Latency budget

End-to-end target: P99 < 3 seconds, first token at ≈ 950ms.

Total budget: 3000 ms
├── API Gateway + Lambda cold path        ≈  100 ms
├── Initialize (Redis/Dynamo loads)       ≈   50 ms
├── First Claude inference (planning)     ≈  400 ms  ← first token here-ish
├── Tool dispatch (parallel, 1 round)     ≈  800 ms
├── Second Claude inference (synthesis)   ≈  500 ms
├── Guardrail validation                  ≈   50 ms
├── Streaming finish + persist            ≈  100 ms
└── Buffer                                ≈ 1000 ms

The 1-second buffer is real and intentional. Bedrock latency has occasional spikes; tools occasionally cold-start. Without a buffer you'd see frequent SLA breaches even when nothing is "wrong."

How the Orchestrator fails (and what we do about it)

Failure	Detection	Recovery
Bedrock timeout / 5xx	Inference returns error	Single retry with reduced max_tokens; then template
Tool returns malformed JSON	JSON parse fails	Drop that tool result, continue with others
Tool returns nothing	Result body is empty list	MCP-side fallback strategy already triggered; pass through to Claude
Guardrail rejection (bad ASIN)	Post-gen check fails	Regenerate once with stricter system prompt; then template
Loop iteration overrun	Counter exceeds 10	Return best-effort + escalation offer
Wall-clock timeout	8s exceeded	Cancel via `asyncio.wait_for`, return template
User confidence drop	Claude reports low confidence	Trigger escalation flow

The pattern: never fail silently. Either return a useful answer or explicitly hand off. The worst outcome is a confidently-wrong answer; the second-worst is a blank screen.

See 07-failure-handling.md for the full failure taxonomy and circuit-breaker design.

Why this shape — design tradeoffs considered

Alternative considered	Why we rejected it
Hardcoded routing in Python	Couples application logic to LLM behavior; every new tool requires a code deploy
Multi-agent debate (autogen-style)	3-5x latency, far higher cost, marginal quality gain for this use case
Single mega-agent (no sub-agents)	Bigger blast radius; harder to scale individual domains
LangChain agent framework	Too much abstraction; harder to debug; we wanted full control over the loop
ReAct without explicit stops	Loops on ambiguous questions; latency unpredictable

The chosen design is "supervisor with bounded specialists" because it matches the failure shape we actually see in production: most issues are domain-local (a stale catalog, a flaky review service), and we want them contained.

Validation: Constraint Sanity Check

The numbers above are the targets in the architecture. This section is a brutal pass over them: which ones survive contact with reality, which ones are aggressive-but-doable with engineering effort, and which ones are flatly inconsistent with each other or with the underlying systems. Without this section, the document is aspirational marketing.

Verdict table

Claimed metric	Verdict	Why
First token at ~950ms	Aggressive	Claude Sonnet on Bedrock TTFT is typically 400–1500ms warm; with a 4–6K token system prompt, 950ms is plausible only with prompt caching hot. Cold cache: 1.5–2s.
End-to-end P99 < 3s	Inconsistent with itself	Component means sum to ~2s, plus 1s buffer = 3s. But P99 of a sum of P99s is not the sum of means. Realistic P99 with current components: 3.5–4.5s.
8s wall-clock vs 10-iteration limit	Mathematically incompatible	Each iteration ≈ Claude inference (500–1000ms) + tool dispatch (300–800ms) ≈ 1.5s. Ten iterations need ~15s. The 10-iteration cap is unreachable before the 8s timeout fires. One of the two is dead code.
Tool dispatch 800ms (parallel, 1 round)	Reasonable	Parallel dispatch latency = max of the slowest tool in the batch, not sum. 800ms holds if individual MCP P99 holds.
Cache hit rate > 85%	Conditional	Bedrock prompt cache TTL is ~5 min. 85% is achievable during steady traffic. During off-peak / cold start / quiet hours, hit rate collapses to 30–50%. The metric should be rate-tier'd by traffic level, not a single number.
Guardrail validation 50ms	Too tight if any check is networked	ASIN existence check via DynamoDB GetItem: 10–30ms P99 — fine. Link validation via HTTP HEAD: 50–500ms variable — blows the budget. Either keep all checks in-memory/cached, or expand the budget to 150ms.
Initialize 50ms	Realistic	Redis GET (~2ms) + DynamoDB GetItem (~15ms P99) + payload assembly (~10ms) ≈ 30–40ms. 50ms holds.
5 concurrent tool calls	Fine on orchestrator, watch MCP fan-out	Lambda async handles 5 concurrent calls trivially. Risk is fan-out to a single MCP server (e.g., 5 simultaneous catalog calls); per-MCP rate limits become the bottleneck, not the orchestrator.
Synthesis (second inference) at 500ms	Aggressive	Tool results add 1–3K tokens to context. Sonnet TTFT on a longer prompt: 500–900ms. Full streamed response is longer still. 500ms is first-token, not finish.
Streaming finish + persist 100ms	Realistic for persist, ambiguous for streaming	DynamoDB PutItem P99 ~50–80ms. But "streaming finish" depends on response length, not infrastructure — a 200-token answer streams in ~1s on its own. The 100ms here only covers post-stream persist.

The biggest lie: end-to-end P99 < 3s

The latency budget table sums components to 3000ms total. That math works for the mean, not the 99^th percentile. A correct calculation:

Component         | P50    | P99
------------------|--------|-------
API GW + Lambda   | 50ms   | 200ms
Initialize        | 30ms   | 80ms
Plan inference    | 600ms  | 1500ms
Tool dispatch     | 400ms  | 1200ms
Synthesis         | 700ms  | 1800ms
Guardrails        | 30ms   | 150ms
Persist + stream  | 80ms   | 300ms
------------------|--------|-------
Sum P50           | 1890ms |
Sum P99           |        | 5230ms ← this is what 99% of users will see at worst

The P99 of a sum is not the sum of P99s (that overstates), but it's also not the sum of medians (that understates). The realistic P99 lands somewhere between — call it 3.5–4.5s — which is over budget. The 3-second target is achievable at P95, maybe P97, not P99.

What to do: either (a) restate the SLA as P95 < 3s, P99 < 5s — the honest version; or (b) invest in latency reduction on the two heaviest components (Plan + Synthesis inferences) via smaller models for routing, more aggressive caching, speculative decoding. The current document promises (a) without doing (b).

The math problem with the loop

The 10-iteration cap and 8-second wall-clock are documented side by side as if they're complementary. They're not.

Per-iteration cost ≈ Claude inference (500–1000ms) + tool call (300–800ms) ≈ 800–1800ms
10 iterations needed → 8–18 seconds

Within the 8-second wall-clock, only 4–5 iterations are reachable. The 10-iteration cap is therefore decorative — the wall-clock kills you first. Either the iteration cap should be lowered to ~4 (matching reality), or the wall-clock raised to ~15s (which breaks the 3s P99 SLA conversation entirely).

The honest design would be: max 4 iterations, 6s wall-clock, with the agent biased toward parallel dispatch (which compresses iterations). Document the bias and the limits will be consistent.

Cache hit rate is a window function, not a number

Quoting "> 85% cache hit rate" as a single number hides the distribution. Bedrock prompt caching has a 5-minute TTL. Behavior:

Steady high traffic (>1 req/sec sustained): cache stays hot, hit rate > 90%
Bursty traffic (gaps > 5 min): each burst pays a cache miss; hit rate 40–70%
Cold start hours (e.g., overnight regional traffic): hit rate < 30%

The right metric is hit rate during business hours (steady) and cost amortization (steady-state cache savings vs. cold-start cost). A flat 85% target is unfalsifiable until you condition on traffic regime.

What this section is not

This is not a takedown of the architecture — the architecture is sound. It is a takedown of the specific numbers without their conditions. Every architecture doc has this problem: targets are written in optimistic mode and never re-validated against measured behavior. Including this section in every agent dive forces the question: is this number from a load test, or from a whiteboard?

If it's from a whiteboard, mark it. If it's from a load test, link the test.

Observability

Every Orchestrator invocation emits:

CloudWatch metrics — latency P50/P95/P99, tokens in/out, tools called, cache hit rate, guardrail rejections, escalation rate
X-Ray trace — every tool call is a sub-segment; the full flow is visible in the service map
Structured logs — request ID, session ID, intent, tools dispatched, final answer hash

The single most useful metric in practice has been cache hit rate. When it drops below 80%, something has changed in the system prompt or context block — it's an early warning that latency and cost are about to spike.

00-the-story.md — Narrative walkthrough
06-tool-dispatch-and-routing.md — How Claude selects tools
07-failure-handling.md — Circuit breakers and retry policy
08-memory-architecture.md — Three-tier memory in detail
../agents.md — Canonical reference
../RAG-MCP-Integration/08-mcp-orchestration-router.md — Routing-as-prompt deep dive