LOCAL PREVIEW View on GitHub

01 — The Orchestrator Agent

The router is not code. The router is Claude reading tool descriptions.

The Orchestrator is the central nervous system of MangaAssist. Every user message passes through it; nothing reaches a sub-agent or MCP server without it deciding. This document is a deep dive into what it actually is, what it does, how it fails, and why it's shaped the way it is.


What it is

A single instance of Claude 3.5 Sonnet running on Amazon Bedrock, invoked from a Lambda handler on each user turn. It is stateful at the conversation level — meaning it carries forward context across turns — but stateless at the process level: each Lambda invocation reconstitutes its working memory from Redis and DynamoDB.

There is no Python class called Orchestrator that contains routing logic. The Orchestrator's behavior is entirely defined by:

  1. The system prompt loaded at the start of each request
  2. The tool manifest (seven MCP servers, ~15 tools total) injected as Claude's tools parameter
  3. The conversation history loaded from Redis/DynamoDB
  4. The inference run itself, which produces tool calls or a final answer

If you removed Claude, you would have nothing. There is no fallback orchestrator written in Python.


The agentic loop

Initialize → Plan → Act → Observe → Reflect → (loop or terminate)

Each phase is concrete:

Initialize (≈ 50ms)

  • Load conversation state from session:{session_id}:context in ElastiCache (TTL 30 min)
  • Load user profile from DynamoDB (Users table) — favorite genres, locale, Prime status
  • Compose the request payload: system prompt + tool manifest + history + new message

Plan (≈ 400ms first token)

  • Bedrock invokes Claude with the assembled prompt
  • Claude streams either:
  • A direct response (greeting, chitchat — no tool needed)
  • One or more tool calls in structured JSON
  • The decision is made inside the model by reading tool descriptions against the user's intent

Act (≈ 200–800ms per tool)

  • Tool calls dispatched via asyncio.gather for independent ones
  • Sequential chain when one tool's output feeds the next
  • Each tool call hits an MCP server over HTTP/SSE through ECS Fargate
  • Up to 5 tools in flight concurrently

Observe

  • Tool results returned and appended to Claude's context as <tool_result> XML blocks
  • The XML wrapping matters — see 06-tool-dispatch-and-routing.md for why this is a prompt-injection defense

Reflect

  • Claude synthesizes the tool results into a final answer
  • Post-generation guardrails run: ASIN validity, price sanity, link resolution
  • If guardrails fail, regenerate; if regeneration fails, fall back to template

Terminate

  • Stream tokens to user via WebSocket
  • Persist session state back to DynamoDB
  • Emit CloudWatch metrics: latency, tokens used, tools called, cache hit rate
  • Close the X-Ray trace

The system prompt — anatomy

The system prompt is the single most important artifact in the entire chatbot. Get it wrong and tool selection collapses, hallucination spikes, latency drifts. The structure:

┌─────────────────────────────────────┐
│ ROLE DEFINITION                     │  "You are MangaAssist, a shopping assistant..."
├─────────────────────────────────────┤
│ HARD CONSTRAINTS                    │  Never invent prices. Always cite tool sources.
├─────────────────────────────────────┤
│ TOOL USAGE RULES                    │  For policy → support_mcp. For order → order_mcp.
├─────────────────────────────────────┤
│ TOOL INVENTORY SUMMARY              │  7 MCP servers, ~15 tools, with rich descriptions
├─────────────────────────────────────┤
│ RESPONSE STYLE                      │  Concise. No marketing language. Cite sources.
├─────────────────────────────────────┤
│ ESCALATION RULES                    │  When to call escalate_to_agent
├─────────────────────────────────────┤
│ CONTEXT BLOCK (per request)         │  Page ASIN, locale, user profile, active promos
├─────────────────────────────────────┤
│ RETRIEVED KNOWLEDGE (per request)   │  Top-3 RAG chunks if intent = policy/faq
├─────────────────────────────────────┤
│ CONVERSATION HISTORY                │  Last 3-5 turns + summary
└─────────────────────────────────────┘

The first six blocks are static — they are identical across every request. This is intentional: it lets Bedrock's prompt caching kick in. Cache hit rate target: > 85%. The last three blocks are dynamic per request.

The non-negotiable tool usage rules

These live verbatim in the system prompt:

1. For any price question, ALWAYS call get_price. Never state or estimate a price.
2. For any order question, ALWAYS call get_order_status or check_stock.
3. For return/refund questions, ALWAYS call answer_faq or check_refund_eligibility.
4. For recommendations, ALWAYS call get_recommendations or get_similar_titles.
5. For trending content, ALWAYS call get_trending or get_new_releases.
6. ALWAYS cite the tool source when answering policy questions.
7. If a tool returns escalation_suggested=true, ALWAYS call escalate_to_agent next.

These exist because, in early evaluations, Claude would occasionally answer policy questions from training data ("Amazon's return window is typically 30 days...") without calling the tool. That's a hallucination risk. The ALWAYS call X pattern is blunt but works — it makes the tool call mandatory regardless of how confident the model feels.


Tool dispatch mechanics

Parallel dispatch

When a request has multiple independent intents, the Orchestrator emits multiple tool calls in a single inference turn. The Lambda handler then dispatches them concurrently:

async def dispatch(tool_calls: list[ToolCall]) -> list[ToolResult]:
    return await asyncio.gather(
        *[mcp_clients[tc.server].call(tc.tool, tc.input) for tc in tool_calls],
        return_exceptions=True,
    )

return_exceptions=True matters: if one tool raises, the others still succeed and the Orchestrator can still answer with partial information.

Up to 5 tools run concurrently. Past that we throttle — historically 6+ concurrent tool calls have hit MCP server rate limits and pushed P99 over budget.

Sequential chaining

When tool B needs tool A's output, Claude emits them across separate inference turns:

Turn 1 (model thinks): "I need the ASIN first."
  → catalog_mcp.search_manga("Naruto vol 3")
  ← returns ASIN B07X1234

Turn 2 (model thinks): "Now I can get sentiment."
  → review_mcp.get_sentiment_summary(asin="B07X1234")
  ← returns summary

Note this is emergent — there is no chain_tools() function in the codebase. The model decides multi-turn dispatch based on the structure of the user's question. This is also why tool descriptions matter so much: the description has to make the dependency visible to the model.


State management

The Orchestrator manages three layers of state, but only directly owns one:

Layer Owner Store TTL Purpose
Working memory Orchestrator (in-process) Lambda RAM Per request Tool results buffered during the agentic loop
Conversation state Orchestrator ElastiCache Redis 30 min Recent turns, active intent, pending tool results
Session state Orchestrator DynamoDB 24 hours Full turn history, summary, extracted entities
User profile UserService (external) DynamoDB Permanent Genres, purchase history, preferences

The Orchestrator reads the user profile but never writes to it directly — that's the UserService's domain. This separation prevents the chatbot from corrupting profile data on a bad inference.

See 08-memory-architecture.md for the full memory deep dive.


Stopping conditions

Without explicit stops, an LLM agent can loop forever. The Orchestrator has three independent kill switches, any of which terminates the loop:

Switch Threshold What happens on trip
Iteration limit 10 tool-call rounds per user message Return best-effort answer with disclaimer
Wall-clock timeout 8 seconds for the full agentic loop Cancel pending tool calls, return template fallback
Token budget Hard cap on context window growth Trigger summarization mid-loop, drop oldest tool results

The 8-second timeout is the user-facing SLA's safety net. The 3-second P99 target is the goal; the 8-second cap is what guarantees we never wedge a websocket waiting for a runaway model.


Latency budget

End-to-end target: P99 < 3 seconds, first token at ≈ 950ms.

Total budget: 3000 ms
├── API Gateway + Lambda cold path        ≈  100 ms
├── Initialize (Redis/Dynamo loads)       ≈   50 ms
├── First Claude inference (planning)     ≈  400 ms  ← first token here-ish
├── Tool dispatch (parallel, 1 round)     ≈  800 ms
├── Second Claude inference (synthesis)   ≈  500 ms
├── Guardrail validation                  ≈   50 ms
├── Streaming finish + persist            ≈  100 ms
└── Buffer                                ≈ 1000 ms

The 1-second buffer is real and intentional. Bedrock latency has occasional spikes; tools occasionally cold-start. Without a buffer you'd see frequent SLA breaches even when nothing is "wrong."


How the Orchestrator fails (and what we do about it)

Failure Detection Recovery
Bedrock timeout / 5xx Inference returns error Single retry with reduced max_tokens; then template
Tool returns malformed JSON JSON parse fails Drop that tool result, continue with others
Tool returns nothing Result body is empty list MCP-side fallback strategy already triggered; pass through to Claude
Guardrail rejection (bad ASIN) Post-gen check fails Regenerate once with stricter system prompt; then template
Loop iteration overrun Counter exceeds 10 Return best-effort + escalation offer
Wall-clock timeout 8s exceeded Cancel via asyncio.wait_for, return template
User confidence drop Claude reports low confidence Trigger escalation flow

The pattern: never fail silently. Either return a useful answer or explicitly hand off. The worst outcome is a confidently-wrong answer; the second-worst is a blank screen.

See 07-failure-handling.md for the full failure taxonomy and circuit-breaker design.


Why this shape — design tradeoffs considered

Alternative considered Why we rejected it
Hardcoded routing in Python Couples application logic to LLM behavior; every new tool requires a code deploy
Multi-agent debate (autogen-style) 3-5x latency, far higher cost, marginal quality gain for this use case
Single mega-agent (no sub-agents) Bigger blast radius; harder to scale individual domains
LangChain agent framework Too much abstraction; harder to debug; we wanted full control over the loop
ReAct without explicit stops Loops on ambiguous questions; latency unpredictable

The chosen design is "supervisor with bounded specialists" because it matches the failure shape we actually see in production: most issues are domain-local (a stale catalog, a flaky review service), and we want them contained.


Validation: Constraint Sanity Check

The numbers above are the targets in the architecture. This section is a brutal pass over them: which ones survive contact with reality, which ones are aggressive-but-doable with engineering effort, and which ones are flatly inconsistent with each other or with the underlying systems. Without this section, the document is aspirational marketing.

Verdict table

Claimed metric Verdict Why
First token at ~950ms Aggressive Claude Sonnet on Bedrock TTFT is typically 400–1500ms warm; with a 4–6K token system prompt, 950ms is plausible only with prompt caching hot. Cold cache: 1.5–2s.
End-to-end P99 < 3s Inconsistent with itself Component means sum to ~2s, plus 1s buffer = 3s. But P99 of a sum of P99s is not the sum of means. Realistic P99 with current components: 3.5–4.5s.
8s wall-clock vs 10-iteration limit Mathematically incompatible Each iteration ≈ Claude inference (500–1000ms) + tool dispatch (300–800ms) ≈ 1.5s. Ten iterations need ~15s. The 10-iteration cap is unreachable before the 8s timeout fires. One of the two is dead code.
Tool dispatch 800ms (parallel, 1 round) Reasonable Parallel dispatch latency = max of the slowest tool in the batch, not sum. 800ms holds if individual MCP P99 holds.
Cache hit rate > 85% Conditional Bedrock prompt cache TTL is ~5 min. 85% is achievable during steady traffic. During off-peak / cold start / quiet hours, hit rate collapses to 30–50%. The metric should be rate-tier'd by traffic level, not a single number.
Guardrail validation 50ms Too tight if any check is networked ASIN existence check via DynamoDB GetItem: 10–30ms P99 — fine. Link validation via HTTP HEAD: 50–500ms variable — blows the budget. Either keep all checks in-memory/cached, or expand the budget to 150ms.
Initialize 50ms Realistic Redis GET (~2ms) + DynamoDB GetItem (~15ms P99) + payload assembly (~10ms) ≈ 30–40ms. 50ms holds.
5 concurrent tool calls Fine on orchestrator, watch MCP fan-out Lambda async handles 5 concurrent calls trivially. Risk is fan-out to a single MCP server (e.g., 5 simultaneous catalog calls); per-MCP rate limits become the bottleneck, not the orchestrator.
Synthesis (second inference) at 500ms Aggressive Tool results add 1–3K tokens to context. Sonnet TTFT on a longer prompt: 500–900ms. Full streamed response is longer still. 500ms is first-token, not finish.
Streaming finish + persist 100ms Realistic for persist, ambiguous for streaming DynamoDB PutItem P99 ~50–80ms. But "streaming finish" depends on response length, not infrastructure — a 200-token answer streams in ~1s on its own. The 100ms here only covers post-stream persist.

The biggest lie: end-to-end P99 < 3s

The latency budget table sums components to 3000ms total. That math works for the mean, not the 99th percentile. A correct calculation:

Component         | P50    | P99
------------------|--------|-------
API GW + Lambda   | 50ms   | 200ms
Initialize        | 30ms   | 80ms
Plan inference    | 600ms  | 1500ms
Tool dispatch     | 400ms  | 1200ms
Synthesis         | 700ms  | 1800ms
Guardrails        | 30ms   | 150ms
Persist + stream  | 80ms   | 300ms
------------------|--------|-------
Sum P50           | 1890ms |
Sum P99           |        | 5230ms ← this is what 99% of users will see at worst

The P99 of a sum is not the sum of P99s (that overstates), but it's also not the sum of medians (that understates). The realistic P99 lands somewhere between — call it 3.5–4.5s — which is over budget. The 3-second target is achievable at P95, maybe P97, not P99.

What to do: either (a) restate the SLA as P95 < 3s, P99 < 5s — the honest version; or (b) invest in latency reduction on the two heaviest components (Plan + Synthesis inferences) via smaller models for routing, more aggressive caching, speculative decoding. The current document promises (a) without doing (b).

The math problem with the loop

The 10-iteration cap and 8-second wall-clock are documented side by side as if they're complementary. They're not.

Per-iteration cost ≈ Claude inference (500–1000ms) + tool call (300–800ms) ≈ 800–1800ms
10 iterations needed → 8–18 seconds

Within the 8-second wall-clock, only 4–5 iterations are reachable. The 10-iteration cap is therefore decorative — the wall-clock kills you first. Either the iteration cap should be lowered to ~4 (matching reality), or the wall-clock raised to ~15s (which breaks the 3s P99 SLA conversation entirely).

The honest design would be: max 4 iterations, 6s wall-clock, with the agent biased toward parallel dispatch (which compresses iterations). Document the bias and the limits will be consistent.

Cache hit rate is a window function, not a number

Quoting "> 85% cache hit rate" as a single number hides the distribution. Bedrock prompt caching has a 5-minute TTL. Behavior:

  • Steady high traffic (>1 req/sec sustained): cache stays hot, hit rate > 90%
  • Bursty traffic (gaps > 5 min): each burst pays a cache miss; hit rate 40–70%
  • Cold start hours (e.g., overnight regional traffic): hit rate < 30%

The right metric is hit rate during business hours (steady) and cost amortization (steady-state cache savings vs. cold-start cost). A flat 85% target is unfalsifiable until you condition on traffic regime.

What this section is not

This is not a takedown of the architecture — the architecture is sound. It is a takedown of the specific numbers without their conditions. Every architecture doc has this problem: targets are written in optimistic mode and never re-validated against measured behavior. Including this section in every agent dive forces the question: is this number from a load test, or from a whiteboard?

If it's from a whiteboard, mark it. If it's from a load test, link the test.


Observability

Every Orchestrator invocation emits:

  • CloudWatch metrics — latency P50/P95/P99, tokens in/out, tools called, cache hit rate, guardrail rejections, escalation rate
  • X-Ray trace — every tool call is a sub-segment; the full flow is visible in the service map
  • Structured logs — request ID, session ID, intent, tools dispatched, final answer hash

The single most useful metric in practice has been cache hit rate. When it drops below 80%, something has changed in the system prompt or context block — it's an early warning that latency and cost are about to spike.