LOCAL PREVIEW View on GitHub

06 — Tool Dispatch & Routing

The router is not code. The router is Claude reading tool descriptions. This document is about what that means in practice.

This is the most consequential design decision in MangaAssist, and the easiest to get wrong. There is no route(intent) function in the codebase. There is no decision tree, no rules engine, no classifier with branching logic. The routing layer is a paragraph of English per tool, read by an LLM at every inference, and that paragraph is the entire API surface between user intent and downstream system behavior.


The central principle

Tool descriptions are the routing logic. A poorly written tool description is a misrouted request.

The Orchestrator (Claude 3.5 Sonnet on Bedrock) is given a list of tools at inference time. Each tool has: - A name (snake_case identifier) - A description (free-form English, ~50–150 words) - An input_schema (JSON Schema for parameters)

Claude reads all of them at every turn and picks zero, one, or many tools based on the user's message. The picking logic lives entirely in the model's interpretation of the descriptions.

This means: changing a tool's description changes routing behavior in production, with no code deploy, no test suite firing, and no review beyond whoever edited the prompt. That's the whole power and the whole danger.


Tool description engineering

What makes a good description

A description has to do four jobs: 1. Anchor the use case — when should Claude call this? 2. Disambiguate from siblings — what should Claude not confuse it with? 3. Set expectations on output — what shape does the result take? 4. Encode constraints — when is it inappropriate to call?

GOOD: "Search the JP manga catalog for titles by name, author, or genre.
       Returns up to 3 manga with ASIN, title, edition, language, and price.
       Use this for product discovery questions like 'recommend dark fantasy'
       or 'find Berserk volume 42'. Do NOT use for general manga trivia
       (no catalog data is consulted) or for order status (use get_order_status)."

BAD:  "Search products."

The bad version routes randomly because Claude has nothing to disambiguate with. The good version embeds positive examples ("recommend dark fantasy"), negative examples ("general manga trivia"), and pointers to siblings ("use get_order_status").

Specific failure modes from underspecified descriptions

Observed during eval:

Pattern Symptom Fix
Two tools with similar verbs Random selection between find_manga and search_manga Merge or hard-disambiguate
Tool description starts with the same phrase as another Claude conflates them Rewrite openings
Negative case missing Tool gets called for unrelated intents Add "Do NOT use for..."
Output shape ambiguous Claude calls correctly but mis-formats response Specify return schema in prose
No locale awareness Tool gets called for queries it can't handle Add region/locale gate explicitly

Single-tool dispatch

The simplest case. User asks one thing, Claude picks one tool, dispatches, synthesizes, replies.

User: "What's the return policy?"
   ↓
Claude reads tool list
   ↓
Selects: get_return_policy(region="JP", category="manga")
   ↓
Tool returns canonical policy chunk
   ↓
Claude synthesizes answer with citation
   ↓
Streamed to user

The "Claude reads tool list" step costs the equivalent of ~2K–5K tokens of context per turn. That's mostly tool descriptions, and it's why prompt caching is mandatory, not optional: those tokens never change between requests, so they should hit the cache 100% of the time.


Multi-tool parallel dispatch

When a user message has multiple independent intents, Claude emits multiple tool calls in a single inference response. The Lambda handler dispatches them with asyncio.gather:

async def dispatch_parallel(tool_calls: list[ToolCall]) -> list[ToolResult]:
    return await asyncio.gather(
        *[mcp_clients[tc.server].call(tc.tool, tc.input) for tc in tool_calls],
        return_exceptions=True,
    )

Three things matter:

  1. return_exceptions=True — one tool failure doesn't take down the others. The Orchestrator sees a ToolError for the failed one and partial results for the rest.

  2. Up to 5 concurrent tools. Past 5, we throttle. Reasons: per-MCP rate limits, Lambda connection pool, Bedrock TPS quotas.

  3. The decision to parallelize is Claude's, not ours. We don't tell the model "use parallel dispatch when you can." We trust the model to emit independent tool calls together when it sees multiple intents.

Example trace:

User: "Show me Naruto vol 3, is it in stock, and what do reviews say?"
Claude emits 3 tool calls in one response:
  ├── catalog_mcp.search_manga(query="Naruto vol 3")
  ├── order_mcp.check_stock(...)  // Note: needs ASIN
  └── review_mcp.get_sentiment_summary(...)  // Note: needs ASIN

But wait — check_stock and get_sentiment_summary both need an ASIN that comes from search_manga. So they can't run in parallel. Claude handles this correctly by emitting only the search first, then dispatching the dependent ones in turn 2. This isn't programmed; it's emergent from the model understanding the data flow.


Sequential chaining

When tools have dependencies, Claude chains them across multiple inference turns:

Turn 1 inference:
  Claude thinks: "I need the ASIN before I can check stock or reviews."
  Emits: catalog_mcp.search_manga(query="Naruto vol 3")
  ←  Returns: { asin: "B07X1234", title: "Naruto Vol 3", ... }

Turn 2 inference:
  Claude thinks: "Now I have the ASIN, I can fan out."
  Emits in parallel:
    ├── order_mcp.check_stock(asin="B07X1234")
    └── review_mcp.get_sentiment_summary(asin="B07X1234")
  ←  Returns both results

Turn 3 inference:
  Claude synthesizes final answer.

Each "turn" is a separate Bedrock inference call. Three turns = three round-trips. This is why the 8-second wall-clock cap is tight: a multi-turn sequential chain can easily eat 5 seconds of inference time before the user sees a final token.


Routing accuracy: how do we know it works?

This is the question every architecture review asks and every team answers vaguely.

The eval set

We maintain a labeled set of ~2,000 user queries paired with the expected tool call(s). Categories: - Single-intent (one tool expected) - Multi-intent (multiple tools expected) - Trick cases (wording suggests one tool but data needs another) - Out-of-scope (no tool should be called; Claude should respond directly)

Each tool description change is evaluated against this set. Pass criteria: - ≥ 95% accuracy on single-intent - ≥ 88% accuracy on multi-intent - ≥ 92% on trick cases - 100% on out-of-scope (no false tool calls)

Drift detection

In production, every tool call is logged with the user message that triggered it. A nightly job samples 100 calls per tool and asks Claude (a separate evaluator instance with the eval prompt) whether the call was appropriate. Disagreements get human-reviewed.

What we can't measure

We don't have a ground truth for "what tool should have been called" in production. We only know what was called and whether it succeeded. Drift in routing quality often shows up indirectly: rising guardrail rejections, rising escalation rate, rising "I don't know" responses.


Why this design

Alternative Why we rejected it
Hardcoded if intent == X: call Y New tool requires code deploy; brittle to phrasing variation
Separate intent classifier (BERT) before tool dispatch Two models means two failure modes; classifier drift independent of tool drift; double maintenance
Function-calling without descriptions (just names) Names alone are too sparse; routing accuracy collapses
Multi-step planner LLM that emits a tool plan first 2x latency, marginal accuracy gain on this scope of tools
Embeddings-based tool retrieval (top-K tool descriptions injected) Useful at 100+ tools; overkill at 15

The choice is right-sized for the current tool count (~15). It would not scale to 500 tools — at that point, embeddings-based tool retrieval becomes necessary.


Validation: Constraint Sanity Check

Claimed metric Verdict Why
Routing accuracy ≥ 95% on single-intent Plausible, eval-dependent A 2,000-example eval set is small for 15 tools across many phrasings. Real production routing accuracy depends on input distribution coverage. Eval set should be larger and continuously refreshed from production samples.
Routing accuracy ≥ 88% on multi-intent Aggressive Multi-intent routing requires Claude to emit multiple tool calls and get all of them right. Joint accuracy compounds — if single-tool accuracy is 95% and a query has 3 tools, joint accuracy ≤ 86%. The 88% target is at the edge.
Tool descriptions cached (tokens reused per request) Mandatory but easy to break Any change to the tool list mid-day busts the cache. Cache hit rate drops sharply during deploys. Should be measured continuously, not assumed.
Up to 5 concurrent tool calls Right ceiling, wrong constraint shown The actual binding constraint is per-MCP rate limit (e.g., 50 QPS for catalog, 20 QPS for graph). 5 concurrent calls to the same MCP is a real risk during traffic bursts. The doc says "5 total" — should be "5 total and ≤ 2 to any single MCP."
8s wall-clock for multi-turn chains Tight; see Orchestrator dive Multi-turn chains (3 inferences + 2 tool dispatches) regularly exceed 5s. 8s cap is the hard kill, but P99 of multi-turn flows is already 4–6s, leaving little margin.
"Decision to parallelize is Claude's" True but unstable across model versions Claude 3.5 Sonnet handles parallel dispatch well. Model upgrades (4.x, etc.) can change this behavior — needs regression testing on the eval set after every model change.
Daily nightly eval samples 100 calls per tool Insufficient for low-volume tools A tool called 50 times/day will have its sample dominated by a single user's queries. Sampling rate should be proportional to call volume and phrasing diversity.
Drift detection via "evaluator Claude" Self-grading risk Using Claude to evaluate Claude on the same task introduces correlated errors. If the model misroutes systematically, the evaluator may also miss it. A human-graded gold set is the correct ground truth, not a Claude evaluator.
Out-of-scope queries get 100% no-tool-call Aspirational "Out of scope" is fuzzy. "What's a manga?" — trivia, no tool. "What's a popular manga?" — should call trending. Boundary cases are exactly where Claude is wrong most often. 100% accuracy here is implausible.

The biggest issue: routing changes have no test gate

Tool descriptions are prose. Anyone can edit them. Editing them changes production routing behavior. There is no gate that says "this description change degraded routing accuracy by 4% on the eval set, do not deploy."

What's needed: 1. Pre-deploy eval gate — every tool description change triggers a CI run against the eval set; deploy blocked if regression > 1%. 2. Canary deploy — 5% of traffic uses new descriptions, compare guardrail rejection rate vs. control. 3. Versioned tool descriptions — descriptions are versioned in code, not edited as live config.

The current architecture treats tool descriptions like configuration. They're more dangerous than configuration; they're program logic. Nothing in the doc makes this gate explicit.

Prompt cache hit rate is load-bearing for cost, not just latency

Tool descriptions are 2K–5K tokens. Across millions of requests, those tokens are mostly cache hits — but only if descriptions don't change. Common pitfalls: - Adding a new tool re-emits the manifest, busting the cache - Editing a description for any reason busts the cache - A/B-testing two description variants halves the cache hit rate

Each of these has a real cost. The doc treats cache as a "nice-to-have"; the system can't economically run without it.

Eval set size is too small for production drift

2,000 examples covers ~130 examples per tool, average. For tools with broad input (catalog search across 5M titles in multiple languages), this is woefully thin. Drift detection should pull from production samples weekly, not rely on a frozen eval set.

Multi-intent compounds errors

The "88% multi-intent accuracy" target is the right metric to call out, but the math is sobering:

3-tool query joint accuracy ≈ p(tool1_correct) × p(tool2_correct) × p(tool3_correct)
                            ≈ 0.95 × 0.95 × 0.95
                            ≈ 0.857

So 88% multi-intent accuracy requires better than 96% per-tool accuracy. The numbers as quoted are barely consistent. If single-tool accuracy is really 95%, multi-intent should be 86%, not 88%.

"Self-grading" evaluator Claude is a red flag

Using Claude to grade Claude's routing introduces a critical confound: if both models share the same blind spots (same training, same biases), the evaluator agrees with the production model on its mistakes. The evaluator's score will be inflated relative to a human grader.

The cheap fix: every nightly eval run, sample 20 of the "passes" and have a human re-grade them. Track human-grader agreement rate. If it falls below 90%, the evaluator is unreliable and the routing accuracy numbers are fiction.