Constraint Scenario 06 — Tool Count Explodes 7 → 70
Trigger: MangaAssist's roadmap commits to expanding from 7 MCPs to 70 by year-end (audio-version search, manga reading-progress sync, fan-art recommendation, comic-style transfer, voice cloning for character lines, etc.). Same agent, same evals, 10× the registry. Pillars stressed: P3 (LLD/registry) primarily; P1 (planner architecture) secondary.
TL;DR
You cannot stuff 70 tool descriptions into the planner's system prompt. The architecture pivot is two-stage tool selection — a lightweight router retrieves the top-K candidate tools per turn, the planner sees only those K. The skill registry and contract layer (story 02) were built for this; the planner is what changes. Cost goes up modestly; quality holds; the registry becomes the hardest-working artifact in the system.
The change
| Metric | Today (7 tools) | Year-end (70 tools) |
|---|---|---|
| Tools in registry | 7 | 70 |
| Sub-agents in registry | 4 | ~25 |
| Planner system prompt size | 4.5K tokens | 38K tokens if naively included (above context budget) |
| Avg tool calls per turn | 2.1 | 2.4 (slight increase, more nuanced answers) |
| Planner cost/turn | $0.0040 | $0.0050 if naive; $0.0042 with router |
| Eval set size for tool-coverage | 350 | 3500 |
The 38K-token planner prompt is the showstopper. We solve it with a router.
The two-stage selection architecture
flowchart TB
Turn[User turn] --> Router[Skill Router]
Router --> Reg[(Skill registry)]
Reg --> Embed[Tool embeddings]
Embed --> TopK[Top-K candidates - K typically 8]
TopK --> Planner[Planner LLM with K tool contracts only]
Planner --> Dispatch[Tool dispatch]
Dispatch --> Tools[Selected tools]
Tools --> Compose[Compose answer]
classDef new fill:#fef3c7,stroke:#92400e
class Router,Embed,TopK new
The router
- A tiny model (Haiku or smaller embedding-only) that, given the user turn + recent context, retrieves the top-K most relevant tools.
- Inputs: user message, last-3-turns digest, user_jurisdiction, user_tier.
- Outputs: ranked list of K tool_ids.
- Cost: ~$0.0001 per turn. Latency: ~50ms.
Tool embeddings
- Each tool's contract description (from story 02) is embedded.
- Updates to the contract trigger re-embedding via CI.
- Stored in a dedicated retrieval index (we use OpenSearch); refresh < 5 min after a contract change.
The K parameter
- K=8 by default — fits the planner prompt at ~6K tokens.
- K=12 for complex cases (the router is itself capable of saying "I think you need more options").
- K varies per surface: voice gets K=4 (latency budget), chat K=8.
Escape hatch
- The planner can request additional tools mid-turn ("the K I got don't seem right; show me the registry list of
<intent>tools"). Used <2% of turns. - This avoids the failure mode where the router misclassifies and the planner is stuck.
What the planner now looks like
# planner system prompt - structure
- system_role: "You are MangaAssist..."
- behavioral_rules: [safety, tone, locale]
- session_context: {user_tier, jurisdiction, recent_history_digest}
- available_tools: <inserted by router; K=8 contract summaries>
- few_shot_examples: <2-3 per available tool, retrieved>
- escape_hatch: "if you need a tool not listed, emit a 'request_tools' action with criteria"
Total: ~6K tokens, regardless of registry size. Stable.
The new failure modes
Failure 1 — Router misclassifies, the right tool isn't in the K
- Detection: planner uses escape hatch frequently; or eval shows wrong-tool-used regression.
- Mitigation: per-intent router accuracy tracked; below 92% triggers retraining.
- Backstop: escape hatch keeps user-visible failure rare.
Failure 2 — Two tools have nearly-identical contracts; router can't distinguish
- Detection: per-tool selection rate; if two tools each get ~50% but should be 80/20, that's ambiguity.
- Mitigation: contract review board (story 02 grilling Q6); merge or differentiate.
Failure 3 — A new tool ships but its embedding is stale; never selected
- Detection: tool selection rate = 0 for >24 hours after registration.
- Mitigation: registration includes synthetic test that asks the router "would you select this for X?". Failing that test blocks deploy.
Failure 4 — Long-tail tools never get traffic, drift in correctness
- Detection: per-tool eval coverage. Tools with <100 calls/day get a synthetic-traffic eval (probe).
- Mitigation: synthetic probes ensure every tool is tested daily even if real traffic doesn't reach it.
The contract-as-source-of-truth pattern
With 70 tools, you cannot keep the planner's prompt and the actual tool behaviors in sync by hand. The contract (story 02) becomes the machine-readable spec that drives:
| Artifact | Generated from contract |
|---|---|
| Router embeddings | description + examples |
| Planner few-shot examples | the examples: field of the contract |
| Eval coverage tests | input/output schemas |
| Cost dashboard groupings | feature and domain tags |
| Compliance audit | respects_jurisdiction_policy field |
| Documentation site | full contract rendered as docs |
Editing the contract regenerates all of these. The contract is the spec; everything downstream is derived.
What the harness contributes
| Harness piece | Role at 70 tools |
|---|---|
| Skill contracts (story 02) | The registry IS the answer to scaling |
| Capability flags (story 02) | Per-tool, per-tier, per-jurisdiction routing |
| Eval framework (story 04) | Per-tool eval scales naturally; synthetic probes for long-tail |
| Observability (story 08) | Per-tool cost / latency / quality dashboards |
| Sub-agent contract (story 03) | Sub-agents scale by the same registry pattern |
| Versioning (story 08) | Per-tool version visible in event log |
The investment in the contract layer at 7 tools paid back at 70.
Q&A drill — opening question
*Q: Why not let the planner LLM scan all 70 contracts? Models have 200K context now.
Three reasons: 1. Cost. 38K tokens at planner-frequency is $0.04/turn extra. That's $96M/year for nothing. 2. Quality. Long contexts dilute attention; planner's tool-selection accuracy drops measurably as context size grows. The router with K=8 is more accurate than the planner with all 70. 3. Cache invalidation. Any change to any tool's contract invalidates the planner's prompt cache. With router, only the embedding index changes; planner cache is stable.
The "long-context model can do anything" argument is technically true and economically wrong.
Grilling — Round 1
Q1. How do you train the router?
The router is mostly retrieval, not training. Initial bootstrap: - Each tool's contract description + 3-5 example inputs are embedded. - Top-K is cosine similarity over the user's message + context digest.
Improvements layered on:
- A lightweight LLM re-ranker (Haiku) takes top-20 retrieval and re-ranks to top-K with awareness of user tier / jurisdiction.
- Periodic finetuning of the embedding model on (user_query, correctly_chosen_tools) pairs harvested from production.
We do not train a from-scratch classifier; the retrieval-then-rerank pattern is sufficient.
Q2. What about tool dependencies — tool B can only be useful after tool A has run?
Three options: - Tool composition declared in contract — tool B's contract says "requires output of A as context." Router treats them as a unit if both are relevant. - Sub-agent encapsulation — the A-then-B pattern becomes a sub-agent, exposed as a single contract. - Two-step planning — planner calls A, then re-plans with A's output, then calls B. Latency cost; rare.
Most workflows go through option 2 (sub-agent). Option 1 is rare; option 3 is for genuinely two-phase reasoning.
Q3. Per-tool eval coverage — how do you keep up at 70 tools?
Each tool owner is responsible for ~50-100 eval examples. Total eval set grows from 350 → 3500. We deduplicate where tools share input shapes. CI runs the relevant subset on each PR. The full eval runs on nightly. The eval framework (story 04) was built for this scale; the marginal cost per tool is bounded.
Grilling — Round 2 (architect-level)
Q4. What's the cost of contract-as-source-of-truth — i.e., what regenerates and at what cost?
Per-tool change triggers: - Router embedding update: $0.0001 + index reload (~30 sec). - Planner few-shot regeneration: free (template rendering). - Eval coverage test refresh: ~$10 in batch eval cost per tool. - Documentation site rebuild: free. - Compliance audit propagation: free (just metadata update).
Total per-tool change: ~$15 + a few automated steps. With 70 tools and ~3 changes/tool/quarter, ~$3K/quarter in change-management overhead. Trivial.
Q5. The planner's tool selection from K=8 candidates can still be wrong — how do you measure?
We measure two correlated metrics: - Tool-selection accuracy — the planner's choice vs. the human-judged "right" choice on a labeled set. Today ~94%. - Outcome quality — the eval rubric on the actual response. This is the bottom line.
A high-accuracy planner with low outcome quality means the right tools were picked but the synthesis was poor (compose-stage issue). A low-accuracy planner with high outcome quality means we got lucky.
We track both because they decouple "did we route well?" from "did we compose well?" — different teams own different fixes.
Q6. Where does this break? At 700 tools?
The two-stage architecture buys a lot of headroom. At 700 tools, the router still gives top-K; cost stays bounded. What stresses: - Embedding index size (manageable). - Synthetic probe budget for long-tail (grows linearly). - Contract review board cadence (becomes a full-time function). - Tool ambiguity / overlap (more tools, more chances of redundancy).
At 700, we'd likely add a router hierarchy: domain-router selects domain (catalog, social, accounts...), then in-domain router selects K. This is "intent classification → scoped retrieval," same idea, more layers. The contract layer doesn't change.
The fundamental breakdown happens not at tool count but at agent task complexity — when a single turn naturally requires reasoning over more tools than the planner can effectively compose. That's where you split into multiple specialized agents (each with its own router) and orchestrate them — story 03 territory.
Intuition gained
- At 70+ tools, you cannot stuff the planner. Two-stage selection (router → planner) is the architecture pivot.
- Contract-as-source-of-truth generates everything downstream: embeddings, prompts, eval, docs, audit.
- Synthetic probes cover long-tail tools that real traffic doesn't.
- Capability flags + per-tier filtering keep the registry usable across user tiers.
- The investment at 7 tools (story 02) paid off at 70. This is the LLD payoff scenario.
See also
02-foundation-model-deprecated.md— same migration discipline applied to models05-new-compliance-mandate.md— capability flags scale across tools- User stories 02, 03, 04