Constraint Scenario 06 — Tool Count Explodes 7 → 70

Trigger: MangaAssist's roadmap commits to expanding from 7 MCPs to 70 by year-end (audio-version search, manga reading-progress sync, fan-art recommendation, comic-style transfer, voice cloning for character lines, etc.). Same agent, same evals, 10× the registry. Pillars stressed: P3 (LLD/registry) primarily; P1 (planner architecture) secondary.

TL;DR

You cannot stuff 70 tool descriptions into the planner's system prompt. The architecture pivot is two-stage tool selection — a lightweight router retrieves the top-K candidate tools per turn, the planner sees only those K. The skill registry and contract layer (story 02) were built for this; the planner is what changes. Cost goes up modestly; quality holds; the registry becomes the hardest-working artifact in the system.

The change

Metric	Today (7 tools)	Year-end (70 tools)
Tools in registry	7	70
Sub-agents in registry	4	~25
Planner system prompt size	4.5K tokens	38K tokens if naively included (above context budget)
Avg tool calls per turn	2.1	2.4 (slight increase, more nuanced answers)
Planner cost/turn	$0.0040	$0.0050 if naive; $0.0042 with router
Eval set size for tool-coverage	350	3500

The 38K-token planner prompt is the showstopper. We solve it with a router.

The two-stage selection architecture

flowchart TB
  Turn[User turn] --> Router[Skill Router]
  Router --> Reg[(Skill registry)]
  Reg --> Embed[Tool embeddings]
  Embed --> TopK[Top-K candidates - K typically 8]
  TopK --> Planner[Planner LLM with K tool contracts only]
  Planner --> Dispatch[Tool dispatch]
  Dispatch --> Tools[Selected tools]
  Tools --> Compose[Compose answer]

  classDef new fill:#fef3c7,stroke:#92400e
  class Router,Embed,TopK new

The router

A tiny model (Haiku or smaller embedding-only) that, given the user turn + recent context, retrieves the top-K most relevant tools.
Inputs: user message, last-3-turns digest, user_jurisdiction, user_tier.
Outputs: ranked list of K tool_ids.
Cost: ~$0.0001 per turn. Latency: ~50ms.

Tool embeddings

Each tool's contract description (from story 02) is embedded.
Updates to the contract trigger re-embedding via CI.
Stored in a dedicated retrieval index (we use OpenSearch); refresh < 5 min after a contract change.

The K parameter

K=8 by default — fits the planner prompt at ~6K tokens.
K=12 for complex cases (the router is itself capable of saying "I think you need more options").
K varies per surface: voice gets K=4 (latency budget), chat K=8.

Escape hatch

The planner can request additional tools mid-turn ("the K I got don't seem right; show me the registry list of <intent> tools"). Used <2% of turns.
This avoids the failure mode where the router misclassifies and the planner is stuck.

What the planner now looks like

# planner system prompt - structure
- system_role: "You are MangaAssist..."
- behavioral_rules: [safety, tone, locale]
- session_context: {user_tier, jurisdiction, recent_history_digest}
- available_tools: <inserted by router; K=8 contract summaries>
- few_shot_examples: <2-3 per available tool, retrieved>
- escape_hatch: "if you need a tool not listed, emit a 'request_tools' action with criteria"

Total: ~6K tokens, regardless of registry size. Stable.

The new failure modes

Failure 1 — Router misclassifies, the right tool isn't in the K

Detection: planner uses escape hatch frequently; or eval shows wrong-tool-used regression.
Mitigation: per-intent router accuracy tracked; below 92% triggers retraining.
Backstop: escape hatch keeps user-visible failure rare.

Failure 2 — Two tools have nearly-identical contracts; router can't distinguish

Detection: per-tool selection rate; if two tools each get ~50% but should be 80/20, that's ambiguity.
Mitigation: contract review board (story 02 grilling Q6); merge or differentiate.

Failure 3 — A new tool ships but its embedding is stale; never selected

Detection: tool selection rate = 0 for >24 hours after registration.
Mitigation: registration includes synthetic test that asks the router "would you select this for X?". Failing that test blocks deploy.

Failure 4 — Long-tail tools never get traffic, drift in correctness

Detection: per-tool eval coverage. Tools with <100 calls/day get a synthetic-traffic eval (probe).
Mitigation: synthetic probes ensure every tool is tested daily even if real traffic doesn't reach it.

The contract-as-source-of-truth pattern

With 70 tools, you cannot keep the planner's prompt and the actual tool behaviors in sync by hand. The contract (story 02) becomes the machine-readable spec that drives:

Artifact	Generated from contract
Router embeddings	description + examples
Planner few-shot examples	the `examples:` field of the contract
Eval coverage tests	input/output schemas
Cost dashboard groupings	`feature` and `domain` tags
Compliance audit	`respects_jurisdiction_policy` field
Documentation site	full contract rendered as docs

Editing the contract regenerates all of these. The contract is the spec; everything downstream is derived.

What the harness contributes

Harness piece	Role at 70 tools
Skill contracts (story 02)	The registry IS the answer to scaling
Capability flags (story 02)	Per-tool, per-tier, per-jurisdiction routing
Eval framework (story 04)	Per-tool eval scales naturally; synthetic probes for long-tail
Observability (story 08)	Per-tool cost / latency / quality dashboards
Sub-agent contract (story 03)	Sub-agents scale by the same registry pattern
Versioning (story 08)	Per-tool version visible in event log

The investment in the contract layer at 7 tools paid back at 70.

Q&A drill — opening question

*Q: Why not let the planner LLM scan all 70 contracts? Models have 200K context now.

Three reasons: 1. Cost. 38K tokens at planner-frequency is $0.04/turn extra. That's $96M/year for nothing. 2. Quality. Long contexts dilute attention; planner's tool-selection accuracy drops measurably as context size grows. The router with K=8 is more accurate than the planner with all 70. 3. Cache invalidation. Any change to any tool's contract invalidates the planner's prompt cache. With router, only the embedding index changes; planner cache is stable.

The "long-context model can do anything" argument is technically true and economically wrong.

Grilling — Round 1

Q1. How do you train the router?

The router is mostly retrieval, not training. Initial bootstrap: - Each tool's contract description + 3-5 example inputs are embedded. - Top-K is cosine similarity over the user's message + context digest.

Improvements layered on: - A lightweight LLM re-ranker (Haiku) takes top-20 retrieval and re-ranks to top-K with awareness of user tier / jurisdiction. - Periodic finetuning of the embedding model on (user_query, correctly_chosen_tools) pairs harvested from production.

We do not train a from-scratch classifier; the retrieval-then-rerank pattern is sufficient.

Q2. What about tool dependencies — tool B can only be useful after tool A has run?

Three options: - Tool composition declared in contract — tool B's contract says "requires output of A as context." Router treats them as a unit if both are relevant. - Sub-agent encapsulation — the A-then-B pattern becomes a sub-agent, exposed as a single contract. - Two-step planning — planner calls A, then re-plans with A's output, then calls B. Latency cost; rare.

Most workflows go through option 2 (sub-agent). Option 1 is rare; option 3 is for genuinely two-phase reasoning.

Q3. Per-tool eval coverage — how do you keep up at 70 tools?

Each tool owner is responsible for ~50-100 eval examples. Total eval set grows from 350 → 3500. We deduplicate where tools share input shapes. CI runs the relevant subset on each PR. The full eval runs on nightly. The eval framework (story 04) was built for this scale; the marginal cost per tool is bounded.

Grilling — Round 2 (architect-level)

Q4. What's the cost of contract-as-source-of-truth — i.e., what regenerates and at what cost?

Per-tool change triggers: - Router embedding update: $0.0001 + index reload (~30 sec). - Planner few-shot regeneration: free (template rendering). - Eval coverage test refresh: ~$10 in batch eval cost per tool. - Documentation site rebuild: free. - Compliance audit propagation: free (just metadata update).

Total per-tool change: ~$15 + a few automated steps. With 70 tools and ~3 changes/tool/quarter, ~$3K/quarter in change-management overhead. Trivial.

Q5. The planner's tool selection from K=8 candidates can still be wrong — how do you measure?

We measure two correlated metrics: - Tool-selection accuracy — the planner's choice vs. the human-judged "right" choice on a labeled set. Today ~94%. - Outcome quality — the eval rubric on the actual response. This is the bottom line.

A high-accuracy planner with low outcome quality means the right tools were picked but the synthesis was poor (compose-stage issue). A low-accuracy planner with high outcome quality means we got lucky.

We track both because they decouple "did we route well?" from "did we compose well?" — different teams own different fixes.

Q6. Where does this break? At 700 tools?

The two-stage architecture buys a lot of headroom. At 700 tools, the router still gives top-K; cost stays bounded. What stresses: - Embedding index size (manageable). - Synthetic probe budget for long-tail (grows linearly). - Contract review board cadence (becomes a full-time function). - Tool ambiguity / overlap (more tools, more chances of redundancy).

At 700, we'd likely add a router hierarchy: domain-router selects domain (catalog, social, accounts...), then in-domain router selects K. This is "intent classification → scoped retrieval," same idea, more layers. The contract layer doesn't change.

The fundamental breakdown happens not at tool count but at agent task complexity — when a single turn naturally requires reasoning over more tools than the planner can effectively compose. That's where you split into multiple specialized agents (each with its own router) and orchestrate them — story 03 territory.

Intuition gained

At 70+ tools, you cannot stuff the planner. Two-stage selection (router → planner) is the architecture pivot.
Contract-as-source-of-truth generates everything downstream: embeddings, prompts, eval, docs, audit.
Synthetic probes cover long-tail tools that real traffic doesn't.
Capability flags + per-tier filtering keep the registry usable across user tiers.
The investment at 7 tools (story 02) paid off at 70. This is the LLD payoff scenario.