User Story 02 — Skill Composition and Invocation
Pillar: P1 (AI Workflow) + P3 (LLD) · Stage unlocked: 2 → 4 · Reading time: ~12 min
TL;DR
"Skills" (a.k.a. tools, MCPs, capabilities) are the vocabulary the agent uses to act. At Amazon scale you cannot afford to register them as ad-hoc Python functions glued to the planner prompt. You need a typed registry with contracts, capability flags, idempotency keys, and discovery — because the same skill is invoked from 4 different graphs (chat, voice, alerts, batch) and re-implementing it 4 times is how you get 4 different answers to the same question.
The User Story
As an Applied ML Engineer adding the 8th MCP to MangaAssist (cross-title-link MCP), I want to register the new skill once, declare its contract, capability requirements, and SLOs, so that every existing surface (chat agent, voice agent, alert daemon, nightly enrichment job) can invoke it safely without me hand-wiring it into each surface's prompt or codepath.
Acceptance criteria
- New skill is registered in <1 day, ships behind a capability flag, and is discoverable by the planner without a planner-prompt change.
- Every skill declares: input schema, output schema, error envelope, idempotency requirement, p95/p99 latency, $/call.
- Calls to the skill go through one gateway that handles auth, rate limit, retry policy, circuit breaker, observability — not 4 copies of that logic.
- The same skill version can be canaried (5% v2, 95% v1) per surface independently.
- Misuse (wrong input shape, missing capability) fails at registration / type-check time, not at invocation time in production.
What "skill composition" actually means
A skill is not a function. A skill is a typed contract with three layers:
flowchart LR
subgraph Layer1[Contract Layer - what the planner sees]
NAME[Name + 1-line purpose]
IN[Input schema JSON]
OUT[Output schema JSON]
ERR[Error envelope]
EX[Few-shot usage examples]
end
subgraph Layer2[Policy Layer - what the gateway enforces]
RL[Rate limit per user / tenant]
RB[Retry / backoff policy]
CB[Circuit breaker]
IDEM[Idempotency key]
AUTH[Capability flag check]
BUD[Latency / cost budget]
end
subgraph Layer3[Implementation Layer - what runs]
IMPL[Backing service handler]
DEPS[Datastore / model deps]
OBS[Trace + metric + log emission]
end
Layer1 --> Layer2 --> Layer3
The planner only sees Layer 1. The gateway enforces Layer 2. Layer 3 is the team-owned backing service. Each layer has its own owner, SLA, and version.
The skill registry — at MangaAssist scale
| Skill | Surface(s) | Contract version | p95 | $/call | Idempotent? |
|---|---|---|---|---|---|
| catalog-search | chat, voice, batch | v3.2 | 220 ms | $0.0004 | yes (by query+filter hash) |
| user-prefs-reco | chat, alerts | v2.1 | 380 ms | $0.0009 | no (time-dependent) |
| order-inventory | chat, voice | v4.0 | 180 ms | $0.0002 | yes (by order_id) |
| review-sentiment | chat, batch | v1.5 | 600 ms | $0.0011 | yes (by title+window) |
| support-policy | chat | v2.3 | 250 ms | $0.0003 | yes |
| trending-discovery | chat, alerts | v1.0 | 290 ms | $0.0005 | no (live aggregation) |
| cross-title-link | chat | v0.9 (canary) | 510 ms | $0.0014 | yes |
Every column here is enforced at registration time, not aspirational documentation.
The composition patterns
Skills compose in four patterns. The planner picks the pattern; the gateway executes it.
Pattern A — Sequential (output of A → input of B)
sequenceDiagram
participant Planner
participant CatalogSearch
participant ReviewSentiment
Planner->>CatalogSearch: query: "berserk english"
CatalogSearch-->>Planner: title_id=BRSK_001
Planner->>ReviewSentiment: title_id=BRSK_001
ReviewSentiment-->>Planner: sentiment summary
Pattern B — Parallel fan-out (independent)
sequenceDiagram
participant Planner
participant Catalog
participant Trending
participant CrossTitle
par fan-out
Planner->>Catalog: search
and
Planner->>Trending: top this week
and
Planner->>CrossTitle: similar
end
Catalog-->>Planner: results
Trending-->>Planner: results
CrossTitle-->>Planner: results
Planner->>Planner: join + compose
Pattern C — Conditional (decide-then-call)
Planner first calls a cheap classifier skill (intent), then dispatches to the right downstream skill. Saves cost when 60% of traffic doesn't need the expensive skill.
Pattern D — Speculative (call early, cancel if not needed)
Kick off the trending skill while the planner is still deciding intent. If the final intent is order-related, cancel the trending call and ignore. Trades wasted cost for tail latency.
The choice between B and D is policy under a constraint — see Changing-Constraints-Scenarios/04-latency-sla-tightened.md.
Why this is the LLD that matters
Three failure modes hit teams that skip this layer:
Failure 1 — "Hidden coupling" — same skill, two implementations
The voice surface team copy-pasted the catalog-search call because the original was Python and they were on Node. Six months later, the Node version doesn't filter mature content; the Python one does. Compliance ticket.
Fix: the gateway is language-agnostic (HTTP/gRPC), every surface calls it. There is no "second implementation."
Failure 2 — "Prompt-only registration"
Team adds a tool by editing the planner prompt: "You have access to catalog_search(query, locale, limit)." Six weeks later, three teams have edited that prompt, two of them broke the schema, and prod silently calls the tool with wrong args.
Fix: the planner prompt is generated from the registry. Editing the prompt directly is a code-review block.
Failure 3 — "No idempotency, retry storm"
Order-inventory skill is not declared idempotent. Network blip → planner retries → user gets 2 holds on inventory → CS escalation.
Fix: idempotency is a registration-time field. Non-idempotent skills get an idempotency-key generated by the gateway and de-duped at the backing service.
What goes in the skill contract — concrete example
# skills/cross-title-link.yaml
name: cross-title-link
version: 0.9
description: |
Given a manga title, return semantically and behaviorally related titles
using a Neptune graph + OpenSearch hybrid retrieval.
owner: team-discovery@amazon.com
input:
title_id: { type: string, required: true, regex: "^[A-Z]{4}_[0-9]{3,6}$" }
locale: { type: string, required: true, enum: [en-US, ja-JP, ko-KR, ...] }
k: { type: integer, default: 10, min: 1, max: 50 }
output:
recommendations:
type: array
items:
title_id: string
similarity: float # [0, 1]
reason: string # human-readable, NOT model-generated
errors:
- code: TITLE_NOT_FOUND # 4xx, do not retry
- code: GRAPH_TIMEOUT # 5xx, retry once
- code: LOCALE_NOT_SUPPORTED # 4xx, fallback to en-US
policy:
idempotent: true
idempotency_key: "{title_id}:{locale}:{k}"
rate_limit: { per_user_per_min: 60, global_per_sec: 5000 }
timeout_ms: 800
retry: { attempts: 1, backoff: exponential }
circuit_breaker: { error_rate_threshold: 0.05, window_sec: 30 }
budget: { p95_ms: 510, cost_usd: 0.0014 }
capabilities_required:
- feature_flag: cross_title_link_enabled
- tier: any # available to all user tiers
backing_service:
endpoint: cross-title.mcp.internal:8443
health_check: /healthz
Every field is enforced, not documentation. Registration fails if the backing service doesn't pass schema validation against a synthetic input. Deploy fails if p95 from canary exceeds the declared budget.
Q&A drill — opening question
Q: This looks like over-engineering for an internal tool registry. Why not just use Python decorators?
It looks like over-engineering when you have 7 skills. It looks like the only thing that saved you when you have 70 — and you do, by year-end (constraint scenario 6). The actual cost of the contract layer is one YAML file per skill. The cost of NOT having it is reproducing the gateway logic 70 times across surfaces, with 70 different bug tails.
A more honest framing: this is the same argument as why services have OpenAPI specs instead of "just call the endpoint." The agent loop is the consumer; the skill is the API.
Grilling — Round 1
Q1. How does the planner know which skill to call? Do you stuff all 70 contracts into the system prompt?
No. The planner gets a two-stage retrieval: 1. Skill router — small embedding model retrieves the top-K (K=8) candidate skills based on user message + conversation context. 2. Planner — gets only those K contracts inlined, plus a "request more if needed" escape hatch.
This keeps the planner prompt under 6K tokens regardless of registry size. The skill router itself is a skill (recursive composition) — see story 03 for sub-agent patterns.
Q2. What about cross-skill state? E.g., "the order I asked about earlier"?
That's the conversation-state layer, not the skill layer. Skills are stateless contracts. Conversation state is held by the harness (see story 06) and rendered into the planner's input. A skill never reads conversation state directly — that's a layering violation that bites you when you try to invoke the skill from a non-conversational surface (alerts, batch).
Q3. Idempotency keys are great until two requests with the same key have different upstream context. How do you handle that?
The idempotency key includes a scope field — usually (user_id, session_id) for chat, (user_id, day) for alerts. Same key in different scopes does not collide. The key is composed by the gateway, not the skill, so the skill author can't get it wrong.
Grilling — Round 2 (architect-level)
Q4. A new compliance rule says some skills cannot be invoked for users under 18. How does that flow through this design?
Capability flags. The skill contract declares capabilities_required.tier: adult (new field). The gateway, on every call, checks the calling identity against the capabilities. If missing, the gateway returns CAPABILITY_DENIED — which the planner sees as a typed error that maps to a polite redirect ("I can't recommend that title for your account; here's a related all-ages title").
Two architectural points: - The compliance check lives in one place (gateway), not in 70 skills. - The planner doesn't need a prompt change; the typed error is enough for the existing fallback edges to handle.
Q5. Walk me through skill versioning when the schema changes incompatibly. Cross-title-link wants to start returning a graph instead of a flat list.
Three options ranked by safety: 1. Side-by-side versions (preferred). v0.9 returns flat list, v1.0 returns graph. Both are alive in the registry. Surfaces opt in. Old surfaces keep working until migrated. Sunset v0.9 on a quarterly cadence. 2. Versioned output with upgrader (acceptable). v1.0 returns graph; gateway has an adapter that flattens to v0.9 shape for old callers. Adapter logic is owned by the skill team, not callers. 3. Big-bang upgrade (avoid). All callers update together. This is what kills 6-team release trains.
Q6. How do you prevent skill bloat — every team adds their own skill, registry grows to 300, planner gets confused?
Three controls, in order of severity: - Quarterly registry review — owner-less skills, skills with <100 calls/day, deprecated. - Overlap audit — embedding similarity between skill descriptions; flag pairs above 0.85 for merge. - Skill review board — net-new skills require sign-off; "is this a new capability or a parameter on an existing skill?" is the gate question.
The router (Q1) also self-protects: if two skills are too semantically close, retrieval becomes ambiguous and call-success metrics drop, naturally surfacing the merge candidate.
Intuition gained
- A skill is a contract, not a function. The contract has three layers and they have separate owners.
- The planner reads from the registry, not from a prompt. This is what scales from 7 to 70 tools.
- Capability flags + idempotency keys are LLD primitives that pay off enormously in incident response.
- Composition patterns (seq / parallel / conditional / speculative) are policy choices, not implementation details. You'll switch between them depending on the latency budget.
See also
01-execution-flow-design.md— how the graph dispatches into the registry03-sub-agent-orchestration.md— when one skill is itself an agent08-observability-cost-versioning-ratelimits.md— what the gateway emitsChanging-Constraints-Scenarios/06-tool-count-explodes.md— registry under 10× growth