User Story 02 — Skill Composition and Invocation

Pillar: P1 (AI Workflow) + P3 (LLD) · Stage unlocked: 2 → 4 · Reading time: ~12 min

TL;DR

"Skills" (a.k.a. tools, MCPs, capabilities) are the vocabulary the agent uses to act. At Amazon scale you cannot afford to register them as ad-hoc Python functions glued to the planner prompt. You need a typed registry with contracts, capability flags, idempotency keys, and discovery — because the same skill is invoked from 4 different graphs (chat, voice, alerts, batch) and re-implementing it 4 times is how you get 4 different answers to the same question.

The User Story

As an Applied ML Engineer adding the 8^th MCP to MangaAssist (cross-title-link MCP), I want to register the new skill once, declare its contract, capability requirements, and SLOs, so that every existing surface (chat agent, voice agent, alert daemon, nightly enrichment job) can invoke it safely without me hand-wiring it into each surface's prompt or codepath.

Acceptance criteria

New skill is registered in <1 day, ships behind a capability flag, and is discoverable by the planner without a planner-prompt change.
Every skill declares: input schema, output schema, error envelope, idempotency requirement, p95/p99 latency, $/call.
Calls to the skill go through one gateway that handles auth, rate limit, retry policy, circuit breaker, observability — not 4 copies of that logic.
The same skill version can be canaried (5% v2, 95% v1) per surface independently.
Misuse (wrong input shape, missing capability) fails at registration / type-check time, not at invocation time in production.

What "skill composition" actually means

A skill is not a function. A skill is a typed contract with three layers:

flowchart LR
  subgraph Layer1[Contract Layer - what the planner sees]
    NAME[Name + 1-line purpose]
    IN[Input schema JSON]
    OUT[Output schema JSON]
    ERR[Error envelope]
    EX[Few-shot usage examples]
  end

  subgraph Layer2[Policy Layer - what the gateway enforces]
    RL[Rate limit per user / tenant]
    RB[Retry / backoff policy]
    CB[Circuit breaker]
    IDEM[Idempotency key]
    AUTH[Capability flag check]
    BUD[Latency / cost budget]
  end

  subgraph Layer3[Implementation Layer - what runs]
    IMPL[Backing service handler]
    DEPS[Datastore / model deps]
    OBS[Trace + metric + log emission]
  end

  Layer1 --> Layer2 --> Layer3

The planner only sees Layer 1. The gateway enforces Layer 2. Layer 3 is the team-owned backing service. Each layer has its own owner, SLA, and version.

The skill registry — at MangaAssist scale

Skill	Surface(s)	Contract version	p95	$/call	Idempotent?
catalog-search	chat, voice, batch	v3.2	220 ms	$0.0004	yes (by query+filter hash)
user-prefs-reco	chat, alerts	v2.1	380 ms	$0.0009	no (time-dependent)
order-inventory	chat, voice	v4.0	180 ms	$0.0002	yes (by order_id)
review-sentiment	chat, batch	v1.5	600 ms	$0.0011	yes (by title+window)
support-policy	chat	v2.3	250 ms	$0.0003	yes
trending-discovery	chat, alerts	v1.0	290 ms	$0.0005	no (live aggregation)
cross-title-link	chat	v0.9 (canary)	510 ms	$0.0014	yes

Every column here is enforced at registration time, not aspirational documentation.

The composition patterns

Skills compose in four patterns. The planner picks the pattern; the gateway executes it.

Pattern A — Sequential (output of A → input of B)

sequenceDiagram
  participant Planner
  participant CatalogSearch
  participant ReviewSentiment

  Planner->>CatalogSearch: query: "berserk english"
  CatalogSearch-->>Planner: title_id=BRSK_001
  Planner->>ReviewSentiment: title_id=BRSK_001
  ReviewSentiment-->>Planner: sentiment summary

Use when downstream needs upstream's output. Latency stacks.

Pattern B — Parallel fan-out (independent)

sequenceDiagram
  participant Planner
  participant Catalog
  participant Trending
  participant CrossTitle

  par fan-out
    Planner->>Catalog: search
  and
    Planner->>Trending: top this week
  and
    Planner->>CrossTitle: similar
  end
  Catalog-->>Planner: results
  Trending-->>Planner: results
  CrossTitle-->>Planner: results
  Planner->>Planner: join + compose

Use when the calls don't depend on each other. Latency = max, not sum.

Pattern C — Conditional (decide-then-call)

Planner first calls a cheap classifier skill (intent), then dispatches to the right downstream skill. Saves cost when 60% of traffic doesn't need the expensive skill.

Pattern D — Speculative (call early, cancel if not needed)

Kick off the trending skill while the planner is still deciding intent. If the final intent is order-related, cancel the trending call and ignore. Trades wasted cost for tail latency.

The choice between B and D is policy under a constraint — see Changing-Constraints-Scenarios/04-latency-sla-tightened.md.

Why this is the LLD that matters

Three failure modes hit teams that skip this layer:

Failure 1 — "Hidden coupling" — same skill, two implementations

The voice surface team copy-pasted the catalog-search call because the original was Python and they were on Node. Six months later, the Node version doesn't filter mature content; the Python one does. Compliance ticket.

Fix: the gateway is language-agnostic (HTTP/gRPC), every surface calls it. There is no "second implementation."

Failure 2 — "Prompt-only registration"

Team adds a tool by editing the planner prompt: "You have access to catalog_search(query, locale, limit)." Six weeks later, three teams have edited that prompt, two of them broke the schema, and prod silently calls the tool with wrong args.

Fix: the planner prompt is generated from the registry. Editing the prompt directly is a code-review block.

Failure 3 — "No idempotency, retry storm"

Order-inventory skill is not declared idempotent. Network blip → planner retries → user gets 2 holds on inventory → CS escalation.

Fix: idempotency is a registration-time field. Non-idempotent skills get an idempotency-key generated by the gateway and de-duped at the backing service.

What goes in the skill contract — concrete example

# skills/cross-title-link.yaml
name: cross-title-link
version: 0.9
description: |
  Given a manga title, return semantically and behaviorally related titles
  using a Neptune graph + OpenSearch hybrid retrieval.
owner: team-discovery@amazon.com

input:
  title_id: { type: string, required: true, regex: "^[A-Z]{4}_[0-9]{3,6}$" }
  locale:   { type: string, required: true, enum: [en-US, ja-JP, ko-KR, ...] }
  k:        { type: integer, default: 10, min: 1, max: 50 }

output:
  recommendations:
    type: array
    items:
      title_id: string
      similarity: float  # [0, 1]
      reason: string     # human-readable, NOT model-generated

errors:
  - code: TITLE_NOT_FOUND        # 4xx, do not retry
  - code: GRAPH_TIMEOUT          # 5xx, retry once
  - code: LOCALE_NOT_SUPPORTED   # 4xx, fallback to en-US

policy:
  idempotent: true
  idempotency_key: "{title_id}:{locale}:{k}"
  rate_limit: { per_user_per_min: 60, global_per_sec: 5000 }
  timeout_ms: 800
  retry: { attempts: 1, backoff: exponential }
  circuit_breaker: { error_rate_threshold: 0.05, window_sec: 30 }
  budget: { p95_ms: 510, cost_usd: 0.0014 }

capabilities_required:
  - feature_flag: cross_title_link_enabled
  - tier: any   # available to all user tiers

backing_service:
  endpoint: cross-title.mcp.internal:8443
  health_check: /healthz

Every field is enforced, not documentation. Registration fails if the backing service doesn't pass schema validation against a synthetic input. Deploy fails if p95 from canary exceeds the declared budget.

Q&A drill — opening question

Q: This looks like over-engineering for an internal tool registry. Why not just use Python decorators?

It looks like over-engineering when you have 7 skills. It looks like the only thing that saved you when you have 70 — and you do, by year-end (constraint scenario 6). The actual cost of the contract layer is one YAML file per skill. The cost of NOT having it is reproducing the gateway logic 70 times across surfaces, with 70 different bug tails.

A more honest framing: this is the same argument as why services have OpenAPI specs instead of "just call the endpoint." The agent loop is the consumer; the skill is the API.

Grilling — Round 1

Q1. How does the planner know which skill to call? Do you stuff all 70 contracts into the system prompt?

No. The planner gets a two-stage retrieval: 1. Skill router — small embedding model retrieves the top-K (K=8) candidate skills based on user message + conversation context. 2. Planner — gets only those K contracts inlined, plus a "request more if needed" escape hatch.

This keeps the planner prompt under 6K tokens regardless of registry size. The skill router itself is a skill (recursive composition) — see story 03 for sub-agent patterns.

Q2. What about cross-skill state? E.g., "the order I asked about earlier"?

That's the conversation-state layer, not the skill layer. Skills are stateless contracts. Conversation state is held by the harness (see story 06) and rendered into the planner's input. A skill never reads conversation state directly — that's a layering violation that bites you when you try to invoke the skill from a non-conversational surface (alerts, batch).

Q3. Idempotency keys are great until two requests with the same key have different upstream context. How do you handle that?

The idempotency key includes a scope field — usually (user_id, session_id) for chat, (user_id, day) for alerts. Same key in different scopes does not collide. The key is composed by the gateway, not the skill, so the skill author can't get it wrong.

Grilling — Round 2 (architect-level)

Q4. A new compliance rule says some skills cannot be invoked for users under 18. How does that flow through this design?

Capability flags. The skill contract declares capabilities_required.tier: adult (new field). The gateway, on every call, checks the calling identity against the capabilities. If missing, the gateway returns CAPABILITY_DENIED — which the planner sees as a typed error that maps to a polite redirect ("I can't recommend that title for your account; here's a related all-ages title").

Two architectural points: - The compliance check lives in one place (gateway), not in 70 skills. - The planner doesn't need a prompt change; the typed error is enough for the existing fallback edges to handle.

Q5. Walk me through skill versioning when the schema changes incompatibly. Cross-title-link wants to start returning a graph instead of a flat list.

Three options ranked by safety: 1. Side-by-side versions (preferred). v0.9 returns flat list, v1.0 returns graph. Both are alive in the registry. Surfaces opt in. Old surfaces keep working until migrated. Sunset v0.9 on a quarterly cadence. 2. Versioned output with upgrader (acceptable). v1.0 returns graph; gateway has an adapter that flattens to v0.9 shape for old callers. Adapter logic is owned by the skill team, not callers. 3. Big-bang upgrade (avoid). All callers update together. This is what kills 6-team release trains.

Q6. How do you prevent skill bloat — every team adds their own skill, registry grows to 300, planner gets confused?

Three controls, in order of severity: - Quarterly registry review — owner-less skills, skills with <100 calls/day, deprecated. - Overlap audit — embedding similarity between skill descriptions; flag pairs above 0.85 for merge. - Skill review board — net-new skills require sign-off; "is this a new capability or a parameter on an existing skill?" is the gate question.

The router (Q1) also self-protects: if two skills are too semantically close, retrieval becomes ambiguous and call-success metrics drop, naturally surfacing the merge candidate.

Intuition gained

A skill is a contract, not a function. The contract has three layers and they have separate owners.
The planner reads from the registry, not from a prompt. This is what scales from 7 to 70 tools.
Capability flags + idempotency keys are LLD primitives that pay off enormously in incident response.
Composition patterns (seq / parallel / conditional / speculative) are policy choices, not implementation details. You'll switch between them depending on the latency budget.