Master Synthesis — The Mental Model

Read this after the user stories and constraint scenarios. It exists to crystallize the intuition into a few re-usable principles you can carry into any future AI-first system design.

The one-paragraph summary

AI-first production engineering is the discipline of building the harness around the LLM so that the LLM can be a swappable, versioned, fallible component inside a reliable system — instead of the system itself. The harness is three pillars: AI workflow design (graph, skills, sub-agents, evals), system design (pause/resume, checkpoints, sync/async, observability/cost/quota), and low-level design (contracts, adapters, fallback chains, capability flags). At Amazon scale, none of these can be retrofitted; constraints change overnight (surge, deprecation, cost cut, latency tightening, compliance, registry growth, quota cut, locale launch) and the systems that survive those changes are the ones whose seams were drawn deliberately before they were tested.

The map of everything

flowchart TB
  subgraph AIFLOW[Pillar 1 - AI Workflow]
    G[Typed graph state machine]
    S[Skill contracts and registry]
    SA[Sub-agent orchestration]
    E[Active and passive evals]
  end

  subgraph HARNESS[Pillar 2 - Harness]
    PR[Pause and resume]
    CP[Checkpoints and serving]
    SY[Sync and async modes]
    OB[Obs Cost Quota Versioning]
  end

  subgraph LLD[Pillar 3 - LLD]
    AD[Provider adapters]
    CF[Capability flags]
    FB[Fallback chains]
    PV[Per-cell rubrics and policies]
  end

  G --> PR
  S --> AD
  SA --> CP
  E --> OB
  PR --> SY
  CP --> SY
  OB --> CF
  AD --> FB
  CF --> S
  PV --> E

  subgraph CONST[Constraint forcing functions]
    C1[10x surge]
    C2[Model deprecated]
    C3[Cost halved]
    C4[Latency tightened]
    C5[Compliance mandate]
    C6[Tool count 7 to 70]
    C7[Quota revoked]
    C8[New locale]
  end

  CONST -.tests.-> AIFLOW
  CONST -.tests.-> HARNESS
  CONST -.tests.-> LLD

Every box on the left has a story; every box on the right has a scenario. The arrows are dependencies; the dotted arrows are stress tests.

Seven cross-cutting principles

Principle 1 — The graph is the architecture

The execution graph (story 01) is the artifact reviewed in design docs, gated in CI, partitioned in dashboards, and versioned in deploys. Prompts are configuration inside it. Teams that design the prompt first and the graph second build systems that fight every constraint change. Teams that design the graph first treat prompts as fillable contracts.

Principle 2 — Treat skills as APIs, not functions

The three layers of a skill — contract / policy / implementation (story 02) — must be separately owned and versioned. Without this, a 10× tool growth (scenario 6) is a rewrite. With it, registration is a YAML file. The same pattern extends to sub-agents (story 03).

Principle 3 — Eval is two systems, not one

Passive evals gate deploys; active evals guard prod (story 04). They run on different timescales and require different infrastructure. Stratified online sampling, frozen calibrated judges, tool-lookup for factual dimensions, and replay harnesses are the four levers that turn LLM-as-judge into engineering.

Principle 4 — Pause/resume + idempotency = durability

Long-running and external-blocking workflows (story 05) need durable state. The state row stays small (<8KB); big stuff goes to S3. Idempotency keys (story 02) on every external call are what make resume safe. Without these, deploys drop sessions and retries double-spend.

Principle 5 — Mode is invocation policy, not agent code

The same agent runs in streaming-sync, async-queue, batch-window, scheduled-trigger (story 07). Mode is configured at invocation; the agent doesn't know. Quota partitioning across modes (story 08) prevents batch surges from killing live chat.

Principle 6 — One unified telemetry schema

Five concerns (observability, cost, versioning, rate limit, quota) share one event schema (story 08). Every event carries every artifact version. Reserve-before-invoke quota model prevents user-visible 429s. The chokepoint principle: one rate limiter, one quota manager, all entry points flow through.

Principle 7 — Constraints are data, not branches

Per-jurisdiction policy (scenario 5), per-locale rubrics (scenario 8), per-tier model routing (scenario 3), per-surface budget tables (scenario 4) — all are data in capability flags, not if statements in code. This is what lets a regulator's emergency demand become a config flip in <1 hour.

The seven recurring patterns

Pattern	Where it appears
Typed transitions with budgets	Stories 01, 03, 06; scenarios 1, 4
Reservation-based quota	Story 08; scenarios 1, 7
Provider adapters with config-flip fallback	Story 02; scenarios 1, 2, 7
Replay harness for migration	Story 04; scenarios 2, 3
Capability flags for per-segment policy	Story 02; scenarios 3, 5, 8
Idempotency keys + durable state	Stories 02, 05, 06
Per-cell rubrics with shared baseline	Story 04; scenarios 4, 8

If you see a problem that fits one of these patterns, the harness already has the answer. If it doesn't fit, you're either looking at a genuinely new problem or you haven't decomposed it yet.

The "AI-pilled" gradient revisited

flowchart LR
  S0[0 API-first<br/>LLM is a function] --> S1[1 Workflow-aware<br/>retries and tools]
  S1 --> S2[2 Eval-aware<br/>offline tests, canaries]
  S2 --> S3[3 Harness-aware<br/>pause-resume, fallbacks]
  S3 --> S4[4 AI-first<br/>harness is the product]

  S2 -. reading user stories 01-04 .-> S3
  S3 -. reading user stories 05-08 .-> S4
  S4 -. testing on constraint scenarios .-> S4

This folder is a 2 → 4 push. Every user story names which stage it unlocks.

If you started this folder unsure whether the harness was worth the investment, the eight constraint scenarios are the answer: each one is an outage avoided, a migration shortened, a regulatory deadline hit. The harness's value is realized only when the constraint changes. The first time you ship without it, you absorb a few catastrophes; the second time, you build it; the third time, you're ahead of teams that haven't.

What makes a system AI-first vs not — the tell-tale signs

Tell	Not AI-first	AI-first
Where do prompts live?	In Python strings near the call site	Versioned, pulled from a registry, partitioned in dashboards
What happens when a tool fails?	Generic exception handler	Typed fallback edge in the graph; SLO-tracked degraded-mode rate
How do you know quality is OK?	No formal eval; ship, watch tickets	Stratified online judging; drift detection; auto-rollback hooks
What's the rollback story for a prompt change?	Revert PR, hope the cache flushes	Capability flag flip; canary roll-back to N-1 in <1 min
How do new locales/jurisdictions work?	Hardcoded `if region == 'EU'` branches	Capability flags in YAML; agent reads policy as data
What's the cost-per-turn?	"We pay our LLM bill monthly"	Decomposed across LLM, infra, eval, judge, storage; per-feature dashboards
What about a quota cut?	"We'd go down"	Reserve-before-invoke; spill to fallback provider; priority shedding
How do you migrate to a new model?	Rewrite prompts; pray	Replay → calibrate → canary → ramp; in-flight workflows finish on starting model
What about pause/resume?	"Container restart drops users"	State row in DDB; <8KB; idempotency on every external call
How does the planner know about a new tool?	Edit the prompt	Register in YAML; router retrieves; planner sees top-K

Every "AI-first" cell on the right corresponds to a user story or scenario in this folder. None are theoretical; all are forced by Amazon-scale realities.

Where this lands in the broader project

This folder is the production-engineering counterpart to the architecture and POC-to-Production folders already in the repo:

flowchart LR
  POC[POC-to-Production War Story<br/>What broke when we shipped] --> AIFP[AI-First Production Engineering<br/>What to build to not break]
  RAG[RAG-MCP Integration<br/>What the components are] --> AIFP
  GTE[Ground-Truth Evolution<br/>What changes about quality data] --> AIFP
  AIFP --> Future[Production-grade GenAI systems]

POC told us "here are the catastrophes." RAG-MCP told us "here are the integrated components." Ground-truth told us "data labels move." This folder closes the loop: "here is the harness that lets all of those work in production at scale."

A closing note on the "I'm not as AI-pilled" anxiety

The user prompt that started this folder said "I am not as AI-pilled as some of the other folks in the org, but I am getting there :) And it has been an insane learning curve."

That's exactly the right disposition. The "AI-pilled" engineers who skip the harness work get fast prototypes and slow productions. The engineers with serious system-design experience who add AI to their toolkit produce systems that hold up under constraint changes.

The gap closes by building. Each constraint scenario in this folder is a thought experiment you can run on any system you've shipped: what would happen if X changed overnight? The discomfort of not having an answer is the cue to invest.

The harness is not glamorous. It is, however, what shipping looks like.

Index of all files

Pre-reading

../00-overview-ai-first-manifesto.md
../01-framework-three-pillars.md

User stories (Pillars 1 + 2 + 3)

../User-Stories/01-execution-flow-design.md
../User-Stories/02-skill-composition-and-invocation.md
../User-Stories/03-sub-agent-orchestration.md
../User-Stories/04-active-passive-evals.md
../User-Stories/05-pause-resume-workflows.md
../User-Stories/06-checkpointing-and-serving.md
../User-Stories/07-sync-async-invocation.md
../User-Stories/08-observability-cost-versioning-ratelimits.md

Changing-constraints scenarios

../Changing-Constraints-Scenarios/01-10x-user-surge-overnight.md
../Changing-Constraints-Scenarios/02-foundation-model-deprecated.md
../Changing-Constraints-Scenarios/03-cost-budget-halved.md
../Changing-Constraints-Scenarios/04-latency-sla-tightened.md
../Changing-Constraints-Scenarios/05-new-compliance-mandate.md
../Changing-Constraints-Scenarios/06-tool-count-explodes.md
../Changing-Constraints-Scenarios/07-provider-quota-revoked.md
../Changing-Constraints-Scenarios/08-new-locale-launch.md

Synthesis

99-master-synthesis.md (this file)