AI-First Production Engineering — Overview & Manifesto

Context: MangaAssist on Amazon — a conversational shopping/discovery agent serving manga, manhwa, and manhua across 18 locales, 200M+ Prime users, and a long-tail catalog of 4M+ titles. Every concept here is anchored in this real production setting.

Why this folder exists

The traditional "build the API, then bolt on the LLM call" mindset breaks down at production scale. The workflow of AI systems is genuinely different — execution flow, skill composition, sub-agent orchestration, evals, checkpoints, fallbacks, cost tracking, quota management — all of these have to be designed in from line one, not retrofitted.

This folder captures user stories for each of those concepts, then layers a set of "changing-constraints" scenarios on top — the moments when one knob turns and the entire system has to re-architect itself overnight.

The three-sentence thesis

AI-first means you design the harness before you design the prompt. Skills, agents, evals, checkpoints, and observability are first-class architecture, not afterthoughts.
System design + LLD become more important, not less. Long-running, async, multi-step, externally-dependent workflows demand more rigor than a stateless REST service ever did.
Constraints change suddenly. Model deprecated, quota cut, latency SLA tightened, locale launched, cost ceiling halved — the systems that survive are the ones whose seams were drawn deliberately.

Folder map

flowchart LR
  Root[AI-First-Production-Engineering]
  Root --> O[00 Overview manifesto]
  Root --> F[01 Three-pillars framework]
  Root --> US[User-Stories/]
  Root --> CC[Changing-Constraints-Scenarios/]
  Root --> SY[Synthesis/]

  US --> US1[01 Execution-flow design]
  US --> US2[02 Skill composition & invocation]
  US --> US3[03 Sub-agent orchestration]
  US --> US4[04 Active vs passive evals]
  US --> US5[05 Pause / resume workflows]
  US --> US6[06 Checkpointing & serving]
  US --> US7[07 Sync vs async invocation]
  US --> US8[08 Observability cost versioning rate-limits]

  CC --> CC1[01 10x user surge overnight]
  CC --> CC2[02 Foundation model deprecated]
  CC --> CC3[03 Cost budget halved]
  CC --> CC4[04 Latency SLA tightened p99 8s -> 2s]
  CC --> CC5[05 New compliance mandate mid-flight]
  CC --> CC6[06 Tool count explodes 7 -> 70]
  CC --> CC7[07 Provider quota revoked]
  CC --> CC8[08 New locale launched in 2 weeks]

  SY --> S1[99 Master synthesis & intuition]

The five behaviors that distinguish AI-first engineers

#	Behavior	Anti-pattern it replaces
1	Designs the execution graph before writing prompts	Writes the prompt, then asks "where does this run?"
2	Treats skills/tools as versioned APIs with contracts and SLOs	Treats tools as Python functions glued together
3	Builds active eval loops alongside passive monitoring	Ships, then waits for support tickets
4	Plans for pause/resume + checkpoints from day one	Assumes every workflow finishes in one process lifetime
5	Models cost, quota, and latency as first-class state	Discovers cost in the AWS bill, not the trace

How to read this folder

User-Stories/ — Each file is one concept, structured as: persona + story + acceptance criteria + deep-dive solution + Amazon-scale numbers + Mermaid architecture + grilling Q&A.
Changing-Constraints-Scenarios/ — Each file picks one constraint that changes overnight, walks the cascade of consequences, and shows the architecture pivot. Same structure: TL;DR, the change, the cascade, the pivot, trade-offs, grilling Q&A.
Synthesis/ — The master mental model that ties all 16 stories together.

Recurring scale anchors (used across all files)

Metric	Value
Prime users with manga affinity	200M+
Active concurrent chat sessions (peak)	1.2M
Daily LLM invocations (planner + tools + eval)	2.4B
Catalog size (titles × volumes)	4M titles, 80M SKUs
Locales	18
MCP tools in production	7 today, 70 projected by year-end
Token budget per session (p95)	32K
Cost ceiling per active user / month	$0.18
Latency SLA (first token, p95)	1.4s
Latency SLA (full response, p95)	6.0s
Eval throughput required	50K judgments / hour

These numbers are referenced repeatedly so the trade-offs feel concrete instead of abstract.

Reading order

Start with 01-framework-three-pillars.md — the mental model.
Read user stories 01 → 08 in order; each builds on the previous.
Then read constraint scenarios in any order — they're independent stress tests.
Finish with Synthesis/99-master-synthesis.md for the intuition pass.