AI-First Production Engineering — Overview & Manifesto
Context: MangaAssist on Amazon — a conversational shopping/discovery agent serving manga, manhwa, and manhua across 18 locales, 200M+ Prime users, and a long-tail catalog of 4M+ titles. Every concept here is anchored in this real production setting.
Why this folder exists
The traditional "build the API, then bolt on the LLM call" mindset breaks down at production scale. The workflow of AI systems is genuinely different — execution flow, skill composition, sub-agent orchestration, evals, checkpoints, fallbacks, cost tracking, quota management — all of these have to be designed in from line one, not retrofitted.
This folder captures user stories for each of those concepts, then layers a set of "changing-constraints" scenarios on top — the moments when one knob turns and the entire system has to re-architect itself overnight.
The three-sentence thesis
- AI-first means you design the harness before you design the prompt. Skills, agents, evals, checkpoints, and observability are first-class architecture, not afterthoughts.
- System design + LLD become more important, not less. Long-running, async, multi-step, externally-dependent workflows demand more rigor than a stateless REST service ever did.
- Constraints change suddenly. Model deprecated, quota cut, latency SLA tightened, locale launched, cost ceiling halved — the systems that survive are the ones whose seams were drawn deliberately.
Folder map
flowchart LR
Root[AI-First-Production-Engineering]
Root --> O[00 Overview manifesto]
Root --> F[01 Three-pillars framework]
Root --> US[User-Stories/]
Root --> CC[Changing-Constraints-Scenarios/]
Root --> SY[Synthesis/]
US --> US1[01 Execution-flow design]
US --> US2[02 Skill composition & invocation]
US --> US3[03 Sub-agent orchestration]
US --> US4[04 Active vs passive evals]
US --> US5[05 Pause / resume workflows]
US --> US6[06 Checkpointing & serving]
US --> US7[07 Sync vs async invocation]
US --> US8[08 Observability cost versioning rate-limits]
CC --> CC1[01 10x user surge overnight]
CC --> CC2[02 Foundation model deprecated]
CC --> CC3[03 Cost budget halved]
CC --> CC4[04 Latency SLA tightened p99 8s -> 2s]
CC --> CC5[05 New compliance mandate mid-flight]
CC --> CC6[06 Tool count explodes 7 -> 70]
CC --> CC7[07 Provider quota revoked]
CC --> CC8[08 New locale launched in 2 weeks]
SY --> S1[99 Master synthesis & intuition]
The five behaviors that distinguish AI-first engineers
| # | Behavior | Anti-pattern it replaces |
|---|---|---|
| 1 | Designs the execution graph before writing prompts | Writes the prompt, then asks "where does this run?" |
| 2 | Treats skills/tools as versioned APIs with contracts and SLOs | Treats tools as Python functions glued together |
| 3 | Builds active eval loops alongside passive monitoring | Ships, then waits for support tickets |
| 4 | Plans for pause/resume + checkpoints from day one | Assumes every workflow finishes in one process lifetime |
| 5 | Models cost, quota, and latency as first-class state | Discovers cost in the AWS bill, not the trace |
How to read this folder
- User-Stories/ — Each file is one concept, structured as: persona + story + acceptance criteria + deep-dive solution + Amazon-scale numbers + Mermaid architecture + grilling Q&A.
- Changing-Constraints-Scenarios/ — Each file picks one constraint that changes overnight, walks the cascade of consequences, and shows the architecture pivot. Same structure: TL;DR, the change, the cascade, the pivot, trade-offs, grilling Q&A.
- Synthesis/ — The master mental model that ties all 16 stories together.
Recurring scale anchors (used across all files)
| Metric | Value |
|---|---|
| Prime users with manga affinity | 200M+ |
| Active concurrent chat sessions (peak) | 1.2M |
| Daily LLM invocations (planner + tools + eval) | 2.4B |
| Catalog size (titles × volumes) | 4M titles, 80M SKUs |
| Locales | 18 |
| MCP tools in production | 7 today, 70 projected by year-end |
| Token budget per session (p95) | 32K |
| Cost ceiling per active user / month | $0.18 |
| Latency SLA (first token, p95) | 1.4s |
| Latency SLA (full response, p95) | 6.0s |
| Eval throughput required | 50K judgments / hour |
These numbers are referenced repeatedly so the trade-offs feel concrete instead of abstract.
Reading order
- Start with
01-framework-three-pillars.md— the mental model. - Read user stories
01 → 08in order; each builds on the previous. - Then read constraint scenarios in any order — they're independent stress tests.
- Finish with
Synthesis/99-master-synthesis.mdfor the intuition pass.