Applied ML Engineer User Stories — MangaAssist Chatbot (Amazon-Scale)
Overview
This directory is the Applied ML Engineer / Product Engineer for ML lens on the MangaAssist chatbot. Where the sibling ML-Engineer-User-Stories/ folder owns the platform lens (training pipelines, drift hub, model registry, retrain cadence, embedding re-index), this folder owns the product lens — translating a customer pain into an ML/AI hypothesis, designing the experiment with statistical discipline, integrating the model into the chatbot turn pipeline, and defending it under business-metric guardrails.
The role distinction matters because the two roles fail in different ways:
- ML Engineer failure mode: drift detection lags, retraining is brittle, the registry is wrong, online serving p99 spikes. Symptoms are infrastructural.
- Applied ML Engineer failure mode: shipped a model that won offline by 5% NDCG and lost 1.2% retention online; ran a 7-day A/B on an effect that needed 21 days to detect; promoted on aggregate CTR while JP cohort regressed by 8%; declared victory before novelty effect washed out. Symptoms are judgment failures, not infra failures.
This folder is the playbook for not making those judgment failures.
The user's framing for this artifact: "Applied ML is product engineer for ML projects where I am taking a business use case and then building a solution with ML and AI for that" — combined with experiment-selection rigor (data-science master's discipline) and Amazon product-engineering DNA (Working Backwards, customer obsession, six-pager culture).
What's In Here
| File | Purpose |
|---|---|
README.md |
Story roster, dependency graph, KPI rollup, role contrast (this file) |
00-foundations-and-primitives-for-applied-ml-engineering.md |
Seven primitives every Applied ML Engineer carries: Working-Backwards framing, experiment portfolio thinking, hypothesis & study design, online/offline correlation, business-metric guardrails, cohort fairness, incident triage |
01-deep-dive-per-applied-ml-story.md |
Eight stories: customer pain → hypothesis → experiment design → architecture wiring → rollout plan → metrics → real-incident vignette |
02-applied-ml-engineer-grill-chains.md |
Interview-style multi-round Q&A drill per story — opening + 4 escalating rounds + 3 architect-level + intuition gained + red-flag/strong-answer markers |
Stories live inside 01-deep-dive-per-applied-ml-story.md (not as separate BDD files). The pattern mirrors Cost-Optimization-Offline-Testing/04-scenario-deep-dives-per-cost-story.md: one section per story, full BDD framing inside the section.
Story Roster
| # | Title | Anchored In | Headline Question | Amazon LP |
|---|---|---|---|---|
| AML-01 | Customer-pain → ML-problem translation | New-reader retention drop in JP cohort | Is this even an ML problem? When is a heuristic enough? | Customer Obsession, Are Right A Lot |
| AML-02 | Experiment portfolio prioritization | 12 candidate ML wins per quarter | Which 3 of 12 do we ship next quarter? Why these? | Bias for Action, Frugality |
| AML-03 | Hypothesis design & sample-size discipline | US-MLE-02 reranker change, US-MLE-06 recsys A/B | What's the MDE, holdout, runtime, stop rule? | Dive Deep, Insist on Highest Standards |
| AML-04 | Online/offline metric decoupling | RAG-MCP-09 (Recall@10 vs CTR), Ground-Truth-Evolution ML-03 | Offline says +5% NDCG, online says flat. Why? | Learn and Be Curious |
| AML-05 | Business-KPI guardrails for promotion | CSAT/GMV/retention as veto signals | When do you NOT ship a model that "looks better"? | Customer Obsession, Ownership |
| AML-06 | Cohort fairness & locale stratification | EN/JP/mixed, new-vs-returning, US-MLE-08 cover-art | Aggregate wins, JP cohort regresses by 8%. Promote? | Earn Trust, Insist on Highest Standards |
| AML-07 | Production integration & latency budgets | RAG-MCP-08 orchestration, US-MLE-02 reranker SLA | Where does the model live in the 800ms turn budget? | Deliver Results, Frugality |
| AML-08 | Incident triage: 'the model got worse' | POC-Production catastrophe #2 (RAG recall collapse) | Reranker quality dropped this morning. Where do you look first? | Dive Deep, Ownership |
The eight stories cover the full lifecycle of a product-applied ML decision: should we build it (AML-01, 02) → how do we test it (AML-03, 04) → when do we ship it (AML-05, 06) → how does it run (AML-07) → what do we do when it breaks (AML-08).
Dependency & Sequencing Graph
The eight Applied-ML stories sit on top of the platform-ML stories. AML stories do not own training, registries, or drift hubs — they consume them. The dependency graph below shows AML stories (top) anchored to US-MLE platform stories (bottom).
graph TB
subgraph Applied[Applied ML Engineer Lens — Product Decisions]
AML01[AML-01<br/>Customer pain<br/>→ ML problem]
AML02[AML-02<br/>Experiment<br/>portfolio]
AML03[AML-03<br/>Hypothesis<br/>& sample size]
AML04[AML-04<br/>Online/offline<br/>decoupling]
AML05[AML-05<br/>Business-KPI<br/>guardrails]
AML06[AML-06<br/>Cohort fairness]
AML07[AML-07<br/>Production<br/>integration]
AML08[AML-08<br/>Incident triage]
end
subgraph Platform[ML Engineer Lens — Platform Foundations]
MLE01[US-MLE-01 Intent]
MLE02[US-MLE-02 Reranker]
MLE05[US-MLE-05 Embedding]
MLE06[US-MLE-06 Recsys]
MLE08[US-MLE-08 Cover-art]
DRIFTHUB[Drift Hub<br/>+ Model Registry]
end
AML01 -->|frames the problem for| AML02
AML02 -->|picks experiments for| AML03
AML03 -->|sample-size feeds into| AML07
AML04 -->|correlation gate before| AML05
AML05 -->|veto signal for| AML07
AML06 -->|cohort holdout for| AML03
AML07 -->|telemetry feeds| AML08
AML08 -->|root-cause feeds back into| AML01
AML03 -.consumes.-> MLE02
AML03 -.consumes.-> MLE06
AML04 -.consumes.-> MLE05
AML06 -.consumes.-> MLE08
AML07 -.consumes.-> MLE01
AML08 -.consumes.-> DRIFTHUB
classDef appl fill:#9cf,stroke:#333,stroke-width:2px
classDef plat fill:#fd2,stroke:#333
class AML01,AML02,AML03,AML04,AML05,AML06,AML07,AML08 appl
class MLE01,MLE02,MLE05,MLE06,MLE08,DRIFTHUB plat
Reading paths:
- First time through — read
00-foundations-and-primitives-for-applied-ml-engineering.mdend-to-end. Then pick a scenario in01-deep-dive-per-applied-ml-story.mdthat matches a real product decision you face. Then drill yourself with the matching grill chain in02-applied-ml-engineer-grill-chains.md. - Interview prep (Amazon Applied Scientist / Applied ML Engineer loop) — go through grill chains AML-01 → AML-08 in order, scoring yourself against the red-flag and strong-answer markers. Most candidates fail on AML-04 (online/offline decoupling) and AML-05 (guardrails) — those are the two stories with the highest signal-to-noise for senior-level evaluation.
- Working a real launch — start with AML-02 (portfolio prioritization) to defend that this experiment is the right one to run; then AML-03 (hypothesis & sample size); then AML-05 (guardrails); then AML-07 (integration). The other stories are diagnostic, not prescriptive.
Owner Mapping
| Story | Suggested Owner Role | Partners |
|---|---|---|
| AML-01 | Applied ML Engineer + Product Manager | Customer-Insights Researcher, Business Analyst |
| AML-02 | Applied ML Engineer (the role's defining accountability) | Engineering Manager, PM, Data Scientist |
| AML-03 | Applied ML Engineer + Data Scientist | Statistician (light review), Experiment-Platform team |
| AML-04 | Applied ML Engineer + Data Scientist | RAG/Retrieval ML Engineer (US-MLE-05 owner) |
| AML-05 | Applied ML Engineer + PM + Business stakeholder | Finance / FinOps Lead (for GMV guardrails) |
| AML-06 | Applied ML Engineer + Trust & Safety + Localisation | Per-locale PM (JP / EN) |
| AML-07 | Applied ML Engineer + Backend Eng (chatbot turn pipeline) | SRE, MCP Owner |
| AML-08 | Applied ML Engineer (on-call rotation primary) | ML Engineer platform owner, SRE |
The Applied ML Engineer holds the cross-cutting product-decision contract: every model promotion ultimately requires their sign-off, and the artefacts in this folder are the evidence base for that sign-off.
Unified KPI Rollup
A senior leader (Eng Manager, Director, GM) should be able to scan this table and know what the role is accountable for on each story.
| # | Story | Headline Product Metric | Bridging ML Metric | Guardrail Veto |
|---|---|---|---|---|
| AML-01 | ML-problem framing | Customer-reported pain ≥ stable | (none — ML not chosen yet) | Heuristic-vs-ML break-even |
| AML-02 | Portfolio prioritization | Quarterly experiment-win rate ≥ 40% | Σ(MDE × prior × reachable population) | Ship velocity / opportunity cost |
| AML-03 | Hypothesis & sample size | A/B power ≥ 80% at MDE | Sample-size compliance, peeking discipline | Sequential-test α inflation < 5% |
| AML-04 | Online/offline correlation | Online ΔCTR vs offline ΔNDCG correlation ≥ 0.6 | Offline NDCG@10, online CTR delta | Correlation collapse alarm |
| AML-05 | Business-KPI guardrails | CSAT / retention / GMV non-regression | Model quality (per-system) | Guardrail breach blocks promotion |
| AML-06 | Cohort fairness | Worst-cohort metric ≥ 95% of aggregate | Stratified per-cohort eval | JP cohort or new-user cohort regression > 3% |
| AML-07 | Production integration | p95 turn latency ≤ 800ms | Per-stage latency budget | Fallback engagement < 1% / min |
| AML-08 | Incident triage | Time-to-detect ≤ 15 min, time-to-rollback ≤ 30 min | (diagnostic) | Customer-reported incidents per quarter |
These are the metrics the Applied ML Engineer is on the hook for. They are deliberately not the same metrics the ML Engineer (US-MLE-XX) is on the hook for; the two roles share systems but own different metric surfaces.
Cross-Cutting Concerns Inherited by Every AML Story
| Concern | Why required | Applies to |
|---|---|---|
| Working-Backwards customer letter (1 page) before ML scoping | Forces problem-statement clarity; many "ML projects" dissolve when written in customer-language | AML-01, AML-02 |
| Pre-registered hypothesis & MDE (versioned in experiment platform) | Prevents post-hoc story-fitting and HARKing (hypothesizing after results known) | AML-03, AML-05 |
| Cohort-stratified eval as default (locale, tenure, device) | Aggregate metrics hide cohort regression; veto must be per-cohort | AML-04, AML-05, AML-06 |
| Pre-declared guardrail metrics with thresholds | Without pre-declaration, guardrails are negotiated post-hoc by whoever has political weight | AML-05, AML-07 |
| Sequential-test α-spending plan | Peeking inflates Type-I error from 5% to 20%+ in 5 looks; an undisciplined PM will peek | AML-03, AML-05 |
| Latency budget contract per chatbot turn (request_id propagation) | Without per-stage budgets, every team blames every other team for p99 regressions | AML-07, AML-08 |
| Incident-triage runbook with named root-cause categories | Random root-causing wastes hours; named categories cut MTTR by 3-5× | AML-08 |
| Six-pager (or PR/FAQ) for any ML feature that ships | Amazon writing-culture default; forces architectural and product clarity | AML-01, AML-02, AML-05 |
These are non-negotiable defaults for every Applied ML Engineer artefact in this folder. Deviations require explicit Engineering Manager + PM sign-off documented in the experiment record.
Amazon Leadership Principle Map
The Applied ML Engineer role lives at the intersection of Customer Obsession, Dive Deep, and Insist on the Highest Standards. Each story illustrates a specific LP application:
| LP | Story Where It Bites Hardest | Why |
|---|---|---|
| Customer Obsession | AML-01, AML-05 | Working Backwards from customer pain; vetoing model wins on retention regression |
| Ownership | AML-08 | Incident triage; you own the model in production end-to-end |
| Invent and Simplify | AML-01, AML-02 | Choosing a heuristic over ML when ML is overkill |
| Are Right A Lot | AML-02, AML-04 | Portfolio judgment; correlation-collapse calls |
| Learn and Be Curious | AML-04, AML-08 | Why offline ≠ online; what failure-mode is hiding |
| Hire and Develop the Best | (cross-cutting) | Grill chains in 02-* are the artefact for raising the bar |
| Insist on the Highest Standards | AML-03, AML-06 | Statistical discipline; cohort fairness even at velocity cost |
| Think Big | AML-02, AML-07 | Portfolio sized to platform impact, not story impact |
| Bias for Action | AML-02, AML-07 | Velocity vs rigor balance; ship-decision discipline |
| Frugality | AML-02, AML-07 | Smallest experiment that proves the hypothesis; tightest latency budget that meets SLO |
| Earn Trust | AML-05, AML-06 | Veto enforcement, fairness disclosure |
| Dive Deep | AML-03, AML-04, AML-08 | Statistical rigor, correlation diagnostics, root-cause depth |
| Have Backbone; Disagree and Commit | AML-05 | Vetoing a model promotion that "looks like a win" |
| Deliver Results | AML-07 | The model has to actually work in production |
| Strive to be Earth's Best Employer | (out of scope) | — |
| Success and Scale Bring Broad Responsibility | AML-06 | Fairness across locales is a scale-bound responsibility |
This mapping is the lens the Applied ML Engineer uses when defending decisions in a six-pager review or operational meeting (OP1/OP2).
Relationship to Other Folders
This folder is part of an interconnected set:
ML-Engineer-User-Stories/— platform lens (owns the systems this folder makes product decisions about)Cost-Optimization-User-Stories/— FinOps lens (cost-side decisions on the same systems)Cost-Optimization-Offline-Testing/— offline-eval lens (the format this folder borrows for deep-dives + grill chains)Ground-Truth-Evolution/— drift lens (why ground truth moves; AML-04 and AML-08 anchor here)POC-to-Production-War-Story/— failure-narrative lens (AML-08 incident vignettes anchor here)RAG-MCP-Integration/— architecture lens (where the models live in the chatbot turn pipeline)Domain1-FM-Integration-Data-Compliance/— architectural-design lens (foundational design-space discussions)
When working on a real product decision: pull the platform owner from ML-Engineer-User-Stories/, the FinOps owner from Cost-Optimization-User-Stories/, and the Applied ML Engineer (you) from this folder. The three artefacts together are the launch readiness package.
How These Stories Were Built
Each story is grounded in a real architectural surface from the project memory: the manga catalog, the JP/EN bilingual traffic, the OpenSearch + Titan + reranker retrieval stack, the WebSocket streaming chatbot turn, the seven production catastrophes from the POC-to-Production War Story. None of the scenarios are abstract textbook examples; every one names the specific component, the specific metric, and the specific failure mode that an Applied ML Engineer at Amazon Japan running MangaAssist would actually face.
The framing — master's-degree data-science depth + Amazon product-engineering DNA — surfaces in two callouts inside every deep-dive section:
- Master's-DS Depth Callout — the statistical or methodological subtlety a non-DS engineer would miss (e.g., why a t-test is wrong here, why CUPED reduces variance, why pre-experiment covariates matter)
- Amazon Product-Lens Callout — the LP framing or six-pager-style reasoning a non-Amazon practitioner would miss (e.g., the customer letter, the input/output metric distinction, the OP1 narrative)
Together they encode the role: rigour without ivory tower, product instinct without hand-waving.