Applied ML Engineer User Stories — MangaAssist Chatbot (Amazon-Scale)

Overview

This directory is the Applied ML Engineer / Product Engineer for ML lens on the MangaAssist chatbot. Where the sibling ML-Engineer-User-Stories/ folder owns the platform lens (training pipelines, drift hub, model registry, retrain cadence, embedding re-index), this folder owns the product lens — translating a customer pain into an ML/AI hypothesis, designing the experiment with statistical discipline, integrating the model into the chatbot turn pipeline, and defending it under business-metric guardrails.

The role distinction matters because the two roles fail in different ways:

ML Engineer failure mode: drift detection lags, retraining is brittle, the registry is wrong, online serving p99 spikes. Symptoms are infrastructural.
Applied ML Engineer failure mode: shipped a model that won offline by 5% NDCG and lost 1.2% retention online; ran a 7-day A/B on an effect that needed 21 days to detect; promoted on aggregate CTR while JP cohort regressed by 8%; declared victory before novelty effect washed out. Symptoms are judgment failures, not infra failures.

This folder is the playbook for not making those judgment failures.

The user's framing for this artifact: "Applied ML is product engineer for ML projects where I am taking a business use case and then building a solution with ML and AI for that" — combined with experiment-selection rigor (data-science master's discipline) and Amazon product-engineering DNA (Working Backwards, customer obsession, six-pager culture).

What's In Here

File	Purpose
`README.md`	Story roster, dependency graph, KPI rollup, role contrast (this file)
`00-foundations-and-primitives-for-applied-ml-engineering.md`	Seven primitives every Applied ML Engineer carries: Working-Backwards framing, experiment portfolio thinking, hypothesis & study design, online/offline correlation, business-metric guardrails, cohort fairness, incident triage
`01-deep-dive-per-applied-ml-story.md`	Eight stories: customer pain → hypothesis → experiment design → architecture wiring → rollout plan → metrics → real-incident vignette
`02-applied-ml-engineer-grill-chains.md`	Interview-style multi-round Q&A drill per story — opening + 4 escalating rounds + 3 architect-level + intuition gained + red-flag/strong-answer markers

Stories live inside 01-deep-dive-per-applied-ml-story.md (not as separate BDD files). The pattern mirrors Cost-Optimization-Offline-Testing/04-scenario-deep-dives-per-cost-story.md: one section per story, full BDD framing inside the section.

Story Roster

#	Title	Anchored In	Headline Question	Amazon LP
AML-01	Customer-pain → ML-problem translation	New-reader retention drop in JP cohort	Is this even an ML problem? When is a heuristic enough?	Customer Obsession, Are Right A Lot
AML-02	Experiment portfolio prioritization	12 candidate ML wins per quarter	Which 3 of 12 do we ship next quarter? Why these?	Bias for Action, Frugality
AML-03	Hypothesis design & sample-size discipline	US-MLE-02 reranker change, US-MLE-06 recsys A/B	What's the MDE, holdout, runtime, stop rule?	Dive Deep, Insist on Highest Standards
AML-04	Online/offline metric decoupling	RAG-MCP-09 (Recall@10 vs CTR), Ground-Truth-Evolution ML-03	Offline says +5% NDCG, online says flat. Why?	Learn and Be Curious
AML-05	Business-KPI guardrails for promotion	CSAT/GMV/retention as veto signals	When do you NOT ship a model that "looks better"?	Customer Obsession, Ownership
AML-06	Cohort fairness & locale stratification	EN/JP/mixed, new-vs-returning, US-MLE-08 cover-art	Aggregate wins, JP cohort regresses by 8%. Promote?	Earn Trust, Insist on Highest Standards
AML-07	Production integration & latency budgets	RAG-MCP-08 orchestration, US-MLE-02 reranker SLA	Where does the model live in the 800ms turn budget?	Deliver Results, Frugality
AML-08	Incident triage: 'the model got worse'	POC-Production catastrophe #2 (RAG recall collapse)	Reranker quality dropped this morning. Where do you look first?	Dive Deep, Ownership

The eight stories cover the full lifecycle of a product-applied ML decision: should we build it (AML-01, 02) → how do we test it (AML-03, 04) → when do we ship it (AML-05, 06) → how does it run (AML-07) → what do we do when it breaks (AML-08).

Dependency & Sequencing Graph

The eight Applied-ML stories sit on top of the platform-ML stories. AML stories do not own training, registries, or drift hubs — they consume them. The dependency graph below shows AML stories (top) anchored to US-MLE platform stories (bottom).

graph TB
    subgraph Applied[Applied ML Engineer Lens — Product Decisions]
        AML01[AML-01<br/>Customer pain<br/>→ ML problem]
        AML02[AML-02<br/>Experiment<br/>portfolio]
        AML03[AML-03<br/>Hypothesis<br/>& sample size]
        AML04[AML-04<br/>Online/offline<br/>decoupling]
        AML05[AML-05<br/>Business-KPI<br/>guardrails]
        AML06[AML-06<br/>Cohort fairness]
        AML07[AML-07<br/>Production<br/>integration]
        AML08[AML-08<br/>Incident triage]
    end

    subgraph Platform[ML Engineer Lens — Platform Foundations]
        MLE01[US-MLE-01 Intent]
        MLE02[US-MLE-02 Reranker]
        MLE05[US-MLE-05 Embedding]
        MLE06[US-MLE-06 Recsys]
        MLE08[US-MLE-08 Cover-art]
        DRIFTHUB[Drift Hub<br/>+ Model Registry]
    end

    AML01 -->|frames the problem for| AML02
    AML02 -->|picks experiments for| AML03
    AML03 -->|sample-size feeds into| AML07
    AML04 -->|correlation gate before| AML05
    AML05 -->|veto signal for| AML07
    AML06 -->|cohort holdout for| AML03
    AML07 -->|telemetry feeds| AML08
    AML08 -->|root-cause feeds back into| AML01

    AML03 -.consumes.-> MLE02
    AML03 -.consumes.-> MLE06
    AML04 -.consumes.-> MLE05
    AML06 -.consumes.-> MLE08
    AML07 -.consumes.-> MLE01
    AML08 -.consumes.-> DRIFTHUB

    classDef appl fill:#9cf,stroke:#333,stroke-width:2px
    classDef plat fill:#fd2,stroke:#333
    class AML01,AML02,AML03,AML04,AML05,AML06,AML07,AML08 appl
    class MLE01,MLE02,MLE05,MLE06,MLE08,DRIFTHUB plat

Reading paths:

First time through — read 00-foundations-and-primitives-for-applied-ml-engineering.md end-to-end. Then pick a scenario in 01-deep-dive-per-applied-ml-story.md that matches a real product decision you face. Then drill yourself with the matching grill chain in 02-applied-ml-engineer-grill-chains.md.
Interview prep (Amazon Applied Scientist / Applied ML Engineer loop) — go through grill chains AML-01 → AML-08 in order, scoring yourself against the red-flag and strong-answer markers. Most candidates fail on AML-04 (online/offline decoupling) and AML-05 (guardrails) — those are the two stories with the highest signal-to-noise for senior-level evaluation.
Working a real launch — start with AML-02 (portfolio prioritization) to defend that this experiment is the right one to run; then AML-03 (hypothesis & sample size); then AML-05 (guardrails); then AML-07 (integration). The other stories are diagnostic, not prescriptive.

Owner Mapping

Story	Suggested Owner Role	Partners
AML-01	Applied ML Engineer + Product Manager	Customer-Insights Researcher, Business Analyst
AML-02	Applied ML Engineer (the role's defining accountability)	Engineering Manager, PM, Data Scientist
AML-03	Applied ML Engineer + Data Scientist	Statistician (light review), Experiment-Platform team
AML-04	Applied ML Engineer + Data Scientist	RAG/Retrieval ML Engineer (US-MLE-05 owner)
AML-05	Applied ML Engineer + PM + Business stakeholder	Finance / FinOps Lead (for GMV guardrails)
AML-06	Applied ML Engineer + Trust & Safety + Localisation	Per-locale PM (JP / EN)
AML-07	Applied ML Engineer + Backend Eng (chatbot turn pipeline)	SRE, MCP Owner
AML-08	Applied ML Engineer (on-call rotation primary)	ML Engineer platform owner, SRE

The Applied ML Engineer holds the cross-cutting product-decision contract: every model promotion ultimately requires their sign-off, and the artefacts in this folder are the evidence base for that sign-off.

Unified KPI Rollup

A senior leader (Eng Manager, Director, GM) should be able to scan this table and know what the role is accountable for on each story.

#	Story	Headline Product Metric	Bridging ML Metric	Guardrail Veto
AML-01	ML-problem framing	Customer-reported pain ≥ stable	(none — ML not chosen yet)	Heuristic-vs-ML break-even
AML-02	Portfolio prioritization	Quarterly experiment-win rate ≥ 40%	Σ(MDE × prior × reachable population)	Ship velocity / opportunity cost
AML-03	Hypothesis & sample size	A/B power ≥ 80% at MDE	Sample-size compliance, peeking discipline	Sequential-test α inflation < 5%
AML-04	Online/offline correlation	Online ΔCTR vs offline ΔNDCG correlation ≥ 0.6	Offline NDCG@10, online CTR delta	Correlation collapse alarm
AML-05	Business-KPI guardrails	CSAT / retention / GMV non-regression	Model quality (per-system)	Guardrail breach blocks promotion
AML-06	Cohort fairness	Worst-cohort metric ≥ 95% of aggregate	Stratified per-cohort eval	JP cohort or new-user cohort regression > 3%
AML-07	Production integration	p95 turn latency ≤ 800ms	Per-stage latency budget	Fallback engagement < 1% / min
AML-08	Incident triage	Time-to-detect ≤ 15 min, time-to-rollback ≤ 30 min	(diagnostic)	Customer-reported incidents per quarter

These are the metrics the Applied ML Engineer is on the hook for. They are deliberately not the same metrics the ML Engineer (US-MLE-XX) is on the hook for; the two roles share systems but own different metric surfaces.

Cross-Cutting Concerns Inherited by Every AML Story

Concern	Why required	Applies to
Working-Backwards customer letter (1 page) before ML scoping	Forces problem-statement clarity; many "ML projects" dissolve when written in customer-language	AML-01, AML-02
Pre-registered hypothesis & MDE (versioned in experiment platform)	Prevents post-hoc story-fitting and HARKing (hypothesizing after results known)	AML-03, AML-05
Cohort-stratified eval as default (locale, tenure, device)	Aggregate metrics hide cohort regression; veto must be per-cohort	AML-04, AML-05, AML-06
Pre-declared guardrail metrics with thresholds	Without pre-declaration, guardrails are negotiated post-hoc by whoever has political weight	AML-05, AML-07
Sequential-test α-spending plan	Peeking inflates Type-I error from 5% to 20%+ in 5 looks; an undisciplined PM will peek	AML-03, AML-05
Latency budget contract per chatbot turn (request_id propagation)	Without per-stage budgets, every team blames every other team for p99 regressions	AML-07, AML-08
Incident-triage runbook with named root-cause categories	Random root-causing wastes hours; named categories cut MTTR by 3-5×	AML-08
Six-pager (or PR/FAQ) for any ML feature that ships	Amazon writing-culture default; forces architectural and product clarity	AML-01, AML-02, AML-05

These are non-negotiable defaults for every Applied ML Engineer artefact in this folder. Deviations require explicit Engineering Manager + PM sign-off documented in the experiment record.

Amazon Leadership Principle Map

The Applied ML Engineer role lives at the intersection of Customer Obsession, Dive Deep, and Insist on the Highest Standards. Each story illustrates a specific LP application:

LP	Story Where It Bites Hardest	Why
Customer Obsession	AML-01, AML-05	Working Backwards from customer pain; vetoing model wins on retention regression
Ownership	AML-08	Incident triage; you own the model in production end-to-end
Invent and Simplify	AML-01, AML-02	Choosing a heuristic over ML when ML is overkill
Are Right A Lot	AML-02, AML-04	Portfolio judgment; correlation-collapse calls
Learn and Be Curious	AML-04, AML-08	Why offline ≠ online; what failure-mode is hiding
Hire and Develop the Best	(cross-cutting)	Grill chains in `02-*` are the artefact for raising the bar
Insist on the Highest Standards	AML-03, AML-06	Statistical discipline; cohort fairness even at velocity cost
Think Big	AML-02, AML-07	Portfolio sized to platform impact, not story impact
Bias for Action	AML-02, AML-07	Velocity vs rigor balance; ship-decision discipline
Frugality	AML-02, AML-07	Smallest experiment that proves the hypothesis; tightest latency budget that meets SLO
Earn Trust	AML-05, AML-06	Veto enforcement, fairness disclosure
Dive Deep	AML-03, AML-04, AML-08	Statistical rigor, correlation diagnostics, root-cause depth
Have Backbone; Disagree and Commit	AML-05	Vetoing a model promotion that "looks like a win"
Deliver Results	AML-07	The model has to actually work in production
Strive to be Earth's Best Employer	(out of scope)	—
Success and Scale Bring Broad Responsibility	AML-06	Fairness across locales is a scale-bound responsibility

This mapping is the lens the Applied ML Engineer uses when defending decisions in a six-pager review or operational meeting (OP1/OP2).

Relationship to Other Folders

This folder is part of an interconnected set:

ML-Engineer-User-Stories/ — platform lens (owns the systems this folder makes product decisions about)
Cost-Optimization-User-Stories/ — FinOps lens (cost-side decisions on the same systems)
Cost-Optimization-Offline-Testing/ — offline-eval lens (the format this folder borrows for deep-dives + grill chains)
Ground-Truth-Evolution/ — drift lens (why ground truth moves; AML-04 and AML-08 anchor here)
POC-to-Production-War-Story/ — failure-narrative lens (AML-08 incident vignettes anchor here)
RAG-MCP-Integration/ — architecture lens (where the models live in the chatbot turn pipeline)
Domain1-FM-Integration-Data-Compliance/ — architectural-design lens (foundational design-space discussions)

When working on a real product decision: pull the platform owner from ML-Engineer-User-Stories/, the FinOps owner from Cost-Optimization-User-Stories/, and the Applied ML Engineer (you) from this folder. The three artefacts together are the launch readiness package.

How These Stories Were Built

Each story is grounded in a real architectural surface from the project memory: the manga catalog, the JP/EN bilingual traffic, the OpenSearch + Titan + reranker retrieval stack, the WebSocket streaming chatbot turn, the seven production catastrophes from the POC-to-Production War Story. None of the scenarios are abstract textbook examples; every one names the specific component, the specific metric, and the specific failure mode that an Applied ML Engineer at Amazon Japan running MangaAssist would actually face.

The framing — master's-degree data-science depth + Amazon product-engineering DNA — surfaces in two callouts inside every deep-dive section:

Master's-DS Depth Callout — the statistical or methodological subtlety a non-DS engineer would miss (e.g., why a t-test is wrong here, why CUPED reduces variance, why pre-experiment covariates matter)
Amazon Product-Lens Callout — the LP framing or six-pager-style reasoning a non-Amazon practitioner would miss (e.g., the customer letter, the input/output metric distinction, the OP1 narrative)

Together they encode the role: rigour without ivory tower, product instinct without hand-waving.