LOCAL PREVIEW View on GitHub

Applied ML Engineer User Stories — MangaAssist Chatbot (Amazon-Scale)

Overview

This directory is the Applied ML Engineer / Product Engineer for ML lens on the MangaAssist chatbot. Where the sibling ML-Engineer-User-Stories/ folder owns the platform lens (training pipelines, drift hub, model registry, retrain cadence, embedding re-index), this folder owns the product lens — translating a customer pain into an ML/AI hypothesis, designing the experiment with statistical discipline, integrating the model into the chatbot turn pipeline, and defending it under business-metric guardrails.

The role distinction matters because the two roles fail in different ways:

  • ML Engineer failure mode: drift detection lags, retraining is brittle, the registry is wrong, online serving p99 spikes. Symptoms are infrastructural.
  • Applied ML Engineer failure mode: shipped a model that won offline by 5% NDCG and lost 1.2% retention online; ran a 7-day A/B on an effect that needed 21 days to detect; promoted on aggregate CTR while JP cohort regressed by 8%; declared victory before novelty effect washed out. Symptoms are judgment failures, not infra failures.

This folder is the playbook for not making those judgment failures.

The user's framing for this artifact: "Applied ML is product engineer for ML projects where I am taking a business use case and then building a solution with ML and AI for that" — combined with experiment-selection rigor (data-science master's discipline) and Amazon product-engineering DNA (Working Backwards, customer obsession, six-pager culture).

What's In Here

File Purpose
README.md Story roster, dependency graph, KPI rollup, role contrast (this file)
00-foundations-and-primitives-for-applied-ml-engineering.md Seven primitives every Applied ML Engineer carries: Working-Backwards framing, experiment portfolio thinking, hypothesis & study design, online/offline correlation, business-metric guardrails, cohort fairness, incident triage
01-deep-dive-per-applied-ml-story.md Eight stories: customer pain → hypothesis → experiment design → architecture wiring → rollout plan → metrics → real-incident vignette
02-applied-ml-engineer-grill-chains.md Interview-style multi-round Q&A drill per story — opening + 4 escalating rounds + 3 architect-level + intuition gained + red-flag/strong-answer markers

Stories live inside 01-deep-dive-per-applied-ml-story.md (not as separate BDD files). The pattern mirrors Cost-Optimization-Offline-Testing/04-scenario-deep-dives-per-cost-story.md: one section per story, full BDD framing inside the section.

Story Roster

# Title Anchored In Headline Question Amazon LP
AML-01 Customer-pain → ML-problem translation New-reader retention drop in JP cohort Is this even an ML problem? When is a heuristic enough? Customer Obsession, Are Right A Lot
AML-02 Experiment portfolio prioritization 12 candidate ML wins per quarter Which 3 of 12 do we ship next quarter? Why these? Bias for Action, Frugality
AML-03 Hypothesis design & sample-size discipline US-MLE-02 reranker change, US-MLE-06 recsys A/B What's the MDE, holdout, runtime, stop rule? Dive Deep, Insist on Highest Standards
AML-04 Online/offline metric decoupling RAG-MCP-09 (Recall@10 vs CTR), Ground-Truth-Evolution ML-03 Offline says +5% NDCG, online says flat. Why? Learn and Be Curious
AML-05 Business-KPI guardrails for promotion CSAT/GMV/retention as veto signals When do you NOT ship a model that "looks better"? Customer Obsession, Ownership
AML-06 Cohort fairness & locale stratification EN/JP/mixed, new-vs-returning, US-MLE-08 cover-art Aggregate wins, JP cohort regresses by 8%. Promote? Earn Trust, Insist on Highest Standards
AML-07 Production integration & latency budgets RAG-MCP-08 orchestration, US-MLE-02 reranker SLA Where does the model live in the 800ms turn budget? Deliver Results, Frugality
AML-08 Incident triage: 'the model got worse' POC-Production catastrophe #2 (RAG recall collapse) Reranker quality dropped this morning. Where do you look first? Dive Deep, Ownership

The eight stories cover the full lifecycle of a product-applied ML decision: should we build it (AML-01, 02) → how do we test it (AML-03, 04) → when do we ship it (AML-05, 06) → how does it run (AML-07) → what do we do when it breaks (AML-08).

Dependency & Sequencing Graph

The eight Applied-ML stories sit on top of the platform-ML stories. AML stories do not own training, registries, or drift hubs — they consume them. The dependency graph below shows AML stories (top) anchored to US-MLE platform stories (bottom).

graph TB
    subgraph Applied[Applied ML Engineer Lens — Product Decisions]
        AML01[AML-01<br/>Customer pain<br/>→ ML problem]
        AML02[AML-02<br/>Experiment<br/>portfolio]
        AML03[AML-03<br/>Hypothesis<br/>& sample size]
        AML04[AML-04<br/>Online/offline<br/>decoupling]
        AML05[AML-05<br/>Business-KPI<br/>guardrails]
        AML06[AML-06<br/>Cohort fairness]
        AML07[AML-07<br/>Production<br/>integration]
        AML08[AML-08<br/>Incident triage]
    end

    subgraph Platform[ML Engineer Lens — Platform Foundations]
        MLE01[US-MLE-01 Intent]
        MLE02[US-MLE-02 Reranker]
        MLE05[US-MLE-05 Embedding]
        MLE06[US-MLE-06 Recsys]
        MLE08[US-MLE-08 Cover-art]
        DRIFTHUB[Drift Hub<br/>+ Model Registry]
    end

    AML01 -->|frames the problem for| AML02
    AML02 -->|picks experiments for| AML03
    AML03 -->|sample-size feeds into| AML07
    AML04 -->|correlation gate before| AML05
    AML05 -->|veto signal for| AML07
    AML06 -->|cohort holdout for| AML03
    AML07 -->|telemetry feeds| AML08
    AML08 -->|root-cause feeds back into| AML01

    AML03 -.consumes.-> MLE02
    AML03 -.consumes.-> MLE06
    AML04 -.consumes.-> MLE05
    AML06 -.consumes.-> MLE08
    AML07 -.consumes.-> MLE01
    AML08 -.consumes.-> DRIFTHUB

    classDef appl fill:#9cf,stroke:#333,stroke-width:2px
    classDef plat fill:#fd2,stroke:#333
    class AML01,AML02,AML03,AML04,AML05,AML06,AML07,AML08 appl
    class MLE01,MLE02,MLE05,MLE06,MLE08,DRIFTHUB plat

Reading paths:

  1. First time through — read 00-foundations-and-primitives-for-applied-ml-engineering.md end-to-end. Then pick a scenario in 01-deep-dive-per-applied-ml-story.md that matches a real product decision you face. Then drill yourself with the matching grill chain in 02-applied-ml-engineer-grill-chains.md.
  2. Interview prep (Amazon Applied Scientist / Applied ML Engineer loop) — go through grill chains AML-01 → AML-08 in order, scoring yourself against the red-flag and strong-answer markers. Most candidates fail on AML-04 (online/offline decoupling) and AML-05 (guardrails) — those are the two stories with the highest signal-to-noise for senior-level evaluation.
  3. Working a real launch — start with AML-02 (portfolio prioritization) to defend that this experiment is the right one to run; then AML-03 (hypothesis & sample size); then AML-05 (guardrails); then AML-07 (integration). The other stories are diagnostic, not prescriptive.

Owner Mapping

Story Suggested Owner Role Partners
AML-01 Applied ML Engineer + Product Manager Customer-Insights Researcher, Business Analyst
AML-02 Applied ML Engineer (the role's defining accountability) Engineering Manager, PM, Data Scientist
AML-03 Applied ML Engineer + Data Scientist Statistician (light review), Experiment-Platform team
AML-04 Applied ML Engineer + Data Scientist RAG/Retrieval ML Engineer (US-MLE-05 owner)
AML-05 Applied ML Engineer + PM + Business stakeholder Finance / FinOps Lead (for GMV guardrails)
AML-06 Applied ML Engineer + Trust & Safety + Localisation Per-locale PM (JP / EN)
AML-07 Applied ML Engineer + Backend Eng (chatbot turn pipeline) SRE, MCP Owner
AML-08 Applied ML Engineer (on-call rotation primary) ML Engineer platform owner, SRE

The Applied ML Engineer holds the cross-cutting product-decision contract: every model promotion ultimately requires their sign-off, and the artefacts in this folder are the evidence base for that sign-off.

Unified KPI Rollup

A senior leader (Eng Manager, Director, GM) should be able to scan this table and know what the role is accountable for on each story.

# Story Headline Product Metric Bridging ML Metric Guardrail Veto
AML-01 ML-problem framing Customer-reported pain ≥ stable (none — ML not chosen yet) Heuristic-vs-ML break-even
AML-02 Portfolio prioritization Quarterly experiment-win rate ≥ 40% Σ(MDE × prior × reachable population) Ship velocity / opportunity cost
AML-03 Hypothesis & sample size A/B power ≥ 80% at MDE Sample-size compliance, peeking discipline Sequential-test α inflation < 5%
AML-04 Online/offline correlation Online ΔCTR vs offline ΔNDCG correlation ≥ 0.6 Offline NDCG@10, online CTR delta Correlation collapse alarm
AML-05 Business-KPI guardrails CSAT / retention / GMV non-regression Model quality (per-system) Guardrail breach blocks promotion
AML-06 Cohort fairness Worst-cohort metric ≥ 95% of aggregate Stratified per-cohort eval JP cohort or new-user cohort regression > 3%
AML-07 Production integration p95 turn latency ≤ 800ms Per-stage latency budget Fallback engagement < 1% / min
AML-08 Incident triage Time-to-detect ≤ 15 min, time-to-rollback ≤ 30 min (diagnostic) Customer-reported incidents per quarter

These are the metrics the Applied ML Engineer is on the hook for. They are deliberately not the same metrics the ML Engineer (US-MLE-XX) is on the hook for; the two roles share systems but own different metric surfaces.

Cross-Cutting Concerns Inherited by Every AML Story

Concern Why required Applies to
Working-Backwards customer letter (1 page) before ML scoping Forces problem-statement clarity; many "ML projects" dissolve when written in customer-language AML-01, AML-02
Pre-registered hypothesis & MDE (versioned in experiment platform) Prevents post-hoc story-fitting and HARKing (hypothesizing after results known) AML-03, AML-05
Cohort-stratified eval as default (locale, tenure, device) Aggregate metrics hide cohort regression; veto must be per-cohort AML-04, AML-05, AML-06
Pre-declared guardrail metrics with thresholds Without pre-declaration, guardrails are negotiated post-hoc by whoever has political weight AML-05, AML-07
Sequential-test α-spending plan Peeking inflates Type-I error from 5% to 20%+ in 5 looks; an undisciplined PM will peek AML-03, AML-05
Latency budget contract per chatbot turn (request_id propagation) Without per-stage budgets, every team blames every other team for p99 regressions AML-07, AML-08
Incident-triage runbook with named root-cause categories Random root-causing wastes hours; named categories cut MTTR by 3-5× AML-08
Six-pager (or PR/FAQ) for any ML feature that ships Amazon writing-culture default; forces architectural and product clarity AML-01, AML-02, AML-05

These are non-negotiable defaults for every Applied ML Engineer artefact in this folder. Deviations require explicit Engineering Manager + PM sign-off documented in the experiment record.

Amazon Leadership Principle Map

The Applied ML Engineer role lives at the intersection of Customer Obsession, Dive Deep, and Insist on the Highest Standards. Each story illustrates a specific LP application:

LP Story Where It Bites Hardest Why
Customer Obsession AML-01, AML-05 Working Backwards from customer pain; vetoing model wins on retention regression
Ownership AML-08 Incident triage; you own the model in production end-to-end
Invent and Simplify AML-01, AML-02 Choosing a heuristic over ML when ML is overkill
Are Right A Lot AML-02, AML-04 Portfolio judgment; correlation-collapse calls
Learn and Be Curious AML-04, AML-08 Why offline ≠ online; what failure-mode is hiding
Hire and Develop the Best (cross-cutting) Grill chains in 02-* are the artefact for raising the bar
Insist on the Highest Standards AML-03, AML-06 Statistical discipline; cohort fairness even at velocity cost
Think Big AML-02, AML-07 Portfolio sized to platform impact, not story impact
Bias for Action AML-02, AML-07 Velocity vs rigor balance; ship-decision discipline
Frugality AML-02, AML-07 Smallest experiment that proves the hypothesis; tightest latency budget that meets SLO
Earn Trust AML-05, AML-06 Veto enforcement, fairness disclosure
Dive Deep AML-03, AML-04, AML-08 Statistical rigor, correlation diagnostics, root-cause depth
Have Backbone; Disagree and Commit AML-05 Vetoing a model promotion that "looks like a win"
Deliver Results AML-07 The model has to actually work in production
Strive to be Earth's Best Employer (out of scope)
Success and Scale Bring Broad Responsibility AML-06 Fairness across locales is a scale-bound responsibility

This mapping is the lens the Applied ML Engineer uses when defending decisions in a six-pager review or operational meeting (OP1/OP2).

Relationship to Other Folders

This folder is part of an interconnected set:

When working on a real product decision: pull the platform owner from ML-Engineer-User-Stories/, the FinOps owner from Cost-Optimization-User-Stories/, and the Applied ML Engineer (you) from this folder. The three artefacts together are the launch readiness package.

How These Stories Were Built

Each story is grounded in a real architectural surface from the project memory: the manga catalog, the JP/EN bilingual traffic, the OpenSearch + Titan + reranker retrieval stack, the WebSocket streaming chatbot turn, the seven production catastrophes from the POC-to-Production War Story. None of the scenarios are abstract textbook examples; every one names the specific component, the specific metric, and the specific failure mode that an Applied ML Engineer at Amazon Japan running MangaAssist would actually face.

The framing — master's-degree data-science depth + Amazon product-engineering DNA — surfaces in two callouts inside every deep-dive section:

  • Master's-DS Depth Callout — the statistical or methodological subtlety a non-DS engineer would miss (e.g., why a t-test is wrong here, why CUPED reduces variance, why pre-experiment covariates matter)
  • Amazon Product-Lens Callout — the LP framing or six-pager-style reasoning a non-Amazon practitioner would miss (e.g., the customer letter, the input/output metric distinction, the OP1 narrative)

Together they encode the role: rigour without ivory tower, product instinct without hand-waving.