LOCAL PREVIEW View on GitHub

Per-Story Deep Dive — Applied ML Engineer

How to Use This Document

This file contains eight scenarios — AML-01 through AML-08 — each one a self-contained product-applied ML decision. Stories live inside this file (not as separate BDD files). Each section starts with the BDD framing, then walks the customer pain, the hypothesis, the experiment design, the architecture, the rollout, the dashboard, and a real-incident vignette grounded in the MangaAssist project.

Pair each scenario with the matching grill chain in 02-applied-ml-engineer-grill-chains.md for self-drilling. Read the foundations doc 00-foundations-and-primitives-for-applied-ml-engineering.md first if you haven't — the seven primitives are the substrate every scenario uses.

Story Roster

ID Title Headline question
AML-01 Customer-pain → ML-problem translation Is this even an ML problem?
AML-02 Experiment portfolio prioritization Which 3 of 12?
AML-03 Hypothesis design & sample-size discipline MDE, holdout, runtime, stop rule?
AML-04 Online/offline metric decoupling Offline +5%, online flat — why?
AML-05 Business-KPI guardrails for promotion When do you NOT ship?
AML-06 Cohort fairness & locale stratification Aggregate +3%, JP -8%, ship?
AML-07 Production integration & latency budgets Where does the model live in 800ms?
AML-08 Incident triage: 'the model got worse' Where do you look first at 3am?

Master Lifecycle Diagram

graph LR
    AML01[AML-01<br/>Frame the problem] --> AML02[AML-02<br/>Pick the experiment]
    AML02 --> AML03[AML-03<br/>Design the test]
    AML03 --> AML04[AML-04<br/>Validate offline-online]
    AML04 --> AML05[AML-05<br/>Check guardrails]
    AML05 --> AML06[AML-06<br/>Check cohorts]
    AML06 --> AML07[AML-07<br/>Production integration]
    AML07 --> AML08[AML-08<br/>Incident-ready]
    AML08 -.lessons feed back.-> AML01

    classDef pre fill:#9cf,stroke:#333
    classDef rigor fill:#fd2,stroke:#333
    classDef ship fill:#2d8,stroke:#333
    classDef ops fill:#f66,stroke:#333

    class AML01,AML02 pre
    class AML03,AML04 rigor
    class AML05,AML06 ship
    class AML07,AML08 ops

The eight scenarios chain. A real reranker launch (US-MLE-02) traverses AML-01 through AML-07 in order and re-uses AML-08 readiness; it doesn't pick one. Read the scenarios in order on first pass; jump to the relevant scenario for a specific real launch.


AML-01: Customer-Pain → ML-Problem Translation

TL;DR

Week-2 retention for new manga readers in the JP cohort dropped from 38% to 31% over four weeks. The PM wants ML to fix it. The Applied ML Engineer's first move is the customer letter, not the model. The right answer this quarter is a heuristic taste-quiz on first session — and a not now recommendation for the ML option.

As a / I want / So that

As an Applied ML Engineer / Product Engineer for ML on the MangaAssist team I want to translate a vague business signal into either a sharp ML problem statement or a defensible "this is not an ML problem" recommendation So that engineering investment is allocated to the highest-leverage intervention rather than to whatever ML feature looks most exciting

Customer Pain (Working Backwards)

Yuki, 24, just discovered MangaAssist last week. She tried 'Solo Leveling', 'Spy x Family', and 'Frieren' — all titles she'd heard of. After turn 4 with the chatbot, none of the three felt like 'her thing'. She didn't open the app on day 7.

What would have made her come back? "If the bot had figured out I like slow-burn psychological stories, not action — and shown me 'A Silent Voice' or 'March Comes In Like A Lion' on day 1."

The pain is misaligned recommendations on day-1 for new users. It is not "retention is dropping." Retention is the lagging indicator. The leading cause is the chatbot showing popular-instead-of-taste-matched titles in the first session, before any preference signal exists.

The signal that this is the right framing: support tickets in the JP cohort cluster around "the bot doesn't know what I like" and "everything it shows is the same." Both are first-session experiences. The cohort that retains well is the one that survives day-1: by day-7, taste signal has accumulated and the recsys works.

Hypothesis & Why-We-Believe

H0 First-session recommendation strategy has no causal effect on week-2 retention for new JP users
H1 A taste-quiz first-session intervention raises week-2 retention for new JP users by ≥4pp absolute (38% → 42%)
Prior evidence (a) Industry: Spotify's onboarding quiz lifts 30-day retention 5-8pp on similar surface; (b) Internal: cohort comparison shows users who self-corrected ("I don't like action") in turn 1-3 retained at 47%, vs. 31% for those who didn't; © Customer-research: 60% of JP-cohort exit interviews cite "bot doesn't get me" as primary reason
Disconfirming evidence we'd need If quiz responses don't actually steer recommendations differently than no-quiz default, intervention has no effect. If JP cohort's taste distribution is too uniform for a 5-question quiz to discriminate, intervention has no effect. Both testable in pilot.
Prior probability of H1 0.55 (moderate confidence; industry evidence good but JP-specific behavior unknown)

Experiment Design

This is the unusual case where the primary experiment is a heuristic, not an ML model. The Applied ML Engineer designs the test the same way regardless of whether the intervention is ML or rules.

Field Choice
Population New users (no prior session ≥ 30d), locale = JP, no other active onboarding experiment
Holdout 50/50 user-level randomization at first-session start
Treatment 5-question taste quiz on first session, drives recsys cold-start cohort embedding
Control Existing genre-popularity cold-start fallback
Primary metric week-2 retention (returns to app within days 7-14)
Baseline 31% (last 4 weeks rolling JP-new)
MDE +4pp absolute (≈ +13% relative)
Sample size n ≈ 8.7K per arm (Welch + CUPED on day-1 session length, 80% power, α=0.05)
Runtime ~5 weeks at JP-new-user rate of ~3.5K/day, 50/50 split, including 14-day retention horizon
Guardrails day-1 session length (no degradation), CSAT (no regression), spam-flag rate (no spike), p95 turn latency (≤ 800ms — quiz must not slow first session)
Stop rule Fixed-horizon at week 5; no peeking before week 4
Pre-registration YAML hash signed off by PM, EM, AppliedML Eng, JP-locale Product owner

Architecture / Wiring

graph TB
    NEW[New JP user<br/>first session start]
    NEW --> COHORT{In experiment<br/>cohort?}
    COHORT -->|Treatment 50%| QUIZ[Taste quiz<br/>5 questions<br/>≤ 60s to complete]:::treatment
    COHORT -->|Control 50%| OLD[Genre-popularity<br/>cold-start fallback]:::control

    QUIZ --> EMB[Quiz answers<br/>→ cold-start embedding<br/>seeded into recsys]:::treatment
    EMB --> RECSYS[US-MLE-06<br/>Two-Tower recsys]
    OLD --> RECSYS

    RECSYS --> RANK[Recommendations<br/>shown to user]
    RANK --> SESSION[Session continues<br/>signals collected]
    SESSION --> METRICS[Telemetry<br/>day-1 length / day-7-14 return]

    classDef treatment fill:#2d8,stroke:#333
    classDef control fill:#9cf,stroke:#333
    class QUIZ,EMB treatment
    class OLD control

The taste quiz is a UX intervention (not ML). The ML system (US-MLE-06 two-tower recsys) consumes the quiz output as a cold-start cohort embedding. The treatment can be reverted by toggling the cohort flag; rollback is one-line config.

Rollout Plan

Stage Population Abort criteria
1% (1 day) 35 new JP users/day Any p99 turn latency > 1500ms; any CSAT regression
5% (3 days) 175/day Day-1 session length regression > 5%
25% (1 week) 875/day Day-1 session length regression > 3%; spam-flag rate > +10%
50% (full A/B, 5 weeks) 1750/day Pre-declared experiment terms
Decision at week 5 If primary +4pp at p<0.05 AND guardrails clear: ship to 100% JP-new + plan EN rollout

Metrics Dashboard

graph LR
    D[Dashboard: AML-01 retention experiment]
    D --> P[Primary: week-2 retention<br/>JP-new]
    D --> S[Secondary: day-1 session length<br/>turn count, completion rate]
    D --> G[Guardrails: CSAT, spam-flag,<br/>p95 turn latency]
    D --> C[Cohort: stratified by JP-Tokyo,<br/>JP-other, by mobile-vs-app]
Metric Formula Target Guardrail-trigger Alert-threshold
Week-2 retention (primary) (returning users days 7-14) / (cohort size) +4pp absolute n/a (primary) (primary)
Day-1 session length mean turns per first session ≥ baseline -3% relative -5% relative
CSAT (5-pt) mean of post-session 5-pt survey, 14-day rolling ≥ baseline -1% relative -1.5% relative
Spam-flag rate flags / sessions ≤ baseline +10% relative +15% relative
p95 turn latency (incl. quiz) p95 of turn-to-first-token ≤ 800ms > 800ms > 900ms
Quiz-completion rate (quiz finished) / (quiz started) ≥ 75% < 60% < 50% (intervention failing)

Real-Incident Vignette

The team launched the quiz to 1% on a Tuesday. Within 6 hours, the JP-Tokyo cohort showed a +12% spike in day-1 session length and quiz-completion rate of 84% — both above expectation. The JP-other (Osaka, Fukuoka, etc.) cohort showed quiz-completion of 62% and no session-length lift. Investigation: the JP-other cohort skews older and has higher non-mobile usage; the 5-question quiz on a smaller mobile screen produced friction. Mitigation: reduce to 3 questions for that sub-cohort; re-run pilot. The lesson fed back into Primitive 6 (cohort fairness): even within JP, sub-cohort fairness matters, and "JP" is not a homogeneous unit.

Cross-Story Dependencies

  • Consumes: US-MLE-06 (recsys two-tower) cold-start cohort embedding API; the quiz-driven cohort embedding requires US-MLE-06 to support keyed cold-start vectors. Confirm contract before scoping.
  • Produces: a leading indicator (day-1 session length lift) that AML-04 (online/offline correlation) will use to validate the recsys-side offline metrics.
  • Sequencing: this experiment must run before the next quarterly portfolio review (AML-02), so its result informs whether the H1-conditional ML investment in cold-start ML belongs in the next portfolio.

Master's-DS Depth Callout — Causal Framing and the Two-Stage Decision

The framing trap on this scenario is collapsing two questions into one: "should we change cold-start strategy?" and "should we use ML or heuristic?" The right framing is staged. Stage 1 is causal: does any cold-start intervention move retention? The cleanest test is a heuristic intervention (cheaper, faster). If Stage 1 is positive, Stage 2 is comparative: does an ML cold-start beat the heuristic? Stage 2 is only worth running if Stage 1 succeeded.

The hypothesis test for Stage 1 is Pr(retention | do(cold-start = quiz), cohort = JP-new) − Pr(retention | do(cold-start = popularity), cohort = JP-new) ≥ MDE. Notice the do() operator — it forces the team to specify the intervention, not just an "association with retention." Most ML PR/FAQs hand-wave the intervention and end up running correlational analysis rebadged as A/B tests. The discipline of writing the do() clause forces clarity.

Amazon Product Lens Callout — Customer Obsession + Invent and Simplify

Two LPs collide. Customer Obsession says start with Yuki. The PR/FAQ leads with her, in her words, on her day-7. Invent and Simplify says ship the simplest intervention that moves Yuki's behavior. ML is not necessarily simpler than a 5-question quiz; in this case, the quiz is simpler, faster, and more interpretable.

The six-pager defending the recommendation: "Hypothesis: cold-start strategy is the leading cause of JP new-user churn. Stage 1 test: 5-question taste quiz vs popularity fallback, 5-week run, +4pp retention MDE. We are not running an ML cold-start experiment this quarter. The ML option's incremental EV over the heuristic is ~2-3pp at 4× engineering cost. We will revisit ML in Q4 if Stage 1 succeeds and Stage 2 evidence justifies. Tenet: we ship the simplest intervention that moves customer behavior." That paragraph is what makes the recommendation defensible to leadership.


AML-02: Experiment Portfolio Prioritization

TL;DR

The team has 12 candidate ML experiments for next quarter and capacity to run 3 in parallel without quality compromise. The Applied ML Engineer is the role that picks. RICE-for-ML scoring + EVOI swing-bet + qualitative overlays produces a defensible 3-of-12. Picking wrong wastes a quarter.

As a / I want / So that

As an Applied ML Engineer responsible for product-ML quarterly investment I want to select the 3 highest-leverage experiments from a portfolio of 12 candidates by RICE × Detectability × strategic value So that engineering effort flows to the experiments with highest expected impact and the team has a defensible OP1 narrative

Customer Pain (Working Backwards)

The pain isn't a single customer's pain — it's the team's pain. Over the last 4 quarters, the team shipped 9 experiments. 5 won, 4 lost. Of the 5 wins, 2 came from the same area (reranker), 2 from another (recsys), 1 from cold-start. Of the 4 losses, 3 were undersized (couldn't detect MDE within quarter); 1 was a real negative result (worth knowing). The pattern: the team has been picking experiments by enthusiasm, not by EV.

The customer pain is indirect: every quarter the team allocates 3-4 ML projects, and the customers experience the aggregate of those projects. A bad portfolio means customers see a quarter where nothing improved; a good portfolio means visible compounding lift across 6 quarters.

Hypothesis & Why-We-Believe

H0 Random portfolio selection (3 of 12) has the same expected lift as RICE-prioritized selection
H1 RICE-prioritized + EVOI overlay produces ≥40% higher expected lift across the quarter than baseline (gut-feel) selection
Prior evidence Industry: Booking.com, Netflix, Microsoft public posts all describe systematic experiment-portfolio frameworks improving experiment win rate by 30-60% over ad-hoc selection.
Disconfirming evidence If team's gut-feel selections are already RICE-correlated (because senior people internalize the framework), the framework adds no marginal value. Test: rank gut-feel against RICE on last quarter; if ρ > 0.7, framework is redundant.
Prior probability of H1 0.7 (high; the team's track record on undersized experiments suggests RICE is missing)

Experiment Design

The "experiment" here is meta-experimental: applying the framework to next quarter's portfolio. The validation is retrospective at end of quarter — did the prioritized portfolio outperform a counterfactual (gut-feel) portfolio?

Field Choice
Population Q3 (next quarter) candidate experiments — 12 named projects
Decision unit Portfolio of 3 (selected from 12)
Treatment RICE + EVOI scoring with qualitative overlay
Control Counterfactual gut-feel pick (recorded but not run; only for retrospective comparison)
Primary metric end-of-quarter portfolio expected lift (sum of shipped-experiment gains)
Secondary individual experiment hit rate, cumulative cost saved by deferring undersized candidates
Pre-registration both portfolios (RICE-picked and gut-feel-picked) committed to git before quarter starts

The 12 Q3 candidates and their RICE scores

# Candidate Reach (sessions/q) Impact (Δ metric) Confidence Effort (eng-wk) RICE EVOI Decision
1 US-MLE-02 reranker MiniLM-L6 → L12 8.0M +0.5% useful-answer 0.60 12 200K low PICK 1
2 US-MLE-01 multilingual intent v2 (JP/EN better tied) 1.2M +0.8% intent-acc 0.43 8 51K medium PICK 2
3 US-MLE-06 recsys cold-start HRNN swap 2.5M +1.2% retention 0.28 18 47K HIGH PICK 3 (EVOI)
4 US-MLE-05 embedding adapter LoRA r=16 8.0M +0.3% recall 0.23 14 41K low Defer Q4
5 US-MLE-04 demand forecasting promo overlay 0.5M (inv ops) +5% sMAPE 0.36 10 90K (inv-team metric) low Defer (inventory team Q3 priority differs)
6 US-MLE-08 cover-art AI-gen detection 12M images +0.05 AUC 0.31 11 1.7M (impressions × low Δ) medium Defer Q4 (signal not yet strong)
7 Cross-encoder distillation to DistilCE dependent on #1 +20% latency reduction 0.55 9 67K low Reject (sequential to #1)
8 Sentiment ABSA quarterly retrain (maintenance) drift bound 1.0 4 hygiene n/a 0.5 SDE assigned, not portfolio
9 Spam classifier weekly retrain automation (maintenance) adversarial-bound 1.0 3 hygiene n/a 0.5 SDE assigned, not portfolio
10 Recsys diversity injection re-tune 2.5M +0.2% diversity, ?CTR 0.20 7 14K low Reject (low EV)
11 Personalize HRNN replacement custom 2-tower 2.5M +1.5% retention 0.40 24 (under-est'd to 18) 78K nominal, 52K corrected medium Reject (cost mis-estimated last time)
12 LLM judge quality-eval framework (tooling) indirect 0.5 6 tooling n/a Reject (platform-ML backlog, not product portfolio)

The RICE-top-3 by raw score: #1, #5, #2. But #5 belongs to inventory team this quarter — out of scope for chatbot product-ML. So the chatbot-product portfolio top-3 by adjusted RICE: #1, #2, #4.

The Applied ML Engineer's overlay swaps #4 for #3 — accepting lower RICE in favor of EVOI. #3 (HRNN cold-start swap) has medium RICE but high EVOI: win or lose, the team learns whether HRNN is the right cold-start direction for the next 4 quarters of recsys investment. That learning is worth more than the marginal RICE delta vs #4.

Architecture / Wiring

graph TB
    BACKLOG[Q3 candidate backlog<br/>12 named experiments]
    BACKLOG --> RICE[RICE scoring<br/>Reach × Impact × Confidence ÷ Effort]
    RICE --> RANKED[Ranked list]
    RANKED --> EVOI[EVOI overlay<br/>which candidates have learning value?]
    EVOI --> QUAL[Qualitative overlay<br/>cost-estimate calibration / dependency / political]
    QUAL --> PICK[3-of-12 selected]:::ship
    PICK --> OP1[OP1 narrative<br/>defending the choice]:::ship

    RICE --> REJECT[9 rejected / deferred]:::reject

    classDef ship fill:#2d8,stroke:#333
    classDef reject fill:#9cf,stroke:#333

Rollout Plan

The "rollout" is the portfolio approval and tracking process:

Stage Action Timeline
Pre-quarter RICE-scored portfolio submitted to EM + Director for approval week -2
Pre-quarter OP1 narrative finalized; pre-registered hypotheses for each of 3 experiments locked week -1
Mid-quarter Status check: are any experiments showing early-abort signals? week +6
End-quarter Retrospective: actual lift vs RICE-predicted lift; recalibrate Confidence priors week +12

Metrics Dashboard

Metric Formula Target Alert
Portfolio expected lift (predicted) Σ(RICE_i × P(success)_i) for picked ≥ baseline by 30% n/a (predictive)
Portfolio realized lift Σ(observed lift × significance) for shipped ≥ predicted × 0.7 < predicted × 0.5
Hit rate (per experiment) shipped & sustained / picked ≥ 60% < 40%
Effort accuracy actual eng-weeks / estimated eng-weeks ≤ 1.3× > 2× (recalibrate Confidence)
Confidence calibration observed-success-rate vs predicted-Confidence within 10% > 20% off (the Confidence factor is not calibrated)

Real-Incident Vignette

End of Q2, team retrospects: portfolio's predicted lift was +2.4% on aggregate session-quality; realized was +1.1%. Investigation: candidate #11 (Personalize HRNN replacement) was picked despite RICE-warning on cost-estimate; engineering cost overran 2.4× and the experiment was abandoned at week 8. The lesson: cost-estimate calibration is the biggest gap in RICE; in Q3, the team adopts a rule that any candidate with engineering cost ≥ 12 weeks must have a pre-mortem document with named cost risks and a contingency plan. The rule prevents the same cost-overrun pattern from recurring.

Cross-Story Dependencies

  • Consumes: signals from AML-08 (incident triage) — recurring incidents flag missing investments that should enter the portfolio (e.g., per-shard recall monitor after the OpenSearch shard-3 incident).
  • Consumes: signals from AML-04 (online/offline correlation collapse) — if correlation is dropping, the portfolio must include a "rebuild offline harness" investment, even if it has no shippable user-facing lift.
  • Produces: the pre-registered hypotheses for each of the 3 chosen experiments (handed to AML-03).

Master's-DS Depth Callout — Bayesian Decision Theory and Cost-Estimate Calibration

RICE's Confidence factor is doing two jobs: prior-evidence quality (am I sure the effect exists?) and cost-estimate calibration (am I sure the effort estimate is right?). Splitting them is the master's-level move. The proper Bayesian framing: each candidate has a joint posterior on (effect_exists, effect_size, eng_cost_actual). The portfolio chooser maximizes expected utility under this joint posterior, not under decomposed RICE.

In practice: track the team's calibration over 8+ quarters. If 80%-confidence experiments succeed at 80% rate, the team is well-calibrated. If they succeed at 50%, the team is over-confident; haircut Confidence by 0.6×. If they succeed at 95%, the team is under-confident (rare but real); take more risks. Calibration audits applied to the team's last 12 experiments are the single highest-ROI portfolio-management practice.

Amazon Product Lens Callout — Bias for Action + Frugality + Think Big

The portfolio decision negotiates three LPs. Bias for Action says ship the things that are obvious wins quickly — that's #1 and #2 (high RICE, well-powered, low risk). Frugality says pick experiments that share infrastructure (all three of our picks share the experiment platform, the cohort holdout, and the feature store). Think Big says reserve one slot for the swing bet — #3 is EVOI on cold-start direction; if it works, the next year of recsys investment is reshaped.

Six-pager defending the portfolio: "We picked 3 of 12 candidates. Two are Bias-for-Action: high-RICE, well-powered, ship-by-week-12. One is Think-Big: an EVOI swing on cold-start strategy that, if it works, sets recsys direction for 4 quarters. Frugality: all three share A/B infra and cohort holdout, amortizing eng cost. Rejected #11 despite RICE rank 2 because the team's historical cost estimates for that scope class run 2.4× over; we will not ship cost-overrun candidates without a pre-mortem document with contingency plan. Confidence-calibration audit: our 80%-confidence picks succeed at 76% rate; portfolio is well-calibrated, no haircut applied."


AML-03: Hypothesis Design & Sample-Size Discipline

TL;DR

The team is shipping a US-MLE-02 reranker change. The data scientist offline-evaluated MiniLM-L12 vs MiniLM-L6 on a 5K replay and got +6% NDCG@10. The PM says "run a 7-day A/B and ship." The Applied ML Engineer's job is to say no — 7 days under-powers the test, the MDE the team can defend is 3% relative on user-perceived useful-answer rate, the design needs O'Brien-Fleming sequential α-spending, and CSAT must be a pre-declared guardrail.

As a / I want / So that

As an Applied ML Engineer accountable for product-launch experimental rigor I want to translate an offline-positive model change into a properly-powered, pre-registered, sequentially-controlled online A/B with named guardrails So that the launch decision rests on statistical evidence the team can defend in a six-pager review and at 28-day post-launch retrospective

Customer Pain (Working Backwards)

Sakura, 31, has been using MangaAssist for six months. She likes how the bot's recommendations have gotten more relevant over time. Recently, the search results feel slightly different — sometimes more accurate, sometimes more click-bait. She's noticed but hasn't said anything yet. If the new ranking gets shipped without a proper test and turns out to favor click-bait, Sakura's six-month trust erodes by week 4 and she stops asking the bot for recommendations.

The customer pain is a future pain — the cost of shipping a noisily-significant winner that doesn't actually move user behavior. Sakura doesn't notice immediately; the cost shows up as a month-4 churn spike. Pre-registration discipline prevents the noise winner from shipping.

Hypothesis & Why-We-Believe

H0 The MiniLM-L12 reranker has same or worse user-perceived ranking quality than MiniLM-L6, measured by useful-answer-rate per user over 14 days
H1 MiniLM-L12 reranker has ≥3% relative lift in useful-answer-rate per user over 14 days
Prior evidence Offline NDCG@10 +6% on golden-500; offline-online correlation 0.68 over last 90 days (above 0.5 threshold per Primitive 4); two prior reranker upgrades correlated 0.7-0.8 between offline NDCG and online useful-answer
Disconfirming evidence we'd need Cohort-stratified online lift below MDE in any of EN / JP / mixed; CSAT regression > 1% relative; latency p95 > 800ms
Prior probability of H1 0.75 (strong; offline evidence + correlation history support it)

Experiment Design

Field Choice
Population All MangaAssist users with ≥ 1 search in pre-14d window; locale ∈ {EN, JP, mixed}; not in any other active experiment
Holdout / control 50/50 user-level randomization (NOT session, NOT request — see Primitive 3 randomization unit discussion)
Treatment New MiniLM-L12 cross-encoder reranker, top-K=20→top-K=30 reranked, served by SageMaker MME endpoint
Control Existing MiniLM-L6 reranker, top-K=20
Primary metric useful-answer-rate per user per 14-day window
Baseline 0.183 (mean from 14-day pre-experiment)
Baseline std σ = 0.142 (per-user std)
MDE 3% relative = 0.00549 absolute
Power 0.80
α 0.05 (sequential, O'Brien-Fleming)
Test Welch's t-test, two-sided, with CUPED variance reduction
CUPED covariate per-user useful-answer-rate over the 14d immediately preceding experiment
Sample size pre-CUPED n = 36,840 per arm (statsmodels TTestIndPower)
Sample size post-CUPED n = 18,420 per arm (50% variance reduction empirically)
Runtime 14 days at 200K eligible users/day, 50/50 split → easily achievable
Looks (sequential) day 4, day 7, day 11, day 14
α at each look 0.0001, 0.001, 0.01, 0.05 (O'Brien-Fleming)
Stop rule Stop early on primary if any look's α-adjusted p < look's α; stop early on guardrail breach

Sample-size calculation walkthrough

# sample_size_us_mle_02.py — pre-registered calc
import statsmodels.stats.power as smp

baseline = 0.183
sigma_pre = 0.142
mde_relative = 0.03
mde_absolute = baseline * mde_relative   # 0.00549
effect_size = mde_absolute / sigma_pre   # 0.0387 standardized

# Pre-CUPED (raw t-test power)
analysis = smp.TTestIndPower()
n_pre_cuped = analysis.solve_power(
    effect_size=effect_size, alpha=0.05, power=0.80, alternative='two-sided'
)
print(f"Pre-CUPED n per arm: {int(n_pre_cuped)}")
# 10491... but this is for a single look; sequential design needs more

# O'Brien-Fleming with 4 looks needs ~4-6% inflation in sample size
# vs fixed-horizon for equivalent power. Apply 1.06× factor:
n_pre_cuped_obf = int(n_pre_cuped * 1.06)
print(f"Pre-CUPED with OBF n per arm: {n_pre_cuped_obf}")
# 11120

# CUPED variance reduction empirically 50% on this metric
sigma_post_cuped = sigma_pre * (1 - 0.50)**0.5  # 0.50 var reduction = 0.71 sigma reduction
effect_size_post = mde_absolute / sigma_post_cuped
n_post_cuped_obf = analysis.solve_power(
    effect_size=effect_size_post, alpha=0.05, power=0.80, alternative='two-sided'
) * 1.06
print(f"Post-CUPED with OBF n per arm: {int(n_post_cuped_obf)}")
# 18420 (rounded)

# Note: 18420 is an underestimate because the OBF inflation applies to raw t,
# not to the CUPED-residualized statistic. Conservative: round up to 20K.

The team can hit n=20K per arm in ~2 days at 200K eligible/day; runtime is bounded by the 14-day horizon (the metric needs 14 days of activity per user), not by sample size. So the experiment runs exactly 14 days, with sequential looks providing safety to stop early on dramatic signals.

Cohort-stratified sample size

The 18.4K-per-arm aggregate is sufficient for the aggregate claim. For cohort claims:

Cohort % of traffic Per-cohort n needed Total exp n (across both arms)
EN 55% 18.4K 33.5K total → 60K combined → ~3 days
JP 30% 18.4K 61K total → 122K combined → ~6 days
mixed 15% 18.4K 122K total → 245K combined → 12 days

The mixed-cohort claim is on the edge of the 14-day horizon — possible but tight. The Applied ML Engineer flags this in the pre-registration: "mixed-cohort claim has marginal power; if mixed-cohort metric is between -3% and +3%, treat as inconclusive, not as evidence against H1." Better to flag this upfront than negotiate it post-hoc.

Architecture / Wiring

graph LR
    USR[User] --> CHOOSE{Experiment<br/>cohort hash}
    CHOOSE -->|treatment 50%| L12[Reranker MiniLM-L12<br/>SageMaker MME]:::treatment
    CHOOSE -->|control 50%| L6[Reranker MiniLM-L6<br/>SageMaker MME]:::control

    L12 --> RANK[Ranked results]
    L6 --> RANK

    RANK --> RESP[Chatbot response]
    RESP --> SIG[useful-answer signal<br/>collected per turn]
    SIG --> AGG[Aggregated per user<br/>over 14-day window]
    AGG --> EXPDB[Experiment platform<br/>treatment_id, user_id, metric]
    EXPDB --> ANA[Analysis<br/>Welch + CUPED<br/>O'Brien-Fleming sequential]:::analysis

    classDef treatment fill:#2d8,stroke:#333
    classDef control fill:#9cf,stroke:#333
    classDef analysis fill:#fd2,stroke:#333

Rollout Plan

Stage Population Abort criteria
Day 1: 1% canary 4K users p99 turn latency > 1500ms; error rate > 1%
Day 2-3: 10% 40K aggregate metric collapse > 5% from baseline; CSAT crash
Day 4 (look 1): 50% 200K OBF α at 0.0001 — early stop only on extreme effect
Day 5-14: 50% full A/B guardrails monitored continuously; subsequent looks at d7, d11, d14
End-of-experiment decision primary clears MDE at look-α AND all guardrails clear AND cohort guardrails clear → ship

Metrics Dashboard

Metric Formula Target Guardrail-trigger Alert
Useful-answer rate (primary) per-user mean over 14d +3% rel n/a n/a
p95 turn latency p95 across all turns ≤ 800ms > 800ms > 850ms
CSAT (5-pt) rolling 14d mean -0% rel -1% rel -1.4% rel
JP cohort useful-answer per-JP-user mean over 14d -0% rel -3% rel -5% rel
EN cohort useful-answer per-EN-user mean over 14d -0% rel -3% rel -5% rel
Sequential α-spent cumulative α used at each look per OBF plan exceeds plan n/a

Real-Incident Vignette

The team ran the experiment exactly to spec. At look 2 (day 7), aggregate p=0.04 (above OBF threshold of 0.001 at look 2). PM saw the dashboard, asked "we're at p<0.05, can we ship?" Applied ML Engineer's response: "OBF threshold at look 2 is 0.001, not 0.05. We're not significant on the sequential plan. We continue. The mechanical rule prevents us from shipping on noise." At look 4 (day 14), aggregate p<0.001 — clear win on the sequential plan. CSAT regressed -0.6% (within -1% threshold). Cohorts cleared. The launch shipped. Six months later, retrospective showed durable lift; no novelty fade. The discipline paid off.

Cross-Story Dependencies

  • Consumes: AML-04 (online/offline correlation) — checks correlation is above 0.5 threshold before trusting the offline evidence supporting H1.
  • Consumes: AML-02 (portfolio prioritization) — the experiment was selected with EVOI consideration; the design must support both win and loss as informative.
  • Produces: pre-registered YAML hash and the analysis output that AML-05 (guardrails) will use to gate promotion.
  • Sequencing: must complete before AML-05 can render its veto/ship decision.

Master's-DS Depth Callout — Welch's t, Delta Method, Sequential α-Inflation

Three depth points that interview at the staff level:

  1. Welch's t is the right default — equal-variance assumption almost never holds for chatbot per-user metrics. The cost of using Welch when variances are equal is negligible (slightly reduced power); the cost of using equal-variance t when they aren't equal is wrong p-values. Always Welch.

  2. Delta method for ratio metrics. If the metric is ratio-typed (e.g., useful-answer-rate = useful_turns / total_turns per user), per-user-ratio averaging biases the variance estimate. Variance of X/Y is approximated by: Var(X/Y) ≈ (μ_X² / μ_Y²) × (σ²_X/μ²_X + σ²_Y/μ²_Y - 2·Cov(X,Y)/(μ_X·μ_Y)) The team that computes per-user ratios and t-tests directly under-estimates variance and over-claims significance. The delta method fixes it.

  3. Sequential α-inflation. Without an α-spending plan, peeking 5 times inflates Type-I error from 5% to ~20%. O'Brien-Fleming distributes α toward later looks (most α at the final look), preserving most of the experiment's nominal 5% Type-I rate; Pocock distributes α equally across looks (more conservative early-stop, less powerful late). For most chatbot A/Bs, OBF is right; for short-runtime experiments where early-stop is critical, Pocock is right.

Amazon Product Lens Callout — Insist on the Highest Standards + Dive Deep

The pre-registration discipline is the canonical instance of Insist on the Highest Standards. It costs friction (PMs hate writing the YAML). It pays in defensibility: 6 months later, when leadership asks "did this experiment really work?", the team points to the pre-registered YAML hash and the analysis that conformed to it. Teams that don't pre-register cannot answer the question — they have to argue from results, which is HARKing in slow motion.

Dive Deep is what catches CUPED, Welch, the delta method, the OBF plan, the cohort sample-size split. None of these are obvious to the gut-feel A/B-runner. They show up in the pre-registration document, signed by the data scientist, signed by the Applied ML Engineer, signed by the EM. Six months later they're the artefact that convinces a Director that the team is rigorous.

Six-pager: "Pre-registered hypothesis: ≥3% relative lift in useful-answer-rate. Powered at 80% via 18.4K-per-arm Welch t with CUPED variance reduction (50%). O'Brien-Fleming sequential plan, 4 looks. Three pre-declared guardrails plus per-cohort guardrails. We will not declare a win on aggregate-positive results if any cohort regresses by >3%. The smallest experiment that disconfirms the hypothesis is 14 days at 200K eligible users/day."


AML-04: Online/Offline Metric Decoupling

TL;DR

The team's RAG-retrieval change shows offline Recall@10 +3.3pp, NDCG@5 +3.8% (both significant). The 14-day online A/B shows useful-answer-rate +0.5% relative, not significant. Offline says win; online says nothing. The Applied ML Engineer's diagnostic: not "is the model wrong" but "which of 5 named root causes broke the offline-online correlation, and how do we fix the harness." The right action is to pause model changes for 4 weeks and rebuild the offline harness.

As a / I want / So that

As an Applied ML Engineer accountable for offline metric trustworthiness I want to detect online-offline correlation collapse, diagnose its root cause, and rebuild the offline harness So that subsequent experiments make ship/no-ship decisions on metrics the team trusts as predictors of customer behavior

Customer Pain (Working Backwards)

Kenji, 28, runs his manga-discovery sessions on mobile. He scrolls fast, scans the top 3 results, and clicks the most relevant. Lower-ranked results he never sees. The team's offline metric (NDCG@10) credits Kenji's experience for improvements at positions 5-10 that he never sees. The online metric (CTR) sees what Kenji sees. They diverge because Kenji's behavior is mismatched against what NDCG@10 measures.

The customer pain is invisible to the team that trusts only offline metrics. The team ships changes that improve positions Kenji never views. The customer's experience doesn't change. The online metric reflects this. The fix is to align the offline metric with Kenji's actual behavior.

Hypothesis & Why-We-Believe

H0 Offline-online correlation has not meaningfully changed; the discrepancy is sampling noise
H1 Correlation has dropped from historical 0.7+ to <0.5; offline metric is no longer a faithful predictor of online lift; rebuild required
Prior evidence Last 90 days of experiment outcomes: 4 of 6 launches showed offline-online direction agreement; 2 showed offline-positive online-flat. Trend rolling Pearson 0.41 over last 60 days, was 0.71 in prior 60 days.
Disconfirming evidence If the next 2 launches show restored direction agreement, the dip was sampling noise, not collapse
Prior probability of H1 0.75 (strong; the rolling Pearson is below threshold, multiple recent disagreements)

Experiment Design

This is a meta-experiment — the experimental subject is the offline harness itself, not a model.

Field Choice
Population Last 12 months of completed experiments with both offline-Δ and online-Δ recorded
Treatment New offline harness: NDCG@3 + IPS-corrected CTR + adversarial click-bait detector + LLM-judged factuality
Control Existing offline harness: NDCG@10 + Recall@10
Primary metric Pearson correlation between offline-Δ and online-Δ across the next 4 experiments
Baseline Current rolling 60-day Pearson: 0.41
Target Restored to ≥ 0.6 (above 0.5 threshold)
Validation Replay last 6 experiments through new harness; check if their offline-Δ ranking matches online-Δ ranking
Decision If new harness restores correlation on replay AND on next 4 forward experiments: adopt

Diagnostic walkthrough — the five named root causes

graph TB
    A[Offline +5% NDCG@10<br/>Online flat CTR<br/>14-day A/B not significant]
    A --> R1[Root cause 1<br/>Selection bias in eval set]
    A --> R2[Root cause 2<br/>Label leakage]
    A --> R3[Root cause 3<br/>Distribution shift]
    A --> R4[Root cause 4<br/>Metric proxy mismatch]
    A --> R5[Root cause 5<br/>Goodharting]

    R1 --> R1D[Diagnostic: KL between<br/>eval queries and prod queries]
    R2 --> R2D[Diagnostic: time-strict<br/>train/eval cut]
    R3 --> R3D[Diagnostic: catalog turnover<br/>since last eval refresh]
    R4 --> R4D[Diagnostic: where do users<br/>actually click? mobile pos 1-3]
    R5 --> R5D[Diagnostic: is offline metric<br/>the training target?]

    R1D --> R1F{Hit?}
    R2D --> R2F{Hit?}
    R3D --> R3F{Hit?}
    R4D --> R4F{Hit?}
    R5D --> R5F{Hit?}

    R1F -->|Yes| FIX1[IPS-corrected eval set]:::fix
    R2F -->|Yes| FIX2[Time-strict split rebuild]:::fix
    R3F -->|Yes| FIX3[Weekly eval refresh + sliding window]:::fix
    R4F -->|Yes| FIX4[NDCG@3 + position-weighted metric]:::fix
    R5F -->|Yes| FIX5[Diversify offline metric portfolio]:::fix

    classDef fix fill:#2d8,stroke:#333

For the RAG-retrieval scenario:

Root cause Test result Verdict
1. Selection bias Eval set drawn from human-curated queries; production queries include long-tail seasonal anime tie-ins not in eval set Partial hit
2. Label leakage Time-strict split shows offline gain shrinks from +3.3pp to +1.8pp; some leakage but not the dominant cause Partial hit
3. Distribution shift Catalog turnover 14% since last eval refresh; KL(eval
4. Metric proxy mismatch 70% mobile traffic; users click positions 1-3 90% of time; offline gain is concentrated at positions 5-7 (where users don't look) Primary hit
5. Goodharting Reranker training loss is the offline NDCG@10 metric directly; gameable Possible (chronic risk)

The dominant cause is #4 (metric proxy mismatch) with #3 (distribution shift) as secondary contributor. The fix is not to re-train the model. The fix is to:

  1. Replace NDCG@10 with NDCG@3 as the primary offline metric for mobile-skewed RAG traffic.
  2. Refresh the eval set weekly, sampled from 7-day-rolling production query distribution with stratified human labels (covers cause #3).
  3. Add IPS counterfactual evaluation as secondary metric (covers cause #1).
  4. Add a click-bait adversarial detector as a secondary offline metric (covers cause #5 — diversification).

Architecture / Wiring

graph LR
    EXP[Last 6 experiments]:::data
    NEW_HARNESS[New offline harness<br/>NDCG@3 + IPS-CTR<br/>+ adv-detector + LLM-judge]:::harness

    EXP --> REPLAY[Replay through<br/>new harness]
    REPLAY --> NEW_OFFLINE[New offline-Δ<br/>per experiment]

    EXP --> ONLINE[Online-Δ<br/>per experiment]:::data

    NEW_OFFLINE --> CORR[Compute Pearson<br/>new-offline-Δ vs online-Δ]
    ONLINE --> CORR

    CORR --> DECIDE{Pearson ≥ 0.6?}
    DECIDE -->|Yes| ADOPT[Adopt new harness<br/>resume model shipping]:::adopt
    DECIDE -->|No| ITERATE[Iterate on harness<br/>add more diagnostic metrics]:::iter

    classDef data fill:#9cf,stroke:#333
    classDef harness fill:#fd2,stroke:#333
    classDef adopt fill:#2d8,stroke:#333
    classDef iter fill:#f66,stroke:#333

Rollout Plan

Stage Action Timeline
Week 1-2 Build new offline harness; set up IPS, NDCG@3, click-bait detector, LLM-judge 2 weeks
Week 3 Replay last 6 experiments through new harness; compute correlation 1 week
Week 4 If correlation ≥ 0.6 on replay, adopt for next quarter; if not, iterate 1 week
Forward Track rolling 90-day Pearson; alarm if < 0.5; require harness rebuild continuous

Metrics Dashboard

Metric Formula Target Alert
Rolling 90-day Pearson corr(offline-Δ_i, online-Δ_i) over last 90d experiments ≥ 0.6 < 0.5
Last-6-experiment direction agreement count(sign(offline-Δ) == sign(online-Δ)) / 6 ≥ ⅚ < 4/6
Eval-set staleness days since last refresh ≤ 7 > 14
KL divergence eval-vs-prod KL(eval queries prod queries)
Mobile NDCG@3 vs desktop NDCG@10 both tracked separately per-surface n/a

Real-Incident Vignette

The team froze model shipping for 4 weeks while the harness was rebuilt. PM was unhappy: "we're not shipping anything for a month." The Applied ML Engineer's defense: "The cost of waiting 4 weeks is one quarter of velocity. The cost of not waiting is shipping noise winners against a broken offline metric for the next year. We've shipped 2 reranker changes in the last 6 months that won offline and went flat online. The pattern is the metric, not the model. We're rebuilding the metric." After 4 weeks, the new harness showed Pearson 0.71 on the 6-experiment replay. The team resumed shipping. Over the next 4 quarters, 11 of 13 launches showed offline-online direction agreement. The 4-week freeze paid back in trust and shipped lift.

Cross-Story Dependencies

  • Consumes: signals from AML-03 (hypothesis design) — every experiment that ships records its offline-Δ and online-Δ; the correlation tracker reads from this.
  • Produces: a Pearson threshold gate that AML-03 must check before trusting offline evidence in pre-registration.
  • Cross-references: ../RAG-MCP-Integration/09-rag-retrieval-pipeline-deep-dive.md for the offline-eval methodology being rebuilt; ../Ground-Truth-Evolution/ for the ML-03 search-ranking-UI-redesign scenario, which is a related but distinct correlation collapse driven by UI change rather than metric mismatch.

Master's-DS Depth Callout — Goodhart's Law and IPS Estimators

Goodhart's Law (formalized): when a measure becomes a target, it ceases to be a good measure. Applied ML systems goodhart their offline metric the moment training loss is computed against it. The mitigation is not "find a better single metric" — any single metric will eventually goodhart. The mitigation is diversity: a portfolio of offline metrics where no single one is sufficient to win, and each catches a failure mode the others miss.

For the MangaAssist RAG harness, the new portfolio is: - NDCG@3 — position-weighted mobile-aligned ranking metric (primary) - IPS-corrected CTR — counterfactual-eval, corrects selection bias from click-logged data - Click-bait adversarial detector — catches Goodhart cases where ranking quality goes up but content quality goes down - LLM-judged factuality on RAG-grounded answers — catches hallucination cases

To win promotion, a model must improve the primary AND not regress any secondary by more than threshold. This is harder to game than any single metric.

IPS estimator — for selection bias correction. The intuition: in click-logged data, items that were rarely shown have rare exposure; their per-click value is up-weighted by 1/π(item|query) where π is the propensity of showing that item. Variance is the cost; SNIPS (self-normalized IPS) and Doubly-Robust estimators are practical workhorses. The Applied ML Engineer doesn't derive these but knows to ask "are we using IPS or just averaging clicks?" — if no, selection bias is in the metric.

Amazon Product Lens Callout — Learn and Be Curious + Are Right A Lot

Learn and Be Curious is what makes the Applied ML Engineer ask "why don't they correlate?" instead of accepting the offline win. Teams that goodhart their offline metric for years are teams that stopped asking. Quarterly correlation audits are the institutional discipline.

Are Right A Lot is what disciplines the response when correlation collapses. Wrong response: ship anyway and hope. Right response: slow down, diagnose, fix the harness. In a six-pager justifying a 4-week ship-freeze: "We declined to ship this quarter despite +5% offline gain because offline-online correlation has dropped to 0.41, below our 0.5 threshold. Offline gain is no longer a trustworthy predictor of customer behavior. We are rebuilding the offline harness in Q4; expected restoration of correlation by end of November; resumption of model-shipping in December. Cost of waiting: one quarter. Cost of not waiting: shipping noise winners for the next year, eroding both customer trust and team credibility with leadership."


AML-05: Business-KPI Guardrails for Promotion

TL;DR

The reranker change shows +6% useful-answer-rate (significant). CSAT regresses -1.4% (significant). The pre-declared CSAT guardrail threshold is -1%. The PM argues to ship anyway. The Applied ML Engineer's job is to enforce the veto mechanically. Pre-declaration is the load-bearing word; guardrails negotiated post-hoc are not guardrails.

As a / I want / So that

As an Applied ML Engineer responsible for ship/no-ship decisions I want to enforce pre-declared mechanical guardrail vetoes regardless of political pressure So that model promotion is governed by customer-experience evidence, not by who has the most leverage in a launch meeting

Customer Pain (Working Backwards)

Akari, 35, has used MangaAssist daily for two years. She trusts the bot's recommendations. After a recent launch, she notices the recommendations feel "off" — she still clicks (the engagement metric), but she fills out the post-session survey saying she's less satisfied. She doesn't churn immediately. She becomes one of the slow-erosion customers: less trust, less engagement at month 3, churned at month 6.

The customer pain on this scenario is delayed. CSAT is the leading indicator; engagement is the lagging indicator (delayed by months). The model that wins engagement and loses CSAT is the model that ships, looks good for 4 weeks, and slowly hollows out the customer base. Pre-declared CSAT vetoes catch this.

Hypothesis & Why-We-Believe

H0 The shipped reranker change preserves customer satisfaction at the level required for sustainable engagement growth
H1 The reranker change regresses CSAT by ≥1% relative, indicating the engagement gain comes at customer-trust cost; veto fires
Prior evidence Industry: engagement-bait-without-satisfaction is a classic anti-pattern (YouTube 2017 watch-time → recommendation overhaul; Facebook 2018 engagement → MSI redesign). Internal: prior reranker change with -0.8% CSAT and +4% engagement showed +6-month retention -1.2pp.
Disconfirming evidence If 28-day post-launch holdout doesn't show retention erosion despite CSAT drop, the relationship between CSAT and retention is weaker than priors suggested
Prior probability of H1 n/a (this is the guardrail check, not a hypothesis test in the new-launch sense)

Experiment Design (the guardrail test)

The "experiment" here is the launch readiness review. The test is whether each pre-declared guardrail clears its threshold.

Field Pre-declared in experiment YAML
Primary useful-answer-rate, MDE +3% rel — measured
Guardrail 1: p95 turn latency ≤ 800ms — must hold
Guardrail 2: CSAT (5-pt) ≥ -1% rel — must hold
Guardrail 3: 28d retention ≥ -0.5% rel — must hold (high-stakes)
Guardrail 4: spam-flag rate ≤ +5% rel — must hold
Guardrail 5: per-cohort useful-answer ≥ -3% rel for any cohort — must hold
Multiple testing correction Bonferroni: per-guardrail α = 0.01 (5 tests, 0.05/5)
Veto rule If any guardrail breaches at adjusted α < 0.01, mechanical veto
Override None. Override requires written incident-style review + Director sign-off + retrospective if shipped.

The five-guardrail decision matrix

Guardrail Pre Δ Post Δ Δ relative Adj-α Status
Useful-answer rate (primary) 0.183 0.194 +6.0% <0.001
p95 turn latency 720ms 740ms +20ms n/a
CSAT 4.21 4.15 -1.43% 0.02 VETO
28d retention 51.2% 50.9% -0.59% 0.18 ⚠️ near-veto, not breached
Spam-flag rate 0.41% 0.41% +0.0% n/a
JP cohort useful-answer 0.171 0.174 +1.8% 0.04

CSAT breaches at -1.43% (threshold -1%), with adjusted-α 0.02. Veto fires mechanically.

The defense — when the PM pushes back

PM: "CSAT is a noisy survey metric. The primary moved by 6%. Let's ship and revisit in 4 weeks."

Applied ML Engineer: "CSAT is the user telling us, in their words, that the new ranking feels worse. The primary metric is a behavioral proxy; CSAT is a stated preference. They disagree. When they disagree at p=0.02, we have to take the user's word. We pre-registered CSAT as a guardrail with -1% threshold; the threshold was breached at -1.43%. The mechanical action is veto. The right next move is to investigate why CSAT regressed despite the engagement lift — likely candidates: ranking is more aggressive on click-bait, or the new ordering surfaces titles users find emotionally heavier than expected. Both are diagnosable in a follow-up offline analysis. Shipping a model that the user is telling us they don't like, on the bet that we're wrong about CSAT, is not a bet I'd make."

PM: "But the launch is in OP1. Leadership expects it."

Applied ML Engineer: "The launch in OP1 was conditional on guardrails clearing. Shipping a guardrail breach turns OP1 from 'we shipped X with positive customer impact' into 'we shipped X with stated-customer-dissatisfaction.' The OP1 narrative is worse with the breach than with the slip. I can write the slip narrative and own it."

The artefact this exchange produces is the defensible decision. Six months later, when retention does (or doesn't) regress, the team has the artefact to point to. Either way, the team's credibility increases — they vetoed when they should have, or they vetoed conservatively and learned the calibration.

Architecture / Wiring

graph TB
    EXP[Experiment results<br/>14d A/B complete]
    EXP --> P[Primary check<br/>useful-answer +6% sig]
    EXP --> G1[Guardrail 1<br/>latency]
    EXP --> G2[Guardrail 2<br/>CSAT]
    EXP --> G3[Guardrail 3<br/>retention 28d]
    EXP --> G4[Guardrail 4<br/>spam]
    EXP --> G5[Guardrail 5<br/>per-cohort]

    P --> AND{All pass at<br/>adjusted-α 0.01?}
    G1 --> AND
    G2 --> AND
    G3 --> AND
    G4 --> AND
    G5 --> AND

    AND -->|All pass| SHIP[Ship<br/>see AML-07]:::ship
    AND -->|Any breach| VETO[Mechanical veto<br/>document and investigate]:::stop

    classDef ship fill:#2d8,stroke:#333
    classDef stop fill:#f66,stroke:#333

Rollout Plan

When veto fires, "rollout" becomes an investigation plan:

Stage Action Owner
Day 0 Veto fires; experiment paused at current rollout Applied ML Eng
Day 1 Document veto in launch readiness review with full data Applied ML Eng
Day 1-3 CSAT drill-down: which session types, which cohorts, which catalog segments Applied ML Eng + Data Scientist
Day 4-7 Hypothesize root cause (click-bait? emotional-heaviness?); design diagnostic offline test Applied ML Eng
Day 8-14 Run diagnostic; either re-train with adjusted training signal or kill the experiment Applied ML Eng + Platform ML
Day 14+ If re-trainable: schedule next experiment iteration; if not: archive learnings, replan portfolio EM + PM

Metrics Dashboard

Metric Formula Target Veto threshold Status
Useful-answer rate per-user rate +3% rel n/a (primary)
p95 latency p95 turn-to-first-token ≤ 800ms > 800ms
CSAT 5-pt mean -0% rel -1% rel
28d retention rolling -0% rel -0.5% rel ⚠️ (-0.59% rel, p=0.18, near veto)
Spam-flag rate flags/sessions -0% rel +5% rel
Per-cohort each cohort metric -0% rel -3% rel
Family-wise α-adjusted Bonferroni 5 tests each at 0.01 breach CSAT breaches

Real-Incident Vignette

The veto fired. The PM escalated to Director. The Director read the launch readiness document, saw the CSAT breach pre-registered with mechanical -1% threshold, and supported the veto. The team investigated: the new reranker had learned to surface "shock value" titles that drove clicks but felt off-brand for MangaAssist. The diagnostic: a manually-constructed eval set of "tonally-jarring" recommendations showed the new model surfaced them at 18% rate vs old at 7%. The fix: add a tone-preference signal to the reranker training (using ABSA aspect signals from US-MLE-03). Re-ran the experiment 6 weeks later: useful-answer +5.2%, CSAT -0.3% (within threshold). The launch shipped with full leadership support. The original veto built credibility; the second-iteration ship validated the discipline.

Cross-Story Dependencies

  • Consumes: AML-03 (hypothesis & sample size) — the pre-registered YAML containing all guardrail thresholds.
  • Consumes: AML-06 (cohort fairness) — the per-cohort guardrail is one of the five.
  • Produces: a ship-or-no-ship decision feeding AML-07 (production integration).
  • Cross-references: US-MLE-02 reranker training pipeline (where the fix re-trains); US-MLE-03 ABSA aspect signal (the new training signal).

Master's-DS Depth Callout — Family-Wise Error Rate, Bonferroni vs Holm vs IUT

Five guardrails tested at α=0.05 each: family-wise probability of at least one false breach is 1 - (1-0.05)^5 ≈ 0.226. Almost 23% chance of false-veto in a clean experiment. Three corrections to choose from:

Method Adjusted α Pros Cons
Bonferroni α/k (e.g., 0.01 per test for 5 guardrails) Simple, conservative, defensible Loses power; legit breaches at α=0.04 don't fire
Holm-Bonferroni Sequential (test smallest p first at α/k, next at α/(k-1), etc.) More powerful than Bonferroni More complex to explain
Intersection-Union Test (IUT) Each guardrail at unadjusted α; pass requires ALL pass Conservative on PASS side, no FWER inflation Inflates false-no-launch rate

For chatbot guardrails, IUT is the right default. Reasoning: false-veto is recoverable (re-run experiment with re-trained model); false-ship is unrecoverable (customer harm at scale). Symmetric error costs do not apply; asymmetry favors conservative-on-ship. Use IUT for guardrails; pre-register the family-wise correction method in the YAML.

Amazon Product Lens Callout — Customer Obsession + Have Backbone, Disagree and Commit + Earn Trust

Customer Obsession: when CSAT regresses, the user is telling us in their own words that they liked it less. Engagement is what they did; CSAT is what they meant. When the two disagree, customer-meaning wins.

Have Backbone, Disagree and Commit: the veto goes against the PM's preference. Maintaining the veto is the role-defining moment. The PM is right that CSAT is noisy; the role is to enforce the threshold anyway, because the alternative is guardrails-by-political-weight, which destroys the institution.

Earn Trust: vetos build trust over multiple quarters. Teams that veto consistently on pre-registered thresholds earn a reputation for shipping launches that actually work in production. The reverse is also true: teams that override their thresholds politically lose credibility with leadership over 3-6 quarters. The mechanical veto is a long-game investment.

Six-pager: "Our launch criteria are pre-declared and mechanical. Of the last 12 reranker / recsys experiments, we declined to ship 3 (25%). Of the 9 we shipped, 8 maintained their lift at 28-day post-launch (89% retention rate). The 3 we declined to ship would, by retrospective analysis, have regressed retention by 0.4-0.8% each — a $X-million-per-year cost we avoided. The discipline cost: 3 more quarters of slower velocity. The discipline yield: $X million per year in avoided regression. NPV positive."


AML-06: Cohort Fairness & Locale Stratification

TL;DR

Aggregate +2.8% CTR (significant, p<0.01). Stratified: EN +4.1%, mixed +2.5%, JP -8.2%. JP is 30% of traffic. The aggregate veto-passes; the per-cohort guardrail (declared at -3% per cohort) fires. The Applied ML Engineer recommends abort + redesign with JP-stratified loss, accepting 4 weeks of velocity cost to preserve trust with the strategically-important JP user base.

As a / I want / So that

As an Applied ML Engineer accountable for cohort fairness in MangaAssist I want to detect cohort regressions before promotion, enforce per-cohort guardrails, and prescribe redesign rather than carve-outs So that the JP user base — strategically critical, ap-northeast-1 data residency, primary catalog locale — does not experience second-class model quality

Customer Pain (Working Backwards)

Hiroshi, 42, lives in Osaka. Japanese is his only language. He's been using MangaAssist for a year, mostly on his phone during commutes. After a recent launch, the bot's recommendations feel less relevant — he can't articulate why, but he stopped following the bot's "you might like" suggestions. He still uses the bot for direct queries. Over 12 weeks, his session count drops from ~5/week to ~2/week. He hasn't churned, but his engagement is hollow.

The customer pain on cohort-fairness scenarios is concentrated and largely invisible to the team. Hiroshi is one of millions of JP users whose model quality regressed; the aggregate metric averages his loss against EN gains. Without stratification, the team never sees Hiroshi.

Hypothesis & Why-We-Believe

H0 A reranker change with aggregate +2.8% CTR has comparable per-cohort lift across locales (EN, JP, mixed) within ±2pp absolute
H1 The change has cohort-asymmetric impact: EN gains by ≥2pp, JP loses by ≥3pp; per-cohort guardrail veto fires
Prior evidence Training data is 78% EN, 22% JP. Two-tower model item embeddings cluster JP titles less well than EN titles. Cold-start fallback is genre-popularity, biased toward EN-titled global hits.
Disconfirming evidence If JP cohort regression is within sampling noise (cohort n underpowered), the asymmetry is artifact, not real
Prior probability of H1 0.6 (moderate-high; training-data imbalance + embedding clustering issue + popularity bias)

Experiment Design

Field Choice
Population All MangaAssist users; cohort dimensions: locale {EN, JP, mixed} × tenure {new < 30d, returning ≥ 30d} × device {mobile, desktop, app}
Stratified sample size Powered for the smallest cohort of interest (mixed-new-mobile, ~3% of traffic)
Per-cohort MDE -3% relative on per-cohort primary metric
Per-cohort guardrail -3% relative regression triggers veto
Multilevel analysis random treatment slopes per cohort (multilevel model fit to experiment data)
Pre-registration cohort dimensions and thresholds locked in YAML before experiment starts

The cohort stratification table

Cohort % traffic Pre primary Post primary Δ rel Per-cohort guardrail Status
EN — total 55% 0.183 0.190 +4.1% -3%
EN — new 12% 0.171 0.180 +5.3% -3%
EN — returning 43% 0.186 0.192 +3.8% -3%
JP — total 30% 0.196 0.180 -8.2% -3% VETO
JP — new 7% 0.182 0.165 -9.3% -3%
JP — returning 23% 0.200 0.184 -7.8% -3%
mixed — total 15% 0.178 0.183 +2.5% -3%
AGGREGATE 100% 0.184 0.189 +2.8% n/a (primary clears)

The aggregate looks great. Both JP sub-cohorts breach the per-cohort guardrail. The launch vetoes.

Diagnosing the cohort asymmetry

graph TB
    A[JP cohort -8.2%]
    A --> Q1{Training data<br/>imbalance?}
    Q1 -->|Yes 78%-22%| FIX1[Stratified loss<br/>reweighting]
    A --> Q2{Item embedding<br/>clustering?}
    Q2 -->|Yes - JP titles less clustered| FIX2[JP-specific embedding<br/>fine-tune]
    A --> Q3{Cold-start<br/>popularity bias?}
    Q3 -->|Yes - EN-titled global hits dominate| FIX3[JP-locale-aware<br/>cold-start]
    A --> Q4{Cross-encoder<br/>training signal<br/>multilingual?}
    Q4 -->|Partially - tokenizer shared| FIX4[Multilingual<br/>fine-tune]

    FIX1 --> COMBINED[Combined fix:<br/>JP-stratified retraining<br/>~4 engineer-weeks]:::fix
    FIX2 --> COMBINED
    FIX3 --> COMBINED
    FIX4 --> COMBINED

    classDef fix fill:#2d8,stroke:#333

Two paths forward:

Option Description Cost Pros Cons
A JP-stratified retraining with reweighted loss + JP-specific cold-start 4 engineer-weeks Fixes underlying bias; JP serves at parity going forward 4 weeks of velocity loss
B Ship EN-only with JP carve-out flag (JP keeps old reranker) 1 engineer-week Fast; preserves EN gains Public commitment to JP second-class model; long-term trust erosion

Recommendation: Option A. Reasoning: JP is strategically important (data residency, primary catalog locale, engaged users). A JP carve-out is a public commitment to second-class treatment that erodes trust over many launches. The 4 weeks of velocity is paid in this quarter to avoid 4 quarters of erosion.

Architecture / Wiring

graph LR
    EXP[Experiment results]
    EXP --> AGG[Aggregate analysis]
    EXP --> STRAT[Stratified per-cohort analysis]

    STRAT --> EN[EN cohort]
    STRAT --> JP[JP cohort]
    STRAT --> MX[mixed cohort]

    EN --> CHECK{Per-cohort<br/>guardrail<br/>cleared?}
    JP --> CHECK
    MX --> CHECK

    CHECK -->|All pass| SHIP[Ship]:::ship
    CHECK -->|Any fail| VETO[VETO + redesign]:::veto

    VETO --> A[Option A: stratified retrain]:::fix
    VETO --> B[Option B: carve-out]:::reject

    A --> RETRAIN[4 eng-wk JP-stratified retrain]
    B --> CARVEOUT[1 eng-wk; rejected for trust reasons]

    classDef ship fill:#2d8,stroke:#333
    classDef veto fill:#f66,stroke:#333
    classDef fix fill:#2d8,stroke:#333
    classDef reject fill:#9cf,stroke:#333

Rollout Plan

Stage Action Timeline
Day 0 Veto fires on JP cohort; experiment paused immediately
Day 1-3 Cohort-asymmetry root-cause analysis 3 days
Day 4 Decision document: Option A (stratified retrain) 1 day
Week 2-5 JP-stratified retraining: data prep, training, eval 4 weeks
Week 6 Re-run A/B with new model; expect EN +3-4%, JP +0-2% (parity floor) 14 days
Week 8 Ship-decision based on re-run results

Metrics Dashboard

Metric Formula Target Per-cohort threshold Status
Aggregate primary useful-answer +3% rel n/a
EN cohort primary EN-only useful-answer -0% rel -3% rel
JP cohort primary JP-only useful-answer -0% rel -3% rel ❌ -8.2%
mixed cohort primary mixed-only useful-answer -0% rel -3% rel
Cohort sample size adequacy per-cohort n vs needed n ≥ 1.0× < 0.8× (underpowered) ⚠️ check
Multilevel random-slope variance between-cohort treatment-effect variance low high (significant heterogeneity) high
Worst-cohort floor min cohort metric / aggregate metric ≥ 0.95 < 0.95 breached

Real-Incident Vignette

The team chose Option A. The 4-week JP-stratified retraining used: (1) loss reweighting to balance JP/EN ratios in cross-encoder training, (2) JP-specific embedding fine-tune with ABSA aspect signals, (3) JP-locale-aware cold-start fallback. Re-run after retrain: aggregate +3.6% (modestly lower than original +2.8% on aggregate), EN +3.9%, JP +1.2%, mixed +2.4%. Per-cohort guardrails clear. Shipped. The 28-day retention impact: aggregate +0.3pp, JP cohort retention preserved (no regression). The decision to insist on cohort fairness lost a small slice of EN gain to gain a large slice of preserved JP trust. The retrospective showed JP cohort engagement increased slightly — the previous model's JP under-service had been a quiet drag.

Cross-Story Dependencies

  • Consumes: AML-03 (hypothesis design) — pre-registered cohort dimensions and per-cohort thresholds.
  • Consumes: AML-05 (guardrails) — the per-cohort guardrail is one of the five family-wise tests.
  • Cross-references: US-MLE-08 cover-art classifier — same cohort-fairness pattern applies to image-classification surfaces; manga-style preferences vary by locale.
  • Cross-references: ../Ground-Truth-Evolution/ ML-Scenarios — the multilingual ground-truth evolution scenario covers training-data imbalance over time.

Master's-DS Depth Callout — Simpson's Paradox + Multilevel Modeling + Conditional MDE

Aggregate (marginal) effects average over cohort distributions; cohort-conditional effects condition on cohort. The two can disagree (Simpson's paradox) — aggregate +2.8% with JP -8% is a textbook case. Multilevel / hierarchical models — Y = Xβ + Z·u + ε where u is per-cohort random effect — give you both simultaneously and partial-pool the cohort estimates (small cohorts borrow from the global mean, regularizing noisy per-cohort signals).

# multilevel_cohort_eval.py
import statsmodels.formula.api as smf
import pandas as pd

df = pd.read_parquet('experiment_us_mle_02.parquet')
model = smf.mixedlm(
    "outcome ~ treatment",
    df,
    groups=df["cohort_locale"],
    re_formula="~treatment"  # per-cohort random treatment slopes
).fit()
print(model.summary())

# Look at: random-effect variance for treatment (between-cohort heterogeneity)
# Significant variance → cohorts disagree → per-cohort decisions required

Conditional MDE: a 3% MDE at aggregate is achievable with n_aggregate. The same 3% MDE for a cohort that is 7% of traffic requires n_aggregate / 0.07 users in that cohort — about 14× larger total. If the experiment is sized for aggregate detection only, cohort claims are noise. The team that ships aggregate-positive on under-powered cohort signals is shipping false safety.

Amazon Product Lens Callout — Earn Trust + Insist on Highest Standards + Success and Scale Bring Broad Responsibility

Earn Trust with cohorts: shipping aggregate-positive while JP regresses erodes JP trust over many launches. Each launch is small; cumulative erosion is large. The discipline of cohort fairness is a multi-launch investment in the JP-cohort trust account.

Insist on the Highest Standards is the willingness to delay shipping by 4 weeks to fix a cohort issue. Most teams ship with the cohort regression flagged as a "follow-up". Most "follow-ups" never happen. The Applied ML Engineer who insists on cohort-clean launches is the one whose launches still work in production a year later.

Success and Scale Bring Broad Responsibility — the LP added in 2021. As MangaAssist scales, JP isn't 30% of traffic; in absolute numbers it's millions of users. A 3% regression is hundreds of thousands of users having a worse experience. Scale is what makes cohort fairness a responsibility, not a nice-to-have.

Six-pager: "At MangaAssist scale, a 3% cohort regression on JP traffic affects [N] users monthly. We do not ship cohort regressions at scale, even when aggregate metrics are positive. The JP cohort is strategically critical (ap-northeast-1 data residency, primary catalog locale, second-most-engaged cohort by tenure). Carving out JP from the launch is a public commitment to second-class treatment we will not make. We are paying 4 weeks of velocity to deliver the launch at parity for all cohorts."


AML-07: Production Integration & Latency Budgets

TL;DR

The chatbot turn budget is 800ms p95 from user-message-received to first-streaming-token-out. The new reranker takes 240ms p95 (up from 180ms). The Applied ML Engineer's job is to allocate the 800ms across stages, place the reranker so it fits, design the fallback chain, and instrument per-stage telemetry so a future incident triages in minutes not hours.

As a / I want / So that

As an Applied ML Engineer integrating ML predictions into the chatbot turn pipeline I want to allocate latency budget per stage, design fallback chains, and instrument telemetry that supports rapid diagnosis So that the chatbot meets its 800ms p95 SLO under load and degrades gracefully when ML components fail

Customer Pain (Working Backwards)

Mia, 22, types her query and waits. If the bot takes more than 1 second to start streaming, she perceives it as broken. The actual data: at 800ms p95, 5% of users wait > 800ms; at 1200ms p95, that fraction grows by an order of magnitude. Mia at the 95th percentile is a stress test of the SLO; Mia at the 99th percentile is the customer who decides the bot is slow. Latency is not a backend concern — it is the user's first-impression metric.

The customer pain is binary: under the SLO, the chatbot feels responsive; over it, the chatbot feels broken. The Applied ML Engineer who fights for the latency budget is the one defending Mia's perception of the product.

Hypothesis & Why-We-Believe

H0 The new reranker (240ms p95) fits in the 800ms turn budget without compromising other stages
H1 At 240ms reranker p95, the turn-budget total exceeds 800ms p95 unless other stages are tightened or fallbacks are added
Prior evidence Current pipeline at 720ms p95 (180ms reranker); +60ms reranker + propagation effects could push to 800-840ms; load-test baseline measurement available
Disconfirming evidence If load-test shows 240ms reranker only adds 50ms to turn budget (because of pipeline parallelism), then total stays at ~770ms
Prior probability of H1 0.7 (high; serial stages dominate)

The 800ms turn budget allocation

graph LR
    U[User msg<br/>received]
    U -->|t=0| P[1. Input parse<br/>+ PII redact<br/>30ms]
    P --> I[2. Intent classify<br/>US-MLE-01<br/>50ms]
    I --> R[3. Retrieval<br/>OpenSearch<br/>kNN+BM25 RRF<br/>220ms]
    R --> RR[4. Rerank<br/>cross-encoder<br/>240ms — NEW]
    RR --> FM[5. FM call<br/>Bedrock Claude<br/>first-token streaming<br/>200ms]
    FM --> FMT[6. Format + emit<br/>40ms]
    FMT --> O[Out: first<br/>streaming token]

    classDef new fill:#fd2,stroke:#333
    classDef ml fill:#9cf,stroke:#333
    classDef stage fill:#fff,stroke:#333

    class RR new
    class I,RR ml
    class P,R,FM,FMT stage
Stage Allocated p95 Notes
1. Input parse + PII redact 30ms regex-based; deterministic
2. Intent classify (US-MLE-01) 50ms DistilBERT, ml.g5.xlarge real-time endpoint, MME
3. Retrieval 220ms OpenSearch kNN+BM25 RRF; HNSW ef_search=128
4. Rerank (NEW) 240ms MiniLM-L12 cross-encoder, ml.g5.2xlarge MME, top-K=30
5. FM call (first-token streaming) 200ms Bedrock Claude, prefix cache hit ~85%
6. Format + emit 40ms template render + WebSocket emit
Sum 780ms within 800ms SLO with 20ms headroom

The serial-stage sum is 780ms — under SLO. But p95 doesn't compose linearly; the more stages, the worse the tail. Empirically, an 800ms p95 SLO requires ~720ms p95 sum-of-stages on a serial pipeline, because tail effects compound. The team must:

  1. Run parallel stages where possible (intent classify + retrieval are independent in some flows; can be fired in parallel for ~50ms savings).
  2. Add fallback chains for stages that occasionally spike.
  3. Reduce reranker top-K from 30→20 if budget pressure remains (saves ~50ms).

The fallback chain

graph TB
    U[User turn]
    U --> S1{Reranker<br/>responding<br/>< 280ms?}
    S1 -->|Yes| R1[Use reranked<br/>top-5]:::happy
    S1 -->|No, > 280ms| S2{Reranker<br/>responding<br/>< 500ms?}
    S2 -->|Yes| R2[Use reranked<br/>top-5 with budget warning]:::warn
    S2 -->|No, timeout| FALLBACK[Fall back to<br/>BM25-only ranking]:::fallback
    FALLBACK --> EMIT[Emit + log<br/>fallback engagement<br/>increment counter]:::fallback

    R1 --> NORMAL[Normal flow]
    R2 --> NORMAL
    EMIT --> NORMAL

    classDef happy fill:#2d8,stroke:#333
    classDef warn fill:#fd2,stroke:#333
    classDef fallback fill:#f66,stroke:#333

The fallback chain ensures graceful degradation:

  • Reranker p95 normally < 240ms — use reranked results (happy path).
  • Reranker between 240-500ms — use reranked results, log warning (acceptable degradation).
  • Reranker > 500ms — fall back to BM25-only ranking, log fallback-engagement counter (mechanical degradation).

Fallback-engagement counter is a leading indicator: if it spikes from 0.1%/min to >1%/min, an incident is in progress (see AML-08).

Experiment Design — load test before production

Field Choice
Test environment Staging cluster, identical to prod, ap-northeast-1
Load profile 1×, 2×, 5× peak prod RPS — sustained 30 min each
Latency targets p50, p95, p99 per-stage AND turn-total
Failure modes tested Reranker timeout (induced), retrieval timeout, FM timeout, partial outage of one MCP tool
Pass criteria turn p95 ≤ 800ms at 2× peak; fallback engagement < 0.5% at 1× peak
Pre-registration load-test config locked before production rollout decision

Architecture / Wiring (full pipeline)

graph TB
    USR[User WebSocket message]
    USR --> GW[API Gateway]
    GW --> APP[Chatbot turn handler<br/>ECS Fargate]

    APP --> S1[1. Parse + PII<br/>30ms]
    S1 --> S2A[2. Intent classify<br/>SageMaker MME<br/>US-MLE-01<br/>50ms]
    S1 --> S2B[3. Retrieval<br/>OpenSearch<br/>220ms]

    S2A --> S3[Combine results<br/>intent-routed retrieval]
    S2B --> S3

    S3 --> S4[4. Rerank<br/>SageMaker MME<br/>US-MLE-02<br/>240ms]
    S4 --> S4F{Reranker timeout?}
    S4F -->|No| S5[5. FM call<br/>Bedrock Claude<br/>200ms first-token]:::happy
    S4F -->|Yes| S4FB[BM25 fallback]:::fallback
    S4FB --> S5

    S5 --> S6[6. Format + emit<br/>40ms]:::stage
    S6 --> WS[WebSocket emit<br/>first token]

    classDef happy fill:#2d8,stroke:#333
    classDef fallback fill:#f66,stroke:#333
    classDef stage fill:#fff,stroke:#333

Rollout Plan

Stage Population Latency abort criteria
Pre-prod Load test 1×, 2×, 5× peak turn p95 > 850ms at 2× peak
1% canary 1% of users turn p95 > 850ms; fallback engagement > 1%/min
10% 10% of users turn p95 > 820ms
50% A/B 50% of users turn p95 > 800ms (the SLO)
100% full rollout continuous monitoring

Metrics Dashboard

graph LR
    D[Production telemetry]
    D --> T[Turn-level<br/>p50/p95/p99]
    D --> S[Per-stage<br/>p50/p95/p99]
    D --> F[Fallback<br/>engagement<br/>per stage]
    D --> E[Error rate<br/>per stage]
    D --> C[Cohort latency<br/>JP / EN / mobile / desktop]
Metric Formula Target Alert
Turn p95 latency p95 across all turns, 5-min rolling ≤ 800ms > 800ms 5min sustained
Turn p99 latency p99 across all turns ≤ 1500ms > 1800ms 5min
Reranker p95 per-stage p95 ≤ 240ms > 280ms 5min
Fallback engagement (reranker) timeout fallbacks / total turns < 0.1%/min > 1%/min
Cohort latency split turn p95 by locale × device within 10% of aggregate > 20% deviation
Error rate per stage errors / requests per stage < 0.05% > 0.5%
FM cold-start fraction first-token > 500ms < 5% > 15%

Real-Incident Vignette

Three weeks after launch, the team notices fallback engagement crept from 0.1%/min to 0.4%/min during JP peak hours. Investigation: SageMaker MME endpoint for the reranker has uneven shard load — JP traffic concentrates on one model variant that is over-subscribed. The fix: split the reranker MME by locale, scale JP variant to 2× capacity. Fallback engagement returns to 0.1%/min within 30 min of deployment. The pre-built dashboard with cohort-stratified latency was the artefact that surfaced the problem; without it, the issue would have shown as a slow JP-cohort engagement decline weeks later.

Cross-Story Dependencies

  • Consumes: AML-05 (guardrails) — the latency guardrail (≤ 800ms turn p95) is one of the five family-wise tests.
  • Consumes: AML-06 (cohort fairness) — cohort-stratified latency monitoring catches per-locale degradation.
  • Produces: telemetry (per-stage, per-cohort latency + fallback engagement) that AML-08 (incident triage) consumes.
  • Cross-references: ../RAG-MCP-Integration/08-mcp-orchestration-router.md for orchestration patterns; US-MLE-02 reranker training pipeline for the model surface.

Master's-DS Depth Callout — Tail-Latency Composition and Hedged Requests

Naive latency math: total p95 = sum of per-stage p95. This is wrong — p95 does not compose additively. For independent stages, the true total is closer to the sum of (mean + 2×std), but tail effects compound non-linearly when stages are bursty.

The right model: simulate the pipeline with empirically-measured per-stage latency distributions (pull 14 days of production telemetry, sample with replacement to construct turn timelines). Total p95 from simulation gives the realistic total. Compared to additive p95, simulation typically shows the true p95 is 5-15% higher than additive estimate.

Hedged requests for tail-latency reduction: send the same request to two replicas of a stage; use whichever responds first. Reduces p99 by 30-50% at 2× cost. Only worth it when p99 dominates customer-experience metrics. For chatbot turn pipeline, hedging the reranker call could reduce p99 from 1200ms to 800ms at the cost of 2× reranker compute. Pre-register the hedging decision; A/B-test it; ensure the fallback chain still works (hedge response cancels the slower call).

Amazon Product Lens Callout — Deliver Results + Frugality

Deliver Results: the chatbot has to actually work. Latency is the result; if turn p95 exceeds SLO, the launch failed regardless of model quality. The Applied ML Engineer who fights for latency budgets is the one defending the launch's actual outcome.

Frugality: the cheapest latency win is the one you don't need to engineer. Reducing reranker top-K from 30 to 20 saves 50ms of latency at minimal NDCG cost. Caching repeated queries saves entire stages. The Applied ML Engineer's frugal moves are the multi-quarter latency wins that compound.

Six-pager: "Turn-budget allocation for the reranker upgrade: 30 / 50 / 220 / 240 / 200 / 40 = 780ms p95 sum-of-stages, simulated total turn p95 of 815ms. To meet 800ms SLO, we (1) parallelize intent + retrieval (saves 50ms), (2) reduce reranker top-K from 30 to 20 (saves 50ms, costs 0.4% NDCG), (3) implement BM25-only fallback for reranker timeouts (graceful degradation). Final simulated turn p95: 770ms with 30ms safety margin. Load-tested at 2× peak; cohort-stratified telemetry live before rollout."


AML-08: Incident Triage — 'The Model Got Worse'

TL;DR

It's 3am. PagerDuty fires: reranker NDCG@10 dropped from 0.78 to 0.61 in the last hour. Traffic and latency are normal. The Applied ML Engineer is on call. The wrong move is random root-causing. The right move is the named-decision-tree triage: did anything change → upstream healthy → feature drift → eval staleness → UI change → query-distribution shift → escalate. MTTR target: localize in 15 minutes, mitigate in 60.

As a / I want / So that

As an Applied ML Engineer on call for production ML systems I want to localize a model-quality incident to a named root-cause category in 15 minutes and mitigate within 60 minutes So that customer-facing impact is bounded and the team has a clean root-cause to address in the post-incident review

Customer Pain (Working Backwards)

Tatsuya, 39, opens MangaAssist at 6am Tokyo time before his commute. He searches for "psychological mystery" — a query he runs weekly. The bot used to surface "Monster," "Death Note," "Erased" as the top three. This morning, the top three are unrelated action titles. He doesn't know there's an incident; he just sees that the bot is "having a bad day" and doesn't engage. If the incident lasts 6 hours, ~50,000 customer interactions are affected. If it lasts 24 hours, 200,000+. The MTTR is the customer-impact bound.

The customer pain on incidents is concentrated, immediate, and silent — Tatsuya doesn't file a ticket; he just disengages. The Applied ML Engineer's MTTR target is the protection against accumulated customer disengagement.

Hypothesis & Why-We-Believe

(Note: incidents are not hypothesis-driven; they are diagnostic. The "hypothesis" here is the active investigation hypothesis at each step of the triage tree.)

Step Active hypothesis Test
1 "A change deployed in last 24h caused this" Check git log + deploy timeline + model registry
2 "Upstream data corruption" Check data-platform alarms + row counts + schema
3 "Feature distribution drifted" Check feature distribution monitors
4 "Eval set stale, false alarm" Compare eval-set age to catalog turnover
5 "UI change altered click distribution" Check frontend deploy timeline
6 "External traffic-distribution shift" Check marketing campaign / external event timeline
7 "Unknown — escalate" Page staff/principal; assemble incident response team

The triage decision tree (full version)

graph TB
    A[3am ALARM<br/>NDCG@10 0.78 → 0.61<br/>traffic / latency normal]
    A --> Q1{Did anything<br/>change last 24h?}
    Q1 -->|Code deploy| RB1[Roll back deploy<br/>verify recovery]:::action
    Q1 -->|Model registry change| RB2[Revert model version<br/>verify recovery]:::action
    Q1 -->|Config change| RB3[Revert config<br/>verify recovery]:::action
    Q1 -->|Prompt template change| RB4[Revert prompt version]:::action
    Q1 -->|Nothing changed| Q2

    Q2{Upstream data healthy?}
    Q2 -->|Row counts off| UP1[Page data-platform team<br/>quarantine bad data]:::action
    Q2 -->|Schema drift| UP2[Quarantine + reprocess]:::action
    Q2 -->|Healthy| Q3

    Q3{Feature distribution<br/>drifted?}
    Q3 -->|Yes| FD[Freeze model decisions<br/>investigate drift root]:::action
    Q3 -->|No| Q4

    Q4{Eval set stale?<br/>compare age vs turnover}
    Q4 -->|Eval is stale| EVAL[Refresh eval; metric may be<br/>false alarm — but verify with<br/>online behavioral signal]:::warn
    Q4 -->|Eval fresh| Q5

    Q5{UI change?<br/>click distribution shift?}
    Q5 -->|Frontend deploy| UI[Coordinate with frontend;<br/>UI may have altered<br/>click distribution]:::warn
    Q5 -->|No UI change| Q6

    Q6{External traffic<br/>distribution shift?}
    Q6 -->|Marketing campaign| EXT[Wait it out / stratify metric<br/>if persistent, scope re-train]:::warn
    Q6 -->|External event<br/>anime tie-in release| EXT2[Same as above]:::warn
    Q6 -->|Neither| ESC[Escalate to staff/principal<br/>schedule incident review]:::escalate

    classDef action fill:#2d8,stroke:#333
    classDef warn fill:#fd2,stroke:#333
    classDef escalate fill:#f66,stroke:#333

The named root-cause categories — quick reference

# Category Detection signal Typical fix MTTR target
1 Code deploy regression git log + deploy timeline overlay Roll back deploy 15 min
2 Model version regression model registry diff Revert model version 30 min
3 Config change regression config-store diff Revert config 20 min
4 Upstream data corruption data-platform alarms; row counts; schema check Quarantine + reprocess 2-4h
5 Feature drift feature distribution monitors KL > threshold Investigate drift; retrain if structural 1-3 days
6 Label drift label distribution + ground-truth scenarios Re-labeling cycle days-weeks
7 UI change frontend deploy timeline; click-distribution diff Coordinate frontend; possibly retrain 1-7 days
8 Prompt-template regression prompt-template diff Revert prompt version 30 min
9 Eval-set staleness (false alarm) eval-set age vs catalog turnover Refresh eval set; re-measure 1 day
10 Query-distribution shift external traffic-source diff Wait, or stratify metric hours-days

The 15 / 60 / 4-hour escalation playbook

Window What Who acts Who's notified
0–15 min Localize: which named category? On-call AML Eng (silent)
15–60 min If category 1, 2, 3, 8: roll back. If category 4: page data team. If 5, 6, 7: freeze + investigate On-call + 1 platform partner EM (status update)
60 min – 4h If unresolved: incident commander mode; assemble team On-call + EM + ML platform owner Director (if cust-impact > X)
4+h Full incident review; customer comms EM owns coordination Director / VP, public if needed

Architecture / Wiring — the 3am dashboard

graph TB
    subgraph Dashboard[3am Dashboard — pre-built]
        M1[Primary metric<br/>last 7d / 24h / 60min]
        M2[Recent change log<br/>code deploys / model versions /<br/>config / prompt versions overlaid]
        M3[Upstream data health<br/>row counts / schema / last-seen]
        M4[Feature distribution drift<br/>top-10 features / KL rolling]
        M5[Cohort breakdown<br/>locale × device × tenure]
        M6[Latency + error rates<br/>per-stage]
    end

    M1 --> DECIDE[Triage decision tree]:::critical
    M2 --> DECIDE
    M3 --> DECIDE
    M4 --> DECIDE
    M5 --> DECIDE
    M6 --> DECIDE

    DECIDE --> ACTION[Localize root cause<br/>15 min target]

    classDef critical fill:#f66,stroke:#333

Rollout Plan (for incident readiness, not for an incident itself)

Phase Action Timeline
Pre-incident Build the 3am dashboard; populate change-log overlay one-time investment
Pre-incident Document named-root-cause categories + decision tree one-time
Pre-incident Run quarterly fire drills using past incidents quarterly
During incident Apply decision tree mechanically; document each step in incident channel per-incident
Post-incident Run blameless retrospective; identify dashboard / playbook gaps within 5 days
Post-incident Update playbook + portfolio with revealed gaps next quarter

Metrics Dashboard (for incident readiness)

Metric Target Alert
MTTR (mean time to recover) ≤ 30 min for categories 1, 2, 3, 8; ≤ 4h overall quarterly review
MTTD (mean time to detect) ≤ 15 min from regression-onset to alarm-fire > 30 min (improve monitoring)
Categories-localized-in-15-min rate ≥ 70% of incidents < 50% (improve playbook)
False-alarm rate < 10% of pages > 20% (improve detection)
Repeat-incident rate < 5% (same root cause within quarter) > 10% (post-incident actions failing)

Real-Incident Vignette — POC-Production Catastrophe #2 (RAG Recall Collapse)

From ../POC-to-Production-War-Story/02-seven-production-catastrophes.md: the team's RAG retrieval suffered Recall@10 collapse from 0.91 to 0.62 over a 6-hour window. No code deploy. No model change. Standard triage walked through:

  1. Did anything change last 24h? No deploy, no model change, no config change. (Categories 1, 2, 3, 8 ruled out in 15 minutes.)
  2. Upstream data healthy? Row counts on catalog feed normal; schema fingerprint matches. (Category 4 ruled out in 8 more minutes.)
  3. Feature drift? OpenSearch document distribution has shifted on one shard. Suspicion: index issue.
  4. Investigate the shard. HNSW index on shard-3 has corrupted; query latency normal, but recall poor because graph is malformed.

Root cause: a recent OpenSearch upgrade applied silently overnight had triggered a partial re-indexing on shard-3 that hadn't completed. Force re-index of shard-3, monitor recall recovery, schedule post-incident review.

The Applied ML Engineer's lesson fed back into Primitive 1 (problem-framing for next quarter): "we need a per-shard recall monitor as a leading indicator; the current dashboard aggregates across shards and missed this for 6 hours." The shard-level dashboard was added to the next quarter's portfolio (AML-02).

Cross-Story Dependencies

  • Consumes: AML-07 (production integration) telemetry — per-stage, per-cohort latency + fallback engagement.
  • Consumes: AML-04 (online/offline correlation) signal — if correlation has decayed, that's a slow-burn incident, not a 3am page.
  • Consumes: signals from US-MLE-XX drift hub (referenced via ../ML-Engineer-User-Stories/deep-dives/02-cross-story-platform-deep-dive.md).
  • Produces: post-incident lessons feed back to AML-01 (problem-framing) and AML-02 (portfolio) — gaps revealed in incidents become next quarter's investment.
  • Cross-references: ../Ground-Truth-Evolution/ ML-Scenarios (label drift, sentiment domain shift) for non-3am slow-burn scenarios.

Master's-DS Depth Callout — CUSUM, Change-Point Detection, Anomaly Detection

The default "alarm when metric drops 10% from 24h baseline" is fragile: noisy during high-variance periods, slow detection during gradual degradations. CUSUM (cumulative-sum) and CUSUM-V are change-point detectors that accumulate signed deviations from baseline; they fire on small persistent regressions while ignoring large transient noise.

# cusum_metric_monitor.py
import numpy as np

def cusum(metric_stream, target=None, k=0.5, h=4.0):
    """Returns timestep of detected change-point or None.

    k = slack (units of std dev) to tolerate small drift
    h = decision threshold (units of std dev)
    """
    if target is None:
        target = np.mean(metric_stream[:14])
    sigma = np.std(metric_stream[:14])
    s_pos, s_neg = 0, 0
    for t, x in enumerate(metric_stream[14:], start=14):
        s_pos = max(0, s_pos + (x - target - k*sigma) / sigma)
        s_neg = max(0, s_neg + (target - x - k*sigma) / sigma)
        if s_pos > h or s_neg > h:
            return t
    return None

For AML-08 monitoring: run CUSUM on the primary online metric stratified by cohort. Alarm fires when a cohort's CUSUM exceeds threshold, even if aggregate metric looks fine. This catches the cohort regression before the aggregate metric reflects it.

Amazon Product Lens Callout — Ownership + Dive Deep + Bias for Action

Ownership is the at-3am LP. Nobody else can do this. The Applied ML Engineer who diffuses ownership ("the platform team should know") fails ownership. The role's pager goes off; the role works the incident.

Dive Deep is the discipline of named-root-cause categorization. Random root-causing is the opposite of Dive Deep. The named categories are the model. They were built (over many incidents) by Applied ML Engineers who dove deep enough to recognize "the model got worse" has 10 distinct underlying root causes.

Bias for Action is the rollback decision under uncertainty. When category 1, 2, 3, or 8 is plausible AND a rollback would resolve, roll back first, investigate after. Customer impact is reduced; investigation is preserved. The trap is "but what if we don't have the right cause?" — at 3am, the cost of an unnecessary rollback (some velocity loss) is much smaller than the cost of an extended customer-impact incident.

Six-pager retrospective: "Incident detected at T+0 via cohort-stratified CUSUM alarm; localized to category 5 (feature drift on shard-3) at T+18m via dashboard; mitigated at T+45m by force-reindexing shard-3; full root cause (silent OpenSearch upgrade triggered partial reindex) identified at T+3h. MTTR: 45m. Avoided full rollback by mechanical use of named-decision-tree triage. Investments that paid: 3am dashboard built in Q1, named-categories playbook published in Q2, cohort-stratified CUSUM monitors deployed in Q3."


Cross-Story Composition Notes

The eight scenarios chain in real workflows. Three example compositions:

Composition 1 — Launching the US-MLE-02 reranker change end-to-end

Step Scenario Artefact produced
1 AML-01 Customer letter; "this IS an ML problem (not heuristic)"
2 AML-02 RICE-prioritized; reranker is Q3 ship-1
3 AML-03 Pre-registered YAML: MDE 3%, n=18.4K/arm CUPED, OBF 4 looks, 3 guardrails + per-cohort
4 AML-04 Correlation tracker shows Pearson 0.68 — above 0.5 threshold; offline evidence trustworthy
5 AML-05 Five-guardrail check; if CSAT breaches, mechanical veto
6 AML-06 Cohort-stratified per-cohort guardrails clear or veto
7 AML-07 800ms turn-budget validated via load test; fallback chain in place
8 AML-08 3am dashboard with cohort-stratified CUSUM monitors live before launch

Composition 2 — The AML-08 incident workflow that feeds back to AML-01

Step Scenario Action
0 AML-08 Page fires; on-call AML Eng begins triage
1 AML-08 Walk decision tree; localize to category 5 (feature drift) at T+18m
2 AML-08 Mitigate: quarantine drifted feature stream; metric recovers at T+45m
3 AML-06 Verify recovery uniform across cohorts (no JP-only or new-user-only residual regression)
4 AML-04 Check offline-online correlation post-incident; if decayed, schedule harness rebuild
5 AML-01 Incident review: was the underlying problem misframed? Update framing artefacts
6 AML-02 Add "feature-drift-monitor build" to next quarter's portfolio

Composition 3 — Quarterly OP1 narrative defense

A senior leader asks: "Why did your team ship 2 launches this quarter instead of 4?" The Applied ML Engineer's six-pager defense traverses:

Section Scenario referenced
Tenets AML-01 customer-letter discipline
Portfolio AML-02 picked 3 of 12 with EVOI overlay
Rigor AML-03 sample-size discipline; one experiment vetoed (AML-05)
Quality AML-04 correlation audit; AML-05 guardrail enforcement
Fairness AML-06 cohort-stratified eval
Production AML-07 latency budget; load tests; fallback chains
On-call AML-08 MTTR 45m / 30d retention preserved
Velocity narrative "We shipped 2 of 3 because the third (US-MLE-06 cold-start) revealed a JP-cohort regression we are paying 4 weeks to fix in Q4. The cost of that 4 weeks vs the cost of shipping cohort-asymmetric: documented above."

The eight scenarios are not isolated; they are a single workflow disguised as eight chapters. An Applied ML Engineer who can show the chain — every quarter, on every launch — is the role's senior signal.