Per-Story Deep Dive — Applied ML Engineer
How to Use This Document
This file contains eight scenarios — AML-01 through AML-08 — each one a self-contained product-applied ML decision. Stories live inside this file (not as separate BDD files). Each section starts with the BDD framing, then walks the customer pain, the hypothesis, the experiment design, the architecture, the rollout, the dashboard, and a real-incident vignette grounded in the MangaAssist project.
Pair each scenario with the matching grill chain in 02-applied-ml-engineer-grill-chains.md for self-drilling. Read the foundations doc 00-foundations-and-primitives-for-applied-ml-engineering.md first if you haven't — the seven primitives are the substrate every scenario uses.
Story Roster
| ID | Title | Headline question |
|---|---|---|
| AML-01 | Customer-pain → ML-problem translation | Is this even an ML problem? |
| AML-02 | Experiment portfolio prioritization | Which 3 of 12? |
| AML-03 | Hypothesis design & sample-size discipline | MDE, holdout, runtime, stop rule? |
| AML-04 | Online/offline metric decoupling | Offline +5%, online flat — why? |
| AML-05 | Business-KPI guardrails for promotion | When do you NOT ship? |
| AML-06 | Cohort fairness & locale stratification | Aggregate +3%, JP -8%, ship? |
| AML-07 | Production integration & latency budgets | Where does the model live in 800ms? |
| AML-08 | Incident triage: 'the model got worse' | Where do you look first at 3am? |
Master Lifecycle Diagram
graph LR
AML01[AML-01<br/>Frame the problem] --> AML02[AML-02<br/>Pick the experiment]
AML02 --> AML03[AML-03<br/>Design the test]
AML03 --> AML04[AML-04<br/>Validate offline-online]
AML04 --> AML05[AML-05<br/>Check guardrails]
AML05 --> AML06[AML-06<br/>Check cohorts]
AML06 --> AML07[AML-07<br/>Production integration]
AML07 --> AML08[AML-08<br/>Incident-ready]
AML08 -.lessons feed back.-> AML01
classDef pre fill:#9cf,stroke:#333
classDef rigor fill:#fd2,stroke:#333
classDef ship fill:#2d8,stroke:#333
classDef ops fill:#f66,stroke:#333
class AML01,AML02 pre
class AML03,AML04 rigor
class AML05,AML06 ship
class AML07,AML08 ops
The eight scenarios chain. A real reranker launch (US-MLE-02) traverses AML-01 through AML-07 in order and re-uses AML-08 readiness; it doesn't pick one. Read the scenarios in order on first pass; jump to the relevant scenario for a specific real launch.
AML-01: Customer-Pain → ML-Problem Translation
TL;DR
Week-2 retention for new manga readers in the JP cohort dropped from 38% to 31% over four weeks. The PM wants ML to fix it. The Applied ML Engineer's first move is the customer letter, not the model. The right answer this quarter is a heuristic taste-quiz on first session — and a not now recommendation for the ML option.
As a / I want / So that
As an Applied ML Engineer / Product Engineer for ML on the MangaAssist team I want to translate a vague business signal into either a sharp ML problem statement or a defensible "this is not an ML problem" recommendation So that engineering investment is allocated to the highest-leverage intervention rather than to whatever ML feature looks most exciting
Customer Pain (Working Backwards)
Yuki, 24, just discovered MangaAssist last week. She tried 'Solo Leveling', 'Spy x Family', and 'Frieren' — all titles she'd heard of. After turn 4 with the chatbot, none of the three felt like 'her thing'. She didn't open the app on day 7.
What would have made her come back? "If the bot had figured out I like slow-burn psychological stories, not action — and shown me 'A Silent Voice' or 'March Comes In Like A Lion' on day 1."
The pain is misaligned recommendations on day-1 for new users. It is not "retention is dropping." Retention is the lagging indicator. The leading cause is the chatbot showing popular-instead-of-taste-matched titles in the first session, before any preference signal exists.
The signal that this is the right framing: support tickets in the JP cohort cluster around "the bot doesn't know what I like" and "everything it shows is the same." Both are first-session experiences. The cohort that retains well is the one that survives day-1: by day-7, taste signal has accumulated and the recsys works.
Hypothesis & Why-We-Believe
| H0 | First-session recommendation strategy has no causal effect on week-2 retention for new JP users |
| H1 | A taste-quiz first-session intervention raises week-2 retention for new JP users by ≥4pp absolute (38% → 42%) |
| Prior evidence | (a) Industry: Spotify's onboarding quiz lifts 30-day retention 5-8pp on similar surface; (b) Internal: cohort comparison shows users who self-corrected ("I don't like action") in turn 1-3 retained at 47%, vs. 31% for those who didn't; © Customer-research: 60% of JP-cohort exit interviews cite "bot doesn't get me" as primary reason |
| Disconfirming evidence we'd need | If quiz responses don't actually steer recommendations differently than no-quiz default, intervention has no effect. If JP cohort's taste distribution is too uniform for a 5-question quiz to discriminate, intervention has no effect. Both testable in pilot. |
| Prior probability of H1 | 0.55 (moderate confidence; industry evidence good but JP-specific behavior unknown) |
Experiment Design
This is the unusual case where the primary experiment is a heuristic, not an ML model. The Applied ML Engineer designs the test the same way regardless of whether the intervention is ML or rules.
| Field | Choice |
|---|---|
| Population | New users (no prior session ≥ 30d), locale = JP, no other active onboarding experiment |
| Holdout | 50/50 user-level randomization at first-session start |
| Treatment | 5-question taste quiz on first session, drives recsys cold-start cohort embedding |
| Control | Existing genre-popularity cold-start fallback |
| Primary metric | week-2 retention (returns to app within days 7-14) |
| Baseline | 31% (last 4 weeks rolling JP-new) |
| MDE | +4pp absolute (≈ +13% relative) |
| Sample size | n ≈ 8.7K per arm (Welch + CUPED on day-1 session length, 80% power, α=0.05) |
| Runtime | ~5 weeks at JP-new-user rate of ~3.5K/day, 50/50 split, including 14-day retention horizon |
| Guardrails | day-1 session length (no degradation), CSAT (no regression), spam-flag rate (no spike), p95 turn latency (≤ 800ms — quiz must not slow first session) |
| Stop rule | Fixed-horizon at week 5; no peeking before week 4 |
| Pre-registration | YAML hash signed off by PM, EM, AppliedML Eng, JP-locale Product owner |
Architecture / Wiring
graph TB
NEW[New JP user<br/>first session start]
NEW --> COHORT{In experiment<br/>cohort?}
COHORT -->|Treatment 50%| QUIZ[Taste quiz<br/>5 questions<br/>≤ 60s to complete]:::treatment
COHORT -->|Control 50%| OLD[Genre-popularity<br/>cold-start fallback]:::control
QUIZ --> EMB[Quiz answers<br/>→ cold-start embedding<br/>seeded into recsys]:::treatment
EMB --> RECSYS[US-MLE-06<br/>Two-Tower recsys]
OLD --> RECSYS
RECSYS --> RANK[Recommendations<br/>shown to user]
RANK --> SESSION[Session continues<br/>signals collected]
SESSION --> METRICS[Telemetry<br/>day-1 length / day-7-14 return]
classDef treatment fill:#2d8,stroke:#333
classDef control fill:#9cf,stroke:#333
class QUIZ,EMB treatment
class OLD control
The taste quiz is a UX intervention (not ML). The ML system (US-MLE-06 two-tower recsys) consumes the quiz output as a cold-start cohort embedding. The treatment can be reverted by toggling the cohort flag; rollback is one-line config.
Rollout Plan
| Stage | Population | Abort criteria |
|---|---|---|
| 1% (1 day) | 35 new JP users/day | Any p99 turn latency > 1500ms; any CSAT regression |
| 5% (3 days) | 175/day | Day-1 session length regression > 5% |
| 25% (1 week) | 875/day | Day-1 session length regression > 3%; spam-flag rate > +10% |
| 50% (full A/B, 5 weeks) | 1750/day | Pre-declared experiment terms |
| Decision at week 5 | — | If primary +4pp at p<0.05 AND guardrails clear: ship to 100% JP-new + plan EN rollout |
Metrics Dashboard
graph LR
D[Dashboard: AML-01 retention experiment]
D --> P[Primary: week-2 retention<br/>JP-new]
D --> S[Secondary: day-1 session length<br/>turn count, completion rate]
D --> G[Guardrails: CSAT, spam-flag,<br/>p95 turn latency]
D --> C[Cohort: stratified by JP-Tokyo,<br/>JP-other, by mobile-vs-app]
| Metric | Formula | Target | Guardrail-trigger | Alert-threshold |
|---|---|---|---|---|
| Week-2 retention (primary) | (returning users days 7-14) / (cohort size) | +4pp absolute | n/a (primary) | (primary) |
| Day-1 session length | mean turns per first session | ≥ baseline | -3% relative | -5% relative |
| CSAT (5-pt) | mean of post-session 5-pt survey, 14-day rolling | ≥ baseline | -1% relative | -1.5% relative |
| Spam-flag rate | flags / sessions | ≤ baseline | +10% relative | +15% relative |
| p95 turn latency (incl. quiz) | p95 of turn-to-first-token | ≤ 800ms | > 800ms | > 900ms |
| Quiz-completion rate | (quiz finished) / (quiz started) | ≥ 75% | < 60% | < 50% (intervention failing) |
Real-Incident Vignette
The team launched the quiz to 1% on a Tuesday. Within 6 hours, the JP-Tokyo cohort showed a +12% spike in day-1 session length and quiz-completion rate of 84% — both above expectation. The JP-other (Osaka, Fukuoka, etc.) cohort showed quiz-completion of 62% and no session-length lift. Investigation: the JP-other cohort skews older and has higher non-mobile usage; the 5-question quiz on a smaller mobile screen produced friction. Mitigation: reduce to 3 questions for that sub-cohort; re-run pilot. The lesson fed back into Primitive 6 (cohort fairness): even within JP, sub-cohort fairness matters, and "JP" is not a homogeneous unit.
Cross-Story Dependencies
- Consumes: US-MLE-06 (recsys two-tower) cold-start cohort embedding API; the quiz-driven cohort embedding requires US-MLE-06 to support keyed cold-start vectors. Confirm contract before scoping.
- Produces: a leading indicator (day-1 session length lift) that AML-04 (online/offline correlation) will use to validate the recsys-side offline metrics.
- Sequencing: this experiment must run before the next quarterly portfolio review (AML-02), so its result informs whether the H1-conditional ML investment in cold-start ML belongs in the next portfolio.
Master's-DS Depth Callout — Causal Framing and the Two-Stage Decision
The framing trap on this scenario is collapsing two questions into one: "should we change cold-start strategy?" and "should we use ML or heuristic?" The right framing is staged. Stage 1 is causal: does any cold-start intervention move retention? The cleanest test is a heuristic intervention (cheaper, faster). If Stage 1 is positive, Stage 2 is comparative: does an ML cold-start beat the heuristic? Stage 2 is only worth running if Stage 1 succeeded.
The hypothesis test for Stage 1 is Pr(retention | do(cold-start = quiz), cohort = JP-new) − Pr(retention | do(cold-start = popularity), cohort = JP-new) ≥ MDE. Notice the do() operator — it forces the team to specify the intervention, not just an "association with retention." Most ML PR/FAQs hand-wave the intervention and end up running correlational analysis rebadged as A/B tests. The discipline of writing the do() clause forces clarity.
Amazon Product Lens Callout — Customer Obsession + Invent and Simplify
Two LPs collide. Customer Obsession says start with Yuki. The PR/FAQ leads with her, in her words, on her day-7. Invent and Simplify says ship the simplest intervention that moves Yuki's behavior. ML is not necessarily simpler than a 5-question quiz; in this case, the quiz is simpler, faster, and more interpretable.
The six-pager defending the recommendation: "Hypothesis: cold-start strategy is the leading cause of JP new-user churn. Stage 1 test: 5-question taste quiz vs popularity fallback, 5-week run, +4pp retention MDE. We are not running an ML cold-start experiment this quarter. The ML option's incremental EV over the heuristic is ~2-3pp at 4× engineering cost. We will revisit ML in Q4 if Stage 1 succeeds and Stage 2 evidence justifies. Tenet: we ship the simplest intervention that moves customer behavior." That paragraph is what makes the recommendation defensible to leadership.
AML-02: Experiment Portfolio Prioritization
TL;DR
The team has 12 candidate ML experiments for next quarter and capacity to run 3 in parallel without quality compromise. The Applied ML Engineer is the role that picks. RICE-for-ML scoring + EVOI swing-bet + qualitative overlays produces a defensible 3-of-12. Picking wrong wastes a quarter.
As a / I want / So that
As an Applied ML Engineer responsible for product-ML quarterly investment I want to select the 3 highest-leverage experiments from a portfolio of 12 candidates by RICE × Detectability × strategic value So that engineering effort flows to the experiments with highest expected impact and the team has a defensible OP1 narrative
Customer Pain (Working Backwards)
The pain isn't a single customer's pain — it's the team's pain. Over the last 4 quarters, the team shipped 9 experiments. 5 won, 4 lost. Of the 5 wins, 2 came from the same area (reranker), 2 from another (recsys), 1 from cold-start. Of the 4 losses, 3 were undersized (couldn't detect MDE within quarter); 1 was a real negative result (worth knowing). The pattern: the team has been picking experiments by enthusiasm, not by EV.
The customer pain is indirect: every quarter the team allocates 3-4 ML projects, and the customers experience the aggregate of those projects. A bad portfolio means customers see a quarter where nothing improved; a good portfolio means visible compounding lift across 6 quarters.
Hypothesis & Why-We-Believe
| H0 | Random portfolio selection (3 of 12) has the same expected lift as RICE-prioritized selection |
| H1 | RICE-prioritized + EVOI overlay produces ≥40% higher expected lift across the quarter than baseline (gut-feel) selection |
| Prior evidence | Industry: Booking.com, Netflix, Microsoft public posts all describe systematic experiment-portfolio frameworks improving experiment win rate by 30-60% over ad-hoc selection. |
| Disconfirming evidence | If team's gut-feel selections are already RICE-correlated (because senior people internalize the framework), the framework adds no marginal value. Test: rank gut-feel against RICE on last quarter; if ρ > 0.7, framework is redundant. |
| Prior probability of H1 | 0.7 (high; the team's track record on undersized experiments suggests RICE is missing) |
Experiment Design
The "experiment" here is meta-experimental: applying the framework to next quarter's portfolio. The validation is retrospective at end of quarter — did the prioritized portfolio outperform a counterfactual (gut-feel) portfolio?
| Field | Choice |
|---|---|
| Population | Q3 (next quarter) candidate experiments — 12 named projects |
| Decision unit | Portfolio of 3 (selected from 12) |
| Treatment | RICE + EVOI scoring with qualitative overlay |
| Control | Counterfactual gut-feel pick (recorded but not run; only for retrospective comparison) |
| Primary metric | end-of-quarter portfolio expected lift (sum of shipped-experiment gains) |
| Secondary | individual experiment hit rate, cumulative cost saved by deferring undersized candidates |
| Pre-registration | both portfolios (RICE-picked and gut-feel-picked) committed to git before quarter starts |
The 12 Q3 candidates and their RICE scores
| # | Candidate | Reach (sessions/q) | Impact (Δ metric) | Confidence | Effort (eng-wk) | RICE | EVOI | Decision |
|---|---|---|---|---|---|---|---|---|
| 1 | US-MLE-02 reranker MiniLM-L6 → L12 | 8.0M | +0.5% useful-answer | 0.60 | 12 | 200K | low | PICK 1 |
| 2 | US-MLE-01 multilingual intent v2 (JP/EN better tied) | 1.2M | +0.8% intent-acc | 0.43 | 8 | 51K | medium | PICK 2 |
| 3 | US-MLE-06 recsys cold-start HRNN swap | 2.5M | +1.2% retention | 0.28 | 18 | 47K | HIGH | PICK 3 (EVOI) |
| 4 | US-MLE-05 embedding adapter LoRA r=16 | 8.0M | +0.3% recall | 0.23 | 14 | 41K | low | Defer Q4 |
| 5 | US-MLE-04 demand forecasting promo overlay | 0.5M (inv ops) | +5% sMAPE | 0.36 | 10 | 90K (inv-team metric) | low | Defer (inventory team Q3 priority differs) |
| 6 | US-MLE-08 cover-art AI-gen detection | 12M images | +0.05 AUC | 0.31 | 11 | 1.7M (impressions × low Δ) | medium | Defer Q4 (signal not yet strong) |
| 7 | Cross-encoder distillation to DistilCE | dependent on #1 | +20% latency reduction | 0.55 | 9 | 67K | low | Reject (sequential to #1) |
| 8 | Sentiment ABSA quarterly retrain | (maintenance) | drift bound | 1.0 | 4 | hygiene | n/a | 0.5 SDE assigned, not portfolio |
| 9 | Spam classifier weekly retrain automation | (maintenance) | adversarial-bound | 1.0 | 3 | hygiene | n/a | 0.5 SDE assigned, not portfolio |
| 10 | Recsys diversity injection re-tune | 2.5M | +0.2% diversity, ?CTR | 0.20 | 7 | 14K | low | Reject (low EV) |
| 11 | Personalize HRNN replacement custom 2-tower | 2.5M | +1.5% retention | 0.40 | 24 (under-est'd to 18) | 78K nominal, 52K corrected | medium | Reject (cost mis-estimated last time) |
| 12 | LLM judge quality-eval framework | (tooling) | indirect | 0.5 | 6 | tooling | n/a | Reject (platform-ML backlog, not product portfolio) |
The RICE-top-3 by raw score: #1, #5, #2. But #5 belongs to inventory team this quarter — out of scope for chatbot product-ML. So the chatbot-product portfolio top-3 by adjusted RICE: #1, #2, #4.
The Applied ML Engineer's overlay swaps #4 for #3 — accepting lower RICE in favor of EVOI. #3 (HRNN cold-start swap) has medium RICE but high EVOI: win or lose, the team learns whether HRNN is the right cold-start direction for the next 4 quarters of recsys investment. That learning is worth more than the marginal RICE delta vs #4.
Architecture / Wiring
graph TB
BACKLOG[Q3 candidate backlog<br/>12 named experiments]
BACKLOG --> RICE[RICE scoring<br/>Reach × Impact × Confidence ÷ Effort]
RICE --> RANKED[Ranked list]
RANKED --> EVOI[EVOI overlay<br/>which candidates have learning value?]
EVOI --> QUAL[Qualitative overlay<br/>cost-estimate calibration / dependency / political]
QUAL --> PICK[3-of-12 selected]:::ship
PICK --> OP1[OP1 narrative<br/>defending the choice]:::ship
RICE --> REJECT[9 rejected / deferred]:::reject
classDef ship fill:#2d8,stroke:#333
classDef reject fill:#9cf,stroke:#333
Rollout Plan
The "rollout" is the portfolio approval and tracking process:
| Stage | Action | Timeline |
|---|---|---|
| Pre-quarter | RICE-scored portfolio submitted to EM + Director for approval | week -2 |
| Pre-quarter | OP1 narrative finalized; pre-registered hypotheses for each of 3 experiments locked | week -1 |
| Mid-quarter | Status check: are any experiments showing early-abort signals? | week +6 |
| End-quarter | Retrospective: actual lift vs RICE-predicted lift; recalibrate Confidence priors | week +12 |
Metrics Dashboard
| Metric | Formula | Target | Alert |
|---|---|---|---|
| Portfolio expected lift (predicted) | Σ(RICE_i × P(success)_i) for picked | ≥ baseline by 30% | n/a (predictive) |
| Portfolio realized lift | Σ(observed lift × significance) for shipped | ≥ predicted × 0.7 | < predicted × 0.5 |
| Hit rate (per experiment) | shipped & sustained / picked | ≥ 60% | < 40% |
| Effort accuracy | actual eng-weeks / estimated eng-weeks | ≤ 1.3× | > 2× (recalibrate Confidence) |
| Confidence calibration | observed-success-rate vs predicted-Confidence | within 10% | > 20% off (the Confidence factor is not calibrated) |
Real-Incident Vignette
End of Q2, team retrospects: portfolio's predicted lift was +2.4% on aggregate session-quality; realized was +1.1%. Investigation: candidate #11 (Personalize HRNN replacement) was picked despite RICE-warning on cost-estimate; engineering cost overran 2.4× and the experiment was abandoned at week 8. The lesson: cost-estimate calibration is the biggest gap in RICE; in Q3, the team adopts a rule that any candidate with engineering cost ≥ 12 weeks must have a pre-mortem document with named cost risks and a contingency plan. The rule prevents the same cost-overrun pattern from recurring.
Cross-Story Dependencies
- Consumes: signals from AML-08 (incident triage) — recurring incidents flag missing investments that should enter the portfolio (e.g., per-shard recall monitor after the OpenSearch shard-3 incident).
- Consumes: signals from AML-04 (online/offline correlation collapse) — if correlation is dropping, the portfolio must include a "rebuild offline harness" investment, even if it has no shippable user-facing lift.
- Produces: the pre-registered hypotheses for each of the 3 chosen experiments (handed to AML-03).
Master's-DS Depth Callout — Bayesian Decision Theory and Cost-Estimate Calibration
RICE's Confidence factor is doing two jobs: prior-evidence quality (am I sure the effect exists?) and cost-estimate calibration (am I sure the effort estimate is right?). Splitting them is the master's-level move. The proper Bayesian framing: each candidate has a joint posterior on (effect_exists, effect_size, eng_cost_actual). The portfolio chooser maximizes expected utility under this joint posterior, not under decomposed RICE.
In practice: track the team's calibration over 8+ quarters. If 80%-confidence experiments succeed at 80% rate, the team is well-calibrated. If they succeed at 50%, the team is over-confident; haircut Confidence by 0.6×. If they succeed at 95%, the team is under-confident (rare but real); take more risks. Calibration audits applied to the team's last 12 experiments are the single highest-ROI portfolio-management practice.
Amazon Product Lens Callout — Bias for Action + Frugality + Think Big
The portfolio decision negotiates three LPs. Bias for Action says ship the things that are obvious wins quickly — that's #1 and #2 (high RICE, well-powered, low risk). Frugality says pick experiments that share infrastructure (all three of our picks share the experiment platform, the cohort holdout, and the feature store). Think Big says reserve one slot for the swing bet — #3 is EVOI on cold-start direction; if it works, the next year of recsys investment is reshaped.
Six-pager defending the portfolio: "We picked 3 of 12 candidates. Two are Bias-for-Action: high-RICE, well-powered, ship-by-week-12. One is Think-Big: an EVOI swing on cold-start strategy that, if it works, sets recsys direction for 4 quarters. Frugality: all three share A/B infra and cohort holdout, amortizing eng cost. Rejected #11 despite RICE rank 2 because the team's historical cost estimates for that scope class run 2.4× over; we will not ship cost-overrun candidates without a pre-mortem document with contingency plan. Confidence-calibration audit: our 80%-confidence picks succeed at 76% rate; portfolio is well-calibrated, no haircut applied."
AML-03: Hypothesis Design & Sample-Size Discipline
TL;DR
The team is shipping a US-MLE-02 reranker change. The data scientist offline-evaluated MiniLM-L12 vs MiniLM-L6 on a 5K replay and got +6% NDCG@10. The PM says "run a 7-day A/B and ship." The Applied ML Engineer's job is to say no — 7 days under-powers the test, the MDE the team can defend is 3% relative on user-perceived useful-answer rate, the design needs O'Brien-Fleming sequential α-spending, and CSAT must be a pre-declared guardrail.
As a / I want / So that
As an Applied ML Engineer accountable for product-launch experimental rigor I want to translate an offline-positive model change into a properly-powered, pre-registered, sequentially-controlled online A/B with named guardrails So that the launch decision rests on statistical evidence the team can defend in a six-pager review and at 28-day post-launch retrospective
Customer Pain (Working Backwards)
Sakura, 31, has been using MangaAssist for six months. She likes how the bot's recommendations have gotten more relevant over time. Recently, the search results feel slightly different — sometimes more accurate, sometimes more click-bait. She's noticed but hasn't said anything yet. If the new ranking gets shipped without a proper test and turns out to favor click-bait, Sakura's six-month trust erodes by week 4 and she stops asking the bot for recommendations.
The customer pain is a future pain — the cost of shipping a noisily-significant winner that doesn't actually move user behavior. Sakura doesn't notice immediately; the cost shows up as a month-4 churn spike. Pre-registration discipline prevents the noise winner from shipping.
Hypothesis & Why-We-Believe
| H0 | The MiniLM-L12 reranker has same or worse user-perceived ranking quality than MiniLM-L6, measured by useful-answer-rate per user over 14 days |
| H1 | MiniLM-L12 reranker has ≥3% relative lift in useful-answer-rate per user over 14 days |
| Prior evidence | Offline NDCG@10 +6% on golden-500; offline-online correlation 0.68 over last 90 days (above 0.5 threshold per Primitive 4); two prior reranker upgrades correlated 0.7-0.8 between offline NDCG and online useful-answer |
| Disconfirming evidence we'd need | Cohort-stratified online lift below MDE in any of EN / JP / mixed; CSAT regression > 1% relative; latency p95 > 800ms |
| Prior probability of H1 | 0.75 (strong; offline evidence + correlation history support it) |
Experiment Design
| Field | Choice |
|---|---|
| Population | All MangaAssist users with ≥ 1 search in pre-14d window; locale ∈ {EN, JP, mixed}; not in any other active experiment |
| Holdout / control | 50/50 user-level randomization (NOT session, NOT request — see Primitive 3 randomization unit discussion) |
| Treatment | New MiniLM-L12 cross-encoder reranker, top-K=20→top-K=30 reranked, served by SageMaker MME endpoint |
| Control | Existing MiniLM-L6 reranker, top-K=20 |
| Primary metric | useful-answer-rate per user per 14-day window |
| Baseline | 0.183 (mean from 14-day pre-experiment) |
| Baseline std | σ = 0.142 (per-user std) |
| MDE | 3% relative = 0.00549 absolute |
| Power | 0.80 |
| α | 0.05 (sequential, O'Brien-Fleming) |
| Test | Welch's t-test, two-sided, with CUPED variance reduction |
| CUPED covariate | per-user useful-answer-rate over the 14d immediately preceding experiment |
| Sample size pre-CUPED | n = 36,840 per arm (statsmodels TTestIndPower) |
| Sample size post-CUPED | n = 18,420 per arm (50% variance reduction empirically) |
| Runtime | 14 days at 200K eligible users/day, 50/50 split → easily achievable |
| Looks (sequential) | day 4, day 7, day 11, day 14 |
| α at each look | 0.0001, 0.001, 0.01, 0.05 (O'Brien-Fleming) |
| Stop rule | Stop early on primary if any look's α-adjusted p < look's α; stop early on guardrail breach |
Sample-size calculation walkthrough
# sample_size_us_mle_02.py — pre-registered calc
import statsmodels.stats.power as smp
baseline = 0.183
sigma_pre = 0.142
mde_relative = 0.03
mde_absolute = baseline * mde_relative # 0.00549
effect_size = mde_absolute / sigma_pre # 0.0387 standardized
# Pre-CUPED (raw t-test power)
analysis = smp.TTestIndPower()
n_pre_cuped = analysis.solve_power(
effect_size=effect_size, alpha=0.05, power=0.80, alternative='two-sided'
)
print(f"Pre-CUPED n per arm: {int(n_pre_cuped)}")
# 10491... but this is for a single look; sequential design needs more
# O'Brien-Fleming with 4 looks needs ~4-6% inflation in sample size
# vs fixed-horizon for equivalent power. Apply 1.06× factor:
n_pre_cuped_obf = int(n_pre_cuped * 1.06)
print(f"Pre-CUPED with OBF n per arm: {n_pre_cuped_obf}")
# 11120
# CUPED variance reduction empirically 50% on this metric
sigma_post_cuped = sigma_pre * (1 - 0.50)**0.5 # 0.50 var reduction = 0.71 sigma reduction
effect_size_post = mde_absolute / sigma_post_cuped
n_post_cuped_obf = analysis.solve_power(
effect_size=effect_size_post, alpha=0.05, power=0.80, alternative='two-sided'
) * 1.06
print(f"Post-CUPED with OBF n per arm: {int(n_post_cuped_obf)}")
# 18420 (rounded)
# Note: 18420 is an underestimate because the OBF inflation applies to raw t,
# not to the CUPED-residualized statistic. Conservative: round up to 20K.
The team can hit n=20K per arm in ~2 days at 200K eligible/day; runtime is bounded by the 14-day horizon (the metric needs 14 days of activity per user), not by sample size. So the experiment runs exactly 14 days, with sequential looks providing safety to stop early on dramatic signals.
Cohort-stratified sample size
The 18.4K-per-arm aggregate is sufficient for the aggregate claim. For cohort claims:
| Cohort | % of traffic | Per-cohort n needed | Total exp n (across both arms) |
|---|---|---|---|
| EN | 55% | 18.4K | 33.5K total → 60K combined → ~3 days |
| JP | 30% | 18.4K | 61K total → 122K combined → ~6 days |
| mixed | 15% | 18.4K | 122K total → 245K combined → 12 days |
The mixed-cohort claim is on the edge of the 14-day horizon — possible but tight. The Applied ML Engineer flags this in the pre-registration: "mixed-cohort claim has marginal power; if mixed-cohort metric is between -3% and +3%, treat as inconclusive, not as evidence against H1." Better to flag this upfront than negotiate it post-hoc.
Architecture / Wiring
graph LR
USR[User] --> CHOOSE{Experiment<br/>cohort hash}
CHOOSE -->|treatment 50%| L12[Reranker MiniLM-L12<br/>SageMaker MME]:::treatment
CHOOSE -->|control 50%| L6[Reranker MiniLM-L6<br/>SageMaker MME]:::control
L12 --> RANK[Ranked results]
L6 --> RANK
RANK --> RESP[Chatbot response]
RESP --> SIG[useful-answer signal<br/>collected per turn]
SIG --> AGG[Aggregated per user<br/>over 14-day window]
AGG --> EXPDB[Experiment platform<br/>treatment_id, user_id, metric]
EXPDB --> ANA[Analysis<br/>Welch + CUPED<br/>O'Brien-Fleming sequential]:::analysis
classDef treatment fill:#2d8,stroke:#333
classDef control fill:#9cf,stroke:#333
classDef analysis fill:#fd2,stroke:#333
Rollout Plan
| Stage | Population | Abort criteria |
|---|---|---|
| Day 1: 1% canary | 4K users | p99 turn latency > 1500ms; error rate > 1% |
| Day 2-3: 10% | 40K | aggregate metric collapse > 5% from baseline; CSAT crash |
| Day 4 (look 1): 50% | 200K | OBF α at 0.0001 — early stop only on extreme effect |
| Day 5-14: 50% | full A/B | guardrails monitored continuously; subsequent looks at d7, d11, d14 |
| End-of-experiment decision | — | primary clears MDE at look-α AND all guardrails clear AND cohort guardrails clear → ship |
Metrics Dashboard
| Metric | Formula | Target | Guardrail-trigger | Alert |
|---|---|---|---|---|
| Useful-answer rate (primary) | per-user mean over 14d | +3% rel | n/a | n/a |
| p95 turn latency | p95 across all turns | ≤ 800ms | > 800ms | > 850ms |
| CSAT (5-pt) | rolling 14d mean | -0% rel | -1% rel | -1.4% rel |
| JP cohort useful-answer | per-JP-user mean over 14d | -0% rel | -3% rel | -5% rel |
| EN cohort useful-answer | per-EN-user mean over 14d | -0% rel | -3% rel | -5% rel |
| Sequential α-spent | cumulative α used at each look | per OBF plan | exceeds plan | n/a |
Real-Incident Vignette
The team ran the experiment exactly to spec. At look 2 (day 7), aggregate p=0.04 (above OBF threshold of 0.001 at look 2). PM saw the dashboard, asked "we're at p<0.05, can we ship?" Applied ML Engineer's response: "OBF threshold at look 2 is 0.001, not 0.05. We're not significant on the sequential plan. We continue. The mechanical rule prevents us from shipping on noise." At look 4 (day 14), aggregate p<0.001 — clear win on the sequential plan. CSAT regressed -0.6% (within -1% threshold). Cohorts cleared. The launch shipped. Six months later, retrospective showed durable lift; no novelty fade. The discipline paid off.
Cross-Story Dependencies
- Consumes: AML-04 (online/offline correlation) — checks correlation is above 0.5 threshold before trusting the offline evidence supporting H1.
- Consumes: AML-02 (portfolio prioritization) — the experiment was selected with EVOI consideration; the design must support both win and loss as informative.
- Produces: pre-registered YAML hash and the analysis output that AML-05 (guardrails) will use to gate promotion.
- Sequencing: must complete before AML-05 can render its veto/ship decision.
Master's-DS Depth Callout — Welch's t, Delta Method, Sequential α-Inflation
Three depth points that interview at the staff level:
-
Welch's t is the right default — equal-variance assumption almost never holds for chatbot per-user metrics. The cost of using Welch when variances are equal is negligible (slightly reduced power); the cost of using equal-variance t when they aren't equal is wrong p-values. Always Welch.
-
Delta method for ratio metrics. If the metric is ratio-typed (e.g., useful-answer-rate = useful_turns / total_turns per user), per-user-ratio averaging biases the variance estimate. Variance of
X/Yis approximated by:Var(X/Y) ≈ (μ_X² / μ_Y²) × (σ²_X/μ²_X + σ²_Y/μ²_Y - 2·Cov(X,Y)/(μ_X·μ_Y))The team that computes per-user ratios and t-tests directly under-estimates variance and over-claims significance. The delta method fixes it. -
Sequential α-inflation. Without an α-spending plan, peeking 5 times inflates Type-I error from 5% to ~20%. O'Brien-Fleming distributes α toward later looks (most α at the final look), preserving most of the experiment's nominal 5% Type-I rate; Pocock distributes α equally across looks (more conservative early-stop, less powerful late). For most chatbot A/Bs, OBF is right; for short-runtime experiments where early-stop is critical, Pocock is right.
Amazon Product Lens Callout — Insist on the Highest Standards + Dive Deep
The pre-registration discipline is the canonical instance of Insist on the Highest Standards. It costs friction (PMs hate writing the YAML). It pays in defensibility: 6 months later, when leadership asks "did this experiment really work?", the team points to the pre-registered YAML hash and the analysis that conformed to it. Teams that don't pre-register cannot answer the question — they have to argue from results, which is HARKing in slow motion.
Dive Deep is what catches CUPED, Welch, the delta method, the OBF plan, the cohort sample-size split. None of these are obvious to the gut-feel A/B-runner. They show up in the pre-registration document, signed by the data scientist, signed by the Applied ML Engineer, signed by the EM. Six months later they're the artefact that convinces a Director that the team is rigorous.
Six-pager: "Pre-registered hypothesis: ≥3% relative lift in useful-answer-rate. Powered at 80% via 18.4K-per-arm Welch t with CUPED variance reduction (50%). O'Brien-Fleming sequential plan, 4 looks. Three pre-declared guardrails plus per-cohort guardrails. We will not declare a win on aggregate-positive results if any cohort regresses by >3%. The smallest experiment that disconfirms the hypothesis is 14 days at 200K eligible users/day."
AML-04: Online/Offline Metric Decoupling
TL;DR
The team's RAG-retrieval change shows offline Recall@10 +3.3pp, NDCG@5 +3.8% (both significant). The 14-day online A/B shows useful-answer-rate +0.5% relative, not significant. Offline says win; online says nothing. The Applied ML Engineer's diagnostic: not "is the model wrong" but "which of 5 named root causes broke the offline-online correlation, and how do we fix the harness." The right action is to pause model changes for 4 weeks and rebuild the offline harness.
As a / I want / So that
As an Applied ML Engineer accountable for offline metric trustworthiness I want to detect online-offline correlation collapse, diagnose its root cause, and rebuild the offline harness So that subsequent experiments make ship/no-ship decisions on metrics the team trusts as predictors of customer behavior
Customer Pain (Working Backwards)
Kenji, 28, runs his manga-discovery sessions on mobile. He scrolls fast, scans the top 3 results, and clicks the most relevant. Lower-ranked results he never sees. The team's offline metric (NDCG@10) credits Kenji's experience for improvements at positions 5-10 that he never sees. The online metric (CTR) sees what Kenji sees. They diverge because Kenji's behavior is mismatched against what NDCG@10 measures.
The customer pain is invisible to the team that trusts only offline metrics. The team ships changes that improve positions Kenji never views. The customer's experience doesn't change. The online metric reflects this. The fix is to align the offline metric with Kenji's actual behavior.
Hypothesis & Why-We-Believe
| H0 | Offline-online correlation has not meaningfully changed; the discrepancy is sampling noise |
| H1 | Correlation has dropped from historical 0.7+ to <0.5; offline metric is no longer a faithful predictor of online lift; rebuild required |
| Prior evidence | Last 90 days of experiment outcomes: 4 of 6 launches showed offline-online direction agreement; 2 showed offline-positive online-flat. Trend rolling Pearson 0.41 over last 60 days, was 0.71 in prior 60 days. |
| Disconfirming evidence | If the next 2 launches show restored direction agreement, the dip was sampling noise, not collapse |
| Prior probability of H1 | 0.75 (strong; the rolling Pearson is below threshold, multiple recent disagreements) |
Experiment Design
This is a meta-experiment — the experimental subject is the offline harness itself, not a model.
| Field | Choice |
|---|---|
| Population | Last 12 months of completed experiments with both offline-Δ and online-Δ recorded |
| Treatment | New offline harness: NDCG@3 + IPS-corrected CTR + adversarial click-bait detector + LLM-judged factuality |
| Control | Existing offline harness: NDCG@10 + Recall@10 |
| Primary metric | Pearson correlation between offline-Δ and online-Δ across the next 4 experiments |
| Baseline | Current rolling 60-day Pearson: 0.41 |
| Target | Restored to ≥ 0.6 (above 0.5 threshold) |
| Validation | Replay last 6 experiments through new harness; check if their offline-Δ ranking matches online-Δ ranking |
| Decision | If new harness restores correlation on replay AND on next 4 forward experiments: adopt |
Diagnostic walkthrough — the five named root causes
graph TB
A[Offline +5% NDCG@10<br/>Online flat CTR<br/>14-day A/B not significant]
A --> R1[Root cause 1<br/>Selection bias in eval set]
A --> R2[Root cause 2<br/>Label leakage]
A --> R3[Root cause 3<br/>Distribution shift]
A --> R4[Root cause 4<br/>Metric proxy mismatch]
A --> R5[Root cause 5<br/>Goodharting]
R1 --> R1D[Diagnostic: KL between<br/>eval queries and prod queries]
R2 --> R2D[Diagnostic: time-strict<br/>train/eval cut]
R3 --> R3D[Diagnostic: catalog turnover<br/>since last eval refresh]
R4 --> R4D[Diagnostic: where do users<br/>actually click? mobile pos 1-3]
R5 --> R5D[Diagnostic: is offline metric<br/>the training target?]
R1D --> R1F{Hit?}
R2D --> R2F{Hit?}
R3D --> R3F{Hit?}
R4D --> R4F{Hit?}
R5D --> R5F{Hit?}
R1F -->|Yes| FIX1[IPS-corrected eval set]:::fix
R2F -->|Yes| FIX2[Time-strict split rebuild]:::fix
R3F -->|Yes| FIX3[Weekly eval refresh + sliding window]:::fix
R4F -->|Yes| FIX4[NDCG@3 + position-weighted metric]:::fix
R5F -->|Yes| FIX5[Diversify offline metric portfolio]:::fix
classDef fix fill:#2d8,stroke:#333
For the RAG-retrieval scenario:
| Root cause | Test result | Verdict |
|---|---|---|
| 1. Selection bias | Eval set drawn from human-curated queries; production queries include long-tail seasonal anime tie-ins not in eval set | Partial hit |
| 2. Label leakage | Time-strict split shows offline gain shrinks from +3.3pp to +1.8pp; some leakage but not the dominant cause | Partial hit |
| 3. Distribution shift | Catalog turnover 14% since last eval refresh; KL(eval | |
| 4. Metric proxy mismatch | 70% mobile traffic; users click positions 1-3 90% of time; offline gain is concentrated at positions 5-7 (where users don't look) | Primary hit |
| 5. Goodharting | Reranker training loss is the offline NDCG@10 metric directly; gameable | Possible (chronic risk) |
The dominant cause is #4 (metric proxy mismatch) with #3 (distribution shift) as secondary contributor. The fix is not to re-train the model. The fix is to:
- Replace NDCG@10 with NDCG@3 as the primary offline metric for mobile-skewed RAG traffic.
- Refresh the eval set weekly, sampled from 7-day-rolling production query distribution with stratified human labels (covers cause #3).
- Add IPS counterfactual evaluation as secondary metric (covers cause #1).
- Add a click-bait adversarial detector as a secondary offline metric (covers cause #5 — diversification).
Architecture / Wiring
graph LR
EXP[Last 6 experiments]:::data
NEW_HARNESS[New offline harness<br/>NDCG@3 + IPS-CTR<br/>+ adv-detector + LLM-judge]:::harness
EXP --> REPLAY[Replay through<br/>new harness]
REPLAY --> NEW_OFFLINE[New offline-Δ<br/>per experiment]
EXP --> ONLINE[Online-Δ<br/>per experiment]:::data
NEW_OFFLINE --> CORR[Compute Pearson<br/>new-offline-Δ vs online-Δ]
ONLINE --> CORR
CORR --> DECIDE{Pearson ≥ 0.6?}
DECIDE -->|Yes| ADOPT[Adopt new harness<br/>resume model shipping]:::adopt
DECIDE -->|No| ITERATE[Iterate on harness<br/>add more diagnostic metrics]:::iter
classDef data fill:#9cf,stroke:#333
classDef harness fill:#fd2,stroke:#333
classDef adopt fill:#2d8,stroke:#333
classDef iter fill:#f66,stroke:#333
Rollout Plan
| Stage | Action | Timeline |
|---|---|---|
| Week 1-2 | Build new offline harness; set up IPS, NDCG@3, click-bait detector, LLM-judge | 2 weeks |
| Week 3 | Replay last 6 experiments through new harness; compute correlation | 1 week |
| Week 4 | If correlation ≥ 0.6 on replay, adopt for next quarter; if not, iterate | 1 week |
| Forward | Track rolling 90-day Pearson; alarm if < 0.5; require harness rebuild | continuous |
Metrics Dashboard
| Metric | Formula | Target | Alert |
|---|---|---|---|
| Rolling 90-day Pearson | corr(offline-Δ_i, online-Δ_i) over last 90d experiments | ≥ 0.6 | < 0.5 |
| Last-6-experiment direction agreement | count(sign(offline-Δ) == sign(online-Δ)) / 6 | ≥ ⅚ | < 4/6 |
| Eval-set staleness | days since last refresh | ≤ 7 | > 14 |
| KL divergence eval-vs-prod | KL(eval queries | prod queries) | |
| Mobile NDCG@3 vs desktop NDCG@10 | both tracked separately | per-surface | n/a |
Real-Incident Vignette
The team froze model shipping for 4 weeks while the harness was rebuilt. PM was unhappy: "we're not shipping anything for a month." The Applied ML Engineer's defense: "The cost of waiting 4 weeks is one quarter of velocity. The cost of not waiting is shipping noise winners against a broken offline metric for the next year. We've shipped 2 reranker changes in the last 6 months that won offline and went flat online. The pattern is the metric, not the model. We're rebuilding the metric." After 4 weeks, the new harness showed Pearson 0.71 on the 6-experiment replay. The team resumed shipping. Over the next 4 quarters, 11 of 13 launches showed offline-online direction agreement. The 4-week freeze paid back in trust and shipped lift.
Cross-Story Dependencies
- Consumes: signals from AML-03 (hypothesis design) — every experiment that ships records its offline-Δ and online-Δ; the correlation tracker reads from this.
- Produces: a Pearson threshold gate that AML-03 must check before trusting offline evidence in pre-registration.
- Cross-references:
../RAG-MCP-Integration/09-rag-retrieval-pipeline-deep-dive.mdfor the offline-eval methodology being rebuilt;../Ground-Truth-Evolution/for the ML-03 search-ranking-UI-redesign scenario, which is a related but distinct correlation collapse driven by UI change rather than metric mismatch.
Master's-DS Depth Callout — Goodhart's Law and IPS Estimators
Goodhart's Law (formalized): when a measure becomes a target, it ceases to be a good measure. Applied ML systems goodhart their offline metric the moment training loss is computed against it. The mitigation is not "find a better single metric" — any single metric will eventually goodhart. The mitigation is diversity: a portfolio of offline metrics where no single one is sufficient to win, and each catches a failure mode the others miss.
For the MangaAssist RAG harness, the new portfolio is: - NDCG@3 — position-weighted mobile-aligned ranking metric (primary) - IPS-corrected CTR — counterfactual-eval, corrects selection bias from click-logged data - Click-bait adversarial detector — catches Goodhart cases where ranking quality goes up but content quality goes down - LLM-judged factuality on RAG-grounded answers — catches hallucination cases
To win promotion, a model must improve the primary AND not regress any secondary by more than threshold. This is harder to game than any single metric.
IPS estimator — for selection bias correction. The intuition: in click-logged data, items that were rarely shown have rare exposure; their per-click value is up-weighted by 1/π(item|query) where π is the propensity of showing that item. Variance is the cost; SNIPS (self-normalized IPS) and Doubly-Robust estimators are practical workhorses. The Applied ML Engineer doesn't derive these but knows to ask "are we using IPS or just averaging clicks?" — if no, selection bias is in the metric.
Amazon Product Lens Callout — Learn and Be Curious + Are Right A Lot
Learn and Be Curious is what makes the Applied ML Engineer ask "why don't they correlate?" instead of accepting the offline win. Teams that goodhart their offline metric for years are teams that stopped asking. Quarterly correlation audits are the institutional discipline.
Are Right A Lot is what disciplines the response when correlation collapses. Wrong response: ship anyway and hope. Right response: slow down, diagnose, fix the harness. In a six-pager justifying a 4-week ship-freeze: "We declined to ship this quarter despite +5% offline gain because offline-online correlation has dropped to 0.41, below our 0.5 threshold. Offline gain is no longer a trustworthy predictor of customer behavior. We are rebuilding the offline harness in Q4; expected restoration of correlation by end of November; resumption of model-shipping in December. Cost of waiting: one quarter. Cost of not waiting: shipping noise winners for the next year, eroding both customer trust and team credibility with leadership."
AML-05: Business-KPI Guardrails for Promotion
TL;DR
The reranker change shows +6% useful-answer-rate (significant). CSAT regresses -1.4% (significant). The pre-declared CSAT guardrail threshold is -1%. The PM argues to ship anyway. The Applied ML Engineer's job is to enforce the veto mechanically. Pre-declaration is the load-bearing word; guardrails negotiated post-hoc are not guardrails.
As a / I want / So that
As an Applied ML Engineer responsible for ship/no-ship decisions I want to enforce pre-declared mechanical guardrail vetoes regardless of political pressure So that model promotion is governed by customer-experience evidence, not by who has the most leverage in a launch meeting
Customer Pain (Working Backwards)
Akari, 35, has used MangaAssist daily for two years. She trusts the bot's recommendations. After a recent launch, she notices the recommendations feel "off" — she still clicks (the engagement metric), but she fills out the post-session survey saying she's less satisfied. She doesn't churn immediately. She becomes one of the slow-erosion customers: less trust, less engagement at month 3, churned at month 6.
The customer pain on this scenario is delayed. CSAT is the leading indicator; engagement is the lagging indicator (delayed by months). The model that wins engagement and loses CSAT is the model that ships, looks good for 4 weeks, and slowly hollows out the customer base. Pre-declared CSAT vetoes catch this.
Hypothesis & Why-We-Believe
| H0 | The shipped reranker change preserves customer satisfaction at the level required for sustainable engagement growth |
| H1 | The reranker change regresses CSAT by ≥1% relative, indicating the engagement gain comes at customer-trust cost; veto fires |
| Prior evidence | Industry: engagement-bait-without-satisfaction is a classic anti-pattern (YouTube 2017 watch-time → recommendation overhaul; Facebook 2018 engagement → MSI redesign). Internal: prior reranker change with -0.8% CSAT and +4% engagement showed +6-month retention -1.2pp. |
| Disconfirming evidence | If 28-day post-launch holdout doesn't show retention erosion despite CSAT drop, the relationship between CSAT and retention is weaker than priors suggested |
| Prior probability of H1 | n/a (this is the guardrail check, not a hypothesis test in the new-launch sense) |
Experiment Design (the guardrail test)
The "experiment" here is the launch readiness review. The test is whether each pre-declared guardrail clears its threshold.
| Field | Pre-declared in experiment YAML |
|---|---|
| Primary | useful-answer-rate, MDE +3% rel — measured |
| Guardrail 1: p95 turn latency | ≤ 800ms — must hold |
| Guardrail 2: CSAT (5-pt) | ≥ -1% rel — must hold |
| Guardrail 3: 28d retention | ≥ -0.5% rel — must hold (high-stakes) |
| Guardrail 4: spam-flag rate | ≤ +5% rel — must hold |
| Guardrail 5: per-cohort useful-answer | ≥ -3% rel for any cohort — must hold |
| Multiple testing correction | Bonferroni: per-guardrail α = 0.01 (5 tests, 0.05/5) |
| Veto rule | If any guardrail breaches at adjusted α < 0.01, mechanical veto |
| Override | None. Override requires written incident-style review + Director sign-off + retrospective if shipped. |
The five-guardrail decision matrix
| Guardrail | Pre Δ | Post Δ | Δ relative | Adj-α | Status |
|---|---|---|---|---|---|
| Useful-answer rate (primary) | 0.183 | 0.194 | +6.0% | <0.001 | ✅ |
| p95 turn latency | 720ms | 740ms | +20ms | n/a | ✅ |
| CSAT | 4.21 | 4.15 | -1.43% | 0.02 | ❌ VETO |
| 28d retention | 51.2% | 50.9% | -0.59% | 0.18 | ⚠️ near-veto, not breached |
| Spam-flag rate | 0.41% | 0.41% | +0.0% | n/a | ✅ |
| JP cohort useful-answer | 0.171 | 0.174 | +1.8% | 0.04 | ✅ |
CSAT breaches at -1.43% (threshold -1%), with adjusted-α 0.02. Veto fires mechanically.
The defense — when the PM pushes back
PM: "CSAT is a noisy survey metric. The primary moved by 6%. Let's ship and revisit in 4 weeks."
Applied ML Engineer: "CSAT is the user telling us, in their words, that the new ranking feels worse. The primary metric is a behavioral proxy; CSAT is a stated preference. They disagree. When they disagree at p=0.02, we have to take the user's word. We pre-registered CSAT as a guardrail with -1% threshold; the threshold was breached at -1.43%. The mechanical action is veto. The right next move is to investigate why CSAT regressed despite the engagement lift — likely candidates: ranking is more aggressive on click-bait, or the new ordering surfaces titles users find emotionally heavier than expected. Both are diagnosable in a follow-up offline analysis. Shipping a model that the user is telling us they don't like, on the bet that we're wrong about CSAT, is not a bet I'd make."
PM: "But the launch is in OP1. Leadership expects it."
Applied ML Engineer: "The launch in OP1 was conditional on guardrails clearing. Shipping a guardrail breach turns OP1 from 'we shipped X with positive customer impact' into 'we shipped X with stated-customer-dissatisfaction.' The OP1 narrative is worse with the breach than with the slip. I can write the slip narrative and own it."
The artefact this exchange produces is the defensible decision. Six months later, when retention does (or doesn't) regress, the team has the artefact to point to. Either way, the team's credibility increases — they vetoed when they should have, or they vetoed conservatively and learned the calibration.
Architecture / Wiring
graph TB
EXP[Experiment results<br/>14d A/B complete]
EXP --> P[Primary check<br/>useful-answer +6% sig]
EXP --> G1[Guardrail 1<br/>latency]
EXP --> G2[Guardrail 2<br/>CSAT]
EXP --> G3[Guardrail 3<br/>retention 28d]
EXP --> G4[Guardrail 4<br/>spam]
EXP --> G5[Guardrail 5<br/>per-cohort]
P --> AND{All pass at<br/>adjusted-α 0.01?}
G1 --> AND
G2 --> AND
G3 --> AND
G4 --> AND
G5 --> AND
AND -->|All pass| SHIP[Ship<br/>see AML-07]:::ship
AND -->|Any breach| VETO[Mechanical veto<br/>document and investigate]:::stop
classDef ship fill:#2d8,stroke:#333
classDef stop fill:#f66,stroke:#333
Rollout Plan
When veto fires, "rollout" becomes an investigation plan:
| Stage | Action | Owner |
|---|---|---|
| Day 0 | Veto fires; experiment paused at current rollout | Applied ML Eng |
| Day 1 | Document veto in launch readiness review with full data | Applied ML Eng |
| Day 1-3 | CSAT drill-down: which session types, which cohorts, which catalog segments | Applied ML Eng + Data Scientist |
| Day 4-7 | Hypothesize root cause (click-bait? emotional-heaviness?); design diagnostic offline test | Applied ML Eng |
| Day 8-14 | Run diagnostic; either re-train with adjusted training signal or kill the experiment | Applied ML Eng + Platform ML |
| Day 14+ | If re-trainable: schedule next experiment iteration; if not: archive learnings, replan portfolio | EM + PM |
Metrics Dashboard
| Metric | Formula | Target | Veto threshold | Status |
|---|---|---|---|---|
| Useful-answer rate | per-user rate | +3% rel | n/a (primary) | ✅ |
| p95 latency | p95 turn-to-first-token | ≤ 800ms | > 800ms | ✅ |
| CSAT | 5-pt mean | -0% rel | -1% rel | ❌ |
| 28d retention | rolling | -0% rel | -0.5% rel | ⚠️ (-0.59% rel, p=0.18, near veto) |
| Spam-flag rate | flags/sessions | -0% rel | +5% rel | ✅ |
| Per-cohort | each cohort metric | -0% rel | -3% rel | ✅ |
| Family-wise α-adjusted | Bonferroni 5 tests | each at 0.01 | breach | CSAT breaches |
Real-Incident Vignette
The veto fired. The PM escalated to Director. The Director read the launch readiness document, saw the CSAT breach pre-registered with mechanical -1% threshold, and supported the veto. The team investigated: the new reranker had learned to surface "shock value" titles that drove clicks but felt off-brand for MangaAssist. The diagnostic: a manually-constructed eval set of "tonally-jarring" recommendations showed the new model surfaced them at 18% rate vs old at 7%. The fix: add a tone-preference signal to the reranker training (using ABSA aspect signals from US-MLE-03). Re-ran the experiment 6 weeks later: useful-answer +5.2%, CSAT -0.3% (within threshold). The launch shipped with full leadership support. The original veto built credibility; the second-iteration ship validated the discipline.
Cross-Story Dependencies
- Consumes: AML-03 (hypothesis & sample size) — the pre-registered YAML containing all guardrail thresholds.
- Consumes: AML-06 (cohort fairness) — the per-cohort guardrail is one of the five.
- Produces: a ship-or-no-ship decision feeding AML-07 (production integration).
- Cross-references: US-MLE-02 reranker training pipeline (where the fix re-trains); US-MLE-03 ABSA aspect signal (the new training signal).
Master's-DS Depth Callout — Family-Wise Error Rate, Bonferroni vs Holm vs IUT
Five guardrails tested at α=0.05 each: family-wise probability of at least one false breach is 1 - (1-0.05)^5 ≈ 0.226. Almost 23% chance of false-veto in a clean experiment. Three corrections to choose from:
| Method | Adjusted α | Pros | Cons |
|---|---|---|---|
| Bonferroni | α/k (e.g., 0.01 per test for 5 guardrails) | Simple, conservative, defensible | Loses power; legit breaches at α=0.04 don't fire |
| Holm-Bonferroni | Sequential (test smallest p first at α/k, next at α/(k-1), etc.) | More powerful than Bonferroni | More complex to explain |
| Intersection-Union Test (IUT) | Each guardrail at unadjusted α; pass requires ALL pass | Conservative on PASS side, no FWER inflation | Inflates false-no-launch rate |
For chatbot guardrails, IUT is the right default. Reasoning: false-veto is recoverable (re-run experiment with re-trained model); false-ship is unrecoverable (customer harm at scale). Symmetric error costs do not apply; asymmetry favors conservative-on-ship. Use IUT for guardrails; pre-register the family-wise correction method in the YAML.
Amazon Product Lens Callout — Customer Obsession + Have Backbone, Disagree and Commit + Earn Trust
Customer Obsession: when CSAT regresses, the user is telling us in their own words that they liked it less. Engagement is what they did; CSAT is what they meant. When the two disagree, customer-meaning wins.
Have Backbone, Disagree and Commit: the veto goes against the PM's preference. Maintaining the veto is the role-defining moment. The PM is right that CSAT is noisy; the role is to enforce the threshold anyway, because the alternative is guardrails-by-political-weight, which destroys the institution.
Earn Trust: vetos build trust over multiple quarters. Teams that veto consistently on pre-registered thresholds earn a reputation for shipping launches that actually work in production. The reverse is also true: teams that override their thresholds politically lose credibility with leadership over 3-6 quarters. The mechanical veto is a long-game investment.
Six-pager: "Our launch criteria are pre-declared and mechanical. Of the last 12 reranker / recsys experiments, we declined to ship 3 (25%). Of the 9 we shipped, 8 maintained their lift at 28-day post-launch (89% retention rate). The 3 we declined to ship would, by retrospective analysis, have regressed retention by 0.4-0.8% each — a $X-million-per-year cost we avoided. The discipline cost: 3 more quarters of slower velocity. The discipline yield: $X million per year in avoided regression. NPV positive."
AML-06: Cohort Fairness & Locale Stratification
TL;DR
Aggregate +2.8% CTR (significant, p<0.01). Stratified: EN +4.1%, mixed +2.5%, JP -8.2%. JP is 30% of traffic. The aggregate veto-passes; the per-cohort guardrail (declared at -3% per cohort) fires. The Applied ML Engineer recommends abort + redesign with JP-stratified loss, accepting 4 weeks of velocity cost to preserve trust with the strategically-important JP user base.
As a / I want / So that
As an Applied ML Engineer accountable for cohort fairness in MangaAssist I want to detect cohort regressions before promotion, enforce per-cohort guardrails, and prescribe redesign rather than carve-outs So that the JP user base — strategically critical, ap-northeast-1 data residency, primary catalog locale — does not experience second-class model quality
Customer Pain (Working Backwards)
Hiroshi, 42, lives in Osaka. Japanese is his only language. He's been using MangaAssist for a year, mostly on his phone during commutes. After a recent launch, the bot's recommendations feel less relevant — he can't articulate why, but he stopped following the bot's "you might like" suggestions. He still uses the bot for direct queries. Over 12 weeks, his session count drops from ~5/week to ~2/week. He hasn't churned, but his engagement is hollow.
The customer pain on cohort-fairness scenarios is concentrated and largely invisible to the team. Hiroshi is one of millions of JP users whose model quality regressed; the aggregate metric averages his loss against EN gains. Without stratification, the team never sees Hiroshi.
Hypothesis & Why-We-Believe
| H0 | A reranker change with aggregate +2.8% CTR has comparable per-cohort lift across locales (EN, JP, mixed) within ±2pp absolute |
| H1 | The change has cohort-asymmetric impact: EN gains by ≥2pp, JP loses by ≥3pp; per-cohort guardrail veto fires |
| Prior evidence | Training data is 78% EN, 22% JP. Two-tower model item embeddings cluster JP titles less well than EN titles. Cold-start fallback is genre-popularity, biased toward EN-titled global hits. |
| Disconfirming evidence | If JP cohort regression is within sampling noise (cohort n underpowered), the asymmetry is artifact, not real |
| Prior probability of H1 | 0.6 (moderate-high; training-data imbalance + embedding clustering issue + popularity bias) |
Experiment Design
| Field | Choice |
|---|---|
| Population | All MangaAssist users; cohort dimensions: locale {EN, JP, mixed} × tenure {new < 30d, returning ≥ 30d} × device {mobile, desktop, app} |
| Stratified sample size | Powered for the smallest cohort of interest (mixed-new-mobile, ~3% of traffic) |
| Per-cohort MDE | -3% relative on per-cohort primary metric |
| Per-cohort guardrail | -3% relative regression triggers veto |
| Multilevel analysis | random treatment slopes per cohort (multilevel model fit to experiment data) |
| Pre-registration | cohort dimensions and thresholds locked in YAML before experiment starts |
The cohort stratification table
| Cohort | % traffic | Pre primary | Post primary | Δ rel | Per-cohort guardrail | Status |
|---|---|---|---|---|---|---|
| EN — total | 55% | 0.183 | 0.190 | +4.1% | -3% | ✅ |
| EN — new | 12% | 0.171 | 0.180 | +5.3% | -3% | ✅ |
| EN — returning | 43% | 0.186 | 0.192 | +3.8% | -3% | ✅ |
| JP — total | 30% | 0.196 | 0.180 | -8.2% | -3% | ❌ VETO |
| JP — new | 7% | 0.182 | 0.165 | -9.3% | -3% | ❌ |
| JP — returning | 23% | 0.200 | 0.184 | -7.8% | -3% | ❌ |
| mixed — total | 15% | 0.178 | 0.183 | +2.5% | -3% | ✅ |
| AGGREGATE | 100% | 0.184 | 0.189 | +2.8% | n/a | (primary clears) |
The aggregate looks great. Both JP sub-cohorts breach the per-cohort guardrail. The launch vetoes.
Diagnosing the cohort asymmetry
graph TB
A[JP cohort -8.2%]
A --> Q1{Training data<br/>imbalance?}
Q1 -->|Yes 78%-22%| FIX1[Stratified loss<br/>reweighting]
A --> Q2{Item embedding<br/>clustering?}
Q2 -->|Yes - JP titles less clustered| FIX2[JP-specific embedding<br/>fine-tune]
A --> Q3{Cold-start<br/>popularity bias?}
Q3 -->|Yes - EN-titled global hits dominate| FIX3[JP-locale-aware<br/>cold-start]
A --> Q4{Cross-encoder<br/>training signal<br/>multilingual?}
Q4 -->|Partially - tokenizer shared| FIX4[Multilingual<br/>fine-tune]
FIX1 --> COMBINED[Combined fix:<br/>JP-stratified retraining<br/>~4 engineer-weeks]:::fix
FIX2 --> COMBINED
FIX3 --> COMBINED
FIX4 --> COMBINED
classDef fix fill:#2d8,stroke:#333
Two paths forward:
| Option | Description | Cost | Pros | Cons |
|---|---|---|---|---|
| A | JP-stratified retraining with reweighted loss + JP-specific cold-start | 4 engineer-weeks | Fixes underlying bias; JP serves at parity going forward | 4 weeks of velocity loss |
| B | Ship EN-only with JP carve-out flag (JP keeps old reranker) | 1 engineer-week | Fast; preserves EN gains | Public commitment to JP second-class model; long-term trust erosion |
Recommendation: Option A. Reasoning: JP is strategically important (data residency, primary catalog locale, engaged users). A JP carve-out is a public commitment to second-class treatment that erodes trust over many launches. The 4 weeks of velocity is paid in this quarter to avoid 4 quarters of erosion.
Architecture / Wiring
graph LR
EXP[Experiment results]
EXP --> AGG[Aggregate analysis]
EXP --> STRAT[Stratified per-cohort analysis]
STRAT --> EN[EN cohort]
STRAT --> JP[JP cohort]
STRAT --> MX[mixed cohort]
EN --> CHECK{Per-cohort<br/>guardrail<br/>cleared?}
JP --> CHECK
MX --> CHECK
CHECK -->|All pass| SHIP[Ship]:::ship
CHECK -->|Any fail| VETO[VETO + redesign]:::veto
VETO --> A[Option A: stratified retrain]:::fix
VETO --> B[Option B: carve-out]:::reject
A --> RETRAIN[4 eng-wk JP-stratified retrain]
B --> CARVEOUT[1 eng-wk; rejected for trust reasons]
classDef ship fill:#2d8,stroke:#333
classDef veto fill:#f66,stroke:#333
classDef fix fill:#2d8,stroke:#333
classDef reject fill:#9cf,stroke:#333
Rollout Plan
| Stage | Action | Timeline |
|---|---|---|
| Day 0 | Veto fires on JP cohort; experiment paused | immediately |
| Day 1-3 | Cohort-asymmetry root-cause analysis | 3 days |
| Day 4 | Decision document: Option A (stratified retrain) | 1 day |
| Week 2-5 | JP-stratified retraining: data prep, training, eval | 4 weeks |
| Week 6 | Re-run A/B with new model; expect EN +3-4%, JP +0-2% (parity floor) | 14 days |
| Week 8 | Ship-decision based on re-run results | — |
Metrics Dashboard
| Metric | Formula | Target | Per-cohort threshold | Status |
|---|---|---|---|---|
| Aggregate primary | useful-answer | +3% rel | n/a | ✅ |
| EN cohort primary | EN-only useful-answer | -0% rel | -3% rel | ✅ |
| JP cohort primary | JP-only useful-answer | -0% rel | -3% rel | ❌ -8.2% |
| mixed cohort primary | mixed-only useful-answer | -0% rel | -3% rel | ✅ |
| Cohort sample size adequacy | per-cohort n vs needed n | ≥ 1.0× | < 0.8× (underpowered) | ⚠️ check |
| Multilevel random-slope variance | between-cohort treatment-effect variance | low | high (significant heterogeneity) | high |
| Worst-cohort floor | min cohort metric / aggregate metric | ≥ 0.95 | < 0.95 | breached |
Real-Incident Vignette
The team chose Option A. The 4-week JP-stratified retraining used: (1) loss reweighting to balance JP/EN ratios in cross-encoder training, (2) JP-specific embedding fine-tune with ABSA aspect signals, (3) JP-locale-aware cold-start fallback. Re-run after retrain: aggregate +3.6% (modestly lower than original +2.8% on aggregate), EN +3.9%, JP +1.2%, mixed +2.4%. Per-cohort guardrails clear. Shipped. The 28-day retention impact: aggregate +0.3pp, JP cohort retention preserved (no regression). The decision to insist on cohort fairness lost a small slice of EN gain to gain a large slice of preserved JP trust. The retrospective showed JP cohort engagement increased slightly — the previous model's JP under-service had been a quiet drag.
Cross-Story Dependencies
- Consumes: AML-03 (hypothesis design) — pre-registered cohort dimensions and per-cohort thresholds.
- Consumes: AML-05 (guardrails) — the per-cohort guardrail is one of the five family-wise tests.
- Cross-references: US-MLE-08 cover-art classifier — same cohort-fairness pattern applies to image-classification surfaces; manga-style preferences vary by locale.
- Cross-references:
../Ground-Truth-Evolution/ML-Scenarios — the multilingual ground-truth evolution scenario covers training-data imbalance over time.
Master's-DS Depth Callout — Simpson's Paradox + Multilevel Modeling + Conditional MDE
Aggregate (marginal) effects average over cohort distributions; cohort-conditional effects condition on cohort. The two can disagree (Simpson's paradox) — aggregate +2.8% with JP -8% is a textbook case. Multilevel / hierarchical models — Y = Xβ + Z·u + ε where u is per-cohort random effect — give you both simultaneously and partial-pool the cohort estimates (small cohorts borrow from the global mean, regularizing noisy per-cohort signals).
# multilevel_cohort_eval.py
import statsmodels.formula.api as smf
import pandas as pd
df = pd.read_parquet('experiment_us_mle_02.parquet')
model = smf.mixedlm(
"outcome ~ treatment",
df,
groups=df["cohort_locale"],
re_formula="~treatment" # per-cohort random treatment slopes
).fit()
print(model.summary())
# Look at: random-effect variance for treatment (between-cohort heterogeneity)
# Significant variance → cohorts disagree → per-cohort decisions required
Conditional MDE: a 3% MDE at aggregate is achievable with n_aggregate. The same 3% MDE for a cohort that is 7% of traffic requires n_aggregate / 0.07 users in that cohort — about 14× larger total. If the experiment is sized for aggregate detection only, cohort claims are noise. The team that ships aggregate-positive on under-powered cohort signals is shipping false safety.
Amazon Product Lens Callout — Earn Trust + Insist on Highest Standards + Success and Scale Bring Broad Responsibility
Earn Trust with cohorts: shipping aggregate-positive while JP regresses erodes JP trust over many launches. Each launch is small; cumulative erosion is large. The discipline of cohort fairness is a multi-launch investment in the JP-cohort trust account.
Insist on the Highest Standards is the willingness to delay shipping by 4 weeks to fix a cohort issue. Most teams ship with the cohort regression flagged as a "follow-up". Most "follow-ups" never happen. The Applied ML Engineer who insists on cohort-clean launches is the one whose launches still work in production a year later.
Success and Scale Bring Broad Responsibility — the LP added in 2021. As MangaAssist scales, JP isn't 30% of traffic; in absolute numbers it's millions of users. A 3% regression is hundreds of thousands of users having a worse experience. Scale is what makes cohort fairness a responsibility, not a nice-to-have.
Six-pager: "At MangaAssist scale, a 3% cohort regression on JP traffic affects [N] users monthly. We do not ship cohort regressions at scale, even when aggregate metrics are positive. The JP cohort is strategically critical (ap-northeast-1 data residency, primary catalog locale, second-most-engaged cohort by tenure). Carving out JP from the launch is a public commitment to second-class treatment we will not make. We are paying 4 weeks of velocity to deliver the launch at parity for all cohorts."
AML-07: Production Integration & Latency Budgets
TL;DR
The chatbot turn budget is 800ms p95 from user-message-received to first-streaming-token-out. The new reranker takes 240ms p95 (up from 180ms). The Applied ML Engineer's job is to allocate the 800ms across stages, place the reranker so it fits, design the fallback chain, and instrument per-stage telemetry so a future incident triages in minutes not hours.
As a / I want / So that
As an Applied ML Engineer integrating ML predictions into the chatbot turn pipeline I want to allocate latency budget per stage, design fallback chains, and instrument telemetry that supports rapid diagnosis So that the chatbot meets its 800ms p95 SLO under load and degrades gracefully when ML components fail
Customer Pain (Working Backwards)
Mia, 22, types her query and waits. If the bot takes more than 1 second to start streaming, she perceives it as broken. The actual data: at 800ms p95, 5% of users wait > 800ms; at 1200ms p95, that fraction grows by an order of magnitude. Mia at the 95th percentile is a stress test of the SLO; Mia at the 99th percentile is the customer who decides the bot is slow. Latency is not a backend concern — it is the user's first-impression metric.
The customer pain is binary: under the SLO, the chatbot feels responsive; over it, the chatbot feels broken. The Applied ML Engineer who fights for the latency budget is the one defending Mia's perception of the product.
Hypothesis & Why-We-Believe
| H0 | The new reranker (240ms p95) fits in the 800ms turn budget without compromising other stages |
| H1 | At 240ms reranker p95, the turn-budget total exceeds 800ms p95 unless other stages are tightened or fallbacks are added |
| Prior evidence | Current pipeline at 720ms p95 (180ms reranker); +60ms reranker + propagation effects could push to 800-840ms; load-test baseline measurement available |
| Disconfirming evidence | If load-test shows 240ms reranker only adds 50ms to turn budget (because of pipeline parallelism), then total stays at ~770ms |
| Prior probability of H1 | 0.7 (high; serial stages dominate) |
The 800ms turn budget allocation
graph LR
U[User msg<br/>received]
U -->|t=0| P[1. Input parse<br/>+ PII redact<br/>30ms]
P --> I[2. Intent classify<br/>US-MLE-01<br/>50ms]
I --> R[3. Retrieval<br/>OpenSearch<br/>kNN+BM25 RRF<br/>220ms]
R --> RR[4. Rerank<br/>cross-encoder<br/>240ms — NEW]
RR --> FM[5. FM call<br/>Bedrock Claude<br/>first-token streaming<br/>200ms]
FM --> FMT[6. Format + emit<br/>40ms]
FMT --> O[Out: first<br/>streaming token]
classDef new fill:#fd2,stroke:#333
classDef ml fill:#9cf,stroke:#333
classDef stage fill:#fff,stroke:#333
class RR new
class I,RR ml
class P,R,FM,FMT stage
| Stage | Allocated p95 | Notes |
|---|---|---|
| 1. Input parse + PII redact | 30ms | regex-based; deterministic |
| 2. Intent classify (US-MLE-01) | 50ms | DistilBERT, ml.g5.xlarge real-time endpoint, MME |
| 3. Retrieval | 220ms | OpenSearch kNN+BM25 RRF; HNSW ef_search=128 |
| 4. Rerank (NEW) | 240ms | MiniLM-L12 cross-encoder, ml.g5.2xlarge MME, top-K=30 |
| 5. FM call (first-token streaming) | 200ms | Bedrock Claude, prefix cache hit ~85% |
| 6. Format + emit | 40ms | template render + WebSocket emit |
| Sum | 780ms | within 800ms SLO with 20ms headroom |
The serial-stage sum is 780ms — under SLO. But p95 doesn't compose linearly; the more stages, the worse the tail. Empirically, an 800ms p95 SLO requires ~720ms p95 sum-of-stages on a serial pipeline, because tail effects compound. The team must:
- Run parallel stages where possible (intent classify + retrieval are independent in some flows; can be fired in parallel for ~50ms savings).
- Add fallback chains for stages that occasionally spike.
- Reduce reranker top-K from 30→20 if budget pressure remains (saves ~50ms).
The fallback chain
graph TB
U[User turn]
U --> S1{Reranker<br/>responding<br/>< 280ms?}
S1 -->|Yes| R1[Use reranked<br/>top-5]:::happy
S1 -->|No, > 280ms| S2{Reranker<br/>responding<br/>< 500ms?}
S2 -->|Yes| R2[Use reranked<br/>top-5 with budget warning]:::warn
S2 -->|No, timeout| FALLBACK[Fall back to<br/>BM25-only ranking]:::fallback
FALLBACK --> EMIT[Emit + log<br/>fallback engagement<br/>increment counter]:::fallback
R1 --> NORMAL[Normal flow]
R2 --> NORMAL
EMIT --> NORMAL
classDef happy fill:#2d8,stroke:#333
classDef warn fill:#fd2,stroke:#333
classDef fallback fill:#f66,stroke:#333
The fallback chain ensures graceful degradation:
- Reranker p95 normally < 240ms — use reranked results (happy path).
- Reranker between 240-500ms — use reranked results, log warning (acceptable degradation).
- Reranker > 500ms — fall back to BM25-only ranking, log fallback-engagement counter (mechanical degradation).
Fallback-engagement counter is a leading indicator: if it spikes from 0.1%/min to >1%/min, an incident is in progress (see AML-08).
Experiment Design — load test before production
| Field | Choice |
|---|---|
| Test environment | Staging cluster, identical to prod, ap-northeast-1 |
| Load profile | 1×, 2×, 5× peak prod RPS — sustained 30 min each |
| Latency targets | p50, p95, p99 per-stage AND turn-total |
| Failure modes tested | Reranker timeout (induced), retrieval timeout, FM timeout, partial outage of one MCP tool |
| Pass criteria | turn p95 ≤ 800ms at 2× peak; fallback engagement < 0.5% at 1× peak |
| Pre-registration | load-test config locked before production rollout decision |
Architecture / Wiring (full pipeline)
graph TB
USR[User WebSocket message]
USR --> GW[API Gateway]
GW --> APP[Chatbot turn handler<br/>ECS Fargate]
APP --> S1[1. Parse + PII<br/>30ms]
S1 --> S2A[2. Intent classify<br/>SageMaker MME<br/>US-MLE-01<br/>50ms]
S1 --> S2B[3. Retrieval<br/>OpenSearch<br/>220ms]
S2A --> S3[Combine results<br/>intent-routed retrieval]
S2B --> S3
S3 --> S4[4. Rerank<br/>SageMaker MME<br/>US-MLE-02<br/>240ms]
S4 --> S4F{Reranker timeout?}
S4F -->|No| S5[5. FM call<br/>Bedrock Claude<br/>200ms first-token]:::happy
S4F -->|Yes| S4FB[BM25 fallback]:::fallback
S4FB --> S5
S5 --> S6[6. Format + emit<br/>40ms]:::stage
S6 --> WS[WebSocket emit<br/>first token]
classDef happy fill:#2d8,stroke:#333
classDef fallback fill:#f66,stroke:#333
classDef stage fill:#fff,stroke:#333
Rollout Plan
| Stage | Population | Latency abort criteria |
|---|---|---|
| Pre-prod | Load test 1×, 2×, 5× peak | turn p95 > 850ms at 2× peak |
| 1% canary | 1% of users | turn p95 > 850ms; fallback engagement > 1%/min |
| 10% | 10% of users | turn p95 > 820ms |
| 50% A/B | 50% of users | turn p95 > 800ms (the SLO) |
| 100% | full rollout | continuous monitoring |
Metrics Dashboard
graph LR
D[Production telemetry]
D --> T[Turn-level<br/>p50/p95/p99]
D --> S[Per-stage<br/>p50/p95/p99]
D --> F[Fallback<br/>engagement<br/>per stage]
D --> E[Error rate<br/>per stage]
D --> C[Cohort latency<br/>JP / EN / mobile / desktop]
| Metric | Formula | Target | Alert |
|---|---|---|---|
| Turn p95 latency | p95 across all turns, 5-min rolling | ≤ 800ms | > 800ms 5min sustained |
| Turn p99 latency | p99 across all turns | ≤ 1500ms | > 1800ms 5min |
| Reranker p95 | per-stage p95 | ≤ 240ms | > 280ms 5min |
| Fallback engagement (reranker) | timeout fallbacks / total turns | < 0.1%/min | > 1%/min |
| Cohort latency split | turn p95 by locale × device | within 10% of aggregate | > 20% deviation |
| Error rate per stage | errors / requests per stage | < 0.05% | > 0.5% |
| FM cold-start fraction | first-token > 500ms | < 5% | > 15% |
Real-Incident Vignette
Three weeks after launch, the team notices fallback engagement crept from 0.1%/min to 0.4%/min during JP peak hours. Investigation: SageMaker MME endpoint for the reranker has uneven shard load — JP traffic concentrates on one model variant that is over-subscribed. The fix: split the reranker MME by locale, scale JP variant to 2× capacity. Fallback engagement returns to 0.1%/min within 30 min of deployment. The pre-built dashboard with cohort-stratified latency was the artefact that surfaced the problem; without it, the issue would have shown as a slow JP-cohort engagement decline weeks later.
Cross-Story Dependencies
- Consumes: AML-05 (guardrails) — the latency guardrail (≤ 800ms turn p95) is one of the five family-wise tests.
- Consumes: AML-06 (cohort fairness) — cohort-stratified latency monitoring catches per-locale degradation.
- Produces: telemetry (per-stage, per-cohort latency + fallback engagement) that AML-08 (incident triage) consumes.
- Cross-references:
../RAG-MCP-Integration/08-mcp-orchestration-router.mdfor orchestration patterns; US-MLE-02 reranker training pipeline for the model surface.
Master's-DS Depth Callout — Tail-Latency Composition and Hedged Requests
Naive latency math: total p95 = sum of per-stage p95. This is wrong — p95 does not compose additively. For independent stages, the true total is closer to the sum of (mean + 2×std), but tail effects compound non-linearly when stages are bursty.
The right model: simulate the pipeline with empirically-measured per-stage latency distributions (pull 14 days of production telemetry, sample with replacement to construct turn timelines). Total p95 from simulation gives the realistic total. Compared to additive p95, simulation typically shows the true p95 is 5-15% higher than additive estimate.
Hedged requests for tail-latency reduction: send the same request to two replicas of a stage; use whichever responds first. Reduces p99 by 30-50% at 2× cost. Only worth it when p99 dominates customer-experience metrics. For chatbot turn pipeline, hedging the reranker call could reduce p99 from 1200ms to 800ms at the cost of 2× reranker compute. Pre-register the hedging decision; A/B-test it; ensure the fallback chain still works (hedge response cancels the slower call).
Amazon Product Lens Callout — Deliver Results + Frugality
Deliver Results: the chatbot has to actually work. Latency is the result; if turn p95 exceeds SLO, the launch failed regardless of model quality. The Applied ML Engineer who fights for latency budgets is the one defending the launch's actual outcome.
Frugality: the cheapest latency win is the one you don't need to engineer. Reducing reranker top-K from 30 to 20 saves 50ms of latency at minimal NDCG cost. Caching repeated queries saves entire stages. The Applied ML Engineer's frugal moves are the multi-quarter latency wins that compound.
Six-pager: "Turn-budget allocation for the reranker upgrade: 30 / 50 / 220 / 240 / 200 / 40 = 780ms p95 sum-of-stages, simulated total turn p95 of 815ms. To meet 800ms SLO, we (1) parallelize intent + retrieval (saves 50ms), (2) reduce reranker top-K from 30 to 20 (saves 50ms, costs 0.4% NDCG), (3) implement BM25-only fallback for reranker timeouts (graceful degradation). Final simulated turn p95: 770ms with 30ms safety margin. Load-tested at 2× peak; cohort-stratified telemetry live before rollout."
AML-08: Incident Triage — 'The Model Got Worse'
TL;DR
It's 3am. PagerDuty fires: reranker NDCG@10 dropped from 0.78 to 0.61 in the last hour. Traffic and latency are normal. The Applied ML Engineer is on call. The wrong move is random root-causing. The right move is the named-decision-tree triage: did anything change → upstream healthy → feature drift → eval staleness → UI change → query-distribution shift → escalate. MTTR target: localize in 15 minutes, mitigate in 60.
As a / I want / So that
As an Applied ML Engineer on call for production ML systems I want to localize a model-quality incident to a named root-cause category in 15 minutes and mitigate within 60 minutes So that customer-facing impact is bounded and the team has a clean root-cause to address in the post-incident review
Customer Pain (Working Backwards)
Tatsuya, 39, opens MangaAssist at 6am Tokyo time before his commute. He searches for "psychological mystery" — a query he runs weekly. The bot used to surface "Monster," "Death Note," "Erased" as the top three. This morning, the top three are unrelated action titles. He doesn't know there's an incident; he just sees that the bot is "having a bad day" and doesn't engage. If the incident lasts 6 hours, ~50,000 customer interactions are affected. If it lasts 24 hours, 200,000+. The MTTR is the customer-impact bound.
The customer pain on incidents is concentrated, immediate, and silent — Tatsuya doesn't file a ticket; he just disengages. The Applied ML Engineer's MTTR target is the protection against accumulated customer disengagement.
Hypothesis & Why-We-Believe
(Note: incidents are not hypothesis-driven; they are diagnostic. The "hypothesis" here is the active investigation hypothesis at each step of the triage tree.)
| Step | Active hypothesis | Test |
|---|---|---|
| 1 | "A change deployed in last 24h caused this" | Check git log + deploy timeline + model registry |
| 2 | "Upstream data corruption" | Check data-platform alarms + row counts + schema |
| 3 | "Feature distribution drifted" | Check feature distribution monitors |
| 4 | "Eval set stale, false alarm" | Compare eval-set age to catalog turnover |
| 5 | "UI change altered click distribution" | Check frontend deploy timeline |
| 6 | "External traffic-distribution shift" | Check marketing campaign / external event timeline |
| 7 | "Unknown — escalate" | Page staff/principal; assemble incident response team |
The triage decision tree (full version)
graph TB
A[3am ALARM<br/>NDCG@10 0.78 → 0.61<br/>traffic / latency normal]
A --> Q1{Did anything<br/>change last 24h?}
Q1 -->|Code deploy| RB1[Roll back deploy<br/>verify recovery]:::action
Q1 -->|Model registry change| RB2[Revert model version<br/>verify recovery]:::action
Q1 -->|Config change| RB3[Revert config<br/>verify recovery]:::action
Q1 -->|Prompt template change| RB4[Revert prompt version]:::action
Q1 -->|Nothing changed| Q2
Q2{Upstream data healthy?}
Q2 -->|Row counts off| UP1[Page data-platform team<br/>quarantine bad data]:::action
Q2 -->|Schema drift| UP2[Quarantine + reprocess]:::action
Q2 -->|Healthy| Q3
Q3{Feature distribution<br/>drifted?}
Q3 -->|Yes| FD[Freeze model decisions<br/>investigate drift root]:::action
Q3 -->|No| Q4
Q4{Eval set stale?<br/>compare age vs turnover}
Q4 -->|Eval is stale| EVAL[Refresh eval; metric may be<br/>false alarm — but verify with<br/>online behavioral signal]:::warn
Q4 -->|Eval fresh| Q5
Q5{UI change?<br/>click distribution shift?}
Q5 -->|Frontend deploy| UI[Coordinate with frontend;<br/>UI may have altered<br/>click distribution]:::warn
Q5 -->|No UI change| Q6
Q6{External traffic<br/>distribution shift?}
Q6 -->|Marketing campaign| EXT[Wait it out / stratify metric<br/>if persistent, scope re-train]:::warn
Q6 -->|External event<br/>anime tie-in release| EXT2[Same as above]:::warn
Q6 -->|Neither| ESC[Escalate to staff/principal<br/>schedule incident review]:::escalate
classDef action fill:#2d8,stroke:#333
classDef warn fill:#fd2,stroke:#333
classDef escalate fill:#f66,stroke:#333
The named root-cause categories — quick reference
| # | Category | Detection signal | Typical fix | MTTR target |
|---|---|---|---|---|
| 1 | Code deploy regression | git log + deploy timeline overlay | Roll back deploy | 15 min |
| 2 | Model version regression | model registry diff | Revert model version | 30 min |
| 3 | Config change regression | config-store diff | Revert config | 20 min |
| 4 | Upstream data corruption | data-platform alarms; row counts; schema check | Quarantine + reprocess | 2-4h |
| 5 | Feature drift | feature distribution monitors KL > threshold | Investigate drift; retrain if structural | 1-3 days |
| 6 | Label drift | label distribution + ground-truth scenarios | Re-labeling cycle | days-weeks |
| 7 | UI change | frontend deploy timeline; click-distribution diff | Coordinate frontend; possibly retrain | 1-7 days |
| 8 | Prompt-template regression | prompt-template diff | Revert prompt version | 30 min |
| 9 | Eval-set staleness (false alarm) | eval-set age vs catalog turnover | Refresh eval set; re-measure | 1 day |
| 10 | Query-distribution shift | external traffic-source diff | Wait, or stratify metric | hours-days |
The 15 / 60 / 4-hour escalation playbook
| Window | What | Who acts | Who's notified |
|---|---|---|---|
| 0–15 min | Localize: which named category? | On-call AML Eng | (silent) |
| 15–60 min | If category 1, 2, 3, 8: roll back. If category 4: page data team. If 5, 6, 7: freeze + investigate | On-call + 1 platform partner | EM (status update) |
| 60 min – 4h | If unresolved: incident commander mode; assemble team | On-call + EM + ML platform owner | Director (if cust-impact > X) |
| 4+h | Full incident review; customer comms | EM owns coordination | Director / VP, public if needed |
Architecture / Wiring — the 3am dashboard
graph TB
subgraph Dashboard[3am Dashboard — pre-built]
M1[Primary metric<br/>last 7d / 24h / 60min]
M2[Recent change log<br/>code deploys / model versions /<br/>config / prompt versions overlaid]
M3[Upstream data health<br/>row counts / schema / last-seen]
M4[Feature distribution drift<br/>top-10 features / KL rolling]
M5[Cohort breakdown<br/>locale × device × tenure]
M6[Latency + error rates<br/>per-stage]
end
M1 --> DECIDE[Triage decision tree]:::critical
M2 --> DECIDE
M3 --> DECIDE
M4 --> DECIDE
M5 --> DECIDE
M6 --> DECIDE
DECIDE --> ACTION[Localize root cause<br/>15 min target]
classDef critical fill:#f66,stroke:#333
Rollout Plan (for incident readiness, not for an incident itself)
| Phase | Action | Timeline |
|---|---|---|
| Pre-incident | Build the 3am dashboard; populate change-log overlay | one-time investment |
| Pre-incident | Document named-root-cause categories + decision tree | one-time |
| Pre-incident | Run quarterly fire drills using past incidents | quarterly |
| During incident | Apply decision tree mechanically; document each step in incident channel | per-incident |
| Post-incident | Run blameless retrospective; identify dashboard / playbook gaps | within 5 days |
| Post-incident | Update playbook + portfolio with revealed gaps | next quarter |
Metrics Dashboard (for incident readiness)
| Metric | Target | Alert |
|---|---|---|
| MTTR (mean time to recover) | ≤ 30 min for categories 1, 2, 3, 8; ≤ 4h overall | quarterly review |
| MTTD (mean time to detect) | ≤ 15 min from regression-onset to alarm-fire | > 30 min (improve monitoring) |
| Categories-localized-in-15-min rate | ≥ 70% of incidents | < 50% (improve playbook) |
| False-alarm rate | < 10% of pages | > 20% (improve detection) |
| Repeat-incident rate | < 5% (same root cause within quarter) | > 10% (post-incident actions failing) |
Real-Incident Vignette — POC-Production Catastrophe #2 (RAG Recall Collapse)
From ../POC-to-Production-War-Story/02-seven-production-catastrophes.md: the team's RAG retrieval suffered Recall@10 collapse from 0.91 to 0.62 over a 6-hour window. No code deploy. No model change. Standard triage walked through:
- Did anything change last 24h? No deploy, no model change, no config change. (Categories 1, 2, 3, 8 ruled out in 15 minutes.)
- Upstream data healthy? Row counts on catalog feed normal; schema fingerprint matches. (Category 4 ruled out in 8 more minutes.)
- Feature drift? OpenSearch document distribution has shifted on one shard. Suspicion: index issue.
- Investigate the shard. HNSW index on shard-3 has corrupted; query latency normal, but recall poor because graph is malformed.
Root cause: a recent OpenSearch upgrade applied silently overnight had triggered a partial re-indexing on shard-3 that hadn't completed. Force re-index of shard-3, monitor recall recovery, schedule post-incident review.
The Applied ML Engineer's lesson fed back into Primitive 1 (problem-framing for next quarter): "we need a per-shard recall monitor as a leading indicator; the current dashboard aggregates across shards and missed this for 6 hours." The shard-level dashboard was added to the next quarter's portfolio (AML-02).
Cross-Story Dependencies
- Consumes: AML-07 (production integration) telemetry — per-stage, per-cohort latency + fallback engagement.
- Consumes: AML-04 (online/offline correlation) signal — if correlation has decayed, that's a slow-burn incident, not a 3am page.
- Consumes: signals from US-MLE-XX drift hub (referenced via
../ML-Engineer-User-Stories/deep-dives/02-cross-story-platform-deep-dive.md). - Produces: post-incident lessons feed back to AML-01 (problem-framing) and AML-02 (portfolio) — gaps revealed in incidents become next quarter's investment.
- Cross-references:
../Ground-Truth-Evolution/ML-Scenarios (label drift, sentiment domain shift) for non-3am slow-burn scenarios.
Master's-DS Depth Callout — CUSUM, Change-Point Detection, Anomaly Detection
The default "alarm when metric drops 10% from 24h baseline" is fragile: noisy during high-variance periods, slow detection during gradual degradations. CUSUM (cumulative-sum) and CUSUM-V are change-point detectors that accumulate signed deviations from baseline; they fire on small persistent regressions while ignoring large transient noise.
# cusum_metric_monitor.py
import numpy as np
def cusum(metric_stream, target=None, k=0.5, h=4.0):
"""Returns timestep of detected change-point or None.
k = slack (units of std dev) to tolerate small drift
h = decision threshold (units of std dev)
"""
if target is None:
target = np.mean(metric_stream[:14])
sigma = np.std(metric_stream[:14])
s_pos, s_neg = 0, 0
for t, x in enumerate(metric_stream[14:], start=14):
s_pos = max(0, s_pos + (x - target - k*sigma) / sigma)
s_neg = max(0, s_neg + (target - x - k*sigma) / sigma)
if s_pos > h or s_neg > h:
return t
return None
For AML-08 monitoring: run CUSUM on the primary online metric stratified by cohort. Alarm fires when a cohort's CUSUM exceeds threshold, even if aggregate metric looks fine. This catches the cohort regression before the aggregate metric reflects it.
Amazon Product Lens Callout — Ownership + Dive Deep + Bias for Action
Ownership is the at-3am LP. Nobody else can do this. The Applied ML Engineer who diffuses ownership ("the platform team should know") fails ownership. The role's pager goes off; the role works the incident.
Dive Deep is the discipline of named-root-cause categorization. Random root-causing is the opposite of Dive Deep. The named categories are the model. They were built (over many incidents) by Applied ML Engineers who dove deep enough to recognize "the model got worse" has 10 distinct underlying root causes.
Bias for Action is the rollback decision under uncertainty. When category 1, 2, 3, or 8 is plausible AND a rollback would resolve, roll back first, investigate after. Customer impact is reduced; investigation is preserved. The trap is "but what if we don't have the right cause?" — at 3am, the cost of an unnecessary rollback (some velocity loss) is much smaller than the cost of an extended customer-impact incident.
Six-pager retrospective: "Incident detected at T+0 via cohort-stratified CUSUM alarm; localized to category 5 (feature drift on shard-3) at T+18m via dashboard; mitigated at T+45m by force-reindexing shard-3; full root cause (silent OpenSearch upgrade triggered partial reindex) identified at T+3h. MTTR: 45m. Avoided full rollback by mechanical use of named-decision-tree triage. Investments that paid: 3am dashboard built in Q1, named-categories playbook published in Q2, cohort-stratified CUSUM monitors deployed in Q3."
Cross-Story Composition Notes
The eight scenarios chain in real workflows. Three example compositions:
Composition 1 — Launching the US-MLE-02 reranker change end-to-end
| Step | Scenario | Artefact produced |
|---|---|---|
| 1 | AML-01 | Customer letter; "this IS an ML problem (not heuristic)" |
| 2 | AML-02 | RICE-prioritized; reranker is Q3 ship-1 |
| 3 | AML-03 | Pre-registered YAML: MDE 3%, n=18.4K/arm CUPED, OBF 4 looks, 3 guardrails + per-cohort |
| 4 | AML-04 | Correlation tracker shows Pearson 0.68 — above 0.5 threshold; offline evidence trustworthy |
| 5 | AML-05 | Five-guardrail check; if CSAT breaches, mechanical veto |
| 6 | AML-06 | Cohort-stratified per-cohort guardrails clear or veto |
| 7 | AML-07 | 800ms turn-budget validated via load test; fallback chain in place |
| 8 | AML-08 | 3am dashboard with cohort-stratified CUSUM monitors live before launch |
Composition 2 — The AML-08 incident workflow that feeds back to AML-01
| Step | Scenario | Action |
|---|---|---|
| 0 | AML-08 | Page fires; on-call AML Eng begins triage |
| 1 | AML-08 | Walk decision tree; localize to category 5 (feature drift) at T+18m |
| 2 | AML-08 | Mitigate: quarantine drifted feature stream; metric recovers at T+45m |
| 3 | AML-06 | Verify recovery uniform across cohorts (no JP-only or new-user-only residual regression) |
| 4 | AML-04 | Check offline-online correlation post-incident; if decayed, schedule harness rebuild |
| 5 | AML-01 | Incident review: was the underlying problem misframed? Update framing artefacts |
| 6 | AML-02 | Add "feature-drift-monitor build" to next quarter's portfolio |
Composition 3 — Quarterly OP1 narrative defense
A senior leader asks: "Why did your team ship 2 launches this quarter instead of 4?" The Applied ML Engineer's six-pager defense traverses:
| Section | Scenario referenced |
|---|---|
| Tenets | AML-01 customer-letter discipline |
| Portfolio | AML-02 picked 3 of 12 with EVOI overlay |
| Rigor | AML-03 sample-size discipline; one experiment vetoed (AML-05) |
| Quality | AML-04 correlation audit; AML-05 guardrail enforcement |
| Fairness | AML-06 cohort-stratified eval |
| Production | AML-07 latency budget; load tests; fallback chains |
| On-call | AML-08 MTTR 45m / 30d retention preserved |
| Velocity narrative | "We shipped 2 of 3 because the third (US-MLE-06 cold-start) revealed a JP-cohort regression we are paying 4 weeks to fix in Q4. The cost of that 4 weeks vs the cost of shipping cohort-asymmetric: documented above." |
The eight scenarios are not isolated; they are a single workflow disguised as eight chapters. An Applied ML Engineer who can show the chain — every quarter, on every launch — is the role's senior signal.