LOCAL PREVIEW View on GitHub

Foundations and Primitives for Applied ML Engineering

Why This Document Exists

The Applied ML Engineer / Product Engineer for ML role fails differently than the platform ML Engineer role. Platform-side failures are infrastructural: drift detection lags, retraining is brittle, the registry is wrong, online serving p99 spikes. They produce alarms. They show up in dashboards. They have runbooks. The platform ML Engineer fixes them by fixing the system.

Applied-side failures are judgment failures. The team shipped a model that won offline NDCG by 5% and lost retention by 1.2% online. The team ran a 7-day A/B on an effect that needed 21 days to detect, declared significance, and shipped — only to watch the lift wash out 6 weeks later when novelty effects faded. The team promoted on aggregate CTR while the JP cohort regressed by 8%. The team picked the wrong experiment from a portfolio of 12 candidates because nobody computed Expected Value of Information. The team vetoed a guardrail violation politically rather than mechanically. None of these show up as alarms. None have runbooks. They show up six months later as "the chatbot used to feel better" customer complaints and a flat business metric.

The seven primitives in this document are the antibodies. Each one names a class of judgment failure, gives the mental model that prevents it, and provides the concrete artefact (template, formula, decision tree, dashboard) the Applied ML Engineer should reach for. They are deliberately separate from the platform-ML primitives in ../ML-Engineer-User-Stories/deep-dives/00-foundations-and-primitives-for-ml-engineering.md, which cover label engineering, feature engineering, training infra, drift detection, retraining cadence — the platform substrate this folder consumes but does not own.

How to Read This Document

Reading path What to do
Interview prep (Amazon Applied Scientist / Applied ML Engineer) Read all 7 primitives end-to-end. Drill yourself on the worked examples. Score yourself against the 02-applied-ml-engineer-grill-chains.md red-flag and strong-answer markers — most candidates fail on Primitives 4 (online/offline) and 5 (guardrails).
Live launch readiness review Skim the seven primitives table; jump to the relevant primitive(s) for your launch (e.g., a reranker A/B → Primitives 3, 4, 5, 6); use the per-primitive Master's-DS callout to anticipate the question the most-rigorous reviewer will ask.
On-call ML incident Read Primitive 7 first. Then read the AML-08 grill chain. Then call your platform ML Engineer counterpart.
Coaching a junior Applied ML Engineer Walk through Primitives 1, 2, 5 in order. The "ML when heuristic suffices" anti-pattern alone saves quarters of misallocated effort.

The Seven Primitives at a Glance

graph TD
    P1[Primitive 1<br/>Customer Obsession<br/>→ ML Problem Framing]
    P2[Primitive 2<br/>Experiment Portfolio<br/>Thinking]
    P3[Primitive 3<br/>Hypothesis & Study Design]
    P4[Primitive 4<br/>Online/Offline Correlation]
    P5[Primitive 5<br/>Business-KPI Guardrails]
    P6[Primitive 6<br/>Cohort Fairness & Stratification]
    P7[Primitive 7<br/>Incident Triage Discipline]

    P1 --> P2 --> P3 --> P4 --> P5 --> P6 --> P7
    P7 -.feeds back into.-> P1

    classDef framing fill:#9cf,stroke:#333,stroke-width:2px
    classDef rigor fill:#fd2,stroke:#333
    classDef ship fill:#2d8,stroke:#333
    classDef ops fill:#f66,stroke:#333

    class P1,P2 framing
    class P3,P4 rigor
    class P5,P6 ship
    class P7 ops
# Primitive Headline concept When it bites hardest
1 Customer Obsession → ML Problem Framing Working Backwards from a customer outcome to an ML-or-not decision Vague "use ML for retention" requests; PM-driven scoping without customer letter
2 Experiment Portfolio Thinking Choose 3 of 12 candidates by EV × probability ÷ engineering cost, not by enthusiasm Quarter planning; OP1 narrative writing; quarterly OKR setting
3 Hypothesis & Study Design Pre-registered H0/H1, MDE, sample size, runtime, stop rule "Run a 7-day A/B and ship" requests; tight quarterly deadlines
4 Online/Offline Correlation Offline gain ≠ online gain; diagnose collapse with named root causes Reranker / recsys / search-ranking changes where offline metric ≠ user behavior
5 Business-KPI Guardrails Pre-declared product-KPI vetoes that override model-quality wins Model wins offline + online quality but loses on retention/CSAT/GMV
6 Cohort Fairness & Stratification Per-cohort eval (locale, tenure, device); aggregate hides regressions Bilingual JP/EN traffic; new-vs-returning splits; mobile-vs-desktop
7 Incident Triage Discipline Named root-cause categories + 5-Why ladder vs random root-causing 3am pages; "the model got worse" without an obvious infra cause

The primitives chain in a logical product-ML lifecycle: frame the problem (1) → choose what to work on (2) → design the test rigorously (3) → validate offline-online consistency (4) → defend product KPIs (5) → check cohort fairness (6) → operate in production (7). Primitive 7's root-cause findings feed back into Primitive 1 — incidents reveal misframed problems.


Primitive 1: Customer Obsession → ML Problem Framing

Concept

The Applied ML Engineer's first move is not to scope a model. It is to write the customer letter. Working Backwards from a specific customer outcome — not a metric, not a system, not a "we should use embeddings here" — disciplines the rest of the lifecycle. Most ML projects that ship and underperform were misframed at this step: the team scoped to a metric (CTR), the metric moved, the customer outcome didn't.

The discipline is binary in form but probabilistic in practice: write the customer letter, then ask "is ML the right tool?". Sometimes ML is the wrong answer. Sometimes a heuristic, a UX change, a content-team workflow change, or a no-op is the right answer. The Applied ML Engineer who can credibly say "this should not be an ML project" is more valuable than the one who can ship five models per quarter.

The framing flow

graph TD
    A[Customer signal<br/>e.g. retention drop, CSAT complaint,<br/>support ticket cluster] --> B[Working Backwards<br/>customer letter / PR-FAQ]
    B --> C{Is the customer outcome<br/>causally tractable<br/>with ML?}
    C -->|No — UX or content| D[Hand off to Design /<br/>Content team]
    C -->|Yes, but heuristic<br/>suffices| E[Heuristic with monitoring]
    C -->|Yes, ML required| F[ML problem statement<br/>+ data audit + scope]
    F --> G[Hypothesis & portfolio<br/>see Primitive 2]
    E --> H[Revisit in 6 weeks<br/>if heuristic plateaus]
    H --> C

    classDef good fill:#2d8,stroke:#333
    classDef neutral fill:#fd2,stroke:#333
    classDef stop fill:#f66,stroke:#333
    class D,E neutral
    class F good

The PR/FAQ-for-ML-Features template

Every ML feature in MangaAssist starts with a one-page PR/FAQ before any model code is written. The template is short on purpose — the constraint forces clarity.

# pr_faq_template.yaml — ML feature scoping
title: "<Product-facing feature name, customer language>"
launch_date: "<target ISO date  pre-commit, slips force re-scoping>"

press_release:
  customer_persona: "<name, locale, manga-reading context  concrete, not abstract>"
  customer_quote: "<a sentence the customer would actually say after the feature ships>"
  problem: "<the customer's pain in their words  NOT 'CTR is too low'>"
  solution: "<one sentence: what the customer experiences after we ship>"
  business_outcome: "<the customer behavior we predict will change  measured how?>"

faq:
  - q: "Why now?"
    a: "<what's changed in the catalog / traffic / customer behavior that makes this urgent>"
  - q: "Is this an ML problem?"
    a: "<the heuristic-vs-ML break-even reasoning  show the math>"
  - q: "What's the smallest test that would convince us?"
    a: "<the cheapest experiment that disconfirms our hypothesis>"
  - q: "What would make us NOT ship?"
    a: "<the guardrail that vetoes  pre-declared, see Primitive 5>"
  - q: "Who else is affected by this  and not in a good way?"
    a: "<cohort fairness pre-check  see Primitive 6>"

The PR/FAQ is reviewed by the PM, the Engineering Manager, and the customer-insights researcher before model design starts. If any of those three cannot answer "would the customer actually feel this?" yes, the feature does not enter the experiment portfolio. This single gate catches roughly 30–40% of proposed ML projects in their first quarter at organizations that adopt it.

When ML is the wrong answer — the heuristic-vs-ML break-even

The most underrated skill of an Applied ML Engineer is recommending against ML. The break-even framework is mechanical:

Factor Heuristic favored when ML favored when
Decision boundary Crisp, expressible in 5–10 rules Fuzzy, multi-dimensional, depends on 50+ signals
Data volume Sparse — fewer than ~10K labeled examples Dense — 100K+ examples available, growing
Stakes Low (small CTR change) or very high (regulatory) Medium (revenue-impacting, reversible)
Adversary Low (no incentive to game) Medium (gameable but observable)
Iteration cadence Slow (rules change quarterly) Fast (weekly feedback signal available)
Interpretability requirement High (must explain decisions to user/regulator) Medium (post-hoc explanation acceptable)
Engineering cost 2–6 engineer-weeks 6+ engineer-months including infra

If you find yourself arguing for ML where >3 of these factors favor heuristic, you are probably wrong. The MangaAssist heuristic-vs-ML break-even came up repeatedly during scoping — for instance, the rule-based intent pre-filter that bypasses 65% of SageMaker calls (see ../Cost-Optimization-User-Stories/US-02-intent-classifier-cost-optimization.md) is a heuristic that an ML-first team would have replaced with a model and lost both cost and clarity.

Worked example: "new-reader retention is dropping"

The PM brings the signal: new manga readers in the JP cohort dropped from 38% → 31% week-2 retention over 4 weeks. The instinct is "use ML to fix it." The Applied ML Engineer's first move is the customer letter:

Yuki, 24, just discovered MangaAssist last week. She tried 'Solo Leveling', 'Spy x Family', and 'Frieren' — all titles she'd heard of. After turn 4 with the chatbot, none of the three felt like 'her thing'. She didn't open the app on day 7.

What would have made her come back? "If the bot had figured out I like slow-burn psychological stories, not action — and shown me 'A Silent Voice' or 'March Comes In Like A Lion' on day 1."

Now the framing is sharp. The hypothesis is not "use ML for retention." The hypothesis is "the recommendation system surfaces popular titles to new users instead of taste-matched titles." The next question is: is that an ML problem?

  • The current recommender (US-MLE-06) uses a two-tower model with cold-start fallback to genre-popularity. The fallback is the suspect.
  • Could a heuristic fix it? Yes — replace the fallback with a 5-question taste-quiz on first session (UX, not ML). Cost: 2 engineer-weeks. Expected lift: maybe 4–6pts retention based on 2018 industry benchmarks.
  • Could an ML fix do better? Yes — use Amazon Personalize HRNN-Coldstart with explicit cold-start cohort embedding. Cost: 8 engineer-weeks. Expected lift: maybe 6–9pts retention.

The recommendation is ship the heuristic first, measure for 6 weeks, then evaluate ML. The expected-value calculation: heuristic captures 60–70% of the lift at 25% of the cost; the ML option's incremental value (over heuristic) is small enough that it should compete in the next quarterly portfolio review against other candidates. Shipping both simultaneously means the ML team optimizes against a moving baseline.

This is the Applied ML Engineer's value: making not building ML a defensible recommendation.

Master's-DS Depth Callout — Causal vs Correlational Framing

The trap in Customer Pain framing is target leakage: framing the problem in terms of an outcome that is partly caused by the proposed intervention. "Predict which users will retain" leaks the intervention you're proposing into the target. The correct framing is causal: "what intervention, applied to which user-segment, changes retention by how much?"

A statistically rigorous PR/FAQ specifies the intervention, the target population, the counterfactual, and the expected effect size with confidence interval. This is essentially the structural causal model written in plain language. When the PR/FAQ skips this — when it says "we'll use ML to improve retention" without specifying intervention/population/counterfactual — the team will run an experiment that confirms what they already believed, which is HARKing in slow motion.

The mechanical test: can you write the experiment's hypothesis as Pr(retention | do(intervention=A), cohort=C) − Pr(retention | do(intervention=B), cohort=C) ≥ MDE? If yes, framing is rigorous. If you have to hand-wave the do() operator, the framing is correlational and the experiment design will be confused.

Amazon Product Lens Callout — Customer Obsession + Invent and Simplify

Two LPs collide here productively. Customer Obsession says start with the customer outcome — the Yuki letter, in her words, with her preferences, on her day-7. Invent and Simplify says the simplest intervention that moves the customer outcome wins, regardless of how exciting the model is. Most Applied ML Engineers fail Invent and Simplify because the role is incentivised to ship models, not solve problems.

In a six-pager, this primitive shows up as the Tenets and Risks section: tenets state "we ship the simplest intervention that meets the customer bar" and "we measure intervention impact on customer behavior, not on proxy metrics"; risks state "the team's instinct will be to over-scope to ML; this tenet is the brake." Writing those tenets explicitly is how you signal to leadership that you have backbone in the room.


Primitive 2: Experiment Portfolio Thinking

Concept

You have 12 candidate ML experiments and capacity for 3 next quarter. The choice is not which one is best in isolation. The choice is which 3 maximize portfolio Expected Value (EV) under engineering capacity, statistical reachability, and opportunity cost. The Applied ML Engineer who optimizes per-experiment instead of per-portfolio reliably picks the wrong 3.

The mental model is RICE adapted for ML — Reach, Impact, Confidence, Effort — with one critical addition: Detectability. An experiment with 4% MDE on 5K daily eligible sessions cannot be powered in a single quarter; it has zero EV regardless of theoretical Impact, because you'll never know if it worked.

The portfolio matrix

graph TD
    subgraph High[High Upside]
        H1[High Upside / Low Risk<br/>SHIP — these are the obvious wins]:::ship
        H2[High Upside / High Risk<br/>1-2 of these — they are the swing<br/>experiments that justify Big Bets]:::risk
    end
    subgraph Low[Low Upside]
        L1[Low Upside / Low Risk<br/>Maintenance / hygiene<br/>cap at 1 per quarter]:::low
        L2[Low Upside / High Risk<br/>NEVER ship — opportunity cost vampires]:::stop
    end

    classDef ship fill:#2d8,stroke:#333,stroke-width:2px
    classDef risk fill:#fd2,stroke:#333
    classDef low fill:#9cf,stroke:#333
    classDef stop fill:#f66,stroke:#333,stroke-width:2px

A 3-experiment quarter typically allocates: 2 high-upside-low-risk wins + 1 high-upside-high-risk swing. Adding a low-upside-low-risk hygiene project is acceptable only when the team has slack capacity. Spending a slot on low-upside-high-risk is the most expensive class of mistake an Applied ML Engineer makes — it consumes velocity, fails noisily, and demoralizes the team.

RICE-for-ML scoring formula

# rice_for_ml.py — portfolio scoring helper
from dataclasses import dataclass

@dataclass
class MLExperiment:
    name: str
    reach: int                  # eligible users/sessions per quarter
    impact_per_user: float      # expected delta in primary metric, in metric units
    prior_evidence_quality: float  # 0.1 (gut feel) to 0.9 (replicated lit + offline)
    detectability_prob: float   # P(detectable at MDE given reachable population)
    effort_engineer_weeks: float

    def rice(self) -> float:
        confidence = self.prior_evidence_quality * self.detectability_prob
        return (self.reach * self.impact_per_user * confidence) / self.effort_engineer_weeks

# Example portfolio scoring for a hypothetical Q3 backlog
candidates = [
    MLExperiment("US-MLE-02 reranker MiniLM-L6 → MiniLM-L12", 8e6, 0.005, 0.7, 0.85, 12),
    MLExperiment("US-MLE-06 recsys cold-start HRNN swap", 2.5e6, 0.012, 0.5, 0.55, 18),
    MLExperiment("Multilingual intent fine-tune (US-MLE-01 v2)", 1.2e6, 0.008, 0.6, 0.72, 8),
    MLExperiment("Embedding adapter LoRA r=16 (US-MLE-05)", 8e6, 0.003, 0.45, 0.5, 14),
    # ... 8 more candidates ...
]
ranked = sorted(candidates, key=lambda e: e.rice(), reverse=True)
for e in ranked[:3]:
    print(f"PICK: {e.name}  RICE={e.rice():.0f}")

The output is a ranking, not a verdict. A portfolio of the top 3 by RICE is the starting recommendation; the Applied ML Engineer then layers in qualitative checks: cross-experiment dependencies (does swap A invalidate test B's baseline?), team-skill matching, learning value (Expected Value of Information — see below), and political constraints (is one of these already over-promised in OP1?).

Expected Value of Information (EVOI) — the experiment-as-learning case

Sometimes the best experiment to run is the one whose primary value is not winning but learning. EVOI is the value of resolving uncertainty: an experiment that has only 30% chance of winning but, win or lose, changes the team's strategy for the next 4 quarters, can be worth more than a 70%-win-probability experiment whose result confirms what you already know.

The decision rule: prefer EVOI experiments early in a research direction, prefer high-RICE experiments late. A new modelling direction (e.g., "should we replace the cross-encoder reranker with a dense bi-encoder rerank?") deserves an EVOI-framed experiment first; once the team knows the direction works, subsequent experiments optimize within the direction.

Worked example — Q3 portfolio for the chatbot

Twelve candidates from a real-feeling Q3 backlog:

# Candidate RICE rank Decision Reason
1 US-MLE-02 reranker MiniLM-L6 → MiniLM-L12 1 Ship High reach (8M sessions/q), strong prior, well-powered
2 US-MLE-01 multilingual intent fine-tune v2 3 Ship Solid lift, low effort, JP cohort badly under-served today
3 US-MLE-06 recsys cold-start swap (HRNN) 4 Ship as EVOI Lower RICE but resolves "cold-start direction" uncertainty for 4 quarters
4 US-MLE-05 embedding adapter LoRA r=16 6 Defer to Q4 Effort high; needs Q3's reranker change baseline
5 US-MLE-04 demand forecasting promo overlay 7 Defer Inventory team's Q3 priority is different
6 US-MLE-08 cover-art AI-gen detection 8 Defer Adversarial signal not yet strong enough; revisit Q4
7 Cross-encoder distillation to DistilCE 5 Reject Lower RICE than MiniLM-L12 swap; dependent on it; serial not parallel
8 Sentiment ABSA quarterly retrain (US-MLE-03) 9 Maintenance — assign 0.5 SDE Hygiene; not a portfolio slot
9 Spam classifier weekly retrain automation 10 Maintenance — assign 0.5 SDE Hygiene; cost of not doing it is high
10 Recsys diversity injection re-tune 11 Reject Low upside, cohort-fairness work better in Q4 (after AML-06 cohort holdout infra)
11 Personalize HRNN replacement custom 2-tower 2 Reject High RICE on paper but engineering cost mis-estimated; team rebuilt this last year
12 LLM judge quality eval framework 12 Reject Tooling, not product — belongs in platform-ML backlog

The chosen 3 are #1, #2, #3. The reasoning: #1 and #2 are high-RICE high-confidence; #3 is the EVOI swing. Rejected #11 despite RICE rank 2 because the engineering cost was systematically underestimated by 2.5× last time the team built this — a Confidence-modifier the formula doesn't capture but the Applied ML Engineer must.

Master's-DS Depth Callout — Bayesian Decision Theory

RICE is a heuristic. The proper Bayesian framing of portfolio choice is: choose the subset S of experiments maximizing E[Σ value(e) · P(success | e, prior) · 1(detectable | e, sample_size)] subject to engineering-capacity constraints. This is a knapsack problem; for portfolios of <20 candidates, brute force (binary subset enumeration) runs in milliseconds and gives the optimal answer.

The piece RICE misses that proper Bayesian decision theory captures: opportunity cost interactions. If experiments A and B both target the reranker, running both in the same quarter splits the test population, doubling each's required runtime — they're not additive, they're competitive. RICE scoring per-experiment ignores this; portfolio Bayesian scoring (with constraints) catches it.

A second subtlety: the prior in RICE's Confidence factor should be calibrated against the team's historical hit rate, not the analyst's gut feel. If the team's last 12 ML experiments ran at 35% positive-result rate, that's the prior baseline; an analyst claiming 70% confidence on a new candidate has to defend why this candidate is in the top quartile of historical bets.

Amazon Product Lens Callout — Bias for Action + Frugality + Think Big

The portfolio-thinking primitive is where three LPs negotiate. Bias for Action pushes for shipping more, faster — fewer hand-wringing reviews per experiment. Frugality pushes for doing more with less — picking experiments that share infra, share population, share telemetry. Think Big pushes for the swing-bet experiment that, if it works, changes the platform.

A senior Applied ML Engineer can defend a portfolio in front of leadership in a six-pager by structuring it as: "We picked 3 of 12. Two are Bias for Action — shippable in <12 weeks each, RICE-top-quartile. One is Think Big — the EVOI swing on cold-start direction, which if it works changes our recsys strategy for the next year. Frugality: all three share the same A/B infra and the same cohort holdout; engineering cost is amortised across 3." That paragraph alone is the core of a defensible OP1 narrative for a quarter's ML investment.


Primitive 3: Hypothesis & Study Design

Concept

A pre-registered, well-powered, properly-randomized A/B test is the single most defensible artefact an Applied ML Engineer produces. Most ML A/Bs that ship and underperform suffered at this step: the team ran a 7-day test on an effect that needed 21 days to detect, peeked daily, declared significance the first time α dipped below 0.05, and shipped a noise winner.

The mechanical discipline: write the hypothesis, MDE, sample size, runtime, randomization unit, primary metric, guardrails, and stop rule before the experiment starts. Version-control the document. When the experiment ends, you reference the pre-registered document; you do not re-write it after seeing the results.

The experiment design lifecycle

graph TD
    H[Hypothesis<br/>H0: no effect on primary metric<br/>H1: lift ≥ MDE in direction D]
    H --> M[Pick MDE<br/>Smallest effect we'd want to detect<br/>given product impact and effort]
    M --> S[Sample size n per arm<br/>n = 16σ²/Δ² (rough)<br/>n via statsmodels (precise)]
    S --> R[Randomization unit<br/>user / session / request]
    R --> P[Primary metric + guardrails<br/>Pre-declared, signed off]
    P --> ST[Stop rule<br/>Fixed-horizon vs sequential<br/>α-spending plan if sequential]
    ST --> RUN[Run experiment<br/>NO PEEKING outside plan]
    RUN --> ANA[Analyze<br/>Pre-registered analysis plan only]
    ANA --> GO{Primary +<br/>guardrails clear?}
    GO -->|Yes| SHIP[Ship — see Primitive 5]
    GO -->|No| DOC[Document negative result<br/>feed back to Primitive 1/2]

    classDef gate fill:#fd2,stroke:#333
    classDef ship fill:#2d8,stroke:#333
    classDef stop fill:#f66,stroke:#333

    class GO gate
    class SHIP ship
    class DOC stop

Sample size — the formula and the pitfall

For a two-sample comparison of means (the most common Applied ML A/B):

n per arm ≈ 16 · σ² / Δ²   (rule of thumb, 80% power, two-sided α=0.05)

This is a useful back-of-envelope. The precise calculation uses Welch's t-test power:

# sample_size.py — proper power calc for an A/B
import statsmodels.stats.power as smp

# Example: ranking-quality A/B
# σ (std of NDCG@10 across users) measured from 14-day baseline = 0.18
# MDE: detect 0.02 absolute NDCG lift (about 2.5% relative)
# Power 0.8, alpha 0.05, two-sided

effect_size = 0.02 / 0.18  # standardized effect = 0.111 (small, Cohen's d)
analysis = smp.TTestIndPower()
n_per_arm = analysis.solve_power(
    effect_size=effect_size,
    alpha=0.05,
    power=0.80,
    alternative='two-sided'
)
print(f"n per arm: {int(n_per_arm)}")
# Output: n per arm: 1273
# At 200K eligible sessions/day, n=1273 per arm needs <1 day. Cost is run length, not pop.

The pitfall most teams hit: they compute n for the aggregate metric, then slice by cohort and don't have power to detect cohort-conditional effects. If the smallest cohort of interest is 8% of traffic (e.g., new-JP users), then n_total = n_per_arm × 2 / 0.08 for the cohort claim. A 1273-per-arm design is fine for aggregate; the cohort-stratified claim needs ~32K total eligible sessions in that cohort across both arms. Most teams miss this and ship aggregate-positive launches that regress on cohorts they can't detect — see Primitive 6.

Sequential testing and α-spending — when fixed-horizon won't work

If the experiment runtime is constrained (must ship by quarter-end) and the team will inevitably peek (PMs always peek), use a sequential testing plan with α-spending. Without it, peeking inflates Type-I error from 5% to ~20% in 5 looks. Three viable plans:

Plan When to use α at each look Pros Cons
Pocock Equal-α-per-look; team peeks weekly α/k per look (e.g., 0.01 per week for 5 weeks) Simple, transparent Conservative; underpowered late
O'Brien-Fleming Increasing α over time; allows early-stop only on huge effects Tiny α early, full α at end Powerful at end, conservative early Late-stage is essentially fixed-horizon
Always-Valid (mSPRT, GAV-α) Continuous monitoring, e.g. internal dashboards Stream-safe at every step True peek-immunity Requires Always-Valid library; less interpretable

For most chatbot A/Bs, O'Brien-Fleming is the right default — it gives you the ability to stop early on a clear win without sacrificing power. For dashboards visible to the org, Always-Valid is the right answer because the act of someone reading the dashboard is itself a peek that uncontrolled fixed-horizon testing cannot survive.

Randomization unit — the under-discussed choice

Unit When right Variance implication Spillover risk
User Long-term metrics (retention, LTV) Low (per-user paths are independent) Low
Session Short-term metrics (CTR, completion) Medium Medium (carry-over effects within day)
Request Sub-session decisions (per-turn rerank), low-stakes High (high autocorrelation across requests in session) High (SUTVA violations within session)

Randomizing at the request level inside a chatbot turn is almost always wrong: the user sees both treatments interleaved and the SUTVA assumption (Stable Unit Treatment Value) breaks. A user who sees a great recsys turn at minute 1 and a poor one at minute 2 is not the same as two different users — their behavior on turn 2 is contaminated by turn 1. Default to user-level randomization unless you have a specific reason otherwise.

CUPED — variance reduction with pre-experiment covariate

CUPED (Controlled-experiment Using Pre-Existing Data) reduces metric variance by adjusting for a pre-experiment user-level covariate. The math is straightforward:

Y_adjusted = Y - θ · (X_pre - mean(X_pre))
   where θ = Cov(Y, X_pre) / Var(X_pre)

For chatbot A/Bs where the metric is per-user CTR over 14 days, using the previous 14 days of per-user CTR as X_pre typically reduces metric variance by 30–60%. The same MDE can be detected at 30–60% smaller sample size, which can collapse a 3-week experiment into 12 days. CUPED costs nothing once implemented and should be the default for any Applied ML A/B where users have history.

Worked example — the US-MLE-02 reranker change

The hypothesis: replacing the MiniLM-L6 cross-encoder with MiniLM-L12 lifts NDCG@10 by ≥3% (relative) on user-perceived ranking quality, measured by a binary "useful answer" engagement signal.

# experiment_us_mle_02_v3.yaml — pre-registered
experiment_id: us_mle_02_reranker_l6_to_l12_2026q3
hypothesis:
  h0: "MiniLM-L12 reranker has same or worse user-perceived ranking quality vs MiniLM-L6"
  h1: "MiniLM-L12 reranker has ≥3% relative lift in NDCG@10-equivalent engagement signal"

primary_metric:
  name: useful_answer_rate_per_user_14d
  baseline_estimate: 0.183  # measured from 14-day pre-experiment
  baseline_std: 0.142

mde_relative: 0.03  # 3% relative lift
mde_absolute: 0.00549  # 0.183 × 0.03

power: 0.80
alpha: 0.05
test: welch_t_two_sided
cuped:
  enabled: true
  covariate: useful_answer_rate_per_user_pre14d

sample_size_per_arm_post_cuped: 18420  # statsmodels calc, 50% variance reduction
runtime_days: 14   # 200K eligible users/day, 50/50 split

randomization_unit: user_id
eligibility:
  - has_at_least_one_search_in_pre14d
  - locale in [JP, EN, MIXED]
  - is_not_in_other_active_experiment

guardrails:
  - name: p95_chatbot_turn_latency_ms
    threshold: 800
    direction: lower_is_better
  - name: chat_csat_proxy_per_user_14d
    threshold_relative: -0.005  # cannot regress > 0.5% relative
    direction: higher_is_better
  - name: jp_cohort_useful_answer_rate
    threshold_relative: -0.03  # JP cohort cannot regress > 3% — see Primitive 6
    direction: higher_is_better

stop_rule:
  type: obrien_fleming
  looks: [day_4, day_7, day_11, day_14]
  early_stop_alpha_at_look: [0.0001, 0.001, 0.01, 0.05]
  abort_on_guardrail_breach: true

pre_registration:
  document_hash: sha256:7e3a...
  registered_at: 2026-04-15T09:00:00Z
  signed_off_by: [pm, applied_ml_eng, ds_partner, eng_manager]

Master's-DS Depth Callout — Welch's t vs Mann-Whitney; Delta Method for Ratio Metrics

The default scipy.stats.ttest_ind assumes equal variance — almost never true in real chatbot A/Bs. Welch's t-test (equal_var=False) is the correct default; it doesn't assume variance equality and the only cost is slightly reduced power when variances actually are equal (which they aren't). Always Welch.

For non-Gaussian metrics — heavily right-skewed metrics like dollars-per-user, time-to-event — the t-test is poorly calibrated even with Welch. Mann-Whitney U (rank test) handles this; alternatively, log-transform the metric if it's strictly positive. On binary metrics (CTR, conversion), use the two-proportion z-test or chi-squared.

For ratio metrics (e.g., conversion-per-impression where users have variable impression counts), the t-test on per-user ratios is biased. Use the delta method to estimate the variance of the ratio:

Var(X/Y) ≈ (μ_X² / μ_Y²) · ( σ²_X/μ²_X + σ²_Y/μ²_Y - 2·Cov(X,Y)/(μ_X·μ_Y) )

This catches the variance contribution from the denominator that per-user-ratio averaging misses. The bug pattern: team computes per-user CTR, averages, t-tests; misses that variance of the ratio depends on impression-count distribution; over-claims significance. The delta method fixes it.

Amazon Product Lens Callout — Insist on the Highest Standards + Dive Deep

The pre-registration discipline is the most concrete instance of Insist on the Highest Standards in the Applied ML Engineer's job. The cost is friction (PMs hate writing the YAML). The benefit is that 6 months later, when leadership asks "did this experiment really work?", you can answer "yes, and here's the document we signed before we started — primary metric cleared, no guardrails breached, no unplanned analysis."

Dive Deep is what catches the variance-reduction opportunity. The team that ships an experiment without considering CUPED is competing with the team that ships more experiments per quarter at the same variance budget. CUPED is not a fancy technique; it is table stakes for an Applied ML Engineer at scale.

A six-pager paragraph defending experiment design: "Pre-registered hypothesis: ≥3% relative NDCG@10 lift via L12 reranker. Powered at 80% via 18.4K-per-arm Welch t with CUPED variance reduction (50%). O'Brien-Fleming sequential plan, 4 looks. Three pre-declared guardrails: p95 latency, CSAT proxy, JP cohort. We will not declare a win on cohort-aggregated lift if any cohort regresses by >3%. The smallest experiment that disconfirms the hypothesis is 14 days at 200K eligible users/day." Every clause defends a specific design decision. Leadership reviews don't ask "did you do power analysis"; they ask "show me the YAML." If you have it, you sail through.


Primitive 4: Online/Offline Correlation

Concept

Offline NDCG@10 says +5%. Online CTR doesn't move. The PM asks "is the model wrong, or is the metric wrong?" The answer is almost always: the model is right and the offline eval is mismeasuring user behavior. The Applied ML Engineer who can diagnose offline-online correlation collapse — naming the root cause and prescribing the fix — is the role's most senior signal.

The mental model: offline metrics are proxies for the online behavior we care about. A proxy that worked yesterday can stop working today, for many reasons. Tracking the correlation between offline gain and online gain — not just the offline metric in isolation — is the discipline that catches the collapse before you ship a noise winner.

The correlation pipeline

graph LR
    O1[Offline NDCG@10<br/>Recall@10<br/>MRR@5] --> CORR[Correlation tracker<br/>rolling 90-day Pearson<br/>between offline-Δ and online-Δ]
    O2[Online CTR<br/>useful-answer rate<br/>session length] --> CORR
    CORR --> ALARM{Correlation<br/>< 0.5?}
    ALARM -->|No, > 0.6| GO[Ship as-usual<br/>offline win is trustworthy]
    ALARM -->|Yes, collapsed| DIAG[Diagnose root cause<br/>5 candidates]
    DIAG --> R1[1. Selection bias<br/>in offline eval set]
    DIAG --> R2[2. Label leakage]
    DIAG --> R3[3. Distribution shift<br/>eval vs prod]
    DIAG --> R4[4. Metric proxy<br/>mismatch]
    DIAG --> R5[5. Optimizer<br/>Goodharting]
    R1 --> FIX[Redesign offline harness<br/>NOT the model]
    R2 --> FIX
    R3 --> FIX
    R4 --> FIX
    R5 --> FIX

    classDef warn fill:#f66,stroke:#333
    classDef good fill:#2d8,stroke:#333

    class ALARM warn
    class GO good

The five named root causes of correlation collapse

Each one is concrete, diagnosable, and has a different fix.

# Root cause Symptom Diagnostic Fix
1 Selection bias Offline eval set drawn from clicked-only impressions; model optimizes for clicks-given-shown; but production sees never-shown items too Compare distribution of (query, item) pairs in eval set vs prod log; if eval is missing the lower-rank impressions, this is it Re-sample eval set from full impression log including never-clicked rows; or use IPS counterfactual eval
2 Label leakage Eval label is something a near-future signal can predict; model "wins" by exploiting leakage that prod can't see Time-strict split: train ≤ T-1, eval = T+1; if offline gap shrinks dramatically, you had leakage Time-hold-out: eval set is entirely after training cut-off; remove any feature derived from data after training cut-off
3 Distribution shift Eval set is 6 weeks old; catalog has rotated; user mix has shifted KL-divergence between eval-set query distribution and last-7-days production query distribution Refresh eval set weekly; build sliding-window evaluation; alert when KL exceeds threshold
4 Metric proxy mismatch NDCG@10 weights position; users only ever look at positions 1-3 on mobile; model optimizes positions 4-10 (cheaper to win there) User-behavior log: where do users click in production? If 90% click position 1-3, NDCG@10 is mismeasuring Redesign offline metric: use NDCG@3 or weighted-by-impression-position metric; check that offline-online correlation recovers
5 Optimizer Goodharting Model has learned to game the offline metric while moving real behavior in the wrong direction (e.g., higher click-bait scores) A/B online with a small population; if offline-Δ is +5% and online behavioral Δ is negative, this is it Add adversarial offline metrics (e.g., "would a human rate this as click-bait?"); treat the gameable metric as a guardrail for the next-gen metric

The "redesign the eval, not the model" principle

When correlation collapses, the temptation is to re-train the model differently. This is wrong. The model is doing exactly what you trained it to do: maximize the offline metric. The offline metric is no longer a faithful proxy for the customer behavior you care about. The Applied ML Engineer who keeps re-training models against a broken metric is racing in the wrong direction faster.

The discipline: when offline-online correlation drops below 0.5, freeze model changes for 4 weeks and rebuild the offline harness. Cost is real (4 weeks of velocity); benefit is permanent (every subsequent experiment runs on a metric the team trusts). Most teams skip this step because it doesn't ship visible wins. The teams that don't skip it ship faster across multi-quarter horizons.

Worked example — RAG retrieval offline NDCG vs online CTR

The system: MangaAssist's RAG retrieval pipeline (see ../RAG-MCP-Integration/09-rag-retrieval-pipeline-deep-dive.md). The change: RRF k tuned from 60 to 30, plus cross-encoder reranker top-K from 20→30.

Metric Pre Post Δ Sig
Offline Recall@10 (golden 500) 0.91 0.94 +3.3pp yes (p<0.001)
Offline NDCG@5 (golden 500) 0.78 0.81 +3.8% yes (p<0.001)
Online CTR (14-day A/B, 50/50) 11.21% 11.27% +0.5% rel NS (p=0.31)
Online useful-answer-rate 0.183 0.184 +0.5% rel NS

Offline says win. Online says nothing. Diagnostic walk:

  • Root cause #4 (proxy mismatch)? Mobile traffic is 70% of MangaAssist; on mobile the user sees positions 1-3. NDCG@10 weights all top-10 positions; the gain is concentrated at positions 5-7. Likely contributor.
  • Root cause #3 (distribution shift)? Golden eval set was built 8 weeks ago; catalog turned over 14% since (new spring season). Compare KL — confirms moderate shift.
  • Root cause #1 (selection bias)? Golden set is human-curated against last quarter's queries; production query distribution has shifted toward seasonal interests.

The fix is not to re-train. The fix is to: 1. Rebuild the offline metric as NDCG@3 (matching mobile UX). 2. Refresh the golden set weekly, sampled from production queries with stratified human labeling. 3. Add an IPS-corrected counterfactual offline eval as a secondary metric. 4. Re-run the same model change against the new harness; if NDCG@3 is +0.5% (matching online), correlation is restored and the team trusts future experiments.

Master's-DS Depth Callout — Goodhart's Law and IPS Counterfactual Eval

Goodhart's Law (formalized by Manheim & Garrabrant 2018): when a measure becomes a target, it ceases to be a good measure. Applied ML systems goodhart their offline metric the moment training loss is computed against it. The mitigation isn't to find a "better" single metric (any single metric will eventually be goodharted); it's to maintain a diverse portfolio of offline metrics where no single one is sufficient to win, and each catches a failure mode the others miss. NDCG@3 + Recall@10 + adversarial click-bait detector + LLM-judged factuality is harder to game than any one of those alone.

Inverse Propensity Scoring (IPS) is the principled fix for selection bias in click-logged eval sets. The intuition: items that were rarely shown in production have rare exposure; their per-click value is up-weighted by the inverse of their show-probability. The variance penalty is real (IPS estimators have notorious variance); SNIPS (self-normalized IPS) and Doubly-Robust estimators are the practical workhorses. The Applied ML Engineer doesn't need to derive these; they need to know to ask the data scientist "are we using IPS or a bias-corrected counterfactual estimator on this eval set?" If the answer is no, selection bias is in the metric.

Amazon Product Lens Callout — Learn and Be Curious + Are Right A Lot

Learn and Be Curious is what makes the Applied ML Engineer ask "why don't they correlate?" instead of accepting the offline win. The teams that goodhart their offline metric and ship for years are the teams that stopped asking. The discipline: every quarter, run a correlation audit on the last 6 months of offline-vs-online experiment outcomes; if the correlation has decayed, that is the next quarter's investment, not another model.

Are Right A Lot is what disciplines the response when correlation collapses. The wrong response is to ship anyway and hope ("we trust the offline metric, our intuition says it'll work"). The right response is to slow down, diagnose, fix the harness. In a six-pager review of "why we didn't ship this quarter despite +5% NDCG", a senior Applied ML Engineer writes: "We declined to ship because offline-online correlation has dropped to 0.41 (90-day rolling), below our 0.5 threshold. Offline gain is no longer a trustworthy predictor of customer behavior. We are rebuilding the offline harness in Q4; expected restoration of correlation by end of November; resumption of model-shipping cadence in December. The cost of waiting is one quarter; the cost of not waiting is shipping noise winners for the next year." That paragraph buys credibility the team will draw on for years.


Primitive 5: Business-KPI Guardrails for Promotion

Concept

A model can win on every quality metric — NDCG, Recall, CTR — and still hurt the business. The Applied ML Engineer is the role responsible for refusing to promote in this case. Guardrail metrics are the institutional veto: pre-declared product KPIs (CSAT, retention, GMV, support-ticket volume, latency) that override model-quality wins. Pre-declaration is the load-bearing word; guardrails negotiated post-hoc are guardrails in name only.

The promotion gate

graph TD
    EXP[Experiment ends] --> PM{Primary metric<br/>cleared MDE,<br/>p-adjusted < α?}
    PM -->|No| STOP1[Document negative<br/>result, do not ship]:::stop
    PM -->|Yes| G1{Guardrail 1<br/>cleared?}
    G1 -->|No| VETO[Guardrail veto<br/>Do NOT ship<br/>investigate root cause]:::stop
    G1 -->|Yes| G2{Guardrail 2<br/>cleared?}
    G2 -->|No| VETO
    G2 -->|Yes| GN{...all guardrails<br/>cleared?}
    GN -->|No| VETO
    GN -->|Yes| C{Cohort-stratified<br/>per-cohort guardrails<br/>cleared?}
    C -->|No| VETO
    C -->|Yes| SHIP[Ship<br/>see Primitive 7 for production prep]:::ship

    classDef stop fill:#f66,stroke:#333
    classDef ship fill:#2d8,stroke:#333

The "win the metric, lose the customer" anti-pattern catalogue

Six named anti-patterns the Applied ML Engineer watches for:

  1. Engagement-bait wins — model optimizes CTR by surfacing controversial / sensational items; CSAT regresses, retention regresses 4 weeks later. Guardrail: 14-day-lagged CSAT and 28-day retention.
  2. Latency-trade wins — model adds 80ms p95 to win 3% NDCG; users abandon at the slower turn; net session length drops. Guardrail: p95 chatbot turn latency.
  3. Diversity collapse wins — model aggressively narrows recommendations to a few popular titles; short-term CTR up, catalog discovery and long-tail GMV down. Guardrail: catalog-coverage entropy and long-tail GMV %.
  4. Cohort-trade wins — aggregate +3% CTR, JP cohort -8% CTR. Aggregate veto-passes; cohort-stratified veto fires. (See Primitive 6.) Guardrail: per-cohort primary metric.
  5. Spam-amplification wins — recsys learns spam-clicks correlate with engagement; surfaces lower-quality content; spam reports up, organic reviews down. Guardrail: spam-report rate, low-quality-flag rate.
  6. Hallucination wins — chatbot generates fluent but factually wrong answers; LLM-judge agreement up (judge has same blind spot); customer-correction rate up. Guardrail: customer-correction rate, citation-accuracy on RAG-grounded answers.

Pre-declaration discipline

Guardrails declared after results are known are negotiated by political weight. The Applied ML Engineer prevents this with three mechanical practices:

  1. Guardrail YAML versioned in the experiment-platform repo before randomization starts. If you can't git log the guardrail's commit hash from before the experiment ran, it's not pre-declared.
  2. Sign-off chain: PM + Applied ML Eng + EM + (for high-stakes launches) Director. All four sign on the YAML hash.
  3. Veto authority is mechanical: when a guardrail breaches threshold with sufficient power, the experiment platform auto-pauses the rollout. No human override without an incident-style review.

Threshold setting — how to choose -1% retention vs -0.5% CSAT

Guardrail thresholds are negotiated by cost of regression, not by gut feel. The framework:

Guardrail Threshold logic Example threshold
Retention (28d) Customer-lifetime-value impact; 1pp retention loss = $X annual revenue at scale -0.5% relative
CSAT Survey-response correlation with NPS; CSAT noise level -1% relative if base ≥ 4.0/5; -0.5% if base < 4.0
GMV per active user Revenue impact directly -1% relative
p95 latency UX research threshold for perceived slowness +50ms absolute, or +5% relative
Support-ticket volume Operational cost; staff capacity ceiling +3% relative
Spam-report rate Trust impact; long-tail revenue impact +5% relative

The Applied ML Engineer doesn't pick thresholds in a vacuum. They negotiate them with PM (business-side cost), Trust & Safety (spam/abuse cost), Localisation (cohort-fairness cost), and Eng Manager (ops cost). The negotiation is the work; once thresholds are signed, the veto is mechanical.

Worked example — the +6% reranker that regressed CSAT by 1.4%

The experiment: US-MLE-02 cross-encoder reranker change. Results after 14 days:

Metric Pre Post Δ Sig Guardrail Status
Primary: useful-answer-rate 0.183 0.194 +6.0% rel p<0.001 (primary)
CSAT (5-pt) 4.21 4.15 -1.43% rel p=0.02 -1.0% rel ❌ VETO
p95 turn latency (ms) 720 740 +20ms n/a +50ms
Retention 28d 51.2% 50.9% -0.6% rel p=0.18 -0.5% rel ⚠️ near-veto
JP cohort useful-answer 0.171 0.174 +1.8% rel p=0.04 -3.0%

The veto fires on CSAT. The PM argues "CSAT is a noisy survey metric; the primary moved by 6%; let's ship and revisit in 4 weeks." This is the test of the role.

The Applied ML Engineer's response: "CSAT is the user telling us, in their words, that the new ranking feels worse. The primary metric is a behavioral proxy; CSAT is a stated preference. They disagree. When they disagree at p=0.02, we have to take the user's word. We pre-registered CSAT as a guardrail with a -1% threshold; the threshold was breached at -1.43%. The mechanical action is veto. The right next move is to investigate why CSAT regressed despite the engagement lift — likely candidates: ranking is more aggressive on click-bait, or the new ordering surfaces titles users find emotionally heavier than expected. Both are diagnosable in a follow-up offline analysis. Shipping a model that the user is telling us they don't like, on the bet that we're wrong about CSAT, is not a bet I'd make."

That paragraph is the artefact. It demonstrates Customer Obsession (the user said they don't like it), Have Backbone Disagree and Commit (vetoed against PM pressure), and Earn Trust (mechanical veto on a pre-registered threshold, no political maneuvering).

Master's-DS Depth Callout — Multiple Testing Across Guardrails (Family-Wise Error Rate)

If you have 5 guardrails, each tested at α=0.05, the family-wise probability of at least one false-veto is 1 - (1-0.05)^5 ≈ 0.226. Almost a 1-in-4 chance of a false guardrail breach in a clean experiment. This causes two pathologies:

  • False vetoes: legit launches blocked because one of 5 guardrails crossed by chance.
  • Re-running: team re-runs the experiment hoping the false breach doesn't recur.

The right adjustment: Bonferroni-correct the guardrail tests. With 5 guardrails, test each at α=0.01 instead of 0.05. The veto fires only when a guardrail breaches with adjusted significance. False-veto rate drops to ~5%; legit-veto detection drops slightly (still well above the false-veto rate when the true regression is meaningful).

For high-stakes launches (multiple guardrails, large rollout), use intersection-union testing (IUT): the launch passes only if ALL guardrails individually pass at the unadjusted α. This is conservative — but symmetric: it inflates false-no-launch rate, which is a recoverable cost (re-run experiment), while shipping a guardrail breach is unrecoverable (customer harm). The Applied ML Engineer's calibration: false-veto cost ≪ false-ship cost when guardrails matter; default to conservative.

Amazon Product Lens Callout — Customer Obsession + Have Backbone, Disagree and Commit + Earn Trust

The guardrail-veto is the Applied ML Engineer's most direct exercise of Have Backbone, Disagree and Commit. The PM wants to ship. The team wants to ship. Leadership wants the OP1 win. The veto says no, mechanically, on a pre-registered threshold. The role is unambiguous: when CSAT is breached, the launch doesn't ship, regardless of who's pushing.

Earn Trust is the medium-term consequence. Teams that veto consistently on pre-registered thresholds earn a reputation for shipping launches that actually work in production. The reverse is also true: teams that override their own thresholds politically lose credibility with leadership over 3-6 quarters. The mechanical guardrail is a long-game investment in the team's standing.

In a six-pager: "Our launch criteria are pre-declared and mechanical. Of the last 12 reranker / recsys experiments, we declined to ship 3 (25%). Of the 9 we shipped, 8 maintained their lift at 28-day post-launch (89% retention rate). The 3 we declined to ship would, by retrospective analysis, have regressed retention by 0.4-0.8% each — a $X-million-per-year cost we avoided. The discipline cost: 3 more quarters of slower velocity. The discipline yield: $X million per year in avoided regression. NPV of the guardrail discipline is positive." That paragraph is what makes leadership trust the team.


Primitive 6: Cohort Fairness & Stratification

Concept

Aggregate metrics hide cohort regressions. A launch that wins +3% on aggregate CTR can lose -8% on the JP cohort, +6% on EN, +4% on mixed. The aggregate metric is the average across cohorts weighted by population — and a population-weighted average can mask any cohort-level harm if the harmed cohort is small enough.

The Applied ML Engineer's discipline: every primary metric is reported aggregate AND stratified by predeclared cohort dimensions. Cohort dimensions for MangaAssist are minimum: locale (EN / JP / mixed), tenure (new < 30 days / returning ≥ 30 days), device (mobile / desktop / app). Additional dimensions for high-stakes launches: age band, content-preference cluster.

The cohort stratification pipeline

graph TB
    AGG[Aggregate eval<br/>e.g., +3% CTR]:::neutral
    AGG --> SPLIT[Stratify by predeclared dims]
    SPLIT --> L[Locale<br/>EN / JP / mixed]
    SPLIT --> T[Tenure<br/>new / returning]
    SPLIT --> D[Device<br/>mobile / desktop / app]

    L --> EN_R[EN: +4.1%]:::good
    L --> JP_R[JP: -8.2%]:::bad
    L --> MX_R[mixed: +2.5%]:::good

    T --> NEW[new: -3.1%]:::bad
    T --> RET[returning: +5.8%]:::good

    D --> MOB[mobile: +2.4%]:::good
    D --> DESK[desktop: +6.2%]:::good

    EN_R --> DECISION{Aggregate +3%<br/>but JP -8%, new -3%<br/>SHIP / ABORT / REDESIGN?}
    JP_R --> DECISION
    NEW --> DECISION

    DECISION -->|Pre-declared<br/>cohort guardrail<br/>fires| ABORT[VETO — fix cohort regression<br/>before promoting]:::bad

    classDef good fill:#2d8,stroke:#333
    classDef bad fill:#f66,stroke:#333
    classDef neutral fill:#fd2,stroke:#333

Stratified holdout design

A stratified A/B is more demanding than an aggregate A/B. The sample-size calculation must satisfy the cohort with the highest variance and lowest population fraction:

Cohort % traffic Per-cohort sample size needed for ±3% MDE Total sessions for cohort claim
EN 55% 22.5K per arm ~80K total
JP 30% 22.5K per arm ~150K total (cohort fraction smaller)
mixed 15% 22.5K per arm ~300K total (cohort fraction smaller still)
New users (any locale) 22% 22.5K per arm ~205K total

If the experiment is sized for aggregate detection only (~45K total), the JP, mixed, and new-user cohort claims are underpowered. The team will see noisy numbers in those cohorts and make false decisions. The fix: size for the smallest cohort of interest at the start, not the aggregate; or accept that you're shipping with an explicit cohort-blindness risk and document it.

Cohort-conditional MDE and the Simpson's paradox trap

Simpson's paradox: aggregate effect can have the opposite sign from per-cohort effects. Toy example:

Cohort n Pre CTR Post CTR Δ
EN 70K 12% 13% +1pp
JP 30K 8% 6% -2pp
Aggregate 100K 10.8% 10.9% +0.1pp

Aggregate goes up by 0.1pp; both cohorts have a different story (EN benefits, JP harmed). If you only look at aggregate, you ship a launch that quietly hurts JP users. If JP is a strategic market (which it is for MangaAssist — ap-northeast-1 data residency, JP catalog, JP customer obsession), shipping aggregate-positive cohort-negative is unacceptable.

The mechanical defense: per-cohort guardrails with thresholds. JP cohort cannot regress >3% relative on the primary metric; if it does, the launch vetoes regardless of aggregate.

"Fairness when 'fair' isn't well-defined"

Cohort fairness is operational, not philosophical. The Applied ML Engineer does not need to solve the philosophical fairness question; they need to operationalize a defensible threshold:

  • Minimum-cohort-performance floor: worst-cohort metric ≥ 95% of aggregate metric.
  • No-cohort-large-regression: no predeclared cohort regresses >3% relative.
  • Pareto-efficiency check: aggregate gains do not come from one cohort gaining at the expense of another.

These are not philosophical answers. They are operational thresholds that, when violated, block a launch. Different products will choose different thresholds. The Applied ML Engineer's job is to choose them, defend them, and enforce them — not to solve fairness in the general case.

Worked example — aggregate +3% CTR, JP cohort -8%

The setup: a recsys change shows aggregate +3% CTR (significant), with EN +4.1%, mixed +2.5%, JP -8.2%. JP is 30% of traffic. The pre-declared cohort guardrail is "no cohort regresses >3% relative."

The decision tree:

JP cohort regression: -8.2% (breaches -3% guardrail) → VETO
  ↓
Diagnose: why does JP regress?
  - Catalog mix in training data is 78% EN, 22% JP — JP under-represented
  - Two-tower model item embeddings cluster JP titles less well
  - Cold-start fallback genre-popularity is biased toward EN (English titles dominate global popularity)
  ↓
Two paths forward:
  A) Retrain with JP-stratified loss reweighting (cost: 4 weeks)
  B) Ship EN-only initially with JP carve-out flag (cost: 1 week, accepts JP-blind launch)
  ↓
Decision: A. Why? JP is strategically important; shipping A/B to "EN-only" while excluding JP is a public commitment
  to second-class treatment of JP users. The 4 weeks is paid in velocity now to avoid 4 quarters of erosion.

The decision documented in the launch readiness review: "We are vetoing on JP cohort regression. We will not ship with a JP carve-out. We will retrain with JP-stratified loss reweighting. The cost is 4 weeks of velocity. The benefit is that the launched model serves JP users at parity with EN — consistent with our ap-northeast-1 data residency and Japan-strategic positioning. The launch is on hold; portfolio replanning is reflected in the updated Q3 narrative."

Master's-DS Depth Callout — Conditional vs Marginal Effects, Multilevel Modeling

Aggregate (marginal) effects average over cohort distributions; cohort-conditional effects condition on cohort. The two can disagree (Simpson's paradox). Multilevel / hierarchical models — Y = Xβ + Z·u + ε where u is a per-cohort random effect — give you both simultaneously and partial-pool the cohort estimates (small cohorts borrow from the global mean, regularizing noisy per-cohort signals).

The practical Applied ML Engineer move: fit a multilevel model on the experiment data, report aggregate effect AND per-cohort partial-pooled estimates. If a cohort estimate is far from aggregate but the partial-pooling has shrunk it most of the way back, you can argue with statistical justification that the cohort regression is statistical noise. If the cohort estimate is far from aggregate AND partial-pooling has not shrunk it, the cohort effect is real — and pre-registered guardrails must veto.

# multilevel_cohort_eval.py — using statsmodels MixedLM
import statsmodels.formula.api as smf
import pandas as pd

df = pd.read_parquet('experiment_us_mle_02_v3.parquet')
# columns: user_id, cohort_locale, treatment, outcome (per-user useful-answer-rate)

model = smf.mixedlm(
    "outcome ~ treatment",
    df,
    groups=df["cohort_locale"],
    re_formula="~treatment"  # per-cohort random treatment slopes
).fit()
print(model.summary())
# Look at: random-effect variance for treatment (between-cohort heterogeneity)
# If non-trivial, per-cohort effects differ — investigate cohort fairness.

Amazon Product Lens Callout — Earn Trust + Insist on Highest Standards + Success and Scale Bring Broad Responsibility

Earn Trust with cohorts: shipping aggregate-positive while JP regresses erodes JP user trust over many launches. Each launch is small; the cumulative erosion is large. The discipline of cohort fairness is a multi-launch investment in the JP-cohort trust account.

Insist on the Highest Standards is the willingness to delay shipping by 4 weeks to fix a cohort issue. Most teams ship with the cohort regression flagged as a "follow-up". Most "follow-ups" never happen. The Applied ML Engineer who insists on cohort-clean launches is the one whose launches still work in production a year later.

Success and Scale Bring Broad Responsibility — the LP added in 2021, often forgotten. As MangaAssist scales, the JP cohort isn't 30% of traffic; in absolute numbers it's millions of users. A 3% regression is hundreds of thousands of users having a worse experience. The scale is what makes cohort fairness a responsibility, not a nice-to-have. In a six-pager: "At scale, a 3% cohort regression on JP traffic affects [number] users monthly. We do not ship cohort regressions at scale, even when aggregate metrics are positive."


Primitive 7: Incident Triage Discipline

Concept

It's 3am. PagerDuty fires: NDCG@10 dropped from 0.78 to 0.61 in the last hour, while traffic and latency are normal. The Applied ML Engineer is on call. The wrong move is random root-causing — checking dashboards in the order they come to mind, hoping. The right move is the named-root-cause decision tree, applied mechanically, with a 15-minute / 60-minute / 4-hour escalation playbook.

The discipline is uncomfortable: when the metric is dropping, the instinct is do something. The discipline says: first, localize. Random fixes during incidents make things worse — rolling back the wrong change, restarting the wrong service, refreshing the wrong cache. The named decision tree forces localization before action.

The triage decision tree

graph TD
    A[ALARM<br/>Model metric dropped] --> Q1{Did anything change<br/>in the last 24h?}
    Q1 -->|Code deploy| ROLLBACK[Roll back deploy<br/>verify metric recovery]:::action
    Q1 -->|Model registry change| REVERT[Revert model version<br/>verify recovery]:::action
    Q1 -->|Config change| REVERTCFG[Revert config<br/>verify recovery]:::action
    Q1 -->|Nothing changed our side| Q2{Is upstream data<br/>healthy?}

    Q2 -->|No — data corruption| UPSTR[Page data-platform team<br/>quarantine bad data]:::action
    Q2 -->|Yes — upstream fine| Q3{Is feature distribution<br/>drifted?}

    Q3 -->|Yes — drift detected| FREEZE[Freeze model decisions<br/>investigate drift root cause<br/>see Ground-Truth-Evolution]:::action
    Q3 -->|No drift| Q4{Is the prompt template<br/>or eval set stale?}

    Q4 -->|Stale eval| EVAL[Eval-set staleness<br/>NOT a model regression<br/>see Primitive 4]:::warn
    Q4 -->|Stale prompt| PROMPT[Prompt-template change<br/>roll back prompt version]:::action
    Q4 -->|Both fresh| Q5{Did the UI change<br/>or query distribution shift?}

    Q5 -->|UI change| UI[Coordinate with frontend<br/>UI may have changed<br/>user-click distribution]:::warn
    Q5 -->|Query shift| QUERY[Investigate traffic<br/>Did marketing run a campaign?<br/>External event?]:::warn
    Q5 -->|Neither| ESCALATE[Escalate to staff/principal<br/>this is rare class<br/>schedule incident review]:::warn

    classDef action fill:#2d8,stroke:#333
    classDef warn fill:#fd2,stroke:#333

The named root-cause categories

Every "the model got worse" incident is one of these. Naming them shortens triage from hours to minutes:

# Category Detection signal Typical fix Avg MTTR
1 Code deploy regression git log + deploy timeline Roll back deploy 15 min
2 Model version regression model registry diff Revert model version 30 min
3 Config change regression config-store diff Revert config 20 min
4 Upstream data corruption data-platform alarms; row counts; schema check Quarantine + reprocess 2-4 hours
5 Feature drift (input distribution) feature distribution monitors Investigate; retrain if structural 1-3 days
6 Label drift (target distribution) label distribution + ground-truth evolution Re-labeling cycle, see ../Ground-Truth-Evolution/ days-weeks
7 UI change frontend deploy timeline; click-distribution diff Coordinate with frontend; possibly retrain 1-7 days
8 Prompt-template regression prompt-template diff Revert prompt version 30 min
9 Eval-set staleness (false alarm) golden-set age vs catalog turnover Refresh eval set; re-measure 1 day
10 Query-distribution shift external traffic source diff (marketing campaign, anime release tie-in) Wait it out, or stratify metric hours-days

The 15 / 60 / 4-hour escalation playbook

Window What Who acts Who's notified
0–15 min Localize: which named category? On-call Applied ML Eng (silent)
15–60 min If category 1, 2, 3, 8: roll back. If category 4: page data team. If 5, 6, 7: freeze + investigate On-call + 1 platform partner Eng Manager (status update)
60 min – 4 hr If unresolved: incident commander mode; assemble team; consider customer-impact mitigation On-call + Eng Manager + ML platform owner Director (if cust-impact > X)
4+ hr Full incident review; customer comms; multi-team escalation Eng Manager owns coordination Director / VP, public if needed

"What is the diagnostic dashboard you wish you had at 3am?"

The Applied ML Engineer's pre-incident investment is the dashboard. A good 3am dashboard surfaces, in priority order:

  1. Primary metric with last 7 days, last 24 hours, last 60 minutes (3 timescales)
  2. Recent change log — code deploys, model versions, config changes, prompt versions in last 24h, with timestamps overlaid on metric chart
  3. Upstream data health — row counts, schema fingerprints, last-seen timestamps for upstream feeds
  4. Feature distribution drift — KL divergence on top-10 features over rolling windows
  5. Cohort breakdown — primary metric stratified by locale / device / tenure (is this incident hitting one cohort or all?)
  6. Latency and error rates — to rule out infra-layer issues masquerading as model issues

The dashboard is built before the incident. Building it during the incident is too late. The Applied ML Engineer who has invested in this dashboard cuts MTTR by 3-5× compared to the engineer who hasn't.

Worked example — POC-Production catastrophe #2 (RAG recall collapse)

From ../POC-to-Production-War-Story/02-seven-production-catastrophes.md: the team's RAG retrieval suffered a sudden Recall@10 collapse from 0.91 to 0.62 over a 6-hour window. No code deploy. No model change. Standard triage walked through:

  1. Did anything change? No deploy, no model change, no config change. (Categories 1, 2, 3, 8 ruled out in 15 minutes.)
  2. Upstream data healthy? Row counts on the catalog feed are normal; schema fingerprint matches. (Category 4 ruled out.)
  3. Feature drift? OpenSearch document distribution has shifted — but only on one shard. Suspicion: index issue.
  4. Investigate the shard. HNSW index on shard-3 has corrupted; query latency on that shard is fine, but recall is poor because the graph is malformed.

Root cause: a recent OpenSearch upgrade applied silently overnight had triggered a partial re-indexing on shard-3 that hadn't completed cleanly. The fix: force re-index of shard-3, monitor recall recovery, schedule a post-incident review on monitoring of partial-reindex states.

The Applied ML Engineer's lessons fed back into Primitive 1 (problem-framing for the next quarter): "we need a per-shard recall monitor as a leading indicator; the current dashboard aggregates across shards and missed this for 6 hours."

Master's-DS Depth Callout — CUSUM and Change-Point Detection

The default "alarm when metric drops 10% from 24h baseline" is fragile: lots of false alarms during high-variance periods, slow detection during gradual degradations. CUSUM (cumulative sum) and CUSUM-V are change-point detectors that accumulate signed deviations from baseline; they fire on small but persistent regressions while ignoring large but transient noise.

# cusum_metric_monitor.py
import numpy as np

def cusum(metric_stream, target=None, k=0.5, h=4.0):
    """Returns timestep of detected change-point or None.

    k = slack (units of std dev) to tolerate small drift
    h = decision threshold (units of std dev)
    """
    if target is None:
        target = np.mean(metric_stream[:14])  # warm-up window
    sigma = np.std(metric_stream[:14])
    s_pos, s_neg = 0, 0
    for t, x in enumerate(metric_stream[14:], start=14):
        s_pos = max(0, s_pos + (x - target - k*sigma) / sigma)
        s_neg = max(0, s_neg + (target - x - k*sigma) / sigma)
        if s_pos > h or s_neg > h:
            return t
    return None

For Applied ML monitoring: run CUSUM on the primary online metric stratified by cohort. Alarm fires when a cohort's CUSUM exceeds threshold, even if aggregate metric looks fine. This catches the cohort regression before the aggregate metric reflects it.

Amazon Product Lens Callout — Ownership + Dive Deep + Bias for Action

Ownership is the at-3am LP. Nobody else can do this. The Applied ML Engineer who diffuses ownership ("the platform team should know") fails ownership. The role's pager goes off; the role works the incident.

Dive Deep is the discipline of named-root-cause categorization. Random root-causing is the opposite of Dive Deep — it's surface-level pattern-matching with no model of how the system fails. The named categories are the model. They were built (over many incidents) by Applied ML Engineers who dove deep enough to recognize that "the model got worse" has 10 distinct underlying root causes, each with its own diagnostic and fix.

Bias for Action is the rollback decision under uncertainty. When category 1, 2, 3, or 8 is plausible and a rollback would resolve, roll back first, investigate after. Customer impact is reduced; investigation is preserved. The trap is "but what if we don't have the right cause?" — at 3am, the cost of an unnecessary rollback (some velocity loss) is much smaller than the cost of an extended customer-impact incident.

In a six-pager retrospective: "Incident detected at T+0; localized to category 5 (feature drift) at T+18m via dashboard; mitigated at T+45m by quarantining drifted feature stream; full root cause (upstream schema partial change) identified at T+3h. MTTR: 45m. Avoided full rollback by mechanical use of named-root-cause decision tree. Investment that paid: 3am dashboard built in Q1, named-categories playbook published in Q2." That paragraph is what makes leadership understand the value of slowing down to write playbooks.


How the Seven Primitives Compose

The seven primitives are not independent. They chain in a logical product-ML lifecycle, and they cross-reference each other in production work.

graph LR
    P1[1. Customer Pain<br/>→ ML Framing] --> P2[2. Portfolio]
    P2 --> P3[3. Hypothesis<br/>& Sample Size]
    P3 --> P4[4. Online/Offline<br/>Correlation]
    P4 --> P5[5. Business-KPI<br/>Guardrails]
    P5 --> P6[6. Cohort<br/>Fairness]
    P6 --> P7[7. Incident<br/>Triage]
    P7 -.lessons feed.-> P1

    classDef pre fill:#9cf,stroke:#333
    classDef during fill:#fd2,stroke:#333
    classDef post fill:#2d8,stroke:#333
    classDef ops fill:#f66,stroke:#333

    class P1,P2 pre
    class P3,P4 during
    class P5,P6 post
    class P7 ops

Composition example — launching the US-MLE-02 reranker change end-to-end

Step Primitive Artefact
Frame the problem 1 PR/FAQ: "Yuki gets click-bait answers; she wants slow-burn taste matches"
Choose the experiment 2 RICE: reranker MiniLM-L12 ranks #1 of 12; chosen as Q3 ship-1
Design the test 3 Pre-registered YAML: MDE 3%, n=18.4K/arm CUPED, O'Brien-Fleming 4 looks, 3 guardrails
Validate offline-online 4 Correlation tracker: offline NDCG vs online useful-answer-rate, 90-day Pearson 0.68 (above 0.5 threshold)
Check guardrails 5 CSAT regression -1.4% breaches -1% threshold → veto fires
Cohort check 6 Aggregate +6%, JP +1.8%, EN +7.2% (cohort guardrails clear, but veto still fires from primary guardrail)
Production prep 7 Per-shard recall dashboard, change log overlay, CUSUM cohort monitors live before launch

Veto fires at step 5; the launch doesn't ship. The team investigates the CSAT regression: the new reranker's higher-NDCG@10 ordering surfaces titles users find emotionally heavier. The fix: re-train with a "tone preference" signal in the cross-encoder feature set. Re-run the experiment. This iteration loop is the Applied ML Engineer's quarterly rhythm.

Composition example — the AML-08 incident workflow

Step Primitive Action
0 7 Page fires; on-call Applied ML Eng begins triage
1 7 Walk decision tree; localize to category 5 (feature drift) at T+18m
2 7 Mitigate: quarantine drifted feature stream; metric recovers at T+45m
3 6 Verify recovery is uniform across cohorts (no JP-only or new-user-only residual regression)
4 4 Check offline-online correlation post-incident; if correlation has decayed, schedule harness rebuild
5 1 Incident review: was the underlying problem misframed in the original PR/FAQ? Update framing artefacts
6 2 Add "feature-drift-monitor build" to next quarter's portfolio if not already present

Each incident teaches the team something. The discipline is feeding the lesson back to Primitive 1 (problem framing) so that the next quarter's portfolio investments include the gaps the incident revealed.


Anti-Pattern Catalogue

The Applied ML Engineer carries these named anti-patterns as a mental checklist. Most of them appear in launch readiness reviews; recognizing them by name shortens the discussion from hours to minutes.

# Anti-pattern What it looks like Why it's wrong Mitigation primitive
1 HARKing (Hypothesizing After Results Known) "Looking at the data, our hypothesis was..." Confirms the data; learns nothing Pre-register, Primitive 3
2 Peeking + Stopping When Significant Daily dashboard checks; ship the first day p<0.05 α-inflation; high false-positive rate Sequential α-spending, Primitive 3
3 Promote on Aggregate, Ignore Cohorts "Aggregate +3%, ship!" Cohort regressions hide; trust erodes Stratified eval, Primitive 6
4 Offline Win Without Online Validation "Offline +5% NDCG, no need for A/B" Offline goodharting; no online validation Online/offline correlation, Primitive 4
5 Veto-by-Politics "CSAT is noisy, let's still ship" Guardrail negotiated post-hoc by power Pre-declared mechanical guardrails, Primitive 5
6 ML When Heuristic Suffices "Use ML for this 5-rule decision" Cost overrun; under-delivers Heuristic-vs-ML break-even, Primitive 1
7 Random Root-Causing At 3am, check 10 dashboards in random order High MTTR; sometimes makes things worse Named-root-cause tree, Primitive 7
8 Pre-Registered, Then Quietly Re-Registered Hypothesis YAML edited mid-experiment Defeats pre-registration; HARKing in disguise Versioned YAML, signed hash, Primitive 3
9 Novelty-Effect Cliff 7-day A/B, ship, lift fades by week 6 Novelty effects, not durable lift Long-runtime A/B + post-launch holdback, Primitive 3
10 Goodharting the Proxy Metric NDCG climbs while user behavior worsens Proxy is gameable; user is not Diverse offline metric portfolio, Primitive 4
11 Cohort Holdout Too Small to Power Aggregate 50K, JP cohort 7K underpowered Cohort claim is noise; false safety Power for smallest cohort, Primitive 6
12 Latency-Win Without Quality Floor "We saved 60ms" — at the cost of NDCG One axis optimized; other regressed Multi-axis pre-declared budget, Primitive 5 + 7

Glossary

Term Definition
MDE (Minimum Detectable Effect) Smallest effect size the experiment is powered to detect at given α and power. Below MDE, you can't tell signal from noise.
Power P(rejecting H0 | H1 is true). Standard target 0.80. Equals 1 − P(false negative).
α (alpha) P(rejecting H0 | H0 is true). Standard target 0.05. Equals false-positive rate.
α-spending Plan for distributing α budget across multiple looks in sequential testing (Pocock equal; O'Brien-Fleming late-skewed).
CUPED Controlled-experiment Using Pre-Existing Data. Variance-reduction technique using pre-experiment user covariate; can cut sample size by 30–60%.
IPS (Inverse Propensity Scoring) Counterfactual estimator that re-weights observations by inverse of their selection probability; corrects for selection bias in click-logged eval.
SNIPS Self-Normalized IPS. Practical variance-tamed variant of IPS.
Goodhart's Law "When a measure becomes a target, it ceases to be a good measure." Formalized by Manheim & Garrabrant 2018; the foundational reason offline metrics degrade.
Simpson's Paradox Aggregate effect can have opposite sign from per-cohort effects. Catches teams that don't stratify.
EVOI (Expected Value of Information) Value of resolving uncertainty. The right framing for "experiments-as-learning" portfolios.
RICE Reach × Impact × Confidence ÷ Effort. Heuristic prioritization framework adapted from product management to ML.
PR/FAQ Press-Release / Frequently-Asked-Questions document. Amazon's Working-Backwards artefact written before any product is built.
Six-pager Six-page narrative document used for senior reviews at Amazon. No bullets, no slides. Forces structural argument.
OP1 / OP2 Operating Plan 1 / 2. Amazon's annual / mid-year planning cycles where teams commit to outcomes.
SUTVA Stable Unit Treatment Value Assumption. Causal-inference assumption that a unit's outcome doesn't depend on others' treatments. Violated when randomizing at request-level inside a session.
Welch's t Two-sample t-test that doesn't assume equal variance. Default for chatbot A/Bs; safer than equal-variance t.
CUSUM Cumulative-sum change-point detector. Detects small persistent regressions; ignores large transient noise.
Bonferroni correction Multiple-testing correction: divide α by number of tests. Conservative but simple; the right default for guardrail testing.
HARKing Hypothesizing After Results Known. The cardinal sin of experimental discipline; defeated only by pre-registration.
Pre-registration Writing hypothesis, MDE, sample size, primary metric, guardrails, and stop rule in a versioned, signed document before the experiment starts.

Cross-References to Sibling Documents

Document When to consult
README.md Folder overview, role boundary, story roster
01-deep-dive-per-applied-ml-story.md Per-story deep dives — eight scenarios applying these primitives end-to-end
02-applied-ml-engineer-grill-chains.md Grill chains for self-drilling and interview prep
../ML-Engineer-User-Stories/README.md Sibling platform-ML stories the AML stories consume
../ML-Engineer-User-Stories/deep-dives/00-foundations-and-primitives-for-ml-engineering.md Platform ML primitives (label engineering, training, drift) — distinct from this document
../RAG-MCP-Integration/09-rag-retrieval-pipeline-deep-dive.md Anchor for online/offline correlation primitive
../Ground-Truth-Evolution/ Anchor for incident-triage label-drift category
../POC-to-Production-War-Story/02-seven-production-catastrophes.md Anchor for AML-08 incident vignettes
../Cost-Optimization-User-Stories/ FinOps lens; relevant for portfolio (Primitive 2) cost-side trade-offs
../API-Design-and-Testing/04-offline-testing-quality-strategies.md Offline-eval methodology referenced by Primitive 4

The seven primitives are the Applied ML Engineer's mental kit. The deep-dive doc applies them story-by-story; the grill-chain doc drills judgment under pressure. Together, the three artefacts encode the role.