ML Scenario 01 — Recommendation Label Decay
TL;DR
The personalized-recommendation reranker (a learning-to-rank model layered on top of Amazon Personalize candidates and reranked into the chatbot's "for you" carousel) is trained quarterly on implicit feedback — clicks, add-to-list, buys, completed chapters. Inside a quarter, what counts as a "good" recommendation drifts: a click on an isekai title in March is a strong positive label; the same click in April, after the user has moved on to mystery, is a weak or wrong label. The training set is a one-pot mix of three months of implicit feedback with no decay weighting, and the model trains a "what users clicked on average" function rather than "what they'd click now." NDCG@10 holds up on the held-out set (which has the same biases) while production CTR drops 3% and add-to-list drops 6% per quarter. The fix shape is decay-weighted training, recency-stratified eval slices, counterfactual replay against random-policy exposures, and a serving-time confidence layer that gates how aggressively the ranker leans on stale labels.
Context & Trigger
- Axis of change: Time (the binding axis). Implicit-feedback labels age out continuously and the training set is a flat sum across the staleness window.
- Subsystem affected: Reranker downstream of
RAG-MCP-Integration/02-user-preferences-recommendation-mcp.md. Pairs closely with the GenAI scenario../GenAI-Scenarios/03-user-preference-concept-drift.md, but this one is about the ranker model itself (training data, training cadence) rather than per-user state. - Trigger event: Q4 review notices a 9% YoY drop in add-to-list rate from "for you" carousels despite three quarterly retrains. Held-out NDCG@10 has been ~stable for 18 months. Cohort slicing reveals long-tenured users are the ones decaying; new users are stable.
The Old Ground Truth
The original training pipeline:
- Implicit-feedback labels. Click = 0.3, add-to-list = 0.6, buy = 1.0, completed chapter = 0.5. Negatives: shown but not clicked within 30 days.
- Training window: trailing 90 days of interactions.
- Train/test split: time-based (last 7 days held out as test).
- Eval metric: NDCG@10 on the test set, with a CTR sanity check on production canary.
- Retrain cadence: quarterly, kicked off automatically.
- Reasonable assumptions:
- 90 days of recent data approximates "current taste."
- Time-based held-out simulates "tomorrow's behavior."
- NDCG@10 + canary CTR is enough signal to promote.
What this gets wrong: 90 days of flat-weighted data is heavily skewed by tenure (long-tenured users have more events and dominate the loss); the held-out 7 days is from the same biased distribution as training (a model that overfits the bias scores well); and CTR on canary is too short (1–2 weeks) to detect behavior changes that propagate over months.
The New Reality
- Labels age non-uniformly. A click 90 days ago has roughly the same training weight as a click 3 days ago — but their predictive value for next week's behavior is very different. The training treats them as equivalent positives.
- The negatives are stale fastest. "Shown and not clicked" 60 days ago carries almost no information about today's preferences (the user may not even remember being shown). Negative labels rot fastest.
- Selection bias compounds with age. What users were shown 90 days ago was determined by an older version of the same ranker. Training on that history bakes in the older ranker's preferences.
- Tenure imbalance grows. Long-tenured users contribute more events and more dominate the loss. Their tastes drift the most. The model gets better at predicting last-quarter's stale long-tenured behavior over time.
- The held-out test is biased the same way as the training set. Time-based split doesn't fix bias if the bias is across users, not across time.
- CTR on canary lags actual user satisfaction. A 1–2 week canary captures click behavior but not long-term engagement; the engagement metric that moves over months is unreachable in a canary window.
Why Naive Approaches Fail
- "Retrain monthly instead of quarterly." Helpful but doesn't fix label decay within the training window or the selection-bias compounding.
- "Just use a shorter window (30 days)." Reduces stale-label drag but starves the model on users with sparse activity.
- "Use a bigger model." Bigger models fit biased data more aggressively; this often worsens the problem.
- "Use deep learning." Same critique; the issue is data, not capacity.
- "Add explicit feedback." Sparse — < 2% of impressions get explicit feedback. Useful as a corrective signal, not as primary truth.
- "Trust the held-out NDCG." The held-out distribution shares the bias of training. NDCG can rise while real CTR falls.
- "Use the LLM-as-judge." That's the GenAI playbook; for a ranker, ground truth is engagement, not generative quality.
Detection — How You Notice the Shift
Online signals.
- Cohort-sliced CTR. Per signup-tenure bucket. If long-tenured cohorts decay while new ones are stable, the model is leaning on stale labels.
- Add-to-list / buy CTR. These are stronger preference signals than click. Their decay is the leading indicator.
- Completion rate per recommendation. Did the user actually read past chapter 1? Lower CTR + lower completion = real degradation; lower CTR + same completion = users are just clicking less often, not enjoying less.
- Diversity-injection acceptance rate. If users accept diverse picks more than personalized picks, your "personalized" picks are stale.
Offline signals.
- Counterfactual replay against a random-policy hold-out. A small (5%) random-policy slice in production gives you unbiased exposure data; replay your model against this slice for an honest off-policy evaluation.
- Per-recency-bucket NDCG. Bucket the test set by interaction age (0–7d, 7–30d, 30–90d). NDCG@10 should be similar across buckets if the model is decay-weighted; large drops on older buckets mean the model is stale-anchored.
- Per-tenure-cohort eval. Evaluate the model separately on users by signup tenure. If long-tenured users' NDCG is dropping, it's the dominant cohort failing.
Distribution signals.
- Drift between served impressions and acted-upon items. If the genres of served impressions don't match the genres of acted-upon items, the ranker is misjudging what users want.
- Action-recency histogram. For each user, the age of their most recent positive action. If the median is stable but the variance is widening, something about activity patterns has changed.
Architecture / Implementation Deep Dive
flowchart TB
subgraph Signals["Implicit feedback (raw)"]
CLK["Clicks"]
ADD["Add to list"]
BUY["Buys"]
COMP["Completed chapters"]
EXP["Explicit feedback (sparse)"]
RAND["Random-policy slice (5%)"]
end
subgraph Labels["Label construction (NEW)"]
DECAY["Per-event decay weighting<br/>by event_age and event_type"]
IPS["IPS reweighting<br/>using exposure propensities"]
STRAT["Stratified sample by<br/>(tenure, recency, action_type)"]
end
subgraph Train["Training"]
TRAIN["Reranker training<br/>(LightGBM / GBDT)"]
EVAL_REC["Per-recency NDCG eval"]
EVAL_COH["Per-cohort eval"]
OPE["Off-policy eval<br/>vs random slice"]
end
subgraph Serve["Serving"]
RANK["Reranker"]
CONF["Per-user confidence signal"]
BLEND["Confidence-aware blend with<br/>popular-in-cohort fallback"]
end
CLK --> DECAY
ADD --> DECAY
BUY --> DECAY
COMP --> DECAY
EXP -.->|gold| DECAY
RAND --> IPS
DECAY --> STRAT
IPS --> STRAT
STRAT --> TRAIN
TRAIN --> EVAL_REC
TRAIN --> EVAL_COH
TRAIN --> OPE
EVAL_REC -->|gate| RANK
EVAL_COH -->|gate| RANK
OPE -->|gate| RANK
RANK --> BLEND
CONF --> BLEND
style DECAY fill:#fde68a,stroke:#92400e,color:#111
style IPS fill:#dbeafe,stroke:#1e40af,color:#111
style RAND fill:#fee2e2,stroke:#991b1b,color:#111
style OPE fill:#dcfce7,stroke:#166534,color:#111
1. Data layer — decay-weighted, IPS-reweighted, stratified
Three transforms from raw events to training labels:
Decay weighting. Each event's training weight = base_weight(event_type) × decay(event_age, event_type).
| event_type | base_weight | half-life | rationale |
|---|---|---|---|
| click | 0.3 | 14d | weak signal, decays fast |
| add_to_list | 0.6 | 30d | medium signal |
| buy | 1.0 | 60d | strong signal but still ages |
| completed_chapter | 0.5 | 30d | engagement signal |
| explicit_thumbs_up | 1.5 | 365d | gold; durable |
| explicit_thumbs_down | -1.5 | 180d | strong negative; durable |
| shown_not_clicked | -0.1 | 7d | weak negative; ages fast |
IPS reweighting. Every served impression is logged with the policy's per-item probability. When training, each event's loss is multiplied by 1 / propensity(item|policy_at_serve). This corrects for the selection bias that "we showed it because we predicted they'd like it."
Stratified sampling. The training set is sampled to balance:
- Tenure cohorts (signup age buckets) — prevents long-tenured-user dominance.
- Recency buckets (event age) — ensures recent events are well-represented even when overall volume is dominated by older events.
- Action-type — ensures buys and explicit feedback are seen often enough.
# pseudocode
def build_training_set():
raw = load_events(window_days=90)
raw["weight"] = raw.apply(lambda r: base_weight(r.type) * decay(r.age, r.type), axis=1)
raw["weight"] *= 1.0 / clip(propensity_at_serve(r), min=0.001)
sample = stratified_sample(raw, by=["tenure_bucket", "age_bucket", "type"])
return sample
2. Pipeline layer — eval slicing and the random-policy hold-out
Three eval lenses applied to every candidate model:
- Per-recency NDCG. Bucketed by event age. A model that has uniformly good NDCG across buckets is decay-aware; one that's good on stale buckets and bad on fresh buckets is what we have today.
- Per-cohort NDCG. Bucketed by user tenure. The dominant cohort's metric is the one that drives production behavior.
- Off-policy evaluation. The 5% random-policy hold-out gives unbiased exposure. Counterfactual replay computes what the candidate ranker would have served vs what was served, then reweights accordingly. The metric is unbiased CTR-equivalent.
The promotion gate requires all three lenses to pass, with thresholds calibrated against the prior production model. None alone is sufficient.
3. Serving layer — confidence-aware reranker
The serving stack mirrors the GenAI scenario 03 confidence pattern but at the model level:
def serve_recommendations(user_id, candidates):
user_conf = preference_confidence(user_id) # from scenario 03's signal
raw = ranker.predict(candidates, user_features(user_id))
if user_conf > 0.7:
return raw[:K]
elif user_conf > 0.4:
return blend(raw, popular_in_cohort(user_id), weight=user_conf)[:K]
else:
return blend(raw, popular_in_cohort(user_id), exploration_pool(user_id),
weights=[0.3, 0.4, 0.3])[:K]
The reranker is trusted only as much as the user's preference signal is trustable. When confidence is low, the system actively explores rather than confidently mis-ranking.
4. Governance — model versioning + rollback
Each promoted model carries:
model_version_sha(artifact hash).training_data_window(start/end timestamps).random_policy_slice_id(which random slice was used for OPE).per_cohort_eval_results(table).
Rollback is automatic if production CTR on the dominant cohort drops > 1% from baseline within 7 days. The previous model stays warm for at least the canary window length × 2.
Trade-offs & Alternatives Considered
| Approach | Adapts to drift | Bias correction | Cost | Verdict |
|---|---|---|---|---|
| Quarterly flat-weighted retrain | No | None | Low | Original — drifts |
| Monthly retrain, same data | Partial | None | Higher | Helps a bit, doesn't address bias |
| Decay-weighted + IPS + stratified, monthly retrain | Yes | IPS handles selection bias | Medium | Chosen |
| Continuous online learning | Yes | Variable | Highest | Operationally fragile for a ranker; reserve for very-small surface |
| Bandit-only (no learned model) | Yes (slowly) | n/a | Medium | Loses the value of a global ranker |
| Re-train on random-policy slice only | Yes (unbiased) | n/a | Data-starved | Slice too small for primary training |
The combination — decay weighting (fixes time bias), IPS (fixes selection bias), stratified sampling (fixes cohort bias), monthly cadence (catches drift), random-policy slice (provides honest eval) — is the standard for production rankers. The lift is operational, not algorithmic.
Production Pitfalls
- Random-policy slices are politically hard. "Why are we serving random recs to 5% of users?" Frame as the cost of having an honest eval signal — without it, every reranker change is unverifiable. The CSAT cost of 5% random recs is small (these users still see candidate-pool items, not garbage); the cost of not having the signal is much larger.
- IPS variance explodes on items with very low propensity. If propensity is 0.001, the inverse weight is 1000 and a single event swamps the loss. Cap propensity at a floor (e.g., 0.01) and clip outlier weights. Document the bias this introduces.
- Decay half-lives are themselves model parameters. They drift too. Re-evaluate the half-lives quarterly using held-out tail-cohort data; don't treat them as constants.
- Stratified sampling can reduce effective volume. If you stratify too aggressively, the model sees too few examples per cohort to learn well. Balance stratification against minimum-cell size.
- Cold-start users are not in any cohort. Build a separate first-N-recommendations model (or a popularity baseline) for them; don't try to feed cold-start users into the same training stratification.
- Selection bias in the training-time policy ≠ selection bias in serve-time policy. When the serving policy changes, the IPS correction needs the historical propensity (what the policy gave at serve time), not the current policy's propensity. Log propensities at serve, not at train.
Interview Q&A Drill
Opening question
Your recommendation reranker is retrained quarterly on 90 days of implicit feedback. NDCG@10 on held-out is stable across retrains, but production CTR has dropped 3% per quarter for two quarters. What's wrong?
Model answer.
NDCG@10 on the held-out set is stable because the held-out set inherits the same biases as training — selection bias in what was shown, time-flat weighting, tenure-imbalance. The held-out is not a window into the future; it's a mirror of the past. Production CTR is the honest signal and it's telling you the model has drifted.
Three structural fixes. (1) Decay-weighted training — events lose weight by age, with per-event-type half-lives (negatives decay fastest, explicit positives slowest). (2) IPS reweighting — every event is weighted by 1 / propensity_at_serve so the loss compensates for what the previous policy chose to show. (3) Random-policy slice — 5% of traffic gets uniformly-sampled candidates from the candidate pool; off-policy eval against this slice is the unbiased test. The promotion gate becomes "per-recency NDCG, per-tenure-cohort NDCG, and off-policy eval all pass."
Move retrain cadence from quarterly to monthly. Add a confidence-aware serving layer that softens personalization when the user's preference signal is unreliable. The conceptual move: ground truth in implicit-feedback ranking is decay-aware and bias-aware; treating it as a flat sum is the bug, regardless of model architecture.
Follow-up grill 1
IPS variance explodes on low-propensity items. Walk me through what you do.
Three protections. (1) Floor the propensity — clip below a minimum (e.g., 0.01). The bias this introduces is small for tail items and worth the variance reduction. Document the floor and the bias it introduces explicitly. (2) Clip the inverse weight — even after the floor, clip the per-event weight at a max (e.g., 100). Single events shouldn't dominate. (3) Use doubly-robust estimation as an alternative — a model-based estimate of the reward combined with IPS gives a lower-variance estimator. More complex but worth it when off-policy eval is the gate.
The deep failure mode IPS fails on is the one where an item has zero propensity in the historical policy — IPS literally can't see it. The defense is the random-policy slice: items get non-zero exposure even if the main policy never picked them, so off-policy eval has data on the tail.
Follow-up grill 2
You have decay-weighted training, but a user who's been quiet for 60 days clicks one isekai today. The decay weight on that single click is high (recent), the weights on their old clicks are low. Don't you over-fit to one event?
Per-event weights interact with per-user volume. Two protections. (1) Per-user normalization. Each user's events are weighted such that the total user weight is bounded — a single recent event doesn't drown out historical patterns for a thin-data user. (2) Confidence-aware serving (covered in scenario 03 detail). When a user has thin recent data, the serving layer doesn't trust the ranker as much; it blends with popular-in-cohort and exploration. Even if the ranker overfits to one event for that user, the serving blend dilutes the impact.
The deeper trade-off: a thin-data user's "ground truth" is itself uncertain, and the architecture should reflect that. Pretending the ranker knows what to do for someone with one recent click is the original mistake.
Follow-up grill 3
Your random-policy slice is 5%. Leadership is uncomfortable with the perceived CSAT cost. Defend or shrink.
Defend, mostly. The cost of the random slice is bounded — these users still see candidate-pool items, which means popular and high-quality content. The CTR delta against the personalized slice is real but small, and the 5% is a one-time hit, not a per-user-per-impression hit. Each user only experiences random-policy on a fraction of their carousels.
The alternative — no random slice — means we have no honest eval. Every reranker change becomes a leap of faith. The cost of that is much larger over time: bad models stay in production longer, good models get promoted on noisy signals, and there's no defense when leadership asks "is the model improving?"
The compromise I'd consider: shrink to 2–3% if 5% is genuinely painful, but with the understanding that the off-policy eval gets noisier and we'd need a longer measurement window. Below 2%, the OPE statistical power is insufficient. Below that, you're not really doing OPE; you're hoping.
Follow-up grill 4
Per-recency NDCG is stable across buckets but per-cohort NDCG fails on long-tenured users. Which gate should win?
Per-cohort. Long-tenured users are the cohort whose taste drifts most, whose retention/CSAT economics matter most, and whose dissatisfaction is hardest to undo. Per-recency might be stable because the short-tenure cohort dominates the recency buckets — the recency signal averages across cohorts.
The gate hierarchy I'd defend: per-cohort > per-recency > aggregate. Aggregate is advisory. Per-recency is structural. Per-cohort is operational — it's the slice closest to who actually walks away if the model is wrong.
If the team disagrees and argues "per-recency is more rigorous," I'd add per-recency-and-per-cohort: a 2-D slice. The cell that fails (e.g., long-tenured + recent events) is the actionable one. 1-D slicing leaves cells unmeasured.
Architect-level escalation 1
Six months out, the company adds a "live" mode where new releases drop and the recommender must surface them within an hour. Your monthly retrain cadence can't hit that. Online learning?
Selectively, yes. Three architectural pieces.
(1) Two-tier ranker. The monthly-trained "main" ranker handles the broad personalization; a short-window incremental model layers on top to capture freshness signals. The incremental model trains daily or hourly on the most recent positive interactions and adjusts a small set of parameters (item-level adjustments, recent-popularity boosts). Online learning isn't retraining the whole reranker; it's an incremental layer.
(2) Cold-item handling separately. New items have no interaction history. A "cold-item" sub-system uses content features (genre, art style, alias matches) and the embedding similarity to seeded interactions of similar items to provide a starting score. The cold-item sub-system is rule-based or shallow-ML, not online-learned, because the data isn't there to learn from.
(3) Bandits on the new-item slice. For genuinely new items, allocate a small bandit budget — show new items to a sampled set of users predicted to like them, observe response, update item-level scores. Bandits are excellent for cold-item exploration; they're poor for primary ranking.
The structural commitment: don't try to make the main ranker fast. Make the layered system fast. Different cadences for different parts of the problem. The main ranker is monthly because that's where the most data sits; the incremental layer is hourly because that's what speed needs.
Architect-level escalation 2
Six months from now, regulators require explanations for personalized recommendations: "show users why an item was recommended to them." How does your decay-weighted, IPS-corrected, multi-tier system surface explanations?
Explanations have to come from the parts of the system that produce explainable signals.
(1) Feature-attribution from the main ranker. GBDT-family rankers (LightGBM) have native feature importance and per-prediction SHAP values. The user-facing explanation can be derived from the top features driving the score: "recommended because you've been reading more shounen recently and this title scores high on shounen relevance," etc. SHAP values per prediction are computed at serve time (small cost) and surfaced in the explanation.
(2) Decay-weighted history makes explanations self-aware. "Recommended because of your recent buys" is honest if the decay weights actually emphasized recent. If the model trained on 90-day flat-weighted data, the explanation is stretching the truth — the model did rely on a 60-day-old signal more than the explanation suggests. Decay-weighting matches the model's actual reliance to the user-facing story.
(3) Confidence-aware explanations. When the serving layer is in low-confidence mode (blending with popular-in-cohort), the explanation reflects that: "we're still learning your taste — this is popular among readers of similar content." Honest about uncertainty.
(4) Random-policy items get a different explanation. "We surface a few items from outside your usual taste so we can learn what you might like." Don't pretend a random-policy pick was personalized.
The architecture commitment: explanations are not bolted on; they're a function of the model + serving layer's design choices. A reranker built without decay weighting can't honestly explain why it surfaced a recent item; it didn't actually weight recent more. Building explainability requires building the reasoning into the model design, not painting on top.
The audit angle: store SHAP values per served recommendation for the regulatory window. Storage cost is non-trivial but bounded; the feature space is fixed. When regulators ask "why did the bot recommend X," you can reconstruct the per-feature contribution.
Red-flag answers
- "Retrain monthly instead of quarterly." (Helps but doesn't address bias.)
- "Use a bigger model." (Fits biased data more aggressively.)
- "Trust held-out NDCG." (Mirrors training bias.)
- "Use explicit feedback only." (Sparse; primary truth must include implicit.)
- "Use the LLM as judge." (Wrong tool for engagement-based ranking.)
Strong-answer indicators
- Names the held-out as biased same as training.
- Knows decay weighting + IPS + stratified sampling + random-policy slice as the canonical fix set.
- Has cohort-stratified evaluation as primary, aggregate as advisory.
- Treats confidence-aware serving as the runtime safety net.
- Has a serious answer for cold-item / new-content surfacing (layered system, not online-learn-the-whole-ranker).
- Connects explainability to model-design choices, not bolted-on.