GenAI Scenario 03 — User Preference Concept Drift
TL;DR
The User-Preferences MCP rolls a per-user taste embedding (an EMA over recent click/purchase/dwell signals) and uses it to rerank LLM-generated recommendations. The "good recommendation" label was implicitly defined as "the user clicked / bought / kept reading." That label decays continuously: a user who loved shounen six months ago may have rotated to seinen this quarter, the gold-standard "click" signal lags behavior changes by weeks, and what counts as a good rec for that user today has no fixed reference. The fix shape is decay-aware labeling with explicit half-lives, rolling-window precision metrics, and a "preference confidence" signal that gates how aggressively we personalize when the truth is stale.
Context & Trigger
- Axis of change: Time (the dominant axis), with secondary pressure from Scale (the cohort of long-tenured users grew, and their taste vectors are aging fastest).
- Subsystem affected:
RAG-MCP-Integration/02-user-preferences-recommendation-mcp.md— Personalize fusion, taste embedding EMA, diversity injection. - Trigger event: No single event. The slow-burn case. CSAT for "find me something to read" decays 0.4–0.6%/month for ~9 months before the team notices in a quarterly review. Aggregate click-through on rec carousels is flat. Cohort-sliced CTR shows long-tenured users (signed up > 12 months ago) have dropped from 18% → 11% on personalized recs while new users are stable.
The Old Ground Truth
The original design:
- Per-user
taste_vector(a 768-dim embedding) maintained as an EMA ofembed(item)weighted by interaction strength:0.3 × click + 0.6 × add-to-list + 1.0 × buy + 0.5 × completed_chapter. - "Good recommendation" = an item the user acted on within 30 days of being shown it. Implicit positive label.
- "Bad recommendation" = shown, not acted on within 30 days. Implicit negative label.
- EMA half-life: 90 days (a 90-day-old click contributes 0.5×; a 180-day-old click contributes 0.25×).
- Cold-start: handled with a separate flow (genre quiz, popular-in-genre fallback). See
02-user-preferences-recommendation-mcp.md. - Reasonable assumptions:
- Users' tastes drift smoothly enough that a 90-day half-life captures "current taste."
- Implicit-feedback labels are unbiased (a click means the user liked it).
- The EMA is the most recent thing we know, therefore "ground truth" for the user now.
That design is fine for stationary tastes. Tastes are not stationary.
The New Reality
- Taste rotates faster than the half-life. Users move between genres on a 2–4 month cycle (school → action → mystery → romance), driven by external triggers (a popular anime, a friend's recommendation, life events). The 90-day EMA averages across two phases of taste and gives a vector that is "the user's recent average," which is a good description of nobody.
- Implicit feedback is biased. A click doesn't mean "I liked it"; it can mean "I was curious." A buy doesn't mean "I'd buy more like it"; it can mean "I bought a gift." Labels are noisy and the noise is non-random.
- The "non-action" label is even more biased. A user who didn't click on a rec might not have seen it (carousel position 12), or might have seen it after they'd already found something else. Treating non-action as a negative label conflates "uninterested" with "not exposed."
- Long-tenured users have unstable embeddings. Their EMA is a smoothed average over years of taste rotations. New users have crisp embeddings; long-tenured users have muddy ones, which is the opposite of what you'd expect.
- There is no "right answer." Unlike scenario 02 (policy), nobody can write down what the correct recommendation for user X is right now. The truth is itself probabilistic and unstable.
The schema didn't change. The meaning of the label drifted, and the silent assumption "EMA is current taste" stopped being true.
Why Naive Approaches Fail
- "Shorten the half-life to 30 days." Reduces smoothing but amplifies noise — users with low recent activity get whiplashed by single events. A user clicks one isekai out of curiosity and now their entire feed is isekai for the next two weeks.
- "Just retrain the personalization model." Retraining on biased implicit feedback bakes in the biases. The model gets better at predicting clicks, not at predicting good recs.
- "Use explicit feedback (rate this, was this useful)." Explicit feedback rates are < 2% in production. The signal is too sparse to be the primary truth.
- "Trust the most recent click." A single recent click is too noisy. It conflates exploration with preference.
- "Reset the embedding every quarter." Loses long-term anchors that do persist (genre preference, content-pacing preference, art-style preference).
- "Add more interaction signals." More signals doesn't fix the bias; it just gives you more biased signals.
Detection — How You Notice the Shift
Online signals.
- Cohort-sliced CTR on personalized recs. Plot CTR per signup-tenure bucket (0–3 mo, 3–12 mo, 12+ mo). If the long-tenured cohort drops while the short-tenured cohort is stable, taste rotation > half-life is the likely cause.
- Diversity-injection acceptance rate. The Personalize stack already injects 10–15% diverse picks into every carousel. If users accept those more than the personalized picks, your "personalized" picks are not actually personalized — you're discovering taste through diversity injection.
- Carousel-skip pattern. Users scrolling past the personalized rail to the "trending" or "new releases" rail — a passive-aggressive signal that personalization is missing.
Offline signals.
- Rolling-window precision@10. Compute it per week: of the items recommended to user U at time t, what fraction were acted on by t+30d? A monotone decline over weeks is the signature.
- Embedding stability per user. Cosine similarity between user_emb(today) and user_emb(30d ago). If high stability + declining precision, the embedding has stopped tracking the user.
- Counterfactual replay. For users who eventually acted on item X, did the recommender ever surface X to them? Latency-to-discover is a useful metric.
Distribution signals.
- Genre-mix divergence. For each user cohort, the genre mix in their recommendations vs the genre mix in their actions. Growing KL divergence = recs lag actions.
- Embedding age distribution. What's the median age of the interactions that constitute the user's EMA? If it's > the half-life, you're under-weighting recent behavior even though the EMA "should" handle it.
Architecture / Implementation Deep Dive
flowchart TB
subgraph Signals["Per-user signals"]
CLK["Clicks"]
BUY["Buys"]
COMP["Completed chapters"]
DWELL["Dwell time"]
SKIP["Skip / scroll-past"]
EXPL["Explicit feedback<br/>(thumbs, rate)"]
end
subgraph TasteState["Taste state (NEW: multi-horizon)"]
SHORT["Short-horizon vec<br/>14-day half-life<br/>captures current phase"]
LONG["Long-horizon vec<br/>365-day half-life<br/>captures durable taste"]
CONF["Preference confidence<br/>= f(signal density,<br/>vector stability,<br/>action-mix entropy)"]
end
subgraph Eval["Decay-aware eval"]
ROLL["Rolling precision@10<br/>per cohort, per week"]
CTR["CTR diff vs<br/>diverse baseline"]
REPLAY["Counterfactual replay<br/>+ latency-to-discover"]
end
subgraph Serve["Reranker"]
BLEND["Blend short + long<br/>weighted by confidence"]
DIV["Diversity injection<br/>(adaptive, larger when<br/>confidence is low)"]
FALL["Fallback to popular-in-cohort<br/>when confidence < threshold"]
end
CLK --> SHORT
BUY --> SHORT
COMP --> SHORT
DWELL --> SHORT
EXPL --> SHORT
EXPL -.->|gold signal| LONG
BUY --> LONG
COMP --> LONG
SKIP --> CONF
SHORT --> CONF
LONG --> CONF
SHORT --> BLEND
LONG --> BLEND
CONF --> BLEND
CONF --> DIV
CONF --> FALL
BLEND --> ROLL
BLEND --> CTR
style CONF fill:#fde68a,stroke:#92400e,color:#111
style SHORT fill:#dbeafe,stroke:#1e40af,color:#111
style LONG fill:#dbeafe,stroke:#1e40af,color:#111
style FALL fill:#fee2e2,stroke:#991b1b,color:#111
1. Data layer — multi-horizon taste, not single EMA
Replace the single EMA with two:
- Short-horizon vector (14-day half-life): captures the current phase of taste. Sensitive, noisy.
- Long-horizon vector (365-day half-life): captures durable preferences (overall genre lean, art style, pacing). Stable, slow.
The two are stored separately. The reranker blends them per request based on the confidence signal:
def serve_taste_vector(user_id: str) -> np.ndarray:
short = read_short_emb(user_id)
long_ = read_long_emb(user_id)
confidence = compute_preference_confidence(user_id) # in [0, 1]
# High confidence → trust short. Low confidence → fall back toward long + popular.
return confidence * short + (1 - confidence) * long_
compute_preference_confidence is the new, central signal:
def compute_preference_confidence(user_id: str) -> float:
interactions_30d = count_interactions(user_id, window="30d")
short_long_drift = 1 - cosine(short_emb, long_emb) # high drift = phase change
action_entropy = entropy_over_genres(actions_30d(user_id)) # high entropy = exploring
signal_density = min(interactions_30d / 20, 1.0)
# Heuristic: confident when signals are dense AND recent behavior is consistent
return signal_density * (1 - short_long_drift) * (1 - 0.5 * action_entropy)
This is the most important architectural addition: a confidence signal about our own ground truth. The system knows when it doesn't know.
2. Pipeline layer — decay-aware labels
Each interaction stored with (timestamp, signal_type, item_id, exposure_position). Labels are computed lazily at eval time, not at write time.
| Label kind | Window | Decay |
|---|---|---|
acted_on_within_7d |
7d | half-life 14d |
acted_on_within_30d |
30d | half-life 30d |
negative_exposure |
30d | half-life 7d (decays fast — non-action ages out) |
explicit_thumbs_up |
∞ | half-life 365d (gold) |
explicit_thumbs_down |
∞ | half-life 365d (gold) |
Negatives decay fastest because the most-stale label is "we showed it, they didn't click" — that signal degrades within weeks.
3. Serving layer — confidence-gated personalization
if confidence > 0.7:
serve = 0.85 × personalized + 0.15 × diversity
elif confidence > 0.4:
serve = 0.6 × personalized + 0.3 × diversity + 0.1 × popular-in-cohort
else:
serve = 0.3 × personalized + 0.4 × popular-in-cohort + 0.3 × exploration
# Exploration: items chosen to *learn* taste, not satisfy it
When confidence is low, the system actively explores — like a multi-armed bandit. This is also when explicit-feedback prompts ("was this useful?") are most worth showing the user, because their answers carry maximum signal value.
4. Governance — guardrails on personalization
Two governance pieces:
- No bandit on first-time users. Cold-start path is unchanged; exploration only kicks in once confidence has been computed and is below threshold for a known user.
- Minimum diversity floor. Even at confidence = 1.0, ≥ 10% of recs are diverse (cross-genre). This protects against echo chambers and provides ongoing signal for "did taste shift" detection.
- Audit on extreme-personalization decisions. If the reranker is choosing items entirely from one genre for one user, log it. Manual review flags when a single-genre carousel was served to a user who later complained.
Trade-offs & Alternatives Considered
| Approach | Adapt to drift | Stability | Coldstart | Compute | Verdict |
|---|---|---|---|---|---|
| Single EMA, 90d half-life | Slow | High | OK | Low | Original — under-adapts |
| Single EMA, 30d half-life | Fast | Low (whiplash) | OK | Low | Trades one bug for another |
| Multi-horizon (short+long) + confidence | Yes, gracefully | Yes | OK | Medium | Chosen |
| Per-user retrained ranker | Yes (fully) | Variable | Bad | Very high | Cost-prohibitive |
| Bandits over rerankers | Yes | Yes | Bad | Medium | Pieces folded in via exploration |
| Pure popularity (no personalization) | n/a | High | Best | Lowest | Loses the value of personalization entirely |
The multi-horizon + confidence pattern is the one architectural lift; everything else (bandit-style exploration, diversity floors) is small additions on top.
Production Pitfalls
- The confidence formula is itself a model. It will drift. Track per-cohort calibration: among users where
confidence > 0.7, the precision should actually be higher than among users withconfidence < 0.4. If the gap is small or inverted, your confidence signal is broken — patch and redeploy without waiting. - Long-horizon vectors corrupt slowly with negatives. If you bake exposure-but-not-clicked into the long EMA at all, after 18 months the long vector becomes a "things I was shown" vector instead of a "things I like" vector. Restrict the long EMA to positive signals only (buy, completed_chapter, explicit thumbs_up). Negatives are short-horizon only.
- "Confidence < threshold → fallback" reveals personalization is off. Users notice when their feed suddenly becomes generic. Soften the transition: a confidence change should move blend weights smoothly over a couple of sessions, not flip on the first low-confidence request.
- Exploration items must be safe. A bandit that surfaces edgy content to a user who's been reading kids' manga is a CSAT bomb. Constrain exploration to genres adjacent to long-horizon taste, never wildly outside.
- A/B testing personalization changes is hard. A change to the EMA half-life affects users after weeks of behavior, but A/B tests run for 1–2 weeks. Plan for long readout windows and include offline replay (counterfactual rerankers) as the primary evidence.
Interview Q&A Drill
Opening question
Your recommender uses a per-user EMA over click and purchase signals to drive personalized recommendations. CSAT on rec is decaying ~0.5%/month. Aggregate CTR looks fine. What's going on, and what do you change?
Model answer.
The aggregate CTR is a lagging, blended metric that hides cohort-level decay. Two diagnoses to confirm: (a) cohort-sliced CTR by signup tenure should show long-tenured users dropping while new ones are stable; (b) rolling precision@10 should be falling. If both confirm, the cause is that taste drifts faster than the EMA half-life. The 90-day EMA averages over two phases of taste and produces a "current average" that's no one's current taste.
The fix is multi-horizon. Maintain a short vector (14-day half-life) for current phase and a long vector (365-day half-life, positive signals only) for durable taste. Blend them per request based on a preference confidence signal — function of signal density, short-vs-long drift, and action-genre entropy. When confidence is high, lean on short. When confidence is low (e.g., user is exploring or has been quiet), lean on long + popular-in-cohort + a small exploration budget. The conceptual shift: instead of pretending we always know the user's current taste, encode whether we know.
Also: reduce reliance on biased implicit-feedback negatives. "Shown but not clicked" is a fast-decaying signal — give it a 7-day half-life — because it conflates "uninterested" with "not exposed."
Follow-up grill 1
You added a "confidence" signal. Won't users notice when the system flips from personalized to "generic" picks because their confidence dropped?
Yes, and that's a real failure mode. Two mitigations. First, smooth the transition: confidence is computed continuously, blend weights move gradually across 1–3 sessions, not abruptly. Second, fail soft, not silent: low-confidence sessions still feature personalization, just blended with more popular-in-cohort and an exploration budget — the carousel doesn't suddenly become "Top 10 of all time," it becomes "things that are working for users like you" plus "want to try X?". The user-visible cost is a slightly less specific carousel; the user-invisible benefit is that we stop confidently mis-personalizing. Users tolerate "we're learning what you like right now" better than they tolerate "your feed is wrong."
Follow-up grill 2
You constrain exploration to "adjacent" genres. How do you decide adjacency? It seems like the very thing you're trying to learn.
Right — adjacency is itself a model. Three ways to source it. (1) Co-engagement graph. Across all users, which genres co-occur in the same user's history within a short window? That's a population-level adjacency that doesn't require per-user labels. (2) Embedding space proximity. In the long-horizon item-embedding space, which genre clusters sit near each other? Closer clusters are safer to explore into. (3) Editorial whitelist. For high-risk transitions (kids' content → adult), explicit safety rules override population-level adjacency. Exploration is constrained by all three: pick the narrowest adjacency from the three sources for any given exposure decision. The cost: exploration is conservative and may miss genuinely surprising taste. The benefit: a kid doesn't see seinen-with-violence because their mom let them try a school-romance once.
Follow-up grill 3
Long-horizon vector decays at 365 days but still drifts. What stops it from becoming a smeared average over five years of changing taste?
Two protections. (1) Positive-only ingestion — only buys, completed chapters, and explicit thumbs_up enter the long vector. Exposure-but-not-clicked never does. That keeps the signal honest. (2) Phase-coherent weighting — when ingesting a positive signal, weight by how aligned the action is with the long vector's recent history. A signal that lies very far from the long vector is dampened by 50%; that prevents a single phase change from corrupting the long-term anchor while still letting it shift gradually. Over 5+ years, the long vector becomes "the durable lean" — overall genre family, art-style preference, content-pacing preference. It does not become "what I read last quarter." If the user's actual long-term taste has changed (rare but real — life events, becoming a parent), the system catches it via short-vector divergence persistently exceeding the long vector for 6+ months, and at that point we can decide to re-anchor the long vector. That re-anchor is a deliberate, logged event, not a continuous drift.
Follow-up grill 4
Counterfactual replay sounds clean in theory. In practice you have selection bias — the recs you served influenced what users acted on. How do you actually do this?
Two techniques. (1) Inverse propensity weighting. Log the exposure_position for every served rec and the probability that the rec was selected by the policy at serve time (the propensity). When replaying, weight each historical action by 1/propensity. This corrects for the fact that some items had higher exposure odds. (2) Off-policy evaluation with hold-outs. For a small fraction (~5%) of traffic, run a random policy (uniformly sampled recs from a candidate pool). Counterfactual replay against the random hold-out is unbiased because there's no selection. The combination — IPS on the main traffic plus a clean random hold-out for ground-truth checking — gives you a usable replay system. The hard part is convincing leadership that 5% random recs is worth the CSAT cost. Frame it as the price of having a real ground-truth signal — without it, you're flying blind on whether your reranker is improving.
Architect-level escalation 1
Privacy regulation forces a "right to be forgotten" on user histories. A user's history can be deleted at any time. How does that interact with your decay-aware labeling and confidence signal?
Two structural decisions need to bake in. (1) Per-user state must be re-derivable from the audit-safe events. Don't snapshot taste vectors only — keep the source events that produced them, with retention according to the regulatory window. When a deletion request comes in, you delete the events; the vectors rebuild on the next training/aggregation pass with that user's signal absent. (2) Cohort-level aggregates (population genre-adjacency, popular-in-cohort) should already be aggregated such that no individual user is identifiable; deletion of a single user's history doesn't disturb them.
The harder edge case: what if the user is the only example of a particular taste pattern? Their deletion changes the population-level adjacency in a measurable way. That's a privacy concern (re-identification through aggregate diff) — the protection is k-anonymity in the cohorts (only emit cohort-level signals when k ≥ 50, say) plus differential-privacy-style noise on the rare-event aggregates. The architecture cost: more ceremony in the aggregation pipeline. The architectural benefit: deletion is a real operation, not a paper one.
For the confidence signal specifically: it should never require features that are non-deletable. Confidence is recomputed from the user's surviving events; if too few events remain (fresh deletion), confidence is low and the system treats them like a returning user, which is the right default.
Architect-level escalation 2
Six months from now, the company wants to add a "for you" generative carousel where the LLM writes a personalized blurb for each rec. The blurb itself is a recommendation artifact and now ground truth includes "did the user like this blurb" not just "did they click." How does your evaluation evolve?
The label space expands from (user, item) → action to (user, item, blurb) → action. Three changes. (1) Per-blurb label. Track blurb_id and blurb_template_version on every exposure. The same item with two different blurbs is two different exposures. (2) Decompose attribution. When a user clicks, was it the rec or the blurb? A/B blurb-only experiments (same items, different blurbs) and rec-only experiments (different items, generic blurb) let you decompose. (3) Judge for blurb quality. Blurb correctness is now part of ground truth — a blurb that misrepresents the item is wrong even if the user clicked. Run an LLM-as-judge offline that scores blurb factuality vs the item description, and gate releases on blurb-factuality regression.
Crucially, the blurb itself drifts. Same template, same item, six months later, the FM (which writes the blurbs) has been upgraded — the same template now produces a slightly different blurb. That's a hidden ground-truth shift in the artifact of recommendation, not just the choice of item. Pin blurb-template versions, version-stamp every exposure, and when re-evaluating historical user response to blurbs, re-render at the original template version (not the current one) to preserve counterfactual fidelity.
The conceptual move is: ground truth used to be "did the user act," now it's "did the user act and was the artifact we showed them defensibly correct." Two labels per exposure. Two metrics to monitor. Same architecture pattern as scenario 02 — the artifact has a version, the version goes with the answer.
Red-flag answers
- "Just shorten the half-life." (Trades one bug for another — whiplash on noisy users.)
- "Trust the most recent click." (Single-event noise; conflates curiosity with preference.)
- "Add explicit feedback." (Necessary but not primary — too sparse.)
- "Reset embeddings periodically." (Loses durable preferences.)
- "Aggregate CTR is fine, ship it." (Cohort decay hidden inside aggregates.)
Strong-answer indicators
- Distinguishes "average over a window" from "current taste."
- Names confidence-on-personalization as a first-class signal.
- Treats negatives as fast-decaying, positives as durable.
- Smoothly degrades personalization rather than flipping on/off.
- Knows IPS / random hold-outs are needed for honest replay.
- Bakes the regulatory constraint (deletion) into derivability, not snapshotting.