GenAI Scenario 06 — Trending / Temporal Decay
TL;DR
The Trending-Discovery MCP exposes "what's trending right now" answers driven by a Kinesis Data Analytics tumbling-window pipeline. "Right now" has a half-life of roughly 60 minutes; a fresh anime tie-in or a viral chapter can push a title's trending score from rank 200 to rank 3 within an hour and back again by morning. The original eval treated trending as if it were a normal retrieval task — a frozen golden set of "recently trending titles" snapshot at label-time, scored against later production answers. Within a quarter, the eval was uniformly stale: by the time the labels were collected, reviewed, and committed, the correct answer had already changed, and the eval was effectively scoring the model on yesterday's truth. The fix shape is time-anchored ground truth where every label is (timestamp, trending_set), evaluation runs against the system's time-of-prediction truth not now-truth, plus a continuous shadow eval that scores live answers against the live truth window.
Context & Trigger
- Axis of change: Time (dominant; the half-life of a label is shorter than the eval's review cycle).
- Subsystem affected:
RAG-MCP-Integration/06-trending-discovery-mcp.md— Kinesis Data Analytics tumbling windows, trend-score formula, anime tie-in detection. - Trigger event: A flagship anime adaptation aired Sunday at 7 pm. By Monday morning's standup, the team's "trending" eval was failing — but failing on the previous week's answers, while production was already serving Sunday-night spikes correctly. Nobody could tell whether the eval failure indicated a real regression or just label staleness, and the team rolled back a perfectly-good change.
The Old Ground Truth
The original setup:
- Trending golden set (~ 150 prompts × snapshot) — "what's trending in shounen this week," "what's the hottest new release," etc.
- Labels collected weekly — analyst pulls top-N from the production system on Wednesday, writes labels, submits PR, lands by Friday.
- Eval runs nightly against the latest committed labels.
- Reasonable assumptions:
- Trending changes "weekly enough" that weekly labels are fresh.
- The labels collected from production are themselves a reasonable proxy for ground truth (since production is what we'd compare against).
- A "right answer" for "what's trending" is unambiguous given a trending list.
What this gets wrong: trending changes hourly during news/anime/release events; labels collected from production are just "what production thought yesterday," which means the eval can't tell if production is wrong; and the "right answer" is conditioned on a time window that's not in the schema.
The New Reality
- The label has a freshness half-life of an hour or two. A "trending" label collected at 9 am Monday is wrong by 9 pm Monday if a chapter dropped or a tie-in aired.
- Ground truth depends on the time the user asked, not the time the eval runs. If the user asked at 2 pm and the bot said "Chainsaw Man chapter X," and the eval at midnight says "the actual top-3 was different at midnight," the eval is wrong about the question. The right comparison is "what was trending at 2 pm."
- Production-derived labels are circular. Using yesterday's production output as today's ground truth means you can never detect that production has been wrong for weeks — the eval drifts with the system.
- Tie-in events break the pattern. Anime episode airings, awards, deaths-of-mangakas, Comic-Con — these create discontinuities the steady-state pipeline isn't built for. The Kinesis tumbling-window architecture is correctly responsive to these spikes; the eval is not.
- There is no "right answer" frozen anywhere. "Trending" is a derived statistic, not a fact. It's the output of a formula on engagement signals — the truth is whatever the formula says it is at the time of asking.
The schema isn't (query → trending_set). It's (query, timestamp_window) → trending_set_at_window. Without the time key, the eval is undefined.
Why Naive Approaches Fail
- "Just relabel weekly." Already what they're doing. Stale by Monday afternoon.
- "Relabel hourly." Cost and operational overhead make this prohibitive for human labels. And it doesn't help: by the time you relabel and run eval, the answer has changed again.
- "Use production as ground truth." Circular. You can't detect production drift.
- "Just trust the trend-score formula." The formula is the system; using it as ground truth means you're testing the formula against itself. Useful for consistency checks, useless for correctness checks.
- "Eval less often." Doesn't help — the problem isn't eval frequency, it's that the eval target moves faster than the eval can be set up.
Detection — How You Notice the Shift
Online signals.
- User-reported staleness. "The bot says X is trending but I checked and Y is everywhere now." Especially loud during anime airing windows.
- Click-through gap on trending answers. If "what's trending" answers get clicked at 8% but "newest releases" gets 14%, users are bouncing past trending — likely because it's stale.
- Refresh-rate-vs-engagement curve. What's the latency between a trend-score change in the upstream pipeline and the bot's first answer reflecting it? If > 5 minutes for a high-velocity trend, the bot is structurally lagging.
Offline signals.
- Snapshot comparison. Did the eval label at time T match the production output at time T (not at midnight)? Re-run with time-aligned labels. If the new eval passes when the old eval failed, your eval was the problem.
- Time-since-label. Plot eval pass rate vs label age. If pass rate drops monotonically with age (steeper than would be predicted by real model regression), eval staleness is dominant.
- Inter-rater agreement on trending labels. Even on a freshly-labeled set, do two analysts agree on "what's trending"? If agreement is < 0.6, the question itself is under-specified — fix the rubric before chasing better data.
Distribution signals.
- Latency distribution from upstream signal to bot answer. Histogram. p50 should be in the seconds-to-minutes range; p99 needs an explicit ceiling.
- Trending-set churn over time. How fast does the top-10 turn over? If 30% of the top-10 changes hour-over-hour during a high-event window, eval cadence must match that rate or be designed differently.
Architecture / Implementation Deep Dive
flowchart TB
subgraph Stream["Real-time pipeline (existing)"]
KIN["Kinesis stream<br/>events"]
KDA["Kinesis Data Analytics<br/>tumbling 5-min windows"]
SCO["Trend score per title<br/>over rolling window"]
TOP["Top-N cache<br/>(refreshed every minute)"]
end
subgraph GT["Time-anchored ground truth (NEW)"]
SNAP["Continuous snapshot<br/>of upstream signal +<br/>computed trend set<br/>every 5 min"]
REPLAY["Replay store<br/>= materialized<br/>(timestamp, trend_set)"]
end
subgraph Bot["Serving"]
ANS["Bot answer<br/>(timestamped at serve)"]
end
subgraph Eval["Time-aware eval"]
SHADOW["Shadow eval:<br/>compare answer.timestamp to<br/>replay[answer.timestamp]"]
DELTA["Continuous metric:<br/>top-K precision in rolling 1h"]
RUBRIC["Rubric snapshot per event<br/>(anime aired, chapter dropped)"]
end
KIN --> KDA --> SCO --> TOP --> ANS
SCO --> SNAP --> REPLAY
ANS --> SHADOW
REPLAY --> SHADOW --> DELTA
RUBRIC --> SHADOW
style SNAP fill:#fde68a,stroke:#92400e,color:#111
style SHADOW fill:#dbeafe,stroke:#1e40af,color:#111
style DELTA fill:#dcfce7,stroke:#166534,color:#111
1. Data layer — replay store, not frozen golden set
Instead of a hand-labeled golden set, maintain a replay store: every 5 minutes, snapshot the upstream signal stream and the computed trend set into S3 (or a time-series store). Each snapshot is (timestamp, signal_state, computed_top_N).
The replay store is the system's own output, but stored in a way that lets you go back and ask "what was the top-N at any time in the past N days." It is not the ground truth — it's the system's claim about ground truth at that time.
The actual ground truth comes from a sparser, slower process:
- Event-anchored labels. Each "interesting" event (anime aired, chapter dropped, awards announced) becomes a labeled event with the time window during which the trending answer should reflect it. A label looks like
(event_id, expected_in_top_5_after=2026-04-25T19:00, expected_decay_by=2026-04-26T07:00). These are written by analysts watching events; they're sparse but high-signal. - Disagreement triage. When the bot's answer at time T disagrees with the replay store at time T, that's not interesting. When the bot's answer at time T disagrees with an event-anchored label, that's the diagnostic signal.
2. Pipeline layer — shadow eval, continuous
Every bot answer that touched the trending tool is logged with a timestamp. A continuous shadow eval (cron every 15 min) joins:
- The bot answer at
t_serve. - The replay store at
t_serve(what would the bot say if it ran "now" but with the data fromt_serve). - Any event-anchored labels whose window covers
t_serve.
It computes:
- Self-consistency: did the bot's actual answer match what the replay would have produced? Drift here means a serving-side bug (caching, routing, prompt issue).
- Event-anchored correctness: during a labeled event window, did the bot include the event-trending title in the top-K?
def shadow_eval_round():
answers = recent_answers(last_minutes=15, tool="trending")
for a in answers:
replay_state = replay_store.get(a.timestamp)
self_consistency_ok = (set(top5(a.cited_titles))
<= set(top10(replay_state.computed_top_N)))
events = event_labels.covering(a.timestamp)
event_ok = all(
ev.expected_id in a.cited_titles for ev in events
) if events else None
log(a, replay_state, self_consistency_ok, event_ok)
The eval window is small (15 min) and the data is rolling. There's no quarterly "freeze" — the metric is always "rolling-1-hour event-anchored precision."
3. Serving layer — freshness budget and version stamp
The bot answer carries a freshness stamp:
"Top trending in shounen right now (as of 2026-04-25 19:32):
1. Chainsaw Man (anime ep 12 just aired)
2. ..."
The "as of" string lets the user calibrate, lets the eval pin the comparison, and forces the system to commit to a freshness moment. Internally, the answer object carries:
{
"answer_text": "...",
"trend_set": ["title-1", "title-2", "..."],
"as_of": "2026-04-25T19:32:00Z",
"upstream_window_end": "2026-04-25T19:30:00Z",
"data_age_seconds": 120
}
data_age_seconds is a serving-side SLO: if it grows above 600 (10 min) the trending tool routes to a "trending data delayed, here's the most recent steady-state list" fallback rather than confidently serving stale data.
4. Governance — events, not labels
Analysts manage event windows, not flat label lists. The workflow:
- An event happens (anime aired, awards, mangaka health news).
- An analyst opens an "event ticket" within minutes — a small structured object with
(event_id, expected_titles, expected_window_start, expected_window_end, evidence_url). - The shadow eval picks up the event from the next round and starts comparing.
- After the window closes, the event becomes part of the historical eval; cumulative event-precision is the headline metric.
Event tickets are cheaper than a 150-prompt golden set refresh because they're sparse — maybe 5–15 events per week, each touching only a handful of expected titles. They're also higher-signal because they correspond to real moments where the system needs to be right.
Trade-offs & Alternatives Considered
| Approach | Freshness | Operational cost | Auditable | Verdict |
|---|---|---|---|---|
| Weekly hand-labeled golden set | Stale (days) | Medium | Yes | Original — wrong cadence |
| Hourly hand-labeled | Fresh | Very high | Yes | Cost-prohibitive |
| Replay store + event-anchored labels + continuous shadow | Fresh | Medium | Yes | Chosen |
| Self-consistency only (replay vs serve) | Fresh | Low | Detects bugs, not correctness | Necessary but not sufficient |
| Trust the formula | n/a | None | No correctness signal | Tests system against itself |
| Crowdsource trending labels | Fresh | High variance | Hard to audit | Useful for pre-launch only |
The chosen approach trades the certainty of a hand-labeled set for freshness, which is the binding constraint for a hourly-half-life truth. Event tickets give back the audit + correctness signal at a fraction of the cost.
Production Pitfalls
- Replay store size grows fast. 5-minute snapshots over a year is ~100K entries × payload size. Use TTL: keep 30 days at full fidelity, downsample older to hourly, drop after 90 days. Audit obligations on this data are typically short.
- Event tickets are easy to skip. Analysts get busy. Shipping events without tickets means the metric loses signal silently. Track "events expected vs events ticketed" as a meta-metric.
- Self-consistency can pass while serving is broken. If a caching bug serves a stale list and the replay store reads the same stale cache, both agree. Replay store must be derived from the upstream signal, not from the cached top-N. This is a subtle architectural choice and must be defended.
- Multiple time zones confuse "as of." Always serve "as of" in the user's locale or UTC consistently — never in the cluster's local time, which has caused incident-class confusion in similar systems.
- Event windows overlap. A new anime airs at 7 pm, a chapter drops at 9 pm — both events are live at 9:30 pm. The eval needs to handle multi-event coverage; "expected titles" for a window is a union across events.
- Trending fallback during data delay must not confidently mislead. If
data_age_seconds > 600, do not serve the bot answer with full confidence; explicitly tell the user the data is delayed and offer steady-state recommendations instead.
Interview Q&A Drill
Opening question
Your "what's trending" feature has a labeled eval set that's refreshed weekly. The eval has been failing every Monday morning, but customer escalations don't match — users seem happy. What's wrong with the eval?
Model answer.
The eval and the system are working at incompatible time scales. Trending has an hour-scale half-life; the eval has a week-scale label cadence. Every Sunday-night anime airing or release event makes the labeled "what was trending last week" stale by the time the eval runs Monday morning, but production — driven by a real-time pipeline — has already moved on correctly. The eval is grading yesterday's exam.
The fix is to anchor labels in time, not in a frozen list. Three pieces. (1) A replay store that snapshots the upstream signal and computed trending set every 5 minutes — that's the system's claim about ground truth at that moment. (2) Event-anchored labels: for each "interesting" event (anime aired, chapter dropped), an analyst writes a tiny structured ticket — (event, expected titles, expected window). These are sparse, fast, and high-signal. (3) A continuous shadow eval that joins bot answers, the replay state at serve time, and any event windows covering that time. The metric is rolling, not weekly. Self-consistency catches serving bugs, event-anchored correctness catches model failures.
The conceptual move: ground truth for trending is (timestamp, trending_set), not just trending_set. Every label needs a time key. The eval grades the system at the time it answered, using the truth that existed at that time.
Follow-up grill 1
"The replay store is the system's own output." Isn't using it as ground truth circular?
It is, for correctness — and that's why event-anchored labels exist. The replay store gives you self-consistency (did the bot's answer match what the system thought at that moment), which catches serving bugs, caching bugs, prompt regressions. Event-anchored labels give you correctness (during the anime event window, was the airing title in the top-5), which catches model and pipeline regressions. The two are complementary: replay alone tests the system against itself; event labels alone don't have the volume to catch subtle drift. Together they cover the failure space at a feasible cost.
The architecture is honest about this: the replay store is not ground truth. It's a fast-iteration anchor for self-consistency. The slow but trustable correctness signal is the event-ticket workflow. Calling them by their right names matters.
Follow-up grill 2
Event tickets are written by humans. What stops them from being slow, partial, or biased toward the events the analysts noticed?
Three protections. (1) Coverage tracking — for each known event source (anime release calendars, chapter release calendars, awards calendars), the system auto-detects events from the source and creates a "ticket needed" reminder. Tickets that aren't filled within 2 hours auto-flag. The metric "events expected vs events ticketed" gets dashboarded and tracked as the upstream signal of label quality. (2) Sampling. Event tickets are sparse, but for a window with no events, a small daily sample of "random-time disagreement triage" surfaces drift not tied to any particular event. (3) Cross-region coverage. Different regions have different events; the analyst rotation includes JP, US, BR, EU coverage so events outside one region's awareness still get ticketed.
The bias is real: events that don't have a calendar (a viral fan post, an unexpected death) are harder to ticket. The mitigation is the random-sample triage — it's slower-signal but covers what the calendar-driven workflow misses.
Follow-up grill 3
Your "as of" stamp on bot answers leaks system internals. Is that good?
It's a deliberate trade-off and yes, it's good. Three reasons. (1) Calibrates user trust. "As of 2 minutes ago" tells users the answer is fresh; "as of 2 hours ago" tells users to take it lightly. Without it, all answers feel equally authoritative, which is wrong. (2) Disambiguates user complaints. When a user says "the bot was wrong," you can look up (answer.timestamp, replay_store[timestamp]) and check immediately. The audit cost falls dramatically. (3) Forces the serving system to commit. If the bot can't compute a fresh as_of, the data path has a bug — make it visible.
The risk: users see "as of 30 seconds ago" and assume the answer is true, not just fresh. The cure is honest copy: "Top trending right now in shounen (data updated 30 seconds ago — popularity changes hourly)." The freshness stamp is about the data, not the correctness.
Follow-up grill 4
Your continuous shadow eval runs every 15 minutes. The team wakes up to a metric drop — by the time someone investigates, two more rounds have run. How do you not drown in noise?
Three layers. (1) Latched alerts. Don't page on a single 15-min window dropping; page on a sustained drop of N consecutive windows or on a rapid drop of magnitude X. The signal-to-noise ratio is much higher on persistent regressions. (2) Event-window context. During known event windows (anime night, chapter-drop time), the eval is naturally more volatile because the system is actively responding to a spike. Calibrate thresholds per time-of-day and per known-event-class; a Sunday 7 pm dip is different from a Tuesday 3 am dip. (3) Diagnose-before-page. The first level of automation isn't "page a human" — it's "auto-classify the regression": is it self-consistency (likely a serving bug) or event-anchored (likely a pipeline issue)? The classification routes to the right team and is annotated when a human eventually looks. Without classification, every alert is "the trending feature is broken," which is operationally unsustainable.
Architect-level escalation 1
Six months out, the company adds a "personalized trending" feature: trends are filtered by what's likely to interest the individual user. Now ground truth is
(user, timestamp, personalized_trending_set). How does your eval architecture extend?
The state explodes. Two principles to keep the explosion bounded.
(1) Decompose the eval into "is the candidate set right" + "did personalization re-rank correctly." The candidate set (e.g., top-50 trending overall) can be evaluated with the existing time-anchored architecture — it's the same problem, same scale. The personalization layer over the candidate set is a separate eval problem with its own ground truth (is item X relevant to user U today?) that overlaps scenario 03 — implicit-feedback labels, decay-aware metrics, confidence signal. Keep them separate. The combined metric ("correct personalized trending answer") is a product of two independent evaluations, not a giant joint label.
(2) Don't try to label per-user trending. It's intractable and noisy. Instead, cohort-anchored personalization eval — for users in cohort C (e.g., "shounen-leaning, JP-region"), the personalized trending should differ from the overall trending in directionally-predictable ways. Define cohort-level expectations (cohort C should over-index on JP-language anime tie-ins) and check those. It's coarser than per-user but it's grounded.
The deeper move: ground truth doesn't have to be at the same granularity as the prediction. A per-user prediction can be evaluated with cohort-level ground truth as long as the cohort definition is honest about what's being checked. Acknowledging that some personalization decisions can't be eval'd in isolation is more honest than fabricating per-user labels.
Architect-level escalation 2
A regulator audits trending claims: "show me every time the bot said X was trending and prove it actually was." How does your architecture answer?
The replay store + event tickets + per-answer logging carry the audit. The chain is:
- Answer log
(conversation_id, turn_id, answer_text, trend_set, as_of, upstream_window_end). - Replay store snapshot at
upstream_window_end— system's claim of trending at that time. - Upstream signal stream at the same window, retained at full fidelity for the audit window — the raw engagement data the formula was applied to.
For any answer, the auditor can reconstruct: "at time T, the system received signals S, the formula computed top-N as [...], the bot said top-K subset of that." That chain is verifiable without re-running the model. Whether the formula itself is reasonable is a separate audit question (formula = methodology); that's documented and versioned in the platform code.
The harder regulator question: "are there cases where the system was wrong but no event ticket existed?" The honest answer is yes — event ticketing is sparse. The protections are (1) random-sample triage covers the no-event case statistically; (2) calendared events have ticket-coverage SLAs; (3) the upstream raw stream is retained, so a forensic audit on a specific answer is always possible even without an existing ticket. Regulators usually want demonstrable process and reproducibility, not perfect coverage; the architecture provides both.
The architectural commitment that survives: trending is provably-derived from logged signals. The truth may be momentary, but the trace is permanent.
Red-flag answers
- "Refresh the labels more often." (Doesn't catch the structural mismatch.)
- "Use production as ground truth." (Circular.)
- "Trust the trend-score formula." (Tests system against itself.)
- "Eval less often so labels stay fresh." (Doesn't help — labels still age.)
- "Add a smoothing layer to make it less twitchy." (Hides the problem; users want fresh.)
Strong-answer indicators
- Names ground truth as
(timestamp, trending_set)— time is a primary key. - Distinguishes self-consistency (replay) from correctness (event tickets).
- Has continuous-eval cadence matched to label half-life, not calendar.
- Logs
as_ofon user-facing answers, with honest copy. - Recognizes the regulator question is about traceability, not perfection.
- Resists per-user trending labels in favor of cohort-anchored eval.