GenAI Scenario 08 — RAG Corpus Staleness at 10× Scale

TL;DR

The shared retrieval pipeline serves seven MCP servers off a single embedding service + OpenSearch fleet. Eighteen months in, the corpus is 10× its launch size (driven mostly by Catalog growth, Review ingestion, and Support-Policy version retention), and the embedding model has been upgraded once. Recall@10 on the original golden set looks fine; CSAT and deflection have been quietly slipping. Two failure modes have compounded: (a) a non-trivial fraction of indexed chunks are stale — the underlying source has changed but the embedding wasn't refreshed — and (b) at this scale, the ANN index is paying a price in tail-latency variance that subtly disfavors fresh chunks (newer shards are less optimized than older ones). Aggregate metrics report fine because old chunks dominate retrievals; new chunks fail to win even when they should. The fix shape is chunk-level freshness as ground-truth metadata, scale-aware retrieval evaluation that explicitly tests the long tail, blue/green re-index with cohort-aware promotion, and a freshness SLO independent of recall.

Context & Trigger

Axis of change: Scale (corpus size and index complexity grew an order of magnitude) + Time (chunks indexed long ago no longer reflect their sources).
Subsystem affected: RAG-MCP-Integration/09-rag-retrieval-pipeline-deep-dive.md — the shared embedding service, OpenSearch HNSW index, RRF fusion, cross-encoder reranker, blue/green re-index. Knock-on effects in RAG-MCP-Integration/01-catalog-search-mcp.md, 04-review-sentiment-mcp.md, 05-support-policy-mcp.md.
Trigger event: A monthly metric review notices that "stale chunk wins" — winning retrieved chunks whose indexed_at is older than the source's last_modified_at — has crept from 2% to 11% over six months. CSAT on tasks that depend on freshness (latest volume, current price, current policy) has dropped 1.5–2% over the same period. No single incident triggered the investigation.

The Old Ground Truth

The shared retrieval pipeline was built with care:

Single embedding model, multilingual, fine-tuned on early manga corpus.
OpenSearch with HNSW kNN + BM25 hybrid, RRF fusion, cross-encoder reranker top-50 → top-10.
Blue/green re-index on a 7-day cadence (already in the architecture).
Recall@10 monitoring against a stable golden set.
Reasonable assumptions:
Re-indexing weekly catches most freshness issues.
The HNSW graph is uniformly performant across corpus age and shard age.
"Recall@10 on the golden set" approximates "production retrieval quality."

What this misses, at scale: HNSW indexes don't behave uniformly as they grow; weekly re-indexing is too slow for some content types; the golden set under-represents the long tail that growth creates; and "stale chunk" is a category of failure the metrics don't see.

The New Reality

Scale changes the index's behavior. At 10× corpus size, HNSW build time grew, recall-vs-latency frontiers shifted, parameter choices that were good at 1× are suboptimal now. The index is correct, but its shape has drifted.
Older shards win more often than they should. When the pipeline re-indexed quarterly through several model versions, older chunks accumulated quality (hand-tuned analyzer rules, alias enrichment) that newer chunks don't have. Newer chunks compete on a less-favorable footing — and since the ranking is per-chunk, the system silently favors older content.
"Stale" is a real label and the system doesn't carry it. A chunk indexed in January for a series whose volume count changed in March is technically retrievable, technically scored well, and technically the wrong answer. The metric schema doesn't have "stale" as a concept.
Long-tail recall is invisible. Aggregate recall@10 is dominated by the head distribution. The long-tail (rare titles, recently-added content, niche cohorts) recalls poorly but doesn't move the aggregate.
Re-indexing has scale-driven side effects. A full re-index now takes 14 hours instead of 2; partial re-indexes risk inconsistency between shards; the blue/green swap window grew.
The cross-encoder reranker is calibrated against an outdated distribution. The top-50 it sees today are different from the top-50 it was trained on. It still picks reasonable winners, but its score distribution has shifted in ways that affect downstream confidence thresholds.

Why Naive Approaches Fail

"Re-index more often." Doesn't help if the re-index pipeline itself is the bottleneck. And without a freshness metric, you can't tell if more frequent re-indexing actually fixes the problem.
"Increase HNSW ef_construction / M." May help recall but raises latency and storage. Without scale-aware testing, you're guessing.
"Add more eval queries." Without explicit long-tail stratification, you just add more head-distribution queries that already pass.
"Refresh the embedding model." Solves nothing on its own; if the pipeline + eval are the problem, a new model just inherits both.
"Trust the golden set." The golden set was sampled when the corpus was 1×; it under-represents the long tail at 10×.
"Migrate to a different vector DB." Tempting but premature — the issues are operational and architectural, not a particular DB's fault.

Detection — How You Notice the Shift

Online signals.

Stale-chunk-wins rate. Fraction of winning retrievals where chunk.indexed_at < source.last_modified_at. The headline metric this scenario is built around. Was 2%, now 11%.
CSAT on freshness-sensitive intents. Volume count, latest release, current price, current policy. These are the intents most punished by stale chunks.
p95/p99 retrieval latency by shard. Older shards may be faster (better-cached) but serve stale data; newer shards may be slower. Both extremes are signals.
"Did not find" rate on tail cohorts. Specifically queries for newly-added titles, niche genres, low-volume locales.

Offline signals.

Tail-cohort recall@10. Stratify the golden set into head, body, tail by query frequency or title age, and measure each separately.
Drift between source and chunk. Periodically diff source.content_hash against chunk.content_hash; mismatches are stale indexed content.
Cross-encoder score distribution drift. Reranker score histograms over time. Significant shifts (mean, variance, mode) suggest the reranker is operating off-distribution.
Re-index regression cohorts. After a re-index, score-vs-baseline per cohort. Long-tail cohorts often regress in ways head cohorts don't.

Distribution signals.

Mean age of winning chunk. If mean now - chunk.indexed_at increases over time, retrieval is favoring older content.
Per-shard recall variance. If recall is uniform across shards, the index is healthy; if some shards systematically lose, they're under-tuned.
Corpus growth rate vs re-index rate. If corpus doubles in 6 months and re-index cadence stays weekly, freshness-per-unit-content is halving.

Architecture / Implementation Deep Dive

flowchart TB
    subgraph Source["Source data"]
        DS["Catalog · Reviews · Policies ·<br/>Trending · Cross-Title"]
    end

    subgraph Ingest["Ingest pipeline"]
        DELTA["Source-delta watcher<br/>(content_hash diff)"]
        EMB["Embedding service"]
        IDX["Index writer<br/>(blue/green + per-cohort)"]
    end

    subgraph Index["Retrieval index"]
        OS["OpenSearch HNSW + BM25<br/>per-cohort shard layout"]
        META["Per-chunk metadata:<br/>indexed_at, source_hash,<br/>cohort, freshness_class"]
    end

    subgraph GT["Ground-truth layer"]
        TAIL["Tail-stratified golden set<br/>(head · body · tail × intent)"]
        FRESH["Freshness SLO per cohort<br/>(content type → max_staleness)"]
        REPLAY["Stale-chunk-wins rate<br/>continuous metric"]
    end

    subgraph Serve["Serving"]
        Q["Query"]
        FUSE["BM25 + kNN + RRF"]
        RR["Reranker<br/>(score-distribution monitored)"]
        OUT["Result + chunk metadata<br/>(age, source_hash)"]
    end

    DS --> DELTA --> EMB --> IDX --> OS
    DELTA --> META --> OS
    OS --> FUSE --> RR --> OUT
    OUT --> REPLAY
    OUT --> FRESH
    TAIL -.->|eval| FUSE
    FRESH -->|gate| IDX

    style META fill:#fde68a,stroke:#92400e,color:#111
    style TAIL fill:#dbeafe,stroke:#1e40af,color:#111
    style FRESH fill:#fee2e2,stroke:#991b1b,color:#111
    style REPLAY fill:#dcfce7,stroke:#166534,color:#111

1. Data layer — chunk-level freshness as first-class metadata

Every chunk carries:

{
  "chunk_id": "...",
  "source_id": "...",
  "source_hash": "sha256:...",       // hash of the source's last_modified content
  "indexed_at": "2026-04-15T...",
  "model_version": "embed-v3.2",
  "cohort": "catalog | review | policy | trending | crosslink",
  "freshness_class": "static | quarterly | weekly | daily | hourly"
}

The freshness_class is set by content type. Catalog facts are "weekly," reviews are "quarterly" (their meaning rarely changes), policies are "static within version," trending is "hourly." Each class has a max-staleness SLO.

Stale = now() - chunk.indexed_at > max_staleness(chunk.freshness_class). This is a deterministic metric and a useful pre-eval gate.

2. Pipeline layer — source-delta-driven re-index

Move from "re-index everything weekly" to "re-index what changed":

def delta_reindex():
    sources_changed = source_delta_watcher.poll()  # content_hash differs from chunk
    for src in sources_changed:
        new_chunks = chunk_and_embed(src)
        index_write(new_chunks, mode="blue-green-per-cohort")
    # Background full re-index runs monthly to catch anything the delta pipeline missed
    if monthly_due():
        full_reindex()

Per-cohort blue/green: catalog cohort can re-index independently of policy cohort. A change in catalog doesn't trigger a full pipeline rebuild.

The full monthly re-index catches:

Embedding model upgrades (new model needs the whole corpus re-embedded).
Drift between the delta watcher and reality (rare but happens).
Index parameter retunes (ef_construction, M).

3. Serving layer — freshness-aware fusion

The reranker fuses signals as before, but with a freshness-aware tiebreaker:

def rerank(query, candidates, top_k=10):
    scored = [(c, cross_encoder.score(query, c), staleness_penalty(c)) for c in candidates]
    # Penalty multiplier: 1.0 for fresh, down to 0.85 for fully-stale-by-class
    scored = [(c, s * (1 - 0.15 * normalized_staleness(c)), st) for c, s, st in scored]
    return sorted(scored, key=lambda x: -x[1])[:top_k]

Stale chunks aren't blocked — they may still be the best match — but they have to be measurably better to win against a fresh chunk. The penalty is small (15%) because correctness still matters more than freshness; the penalty exists to break ties in favor of fresher content.

The serving response includes chunk metadata for downstream observability:

{
  "answer": "...",
  "winning_chunks": [
    {"id": "c1", "indexed_at": "2026-04-22", "source_hash": "...", "freshness_status": "fresh"},
    {"id": "c2", "indexed_at": "2026-01-10", "source_hash": "...", "freshness_status": "stale-by-class"}
  ]
}

This metadata is logged for the stale-chunk-wins continuous metric.

4. Governance — re-index promotion gates

Blue/green promotion requires:

Cohort-stratified recall checks pass — head, body, tail per cohort.
Freshness SLO holds — no cohort has > 5% chunks beyond their freshness class's max-staleness.
Reranker score distribution check — distribution overlap with the previous index is > 0.85 (KS test); large divergence is a signal of an index parameter regression.
p95/p99 latency held within envelope.

Failed promotions roll back automatically. The previous green index remains the live one until a clean promotion passes.

Trade-offs & Alternatives Considered

Approach	Freshness	Latency	Long-tail recall	Verdict
Weekly full re-index	Stale	Stable	Mediocre	Original — falls behind at scale
Daily full re-index	Fresher	Index-build cost	Same	Cost grows linearly without addressing long tail
Source-delta re-index + monthly full + cohort-aware blue/green	Fresh	Stable	Improving	Chosen
Real-time index updates (per-event)	Freshest	Tail-latency risk	Same	Over-engineered for most content
Larger HNSW (`ef`, `M`)	n/a	Worse	Better	Useful but only with eval to verify
Add a freshness side-index	Fresh	+ retrieval cost	Same	Complexity not worth it
Migrate vector DB	Variable	Variable	Variable	Premature; no DB swap fixes the eval problem

The source-delta + monthly full pattern is the standard at large-scale RAG; the contribution here is the cohort layer and the freshness SLO as a gate.

Production Pitfalls

Source-delta watcher misses changes. If your source DB doesn't expose change-data-capture, the watcher relies on hashing — and hashing is only as good as the periodicity. Keep the monthly full re-index as a safety net regardless.
Per-cohort shard layout breaks query routing. If a query's intended cohort is mis-classified, the right chunks are in a shard the query doesn't search. Fail-safe: query unioning across cohorts when classification confidence is low.
Reranker score-distribution drift is silent. A reranker calibrated on top-50 from the old index will produce scores that look reasonable on the new index's top-50 even when ranking is degrading. Calibrate the reranker periodically against held-out tail-cohort labels.
Stale-chunk-wins includes false positives. A chunk indexed before a source's last_modified_at may still be correct (the source's modification was unrelated to the chunk). Refine the metric: stale-and-incorrect-wins, computed against a sampled set where humans confirm correctness.
Tail-stratified eval needs stable cohort definitions. "Tail" today is "body" in six months. The cohort thresholds (e.g., query frequency percentile) need to be stable enough to compare quarter-over-quarter — pin them in absolute counts when possible, or document the cohort definition versions.
Embedding model upgrade is a separate ground-truth shift. A new embedding model changes the geometry of the index — relevance is no longer comparable to before. Do not roll the embedding upgrade into a routine re-index. Treat it as a major event with its own A/B and golden-set re-evaluation, much like the FM upgrade in 04-fm-upgrade-judge-recalibration.md.

Interview Q&A Drill

Opening question

Your shared RAG pipeline has a corpus 10× its launch size. Recall@10 on the golden set is unchanged but CSAT on freshness-sensitive tasks is slipping. What's happening and what do you change?

Model answer.

The aggregate recall@10 is being dominated by head-distribution queries that retrieve old, stable chunks correctly. The slipping CSAT is in the long tail and on freshness-sensitive intents. Two compounding issues. (1) Stale chunks win. A chunk indexed before its source was updated still scores well against the original query but serves outdated content. There's no metric in the system that flags this — "stale" isn't a label the schema carries. (2) Long-tail under-sampling. The golden set was built when the corpus was 1×; at 10× the long tail is much larger, and aggregate metrics blur it.

Three architectural changes. (1) Chunk-level freshness as first-class metadata: every chunk carries indexed_at, source_hash, freshness_class (catalog: weekly, policy: static-per-version, trending: hourly). Define a per-class max-staleness SLO and gate re-index promotion on it. (2) Source-delta-driven re-index: don't re-index everything weekly; re-index what changed, with a monthly full re-index as a safety net. Per-cohort blue/green. (3) Tail-stratified eval: stratify the golden set into head/body/tail per cohort and measure each separately. Aggregate becomes advisory; per-cohort is blocking.

Add a freshness-aware tiebreaker in the reranker — small penalty (~15%) on stale chunks so they have to be meaningfully better to win against fresh ones. And track stale-chunk-wins rate as a continuous metric: this is the headline number that decoupled from recall.

Follow-up grill 1

"Stale chunk wins" includes cases where the source changed but the chunk is still correct. How do you make the metric meaningful?

Right — naïve chunk.indexed_at < source.last_modified_at over-counts. Two refinements. (1) Hash-level diff. Compute source_hash at chunk time. A chunk is stale only if source_hash no longer matches the current source's hash on the content section the chunk represents. Chunk-level hashing distinguishes "the source changed somewhere" from "the chunk's section changed." (2) Sampled correctness check. For a small subset of stale-chunk-wins events (e.g., 100/day), have a human (or a calibrated judge) verify whether the served answer was actually wrong. The metric becomes "stale-and-incorrect-wins" — false positives drop, the metric correlates with CSAT.

The pattern is: staleness is a structural signal (cheap, automatable, often noisy). Stale-and-incorrect is the actionable signal (expensive but precise). Track both, alert on the precise one.

Follow-up grill 2

Per-cohort blue/green. What stops one cohort's bad re-index from poisoning the others?

Several mechanisms. (1) Independent green slots per cohort. Each cohort has its own blue and green indexes, not shared. (2) Per-cohort promotion gates. Catalog promotes when catalog gates pass; policy promotes independently. A failure in one cohort doesn't block another. (3) Cross-cohort fusion happens at query time. Queries that span cohorts (rare but real — "what's trending and well-reviewed") fuse from each cohort's current-live index. As long as each cohort is internally consistent, cross-cohort queries get the latest of each. (4) Atomic-per-cohort swaps. The blue/green swap is per-cohort and atomic; an ongoing promotion in cohort A doesn't introduce inconsistency for cohort B.

The risk that remains: a schema change that affects multiple cohorts (e.g., adding a new metadata field) needs to ship to all cohorts in lockstep or the fusion breaks. That's a versioned-rollout concern, handled with a manifest much like the policy version manifest in scenario 02.

Follow-up grill 3

Your freshness penalty is 15%. Why not 50%? Why not 0%?

15% is a defensible compromise. The reasoning. (1) Freshness is not correctness. A stale chunk can still be the right answer if its content didn't change. A heavy penalty (50%) would systematically hurt correctness on freshness-irrelevant queries. (2) Freshness matters most as a tiebreaker. When two chunks score within ~10% of each other (which is most close-call retrievals), nudging toward fresh is the right move. A 15% penalty achieves that without dominating the score. (3) Empirically tunable. The actual number should come from offline experiments — measure CSAT lift on freshness-sensitive intents at penalty values {0, 5, 10, 15, 20, 30}, plot the curve, pick the inflection. 15% is a reasonable starting point but not sacred.

A 0% penalty leaves the original failure mode in place; a 50% penalty creates new ones. The architecture commits to "small + tunable + measured."

Follow-up grill 4

You mentioned the embedding upgrade is a "major event." Walk me through what that looks like operationally.

Embedding upgrades shift the geometry of the index. Distances aren't comparable across models, so:

(1) Pre-upgrade: (a) Build a separate "shadow" index with the new model. (b) Re-run the entire golden set through the shadow index and compare per-cohort recall. © Re-calibrate the cross-encoder reranker against the new score distribution. (d) Re-anchor any judge-style metrics against a fresh calibration set.

(2) Migration: (a) Run blue/green at the embedding-model level — the entire shadow becomes the new live, while the old model's index stays on standby for rollback. (b) Switch is per-cohort, not all-at-once. © Reranker score thresholds (used in confidence cutoffs) are updated to the new distribution.

(3) Post-upgrade: (a) Watch tail-cohort recall closely for 2–4 weeks; embedding upgrades often help head and hurt tail or vice versa. (b) Keep the old model warm for rollback for at least one full re-index cycle.

Critically, do not bundle the embedding upgrade with other changes (new chunk strategy, new cohort, new freshness-class). One knob at a time. If both move, you can't attribute regressions. This is the same discipline as scenario 04 around FM upgrades — measure one variable at a time or you're flying blind.

Architect-level escalation 1

The corpus has now grown to 100×. The source-delta watcher is the bottleneck (it can't keep up with the change rate), the OpenSearch HNSW build is consuming a 32-node cluster around the clock, and you're being asked to keep latency p99 under 800 ms. What's your structural redesign?

This is the "scale changes the architecture" moment. Three structural moves.

(1) Tier the corpus by velocity. Most chunks are stable for months (catalog backstory, character lists, old reviews). A small fraction (recent releases, trending, reviews on hot titles) churns weekly or faster. Split the index into tiers: a cold tier optimized for size/cost (lower-dim embeddings, larger shards, infrequent re-index), a warm tier with full embeddings and weekly re-index, and a hot tier with hourly delta-update and real-time-ish freshness. Queries fan out to all three but the cost-balance of the cluster matches the value.

(2) Move freshness signal out of the main retrieval path. The freshness-aware penalty was a small in-line cost when the corpus was 10×; at 100× even a small lookup per candidate matters. Pre-compute a freshness_score at index time and store it as a numeric field; the reranker reads it without recomputing. Latency cost falls.

(3) Introduce learned routing. At 100× with multiple tiers, naive fan-out to all tiers is wasteful. A small classifier (intent → tier-set) routes queries to the relevant tiers. Most "what's the synopsis" queries hit cold only; "latest volume" hits warm + hot. The classifier is tested against held-out queries; mis-routing has a fallback (run on warm) so failures degrade gracefully.

The structural commitment: at 100× the index is no longer a single homogeneous thing. Tiering, learned routing, and pre-computed metadata are how you keep latency flat while corpus grows. Re-indexing is no longer a single pipeline — it's a tier-specific operation with tier-specific cadence.

The cost lever that justifies the complexity: the cold tier holds 90% of bytes at 30% of compute and re-index cost; the hot tier holds 1% of bytes at 30% of compute (because it's hot). The total cost is far below "everything in the warm tier."

Architect-level escalation 2

Ground truth in this scenario is "what was the source's content at chunk time vs now." Three years from now, regulators ask you to prove that on a specific date, the bot's answer was based on a chunk whose content matched the source at that time. Can you?

The metadata pattern from this scenario already has the bones. Per-chunk: source_id, source_hash, indexed_at, model_version. Per-served-answer: the winning chunks plus their metadata at serve time.

The gap to close for an audit is source-content reconstructability: given a chunk's source_hash and indexed_at, can you reconstruct what the source said at that time? Two patterns:

(1) Source-history retention. The source systems (catalog DB, policy repo, review datastore) retain history (CDC log, Git history, review-version table). Reconstructing source content at time T means querying source-history at T. This is the cheapest pattern operationally but requires source-history retention to outlive the audit window (e.g., 7 years for some regulatory cases).

(2) Chunk-content snapshot. The chunk itself stores its content body, not just its metadata. Storage cost is higher but it's self-sufficient; you don't depend on source-history.

In practice, a hybrid: chunk-content for the high-stakes cohorts (policy, support) where audit is most likely; source-history dependence for the rest.

The architectural commitment: every served answer must be reproducible from logged metadata + retained sources. If the chain breaks anywhere, you cannot answer the audit. Test the chain periodically (a "audit-readiness drill" on synthetic answers) — if reconstruction fails on the synthetic, fix it before a real audit.

Red-flag answers

"Re-index more often." (Without a freshness metric, you can't tell if it's working.)
"Bigger HNSW parameters." (Helps recall, not freshness; doesn't address tail.)
"Trust the existing recall@10." (Aggregate is a lying metric here.)
"Migrate to a different vector DB." (Premature.)
"Upgrade the embedding model." (Solves nothing on its own.)

Strong-answer indicators

Names freshness as a first-class signal, separate from relevance.
Distinguishes stale (cheap signal) from stale-and-incorrect (actionable signal).
Has per-cohort + per-tail eval, not just aggregate.
Knows source-delta + monthly-full is the standard pattern.
Treats embedding upgrade as a major event with its own discipline.
Anticipates that 10× → 100× forces tiering + learned routing.
Has an audit-trail story that uses logged chunk metadata + retained source history.