GenAI Scenario 01 — Catalog Evolution / New Manga Releases

TL;DR

Every Wednesday roughly 1,200 new manga chapters and ~40 new series enter the catalog. The Catalog-Search MCP's golden retrieval set was sampled in Q1 against a 100K-title catalog; today the catalog is 320K and growing 2.8% per week. Old gold-relevance labels for queries like "latest volume of Chainsaw Man" point at chunks that are now stale or wrong, new series have no labeled queries at all, and the eval harness keeps reporting recall@10 above its bar while users are increasingly missing the new releases they came to find. The fix shape is catalog-as-source-of-truth + derivative golden sets with a freshness SLA, plus stratified sampling that mirrors current catalog distribution rather than launch-day distribution.

Context & Trigger

Axis of change: Requirements + Scale.
Subsystem affected: RAG-MCP-Integration/01-catalog-search-mcp.md, shared embedding service in RAG-MCP-Integration/09-rag-retrieval-pipeline-deep-dive.md.
Trigger event: Weekly catalog drop (the "Wednesday refresh" — publishers push releases on a coordinated cadence) compounded with a quarterly editorial expansion (manhwa/manhua admitted into the same index in Q3). Two things changed: the volume of titles (Scale) and the kinds of titles (Requirements).

This is the most common ground-truth failure in any catalog-driven RAG system. It is also the one most often dismissed as "just a re-index problem" — which is the wrong frame.

The Old Ground Truth

The Q1 golden set looked like this:

2,400 (query, intended_title) pairs sampled by stratifying on top-1000 most-searched queries over the prior 30 days, plus 200 hand-curated edge cases (typos, romanizations, "vol 12" syntax).
800 (query, relevant_chunk_ids) pairs for retrieval evaluation, where a "relevant chunk" was a synopsis chunk, a volume-list chunk, or a character-list chunk for the intended title.
Reasonable assumptions at the time:
Catalog is roughly stable week to week (true at the time — releases were ~300/week).
Top-1000 queries cover ~78% of search volume (true at the time).
Synopsis-style chunks are the right relevance unit (true while series were mostly complete or well-established).

Nothing about that design was foolish. It was a reasonable v1.

The New Reality

Three months later, the world has moved:

The intended title for a query may not have existed at sampling time. The query "vol 1 of Sakamoto Days" was not in Q1's golden set because Sakamoto Days was not in the catalog yet. The label "this query has no answer" implicit in its absence is now actively wrong.
Some labeled chunks are stale. A query like "how many volumes of Chainsaw Man are out" was labeled in Q1 with a chunk that says "11 volumes." The current truth is 17. The retriever may correctly return the labeled chunk and score recall@1 = 1.0 while serving an outdated answer.
The top-1000 query distribution moved. Queries about new series (manhwa) now make up 14% of volume but are entirely absent from the golden set. Aggregate recall@10 looks fine; recall@10 on queries about post-Q1 titles is ~0.41.
Relevance unit drifted. For ongoing series with frequent volume drops, users want the volume-release-list chunk, not the synopsis chunk. The Q1 labels favor synopsis. The retriever has been silently penalized for returning the actually-useful chunk.

In short: the schema is the same (query → relevant chunks), but the meaning under the schema has moved.

Why Naive Approaches Fail

"Just re-run the eval more often." Re-running a stale eval daily produces stale numbers daily. It tells you nothing about the new surface area.
"Just expand the golden set with new queries." Without re-stratifying to the current distribution and without retiring stale labels, you get a Frankenstein set that overweights old behavior. Aggregate metrics become uninterpretable.
"Just re-embed and re-index more often." Re-embedding fixes coverage at the retriever level; it does nothing at the evaluation level. You can have perfect coverage and still fail to measure it.
"Auto-label new queries with LLM-as-judge." Cheap, but the judge inherits the same blind spots — it has the same FM with the same training cutoff, and for very new series it doesn't know either. Hallucinated judge labels are worse than no labels.
"Just monitor recall@10 and alert on drops." Recall@10 is computed against the golden set. If the golden set is missing the new surface, the metric is incapable of dropping for the right reason. The alert would have fired if your eval set were the world; it isn't.

Detection — How You Notice the Shift

Online signals (lead indicators).

"Did not find" / fallback rate per intent — rises sharply for catalog-search queries 2–3 weeks after a publisher refresh.
Click-through rate on the first returned card — drops on queries containing post-Q1 series tokens.
Manual escalation tickets that say "the bot says X is the latest, but Y came out two weeks ago."
Query-rewrite frequency: users typing the same query 2× or 3× with variations (a sign the answer is wrong).

Offline signals.

A parallel synthetic eval generated weekly from the live catalog (not the frozen golden set) shows recall@10 falling on new-series queries.
LLM-as-judge agreement with humans drops on queries that contain entities the FM was trained before.
Distribution diff: percent of production queries containing tokens not in the golden set rises monotonically.

Distribution signals.

KL-divergence on intent mix between production traffic and golden-set composition climbs > 0.15 — a soft alert threshold.
Long-tail genre share (manhwa, manhua, vertical scroll) rising in production but ~0% in the golden set.
Mean age of a "winning" retrieved chunk (now − chunk.indexed_at) drifts up — indicates retrieval is favoring stale chunks because that's where labels live.

Architecture / Implementation Deep Dive

flowchart TB
    subgraph Source["Source of truth (live)"]
        CAT["Catalog Postgres<br/>+ Series Metadata<br/>(authoritative)"]
        REL["Release Calendar<br/>(publisher feeds)"]
    end

    subgraph Index["Retrieval index"]
        EMB["Embedding Service<br/>blue/green re-index"]
        OS["OpenSearch<br/>BM25 + kNN"]
    end

    subgraph GT["Ground-Truth Layer (NEW)"]
        DGS["Derivative Golden Set<br/>auto-generated weekly<br/>from catalog deltas"]
        SGS["Stable Golden Set<br/>hand-curated, refreshed quarterly"]
        DECAY["Label-decay tracker<br/>last_confirmed_valid_at"]
    end

    subgraph Eval["Evaluation"]
        OFF["Offline eval<br/>recall@k, MRR<br/>per-cohort"]
        SHADOW["Shadow eval on prod<br/>LLM-judge + spot-check"]
    end

    subgraph Online["Online"]
        SERVE["Catalog-Search MCP"]
        FB["Feedback collector<br/>CTR, escalation,<br/>did-not-find"]
    end

    CAT -->|nightly delta| EMB --> OS
    REL --> CAT
    CAT -->|delta + new entities| DGS
    DGS --> OFF
    SGS --> OFF
    DECAY --> SGS
    OS --> SERVE
    SERVE --> FB
    FB --> SHADOW
    FB -->|flagged labels| DECAY
    SHADOW -->|disagreements| DGS

    style CAT fill:#fde68a,stroke:#92400e,color:#111
    style DGS fill:#dbeafe,stroke:#1e40af,color:#111
    style SGS fill:#dbeafe,stroke:#1e40af,color:#111
    style DECAY fill:#fee2e2,stroke:#991b1b,color:#111
    style FB fill:#dcfce7,stroke:#166534,color:#111

1. Data layer — golden set becomes a derivative of the catalog

Two golden sets, not one:

Stable golden set (≈ 2,400 pairs). Hand-curated. Refreshed quarterly. Carries timestamps and a last_confirmed_valid_at field. Used for trend tracking and regression detection. Owned by retrieval team.
Derivative golden set (≈ 8,000 pairs, regenerated weekly). Auto-generated from the catalog delta + production query logs. Stratified by:
Title age bucket (released < 30 days, 30–180, 180–730, > 730).
Genre / format (shounen, shoujo, seinen, manhwa, manhua, webtoon-vertical, doujin).
Query intent class (entity lookup, volume count, latest-release, recommendation-anchor, character).
Language (en, jp-romaji, jp-native, es).

Generation rule: for each new title in the weekly delta, sample 4–8 production queries that mention the title or its aliases, deduplicate by minhash, then label by:

Hard label: intended_title_id from the catalog (deterministic).
Soft label: relevant_chunk_ids proposed by the retriever, then verified by an LLM-as-judge constrained to citing the catalog row (not the FM's parametric memory). The judge prompt explicitly says: "If the catalog row does not appear in the candidate chunks, return UNKNOWN — do not infer."

# pseudocode for the weekly derivative golden-set builder
def build_derivative_golden_set(catalog_delta, prod_query_logs, judge):
    pairs = []
    for title in catalog_delta:
        candidate_queries = sample_prod_queries(
            prod_query_logs,
            tokens=title.aliases + [title.canonical_title],
            min_count=4,
            max_count=8,
            dedupe_minhash=0.85,
        )
        for q in candidate_queries:
            chunks = retriever.search(q, top_k=20)
            judgement = judge.assess(
                query=q,
                candidate_chunks=chunks,
                catalog_row=title,
                refuse_if_not_present=True,  # critical
            )
            if judgement.status == "OK":
                pairs.append(GoldenPair(
                    query=q,
                    intended_title_id=title.id,
                    relevant_chunk_ids=judgement.relevant_chunk_ids,
                    cohort=stratify(title, q),
                    last_confirmed_valid_at=now(),
                ))
    return pairs

The catalog row is the authoritative anchor. Without it, the judge degenerates to "the FM thinks this is right" — which is exactly the failure mode we're trying to avoid for new titles.

2. Pipeline layer — label decay and rotation

Each label carries (created_at, last_confirmed_valid_at, decay_class):

decay_class	Half-life	Refresh trigger
`static_fact` (e.g., publication year)	∞	Catalog delete
`volume_count`	30 days	Volume release event
`synopsis_relevance`	180 days	Major edition/edit
`latest_release_pointer`	7 days	Release calendar event
`recommendation_anchor`	90 days	User-feedback signal

Labels expire automatically; expired labels are excluded from aggregate metrics but kept for trend analysis. A nightly Lambda walks the stable golden set and flips any pair where the underlying catalog row has changed since last_confirmed_valid_at to needs_review.

3. Serving layer — re-index strategy

Blue/green re-index on a 7-day cadence (already documented in 09-rag-retrieval-pipeline-deep-dive.md). The new piece: a per-cohort canary. After a re-index, traffic is shadowed for the new-title cohort first, because that's where regressions hide. Aggregate canary on top-1000 queries can pass while new-title cohort is broken.

canary:
  cohorts:
    - name: post-q1-titles
      filter: "title.released_at > '2026-01-01'"
      threshold:
        recall_at_10: 0.85
        mrr_at_10: 0.60
      block_promotion_on_regression: true
    - name: manhwa
      filter: "title.format = 'manhwa'"
      threshold:
        recall_at_10: 0.80
      block_promotion_on_regression: true
    - name: aggregate
      filter: "*"
      threshold:
        recall_at_10: 0.88
      block_promotion_on_regression: false  # advisory only

Aggregate is advisory. Cohorts are blocking. This inverts the default and is the operational center of gravity.

4. Governance — who signs off on a moved label

A label change is a small but real audit trail event. Three roles:

Catalog Ops owns title-level facts (volume counts, releases). They cannot edit golden labels directly; they edit catalog rows and labels rebuild from those.
Retrieval team owns the eval set definitions, the stratification scheme, and the derivative-set pipeline.
Judge owner (often retrieval team + safety) owns the judge prompt and re-anchors it (see scenario 04).

Label changes are versioned in S3 (golden-set/v{N}/manifest.json) and changes that move aggregate recall@10 by > 0.02 require a human review before the next eval run uses them. This is a guardrail against silent mass-relabeling that masks real model regressions.

Trade-offs & Alternatives Considered

Approach	Latency	Cost	Label quality	Ops complexity	Verdict
Single hand-curated golden set, refreshed quarterly	n/a	$$	High (small)	Low	What we had — unsalvageable for new surface
Single hand-curated, refreshed weekly	n/a	$$$$$	High	High	Cost-prohibitive at this scale
Pure LLM-judge auto-label, no catalog anchor	n/a	$$	Drifts with FM	Low	Hallucinated labels for new titles
Stable + derivative split, catalog-anchored	n/a	$$$	High where it matters	Medium	Chosen
Crowd-source labels (MTurk) per delta	Slow	$$$$	Variable	High	Reserve for edge cases (Cyrillic, ambiguous typos)
Embedding-only auto-label (cosine to catalog row)	Fast	$	Low for synonyms	Low	Useful as a candidate generator, not a verdict

The stable+derivative split is the pattern: hand-curated for trend, machine-generated for coverage, and the catalog itself as the truth-maker for new entities.

Production Pitfalls

The judge silently drifts when the FM is upgraded. If you go from Sonnet 4.5 → 4.6 mid-quarter, the same judge prompt with the same temperature can produce different verdicts on the same pair. See 04-fm-upgrade-judge-recalibration.md. Pin the judge model independently of the application model.
Aliases / romanization eat 20% of recall. "Sakamoto Days" vs "サカモトデイズ" vs "Sakamoto De" (typo'd auto-suggest). The derivative builder needs a populated alias table or it will under-cover the new title from day one. Don't trust the catalog alias field — it's incomplete.
Quarterly refresh of the stable set is when the bug enters. People rush, accept too many machine suggestions, the stable set quietly becomes a worse copy of the derivative set. Defend the stable set's hand-curation budget — it is the only thing telling you whether the derivative pipeline itself is healthy.
Cohort canaries fail open. If a cohort has < 50 queries, statistical noise dominates and regressions go undetected. Either raise the threshold for small cohorts or batch them into a "long tail" cohort with enough volume.
"Stale chunk wins" is invisible until you log chunk ages. Add chunk.indexed_at to your retrieval response logs. Look at the histogram of "winning chunk age" weekly. If it tilts older over time, your re-index isn't winning enough.

Interview Q&A Drill

Opening question

Your manga catalog grows by ~1,200 chapters and 40 new series every Wednesday. Your retrieval golden set was built three months ago. Aggregate recall@10 is healthy, but customer support is escalating "the bot doesn't know about new releases." What's your diagnosis and what's your fix?

Model answer (the shape of a strong response).

The diagnosis: aggregate recall@10 is healthy precisely because the golden set encodes old expectations of correctness. Two distinct failure modes are blended:

Coverage gap. Post-launch series have no labeled queries, so eval is incapable of detecting failure on them.
Label staleness. Labels for ongoing series (e.g., volume counts) point at chunks whose content is now wrong, so even a "successful" retrieval serves an outdated answer.

The fix has three pieces:

Treat the catalog as the source of truth and the golden set as a derivative. Generate a weekly derivative golden set from the catalog delta and production query logs, anchored by the canonical catalog row, verified by an LLM-as-judge constrained to refuse if the row isn't in candidates.
Stratify per-cohort (title age, format, language, intent) and block re-index promotion on cohort regressions, not on aggregate. Aggregate is advisory.
Add label-decay metadata (last_confirmed_valid_at, decay_class) so labels expire by class — latest_release_pointer half-lives in 7 days, synopsis_relevance in 180.

The change in mindset is from "the golden set is the source of truth" to "the world is the source of truth, the golden set is cached data with a TTL."

Follow-up grill 1

Your derivative golden set is built by an LLM-as-judge. Why not just use the same judge to score model outputs directly and skip the golden set entirely?

Two reasons. First, judge calls are stochastic and expensive — you cannot afford to call the judge on every production query, and replaying historical traffic against a judge gives you a per-run number with no reproducibility across model versions. Second, the judge has the same training cutoff and parametric blind spots as the FM you're testing; without the catalog anchor, "the judge thinks the answer is right" reduces to "the FM and judge agree" — which is exactly correlated error. The golden set's value is that the catalog row is the anchor, the judge is just a cheap labeler, and the labels become reusable, versioned, comparable across model versions, and auditable.

Follow-up grill 2

You said "block on cohort regressions, not aggregate." Walk me through the noise problem — what stops cohort regressions from being false alarms when a cohort has only 50 queries?

Three layered defenses. (1) Set the threshold differently per cohort size — for a small cohort, require the regression to be both larger in magnitude and persistent over two consecutive runs before blocking. (2) Use bootstrapped confidence intervals on cohort-level metrics; only block if the CI on the regression doesn't include zero. (3) For cohorts that consistently fall under 50, batch them into a "long-tail" cohort. The hard part is naming the threshold — I'd anchor on the smallest movement that has historically corresponded to a real customer-visible regression, which is empirical and you discover by replaying past incidents through the canary system.

Follow-up grill 3

Your judge is constrained to refuse if the catalog row isn't in the candidates. How do you make sure the retriever doesn't just learn to surface the catalog row to please the judge?

This is a Goodhart's-law concern and it's real. Three protections. (1) The judge sees only the candidate chunks plus the catalog row identity — it does not see retriever scores or rankings, so it can't reward a "stuffed" candidate set. (2) The relevance label is on chunks, but the retriever's training signal is from a different source (production click-through, human-curated pairs). The judge labels feed eval, not training. (3) Hold out a "trust-but-verify" set where the judge labels a query and a separate human spot-check confirms — if humans disagree with judge > 5%, you re-anchor the judge before trusting more of its labels.

Follow-up grill 4

A new format (manhwa, vertical-scroll) is being added next quarter. They look different — vertical chapters, no volumes, episode-numbered, translated from Korean. Your derivative pipeline assumes "title + volume + synopsis." What changes?

The schema-shift case. The fix isn't to bolt manhwa into the existing pipeline; it's to treat manhwa as a new entity class with its own decay classes, its own relevance unit, its own intent enum, and its own cohort. Concretely: add format ∈ {manga, manhwa, manhua} as a stratification key; introduce an episode decay class (latest_episode_pointer, half-life 3 days); add an episode-list chunk type and treat synopsis-relevance as secondary not primary for ongoing webtoons. Critically, run a two-phase rollout: first, ingest manhwa into the index but route manhwa queries to a "we're learning this" deflection; second, after the cohort eval is built and passing, route to the model. Skipping phase one is how you ship a confidently-wrong answer on a brand-new surface.

Architect-level escalation 1

Imagine the company decides catalog refresh moves from Wednesdays to continuous (sub-hourly) as part of a partnership with a publisher. Latency budget for catalog freshness drops from 24 h to 30 min. How does your eval architecture survive?

The weekly derivative-set builder is too coarse. Two changes. (1) Decompose the derivative set into a "catalog-event-driven" subset that's regenerated within 30 minutes of a catalog event for the affected entities only, plus a weekly background refresh for cohort-level stratification. (2) Move from "block re-index promotion on cohort regression" to "block publish of an updated index on entity-level regression for the entities in this delta." The unit of evaluation gets finer-grained: instead of "did this re-index pass," it's "did the chunks for this title pass." Cost goes up because the judge is called per delta instead of per week — control with caching, lazy evaluation (only label queries that production actually asks), and tier-aware judge models (cheap judge for routine deltas, expensive judge for high-risk titles like new series). The hard part operationally is that "blocking publish" can't be a 10-minute human review anymore — the gate has to be automated, which raises the bar on judge calibration and on having a fallback to "serve the previous index for this title only."

Architect-level escalation 2

Six months out, you're paying $40K/month on judge calls for the derivative golden set, and your CFO wants it cut in half. Where do you cut without rebuilding the failure?

Three axes of cost reduction, in order of safety. (1) Tier the judge. Use a small cheap model (Haiku) for the 80% of cases where retrieval candidates contain the catalog row obviously (high cosine, exact alias match) and only escalate to Sonnet/Opus for ambiguous cases. Estimate: 50% cost reduction, near-zero quality loss. (2) Cache by query embedding. A query "vol 12 of CSM" today and tomorrow should not require two judge calls if the catalog state for CSM hasn't changed. Cache key = (query_embedding_bucket, title_id, catalog_row_version). Estimate: another 20% off. (3) Sample, don't enumerate. For long-tail titles with low query volume, label 1 query/title not 4–8. Estimate: another 10–15%. Where I would not cut: anchor verification — if you let the judge label without the catalog row in the prompt, you save tokens but you've turned the eval into "FM agrees with FM" which is the original failure mode. The cost-quality knob with a hard stop is "never remove the anchor."

Red-flag answers (weak)

"Just retrain the embedding model more often." (Doesn't address eval-set staleness.)
"Add more queries to the golden set." (Without retiring stale ones and without stratification, makes the metric more misleading.)
"Use LLM-as-judge for everything." (Inherits FM blind spots; without anchor, hallucinates.)
"Aggregate recall@10 is good, ship it." (The whole point: aggregate is the liar's metric here.)
"Re-label the entire historical golden set." (Cost-prohibitive and unnecessary — most old labels are still right.)

Strong-answer indicators

Distinguishes coverage gap from label staleness as separate failure modes.
Treats the catalog as the source of truth, not the golden set.
Names per-cohort canary as the operational gate, with aggregate as advisory.
Knows that judge labels need an external anchor or they collapse to FM agreement.
Talks about decay classes, not just "refresh more often."
Has an opinion on which cuts are safe under cost pressure (judge tier, cache, sample) and which are not (drop the anchor).