Ground-Truth Evolution — Overview

TL;DR. Ground truth is not a fixed reference. In a real production manga-chatbot, what counts as "the correct answer," "the relevant document," "the right recommendation," "the safe response," or "the spammy review" mutates constantly — driven by changing requirements, tightening constraints, growing scale, the passage of time, and adversarial pressure. Every evaluation harness, retraining cadence, golden set, judge model, and safety guardrail silently depends on a stable definition of "correct." The instant the world changes faster than the labels do, your metrics keep going up while your users get worse answers. This folder makes that dependence explicit, scenario by scenario, separately for GenAI and classical ML systems.

Why a separate folder for this

The MangaAssist repo already has rich coverage of:

Golden sets and offline test design — Evaluation-Systems-GenAI/06-retrieval-quality-testing.md, API-Design-and-Testing/04-offline-testing-quality-strategies.md
Retrieval quality — RAG-MCP-Integration/09-rag-retrieval-pipeline-deep-dive.md
Configuration / prompt-template / metric-naming drift — Skill-1.1.3-Standardized-Technical-Components/scenarios/
POC-to-production failure narratives — POC-to-Production-War-Story/

What is missing — and what this folder fills — is ground-truth drift itself: the slower, quieter, more dangerous failure mode where the labels you trained, evaluated, monitored, and guardrailed against simply stop describing the world. None of the existing folders treat this directly. A retrieval evaluator can be perfectly designed and still measure the wrong thing if the gold relevance labels were collected in 2024 and the catalog has tripled since.

This folder is the missing layer between static evaluation (we already have) and operational drift (we already have for config). It is the layer where the definition of correctness itself moves.

What "ground truth" means in this stack

Different subsystems anchor on different forms of ground truth. They do not drift the same way and they cannot be re-labeled the same way.

Subsystem	What "ground truth" means concretely	Refresh cadence	Who owns the labels
Catalog-Search MCP	(query, intended title) pairs; (query, relevant chunk) pairs	Weekly catalog sync	Catalog ops + retrieval team
User-Preferences MCP	(user, "good recommendation") pairs derived from clicks/buys	Continuous, decays	Personalization team
Order-Inventory MCP	(order_id, current state) — authoritative live API	Real-time	Aurora is source of truth
Review-Sentiment MCP	(review_text, sentiment) and (review_text, aspect-spans)	Quarterly re-label	Annotation vendor + ABSA team
Support-Policy MCP	(question, policy_clause, region, version) tuples	On policy change	Legal + support ops
Trending-Discovery MCP	(timestamp, trending_set) — minute-level windows	Continuous	Real-time pipeline
Cross-Title-Link MCP	(title_a, title_b, relation) edges	Editorial + co-purchase	Editorial + graph team
Guardrails	(prompt, must_refuse?) and (prompt, attack_class)	Continuous threat-intel	Safety team + red team
Cover-art classifier	(image, art_style_label)	Quarterly	CV team + annotators
Demand forecaster	(sku, day, units_sold) — actuals	Daily	Inventory data warehouse

Five things to notice:

Some ground truths are facts (order state, units sold). They cannot drift — but the distribution of what we ask about them can.
Some are policy (support answers). They drift on a legal/PM clock, not a data clock.
Some are taste (recommendations). They decay continuously and re-labeling is infeasible at scale.
Some are linguistic (sentiment, intent). They drift with culture, slang, and new product categories.
Some are adversarial (guardrails, spam). They drift because somebody is actively pushing them.

Treating all of these the same is the most common architectural mistake in the space. Scenario files in this folder are organized to make those differences visible.

Five axes of change (the framework)

Every scenario in this folder is tagged with one or more of these axes. They are described in detail in 01-framework-axes-of-change.md. Brief here:

flowchart LR
    GT["Ground Truth<br/>(labels, golden sets,<br/>judge rubrics, edges)"]

    REQ["Requirements change<br/>new region · new SKU · new policy"] --> GT
    CON["Constraints change<br/>latency · cost · model size"] --> GT
    SCA["Scale change<br/>10x corpus · 100x users · long-tail"] --> GT
    TIM["Time / decay<br/>taste · slang · trends · staleness"] --> GT
    ADV["Adversary<br/>prompt injection · spam · fraud"] --> GT

    GT --> EVAL["Eval harness<br/>says model is fine"]
    GT --> TRAIN["Retraining<br/>uses old labels"]
    GT --> GUARD["Guardrails<br/>miss new attacks"]
    GT --> MON["Monitors<br/>watch wrong metric"]

    style GT fill:#fde68a,stroke:#92400e,color:#111
    style REQ fill:#dbeafe,stroke:#1e40af,color:#111
    style CON fill:#dbeafe,stroke:#1e40af,color:#111
    style SCA fill:#dbeafe,stroke:#1e40af,color:#111
    style TIM fill:#dbeafe,stroke:#1e40af,color:#111
    style ADV fill:#fee2e2,stroke:#991b1b,color:#111
    style EVAL fill:#dcfce7,stroke:#166534,color:#111
    style TRAIN fill:#dcfce7,stroke:#166534,color:#111
    style GUARD fill:#dcfce7,stroke:#166534,color:#111
    style MON fill:#dcfce7,stroke:#166534,color:#111

Requirements change. Product/legal/PM expands the surface area: a new region, a new manga format (manhwa, manhua, webtoons), a new policy clause, a new entity type. The ground truth that existed before this expansion does not cover the new space — and worse, some old labels are now actively wrong.
Constraints change. Cost, latency, model size, context budget, or compliance posture tighten. An answer that was correct under the old constraint is unacceptable under the new one. The "correctness" rubric has moved without anyone re-labeling.
Scale change. 10× the corpus, 100× the queries, 1000× the long tail. The labeled distribution is no longer representative. Tail behavior — which the golden set under-samples — now dominates user experience.
Time / decay. Taste decays, slang shifts, trends turn over hourly, RAG corpora go stale, fine-tuning drifts. Yesterday's label is wrong even if nothing about the schema changed.
Adversary. Somebody is actively trying to make your label wrong. Every defense decision creates a counter-move. Ground truth is a moving target by design.

A scenario can carry more than one axis (most do). The folder is balanced so each axis appears in at least two scenarios across GenAI + ML.

How GenAI and ML differ (and why we split them)

Both branches share the framework, but the failure modes and the fix shapes are not the same.

Dimension	GenAI side	Classical ML side
Primary "label"	Whether a free-text answer is correct, grounded, safe, on-policy	Discrete class / score / ranked list
How labels are produced	Hand-curated golden sets + LLM-as-judge + human spot-check	Bulk-labeled training data + implicit feedback
Drift detection	Judge disagreement, citation-grounding rate, refusal rate	PSI/KL on features, accuracy decay, AUC drop
Re-labeling cost	High (judge models cost $, humans slow)	Very high at scale (millions of rows)
Re-training cost	Low if just prompt; high if fine-tune	Always high
Time to fix	Hours (prompt) to weeks (eval set rebuild)	Weeks to months
Failure mode at runtime	Hallucination, refusal, off-policy answer	Wrong score → wrong action
Adversarial surface	Prompt-shaped, language-shaped	Feature-shaped, behavior-shaped
"Truth" tends to be	Negotiated (judge rubric, policy doc)	Empirical (events, sales, clicks)

The split matters because the operational playbook is different: re-anchoring an LLM judge is not the same operation as re-labeling 5M reviews, and the people, tools, and time horizons are not interchangeable. Mixing them in a single scenario file would obscure the actual decisions an architect needs to make.

Folder map

Ground-Truth-Evolution/
├── 00-overview-ground-truth-evolution.md       (this file)
├── 01-framework-axes-of-change.md              (taxonomy + decision rubric)
├── GenAI-Scenarios/
│   ├── 00-genai-scenarios-index.md
│   ├── 01-catalog-evolution-new-releases.md            (Requirements + Scale)
│   ├── 02-policy-document-versioning.md                (Requirements)
│   ├── 03-user-preference-concept-drift.md             (Time)
│   ├── 04-fm-upgrade-judge-recalibration.md            (Constraints)
│   ├── 05-localization-multilingual-truth.md           (Requirements + Scale)
│   ├── 06-trending-temporal-decay.md                   (Time)
│   ├── 07-adversarial-prompt-injection-evolution.md    (Adversary)
│   └── 08-rag-corpus-staleness-and-scale.md            (Scale + Time)
└── ML-Scenarios/
    ├── 00-ml-scenarios-index.md
    ├── 01-recommendation-label-decay.md                (Time)
    ├── 02-sentiment-classifier-domain-shift.md         (Time + Requirements)
    ├── 03-search-ranking-ui-redesign.md                (Requirements)
    ├── 04-spam-review-adversarial-evolution.md         (Adversary)
    ├── 05-demand-forecasting-promo-distortion.md       (Requirements + Time)
    ├── 06-absa-aspect-emergence.md                     (Requirements)
    ├── 07-embedding-category-expansion.md              (Scale + Requirements)
    └── 08-cover-art-style-drift.md                     (Time + Adversary)

Each scenario file is a self-contained deep dive — TL;DR, context, old ground truth, new reality, why naive approaches fail, detection, architecture (with Mermaid), trade-offs, production pitfalls, and a 5–7 question interview Q&A drill ending in architect-level escalation.

How to use this folder

For architects under interview: pick any scenario, read it cold, and you should be able to answer the opening question + at least two grills without referring back. The Q&A drill at the bottom is calibrated to BMS / Amazon / staff-level bar.

For operators: the architecture + detection sections are designed to lift directly into a runbook or a design doc. The "Production Pitfalls" sections are the things you only learn the hard way.

For PMs / leads: the trade-off tables make the cost of "do nothing" visible — they are the easiest section to use to argue for staffing a labeling pipeline or a judge re-anchor.

Cross-references (do not duplicate, link)

Voice and structure anchor: Skill-1.1.1-Comprehensive-Architectural-Design/scenarios/DEEP-DIVE-GRILLING-SESSION.md
Golden-set methodology: Evaluation-Systems-GenAI/06-retrieval-quality-testing.md
Offline test ops: API-Design-and-Testing/04-offline-testing-quality-strategies.md
Subsystem anchors for every scenario: RAG-MCP-Integration/
Failure-narrative voice: POC-to-Production-War-Story/02-seven-production-catastrophes.md
Configuration-drift (orthogonal to truth-drift): Skill-1.1.3-Standardized-Technical-Components/scenarios/
HITL / re-labeling patterns: Skill-2.1.5-Collaborative-AI-Systems/01-human-in-the-loop-architecture.md