The Framework — Five Axes of Ground-Truth Change

Purpose. This file defines the shared vocabulary used by every scenario file in this folder. Each scenario tags itself with one or more axes of change — the force that made the old ground truth wrong. The axes are not mutually exclusive; the most dangerous failures combine two or three of them and look like one.

The five axes

flowchart TB
    subgraph Pull["Pulls on Ground Truth"]
        REQ["Requirements<br/>(product · legal · PM)"]
        CON["Constraints<br/>(latency · cost · model)"]
        SCA["Scale<br/>(corpus · users · tail)"]
        TIM["Time<br/>(decay · staleness · trend)"]
        ADV["Adversary<br/>(injection · spam · fraud)"]
    end

    GT["Ground-Truth Layer<br/>golden sets · judge rubrics<br/>labels · edges · facts"]

    REQ --> GT
    CON --> GT
    SCA --> GT
    TIM --> GT
    ADV --> GT

    GT --> M["Models / Retrievers / Guardrails"]
    GT --> E["Eval / CI / Canary"]
    GT --> O["Online Monitors"]

    style GT fill:#fde68a,stroke:#92400e,color:#111
    style REQ fill:#dbeafe,stroke:#1e40af,color:#111
    style CON fill:#dbeafe,stroke:#1e40af,color:#111
    style SCA fill:#dbeafe,stroke:#1e40af,color:#111
    style TIM fill:#dbeafe,stroke:#1e40af,color:#111
    style ADV fill:#fee2e2,stroke:#991b1b,color:#111

Axis 1 — Requirements change

Definition. The product surface area expanded or shifted. New region, new policy, new SKU type, new entity, new user intent, new compliance regime. The ground-truth schema either grew (new columns/labels) or its meaning under existing rows shifted.

Tell. Someone outside the ML/GenAI team announces a change ("we're launching JP", "legal updated the return policy", "we're adding manhwa"). PRs that change the catalog schema, the intent enum, or the policy document are the loudest signals. Quietest signal: a stakeholder asks "does the bot handle X?" — if X is new, your golden set almost certainly does not cover it.

Why it's hard. Re-labeling against a moved spec means re-reading the new spec, re-training annotators, and re-deciding which old labels are now wrong (not just which ones are missing). The "old labels are now wrong" subset is what gets skipped.

Failure shape. Coverage gap → blind spots → silent failures on the new surface area, while metrics on the old surface area keep looking fine. The eval harness can't see what it doesn't ask about.

Decision rubric.

Did the schema change, the meaning change, or both?
Are the new entities a strict superset (additive) or a partial overwrite (some old labels are now wrong)?
Is the cost of re-labeling worth more than the cost of running the system blind on the new surface for N weeks?
Can we treat new surface as "out of distribution → defer to safe path" until labeled, instead of pretending the old model handles it?

Axis 2 — Constraints change

Definition. Latency, cost, context-window budget, model size, compliance posture, or carbon budget tightens (or loosens). A response that scored as "correct" under the old constraint is unacceptable under the new one. The truth didn't change — the rubric did.

Tell. A finance/PM/SRE-driven change: "p95 must be ≤ 1.2 s", "drop Opus to Haiku", "GDPR audit means no PII echoes", "8 K context only". The eval framework didn't change but suddenly fewer answers pass.

Why it's hard. Most teams don't realize the eval rubric needs to move with the constraint. They re-run the old golden set against the new model and see a 40% regression that is half "model worse" and half "rubric got stricter." Disentangling those two requires re-anchoring the judge.

Failure shape. Either (a) you accept the regression because "the old eval said it's fine" and ship something users hate, or (b) you reject a perfectly-good cheap model because the judge wasn't recalibrated for the cost-quality trade-off you actually wanted.

Decision rubric.

Which axis of "correct" moved — factuality, latency, cost, safety, length, citation, or tone?
Does the judge prompt explicitly encode the new constraint, or is it implicit in human reviewers' heads?
Are golden answers still reachable under the new constraint? If not, some of them must be re-written, not just re-scored.
Is there a tier (e.g., easy intents) where the old constraint still holds and only the rubric for hard intents moved?

Axis 3 — Scale change

Definition. The data volume, query volume, user count, or long-tail surface grew (or shrank) by an order of magnitude. The labeled distribution is no longer representative of production. Tail behavior dominates in a way the golden set under-samples.

Tell. You launched a new market. You opened the bot to a wider cohort. The catalog 3×'d. p99 latency moves but p50 doesn't. Long-tail intents — which had 2% share — now have 12% share and they are the ones you don't have golden examples for.

Why it's hard. The golden set was sampled with assumptions about distribution that no longer hold. Stratifying the golden set by intent/entity/language has to be redone, and the labels for the long tail are precisely the labels you didn't have time to collect last quarter.

Failure shape. Aggregate metrics (recall@10, CSAT) drop slowly while specific cohorts (long-tail genres, new users, edge intents) experience much sharper drops that aggregate metrics smooth over.

Decision rubric.

Which slice grew the most, and is it represented in the golden set?
Is the eval set re-stratified to match production now, not production at training time?
Are tail-cohort metrics tracked separately, or are they hidden inside aggregates?
Can we sample new golden examples from production logs (with privacy filters) instead of synthesizing?

Axis 4 — Time / decay

Definition. Without anyone changing the schema, the constraint, or the volume, the ground truth simply ages. Taste changes, slang shifts, trends turn over, RAG documents go stale, fine-tuning drifts further from current data, embedded snapshots become wrong.

Tell. Online metrics decay slowly — 0.5% CSAT/month, 1% deflection/month — while offline metrics on the (frozen) golden set look stable. Nothing "broke." Every metric is within tolerance. Users are just quietly less happy.

Why it's hard. No event triggers attention. There's no PR, no incident, no announcement. Continuous re-labeling is expensive and feels low-priority because the system "still works."

Failure shape. A slow boil. By the time it's visible, six months of model behavior is calibrated against truth that no longer exists. Rebuilding takes another quarter.

Decision rubric.

What is the half-life of correctness for this label? (Fresh trends: 1 hour. Taste: months. Sentiment: quarters. Order facts: never decay.)
Does the golden set carry timestamps and a "last confirmed valid" date?
Is there a continuous trickle of fresh labels (e.g., from production feedback / HITL) or only batched re-labeling campaigns?
Are decay-aware metrics (rolling-window precision, time-bucketed CSAT) tracked?

Axis 5 — Adversary

Definition. Somebody is actively trying to make the label wrong. Prompt-injection authors, spam-review farms, fraudsters, scraper bots. Every defensive change provokes a counter-move. The label "this prompt is an attack" or "this review is spam" has an adversarial half-life measured in days, not months.

Tell. Detector recall drops sharply on a specific class while overall accuracy looks fine. New patterns appear in support tickets ("the bot leaked the system prompt"). External threat-intel feeds light up. Red-team exercises catch new attacks the production guardrails miss.

Why it's hard. Re-labeling alone isn't enough — you need adversaries-as-a-process: continuous red-teaming, fast-cycle label updates, and a guardrails layer that can be patched independently of the model. Static golden sets cannot keep up.

Failure shape. The rate-of-change of attacks exceeds the rate-of-change of defenses. Detection drops → an incident hits production → emergency patches go in → over-block / over-refusal → user complaints → relax → next attack.

Decision rubric.

Is there a versioned attack catalog with a freshness SLA?
Is the safety golden set additive (we never delete attacks) or rolling (we drop old ones)?
Is the adversarial signal coming from real production, red team, or threat intel — and are all three feeding labels?
Can defenses ship faster than the model retrains? (They must.)

Combinations are the dangerous case

A real failure usually mixes axes, and one axis hides another:

Combination	What it looks like	Why it's worse than the parts
Requirements + Scale	Adding manhwa just as catalog 3×'d	New surface arrives in a tail you can't see in the eval
Time + Adversary	Spam tactics evolve over months	You assume it's "natural" decay and miss the active attacker
Constraints + Time	You moved to Haiku and 4 months passed	Regression looks like a bad model but is half rubric-shift, half drift
Requirements + Adversary	New region opens with new attacker base	Defenders have neither labels nor threat intel for the new surface
Scale + Time	10× users, 6 months later	Golden set is stale and mis-sampled — the worst combination

The scenario files in this folder were chosen to show realistic combinations, not just clean single-axis cases.

Decision tree — which axis am I really hitting?

flowchart TD
    A["Metric moved or<br/>complaint arrived"] --> B{"Did anything change<br/>upstream?"}
    B -- "Yes: schema / policy /<br/>region / SKU" --> C["Requirements"]
    B -- "Yes: latency / cost /<br/>model / context" --> D["Constraints"]
    B -- "Yes: traffic / corpus /<br/>cohort growth" --> E["Scale"]
    B -- "Nothing announced" --> F{"Is somebody<br/>actively pushing?"}
    F -- "Yes (red team /<br/>support tickets / TI)" --> G["Adversary"]
    F -- "No, slow drift" --> H["Time / decay"]

    style C fill:#dbeafe,stroke:#1e40af,color:#111
    style D fill:#dbeafe,stroke:#1e40af,color:#111
    style E fill:#dbeafe,stroke:#1e40af,color:#111
    style G fill:#fee2e2,stroke:#991b1b,color:#111
    style H fill:#fef3c7,stroke:#92400e,color:#111

Use this when you walk into an incident or a regression and you're not sure where to look first. The answer dictates who you call (PM/legal vs SRE vs data vs safety) and what you fix (eval set vs judge prompt vs label refresh vs defense layer).

How to read a scenario file in this folder

Every scenario file is structured the same way:

Section	What you'll find	Why it's there
TL;DR	One paragraph: what changed, why old GT is wrong, fix shape	So you can read 30 scenarios in 20 minutes
Context & Trigger	The axis tag, the subsystem, the event	Anchors the failure to a real subsystem
The Old Ground Truth	What we used to call correct, and why that was reasonable	Honors that the old design wasn't dumb
The New Reality	What "correct" means now	The exact thing the eval/training is missing
Why Naive Approaches Fail	Why "just retrain" / "just re-label" / "just add a monitor" don't fix it	Pre-empts the obvious answer in interviews
Detection	Online + offline + distribution signals	Operational immediately
Architecture / Implementation Deep Dive	Mermaid + per-layer breakdown + code/config	The actual fix
Trade-offs & Alternatives	Latency × cost × label-quality × ops complexity	What you'd say in a design review
Production Pitfalls	What bites in week 3	The lessons-only-by-experience layer
Interview Q&A Drill	Opening + 3–4 grills + 2 architect-level	Self-test, BMS / staff bar

The Q&A drill is the most important section if you are using this folder for interview prep. Read the question, write your answer (out loud or in a doc), then compare to the model answer. The grills are calibrated to push you past the easy first answer.

Anti-patterns this folder pushes back on

"Our golden set is the source of truth." No — the world is the source of truth. The golden set is a stale, sampled, possibly-mislabeled approximation. Treat it like cached data with a TTL, not like a database.
"We re-evaluate quarterly so we're fine." Quarters are not a unit of time that adversaries respect. Trends decay in hours.
"We have monitors." Monitors watch the metrics they were designed for. If the metric became wrong (constraint shift, scale shift), the monitor is now lying to you with high confidence.
"Re-label everything." Re-labeling at scale is so expensive that this is rarely the answer. Smarter: stratify, sample, prioritize the changed slice, accept that some old labels stay outdated until production behavior says otherwise.
"The judge model is correct." The judge is a model. It has its own drift. It needs its own golden set (humans on a calibration set). When you upgrade FMs, the judge often needs upgrading too — and its judge needs anchoring.

These anti-patterns are exactly what the scenario files are calibrated against.