The Framework — Five Axes of Ground-Truth Change
Purpose. This file defines the shared vocabulary used by every scenario file in this folder. Each scenario tags itself with one or more axes of change — the force that made the old ground truth wrong. The axes are not mutually exclusive; the most dangerous failures combine two or three of them and look like one.
The five axes
flowchart TB
subgraph Pull["Pulls on Ground Truth"]
REQ["Requirements<br/>(product · legal · PM)"]
CON["Constraints<br/>(latency · cost · model)"]
SCA["Scale<br/>(corpus · users · tail)"]
TIM["Time<br/>(decay · staleness · trend)"]
ADV["Adversary<br/>(injection · spam · fraud)"]
end
GT["Ground-Truth Layer<br/>golden sets · judge rubrics<br/>labels · edges · facts"]
REQ --> GT
CON --> GT
SCA --> GT
TIM --> GT
ADV --> GT
GT --> M["Models / Retrievers / Guardrails"]
GT --> E["Eval / CI / Canary"]
GT --> O["Online Monitors"]
style GT fill:#fde68a,stroke:#92400e,color:#111
style REQ fill:#dbeafe,stroke:#1e40af,color:#111
style CON fill:#dbeafe,stroke:#1e40af,color:#111
style SCA fill:#dbeafe,stroke:#1e40af,color:#111
style TIM fill:#dbeafe,stroke:#1e40af,color:#111
style ADV fill:#fee2e2,stroke:#991b1b,color:#111
Axis 1 — Requirements change
Definition. The product surface area expanded or shifted. New region, new policy, new SKU type, new entity, new user intent, new compliance regime. The ground-truth schema either grew (new columns/labels) or its meaning under existing rows shifted.
Tell. Someone outside the ML/GenAI team announces a change ("we're launching JP", "legal updated the return policy", "we're adding manhwa"). PRs that change the catalog schema, the intent enum, or the policy document are the loudest signals. Quietest signal: a stakeholder asks "does the bot handle X?" — if X is new, your golden set almost certainly does not cover it.
Why it's hard. Re-labeling against a moved spec means re-reading the new spec, re-training annotators, and re-deciding which old labels are now wrong (not just which ones are missing). The "old labels are now wrong" subset is what gets skipped.
Failure shape. Coverage gap → blind spots → silent failures on the new surface area, while metrics on the old surface area keep looking fine. The eval harness can't see what it doesn't ask about.
Decision rubric.
- Did the schema change, the meaning change, or both?
- Are the new entities a strict superset (additive) or a partial overwrite (some old labels are now wrong)?
- Is the cost of re-labeling worth more than the cost of running the system blind on the new surface for N weeks?
- Can we treat new surface as "out of distribution → defer to safe path" until labeled, instead of pretending the old model handles it?
Axis 2 — Constraints change
Definition. Latency, cost, context-window budget, model size, compliance posture, or carbon budget tightens (or loosens). A response that scored as "correct" under the old constraint is unacceptable under the new one. The truth didn't change — the rubric did.
Tell. A finance/PM/SRE-driven change: "p95 must be ≤ 1.2 s", "drop Opus to Haiku", "GDPR audit means no PII echoes", "8 K context only". The eval framework didn't change but suddenly fewer answers pass.
Why it's hard. Most teams don't realize the eval rubric needs to move with the constraint. They re-run the old golden set against the new model and see a 40% regression that is half "model worse" and half "rubric got stricter." Disentangling those two requires re-anchoring the judge.
Failure shape. Either (a) you accept the regression because "the old eval said it's fine" and ship something users hate, or (b) you reject a perfectly-good cheap model because the judge wasn't recalibrated for the cost-quality trade-off you actually wanted.
Decision rubric.
- Which axis of "correct" moved — factuality, latency, cost, safety, length, citation, or tone?
- Does the judge prompt explicitly encode the new constraint, or is it implicit in human reviewers' heads?
- Are golden answers still reachable under the new constraint? If not, some of them must be re-written, not just re-scored.
- Is there a tier (e.g., easy intents) where the old constraint still holds and only the rubric for hard intents moved?
Axis 3 — Scale change
Definition. The data volume, query volume, user count, or long-tail surface grew (or shrank) by an order of magnitude. The labeled distribution is no longer representative of production. Tail behavior dominates in a way the golden set under-samples.
Tell. You launched a new market. You opened the bot to a wider cohort. The catalog 3×'d. p99 latency moves but p50 doesn't. Long-tail intents — which had 2% share — now have 12% share and they are the ones you don't have golden examples for.
Why it's hard. The golden set was sampled with assumptions about distribution that no longer hold. Stratifying the golden set by intent/entity/language has to be redone, and the labels for the long tail are precisely the labels you didn't have time to collect last quarter.
Failure shape. Aggregate metrics (recall@10, CSAT) drop slowly while specific cohorts (long-tail genres, new users, edge intents) experience much sharper drops that aggregate metrics smooth over.
Decision rubric.
- Which slice grew the most, and is it represented in the golden set?
- Is the eval set re-stratified to match production now, not production at training time?
- Are tail-cohort metrics tracked separately, or are they hidden inside aggregates?
- Can we sample new golden examples from production logs (with privacy filters) instead of synthesizing?
Axis 4 — Time / decay
Definition. Without anyone changing the schema, the constraint, or the volume, the ground truth simply ages. Taste changes, slang shifts, trends turn over, RAG documents go stale, fine-tuning drifts further from current data, embedded snapshots become wrong.
Tell. Online metrics decay slowly — 0.5% CSAT/month, 1% deflection/month — while offline metrics on the (frozen) golden set look stable. Nothing "broke." Every metric is within tolerance. Users are just quietly less happy.
Why it's hard. No event triggers attention. There's no PR, no incident, no announcement. Continuous re-labeling is expensive and feels low-priority because the system "still works."
Failure shape. A slow boil. By the time it's visible, six months of model behavior is calibrated against truth that no longer exists. Rebuilding takes another quarter.
Decision rubric.
- What is the half-life of correctness for this label? (Fresh trends: 1 hour. Taste: months. Sentiment: quarters. Order facts: never decay.)
- Does the golden set carry timestamps and a "last confirmed valid" date?
- Is there a continuous trickle of fresh labels (e.g., from production feedback / HITL) or only batched re-labeling campaigns?
- Are decay-aware metrics (rolling-window precision, time-bucketed CSAT) tracked?
Axis 5 — Adversary
Definition. Somebody is actively trying to make the label wrong. Prompt-injection authors, spam-review farms, fraudsters, scraper bots. Every defensive change provokes a counter-move. The label "this prompt is an attack" or "this review is spam" has an adversarial half-life measured in days, not months.
Tell. Detector recall drops sharply on a specific class while overall accuracy looks fine. New patterns appear in support tickets ("the bot leaked the system prompt"). External threat-intel feeds light up. Red-team exercises catch new attacks the production guardrails miss.
Why it's hard. Re-labeling alone isn't enough — you need adversaries-as-a-process: continuous red-teaming, fast-cycle label updates, and a guardrails layer that can be patched independently of the model. Static golden sets cannot keep up.
Failure shape. The rate-of-change of attacks exceeds the rate-of-change of defenses. Detection drops → an incident hits production → emergency patches go in → over-block / over-refusal → user complaints → relax → next attack.
Decision rubric.
- Is there a versioned attack catalog with a freshness SLA?
- Is the safety golden set additive (we never delete attacks) or rolling (we drop old ones)?
- Is the adversarial signal coming from real production, red team, or threat intel — and are all three feeding labels?
- Can defenses ship faster than the model retrains? (They must.)
Combinations are the dangerous case
A real failure usually mixes axes, and one axis hides another:
| Combination | What it looks like | Why it's worse than the parts |
|---|---|---|
| Requirements + Scale | Adding manhwa just as catalog 3×'d | New surface arrives in a tail you can't see in the eval |
| Time + Adversary | Spam tactics evolve over months | You assume it's "natural" decay and miss the active attacker |
| Constraints + Time | You moved to Haiku and 4 months passed | Regression looks like a bad model but is half rubric-shift, half drift |
| Requirements + Adversary | New region opens with new attacker base | Defenders have neither labels nor threat intel for the new surface |
| Scale + Time | 10× users, 6 months later | Golden set is stale and mis-sampled — the worst combination |
The scenario files in this folder were chosen to show realistic combinations, not just clean single-axis cases.
Decision tree — which axis am I really hitting?
flowchart TD
A["Metric moved or<br/>complaint arrived"] --> B{"Did anything change<br/>upstream?"}
B -- "Yes: schema / policy /<br/>region / SKU" --> C["Requirements"]
B -- "Yes: latency / cost /<br/>model / context" --> D["Constraints"]
B -- "Yes: traffic / corpus /<br/>cohort growth" --> E["Scale"]
B -- "Nothing announced" --> F{"Is somebody<br/>actively pushing?"}
F -- "Yes (red team /<br/>support tickets / TI)" --> G["Adversary"]
F -- "No, slow drift" --> H["Time / decay"]
style C fill:#dbeafe,stroke:#1e40af,color:#111
style D fill:#dbeafe,stroke:#1e40af,color:#111
style E fill:#dbeafe,stroke:#1e40af,color:#111
style G fill:#fee2e2,stroke:#991b1b,color:#111
style H fill:#fef3c7,stroke:#92400e,color:#111
Use this when you walk into an incident or a regression and you're not sure where to look first. The answer dictates who you call (PM/legal vs SRE vs data vs safety) and what you fix (eval set vs judge prompt vs label refresh vs defense layer).
How to read a scenario file in this folder
Every scenario file is structured the same way:
| Section | What you'll find | Why it's there |
|---|---|---|
| TL;DR | One paragraph: what changed, why old GT is wrong, fix shape | So you can read 30 scenarios in 20 minutes |
| Context & Trigger | The axis tag, the subsystem, the event | Anchors the failure to a real subsystem |
| The Old Ground Truth | What we used to call correct, and why that was reasonable | Honors that the old design wasn't dumb |
| The New Reality | What "correct" means now | The exact thing the eval/training is missing |
| Why Naive Approaches Fail | Why "just retrain" / "just re-label" / "just add a monitor" don't fix it | Pre-empts the obvious answer in interviews |
| Detection | Online + offline + distribution signals | Operational immediately |
| Architecture / Implementation Deep Dive | Mermaid + per-layer breakdown + code/config | The actual fix |
| Trade-offs & Alternatives | Latency × cost × label-quality × ops complexity | What you'd say in a design review |
| Production Pitfalls | What bites in week 3 | The lessons-only-by-experience layer |
| Interview Q&A Drill | Opening + 3–4 grills + 2 architect-level | Self-test, BMS / staff bar |
The Q&A drill is the most important section if you are using this folder for interview prep. Read the question, write your answer (out loud or in a doc), then compare to the model answer. The grills are calibrated to push you past the easy first answer.
Anti-patterns this folder pushes back on
- "Our golden set is the source of truth." No — the world is the source of truth. The golden set is a stale, sampled, possibly-mislabeled approximation. Treat it like cached data with a TTL, not like a database.
- "We re-evaluate quarterly so we're fine." Quarters are not a unit of time that adversaries respect. Trends decay in hours.
- "We have monitors." Monitors watch the metrics they were designed for. If the metric became wrong (constraint shift, scale shift), the monitor is now lying to you with high confidence.
- "Re-label everything." Re-labeling at scale is so expensive that this is rarely the answer. Smarter: stratify, sample, prioritize the changed slice, accept that some old labels stay outdated until production behavior says otherwise.
- "The judge model is correct." The judge is a model. It has its own drift. It needs its own golden set (humans on a calibration set). When you upgrade FMs, the judge often needs upgrading too — and its judge needs anchoring.
These anti-patterns are exactly what the scenario files are calibrated against.