GenAI Scenario 04 — FM Upgrade and Judge Recalibration

TL;DR

The platform is migrating the answering FM from Claude Sonnet 4.5 → 4.6 (and, in pilot, 4.7), and to control cost the answering tier for low-risk intents drops Sonnet → Haiku for ~60% of traffic. The eval framework re-runs the same golden set with the same LLM-as-judge prompt against the new model and reports a 12% factuality regression. Half of that is real model behavior change; the other half is the judge — which itself is a model — disagreeing with itself across the upgrade because the rubric encodes implicit constraints (verbosity, citation style, refusal threshold) that the new model interprets differently. Without re-anchoring the judge to a stable human-labeled calibration set, every model-comparison number you produce is a blend of "model changed" and "judge changed," and you can't tell which is which. The fix shape is judge re-anchoring as a first-class operation: its own calibration set, its own version, its own promotion gate, run before any application-model upgrade is evaluated.

Context & Trigger

Axis of change: Constraints (a cost/latency-driven model substitution forces the rubric of "correct" to be re-examined; the truth-content didn't move but the threshold did).
Subsystem affected: Eval framework, LLM-as-judge harness, model-promotion CI gate, the tiered-model policy described across API-Design-and-Testing/02-api-testing-strategy.md, Model-Inference/06-model-evaluation-framework.md, and Optimization-Tradeoffs-User-Stories/US-02-llm-model-tiering-tradeoffs.md.
Trigger event: Anthropic ships Claude 4.6, the platform team plans an upgrade. Same week, finance asks for a 30% cost reduction on the answering tier. The natural response is "upgrade to 4.6 + drop low-risk intents to Haiku, run the golden set, ship if it passes." The first run shows 12% factuality regression. Nobody can tell whether the model is worse or the rubric got harder.

The Old Ground Truth

The original eval setup:

500-prompt golden set (per 02-api-testing-strategy.md), with hand-curated reference answers and grading rubrics.
LLM-as-judge prompt scoring on factuality, citation correctness, completeness, and tone — with the judge model pinned to Sonnet 4.5 (same model as the answering tier).
A model promotion gate: judge.factuality ≥ 0.92 and judge.citation ≥ 0.95 for the new model on the golden set.
Reasonable assumptions:
The judge prompt is stable across model versions because it's "just instructions."
When the answering model changes, only the answering model's behavior changes.
The 500-prompt set captures the breadth of correctness.

What this design gets right: the structure (golden set + judge + gate). What it misses: the judge is a model, and changing anything (answering model, judge model, rubric, prompt template) changes the measurement.

The New Reality

Three things shifted simultaneously:

Judge model itself updated. Bedrock auto-pushed a new minor version of Sonnet 4.5 mid-quarter — in the past, this kind of update was labeled "drop-in compatible," but for an LLM-as-judge it means the measurement instrument changed. The same judge prompt now scores differently.
Tier policy changed. Low-risk intents now run on Haiku. Haiku is shorter, more concise, less hedged. The judge — calibrated against Sonnet-style answers — penalizes Haiku's terseness as "incomplete" even when factuality is identical.
Application FM upgraded. Claude 4.6 produces longer, more cautious answers with more nuanced refusals. The judge — calibrated against 4.5-style refusals — sometimes scores 4.6's "I can confirm X but not Y" as "incomplete refusal" when it's actually a more correct partial answer.

The "12% factuality regression" decomposes (after investigation) into roughly: 4% real (4.6 hallucinated more on a specific cohort of catalog questions), 3% Haiku terseness penalty (factually correct but judge marked incomplete), 3% judge-version drift (the same prompt grading the same answer differently across judge versions), and 2% rubric implicit constraints (the rubric implied a verbosity floor that nobody wrote down).

You cannot ship the model change without disentangling these. You also cannot wait — finance is pressuring on cost.

Why Naive Approaches Fail

"Re-run with the same judge." That's exactly the failure case. The judge is part of the measurement and it's drifted.
"Use a different judge model." Trades one judge for another with a different bias. Without a calibration anchor, you don't know if the new judge is more or less correct.
"Trust human review." Humans are slow and disagree with each other. Without a calibrated set of human labels, "human review" is no more anchored than the judge.
"Just lower the gate threshold." Hides the regression instead of resolving it. You no longer know whether subsequent changes are improvements or further degradation.
"Pin the judge model forever." Bedrock doesn't guarantee model versions are pinnable indefinitely; eventually the pinned version is deprecated. You need a re-anchoring discipline, not eternal pinning.

Detection — How You Notice the Shift

Online signals.

Compared CSAT between the two model variants (running side-by-side via tier shadow) doesn't track the offline judge delta. If judge says new model is 12% worse but CSAT is flat or up, the judge is the broken instrument.
Customer complaints citing specific failure modes ("the bot didn't answer my question") — useful as ground-truth ground-truth, but slow.

Offline signals (these are where the real diagnosis happens).

Judge-vs-judge agreement — run the same answer through two judge model versions and measure the disagreement rate. If > 5%, the judge layer is unstable and you must re-anchor before trusting the metric.
Judge-vs-human agreement on a calibration set — a fixed 200-prompt human-labeled set that you periodically rerun the judge against. If judge-human agreement falls below 0.90, the judge has drifted.
Per-failure-class breakdown — when judge marks an answer as "incomplete," is the answer actually incomplete or just terse? When it marks as "incorrect," is the citation wrong or just stylistically different? Manual triage of 30 disagreement cases per regression event is the most diagnostic activity.

Distribution signals.

Length-distribution diff between old model answers and new model answers. If new answers are systematically shorter (Haiku) or longer (4.6 hedging), and the judge's incompleteness rate moves in proportion, the judge is keying on length not content.
Refusal-rate diff: 4.6 refuses with more nuance ("I can confirm X but not Y") which the judge sometimes scores differently than 4.5's full refusals.

Architecture / Implementation Deep Dive

flowchart TB
    subgraph Calib["Calibration layer (NEW)"]
        HUM["Human calibration set<br/>200 (prompt, ideal_answer, label)<br/>updated quarterly"]
        ANCHOR["Anchor scores per (judge_model, judge_prompt)<br/>vs human labels"]
    end

    subgraph Judge["Judge tier"]
        JV["Judge model<br/>pinned version + sha"]
        JP["Judge prompt<br/>versioned"]
        JE["Judge ensemble<br/>(2-3 judges + tiebreak)"]
    end

    subgraph App["Application models"]
        OLD["Sonnet 4.5"]
        NEW["Sonnet 4.6 / Haiku"]
        EXP["4.7 pilot"]
    end

    subgraph Gate["Promotion CI"]
        DIFF["Side-by-side eval<br/>old vs new × old vs new judge"]
        DECOMP["Decomposed regression<br/>(model · rubric · judge · style)"]
        PROMOTE["Block if model_real_delta < threshold<br/>OR if judge_drift > threshold"]
    end

    HUM --> ANCHOR --> JV
    HUM --> ANCHOR --> JP
    JV --> JE
    JP --> JE
    OLD --> DIFF
    NEW --> DIFF
    EXP --> DIFF
    JE --> DIFF
    DIFF --> DECOMP --> PROMOTE

    style HUM fill:#fde68a,stroke:#92400e,color:#111
    style ANCHOR fill:#dbeafe,stroke:#1e40af,color:#111
    style DECOMP fill:#fee2e2,stroke:#991b1b,color:#111
    style PROMOTE fill:#dcfce7,stroke:#166534,color:#111

1. Data layer — calibration set is not the golden set

Two distinct datasets:

Golden set (~ 500–800 prompts): What the application model is evaluated against. Optimized for coverage of intents, edge cases, multilingual surface, etc.
Calibration set (~ 200 prompts): What the judge is evaluated against. A subset of representative prompts, each with:
1 reference answer that is "ideally correct."
3 alternative answers labeled by humans on the rubric (factuality, citation, completeness, tone) — a "good," a "borderline," a "bad."
Updated quarterly. The work is small but disciplined.

The judge is calibrated by computing its score distribution on the 200 × 4 = 800 (prompt, answer, human_label) tuples. The agreement metric is per-rubric-axis Spearman correlation between judge score and human ordinal label. The judge is "calibrated" if all axes are above 0.85.

2. Pipeline layer — re-anchor before re-evaluate

Order of operations on any FM upgrade or judge upgrade:

1. Pin candidate judge_model + judge_prompt versions
2. Run judge against the calibration set
3. If agreement(judge, human) < threshold → DO NOT PROCEED. Tune judge prompt or substitute model.
4. Once judge is anchored, run application model against golden set with this judge
5. Compare to baseline application model with the SAME (newly-anchored) judge
6. Decompose regressions (see below)
7. Gate the promotion

Decomposition of a measured regression:

total_delta = judge.score(new_model) - judge.score(old_model)

# Run judge_old against old_model_answers (cached)
# Run judge_new against old_model_answers
judge_drift = judge_new.score(old_answers) - judge_old.score(old_answers)

# Run judge_new against new_model_answers
real_model_delta = judge_new.score(new_answers) - judge_new.score(old_answers)

# total_delta = judge_drift + real_model_delta

This decomposition is the central operational discipline. Without it, every "12% regression" is unattributable.

3. Serving layer — judge ensemble for ambiguous cases

Single judges are fragile. For ambiguous cases (where judge confidence is low), use a 3-judge ensemble:

Primary: a calibrated mid-cost judge (e.g., Sonnet 4.6 with the anchored prompt).
Secondary: a different family or different prompt template.
Tiebreaker: a stricter, more expensive judge invoked only when primary and secondary disagree by > 0.2 on any axis.

The ensemble agreement metric is logged and tracked; growing disagreement is itself a signal that calibration is drifting.

4. Governance — judge versions and rollback

Every eval run is logged with:

{
  "run_id": "...",
  "application_model": {"name": "claude-4-6", "version_sha": "..."},
  "judge_model": {"name": "claude-4-6", "version_sha": "..."},
  "judge_prompt_version": "v17",
  "calibration_set_sha": "...",
  "calibration_agreement_at_run_time": 0.91,
  "golden_set_sha": "...",
  "results": {...}
}

If a regression is detected later that was due to judge drift, the team can roll back to the previous judge anchor (reproducible because everything is sha-pinned) and re-evaluate. Rollback at the judge layer is independent of rollback at the application layer — they are decoupled by design.

Trade-offs & Alternatives Considered

Approach	Disentangles judge from model	Cost	Complexity	Verdict
Pin judge model forever	Yes (until deprecated)	Low	Low	Brittle, defers the problem
Re-run golden set with same judge	No	Low	Low	Original — confounded
Use different judge per upgrade	No (no anchor)	Medium	Medium	Trades one bias for another
Calibration set + decomposition + ensemble	Yes	Medium-High	Medium-High	Chosen
Pure human eval	Yes	Very high	Low (operationally)	Doesn't scale beyond < 100 prompts/week
Pairwise model preference (no rubric)	Partial	Medium	Low	Loses axis-level diagnosis (factuality vs tone)

The calibration discipline is the lift. Once in place, every future FM upgrade is a routine event instead of a forensic investigation.

Production Pitfalls

Calibration agreement falls below threshold mid-quarter. Bedrock pushed a minor judge update. The fix is to re-tune the judge prompt or substitute model — don't ship application-model changes during this period because measurements are uninterpretable.
The calibration set itself drifts. Hand-labeled four quarters ago, the human labels reflect old correctness norms. Refresh the human labels at least yearly; the calibration set is not a frozen artifact.
Decomposition assumes cached old answers. If the answering pipeline (prompt template, retrieval) also changed, you can't isolate model from pipeline. Pin the whole stack during a model A/B; only one knob moves at a time.
Tier mixing breaks per-prompt comparison. If 60% of prompts now run on Haiku (cost tier) and the comparison is "old all-Sonnet vs new mixed-tier," you're comparing two architectures, not two models. Run the same tier policy on both sides for clean comparison.
Length-of-answer is a confound. Judge prompts that don't explicitly normalize for length end up rewarding verbose hedging. Add an explicit "do not penalize concise correct answers" instruction and validate it on the calibration set (Haiku-style terse answers should not score systematically lower than Sonnet-style on factually identical content).
The judge prompt itself is sacred and breaks easily. A small wording tweak ("rate the answer's accuracy" → "rate the answer's correctness") changes judge behavior nontrivially. Treat the judge prompt as code: version-controlled, PR-reviewed, calibration-tested before merge.

Interview Q&A Drill

Opening question

You upgrade the answering model from Sonnet 4.5 to 4.6 and shift low-risk intents to Haiku for cost reasons. The eval framework reports a 12% factuality regression. Finance is pressuring you to ship cost savings. What do you do?

Model answer.

Don't ship and don't accept the regression at face value. The 12% is unattributable until decomposed. Step one is to suspect the judge, because two things changed simultaneously (application model + tier policy) and a third thing may have changed silently (judge model auto-update). The judge is a model, and a model that's measuring two other models that both changed is an unstable instrument.

Concretely: (1) Pin the judge model + judge prompt to a known sha. (2) Run the judge against a 200-prompt human-labeled calibration set; if judge-vs-human agreement is below 0.85 on any rubric axis, the judge is broken — fix it before doing anything else. (3) Decompose the 12% regression into judge-drift + model-real-delta + tier-style-effect by running the calibrated judge over old model answers and new model answers separately. (4) Ship only if model_real_delta is acceptable; the cost-saving move via Haiku is independently evaluable.

The cultural shift is treating the judge as a measurement instrument that requires its own calibration discipline, separate from the model-under-test. Without that, every regression number is a confounded quantity and you cannot make defensible promotion decisions.

Follow-up grill 1

Your "calibration set" has 200 prompts with human labels. A new judge prompt version comes through. How do you know it's actually better than the old one — beyond just "agreement went up"?

Three checks. (1) Per-axis agreement — agreement should improve or hold on every rubric axis (factuality, citation, completeness, tone), not just average. A judge prompt that improves factuality agreement at the cost of completeness agreement may be a regression on the cases that matter most. (2) Disagreement diagnosis — for the cases where old judge and new judge disagree, manually triage 30 to confirm new judge's verdict is more aligned with the human label and not just "new judge changed its mind." If the new judge's wins are mostly cases where the old judge had a known bias (e.g., length penalty), that's signal. (3) Calibration on out-of-distribution prompts — keep a held-out 50-prompt set that the judge prompt was not tuned on. Agreement on the held-out set should also be above threshold; if the new prompt overfits to the calibration set, held-out agreement will be lower than calibration-set agreement.

Follow-up grill 2

You said "Bedrock auto-pushed a minor judge update." How do you actually pin the judge in a way that survives that?

Two layers. (1) Use the inference profile + cross-region inference to a specific model version where the platform supports it; in Bedrock that's a deliberate model ID with a version suffix when offered. (2) Where the platform doesn't expose that fidelity, the architectural protection is a judge-version-detection canary: a small tripwire test that runs every hour against a fixed prompt with a known expected score. If the score moves outside an envelope, the judge has changed under you and the eval CI flips into "do not trust judge" mode automatically. The canary doesn't prevent the change; it makes the change visible fast. The discipline is to halt model promotions during periods of unverified judge behavior.

In an ideal world, the platform would expose deterministic model versions for the judge tier indefinitely. In practice, periods of judge instability are real and the only protection is fast detection plus a strict "do not promote during instability" rule.

Follow-up grill 3

Your decomposition assumes you can re-run the old judge against the new model. What if the old judge is deprecated and you can't?

That's the scenario where a calibrated human label on the calibration set is your only fixed reference. If the old judge is gone, you cannot compute judge_old.score(new_answers) and the decomposition breaks. The fallback: anchor everything to human labels, not to any specific judge.

Practically: (1) Make sure the calibration set carries human labels, not just human pairwise preferences — the labels survive judge deprecation. (2) Express both the old judge and the new judge as a "distance to human label" on the calibration set. The new judge wins if its distance is smaller. The model upgrade is then evaluated on the golden set with the new judge, with the real-model delta expressed as "judge says new model is X% closer to ideal" — and ideal is defined by human labels on the calibration set transitively. (3) If the calibration set itself ages too much (humans changed their minds about ideal), refresh it. The calibration set is the slowest-moving anchor in the system; it's the one piece you cannot afford to lose.

Follow-up grill 4

Your judge ensemble has 3 judges. Two say the answer is good, one says it's bad. What's the verdict?

Don't auto-tiebreak. The disagreement itself is the diagnostic signal. The verdict is uncertain, and the per-prompt logging captures the disagreement vector. At gate-time, prompts where the ensemble disagrees by more than a threshold are excluded from the aggregate metric and flagged for human review. The aggregate is computed only over the high-agreement subset.

Why exclude rather than majority-vote? Because the disagreement carries information about where the rubric is fuzzy. If 30% of golden-set prompts produce judge disagreement, the rubric itself is under-specified for that intent class — that's the work to fix, not a tiebreak to compute. If only 2% disagree, those are the genuine edge cases and humans can review them in a tractable batch.

There's a natural objection: "but the gate needs a number." Right — the number is "agreement-eligible aggregate" and the gate is "agreement-eligible aggregate ≥ threshold AND disagreement rate ≤ ceiling." Two thresholds. Disagreement-rate creep is itself a metric you watch.

Architect-level escalation 1

The platform is moving to a multi-model serving topology where Haiku handles 60% of intents, Sonnet handles 30%, and Opus handles 10%. Each tier produces stylistically distinct answers. Your single golden set with a single rubric was built for "one answer style fits all." How do you redesign?

Tier the rubric, not just the model. (1) Per-tier rubric variants. Haiku gets a rubric that explicitly says "concise correct answers are equally valid as longer ones; do not penalize for brevity," and Opus gets one that allows latitude for nuanced multi-part answers. The factuality and citation axes are constant; the completeness/tone axes are tier-aware. (2) Per-tier reference answers. The golden set's reference answers are written in tier-appropriate style. The judge's "ideal" answer for a factuality test on Haiku is short and direct; on Opus, it's allowed to be more thorough. The schema becomes (prompt, intent, tier, reference_answer, rubric_variant). (3) Cross-tier comparison stays valid for factuality, not for style. You can compare Haiku factuality to Sonnet factuality apples-to-apples; you cannot compare their tone scores apples-to-apples because the rubric differs. The aggregate "quality" metric is then either (a) per-tier and never aggregated, or (b) aggregated only on tier-invariant axes (factuality, citation).

The deeper architectural commitment: ground truth has tiers too. There is no single "correct answer" for "what's the return policy" — there's a Haiku-correct one (short, citation, single sentence) and an Opus-correct one (thorough, multi-clause, contextual). Treating them as the same answer space is the original mistake.

Architect-level escalation 2

Your judge calibration discipline costs about $30K/quarter in human labeling and judge-recall API spend. Leadership wants to know if the cost is worth it. Make the case.

Two parts: cost of having calibration vs cost of not.

Cost of having. $30K/quarter ≈ $120K/year, plus 0.5 FTE oversight. The calibration set is 200 prompts × 4 answer variants × ~5 minutes of human time at vendor rates ≈ $4K/refresh × 4/year = $16K. The rest is judge-recall (running the calibration regularly to detect drift) and ensemble cost.

Cost of not. The 12% regression event in the question, if shipped without decomposition, would have either: (a) shipped a real regression of 4% factuality across millions of conversations — that's a measurable CSAT hit, escalation surge, and brand cost likely in the high six figures or low seven; (b) been caught in production by escalations, not in CI, with a multi-week delay before rollback could be designed; © blocked a legitimate cost-saving Haiku tier shift because the team couldn't confidently disentangle its effects, costing the predicted $1–2M/year in inference savings.

The dollar case is straightforward: $120K/year of calibration buys the ability to make confident promotion decisions on FM upgrades that themselves move millions of dollars and millions of conversations. The calibration set is the cheapest part of the eval system; it is also the only part that turns judge numbers into decisions. Cutting it to save $120K is the textbook penny-wise-pound-foolish move, and I'd push back hard. The defensible cost reduction is at the golden set + judge ensemble layer (cache, sample, tier the judge), not at the calibration anchor.

Red-flag answers

"Re-run the eval and ship if it passes." (Doesn't address judge drift.)
"Pin the judge to Sonnet 4.5 forever." (Brittle; deprecation kills it.)
"Lower the gate threshold." (Hides regressions.)
"Use a stronger judge (Opus) and trust it." (No anchor; opaque cost.)
"Trust customer escalations." (Slow, biased to the loud minority.)

Strong-answer indicators

Names the judge as a model that itself drifts.
Distinguishes calibration set from golden set.
Decomposes regression into judge-drift + real-model-delta + tier-style.
Knows pinning is partial protection; canary detection of judge drift is the operational lift.
Treats the judge prompt as code (versioned, reviewed, calibration-tested).
Has an opinion on cost: cuts come from judge tiering and caching, not from calibration anchor.