User Story 04 — Active and Passive Evaluations

Pillar: P1 (AI Workflow) + P2 (Harness) · Stage unlocked: 2 → 4 · Reading time: ~14 min

TL;DR

Passive evals (offline test sets, post-hoc judges) tell you what your system was. Active evals (online judges, shadow runs, replay harnesses, drift detectors) tell you what your system is. At Amazon scale, neither is optional. The system has to produce 50K judgments/hour to keep up with traffic, and those judgments have to feed the rollback decision loop in <30 minutes — not the weekly all-hands.

The User Story

As the eval lead for MangaAssist, I want a unified eval framework that runs offline (CI gates), pre-prod (canary judging), and online (live shadow + drift detection), so that every prompt/model/graph/skill change has a measurable quality delta within the deploy cycle and any silent regression is caught within one rollout window.

Acceptance criteria

Offline eval is the deploy gate — no prompt/graph/sub-agent change ships without a green eval delta.
Online eval samples ≥1% of live traffic, judged by an LLM-as-judge with calibrated rubrics.
Drift detection runs continuously on the eval distribution; alerts when judgment scores trend down by >2% over 6 hours.
Replay harness can re-run any window of historical traffic against any candidate version for offline diff-testing.
Eval judgments feed a dashboard + alarm, not a "we'll review next sprint" Slack channel.

The eval taxonomy

flowchart LR
  subgraph PASSIVE[Passive evals]
    O1[Offline test sets]
    O2[Replay against historical traffic]
    O3[Post-mortem incident replay]
  end

  subgraph ACTIVE[Active evals]
    A1[Pre-prod canary judging]
    A2[Online sampled judging - 1pct of traffic]
    A3[Drift detectors]
    A4[Counterfactual probes - synthetic adversarial]
  end

  PASSIVE --> Gate1[Deploy gate]
  ACTIVE --> Alarm[Online quality alarms]
  ACTIVE --> Rollback[Auto-rollback signal]
  Gate1 --> Deploy[Promote to prod]
  Alarm --> Oncall[On-call paged]
  Rollback --> Prev[Previous version restored]

The two halves do different jobs. Passive evals prove the candidate is at least as good as the baseline. Active evals prove the deployed version is still as good as it was at deploy time.

Passive evals — the deploy gate

Composition of the offline test set

Layer	Size	Purpose
Golden curated	2,000	Hand-built canonical Q&A pairs with reference answers
Adversarial / red-team	800	Prompt injection, mature-content gating, contradicting data
Long-tail (sampled prod)	5,000	Real user turns, anonymized, judged once by humans
Per-locale	18 × 500	Locale-specific phrasing, holidays, named entities
Per-skill (unit-style)	70 × 50	One curated set per registered skill
Regression	500	Specific past-incident reproductions

Total: ~17K Q&A pairs. Runs in CI in ~12 minutes against the candidate.

What "green" means

Aggregate judgment score ≥ baseline − 0.5%
No subset (per-locale, per-skill, adversarial) drops > 2%
No item in the regression set fails

The per-subset check is critical. An aggregate that's stable can hide a Korean-locale collapse if Japanese improves equally. Subset gates catch this.

Active evals — at scale

Online sampled judging

sequenceDiagram
  participant User
  participant Agent
  participant Sampler
  participant Judge
  participant Dash

  User->>Agent: turn
  Agent-->>User: response
  Agent->>Sampler: emit (turn_id, request, response)
  Sampler->>Sampler: pick 1pct stratified
  Sampler->>Judge: judge_request
  Judge-->>Dash: judgment + score + dimensions
  Dash->>Dash: aggregate per minute, per locale, per graph version

Stratification is non-negotiable. Naive 1% random sampling under-samples low-volume locales and minority intents. Stratify by (locale, intent, user_tier); reserve a quota for each cell.

Throughput math: - 2.4B turns/day → 27.7K turns/sec average → 1.6M turns/min. - 1% sample → 16.6K judgments/min → ~1M/hour. - Spec was 50K/hour minimum; we're 20× over that, which gives headroom for stratification and bursts.

Judge model choice: a smaller, calibrated LLM (Haiku 4.5) with a structured rubric. NOT the same model that generated the response (avoids self-favoring bias).

Drift detection

On the judgment stream, run two detectors: - Mean-shift detector (CUSUM) — alarms on sustained downward trend in score. - Distribution-shift detector (KS-test on score distribution) — alarms on shape changes (e.g., scores becoming bimodal).

Detection latency target: 10-30 minutes from the start of a regression.

Counterfactual probes

A small fleet of synthetic adversarial inputs is injected at 0.01% of traffic at random: - Prompt injection variants - Contradicting policy + catalog data - Out-of-locale entities - Truncation-baiting long inputs

Each probe has a known correct behavior (refuse, defer, route to safety). Probe failure rate is a hard SLO: <0.5%.

The replay harness

This is the most under-built tool in most AI orgs and the most powerful.

flowchart LR
  HIST[Historical request log] --> SEL[Time window selector]
  SEL --> RP[Replay harness]
  RP --> CAND[Candidate version]
  RP --> CTRL[Control version]
  CAND --> J[Judge]
  CTRL --> J
  J --> DIFF[Diff report - per-turn deltas]
  DIFF --> REVIEW[Human review queue]

What it enables

Pre-deploy: "what would v2 have answered for the past 24 hours of traffic?" Diff against what v1 actually answered.
Post-incident: "for the 4-hour window of the bug, what should the corrected v3 have answered?"
Capability flag testing: "for the cohort that would have triggered the new flag, what's the delta?"

Privacy / cost gates

Replay log retains 30 days of requests, redacted of PII at write time.
Replay budget is metered: a full 24-hour replay costs ~$8K in tokens, so it's an approved-action, not click-to-run.
Read-only mode: replays never call write-side skills (no order placement, no email send) — those are mocked.

What goes in a rubric

A rubric is not "is the answer good?" That's unrunnable at scale. A rubric is a structured set of dimensions:

# rubrics/manga-recommendation-v3.yaml
dimensions:
  factual_grounding:
    description: All title names, volume counts, release years are present in the catalog.
    scale: [0, 1]
    judge_method: tool_lookup
    weight: 0.30

  intent_match:
    description: Recommendations match the user's stated genre/mood preferences.
    scale: [0, 1, 2, 3]
    judge_method: llm
    weight: 0.25

  locale_appropriateness:
    description: Title availability and naming convention match the user's locale.
    scale: [0, 1]
    judge_method: tool_lookup + llm
    weight: 0.20

  reasoning_quality:
    description: The explanation is coherent and not a generic platitude.
    scale: [0, 1, 2, 3]
    judge_method: llm
    weight: 0.15

  safety:
    description: No mature content for non-adult accounts; no spoilers for in-progress titles.
    scale: [0, 1]   # binary
    judge_method: classifier + llm
    weight: 0.10

aggregation: weighted_sum
fail_threshold: 0.65

Two architectural points: - factual_grounding uses tool lookup, not LLM judgment. The judge calls catalog-search to verify facts. This is more reliable than asking an LLM "is this title real?" - safety is binary and final-overrides — a 0 here is a hard fail regardless of the weighted score.

Q&A drill — opening question

Q: Isn't 1% online judging just expensive vibes? Real ML evals use precision/recall on labeled data.

Two responses: 1. For retrieval sub-components (catalog-search recall, intent classification accuracy), we DO use precision/recall on labeled data. Those are skill-level evals. 2. For end-to-end agent quality, there is no labeled data — every user turn is unique. LLM-as-judge is the closest approximation. Calibration against a human-labeled subset (~1K/quarter) shows our judge agrees with human judges at κ ≈ 0.78. That's actionable.

The "expensive vibes" criticism is correct if the judge is uncalibrated. Calibration is the work.

Grilling — Round 1

Q1. How do you keep the judge from drifting alongside the system being judged?

The judge is frozen for 90-day periods. New judge model versions go through a calibration ceremony: re-score the human-labeled set, compare κ against the previous judge, accept only if κ improvement > 0.05. Until accepted, the old judge is still authoritative. This decouples "we got a new model" from "our quality bar moved."

Q2. What about prompt caching for the judge? 1M judgments/hour is a lot of tokens.

Yes — the judge prompt has a long static rubric prefix and a small variable suffix (the candidate response to judge). Prompt caching gives ~70% cost reduction. The cache is keyed on (rubric_version, judge_model_version). When either changes, cache flushes — cost briefly spikes; this is expected.

Q3. Sample stratification cuts both ways — won't you over-judge low-volume cells?

Yes, intentionally. Per-cell judgment count is what gives statistical power per locale. Without it, ko-KR (0.4% of traffic) would get ~6.6K judgments/day, which is too few for daily trend detection. With stratification, every cell gets a minimum quota that supports detection at the cell level.

Grilling — Round 2 (architect-level)

Q4. Walk me through what happens at 02:13 AM when the drift detector fires for fr-FR locale.

02:13 — drift detector emits alarm (mean score dropped from 0.81 → 0.76 over 4 hours, p < 0.01).
02:14 — automated pager. On-call gets a structured payload: locale, time window, top-5 sample failures with judge dimensions.
02:18 — on-call opens the dashboard. Dimension breakdown shows factual_grounding is the regressor. Other dims stable.
02:22 — replay harness auto-runs 200 fr-FR turns against current prod and the previous deploy. Diff shows 18% more "title not found" answers in current.
02:28 — root cause: catalog index for fr-FR was reindexing; bm25-only fallback missed kanji-derived titles.
02:35 — auto-rollback of the index swap (graph version held; data layer rolled).
03:05 — drift signal back to baseline.
Next day — postmortem writes a regression test into the offline set so this exact failure mode is gated.

Total time-to-detect: 4 hours of degradation. Time-to-mitigate: 22 min from page. That's the system working.

Q5. The judge is itself an LLM that costs money. At 1M/hour, that's serious. How do you think about ROI?

Cost: ~1M judgments/hour × $0.0002/judgment (Haiku 4.5 with cache) = $200/hour = $144K/month. That's substantial.

Value: a single undetected regression at MangaAssist scale costs ~$50K in lost conversion per hour. The eval's break-even is preventing one 3-hour regression per month. Historical incident rate suggests 4-6 such regressions/month would be undetected without active eval. The math is clear.

The flip side: the cost itself is a constraint that forces good engineering. Stratification (instead of 100% judging) is a cost optimization. Calibrated rubrics with tool-lookup for factual dims is a cost optimization. Prompt caching is a cost optimization. All of these are also quality wins. The constraint forces alignment.

Q6. How does this framework handle the migration from one foundation model to another (constraint scenario 2)?

The eval framework is exactly the migration tool. Steps: 1. Replay 24-72 hours of traffic through the new model (offline cost: ~$30K in tokens). Get per-dimension diff against current model. 2. Calibrate the judge: human-rate 500 turns where models disagreed. Confirm judge isn't biased toward one model's style. 3. Canary at 5% with online active eval; tight drift bounds. 4. Promote when (a) replay shows aggregate ≥, (b) canary drift is null, © per-subset gates pass. 5. Hold rollback for 7 days post-promote — full traffic on new model, but old model on warm standby with the same eval running on both via shadow.

This is exactly the playbook in scenario 2. The eval framework is the engine; the playbook is the procedure.

Intuition gained

Passive evals gate deploys; active evals guard prod. They serve different timelines and need different infra.
Stratified sampling, frozen judges, calibrated rubrics, tool-lookup for facts — these are the four levers that turn LLM-as-judge from vibes into engineering.
The replay harness is the most leveraged piece of infra you can build. It powers migration, post-mortem, and counterfactual analysis.
Drift detection latency is an SLO. 30-minute detection is the bar; anything slower means regressions ship in customer experiences.
Eval cost is a forcing function for good engineering, not pure overhead. Caching, stratification, rubric design — all are wins.