06. Model Evaluation Framework — End-to-End Quality Gates

"A model that isn't evaluated rigorously is a liability, not an asset. I built a 4-layer evaluation framework: golden datasets for offline regression, shadow mode for safe transitions, canary deployments for controlled rollouts, and continuous monitoring for long-term health. No model change reached 100% of users without passing all four layers."

Evaluation Architecture Overview

graph TD
    A[Model or Prompt Change] --> B[Layer 1: Offline Evaluation<br>Golden Dataset - 500 queries]
    B -->|Pass| C[Layer 2: Shadow Mode<br>1 week parallel run]
    B -->|Fail| X[Block Deployment<br>Alert DS + Engineering]
    C -->|Pass| D[Layer 3: Canary Deployment<br>1% traffic, 24 hours]
    C -->|Fail| X
    D -->|Pass| E[Layer 4: Gradual Rollout<br>1% → 10% → 50% → 100%]
    D -->|Fail| X
    E --> F[Continuous Monitoring<br>Real-time + weekly reviews]
    F -->|Regression| G[Automatic Rollback<br>+ Incident Review]

    style X fill:#FF6B6B
    style G fill:#FF6B6B

Evaluation Layer Comparison

Layer	What It Catches	Cost	Duration	Confidence Level	Can Auto-Rollback?
1. Golden Dataset	Quality regressions, format violations, intent accuracy drop	~$15 (500 LLM calls)	25 minutes	Medium (offline only)	N/A (blocks PR merge)
2. Shadow Mode	Behavioral changes, response drift, style shifts	~$31.5K/week (doubles LLM cost)	3-7 days	High (real traffic, no user impact)	No (manual decision)
3. Canary	User-facing regressions (escalation, thumbs down)	Minimal (1% traffic overhead)	24-48 hours	High (real users, real metrics)	Yes (auto-rollback)
4. Continuous Monitoring	Slow drift, seasonal shifts, model decay	Dashboard cost only	Ongoing	Highest (long-term trends)	Yes (auto-rollback on hard thresholds)

When to Use Which Layer

Change Type	Minimum Required Layers	Skip Allowed?
Typo fix in prompt	Layer 1 (golden dataset) only	Layers 2-3 can be skipped
Prompt rewrite	Layers 1 + 2 (shadow)	Layer 3 if shadow shows no regression
New Claude model version	All 4 layers	Never skip — caught emoji issue, length inflation
Intent classifier retrain	Layers 1 + 2 + 3	Shadow is critical for routing changes
New guardrail rule	Layers 1 + 2	Focus on false positive rate in shadow
Infrastructure change (e.g., instance type)	Layer 3 (canary) only	Layers 1-2 not needed for infra-only changes

Layer 1: Golden Dataset Evaluation (Offline)

What Is the Golden Dataset?

A curated set of 500+ query-response pairs that represent the full spectrum of MangaAssist interactions. Each pair includes:

{
  "query_id": "GD-042",
  "query": "What dark fantasy manga would you recommend for someone who loved Berserk?",
  "intent": "recommendation",
  "context": {
    "user_profile": {"prime": true, "locale": "en-US"},
    "browsing_history": ["B00GX...", "B01HN..."],
    "page_context": {"page": "manga_store_home"}
  },
  "expected_intent": "recommendation",
  "reference_response": "Based on your love for Berserk, I'd recommend Vinland Saga for its dark historical themes and intense action, Claymore for its dark fantasy setting with powerful warrior protagonists, and Vagabond for its masterful art and mature storytelling.",
  "required_elements": ["at least 2 product recommendations", "genre reasoning", "no fabricated titles"],
  "prohibited_elements": ["competitor mentions", "prices from memory", "non-manga products"],
  "quality_rubric": {
    "factual_correctness": "All recommended titles must exist and match the genre",
    "completeness": "At least 2 recommendations with reasoning",
    "helpfulness": "Recommendations should be genuinely relevant to Berserk fans",
    "format": "Natural language with product names, not just ASINs"
  },
  "tags": ["recommendation", "genre-specific", "dark-fantasy", "medium-complexity"]
}

Dataset Composition

Category	Count	%	Examples
Recommendations (by genre, mood, author)	120	24%	"Suggest manga like One Piece", "Something dark and psychological"
Product questions (price, format, edition)	100	20%	"Is Berserk available in hardcover?", "What editions of Naruto exist?"
FAQ / Policy	80	16%	"What's the return policy?", "Do you ship to Hawaii?"
Order tracking	60	12%	"Where is my order?", "When will it arrive?"
Multi-turn scenarios	50	10%	3-5 turn conversation sequences with co-references
Edge cases / adversarial	40	8%	Prompt injection attempts, nonsensical queries, out-of-scope
Return / refund requests	30	6%	"I want to return this damaged manga"
Chitchat / greetings	20	4%	"Hi!", "Thanks!", "What can you do?"

How We Built It

Initial seed (Week 1): I sampled 300 production queries stratified by intent. The DS team wrote reference responses.
Edge case augmentation (Week 2-3): I analyzed the production error log — every escalation, thumbs-down, and guardrail block became a candidate golden query. Added 100 edge cases.
Adversarial additions (Week 4): Security team contributed 40 prompt injection and adversarial queries.
Multi-turn scenarios (Week 5): I created 50 multi-turn conversation flows from production conversation logs (anonymized).
Quarterly refresh: Every quarter, we removed 50 stale queries (about discontinued products, outdated policies) and added 50 new ones based on recent production issues.

Automated Evaluation Pipeline

graph LR
    A[Trigger: PR with prompt/model change] --> B[CI Pipeline<br>pulls golden dataset]
    B --> C[Run 500 queries<br>through pipeline]
    C --> D[Automated Scoring]
    D --> E{All gates pass?}
    E -->|Yes| F[Mark PR as<br>evaluation-passed]
    E -->|No| G[Block PR merge<br>Post failure report]

    D --> D1[Intent accuracy ≥ 90%]
    D --> D2[BERTScore ≥ 0.80]
    D --> D3[ROUGE-L drop ≤ 10%]
    D --> D4[Format compliance ≥ 95%]
    D --> D5[Guardrail pass rate ≥ 95%]
    D --> D6[Avg response length within ±30% baseline]
    D --> D7[Zero prohibited elements detected]

Evaluation Gates (Must-Pass Criteria)

Gate	Metric	Threshold	What It Catches
Intent accuracy	Classification accuracy on golden set	≥ 90%	Classifier regression
BERTScore	Semantic similarity to reference responses	≥ 0.80 avg	Quality degradation
ROUGE-L regression	Delta from previous version	≤ 10% drop	Structural change in responses
Format compliance	% of responses with valid JSON output	≥ 95%	Prompt format instruction failure
Guardrail pass rate	% of responses passing all guardrails	≥ 95%	Hallucination/safety regression
Response length	Average output token count	Within ±30% of baseline	Response inflation/deflation
Prohibited element check	Zero instances of competitor mentions, fabricated ASINs	0 failures	Hard constraint violation
Per-class F1	F1 score per intent class	≥ 0.85 all classes	Class-specific regression

Pipeline runtime: ~25 minutes (500 queries × ~3 seconds per full pipeline run). Fast enough to run on every PR.

Layer 2: Shadow Mode Evaluation

What Is Shadow Mode?

Both the old (serving) and new (candidate) versions process every real production request simultaneously. Only the old version's response reaches the user. The new version's response is logged and compared offline.

graph TD
    A[User Request] --> B[Load Balancer]
    B --> C[Old Model<br>Response served to user]
    B --> D[New Model<br>Response logged only]
    C --> E[User sees old response]
    D --> F[Comparison Pipeline<br>Async analysis]
    F --> G[Shadow Mode<br>Report]

When I Used Shadow Mode

Scenario	Shadow Duration	What I Compared
Bedrock Claude version update	1 week	Response quality, format, length, guardrail pass rate, BERTScore delta
Major prompt rewrite	3-5 days	Same metrics + hallucination rate comparison
Intent classifier retraining	3 days	Intent accuracy, escalation rate, routing differences
New guardrail rule	2 days	Block rate change, false positive analysis

Shadow Mode Comparison Metrics

Metric	Comparison Method	Decision Criteria
BERTScore	Per-query delta	New model's BERTScore should not drop >5% vs old model on any intent category
Response length	Distribution comparison	New model's avg length within ±20% of old model
Guardrail pass rate	Rate comparison	New model's pass rate ≥ old model's - 1%
Format compliance	Rate comparison	New model ≥ old model
Hallucination score	Avg score comparison	New model's avg hallucination score ≤ old model's + 0.02
Intent routing changes	Confusion matrix of routing differences	< 5% of requests routed differently

Shadow Mode Catches — Real Examples

Example 1: Claude 3.5 Emoji Issue

Shadow mode revealed that Claude 3.5 Sonnet added emoji to 12% of responses (vs. 0% for Claude 3). This violated the Amazon style guide. I caught it before any user saw it and added a prompt instruction: "Do not use emoji in responses."

Example 2: Response Length Inflation

Shadow mode showed Claude 3.5 produced responses averaging 195 tokens (vs. 120 for Claude 3). This would have increased output token cost by 63% and added ~400ms of generation time. I added explicit length constraints before promoting the new model.

Example 3: Intent Classifier V3

Shadow mode showed the new classifier routed 8% of recommendation queries to product_question — a regression on implicit recommendations like "What should I read next?" The classifier was retrained with more implicit recommendation examples before deployment.

Shadow Mode Cost

Shadow mode doubles LLM inference cost during the test period. At $4,500/day for LLM inference, a 1-week shadow test cost ~$31,500 in additional compute. This was justified because a bad model change at full traffic could cost far more in user trust and support escalations.

Layer 3: Canary Deployment

What Is Canary Deployment?

After passing shadow mode, the new model serves 1% of real traffic for 24 hours. Key metrics are monitored in real-time.

graph LR
    A[Production Traffic] --> B[Traffic Splitter]
    B -->|99%| C[Old Model<br>Baseline]
    B -->|1%| D[New Model<br>Canary]
    C --> E[Baseline Metrics]
    D --> F[Canary Metrics]
    E --> G[Comparator<br>Statistical Significance]
    F --> G
    G -->|Pass| H[Promote to 10% → 50% → 100%]
    G -->|Fail| I[Automatic Rollback]

Canary Metrics & Thresholds

Metric	Baseline Comparison	Auto-Rollback If	Rationale
Escalation rate	Canary vs baseline	Canary > baseline + 2%	More escalations = worse user experience
Thumbs down rate	Canary vs baseline	Canary > baseline + 3%	Users actively disliking responses
Error rate	Canary vs baseline	Canary > baseline + 0.5%	Infrastructure issues with new model
P99 latency	Canary vs baseline	Canary > baseline + 500ms	Performance regression
Guardrail block rate	Canary vs baseline	Canary > baseline + 2%	More responses being blocked

Statistical Significance

With 1% of traffic (~5,000 messages in 24 hours), statistical significance was a challenge for rare metrics (escalation rate ~12%):

For escalation rate at 1% traffic:
- Baseline: 12% of ~495,000 messages = ~59,400 escalations
- Canary: 12% of ~5,000 messages = ~600 escalations
- To detect a 2% increase (12% → 14%) at 95% confidence:
  Need ~2,400 canary messages (achievable in ~12 hours)

For thumbs down rate (8% baseline), significance was achievable within 6 hours at 1% traffic.

Canary Promotion Schedule

Stage	Traffic %	Duration	Monitoring
Stage 1	1%	24 hours	All canary metrics + manual review
Stage 2	10%	12 hours	Automated metric comparison
Stage 3	50%	6 hours	Automated + cost verification
Stage 4	100%	Ongoing	Full production monitoring

Total time from canary start to 100%: ~42 hours (under normal conditions). Any regression at any stage triggers automatic rollback to the previous stage.

Layer 4: Continuous Monitoring

Real-Time Monitoring (CloudWatch Dashboards)

After full deployment, continuous monitoring detected slow degradations that point-in-time evaluations might miss.

Real-Time Metrics (1-minute granularity):
├── TTFT P50, P99
├── Error rate
├── Guardrail block rate
├── ASIN validation failures
└── Bedrock throttling events

Daily Aggregations:
├── Hallucination score (async pipeline)
├── Cost per session
├── Escalation rate
├── Thumbs up/down trend
└── Response length distribution

Weekly Evaluations:
├── Intent accuracy (500-sample test)
├── RAG Recall@3 (200-query eval)
├── BERTScore on golden dataset
├── Confusion matrix analysis
└── Human audit of 100 responses

Drift Detection Alerts

Signal	Detection Method	Alert Threshold	Response Time
Intent distribution shift	KL divergence of intent proportions	KL > 0.05 week-over-week	Review within 24 hours
Classification confidence drop	Average softmax probability trend	Avg < 0.82	Investigate within 4 hours
Hallucination rate increase	7-day rolling average of grounding score	Avg < 0.85	Investigate within 4 hours
Response length inflation	7-day rolling average output tokens	> 150% of baseline	Prompt review within 24 hours
Embedding drift	Cosine similarity distribution shift	Mean shift > 0.05	DS review within 1 week

Automatic Rollback Triggers

Condition	Action	Notification
Error rate > 2% for 5 minutes	Rollback to previous model version	PagerDuty P1 alert
P99 TTFT > 3s for 10 minutes	Switch to Haiku for all intents	PagerDuty P2 alert
ASIN validation failure rate > 1%	Block LLM responses, serve templates only	PagerDuty P1 alert
Bedrock throttling > 5% of requests	Enable request queuing with priority	PagerDuty P2 alert

Human Evaluation Workflows

Weekly Audit (100 Responses)

Every week, I sampled 100 responses from production for human evaluation:

Sampling strategy (stratified, not random): - 30 from recommendation intent (highest user impact) - 20 from product_question (most hallucination-prone) - 15 from faq (policy accuracy matters) - 10 from order_tracking (should be templated — checking for anomalies) - 10 from multi-turn conversations (coherence check) - 10 from thumbs-down responses (understand failures) - 5 from guardrail-blocked responses (false positive audit)

Human evaluation rubric:

Dimension	Score	Definition
Factual Correctness	1-5	5 = all facts correct; 1 = major factual errors
Completeness	1-5	5 = fully addresses query; 1 = misses key info
Helpfulness	1-5	5 = would definitely help the user; 1 = confusing or wrong
Tone & Style	1-5	5 = natural, friendly, Amazon-appropriate; 1 = robotic or off-brand
Safety	Pass/Fail	Fail = contains PII, toxicity, competitor mention, or harmful content

Evaluators: Internal team (2 evaluators per response, scores averaged). Inter-rater agreement: Cohen's κ = 0.78 (substantial agreement).

Quarterly Deep Dive (500 Responses)

Once per quarter, we evaluated the full golden dataset against current production. This was the most expensive evaluation (~2 person-days of human evaluation) but provided the most comprehensive quality picture.

Quarterly deep dive additions: - Trend analysis: quality scores over the last 4 quarters. - Failure mode analysis: categorize all responses scoring <3 on any dimension. - Golden dataset freshness: identify stale queries to retire and edge cases to add. - Cross-comparison: current model vs. best offline model (are we deploying the best option?).

Labeling Infrastructure for Retraining

Separate from evaluation, human labeling supported classifier retraining:

graph LR
    A[Production<br>Low-confidence<br>Predictions] --> B[Sampling Pipeline<br>200/week]
    B --> C[Internal Labeling Queue<br>2 labelers per sample]
    C --> D{Agreement?}
    D -->|Yes| E[Add to Training Set]
    D -->|No| F[Adjudication<br>Senior labeler decides]
    F --> E
    E --> G[Monthly Retraining<br>Dataset]

Labeling quality: Inter-annotator agreement on intent labels: Cohen's κ = 0.85 (near-perfect for intent classification). When annotators disagreed, the most common confusion was recommendation vs product_question — the same confusion the classifier struggled with.

How Offline and Online Metrics Correlated

One of the most important lessons was understanding which offline metrics actually predicted online success.

Strong Correlations (Offline → Online)

Offline Metric	Online Metric	Correlation ®	Implication
Intent accuracy (offline)	Escalation rate	-0.68	Higher offline accuracy → fewer escalations
RAG Recall@3 (offline)	Thumbs up rate	+0.55	Better retrieval → users like responses more
BERTScore (offline)	Resolution rate	+0.61	Higher semantic quality → more issues resolved
Per-class F1 (offline)	Intent-specific escalation	-0.72	Strong per-class metric → strong per-class online quality

Weak Correlations (Offline → Online)

Offline Metric	Online Metric	Correlation ®	Implication
BLEU-4 (offline)	Thumbs up rate	+0.15	BLEU poorly predicts user satisfaction
ROUGE-2 (offline)	CSAT	+0.22	Bigram overlap doesn't drive satisfaction
Overall accuracy (offline)	Conversion rate	+0.18	Accuracy is necessary but not sufficient for conversion

Surprising Findings

BERTScore was the best single predictor of user satisfaction (r=0.61 with resolution rate). BLEU was nearly useless (r=0.15). This validated our decision to use BERTScore as the primary evaluation metric.
RAG Recall@3 had higher impact on thumbs-up than intent accuracy. Better source material (retrieval) mattered more than correct routing (intent). This shifted our investment from classifier accuracy toward retrieval quality.
Format compliance rate had zero correlation with user satisfaction — users didn't care if the JSON was slightly malformed because the frontend handled edge cases gracefully. But format compliance was critical for product card rendering, so it remained a gate.
Response length had a non-linear relationship with satisfaction: very short (<30 tokens) and very long (>250 tokens) responses both had lower satisfaction. The sweet spot was 80-150 tokens.

Evaluation Framework Evolution Over Time

Period	Evaluation Maturity	What Changed
MVP (Month 1-2)	Manual review of 50 responses after each change	No automation; relied on gut feeling and manual spot-checks
V2 (Month 3-4)	Golden dataset of 200 queries + automated BLEU/ROUGE	Added CI pipeline that blocked PRs with quality regression
V3 (Month 5-6)	500-query golden set + BERTScore + shadow mode	Replaced BLEU with BERTScore; added shadow mode for model transitions
V4 (Month 7+)	Full 4-layer framework + continuous monitoring + auto-rollback	Canary deployments, drift detection, automatic rollback triggers

Key lesson: Don't try to build the full framework from day one. Start with manual review, add automation as pain points emerge, and build toward full automation as the system matures.

Evaluation Framework ROI

"Was all this evaluation infrastructure worth building?"

Investment	Cost	What It Prevented	Estimated Savings
Golden dataset (500 queries, quarterly refresh)	~2 person-days/quarter	Blocked 4 prompt regressions that would have degraded 500K users/day	Incalculable trust value; conservatively $50K+ in avoided escalations
Shadow mode infrastructure	~$31.5K per shadow test (doubled LLM cost)	Caught Claude 3.5 emoji issue, response inflation (+63% cost), intent routing regression	$200K+/month in avoided cost inflation and user experience damage
Canary deployment pipeline	~1 week engineering effort to build	Detected 2 escalation rate increases before full rollout	~$25K/month in avoided unnecessary human agent costs
Continuous monitoring + auto-rollback	~2 weeks engineering effort to build	Auto-rolled back 3 incidents in first 6 months	Avoided 3 P0/P1 incidents → each costs $10K-50K in engineering time + user impact
Total annual investment	~$200K (shadow tests + maintenance)		~$1M+ in prevented costs and quality failures

Common Evaluation Pitfalls

Pitfall	Why It Happens	How We Avoided It
Stale golden dataset	Queries about discontinued products, old policies give false confidence	Quarterly refresh: remove 50 stale, add 50 new based on recent production issues
Over-reliance on BLEU	Teams use BLEU because it's familiar, but it punishes valid paraphrasing	Replaced with BERTScore as primary metric; BLEU retained only for structural regression
No statistical significance on canary	Drawing conclusions from 1% traffic before enough data accumulates	Calculated minimum sample size per metric; waited 12-24 hours before decisions
Shadow mode without baseline comparison	Logging new model output without the old model for comparison	Always run both models in parallel so every request has a direct comparison
Manual rollback only	Waiting for an on-call engineer to notice and act during an incident	Auto-rollback on hard thresholds (error rate > 2%, ASIN validation < 99%)
Evaluating in isolation	Testing model changes without the full pipeline (guardrails, formatting)	Golden dataset runs the complete pipeline, not just the model
Ignoring offline-online correlation	Optimizing offline metrics that don't predict user satisfaction	Measured correlations: BERTScore (r=0.61) was 4x better than BLEU (r=0.15)

Evaluation Maturity Checklist

Use this to assess where your evaluation framework stands:

Level 1: Manual review of responses after changes (ad-hoc, no automation)
Level 2: Golden dataset with automated quality metrics (CI pipeline blocks regressions)
Level 3: Shadow mode for safe model transitions (both versions compared on real traffic)
Level 4: Canary deployments with auto-rollback (real users, statistical significance)
Level 5: Continuous monitoring with drift detection (weekly evals, alerting, trend analysis)
Level 6: Offline-online correlation analysis (know which offline metrics predict online success)

MangaAssist reached Level 6 by Month 7. Start at Level 1 and add layers as pain points emerge.

Key Takeaways for Interviews

"4 layers, not 1" — Offline evaluation → shadow mode → canary → continuous monitoring. Each layer catches different types of regressions. Offline catches quality issues; shadow catches behavioral changes; canary catches user-facing impacts; continuous monitoring catches drift.
"The golden dataset is a living document" — Quarterly refresh to add new edge cases and remove stale queries. A stale golden dataset gives false confidence.
"Shadow mode costs double but saves millions" — Doubling LLM cost for a week (~$31K) is trivial compared to the cost of a bad model change hitting 500K users/day. Shadow mode caught the emoji issue, response inflation, and intent routing regression before any user was affected.
"Canary with statistical significance" — 1% traffic for 24 hours gives enough data for most metrics. I calculated the minimum sample size needed for each metric's significance threshold.
"BERTScore > BLEU for LLM evaluation" — This was a data-driven decision. BERTScore correlated 4x better with user satisfaction (r=0.61 vs r=0.15). This shows I don't just follow conventions — I validated which metrics actually predict outcomes.
"Offline-online correlation analysis" — I measured which offline metrics actually predicted online success. This informed what to include (BERTScore, Recall@3) and what to demote (BLEU, ROUGE-2) in the evaluation pipeline.
"Automatic rollback on hard constraints" — Error rate >2% → auto-rollback. No human in the loop for safety-critical decisions. This shows production maturity.

Deep-Dive Interview Grilling — Evaluation Framework

Grill Set 1: Shadow Mode

Q: You ran shadow mode for a week at $31.5K. How do you justify that cost to a CTO who wants faster deployments?

The ROI calculation is straightforward. Shadow mode prevented three production incidents in its first 6 months: the emoji style violation, the 63% response length inflation ($200K+/month cost increase if shipped to production), and the intent routing regression. Any single one of those would have cost more than $31.5K to remediate — plus the user experience damage.

But the stronger argument is: shadow mode buys time to make a decision. Without it, you're forced to choose between "don't deploy" (slower velocity) and "deploy blind" (production risk). Shadow mode lets you deploy to production-equivalent traffic, measure real impact, and make an informed decision in 3-7 days. It replaces gut-feel with data.

Grill Follow-Up 1a: "What if the CTO says: 'Use canary instead of shadow — it's cheaper and still uses real traffic.'"

Shadow mode and canary serve different purposes and are not interchangeable. Shadow mode is pre-production validation: the new model never serves real users, so zero user impact risk. It's appropriate when you're uncertain whether the new model is acceptable at all.

Canary is production rollout control: the new model serves 1% of real users with auto-rollback. It assumes the model is likely good and you want controlled exposure.

The difference: a model that fails shadow mode would have caused user-facing harm during canary (100 escalations among the 1% canary users before the rollback triggered). Shadow mode prevents that. For model version changes (which are black-box from our perspective), skipping shadow in favor of canary means real users absorb the risk of unknown model behavior changes.

My rule: shadow mode is mandatory for model version changes and major prompt rewrites. Canary-only is acceptable for minor prompt tweaks that passed shadow on a prior similar change.

Grill Follow-Up 1b: "Shadow mode revealed the emoji issue. But what if the new model was better on 95% of queries but worse on 5%? Would shadow mode catch a net-positive model with a specific failure mode?"

Shadow mode doesn't tell you whether to ship — it tells you where the new model differs. I'd catch the 5% failure mode through the per-category BERTScore breakdown and the intent-specific block rate analysis. Whether to ship despite the 5% regression is a product decision, not a testing decision.

In practice: if the 5% regression was on a low-stakes intent (chitchat) and the 95% improvement was on a high-value intent (recommendations), the answer might be "ship with monitoring on the regression category and a prompt fix queued." If the 5% regression was on order_tracking (safety-critical), I'd block the promotion and investigate.

Shadow mode surfaces the tradeoff. I make the call with full information, not blind optimism.

Grill Set 2: Offline-Online Correlation

Q: You measured offline-online correlations. But both metrics changed over 24 weeks because you were actively improving the system. How do you know the correlation is real and not just two metrics improving together?

This is the temporal autocorrelation problem. Two time series that both trend upward will show spurious correlation even if they're causally unrelated. My mitigation: I first-differenced both series (computed week-over-week changes) before calculating Pearson correlation. First-differencing removes the common trend and leaves only the relationship between week-to-week movements. The correlation on first differences was r=0.47 (still significant, p < 0.05), down from r=0.61 on raw values. I reported 0.61 in summaries but used 0.47 for decision-making.

Grill Follow-Up 2a: "r=0.47 between BERTScore and resolution rate means 53% of variance is unexplained. What explains the other 53%?"

The other 53% reflects factors that affect resolution rate independent of response quality: (1) system latency — faster responses had higher resolution rates independent of quality (r=-0.34 with latency), (2) intent routing accuracy — correctly routing to the right service matters even when the LLM response quality is high, (3) external factors: resolution rate is higher on weekends (less time pressure, more browsing intent) and lower during Prime Day (high-pressure shopping, users want instant answers). BERTScore doesn't capture any of these. The r=0.47 for BERTScore is good — it means BERTScore alone explains about 22% of variance in resolution rate (r²=0.22), which is meaningful for a single metric in a complex system.

Grill Follow-Up 2b (Architect level): "You found high RAG Precision@3 correlated with LOWER add-to-cart for recommendation queries — more relevant chunks led to narrower, less diverse recommendations. How did you resolve the tension between retrieval quality and recommendation diversity without breaking your offline evaluation metrics?"

This required rethinking what "quality" meant for the recommendation intent specifically. The insight: for recommendation queries, diversity is a quality dimension, not a tradeoff against quality. A response with 3 highly-relevant but similar recommendations ("Here are 3 Naruto-like shonen manga") is less valuable to a user than 3 diverse recommendations across the relevant spectrum (shonen + isekai + darker seinen).

My resolution: I added a Maximal Marginal Relevance (MMR) retrieval strategy specifically for the recommendation intent. MMR retrieves K=5 candidates by relevance, then re-selects K=3 to maximize both relevance AND diversity (measured by semantic distance between selected chunks). This maintained MRR (the most relevant chunk still ranks first) while increasing inter-chunk diversity.

For the evaluation metric: I added an "intradiversity" metric to the golden dataset for recommendation queries — measuring the average pairwise semantic distance between recommended titles. Target: average cosine distance ≥ 0.35 (titles should not all be from the same sub-genre). This became a gate metric for recommendation quality, separate from precision and recall.

Grill Set 3: Canary Statistical Significance

Q: You use 1% canary traffic. At 5,000 messages in 24 hours, is that enough for statistical significance on escalation rate?

For escalation rate at 12% baseline: to detect a 2 percentage point increase (12% → 14%) at 95% confidence and 80% power, I need approximately 2,400 canary messages (calculated using a two-proportion z-test). At 5,000 messages in 24 hours, I exceed that threshold in about 11-12 hours. So yes, 24 hours is sufficient for escalation rate.

For thumbs-down rate (8% baseline), detecting a 3 percentage point increase requires ~1,800 samples — achievable in about 9 hours at 1% traffic.

For rarer metrics (like ASIN validation failure at 0.3%), you need ~50,000 canary messages to detect a 0.1% increase — which at 1% traffic takes 10+ days. I handled these differently: rare-metric canaries ran at 5% traffic for 48 hours, which gave sufficient samples.

Grill Follow-Up 3a: "What if the 1% canary users are systematically different from the 99%? Maybe the router picks specific user segments for canary that happen to behave differently."

Good point. Traffic splitting needs to be statistically representative, not segment-biased. I used hash-based routing (hash(session_id) % 100 < 1 for canary) — this distributes users across the canary group uniformly based on session ID, not any behavioral attribute. Since session IDs are random UUIDs, the canary group inherits the same intent distribution, Prime vs. non-Prime mix, and locale distribution as production.

I verified this: after the first 1,000 canary messages, I checked that the intent distribution in the canary group matched the production distribution within ±2 percentage points. If they diverged (e.g., canary was over-represented in recommendation intent), I'd suspect a routing bug. In 6 months of canary deployments, the distribution was always within ±2 percentage points.

Grill Follow-Up 3b (Architect level): "Auto-rollback triggers on escalation rate > baseline + 2%. But what if a new model legitimately reduces 'bad' escalations (from chatbot failures) while increasing 'good' escalations (users intentionally requesting a human after getting good info)? Your metric would incorrectly roll back a better model."

This is a real measurement problem: escalation rate is a proxy, not a ground truth. "Escalation rate increased" can mean: (a) the model is worse and users are frustrated, or (b) the model is better at surface-routing complex issues to humans appropriately.

My solution: I split escalation into forced escalation (chatbot gave up or error occurred) and intentional escalation (user explicitly said "talk to a human" or clicked the human handoff button). The auto-rollback trigger used only forced escalation rate. Intentional escalation rate increase was flagged for human review, not auto-rollback.

Additionally, I added a downstream quality signal: among escalated sessions, what was the resolution rate with the human agent? If the new model's escalations resolved faster with the human agent (because the chatbot provided better context before escalating), that was evidence of quality improvement, not regression. This was a 1-day lag metric — too slow for auto-rollback but useful for post-canary review before promoting to 10%.

Challenges/real-world-challenges.md §4 — Model Drift — Shadow mode and canary deployment stories
Challenges/real-world-challenges.md §19 — Evaluation — Measuring true impact
02-data-scientist-collaboration.md — Area 7 (LLM Evaluation Framework) — How DS and I jointly designed evaluation
04-ml-metrics-taxonomy.md — ML metrics for intent classifier and RAG
05-llm-metrics-taxonomy.md — Full LLM metrics taxonomy
../API-Design-and-Testing/04-offline-testing-quality-strategies.md — Manga-specific offline testing: golden dataset design, hallucination testing, grilling chains
13-metrics.md — Business, UX, and operational metrics