07. Model Evaluation Scenarios — Deep Dive Q&A
"Every evaluation decision is a tradeoff. You trade cost for confidence, speed for rigor, offline safety for production realism. This document captures 11 concrete evaluation scenarios from MangaAssist production, the metric choices made, why those metrics beat the alternatives, and how to answer every level of follow-up from an interviewer."
How to Use This Document
Each scenario follows the same structure: 1. Scenario description — what triggered the evaluation 2. Metrics chosen — the exact numbers we tracked 3. Metric tradeoff reasoning — why these beat the alternatives 4. Interview Q&A: Easy → Medium → Hard → Very Hard
Scenario Index
| # | Scenario | Core Metrics | Tradeoff Theme |
|---|---|---|---|
| 1 | Prompt change regression | BERTScore, ROUGE-L delta, intent accuracy | BERTScore vs BLEU |
| 2 | New LLM version rollout (Claude 3 → 3.5) | Shadow mode, response length distribution, guardrail pass rate | Shadow cost vs deployment risk |
| 3 | Intent classifier retraining | Macro F1, per-class AUC-PR, confusion matrix | Accuracy vs recall vs F1 |
| 4 | RAG pipeline evaluation | Recall@3, MRR, Precision@3, Reranking lift | Recall vs Precision tradeoff in retrieval |
| 5 | Canary deployment — escalation rate spike | Escalation delta, thumbs-down rate, statistical significance | 1% traffic vs. sample size |
| 6 | Guardrail calibration — false positive crisis | Block rate, false positive rate, recall on adversarial set | FP vs FN asymmetry |
| 7 | Hallucination regression — product catalog | Factual grounding score, ASIN validation rate, price accuracy | Async scoring vs real-time gate |
| 8 | Multi-turn conversation coherence | Multi-turn coherence score, topic drift rate, turns-to-resolution | Proxy metrics vs human eval |
| 9 | Offline-to-online metric correlation audit | r(offline, online) for 6 metric pairs | Which offline metrics actually predict production |
| 10 | Response length regression (cost + quality) | Output token distribution, cost per session, satisfaction by length | Cost vs UX optimization |
| 11 | Model tiering decision — Haiku vs Sonnet routing | Quality-adjusted latency score, per-intent routing accuracy | Compound metrics vs single-axis evaluation |
Scenario 1 — Prompt Change Regression
What Happened
The DS team rewrote the recommendation intent system prompt to improve specificity. The new prompt was more explicit about format, added a "do not recommend adult content" rule, and changed the recommendation justification instruction. Before merging to production, the CI pipeline ran the 500-query golden dataset evaluation.
Metrics Used and Why
| Metric | Threshold | Why This Metric |
|---|---|---|
| BERTScore F1 | ≥ 0.80 avg | Measures semantic similarity; captures paraphrase quality that BLEU misses |
| ROUGE-L delta | ≤ 10% drop | Detects structural regression in response format without word-level penalization |
| Intent accuracy | ≥ 90% | Validates routing hasn't broken — prompts can inadvertently confuse downstream intent |
| Format compliance | ≥ 95% | New prompts sometimes break JSON output structure |
| Prohibited element check | 0 violations | Hard gate — any competitor mention or fabricated ASIN is an immediate block |
Metric Tradeoff: BERTScore vs BLEU
BLEU measures n-gram overlap against a reference string. It penalizes valid paraphrases identically to wrong answers. For example:
- Reference: "Berserk is a must-read for fans of dark fantasy manga."
- Candidate A: "Fans of dark fantasy should absolutely read Berserk." (paraphrase, equally correct)
- Candidate B: "Berserk is available in digital and print editions." (factually correct but wrong intent)
BLEU scores Candidate A lower than Candidate B because B shares more n-grams with the reference. BERTScore uses contextualized embeddings — it scores A at 0.87 and B at 0.51, correctly identifying the paraphrase as higher quality.
Production evidence: On the recommendation intent, BLEU scored valid reformulations at 0.18 while BERTScore correctly scored them at 0.82. Using BLEU would have blocked a valid prompt improvement.
Tradeoff: ROUGE-L vs ROUGE-2
ROUGE-2 measures bigram overlap — it is sensitive to word-order changes and synonym substitutions. ROUGE-L measures the longest common subsequence, which tolerates word insertions and reorderings while preserving structural integrity. For format-heavy outputs like our JSON recommendation cards, ROUGE-L better captures whether the structure is preserved without falsely penalizing synonym-level variations.
Q&A — Scenario 1: Prompt Change Regression
Easy Questions
Q1: Why do you run offline evaluation before deploying a prompt change?
A: A prompt change modifies the core instruction to the LLM, which can produce cascading changes: different response length, different format, different tone, different fact inclusion. Offline evaluation against the golden dataset catches regressions in 25 minutes for $15 — far cheaper than catching the same regression in production at 500K requests/day. The CI pipeline gate also creates a forcing function where no prompt change can reach users without proving it doesn't regress key metrics.
Q2: What is the golden dataset?
A: A curated set of 500 query-response pairs covering all 8 intent classes in production-realistic proportion. Each entry has a query, reference response, expected intent, and tags. It's built from stratified sampling of production traffic plus intentional edge cases added from observed production failures. It's refreshed quarterly — we retire stale queries about discontinued products and add 50 new queries based on recent production issues. The size (500) was chosen to cover all intent classes adequately while keeping CI runtime under 30 minutes.
Q3: What threshold triggers a PR block?
A: Any of the following: BERTScore drops below 0.80 average, ROUGE-L drops more than 10% from the previous baseline, intent accuracy drops below 90%, format compliance drops below 95%, or any prohibited element check returns a non-zero count. The prohibited check (no fabricated ASINs, no competitor mentions) is a hard zero-tolerance gate. All other thresholds have a 10% buffer below baseline to avoid over-alerting on random noise.
Medium Questions
Q4: How do you handle dataset imbalance in the golden set? If "recommendation" is 24% but production traffic is 35%, won't your aggregate BERTScore be misleading?
A: This is a real tension. Aggregate BERTScore can be misleading when the dataset proportions don't match traffic. We address this two ways. First, we report per-class BERTScore alongside the aggregate — a prompt change that improves recommendation quality while degrading FAQ quality would still be approved by aggregate BERTScore, but blocked by the per-class gate. Second, the golden dataset composition (24% recommendation, 20% product question) was intentionally designed to underrepresent high-frequency simple intents (like order tracking at 12%) because those are templated and LLM quality variation there is minimal. The added evaluation cycles go toward high-variance intents where quality actually differs.
Q5: You said golden dataset evaluation costs ~$15. How did you calculate that, and how does cost scale with dataset size?
A: 500 queries × avg 3 seconds per full pipeline run × Claude API cost. Each query incurs one LLM call with approximately 2,500 input tokens (system prompt + RAG context + conversation history + query) and 150 output tokens. At Claude 3.5 Sonnet pricing (~$3/1M input, ~$15/1M output), that's approximately (500 × 2500 × $0.000003) + (500 × 150 × $0.000015) ≈ $3.75 + $1.13 ≈ ~$5 in API cost, plus compute for the BERTScore evaluation pipeline (~$3). The $15 estimate includes overhead and occasional retries. Cost scales linearly with dataset size — doubling to 1,000 queries doubles cost to ~$30, which is still trivially cheap. The bottleneck at scale is evaluation latency (time to run NLP scoring), not API cost.
Q6: What's a false positive in this context — a PR that's incorrectly blocked?
A: Yes. A false positive block happens when a legitimate prompt improvement is rejected because the golden dataset doesn't represent the improvement well. Example: a prompt change that makes recommendations more conversational would score lower on ROUGE-L (structural change) and potentially lower on BERTScore if the reference responses are formal. We handle this by (a) quarterly freshening of references to reflect current style preferences, (b) requiring human review when a PR is blocked but the author believes the block is a false positive — two reviewers can override with documented justification, and © tracking the block-override rate as a meta-metric for dataset staleness.
Hard Questions
Q7: BERTScore uses BERT embeddings to measure similarity. What are the failure modes of BERTScore for your specific use case (product recommendations)?
A: Three concrete failure modes we observed:
First, domain vocabulary mismatch: BERTScore uses base BERT or RoBERTa trained on general text. For manga-specific terms ("shōnen," "seinen," "isekai"), the embeddings may be poor, causing BERTScore to under-score responses that use domain-correct terminology. We partially mitigated this by fine-tuning the evaluation embeddings on a manga product corpus — this improved BERTScore discriminability by about 8%.
Second, factual correctness is not measured: BERTScore measures semantic similarity, not factual accuracy. A response that says "Berserk is a 10-volume series" (wrong — it's 41 volumes) can score 0.89 BERTScore against a reference about Berserk because the semantic similarity is high. This is why BERTScore is paired with ASIN validation rate and factual grounding score — BERTScore alone is insufficient.
Third, length asymmetry: A very short response ("Yes, Berserk is available in hardcover.") scores poorly against a longer reference even if it's factually complete, because the LCS and embedding coverage penalize brevity. We adjusted for this by setting a minimum response length gate and evaluating BERTScore only on responses within 50% of the reference length. Responses shorter than the gate are flagged separately.
Q8: How do you prevent the golden dataset from becoming a benchmark it's trained to pass — i.e., prompt engineers optimizing prompts specifically to ace the golden set?
A: This is a real contamination risk. We addressed it in four ways:
-
Holdout partition: 20% of the golden dataset (100 queries) is held out from the development team. CI reports aggregate scores; holdheld queries are evaluated only during quarterly deep dives. Prompt engineers never see which specific queries are in the holdout.
-
Blind reference updates: Reference responses for new queries are written by a separate QA reviewer, not the team writing the prompts. This prevents writing reference responses that are artificially easy to match.
-
Distribution audit: When a new prompt submission has unusually high BERTScore (>0.92) on a class that previously scored 0.82, this triggers a contamination review — someone manually checks whether the prompt text was "learned from" the golden queries.
-
Shadow mode as the ground truth: The ultimate validation is shadow mode on real traffic. If a prompt aced the golden set but behaves differently on live traffic, shadow mode catches it. We've had one case where a prompt scored 0.91 BERTScore offline but showed 6% worse escalation rate in shadow mode — indicating the golden set was stale and hadn't captured a shift in user query distribution.
Very Hard Questions
Q9: Suppose you're moving from Claude 3 to a hypothetical new model that generates syntactically different but semantically equivalent responses — for example, it prefers bulleted lists over prose. BERTScore would score this lower because embedding similarity is lower across structural format changes, and ROUGE-L would flatly fail. How would you redesign your evaluation pipeline to handle format-divergent but quality-equivalent responses?
A: This is a genuine open problem in LLM evaluation. My approach would be a three-part redesign:
Step 1 — Decouple structural evaluation from semantic evaluation. Parse responses into structured semantic units (claims, recommendations, caveats) rather than comparing raw strings. For a recommendation response, extract: (1) the recommended title, (2) the justification, (3) the format/availability statement. Compare each unit independently using BERTScore at the claim level rather than the document level. Structural formatting (bullets vs. prose) becomes irrelevant.
Step 2 — Add a format-normalized BERTScore variant. Strip markdown formatting (bullets, bold, headers) from both the reference and the candidate before embedding. This eliminates structural penalty while preserving semantic comparison. We would maintain the raw BERTScore check for format-intentional comparisons (e.g., did the model respect a format instruction?) and use the normalized variant for cross-format quality comparison.
Step 3 — Introduce LLM-as-judge for format transitions. For a major model version upgrade where format divergence is expected, supplement automated metrics with a GPT-4o judge prompt that evaluates: "Given the user query and these two responses — one in prose and one in bullets — which provides a better answer?" This gives a relative quality signal that is format-agnostic. Calibrate the judge on a set of 50 human-evaluated pairs to verify its scores match human judgment before relying on it.
The deeper issue is that our entire evaluation framework (BERTScore, ROUGE-L) was designed for format-consistent comparisons. A model that shifts response style represents a distribution shift in the reference space. The correct response is to update the golden dataset references to the new style after manually confirming through shadow mode that the new style is indeed higher quality. Treat the golden set references as a contract you renegotiate when the product style changes.
Q10: Design an evaluation framework from scratch for a hypothetical new intent class: "manga subscription management" (users managing their Kindle Unlimited manga subscriptions — add, remove, pause). What golden dataset composition, metrics, and thresholds would you set, and why?
A: This requires thinking through what makes this intent distinct:
Distinct properties: It's action-oriented (user wants to take a state-changing action, not just information). It involves account-specific data (subscription status, active titles). Errors have real consequences (user unsubscribes from wrong content). The LLM cannot actually perform the action — it must generate correct instructions or hand off to the subscription API.
Golden dataset composition (100 queries for an intent this specific): - 30 queries: "How do I add a title to subscription" (direct, unambiguous) - 20 queries: "Remove X from my subscription" (requires ASIN resolution) - 15 queries: "Pause my subscription until next month" (multi-step action) - 15 queries: Edge cases — already subscribed ("I'm already subscribed to this"), can't subscribe ("This title doesn't support subscription"), wrong plan ("Your current plan doesn't include this") - 10 queries: Multi-turn (user doesn't know the exact title name, asks clarifying questions) - 10 queries: Adversarial (user tries to manipulate subscription state through prompt injection, cancel-all-subscriptions-type attacks)
Metrics and thresholds:
| Metric | Threshold | Justification |
|---|---|---|
| Intent routing accuracy | ≥ 95% | Higher than avg — mis-routing a subscription action to FAQ is a trust failure |
| Action extraction accuracy | ≥ 98% | Can the system extract (action=add, ASIN=B00X12345, account=user) correctly? |
| Instruction completeness | ≥ 90% | Does the response give all required steps? Missing step = user can't complete action |
| Hallucination rate (action steps) | ≤ 0.5% | Making up UI steps that don't exist is catastrophic — sets higher bar than recommendation |
| Escalation rate (canary) | Baseline - 1% | Subscription actions should reduce escalation vs pre-chatbot; set bar higher |
| False action rate | 0% | LLM must never claim to have performed an action it didn't do |
The key difference from recommendation evaluation: false action rate is a hard zero-tolerance gate, not a threshold. A recommendation that's wrong is disappointing. A response that claims "I've cancelled your subscription" without actually doing so is a trust-breaking defect.
Scenario 2 — New LLM Version Rollout (Claude 3 → Claude 3.5)
What Happened
AWS announced Claude 3.5 Sonnet on Bedrock with 2× context window, improved reasoning, and better instruction following. The DS team wanted to upgrade. I required all 4 evaluation layers before this could go to production.
Metrics Used and Why
Shadow mode ran for 1 week with every production request going to both Claude 3 (live) and Claude 3.5 (shadow). These metrics were compared per-query:
| Metric | Comparison | Threshold |
|---|---|---|
| BERTScore delta | Per-intent class | New model not to drop >5% on any class |
| Response length distribution | Distribution comparison | New model avg within ±20% of old model |
| Guardrail pass rate | Rate comparison | New model ≥ old model − 1% |
| Emoji occurrence rate | New metric added mid-shadow | Zero tolerance — violates Amazon style guide |
| Hallucination score | Avg score comparison | New model ≤ old model + 0.02 |
| Intent routing change rate | Confusion matrix | <5% of requests routed differently |
What Shadow Mode Found
Finding 1 — Emoji in 12% of responses: Claude 3.5 added emoji by default. Claude 3 never did. This was invisible to offline evaluation because the golden dataset didn't check for emoji presence. Shadow mode processing real traffic revealed the pattern. Fix: added "Do not use emoji in responses" to the system prompt.
Finding 2 — Response length inflation: Claude 3.5 averaged 195 tokens (vs. 120 for Claude 3), a 63% increase. Cost implication: +$2,800/day in output token cost. Fix: added explicit token budget instruction ("Respond in 120 tokens or fewer for simple queries; 200 tokens maximum for recommendations").
Finding 3 — Intent routing regression caught by shadow: Not caught in this upgrade, but in the classifier retraining. Documented here for completeness.
Metric Tradeoff: Why Shadow Mode vs. Just Running the Golden Dataset
The golden dataset is static. It cannot catch emergent behavior on query patterns that don't exist in the curated set. Shadow mode processed ~500,000 real queries over the test week — 1,000× more queries than the golden dataset and drawn from the actual live distribution, not a curated sample. The emoji issue was found in shadow mode precisely because real users make the kind of requests (casual chitchat, "what's a good manga for kids?") where Claude 3.5 was most likely to add emoji. Curated golden dataset queries were more formal.
The tradeoff: shadow mode costs ~$31,500 per week (doubles LLM cost). For a major model version change, this cost was mandatory. For a minor prompt tweak, it's disproportionate — golden dataset evaluation is sufficient.
Q&A — Scenario 2: New LLM Version Rollout
Easy Questions
Q1: What is shadow mode and why does it not affect users?
A: Shadow mode is a parallel evaluation technique where both the old model (serving) and the new model (candidate) process every real incoming request simultaneously. Only the old model's response is returned to the user. The new model's response is logged to an evaluation store but never shown. This means users experience zero change — they interact only with the trusted production model. Shadow mode gives us real traffic coverage (real query distributions, real contexts) without any risk to user experience. It's conceptually similar to a dark launch in feature flagging.
Q2: What were the two key regressions shadow mode caught in the Claude 3 → 3.5 upgrade?
A: First, Claude 3.5 added emoji to 12% of responses, violating Amazon's style guide. Zero such instances appeared with Claude 3. Second, Claude 3.5 inflated average response length from 120 to 195 tokens — a 63% increase that would have added roughly $2,800/day in output token costs and ~400ms of extra generation latency. Both issues were caught before a single user saw the new model's output, and both were fixed via prompt adjustments before promotion.
Medium Questions
Q3: You said shadow mode costs ~$31,500 per week. How do you justify this to a product team that's asking why you need to spend $31K just to test a new model version?
A: I frame it as insurance math, not infrastructure cost. The Claude 3.5 upgrade would have introduced a 63% token cost increase — at $4,500/day baseline LLM cost, that's +$2,800/day, or +$84,000/month. Finding this in shadow mode cost $31,500 once. Finding it in production would cost $84,000 every month until someone noticed and fixed it. Break-even is less than 2 weeks — after that, every month we're in net profit. The emoji issue adds to this: an Amazon product emitting emoji in responses to customers is a brand violation. Rolling that back after users saw it carries trust cost you can't put a number on, but is clearly worth more than $31K.
Q4: How do you decide when shadow mode results are "good enough" to promote to canary?
A: We use a structured decision checklist, not a single number. All of the following must be true: (1) Per-intent BERTScore delta < −5% on every intent class. (2) Response length distribution is within ±20% of the old model's distribution. (3) Guardrail pass rate ≥ old model −1%. (4) Hallucination score ≤ old model +0.02. (5) Routing change rate <5% (for classifier-adjacent changes). If any threshold is violated, the candidate model is blocked and the DS team must either modify the model configuration or adjust the prompt to resolve the gap. Only after all five are green does the candidate advance to canary. The shadow period minimum is 3 days (to cover weekday/weekend traffic variation) and maximum 1 week.
Hard Questions
Q5: Shadow mode assumes the old model and new model receive identical inputs. What are the sources of non-identicalness in your production setup, and how do you control for them?
A: Four sources of divergence and how we handled each:
1. RAG context divergence: If the RAG pipeline retrieves different chunks between the two model calls (due to caching expiry, index updates, or non-determinism in HNSW approximate nearest neighbor search), the two models receive different context, making per-query comparison invalid. Fix: We snapshot the assembled prompt at the load balancer and replay the identical prompt to both models. RAG is called once, and the assembled context is forked to both branches. Both models see byte-for-byte identical input.
2. Timing-based context changes: In a live system, product prices change between request A (to old model) and request A (to new model) if there's any delay. Fix: Both models are called in parallel from the same assembled context snapshot. The latency window between the two calls is less than 100ms, and product price context is assembled once and shared.
3. Model non-determinism (temperature): Both models are called with temperature=0 for shadow mode evaluation. This removes stochastic variation as a confounder, at the cost of not representing the temperature used in production. For the final shadow report, we'd run a subset (10% of traffic) at production temperature to check that conclusions hold at higher temperature.
4. Request-level ordering effects: For multi-turn conversations, the conversation history entering each model diverges as soon as one turn is processed differently. We address this by evaluating shadow mode only on single-turn requests within multi-turn sessions — never passing shadow-generated history back into the new model's context. Full multi-turn shadow evaluation requires a separate test harness with simulated conversation replays.
Q6: The emoji finding suggests your golden dataset had a coverage gap — it didn't test for emoji presence. How do you systematically find and close coverage gaps in the golden dataset?
A: Golden dataset gaps emerge from two failure modes: (a) known unknowns — categories we forgot to test, and (b) unknown unknowns — behavior the model exhibits on real traffic that no curated test captures.
For known unknowns, I maintain a "coverage matrix" that maps the golden dataset against a taxonomy of outputs: format attributes (emoji, markdown, JSON compliance), content attributes (ASIN presence, price mention, competitor mention), behavioral attributes (response_length, tone, refusal rate). After the emoji finding, "emoji occurrence rate" was added to this matrix and 10 casual/chitchat queries were added to the golden set that historically triggered emoji in general models.
For unknown unknowns, the process is: shadow mode findings → golden dataset additions. Every regression found in shadow mode that wasn't caught by the golden dataset becomes a mandatory addition. Within one sprint of the emoji finding, 10 new golden queries specifically designed to elicit emoji were added. If the current golden dataset had been run against Claude 3.5 before shadow mode, it would have missed it. After the addition, it catches it. The golden dataset learns from every production failure.
Very Hard Questions
Q7: You observed a 63% response length inflation with Claude 3.5. You added a token budget instruction to the prompt. How do you verify that the token budget instruction is causing the length reduction — rather than simply suppressing a quality improvement that Claude 3.5 was trying to make? In other words, how do you distinguish "the new model is better, it's just more verbose" from "the new model is worse, it's padding responses"?
A: This is exactly the right question to ask, and it's one I wrestled with before accepting the token budget instruction as the right fix.
The test I ran: for a stratified sample of 200 responses from shadow mode (100 long from Claude 3.5, 100 comparable from Claude 3), I ran human evaluation with the following rubric: "Ignoring length, is the additional content in the longer response (a) meaningfully more helpful, (b) neutral filler, or © actively unhelpful noise?" Two evaluators rated each pair.
Results: 68% of the additional content in Claude 3.5's longer responses was rated "neutral filler" (repeating information already stated, generic caveats, unnecessary caveats about the LLM being an AI). 22% was "meaningfully more helpful" (additional recommendation context, format details the user would benefit from). 10% was irrelevant.
This data supported the token budget instruction as the right fix: we were trimming 78% filler while retaining 22% quality. But to preserve the quality, I made the token budget instruction specific, not arbitrary: "For recommendations, limit to 150 tokens but include all relevant format and edition details. For FAQ responses, limit to 80 tokens." Rather than a hard ceiling that discards quality, it was an intent-aware budget that forces compression of filler while requiring retention of the informative content.
Post-deployment verification: we checked BERTScore for the length-constrained Claude 3.5 vs. unconstrained Claude 3. Constrained Claude 3.5 scored 0.85 BERTScore vs. 0.82 for Claude 3 — meaning the length constraint improved quality compared to the old model, not degraded it. The extra length in Claude 3.5 was not quality.
Scenario 3 — Intent Classifier Retraining
What Happened
After 6 months in production, the intent classifier's production accuracy dropped from 87% to 82% — a 5-point drift. The DS team retrained classifier V3 and needed to validate it before deploying.
Metrics Used and Why
| Metric | Threshold | Why This Instead Of |
|---|---|---|
| Macro F1 | ≥ 0.88 | Treats all 8 classes equally. Accuracy at 90% can hide a class at 0.50 F1 |
| Per-class AUC-PR | ≥ 0.85 all classes | AUC-ROC inflates for rare classes. escalation at 5% traffic had 0.97 AUC-ROC but only 0.82 AUC-PR — AUC-PR told the truth |
| Confusion matrix analysis | Qualitative review of top-3 confusion pairs | Tells you what is misclassified, not just how much |
| Classification confidence (log loss) | ≤ 0.25 | Models with good accuracy but poor calibration trigger false confidence, suppressing fallback when fallback should fire |
| Escalation rate (canary) | ≤ baseline + 1% | Offline accuracy gains mean nothing if production escalation rises |
Metric Tradeoff: Macro F1 vs Accuracy vs Weighted F1
Accuracy aggregates correct classifications as a percentage. For our 8 classes with different frequencies (recommendation 35%, order_tracking 18%, ..., escalation 5%), a classifier that predicts "recommendation" for every query achieves 35% accuracy — which sounds terrible, but optimizes for accuracy-at-scale. More insidiously, a classifier that's 100% accurate on the 95% majority and random on the 5% minority still achieves 95% accuracy. Accuracy is the wrong metric for production classifiers on imbalanced classes.
Weighted F1 weights each class's F1 by its frequency. This means a heavily weighted improvement on recommendation (35%) overwhelms a degradation on escalation (5%). In production, misclassifying an escalation query has higher consequences than misclassifying a recommendation query — a missed escalation means an angry customer who needed human help got an LLM response instead. Weighted F1 buries this risk.
Macro F1 treats all classes equally regardless of frequency. A drop in escalation F1 from 0.88 to 0.72 is as visible as a drop in recommendation F1. This is the correct metric for a system where all class failures have material consequences, even if they have different frequencies.
Q&A — Scenario 3: Intent Classifier Retraining
Easy Questions
Q1: Why is accuracy a poor metric for your intent classifier?
A: The 8 intent classes have very different traffic frequencies — recommendation accounts for 35% of traffic while escalation is only 5%. A classifier that correctly classifies the four most common intents and randomly handles the four rarest would achieve over 80% accuracy despite failing on important minority classes. Accuracy gives you a single number that hides class-level failures. In our case, escalation at 5% traffic needs to be classified correctly because a missed escalation means a customer who needed human help gets an LLM response — a trust-breaking failure. Macro F1, which treats all classes equally, immediately surfaces any per-class regression.
Q2: What is the confusion matrix and what did it tell you about the classifier?
A: The confusion matrix is a grid where rows represent true classes and columns represent predicted classes. Each cell shows the count of times a true class was predicted as a different class. In our case, the classifier's most common confusion pair was recommendation (true) → product_question (predicted). This told us the classifier struggled with implicit recommendation queries like "What should I read next?" — these look like product questions because they're open-ended, but the correct handler is the recommendation intent. The confusion matrix revealed the specific failure mode, which guided the DS team to add more implicit recommendation examples to the retraining set. Without the confusion matrix, we would have known accuracy was low on recommendation but not known why.
Medium Questions
Q3: Explain why AUC-ROC gave you a misleadingly optimistic picture of the escalation intent, while AUC-PR gave you the correct signal.
A: AUC-ROC measures the area under the Receiver Operating Characteristic curve, which plots true positive rate against false positive rate at all classification thresholds. For a rare class like escalation (5% of traffic), the true negative count is enormous. A classifier that predicts "not escalation" for nearly everything still has a low false positive rate, because there are so many true negatives that even moderate false positives represent a small fraction. This inflates AUC-ROC.
AUC-PR measures the area under the Precision-Recall curve — precision (of all the times I predicted escalation, how often was I right?) against recall (of all true escalations, how many did I catch?). When the positive class is rare, any false positive significantly drops precision, and any false negative drops recall. AUC-PR cannot hide behind a mountain of true negatives. For our escalation class, AUC-ROC was 0.97 — looks excellent. AUC-PR was 0.82 — reveals that we were missing 18% of true escalations and/or had significant false positives. The right metric depends on class prevalence and the cost of each error type.
Q4: How did you validate that the offline accuracy improvement in V3 (87% → 92% macro F1) translated to production improvement?
A: Offline validation tells you the model improved on the test set. Production tells you whether it improved on real traffic. Three-step validation:
First, shadow mode for 3 days: V3 ran on all production requests in parallel with V2. We compared the intent predictions and flagged all disagreements. A human reviewer spot-checked 200 disagreements to determine which classifier was correct — V3 was correct in 78% of disagreements, validating the offline improvement held on live traffic.
Second, canary deployment at 1% for 48 hours: we tracked escalation rate, which is a downstream signal of misclassification. If the classifier sends a customer's escalation request to the FAQ handler, the customer gets an unhelpful response and is more likely to escalate to human support. V3 at 1% traffic showed escalation rate at 10.8% vs. baseline 12% — a real improvement.
Third, confidence calibration check via log loss: V3's log loss was 0.21 vs. V2's 0.28. Lower log loss means the model is more confident when correct and less falsely confident when wrong. This matters because the fallback mechanism uses classification confidence — miscalibrated confidence suppresses fallback on borderline queries.
Hard Questions
Q5: Your classifier outputs 8-class probabilities. How does the system use these probabilities after the argmax class is selected, and what would go wrong if the probabilities were well-calibrated on training data but poorly calibrated on production traffic?
A: The system uses the classification probability in two ways beyond just taking the argmax:
-
Fallback threshold gating: If the max probability is below 0.65, the request bypasses the intent handler and goes directly to a generalist "I'm not sure what you need" flow, which prompts clarification. This prevents confident misclassification from producing wrong responses.
-
Model routing in the tiering system: For queries with probabilities between 0.65 and 0.80 on a complex intent (recommendation, multi-turn), we route to Claude Sonnet. Above 0.80, we can route simpler intents to Haiku. Calibration affects model selection and cost.
If probabilities are well-calibrated in training but poorly calibrated on production traffic, several failures emerge:
-
Overconfident misclassification: The model predicts
product_questionat 0.92 probability for a query that is actually anescalation. High confidence suppresses the fallback gate — the user gets a product question handler response when they needed human escalation. This is a silent failure — no log entry for "fallback suppressed," just a wrong response delivered confidently. -
Systematic routing drift: If production queries have shifted toward a distribution where the model consistently under-predicts probability for
recommendationqueries (e.g., more casual phrasing post-holiday season), those queries will fall into the generalist fallback more often, increasing resolution time without triggering any accuracy alert.
We detected calibration drift by tracking the average confidence score as a weekly trend metric. When average confidence for any class dropped below 0.82 for two consecutive weeks (down from 0.88 baseline), it triggered a retraining investigation. This is why log loss is in the monitoring dashboard, not just initial evaluation.
Q6: Shadow mode for the classifier comparison showed V3 routing 8% of requests differently than V2. How do you decide whether this routing change is an improvement or a regression?
A: A routing change of 8% is neither good nor bad by itself — it's a signal that requires structured investigation. Four-step process:
Step 1 — Automatic categorization of routing changes: Split the 8% into three buckets: (a) queries where V3 matches human-labeled ground truth and V2 doesn't (improvements), (b) queries where V2 matches ground truth and V3 doesn't (regressions), © queries with no ground truth label (ambiguous). For this classifier retraining, 63% of the 8% routing changes were in bucket (a), 22% in bucket (b), 15% ambiguous.
Step 2 — Consequence analysis for each regression bucket: For the 22% that were regressions, what intent pair was wrong? recommendation → product_question routing errors have low consequence (the product question handler will still give relevant information). order_tracking → faq routing errors have medium consequence. escalation → recommendation errors have high consequence. We weighted regressions by consequence and found V3's regressions were concentrated in medium-consequence routing pairs, not high-consequence ones.
Step 3 — Downstream metric projection: For each regression routing pair, what's the expected escalation rate and thumbs-down rate change? Using historical data on how each routing pair affects downstream metrics, V3's projected escalation rate impact was −0.8% net (improvements outweighed regressions).
Step 4 — Human review of the 15% ambiguous: For queries with no ground truth, two reviewers labeled the correct intent. V3 was preferred in 71% of ambiguous cases. This provided confidence to deploy V3.
Very Hard Questions
Q7: Design a multi-armed bandit evaluation system that replaces static A/B testing for classifier versions. When would multi-armed bandit evaluation outperform or underperform fixed canary splits for this specific problem?
A: Multi-armed bandit (MAB) evaluation dynamically allocates more traffic to better-performing variants in real time, reducing regret (the cost of exposing users to an inferior variant during the test period). For classifier comparison:
Implementation design: Use Thompson Sampling with a Beta distribution over escalation rate (lower is better, so model as 1 - escalation_rate). Start with 50/50 split. After each batch of 100 requests, update the Beta parameters for each classifier: alpha += correct_classifications, beta += incorrect_classifications (approximated by downstream escalation rate as proxy). Sample from each distribution; route next batch proportionally.
When MAB outperforms fixed canary: - When the performance gap is large and quickly detectable: MAB quickly shifts to the better model, reducing total user exposure to the worse variant - When testing in a high-traffic period where a bad model has high per-hour impact cost - When the metric is fast to observe (thumbs-down is available in seconds; CSAT takes 24 hours)
When MAB underperforms fixed canary: - Delayed outcomes: Escalation rate can be observed within minutes, but resolution rate may not be known for 48 hours (customer issue resolution). MAB with delayed reward signals produces noisy allocation decisions that may favor the currently-observed-better model for the wrong reasons - Non-stationarity: If traffic patterns shift (e.g., Prime Day spike starts), the MAB's learned allocation may be based on a distribution that no longer matches the current traffic. Fixed canary with a pre-planned duration is more robust to distribution shifts during the test window - Multi-metric objectives: We need to optimize for escalation rate AND thumbs-down AND cost simultaneously. MAB with a single reward signal optimizes for one dimension; scalarizing multiple metrics into a single reward introduces arbitrary weighting decisions - Statistical auditing requirements: Fixed canary produces a clean pre/post comparison that passes statistical significance checks for reporting. MAB produces shifting allocations that are harder to audit retroactively if a regression is found post-deployment
Conclusion for this specific use case: Fixed canary is superior because (a) our key metrics have delayed feedback (escalation takes hours), (b) we compare on multiple axes simultaneously, and © traffic is non-stationary (recommendations spike on weekends). MAB would be appropriate for a single fast-signal metric in a stable traffic environment.
Scenario 4 — RAG Pipeline Evaluation
What Happened
The embedding model was upgraded from amazon.titan-embed-text-v1 to amazon.titan-embed-text-v2:0. This required a full re-evaluation of retrieval quality before migrating the production vector index.
Metrics Used and Why
| Metric | Threshold | Role |
|---|---|---|
| Recall@3 | ≥ 85% | Primary coverage gate — did we retrieve at least one ground-truth document in top 3? |
| Precision@3 | ≥ 60% | Noise gate — of the 3 retrieved chunks, how many were actually relevant? |
| MRR (Mean Reciprocal Rank) | ≥ 0.75 | Did the best document come first? LLM priming is stronger for top-ranked chunks |
| NDCG@3 | ≥ 0.82 | Graded relevance — rewards highly relevant docs ranked first over partially relevant |
| Reranking lift | ≥ +0.05 on NDCG@3 | Does the cross-encoder reranker add value over embedding-only retrieval? |
Metric Tradeoff: Why Recall@3 and Precision@3 Together
Recall@K alone tells you "did you find the answer?" but says nothing about noise injection. A system that retrieves 3 chunks of which 2 are irrelevant still achieves 100% Recall@1 if the relevant document is in the top 3. But those 2 irrelevant chunks go into the LLM prompt, adding ~1,000 wasted tokens, confusing the LLM, and increasing hallucination risk.
Precision@K alone tells you "are retrieved docs relevant?" but optimizing for it alone produces over-conservative retrieval — a retriever that only returns results above a very high confidence threshold achieves perfect precision but misses relevant documents that needed to be included.
The correct signal is effective quality = Recall@K × Precision@K. A system with Recall@3 = 0.90 and Precision@3 = 0.40 has effective quality = 0.36 — worse than a system with Recall@3 = 0.85 and Precision@3 = 0.70 (effective quality = 0.595). Tracking both forces optimization of both axes simultaneously.
Metric Tradeoff: MRR vs NDCG
MRR cares only about the position of the first relevant document. It's binary at the document level (relevant or not). It's the right metric if document relevance is binary and the LLM benefits most from the first retrieved chunk.
NDCG@K uses graded relevance — a document can be highly relevant (score=3), partially relevant (score=2), somewhat relevant (score=1), or irrelevant (score=0). NDCG rewards placing the most relevant document at position 1, the second-most relevant at position 2, etc. For our use case, RAG chunks can be "directly answers the question" vs. "tangentially related to the topic" — graded relevance is more realistic.
In practice: we used MRR as a quick gate ("did a relevant doc make top 3?") and NDCG@3 as the optimization target ("is the ranking order maximizing information value to the LLM?").
Q&A — Scenario 4: RAG Pipeline Evaluation
Easy Questions
Q1: What is Recall@3 measuring and why is 3 the K value you care about?
A: Recall@3 measures what fraction of queries had at least one ground-truth relevant document appear in the top 3 retrieved chunks. The K=3 comes from our production setting: we inject exactly 3 chunks into the LLM prompt. K=3 is the only K that matters for our retrieval system because the 4th, 5th, or 10th chunk is never seen by the LLM. Optimizing Recall@10 would be meaningless — even if we achieve 96% Recall@10, the 4th through 10th chunks never influence the response. Measure and optimize for the K you actually use.
Q2: What is the reranking step in the RAG pipeline and how did you measure whether it was worth the latency cost?
A: After the vector embedding search returns the top 10 candidate chunks, a cross-encoder reranker (a smaller BERT-based model) re-scores each candidate using the full query-chunk pair (rather than independent query and chunk embeddings). The reranker is slower — it adds ~50ms — but it considers the query and chunk jointly. We measured reranking lift as the improvement in NDCG@3 when the reranker's re-sorted top 3 is used compared to the embedding-only top 3. Our lift was +0.07 on NDCG@3 — 7 percentage points of ranking quality improvement for 50ms of latency. For recommendation and product question intents where the top-ranked chunk strongly influences LLM response quality, this was clearly worth it. For order tracking (templated responses), we disabled the reranker because the response is deterministic regardless of RAG ranking — saving 50ms per request on 18% of traffic.
Medium Questions
Q3: You described "effective quality = Recall@K × Precision@K." How did the two embedding models compare on this composite score?
A: Titan v1 had Recall@3 = 0.88 and Precision@3 = 0.52, giving effective quality = 0.458. Titan v2 had Recall@3 = 0.91 and Precision@3 = 0.73, giving effective quality = 0.664. The v2 model was a 45% improvement in effective quality. The Precision improvement was the bigger driver — v1 was retrieving relevant documents but "diluting" them with off-topic chunks. The practical impact: with v2, the LLM received 2.2 relevant chunks on average (vs 1.6 with v1), reducing hallucination rate by approximately 18% on product question intents.
Q4: How do you create ground truth for RAG evaluation? Who decides which document is "relevant" to which query?
A: Ground truth creation was a joint effort between the DS team and the product catalog team:
Step 1: We sampled 200 production queries stratified by intent (higher weight on product_question and recommendation where retrieval quality matters most). For each query, we manually ran retrieval with a generous top-20 setting and got all candidate chunks.
Step 2: Two annotators independently labeled each chunk as highly relevant (3), partially relevant (2), tangentially relevant (1), or irrelevant (0). Inter-annotator agreement was κ=0.79 (substantial). Disagreements were resolved by a third reviewer.
Step 3: For ongoing weekly RAG evaluation, we use the LLM itself as a partial annotator — we prompt Claude to rate whether a retrieved chunk "contains information necessary to answer the user query." This LLM-as-judge approach matches human annotation at κ=0.76, which is close enough for weekly monitoring at lower cost. Human annotation is reserved for quarterly deep dives and embedding model transitions.
Hard Questions
Q5: Your embedding-based retrieval uses approximate nearest neighbor search (HNSW in OpenSearch Serverless). HNSW trades recall for speed — it sometimes misses vectors that would have been in the exact top-K. How do you measure and bound the error introduced by approximate search, and at what point would the approximation error be unacceptable?
A: HNSW's recall error is parameterized by ef_search (the exploration factor during query time). Higher ef_search = more exhaustive search = higher recall but higher latency. We set ef_search=128 as baseline.
Measuring approximation error: Monthly, we run 1,000 random queries against both the HNSW index (approximate) and a brute-force exact k-NN search on a read replica. We compare Recall@3 between the two. Current gap: HNSW achieves 91% of exact Recall@3 at one-fifth the search latency.
Quantifying the gap: HNSW Recall@3 = 0.91. Exact Recall@3 = 0.94. The 3-point gap means 3% of queries where the correct chunk was in the true top 3 but HNSW's approximation missed it. For 500K daily requests, that's ~15,000 queries where the LLM had a worse starting context.
Acceptable threshold: The threshold is defined by the hallucinaton rate correlation. We determined that a 1% drop in Recall@3 corresponds to approximately a 0.8% increase in hallucination rate on product_question intent. If HNSW approximation error grew to a 10-point gap from exact (e.g., due to index fragmentation after many updates), that would correspond to ~8% hallucination rate increase — unacceptable. Our trigger: if HNSW recall drops more than 5 points below exact recall on the monthly benchmark, we either re-tune ef_search upward or trigger an index rebuild.
Q6: The reranker adds 50ms. At 500K requests/day with an average TTFT SLA of 200ms, what is the cost-benefit calculation for keeping the reranker, and how would you selectively enable/disable it?
A: At 500K requests/day, the reranker adds 50ms × 500,000 = 25,000 seconds of total latency per day across all users. It's not 50ms added to every user's experience in a serial way — these are parallel requests — but it does add 50ms to the TTFT of every request that uses it.
Cases where reranker is clearly not worth it:
- order_tracking (18% of traffic): response is templated, LLM output is nearly identical regardless of which chunk ranks first. Disabling saves 50ms × 90K daily requests.
- faq_policy (16% of traffic): policy answers are exact document retrieval, not semantic generation. The LLM is effectively quoting the top chunk. NDCG ranking matters less than pure Recall. Disable for 80K daily requests.
Cases where reranker is clearly worth it:
- recommendation (35% of traffic): the top-ranked editorial chunk strongly influences recommendation framing. NDCG@3 lift = +0.07 is meaningful here.
- product_question (20%): hallucination risk is highest, and ranking affects which product facts are in position 1 of the prompt. Most important use of the reranker.
Net result of selective enabling: keep reranker active for 55% of traffic (recommendation + product_question = 275K requests/day), disable for 45% (270K requests/day). Latency savings: 50ms × 270K = 3.75 hours of aggregate latency saved per day. Infrastructure cost savings: the reranker was running on an Inferentia instance at $800/month — disabling for 45% of traffic reduces utilization enough to downgrade to a smaller instance saving ~$300/month.
Very Hard Questions
Q7: Your NDCG@3 evaluation uses annotated relevance labels that are themselves uncertain (annotator disagreement κ=0.79). How does inter-annotator uncertainty propagate through the NDCG@3 score, and how should you report NDCG in the context of this uncertainty?
A: This is a measurement uncertainty problem that most ML teams ignore. κ=0.79 means annotators disagree on approximately 21% of judgments (roughly). These disagreements are not uniformly distributed — they concentrate on the "highly relevant" vs. "partially relevant" boundary (graded relevance distinctions are harder than binary relevance judgments).
Uncertainty propagation through NDCG: NDCG@3 = DCG@3 / IDCG@3. DCG@3 = Σ(rel_i / log2(i+1)) for i=1,2,3. If the relevance label for position 1 (the highest-impact position due to the log discount) is uncertain — annotator A says 3 and annotator B says 2 — the NDCG changes by approximately 0.06 on a 0-1 scale (1/log2(2) × (3-2) / max_DCG). For a system with NDCG@3 = 0.82, the annotator uncertainty alone introduces ±0.03-0.05 noise in the reported score.
Correct reporting approach: Report NDCG@3 with a confidence interval derived from the annotation uncertainty. Using the disagreement rate and its distribution across positions: - Point estimate: 0.82 - 95% CI from annotation uncertainty: [0.78, 0.86]
This CI means a claim that System A (NDCG 0.84) is better than System B (NDCG 0.82) is statistically indistinguishable at the raw annotation level. You need to also compare the CIs.
Practical implication for our system: We don't report NDCG@3 as a single number except for quarterly deep dives where annotation count (n > 200) is sufficient for narrow CIs. For weekly monitoring, we track NDCG trends directionally — a move from 0.82 to 0.75 over 4 weeks is a real regression signal regardless of annotation uncertainty. A move from 0.82 to 0.80 is within annotation noise and should not trigger a retraining response without corroborating signals (e.g., hallucination rate increase).
Scenario 5 — Canary Deployment: Escalation Rate Spike
What Happened
During Stage 1 of a canary deployment for intent classifier V3, the auto-rollback system fired after 18 hours — escalation rate on the canary bucket was 14.3% vs. baseline 12.0%, a 2.3-point increase that crossed the +2% rollback threshold. But the DS team believed the signal was statistical noise because the canary had only processed 4,200 requests at that point.
Metrics Used and Why
| Metric | Canary Value | Baseline | Delta | Rollback Threshold |
|---|---|---|---|---|
| Escalation rate | 14.3% | 12.0% | +2.3% | +2.0% (auto-rollback) |
| Thumbs-down rate | 8.5% | 8.1% | +0.4% | +3.0% (not triggered) |
| Error rate | 0.3% | 0.3% | 0.0% | +0.5% (not triggered) |
| P99 TTFT | 485ms | 490ms | −5ms | (not triggered) |
The Statistical Significance Dispute
The team's argument: escalation rate at 1% traffic with 4,200 requests. Baseline is 12%, observed is 14.3%.
H0: canary escalation rate = 12% (same as baseline)
Ha: canary escalation rate > 12%
n_canary = 4,200
observed escalations = 4200 × 0.143 = 601
expected under H0 = 4200 × 0.12 = 504
z = (601 - 504) / sqrt(4200 × 0.12 × 0.88) = 97 / 20.6 ≈ 4.7
p-value ≈ 0.000001 (highly significant)
The signal was not noise. The rollback was correct.
Post-mortem revealed V3 had a regression specifically on the return_request intent when combined with multi-turn context — an edge case not covered in shadow mode because shadow mode had not run full multi-turn evaluation for V3.
Q&A — Scenario 5: Canary Deployment
Easy Questions
Q1: Why run a canary at 1% first instead of just deploying to 100%?
A: If a model version has a regression that passed offline evaluation and shadow mode, the canary ensures only 1% of users experience the regression while 99% continue to receive the known-good model. At 500K requests/day, 1% canary exposes 5,000 users daily to the new model — enough to detect real regressions with statistical significance for most daily-frequency metrics. A direct 100% deployment would expose all 500K users to a bad model for however long it takes to detect and roll back — potentially 20-60 minutes with alert latency. At that scale, a 2% escalation rate increase would generate ~1,000 incremental unnecessary escalations before rollback — each escalation costing engineering time and user trust.
Q2: What is auto-rollback and what triggers it?
A: Auto-rollback is a mechanism that automatically reverts the canary to 0% (restoring the previous model at 100%) when a monitored metric crosses a defined threshold. The rollback triggers without requiring a human engineer to be on-call and react. In Scenario 5, the escalation rate threshold was set at +2% above baseline — if the canary's 1-hour rolling escalation rate exceeds the 24-hour baseline by 2 or more percentage points, the traffic splitter immediately routes all traffic back to the previous model and fires a PagerDuty alert. Auto-rollback is essential for operating at scale: a 2% escalation regression at 500K requests/day generates 10,000 unnecessary escalations per day if handled only by on-call reaction times.
Medium Questions
Q3: The DS team argued that 4,200 requests was too small a sample for statistical significance. You calculated that the z-score was 4.7 (p < 0.000001). Walk through why they were wrong and what the minimum sample size should be for your canary.
A: The DS team's intuition that small samples produce noisy results is generally correct, but they failed to apply the math. For detecting a 2% absolute increase in escalation rate (from 12% to 14%):
Minimum detectable effect at 80% power, 95% confidence:
n = (z_α/2 + z_β)² × (p₁(1-p₁) + p₂(1-p₂)) / (p₁ - p₂)²
n = (1.96 + 0.84)² × (0.12×0.88 + 0.14×0.86) / (0.02)²
n = 7.84 × (0.1056 + 0.1204) / 0.0004
n ≈ 7.84 × 0.226 / 0.0004 ≈ 4,430
We needed 4,430 samples, and had 4,200. Very close — technically slightly underpowered, but the observed z-score of 4.7 suggests the true effect size was larger than the minimum 2% we were powered to detect. The observed 2.3% increase at 4,200 samples is statistically highly significant.
The correct application of "minimum sample size" logic: we should have pre-committed to not making a rollback decision until we had 4,430 samples (12 hours at 1% traffic). This is a procedural fix — the rollback trigger should require both exceeding the threshold AND reaching minimum sample size. We updated the rollback logic after this incident to require n ≥ 5,000 before auto-rollback fires on escalation rate.
Q4: Thumbs-down rate showed only +0.4% increase (well below the +3% threshold). Escalation rate showed +2.3% (above the +2% threshold). Why did these two user-satisfaction signals diverge, and which do you trust more?
A: The divergence reveals a behavioral difference between thumbs-down and escalation. Thumbs-down is an active negative signal — users must deliberately click the thumbs-down button. Escalation is a passive derived signal — users who escalate didn't explicitly rate the response; we infer dissatisfaction from the action of contacting human support.
The divergence happens because not all escalations are response-quality failures. Some users escalate because the issue is genuinely complex (e.g., multi-volume return request beyond the LLM's capability), not because the response was bad. The 2.3% escalation increase from V3 turned out to be specific to return_request + multi-turn — a routing edge case. These users didn't leave a thumbs-down because the response wasn't bad per se — it just routed incorrectly and the follow-up LLM response failed to resolve the issue.
Which to trust more? Escalation rate for action-quality regressions (routing failures, instruction failures, process failures) and thumbs-down for content-quality regressions (bad recommendations, wrong information, unhelpful tone). In this specific incident, escalation was the more sensitive detector. For scenarios involving recommendation quality degradation, thumbs-down is more sensitive. The right answer depends on the nature of the hypothesized regression.
Hard Questions
Q5: Your auto-rollback logic fires when the 1-hour rolling escalation rate exceeds baseline + 2%. What are the false-positive failure modes of this auto-rollback design, and what guardrails would you add?
A: Four false-positive failure modes:
1. Traffic composition shifts: If the canary's 1% slice receives an atypically high fraction of return_request or complex multi-turn queries (which naturally escalate more), the escalation rate would be elevated for reasons unrelated to model quality. Fix: Stratify the canary slice by intent distribution and verify it closely matches baseline. If intent distribution diverges by > 5% (KL divergence), pause the rollback decision until the sample normalizes.
2. External stochastic events: A PR Day traffic surge where all the surge traffic is in categories that escalate more (people urgently trying to track orders, fix address errors) would elevate canary escalation rate even if the model is identical. Fix: Suppress auto-rollback triggers during flagged high-volatility windows (Prime Day ±24 hours, known traffic anomalies) and require human approval for rollback during those windows.
3. Canary cohort selection bias: If the traffic splitter assigns the canary bucket by user ID hash and that hash bucket skews toward a user segment with naturally higher escalation rates (e.g., new users who escalate more), the comparison is invalid. Fix: Periodically validate that canary bucket escalation rate on the previous (control) model matches baseline. If the canary bucket's escalation rate on the old model is already 14%, the baseline for rollback comparison should be 14%, not 12%.
4. Threshold set too tight for metric noise floor: The escalation rate has inherent hour-to-hour variance. If the 1-hour rolling window variance is ±1.5% naturally, a +2% threshold generates false alarms roughly every 12 hours. Fix: Use a CUSUM (Cumulative Sum) control chart instead of a simple threshold comparison — CUSUM detects sustained shifts above the noise floor rather than reacting to individual threshold crossings.
Very Hard Questions
Q6: Design a multi-metric rollback decision function that combines escalation rate, thumbs-down rate, and error rate into a single rollback decision. How do you weight the metrics, handle their different confidence levels at early canary stages, and avoid both false rollbacks and false non-rollbacks?
A: A multi-metric rollback decision function needs to solve three problems: (1) different metrics have different noise levels and different sample-size requirements for significance, (2) different metrics detect different types of regressions, (3) combining them requires value judgment about which errors matter more.
Proposed framework: Weighted Evidence Accumulation
Step 1 — Normalize each metric to a "regression score" using Cohen's h for proportions:
h_escalation = 2 × arcsin(sqrt(p_canary)) - 2 × arcsin(sqrt(p_baseline))
h_thumbs_down = similar formula
h_error_rate = similar formula
Step 2 — Assign confidence weights based on sample size:
w_i = min(1.0, n_canary / n_min_i) × (1 / σ_i)
Step 3 — Compute a combined regression Evidence Score:
E = Σ(w_i × h_i × severity_i)
Step 4 — Decision thresholds: - E < 0.3: Continue canary, no action - 0.3 ≤ E < 0.6: Pause canary progression, alert on-call for review, continue monitoring - E ≥ 0.6: Auto-rollback + PagerDuty P1
Why this beats independent thresholds: Independent thresholds can all be individually below rollback threshold while the composite signal is clearly regression. Example: escalation +1.8% (just below 2%), thumbs-down +2.5% (just below 3%), error rate +0.4% (just below 0.5%). Each individually looks fine. The combined evidence score at 0.58 would trigger a pause and human review, catching the compound regression that independent gates miss.
Scenario 6 — Guardrail Calibration
What Happened
After adding a new guardrail rule (block responses mentioning specific competing services by name), the block rate jumped from 3.2% to 7.4%. Customer support reported increased "I can't help with that" response complaints. The guardrail was triggering on legitimate Amazon queries that happened to mention competitor names in the user's question.
Metrics Used and Why
| Metric | Before New Rule | After New Rule | Target |
|---|---|---|---|
| Total block rate | 3.2% | 7.4% | < 5% |
| False positive rate (legitimate queries blocked) | 0.8% | 4.1% | < 1% |
| Recall on adversarial set | 94% | 97% | ≥ 90% |
| "Can't help" complaint rate | 0.1% | 0.6% | < 0.2% |
Metric Tradeoff: False Positive vs False Negative Asymmetry
Guardrails have asymmetric error costs that depend on the type of guardrail:
| Guardrail Type | False Positive Cost | False Negative Cost | Error Priority |
|---|---|---|---|
| PII leak prevention | Low (user gets "I can't help") | Critical (real PII exposure) | Minimize false negatives |
| Toxicity/harm filter | Low-medium (blocked benign query) | High (harmful content to user) | Minimize false negatives |
| Competitor mention block | Medium (user with legitimate query gets blocked) | Low (Amazon chatbot mentions Kindle Unlimited competitor name) | Balance. False positives cause real UX damage |
| Hallucinated ASIN block | Low (response withheld, template used instead) | High (user gets wrong product link) | Minimize false negatives |
For the competitor mention guardrail specifically: the cost of a false positive (blocking a user who said "I use Comixology's competitor...") is higher than the cost of a false negative (the chatbot mentions a competitor name once). The guardrail was mis-calibrated toward false negative minimization when this type warranted balancing.
Q&A — Scenario 6: Guardrail Calibration
Easy Questions
Q1: What is a guardrail block rate and why should it be neither too high nor too low?
A: The guardrail block rate is the percentage of LLM responses that are intercepted and replaced (either by a fallback template or an "I can't help" message) because they triggered a safety or quality rule. It should not be too high or too low for opposite reasons: too high (above ~5%) means the guardrails are over-triggering on legitimate queries — users are getting unhelpful "I can't help" responses when they asked reasonable questions, damaging trust and increasing escalations. Too low means the guardrails are under-functioning — the LLM is generating responses that include hallucinated ASINs, PII leaks, toxic content, or competitor mentions that should have been caught. The correct calibration depends on the type of guardrail — PII and toxicity are zero-tolerance (prefer false positives), while content-style guardrails like "don't mention competitors" need careful balance.
Q2: How did you diagnose that the new rule was causing false positives rather than correctly blocking problematic responses?
A: Two signals immediately pointed to false positives: First, the "can't help" complaint rate spiked from 0.1% to 0.6% — users were explicitly complaining that the bot refused to help with what they considered legitimate requests. Second, a manual audit of 50 randomly sampled blocked responses showed that 42 of 50 were triggered by the user's own message mentioning a competitor in a neutral context ("I used Amazon after switching from [competitor]" or "Is this different from the [competitor] version?") — not by the LLM generating competitor-favorable content. The rule was pattern-matching on input text, not just output text.
Medium Questions
Q3: How do you build an adversarial test set for guardrails, and how do you use it to calibrate guardrail thresholds?
A: An adversarial test set for guardrails has two components: adversarial positives (queries that should be blocked) and hard negatives (queries that look like they should be blocked but shouldn't).
For the competitor mention guardrail: - Adversarial positives (should block): 50 synthetic responses where the LLM says things like "For this type of content, [competitor] actually has better prices" or "[Competitor service] is where you'll find that." These represent the actual harm the guardrail is designed to prevent. - Hard negatives (should NOT block): 50 queries where users mention competitors contextually: "I'm moving from [competitor] to Amazon, where do I find my purchases?", "Is this the same version as on [competitor site]?", "I noticed [competitor] had a different edition." These test whether the guardrail over-triggers on benign mentions.
Threshold calibration: run both sets through the guardrail at different sensitivity thresholds. Plot the precision-recall tradeoff curve. For this guardrail type, I target: recall on adversarial positives ≥ 90% while false positive rate on hard negatives ≤ 1%. The calibration that achieved this was narrowing the rule from "any mention of [competitor names]" to "LLM output recommending [competitor] over Amazon products."
Q4: Should you apply the same severity standard to all guardrail types? Walk through your classification of guardrail types by false-positive vs. false-negative cost.
A: No — different guardrails have fundamentally different error asymmetries. The classification:
Zero false negative tolerance (false negatives are critical, false positives are acceptable): PII leak prevention (leaking a customer's order history or contact information to them or another user), toxicity/harm (generating content that could harm a user), prompt injection execution (the LLM following injected malicious instructions). For these, we accept that some legitimate queries will get blocked. The cost of one slip is catastrophic.
Zero false positive tolerance (false positives are critical, false negatives are acceptable): ASIN validation (blocking the response when a product link might be wrong). Here, a false positive means the user gets a template instead of a personalized response — mildly degrading. A false negative (serving a fabricated ASIN) means the user clicks a bad link, potentially affecting their purchase decision. But ASIN validation has a low false negative rate by construction (we validate ASINs against the catalog) so we err conservative.
Balanced (both errors matter): Competitor mention (as discussed), response length limits, format guardrails. Here, set specific false positive and false negative budgets, run them against the adversarial/hard-negative test set, and find the threshold that stays within both budgets.
The key insight: don't use the same "block rate < 5%" target for all guardrails. PII leak guardrails should have near-zero false negative rate even if block rate is high. Competitor mention guardrails should have near-zero false positive rate, even if it means accepting some false negatives.
Hard Questions
Q5: Your guardrail evaluation uses a static adversarial test set of 50+50 examples. What are the coverage gaps in a 100-example static set, and how would you build a self-improving adversarial evaluation pipeline?
A: Coverage gaps in a 100-example static adversarial set:
-
Distribution mismatch: The 50 adversarial positives are written by the security team in 2 weeks. Adversarial users in production have months to craft novel attacks. Jailbreak patterns evolve — a static set from Month 1 won't cover Month 7 adversarial patterns (e.g., multilingual bypasses, encoded instructions, indirect injection via product descriptions).
-
Systematic blind spots: The people writing the test set think like security professionals, not like the full range of users. Real false positive triggers come from query patterns that were never anticipated (e.g., users quoting competitor reviews they read, users asking to compare products across platforms).
-
Scale insufficiency: 50 adversarial positives cover maybe 5-10 distinct attack patterns. Production adversarial traffic in a popular chatbot can exhibit 50+ distinct patterns within months.
Self-improving pipeline design:
Source 1 — Production adversarial mining: Every blocked response is a candidate adversarial example. Weekly, sample 100 blocked responses and classify them: justified block (true positive) vs. over-trigger (false positive). True positives that represent novel attack patterns are added to the test set. False positives that represent novel hard negatives are added to the hard negative set.
Source 2 — Red team fuzzing: Run an automated red team using a separate LLM (GPT-4o) prompted to generate variations of known attacks: "Generate 20 variations of this injection attempt with different wording." This continuously expands the adversarial positive set.
Source 3 — User feedback signals: "I can't help" complaints are candidates for false positive additions to the hard negative set. Escalations from users who felt the bot was unhelpful (not harmful) often indicate guardrail over-triggering.
Governance: Every test set addition requires human review before inclusion. A fully automated test set would inherit production biases without correction.
Scenario 7 — Hallucination Regression: Product Catalog
What Happened
After a batch product catalog update (ASINs shifted, some series were re-numbered, new editions added), the ASIN validation failure rate spiked from 0.2% to 1.8% — the auto-rollback threshold of 1% was crossed, triggering a switch to template-only responses for the product_question intent for 4 hours while the root cause was investigated.
Metrics Used and Why
| Metric | Normal | During Incident | Threshold |
|---|---|---|---|
| ASIN validation rate | 99.8% | 98.2% | ≥ 99% (auto-block) |
| Price accuracy rate | 99.9% | 99.9% | No change (not affected) |
| Factual grounding score | 0.91 | 0.73 | ≥ 0.85 |
| RAG index freshness | < 1 hour lag | 6-hour lag | < 2 hours |
Metric Tradeoff: Async Factual Grounding vs Real-Time ASIN Validation
These two metrics serve different roles in the hallucination defense:
ASIN Validation Rate is a synchronous, deterministic gate that runs before every response delivery. The LLM generates a response; before it reaches the user, an extraction layer pulls every 10-digit ASIN from the response and validates it against the live product catalog API. Invalid ASINs cause the response to be replaced with a template. This catches product hallucinations at the point of delivery with 100ms latency overhead.
Factual Grounding Score is an asynchronous quality signal computed in a batch pipeline 30-60 minutes after responses are delivered. It uses a natural language inference model to check whether claims in the response are entailed by the retrieved RAG context. It provides a richer signal (detects content-level hallucinations, not just ASIN format errors) but cannot be in the real-time path because NLI inference at 500K requests/day with 200ms latency budget is infeasible.
The tradeoff: ASIN validation blocks bad product links immediately but misses semantic hallucinations (e.g., wrong volume count, wrong author attribution). Factual grounding catches semantic hallucinations but 30-60 minutes after the user saw them. The product catalog incident required both: ASIN validation caught the hard failure (wrong ASINs delivered), and factual grounding's 6-hour lag in the RAG index caused the semantic factual errors that ASIN validation couldn't see.
Q&A — Scenario 7: Hallucination
Easy Questions
Q1: What is hallucination in the context of a product recommendation chatbot, and why is it especially harmful here compared to a general-purpose chatbot?
A: In a general-purpose chatbot, hallucination is when the LLM generates plausible-sounding but factually incorrect information. In a product recommendation chatbot, hallucination has direct commercial consequences: if the chatbot recommends "Volume 15 of Berserk (ASIN: B00X12345)" and that ASIN is fabricated or corresponds to a different product, the user clicks through, sees an incorrect product, and either buys the wrong thing or loses trust in the chatbot entirely. Wrong price information could lead to purchase decisions based on false data. Fabricated product editions that don't exist waste the user's time researching products that cannot be ordered. We're not in the domain of "interesting but slightly wrong facts" — we're in the domain of "bad information that affects a commercial transaction."
Q2: Why is a 1% ASIN validation failure rate the auto-rollback threshold? Why not 0.1% or 5%?
A: At 500K daily requests and roughly 40% of responses containing at least one ASIN link, that's 200K ASIN-bearing responses per day. At 1% failure rate, that's 2,000 bad product links per day delivered to customers. Each bad link is a potential wrong purchase, a trust violation, or a customer service contact. Amazon's internal data shows each bad link generates approximately 1-in-50 support contacts — at 2,000 bad links, that's ~40 additional daily human support tickets, which at $15/contact is $600/day in direct cost, plus trust erosion. 0.1% was explored but would cause excessive false alarms from catalog API latency spikes generating validation timeouts (not actual bad ASINs). The 1% threshold was set by measuring the false alarm rate at multiple thresholds during a week of manual monitoring; 1% was the level where genuine catalog quality issues exceeded false alarms from infrastructure noise.
Medium Questions
Q3: You mentioned factual grounding score uses NLI (natural language inference). What exactly is the model checking, and what are the limitations of NLI-based grounding evaluation?
A: NLI determines whether a "hypothesis" is entailed by, contradicted by, or neutral with respect to a "premise." In our grounding pipeline: the premise is the concatenated RAG context chunks retrieved for the query, and the hypothesis is each extractable factual claim from the LLM response.
For example: - Context: "Berserk Vol. 1 by Kentaro Miura, paperback, $14.99, ISBN 978-1-59307-020-5" - Claim: "Berserk Volume 1 is available in paperback for $14.99" → NLI: ENTAILED → grounding = 1.0 - Claim: "Berserk Volume 1 is currently on sale for $10.99" → NLI: CONTRADICTED → grounding = 0.0 - Claim: "Berserk is a classic dark fantasy manga" → NLI: NEUTRAL (not in context) → grounding = 0.5
The factual grounding score is the average entailment score across all extracted claims.
Limitations: 1. Claim extraction quality: We extract claims using a heuristic phrase extractor. Complex nested claims ("Berserk, which was published by Dark Horse and later by Viz, is available in 41 volumes") might be extracted poorly, leading to grounding evaluation on partial claims. 2. NLI model domain gap: NLI models trained on general text (SNLI, MultiNLI) may not understand product-specific language well. Numeric facts especially ("$14.99" vs. "$14.99 MSRP" vs. "$14.99 list price") can confuse NLI models. 3. Absence from context ≠ hallucination: The LLM might correctly state "This manga series has won multiple awards" based on its training data, even though this fact is not in the retrieved RAG context. NLI would score this as NEUTRAL (not grounded), but it may be factually correct. Grounding score measures context-fidelity, not absolute factual accuracy.
Q4: The RAG index had a 6-hour lag during the catalog update. How do you design the RAG indexing pipeline to minimize freshness lag, and what are the tradeoffs of faster indexing?
A: The lag during the catalog update occurred because the indexing pipeline was a batch process triggered once per hour, but the catalog update was large enough (~40K changed ASINs) that the batch took 6+ hours, blocking the next batch from starting. The root cause was that batch size was fixed regardless of update volume.
Solution: Streaming incremental updates with a priority queue: When catalog changes propagate through the SNS topic, a Lambda function listens, identifies changed documents (ASINs with price/metadata changes), and adds them to a priority update queue. An indexing service drains the queue continuously, updating changed vectors in OpenSearch in near real-time (< 5 minute lag for individual changes). For bulk updates (catalog restructuring like the incident), the priority queue processes time-critical items first (popular ASINs, currently-browsed products) while lower-priority items process in the background.
Tradeoffs of faster indexing: - Cost: Streaming indexing requires continuously running infrastructure vs. a batch job that runs once per hour. OpenSearch write costs increase with update frequency. - Consistency: If the catalog update is a bulk operation (hundreds of thousands of changes atomically), streaming incremental updates could serve partially-updated content during the transition window — some queries see old prices, some see new prices. The batch approach guaranteed atomic consistency but caused the 6-hour lag. We resolved this with a "soft-lock" mechanism: during a catalog bulk update, responses include a staleness disclaimer ("prices may update shortly") until the index is fully refreshed. - Index fragmentation: Frequent write operations fragment the HNSW index, degrading search quality over time. We run daily automated index optimization that reorganizes fragmented segments during low-traffic hours (3-5 AM).
Scenario 8 — Multi-Turn Conversation Coherence
What Happened
During the weekly human audit of 100 conversations (stratified 10 from multi-turn), reviewers noticed that by turn 5-7 in long recommendation conversations, the chatbot was occasionally repeating recommendations it had already made in the same session. Context window utilization was fine — the conversation history was present. The issue was that the model was not adequately cross-referencing earlier recommendations.
Metrics Used and Why
| Metric | Before Fix | After Fix |
|---|---|---|
| Repetition rate in multi-turn | 12% of 5+ turn sessions had at least 1 repeat | 3% |
| Multi-turn coherence score | 0.71 (human eval) | 0.86 |
| Topic drift rate | 8% of sessions drifted to off-topic by turn 6 | 4% |
| Turns to resolution | 4.2 avg for recommendation | 3.1 avg |
Why "Turns to Resolution" Was Added Late
Turns to resolution measures how many conversation turns a user needed to get their question fully answered. It was not in the original evaluation framework because single-turn quality was the initial focus. When the audit revealed the repetition issue, turns-to-resolution was added as a compound metric that captures multi-turn efficiency: high turns-to-resolution (needing 5+ turns where 2 would suffice) often signals the model is being unhelpful or repetitive. This metric was added at Month 4. In retrospect, tracking it from Month 1 would have caught the repetition issue earlier.
Q&A — Scenario 8: Multi-Turn Coherence
Easy Questions
Q1: What is topic drift in a multi-turn conversation and why does it matter for a shopping chatbot?
A: Topic drift occurs when a conversation progressively moves away from the user's original intent. In a manga recommendation conversation, drift might look like: User asks for dark fantasy recommendations → 3 turns of recommendations → User asks a clarifying question about one title → Chatbot responds to the clarification but then continues the next response with a completely different genre of recommendations, losing track of the original "dark fantasy" constraint. Topic drift matters for a shopping chatbot because user intent is often refined over multiple turns — "show me more like this" and "but without violence" are refinements, not new conversations. A model that drifts loses the accumulated refinements and forces the user to re-state their preferences, increasing turns-to-resolution and frustrating users.
Medium Questions
Q2: How do you measure multi-turn coherence programmatically without human evaluation for every conversation?
A: We use a three-signal proxy that correlates with human coherence ratings (r=0.72):
Signal 1 — Semantic continuity: Compute embedding cosine similarity between consecutive turns. A high-coherence conversation has adjacent turns with cosine > 0.70. A turn that suddenly drops to cosine < 0.40 with the previous turn is a potential drift or non-sequitur signal.
Signal 2 — Reference resolution accuracy: For multi-turn conversations, users frequently use pronouns and co-references ("that one," "the author of the second one you mentioned"). We extract co-references and check whether the model's response correctly resolves them to the right entity from conversation history. Automated co-reference resolution accuracy (tested against a labeled set of 100 multi-turn conversations) was used as a proxy.
Signal 3 — Repetition rate: For each bot response, we check whether recommended items appear in prior bot turns. This is fully automated — a simple set intersection check on extracted ASINs and titles.
Weekly automated coherence scoring uses these three signals. When the composite score drops below threshold, it triggers a human audit of 20 sampled multi-turn conversations from that week. Human coherence scores are collected, and if human scores confirm the automated signal, an investigation begins.
Hard Questions
Q3: Your conversation history is compressed when it exceeds the ~1,200 token budget, using the LLM to summarize older turns. What are the failure modes of LLM-based history compression, and how do you evaluate whether the summarization is preserving the right information?
A: LLM-based history compression tells the model to summarize older conversation turns into a compact representation. For the recommendation use case, a 10-turn conversation history might be summarized as: "User is looking for dark fantasy manga, has read Berserk and Vagabond, prefers physical collected editions. Previously recommended: Vinland Saga, Dungeon Meshi. User liked the Dungeon Meshi recommendation."
Failure modes: 1. Personalization loss: The compressor might summarize "User prefers long-running series" but drop specific negative preferences stated earlier ("not interested in anything over 20 volumes") — the dropped constraint causes future recommendations to violate a user preference.
-
Recency bias in summarization: LLMs tend to weight recent information more heavily in summarization. A preference stated at turn 1 ("I only want English translations") is more likely to be dropped in compression than a preference stated at turn 8. Fix: use structured history compression that explicitly preserves a "user_preferences" list distinct from the conversational narrative.
-
Hallucination during summarization: The compressor LLM might "helpfully" infer preferences that the user never stated. "User seems to enjoy shorter series" based on the two recommendations they engaged with, when the user never stated length preferences. These hallucinated preferences then constrain future recommendations incorrectly.
Evaluation of compression quality: We built a 100-conversation compression evaluation set where each conversation has gold-standard compression labels (what information must be preserved, what is safe to drop). Automated evaluation checks: (a) preservation rate of explicit user preferences (must preserve ≥ 95%), (b) preservation of explicit exclusions (must preserve 100% — if user says "not Naruto," that must survive all compression), © hallucination rate in compression (added facts not in original history must be ≤ 0.5%). Human reviewers validate a weekly sample of 20 compression outputs against these criteria.
Scenario 9 — Offline-Online Metric Correlation Audit
The Annual Audit
Once per year, the team ran a systematic analysis: for each offline metric tracked in CI evaluation, what is its correlation with the online production metrics we care about? This audit directly shaped metric investment and deprecation decisions.
Correlation Results
| Offline Metric | Online Metric | Measured r | Decision |
|---|---|---|---|
| BERTScore | Resolution rate | r = +0.61 | Keep as primary; highest predictor |
| RAG Recall@3 | Thumbs-up rate | r = +0.55 | Keep; invest more in retrieval quality |
| Per-class F1 | Intent-specific escalation rate | r = −0.72 | Keep; strongest predictor in study |
| BLEU-4 | Thumbs-up rate | r = +0.15 | Demote; retained only for structural regression baseline |
| ROUGE-2 | CSAT | r = +0.22 | Demote; superseded by BERTScore |
| Format compliance | User satisfaction | r = +0.03 | Keep as hard gate (backend requirement), not quality signal |
| Response length (80-150 tokens) | Satisfaction vs all response lengths | Inverted-U relationship | Add optimal range gate |
The Key Finding: RAG Recall > Intent Accuracy for User Satisfaction
One counterintuitive finding: RAG Recall@3 (r=+0.55) was a stronger predictor of thumbs-up rate than intent classification accuracy (r=+0.38). This suggested that getting the right source material matters more than getting the routing exactly right for the user's perception of response quality. A well-routed query with poor retrieval produces a hallucinated or generic response. A slightly misrouted query with excellent retrieval at least gives the user useful information.
Investment implication: shifted 40% of planned classifier improvement sprint capacity to embedding model and reranker improvement instead.
Q&A — Scenario 9: Metric Correlation
Easy Questions
Q1: Why did you measure offline-online correlations rather than just trusting that better offline metrics mean better production quality?
A: Offline metrics measure performance on a curated test set under controlled conditions. Production metrics measure what users actually experience. These can diverge significantly if the test set doesn't represent production distribution, if users interact differently from the curated test scenarios, or if the metric captures something that doesn't actually drive user behavior. BLEU is the canonical example: it measures n-gram overlap, which seems like a proxy for response quality, but our data showed r=0.15 between BLEU and user satisfaction — practically no correlation. Optimizing BLEU would have been wasted engineering effort. The correlation audit revealed which offline metrics are actually diagnostic of production quality and which are measurement theater.
Medium Questions
Q2: The per-class F1 correlation with intent-specific escalation (r=−0.72) was the strongest correlation in your study. What does −0.72 mean practically, and why is the correlation negative?
A: r=−0.72 means that as per-class F1 increases (classifier improves), the escalation rate for that intent class decreases — a strong inverse relationship. It's negative because escalation is a failure signal (higher is worse) while F1 is a quality signal (higher is better). r=−0.72 means approximately 52% of the variance in intent-specific escalation rate is explained by the classifier's F1 score for that intent (r²=0.52). This is the strongest single predictor in our study.
Practically: if recommendation intent F1 drops from 0.91 to 0.82 (a fairly large regression), we can predict an escalation rate increase of approximately 0.72 × (0.82-0.91) / (1 std dev of F1) = approximately 1.5 percentage points on recommendation-intent conversations. This is the value of the correlation — it lets us translate offline F1 changes into projected production metric impacts during code review.
Q3: How do you calculate and validate these correlations? What sample size did you need?
A: Correlation calculation process:
For BERTScore vs. Resolution Rate (r=+0.61): We used weekly aggregates over 24 weeks (one year). Each data point was: (weekly average BERTScore on golden dataset) paired with (weekly resolution rate from production). 24 data points, which gives 95% CI: approximately ±0.35 at r=0.61 — fairly wide CI, but the correlation is strong enough to survive.
Validation: (1) Temporal holdout — calculated correlation on first 16 weeks, predicted last 8 weeks using the correlation model. RMSE of prediction vs. actual within the expected noise floor. (2) Cross-intent validation — the correlation held across all 8 intent classes separately, not just in aggregate. (3) Direction verification — when we deliberately ran a regression experiment (lower BERTScore prompt) for 2 weeks on a low-traffic intent class, resolution rate moved in the predicted direction. This gave us directional validation beyond statistical correlation.
Sample size limitation: 24 weekly data points is not large. At r=0.61, the 95% CI covers [0.23, 0.83] — wide, but the direction and magnitude are consistent with our mental model and directional experiments. We use the correlations as decision-guiding evidence, not precision quantities.
Hard Questions
Q4: Format compliance had r=+0.03 with user satisfaction — essentially zero correlation. Yet you kept it as a "hard gate" in evaluation. Isn't that inconsistent? If it doesn't affect user satisfaction, why block deployments for it?
A: This is the right tension to name. The answer is that format compliance serves two distinct purposes, only one of which is user satisfaction:
Purpose 1 — User experience: Format compliance (valid JSON output, correct product card structure) enables the frontend to render product cards, clickable ASIN links, and price displays. A response with invalid JSON outputs raw text instead of a formatted product card — users see a wall of text instead of a clean product display. The correlation between format compliance and user satisfaction is r=+0.03 because our frontend has robust error handling that gracefully degrades invalid responses to plain text. Users don't notice the absence of a product card as often as you'd expect — they still get the information.
Purpose 2 — System correctness: The backend systems that depend on the chatbot output (order management, analytics tracking, A/B test attribution) parse the structured JSON. A malformed JSON response that the frontend handles gracefully can still break downstream analytics, make the conversation un-loggable, or produce gaps in user session attribution. These are invisible to user satisfaction metrics but are real system integrity failures.
This is why format compliance is a hard gate, not a soft quality signal: it has near-zero impact on user-observable quality (hence low r), but breaks backend integration integrity. Hard gates exist for correctness requirements that don't show up in UX metrics. Soft signals guide quality optimization. The two serve different purposes in the evaluation framework.
Scenario 10 — Response Length Regression
What Happened
After upgrading to Claude 3.5, response length inflated from an average of 120 tokens to 195 tokens. After adding explicit token budget instructions, length dropped to 135 tokens. The question: did we recover quality or over-constrain?
The Non-Linear Satisfaction Relationship
| Response Length | Thumbs-Up Rate | n | Notes |
|---|---|---|---|
| < 30 tokens | 41% | 12K sessions | Too brief — misses key info |
| 30-79 tokens | 67% | 48K sessions | OK for simple FAQs |
| 80-150 tokens | 78% | 180K sessions | Sweet spot — covers all use cases |
| 151-250 tokens | 71% | 95K sessions | Slightly long but acceptable |
| > 250 tokens | 52% | 25K sessions | Too long — user satisfaction drops |
Metric Tradeoff: Token Count vs Satisfaction Correlation
Token count is an operational metric (directly impacts cost). Satisfaction is a quality metric. Neither alone tells the full story:
- Optimizing only for short responses (minimize tokens) would push the system toward the < 30 token bucket where thumbs-up is only 41%.
- Optimizing only for thumbs-up would allow token inflation (195 token avg) at 63% higher cost.
- The correct optimization is the satisfaction-weighted token cost: maximize
thumbs_up_rate / cost_per_response. At 135 tokens avg, this ratio is0.76 / $0.0090 = 84.4. At 120 tokens avg:0.74 / $0.0082 = 90.2. At 195 tokens avg:0.73 / $0.0131 = 55.7.
This analysis showed that the pre-Claude-3.5 baseline (120 tokens) was actually near-optimal. The constraint we added after the upgrade (target 135 tokens) was slightly suboptimal but within the acceptable range. Reducing from 135 to 120 through tighter prompting would recover a small amount of cost with negligible quality impact.
Q&A — Scenario 10: Response Length
Easy Questions
Q1: Why does response length affect user satisfaction in a non-linear way?
A: Too short means incomplete. A 20-token response to "What are some dark fantasy manga similar to Berserk?" might say "Try Vinland Saga." This is technically an answer but doesn't explain why, doesn't mention format options, and doesn't acknowledge the user's specific reference point of Berserk. Users feel the response was unhelpful. Too long means buried in filler. A 300-token response might recommend one title correctly but surround the recommendation with generic caveats, disclaimers, unnecessary backstory, and restatements of what the user said. Users stop reading at ~200 tokens and miss the relevant content. The sweet spot (80-150 tokens) is long enough to include justification and key details, short enough that users read the entire response.
Medium Questions
Q2: You identified the "satisfaction-weighted token cost" as the optimization target. Walk through exactly how you would use this metric to decide whether to tighten token budget instructions further.
A: The satisfaction-weighted token cost is thumbs_up_rate / cost_per_response. Higher is better — it means more user satisfaction per dollar spent.
Current state: 135-token average, thumbs-up 76%, cost = $0.0090/response → ratio = 84.4. Target state if prompting reduces to 120 tokens: thumb-up (estimate from historical data at 120 tokens) 74%, cost = $0.0082/response → ratio = 90.2.
Decision: the ratio improves from 84.4 → 90.2 (7% improvement) by reducing from 135 to 120 tokens. This suggests tighter prompting is the right direction. But we need to validate that the 120-token constraint doesn't push quality below the 74% estimate. The approach: A/B test with 10% traffic at more constrained prompting for 2 weeks. Measure actual thumbs-up rate. If thumbs-up at 120 tokens is ≥ 72% (within the CI of our estimate), implement the tighter constraint. If thumbs-up drops to 65%, the constraint is over-aggressive and we stay at 135 tokens.
The metric also separates intent types: recommendation responses have a higher optimal length than FAQ responses. We should set intent-specific token budgets rather than a single global constraint, which would allow each intent to hit its individual optimal.
Scenario 11 — Model Tiering Decision: Haiku vs Sonnet
What Happened
After measuring quality vs. latency vs. cost for three deployment options, the team implemented a two-tier routing system: Haiku for simple/structured intents, Sonnet for complex/creative intents. Evaluating this decision required a compound metric.
Decision Metric: Quality-Adjusted Latency
Quality-Adjusted Latency Score = (Response Quality × 20) − (P99 Latency in seconds × 10)
| Option | Quality (0-5) | P99 Latency | Cost/month | Score |
|---|---|---|---|---|
| Sonnet for all | 4.2 | 500ms | $143K | 4.2×20 − 0.5×10 = 79 |
| Haiku for simple, Sonnet for complex | 3.8 / 4.2 blended ≈ 3.9 | 100ms / 500ms blended | $95K | 3.9×20 − 0.2×10 = 76 |
| Haiku for all | 3.6 | 100ms | $48K | 3.6×20 − 0.1×10 = 71 |
Conclusion: The tiered approach (Score=76) is only 4% below Sonnet-for-all (Score=79) while saving $48K/month. This was the data behind the model tiering decision.
Why a Compound Metric Was Necessary
Single-axis decisions fail here: - Choosing only by quality → Sonnet for all (ignores $48K/month savings) - Choosing only by cost → Haiku for all (drops quality to 3.6/5.0 on complex intents) - Choosing only by latency → Haiku for all (100ms vs 500ms, but 3.6 quality on recommendation is unacceptable)
The compound metric captures the tradeoff space. It explicitly encodes the team's value judgments: quality is worth 20 points per unit improvement; 1 second of latency costs 10 points. These weights were calibrated against user research showing that a 1-second latency increase reduced satisfaction by approximately the same amount as a 0.5-point quality score reduction on our rubric. The compound metric is only as valid as the weight calibration.
Q&A — Scenario 11: Model Tiering
Easy Questions
Q1: What is the model tiering system and which intents use which model?
A: Model tiering routes different intent types to different LLMs based on the complexity and stakes of the response. Simple, structured intents where responses are mostly templated or require minimal creativity (order tracking, policy FAQ) are routed to Claude 3 Haiku, which responds in ~100ms and costs approximately one-third of Sonnet. Complex intents where response quality heavily depends on reasoning and creativity (recommendations, product comparisons, multi-turn conversations) are routed to Claude 3.5 Sonnet, which has higher quality (4.⅖.0 vs 3.6/5.0 for complex queries) at ~500ms latency. The routing decision is made by the intent classifier before the LLM call. Simple fast-path rules (single-word greetings, chitchat) bypass both models and return templated responses in < 10ms.
Medium Questions
Q2: The quality weights in your compound metric (quality × 20, latency × 10) encode a value judgment about the relative importance of quality vs latency. How did you calibrate these weights?
A: The weights were calibrated from two sources:
First, user research data: an internal user study (n=200) presented users with pairs of responses — one at Haiku latency (100ms) with Haiku quality, one at Sonnet latency (500ms) with Sonnet quality — and asked which they preferred. The preference function showed that users were willing to wait up to 400ms more for a quality improvement of 0.5 points on our 5-point rubric. This translates to: 400ms latency ≈ 0.5 quality points → 1 quality point ≈ 800ms latency.
Second, mapping to the score formula: Latency coefficient = 10 points per second. Quality coefficient = 20 points per unit. So 1 quality unit = 20 points = 2 seconds of latency = 2,000ms. But user research said 1 quality unit = 800ms of latency. This means our formula is conservatively latency-tolerant — it's willing to wait longer than users would. We recalibrated: if 800ms = 1 quality unit, then 1 second = 1.25 quality units → quality × 12.5 − latency × 10. This revised compound metric gave the tiered option (score=70) still better than Haiku-for-all (score=68), with the gap between tiered and Sonnet-for-all narrowing from 4% to 2%. The decision holds at either calibration.
Q3: How do you handle the case where a query at 1% canary has both a worse quality score AND a better latency? How do you determine if the canary version is better or worse overall?
A: Use the compound metric at the canary level. Compute the quality-adjusted latency score for the canary model vs. the baseline using canary-period measurements. If the canary's compound score is within 3% of the baseline, treat it as a wash (no rollback or promotion signal from the compound metric alone; let other metrics decide). If the canary compound score is >5% below baseline, flag for review. If >10% below, auto-rollback.
The 3% tolerance band exists because quality scores at 1% traffic are estimated from human evaluation of a small sample (typically 50-100 canary responses) — the measurement noise is too high for smaller differences to be meaningful. Latency is more precisely measured and contributes more reliably to the compound score at small sample sizes. In practice, for canary decisions we use compound score as a tiebreaker when single-axis metrics disagree — if escalation rate and thumbs-down are both below rollback threshold but compound score is 8% below baseline, that's a signal worth a manual human review before proceeding to Stage 2.
Hard Questions
Q4: Your quality-adjusted latency metric treats quality as a single number (e.g., 4.⅖.0). But quality varies by intent — Haiku scores 3.8 on simple intents and 3.2 on complex intents. How does this per-intent quality variation affect the tiering decision, and what is the correct way to compute the compound metric for a tiered system?
A: A single blended quality score loses the information that matters most about the tiering decision. The correct computation is traffic-weighted per-intent quality:
Effective_Quality_Tiered = Σ(traffic_share_i × quality_tiered_i)
Where quality_tiered_i is the quality of the selected model for intent i.
With actual numbers:
| Intent | Traffic % | Haiku Q | Sonnet Q | Tiered Model | Tiered Q |
|---|---|---|---|---|---|
| recommendation | 35% | 3.2 | 4.2 | Sonnet | 4.2 |
| product_question | 20% | 3.5 | 4.1 | Sonnet | 4.1 |
| order_tracking | 18% | 4.0 | 4.0 | Haiku | 4.0 |
| faq_policy | 16% | 3.9 | 4.0 | Haiku | 3.9 |
| return_request | 6% | 3.3 | 4.1 | Sonnet | 4.1 |
| other | 5% | 3.7 | 3.8 | Haiku | 3.7 |
Weighted Tiered Quality = (0.35×4.2) + (0.20×4.1) + (0.18×4.0) + (0.16×3.9) + (0.06×4.1) + (0.05×3.7) = 1.47 + 0.82 + 0.72 + 0.62 + 0.25 + 0.19 = 4.07
Compare to Sonnet-for-all: (0.35×4.2) + (0.20×4.1) + (0.18×4.0) + (0.16×4.0) + (0.06×4.1) + (0.05×3.8) = 4.07 — essentially identical to tiered.
This reveals a key insight: sending order_tracking to Haiku does not degrade its quality (4.0 either way), and order_tracking is 18% of traffic. By routing the high-frequency, quality-neutral intents to Haiku, the tiered system achieves the same effective quality as Sonnet-for-all at $48K lower monthly cost. The compound metric justification becomes: same quality, lower cost, better latency on 40% of queries — a Pareto improvement for Haiku-eligible intents.
Summary: Metric Decision Principles
When to Use Each Metric
| Decision | Primary Metric | Why |
|---|---|---|
| Prompt quality change | BERTScore | Captures paraphrase quality; BLEU falsely penalizes valid rewording |
| Structural format regression | ROUGE-L delta | Detects format changes without word-level penalization |
| Model version comparison | Shadow mode BERTScore delta + length distribution | Real traffic, no user impact |
| Classifier comparison | Macro F1 + per-class AUC-PR | Treats all classes fairly; AUC-PR honest about rare classes |
| Retrieval quality | Recall@3 × Precision@3 (effective quality) | Both coverage and noise matter |
| RAG ranking optimization | MRR for speed; NDCG@K for quality optimization | Different purposes |
| Live deployment safety | Escalation rate + thumbs-down with statistical significance gates | User-facing signals with sample size controls |
| Guardrail calibration | False positive rate + recall on adversarial set | Asymmetric cost function by guardrail type |
| Hallucination prevention | ASIN validation (real-time gate) + factual grounding (async signal) | Complementary coverage |
| Multi-turn quality | Turns-to-resolution + coherence score | Proxy signals with weekly human audit validation |
| Cost-quality tradeoff | Satisfaction-weighted token cost | Compound metric that encodes value tradeoffs |
| Model routing decision | Traffic-weighted quality-adjusted latency | Intent-specific quality drives routing value |
The Three Meta-Principles
1. Measure correlations, not conventions: Before investing in improving an offline metric, verify it correlates with an online metric you care about (r > 0.4 is the floor for actionability). BLEU is r=0.15 with satisfaction — not actionable. Rack hours spent optimizing BLEU are wasted.
2. Error types have asymmetric costs: Different metrics and guardrails have different costs for false positives vs. false negatives. Calibrate thresholds to the asymmetry, not to a single accuracy number. A PII guardrail and a response-length guardrail need different calibration philosophies.
3. The evaluation framework must grow with the system: Don't build the full 4-layer framework on Day 1. Start with manual review. Add automation as pain points emerge. Add shadow mode when model changes become frequent. Add canary when model changes carry real risk. The framework should be as mature as the system demands, not more.
Related Documents: - 06-model-evaluation-framework.md — The 4-layer evaluation architecture - 04-ml-metrics-taxonomy.md — Full classification and retrieval metrics reference - 05-llm-metrics-taxonomy.md — Full LLM metrics taxonomy - 03-tradeoffs-decisions.md — Decision frameworks for inference tradeoffs - Challenges/real-world-challenges.md — Production challenges context