05. ML/AI Engineer Grill Chains — Cost-Optimization Offline Testing

This file contains 8 grill chains, one per cost-optimization user story, framed for an Amazon ML/AI Engineer loop. The lens here is on model behavior, dataset design, statistical rigor, calibration, distribution shift, and the cost-quality tradeoff curve. The MLOps lens (telemetry plumbing, deployment infra, runbooks) lives in 06-mlops-engineer-grill-chains.md.

How To Use This File

Each scenario follows the format established in Domain1-FM-Integration-Data-Compliance/Skill-1.1.1-Comprehensive-Architectural-Design/scenarios/DEEP-DIVE-GRILLING-SESSION.md:

1. Opening question + answer
2. Round 1 — Surface (the easy follow-up)
3. Round 2 — Push Harder (force a tradeoff)
4. Round 3 — Squeeze (force quantitative rigor)
5. Round 4 — Corner (find the failure mode)
6. Architect Escalation A1, A2, A3 (staff/principal-level system thinking)
7. Intuition Gained (the mental model)

Work through each as a live interview. Read the scenario deep-dive in 04-scenario-deep-dives-per-cost-story.md first if you want context.

Scenario US-01 (ML/AI Engineer Lens) — LLM Token Cost Optimization

Opening Question

Q: You're shipping a four-technique LLM-cost optimization (template router, semantic cache, model tiering, prompt compression). Before any of it touches production, what does the offline evaluation look like, and what's the one thing you'd refuse to ship without?

Round 1 Answer: Offline evaluation is a paired cost-and-quality counterfactual on a 5K replay slice plus a 1.5K labeled decision-equivalence set. The one thing I refuse to ship without is per-intent slicing on the quality side. Aggregate CSAT or LLM-judge agreement is meaningless here — the optimizations are gates that fire on specific intents, and the failure mode is "cheap wrong answer on a low-volume but high-value intent." If I can't slice quality by intent and confidence band, I'm shipping blind.

Round 1 — Surface

Follow-up: You said "LLM-as-judge agreement." Which judge model, and how do you trust it?

The judge is a held-out, larger model (e.g., Claude Opus or a fine-tuned evaluator) running with a constrained rubric — not a freeform "is this good?" prompt. Trust is established by: - Calibrating the judge against 200 human-labeled samples first (target: judge-human Cohen's κ ≥ 0.7). - Running blind: judge sees responses A and B without knowing which came from baseline vs. optimized. - Position-randomizing every comparison (judges have a documented order bias). - Re-calibrating quarterly because the judge model itself drifts.

Without those four steps, "judge agreement" is a metric that looks rigorous and isn't.

Round 2 — Push Harder

Follow-up: Your prompt compressor cuts input tokens by 40%. The judge says quality is the same. The PM is happy. What's the failure mode you're still worried about?

Three:

Compression-induced hallucination on long-context queries. Compression strips older turns. If the user references something from turn 1 in turn 8, compression may have removed the antecedent. The judge — comparing single-turn responses — won't catch this. I need a multi-turn coherence slice, not just per-turn judging.
Compression interacting with retrieval. If RAG chunks are deduped aggressively, the compressed prompt may lose the chunk that contained the actually-cited fact. The judge sees a fluent answer; the citation is wrong. I need a factuality-against-source check on RAG-grounded responses.
Compression changing model behavior on edge formatting. Stripping whitespace and structure from the prompt sometimes nudges the model toward less-structured outputs (markdown drops, list formatting changes). User-facing UI breaks. I need a format-compliance check downstream of compression.

The judge catches semantic regression. It misses hallucination, citation accuracy, and format. Three orthogonal checks, three different test sets.

Round 3 — Squeeze

Follow-up: You have a 5K replay slice. You measure semantic-cache wrong-answer rate at 0.3% — under your 0.5% ceiling. That's 15 wrong answers in 5,000. Production is 1M messages/day. What's the production wrong-answer count, and how do you reason about whether it's acceptable?

15/5,000 = 0.3%. Extrapolated naively: 0.3% × 1M = 3,000 wrong answers/day. Three reasons that's the wrong number to extrapolate:

Cache hit rate is 15-25%, not 100%. Wrong-answer rate is conditional on a cache hit. Production wrong-answer count = 1M × cache_hit_rate × wrong_answer_given_hit ≈ 1M × 0.20 × 0.003 = 600 wrong answers/day.
Distribution mismatch. The 5K replay slice is stratified; production is not. If the wrong answers cluster on a single high-volume intent (say, FAQ), the production count on that intent is much worse. I'd estimate per-intent.
Severity distribution. "Wrong answer" is binary in the metric, but consequences aren't. A wrong recommendation is annoying; a wrong return policy is a refund-eligibility dispute. I'd weight wrong-answer rate by intent severity before deciding acceptability.

So the right framing is: 600 wrong answers/day, weighted by severity. If 80% of those are FAQ wrong-but-not-harmful, acceptable. If 5% are return-policy wrong, not acceptable — even though the headline rate cleared the 0.5% ceiling. Aggregate rates are decision-hostile; severity-weighted per-intent rates are decision-friendly.

Round 4 — Corner

Follow-up: Confidence-band slicing on the template router — you said agreement was 71% in the 0.6–0.8 band. The ML team says "raise the confidence floor to 0.85 and you're safe." What's wrong with that recommendation?

Three things:

Raising the floor changes the bypass rate. If 30% of traffic was being bypassed at confidence ≥ 0.6, and 12% of that was in the 0.6–0.8 band, raising to 0.85 drops bypass rate to ~22%. The cost lever is now under-firing. I need to re-validate the cost target after the threshold change.
Confidence calibration drifts. The 0.6–0.8 band today might be the 0.5–0.7 band in three months as the upstream classifier (US-02) is retrained. The 0.85 threshold isn't a fixed safety boundary; it's a calibrated point on a moving curve. I need a periodic recalibration test, not a one-time threshold pick.
The fix doesn't address the root cause. The 71% agreement in the 0.6–0.8 band tells us the classifier is uncertain on those queries — not that the template-router is wrong. The right fix may be improving the classifier's calibration in that band (US-02 work), not raising the template threshold (US-01 work). Cost-savings cross-story handoff: this might be an US-02 problem masquerading as an US-01 threshold tune.

So the answer is "raise the threshold and file a calibration ticket against US-02 and schedule a quarterly recalibration." Single-knob fixes are usually wrong.

Architect-Level Escalation

A1: Design an evaluation framework that scales to 8 cost knobs (one per US story), where each knob can be flipped independently. What does the offline harness look like as a system?

The framework is a factorial cost-quality test grid, not 8 independent test suites. Properties:

Single shared replay dataset (5K stratified sessions) with all upstream context recorded once. Each test run replays the same dataset; this is the experimental control.
Pipeline composition is configurable: each cost knob is a feature flag in the pipeline definition. The harness can run any combination (knob A only, A+B, all 8) by reading a config matrix.
Each run emits the same telemetry schema: a (cost_metric, value, intent_slice, lever_engagement) tuple per query, written to a single Parquet output. Every test run goes into the same lake.
Decision gates are SQL queries over the output table, not bespoke Python per scenario. "Did US-01 cost go down ≥ 30%? Did US-01 quality stay within ±2pts on every intent?" — both are SELECT statements.
A factorial cross-check: run combinations to detect lever interaction. If knob-A alone saves 30%, knob-B alone saves 20%, knob-A+B should save ~44% (multiplicative on what each affects). If A+B only saves 35%, there's interference — they're competing on the same fraction of traffic.

The systems insight: 8 cost optimizations are not 8 independent experiments; they are a shared experimental infrastructure with 8 hypothesis tests on top. Building 8 test harnesses creates 8 sources of inconsistency.

A2: What's the minimum sample size for the cost-quality A/B such that you can claim a 30% cost reduction with 95% confidence and detect a 2-point quality regression with 80% power?

Two power calculations, two different sample sizes:

For cost (continuous metric, $/session): - Effect size of 30% relative is large. Even with high session-to-session cost variance (CoV ~0.8 on chatbot traffic), n ≈ 200 gets you 95% CI within ±10% relative. - With n = 5,000, the cost claim is very tight (CI within ±2%).

For quality (binary metric, agreement rate, baseline 95%): - Detecting drop from 95% to 93% (2pt) at 80% power, two-proportion test: n ≈ 1,800 per arm minimum. - Per-intent slicing: if the smallest intent slice we care about (e.g., return_request) is 8% of traffic, that slice needs 1,800/0.08 = 22,500 total samples to have 1,800 in the slice.

Conclusion: the quality side drives the sample-size requirement, not the cost side. The 5K replay is fine for aggregate cost claims but undersized for per-intent quality claims on tail intents. Either I sample tail intents up (oversample by 5x in the dataset) or I expand to 22K.

This is the most-skipped step in cost-optimization evaluation: people size for the cost claim and assume the quality side comes for free.

A3: When do you stop optimizing? You've shipped four techniques in US-01 and saved 50%. Someone proposes a fifth (Bedrock native prompt caching). How do you decide whether to add it?

The right framing is marginal cost saved per marginal engineering risk. The first techniques are easy because they target obvious inefficiencies. The fifth technique competes for the same fraction of traffic the first four already optimized:

Native prompt caching saves on the input-token side, but compression already cut input tokens 40% and template + cache eliminated 45% of calls outright. The remaining traffic is the hard 55% that still goes to LLM. Native caching saves ~20% on that 55%, ~11pts on the original total.
11pts of savings is meaningful — but each new technique has compounding interaction risk. Native caching's TTL semantics interact with the semantic cache's TTL; the cache-hit accounting now has two layers; failure modes are harder to reason about.
I stop when the expected marginal savings × probability of clean ship falls below the engineering cost × cost of new failure mode.

For US-01 specifically: I'd ship native prompt caching, but only after the four current techniques have run for 6 months and I have production data on their real interaction. Layering five new things at once is how you get the "expensive and same" failure mode (file 03) — they all canceled out and you don't know why.

The honest answer to "when do you stop" is "when the next optimization would change the failure-debugging time more than it changes the cost." That's the real ROI break-even.

Intuition Gained — US-01

The core insight: Token cost is not one number; it is a distribution sliced by intent, by confidence, by user tier, by query shape. Every optimization moves part of the distribution; an aggregate metric will hide the part it broke.

Mental model to carry forward:

"Every cost lever has a head class where it works and a tail class where it fails. Find the tail class. Slice your eval by it. Don't ship until that slice is bounded."

The hidden failure mode: Confidence-band slicing is the most-overlooked dimension. Routing decisions made in the 0.6–0.8 confidence range are where most "cheap and wrong" failures hide. Most teams slice by intent class but not by confidence band; this misses the most common failure surface.

One-line rule: Pair every cost metric with a quality metric and slice both by the dimension where the optimization changes behavior. If you can't name that dimension, you haven't designed the eval.

Scenario US-02 (ML/AI Engineer Lens) — Intent Classifier Cost Optimization

Opening Question

Q: You're replacing 65% of SageMaker intent-classification calls with an Aho-Corasick rule pre-filter. From an ML/AI perspective, what's the offline evaluation, and what's the non-obvious failure mode you're hunting?

Round 1 Answer: The eval is a per-rule decision-equivalence test against the SageMaker model on ≥1K stratified samples per rule, with promotion only at ≥95% agreement. The non-obvious failure mode is rule pre-filter changes the distribution the ML model sees in production. Once rules catch all the easy cases, SageMaker only sees the hard residual. Its accuracy on that residual is not the same as its accuracy on the full distribution — usually it's worse, because the easy cases were what the model was confident on. The remaining model now looks degraded even though it didn't change. I need to baseline the residual-distribution accuracy explicitly, not extrapolate from the all-traffic accuracy.

Round 1 — Surface

Follow-up: How do you build the per-rule labeled dataset?

Three sources, weighted:

Production query log sample (60%) — stratified by intent, locale, message length. These are the queries the rule will actually see.
Hard-negative miner (25%) — queries that look like they should match the rule but shouldn't. For a rule targeting order_tracking keyed on "order," I mine queries containing "order" that the SageMaker model classified as something else. These are the rule's enemies.
Adversarial DS-curated (15%) — queries the data scientist wrote to test specific edge cases (multilingual, slang, intentional ambiguity).

For each, the ground truth label is the SageMaker prediction (not human) — because the question we're asking is "does the rule agree with the model?" Not "is the rule absolutely right?" If the model is wrong, that's a different problem (model retraining), and conflating the two corrupts the rule-promotion signal.

The dataset is per-rule, refreshed weekly, and 1K samples is a floor — for high-traffic rules I want 5K.

Round 2 — Push Harder

Follow-up: Your rule for order_tracking agrees with SageMaker at 96% — clears the 95% promotion floor. But you mentioned "tail intent false positives." Walk me through the calculation.

96% agreement = 4% disagreement. On 1K samples, that's 40 disagreements. The question is: where do those 40 land?

If the 40 disagreements are evenly distributed across the other 7 intent classes, each class loses ~5–6 samples. That's noise.

If 35 of 40 are concentrated in return_request (because "I want to order a return" looks like an order_* rule match), that's a structural failure. The rule has a known systematic bias against a tail intent that is downstream-cost-sensitive.

So 96% headline agreement isn't enough. I need: - Per-class confusion matrix on the disagreements. - Per-class false-positive rate from the rule. - Hard floor: zero false positives on safety-critical classes (return_request, escalation).

If the rule has even one false positive on escalation in 1K samples, it doesn't ship — I'd rather pay the $0.0001 of SageMaker inference than miss a frustrated user. Cost optimization stops at safety boundaries.

Round 3 — Squeeze

Follow-up: You need to scale-to-zero off-peak. SageMaker serverless cold-start is 30–60s. Your SLO is 3s. What does the offline test look like?

The 30–60s number is for an empty serverless endpoint receiving its first request after hours. The 3s SLO is what the user will tolerate. The bridge is provisioned concurrency floor of 1, kept warm during predictable peak windows (8am–11am JST and 6pm–10pm JST).

The offline test:

Cold-start baseline measurement: shut down the endpoint, wait 1 hour, send a single request, measure end-to-end latency. Repeat 30 times across days. Get a distribution.
Warm-up curve: with provisioned concurrency = 1, send a single request, measure. Then 10 RPS, 50 RPS. The curve tells you how many warm instances you need for what RPS.
Burst stress test: synthetic 0 → 50 RPS in 30 seconds, with provisioned = 1. Measure p50, p95, p99 first-response latency. Does the serverless fallback kick in fast enough? p99 should be ≤ 3s.
Cost sanity check: provisioned concurrency = 1 for 16 hours/day costs ~$50/mo. Compare against the inference savings. If saving $300/mo on inference and spending $50/mo on warm capacity, net is +$250 — good. If saving $80/mo on inference and spending $50/mo, marginal — reconsider.

The offline test is not "run it cold once and see." It's a distribution + an explicit cost-of-warmth tradeoff calculation.

Round 4 — Corner

Follow-up: After 3 months, monitoring shows rule classification rate dropped from 65% to 51%. Cost savings are eroding. What happened, and what does your offline harness do about it?

Three plausible causes; I'd rule them out in order:

Distribution shift in user queries. The rules were tuned to query patterns from 3 months ago. New manga releases (Chainsaw Man arc), new promotions, new features all introduce queries the rules don't match. The fix is rule freshening from the current query log, on a quarterly cadence.
Upstream change in input pipeline. Maybe a frontend UI change rephrased "Track my order" to "Where's my package?" — the rule pattern was on "track my order." The rules became stale because the inputs changed, not because the model did.
The rule was never as good as it looked. A new evaluation slice (e.g., a Spanish-language locale that grew) is being misrouted. The original 65% was on en-US; the residual was always different.

The offline harness needs:

Drift detection on rule fire rate by intent and by locale. If order_tracking rule fire rate drops 10% in en-US but is stable in ja-JP, root cause is in the en-US input distribution.
Quarterly rule re-validation against fresh production samples. Same eval as initial promotion; same ≥95% agreement floor; same per-tail-class FP=0 floor. Rules expire and must be re-promoted.
A "candidate rules" pipeline that mines new patterns from recent queries the model classified with ≥0.95 confidence, proposing them as new rule candidates for DS review.

Cost optimization decay is normal. Building a re-validation cadence into the offline harness is what makes the optimization durable.

Architect-Level Escalation

A1: Imagine you have 50 rules, each with its own decision-equivalence test. How do you maintain rule-level statistics over time without a manual review of each one?

Each rule emits a daily metric tuple: (rule_id, fire_rate, agreement_with_ml, false_positive_per_class, sample_volume). These go into a time-series store. The system runs three automated checks nightly:

Trend-break detection (CUSUM or similar) on each rule's agreement_with_ml. If it crosses below 95% for 3 consecutive days, the rule auto-disables and a ticket is filed.
Coverage drift detection: compare today's fire_rate against the 90-day rolling median. ±20% delta files a ticket.
New failure mode detection: any rule that newly produces a false positive on a safety-critical class (return_request, escalation) auto-disables immediately, regardless of statistics.

The principle: rules are configuration, not code, and configuration deserves observability. A 50-rule system where rules are managed manually is a system with 50 latent bugs waiting to surface.

A2: A new model architecture (say, a smaller distilled BERT) becomes available that's cheaper than SageMaker today. How do you decide whether to migrate?

Three-axis evaluation:

Cost trajectory. Distilled BERT serving cost vs current SageMaker cost — both at current traffic and projected 12-month traffic. Include the cost of fine-tuning, the data labeling, and the eval pipeline.
Quality on the production-residual distribution (not the full distribution — see Round 1 answer). If rules already catch 65%, the new model only sees the hard 35%. Its accuracy on that 35% is what matters.
Operational risk. SageMaker is a managed surface; a custom-distilled model is a self-managed surface (model artifact storage, version pinning, A/B promotion infra, monitoring). Add ~$X/mo of engineering time to the cost.

Decision rule: migrate if (cost savings) > (engineering ops cost) AND (quality on residual ≥ current model on residual) AND (the operational risk fits the team's runbook capacity).

The trap: people compare distilled BERT against SageMaker on the full traffic distribution and get great numbers, then deploy and discover the residual-distribution accuracy is much worse. The eval has to match production reality.

A3: What's the long-term failure mode of optimizing the intent classifier this aggressively?

The long-term failure mode is data starvation of the SageMaker model. With 65% of traffic going through rules, the SageMaker model is now trained on a sample biased toward the residual hard cases. Each retraining cycle, this gets worse — the training distribution narrows. After 4 retraining cycles, the model has forgotten how to classify the easy cases (because it never sees them) and the rules are now load-bearing. If the rule pipeline ever fails (config bug, deploy regression), the model alone can't carry the system.

Mitigations: - Mix easy cases back into training data: the rule-bypassed queries are still labeled (rule label = ground truth proxy). Sample them back into the SageMaker training set so the model retains coverage. - Periodic "shadow run": every quarter, route 5% of rule-classified traffic through SageMaker too (paying the cost) to keep the model honest and catch divergence. - Architectural commitment: the rule pre-filter is a cost optimization, not a replacement. The model must remain the source of truth.

The deeper insight: cost optimizations that reduce a model's training signal compound over time. This is the silent debt of "ML as a fallback" architectures.

Intuition Gained — US-02

The core insight: Replacing model calls with rules is an architectural shift, not a quick win. The model now sees a different distribution; the rules need their own ML lifecycle (training data, validation, drift detection, retirement).

Mental model to carry forward:

"A rule is a model with a manually-engineered decision boundary. It needs the same lifecycle: dataset, eval, monitoring, retraining."

The hidden failure mode: Model accuracy on the residual (post-rule-filter) distribution is what matters, not on the full distribution. People evaluate the model on full traffic and miss that production reality has shifted.

One-line rule: Rules are not free; they shift cost from inference to ML-engineering operations. The savings are real only if the operational cost is accounted for.

Scenario US-03 (ML/AI Engineer Lens) — Caching Strategy

Opening Question

Q: Semantic response cache at 0.95 cosine threshold. From an ML/AI standpoint, what determines whether this is a safe lever or a quality time-bomb?

Round 1 Answer: It's the embedding space's monotonicity property: does cosine similarity 0.95 in this embedding space actually mean "these two queries have the same correct answer"? That's an empirical question about the embedding model, not a global property of "semantic similarity." For a chatbot that fetches from a moving catalog, even queries semantically identical can have different correct answers if the catalog updated between them. So the offline test isn't "are these embeddings close" — it's "do queries close in embedding space have answers that remain correct over the cache TTL window."

Round 1 — Surface

Follow-up: How do you measure "do queries close in embedding space have correct same answers"?

Build a cache-collision audit dataset: pairs of queries that the embedding model says are ≥0.95 similar but came from different sessions. For each pair:

Fetch the current correct answer for each query (live, not cached).
Compare: are they the same answer? Are they semantically equivalent? Are they factually inconsistent?

The metric is pair_collision_correctness — what fraction of high-similarity pairs have answers we'd be willing to swap in production. This is not "embedding quality" — it's "embedding-quality-conditional-on-our-answer-distribution," which is the only metric that matters for cache safety.

If the rate is below 99%, the threshold of 0.95 is too low. Either raise the threshold or shorten the TTL.

Round 2 — Push Harder

Follow-up: You're measuring this on a static snapshot of the catalog. Production catalog changes. How do you account for staleness?

Two-part approach:

TTL-versioned correctness measurement. For each cache entry, the audit doesn't ask "is this answer correct now?" — it asks "would this answer have been correct at any point during its TTL window?" If the cached answer was correct when written and remains correct for 24h, it's safe. If the catalog flips during the TTL, the cache is stale-but-not-yet-invalidated.
Catalog-change overlay on the offline test. I take 1 week of replay traffic and overlay 1 week of actual catalog change events. For each query at time t, I compute (a) the answer that would have been served from cache (if there was a hit), and (b) the answer that would have been live-fetched. Discrepancies are stale-cache events. Their rate is stale_read_rate.

If stale_read_rate is 0.4% on average but spikes to 12% during the 6 hours after a major price update, the average is misleading. I need the spike behavior, not just the average. Cache safety is about worst-case, not average-case.

Round 3 — Squeeze

Follow-up: You've added embedding caching, semantic-response caching, AND L1 in-process caching for the same query. How do you reason about the compound failure surface?

Three layers, three failure modes, three masking interactions:

Layer	Wrong-answer cause	Masks the layer below it?
L1 (in-process LRU, 30s TTL, top-50 ASINs)	Stale by ≤ 30s	Yes — L1 hit means L2 is never queried
L2 (Redis, semantic cache, intent-specific TTL)	Wrong key collision OR stale within TTL	Yes — L2 hit means LLM is never called
Embedding cache (1h TTL on query embeddings)	Stale embedding (rare; embedding model is stable)	No — independent layer

The compound failure surface: - L1 wrong → L2 silently bypassed. A bug in L1 means L2's invalidation never matters. I need an offline test that validates L1 hits against L2 hits on the same key periodically. - Embedding cache returns a stale embedding → L2 lookup uses wrong vector → L2 cache miss when it should have hit, or L2 cache hit when it shouldn't. This is the most insidious — a wrong embedding can route to a wrong cache key entirely.

Test design: - Per-layer wrong-answer audit. Each layer must have its own audit dataset and its own ceiling. Can't aggregate. - Cross-layer consistency check. Sample 100 queries; query them against all three layers in parallel; assert agreement ≥ 99.9%. Disagreement is a red flag for one of the layers. - Embedding cache TTL alignment. Embedding cache TTL must be ≤ L2 cache TTL on the same intent. Otherwise an old embedding could find an evicted cache key.

The architect-level point: multi-layer caching is not three caches stacked; it's one cache with three failure modes. The audit must reason about the system, not the layers in isolation.

Round 4 — Corner

Follow-up: A user reports they got the wrong manga price. You suspect cache. How do you reproduce and root-cause offline?

Reproduction is the hard part because cache state is ephemeral. The runbook:

Capture the user's session ID and timestamp. Pull the request from access logs.
Reconstruct the cache state at that moment. This requires the cache layers to log every put/get with key + timestamp + value-hash. If you don't have this, you can't root-cause cache bugs — full stop.
Replay the request against historical cache state. Offline, load the cache snapshot from t-1 minute, replay the user's request, observe the response.
Check three hypotheses in order: - L1 hit returned wrong value (possible TTL bug, possible eviction race) - L2 hit returned wrong value (semantic cache key collision; check the embedding for the user's query and the embedding for the cached query, measure their cosine, see if they should have been classed as same) - Live fetch returned wrong value (catalog inconsistency; not a cache bug)

The offline test that prevents this: cache audit logs are not optional. Every cache layer must log enough state for offline post-hoc reproduction. If the on-call can't reconstruct cache state from logs, the cache observability is broken.

The ML insight: most "wrong answer from cache" tickets are actually embedding-collision bugs that would have been caught by the cache-collision audit dataset described in Round 1. Investing in that dataset upfront is the single highest-leverage QA investment for semantic caching.

Architect-Level Escalation

A1: How do you handle embedding model drift over the lifetime of the cache?

The cache is keyed on embedding-derived hashes. Change the embedding model, and the keys collide differently — old cache entries become unreachable (no big deal; they expire) but also, new lookups find old entries that meant something different (huge deal — wrong-answer risk).

Two-pattern fix:

Embedding-model-versioned cache keys. Every cache entry includes the embedding model version in its key prefix. New version = new key namespace = no collision with old entries. Old entries expire naturally.
Migration mode: when promoting a new embedding model, run a dual-key write for one TTL window (write under both old and new model keys). This warms the new namespace without a cold-start cliff. After one TTL window, drop dual-write.

The pattern: every cache backed by a model is implicitly versioned by the model. If the model version isn't in the key, the cache assumes the model is eternal, and that assumption breaks the day you upgrade.

A2: The cost-quality tradeoff of cache TTL is not monotonic. Walk me through it.

Naive view: longer TTL = higher hit rate = more savings. So pick the longest TTL the staleness contract allows.

Real view: longer TTL = higher hit rate = more savings and more wrong-answer events when staleness happens. The expected cost of a wrong answer (escalation cost, refund cost, churn) can dominate the savings on a long TTL.

The optimization is:

maximize: hit_rate * call_cost_saved
subject to: stale_event_rate * stale_event_cost <= acceptable_loss_per_day

This is intent-specific. For faq queries about return policy, stale-event-cost is a refund dispute (~$50). For recommendation, stale-event-cost is a click-through that didn't happen (~$0.10). Same TTL is wrong for both; the TTL should be set per-intent based on the stale-event cost.

The current US-03 design does this implicitly (24h for faq, 30min for recommendation); the offline test should make the math explicit so future PRs can't silently break the assumption.

A3: When cache is at 70% combined hit rate, what's the next ML-leverage cost optimization?

Three candidates:

Pre-compute the long-tail. Top 500 ASINs are warmed; the next 5,000 ASINs hit cache cold. Pre-warm the next tier based on Redshift trending data — small storage cost, big hit-rate gain.
Predictive pre-fetch on session start. When a session starts on manga_pdp for ASIN X, pre-fetch ASIN X's reviews, recommendations, and Q&A — the user is statistically likely to request these next. Asynchronous, latency-hidden.
Hierarchical caching for partial results. Many queries reuse intermediate computations (e.g., the same RAG chunks across different downstream prompts). Cache the intermediate representations, not just the final answer.

All three are downstream of US-03's design. They're "cache 2.0" — not in the current 8 user stories but the natural next step. The offline test for any of them is the same shape: counterfactual replay with paired hit-rate-vs-wrong-answer-rate metrics.

Intuition Gained — US-03

The core insight: A cache backed by an embedding model is a system with two ML components (the embedder and the answer model) and the cache safety depends on the calibration of both. Treat it as ML infrastructure, not just KV storage.

Mental model to carry forward:

"Cache wrong-answer rate is the product of (collision rate at threshold) × (answer-divergence rate given collision). Both need their own dataset and their own audit."

The hidden failure mode: TTL is the visible knob; embedding-model version is the invisible knob. Upgrade the embedder without versioning cache keys and the next day's cache hits return last week's answers from a different question.

One-line rule: Cache safety is the wrong-answer rate during the TTL window under realistic upstream change rates, not the wrong-answer rate at t=0.

Scenario US-04 (ML/AI Engineer Lens) — Compute Cost Optimization

Opening Question

Q: Right-sizing ECS Fargate from 2 vCPU/4GB to 1 vCPU/2GB for the Orchestrator. From an ML/AI engineer's perspective — not an SRE's — what's the test that matters?

Round 1 Answer: The ML-relevant test is whether right-sizing changes the distribution of latencies seen by the model-calling code path. The Orchestrator is I/O-bound on average, but during personalization scoring or RAG re-ranking it's briefly CPU-bound. If right-sizing causes those bursts to queue, the LLM call's "time to first token" effective latency from the user's perspective changes — even though the LLM call itself didn't change. From an ML perspective, the user-experienced latency distribution is what matters, not the per-component latency.

Round 1 — Surface

Follow-up: How do you measure the right thing offline?

Three latency metrics, all distributions (not averages):

Pre-LLM latency: from request received to LLM call initiated. This is the part right-sizing affects.
LLM call latency: from call initiated to first token. This is Bedrock; right-sizing doesn't change it.
End-to-end latency: from request received to response complete. This is what the user feels.

Right-sizing is safe if (1) shifts up by ≤ X ms p99 and (3) shifts up by ≤ same X ms p99 (i.e., the shift in (1) flowed through unattenuated to (3), no compound interaction). Right-sizing is unsafe if (1) shifts up by 50ms p99 but (3) shifts up by 200ms p99 — that means the right-sizing caused queueing or backpressure that amplified.

The offline test runs 1 week of replay traffic against both pipeline configurations and compares the distributions, not the means. I'm specifically looking for p99 and p99.9 — averages will hide the failure mode.

Round 2 — Push Harder

Follow-up: Spot interruption + mid-WebSocket-stream loss. From an ML perspective, what does the offline test need that an SRE test wouldn't think to include?

The SRE thinks about connection state. The ML/AI engineer thinks about conversation state. When a Spot task is reclaimed mid-WebSocket-stream:

The user's question was already sent to Bedrock.
The first 60% of the streamed response was sent to the user.
The Spot reclaim aborts the stream.
The user reconnects on a different task, sees a partial response, and re-asks.

The ML failure modes the SRE test misses:

Conversation state coherence on reconnection. The new task picks up the conversation; does it know about the partial response that was streamed? If not, the model is operating on stale context. The user re-asks; the model now has the same question twice in history — its responses to the second can be "I just answered that" or worse, an inconsistent variant.
LLM call cost double-counted. The first call happened and was billed. The reconnection triggers a second call, also billed. Cost telemetry undercount: this looks like one user with one question and we paid for two LLM calls. Aggregate cost-per-session metrics over-count "average cost" and the cost-savings claim becomes wrong.

So the offline test for Spot resilience must include: - Reconnection conversation-coherence audit. Drop a connection mid-stream, reconnect, send the user's likely re-ask, measure whether the model maintains coherent state. - Cost double-count detection. When a stream is interrupted and re-issued, the cost telemetry must mark the first call as "interrupted" so it's excluded from per-session cost averages.

These are ML/conversation-design concerns, not SRE concerns. The SRE owns "the connection drains gracefully"; the ML/AI engineer owns "the conversation survives the drop."

Round 3 — Squeeze

Follow-up: You're using Graviton ARM64 for WebSocket handlers — 20% cheaper than x86. The offline test passes. What ML-specific behavior could differ between architectures and how do you catch it?

ARM64 vs x86 floating-point behavior is almost identical for the operations a chatbot uses, but there are three landmines:

Embedding similarity scores can differ at the 4^th–5^th decimal place due to FMA (fused multiply-add) ordering. This doesn't matter for individual scores but can flip the ordering of near-tied results in top-k lists. If the cache key is derived from embedding bytes (not normalized values), keys generated on ARM and x86 are different — same query, different cache key, no hits.
Tokenizer output is identical (deterministic string operations) but prompt-token-count assertions in tests can fail if the test was written assuming a specific token count and ARM is computing it via a different code path. Rare but seen.
Native dependencies (numpy, scipy, redis client) sometimes have ARM-specific bugs that x86 doesn't trigger. The test catches these only if it runs the full workload, not just unit tests.

The offline test for ARM migration: - Workload-shape diff test on 5K samples — run the same input through ARM and x86 builds, diff every output byte. Acceptable diffs: ordering of equal-scored items, formatting whitespace. Unacceptable: any answer-content diff. - Cache-key generation parity check — generate cache keys for 1K queries on both architectures, assert byte-identical. - Dependency scan for known ARM-incompatible packages before the build even runs.

The savings (20%) are real and the migration is usually safe, but "usually" is doing the work — the offline test is what makes it certainly safe.

Round 4 — Corner

Follow-up: You right-sized successfully. 6 weeks later, the recommendation team adds a new personalization feature that does in-process matrix multiplication for user-taste re-ranking. CPU usage on Orchestrator tasks jumps from 55% to 88% average. What does your offline harness do?

The harness should have caught this at PR time, not 6 weeks later. The PR adding the recommendation feature should have run the cost-aware golden dataset (file 03 Primitive C) against the right-sized task definition, and seen:

ecs_avg_cpu jumped from 55% to 88% on the same replay traffic.
p99_response_latency for personalization-heavy queries jumped from 800ms to 2.4s.

Both of these would have failed the cost-aware golden's regression gate (the gate's "no individual query exceeds cost-band ceiling by 15%" rule). The PR would have been blocked.

What this requires: - The cost-aware golden runs on every PR, not just cost-optimization PRs. The "ratchet" pattern (file 03). - The golden dataset must include personalization-heavy queries in its sample (otherwise the ratchet doesn't see them). - The PR author must understand that "my feature passed unit tests" is not "my feature passed the cost-aware golden" — these are different checks.

If the harness wasn't run on the PR, the right reaction is both (a) revert the right-sizing temporarily to absorb the headroom, AND (b) require a follow-up PR that either optimizes the personalization code or accepts a larger task definition. Don't let the right-sizing be the casualty of a feature PR that didn't account for it.

The systems insight: cost-optimization gates are infrastructure, not project-specific tests. Once shipped, they apply to every change forever.

Architect-Level Escalation

A1: When you compress the orchestrator footprint by right-sizing, what other architectural signals should you re-baseline?

Five signals shift when CPU/memory headroom shrinks:

Auto-scaling thresholds: scale-out at 70% CPU on a 1vCPU task is much sooner than 70% CPU on a 2vCPU task. The absolute CPU-seconds-of-headroom changes. Retune.
Garbage collection behavior (if managed runtime): smaller heap = more frequent GC pauses. Latency p99 sensitive.
Request concurrency limits: each task can hold fewer concurrent requests. Total cluster concurrency may need more tasks even at lower per-task cost.
Cache warm-up time: in-process L1 cache rebuilds faster on a smaller task (less memory to fill) but evicts more aggressively. Hit rate on L1 may drop.
Dependency connection pool sizes (Redis, RDS, OpenSearch) — sized per-task, often configured assuming a larger task. May over-allocate.

A right-sizing PR should adjust all five concurrently. Right-sizing CPU/memory in isolation creates 5 silent inefficiencies.

A2: The org wants to standardize on a 1vCPU/2GB Fargate task definition for all chatbot services. What ML/AI risks does this create?

Three:

Heterogeneous workloads forced into a one-size profile. The Orchestrator is I/O-bound (good fit). The personalization re-ranker is CPU-bound on bursts (bad fit). The image-thumbnail service is memory-bound (bad fit). Standardization saves on management overhead but loses workload-fit.
Concurrency assumptions baked into code. If the personalization service was written assuming "I have 2vCPU and can parallelize 4 scorers," 1vCPU breaks the parallelism assumption silently — you get sequential execution and a latency cliff.
ML model loading footprint. A larger model loads into a 2GB task fine; into a 2GB task that's also serving requests, marginal. If 1.5GB is for the model and 0.5GB for everything else, request handlers OOM on second concurrent request.

Recommendation: standardization is fine for stateless I/O-bound services. Stateful or compute-bound services need an explicit sizing exception. Make the exception a documented architectural decision, not a silent override.

A3: How do you set up cost-vs-quality regression detection for compute changes that's automatic — no human in the loop?

Cost-aware golden runs on every PR (Primitive C). Its config: - 500-query golden dataset, with expected_cpu_seconds, expected_memory_peak_mb, expected_p99_latency per query. - PR runs the dataset against the new task definition; emits actuals. - Fails the PR if actual > expected * 1.15 on any single query. - Fails the PR if actual_p99 > expected_p99 * 1.15 aggregate.

Plus a weekly "cost re-baseline" job: - Re-runs the golden dataset against current main. - Updates expected_* to the new median + IQR. - Files an alert if the baseline drift exceeds 10% week-over-week (means the baseline is moving, which is itself a signal).

The combination: PR-level guard against single-PR regression + weekly drift detection against gradual creep. Both automated. Human-in-loop only when the alert fires.

Intuition Gained — US-04

The core insight: Compute cost optimizations affect more than the cost line — they shift latency distributions, change concurrency, alter cache behavior, and stress dependency pools. The eval must measure all of these.

Mental model to carry forward:

"Right-sizing changes 5 things: cost, latency-p99, concurrency, GC behavior, and dependency pool sizing. The PR that ships only changes the cost. The PR that ships safely re-baselines all 5."

The hidden failure mode: Spot interruption mid-WebSocket-stream is the canonical "the SRE test passed but the ML behavior failed" — connection state is fine, conversation state is broken.

One-line rule: Compute changes are ML changes through the latency channel. Treat them with the same eval rigor as a model change.

Scenario US-05 (ML/AI Engineer Lens) — DynamoDB Cost Optimization

Opening Question

Q: Aggressive TTLs on conversation memory (TURN: 24h, SUMMARY: 72h). From an ML perspective, when does this break?

Round 1 Answer: It breaks when the distribution of conversation length and gap-time doesn't match what the TTL was designed for. A multi-turn return-flow conversation can pause for 26+ hours mid-stream; a returning user resumes a recommendation thread the next morning. If the offline test only used same-day sessions, the TTL appears safe. The failure surface is the long tail of resumed sessions, and the offline replay must include that tail explicitly.

Round 1 — Surface

Follow-up: What's your sampling strategy to ensure the long tail is in your offline dataset?

Stratified sampling on session-resume-gap distribution. Pull 1 month of production session metadata, compute the distribution of (last-message-time → next-message-time) gaps within each session. The distribution is heavy-tailed: 80% within 5 minutes, 15% within 30 minutes, 4% within 24 hours, 1% > 24 hours.

For the offline replay set: - 800 sessions with same-day continuity (matches 80% of production) - 150 sessions with 5-30 min gaps - 40 sessions with 30 min - 24 hour gaps - 20 sessions with > 24 hour gaps — over-sample 20x relative to production frequency

That last bucket is critical because the failure mode is concentrated there. Production frequency would give me 1 sample of the bucket I need to test. Stratified over-sampling gives me 20.

The metric to measure on each bucket: context_loss_rate = fraction of resumed sessions where the bot can't reconstruct context from what survived TTL.

Round 2 — Push Harder

Follow-up: You measured context_loss_rate at 4.1% on the > 24h bucket. The team says "that's only 1% of total traffic so we're at 0.04% overall — ship it." What's wrong?

Three problems:

Severity weighting. The 1% of traffic that resumes >24h later is heavily weighted toward return_request and escalation — long-running issues, not casual queries. Weighted by intent severity (refund disputes are high-cost), the 0.04% headline understates 5–10x.
User-perceived consequence. A user resuming a return after 26 hours and getting "I have no record of this conversation" doesn't churn on average — they escalate angrily. The cost isn't lost user; it's escalation count + agent time + churn-on-tail.
Self-fulfilling distribution. If we ship aggressive TTL, the 1% of resumed-session users learn the bot doesn't remember, and they switch to email or phone for those flows. The 1% drops not because we fixed the system but because users abandoned it. The metric improves; the customer experience worsens.

The right framing: don't accept aggregate rates that hide a structural failure on a high-stakes intent. Either fix the failure (per-intent TTL: keep return_request TURN items for 168h) or don't ship.

Round 3 — Squeeze

Follow-up: Per-intent TTL — return_request gets 168h, everything else 24h. How do you tune the 168h number? Is it an SLO, an empirical fit, or a guess?

Empirical fit, with an SLO sanity check:

Empirical: pull 1 month of return-flow sessions, measure the resume-gap distribution within return flows. If 95% of return-flow resumes happen within 96 hours, set TTL at 168h to cover 99%.
SLO sanity: Amazon's return policy is 30 days. A user resuming a return at day 14 is within the policy window. At 168h (7 days), we cover the first week — most resumes happen here. Beyond 7 days, the user typically restarts the flow with fresh context anyway.
Cost cross-check: 168h vs 24h on return_request items multiplies storage for that class by 7. If return_request is 4% of all turns, the additional cost is 4% × 6× = 24% bump on the return-flow storage line — acceptable given the criticality.

Then commit to quarterly re-tuning because the distribution shifts (new return policies, new product categories). The TTL is a calibrated parameter, not a constant.

Round 4 — Corner

Follow-up: Sparse GSI — only META items project to the GSI. A future PR adds a customer_id field to TURN items "for analytics." What's your offline guard?

This is exactly the cost-aware golden (Primitive C) failure case. The PR adding customer_id to TURN items would: - Cause TURN items to project to the sparse GSI - Double the GSI write capacity overnight - Inflate dollars_per_1k_turns by ~80%

The cost-aware golden's expected_ddb_cost_per_turn band would be exceeded. The PR fails CI. The author has to either (a) not add the attribute, (b) add it as a non-projected attribute (explicit ProjectionExpression carve-out in the GSI definition), or © explicitly opt the PR out of the cost gate with a documented justification.

The deeper protection: the GSI definition should be code-reviewed when any model-affecting attribute is added. The PR template should ask: "Does this PR add any new attribute to TURN, SUMMARY, or META items? If yes, has the GSI ProjectionExpression been verified?" That's a process control, not a test, and it's the cheapest possible safeguard.

Architect-Level Escalation

A1: How do you reason about the cost-quality tradeoff curve for conversation memory across the full DDB stack?

The curve has three regions:

Memory regime	Cost	Quality (multi-turn coherence)	When chosen
Aggressive (TTL 6h, sparse everything)	~50% of baseline	-8% on multi-turn (re-ask rate up)	Cost-constrained mode (US-08 DEGRADED)
Balanced (TTL 24h TURN / 72h SUMMARY, sparse GSI)	~60% of baseline	-1% (current US-05 design)	Normal operation
Conservative (TTL 168h, all attributes projected)	100% baseline	reference	Pre-optimization baseline

The curve is non-monotonic in user satisfaction: aggressive mode saves cost but increases re-ask + escalation rate, which generates more downstream cost. The aggressive cell's true cost is DDB savings - escalation cost increase, which can be net-negative.

So the architectural choice is: balanced mode in steady state, with aggressive mode reserved for cost-circuit-breaker scenarios where you've already accepted the quality tradeoff. The offline test for switching modes is a counterfactual replay measuring escalation rate, not just DDB cost.

A2: What ML signals would convince you the balanced TTL is wrong (too long or too short) without waiting for an incident?

Five drift signals worth monitoring:

Re-ask rate per intent — if return_request re-ask rate climbs from 8% to 12% over a quarter, TTL is too short for that intent.
Multi-turn coherence score (LLM-as-judge on follow-up resolution) — if dropping, memory is dropping things the model needs.
TTL-deletion-to-resume gap distribution — track the time between TTL-evicting an item and the user resuming the conversation. If 5% of resumes are within 10 minutes of a TTL eviction, the TTL is just-too-aggressive.
Storage cost per intent — if return_request storage line goes up 40% over a quarter while other intents are flat, return-flow conversations are getting longer (maybe a UX change), and TTL-cost interaction has changed.
Per-intent context-rebuild count — if rebuild count is rising (we're reconstructing context from S3 archive more often), TTL is forcing more cold reads, which has its own cost.

These are ongoing monitoring, not one-time eval. The offline test sets the initial TTL; the monitoring tells you when to retune.

A3: TransactWriteItems for atomic TURN+META updates. The cost is 2x WCU. Is the consistency guarantee worth it?

Yes — and the right way to defend the choice is to enumerate the failure modes without TransactWrite:

Failure mode without TransactWrite	Probability	Cost when it happens
TURN write succeeds, META write fails	~1% per pair (network/throttle)	Conversation state inconsistent; bot misroutes next turn; user notices
META write succeeds, TURN write fails	~1% per pair	Conversation history missing the latest turn; bot loses context
Both succeed but in different order, observed by reader	~5% per pair (concurrent reader)	Bot sees inconsistent state briefly; usually invisible but possible bad route

At 1M turns/day, 1% partial-failure = 10K user-visible inconsistencies per day. Each one risks a re-ask or escalation. The cost of those (~$0.10-$0.30 per escalation) at 10K/day = ~$1K-$3K/day downstream cost.

TransactWrite cost: 2x WCU = +50% on write line ≈ $200/mo on the DDB bill. Saves $30K-$90K/mo in downstream re-ask/escalation cost.

ROI is overwhelming. The 2x WCU cost is well worth it. But the defense needs the math — without it, "atomic writes" sounds like over-engineering. With it, it's the cheapest insurance you'll buy.

Intuition Gained — US-05

The core insight: TTL is a calibrated ML parameter, not a fixed config. The right TTL depends on the conversation-resume distribution, which is intent-specific and shifts over time.

Mental model to carry forward:

"TTL aggressiveness saves storage and risks context loss. The right point depends on the intent's stale-event cost. One TTL for all intents is wrong on at least one intent."

The hidden failure mode: Long-tail resumed sessions cluster on return_request and escalation — the highest-stakes intents. Aggregate TTL metrics hide a structural failure on the intents that matter most.

One-line rule: Per-intent TTL is the cost-quality optimum. Calibrate it from the resume-gap distribution, re-tune quarterly, monitor re-ask and escalation rates as drift signals.

Scenario US-06 (ML/AI Engineer Lens) — RAG Pipeline Cost Optimization

Opening Question

Q: RAG-bypass for promotion, order_tracking, escalation, chitchat — 40% of traffic. From an ML perspective, what's the safety net?

Round 1 Answer: The safety net is per-intent decision-equivalence on a labeled "requires_retrieval" dataset. The bypass is intent-keyed but the failure mode is sub-intent — within promotion, some queries genuinely need retrieval ("any deals on Berserk Deluxe?"). The eval can't be intent-level; it has to be per-query, with a requires_retrieval ground-truth label, and the bypass agreement floor must be met on the queries within each intent that genuinely need retrieval, not just on the intent overall.

Round 1 — Surface

Follow-up: How do you generate the requires_retrieval ground truth?

Three sources:

DS-curated: data scientist labels 1K queries per intent with binary "would this query benefit from RAG?" — based on whether retrieved chunks would change the response. ~$2K of labeling effort.
Sonnet-generated with audit: prompt Sonnet with the query + system prompt + a flag for "would you have wanted retrieval for this?" Audit on 200 samples for human-Sonnet agreement.
Behavioral signal from prod: queries where the user immediately re-asked or escalated after a no-RAG response are signal that RAG was needed.

I'd combine all three with weights. The DS-curated set is the gold; Sonnet-generated is the bulk; behavioral is the signal-from-real-users. The combined dataset has ~5K labeled queries per intent.

Round 2 — Push Harder

Follow-up: Reranker skip when top kNN score > 0.9. You're conditioning on a calibrated score. What invalidates the calibration?

Four things:

Embedding model upgrade. Score distribution shifts. 0.9 today might mean 0.7 in calibration units after the upgrade. (See file 04 Scenario US-06 Incident B.)
Index size growth. With more documents, the kNN top-1 score generally drops (more competing candidates). The 0.9 threshold over-fires (skip rate climbs) and top-1 quality drops.
Query distribution shift. New types of queries (e.g., a new category of manga added) may produce systematically lower scores because the index has fewer matching documents. Scores below 0.9 on those queries get reranked correctly; the issue is the headline skip rate metric stops being meaningful for those queries.
Document re-indexing with new chunking. Smaller chunks → kNN scores depend on chunk vs query alignment, which differs from larger chunks.

Each of these requires a calibration re-run. The right architecture: every time any of (embedder, index, chunking, document corpus) changes, the reranker-skip-threshold-vs-correctness curve must be re-computed and the threshold re-tuned. This is a hard CI gate, not a runbook step.

Round 3 — Squeeze

Follow-up: Embedding cache hit rate target is 20%. You measure 24% in shadow. The PM is happy. What's the next question you ask?

"What does the 24% comprise?"

Two scenarios:

A) 24% of all queries hit cache, distributed roughly uniformly across query types. Cost savings track linearly.

B) 24% of all queries hit cache, but they're concentrated on a few high-volume queries ("recommend me something like X" for top-50 X values). The savings come from the head; the long tail still has 0% hit rate.

Scenario B has hidden risk: if the embedding model is updated, those high-volume queries' cached embeddings become stale, and we lose ~all of the 24% in one event. Scenario A degrades gracefully because the cache hit distribution is broad.

So the next questions: - What's the entropy of the cache hit distribution? (high entropy = scenario A; low entropy = scenario B) - What's the staleness behavior on a synthetic embedding model upgrade? (does cache hit rate fall to 0% instantly or decay over the 1h TTL?)

If scenario B and a model upgrade is on the roadmap in 3 months, I'd want a "warm-shadow-cache" plan: pre-populate the new-model embeddings for the top-N queries before the model swap, so hit rate doesn't cliff.

The headline number is a starting point, not a conclusion.

Round 4 — Corner

Follow-up: A new manga release (Chainsaw Man Part 2) drops. Production traffic for queries about it spikes 50x. Your offline tests didn't include it. What's at risk?

Multiple things at risk simultaneously — RAG corner cases compound:

RAG bypass gate over-fires on the new title. The new title triggers recommendation intent (correctly), so RAG isn't bypassed — but the index doesn't have Chainsaw Man Part 2 documents yet (they're in the next 2am batching window). RAG returns weak results. LLM hallucinates volume counts.
Embedding cache misses 100% on these queries. Spike → cold cache → spike of LLM calls and OCU reads. Cost briefly inflates by the spike multiplier.
Reranker skip threshold over-fires. The new-title queries have low kNN scores (sparse index coverage), so they don't trigger the skip — actually safe here — but the quality of reranked results is poor because the documents aren't there yet.

The offline harness mitigation: - Synthetic "new-title" stress test. Inject a fake new-title scenario monthly: 100 queries about a title not in the index, measure quality and cost behavior. - Index freshness SLO. Documents about new titles must be indexed within X hours of release. Not an offline test exactly — an upstream contract from content-ingest to RAG. - Bypass gate freshness check. When a new intent or sub-intent emerges, the bypass labels must be re-validated. Quarterly cadence at minimum.

The architect insight: cost optimizations on RAG assume index freshness and stable distribution. Both can break in events the offline test didn't anticipate. Build the harness to detect those breakings, not just to validate the steady state.

Architect-Level Escalation

A1: Build me an evaluation framework where every embedding model upgrade auto-runs the calibration re-tune and blocks deployment if the threshold-correctness curve degrades.

Embedding model upgrade as a CI pipeline:

Trigger: PR changes embedding_model_version config.
Calibration set: 5K queries with top1_correct ground truth, frozen.
Pipeline: a. Build new embedder. b. Embed all 5K queries. c. For each query: kNN search against current index, get score for current top-1. d. Sweep threshold from 0.85 to 0.99 in 0.01 steps; for each threshold, compute (skip_rate, top1_correctness). e. Choose threshold T such that (skip_rate ≥ 30%) AND (top1_correctness ≥ 95%). f. If no such T exists, fail the PR.
Automatic threshold update: PR auto-amends the config with the new T value.
Promotion gate: PR can't merge until the sweep + auto-update + integration tests pass.

The principle: calibration parameters must be regenerated when their underlying signals change. Hand-tuning works once; the second model upgrade after the first hand-tune is when production breaks. Automate the tuning into the PR pipeline.

A2: Cost-quality tradeoff curve for OpenSearch Serverless OCU floor. Walk me through it.

OCU floor = 4 (per US-06). $691/mo.

OCU floor	Monthly cost	Search latency p99	Capacity for spike
2	$345	250ms (slow)	None — rejected on spike
4 (current)	$691	180ms	1.5x normal
6	$1,037	140ms	2.5x normal
8	$1,382	130ms (diminishing returns)	3.5x normal

The curve for cost is linear; for latency, logarithmic; for spike capacity, sub-linear. The 4 OCU choice is the inflection point — below 4, latency cliff; above 4, diminishing returns.

The decision is whether to accept the latency cliff in exchange for cost. For a chatbot where user-perceived latency matters and the LLM call already takes 1-3s, the marginal 70ms from OCU=4 vs OCU=2 is hard to value for users. But it matters for the cost circuit breaker because slow searches accumulate concurrent OCU usage and can themselves spike cost.

So 4 is the right floor for steady state, with the option to scale up to 6 during predictable peak windows (Black Friday, manga release events). The offline test for the auto-scale logic is "sustain 2x normal for 30 minutes; observe OCU consumption stays below ceiling."

A3: When does RAG-bypass cost-optimization become RAG-elimination?

When a sufficiently large fraction of traffic doesn't need RAG, and the maintenance cost of the RAG pipeline exceeds the value it delivers on the residual.

The math: RAG pipeline cost (OCU + embedding + index + ops) is $X/mo. RAG-residual queries (after bypass) are Y queries/mo. Quality benefit on those Y queries (vs no-RAG) is Z (improved CSAT, reduced hallucination).

If $X / Y > $0.50 per residual query AND Z is small (< 5pt CSAT improvement), RAG is uneconomic — replace with a smaller architecture (e.g., fine-tuned LLM on policy text + structured catalog lookup).

For MangaAssist today: bypass at 40% means residual is 60% of traffic, ~600K queries/day = ~18M/month. RAG cost is ~$400/mo per US-06 target. Per-query cost is $0.000022 — far below any reasonable threshold. RAG stays.

The day RAG-bypass exceeds 80%, RAG architecture itself becomes a candidate for elimination. The offline test that detects this: track rag_value_per_query (CSAT lift on RAG-served vs RAG-bypassed responses) over time. When it falls below the operational cost per query, time to revisit the architecture.

Intuition Gained — US-06

The core insight: RAG cost optimizations are calibrated systems. Every upstream change (embedder, index size, chunking, corpus) invalidates calibration. The offline harness must enforce re-calibration as a CI gate.

Mental model to carry forward:

"Score thresholds are not constants; they are functions of the index state and the embedding distribution. Treat them as parameters of a model that must be re-fit when the model changes."

The hidden failure mode: Bypass gates set at intent-level mask sub-intent failures. Sub-intent slicing (queries containing product names, queries with specific entities) is where the gate's failure mode hides.

One-line rule: Every cost-optimization on RAG must be re-validated on every embedding model change, every index re-build, every chunking change. Three triggers, one re-validation pipeline.

Scenario US-07 (ML/AI Engineer Lens) — Analytics Pipeline Cost Optimization

Opening Question

Q: Event batching in Kinesis (50 events / 5s → 1 PUT). From an ML perspective, what's the risk to your downstream analytics?

Round 1 Answer: The risk is sampling bias on low-frequency event types. Batching is statistically fine for high-volume events (LLM calls, RAG queries). It's catastrophic for low-frequency events (cost-breaker triggers, escalations, errors) because they may not arrive during a batch window — or worse, if the batch overflows under load (Prime Day), they get dropped silently. The downstream ML signals that depend on those events (training data for cost prediction, anomaly detection on errors) get systematically biased.

Round 1 — Surface

Follow-up: How does this interact with downstream models you might train on this data?

Two model-training paths affected:

Cost prediction model. If we want to predict per-session cost from session features (intent, user tier, query complexity), we train on session-level event aggregates. If escalation events are systematically under-counted on Prime Day, the model learns "cost is lower on Prime Day" — which is wrong, and the cost-circuit-breaker depends on accurate forward-looking cost prediction.
Anomaly detection. If we train an anomaly detector on event volumes per minute, the batching artifact creates "anomalies" at batch boundaries (zero events for 4.9s, then a spike). The detector either learns to ignore those (and misses real anomalies) or fires on them (alert fatigue).

Both training paths need batch-aware data preparation: when we read from S3 Parquet, we need the original event timestamps (not the batch arrival time), and we need to handle batched events as a group with a shared write time, not as 50 events at the same instant.

The offline test for the analytics pipeline must include "are downstream models still trainable on this data?" — not just "are the dashboards correct."

Round 2 — Push Harder

Follow-up: The materialized view for daily cost is refreshed every 15 minutes. The cost circuit breaker reads from it. What's the ML risk?

The breaker is making a control decision based on a 15-min-stale signal. Three failure modes:

Lag during a spike. Spend climbs 0 → 100% in 8 minutes. The breaker sees the value as of 15 minutes ago (still 70%) and doesn't engage. By the time the view refreshes, breach has already happened.
Refresh window blackout. During the materialized view rebuild (60 seconds), the view returns NULL or stale. If the breaker code treats NULL as 0, it doesn't engage when needed.
Refresh failure. If the refresh job fails (Redshift overloaded), the view is more than 15 minutes stale. Breaker has no idea.

The ML/control-engineering insight: the breaker is a feedback controller. The feedback signal must be timely, robust to refresh blackouts, and have clear semantics for missing data. Three fixes:

Projected spend, not current spend. Linear extrapolation over the past 5 minutes. Reduces lag.
NULL handling. NULL is "I don't know," not "0." Breaker treats NULL as "use prior known value + safety buffer."
Heartbeat metric on the view. If the view's last_refresh_time is more than 30 minutes ago, alert on the alert system — the breaker is operating on stale data.

These are control-system patterns, not analytics patterns. The cost-tracking signal needs the rigor of a control-system feedback loop.

Round 3 — Squeeze

Follow-up: Redshift RA3 hot/cold tiering — last 30 days hot, 31+ days unloaded to S3. What's the ML risk for time-series analyses crossing the boundary?

Spectrum queries (over S3-unloaded data) have higher latency and different semantics than hot-table queries (over RA3 SSD). For ML analyses crossing the boundary:

Query latency cliff. A query "give me cost trend over the last 60 days" hits both hot (last 30 days, fast) and cold (days 31-60, slow). p99 latency is dominated by the cold half. Dashboards feel slow.
Schema drift across the boundary. If we evolve the event schema (add a new field) and the migration only applies to the hot table, queries crossing the boundary now have to handle two schemas. Spectrum queries may NULL the new field for old data; ML model training data is silently inconsistent.
Cost of cold queries. Spectrum charges per GB scanned. A query for "every cost event in the last 90 days" scans 60 days of Parquet from S3. If a cost-monitoring dashboard refreshes hourly, this becomes a real Spectrum bill.

Mitigations: - Aggregate-and-store. For time-series ML training, pre-compute daily aggregates as a separate table that lives in RA3 hot indefinitely. The 30-day boundary doesn't apply to aggregates. - Schema migration policy. Schema changes apply to both hot and cold (via S3 Parquet schema evolution). Test this offline. - Spectrum query budget. Per-dashboard daily cap on Spectrum scan volume. Alerts when dashboards are over-scanning.

The pattern: tiering optimizes storage cost, but query patterns must be designed for the tiering. A query that wasn't designed for the boundary becomes the cost regression.

Round 4 — Corner

Follow-up: A new analytics requirement comes in: "show real-time conversion funnel by user tier." Your batched events have 5-second granularity. Can you serve this requirement?

Not at sub-5-second granularity. The 5-second batching is a hard floor on event timeliness. Three responses to the requirement:

Define "real-time" as 30 seconds. Most "real-time funnel" requirements are actually fine at 30-second freshness. The funnel itself is computed from aggregated events; aggregating per-30-seconds is fine.
Carve out a separate fast path. Critical events (purchase complete, conversion milestone) bypass batching and write directly. Costs more on Kinesis PUT, but the volume is small (conversion events are rare). The fast-path is unbatched; the slow-path is batched.
Push back. "Real-time" is often expensive ego — ask what decision the dashboard supports. If the decision can wait 5 minutes, build it on the batched path.

The ML/analytics lesson: batching trades latency for cost. Requirements that need sub-batch latency are by definition not on the cost-optimized path. The architecture should expose two paths and let requirements declare which they need.

The offline test for adding a fast-path event class: counterfactual "with fast-path vs without" — measure dashboard freshness and PUT-cost delta, decide whether the freshness gain is worth the cost.

Architect-Level Escalation

A1: Materialized views for cost dashboards. The cost circuit breaker reads from them. From an ML/control-systems perspective, design the ideal interface.

The ideal interface treats the materialized view as a control signal, not just a query result:

GET /cost/daily_spend_so_far
Response:
{
  "value": 4250.00,
  "as_of_timestamp": "2026-04-27T10:45:00Z",
  "freshness_seconds": 180,
  "confidence": "fresh" | "stale" | "unknown",
  "projected_5min_value": 4360.00,
  "projection_method": "linear_regression_over_15min"
}

The breaker's decision logic: - If confidence == "fresh", decide on projected_5min_value. - If confidence == "stale", decide on value + safety_buffer. - If confidence == "unknown", fail-safe (assume worst case; degrade to WARNING).

The view doesn't just return a number — it returns a number with its own metadata about its trustworthiness. The breaker doesn't have to guess. This is the cost-engineering equivalent of returning HTTP status codes alongside response bodies.

A2: How do you handle schema evolution in batched events without breaking historical aggregations?

Three principles:

Additive-only schema changes. New fields can be added; old fields cannot be removed or renamed. If a field becomes obsolete, mark it deprecated in the schema registry but keep it.
Schema registry with versioning. Every batched event includes its schema version. Consumers (dashboards, models) can handle multiple versions.
Backfill discipline. When a new field is added that you want historical data to have, run an explicit backfill job that updates Parquet files in S3 with the new field populated (NULL or imputed value). Don't pretend retroactive presence; document the backfill.

The pattern: batched analytics is a write-once-read-many architecture. Historical data is immutable; schema evolution must be additive and versioned.

A3: When does the analytics pipeline itself become the cost bottleneck?

When the meta-cost of running analytics exceeds the savings analytics enables. Signs:

Analytics monthly cost > 20% of Bedrock monthly cost. (Currently $300 vs $315K — 0.1%. Healthy.)
Materialized view refresh times exceed acceptable freshness for the breaker.
Dashboard queries scan more data than the events they aggregate (inverted ratio).
Engineering time spent on analytics pipeline maintenance exceeds time spent on the chatbot's core ML.

If any of these flips, analytics needs its own cost-optimization scrutiny — perhaps moving from Redshift to a smaller engine (Athena-only, no Redshift) for the cold tier, or from Kinesis to a simpler queue.

The macro principle: observability and cost-tracking are infrastructure, and infrastructure has cost. The cost of measurement must stay materially smaller than the cost being measured.

Intuition Gained — US-07

The core insight: Analytics pipeline cost optimizations interact with downstream ML and control systems. Sampling bias from batching, lag in materialized views, and tier boundaries all break things that depend on the data — not just the data itself.

Mental model to carry forward:

"An analytics pipeline that feeds a control system is a feedback signal. Treat its freshness, completeness, and confidence as first-class properties, not as 'eventual consistency.'"

The hidden failure mode: Low-frequency events get systematically underrepresented in batched paths. They are exactly the events safety-critical systems depend on.

One-line rule: Cost-optimize the high-volume events; preserve the low-volume events. They have different value-per-event by orders of magnitude.

Scenario US-08 (ML/AI Engineer Lens) — Traffic-Based Cost Optimization

Opening Question

Q: Cost circuit breaker degrades to Haiku-only at 80% budget, template-only at 100%. From an ML perspective, what's the eval that matters?

Round 1 Answer: The eval is per-tier quality regression during degradation — specifically, do Prime users see a quality regression when the breaker forces the system into Haiku-only mode? The breaker's job is to bound cost; the quality contract is that the bound shouldn't surprise revenue-critical users. Offline test: simulate a synthetic spend-pressure scenario, measure CSAT proxy and re-ask rate per tier (Prime, Auth, Guest) during each degradation level. Prime should see ≤ 0.3pt CSAT delta; Guest can absorb more.

Round 1 — Surface

Follow-up: How do you measure "CSAT proxy" offline?

CSAT directly is a production signal. Offline proxies:

LLM-as-judge satisfaction score — judge model rates each response 1-5 on a manga-chatbot-specific rubric (helpfulness, accuracy, tone, completeness).
Re-ask rate proxy — feed the response back to the user-simulator (another LLM impersonating the user) and measure whether the simulator would have re-asked. Crude but cheap.
Forbidden-element rate — does the response contain hallucinations, wrong prices, off-topic content? Higher in Haiku-only mode? That's a quality regression.
Format-compliance rate — Haiku sometimes produces less-structured outputs. UI-breaking format issues are a quality regression.

For each, measure baseline (Sonnet, full pipeline) vs degraded (Haiku, template-only). Per-tier slicing. The Prime delta is what gates the breaker design.

Round 2 — Push Harder

Follow-up: Synthetic budget-pressure scenario. Walk me through how you build it.

Three patterns to simulate, each tests different breaker behavior:

Linear ramp: spend climbs 0 → 100% over 10 minutes. Tests engagement timing (does breaker engage at 80% within 2 min of crossing?).
Sudden jump: spend climbs 0 → 90% in 30 seconds (e.g., a bot attack or a misconfigured downstream that 10x'd LLM calls). Tests whether breaker reacts to rate-of-change, not just absolute level.
Sustained pressure: spend hovers at 75-78% for 4 hours. Tests whether breaker holds stable without oscillating into and out of WARNING.

For each pattern, simulated traffic flows through the system; the breaker reads from a mock cost view that returns the synthetic curve; quality and engagement metrics are recorded.

The third pattern is the most often-missed: oscillation under sustained near-threshold load. If the breaker engages at 80% and recovers at 80%, sustained 78-82% load makes it bounce. Hysteresis (engage at 80%, recover at 70%) is the fix, and the offline test must specifically construct this scenario to validate hysteresis.

Round 3 — Squeeze

Follow-up: The graceful degradation ladder has 5 levels (Normal → Pressure → High Load → Overload → Emergency). How do you validate that each level engages on the right trigger and that quality degrades gracefully (not cliff-shaped)?

Per-level tests:

Level	Trigger condition	Quality contract
Normal	Capacity < 50%	Reference quality
Pressure	50-65%	-1pt CSAT, no rate-limit on Prime
High Load	65-80%	-2pt CSAT, Guest tier rate-limited first
Overload	80-95%	-3pt CSAT, Auth template-first; Haiku-floor on Auth
Emergency	>95% or breaker-engaged	-5pt CSAT, all tiers template-first; LLM only on cart-confirmed flows

For each level: - Simulate the trigger condition. Does the level engage within X seconds? - Hold for 5 minutes. Measure quality per tier. - Reduce pressure. Does the level recover monotonically?

The cliff-shaped degradation failure mode: a level that engages a sweeping change (e.g., Overload → "all reranking off") creates a 4pt quality drop in one step. Better to interpolate: Pressure → "skip reranker on top-1 score > 0.95" (small impact); High Load → "skip on > 0.9" (medium); Overload → "skip on > 0.85" (large). The aggregate effect is similar; the per-level user experience is smoother.

The eval slices this: for each level, measure quality delta vs the previous level. Each delta should be ≤ 1.5pts. If any single transition is > 2pts, it's a cliff — redesign.

Round 4 — Corner

Follow-up: A bot-attack pattern: 10K guest requests in 60 seconds from 1K IPs. Your offline test needs to validate that rate-limiting plus WAF protect the system. What does the test look like?

Multi-layer test:

WAF layer first: replay the bot pattern against the WAF rules dataset. Measure block rate. Should catch ~70-80% of the pattern (known bot signatures, abnormal request rates per IP).
Rate limiter second: the 20-30% that pass WAF hit rate limiter. Guest tier limit is 10 msg/min. The pattern is 600 msg/min per IP. Rate limiter should reject ~95% of the surviving requests.
What survives: ~5% of original = 500 requests in 60s. These look legitimate. Cost circuit breaker engages if total daily spend approaches threshold; for 500 requests this is unlikely to trigger.

The test validates each layer's contribution and the interaction: - WAF block rate ≥ 70% on the pattern. - Rate-limiter block rate ≥ 95% on WAF-survivors at guest tier. - Total cost impact of the attack: ≤ $50 (500 LLM calls × ~$0.10 max).

What can go wrong: - WAF false positive on real users. Audit the bot pattern dataset against a real-user sample; require false-positive rate ≤ 0.1%. - Rate-limiter Redis hot-key. All guest IPs share a global rate-limit counter; under attack, the counter Redis key becomes a hot key (10K writes/sec). Redis CPU spikes; legitimate Auth users see latency. Test: shard the counter (e.g., consistent-hash 16 sub-counters) to avoid hot-key. - Breaker over-engages on attack. If the attack fits within budget but spikes hourly spend, breaker should engage on rate-of-change but not on the absolute (because daily total is fine). Test: synthetic attack should not trigger Emergency level — that would over-degrade real users.

Three layers, three independent failure modes, three test scenarios.

Architect-Level Escalation

A1: Build me a degradation policy that maximizes user-perceived quality given a hard cost ceiling.

The objective:

maximize: Σ (user_satisfaction_per_response × response_count) per user tier
subject to: total_daily_cost <= cost_ceiling
            per_tier_minimum_quality >= floor

Solution shape: per-tier-priority degradation, not uniform degradation.

Prime users get full pipeline as long as cost allows. They are not degraded until last.
Auth users get degraded second.
Guest users get degraded first (they're already on lite pipeline; further degradation has small marginal user impact).

Within each tier, intent priority: - escalation and return_request are never degraded (safety contract). - recommendation and product_question degrade to Haiku. - chitchat and promotion degrade to template-only.

The math: if cost ceiling is $X/day and we have N requests per tier per day with marginal cost C_full (Sonnet) and C_lite (Haiku), the linear program tells us how many requests of each tier-intent combination get full vs lite to maximize satisfaction.

In practice this is solved offline once and codified as a routing matrix (tier × intent → model_tier). Production reads the matrix; doesn't re-solve.

A2: When does the cost circuit breaker itself become a quality risk?

When it engages too aggressively (false-positive engagement) or too late (false-negative engagement). Both are quality risks:

False positive: spend was actually fine; breaker engaged based on a glitchy view; users got degraded experience for an hour. Quality regression with no cost benefit.
False negative: spend exceeded budget; breaker didn't engage; cost overrun is finance-visible incident. Cost regression with no quality benefit.

The metric to track is breaker precision and recall against historical incidents: - Replay 90 days of cost data; replay the breaker logic; compute false-positive rate (engaged when spend was actually fine retrospectively) and false-negative rate (didn't engage when spend went over). - Acceptable: ≤ 1 false positive per quarter, 0 false negatives.

If false positives climb (to say 1/month), the breaker is too sensitive — relax thresholds. If false negatives ever happen, the breaker is broken — fix immediately.

A3: How do you reason about whether to ship the rate limiter as 60/30/10 (Prime/Auth/Guest) versus 100/50/15 versus other splits?

Two-step reasoning:

What rejection rate is acceptable per tier? Guest: 8% (these users have low conversion anyway; rate-limiting is OK). Auth: 1% (rejecting auth users hurts retention). Prime: 0.01% (Prime rejection is a revenue incident).
What request rate per user, per tier, generates the acceptable rejection rate? From production data, the 95^th percentile of requests-per-minute is ~6 for Prime, ~3 for Auth, ~2 for Guest. Setting limits at 60/30/10 means most users never approach the limit; only abusive patterns (bots, runaway scripts) hit it.

The 60/30/10 numbers are 10x the p95 — that's a generous design that reduces false-positive rejection but leaves cost headroom for misuse. The 100/50/15 numbers are 15x — more headroom, less cost protection.

The decision: pick the split where the cost protection matters more than the rejection-rate protection. For MangaAssist where revenue is from Prime + conversion, false-positive rejection of Prime is the worst outcome — go with 60/30/10 (the more generous numbers), reserve the cost protection for the cost circuit breaker rather than the rate limiter.

The pattern: rate limits protect against abuse, not cost. Cost protection belongs to the breaker. Let each control mechanism do one job well.

Intuition Gained — US-08

The core insight: Traffic-based cost optimization is control engineering, not analytics. The breaker, rate limiter, and degradation ladder are feedback controllers operating on cost signals. They need control-engineering rigor: hysteresis, projection, fail-safe modes.

Mental model to carry forward:

"Every cost-control mechanism is a controller with a sensor (cost signal), an actuator (degradation lever), and a setpoint (cost ceiling). Treat the eval like a control-system eval: stability, latency, false-positive rate, false-negative rate."

The hidden failure mode: Oscillation under sustained near-threshold load. Without hysteresis, the controller fights itself, churning routing and cache state in ways that increase cost.

One-line rule: The cost-circuit-breaker decision-time matters as much as the threshold. A correct threshold reached too late is a missed protection.

Cross-Scenario Wrap-Up for ML/AI Engineer Loop

After working through 8 scenarios, the recurring themes:

Slice everything. Aggregate metrics hide structural failures. Slice by intent, confidence band, user tier, query shape, session duration. The dimension where the optimization changes behavior is the slice.
Calibration is not a one-time event. Every score-threshold, every rule, every TTL is a calibrated parameter. Re-calibrate when upstream changes — and bake the re-calibration into CI.
The "expected and same" failure is as common as the "cheap and wrong" failure. Verify the lever fires before celebrating it didn't break quality.
Pair every cost metric with a quality metric. Gate on both. This is the single most important discipline.
Severity-weight your rates. Wrong-answer rate of 0.3% is meaningless without knowing whether the wrong answers cluster on safety-critical intents.

Continue to 06-mlops-engineer-grill-chains.md for the same 8 scenarios from the MLOps lens — telemetry, deployment, infra, observability, and operational discipline.