04. Scenario Deep-Dives — Offline Testing for Each Cost-Optimization User Story

This file applies the four primitives from 03-foundations-and-primitives-for-cost-optimization-testing.md to each of the 8 cost-optimization user stories in ../Cost-Optimization-User-Stories/. Every scenario follows the same structure so you can scan one or compare across.

1. Scenario recap (2-3 sentences from the source US doc)
2. The cost lever (what we're saving)
3. The quality contract (what we must not break)
4. Offline test design (primitive, dataset, paired metrics, decision gate)
5. Pipeline diagram (mermaid)
6. Two real-incident sketches the offline test would have caught
7. Threshold reconciliation against the source US doc

The grill chains for each scenario, broken out by ML/AI Engineer lens and MLOps Engineer lens, live in 05-ml-ai-engineer-grill-chains.md and 06-mlops-engineer-grill-chains.md.

Scenario US-01 — LLM Token Cost Optimization

Recap

US-01 layers four token-saving techniques in front of Bedrock: (1) a template-first router that bypasses the LLM for ~30% of high-volume intents, (2) a semantic response cache, (3) a Haiku/Sonnet model-tier selector, (4) a prompt compressor that trims input tokens by ~40%. The combined target is a 40–60% reduction in monthly Bedrock spend (~$315K/mo baseline → ~$120–180K/mo).

The Cost Lever

Bedrock input + output tokens. Every technique either eliminates a call (template, cache), shrinks the call (compression), or downgrades the call to a cheaper model (tiering).

The Quality Contract

The four techniques each break differently if quality regresses:

Template router: must not respond to a complex query with a stock string. Quality contract: intent_route_correctness ≥ 98% and template_misroute_rate ≤ 2%.
Semantic cache: must not serve a stale or wrong-intent answer. Quality contract: cache_wrong_answer_rate ≤ 0.5% measured by audited LLM-as-judge.
Tier selector: Haiku must produce comparable quality on the prompts it actually sees. Quality contract: |csat_haiku - csat_sonnet| ≤ 0.15 on a 5-point scale, sliced by intent.
Prompt compressor: must not strip context the LLM needed. Quality contract: compression_induced_hallucination_rate ≤ 1%.

Offline Test Design

Primitive	Why it applies	Dataset
A: Counterfactual Replay	Need to measure paired cost↔quality on real production traffic across all four techniques	5K replay sessions stratified by intent
B: Decision-Equivalence	Template-bypass and tier-selector are both gates; we need to verify the gate decision against ground truth	1.5K labeled queries (template-eligible label, tier label, cache-eligible label)
C: Cost-Aware Golden	Prompt compression can silently inflate tokens elsewhere (system prompt edits, RAG chunk packing); need regression coverage	500 golden queries with `expected_cost_band` populated

Paired Metrics Table

Cost (must improve)	Threshold	Quality (must not regress)	Threshold	Sourced from US-01 line
`template_bypass_rate`	≥ 30%	`template_decision_agreement_at_conf_band_0.6_to_0.8`	≥ 90%	line 622, augmented by Incident-1 in file 03
`semantic_cache_hit_rate`	≥ 15%	`cache_wrong_answer_rate`	≤ 0.5%	line 622, line 718
`haiku_routing_rate`	≥ 35% of LLM calls	`tier_assignment_agreement` on complexity-labeled set	≥ 92%	line 622, line 717
`avg_input_tokens_per_call`	≤ 1,200	`compression_induced_hallucination_rate`	≤ 1%	line 624, line 719
`dollars_per_session`	down ≥ 30% on replay	`llm_judge_agreement_rate`	≥ 92%	derived from US-01 macro target

Decision Gate

PROMOTE TO SHADOW IF:
   - all four cost metrics meet floor on full replay set
   - paired quality metric for each technique within threshold
   - per-intent slice: no intent shows quality regression > 2x ceiling
   - confidence-band slicing on template router: agreement ≥ 90% in 0.6-0.8 band
HOLD IF:
   - cost target met but a single intent class regresses > 2x quality ceiling
   - template misroute rate is fine on average but tail intent (return_request, escalation) > 1%
ABORT IF:
   - any cost lever misses floor by > 10pts (the lever is broken, not the threshold)
   - any safety-critical intent (return_request, escalation) below quality ceiling

Pipeline Diagram

graph TD
    A[5K Production Replay<br/>stratified by intent + confidence] --> B[Test Harness]
    G[1.5K Labeled Decision Set<br/>template/tier/cache labels] --> B
    H[500 Golden + Cost Bands] --> B

    B --> C[Pipeline Baseline<br/>all calls -> Sonnet, no cache, no compress]
    B --> D[Pipeline Optimized<br/>template + cache + tier + compress]

    C --> E1[Cost: tokens, $/session]
    C --> F1[Quality: response, route, judge]

    D --> E2[Cost: tokens, $/session]
    D --> F2[Quality: response, route, judge]
    D --> J[Lever Engagement<br/>which technique fired?]

    E1 --> K[Paired Diff Engine]
    E2 --> K
    F1 --> K
    F2 --> K
    J --> K

    K --> L[Per-intent slicing]
    L --> M{Decision Gate}

    M -->|all clear| N[PROMOTE]
    M -->|cost only| O[ABORT - quality regression]
    M -->|tail-class fail| P[HOLD - tune confidence floor]

    style D fill:#fd2,stroke:#333
    style M fill:#2d8,stroke:#333

Two Real-Incident Sketches the Offline Test Would Have Caught

Incident A (template misroute on confidence band): Template router classified "hey what's something good" as chitchat with confidence 0.71 and served the welcome template. The query was a real recommendation request from a guest user. The user dropped off. Across 6 weeks, this pattern cost a measurable conversion delta. The offline harness, run with confidence-band slicing in the decision-equivalence test, would have shown that template-vs-LLM agreement in the 0.6–0.8 band was only 71% — well under the 90% floor — and the rollout would have held until the confidence floor was raised to 0.85.

Incident B (cache wrong-answer after policy update): Semantic cache served a 22-hour-old faq answer about return windows after a policy change. Three escalations resulted. The counterfactual replay would have flagged this if the dataset had included a "policy update event" stress overlay — running the replay with policy_revision_timestamp injected mid-stream would have measured cache_stale_answer_rate jumping above 0.5% in the post-update window. Without that overlay the cache looks fine on static replay.

Threshold Reconciliation

All offline thresholds match US-01 source thresholds (lines 619–626 of US-01-llm-token-cost-optimization.md). The added thresholds — confidence-band agreement floor and cache-stale-rate during upstream-change overlays — are augmentations, not contradictions. They tighten the offline gate beyond what the source US doc specifies and should be folded back into the source doc's "Quality Regression Criteria" section in the next revision.

Scenario US-02 — Intent Classifier Cost Optimization

Recap

US-02 reduces SageMaker inference cost via (1) an Aho-Corasick rule-based pre-filter that catches 65% of traffic with confidence ≥ 0.92, (2) session-based intent caching for continuation patterns within 3 turns / 90 seconds (additional 10% bypass), and (3) SageMaker scale-to-zero auto-scaling with serverless fallback. Target: 50–70% inference cost reduction.

The Cost Lever

SageMaker real-time endpoint hours, ML inference call count, and idle-instance cost.

The Quality Contract

per_class_F1 ≥ 0.92 on head classes; tail classes (return_request, escalation) must not regress because they are downstream-cost-sensitive (a missed return → user stuck; missed escalation → angry customer with no human in loop).
rule_classifier_disagreement_with_ml ≤ 5% per rule before that rule promotes to live.
serverless_cold_start_p99 ≤ 3s for off-peak first request — the user-facing latency contract for scale-to-zero.

Offline Test Design

Primitive	Why it applies	Dataset
B: Decision-Equivalence	Rules and session cache are gates deciding whether to call SageMaker; agreement-with-ML is the truth signal	10K labeled intent samples per rule, stratified by intent + locale + length
D: Stress/Saturation	Scale-to-zero only matters when traffic returns; need to test the cold-start path	Synthetic burst generator at 4am JST conditions
A: Counterfactual Replay	Session intent cache only fires on multi-turn flows; need full session replay to measure	2K full multi-turn session replays

Paired Metrics Table

Cost (must improve)	Threshold	Quality (must not regress)	Threshold	Sourced from US-02
`rule_based_classification_rate`	≥ 65%	`rule_vs_ml_agreement_per_rule`	≥ 95% on ≥ 1K samples per rule	line 90 of US-02
`session_cache_hit_rate`	≥ 10%	`session_continuation_label_correctness`	≥ 97% (false continuation re-uses an outdated intent)	new — augmentation
`off_peak_active_instances`	0	`cold_start_first_response_p99`	≤ 3s	line 90 of US-02
`monthly_inference_cost`	≤ $70	`tail_class_F1` (return_request, escalation, escalation_handoff)	≥ baseline F1 (no regression)	line 90 augmented

Decision Gate

PROMOTE A NEW RULE TO LIVE IF:
   - agreement_with_ml >= 95% on >=1K samples
   - false-positive rate on each tail class = 0
   - rule fires on >= 0.5% of traffic (otherwise not worth the operational cost)
PROMOTE SCALE-TO-ZERO IF:
   - cold-start p99 with serverless fallback <= 3s
   - no 5xx during burst test
   - cost trajectory shows >=$200/mo savings on observed traffic shape
HOLD IF:
   - head-class F1 fine but a single tail class regresses on the rule pre-filter
ABORT IF:
   - any rule shows non-zero false positive on return_request or escalation

Pipeline Diagram

graph TD
    A[10K Labeled Intent Samples<br/>per rule, stratified] --> B[Rule Decision-Equivalence Test]
    C[2K Multi-Turn Sessions] --> D[Counterfactual Replay]
    E[Burst Generator<br/>0 -> 50 RPS in 30s] --> F[Cold-Start Stress Test]

    B --> G[Per-Rule Agreement Matrix]
    D --> H[Cost: $/inference, idle hrs<br/>Quality: per-class F1]
    F --> I[Cold-start p99, error rate]

    G --> J{Decision Gate}
    H --> J
    I --> J

    J -->|all clear| K[PROMOTE rule / scale-to-zero]
    J -->|tail-class FP| L[ABORT - rule too greedy]
    J -->|cold-start slow| M[HOLD - tune serverless concurrency]

    style B fill:#fd2,stroke:#333
    style J fill:#2d8,stroke:#333

Two Real-Incident Sketches the Offline Test Would Have Caught

Incident A (rule fires on tail intent): A rule pattern "order" was added to catch order_tracking. It fired on the message "I'd like to order a return for this damaged volume" — which is a return_request. SageMaker would have classified this correctly. The rule routed it to order_tracking, the order-tracking handler couldn't find an order ID, and the user got "I couldn't find that order" — abandoning the return flow. The offline decision-equivalence test, sliced per-tail-class, would have shown rule-vs-ML disagreement of 14% on return_request queries containing the substring "order", flagging the rule as too greedy.

Incident B (cold-start latency cascade on Monday morning): Scale-to-zero was set to 0 instances off-peak. Monday 8:30am JST, 200 simultaneous users hit the endpoint. Serverless fallback took 28 seconds for the first response, far over the 3s SLO. Users dropped or retried, causing more cold starts. The offline burst stress test, configured to simulate "0 → 50 RPS in 30s with no warm capacity," would have shown the serverless fallback p99 was 28s, not the 3s the design assumed. The fix (provisioned concurrency floor of 1 instance during the 8–9am JST window) would have been cheap and obvious.

Threshold Reconciliation

US-02 source doc sets cold_start ~30–60s as the serverless-fallback expectation. This file tightens the offline gate to ≤ 3s p99 first-response — the gap reflects the production SLO required for user experience, not the unaided serverless behavior. The reconciliation: provisioned concurrency floor of 1 instance during predictable peak windows is required to bridge the gap. This should be added as an explicit acceptance criterion to US-02.

Scenario US-03 — Caching Strategy for Cost Reduction

Recap

US-03 maximizes ElastiCache Redis hit rate via (1) multi-layer cache (L1 in-process LRU 30s TTL for top-50 ASINs; L2 Redis with keyspace-specific TTLs), (2) cache warming Lambda at 8:30am JST pre-populating top 500 ASINs, (3) event-driven invalidation on catalog changes (delete within 5s). Target: combined L1+L2 hit rate ≥ 70% and 30–50% reduction in downstream API calls.

The Cost Lever

Downstream API calls (Product Catalog, Recommendations, Promotions, Reviews), and the per-request cost of those calls.

The Quality Contract

stale_read_rate ≤ 0.5% — a stale read serving outdated price or stock is a customer-facing incident.
cache_wrong_answer_rate ≤ 0.5% — wrong cache hit (key collision, semantic mismatch) serves wrong product data.
invalidation_latency_p99 ≤ 5s from event publish to cache delete.

Offline Test Design

Primitive	Why it applies	Dataset
A: Counterfactual Replay	Cache hit-rate and wrong-answer rate must be measured on real query distribution	5K session replay across all keyspaces
D: Stress/Saturation	Invalidation under load + cache stampede scenarios + warming-Lambda-failure scenarios all need engineered conditions	Synthetic catalog-change burst + warming-failure injection + stampede load
C: Cost-Aware Golden	Detect cost regression if cache hit rate silently drops in a future release	Golden 500 with `expected_cache_hit` flag per query

Paired Metrics Table

Cost (must improve)	Threshold	Quality (must not regress)	Threshold	Sourced from US-03
`L1_hit_rate`	≥ 30%	`L1_stale_read_rate` (30s TTL window)	≤ 0.1%	line 122
`L2_redis_hit_rate`	≥ 50%	`L2_wrong_answer_rate` (key collision audit)	≤ 0.5%	line 122
`combined_hit_rate`	≥ 70%	`invalidation_latency_p99` (catalog change → cache delete)	≤ 5s	line 122
`redis_eviction_rate`	< 1% / hour	`redis_memory_utilization`	between 40-70%	line 122
`warming_success_rate`	100% daily	`warming_data_freshness_at_8:35am_jst`	≤ 5min stale	line 122

Decision Gate

PROMOTE A KEYSPACE TO LIVE IF:
   - hit rate target met
   - wrong-answer audit on 1K samples shows <= 0.5%
   - invalidation stress test shows p99 <= 5s under 100 catalog changes/min
   - warming Lambda failure injection: cache rebuilds within 1 hour with degraded hit rate >= 40%
HOLD IF:
   - hit rate met but wrong-answer rate is in 0.5-1% band (raise similarity threshold or shorten TTL)
ABORT IF:
   - invalidation latency exceeds 30s on any keyspace (data staleness guarantees broken)
   - cache stampede causes downstream service CPU > 90% during a single popular-key invalidation

Pipeline Diagram

graph TD
    A[5K Session Replay] --> B[Counterfactual: L1+L2 vs No-Cache]
    C[Catalog-Change Burst<br/>100 events/min] --> D[Invalidation Stress Test]
    E[Warming Lambda Killer] --> F[Recovery Test]
    G[Stampede Load<br/>1K reqs/sec on single popular key] --> H[Stampede Resilience]

    B --> I[Cost: hits, downstream calls saved<br/>Quality: stale rate, wrong-answer rate]
    D --> J[Inv. p99, downstream CPU during invalidate]
    F --> K[Time to recover hit rate]
    H --> L[Origin service CPU spike, dedup behavior]

    I --> M{Decision Gate}
    J --> M
    K --> M
    L --> M

    M -->|all clear| N[PROMOTE keyspace]
    M -->|stale detected| O[ABORT - invalidation broken]
    M -->|stampede crashes| P[HOLD - add request coalescing]

    style B fill:#fd2,stroke:#333
    style M fill:#2d8,stroke:#333

Two Real-Incident Sketches the Offline Test Would Have Caught

Incident A (warming Lambda silently misconfigured): Warming Lambda's environment variable REDIS_CLUSTER_ENDPOINT pointed at the test cluster. Production cache was never pre-populated. Hit rate stayed at 7% instead of the 30% target. The 8:30am JST warming Lambda showed success: true in logs (it wrote to some Redis), so dashboards looked fine. The offline test "kill the warming Lambda, observe hit rate over the next hour" would have measured hit rate before warming on production traffic and noticed the baseline was 7%, not 30% — flagging that warming wasn't actually working.

Incident B (cache stampede during popular-product invalidation): Catalog team published a price update for a top-10 ASIN. Cache invalidation deleted the key. The next 1,800 requests for that ASIN all missed cache and hit the Product Catalog service simultaneously. Catalog service CPU spiked to 100%, then 5xx errors for 90 seconds. The offline stampede test (synthetic 1K req/s on a single key during invalidation) would have shown that the cache layer needed a request-coalescing wrapper (e.g., singleflight) to protect downstream services from thundering-herd on invalidation.

Threshold Reconciliation

US-03 source specifies the hit-rate floor and invalidation timing. This file adds the stampede resilience criterion (singleflight or equivalent must protect downstream during high-traffic-key invalidation) and the warming-failure recovery criterion (degraded hit rate ≥ 40% within 1 hour of warming-Lambda failure). Both should be added to US-03's "Quality Regression Criteria" section.

Scenario US-04 — Compute Cost Optimization (ECS Fargate + Lambda)

Recap

US-04 right-sizes ECS Fargate tasks (2 vCPU/4GB → 1 vCPU/2GB for I/O-bound Orchestrator), uses Fargate Spot at 70% off-peak with on-demand for WebSocket handlers, runs WebSocket handlers on Graviton ARM64 (20% cheaper), and applies scheduled auto-scaling. Target: 35–55% compute cost reduction.

The Cost Lever

Per-task-hour Fargate price (Spot is ~70% cheaper than on-demand; ARM64 is 20% cheaper than x86), plus elimination of over-provisioned vCPU/memory.

The Quality Contract

p99_response_latency no regression > 15% after right-sizing.
error_rate_during_spot_interrupt ≤ 1% — Spot interruptions must not surface as 5xx to users.
websocket_disconnect_rate_during_drain ≤ 0.5% of in-flight connections.
ARM64_compatibility: zero non-deterministic behavior differences vs x86 on the same workload.

Offline Test Design

Primitive	Why it applies	Dataset
D: Stress/Saturation	Right-sizing only matters under realistic CPU/memory pressure; Spot interruption only tests if interruption is simulated; ARM64 needs full workload pass	Synthetic load + chaos primitives (SIGTERM injection, CPU saturation)
A: Counterfactual Replay	Compare 1vCPU/2GB vs 2vCPU/4GB latency distribution on the same traffic	1 week of replayed orchestrator traffic
C: Cost-Aware Golden	Catch latency regression on golden queries when right-sizing lands	Golden 500 with `expected_p99_latency` per query class

Paired Metrics Table

Cost (must improve)	Threshold	Quality (must not regress)	Threshold	Sourced from US-04
`ecs_avg_cpu`	50–65% (right-sized)	`p99_response_latency`	no regression > 15%	from US-04 metrics section
`ecs_avg_memory`	50–70%	`task_oom_kill_rate`	0 events / week under replay	from US-04 metrics section
`spot_off_peak_share`	≥ 70%	`error_rate_during_spot_interrupt`	≤ 1%	from US-04 metrics section
`spot_interruption_rate`	≤ 5% (production-measured, not offline)	`websocket_disconnect_rate_during_drain`	≤ 0.5% in-flight	augmentation from Incident-3 in file 03
`lambda_cold_start_rate_peak`	≤ 5%	`lambda_cold_start_p99_latency`	≤ 1.5s	from US-04

Decision Gate

PROMOTE RIGHT-SIZE IF:
   - 1 week shadow CPU stays in 50-65% band on production-shaped load
   - p99 latency regression <= 15%
   - no OOM kills
   - replay test on 1K sessions shows zero quality regression
PROMOTE SPOT IF:
   - chaos test: 8 simultaneous SIGTERM signals during peak hour, all in-flight WebSocket sessions complete
   - drain logic respects WebSocket frame boundaries (no mid-frame disconnect)
   - error rate during simulated interrupt <= 1%
PROMOTE ARM64 IF:
   - full integration test pass on ARM64 build
   - workload-shape replay shows zero behavior delta on 5K samples
   - no native-dependency surprises (logged via dependency scan)

Pipeline Diagram

graph TD
    A[1 Week Replay Traffic] --> B[Right-Size Counterfactual<br/>1vCPU/2GB vs 2vCPU/4GB]
    C[SIGTERM Injector<br/>2-min drain on active WS] --> D[Spot Resilience Stress Test]
    E[Synthetic Burst<br/>0->100 RPS in 60s] --> F[Auto-scaling Lag Test]
    G[ARM64 Build] --> H[Workload-Shape Diff Test<br/>5K samples]

    B --> I[Cost: $/task-hr<br/>Quality: p99 latency, OOM, error rate]
    D --> J[Time-to-drain, in-flight loss rate, error rate during interrupt]
    F --> K[Auto-scale lag, p99 during burst]
    H --> L[Behavior-delta vs x86 baseline]

    I --> M{Decision Gate}
    J --> M
    K --> M
    L --> M

    M -->|all clear| N[PROMOTE]
    M -->|in-flight loss > 0.5%| O[HOLD - fix WS drain logic]
    M -->|ARM behavior delta| P[ABORT - native dep mismatch]

    style D fill:#f66,stroke:#333
    style M fill:#2d8,stroke:#333

Two Real-Incident Sketches the Offline Test Would Have Caught

Incident A (Spot interruption during WS stream — file 03 Incident 3): As described in file 03, 8 simultaneous Spot reclaims dropped 312 WebSocket streams mid-response. The drain handler was correct for REST traffic but had no awareness of in-flight stream frames. The offline chaos test, with the SIGTERM injector specifically targeting tasks with active WebSocket connections, would have measured websocket_disconnect_rate_during_drain directly and caught the missing frame-boundary logic before the 4am incident.

Incident B (right-size hit OOM under recommendation burst): Right-sizing landed at 1 vCPU / 2 GB. Memory peaked at 1.6 GB on average but the recommendation handler — when responding to a complex user with deep history — held 8K product objects in memory briefly during personalization scoring, hitting 2.1 GB and getting OOM-killed. Random user got "service unavailable." The offline counterfactual replay using worst-case sessions (deep-history users with personalization-heavy queries) — not just average sessions — would have measured peak memory at 2.1 GB and either flagged the right-size as too aggressive or required a memory-optimization PR before the right-size could ship.

Threshold Reconciliation

US-04 source specifies CPU/memory bands and Spot interruption rate. This file adds the WebSocket-frame-boundary drain criterion (specifically: drain handler must wait for WebSocket close-frame ACK, not just stop accepting new connections) and the worst-case memory criterion (replay must include the top 5% deep-history sessions, not just stratified-by-intent samples). Both belong in US-04's "Quality Regression Criteria."

Scenario US-05 — DynamoDB Cost Optimization

Recap

US-05 reduces DynamoDB costs via (1) on-demand capacity mode (vs provisioned; spiky chatbot traffic wins under ~18% sustained utilization), (2) aggressive TTL (TURN: 24h, SUMMARY: 72h, META: 24h), (3) sparse GSI projection (only META items get the customer_id attribute, naturally excluding TURN/SUMMARY from the GSI), (4) coalesced TransactWrites for atomic TURN+META updates. Target: 40–60% DDB cost reduction (~$570/mo → ~$230–340/mo).

The Cost Lever

DynamoDB write/read request units, table + GSI storage, and capacity-mode over-provisioning waste.

The Quality Contract

context_loss_rate ≤ 1% — multi-turn conversations must retain context within their TTL window.
multi_turn_coherence_score no regression on multi-turn replay.
transactwrite_failure_rate ≤ 0.1% — partial-write failure modes must not surface as inconsistent state.

Offline Test Design

Primitive	Why it applies	Dataset
A: Counterfactual Replay	TTL aggressiveness changes which sessions retain context; need full multi-turn replay	2K full multi-turn sessions, including some that span > 24h gaps
D: Stress/Saturation	Sparse GSI behavior under write spikes; on-demand auto-scaling on 10x burst; TransactWrite under partial-failure injection	Burst generator + chaos primitives
C: Cost-Aware Golden	Baseline regression: a future change might add a new attribute that bloats the GSI silently	Golden 500 with `expected_ddb_cost_per_turn`

Paired Metrics Table

Cost (must improve)	Threshold	Quality (must not regress)	Threshold	Sourced from US-05
`dollars_per_1k_turns`	down ≥ 40%	`context_loss_rate_within_TTL_window`	≤ 1%	new — augmentation
`read_latency_p99`	< 10ms	`multi_turn_coherence_score` (LLM judge on follow-up resolution)	no regression > 2pts	from US-05 metrics + augmentation
`write_latency_p99`	< 20ms	`transactwrite_atomicity` (no partial state in fault test)	100%	from US-05 metrics
`ttl_deletion_rate`	≥ 5M items / day (TTL is working)	`lost_context_user_reports` (proxy: re-ask rate)	≤ baseline + 5%	from US-05
`gsi_size`	only META items projected	`gsi_query_consistency`	100% match with table scan over META	new

Decision Gate

PROMOTE TTL TIGHTENING IF:
   - on-demand 2K-session replay shows context_loss_rate <= 1%
   - re-ask rate proxy on multi-turn session tail (sessions with > 22h gap) doesn't exceed +5% baseline
   - cost trajectory shows >= claimed savings
PROMOTE SPARSE GSI IF:
   - parallel-GSI test (new sparse vs old projected) shows 100% query equivalence on 1 week of access patterns
PROMOTE TRANSACTWRITE IF:
   - chaos: kill the write halfway through TransactWrite; assert no partial state in either TURN or META
HOLD IF:
   - context_loss_rate is in 1-2% band (TTL is too aggressive for some user shape)
ABORT IF:
   - any test shows partial-write state (transaction atomicity broken)

Pipeline Diagram

graph TD
    A[2K Multi-Turn Sessions<br/>incl. > 24h gap sessions] --> B[TTL Counterfactual<br/>48h vs 36h vs 24h]
    C[Sparse GSI Test<br/>parallel old + new GSI] --> D[Query Equivalence Test]
    E[10x Write Burst Generator] --> F[On-Demand Autoscale Test]
    G[Chaos: Kill mid-TransactWrite] --> H[Atomicity Test]

    B --> I[Cost: $/1K turns<br/>Quality: context_loss_rate, re-ask rate]
    D --> J[100% query equiv. assertion]
    F --> K[Throttle rate, latency p99 during burst]
    H --> L[Partial-state detection]

    I --> M{Decision Gate}
    J --> M
    K --> M
    L --> M

    M -->|all clear| N[PROMOTE]
    M -->|context_loss 1-2%| O[HOLD - relax TTL]
    M -->|partial state| P[ABORT - revert TransactWrite]

    style B fill:#fd2,stroke:#333
    style M fill:#2d8,stroke:#333

Two Real-Incident Sketches the Offline Test Would Have Caught

Incident A (TTL drops mid-return-flow context): A user opened a return conversation at 2pm Monday. They walked away, came back at 4pm Tuesday — 26 hours later, past the 24h TURN TTL. The TURN items were gone. The bot couldn't reconstruct the return reason and asked the user to start over. The user escalated. The offline counterfactual replay, with the 2K-session dataset including sessions that span > 22h gaps, would have measured context_loss_rate at 4.1% on those tail sessions — showing 24h TTL was too aggressive for return-flow conversations. The fix: per-intent TTL (return_request keeps TURN for 168h; everything else 24h).

Incident B (sparse GSI silently included TURN items after attribute leak): A future PR added customer_id to TURN items "for analytics" without realizing this caused TURN items to project to the sparse GSI, doubling GSI write capacity overnight. The cost-aware golden dataset with expected_ddb_cost_per_turn would have caught the cost regression in CI on the PR — dollars_per_1k_turns would have jumped 80% on the golden replay, blocking the PR before it shipped.

Threshold Reconciliation

US-05 source defines the metrics but doesn't specify context_loss_rate as a quality contract. This file adds that explicit contract (≤ 1% globally, ≤ 1% on tail sessions with > 22h gaps) and the per-intent TTL recommendation. Both should be folded back into US-05.

Scenario US-06 — RAG Pipeline Cost Optimization

Recap

US-06 reduces RAG cost via (1) RAG-bypass gate skipping retrieval for non-retrieval intents (order_tracking, escalation, chitchat, promotion — ~40% of traffic), (2) embedding cache (1h TTL, 20% hit rate target), (3) OpenSearch Serverless OCU batching (index 2am–4am JST off-peak), (4) conditional reranker skip when top kNN score > 0.9 (~30% of queries). Target: 30–50% RAG cost reduction.

The Cost Lever

OpenSearch Serverless OCU consumption (4 OCU floor = $691/mo), Titan Embeddings call count, and reranker (cross-encoder) invocations.

The Quality Contract

recall_at_3_per_intent no regression > 2pts on any intent.
mrr_per_intent no regression on any intent that uses RAG.
bypass_decision_correctness ≥ 95% per intent (RAG-bypass-eligible intents must not include sub-cases that genuinely need retrieval).
reranker_skip_top1_correctness ≥ 95% (when we skip reranker, the kNN top-1 must be the correct top-1).

Offline Test Design

Primitive	Why it applies	Dataset
B: Decision-Equivalence	Bypass and conditional-rerank are gates; need to verify they don't drop genuinely-needed retrieval	2K queries labeled with `requires_retrieval` and `top1_is_correct` ground truth
A: Counterfactual Replay	Embedding cache hit-rate and answer parity need real query distribution	5K replay queries
D: Stress/Saturation	OCU batching only matters when ingest pressure is high; need to test the off-peak window respect	Synthetic ingest burst at 10am JST (peak hours) — verify it's deferred

Paired Metrics Table

Cost (must improve)	Threshold	Quality (must not regress)	Threshold	Sourced from US-06
`rag_bypass_rate`	≥ 40%	`bypass_decision_correctness_per_intent`	≥ 95%	from US-06 + augmentation
`embedding_cache_hit_rate`	≥ 20%	`embedding_cache_wrong_answer_rate`	≤ 0.1%	from US-06
`reranker_skip_rate`	≥ 30%	`reranker_skip_top1_correctness`	≥ 95%	from US-06 + augmentation
`opensearch_search_latency_p99`	< 200ms	`recall_at_3_overall`	no regression > 2pts	from US-06 metrics
`monthly_rag_cost`	≤ $400	`mrr_per_intent`	no regression on any intent	from US-06

Decision Gate

PROMOTE BYPASS GATE FOR AN INTENT IF:
   - parallel run for 1 week shows answer equivalence (LLM judge) >= 90% on that intent
   - no flagged sub-case where retrieval was genuinely needed (audited by DS)
PROMOTE RERANKER SKIP IF:
   - on labeled top1 set, top1_correctness when skipping >= 95% at score threshold 0.9
   - calibration check: distribution of kNN scores hasn't shifted vs baseline embedding model
PROMOTE EMBEDDING CACHE IF:
   - hit rate >= 20% on observed query distribution
   - shadow mode: every cache hit triggers parallel cache-miss fetch; assert byte-equality
HOLD IF:
   - bypass correctness in 90-95% band on a single intent (tune intent-specific bypass logic)
ABORT IF:
   - reranker skip causes top1 swap on > 5% of safety-critical intents (return_request RAG, faq RAG)

Pipeline Diagram

graph TD
    A[2K Labeled Queries<br/>requires_retrieval + top1] --> B[Bypass Decision-Equivalence]
    C[5K Replay Queries] --> D[Embedding Cache Counterfactual]
    E[Synthetic Ingest Burst at 10am JST] --> F[OCU Batching Window Test]
    G[Reranker-Skip Score Calibration Set] --> H[Top1-Correctness Test]

    B --> I[Bypass correctness per intent]
    D --> J[Cache hit rate, byte-equality on hits]
    F --> K[Was ingest deferred to 2am window?]
    H --> L[top1 swap rate at threshold 0.9]

    I --> M{Decision Gate}
    J --> M
    K --> M
    L --> M

    M -->|all clear| N[PROMOTE]
    M -->|bypass < 95% on tail intent| O[HOLD - tighten gate]
    M -->|top1 swap| P[ABORT - raise score threshold]

    style B fill:#fd2,stroke:#333
    style M fill:#2d8,stroke:#333

Two Real-Incident Sketches the Offline Test Would Have Caught

Incident A (bypass on promotion-with-product-detail): RAG-bypass marked promotion as bypass-eligible. A user asked "any deals on Berserk Deluxe?" — both promotion data AND product retrieval were needed. Bypass returned "We have these promotions: ..." with no Berserk-specific information. User asked again. Cost saved on the OCU read; user experience worse. The decision-equivalence test, sliced per intent + per query-shape (queries containing a product name should not bypass even if intent is promotion), would have measured bypass correctness at 73% on promotion queries containing a product name — flagging the gate as too coarse.

Incident B (embedding model upgrade silently broke reranker skip): Embedding model was upgraded from titan-v1 to titan-v2 for higher recall. The kNN score distribution shifted: scores above 0.9 (previously meaning "obviously top result") now meant "kind of similar." Reranker skip rate jumped from 30% to 65%, and top-1 correctness dropped to 78%. The offline calibration check — "after embedding model change, recompute the score-threshold-vs-top1-correctness curve and re-tune the threshold" — would have caught this immediately and required a threshold re-tune (probably to 0.94) before the new model could ship.

Threshold Reconciliation

US-06 source specifies bypass rate, cache hit rate, reranker skip rate. This file adds the per-intent decision-correctness floors and the embedding-model-change calibration requirement. The latter is critical — every embedding model upgrade must trigger a reranker-threshold re-tune; this should be a hard CI gate, not a runbook step.

Scenario US-07 — Analytics Pipeline Cost Optimization

Recap

US-07 optimizes Kinesis + Redshift cost via (1) event batching with gzip compression (50 events / 5s → 1 PUT, ~40x reduction), (2) Kinesis on-demand mode, (3) Firehose to Parquet on S3 (70% compression), (4) Redshift RA3 hot/cold tiering (last 30 days hot; 31+ days unloaded to S3 + Spectrum), (5) materialized views for daily/hourly summaries. Target: 40–60% analytics cost reduction.

The Cost Lever

Kinesis PUT units, Redshift compute + managed storage, S3 storage, and per-query Spectrum scan cost.

The Quality Contract

event_loss_rate ≤ 0.1% — analytics drives the cost circuit breaker (US-08); event loss creates blind-spots in cost telemetry.
dashboard_parity_with_unbatched ≤ 0.5% drift on key metrics.
materialized_view_freshness ≤ 15min lag on the cost-tracking views (US-08 reads these).
spectrum_query_correctness 100% match with hot-table query on the same date range.

Offline Test Design

Primitive	Why it applies	Dataset
A: Counterfactual Replay	Batched vs unbatched event paths must produce dashboard parity; need real event stream	1 week of replayed event traffic in shadow stream
D: Stress/Saturation	Batch fills too fast on Prime Day; Firehose backpressure under 10x volume; RA3 query slowdown when hot data exceeds local SSD	Synthetic event burst + RA3 capacity stress
C: Cost-Aware Golden	Catch dashboard regression: a future PR could break a materialized view definition silently	Golden dashboard query set with `expected_result_hash`

Paired Metrics Table

Cost (must improve)	Threshold	Quality (must not regress)	Threshold	Sourced from US-07
`event_compression_ratio`	≥ 60%	`event_loss_rate`	≤ 0.1%	from US-07
`kinesis_lag`	< 5s	`dashboard_parity_with_unbatched`	≤ 0.5% drift	from US-07 + augmentation
`redshift_hot_table_size`	< 10GB	`materialized_view_refresh_time`	≤ 2min	from US-07
`s3_archive_cost`	< $5/month	`spectrum_query_correctness`	100% match with hot	from US-07 + augmentation
`monthly_analytics_cost`	≤ $300	`cost_metric_lag_for_us08` (US-08 reads daily spend from US-07)	≤ 15min	from US-07 + cross-story dependency

Decision Gate

PROMOTE BATCHING IF:
   - shadow vs unbatched on 1 week of events: dashboard parity within 0.5% on every key metric
   - low-frequency event types (errors, escalations, breaker triggers) preserved 100%
   - event_loss_rate <= 0.1%
PROMOTE RA3 MIGRATION IF:
   - parallel cluster: dashboards run on both for 2 weeks; Spectrum query correctness 100% on date ranges that span the hot/cold boundary
PROMOTE MATERIALIZED VIEWS IF:
   - per-view test: source-table query result equals materialized-view result on golden dashboard set
   - refresh time on a 100M-row source <= 2min
HOLD IF:
   - dashboard parity 0.5-1% drift (investigate which event class is shedding precision)
ABORT IF:
   - low-frequency event class loses any events under burst conditions (US-08 cost breaker depends on these)

Pipeline Diagram

graph TD
    A[1 Week Event Replay] --> B[Batched vs Unbatched Counterfactual]
    C[10x Event Burst<br/>Prime Day shape] --> D[Backpressure Stress Test]
    E[RA3 Parallel Cluster] --> F[Spectrum Correctness Test<br/>across hot/cold boundary]
    G[MV Golden Dashboard Set] --> H[MV Parity Test]

    B --> I[Cost: $/GB ingested<br/>Quality: dashboard parity, low-freq event preservation]
    D --> J[Loss rate during burst, lag]
    F --> K[100% query equivalence assertion]
    H --> L[Per-view parity assertion]

    I --> M{Decision Gate}
    J --> M
    K --> M
    L --> M

    M -->|all clear| N[PROMOTE]
    M -->|low-freq event loss| O[ABORT - US-08 will go blind]
    M -->|MV drift| P[HOLD - rebuild MV definition]

    style B fill:#fd2,stroke:#333
    style M fill:#2d8,stroke:#333

Two Real-Incident Sketches the Offline Test Would Have Caught

Incident A (batching swallowed escalation events on Prime Day): Event batching at 50 events/5s worked fine on weekday traffic. On Prime Day the batch filled in 0.4s; Firehose's buffer hit limit; events were dropped. Crucially, the dropped events included cost_circuit_breaker_warning_engaged events. US-08's breaker engaged correctly, but US-07 didn't see it, so the cost dashboard showed normal spend for the next 3 hours — ops only noticed when AWS billing dashboard caught up at hour 4. The offline burst test, simulating Prime Day shape with synthetic traffic and specifically tracking low-frequency event types (errors, breakers, escalations) through the batched path, would have measured 18% loss on those event classes and forced the batch-overflow handling design.

Incident B (materialized view returned stale cost number to US-08 breaker): The 15-min cost-tracking materialized view was rebuilt nightly with a schema change. During the 4am rebuild window, the view returned NULL for "today's spend so far." US-08's circuit breaker read NULL, treated it as 0, and didn't engage when spend was actually at 95% of budget. The offline parity test for materialized views — which must include "during refresh window, what does the view return?" — would have caught the NULL-reading and required the breaker code to handle NULL as "stale; use prior value + safety buffer."

Threshold Reconciliation

US-07 source specifies dashboard parity but doesn't call out low-frequency event preservation as a separate criterion. This file makes it explicit: event types that occur < 1/min must have a separate per-class loss-rate metric and 100% preservation requirement under burst (because they are the events that other systems depend on for safety logic). This addition belongs in US-07's "Quality Regression Criteria."

Scenario US-08 — Traffic-Based Cost Optimization

Recap

US-08 controls cost via (1) tiered rate limiter (Prime 60 msg/min, Auth 30, Guest 10; global 50K msg/min), (2) request prioritizer (HIGH: auth+cart → full pipeline; LOW: guest → template + cache + Haiku), (3) cost circuit breaker (reads daily Bedrock spend from US-07; 80% → WARNING (Haiku-only); 100% → DEGRADED (template-only)), (4) graceful degradation ladder (5 levels), (5) WAF bot detection. Target: 20–35% total infrastructure cost reduction + bounded worst-case spend.

The Cost Lever

LLM call count, RAG call count, compute via guest-lite pipeline, and bounded cost risk via the breaker.

The Quality Contract

prime_user_p99_latency no regression during normal operation.
auth_user_csat no regression > 0.3pts even during WARNING degradation.
cost_breaker_false_positive_rate ≤ 1/quarter — breaker must not engage when spend is normal.
degradation_recovery_time ≤ 5min after pressure subsides.

Offline Test Design

Primitive	Why it applies	Dataset
D: Stress/Saturation	Every component of US-08 only matters under pressure; the entire story is about correct behavior on the bad day	Synthetic budget pressure + traffic spike + bot attack patterns
A: Counterfactual Replay	Guest-lite pipeline: compare quality on real guest traffic	1 week of guest session replay
B: Decision-Equivalence	Rate limiter and degradation ladder make routing decisions; need to verify they fire on the right inputs	Synthetic load shaped by tier, with labeled "should be limited / should be degraded" ground truth

Paired Metrics Table

Cost (must improve)	Threshold	Quality (must not regress)	Threshold	Sourced from US-08
`rate_limit_rejection_rate`	3-8%	`prime_tier_rejection_rate`	0% (Prime never rate-limited under design load)	from US-08 metrics
`guest_template_only_rate`	≥ 50%	`guest_to_auth_conversion_rate`	no regression > baseline	from US-08 + augmentation
`cost_breaker_engagement_time` (synthetic test)	≤ 2min from threshold	`breaker_false_positive_rate`	≤ 1/quarter (audit on historical spend traces)	augmentation
`degradation_level_during_load`	≤ 2 (Pressure tier) on 3x normal load	`auth_user_csat_proxy_during_degradation`	no regression > 0.3pts	augmentation
`bot_block_rate`	3-5%	`false_positive_block_rate_on_real_users`	≤ 0.1% (audit)	from US-08

Decision Gate

PROMOTE RATE LIMITER IF:
   - synthetic 3x load: Guest tier rejected first; Prime tier never rejected
   - response code at limit = 429 with valid Retry-After
   - global limit triggers correctly when sum exceeds 50K msg/min
PROMOTE COST BREAKER IF:
   - synthetic spend curve: WARNING tier engages within 2min of crossing 80%
   - spend curve flattens after engagement (Haiku-floor enforced)
   - on historical 90-day spend trace, breaker would have engaged on (and only on) the actual incidents
PROMOTE DEGRADATION LADDER IF:
   - synthetic load through all 5 levels: each level engages on correct trigger; recovery is monotonic
   - no oscillation between levels under sustained pressure
HOLD IF:
   - breaker engagement time 2-5min (still works but slower than design; tune polling)
ABORT IF:
   - any false positive on Prime tier rejection (revenue-critical breach)
   - breaker fails to engage within 5min on synthetic 100% threshold cross

Pipeline Diagram

graph TD
    A[Synthetic Load Generator<br/>shape: 3x normal, mixed tiers] --> B[Rate Limiter Stress Test]
    C[Synthetic Spend Curve<br/>0 -> 100% over 10min] --> D[Cost Breaker Engagement Test]
    E[Multi-Hour Sustained Load] --> F[Degradation Ladder Test]
    G[1 Week Guest Session Replay] --> H[Guest-Lite Counterfactual]
    I[Bot Pattern Replay<br/>WAF rules dataset] --> J[Bot Detection Accuracy Test]

    B --> K[Tier-respect: Prime 0%, Guest highest]
    D --> L[Engagement time, spend flattening, false positive on historical]
    F --> M[Per-level activation, recovery slope, oscillation check]
    H --> N[Guest conversion, CSAT under lite pipeline]
    J --> O[True positive / false positive on real users]

    K --> P{Decision Gate}
    L --> P
    M --> P
    N --> P
    O --> P

    P -->|all clear| Q[PROMOTE]
    P -->|Prime rejection| R[ABORT - revenue breach]
    P -->|breaker too slow| S[HOLD - tune polling cadence]

    style C fill:#f66,stroke:#333
    style P fill:#2d8,stroke:#333

Two Real-Incident Sketches the Offline Test Would Have Caught

Incident A (cost breaker engaged on Prime Day too slowly): US-08 breaker polled US-07 cost view every 60s. On Prime Day, spend climbed from 70% to 105% of budget in 8 minutes. Breaker engaged at 105% (after the breach) instead of at 80%. The offline synthetic test, with a spend curve climbing 0 → 120% over 10 minutes, would have measured engagement time at 9 minutes (because of polling lag + view refresh lag) — far over the 2-min target. The fix: polling cadence dropped to 15s + breaker uses projected 5-min-forward spend (linear extrapolation), not just current spend.

Incident B (degradation ladder oscillation): Under sustained pressure, the ladder bounced between Pressure (level 1) and High Load (level 2) every 90 seconds because the trigger and the recovery threshold were the same number. Each oscillation caused cache flushes and routing churn — making the situation worse. The offline ladder test, run for 30 minutes of sustained load, would have shown the oscillation pattern in the engagement-time series and required hysteresis (engage at 60% capacity, recover at 50%) before the ladder could ship.

Threshold Reconciliation

US-08 source specifies the engagement levels but doesn't specify hysteresis requirements or the projected-spend approach. This file adds both: hysteresis between every adjacent level (engage threshold ≥ recover threshold + 10pts) and projected-spend computation for the breaker (linear extrapolation over 5min, not raw spend). Both belong in US-08's design.

Cross-Scenario Patterns

After laying out 8 scenarios side-by-side, the patterns harden:

Every scenario uses 2–3 primitives. No scenario is well-tested by a single primitive. Counterfactual gives you the average; decision-equivalence gives you the routing correctness; stress gives you the worst-case behavior.
Every scenario has a "head class" and "tail class" failure mode. The optimization works on average, breaks on a specific subset (a confidence band, an intent class, a session duration, a query shape). The offline harness must slice by the dimension where the failure hides.
Every scenario has an upstream dependency. Cache invalidation depends on event publishing; cost breaker depends on analytics; reranker skip depends on embedding model. Whenever any upstream changes, the cost optimization downstream must be re-tested. This is the cross-story coupling the next file (07) drills into.
Every scenario adds at least one criterion to its source US doc. Across all 8, the offline-testing exercise surfaces 14 augmentations that should be folded back into the source docs. These are not contradictions — they are specifications the source docs left implicit.

The next files contain the interview grill chains for these scenarios:

05-ml-ai-engineer-grill-chains.md — for the ML/AI Engineer loop
06-mlops-engineer-grill-chains.md — for the MLOps Engineer loop
07-cross-cutting-system-grill.md — for system-level questions spanning multiple scenarios