03. Foundations and Primitives for Cost-Optimization Offline Testing

"Quality testing asks 'is this answer correct?' Cost-optimization testing asks something harder: 'is this answer correct AND did we route it through the cheapest path that still keeps it correct?' One number tells you nothing — every cost lever in MangaAssist is paired with a quality contract, and the offline harness has to enforce both at once."

This file is the foundation layer for the per-scenario deep-dives in 04-scenario-deep-dives-per-cost-story.md and the interview grill chains in files 05 and 06. It assumes you have already read 01-offline-testing-strategy.md and 02-offline-testing-scenarios-with-answers.md — those cover the general offline-test ladder for the chatbot. This file covers what changes when the thing under test is a cost-optimization mechanism rather than a quality change.

Why Cost-Optimization Offline Testing Is Structurally Different

Concern	General Quality Testing (files 01–02)	Cost-Optimization Testing (this folder, files 03+)
Question being answered	Does the new code produce a correct answer?	Does the new code produce the same correct answer at lower cost?
Pass criterion	Single threshold on a quality metric	Paired threshold: cost-lever effect AND quality non-regression, both must clear
Failure mode of a passing test	False positives (test says good, prod says bad)	"Cheap and wrong" — cost target met but quality degraded; OR "expensive and same" — quality preserved but cost lever didn't fire
Counterfactual needed?	Sometimes (A/B)	Always. You can't measure savings without a baseline pipeline running on the same input
Dataset shape	Queries + expected answers	Queries + expected answers + expected cost band + expected routing decision
Risk asymmetry	A wrong answer hurts one user	A wrong cost knob hurts revenue, blows budget, and creates a finance-visible incident
Time horizon	Pass/fail at PR time	Pass/fail at PR time + drift detection over weeks (a cost lever can silently regress as traffic shifts)
What can't be tested offline	Tone, helpfulness on real users	Real Spot-interruption rate, real cache-hit pattern, real OCU autoscale behavior, real DDB billing-mode breakeven

The existing 8-scenario file (02-offline-testing-scenarios-with-answers.md) covers scenarios 1–8 from the perspective of changes to the chatbot (prompt edit, retriever change, classifier retraining, guardrails, memory, semantic caching, FAQ bypass, when-to-pay). The 8 scenarios in this folder map to the 8 cost-optimization user stories in Cost-Optimization-User-Stories/ — they are about validating infrastructure and architectural cost levers, not chatbot logic changes.

The Three Failure Modes Cost-Opt Offline Testing Must Catch

Every cost optimization in US-01 through US-08 can fail in three structurally different ways. The offline harness must catch all three before shadow mode begins, because shadow mode is itself expensive — running both pipelines for two weeks doubles the relevant cost line, and we want to enter shadow with high confidence the optimization isn't broken.

Failure Mode 1: "Cheap and Wrong"

The optimization fires as designed but produces a degraded answer. Examples drawn from the 8 user stories:

US-01: Template router catches an order_tracking intent for a guest user with no order ID resolved → it falls through correctly. But for a chitchat query that is actually a recommendation cloaked as small talk ("hey what's something good to read"), the template router responds with "Welcome to JP Manga Store!" — cost saved, user lost.
US-06: RAG bypass gate fires for a promotion intent (correctly listed as bypass-eligible). The user's actual query was "are there any deals on Berserk Deluxe?" — which needs both promotion data AND product retrieval. Bypass saved an OCU read but the answer is now incomplete.
US-05: Aggressive 24h TTL on TURN items deletes a 26-hour-old conversation context where the user is mid-multi-turn negotiation about a return. They come back, ask "so will you process that refund?" and the bot has no memory.

These are the silent-killer failures. The cost dashboards look healthy — savings are real — and the quality dashboards look healthy because the affected slice is small. The compounded business impact only surfaces in escalation-rate or conversion drift weeks later.

Failure Mode 2: "Expensive and Same"

The optimization is wired in but doesn't actually fire because of a bug, a misconfiguration, or a feature flag in the wrong state. The system pays the operational cost of running the optimization (extra Lambda invocations to check a cache, extra Redis lookups, extra rule evaluations) without the savings.

US-02: The Aho-Corasick pattern matcher is deployed but its rules file points at an empty bucket due to a stale environment variable. Every message falls through to SageMaker. Inference cost is unchanged; cold-start logging now shows new "rule classifier checked" lines that look like progress.
US-03: Cache warming Lambda runs at 8:30am JST but writes to the wrong Redis cluster (test instead of prod). Hit rate target says ≥30% but observed is 4% because nothing pre-populates. The cache is "deployed" but useless.
US-04: Fargate Spot is enabled but the task-definition still requests 2 vCPU / 4 GB; the right-sizing PR was reverted but the Spot PR landed alone. We have Spot interruptions on oversized tasks and are paying both costs.

Failure Mode 3: "Cheap Now, Expensive Later"

The optimization works on the test slice but breaks under conditions the offline harness didn't simulate — usually a distribution shift, a traffic spike, or an upstream change.

US-08: The cost circuit breaker is tested against a synthetic 80% budget load. In production, a Black Friday spike pushes spend past 100% in 8 minutes — faster than the breaker's 1-minute polling window. By the time it engages, US-07 has already buffered 12 minutes of unbillable spend.
US-07: Event batching tested at 50 events/5s on a representative weekday. On Prime Day the batch fills in 0.4s and Firehose buffers max out, dropping events. Dashboards are now under-counting LLM calls — the very metric US-08's breaker keys on.
US-06: Conditional reranker skip is tested with a stable embedding model. The next embedding model upgrade shifts the kNN score distribution; the 0.9 threshold no longer corresponds to "obviously top result" and reranker skip rate jumps from 30% to 65%, surfacing low-quality top-1 results to the LLM.

The Four Offline-Testing Primitives

Across all 8 scenarios in file 04, four primitive test patterns recur. Every per-scenario test design uses one or more of these as its backbone. Naming them up front lets the per-scenario sections stay short.

Primitive A: Counterfactual Replay

The shape: Run the same input through two pipelines side-by-side — pipeline_baseline (current production) and pipeline_optimized (new cost lever enabled) — and diff both cost and quality outcomes.

What it proves: That the optimization actually saves money on this input distribution AND doesn't change the quality contract.

Dataset needed: A traffic replay slice (real production sessions, anonymized, with full input context — not synthetic). Minimum 5,000 sessions stratified by intent. The dataset must include the full upstream context (page state, browsing history, conversation turns, page ASIN) because cost levers are sensitive to context shape.

Metrics emitted (paired):

Cost side	Quality side
`delta_dollars_per_session`	`llm_judge_agreement_rate`
`delta_input_tokens`	`intent_route_correctness`
`delta_output_tokens`	`bertscore_vs_baseline`
`delta_downstream_calls`	`forbidden_element_rate`

Decision gate template:

PROMOTE IF:
   cost_savings_p50 >= claimed_savings_floor
   AND quality_regression_p95 <= regression_ceiling
   AND no_paired_metric_breach_in_any_intent_slice
HOLD IF:
   cost target met but quality regression in any single intent > 2x ceiling
ABORT IF:
   cost lever doesn't fire (delta_dollars_per_session < 5% of target)
   OR quality regression detected even with cost target met

When this primitive is the right choice: Whenever the optimization changes what gets sent to (or skipped at) an expensive component. US-01, US-03, US-05, US-06, US-07 all use this as their backbone.

graph TD
    A[Replay Dataset<br/>5K sessions stratified by intent] --> B[Test Harness]
    B --> C[Pipeline Baseline<br/>current production code path]
    B --> D[Pipeline Optimized<br/>new cost lever enabled]

    C --> E[Cost Sink<br/>tokens, downstream calls, $/session]
    C --> F[Quality Sink<br/>response, intent route, embedding]

    D --> G[Cost Sink<br/>tokens, downstream calls, $/session]
    D --> H[Quality Sink<br/>response, intent route, embedding]

    E --> I[Paired Diff Engine]
    F --> I
    G --> I
    H --> I

    I --> J{Decision Gate}
    J -->|both clear| K[PROMOTE to shadow mode]
    J -->|cost only| L[ABORT - quality regression]
    J -->|quality only| M[ABORT - lever not firing]

    style D fill:#fd2,stroke:#333
    style I fill:#2d8,stroke:#333
    style L fill:#f66,stroke:#333
    style M fill:#f66,stroke:#333

Primitive B: Decision-Equivalence Test

The shape: Don't compare outputs token-by-token. Compare the routing decision the optimized pipeline makes against the routing decision the baseline pipeline would have made. Disagreement is the signal.

What it proves: That the cost lever doesn't silently change which downstream branch a request takes — which is the most common cause of "cheap and wrong" failures.

Dataset needed: Labeled queries (intent, complexity tier, RAG-vs-no-RAG ground truth, model-tier ground truth). 1,000–3,000 queries stratified across decision boundaries.

Metrics emitted:

Decision under test	Equivalence metric	Threshold
Template-bypass vs LLM (US-01)	template_decision_agreement	≥ 98% on labeled set
Rule-classifier vs SageMaker intent (US-02)	rule_vs_ml_label_agreement	≥ 95% on ≥1K samples per rule
RAG-bypass vs RAG (US-06)	bypass_decision_agreement	≥ 95% per intent
Haiku vs Sonnet tier (US-01)	tier_assignment_agreement	≥ 92% on complexity-labeled set
Reranker-skip vs rerank (US-06)	top1_id_match_rate	≥ 95%

Decision gate template:

PROMOTE IF:
   per_class_agreement_rate >= class_floor_table[class]
   AND no_tail_class_below_global_floor
HOLD IF:
   head-class agreement is high but a single tail class drops > 5pts
ABORT IF:
   any safety-critical class (return_request, escalation) below 0.90

Why decision-equivalence beats output-equivalence: Two LLM calls on the same prompt produce different exact strings — checking string match would create permanent test flake. Two routing decisions on the same input are deterministic (or near-deterministic with calibrated confidence) — checking that layer is reliable and high-signal.

When this primitive is the right choice: Whenever the optimization is a gate or a router deciding whether to invoke an expensive component. US-01 (template + tier), US-02 (rule pre-filter), US-06 (RAG bypass + reranker skip) all use this.

Primitive C: Cost-Aware Golden Dataset

The shape: Extend the existing golden dataset (described in file 02) so each query carries an expected cost band alongside the expected answer, expected intent, and expected route. The dataset becomes a regression suite for cost in addition to quality.

What it proves: That a code change anywhere in the system hasn't accidentally inflated per-query cost — even when no cost optimization was the subject of the PR. This is the cost-equivalent of "did we accidentally drop test coverage."

Schema extension to the 500-query golden dataset:

{
  "query_id": "GD-187",
  "query": "What dark psychological manga should I read if I loved Death Note?",
  "intent": "recommendation",
  "expected_intent": "recommendation",
  "expected_route": "llm_full_pipeline",
  "expected_cost_band": {
    "input_tokens_p50": [800, 1400],
    "output_tokens_p50": [200, 350],
    "rag_chunks_retrieved_p50": [3, 5],
    "model_tier_expected": "sonnet",
    "dollars_per_call_p95": 0.012,
    "downstream_calls_expected": ["catalog_search", "personalize", "rag_retrieve"]
  },
  "cost_regression_alert_threshold": "+15% on any band"
}

Metrics emitted:

Per-query: cost_band_violation_count, route_change_count
Aggregate: dataset_cost_p50, dataset_cost_p95, dataset_cost_band_violation_rate

Decision gate template:

PASS IF:
   dataset_cost_p50 within +/- 5% of last release baseline
   AND no individual query exceeds cost_regression_alert_threshold
   AND no query changes route to a more expensive route without an explicit override comment
WARN IF:
   p95 inflated by 5-15% but p50 stable (long-tail regression — investigate)
FAIL IF:
   any single query crosses route into more-expensive lane unexpectedly

Critical insight from the existing 02 file (Scenario 1, prompt-edit case): A prompt edit that adds 200 tokens to the system preamble passes every quality test but inflates per-query cost across all 500 golden queries. Without cost-aware golden, that edit ships and the next monthly Bedrock bill is the first signal. With cost-aware golden, the PR is blocked at CI.

When this primitive is the right choice: Always. This is the baseline regression suite that runs on every PR, regardless of whether the PR is a cost optimization. It's the "ratchet" that prevents silent cost regression.

Primitive D: Stress / Saturation Simulation

The shape: Run the optimization under conditions deliberately outside its normal operating range — traffic spike, budget pressure, Spot interruption, dependency outage — and verify it degrades gracefully rather than failing catastrophically.

What it proves: That the cost lever holds up under the conditions where its absence would matter most. A circuit breaker that doesn't trip during a real spike is worse than no circuit breaker at all (because it falsely advertises safety).

Dataset needed: Synthetic traffic generator + chaos primitives. Not replay; this is engineered load.

Test patterns:

Story	Stress condition simulated	Pass criterion
US-04	Spot task receives 2-min SIGTERM mid-WebSocket-stream	All in-flight sessions complete; new sessions route to on-demand within 30s
US-08	Synthetic spend climbs to 80% of daily budget over 5 min	WARNING tier engages within 2 min; Haiku-floor enforced; spend curve flattens
US-08	3x normal request rate for 10 min	Rate limiter rejects guest tier first; Prime tier latency P99 stays < 3s
US-07	Kinesis stream gets 10x normal event volume	On-demand mode autoscales; lag stays < 30s; no event loss
US-06	OpenSearch Serverless OCU floor exceeded by 2x batch backfill	Batched indexing window respected; query latency P99 stays < 200ms
US-02	SageMaker endpoint scaled-to-zero gets surprise traffic burst	Serverless fallback engages; first-response latency < 3s; no 5xx

Metrics emitted: Time-to-engage, time-to-stable, breach magnitude, recovery slope.

Decision gate template:

PROMOTE IF:
   time_to_engage <= SLO
   AND breach_magnitude <= acceptable_overshoot
   AND no quality regression during engagement
   AND recovery is monotonic (no oscillation)
ABORT IF:
   breach magnitude exceeds 2x acceptable (the lever protects too late)
   OR oscillation detected (the lever fights itself)

When this primitive is the right choice: Any optimization whose value is "contains a worst-case scenario" rather than "improves the average case." US-04 (Spot resilience), US-08 (circuit breaker, rate limiter, degradation ladder), US-07 (Kinesis burst), US-02 (scale-to-zero cold-start). These are the optimizations whose offline test must specifically create the bad day.

graph LR
    A[Synthetic Load Generator] --> B[Production-Shape Pipeline]
    C[Chaos Injector<br/>spot kill, dep failure,<br/>budget spike] --> B
    B --> D[Lever Engagement Telemetry]
    B --> E[Quality Telemetry<br/>during engagement]
    D --> F{Pass / Fail}
    E --> F

    style C fill:#f66,stroke:#333
    style F fill:#2d8,stroke:#333

The Paired-Metric Pattern

Every cost lever has at least one cost metric and at least one quality metric. The offline harness must assert both thresholds, treating either breach as a failure. This is the most important pattern in the entire folder.

Story	Cost metric (must improve)	Quality metric (must not regress)
US-01 LLM tokens	`dollars_per_session` (down ≥ 30%)	`llm_judge_agreement_rate` ≥ 92%; `intent_route_correctness` ≥ 98%
US-02 Intent classifier	`dollars_per_inference` (down ≥ 50%); `endpoint_idle_hours_per_day` ≥ 12	`per_class_F1` ≥ tail-class table; `head_class_F1` no regression
US-03 Caching	`combined_hit_rate` ≥ 70%; `downstream_calls_per_session` (down ≥ 30%)	`stale_read_rate` ≤ 0.5%; `cache_wrong_answer_rate` ≤ 0.5%
US-04 Compute	`dollars_per_task_hour` (down ≥ 35%); `spot_interruption_rate` ≤ 5%	`p99_response_latency` (no regression > 15%); `error_rate_during_interrupt` ≤ 1%
US-05 DynamoDB	`dollars_per_1k_turns` (down ≥ 40%); `ttl_deletion_rate` ≥ 5M items/day	`context_loss_rate` ≤ 1%; `multi_turn_coherence_score` no regression
US-06 RAG	`dollars_per_rag_query` (down ≥ 30%); `rag_bypass_rate` ≥ 40%	`recall_at_3_per_intent` no regression > 2pts; `mrr_per_intent` no regression
US-07 Analytics	`dollars_per_gb_ingested` (down ≥ 40%); `event_compression_ratio` ≥ 60%	`event_loss_rate` ≤ 0.1%; `dashboard_parity_with_unbatched` ≤ 0.5% drift
US-08 Traffic	`daily_bedrock_spend` ≤ $4,200; `cost_breaker_trigger_count` = 0	`prime_user_p99_latency` no regression; `auth_user_csat` no regression > 0.3pts

The fundamental rule: If a PR claims to optimize cost, its CI gate must compute both halves and fail loudly if either side breaches. Cost-only PRs that ship without paired-metric enforcement are how the "cheap and wrong" failure mode reaches production.

The Unified Test-Pipeline Shape (Reused Across All 8 Scenarios)

Every per-scenario test design in file 04 is a specialization of this template. The boxes change per scenario; the shape doesn't.

graph TD
    subgraph "Inputs"
        A1[Cost-Aware Golden Dataset<br/>500 queries + cost bands]
        A2[Production Replay Slice<br/>5K sessions stratified]
        A3[Synthetic Stress Generator]
    end

    subgraph "Pipelines Under Test"
        B1[Pipeline Baseline<br/>current prod]
        B2[Pipeline Optimized<br/>cost lever enabled]
    end

    subgraph "Telemetry Sinks"
        C1[Cost Telemetry<br/>tokens, $, downstream calls]
        C2[Quality Telemetry<br/>route decision, output, judge score]
        C3[Lever Engagement<br/>did the lever fire? when? how often?]
    end

    subgraph "Decision Gate"
        D1{Paired Threshold Check}
        D2[Per-intent slicing]
        D3[Cost-band regression]
    end

    subgraph "Outcomes"
        E1[PROMOTE to shadow mode]
        E2[HOLD - investigate divergence]
        E3[ABORT - lever broken]
    end

    A1 --> B1
    A1 --> B2
    A2 --> B1
    A2 --> B2
    A3 --> B2

    B1 --> C1
    B1 --> C2
    B2 --> C1
    B2 --> C2
    B2 --> C3

    C1 --> D1
    C2 --> D1
    C3 --> D1
    D1 --> D2
    D1 --> D3

    D2 --> E1
    D2 --> E2
    D2 --> E3
    D3 --> E2

    style B2 fill:#fd2,stroke:#333
    style D1 fill:#2d8,stroke:#333
    style E3 fill:#f66,stroke:#333

In file 04, each of the 8 scenarios specifies: - Which input branches it uses (golden / replay / stress, often two of three) - Which primitive (A, B, C, D) drives its decision gate - The exact paired-metric thresholds for that scenario - The mermaid diagram of its specialized pipeline - 2 "real incident" sketches of what the offline test would have caught

Three Real Incidents That Motivated This Foundation

Three concrete failures from cross-functional MangaAssist work shape the design above. Each one reveals why a single primitive isn't enough.

Incident 1: The Template Router That Cost Us Conversion

What happened (US-01 territory): Template-first router was deployed with the rule "intent == chitchat → respond with welcome template." Quality dashboards stayed green. CSAT stayed green. After three weeks, a cohort analysis showed guest-to-authenticated conversion was down 4.2 percentage points for users whose first message was classified as chitchat with confidence 0.6–0.8.

Root cause: Borderline-confidence chitchat messages were often real shopping inquiries phrased casually ("hey what's something good"). The template killed the conversation before any product appeared.

What offline testing would have caught it: Decision-equivalence (Primitive B) sliced by confidence band — not just by intent class. The offline harness was checking intent == chitchat agreement at 96%, but at confidence 0.6–0.8 the agreement with "what would Sonnet have done" was only 71%. We had the data; we weren't slicing by it.

Lesson baked into file 04: Every decision-equivalence test in file 04 specifies confidence-band slicing, not just label agreement.

Incident 2: The Cache That Saved Money And Then Lost Money

What happened (US-03 territory): Semantic response cache shipped with 0.95 cosine threshold and 24h TTL on faq intent. Hit rate hit 22%, savings tracked at $43K/mo, dashboard green for 6 weeks. Then a return-policy update rolled out. Customers were seeing the old policy from cache for the next 24 hours. Three escalations, one PR-visible angry tweet.

Root cause: Event-driven invalidation was specified for product/promo updates but not for policy updates. The cache had no invalidation hook on the policy publishing pipeline.

What offline testing would have caught it: Stress/saturation simulation (Primitive D) specifically for upstream-content-change scenarios. The test "publish a new policy doc; assert cached responses on related queries are invalidated within 60s" was missing.

Lesson baked into file 04: Every cache scenario in file 04 has an upstream-change stress test, not just a hit-rate counterfactual.

Incident 3: The Spot Interruption That Killed In-Flight Streams

What happened (US-04 territory): Fargate Spot rolled out at 70% off-peak. Spot interruption rate measured 3% (well under 5% target). One Sunday at 4am JST, AWS reclaimed 8 Spot tasks simultaneously during a brief regional reshuffle. The 2-minute SIGTERM was respected, but the SIGTERM handler only drained REST traffic — WebSocket streams were dropped mid-stream. 312 users saw their conversation stop mid-response.

Root cause: The drain logic was tested with synthetic REST load. WebSocket connections were never tested under SIGTERM in CI.

What offline testing would have caught it: Stress/saturation (Primitive D) with the chaos injector configured for SIGTERM mid-WebSocket-frame. The dataset for this test needs active connections, not request/response payloads.

Lesson baked into file 04: Every compute/Spot scenario in file 04 has a chaos test that specifically targets the connection-state-mid-operation failure surface, not just the request-completion path.

The Five Things This Folder Will Not Solve

Cost-optimization offline testing is necessary but not sufficient. Be honest about the seam.

Real OCU autoscale behavior — OpenSearch Serverless minimum capacity, autoscale cadence, and noisy-neighbor effects in a shared region cannot be simulated; only measured in production with controlled traffic.
Real Spot interruption rate over 90 days — 30-day Spot interruption distribution varies by region, instance type, and AWS capacity pressure. The offline test proves the handler works; the rate is a production metric.
Real DDB on-demand vs provisioned breakeven — depends on traffic shape over weeks. Offline can simulate; the breakeven proof needs a billing cycle.
Real semantic-cache wrong-answer rate at scale — 0.5% wrong-answer rate on 5K offline samples extrapolates to 5K wrong answers per million. The actual production wrong-answer rate is what shadow mode measures.
Real circuit-breaker engagement rate over a quarter — designed to be "0 trigger / month." Whether it stays at 0 depends on actual cost trajectory; the test proves the breaker works when needed, not how often it's needed.

The handoff from offline to shadow to canary to production is a relay race. This folder is about running the offline leg cleanly. The runbooks for shadow, canary, and rollback live in each US-XX file's "Rollback & Experimentation" section.

How To Use This Folder

You are...	Read this...
New to the cost-opt stories	First read `Cost-Optimization-User-Stories/README.md`, then file 01 in this folder, then this file (03), then file 04
Designing the offline test for a specific story (e.g., US-04)	Skip to file 04, find the US-04 section; refer back to this file for the primitive being used
Preparing for an Amazon ML/AI Engineer loop	Read this file (03) for vocabulary, then 05-ml-ai-engineer-grill-chains.md for the per-scenario interview prep
Preparing for an Amazon MLOps Engineer loop	Read this file (03) for vocabulary, then 06-mlops-engineer-grill-chains.md
Doing a system-design interview where cost-at-scale will be probed	Read this file plus 07-cross-cutting-system-grill.md

Bottom Line

Cost-optimization offline testing rests on three commitments:

Pair every cost metric with a quality metric, and gate on both. Cost-only gates ship "cheap and wrong" code.
Use the right primitive for the failure mode. Counterfactual replay catches average-case regressions; decision-equivalence catches routing drift; cost-aware golden catches silent inflation; stress simulation catches catastrophic-condition holes.
Be honest about what offline can't prove. Five conditions in the list above only prove themselves in production. The offline harness must be confident enough to enter shadow mode with low risk, not confident enough to skip shadow mode.

The next file (04-scenario-deep-dives-per-cost-story.md) applies these primitives one US story at a time.