07. Cross-Cutting System Grill — Cost Optimization at Amazon-Loop Depth
This file contains 6 system-level questions that span multiple cost-optimization scenarios. These are the questions where a candidate is expected to think across the 8 user stories simultaneously — the kind of problem an Amazon principal/staff loop probes to separate strong system thinkers from people who can answer one scenario well.
Each question is tagged [ML/AI], [MLOps], or [Both] based on which engineering lens it primarily exercises. Some are deliberately [Both] because the answer requires synthesizing both perspectives.
Format mirrors files 05 and 06: Opening + 4 grill rounds + 3 architect-level + Intuition Gained.
Cross-Cutting Q1: Compounding Savings vs. Compounding Risk [Both]
Opening Question
Q: You've shipped all 8 cost optimizations. Each one ships clean — its own offline tests, its own canary, its own SLO. The aggregate savings target is 40-70% of monthly spend. From a system-thinking perspective, what's the failure mode that scares you the most?
Round 1 Answer: Correlated failure across stories. Each story is tested in isolation; production runs them together. The scariest scenario: a single upstream perturbation (e.g., a Bedrock outage, a region-wide network event, a viral manga release) triggers degraded behavior in 3-4 stories simultaneously. Each individual mechanism behaves correctly per its design; their combined behavior creates a state nobody designed for. The offline tests don't cover the combination; production discovers it.
Round 1 — Surface
Follow-up: Give me a concrete example of correlated failure.
A viral release event:
- US-06 (RAG): index doesn't have the new title → low Recall@3 → reranker skip threshold (calibrated for the old distribution) over-fires → low-quality top-1 → LLM hallucinates volume counts.
- US-01 (LLM): prompt compression happens to be aggressive on recommendation intent → less context → LLM has even less ground to stand on → hallucination rate climbs.
- US-03 (Cache): cache wasn't warm for the new title → 100% miss rate on those queries → downstream services spike → catalog service becomes slow.
- US-02 (Intent): rules don't match the new release's query phrasing yet → SageMaker handles 100% → SageMaker auto-scaled-down for off-peak doesn't keep up → cold starts.
- US-08 (Breaker): cost spike triggers cost circuit breaker → degrades to Haiku-only → Haiku makes the hallucination problem worse.
Five stories, all "working correctly per their design," producing a much worse user experience than any single failure would. The breaker even amplifies the problem because Haiku is worse at edge cases.
Round 2 — Push Harder
Follow-up: How do you offline-test for compounding failures across 8 stories?
The naive approach is N-factorial — 256 combinations. Untenable.
The pragmatic approach: scenario-based combination testing. Identify 5-10 high-risk scenarios that exercise multiple stories and test those combinations explicitly:
| Scenario | Stories activated together | What it tests |
|---|---|---|
| Viral new release | US-01, US-02, US-03, US-06 | Cold-cache + un-indexed content + new query patterns |
| Black Friday spike | US-03, US-04, US-07, US-08 | Cache pressure + compute scaling + analytics burst + cost ceiling |
| Multi-day outage of upstream catalog | US-03, US-06 | Stale cache + stale RAG index |
| Embedder model upgrade | US-03, US-06 | Cache + RAG calibration both affected |
| AWS region degraded | US-04, US-08 | Spot interruptions + degradation ladder both engaged |
Each scenario gets a counterfactual replay (file 03 Primitive A) with all relevant stories enabled. Measure aggregate quality and aggregate cost.
The scenarios are not exhaustive but cover the modes where stories interact. The offline harness can run them weekly; expensive but tractable.
Round 3 — Squeeze
Follow-up: Your scenario tests show that the cost-circuit-breaker engaging during a viral release makes things worse. What's the architectural fix?
Two-part fix:
- Quality-aware cost decisions. The breaker shouldn't engage purely on cost; it should engage on
cost trajectory AND quality trajectory. If quality is already degraded (Recall@3 dropping, hallucination climbing), engaging Haiku-only further compounds quality. Better to absorb the cost overrun temporarily and emit a high-priority alert for human review than to make the user experience worse. - Per-intent-quality-aware degradation. When degrading, don't degrade intents that are already showing quality issues. The breaker's degradation matrix becomes:
For each (tier, intent) cell:
if intent's current quality is already below SLO floor:
do NOT degrade further
else:
apply degradation per the policy
The system insight: cost protection should not amplify a quality problem. The breaker becomes context-aware: it acts on cost only when the system is otherwise healthy.
This is the kind of design decision that emerges from cross-scenario thinking; you can't design it from US-08 alone.
Round 4 — Corner
Follow-up: You can't test all combinations. What signal in production tells you a compounding failure is happening?
A cross-story coherence signal: per-user user-experience score, computed from request-level telemetry. Inputs: - Re-ask within session (proxy for misroute or wrong answer). - Escalation rate. - Response latency p99. - Hallucination flags from the post-generation auditor. - Format compliance.
These are aggregated per-user-per-session. A compounding failure shows up as a distribution shift: not "more users have one type of failure" but "average user has multiple types of failure simultaneously." The signal is a multi-dimensional anomaly, not a single-dimensional alarm.
Detection: anomaly model (or simple statistical test) on the user-session-quality vector. If the cross-correlation of bad signals climbs above baseline, alert with "potential compound failure" diagnosis.
The pattern: single-dimensional alarms catch single failures; multi-dimensional anomaly detection catches compound failures. The latter is what an experienced system would want when 8 cost optimizations are active simultaneously.
Architect-Level Escalation
A1 [Both]: Build me a cross-story regression dashboard. What does it show?
A 3-section dashboard:
SECTION 1 — Per-Story Health (8 cells, one per US)
Each cell: cost-savings %, quality-regression %, lever-engagement %, alarm count
Color: green / yellow / red
SECTION 2 — Cross-Story Correlation (heatmap)
Rows: stories. Columns: stories.
Cell value: Pearson correlation of "story X showed regression" with "story Y showed regression" over the past 30 days.
High off-diagonal correlation = compound failure mode.
SECTION 3 — User Experience Composite (time series)
Y-axis: composite user-experience score.
X-axis: time, last 30 days.
Overlays: incident markers, deploy markers.
The dashboard is what an SRE on-call sees during a quarterly cost-system review. Section 1 is daily reading; Section 2 is monthly analysis; Section 3 is the leading indicator.
A2 [ML/AI]: How does the system learn about compound failures over time so the response improves?
A feedback loop: every incident gets tagged with "which stories were in degraded states at incident time?" Over months, this creates a dataset of (state vector, incident severity) pairs.
A simple model (logistic regression or gradient-boosted tree) on this dataset learns which state-vector patterns predict incidents. The model becomes a predictive risk indicator: "given current per-story states, P(incident in next 60 min) = X." When P > 0.5, pre-emptively engage less aggressive degradation, alert SRE for review.
This is ML in service of SRE — the system learns its own failure patterns. The cost: a small dataset, a small model, monthly retraining. Nothing exotic.
A3 [MLOps]: The org wants quarterly "cost-system reliability reports." What's in them?
Five sections:
- Cost performance: actual savings vs. target per story, with explanation of variances.
- Quality performance: paired-quality SLO performance per story, with breaches noted.
- Incident retrospective: every incident in the quarter, root-caused, with stories involved.
- Compound-failure analysis: cross-story correlation heatmap, leading indicators.
- Forward plan: changes proposed for next quarter (new optimizations, retired optimizations, threshold adjustments).
The audience: leadership, finance, engineering. The cadence: quarterly, ~30 pages, presented in a 60-min review.
The principle: cost optimization is a long-running program, not a project. It needs program-style reporting, not project-style.
Intuition Gained — Cross-Cutting Q1
The core insight: 8 cost optimizations operating together is not 8 separate systems; it's one system with 8 levers. The failure modes that matter are the cross-lever ones.
Mental model:
"Each story tests its own design. The system tests how the designs interact under perturbation."
Cross-Cutting Q2: Offline-Online Correlation Calibration [ML/AI]
Opening Question
Q: Your offline cost predictions don't match production cost. Per-session offline estimate is $0.008; production measures $0.011. From an ML perspective, how do you reason about this gap?
Round 1 Answer: The gap is a distribution mismatch between offline replay and production traffic + measurement-time offset. The replay dataset is stratified by intent but probably under-represents the long tail (multi-turn deep-history sessions, complex recommendation queries). Production cost is dominated by the long tail. The 38% gap is the long-tail under-representation. Fix: oversample the long tail in the replay set, plus run the offline-online calibration explicitly each quarter.
Round 1 — Surface
Follow-up: How do you measure offline-online correlation rigorously?
Monthly procedure:
- Sample 2K production sessions randomly (not stratified — random).
- Re-run those sessions through the offline harness with the same configuration.
- Per-session, compare: offline cost estimate vs. actual production cost.
- Compute the per-session ratio
production_cost / offline_estimate. - Aggregate: mean, median, 90th percentile of the ratio.
If the median ratio is 1.0 ± 0.1, calibration is healthy. If 1.3 (production is 30% more), offline is under-estimating; investigate per-intent. If 0.7 (production is 30% less), offline is over-estimating; less common but possible.
Then per-intent slicing: which intents have the largest ratio drift? Those are where offline doesn't match production.
The correlation r (Pearson, between offline-estimate and production-cost) should be > 0.85. If it's 0.6, offline isn't predictive at all — the harness is broken.
Round 2 — Push Harder
Follow-up: Per-intent ratio shows
recommendationis 1.6x off (offline says $0.005, production says $0.008). What's the next investigation?
Three hypotheses, in order of likelihood:
- Offline replay doesn't include real RAG calls. Offline mocks the RAG service; production calls it for real. RAG calls in production are slower and trigger more LLM thinking time. Fix: include real RAG (or higher-fidelity RAG simulation) in the offline harness.
- Conversation context length differs. Offline replay sessions are shorter (we sampled them) than production sessions (which include all turns up to that point). Longer context = more input tokens. Fix: replay full sessions, not just samples.
- Production has retries we don't model. Bedrock occasionally returns 5xx; production retries; that's 2x cost on the retry. Offline doesn't simulate. Fix: include retry overhead in the cost model (e.g., +5% buffer).
For each hypothesis, run a controlled offline experiment to confirm. The one with the largest contribution gets the fix.
The deeper insight: offline cost modeling is itself a model. It has assumptions, biases, and gaps. Validate it against production regularly.
Round 3 — Squeeze
Follow-up: After fixing, ratio is 1.05 — close enough. But three months later, it's 1.4 again. What changed?
Distribution shift in production traffic without a corresponding shift in the offline dataset. Possible causes:
- New user segment (e.g., a marketing campaign brought more guest users; guests have different query distributions).
- Product mix shift (e.g., new manga release shifted intent distribution toward
recommendation). - Seasonal pattern (school holidays, year-end shopping).
- Pricing change in Bedrock that we updated in production but not in the offline cost model.
Mitigations:
- Quarterly re-stratification of the offline replay dataset from a fresh production sample.
- Pricing model in offline harness is sourced from a single config that's updated when AWS pricing changes (don't hardcode prices in the test code).
- Drift alarm: monthly automated calibration check; alarm if ratio shifts > 20% month-over-month.
The pattern: offline and online are in contract; the contract is calibration. Without active calibration, they drift apart silently.
Round 4 — Corner
Follow-up: Offline says US-01 will save 50% of LLM cost. After production rollout, you measure 32%. What's the post-mortem?
Four candidate explanations:
- Lever didn't fire as expected. Template bypass rate measured 22% in prod vs 30% offline. Why? Investigate per-intent template eligibility.
- Quality contract forced the lever to back off. Maybe in production we found more borderline cases and raised the confidence floor on the template router, reducing bypass rate.
- Compounding with another change. Maybe a feature shipped between offline test and production rollout that affects the affected traffic.
- Offline overestimated savings. Maybe the 30% template bypass was on a sample where templates were unusually applicable; production has lower applicability.
Investigation order: compare offline replay results against production lever-engagement metrics. Whichever differs most is the explanation.
The expected delivery vs. actual delivery analysis is the post-mortem of the cost optimization itself. It's normal for actual to be 70-80% of projected; if it's < 50% something material was wrong with the offline projection.
Architect-Level Escalation
A1: How do you set realistic cost-savings targets in PRDs?
Two patterns:
- Range, not point: "40-60% savings target." Captures the offline-projection uncertainty.
- Confidence-weighted: "30% savings (high-confidence) + 20% savings (medium-confidence)." Each component has a calibration backing.
Avoid: single-point targets backed by single-replay results. The variance is too high to commit to a point.
A2: When does the offline harness need a rebuild vs. a tuneup?
Tuneup signs: ratio drift < 20%, fixable with re-sampling and re-calibration. Rebuild signs: - Ratio drift > 50% on a single intent. - Multiple stories diverge simultaneously (suggests architectural change in production not modeled). - The harness can't be augmented to include a new dimension (e.g., multi-modal queries).
A rebuild is a multi-month project. The signal is when tuning is no longer producing improvement.
A3: How do you organizationally make sure the offline-online calibration happens?
Single owner: the cost-engineering team (per file 06's cross-cutting Q3). Calibration is a recurring task on their backlog, monthly cadence, with a written handoff if owners change.
If no cost-engineering team exists, calibration falls on whichever team owns the most affected stories — usually rotating ownership leads to skipped quarters.
The principle: calibration without an owner doesn't happen. Assign it explicitly.
Intuition Gained — Cross-Cutting Q2
The core insight: Offline cost prediction is a model of production. Like any model, it has bias and variance. Calibration is the active practice that keeps the model useful.
Mental model:
"The offline-online ratio is the metric that the cost program lives by. If you're not measuring it monthly, the program is operating on faith."
Cross-Cutting Q3: Cost SLO Breach vs. Quality SLO Breach [Both]
Opening Question
Q: It's 3am. Cost SLO is at 105% of daily budget. Quality SLO is at 98% (one point under floor). Which do you fix first, and why?
Round 1 Answer: Quality first, always. Cost overrun is a finance problem with a billing-cycle horizon. Quality regression is a customer problem with an immediate experience-degrading impact and downstream churn risk. The order is: stop the quality bleeding (engage degradation that doesn't worsen quality, e.g., disable an aggressive optimization that's causing the quality drop), then assess cost. If cost is still over after quality is restored, that's the next conversation.
Round 1 — Surface
Follow-up: But the cost circuit breaker is designed to engage when cost breaches budget. So it engaged, and now quality is dropping further. What now?
The breaker engaged because cost was 105%; engaging the breaker (Haiku-only) reduced quality. Now quality is 95% — under floor. The system entered a state where cost is being protected at the expense of quality.
The right action: disengage the breaker temporarily. Accept the cost overrun for the next 4-8 hours while: - Diagnose the cost driver (was it a real spike, or a metric glitch?). - Restore quality to floor. - Re-engage the breaker only when both signals are healthy.
This requires the breaker to have a manual override that's used during incidents. Without override, the system auto-protects cost at all costs, including unhealthy quality.
The principle: safety systems need a manual disengage path. Auto-engagement is good; auto-engagement-without-override is brittle.
Round 2 — Push Harder
Follow-up: Manual override sounds like asking for trouble. How do you build it without making it a back door?
Three properties:
- Logged and time-bounded: every override has an
engineer,reason,expires_at(default 4 hours). Auto-reverts on expiry. - Limited blast radius: override can degrade-less, but cannot fully disable the cost protection. There's a hard ceiling (e.g., spend cannot exceed 130% of budget regardless of override).
- Audit trail: every override is reviewed in the next on-call handoff and the next weekly cost meeting.
The override is for temporary unblocking during diagnosis, not for "we don't trust the breaker." If overrides happen monthly, the breaker design is wrong — fix the design, don't normalize the override.
Round 3 — Squeeze
Follow-up: How do you make the cost-vs-quality tradeoff explicit in the system design?
Each story's design includes a tradeoff document that states: - The cost reduction the story delivers. - The maximum quality regression the story accepts. - The conditions under which the story should be disabled (cost too high, quality too low). - The owner who decides on conflicts.
Together, the 8 tradeoff documents form a cost-quality contract for the system. Conflicts get resolved per the documented owners; new optimizations must be justified against the contract.
The pattern: tradeoffs must be documented to be debatable. Implicit tradeoffs become legacy decisions nobody remembers making.
Round 4 — Corner
Follow-up: Finance says "the cost overrun is unacceptable; we need stricter cost protection." Engineering says "the quality regression from stricter protection is unacceptable; we need looser cost protection." Both are correct. How do you resolve?
This is a real-world tension that doesn't have a technical solution alone. The resolution is:
- Quantify the tradeoff curve: for the system, plot cost-protection-strength on x-axis and customer-perceived-quality on y-axis. Show finance the curve.
- Escalate the decision to the level that owns both the cost line and the customer experience (often a director or VP).
- Document the chosen point on the curve as a written contract: "We will protect cost up to X, accepting Y quality regression. Beyond that, we accept cost overrun."
- Build the system to honor the contract: the breaker engages at X; if cost continues climbing past X, the breaker holds (doesn't degrade further) and emits an alert for human review.
The principle: cost-vs-quality is a business decision, not a technical one. The technical role is to make the curve visible and execute the chosen point. The business role is to choose the point.
Architect-Level Escalation
A1 [Both]: Design a system that adaptively shifts the cost-quality balance based on time-of-day and traffic source.
Two-axis policy:
Time-of-day:
Off-peak (low traffic, low revenue): tighter cost protection (engage at 70%).
Peak (high traffic, high revenue): looser cost protection (engage at 90%).
Traffic source:
Prime users: minimal degradation; quality > cost.
Auth users: moderate degradation acceptable.
Guest users: aggressive degradation acceptable (already on lite).
Combined policy:
for each (tier, time-of-day) cell:
cost_protection_strength: function of revenue value
This is the operating policy. Each cell has its own cost-quality contract.
A2 [MLOps]: How do you avoid escalation fatigue when the cost-quality system is sensitive?
Three patterns:
- Tier alarms by severity. Most cost issues are warning-level (Slack); only true incidents page on-call.
- Suppress correlated alarms. If 5 stories alarm because of one upstream issue, send 1 incident alert with all 5 listed; don't 5 alerts.
- Quarterly alarm review. Audit which alarms fired vs. which were actionable; tune ratios. Aim for ≥ 80% actionable.
Without these, the on-call drowns in noise and learns to ignore real signals.
A3 [ML/AI]: How would you build an ML model that predicts cost-quality conflicts before they happen?
Inputs: per-story state (lever engagement, recent quality metrics, recent cost trajectory), upstream signals (Bedrock latency, RAG availability), traffic shape (RPS, intent distribution).
Output: P(cost-quality conflict in next hour).
Training data: historical incidents tagged with state vectors at incident time + control sample of normal states.
Model: gradient-boosted tree (interpretable, simple). Features ranked by importance show what to monitor.
Deployment: model runs every 10 min on current state; if P > 0.5, pre-emptively reduce cost-protection strength to give quality more headroom.
This is "ML for SRE" — small, interpretable, drives action. Not exotic.
Intuition Gained — Cross-Cutting Q3
The core insight: Cost-quality tradeoffs are business decisions encoded into operational systems. The technical role is to make them visible, executable, and reversible.
Mental model:
"Cost protection that worsens quality is not protection; it's collateral damage. Engineer for the case where cost and quality aren't fungible."
Cross-Cutting Q4: Designing the Cost-Regression CI Gate [MLOps]
Opening Question
Q: Every PR runs the cost-aware golden (Primitive C). What does the gate look like in practice — what blocks a PR, what warns, what passes silently?
Round 1 Answer: The gate has three levels: fail (max single-query cost regression > 25%), warn (aggregate p95 cost regression > 10%), pass (everything within ±10% of baseline). Failures block merge. Warnings post a comment on the PR with details but don't block; the author decides. Passes are silent (no comment). The discipline: failures must be fixed or explicitly opted out (with a documented reason). Warnings accumulate as soft signals; if 5 PRs in a row warn, the cumulative drift becomes a fail and someone investigates.
Round 1 — Surface
Follow-up: How do you handle the false-positive rate on the gate?
False positives erode trust. Two mitigations:
- Calibrate the gate against historical PRs. Replay the gate against last 100 merged PRs. Aim for false-positive rate ≤ 5%. If higher, raise thresholds.
- Provide per-query detail in the comment. If the gate fails, the author sees exactly which queries regressed and by how much. Often the regression is "test query GD-187 went from 1,200 tokens to 1,440 tokens — investigate" — much more actionable than "your PR regressed cost."
If the false-positive rate climbs, root-cause: was it dataset staleness (most common), genuine cost climbing (track it), or threshold too tight (re-tune)?
Round 2 — Push Harder
Follow-up: How do you keep the cost-aware golden dataset itself from rotting?
Three practices:
- Quarterly re-baseline: re-run the dataset against current main, update
expected_cost_bandto the new median + IQR. Drift > 10% is normal; drift > 50% needs investigation. - Quarterly query refresh: replace 10% of dataset queries with fresh production samples. Retire the oldest 10%. Keep the dataset evolving with traffic.
- Annual schema review: do the
expected_*fields still capture what we care about? Add fields for new dimensions (e.g., a new model tier added to the system).
Without quarterly maintenance, the dataset becomes a museum exhibit: nostalgic but irrelevant.
Round 3 — Squeeze
Follow-up: A PR adds a new feature that legitimately increases cost (it's a quality improvement). How does the author bypass the gate?
Documented opt-out:
- PR description includes a
cost-regression-justification:field with explanation, expected cost delta, and quality delta. - The CI gate parses this field: if present, the gate fails with a "manual review required" status, not "blocked."
- A senior engineer (cost-engineering owner) reviews and approves the justification. Sign-off is recorded.
- The new baseline is re-set for affected queries. The expected_cost_band is updated.
The opt-out is intentionally friction-bearing: not impossible, but visible. This prevents the gate from being routinely bypassed, while allowing legitimate exceptions.
The principle: CI gates without an opt-out are brittle; opt-outs without friction are useless. Calibrate the friction.
Round 4 — Corner
Follow-up: An emergency production fix needs to ship in 2 hours. The cost gate fails. What's the override path?
Emergency override:
emergency-override:PR label can be applied by a senior engineer.- The override skips the cost gate but still runs all other tests.
- An automatic JIRA ticket is created for post-incident cost-regression review within 7 days.
- The next PR touching the same code area must address the cost regression or document why it's accepted.
Emergency overrides are tracked: if a team uses them more than 1/month, it's a process smell — investigate why emergencies are bypassing cost discipline.
The pattern: escape hatches exist; using them creates obligations. Override → ticket → review → resolution. The cycle keeps the override from becoming a permanent bypass.
Architect-Level Escalation
A1 [MLOps]: Build a CI gate that doesn't require running real LLM calls.
Mock Bedrock at the test level:
class MockBedrockClient:
def invoke(self, model_id, prompt, **kwargs):
# Don't actually call Bedrock
input_tokens = count_tokens(prompt) # tiktoken or similar
# Return a deterministic synthetic response
output_tokens = synthesize_output_tokens(model_id, intent_label)
return {
"input_tokens": input_tokens,
"output_tokens": output_tokens,
"response_text": SYNTHETIC_RESPONSE_BY_INTENT[intent_label]
}
The mock counts tokens accurately (using tiktoken with the right tokenizer per model) and returns a synthetic response. This lets the cost gate measure prompt tokens precisely without paying for actual generation.
Limitation: the mock doesn't measure output-token reality. Output tokens are ~30% of total cost for chatbot. The gate covers ~70% of cost (input tokens) precisely; output is estimated. For a more precise gate, periodically run the dataset with real Bedrock and update the output-token estimates per intent.
Cost of the gate: $0 per PR run (mock). Cost of the periodic real-Bedrock run: ~$10/quarter. Acceptable.
A2 [MLOps]: How do you prevent the cost-regression CI gate from becoming a development bottleneck?
Five patterns:
- Gate runs in parallel with other CI, not sequentially. Total CI time isn't dominated by it.
- Gate runs in 2 minutes max by mocking expensive operations and parallelizing across queries.
- Failures include actionable details: which queries regressed, what changed, suggested fix.
- Authors can run the gate locally before pushing (
make cost-check). Catches issues before CI rather than during. - Common false-positive patterns are documented: when authors hit one, they don't waste time investigating; they look up the known pattern.
If authors complain about the gate, listen — the friction is real, and the gate's value depends on developer trust.
A3 [Both]: At what point should the cost-regression gate fail an entire PR vs. just one commit?
Granularity matters:
- Single PR: gate runs on the PR's net diff vs. main. If the diff regresses cost, PR fails.
- Batched PRs (cherry-pick releases): gate runs on the batch's net diff. Individual PRs may have offsetting effects.
- Long-running feature branches: gate runs on the branch's net diff vs. main. Catches divergence before merge.
For a feature branch active for weeks, the gate should run on every PR-into-the-feature-branch (catching the local regression) AND on the feature-branch-into-main (catching the cumulative regression). Both checks are valuable.
Intuition Gained — Cross-Cutting Q4
The core insight: Cost-regression CI gates are a cultural mechanism as much as a technical one. They establish "cost is a tested property, not an aspirational goal."
Mental model:
"If cost isn't gated at PR time, it regresses silently. The golden dataset is the floor; everything below it is regression."
Cross-Cutting Q5: Auditability — Proving Savings to Leadership [Both]
Opening Question
Q: Six months after rolling out all 8 stories, leadership asks: "How much did we actually save?" What's your answer methodology?
Round 1 Answer: The honest answer is counterfactual estimation, not "compare current bill to old bill." Traffic grew, prices changed, features shipped — direct year-over-year comparison conflates cost optimization with everything else. The methodology: at any point in time, compute the synthetic baseline cost = "what would we have spent at today's traffic with the pre-optimization architecture?" Compare to actual cost. The delta is the optimization saving. This requires keeping the cost model of the pre-optimization architecture current (just for this calculation), and re-running it against today's traffic shape.
Round 1 — Surface
Follow-up: That sounds expensive. How do you maintain the pre-optimization cost model?
It's a model, not running code. Inputs:
- Pre-optimization per-request token cost (pre-compression, pre-tiering, pre-cache).
- Today's request count by intent.
- Today's per-intent average tokens.
- Pre-optimization downstream-call rate (cache hit rate = 0%, full RAG always, full Sonnet always).
- AWS pricing today.
The model computes: "if we had today's traffic with the pre-optimization architecture, the cost would be $X." That's the baseline.
The model is ~200 lines of code. Updating it for new pricing or new intents is straightforward. Maintenance: ~1 day/quarter.
The output: a quarterly chart showing actual cost vs. counterfactual cost. The gap is the cost-program contribution.
Round 2 — Push Harder
Follow-up: How do you handle the question "but X% of the savings would have happened anyway from natural improvements"?
Decompose the saving into named contributors:
| Source | Estimated savings | How estimated |
|---|---|---|
| Template-first router (US-01) | 12% | Fraction of traffic bypassed × saved cost per bypass |
| Semantic cache (US-01) | 8% | Cache hit rate × saved cost per hit |
| Model tiering (US-01) | 10% | Haiku traffic × cost delta vs Sonnet |
| Caching (US-03) | 15% | Downstream calls saved × cost per call |
| RAG bypass (US-06) | 6% | RAG bypass rate × OCU cost saved |
| ... etc | ... | ... |
| AWS pricing reductions | 4% | Bedrock pricing changes since baseline |
| Natural traffic shifts | 2% | Intent distribution changes that happen to favor cheap paths |
| Total measured | 57% | |
| Total observed | 55% | (small reconciliation — model error) |
Each row is sourced from per-component telemetry. Leadership sees not just "we saved 55%" but "here's what each lever contributed." If finance says "I don't trust this number," they can audit any row.
The "natural" sources are honest: not everything is a credit to the cost program. Acknowledging them maintains credibility.
Round 3 — Squeeze
Follow-up: What's the strongest objection to your methodology and how do you respond?
Strongest objection: "Your counterfactual is itself a model. Your contribution attribution is itself a model. You're saying 'the optimization saved 50%' but you can't prove it because the baseline is hypothetical."
Response:
- All financial measurement involves models. GAAP accounting is a model. Tax calculation is a model. The question is whether the model is reasonable, transparent, and consistently applied.
- Triangulation: if multiple methods (counterfactual, contribution attribution, AWS Cost Explorer trend break) all give answers within ±10%, the savings are real. If they diverge widely, the methods are broken or the savings are smaller than claimed.
- Acknowledge uncertainty: "Best estimate is 55%, with a 95% confidence interval of 45-65%." Better to commit to a range than a false-precision point estimate.
The principle: financial credibility comes from transparent methodology and consistent application, not from claimed precision. Leadership respects honesty about uncertainty more than confident-sounding-but-fragile numbers.
Round 4 — Corner
Follow-up: Audit time. Finance pulls a single day's AWS bill, picks one specific cost item, and asks "show me where this comes from in your model." What's your answer?
Per-request cost attribution (file 06 US-01 architect A1) is what enables this. Methodology:
- Pull all cost records for that day, filtered by the cost item (e.g., Bedrock invocations).
- Reconcile against AWS bill: aggregate the cost records'
dollarsfield. Should match AWS bill ± 5% (some lag and AWS-side adjustments). If it doesn't match, the cost telemetry is broken — that's an immediate fix. - Drill down: show finance a few sample requests with full attribution (request_id, intent, model_tier, tokens, calculated cost, contribution to daily total).
If you can't drill down to per-request attribution, you can't answer audit questions. The cost telemetry is the audit trail.
The principle: savings credibility requires per-request attribution. Aggregate dashboards aren't auditable; per-request records are.
Architect-Level Escalation
A1 [Both]: Build a quarterly cost report that finance, engineering, and leadership all read.
Five-section report:
- Executive summary (1 page): headline savings, comparison to target, notable changes.
- Per-story performance (5 pages): one page per story, with cost trajectory + quality trajectory + incidents.
- Cost attribution detail (3 pages): waterfall chart of total spend → per-component → per-team.
- Forward-looking: planned changes, risks, capacity needs.
- Methodology (2 pages): how numbers were computed, what assumptions were made.
Audience-tuned: leadership reads (1) and (4); finance reads (3) and (5); engineering reads (2). Total ~15 pages. Quarterly.
A2 [MLOps]: How do you build cost-attribution for shared infrastructure (e.g., Redis serves multiple stories)?
Three approaches, in order of complexity:
- Pro-rata: split Redis cost by usage share. Each story's share = (story's request count) / (total request count). Simple, defensible, slightly coarse.
- Marginal: estimate "if this story didn't exist, how much less Redis would cost?" More accurate but requires modeling.
- Tagged: every Redis operation tagged with originating story; per-story cost computed from operation counts. Most accurate but requires instrumentation.
For most chatbot infrastructure, (1) is sufficient. Use (3) only when stories are genuinely separate (different organizations, different budgets).
A3 [Both]: When does the audit overhead exceed the value of detailed attribution?
When attribution machinery costs > 5% of the savings being attributed. At that point, you're spending savings to count savings.
Signal: if the cost-attribution pipeline (US-07-style telemetry, Athena queries, dashboard generation) costs more than 5% of monthly savings, simplify. Drop per-tag attribution; use pro-rata. Drop per-request records; use sample.
The principle: measurement infrastructure has cost. Optimize it relative to what it measures.
Intuition Gained — Cross-Cutting Q5
The core insight: Savings claims require methodology, not just numbers. The methodology is auditable, transparent, and consistently applied. That's the difference between credible cost engineering and aspirational marketing.
Mental model:
"Counterfactual + contribution attribution + per-request audit trail. Three pieces. Without all three, the savings number is a story, not a number."
Cross-Cutting Q6: The Ratchet Problem [Both]
Opening Question
Q: You shipped a 50% cost reduction. Six months later, cost is back up by 15%. Nobody disabled the optimization; nothing visibly changed. What happened?
Round 1 Answer: Silent regression accumulation. Each PR shipped over 6 months added a few tokens to a system prompt, a new analytic event class, a new cache key pattern. Each was within the cost-aware golden's per-PR threshold. The cumulative effect is 15%. The CI gate is per-PR, not cumulative — and the per-PR threshold was set assuming a baseline that is itself drifting. The ratchet broke. The fix: add a cumulative drift metric and re-baseline regularly.
Round 1 — Surface
Follow-up: What's a cumulative drift metric look like?
Weekly automated job:
- Re-run the cost-aware golden against current main.
- Compute aggregate cost delta vs. last release tag.
- Compute aggregate cost delta vs. baseline (the version the savings claim was made against).
- Plot both as time series.
Alarms: - Week-over-week delta > 5%: investigate which PRs contributed. - Baseline delta > 15%: file a high-priority ticket; the savings claim is no longer valid.
The deltas are observable. The drift is named. It can no longer accumulate silently.
Round 2 — Push Harder
Follow-up: How do you make the baseline itself a tracked artifact?
The baseline is in version control:
.cost-baseline.yaml:
baseline_date: 2025-12-01
baseline_commit: abc123
baseline_dataset_version: v17
baseline_metrics:
dollars_per_session_p50: 0.0080
template_bypass_rate: 0.30
cache_hit_rate: 0.20
haiku_routing_rate: 0.35
avg_input_tokens: 1180
re-baseline_due: 2026-03-01 # quarterly
Updates require a PR. The PR includes the data behind the new baseline and the rationale for re-baselining (e.g., "after 3 months of organic growth, baseline is no longer realistic").
The principle: baselines are configuration, not constants. Treat them with the same versioning discipline as code.
Round 3 — Squeeze
Follow-up: A new feature gets shipped. The team asks for an exception to the cost gate, citing "this is necessary functionality." How do you respond?
The right response: "Necessary functionality has a cost. Tell me how much, and we'll update the baseline."
Not: "Sure, we'll skip the gate for this PR."
The exception flow:
1. PR includes cost-regression-justification: (per cross-cutting Q4).
2. Justification includes the new expected baseline metrics.
3. Cost-engineering reviews and approves.
4. Baseline is updated as part of the PR.
The PR doesn't ship "ignoring the gate." It ships "with an updated gate that reflects the new reality." The gate continues to enforce cost discipline against the new baseline.
The deeper insight: exceptions that don't update the contract are bugs. Either the contract changes (formally, with sign-off) or it holds (and the PR doesn't ship).
Round 4 — Corner
Follow-up: Two years from now, the cost optimization team has long disbanded. The original architects are gone. What artifacts ensure the cost discipline survives?
Three artifacts:
- Documented cost-quality contracts (per Cross-Cutting Q3 A1) — every story's tradeoffs, rationale, owner. Future engineers can read and understand without asking.
- Automated CI gates — cost-aware golden runs on every PR. Without thinking about it, every change is checked. The discipline is in the system, not in the team.
- Recurring re-baselines — quarterly cadence baked into the team's calendar. The baseline updates happen even without owners proposing them.
What doesn't survive: tribal knowledge. If the discipline depends on "Alice remembers the design," it dies when Alice moves teams. The artifacts must encode the discipline so the system enforces it.
The principle: engineering practices that depend on memory die. Practices that depend on automation persist. Build for the persistence.
Architect-Level Escalation
A1 [Both]: How do you organizationally make sure the cost discipline doesn't decay?
Three levers:
- A named owner for the cost program, not just the optimizations. The owner monitors the meta-metrics (drift, gate effectiveness, baseline freshness).
- Quarterly cost program reviews: forcing function for re-baselining and policy review.
- Cost goals in OKRs, with attribution to specific teams. If team X owns story Y, team X has an OKR like "maintain cost reduction from story Y at ≥ 80% of original target."
Without these, cost decays into "engineering's problem to occasionally fix."
A2 [MLOps]: When does the ratchet system itself need a re-architect?
Signs: - Baselines need re-setting more than quarterly (the system is changing faster than the baseline can keep up). - Per-PR gate alarms more than 20% of the time (false positive rate is unsustainable). - Cumulative drift exceeds re-baseline cadence (the ratchet is leaking).
When these happen: simplify. Drop per-query baselines; use per-intent. Drop per-PR gates; use weekly. Trade precision for sustainability.
A3 [ML/AI]: Compounded over years, how does cost-program drift compare to model-drift?
Both are silent regression patterns:
- Model drift: model accuracy degrades as production data distribution shifts. Detected with monitoring; fixed with retraining.
- Cost drift: cost climbs as code accumulates inefficiencies. Detected with weekly re-baseline; fixed with optimization PRs or accepted as new normal.
Both require: monitoring, an explicit baseline, a remediation cadence. The ML/AI engineering practices for model drift translate directly to cost drift. Adopt them.
Intuition Gained — Cross-Cutting Q6
The core insight: Cost optimization is a ratchet that leaks without active maintenance. The maintenance is itself a system: baselines as artifacts, drift as a metric, exceptions as contract updates.
Mental model:
"Savings shipped today are not savings preserved tomorrow. The optimization is the easy part; preserving it for years is the engineering."
Closing — ML/AI Engineer vs MLOps Engineer Scoring Rubric
When evaluating candidate answers across files 05, 06, and 07, distinct lenses produce distinct strong/weak signals.
ML/AI Engineer Strong vs Weak
| Strong signal | Weak signal |
|---|---|
| Slices metrics by intent, confidence band, user tier — names the dimension where the failure hides | Quotes aggregate metrics; doesn't slice |
| Frames cost optimization as a calibrated parameter that drifts with upstream changes | Treats cost optimization as a one-time configuration |
| Distinguishes "cheap and wrong" vs "expensive and same" failure modes; designs for both | Tests only that "the optimization works" |
| Uses statistical reasoning: sample size, confidence intervals, severity weighting | Single-point estimates without confidence |
| Connects cost optimization to model-quality observability | Treats them as separate concerns |
| Recognizes that ML metrics need ML rigor (calibration, drift, recalibration) | Manages cost optimizations as static config |
MLOps Engineer Strong vs Weak
| Strong signal | Weak signal |
|---|---|
| Designs telemetry before optimization; per-request attribution | Ships optimization first, instruments later |
| Treats kill switches as practiced flip paths, not just configuration | Has flags but never tests the flip |
| Builds CI gates and automated re-baselining | Relies on humans to notice cost regression |
| Practices chaos drills on safety systems quarterly | Trusts that the breaker will work when needed |
| Documents per-component SLOs and runbooks | Operates on tribal knowledge |
| Treats cost optimization as a long-running program with maintenance cadence | Treats it as a project that ships and ends |
Where the Lenses Should Diverge
- Threshold tuning: ML/AI engineer reasons about calibration of the threshold; MLOps engineer reasons about how to update the threshold safely without downtime.
- Failure investigation: ML/AI engineer asks "why did the model output this?"; MLOps engineer asks "what telemetry is missing to root-cause this?"
- Trade-offs: ML/AI engineer frames quality regression in CSAT/F1 terms; MLOps engineer frames it in SLO/error-budget terms.
- Long-term: ML/AI engineer worries about distribution shift and model drift; MLOps engineer worries about operational complexity and on-call burden.
Where The Lenses Should Converge
Both lenses should agree on:
- Pair every cost metric with a quality metric and gate on both. This is the single most important discipline — both lenses must own it.
- Cost optimization is not a one-shot. It requires continuous calibration (ML/AI lens) and continuous monitoring (MLOps lens).
- Compound failures are the system's hardest test. Both lenses must think across stories.
- Auditability matters. Both lenses must produce work that's reproducible and defensible to leadership.
When a candidate answer demonstrates one lens but not the other, that's a signal: they're a strong specialist but may need a partner to round out the work. When they synthesize both, that's principal-level system thinking.
End of Cross-Cutting Grill
This file completes the cost-optimization offline-testing deep-dive. Together with files 03, 04, 05, and 06, it provides:
- Conceptual foundations for cost-optimization testing (file 03)
- Per-scenario test design across all 8 user stories (file 04)
- ML/AI Engineer interview prep (file 05)
- MLOps Engineer interview prep (file 06)
- System-level cross-cutting discussion (this file)
For the source user stories themselves, see ../Cost-Optimization-User-Stories/. For the general MangaAssist offline-testing framework that this folder builds upon, see 01-offline-testing-strategy.md and 02-offline-testing-scenarios-with-answers.md.