06. MLOps Engineer Grill Chains — Cost-Optimization Offline Testing

This file contains 8 grill chains, one per cost-optimization user story, framed for an Amazon MLOps Engineer loop. The lens here is on telemetry, deployment infrastructure, observability, runbooks, CI gates, kill switches, on-call burden, and the operational discipline that keeps cost optimizations durable in production.

The ML/AI lens for the same scenarios (model behavior, calibration, dataset design, statistical rigor) is in 05-ml-ai-engineer-grill-chains.md.

How To Use This File

Format mirrors file 05 (Opening + 4 grill rounds + 3 architect-level + Intuition Gained). The questions are different: where file 05 asks "is the model right?" this file asks "is the system right?"

Scenario US-01 (MLOps Engineer Lens) — LLM Token Cost Optimization

Opening Question

Q: You're shipping four LLM-cost optimizations (template, cache, tier, compression). From an MLOps perspective, what's the single most important infrastructure investment to make before the optimizations ship?

Round 1 Answer: Per-request cost attribution telemetry. Every Bedrock call must emit (request_id, intent, model_tier, template_bypassed, cache_hit, input_tokens, output_tokens, dollars, lever_engagement_flags) to a structured log. Without this, you can't tell whether savings came from template or cache, you can't slice cost by intent, and you can't debug "why did spend spike yesterday." The optimizations themselves are a feature flag away; the telemetry to reason about them is six weeks of plumbing. Build the telemetry first.

Round 1 — Surface

Follow-up: Where does that telemetry live, and how does it not become its own cost problem?

The data path:

Emission: each Lambda/Fargate task emits the cost record to CloudWatch Logs (or stdout, picked up by Fluent Bit).
Buffering: Kinesis Firehose ingests, buffers, batches with gzip (per US-07's pattern).
Storage: S3 Parquet, partitioned by date + intent.
Query layer: Athena for ad-hoc, Redshift materialized views for dashboards (per US-07).

It doesn't become its own cost problem because: - Per-record overhead is ~200 bytes; 1M requests/day = 200 MB/day = $5/mo S3. - Firehose buffering: $0.029 per GB ingested = ~$6/mo. - Athena queries: per-scan billing. With Parquet + partition pruning, a typical query scans 50 MB. Even 1000 queries/day = $0.0001 each = trivial.

Total cost telemetry overhead: ~$15/mo for ~$315K/mo of Bedrock cost. Telemetry budget should always be < 0.01% of the cost being measured. If it grows beyond that, the telemetry is over-engineered.

Round 2 — Push Harder

Follow-up: How do you wire the kill switch for llm_cost_optimization_enabled? What's the SLO on a flag flip?

Kill switch wiring: - Flag stored in AWS AppConfig (preferred) or SSM Parameter Store with a 30-second client cache. - Application reads the flag on every request, but uses the cached value (not a fresh API call). - AppConfig has built-in deployment validation (canary % rollouts of the flag itself).

SLO on a flag flip: - Detection to flip: 5 minutes (someone needs to decide and execute). - Flip propagation: ≤ 60 seconds (30s cache + 30s safety margin). - Pipeline behavior change: ≤ 60 seconds after flip propagation. - Cost trajectory change visible in dashboards: ≤ 5 minutes (Kinesis lag + materialized view refresh).

Total: ~10 minutes from "we need to roll back" to "rollback verified in dashboard." That's the MLOps SLO. If it takes longer, the kill switch is theoretical, not operational.

The trap: people set up the flag but don't test the flip path under fire-drill conditions. The first real rollback discovers that the flag didn't propagate to one ECS service that pulls config on-startup-only. The offline test is "flip the flag in staging; assert behavior change within 60s on every consumer service." Run it as a chaos drill quarterly.

Round 3 — Squeeze

Follow-up: Cost-regression CI gate. A PR adds 200 tokens to the system prompt. How does CI catch it before merge?

The CI gate workflow:

PR triggers cost-aware-golden run (Primitive C from file 03). Golden dataset of 500 queries with expected_cost_band per query.
Test harness runs both pipelines: pipeline_main (current main branch) and pipeline_pr (PR branch). Runs locally with mock Bedrock (which counts tokens but doesn't actually call) — so the CI gate is itself zero-cost.
Diff engine: per-query, computes tokens_pr - tokens_main. Aggregates p50, p95, max delta.
Gate logic: - If max single-query delta > 15% AND p50 delta > 10%: block PR. - If max single-query delta > 25% (any single query): block PR regardless of aggregate. - If aggregate p95 > 5% but p50 stable: warn (long-tail regression; investigator decides).
PR comment: bot posts a table of top-10 regressed queries with token counts and intent labels, so the author can see exactly what changed.

Total CI run time: ~2 minutes (no actual LLM calls; just prompt assembly + token counting).

The trap is forgetting that the cost-aware golden must be maintained. Quarterly, rebuild expected_cost_band from current main's actuals so the baseline doesn't drift to forever-old assumptions. If the golden's expectations are stale, the gate fires on "regressions" that are actually unrelated changes from 6 months ago.

Round 4 — Corner

Follow-up: A canary rollout of US-01 is at 25%. CloudWatch shows P99 latency is fine but cost is higher than baseline for the canary group. What do you investigate?

Step-by-step:

Confirm the data: pull dollars_per_session from the canary cohort vs. the control cohort over the same hour. Confirm the delta is statistically meaningful (n large, Cohen's d > 0.2).
Slice by lever engagement: are template/cache/tier/compression flags actually firing in the canary cohort? - If template_bypass_rate is 5% in canary vs 30% baseline → template router isn't engaging. Likely a flag-misread or a feature-flag-default-off bug. - If cache_hit_rate is 0% in canary → cache writes aren't happening (writer's flag isn't on, or writes are being rejected). - If haiku_routing_rate is 35% but average input tokens is up → compression isn't engaging; tier engaged but compression didn't.
Slice by intent: maybe one specific intent is over-firing on Sonnet (e.g., complexity score is mis-calibrated and routing all recommendation queries to Sonnet that should have gone Haiku).
Check for new failure modes: did canary expose a code path with retries? An optimization that triggers a retry loop costs more, not less.

The most likely cause: partial deployment. One of the four techniques is enabled in the canary but not its dependency. E.g., compression is on, but the prompt-builder change to actually invoke it is gated behind a different flag that's still off. Net effect: code does extra work to prepare a "compressed" prompt that's actually the original prompt, then sends to Bedrock — same cost or higher.

The runbook for this scenario: revert canary, dump telemetry to S3, run a per-lever-engagement audit. The cost regression is a debugging clue, not a quality problem; treat it as such.

Architect-Level Escalation

A1: Design a cost-attribution model that lets finance answer "what did each customer cost us last month?" without doubling our infrastructure cost.

The model: per-request cost record + offline aggregation, no real-time finance dashboard.

Schema:

cost_record {
  request_id, customer_id, session_id, timestamp,
  intent, model_tier,
  bedrock_cost_dollars,
  rag_cost_dollars,
  ddb_cost_dollars (estimated from RCU/WCU),
  ecs_cost_seconds (estimated from request duration / task share),
  total_cost_estimate
}

Aggregation: nightly Athena query → Parquet aggregates per-customer per-day. Stored in a small Redshift table customer_cost_daily. Finance queries this table; doesn't query the raw events.

Cost of cost-attribution itself: ~$30/mo for a $300K/mo Bedrock bill (under 0.01%). Acceptable.

The trap: trying to do real-time customer-level cost. That requires high-cardinality streaming aggregation, which is expensive. Daily aggregation is sufficient for finance and avoids the cost spiral.

A2: How do you safely roll back the four US-01 techniques independently? They share the LLM call path.

Four feature flags, not one:

feature_flags {
  template_router_enabled: bool,
  semantic_cache_enabled: bool,
  model_tiering_enabled: bool,
  prompt_compression_enabled: bool
}

Each flag can be flipped independently. The pipeline logic:

if template_router_enabled and try_template():
    return template_response
if semantic_cache_enabled:
    cached = check_cache()
    if cached:
        return cached
prompt = build_prompt()
if prompt_compression_enabled:
    prompt = compress(prompt)
model = select_tier(intent) if model_tiering_enabled else SONNET
response = bedrock(model, prompt)
if semantic_cache_enabled:
    cache(response)
return response

The benefits: - If model_tiering_enabled = false, all calls go to Sonnet — known-good baseline. - If prompt_compression_enabled = false, full prompt sent — known-good prompt. - Independent rollback per technique.

The downside: 16 flag combinations. Most never run in production; tested combinations are baseline (all off), all-on, and one-off-each (debugging combinations). Limit canary to known-good combinations.

The MLOps insight: independent flags require disciplined integration testing. Every combination that ever runs in production must have a corresponding offline test. Otherwise the flags are configuration-as-bug.

A3: How do you migrate from this 4-technique design to a 5-technique design (adding native Bedrock prompt caching) without a maintenance window?

Step-by-step migration:

Schema-add native caching telemetry: add bedrock_cache_hit field to the cost record. Default false.
Deploy native caching code path behind a feature flag native_prompt_cache_enabled = false. Code emits the new telemetry but the flag controls whether it actually uses native caching.
Validate the telemetry plumbing: confirm bedrock_cache_hit is being written correctly when the flag is on in staging.
Canary with the flag on at 1% production. Monitor bedrock_cache_hit_rate and verify it matches expectations.
Gradual ramp: 1% → 5% → 25% → 100% over 4 weeks.
Update cost-aware golden with a new expected baseline once at 100% — the system has fundamentally lower per-call cost now; old expectations are wrong.

No maintenance window because every step is non-blocking. The flag default is off; the code path is dormant until enabled. The golden update is a one-time PR after full rollout.

Intuition Gained — US-01 (MLOps)

The core insight: LLM cost optimization is a telemetry problem first, an algorithm problem second. Without per-request cost attribution and per-lever engagement flags, no rollout is debuggable.

Mental model to carry forward:

"Every cost-optimization deployment starts with the question: 'when this breaks, what data do I have to root-cause it?' If the answer is 'aggregate cost dashboards,' the deployment isn't ready."

The hidden failure mode: Partial deployment of a multi-technique optimization. Code paths that 'prepare' for a technique that doesn't actually fire — extra work, no savings.

One-line rule: Telemetry + kill switches + cost-regression CI gate. If any of the three is missing, the optimization isn't production-ready, regardless of how good the algorithm is.

Scenario US-02 (MLOps Engineer Lens) — Intent Classifier Cost Optimization

Opening Question

Q: SageMaker scale-to-zero off-peak. From an MLOps perspective, what's the SLO contract you're committing to and how do you instrument it?

Round 1 Answer: SLO contract: cold-start first-response latency p99 ≤ 3 seconds. Instrumentation: emit (timestamp, request_id, was_cold_start, time_to_first_response, fallback_used, instance_count_at_request) for every inference call. SLO breach = any 5-minute window where cold-start p99 exceeds 3s. Alert wires to PagerDuty for on-call. Quarterly SLO review against the actual achieved p99 — adjust either the SLO or the provisioned-concurrency floor based on what's realistic.

Round 1 — Surface

Follow-up: How do rule deployments propagate to production without a code deploy?

Rules are configuration, not code. Deployment pipeline:

Author: data scientist proposes a new rule (regex pattern + intent label + confidence threshold).
Validation pipeline: rule runs through the offline decision-equivalence test (file 04) on the rule's labeled dataset. ≥95% agreement with SageMaker required, plus zero false-positives on safety-critical classes.
Storage: validated rules are committed to a YAML/JSON file in S3, versioned by S3 versioning.
Distribution: production tasks read the rules file on startup AND watch S3 for changes (S3 event → SNS → in-memory rule reload). New rules propagate within ~30 seconds.
Rollback: revert S3 to previous version; SNS event triggers reload; rollback in ~30 seconds.

The principle: rules are config; config has its own lifecycle. Code deploys are slow (Fargate task replacement); config changes should be near-instant. Don't conflate them.

Round 2 — Push Harder

Follow-up: A rule starts misbehaving in production — false-positives on return_request. What's your detection lag?

Detection mechanisms, in order of speed:

Per-rule false-positive metric on safety classes: emitted on every classification, aggregated to CloudWatch every 1 minute. Alarm fires if any rule has >0 FPs on return_request in any 5-minute window. Detection lag: ~6 minutes.
Alarm action: auto-disables the offending rule (writes a flag rules.<rule_id>.disabled = true to the rules file). Tasks reload within 30s.
Notification: PagerDuty alert + Slack message to on-call. On-call confirms the auto-disable was correct or rolls back.
Backstop: customer-reported incidents come hours/days later if the metric path failed. Should never be the primary detection.

Total: 6-7 minutes from first FP to rule disabled. Worst-case impact: ~50 misclassified queries during the detection window. If the rule fires on 5% of traffic = 300K/day, 5min = 1,000 queries — of which the misclassification rate is what we measured. Acceptable.

What can break this: - The metric path itself (CloudWatch ingestion lag, alarm misconfiguration). Test the alarm path quarterly. - The auto-disable path (S3 write failure, SNS delivery failure). Test this too. - The reload path (a task that doesn't watch S3 properly). Integration tests must verify reload behavior.

Round 3 — Squeeze

Follow-up: Provisioned concurrency on SageMaker for warm capacity. How do you decide the schedule?

Empirical schedule from 4 weeks of traffic data:

Plot RPS by hour-of-day, by day-of-week. Identify peak windows: 8-11am JST, 6-10pm JST on weekdays; 10am-10pm JST on weekends.
Compute warm capacity needed to keep p99 cold-start ≤ 3s during each window. Usually 1-2 provisioned instances cover up to 50 RPS.
Schedule via CloudWatch Events + Lambda: aws sagemaker update-endpoint-weights-and-capacities to set provisioned to 1 at 8am, back to 0 at 11pm.
Sanity check the cost: 1 provisioned instance × 16h/day × 30 days × hourly rate ≈ $X. Compare to inference savings. Net-positive savings is the deciding factor.

The trap: people pick a schedule based on the dashboard's first day, not on multi-week patterns. Friday traffic looks different from Monday; school holidays shift JST patterns; manga release events spike unpredictably. Use 4 weeks minimum; review quarterly.

The deeper MLOps practice: codify the schedule as Terraform/CDK config, not as click-ops in the console. Schedule lives in version control; review through PR; rollback through revert.

Round 4 — Corner

Follow-up: SageMaker endpoint version rollout. New model version performs better offline but worse on a specific tail intent. How do you ship safely?

Multi-version strategy:

Production multi-variant endpoint: SageMaker supports multiple model versions on one endpoint with weighted traffic split.
Initial split: 95% old version, 5% new version. The 5% goes through the decision-equivalence test in shadow (logged but not shown to user) for additional confidence.
Per-intent monitoring: per-class F1 on the 5% slice tracked daily. If new version's tail-intent F1 drops vs old, hold ramp.
Gradual ramp: 5% → 25% → 50% → 100% over 2-4 weeks, gated by per-tail-class quality.
Rollback path: shift traffic weight back to 100% old version. Latency: ~2 minutes (endpoint config update).

The new model version doesn't ship as a single binary swap. It ramps with quality gates at each stage. The MLOps maturity is having the per-tail-class quality gates as code, not as "let me check the dashboard."

The trap: shipping with only aggregate quality monitoring. The new version has higher overall accuracy but lower tail-intent F1. Aggregate looks fine; tail intents (safety-critical) regress; users escalate; only then does someone slice by intent.

Architect-Level Escalation

A1: How do you instrument the system to detect when a "cost optimization" is silently regressing?

The cost-regression CI gate (file 03 Primitive C) catches PR-time regressions. For runtime regressions, three layered detectors:

Daily automated cost-aware golden run. Every night, replay the golden against current main; compare cost metrics to last release's baseline. Alert if drift > 5% week-over-week.
Continuous lever-engagement monitoring. CloudWatch metric for each lever's engagement rate (template_bypass_rate, cache_hit_rate, etc.). Alarm fires if any drops below 80% of its design rate for 6+ hours.
Per-intent cost tracking. CloudWatch metric dollars_per_request_by_intent. Alarm fires if any intent's per-request cost climbs > 20% over a 7-day rolling baseline.

Three detectors, three timescales: PR-time (CI), nightly (golden replay), continuous (CloudWatch). Each catches different regression patterns.

A2: Build me an MLOps capability: "automatically detect when a rule's distribution has shifted enough to need re-validation."

Per-rule monitoring metrics, daily aggregated:

rule_health {
  rule_id,
  fire_rate_today, fire_rate_7d_median,
  agreement_with_ml_today (sampled), agreement_with_ml_7d_median,
  per_class_FP_today
}

Detection logic (runs nightly):

Coverage drift: if |fire_rate_today - fire_rate_7d_median| / fire_rate_7d_median > 0.20, flag for re-validation.
Quality drift: if agreement_with_ml_today < 0.95 OR < agreement_7d_median - 0.02, flag.
New FP: if per_class_FP_today > 0 on any safety class and 7d median was 0, flag immediately.

Flags create JIRA tickets assigned to the data scientist who owns the rule. Re-validation pipeline auto-runs on the rule's current state. If it now fails the promotion threshold, auto-disable.

The systems insight: rules drift just like models. Treat them as ML artifacts in the MLOps lifecycle. Without this, rules become silent technical debt.

A3: How do you handle the operational burden when you have 50 rules and 8 cost optimizations and 3 model versions and 5 caches all in flight?

The operational burden compounds. Mitigations:

Standardize the lifecycle. Every rule, every model version, every cache layer follows the same pipeline: shadow → canary → ramp → monitor → expire. Different artifacts, same workflow.
Single observability plane. One dashboard per "fleet" (rules, models, caches). Each artifact is a row; columns are health metrics. On-call scans the dashboards; doesn't have 50 individual dashboards.
Auto-disable defaults. Any artifact that breaches its quality contract auto-disables. Notification, not human-decision-required-to-prevent-damage.
Quarterly housekeeping. Every quarter, audit: which rules haven't fired in 30 days (delete)? Which canaries are stalled (commit or roll back)? Which model versions are deprecated (decommission)?
Owner-per-artifact. Every rule, every cache, every model has a named owner. No artifact is "everyone's." Ownership is in the artifact's metadata.

Without these, 50 rules become 50 sources of mystery and the on-call becomes the de-facto owner of everything. With them, the on-call is "incidents" only — ownership is distributed.

Intuition Gained — US-02 (MLOps)

The core insight: Configuration-as-code, dynamic flag flipping, and per-artifact health metrics turn rules from static config into managed ML artifacts.

Mental model to carry forward:

"Rules are models with hand-engineered decision boundaries. They need an MLOps lifecycle (deployment pipeline, monitoring, drift detection, retirement) just like models do."

The hidden failure mode: A rule that's been silent for 6 months and is now subtly wrong, with no monitoring telling anyone. The MLOps capability that prevents this is per-rule daily health checks.

One-line rule: Cost optimizations are operational commitments, not one-time deployments. Build the lifecycle infra before you ship the optimization.

Scenario US-03 (MLOps Engineer Lens) — Caching Strategy

Opening Question

Q: Multi-layer cache (L1 in-process + L2 Redis + warming Lambda + event-driven invalidation). From an MLOps perspective, what's the observability surface and what breaks first under load?

Round 1 Answer: Observability surface: per-layer hit rate, per-key staleness, invalidation latency, eviction rate, memory utilization, and downstream-call rate. Under load, invalidation latency breaks first. A burst of catalog updates means the SNS → Lambda → Redis DELETE path queues up. If invalidation lag exceeds TTL on a hot-key, you serve stale data while you think the cache is invalidating correctly. The first instrument I'd build: end-to-end invalidation latency p99, alerted at >5s.

Round 1 — Surface

Follow-up: How do you measure end-to-end invalidation latency in production?

Two-step measurement:

Mark every catalog-change event with a published_at timestamp when it's emitted by the catalog service.
The cache invalidation Lambda emits (event_id, published_at, deleted_at) when it processes the event.
End-to-end latency = deleted_at - published_at, computed per event, aggregated as a CloudWatch histogram.

Alarms: - p99 > 5s for 2 consecutive 5-min windows: page on-call. - Any single event > 30s: high-severity alert (something is fundamentally broken). - Throughput drop (events processed per minute < expected): the Lambda may be dead-lettered or stuck.

The instrument tells you not just "is invalidation working" but "how much of a stale-window are users actually exposed to." That number, not the dashboard's hit-rate, is the cache-safety SLO.

Round 2 — Push Harder

Follow-up: Cache stampede scenario — popular ASIN gets invalidated, 1,800 simultaneous requests miss cache, all hit Product Catalog. What's the production safeguard?

Three layers of defense:

In-process singleflight — within a single task, only one in-flight fetch per cache key. Other concurrent requests for the same key wait for the first one. Reduces 1,800 concurrent fetches to ~50 (the number of tasks).
Redis-level singleflight (advisory lock pattern) — first task to acquire a lock:{key} Redis key fetches; others poll the cache for the result. Reduces ~50 concurrent fetches to ~1.
Stale-while-revalidate fallback — if the lock-holder takes too long, other tasks return the (just-evicted) stale value as a degraded answer rather than waiting indefinitely. Quality < freshness, but better than a timeout.

Plus the upstream protection: - Product Catalog itself has rate limiting per consumer. Even if cache fails, downstream services don't melt. - Circuit breaker on the Product Catalog client — if Product Catalog returns 5xx for 5+ requests, client opens circuit; serves stale cache for 30s.

The MLOps insight: defense in depth on the cache miss path. Cache hit is the happy path; cache miss under load is where systems die. Engineer for the miss.

Round 3 — Squeeze

Follow-up: Warming Lambda runs at 8:30am JST. How do you know it actually warmed correctly and didn't write to /dev/null?

Two-part instrumentation:

Lambda emits a structured success record: {run_id, started_at, completed_at, asins_warmed, redis_cluster_endpoint, sample_check_pass}. The sample_check_pass is critical: after warming, the Lambda re-reads 10 random keys and asserts they exist with non-empty values. Logs the result.
Validation alarm: a separate CloudWatch event triggers at 8:35am, runs a Lambda that reads the morning's success record and checks asins_warmed >= 450 AND sample_check_pass == true. If either fails, page on-call.

Without the sample_check, the Lambda can return success: true after writing to a misconfigured cluster. The check is the only thing that catches misconfiguration silently passing tests.

Plus the daily metric: - 8:30am hit rate vs noon hit rate ratio. If warming worked, the ratio should be > 0.5 (8:30am benefits from warm; noon is lukewarm). If ratio is < 0.3, warming didn't have an effect — likely went to wrong cluster.

The pattern: infrastructure-level success isn't application-level success. The Lambda completing isn't the same as the cache being warm. Verify the application-level outcome explicitly.

Round 4 — Corner

Follow-up: Redis cluster is at 78% memory utilization. The threshold is 80%. What's the runbook?

Pre-incident actions (before hitting 80%):

Investigate the trend: is utilization climbing linearly (organic growth) or with steps (deployment added new keys)?
Check eviction rate: if eviction is already > 1%/hr, we're already losing usable cache. The 80% threshold is past-due.
Identify the keyspace consuming memory. redis-cli --bigkeys or production analog. Usually one keyspace dominates; understand which.

Mitigation options (in order of preference):

Reduce TTL on the dominant keyspace. Shorter TTL = faster eviction = lower memory. Trade-off: lower hit rate. Acceptable as a temporary measure.
Tighten the cache eligibility filter. If recommendation queries are caching too aggressively, raise the eligibility threshold. Less stuff cached = lower memory.
Scale the cluster up. Larger Redis = more capacity but higher cost. Last resort.
Dedicated cluster per high-volume keyspace. If llmresp: is dominating, give it its own cluster. Architectural change; not for an incident.

The runbook should be a tiered playbook: option 1 takes 5 minutes (TTL config update), option 2 takes 30 minutes (deploy), option 3 takes 1 hour (cluster resize), option 4 takes a sprint. Match the urgency to the option.

The MLOps maturity: the runbook is documented, practiced, and automated where possible. Runbooks that exist only in someone's head are runbooks that fail at 3am.

Architect-Level Escalation

A1: Multi-region Redis. From an MLOps perspective, when do you need it and what's the consistency model?

You need multi-region Redis when: 1. Production traffic spans multiple AWS regions (e.g., us-east-1 + ap-northeast-1). 2. Cache hit rate cross-region falls below 50% because keys are populated in only one region. 3. Single-region Redis failure would cause a > 30-min outage.

Consistency model is the hard part: - Eventual consistency (Redis Cluster cross-region replication): writes propagate within seconds. Acceptable for cache (cache is already eventually consistent vs source). - Active-active with conflict resolution: complex, usually not worth it for cache. - Per-region partitioning: each region has its own cache; no replication. Higher cost (each region warms independently), simpler to reason about. Often the right call for chatbot caching.

For MangaAssist, the chatbot is single-region (per the architecture). Multi-region cache is over-engineering until single-region fails twice. Don't pre-build it.

A2: How do you A/B test cache configurations (e.g., TTL, similarity threshold) without breaking production?

Per-keyspace, per-config A/B:

Cache key includes config version: llmresp:v1:... and llmresp:v2:... for two different configs. Both can coexist.
Read path: chooses config based on a session hash → 50/50 split.
Write path: writes to the config the read came from.
Telemetry: emit (config_version, hit, response_correctness) per request.
Comparison: after 1 week, compare hit rate and wrong-answer rate per config.
Promote winning config, evict losing config's keys via prefix delete.

Cost: 2x cache storage during A/B. Acceptable for a 1-week experiment.

The pattern: cache configurations are first-class experiments, not config tweaks. Every TTL change should be A/B tested with telemetry, not eyeballed.

A3: When does the cache become the system's bottleneck instead of an optimization?

When cache operations dominate latency or cost: - Per-request cache check + put adds > 5ms p99 to the request path. - Redis cluster cost > 20% of the downstream-call cost it's supposed to save. - Cache invalidation events outnumber cache reads (storm of catalog updates breaks the cache). - Cache hit rate falls below 20% structurally (not transiently) — meaning most requests still pay the downstream cost AND the cache check overhead.

Mitigations: in-process L1 only (skip Redis), or tighter eligibility (only cache the top-N highest-value keys), or per-tier caching (only Prime users get cache).

The deeper insight: caches optimize the average; they cost the cache check on every request, hit or miss. When the miss path is unavoidable, the cache check is overhead. Measure the cost of the cache itself, not just the cost it saves.

Intuition Gained — US-03 (MLOps)

The core insight: Cache observability is more important than cache hit rate. Knowing why the cache works (or doesn't) is what makes it operable; the headline number is just the outcome.

Mental model to carry forward:

"Cache safety is the invalidation path's correctness, not the hit rate. Invalidation is the most likely silent failure."

The hidden failure mode: Warming Lambda writes to the wrong cluster; success metric goes green; hit rate stays low; nobody investigates because dashboards say "warming succeeded."

One-line rule: Verify the application-level outcome of every infrastructure operation. Infrastructure success != application success.

Scenario US-04 (MLOps Engineer Lens) — Compute Cost Optimization

Opening Question

Q: Fargate Spot at 70% off-peak. From an MLOps perspective, what's the operational discipline that makes this safe vs. dangerous?

Round 1 Answer: The discipline is graceful drain mechanics validated under chaos. Spot interruptions arrive as a 2-minute SIGTERM. The drain handler must (a) stop accepting new connections, (b) finish in-flight work — including WebSocket frames not just REST requests, © report drained state, (d) terminate cleanly. The validation: chaos-test the drain path on every deployment. Without this, Spot is a cost optimization that periodically drops user sessions on the floor.

Round 1 — Surface

Follow-up: How do you detect Spot interruption rate trending up before it hits SLO?

CloudWatch metric: spot_interruption_rate_pct = spot_interruptions_per_hour / spot_tasks_per_hour. Aggregated per hour, baseline 7-day rolling median.

Alarms: - 2x baseline for 1 hour: warning (probably normal noise). - 3x baseline for 2 hours: page on-call (investigate region/instance type). - 5x baseline: emergency, drain Spot share to 30% temporarily.

Plus a daily summary: spot_interruptions_this_week vs last_week vs trend. On-call reviews this in the weekly rotation handoff.

The deeper instrumentation: tag each interruption with (task_id, instance_type, AZ, time_of_interruption, queue_state_at_interruption). Patterns emerge: "interruptions cluster on a single instance type → switch type" or "interruptions cluster in one AZ → diversify."

The trap: people watch the count of interruptions, not the rate. Count grows linearly with traffic; rate is the actual signal.

Round 2 — Push Harder

Follow-up: Graviton ARM64 migration. CI strategy?

Multi-architecture build pipeline:

Dual-arch Docker images: build both linux/amd64 and linux/arm64 in CI. Use buildx or similar.
Test on both: unit tests + integration tests run on both arches in CI. Parallel jobs.
Native dependency scan: scan requirements.txt (or equivalent) for packages with C extensions. Each gets a manual ARM compatibility check (most are fine; some lag).
Workload-shape diff test: 5K-sample test where the same input runs on both arches; outputs are byte-compared (per file 05 ML lens). Acceptable diffs are documented; unacceptable diffs fail CI.
Per-arch deployment: ARM ECS service is a separate deployment. Initially 10% traffic; ramp to 100% over 2 weeks.

The principle: architecture migration is a deployment, not a build flag. Treat it like a model upgrade — gradual rollout, per-stage gates, rollback path.

Round 3 — Squeeze

Follow-up: Right-sized to 1vCPU/2GB. Auto-scaling settings — what's the policy?

For an I/O-bound 1vCPU/2GB Orchestrator task:

Target tracking on CPU at 60%: scale out when avg CPU exceeds 60%, scale in when below 40%.
Cooldown periods: scale-out cooldown 60s (fast response to spikes); scale-in cooldown 300s (avoid thrashing).
Min capacity: 2 tasks (HA + handles startup load).
Max capacity: 50 tasks (caps the cost ceiling; alerts if hit).
Step scaling for emergencies: if CPU > 85% for 1 min, add 5 tasks (not 1) — preempt the latency cliff.

Right-sizing makes auto-scaling more important because each task has less headroom. Auto-scaling lag matters more.

Test: - Synthetic load 0 → 80 RPS in 60 seconds. Measure auto-scale lag (time from CPU exceeding 60% to new task accepting traffic). Should be < 90s. - If lag > 2 minutes, consider scheduled scale-up before predictable peaks (8:45am JST pre-warm).

The MLOps lever: scheduled scaling complements reactive scaling. Reactive handles surprises; scheduled handles expected peaks. Both should be configured.

Round 4 — Corner

Follow-up: WebSocket handler on Spot — wait, you said WS on on-demand. So Spot is only for REST? Walk me through the architecture.

Yes — segregated by service:

WebSocket Service (long-lived connections):
   - On-demand only
   - Graviton ARM64 (20% cheaper)
   - Right-sized to 0.5vCPU/1GB (handlers are very lightweight)

REST/Orchestrator Service (request/response):
   - 70% Fargate Spot off-peak, 30% on-demand
   - x86 (some native deps lock us in)
   - Right-sized to 1vCPU/2GB

Lambda burst workers (intermittent):
   - On-demand always (Lambda doesn't have Spot)
   - Provisioned concurrency on peak hours only
   - x86 (faster cold start than ARM today)

Three services, three distinct cost profiles. The MLOps complexity: three deployment pipelines, three monitoring dashboards, three on-call runbooks. Simplification (one service definition for all) is tempting but loses ~30% of the savings.

The decision: architectural complexity vs operational simplicity is a real cost. For MangaAssist scale, the complexity pays. For a smaller system, single-config might win.

Architect-Level Escalation

A1: Spot interruption rate exceeded 10% for one week. What's the post-mortem and fix?

Post-mortem:

Identify the cause: AWS Spot capacity in the region for that instance type was tight. Check AWS Spot Advisor for historical interruption rate by type.
Quantify impact: how many user-facing failures resulted? Calculate from connection-state-loss telemetry.
Immediate fix: shift Spot allocation to a more stable instance type (m6g.large vs c6g.large, or larger instance to reduce competition).
Permanent fix: spot-fleet diversification — request capacity across 3-5 instance types instead of one. Spot fleet auto-selects whichever has capacity.

Long-term: subscribe to AWS Spot Capacity Insights, use it as a signal for proactive instance-type rotation.

A2: How do you maintain the cost-optimized compute setup as the application evolves?

Per-PR cost-aware golden run (file 03 Primitive C). Plus a quarterly cost re-baseline:

Re-run the cost-aware golden against current main.
Compare to last quarter's baseline.
If cost crept up > 5% Q-over-Q, investigate which PRs contributed.
File tickets to right-size new resource consumers; or update the baseline with a documented justification.

Plus a yearly architecture review: are the right-sizing assumptions still valid? Is the Spot/on-demand split still optimal given new AWS pricing? Has any service grown enough to deserve a separate task definition?

The principle: cost-optimized state is a moving target. Static optimization decays as the application grows. Schedule the maintenance.

A3: When does the operational complexity of compute optimization exceed the savings?

The signal: cost optimization tickets exceed feature tickets in the SRE backlog for two consecutive quarters. At that point, you're spending engineer-time to save dollar-time at a bad rate.

The fix: simplify. Move from segregated services to a unified larger task. Accept some Spot risk and over-provision. Trade ~10% of cost savings for ~50% of operational simplification.

The framing: engineering time has a cost too. Cost optimization that consumes engineering time is only worth it if the engineering cost is less than the cost saved. Track both.

Intuition Gained — US-04 (MLOps)

The core insight: Compute cost optimization adds operational complexity (multiple service definitions, Spot drain logic, ARM compatibility, scheduled scaling). The MLOps capability needed is "running this complexity sustainably," not just "saving the money."

Mental model to carry forward:

"Spot is free money on the cost side and engineering work on the operational side. Account for both."

The hidden failure mode: WebSocket connection state loss during Spot interruption. SRE tests cover REST drain; WS drain is different and easy to miss.

One-line rule: Chaos-test the drain path on every deployment. Spot resilience is a behavior, not a configuration.

Scenario US-05 (MLOps Engineer Lens) — DynamoDB Cost Optimization

Opening Question

Q: TTL deletion + sparse GSI + on-demand capacity. From an MLOps perspective, what's the most important observable signal and what's the most overlooked failure mode?

Round 1 Answer: Most important signal: ttl_deletion_rate (items per day). Most overlooked failure mode: TTL processor falls behind. DDB's TTL deletion is best-effort; under load the deletion can lag by hours or days. Storage doesn't shrink. Scan operations get slower. Cost climbs. The signal looks fine (TTL is "configured"); the implementation isn't keeping up. Instrument the deletion lag, not just the deletion rate.

Round 1 — Surface

Follow-up: How do you measure TTL deletion lag?

DDB provides a TimeToLive configuration but no direct lag metric. Indirect measurement:

Tag each item's ttl field with created_at plus delta.
Periodic scan (1% sample) computes (current_time - ttl) for items still present. This is the lag — items that should have been deleted but haven't been.
Aggregate: median, p95, p99 of lag across the sample.

Alarm if p95 lag > 6 hours (TTL is configured but eviction is sluggish — pre-warning of storage issues).

Alternatively: measure storage size growth rate vs ingestion rate. If ingestion is X GB/day and TTL should evict X GB/day (steady-state), but storage is growing, deletion is lagging.

Round 2 — Push Harder

Follow-up: On-demand vs provisioned capacity decision. How do you monitor the breakeven?

DDB on-demand pricing: ~$1.25 per million WCU + ~$0.25 per million RCU. Provisioned: ~$0.65 per WCU-hour and ~$0.13 per RCU-hour, sustained.

Breakeven at ~18% utilization. If the table runs above 18% utilization on average, provisioned is cheaper; below, on-demand wins.

Monitoring:

Current capacity mode (config).
Actual usage: total WCU consumed per day, total RCU per day.
Implied utilization (if provisioned): consumed / (provisioned × 24 × 3600).
Cost-mode comparison metric: monthly_cost_actual vs monthly_cost_if_other_mode. Both numbers. Visible to humans.

Quarterly review: if the comparison flips (other mode would be 20%+ cheaper), file a ticket to switch.

The deeper MLOps practice: capacity mode isn't a one-time choice. Traffic patterns shift; mode optimality shifts. Monitor and revisit.

Round 3 — Squeeze

Follow-up: Sparse GSI — TURN items don't project. A future PR adds a customer_id field to TURN items "for analytics." What's the operational guard?

Two guards:

CI gate (cost-aware golden, Primitive C): the PR's changes are run through the golden replay; dollars_per_1k_turns regression is caught and fails the PR.
Runtime monitoring: per-table-and-index storage size, daily. Alarm if any table/index storage grows > 20% week-over-week unexpectedly. Catches issues that bypassed CI (e.g., a config change made manually, not via PR).

Plus a PR review checklist: any PR touching DDB items must answer "Does this change which attributes are in which items? If yes, has the GSI projection been verified?" Not a test; a process control. Cheapest possible safeguard.

The pattern: CI catches code changes; runtime catches config changes; review catches both. Three layers of defense for a high-impact silent regression.

Round 4 — Corner

Follow-up: TransactWriteItems failure rate is 0.5% — over the 0.1% threshold. What do you investigate?

Failure modes for TransactWriteItems:

Conditional check failures: another concurrent write modified the item between the read and the conditional write. Symptom: TransactionCanceledException with reason ConditionalCheckFailed. Fix: design conditions to be commutative (e.g., versioned writes), or accept retries.
Throttling: total WCU on the operation exceeded available capacity. Symptom: ProvisionedThroughputExceededException (or capacity-mode equivalent). Fix: increase capacity, batch differently, or throttle the upstream.
Validation errors: malformed write. Symptom: ValidationException. Fix: bug in code; investigate.
Internal: AWS-side failure. Symptom: InternalServerError. Fix: retry with exponential backoff.

Root-cause analysis:

Pull the last 1000 failure records; group by failure mode.
If 90% are conditional check failures: too much concurrency on the same items. Architectural fix needed (sharding, version vectors, eventual consistency).
If 90% are throttling: capacity is wrong; switch mode or scale up.
If 90% are validation: there's a bug in the new code path; revert and fix.

The MLOps-first response: have the failure breakdown automatically computed and visible without manual log queries. Without it, every incident starts with "let me grep CloudWatch Logs."

Architect-Level Escalation

A1: Build a backup/PITR cost optimization. Current cost is $X/month for full Point-In-Time Recovery on all tables. How do you reduce it without compromising recovery?

Per-table evaluation:

Table	PITR needed?	Recovery point objective
Conversation memory (TURN)	Yes — but only 24h	1h
Customer META (long-lived)	Yes	24h
Cache mirror (if any)	No — can rebuild	N/A
Analytics aggregates	Yes — but daily snapshot is enough	24h

PITR is on/off. The optimization isn't "reduce PITR cost"; it's "use PITR only on tables that need < 1h RPO; use daily backups for everything else."

For tables with daily-backup-sufficient: AWS Backup with daily frequency, 7-day retention. ~10% of PITR cost.

The principle: recovery requirements are per-table, not per-database. Group tables by RPO; apply the right backup mechanism per group.

A2: How do you safely roll out a TTL change in production?

A multi-phase rollout:

Shadow mode: deploy the new TTL value as a secondary attribute that doesn't drive eviction. Observe what would have been deleted.
Audit: count "would have lost" sessions. If > 1% of sessions would lose context, hold rollout — TTL is too aggressive.
Gradual rollout: enable new TTL for 10% of new items, ramp to 50% then 100% over 2 weeks. Existing items keep old TTL until they expire.
Monitoring: re-ask rate per intent, escalation rate per intent. If either climbs > 5%, roll back.

The deeper principle: TTL changes are silent until they bite. Items keep working until they don't, and the failure is "data was here yesterday." Multi-phase rollout converts the silent failure into observable signals.

A3: How do you handle DDB cost optimization when a new feature requires schema change?

Schema changes intersect cost optimization:

Adding a new attribute: cost-aware golden gate. If the attribute bloats GSI (sparse GSI failure) or inflates per-item size, fails CI.
Removing an attribute: backfill plan. Old items still have it; reads need to handle both with-attribute and without-attribute. After 24h TTL window, all items have the new schema. Then code can simplify.
Renaming an attribute: never. Add the new name; deprecate the old. Both coexist for one TTL cycle.
Changing a key: can't, in DDB. Create a new table; migrate. This is a multi-week project, not a feature PR.

The pattern: schema is forever; design with TTL-cycle migrations in mind. Without TTL aggressiveness, schema migration is a multi-month project. With 24h TTL, schema converges in 24h.

Intuition Gained — US-05 (MLOps)

The core insight: DDB cost optimization centers on TTL discipline + GSI projection discipline. Both can fail silently and both need automated guards (CI + runtime).

Mental model to carry forward:

"Storage that doesn't shrink is storage that's lying about TTL. Measure deletion lag, not just deletion rate."

The hidden failure mode: A schema change adds an attribute that's accidentally projected to a sparse GSI, doubling write costs overnight without anyone noticing.

One-line rule: Cost-aware CI gate + runtime storage monitoring + PR-review checklist. Three layers because schema bugs are silent.

Scenario US-06 (MLOps Engineer Lens) — RAG Pipeline Cost Optimization

Opening Question

Q: OpenSearch Serverless OCU consumption + embedding caching + conditional reranking. From an MLOps perspective, what's the operational nightmare scenario?

Round 1 Answer: Embedding model upgrade with un-versioned cache keys. Old embeddings in cache; new embeddings being computed. Cache lookups silently mismatch. Reranker thresholds are now uncalibrated. Quality drops; cost looks fine. Without strict cache-key versioning + a re-calibration CI gate, the embedding upgrade is a multi-system regression that's hard to root-cause. The operational nightmare is: this can happen on a Friday, the on-call sees "quality alerts" and doesn't link them to the embedding deployment three days ago.

Round 1 — Surface

Follow-up: How do you operationalize the index re-build (blue/green for an OpenSearch index)?

Blue/green index workflow:

Build green index: parallel to current (blue) index, with new chunking/embedding/schema. Write to S3, build OpenSearch index from S3 in 2-4 hour batch.
Validation: query green with a 1K validation set; compare top-K against blue. Acceptable divergence is documented (e.g., 5% top-1 swap is OK; 30% is not).
Read traffic split: 5% reads from green, 95% from blue. Measure quality (Recall@3, MRR) in production.
Ramp: 5% → 25% → 50% → 100% over 2-4 weeks.
Decommission blue after green is at 100% for 7 days.

The infra: index aliases in OpenSearch let read traffic switch by alias-update without code change.

The MLOps wrinkle: writes during the migration. New documents need to land in both indexes (dual-write) until cutover. Otherwise green is missing recent writes.

Round 2 — Push Harder

Follow-up: Embedding cache eviction policy. How do you decide?

Default Redis: LRU. For embedding cache, that's usually wrong because:

Old high-volume queries get evicted by recent low-volume queries.
Cache hit rate becomes recency-biased rather than frequency-biased.

Better: LFU (Least Frequently Used) for embedding cache. Redis 4+ supports maxmemory-policy allkeys-lfu.

Sizing: embedding cache should hold the top-N most-frequent queries. From production, N is usually 10K-50K. Per-entry size is ~6KB (1024-dim float32 vector + key + metadata). Total: ~60-300 MB. Modest.

The trap: people set up an embedding cache, observe low hit rate, conclude "it doesn't work," then disable it. The actual issue is usually: - Cache size too small (LRU evicts before second hit can occur). - LRU instead of LFU. - TTL too short (1h might be too aggressive; embeddings rarely change for the same query).

Tune the cache, then conclude.

Round 3 — Squeeze

Follow-up: OCU autoscale ramps. How do you know they're working correctly?

OpenSearch Serverless OCU is opaque — you don't see instances, just OCU consumption. Monitoring:

OCU consumption per minute: CloudWatch metric OCUSearchConsumption and OCUIngestConsumption.
Search latency p99: if OCU is keeping up with demand, latency is stable.
Throttling errors: OpenSearch returns 429s if OCU can't scale fast enough.

Test the ramp: - Send a 10x burst (e.g., 100 RPS for 5 min after a quiet period). - Measure: was OCU consumption proportional? Was latency stable? Were there 429s? - Acceptable: brief latency spike (10-30s) during ramp, no 429s. - Unacceptable: sustained 429s = OCU ramp didn't keep up.

If the ramp is too slow, the workaround is a higher OCU floor (more baseline capacity = less ramp needed). Cost vs. burst-resilience tradeoff.

The deeper MLOps practice: synthetic burst testing on a quarterly cadence. Production traffic doesn't always provide the burst you need to test the ramp; create it on a schedule.

Round 4 — Corner

Follow-up: Reranker model gets an upgrade. How do you ship it?

Cross-encoder reranker is a separate model from the embedder. Upgrade workflow:

Calibration: per file 05's ML lens, the score-threshold-vs-correctness curve must be re-computed. CI gate runs the calibration on every reranker version PR. If no threshold T satisfies (skip ≥ 30%, top1 correct ≥ 95%), PR fails.
Decision-equivalence: shadow the new reranker against the current one on 5K queries. Top-K agreement should be ≥ 90%.
Canary: 5% traffic uses new reranker; measure RAG-grounded quality (Recall@3, citation correctness) on this slice.
Ramp: 5% → 25% → 100% over 2 weeks if quality holds.

The infra: reranker is a separate SageMaker endpoint or Bedrock model. Multi-version routing is handled at the calling code level (not at endpoint level), which gives finer control.

The trap: reranker quality changes interact with embedder quality. Upgrade both in the same release: blame becomes ambiguous. Schedule them sequentially with 2 weeks between.

Architect-Level Escalation

A1: Build me an SLO contract for the RAG pipeline that ties cost to quality.

SLO #1 (Quality): Recall@3 >= 82% per intent, weekly p50.
SLO #2 (Cost): RAG monthly cost <= $400.
SLO #3 (Latency): Search p99 <= 200ms.
SLO #4 (Freshness): New documents indexed within 4 hours of publish.

If quality and cost SLOs conflict (quality regression but within cost), quality wins — adjust cost levers to recover quality. Document this as the contract.

The error budget: quality SLO has a 1% miss budget per quarter; cost SLO has a 0% miss budget (cost is a hard limit). If we miss quality SLO, we have data to negotiate (lower cost target, better infra). If we miss cost SLO, finance escalates immediately.

A2: How do you operationally handle a "RAG bypass" gate that needs intent-aware decisions?

Bypass logic lives in the routing layer, not the RAG layer:

def should_bypass_rag(intent: str, query: str, confidence: float) -> bool:
    bypass_intents = config["rag_bypass_intents"]
    if intent not in bypass_intents:
        return False
    if confidence < 0.85:
        return False
    if contains_product_name(query):
        return False  # sub-intent override
    return True

The configuration rag_bypass_intents is in AppConfig (per US-02 pattern). Sub-intent overrides (contains_product_name) are code that has its own unit tests. The combination is auditable.

Per-intent bypass rate is logged. If bypass_rate_for_promotion drops because the override fires on most promotion queries, that's a signal that the bypass is the wrong design for that intent.

A3: When does the RAG pipeline's ops complexity exceed its cost?

Signs: - Quarterly engineering time on RAG infra (index rebuilds, embedder upgrades, reranker tuning) > $X. - Number of feature flags governing RAG behavior > 8. - New engineer ramp time on RAG > 1 week.

When these happen, simplify: drop conditional reranking, pin embedder version (longer cycles), use a single index keyspace instead of per-source. Trade some quality and cost for operational simplicity.

The macro: complex pipelines have complex maintenance costs. Optimization that adds complexity should be measured against the complexity it adds.

Intuition Gained — US-06 (MLOps)

The core insight: RAG cost optimization is a constellation of calibrated parameters (bypass rules, score thresholds, cache TTLs, OCU floor). The MLOps maturity is treating each as a versioned, monitored, re-calibratable artifact.

Mental model to carry forward:

"Every RAG cost lever is a parameter on a curve. The curve moves when upstream changes. Re-calibration is recurring work, not one-time tuning."

The hidden failure mode: Embedder upgrade without cache-key versioning corrupts cache silently for one TTL window.

One-line rule: Cache key includes embedder version. Index name includes schema version. Reranker has its own calibration CI gate. Every artifact knows its dependencies.

Scenario US-07 (MLOps Engineer Lens) — Analytics Pipeline Cost Optimization

Opening Question

Q: Kinesis batching + RA3 hot/cold tiering + materialized views. From an MLOps perspective, what's the most important reliability concern?

Round 1 Answer: End-to-end event delivery guarantee. Cost optimizations on the analytics pipeline (batching, compression, tiering) all introduce latency or potential loss. The cost circuit breaker (US-08) reads from this pipeline; if events are lost, the breaker is blind. The reliability concern is: every event class — especially low-frequency ones — has an end-to-end delivery SLO and instrumentation that proves it. Without that, the cost-optimization is over-fitting to the average case and breaking on the spike.

Round 1 — Surface

Follow-up: Kinesis lag SLO. How do you monitor it?

Kinesis-native metric IteratorAge (per shard, max across shards) — milliseconds since the last record processed.

Alarms: - IteratorAge > 5s for 2 min: warning. - > 30s: page on-call. - > 5min: emergency, downstream consumers are stale.

Dashboard: per-stream IteratorAge time-series, side-by-side with consumer health metrics.

For batched events (50/5s), additional monitoring: - Batch fill ratio: how full are batches? If consistently 100% (always full), the batch size is too small for the load — adjust batch size or trigger time. - Per-event-class delivery rate: count of each event class delivered, vs count emitted at source. Delta = loss.

The deeper MLOps insight: Kinesis is eventual delivery, not guaranteed. The lag is your tolerance for "eventual."

Round 2 — Push Harder

Follow-up: Firehose backpressure on Prime Day. What's the runbook?

Prime Day is predictable; pre-scale before the event:

One week before: increase Kinesis on-demand capacity headroom (Kinesis on-demand auto-scales but has a multi-minute ramp; pre-warm).
One day before: review Firehose buffer settings. Default BufferingHints may be too small for the surge. Increase buffer interval to 600s and buffer size to 128MB.
During: monitor Firehose IncomingBytes, IncomingRecords, and DataReadFromKinesisStream.Bytes. If bytes-in > bytes-out, backpressure is building.
Mitigation if stuck: spawn a parallel Firehose stream (e.g., analytics-firehose-priority) for high-priority events; route critical events there. Bypasses the congested main stream.

The runbook documented and practiced. MLOps maturity: every predictable spike has a pre-scale checklist; on-call doesn't improvise during the event.

Round 3 — Squeeze

Follow-up: Materialized view refresh failure. The cost circuit breaker reads from it. What's the failure-mode handling?

Failure-mode hierarchy on the breaker side:

View returns fresh data (as_of_timestamp within 15 min): use the value.
View returns stale data (as_of_timestamp 15-60 min old): use the value + safety buffer (multiply by 1.1).
View returns NULL (refresh failed): fail-safe — assume worst case; engage WARNING tier.
View query times out: fail-safe — assume worst case; engage WARNING tier; alert.

On the view side: - Daily monitoring: view_refresh_success_rate per day. - Alarm if any single refresh fails. - Auto-retry (Redshift JOB) on first failure. - If retry also fails: page on-call.

The principle: the breaker is a safety system; the data feeding it must be safety-rated. That means explicit failure modes, fail-safes, and observability on all of them.

Round 4 — Corner

Follow-up: Schema evolution on batched events. A field is renamed at the producer. Consumers (dashboards, breakers) all break. What's the mitigation?

Schema evolution discipline (file 05 ML lens A2): additive-only changes. Renames are forbidden.

If a rename is unavoidable:

Add the new field alongside the old. Both are written for one TTL cycle.
Update consumers to read the new field with fallback to the old.
Verify all consumers are updated.
Stop writing the old field.
After another TTL cycle (when no historical query reads the old field), drop the old field.

This is a 4-step process, not a single PR. The discipline: schema changes go through a schema-review process, not a code-review-only process.

For analytics pipelines, the pattern: producer + Glue Catalog + consumers form a contract. Any change to the producer requires updating the Glue schema first; otherwise consumers can't even read the new data.

Architect-Level Escalation

A1: Cost-tracking pipeline as a critical system. SLO?

SLO #1: Cost data freshness <= 15 min p99 (cost circuit breaker depends on this).
SLO #2: Event loss rate <= 0.1% on any event class.
SLO #3: Materialized view refresh success rate >= 99.9% per month.
SLO #4: Cost dashboard p95 query latency <= 5s (operator usability).

Each SLO has alarms tied to it. Each alarm has a runbook. Each runbook is practiced quarterly. This is critical-system rigor applied to analytics — usually under-applied because analytics "feels" non-critical.

A2: When do you migrate from Redshift to a cheaper warehouse?

Triggers: - Redshift cost > 20% of total system cost. - Redshift maintenance overhead (vacuum, analyze, capacity tuning) > 1 engineer-day/week. - Query patterns are fully aggregatable (no need for joins or random-access).

Alternatives: - Athena over S3 Parquet for ad-hoc and scheduled queries. ~10% of Redshift cost. Higher per-query latency. - Aurora if join workloads dominate. - Single-tenant Redshift Serverless for predictable workloads.

The migration is multi-month. Don't jump to it on a small cost concern; cross the trigger thresholds first.

A3: How do you organizationally manage the analytics pipeline as both a cost-optimization target AND a critical observability dependency?

Two-team model: - Data Engineering owns the pipeline infrastructure, schema, performance. - SRE consumes the pipeline for observability (cost dashboards, breaker signals).

SLOs are negotiated jointly. SRE's needs (freshness, reliability) constrain Data Engineering's optimization choices (can't batch more aggressively because it breaks freshness SLO).

Quarterly: Data Engineering proposes optimization PRs; SRE reviews for SLO impact; both sign off. No unilateral changes to the contract.

The principle: observability infrastructure is shared; ownership is shared. Cost optimizers can't optimize away the things observability needs.

Intuition Gained — US-07 (MLOps)

The core insight: Analytics is critical infrastructure that feeds safety systems. Cost optimization on it must respect SLOs that other systems depend on.

Mental model to carry forward:

"Cost-optimize the pipeline; don't compromise its role as a feedback signal for other systems."

The hidden failure mode: Batching swallows low-frequency events under burst, blinding the cost circuit breaker on the very day cost protection matters most.

One-line rule: Per-event-class delivery SLO, end-to-end. Loss-rate alarm per class. Critical events have their own preserved-delivery contract.

Scenario US-08 (MLOps Engineer Lens) — Traffic-Based Cost Optimization

Opening Question

Q: Cost circuit breaker + rate limiter + degradation ladder. From an MLOps perspective, what's the operational discipline that keeps these working as designed?

Round 1 Answer: Quarterly chaos drills. Each component (breaker, rate limiter, degradation ladder) gets a synthetic stress test in production-like staging on a recurring cadence. Without practice, these systems atrophy — the on-call doesn't trust them, the configuration drifts, the alarms get noisy and ignored. The discipline: practice the bad day until everyone is bored, so the real bad day is forgettable.

Round 1 — Surface

Follow-up: Rate-limit Redis hot-key risk. Specifically, how does it surface and how do you mitigate?

Rate limiter is implemented with a Redis counter per user per minute (rate:{user_id}:min:{T}). Plus a global counter rate:global:min:{T}.

The hot-key risk is the global counter. Every request increments it. At 50K msg/min, 50K writes/min hit the same Redis key. That's 833 ops/sec on one key, on one Redis cluster slot.

Symptom: Redis CPU spikes on the node owning that slot. Other operations on that node (other counters, other keyspaces) get queued. Latency for unrelated services climbs.

Mitigation: 1. Shard the global counter: 16 sub-counters (rate:global:min:{T}:{shard_idx}), where shard_idx = hash(request_id) % 16. Sum of 16 counters = global rate. Each sub-counter sees 1/16 of the traffic = 52 ops/sec. No more hot-key. 2. Lua script for atomic increment + check: reduces round-trip count. 3. Cluster-aware client: ensures sub-counters distribute across slots.

The MLOps observation: counters are not a free pattern at scale. Anything that touches the same key on every request is a hot-spot waiting to happen.

Round 2 — Push Harder

Follow-up: Cost circuit breaker state replication. The breaker engages on one task; how does the rest of the fleet know?

Two architecture options:

A) Shared-state breaker: the engagement state (current_degradation_level) is in Redis or DynamoDB, read by every task on every request. Pros: single source of truth. Cons: every request reads from a shared store; that store is a hot path.

B) Distributed breaker: each task computes the breaker state independently from CloudWatch metrics or the cost view. Pros: no shared hot path. Cons: tasks may briefly disagree on the state; brief inconsistency window.

For MangaAssist, option (A) with client-side caching (5-second cache of the breaker state on each task) is the right balance: - Tasks read the breaker state from Redis once per 5 seconds. - The cached state drives request handling. - 5-second inconsistency window during state changes is acceptable.

Operational implication: the breaker state Redis key is itself critical infrastructure. Need monitoring on the read latency and staleness of that key.

Round 3 — Squeeze

Follow-up: Degradation ladder fan-out — different services need different responses to "we're at level 3." How do you coordinate?

Centralized state, decentralized response:

Central: degradation controller writes current_level to a known location (Redis key, AppConfig parameter).
Decentralized: each service reads current_level and applies its own response policy. Examples: - Orchestrator: level >= 2 → force Haiku tier. - RAG service: level >= 3 → bypass reranker for all queries. - Cache: level >= 3 → extend TTLs by 2x. - WebSocket service: level >= 4 → reject new connections.

The contract: each service has a documented per-level response. Documented in code (with comments) and in a cross-service runbook.

Test: - For each level, simulate the state change. Each service reports its response. Verify all responses match the documented policy. - Run quarterly.

The systems insight: fan-out coordination is a contract, not an event broadcast. Services read centralized state on their own cadence; the controller doesn't push events.

Round 4 — Corner

Follow-up: WAF rules update. False-positive rate on real users climbs from 0.1% to 1.5% — 15x. Runbook?

Immediate response:

Confirm: pull WAF block logs. Sample 100 blocked requests; classify as bot (true positive) vs real user (false positive). Confirm the 1.5% rate.
Identify the rule: the WAF rule update added/changed which rules. Identify the specific rule(s) firing on real users.
Disable the offending rule: WAF rules can be disabled individually without redeploying. Latency: ~30 seconds.
Post-mortem: was the rule tested against a real-user sample before deployment? If not, fix the deployment process.

Long-term: every WAF rule update goes through: - Offline test against a real-user sample (≥10K queries) — measure FP rate. - Canary on 5% of real traffic for 48 hours — measure FP rate in production conditions. - Full rollout only after both gates clear.

The MLOps maturity: WAF rules are configuration, but they are also models (with a hand-written decision boundary). They need ML-style validation, not just code review.

Architect-Level Escalation

A1: Cost circuit breaker as a critical safety system. What's the SRE-level treatment?

The breaker gets the same rigor as a production database:

Documented architecture and runbook.
SLO: zero false negatives per quarter (must engage on real overruns); ≤1 false positive per quarter.
Quarterly chaos drill: synthetic spend curve in staging; verify breaker engages correctly.
Quarterly post-mortem review: any breaker incidents reviewed by SRE leads.
On-call training: every new on-call engineer runs through the breaker runbook.
Monitoring on the monitoring: alarm if the breaker hasn't been triggered in 90 days (suspicious — either nothing approached the budget, or the breaker is broken). Test it explicitly.

The principle: safety systems atrophy without exercise. The breaker that's never tested is the breaker that doesn't engage when needed.

A2: How do you handle the cross-cutting nature of US-08 — it touches every service?

Per-service contract documents:

Per-Service Degradation Contract (Service: Orchestrator)
- Reads: current_degradation_level from AppConfig, cached 5s.
- Level 0 (Normal): full pipeline.
- Level 1 (Pressure): reranker score threshold 0.95 (was 0.9).
- Level 2 (High Load): force Haiku tier on guest; Sonnet for auth+.
- Level 3 (Overload): force Haiku for all; template-first more aggressive.
- Level 4 (Emergency): template-only.
- Level 5 (Stop): reject all guest, queue auth.
- Recovery: levels recover monotonically; 5-min hysteresis on each transition.

Each service has its own contract. The cross-service runbook lists all contracts. PR review for any service change requires updating the contract if behavior changes.

A3: When do you build a separate cost-engineering team?

Triggers: - Cost optimization tickets > 30% of all engineering tickets quarter-over-quarter. - Multiple cost-optimization initiatives need cross-team coordination (these 8 user stories are an example). - Finance leadership is asking for cost forecasts that engineering can't easily produce.

A cost-engineering team owns: - Cost telemetry (US-07 plus pipelines). - Cost optimization roadmap (the 8 user stories). - Per-team cost dashboards and budgets. - Cost-regression CI gates.

The principle: once cost is a strategic concern, it deserves dedicated engineering ownership. Embedded cost work in feature teams produces ad-hoc results.

Intuition Gained — US-08 (MLOps)

The core insight: Traffic-based cost optimization is a fleet of safety systems (breaker, limiter, degrader). Each needs critical-system treatment: SLO, chaos drills, runbooks, on-call training.

Mental model to carry forward:

"Safety systems are infrastructure that prevents incidents from being incidents. They're invisible when working and catastrophic when they don't. Practice the bad day."

The hidden failure mode: A breaker that's never triggered — either nothing tested it or it's broken. Either way, it's not a safety system; it's a placebo.

One-line rule: Quarterly chaos drills + per-component SLO + documented runbook + on-call training. If any of those are missing, the system isn't operationally ready.

Cross-Scenario Wrap-Up for MLOps Engineer Loop

After working through 8 scenarios from the MLOps lens:

Telemetry first. Every cost optimization needs per-request attribution and per-lever engagement signals before it ships. Without these, the optimization is undebuggable.
Kill switches with verified flip paths. Flags exist; the flip path is what matters. Practice the flip in chaos drills.
CI gates for cost regression. Cost-aware golden runs on every PR. Without it, prompt edits and feature additions silently inflate cost.
Per-component SLO + alarm + runbook. Every cost-optimization mechanism is a managed system, not a one-time deployment.
Quarterly chaos drills. Practice the bad day. Safety systems atrophy without exercise.
Cost is a continuous discipline. Optimization decays as code and traffic evolve. Build the maintenance cadence into the team's rhythm.

Continue to 07-cross-cutting-system-grill.md for system-level questions that span multiple scenarios — these are where principal-level system thinking is tested.