LOCAL PREVIEW View on GitHub

Interview Q&A — Resilient AI Systems

Skill 1.2.3 | Task 1.2 — Select and Configure FMs | Domain 1


Scenario 1: No Circuit Breaker — ThrottlingException Retry Storm

Opening Question

Q: MangaAssist's Bedrock integration has no circuit breaker. During a traffic spike, Bedrock starts returning ThrottlingException. Every service retries with exponential backoff, amplifying the request load 3×. Within 2 minutes, 100% of chatbot requests are failing. Walk me through the root cause, the immediate response, and how you design a proper circuit breaker that prevents this.

Model Answer

The root cause is the absence of a circuit breaker pattern and the presence of naive retry logic. When Bedrock throttles, each service independently retries — at 3 retry attempts per request, the effective request load triples. Bedrock's throttle condition worsens under increased load, creating a positive feedback loop that drives 100% failure. The immediate incident response: reduce the retry count to 0 (disable retries) to stop amplification, then deploy a Haiku fallback path as the emergency graceful degradation. The circuit breaker design: a Redis-backed BedrockCircuitBreaker with three states — CLOSED (normal operation), OPEN (blocking all requests to Bedrock, routing to fallback), HALF_OPEN (allowing a single probe request to test recovery). Parameters: FAILURE_THRESHOLD=5 (open after 5 consecutive failures), RESET_TIMEOUT=30 (seconds before attempting HALF_OPEN probe from OPEN state), HALF_OPEN_MAX_CALLS=1. Redis backing means all ECS tasks share circuit state — if one task detects Bedrock is throttling, all tasks immediately stop hammering Bedrock, not just the one that detected it. The fallback in OPEN state: route to Claude 3 Haiku (lower quota limit usage) or return a cached/static recommendation set.

Follow-up 1: Redis-backed vs. local in-memory circuit breaker

Q: Why Redis instead of an in-memory circuit breaker per ECS task? Is the Redis overhead worth it? A: Local in-memory circuit breaker: each of the 50 ECS tasks has its own circuit breaker. If Task 1 detects 5 failures and opens its circuit, Tasks 2–50 continue sending requests to Bedrock (still throttling). Tasks 2–50 each need to independently observe 5 failures before opening — that is up to 245 more failed requests (49 tasks × 5 failures) before the cluster converges to OPEN. This is the exact retry storm scenario — the cluster as a whole continues amplifying load even when individual tasks are responding. Redis-backed circuit breaker: Task 1's failure counter is incremented in Redis. When the Redis counter reaches 5, the circuit state key is set to OPEN. All 50 tasks check the same Redis key before sending requests — within the next request cycle (< 10ms), all 50 tasks see OPEN and stop sending. The Redis GET/SET overhead is 0.3–0.8ms per request — negligible compared to the 200–1,800ms Bedrock invocation. The distributed coordination benefit is worth many orders of magnitude more than the Redis overhead cost.

Follow-up 2: State machine transitions and HALF_OPEN probe

Q: Describe the exact state transition logic for all three circuit breaker states. A: Transition rules: CLOSED → OPEN: when failure_count in Redis reaches FAILURE_THRESHOLD. Set Redis key circuit:bedrock:state = OPEN, set circuit:bedrock:opened_at = unix_timestamp, reset failure_count = 0. OPEN → HALF_OPEN: when current_time - opened_at >= RESET_TIMEOUT. The first request to check the circuit after RESET_TIMEOUT elapsed: atomically set circuit:bedrock:state = HALF_OPEN and proceed as the probe request. HALF_OPEN → CLOSED (success): the probe request succeeded. Set circuit:bedrock:state = CLOSED, reset failure_count = 0, reset opened_at. HALF_OPEN → OPEN (failure): the probe request failed. Set circuit:bedrock:state = OPEN, set circuit:bedrock:opened_at = current_time (restart the RESET_TIMEOUT). Using atomic Redis operations (SETNX, GETSET, or Lua scripts) for the HALF_OPEN transition ensures only one ECS task sends the probe request, not all 50 simultaneously (which would re-create the thundering herd during recovery).

Follow-up 3: What is the fallback when the circuit is OPEN?

Q: The circuit is OPEN — all Bedrock Sonnet invocations are blocked. What does the user see and what does the system do? A: Tiered fallback: (1) Haiku fallback: if the ThrottlingException is specific to the Sonnet quota (not a global Bedrock outage), try Claude 3 Haiku which has a separate throttle limit. Log bedrock_fallback_used=haiku as a metric dimension. Quality is reduced but the chatbot remains functional. (2) Cached recommendations: for the recommendation use-case, Redis holds the last 100 most popular manga recommendations. If both Sonnet and Haiku are OPEN, return the cached recommendations with a user-visible message: "Here are some popular recommendations while we tune our suggestions for you." TTL on cached recommendations: 3,600 seconds (1 hour) — not 72 hours. (3) Static fallback response: for intent classification and safety checks, a predefined conservative response ("I'm having trouble understanding your request right now, please try again shortly"). The fallback decisions are encoded in the circuit breaker's execute_with_fallback(primary_fn, fallback_fn) interface — each call site registers both a primary and a fallback.

Follow-up 4: Testing the circuit breaker without a real Bedrock outage

Q: How do you test the circuit breaker in CI and staging without inducing a real Bedrock failure? A: Three testing strategies: (1) Unit tests with mocked Bedrock: mock bedrock.invoke_model() to raise ThrottlingException on demand. Assert that after 5 failures, get_circuit_state() == OPEN. Assert that subsequent calls return fallback output without calling Bedrock. Assert that after RESET_TIMEOUT seconds, the state transitions to HALF_OPEN. Assert that one successful probe → CLOSED. (2) Integration tests with test Redis: use a real Redis instance in the test environment. Run circuit breaker state transitions end-to-end with actual Redis key reads/writes. Verify that two simulated ECS tasks share circuit state (Task A's failures affect Task B's routing). (3) Chaos engineering in staging: use a Lambda layer or Envoy proxy rule to return ThrottlingException for 10% of Bedrock calls in staging. Assert that the circuit opens within 5 failures and the fallback serves all subsequent requests. Monitor that no real ThrottlingExceptions appear in Bedrock's CloudWatch metrics (confirming the circuit truly stopped sending). Schedule this chaos test monthly using AWS Fault Injection Simulator (FIS).

Grill 1: "The circuit breaker adds latency — every request now does a Redis GET"

Q: A performance engineer argues: "Every Bedrock call now does a Redis GET to check circuit state, adding 1ms latency. At 1M messages/day, that's non-trivial." How do you respond? A: Quantify first: 1M messages/day = ~11.5 requests/second. At 0.5ms Redis GET latency (same-AZ ElastiCache), the total per-day overhead is 1,000,000 × 0.5ms = 500 seconds of compute time, spread across the fleet. Per-user latency impact: 0.5ms out of a 1,800ms Bedrock call = 0.03% overhead — imperceptible. But the real argument is the asymmetry: without the circuit breaker, a single Bedrock throttle event can cause 100% of requests to fail for 2+ minutes. With the circuit breaker, the same event causes 5 failures (< 0.5 seconds at normal traffic), then all requests serve from a working fallback. The protection value of the circuit breaker against a 2-minute 100% failure window vastly outweighs 0.5ms of latency on every request. This is a classic reliability vs. micro-performance tradeoff — choose reliability. The engineer optimizing for 0.5ms while shipping without a circuit breaker is optimizing the wrong thing.

Grill 2: The circuit opens too aggressively — 5 failures in 2 seconds during a burst

Q: During a normal traffic burst, 5 transient failures occur in 2 seconds due to brief Bedrock latency spikes (not throttling). The circuit opens and 10,000 users hit the fallback. Was the threshold too sensitive? A: Yes — a simple failure count without a time window is too coarse. Correct the threshold to a failure rate within a time window: "5 failures within a 30-second window" rather than "5 cumulative failures." Redis implementation: use a sorted set keyed by circuit:bedrock:failures with score = unix_timestamp. Count entries with score > current_time - 30. If count >= 5, open the circuit. Also expire old entries (ZREMRANGEBYSCORE older than 30s). Second refinement: distinguish error types. ThrottlingException and ServiceUnavailableException are indicators of Bedrock overload → count toward opening the circuit. ValidationException (malformed request) and AccessDeniedException (IAM) are not Bedrock capacity issues → do not count. A ValidationException storm would otherwise incorrectly open the circuit, blocking valid requests.

Red Flags — Weak Answer Indicators

  • Proposing a local in-memory circuit breaker per-task without recognizing the cluster-wide coordination requirement
  • Not distinguishing error types that should trigger the circuit vs. client errors that should not
  • Missing the HALF_OPEN state — proposing binary OPEN/CLOSED only
  • No fallback hierarchy — circuit OPEN → 100% failure rather than degraded service

Strong Answer Indicators

  • Designs Redis-backed cluster-wide circuit breaker with all three states
  • Uses atomic Redis operations for the single-probe HALF_OPEN transition
  • Distinguishes ThrottlingException/ServiceUnavailableException from ValidationException/AccessDeniedException for error counting
  • Uses time-window failure rate (not cumulative count) for the opening threshold
  • Designs a tiered fallback: Haiku → cached recommendations → static response

Scenario 2: us-east-1 Outage — No Cross-Region Fallback

Opening Question

Q: Bedrock in us-east-1 experiences a 45-minute service disruption. MangaAssist is completely down for the entire duration. No requests to Bedrock succeed. What architectural capability was missing, how do you implement multi-region resilience, and what is the SLA target for RTO?

Model Answer

The missing capability is a cross-region fallback mechanism. MangaAssist was designed single-region: all Bedrock calls routed exclusively to us-east-1. A full us-east-1 Bedrock outage means zero fallback path. The 45-minute outage duration equals the exact mean time for the AWS regional team to restore the service — MangaAssist had no agency to reduce its own recovery time. The correct architecture: AWS Bedrock Cross-Region Inference Profiles. These are ARN-based model identifiers that AWS internally routes to multiple regional endpoints. The MangaAssist MultiRegionBedrockClient maintains a list of regions in priority order: ["us-east-1", "us-west-2", "eu-west-1"]. Each region has an _unhealthy_until: Dict[str, float] timestamp. When a region returns ServiceUnavailableException or ThrottlingException, its _unhealthy_until is set to current_time + REGION_COOLDOWN_SECONDS. All subsequent requests skip that region until the cooldown expires. RTO target: < 5 minutes (time to detect failure + propagate the _unhealthy_until to all ECS tasks + begin routing to fallback region).

Follow-up 1: Cross-Region Inference Profile vs. direct regional endpoint

Q: What is a Cross-Region Inference Profile and how does it differ from manually calling a different regional Bedrock endpoint? A: A Cross-Region Inference Profile is a Bedrock-native resource type (ARN format: arn:aws:bedrock:us-east-1::foundation-model/anthropic.claude-3-sonnet-20240229-v1:0 → Cross-Region version: arn:aws:bedrock:us-east-1::inference-profile/us.anthropic.claude-3-sonnet-20240229-v1:0). When you invoke a Cross-Region Inference Profile, AWS Bedrock handles regional routing internally: if us-east-1 is degraded, Bedrock automatically tries us-west-2. You do not write failover logic — you get it from the service. The MultiRegionBedrockClient approach (maintaining your own regional endpoint list + health checks) is the alternative for cases where you need explicit control over routing decisions, want visibility into which region served a request, or need latency-aware routing (e.g., always prefer ap-northeast-1 for JP users). Trade-off: Cross-Region Inference Profile: simpler, zero custom code, AWS handles failover timing. Manual multi-region client: more control, explicit model_id per region, custom routing logic, but more code to maintain and test. For MangaAssist: use Cross-Region Inference Profile as the first layer, with MultiRegionBedrockClient as an additional application-level fallback for latency-aware routing.

Follow-up 2: Propagating regional health state across ECS tasks

Q: The _unhealthy_until dict is in-process memory per ECS task. If Task A detects us-east-1 is down, Tasks B–C still send requests to us-east-1. How do you fix this? A: Same answer as the circuit breaker: Redis as the shared state store. Replace the in-process _unhealthy_until dict with Redis keys: region_health:us-east-1 → unhealthy_until_unix_ts. When Task A marks us-east-1 unhealthy, it writes to Redis. Tasks B–C read Redis before routing — within the next request (< 10ms), all tasks route away from us-east-1. Redis TTL on the health key = REGION_COOLDOWN_SECONDS (e.g., 300 seconds). The key expires automatically — no explicit cleanup needed. Also: emit a BedrockRegionalHealthCheck CloudWatch metric from the MultiRegionBedrockClient each time a region is marked unhealthy. Set an alarm: if us-east-1 is unhealthy for > 5 minutes → page on-call. The on-call can then verify via the AWS Health Dashboard and start the incident response process, rather than discovering the outage through CSAT degradation.

Follow-up 3: Latency implications of cross-region Bedrock calls

Q: Routing to us-west-2 adds 70ms cross-region RTT. Routing to eu-west-1 adds 200ms. How do you ensure the 3-second SLA is still met during cross-region fallback? A: Three approaches: (1) AP user preference: MangaAssist users in Japan should prefer ap-northeast-1 as a fallback over us-west-2 (RTT from Tokyo to ap-northeast-1 is ~10ms vs. 150ms to us-west-2). Maintain a user-to-preferred-region map based on detected timezone or browser locale. (2) SLA adjustment during fallback: during regional fallback, widen the SLA window from 3s to 4.5s and communicate this transparently via a UI banner: "We're experiencing a brief service disruption — responses may be slightly slower." This is acceptable during a regional outage. (3) Token budget reduction during fallback: reduce max_tokens from 1,024 to 512 during cross-region fallback to reduce inference latency in the fallback region, keeping the total response time within 3.5–4s. All three approaches require detecting "currently in fallback mode" — emit a BedrockFallbackRegion CloudWatch metric to confirm when the system is operating in fallback mode so these compensating controls can be conditionally activated.

Follow-up 4: Testing multi-region failover without a real regional outage

Q: How do you validate the cross-region failover path works before a real outage tests it? A: Scheduled quarterly failover drills using AWS Fault Injection Simulator (FIS): Create an FIS experiment that blocks outbound HTTPS traffic to the us-east-1 Bedrock endpoint (via VPC endpoint policy or security group rule modification) for 5 minutes in staging. Assert: (1) within 30 seconds, all ECS tasks have stopped sending requests to us-east-1 (CloudWatch: BedrockInvocations_by_region{region=us-east-1} = 0); (2) within 30 seconds, us-west-2 Bedrock invocations increase (BedrockInvocations_by_region{region=us-west-2} > 0); (3) end-to-end chatbot response rate remains > 95% during the drill; (4) p99 latency < 4.5s in fallback mode. If any of these assertions fail, the failover mechanism is broken — fix before the next quarterly drill. Also run a monthly "canary in fallback" test: deliberately route 1% of production traffic to us-west-2 and confirm latency and quality metrics are within acceptable bounds, so the fallback region is always warm and validated.

Grill 1: "Cross-region failover means data leaves the primary region — compliance concern"

Q: The legal team raises a concern: "Routing to eu-west-1 means user data might be processed in the EU, which could violate Japanese data residency requirements." How do you respond? A: This is a valid compliance concern that must be respected, not engineered around. The fallback region list must be filtered to regions that are compliant with MangaAssist's data residency requirements. For Japanese users, the compliant cross-region options are: us-east-1 (primary), us-west-2 (secondary — US processing, acceptable if ToS and privacy policy disclose it), ap-northeast-1 (Tokyo — preferred secondary for data residency). Never route to eu-west-1 if EU data processing is restricted by privacy policy or regulatory requirement. The MultiRegionBedrockClient region list is not a generic list of Bedrock regions — it is reviewed by legal and compliance before deployment, stored in AppConfig under a compliance_approved_regions key, and requires legal sign-off to modify. Additionally: user messages that are PII-adjacent (contain names, addresses) may need special handling — consider anonymization before cross-region routing for sensitive fields. This is a compliance-architecture junction that requires both engineering and legal input.

Grill 2: The fallback region is also throttling during the outage

Q: us-east-1 is down AND us-west-2 is throttling because everyone else is also failing over there. Both regions are degraded. What now? A: This is the "thundering herd on the failover region" scenario — predictable when a major US region experiences a widespread AWS incident. Three mitigations: (1) Three-region spread: always maintain three fallback regions (us-east-1, us-west-2, ap-northeast-1). During a US regional event, ap-northeast-1 may be unaffected. Route MangaAssist's Japanese users (majority of user base) to ap-northeast-1, which is geographically local and likely unaffected by a US event. (2) Bedrock Cross-Region Inference Profile: AWS manages the routing across the profile's member regions — they are better positioned to handle thundering herd via internal load balancing than application-layer retry storms. (3) Local cache + graceful degradation: if all three regions are degraded, serve from the recommendation cache (Redis, 1-hour TTL) for recommendation requests and return static fallback responses for intent classification. The chatbot remains partially functional using pre-computed outputs. Document the "all-regions-degraded" playbook so the on-call knows the expected behavior and can communicate it to customer support.

Red Flags — Weak Answer Indicators

  • Not mentioning Bedrock Cross-Region Inference Profiles as the AWS-native solution
  • Local in-process _unhealthy_until dict without Redis propagation to the cluster
  • No RTO target — treating failover recovery time as "however long it takes"
  • Not addressing data residency compliance constraints for region selection

Strong Answer Indicators

  • Correctly describes Cross-Region Inference Profile ARN format and AWS-internal routing behavior
  • Designs Redis-backed regional health state for cluster-wide propagation
  • Addresses the "all regions degraded" scenario with a three-region spread + local cache fallback
  • Filters fallback region list by compliance-approved regions, with legal sign-off required for changes

Scenario 3: Step Functions Circuit Breaker Never Resets — Stuck in OPEN State

Opening Question

Q: MangaAssist's reliability team implemented a circuit breaker using AWS Step Functions — a state machine that tracks Bedrock failure counts. After a brief Bedrock hiccup, the circuit opens and never recovers. Investigation reveals reset_timer_seconds=3000 is set in the CDK Stack instead of the intended 30 seconds. Manual restart is the only recovery path. How did this happen, how do you fix it, and how do you prevent it?

Model Answer

The immediate cause is a typo or unit confusion: 3000 seconds (50 minutes) instead of 30 seconds. Because Step Functions circuit breakers use a Wait state with a duration calculated from the config value, the machine genuinely waited 50 minutes before attempting a probe. The state machine didn't malfunction — it did exactly what it was told. The deeper cause: no validation of the reset_timer_seconds value. Any integer was accepted by the CDK stack parameter. Prevention requires three layers: (1) CDK synth validation: a validate_circuit_config() function that asserts 10 <= reset_timer_seconds <= 300 (10 seconds minimum for meaningful protection, 300 seconds maximum to ensure recovery within 5 minutes). If assertion fails, CDK synth fails — blocking deployment. (2) CloudWatch alarm on OPEN duration: if the Step Functions circuit has been in the OPEN state for > 5 minutes, page on-call immediately. The alarm fires whether the timer is correct or not — any unexpected long OPEN duration is an anomaly. (3) Recovery runbook: document the stop_execution + manual state reset + start_execution procedure so the MTTR for a stuck circuit is measured in minutes, not the time it takes an engineer to figure out Step Functions console navigation.

Follow-up 1: Why Step Functions for a circuit breaker vs. Redis

Q: Why was Step Functions chosen for a circuit breaker instead of Redis? And is it a good choice? A: Step Functions circuit breaker makes sense for asynchronous workflows — particularly for cases where the circuit state needs to span multiple Lambda functions or ECS tasks in a coordinated orchestration pattern (e.g., a multi-step async job that calls Bedrock as one step). Step Functions provides: durable state persistence (state survives Lambda restarts), built-in wait states (the OPEN→HALF_OPEN timer is a Wait state — native to the service), and visual debugging in the AWS console. The limitation: Step Functions circuit breakers are workflow-level, not real-time request-path. Checking Step Functions state on every synchronous chatbot request would require a describe_execution API call — ~50ms overhead and subject to Step Functions API throttle limits. The correct split: Redis-backed circuit breaker for the synchronous request path (< 1ms state check, handles 1M checks/day easily). Step Functions circuit breaker for asynchronous batch workflows (model retraining pipelines, async recommendation generation). MangaAssist should use both, each for the appropriate workflow type.

Follow-up 2: Adding the validation function to CDK synth

Q: Show how the CDK synth validation for reset_timer_seconds would work. A: In the CDK Stack Python file, before creating the Step Functions state machine: define def validate_circuit_config(reset_timer_seconds: int) -> None. Inside the function: assert 10 <= reset_timer_seconds <= 300, f"reset_timer_seconds must be between 10 and 300, got {reset_timer_seconds}". Call validate_circuit_config(config["reset_timer_seconds"]) before passing to the state machine definition. Since this code runs at cdk synth time (not runtime), a failing assertion raises during synthesis and blocks the CloudFormation template generation. The CI/CD pipeline's cdk synth step fails with the validation message. The engineer sees: AssertionError: reset_timer_seconds must be between 10 and 300, got 3000 in the CI output and fixes the value before the PR can be merged. Additionally: add a unit test that calls validate_circuit_config(3000) and asserts it raises AssertionError — prevents the validator itself from being accidentally removed. The constraint range [10, 300] is documented in a comment explaining the reasoning.

Follow-up 3: Designing the OPEN-duration alarm

Q: Walk me through the CloudWatch alarm for circuit being stuck OPEN. A: Metric source: Step Functions emits ExecutionsTimedOut and state transition events to CloudWatch. However, for an "OPEN duration" alarm, the simpler approach is: (1) When the circuit transitions to OPEN, emit a custom CloudWatch metric BedrockCircuitOpenEvent with a count of 1. (2) Set a CloudWatch alarm: if the circuit transitions to OPEN (metric = 1) and does NOT transition to CLOSED or HALF_OPEN within 5 minutes, the alarm fires. Implementation: use an AWS EventBridge rule that listens to Step Functions state transitions. When a WAIT_STATE_ENTERED event is detected for the OPEN wait state, start a CloudWatch alarm timer. When a TASK_STATE_ENTERED event is detected for the PROBE state (HALF_OPEN), reset the timer. If 5 minutes pass without the PROBE state being entered, the CircuitOpenTooLong alarm fires. In the alarm description, include the remediation link: "Run: aws stepfunctions stop-execution --execution-arn ...." The runbook link in the alarm makes the MTTR predictably short.

Follow-up 4: What does the manual recovery procedure look like?

Q: The alert fires — the circuit has been OPEN for 10 minutes and the timer won't expire for 40 more minutes. What is the exact recovery procedure? A: Documented in the runbook as a 5-step procedure: (1) Identify the stuck execution ARN: aws stepfunctions list-executions --state-machine-arn <circuit-breaker-sm-arn> --status-filter RUNNING. (2) Verify the failure: confirm Bedrock is actually healthy now (check AWS Health Dashboard + aws bedrock-runtime invoke-model smoke test). Only proceed if Bedrock is healthy. (3) Stop the stuck execution: aws stepfunctions stop-execution --execution-arn <arn> --error "ManualRecovery" --cause "CircuitStuck-ManualReset". (4) Start a new execution in HALF_OPEN state: aws stepfunctions start-execution --state-machine-arn <arn> --input '{"state": "HALF_OPEN", "failure_count": 0}' — the state machine accepts an initial state, allowing it to skip directly to the probe. (5) Monitor the HALF_OPEN probe: watch X-Ray for the first Bedrock call post-reset. If it succeeds, the circuit moves to CLOSED. If it fails again, re-evaluate Bedrock status. Total procedure time with a practiced engineer: 3–5 minutes. Without a runbook: 20–45 minutes of console exploration.

Grill 1: "The circuit breaker should auto-heal — why do we need a runbook?"

Q: The reliability team says: "A properly configured circuit breaker doesn't need a runbook — if the timer is correct, it self-heals. The runbook is covering for a configuration bug." How do you respond? A: The team is correct that a properly configured circuit breaker self-heals. The validation function prevents the 3000 vs. 30 misconfiguration. But the runbook remains necessary for three reasons: (1) Unknown unknowns: a Step Functions execution can get stuck for reasons beyond timer misconfiguration — AWS service disruptions, Lambda timeouts during the probe, IAM permission revocations. No validation function covers every runtime failure mode. (2) Recovery knowledge decay: the on-call engineer at 3 AM during a production incident should not have to reason from first principles about Step Functions execution management. The runbook encodes the knowledge of the engineer who designed the system. (3) Mean Time to Recovery (MTTR) is a SLA metric: the runbook's job is to reduce MTTR. "Self-healing systems don't need runbooks" is an argument that only holds when the system never fails in unexpected ways — and reliability engineering exists precisely because systems do fail in unexpected ways. The runbook is not a workaround — it is a resilience artifact.

Grill 2: What if the Step Functions circuit state machine itself fails to start?

Q: The Step Functions circuit breaker state machine is in error and can't be started. The circuit is now neither OPEN nor CLOSED — its state is unknown. How does the application behave? A: The application must apply a fail-safe default when circuit state is unknown. In the request path: when the circuit breaker check returns an error (Step Functions unreachable or state machine in ERROR), the application should treat the circuit as CLOSED (pass-through to Bedrock) with an error-of-unknown-origin alarm emitted. Reasoning: if the circuit state machine itself is down, the application is in an already-degraded state. Blocking all Bedrock traffic (treating unknown as OPEN) causes unnecessary user impact on top of the circuit breaker failure. Passing traffic to Bedrock (treating unknown as CLOSED) allows the application to function while the circuit breaker is investigated. Mandatory safeguard: emit CircuitBreakerUnavailable CloudWatch metric — alert separately from the circuit breaker's normal operation. This ensures the on-call knows the circuit breaker is non-functional and Bedrock is unprotected, while not causing unnecessary user impact from a double failure.

Red Flags — Weak Answer Indicators

  • Treating the 3000 vs. 30 issue as a "simple typo" without proposing systemic prevention
  • No CDK synth-time validation — relying on code review to catch numeric values
  • No OPEN-duration alarm — only noticing stuck circuits via user complaints
  • Not addressing the Step Functions vs. Redis choice for synchronous vs. asynchronous paths

Strong Answer Indicators

  • Implements CDK synth assertion with specific valid range [10, 300] and unit test for the validator
  • Creates an EventBridge-based OPEN-duration alarm with a 5-minute threshold
  • Distinguishes Step Functions (async orchestration) from Redis (synchronous request path) as complementary tools
  • Writes a 5-step recovery runbook with specific CLI commands for the stuck-execution scenario

Scenario 4: Cross-Region Routing Adds 800ms for Japanese Users

Opening Question

Q: MangaAssist routes all Japanese users to us-east-1 Bedrock in a failover scenario. The RTT from ap-northeast-1 to us-east-1 adds 800ms to each Bedrock call, pushing total latency to 3.4 seconds — exceeding the 3-second SLA. Your fallback was solving availability but breaking latency SLA. How do you design a latency-aware cross-region routing strategy?

Model Answer

The root cause is a static failover order that was designed purely for availability, ignoring the latency impact on geographically distant users. Japanese users have ~10ms RTT to ap-northeast-1 and ~150ms RTT to us-east-1 — not 800ms. The 800ms is the added latency on top of the existing inference time when Bedrock inference in ap-northeast-1 is redirected to us-east-1 internally or when the request traverses the full cross-region path. Solution: LatencyAwareBedrockRouter that probes each available region periodically and routes requests based on measured latency rather than a static preference list. Each probe result is cached in Redis for PROBE_CACHE_TTL_SECONDS=30 seconds. The router selects the region with the lowest recent latency that is also within the INFERENCE_LATENCY_BUDGET_MS=1800 millisecond threshold. For Japanese users: ap-northeast-1 has a probe latency of ~800ms (inference) + ~10ms (RTT) = ~810ms. us-east-1 has ~600ms (lower inference time per pricing tier) + ~150ms (RTT) = ~750ms. The router selects the lower measured latency, regardless of geography. If both regions exceed the budget, route to the one with lower latency (best-of-worst fallback) and alert.

Follow-up 1: How the latency probe works without continuous overhead

Q: The router probes regions every 30 seconds. Describe the probe mechanism and its overhead. A: Probe mechanism: a background task (or Lambda scheduled with a 30-second EventBridge rule) sends a minimal synthetic Bedrock request to each candidate region: a 20-token prompt ("Hello") and measures the time from request to first token received. Records the probe latency in Redis: probe_latency:us-east-1 = 620ms, probe_latency:ap-northeast-1 = 790ms, with an expiry of PROBE_CACHE_TTL_SECONDS=30. Overhead: 2 regions × 1 probe/30 seconds = 0.067 probes/second. At Bedrock's pricing, this is $0.25/1M input tokens × (20 tokens × 2 calls/minute × 1440 min/day) = $0.000014/day — essentially free. The only risk is the probe adding to Bedrock's throttle count. Mitigate by using a dedicated IAM role for probes with a separate Bedrock quota — probe requests and user requests use different quota allocations. At low traffic, a 30-second probe expiry is fine. At very high traffic, reduce to 15 seconds for fresher data. If Redis has no probe data (first startup, Redis flush), default to the static preference order until probes populate the cache.

Follow-up 2: What happens when both regions exceed the latency budget?

Q: Both ap-northeast-1 and us-east-1 probes show > 1800ms. What does the router do? A: Best-of-worst routing plus degradation mode: (1) Route to the region with lower measured latency (even if both exceed the budget). Add a degraded_mode flag to the Bedrock call context. (2) When degraded_mode = true, the system activates two compensating controls: reduce max_tokens from 1024 to 512 (halves inference time, reduces total latency by ~400ms in most cases), and skip non-critical RAG injection (only include the top-1 retrieved chunk instead of top-3, saving assembly time). (3) Emit BedrockAllRegionsOverBudget metric with count=1. Alarm immediately — this is a production reliability incident. (4) Surface the degraded-mode indicator in the API response headers (X-Bedrock-Mode: degraded) so the client can optionally display a slower-than-normal indicator to users. The goal is to serve a slightly reduced-quality but timely response rather than hold the connection open for 3.4 seconds.

Follow-up 3: How do you validate the latency-aware router in testing?

Q: The latency-aware router makes routing decisions based on Redis probe cache values. How do you test it without real cross-region latency? A: In tests, inject mock Redis probe cache values: redis.set("probe_latency:us-east-1", "620"), redis.set("probe_latency:ap-northeast-1", "2100"). Assert the router selects us-east-1. Then inject: redis.set("probe_latency:us-east-1", "2200"), redis.set("probe_latency:ap-northeast-1", "1700"). Assert the router selects ap-northeast-1. Test the "all over budget" case: set both values > 1800. Assert the router emits BedrockAllRegionsOverBudget, activates degraded_mode, and routes to the lower-latency region. Test the "no probe data" case: flush Redis. Assert the router falls back to static preference order ["ap-northeast-1", "us-east-1"] and logs a warning. These unit tests cover the four distinct routing outcomes without any network calls. Add an integration test in staging that deliberately delays the ap-northeast-1 probe endpoint using a network proxy and asserts the router dynamically shifts to us-east-1 within one probe cycle (30 seconds).

Follow-up 4: How does the 1800ms budget relate to the 3-second end-to-end SLA?

Q: Why 1800ms as the inference budget specifically? How does that relate to the overall SLA? A: Time budget decomposition for a 3-second SLA: (1) WebSocket + API Gateway overhead: ~100ms. (2) Intent classification (Haiku, fast path): ~300ms. (3) RAG retrieval from OpenSearch Serverless: ~200ms. (4) Bedrock inference (Sonnet): budget = 1800ms. (5) Response serialization + WebSocket push: ~100ms. Total: 100 + 300 + 200 + 1800 + 100 = 2,500ms with 500ms buffer. The 1800ms is the largest single component and the most variable — it depends on model, region, token count, and Bedrock load. The 500ms buffer absorbs occasional slow responses without SLA breaches. Monitor InferenceDuration_p99 separately from end-to-end ResponseDuration_p99 — if the inference component is approaching 1800ms, alarm before it breaks the SLA, not after. The INFERENCE_LATENCY_BUDGET_MS constant should be reviewed quarterly: if Bedrock's regional inference speeds improve (faster hardware generations), the budget can be reallocated to richer prompts or more RAG chunks.

Grill 1: "Just use ap-northeast-1 as the primary and never fall back to us-east-1 for Japanese users"

Q: A product engineer proposes: "Japanese users always go to ap-northeast-1, no fallback to us-east-1. This avoids latency entirely." Why is this insufficient? A: It trading latency protection for zero availability protection. If ap-northeast-1 Bedrock experiences an outage — which AWS regional incidents do happen — Japanese users have zero fallback and the chatbot is completely down. That is the exact scenario we are trying to avoid. The correct architecture is not "never use us-east-1" — it is "prefer ap-northeast-1 when it's healthy and meets the latency budget, fall back to us-east-1 with latency-aware monitoring and SLA adjustment when ap-northeast-1 is unhealthy." The latency of the fallback path is known and within acceptable bounds (3.4s vs. 3.0s SLA — a marginal breach that can be managed with token reduction). A complete outage of Japanese users is not manageable. The choice is: occasional 400ms SLA breach under fallback conditions vs. complete outage whenever ap-northeast-1 has any issues. The former is correct for a production system with a reliability SLA.

Grill 2: The latency probe adds Bedrock invocations to the quota — could it trigger throttling?

Q: At high traffic, the probe's Bedrock calls compete with real user requests for Bedrock quota. Could the probe itself cause user throttling? A: This is a valid concern addressed by quota partitioning. Bedrock has configurable quotas per IAM role (via Service Quotas). Create a dedicated bedrock-probe-role IAM role with its own Bedrock quota allocation (small — 10 requests/minute is enough for probes). Production user requests are invoked under bedrock-production-role with the full quota allocation. The SCP ensures bedrock-probe-role cannot exceed its small quota, and production requests are never throttled by probe activity. The production role is never assumed by probe logic and vice versa. This quota segregation is standard practice for synthetic monitoring that uses production APIs. Add a guardrail in the probe scheduler: if the probe itself receives a ThrottlingException under the probe role, log it, emit ProbeFailed metric, and use the last cached probe value — never retry the probe with the production role as a fallback.

Red Flags — Weak Answer Indicators

  • Static failover order with no latency measurement
  • Missing the INFERENCE_LATENCY_BUDGET_MS concept — treating cross-region routing as binary (healthy/unhealthy) rather than latency-aware
  • No handling of the "both regions over budget" scenario
  • Not partitioning probe quota from production quota

Strong Answer Indicators

  • Designs LatencyAwareBedrockRouter with Redis-cached probe results and a 30-second TTL
  • Sets INFERENCE_LATENCY_BUDGET_MS=1800 derived from a complete end-to-end time budget decomposition
  • Activates degraded_mode (reduced tokens, fewer RAG chunks) when both regions exceed the budget
  • Partitions probe IAM role and quota from production role to prevent probe-induced throttling

Scenario 5: Stale Degradation Cache — 72-Hour TTL

Opening Question

Q: MangaAssist serves cached recommendations from Redis as a graceful degradation fallback when Bedrock is unavailable. The TTL was set to 72 hours. During a 2-hour planned Bedrock maintenance window, users receive manga recommendations that are 3 days old — some titles are out of stock, some promotions expired. How does this happen, what is the correct TTL strategy, and how do you embed freshness metadata?

Model Answer

The 72-hour TTL was set with the assumption that "stale is better than nothing" without quantifying what "staleness" means for MangaAssist's inventory. At 72 hours, the cache holds data that is up to 3 days old. For a manga store with daily inventory updates, high-volume title promotions (24–48 hour campaigns), and stock-level changes, 3-day-old recommendations can contain out-of-stock items and expired deals — actively harmful recommendations rather than merely "slightly outdated" ones. The correct TTL is DEGRADATION_CACHE_TTL_SECONDS=3600 (1 hour). Business reasoning: MangaAssist's product team updates promotions no more frequently than once per hour. Inventory levels change continuously but the cache is serving recommendations, not real-time stock data — the UI should always do a real-time stock check before displaying a "buy now" button. The cache provides a recommendation baseline that is at most 1 hour old — acceptable for a maintenance window. Additionally, embed cached_at_utc in every cache entry so the retrieval code can check freshness at read time: warn at > 2 hours, hard-reject at > 4 hours (return empty rather than arbitrarily stale data).

Follow-up 1: Embedding freshness metadata in cache entries

Q: How do you embed freshness metadata and what does the read-time freshness check look like? A: Cache entry structure: store the full recommendation list as a JSON object with a top-level cached_at_utc field: { "cached_at_utc": "2026-04-01T10:00:00Z", "recommendations": [...] }. At read time in get_degradation_cache(user_id, session_id): parse the object, compute age_seconds = (datetime.utcnow() - cached_at).total_seconds(). Three conditions: (1) age_seconds <= WARN_AGE_SECONDS (7200, 2 hours): return recommendations normally, no warning. (2) WARN_AGE_SECONDS < age_seconds <= MAX_AGE_SECONDS (14400, 4 hours): return recommendations with a stale_data_warning log at WARNING level and emit CacheAgeExceededWarnThreshold CloudWatch metric. Also inject a user-visible note into the recommendation list: "These are our recent popular picks." (3) age_seconds > MAX_AGE_SECONDS: do not return the cache entry. Return empty, triggering the next fallback in the chain (static bestseller list). Emit CacheAgeExceededMaxThreshold metric — this should almost never fire if the TTL is set correctly.

Follow-up 2: Cache population strategy — when and how are cache entries written?

Q: When are degradation cache entries written — on every successful recommendation, or proactively? A: Two population strategies, both should be used: (1) Lazy population (write-through): every successful Bedrock recommendation response also writes to the degradation cache with the current UTC timestamp. This requires zero additional infrastructure — the cache populates naturally as users receive recommendations. Advantage: cache entries reflect actual user-specific recommendations, not generic ones. (2) Proactive warm-up: a separate scheduled Lambda runs every 50 minutes (before the 1-hour TTL expires for popular user segments) and generates fresh recommendations for the top 1,000 user IDs by activity. These are written to the cache with a fresh timestamp. This ensures the most active users always have a recent cache entry available during a degradation event. The combination: active users always have entries from the proactive warm-up. Initial or rare users fall back to the lazy-write entries or, if unavailable, to the static bestseller fallback. The 50-minute proactive refresh interval vs. a 60-minute TTL means there is always a 10-minute freshness margin.

Follow-up 3: How does the recommendation cache interact with inventory real-time checks?

Q: If the cache returns a recommendation for a manga that is currently out of stock, how does the user experience degrade? A: The recommendation cache stores title IDs, not stock data. The UI layer is responsible for real-time stock validation on any recommendation before displaying a "buy" button. The cache serves the recommendation decision; the product catalog API (DynamoDB) serves the real-time stock status. During Bedrock degradation mode: the recommendation cache returns title IDs → the UI calls GET /products/{id}/availability for each recommended title → renders only in-stock items with a "buy" button. If all recommended titles are out of stock, the UI falls back to "Popular in your genres" (static category-based list from a DynamoDB query). The recommendation cache failure mode is thus: "shows slightly dated title suggestions but never shows a buyable out-of-stock item." This requires the UI/frontend team to implement the catalog check — the backend serves recommendations and the frontend enforces stock integrity. Document this contract in the API specification.

Follow-up 4: Monitoring cache age in production

Q: How do you confirm that the degradation cache is always "fresh enough" before a degradation event occurs? A: Proactive freshness monitoring: (1) A CloudWatch metric DegradationCacheAge_p95 emitted hourly by the cache warm-up Lambda. At each warm-up run, calculate the age of the oldest cache entry in the top-1,000 active user set. Alert if p95_age > 55 minutes (approaching the 60-minute TTL threshold). (2) CacheEntryCount_in_cache_vs_active_users: the warm-up Lambda knows it processed 1,000 users — confirm that 1,000 entries exist in Redis with age < 60 minutes. If the count is significantly lower (e.g., 600), entries are being evicted by memory pressure or the warm-up Lambda is failing. Alert. (3) WarmUpLambdaSuccess — if the proactive warm-up Lambda fails (exception, timeout, IAM error), treat this as a pre-degradation alert: the cache is now degrading toward staleness and will be inadequately populated if Bedrock degrades before the next run. Page on-call to investigate the warm-up Lambda failure.

Grill 1: "72 hours was set because longer TTL reduces cache misses — 1 hour increases cost"

Q: The original engineer argues: "72-hour TTL means almost every fallback request hits the cache. 1-hour TTL means many users won't have a fresh cache entry — they'll hit the static fallback instead, which is lower quality." How do you balance this? A: The argument has merit: a longer TTL does increase cache hit rate during degradation. But the quality threshold must be evaluated: is a 72-hour-old recommendation better or worse than a static bestseller list? For MangaAssist, the static bestseller list reflects this week's top-selling titles — typically more representative of current availability than a 3-day-old personalized recommendation. The quality comparison is not "personalized but stale" vs. "generic and fresh" — it is "potentially harmful stale (out-of-stock titles) vs. reliable and current (bestsellers)." The right answer is: invest in the proactive cache warm-up to maintain high cache hit rates at shorter TTLs. A 50-minute proactive refresh for the top 1,000 users ensures those users always have a fresh 1-hour-TTL hit. For less active users, accepting the static bestseller fallback is correct — a rarely-seen user's preferences from 3 days ago are not materially better than current bestsellers. Higher engineering investment in warm-up strategy, not longer TTL, is the maintainable solution.

Grill 2: The 1-hour TTL flushes during a long maintenance window

Q: Bedrock maintenance runs for 3 hours. After 1 hour, the TTL expires and the cache is empty. Users get the static fallback for 2 hours. Is this acceptable? A: Two mitigations: (1) Pre-maintenance cache seeding: when a planned maintenance window is scheduled (announced via AWS PHD or the engineering team's own maintenance schedule), run a forced cache warm-up for all active users (not just the top 1,000) immediately before the maintenance window begins. Write these entries with cached_at_utc = maintenance_end_expected_time (i.e., future-date the timestamp). The TTL is calculated relative to cached_at_utc, so entries written 1 hour before a 3-hour window have cached_at_utc + 4 hours = valid_until — covering the full maintenance window. This is a manual operational step in the maintenance runbook. (2) Extend TTL specifically during active maintenance mode: an operational flag in AppConfig, maintenance_mode_active: true, triggers the cache read code to use MAX_AGE_SECONDS=14400 (4 hours) instead of the normal 1-hour TTL. When maintenance_mode_active: false is set after the window, fresh cache entries are already being populated by live traffic and the proactive warm-up resumes normally.

Red Flags — Weak Answer Indicators

  • Accepting "stale is better than nothing" without quantifying what staleness means for the business
  • No freshness metadata in cache entries — relying purely on Redis TTL for age tracking
  • Missing the proactive warm-up strategy — only lazy write-through resulting in gaps for rare users
  • No maintenance_mode_active consideration for planned long-duration degradation windows

Strong Answer Indicators

  • Sets DEGRADATION_CACHE_TTL_SECONDS=3600 with explicit business reasoning (inventory update frequency)
  • Embeds cached_at_utc in every cache entry with read-time warn at 2h, reject at 4h logic
  • Designs 50-minute proactive warm-up Lambda for top-1,000 active users
  • Handles planned long maintenance windows via pre-maintenance forced seed + maintenance_mode_active AppConfig flag