LOCAL PREVIEW View on GitHub

Scenarios and Runbooks: Resilient FM Systems

MangaAssist context: JP Manga store chatbot on AWS — Bedrock Claude 3 (Sonnet at $3/$15 per 1M tokens input/output, Haiku at $0.25/$1.25), OpenSearch Serverless (vector store), DynamoDB (sessions/products), ECS Fargate (orchestrator), API Gateway WebSocket, ElastiCache Redis. Target: useful answer in under 3 seconds, 1M messages/day scale.


Skill Mapping

Attribute Value
Certification AWS Certified AI Practitioner (AIP-C01)
Domain 2 — Implementation and Integration of Foundation Models
Task 2.4 — Design resilient and scalable FM-based applications
Skill 2.4.3 — Create resilient FM systems to ensure reliable operations
Focus Areas Exponential backoff storms, rate limiter tuning, stale cache risks, X-Ray sampling gaps, graceful degradation UX

Scenario Format

Each scenario follows a standard structure:

SCENARIO: [Title]
├── SITUATION: What happened and why
├── WHAT WENT WRONG: Root cause analysis
├── DETECTION: How the issue was (or should have been) detected
├── IMPACT: Business and user impact
├── RESPONSE RUNBOOK: Step-by-step mitigation
├── RESOLUTION: How to fix permanently
├── PREVENTION: How to prevent recurrence
├── EXAM ANGLE: What the AIP-C01 exam tests from this scenario
└── KEY TAKEAWAY: One-sentence lesson

Scenario 1: Exponential Backoff Storm from Correlated Retries Across Fleet

Situation

Date: Thursday, 7:22 PM JST (peak manga browsing hour) Trigger: Bedrock experienced a 45-second regional capacity constraint in ap-northeast-1

At 7:22 PM, Bedrock's Claude 3 Sonnet endpoint in ap-northeast-1 began returning ThrottlingException at an elevated rate. The MangaAssist ECS Fargate fleet (running 24 tasks) detected these errors and all tasks simultaneously began their exponential backoff retry loops.

Timeline:
  7:22:00 PM  Bedrock starts throttling (~30% of requests)
  7:22:01 PM  All 24 ECS tasks detect ThrottlingException
  7:22:01 PM  All 24 tasks begin retry attempt 1 (100ms delay)
  7:22:02 PM  All 24 tasks fire retry attempt 1 simultaneously
  7:22:02 PM  Bedrock throttle rate increases to 60% (overloaded by correlated retries)
  7:22:03 PM  All 24 tasks begin retry attempt 2 (200ms delay)
  7:22:04 PM  Retry storm: 24 tasks x 200ms = synchronized wave
  7:22:06 PM  All 24 tasks begin retry attempt 3 (400ms delay)
  7:22:10 PM  Retry attempt 4 (800ms delay) — Bedrock now throttling 90%
  7:22:18 PM  Retry attempt 5 (max 10s delay) — all retries exhausted
  7:22:18 PM  All 24 tasks switch to Haiku fallback simultaneously
  7:22:19 PM  Haiku now overwhelmed by 24 tasks' worth of redirected traffic
  7:22:25 PM  Haiku also starts throttling
  7:22:30 PM  Circuit breakers open across fleet — cache-only mode
  7:23:05 PM  Bedrock recovers, but circuit breakers remain open for 30s
  7:23:35 PM  Circuit breakers transition to HALF_OPEN
  7:23:36 PM  Probe requests succeed — circuit breakers close
  7:23:40 PM  Normal operation resumes

What Went Wrong

The root cause was correlated retries — a thundering herd problem. When all 24 ECS tasks encounter the same error at the same time and use the same backoff algorithm, their retries synchronize:

Without Jitter (the problem):
  Task 1:  |==|    |====|        |========|
  Task 2:  |==|    |====|        |========|
  Task 3:  |==|    |====|        |========|
  ...
  Task 24: |==|    |====|        |========|
           ↑ All fire at same instant ↑

With Decorrelated Jitter (the fix):
  Task 1:  |=|      |=======|         |==========|
  Task 2:  |===|  |====|        |===============|
  Task 3:  |==|       |=====|     |============|
  ...
  Task 24: |====|        |===|           |========|
           ↑ Spread across time window ↑

Contributing factors: 1. All ECS tasks used JitterStrategy.NONE (exponential backoff without jitter) 2. No retry budget was enforced across the fleet 3. The fallback cascade triggered a secondary thundering herd on Haiku 4. Circuit breaker recovery timeout (30s) was longer than the actual Bedrock outage (45s)

Detection

Signal Source Alert Time to Detect
Bedrock ThrottlingException spike CloudWatch ErrorCount metric Warning alarm at 5% error rate 2 minutes
Retry count spike Custom metric RetryCount None configured Not detected automatically
Correlated retry pattern X-Ray trace analysis (post-incident) None Manual analysis after incident
Circuit breaker open Custom metric CircuitBreakerState Critical alarm 30 seconds
Haiku fallback spike Custom metric TierCount.Haiku Warning at >20% degradation 5 minutes

Detection gap: There was no alarm for correlated retry patterns. The retry count metric existed but no alarm was set to detect the synchronized spike pattern.

Impact

Metric Normal During Incident Recovery
User-visible error rate 0.01% 15% 0.5%
p99 latency 2.8s 12.5s (timeout) 3.5s
Sonnet availability 99.5% 10% 85%
Haiku availability 99.5% 40% (secondary storm) 95%
Cache serve rate 4% 65% 15%
Messages affected 0 ~2,400 messages ~600
Duration N/A 78 seconds 35 seconds

Business impact: During the 78-second incident, approximately 2,400 manga customer messages received degraded responses (cached or graceful fallback). 15% of users during the window saw error messages. Customer satisfaction scores for the affected window dropped from 4.⅖ to 2.8/5.

Response Runbook

RUNBOOK: Exponential Backoff Storm Mitigation
SEVERITY: P2 (High — service degradation, not outage)
ON-CALL TEAM: MangaAssist Platform Engineering

STEP 1: CONFIRM THE INCIDENT (0-2 minutes)
  ├── Check CloudWatch dashboard: MangaAssist-FM-Health
  │   └── Look for: ErrorRate spike + TierCount shift away from Sonnet
  ├── Check X-Ray service map for error segments
  │   └── Filter: fault = true AND service("Bedrock Runtime")
  ├── Check Bedrock Service Health Dashboard
  │   └── https://health.aws.amazon.com/health/status
  └── Confirm: Is this a Bedrock regional issue or MangaAssist-specific?

STEP 2: IMMEDIATE MITIGATION (2-5 minutes)
  ├── If Bedrock is the source:
  │   ├── Enable "storm brake" — temporarily disable retries fleet-wide
  │   │   └── Set environment variable: RETRY_MAX_ATTEMPTS=0
  │   │   └── ECS will pick up on next health check (30s)
  │   ├── Force all traffic to Haiku (skip Sonnet entirely)
  │   │   └── Set environment variable: PRIMARY_MODEL=haiku
  │   └── If Haiku also affected: enable cache-only mode
  │       └── Set environment variable: FM_MODE=cache_only
  ├── If MangaAssist-specific (our retry logic):
  │   ├── Scale ECS tasks to 0 temporarily, then back to 12 (half fleet)
  │   │   └── aws ecs update-service --desired-count 0
  │   │   └── Wait 10 seconds
  │   │   └── aws ecs update-service --desired-count 12
  │   └── This breaks the synchronized retry pattern
  └── Monitor: Watch ErrorRate metric for improvement

STEP 3: VERIFY RECOVERY (5-10 minutes)
  ├── Confirm Sonnet invocations returning to normal
  ├── Check circuit breaker states: all should be CLOSED
  ├── Verify cache serve rate dropping back to baseline (~4%)
  ├── Check user-facing error rate back below 0.1%
  └── Confirm p99 latency below 3s SLA

STEP 4: POST-INCIDENT (within 24 hours)
  ├── Collect X-Ray traces from the incident window
  ├── Analyze retry patterns: Were they correlated?
  ├── Document: Timeline, impact, mitigation steps
  └── Schedule: Post-mortem meeting

Resolution

  1. Enable decorrelated jitter — Switch from JitterStrategy.NONE to JitterStrategy.DECORRELATED across all ECS tasks
  2. Add fleet-wide retry budget — Use a shared Redis counter to cap total retries across the fleet at 100/minute
  3. Stagger fallback cascade — When falling back to Haiku, add a random 0-2 second delay to prevent secondary thundering herd
  4. Reduce circuit breaker recovery timeout — From 30s to 15s, since most Bedrock transient issues resolve within 10s

Prevention

Prevention Measure Implementation Priority
Decorrelated jitter by default Update BackoffConfig default to DECORRELATED P0
Fleet retry budget (Redis) Add RetryBudgetManager with 100/min cap P0
Retry correlation alarm CloudWatch alarm on retry spike variance P1
Staggered fallback Random delay before Haiku fallback P1
Chaos testing Inject Bedrock throttle in staging weekly P2
Documentation Update runbook with correlated retry procedure P2

Exam Angle

What the AIP-C01 exam tests here: - Understanding of exponential backoff and why jitter is essential in distributed systems - Knowledge of the thundering herd problem and how it manifests in FM API calls - AWS SDK retry modes (standard, adaptive, legacy) and their jitter behavior - Circuit breaker pattern and its role in preventing cascading failures

Key concept: The AWS SDK's adaptive retry mode includes a client-side rate limiter (token bucket) that automatically adjusts based on throttle responses. This is superior to standard exponential backoff because it prevents the client from sending requests it knows will be throttled.

Key Takeaway

Exponential backoff without jitter in a multi-task fleet creates synchronized retry storms that amplify the original overload — always use decorrelated jitter and enforce a fleet-wide retry budget.


Scenario 2: Rate Limiter Blocking Legitimate Traffic During Organic Spike

Situation

Date: Wednesday, 6:00 PM JST Trigger: Popular manga series "Kaiju No. 8" announced surprise new volume release on social media

At 5:58 PM, a viral tweet from the official Kaiju No. 8 account announced an exclusive limited-edition volume available only through MangaAssist. Within 90 seconds, organic traffic spiked from the baseline of ~800 rps to 4,200 rps — exceeding the global rate limit of 2,000 rps (burst 4,000).

Timeline:
  5:58:00 PM  Viral tweet posted (6.2M followers)
  5:58:30 PM  Traffic begins rising: 800 → 1,200 rps
  5:59:00 PM  Traffic: 1,800 rps (within limits)
  5:59:30 PM  Traffic: 3,200 rps (within burst, consuming tokens fast)
  5:59:45 PM  Burst tokens exhausted: 4,000 tokens consumed
  5:59:46 PM  Rate limiter begins returning 429 responses
  5:59:46 PM  Legitimate customers see: "You're sending messages too fast"
  6:00:00 PM  Traffic: 4,200 rps — 2,200 rps being throttled (52%)
  6:00:30 PM  Customer complaints spike on social media
  6:01:00 PM  Operations team alerted via CloudWatch alarm
  6:02:00 PM  Manual rate limit increase: 2,000 → 5,000 rps
  6:02:15 PM  Throttle rate drops to 0%
  6:05:00 PM  Traffic stabilizes at 3,800 rps
  6:15:00 PM  Traffic returns to 1,500 rps (above normal but manageable)
  6:30:00 PM  Normal operations

What Went Wrong

The rate limiter worked exactly as designed — it protected Bedrock from overload. The problem was that the rate limits were too conservative for organic traffic spikes:

  1. Static rate limits — The 2,000 rps limit was based on average load, not peak capacity
  2. No distinction between organic and abusive traffic — All requests treated equally
  3. No auto-scaling of rate limits — Required manual intervention to increase
  4. Per-user limits too aggressive — Returning "too fast" message to first-time visitors who sent just 1 message
  5. No priority queue — Premium customers throttled alongside anonymous browsers

The real problem: The error message "You're sending messages too fast" was misleading. These were first-time visitors asking about the new manga volume — they had sent exactly one message and were told they were too fast. The actual throttle was the global rate limit being hit, but the error message was designed for per-user throttling.

Detection

Signal Source Alert Time to Detect
429 response rate spike API Gateway metrics Warning alarm at >1% 429 rate 45 seconds
Global rate limit utilization >95% Custom RateLimitUtilization metric Info alarm at 80% 30 seconds
Customer complaints Social media monitoring Manual detection 2 minutes
Traffic spike API Gateway RequestCount Anomaly detection alarm 1 minute

What worked: The CloudWatch anomaly detection alarm caught the traffic spike quickly. What failed was the automatic response — there was no auto-scaling mechanism for rate limits.

Impact

Metric Normal During Incident After Fix
429 error rate 0.01% 52% 0%
Legitimate users throttled ~0 ~45,000 users (2.5 min window) 0
Customer complaints ~2/hr ~340 in 10 minutes Subsided
Revenue impact Normal Est. $12,000 lost sales (limited edition) Recovered partially
Social media sentiment Positive Negative (#MangaAssistDown trending) Neutral
Bedrock utilization 40% Capped at rate limit 75%

Business impact: An estimated 45,000 users were blocked from purchasing the limited-edition Kaiju No. 8 volume during the first 2.5 minutes. The negative social media attention (#MangaAssistDown trending briefly) caused reputational damage. Estimated revenue loss: $12,000 in the first 10 minutes.

Response Runbook

RUNBOOK: Rate Limiter Blocking Legitimate Traffic
SEVERITY: P1 (Critical — revenue-impacting, customer-facing)
ON-CALL TEAM: MangaAssist Platform Engineering + Customer Success

STEP 1: CONFIRM ORGANIC SPIKE vs. DDoS (0-2 minutes)
  ├── Check WAF metrics: Is traffic from diverse IPs or concentrated?
  │   └── Diverse IPs = organic | Few IPs = potential DDoS
  ├── Check geographic distribution: Mostly Japan? (Expected for manga)
  ├── Check request patterns: Normal chat queries or repetitive patterns?
  ├── Check social media: Is there a viral event driving traffic?
  └── Decision: If organic → increase limits | If DDoS → keep limits, enable WAF rules

STEP 2: INCREASE RATE LIMITS (2-5 minutes)
  ├── Global rate limit:
  │   └── aws apigateway update-usage-plan --usage-plan-id UP_ID \
  │       --patch-operations op=replace,path=/throttle/rateLimit,value=5000
  ├── Burst capacity:
  │   └── aws apigateway update-usage-plan --usage-plan-id UP_ID \
  │       --patch-operations op=replace,path=/throttle/burstLimit,value=8000
  ├── Scale ECS tasks:
  │   └── aws ecs update-service --desired-count 48 (double fleet)
  └── Verify: Monitor 429 rate dropping to 0%

STEP 3: PROTECT BEDROCK (concurrent with Step 2)
  ├── Check Bedrock service quotas: Are we near account limits?
  │   └── aws service-quotas get-service-quota --service-code bedrock ...
  ├── If near Bedrock limits:
  │   ├── Route 50% of traffic to Haiku (cheaper, faster)
  │   ├── Enable aggressive caching (extend TTLs temporarily)
  │   └── Consider: Bedrock on-demand throughput increase request
  └── Monitor: Bedrock throttle rate should remain below 5%

STEP 4: FIX USER-FACING MESSAGES (5-10 minutes)
  ├── Update throttle message for global rate limit:
  │   └── FROM: "You're sending messages too fast"
  │   └── TO: "We're experiencing high demand — you're in a queue!
  │            Estimated wait: ~30 seconds. While you wait, browse
  │            our Kaiju No. 8 collection directly."
  ├── Enable queue mode: Accept and queue requests rather than reject
  └── Add banner to UI: "High demand for Kaiju No. 8 — please be patient"

STEP 5: POST-SPIKE NORMALIZATION (30-60 minutes)
  ├── Monitor traffic returning to baseline
  ├── Gradually reduce rate limits back to normal
  ├── Scale ECS tasks back to standard count
  ├── Review: Were any customers permanently lost?
  └── Document: What threshold triggered the issue?

Resolution

  1. Implement dynamic rate limiting — Auto-scale rate limits based on traffic pattern analysis (organic vs. abusive)
  2. Add priority queueing — Premium users bypass throttle; registered users get priority over anonymous
  3. Fix error messages — Distinguish between "you personally are too fast" and "system is busy"
  4. Pre-event scaling — Integrate with marketing team's event calendar to pre-scale before known spikes
  5. Request queueing — Instead of rejecting with 429, queue requests and serve them as capacity becomes available

Prevention

Prevention Measure Implementation Priority
Dynamic rate limit auto-scaling Lambda function that adjusts limits based on CloudWatch Anomaly Detection P0
Priority queue for premium users Redis-based priority queue in ECS orchestrator P0
Marketing integration Webhook from marketing calendar to auto-scale 2 hours before events P1
Request queueing (SQS) Queue overflow requests instead of rejecting P1
Differentiated error messages Context-aware throttle messages (global vs. per-user) P1
Capacity planning review Monthly review of rate limits vs. traffic growth P2

Exam Angle

What the AIP-C01 exam tests here: - Understanding of API Gateway throttling mechanisms (account, stage, method level) - Usage plans and their role in rate limiting different user tiers - The relationship between rate limiting and user experience - How to design for traffic spikes in FM-based applications - Token bucket algorithm behavior (steady rate vs. burst capacity)

Key concept: API Gateway's token bucket algorithm allows bursts up to the burst limit, but once burst tokens are consumed, traffic is limited to the steady-state rate. For applications with predictable spikes (like content releases), pre-scaling the burst capacity is essential.

Key Takeaway

Rate limiters protect the system but must distinguish between organic traffic spikes and abuse — a rate limiter that blocks legitimate customers during a viral moment causes more damage than the overload it prevents.


Scenario 3: Fallback to Cached Response Serving Stale Manga Prices

Situation

Date: Monday, 10:15 AM JST Trigger: Weekend sale ended at midnight but cache entries with sale prices had 4-hour TTL

MangaAssist ran a weekend flash sale on popular manga series (30% off). The sale ended at midnight Sunday. However, the response cache in Redis still contained responses generated during the sale period, complete with discounted prices. When Bedrock experienced a brief 3-minute throttling event Monday morning, the fallback system served these cached responses to customers.

Timeline:
  Sat 9:00 AM   Sale begins — all prices show 30% discount
  Sat-Sun        Cache populated with sale-price responses (TTL: 4 hours)
  Sun 11:59 PM   Sale ends — DynamoDB prices updated to normal
  Mon 12:00 AM   Cache contains: mix of stale (sale) and fresh (normal) prices
  Mon 3:00 AM    Most sale-price cache entries expired (4hr TTL)
  Mon 3:30 AM    Cache warm-up job runs — repopulates with correct prices
  Mon 10:15 AM   Bedrock throttling event begins (3 minutes)
  Mon 10:15 AM   Fallback cascade: Sonnet fail → Haiku fail → Cache
  Mon 10:15 AM   PROBLEM: Some cache entries were refreshed at 9:00 AM
                  during the sale's "lingering" in one region, containing
                  sale prices that were manually re-added in error
  Mon 10:15 AM   ~180 users receive responses with incorrect (sale) prices
  Mon 10:18 AM   Bedrock recovers — fresh responses resume
  Mon 10:45 AM   Customer support reports: "Users saying prices differ
                  from what chatbot told them"
  Mon 11:00 AM   Investigation confirms stale cache served sale prices
  Mon 11:15 AM   Cache manually flushed for all pricing-related entries

What Went Wrong

Multiple failures compounded:

  1. Cache TTL too long for pricing data — 4 hours is appropriate for recommendations but not for price-sensitive content
  2. No cache invalidation on price change — When the sale ended and DynamoDB prices were updated, the Redis cache was not invalidated
  3. No content-type awareness in cache — All responses cached with the same TTL regardless of whether they contained pricing data
  4. Manual re-entry error — An operations team member accidentally re-triggered the sale in one region, causing fresh cache entries with sale prices
  5. No staleness warning for pricing — When serving cached price data, no disclaimer was shown to users
Cache Entry Problem:

  User asks: "How much is One Piece Volume 105?"

  Fresh response (correct):
    "One Piece Volume 105 is ¥528 (regular price)."

  Stale cached response (incorrect):
    "One Piece Volume 105 is ¥369 (30% off — flash sale!)."
    ← This response was cached during the weekend sale
    ← Served during the Monday Bedrock throttle event

Detection

Signal Source Alert Time to Detect
Fallback to cache tier Custom TierCount metric Info alarm at >10% cache serves 1 minute
Stale cache serves Custom StaleCacheServes metric Not configured for pricing category Not detected
Customer complaints Support ticket system Manual detection 30 minutes
Price discrepancy No automated check None Manual investigation

Detection gap: There was no automated check comparing cached prices against the source of truth (DynamoDB). The staleness metric existed but was not configured to alarm specifically on pricing-category cache entries.

Impact

Metric Normal During Incident
Users served stale prices 0 ~180 users
Incorrect price quotes 0 ~85 unique products
Customer support tickets ~5/hr ~45 in 2 hours
Potential revenue loss $0 ~$2,300 (if honored)
Trust impact Baseline Moderate (price credibility)

Business impact: 180 customers were quoted sale prices that no longer applied. The legal and customer experience teams decided to honor the incorrect prices for affected customers, costing ~$2,300. The incident also raised questions about the chatbot's reliability for transactional queries.

Response Runbook

RUNBOOK: Stale Cache Serving Incorrect Data
SEVERITY: P2 (High — financial impact, customer trust)
ON-CALL TEAM: MangaAssist Platform Engineering + Customer Success

STEP 1: CONFIRM STALE DATA (0-5 minutes)
  ├── Identify affected cache entries:
  │   └── redis-cli KEYS "manga:cache:*" | sample 20 entries
  │   └── Check created_at timestamps — are any from before the sale ended?
  ├── Compare cached prices to DynamoDB source of truth:
  │   └── For each cached product_id, query DynamoDB for current price
  │   └── Flag entries where cached price != DynamoDB price
  ├── Estimate scope: How many cache entries are stale?
  └── Check: Is the issue ongoing or has the Bedrock throttle resolved?

STEP 2: FLUSH AFFECTED CACHE ENTRIES (5-10 minutes)
  ├── Option A: Flush ALL pricing-category cache entries
  │   └── redis-cli EVAL "local keys = redis.call('keys', 'manga:cache:*')
  │       for i, key in ipairs(keys) do
  │         local entry = redis.call('get', key)
  │         if string.find(entry, '\"category\":\"pricing\"') then
  │           redis.call('del', key)
  │         end
  │       end
  │       return 'done'" 0
  ├── Option B: Flush ALL cache entries (nuclear option)
  │   └── redis-cli FLUSHDB
  │   └── WARNING: This increases Bedrock load temporarily
  └── Verify: Spot-check that pricing queries now return fresh Bedrock responses

STEP 3: IDENTIFY AFFECTED CUSTOMERS (10-30 minutes)
  ├── Query CloudWatch Logs for cache-served pricing responses:
  │   └── Filter: tier="cached" AND category="pricing" AND
  │       timestamp BETWEEN "2024-XX-XX 10:15" AND "2024-XX-XX 10:18"
  ├── Extract user_ids from affected traces
  ├── For each user: What price were they quoted? What is the correct price?
  ├── Generate report: user_id, product, quoted_price, actual_price, delta
  └── Send report to Customer Success team for remediation

STEP 4: CUSTOMER REMEDIATION (same day)
  ├── Decision: Honor incorrect prices or issue apology + coupon?
  │   └── If delta < $10 per customer: Honor the quoted price
  │   └── If delta > $10: Contact customer with apology + 15% coupon
  ├── Send proactive email to affected customers:
  │   └── "We noticed a brief pricing display issue on [date]..."
  └── Update FAQ: Add note about the incident for customer support agents

STEP 5: PREVENT RECURRENCE (within 1 week)
  ├── Implement content-aware TTLs (pricing: 15 min max)
  ├── Add event-driven cache invalidation on DynamoDB price updates
  ├── Add staleness warning for cached pricing responses
  └── Add automated price-accuracy check for cached entries

Resolution

  1. Content-aware TTLs — Pricing: 15 minutes, Availability: 5 minutes, Recommendations: 4 hours, FAQ: 7 days
  2. DynamoDB Streams cache invalidation — When prices update in DynamoDB, a Lambda function invalidates all cache entries referencing those product IDs
  3. Staleness disclaimer — When serving cached pricing data, append: "Prices shown may not reflect the latest updates. Please verify at checkout."
  4. Cache validation job — Hourly Lambda that samples 100 cached pricing entries and validates against DynamoDB
  5. Sale end cache flush — Automated cache flush for all pricing entries when a sale period ends

Prevention

Prevention Measure Implementation Priority
Content-aware TTLs Modify cache layer to use CATEGORY_TTL map P0
DynamoDB Streams invalidation Lambda triggered by DynamoDB Streams on price changes P0
Staleness disclaimer on prices Append warning when serving cached pricing data P0
Sale-end cache flush automation EventBridge rule triggers cache flush at sale end time P1
Hourly price accuracy validation Lambda samples and validates cached pricing entries P1
Cache entry product_id tracking Store product_ids in cache entry metadata for targeted invalidation P2

Exam Angle

What the AIP-C01 exam tests here: - Understanding of caching strategies and their trade-offs in FM applications - The concept of graceful degradation and when cached responses are appropriate vs. dangerous - Event-driven architectures (DynamoDB Streams) for cache invalidation - The importance of content-type awareness in FM response caching

Key concept: Not all FM responses are equal. A cached recommendation from 4 hours ago is still useful, but a cached price from 4 hours ago could be financially harmful. Content-aware caching ensures that time-sensitive data (prices, availability) has much shorter TTLs than stable data (FAQs, general information).

Key Takeaway

Cache-based fallback must be content-aware — a stale manga recommendation is harmless, but a stale price can cost real money and erode customer trust.


Scenario 4: X-Ray Sampling Missing Critical Error Traces

Situation

Date: Tuesday, 2:45 PM JST Trigger: Bedrock introduced a new error response format that the MangaAssist error handler did not recognize

At 2:45 PM, Bedrock began returning a new error format for a specific edge case (malformed multi-modal content in the request). The MangaAssist error handler treated this as a generic ClientError rather than a retryable ValidationException, causing it to fail fast without retry and return a raw error message to users.

The problem: X-Ray was configured with a 5% sampling rate. The error affected only 0.3% of requests (those containing certain Unicode characters in manga title searches). With 5% sampling of 0.3% error traffic, only 0.015% of errors were captured — roughly 1-2 traces per hour, insufficient to identify the pattern.

Timeline:
  2:45 PM    New Bedrock error response begins appearing
  2:45 PM    Error rate increases from 0.1% to 0.4%
  2:45 PM    X-Ray captures 5% of all traces (including errors)
  2:45 PM    Of 0.3% error traffic, only 0.015% captured in X-Ray
  3:00 PM    CloudWatch alarm triggers: ErrorRate > 0.3%
  3:05 PM    On-call engineer checks X-Ray: finds 2 error traces
  3:05 PM    Engineer: "Looks like a rare transient — only 2 traces"
  3:05 PM    Engineer dismisses as transient (wrong conclusion)
  3:30 PM    Error rate persists at 0.4%
  4:00 PM    More customer complaints: "Error when searching manga with kanji"
  4:15 PM    Second investigation: CloudWatch Logs searched directly
  4:15 PM    FOUND: 4,300 error events in 1.5 hours (not 2!)
  4:20 PM    Root cause identified: New Bedrock error format for Unicode
  4:30 PM    Fix deployed: Updated error handler to recognize new format
  4:35 PM    Error rate returns to baseline

What Went Wrong

  1. Uniform sampling rate — 5% sampling applied equally to successful and failed requests
  2. Low-frequency errors undersampled — 0.3% error rate x 5% sampling = almost invisible in X-Ray
  3. No error-specific sampling rule — X-Ray should have captured 100% of errors
  4. Wrong investigation tool — Engineer relied on X-Ray (sampled) instead of CloudWatch Logs (complete)
  5. Misleading X-Ray data — Seeing "only 2 error traces" led to incorrect dismissal
The Sampling Math Problem:

  Total traffic:  11.6 rps (1M messages/day)
  Error rate:     0.3% = 0.035 errors/second = ~125 errors/hour
  X-Ray sampling: 5%
  Captured errors: 0.05 x 125 = ~6 error traces/hour

  But: X-Ray groups samples by trace, not by error.
  The engineer saw only 2 error traces in a 15-minute window.
  Conclusion: "This is rare and transient" (WRONG)
  Reality: 125 errors/hour x 1.5 hours = ~188 errors before detection

  With 100% error sampling:
  The engineer would have seen 31 error traces in the same 15-minute window.
  Conclusion: "This is a consistent pattern affecting Unicode queries" (CORRECT)

Detection

Signal Source Alert Time to Detect
Error rate increase CloudWatch ErrorCount Warning at >0.3% 15 minutes
X-Ray error traces X-Ray Groups (MangaAssist-Errors) Too few samples to trigger insight Failed
Customer complaints Support tickets Manual 1.5 hours
CloudWatch Logs errors Log Insights query Manual investigation 1.5 hours

Detection gap: X-Ray was configured as the primary error investigation tool, but its sampling rate was too low to capture enough error traces for pattern analysis. CloudWatch Logs had complete data but was not the first tool checked.

Impact

Metric Normal During Incident
Error rate 0.1% 0.4% (+0.3%)
Affected queries 0 ~4,300 over 1.75 hours
Time to detect N/A 15 minutes (alarm), 1.5 hours (root cause)
Time to resolve N/A 1 hour 45 minutes total
Users affected 0 ~2,800 unique users
Queries with raw error 0 ~4,300 (users saw error JSON)

Business impact: 2,800 users who searched for manga using Japanese kanji characters received raw error messages instead of helpful responses. The 1.5-hour delay in root cause identification (due to misleading X-Ray data) extended the impact window unnecessarily.

Response Runbook

RUNBOOK: X-Ray Sampling Missing Critical Errors
SEVERITY: P2 (High — errors reaching users, investigation delayed)
ON-CALL TEAM: MangaAssist Platform Engineering

STEP 1: DO NOT RELY SOLELY ON X-RAY FOR ERROR INVESTIGATION
  ├── ALWAYS cross-reference X-Ray with CloudWatch Logs
  ├── CloudWatch Logs: Complete record (no sampling)
  │   └── Log Insights query:
  │       fields @timestamp, error_code, error_message, user_query
  │       | filter level = "ERROR"
  │       | stats count(*) by error_code
  │       | sort count desc
  ├── X-Ray: Use for distributed trace context AFTER identifying the pattern
  └── Rule: If X-Ray shows few errors but CloudWatch alarm fired,
            trust CloudWatch — X-Ray may be undersampled

STEP 2: IDENTIFY ERROR PATTERN (0-10 minutes)
  ├── Run Log Insights query for error distribution:
  │   └── fields @timestamp, error_code, @message
  │       | filter level = "ERROR"
  │       | stats count(*) as error_count by bin(5m)
  │       | sort @timestamp desc
  ├── Look for: New error codes, changed error formats, specific patterns
  ├── Correlate with user input: Is there a common pattern in failing queries?
  └── Check: Did AWS release any changes to Bedrock API recently?

STEP 3: INCREASE X-RAY ERROR SAMPLING (immediate)
  ├── Create high-priority error sampling rule:
  │   └── aws xray create-sampling-rule --cli-input-json '{
  │         "SamplingRule": {
  │           "RuleName": "MangaAssist-AllErrors",
  │           "Priority": 1,
  │           "FixedRate": 1.0,
  │           "ReservoirSize": 100,
  │           "ServiceName": "MangaAssist",
  │           "ServiceType": "*",
  │           "Host": "*",
  │           "HTTPMethod": "*",
  │           "URLPath": "*",
  │           "ResourceARN": "*",
  │           "Version": 1
  │         }
  │       }'
  └── This captures 100% of traces going forward (useful for ongoing investigation)

STEP 4: FIX THE ERROR (once root cause identified)
  ├── Update error handler to recognize new error format
  ├── Add the new error code to the retryable set (if retryable)
  ├── Deploy fix to ECS tasks (rolling update, no downtime)
  └── Verify: Error rate returns to baseline

STEP 5: REVERT X-RAY SAMPLING (after incident resolved)
  ├── Remove the temporary 100% sampling rule
  ├── Ensure permanent error sampling rule exists:
  │   └── Priority: 50, FixedRate: 1.0, ReservoirSize: 10
  │   └── Filter: Errors only (via annotation or HTTP status)
  └── Verify: Normal sampling rate for success, 100% for errors

Resolution

  1. Separate error sampling rule — Create a permanent X-Ray sampling rule that captures 100% of error traces (Priority 50, FixedRate 1.0)
  2. Error-first investigation protocol — Update runbook: always check CloudWatch Logs first, use X-Ray for trace context second
  3. Error format handler update — Add defensive parsing for new Bedrock error response formats
  4. Anomaly detection on error patterns — CloudWatch anomaly detection on error codes, not just error rate

Prevention

Prevention Measure Implementation Priority
100% error sampling rule (permanent) X-Ray sampling rule: Priority 50, Rate 1.0 for error traces P0
CloudWatch Logs-first investigation Update on-call runbook and training P0
Defensive error parsing Try/catch with fallback for unknown Bedrock error formats P1
Error code anomaly detection CloudWatch anomaly detection per error_code dimension P1
Weekly Bedrock API change review Subscribe to AWS service announcements, review weekly P2
X-Ray sampling adequacy test Monthly: Inject known errors, verify X-Ray captures them P2

Exam Angle

What the AIP-C01 exam tests here: - Understanding of X-Ray sampling strategies and their trade-offs - The relationship between X-Ray sampling rate and error visibility - When to use X-Ray vs. CloudWatch Logs for investigation - X-Ray filter expressions and groups for isolating errors - The importance of sampling rules that prioritize errors over successful requests

Key concept: X-Ray sampling rules have priorities. A rule with Priority 1 and FixedRate 1.0 for error paths ensures 100% capture of errors regardless of the default sampling rate. The reservoir ensures at least N traces per second are captured even if the fixed rate would capture fewer.

Key Takeaway

X-Ray sampling at 5% is cost-effective for normal traffic but blind to low-frequency errors — always create a separate high-priority sampling rule that captures 100% of error traces.


Scenario 5: Graceful Degradation Message Confusing Users About Service Status

Situation

Date: Friday, 8:30 PM JST (peak evening traffic) Trigger: Complete Bedrock regional outage in ap-northeast-1 lasting 12 minutes

At 8:30 PM, Bedrock experienced a complete regional outage. All model invocations failed. The MangaAssist fallback cascade executed correctly: Sonnet failed, Haiku failed, cache served where possible, and for cache misses, the graceful degradation message was displayed.

The problem was not technical — the fallback worked. The problem was the user experience of the graceful degradation message.

Graceful Degradation Message (version 1 — problematic):

  "I'm having a bit of trouble right now, but I'm still here!
   Could you try asking again in a moment? In the meantime,
   you can browse our manga catalog directly."

Problems with this message:
  1. "Try again in a moment" — Users spam retry, increasing load
  2. "I'm still here" — Implies the bot is working, just slow
  3. No estimated recovery time — Users don't know if it's 1 min or 1 hour
  4. "Browse our manga catalog directly" — Link not provided
  5. No acknowledgment that there IS a problem
  6. Same message for every query — "Are you ignoring my specific question?"
Timeline:
  8:30:00 PM  Bedrock outage begins
  8:30:05 PM  Circuit breakers open across ECS fleet
  8:30:05 PM  Cache serves 60% of queries (cache hit)
  8:30:05 PM  40% of queries receive graceful degradation message
  8:30:30 PM  Users retry repeatedly (message says "try again in a moment")
  8:31:00 PM  Retry traffic: +40% above normal (users following the message)
  8:31:00 PM  Social media: "MangaAssist is broken but pretending it's fine"
  8:32:00 PM  Support tickets: "The bot keeps saying try again but nothing works"
  8:33:00 PM  Some users interpret "I'm still here" as chatbot being sentient
  8:35:00 PM  Operations team adds maintenance banner to website
  8:37:00 PM  Updated degradation message deployed (version 2)
  8:42:00 PM  Bedrock recovers
  8:42:30 PM  Normal service resumes
  8:45:00 PM  Post-incident: 2,100 support tickets about "confusing bot messages"

What Went Wrong

The graceful degradation message was designed by engineers focused on the technical implementation, not on user experience. Multiple UX failures:

  1. "Try again in a moment" — Encouraged retry storms. Users interpreted "a moment" as "right now" and hammered the retry button. This increased load by 40% during the outage, worsening the situation.

  2. "I'm still here" — Created false expectation. Users thought the bot was working but just slow, so they kept waiting and retrying rather than seeking alternatives.

  3. No context acknowledgment — The message was generic. A user asking "Where is my order?" received the same message as one asking "Recommend a manga." Users felt their specific question was being ignored.

  4. No service status information — Users could not tell if this was a 30-second hiccup or a major outage. Without this information, they could not make informed decisions (wait vs. call support vs. use the website directly).

  5. No action links — "Browse our manga catalog directly" provided no clickable link. Users would have to navigate away from the chat manually.

  6. Uncanny valley — The first-person "I'm still here" language made some users uncomfortable, as it sounded like the AI was claiming sentience or emotional status during what was clearly a system failure.

User Experience Comparison:

  BAD (version 1):
  ┌─────────────────────────────────────────────────┐
  │ User: Where is my order #12345?                 │
  │                                                 │
  │ Bot: I'm having a bit of trouble right now,     │
  │ but I'm still here! Could you try asking again  │
  │ in a moment? In the meantime, you can browse    │
  │ our manga catalog directly.                     │
  │                                                 │
  │ User: I asked about my ORDER, not the catalog!  │
  │                                                 │
  │ Bot: I'm having a bit of trouble right now...   │
  │ [same message repeated]                         │
  └─────────────────────────────────────────────────┘

  GOOD (version 2):
  ┌─────────────────────────────────────────────────┐
  │ User: Where is my order #12345?                 │
  │                                                 │
  │ Bot: I've received your question about order    │
  │ #12345. Our systems are currently experiencing  │
  │ temporary difficulties and I'm unable to look   │
  │ up order details right now.                     │
  │                                                 │
  │ Here's what you can do:                         │
  │ • Track your order: mangaassist.jp/orders/12345 │
  │ • Contact support: support@mangaassist.jp       │
  │ • Call us: 0120-XXX-XXXX (9AM-9PM JST)         │
  │                                                 │
  │ I'll be fully back within a few minutes and     │
  │ will be happy to help then. Our team is         │
  │ actively working on this.                       │
  │                                                 │
  │ [Service Status: Temporary disruption]          │
  │ [Auto-retry in: 2 minutes]                      │
  └─────────────────────────────────────────────────┘

Detection

Signal Source Alert Time to Detect
Graceful degradation tier spike Custom TierCount.Graceful metric Warning alarm (>5% graceful serves) 30 seconds
User retry spike API Gateway RequestCount anomaly Anomaly detection alarm 2 minutes
Support ticket spike Zendesk ticket count Manual detection 5 minutes
Social media complaints Social monitoring tool Manual detection 5 minutes
User satisfaction drop Post-chat survey scores Delayed (batched daily) Next day

What was detected quickly: The Bedrock outage and fallback tier shift were detected within 30 seconds. What was not detected: The UX impact of the degradation message. There was no metric for "user confusion" or "degradation message effectiveness."

Impact

Metric Normal During Incident
Users receiving degradation message <0.1% 40% of queries
User retry rate 5% 45% (9x increase)
Support tickets (12 minutes) ~2 2,100
Social media complaints ~1/hr 340 in 30 minutes
User satisfaction score 4.⅖ 1.8/5 (for degraded users)
Users who left chat permanently 0.5% 8%
Additional load from retries Baseline +40% above normal

Business impact: The degradation message itself caused more damage than the outage. The retry storms increased infrastructure load by 40%, the support ticket flood overwhelmed the team (2,100 tickets in 12 minutes), and 8% of users who received the message never returned to the chatbot. The misleading "try again" language directly caused the retry storm.

Response Runbook

RUNBOOK: Graceful Degradation UX Issues
SEVERITY: P2 (High — customer experience, not technical failure)
ON-CALL TEAM: MangaAssist Platform Engineering + UX Team

STEP 1: DEPLOY IMPROVED DEGRADATION MESSAGE (0-5 minutes)
  ├── Switch to context-aware degradation message:
  │   └── Update ECS environment variable: DEGRADATION_MESSAGE_VERSION=v2
  │   └── V2 messages include:
  │       - Acknowledgment of the user's specific question type
  │       - Explicit service status ("Our AI assistant is temporarily unavailable")
  │       - Actionable alternatives with links
  │       - Estimated recovery time (if known)
  │       - NO "try again" language (prevents retry storms)
  ├── Add service status banner to chat UI:
  │   └── "Service Notice: AI assistant is in limited mode.
  │        You can still browse products and track orders."
  └── Disable auto-retry in client WebSocket (prevents automated retry loops)

STEP 2: REDUCE RETRY STORM (concurrent with Step 1)
  ├── Client-side: Update WebSocket handler to not auto-retry on degradation
  ├── Server-side: Add rate limit specifically for repeated identical queries
  │   └── Same user, same query hash within 60s → return cached degradation
  │   └── Do not re-process through the full fallback cascade
  └── Monitor: Watch retry rate metric for reduction

STEP 3: CUSTOMER COMMUNICATION (5-15 minutes)
  ├── Post service status on mangaassist.jp/status
  ├── Tweet from @MangaAssist: "We're aware of temporary AI assistant issues.
  │    You can still shop and track orders normally. We'll be fully back soon!"
  ├── Update in-app banner with current status
  └── Notify customer support team with pre-written response templates

STEP 4: EVALUATE MESSAGE EFFECTIVENESS (post-incident)
  ├── Survey users who received the degradation message
  ├── Analyze: Did users understand what was happening?
  ├── Analyze: Did users find the alternative actions helpful?
  ├── Compare retry rates between v1 and v2 messages
  └── A/B test improved message variants for future incidents

Resolution

  1. Context-aware degradation messages — Different messages for order queries, recommendations, general questions, and pricing queries. Each acknowledges the user's intent and provides relevant alternatives.

  2. Service status integration — The degradation message includes a real-time service status indicator and estimated recovery time (based on historical outage durations).

  3. Anti-retry design — Replace "try again" with "I'll notify you when I'm back" (if WebSocket is open) or a specific time ("check back in 5 minutes").

  4. Actionable alternatives with links — Every degradation message includes clickable deep links to the specific feature the user was asking about (order tracking, catalog, support).

  5. Degradation message A/B testing — Monthly chaos testing in staging where different degradation message variants are tested with internal users. Measure comprehension, satisfaction, and retry behavior.

Prevention

Prevention Measure Implementation Priority
Context-aware degradation messages Extract user intent before returning degradation message P0
Anti-retry language Remove all "try again" phrasing, add auto-notify P0
Actionable deep links Include relevant product/order/support URLs in message P0
Service status page Real-time status at mangaassist.jp/status P1
Degradation message A/B testing Monthly chaos test with UX evaluation P1
Client-side retry suppression WebSocket handler: exponential backoff on degradation P1
Degradation UX metrics Track "message comprehension" via post-chat survey P2
UX review of all system messages Quarterly UX audit of all bot error/fallback messages P2

Exam Angle

What the AIP-C01 exam tests here: - Understanding of graceful degradation as a user experience concern, not just a technical one - The relationship between degradation messaging and system load (retry storms) - Designing fallback systems that maintain user trust during outages - The importance of actionable alternatives in degradation responses - How poorly designed resilience can amplify the impact of an outage

Key concept: Graceful degradation is not just about keeping the system running — it is about keeping the user informed and productive. A degradation message that causes a retry storm is worse than showing an error page, because it actively makes the outage worse while simultaneously frustrating users.

Key Takeaway

Graceful degradation messages must inform, not mislead — "try again in a moment" causes retry storms, while "here is what you can do instead" preserves user trust and reduces system load.


Cross-Scenario Summary

Common Patterns Across All 5 Scenarios

Pattern Scenarios Lesson
Cascading failure 1, 2 A resilience mechanism that triggers another failure mode (retries causing storms, rate limiting causing retry storms)
Detection gap 1, 3, 4 Metrics existed but alarms were not configured for the specific failure pattern
Investigation misdirection 4, 5 Using the wrong tool or metric for investigation delays root cause identification
UX as resilience 2, 5 User-facing messages are part of the resilience architecture — bad messages amplify impact
Content-awareness 3 Not all data is equal — pricing data needs different handling than recommendation data

Resilience Anti-Patterns Identified

Anti-Pattern 1: Exponential Backoff Without Jitter
  Scenario: 1
  Effect: Thundering herd across fleet
  Fix: Decorrelated jitter + fleet retry budget

Anti-Pattern 2: Static Rate Limits Without Auto-Scale
  Scenario: 2
  Effect: Blocks legitimate organic traffic
  Fix: Dynamic limits with anomaly detection

Anti-Pattern 3: Uniform Cache TTL for All Content Types
  Scenario: 3
  Effect: Stale prices served from cache
  Fix: Content-aware TTLs + event-driven invalidation

Anti-Pattern 4: Uniform X-Ray Sampling for All Traffic
  Scenario: 4
  Effect: Errors invisible in traces
  Fix: 100% error sampling rule

Anti-Pattern 5: Engineer-Designed Degradation Messages
  Scenario: 5
  Effect: Retry storms + user confusion
  Fix: UX-designed, context-aware messages with alternatives

Exam-Ready Quick Reference

Topic: Exponential Backoff (Scenario 1)
  - AWS SDK retry modes: standard, adaptive, legacy
  - Jitter strategies: full, equal, decorrelated
  - Circuit breaker: closed → open → half-open → closed
  - Fleet-wide retry budget prevents thundering herd

Topic: Rate Limiting (Scenario 2)
  - API Gateway: account, stage, method throttling
  - Token bucket: rate (steady) + burst (peak) + quota (period)
  - Usage plans: per-tier access control
  - Dynamic scaling for organic traffic spikes

Topic: Fallback/Degradation (Scenarios 3, 5)
  - Fallback cascade: primary → secondary → cached → static → graceful
  - Content-aware caching: different TTLs for different data types
  - Cache invalidation: event-driven (DynamoDB Streams)
  - Graceful degradation: UX design, not just technical implementation

Topic: Observability (Scenario 4)
  - X-Ray: distributed tracing, service maps, annotations
  - Sampling: reservoir (guaranteed) + fixed rate (probabilistic)
  - Error sampling: separate rule, 100% capture
  - CloudWatch: metrics, logs, alarms, dashboards, anomaly detection