Scenarios and Runbooks: Resilient FM Systems
MangaAssist context: JP Manga store chatbot on AWS — Bedrock Claude 3 (Sonnet at $3/$15 per 1M tokens input/output, Haiku at $0.25/$1.25), OpenSearch Serverless (vector store), DynamoDB (sessions/products), ECS Fargate (orchestrator), API Gateway WebSocket, ElastiCache Redis. Target: useful answer in under 3 seconds, 1M messages/day scale.
Skill Mapping
| Attribute | Value |
|---|---|
| Certification | AWS Certified AI Practitioner (AIP-C01) |
| Domain | 2 — Implementation and Integration of Foundation Models |
| Task | 2.4 — Design resilient and scalable FM-based applications |
| Skill | 2.4.3 — Create resilient FM systems to ensure reliable operations |
| Focus Areas | Exponential backoff storms, rate limiter tuning, stale cache risks, X-Ray sampling gaps, graceful degradation UX |
Scenario Format
Each scenario follows a standard structure:
SCENARIO: [Title]
├── SITUATION: What happened and why
├── WHAT WENT WRONG: Root cause analysis
├── DETECTION: How the issue was (or should have been) detected
├── IMPACT: Business and user impact
├── RESPONSE RUNBOOK: Step-by-step mitigation
├── RESOLUTION: How to fix permanently
├── PREVENTION: How to prevent recurrence
├── EXAM ANGLE: What the AIP-C01 exam tests from this scenario
└── KEY TAKEAWAY: One-sentence lesson
Scenario 1: Exponential Backoff Storm from Correlated Retries Across Fleet
Situation
Date: Thursday, 7:22 PM JST (peak manga browsing hour) Trigger: Bedrock experienced a 45-second regional capacity constraint in ap-northeast-1
At 7:22 PM, Bedrock's Claude 3 Sonnet endpoint in ap-northeast-1 began returning ThrottlingException at an elevated rate. The MangaAssist ECS Fargate fleet (running 24 tasks) detected these errors and all tasks simultaneously began their exponential backoff retry loops.
Timeline:
7:22:00 PM Bedrock starts throttling (~30% of requests)
7:22:01 PM All 24 ECS tasks detect ThrottlingException
7:22:01 PM All 24 tasks begin retry attempt 1 (100ms delay)
7:22:02 PM All 24 tasks fire retry attempt 1 simultaneously
7:22:02 PM Bedrock throttle rate increases to 60% (overloaded by correlated retries)
7:22:03 PM All 24 tasks begin retry attempt 2 (200ms delay)
7:22:04 PM Retry storm: 24 tasks x 200ms = synchronized wave
7:22:06 PM All 24 tasks begin retry attempt 3 (400ms delay)
7:22:10 PM Retry attempt 4 (800ms delay) — Bedrock now throttling 90%
7:22:18 PM Retry attempt 5 (max 10s delay) — all retries exhausted
7:22:18 PM All 24 tasks switch to Haiku fallback simultaneously
7:22:19 PM Haiku now overwhelmed by 24 tasks' worth of redirected traffic
7:22:25 PM Haiku also starts throttling
7:22:30 PM Circuit breakers open across fleet — cache-only mode
7:23:05 PM Bedrock recovers, but circuit breakers remain open for 30s
7:23:35 PM Circuit breakers transition to HALF_OPEN
7:23:36 PM Probe requests succeed — circuit breakers close
7:23:40 PM Normal operation resumes
What Went Wrong
The root cause was correlated retries — a thundering herd problem. When all 24 ECS tasks encounter the same error at the same time and use the same backoff algorithm, their retries synchronize:
Without Jitter (the problem):
Task 1: |==| |====| |========|
Task 2: |==| |====| |========|
Task 3: |==| |====| |========|
...
Task 24: |==| |====| |========|
↑ All fire at same instant ↑
With Decorrelated Jitter (the fix):
Task 1: |=| |=======| |==========|
Task 2: |===| |====| |===============|
Task 3: |==| |=====| |============|
...
Task 24: |====| |===| |========|
↑ Spread across time window ↑
Contributing factors:
1. All ECS tasks used JitterStrategy.NONE (exponential backoff without jitter)
2. No retry budget was enforced across the fleet
3. The fallback cascade triggered a secondary thundering herd on Haiku
4. Circuit breaker recovery timeout (30s) was longer than the actual Bedrock outage (45s)
Detection
| Signal | Source | Alert | Time to Detect |
|---|---|---|---|
| Bedrock ThrottlingException spike | CloudWatch ErrorCount metric | Warning alarm at 5% error rate | 2 minutes |
| Retry count spike | Custom metric RetryCount |
None configured | Not detected automatically |
| Correlated retry pattern | X-Ray trace analysis (post-incident) | None | Manual analysis after incident |
| Circuit breaker open | Custom metric CircuitBreakerState |
Critical alarm | 30 seconds |
| Haiku fallback spike | Custom metric TierCount.Haiku |
Warning at >20% degradation | 5 minutes |
Detection gap: There was no alarm for correlated retry patterns. The retry count metric existed but no alarm was set to detect the synchronized spike pattern.
Impact
| Metric | Normal | During Incident | Recovery |
|---|---|---|---|
| User-visible error rate | 0.01% | 15% | 0.5% |
| p99 latency | 2.8s | 12.5s (timeout) | 3.5s |
| Sonnet availability | 99.5% | 10% | 85% |
| Haiku availability | 99.5% | 40% (secondary storm) | 95% |
| Cache serve rate | 4% | 65% | 15% |
| Messages affected | 0 | ~2,400 messages | ~600 |
| Duration | N/A | 78 seconds | 35 seconds |
Business impact: During the 78-second incident, approximately 2,400 manga customer messages received degraded responses (cached or graceful fallback). 15% of users during the window saw error messages. Customer satisfaction scores for the affected window dropped from 4.⅖ to 2.8/5.
Response Runbook
RUNBOOK: Exponential Backoff Storm Mitigation
SEVERITY: P2 (High — service degradation, not outage)
ON-CALL TEAM: MangaAssist Platform Engineering
STEP 1: CONFIRM THE INCIDENT (0-2 minutes)
├── Check CloudWatch dashboard: MangaAssist-FM-Health
│ └── Look for: ErrorRate spike + TierCount shift away from Sonnet
├── Check X-Ray service map for error segments
│ └── Filter: fault = true AND service("Bedrock Runtime")
├── Check Bedrock Service Health Dashboard
│ └── https://health.aws.amazon.com/health/status
└── Confirm: Is this a Bedrock regional issue or MangaAssist-specific?
STEP 2: IMMEDIATE MITIGATION (2-5 minutes)
├── If Bedrock is the source:
│ ├── Enable "storm brake" — temporarily disable retries fleet-wide
│ │ └── Set environment variable: RETRY_MAX_ATTEMPTS=0
│ │ └── ECS will pick up on next health check (30s)
│ ├── Force all traffic to Haiku (skip Sonnet entirely)
│ │ └── Set environment variable: PRIMARY_MODEL=haiku
│ └── If Haiku also affected: enable cache-only mode
│ └── Set environment variable: FM_MODE=cache_only
├── If MangaAssist-specific (our retry logic):
│ ├── Scale ECS tasks to 0 temporarily, then back to 12 (half fleet)
│ │ └── aws ecs update-service --desired-count 0
│ │ └── Wait 10 seconds
│ │ └── aws ecs update-service --desired-count 12
│ └── This breaks the synchronized retry pattern
└── Monitor: Watch ErrorRate metric for improvement
STEP 3: VERIFY RECOVERY (5-10 minutes)
├── Confirm Sonnet invocations returning to normal
├── Check circuit breaker states: all should be CLOSED
├── Verify cache serve rate dropping back to baseline (~4%)
├── Check user-facing error rate back below 0.1%
└── Confirm p99 latency below 3s SLA
STEP 4: POST-INCIDENT (within 24 hours)
├── Collect X-Ray traces from the incident window
├── Analyze retry patterns: Were they correlated?
├── Document: Timeline, impact, mitigation steps
└── Schedule: Post-mortem meeting
Resolution
- Enable decorrelated jitter — Switch from
JitterStrategy.NONEtoJitterStrategy.DECORRELATEDacross all ECS tasks - Add fleet-wide retry budget — Use a shared Redis counter to cap total retries across the fleet at 100/minute
- Stagger fallback cascade — When falling back to Haiku, add a random 0-2 second delay to prevent secondary thundering herd
- Reduce circuit breaker recovery timeout — From 30s to 15s, since most Bedrock transient issues resolve within 10s
Prevention
| Prevention Measure | Implementation | Priority |
|---|---|---|
| Decorrelated jitter by default | Update BackoffConfig default to DECORRELATED |
P0 |
| Fleet retry budget (Redis) | Add RetryBudgetManager with 100/min cap |
P0 |
| Retry correlation alarm | CloudWatch alarm on retry spike variance | P1 |
| Staggered fallback | Random delay before Haiku fallback | P1 |
| Chaos testing | Inject Bedrock throttle in staging weekly | P2 |
| Documentation | Update runbook with correlated retry procedure | P2 |
Exam Angle
What the AIP-C01 exam tests here: - Understanding of exponential backoff and why jitter is essential in distributed systems - Knowledge of the thundering herd problem and how it manifests in FM API calls - AWS SDK retry modes (standard, adaptive, legacy) and their jitter behavior - Circuit breaker pattern and its role in preventing cascading failures
Key concept: The AWS SDK's adaptive retry mode includes a client-side rate limiter (token bucket) that automatically adjusts based on throttle responses. This is superior to standard exponential backoff because it prevents the client from sending requests it knows will be throttled.
Key Takeaway
Exponential backoff without jitter in a multi-task fleet creates synchronized retry storms that amplify the original overload — always use decorrelated jitter and enforce a fleet-wide retry budget.
Scenario 2: Rate Limiter Blocking Legitimate Traffic During Organic Spike
Situation
Date: Wednesday, 6:00 PM JST Trigger: Popular manga series "Kaiju No. 8" announced surprise new volume release on social media
At 5:58 PM, a viral tweet from the official Kaiju No. 8 account announced an exclusive limited-edition volume available only through MangaAssist. Within 90 seconds, organic traffic spiked from the baseline of ~800 rps to 4,200 rps — exceeding the global rate limit of 2,000 rps (burst 4,000).
Timeline:
5:58:00 PM Viral tweet posted (6.2M followers)
5:58:30 PM Traffic begins rising: 800 → 1,200 rps
5:59:00 PM Traffic: 1,800 rps (within limits)
5:59:30 PM Traffic: 3,200 rps (within burst, consuming tokens fast)
5:59:45 PM Burst tokens exhausted: 4,000 tokens consumed
5:59:46 PM Rate limiter begins returning 429 responses
5:59:46 PM Legitimate customers see: "You're sending messages too fast"
6:00:00 PM Traffic: 4,200 rps — 2,200 rps being throttled (52%)
6:00:30 PM Customer complaints spike on social media
6:01:00 PM Operations team alerted via CloudWatch alarm
6:02:00 PM Manual rate limit increase: 2,000 → 5,000 rps
6:02:15 PM Throttle rate drops to 0%
6:05:00 PM Traffic stabilizes at 3,800 rps
6:15:00 PM Traffic returns to 1,500 rps (above normal but manageable)
6:30:00 PM Normal operations
What Went Wrong
The rate limiter worked exactly as designed — it protected Bedrock from overload. The problem was that the rate limits were too conservative for organic traffic spikes:
- Static rate limits — The 2,000 rps limit was based on average load, not peak capacity
- No distinction between organic and abusive traffic — All requests treated equally
- No auto-scaling of rate limits — Required manual intervention to increase
- Per-user limits too aggressive — Returning "too fast" message to first-time visitors who sent just 1 message
- No priority queue — Premium customers throttled alongside anonymous browsers
The real problem: The error message "You're sending messages too fast" was misleading. These were first-time visitors asking about the new manga volume — they had sent exactly one message and were told they were too fast. The actual throttle was the global rate limit being hit, but the error message was designed for per-user throttling.
Detection
| Signal | Source | Alert | Time to Detect |
|---|---|---|---|
| 429 response rate spike | API Gateway metrics | Warning alarm at >1% 429 rate | 45 seconds |
| Global rate limit utilization >95% | Custom RateLimitUtilization metric | Info alarm at 80% | 30 seconds |
| Customer complaints | Social media monitoring | Manual detection | 2 minutes |
| Traffic spike | API Gateway RequestCount | Anomaly detection alarm | 1 minute |
What worked: The CloudWatch anomaly detection alarm caught the traffic spike quickly. What failed was the automatic response — there was no auto-scaling mechanism for rate limits.
Impact
| Metric | Normal | During Incident | After Fix |
|---|---|---|---|
| 429 error rate | 0.01% | 52% | 0% |
| Legitimate users throttled | ~0 | ~45,000 users (2.5 min window) | 0 |
| Customer complaints | ~2/hr | ~340 in 10 minutes | Subsided |
| Revenue impact | Normal | Est. $12,000 lost sales (limited edition) | Recovered partially |
| Social media sentiment | Positive | Negative (#MangaAssistDown trending) | Neutral |
| Bedrock utilization | 40% | Capped at rate limit | 75% |
Business impact: An estimated 45,000 users were blocked from purchasing the limited-edition Kaiju No. 8 volume during the first 2.5 minutes. The negative social media attention (#MangaAssistDown trending briefly) caused reputational damage. Estimated revenue loss: $12,000 in the first 10 minutes.
Response Runbook
RUNBOOK: Rate Limiter Blocking Legitimate Traffic
SEVERITY: P1 (Critical — revenue-impacting, customer-facing)
ON-CALL TEAM: MangaAssist Platform Engineering + Customer Success
STEP 1: CONFIRM ORGANIC SPIKE vs. DDoS (0-2 minutes)
├── Check WAF metrics: Is traffic from diverse IPs or concentrated?
│ └── Diverse IPs = organic | Few IPs = potential DDoS
├── Check geographic distribution: Mostly Japan? (Expected for manga)
├── Check request patterns: Normal chat queries or repetitive patterns?
├── Check social media: Is there a viral event driving traffic?
└── Decision: If organic → increase limits | If DDoS → keep limits, enable WAF rules
STEP 2: INCREASE RATE LIMITS (2-5 minutes)
├── Global rate limit:
│ └── aws apigateway update-usage-plan --usage-plan-id UP_ID \
│ --patch-operations op=replace,path=/throttle/rateLimit,value=5000
├── Burst capacity:
│ └── aws apigateway update-usage-plan --usage-plan-id UP_ID \
│ --patch-operations op=replace,path=/throttle/burstLimit,value=8000
├── Scale ECS tasks:
│ └── aws ecs update-service --desired-count 48 (double fleet)
└── Verify: Monitor 429 rate dropping to 0%
STEP 3: PROTECT BEDROCK (concurrent with Step 2)
├── Check Bedrock service quotas: Are we near account limits?
│ └── aws service-quotas get-service-quota --service-code bedrock ...
├── If near Bedrock limits:
│ ├── Route 50% of traffic to Haiku (cheaper, faster)
│ ├── Enable aggressive caching (extend TTLs temporarily)
│ └── Consider: Bedrock on-demand throughput increase request
└── Monitor: Bedrock throttle rate should remain below 5%
STEP 4: FIX USER-FACING MESSAGES (5-10 minutes)
├── Update throttle message for global rate limit:
│ └── FROM: "You're sending messages too fast"
│ └── TO: "We're experiencing high demand — you're in a queue!
│ Estimated wait: ~30 seconds. While you wait, browse
│ our Kaiju No. 8 collection directly."
├── Enable queue mode: Accept and queue requests rather than reject
└── Add banner to UI: "High demand for Kaiju No. 8 — please be patient"
STEP 5: POST-SPIKE NORMALIZATION (30-60 minutes)
├── Monitor traffic returning to baseline
├── Gradually reduce rate limits back to normal
├── Scale ECS tasks back to standard count
├── Review: Were any customers permanently lost?
└── Document: What threshold triggered the issue?
Resolution
- Implement dynamic rate limiting — Auto-scale rate limits based on traffic pattern analysis (organic vs. abusive)
- Add priority queueing — Premium users bypass throttle; registered users get priority over anonymous
- Fix error messages — Distinguish between "you personally are too fast" and "system is busy"
- Pre-event scaling — Integrate with marketing team's event calendar to pre-scale before known spikes
- Request queueing — Instead of rejecting with 429, queue requests and serve them as capacity becomes available
Prevention
| Prevention Measure | Implementation | Priority |
|---|---|---|
| Dynamic rate limit auto-scaling | Lambda function that adjusts limits based on CloudWatch Anomaly Detection | P0 |
| Priority queue for premium users | Redis-based priority queue in ECS orchestrator | P0 |
| Marketing integration | Webhook from marketing calendar to auto-scale 2 hours before events | P1 |
| Request queueing (SQS) | Queue overflow requests instead of rejecting | P1 |
| Differentiated error messages | Context-aware throttle messages (global vs. per-user) | P1 |
| Capacity planning review | Monthly review of rate limits vs. traffic growth | P2 |
Exam Angle
What the AIP-C01 exam tests here: - Understanding of API Gateway throttling mechanisms (account, stage, method level) - Usage plans and their role in rate limiting different user tiers - The relationship between rate limiting and user experience - How to design for traffic spikes in FM-based applications - Token bucket algorithm behavior (steady rate vs. burst capacity)
Key concept: API Gateway's token bucket algorithm allows bursts up to the burst limit, but once burst tokens are consumed, traffic is limited to the steady-state rate. For applications with predictable spikes (like content releases), pre-scaling the burst capacity is essential.
Key Takeaway
Rate limiters protect the system but must distinguish between organic traffic spikes and abuse — a rate limiter that blocks legitimate customers during a viral moment causes more damage than the overload it prevents.
Scenario 3: Fallback to Cached Response Serving Stale Manga Prices
Situation
Date: Monday, 10:15 AM JST Trigger: Weekend sale ended at midnight but cache entries with sale prices had 4-hour TTL
MangaAssist ran a weekend flash sale on popular manga series (30% off). The sale ended at midnight Sunday. However, the response cache in Redis still contained responses generated during the sale period, complete with discounted prices. When Bedrock experienced a brief 3-minute throttling event Monday morning, the fallback system served these cached responses to customers.
Timeline:
Sat 9:00 AM Sale begins — all prices show 30% discount
Sat-Sun Cache populated with sale-price responses (TTL: 4 hours)
Sun 11:59 PM Sale ends — DynamoDB prices updated to normal
Mon 12:00 AM Cache contains: mix of stale (sale) and fresh (normal) prices
Mon 3:00 AM Most sale-price cache entries expired (4hr TTL)
Mon 3:30 AM Cache warm-up job runs — repopulates with correct prices
Mon 10:15 AM Bedrock throttling event begins (3 minutes)
Mon 10:15 AM Fallback cascade: Sonnet fail → Haiku fail → Cache
Mon 10:15 AM PROBLEM: Some cache entries were refreshed at 9:00 AM
during the sale's "lingering" in one region, containing
sale prices that were manually re-added in error
Mon 10:15 AM ~180 users receive responses with incorrect (sale) prices
Mon 10:18 AM Bedrock recovers — fresh responses resume
Mon 10:45 AM Customer support reports: "Users saying prices differ
from what chatbot told them"
Mon 11:00 AM Investigation confirms stale cache served sale prices
Mon 11:15 AM Cache manually flushed for all pricing-related entries
What Went Wrong
Multiple failures compounded:
- Cache TTL too long for pricing data — 4 hours is appropriate for recommendations but not for price-sensitive content
- No cache invalidation on price change — When the sale ended and DynamoDB prices were updated, the Redis cache was not invalidated
- No content-type awareness in cache — All responses cached with the same TTL regardless of whether they contained pricing data
- Manual re-entry error — An operations team member accidentally re-triggered the sale in one region, causing fresh cache entries with sale prices
- No staleness warning for pricing — When serving cached price data, no disclaimer was shown to users
Cache Entry Problem:
User asks: "How much is One Piece Volume 105?"
Fresh response (correct):
"One Piece Volume 105 is ¥528 (regular price)."
Stale cached response (incorrect):
"One Piece Volume 105 is ¥369 (30% off — flash sale!)."
← This response was cached during the weekend sale
← Served during the Monday Bedrock throttle event
Detection
| Signal | Source | Alert | Time to Detect |
|---|---|---|---|
| Fallback to cache tier | Custom TierCount metric | Info alarm at >10% cache serves | 1 minute |
| Stale cache serves | Custom StaleCacheServes metric | Not configured for pricing category | Not detected |
| Customer complaints | Support ticket system | Manual detection | 30 minutes |
| Price discrepancy | No automated check | None | Manual investigation |
Detection gap: There was no automated check comparing cached prices against the source of truth (DynamoDB). The staleness metric existed but was not configured to alarm specifically on pricing-category cache entries.
Impact
| Metric | Normal | During Incident |
|---|---|---|
| Users served stale prices | 0 | ~180 users |
| Incorrect price quotes | 0 | ~85 unique products |
| Customer support tickets | ~5/hr | ~45 in 2 hours |
| Potential revenue loss | $0 | ~$2,300 (if honored) |
| Trust impact | Baseline | Moderate (price credibility) |
Business impact: 180 customers were quoted sale prices that no longer applied. The legal and customer experience teams decided to honor the incorrect prices for affected customers, costing ~$2,300. The incident also raised questions about the chatbot's reliability for transactional queries.
Response Runbook
RUNBOOK: Stale Cache Serving Incorrect Data
SEVERITY: P2 (High — financial impact, customer trust)
ON-CALL TEAM: MangaAssist Platform Engineering + Customer Success
STEP 1: CONFIRM STALE DATA (0-5 minutes)
├── Identify affected cache entries:
│ └── redis-cli KEYS "manga:cache:*" | sample 20 entries
│ └── Check created_at timestamps — are any from before the sale ended?
├── Compare cached prices to DynamoDB source of truth:
│ └── For each cached product_id, query DynamoDB for current price
│ └── Flag entries where cached price != DynamoDB price
├── Estimate scope: How many cache entries are stale?
└── Check: Is the issue ongoing or has the Bedrock throttle resolved?
STEP 2: FLUSH AFFECTED CACHE ENTRIES (5-10 minutes)
├── Option A: Flush ALL pricing-category cache entries
│ └── redis-cli EVAL "local keys = redis.call('keys', 'manga:cache:*')
│ for i, key in ipairs(keys) do
│ local entry = redis.call('get', key)
│ if string.find(entry, '\"category\":\"pricing\"') then
│ redis.call('del', key)
│ end
│ end
│ return 'done'" 0
├── Option B: Flush ALL cache entries (nuclear option)
│ └── redis-cli FLUSHDB
│ └── WARNING: This increases Bedrock load temporarily
└── Verify: Spot-check that pricing queries now return fresh Bedrock responses
STEP 3: IDENTIFY AFFECTED CUSTOMERS (10-30 minutes)
├── Query CloudWatch Logs for cache-served pricing responses:
│ └── Filter: tier="cached" AND category="pricing" AND
│ timestamp BETWEEN "2024-XX-XX 10:15" AND "2024-XX-XX 10:18"
├── Extract user_ids from affected traces
├── For each user: What price were they quoted? What is the correct price?
├── Generate report: user_id, product, quoted_price, actual_price, delta
└── Send report to Customer Success team for remediation
STEP 4: CUSTOMER REMEDIATION (same day)
├── Decision: Honor incorrect prices or issue apology + coupon?
│ └── If delta < $10 per customer: Honor the quoted price
│ └── If delta > $10: Contact customer with apology + 15% coupon
├── Send proactive email to affected customers:
│ └── "We noticed a brief pricing display issue on [date]..."
└── Update FAQ: Add note about the incident for customer support agents
STEP 5: PREVENT RECURRENCE (within 1 week)
├── Implement content-aware TTLs (pricing: 15 min max)
├── Add event-driven cache invalidation on DynamoDB price updates
├── Add staleness warning for cached pricing responses
└── Add automated price-accuracy check for cached entries
Resolution
- Content-aware TTLs — Pricing: 15 minutes, Availability: 5 minutes, Recommendations: 4 hours, FAQ: 7 days
- DynamoDB Streams cache invalidation — When prices update in DynamoDB, a Lambda function invalidates all cache entries referencing those product IDs
- Staleness disclaimer — When serving cached pricing data, append: "Prices shown may not reflect the latest updates. Please verify at checkout."
- Cache validation job — Hourly Lambda that samples 100 cached pricing entries and validates against DynamoDB
- Sale end cache flush — Automated cache flush for all pricing entries when a sale period ends
Prevention
| Prevention Measure | Implementation | Priority |
|---|---|---|
| Content-aware TTLs | Modify cache layer to use CATEGORY_TTL map |
P0 |
| DynamoDB Streams invalidation | Lambda triggered by DynamoDB Streams on price changes | P0 |
| Staleness disclaimer on prices | Append warning when serving cached pricing data | P0 |
| Sale-end cache flush automation | EventBridge rule triggers cache flush at sale end time | P1 |
| Hourly price accuracy validation | Lambda samples and validates cached pricing entries | P1 |
| Cache entry product_id tracking | Store product_ids in cache entry metadata for targeted invalidation | P2 |
Exam Angle
What the AIP-C01 exam tests here: - Understanding of caching strategies and their trade-offs in FM applications - The concept of graceful degradation and when cached responses are appropriate vs. dangerous - Event-driven architectures (DynamoDB Streams) for cache invalidation - The importance of content-type awareness in FM response caching
Key concept: Not all FM responses are equal. A cached recommendation from 4 hours ago is still useful, but a cached price from 4 hours ago could be financially harmful. Content-aware caching ensures that time-sensitive data (prices, availability) has much shorter TTLs than stable data (FAQs, general information).
Key Takeaway
Cache-based fallback must be content-aware — a stale manga recommendation is harmless, but a stale price can cost real money and erode customer trust.
Scenario 4: X-Ray Sampling Missing Critical Error Traces
Situation
Date: Tuesday, 2:45 PM JST Trigger: Bedrock introduced a new error response format that the MangaAssist error handler did not recognize
At 2:45 PM, Bedrock began returning a new error format for a specific edge case (malformed multi-modal content in the request). The MangaAssist error handler treated this as a generic ClientError rather than a retryable ValidationException, causing it to fail fast without retry and return a raw error message to users.
The problem: X-Ray was configured with a 5% sampling rate. The error affected only 0.3% of requests (those containing certain Unicode characters in manga title searches). With 5% sampling of 0.3% error traffic, only 0.015% of errors were captured — roughly 1-2 traces per hour, insufficient to identify the pattern.
Timeline:
2:45 PM New Bedrock error response begins appearing
2:45 PM Error rate increases from 0.1% to 0.4%
2:45 PM X-Ray captures 5% of all traces (including errors)
2:45 PM Of 0.3% error traffic, only 0.015% captured in X-Ray
3:00 PM CloudWatch alarm triggers: ErrorRate > 0.3%
3:05 PM On-call engineer checks X-Ray: finds 2 error traces
3:05 PM Engineer: "Looks like a rare transient — only 2 traces"
3:05 PM Engineer dismisses as transient (wrong conclusion)
3:30 PM Error rate persists at 0.4%
4:00 PM More customer complaints: "Error when searching manga with kanji"
4:15 PM Second investigation: CloudWatch Logs searched directly
4:15 PM FOUND: 4,300 error events in 1.5 hours (not 2!)
4:20 PM Root cause identified: New Bedrock error format for Unicode
4:30 PM Fix deployed: Updated error handler to recognize new format
4:35 PM Error rate returns to baseline
What Went Wrong
- Uniform sampling rate — 5% sampling applied equally to successful and failed requests
- Low-frequency errors undersampled — 0.3% error rate x 5% sampling = almost invisible in X-Ray
- No error-specific sampling rule — X-Ray should have captured 100% of errors
- Wrong investigation tool — Engineer relied on X-Ray (sampled) instead of CloudWatch Logs (complete)
- Misleading X-Ray data — Seeing "only 2 error traces" led to incorrect dismissal
The Sampling Math Problem:
Total traffic: 11.6 rps (1M messages/day)
Error rate: 0.3% = 0.035 errors/second = ~125 errors/hour
X-Ray sampling: 5%
Captured errors: 0.05 x 125 = ~6 error traces/hour
But: X-Ray groups samples by trace, not by error.
The engineer saw only 2 error traces in a 15-minute window.
Conclusion: "This is rare and transient" (WRONG)
Reality: 125 errors/hour x 1.5 hours = ~188 errors before detection
With 100% error sampling:
The engineer would have seen 31 error traces in the same 15-minute window.
Conclusion: "This is a consistent pattern affecting Unicode queries" (CORRECT)
Detection
| Signal | Source | Alert | Time to Detect |
|---|---|---|---|
| Error rate increase | CloudWatch ErrorCount | Warning at >0.3% | 15 minutes |
| X-Ray error traces | X-Ray Groups (MangaAssist-Errors) | Too few samples to trigger insight | Failed |
| Customer complaints | Support tickets | Manual | 1.5 hours |
| CloudWatch Logs errors | Log Insights query | Manual investigation | 1.5 hours |
Detection gap: X-Ray was configured as the primary error investigation tool, but its sampling rate was too low to capture enough error traces for pattern analysis. CloudWatch Logs had complete data but was not the first tool checked.
Impact
| Metric | Normal | During Incident |
|---|---|---|
| Error rate | 0.1% | 0.4% (+0.3%) |
| Affected queries | 0 | ~4,300 over 1.75 hours |
| Time to detect | N/A | 15 minutes (alarm), 1.5 hours (root cause) |
| Time to resolve | N/A | 1 hour 45 minutes total |
| Users affected | 0 | ~2,800 unique users |
| Queries with raw error | 0 | ~4,300 (users saw error JSON) |
Business impact: 2,800 users who searched for manga using Japanese kanji characters received raw error messages instead of helpful responses. The 1.5-hour delay in root cause identification (due to misleading X-Ray data) extended the impact window unnecessarily.
Response Runbook
RUNBOOK: X-Ray Sampling Missing Critical Errors
SEVERITY: P2 (High — errors reaching users, investigation delayed)
ON-CALL TEAM: MangaAssist Platform Engineering
STEP 1: DO NOT RELY SOLELY ON X-RAY FOR ERROR INVESTIGATION
├── ALWAYS cross-reference X-Ray with CloudWatch Logs
├── CloudWatch Logs: Complete record (no sampling)
│ └── Log Insights query:
│ fields @timestamp, error_code, error_message, user_query
│ | filter level = "ERROR"
│ | stats count(*) by error_code
│ | sort count desc
├── X-Ray: Use for distributed trace context AFTER identifying the pattern
└── Rule: If X-Ray shows few errors but CloudWatch alarm fired,
trust CloudWatch — X-Ray may be undersampled
STEP 2: IDENTIFY ERROR PATTERN (0-10 minutes)
├── Run Log Insights query for error distribution:
│ └── fields @timestamp, error_code, @message
│ | filter level = "ERROR"
│ | stats count(*) as error_count by bin(5m)
│ | sort @timestamp desc
├── Look for: New error codes, changed error formats, specific patterns
├── Correlate with user input: Is there a common pattern in failing queries?
└── Check: Did AWS release any changes to Bedrock API recently?
STEP 3: INCREASE X-RAY ERROR SAMPLING (immediate)
├── Create high-priority error sampling rule:
│ └── aws xray create-sampling-rule --cli-input-json '{
│ "SamplingRule": {
│ "RuleName": "MangaAssist-AllErrors",
│ "Priority": 1,
│ "FixedRate": 1.0,
│ "ReservoirSize": 100,
│ "ServiceName": "MangaAssist",
│ "ServiceType": "*",
│ "Host": "*",
│ "HTTPMethod": "*",
│ "URLPath": "*",
│ "ResourceARN": "*",
│ "Version": 1
│ }
│ }'
└── This captures 100% of traces going forward (useful for ongoing investigation)
STEP 4: FIX THE ERROR (once root cause identified)
├── Update error handler to recognize new error format
├── Add the new error code to the retryable set (if retryable)
├── Deploy fix to ECS tasks (rolling update, no downtime)
└── Verify: Error rate returns to baseline
STEP 5: REVERT X-RAY SAMPLING (after incident resolved)
├── Remove the temporary 100% sampling rule
├── Ensure permanent error sampling rule exists:
│ └── Priority: 50, FixedRate: 1.0, ReservoirSize: 10
│ └── Filter: Errors only (via annotation or HTTP status)
└── Verify: Normal sampling rate for success, 100% for errors
Resolution
- Separate error sampling rule — Create a permanent X-Ray sampling rule that captures 100% of error traces (Priority 50, FixedRate 1.0)
- Error-first investigation protocol — Update runbook: always check CloudWatch Logs first, use X-Ray for trace context second
- Error format handler update — Add defensive parsing for new Bedrock error response formats
- Anomaly detection on error patterns — CloudWatch anomaly detection on error codes, not just error rate
Prevention
| Prevention Measure | Implementation | Priority |
|---|---|---|
| 100% error sampling rule (permanent) | X-Ray sampling rule: Priority 50, Rate 1.0 for error traces | P0 |
| CloudWatch Logs-first investigation | Update on-call runbook and training | P0 |
| Defensive error parsing | Try/catch with fallback for unknown Bedrock error formats | P1 |
| Error code anomaly detection | CloudWatch anomaly detection per error_code dimension | P1 |
| Weekly Bedrock API change review | Subscribe to AWS service announcements, review weekly | P2 |
| X-Ray sampling adequacy test | Monthly: Inject known errors, verify X-Ray captures them | P2 |
Exam Angle
What the AIP-C01 exam tests here: - Understanding of X-Ray sampling strategies and their trade-offs - The relationship between X-Ray sampling rate and error visibility - When to use X-Ray vs. CloudWatch Logs for investigation - X-Ray filter expressions and groups for isolating errors - The importance of sampling rules that prioritize errors over successful requests
Key concept: X-Ray sampling rules have priorities. A rule with Priority 1 and FixedRate 1.0 for error paths ensures 100% capture of errors regardless of the default sampling rate. The reservoir ensures at least N traces per second are captured even if the fixed rate would capture fewer.
Key Takeaway
X-Ray sampling at 5% is cost-effective for normal traffic but blind to low-frequency errors — always create a separate high-priority sampling rule that captures 100% of error traces.
Scenario 5: Graceful Degradation Message Confusing Users About Service Status
Situation
Date: Friday, 8:30 PM JST (peak evening traffic) Trigger: Complete Bedrock regional outage in ap-northeast-1 lasting 12 minutes
At 8:30 PM, Bedrock experienced a complete regional outage. All model invocations failed. The MangaAssist fallback cascade executed correctly: Sonnet failed, Haiku failed, cache served where possible, and for cache misses, the graceful degradation message was displayed.
The problem was not technical — the fallback worked. The problem was the user experience of the graceful degradation message.
Graceful Degradation Message (version 1 — problematic):
"I'm having a bit of trouble right now, but I'm still here!
Could you try asking again in a moment? In the meantime,
you can browse our manga catalog directly."
Problems with this message:
1. "Try again in a moment" — Users spam retry, increasing load
2. "I'm still here" — Implies the bot is working, just slow
3. No estimated recovery time — Users don't know if it's 1 min or 1 hour
4. "Browse our manga catalog directly" — Link not provided
5. No acknowledgment that there IS a problem
6. Same message for every query — "Are you ignoring my specific question?"
Timeline:
8:30:00 PM Bedrock outage begins
8:30:05 PM Circuit breakers open across ECS fleet
8:30:05 PM Cache serves 60% of queries (cache hit)
8:30:05 PM 40% of queries receive graceful degradation message
8:30:30 PM Users retry repeatedly (message says "try again in a moment")
8:31:00 PM Retry traffic: +40% above normal (users following the message)
8:31:00 PM Social media: "MangaAssist is broken but pretending it's fine"
8:32:00 PM Support tickets: "The bot keeps saying try again but nothing works"
8:33:00 PM Some users interpret "I'm still here" as chatbot being sentient
8:35:00 PM Operations team adds maintenance banner to website
8:37:00 PM Updated degradation message deployed (version 2)
8:42:00 PM Bedrock recovers
8:42:30 PM Normal service resumes
8:45:00 PM Post-incident: 2,100 support tickets about "confusing bot messages"
What Went Wrong
The graceful degradation message was designed by engineers focused on the technical implementation, not on user experience. Multiple UX failures:
-
"Try again in a moment" — Encouraged retry storms. Users interpreted "a moment" as "right now" and hammered the retry button. This increased load by 40% during the outage, worsening the situation.
-
"I'm still here" — Created false expectation. Users thought the bot was working but just slow, so they kept waiting and retrying rather than seeking alternatives.
-
No context acknowledgment — The message was generic. A user asking "Where is my order?" received the same message as one asking "Recommend a manga." Users felt their specific question was being ignored.
-
No service status information — Users could not tell if this was a 30-second hiccup or a major outage. Without this information, they could not make informed decisions (wait vs. call support vs. use the website directly).
-
No action links — "Browse our manga catalog directly" provided no clickable link. Users would have to navigate away from the chat manually.
-
Uncanny valley — The first-person "I'm still here" language made some users uncomfortable, as it sounded like the AI was claiming sentience or emotional status during what was clearly a system failure.
User Experience Comparison:
BAD (version 1):
┌─────────────────────────────────────────────────┐
│ User: Where is my order #12345? │
│ │
│ Bot: I'm having a bit of trouble right now, │
│ but I'm still here! Could you try asking again │
│ in a moment? In the meantime, you can browse │
│ our manga catalog directly. │
│ │
│ User: I asked about my ORDER, not the catalog! │
│ │
│ Bot: I'm having a bit of trouble right now... │
│ [same message repeated] │
└─────────────────────────────────────────────────┘
GOOD (version 2):
┌─────────────────────────────────────────────────┐
│ User: Where is my order #12345? │
│ │
│ Bot: I've received your question about order │
│ #12345. Our systems are currently experiencing │
│ temporary difficulties and I'm unable to look │
│ up order details right now. │
│ │
│ Here's what you can do: │
│ • Track your order: mangaassist.jp/orders/12345 │
│ • Contact support: support@mangaassist.jp │
│ • Call us: 0120-XXX-XXXX (9AM-9PM JST) │
│ │
│ I'll be fully back within a few minutes and │
│ will be happy to help then. Our team is │
│ actively working on this. │
│ │
│ [Service Status: Temporary disruption] │
│ [Auto-retry in: 2 minutes] │
└─────────────────────────────────────────────────┘
Detection
| Signal | Source | Alert | Time to Detect |
|---|---|---|---|
| Graceful degradation tier spike | Custom TierCount.Graceful metric | Warning alarm (>5% graceful serves) | 30 seconds |
| User retry spike | API Gateway RequestCount anomaly | Anomaly detection alarm | 2 minutes |
| Support ticket spike | Zendesk ticket count | Manual detection | 5 minutes |
| Social media complaints | Social monitoring tool | Manual detection | 5 minutes |
| User satisfaction drop | Post-chat survey scores | Delayed (batched daily) | Next day |
What was detected quickly: The Bedrock outage and fallback tier shift were detected within 30 seconds. What was not detected: The UX impact of the degradation message. There was no metric for "user confusion" or "degradation message effectiveness."
Impact
| Metric | Normal | During Incident |
|---|---|---|
| Users receiving degradation message | <0.1% | 40% of queries |
| User retry rate | 5% | 45% (9x increase) |
| Support tickets (12 minutes) | ~2 | 2,100 |
| Social media complaints | ~1/hr | 340 in 30 minutes |
| User satisfaction score | 4.⅖ | 1.8/5 (for degraded users) |
| Users who left chat permanently | 0.5% | 8% |
| Additional load from retries | Baseline | +40% above normal |
Business impact: The degradation message itself caused more damage than the outage. The retry storms increased infrastructure load by 40%, the support ticket flood overwhelmed the team (2,100 tickets in 12 minutes), and 8% of users who received the message never returned to the chatbot. The misleading "try again" language directly caused the retry storm.
Response Runbook
RUNBOOK: Graceful Degradation UX Issues
SEVERITY: P2 (High — customer experience, not technical failure)
ON-CALL TEAM: MangaAssist Platform Engineering + UX Team
STEP 1: DEPLOY IMPROVED DEGRADATION MESSAGE (0-5 minutes)
├── Switch to context-aware degradation message:
│ └── Update ECS environment variable: DEGRADATION_MESSAGE_VERSION=v2
│ └── V2 messages include:
│ - Acknowledgment of the user's specific question type
│ - Explicit service status ("Our AI assistant is temporarily unavailable")
│ - Actionable alternatives with links
│ - Estimated recovery time (if known)
│ - NO "try again" language (prevents retry storms)
├── Add service status banner to chat UI:
│ └── "Service Notice: AI assistant is in limited mode.
│ You can still browse products and track orders."
└── Disable auto-retry in client WebSocket (prevents automated retry loops)
STEP 2: REDUCE RETRY STORM (concurrent with Step 1)
├── Client-side: Update WebSocket handler to not auto-retry on degradation
├── Server-side: Add rate limit specifically for repeated identical queries
│ └── Same user, same query hash within 60s → return cached degradation
│ └── Do not re-process through the full fallback cascade
└── Monitor: Watch retry rate metric for reduction
STEP 3: CUSTOMER COMMUNICATION (5-15 minutes)
├── Post service status on mangaassist.jp/status
├── Tweet from @MangaAssist: "We're aware of temporary AI assistant issues.
│ You can still shop and track orders normally. We'll be fully back soon!"
├── Update in-app banner with current status
└── Notify customer support team with pre-written response templates
STEP 4: EVALUATE MESSAGE EFFECTIVENESS (post-incident)
├── Survey users who received the degradation message
├── Analyze: Did users understand what was happening?
├── Analyze: Did users find the alternative actions helpful?
├── Compare retry rates between v1 and v2 messages
└── A/B test improved message variants for future incidents
Resolution
-
Context-aware degradation messages — Different messages for order queries, recommendations, general questions, and pricing queries. Each acknowledges the user's intent and provides relevant alternatives.
-
Service status integration — The degradation message includes a real-time service status indicator and estimated recovery time (based on historical outage durations).
-
Anti-retry design — Replace "try again" with "I'll notify you when I'm back" (if WebSocket is open) or a specific time ("check back in 5 minutes").
-
Actionable alternatives with links — Every degradation message includes clickable deep links to the specific feature the user was asking about (order tracking, catalog, support).
-
Degradation message A/B testing — Monthly chaos testing in staging where different degradation message variants are tested with internal users. Measure comprehension, satisfaction, and retry behavior.
Prevention
| Prevention Measure | Implementation | Priority |
|---|---|---|
| Context-aware degradation messages | Extract user intent before returning degradation message | P0 |
| Anti-retry language | Remove all "try again" phrasing, add auto-notify | P0 |
| Actionable deep links | Include relevant product/order/support URLs in message | P0 |
| Service status page | Real-time status at mangaassist.jp/status | P1 |
| Degradation message A/B testing | Monthly chaos test with UX evaluation | P1 |
| Client-side retry suppression | WebSocket handler: exponential backoff on degradation | P1 |
| Degradation UX metrics | Track "message comprehension" via post-chat survey | P2 |
| UX review of all system messages | Quarterly UX audit of all bot error/fallback messages | P2 |
Exam Angle
What the AIP-C01 exam tests here: - Understanding of graceful degradation as a user experience concern, not just a technical one - The relationship between degradation messaging and system load (retry storms) - Designing fallback systems that maintain user trust during outages - The importance of actionable alternatives in degradation responses - How poorly designed resilience can amplify the impact of an outage
Key concept: Graceful degradation is not just about keeping the system running — it is about keeping the user informed and productive. A degradation message that causes a retry storm is worse than showing an error page, because it actively makes the outage worse while simultaneously frustrating users.
Key Takeaway
Graceful degradation messages must inform, not mislead — "try again in a moment" causes retry storms, while "here is what you can do instead" preserves user trust and reduces system load.
Cross-Scenario Summary
Common Patterns Across All 5 Scenarios
| Pattern | Scenarios | Lesson |
|---|---|---|
| Cascading failure | 1, 2 | A resilience mechanism that triggers another failure mode (retries causing storms, rate limiting causing retry storms) |
| Detection gap | 1, 3, 4 | Metrics existed but alarms were not configured for the specific failure pattern |
| Investigation misdirection | 4, 5 | Using the wrong tool or metric for investigation delays root cause identification |
| UX as resilience | 2, 5 | User-facing messages are part of the resilience architecture — bad messages amplify impact |
| Content-awareness | 3 | Not all data is equal — pricing data needs different handling than recommendation data |
Resilience Anti-Patterns Identified
Anti-Pattern 1: Exponential Backoff Without Jitter
Scenario: 1
Effect: Thundering herd across fleet
Fix: Decorrelated jitter + fleet retry budget
Anti-Pattern 2: Static Rate Limits Without Auto-Scale
Scenario: 2
Effect: Blocks legitimate organic traffic
Fix: Dynamic limits with anomaly detection
Anti-Pattern 3: Uniform Cache TTL for All Content Types
Scenario: 3
Effect: Stale prices served from cache
Fix: Content-aware TTLs + event-driven invalidation
Anti-Pattern 4: Uniform X-Ray Sampling for All Traffic
Scenario: 4
Effect: Errors invisible in traces
Fix: 100% error sampling rule
Anti-Pattern 5: Engineer-Designed Degradation Messages
Scenario: 5
Effect: Retry storms + user confusion
Fix: UX-designed, context-aware messages with alternatives
Exam-Ready Quick Reference
Topic: Exponential Backoff (Scenario 1)
- AWS SDK retry modes: standard, adaptive, legacy
- Jitter strategies: full, equal, decorrelated
- Circuit breaker: closed → open → half-open → closed
- Fleet-wide retry budget prevents thundering herd
Topic: Rate Limiting (Scenario 2)
- API Gateway: account, stage, method throttling
- Token bucket: rate (steady) + burst (peak) + quota (period)
- Usage plans: per-tier access control
- Dynamic scaling for organic traffic spikes
Topic: Fallback/Degradation (Scenarios 3, 5)
- Fallback cascade: primary → secondary → cached → static → graceful
- Content-aware caching: different TTLs for different data types
- Cache invalidation: event-driven (DynamoDB Streams)
- Graceful degradation: UX design, not just technical implementation
Topic: Observability (Scenario 4)
- X-Ray: distributed tracing, service maps, annotations
- Sampling: reservoir (guaranteed) + fixed rate (probabilistic)
- Error sampling: separate rule, 100% capture
- CloudWatch: metrics, logs, alarms, dashboards, anomaly detection