Scenarios and Runbooks: Resilient FM Systems

MangaAssist context: JP Manga store chatbot on AWS — Bedrock Claude 3 (Sonnet at $3/$15 per 1M tokens input/output, Haiku at $0.25/$1.25), OpenSearch Serverless (vector store), DynamoDB (sessions/products), ECS Fargate (orchestrator), API Gateway WebSocket, ElastiCache Redis. Target: useful answer in under 3 seconds, 1M messages/day scale.

Skill Mapping

Attribute	Value
Certification	AWS Certified AI Practitioner (AIP-C01)
Domain	2 — Implementation and Integration of Foundation Models
Task	2.4 — Design resilient and scalable FM-based applications
Skill	2.4.3 — Create resilient FM systems to ensure reliable operations
Focus Areas	Exponential backoff storms, rate limiter tuning, stale cache risks, X-Ray sampling gaps, graceful degradation UX

Scenario Format

Each scenario follows a standard structure:

SCENARIO: [Title]
├── SITUATION: What happened and why
├── WHAT WENT WRONG: Root cause analysis
├── DETECTION: How the issue was (or should have been) detected
├── IMPACT: Business and user impact
├── RESPONSE RUNBOOK: Step-by-step mitigation
├── RESOLUTION: How to fix permanently
├── PREVENTION: How to prevent recurrence
├── EXAM ANGLE: What the AIP-C01 exam tests from this scenario
└── KEY TAKEAWAY: One-sentence lesson

Scenario 1: Exponential Backoff Storm from Correlated Retries Across Fleet

Situation

Date: Thursday, 7:22 PM JST (peak manga browsing hour) Trigger: Bedrock experienced a 45-second regional capacity constraint in ap-northeast-1

At 7:22 PM, Bedrock's Claude 3 Sonnet endpoint in ap-northeast-1 began returning ThrottlingException at an elevated rate. The MangaAssist ECS Fargate fleet (running 24 tasks) detected these errors and all tasks simultaneously began their exponential backoff retry loops.

Timeline:
  7:22:00 PM  Bedrock starts throttling (~30% of requests)
  7:22:01 PM  All 24 ECS tasks detect ThrottlingException
  7:22:01 PM  All 24 tasks begin retry attempt 1 (100ms delay)
  7:22:02 PM  All 24 tasks fire retry attempt 1 simultaneously
  7:22:02 PM  Bedrock throttle rate increases to 60% (overloaded by correlated retries)
  7:22:03 PM  All 24 tasks begin retry attempt 2 (200ms delay)
  7:22:04 PM  Retry storm: 24 tasks x 200ms = synchronized wave
  7:22:06 PM  All 24 tasks begin retry attempt 3 (400ms delay)
  7:22:10 PM  Retry attempt 4 (800ms delay) — Bedrock now throttling 90%
  7:22:18 PM  Retry attempt 5 (max 10s delay) — all retries exhausted
  7:22:18 PM  All 24 tasks switch to Haiku fallback simultaneously
  7:22:19 PM  Haiku now overwhelmed by 24 tasks' worth of redirected traffic
  7:22:25 PM  Haiku also starts throttling
  7:22:30 PM  Circuit breakers open across fleet — cache-only mode
  7:23:05 PM  Bedrock recovers, but circuit breakers remain open for 30s
  7:23:35 PM  Circuit breakers transition to HALF_OPEN
  7:23:36 PM  Probe requests succeed — circuit breakers close
  7:23:40 PM  Normal operation resumes

What Went Wrong

The root cause was correlated retries — a thundering herd problem. When all 24 ECS tasks encounter the same error at the same time and use the same backoff algorithm, their retries synchronize:

Without Jitter (the problem):
  Task 1:  |==|    |====|        |========|
  Task 2:  |==|    |====|        |========|
  Task 3:  |==|    |====|        |========|
  ...
  Task 24: |==|    |====|        |========|
           ↑ All fire at same instant ↑

With Decorrelated Jitter (the fix):
  Task 1:  |=|      |=======|         |==========|
  Task 2:  |===|  |====|        |===============|
  Task 3:  |==|       |=====|     |============|
  ...
  Task 24: |====|        |===|           |========|
           ↑ Spread across time window ↑

Contributing factors: 1. All ECS tasks used JitterStrategy.NONE (exponential backoff without jitter) 2. No retry budget was enforced across the fleet 3. The fallback cascade triggered a secondary thundering herd on Haiku 4. Circuit breaker recovery timeout (30s) was longer than the actual Bedrock outage (45s)

Detection

Signal	Source	Alert	Time to Detect
Bedrock ThrottlingException spike	CloudWatch ErrorCount metric	Warning alarm at 5% error rate	2 minutes
Retry count spike	Custom metric `RetryCount`	None configured	Not detected automatically
Correlated retry pattern	X-Ray trace analysis (post-incident)	None	Manual analysis after incident
Circuit breaker open	Custom metric `CircuitBreakerState`	Critical alarm	30 seconds
Haiku fallback spike	Custom metric `TierCount.Haiku`	Warning at >20% degradation	5 minutes

Detection gap: There was no alarm for correlated retry patterns. The retry count metric existed but no alarm was set to detect the synchronized spike pattern.

Impact

Metric	Normal	During Incident	Recovery
User-visible error rate	0.01%	15%	0.5%
p99 latency	2.8s	12.5s (timeout)	3.5s
Sonnet availability	99.5%	10%	85%
Haiku availability	99.5%	40% (secondary storm)	95%
Cache serve rate	4%	65%	15%
Messages affected	0	~2,400 messages	~600
Duration	N/A	78 seconds	35 seconds

Business impact: During the 78-second incident, approximately 2,400 manga customer messages received degraded responses (cached or graceful fallback). 15% of users during the window saw error messages. Customer satisfaction scores for the affected window dropped from 4.⅖ to 2.8/5.

Response Runbook

RUNBOOK: Exponential Backoff Storm Mitigation
SEVERITY: P2 (High — service degradation, not outage)
ON-CALL TEAM: MangaAssist Platform Engineering

STEP 1: CONFIRM THE INCIDENT (0-2 minutes)
  ├── Check CloudWatch dashboard: MangaAssist-FM-Health
  │   └── Look for: ErrorRate spike + TierCount shift away from Sonnet
  ├── Check X-Ray service map for error segments
  │   └── Filter: fault = true AND service("Bedrock Runtime")
  ├── Check Bedrock Service Health Dashboard
  │   └── https://health.aws.amazon.com/health/status
  └── Confirm: Is this a Bedrock regional issue or MangaAssist-specific?

STEP 2: IMMEDIATE MITIGATION (2-5 minutes)
  ├── If Bedrock is the source:
  │   ├── Enable "storm brake" — temporarily disable retries fleet-wide
  │   │   └── Set environment variable: RETRY_MAX_ATTEMPTS=0
  │   │   └── ECS will pick up on next health check (30s)
  │   ├── Force all traffic to Haiku (skip Sonnet entirely)
  │   │   └── Set environment variable: PRIMARY_MODEL=haiku
  │   └── If Haiku also affected: enable cache-only mode
  │       └── Set environment variable: FM_MODE=cache_only
  ├── If MangaAssist-specific (our retry logic):
  │   ├── Scale ECS tasks to 0 temporarily, then back to 12 (half fleet)
  │   │   └── aws ecs update-service --desired-count 0
  │   │   └── Wait 10 seconds
  │   │   └── aws ecs update-service --desired-count 12
  │   └── This breaks the synchronized retry pattern
  └── Monitor: Watch ErrorRate metric for improvement

STEP 3: VERIFY RECOVERY (5-10 minutes)
  ├── Confirm Sonnet invocations returning to normal
  ├── Check circuit breaker states: all should be CLOSED
  ├── Verify cache serve rate dropping back to baseline (~4%)
  ├── Check user-facing error rate back below 0.1%
  └── Confirm p99 latency below 3s SLA

STEP 4: POST-INCIDENT (within 24 hours)
  ├── Collect X-Ray traces from the incident window
  ├── Analyze retry patterns: Were they correlated?
  ├── Document: Timeline, impact, mitigation steps
  └── Schedule: Post-mortem meeting

Resolution

Enable decorrelated jitter — Switch from JitterStrategy.NONE to JitterStrategy.DECORRELATED across all ECS tasks
Add fleet-wide retry budget — Use a shared Redis counter to cap total retries across the fleet at 100/minute
Stagger fallback cascade — When falling back to Haiku, add a random 0-2 second delay to prevent secondary thundering herd
Reduce circuit breaker recovery timeout — From 30s to 15s, since most Bedrock transient issues resolve within 10s

Prevention

Prevention Measure	Implementation	Priority
Decorrelated jitter by default	Update `BackoffConfig` default to `DECORRELATED`	P0
Fleet retry budget (Redis)	Add `RetryBudgetManager` with 100/min cap	P0
Retry correlation alarm	CloudWatch alarm on retry spike variance	P1
Staggered fallback	Random delay before Haiku fallback	P1
Chaos testing	Inject Bedrock throttle in staging weekly	P2
Documentation	Update runbook with correlated retry procedure	P2

Exam Angle

What the AIP-C01 exam tests here: - Understanding of exponential backoff and why jitter is essential in distributed systems - Knowledge of the thundering herd problem and how it manifests in FM API calls - AWS SDK retry modes (standard, adaptive, legacy) and their jitter behavior - Circuit breaker pattern and its role in preventing cascading failures

Key concept: The AWS SDK's adaptive retry mode includes a client-side rate limiter (token bucket) that automatically adjusts based on throttle responses. This is superior to standard exponential backoff because it prevents the client from sending requests it knows will be throttled.

Key Takeaway

Exponential backoff without jitter in a multi-task fleet creates synchronized retry storms that amplify the original overload — always use decorrelated jitter and enforce a fleet-wide retry budget.

Scenario 2: Rate Limiter Blocking Legitimate Traffic During Organic Spike

Situation

Date: Wednesday, 6:00 PM JST Trigger: Popular manga series "Kaiju No. 8" announced surprise new volume release on social media

At 5:58 PM, a viral tweet from the official Kaiju No. 8 account announced an exclusive limited-edition volume available only through MangaAssist. Within 90 seconds, organic traffic spiked from the baseline of ~800 rps to 4,200 rps — exceeding the global rate limit of 2,000 rps (burst 4,000).

Timeline:
  5:58:00 PM  Viral tweet posted (6.2M followers)
  5:58:30 PM  Traffic begins rising: 800 → 1,200 rps
  5:59:00 PM  Traffic: 1,800 rps (within limits)
  5:59:30 PM  Traffic: 3,200 rps (within burst, consuming tokens fast)
  5:59:45 PM  Burst tokens exhausted: 4,000 tokens consumed
  5:59:46 PM  Rate limiter begins returning 429 responses
  5:59:46 PM  Legitimate customers see: "You're sending messages too fast"
  6:00:00 PM  Traffic: 4,200 rps — 2,200 rps being throttled (52%)
  6:00:30 PM  Customer complaints spike on social media
  6:01:00 PM  Operations team alerted via CloudWatch alarm
  6:02:00 PM  Manual rate limit increase: 2,000 → 5,000 rps
  6:02:15 PM  Throttle rate drops to 0%
  6:05:00 PM  Traffic stabilizes at 3,800 rps
  6:15:00 PM  Traffic returns to 1,500 rps (above normal but manageable)
  6:30:00 PM  Normal operations

What Went Wrong

The rate limiter worked exactly as designed — it protected Bedrock from overload. The problem was that the rate limits were too conservative for organic traffic spikes:

Static rate limits — The 2,000 rps limit was based on average load, not peak capacity
No distinction between organic and abusive traffic — All requests treated equally
No auto-scaling of rate limits — Required manual intervention to increase
Per-user limits too aggressive — Returning "too fast" message to first-time visitors who sent just 1 message
No priority queue — Premium customers throttled alongside anonymous browsers

The real problem: The error message "You're sending messages too fast" was misleading. These were first-time visitors asking about the new manga volume — they had sent exactly one message and were told they were too fast. The actual throttle was the global rate limit being hit, but the error message was designed for per-user throttling.

Detection

Signal	Source	Alert	Time to Detect
429 response rate spike	API Gateway metrics	Warning alarm at >1% 429 rate	45 seconds
Global rate limit utilization >95%	Custom RateLimitUtilization metric	Info alarm at 80%	30 seconds
Customer complaints	Social media monitoring	Manual detection	2 minutes
Traffic spike	API Gateway RequestCount	Anomaly detection alarm	1 minute

What worked: The CloudWatch anomaly detection alarm caught the traffic spike quickly. What failed was the automatic response — there was no auto-scaling mechanism for rate limits.

Impact

Metric	Normal	During Incident	After Fix
429 error rate	0.01%	52%	0%
Legitimate users throttled	~0	~45,000 users (2.5 min window)	0
Customer complaints	~2/hr	~340 in 10 minutes	Subsided
Revenue impact	Normal	Est. $12,000 lost sales (limited edition)	Recovered partially
Social media sentiment	Positive	Negative (#MangaAssistDown trending)	Neutral
Bedrock utilization	40%	Capped at rate limit	75%

Business impact: An estimated 45,000 users were blocked from purchasing the limited-edition Kaiju No. 8 volume during the first 2.5 minutes. The negative social media attention (#MangaAssistDown trending briefly) caused reputational damage. Estimated revenue loss: $12,000 in the first 10 minutes.

Response Runbook

RUNBOOK: Rate Limiter Blocking Legitimate Traffic
SEVERITY: P1 (Critical — revenue-impacting, customer-facing)
ON-CALL TEAM: MangaAssist Platform Engineering + Customer Success

STEP 1: CONFIRM ORGANIC SPIKE vs. DDoS (0-2 minutes)
  ├── Check WAF metrics: Is traffic from diverse IPs or concentrated?
  │   └── Diverse IPs = organic | Few IPs = potential DDoS
  ├── Check geographic distribution: Mostly Japan? (Expected for manga)
  ├── Check request patterns: Normal chat queries or repetitive patterns?
  ├── Check social media: Is there a viral event driving traffic?
  └── Decision: If organic → increase limits | If DDoS → keep limits, enable WAF rules

STEP 2: INCREASE RATE LIMITS (2-5 minutes)
  ├── Global rate limit:
  │   └── aws apigateway update-usage-plan --usage-plan-id UP_ID \
  │       --patch-operations op=replace,path=/throttle/rateLimit,value=5000
  ├── Burst capacity:
  │   └── aws apigateway update-usage-plan --usage-plan-id UP_ID \
  │       --patch-operations op=replace,path=/throttle/burstLimit,value=8000
  ├── Scale ECS tasks:
  │   └── aws ecs update-service --desired-count 48 (double fleet)
  └── Verify: Monitor 429 rate dropping to 0%

STEP 3: PROTECT BEDROCK (concurrent with Step 2)
  ├── Check Bedrock service quotas: Are we near account limits?
  │   └── aws service-quotas get-service-quota --service-code bedrock ...
  ├── If near Bedrock limits:
  │   ├── Route 50% of traffic to Haiku (cheaper, faster)
  │   ├── Enable aggressive caching (extend TTLs temporarily)
  │   └── Consider: Bedrock on-demand throughput increase request
  └── Monitor: Bedrock throttle rate should remain below 5%

STEP 4: FIX USER-FACING MESSAGES (5-10 minutes)
  ├── Update throttle message for global rate limit:
  │   └── FROM: "You're sending messages too fast"
  │   └── TO: "We're experiencing high demand — you're in a queue!
  │            Estimated wait: ~30 seconds. While you wait, browse
  │            our Kaiju No. 8 collection directly."
  ├── Enable queue mode: Accept and queue requests rather than reject
  └── Add banner to UI: "High demand for Kaiju No. 8 — please be patient"

STEP 5: POST-SPIKE NORMALIZATION (30-60 minutes)
  ├── Monitor traffic returning to baseline
  ├── Gradually reduce rate limits back to normal
  ├── Scale ECS tasks back to standard count
  ├── Review: Were any customers permanently lost?
  └── Document: What threshold triggered the issue?

Resolution

Implement dynamic rate limiting — Auto-scale rate limits based on traffic pattern analysis (organic vs. abusive)
Add priority queueing — Premium users bypass throttle; registered users get priority over anonymous
Fix error messages — Distinguish between "you personally are too fast" and "system is busy"
Pre-event scaling — Integrate with marketing team's event calendar to pre-scale before known spikes
Request queueing — Instead of rejecting with 429, queue requests and serve them as capacity becomes available

Prevention

Prevention Measure	Implementation	Priority
Dynamic rate limit auto-scaling	Lambda function that adjusts limits based on CloudWatch Anomaly Detection	P0
Priority queue for premium users	Redis-based priority queue in ECS orchestrator	P0
Marketing integration	Webhook from marketing calendar to auto-scale 2 hours before events	P1
Request queueing (SQS)	Queue overflow requests instead of rejecting	P1
Differentiated error messages	Context-aware throttle messages (global vs. per-user)	P1
Capacity planning review	Monthly review of rate limits vs. traffic growth	P2

Exam Angle

What the AIP-C01 exam tests here: - Understanding of API Gateway throttling mechanisms (account, stage, method level) - Usage plans and their role in rate limiting different user tiers - The relationship between rate limiting and user experience - How to design for traffic spikes in FM-based applications - Token bucket algorithm behavior (steady rate vs. burst capacity)

Key concept: API Gateway's token bucket algorithm allows bursts up to the burst limit, but once burst tokens are consumed, traffic is limited to the steady-state rate. For applications with predictable spikes (like content releases), pre-scaling the burst capacity is essential.

Key Takeaway

Rate limiters protect the system but must distinguish between organic traffic spikes and abuse — a rate limiter that blocks legitimate customers during a viral moment causes more damage than the overload it prevents.

Scenario 3: Fallback to Cached Response Serving Stale Manga Prices

Situation

Date: Monday, 10:15 AM JST Trigger: Weekend sale ended at midnight but cache entries with sale prices had 4-hour TTL

MangaAssist ran a weekend flash sale on popular manga series (30% off). The sale ended at midnight Sunday. However, the response cache in Redis still contained responses generated during the sale period, complete with discounted prices. When Bedrock experienced a brief 3-minute throttling event Monday morning, the fallback system served these cached responses to customers.

Timeline:
  Sat 9:00 AM   Sale begins — all prices show 30% discount
  Sat-Sun        Cache populated with sale-price responses (TTL: 4 hours)
  Sun 11:59 PM   Sale ends — DynamoDB prices updated to normal
  Mon 12:00 AM   Cache contains: mix of stale (sale) and fresh (normal) prices
  Mon 3:00 AM    Most sale-price cache entries expired (4hr TTL)
  Mon 3:30 AM    Cache warm-up job runs — repopulates with correct prices
  Mon 10:15 AM   Bedrock throttling event begins (3 minutes)
  Mon 10:15 AM   Fallback cascade: Sonnet fail → Haiku fail → Cache
  Mon 10:15 AM   PROBLEM: Some cache entries were refreshed at 9:00 AM
                  during the sale's "lingering" in one region, containing
                  sale prices that were manually re-added in error
  Mon 10:15 AM   ~180 users receive responses with incorrect (sale) prices
  Mon 10:18 AM   Bedrock recovers — fresh responses resume
  Mon 10:45 AM   Customer support reports: "Users saying prices differ
                  from what chatbot told them"
  Mon 11:00 AM   Investigation confirms stale cache served sale prices
  Mon 11:15 AM   Cache manually flushed for all pricing-related entries

What Went Wrong

Multiple failures compounded:

Cache TTL too long for pricing data — 4 hours is appropriate for recommendations but not for price-sensitive content
No cache invalidation on price change — When the sale ended and DynamoDB prices were updated, the Redis cache was not invalidated
No content-type awareness in cache — All responses cached with the same TTL regardless of whether they contained pricing data
Manual re-entry error — An operations team member accidentally re-triggered the sale in one region, causing fresh cache entries with sale prices
No staleness warning for pricing — When serving cached price data, no disclaimer was shown to users

Cache Entry Problem:

  User asks: "How much is One Piece Volume 105?"

  Fresh response (correct):
    "One Piece Volume 105 is ¥528 (regular price)."

  Stale cached response (incorrect):
    "One Piece Volume 105 is ¥369 (30% off — flash sale!)."
    ← This response was cached during the weekend sale
    ← Served during the Monday Bedrock throttle event

Detection

Signal	Source	Alert	Time to Detect
Fallback to cache tier	Custom TierCount metric	Info alarm at >10% cache serves	1 minute
Stale cache serves	Custom StaleCacheServes metric	Not configured for pricing category	Not detected
Customer complaints	Support ticket system	Manual detection	30 minutes
Price discrepancy	No automated check	None	Manual investigation

Detection gap: There was no automated check comparing cached prices against the source of truth (DynamoDB). The staleness metric existed but was not configured to alarm specifically on pricing-category cache entries.

Impact

Metric	Normal	During Incident
Users served stale prices	0	~180 users
Incorrect price quotes	0	~85 unique products
Customer support tickets	~5/hr	~45 in 2 hours
Potential revenue loss	$0	~$2,300 (if honored)
Trust impact	Baseline	Moderate (price credibility)

Business impact: 180 customers were quoted sale prices that no longer applied. The legal and customer experience teams decided to honor the incorrect prices for affected customers, costing ~$2,300. The incident also raised questions about the chatbot's reliability for transactional queries.

Response Runbook

RUNBOOK: Stale Cache Serving Incorrect Data
SEVERITY: P2 (High — financial impact, customer trust)
ON-CALL TEAM: MangaAssist Platform Engineering + Customer Success

STEP 1: CONFIRM STALE DATA (0-5 minutes)
  ├── Identify affected cache entries:
  │   └── redis-cli KEYS "manga:cache:*" | sample 20 entries
  │   └── Check created_at timestamps — are any from before the sale ended?
  ├── Compare cached prices to DynamoDB source of truth:
  │   └── For each cached product_id, query DynamoDB for current price
  │   └── Flag entries where cached price != DynamoDB price
  ├── Estimate scope: How many cache entries are stale?
  └── Check: Is the issue ongoing or has the Bedrock throttle resolved?

STEP 2: FLUSH AFFECTED CACHE ENTRIES (5-10 minutes)
  ├── Option A: Flush ALL pricing-category cache entries
  │   └── redis-cli EVAL "local keys = redis.call('keys', 'manga:cache:*')
  │       for i, key in ipairs(keys) do
  │         local entry = redis.call('get', key)
  │         if string.find(entry, '\"category\":\"pricing\"') then
  │           redis.call('del', key)
  │         end
  │       end
  │       return 'done'" 0
  ├── Option B: Flush ALL cache entries (nuclear option)
  │   └── redis-cli FLUSHDB
  │   └── WARNING: This increases Bedrock load temporarily
  └── Verify: Spot-check that pricing queries now return fresh Bedrock responses

STEP 3: IDENTIFY AFFECTED CUSTOMERS (10-30 minutes)
  ├── Query CloudWatch Logs for cache-served pricing responses:
  │   └── Filter: tier="cached" AND category="pricing" AND
  │       timestamp BETWEEN "2024-XX-XX 10:15" AND "2024-XX-XX 10:18"
  ├── Extract user_ids from affected traces
  ├── For each user: What price were they quoted? What is the correct price?
  ├── Generate report: user_id, product, quoted_price, actual_price, delta
  └── Send report to Customer Success team for remediation

STEP 4: CUSTOMER REMEDIATION (same day)
  ├── Decision: Honor incorrect prices or issue apology + coupon?
  │   └── If delta < $10 per customer: Honor the quoted price
  │   └── If delta > $10: Contact customer with apology + 15% coupon
  ├── Send proactive email to affected customers:
  │   └── "We noticed a brief pricing display issue on [date]..."
  └── Update FAQ: Add note about the incident for customer support agents

STEP 5: PREVENT RECURRENCE (within 1 week)
  ├── Implement content-aware TTLs (pricing: 15 min max)
  ├── Add event-driven cache invalidation on DynamoDB price updates
  ├── Add staleness warning for cached pricing responses
  └── Add automated price-accuracy check for cached entries

Resolution

Content-aware TTLs — Pricing: 15 minutes, Availability: 5 minutes, Recommendations: 4 hours, FAQ: 7 days
DynamoDB Streams cache invalidation — When prices update in DynamoDB, a Lambda function invalidates all cache entries referencing those product IDs
Staleness disclaimer — When serving cached pricing data, append: "Prices shown may not reflect the latest updates. Please verify at checkout."
Cache validation job — Hourly Lambda that samples 100 cached pricing entries and validates against DynamoDB
Sale end cache flush — Automated cache flush for all pricing entries when a sale period ends

Prevention

Prevention Measure	Implementation	Priority
Content-aware TTLs	Modify cache layer to use `CATEGORY_TTL` map	P0
DynamoDB Streams invalidation	Lambda triggered by DynamoDB Streams on price changes	P0
Staleness disclaimer on prices	Append warning when serving cached pricing data	P0
Sale-end cache flush automation	EventBridge rule triggers cache flush at sale end time	P1
Hourly price accuracy validation	Lambda samples and validates cached pricing entries	P1
Cache entry product_id tracking	Store product_ids in cache entry metadata for targeted invalidation	P2

Exam Angle

What the AIP-C01 exam tests here: - Understanding of caching strategies and their trade-offs in FM applications - The concept of graceful degradation and when cached responses are appropriate vs. dangerous - Event-driven architectures (DynamoDB Streams) for cache invalidation - The importance of content-type awareness in FM response caching

Key concept: Not all FM responses are equal. A cached recommendation from 4 hours ago is still useful, but a cached price from 4 hours ago could be financially harmful. Content-aware caching ensures that time-sensitive data (prices, availability) has much shorter TTLs than stable data (FAQs, general information).

Key Takeaway

Cache-based fallback must be content-aware — a stale manga recommendation is harmless, but a stale price can cost real money and erode customer trust.

Scenario 4: X-Ray Sampling Missing Critical Error Traces

Situation

Date: Tuesday, 2:45 PM JST Trigger: Bedrock introduced a new error response format that the MangaAssist error handler did not recognize

At 2:45 PM, Bedrock began returning a new error format for a specific edge case (malformed multi-modal content in the request). The MangaAssist error handler treated this as a generic ClientError rather than a retryable ValidationException, causing it to fail fast without retry and return a raw error message to users.

The problem: X-Ray was configured with a 5% sampling rate. The error affected only 0.3% of requests (those containing certain Unicode characters in manga title searches). With 5% sampling of 0.3% error traffic, only 0.015% of errors were captured — roughly 1-2 traces per hour, insufficient to identify the pattern.

Timeline:
  2:45 PM    New Bedrock error response begins appearing
  2:45 PM    Error rate increases from 0.1% to 0.4%
  2:45 PM    X-Ray captures 5% of all traces (including errors)
  2:45 PM    Of 0.3% error traffic, only 0.015% captured in X-Ray
  3:00 PM    CloudWatch alarm triggers: ErrorRate > 0.3%
  3:05 PM    On-call engineer checks X-Ray: finds 2 error traces
  3:05 PM    Engineer: "Looks like a rare transient — only 2 traces"
  3:05 PM    Engineer dismisses as transient (wrong conclusion)
  3:30 PM    Error rate persists at 0.4%
  4:00 PM    More customer complaints: "Error when searching manga with kanji"
  4:15 PM    Second investigation: CloudWatch Logs searched directly
  4:15 PM    FOUND: 4,300 error events in 1.5 hours (not 2!)
  4:20 PM    Root cause identified: New Bedrock error format for Unicode
  4:30 PM    Fix deployed: Updated error handler to recognize new format
  4:35 PM    Error rate returns to baseline

What Went Wrong

Uniform sampling rate — 5% sampling applied equally to successful and failed requests
Low-frequency errors undersampled — 0.3% error rate x 5% sampling = almost invisible in X-Ray
No error-specific sampling rule — X-Ray should have captured 100% of errors
Wrong investigation tool — Engineer relied on X-Ray (sampled) instead of CloudWatch Logs (complete)
Misleading X-Ray data — Seeing "only 2 error traces" led to incorrect dismissal

The Sampling Math Problem:

  Total traffic:  11.6 rps (1M messages/day)
  Error rate:     0.3% = 0.035 errors/second = ~125 errors/hour
  X-Ray sampling: 5%
  Captured errors: 0.05 x 125 = ~6 error traces/hour

  But: X-Ray groups samples by trace, not by error.
  The engineer saw only 2 error traces in a 15-minute window.
  Conclusion: "This is rare and transient" (WRONG)
  Reality: 125 errors/hour x 1.5 hours = ~188 errors before detection

  With 100% error sampling:
  The engineer would have seen 31 error traces in the same 15-minute window.
  Conclusion: "This is a consistent pattern affecting Unicode queries" (CORRECT)

Detection

Signal	Source	Alert	Time to Detect
Error rate increase	CloudWatch ErrorCount	Warning at >0.3%	15 minutes
X-Ray error traces	X-Ray Groups (MangaAssist-Errors)	Too few samples to trigger insight	Failed
Customer complaints	Support tickets	Manual	1.5 hours
CloudWatch Logs errors	Log Insights query	Manual investigation	1.5 hours

Detection gap: X-Ray was configured as the primary error investigation tool, but its sampling rate was too low to capture enough error traces for pattern analysis. CloudWatch Logs had complete data but was not the first tool checked.

Impact

Metric	Normal	During Incident
Error rate	0.1%	0.4% (+0.3%)
Affected queries	0	~4,300 over 1.75 hours
Time to detect	N/A	15 minutes (alarm), 1.5 hours (root cause)
Time to resolve	N/A	1 hour 45 minutes total
Users affected	0	~2,800 unique users
Queries with raw error	0	~4,300 (users saw error JSON)

Business impact: 2,800 users who searched for manga using Japanese kanji characters received raw error messages instead of helpful responses. The 1.5-hour delay in root cause identification (due to misleading X-Ray data) extended the impact window unnecessarily.

Response Runbook

RUNBOOK: X-Ray Sampling Missing Critical Errors
SEVERITY: P2 (High — errors reaching users, investigation delayed)
ON-CALL TEAM: MangaAssist Platform Engineering

STEP 1: DO NOT RELY SOLELY ON X-RAY FOR ERROR INVESTIGATION
  ├── ALWAYS cross-reference X-Ray with CloudWatch Logs
  ├── CloudWatch Logs: Complete record (no sampling)
  │   └── Log Insights query:
  │       fields @timestamp, error_code, error_message, user_query
  │       | filter level = "ERROR"
  │       | stats count(*) by error_code
  │       | sort count desc
  ├── X-Ray: Use for distributed trace context AFTER identifying the pattern
  └── Rule: If X-Ray shows few errors but CloudWatch alarm fired,
            trust CloudWatch — X-Ray may be undersampled

STEP 2: IDENTIFY ERROR PATTERN (0-10 minutes)
  ├── Run Log Insights query for error distribution:
  │   └── fields @timestamp, error_code, @message
  │       | filter level = "ERROR"
  │       | stats count(*) as error_count by bin(5m)
  │       | sort @timestamp desc
  ├── Look for: New error codes, changed error formats, specific patterns
  ├── Correlate with user input: Is there a common pattern in failing queries?
  └── Check: Did AWS release any changes to Bedrock API recently?

STEP 3: INCREASE X-RAY ERROR SAMPLING (immediate)
  ├── Create high-priority error sampling rule:
  │   └── aws xray create-sampling-rule --cli-input-json '{
  │         "SamplingRule": {
  │           "RuleName": "MangaAssist-AllErrors",
  │           "Priority": 1,
  │           "FixedRate": 1.0,
  │           "ReservoirSize": 100,
  │           "ServiceName": "MangaAssist",
  │           "ServiceType": "*",
  │           "Host": "*",
  │           "HTTPMethod": "*",
  │           "URLPath": "*",
  │           "ResourceARN": "*",
  │           "Version": 1
  │         }
  │       }'
  └── This captures 100% of traces going forward (useful for ongoing investigation)

STEP 4: FIX THE ERROR (once root cause identified)
  ├── Update error handler to recognize new error format
  ├── Add the new error code to the retryable set (if retryable)
  ├── Deploy fix to ECS tasks (rolling update, no downtime)
  └── Verify: Error rate returns to baseline

STEP 5: REVERT X-RAY SAMPLING (after incident resolved)
  ├── Remove the temporary 100% sampling rule
  ├── Ensure permanent error sampling rule exists:
  │   └── Priority: 50, FixedRate: 1.0, ReservoirSize: 10
  │   └── Filter: Errors only (via annotation or HTTP status)
  └── Verify: Normal sampling rate for success, 100% for errors

Resolution

Separate error sampling rule — Create a permanent X-Ray sampling rule that captures 100% of error traces (Priority 50, FixedRate 1.0)
Error-first investigation protocol — Update runbook: always check CloudWatch Logs first, use X-Ray for trace context second
Error format handler update — Add defensive parsing for new Bedrock error response formats
Anomaly detection on error patterns — CloudWatch anomaly detection on error codes, not just error rate

Prevention

Prevention Measure	Implementation	Priority
100% error sampling rule (permanent)	X-Ray sampling rule: Priority 50, Rate 1.0 for error traces	P0
CloudWatch Logs-first investigation	Update on-call runbook and training	P0
Defensive error parsing	Try/catch with fallback for unknown Bedrock error formats	P1
Error code anomaly detection	CloudWatch anomaly detection per error_code dimension	P1
Weekly Bedrock API change review	Subscribe to AWS service announcements, review weekly	P2
X-Ray sampling adequacy test	Monthly: Inject known errors, verify X-Ray captures them	P2

Exam Angle

What the AIP-C01 exam tests here: - Understanding of X-Ray sampling strategies and their trade-offs - The relationship between X-Ray sampling rate and error visibility - When to use X-Ray vs. CloudWatch Logs for investigation - X-Ray filter expressions and groups for isolating errors - The importance of sampling rules that prioritize errors over successful requests

Key concept: X-Ray sampling rules have priorities. A rule with Priority 1 and FixedRate 1.0 for error paths ensures 100% capture of errors regardless of the default sampling rate. The reservoir ensures at least N traces per second are captured even if the fixed rate would capture fewer.

Key Takeaway

X-Ray sampling at 5% is cost-effective for normal traffic but blind to low-frequency errors — always create a separate high-priority sampling rule that captures 100% of error traces.

Scenario 5: Graceful Degradation Message Confusing Users About Service Status

Situation

Date: Friday, 8:30 PM JST (peak evening traffic) Trigger: Complete Bedrock regional outage in ap-northeast-1 lasting 12 minutes

At 8:30 PM, Bedrock experienced a complete regional outage. All model invocations failed. The MangaAssist fallback cascade executed correctly: Sonnet failed, Haiku failed, cache served where possible, and for cache misses, the graceful degradation message was displayed.

The problem was not technical — the fallback worked. The problem was the user experience of the graceful degradation message.

Graceful Degradation Message (version 1 — problematic):

  "I'm having a bit of trouble right now, but I'm still here!
   Could you try asking again in a moment? In the meantime,
   you can browse our manga catalog directly."

Problems with this message:
  1. "Try again in a moment" — Users spam retry, increasing load
  2. "I'm still here" — Implies the bot is working, just slow
  3. No estimated recovery time — Users don't know if it's 1 min or 1 hour
  4. "Browse our manga catalog directly" — Link not provided
  5. No acknowledgment that there IS a problem
  6. Same message for every query — "Are you ignoring my specific question?"

Timeline:
  8:30:00 PM  Bedrock outage begins
  8:30:05 PM  Circuit breakers open across ECS fleet
  8:30:05 PM  Cache serves 60% of queries (cache hit)
  8:30:05 PM  40% of queries receive graceful degradation message
  8:30:30 PM  Users retry repeatedly (message says "try again in a moment")
  8:31:00 PM  Retry traffic: +40% above normal (users following the message)
  8:31:00 PM  Social media: "MangaAssist is broken but pretending it's fine"
  8:32:00 PM  Support tickets: "The bot keeps saying try again but nothing works"
  8:33:00 PM  Some users interpret "I'm still here" as chatbot being sentient
  8:35:00 PM  Operations team adds maintenance banner to website
  8:37:00 PM  Updated degradation message deployed (version 2)
  8:42:00 PM  Bedrock recovers
  8:42:30 PM  Normal service resumes
  8:45:00 PM  Post-incident: 2,100 support tickets about "confusing bot messages"

What Went Wrong

The graceful degradation message was designed by engineers focused on the technical implementation, not on user experience. Multiple UX failures:

"Try again in a moment" — Encouraged retry storms. Users interpreted "a moment" as "right now" and hammered the retry button. This increased load by 40% during the outage, worsening the situation.
"I'm still here" — Created false expectation. Users thought the bot was working but just slow, so they kept waiting and retrying rather than seeking alternatives.
No context acknowledgment — The message was generic. A user asking "Where is my order?" received the same message as one asking "Recommend a manga." Users felt their specific question was being ignored.
No service status information — Users could not tell if this was a 30-second hiccup or a major outage. Without this information, they could not make informed decisions (wait vs. call support vs. use the website directly).
No action links — "Browse our manga catalog directly" provided no clickable link. Users would have to navigate away from the chat manually.
Uncanny valley — The first-person "I'm still here" language made some users uncomfortable, as it sounded like the AI was claiming sentience or emotional status during what was clearly a system failure.

User Experience Comparison:

  BAD (version 1):
  ┌─────────────────────────────────────────────────┐
  │ User: Where is my order #12345?                 │
  │                                                 │
  │ Bot: I'm having a bit of trouble right now,     │
  │ but I'm still here! Could you try asking again  │
  │ in a moment? In the meantime, you can browse    │
  │ our manga catalog directly.                     │
  │                                                 │
  │ User: I asked about my ORDER, not the catalog!  │
  │                                                 │
  │ Bot: I'm having a bit of trouble right now...   │
  │ [same message repeated]                         │
  └─────────────────────────────────────────────────┘

  GOOD (version 2):
  ┌─────────────────────────────────────────────────┐
  │ User: Where is my order #12345?                 │
  │                                                 │
  │ Bot: I've received your question about order    │
  │ #12345. Our systems are currently experiencing  │
  │ temporary difficulties and I'm unable to look   │
  │ up order details right now.                     │
  │                                                 │
  │ Here's what you can do:                         │
  │ • Track your order: mangaassist.jp/orders/12345 │
  │ • Contact support: support@mangaassist.jp       │
  │ • Call us: 0120-XXX-XXXX (9AM-9PM JST)         │
  │                                                 │
  │ I'll be fully back within a few minutes and     │
  │ will be happy to help then. Our team is         │
  │ actively working on this.                       │
  │                                                 │
  │ [Service Status: Temporary disruption]          │
  │ [Auto-retry in: 2 minutes]                      │
  └─────────────────────────────────────────────────┘

Detection

Signal	Source	Alert	Time to Detect
Graceful degradation tier spike	Custom TierCount.Graceful metric	Warning alarm (>5% graceful serves)	30 seconds
User retry spike	API Gateway RequestCount anomaly	Anomaly detection alarm	2 minutes
Support ticket spike	Zendesk ticket count	Manual detection	5 minutes
Social media complaints	Social monitoring tool	Manual detection	5 minutes
User satisfaction drop	Post-chat survey scores	Delayed (batched daily)	Next day

What was detected quickly: The Bedrock outage and fallback tier shift were detected within 30 seconds. What was not detected: The UX impact of the degradation message. There was no metric for "user confusion" or "degradation message effectiveness."

Impact

Metric	Normal	During Incident
Users receiving degradation message	<0.1%	40% of queries
User retry rate	5%	45% (9x increase)
Support tickets (12 minutes)	~2	2,100
Social media complaints	~1/hr	340 in 30 minutes
User satisfaction score	4.⅖	1.8/5 (for degraded users)
Users who left chat permanently	0.5%	8%
Additional load from retries	Baseline	+40% above normal

Business impact: The degradation message itself caused more damage than the outage. The retry storms increased infrastructure load by 40%, the support ticket flood overwhelmed the team (2,100 tickets in 12 minutes), and 8% of users who received the message never returned to the chatbot. The misleading "try again" language directly caused the retry storm.

Response Runbook

RUNBOOK: Graceful Degradation UX Issues
SEVERITY: P2 (High — customer experience, not technical failure)
ON-CALL TEAM: MangaAssist Platform Engineering + UX Team

STEP 1: DEPLOY IMPROVED DEGRADATION MESSAGE (0-5 minutes)
  ├── Switch to context-aware degradation message:
  │   └── Update ECS environment variable: DEGRADATION_MESSAGE_VERSION=v2
  │   └── V2 messages include:
  │       - Acknowledgment of the user's specific question type
  │       - Explicit service status ("Our AI assistant is temporarily unavailable")
  │       - Actionable alternatives with links
  │       - Estimated recovery time (if known)
  │       - NO "try again" language (prevents retry storms)
  ├── Add service status banner to chat UI:
  │   └── "Service Notice: AI assistant is in limited mode.
  │        You can still browse products and track orders."
  └── Disable auto-retry in client WebSocket (prevents automated retry loops)

STEP 2: REDUCE RETRY STORM (concurrent with Step 1)
  ├── Client-side: Update WebSocket handler to not auto-retry on degradation
  ├── Server-side: Add rate limit specifically for repeated identical queries
  │   └── Same user, same query hash within 60s → return cached degradation
  │   └── Do not re-process through the full fallback cascade
  └── Monitor: Watch retry rate metric for reduction

STEP 3: CUSTOMER COMMUNICATION (5-15 minutes)
  ├── Post service status on mangaassist.jp/status
  ├── Tweet from @MangaAssist: "We're aware of temporary AI assistant issues.
  │    You can still shop and track orders normally. We'll be fully back soon!"
  ├── Update in-app banner with current status
  └── Notify customer support team with pre-written response templates

STEP 4: EVALUATE MESSAGE EFFECTIVENESS (post-incident)
  ├── Survey users who received the degradation message
  ├── Analyze: Did users understand what was happening?
  ├── Analyze: Did users find the alternative actions helpful?
  ├── Compare retry rates between v1 and v2 messages
  └── A/B test improved message variants for future incidents

Resolution

Context-aware degradation messages — Different messages for order queries, recommendations, general questions, and pricing queries. Each acknowledges the user's intent and provides relevant alternatives.
Service status integration — The degradation message includes a real-time service status indicator and estimated recovery time (based on historical outage durations).
Anti-retry design — Replace "try again" with "I'll notify you when I'm back" (if WebSocket is open) or a specific time ("check back in 5 minutes").
Actionable alternatives with links — Every degradation message includes clickable deep links to the specific feature the user was asking about (order tracking, catalog, support).
Degradation message A/B testing — Monthly chaos testing in staging where different degradation message variants are tested with internal users. Measure comprehension, satisfaction, and retry behavior.

Prevention

Prevention Measure	Implementation	Priority
Context-aware degradation messages	Extract user intent before returning degradation message	P0
Anti-retry language	Remove all "try again" phrasing, add auto-notify	P0
Actionable deep links	Include relevant product/order/support URLs in message	P0
Service status page	Real-time status at mangaassist.jp/status	P1
Degradation message A/B testing	Monthly chaos test with UX evaluation	P1
Client-side retry suppression	WebSocket handler: exponential backoff on degradation	P1
Degradation UX metrics	Track "message comprehension" via post-chat survey	P2
UX review of all system messages	Quarterly UX audit of all bot error/fallback messages	P2

Exam Angle

What the AIP-C01 exam tests here: - Understanding of graceful degradation as a user experience concern, not just a technical one - The relationship between degradation messaging and system load (retry storms) - Designing fallback systems that maintain user trust during outages - The importance of actionable alternatives in degradation responses - How poorly designed resilience can amplify the impact of an outage

Key concept: Graceful degradation is not just about keeping the system running — it is about keeping the user informed and productive. A degradation message that causes a retry storm is worse than showing an error page, because it actively makes the outage worse while simultaneously frustrating users.

Key Takeaway

Graceful degradation messages must inform, not mislead — "try again in a moment" causes retry storms, while "here is what you can do instead" preserves user trust and reduces system load.

Cross-Scenario Summary

Common Patterns Across All 5 Scenarios

Pattern	Scenarios	Lesson
Cascading failure	1, 2	A resilience mechanism that triggers another failure mode (retries causing storms, rate limiting causing retry storms)
Detection gap	1, 3, 4	Metrics existed but alarms were not configured for the specific failure pattern
Investigation misdirection	4, 5	Using the wrong tool or metric for investigation delays root cause identification
UX as resilience	2, 5	User-facing messages are part of the resilience architecture — bad messages amplify impact
Content-awareness	3	Not all data is equal — pricing data needs different handling than recommendation data

Resilience Anti-Patterns Identified

Anti-Pattern 1: Exponential Backoff Without Jitter
  Scenario: 1
  Effect: Thundering herd across fleet
  Fix: Decorrelated jitter + fleet retry budget

Anti-Pattern 2: Static Rate Limits Without Auto-Scale
  Scenario: 2
  Effect: Blocks legitimate organic traffic
  Fix: Dynamic limits with anomaly detection

Anti-Pattern 3: Uniform Cache TTL for All Content Types
  Scenario: 3
  Effect: Stale prices served from cache
  Fix: Content-aware TTLs + event-driven invalidation

Anti-Pattern 4: Uniform X-Ray Sampling for All Traffic
  Scenario: 4
  Effect: Errors invisible in traces
  Fix: 100% error sampling rule

Anti-Pattern 5: Engineer-Designed Degradation Messages
  Scenario: 5
  Effect: Retry storms + user confusion
  Fix: UX-designed, context-aware messages with alternatives

Exam-Ready Quick Reference

Topic: Exponential Backoff (Scenario 1)
  - AWS SDK retry modes: standard, adaptive, legacy
  - Jitter strategies: full, equal, decorrelated
  - Circuit breaker: closed → open → half-open → closed
  - Fleet-wide retry budget prevents thundering herd

Topic: Rate Limiting (Scenario 2)
  - API Gateway: account, stage, method throttling
  - Token bucket: rate (steady) + burst (peak) + quota (period)
  - Usage plans: per-tier access control
  - Dynamic scaling for organic traffic spikes

Topic: Fallback/Degradation (Scenarios 3, 5)
  - Fallback cascade: primary → secondary → cached → static → graceful
  - Content-aware caching: different TTLs for different data types
  - Cache invalidation: event-driven (DynamoDB Streams)
  - Graceful degradation: UX design, not just technical implementation

Topic: Observability (Scenario 4)
  - X-Ray: distributed tracing, service maps, annotations
  - Sampling: reservoir (guaranteed) + fixed rate (probabilistic)
  - Error sampling: separate rule, 100% capture
  - CloudWatch: metrics, logs, alarms, dashboards, anomaly detection