FM Throughput Optimization — Scenarios and Runbooks
AWS AIP-C01 Task 4.2 — Skill 4.2.3: Optimize FM throughput — operational scenarios and runbooks Context: MangaAssist e-commerce chatbot — Bedrock Claude 3 (Sonnet/Haiku), OpenSearch Serverless, DynamoDB, ECS Fargate, API Gateway WebSocket. 1M messages/day, peak 20K concurrent users. Target: sustain 1,000+ requests/minute to Bedrock.
Skill Mapping
| Certification | Domain | Task | Skill |
|---|---|---|---|
| AWS AIP-C01 | Domain 4 — Operational Efficiency & Optimization | Task 4.2 — Optimize FM performance | Skill 4.2.3 — Optimize FM throughput — operational scenarios and runbooks |
Skill scope: Five production scenarios covering throughput degradation patterns in MangaAssist, each with structured Problem, Detection, Root Cause, Resolution, and Prevention analysis with decision trees.
Scenario 1 — Bedrock Throttling During Manga Sale Event
Problem
MangaAssist is running a flash sale for a popular manga series (One Piece limited edition volumes). Within 10 minutes of the sale starting, CloudWatch alarms fire for ThrottleRate > 5%. The chatbot begins returning delayed responses. Some users see "We're experiencing high demand" fallback messages. Customer complaints spike in the support channel.
Detection
CloudWatch Alarm: MangaAssist/Throughput/ThrottleRate > 0.05 for 3 datapoints within 5 minutes
CloudWatch Alarm: MangaAssist/Throughput/CircuitState = "OPEN"
CloudWatch Alarm: MangaAssist/Throughput/DegradationRate > 0.03
X-Ray: Bedrock invoke_model segments showing 429 ThrottlingException
ECS Logs: "Circuit OPEN for anthropic.claude-3-sonnet: throttle_rate=0.12 >= 0.10"
Decision Tree
flowchart TD
A[ThrottleRate alarm fires] --> B{Check Bedrock<br/>Service Health}
B -->|Service degraded| C[AWS Service Issue<br/>Wait + fallback to Haiku]
B -->|Service healthy| D{Check account<br/>quota usage}
D -->|Near account limit| E{Multiple services<br/>sharing quota?}
D -->|Well below limit| F{Check request<br/>pattern}
E -->|Yes| G[Other services consuming<br/>Bedrock quota — throttle them]
E -->|No| H[Request quota increase<br/>via AWS Support]
F -->|Burst spike| I{Circuit breaker<br/>activating?}
F -->|Sustained high| J[Concurrency limit<br/>too high — reduce]
I -->|Yes| K[Circuit breaker working<br/>Monitor recovery]
I -->|No| L[Adaptive concurrency<br/>not reducing fast enough<br/>Lower backoff_factor]
style A fill:#d13212,color:#fff
style C fill:#f39c12,color:#000
style G fill:#ff9900,color:#000
style H fill:#1a73e8,color:#fff
style K fill:#2ecc71,color:#000
Root Cause
The flash sale drove 3x normal traffic (3,600 requests/minute vs 1,200 normal peak). The adaptive concurrency controller was configured with max_concurrency=50 for Sonnet, but the account's on-demand throughput quota allowed only 40 concurrent invocations across all callers. A separate analytics pipeline was consuming 8 slots for real-time dashboard updates, leaving only 32 for MangaAssist chat.
Resolution
Immediate (0-5 minutes):
1. Pause the analytics pipeline's Bedrock calls (they can use cached data during the sale).
2. Verify circuit breaker is activating — check CircuitState metric.
3. Confirm graceful degradation chain is working: Sonnet throttle -> Haiku fallback -> cached response -> template.
Short-term (5-30 minutes):
1. Reduce max_concurrency for Sonnet from 50 to 35 to stay within available quota.
2. Route simple queries (FAQ, store hours) to Haiku to free Sonnet capacity for complex queries.
3. Enable micro-batching for the most common sale-related queries ("Is One Piece vol 108 in stock?").
Medium-term (post-incident): 1. Request Bedrock provisioned throughput for anticipated sale events. 2. Implement Bedrock quota isolation: separate concurrency pools per workload with hard caps. 3. Pre-warm the recommendation cache before sales events to reduce Bedrock dependency.
Prevention
| Control | Implementation | Prevents |
|---|---|---|
| Pre-sale capacity planning | Request provisioned throughput 48 hours before events | Quota exhaustion |
| Workload isolation | Separate concurrency pools with hard caps per service | Cross-service contention |
| Pre-event cache warming | Generate and cache top-100 FAQ answers before sale | Unnecessary Bedrock calls |
| Load test before events | Run synthetic load at 3x expected peak | Surprise throttling |
| Automatic analytics throttle | EventBridge rule to pause analytics during sale events | Background quota consumption |
Scenario 2 — Batch Inference Queue Growing Unbounded During Sustained Peak
Problem
After a manga publisher releases 200 new titles simultaneously, the catalog enrichment service enqueues 200 batch jobs to generate product descriptions. Normally these process within 30 minutes. But the queue has been growing for 2 hours. The BatchQueueAge metric shows the oldest message is 7,200 seconds old (2 hours). New arrivals are still being added, and the queue depth has reached 450 messages.
Detection
CloudWatch Alarm: MangaAssist/Batch/QueueDepth > 200 for 15 minutes
CloudWatch Alarm: MangaAssist/Batch/QueueAge > 3600 seconds
CloudWatch Metric: Batch/ProcessedPerMinute dropped from 15 to 3
ECS Logs: "Worker 2: circuit open, returning job msg-xxx to queue"
SQS Console: ApproximateNumberOfMessagesVisible = 450, ApproximateAgeOfOldestMessage = 7200
Decision Tree
flowchart TD
A[Batch queue depth alarm] --> B{Queue growing<br/>or stable?}
B -->|Growing| C{Inflow rate ><br/>Processing rate?}
B -->|Stable but large| D[Workers healthy but<br/>processing slowly —<br/>check concurrency limit]
C -->|Yes| E{Workers healthy?}
C -->|No, processing stopped| F{Check worker<br/>ECS tasks}
E -->|Yes, but throttled| G[Batch competing with<br/>real-time for Bedrock capacity]
E -->|Workers crashing| H[Check ECS logs<br/>for OOM / crash loop]
F -->|Tasks running| I[Workers blocked on<br/>concurrency acquire —<br/>circuit breaker open]
F -->|Tasks not running| J[ECS scaling issue —<br/>check desired count]
G --> K[Reduce batch concurrency<br/>OR pause batch during peak]
I --> L[Check real-time throttle<br/>rate — batch circuit inherits<br/>from shared controller]
style A fill:#d13212,color:#fff
style G fill:#ff9900,color:#000
style K fill:#1a73e8,color:#fff
style L fill:#f39c12,color:#000
Root Cause
The batch processor shares the Haiku concurrency pool with real-time order status queries. During the 2-hour peak following the title release, real-time traffic consumed 45 of the 50 Haiku concurrency slots. The batch processor's acquire() calls mostly blocked or were rejected by the circuit breaker, dropping batch throughput from 15 jobs/minute to 3 jobs/minute. Meanwhile, the catalog service continued enqueuing at the same rate.
Resolution
Immediate (0-10 minutes): 1. Verify real-time traffic is being served correctly — batch degradation is acceptable, real-time degradation is not. 2. Check SQS visibility timeout: if timeout < processing time, messages reappear and get double-processed. Extend to 600 seconds if needed. 3. Stop the catalog service from enqueuing new jobs temporarily (the 200 titles can wait).
Short-term (10-60 minutes): 1. Separate the batch concurrency pool from the real-time pool. Batch gets a dedicated 10-slot Haiku pool that does not compete with real-time. 2. Increase batch worker ECS desired count from 3 to 5 to drain the backlog faster once capacity frees up. 3. Verify DLQ is not filling: messages that fail and return to the queue inflate depth artificially.
Medium-term (post-incident): 1. Implement queue inflow rate limiting: cap catalog enrichment enqueue rate at 50 jobs/batch with 5-minute cooldown. 2. Add SQS queue depth metric to the batch producer: stop enqueuing if depth > 300. 3. Pre-schedule large catalog imports for off-peak hours (2 AM - 6 AM JST).
Prevention
| Control | Implementation | Prevents |
|---|---|---|
| Dedicated batch concurrency pool | Separate semaphore and Bedrock quota for batch | Real-time/batch contention |
| Inflow rate limiting | Producer checks queue depth before enqueue | Unbounded queue growth |
| Off-peak scheduling | EventBridge scheduled rule for large imports | Peak-hour batch competition |
| Backlog alarm with auto-scaling | ECS auto-scaling on SQS queue depth | Insufficient batch workers |
| Queue depth circuit breaker | Stop enqueue when depth > threshold | Runaway queue |
Scenario 3 — Concurrency Limiter Set Too Conservatively
Problem
MangaAssist has been running for two weeks after a throttling incident. The operations team manually set max_concurrency=15 for Sonnet and max_concurrency=25 for Haiku as a "safety measure." CloudWatch shows the effective throughput is 400 requests/minute — well below the 1,000 target. The throttle rate is 0.0% (zero throttles). P99 latency is 600ms for Sonnet and 250ms for Haiku — both far below their targets. Bedrock capacity is being wasted.
Detection
CloudWatch: MangaAssist/Throughput/InvocationsPerMinute = 400 (target: 1000+)
CloudWatch: MangaAssist/Throughput/ThrottleRate = 0.000 (suspiciously zero)
CloudWatch: MangaAssist/Throughput/EffectiveConcurrency = 14/15 (93% — at limit)
CloudWatch: MangaAssist/Throughput/P99Latency_Sonnet = 600ms (target: 1200ms)
CloudWatch: MangaAssist/Throughput/QueueDepth increasing during peak hours
User reports: "Chatbot is slow during lunchtime"
Decision Tree
flowchart TD
A[InvocationsPerMinute below target] --> B{ThrottleRate?}
B -->|> 0| C[Actually throttled —<br/>different problem]
B -->|= 0| D{EffectiveConcurrency<br/>vs limit?}
D -->|> 90% of limit| E[At concurrency ceiling<br/>with zero throttles —<br/>limit is too low]
D -->|< 50% of limit| F[Low demand or<br/>requests not reaching Bedrock]
E --> G{P99 latency<br/>vs target?}
G -->|Well below target| H[Significant headroom<br/>Increase max_concurrency]
G -->|Near target| I[At optimal point<br/>Increase cautiously]
G -->|Above target| J[Latency issue, not<br/>concurrency issue]
H --> K{Adaptive controller<br/>enabled?}
K -->|Yes, but capped| L[Raise max_concurrency<br/>Let controller adapt]
K -->|No, static| M[Enable adaptive<br/>concurrency controller]
style A fill:#d13212,color:#fff
style E fill:#ff9900,color:#000
style H fill:#2ecc71,color:#000
style L fill:#1a73e8,color:#fff
Root Cause
After the throttling incident two weeks ago, the on-call engineer manually overrode max_concurrency to low values as a stopgap. The adaptive concurrency controller was still running but its max_concurrency ceiling prevented it from growing beyond 15 (Sonnet) and 25 (Haiku). The controller repeatedly logged "p99=600ms < target=1200ms, growing 15 -> 15" — it wanted to grow but could not exceed the cap.
Meanwhile, the Bedrock account quota had been increased from 40 to 80 concurrent invocations (the quota increase requested after the incident was approved), but nobody updated the MangaAssist concurrency limits to take advantage of the new capacity.
Resolution
Immediate (0-5 minutes): 1. Review the adaptive controller logs to confirm it is hitting the ceiling:
grep "growing .* -> " /var/log/mangaassist/concurrency.log | tail -20
max_concurrency for Sonnet from 15 to 50 and Haiku from 25 to 80.
3. The adaptive controller will automatically ramp up over the next 2-4 minutes.
Short-term (5-30 minutes):
1. Monitor EffectiveConcurrency as it ramps: expect 15 -> 20 -> 25 -> 30 -> ... over successive evaluation windows.
2. Verify ThrottleRate stays at 0% as concurrency grows.
3. Watch InvocationsPerMinute climb toward the 1,000+ target.
Medium-term (post-incident):
1. Remove manual concurrency overrides. The adaptive controller should manage limits autonomously.
2. Add an alarm for "underutilization": EffectiveConcurrency > 90% AND ThrottleRate == 0 AND P99 < 50% of target — this detects overly conservative limits.
3. Document the Bedrock quota in a shared config that all services reference.
Prevention
| Control | Implementation | Prevents |
|---|---|---|
| Underutilization alarm | Alert when concurrency at ceiling with zero throttles | Wasted Bedrock capacity |
| No manual concurrency overrides | Adaptive controller manages limits; manual caps forbidden in runbook | Stale manual settings |
| Quota change propagation | When Bedrock quota changes, update all service configs via AppConfig | Config drift |
| Weekly capacity review | Automated report comparing used vs available concurrency | Unnoticed underutilization |
| Adaptive controller health check | Alert if controller is running but limit unchanged for 24+ hours | Stuck controller |
Scenario 4 — Priority Inversion: Batch Jobs Consuming Real-Time Capacity
Problem
MangaAssist's weekly recommendation pre-computation batch runs every Sunday at 10 AM JST — unfortunately overlapping with peak browsing hours. During this window, real-time chat responses slow from P99=1.2s to P99=4.8s. Users browsing manga on Sunday morning experience degraded chatbot performance. Order-related queries (P0) are still fast, but recommendation and browsing queries (P1, P2) are delayed because the batch recommendation job consumes the capacity they need.
Detection
CloudWatch: MangaAssist/Throughput/P99Latency_Sonnet spiked from 1200ms to 4800ms at 10:00 JST
CloudWatch: MangaAssist/Batch/ActiveConcurrency_Sonnet = 10 (batch pool)
CloudWatch: MangaAssist/RealTime/ActiveConcurrency_Sonnet = 28/30 (near limit)
CloudWatch: MangaAssist/Priority/QueueDepth_P1 = 85, QueueDepth_P2 = 230
X-Ray: Sonnet invocation latency increased 4x during batch window
ECS Logs: "Batch recommendation job started: 50,000 user profiles to process"
Decision Tree
flowchart TD
A[Real-time P99 spike<br/>during batch window] --> B{Batch and real-time<br/>share concurrency pool?}
B -->|Yes — shared| C[Priority inversion:<br/>batch starving real-time]
B -->|No — separate pools| D{Check Bedrock<br/>account-level throttle}
C --> E{Can batch be<br/>paused immediately?}
E -->|Yes| F[Pause batch<br/>Resume off-peak]
E -->|No, deadline| G{Can batch use<br/>different model?}
G -->|Yes| H[Downgrade batch<br/>from Sonnet to Haiku]
G -->|No, needs Sonnet| I[Reduce batch<br/>concurrency to 3<br/>Yield to real-time]
D -->|Throttled| J[Account quota shared<br/>across pools — still<br/>priority inversion at<br/>quota level]
D -->|Not throttled| K[Latency issue in<br/>Bedrock itself — not<br/>a concurrency problem]
style A fill:#d13212,color:#fff
style C fill:#ff9900,color:#000
style F fill:#2ecc71,color:#000
style H fill:#1a73e8,color:#fff
Root Cause
The recommendation batch job was configured to use Sonnet (for quality) with a dedicated batch pool of 10 concurrent slots. However, at the Bedrock account level, both the real-time pool (30 slots) and batch pool (10 slots) drew from the same 50-slot on-demand quota. During peak hours, the total demand (30 real-time + 10 batch = 40 active) approached the account limit, causing Bedrock to slow down all requests. Even though the pools were logically separate in MangaAssist, they competed at the Bedrock service level.
Additionally, the batch job was scheduled at 10 AM JST — one of the highest-traffic hours — because "that's when the data team wanted results."
Resolution
Immediate (0-5 minutes): 1. Pause the batch recommendation job: set batch concurrency to 0 or stop the ECS task. 2. Monitor real-time P99 latency recovering to normal within 1-2 minutes. 3. Notify the data team that the batch will resume during off-peak.
Short-term (same day): 1. Reschedule the recommendation batch to 2 AM - 6 AM JST (lowest traffic window). 2. Downgrade the batch job from Sonnet to Haiku — recommendation pre-computation does not need Sonnet quality since results are later filtered by a ranking algorithm. 3. Add a "batch pause during peak" circuit: if real-time P99 exceeds 2x target, automatically reduce batch concurrency to 0.
Medium-term (post-incident): 1. Use Bedrock provisioned throughput for real-time traffic, isolating it from on-demand batch traffic at the Bedrock level. 2. Implement time-of-day batch scheduling: batch concurrency auto-adjusts based on the real-time traffic profile.
# Time-of-day batch concurrency schedule (JST)
BATCH_CONCURRENCY_SCHEDULE = {
# hour_jst: max_batch_concurrency
0: 20, # Midnight — full batch capacity
1: 20,
2: 20, # Off-peak: batch runs at full speed
3: 20,
4: 20,
5: 20,
6: 15, # Early morning: starting to ramp down
7: 10,
8: 5, # Morning peak approaching
9: 3,
10: 0, # Peak hours: NO batch processing
11: 0,
12: 0, # Lunch peak
13: 0,
14: 3, # Afternoon: minimal batch
15: 3,
16: 5,
17: 5,
18: 5, # Evening: moderate batch
19: 10,
20: 10,
21: 15, # Late evening: batch ramps up
22: 15,
23: 20, # Night: full batch capacity
}
Prevention
| Control | Implementation | Prevents |
|---|---|---|
| Off-peak batch scheduling | EventBridge cron at 2 AM JST + time-of-day concurrency map | Batch/real-time contention |
| Real-time priority gate | Auto-pause batch when real-time P99 > 2x target | Priority inversion |
| Provisioned throughput for real-time | Bedrock provisioned throughput for Sonnet real-time pool | Account-level quota contention |
| Batch model downgrade | Use Haiku for batch where Sonnet quality is unnecessary | Wasted Sonnet capacity |
| Data team SLA alignment | Agree that batch results are delivered by 8 AM, not 10 AM | Schedule conflicts |
Scenario 5 — Dead-Letter Queue Filling with Retryable Errors
Problem
The operations dashboard shows the DLQ depth climbing steadily: 0 -> 12 -> 45 -> 120 messages over 2 hours. The DLQ alarm fired at depth > 0. Upon inspection, all DLQ messages contain ThrottlingException — which should be retryable. The primary queue is also growing because messages that should be retried are instead being discarded to the DLQ.
Detection
CloudWatch Alarm: MangaAssist/DLQ/Depth > 0
CloudWatch: MangaAssist/DLQ/Depth climbing: 0 -> 12 -> 45 -> 120 over 2 hours
SQS Console: DLQ messages all show error_type = "ThrottlingException"
SQS Console: Primary queue ApproximateReceiveCount on DLQ'd messages = 3 (MaxReceiveCount is 3)
CloudWatch: MangaAssist/Batch/ThrottleRate = 0.08 (8% — elevated but not extreme)
Decision Tree
flowchart TD
A[DLQ depth alarm] --> B{Error types in<br/>DLQ messages?}
B -->|All retryable<br/>ThrottlingException| C{Why are retryable<br/>errors in DLQ?}
B -->|Non-retryable<br/>ValidationException| D[Prompt/payload error<br/>Fix and redeploy]
B -->|Mixed| E[Handle each<br/>type separately]
C --> F{Check SQS<br/>MaxReceiveCount}
F -->|MaxReceiveCount = 3| G{Check visibility<br/>timeout vs<br/>backoff duration}
G -->|Timeout < Backoff| H[Root Cause: Message<br/>becomes visible before<br/>backoff completes —<br/>retry count inflated]
G -->|Timeout >= Backoff| I{Check actual<br/>retry logic}
I -->|Retries not implemented| J[Root Cause: No retry<br/>logic — SQS receive<br/>count = retry count]
I -->|Retries implemented| K{Backoff too short?}
K -->|Yes| L[Increase base_delay<br/>and max_delay]
K -->|No| M[Sustained throttle<br/>beyond retry budget]
H --> N[Increase visibility<br/>timeout to cover<br/>max backoff + processing]
style A fill:#d13212,color:#fff
style H fill:#ff9900,color:#000
style J fill:#e74c3c,color:#fff
style N fill:#2ecc71,color:#000
Root Cause
The SQS queue was configured with MaxReceiveCount=3 and VisibilityTimeout=30 seconds. The batch processor implemented exponential backoff: attempt 1 waits 1s, attempt 2 waits 2s, attempt 3 waits 4s. However, the backoff was happening within a single receive (the processor retried 3 times within the 30-second visibility window).
The problem: each SQS ReceiveMessage counts as one receive. If the processor receives the message, retries 3 times with backoff (total ~7s), and then fails, it deletes nothing. The message becomes visible again. The next ReceiveMessage is receive count 2. Three receives = MaxReceiveCount reached = DLQ.
But the actual retry attempts were 3 retries x 3 receives = 9 Bedrock calls, not the 3 retries the operator expected. With an 8% throttle rate and 9 attempts, the probability of all 9 failing is 0.08^9 ≈ 0.0000000001 — essentially zero. The real issue was that the retry counter was tracking SQS receive count, not application-level retries.
Resolution
Immediate (0-15 minutes):
1. Run the DLQ processor to re-enqueue all messages back to the primary queue (all are ThrottlingException = retryable).
2. Verify primary queue is processing normally after re-enqueue.
Short-term (15-60 minutes):
1. Fix the retry logic: track application-level retry count in the message body, not SQS ApproximateReceiveCount.
2. Increase MaxReceiveCount from 3 to 5 as additional safety margin.
3. Increase VisibilityTimeout from 30s to 300s to prevent premature message re-appearance during backoff.
Corrected retry tracking:
import json
import time
def process_message_with_correct_retry(sqs_client, queue_url, dlq_url, message, bedrock_client):
"""
Correct retry logic: track retries in message body, not SQS receive count.
"""
body = json.loads(message["Body"])
# Track retries in message attributes, not SQS receive count
app_retry_count = body.get("_retry_count", 0)
max_app_retries = 5
if app_retry_count >= max_app_retries:
# Genuine exhaustion — send to DLQ
sqs_client.send_message(
QueueUrl=dlq_url,
MessageBody=json.dumps({
"original_request": body,
"error_type": "ThrottlingException",
"retry_count": app_retry_count,
"exhausted_at": time.time(),
}),
)
sqs_client.delete_message(
QueueUrl=queue_url,
ReceiptHandle=message["ReceiptHandle"],
)
return
try:
# Attempt Bedrock invocation (single attempt, no inner retry loop)
response = bedrock_client.invoke_model(
modelId="anthropic.claude-3-haiku-20240307-v1:0",
body=json.dumps(body["payload"]),
)
# Success — delete message
sqs_client.delete_message(
QueueUrl=queue_url,
ReceiptHandle=message["ReceiptHandle"],
)
except bedrock_client.exceptions.ThrottlingException:
# Increment retry count and re-enqueue with delay
body["_retry_count"] = app_retry_count + 1
delay_seconds = min(900, int(2 ** app_retry_count)) # 1, 2, 4, 8, 16... max 900
sqs_client.send_message(
QueueUrl=queue_url,
MessageBody=json.dumps(body),
DelaySeconds=delay_seconds,
)
# Delete the original message (we re-enqueued with updated retry count)
sqs_client.delete_message(
QueueUrl=queue_url,
ReceiptHandle=message["ReceiptHandle"],
)
Medium-term (post-incident):
1. Add a CloudWatch custom metric for ApplicationRetryCount vs SQSReceiveCount discrepancy.
2. Implement the DLQ processor Lambda that auto-classifies and re-enqueues retryable errors.
3. Add a dashboard panel showing DLQ message composition (retryable vs non-retryable).
Prevention
| Control | Implementation | Prevents |
|---|---|---|
| Application-level retry tracking | Track retries in message body, not SQS receive count | Premature DLQ routing |
| Visibility timeout >> max backoff | Set visibility timeout to 5x max backoff duration | Premature message re-appearance |
| DLQ composition monitoring | CloudWatch metric for retryable vs non-retryable DLQ messages | Silent retryable message loss |
| DLQ auto-processor | Lambda processes DLQ every 15 minutes, re-enqueues retryable | Manual DLQ draining |
| SQS configuration review | Quarterly review of MaxReceiveCount, VisibilityTimeout, DelaySeconds | Configuration drift |
Cross-Scenario Summary
| Scenario | Root Cause Pattern | Primary Metric | Key Fix |
|---|---|---|---|
| 1 — Throttling during sale | Shared quota + traffic spike | ThrottleRate > 5% | Pre-event provisioned throughput + quota isolation |
| 2 — Unbounded batch queue | Batch/real-time concurrency contention | QueueDepth growing + QueueAge > 1hr | Dedicated batch concurrency pool |
| 3 — Conservative concurrency | Manual override not reverted | EffectiveConcurrency at ceiling, ThrottleRate = 0% | Enable adaptive controller, remove manual caps |
| 4 — Priority inversion | Batch scheduled during peak, shared Bedrock quota | Real-time P99 spike during batch window | Off-peak scheduling + provisioned throughput for real-time |
| 5 — DLQ filling with retryable errors | SQS receive count confused with app retry count | DLQ depth climbing with retryable errors | Application-level retry tracking |
Common Themes
-
Isolation is the foundation — Every scenario involves some form of resource sharing that causes contention. Separate concurrency pools, provisioned throughput, and time-of-day scheduling are all forms of isolation.
-
Adaptive beats static — Manual concurrency settings become stale within days. Adaptive controllers that respond to real-time signals consistently outperform static configuration.
-
Monitor what you expect to be zero — A DLQ depth of zero is the expected state. A throttle rate of zero during high traffic is suspicious. Alarms on "expected zero" metrics catch problems that "expected non-zero" metrics miss.
-
Backpressure must be intentional — When capacity is constrained, something must give. The choice of what gives (batch pauses, model downgrades, template responses) must be an explicit design decision, not an accident.
-
Retry logic is harder than it looks — The gap between "SQS retries" and "application retries" is a common source of production incidents. Track retries explicitly in message payloads, not implicitly via infrastructure counters.