LOCAL PREVIEW View on GitHub

FM Throughput Optimization — Scenarios and Runbooks

AWS AIP-C01 Task 4.2 — Skill 4.2.3: Optimize FM throughput — operational scenarios and runbooks Context: MangaAssist e-commerce chatbot — Bedrock Claude 3 (Sonnet/Haiku), OpenSearch Serverless, DynamoDB, ECS Fargate, API Gateway WebSocket. 1M messages/day, peak 20K concurrent users. Target: sustain 1,000+ requests/minute to Bedrock.


Skill Mapping

Certification Domain Task Skill
AWS AIP-C01 Domain 4 — Operational Efficiency & Optimization Task 4.2 — Optimize FM performance Skill 4.2.3 — Optimize FM throughput — operational scenarios and runbooks

Skill scope: Five production scenarios covering throughput degradation patterns in MangaAssist, each with structured Problem, Detection, Root Cause, Resolution, and Prevention analysis with decision trees.


Scenario 1 — Bedrock Throttling During Manga Sale Event

Problem

MangaAssist is running a flash sale for a popular manga series (One Piece limited edition volumes). Within 10 minutes of the sale starting, CloudWatch alarms fire for ThrottleRate > 5%. The chatbot begins returning delayed responses. Some users see "We're experiencing high demand" fallback messages. Customer complaints spike in the support channel.

Detection

CloudWatch Alarm: MangaAssist/Throughput/ThrottleRate > 0.05 for 3 datapoints within 5 minutes
CloudWatch Alarm: MangaAssist/Throughput/CircuitState = "OPEN"
CloudWatch Alarm: MangaAssist/Throughput/DegradationRate > 0.03
X-Ray: Bedrock invoke_model segments showing 429 ThrottlingException
ECS Logs: "Circuit OPEN for anthropic.claude-3-sonnet: throttle_rate=0.12 >= 0.10"

Decision Tree

flowchart TD
    A[ThrottleRate alarm fires] --> B{Check Bedrock<br/>Service Health}
    B -->|Service degraded| C[AWS Service Issue<br/>Wait + fallback to Haiku]
    B -->|Service healthy| D{Check account<br/>quota usage}

    D -->|Near account limit| E{Multiple services<br/>sharing quota?}
    D -->|Well below limit| F{Check request<br/>pattern}

    E -->|Yes| G[Other services consuming<br/>Bedrock quota — throttle them]
    E -->|No| H[Request quota increase<br/>via AWS Support]

    F -->|Burst spike| I{Circuit breaker<br/>activating?}
    F -->|Sustained high| J[Concurrency limit<br/>too high — reduce]

    I -->|Yes| K[Circuit breaker working<br/>Monitor recovery]
    I -->|No| L[Adaptive concurrency<br/>not reducing fast enough<br/>Lower backoff_factor]

    style A fill:#d13212,color:#fff
    style C fill:#f39c12,color:#000
    style G fill:#ff9900,color:#000
    style H fill:#1a73e8,color:#fff
    style K fill:#2ecc71,color:#000

Root Cause

The flash sale drove 3x normal traffic (3,600 requests/minute vs 1,200 normal peak). The adaptive concurrency controller was configured with max_concurrency=50 for Sonnet, but the account's on-demand throughput quota allowed only 40 concurrent invocations across all callers. A separate analytics pipeline was consuming 8 slots for real-time dashboard updates, leaving only 32 for MangaAssist chat.

Resolution

Immediate (0-5 minutes): 1. Pause the analytics pipeline's Bedrock calls (they can use cached data during the sale). 2. Verify circuit breaker is activating — check CircuitState metric. 3. Confirm graceful degradation chain is working: Sonnet throttle -> Haiku fallback -> cached response -> template.

Short-term (5-30 minutes): 1. Reduce max_concurrency for Sonnet from 50 to 35 to stay within available quota. 2. Route simple queries (FAQ, store hours) to Haiku to free Sonnet capacity for complex queries. 3. Enable micro-batching for the most common sale-related queries ("Is One Piece vol 108 in stock?").

Medium-term (post-incident): 1. Request Bedrock provisioned throughput for anticipated sale events. 2. Implement Bedrock quota isolation: separate concurrency pools per workload with hard caps. 3. Pre-warm the recommendation cache before sales events to reduce Bedrock dependency.

Prevention

Control Implementation Prevents
Pre-sale capacity planning Request provisioned throughput 48 hours before events Quota exhaustion
Workload isolation Separate concurrency pools with hard caps per service Cross-service contention
Pre-event cache warming Generate and cache top-100 FAQ answers before sale Unnecessary Bedrock calls
Load test before events Run synthetic load at 3x expected peak Surprise throttling
Automatic analytics throttle EventBridge rule to pause analytics during sale events Background quota consumption

Scenario 2 — Batch Inference Queue Growing Unbounded During Sustained Peak

Problem

After a manga publisher releases 200 new titles simultaneously, the catalog enrichment service enqueues 200 batch jobs to generate product descriptions. Normally these process within 30 minutes. But the queue has been growing for 2 hours. The BatchQueueAge metric shows the oldest message is 7,200 seconds old (2 hours). New arrivals are still being added, and the queue depth has reached 450 messages.

Detection

CloudWatch Alarm: MangaAssist/Batch/QueueDepth > 200 for 15 minutes
CloudWatch Alarm: MangaAssist/Batch/QueueAge > 3600 seconds
CloudWatch Metric: Batch/ProcessedPerMinute dropped from 15 to 3
ECS Logs: "Worker 2: circuit open, returning job msg-xxx to queue"
SQS Console: ApproximateNumberOfMessagesVisible = 450, ApproximateAgeOfOldestMessage = 7200

Decision Tree

flowchart TD
    A[Batch queue depth alarm] --> B{Queue growing<br/>or stable?}
    B -->|Growing| C{Inflow rate ><br/>Processing rate?}
    B -->|Stable but large| D[Workers healthy but<br/>processing slowly —<br/>check concurrency limit]

    C -->|Yes| E{Workers healthy?}
    C -->|No, processing stopped| F{Check worker<br/>ECS tasks}

    E -->|Yes, but throttled| G[Batch competing with<br/>real-time for Bedrock capacity]
    E -->|Workers crashing| H[Check ECS logs<br/>for OOM / crash loop]

    F -->|Tasks running| I[Workers blocked on<br/>concurrency acquire —<br/>circuit breaker open]
    F -->|Tasks not running| J[ECS scaling issue —<br/>check desired count]

    G --> K[Reduce batch concurrency<br/>OR pause batch during peak]
    I --> L[Check real-time throttle<br/>rate — batch circuit inherits<br/>from shared controller]

    style A fill:#d13212,color:#fff
    style G fill:#ff9900,color:#000
    style K fill:#1a73e8,color:#fff
    style L fill:#f39c12,color:#000

Root Cause

The batch processor shares the Haiku concurrency pool with real-time order status queries. During the 2-hour peak following the title release, real-time traffic consumed 45 of the 50 Haiku concurrency slots. The batch processor's acquire() calls mostly blocked or were rejected by the circuit breaker, dropping batch throughput from 15 jobs/minute to 3 jobs/minute. Meanwhile, the catalog service continued enqueuing at the same rate.

Resolution

Immediate (0-10 minutes): 1. Verify real-time traffic is being served correctly — batch degradation is acceptable, real-time degradation is not. 2. Check SQS visibility timeout: if timeout < processing time, messages reappear and get double-processed. Extend to 600 seconds if needed. 3. Stop the catalog service from enqueuing new jobs temporarily (the 200 titles can wait).

Short-term (10-60 minutes): 1. Separate the batch concurrency pool from the real-time pool. Batch gets a dedicated 10-slot Haiku pool that does not compete with real-time. 2. Increase batch worker ECS desired count from 3 to 5 to drain the backlog faster once capacity frees up. 3. Verify DLQ is not filling: messages that fail and return to the queue inflate depth artificially.

Medium-term (post-incident): 1. Implement queue inflow rate limiting: cap catalog enrichment enqueue rate at 50 jobs/batch with 5-minute cooldown. 2. Add SQS queue depth metric to the batch producer: stop enqueuing if depth > 300. 3. Pre-schedule large catalog imports for off-peak hours (2 AM - 6 AM JST).

Prevention

Control Implementation Prevents
Dedicated batch concurrency pool Separate semaphore and Bedrock quota for batch Real-time/batch contention
Inflow rate limiting Producer checks queue depth before enqueue Unbounded queue growth
Off-peak scheduling EventBridge scheduled rule for large imports Peak-hour batch competition
Backlog alarm with auto-scaling ECS auto-scaling on SQS queue depth Insufficient batch workers
Queue depth circuit breaker Stop enqueue when depth > threshold Runaway queue

Scenario 3 — Concurrency Limiter Set Too Conservatively

Problem

MangaAssist has been running for two weeks after a throttling incident. The operations team manually set max_concurrency=15 for Sonnet and max_concurrency=25 for Haiku as a "safety measure." CloudWatch shows the effective throughput is 400 requests/minute — well below the 1,000 target. The throttle rate is 0.0% (zero throttles). P99 latency is 600ms for Sonnet and 250ms for Haiku — both far below their targets. Bedrock capacity is being wasted.

Detection

CloudWatch: MangaAssist/Throughput/InvocationsPerMinute = 400 (target: 1000+)
CloudWatch: MangaAssist/Throughput/ThrottleRate = 0.000 (suspiciously zero)
CloudWatch: MangaAssist/Throughput/EffectiveConcurrency = 14/15 (93% — at limit)
CloudWatch: MangaAssist/Throughput/P99Latency_Sonnet = 600ms (target: 1200ms)
CloudWatch: MangaAssist/Throughput/QueueDepth increasing during peak hours
User reports: "Chatbot is slow during lunchtime"

Decision Tree

flowchart TD
    A[InvocationsPerMinute below target] --> B{ThrottleRate?}
    B -->|> 0| C[Actually throttled —<br/>different problem]
    B -->|= 0| D{EffectiveConcurrency<br/>vs limit?}

    D -->|> 90% of limit| E[At concurrency ceiling<br/>with zero throttles —<br/>limit is too low]
    D -->|< 50% of limit| F[Low demand or<br/>requests not reaching Bedrock]

    E --> G{P99 latency<br/>vs target?}
    G -->|Well below target| H[Significant headroom<br/>Increase max_concurrency]
    G -->|Near target| I[At optimal point<br/>Increase cautiously]
    G -->|Above target| J[Latency issue, not<br/>concurrency issue]

    H --> K{Adaptive controller<br/>enabled?}
    K -->|Yes, but capped| L[Raise max_concurrency<br/>Let controller adapt]
    K -->|No, static| M[Enable adaptive<br/>concurrency controller]

    style A fill:#d13212,color:#fff
    style E fill:#ff9900,color:#000
    style H fill:#2ecc71,color:#000
    style L fill:#1a73e8,color:#fff

Root Cause

After the throttling incident two weeks ago, the on-call engineer manually overrode max_concurrency to low values as a stopgap. The adaptive concurrency controller was still running but its max_concurrency ceiling prevented it from growing beyond 15 (Sonnet) and 25 (Haiku). The controller repeatedly logged "p99=600ms < target=1200ms, growing 15 -> 15" — it wanted to grow but could not exceed the cap.

Meanwhile, the Bedrock account quota had been increased from 40 to 80 concurrent invocations (the quota increase requested after the incident was approved), but nobody updated the MangaAssist concurrency limits to take advantage of the new capacity.

Resolution

Immediate (0-5 minutes): 1. Review the adaptive controller logs to confirm it is hitting the ceiling:

grep "growing .* -> " /var/log/mangaassist/concurrency.log | tail -20
2. Increase max_concurrency for Sonnet from 15 to 50 and Haiku from 25 to 80. 3. The adaptive controller will automatically ramp up over the next 2-4 minutes.

Short-term (5-30 minutes): 1. Monitor EffectiveConcurrency as it ramps: expect 15 -> 20 -> 25 -> 30 -> ... over successive evaluation windows. 2. Verify ThrottleRate stays at 0% as concurrency grows. 3. Watch InvocationsPerMinute climb toward the 1,000+ target.

Medium-term (post-incident): 1. Remove manual concurrency overrides. The adaptive controller should manage limits autonomously. 2. Add an alarm for "underutilization": EffectiveConcurrency > 90% AND ThrottleRate == 0 AND P99 < 50% of target — this detects overly conservative limits. 3. Document the Bedrock quota in a shared config that all services reference.

Prevention

Control Implementation Prevents
Underutilization alarm Alert when concurrency at ceiling with zero throttles Wasted Bedrock capacity
No manual concurrency overrides Adaptive controller manages limits; manual caps forbidden in runbook Stale manual settings
Quota change propagation When Bedrock quota changes, update all service configs via AppConfig Config drift
Weekly capacity review Automated report comparing used vs available concurrency Unnoticed underutilization
Adaptive controller health check Alert if controller is running but limit unchanged for 24+ hours Stuck controller

Scenario 4 — Priority Inversion: Batch Jobs Consuming Real-Time Capacity

Problem

MangaAssist's weekly recommendation pre-computation batch runs every Sunday at 10 AM JST — unfortunately overlapping with peak browsing hours. During this window, real-time chat responses slow from P99=1.2s to P99=4.8s. Users browsing manga on Sunday morning experience degraded chatbot performance. Order-related queries (P0) are still fast, but recommendation and browsing queries (P1, P2) are delayed because the batch recommendation job consumes the capacity they need.

Detection

CloudWatch: MangaAssist/Throughput/P99Latency_Sonnet spiked from 1200ms to 4800ms at 10:00 JST
CloudWatch: MangaAssist/Batch/ActiveConcurrency_Sonnet = 10 (batch pool)
CloudWatch: MangaAssist/RealTime/ActiveConcurrency_Sonnet = 28/30 (near limit)
CloudWatch: MangaAssist/Priority/QueueDepth_P1 = 85, QueueDepth_P2 = 230
X-Ray: Sonnet invocation latency increased 4x during batch window
ECS Logs: "Batch recommendation job started: 50,000 user profiles to process"

Decision Tree

flowchart TD
    A[Real-time P99 spike<br/>during batch window] --> B{Batch and real-time<br/>share concurrency pool?}
    B -->|Yes — shared| C[Priority inversion:<br/>batch starving real-time]
    B -->|No — separate pools| D{Check Bedrock<br/>account-level throttle}

    C --> E{Can batch be<br/>paused immediately?}
    E -->|Yes| F[Pause batch<br/>Resume off-peak]
    E -->|No, deadline| G{Can batch use<br/>different model?}

    G -->|Yes| H[Downgrade batch<br/>from Sonnet to Haiku]
    G -->|No, needs Sonnet| I[Reduce batch<br/>concurrency to 3<br/>Yield to real-time]

    D -->|Throttled| J[Account quota shared<br/>across pools — still<br/>priority inversion at<br/>quota level]
    D -->|Not throttled| K[Latency issue in<br/>Bedrock itself — not<br/>a concurrency problem]

    style A fill:#d13212,color:#fff
    style C fill:#ff9900,color:#000
    style F fill:#2ecc71,color:#000
    style H fill:#1a73e8,color:#fff

Root Cause

The recommendation batch job was configured to use Sonnet (for quality) with a dedicated batch pool of 10 concurrent slots. However, at the Bedrock account level, both the real-time pool (30 slots) and batch pool (10 slots) drew from the same 50-slot on-demand quota. During peak hours, the total demand (30 real-time + 10 batch = 40 active) approached the account limit, causing Bedrock to slow down all requests. Even though the pools were logically separate in MangaAssist, they competed at the Bedrock service level.

Additionally, the batch job was scheduled at 10 AM JST — one of the highest-traffic hours — because "that's when the data team wanted results."

Resolution

Immediate (0-5 minutes): 1. Pause the batch recommendation job: set batch concurrency to 0 or stop the ECS task. 2. Monitor real-time P99 latency recovering to normal within 1-2 minutes. 3. Notify the data team that the batch will resume during off-peak.

Short-term (same day): 1. Reschedule the recommendation batch to 2 AM - 6 AM JST (lowest traffic window). 2. Downgrade the batch job from Sonnet to Haiku — recommendation pre-computation does not need Sonnet quality since results are later filtered by a ranking algorithm. 3. Add a "batch pause during peak" circuit: if real-time P99 exceeds 2x target, automatically reduce batch concurrency to 0.

Medium-term (post-incident): 1. Use Bedrock provisioned throughput for real-time traffic, isolating it from on-demand batch traffic at the Bedrock level. 2. Implement time-of-day batch scheduling: batch concurrency auto-adjusts based on the real-time traffic profile.

# Time-of-day batch concurrency schedule (JST)
BATCH_CONCURRENCY_SCHEDULE = {
    # hour_jst: max_batch_concurrency
    0: 20,   # Midnight — full batch capacity
    1: 20,
    2: 20,   # Off-peak: batch runs at full speed
    3: 20,
    4: 20,
    5: 20,
    6: 15,   # Early morning: starting to ramp down
    7: 10,
    8: 5,    # Morning peak approaching
    9: 3,
    10: 0,   # Peak hours: NO batch processing
    11: 0,
    12: 0,   # Lunch peak
    13: 0,
    14: 3,   # Afternoon: minimal batch
    15: 3,
    16: 5,
    17: 5,
    18: 5,   # Evening: moderate batch
    19: 10,
    20: 10,
    21: 15,  # Late evening: batch ramps up
    22: 15,
    23: 20,  # Night: full batch capacity
}

Prevention

Control Implementation Prevents
Off-peak batch scheduling EventBridge cron at 2 AM JST + time-of-day concurrency map Batch/real-time contention
Real-time priority gate Auto-pause batch when real-time P99 > 2x target Priority inversion
Provisioned throughput for real-time Bedrock provisioned throughput for Sonnet real-time pool Account-level quota contention
Batch model downgrade Use Haiku for batch where Sonnet quality is unnecessary Wasted Sonnet capacity
Data team SLA alignment Agree that batch results are delivered by 8 AM, not 10 AM Schedule conflicts

Scenario 5 — Dead-Letter Queue Filling with Retryable Errors

Problem

The operations dashboard shows the DLQ depth climbing steadily: 0 -> 12 -> 45 -> 120 messages over 2 hours. The DLQ alarm fired at depth > 0. Upon inspection, all DLQ messages contain ThrottlingException — which should be retryable. The primary queue is also growing because messages that should be retried are instead being discarded to the DLQ.

Detection

CloudWatch Alarm: MangaAssist/DLQ/Depth > 0
CloudWatch: MangaAssist/DLQ/Depth climbing: 0 -> 12 -> 45 -> 120 over 2 hours
SQS Console: DLQ messages all show error_type = "ThrottlingException"
SQS Console: Primary queue ApproximateReceiveCount on DLQ'd messages = 3 (MaxReceiveCount is 3)
CloudWatch: MangaAssist/Batch/ThrottleRate = 0.08 (8% — elevated but not extreme)

Decision Tree

flowchart TD
    A[DLQ depth alarm] --> B{Error types in<br/>DLQ messages?}
    B -->|All retryable<br/>ThrottlingException| C{Why are retryable<br/>errors in DLQ?}
    B -->|Non-retryable<br/>ValidationException| D[Prompt/payload error<br/>Fix and redeploy]
    B -->|Mixed| E[Handle each<br/>type separately]

    C --> F{Check SQS<br/>MaxReceiveCount}
    F -->|MaxReceiveCount = 3| G{Check visibility<br/>timeout vs<br/>backoff duration}

    G -->|Timeout < Backoff| H[Root Cause: Message<br/>becomes visible before<br/>backoff completes —<br/>retry count inflated]
    G -->|Timeout >= Backoff| I{Check actual<br/>retry logic}

    I -->|Retries not implemented| J[Root Cause: No retry<br/>logic — SQS receive<br/>count = retry count]
    I -->|Retries implemented| K{Backoff too short?}

    K -->|Yes| L[Increase base_delay<br/>and max_delay]
    K -->|No| M[Sustained throttle<br/>beyond retry budget]

    H --> N[Increase visibility<br/>timeout to cover<br/>max backoff + processing]

    style A fill:#d13212,color:#fff
    style H fill:#ff9900,color:#000
    style J fill:#e74c3c,color:#fff
    style N fill:#2ecc71,color:#000

Root Cause

The SQS queue was configured with MaxReceiveCount=3 and VisibilityTimeout=30 seconds. The batch processor implemented exponential backoff: attempt 1 waits 1s, attempt 2 waits 2s, attempt 3 waits 4s. However, the backoff was happening within a single receive (the processor retried 3 times within the 30-second visibility window).

The problem: each SQS ReceiveMessage counts as one receive. If the processor receives the message, retries 3 times with backoff (total ~7s), and then fails, it deletes nothing. The message becomes visible again. The next ReceiveMessage is receive count 2. Three receives = MaxReceiveCount reached = DLQ.

But the actual retry attempts were 3 retries x 3 receives = 9 Bedrock calls, not the 3 retries the operator expected. With an 8% throttle rate and 9 attempts, the probability of all 9 failing is 0.08^9 ≈ 0.0000000001 — essentially zero. The real issue was that the retry counter was tracking SQS receive count, not application-level retries.

Resolution

Immediate (0-15 minutes): 1. Run the DLQ processor to re-enqueue all messages back to the primary queue (all are ThrottlingException = retryable). 2. Verify primary queue is processing normally after re-enqueue.

Short-term (15-60 minutes): 1. Fix the retry logic: track application-level retry count in the message body, not SQS ApproximateReceiveCount. 2. Increase MaxReceiveCount from 3 to 5 as additional safety margin. 3. Increase VisibilityTimeout from 30s to 300s to prevent premature message re-appearance during backoff.

Corrected retry tracking:

import json
import time

def process_message_with_correct_retry(sqs_client, queue_url, dlq_url, message, bedrock_client):
    """
    Correct retry logic: track retries in message body, not SQS receive count.
    """
    body = json.loads(message["Body"])

    # Track retries in message attributes, not SQS receive count
    app_retry_count = body.get("_retry_count", 0)
    max_app_retries = 5

    if app_retry_count >= max_app_retries:
        # Genuine exhaustion — send to DLQ
        sqs_client.send_message(
            QueueUrl=dlq_url,
            MessageBody=json.dumps({
                "original_request": body,
                "error_type": "ThrottlingException",
                "retry_count": app_retry_count,
                "exhausted_at": time.time(),
            }),
        )
        sqs_client.delete_message(
            QueueUrl=queue_url,
            ReceiptHandle=message["ReceiptHandle"],
        )
        return

    try:
        # Attempt Bedrock invocation (single attempt, no inner retry loop)
        response = bedrock_client.invoke_model(
            modelId="anthropic.claude-3-haiku-20240307-v1:0",
            body=json.dumps(body["payload"]),
        )
        # Success — delete message
        sqs_client.delete_message(
            QueueUrl=queue_url,
            ReceiptHandle=message["ReceiptHandle"],
        )

    except bedrock_client.exceptions.ThrottlingException:
        # Increment retry count and re-enqueue with delay
        body["_retry_count"] = app_retry_count + 1
        delay_seconds = min(900, int(2 ** app_retry_count))  # 1, 2, 4, 8, 16... max 900

        sqs_client.send_message(
            QueueUrl=queue_url,
            MessageBody=json.dumps(body),
            DelaySeconds=delay_seconds,
        )
        # Delete the original message (we re-enqueued with updated retry count)
        sqs_client.delete_message(
            QueueUrl=queue_url,
            ReceiptHandle=message["ReceiptHandle"],
        )

Medium-term (post-incident): 1. Add a CloudWatch custom metric for ApplicationRetryCount vs SQSReceiveCount discrepancy. 2. Implement the DLQ processor Lambda that auto-classifies and re-enqueues retryable errors. 3. Add a dashboard panel showing DLQ message composition (retryable vs non-retryable).

Prevention

Control Implementation Prevents
Application-level retry tracking Track retries in message body, not SQS receive count Premature DLQ routing
Visibility timeout >> max backoff Set visibility timeout to 5x max backoff duration Premature message re-appearance
DLQ composition monitoring CloudWatch metric for retryable vs non-retryable DLQ messages Silent retryable message loss
DLQ auto-processor Lambda processes DLQ every 15 minutes, re-enqueues retryable Manual DLQ draining
SQS configuration review Quarterly review of MaxReceiveCount, VisibilityTimeout, DelaySeconds Configuration drift

Cross-Scenario Summary

Scenario Root Cause Pattern Primary Metric Key Fix
1 — Throttling during sale Shared quota + traffic spike ThrottleRate > 5% Pre-event provisioned throughput + quota isolation
2 — Unbounded batch queue Batch/real-time concurrency contention QueueDepth growing + QueueAge > 1hr Dedicated batch concurrency pool
3 — Conservative concurrency Manual override not reverted EffectiveConcurrency at ceiling, ThrottleRate = 0% Enable adaptive controller, remove manual caps
4 — Priority inversion Batch scheduled during peak, shared Bedrock quota Real-time P99 spike during batch window Off-peak scheduling + provisioned throughput for real-time
5 — DLQ filling with retryable errors SQS receive count confused with app retry count DLQ depth climbing with retryable errors Application-level retry tracking

Common Themes

  1. Isolation is the foundation — Every scenario involves some form of resource sharing that causes contention. Separate concurrency pools, provisioned throughput, and time-of-day scheduling are all forms of isolation.

  2. Adaptive beats static — Manual concurrency settings become stale within days. Adaptive controllers that respond to real-time signals consistently outperform static configuration.

  3. Monitor what you expect to be zero — A DLQ depth of zero is the expected state. A throttle rate of zero during high traffic is suspicious. Alarms on "expected zero" metrics catch problems that "expected non-zero" metrics miss.

  4. Backpressure must be intentional — When capacity is constrained, something must give. The choice of what gives (batch pauses, model downgrades, template responses) must be an explicit design decision, not an accident.

  5. Retry logic is harder than it looks — The gap between "SQS retries" and "application retries" is a common source of production incidents. Track retries explicitly in message payloads, not implicitly via infrastructure counters.