Skill 2.3.2 --- Scenarios and Runbooks: Integrated AI Capabilities Troubleshooting
MangaAssist context: JP Manga store chatbot on AWS — Bedrock Claude 3 (Sonnet at $3/$15 per 1M tokens input/output, Haiku at $0.25/$1.25), OpenSearch Serverless (vector store), DynamoDB (sessions/products), ECS Fargate (orchestrator), API Gateway WebSocket, ElastiCache Redis. Target: useful answer in under 3 seconds, 1M messages/day scale.
Skill Mapping
| Field | Value |
|---|---|
| Certification | AWS Certified AI Practitioner (AIP-C01) |
| Domain | 2 --- Development and Implementation of GenAI Applications |
| Task | 2.3 --- Describe methods to integrate foundation models into applications |
| Skill | 2.3.2 --- Develop integrated AI capabilities to enhance existing applications with GenAI functionality (API Gateway for microservice integrations, Lambda for webhook handlers, EventBridge for event-driven integrations) |
Scenario 1: API Gateway 29-Second Timeout on Long FM Responses
Situation
The MangaAssist team deploys a new "deep analysis" feature that lets users request detailed manga series comparisons. The feature uses Claude 3 Sonnet for multi-paragraph analysis with 4096 max_tokens. During load testing, 15% of requests to POST /v1/analyze return HTTP 504 Gateway Timeout errors. CloudWatch logs show the Lambda function is completing successfully at 31-35 seconds, but API Gateway drops the connection at 29 seconds.
Symptoms
CloudWatch API Gateway Logs:
"status": 504
"integrationLatency": 29000
"errorMessage": "Endpoint request timed out"
"requestId": "abc123-..."
"path": "/v1/analyze"
CloudWatch Lambda Logs:
[INFO] Analysis generated successfully in 33,412ms
[INFO] Response size: 12,847 bytes
(No errors in Lambda — it finishes AFTER API Gateway has already returned 504)
CloudWatch Metrics:
API Gateway IntegrationLatency p99: 28,700ms
API Gateway 5XXError count: 847 (in 1 hour)
Lambda Duration p99: 33,200ms
Root Cause Analysis
Root Cause: API Gateway REST API has a HARD 29-second integration timeout.
This is a service limit that CANNOT be increased. The Sonnet model invocation
for complex analysis takes 25-35 seconds, which frequently exceeds this limit.
Why 29 seconds?
- API Gateway REST API: 29s max (hard limit, not adjustable)
- API Gateway HTTP API: 30s max (hard limit, not adjustable)
- Neither can be raised via support ticket
Why does Sonnet take so long?
- 4096 max_tokens with complex reasoning prompt
- Multi-step comparison requires deep inference
- Sonnet at $3/$15/1M tokens is more capable but slower than Haiku
- Cold start adds 1-2s on first invocation per Lambda container
Timeline of a failing request:
t=0.0s Client sends POST /v1/analyze
t=0.2s API Gateway routes to Lambda
t=0.5s Lambda starts, loads session from DynamoDB
t=1.0s Lambda sends InvokeModel to Bedrock (Sonnet)
t=29.0s API Gateway timeout — returns 504 to client
t=29.0s (Lambda is STILL waiting for Bedrock response)
t=33.0s Bedrock returns response to Lambda
t=33.5s Lambda completes — but API Gateway already disconnected
t=33.5s Lambda response is DISCARDED (nobody is listening)
Incorrect Approaches
WRONG: "Increase the API Gateway timeout to 60 seconds"
-> API Gateway REST/HTTP API timeout is a hard limit (29s/30s).
It cannot be increased. This is not a configurable setting.
WRONG: "Switch to Haiku to make it faster"
-> Haiku is faster but less capable for deep analysis tasks.
The analysis quality would degrade significantly.
WRONG: "Increase Lambda memory to speed up Bedrock calls"
-> Lambda memory affects Lambda compute, not Bedrock inference time.
The bottleneck is Bedrock model inference, not Lambda execution.
WRONG: "Add a longer timeout in the CDK Lambda configuration"
-> Lambda timeout can be up to 15 minutes, but API Gateway
will still cut the connection at 29 seconds regardless.
Correct Solution
SOLUTION: Implement asynchronous pattern with polling or WebSocket streaming.
Option A: Async with Polling (simplest)
========================================
1. Client sends POST /v1/analyze
2. Lambda immediately returns 202 Accepted with a job_id
3. Lambda publishes to SQS queue
4. Background Lambda picks up from SQS, invokes Sonnet (no API GW timeout)
5. Background Lambda stores result in DynamoDB
6. Client polls GET /v1/analyze/{job_id} until status = "completed"
Flow:
Client -> API GW -> Lambda (returns 202 + job_id in <1s)
|
+-> SQS -> Background Lambda -> Bedrock Sonnet (33s, no timeout)
|
+-> DynamoDB (store result)
|
Client -> API GW -> Lambda (GET /v1/analyze/{job_id} -> reads DynamoDB)
Option B: WebSocket Streaming (best UX)
=========================================
1. Client opens WebSocket connection via API Gateway WebSocket API
2. Client sends { "action": "analyze", "data": {...} }
3. Lambda routes to ECS Fargate orchestrator
4. Fargate invokes Bedrock with response streaming
5. Tokens stream back to client in real-time via WebSocket
6. No timeout issue — WebSocket connections persist for 2 hours (idle 10min)
Flow:
Client <-> WS API GW <-> Lambda <-> Fargate <-> Bedrock (streaming)
(tokens flow back incrementally, first token in ~2s)
Option C: Chunked Processing (if analysis is decomposable)
==========================================================
1. Split analysis into sub-tasks (each under 15s)
2. Use Step Functions to orchestrate sub-tasks
3. Each step invokes Haiku for a sub-component
4. Final step assembles the full analysis
5. Total time may be longer, but each API call is within timeout
Implementation for MangaAssist:
-> Option B (WebSocket streaming) for the /analyze endpoint
-> Already have WebSocket API Gateway for chat
-> Reuse existing infrastructure, add "analyze" route
Prevention
1. Set Lambda timeout to 25s for API Gateway-backed functions (4s buffer)
2. Add CloudWatch alarm on API Gateway 5XXError > 1% for 5 minutes
3. Implement client-side timeout handling with retry and fallback UI
4. Use Bedrock response streaming for any endpoint that may exceed 10s
5. Monitor Bedrock InvokeModel latency per model — alert if p95 > 20s
Scenario 2: Webhook Handler Lambda Concurrency Exhausted
Situation
MangaAssist runs a "Golden Week" promotion. A manga publisher sends a catalog update webhook with 5,000 new items. Simultaneously, Stripe sends a surge of payment webhooks (300+ per minute from promotional purchases). The webhook handler Lambda functions hit their reserved concurrency limit of 50, causing Stripe to receive 429 errors and begin aggressive retries. Within 10 minutes, the retry amplification creates a thundering herd that exhausts the entire account's unreserved concurrency pool.
Symptoms
CloudWatch Lambda Metrics:
Throttles count: 4,287 (in 15 minutes)
ConcurrentExecutions: 50/50 (at reserved limit)
Errors: 0 (no code errors — pure throttling)
CloudWatch Logs (visible for the few that got through):
[INFO] Processing catalog webhook: batch_size=5000
[INFO] Publishing 10 events to EventBridge (batch 1 of 500)
...
[INFO] Publishing 10 events to EventBridge (batch 47 of 500)
REPORT Duration: 28,400ms Memory: 512MB
Stripe Dashboard:
Webhook delivery failures: 312
Webhook retry attempts: 1,248 (4x amplification)
Status: "Endpoint temporarily disabled by Stripe"
Account-Level Impact:
Other Lambda functions (chat, search) also throttled
Unreserved concurrency pool depleted
MangaAssist chatbot returning 500 errors to all users
Root Cause Analysis
Root Cause: Multiple compounding failures in concurrency management.
Problem 1: Catalog webhook processes 5,000 items synchronously
- Single Lambda invocation runs for 28s processing all items
- Holds a concurrency slot for the entire duration
- Should batch into smaller chunks via SQS
Problem 2: Reserved concurrency too low for burst
- Reserved concurrency = 50 for all webhook handlers
- Normal load: 5-10 concurrent, so 50 seemed generous
- Promotional burst: 300+ webhook/min = 50+ concurrent easily
Problem 3: No separation between webhook sources
- All sources (stripe, catalog, reviews) share the same 50 slots
- Catalog webhook hogs slots with long-running processing
- Stripe webhooks starved of concurrency
Problem 4: Stripe retry amplification
- Stripe retries failed webhooks: 1, 2, 4, 8, 16, 32, 64 minutes
- During burst, retries compound on top of new webhooks
- Creates exponential load increase
Problem 5: No account-level concurrency protection
- Webhook Lambda uses unreserved concurrency after reserved is full
- Consumes concurrency from production chat/search functions
- Blast radius extends to entire MangaAssist service
Concurrency math during the incident:
Catalog webhooks: 10 concurrent * 28s each = 10 slots held
Stripe webhooks: 300/min = 5/sec, each ~2s = 10 concurrent
Stripe retries: accumulating to 15/sec = 30 concurrent
Total demand: ~50 concurrent (hitting reserved limit)
Overflow: spills into unreserved pool, starving other functions
Incorrect Approaches
WRONG: "Just increase reserved concurrency to 500"
-> Addresses symptom not cause. Catalog webhook still processes
5,000 items in a single invocation. During larger promotions,
even 500 would be insufficient. Also increases blast radius.
WRONG: "Remove reserved concurrency limits entirely"
-> Without reserved concurrency, webhook functions can consume
the entire account concurrency pool (default 1,000), leaving
nothing for production chat/search functions.
WRONG: "Add provisioned concurrency for webhook handlers"
-> Provisioned concurrency ensures warm starts but does NOT
increase the concurrency limit. Still capped at reserved count.
Also expensive ($0.000004646/GB-sec * 24/7 for 50 instances).
WRONG: "Process catalog items in parallel using threading in Lambda"
-> Lambda has limited CPU (proportional to memory). Threading
doesn't solve the EventBridge PutEvents bottleneck (10 entries/call).
Still a single long-running invocation holding concurrency.
Correct Solution
SOLUTION: Decouple webhook receipt from processing using SQS buffering.
Architecture Change:
====================
BEFORE (broken):
Stripe/Catalog -> Lambda URL -> [Process + EventBridge] -> Done
(single Lambda does everything, holds concurrency for full duration)
AFTER (fixed):
Stripe/Catalog -> Lambda URL -> [Validate + SQS] -> Return 200 (<1s)
|
v
SQS Queue(s)
|
v
Lambda Consumer(s) -> EventBridge
(separate concurrency pool)
Step-by-step implementation:
1. SEPARATE FUNCTIONS PER SOURCE
- stripe-webhook-receiver (Lambda URL, reserved concurrency = 20)
- catalog-webhook-receiver (Lambda URL, reserved concurrency = 10)
- review-webhook-receiver (Lambda URL, reserved concurrency = 10)
Each does ONLY: validate signature -> SQS enqueue -> return 200
2. SQS QUEUES PER SOURCE
- MangaAssist-Stripe-Webhooks (standard queue)
- MangaAssist-Catalog-Webhooks (standard queue)
- MangaAssist-Review-Webhooks (standard queue)
Each with DLQ for failed processing
3. CONSUMER LAMBDAS (separate from receivers)
- stripe-processor (reserved concurrency = 30)
- catalog-processor (reserved concurrency = 20, batch size = 10)
- review-processor (reserved concurrency = 20)
SQS Lambda event source mapping with maxConcurrency
4. CATALOG WEBHOOK: DECOMPOSE BATCH
- Receiver: accept 5,000-item batch, split into SQS messages of 10 each
- Consumer: process 10 items per invocation (< 5s each)
- 5,000 items = 500 SQS messages, processed in parallel by 20 consumers
5. ACCOUNT-LEVEL PROTECTION
- Set account reserved concurrency for non-critical Lambdas
- Production functions (chat, search) get reserved concurrency = 200 each
- Total reserved: 200 + 200 + 20 + 10 + 10 + 30 + 20 + 20 = 510
- Remaining unreserved: 490 (buffer for other functions)
Result:
- Webhook receipt: <1s (validate + SQS), concurrency = 1 slot briefly
- Processing: async via SQS, controlled concurrency per consumer
- No blast radius: each function has isolated reserved concurrency
- Catalog batch: 500 parallel messages, each processed in 5s
- Stripe retries: unnecessary (receiver always returns 200 quickly)
Prevention
1. Never process large batches synchronously in a webhook receiver
2. Always separate "receipt" (fast, return 200) from "processing" (async)
3. Use per-source Lambda functions with isolated reserved concurrency
4. Set SQS maxConcurrency on Lambda event source mapping
5. Monitor account-level ConcurrentExecutions with alarm at 80% of limit
6. Configure Stripe webhook retry behavior (reduce max retries to 3)
7. Load test promotional scenarios before launch
Scenario 3: EventBridge FM Trigger Creating Infinite Event Loop
Situation
A developer adds a new EventBridge rule to automatically generate AI-powered response suggestions for flagged customer reviews. The rule triggers on ModerationFlagged events and invokes a Lambda that generates a response suggestion using Claude 3 Haiku. However, the Lambda also publishes a SuggestionGenerated event, and a catch-all analytics rule routes all mangaassist.ai events back to the same processing pipeline. Within 3 minutes, the event loop amplifies to 10,000+ events per second, exhausting both the EventBridge PutEvents quota and the Lambda concurrency pool.
Symptoms
CloudWatch EventBridge Metrics:
PutEvents invocations: 847,000 (in 5 minutes)
FailedInvocations: 312,000
ThrottledRules: 45
CloudWatch Lambda Metrics:
Invocations: 523,000 (in 5 minutes)
ConcurrentExecutions: 1,000/1,000 (account limit)
Throttles: 189,000
Duration avg: 2,100ms (Haiku invocations)
CloudWatch Bedrock Metrics:
InvocationCount: 312,000 (in 5 minutes)
ThrottlingExceptions: 89,000
EstimatedCost: $78.00 (5 minutes of Haiku at $0.25/$1.25/1M)
EventBridge DLQ (SQS):
ApproximateNumberOfMessages: 156,000 (and growing)
Alarm:
ALARM: MangaAssist-UnexpectedCost - Cost exceeded $50 threshold
ALARM: MangaAssist-ConcurrencyExhausted - 100% utilization
Root Cause Analysis
Root Cause: Circular event chain between three EventBridge rules.
The event loop path:
1. ReviewSubmitted event arrives (legitimate, from webhook)
-> Rule: MangaAssist-ReviewSubmitted
-> Target: moderation Lambda
-> Lambda publishes: ModerationFlagged (if unsafe)
2. ModerationFlagged event arrives
-> Rule: MangaAssist-ModerationFlagged
-> Target: suggestion Lambda (NEW rule, the bug)
-> Lambda invokes Haiku, generates suggestion
-> Lambda publishes: SuggestionGenerated (source: mangaassist.ai)
3. SuggestionGenerated event arrives
-> Rule: MangaAssist-AllEvents-Analytics (catch-all)
-> Target: analytics Lambda
-> Analytics Lambda publishes: AnalyticsProcessed (source: mangaassist.ai)
4. AnalyticsProcessed event arrives
-> Rule: MangaAssist-AllEvents-Analytics (catch-all matches AGAIN)
-> Back to step 3. INFINITE LOOP.
Actually worse: the analytics Lambda also triggers the suggestion Lambda
because the catch-all rule fans out to multiple targets, one of which
re-processes events and publishes new ones.
Amplification factor per cycle:
1 event -> 2 targets -> 2 events -> 4 targets -> exponential growth
After 10 cycles: 1,024 events from 1 original review
Time per cycle: ~2 seconds (Lambda + Bedrock + EventBridge)
In 3 minutes: ~90 cycles possible
Theoretical events: 2^90 (but throttled long before that)
Incorrect Approaches
WRONG: "Add a duplicate check in each Lambda to prevent re-processing"
-> Each event has a unique event ID, so duplicate detection
doesn't work — every event in the loop is technically unique.
WRONG: "Reduce the Lambda concurrency to slow down the loop"
-> Slows the loop but doesn't stop it. Events queue up in
EventBridge retry/DLQ and resume when concurrency frees up.
Also impacts legitimate event processing.
WRONG: "Delete the catch-all analytics rule"
-> Removes analytics capability entirely. The real fix is to
prevent the loop while keeping analytics.
WRONG: "Add a time-based filter to skip events older than 5 seconds"
-> EventBridge events are delivered quickly (<1s typically).
Time-based filtering wouldn't catch loop events because
each new event has a fresh timestamp.
Correct Solution
SOLUTION: Multi-layer loop prevention with source differentiation and chain depth.
Layer 1: Source Differentiation (prevent cross-contamination)
=============================================================
Rule: MangaAssist-AllEvents-Analytics
BEFORE (broken):
EventPattern:
source: [{ "prefix": "mangaassist." }] # Matches EVERYTHING
AFTER (fixed):
EventPattern:
source: ["mangaassist.chat", "mangaassist.payments",
"mangaassist.reviews", "mangaassist.catalog"]
# EXPLICIT list — does NOT include "mangaassist.ai"
# AI-generated events are excluded from the catch-all
For analytics on AI events, create a SEPARATE non-recursive rule:
Rule: MangaAssist-AI-Analytics-Terminal
EventPattern:
source: ["mangaassist.ai"]
Target: analytics Lambda configured to NOT publish any events
(terminal consumer — writes to Firehose only, no EventBridge)
Layer 2: Processing Stage Guard (belt-and-suspenders)
=====================================================
Every event published by an FM consumer MUST include:
"processing_stage": "processed"
Rules that trigger FM invocation filter:
EventPattern:
detail:
processing_stage: ["raw"] # Only process raw events
Events from webhooks (raw):
{ "processing_stage": "raw", ... }
Events from FM consumers (processed):
{ "processing_stage": "processed", ... }
FM-triggering rules never match "processed" events -> no loop.
Layer 3: Chain Depth Counter (emergency stop)
=============================================
Every event carries a _chain_depth integer:
- Webhook events start at depth 0
- Each FM consumer increments depth by 1
- Any Lambda rejects events with depth >= 3
Implementation in every FM-triggering Lambda:
depth = event["detail"].get("_chain_depth", 0)
if depth >= MAX_CHAIN_DEPTH:
logger.error("Chain depth %d exceeded max %d — dropping event", depth, MAX_CHAIN_DEPTH)
return {"status": "rejected", "reason": "max_chain_depth"}
# Include incremented depth in any outbound events
outbound_detail["_chain_depth"] = depth + 1
Layer 4: Cost Circuit Breaker (last resort)
============================================
CloudWatch alarm on Bedrock InvocationCount:
- Threshold: > 10,000 invocations in 5 minutes
- Action: SNS -> Lambda -> disable EventBridge rules via API
# Emergency rule disabler Lambda
eventbridge.disable_rule(Name="MangaAssist-ModerationFlagged", EventBusName=bus)
eventbridge.disable_rule(Name="MangaAssist-ReviewSubmitted", EventBusName=bus)
Alarm also sends PagerDuty alert for immediate human review.
Immediate Remediation Steps (During Incident)
1. DISABLE the catch-all analytics rule (stops the loop immediately):
aws events disable-rule --name MangaAssist-AllEvents-Analytics \
--event-bus-name MangaAssist-AI-Events
2. DISABLE the new suggestion rule (root cause):
aws events disable-rule --name MangaAssist-ModerationSuggestion \
--event-bus-name MangaAssist-AI-Events
3. PURGE the DLQ to prevent replay of loop events:
aws sqs purge-queue --queue-url https://sqs.ap-northeast-1.amazonaws.com/.../EventBridge-DLQ
4. VERIFY Lambda concurrency has recovered:
aws lambda get-account-settings
# Check UnreservedConcurrentExecutions > 500
5. RE-ENABLE non-recursive rules with fixes applied
6. MONITOR for 30 minutes before declaring incident resolved
Prevention
1. NEVER use prefix-based catch-all rules that match FM consumer output sources
2. Require processing_stage field in all event schemas (enforce via schema registry)
3. Include _chain_depth in all events; reject if depth >= 3
4. Code review checklist: "Does this Lambda publish events that could match existing rules?"
5. Deploy new EventBridge rules to staging first with CloudWatch cost monitoring
6. Set per-source EventBridge PutEvents rate alarms
7. Implement cost circuit breaker that auto-disables rules above threshold
Scenario 4: Microservice Sidecar Adding Unacceptable Latency
Situation
The MangaAssist team implements the Envoy sidecar pattern via AWS App Mesh to add circuit breaking and mTLS to the Fargate orchestrator's Bedrock calls. After deployment, the chat endpoint p50 latency increases from 1.2s to 2.8s, and the p99 latency increases from 2.5s to 6.1s. The 3-second SLA is now missed on 40% of requests. Users report "slow chatbot" and engagement metrics drop.
Symptoms
CloudWatch Metrics (BEFORE sidecar):
/v1/chat p50 latency: 1,200ms
/v1/chat p99 latency: 2,500ms
SLA compliance (< 3s): 97%
CloudWatch Metrics (AFTER sidecar):
/v1/chat p50 latency: 2,800ms (+1,600ms)
/v1/chat p99 latency: 6,100ms (+3,600ms)
SLA compliance (< 3s): 60% (was 97%)
X-Ray Trace Analysis (single request breakdown):
Total: 2,800ms
|-- API Gateway: 50ms
|-- Lambda: 100ms
|-- Fargate app container: 200ms
|-- Envoy sidecar: 1,400ms <-- NEW BOTTLENECK
| |-- TLS handshake: 400ms
| |-- DNS resolution: 300ms
| |-- Connection setup: 200ms
| |-- Proxy overhead: 500ms
|-- Bedrock inference: 1,050ms
Envoy Admin Stats (port 15000):
upstream_cx_total: 12,847
upstream_cx_active: 3
upstream_cx_connect_fail: 0
upstream_cx_pool_overflow: 247
upstream_rq_timeout: 0
upstream_rq_retry: 1,203
cluster.bedrock.outlier_detection.ejections_active: 0
Root Cause Analysis
Root Cause: Three compounding sidecar configuration issues.
Problem 1: TLS Handshake on Every Request (no connection reuse)
- Envoy configured with max_connections: 1 (default for new mesh)
- Each Bedrock request creates a new TLS connection
- TLS 1.3 handshake to Bedrock endpoint: 200-400ms
- Connection closed after each request (no keep-alive)
Problem 2: DNS Resolution on Every Request
- Envoy DNS refresh rate: 5 minutes (default)
- But connection pool drops connections, forcing new DNS lookup
- DNS resolution to Bedrock endpoint: 100-300ms
- ap-northeast-1 to Bedrock endpoint varies by AZ
Problem 3: Envoy Proxy Processing Overhead
- Access logging to stdout: synchronous, blocking
- Header manipulation rules: 12 rules applied per request
- mTLS certificate validation: per-request check
- Trace context injection: X-Ray segment creation
Problem 4: Connection Pool Overflow
- max_pending_requests: 1 (default)
- During burst, requests queue behind the single connection
- 247 pool overflows = 247 failed connection reuse attempts
- Each overflow forces a new connection (back to Problem 1)
Incorrect Approaches
WRONG: "Remove the sidecar entirely"
-> Loses circuit breaking, mTLS, and observability.
The sidecar provides real operational value; it just
needs proper configuration.
WRONG: "Increase Fargate task CPU/memory"
-> Envoy sidecar overhead is I/O bound (TLS, DNS, connections),
not CPU bound. More CPU won't help connection setup latency.
WRONG: "Switch from App Mesh to direct Bedrock SDK calls"
-> Direct SDK calls don't provide circuit breaking or mTLS.
The team chose the sidecar pattern for good reasons.
Fix the configuration, not the architecture.
WRONG: "Add Envoy caching for Bedrock responses"
-> Bedrock responses are non-cacheable by Envoy (POST requests,
non-deterministic output). Application-level caching via
Redis is already implemented in MicroserviceFMProxy.
Correct Solution
SOLUTION: Optimize Envoy sidecar connection pooling, DNS, and TLS configuration.
Fix 1: Connection Pool Tuning
==============================
BEFORE:
max_connections: 1
max_pending_requests: 1
max_requests: 1
AFTER:
max_connections: 100 # Reuse connections to Bedrock
max_pending_requests: 50 # Queue requests during burst
max_requests: 200 # Allow concurrent requests
max_requests_per_connection: 1000 # Keep-alive for 1000 requests
Impact: Eliminates per-request TLS handshake. Connections reused.
Savings: -400ms per request (TLS handshake eliminated for cached connections)
Fix 2: DNS Configuration
==========================
BEFORE:
dns_refresh_rate: 300s (5 minutes)
dns_lookup_family: V4_ONLY
(using default DNS resolver)
AFTER:
dns_refresh_rate: 30s
dns_lookup_family: V4_PREFERRED
dns_cache_config:
max_hosts: 100
dns_ttl: 60s
use_tcp_for_dns_lookups: false
Additionally: Configure Envoy to use the VPC DNS resolver
directly (169.254.169.253) for faster resolution.
Impact: DNS results cached, eliminating per-request lookups.
Savings: -300ms per request (DNS lookup eliminated)
Fix 3: Access Logging Optimization
====================================
BEFORE:
access_log:
- typed_config:
"@type": envoy.file_access_log
path: "/dev/stdout" # Synchronous stdout write
AFTER:
access_log:
- typed_config:
"@type": envoy.file_access_log
path: "/dev/stdout"
log_format:
json_format:
# Minimal fields only
duration: "%DURATION%"
status: "%RESPONSE_CODE%"
bytes: "%BYTES_RECEIVED%"
# Also: configure async buffered logging in Envoy bootstrap
Impact: Reduced per-request logging overhead.
Savings: -100ms per request
Fix 4: Reduce Header Manipulation
===================================
BEFORE: 12 header rules applied per request
AFTER: 3 essential header rules only
- X-Amzn-Trace-Id propagation
- Authorization header passthrough
- Content-Type enforcement
Remove: decorative headers, debug headers, redundant CORS headers
(CORS is handled at API Gateway level, not sidecar)
Impact: Less CPU work per request.
Savings: -50ms per request
Total Improvement:
BEFORE: 1,400ms sidecar overhead
AFTER: ~150ms sidecar overhead (connection pool warm)
SAVINGS: ~1,250ms per request
New latency profile:
/v1/chat p50: ~1,350ms (within SLA)
/v1/chat p99: ~2,800ms (within SLA)
SLA compliance: ~96%
Prevention
1. Load test sidecar configuration BEFORE production deployment
2. Always configure connection pooling for high-throughput upstreams
3. Monitor Envoy admin stats (/stats endpoint) for pool overflow and connection metrics
4. Use X-Ray traces to identify per-component latency breakdown
5. Set CloudWatch alarm on p99 latency regression > 20% after any deployment
6. Document baseline latency for each endpoint; compare after infrastructure changes
Scenario 5: API Gateway Payload Size Limit Blocking Large FM Responses
Situation
MangaAssist introduces a "full catalog search" endpoint that returns AI-generated summaries for multiple manga series matching a search query. The endpoint calls Bedrock Sonnet with RAG context from OpenSearch (5 documents, each 2000 tokens) and requests a comprehensive summary. Some responses exceed 10 MB when combined with metadata, causing API Gateway to return HTTP 413 Payload Too Large errors. The Lambda function completes successfully, but API Gateway rejects the response before delivering it to the client.
Symptoms
CloudWatch API Gateway Logs:
"status": 413
"errorMessage": "Response payload size exceeded maximum allowed payload size"
"requestId": "xyz789-..."
"path": "/v1/catalog-search"
CloudWatch Lambda Logs:
[INFO] Search completed: 12 results, response_size=11,247,891 bytes
[INFO] Response generated successfully in 8,234ms
(Lambda succeeds but API Gateway rejects the response)
API Gateway Metrics:
4XXError: 234 (in 1 hour, all 413s)
IntegrationLatency p50: 8,100ms
Client Error:
HTTP 413: {"message": "Request Too Long"}
Root Cause Analysis
Root Cause: API Gateway payload size limits exceeded by FM response + metadata.
API Gateway Payload Limits:
REST API: 10 MB max response payload
HTTP API: 10 MB max response payload
WebSocket: 128 KB per frame, 32 KB for initial response
The catalog search response structure:
{
"query": "...", ~100 bytes
"results": [ 12 results
{
"product_id": "...", ~50 bytes
"title": "...", ~200 bytes
"ai_summary": "...", ~2,000 bytes (per result)
"rag_context": "...", ~8,000 bytes (per result)
"embedding_vector": [0.1, 0.2, ...], ~6,000 bytes (768-dim float32)
"metadata": { ... } ~500 bytes
}
],
"ai_analysis": "...", ~5,000 bytes (overall summary)
"debug_info": { ~10 MB (!!)
"opensearch_raw_results": [ ... ], Full OpenSearch response
"bedrock_raw_response": "...", Full Bedrock response
"prompt_template": "...", Full prompt with all context
}
}
The problem: debug_info contains raw responses from OpenSearch and Bedrock.
In development, this was useful for debugging. In production with large
result sets, it pushes the payload beyond the 10 MB limit.
Secondary issue: embedding_vector is included in the response.
Clients don't need the raw embedding vectors (768 float32 values per result).
This adds ~72 KB for 12 results unnecessarily.
Incorrect Approaches
WRONG: "Increase the API Gateway payload limit"
-> The 10 MB limit is a hard service limit. It cannot be increased
via configuration or support ticket.
WRONG: "Compress the response with gzip"
-> API Gateway REST API does not support response compression
natively for Lambda proxy integration responses. The Lambda
would need to return a base64-encoded gzip body, and the client
would need to handle decompression. Even compressed, debug_info
could still exceed 10 MB before compression.
WRONG: "Switch to HTTP API (it has higher limits)"
-> HTTP API has the same 10 MB payload limit as REST API.
No advantage for this specific problem.
WRONG: "Use Lambda response streaming"
-> Lambda response streaming (via Function URL) can stream up to
20 MB. But it's not compatible with API Gateway. Would require
architectural change to use Lambda Function URL directly.
Correct Solution
SOLUTION: Response filtering + pagination + pre-signed S3 for large payloads.
Fix 1: Remove Debug Info from Production Responses
====================================================
The debug_info field accounts for ~10 MB of the response. It should
NEVER be included in production API responses.
Implementation:
# In Lambda handler
include_debug = event.get("queryStringParameters", {}).get("debug") == "true"
stage = event.get("requestContext", {}).get("stage", "prod")
response_body = {
"query": query,
"results": formatted_results,
"ai_analysis": analysis_text,
}
# Only include debug info in dev/staging AND when explicitly requested
if include_debug and stage != "prod":
response_body["debug_info"] = debug_data
Impact: Reduces typical response from ~11 MB to ~200 KB.
Fix 2: Strip Unnecessary Fields from Results
==============================================
Remove embedding vectors and raw RAG context from client responses.
Clients need summaries, not internal ML artifacts.
BEFORE (per result):
{
"product_id": "...",
"title": "...",
"ai_summary": "...",
"rag_context": "...", # 8 KB - remove
"embedding_vector": [...], # 6 KB - remove
"metadata": { ... }
}
AFTER (per result):
{
"product_id": "...",
"title": "...",
"ai_summary": "...",
"relevance_score": 0.92, # Computed from embedding, not raw vector
"metadata": { "author": "...", "genre": "..." } # Curated subset
}
Impact: Per-result size drops from ~17 KB to ~3 KB.
Fix 3: Implement Pagination
=============================
Limit results per page and provide cursor-based pagination.
GET /v1/catalog-search?q=shonen&limit=5&cursor=eyJ...
Response:
{
"query": "shonen",
"results": [ ... 5 results ... ],
"ai_analysis": "Based on your search for shonen manga...",
"pagination": {
"total_results": 47,
"returned": 5,
"next_cursor": "eyJwYWdlIjoyLCJvZmZzZXQiOjV9",
"has_more": true
}
}
Impact: Guaranteed response size under 50 KB per page.
Fix 4: Pre-Signed S3 URL for Truly Large Responses
====================================================
For export/download use cases where the client needs ALL results:
1. Lambda generates full response and writes to S3
2. Lambda generates a pre-signed URL (valid for 15 minutes)
3. API Gateway returns a small response with the download URL
POST /v1/catalog-search/export
Response (< 1 KB):
{
"export_id": "exp-12345",
"status": "ready",
"download_url": "https://s3.ap-northeast-1.amazonaws.com/...",
"expires_in_seconds": 900,
"size_bytes": 11247891,
"format": "application/json"
}
Impact: API Gateway only returns a small JSON pointer.
Client downloads large payload directly from S3 (no size limit).
Combined Impact:
Normal search: ~30 KB response (5 results, no debug, no vectors)
Full export: small pointer + S3 pre-signed URL
API Gateway 413 errors: 0
Prevention
1. NEVER include debug/raw data in production API responses
2. Use environment-based response filtering (stage != "prod")
3. Set maximum results per page (default 10, max 50)
4. Strip internal fields (embeddings, raw context) before response serialization
5. Add response size monitoring: CloudWatch alarm if avg response > 1 MB
6. For any endpoint that could return > 5 MB, use S3 pre-signed URL pattern
7. Load test with realistic data volumes (not just toy examples)
8. Add Lambda middleware that measures and logs response size before return
Cross-Scenario Summary: Integration Pattern Failure Modes
| Scenario | Service | Hard Limit | Root Pattern | Fix Category |
|---|---|---|---|---|
| 1. API GW 29s timeout | API Gateway REST | 29 seconds | Synchronous FM call exceeds timeout | Async pattern (SQS polling or WebSocket streaming) |
| 2. Lambda concurrency exhaustion | Lambda | Account concurrency (1,000 default) | Long-running webhook + retry amplification | SQS buffering + per-source isolation |
| 3. EventBridge infinite loop | EventBridge + Lambda | PutEvents quota + concurrency | Catch-all rule matches FM consumer output | Source differentiation + chain depth counter |
| 4. Sidecar latency overhead | App Mesh / Envoy | N/A (configuration issue) | Default connection pool + per-request TLS | Connection pool tuning + DNS caching |
| 5. API GW payload size | API Gateway | 10 MB response | Debug data + embedding vectors in response | Response filtering + pagination + S3 pre-signed URLs |
Key Takeaways for the AIP-C01 Exam
| Concept | What to Remember |
|---|---|
| API GW REST timeout | 29 seconds --- hard limit, not adjustable. Use async or WebSocket for long FM calls. |
| API GW payload limit | 10 MB response --- hard limit. Use pagination or S3 pre-signed URLs for large FM outputs. |
| Lambda concurrency | Default 1,000 per account. Reserved concurrency isolates functions. Use SQS to buffer bursts. |
| Webhook best practice | Validate + enqueue (fast return 200) then process async. Never do heavy work in the receiver. |
| EventBridge loop prevention | Use explicit source lists (not prefix catch-all), processing_stage guards, and chain depth counters. |
| Sidecar connection pooling | Default Envoy settings create new TLS connections per request. Always configure max_connections and keep-alive. |
| Cost circuit breaker | Set CloudWatch alarms on Bedrock invocation count. Auto-disable EventBridge rules if threshold exceeded. |
| Response size management | Strip debug info, embedding vectors, and raw context from production responses. Paginate results. |
| Retry amplification | External webhook retries compound during outages. Fast-acknowledge (SQS buffer) prevents amplification. |
| Async FM pattern | POST returns 202 + job_id. Background worker invokes FM (no timeout). Client polls for result. |