Skill 2.3.2 --- Scenarios and Runbooks: Integrated AI Capabilities Troubleshooting

MangaAssist context: JP Manga store chatbot on AWS — Bedrock Claude 3 (Sonnet at $3/$15 per 1M tokens input/output, Haiku at $0.25/$1.25), OpenSearch Serverless (vector store), DynamoDB (sessions/products), ECS Fargate (orchestrator), API Gateway WebSocket, ElastiCache Redis. Target: useful answer in under 3 seconds, 1M messages/day scale.

Skill Mapping

Field	Value
Certification	AWS Certified AI Practitioner (AIP-C01)
Domain	2 --- Development and Implementation of GenAI Applications
Task	2.3 --- Describe methods to integrate foundation models into applications
Skill	2.3.2 --- Develop integrated AI capabilities to enhance existing applications with GenAI functionality (API Gateway for microservice integrations, Lambda for webhook handlers, EventBridge for event-driven integrations)

Scenario 1: API Gateway 29-Second Timeout on Long FM Responses

Situation

The MangaAssist team deploys a new "deep analysis" feature that lets users request detailed manga series comparisons. The feature uses Claude 3 Sonnet for multi-paragraph analysis with 4096 max_tokens. During load testing, 15% of requests to POST /v1/analyze return HTTP 504 Gateway Timeout errors. CloudWatch logs show the Lambda function is completing successfully at 31-35 seconds, but API Gateway drops the connection at 29 seconds.

Symptoms

CloudWatch API Gateway Logs:
  "status": 504
  "integrationLatency": 29000
  "errorMessage": "Endpoint request timed out"
  "requestId": "abc123-..."
  "path": "/v1/analyze"

CloudWatch Lambda Logs:
  [INFO] Analysis generated successfully in 33,412ms
  [INFO] Response size: 12,847 bytes
  (No errors in Lambda — it finishes AFTER API Gateway has already returned 504)

CloudWatch Metrics:
  API Gateway IntegrationLatency p99: 28,700ms
  API Gateway 5XXError count: 847 (in 1 hour)
  Lambda Duration p99: 33,200ms

Root Cause Analysis

Root Cause: API Gateway REST API has a HARD 29-second integration timeout.
This is a service limit that CANNOT be increased. The Sonnet model invocation
for complex analysis takes 25-35 seconds, which frequently exceeds this limit.

Why 29 seconds?
- API Gateway REST API: 29s max (hard limit, not adjustable)
- API Gateway HTTP API: 30s max (hard limit, not adjustable)
- Neither can be raised via support ticket

Why does Sonnet take so long?
- 4096 max_tokens with complex reasoning prompt
- Multi-step comparison requires deep inference
- Sonnet at $3/$15/1M tokens is more capable but slower than Haiku
- Cold start adds 1-2s on first invocation per Lambda container

Timeline of a failing request:
  t=0.0s   Client sends POST /v1/analyze
  t=0.2s   API Gateway routes to Lambda
  t=0.5s   Lambda starts, loads session from DynamoDB
  t=1.0s   Lambda sends InvokeModel to Bedrock (Sonnet)
  t=29.0s  API Gateway timeout — returns 504 to client
  t=29.0s  (Lambda is STILL waiting for Bedrock response)
  t=33.0s  Bedrock returns response to Lambda
  t=33.5s  Lambda completes — but API Gateway already disconnected
  t=33.5s  Lambda response is DISCARDED (nobody is listening)

Incorrect Approaches

WRONG: "Increase the API Gateway timeout to 60 seconds"
  -> API Gateway REST/HTTP API timeout is a hard limit (29s/30s).
     It cannot be increased. This is not a configurable setting.

WRONG: "Switch to Haiku to make it faster"
  -> Haiku is faster but less capable for deep analysis tasks.
     The analysis quality would degrade significantly.

WRONG: "Increase Lambda memory to speed up Bedrock calls"
  -> Lambda memory affects Lambda compute, not Bedrock inference time.
     The bottleneck is Bedrock model inference, not Lambda execution.

WRONG: "Add a longer timeout in the CDK Lambda configuration"
  -> Lambda timeout can be up to 15 minutes, but API Gateway
     will still cut the connection at 29 seconds regardless.

Correct Solution

SOLUTION: Implement asynchronous pattern with polling or WebSocket streaming.

Option A: Async with Polling (simplest)
========================================

1. Client sends POST /v1/analyze
2. Lambda immediately returns 202 Accepted with a job_id
3. Lambda publishes to SQS queue
4. Background Lambda picks up from SQS, invokes Sonnet (no API GW timeout)
5. Background Lambda stores result in DynamoDB
6. Client polls GET /v1/analyze/{job_id} until status = "completed"

Flow:
  Client -> API GW -> Lambda (returns 202 + job_id in <1s)
           |
           +-> SQS -> Background Lambda -> Bedrock Sonnet (33s, no timeout)
                                         |
                                         +-> DynamoDB (store result)
           |
  Client -> API GW -> Lambda (GET /v1/analyze/{job_id} -> reads DynamoDB)


Option B: WebSocket Streaming (best UX)
=========================================

1. Client opens WebSocket connection via API Gateway WebSocket API
2. Client sends { "action": "analyze", "data": {...} }
3. Lambda routes to ECS Fargate orchestrator
4. Fargate invokes Bedrock with response streaming
5. Tokens stream back to client in real-time via WebSocket
6. No timeout issue — WebSocket connections persist for 2 hours (idle 10min)

Flow:
  Client <-> WS API GW <-> Lambda <-> Fargate <-> Bedrock (streaming)
  (tokens flow back incrementally, first token in ~2s)


Option C: Chunked Processing (if analysis is decomposable)
==========================================================

1. Split analysis into sub-tasks (each under 15s)
2. Use Step Functions to orchestrate sub-tasks
3. Each step invokes Haiku for a sub-component
4. Final step assembles the full analysis
5. Total time may be longer, but each API call is within timeout

Implementation for MangaAssist:
  -> Option B (WebSocket streaming) for the /analyze endpoint
  -> Already have WebSocket API Gateway for chat
  -> Reuse existing infrastructure, add "analyze" route

Prevention

1. Set Lambda timeout to 25s for API Gateway-backed functions (4s buffer)
2. Add CloudWatch alarm on API Gateway 5XXError > 1% for 5 minutes
3. Implement client-side timeout handling with retry and fallback UI
4. Use Bedrock response streaming for any endpoint that may exceed 10s
5. Monitor Bedrock InvokeModel latency per model — alert if p95 > 20s

Scenario 2: Webhook Handler Lambda Concurrency Exhausted

Situation

MangaAssist runs a "Golden Week" promotion. A manga publisher sends a catalog update webhook with 5,000 new items. Simultaneously, Stripe sends a surge of payment webhooks (300+ per minute from promotional purchases). The webhook handler Lambda functions hit their reserved concurrency limit of 50, causing Stripe to receive 429 errors and begin aggressive retries. Within 10 minutes, the retry amplification creates a thundering herd that exhausts the entire account's unreserved concurrency pool.

Symptoms

CloudWatch Lambda Metrics:
  Throttles count: 4,287 (in 15 minutes)
  ConcurrentExecutions: 50/50 (at reserved limit)
  Errors: 0 (no code errors — pure throttling)

CloudWatch Logs (visible for the few that got through):
  [INFO] Processing catalog webhook: batch_size=5000
  [INFO] Publishing 10 events to EventBridge (batch 1 of 500)
  ...
  [INFO] Publishing 10 events to EventBridge (batch 47 of 500)
  REPORT Duration: 28,400ms  Memory: 512MB

Stripe Dashboard:
  Webhook delivery failures: 312
  Webhook retry attempts: 1,248 (4x amplification)
  Status: "Endpoint temporarily disabled by Stripe"

Account-Level Impact:
  Other Lambda functions (chat, search) also throttled
  Unreserved concurrency pool depleted
  MangaAssist chatbot returning 500 errors to all users

Root Cause Analysis

Root Cause: Multiple compounding failures in concurrency management.

Problem 1: Catalog webhook processes 5,000 items synchronously
  - Single Lambda invocation runs for 28s processing all items
  - Holds a concurrency slot for the entire duration
  - Should batch into smaller chunks via SQS

Problem 2: Reserved concurrency too low for burst
  - Reserved concurrency = 50 for all webhook handlers
  - Normal load: 5-10 concurrent, so 50 seemed generous
  - Promotional burst: 300+ webhook/min = 50+ concurrent easily

Problem 3: No separation between webhook sources
  - All sources (stripe, catalog, reviews) share the same 50 slots
  - Catalog webhook hogs slots with long-running processing
  - Stripe webhooks starved of concurrency

Problem 4: Stripe retry amplification
  - Stripe retries failed webhooks: 1, 2, 4, 8, 16, 32, 64 minutes
  - During burst, retries compound on top of new webhooks
  - Creates exponential load increase

Problem 5: No account-level concurrency protection
  - Webhook Lambda uses unreserved concurrency after reserved is full
  - Consumes concurrency from production chat/search functions
  - Blast radius extends to entire MangaAssist service

Concurrency math during the incident:
  Catalog webhooks: 10 concurrent * 28s each = 10 slots held
  Stripe webhooks:  300/min = 5/sec, each ~2s = 10 concurrent
  Stripe retries:   accumulating to 15/sec = 30 concurrent
  Total demand:     ~50 concurrent (hitting reserved limit)
  Overflow:         spills into unreserved pool, starving other functions

Incorrect Approaches

WRONG: "Just increase reserved concurrency to 500"
  -> Addresses symptom not cause. Catalog webhook still processes
     5,000 items in a single invocation. During larger promotions,
     even 500 would be insufficient. Also increases blast radius.

WRONG: "Remove reserved concurrency limits entirely"
  -> Without reserved concurrency, webhook functions can consume
     the entire account concurrency pool (default 1,000), leaving
     nothing for production chat/search functions.

WRONG: "Add provisioned concurrency for webhook handlers"
  -> Provisioned concurrency ensures warm starts but does NOT
     increase the concurrency limit. Still capped at reserved count.
     Also expensive ($0.000004646/GB-sec * 24/7 for 50 instances).

WRONG: "Process catalog items in parallel using threading in Lambda"
  -> Lambda has limited CPU (proportional to memory). Threading
     doesn't solve the EventBridge PutEvents bottleneck (10 entries/call).
     Still a single long-running invocation holding concurrency.

Correct Solution

SOLUTION: Decouple webhook receipt from processing using SQS buffering.

Architecture Change:
====================

BEFORE (broken):
  Stripe/Catalog -> Lambda URL -> [Process + EventBridge] -> Done
  (single Lambda does everything, holds concurrency for full duration)

AFTER (fixed):
  Stripe/Catalog -> Lambda URL -> [Validate + SQS] -> Return 200 (<1s)
                                        |
                                        v
                                  SQS Queue(s)
                                        |
                                        v
                                  Lambda Consumer(s) -> EventBridge
                                  (separate concurrency pool)

Step-by-step implementation:

1. SEPARATE FUNCTIONS PER SOURCE
   - stripe-webhook-receiver (Lambda URL, reserved concurrency = 20)
   - catalog-webhook-receiver (Lambda URL, reserved concurrency = 10)
   - review-webhook-receiver (Lambda URL, reserved concurrency = 10)
   Each does ONLY: validate signature -> SQS enqueue -> return 200

2. SQS QUEUES PER SOURCE
   - MangaAssist-Stripe-Webhooks (standard queue)
   - MangaAssist-Catalog-Webhooks (standard queue)
   - MangaAssist-Review-Webhooks (standard queue)
   Each with DLQ for failed processing

3. CONSUMER LAMBDAS (separate from receivers)
   - stripe-processor (reserved concurrency = 30)
   - catalog-processor (reserved concurrency = 20, batch size = 10)
   - review-processor (reserved concurrency = 20)
   SQS Lambda event source mapping with maxConcurrency

4. CATALOG WEBHOOK: DECOMPOSE BATCH
   - Receiver: accept 5,000-item batch, split into SQS messages of 10 each
   - Consumer: process 10 items per invocation (< 5s each)
   - 5,000 items = 500 SQS messages, processed in parallel by 20 consumers

5. ACCOUNT-LEVEL PROTECTION
   - Set account reserved concurrency for non-critical Lambdas
   - Production functions (chat, search) get reserved concurrency = 200 each
   - Total reserved: 200 + 200 + 20 + 10 + 10 + 30 + 20 + 20 = 510
   - Remaining unreserved: 490 (buffer for other functions)

Result:
  - Webhook receipt: <1s (validate + SQS), concurrency = 1 slot briefly
  - Processing: async via SQS, controlled concurrency per consumer
  - No blast radius: each function has isolated reserved concurrency
  - Catalog batch: 500 parallel messages, each processed in 5s
  - Stripe retries: unnecessary (receiver always returns 200 quickly)

Prevention

1. Never process large batches synchronously in a webhook receiver
2. Always separate "receipt" (fast, return 200) from "processing" (async)
3. Use per-source Lambda functions with isolated reserved concurrency
4. Set SQS maxConcurrency on Lambda event source mapping
5. Monitor account-level ConcurrentExecutions with alarm at 80% of limit
6. Configure Stripe webhook retry behavior (reduce max retries to 3)
7. Load test promotional scenarios before launch

Scenario 3: EventBridge FM Trigger Creating Infinite Event Loop

Situation

A developer adds a new EventBridge rule to automatically generate AI-powered response suggestions for flagged customer reviews. The rule triggers on ModerationFlagged events and invokes a Lambda that generates a response suggestion using Claude 3 Haiku. However, the Lambda also publishes a SuggestionGenerated event, and a catch-all analytics rule routes all mangaassist.ai events back to the same processing pipeline. Within 3 minutes, the event loop amplifies to 10,000+ events per second, exhausting both the EventBridge PutEvents quota and the Lambda concurrency pool.

Symptoms

CloudWatch EventBridge Metrics:
  PutEvents invocations: 847,000 (in 5 minutes)
  FailedInvocations: 312,000
  ThrottledRules: 45

CloudWatch Lambda Metrics:
  Invocations: 523,000 (in 5 minutes)
  ConcurrentExecutions: 1,000/1,000 (account limit)
  Throttles: 189,000
  Duration avg: 2,100ms (Haiku invocations)

CloudWatch Bedrock Metrics:
  InvocationCount: 312,000 (in 5 minutes)
  ThrottlingExceptions: 89,000
  EstimatedCost: $78.00 (5 minutes of Haiku at $0.25/$1.25/1M)

EventBridge DLQ (SQS):
  ApproximateNumberOfMessages: 156,000 (and growing)

Alarm:
  ALARM: MangaAssist-UnexpectedCost - Cost exceeded $50 threshold
  ALARM: MangaAssist-ConcurrencyExhausted - 100% utilization

Root Cause Analysis

Root Cause: Circular event chain between three EventBridge rules.

The event loop path:

1. ReviewSubmitted event arrives (legitimate, from webhook)
     -> Rule: MangaAssist-ReviewSubmitted
     -> Target: moderation Lambda
     -> Lambda publishes: ModerationFlagged (if unsafe)

2. ModerationFlagged event arrives
     -> Rule: MangaAssist-ModerationFlagged
     -> Target: suggestion Lambda (NEW rule, the bug)
     -> Lambda invokes Haiku, generates suggestion
     -> Lambda publishes: SuggestionGenerated (source: mangaassist.ai)

3. SuggestionGenerated event arrives
     -> Rule: MangaAssist-AllEvents-Analytics (catch-all)
     -> Target: analytics Lambda
     -> Analytics Lambda publishes: AnalyticsProcessed (source: mangaassist.ai)

4. AnalyticsProcessed event arrives
     -> Rule: MangaAssist-AllEvents-Analytics (catch-all matches AGAIN)
     -> Back to step 3. INFINITE LOOP.

Actually worse: the analytics Lambda also triggers the suggestion Lambda
because the catch-all rule fans out to multiple targets, one of which
re-processes events and publishes new ones.

Amplification factor per cycle:
  1 event -> 2 targets -> 2 events -> 4 targets -> exponential growth
  After 10 cycles: 1,024 events from 1 original review

Time per cycle: ~2 seconds (Lambda + Bedrock + EventBridge)
  In 3 minutes: ~90 cycles possible
  Theoretical events: 2^90 (but throttled long before that)

Incorrect Approaches

WRONG: "Add a duplicate check in each Lambda to prevent re-processing"
  -> Each event has a unique event ID, so duplicate detection
     doesn't work — every event in the loop is technically unique.

WRONG: "Reduce the Lambda concurrency to slow down the loop"
  -> Slows the loop but doesn't stop it. Events queue up in
     EventBridge retry/DLQ and resume when concurrency frees up.
     Also impacts legitimate event processing.

WRONG: "Delete the catch-all analytics rule"
  -> Removes analytics capability entirely. The real fix is to
     prevent the loop while keeping analytics.

WRONG: "Add a time-based filter to skip events older than 5 seconds"
  -> EventBridge events are delivered quickly (<1s typically).
     Time-based filtering wouldn't catch loop events because
     each new event has a fresh timestamp.

Correct Solution

SOLUTION: Multi-layer loop prevention with source differentiation and chain depth.

Layer 1: Source Differentiation (prevent cross-contamination)
=============================================================

Rule: MangaAssist-AllEvents-Analytics
BEFORE (broken):
  EventPattern:
    source: [{ "prefix": "mangaassist." }]   # Matches EVERYTHING

AFTER (fixed):
  EventPattern:
    source: ["mangaassist.chat", "mangaassist.payments",
             "mangaassist.reviews", "mangaassist.catalog"]
    # EXPLICIT list — does NOT include "mangaassist.ai"
    # AI-generated events are excluded from the catch-all

For analytics on AI events, create a SEPARATE non-recursive rule:
  Rule: MangaAssist-AI-Analytics-Terminal
  EventPattern:
    source: ["mangaassist.ai"]
  Target: analytics Lambda configured to NOT publish any events
          (terminal consumer — writes to Firehose only, no EventBridge)


Layer 2: Processing Stage Guard (belt-and-suspenders)
=====================================================

Every event published by an FM consumer MUST include:
  "processing_stage": "processed"

Rules that trigger FM invocation filter:
  EventPattern:
    detail:
      processing_stage: ["raw"]   # Only process raw events

Events from webhooks (raw):
  { "processing_stage": "raw", ... }

Events from FM consumers (processed):
  { "processing_stage": "processed", ... }

FM-triggering rules never match "processed" events -> no loop.


Layer 3: Chain Depth Counter (emergency stop)
=============================================

Every event carries a _chain_depth integer:
  - Webhook events start at depth 0
  - Each FM consumer increments depth by 1
  - Any Lambda rejects events with depth >= 3

Implementation in every FM-triggering Lambda:
  depth = event["detail"].get("_chain_depth", 0)
  if depth >= MAX_CHAIN_DEPTH:
      logger.error("Chain depth %d exceeded max %d — dropping event", depth, MAX_CHAIN_DEPTH)
      return {"status": "rejected", "reason": "max_chain_depth"}
  # Include incremented depth in any outbound events
  outbound_detail["_chain_depth"] = depth + 1


Layer 4: Cost Circuit Breaker (last resort)
============================================

CloudWatch alarm on Bedrock InvocationCount:
  - Threshold: > 10,000 invocations in 5 minutes
  - Action: SNS -> Lambda -> disable EventBridge rules via API

  # Emergency rule disabler Lambda
  eventbridge.disable_rule(Name="MangaAssist-ModerationFlagged", EventBusName=bus)
  eventbridge.disable_rule(Name="MangaAssist-ReviewSubmitted", EventBusName=bus)

Alarm also sends PagerDuty alert for immediate human review.

Immediate Remediation Steps (During Incident)

1. DISABLE the catch-all analytics rule (stops the loop immediately):
   aws events disable-rule --name MangaAssist-AllEvents-Analytics \
     --event-bus-name MangaAssist-AI-Events

2. DISABLE the new suggestion rule (root cause):
   aws events disable-rule --name MangaAssist-ModerationSuggestion \
     --event-bus-name MangaAssist-AI-Events

3. PURGE the DLQ to prevent replay of loop events:
   aws sqs purge-queue --queue-url https://sqs.ap-northeast-1.amazonaws.com/.../EventBridge-DLQ

4. VERIFY Lambda concurrency has recovered:
   aws lambda get-account-settings
   # Check UnreservedConcurrentExecutions > 500

5. RE-ENABLE non-recursive rules with fixes applied
6. MONITOR for 30 minutes before declaring incident resolved

Prevention

1. NEVER use prefix-based catch-all rules that match FM consumer output sources
2. Require processing_stage field in all event schemas (enforce via schema registry)
3. Include _chain_depth in all events; reject if depth >= 3
4. Code review checklist: "Does this Lambda publish events that could match existing rules?"
5. Deploy new EventBridge rules to staging first with CloudWatch cost monitoring
6. Set per-source EventBridge PutEvents rate alarms
7. Implement cost circuit breaker that auto-disables rules above threshold

Scenario 4: Microservice Sidecar Adding Unacceptable Latency

Situation

The MangaAssist team implements the Envoy sidecar pattern via AWS App Mesh to add circuit breaking and mTLS to the Fargate orchestrator's Bedrock calls. After deployment, the chat endpoint p50 latency increases from 1.2s to 2.8s, and the p99 latency increases from 2.5s to 6.1s. The 3-second SLA is now missed on 40% of requests. Users report "slow chatbot" and engagement metrics drop.

Symptoms

CloudWatch Metrics (BEFORE sidecar):
  /v1/chat p50 latency: 1,200ms
  /v1/chat p99 latency: 2,500ms
  SLA compliance (< 3s): 97%

CloudWatch Metrics (AFTER sidecar):
  /v1/chat p50 latency: 2,800ms (+1,600ms)
  /v1/chat p99 latency: 6,100ms (+3,600ms)
  SLA compliance (< 3s): 60% (was 97%)

X-Ray Trace Analysis (single request breakdown):
  Total: 2,800ms
  |-- API Gateway:           50ms
  |-- Lambda:               100ms
  |-- Fargate app container: 200ms
  |-- Envoy sidecar:       1,400ms  <-- NEW BOTTLENECK
  |   |-- TLS handshake:     400ms
  |   |-- DNS resolution:    300ms
  |   |-- Connection setup:  200ms
  |   |-- Proxy overhead:    500ms
  |-- Bedrock inference:   1,050ms

Envoy Admin Stats (port 15000):
  upstream_cx_total: 12,847
  upstream_cx_active: 3
  upstream_cx_connect_fail: 0
  upstream_cx_pool_overflow: 247
  upstream_rq_timeout: 0
  upstream_rq_retry: 1,203
  cluster.bedrock.outlier_detection.ejections_active: 0

Root Cause Analysis

Root Cause: Three compounding sidecar configuration issues.

Problem 1: TLS Handshake on Every Request (no connection reuse)
  - Envoy configured with max_connections: 1 (default for new mesh)
  - Each Bedrock request creates a new TLS connection
  - TLS 1.3 handshake to Bedrock endpoint: 200-400ms
  - Connection closed after each request (no keep-alive)

Problem 2: DNS Resolution on Every Request
  - Envoy DNS refresh rate: 5 minutes (default)
  - But connection pool drops connections, forcing new DNS lookup
  - DNS resolution to Bedrock endpoint: 100-300ms
  - ap-northeast-1 to Bedrock endpoint varies by AZ

Problem 3: Envoy Proxy Processing Overhead
  - Access logging to stdout: synchronous, blocking
  - Header manipulation rules: 12 rules applied per request
  - mTLS certificate validation: per-request check
  - Trace context injection: X-Ray segment creation

Problem 4: Connection Pool Overflow
  - max_pending_requests: 1 (default)
  - During burst, requests queue behind the single connection
  - 247 pool overflows = 247 failed connection reuse attempts
  - Each overflow forces a new connection (back to Problem 1)

Incorrect Approaches

WRONG: "Remove the sidecar entirely"
  -> Loses circuit breaking, mTLS, and observability.
     The sidecar provides real operational value; it just
     needs proper configuration.

WRONG: "Increase Fargate task CPU/memory"
  -> Envoy sidecar overhead is I/O bound (TLS, DNS, connections),
     not CPU bound. More CPU won't help connection setup latency.

WRONG: "Switch from App Mesh to direct Bedrock SDK calls"
  -> Direct SDK calls don't provide circuit breaking or mTLS.
     The team chose the sidecar pattern for good reasons.
     Fix the configuration, not the architecture.

WRONG: "Add Envoy caching for Bedrock responses"
  -> Bedrock responses are non-cacheable by Envoy (POST requests,
     non-deterministic output). Application-level caching via
     Redis is already implemented in MicroserviceFMProxy.

Correct Solution

SOLUTION: Optimize Envoy sidecar connection pooling, DNS, and TLS configuration.

Fix 1: Connection Pool Tuning
==============================

BEFORE:
  max_connections: 1
  max_pending_requests: 1
  max_requests: 1

AFTER:
  max_connections: 100          # Reuse connections to Bedrock
  max_pending_requests: 50      # Queue requests during burst
  max_requests: 200             # Allow concurrent requests
  max_requests_per_connection: 1000  # Keep-alive for 1000 requests

Impact: Eliminates per-request TLS handshake. Connections reused.
Savings: -400ms per request (TLS handshake eliminated for cached connections)


Fix 2: DNS Configuration
==========================

BEFORE:
  dns_refresh_rate: 300s (5 minutes)
  dns_lookup_family: V4_ONLY
  (using default DNS resolver)

AFTER:
  dns_refresh_rate: 30s
  dns_lookup_family: V4_PREFERRED
  dns_cache_config:
    max_hosts: 100
    dns_ttl: 60s
  use_tcp_for_dns_lookups: false

Additionally: Configure Envoy to use the VPC DNS resolver
directly (169.254.169.253) for faster resolution.

Impact: DNS results cached, eliminating per-request lookups.
Savings: -300ms per request (DNS lookup eliminated)


Fix 3: Access Logging Optimization
====================================

BEFORE:
  access_log:
    - typed_config:
        "@type": envoy.file_access_log
        path: "/dev/stdout"          # Synchronous stdout write

AFTER:
  access_log:
    - typed_config:
        "@type": envoy.file_access_log
        path: "/dev/stdout"
        log_format:
          json_format:
            # Minimal fields only
            duration: "%DURATION%"
            status: "%RESPONSE_CODE%"
            bytes: "%BYTES_RECEIVED%"
  # Also: configure async buffered logging in Envoy bootstrap

Impact: Reduced per-request logging overhead.
Savings: -100ms per request


Fix 4: Reduce Header Manipulation
===================================

BEFORE: 12 header rules applied per request
AFTER: 3 essential header rules only
  - X-Amzn-Trace-Id propagation
  - Authorization header passthrough
  - Content-Type enforcement

Remove: decorative headers, debug headers, redundant CORS headers
(CORS is handled at API Gateway level, not sidecar)

Impact: Less CPU work per request.
Savings: -50ms per request


Total Improvement:
  BEFORE: 1,400ms sidecar overhead
  AFTER:  ~150ms sidecar overhead (connection pool warm)
  SAVINGS: ~1,250ms per request

New latency profile:
  /v1/chat p50: ~1,350ms (within SLA)
  /v1/chat p99: ~2,800ms (within SLA)
  SLA compliance: ~96%

Prevention

1. Load test sidecar configuration BEFORE production deployment
2. Always configure connection pooling for high-throughput upstreams
3. Monitor Envoy admin stats (/stats endpoint) for pool overflow and connection metrics
4. Use X-Ray traces to identify per-component latency breakdown
5. Set CloudWatch alarm on p99 latency regression > 20% after any deployment
6. Document baseline latency for each endpoint; compare after infrastructure changes

Scenario 5: API Gateway Payload Size Limit Blocking Large FM Responses

Situation

MangaAssist introduces a "full catalog search" endpoint that returns AI-generated summaries for multiple manga series matching a search query. The endpoint calls Bedrock Sonnet with RAG context from OpenSearch (5 documents, each 2000 tokens) and requests a comprehensive summary. Some responses exceed 10 MB when combined with metadata, causing API Gateway to return HTTP 413 Payload Too Large errors. The Lambda function completes successfully, but API Gateway rejects the response before delivering it to the client.

Symptoms

CloudWatch API Gateway Logs:
  "status": 413
  "errorMessage": "Response payload size exceeded maximum allowed payload size"
  "requestId": "xyz789-..."
  "path": "/v1/catalog-search"

CloudWatch Lambda Logs:
  [INFO] Search completed: 12 results, response_size=11,247,891 bytes
  [INFO] Response generated successfully in 8,234ms
  (Lambda succeeds but API Gateway rejects the response)

API Gateway Metrics:
  4XXError: 234 (in 1 hour, all 413s)
  IntegrationLatency p50: 8,100ms

Client Error:
  HTTP 413: {"message": "Request Too Long"}

Root Cause Analysis

Root Cause: API Gateway payload size limits exceeded by FM response + metadata.

API Gateway Payload Limits:
  REST API:  10 MB max response payload
  HTTP API:  10 MB max response payload
  WebSocket: 128 KB per frame, 32 KB for initial response

The catalog search response structure:
  {
    "query": "...",                              ~100 bytes
    "results": [                                 12 results
      {
        "product_id": "...",                     ~50 bytes
        "title": "...",                          ~200 bytes
        "ai_summary": "...",                     ~2,000 bytes (per result)
        "rag_context": "...",                    ~8,000 bytes (per result)
        "embedding_vector": [0.1, 0.2, ...],    ~6,000 bytes (768-dim float32)
        "metadata": { ... }                      ~500 bytes
      }
    ],
    "ai_analysis": "...",                        ~5,000 bytes (overall summary)
    "debug_info": {                              ~10 MB (!!)
      "opensearch_raw_results": [ ... ],         Full OpenSearch response
      "bedrock_raw_response": "...",             Full Bedrock response
      "prompt_template": "...",                  Full prompt with all context
    }
  }

The problem: debug_info contains raw responses from OpenSearch and Bedrock.
In development, this was useful for debugging. In production with large
result sets, it pushes the payload beyond the 10 MB limit.

Secondary issue: embedding_vector is included in the response.
Clients don't need the raw embedding vectors (768 float32 values per result).
This adds ~72 KB for 12 results unnecessarily.

Incorrect Approaches

WRONG: "Increase the API Gateway payload limit"
  -> The 10 MB limit is a hard service limit. It cannot be increased
     via configuration or support ticket.

WRONG: "Compress the response with gzip"
  -> API Gateway REST API does not support response compression
     natively for Lambda proxy integration responses. The Lambda
     would need to return a base64-encoded gzip body, and the client
     would need to handle decompression. Even compressed, debug_info
     could still exceed 10 MB before compression.

WRONG: "Switch to HTTP API (it has higher limits)"
  -> HTTP API has the same 10 MB payload limit as REST API.
     No advantage for this specific problem.

WRONG: "Use Lambda response streaming"
  -> Lambda response streaming (via Function URL) can stream up to
     20 MB. But it's not compatible with API Gateway. Would require
     architectural change to use Lambda Function URL directly.

Correct Solution

SOLUTION: Response filtering + pagination + pre-signed S3 for large payloads.

Fix 1: Remove Debug Info from Production Responses
====================================================

The debug_info field accounts for ~10 MB of the response. It should
NEVER be included in production API responses.

Implementation:
  # In Lambda handler
  include_debug = event.get("queryStringParameters", {}).get("debug") == "true"
  stage = event.get("requestContext", {}).get("stage", "prod")

  response_body = {
      "query": query,
      "results": formatted_results,
      "ai_analysis": analysis_text,
  }

  # Only include debug info in dev/staging AND when explicitly requested
  if include_debug and stage != "prod":
      response_body["debug_info"] = debug_data

Impact: Reduces typical response from ~11 MB to ~200 KB.


Fix 2: Strip Unnecessary Fields from Results
==============================================

Remove embedding vectors and raw RAG context from client responses.
Clients need summaries, not internal ML artifacts.

BEFORE (per result):
  {
    "product_id": "...",
    "title": "...",
    "ai_summary": "...",
    "rag_context": "...",         # 8 KB - remove
    "embedding_vector": [...],    # 6 KB - remove
    "metadata": { ... }
  }

AFTER (per result):
  {
    "product_id": "...",
    "title": "...",
    "ai_summary": "...",
    "relevance_score": 0.92,      # Computed from embedding, not raw vector
    "metadata": { "author": "...", "genre": "..." }  # Curated subset
  }

Impact: Per-result size drops from ~17 KB to ~3 KB.


Fix 3: Implement Pagination
=============================

Limit results per page and provide cursor-based pagination.

  GET /v1/catalog-search?q=shonen&limit=5&cursor=eyJ...

Response:
  {
    "query": "shonen",
    "results": [ ... 5 results ... ],
    "ai_analysis": "Based on your search for shonen manga...",
    "pagination": {
      "total_results": 47,
      "returned": 5,
      "next_cursor": "eyJwYWdlIjoyLCJvZmZzZXQiOjV9",
      "has_more": true
    }
  }

Impact: Guaranteed response size under 50 KB per page.


Fix 4: Pre-Signed S3 URL for Truly Large Responses
====================================================

For export/download use cases where the client needs ALL results:

1. Lambda generates full response and writes to S3
2. Lambda generates a pre-signed URL (valid for 15 minutes)
3. API Gateway returns a small response with the download URL

  POST /v1/catalog-search/export
  Response (< 1 KB):
  {
    "export_id": "exp-12345",
    "status": "ready",
    "download_url": "https://s3.ap-northeast-1.amazonaws.com/...",
    "expires_in_seconds": 900,
    "size_bytes": 11247891,
    "format": "application/json"
  }

Impact: API Gateway only returns a small JSON pointer.
Client downloads large payload directly from S3 (no size limit).


Combined Impact:
  Normal search: ~30 KB response (5 results, no debug, no vectors)
  Full export: small pointer + S3 pre-signed URL
  API Gateway 413 errors: 0

Prevention

1. NEVER include debug/raw data in production API responses
2. Use environment-based response filtering (stage != "prod")
3. Set maximum results per page (default 10, max 50)
4. Strip internal fields (embeddings, raw context) before response serialization
5. Add response size monitoring: CloudWatch alarm if avg response > 1 MB
6. For any endpoint that could return > 5 MB, use S3 pre-signed URL pattern
7. Load test with realistic data volumes (not just toy examples)
8. Add Lambda middleware that measures and logs response size before return

Cross-Scenario Summary: Integration Pattern Failure Modes

Scenario	Service	Hard Limit	Root Pattern	Fix Category
1. API GW 29s timeout	API Gateway REST	29 seconds	Synchronous FM call exceeds timeout	Async pattern (SQS polling or WebSocket streaming)
2. Lambda concurrency exhaustion	Lambda	Account concurrency (1,000 default)	Long-running webhook + retry amplification	SQS buffering + per-source isolation
3. EventBridge infinite loop	EventBridge + Lambda	PutEvents quota + concurrency	Catch-all rule matches FM consumer output	Source differentiation + chain depth counter
4. Sidecar latency overhead	App Mesh / Envoy	N/A (configuration issue)	Default connection pool + per-request TLS	Connection pool tuning + DNS caching
5. API GW payload size	API Gateway	10 MB response	Debug data + embedding vectors in response	Response filtering + pagination + S3 pre-signed URLs

Key Takeaways for the AIP-C01 Exam

Concept	What to Remember
API GW REST timeout	29 seconds --- hard limit, not adjustable. Use async or WebSocket for long FM calls.
API GW payload limit	10 MB response --- hard limit. Use pagination or S3 pre-signed URLs for large FM outputs.
Lambda concurrency	Default 1,000 per account. Reserved concurrency isolates functions. Use SQS to buffer bursts.
Webhook best practice	Validate + enqueue (fast return 200) then process async. Never do heavy work in the receiver.
EventBridge loop prevention	Use explicit source lists (not prefix catch-all), processing_stage guards, and chain depth counters.
Sidecar connection pooling	Default Envoy settings create new TLS connections per request. Always configure max_connections and keep-alive.
Cost circuit breaker	Set CloudWatch alarms on Bedrock invocation count. Auto-disable EventBridge rules if threshold exceeded.
Response size management	Strip debug info, embedding vectors, and raw context from production responses. Paginate results.
Retry amplification	External webhook retries compound during outages. Fast-acknowledge (SQS buffer) prevents amplification.
Async FM pattern	POST returns 202 + job_id. Background worker invokes FM (no timeout). Client polls for result.