FM System Performance — Scenarios and Runbooks

MangaAssist is a JP Manga store chatbot running on AWS. The stack includes Bedrock Claude 3 (Sonnet for complex queries, Haiku for simple), OpenSearch Serverless for vector retrieval, DynamoDB for sessions/products/orders, ECS Fargate for orchestration, API Gateway WebSocket for real-time communication, and ElastiCache Redis for semantic caching. The p95 end-to-end latency target is < 2 seconds.

Skill Mapping

AWS AIP-C01 Skill	Sub-Skill	This File Covers
4.2.6 FM System Performance	Scenario-Based Troubleshooting	5 real-world MangaAssist performance scenarios
4.2.6 FM System Performance	Runbook Design	Decision trees for detection, root cause, resolution, prevention
4.2.6 FM System Performance	Operational Excellence	Systematic approaches to latency, networking, and resource issues

Scenario Overview

#	Scenario	Root Cause Category	Severity	Time to Detect	Time to Resolve
1	p95 latency spike from DNS resolution timeout	Network / DNS	P2	3-5 min	15-30 min
2	Connection pool exhaustion under sustained load	Resource / Capacity	P1	1-2 min	10-20 min
3	NAT gateway bottleneck during VPC endpoint migration	Network / Infrastructure	P2	5-10 min	30-60 min
4	Unexpected latency from Guardrails check revealed by X-Ray	Performance / Configuration	P3	10-15 min	20-40 min
5	Cache write latency impacting response time	Architecture / Sync Pattern	P3	5-10 min	30-60 min

Scenario 1: p95 Latency Spike Due to DNS Resolution Timeout to Bedrock Endpoint

Problem

At 14:22 JST on a Tuesday, MangaAssist p95 latency spikes from 850ms to 2,400ms — breaching the 2-second SLO. The spike affects ~15% of requests. Customers experience noticeable delays when asking for manga recommendations.

The on-call engineer sees no change in traffic volume, no deployment in the last 6 hours, and no Bedrock service health issues.

Detection

graph TB
    A["CloudWatch Alarm fires:<br/>p95 latency > 2000ms<br/>for 3 consecutive minutes"]
    B["On-call receives PagerDuty alert"]
    C{"Check X-Ray traces<br/>for high-latency requests"}
    D["X-Ray shows: bedrock-invoke<br/>segment has 1,800ms spikes<br/>(normally 350-800ms)"]
    E{"Drill into bedrock-invoke<br/>subsegments"}
    F["network-transit subsegment<br/>shows 1,200ms<br/>(normally 5-15ms)"]
    G{"Check DNS resolution<br/>timing in subsegment metadata"}
    H["DNS resolution for<br/>bedrock-runtime.ap-northeast-1.amazonaws.com<br/>taking 1,000-1,200ms<br/>(normally < 5ms)"]

    A --> B --> C --> D --> E --> F --> G --> H

    style A fill:#e74c3c,color:#fff
    style H fill:#e74c3c,color:#fff

Key metrics that fired: - MangaAssist/Performance/RequestLatency p95 > 2,000ms - MangaAssist/Performance/SegmentLatency_bedrock-invoke p95 > 1,500ms - MangaAssist/VPCEndpoints/DNSResolutionLatency for bedrock-runtime > 500ms

Root Cause

The VPC DNS resolver (Route 53 Resolver) in the ap-northeast-1a availability zone experienced intermittent resolution failures for the Bedrock VPC endpoint's private hosted zone. This caused:

DNS queries for bedrock-runtime.ap-northeast-1.amazonaws.com to fall back to the public DNS resolver
The public resolver returned a public IP instead of the VPC endpoint's private IP
Traffic routed through the NAT gateway (adding 15ms) and experienced DNS retry timeouts (adding 1,000ms on the failing attempts)

Only tasks in AZ ap-northeast-1a were affected. Tasks in ap-northeast-1c resolved DNS normally via their local resolver.

Resolution

Immediate (15 minutes):

Confirm the AZ-specific impact by filtering X-Ray traces by ecs.availability_zone annotation
Scale down ECS tasks in ap-northeast-1a and scale up in ap-northeast-1c to shift traffic away from the affected AZ:

# Shift traffic away from affected AZ
aws ecs update-service \
  --cluster manga-assist-prod \
  --service manga-assist-orchestrator \
  --placement-constraints "type=memberOf,expression=attribute:ecs.availability-zone != ap-northeast-1a" \
  --desired-count 8

Monitor p95 latency — should drop back to < 1,000ms within 2-3 minutes as traffic drains from ap-northeast-1a

Longer-term (30 minutes):

Verify Route 53 Resolver health in ap-northeast-1a — check resolver endpoint status via AWS console
If resolver is unhealthy, open an AWS support case (Severity: Business-critical)
Once resolved, gradually reintroduce ap-northeast-1a capacity

Prevention

graph TB
    P1["Enable DNS resolution<br/>monitoring per AZ"]
    P2["Add application-level<br/>DNS caching (60s TTL)"]
    P3["Configure DNS failover:<br/>if private zone fails,<br/>use cached result<br/>not public resolver"]
    P4["Deploy ECS tasks<br/>across 3 AZs<br/>(add ap-northeast-1d)"]
    P5["Add X-Ray annotation<br/>for AZ on every trace"]
    P6["Create CloudWatch alarm:<br/>DNS resolution > 50ms<br/>per AZ per service"]

    P1 --> P6
    P2 --> P3
    P4 --> P5

    style P1 fill:#2ecc71,color:#000
    style P2 fill:#2ecc71,color:#000
    style P3 fill:#2ecc71,color:#000
    style P4 fill:#2ecc71,color:#000
    style P5 fill:#2ecc71,color:#000
    style P6 fill:#2ecc71,color:#000

Prevention checklist: - [ ] Application-level DNS cache with 60-second TTL (survives resolver blips) - [ ] Multi-AZ deployment across 3 AZs (no single-AZ dependency) - [ ] Per-AZ DNS resolution CloudWatch alarm (detect before customers notice) - [ ] X-Ray AZ annotation on all traces (instant AZ-level troubleshooting) - [ ] Runbook documented for "DNS resolution degradation" scenario

Scenario 2: Connection Pool Exhaustion Under Sustained Load

Problem

During a manga publisher's flash sale event (new "One Piece" volume release), MangaAssist traffic spikes from 500 req/s to 2,200 req/s over 10 minutes. The system initially handles the spike, but after 8 minutes, error rates jump from 0.1% to 12%. Customers see "Service temporarily unavailable" messages.

The ECS tasks are not CPU or memory constrained. Bedrock is not throttling. The issue is inside the application.

Detection

graph TB
    A["CloudWatch Alarm fires:<br/>Error rate > 5%<br/>for 2 consecutive minutes"]
    B["Check ECS task metrics:<br/>CPU 45%, Memory 60%<br/>— NOT resource-constrained"]
    C["Check Bedrock throttling:<br/>ThrottlingException count = 0<br/>— NOT Bedrock throttle"]
    D{"Check application logs<br/>for error patterns"}
    E["Logs show:<br/>ConnectionPoolExhausted<br/>'Unable to acquire connection<br/>to bedrock-runtime<br/>within 5000ms timeout'"]
    F{"Check connection pool metrics"}
    G["Pool size: 10 per task<br/>Active: 10/10 (100%)<br/>Queue depth: 45 waiting<br/>Avg checkout wait: 4,800ms"]

    A --> B --> C --> D --> E --> F --> G

    style A fill:#e74c3c,color:#fff
    style E fill:#e74c3c,color:#fff
    style G fill:#e74c3c,color:#fff

Key metrics that fired: - MangaAssist/Errors/Rate > 5% - MangaAssist/ConnectionPool/bedrock-runtime/ActiveConnections = max (10/10) - MangaAssist/ConnectionPool/bedrock-runtime/QueueDepth > 20

Root Cause

MangaAssist configured 10 Bedrock connections per ECS task, assuming average Bedrock response time of 400ms. At 400ms per request, 10 connections can handle 25 req/s per task. With 8 tasks, the cluster handles 200 req/s of Bedrock calls.

The flash sale changed the query pattern: - Normal: 40% cache hit, 30% Haiku (fast), 30% Sonnet — effective Bedrock load: ~30% of requests - Flash sale: 5% cache hit (new product, not cached), 10% Haiku, 85% Sonnet (complex comparison queries) — effective Bedrock load: ~95% of requests

At 2,200 req/s with 95% needing Bedrock: 2,090 Bedrock calls/s needed. With 8 tasks x 10 connections x (1000ms / 500ms avg Sonnet time) = 160 effective Bedrock calls/s capacity. Demand exceeded capacity by 13x.

Resolution

Immediate (10 minutes):

Scale ECS tasks — increase from 8 to 40 tasks:

aws ecs update-service \
  --cluster manga-assist-prod \
  --service manga-assist-orchestrator \
  --desired-count 40

Increase connection pool per task — deploy a config change to set pool size from 10 to 25:

# Update SSM parameter (config source for ECS tasks)
aws ssm put-parameter \
  --name "/manga-assist/prod/bedrock-pool-size" \
  --value "25" \
  --type String \
  --overwrite

# Force new deployment to pick up the parameter
aws ecs update-service \
  --cluster manga-assist-prod \
  --service manga-assist-orchestrator \
  --force-new-deployment

Enable aggressive caching — reduce cache TTL check strictness to increase hit rate:

aws ssm put-parameter \
  --name "/manga-assist/prod/cache-similarity-threshold" \
  --value "0.85" \
  --type String \
  --overwrite

After stabilization (20 minutes):

Verify error rate drops below 1%
Monitor Bedrock throttling — the increased connection count may trigger Bedrock service limits
If Bedrock throttles, request a quota increase or switch overflow traffic to Haiku

Prevention

graph TB
    subgraph "Capacity Planning"
        CP1["Model connection pool sizing<br/>for peak, not average"]
        CP2["Calculate: peak_rps x<br/>bedrock_miss_rate x<br/>avg_latency / 1000 =<br/>connections needed"]
        CP3["Add 50% headroom to<br/>calculated pool size"]
    end

    subgraph "Auto-Scaling"
        AS1["Scale on connection pool<br/>utilization, not just CPU"]
        AS2["Target: pool utilization<br/>< 70% triggers scale-out"]
        AS3["Pre-scale for known events<br/>(publisher releases, sales)"]
    end

    subgraph "Circuit Breaking"
        CB1["Bedrock circuit breaker<br/>with Haiku fallback"]
        CB2["If Sonnet pool exhausted,<br/>route to Haiku pool<br/>(separate pool, faster)"]
        CB3["Queue with timeout<br/>+ graceful degradation"]
    end

    CP1 --> CP2 --> CP3
    AS1 --> AS2 --> AS3
    CB1 --> CB2 --> CB3

    style CP1 fill:#2ecc71,color:#000
    style AS1 fill:#3498db,color:#fff
    style CB1 fill:#9b59b6,color:#fff

Prevention checklist: - [ ] Size connection pools for peak traffic (2x normal during sale events) - [ ] Auto-scale on ConnectionPoolUtilization > 70% (not just CPU) - [ ] Separate connection pools for Sonnet and Haiku (Haiku pool is a fallback) - [ ] Pre-scale 2 hours before known events (new volume releases, sales) - [ ] Connection pool queue depth alarm at > 10 waiting requests

Scenario 3: NAT Gateway Bottleneck Discovered During VPC Endpoint Migration

Problem

The infrastructure team begins migrating MangaAssist from NAT gateway routing to VPC endpoints. They deploy the Bedrock VPC endpoint on Monday, the DynamoDB gateway endpoint on Tuesday, and plan OpenSearch for Wednesday.

On Tuesday evening, after deploying the DynamoDB endpoint, p50 latency increases by 25ms (from 420ms to 445ms). No one notices immediately because the p95 stays within SLO. On Wednesday morning, a sharp-eyed SRE spots the drift in the latency dashboard.

Detection

graph TB
    A["SRE reviews daily<br/>latency trend dashboard"]
    B["p50 latency increased<br/>25ms since Tuesday deploy"]
    C{"Correlate with deployment<br/>timeline"}
    D["Tuesday: DynamoDB VPC<br/>gateway endpoint deployed"]
    E{"Check per-segment<br/>latency trends"}
    F["dynamo-session segment:<br/>p50 dropped from 12ms to 6ms<br/>(good — VPC endpoint working)"]
    G["vector-search segment:<br/>p50 increased from 60ms to 85ms<br/>(bad — 25ms regression!)"]
    H{"Why would DynamoDB<br/>endpoint affect OpenSearch?"}
    I["NAT gateway was handling<br/>both DynamoDB and OpenSearch.<br/>DynamoDB traffic moved off NAT,<br/>but NAT config was changed<br/>during migration, affecting<br/>OpenSearch routing."]

    A --> B --> C --> D --> E
    E --> F
    E --> G
    G --> H --> I

    style B fill:#f39c12,color:#000
    style G fill:#e74c3c,color:#fff
    style I fill:#e74c3c,color:#fff

Key metrics: - MangaAssist/Performance/SegmentLatency_vector-search p50 increased from 60ms to 85ms - MangaAssist/Performance/SegmentLatency_dynamo-session p50 decreased from 12ms to 6ms - NAT Gateway BytesProcessed decreased (DynamoDB traffic moved off) - NAT Gateway ConnectionCount remained same (OpenSearch still routing through it)

Root Cause

The infrastructure team's Terraform change for the DynamoDB gateway endpoint also modified the VPC route table. The route table update:

Added a route for the DynamoDB gateway endpoint prefix list (correct)
Inadvertently changed the default route's NAT gateway from the high-performance NAT in ap-northeast-1a to a lower-capacity NAT in ap-northeast-1c (incorrect — a Terraform state merge conflict)

OpenSearch traffic, still routing through NAT, now went through the ap-northeast-1c NAT gateway. This NAT was in a different AZ from most ECS tasks, adding cross-AZ latency (~15ms) plus it had lower bandwidth allocation.

Resolution

Immediate (30 minutes):

Identify the route table change:

# Compare current route table with previous version
aws ec2 describe-route-tables \
  --route-table-ids rtb-xxxxx \
  --query 'RouteTables[].Routes[]' \
  --output table

Revert the default route to the correct NAT gateway:

aws ec2 replace-route \
  --route-table-id rtb-xxxxx \
  --destination-cidr-block 0.0.0.0/0 \
  --nat-gateway-id nat-aaaa1111  # original NAT in ap-northeast-1a

Verify OpenSearch latency recovers — check vector-search segment p50 returns to ~60ms

Complete migration (60 minutes):

Deploy the OpenSearch VPC interface endpoint (originally planned for Wednesday)
Once the OpenSearch VPC endpoint is active, OpenSearch traffic bypasses NAT entirely
Validate all segments are using VPC endpoints via the VPCEndpointHealthChecker

Prevention

graph TB
    subgraph "Migration Safety"
        M1["Deploy VPC endpoints<br/>one at a time with<br/>24-hour soak between"]
        M2["Run VPCEndpointHealthChecker<br/>after every infra change"]
        M3["Terraform plan review:<br/>check for unintended<br/>route table changes"]
    end

    subgraph "Observability"
        O1["Per-segment latency<br/>trend alerting<br/>(not just p95 breach)"]
        O2["Anomaly detection on<br/>p50 latency per segment"]
        O3["NAT gateway traffic<br/>baseline comparison<br/>(detect unexpected routing)"]
    end

    subgraph "Rollback"
        R1["Terraform state snapshots<br/>before each migration step"]
        R2["Automated rollback trigger:<br/>if any segment p50 increases<br/>> 10ms post-deploy"]
        R3["Canary deployment for<br/>infrastructure changes"]
    end

    M1 --> M2 --> M3
    O1 --> O2 --> O3
    R1 --> R2 --> R3

    style M1 fill:#2ecc71,color:#000
    style O1 fill:#3498db,color:#fff
    style R1 fill:#9b59b6,color:#fff

Prevention checklist: - [ ] Terraform plan diff review specifically checks route table changes - [ ] Automated p50 regression detection (10ms threshold per segment) - [ ] VPC endpoint health check runs in CI/CD pipeline after every infra deploy - [ ] Migration runbook requires VPCEndpointHealthChecker validation after each step - [ ] NAT gateway traffic dashboard shows expected vs. actual traffic volume

Scenario 4: X-Ray Profiling Reveals Unexpected Latency from Guardrails Check

Problem

A new Bedrock Guardrails policy is deployed for MangaAssist to filter adult content more aggressively (required for the JP market). After deployment, overall p95 latency increases from 850ms to 1,350ms. The team expects some increase from Guardrails but not 500ms.

The Guardrails check is synchronous — it runs on the complete response before streaming begins. This blocks response delivery.

Detection

graph TB
    A["CloudWatch Alarm:<br/>p95 latency increased<br/>from 850ms to 1,350ms"]
    B["Check recent deployments:<br/>Guardrails policy update<br/>deployed 2 hours ago"]
    C{"X-Ray trace analysis<br/>for high-latency requests"}
    D["guardrails-check segment:<br/>p95 = 480ms<br/>(was 40ms before update)"]
    E{"Drill into guardrails<br/>subsegment metadata"}
    F["New policy has 12 content<br/>filters (was 4).<br/>Each filter runs sequentially.<br/>Long responses (>500 tokens)<br/>take 400-500ms to evaluate."]

    A --> B --> C --> D --> E --> F

    style A fill:#e74c3c,color:#fff
    style D fill:#e74c3c,color:#fff
    style F fill:#e74c3c,color:#fff

Key metrics: - MangaAssist/Performance/SegmentLatency_guardrails-check p95: 40ms -> 480ms - MangaAssist/Guardrails/EvaluationDuration p95: 480ms - MangaAssist/Guardrails/FilterCount: 4 -> 12 - MangaAssist/Guardrails/ResponseTokensEvaluated p95: 450 tokens

Root Cause

The new Guardrails policy tripled the number of content filters from 4 to 12, and each filter evaluates the complete response text sequentially. For long responses (500+ tokens, common in manga recommendations that include descriptions), the evaluation time scales linearly with both filter count and response length:

Old: 4 filters x 10ms/filter = 40ms
New: 12 filters x 10ms/filter = 120ms (short responses)
New: 12 filters x 40ms/filter = 480ms (long responses, 500+ tokens)

The 40ms/filter for long responses is because complex filters (regex patterns for adult content in Japanese) require full-text scanning. Short responses (< 100 tokens) are fast; long responses are quadratically slower due to the regex complexity.

Resolution

Immediate (20 minutes):

Move Guardrails off the critical path — change from synchronous (blocking) to asynchronous (non-blocking):

# BEFORE: Synchronous guardrails (blocks response delivery)
response = await bedrock_invoke(prompt)
guardrails_result = await guardrails_check(response)  # 480ms blocking!
if guardrails_result.is_safe:
    await stream_to_client(response)

# AFTER: Async guardrails (stream immediately, check in background)
response = await bedrock_invoke(prompt)
await stream_to_client(response)  # Stream starts immediately

# Check guardrails asynchronously
asyncio.create_task(
    guardrails_check_and_remediate(session_id, response)
)

Add post-hoc remediation — if Guardrails flags a response that was already streamed:

async def guardrails_check_and_remediate(session_id, response):
    result = await guardrails_check(response)
    if not result.is_safe:
        # Send a follow-up message to the user
        await send_websocket_message(
            session_id,
            "I need to revise my previous response. Let me provide "
            "updated information..."
        )
        # Log for review
        logger.warning("Guardrails flagged response post-stream: %s",
                       result.violations)
        # Emit metric for monitoring
        emit_metric("GuardrailsPostStreamViolation", 1)

Longer-term (40 minutes):

Optimize the Guardrails policy: - Consolidate overlapping filters (12 -> 7 distinct filters) - Use tiered evaluation: fast filters first, expensive regex filters only if fast filters pass - Set max_tokens_to_evaluate to limit evaluation to the first 200 tokens (most violations appear early)
Implement chunked evaluation — for streaming responses, evaluate each chunk as it arrives rather than waiting for the complete response:

async def chunked_guardrails(stream):
    buffer = ""
    for chunk in stream:
        buffer += chunk
        yield chunk  # Stream to client immediately

        if len(buffer) > 100:  # Evaluate every 100 tokens
            result = await guardrails_check(buffer)
            if not result.is_safe:
                yield "[Content revised for safety]"
                break
            buffer = ""  # Reset for next chunk

Prevention

graph TB
    subgraph "Policy Management"
        PM1["Load test every Guardrails<br/>policy change in staging"]
        PM2["Benchmark: measure p95<br/>evaluation time at<br/>100, 300, 500, 1000 tokens"]
        PM3["Set latency budget<br/>for Guardrails: max 50ms"]
    end

    subgraph "Architecture"
        AR1["Default: async Guardrails<br/>(stream first, check later)"]
        AR2["Sync only for high-risk<br/>intents (order placement,<br/>payment info)"]
        AR3["Chunked evaluation<br/>for streaming responses"]
    end

    subgraph "Monitoring"
        MO1["Alert on Guardrails<br/>evaluation time > 100ms"]
        MO2["Track filter count<br/>and complexity metrics"]
        MO3["A/B test: sync vs async<br/>Guardrails impact on UX"]
    end

    PM1 --> PM2 --> PM3
    AR1 --> AR2 --> AR3
    MO1 --> MO2 --> MO3

    style PM1 fill:#2ecc71,color:#000
    style AR1 fill:#3498db,color:#fff
    style MO1 fill:#9b59b6,color:#fff

Prevention checklist: - [ ] Load test Guardrails policy changes before production deployment - [ ] Default to async Guardrails with post-hoc remediation - [ ] Set CloudWatch alarm on Guardrails evaluation time > 100ms - [ ] Guardrails latency included in latency budget dashboard - [ ] Tiered filter evaluation: fast first, expensive only if needed

Scenario 5: Cache Write Latency Impacting Response Time

Problem

MangaAssist's response latency has been gradually increasing over 3 weeks — from p95 of 850ms to p95 of 920ms. No single event caused the increase. Traffic has grown 20% in the same period (new marketing campaign).

The team initially attributes the increase to higher load, but after scaling ECS tasks to handle the traffic, the latency doesn't improve. Something in the request pipeline has gotten slower.

Detection

graph TB
    A["Weekly latency review:<br/>p95 trending up 70ms<br/>over 3 weeks"]
    B["ECS scaled from 8 to 12<br/>tasks — latency unchanged"]
    C{"X-Ray trace comparison:<br/>this week vs 3 weeks ago"}
    D["bedrock-invoke: unchanged (800ms)<br/>vector-search: unchanged (85ms)<br/>cache-lookup: unchanged (2ms)"]
    E["cache-write: 3ms -> 18ms<br/>(6x increase!)"]
    F{"Investigate cache-write<br/>subsegment"}
    G["Redis SET operation for<br/>caching responses now takes<br/>15-20ms instead of 2-4ms"]
    H{"Check Redis metrics"}
    I["Redis memory usage: 92%<br/>Eviction rate: 1,200/min<br/>Connected clients: 580<br/>— Redis under pressure"]
    J{"Why is cache-write<br/>on the critical path?"}
    K["Code review reveals:<br/>cache write is SYNCHRONOUS<br/>— await redis.set() is called<br/>BEFORE response is returned"]

    A --> B --> C --> D --> E --> F --> G --> H --> I --> J --> K

    style A fill:#f39c12,color:#000
    style E fill:#e74c3c,color:#fff
    style I fill:#e74c3c,color:#fff
    style K fill:#e74c3c,color:#fff

Key metrics (gradual trend): - MangaAssist/Performance/SegmentLatency_cache-write p95: Week 1: 4ms, Week 2: 8ms, Week 3: 18ms - ElastiCache/BytesUsedForCache: 92% of max memory - ElastiCache/Evictions: 1,200/min (high — Redis is actively evicting keys) - ElastiCache/CurrConnections: 580 (approaching maxclients default of 65,000 but connection overhead is growing)

Root Cause

Two issues combined:

Issue 1: Synchronous cache write on the critical path

The original code awaits the Redis SET before returning the response:

# The problematic code path
async def handle_query(query):
    response = await bedrock_invoke(prompt)
    await redis.set(cache_key, response, ex=3600)  # BLOCKS response!
    return response  # Response delayed by cache write

When Redis was fast (2-3ms), this was invisible. As Redis got slower, the impact became noticeable.

Issue 2: Redis memory pressure

The 20% traffic increase brought more unique queries, which means more cache entries. The Redis instance (cache.r6g.large, 13.07 GB) filled to 92%. Redis started active eviction (allkeys-lru), which creates: - Memory fragmentation (slows all operations) - Eviction processing overhead (Redis is single-threaded — eviction blocks other operations) - Increased latency on SET operations (must evict before writing)

Resolution

Immediate (30 minutes):

Make cache writes asynchronous — fire-and-forget pattern:

# FIXED: Async cache write (does not block response)
async def handle_query(query):
    response = await bedrock_invoke(prompt)

    # Fire-and-forget: schedule cache write, return immediately
    asyncio.create_task(
        safe_cache_write(cache_key, response)
    )

    return response  # Response returns immediately

async def safe_cache_write(key, value):
    """Cache write with error handling (never fails the request)."""
    try:
        await redis.set(key, value, ex=3600)
    except Exception as e:
        logger.warning("Cache write failed (non-critical): %s", e)
        emit_metric("CacheWriteFailure", 1)

Reduce Redis memory pressure — decrease cache TTL from 3600s to 1800s:

aws ssm put-parameter \
  --name "/manga-assist/prod/cache-ttl-seconds" \
  --value "1800" \
  --type String \
  --overwrite

Longer-term (60 minutes):

Scale Redis — upgrade from cache.r6g.large (13 GB) to cache.r6g.xlarge (26 GB):

aws elasticache modify-replication-group \
  --replication-group-id manga-assist-cache \
  --cache-node-type cache.r6g.xlarge \
  --apply-immediately

Implement cache size management: - Set maxmemory-policy allkeys-lfu (least frequently used) instead of allkeys-lru - Monitor BytesUsedForCache with alarm at 75% (not 90%) - Implement cache entry size limits (max 10 KB per entry)
Add cache write latency monitoring:

# Emit cache write timing as a separate metric
async def timed_cache_write(key, value):
    start = time.monotonic()
    try:
        await redis.set(key, value, ex=1800)
        elapsed_ms = (time.monotonic() - start) * 1000
        emit_metric("CacheWriteLatency", elapsed_ms)
        if elapsed_ms > 10:
            logger.warning("Slow cache write: %.1fms for key %s", elapsed_ms, key)
    except Exception as e:
        logger.warning("Cache write failed: %s", e)
        emit_metric("CacheWriteFailure", 1)

Prevention

graph TB
    subgraph "Architecture Principle"
        AP1["ALL cache writes must<br/>be async/fire-and-forget"]
        AP2["Cache is an optimization,<br/>not a dependency"]
        AP3["Code review checklist:<br/>no await on cache SET<br/>in request hot path"]
    end

    subgraph "Capacity Planning"
        CP1["Redis memory alarm<br/>at 75% (not 90%)"]
        CP2["Auto-scale Redis based<br/>on memory utilization"]
        CP3["Project cache growth:<br/>new_queries/day x avg_size<br/>x TTL = memory needed"]
    end

    subgraph "Monitoring"
        MO1["Track cache write latency<br/>as a separate metric"]
        MO2["Alert if cache write p95<br/>> 5ms (early warning)"]
        MO3["Dashboard: cache memory<br/>utilization trend over 30 days"]
    end

    AP1 --> AP2 --> AP3
    CP1 --> CP2 --> CP3
    MO1 --> MO2 --> MO3

    style AP1 fill:#2ecc71,color:#000
    style CP1 fill:#3498db,color:#fff
    style MO1 fill:#9b59b6,color:#fff

Prevention checklist: - [ ] Enforce async cache writes in code review (linting rule for await redis.set in request handlers) - [ ] Redis memory alarm at 75% — auto-scale or reduce TTL - [ ] Cache write latency metric with p95 > 5ms alarm - [ ] Monthly capacity planning review: cache growth rate vs. available memory - [ ] Load test cache behavior at 90% memory utilization

Decision Tree — Latency Spike Troubleshooting

Use this decision tree when MangaAssist p95 latency exceeds the 2-second SLO.

graph TB
    START["p95 Latency > 2000ms"]
    Q1{"X-Ray: Which segment<br/>is slowest?"}

    Q1 -->|"bedrock-invoke"| BQ1{"Bedrock throttling?<br/>ThrottlingException > 0?"}
    BQ1 -->|"Yes"| BR1["Scale down concurrency<br/>or request quota increase"]
    BQ1 -->|"No"| BQ2{"Network transit<br/>> 50ms?"}
    BQ2 -->|"Yes"| BR2["Check VPC endpoint<br/>+ DNS resolution<br/>(Scenario 1)"]
    BQ2 -->|"No"| BQ3{"Large prompt<br/>> 4000 tokens?"}
    BQ3 -->|"Yes"| BR3["Optimize prompt<br/>(Skill 4.1.1)"]
    BQ3 -->|"No"| BR4["Check Bedrock<br/>service health dashboard"]

    Q1 -->|"vector-search"| OQ1{"Query has filters?"}
    OQ1 -->|"No"| OR1["Add pre-filters<br/>(category, language, stock)"]
    OQ1 -->|"Yes"| OQ2{"Returning all fields?"}
    OQ2 -->|"Yes"| OR2["Add _source selection<br/>(return only 6 fields)"]
    OQ2 -->|"No"| OR3["Check OpenSearch OCU<br/>scaling + cold start"]

    Q1 -->|"cache-write/session-update"| CQ1{"Write is synchronous?"}
    CQ1 -->|"Yes"| CR1["Make async<br/>(Scenario 5)"]
    CQ1 -->|"No"| CQ2{"Redis memory > 80%?"}
    CQ2 -->|"Yes"| CR2["Scale Redis<br/>or reduce TTL"]
    CQ2 -->|"No"| CR3["Check network path<br/>to Redis endpoint"]

    Q1 -->|"guardrails-check"| GQ1{"Evaluation time<br/>> 100ms?"}
    GQ1 -->|"Yes"| GR1["Move to async<br/>(Scenario 4)"]
    GQ1 -->|"No"| GR2["Check filter count<br/>and response size"]

    Q1 -->|"All segments normal<br/>but total is high"| AQ1{"Connection pool<br/>utilization > 80%?"}
    AQ1 -->|"Yes"| AR1["Pool exhaustion<br/>(Scenario 2)"]
    AQ1 -->|"No"| AQ2{"DNS resolution<br/>> 10ms?"}
    AQ2 -->|"Yes"| AR2["DNS issue<br/>(Scenario 1)"]
    AQ2 -->|"No"| AR3["Check for new<br/>inter-service hops<br/>or cold ECS tasks"]

    style START fill:#e74c3c,color:#fff
    style BR1 fill:#2ecc71,color:#000
    style BR2 fill:#2ecc71,color:#000
    style BR3 fill:#2ecc71,color:#000
    style BR4 fill:#f39c12,color:#000
    style OR1 fill:#2ecc71,color:#000
    style OR2 fill:#2ecc71,color:#000
    style OR3 fill:#f39c12,color:#000
    style CR1 fill:#2ecc71,color:#000
    style CR2 fill:#2ecc71,color:#000
    style CR3 fill:#f39c12,color:#000
    style GR1 fill:#2ecc71,color:#000
    style GR2 fill:#f39c12,color:#000
    style AR1 fill:#2ecc71,color:#000
    style AR2 fill:#2ecc71,color:#000
    style AR3 fill:#f39c12,color:#000

Cross-Scenario Patterns

Pattern	Scenarios	Lesson
Async by default	4, 5	Non-critical operations (cache writes, Guardrails, analytics) must never block response delivery
VPC endpoint verification	1, 3	Every infra change needs automated VPC endpoint health validation
Per-segment monitoring	1, 3, 4, 5	Aggregate latency metrics hide segment-level regressions; always monitor per-segment
Capacity math matters	2	Connection pool sizing requires peak traffic modeling, not average
Gradual degradation is hard to detect	5	Weekly latency trend reviews catch slow drifts that alarms miss
X-Ray is the single source of truth	All	Every troubleshooting path starts with X-Ray trace analysis

Key Takeaways

Every scenario started with X-Ray traces — investing in comprehensive request profiling pays for itself the first time you debug a latency issue. Without per-segment timing, you are guessing.
Network issues are the most common and most subtle — DNS resolution, VPC endpoint routing, NAT gateway misconfigurations. These are invisible in application logs but clearly visible in network timing metrics.
Synchronous operations accumulate silently — a 3ms cache write seems harmless, but when Redis slows to 18ms under memory pressure, it adds nearly 1% to the entire latency budget. Make everything non-critical async.
Capacity planning for peak, not average — MangaAssist's connection pool was sized for average load. The flash sale exposed this immediately. Always size for 2x expected peak with known event pre-scaling.
Infrastructure changes need performance validation gates — the VPC endpoint migration introduced a regression because there was no automated latency check after the Terraform apply. Every infra change should run VPCEndpointHealthChecker and per-segment latency comparison.