FM System Performance — Scenarios and Runbooks
MangaAssist is a JP Manga store chatbot running on AWS. The stack includes Bedrock Claude 3 (Sonnet for complex queries, Haiku for simple), OpenSearch Serverless for vector retrieval, DynamoDB for sessions/products/orders, ECS Fargate for orchestration, API Gateway WebSocket for real-time communication, and ElastiCache Redis for semantic caching. The p95 end-to-end latency target is < 2 seconds.
Skill Mapping
| AWS AIP-C01 Skill | Sub-Skill | This File Covers |
|---|---|---|
| 4.2.6 FM System Performance | Scenario-Based Troubleshooting | 5 real-world MangaAssist performance scenarios |
| 4.2.6 FM System Performance | Runbook Design | Decision trees for detection, root cause, resolution, prevention |
| 4.2.6 FM System Performance | Operational Excellence | Systematic approaches to latency, networking, and resource issues |
Scenario Overview
| # | Scenario | Root Cause Category | Severity | Time to Detect | Time to Resolve |
|---|---|---|---|---|---|
| 1 | p95 latency spike from DNS resolution timeout | Network / DNS | P2 | 3-5 min | 15-30 min |
| 2 | Connection pool exhaustion under sustained load | Resource / Capacity | P1 | 1-2 min | 10-20 min |
| 3 | NAT gateway bottleneck during VPC endpoint migration | Network / Infrastructure | P2 | 5-10 min | 30-60 min |
| 4 | Unexpected latency from Guardrails check revealed by X-Ray | Performance / Configuration | P3 | 10-15 min | 20-40 min |
| 5 | Cache write latency impacting response time | Architecture / Sync Pattern | P3 | 5-10 min | 30-60 min |
Scenario 1: p95 Latency Spike Due to DNS Resolution Timeout to Bedrock Endpoint
Problem
At 14:22 JST on a Tuesday, MangaAssist p95 latency spikes from 850ms to 2,400ms — breaching the 2-second SLO. The spike affects ~15% of requests. Customers experience noticeable delays when asking for manga recommendations.
The on-call engineer sees no change in traffic volume, no deployment in the last 6 hours, and no Bedrock service health issues.
Detection
graph TB
A["CloudWatch Alarm fires:<br/>p95 latency > 2000ms<br/>for 3 consecutive minutes"]
B["On-call receives PagerDuty alert"]
C{"Check X-Ray traces<br/>for high-latency requests"}
D["X-Ray shows: bedrock-invoke<br/>segment has 1,800ms spikes<br/>(normally 350-800ms)"]
E{"Drill into bedrock-invoke<br/>subsegments"}
F["network-transit subsegment<br/>shows 1,200ms<br/>(normally 5-15ms)"]
G{"Check DNS resolution<br/>timing in subsegment metadata"}
H["DNS resolution for<br/>bedrock-runtime.ap-northeast-1.amazonaws.com<br/>taking 1,000-1,200ms<br/>(normally < 5ms)"]
A --> B --> C --> D --> E --> F --> G --> H
style A fill:#e74c3c,color:#fff
style H fill:#e74c3c,color:#fff
Key metrics that fired:
- MangaAssist/Performance/RequestLatency p95 > 2,000ms
- MangaAssist/Performance/SegmentLatency_bedrock-invoke p95 > 1,500ms
- MangaAssist/VPCEndpoints/DNSResolutionLatency for bedrock-runtime > 500ms
Root Cause
The VPC DNS resolver (Route 53 Resolver) in the ap-northeast-1a availability zone experienced intermittent resolution failures for the Bedrock VPC endpoint's private hosted zone. This caused:
- DNS queries for
bedrock-runtime.ap-northeast-1.amazonaws.comto fall back to the public DNS resolver - The public resolver returned a public IP instead of the VPC endpoint's private IP
- Traffic routed through the NAT gateway (adding 15ms) and experienced DNS retry timeouts (adding 1,000ms on the failing attempts)
Only tasks in AZ ap-northeast-1a were affected. Tasks in ap-northeast-1c resolved DNS normally via their local resolver.
Resolution
Immediate (15 minutes):
- Confirm the AZ-specific impact by filtering X-Ray traces by
ecs.availability_zoneannotation - Scale down ECS tasks in
ap-northeast-1aand scale up inap-northeast-1cto shift traffic away from the affected AZ:
# Shift traffic away from affected AZ
aws ecs update-service \
--cluster manga-assist-prod \
--service manga-assist-orchestrator \
--placement-constraints "type=memberOf,expression=attribute:ecs.availability-zone != ap-northeast-1a" \
--desired-count 8
- Monitor p95 latency — should drop back to < 1,000ms within 2-3 minutes as traffic drains from
ap-northeast-1a
Longer-term (30 minutes):
- Verify Route 53 Resolver health in
ap-northeast-1a— check resolver endpoint status via AWS console - If resolver is unhealthy, open an AWS support case (Severity: Business-critical)
- Once resolved, gradually reintroduce
ap-northeast-1acapacity
Prevention
graph TB
P1["Enable DNS resolution<br/>monitoring per AZ"]
P2["Add application-level<br/>DNS caching (60s TTL)"]
P3["Configure DNS failover:<br/>if private zone fails,<br/>use cached result<br/>not public resolver"]
P4["Deploy ECS tasks<br/>across 3 AZs<br/>(add ap-northeast-1d)"]
P5["Add X-Ray annotation<br/>for AZ on every trace"]
P6["Create CloudWatch alarm:<br/>DNS resolution > 50ms<br/>per AZ per service"]
P1 --> P6
P2 --> P3
P4 --> P5
style P1 fill:#2ecc71,color:#000
style P2 fill:#2ecc71,color:#000
style P3 fill:#2ecc71,color:#000
style P4 fill:#2ecc71,color:#000
style P5 fill:#2ecc71,color:#000
style P6 fill:#2ecc71,color:#000
Prevention checklist: - [ ] Application-level DNS cache with 60-second TTL (survives resolver blips) - [ ] Multi-AZ deployment across 3 AZs (no single-AZ dependency) - [ ] Per-AZ DNS resolution CloudWatch alarm (detect before customers notice) - [ ] X-Ray AZ annotation on all traces (instant AZ-level troubleshooting) - [ ] Runbook documented for "DNS resolution degradation" scenario
Scenario 2: Connection Pool Exhaustion Under Sustained Load
Problem
During a manga publisher's flash sale event (new "One Piece" volume release), MangaAssist traffic spikes from 500 req/s to 2,200 req/s over 10 minutes. The system initially handles the spike, but after 8 minutes, error rates jump from 0.1% to 12%. Customers see "Service temporarily unavailable" messages.
The ECS tasks are not CPU or memory constrained. Bedrock is not throttling. The issue is inside the application.
Detection
graph TB
A["CloudWatch Alarm fires:<br/>Error rate > 5%<br/>for 2 consecutive minutes"]
B["Check ECS task metrics:<br/>CPU 45%, Memory 60%<br/>— NOT resource-constrained"]
C["Check Bedrock throttling:<br/>ThrottlingException count = 0<br/>— NOT Bedrock throttle"]
D{"Check application logs<br/>for error patterns"}
E["Logs show:<br/>ConnectionPoolExhausted<br/>'Unable to acquire connection<br/>to bedrock-runtime<br/>within 5000ms timeout'"]
F{"Check connection pool metrics"}
G["Pool size: 10 per task<br/>Active: 10/10 (100%)<br/>Queue depth: 45 waiting<br/>Avg checkout wait: 4,800ms"]
A --> B --> C --> D --> E --> F --> G
style A fill:#e74c3c,color:#fff
style E fill:#e74c3c,color:#fff
style G fill:#e74c3c,color:#fff
Key metrics that fired:
- MangaAssist/Errors/Rate > 5%
- MangaAssist/ConnectionPool/bedrock-runtime/ActiveConnections = max (10/10)
- MangaAssist/ConnectionPool/bedrock-runtime/QueueDepth > 20
Root Cause
MangaAssist configured 10 Bedrock connections per ECS task, assuming average Bedrock response time of 400ms. At 400ms per request, 10 connections can handle 25 req/s per task. With 8 tasks, the cluster handles 200 req/s of Bedrock calls.
The flash sale changed the query pattern: - Normal: 40% cache hit, 30% Haiku (fast), 30% Sonnet — effective Bedrock load: ~30% of requests - Flash sale: 5% cache hit (new product, not cached), 10% Haiku, 85% Sonnet (complex comparison queries) — effective Bedrock load: ~95% of requests
At 2,200 req/s with 95% needing Bedrock: 2,090 Bedrock calls/s needed. With 8 tasks x 10 connections x (1000ms / 500ms avg Sonnet time) = 160 effective Bedrock calls/s capacity. Demand exceeded capacity by 13x.
Resolution
Immediate (10 minutes):
- Scale ECS tasks — increase from 8 to 40 tasks:
aws ecs update-service \
--cluster manga-assist-prod \
--service manga-assist-orchestrator \
--desired-count 40
- Increase connection pool per task — deploy a config change to set pool size from 10 to 25:
# Update SSM parameter (config source for ECS tasks)
aws ssm put-parameter \
--name "/manga-assist/prod/bedrock-pool-size" \
--value "25" \
--type String \
--overwrite
# Force new deployment to pick up the parameter
aws ecs update-service \
--cluster manga-assist-prod \
--service manga-assist-orchestrator \
--force-new-deployment
- Enable aggressive caching — reduce cache TTL check strictness to increase hit rate:
aws ssm put-parameter \
--name "/manga-assist/prod/cache-similarity-threshold" \
--value "0.85" \
--type String \
--overwrite
After stabilization (20 minutes):
- Verify error rate drops below 1%
- Monitor Bedrock throttling — the increased connection count may trigger Bedrock service limits
- If Bedrock throttles, request a quota increase or switch overflow traffic to Haiku
Prevention
graph TB
subgraph "Capacity Planning"
CP1["Model connection pool sizing<br/>for peak, not average"]
CP2["Calculate: peak_rps x<br/>bedrock_miss_rate x<br/>avg_latency / 1000 =<br/>connections needed"]
CP3["Add 50% headroom to<br/>calculated pool size"]
end
subgraph "Auto-Scaling"
AS1["Scale on connection pool<br/>utilization, not just CPU"]
AS2["Target: pool utilization<br/>< 70% triggers scale-out"]
AS3["Pre-scale for known events<br/>(publisher releases, sales)"]
end
subgraph "Circuit Breaking"
CB1["Bedrock circuit breaker<br/>with Haiku fallback"]
CB2["If Sonnet pool exhausted,<br/>route to Haiku pool<br/>(separate pool, faster)"]
CB3["Queue with timeout<br/>+ graceful degradation"]
end
CP1 --> CP2 --> CP3
AS1 --> AS2 --> AS3
CB1 --> CB2 --> CB3
style CP1 fill:#2ecc71,color:#000
style AS1 fill:#3498db,color:#fff
style CB1 fill:#9b59b6,color:#fff
Prevention checklist:
- [ ] Size connection pools for peak traffic (2x normal during sale events)
- [ ] Auto-scale on ConnectionPoolUtilization > 70% (not just CPU)
- [ ] Separate connection pools for Sonnet and Haiku (Haiku pool is a fallback)
- [ ] Pre-scale 2 hours before known events (new volume releases, sales)
- [ ] Connection pool queue depth alarm at > 10 waiting requests
Scenario 3: NAT Gateway Bottleneck Discovered During VPC Endpoint Migration
Problem
The infrastructure team begins migrating MangaAssist from NAT gateway routing to VPC endpoints. They deploy the Bedrock VPC endpoint on Monday, the DynamoDB gateway endpoint on Tuesday, and plan OpenSearch for Wednesday.
On Tuesday evening, after deploying the DynamoDB endpoint, p50 latency increases by 25ms (from 420ms to 445ms). No one notices immediately because the p95 stays within SLO. On Wednesday morning, a sharp-eyed SRE spots the drift in the latency dashboard.
Detection
graph TB
A["SRE reviews daily<br/>latency trend dashboard"]
B["p50 latency increased<br/>25ms since Tuesday deploy"]
C{"Correlate with deployment<br/>timeline"}
D["Tuesday: DynamoDB VPC<br/>gateway endpoint deployed"]
E{"Check per-segment<br/>latency trends"}
F["dynamo-session segment:<br/>p50 dropped from 12ms to 6ms<br/>(good — VPC endpoint working)"]
G["vector-search segment:<br/>p50 increased from 60ms to 85ms<br/>(bad — 25ms regression!)"]
H{"Why would DynamoDB<br/>endpoint affect OpenSearch?"}
I["NAT gateway was handling<br/>both DynamoDB and OpenSearch.<br/>DynamoDB traffic moved off NAT,<br/>but NAT config was changed<br/>during migration, affecting<br/>OpenSearch routing."]
A --> B --> C --> D --> E
E --> F
E --> G
G --> H --> I
style B fill:#f39c12,color:#000
style G fill:#e74c3c,color:#fff
style I fill:#e74c3c,color:#fff
Key metrics:
- MangaAssist/Performance/SegmentLatency_vector-search p50 increased from 60ms to 85ms
- MangaAssist/Performance/SegmentLatency_dynamo-session p50 decreased from 12ms to 6ms
- NAT Gateway BytesProcessed decreased (DynamoDB traffic moved off)
- NAT Gateway ConnectionCount remained same (OpenSearch still routing through it)
Root Cause
The infrastructure team's Terraform change for the DynamoDB gateway endpoint also modified the VPC route table. The route table update:
- Added a route for the DynamoDB gateway endpoint prefix list (correct)
- Inadvertently changed the default route's NAT gateway from the high-performance NAT in
ap-northeast-1ato a lower-capacity NAT inap-northeast-1c(incorrect — a Terraform state merge conflict)
OpenSearch traffic, still routing through NAT, now went through the ap-northeast-1c NAT gateway. This NAT was in a different AZ from most ECS tasks, adding cross-AZ latency (~15ms) plus it had lower bandwidth allocation.
Resolution
Immediate (30 minutes):
- Identify the route table change:
# Compare current route table with previous version
aws ec2 describe-route-tables \
--route-table-ids rtb-xxxxx \
--query 'RouteTables[].Routes[]' \
--output table
- Revert the default route to the correct NAT gateway:
aws ec2 replace-route \
--route-table-id rtb-xxxxx \
--destination-cidr-block 0.0.0.0/0 \
--nat-gateway-id nat-aaaa1111 # original NAT in ap-northeast-1a
- Verify OpenSearch latency recovers — check
vector-searchsegment p50 returns to ~60ms
Complete migration (60 minutes):
- Deploy the OpenSearch VPC interface endpoint (originally planned for Wednesday)
- Once the OpenSearch VPC endpoint is active, OpenSearch traffic bypasses NAT entirely
- Validate all segments are using VPC endpoints via the
VPCEndpointHealthChecker
Prevention
graph TB
subgraph "Migration Safety"
M1["Deploy VPC endpoints<br/>one at a time with<br/>24-hour soak between"]
M2["Run VPCEndpointHealthChecker<br/>after every infra change"]
M3["Terraform plan review:<br/>check for unintended<br/>route table changes"]
end
subgraph "Observability"
O1["Per-segment latency<br/>trend alerting<br/>(not just p95 breach)"]
O2["Anomaly detection on<br/>p50 latency per segment"]
O3["NAT gateway traffic<br/>baseline comparison<br/>(detect unexpected routing)"]
end
subgraph "Rollback"
R1["Terraform state snapshots<br/>before each migration step"]
R2["Automated rollback trigger:<br/>if any segment p50 increases<br/>> 10ms post-deploy"]
R3["Canary deployment for<br/>infrastructure changes"]
end
M1 --> M2 --> M3
O1 --> O2 --> O3
R1 --> R2 --> R3
style M1 fill:#2ecc71,color:#000
style O1 fill:#3498db,color:#fff
style R1 fill:#9b59b6,color:#fff
Prevention checklist: - [ ] Terraform plan diff review specifically checks route table changes - [ ] Automated p50 regression detection (10ms threshold per segment) - [ ] VPC endpoint health check runs in CI/CD pipeline after every infra deploy - [ ] Migration runbook requires VPCEndpointHealthChecker validation after each step - [ ] NAT gateway traffic dashboard shows expected vs. actual traffic volume
Scenario 4: X-Ray Profiling Reveals Unexpected Latency from Guardrails Check
Problem
A new Bedrock Guardrails policy is deployed for MangaAssist to filter adult content more aggressively (required for the JP market). After deployment, overall p95 latency increases from 850ms to 1,350ms. The team expects some increase from Guardrails but not 500ms.
The Guardrails check is synchronous — it runs on the complete response before streaming begins. This blocks response delivery.
Detection
graph TB
A["CloudWatch Alarm:<br/>p95 latency increased<br/>from 850ms to 1,350ms"]
B["Check recent deployments:<br/>Guardrails policy update<br/>deployed 2 hours ago"]
C{"X-Ray trace analysis<br/>for high-latency requests"}
D["guardrails-check segment:<br/>p95 = 480ms<br/>(was 40ms before update)"]
E{"Drill into guardrails<br/>subsegment metadata"}
F["New policy has 12 content<br/>filters (was 4).<br/>Each filter runs sequentially.<br/>Long responses (>500 tokens)<br/>take 400-500ms to evaluate."]
A --> B --> C --> D --> E --> F
style A fill:#e74c3c,color:#fff
style D fill:#e74c3c,color:#fff
style F fill:#e74c3c,color:#fff
Key metrics:
- MangaAssist/Performance/SegmentLatency_guardrails-check p95: 40ms -> 480ms
- MangaAssist/Guardrails/EvaluationDuration p95: 480ms
- MangaAssist/Guardrails/FilterCount: 4 -> 12
- MangaAssist/Guardrails/ResponseTokensEvaluated p95: 450 tokens
Root Cause
The new Guardrails policy tripled the number of content filters from 4 to 12, and each filter evaluates the complete response text sequentially. For long responses (500+ tokens, common in manga recommendations that include descriptions), the evaluation time scales linearly with both filter count and response length:
Old: 4 filters x 10ms/filter = 40ms
New: 12 filters x 10ms/filter = 120ms (short responses)
New: 12 filters x 40ms/filter = 480ms (long responses, 500+ tokens)
The 40ms/filter for long responses is because complex filters (regex patterns for adult content in Japanese) require full-text scanning. Short responses (< 100 tokens) are fast; long responses are quadratically slower due to the regex complexity.
Resolution
Immediate (20 minutes):
- Move Guardrails off the critical path — change from synchronous (blocking) to asynchronous (non-blocking):
# BEFORE: Synchronous guardrails (blocks response delivery)
response = await bedrock_invoke(prompt)
guardrails_result = await guardrails_check(response) # 480ms blocking!
if guardrails_result.is_safe:
await stream_to_client(response)
# AFTER: Async guardrails (stream immediately, check in background)
response = await bedrock_invoke(prompt)
await stream_to_client(response) # Stream starts immediately
# Check guardrails asynchronously
asyncio.create_task(
guardrails_check_and_remediate(session_id, response)
)
- Add post-hoc remediation — if Guardrails flags a response that was already streamed:
async def guardrails_check_and_remediate(session_id, response):
result = await guardrails_check(response)
if not result.is_safe:
# Send a follow-up message to the user
await send_websocket_message(
session_id,
"I need to revise my previous response. Let me provide "
"updated information..."
)
# Log for review
logger.warning("Guardrails flagged response post-stream: %s",
result.violations)
# Emit metric for monitoring
emit_metric("GuardrailsPostStreamViolation", 1)
Longer-term (40 minutes):
-
Optimize the Guardrails policy: - Consolidate overlapping filters (12 -> 7 distinct filters) - Use tiered evaluation: fast filters first, expensive regex filters only if fast filters pass - Set
max_tokens_to_evaluateto limit evaluation to the first 200 tokens (most violations appear early) -
Implement chunked evaluation — for streaming responses, evaluate each chunk as it arrives rather than waiting for the complete response:
async def chunked_guardrails(stream):
buffer = ""
for chunk in stream:
buffer += chunk
yield chunk # Stream to client immediately
if len(buffer) > 100: # Evaluate every 100 tokens
result = await guardrails_check(buffer)
if not result.is_safe:
yield "[Content revised for safety]"
break
buffer = "" # Reset for next chunk
Prevention
graph TB
subgraph "Policy Management"
PM1["Load test every Guardrails<br/>policy change in staging"]
PM2["Benchmark: measure p95<br/>evaluation time at<br/>100, 300, 500, 1000 tokens"]
PM3["Set latency budget<br/>for Guardrails: max 50ms"]
end
subgraph "Architecture"
AR1["Default: async Guardrails<br/>(stream first, check later)"]
AR2["Sync only for high-risk<br/>intents (order placement,<br/>payment info)"]
AR3["Chunked evaluation<br/>for streaming responses"]
end
subgraph "Monitoring"
MO1["Alert on Guardrails<br/>evaluation time > 100ms"]
MO2["Track filter count<br/>and complexity metrics"]
MO3["A/B test: sync vs async<br/>Guardrails impact on UX"]
end
PM1 --> PM2 --> PM3
AR1 --> AR2 --> AR3
MO1 --> MO2 --> MO3
style PM1 fill:#2ecc71,color:#000
style AR1 fill:#3498db,color:#fff
style MO1 fill:#9b59b6,color:#fff
Prevention checklist: - [ ] Load test Guardrails policy changes before production deployment - [ ] Default to async Guardrails with post-hoc remediation - [ ] Set CloudWatch alarm on Guardrails evaluation time > 100ms - [ ] Guardrails latency included in latency budget dashboard - [ ] Tiered filter evaluation: fast first, expensive only if needed
Scenario 5: Cache Write Latency Impacting Response Time
Problem
MangaAssist's response latency has been gradually increasing over 3 weeks — from p95 of 850ms to p95 of 920ms. No single event caused the increase. Traffic has grown 20% in the same period (new marketing campaign).
The team initially attributes the increase to higher load, but after scaling ECS tasks to handle the traffic, the latency doesn't improve. Something in the request pipeline has gotten slower.
Detection
graph TB
A["Weekly latency review:<br/>p95 trending up 70ms<br/>over 3 weeks"]
B["ECS scaled from 8 to 12<br/>tasks — latency unchanged"]
C{"X-Ray trace comparison:<br/>this week vs 3 weeks ago"}
D["bedrock-invoke: unchanged (800ms)<br/>vector-search: unchanged (85ms)<br/>cache-lookup: unchanged (2ms)"]
E["cache-write: 3ms -> 18ms<br/>(6x increase!)"]
F{"Investigate cache-write<br/>subsegment"}
G["Redis SET operation for<br/>caching responses now takes<br/>15-20ms instead of 2-4ms"]
H{"Check Redis metrics"}
I["Redis memory usage: 92%<br/>Eviction rate: 1,200/min<br/>Connected clients: 580<br/>— Redis under pressure"]
J{"Why is cache-write<br/>on the critical path?"}
K["Code review reveals:<br/>cache write is SYNCHRONOUS<br/>— await redis.set() is called<br/>BEFORE response is returned"]
A --> B --> C --> D --> E --> F --> G --> H --> I --> J --> K
style A fill:#f39c12,color:#000
style E fill:#e74c3c,color:#fff
style I fill:#e74c3c,color:#fff
style K fill:#e74c3c,color:#fff
Key metrics (gradual trend):
- MangaAssist/Performance/SegmentLatency_cache-write p95: Week 1: 4ms, Week 2: 8ms, Week 3: 18ms
- ElastiCache/BytesUsedForCache: 92% of max memory
- ElastiCache/Evictions: 1,200/min (high — Redis is actively evicting keys)
- ElastiCache/CurrConnections: 580 (approaching maxclients default of 65,000 but connection overhead is growing)
Root Cause
Two issues combined:
Issue 1: Synchronous cache write on the critical path
The original code awaits the Redis SET before returning the response:
# The problematic code path
async def handle_query(query):
response = await bedrock_invoke(prompt)
await redis.set(cache_key, response, ex=3600) # BLOCKS response!
return response # Response delayed by cache write
When Redis was fast (2-3ms), this was invisible. As Redis got slower, the impact became noticeable.
Issue 2: Redis memory pressure
The 20% traffic increase brought more unique queries, which means more cache entries. The Redis instance (cache.r6g.large, 13.07 GB) filled to 92%. Redis started active eviction (allkeys-lru), which creates: - Memory fragmentation (slows all operations) - Eviction processing overhead (Redis is single-threaded — eviction blocks other operations) - Increased latency on SET operations (must evict before writing)
Resolution
Immediate (30 minutes):
- Make cache writes asynchronous — fire-and-forget pattern:
# FIXED: Async cache write (does not block response)
async def handle_query(query):
response = await bedrock_invoke(prompt)
# Fire-and-forget: schedule cache write, return immediately
asyncio.create_task(
safe_cache_write(cache_key, response)
)
return response # Response returns immediately
async def safe_cache_write(key, value):
"""Cache write with error handling (never fails the request)."""
try:
await redis.set(key, value, ex=3600)
except Exception as e:
logger.warning("Cache write failed (non-critical): %s", e)
emit_metric("CacheWriteFailure", 1)
- Reduce Redis memory pressure — decrease cache TTL from 3600s to 1800s:
aws ssm put-parameter \
--name "/manga-assist/prod/cache-ttl-seconds" \
--value "1800" \
--type String \
--overwrite
Longer-term (60 minutes):
- Scale Redis — upgrade from
cache.r6g.large(13 GB) tocache.r6g.xlarge(26 GB):
aws elasticache modify-replication-group \
--replication-group-id manga-assist-cache \
--cache-node-type cache.r6g.xlarge \
--apply-immediately
-
Implement cache size management: - Set
maxmemory-policy allkeys-lfu(least frequently used) instead ofallkeys-lru- MonitorBytesUsedForCachewith alarm at 75% (not 90%) - Implement cache entry size limits (max 10 KB per entry) -
Add cache write latency monitoring:
# Emit cache write timing as a separate metric
async def timed_cache_write(key, value):
start = time.monotonic()
try:
await redis.set(key, value, ex=1800)
elapsed_ms = (time.monotonic() - start) * 1000
emit_metric("CacheWriteLatency", elapsed_ms)
if elapsed_ms > 10:
logger.warning("Slow cache write: %.1fms for key %s", elapsed_ms, key)
except Exception as e:
logger.warning("Cache write failed: %s", e)
emit_metric("CacheWriteFailure", 1)
Prevention
graph TB
subgraph "Architecture Principle"
AP1["ALL cache writes must<br/>be async/fire-and-forget"]
AP2["Cache is an optimization,<br/>not a dependency"]
AP3["Code review checklist:<br/>no await on cache SET<br/>in request hot path"]
end
subgraph "Capacity Planning"
CP1["Redis memory alarm<br/>at 75% (not 90%)"]
CP2["Auto-scale Redis based<br/>on memory utilization"]
CP3["Project cache growth:<br/>new_queries/day x avg_size<br/>x TTL = memory needed"]
end
subgraph "Monitoring"
MO1["Track cache write latency<br/>as a separate metric"]
MO2["Alert if cache write p95<br/>> 5ms (early warning)"]
MO3["Dashboard: cache memory<br/>utilization trend over 30 days"]
end
AP1 --> AP2 --> AP3
CP1 --> CP2 --> CP3
MO1 --> MO2 --> MO3
style AP1 fill:#2ecc71,color:#000
style CP1 fill:#3498db,color:#fff
style MO1 fill:#9b59b6,color:#fff
Prevention checklist:
- [ ] Enforce async cache writes in code review (linting rule for await redis.set in request handlers)
- [ ] Redis memory alarm at 75% — auto-scale or reduce TTL
- [ ] Cache write latency metric with p95 > 5ms alarm
- [ ] Monthly capacity planning review: cache growth rate vs. available memory
- [ ] Load test cache behavior at 90% memory utilization
Decision Tree — Latency Spike Troubleshooting
Use this decision tree when MangaAssist p95 latency exceeds the 2-second SLO.
graph TB
START["p95 Latency > 2000ms"]
Q1{"X-Ray: Which segment<br/>is slowest?"}
Q1 -->|"bedrock-invoke"| BQ1{"Bedrock throttling?<br/>ThrottlingException > 0?"}
BQ1 -->|"Yes"| BR1["Scale down concurrency<br/>or request quota increase"]
BQ1 -->|"No"| BQ2{"Network transit<br/>> 50ms?"}
BQ2 -->|"Yes"| BR2["Check VPC endpoint<br/>+ DNS resolution<br/>(Scenario 1)"]
BQ2 -->|"No"| BQ3{"Large prompt<br/>> 4000 tokens?"}
BQ3 -->|"Yes"| BR3["Optimize prompt<br/>(Skill 4.1.1)"]
BQ3 -->|"No"| BR4["Check Bedrock<br/>service health dashboard"]
Q1 -->|"vector-search"| OQ1{"Query has filters?"}
OQ1 -->|"No"| OR1["Add pre-filters<br/>(category, language, stock)"]
OQ1 -->|"Yes"| OQ2{"Returning all fields?"}
OQ2 -->|"Yes"| OR2["Add _source selection<br/>(return only 6 fields)"]
OQ2 -->|"No"| OR3["Check OpenSearch OCU<br/>scaling + cold start"]
Q1 -->|"cache-write/session-update"| CQ1{"Write is synchronous?"}
CQ1 -->|"Yes"| CR1["Make async<br/>(Scenario 5)"]
CQ1 -->|"No"| CQ2{"Redis memory > 80%?"}
CQ2 -->|"Yes"| CR2["Scale Redis<br/>or reduce TTL"]
CQ2 -->|"No"| CR3["Check network path<br/>to Redis endpoint"]
Q1 -->|"guardrails-check"| GQ1{"Evaluation time<br/>> 100ms?"}
GQ1 -->|"Yes"| GR1["Move to async<br/>(Scenario 4)"]
GQ1 -->|"No"| GR2["Check filter count<br/>and response size"]
Q1 -->|"All segments normal<br/>but total is high"| AQ1{"Connection pool<br/>utilization > 80%?"}
AQ1 -->|"Yes"| AR1["Pool exhaustion<br/>(Scenario 2)"]
AQ1 -->|"No"| AQ2{"DNS resolution<br/>> 10ms?"}
AQ2 -->|"Yes"| AR2["DNS issue<br/>(Scenario 1)"]
AQ2 -->|"No"| AR3["Check for new<br/>inter-service hops<br/>or cold ECS tasks"]
style START fill:#e74c3c,color:#fff
style BR1 fill:#2ecc71,color:#000
style BR2 fill:#2ecc71,color:#000
style BR3 fill:#2ecc71,color:#000
style BR4 fill:#f39c12,color:#000
style OR1 fill:#2ecc71,color:#000
style OR2 fill:#2ecc71,color:#000
style OR3 fill:#f39c12,color:#000
style CR1 fill:#2ecc71,color:#000
style CR2 fill:#2ecc71,color:#000
style CR3 fill:#f39c12,color:#000
style GR1 fill:#2ecc71,color:#000
style GR2 fill:#f39c12,color:#000
style AR1 fill:#2ecc71,color:#000
style AR2 fill:#2ecc71,color:#000
style AR3 fill:#f39c12,color:#000
Cross-Scenario Patterns
| Pattern | Scenarios | Lesson |
|---|---|---|
| Async by default | 4, 5 | Non-critical operations (cache writes, Guardrails, analytics) must never block response delivery |
| VPC endpoint verification | 1, 3 | Every infra change needs automated VPC endpoint health validation |
| Per-segment monitoring | 1, 3, 4, 5 | Aggregate latency metrics hide segment-level regressions; always monitor per-segment |
| Capacity math matters | 2 | Connection pool sizing requires peak traffic modeling, not average |
| Gradual degradation is hard to detect | 5 | Weekly latency trend reviews catch slow drifts that alarms miss |
| X-Ray is the single source of truth | All | Every troubleshooting path starts with X-Ray trace analysis |
Key Takeaways
-
Every scenario started with X-Ray traces — investing in comprehensive request profiling pays for itself the first time you debug a latency issue. Without per-segment timing, you are guessing.
-
Network issues are the most common and most subtle — DNS resolution, VPC endpoint routing, NAT gateway misconfigurations. These are invisible in application logs but clearly visible in network timing metrics.
-
Synchronous operations accumulate silently — a 3ms cache write seems harmless, but when Redis slows to 18ms under memory pressure, it adds nearly 1% to the entire latency budget. Make everything non-critical async.
-
Capacity planning for peak, not average — MangaAssist's connection pool was sized for average load. The flash sale exposed this immediately. Always size for 2x expected peak with known event pre-scaling.
-
Infrastructure changes need performance validation gates — the VPC endpoint migration introduced a regression because there was no automated latency check after the Terraform apply. Every infra change should run VPCEndpointHealthChecker and per-segment latency comparison.