Scenarios and Runbooks — Responsive AI Systems
AWS AIP-C01 Task 4.2 — Skill 4.2.1: Operational scenarios for responsive AI system failures and their resolution Context: MangaAssist e-commerce chatbot — Bedrock Claude 3 Sonnet/Haiku, OpenSearch Serverless, DynamoDB, ECS Fargate, API Gateway WebSocket, ElastiCache Redis Format: Each scenario follows Problem → Detection → Root Cause → Resolution → Prevention with decision trees
Skill Mapping
| Certification | Domain | Task | Skill |
|---|---|---|---|
| AWS AIP-C01 | Domain 4 — Operational Efficiency | Task 4.2 — Optimize FM application performance | Skill 4.2.1 — Identify techniques to create responsive AI systems (e.g., pre-computation, latency-optimized model selection, parallel request processing, response streaming, performance benchmarking) |
File scope: Five production scenarios that test responsiveness engineering — covering first-token latency spikes, parallel request timeouts, streaming failures, stale pre-computation, and latency regressions. Each includes a decision tree runbook for on-call engineers.
Scenario Index
| # | Scenario | Primary Technique Tested | Severity |
|---|---|---|---|
| 1 | First-token latency exceeds 2s during peak evening hours | Streaming + Model Selection + Connection Pooling | SEV-2 |
| 2 | Parallel request timeout: RAG hangs while DynamoDB returns | Parallel Orchestration + Timeout Strategy | SEV-2 |
| 3 | WebSocket connection drops during streaming response | Streaming + Connection Management | SEV-3 |
| 4 | Pre-computation generates stale recommendations for out-of-stock manga | Pre-Computation + Cache Invalidation | SEV-3 |
| 5 | Latency regression after prompt template update | Benchmarking + Regression Detection | SEV-2 |
Scenario 1 — First-Token Latency Exceeds 2 Seconds During Peak Evening Hours
Problem Statement
MangaAssist's CloudWatch alarm fires at 20:15 JST on a Tuesday evening. The first_token_latency_ms p95 metric has crossed the 2000ms threshold for the past 10 minutes. Users in Japan are experiencing a visible delay — the typing indicator shows for 2+ seconds before any text appears. Customer satisfaction scores for the 20:00-21:00 hour are dropping. The target for first-token latency is < 400ms for Haiku queries and < 1100ms for Sonnet queries.
Detection
flowchart TD
ALARM[CloudWatch Alarm<br/>first_token_p95 > 2000ms<br/>for 10 min] --> DASH[Check Grafana Dashboard]
DASH --> METRICS{Which intents<br/>are affected?}
METRICS -->|All intents| INFRA[Infrastructure issue<br/>→ Go to Root Cause A]
METRICS -->|Only Sonnet intents| MODEL[Model-specific issue<br/>→ Go to Root Cause B]
METRICS -->|Only specific intents| ROUTING[Routing issue<br/>→ Go to Root Cause C]
Key metrics to check immediately:
| Metric | Location | What to Look For |
|---|---|---|
first_token_latency_ms by model |
CloudWatch MangaAssist/Streaming |
Which model is slow? |
parallel_rag_latency_ms |
CloudWatch MangaAssist/Orchestrator |
Is RAG retrieval slow? |
Bedrock InvocationLatency |
CloudWatch AWS/Bedrock |
Is Bedrock itself slow? |
| ECS CPU/Memory | CloudWatch AWS/ECS |
Is the orchestrator overloaded? |
API Gateway IntegrationLatency |
CloudWatch AWS/ApiGateway |
Is the gateway slow? |
| Active WebSocket connections | CloudWatch AWS/ApiGateway |
Traffic spike? |
Root Cause Analysis
Root Cause A — Bedrock Cold Starts Under Load
Diagnosis: Bedrock InvocationLatency p95 spikes from 350ms to 1800ms. The connection pool warm-up pings have been failing silently because the keep-alive Lambda was throttled during the traffic spike. Connections went cold and each new streaming invocation pays the full TLS + auth + cold-model penalty.
flowchart TD
CHECK_BEDROCK{Bedrock InvocationLatency<br/>p95 > 1500ms?}
CHECK_BEDROCK -->|Yes| CHECK_POOL{Connection pool<br/>health pings<br/>succeeding?}
CHECK_POOL -->|No - pings failing| ROOT_A[Root Cause A confirmed:<br/>Cold connections to Bedrock]
CHECK_POOL -->|Yes - pings OK| CHECK_THROTTLE{Bedrock<br/>ThrottlingException<br/>count > 0?}
CHECK_THROTTLE -->|Yes| ROOT_A2[Bedrock throttling —<br/>request quota exceeded]
CHECK_THROTTLE -->|No| ROOT_A3[Bedrock service degradation —<br/>check AWS Health Dashboard]
CHECK_BEDROCK -->|No| CHECK_ECS[Check ECS metrics →]
Resolution:
-
Immediate (5 min): Manually trigger connection warm-up across all ECS tasks:
# Force warm-up on all running tasks aws ecs list-tasks --cluster mangaassist-prod --service-name orchestrator \ | jq -r '.taskArns[]' \ | xargs -I {} aws ecs execute-command --cluster mangaassist-prod \ --task {} --command "/app/scripts/warm_connections.sh" --interactive -
Short-term (15 min): Increase connection pool size from 5 to 10 per ECS task to handle burst load:
# Update ECS task definition environment variable # BEDROCK_POOL_SIZE: "5" → "10" # BEDROCK_KEEPALIVE_INTERVAL: "30" → "15" -
Medium-term (1 hour): Scale out ECS tasks to distribute load and reduce per-task connection contention:
aws ecs update-service --cluster mangaassist-prod \ --service orchestrator --desired-count 8 # was 4
Root Cause B — Sonnet Model Capacity Saturation
Diagnosis: Only Sonnet-routed intents (recommendations, comparisons) show elevated latency. Haiku intents remain within target. Bedrock ThrottlingException count is elevated. The peak evening traffic has exceeded the provisioned Sonnet throughput.
Resolution:
-
Immediate: Temporarily route medium-complexity queries to Haiku to reduce Sonnet pressure:
# Adjust model router threshold # complexity_threshold_for_sonnet: 0.3 → 0.6 # This routes ~40% of current Sonnet traffic to Haiku -
Short-term: Request Bedrock quota increase for Sonnet model in ap-northeast-1.
-
Medium-term: Implement adaptive model routing that automatically shifts traffic to Haiku when Sonnet latency exceeds threshold.
Root Cause C — Misclassified Intents Hitting Wrong Model
Diagnosis: The intent classifier is routing simple FAQ queries to Sonnet instead of Haiku. A recent classifier model update introduced a regression in confidence thresholds.
Resolution:
- Immediate: Roll back intent classifier to previous version.
- Short-term: Add classifier output validation — if a "faq" or "greeting" intent gets routed to Sonnet, flag it as a misclassification.
- Medium-term: Add automated classifier accuracy benchmarks to the deployment pipeline.
Prevention
| Prevention Measure | Owner | Implementation |
|---|---|---|
| Connection pool health monitoring with alerting | SRE | CloudWatch alarm on pool health ping failure rate > 5% |
| Provisioned throughput for Sonnet during peak hours | Platform | Bedrock provisioned throughput scheduled 19:00-23:00 JST |
| Adaptive model routing with latency feedback | ML Eng | Route to Haiku when Sonnet p95 > 1500ms |
| Load testing at 2x peak traffic monthly | QA | Automated load test via EventBridge + ECS benchmark task |
| Connection pool warm-up on ECS task start | Platform | Add warm-up to ECS task health check |
Scenario 2 — Parallel Request Timeout: RAG Retrieval Hangs While DynamoDB Returns Instantly
Problem Statement
MangaAssist users querying manga recommendations see a 5-second timeout error. CloudWatch logs show that the ParallelOrchestrator is hitting the overall timeout because OpenSearch RAG retrieval is hanging at 4.8 seconds, even though DynamoDB session load and user profile fetch complete in < 200ms. The asyncio.gather call is waiting for the slowest task (RAG) to complete before proceeding to Bedrock invocation. Pre-timeout, the parallel execution was collapsing from the expected 650ms to > 5000ms.
Detection
flowchart TD
ALARM[CloudWatch Alarm<br/>parallel_rag_latency_ms p95 > 3000ms] --> LOGS[Check orchestrator logs]
LOGS --> PATTERN{Which parallel<br/>task is slow?}
PATTERN -->|RAG only| RAG_ISSUE[OpenSearch issue<br/>→ Go to Root Cause A]
PATTERN -->|Multiple tasks| NETWORK[Network / ECS issue<br/>→ Go to Root Cause B]
PATTERN -->|DynamoDB slow too| INFRA[General infra issue<br/>→ Check ECS + VPC]
Key log pattern to search for:
WARN | Parallel task 'rag' failed | error=Task 'rag' exceeded 2.0s timeout
INFO | Parallel execution complete | timings={"rag": 2.001, "session": 0.18, "profile": 0.14, "cache": 0.04}
Root Cause Analysis
Root Cause A — OpenSearch Serverless OCU Throttling
Diagnosis: OpenSearch Serverless has auto-scaled down to minimum OCUs (2) during the low-traffic afternoon, and the evening traffic spike arrives before OCUs scale back up. KNN vector search queries queue behind each other, causing latency to climb from 600ms to 4800ms.
flowchart TD
CHECK_OS{OpenSearch<br/>SearchLatency p95<br/>> 2000ms?}
CHECK_OS -->|Yes| CHECK_OCU{OpenSearch OCU<br/>utilization > 80%?}
CHECK_OCU -->|Yes| ROOT_A[Root Cause A confirmed:<br/>OCU capacity insufficient<br/>for traffic spike]
CHECK_OCU -->|No| CHECK_INDEX{Index health<br/>status?}
CHECK_INDEX -->|Yellow/Red| ROOT_A2[Index shard issue —<br/>rebalancing needed]
CHECK_INDEX -->|Green| ROOT_A3[Query complexity issue —<br/>check KNN parameters]
CHECK_OS -->|No| CHECK_NET[Check VPC / ENI →]
Resolution:
-
Immediate (5 min): The
ParallelOrchestratorshould already be returning partial results (session + profile) without RAG context. Verify the degraded-mode path is working:# Verify graceful degradation in logs: # "Proceeding without RAG context due to timeout" # Bedrock still generates response using session history only -
Short-term (30 min): Increase OpenSearch Serverless minimum OCUs:
aws opensearchserverless update-collection \ --id manga-products \ --description "Increase min OCUs for peak" \ # Update capacity policy: min search OCUs 2 → 6 -
Medium-term: Implement pre-warming for OpenSearch — run a set of representative queries at 18:00 JST (before peak) to trigger OCU scale-up.
Root Cause B — VPC DNS Resolution Failure
Diagnosis: The ECS tasks in private subnets are intermittently failing to resolve the OpenSearch Serverless endpoint. DNS queries to the VPC DNS resolver are timing out, causing the HTTP connection to OpenSearch to hang until the TCP timeout.
Resolution:
- Immediate: Restart affected ECS tasks to refresh DNS cache.
- Short-term: Add DNS caching at the application level with a 60-second TTL.
- Medium-term: Configure Route 53 Resolver with DNS firewall rules and monitoring.
Decision Tree — On-Call Runbook
flowchart TD
START[RAG timeout alert fires] --> CHECK1{Other parallel tasks<br/>also slow?}
CHECK1 -->|No, only RAG| CHECK2{OpenSearch<br/>SearchLatency > 2s?}
CHECK1 -->|Yes, multiple| CHECK_ECS{ECS task<br/>CPU > 90%?}
CHECK2 -->|Yes| CHECK3{OCU utilization<br/>> 80%?}
CHECK2 -->|No| CHECK_DNS{DNS resolution<br/>errors in logs?}
CHECK3 -->|Yes| ACTION1[Increase min OCUs<br/>+ trigger pre-warm queries]
CHECK3 -->|No| CHECK4{KNN query<br/>ef_search changed?}
CHECK4 -->|Yes| ACTION2[Revert ef_search<br/>to previous value]
CHECK4 -->|No| ACTION3[Check index health<br/>+ shard distribution]
CHECK_DNS -->|Yes| ACTION4[Restart ECS tasks<br/>+ add DNS cache]
CHECK_DNS -->|No| ACTION5[Check VPC endpoints<br/>+ security groups]
CHECK_ECS -->|Yes| ACTION6[Scale out ECS<br/>+ check for memory leak]
CHECK_ECS -->|No| ACTION7[Check VPC networking<br/>+ NAT gateway throughput]
style ACTION1 fill:#2ecc71,color:#000
style ACTION2 fill:#2ecc71,color:#000
style ACTION4 fill:#2ecc71,color:#000
style ACTION6 fill:#f39c12,color:#000
Prevention
| Prevention Measure | Owner | Implementation |
|---|---|---|
| OpenSearch pre-warming queries at 18:00 JST | Platform | EventBridge → Lambda runs 20 representative KNN queries |
| Per-task timeout (not just overall timeout) | Backend | Already implemented in ParallelOrchestrator._with_timeout() |
| Graceful degradation without RAG | Backend | Generate response with session context only when RAG times out |
| OpenSearch OCU minimum set to peak-hour baseline | Platform | Capacity policy: min 6 OCUs during 18:00-24:00 JST |
| DNS resolution monitoring | SRE | CloudWatch alarm on Route 53 Resolver NXDOMAIN/SERVFAIL rate |
Scenario 3 — WebSocket Connection Drops During Streaming Response
Problem Statement
Multiple MangaAssist users report seeing partial responses — the chatbot starts answering a manga recommendation query, streams 2-3 sentences, then the response abruptly stops. The client shows "Connection lost. Reconnecting..." and when the connection re-establishes, the partial response is gone. The issue affects ~5% of streaming responses during a 30-minute window on a Saturday evening.
Detection
flowchart TD
ALARM[CloudWatch Alarm<br/>stream_cancellations > 50/hr] --> CHECK_CLIENT{Client-side error<br/>reports available?}
CHECK_CLIENT -->|Yes| ANALYZE[Analyze error patterns]
CHECK_CLIENT -->|No| SERVER_LOGS[Check ECS + API GW logs]
ANALYZE --> PATTERN{Error pattern?}
PATTERN -->|WebSocket close code 1006| ABNORMAL[Abnormal closure<br/>→ Network or gateway issue]
PATTERN -->|WebSocket close code 1001| GOING_AWAY[Going away<br/>→ Server-side shutdown]
PATTERN -->|No close frame received| TIMEOUT[Connection timeout<br/>→ Idle timeout or NAT issue]
Key metrics to check:
| Metric | Location | Normal | Current |
|---|---|---|---|
stream_cancellations |
CloudWatch MangaAssist/Streaming |
< 10/hr | 67/hr |
API Gateway ClientError (4xx) |
CloudWatch AWS/ApiGateway |
< 5/hr | 42/hr |
API Gateway MessageCount |
CloudWatch AWS/ApiGateway |
~50K/hr | 48K/hr (normal) |
| ECS task restarts | CloudWatch AWS/ECS |
0 | 0 |
GoneException count in logs |
CloudWatch Logs | < 5/hr | 55/hr |
Root Cause Analysis
Root Cause A — API Gateway WebSocket Idle Timeout
Diagnosis: The Bedrock streaming response for complex recommendation queries takes 2.5-3 seconds. During this time, the chunk batching (batch size = 3) means the first WebSocket frame is not sent until ~3 chunks accumulate (~900ms after first token). If the total time between the initial status message and the first chunk frame exceeds the API Gateway idle connection timeout (10 minutes by default — but the client-side proxy has a 30-second idle timeout), the connection is deemed idle.
The actual issue: a corporate proxy between some JP users and CloudFront has a 15-second idle timeout. The parallel fan-out phase (650ms) + Bedrock first-token wait (350ms) = 1000ms with no WebSocket frame sent. This is within normal bounds, but the corporate proxy is aggressively closing connections it considers idle.
flowchart TD
CHECK_PATTERN{Affected users<br/>share network<br/>characteristics?}
CHECK_PATTERN -->|Yes - same ISP/corp| PROXY[Corporate proxy<br/>idle timeout]
CHECK_PATTERN -->|No - random| CHECK_GW{API Gateway<br/>Connection Duration<br/>metric available?}
CHECK_GW -->|Connections dying at<br/>consistent duration| TIMEOUT_CFG[Timeout configuration<br/>issue]
CHECK_GW -->|Random durations| CHECK_ECS_HEALTH{ECS tasks<br/>healthy?}
CHECK_ECS_HEALTH -->|Yes| NETWORK[Intermittent network<br/>issue — check VPC NAT]
CHECK_ECS_HEALTH -->|No — restarts| DEPLOYMENT[Rolling deployment<br/>draining connections]
Resolution:
-
Immediate (10 min): Reduce chunk batch size from 3 to 1 so the first chunk is sent immediately after the first token:
# Environment variable update on ECS service # CHUNK_BATCH_SIZE: "3" → "1" # Trade: more WebSocket frames, but no idle gaps -
Short-term (1 hour): Add periodic heartbeat frames during the parallel fan-out phase:
async def stream_with_heartbeat(self, connection_id, ...): """Send heartbeat pings during long processing phases.""" heartbeat_task = asyncio.create_task( self._heartbeat_loop(connection_id, interval=5.0) ) try: result = await self.stream_query(...) finally: heartbeat_task.cancel() async def _heartbeat_loop(self, connection_id, interval): """Send invisible heartbeat to keep connection alive.""" while True: await asyncio.sleep(interval) await self.ws_send(connection_id, { "type": "heartbeat", "timestamp": time.time(), }) -
Medium-term: Implement client-side reconnection with response resumption:
Client sends: {"action":"resume","session_id":"abc","last_chunk_idx":7} Server sends: Remaining chunks from idx 8 onward (buffered in Redis)
Prevention
| Prevention Measure | Owner | Implementation |
|---|---|---|
| WebSocket heartbeat every 5 seconds during processing | Backend | Heartbeat task runs parallel to stream processing |
| Reduce chunk batch size to 1 for first 3 chunks | Backend | Prioritize first-frame delivery, then batch normally |
| Client-side reconnection with resume | Frontend | Store partial response + last chunk index, request resume |
| Server-side response buffering in Redis | Backend | Buffer last 60s of chunks per connection_id, TTL 120s |
| Monitor WebSocket close code distribution | SRE | CloudWatch alarm on close code 1006 rate > 1% |
Scenario 4 — Pre-Computation Generates Stale Recommendations for Out-of-Stock Manga
Problem Statement
A customer asks "What shonen manga should I read?" and MangaAssist responds with a recommendation list that includes "Demon Slayer Volume 23 — Limited Edition" as the top recommendation. The customer clicks through to buy it, but the product page shows "Out of Stock." The recommendation was pre-computed 6 hours ago by the nightly batch job, and the limited edition sold out 2 hours ago during a flash sale. The pre-computed response in Redis has a 24-hour TTL and does not reflect the inventory change.
Detection
flowchart TD
SIGNAL1[Customer complaint:<br/>"Recommended manga is out of stock"] --> INVESTIGATE
SIGNAL2[CloudWatch: cart_abandonment_rate<br/>spike after recommendation click] --> INVESTIGATE
SIGNAL3[Business metric: recommendation_to_purchase<br/>conversion rate dropped 40%] --> INVESTIGATE
INVESTIGATE --> CHECK{Is the recommended<br/>product in stock?}
CHECK -->|No| STALE[Stale pre-computation<br/>→ Investigate cache freshness]
CHECK -->|Yes - different issue| OTHER[Product page issue<br/>→ Check catalog service]
Key data points:
| Data Point | Expected | Actual |
|---|---|---|
| Pre-computation last run | < 24 hours ago | 6 hours ago (on schedule) |
| Redis cache TTL for genre recommendations | 24 hours | 24 hours (as configured) |
| Demon Slayer Vol 23 LE stock status | In stock (at pre-compute time) | Out of stock (sold out 2 hours ago) |
| DynamoDB inventory update timestamp | N/A | 4 hours ago (flash sale depleted stock) |
| Cache invalidation trigger for inventory changes | Should exist | Does not exist |
Root Cause Analysis
flowchart TD
ROOT[Pre-computed recommendation<br/>is stale] --> WHY1{Does inventory change<br/>trigger cache invalidation?}
WHY1 -->|No| MISSING[Missing invalidation path:<br/>DynamoDB Stream → Lambda → Redis]
WHY1 -->|Yes but failed| CHECK_STREAM{DynamoDB Stream<br/>events flowing?}
MISSING --> WHY2{Why was this not<br/>built?}
WHY2 --> GAP[Pre-computation pipeline was designed<br/>for catalog data, not inventory state.<br/>Inventory was assumed stable between runs.]
CHECK_STREAM -->|Events present| CHECK_LAMBDA{Invalidation Lambda<br/>errors?}
CHECK_STREAM -->|No events| STREAM_DISABLED[DynamoDB Stream<br/>not enabled on inventory table]
CHECK_LAMBDA -->|Errors| LAMBDA_BUG[Lambda bug —<br/>check error logs]
CHECK_LAMBDA -->|No errors| CACHE_KEY[Cache key mismatch —<br/>Lambda invalidating wrong key]
Root cause confirmed: The pre-computation pipeline invalidates cached responses when catalog data changes (new titles, updated descriptions), but it does not invalidate when inventory status changes. The pipeline was designed assuming that stock levels change slowly, but flash sales create rapid inventory depletion that the 24-hour TTL cannot accommodate.
Resolution:
-
Immediate (15 min): Manually invalidate the stale genre recommendation caches:
# Connect to ElastiCache Redis and delete stale keys redis-cli -h mangaassist-cache.xxxxx.ng.0001.apne1.cache.amazonaws.com > KEYS "precompute:genre:shonen:*" > DEL "precompute:genre:shonen:recommendations" > DEL "precompute:genre:shonen:top_picks" -
Short-term (2 hours): Add inventory-aware validation at response time — before serving a pre-computed recommendation, check stock status for included products:
async def validate_precomputed_response( self, cached_response: dict, product_ids: list[str] ) -> dict: """ Validate pre-computed recommendations against live inventory. Replace out-of-stock items with alternatives. """ # Batch check inventory status (single DynamoDB BatchGetItem) stock_status = await self._batch_check_inventory(product_ids) out_of_stock = [ pid for pid, status in stock_status.items() if status["quantity"] == 0 ] if not out_of_stock: return cached_response # All items in stock, serve as-is # Filter out unavailable items and note the gap filtered = { "recommendations": [ rec for rec in cached_response["recommendations"] if rec["product_id"] not in out_of_stock ], "stale_items_removed": len(out_of_stock), } # If too many items removed, fall through to live generation if len(filtered["recommendations"]) < 2: return None # Trigger fresh LLM generation return filtered -
Medium-term (1 week): Add DynamoDB Streams-based cache invalidation for inventory changes:
graph LR DDB[DynamoDB<br/>manga-inventory] -->|Stream| LAMBDA[Lambda:<br/>Inventory Invalidator] LAMBDA -->|Check| REDIS_KEYS[Find cached responses<br/>containing this product] REDIS_KEYS -->|Invalidate| REDIS[ElastiCache Redis<br/>Delete affected keys] LAMBDA -->|Log| CW[CloudWatch<br/>invalidation_count metric]
Prevention
| Prevention Measure | Owner | Implementation |
|---|---|---|
| Inventory-aware validation on pre-computed responses | Backend | BatchGetItem stock check before serving cached recommendations |
| DynamoDB Streams → Lambda → Redis invalidation | Platform | Stream from inventory table triggers selective cache invalidation |
| Reduce TTL for recommendation caches to 4 hours | Platform | Redis TTL 24h → 4h for genre/category recommendations |
| Flash sale mode: disable pre-computed recommendations | Business | Feature flag disables pre-compute cache during flash sale events |
| Monitor recommendation-to-purchase conversion rate | Analytics | CloudWatch alarm when conversion drops > 20% from 7-day average |
Scenario 5 — Latency Regression After Prompt Template Update
Problem Statement
On Wednesday at 14:00 JST, the ML engineering team deploys a new system prompt template for the "recommendation" intent. The updated prompt adds detailed instructions for multi-criteria scoring (art style, story complexity, target demographic) to improve recommendation quality. The new prompt is 340 tokens longer than the previous version (from 180 tokens to 520 tokens). By 15:00 JST, the automated benchmark framework detects a statistically significant latency regression: recommendation intent p95 has increased from 2400ms to 3200ms (33% increase), crossing the 3000ms SLA target.
Detection
flowchart TD
BENCHMARK[Automated Benchmark Run<br/>Every 15 min] --> STATS{Mann-Whitney U Test<br/>p-value < 0.05?}
STATS -->|p = 0.003| SIGNIFICANT[Statistically significant<br/>regression detected]
SIGNIFICANT --> SEVERITY{Regression<br/>magnitude?}
SEVERITY -->|33% increase<br/>> 25% threshold| SEV2[SEV-2 Alert<br/>PagerDuty notification]
SEV2 --> CORRELATE[Correlate with<br/>recent changes]
CORRELATE --> DEPLOY_LOG{Any deployments<br/>in last 2 hours?}
DEPLOY_LOG -->|Yes: Prompt template<br/>update at 14:00| SUSPECT[Primary suspect:<br/>prompt template change]
Benchmark comparison data:
| Metric | Before (13:45 run) | After (15:00 run) | Change | Significance |
|---|---|---|---|---|
| Recommendation p50 | 1180ms | 1620ms | +37% | p < 0.001 |
| Recommendation p95 | 2410ms | 3210ms | +33% | p = 0.003 |
| Recommendation p99 | 2890ms | 3780ms | +31% | p = 0.008 |
| First-token p95 | 1050ms | 1380ms | +31% | p = 0.005 |
| Recommendation tokens/response | 285 avg | 410 avg | +44% | p < 0.001 |
| Other intents p95 | All within target | All within target | No change | N/A |
Root Cause Analysis
flowchart TD
SUSPECT[Prompt template update<br/>suspected] --> CHECK1{Only recommendation<br/>intent affected?}
CHECK1 -->|Yes| CHECK2{Prompt token count<br/>increased?}
CHECK2 -->|Yes: 180 → 520 tokens| CHECK3{Response tokens<br/>also increased?}
CHECK3 -->|Yes: 285 → 410 avg| ROOT[Root cause confirmed:<br/>Longer prompt + longer outputs]
ROOT --> IMPACT[Two latency impacts:]
IMPACT --> IMPACT1["1. Input processing: +340 tokens<br/>→ +120ms first-token latency"]
IMPACT --> IMPACT2["2. Output generation: +125 tokens avg<br/>→ +500ms streaming duration"]
CHECK1 -->|No, all intents| DIFFERENT[Different root cause —<br/>check infrastructure]
CHECK2 -->|No, same size| DIFFERENT2[Not prompt-related —<br/>check model routing]
CHECK3 -->|No, same output| PROMPT_QUALITY[Prompt is slower to process<br/>→ Check prompt complexity]
Root cause breakdown:
| Factor | Before | After | Latency Impact |
|---|---|---|---|
| System prompt tokens | 180 | 520 | +120ms (input processing) |
| Avg output tokens | 285 | 410 | +500ms (generation time) |
| Prompt complexity (nested instructions) | Low | High (multi-criteria) | +60ms (reasoning overhead) |
| Total latency increase | +680ms |
The longer system prompt causes two latency effects: 1. More input tokens = more time for the model to process the prompt before generating the first token 2. More detailed instructions = the model generates longer, more structured responses, increasing streaming duration
Resolution
- Immediate — Assess rollback need (15 min):
flowchart TD
ASSESS{Is the quality improvement<br/>worth the latency cost?}
ASSESS -->|No — quality gain marginal| ROLLBACK[Roll back to previous<br/>prompt version immediately]
ASSESS -->|Yes — quality significantly better| OPTIMIZE[Keep new prompt,<br/>optimize for latency]
ASSESS -->|Unsure — need data| ABTEST[Run A/B test:<br/>old prompt vs new prompt]
-
If rolling back (immediate fix):
# Revert prompt version in DynamoDB aws dynamodb update-item \ --table-name manga-prompt-templates \ --key '{"intent":{"S":"recommendation"},"version":{"S":"v2.1"}}' \ --update-expression "SET active = :false" \ --expression-attribute-values '{":false":{"BOOL":false}}' aws dynamodb update-item \ --table-name manga-prompt-templates \ --key '{"intent":{"S":"recommendation"},"version":{"S":"v2.0"}}' \ --update-expression "SET active = :true" \ --expression-attribute-values '{":true":{"BOOL":true}}' -
If optimizing the new prompt (1-2 days):
| Optimization | Technique | Token Reduction | Quality Impact |
|---|---|---|---|
| Compress scoring instructions | Use structured format instead of prose | -120 tokens | Minimal |
Add max_tokens: 300 limit |
Cap output length | -110 tokens avg | Responses slightly shorter |
| Use Haiku for scoring, Sonnet for prose | Two-stage: Haiku scores, Sonnet narrates | -200ms first-token | Minimal (scoring is mechanical) |
| Move static criteria to pre-computed context | Load scoring criteria from cache, not prompt | -180 tokens from prompt | None |
| Combined | -300 tokens, -180ms | Negligible |
- Optimized prompt structure:
Before (520 tokens — prose): "When recommending manga, evaluate each title across multiple dimensions. First, consider the art style and how it compares to what the customer enjoys. Then evaluate story complexity..." After (320 tokens — structured): "Score each recommendation on: art_style(1-5), story_complexity(1-5), demographic_match(1-5). Format: Title | Scores | 2-sentence reason. Max 4 recommendations. Prioritize by total score."
Prevention
| Prevention Measure | Owner | Implementation |
|---|---|---|
| Prompt change requires latency impact assessment | ML Eng | Pre-deployment: run benchmark with new prompt in staging |
| Token budget per intent (prompt + response) | ML Eng | recommendation intent: max 520 input + 350 output tokens |
| Automated benchmark gate in CI/CD | Platform | Deployment blocked if p95 regression > 15% |
| Prompt compression review checklist | ML Eng | Every prompt update reviewed for token efficiency |
| A/B testing framework for prompt changes | ML Eng | Shadow-test new prompts against production baseline for 24h |
| Auto-rollback on benchmark regression | SRE | If SEV-1 regression detected within 1 hour of deploy, auto-revert |
Prompt Change Deployment Safeguard Flow
flowchart TD
DEV[ML Engineer writes<br/>new prompt template] --> STAGE_TEST[Deploy to staging<br/>+ run benchmark suite]
STAGE_TEST --> GATE{p95 within<br/>intent budget?}
GATE -->|Yes| SHADOW[Shadow deployment:<br/>run new prompt on 5% traffic]
GATE -->|No| OPTIMIZE[Optimize prompt<br/>before proceeding]
OPTIMIZE --> STAGE_TEST
SHADOW --> COMPARE{Quality improved?<br/>Latency acceptable?}
COMPARE -->|Both pass| CANARY[Canary: 25% traffic<br/>for 2 hours]
COMPARE -->|Quality pass,<br/>latency fail| OPTIMIZE
COMPARE -->|Quality fail| REJECT[Reject prompt change]
CANARY --> FULL{All metrics<br/>within targets?}
FULL -->|Yes| DEPLOY[Full production<br/>deployment]
FULL -->|No| ROLLBACK[Auto-rollback<br/>to previous version]
style DEPLOY fill:#2ecc71,color:#000
style ROLLBACK fill:#e74c3c,color:#fff
style REJECT fill:#e74c3c,color:#fff
Cross-Scenario Summary
Common Patterns Across All 5 Scenarios
| Pattern | Scenarios | Key Lesson |
|---|---|---|
| Graceful degradation is essential | 1, 2 | System must produce a response even with partial data (no RAG, no cache, wrong model) |
| Pre-computation needs invalidation | 4 | Every pre-computed cache must have an invalidation path tied to data freshness |
| Latency budgets must be per-intent | 1, 5 | A global "< 3s" target is insufficient — each intent has different acceptable latency |
| Statistical testing prevents false alarms | 5 | Use Mann-Whitney U test (not simple threshold comparison) for regression detection |
| Connection management is critical for streaming | 1, 3 | Warm pools, heartbeats, and reconnection with resume handle network volatility |
Scenario-to-Technique Mapping
| Scenario | Pre-Computation | Model Selection | Parallel Processing | Streaming | Benchmarking |
|---|---|---|---|---|---|
| 1 — First-token spike | Primary fix (route to Haiku) | Root cause (cold connections) | Detection method | ||
| 2 — RAG timeout | Primary fix (per-task timeout + degradation) | ||||
| 3 — WebSocket drop | Primary fix (heartbeat + resume) | ||||
| 4 — Stale recommendations | Primary fix (invalidation pipeline) | ||||
| 5 — Prompt regression | Primary fix (automated gate + rollback) |
Key Exam Takeaways
- Responsive AI is not just fast models — it spans connection management, parallel orchestration, streaming delivery, pre-computation freshness, and continuous benchmarking
- Every pre-computation strategy needs an invalidation strategy — without it, you trade latency for stale/incorrect responses
- Parallel orchestration must handle partial failures gracefully —
asyncio.gatherwith per-task timeouts and degraded-mode fallbacks - Streaming connections need active keep-alive — WebSocket idle timeouts from proxies and gateways can kill streaming responses silently
- Prompt changes are latency changes — longer prompts mean more input processing time AND often longer outputs; gate prompt deployments with automated latency benchmarks