LOCAL PREVIEW View on GitHub

Scenarios and Runbooks — Responsive AI Systems

AWS AIP-C01 Task 4.2 — Skill 4.2.1: Operational scenarios for responsive AI system failures and their resolution Context: MangaAssist e-commerce chatbot — Bedrock Claude 3 Sonnet/Haiku, OpenSearch Serverless, DynamoDB, ECS Fargate, API Gateway WebSocket, ElastiCache Redis Format: Each scenario follows Problem → Detection → Root Cause → Resolution → Prevention with decision trees


Skill Mapping

Certification Domain Task Skill
AWS AIP-C01 Domain 4 — Operational Efficiency Task 4.2 — Optimize FM application performance Skill 4.2.1 — Identify techniques to create responsive AI systems (e.g., pre-computation, latency-optimized model selection, parallel request processing, response streaming, performance benchmarking)

File scope: Five production scenarios that test responsiveness engineering — covering first-token latency spikes, parallel request timeouts, streaming failures, stale pre-computation, and latency regressions. Each includes a decision tree runbook for on-call engineers.


Scenario Index

# Scenario Primary Technique Tested Severity
1 First-token latency exceeds 2s during peak evening hours Streaming + Model Selection + Connection Pooling SEV-2
2 Parallel request timeout: RAG hangs while DynamoDB returns Parallel Orchestration + Timeout Strategy SEV-2
3 WebSocket connection drops during streaming response Streaming + Connection Management SEV-3
4 Pre-computation generates stale recommendations for out-of-stock manga Pre-Computation + Cache Invalidation SEV-3
5 Latency regression after prompt template update Benchmarking + Regression Detection SEV-2

Scenario 1 — First-Token Latency Exceeds 2 Seconds During Peak Evening Hours

Problem Statement

MangaAssist's CloudWatch alarm fires at 20:15 JST on a Tuesday evening. The first_token_latency_ms p95 metric has crossed the 2000ms threshold for the past 10 minutes. Users in Japan are experiencing a visible delay — the typing indicator shows for 2+ seconds before any text appears. Customer satisfaction scores for the 20:00-21:00 hour are dropping. The target for first-token latency is < 400ms for Haiku queries and < 1100ms for Sonnet queries.

Detection

flowchart TD
    ALARM[CloudWatch Alarm<br/>first_token_p95 > 2000ms<br/>for 10 min] --> DASH[Check Grafana Dashboard]
    DASH --> METRICS{Which intents<br/>are affected?}

    METRICS -->|All intents| INFRA[Infrastructure issue<br/>→ Go to Root Cause A]
    METRICS -->|Only Sonnet intents| MODEL[Model-specific issue<br/>→ Go to Root Cause B]
    METRICS -->|Only specific intents| ROUTING[Routing issue<br/>→ Go to Root Cause C]

Key metrics to check immediately:

Metric Location What to Look For
first_token_latency_ms by model CloudWatch MangaAssist/Streaming Which model is slow?
parallel_rag_latency_ms CloudWatch MangaAssist/Orchestrator Is RAG retrieval slow?
Bedrock InvocationLatency CloudWatch AWS/Bedrock Is Bedrock itself slow?
ECS CPU/Memory CloudWatch AWS/ECS Is the orchestrator overloaded?
API Gateway IntegrationLatency CloudWatch AWS/ApiGateway Is the gateway slow?
Active WebSocket connections CloudWatch AWS/ApiGateway Traffic spike?

Root Cause Analysis

Root Cause A — Bedrock Cold Starts Under Load

Diagnosis: Bedrock InvocationLatency p95 spikes from 350ms to 1800ms. The connection pool warm-up pings have been failing silently because the keep-alive Lambda was throttled during the traffic spike. Connections went cold and each new streaming invocation pays the full TLS + auth + cold-model penalty.

flowchart TD
    CHECK_BEDROCK{Bedrock InvocationLatency<br/>p95 > 1500ms?}
    CHECK_BEDROCK -->|Yes| CHECK_POOL{Connection pool<br/>health pings<br/>succeeding?}
    CHECK_POOL -->|No - pings failing| ROOT_A[Root Cause A confirmed:<br/>Cold connections to Bedrock]
    CHECK_POOL -->|Yes - pings OK| CHECK_THROTTLE{Bedrock<br/>ThrottlingException<br/>count > 0?}
    CHECK_THROTTLE -->|Yes| ROOT_A2[Bedrock throttling —<br/>request quota exceeded]
    CHECK_THROTTLE -->|No| ROOT_A3[Bedrock service degradation —<br/>check AWS Health Dashboard]
    CHECK_BEDROCK -->|No| CHECK_ECS[Check ECS metrics →]

Resolution:

  1. Immediate (5 min): Manually trigger connection warm-up across all ECS tasks:

    # Force warm-up on all running tasks
    aws ecs list-tasks --cluster mangaassist-prod --service-name orchestrator \
      | jq -r '.taskArns[]' \
      | xargs -I {} aws ecs execute-command --cluster mangaassist-prod \
        --task {} --command "/app/scripts/warm_connections.sh" --interactive
    

  2. Short-term (15 min): Increase connection pool size from 5 to 10 per ECS task to handle burst load:

    # Update ECS task definition environment variable
    # BEDROCK_POOL_SIZE: "5" → "10"
    # BEDROCK_KEEPALIVE_INTERVAL: "30" → "15"
    

  3. Medium-term (1 hour): Scale out ECS tasks to distribute load and reduce per-task connection contention:

    aws ecs update-service --cluster mangaassist-prod \
      --service orchestrator --desired-count 8  # was 4
    

Root Cause B — Sonnet Model Capacity Saturation

Diagnosis: Only Sonnet-routed intents (recommendations, comparisons) show elevated latency. Haiku intents remain within target. Bedrock ThrottlingException count is elevated. The peak evening traffic has exceeded the provisioned Sonnet throughput.

Resolution:

  1. Immediate: Temporarily route medium-complexity queries to Haiku to reduce Sonnet pressure:

    # Adjust model router threshold
    # complexity_threshold_for_sonnet: 0.3 → 0.6
    # This routes ~40% of current Sonnet traffic to Haiku
    

  2. Short-term: Request Bedrock quota increase for Sonnet model in ap-northeast-1.

  3. Medium-term: Implement adaptive model routing that automatically shifts traffic to Haiku when Sonnet latency exceeds threshold.

Root Cause C — Misclassified Intents Hitting Wrong Model

Diagnosis: The intent classifier is routing simple FAQ queries to Sonnet instead of Haiku. A recent classifier model update introduced a regression in confidence thresholds.

Resolution:

  1. Immediate: Roll back intent classifier to previous version.
  2. Short-term: Add classifier output validation — if a "faq" or "greeting" intent gets routed to Sonnet, flag it as a misclassification.
  3. Medium-term: Add automated classifier accuracy benchmarks to the deployment pipeline.

Prevention

Prevention Measure Owner Implementation
Connection pool health monitoring with alerting SRE CloudWatch alarm on pool health ping failure rate > 5%
Provisioned throughput for Sonnet during peak hours Platform Bedrock provisioned throughput scheduled 19:00-23:00 JST
Adaptive model routing with latency feedback ML Eng Route to Haiku when Sonnet p95 > 1500ms
Load testing at 2x peak traffic monthly QA Automated load test via EventBridge + ECS benchmark task
Connection pool warm-up on ECS task start Platform Add warm-up to ECS task health check

Scenario 2 — Parallel Request Timeout: RAG Retrieval Hangs While DynamoDB Returns Instantly

Problem Statement

MangaAssist users querying manga recommendations see a 5-second timeout error. CloudWatch logs show that the ParallelOrchestrator is hitting the overall timeout because OpenSearch RAG retrieval is hanging at 4.8 seconds, even though DynamoDB session load and user profile fetch complete in < 200ms. The asyncio.gather call is waiting for the slowest task (RAG) to complete before proceeding to Bedrock invocation. Pre-timeout, the parallel execution was collapsing from the expected 650ms to > 5000ms.

Detection

flowchart TD
    ALARM[CloudWatch Alarm<br/>parallel_rag_latency_ms p95 > 3000ms] --> LOGS[Check orchestrator logs]
    LOGS --> PATTERN{Which parallel<br/>task is slow?}

    PATTERN -->|RAG only| RAG_ISSUE[OpenSearch issue<br/>→ Go to Root Cause A]
    PATTERN -->|Multiple tasks| NETWORK[Network / ECS issue<br/>→ Go to Root Cause B]
    PATTERN -->|DynamoDB slow too| INFRA[General infra issue<br/>→ Check ECS + VPC]

Key log pattern to search for:

WARN  | Parallel task 'rag' failed | error=Task 'rag' exceeded 2.0s timeout
INFO  | Parallel execution complete | timings={"rag": 2.001, "session": 0.18, "profile": 0.14, "cache": 0.04}

Root Cause Analysis

Root Cause A — OpenSearch Serverless OCU Throttling

Diagnosis: OpenSearch Serverless has auto-scaled down to minimum OCUs (2) during the low-traffic afternoon, and the evening traffic spike arrives before OCUs scale back up. KNN vector search queries queue behind each other, causing latency to climb from 600ms to 4800ms.

flowchart TD
    CHECK_OS{OpenSearch<br/>SearchLatency p95<br/>> 2000ms?}
    CHECK_OS -->|Yes| CHECK_OCU{OpenSearch OCU<br/>utilization > 80%?}
    CHECK_OCU -->|Yes| ROOT_A[Root Cause A confirmed:<br/>OCU capacity insufficient<br/>for traffic spike]
    CHECK_OCU -->|No| CHECK_INDEX{Index health<br/>status?}
    CHECK_INDEX -->|Yellow/Red| ROOT_A2[Index shard issue —<br/>rebalancing needed]
    CHECK_INDEX -->|Green| ROOT_A3[Query complexity issue —<br/>check KNN parameters]
    CHECK_OS -->|No| CHECK_NET[Check VPC / ENI →]

Resolution:

  1. Immediate (5 min): The ParallelOrchestrator should already be returning partial results (session + profile) without RAG context. Verify the degraded-mode path is working:

    # Verify graceful degradation in logs:
    # "Proceeding without RAG context due to timeout"
    # Bedrock still generates response using session history only
    

  2. Short-term (30 min): Increase OpenSearch Serverless minimum OCUs:

    aws opensearchserverless update-collection \
      --id manga-products \
      --description "Increase min OCUs for peak" \
      # Update capacity policy: min search OCUs 2 → 6
    

  3. Medium-term: Implement pre-warming for OpenSearch — run a set of representative queries at 18:00 JST (before peak) to trigger OCU scale-up.

Root Cause B — VPC DNS Resolution Failure

Diagnosis: The ECS tasks in private subnets are intermittently failing to resolve the OpenSearch Serverless endpoint. DNS queries to the VPC DNS resolver are timing out, causing the HTTP connection to OpenSearch to hang until the TCP timeout.

Resolution:

  1. Immediate: Restart affected ECS tasks to refresh DNS cache.
  2. Short-term: Add DNS caching at the application level with a 60-second TTL.
  3. Medium-term: Configure Route 53 Resolver with DNS firewall rules and monitoring.

Decision Tree — On-Call Runbook

flowchart TD
    START[RAG timeout alert fires] --> CHECK1{Other parallel tasks<br/>also slow?}

    CHECK1 -->|No, only RAG| CHECK2{OpenSearch<br/>SearchLatency > 2s?}
    CHECK1 -->|Yes, multiple| CHECK_ECS{ECS task<br/>CPU > 90%?}

    CHECK2 -->|Yes| CHECK3{OCU utilization<br/>> 80%?}
    CHECK2 -->|No| CHECK_DNS{DNS resolution<br/>errors in logs?}

    CHECK3 -->|Yes| ACTION1[Increase min OCUs<br/>+ trigger pre-warm queries]
    CHECK3 -->|No| CHECK4{KNN query<br/>ef_search changed?}

    CHECK4 -->|Yes| ACTION2[Revert ef_search<br/>to previous value]
    CHECK4 -->|No| ACTION3[Check index health<br/>+ shard distribution]

    CHECK_DNS -->|Yes| ACTION4[Restart ECS tasks<br/>+ add DNS cache]
    CHECK_DNS -->|No| ACTION5[Check VPC endpoints<br/>+ security groups]

    CHECK_ECS -->|Yes| ACTION6[Scale out ECS<br/>+ check for memory leak]
    CHECK_ECS -->|No| ACTION7[Check VPC networking<br/>+ NAT gateway throughput]

    style ACTION1 fill:#2ecc71,color:#000
    style ACTION2 fill:#2ecc71,color:#000
    style ACTION4 fill:#2ecc71,color:#000
    style ACTION6 fill:#f39c12,color:#000

Prevention

Prevention Measure Owner Implementation
OpenSearch pre-warming queries at 18:00 JST Platform EventBridge → Lambda runs 20 representative KNN queries
Per-task timeout (not just overall timeout) Backend Already implemented in ParallelOrchestrator._with_timeout()
Graceful degradation without RAG Backend Generate response with session context only when RAG times out
OpenSearch OCU minimum set to peak-hour baseline Platform Capacity policy: min 6 OCUs during 18:00-24:00 JST
DNS resolution monitoring SRE CloudWatch alarm on Route 53 Resolver NXDOMAIN/SERVFAIL rate

Scenario 3 — WebSocket Connection Drops During Streaming Response

Problem Statement

Multiple MangaAssist users report seeing partial responses — the chatbot starts answering a manga recommendation query, streams 2-3 sentences, then the response abruptly stops. The client shows "Connection lost. Reconnecting..." and when the connection re-establishes, the partial response is gone. The issue affects ~5% of streaming responses during a 30-minute window on a Saturday evening.

Detection

flowchart TD
    ALARM[CloudWatch Alarm<br/>stream_cancellations > 50/hr] --> CHECK_CLIENT{Client-side error<br/>reports available?}

    CHECK_CLIENT -->|Yes| ANALYZE[Analyze error patterns]
    CHECK_CLIENT -->|No| SERVER_LOGS[Check ECS + API GW logs]

    ANALYZE --> PATTERN{Error pattern?}
    PATTERN -->|WebSocket close code 1006| ABNORMAL[Abnormal closure<br/>→ Network or gateway issue]
    PATTERN -->|WebSocket close code 1001| GOING_AWAY[Going away<br/>→ Server-side shutdown]
    PATTERN -->|No close frame received| TIMEOUT[Connection timeout<br/>→ Idle timeout or NAT issue]

Key metrics to check:

Metric Location Normal Current
stream_cancellations CloudWatch MangaAssist/Streaming < 10/hr 67/hr
API Gateway ClientError (4xx) CloudWatch AWS/ApiGateway < 5/hr 42/hr
API Gateway MessageCount CloudWatch AWS/ApiGateway ~50K/hr 48K/hr (normal)
ECS task restarts CloudWatch AWS/ECS 0 0
GoneException count in logs CloudWatch Logs < 5/hr 55/hr

Root Cause Analysis

Root Cause A — API Gateway WebSocket Idle Timeout

Diagnosis: The Bedrock streaming response for complex recommendation queries takes 2.5-3 seconds. During this time, the chunk batching (batch size = 3) means the first WebSocket frame is not sent until ~3 chunks accumulate (~900ms after first token). If the total time between the initial status message and the first chunk frame exceeds the API Gateway idle connection timeout (10 minutes by default — but the client-side proxy has a 30-second idle timeout), the connection is deemed idle.

The actual issue: a corporate proxy between some JP users and CloudFront has a 15-second idle timeout. The parallel fan-out phase (650ms) + Bedrock first-token wait (350ms) = 1000ms with no WebSocket frame sent. This is within normal bounds, but the corporate proxy is aggressively closing connections it considers idle.

flowchart TD
    CHECK_PATTERN{Affected users<br/>share network<br/>characteristics?}
    CHECK_PATTERN -->|Yes - same ISP/corp| PROXY[Corporate proxy<br/>idle timeout]
    CHECK_PATTERN -->|No - random| CHECK_GW{API Gateway<br/>Connection Duration<br/>metric available?}
    CHECK_GW -->|Connections dying at<br/>consistent duration| TIMEOUT_CFG[Timeout configuration<br/>issue]
    CHECK_GW -->|Random durations| CHECK_ECS_HEALTH{ECS tasks<br/>healthy?}
    CHECK_ECS_HEALTH -->|Yes| NETWORK[Intermittent network<br/>issue — check VPC NAT]
    CHECK_ECS_HEALTH -->|No — restarts| DEPLOYMENT[Rolling deployment<br/>draining connections]

Resolution:

  1. Immediate (10 min): Reduce chunk batch size from 3 to 1 so the first chunk is sent immediately after the first token:

    # Environment variable update on ECS service
    # CHUNK_BATCH_SIZE: "3" → "1"
    # Trade: more WebSocket frames, but no idle gaps
    

  2. Short-term (1 hour): Add periodic heartbeat frames during the parallel fan-out phase:

    async def stream_with_heartbeat(self, connection_id, ...):
        """Send heartbeat pings during long processing phases."""
        heartbeat_task = asyncio.create_task(
            self._heartbeat_loop(connection_id, interval=5.0)
        )
        try:
            result = await self.stream_query(...)
        finally:
            heartbeat_task.cancel()
    
    async def _heartbeat_loop(self, connection_id, interval):
        """Send invisible heartbeat to keep connection alive."""
        while True:
            await asyncio.sleep(interval)
            await self.ws_send(connection_id, {
                "type": "heartbeat",
                "timestamp": time.time(),
            })
    

  3. Medium-term: Implement client-side reconnection with response resumption:

    Client sends: {"action":"resume","session_id":"abc","last_chunk_idx":7}
    Server sends: Remaining chunks from idx 8 onward (buffered in Redis)
    

Prevention

Prevention Measure Owner Implementation
WebSocket heartbeat every 5 seconds during processing Backend Heartbeat task runs parallel to stream processing
Reduce chunk batch size to 1 for first 3 chunks Backend Prioritize first-frame delivery, then batch normally
Client-side reconnection with resume Frontend Store partial response + last chunk index, request resume
Server-side response buffering in Redis Backend Buffer last 60s of chunks per connection_id, TTL 120s
Monitor WebSocket close code distribution SRE CloudWatch alarm on close code 1006 rate > 1%

Scenario 4 — Pre-Computation Generates Stale Recommendations for Out-of-Stock Manga

Problem Statement

A customer asks "What shonen manga should I read?" and MangaAssist responds with a recommendation list that includes "Demon Slayer Volume 23 — Limited Edition" as the top recommendation. The customer clicks through to buy it, but the product page shows "Out of Stock." The recommendation was pre-computed 6 hours ago by the nightly batch job, and the limited edition sold out 2 hours ago during a flash sale. The pre-computed response in Redis has a 24-hour TTL and does not reflect the inventory change.

Detection

flowchart TD
    SIGNAL1[Customer complaint:<br/>"Recommended manga is out of stock"] --> INVESTIGATE
    SIGNAL2[CloudWatch: cart_abandonment_rate<br/>spike after recommendation click] --> INVESTIGATE
    SIGNAL3[Business metric: recommendation_to_purchase<br/>conversion rate dropped 40%] --> INVESTIGATE

    INVESTIGATE --> CHECK{Is the recommended<br/>product in stock?}
    CHECK -->|No| STALE[Stale pre-computation<br/>→ Investigate cache freshness]
    CHECK -->|Yes - different issue| OTHER[Product page issue<br/>→ Check catalog service]

Key data points:

Data Point Expected Actual
Pre-computation last run < 24 hours ago 6 hours ago (on schedule)
Redis cache TTL for genre recommendations 24 hours 24 hours (as configured)
Demon Slayer Vol 23 LE stock status In stock (at pre-compute time) Out of stock (sold out 2 hours ago)
DynamoDB inventory update timestamp N/A 4 hours ago (flash sale depleted stock)
Cache invalidation trigger for inventory changes Should exist Does not exist

Root Cause Analysis

flowchart TD
    ROOT[Pre-computed recommendation<br/>is stale] --> WHY1{Does inventory change<br/>trigger cache invalidation?}

    WHY1 -->|No| MISSING[Missing invalidation path:<br/>DynamoDB Stream → Lambda → Redis]
    WHY1 -->|Yes but failed| CHECK_STREAM{DynamoDB Stream<br/>events flowing?}

    MISSING --> WHY2{Why was this not<br/>built?}
    WHY2 --> GAP[Pre-computation pipeline was designed<br/>for catalog data, not inventory state.<br/>Inventory was assumed stable between runs.]

    CHECK_STREAM -->|Events present| CHECK_LAMBDA{Invalidation Lambda<br/>errors?}
    CHECK_STREAM -->|No events| STREAM_DISABLED[DynamoDB Stream<br/>not enabled on inventory table]

    CHECK_LAMBDA -->|Errors| LAMBDA_BUG[Lambda bug —<br/>check error logs]
    CHECK_LAMBDA -->|No errors| CACHE_KEY[Cache key mismatch —<br/>Lambda invalidating wrong key]

Root cause confirmed: The pre-computation pipeline invalidates cached responses when catalog data changes (new titles, updated descriptions), but it does not invalidate when inventory status changes. The pipeline was designed assuming that stock levels change slowly, but flash sales create rapid inventory depletion that the 24-hour TTL cannot accommodate.

Resolution:

  1. Immediate (15 min): Manually invalidate the stale genre recommendation caches:

    # Connect to ElastiCache Redis and delete stale keys
    redis-cli -h mangaassist-cache.xxxxx.ng.0001.apne1.cache.amazonaws.com
    > KEYS "precompute:genre:shonen:*"
    > DEL "precompute:genre:shonen:recommendations"
    > DEL "precompute:genre:shonen:top_picks"
    

  2. Short-term (2 hours): Add inventory-aware validation at response time — before serving a pre-computed recommendation, check stock status for included products:

    async def validate_precomputed_response(
        self, cached_response: dict, product_ids: list[str]
    ) -> dict:
        """
        Validate pre-computed recommendations against live inventory.
        Replace out-of-stock items with alternatives.
        """
        # Batch check inventory status (single DynamoDB BatchGetItem)
        stock_status = await self._batch_check_inventory(product_ids)
    
        out_of_stock = [
            pid for pid, status in stock_status.items()
            if status["quantity"] == 0
        ]
    
        if not out_of_stock:
            return cached_response  # All items in stock, serve as-is
    
        # Filter out unavailable items and note the gap
        filtered = {
            "recommendations": [
                rec for rec in cached_response["recommendations"]
                if rec["product_id"] not in out_of_stock
            ],
            "stale_items_removed": len(out_of_stock),
        }
    
        # If too many items removed, fall through to live generation
        if len(filtered["recommendations"]) < 2:
            return None  # Trigger fresh LLM generation
    
        return filtered
    

  3. Medium-term (1 week): Add DynamoDB Streams-based cache invalidation for inventory changes:

    graph LR
        DDB[DynamoDB<br/>manga-inventory] -->|Stream| LAMBDA[Lambda:<br/>Inventory Invalidator]
        LAMBDA -->|Check| REDIS_KEYS[Find cached responses<br/>containing this product]
        REDIS_KEYS -->|Invalidate| REDIS[ElastiCache Redis<br/>Delete affected keys]
        LAMBDA -->|Log| CW[CloudWatch<br/>invalidation_count metric]
    

Prevention

Prevention Measure Owner Implementation
Inventory-aware validation on pre-computed responses Backend BatchGetItem stock check before serving cached recommendations
DynamoDB Streams → Lambda → Redis invalidation Platform Stream from inventory table triggers selective cache invalidation
Reduce TTL for recommendation caches to 4 hours Platform Redis TTL 24h → 4h for genre/category recommendations
Flash sale mode: disable pre-computed recommendations Business Feature flag disables pre-compute cache during flash sale events
Monitor recommendation-to-purchase conversion rate Analytics CloudWatch alarm when conversion drops > 20% from 7-day average

Scenario 5 — Latency Regression After Prompt Template Update

Problem Statement

On Wednesday at 14:00 JST, the ML engineering team deploys a new system prompt template for the "recommendation" intent. The updated prompt adds detailed instructions for multi-criteria scoring (art style, story complexity, target demographic) to improve recommendation quality. The new prompt is 340 tokens longer than the previous version (from 180 tokens to 520 tokens). By 15:00 JST, the automated benchmark framework detects a statistically significant latency regression: recommendation intent p95 has increased from 2400ms to 3200ms (33% increase), crossing the 3000ms SLA target.

Detection

flowchart TD
    BENCHMARK[Automated Benchmark Run<br/>Every 15 min] --> STATS{Mann-Whitney U Test<br/>p-value < 0.05?}
    STATS -->|p = 0.003| SIGNIFICANT[Statistically significant<br/>regression detected]
    SIGNIFICANT --> SEVERITY{Regression<br/>magnitude?}
    SEVERITY -->|33% increase<br/>> 25% threshold| SEV2[SEV-2 Alert<br/>PagerDuty notification]

    SEV2 --> CORRELATE[Correlate with<br/>recent changes]
    CORRELATE --> DEPLOY_LOG{Any deployments<br/>in last 2 hours?}
    DEPLOY_LOG -->|Yes: Prompt template<br/>update at 14:00| SUSPECT[Primary suspect:<br/>prompt template change]

Benchmark comparison data:

Metric Before (13:45 run) After (15:00 run) Change Significance
Recommendation p50 1180ms 1620ms +37% p < 0.001
Recommendation p95 2410ms 3210ms +33% p = 0.003
Recommendation p99 2890ms 3780ms +31% p = 0.008
First-token p95 1050ms 1380ms +31% p = 0.005
Recommendation tokens/response 285 avg 410 avg +44% p < 0.001
Other intents p95 All within target All within target No change N/A

Root Cause Analysis

flowchart TD
    SUSPECT[Prompt template update<br/>suspected] --> CHECK1{Only recommendation<br/>intent affected?}
    CHECK1 -->|Yes| CHECK2{Prompt token count<br/>increased?}
    CHECK2 -->|Yes: 180 → 520 tokens| CHECK3{Response tokens<br/>also increased?}
    CHECK3 -->|Yes: 285 → 410 avg| ROOT[Root cause confirmed:<br/>Longer prompt + longer outputs]

    ROOT --> IMPACT[Two latency impacts:]
    IMPACT --> IMPACT1["1. Input processing: +340 tokens<br/>→ +120ms first-token latency"]
    IMPACT --> IMPACT2["2. Output generation: +125 tokens avg<br/>→ +500ms streaming duration"]

    CHECK1 -->|No, all intents| DIFFERENT[Different root cause —<br/>check infrastructure]
    CHECK2 -->|No, same size| DIFFERENT2[Not prompt-related —<br/>check model routing]
    CHECK3 -->|No, same output| PROMPT_QUALITY[Prompt is slower to process<br/>→ Check prompt complexity]

Root cause breakdown:

Factor Before After Latency Impact
System prompt tokens 180 520 +120ms (input processing)
Avg output tokens 285 410 +500ms (generation time)
Prompt complexity (nested instructions) Low High (multi-criteria) +60ms (reasoning overhead)
Total latency increase +680ms

The longer system prompt causes two latency effects: 1. More input tokens = more time for the model to process the prompt before generating the first token 2. More detailed instructions = the model generates longer, more structured responses, increasing streaming duration

Resolution

  1. Immediate — Assess rollback need (15 min):
flowchart TD
    ASSESS{Is the quality improvement<br/>worth the latency cost?}
    ASSESS -->|No — quality gain marginal| ROLLBACK[Roll back to previous<br/>prompt version immediately]
    ASSESS -->|Yes — quality significantly better| OPTIMIZE[Keep new prompt,<br/>optimize for latency]
    ASSESS -->|Unsure — need data| ABTEST[Run A/B test:<br/>old prompt vs new prompt]
  1. If rolling back (immediate fix):

    # Revert prompt version in DynamoDB
    aws dynamodb update-item \
      --table-name manga-prompt-templates \
      --key '{"intent":{"S":"recommendation"},"version":{"S":"v2.1"}}' \
      --update-expression "SET active = :false" \
      --expression-attribute-values '{":false":{"BOOL":false}}'
    
    aws dynamodb update-item \
      --table-name manga-prompt-templates \
      --key '{"intent":{"S":"recommendation"},"version":{"S":"v2.0"}}' \
      --update-expression "SET active = :true" \
      --expression-attribute-values '{":true":{"BOOL":true}}'
    

  2. If optimizing the new prompt (1-2 days):

Optimization Technique Token Reduction Quality Impact
Compress scoring instructions Use structured format instead of prose -120 tokens Minimal
Add max_tokens: 300 limit Cap output length -110 tokens avg Responses slightly shorter
Use Haiku for scoring, Sonnet for prose Two-stage: Haiku scores, Sonnet narrates -200ms first-token Minimal (scoring is mechanical)
Move static criteria to pre-computed context Load scoring criteria from cache, not prompt -180 tokens from prompt None
Combined -300 tokens, -180ms Negligible
  1. Optimized prompt structure:
    Before (520 tokens — prose):
    "When recommending manga, evaluate each title across multiple
    dimensions. First, consider the art style and how it compares
    to what the customer enjoys. Then evaluate story complexity..."
    
    After (320 tokens — structured):
    "Score each recommendation on: art_style(1-5), story_complexity(1-5),
    demographic_match(1-5). Format: Title | Scores | 2-sentence reason.
    Max 4 recommendations. Prioritize by total score."
    

Prevention

Prevention Measure Owner Implementation
Prompt change requires latency impact assessment ML Eng Pre-deployment: run benchmark with new prompt in staging
Token budget per intent (prompt + response) ML Eng recommendation intent: max 520 input + 350 output tokens
Automated benchmark gate in CI/CD Platform Deployment blocked if p95 regression > 15%
Prompt compression review checklist ML Eng Every prompt update reviewed for token efficiency
A/B testing framework for prompt changes ML Eng Shadow-test new prompts against production baseline for 24h
Auto-rollback on benchmark regression SRE If SEV-1 regression detected within 1 hour of deploy, auto-revert

Prompt Change Deployment Safeguard Flow

flowchart TD
    DEV[ML Engineer writes<br/>new prompt template] --> STAGE_TEST[Deploy to staging<br/>+ run benchmark suite]
    STAGE_TEST --> GATE{p95 within<br/>intent budget?}

    GATE -->|Yes| SHADOW[Shadow deployment:<br/>run new prompt on 5% traffic]
    GATE -->|No| OPTIMIZE[Optimize prompt<br/>before proceeding]
    OPTIMIZE --> STAGE_TEST

    SHADOW --> COMPARE{Quality improved?<br/>Latency acceptable?}
    COMPARE -->|Both pass| CANARY[Canary: 25% traffic<br/>for 2 hours]
    COMPARE -->|Quality pass,<br/>latency fail| OPTIMIZE
    COMPARE -->|Quality fail| REJECT[Reject prompt change]

    CANARY --> FULL{All metrics<br/>within targets?}
    FULL -->|Yes| DEPLOY[Full production<br/>deployment]
    FULL -->|No| ROLLBACK[Auto-rollback<br/>to previous version]

    style DEPLOY fill:#2ecc71,color:#000
    style ROLLBACK fill:#e74c3c,color:#fff
    style REJECT fill:#e74c3c,color:#fff

Cross-Scenario Summary

Common Patterns Across All 5 Scenarios

Pattern Scenarios Key Lesson
Graceful degradation is essential 1, 2 System must produce a response even with partial data (no RAG, no cache, wrong model)
Pre-computation needs invalidation 4 Every pre-computed cache must have an invalidation path tied to data freshness
Latency budgets must be per-intent 1, 5 A global "< 3s" target is insufficient — each intent has different acceptable latency
Statistical testing prevents false alarms 5 Use Mann-Whitney U test (not simple threshold comparison) for regression detection
Connection management is critical for streaming 1, 3 Warm pools, heartbeats, and reconnection with resume handle network volatility

Scenario-to-Technique Mapping

Scenario Pre-Computation Model Selection Parallel Processing Streaming Benchmarking
1 — First-token spike Primary fix (route to Haiku) Root cause (cold connections) Detection method
2 — RAG timeout Primary fix (per-task timeout + degradation)
3 — WebSocket drop Primary fix (heartbeat + resume)
4 — Stale recommendations Primary fix (invalidation pipeline)
5 — Prompt regression Primary fix (automated gate + rollback)

Key Exam Takeaways

  1. Responsive AI is not just fast models — it spans connection management, parallel orchestration, streaming delivery, pre-computation freshness, and continuous benchmarking
  2. Every pre-computation strategy needs an invalidation strategy — without it, you trade latency for stale/incorrect responses
  3. Parallel orchestration must handle partial failures gracefullyasyncio.gather with per-task timeouts and degraded-mode fallbacks
  4. Streaming connections need active keep-alive — WebSocket idle timeouts from proxies and gateways can kill streaming responses silently
  5. Prompt changes are latency changes — longer prompts mean more input processing time AND often longer outputs; gate prompt deployments with automated latency benchmarks