Scenarios and Runbooks — Responsive AI Systems

AWS AIP-C01 Task 4.2 — Skill 4.2.1: Operational scenarios for responsive AI system failures and their resolution Context: MangaAssist e-commerce chatbot — Bedrock Claude 3 Sonnet/Haiku, OpenSearch Serverless, DynamoDB, ECS Fargate, API Gateway WebSocket, ElastiCache Redis Format: Each scenario follows Problem → Detection → Root Cause → Resolution → Prevention with decision trees

Skill Mapping

Certification	Domain	Task	Skill
AWS AIP-C01	Domain 4 — Operational Efficiency	Task 4.2 — Optimize FM application performance	Skill 4.2.1 — Identify techniques to create responsive AI systems (e.g., pre-computation, latency-optimized model selection, parallel request processing, response streaming, performance benchmarking)

File scope: Five production scenarios that test responsiveness engineering — covering first-token latency spikes, parallel request timeouts, streaming failures, stale pre-computation, and latency regressions. Each includes a decision tree runbook for on-call engineers.

Scenario Index

#	Scenario	Primary Technique Tested	Severity
1	First-token latency exceeds 2s during peak evening hours	Streaming + Model Selection + Connection Pooling	SEV-2
2	Parallel request timeout: RAG hangs while DynamoDB returns	Parallel Orchestration + Timeout Strategy	SEV-2
3	WebSocket connection drops during streaming response	Streaming + Connection Management	SEV-3
4	Pre-computation generates stale recommendations for out-of-stock manga	Pre-Computation + Cache Invalidation	SEV-3
5	Latency regression after prompt template update	Benchmarking + Regression Detection	SEV-2

Scenario 1 — First-Token Latency Exceeds 2 Seconds During Peak Evening Hours

Problem Statement

MangaAssist's CloudWatch alarm fires at 20:15 JST on a Tuesday evening. The first_token_latency_ms p95 metric has crossed the 2000ms threshold for the past 10 minutes. Users in Japan are experiencing a visible delay — the typing indicator shows for 2+ seconds before any text appears. Customer satisfaction scores for the 20:00-21:00 hour are dropping. The target for first-token latency is < 400ms for Haiku queries and < 1100ms for Sonnet queries.

Detection

flowchart TD
    ALARM[CloudWatch Alarm<br/>first_token_p95 > 2000ms<br/>for 10 min] --> DASH[Check Grafana Dashboard]
    DASH --> METRICS{Which intents<br/>are affected?}

    METRICS -->|All intents| INFRA[Infrastructure issue<br/>→ Go to Root Cause A]
    METRICS -->|Only Sonnet intents| MODEL[Model-specific issue<br/>→ Go to Root Cause B]
    METRICS -->|Only specific intents| ROUTING[Routing issue<br/>→ Go to Root Cause C]

Key metrics to check immediately:

Metric	Location	What to Look For
`first_token_latency_ms` by model	CloudWatch `MangaAssist/Streaming`	Which model is slow?
`parallel_rag_latency_ms`	CloudWatch `MangaAssist/Orchestrator`	Is RAG retrieval slow?
Bedrock `InvocationLatency`	CloudWatch `AWS/Bedrock`	Is Bedrock itself slow?
ECS CPU/Memory	CloudWatch `AWS/ECS`	Is the orchestrator overloaded?
API Gateway `IntegrationLatency`	CloudWatch `AWS/ApiGateway`	Is the gateway slow?
Active WebSocket connections	CloudWatch `AWS/ApiGateway`	Traffic spike?

Root Cause Analysis

Root Cause A — Bedrock Cold Starts Under Load

Diagnosis: Bedrock InvocationLatency p95 spikes from 350ms to 1800ms. The connection pool warm-up pings have been failing silently because the keep-alive Lambda was throttled during the traffic spike. Connections went cold and each new streaming invocation pays the full TLS + auth + cold-model penalty.

flowchart TD
    CHECK_BEDROCK{Bedrock InvocationLatency<br/>p95 > 1500ms?}
    CHECK_BEDROCK -->|Yes| CHECK_POOL{Connection pool<br/>health pings<br/>succeeding?}
    CHECK_POOL -->|No - pings failing| ROOT_A[Root Cause A confirmed:<br/>Cold connections to Bedrock]
    CHECK_POOL -->|Yes - pings OK| CHECK_THROTTLE{Bedrock<br/>ThrottlingException<br/>count > 0?}
    CHECK_THROTTLE -->|Yes| ROOT_A2[Bedrock throttling —<br/>request quota exceeded]
    CHECK_THROTTLE -->|No| ROOT_A3[Bedrock service degradation —<br/>check AWS Health Dashboard]
    CHECK_BEDROCK -->|No| CHECK_ECS[Check ECS metrics →]

Resolution:

Immediate (5 min): Manually trigger connection warm-up across all ECS tasks:

# Force warm-up on all running tasks
aws ecs list-tasks --cluster mangaassist-prod --service-name orchestrator \
  | jq -r '.taskArns[]' \
  | xargs -I {} aws ecs execute-command --cluster mangaassist-prod \
    --task {} --command "/app/scripts/warm_connections.sh" --interactive

Short-term (15 min): Increase connection pool size from 5 to 10 per ECS task to handle burst load:

# Update ECS task definition environment variable
# BEDROCK_POOL_SIZE: "5" → "10"
# BEDROCK_KEEPALIVE_INTERVAL: "30" → "15"

Medium-term (1 hour): Scale out ECS tasks to distribute load and reduce per-task connection contention:

aws ecs update-service --cluster mangaassist-prod \
  --service orchestrator --desired-count 8  # was 4

Root Cause B — Sonnet Model Capacity Saturation

Diagnosis: Only Sonnet-routed intents (recommendations, comparisons) show elevated latency. Haiku intents remain within target. Bedrock ThrottlingException count is elevated. The peak evening traffic has exceeded the provisioned Sonnet throughput.

Resolution:

Immediate: Temporarily route medium-complexity queries to Haiku to reduce Sonnet pressure:

# Adjust model router threshold
# complexity_threshold_for_sonnet: 0.3 → 0.6
# This routes ~40% of current Sonnet traffic to Haiku

Short-term: Request Bedrock quota increase for Sonnet model in ap-northeast-1.
Medium-term: Implement adaptive model routing that automatically shifts traffic to Haiku when Sonnet latency exceeds threshold.

Root Cause C — Misclassified Intents Hitting Wrong Model

Diagnosis: The intent classifier is routing simple FAQ queries to Sonnet instead of Haiku. A recent classifier model update introduced a regression in confidence thresholds.

Resolution:

Immediate: Roll back intent classifier to previous version.
Short-term: Add classifier output validation — if a "faq" or "greeting" intent gets routed to Sonnet, flag it as a misclassification.
Medium-term: Add automated classifier accuracy benchmarks to the deployment pipeline.

Prevention

Prevention Measure	Owner	Implementation
Connection pool health monitoring with alerting	SRE	CloudWatch alarm on pool health ping failure rate > 5%
Provisioned throughput for Sonnet during peak hours	Platform	Bedrock provisioned throughput scheduled 19:00-23:00 JST
Adaptive model routing with latency feedback	ML Eng	Route to Haiku when Sonnet p95 > 1500ms
Load testing at 2x peak traffic monthly	QA	Automated load test via EventBridge + ECS benchmark task
Connection pool warm-up on ECS task start	Platform	Add warm-up to ECS task health check

Scenario 2 — Parallel Request Timeout: RAG Retrieval Hangs While DynamoDB Returns Instantly

Problem Statement

MangaAssist users querying manga recommendations see a 5-second timeout error. CloudWatch logs show that the ParallelOrchestrator is hitting the overall timeout because OpenSearch RAG retrieval is hanging at 4.8 seconds, even though DynamoDB session load and user profile fetch complete in < 200ms. The asyncio.gather call is waiting for the slowest task (RAG) to complete before proceeding to Bedrock invocation. Pre-timeout, the parallel execution was collapsing from the expected 650ms to > 5000ms.

Detection

flowchart TD
    ALARM[CloudWatch Alarm<br/>parallel_rag_latency_ms p95 > 3000ms] --> LOGS[Check orchestrator logs]
    LOGS --> PATTERN{Which parallel<br/>task is slow?}

    PATTERN -->|RAG only| RAG_ISSUE[OpenSearch issue<br/>→ Go to Root Cause A]
    PATTERN -->|Multiple tasks| NETWORK[Network / ECS issue<br/>→ Go to Root Cause B]
    PATTERN -->|DynamoDB slow too| INFRA[General infra issue<br/>→ Check ECS + VPC]

Key log pattern to search for:

WARN  | Parallel task 'rag' failed | error=Task 'rag' exceeded 2.0s timeout
INFO  | Parallel execution complete | timings={"rag": 2.001, "session": 0.18, "profile": 0.14, "cache": 0.04}

Root Cause Analysis

Root Cause A — OpenSearch Serverless OCU Throttling

Diagnosis: OpenSearch Serverless has auto-scaled down to minimum OCUs (2) during the low-traffic afternoon, and the evening traffic spike arrives before OCUs scale back up. KNN vector search queries queue behind each other, causing latency to climb from 600ms to 4800ms.

flowchart TD
    CHECK_OS{OpenSearch<br/>SearchLatency p95<br/>> 2000ms?}
    CHECK_OS -->|Yes| CHECK_OCU{OpenSearch OCU<br/>utilization > 80%?}
    CHECK_OCU -->|Yes| ROOT_A[Root Cause A confirmed:<br/>OCU capacity insufficient<br/>for traffic spike]
    CHECK_OCU -->|No| CHECK_INDEX{Index health<br/>status?}
    CHECK_INDEX -->|Yellow/Red| ROOT_A2[Index shard issue —<br/>rebalancing needed]
    CHECK_INDEX -->|Green| ROOT_A3[Query complexity issue —<br/>check KNN parameters]
    CHECK_OS -->|No| CHECK_NET[Check VPC / ENI →]

Resolution:

Immediate (5 min): The ParallelOrchestrator should already be returning partial results (session + profile) without RAG context. Verify the degraded-mode path is working:
```
# Verify graceful degradation in logs:
# "Proceeding without RAG context due to timeout"
# Bedrock still generates response using session history only
```

Short-term (30 min): Increase OpenSearch Serverless minimum OCUs:

aws opensearchserverless update-collection \
  --id manga-products \
  --description "Increase min OCUs for peak" \
  # Update capacity policy: min search OCUs 2 → 6

Medium-term: Implement pre-warming for OpenSearch — run a set of representative queries at 18:00 JST (before peak) to trigger OCU scale-up.

Root Cause B — VPC DNS Resolution Failure

Diagnosis: The ECS tasks in private subnets are intermittently failing to resolve the OpenSearch Serverless endpoint. DNS queries to the VPC DNS resolver are timing out, causing the HTTP connection to OpenSearch to hang until the TCP timeout.

Resolution:

Immediate: Restart affected ECS tasks to refresh DNS cache.
Short-term: Add DNS caching at the application level with a 60-second TTL.
Medium-term: Configure Route 53 Resolver with DNS firewall rules and monitoring.

Decision Tree — On-Call Runbook

flowchart TD
    START[RAG timeout alert fires] --> CHECK1{Other parallel tasks<br/>also slow?}

    CHECK1 -->|No, only RAG| CHECK2{OpenSearch<br/>SearchLatency > 2s?}
    CHECK1 -->|Yes, multiple| CHECK_ECS{ECS task<br/>CPU > 90%?}

    CHECK2 -->|Yes| CHECK3{OCU utilization<br/>> 80%?}
    CHECK2 -->|No| CHECK_DNS{DNS resolution<br/>errors in logs?}

    CHECK3 -->|Yes| ACTION1[Increase min OCUs<br/>+ trigger pre-warm queries]
    CHECK3 -->|No| CHECK4{KNN query<br/>ef_search changed?}

    CHECK4 -->|Yes| ACTION2[Revert ef_search<br/>to previous value]
    CHECK4 -->|No| ACTION3[Check index health<br/>+ shard distribution]

    CHECK_DNS -->|Yes| ACTION4[Restart ECS tasks<br/>+ add DNS cache]
    CHECK_DNS -->|No| ACTION5[Check VPC endpoints<br/>+ security groups]

    CHECK_ECS -->|Yes| ACTION6[Scale out ECS<br/>+ check for memory leak]
    CHECK_ECS -->|No| ACTION7[Check VPC networking<br/>+ NAT gateway throughput]

    style ACTION1 fill:#2ecc71,color:#000
    style ACTION2 fill:#2ecc71,color:#000
    style ACTION4 fill:#2ecc71,color:#000
    style ACTION6 fill:#f39c12,color:#000

Prevention

Prevention Measure	Owner	Implementation
OpenSearch pre-warming queries at 18:00 JST	Platform	EventBridge → Lambda runs 20 representative KNN queries
Per-task timeout (not just overall timeout)	Backend	Already implemented in `ParallelOrchestrator._with_timeout()`
Graceful degradation without RAG	Backend	Generate response with session context only when RAG times out
OpenSearch OCU minimum set to peak-hour baseline	Platform	Capacity policy: min 6 OCUs during 18:00-24:00 JST
DNS resolution monitoring	SRE	CloudWatch alarm on Route 53 Resolver NXDOMAIN/SERVFAIL rate

Scenario 3 — WebSocket Connection Drops During Streaming Response

Problem Statement

Multiple MangaAssist users report seeing partial responses — the chatbot starts answering a manga recommendation query, streams 2-3 sentences, then the response abruptly stops. The client shows "Connection lost. Reconnecting..." and when the connection re-establishes, the partial response is gone. The issue affects ~5% of streaming responses during a 30-minute window on a Saturday evening.

Detection

flowchart TD
    ALARM[CloudWatch Alarm<br/>stream_cancellations > 50/hr] --> CHECK_CLIENT{Client-side error<br/>reports available?}

    CHECK_CLIENT -->|Yes| ANALYZE[Analyze error patterns]
    CHECK_CLIENT -->|No| SERVER_LOGS[Check ECS + API GW logs]

    ANALYZE --> PATTERN{Error pattern?}
    PATTERN -->|WebSocket close code 1006| ABNORMAL[Abnormal closure<br/>→ Network or gateway issue]
    PATTERN -->|WebSocket close code 1001| GOING_AWAY[Going away<br/>→ Server-side shutdown]
    PATTERN -->|No close frame received| TIMEOUT[Connection timeout<br/>→ Idle timeout or NAT issue]

Key metrics to check:

Metric	Location	Normal	Current
`stream_cancellations`	CloudWatch `MangaAssist/Streaming`	< 10/hr	67/hr
API Gateway `ClientError` (4xx)	CloudWatch `AWS/ApiGateway`	< 5/hr	42/hr
API Gateway `MessageCount`	CloudWatch `AWS/ApiGateway`	~50K/hr	48K/hr (normal)
ECS task restarts	CloudWatch `AWS/ECS`	0	0
`GoneException` count in logs	CloudWatch Logs	< 5/hr	55/hr

Root Cause Analysis

Root Cause A — API Gateway WebSocket Idle Timeout

Diagnosis: The Bedrock streaming response for complex recommendation queries takes 2.5-3 seconds. During this time, the chunk batching (batch size = 3) means the first WebSocket frame is not sent until ~3 chunks accumulate (~900ms after first token). If the total time between the initial status message and the first chunk frame exceeds the API Gateway idle connection timeout (10 minutes by default — but the client-side proxy has a 30-second idle timeout), the connection is deemed idle.

The actual issue: a corporate proxy between some JP users and CloudFront has a 15-second idle timeout. The parallel fan-out phase (650ms) + Bedrock first-token wait (350ms) = 1000ms with no WebSocket frame sent. This is within normal bounds, but the corporate proxy is aggressively closing connections it considers idle.

flowchart TD
    CHECK_PATTERN{Affected users<br/>share network<br/>characteristics?}
    CHECK_PATTERN -->|Yes - same ISP/corp| PROXY[Corporate proxy<br/>idle timeout]
    CHECK_PATTERN -->|No - random| CHECK_GW{API Gateway<br/>Connection Duration<br/>metric available?}
    CHECK_GW -->|Connections dying at<br/>consistent duration| TIMEOUT_CFG[Timeout configuration<br/>issue]
    CHECK_GW -->|Random durations| CHECK_ECS_HEALTH{ECS tasks<br/>healthy?}
    CHECK_ECS_HEALTH -->|Yes| NETWORK[Intermittent network<br/>issue — check VPC NAT]
    CHECK_ECS_HEALTH -->|No — restarts| DEPLOYMENT[Rolling deployment<br/>draining connections]

Resolution:

Immediate (10 min): Reduce chunk batch size from 3 to 1 so the first chunk is sent immediately after the first token:

# Environment variable update on ECS service
# CHUNK_BATCH_SIZE: "3" → "1"
# Trade: more WebSocket frames, but no idle gaps

Short-term (1 hour): Add periodic heartbeat frames during the parallel fan-out phase:

async def stream_with_heartbeat(self, connection_id, ...):
    """Send heartbeat pings during long processing phases."""
    heartbeat_task = asyncio.create_task(
        self._heartbeat_loop(connection_id, interval=5.0)
    )
    try:
        result = await self.stream_query(...)
    finally:
        heartbeat_task.cancel()

async def _heartbeat_loop(self, connection_id, interval):
    """Send invisible heartbeat to keep connection alive."""
    while True:
        await asyncio.sleep(interval)
        await self.ws_send(connection_id, {
            "type": "heartbeat",
            "timestamp": time.time(),
        })

Medium-term: Implement client-side reconnection with response resumption:

Client sends: {"action":"resume","session_id":"abc","last_chunk_idx":7}
Server sends: Remaining chunks from idx 8 onward (buffered in Redis)

Prevention

Prevention Measure	Owner	Implementation
WebSocket heartbeat every 5 seconds during processing	Backend	Heartbeat task runs parallel to stream processing
Reduce chunk batch size to 1 for first 3 chunks	Backend	Prioritize first-frame delivery, then batch normally
Client-side reconnection with resume	Frontend	Store partial response + last chunk index, request resume
Server-side response buffering in Redis	Backend	Buffer last 60s of chunks per connection_id, TTL 120s
Monitor WebSocket close code distribution	SRE	CloudWatch alarm on close code 1006 rate > 1%

Scenario 4 — Pre-Computation Generates Stale Recommendations for Out-of-Stock Manga

Problem Statement

A customer asks "What shonen manga should I read?" and MangaAssist responds with a recommendation list that includes "Demon Slayer Volume 23 — Limited Edition" as the top recommendation. The customer clicks through to buy it, but the product page shows "Out of Stock." The recommendation was pre-computed 6 hours ago by the nightly batch job, and the limited edition sold out 2 hours ago during a flash sale. The pre-computed response in Redis has a 24-hour TTL and does not reflect the inventory change.

Detection

flowchart TD
    SIGNAL1[Customer complaint:<br/>"Recommended manga is out of stock"] --> INVESTIGATE
    SIGNAL2[CloudWatch: cart_abandonment_rate<br/>spike after recommendation click] --> INVESTIGATE
    SIGNAL3[Business metric: recommendation_to_purchase<br/>conversion rate dropped 40%] --> INVESTIGATE

    INVESTIGATE --> CHECK{Is the recommended<br/>product in stock?}
    CHECK -->|No| STALE[Stale pre-computation<br/>→ Investigate cache freshness]
    CHECK -->|Yes - different issue| OTHER[Product page issue<br/>→ Check catalog service]

Key data points:

Data Point	Expected	Actual
Pre-computation last run	< 24 hours ago	6 hours ago (on schedule)
Redis cache TTL for genre recommendations	24 hours	24 hours (as configured)
Demon Slayer Vol 23 LE stock status	In stock (at pre-compute time)	Out of stock (sold out 2 hours ago)
DynamoDB inventory update timestamp	N/A	4 hours ago (flash sale depleted stock)
Cache invalidation trigger for inventory changes	Should exist	Does not exist

Root Cause Analysis

flowchart TD
    ROOT[Pre-computed recommendation<br/>is stale] --> WHY1{Does inventory change<br/>trigger cache invalidation?}

    WHY1 -->|No| MISSING[Missing invalidation path:<br/>DynamoDB Stream → Lambda → Redis]
    WHY1 -->|Yes but failed| CHECK_STREAM{DynamoDB Stream<br/>events flowing?}

    MISSING --> WHY2{Why was this not<br/>built?}
    WHY2 --> GAP[Pre-computation pipeline was designed<br/>for catalog data, not inventory state.<br/>Inventory was assumed stable between runs.]

    CHECK_STREAM -->|Events present| CHECK_LAMBDA{Invalidation Lambda<br/>errors?}
    CHECK_STREAM -->|No events| STREAM_DISABLED[DynamoDB Stream<br/>not enabled on inventory table]

    CHECK_LAMBDA -->|Errors| LAMBDA_BUG[Lambda bug —<br/>check error logs]
    CHECK_LAMBDA -->|No errors| CACHE_KEY[Cache key mismatch —<br/>Lambda invalidating wrong key]

Root cause confirmed: The pre-computation pipeline invalidates cached responses when catalog data changes (new titles, updated descriptions), but it does not invalidate when inventory status changes. The pipeline was designed assuming that stock levels change slowly, but flash sales create rapid inventory depletion that the 24-hour TTL cannot accommodate.

Resolution:

Immediate (15 min): Manually invalidate the stale genre recommendation caches:

# Connect to ElastiCache Redis and delete stale keys
redis-cli -h mangaassist-cache.xxxxx.ng.0001.apne1.cache.amazonaws.com
> KEYS "precompute:genre:shonen:*"
> DEL "precompute:genre:shonen:recommendations"
> DEL "precompute:genre:shonen:top_picks"

Short-term (2 hours): Add inventory-aware validation at response time — before serving a pre-computed recommendation, check stock status for included products:

async def validate_precomputed_response(
    self, cached_response: dict, product_ids: list[str]
) -> dict:
    """
    Validate pre-computed recommendations against live inventory.
    Replace out-of-stock items with alternatives.
    """
    # Batch check inventory status (single DynamoDB BatchGetItem)
    stock_status = await self._batch_check_inventory(product_ids)

    out_of_stock = [
        pid for pid, status in stock_status.items()
        if status["quantity"] == 0
    ]

    if not out_of_stock:
        return cached_response  # All items in stock, serve as-is

    # Filter out unavailable items and note the gap
    filtered = {
        "recommendations": [
            rec for rec in cached_response["recommendations"]
            if rec["product_id"] not in out_of_stock
        ],
        "stale_items_removed": len(out_of_stock),
    }

    # If too many items removed, fall through to live generation
    if len(filtered["recommendations"]) < 2:
        return None  # Trigger fresh LLM generation

    return filtered

Medium-term (1 week): Add DynamoDB Streams-based cache invalidation for inventory changes:

graph LR
    DDB[DynamoDB<br/>manga-inventory] -->|Stream| LAMBDA[Lambda:<br/>Inventory Invalidator]
    LAMBDA -->|Check| REDIS_KEYS[Find cached responses<br/>containing this product]
    REDIS_KEYS -->|Invalidate| REDIS[ElastiCache Redis<br/>Delete affected keys]
    LAMBDA -->|Log| CW[CloudWatch<br/>invalidation_count metric]

Prevention

Prevention Measure	Owner	Implementation
Inventory-aware validation on pre-computed responses	Backend	BatchGetItem stock check before serving cached recommendations
DynamoDB Streams → Lambda → Redis invalidation	Platform	Stream from inventory table triggers selective cache invalidation
Reduce TTL for recommendation caches to 4 hours	Platform	Redis TTL 24h → 4h for genre/category recommendations
Flash sale mode: disable pre-computed recommendations	Business	Feature flag disables pre-compute cache during flash sale events
Monitor recommendation-to-purchase conversion rate	Analytics	CloudWatch alarm when conversion drops > 20% from 7-day average

Scenario 5 — Latency Regression After Prompt Template Update

Problem Statement

On Wednesday at 14:00 JST, the ML engineering team deploys a new system prompt template for the "recommendation" intent. The updated prompt adds detailed instructions for multi-criteria scoring (art style, story complexity, target demographic) to improve recommendation quality. The new prompt is 340 tokens longer than the previous version (from 180 tokens to 520 tokens). By 15:00 JST, the automated benchmark framework detects a statistically significant latency regression: recommendation intent p95 has increased from 2400ms to 3200ms (33% increase), crossing the 3000ms SLA target.

Detection

flowchart TD
    BENCHMARK[Automated Benchmark Run<br/>Every 15 min] --> STATS{Mann-Whitney U Test<br/>p-value < 0.05?}
    STATS -->|p = 0.003| SIGNIFICANT[Statistically significant<br/>regression detected]
    SIGNIFICANT --> SEVERITY{Regression<br/>magnitude?}
    SEVERITY -->|33% increase<br/>> 25% threshold| SEV2[SEV-2 Alert<br/>PagerDuty notification]

    SEV2 --> CORRELATE[Correlate with<br/>recent changes]
    CORRELATE --> DEPLOY_LOG{Any deployments<br/>in last 2 hours?}
    DEPLOY_LOG -->|Yes: Prompt template<br/>update at 14:00| SUSPECT[Primary suspect:<br/>prompt template change]

Benchmark comparison data:

Metric	Before (13:45 run)	After (15:00 run)	Change	Significance
Recommendation p50	1180ms	1620ms	+37%	p < 0.001
Recommendation p95	2410ms	3210ms	+33%	p = 0.003
Recommendation p99	2890ms	3780ms	+31%	p = 0.008
First-token p95	1050ms	1380ms	+31%	p = 0.005
Recommendation tokens/response	285 avg	410 avg	+44%	p < 0.001
Other intents p95	All within target	All within target	No change	N/A

Root Cause Analysis

flowchart TD
    SUSPECT[Prompt template update<br/>suspected] --> CHECK1{Only recommendation<br/>intent affected?}
    CHECK1 -->|Yes| CHECK2{Prompt token count<br/>increased?}
    CHECK2 -->|Yes: 180 → 520 tokens| CHECK3{Response tokens<br/>also increased?}
    CHECK3 -->|Yes: 285 → 410 avg| ROOT[Root cause confirmed:<br/>Longer prompt + longer outputs]

    ROOT --> IMPACT[Two latency impacts:]
    IMPACT --> IMPACT1["1. Input processing: +340 tokens<br/>→ +120ms first-token latency"]
    IMPACT --> IMPACT2["2. Output generation: +125 tokens avg<br/>→ +500ms streaming duration"]

    CHECK1 -->|No, all intents| DIFFERENT[Different root cause —<br/>check infrastructure]
    CHECK2 -->|No, same size| DIFFERENT2[Not prompt-related —<br/>check model routing]
    CHECK3 -->|No, same output| PROMPT_QUALITY[Prompt is slower to process<br/>→ Check prompt complexity]

Root cause breakdown:

Factor	Before	After	Latency Impact
System prompt tokens	180	520	+120ms (input processing)
Avg output tokens	285	410	+500ms (generation time)
Prompt complexity (nested instructions)	Low	High (multi-criteria)	+60ms (reasoning overhead)
Total latency increase			+680ms

The longer system prompt causes two latency effects: 1. More input tokens = more time for the model to process the prompt before generating the first token 2. More detailed instructions = the model generates longer, more structured responses, increasing streaming duration

Resolution

Immediate — Assess rollback need (15 min):

flowchart TD
    ASSESS{Is the quality improvement<br/>worth the latency cost?}
    ASSESS -->|No — quality gain marginal| ROLLBACK[Roll back to previous<br/>prompt version immediately]
    ASSESS -->|Yes — quality significantly better| OPTIMIZE[Keep new prompt,<br/>optimize for latency]
    ASSESS -->|Unsure — need data| ABTEST[Run A/B test:<br/>old prompt vs new prompt]

If rolling back (immediate fix):

# Revert prompt version in DynamoDB
aws dynamodb update-item \
  --table-name manga-prompt-templates \
  --key '{"intent":{"S":"recommendation"},"version":{"S":"v2.1"}}' \
  --update-expression "SET active = :false" \
  --expression-attribute-values '{":false":{"BOOL":false}}'

aws dynamodb update-item \
  --table-name manga-prompt-templates \
  --key '{"intent":{"S":"recommendation"},"version":{"S":"v2.0"}}' \
  --update-expression "SET active = :true" \
  --expression-attribute-values '{":true":{"BOOL":true}}'

If optimizing the new prompt (1-2 days):

Optimization	Technique	Token Reduction	Quality Impact
Compress scoring instructions	Use structured format instead of prose	-120 tokens	Minimal
Add `max_tokens: 300` limit	Cap output length	-110 tokens avg	Responses slightly shorter
Use Haiku for scoring, Sonnet for prose	Two-stage: Haiku scores, Sonnet narrates	-200ms first-token	Minimal (scoring is mechanical)
Move static criteria to pre-computed context	Load scoring criteria from cache, not prompt	-180 tokens from prompt	None
Combined		-300 tokens, -180ms	Negligible

Optimized prompt structure:

Before (520 tokens — prose):
"When recommending manga, evaluate each title across multiple
dimensions. First, consider the art style and how it compares
to what the customer enjoys. Then evaluate story complexity..."

After (320 tokens — structured):
"Score each recommendation on: art_style(1-5), story_complexity(1-5),
demographic_match(1-5). Format: Title | Scores | 2-sentence reason.
Max 4 recommendations. Prioritize by total score."

Prevention

Prevention Measure	Owner	Implementation
Prompt change requires latency impact assessment	ML Eng	Pre-deployment: run benchmark with new prompt in staging
Token budget per intent (prompt + response)	ML Eng	`recommendation` intent: max 520 input + 350 output tokens
Automated benchmark gate in CI/CD	Platform	Deployment blocked if p95 regression > 15%
Prompt compression review checklist	ML Eng	Every prompt update reviewed for token efficiency
A/B testing framework for prompt changes	ML Eng	Shadow-test new prompts against production baseline for 24h
Auto-rollback on benchmark regression	SRE	If SEV-1 regression detected within 1 hour of deploy, auto-revert

Prompt Change Deployment Safeguard Flow

flowchart TD
    DEV[ML Engineer writes<br/>new prompt template] --> STAGE_TEST[Deploy to staging<br/>+ run benchmark suite]
    STAGE_TEST --> GATE{p95 within<br/>intent budget?}

    GATE -->|Yes| SHADOW[Shadow deployment:<br/>run new prompt on 5% traffic]
    GATE -->|No| OPTIMIZE[Optimize prompt<br/>before proceeding]
    OPTIMIZE --> STAGE_TEST

    SHADOW --> COMPARE{Quality improved?<br/>Latency acceptable?}
    COMPARE -->|Both pass| CANARY[Canary: 25% traffic<br/>for 2 hours]
    COMPARE -->|Quality pass,<br/>latency fail| OPTIMIZE
    COMPARE -->|Quality fail| REJECT[Reject prompt change]

    CANARY --> FULL{All metrics<br/>within targets?}
    FULL -->|Yes| DEPLOY[Full production<br/>deployment]
    FULL -->|No| ROLLBACK[Auto-rollback<br/>to previous version]

    style DEPLOY fill:#2ecc71,color:#000
    style ROLLBACK fill:#e74c3c,color:#fff
    style REJECT fill:#e74c3c,color:#fff

Cross-Scenario Summary

Common Patterns Across All 5 Scenarios

Pattern	Scenarios	Key Lesson
Graceful degradation is essential	1, 2	System must produce a response even with partial data (no RAG, no cache, wrong model)
Pre-computation needs invalidation	4	Every pre-computed cache must have an invalidation path tied to data freshness
Latency budgets must be per-intent	1, 5	A global "< 3s" target is insufficient — each intent has different acceptable latency
Statistical testing prevents false alarms	5	Use Mann-Whitney U test (not simple threshold comparison) for regression detection
Connection management is critical for streaming	1, 3	Warm pools, heartbeats, and reconnection with resume handle network volatility

Scenario-to-Technique Mapping

Scenario	Pre-Computation	Model Selection	Parallel Processing	Streaming	Benchmarking
1 — First-token spike		Primary fix (route to Haiku)		Root cause (cold connections)	Detection method
2 — RAG timeout			Primary fix (per-task timeout + degradation)
3 — WebSocket drop				Primary fix (heartbeat + resume)
4 — Stale recommendations	Primary fix (invalidation pipeline)
5 — Prompt regression					Primary fix (automated gate + rollback)

Key Exam Takeaways

Responsive AI is not just fast models — it spans connection management, parallel orchestration, streaming delivery, pre-computation freshness, and continuous benchmarking
Every pre-computation strategy needs an invalidation strategy — without it, you trade latency for stale/incorrect responses
Parallel orchestration must handle partial failures gracefully — asyncio.gather with per-task timeouts and degraded-mode fallbacks
Streaming connections need active keep-alive — WebSocket idle timeouts from proxies and gateways can kill streaming responses silently
Prompt changes are latency changes — longer prompts mean more input processing time AND often longer outputs; gate prompt deployments with automated latency benchmarks