Scenarios and Runbooks — Intelligent Model Routing
MangaAssist context: JP Manga store chatbot on AWS — Bedrock Claude 3 (Sonnet at $3/$15 per 1M tokens input/output, Haiku at $0.25/$1.25), OpenSearch Serverless (vector store), DynamoDB (sessions/products), ECS Fargate (orchestrator), API Gateway WebSocket, ElastiCache Redis. Target: useful answer in under 3 seconds, 1M messages/day scale.
Skill Mapping
| Field | Value |
|---|---|
| Certification | AWS AI Practitioner (AIP-C01) |
| Domain | 2 — Implementation and Integration of Foundation Models |
| Task | 2.4 — Design model deployment and inference strategies |
| Skill | 2.4.4 — Develop intelligent model routing systems to optimize model selection |
| Focus | Scenario-based troubleshooting for static routing, dynamic routing, metric-based selection, API Gateway transformation, and shadow routing |
Scenario Format Guide
Each scenario follows this structure:
| Section | Purpose |
|---|---|
| Situation | What went wrong and how it was detected |
| Symptom Timeline | Chronological sequence of observable events |
| Root Cause Analysis | Deep technical explanation of the failure |
| Architecture Impact Diagram | Mermaid diagram showing affected components |
| Blast Radius | What else broke or degraded as a consequence |
| Resolution Runbook | Step-by-step fix with commands and code |
| Verification | How to confirm the fix worked |
| Prevention | Long-term measures to prevent recurrence |
| Key Takeaway | One-sentence lesson for the exam/interview |
Scenario 1: Dynamic Routing Sending All Complex Queries to Sonnet — Cost Spike
Situation
The MangaAssist operations team receives a PagerDuty alert at 14:22 JST: daily Bedrock spend has crossed the 150% budget threshold with 8 hours remaining in the billing day. Investigation reveals that the DynamicModelRouter is routing 45% of all queries to Claude 3 Sonnet instead of the expected 10-15%. The complexity scorer threshold was inadvertently lowered during a configuration update, causing moderate-complexity queries (product comparisons, simple recommendations) to exceed the Sonnet threshold.
Symptom Timeline
14:00 JST — Daily cost tracker shows $2,847 spend (expected: $1,900 by this hour)
14:05 JST — CloudWatch alarm: Sonnet invocations/minute = 312 (baseline: 70)
14:10 JST — Haiku invocations/minute dropped from 620 to 380
14:15 JST — Cost anomaly detector fires (150% of projected daily budget)
14:22 JST — PagerDuty alert to on-call engineer
14:25 JST — Investigation begins
14:35 JST — Root cause identified: complexity threshold = 4.0 (should be 6.5)
14:40 JST — Threshold corrected, cache invalidated
14:50 JST — Sonnet routing rate returns to normal (11%)
15:00 JST — Cost burn rate normalized
Root Cause Analysis
The ComplexityScorer uses a configurable sonnet_threshold parameter stored in the DynamoDB routing configuration table. During a routine configuration update at 09:00 JST, an engineer updated the threshold from 6.5 to 4.0 intending to test a new complexity algorithm in a staging environment. The update was accidentally applied to the production DynamoDB table instead of the staging table because both tables have similar names (MangaAssist-RoutingConfig vs MangaAssist-RoutingConfig-staging).
With the threshold at 4.0, any query scoring above "simple" complexity was routed to Sonnet. Product comparison queries ("Which is better, One Piece or Naruto?") that score around 5.0-6.0 were now hitting Sonnet at $3/$15 per 1M tokens instead of Haiku at $0.25/$1.25 — a 12x cost increase per query.
Contributing factors: - No environment prefix validation on DynamoDB table names - No approval gate for threshold changes below 5.0 - Configuration cache TTL of 5 minutes meant all ECS tasks picked up the change within minutes - No automated rollback trigger on cost anomaly detection
Architecture Impact Diagram
flowchart TB
subgraph Problem["Problem Chain"]
CONFIG[DynamoDB Config Update<br/>threshold: 6.5 → 4.0] -->|5-min cache TTL| CACHE[Redis Cache Refresh]
CACHE --> SCORER[ComplexityScorer<br/>Now: score >= 4.0 → Sonnet]
SCORER -->|Moderate queries routed up| SONNET[Sonnet Invocations<br/>70/min → 312/min]
SONNET --> COST[Cost Spike<br/>$1,900 → $4,200 projected]
end
subgraph Cascade["Cascade Effects"]
SONNET -->|Higher latency| LATENCY[P95 Latency<br/>1.2s → 2.8s]
SONNET -->|Throttling risk| THROTTLE[Bedrock Throttle Rate<br/>0.1% → 3.2%]
LATENCY -->|Timeout increase| UX[User Experience<br/>Slower responses]
end
style CONFIG fill:#f44336,color:#fff
style SONNET fill:#FF9800,color:#fff
style COST fill:#f44336,color:#fff
style LATENCY fill:#FF9800,color:#fff
Blast Radius
| Component | Impact | Severity |
|---|---|---|
| Daily budget | 150% overspend projected | HIGH |
| Sonnet latency | P95 rose from 1.2s to 2.8s due to increased load | MEDIUM |
| Bedrock throttling | Sonnet throttle rate spiked to 3.2% | MEDIUM |
| User experience | Slower responses for moderate queries (no quality gain) | LOW |
| Haiku utilization | Under-utilized — wasted provisioned throughput | LOW |
Resolution Runbook
Step 1: Immediate — Restore correct threshold (< 5 minutes)
# Verify current production threshold value
aws dynamodb get-item \
--table-name MangaAssist-RoutingConfig \
--key '{"PK": {"S": "GLOBAL#config"}, "SK": {"S": "THRESHOLD#complexity"}}' \
--region ap-northeast-1 \
--query 'Item.sonnet_threshold.N'
# Restore correct threshold
aws dynamodb update-item \
--table-name MangaAssist-RoutingConfig \
--key '{"PK": {"S": "GLOBAL#config"}, "SK": {"S": "THRESHOLD#complexity"}}' \
--update-expression "SET sonnet_threshold = :val, updated_at = :ts, updated_by = :who" \
--expression-attribute-values '{
":val": {"N": "6.5"},
":ts": {"S": "2026-03-31T14:35:00Z"},
":who": {"S": "oncall-engineer@mangaassist.jp"}
}' \
--region ap-northeast-1
Step 2: Force cache invalidation across all ECS tasks
# Invalidate Redis cache for the threshold config
redis-cli -h mangaassist-redis.xxxxx.apne1.cache.amazonaws.com \
DEL "config:complexity_threshold"
# Publish cache invalidation event to all ECS tasks
redis-cli -h mangaassist-redis.xxxxx.apne1.cache.amazonaws.com \
PUBLISH "routing:cache_invalidation" '{"key": "complexity_threshold", "action": "refresh"}'
Step 3: Verify routing distribution has normalized
# Check Sonnet vs Haiku invocation rates (last 10 minutes)
aws cloudwatch get-metric-statistics \
--namespace MangaAssist/Routing \
--metric-name InvocationCount \
--dimensions Name=ModelId,Value=anthropic.claude-3-sonnet-20240229-v1:0 \
--start-time $(date -u -d '10 minutes ago' +%Y-%m-%dT%H:%M:%S) \
--end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
--period 60 \
--statistics Sum \
--region ap-northeast-1
Step 4: Review cost impact and document incident
# Get today's Bedrock cost breakdown
aws ce get-cost-and-usage \
--time-period Start=$(date -u +%Y-%m-%d),End=$(date -u -d 'tomorrow' +%Y-%m-%d) \
--granularity DAILY \
--filter '{"Dimensions": {"Key": "SERVICE", "Values": ["Amazon Bedrock"]}}' \
--metrics BlendedCost
Verification
- Sonnet invocations/minute returns to baseline range (60-80)
- Haiku invocations/minute returns to baseline range (580-650)
- P95 latency drops below 1.5s within 10 minutes
- Cost burn rate returns to projected daily total
- No new throttling events on Sonnet endpoint
Prevention
- Environment-prefixed table names: Enforce naming convention
MangaAssist-{env}-RoutingConfigwith IAM policies restricting production writes to specific roles - Threshold guard rails: DynamoDB Streams trigger that rejects
sonnet_thresholdvalues below 5.0 without an approval token - Cost circuit breaker: Automated threshold rollback when cost anomaly exceeds 130% of daily projection
- Configuration change approval: Require two-person approval for production routing config changes via a Step Functions approval workflow
- Staging environment isolation: Separate AWS accounts for staging vs production
Key Takeaway
Dynamic routing thresholds are high-leverage configuration points — a single number change can shift millions of dollars in monthly FM spend. Guard them with validation, approval gates, and automated cost-based circuit breakers.
Scenario 2: Static Routing Table Stale After New Model Deployment
Situation
The MangaAssist team deploys Claude 3.5 Sonnet (anthropic.claude-3-5-sonnet-20241022-v2:0) as an upgrade path for complex queries. The new model is enabled in Bedrock and tested in isolation. However, the static routing table in DynamoDB still references the old Claude 3 Sonnet model ID (anthropic.claude-3-sonnet-20240229-v1:0). The new model never receives production traffic, and the team only discovers this three days later when reviewing an A/B test that shows zero queries reaching the new model.
Symptom Timeline
Day 0, 10:00 JST — Claude 3.5 Sonnet enabled in Bedrock account
Day 0, 10:30 JST — Integration tests pass against new model (direct invocation)
Day 0, 11:00 JST — Deployment marked as "complete" in Jira ticket
Day 0-3 — All production traffic continues to old model (no alerts)
Day 3, 15:00 JST — A/B test report shows 0 queries to new model
Day 3, 15:30 JST — Investigation reveals routing table not updated
Day 3, 16:00 JST — 23 DynamoDB items identified as stale
Day 3, 17:00 JST — Bulk update applied, canary testing started
Root Cause Analysis
The deployment checklist for new model activation covered:
- Enable model access in Bedrock console
- Run integration test suite against new model endpoint
- Update CloudWatch dashboards with new model metrics
- [MISSING] Update DynamoDB routing table entries
- [MISSING] Invalidate Redis routing cache
- [MISSING] Verify production traffic hitting new model
The static routing table has 23 entries that reference the old Sonnet model ID. Since the old model remained active in Bedrock (AWS does not automatically deactivate old model versions), all queries continued to work — just not on the new model. There were no errors, no latency changes, and no quality regressions that would trigger alerts.
Contributing factors: - No deployment automation linking model enablement to routing table updates - No "model version freshness" check in the routing pipeline - No alert on "expected model not receiving traffic" - Deployment checklist was a manual wiki page, not an automated runbook
Architecture Impact Diagram
flowchart TB
subgraph Deployment["Deployment Actions (Completed)"]
ENABLE[Bedrock Model Access<br/>Claude 3.5 Sonnet Enabled]
TEST[Integration Tests<br/>Passed]
DASH[Dashboard Updated]
end
subgraph Missed["Deployment Actions (MISSED)"]
DDB_UPDATE[DynamoDB Route Update<br/>NOT DONE]
CACHE_INV[Redis Cache Invalidation<br/>NOT DONE]
VERIFY[Traffic Verification<br/>NOT DONE]
end
subgraph Production["Production State"]
DDB[DynamoDB Routing Table<br/>Still: claude-3-sonnet-v1:0]
REDIS[Redis Cache<br/>Still: claude-3-sonnet-v1:0]
ROUTER[StaticRouter<br/>Routes to OLD model]
OLD[Claude 3 Sonnet<br/>RECEIVING ALL TRAFFIC]
NEW[Claude 3.5 Sonnet<br/>ZERO TRAFFIC]
end
ENABLE -.->|Should trigger| DDB_UPDATE
DDB_UPDATE -.->|Should trigger| CACHE_INV
CACHE_INV -.->|Should trigger| VERIFY
DDB --> REDIS
REDIS --> ROUTER
ROUTER --> OLD
style DDB_UPDATE fill:#f44336,color:#fff
style CACHE_INV fill:#f44336,color:#fff
style VERIFY fill:#f44336,color:#fff
style OLD fill:#FF9800,color:#fff
style NEW fill:#9E9E9E,color:#fff
Blast Radius
| Component | Impact | Severity |
|---|---|---|
| New model utilization | Zero production traffic for 3 days | HIGH |
| Quality improvement | Users did not benefit from model upgrade | MEDIUM |
| A/B test validity | 3-day A/B test data is useless (no treatment traffic) | MEDIUM |
| Cost | No cost impact (old model pricing unchanged) | NONE |
| User experience | No degradation (old model still functional) | NONE |
Resolution Runbook
Step 1: Identify all stale routing entries
# Scan routing table for entries referencing old model
aws dynamodb scan \
--table-name MangaAssist-RoutingConfig \
--filter-expression "contains(model_id, :old_model)" \
--expression-attribute-values '{":old_model": {"S": "claude-3-sonnet-20240229"}}' \
--projection-expression "PK, SK, model_id, intent_category, sub_intent" \
--region ap-northeast-1
Step 2: Generate bulk update script
"""
Bulk update routing table entries from old model to new model.
Run with: python update_model_routes.py --dry-run first, then without --dry-run.
"""
import sys
import boto3
from datetime import datetime
OLD_MODEL = "anthropic.claude-3-sonnet-20240229-v1:0"
NEW_MODEL = "anthropic.claude-3-5-sonnet-20241022-v2:0"
TABLE_NAME = "MangaAssist-RoutingConfig"
REGION = "ap-northeast-1"
dry_run = "--dry-run" in sys.argv
dynamodb = boto3.resource("dynamodb", region_name=REGION)
table = dynamodb.Table(TABLE_NAME)
# Scan for stale entries
response = table.scan(
FilterExpression="model_id = :old",
ExpressionAttributeValues={":old": OLD_MODEL},
)
items = response.get("Items", [])
print(f"Found {len(items)} entries to update")
for item in items:
pk = item["PK"]
sk = item["SK"]
intent = item.get("intent_category", "unknown")
sub = item.get("sub_intent", "default")
print(f" {'[DRY RUN] ' if dry_run else ''}Updating {intent}:{sub} → {NEW_MODEL}")
if not dry_run:
table.update_item(
Key={"PK": pk, "SK": sk},
UpdateExpression="SET model_id = :new, updated_at = :ts, updated_by = :who",
ExpressionAttributeValues={
":new": NEW_MODEL,
":ts": datetime.utcnow().isoformat(),
":who": "model-migration-script",
},
)
print(f"{'[DRY RUN] ' if dry_run else ''}Updated {len(items)} entries")
Step 3: Invalidate all routing caches
# Flush all route cache keys
redis-cli -h mangaassist-redis.xxxxx.apne1.cache.amazonaws.com \
EVAL "local keys = redis.call('keys', 'route:*'); for i,k in ipairs(keys) do redis.call('del', k) end; return #keys" 0
# Publish global cache invalidation
redis-cli -h mangaassist-redis.xxxxx.apne1.cache.amazonaws.com \
PUBLISH "routing:cache_invalidation" '{"action": "flush_all", "reason": "model_migration"}'
Step 4: Start canary testing before full rollout
# Use RoutingTableManager to stage the new routes with 5% canary traffic
# Monitor for 30 minutes before promoting
python -c "
from routing_table_manager import RoutingTableManager
mgr = RoutingTableManager()
for route in mgr.get_active_routes():
if 'claude-3-5-sonnet' in route.model_id:
mgr.start_test(route.route_id, traffic_pct=5.0, test_duration_minutes=30)
"
Verification
- New model appears in CloudWatch metrics with non-zero invocation counts within 10 minutes
- Routing table scan shows zero entries referencing old model ID
- Redis cache contains updated model IDs
- A/B test (if restarted) shows traffic split matching configuration
Prevention
- Model deployment automation: CDK/Terraform pipeline that atomically updates Bedrock model access + DynamoDB routing entries + Redis cache in a single deployment
- Model version health check: Lambda cron (every 15 minutes) that verifies all models in the routing table are receiving traffic. Alert if any model has zero invocations for > 30 minutes during business hours
- Routing table freshness metric: CloudWatch custom metric tracking the age of the oldest routing entry. Alert if any entry is older than the latest model deployment timestamp
- Deployment gate: CI/CD pipeline stage that queries the routing table post-deploy and fails the pipeline if expected model IDs are not present
Key Takeaway
A model deployment is not complete until production traffic actually reaches the new model. Static routing tables create a decoupling risk — always include routing table updates and traffic verification in the deployment checklist, ideally automated in the same pipeline.
Scenario 3: Metric Collection Lag Causing Suboptimal Routing Decisions
Situation
The MetricBasedSelector ranks models using real-time metrics from ElastiCache Redis. During a morning traffic spike (09:00-10:00 JST), the Kinesis Data Stream backing the metric pipeline becomes throttled due to insufficient shard capacity. Metric aggregation falls behind by 8-12 minutes. The selector continues routing based on stale metrics that show Sonnet with low latency (from overnight low-traffic period), even though Sonnet's actual P95 latency has risen to 4.2 seconds due to the traffic surge. Users experience timeouts and degraded response quality.
Symptom Timeline
08:55 JST — Morning traffic ramp begins (Japan business hours)
09:05 JST — Kinesis shard iterator age rises above 30 seconds
09:10 JST — Metric aggregation Lambda falling behind by 2 minutes
09:15 JST — Redis metrics still showing overnight P95 latency (800ms for Sonnet)
09:20 JST — Actual Sonnet P95 latency at 3.1s (MetricBasedSelector unaware)
09:25 JST — User timeout rate crosses 2% (3-second WebSocket timeout)
09:30 JST — Customer complaints in support queue: "chatbot is slow"
09:35 JST — Kinesis iterator age at 8 minutes behind real-time
09:40 JST — On-call engineer paged via CloudWatch alarm on timeout rate
09:45 JST — Root cause identified: stale metrics driving bad routing
09:50 JST — Emergency: force all routing to Haiku via static override
09:55 JST — Timeout rate drops to 0.3%
10:15 JST — Kinesis resharding completes (2 → 4 shards)
10:30 JST — Metric pipeline catches up, dynamic routing re-enabled
Root Cause Analysis
The metric collection pipeline has a critical dependency on Kinesis Data Streams for buffering raw metric data points before aggregation. The Kinesis stream was provisioned with 2 shards, sufficient for average traffic (approximately 700 messages/minute) but insufficient for morning peak traffic (approximately 2,500 messages/minute).
When the stream throttled, the Lambda aggregator could not consume records fast enough. The aggregated metrics in Redis became stale — still reflecting overnight low-traffic latency numbers. The MetricBasedSelector trusted these stale metrics and continued ranking Sonnet highly for quality-sensitive queries, even though Sonnet's actual latency had degraded due to regional demand pressure.
The 3-second WebSocket timeout in MangaAssist meant that Sonnet responses taking 4+ seconds were dropped, resulting in user-visible errors.
Contributing factors:
- Kinesis shard capacity not auto-scaling (on-demand mode not enabled)
- No "metric staleness" check in MetricBasedSelector
- No fallback to latency-safe model when metrics are stale
- Metric pipeline single point of failure (no direct Redis writes as backup)
Architecture Impact Diagram
flowchart TB
subgraph Trigger["Traffic Spike"]
SPIKE[Morning Ramp<br/>700 → 2,500 msg/min]
end
subgraph Pipeline["Metric Pipeline (Broken)"]
KDS[Kinesis 2 Shards<br/>THROTTLED]
LAMBDA[Lambda Aggregator<br/>8 min BEHIND]
REDIS_STALE[Redis Metrics<br/>STALE: P95=800ms]
end
subgraph Routing["Routing (Bad Decisions)"]
SELECTOR[MetricBasedSelector<br/>Trusts Stale Metrics]
SONNET[Routes to Sonnet<br/>Actual P95=4.2s]
end
subgraph User["User Impact"]
TIMEOUT[WebSocket Timeout<br/>3s limit exceeded]
ERROR[User Error Rate<br/>0.1% → 2%+]
end
SPIKE --> KDS
KDS -->|Throttled| LAMBDA
LAMBDA -->|Can't update| REDIS_STALE
REDIS_STALE --> SELECTOR
SELECTOR --> SONNET
SONNET --> TIMEOUT
TIMEOUT --> ERROR
style KDS fill:#f44336,color:#fff
style REDIS_STALE fill:#FF9800,color:#fff
style TIMEOUT fill:#f44336,color:#fff
style ERROR fill:#f44336,color:#fff
Blast Radius
| Component | Impact | Severity |
|---|---|---|
| User timeout rate | 0.1% → 2%+ (20x increase) | CRITICAL |
| Response quality | Timeouts = zero response quality | CRITICAL |
| Metric accuracy | 8-12 minute lag, decisions based on stale data | HIGH |
| Routing optimality | Sonnet chosen when Haiku would have been within latency budget | HIGH |
| Customer satisfaction | Support tickets increased 5x during the incident | HIGH |
| Cost | Sonnet invocations wasted (timed out before response delivered) | MEDIUM |
Resolution Runbook
Step 1: Immediate — Force static routing override (< 2 minutes)
# Set emergency routing override: all traffic to Haiku
aws dynamodb put-item \
--table-name MangaAssist-RoutingConfig \
--item '{
"PK": {"S": "GLOBAL#override"},
"SK": {"S": "EMERGENCY#static_only"},
"override_type": {"S": "force_haiku"},
"enabled": {"BOOL": true},
"reason": {"S": "Metric pipeline lag causing timeout spike"},
"activated_by": {"S": "oncall-engineer"},
"activated_at": {"S": "2026-03-31T09:50:00+09:00"},
"ttl": {"N": "1711872600"}
}' \
--region ap-northeast-1
# Invalidate metric-based selector cache
redis-cli -h mangaassist-redis.xxxxx.apne1.cache.amazonaws.com \
SET "routing:override" '{"mode": "static_only", "model": "haiku"}' EX 3600
# Publish override event
redis-cli -h mangaassist-redis.xxxxx.apne1.cache.amazonaws.com \
PUBLISH "routing:emergency" '{"action": "force_haiku", "reason": "metric_lag"}'
Step 2: Fix Kinesis capacity
# Check current shard count and iterator age
aws kinesis describe-stream-summary \
--stream-name MangaAssist-Metrics \
--region ap-northeast-1
# Option A: Enable on-demand mode (recommended for variable traffic)
aws kinesis update-stream-mode \
--stream-arn arn:aws:kinesis:ap-northeast-1:ACCOUNT:stream/MangaAssist-Metrics \
--stream-mode-details StreamMode=ON_DEMAND \
--region ap-northeast-1
# Option B: Manual reshard to 4 shards
aws kinesis update-shard-count \
--stream-name MangaAssist-Metrics \
--target-shard-count 4 \
--scaling-type UNIFORM_SCALING \
--region ap-northeast-1
Step 3: Add metric staleness detection to the selector
# Add this check to MetricBasedSelector.get_model_metrics()
def _is_metric_stale(self, metrics: ModelMetrics, max_age_seconds: int = 120) -> bool:
"""Check if metrics are too old to trust."""
if not metrics.last_updated:
return True
try:
from datetime import datetime
updated = datetime.fromisoformat(metrics.last_updated)
age = (datetime.utcnow() - updated).total_seconds()
return age > max_age_seconds
except (ValueError, TypeError):
return True
# In rank_models(), add staleness handling:
def rank_models(self, query_type="general", override_weights=None):
# ... existing code ...
for model_id, metrics in all_metrics.items():
if self._is_metric_stale(metrics):
logger.warning("Stale metrics for %s — using conservative scores", model_id)
# Penalize latency score for stale metrics (assume worst case)
latency_score = 0.3 # Conservative — don't trust stale latency
# ... rest of scoring uses conservative defaults
Step 4: Re-enable dynamic routing after pipeline catches up
# Verify Kinesis iterator age is back to near-real-time
aws cloudwatch get-metric-statistics \
--namespace AWS/Kinesis \
--metric-name GetRecords.IteratorAgeMilliseconds \
--dimensions Name=StreamName,Value=MangaAssist-Metrics \
--start-time $(date -u -d '5 minutes ago' +%Y-%m-%dT%H:%M:%S) \
--end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
--period 60 \
--statistics Maximum \
--region ap-northeast-1
# Remove emergency override once metrics are fresh
redis-cli -h mangaassist-redis.xxxxx.apne1.cache.amazonaws.com \
DEL "routing:override"
aws dynamodb delete-item \
--table-name MangaAssist-RoutingConfig \
--key '{"PK": {"S": "GLOBAL#override"}, "SK": {"S": "EMERGENCY#static_only"}}' \
--region ap-northeast-1
Verification
- Kinesis iterator age below 5 seconds (real-time)
- Redis metric timestamps within 60 seconds of current time
- Timeout rate below 0.5%
- MetricBasedSelector logs show "fresh" metric reads (no staleness warnings)
- Dynamic routing re-enabled with normal Sonnet/Haiku distribution
Prevention
- Kinesis on-demand mode: Eliminates shard capacity planning entirely; auto-scales to traffic
- Metric staleness circuit breaker: If metric age exceeds 2 minutes,
MetricBasedSelectorautomatically falls back to conservative (latency-safe) model selection - Direct Redis writes: Add a secondary metric path that writes critical metrics (latency, errors) directly to Redis from the ECS task, bypassing Kinesis entirely. Use Kinesis only for durable archival
- Kinesis iterator age alarm: CloudWatch alarm when
GetRecords.IteratorAgeMillisecondsexceeds 30,000ms (30 seconds) - Pre-spike capacity: Schedule Kinesis shard scaling 15 minutes before known traffic ramps (e.g., 08:45 JST for Japan morning)
Key Takeaway
Metric-based routing is only as good as the freshness of its metrics. Always implement staleness detection and conservative fallback behavior — stale metrics that look "good" are more dangerous than having no metrics at all.
Scenario 4: API Gateway Transformation Error Dropping Routing Headers
Situation
After a routine API Gateway deployment, the VTL (Velocity Template Language) mapping template for the WebSocket $connect route has a syntax error that silently drops the X-Route-Target and X-User-Tier headers from the integration request. The ECS Fargate orchestrator receives requests without routing context and falls through to the default route (full dynamic analysis) for every single request — including trivial greetings and FAQ queries that should be fast-pathed to Haiku.
This causes a 40% increase in average response latency (greetings now take 1.5s instead of 0.3s) and unnecessary compute spending on complexity analysis for queries that should never reach the dynamic router.
Symptom Timeline
11:00 JST — API Gateway deployment via CloudFormation stack update
11:05 JST — VTL template deployed with missing closing parenthesis
11:10 JST — First requests hit ECS without routing headers
11:15 JST — Dynamic router processing 100% of traffic (expected: 40%)
11:20 JST — Average response latency rises from 0.8s to 1.2s
11:30 JST — P95 latency rises from 1.5s to 2.4s
11:45 JST — CloudWatch alarm: "AverageLatency > 1.0s for 15 minutes"
11:50 JST — Investigation begins
12:00 JST — Root cause: VTL template dropping headers
12:10 JST — Rollback API Gateway deployment to previous stage
12:15 JST — Latency returns to normal
Root Cause Analysis
The VTL mapping template translates incoming WebSocket messages into backend integration requests with routing headers. The updated template had a syntax error on the $util.escapeJavaScript() call — a missing closing parenthesis caused the VTL engine to silently fail and produce an empty mapping for the affected headers.
## Broken VTL (line 14)
#set($route = $util.escapeJavaScript($input.path('$.route')) <-- missing closing paren
## Correct VTL (line 14)
#set($route = $util.escapeJavaScript($input.path('$.route')))
API Gateway VTL template errors are not surfaced as 5xx errors. Instead, the template produces partial output — the request body is passed through, but the computed headers are empty strings. The ECS task receives the request with X-Route-Target: "" and X-User-Tier: "", which the APIGatewayRouteTransformer interprets as "no routing context" and falls back to the default full dynamic analysis path.
Contributing factors: - No VTL template validation in the CI/CD pipeline - API Gateway does not fail loudly on VTL syntax errors - No integration test that verifies header presence after deployment - No monitoring on "percentage of requests with routing headers present"
Architecture Impact Diagram
flowchart TB
subgraph Deploy["API Gateway Deployment"]
CFN[CloudFormation Update]
VTL[VTL Template<br/>SYNTAX ERROR]
end
subgraph Gateway["API Gateway (Broken)"]
WS[WebSocket $connect]
TRANSFORM[Request Transform<br/>Headers: EMPTY]
end
subgraph ECS["ECS Fargate"]
RECEIVE[Receive Request<br/>No routing headers]
FALLBACK[Default: Full Dynamic Analysis<br/>100% of traffic]
end
subgraph Impact["Impact"]
LATENCY[Latency Increase<br/>Greetings: 0.3s → 1.5s]
COMPUTE[Unnecessary Compute<br/>ComplexityScorer on trivial queries]
COST[Wasted Analysis Cost<br/>Step Functions invocations up 150%]
end
CFN --> VTL
VTL --> WS
WS --> TRANSFORM
TRANSFORM -->|Empty headers| RECEIVE
RECEIVE --> FALLBACK
FALLBACK --> LATENCY
FALLBACK --> COMPUTE
FALLBACK --> COST
style VTL fill:#f44336,color:#fff
style TRANSFORM fill:#FF9800,color:#fff
style FALLBACK fill:#FF9800,color:#fff
Blast Radius
| Component | Impact | Severity |
|---|---|---|
| Average latency | 0.8s → 1.2s (50% increase) | HIGH |
| Greeting latency | 0.3s → 1.5s (5x increase) | HIGH |
| Dynamic router load | 100% of traffic (expected: 40%) | MEDIUM |
| Step Functions cost | 150% increase in state machine executions | MEDIUM |
| User experience | Noticeable slowdown on simple queries | MEDIUM |
| Error rate | Unchanged (requests still succeed, just slower) | NONE |
Resolution Runbook
Step 1: Identify the broken VTL template
# Export current API Gateway stage configuration
aws apigatewayv2 get-integration \
--api-id YOUR_API_ID \
--integration-id YOUR_INTEGRATION_ID \
--region ap-northeast-1
# Check recent deployments
aws apigatewayv2 get-deployments \
--api-id YOUR_API_ID \
--region ap-northeast-1 \
--max-results 5
Step 2: Rollback to previous API Gateway stage
# List stages to find previous deployment
aws apigatewayv2 get-stages \
--api-id YOUR_API_ID \
--region ap-northeast-1
# Update stage to point to previous (working) deployment
aws apigatewayv2 update-stage \
--api-id YOUR_API_ID \
--stage-name production \
--deployment-id PREVIOUS_DEPLOYMENT_ID \
--region ap-northeast-1
Step 3: Fix the VTL template and redeploy
# Corrected VTL template (fix the missing parenthesis)
# Then redeploy via CloudFormation with the fixed template
aws cloudformation update-stack \
--stack-name MangaAssist-APIGateway \
--template-body file://corrected-template.yaml \
--region ap-northeast-1
Step 4: Add header presence validation to ECS
# Add to the ECS route controller: validate routing headers
def validate_routing_headers(self, headers: Dict[str, str]) -> bool:
"""Check that API Gateway sent valid routing headers."""
required = ["X-Route-Target", "X-User-Tier"]
for header in required:
value = headers.get(header, "")
if not value or value.strip() == "":
logger.warning(
"Missing routing header: %s — API Gateway transform may be broken",
header,
)
# Emit metric for monitoring
self._emit_metric("missing_routing_header", 1, {"header": header})
return False
return True
Verification
X-Route-Targetheader present on 100% of requests reaching ECS- Static router handling 60%+ of traffic (greetings, FAQ, order status)
- Dynamic router handling approximately 40% of traffic (complex queries)
- Average latency returns to 0.8s baseline
- Greeting response latency returns to 0.3s
Prevention
- VTL template validation in CI/CD: Use
aws apigateway test-invoke-methodor a custom VTL parser to validate templates before deployment - Integration tests post-deployment: Automated test that sends a known query through API Gateway and verifies all expected headers are present on the ECS side
- Header presence monitoring: CloudWatch custom metric tracking percentage of requests with valid routing headers. Alert if below 95%
- Canary deployment: Deploy API Gateway changes to a canary stage first, monitor for 10 minutes, then promote to production
- VTL template versioning: Store VTL templates as versioned files in the code repository with automated diff review
Key Takeaway
API Gateway VTL template errors fail silently — they produce partial output instead of errors. Always validate VTL templates before deployment and monitor for header presence on the backend, not just HTTP status codes.
Scenario 5: Shadow Routing Doubling FM Costs Without Intended A/B Split
Situation
A data science team member configures shadow routing to compare Claude 3 Sonnet against Claude 3 Haiku for product recommendation queries. The intention was to shadow 10% of traffic. However, due to a misconfiguration, the ShadowRouter is instantiated with shadow_traffic_pct=100 and the shadow model is set to Sonnet. This means every single query to the system now invokes both Haiku (primary, serving the user) and Sonnet (shadow, discarded) — effectively doubling FM costs for the entire platform. The shadow invocations also compete for Bedrock throughput, contributing to increased primary model latency.
Symptom Timeline
Day 1, 09:00 JST — Shadow routing config deployed via feature flag
Day 1, 09:05 JST — ShadowRouter starts with shadow_traffic_pct=100 (intended: 10)
Day 1, 09:30 JST — Bedrock invocation count doubles (Haiku + Sonnet per query)
Day 1, 12:00 JST — Midday cost check shows $1,600 spend (expected: $800)
Day 1, 14:00 JST — AWS Budgets alert: 200% of daily Bedrock forecast
Day 1, 14:15 JST — Primary (Haiku) P95 latency increases 30% due to shared throughput
Day 1, 14:30 JST — Investigation begins
Day 1, 14:45 JST — Root cause: shadow_traffic_pct=100, shadow_model=Sonnet
Day 1, 14:50 JST — Shadow routing disabled via feature flag
Day 1, 15:00 JST — Costs and latency return to baseline
Day 1 end — Total overspend: ~$2,400 (one full day of Sonnet shadow)
Root Cause Analysis
The ShadowRouter configuration was set via a feature flag in DynamoDB:
{
"shadow_enabled": true,
"shadow_model_id": "anthropic.claude-3-sonnet-20240229-v1:0",
"shadow_traffic_pct": 100,
"shadow_max_tokens": 1024
}
The data scientist intended shadow_traffic_pct: 10 but entered 100. There was no validation guard preventing a 100% shadow rate. Additionally, the shadow model was set to Sonnet ($3/$15 per 1M tokens) — the most expensive available model — which is the worst-case scenario for cost.
At 1M messages/day with average 500 input tokens and 400 output tokens per query: - Normal Haiku cost: (500 * $0.25 + 400 * $1.25) / 1M * 1M = $625/day - Shadow Sonnet cost at 100%: (500 * $3.00 + 400 * $15.00) / 1M * 1M = $7,500/day - Total with shadow: $8,125/day (13x normal cost)
The shadow invocations were fire-and-forget (responses discarded), so there was no user-visible quality impact. However, the shadow invocations consumed Bedrock account-level throughput, causing the primary Haiku invocations to experience increased latency from contention.
Contributing factors:
- No upper bound validation on shadow_traffic_pct (should cap at 50%)
- No cost impact estimation before enabling shadow routing
- Feature flag change did not require approval for cost-impacting parameters
- No alert on "shadow invocation count > expected threshold"
- Shadow uses the same Bedrock account/region, sharing throughput limits
Architecture Impact Diagram
flowchart TB
subgraph Config["Misconfiguration"]
FLAG[Feature Flag<br/>shadow_traffic_pct: 100<br/>shadow_model: Sonnet]
end
subgraph Normal["Normal Path (Still Working)"]
QUERY[Every Query<br/>1M/day]
HAIKU[Haiku Primary<br/>$625/day]
RESPONSE[User Response<br/>From Haiku]
end
subgraph Shadow["Shadow Path (Unintended 100%)"]
SHADOW[ShadowRouter<br/>100% duplication]
SONNET[Sonnet Shadow<br/>$7,500/day<br/>RESPONSES DISCARDED]
end
subgraph Impact["Cost + Performance Impact"]
COST[Daily Cost<br/>$625 → $8,125<br/>13x increase]
THROUGHPUT[Bedrock Throughput<br/>Shared contention]
LATENCY[Haiku P95 Latency<br/>+30% from contention]
end
FLAG --> SHADOW
QUERY --> HAIKU
QUERY --> SHADOW
HAIKU --> RESPONSE
SHADOW --> SONNET
SONNET -->|Discarded| WASTE[Wasted Responses]
SONNET --> COST
SONNET --> THROUGHPUT
THROUGHPUT --> LATENCY
style FLAG fill:#f44336,color:#fff
style SONNET fill:#f44336,color:#fff
style COST fill:#f44336,color:#fff
style WASTE fill:#9E9E9E,color:#fff
Blast Radius
| Component | Impact | Severity |
|---|---|---|
| Daily FM cost | $625 → $8,125 (13x increase) | CRITICAL |
| Bedrock throughput | Consumed by shadow invocations, starving primary | HIGH |
| Primary latency | Haiku P95 up 30% from throughput contention | MEDIUM |
| Shadow data value | 100% shadow defeats purpose (no A/B comparison possible) | MEDIUM |
| Monthly budget | Single day overspend = $7,500 additional | HIGH |
| User experience | No quality degradation, slight latency increase | LOW |
Resolution Runbook
Step 1: Immediately disable shadow routing (< 2 minutes)
# Disable shadow via feature flag
aws dynamodb update-item \
--table-name MangaAssist-FeatureFlags \
--key '{"PK": {"S": "FLAG#shadow_routing"}, "SK": {"S": "CONFIG"}}' \
--update-expression "SET shadow_enabled = :disabled, disabled_by = :who, disabled_at = :ts, disable_reason = :reason" \
--expression-attribute-values '{
":disabled": {"BOOL": false},
":who": {"S": "oncall-engineer"},
":ts": {"S": "2026-03-31T14:50:00+09:00"},
":reason": {"S": "Cost spike - shadow_traffic_pct was 100 instead of 10"}
}' \
--region ap-northeast-1
# Force immediate effect via Redis
redis-cli -h mangaassist-redis.xxxxx.apne1.cache.amazonaws.com \
SET "feature:shadow_routing:enabled" "false" EX 3600
Step 2: Verify shadow invocations have stopped
# Check Sonnet invocation rate (should drop to near-zero for shadow)
aws cloudwatch get-metric-statistics \
--namespace MangaAssist/Routing \
--metric-name ShadowInvocationCount \
--start-time $(date -u -d '10 minutes ago' +%Y-%m-%dT%H:%M:%S) \
--end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
--period 60 \
--statistics Sum \
--region ap-northeast-1
Step 3: Calculate actual cost impact
# Get today's Bedrock cost
aws ce get-cost-and-usage \
--time-period Start=$(date -u +%Y-%m-%d),End=$(date -u -d 'tomorrow' +%Y-%m-%d) \
--granularity DAILY \
--filter '{"Dimensions": {"Key": "SERVICE", "Values": ["Amazon Bedrock"]}}' \
--metrics BlendedCost UnblendedCost
# Get shadow-specific cost from internal metrics
redis-cli -h mangaassist-redis.xxxxx.apne1.cache.amazonaws.com \
GET "shadow:total_shadow_cost"
Step 4: Add validation guards before re-enabling
# Add to ShadowRouter.__init__():
class ShadowRouter:
MAX_SHADOW_TRAFFIC_PCT = 25.0 # Hard cap
MAX_SHADOW_DAILY_COST_USD = 500.0 # Daily cost circuit breaker
def __init__(self, config: ShadowConfig, ...):
# Validate shadow traffic percentage
if config.shadow_traffic_pct > self.MAX_SHADOW_TRAFFIC_PCT:
logger.error(
"Shadow traffic %.1f%% exceeds maximum %.1f%% — capping",
config.shadow_traffic_pct,
self.MAX_SHADOW_TRAFFIC_PCT,
)
config.shadow_traffic_pct = self.MAX_SHADOW_TRAFFIC_PCT
# Estimate daily cost and warn
estimated_daily_cost = self._estimate_daily_shadow_cost(config)
if estimated_daily_cost > self.MAX_SHADOW_DAILY_COST_USD:
logger.warning(
"Shadow routing estimated cost: $%.2f/day (limit: $%.2f) — DISABLING",
estimated_daily_cost,
self.MAX_SHADOW_DAILY_COST_USD,
)
config.enabled = False
def _estimate_daily_shadow_cost(self, config: ShadowConfig) -> float:
"""Estimate daily shadow cost based on traffic and model pricing."""
daily_queries = 1_000_000 # MangaAssist scale
shadow_queries = daily_queries * (config.shadow_traffic_pct / 100)
avg_input_tokens = 500
avg_output_tokens = 400
costs = self.COSTS.get(config.shadow_model_id, {"input": 0, "output": 0})
return (
shadow_queries
* (avg_input_tokens * costs["input"] + avg_output_tokens * costs["output"])
/ 1_000_000
)
Step 5: Re-enable shadow routing with correct configuration
# Re-enable with validated parameters
aws dynamodb update-item \
--table-name MangaAssist-FeatureFlags \
--key '{"PK": {"S": "FLAG#shadow_routing"}, "SK": {"S": "CONFIG"}}' \
--update-expression "SET shadow_enabled = :enabled, shadow_traffic_pct = :pct, shadow_model_id = :model, updated_by = :who" \
--expression-attribute-values '{
":enabled": {"BOOL": true},
":pct": {"N": "10"},
":model": {"S": "anthropic.claude-3-sonnet-20240229-v1:0"},
":who": {"S": "data-science-lead"}
}' \
--region ap-northeast-1
Verification
- Shadow invocation rate matches expected 10% of total traffic
- Daily Bedrock cost projection within normal range ($1,900-$2,200)
- Primary model (Haiku) P95 latency back to baseline (< 1.2s)
- Shadow cost tracking metric shows reasonable daily projection
- No Bedrock throttling events
Prevention
- Shadow traffic hard cap:
ShadowRouterrejects configurations above 25% traffic without an explicit override flag - Cost estimation gate: Before enabling shadow routing, calculate and display estimated daily cost. Require explicit acknowledgment for costs exceeding $200/day
- Shadow cost circuit breaker: Real-time tracking of shadow invocation costs. Auto-disable shadow routing if cumulative daily shadow cost exceeds threshold
- Feature flag approval workflow: Cost-impacting feature flag changes require team lead approval via Slack bot or Step Functions approval workflow
- Separate throughput allocation: Use Bedrock Provisioned Throughput for shadow invocations to isolate from primary traffic (or use a separate AWS account)
- Shadow budget alarm: CloudWatch alarm on cumulative shadow cost metric, alerting at 50% and 80% of daily shadow budget
Key Takeaway
Shadow routing trades money for information — and at 100% shadow rate with an expensive model, the cost can exceed the primary workload by an order of magnitude. Always enforce hard caps on shadow traffic percentage and implement cost circuit breakers that auto-disable shadow routing when spend exceeds thresholds.
Scenario Comparison Matrix
| Dimension | Scenario 1: Threshold Misconfig | Scenario 2: Stale Model IDs | Scenario 3: Metric Lag | Scenario 4: VTL Header Drop | Scenario 5: Shadow Cost Spike |
|---|---|---|---|---|---|
| Root Cause Category | Configuration error | Deployment gap | Capacity planning | Template syntax | Validation gap |
| Detection Time | 5 hours | 3 days | 45 minutes | 45 minutes | 5.5 hours |
| User Impact | None (better model, just costly) | None (old model still works) | Timeouts (2%+ error rate) | Slower responses | Slight latency increase |
| Cost Impact | +$2,300/day | None | Wasted Sonnet spend | +Step Functions cost | +$7,500/day |
| Detection Method | Cost anomaly alert | Manual A/B review | Timeout rate alarm | Latency alarm | Budget alert |
| Fix Time | 15 minutes | 2 hours | 30 minutes | 15 minutes | 5 minutes |
| Could Have Been Prevented | Threshold guard rail | Deployment automation | On-demand Kinesis | VTL validation in CI | Traffic cap validation |
Cross-Scenario Prevention Framework
flowchart TB
subgraph Preventive["Preventive Controls"]
VAL[Input Validation<br/>Thresholds, percentages, model IDs]
AUTO[Deployment Automation<br/>Routing table in same pipeline]
CAP[Capacity Planning<br/>Kinesis on-demand mode]
CI[CI/CD Validation<br/>VTL parsing, integration tests]
end
subgraph Detective["Detective Controls"]
COST_MON[Cost Monitoring<br/>Budget alerts at 130%, 150%, 200%]
METRIC_FRESH[Metric Freshness<br/>Staleness > 2min alarm]
HEADER_MON[Header Presence<br/>Missing header rate alarm]
TRAFFIC_MON[Traffic Distribution<br/>Model split deviation alarm]
end
subgraph Corrective["Corrective Controls"]
CIRCUIT[Cost Circuit Breaker<br/>Auto-disable at threshold]
FALLBACK[Stale Metric Fallback<br/>Conservative routing]
ROLLBACK[Auto-Rollback<br/>Deploy fails → revert]
OVERRIDE[Emergency Override<br/>Force static routing]
end
VAL -->|Prevents| S1[Scenario 1 + 5]
AUTO -->|Prevents| S2[Scenario 2]
CAP -->|Prevents| S3[Scenario 3]
CI -->|Prevents| S4[Scenario 4]
COST_MON -->|Detects| S15[Scenario 1 + 5]
METRIC_FRESH -->|Detects| S3B[Scenario 3]
HEADER_MON -->|Detects| S4B[Scenario 4]
TRAFFIC_MON -->|Detects| S12[Scenario 1 + 2]
CIRCUIT -->|Corrects| S15C[Scenario 1 + 5]
FALLBACK -->|Corrects| S3C[Scenario 3]
ROLLBACK -->|Corrects| S4C[Scenario 4]
OVERRIDE -->|Corrects| SALL[All Scenarios]
style VAL fill:#4CAF50,color:#fff
style AUTO fill:#4CAF50,color:#fff
style CAP fill:#4CAF50,color:#fff
style CI fill:#4CAF50,color:#fff
style COST_MON fill:#2196F3,color:#fff
style METRIC_FRESH fill:#2196F3,color:#fff
style HEADER_MON fill:#2196F3,color:#fff
style TRAFFIC_MON fill:#2196F3,color:#fff
style CIRCUIT fill:#FF9800,color:#fff
style FALLBACK fill:#FF9800,color:#fff
style ROLLBACK fill:#FF9800,color:#fff
style OVERRIDE fill:#FF9800,color:#fff
Emergency Quick Reference
| Emergency | Immediate Action | Command |
|---|---|---|
| Cost spike from routing | Force all traffic to Haiku | redis-cli SET "routing:override" '{"mode":"force_haiku"}' EX 3600 |
| Timeout spike | Force static routing (skip dynamic analysis) | redis-cli SET "routing:override" '{"mode":"static_only"}' EX 3600 |
| Shadow cost runaway | Disable shadow routing | redis-cli SET "feature:shadow_routing:enabled" "false" EX 3600 |
| Stale metrics | Switch to static-only routing | redis-cli SET "routing:override" '{"mode":"static_only"}' EX 3600 |
| API Gateway broken | Rollback to previous deployment | aws apigatewayv2 update-stage --deployment-id PREV_ID |
| All else fails | Emergency maintenance page | Update API Gateway to return static maintenance JSON |
References
| Resource | Link |
|---|---|
| AWS Well-Architected — Operational Excellence | https://docs.aws.amazon.com/wellarchitected/latest/operational-excellence-pillar/welcome.html |
| Amazon Bedrock Quotas | https://docs.aws.amazon.com/bedrock/latest/userguide/quotas.html |
| Kinesis Data Streams On-Demand | https://docs.aws.amazon.com/streams/latest/dev/how-do-i-size-a-stream.html |
| API Gateway VTL Reference | https://docs.aws.amazon.com/apigateway/latest/developerguide/api-gateway-mapping-template-reference.html |
| AWS Budgets Actions | https://docs.aws.amazon.com/cost-management/latest/userguide/budgets-controls.html |
| CloudWatch Anomaly Detection | https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch_Anomaly_Detection.html |