LOCAL PREVIEW View on GitHub

Scenarios and Runbooks — Intelligent Model Routing

MangaAssist context: JP Manga store chatbot on AWS — Bedrock Claude 3 (Sonnet at $3/$15 per 1M tokens input/output, Haiku at $0.25/$1.25), OpenSearch Serverless (vector store), DynamoDB (sessions/products), ECS Fargate (orchestrator), API Gateway WebSocket, ElastiCache Redis. Target: useful answer in under 3 seconds, 1M messages/day scale.


Skill Mapping

Field Value
Certification AWS AI Practitioner (AIP-C01)
Domain 2 — Implementation and Integration of Foundation Models
Task 2.4 — Design model deployment and inference strategies
Skill 2.4.4 — Develop intelligent model routing systems to optimize model selection
Focus Scenario-based troubleshooting for static routing, dynamic routing, metric-based selection, API Gateway transformation, and shadow routing

Scenario Format Guide

Each scenario follows this structure:

Section Purpose
Situation What went wrong and how it was detected
Symptom Timeline Chronological sequence of observable events
Root Cause Analysis Deep technical explanation of the failure
Architecture Impact Diagram Mermaid diagram showing affected components
Blast Radius What else broke or degraded as a consequence
Resolution Runbook Step-by-step fix with commands and code
Verification How to confirm the fix worked
Prevention Long-term measures to prevent recurrence
Key Takeaway One-sentence lesson for the exam/interview

Scenario 1: Dynamic Routing Sending All Complex Queries to Sonnet — Cost Spike

Situation

The MangaAssist operations team receives a PagerDuty alert at 14:22 JST: daily Bedrock spend has crossed the 150% budget threshold with 8 hours remaining in the billing day. Investigation reveals that the DynamicModelRouter is routing 45% of all queries to Claude 3 Sonnet instead of the expected 10-15%. The complexity scorer threshold was inadvertently lowered during a configuration update, causing moderate-complexity queries (product comparisons, simple recommendations) to exceed the Sonnet threshold.

Symptom Timeline

14:00 JST — Daily cost tracker shows $2,847 spend (expected: $1,900 by this hour)
14:05 JST — CloudWatch alarm: Sonnet invocations/minute = 312 (baseline: 70)
14:10 JST — Haiku invocations/minute dropped from 620 to 380
14:15 JST — Cost anomaly detector fires (150% of projected daily budget)
14:22 JST — PagerDuty alert to on-call engineer
14:25 JST — Investigation begins
14:35 JST — Root cause identified: complexity threshold = 4.0 (should be 6.5)
14:40 JST — Threshold corrected, cache invalidated
14:50 JST — Sonnet routing rate returns to normal (11%)
15:00 JST — Cost burn rate normalized

Root Cause Analysis

The ComplexityScorer uses a configurable sonnet_threshold parameter stored in the DynamoDB routing configuration table. During a routine configuration update at 09:00 JST, an engineer updated the threshold from 6.5 to 4.0 intending to test a new complexity algorithm in a staging environment. The update was accidentally applied to the production DynamoDB table instead of the staging table because both tables have similar names (MangaAssist-RoutingConfig vs MangaAssist-RoutingConfig-staging).

With the threshold at 4.0, any query scoring above "simple" complexity was routed to Sonnet. Product comparison queries ("Which is better, One Piece or Naruto?") that score around 5.0-6.0 were now hitting Sonnet at $3/$15 per 1M tokens instead of Haiku at $0.25/$1.25 — a 12x cost increase per query.

Contributing factors: - No environment prefix validation on DynamoDB table names - No approval gate for threshold changes below 5.0 - Configuration cache TTL of 5 minutes meant all ECS tasks picked up the change within minutes - No automated rollback trigger on cost anomaly detection

Architecture Impact Diagram

flowchart TB
    subgraph Problem["Problem Chain"]
        CONFIG[DynamoDB Config Update<br/>threshold: 6.5 → 4.0] -->|5-min cache TTL| CACHE[Redis Cache Refresh]
        CACHE --> SCORER[ComplexityScorer<br/>Now: score >= 4.0 → Sonnet]
        SCORER -->|Moderate queries routed up| SONNET[Sonnet Invocations<br/>70/min → 312/min]
        SONNET --> COST[Cost Spike<br/>$1,900 → $4,200 projected]
    end

    subgraph Cascade["Cascade Effects"]
        SONNET -->|Higher latency| LATENCY[P95 Latency<br/>1.2s → 2.8s]
        SONNET -->|Throttling risk| THROTTLE[Bedrock Throttle Rate<br/>0.1% → 3.2%]
        LATENCY -->|Timeout increase| UX[User Experience<br/>Slower responses]
    end

    style CONFIG fill:#f44336,color:#fff
    style SONNET fill:#FF9800,color:#fff
    style COST fill:#f44336,color:#fff
    style LATENCY fill:#FF9800,color:#fff

Blast Radius

Component Impact Severity
Daily budget 150% overspend projected HIGH
Sonnet latency P95 rose from 1.2s to 2.8s due to increased load MEDIUM
Bedrock throttling Sonnet throttle rate spiked to 3.2% MEDIUM
User experience Slower responses for moderate queries (no quality gain) LOW
Haiku utilization Under-utilized — wasted provisioned throughput LOW

Resolution Runbook

Step 1: Immediate — Restore correct threshold (< 5 minutes)

# Verify current production threshold value
aws dynamodb get-item \
  --table-name MangaAssist-RoutingConfig \
  --key '{"PK": {"S": "GLOBAL#config"}, "SK": {"S": "THRESHOLD#complexity"}}' \
  --region ap-northeast-1 \
  --query 'Item.sonnet_threshold.N'

# Restore correct threshold
aws dynamodb update-item \
  --table-name MangaAssist-RoutingConfig \
  --key '{"PK": {"S": "GLOBAL#config"}, "SK": {"S": "THRESHOLD#complexity"}}' \
  --update-expression "SET sonnet_threshold = :val, updated_at = :ts, updated_by = :who" \
  --expression-attribute-values '{
    ":val": {"N": "6.5"},
    ":ts": {"S": "2026-03-31T14:35:00Z"},
    ":who": {"S": "oncall-engineer@mangaassist.jp"}
  }' \
  --region ap-northeast-1

Step 2: Force cache invalidation across all ECS tasks

# Invalidate Redis cache for the threshold config
redis-cli -h mangaassist-redis.xxxxx.apne1.cache.amazonaws.com \
  DEL "config:complexity_threshold"

# Publish cache invalidation event to all ECS tasks
redis-cli -h mangaassist-redis.xxxxx.apne1.cache.amazonaws.com \
  PUBLISH "routing:cache_invalidation" '{"key": "complexity_threshold", "action": "refresh"}'

Step 3: Verify routing distribution has normalized

# Check Sonnet vs Haiku invocation rates (last 10 minutes)
aws cloudwatch get-metric-statistics \
  --namespace MangaAssist/Routing \
  --metric-name InvocationCount \
  --dimensions Name=ModelId,Value=anthropic.claude-3-sonnet-20240229-v1:0 \
  --start-time $(date -u -d '10 minutes ago' +%Y-%m-%dT%H:%M:%S) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
  --period 60 \
  --statistics Sum \
  --region ap-northeast-1

Step 4: Review cost impact and document incident

# Get today's Bedrock cost breakdown
aws ce get-cost-and-usage \
  --time-period Start=$(date -u +%Y-%m-%d),End=$(date -u -d 'tomorrow' +%Y-%m-%d) \
  --granularity DAILY \
  --filter '{"Dimensions": {"Key": "SERVICE", "Values": ["Amazon Bedrock"]}}' \
  --metrics BlendedCost

Verification

  • Sonnet invocations/minute returns to baseline range (60-80)
  • Haiku invocations/minute returns to baseline range (580-650)
  • P95 latency drops below 1.5s within 10 minutes
  • Cost burn rate returns to projected daily total
  • No new throttling events on Sonnet endpoint

Prevention

  1. Environment-prefixed table names: Enforce naming convention MangaAssist-{env}-RoutingConfig with IAM policies restricting production writes to specific roles
  2. Threshold guard rails: DynamoDB Streams trigger that rejects sonnet_threshold values below 5.0 without an approval token
  3. Cost circuit breaker: Automated threshold rollback when cost anomaly exceeds 130% of daily projection
  4. Configuration change approval: Require two-person approval for production routing config changes via a Step Functions approval workflow
  5. Staging environment isolation: Separate AWS accounts for staging vs production

Key Takeaway

Dynamic routing thresholds are high-leverage configuration points — a single number change can shift millions of dollars in monthly FM spend. Guard them with validation, approval gates, and automated cost-based circuit breakers.


Scenario 2: Static Routing Table Stale After New Model Deployment

Situation

The MangaAssist team deploys Claude 3.5 Sonnet (anthropic.claude-3-5-sonnet-20241022-v2:0) as an upgrade path for complex queries. The new model is enabled in Bedrock and tested in isolation. However, the static routing table in DynamoDB still references the old Claude 3 Sonnet model ID (anthropic.claude-3-sonnet-20240229-v1:0). The new model never receives production traffic, and the team only discovers this three days later when reviewing an A/B test that shows zero queries reaching the new model.

Symptom Timeline

Day 0, 10:00 JST — Claude 3.5 Sonnet enabled in Bedrock account
Day 0, 10:30 JST — Integration tests pass against new model (direct invocation)
Day 0, 11:00 JST — Deployment marked as "complete" in Jira ticket
Day 0-3          — All production traffic continues to old model (no alerts)
Day 3, 15:00 JST — A/B test report shows 0 queries to new model
Day 3, 15:30 JST — Investigation reveals routing table not updated
Day 3, 16:00 JST — 23 DynamoDB items identified as stale
Day 3, 17:00 JST — Bulk update applied, canary testing started

Root Cause Analysis

The deployment checklist for new model activation covered:

  1. Enable model access in Bedrock console
  2. Run integration test suite against new model endpoint
  3. Update CloudWatch dashboards with new model metrics
  4. [MISSING] Update DynamoDB routing table entries
  5. [MISSING] Invalidate Redis routing cache
  6. [MISSING] Verify production traffic hitting new model

The static routing table has 23 entries that reference the old Sonnet model ID. Since the old model remained active in Bedrock (AWS does not automatically deactivate old model versions), all queries continued to work — just not on the new model. There were no errors, no latency changes, and no quality regressions that would trigger alerts.

Contributing factors: - No deployment automation linking model enablement to routing table updates - No "model version freshness" check in the routing pipeline - No alert on "expected model not receiving traffic" - Deployment checklist was a manual wiki page, not an automated runbook

Architecture Impact Diagram

flowchart TB
    subgraph Deployment["Deployment Actions (Completed)"]
        ENABLE[Bedrock Model Access<br/>Claude 3.5 Sonnet Enabled]
        TEST[Integration Tests<br/>Passed]
        DASH[Dashboard Updated]
    end

    subgraph Missed["Deployment Actions (MISSED)"]
        DDB_UPDATE[DynamoDB Route Update<br/>NOT DONE]
        CACHE_INV[Redis Cache Invalidation<br/>NOT DONE]
        VERIFY[Traffic Verification<br/>NOT DONE]
    end

    subgraph Production["Production State"]
        DDB[DynamoDB Routing Table<br/>Still: claude-3-sonnet-v1:0]
        REDIS[Redis Cache<br/>Still: claude-3-sonnet-v1:0]
        ROUTER[StaticRouter<br/>Routes to OLD model]
        OLD[Claude 3 Sonnet<br/>RECEIVING ALL TRAFFIC]
        NEW[Claude 3.5 Sonnet<br/>ZERO TRAFFIC]
    end

    ENABLE -.->|Should trigger| DDB_UPDATE
    DDB_UPDATE -.->|Should trigger| CACHE_INV
    CACHE_INV -.->|Should trigger| VERIFY

    DDB --> REDIS
    REDIS --> ROUTER
    ROUTER --> OLD

    style DDB_UPDATE fill:#f44336,color:#fff
    style CACHE_INV fill:#f44336,color:#fff
    style VERIFY fill:#f44336,color:#fff
    style OLD fill:#FF9800,color:#fff
    style NEW fill:#9E9E9E,color:#fff

Blast Radius

Component Impact Severity
New model utilization Zero production traffic for 3 days HIGH
Quality improvement Users did not benefit from model upgrade MEDIUM
A/B test validity 3-day A/B test data is useless (no treatment traffic) MEDIUM
Cost No cost impact (old model pricing unchanged) NONE
User experience No degradation (old model still functional) NONE

Resolution Runbook

Step 1: Identify all stale routing entries

# Scan routing table for entries referencing old model
aws dynamodb scan \
  --table-name MangaAssist-RoutingConfig \
  --filter-expression "contains(model_id, :old_model)" \
  --expression-attribute-values '{":old_model": {"S": "claude-3-sonnet-20240229"}}' \
  --projection-expression "PK, SK, model_id, intent_category, sub_intent" \
  --region ap-northeast-1

Step 2: Generate bulk update script

"""
Bulk update routing table entries from old model to new model.
Run with: python update_model_routes.py --dry-run first, then without --dry-run.
"""

import sys
import boto3
from datetime import datetime

OLD_MODEL = "anthropic.claude-3-sonnet-20240229-v1:0"
NEW_MODEL = "anthropic.claude-3-5-sonnet-20241022-v2:0"
TABLE_NAME = "MangaAssist-RoutingConfig"
REGION = "ap-northeast-1"

dry_run = "--dry-run" in sys.argv

dynamodb = boto3.resource("dynamodb", region_name=REGION)
table = dynamodb.Table(TABLE_NAME)

# Scan for stale entries
response = table.scan(
    FilterExpression="model_id = :old",
    ExpressionAttributeValues={":old": OLD_MODEL},
)

items = response.get("Items", [])
print(f"Found {len(items)} entries to update")

for item in items:
    pk = item["PK"]
    sk = item["SK"]
    intent = item.get("intent_category", "unknown")
    sub = item.get("sub_intent", "default")

    print(f"  {'[DRY RUN] ' if dry_run else ''}Updating {intent}:{sub}{NEW_MODEL}")

    if not dry_run:
        table.update_item(
            Key={"PK": pk, "SK": sk},
            UpdateExpression="SET model_id = :new, updated_at = :ts, updated_by = :who",
            ExpressionAttributeValues={
                ":new": NEW_MODEL,
                ":ts": datetime.utcnow().isoformat(),
                ":who": "model-migration-script",
            },
        )

print(f"{'[DRY RUN] ' if dry_run else ''}Updated {len(items)} entries")

Step 3: Invalidate all routing caches

# Flush all route cache keys
redis-cli -h mangaassist-redis.xxxxx.apne1.cache.amazonaws.com \
  EVAL "local keys = redis.call('keys', 'route:*'); for i,k in ipairs(keys) do redis.call('del', k) end; return #keys" 0

# Publish global cache invalidation
redis-cli -h mangaassist-redis.xxxxx.apne1.cache.amazonaws.com \
  PUBLISH "routing:cache_invalidation" '{"action": "flush_all", "reason": "model_migration"}'

Step 4: Start canary testing before full rollout

# Use RoutingTableManager to stage the new routes with 5% canary traffic
# Monitor for 30 minutes before promoting
python -c "
from routing_table_manager import RoutingTableManager
mgr = RoutingTableManager()
for route in mgr.get_active_routes():
    if 'claude-3-5-sonnet' in route.model_id:
        mgr.start_test(route.route_id, traffic_pct=5.0, test_duration_minutes=30)
"

Verification

  • New model appears in CloudWatch metrics with non-zero invocation counts within 10 minutes
  • Routing table scan shows zero entries referencing old model ID
  • Redis cache contains updated model IDs
  • A/B test (if restarted) shows traffic split matching configuration

Prevention

  1. Model deployment automation: CDK/Terraform pipeline that atomically updates Bedrock model access + DynamoDB routing entries + Redis cache in a single deployment
  2. Model version health check: Lambda cron (every 15 minutes) that verifies all models in the routing table are receiving traffic. Alert if any model has zero invocations for > 30 minutes during business hours
  3. Routing table freshness metric: CloudWatch custom metric tracking the age of the oldest routing entry. Alert if any entry is older than the latest model deployment timestamp
  4. Deployment gate: CI/CD pipeline stage that queries the routing table post-deploy and fails the pipeline if expected model IDs are not present

Key Takeaway

A model deployment is not complete until production traffic actually reaches the new model. Static routing tables create a decoupling risk — always include routing table updates and traffic verification in the deployment checklist, ideally automated in the same pipeline.


Scenario 3: Metric Collection Lag Causing Suboptimal Routing Decisions

Situation

The MetricBasedSelector ranks models using real-time metrics from ElastiCache Redis. During a morning traffic spike (09:00-10:00 JST), the Kinesis Data Stream backing the metric pipeline becomes throttled due to insufficient shard capacity. Metric aggregation falls behind by 8-12 minutes. The selector continues routing based on stale metrics that show Sonnet with low latency (from overnight low-traffic period), even though Sonnet's actual P95 latency has risen to 4.2 seconds due to the traffic surge. Users experience timeouts and degraded response quality.

Symptom Timeline

08:55 JST — Morning traffic ramp begins (Japan business hours)
09:05 JST — Kinesis shard iterator age rises above 30 seconds
09:10 JST — Metric aggregation Lambda falling behind by 2 minutes
09:15 JST — Redis metrics still showing overnight P95 latency (800ms for Sonnet)
09:20 JST — Actual Sonnet P95 latency at 3.1s (MetricBasedSelector unaware)
09:25 JST — User timeout rate crosses 2% (3-second WebSocket timeout)
09:30 JST — Customer complaints in support queue: "chatbot is slow"
09:35 JST — Kinesis iterator age at 8 minutes behind real-time
09:40 JST — On-call engineer paged via CloudWatch alarm on timeout rate
09:45 JST — Root cause identified: stale metrics driving bad routing
09:50 JST — Emergency: force all routing to Haiku via static override
09:55 JST — Timeout rate drops to 0.3%
10:15 JST — Kinesis resharding completes (2 → 4 shards)
10:30 JST — Metric pipeline catches up, dynamic routing re-enabled

Root Cause Analysis

The metric collection pipeline has a critical dependency on Kinesis Data Streams for buffering raw metric data points before aggregation. The Kinesis stream was provisioned with 2 shards, sufficient for average traffic (approximately 700 messages/minute) but insufficient for morning peak traffic (approximately 2,500 messages/minute).

When the stream throttled, the Lambda aggregator could not consume records fast enough. The aggregated metrics in Redis became stale — still reflecting overnight low-traffic latency numbers. The MetricBasedSelector trusted these stale metrics and continued ranking Sonnet highly for quality-sensitive queries, even though Sonnet's actual latency had degraded due to regional demand pressure.

The 3-second WebSocket timeout in MangaAssist meant that Sonnet responses taking 4+ seconds were dropped, resulting in user-visible errors.

Contributing factors: - Kinesis shard capacity not auto-scaling (on-demand mode not enabled) - No "metric staleness" check in MetricBasedSelector - No fallback to latency-safe model when metrics are stale - Metric pipeline single point of failure (no direct Redis writes as backup)

Architecture Impact Diagram

flowchart TB
    subgraph Trigger["Traffic Spike"]
        SPIKE[Morning Ramp<br/>700 → 2,500 msg/min]
    end

    subgraph Pipeline["Metric Pipeline (Broken)"]
        KDS[Kinesis 2 Shards<br/>THROTTLED]
        LAMBDA[Lambda Aggregator<br/>8 min BEHIND]
        REDIS_STALE[Redis Metrics<br/>STALE: P95=800ms]
    end

    subgraph Routing["Routing (Bad Decisions)"]
        SELECTOR[MetricBasedSelector<br/>Trusts Stale Metrics]
        SONNET[Routes to Sonnet<br/>Actual P95=4.2s]
    end

    subgraph User["User Impact"]
        TIMEOUT[WebSocket Timeout<br/>3s limit exceeded]
        ERROR[User Error Rate<br/>0.1% → 2%+]
    end

    SPIKE --> KDS
    KDS -->|Throttled| LAMBDA
    LAMBDA -->|Can't update| REDIS_STALE
    REDIS_STALE --> SELECTOR
    SELECTOR --> SONNET
    SONNET --> TIMEOUT
    TIMEOUT --> ERROR

    style KDS fill:#f44336,color:#fff
    style REDIS_STALE fill:#FF9800,color:#fff
    style TIMEOUT fill:#f44336,color:#fff
    style ERROR fill:#f44336,color:#fff

Blast Radius

Component Impact Severity
User timeout rate 0.1% → 2%+ (20x increase) CRITICAL
Response quality Timeouts = zero response quality CRITICAL
Metric accuracy 8-12 minute lag, decisions based on stale data HIGH
Routing optimality Sonnet chosen when Haiku would have been within latency budget HIGH
Customer satisfaction Support tickets increased 5x during the incident HIGH
Cost Sonnet invocations wasted (timed out before response delivered) MEDIUM

Resolution Runbook

Step 1: Immediate — Force static routing override (< 2 minutes)

# Set emergency routing override: all traffic to Haiku
aws dynamodb put-item \
  --table-name MangaAssist-RoutingConfig \
  --item '{
    "PK": {"S": "GLOBAL#override"},
    "SK": {"S": "EMERGENCY#static_only"},
    "override_type": {"S": "force_haiku"},
    "enabled": {"BOOL": true},
    "reason": {"S": "Metric pipeline lag causing timeout spike"},
    "activated_by": {"S": "oncall-engineer"},
    "activated_at": {"S": "2026-03-31T09:50:00+09:00"},
    "ttl": {"N": "1711872600"}
  }' \
  --region ap-northeast-1

# Invalidate metric-based selector cache
redis-cli -h mangaassist-redis.xxxxx.apne1.cache.amazonaws.com \
  SET "routing:override" '{"mode": "static_only", "model": "haiku"}' EX 3600

# Publish override event
redis-cli -h mangaassist-redis.xxxxx.apne1.cache.amazonaws.com \
  PUBLISH "routing:emergency" '{"action": "force_haiku", "reason": "metric_lag"}'

Step 2: Fix Kinesis capacity

# Check current shard count and iterator age
aws kinesis describe-stream-summary \
  --stream-name MangaAssist-Metrics \
  --region ap-northeast-1

# Option A: Enable on-demand mode (recommended for variable traffic)
aws kinesis update-stream-mode \
  --stream-arn arn:aws:kinesis:ap-northeast-1:ACCOUNT:stream/MangaAssist-Metrics \
  --stream-mode-details StreamMode=ON_DEMAND \
  --region ap-northeast-1

# Option B: Manual reshard to 4 shards
aws kinesis update-shard-count \
  --stream-name MangaAssist-Metrics \
  --target-shard-count 4 \
  --scaling-type UNIFORM_SCALING \
  --region ap-northeast-1

Step 3: Add metric staleness detection to the selector

# Add this check to MetricBasedSelector.get_model_metrics()
def _is_metric_stale(self, metrics: ModelMetrics, max_age_seconds: int = 120) -> bool:
    """Check if metrics are too old to trust."""
    if not metrics.last_updated:
        return True
    try:
        from datetime import datetime
        updated = datetime.fromisoformat(metrics.last_updated)
        age = (datetime.utcnow() - updated).total_seconds()
        return age > max_age_seconds
    except (ValueError, TypeError):
        return True

# In rank_models(), add staleness handling:
def rank_models(self, query_type="general", override_weights=None):
    # ... existing code ...
    for model_id, metrics in all_metrics.items():
        if self._is_metric_stale(metrics):
            logger.warning("Stale metrics for %s — using conservative scores", model_id)
            # Penalize latency score for stale metrics (assume worst case)
            latency_score = 0.3  # Conservative — don't trust stale latency
            # ... rest of scoring uses conservative defaults

Step 4: Re-enable dynamic routing after pipeline catches up

# Verify Kinesis iterator age is back to near-real-time
aws cloudwatch get-metric-statistics \
  --namespace AWS/Kinesis \
  --metric-name GetRecords.IteratorAgeMilliseconds \
  --dimensions Name=StreamName,Value=MangaAssist-Metrics \
  --start-time $(date -u -d '5 minutes ago' +%Y-%m-%dT%H:%M:%S) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
  --period 60 \
  --statistics Maximum \
  --region ap-northeast-1

# Remove emergency override once metrics are fresh
redis-cli -h mangaassist-redis.xxxxx.apne1.cache.amazonaws.com \
  DEL "routing:override"

aws dynamodb delete-item \
  --table-name MangaAssist-RoutingConfig \
  --key '{"PK": {"S": "GLOBAL#override"}, "SK": {"S": "EMERGENCY#static_only"}}' \
  --region ap-northeast-1

Verification

  • Kinesis iterator age below 5 seconds (real-time)
  • Redis metric timestamps within 60 seconds of current time
  • Timeout rate below 0.5%
  • MetricBasedSelector logs show "fresh" metric reads (no staleness warnings)
  • Dynamic routing re-enabled with normal Sonnet/Haiku distribution

Prevention

  1. Kinesis on-demand mode: Eliminates shard capacity planning entirely; auto-scales to traffic
  2. Metric staleness circuit breaker: If metric age exceeds 2 minutes, MetricBasedSelector automatically falls back to conservative (latency-safe) model selection
  3. Direct Redis writes: Add a secondary metric path that writes critical metrics (latency, errors) directly to Redis from the ECS task, bypassing Kinesis entirely. Use Kinesis only for durable archival
  4. Kinesis iterator age alarm: CloudWatch alarm when GetRecords.IteratorAgeMilliseconds exceeds 30,000ms (30 seconds)
  5. Pre-spike capacity: Schedule Kinesis shard scaling 15 minutes before known traffic ramps (e.g., 08:45 JST for Japan morning)

Key Takeaway

Metric-based routing is only as good as the freshness of its metrics. Always implement staleness detection and conservative fallback behavior — stale metrics that look "good" are more dangerous than having no metrics at all.


Scenario 4: API Gateway Transformation Error Dropping Routing Headers

Situation

After a routine API Gateway deployment, the VTL (Velocity Template Language) mapping template for the WebSocket $connect route has a syntax error that silently drops the X-Route-Target and X-User-Tier headers from the integration request. The ECS Fargate orchestrator receives requests without routing context and falls through to the default route (full dynamic analysis) for every single request — including trivial greetings and FAQ queries that should be fast-pathed to Haiku.

This causes a 40% increase in average response latency (greetings now take 1.5s instead of 0.3s) and unnecessary compute spending on complexity analysis for queries that should never reach the dynamic router.

Symptom Timeline

11:00 JST — API Gateway deployment via CloudFormation stack update
11:05 JST — VTL template deployed with missing closing parenthesis
11:10 JST — First requests hit ECS without routing headers
11:15 JST — Dynamic router processing 100% of traffic (expected: 40%)
11:20 JST — Average response latency rises from 0.8s to 1.2s
11:30 JST — P95 latency rises from 1.5s to 2.4s
11:45 JST — CloudWatch alarm: "AverageLatency > 1.0s for 15 minutes"
11:50 JST — Investigation begins
12:00 JST — Root cause: VTL template dropping headers
12:10 JST — Rollback API Gateway deployment to previous stage
12:15 JST — Latency returns to normal

Root Cause Analysis

The VTL mapping template translates incoming WebSocket messages into backend integration requests with routing headers. The updated template had a syntax error on the $util.escapeJavaScript() call — a missing closing parenthesis caused the VTL engine to silently fail and produce an empty mapping for the affected headers.

## Broken VTL (line 14)
#set($route = $util.escapeJavaScript($input.path('$.route'))    <-- missing closing paren

## Correct VTL (line 14)
#set($route = $util.escapeJavaScript($input.path('$.route')))

API Gateway VTL template errors are not surfaced as 5xx errors. Instead, the template produces partial output — the request body is passed through, but the computed headers are empty strings. The ECS task receives the request with X-Route-Target: "" and X-User-Tier: "", which the APIGatewayRouteTransformer interprets as "no routing context" and falls back to the default full dynamic analysis path.

Contributing factors: - No VTL template validation in the CI/CD pipeline - API Gateway does not fail loudly on VTL syntax errors - No integration test that verifies header presence after deployment - No monitoring on "percentage of requests with routing headers present"

Architecture Impact Diagram

flowchart TB
    subgraph Deploy["API Gateway Deployment"]
        CFN[CloudFormation Update]
        VTL[VTL Template<br/>SYNTAX ERROR]
    end

    subgraph Gateway["API Gateway (Broken)"]
        WS[WebSocket $connect]
        TRANSFORM[Request Transform<br/>Headers: EMPTY]
    end

    subgraph ECS["ECS Fargate"]
        RECEIVE[Receive Request<br/>No routing headers]
        FALLBACK[Default: Full Dynamic Analysis<br/>100% of traffic]
    end

    subgraph Impact["Impact"]
        LATENCY[Latency Increase<br/>Greetings: 0.3s → 1.5s]
        COMPUTE[Unnecessary Compute<br/>ComplexityScorer on trivial queries]
        COST[Wasted Analysis Cost<br/>Step Functions invocations up 150%]
    end

    CFN --> VTL
    VTL --> WS
    WS --> TRANSFORM
    TRANSFORM -->|Empty headers| RECEIVE
    RECEIVE --> FALLBACK
    FALLBACK --> LATENCY
    FALLBACK --> COMPUTE
    FALLBACK --> COST

    style VTL fill:#f44336,color:#fff
    style TRANSFORM fill:#FF9800,color:#fff
    style FALLBACK fill:#FF9800,color:#fff

Blast Radius

Component Impact Severity
Average latency 0.8s → 1.2s (50% increase) HIGH
Greeting latency 0.3s → 1.5s (5x increase) HIGH
Dynamic router load 100% of traffic (expected: 40%) MEDIUM
Step Functions cost 150% increase in state machine executions MEDIUM
User experience Noticeable slowdown on simple queries MEDIUM
Error rate Unchanged (requests still succeed, just slower) NONE

Resolution Runbook

Step 1: Identify the broken VTL template

# Export current API Gateway stage configuration
aws apigatewayv2 get-integration \
  --api-id YOUR_API_ID \
  --integration-id YOUR_INTEGRATION_ID \
  --region ap-northeast-1

# Check recent deployments
aws apigatewayv2 get-deployments \
  --api-id YOUR_API_ID \
  --region ap-northeast-1 \
  --max-results 5

Step 2: Rollback to previous API Gateway stage

# List stages to find previous deployment
aws apigatewayv2 get-stages \
  --api-id YOUR_API_ID \
  --region ap-northeast-1

# Update stage to point to previous (working) deployment
aws apigatewayv2 update-stage \
  --api-id YOUR_API_ID \
  --stage-name production \
  --deployment-id PREVIOUS_DEPLOYMENT_ID \
  --region ap-northeast-1

Step 3: Fix the VTL template and redeploy

# Corrected VTL template (fix the missing parenthesis)
# Then redeploy via CloudFormation with the fixed template
aws cloudformation update-stack \
  --stack-name MangaAssist-APIGateway \
  --template-body file://corrected-template.yaml \
  --region ap-northeast-1

Step 4: Add header presence validation to ECS

# Add to the ECS route controller: validate routing headers
def validate_routing_headers(self, headers: Dict[str, str]) -> bool:
    """Check that API Gateway sent valid routing headers."""
    required = ["X-Route-Target", "X-User-Tier"]
    for header in required:
        value = headers.get(header, "")
        if not value or value.strip() == "":
            logger.warning(
                "Missing routing header: %s — API Gateway transform may be broken",
                header,
            )
            # Emit metric for monitoring
            self._emit_metric("missing_routing_header", 1, {"header": header})
            return False
    return True

Verification

  • X-Route-Target header present on 100% of requests reaching ECS
  • Static router handling 60%+ of traffic (greetings, FAQ, order status)
  • Dynamic router handling approximately 40% of traffic (complex queries)
  • Average latency returns to 0.8s baseline
  • Greeting response latency returns to 0.3s

Prevention

  1. VTL template validation in CI/CD: Use aws apigateway test-invoke-method or a custom VTL parser to validate templates before deployment
  2. Integration tests post-deployment: Automated test that sends a known query through API Gateway and verifies all expected headers are present on the ECS side
  3. Header presence monitoring: CloudWatch custom metric tracking percentage of requests with valid routing headers. Alert if below 95%
  4. Canary deployment: Deploy API Gateway changes to a canary stage first, monitor for 10 minutes, then promote to production
  5. VTL template versioning: Store VTL templates as versioned files in the code repository with automated diff review

Key Takeaway

API Gateway VTL template errors fail silently — they produce partial output instead of errors. Always validate VTL templates before deployment and monitor for header presence on the backend, not just HTTP status codes.


Scenario 5: Shadow Routing Doubling FM Costs Without Intended A/B Split

Situation

A data science team member configures shadow routing to compare Claude 3 Sonnet against Claude 3 Haiku for product recommendation queries. The intention was to shadow 10% of traffic. However, due to a misconfiguration, the ShadowRouter is instantiated with shadow_traffic_pct=100 and the shadow model is set to Sonnet. This means every single query to the system now invokes both Haiku (primary, serving the user) and Sonnet (shadow, discarded) — effectively doubling FM costs for the entire platform. The shadow invocations also compete for Bedrock throughput, contributing to increased primary model latency.

Symptom Timeline

Day 1, 09:00 JST — Shadow routing config deployed via feature flag
Day 1, 09:05 JST — ShadowRouter starts with shadow_traffic_pct=100 (intended: 10)
Day 1, 09:30 JST — Bedrock invocation count doubles (Haiku + Sonnet per query)
Day 1, 12:00 JST — Midday cost check shows $1,600 spend (expected: $800)
Day 1, 14:00 JST — AWS Budgets alert: 200% of daily Bedrock forecast
Day 1, 14:15 JST — Primary (Haiku) P95 latency increases 30% due to shared throughput
Day 1, 14:30 JST — Investigation begins
Day 1, 14:45 JST — Root cause: shadow_traffic_pct=100, shadow_model=Sonnet
Day 1, 14:50 JST — Shadow routing disabled via feature flag
Day 1, 15:00 JST — Costs and latency return to baseline
Day 1 end       — Total overspend: ~$2,400 (one full day of Sonnet shadow)

Root Cause Analysis

The ShadowRouter configuration was set via a feature flag in DynamoDB:

{
  "shadow_enabled": true,
  "shadow_model_id": "anthropic.claude-3-sonnet-20240229-v1:0",
  "shadow_traffic_pct": 100,
  "shadow_max_tokens": 1024
}

The data scientist intended shadow_traffic_pct: 10 but entered 100. There was no validation guard preventing a 100% shadow rate. Additionally, the shadow model was set to Sonnet ($3/$15 per 1M tokens) — the most expensive available model — which is the worst-case scenario for cost.

At 1M messages/day with average 500 input tokens and 400 output tokens per query: - Normal Haiku cost: (500 * $0.25 + 400 * $1.25) / 1M * 1M = $625/day - Shadow Sonnet cost at 100%: (500 * $3.00 + 400 * $15.00) / 1M * 1M = $7,500/day - Total with shadow: $8,125/day (13x normal cost)

The shadow invocations were fire-and-forget (responses discarded), so there was no user-visible quality impact. However, the shadow invocations consumed Bedrock account-level throughput, causing the primary Haiku invocations to experience increased latency from contention.

Contributing factors: - No upper bound validation on shadow_traffic_pct (should cap at 50%) - No cost impact estimation before enabling shadow routing - Feature flag change did not require approval for cost-impacting parameters - No alert on "shadow invocation count > expected threshold" - Shadow uses the same Bedrock account/region, sharing throughput limits

Architecture Impact Diagram

flowchart TB
    subgraph Config["Misconfiguration"]
        FLAG[Feature Flag<br/>shadow_traffic_pct: 100<br/>shadow_model: Sonnet]
    end

    subgraph Normal["Normal Path (Still Working)"]
        QUERY[Every Query<br/>1M/day]
        HAIKU[Haiku Primary<br/>$625/day]
        RESPONSE[User Response<br/>From Haiku]
    end

    subgraph Shadow["Shadow Path (Unintended 100%)"]
        SHADOW[ShadowRouter<br/>100% duplication]
        SONNET[Sonnet Shadow<br/>$7,500/day<br/>RESPONSES DISCARDED]
    end

    subgraph Impact["Cost + Performance Impact"]
        COST[Daily Cost<br/>$625 → $8,125<br/>13x increase]
        THROUGHPUT[Bedrock Throughput<br/>Shared contention]
        LATENCY[Haiku P95 Latency<br/>+30% from contention]
    end

    FLAG --> SHADOW
    QUERY --> HAIKU
    QUERY --> SHADOW
    HAIKU --> RESPONSE
    SHADOW --> SONNET
    SONNET -->|Discarded| WASTE[Wasted Responses]

    SONNET --> COST
    SONNET --> THROUGHPUT
    THROUGHPUT --> LATENCY

    style FLAG fill:#f44336,color:#fff
    style SONNET fill:#f44336,color:#fff
    style COST fill:#f44336,color:#fff
    style WASTE fill:#9E9E9E,color:#fff

Blast Radius

Component Impact Severity
Daily FM cost $625 → $8,125 (13x increase) CRITICAL
Bedrock throughput Consumed by shadow invocations, starving primary HIGH
Primary latency Haiku P95 up 30% from throughput contention MEDIUM
Shadow data value 100% shadow defeats purpose (no A/B comparison possible) MEDIUM
Monthly budget Single day overspend = $7,500 additional HIGH
User experience No quality degradation, slight latency increase LOW

Resolution Runbook

Step 1: Immediately disable shadow routing (< 2 minutes)

# Disable shadow via feature flag
aws dynamodb update-item \
  --table-name MangaAssist-FeatureFlags \
  --key '{"PK": {"S": "FLAG#shadow_routing"}, "SK": {"S": "CONFIG"}}' \
  --update-expression "SET shadow_enabled = :disabled, disabled_by = :who, disabled_at = :ts, disable_reason = :reason" \
  --expression-attribute-values '{
    ":disabled": {"BOOL": false},
    ":who": {"S": "oncall-engineer"},
    ":ts": {"S": "2026-03-31T14:50:00+09:00"},
    ":reason": {"S": "Cost spike - shadow_traffic_pct was 100 instead of 10"}
  }' \
  --region ap-northeast-1

# Force immediate effect via Redis
redis-cli -h mangaassist-redis.xxxxx.apne1.cache.amazonaws.com \
  SET "feature:shadow_routing:enabled" "false" EX 3600

Step 2: Verify shadow invocations have stopped

# Check Sonnet invocation rate (should drop to near-zero for shadow)
aws cloudwatch get-metric-statistics \
  --namespace MangaAssist/Routing \
  --metric-name ShadowInvocationCount \
  --start-time $(date -u -d '10 minutes ago' +%Y-%m-%dT%H:%M:%S) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
  --period 60 \
  --statistics Sum \
  --region ap-northeast-1

Step 3: Calculate actual cost impact

# Get today's Bedrock cost
aws ce get-cost-and-usage \
  --time-period Start=$(date -u +%Y-%m-%d),End=$(date -u -d 'tomorrow' +%Y-%m-%d) \
  --granularity DAILY \
  --filter '{"Dimensions": {"Key": "SERVICE", "Values": ["Amazon Bedrock"]}}' \
  --metrics BlendedCost UnblendedCost

# Get shadow-specific cost from internal metrics
redis-cli -h mangaassist-redis.xxxxx.apne1.cache.amazonaws.com \
  GET "shadow:total_shadow_cost"

Step 4: Add validation guards before re-enabling

# Add to ShadowRouter.__init__():
class ShadowRouter:
    MAX_SHADOW_TRAFFIC_PCT = 25.0  # Hard cap
    MAX_SHADOW_DAILY_COST_USD = 500.0  # Daily cost circuit breaker

    def __init__(self, config: ShadowConfig, ...):
        # Validate shadow traffic percentage
        if config.shadow_traffic_pct > self.MAX_SHADOW_TRAFFIC_PCT:
            logger.error(
                "Shadow traffic %.1f%% exceeds maximum %.1f%% — capping",
                config.shadow_traffic_pct,
                self.MAX_SHADOW_TRAFFIC_PCT,
            )
            config.shadow_traffic_pct = self.MAX_SHADOW_TRAFFIC_PCT

        # Estimate daily cost and warn
        estimated_daily_cost = self._estimate_daily_shadow_cost(config)
        if estimated_daily_cost > self.MAX_SHADOW_DAILY_COST_USD:
            logger.warning(
                "Shadow routing estimated cost: $%.2f/day (limit: $%.2f) — DISABLING",
                estimated_daily_cost,
                self.MAX_SHADOW_DAILY_COST_USD,
            )
            config.enabled = False

    def _estimate_daily_shadow_cost(self, config: ShadowConfig) -> float:
        """Estimate daily shadow cost based on traffic and model pricing."""
        daily_queries = 1_000_000  # MangaAssist scale
        shadow_queries = daily_queries * (config.shadow_traffic_pct / 100)
        avg_input_tokens = 500
        avg_output_tokens = 400
        costs = self.COSTS.get(config.shadow_model_id, {"input": 0, "output": 0})
        return (
            shadow_queries
            * (avg_input_tokens * costs["input"] + avg_output_tokens * costs["output"])
            / 1_000_000
        )

Step 5: Re-enable shadow routing with correct configuration

# Re-enable with validated parameters
aws dynamodb update-item \
  --table-name MangaAssist-FeatureFlags \
  --key '{"PK": {"S": "FLAG#shadow_routing"}, "SK": {"S": "CONFIG"}}' \
  --update-expression "SET shadow_enabled = :enabled, shadow_traffic_pct = :pct, shadow_model_id = :model, updated_by = :who" \
  --expression-attribute-values '{
    ":enabled": {"BOOL": true},
    ":pct": {"N": "10"},
    ":model": {"S": "anthropic.claude-3-sonnet-20240229-v1:0"},
    ":who": {"S": "data-science-lead"}
  }' \
  --region ap-northeast-1

Verification

  • Shadow invocation rate matches expected 10% of total traffic
  • Daily Bedrock cost projection within normal range ($1,900-$2,200)
  • Primary model (Haiku) P95 latency back to baseline (< 1.2s)
  • Shadow cost tracking metric shows reasonable daily projection
  • No Bedrock throttling events

Prevention

  1. Shadow traffic hard cap: ShadowRouter rejects configurations above 25% traffic without an explicit override flag
  2. Cost estimation gate: Before enabling shadow routing, calculate and display estimated daily cost. Require explicit acknowledgment for costs exceeding $200/day
  3. Shadow cost circuit breaker: Real-time tracking of shadow invocation costs. Auto-disable shadow routing if cumulative daily shadow cost exceeds threshold
  4. Feature flag approval workflow: Cost-impacting feature flag changes require team lead approval via Slack bot or Step Functions approval workflow
  5. Separate throughput allocation: Use Bedrock Provisioned Throughput for shadow invocations to isolate from primary traffic (or use a separate AWS account)
  6. Shadow budget alarm: CloudWatch alarm on cumulative shadow cost metric, alerting at 50% and 80% of daily shadow budget

Key Takeaway

Shadow routing trades money for information — and at 100% shadow rate with an expensive model, the cost can exceed the primary workload by an order of magnitude. Always enforce hard caps on shadow traffic percentage and implement cost circuit breakers that auto-disable shadow routing when spend exceeds thresholds.


Scenario Comparison Matrix

Dimension Scenario 1: Threshold Misconfig Scenario 2: Stale Model IDs Scenario 3: Metric Lag Scenario 4: VTL Header Drop Scenario 5: Shadow Cost Spike
Root Cause Category Configuration error Deployment gap Capacity planning Template syntax Validation gap
Detection Time 5 hours 3 days 45 minutes 45 minutes 5.5 hours
User Impact None (better model, just costly) None (old model still works) Timeouts (2%+ error rate) Slower responses Slight latency increase
Cost Impact +$2,300/day None Wasted Sonnet spend +Step Functions cost +$7,500/day
Detection Method Cost anomaly alert Manual A/B review Timeout rate alarm Latency alarm Budget alert
Fix Time 15 minutes 2 hours 30 minutes 15 minutes 5 minutes
Could Have Been Prevented Threshold guard rail Deployment automation On-demand Kinesis VTL validation in CI Traffic cap validation

Cross-Scenario Prevention Framework

flowchart TB
    subgraph Preventive["Preventive Controls"]
        VAL[Input Validation<br/>Thresholds, percentages, model IDs]
        AUTO[Deployment Automation<br/>Routing table in same pipeline]
        CAP[Capacity Planning<br/>Kinesis on-demand mode]
        CI[CI/CD Validation<br/>VTL parsing, integration tests]
    end

    subgraph Detective["Detective Controls"]
        COST_MON[Cost Monitoring<br/>Budget alerts at 130%, 150%, 200%]
        METRIC_FRESH[Metric Freshness<br/>Staleness > 2min alarm]
        HEADER_MON[Header Presence<br/>Missing header rate alarm]
        TRAFFIC_MON[Traffic Distribution<br/>Model split deviation alarm]
    end

    subgraph Corrective["Corrective Controls"]
        CIRCUIT[Cost Circuit Breaker<br/>Auto-disable at threshold]
        FALLBACK[Stale Metric Fallback<br/>Conservative routing]
        ROLLBACK[Auto-Rollback<br/>Deploy fails → revert]
        OVERRIDE[Emergency Override<br/>Force static routing]
    end

    VAL -->|Prevents| S1[Scenario 1 + 5]
    AUTO -->|Prevents| S2[Scenario 2]
    CAP -->|Prevents| S3[Scenario 3]
    CI -->|Prevents| S4[Scenario 4]

    COST_MON -->|Detects| S15[Scenario 1 + 5]
    METRIC_FRESH -->|Detects| S3B[Scenario 3]
    HEADER_MON -->|Detects| S4B[Scenario 4]
    TRAFFIC_MON -->|Detects| S12[Scenario 1 + 2]

    CIRCUIT -->|Corrects| S15C[Scenario 1 + 5]
    FALLBACK -->|Corrects| S3C[Scenario 3]
    ROLLBACK -->|Corrects| S4C[Scenario 4]
    OVERRIDE -->|Corrects| SALL[All Scenarios]

    style VAL fill:#4CAF50,color:#fff
    style AUTO fill:#4CAF50,color:#fff
    style CAP fill:#4CAF50,color:#fff
    style CI fill:#4CAF50,color:#fff
    style COST_MON fill:#2196F3,color:#fff
    style METRIC_FRESH fill:#2196F3,color:#fff
    style HEADER_MON fill:#2196F3,color:#fff
    style TRAFFIC_MON fill:#2196F3,color:#fff
    style CIRCUIT fill:#FF9800,color:#fff
    style FALLBACK fill:#FF9800,color:#fff
    style ROLLBACK fill:#FF9800,color:#fff
    style OVERRIDE fill:#FF9800,color:#fff

Emergency Quick Reference

Emergency Immediate Action Command
Cost spike from routing Force all traffic to Haiku redis-cli SET "routing:override" '{"mode":"force_haiku"}' EX 3600
Timeout spike Force static routing (skip dynamic analysis) redis-cli SET "routing:override" '{"mode":"static_only"}' EX 3600
Shadow cost runaway Disable shadow routing redis-cli SET "feature:shadow_routing:enabled" "false" EX 3600
Stale metrics Switch to static-only routing redis-cli SET "routing:override" '{"mode":"static_only"}' EX 3600
API Gateway broken Rollback to previous deployment aws apigatewayv2 update-stage --deployment-id PREV_ID
All else fails Emergency maintenance page Update API Gateway to return static maintenance JSON

References

Resource Link
AWS Well-Architected — Operational Excellence https://docs.aws.amazon.com/wellarchitected/latest/operational-excellence-pillar/welcome.html
Amazon Bedrock Quotas https://docs.aws.amazon.com/bedrock/latest/userguide/quotas.html
Kinesis Data Streams On-Demand https://docs.aws.amazon.com/streams/latest/dev/how-do-i-size-a-stream.html
API Gateway VTL Reference https://docs.aws.amazon.com/apigateway/latest/developerguide/api-gateway-mapping-template-reference.html
AWS Budgets Actions https://docs.aws.amazon.com/cost-management/latest/userguide/budgets-controls.html
CloudWatch Anomaly Detection https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch_Anomaly_Detection.html