Cost-Effective Model Selection — Scenarios and Runbooks
MangaAssist Context: JP Manga store chatbot running on AWS. Bedrock Claude 3 Sonnet ($3/$15 per 1M input/output tokens) handles complex queries; Haiku ($0.25/$1.25 per 1M input/output tokens) handles simple ones. 1M messages/day across product search, order status, manga recommendations, and Q&A. Infrastructure: OpenSearch Serverless, DynamoDB, ECS Fargate, API Gateway WebSocket, ElastiCache Redis.
Skill Mapping
| AWS AIP-C01 Domain | Task | Skill | This File Covers |
|---|---|---|---|
| Domain 4 Operational Efficiency | Task 4.1 Cost Optimization | 4.1.2 Cost-Effective Model Selection | 5 production scenarios — wrong model, wasteful routing, budget overrun, quality regression, unmapped intent |
Scenario Overview
| # | Scenario | Core Problem | Severity |
|---|---|---|---|
| 1 | Haiku producing poor manga recommendations | Wrong model for complex task | HIGH |
| 2 | Sonnet routing for simple "where's my order" queries | Wasteful over-routing | MEDIUM |
| 3 | Budget overrun during manga sale event | Traffic spike exhausts daily LLM budget | CRITICAL |
| 4 | Quality regression after model tier rebalancing | Routing change degraded user experience | HIGH |
| 5 | New intent manga_review defaults to expensive Sonnet |
Unmapped intent hits most expensive model | MEDIUM |
Scenario 1: Haiku Producing Poor Manga Recommendations
Problem Statement
After a cost optimization initiative, the recommendation intent was experimentally routed to Haiku to reduce spend. Users began reporting irrelevant recommendations — "I asked for manga like Berserk and got generic shounen suggestions." Satisfaction for recommendations dropped from 94% to 61%.
Decision Tree
flowchart TD
ALERT["ALERT: Recommendation<br/>satisfaction dropped to 61%<br/>(threshold: 80%)"] --> CHECK_MODEL{"Which model is<br/>serving recommendations?"}
CHECK_MODEL -->|Haiku| ROOT["ROOT CAUSE:<br/>Haiku lacks reasoning depth<br/>for nuanced recommendations"]
CHECK_MODEL -->|Sonnet| OTHER["Investigate other causes:<br/>prompt drift, data quality, guardrails"]
ROOT --> FIX1["IMMEDIATE: Revert<br/>recommendation → Sonnet"]
FIX1 --> VERIFY{"Satisfaction<br/>recovering?"}
VERIFY -->|Yes, back to ~90%| RESOLVED["RESOLVED<br/>Document: recommendation<br/>requires Sonnet-tier reasoning"]
VERIFY -->|No| INVESTIGATE["Investigate:<br/>- Prompt template changed?<br/>- OpenSearch index stale?<br/>- Guardrails blocking content?"]
INVESTIGATE --> FIX2["Fix secondary issue<br/>+ keep Sonnet routing"]
ROOT --> PREVENT["PREVENTION:<br/>1. Add recommendation to<br/> 'Sonnet-required' locked list<br/>2. A/B test before downgrade<br/>3. Quality gate: auto-rollback<br/> if satisfaction < 80%"]
style ALERT fill:#e74c3c,color:#fff
style ROOT fill:#f39c12,color:#fff
style FIX1 fill:#2ecc71,color:#fff
style RESOLVED fill:#2ecc71,color:#fff
style PREVENT fill:#3498db,color:#fff
Runbook
Detection
- CloudWatch alarm:
recommendation_satisfaction_rate < 0.80for 15 minutes. - Metric source: User thumbs-up/down feedback aggregated per intent per model.
- Dashboard query:
SELECT AVG(satisfaction) FROM mangaassist.user_feedback WHERE intent = 'recommendation' AND timestamp > NOW() - INTERVAL 1 HOUR GROUP BY model_tier
Root Cause Analysis
| Factor | Expected | Actual | Verdict |
|---|---|---|---|
Model tier for recommendation |
Sonnet | Haiku | Incorrect routing |
| Haiku quality score for recommendations | N/A (untested) | 6.2/10 | Below threshold |
| Prompt template | v3 (unchanged) | v3 | Not the cause |
| OpenSearch index freshness | < 1 hour | 45 min | Not the cause |
Root cause: Haiku cannot perform the multi-step reasoning required for personalized manga recommendations. Queries like "manga like Berserk but less dark and more hopeful" require understanding thematic nuance, tone comparison, and reader preference modeling — capabilities where Sonnet scores 9.4 and Haiku scores 6.2.
Resolution Steps
- Immediate (< 5 min): Update routing config in DynamoDB:
# Revert recommendation intent to Sonnet dynamodb.update_item( TableName="mangaassist-routing-config", Key={"intent": {"S": "recommendation"}}, UpdateExpression="SET model_tier = :tier, updated_by = :by, updated_at = :ts", ExpressionAttributeValues={ ":tier": {"S": "sonnet"}, ":by": {"S": "oncall-engineer"}, ":ts": {"S": "2026-03-31T14:30:00Z"}, }, ) - Verify (< 30 min): Monitor
recommendation_satisfaction_rate— should recover to > 85% within 30 minutes. - Post-mortem: Document that
recommendationintent is in the "Sonnet-locked" category.
Prevention
- Lock list: Maintain a
sonnet_required_intentsset in config that cannot be overridden by automated cost optimization. - A/B testing gate: Any routing change must go through a 10% A/B test with n >= 2,000 before full rollout.
- Auto-rollback: If any intent's satisfaction drops > 15% within 1 hour of a routing change, automatically revert.
Scenario 2: Sonnet Routing for Simple "Where's My Order" Queries
Problem Statement
Cost analysis reveals that 12% of Sonnet invocations are for order_status queries like "Where is my order #98765?" These should be handled by the Template tier (zero cost, pure DynamoDB lookup) but are being misclassified as complex queries due to a bug in the intent classifier.
Decision Tree
flowchart TD
ALERT["ALERT: Sonnet cost 18% above<br/>projection for 3 consecutive days"] --> ANALYZE{"Analyze Sonnet<br/>invocations by intent"}
ANALYZE --> FIND["FINDING: 12% of Sonnet calls<br/>are order_status queries<br/>($200/day wasted)"]
FIND --> WHY{"Why are order_status<br/>queries hitting Sonnet?"}
WHY -->|Intent classifier bug| BUG_FIX["FIX: Patch intent classifier<br/>order_status regex missed<br/>Japanese order number formats"]
WHY -->|Complexity scorer too high| SCORER_FIX["FIX: Tune complexity scorer<br/>threshold for order_status"]
WHY -->|Missing template pattern| PATTERN_FIX["FIX: Add missing patterns<br/>to template fast path"]
BUG_FIX --> DEPLOY["Deploy fix to ECS"]
SCORER_FIX --> DEPLOY
PATTERN_FIX --> DEPLOY
DEPLOY --> VERIFY{"Order_status queries<br/>now hitting Template?"}
VERIFY -->|Yes| SAVINGS["RESOLVED<br/>Savings: ~$6,000/month"]
VERIFY -->|Partially| ITERATE["Add more patterns<br/>Retrain classifier"]
SAVINGS --> PREVENT["PREVENTION:<br/>1. Weekly intent-model audit<br/>2. Alert on expensive-model<br/> usage for template intents<br/>3. Cost anomaly detection"]
style ALERT fill:#f39c12,color:#fff
style FIND fill:#e74c3c,color:#fff
style SAVINGS fill:#2ecc71,color:#fff
style PREVENT fill:#3498db,color:#fff
Runbook
Detection
- CloudWatch alarm:
sonnet_cost_daily > projected_daily * 1.15for 3 consecutive days. - Investigation query:
SELECT intent, COUNT(*) as count, SUM(cost) as total_cost FROM mangaassist.inference_log WHERE model = 'sonnet' AND date = CURRENT_DATE GROUP BY intent ORDER BY total_cost DESC - Red flag:
order_statusappearing in the Sonnet invocation log at all.
Root Cause Analysis
| Factor | Expected | Actual | Verdict |
|---|---|---|---|
order_status model tier |
Template | Sonnet | Misrouted |
| Intent classifier accuracy | > 95% | 88% for order_status | Bug |
| Missed pattern | N/A | Japanese order formats (注文番号12345) | Missing regex |
| Complexity score for "注文番号12345はどこ?" | < 0.2 | 0.65 | Inflated by Japanese chars |
Root cause: The complexity classifier's japanese_char_ratio feature adds +0.05 to the score for any query with Japanese characters. Since all order status queries from Japanese users contain Japanese, they get inflated complexity scores. Additionally, the template fast path was missing Japanese-language order patterns.
Resolution Steps
- Immediate: Add Japanese order status patterns to the template fast path:
# Add to ComplexityClassifier.TEMPLATE_PATTERNS re.compile(r"(?i)(注文|オーダー)\s*(番号|#|#)?\s*\d+"), re.compile(r"(?i)(配送|配達|届|delivery).*(状況|ステータス|status)"), - Short-term: Adjust complexity scorer to not penalize queries that match a known template intent:
def classify(self, query: str, detected_intent: str = None) -> float: # If intent is already classified as template-eligible, cap score if detected_intent in ("order_status", "escalation", "chitchat"): return min(self._raw_score(query), 0.15) return self._raw_score(query) - Verify: Confirm zero Sonnet invocations for
order_statusin the next 24 hours.
Prevention
- Weekly audit: Automated report of model utilization by intent — flag any template-eligible intent appearing in Sonnet logs.
- Cost anomaly alert: CloudWatch anomaly detection on per-intent Sonnet spend.
- Test coverage: Add Japanese-language order queries to the classifier test suite.
Scenario 3: Budget Overrun During Manga Sale Event
Problem Statement
During the annual "Manga Matsuri" sale event, traffic spikes to 3.2x normal (3.2M messages/day). The daily FM budget of $2,500 is exhausted by 3:00 PM JST. The Budget Guardian enters EMERGENCY mode, routing all queries to Haiku/Template. Recommendation quality plummets for the remaining 9 hours of peak sale traffic.
Decision Tree
flowchart TD
ALERT["CRITICAL ALERT: Budget 95%<br/>consumed at 15:00 JST<br/>9 hours of peak traffic remaining"] --> ASSESS{"Is this a known<br/>traffic event?"}
ASSESS -->|"Yes (Manga Matsuri)"| PLANNED["Should have been planned.<br/>Increase daily budget<br/>for event duration."]
ASSESS -->|"No (unexpected spike)"| UNEXPECTED["Investigate traffic source:<br/>- Organic growth?<br/>- Bot traffic?<br/>- Marketing campaign?"]
PLANNED --> INCREASE["IMMEDIATE: Increase daily<br/>budget from $2,500 to $6,000<br/>for event duration (3 days)"]
INCREASE --> VERIFY_B{"Budget Guardian<br/>exits EMERGENCY?"}
VERIFY_B -->|Yes| QUALITY_CHECK["Monitor recommendation<br/>quality recovery"]
VERIFY_B -->|No| REDIS_FIX["Check Redis config —<br/>budget key may be cached"]
QUALITY_CHECK --> RESOLVED["RESOLVED:<br/>Quality restored during sale"]
UNEXPECTED --> BOT{"Is it bot<br/>traffic?"}
BOT -->|Yes| BLOCK["Block bots at WAF layer<br/>Do NOT increase budget"]
BOT -->|No| SCALE["Increase budget +<br/>enable auto-scale provisions"]
RESOLVED --> PREVENT["PREVENTION:<br/>1. Pre-scale budget for known events<br/>2. Event calendar integration<br/>3. Gradual budget ramp (not cliff)<br/>4. Separate event budget pool"]
style ALERT fill:#e74c3c,color:#fff
style INCREASE fill:#2ecc71,color:#fff
style RESOLVED fill:#2ecc71,color:#fff
style BLOCK fill:#e74c3c,color:#fff
style PREVENT fill:#3498db,color:#fff
Runbook
Detection
- PagerDuty alert:
budget_utilization_pct > 95%triggers page. - Early warning:
budget_utilization_pct > 60% AND hour < 12(budget draining faster than expected). - Dashboard: Real-time budget burn rate vs projected end-of-day.
Root Cause Analysis
| Factor | Expected | Actual | Verdict |
|---|---|---|---|
| Daily traffic | 1M messages | 3.2M messages | 3.2x spike |
| Daily budget | $2,500 | $2,500 (unchanged) | Budget not pre-scaled |
| Budget exhaustion time | 23:59 JST | 15:00 JST | 9 hours of degraded service |
| Emergency mode duration | 0 hours | 9 hours | Unacceptable quality loss |
| Recommendation satisfaction (15:00-24:00) | 94% | 58% | Severe degradation |
Root cause: The operations team did not pre-scale the daily budget for the Manga Matsuri sale event. The Budget Guardian correctly activated emergency mode to protect spend, but the budget was too low for the traffic level, causing extended quality degradation during peak sale hours — the worst possible time for poor recommendations.
Resolution Steps
-
Immediate (during incident):
# Increase daily budget via Redis override redis_client.set("budget:override:2026-03-31", "6000.00") redis_client.expire("budget:override:2026-03-31", 86400) # Reset today's budget mode tracking # BudgetGuardian will re-evaluate on next request -
Same day: Update Budget Guardian to check for override values:
def _get_daily_budget(self) -> float: override = self.redis.get(f"budget:override:{time.strftime('%Y-%m-%d')}") if override: return float(override) return self.DAILY_BUDGET -
Post-event: Create event budget calendar integration.
Prevention
-
Event calendar: Maintain a
budget_eventsDynamoDB table with known traffic events:{ "event_id": "manga-matsuri-2026", "start_date": "2026-03-29", "end_date": "2026-04-01", "expected_traffic_multiplier": 3.5, "budget_multiplier": 3.0, "approved_by": "platform-lead" } -
Gradual budget ramp: Instead of a cliff at 95%, implement proportional quality reduction:
quality_factor = min(1.0, remaining_budget / projected_remaining_cost) -
Separate event budget pool: Allocate a separate pool for planned events so normal-day budget is unaffected.
-
Pre-event checklist: - [ ] Budget multiplier configured for event dates - [ ] Traffic projections reviewed - [ ] On-call staffing increased - [ ] Fallback templates updated for event context - [ ] CloudWatch alarms tuned for event thresholds
Scenario 4: Quality Regression After Model Tier Rebalancing
Problem Statement
To reduce costs, the platform team changed the routing split from 15% Sonnet / 50% Haiku / 35% Template to 8% Sonnet / 52% Haiku / 40% Template. The change shipped globally at once (no A/B test). Within 24 hours, the overall CSAT score dropped from 4.2 to 3.6 (out of 5). The team cannot pinpoint which intent degraded because the change affected multiple intents simultaneously.
Decision Tree
flowchart TD
ALERT["ALERT: CSAT dropped<br/>from 4.2 to 3.6 in 24 hours<br/>(post-routing change)"] --> CORRELATE{"Does timing correlate<br/>with routing change?"}
CORRELATE -->|Yes| ROLLBACK_Q{"Can we rollback<br/>the routing change?"}
CORRELATE -->|No| OTHER_CAUSES["Investigate:<br/>- Prompt template change?<br/>- Data pipeline issue?<br/>- Guardrails update?"]
ROLLBACK_Q -->|Yes| ROLLBACK["IMMEDIATE: Rollback<br/>routing to previous config<br/>(15/50/35 split)"]
ROLLBACK_Q -->|No, config lost| MANUAL["Manually reconstruct<br/>previous routing config"]
ROLLBACK --> RECOVERING{"CSAT recovering<br/>within 4 hours?"}
RECOVERING -->|Yes| ROOT_CAUSE["Confirmed: routing change<br/>caused quality regression"]
RECOVERING -->|No| COMPOUND["Compound issue —<br/>investigate other changes"]
ROOT_CAUSE --> ANALYZE["Analyze per-intent impact:<br/>Which intents degraded most<br/>when moved to cheaper tier?"]
ANALYZE --> INTENT1["manga_qa: Sonnet→Haiku<br/>Quality: 9.1→6.5 (-28.6%)"]
ANALYZE --> INTENT2["product_search (complex):<br/>Sonnet→Haiku<br/>Quality: 8.9→7.8 (-12.4%)"]
ANALYZE --> INTENT3["shipping_info: unchanged<br/>Quality: stable"]
INTENT1 & INTENT2 & INTENT3 --> PLAN["PLAN: Phased rebalancing<br/>with A/B tests per intent"]
PLAN --> PREVENT["PREVENTION:<br/>1. Never change multiple intents at once<br/>2. Mandatory A/B test per intent<br/>3. Canary rollout (1%→10%→50%→100%)<br/>4. Automated rollback on CSAT drop"]
style ALERT fill:#e74c3c,color:#fff
style ROLLBACK fill:#f39c12,color:#fff
style ROOT_CAUSE fill:#2ecc71,color:#fff
style PREVENT fill:#3498db,color:#fff
Runbook
Detection
- CloudWatch alarm:
csat_score_24h < 3.8(baseline 4.2, threshold 10% drop). - Supporting signals: Per-intent satisfaction rates, model-tier distribution shift, user complaint volume in support tickets.
Root Cause Analysis
| Intent | Before (Model/Quality) | After (Model/Quality) | Quality Delta | User Impact |
|---|---|---|---|---|
manga_qa |
Sonnet / 9.1 | Haiku / 6.5 | -28.6% | Very noticeable |
product_search (complex) |
Sonnet / 8.9 | Haiku / 7.8 | -12.4% | Noticeable |
recommendation |
Sonnet / 9.4 | Sonnet / 9.4 | 0% | None |
product_search (simple) |
Haiku / 8.1 | Haiku / 8.1 | 0% | None |
shipping_info |
Haiku / 8.8 | Haiku / 8.8 | 0% | None |
order_status |
Template / 9.5 | Template / 9.5 | 0% | None |
Root cause: The blanket reduction from 15% to 8% Sonnet moved manga_qa from Sonnet to Haiku. This single change accounts for ~70% of the CSAT drop. The lack of A/B testing meant the impact was not caught before full rollout.
Resolution Steps
- Immediate (< 10 min): Rollback routing config to 15/50/35 split.
- Verify (< 4 hours): Confirm CSAT recovery trend.
- Plan (next sprint):
- A/B test
manga_qaon Haiku with 10% traffic. - A/B testproduct_search(complex subset) on Haiku with 10% traffic. - Only proceed with each change if quality drop < 10%.
Prevention
-
Change management policy: Routing changes must follow the staged rollout process:
Config change → A/B test (10%, 48 hours, n >= 2,000) → Canary (25%, 24 hours) → Full rollout -
Per-intent quality gates:
# Automated quality gate check before promoting routing change def can_promote_routing_change(intent: str, ab_test_results: dict) -> bool: control = ab_test_results["control"] treatment = ab_test_results["treatment"] quality_drop = (control["quality"] - treatment["quality"]) / control["quality"] satisfaction_drop = (control["satisfaction"] - treatment["satisfaction"]) / control["satisfaction"] return ( quality_drop < 0.10 # Less than 10% quality drop and satisfaction_drop < 0.05 # Less than 5% satisfaction drop and treatment["sample_count"] >= 2000 # Enough samples ) -
One-intent-at-a-time rule: Never change routing for more than one intent in a single deployment.
Scenario 5: New Intent manga_review Defaults to Expensive Sonnet
Problem Statement
The product team launches a new "write a manga review" feature. The intent classifier correctly detects manga_review as a new intent, but the Model Router has no explicit mapping for it. The fallback logic defaults unknown intents to Sonnet (the safest but most expensive option). Within a week, manga_review becomes 8% of traffic, adding $1,200/day in unnecessary Sonnet spend. Analysis shows Haiku handles review generation at 85% quality — more than adequate for user-generated content.
Decision Tree
flowchart TD
ALERT["ALERT: New intent 'manga_review'<br/>detected in Sonnet invocation logs<br/>8% of traffic, $1,200/day"] --> CHECK{"Is manga_review<br/>in the routing map?"}
CHECK -->|No — using default| DEFAULT_Q{"What is the default<br/>model for unmapped intents?"}
CHECK -->|Yes| VERIFY_MAP["Verify correct model<br/>is assigned"]
DEFAULT_Q -->|Sonnet (current)| PROBLEM["PROBLEM: Unmapped intents<br/>default to most expensive model"]
PROBLEM --> EVALUATE{"What model does<br/>manga_review need?"}
EVALUATE --> TEST["Run quick evaluation:<br/>- 100 sample reviews on Sonnet<br/>- 100 sample reviews on Haiku<br/>- LLM-as-judge scoring"]
TEST --> RESULT["Results:<br/>Sonnet: 9.0 quality, $0.0111/req<br/>Haiku: 8.5 quality, $0.0003/req<br/>Quality gap: 5.6% (acceptable)"]
RESULT --> ASSIGN["ASSIGN: manga_review → Haiku<br/>Update routing config"]
ASSIGN --> VERIFY{"Cost reduced?<br/>Quality acceptable?"}
VERIFY -->|Yes| RESOLVED["RESOLVED<br/>Savings: $1,176/day ($35,280/month)"]
RESOLVED --> PREVENT["PREVENTION:<br/>1. Change default for unknown intents<br/> from Sonnet to Haiku<br/>2. Alert on any new unmapped intent<br/>3. Require model assignment in<br/> feature launch checklist<br/>4. Weekly unmapped intent audit"]
style ALERT fill:#f39c12,color:#fff
style PROBLEM fill:#e74c3c,color:#fff
style RESULT fill:#3498db,color:#fff
style RESOLVED fill:#2ecc71,color:#fff
style PREVENT fill:#3498db,color:#fff
Runbook
Detection
- Weekly audit report: Automated check for intents appearing in inference logs that are not in the routing map.
# Audit query: find unmapped intents unmapped = set(inference_log_intents) - set(INTENT_MODEL_MAP.keys()) if unmapped: alert(f"Unmapped intents detected: {unmapped}") - Cost anomaly: Unexpected Sonnet spend increase not correlated with traffic growth.
Root Cause Analysis
| Factor | Expected | Actual | Verdict |
|---|---|---|---|
manga_review in routing map |
Yes | No | Missing mapping |
| Default for unmapped intents | Haiku (safe default) | Sonnet | Overly conservative default |
| Feature launch checklist | Includes model assignment | Skipped | Process gap |
| Weekly intent audit | Running | Not configured | Missing automation |
Root cause: Two process failures compounded: (1) the feature launch checklist does not require a model assignment for new intents, and (2) the default model for unmapped intents is Sonnet instead of Haiku.
Resolution Steps
-
Immediate: Add
manga_reviewto routing config:INTENT_MODEL_MAP[Intent.MANGA_REVIEW] = ModelTier.HAIKU -
Same day: Change the default model for unmapped intents from Sonnet to Haiku:
# In ModelRouter.route() default_tier = INTENT_MODEL_MAP.get(intent, ModelTier.HAIKU) # Changed from SONNET -
This sprint: Add weekly unmapped intent audit automation.
Prevention
-
Safe default: Change the system-wide default from Sonnet to Haiku for unmapped intents. Rationale: Haiku provides acceptable quality (7.4+) for most intents, and the 24.5x cost savings far outweigh occasional slight quality dips while the team evaluates the correct model tier.
-
Feature launch checklist update:
## New Feature Launch — Model Routing Checklist - [ ] New intent name defined and added to Intent enum - [ ] Complexity evaluation performed (100 sample queries scored) - [ ] Model tier assigned based on evaluation - [ ] Routing config updated in DynamoDB - [ ] A/B test configured for first 2 weeks - [ ] Cost projection added to monthly forecast - [ ] On-call team notified of new intent -
Automated guardrail:
# Run daily: detect and alert on unmapped intents hitting Sonnet def audit_unmapped_intents(inference_logs: list, routing_map: dict) -> list: unmapped = [] for log in inference_logs: if log["intent"] not in routing_map and log["model"] == "sonnet": unmapped.append({ "intent": log["intent"], "count": log["count"], "daily_cost": log["total_cost"], "first_seen": log["first_seen"], }) return sorted(unmapped, key=lambda x: x["daily_cost"], reverse=True)
Cross-Scenario Summary
Common Patterns Across All 5 Scenarios
| Pattern | Scenarios | Mitigation |
|---|---|---|
| Missing or incorrect model assignment | 1, 2, 5 | Intent-to-model map with locked tiers + default to Haiku |
| No A/B testing before routing changes | 1, 4 | Mandatory A/B test gate (10%, n >= 2,000) |
| Budget not scaled for known events | 3 | Event calendar with auto-budget adjustment |
| No automated detection of misrouting | 2, 5 | Weekly intent-model audit + cost anomaly alerts |
| Rollout too fast (no canary) | 4 | Staged rollout: 1% → 10% → 50% → 100% |
Prevention Priority Matrix
quadrantChart
title Prevention Priority — Impact vs Effort
x-axis Low Effort --> High Effort
y-axis Low Impact --> High Impact
quadrant-1 Do First
quadrant-2 Plan Carefully
quadrant-3 Quick Wins
quadrant-4 Deprioritize
"Change default to Haiku": [0.2, 0.7]
"Weekly intent audit": [0.3, 0.6]
"Event budget calendar": [0.5, 0.9]
"A/B test gate": [0.6, 0.85]
"Canary rollout system": [0.7, 0.8]
"Per-intent quality gates": [0.8, 0.75]
"Feature launch checklist": [0.15, 0.5]
"Auto-rollback system": [0.9, 0.9]
Monitoring Checklist
| Metric | Alarm Threshold | Scenario It Catches |
|---|---|---|
intent_satisfaction_rate |
< 80% for any intent | Scenario 1, 4 |
sonnet_daily_cost |
> 115% of projection | Scenario 2, 5 |
budget_utilization_pct |
> 60% before noon JST | Scenario 3 |
csat_score_24h |
< 3.8 (baseline 4.2) | Scenario 4 |
unmapped_intent_count |
> 0 | Scenario 5 |
model_tier_distribution |
Sonnet > 20% of traffic | Scenario 2 |
budget_mode |
!= "normal" for > 2 hours | Scenario 3 |
Incident Response Quick Reference
flowchart LR
INCIDENT[Model Routing<br/>Incident] --> TYPE{Incident Type?}
TYPE -->|Quality Drop| QD["1. Check which intent dropped<br/>2. Check model tier for that intent<br/>3. Revert if tier changed recently<br/>4. A/B test before re-applying"]
TYPE -->|Cost Spike| CS["1. Check per-intent Sonnet usage<br/>2. Look for unmapped intents<br/>3. Look for misclassified queries<br/>4. Check if traffic event is known"]
TYPE -->|Budget Exhaustion| BE["1. Is it a planned event?<br/> → Increase budget<br/>2. Is it bot traffic?<br/> → Block at WAF<br/>3. Is it organic growth?<br/> → Adjust daily budget"]
style QD fill:#e74c3c,color:#fff
style CS fill:#f39c12,color:#fff
style BE fill:#e74c3c,color:#fff
Previous: 02-inference-cost-optimization.md -- Budget guardian, A/B testing, fallback chains, cost projections.
Back to: 01-model-selection-framework.md -- Model selection architecture, complexity classifier, routing maps.