Inference Cost Optimization
MangaAssist Context: JP Manga store chatbot running on AWS. Bedrock Claude 3 Sonnet ($3/$15 per 1M input/output tokens) handles complex queries; Haiku ($0.25/$1.25 per 1M input/output tokens) handles simple ones. 1M messages/day across product search, order status, manga recommendations, and Q&A. Infrastructure: OpenSearch Serverless, DynamoDB, ECS Fargate, API Gateway WebSocket, ElastiCache Redis.
Skill Mapping
| AWS AIP-C01 Domain |
Task |
Skill |
This File Covers |
| Domain 4 Operational Efficiency |
Task 4.1 Cost Optimization |
4.1.2 Cost-Effective Model Selection |
Inference cost patterns, budget guardian, traffic analysis, A/B testing model assignments, fallback chains, cost projections |
1. Inference Cost Patterns Mind Map
mindmap
root((Inference Cost<br/>Optimization))
Traffic Analysis
Volume by Intent
Cost Distribution
Peak vs Off-Peak
Seasonal Patterns
Budget Guardian
Daily Budget Tracking
Spend Rate Monitoring
Dynamic Downgrade
Emergency Mode
A/B Testing
Quality Impact Measurement
Statistical Significance
Routing Change Rollout
Rollback Triggers
Fallback Chains
Sonnet to Haiku
Haiku to Template
Cascading Failures
Graceful Degradation
Cross-Region Pricing
ap-northeast-1 vs us-east-1
Latency-Cost Tradeoff
Regional Availability
Cost Projections
Monthly Forecasting
Growth Modeling
Scenario Planning
2. MangaAssist Traffic Analysis
2.1 Daily Traffic Distribution by Intent and Tier
pie title Daily Message Distribution by Tier (1M messages/day)
"Tier 0 — Template (35%)" : 350000
"Tier 1 — Haiku (50%)" : 500000
"Tier 2 — Sonnet (15%)" : 150000
Detailed Breakdown
| Intent |
Tier |
% Traffic |
Daily Volume |
Input Tokens/Req |
Output Tokens/Req |
Cost/Req |
Daily Cost |
product_search |
Haiku |
30% |
300,000 |
400 |
120 |
$0.000250 |
$75.00 |
order_status |
Template |
20% |
200,000 |
0 |
0 |
$0.000000 |
$0.00 |
chitchat |
Template |
15% |
150,000 |
0 |
0 |
$0.000000 |
$0.00 |
shipping_info |
Haiku |
10% |
100,000 |
300 |
80 |
$0.000175 |
$17.50 |
escalation |
Template |
5% |
50,000 |
0 |
0 |
$0.000000 |
$0.00 |
recommendation |
Sonnet |
10% |
100,000 |
1,200 |
500 |
$0.011100 |
$1,110.00 |
manga_qa |
Sonnet |
5% |
50,000 |
900 |
400 |
$0.008700 |
$435.00 |
product_search (complex) |
Sonnet |
3% |
30,000 |
600 |
250 |
$0.005550 |
$166.50 |
shipping_info (complex) |
Haiku |
2% |
20,000 |
300 |
80 |
$0.000175 |
$3.50 |
| Totals |
|
100% |
1,000,000 |
|
|
|
$1,807.50 |
2.2 Cost Distribution — Where the Money Goes
pie title Daily FM Cost Distribution ($1,807.50/day)
"recommendation (Sonnet) — $1,110" : 1110
"manga_qa (Sonnet) — $435" : 435
"product_search complex (Sonnet) — $167" : 167
"product_search (Haiku) — $75" : 75
"shipping_info (Haiku) — $21" : 21
"Template intents — $0" : 1
Key insight: Sonnet calls represent only 15% of traffic but account for 94.7% of daily FM cost ($1,711.50 of $1,807.50). Every 1% of traffic shifted from Sonnet to Haiku saves approximately $110/day ($3,300/month).
2.3 Hourly Traffic Pattern (JST)
xychart-beta
title "MangaAssist Hourly Traffic (JST) — Messages per Hour"
x-axis ["0","1","2","3","4","5","6","7","8","9","10","11","12","13","14","15","16","17","18","19","20","21","22","23"]
y-axis "Messages (thousands)" 0 --> 90
bar [8,5,3,2,2,5,15,30,45,55,60,65,70,55,50,48,52,58,72,85,80,65,40,20]
| Time Block (JST) |
% of Daily Traffic |
Hourly Avg |
Peak Consideration |
| 00:00-06:00 |
4% |
6,667 |
Lowest — aggressive cost savings |
| 06:00-12:00 |
27% |
45,000 |
Morning ramp — normal routing |
| 12:00-18:00 |
31% |
52,500 |
Afternoon steady — normal routing |
| 18:00-00:00 |
38% |
63,333 |
Evening peak — budget monitoring critical |
3. Dynamic Model Downgrade — Budget Guardian
3.1 Architecture
flowchart TD
subgraph "Every Request"
REQ[Incoming Request] --> ROUTER[Model Router]
ROUTER --> BG[Budget Guardian]
BG --> CHECK{Check Daily<br/>Spend vs Budget}
end
CHECK -->|"< 60% spent"| NORMAL["NORMAL MODE<br/>Standard intent-to-model routing"]
CHECK -->|"60-80% spent"| CAUTIOUS["CAUTIOUS MODE<br/>Downgrade manga_qa: Sonnet→Haiku<br/>Est. savings: ~20% on Sonnet spend"]
CHECK -->|"80-95% spent"| AGGRESSIVE["AGGRESSIVE MODE<br/>Only recommendation stays Sonnet<br/>All else: Haiku or Template<br/>Est. savings: ~60% on Sonnet spend"]
CHECK -->|"> 95% spent"| EMERGENCY["EMERGENCY MODE<br/>No Sonnet calls<br/>recommendation→Haiku, rest→Template<br/>Queue critical requests for next day"]
NORMAL --> SERVE[Serve Response]
CAUTIOUS --> SERVE
AGGRESSIVE --> SERVE
EMERGENCY --> SERVE
BG --> LOG[Log to CloudWatch<br/>budget_mode metric]
BG --> ALERT{Threshold<br/>Breach?}
ALERT -->|80%| SNS1["SNS Alert:<br/>Budget Warning"]
ALERT -->|95%| SNS2["PagerDuty Alert:<br/>Budget Critical"]
style NORMAL fill:#2ecc71,color:#fff
style CAUTIOUS fill:#f39c12,color:#fff
style AGGRESSIVE fill:#e67e22,color:#fff
style EMERGENCY fill:#e74c3c,color:#fff
3.2 Budget Mode Impact on Quality and Cost
| Budget Mode |
Trigger |
Sonnet % |
Haiku % |
Template % |
Quality Impact |
Cost Reduction |
| Normal |
< 60% budget |
15% |
50% |
35% |
Baseline (8.5) |
0% |
| Cautious |
60-80% |
10% |
55% |
35% |
-0.3 (8.2) |
~15% |
| Aggressive |
80-95% |
5% |
50% |
45% |
-0.8 (7.7) |
~45% |
| Emergency |
> 95% |
0% |
35% |
65% |
-1.5 (7.0) |
~85% |
3.3 Budget Guardian Sequence Diagram
sequenceDiagram
participant User as Customer
participant Router as Model Router
participant BG as Budget Guardian
participant Redis as ElastiCache Redis
participant CW as CloudWatch
participant SNS as SNS / PagerDuty
User->>Router: "Recommend manga like Attack on Titan"
Router->>BG: check_budget(intent=recommendation)
BG->>Redis: GET budget:daily:2026-03-31
Redis-->>BG: $1,850.00 spent (74% of $2,500)
BG-->>Router: mode=CAUTIOUS (recommendation stays Sonnet)
Router->>Router: Route to Sonnet
Note over BG,CW: Later that evening — manga sale event spikes traffic
User->>Router: "What themes in Chainsaw Man?"
Router->>BG: check_budget(intent=manga_qa)
BG->>Redis: GET budget:daily:2026-03-31
Redis-->>BG: $2,150.00 spent (86% of $2,500)
BG-->>Router: mode=AGGRESSIVE (manga_qa downgraded to Haiku)
BG->>CW: Emit budget_mode=aggressive metric
BG->>SNS: Budget Warning Alert (86%)
Router->>Router: Route to Haiku (downgraded)
Note over BG,SNS: Budget hits 96%
User->>Router: "Suggest something similar to Vinland Saga"
Router->>BG: check_budget(intent=recommendation)
BG->>Redis: GET budget:daily:2026-03-31
Redis-->>BG: $2,400.00 spent (96% of $2,500)
BG-->>Router: mode=EMERGENCY (recommendation→Haiku)
BG->>SNS: PagerDuty CRITICAL Alert (96%)
Router->>Router: Route to Haiku (emergency downgrade)
3.4 BudgetGuardian — Full Production Implementation
"""
MangaAssist Budget Guardian — Dynamic Model Downgrade System
Monitors daily FM spend and dynamically adjusts routing tiers.
"""
import time
import logging
from dataclasses import dataclass
from enum import Enum
from typing import Optional
import boto3
import redis
logger = logging.getLogger("mangaassist.budget_guardian")
class BudgetMode(Enum):
NORMAL = "normal"
CAUTIOUS = "cautious"
AGGRESSIVE = "aggressive"
EMERGENCY = "emergency"
@dataclass
class BudgetStatus:
"""Current budget state with full context."""
mode: BudgetMode
daily_budget: float
spent_today: float
remaining: float
utilization_pct: float
projected_eod_spend: float # projected end-of-day spend
projected_eod_pct: float
minutes_until_budget_exhausted: Optional[float]
class BudgetGuardian:
"""
Tracks daily FM spend in Redis and returns the current budget mode.
Provides cost projection, alerting, and automatic downgrade triggers.
"""
DAILY_BUDGET = 2_500.00 # $2,500/day baseline
THRESHOLDS = {
BudgetMode.NORMAL: 0.00,
BudgetMode.CAUTIOUS: 0.60,
BudgetMode.AGGRESSIVE: 0.80,
BudgetMode.EMERGENCY: 0.95,
}
# Which intents get downgraded at each mode level
DOWNGRADE_MAP = {
BudgetMode.CAUTIOUS: {
"manga_qa": "haiku", # Sonnet -> Haiku
},
BudgetMode.AGGRESSIVE: {
"manga_qa": "haiku",
"product_search": "template", # Haiku -> Template (simple searches)
},
BudgetMode.EMERGENCY: {
"recommendation": "haiku", # Even recommendation gets downgraded
"manga_qa": "haiku",
"product_search": "template",
"shipping_info": "template",
},
}
def __init__(
self,
redis_client: redis.Redis,
cloudwatch_client=None,
sns_client=None,
daily_budget: Optional[float] = None,
):
self.redis = redis_client
self.cw = cloudwatch_client or boto3.client("cloudwatch", region_name="ap-northeast-1")
self.sns = sns_client or boto3.client("sns", region_name="ap-northeast-1")
self.daily_budget = daily_budget or self.DAILY_BUDGET
self._alert_topic_arn = "arn:aws:sns:ap-northeast-1:123456789012:mangaassist-budget-alerts"
self._critical_topic_arn = "arn:aws:sns:ap-northeast-1:123456789012:mangaassist-budget-critical"
# ----- Core Budget Check -----
def get_budget_status(self) -> BudgetStatus:
"""Full budget status with projections."""
today = time.strftime("%Y-%m-%d")
spent = self._get_daily_spend(today)
remaining = max(self.daily_budget - spent, 0)
utilization = spent / self.daily_budget
# Project end-of-day spend based on current burn rate
projected_eod, minutes_left = self._project_spend(today, spent)
mode = self._determine_mode(utilization)
return BudgetStatus(
mode=mode,
daily_budget=self.daily_budget,
spent_today=spent,
remaining=remaining,
utilization_pct=utilization * 100,
projected_eod_spend=projected_eod,
projected_eod_pct=(projected_eod / self.daily_budget) * 100,
minutes_until_budget_exhausted=minutes_left,
)
def get_budget_mode(self) -> BudgetMode:
"""Quick mode check for hot-path routing."""
today = time.strftime("%Y-%m-%d")
spent = self._get_daily_spend(today)
utilization = spent / self.daily_budget
return self._determine_mode(utilization)
def get_model_override(self, intent: str, current_model: str) -> Optional[str]:
"""
Check if the current budget mode requires a model downgrade.
Returns None if no override needed, or the downgraded model name.
"""
mode = self.get_budget_mode()
if mode == BudgetMode.NORMAL:
return None
downgrades = self.DOWNGRADE_MAP.get(mode, {})
override = downgrades.get(intent)
if override and self._tier_rank(override) < self._tier_rank(current_model):
logger.warning(
"Budget %s: downgrading %s from %s to %s",
mode.value, intent, current_model, override,
)
return override
return None
# ----- Cost Recording -----
def record_cost(self, cost: float, intent: str, model: str) -> None:
"""Record an inference cost and emit metrics."""
today = time.strftime("%Y-%m-%d")
pipe = self.redis.pipeline()
# Total daily spend
total_key = f"budget:daily:{today}"
pipe.incrbyfloat(total_key, cost)
pipe.expire(total_key, 86400 * 2)
# Per-intent spend
intent_key = f"budget:daily:{today}:intent:{intent}"
pipe.incrbyfloat(intent_key, cost)
pipe.expire(intent_key, 86400 * 2)
# Per-model spend
model_key = f"budget:daily:{today}:model:{model}"
pipe.incrbyfloat(model_key, cost)
pipe.expire(model_key, 86400 * 2)
# Request count for burn-rate projection
count_key = f"budget:daily:{today}:count"
pipe.incr(count_key)
pipe.expire(count_key, 86400 * 2)
# Timestamp of first request today (for projection)
first_key = f"budget:daily:{today}:first_ts"
pipe.setnx(first_key, str(time.time()))
pipe.expire(first_key, 86400 * 2)
pipe.execute()
# Emit CloudWatch metrics
self._emit_metrics(cost, intent, model)
# Check for alert thresholds
self._check_alerts(today)
# ----- Spend Projection -----
def _project_spend(self, today: str, current_spend: float) -> tuple[float, Optional[float]]:
"""
Project end-of-day spend based on current burn rate.
Returns (projected_eod_spend, minutes_until_exhaustion).
"""
first_ts = self.redis.get(f"budget:daily:{today}:first_ts")
count = self.redis.get(f"budget:daily:{today}:count")
if not first_ts or not count:
return current_spend, None
elapsed_seconds = time.time() - float(first_ts)
if elapsed_seconds < 60:
return current_spend, None
# Calculate burn rate ($/second)
burn_rate = current_spend / elapsed_seconds
# Seconds remaining in the day
now = time.localtime()
seconds_remaining = (
(23 - now.tm_hour) * 3600
+ (59 - now.tm_min) * 60
+ (60 - now.tm_sec)
)
projected_eod = current_spend + (burn_rate * seconds_remaining)
# Minutes until budget exhaustion
remaining_budget = self.daily_budget - current_spend
if burn_rate > 0 and remaining_budget > 0:
minutes_left = (remaining_budget / burn_rate) / 60
else:
minutes_left = None
return projected_eod, minutes_left
# ----- Alerting -----
def _check_alerts(self, today: str) -> None:
"""Send alerts when budget thresholds are crossed."""
spent = self._get_daily_spend(today)
pct = spent / self.daily_budget
# Use Redis to track which alerts have been sent today
if pct >= 0.80:
alert_key = f"budget:alert:{today}:80"
if not self.redis.exists(alert_key):
self.redis.setex(alert_key, 86400, "sent")
self._send_alert(
topic_arn=self._alert_topic_arn,
subject="MangaAssist Budget Warning (80%)",
message=(
f"Daily FM budget is {pct:.1%} consumed.\n"
f"Spent: ${spent:.2f} / ${self.daily_budget:.2f}\n"
f"Mode: AGGRESSIVE — non-critical Sonnet calls downgraded."
),
)
if pct >= 0.95:
alert_key = f"budget:alert:{today}:95"
if not self.redis.exists(alert_key):
self.redis.setex(alert_key, 86400, "sent")
self._send_alert(
topic_arn=self._critical_topic_arn,
subject="MangaAssist Budget CRITICAL (95%)",
message=(
f"Daily FM budget is {pct:.1%} consumed.\n"
f"Spent: ${spent:.2f} / ${self.daily_budget:.2f}\n"
f"Mode: EMERGENCY — all queries on Haiku/Template.\n"
f"ACTION REQUIRED: Review traffic spike and adjust budget."
),
)
def _send_alert(self, topic_arn: str, subject: str, message: str) -> None:
try:
self.sns.publish(
TopicArn=topic_arn,
Subject=subject,
Message=message,
)
logger.info("Sent alert: %s", subject)
except Exception as e:
logger.error("Failed to send alert: %s", e)
# ----- CloudWatch Metrics -----
def _emit_metrics(self, cost: float, intent: str, model: str) -> None:
"""Emit cost metrics to CloudWatch for dashboarding."""
try:
self.cw.put_metric_data(
Namespace="MangaAssist/FMCost",
MetricData=[
{
"MetricName": "InferenceCost",
"Value": cost,
"Unit": "None",
"Dimensions": [
{"Name": "Intent", "Value": intent},
{"Name": "Model", "Value": model},
],
},
{
"MetricName": "DailySpend",
"Value": self._get_daily_spend(time.strftime("%Y-%m-%d")),
"Unit": "None",
},
],
)
except Exception as e:
logger.error("Failed to emit CloudWatch metrics: %s", e)
# ----- Helpers -----
def _get_daily_spend(self, today: str) -> float:
val = self.redis.get(f"budget:daily:{today}")
return float(val) if val else 0.0
def _determine_mode(self, utilization: float) -> BudgetMode:
if utilization >= 0.95:
return BudgetMode.EMERGENCY
elif utilization >= 0.80:
return BudgetMode.AGGRESSIVE
elif utilization >= 0.60:
return BudgetMode.CAUTIOUS
else:
return BudgetMode.NORMAL
@staticmethod
def _tier_rank(model: str) -> int:
return {"template": 0, "haiku": 1, "sonnet": 2}.get(model, 1)
# ---------------------------------------------------------------------------
# Cost Projection Calculator
# ---------------------------------------------------------------------------
class CostProjectionCalculator:
"""
Projects monthly FM costs under different traffic and routing scenarios.
Used for capacity planning and budget requests.
"""
PRICING = {
"sonnet": {"input": 3.00, "output": 15.00},
"haiku": {"input": 0.25, "output": 1.25},
"template": {"input": 0.00, "output": 0.00},
}
DEFAULT_TOKENS = {
"sonnet": {"input": 900, "output": 400},
"haiku": {"input": 450, "output": 130},
"template": {"input": 0, "output": 0},
}
def project_monthly_cost(
self,
daily_messages: int,
sonnet_pct: float,
haiku_pct: float,
template_pct: float,
days: int = 30,
) -> dict:
"""
Project monthly cost for a given traffic mix.
Percentages should sum to 1.0.
"""
results = {}
for model, pct in [("sonnet", sonnet_pct), ("haiku", haiku_pct), ("template", template_pct)]:
daily_vol = daily_messages * pct
tokens = self.DEFAULT_TOKENS[model]
pricing = self.PRICING[model]
cost_per_req = (
(tokens["input"] / 1_000_000) * pricing["input"]
+ (tokens["output"] / 1_000_000) * pricing["output"]
)
daily_cost = daily_vol * cost_per_req
monthly_cost = daily_cost * days
results[model] = {
"daily_volume": int(daily_vol),
"cost_per_request": cost_per_req,
"daily_cost": daily_cost,
"monthly_cost": monthly_cost,
}
total_daily = sum(r["daily_cost"] for r in results.values())
total_monthly = sum(r["monthly_cost"] for r in results.values())
results["total"] = {
"daily_cost": total_daily,
"monthly_cost": total_monthly,
"avg_cost_per_message": total_daily / daily_messages if daily_messages > 0 else 0,
}
return results
def compare_scenarios(self, daily_messages: int = 1_000_000) -> str:
"""Compare multiple routing scenarios."""
scenarios = {
"All Sonnet": (1.00, 0.00, 0.00),
"All Haiku": (0.00, 1.00, 0.00),
"Current Tiered (15/50/35)": (0.15, 0.50, 0.35),
"Optimized Tiered (10/55/35)": (0.10, 0.55, 0.35),
"Aggressive Savings (5/45/50)": (0.05, 0.45, 0.50),
"Emergency (0/35/65)": (0.00, 0.35, 0.65),
}
lines = [
"=" * 80,
"MANGAASSIST COST PROJECTION — SCENARIO COMPARISON",
f"Base: {daily_messages:,} messages/day, 30-day month",
"=" * 80,
"",
f"{'Scenario':<35} {'Daily':>12} {'Monthly':>14} {'Savings':>10}",
"-" * 80,
]
baseline = None
for name, (s, h, t) in scenarios.items():
result = self.project_monthly_cost(daily_messages, s, h, t)
monthly = result["total"]["monthly_cost"]
daily = result["total"]["daily_cost"]
if baseline is None:
baseline = monthly
savings = "Baseline"
else:
savings_pct = (1 - monthly / baseline) * 100
savings = f"{savings_pct:.1f}%"
lines.append(
f"{name:<35} ${daily:>10,.2f} ${monthly:>12,.2f} {savings:>10}"
)
lines.append("-" * 80)
return "\n".join(lines)
if __name__ == "__main__":
calc = CostProjectionCalculator()
print(calc.compare_scenarios())
4. A/B Testing Model Assignments
4.1 A/B Test Architecture
flowchart TD
REQ[Incoming Request] --> SPLIT{A/B Split<br/>by session_id hash}
SPLIT -->|Control: 90%| CONTROL["Control Group<br/>Standard routing<br/>(manga_qa → Sonnet)"]
SPLIT -->|Treatment: 10%| TREATMENT["Treatment Group<br/>Experimental routing<br/>(manga_qa → Haiku)"]
CONTROL --> BEDROCK_S[Bedrock Sonnet]
TREATMENT --> BEDROCK_H[Bedrock Haiku]
BEDROCK_S --> RESPONSE_C[Response + Metadata]
BEDROCK_H --> RESPONSE_T[Response + Metadata]
RESPONSE_C --> METRICS[Metrics Collector]
RESPONSE_T --> METRICS
METRICS --> QUALITY["Quality Metrics<br/>- LLM-as-judge score<br/>- User satisfaction<br/>- Response relevance"]
METRICS --> COST_M["Cost Metrics<br/>- Tokens used<br/>- Cost per request<br/>- Budget impact"]
METRICS --> LATENCY["Latency Metrics<br/>- TTFT<br/>- Total latency<br/>- Streaming time"]
QUALITY & COST_M & LATENCY --> ANALYSIS["Statistical Analysis<br/>- t-test for means<br/>- Chi-square for satisfaction<br/>- Required: p < 0.05, n > 1,000"]
ANALYSIS -->|Quality drop < 10%| PROMOTE["Promote Treatment<br/>Roll out to 100%"]
ANALYSIS -->|Quality drop >= 10%| ROLLBACK["Rollback Treatment<br/>Keep Control routing"]
style CONTROL fill:#3498db,color:#fff
style TREATMENT fill:#f39c12,color:#fff
style PROMOTE fill:#2ecc71,color:#fff
style ROLLBACK fill:#e74c3c,color:#fff
4.2 A/B Test Configuration
"""
MangaAssist A/B Testing — Model Assignment Experiments
"""
import hashlib
import time
import logging
from dataclasses import dataclass
from typing import Optional
logger = logging.getLogger("mangaassist.ab_test")
@dataclass
class ABTestConfig:
"""Configuration for a model routing A/B test."""
test_id: str
intent: str
control_model: str # e.g., "sonnet"
treatment_model: str # e.g., "haiku"
treatment_pct: float # 0.0 - 1.0 (e.g., 0.10 = 10%)
start_time: float
end_time: Optional[float] = None
min_samples: int = 1_000
max_quality_drop_pct: float = 10.0 # auto-rollback if quality drops > 10%
@dataclass
class ABTestResult:
"""Metrics for one group in an A/B test."""
group: str # "control" or "treatment"
sample_count: int
avg_quality_score: float
avg_cost: float
avg_latency_ms: float
satisfaction_rate: float
class ABTestRouter:
"""
Deterministic A/B test routing based on session_id hash.
Ensures the same user always sees the same variant within a test.
"""
def __init__(self, active_tests: list[ABTestConfig]):
self.tests = {t.intent: t for t in active_tests}
def get_variant(self, session_id: str, intent: str) -> Optional[str]:
"""
Return the model to use, or None if no active test for this intent.
Uses session_id hash for deterministic, repeatable assignment.
"""
test = self.tests.get(intent)
if not test:
return None
# Check if test is still active
now = time.time()
if now < test.start_time:
return None
if test.end_time and now > test.end_time:
return None
# Deterministic hash-based split
hash_input = f"{test.test_id}:{session_id}"
hash_value = int(hashlib.sha256(hash_input.encode()).hexdigest()[:8], 16)
bucket = (hash_value % 10000) / 10000.0 # 0.0000 to 0.9999
if bucket < test.treatment_pct:
return test.treatment_model
else:
return test.control_model
def should_auto_rollback(
self, intent: str, control: ABTestResult, treatment: ABTestResult
) -> bool:
"""
Check if the treatment should be auto-rolled back.
Triggers if quality drops more than the configured threshold
AND we have enough samples for statistical significance.
"""
test = self.tests.get(intent)
if not test:
return False
if treatment.sample_count < test.min_samples:
return False # Not enough data yet
quality_drop_pct = (
(control.avg_quality_score - treatment.avg_quality_score)
/ control.avg_quality_score
* 100
)
if quality_drop_pct > test.max_quality_drop_pct:
logger.warning(
"A/B test %s: quality drop %.1f%% exceeds threshold %.1f%%. "
"Auto-rolling back.",
test.test_id, quality_drop_pct, test.max_quality_drop_pct,
)
return True
return False
# Example: Testing manga_qa on Haiku instead of Sonnet
ACTIVE_TESTS = [
ABTestConfig(
test_id="exp-2026-03-manga-qa-haiku",
intent="manga_qa",
control_model="sonnet",
treatment_model="haiku",
treatment_pct=0.10, # 10% of manga_qa traffic
start_time=time.time(),
min_samples=2_000,
max_quality_drop_pct=15.0, # Allow up to 15% quality drop
),
]
4.3 A/B Test Decision Criteria
| Metric |
Acceptable Range |
Auto-Rollback Trigger |
Data Required |
| Quality score (LLM-as-judge) |
< 10% drop |
> 15% drop |
n >= 1,000 |
| User satisfaction |
< 5% drop |
> 10% drop |
n >= 2,000 |
| Cost reduction |
> 50% savings |
N/A (always cheaper) |
N/A |
| Latency improvement |
Any improvement |
> 200ms increase |
n >= 500 |
5. Cross-Region Pricing Optimization
| Region |
Sonnet Input/1M |
Sonnet Output/1M |
Haiku Input/1M |
Haiku Output/1M |
Latency from Tokyo |
| ap-northeast-1 (Tokyo) |
$3.00 |
$15.00 |
$0.25 |
$1.25 |
0 ms |
| us-east-1 (Virginia) |
$3.00 |
$15.00 |
$0.25 |
$1.25 |
~160 ms |
| us-west-2 (Oregon) |
$3.00 |
$15.00 |
$0.25 |
$1.25 |
~120 ms |
| eu-west-1 (Ireland) |
$3.00 |
$15.00 |
$0.25 |
$1.25 |
~230 ms |
MangaAssist decision: Bedrock pricing is uniform across regions for Claude 3. Since our users are in Japan, we use ap-northeast-1 exclusively to minimize latency. Cross-region arbitrage is not beneficial for this workload — latency cost far exceeds any potential savings.
6. Fallback Chains
6.1 Cascading Fallback Architecture
flowchart TD
QUERY[Customer Query<br/>Intent: recommendation] --> PRIMARY["PRIMARY: Sonnet<br/>Attempt invocation"]
PRIMARY -->|Success| RESPONSE[Return Response]
PRIMARY -->|"Failure (throttle/timeout/5xx)"| FALLBACK1["FALLBACK 1: Haiku<br/>Attempt with simplified prompt"]
FALLBACK1 -->|Success| RESPONSE
FALLBACK1 -->|Failure| FALLBACK2["FALLBACK 2: Template<br/>Generic response + apology"]
FALLBACK2 --> RESPONSE
PRIMARY -->|Latency > 5s| TIMEOUT["Parallel: Start Haiku<br/>Race condition — first wins"]
TIMEOUT --> RESPONSE
style PRIMARY fill:#e74c3c,color:#fff
style FALLBACK1 fill:#3498db,color:#fff
style FALLBACK2 fill:#2ecc71,color:#fff
style TIMEOUT fill:#f39c12,color:#fff
6.2 Fallback Chain Implementation
"""
MangaAssist Fallback Chain — Graceful Degradation on FM Failures
"""
import asyncio
import logging
import time
from dataclasses import dataclass
from typing import Optional
import boto3
logger = logging.getLogger("mangaassist.fallback")
@dataclass
class FallbackResult:
"""Result from fallback chain invocation."""
response_text: str
model_used: str
tier_used: str # "primary", "fallback_1", "fallback_2"
latency_ms: float
was_degraded: bool
degradation_reason: Optional[str] = None
class FallbackChain:
"""
Executes a cascade of model invocations with graceful degradation.
Primary → Fallback 1 → Fallback 2 (template).
"""
FALLBACK_MAP = {
"sonnet": ["haiku", "template"],
"haiku": ["template"],
"template": [],
}
TEMPLATE_RESPONSES = {
"recommendation": (
"I apologize, but I'm currently unable to provide personalized manga "
"recommendations. Please try again in a moment, or browse our curated "
"collections at manga.example.com/collections."
),
"manga_qa": (
"I'm sorry, I'm temporarily unable to answer detailed manga questions. "
"You can find manga information on our wiki at manga.example.com/wiki."
),
"product_search": (
"I'm having trouble searching right now. Please try using the search bar "
"on our website, or contact support for assistance."
),
"default": (
"I apologize for the inconvenience. I'm experiencing temporary difficulties. "
"Please try again shortly or contact our support team."
),
}
def __init__(self, bedrock_client=None, timeout_ms: int = 5_000):
self.bedrock = bedrock_client or boto3.client(
"bedrock-runtime", region_name="ap-northeast-1"
)
self.timeout_s = timeout_ms / 1000.0
async def invoke_with_fallback(
self,
prompt: str,
primary_model: str,
intent: str,
) -> FallbackResult:
"""
Invoke the primary model, falling back through the chain on failure.
"""
chain = [primary_model] + self.FALLBACK_MAP.get(primary_model, [])
tier_names = ["primary", "fallback_1", "fallback_2"]
for i, model in enumerate(chain):
tier = tier_names[min(i, len(tier_names) - 1)]
if model == "template":
return FallbackResult(
response_text=self.TEMPLATE_RESPONSES.get(
intent, self.TEMPLATE_RESPONSES["default"]
),
model_used="template",
tier_used=tier,
latency_ms=1.0,
was_degraded=(i > 0),
degradation_reason=(
f"All FM models failed; serving template for {intent}"
if i > 0 else None
),
)
try:
start = time.monotonic()
response = await asyncio.wait_for(
self._invoke_model(model, prompt),
timeout=self.timeout_s,
)
latency = (time.monotonic() - start) * 1000
return FallbackResult(
response_text=response,
model_used=model,
tier_used=tier,
latency_ms=latency,
was_degraded=(i > 0),
degradation_reason=(
f"Degraded from {primary_model} to {model}" if i > 0 else None
),
)
except asyncio.TimeoutError:
logger.warning("Model %s timed out after %.1fs", model, self.timeout_s)
continue
except Exception as e:
logger.error("Model %s failed: %s", model, str(e))
continue
# All models failed — should not reach here because template is always last
return FallbackResult(
response_text=self.TEMPLATE_RESPONSES["default"],
model_used="template",
tier_used="fallback_2",
latency_ms=1.0,
was_degraded=True,
degradation_reason="All models and templates exhausted",
)
async def _invoke_model(self, model: str, prompt: str) -> str:
"""Invoke a Bedrock model. In production, use async Bedrock client."""
model_ids = {
"sonnet": "anthropic.claude-3-sonnet-20240229-v1:0",
"haiku": "anthropic.claude-3-haiku-20240307-v1:0",
}
model_id = model_ids.get(model)
if not model_id:
raise ValueError(f"Unknown model: {model}")
# In production: use aioboto3 or run_in_executor
loop = asyncio.get_event_loop()
response = await loop.run_in_executor(
None,
lambda: self.bedrock.invoke_model(
modelId=model_id,
contentType="application/json",
accept="application/json",
body=f'{{"anthropic_version":"bedrock-2023-05-31","max_tokens":1024,"messages":[{{"role":"user","content":"{prompt}"}}]}}',
),
)
import json
body = json.loads(response["body"].read())
return body["content"][0]["text"]
6.3 Fallback Chain — Expected Behavior
| Scenario |
Primary |
Fallback 1 |
Fallback 2 |
User Impact |
| Normal operation |
Sonnet succeeds |
N/A |
N/A |
Full quality |
| Sonnet throttled |
Sonnet 429 |
Haiku succeeds |
N/A |
Slight quality drop |
| Sonnet + Haiku throttled |
Sonnet 429 |
Haiku 429 |
Template |
Canned response |
| Sonnet timeout (> 5s) |
Timeout |
Haiku succeeds |
N/A |
Faster, lower quality |
| All Bedrock down |
Sonnet fail |
Haiku fail |
Template |
Canned response |
7. Cost Projection — Monthly Forecast
7.1 Current Baseline
pie title Monthly FM Cost Projection — Current Routing ($54,225/mo)
"Sonnet (recommendation)" : 33300
"Sonnet (manga_qa)" : 13050
"Sonnet (complex product_search)" : 4995
"Haiku (product_search)" : 2250
"Haiku (shipping_info)" : 630
"Template (all)" : 0
7.2 Scenario Projections
| Scenario |
Daily Messages |
Sonnet % |
Haiku % |
Template % |
Monthly Cost |
vs Baseline |
| All Sonnet |
1M |
100% |
0% |
0% |
$229,500 |
+323% |
| All Haiku |
1M |
0% |
100% |
0% |
$5,063 |
-91% |
| Current Tiered |
1M |
15% |
50% |
35% |
$54,225 |
Baseline |
| Optimized (after A/B tests) |
1M |
10% |
55% |
35% |
$38,475 |
-29% |
| 2x Growth |
2M |
15% |
50% |
35% |
$108,450 |
+100% |
| Sale Event (3x spike) |
3M |
15% |
50% |
35% |
$162,675 |
+200% |
| Sale + Budget Guardian |
3M |
5% |
50% |
45% |
$56,250 |
+4% |
Key takeaway: During a 3x traffic spike (sale event), the Budget Guardian keeps monthly costs nearly flat by dynamically shifting traffic from Sonnet to Haiku/Template. Without the guardian, costs would triple.
7.3 Annual Budget Planning
| Quarter |
Expected Daily Volume |
Routing Strategy |
Monthly Budget |
Quarterly Budget |
| Q1 2026 |
1.0M |
Current Tiered |
$54,225 |
$162,675 |
| Q2 2026 |
1.2M (spring sale) |
Optimized + Guardian |
$46,170 |
$138,510 |
| Q3 2026 |
1.5M (growth) |
Optimized + Guardian |
$57,713 |
$173,138 |
| Q4 2026 |
2.0M (holiday) |
Aggressive + Guardian |
$62,500 |
$187,500 |
| Annual Total |
|
|
|
$661,823 |
8. Key Takeaways
- 94.7% of FM cost comes from 15% of traffic (Sonnet calls). Optimizing Sonnet routing has the highest ROI.
- Budget Guardian prevents runaway costs during traffic spikes — keeping a 3x spike cost-neutral.
- A/B testing before routing changes prevents quality regressions — always test with statistical rigor (n >= 1,000, p < 0.05).
- Fallback chains provide graceful degradation — users always get a response, even during Bedrock outages.
- Cross-region arbitrage does not apply for MangaAssist (uniform Bedrock pricing) — minimize latency by staying in ap-northeast-1.
- Projected annual cost of $661,823 for 2026 is 71% less than an all-Sonnet approach ($2.75M).
Previous: 01-model-selection-framework.md -- Model selection architecture, complexity classifier, routing maps.
Next: 03-scenarios-and-runbooks.md -- Five MangaAssist production scenarios with decision trees.