Cost-Effective Model Selection Framework
MangaAssist Context: JP Manga store chatbot running on AWS. Bedrock Claude 3 Sonnet ($3/$15 per 1M input/output tokens) handles complex queries; Haiku ($0.25/$1.25 per 1M input/output tokens) handles simple ones. 1M messages/day across product search, order status, manga recommendations, and Q&A. Infrastructure: OpenSearch Serverless, DynamoDB, ECS Fargate, API Gateway WebSocket, ElastiCache Redis.
Skill Mapping
| AWS AIP-C01 Domain | Task | Skill | This File Covers |
|---|---|---|---|
| Domain 4 Operational Efficiency | Task 4.1 Cost Optimization | 4.1.2 Cost-Effective Model Selection | Cost-capability matrix, tiered FM routing, complexity classification, intent-to-model mapping, price-to-performance measurement |
1. Model Selection Dimensions Mind Map
mindmap
root((Model Selection<br/>Framework))
Cost-Capability Tradeoff
Quality Dimensions
Accuracy
Reasoning Depth
Multilingual / Japanese
Creativity
Speed / Latency
Cost Dimensions
Input Token Price
Output Token Price
Tokens per Request
Requests per Day
Tiered FM Usage
Tier 0 — Template
Zero LLM Cost
Deterministic Responses
Order Status / Shipping
Tier 1 — Haiku
Low Cost
Simple Queries
Product Search / FAQ
Tier 2 — Sonnet
High Quality
Complex Reasoning
Recommendations / Manga QA
Complexity Classification
Rule-Based Fast Path
Keyword Matching
Intent Detection
Pattern Recognition
ML Classifier
Complexity Score 0-1
Feature Extraction
Threshold Tuning
Intent-to-Model Routing
product_search to Haiku
order_status to Template
recommendation to Sonnet
manga_qa to Sonnet
chitchat to Template
shipping_info to Haiku
escalation to Template
Price-to-Performance
Tokens per Dollar
Quality per Dollar
Satisfaction per Dollar
Efficient Inference
Prompt Reuse
Response Recycling
Batch Grouping
Budget Controls
Daily Budget Limits
Dynamic Downgrade
Cost-per-Quality-Point
2. Cost-Capability Tradeoff Matrix
Claude 3 Sonnet vs Haiku — Quality Dimension Comparison
| Quality Dimension | Claude 3 Sonnet | Claude 3 Haiku | Winner for MangaAssist | Notes |
|---|---|---|---|---|
| Accuracy | 9.2/10 | 7.8/10 | Sonnet | Critical for manga recommendations, plot summaries |
| Reasoning Depth | 9.5/10 | 6.5/10 | Sonnet | Multi-step reasoning needed for "manga like X but with Y" |
| Multilingual / Japanese | 9.0/10 | 7.5/10 | Sonnet | Japanese title parsing, honorifics, genre terminology |
| Creativity | 9.3/10 | 6.8/10 | Sonnet | Personalized recommendation narratives |
| Speed (Latency) | 6.5/10 | 9.5/10 | Haiku | 3x faster TTFT; ideal for simple lookups |
| Cost Efficiency | 3.0/10 | 9.5/10 | Haiku | Haiku is 12x cheaper on input, 12x cheaper on output |
Cost per Request Comparison (MangaAssist Averages)
| Metric | Sonnet | Haiku | Ratio |
|---|---|---|---|
| Avg input tokens | 800 | 500 | 1.6x |
| Avg output tokens | 350 | 150 | 2.3x |
| Input cost per request | $0.0024 | $0.000125 | 19.2x |
| Output cost per request | $0.00525 | $0.0001875 | 28.0x |
| Total cost per request | $0.00765 | $0.0003125 | 24.5x |
| Avg latency (TTFT) | 1.2s | 0.4s | 3.0x |
| Quality score (weighted) | 9.2 | 7.4 | 1.24x |
Key insight: Sonnet is 24.5x more expensive but only 1.24x better in quality. For simple queries, Haiku delivers 95%+ acceptable quality at 4% of the cost.
3. Tiered FM Usage — Three-Tier Routing System
flowchart TD
QUERY[Incoming Customer Query] --> CLASSIFY[Complexity Classifier]
CLASSIFY -->|Score 0.0 - 0.2| TIER0[Tier 0: Template Engine<br/>Cost: $0.00/request]
CLASSIFY -->|Score 0.2 - 0.6| TIER1[Tier 1: Claude 3 Haiku<br/>Cost: ~$0.0003/request]
CLASSIFY -->|Score 0.6 - 1.0| TIER2[Tier 2: Claude 3 Sonnet<br/>Cost: ~$0.0077/request]
TIER0 --> T0_EXAMPLES["Examples:<br/>- Where is my order #12345?<br/>- What are your shipping rates?<br/>- I want to speak to a human<br/>- What are your store hours?"]
TIER1 --> T1_EXAMPLES["Examples:<br/>- Find manga by Eiichiro Oda<br/>- Is One Piece volume 104 in stock?<br/>- What genres do you carry?<br/>- How do I return a book?"]
TIER2 --> T2_EXAMPLES["Examples:<br/>- Recommend manga like Berserk but less dark<br/>- Explain the themes in Chainsaw Man<br/>- Compare Seinen vs Shounen for a new reader<br/>- I liked Vinland Saga, what next?"]
TIER0 --> RESPONSE[Response to Customer]
TIER1 --> RESPONSE
TIER2 --> RESPONSE
style TIER0 fill:#2ecc71,color:#fff
style TIER1 fill:#3498db,color:#fff
style TIER2 fill:#e74c3c,color:#fff
Tier Distribution (MangaAssist Production Target)
| Tier | Model | % of Traffic | Daily Requests | Cost/Request | Daily Cost | Monthly Cost |
|---|---|---|---|---|---|---|
| Tier 0 | Template | 25% | 250,000 | $0.0000 | $0.00 | $0.00 |
| Tier 1 | Haiku | 50% | 500,000 | $0.0003125 | $156.25 | $4,687.50 |
| Tier 2 | Sonnet | 25% | 250,000 | $0.00765 | $1,912.50 | $57,375.00 |
| Total | Mixed | 100% | 1,000,000 | $0.002069 | $2,068.75 | $62,062.50 |
Comparison: All-Sonnet vs Tiered Routing
| Approach | Monthly Cost | Quality (avg) | Cost Savings |
|---|---|---|---|
| All Sonnet | $229,500 | 9.2/10 | Baseline |
| All Haiku | $9,375 | 7.4/10 | 95.9% |
| Tiered (25/50/25) | $62,063 | 8.5/10 | 73.0% |
Tiered routing saves $167,437/month while maintaining an average quality score of 8.5/10 (only 0.7 points below all-Sonnet).
4. Complexity Classifier
The classifier combines a rule-based fast path for obvious cases with a lightweight ML scorer for ambiguous queries.
flowchart TD
INPUT[Customer Message] --> RULE[Rule-Based Fast Path]
RULE -->|Matched Template Pattern| SCORE_0["Score: 0.0<br/>Route: Template"]
RULE -->|Matched Simple Pattern| SCORE_LOW["Score: 0.15<br/>Route: Haiku"]
RULE -->|No Pattern Match| ML[ML Complexity Scorer]
ML --> FEATURES[Feature Extraction]
FEATURES --> F1["Token count"]
FEATURES --> F2["Question depth<br/>(nested clauses)"]
FEATURES --> F3["Named entities"]
FEATURES --> F4["Japanese char ratio"]
FEATURES --> F5["Comparison keywords"]
FEATURES --> F6["Subjective language"]
F1 & F2 & F3 & F4 & F5 & F6 --> MODEL[Lightweight Classifier<br/>Logistic Regression / Small NN]
MODEL --> SCORE["Complexity Score<br/>0.0 - 1.0"]
SCORE -->|0.0 - 0.2| TEMPLATE[Template Tier]
SCORE -->|0.2 - 0.6| HAIKU[Haiku Tier]
SCORE -->|0.6 - 1.0| SONNET[Sonnet Tier]
style SCORE_0 fill:#2ecc71,color:#fff
style SCORE_LOW fill:#2ecc71,color:#fff
style TEMPLATE fill:#2ecc71,color:#fff
style HAIKU fill:#3498db,color:#fff
style SONNET fill:#e74c3c,color:#fff
5. Intent-to-Model Routing Map
| Intent | Routed Model | Avg Input Tokens | Avg Output Tokens | Cost/Request | Quality Score | Rationale |
|---|---|---|---|---|---|---|
product_search |
Haiku | 400 | 120 | $0.000250 | 8.1/10 | Structured lookup; Haiku handles keyword extraction well |
order_status |
Template | 0 (DynamoDB) | 0 | $0.000000 | 9.5/10 | Pure database lookup, deterministic response |
recommendation |
Sonnet | 1,200 | 500 | $0.011100 | 9.4/10 | Requires deep reasoning about taste, themes, similar titles |
manga_qa |
Sonnet | 900 | 400 | $0.008700 | 9.1/10 | Needs plot understanding, cultural context, Japanese nuance |
chitchat |
Template | 0 | 0 | $0.000000 | 7.5/10 | Canned friendly responses; no LLM needed |
shipping_info |
Haiku | 300 | 80 | $0.000175 | 8.8/10 | Simple policy lookup with slight personalization |
escalation |
Template | 0 | 0 | $0.000000 | 9.0/10 | Hand-off to human agent; deterministic workflow |
Daily Cost Breakdown by Intent (at 1M messages/day)
pie title Daily FM Cost by Intent
"recommendation (Sonnet)" : 832
"manga_qa (Sonnet)" : 652
"product_search (Haiku)" : 75
"shipping_info (Haiku)" : 17
"order_status (Template)" : 0
"chitchat (Template)" : 0
"escalation (Template)" : 0
| Intent | % of Traffic | Daily Volume | Daily Cost | Monthly Cost |
|---|---|---|---|---|
product_search |
30% | 300,000 | $75.00 | $2,250 |
order_status |
20% | 200,000 | $0.00 | $0 |
recommendation |
7.5% | 75,000 | $832.50 | $24,975 |
manga_qa |
7.5% | 75,000 | $652.50 | $19,575 |
chitchat |
15% | 150,000 | $0.00 | $0 |
shipping_info |
10% | 100,000 | $17.50 | $525 |
escalation |
10% | 100,000 | $0.00 | $0 |
| Total | 100% | 1,000,000 | $1,577.50 | $47,325 |
6. Inference Cost Balancing
Cost-per-Quality-Point Metric
This metric normalizes cost against quality output, enabling apples-to-apples comparison across models and intents.
Cost per Quality Point = (cost_per_request / quality_score) * 1000
| Model | Cost/Request | Quality Score | Cost per Quality Point |
|---|---|---|---|
| Sonnet (recommendation) | $0.01110 | 9.4 | $1.181 |
| Sonnet (manga_qa) | $0.00870 | 9.1 | $0.956 |
| Haiku (product_search) | $0.00025 | 8.1 | $0.031 |
| Haiku (shipping_info) | $0.000175 | 8.8 | $0.020 |
| Template (order_status) | $0.00000 | 9.5 | $0.000 |
Decision rule: If a cheaper model achieves > 80% of the quality score, route to the cheaper model. Only escalate to Sonnet when the quality gap exceeds the cost-per-quality-point threshold.
Dynamic Routing Under Budget Pressure
flowchart TD
BUDGET[Check Daily Budget] --> PERCENT{Budget<br/>Consumed %}
PERCENT -->|"< 60%"| NORMAL["Normal Routing<br/>Follow intent-to-model map"]
PERCENT -->|"60-80%"| CAUTIOUS["Cautious Mode<br/>Downgrade low-priority Sonnet to Haiku"]
PERCENT -->|"80-95%"| AGGRESSIVE["Aggressive Savings<br/>All queries to Haiku except recommendation"]
PERCENT -->|"> 95%"| EMERGENCY["Emergency Mode<br/>Template + Haiku only<br/>Queue Sonnet requests"]
NORMAL --> SERVE[Serve Response]
CAUTIOUS --> SERVE
AGGRESSIVE --> SERVE
EMERGENCY --> SERVE
style NORMAL fill:#2ecc71,color:#fff
style CAUTIOUS fill:#f39c12,color:#fff
style AGGRESSIVE fill:#e67e22,color:#fff
style EMERGENCY fill:#e74c3c,color:#fff
7. Price-to-Performance Measurement
Three Core Metrics
7.1 Tokens per Dollar
Tokens per Dollar = 1,000,000 / price_per_1M_tokens
| Model | Direction | Price / 1M Tokens | Tokens per Dollar |
|---|---|---|---|
| Sonnet | Input | $3.00 | 333,333 |
| Sonnet | Output | $15.00 | 66,667 |
| Haiku | Input | $0.25 | 4,000,000 |
| Haiku | Output | $1.25 | 800,000 |
Haiku delivers 12x more input tokens and 12x more output tokens per dollar than Sonnet.
7.2 Quality per Dollar
Quality per Dollar = (quality_score / cost_per_request) * 0.001
| Use Case | Model | Quality | Cost | Quality per Dollar |
|---|---|---|---|---|
| Recommendation | Sonnet | 9.4 | $0.01110 | 0.847 |
| Recommendation | Haiku | 6.2 | $0.00031 | 20.000 |
| Product Search | Haiku | 8.1 | $0.00025 | 32.400 |
| Product Search | Sonnet | 8.9 | $0.00765 | 1.163 |
For product search, Haiku delivers 27.9x more quality per dollar than Sonnet, with only a 0.8-point quality gap.
7.3 Satisfaction per Dollar
Based on user satisfaction surveys (thumbs up/down) mapped to model routing:
| Model + Intent | Satisfaction Rate | Cost/Request | Satisfaction per Dollar |
|---|---|---|---|
| Sonnet + recommendation | 94% | $0.01110 | 84.7 |
| Haiku + product_search | 88% | $0.00025 | 3,520.0 |
| Haiku + shipping_info | 92% | $0.000175 | 5,257.1 |
| Template + order_status | 96% | $0.00000 | Infinite |
8. Efficient Inference Patterns
8.1 Prompt Reuse
Reuse system prompts and common context blocks across requests to leverage Bedrock's prompt caching.
sequenceDiagram
participant User as Customer
participant Router as Model Router
participant Cache as Prompt Cache<br/>(ElastiCache)
participant Bedrock as Bedrock API
User->>Router: "Recommend manga like Berserk"
Router->>Cache: Check cached system prompt hash
Cache-->>Router: Cache HIT (system prompt ID: sp_manga_rec_v3)
Router->>Bedrock: Invoke with cached prompt reference + user query
Bedrock-->>Router: Response (only user-specific tokens billed)
Router-->>User: Personalized recommendation
Note over Cache,Bedrock: System prompt (~400 tokens) reused across<br/>all recommendation queries = 75,000 reuses/day
8.2 Response Recycling
For near-identical queries, recycle previous responses instead of invoking the FM again.
flowchart LR
Q1["Query: Best shounen manga?"] --> HASH[Semantic Hash]
HASH --> CHECK{Similar Response<br/>in Cache?}
CHECK -->|Yes, similarity > 0.95| RECYCLE["Recycle cached response<br/>Cost: $0.00"]
CHECK -->|No| INVOKE["Invoke FM<br/>Cache the response"]
style RECYCLE fill:#2ecc71,color:#fff
style INVOKE fill:#3498db,color:#fff
9. MangaAssist Model Router — Architecture
graph TB
subgraph "Client Layer"
USER[Customer<br/>Browser/App]
WS[API Gateway<br/>WebSocket]
end
subgraph "Orchestration Layer (ECS Fargate)"
INTENT[Intent Classifier]
COMPLEX[Complexity Scorer<br/>Rule + ML]
ROUTER[Model Router<br/>Intent + Complexity + Budget]
BUDGET[Budget Guardian<br/>Daily Spend Tracker]
COST_LOG[Cost Logger<br/>CloudWatch Metrics]
end
subgraph "Response Tier 0 — Templates"
TMPL[Template Engine<br/>Jinja2 Templates]
DDB_T[DynamoDB<br/>Order/Shipping Data]
end
subgraph "Response Tier 1 — Haiku"
HAIKU[Bedrock Claude 3 Haiku<br/>$0.25 / $1.25 per 1M tokens]
OS_H[OpenSearch<br/>Product Index]
end
subgraph "Response Tier 2 — Sonnet"
SONNET[Bedrock Claude 3 Sonnet<br/>$3.00 / $15.00 per 1M tokens]
OS_S[OpenSearch<br/>Full Catalog + Reviews]
end
subgraph "Shared Infrastructure"
REDIS[ElastiCache Redis<br/>Prompt Cache + Response Cache]
GUARD[Bedrock Guardrails]
CW[CloudWatch<br/>Cost + Quality Metrics]
end
USER --> WS --> INTENT
INTENT --> COMPLEX --> ROUTER
ROUTER --> BUDGET
BUDGET --> ROUTER
ROUTER -->|Tier 0| TMPL
TMPL --> DDB_T
ROUTER -->|Tier 1| HAIKU
HAIKU --> OS_H
ROUTER -->|Tier 2| SONNET
SONNET --> OS_S
TMPL --> GUARD --> WS
HAIKU --> GUARD
SONNET --> GUARD
ROUTER --> COST_LOG --> CW
REDIS -.->|Cache Check| ROUTER
style SONNET fill:#e74c3c,color:#fff
style HAIKU fill:#3498db,color:#fff
style TMPL fill:#2ecc71,color:#fff
style ROUTER fill:#ff9900,color:#000
style BUDGET fill:#f39c12,color:#fff
10. Production Python Code
10.1 ModelRouter Class — Complexity Scoring and Cost-Aware Routing
"""
MangaAssist Model Router — Cost-Effective FM Selection
Routes queries to Template / Haiku / Sonnet based on intent, complexity, and budget.
"""
import re
import time
import json
import hashlib
import logging
from enum import Enum
from dataclasses import dataclass, field
from typing import Optional
import boto3
import redis
logger = logging.getLogger("mangaassist.model_router")
# ---------------------------------------------------------------------------
# Enums & Data Classes
# ---------------------------------------------------------------------------
class ModelTier(Enum):
TEMPLATE = "template"
HAIKU = "haiku"
SONNET = "sonnet"
class Intent(Enum):
PRODUCT_SEARCH = "product_search"
ORDER_STATUS = "order_status"
RECOMMENDATION = "recommendation"
MANGA_QA = "manga_qa"
CHITCHAT = "chitchat"
SHIPPING_INFO = "shipping_info"
ESCALATION = "escalation"
@dataclass
class RoutingDecision:
"""Encapsulates the routing result with full traceability."""
model_tier: ModelTier
model_id: str
intent: Intent
complexity_score: float
estimated_cost: float
budget_mode: str # normal, cautious, aggressive, emergency
reasoning: str
@dataclass
class ModelConfig:
"""Configuration for a Bedrock model."""
model_id: str
input_price_per_1m: float # $ per 1M input tokens
output_price_per_1m: float # $ per 1M output tokens
avg_input_tokens: int
avg_output_tokens: int
quality_score: float # 0-10 scale
max_tokens: int = 4096
@property
def cost_per_request(self) -> float:
input_cost = (self.avg_input_tokens / 1_000_000) * self.input_price_per_1m
output_cost = (self.avg_output_tokens / 1_000_000) * self.output_price_per_1m
return input_cost + output_cost
# ---------------------------------------------------------------------------
# Model Configurations
# ---------------------------------------------------------------------------
MODELS = {
ModelTier.SONNET: ModelConfig(
model_id="anthropic.claude-3-sonnet-20240229-v1:0",
input_price_per_1m=3.00,
output_price_per_1m=15.00,
avg_input_tokens=800,
avg_output_tokens=350,
quality_score=9.2,
),
ModelTier.HAIKU: ModelConfig(
model_id="anthropic.claude-3-haiku-20240307-v1:0",
input_price_per_1m=0.25,
output_price_per_1m=1.25,
avg_input_tokens=500,
avg_output_tokens=150,
quality_score=7.4,
),
}
# ---------------------------------------------------------------------------
# Intent-to-Model Default Routing Map
# ---------------------------------------------------------------------------
INTENT_MODEL_MAP: dict[Intent, ModelTier] = {
Intent.PRODUCT_SEARCH: ModelTier.HAIKU,
Intent.ORDER_STATUS: ModelTier.TEMPLATE,
Intent.RECOMMENDATION: ModelTier.SONNET,
Intent.MANGA_QA: ModelTier.SONNET,
Intent.CHITCHAT: ModelTier.TEMPLATE,
Intent.SHIPPING_INFO: ModelTier.HAIKU,
Intent.ESCALATION: ModelTier.TEMPLATE,
}
# ---------------------------------------------------------------------------
# Complexity Classifier
# ---------------------------------------------------------------------------
class ComplexityClassifier:
"""
Two-stage classifier:
Stage 1 — Rule-based fast path for obvious patterns.
Stage 2 — Feature-based ML scorer for ambiguous queries.
Returns a score from 0.0 (trivial) to 1.0 (highly complex).
"""
# Rule-based patterns (compiled once)
TEMPLATE_PATTERNS = [
re.compile(r"(?i)(where|track)\s.*(order|package|shipment)\s*#?\d+"),
re.compile(r"(?i)(speak|talk|transfer).*(human|agent|representative)"),
re.compile(r"(?i)(store\s*hours|opening\s*hours|when.*(open|close))"),
re.compile(r"(?i)(shipping\s*(rate|cost|fee|price))"),
]
SIMPLE_PATTERNS = [
re.compile(r"(?i)(do you have|in stock|is .* available)\b"),
re.compile(r"(?i)(find|search|look for)\s+(manga|book|volume)\b"),
re.compile(r"(?i)(how (do|can) I)\s+(return|exchange|cancel)\b"),
re.compile(r"(?i)^(what genres|which categories)"),
]
COMPLEX_INDICATORS = [
re.compile(r"(?i)(recommend|suggest|similar to|like .* but)"),
re.compile(r"(?i)(compare|difference between|versus|vs)"),
re.compile(r"(?i)(explain|analyze|theme|meaning|symbolism)"),
re.compile(r"(?i)(best .* for .* who)"),
]
def classify(self, query: str) -> float:
"""Return complexity score 0.0 - 1.0."""
# Stage 1: Rule-based fast path
for pattern in self.TEMPLATE_PATTERNS:
if pattern.search(query):
return 0.0
for pattern in self.SIMPLE_PATTERNS:
if pattern.search(query):
return 0.15
# Stage 2: Feature-based scoring
return self._ml_score(query)
def _ml_score(self, query: str) -> float:
"""
Lightweight feature-based complexity scoring.
In production, this would be a trained logistic regression or small NN.
Here we use interpretable heuristic features that mirror the trained model.
"""
score = 0.3 # base score for unmatched queries
features = self._extract_features(query)
# Token length contributes to complexity
if features["token_count"] > 20:
score += 0.1
if features["token_count"] > 40:
score += 0.1
# Complex indicator keywords
for pattern in self.COMPLEX_INDICATORS:
if pattern.search(query):
score += 0.15
# Question depth (subordinate clauses)
score += min(features["clause_count"] * 0.05, 0.15)
# Japanese character ratio (mixed-language = harder)
if features["japanese_char_ratio"] > 0.1:
score += 0.05
# Subjective language
if features["has_subjective_language"]:
score += 0.1
# Named entity count (more entities = more complex comparison)
if features["named_entity_count"] >= 2:
score += 0.1
return min(score, 1.0)
def _extract_features(self, query: str) -> dict:
"""Extract features for the ML scorer."""
tokens = query.split()
# Count Japanese characters (Hiragana, Katakana, CJK)
jp_chars = sum(
1 for c in query
if ("\u3040" <= c <= "\u309f") # Hiragana
or ("\u30a0" <= c <= "\u30ff") # Katakana
or ("\u4e00" <= c <= "\u9fff") # CJK
)
subjective_words = {
"best", "worst", "favorite", "love", "hate", "amazing",
"boring", "exciting", "beautiful", "dark", "deep",
}
return {
"token_count": len(tokens),
"clause_count": query.count(",") + query.count(" and ") + query.count(" but "),
"japanese_char_ratio": jp_chars / max(len(query), 1),
"has_subjective_language": bool(
set(t.lower() for t in tokens) & subjective_words
),
"named_entity_count": self._count_title_case_sequences(tokens),
}
@staticmethod
def _count_title_case_sequences(tokens: list[str]) -> int:
"""Count sequences of title-case words as named entity proxies."""
count = 0
in_entity = False
for token in tokens:
if token[0:1].isupper() and token.isalpha():
if not in_entity:
count += 1
in_entity = True
else:
in_entity = False
return count
# ---------------------------------------------------------------------------
# Budget Guardian (simplified — full version in 02-inference-cost-optimization.md)
# ---------------------------------------------------------------------------
class BudgetGuardian:
"""Tracks daily FM spend and returns the current budget mode."""
DAILY_BUDGET = 2_500.00 # $2,500/day target
def __init__(self, redis_client: redis.Redis):
self.redis = redis_client
def get_budget_mode(self) -> str:
"""Return current budget mode based on daily spend."""
spent = self._get_daily_spend()
pct = spent / self.DAILY_BUDGET
if pct < 0.60:
return "normal"
elif pct < 0.80:
return "cautious"
elif pct < 0.95:
return "aggressive"
else:
return "emergency"
def record_cost(self, cost: float) -> None:
"""Record an inference cost against today's budget."""
key = f"budget:daily:{time.strftime('%Y-%m-%d')}"
self.redis.incrbyfloat(key, cost)
self.redis.expire(key, 86400 * 2) # TTL 2 days
def _get_daily_spend(self) -> float:
key = f"budget:daily:{time.strftime('%Y-%m-%d')}"
val = self.redis.get(key)
return float(val) if val else 0.0
# ---------------------------------------------------------------------------
# Model Router
# ---------------------------------------------------------------------------
class ModelRouter:
"""
Cost-aware model router for MangaAssist.
Combines intent detection, complexity scoring, and budget constraints
to select the optimal model tier for each query.
"""
# Budget mode overrides: which intents get downgraded
DOWNGRADE_RULES = {
"cautious": {
# In cautious mode, manga_qa drops from Sonnet to Haiku
Intent.MANGA_QA: ModelTier.HAIKU,
},
"aggressive": {
# In aggressive mode, everything except recommendation goes to Haiku
Intent.MANGA_QA: ModelTier.HAIKU,
Intent.PRODUCT_SEARCH: ModelTier.TEMPLATE,
},
"emergency": {
# In emergency mode, only recommendation stays on Haiku, rest Template
Intent.RECOMMENDATION: ModelTier.HAIKU,
Intent.MANGA_QA: ModelTier.HAIKU,
Intent.PRODUCT_SEARCH: ModelTier.TEMPLATE,
Intent.SHIPPING_INFO: ModelTier.TEMPLATE,
},
}
def __init__(
self,
classifier: ComplexityClassifier,
budget_guardian: BudgetGuardian,
):
self.classifier = classifier
self.budget = budget_guardian
def route(self, query: str, intent: Intent) -> RoutingDecision:
"""
Determine the optimal model for a given query and intent.
Steps:
1. Look up default model from intent-to-model map.
2. Score query complexity.
3. Check budget mode and apply downgrades if needed.
4. Return RoutingDecision with full traceability.
"""
# Step 1: Default model from intent map
default_tier = INTENT_MODEL_MAP[intent]
# Step 2: Complexity scoring
complexity = self.classifier.classify(query)
# Step 3: Complexity-based override (upgrade simple→template or downgrade)
tier = self._apply_complexity_override(default_tier, complexity)
# Step 4: Budget-aware adjustment
budget_mode = self.budget.get_budget_mode()
tier, reasoning = self._apply_budget_override(tier, intent, budget_mode)
# Step 5: Resolve model ID and cost
if tier == ModelTier.TEMPLATE:
model_id = "template_engine"
estimated_cost = 0.0
else:
config = MODELS[tier]
model_id = config.model_id
estimated_cost = config.cost_per_request
# Record cost
self.budget.record_cost(estimated_cost)
return RoutingDecision(
model_tier=tier,
model_id=model_id,
intent=intent,
complexity_score=complexity,
estimated_cost=estimated_cost,
budget_mode=budget_mode,
reasoning=reasoning,
)
def _apply_complexity_override(
self, default_tier: ModelTier, complexity: float
) -> ModelTier:
"""
Override model tier based on complexity score.
- If default is Haiku but complexity > 0.7, upgrade to Sonnet.
- If default is Sonnet but complexity < 0.3, downgrade to Haiku.
"""
if default_tier == ModelTier.HAIKU and complexity > 0.7:
logger.info("Upgrading Haiku -> Sonnet (complexity=%.2f)", complexity)
return ModelTier.SONNET
if default_tier == ModelTier.SONNET and complexity < 0.3:
logger.info("Downgrading Sonnet -> Haiku (complexity=%.2f)", complexity)
return ModelTier.HAIKU
return default_tier
def _apply_budget_override(
self, tier: ModelTier, intent: Intent, budget_mode: str
) -> tuple[ModelTier, str]:
"""Apply budget-mode downgrades."""
if budget_mode == "normal":
return tier, f"Normal routing: {intent.value} -> {tier.value}"
overrides = self.DOWNGRADE_RULES.get(budget_mode, {})
if intent in overrides:
new_tier = overrides[intent]
if self._tier_rank(new_tier) < self._tier_rank(tier):
reason = (
f"Budget {budget_mode}: downgraded {intent.value} "
f"from {tier.value} to {new_tier.value}"
)
logger.warning(reason)
return new_tier, reason
return tier, f"{budget_mode} mode: {intent.value} -> {tier.value} (no change)"
@staticmethod
def _tier_rank(tier: ModelTier) -> int:
return {ModelTier.TEMPLATE: 0, ModelTier.HAIKU: 1, ModelTier.SONNET: 2}[tier]
# ---------------------------------------------------------------------------
# Usage Example
# ---------------------------------------------------------------------------
def demo():
"""Demonstrate the ModelRouter with sample MangaAssist queries."""
redis_client = redis.Redis(host="mangaassist-cache.xxxxx.ng.0001.apne1.cache.amazonaws.com")
classifier = ComplexityClassifier()
guardian = BudgetGuardian(redis_client)
router = ModelRouter(classifier, guardian)
test_cases = [
("Where is my order #98765?", Intent.ORDER_STATUS),
("Do you have One Piece volume 104?", Intent.PRODUCT_SEARCH),
("Recommend manga like Berserk but less dark and more hopeful", Intent.RECOMMENDATION),
("What themes does Chainsaw Man explore?", Intent.MANGA_QA),
("Hello!", Intent.CHITCHAT),
("What are your shipping rates to Osaka?", Intent.SHIPPING_INFO),
]
for query, intent in test_cases:
decision = router.route(query, intent)
print(
f" [{decision.model_tier.value:>8}] "
f"complexity={decision.complexity_score:.2f} "
f"cost=${decision.estimated_cost:.6f} "
f"mode={decision.budget_mode:>10} | "
f"{query[:60]}"
)
if __name__ == "__main__":
demo()
10.2 CostCapabilityMatrix Evaluator
"""
MangaAssist Cost-Capability Matrix Evaluator
Evaluates and compares model cost-effectiveness across intents.
"""
from dataclasses import dataclass
from typing import Optional
import json
@dataclass
class EvaluationResult:
"""Single model evaluation for a specific intent."""
intent: str
model: str
cost_per_request: float
quality_score: float
satisfaction_rate: float
tokens_per_dollar: float
quality_per_dollar: float
satisfaction_per_dollar: float
cost_per_quality_point: float
class CostCapabilityMatrix:
"""
Evaluates models across cost and capability dimensions,
producing actionable routing recommendations.
"""
# Quality scores by (model, intent) — from production evaluation
QUALITY_SCORES: dict[tuple[str, str], float] = {
("sonnet", "product_search"): 8.9,
("sonnet", "recommendation"): 9.4,
("sonnet", "manga_qa"): 9.1,
("sonnet", "shipping_info"): 9.0,
("haiku", "product_search"): 8.1,
("haiku", "recommendation"): 6.2,
("haiku", "manga_qa"): 6.5,
("haiku", "shipping_info"): 8.8,
("template", "order_status"): 9.5,
("template", "chitchat"): 7.5,
("template", "escalation"): 9.0,
}
# Satisfaction rates by (model, intent) — from user feedback
SATISFACTION_RATES: dict[tuple[str, str], float] = {
("sonnet", "product_search"): 0.93,
("sonnet", "recommendation"): 0.94,
("sonnet", "manga_qa"): 0.91,
("sonnet", "shipping_info"): 0.92,
("haiku", "product_search"): 0.88,
("haiku", "recommendation"): 0.62,
("haiku", "manga_qa"): 0.64,
("haiku", "shipping_info"): 0.92,
("template", "order_status"): 0.96,
("template", "chitchat"): 0.78,
("template", "escalation"): 0.90,
}
# Average tokens per request by (model, intent)
AVG_TOKENS: dict[tuple[str, str], tuple[int, int]] = {
("sonnet", "product_search"): (600, 250),
("sonnet", "recommendation"): (1200, 500),
("sonnet", "manga_qa"): (900, 400),
("sonnet", "shipping_info"): (400, 150),
("haiku", "product_search"): (400, 120),
("haiku", "recommendation"): (800, 200),
("haiku", "manga_qa"): (600, 180),
("haiku", "shipping_info"): (300, 80),
}
# Pricing
PRICING = {
"sonnet": {"input": 3.00, "output": 15.00}, # per 1M tokens
"haiku": {"input": 0.25, "output": 1.25},
"template": {"input": 0.00, "output": 0.00},
}
def evaluate(self, model: str, intent: str) -> Optional[EvaluationResult]:
"""Evaluate a specific model-intent combination."""
quality = self.QUALITY_SCORES.get((model, intent))
satisfaction = self.SATISFACTION_RATES.get((model, intent))
if quality is None or satisfaction is None:
return None
cost = self._calculate_cost(model, intent)
# Tokens per dollar (using input tokens as reference)
pricing = self.PRICING[model]
tokens_per_dollar = (
1_000_000 / pricing["input"] if pricing["input"] > 0 else float("inf")
)
# Quality per dollar
quality_per_dollar = (
(quality / cost) * 0.001 if cost > 0 else float("inf")
)
# Satisfaction per dollar
satisfaction_per_dollar = (
(satisfaction * 100 / cost) if cost > 0 else float("inf")
)
# Cost per quality point
cost_per_quality_point = (
(cost / quality) * 1000 if quality > 0 else float("inf")
)
return EvaluationResult(
intent=intent,
model=model,
cost_per_request=cost,
quality_score=quality,
satisfaction_rate=satisfaction,
tokens_per_dollar=tokens_per_dollar,
quality_per_dollar=quality_per_dollar,
satisfaction_per_dollar=satisfaction_per_dollar,
cost_per_quality_point=cost_per_quality_point,
)
def compare_models(self, intent: str) -> dict:
"""
Compare all available models for a given intent.
Returns recommendation with reasoning.
"""
results = {}
for model in ["template", "haiku", "sonnet"]:
result = self.evaluate(model, intent)
if result:
results[model] = result
if not results:
return {"error": f"No evaluations available for intent: {intent}"}
# Find the model with best quality-per-dollar
best_value = max(results.values(), key=lambda r: r.quality_per_dollar)
# Find the model with best raw quality
best_quality = max(results.values(), key=lambda r: r.quality_score)
# Recommendation logic:
# Use the cheaper model if its quality is within 15% of the best quality
quality_threshold = best_quality.quality_score * 0.85
cost_effective_options = [
r for r in results.values() if r.quality_score >= quality_threshold
]
recommended = min(cost_effective_options, key=lambda r: r.cost_per_request)
return {
"intent": intent,
"evaluations": {k: self._to_dict(v) for k, v in results.items()},
"recommended_model": recommended.model,
"reasoning": (
f"For '{intent}': {recommended.model} achieves "
f"{recommended.quality_score}/10 quality at "
f"${recommended.cost_per_request:.6f}/request. "
f"Best quality ({best_quality.model}) scores "
f"{best_quality.quality_score}/10 at "
f"${best_quality.cost_per_request:.6f}/request — "
f"{best_quality.cost_per_request / max(recommended.cost_per_request, 0.000001):.1f}x more expensive."
),
}
def full_matrix_report(self) -> str:
"""Generate a full cost-capability matrix report."""
intents = [
"product_search", "order_status", "recommendation",
"manga_qa", "chitchat", "shipping_info", "escalation",
]
lines = [
"=" * 90,
"MANGAASSIST COST-CAPABILITY MATRIX REPORT",
"=" * 90,
"",
]
for intent in intents:
comparison = self.compare_models(intent)
if "error" in comparison:
continue
lines.append(f"--- {intent.upper()} ---")
lines.append(f" Recommended: {comparison['recommended_model']}")
lines.append(f" {comparison['reasoning']}")
for model, data in comparison["evaluations"].items():
lines.append(
f" {model:>8}: quality={data['quality_score']:.1f} "
f"cost=${data['cost_per_request']:.6f} "
f"satisfaction={data['satisfaction_rate']:.0%} "
f"cost/qp=${data['cost_per_quality_point']:.4f}"
)
lines.append("")
return "\n".join(lines)
def _calculate_cost(self, model: str, intent: str) -> float:
"""Calculate cost per request for a model-intent pair."""
if model == "template":
return 0.0
tokens = self.AVG_TOKENS.get((model, intent))
if not tokens:
return 0.0
input_tokens, output_tokens = tokens
pricing = self.PRICING[model]
input_cost = (input_tokens / 1_000_000) * pricing["input"]
output_cost = (output_tokens / 1_000_000) * pricing["output"]
return input_cost + output_cost
@staticmethod
def _to_dict(result: EvaluationResult) -> dict:
return {
"quality_score": result.quality_score,
"cost_per_request": result.cost_per_request,
"satisfaction_rate": result.satisfaction_rate,
"tokens_per_dollar": result.tokens_per_dollar,
"quality_per_dollar": result.quality_per_dollar,
"satisfaction_per_dollar": result.satisfaction_per_dollar,
"cost_per_quality_point": result.cost_per_quality_point,
}
# ---------------------------------------------------------------------------
# Usage
# ---------------------------------------------------------------------------
if __name__ == "__main__":
matrix = CostCapabilityMatrix()
print(matrix.full_matrix_report())
print("\n\nDetailed comparison for 'recommendation':")
result = matrix.compare_models("recommendation")
print(json.dumps(result, indent=2, default=str))
11. Complete Intent x Model Cost Matrix
| Intent | Model | Avg Input Tokens | Avg Output Tokens | Cost / Request | Quality Score | Satisfaction | Cost/Quality Point |
|---|---|---|---|---|---|---|---|
product_search |
Haiku | 400 | 120 | $0.000250 | 8.1 | 88% | $0.031 |
product_search |
Sonnet | 600 | 250 | $0.005550 | 8.9 | 93% | $0.624 |
order_status |
Template | 0 | 0 | $0.000000 | 9.5 | 96% | $0.000 |
recommendation |
Sonnet | 1,200 | 500 | $0.011100 | 9.4 | 94% | $1.181 |
recommendation |
Haiku | 800 | 200 | $0.000450 | 6.2 | 62% | $0.073 |
manga_qa |
Sonnet | 900 | 400 | $0.008700 | 9.1 | 91% | $0.956 |
manga_qa |
Haiku | 600 | 180 | $0.000375 | 6.5 | 64% | $0.058 |
chitchat |
Template | 0 | 0 | $0.000000 | 7.5 | 78% | $0.000 |
shipping_info |
Haiku | 300 | 80 | $0.000175 | 8.8 | 92% | $0.020 |
shipping_info |
Sonnet | 400 | 150 | $0.003450 | 9.0 | 92% | $0.383 |
escalation |
Template | 0 | 0 | $0.000000 | 9.0 | 90% | $0.000 |
Reading the table: For
product_search, Haiku costs $0.000250 vs Sonnet's $0.005550 (22x cheaper) while scoring 8.1 vs 8.9 (only 9% lower quality). Haiku is the clear winner. Forrecommendation, Sonnet's 9.4 quality vs Haiku's 6.2 (34% gap) justifies the 24.7x cost premium.
12. Key Takeaways
- 73% cost reduction is achievable through tiered routing without meaningful quality loss.
- Template tier eliminates 35% of LLM costs entirely (order_status, chitchat, escalation).
- Complexity scoring prevents the two worst outcomes: overspending on simple queries and underserving complex ones.
- Budget guardian provides automatic cost protection during traffic spikes and sale events.
- Cost-per-quality-point is the single most useful metric for routing decisions — not raw cost or raw quality alone.
Next: 02-inference-cost-optimization.md — Deep-dive into inference cost patterns, budget guardian implementation, and A/B testing model assignments.