Optimized FM Deployment and Model Cascading Architecture
MangaAssist context: JP Manga store chatbot on AWS — Bedrock Claude 3 (Sonnet at $3/$15 per 1M tokens input/output, Haiku at $0.25/$1.25), OpenSearch Serverless (vector store), DynamoDB (sessions/products), ECS Fargate (orchestrator), API Gateway WebSocket, ElastiCache Redis. Target: useful answer in under 3 seconds, 1M messages/day scale.
Skill Mapping
| Attribute | Value |
|---|---|
| Certification | AWS Certified AI Practitioner (AIP-C01) |
| Domain | 2 — Fundamentals of Generative AI |
| Task | 2.2 — Understand the basics of using foundation models |
| Skill | 2.2.3 — Develop optimized FM deployment approaches to balance performance and resource requirements (selecting appropriate models, smaller pre-trained models for specific tasks, API-based model cascading for routine queries) |
Table of Contents
- Optimization Strategy Mindmap
- Model Cascading Architecture
- Small Model Selection for Specific Tasks
- Performance vs Cost Trade-off Frameworks
- ModelCascadeRouter Implementation
- ModelSelectorFramework Implementation
- PerformanceOptimizer Implementation
- Decision Trees for Model Selection
- Cost Analysis Deep Dive
- Key Takeaways
Optimization Strategy Mindmap
Optimized FM Deployment
|
+----------+-----------+-----------+----------+
| | | | |
Model Model Model Right- API-Based
Selection Distillation Cascading Sizing Orchestration
| | | | |
+--+--+ +--+--+ +--+--+ +--+--+ +--+--+
| | | | | | | | | |
Task Perf Know- Fine- Haiku Conf Batch On- Gate- Load
Match Tier ledge Tune First idence Stream Demand way Balance
| Trans | Score | Alloc Route
| fer v | |
| Sonnet v v
| If Ndd Escalate Scale
| | Infra
v v
Claude 3 Confidence
Haiku for Threshold
Simple Tasks (0.80+)
|
v
Claude 3
Sonnet for
Complex Tasks
Sub-strategies:
Model Selection Model Distillation
|-- Task complexity |-- Teacher-student learning
|-- Latency requirements |-- Knowledge compression
|-- Cost constraints |-- Task-specific fine-tuning
|-- Quality thresholds |-- Bedrock custom models
'-- Token budget '-- Embedding model optimization
Model Cascading Right-Sizing
|-- Confidence scoring |-- Instance type selection
|-- Escalation rules |-- Batch vs real-time
|-- Fallback chains |-- Concurrency limits
|-- Response validation |-- Memory allocation
'-- Cost-aware routing '-- Provisioned throughput
API-Based Orchestration
|-- Gateway routing rules
|-- Model endpoint management
|-- Rate limiting per model
|-- Circuit breaker patterns
'-- A/B testing infrastructure
Model Cascading Architecture
The Core Principle: Haiku First, Sonnet If Needed
Model cascading is a cost-optimization pattern that routes queries through progressively more capable (and expensive) models. For MangaAssist, this means:
User Query
|
v
+-------------------+
| Intent Classifier | (Haiku - $0.25/1M input)
| + Complexity |
| Estimator |
+--------+----------+
|
+----+----+
| |
Simple Complex
(70-80%) (20-30%)
| |
v v
+--------+ +--------+
| Haiku | | Sonnet |
| $0.25 | | $3.00 |
| /1M in | | /1M in |
+---+----+ +---+----+
| |
v v
+--------+ +--------+
| Conf | | Direct |
| Check | | Return |
| >=0.80 | +--------+
+---+----+
|
+-+---+
| |
Pass Fail
| |
v v
Return Escalate
to Sonnet
Traffic Distribution Analysis
| Query Category | Percentage | Model | Avg Tokens (In/Out) | Cost per Query |
|---|---|---|---|---|
| Simple FAQ / Greetings | 35% | Haiku | 200 / 100 | $0.000175 |
| Product Lookup | 25% | Haiku | 400 / 200 | $0.000350 |
| Manga Recommendations | 20% | Sonnet | 600 / 400 | $0.007800 |
| Complex Comparison | 10% | Sonnet | 800 / 600 | $0.011400 |
| Nuanced Discussion | 7% | Sonnet | 1000 / 800 | $0.015000 |
| Haiku Escalation to Sonnet | 3% | Both | 400+600 / 200+400 | $0.008150 |
Daily Cost at 1M Messages
Without Cascading (All Sonnet):
Average tokens: 500 in / 350 out per message
Daily cost = 1M * (500 * $3/1M + 350 * $15/1M)
= 1M * ($0.0015 + $0.00525)
= $6,750/day
= $202,500/month
With Cascading (Haiku-first):
Haiku messages (60%): 600K * (300 * $0.25/1M + 150 * $1.25/1M)
= 600K * $0.0002625
= $157.50/day
Sonnet messages (37%): 370K * (700 * $3/1M + 500 * $15/1M)
= 370K * $0.0096
= $3,552/day
Escalated (3%): 30K * (400*$0.25/1M + 200*$1.25/1M + 600*$3/1M + 400*$15/1M)
= 30K * $0.00815
= $244.50/day
Total: $3,954/day = $118,620/month
Savings: 41.4% reduction ($83,880/month saved)
Small Model Selection for Specific Tasks
Task-to-Model Mapping for MangaAssist
| Task | Recommended Model | Why | Latency | Quality |
|---|---|---|---|---|
| Intent Classification | Haiku | Structured output, limited reasoning needed | ~200ms | 95%+ |
| Entity Extraction (manga) | Haiku | Pattern matching, named entities | ~150ms | 93%+ |
| Simple FAQ Answers | Haiku | Retrieval-augmented, template-like | ~300ms | 90%+ |
| Product Search Query Build | Haiku | SQL/query generation, structured | ~250ms | 92%+ |
| Manga Recommendations | Sonnet | Nuanced preference understanding | ~800ms | 96%+ |
| Complex Comparisons | Sonnet | Multi-factor reasoning | ~1000ms | 95%+ |
| Content Summarization | Haiku (short) | Extractive summarization | ~400ms | 88%+ |
| Content Summarization | Sonnet (detailed) | Abstractive, nuanced summaries | ~900ms | 95%+ |
| Sentiment Analysis | Haiku | Binary/ternary classification | ~100ms | 94%+ |
| Language Translation (JP) | Sonnet | Cultural nuance in JP manga context | ~700ms | 97%+ |
Model Selection Decision Matrix
Low Complexity High Complexity
+--------------------+--------------------+
| | |
Low Latency | Haiku | Haiku + Cache |
Requirement | Direct response | Pre-computed + |
(< 500ms) | Simple prompts | fallback Sonnet |
| | |
+--------------------+--------------------+
| | |
High Latency | Haiku | Sonnet |
Tolerance | Cost-optimized | Quality-optimized |
(< 3000ms) | Batch-friendly | Full reasoning |
| | |
+--------------------+--------------------+
Performance vs Cost Trade-off Frameworks
Framework 1: Quality-Cost Frontier
Quality
Score
|
1.0| * Sonnet (all queries)
| *
| * Cascade (optimized)
0.9| *
| * Cascade (aggressive)
| *
0.8| * Haiku (all queries)
|
0.7|
+----+----+----+----+----+----+----> Cost/query
$0 $2 $4 $6 $8 $10 $12
(x10^-3)
Sweet Spot: Cascade (optimized) at ~$4x10^-3/query with 0.94 quality
Framework 2: Latency-Throughput Trade-off
Throughput
(queries/sec)
|
500| * Haiku-only
|
400| * Cascade (aggressive)
|
300| * Cascade (optimized)
|
200| * Cascade (conservative)
|
100| * Sonnet-only
|
+----+-----+-----+-----+-----+-----> P99 Latency (ms)
0 500 1000 1500 2000 2500
Target: 300+ queries/sec with P99 < 2500ms
Framework 3: Cost Efficiency Score
# Cost Efficiency Score (CES) = Quality * (1 / NormalizedCost) * LatencyFactor
def calculate_ces(quality_score, cost_per_query, latency_ms, target_latency_ms=3000):
"""
Calculate Cost Efficiency Score for a deployment strategy.
Args:
quality_score: 0.0-1.0 quality rating
cost_per_query: Dollar cost per query
latency_ms: Average response latency in milliseconds
target_latency_ms: Target latency SLA
Returns:
CES score (higher is better)
"""
# Normalize cost (lower is better, avoid division by zero)
max_cost = 0.015 # Sonnet full-reasoning cost as baseline
normalized_cost = cost_per_query / max_cost
cost_factor = 1.0 / max(normalized_cost, 0.01)
# Latency factor (1.0 if within SLA, degrades if over)
latency_factor = min(1.0, target_latency_ms / max(latency_ms, 1))
# Combined CES
ces = quality_score * cost_factor * latency_factor
return round(ces, 4)
# MangaAssist strategy comparison
strategies = {
"All Sonnet": {"quality": 0.97, "cost": 0.0096, "latency": 1200},
"All Haiku": {"quality": 0.82, "cost": 0.0003, "latency": 300},
"Cascade Aggressive":{"quality": 0.88, "cost": 0.0025, "latency": 500},
"Cascade Optimized": {"quality": 0.94, "cost": 0.0040, "latency": 700},
"Cascade Conservative":{"quality": 0.96, "cost": 0.0065, "latency": 900},
}
# Results:
# All Sonnet: CES = 1.5156
# All Haiku: CES = 41.0000
# Cascade Aggressive: CES = 5.2800
# Cascade Optimized: CES = 3.5250 <-- Best balance
# Cascade Conservative: CES = 2.2154
ModelCascadeRouter Implementation
"""
ModelCascadeRouter: Intelligent routing of queries to appropriate FM models
based on complexity, task type, and confidence thresholds.
MangaAssist Production Implementation
"""
import json
import time
import hashlib
import logging
from enum import Enum
from typing import Optional
from dataclasses import dataclass, field
import boto3
from botocore.config import Config
logger = logging.getLogger(__name__)
class ModelTier(Enum):
"""Available model tiers in the cascade."""
HAIKU = "anthropic.claude-3-haiku-20240307-v1:0"
SONNET = "anthropic.claude-3-sonnet-20240229-v1:0"
class TaskCategory(Enum):
"""Predefined task categories for MangaAssist."""
GREETING = "greeting"
FAQ = "faq"
PRODUCT_SEARCH = "product_search"
RECOMMENDATION = "recommendation"
COMPARISON = "comparison"
DISCUSSION = "discussion"
ORDER_STATUS = "order_status"
SUMMARIZATION = "summarization"
TRANSLATION = "translation"
UNKNOWN = "unknown"
# Task-to-tier mapping: which model tier handles which task by default
TASK_MODEL_MAP = {
TaskCategory.GREETING: ModelTier.HAIKU,
TaskCategory.FAQ: ModelTier.HAIKU,
TaskCategory.PRODUCT_SEARCH: ModelTier.HAIKU,
TaskCategory.ORDER_STATUS: ModelTier.HAIKU,
TaskCategory.SUMMARIZATION: ModelTier.HAIKU,
TaskCategory.RECOMMENDATION: ModelTier.SONNET,
TaskCategory.COMPARISON: ModelTier.SONNET,
TaskCategory.DISCUSSION: ModelTier.SONNET,
TaskCategory.TRANSLATION: ModelTier.SONNET,
TaskCategory.UNKNOWN: ModelTier.HAIKU, # Default to cheap, escalate if needed
}
@dataclass
class CascadeConfig:
"""Configuration for the cascade router."""
confidence_threshold: float = 0.80
max_escalations: int = 1
haiku_max_tokens: int = 1024
sonnet_max_tokens: int = 2048
temperature_haiku: float = 0.3
temperature_sonnet: float = 0.5
enable_caching: bool = True
cache_ttl_seconds: int = 300
escalation_timeout_ms: int = 2500
region: str = "us-east-1"
@dataclass
class CascadeResult:
"""Result of a cascade routing decision."""
response: str
model_used: ModelTier
was_escalated: bool
confidence_score: float
latency_ms: float
input_tokens: int
output_tokens: int
estimated_cost: float
task_category: TaskCategory
cache_hit: bool = False
@dataclass
class CascadeMetrics:
"""Tracks cascade performance metrics."""
total_requests: int = 0
haiku_requests: int = 0
sonnet_requests: int = 0
escalations: int = 0
cache_hits: int = 0
total_cost: float = 0.0
avg_latency_ms: float = 0.0
quality_scores: list = field(default_factory=list)
class ModelCascadeRouter:
"""
Routes queries through a model cascade: Haiku first, Sonnet if needed.
The router implements:
1. Task classification to determine initial model selection
2. Confidence-based escalation from Haiku to Sonnet
3. Response caching to avoid redundant inference
4. Cost tracking and optimization metrics
5. Circuit breaker for model endpoint failures
Architecture:
User Query --> TaskClassifier --> ModelSelector --> Inference
|
Haiku (cheap)
|
Confidence Check
/ \\
>= 0.80 < 0.80
| |
Return Escalate to Sonnet
"""
# Pricing per 1M tokens (USD)
PRICING = {
ModelTier.HAIKU: {"input": 0.25, "output": 1.25},
ModelTier.SONNET: {"input": 3.00, "output": 15.00},
}
def __init__(self, config: Optional[CascadeConfig] = None):
self.config = config or CascadeConfig()
self.metrics = CascadeMetrics()
self._response_cache = {}
# Initialize Bedrock client with retry configuration
boto_config = Config(
region_name=self.config.region,
retries={"max_attempts": 3, "mode": "adaptive"},
read_timeout=10,
connect_timeout=5,
)
self.bedrock = boto3.client("bedrock-runtime", config=boto_config)
logger.info(
"ModelCascadeRouter initialized | confidence_threshold=%.2f | region=%s",
self.config.confidence_threshold,
self.config.region,
)
# ------------------------------------------------------------------
# Public API
# ------------------------------------------------------------------
def route_query(
self,
query: str,
conversation_history: Optional[list] = None,
task_hint: Optional[TaskCategory] = None,
force_model: Optional[ModelTier] = None,
) -> CascadeResult:
"""
Route a query through the model cascade.
Args:
query: The user's input query
conversation_history: Previous messages for context
task_hint: Optional pre-classified task category
force_model: Force a specific model (bypasses cascade)
Returns:
CascadeResult with response and metadata
"""
start_time = time.time()
self.metrics.total_requests += 1
# Step 1: Check cache
if self.config.enable_caching:
cache_key = self._build_cache_key(query, conversation_history)
cached = self._check_cache(cache_key)
if cached:
self.metrics.cache_hits += 1
cached.cache_hit = True
return cached
# Step 2: Classify task
task_category = task_hint or self._classify_task(query)
# Step 3: Determine initial model
if force_model:
initial_model = force_model
else:
initial_model = TASK_MODEL_MAP.get(task_category, ModelTier.HAIKU)
# Step 4: If task maps directly to Sonnet, skip cascade
if initial_model == ModelTier.SONNET:
result = self._invoke_model(
ModelTier.SONNET, query, conversation_history, task_category
)
self.metrics.sonnet_requests += 1
result.latency_ms = (time.time() - start_time) * 1000
self._update_metrics(result)
if self.config.enable_caching:
self._store_cache(cache_key, result)
return result
# Step 5: Try Haiku first
haiku_result = self._invoke_model(
ModelTier.HAIKU, query, conversation_history, task_category
)
self.metrics.haiku_requests += 1
# Step 6: Check confidence and decide escalation
if haiku_result.confidence_score >= self.config.confidence_threshold:
haiku_result.latency_ms = (time.time() - start_time) * 1000
self._update_metrics(haiku_result)
if self.config.enable_caching:
self._store_cache(cache_key, haiku_result)
return haiku_result
# Step 7: Escalate to Sonnet
logger.info(
"Escalating to Sonnet | confidence=%.2f < threshold=%.2f | task=%s",
haiku_result.confidence_score,
self.config.confidence_threshold,
task_category.value,
)
sonnet_result = self._invoke_model(
ModelTier.SONNET, query, conversation_history, task_category
)
sonnet_result.was_escalated = True
sonnet_result.estimated_cost += haiku_result.estimated_cost # Include Haiku cost
sonnet_result.input_tokens += haiku_result.input_tokens
sonnet_result.output_tokens += haiku_result.output_tokens
self.metrics.sonnet_requests += 1
self.metrics.escalations += 1
sonnet_result.latency_ms = (time.time() - start_time) * 1000
self._update_metrics(sonnet_result)
if self.config.enable_caching:
self._store_cache(cache_key, sonnet_result)
return sonnet_result
def get_metrics_summary(self) -> dict:
"""Return current cascade metrics."""
total = max(self.metrics.total_requests, 1)
return {
"total_requests": self.metrics.total_requests,
"haiku_percentage": round(self.metrics.haiku_requests / total * 100, 1),
"sonnet_percentage": round(self.metrics.sonnet_requests / total * 100, 1),
"escalation_rate": round(self.metrics.escalations / total * 100, 1),
"cache_hit_rate": round(self.metrics.cache_hits / total * 100, 1),
"total_cost": round(self.metrics.total_cost, 4),
"avg_cost_per_query": round(self.metrics.total_cost / total, 6),
"avg_latency_ms": round(self.metrics.avg_latency_ms, 1),
}
# ------------------------------------------------------------------
# Internal Methods
# ------------------------------------------------------------------
def _classify_task(self, query: str) -> TaskCategory:
"""
Classify the query into a task category using lightweight heuristics
before invoking any model. Falls back to UNKNOWN for model-based routing.
"""
query_lower = query.lower().strip()
# Greeting patterns
greeting_patterns = [
"hello", "hi", "hey", "good morning", "good afternoon",
"konnichiwa", "ohayo", "こんにちは", "おはよう",
]
if any(query_lower.startswith(p) for p in greeting_patterns):
return TaskCategory.GREETING
# Order status patterns
order_patterns = ["order", "shipping", "delivery", "tracking", "注文", "配送"]
if any(p in query_lower for p in order_patterns):
return TaskCategory.ORDER_STATUS
# Product search patterns
search_patterns = [
"find", "search", "looking for", "where can i",
"do you have", "探す", "検索",
]
if any(p in query_lower for p in search_patterns):
return TaskCategory.PRODUCT_SEARCH
# Recommendation patterns
recommend_patterns = [
"recommend", "suggest", "similar to", "like",
"what should i read", "おすすめ",
]
if any(p in query_lower for p in recommend_patterns):
return TaskCategory.RECOMMENDATION
# Comparison patterns
compare_patterns = ["compare", "versus", "vs", "difference between", "比較"]
if any(p in query_lower for p in compare_patterns):
return TaskCategory.COMPARISON
# FAQ patterns
faq_patterns = [
"how do i", "what is", "can i", "do you", "is there",
"how much", "when", "返品", "支払い",
]
if any(p in query_lower for p in faq_patterns):
return TaskCategory.FAQ
return TaskCategory.UNKNOWN
def _invoke_model(
self,
model_tier: ModelTier,
query: str,
conversation_history: Optional[list],
task_category: TaskCategory,
) -> CascadeResult:
"""
Invoke a Bedrock model and return structured result.
"""
max_tokens = (
self.config.haiku_max_tokens
if model_tier == ModelTier.HAIKU
else self.config.sonnet_max_tokens
)
temperature = (
self.config.temperature_haiku
if model_tier == ModelTier.HAIKU
else self.config.temperature_sonnet
)
# Build messages
messages = []
if conversation_history:
messages.extend(conversation_history[-6:]) # Last 3 turns
messages.append({"role": "user", "content": query})
system_prompt = self._build_system_prompt(task_category)
body = json.dumps({
"anthropic_version": "bedrock-2023-05-31",
"max_tokens": max_tokens,
"temperature": temperature,
"system": system_prompt,
"messages": messages,
})
try:
response = self.bedrock.invoke_model(
modelId=model_tier.value,
contentType="application/json",
accept="application/json",
body=body,
)
response_body = json.loads(response["body"].read())
content = response_body["content"][0]["text"]
input_tokens = response_body["usage"]["input_tokens"]
output_tokens = response_body["usage"]["output_tokens"]
stop_reason = response_body.get("stop_reason", "end_turn")
# Calculate confidence from stop reason and response characteristics
confidence = self._estimate_confidence(content, stop_reason, task_category)
# Calculate cost
pricing = self.PRICING[model_tier]
cost = (
input_tokens * pricing["input"] / 1_000_000
+ output_tokens * pricing["output"] / 1_000_000
)
return CascadeResult(
response=content,
model_used=model_tier,
was_escalated=False,
confidence_score=confidence,
latency_ms=0.0, # Set by caller
input_tokens=input_tokens,
output_tokens=output_tokens,
estimated_cost=cost,
task_category=task_category,
)
except Exception as e:
logger.error("Model invocation failed | model=%s | error=%s", model_tier.value, str(e))
raise
def _estimate_confidence(
self, response: str, stop_reason: str, task_category: TaskCategory
) -> float:
"""
Estimate confidence in the response quality.
Factors considered:
- Stop reason (end_turn vs max_tokens)
- Response length relative to task expectations
- Presence of hedging language
- Task category typical accuracy
"""
confidence = 0.85 # Base confidence
# Penalize truncated responses
if stop_reason == "max_tokens":
confidence -= 0.15
# Check for hedging / uncertainty markers
hedging_phrases = [
"i'm not sure", "i don't know", "it's unclear",
"i cannot determine", "maybe", "perhaps",
"わかりません", "不明",
]
response_lower = response.lower()
hedging_count = sum(1 for phrase in hedging_phrases if phrase in response_lower)
confidence -= hedging_count * 0.05
# Task-specific confidence adjustments
task_adjustments = {
TaskCategory.GREETING: 0.10, # Greetings are almost always correct
TaskCategory.FAQ: 0.05, # FAQ with RAG is usually reliable
TaskCategory.PRODUCT_SEARCH: 0.03, # Structured queries are reliable
TaskCategory.ORDER_STATUS: 0.05, # Template-based responses
TaskCategory.RECOMMENDATION: -0.05, # Subjective, harder to be confident
TaskCategory.COMPARISON: -0.08, # Requires deep reasoning
TaskCategory.DISCUSSION: -0.10, # Most subjective
TaskCategory.UNKNOWN: -0.05, # Unknown tasks get penalized
}
confidence += task_adjustments.get(task_category, 0.0)
# Clamp to [0.0, 1.0]
return max(0.0, min(1.0, confidence))
def _build_system_prompt(self, task_category: TaskCategory) -> str:
"""Build task-optimized system prompt."""
base = (
"You are MangaAssist, a helpful JP manga store chatbot. "
"Provide accurate, concise answers about manga products, "
"recommendations, and store services. Respond in the same "
"language as the user's query."
)
task_additions = {
TaskCategory.GREETING: " Keep greetings brief and friendly.",
TaskCategory.FAQ: " Answer the question directly. Cite store policies when relevant.",
TaskCategory.PRODUCT_SEARCH: " Help the user find specific manga. Ask clarifying questions if needed.",
TaskCategory.RECOMMENDATION: (
" Provide thoughtful manga recommendations based on user preferences. "
"Consider genre, art style, themes, and similar series."
),
TaskCategory.COMPARISON: (
" Compare manga series objectively across multiple dimensions: "
"story, art, pacing, character development, and audience appeal."
),
TaskCategory.ORDER_STATUS: " Provide clear order status information. Be precise about dates and statuses.",
}
return base + task_additions.get(task_category, "")
def _build_cache_key(self, query: str, history: Optional[list]) -> str:
"""Create a deterministic cache key."""
raw = query
if history:
raw += json.dumps(history[-4:], sort_keys=True)
return hashlib.sha256(raw.encode()).hexdigest()
def _check_cache(self, key: str) -> Optional[CascadeResult]:
"""Check if a response is cached and still valid."""
if key in self._response_cache:
entry = self._response_cache[key]
if time.time() - entry["timestamp"] < self.config.cache_ttl_seconds:
return entry["result"]
else:
del self._response_cache[key]
return None
def _store_cache(self, key: str, result: CascadeResult):
"""Store a response in the cache."""
self._response_cache[key] = {
"result": result,
"timestamp": time.time(),
}
def _update_metrics(self, result: CascadeResult):
"""Update running metrics."""
self.metrics.total_cost += result.estimated_cost
n = self.metrics.total_requests
self.metrics.avg_latency_ms = (
(self.metrics.avg_latency_ms * (n - 1) + result.latency_ms) / n
)
ModelSelectorFramework Implementation
"""
ModelSelectorFramework: Framework for selecting the optimal model
based on task requirements, constraints, and historical performance.
Implements multi-criteria decision making for model selection.
"""
import logging
from enum import Enum
from typing import Optional
from dataclasses import dataclass
logger = logging.getLogger(__name__)
class OptimizationGoal(Enum):
"""Primary optimization objectives."""
COST = "cost"
QUALITY = "quality"
LATENCY = "latency"
BALANCED = "balanced"
class QueryComplexity(Enum):
"""Complexity classification for queries."""
TRIVIAL = 1 # Greetings, acknowledgments
SIMPLE = 2 # Direct FAQ, lookups
MODERATE = 3 # Search with filters, basic recommendations
COMPLEX = 4 # Multi-criteria comparisons, detailed analysis
EXPERT = 5 # Nuanced discussion, cultural context, creative tasks
@dataclass
class ModelProfile:
"""Profile describing a model's characteristics."""
model_id: str
tier_name: str
cost_input_per_1m: float
cost_output_per_1m: float
avg_latency_ms: float
quality_score: float # 0.0-1.0 aggregate quality
max_context_tokens: int
strengths: list # Task categories where it excels
weaknesses: list # Task categories where it struggles
throughput_qps: float # Estimated queries per second
@dataclass
class SelectionCriteria:
"""Criteria for model selection decision."""
query_complexity: QueryComplexity
max_latency_ms: float = 3000.0
max_cost_per_query: float = 0.02
min_quality_score: float = 0.85
optimization_goal: OptimizationGoal = OptimizationGoal.BALANCED
required_tokens: int = 500
task_category: str = "general"
@dataclass
class SelectionResult:
"""Result of a model selection decision."""
selected_model: ModelProfile
selection_reason: str
score: float
alternatives: list
trade_offs: dict
class ModelSelectorFramework:
"""
Multi-criteria model selection framework.
Uses weighted scoring to evaluate models against task requirements:
Score = w_cost * CostScore + w_quality * QualityScore
+ w_latency * LatencyScore + w_fit * TaskFitScore
Weights are adjusted based on the OptimizationGoal.
"""
# Weight profiles for different optimization goals
WEIGHT_PROFILES = {
OptimizationGoal.COST: {"cost": 0.50, "quality": 0.20, "latency": 0.15, "fit": 0.15},
OptimizationGoal.QUALITY: {"cost": 0.10, "quality": 0.50, "latency": 0.15, "fit": 0.25},
OptimizationGoal.LATENCY: {"cost": 0.15, "quality": 0.15, "latency": 0.50, "fit": 0.20},
OptimizationGoal.BALANCED: {"cost": 0.25, "quality": 0.30, "latency": 0.20, "fit": 0.25},
}
def __init__(self):
self.model_registry = {}
self._register_default_models()
def _register_default_models(self):
"""Register MangaAssist model profiles."""
self.register_model(ModelProfile(
model_id="anthropic.claude-3-haiku-20240307-v1:0",
tier_name="Haiku",
cost_input_per_1m=0.25,
cost_output_per_1m=1.25,
avg_latency_ms=250,
quality_score=0.82,
max_context_tokens=200000,
strengths=["classification", "extraction", "faq", "search", "greeting"],
weaknesses=["nuanced_reasoning", "creative_writing", "complex_comparison"],
throughput_qps=500,
))
self.register_model(ModelProfile(
model_id="anthropic.claude-3-sonnet-20240229-v1:0",
tier_name="Sonnet",
cost_input_per_1m=3.00,
cost_output_per_1m=15.00,
avg_latency_ms=800,
quality_score=0.95,
max_context_tokens=200000,
strengths=[
"nuanced_reasoning", "creative_writing", "complex_comparison",
"recommendation", "translation", "cultural_context",
],
weaknesses=[],
throughput_qps=150,
))
def register_model(self, profile: ModelProfile):
"""Register a model profile in the framework."""
self.model_registry[profile.model_id] = profile
logger.info("Registered model: %s (%s)", profile.tier_name, profile.model_id)
def select_model(self, criteria: SelectionCriteria) -> SelectionResult:
"""
Select the optimal model based on provided criteria.
Algorithm:
1. Filter models that meet hard constraints (latency, cost, tokens)
2. Score remaining models using weighted multi-criteria evaluation
3. Return best model with alternatives and trade-off analysis
"""
candidates = list(self.model_registry.values())
# Step 1: Hard constraint filtering
eligible = []
for model in candidates:
# Estimate cost for this query
est_cost = self._estimate_query_cost(model, criteria.required_tokens)
if model.avg_latency_ms <= criteria.max_latency_ms and est_cost <= criteria.max_cost_per_query:
eligible.append(model)
if not eligible:
# If no model meets all constraints, relax and pick best available
logger.warning("No model meets all hard constraints, relaxing requirements")
eligible = candidates
# Step 2: Score eligible models
weights = self.WEIGHT_PROFILES[criteria.optimization_goal]
scored = []
for model in eligible:
score = self._score_model(model, criteria, weights)
scored.append((model, score))
scored.sort(key=lambda x: x[1], reverse=True)
# Step 3: Build result
best_model, best_score = scored[0]
alternatives = [
{"model": m.tier_name, "score": round(s, 4)}
for m, s in scored[1:]
]
trade_offs = self._analyze_trade_offs(best_model, scored, criteria)
reason = self._generate_selection_reason(best_model, criteria, best_score)
return SelectionResult(
selected_model=best_model,
selection_reason=reason,
score=round(best_score, 4),
alternatives=alternatives,
trade_offs=trade_offs,
)
def _score_model(self, model: ModelProfile, criteria: SelectionCriteria, weights: dict) -> float:
"""Calculate weighted score for a model."""
# Cost score (lower cost = higher score)
est_cost = self._estimate_query_cost(model, criteria.required_tokens)
max_cost = criteria.max_cost_per_query
cost_score = max(0, 1.0 - (est_cost / max_cost)) if max_cost > 0 else 0.5
# Quality score (direct from profile)
quality_score = model.quality_score
# Latency score (lower latency = higher score)
latency_score = max(0, 1.0 - (model.avg_latency_ms / criteria.max_latency_ms))
# Task fit score (does the model excel at this task?)
task_fit_score = self._calculate_task_fit(model, criteria)
# Weighted combination
total = (
weights["cost"] * cost_score
+ weights["quality"] * quality_score
+ weights["latency"] * latency_score
+ weights["fit"] * task_fit_score
)
return total
def _calculate_task_fit(self, model: ModelProfile, criteria: SelectionCriteria) -> float:
"""Calculate how well a model fits the specific task."""
task = criteria.task_category.lower()
if task in model.strengths:
return 1.0
elif task in model.weaknesses:
return 0.3
else:
return 0.6 # Neutral
def _estimate_query_cost(self, model: ModelProfile, token_count: int) -> float:
"""Estimate cost for a single query."""
input_tokens = token_count
output_tokens = int(token_count * 0.7) # Assume 70% output ratio
cost = (
input_tokens * model.cost_input_per_1m / 1_000_000
+ output_tokens * model.cost_output_per_1m / 1_000_000
)
return cost
def _analyze_trade_offs(self, selected, scored, criteria) -> dict:
"""Analyze what we gain/lose with the selection."""
trade_offs = {}
for model, score in scored:
if model.model_id != selected.model_id:
est_cost_selected = self._estimate_query_cost(selected, criteria.required_tokens)
est_cost_alt = self._estimate_query_cost(model, criteria.required_tokens)
trade_offs[model.tier_name] = {
"cost_difference": round(est_cost_selected - est_cost_alt, 6),
"quality_difference": round(selected.quality_score - model.quality_score, 3),
"latency_difference_ms": round(selected.avg_latency_ms - model.avg_latency_ms, 1),
}
return trade_offs
def _generate_selection_reason(self, model, criteria, score) -> str:
"""Generate human-readable selection reason."""
goal_text = {
OptimizationGoal.COST: "cost optimization",
OptimizationGoal.QUALITY: "quality maximization",
OptimizationGoal.LATENCY: "latency minimization",
OptimizationGoal.BALANCED: "balanced performance",
}
return (
f"Selected {model.tier_name} (score={score:.4f}) for "
f"{goal_text[criteria.optimization_goal]} | "
f"task={criteria.task_category} | "
f"complexity={criteria.query_complexity.name}"
)
PerformanceOptimizer Implementation
"""
PerformanceOptimizer: Monitors and optimizes FM deployment performance
across latency, throughput, cost, and quality dimensions.
Provides real-time optimization recommendations and auto-tuning.
"""
import time
import logging
import statistics
from collections import deque
from dataclasses import dataclass, field
from typing import Optional
logger = logging.getLogger(__name__)
@dataclass
class PerformanceWindow:
"""Sliding window of performance observations."""
window_size: int = 1000
latencies: deque = field(default_factory=lambda: deque(maxlen=1000))
costs: deque = field(default_factory=lambda: deque(maxlen=1000))
quality_scores: deque = field(default_factory=lambda: deque(maxlen=1000))
escalation_flags: deque = field(default_factory=lambda: deque(maxlen=1000))
timestamps: deque = field(default_factory=lambda: deque(maxlen=1000))
@dataclass
class OptimizationRecommendation:
"""A recommended optimization action."""
action: str
reason: str
expected_impact: dict
priority: str # "critical", "high", "medium", "low"
auto_applicable: bool # Can be applied automatically
@dataclass
class PerformanceReport:
"""Comprehensive performance report."""
timestamp: float
period_seconds: int
metrics: dict
sla_compliance: dict
recommendations: list
cost_projection: dict
class PerformanceOptimizer:
"""
Continuously monitors and optimizes FM deployment performance.
Optimization Strategies:
1. Confidence threshold tuning
2. Cache hit rate optimization
3. Prompt length optimization
4. Batch processing for non-real-time tasks
5. Model routing weight adjustment
6. Circuit breaker tuning
SLA Targets (MangaAssist):
- P50 latency: < 1000ms
- P95 latency: < 2500ms
- P99 latency: < 3000ms
- Quality score: > 0.90
- Cost per query: < $0.006
- Availability: > 99.9%
"""
SLA_TARGETS = {
"p50_latency_ms": 1000,
"p95_latency_ms": 2500,
"p99_latency_ms": 3000,
"min_quality_score": 0.90,
"max_cost_per_query": 0.006,
"min_availability": 0.999,
"max_escalation_rate": 0.15,
}
def __init__(self, cascade_router=None):
self.cascade_router = cascade_router
self.performance_window = PerformanceWindow()
self._optimization_history = []
self._alert_callbacks = []
logger.info("PerformanceOptimizer initialized with SLA targets: %s", self.SLA_TARGETS)
def record_observation(
self,
latency_ms: float,
cost: float,
quality_score: float,
was_escalated: bool,
):
"""Record a single query observation for performance tracking."""
now = time.time()
self.performance_window.latencies.append(latency_ms)
self.performance_window.costs.append(cost)
self.performance_window.quality_scores.append(quality_score)
self.performance_window.escalation_flags.append(was_escalated)
self.performance_window.timestamps.append(now)
def analyze_performance(self) -> PerformanceReport:
"""
Analyze current performance window and generate report.
"""
window = self.performance_window
if len(window.latencies) < 10:
logger.warning("Insufficient data for analysis (need >= 10 observations)")
return self._empty_report()
latencies = list(window.latencies)
costs = list(window.costs)
quality_scores = list(window.quality_scores)
escalation_flags = list(window.escalation_flags)
# Compute metrics
metrics = {
"latency": {
"p50_ms": round(statistics.median(latencies), 1),
"p95_ms": round(self._percentile(latencies, 95), 1),
"p99_ms": round(self._percentile(latencies, 99), 1),
"mean_ms": round(statistics.mean(latencies), 1),
"std_ms": round(statistics.stdev(latencies) if len(latencies) > 1 else 0, 1),
},
"cost": {
"mean_per_query": round(statistics.mean(costs), 6),
"total_window": round(sum(costs), 4),
"std_per_query": round(statistics.stdev(costs) if len(costs) > 1 else 0, 6),
},
"quality": {
"mean_score": round(statistics.mean(quality_scores), 4),
"min_score": round(min(quality_scores), 4),
"below_threshold_pct": round(
sum(1 for q in quality_scores if q < self.SLA_TARGETS["min_quality_score"])
/ len(quality_scores) * 100, 1
),
},
"escalation": {
"rate": round(sum(escalation_flags) / len(escalation_flags), 4),
"count": sum(escalation_flags),
"total": len(escalation_flags),
},
}
# Check SLA compliance
sla_compliance = self._check_sla_compliance(metrics)
# Generate recommendations
recommendations = self._generate_recommendations(metrics, sla_compliance)
# Cost projection
cost_projection = self._project_costs(metrics)
report = PerformanceReport(
timestamp=time.time(),
period_seconds=self._window_duration(),
metrics=metrics,
sla_compliance=sla_compliance,
recommendations=recommendations,
cost_projection=cost_projection,
)
# Trigger alerts if needed
self._check_alerts(sla_compliance)
return report
def auto_tune(self, report: PerformanceReport) -> list:
"""
Apply automatic optimizations based on performance report.
Only applies recommendations marked as auto_applicable.
Returns list of applied optimizations.
"""
applied = []
for rec in report.recommendations:
if not rec.auto_applicable:
continue
if rec.action == "raise_confidence_threshold":
if self.cascade_router:
old = self.cascade_router.config.confidence_threshold
new = min(0.95, old + 0.02)
self.cascade_router.config.confidence_threshold = new
applied.append(f"Raised confidence threshold: {old:.2f} -> {new:.2f}")
elif rec.action == "lower_confidence_threshold":
if self.cascade_router:
old = self.cascade_router.config.confidence_threshold
new = max(0.60, old - 0.02)
self.cascade_router.config.confidence_threshold = new
applied.append(f"Lowered confidence threshold: {old:.2f} -> {new:.2f}")
elif rec.action == "increase_cache_ttl":
if self.cascade_router:
old = self.cascade_router.config.cache_ttl_seconds
new = min(600, old + 60)
self.cascade_router.config.cache_ttl_seconds = new
applied.append(f"Increased cache TTL: {old}s -> {new}s")
elif rec.action == "reduce_haiku_max_tokens":
if self.cascade_router:
old = self.cascade_router.config.haiku_max_tokens
new = max(256, old - 128)
self.cascade_router.config.haiku_max_tokens = new
applied.append(f"Reduced Haiku max tokens: {old} -> {new}")
if applied:
self._optimization_history.append({
"timestamp": time.time(),
"actions": applied,
})
logger.info("Auto-tune applied %d optimizations: %s", len(applied), applied)
return applied
# ------------------------------------------------------------------
# Internal Methods
# ------------------------------------------------------------------
def _check_sla_compliance(self, metrics: dict) -> dict:
"""Check each SLA target against current metrics."""
return {
"p50_latency": {
"target": self.SLA_TARGETS["p50_latency_ms"],
"actual": metrics["latency"]["p50_ms"],
"compliant": metrics["latency"]["p50_ms"] <= self.SLA_TARGETS["p50_latency_ms"],
},
"p95_latency": {
"target": self.SLA_TARGETS["p95_latency_ms"],
"actual": metrics["latency"]["p95_ms"],
"compliant": metrics["latency"]["p95_ms"] <= self.SLA_TARGETS["p95_latency_ms"],
},
"p99_latency": {
"target": self.SLA_TARGETS["p99_latency_ms"],
"actual": metrics["latency"]["p99_ms"],
"compliant": metrics["latency"]["p99_ms"] <= self.SLA_TARGETS["p99_latency_ms"],
},
"quality": {
"target": self.SLA_TARGETS["min_quality_score"],
"actual": metrics["quality"]["mean_score"],
"compliant": metrics["quality"]["mean_score"] >= self.SLA_TARGETS["min_quality_score"],
},
"cost": {
"target": self.SLA_TARGETS["max_cost_per_query"],
"actual": metrics["cost"]["mean_per_query"],
"compliant": metrics["cost"]["mean_per_query"] <= self.SLA_TARGETS["max_cost_per_query"],
},
"escalation_rate": {
"target": self.SLA_TARGETS["max_escalation_rate"],
"actual": metrics["escalation"]["rate"],
"compliant": metrics["escalation"]["rate"] <= self.SLA_TARGETS["max_escalation_rate"],
},
}
def _generate_recommendations(self, metrics: dict, sla: dict) -> list:
"""Generate optimization recommendations based on current state."""
recommendations = []
# High escalation rate -> raise confidence threshold or improve Haiku prompts
if not sla["escalation_rate"]["compliant"]:
recommendations.append(OptimizationRecommendation(
action="lower_confidence_threshold",
reason=(
f"Escalation rate ({sla['escalation_rate']['actual']:.1%}) exceeds "
f"target ({sla['escalation_rate']['target']:.1%}). Lowering threshold "
"will reduce escalations but may affect quality."
),
expected_impact={"escalation_rate_delta": -0.03, "quality_delta": -0.01},
priority="high",
auto_applicable=True,
))
# High cost -> optimize token usage or shift more to Haiku
if not sla["cost"]["compliant"]:
recommendations.append(OptimizationRecommendation(
action="reduce_haiku_max_tokens",
reason=(
f"Cost per query (${sla['cost']['actual']:.6f}) exceeds "
f"target (${sla['cost']['target']:.6f}). Reducing max tokens "
"will lower cost but may truncate responses."
),
expected_impact={"cost_delta_pct": -10, "quality_delta": -0.005},
priority="high",
auto_applicable=True,
))
# High latency -> increase caching or reduce token limits
if not sla["p95_latency"]["compliant"]:
recommendations.append(OptimizationRecommendation(
action="increase_cache_ttl",
reason=(
f"P95 latency ({sla['p95_latency']['actual']:.0f}ms) exceeds "
f"target ({sla['p95_latency']['target']}ms). Increasing cache TTL "
"will improve repeat query latency."
),
expected_impact={"latency_p95_delta_ms": -200, "cache_hit_rate_delta": 0.05},
priority="high",
auto_applicable=True,
))
# Low quality -> raise threshold to escalate more to Sonnet
if not sla["quality"]["compliant"]:
recommendations.append(OptimizationRecommendation(
action="raise_confidence_threshold",
reason=(
f"Quality score ({sla['quality']['actual']:.4f}) below "
f"target ({sla['quality']['target']:.2f}). Raising confidence "
"threshold will escalate more queries to Sonnet for better quality."
),
expected_impact={"quality_delta": 0.02, "cost_delta_pct": 15},
priority="critical",
auto_applicable=True,
))
# If everything is compliant, suggest further cost optimization
all_compliant = all(v["compliant"] for v in sla.values())
if all_compliant:
recommendations.append(OptimizationRecommendation(
action="explore_further_cost_reduction",
reason="All SLAs met. Consider testing lower confidence threshold or smaller token budgets.",
expected_impact={"cost_delta_pct": -5},
priority="low",
auto_applicable=False,
))
return recommendations
def _project_costs(self, metrics: dict) -> dict:
"""Project costs for different time horizons."""
avg_cost = metrics["cost"]["mean_per_query"]
daily_queries = 1_000_000 # MangaAssist target
return {
"daily": round(avg_cost * daily_queries, 2),
"weekly": round(avg_cost * daily_queries * 7, 2),
"monthly": round(avg_cost * daily_queries * 30, 2),
"annual": round(avg_cost * daily_queries * 365, 2),
"per_query_avg": round(avg_cost, 6),
"queries_per_day": daily_queries,
}
def _percentile(self, data: list, pct: int) -> float:
"""Calculate percentile from sorted data."""
sorted_data = sorted(data)
index = int(len(sorted_data) * pct / 100)
index = min(index, len(sorted_data) - 1)
return sorted_data[index]
def _window_duration(self) -> int:
"""Get duration of the current observation window."""
ts = self.performance_window.timestamps
if len(ts) < 2:
return 0
return int(ts[-1] - ts[0])
def _empty_report(self) -> PerformanceReport:
"""Return an empty report when insufficient data."""
return PerformanceReport(
timestamp=time.time(),
period_seconds=0,
metrics={},
sla_compliance={},
recommendations=[],
cost_projection={},
)
def _check_alerts(self, sla_compliance: dict):
"""Trigger alerts for SLA violations."""
violations = [k for k, v in sla_compliance.items() if not v["compliant"]]
if violations:
for callback in self._alert_callbacks:
try:
callback(violations, sla_compliance)
except Exception as e:
logger.error("Alert callback failed: %s", e)
def register_alert_callback(self, callback):
"""Register a function to be called on SLA violations."""
self._alert_callbacks.append(callback)
Decision Trees for Model Selection
Decision Tree 1: Initial Model Routing
[User Query Arrives]
|
[Is query in response cache?]
/ \
Yes No
| |
[Return cached] [Classify task category]
|
+--------------+--------------+
| | |
[Greeting/ [Recommend/ [Unknown]
FAQ/Search/ Compare/ |
Order/ Discussion/ [Default to
Summary] Translation] Haiku with
| | escalation]
| |
[Route to [Route to
Haiku] Sonnet]
| |
[Check [Invoke Sonnet
confidence] directly]
/ \ |
>=0.80 <0.80 [Return response]
| |
[Return] [Escalate
to Sonnet]
Decision Tree 2: Cost Optimization Routing
[Query with cost constraint]
|
[Estimate token count]
/ \
< 300 >= 300
tokens tokens
| |
[Always use [Check complexity]
Haiku] / \
$0.0001 Simple Complex
| |
[Use Haiku [Check budget]
with RAG] / \
$0.0003 Budget Budget
OK Tight
| |
[Sonnet] [Haiku +
$0.0096 escalate
if fail]
$0.0003-
$0.0099
Decision Tree 3: Latency-Optimized Routing
[Query with latency SLA]
|
[Target latency?]
/ | \
< 500ms 500-2000ms > 2000ms
| | |
[Haiku + [Haiku-first [Sonnet
cache cascade] acceptable]
only] | |
| [Haiku P50: [Full reasoning
[P50: 250ms + with context]
150ms escalation
cached, overhead:
250ms ~500ms]
uncached]
Cost Analysis Deep Dive
Scenario: MangaAssist 1M Messages/Day
+--------------------------------------------------+
| COST COMPARISON: DEPLOYMENT STRATEGIES |
+--------------------------------------------------+
Strategy 1: All Sonnet
Input: avg 500 tokens * 1M = 500M tokens * $3/1M = $1,500
Output: avg 350 tokens * 1M = 350M tokens * $15/1M = $5,250
Daily total: $6,750
Monthly: $202,500
Annual: $2,463,750
Strategy 2: All Haiku
Input: avg 500 tokens * 1M = 500M tokens * $0.25/1M = $125
Output: avg 350 tokens * 1M = 350M tokens * $1.25/1M = $437.50
Daily total: $562.50
Monthly: $16,875
Annual: $205,313
Quality impact: -15% on complex queries
Strategy 3: Cascade (60/40 Haiku/Sonnet)
Haiku: 600K msgs * (300*$0.25/1M + 150*$1.25/1M) = $157.50
Sonnet: 400K msgs * (700*$3/1M + 500*$15/1M) = $3,840.00
Daily total: $3,997.50
Monthly: $119,925
Annual: $1,459,088
Savings vs All-Sonnet: 40.8%
Strategy 4: Cascade Optimized (70/27/3 Haiku/Sonnet/Escalated)
Haiku: 700K * $0.000263 = $183.75
Sonnet: 270K * $0.009600 = $2,592.00
Escalated: 30K * $0.008150 = $244.50
Daily total: $3,020.25
Monthly: $90,607.50
Annual: $1,102,391
Savings vs All-Sonnet: 55.3%
Strategy 5: Cascade + Caching (20% cache hit rate)
Effective queries: 800K (200K served from cache)
Haiku: 560K * $0.000263 = $147.00
Sonnet: 216K * $0.009600 = $2,073.60
Escalated: 24K * $0.008150 = $195.60
Daily total: $2,416.20
Monthly: $72,486
Annual: $881,913
Savings vs All-Sonnet: 64.2%
+--------------------------------------------------+
| ANNUAL SAVINGS SUMMARY |
+--------------------------------------------------+
| Strategy | Annual Cost | vs All-Sonnet |
|------------------|-------------|-----------------|
| All Sonnet | $2,463,750 | baseline |
| All Haiku | $205,313 | -91.7% |
| Cascade 60/40 | $1,459,088 | -40.8% |
| Cascade Optimized| $1,102,391 | -55.3% |
| Cascade + Cache | $881,913 | -64.2% |
+--------------------------------------------------+
Winner: Cascade + Cache delivers 64.2% cost savings while
maintaining 94%+ quality on all query types.
Break-even Analysis: When Does Cascading Pay Off?
At what daily volume does cascade ROI exceed implementation cost?
Implementation cost (one-time):
- Development: ~$15,000 (80 hours * $187.50/hr)
- Testing: ~$5,000 (validation across all task types)
- Monitoring: ~$3,000 (dashboards, alerts)
Total: ~$23,000
Daily savings (Cascade Optimized vs All-Sonnet):
$6,750 - $3,020.25 = $3,729.75/day
Break-even: $23,000 / $3,729.75 = 6.2 days
At 1M messages/day, cascading pays for itself in under 1 week.
Even at 100K messages/day, break-even is ~62 days.
Key Takeaways
+------------------------------------------------------------------+
| # | Takeaway |
+---+---------------------------------------------------------------+
| 1 | Model cascading (Haiku-first, Sonnet-if-needed) saves 55%+ |
| | compared to all-Sonnet while maintaining 94%+ quality. |
+---+---------------------------------------------------------------+
| 2 | Task classification is the linchpin -- accurate routing |
| | determines whether cascading delivers cost savings or adds |
| | unnecessary latency through escalations. |
+---+---------------------------------------------------------------+
| 3 | Confidence-based escalation (threshold ~0.80) provides the |
| | optimal balance between cost savings and quality assurance. |
+---+---------------------------------------------------------------+
| 4 | Response caching at a 20% hit rate adds another 15-20% cost |
| | reduction on top of cascading savings. |
+---+---------------------------------------------------------------+
| 5 | Small models (Haiku) handle 60-70% of production traffic |
| | adequately for structured tasks: FAQ, search, classification. |
+---+---------------------------------------------------------------+
| 6 | Multi-criteria model selection (cost, quality, latency, fit) |
| | should be weighted based on the business optimization goal. |
+---+---------------------------------------------------------------+
| 7 | Auto-tuning of confidence thresholds and token budgets via |
| | PerformanceOptimizer keeps the system at optimal operating |
| | point as traffic patterns evolve. |
+---+---------------------------------------------------------------+
| 8 | At MangaAssist scale (1M msgs/day), even small per-query |
| | savings compound to $1M+ annual savings. |
+---+---------------------------------------------------------------+
| 9 | The Cascade + Cache strategy achieves 64.2% cost reduction |
| | vs All-Sonnet -- the recommended production deployment. |
+---+---------------------------------------------------------------+
|10 | Break-even for cascading infrastructure is < 1 week at scale, |
| | making it one of the highest-ROI optimizations available. |
+---+---------------------------------------------------------------+
Next: 02-model-cascading-right-sizing.md -- Deep-dive into cascading implementation, confidence-based escalation, and task-specific model matching.