LOCAL PREVIEW View on GitHub

Optimized FM Deployment and Model Cascading Architecture

MangaAssist context: JP Manga store chatbot on AWS — Bedrock Claude 3 (Sonnet at $3/$15 per 1M tokens input/output, Haiku at $0.25/$1.25), OpenSearch Serverless (vector store), DynamoDB (sessions/products), ECS Fargate (orchestrator), API Gateway WebSocket, ElastiCache Redis. Target: useful answer in under 3 seconds, 1M messages/day scale.


Skill Mapping

Attribute Value
Certification AWS Certified AI Practitioner (AIP-C01)
Domain 2 — Fundamentals of Generative AI
Task 2.2 — Understand the basics of using foundation models
Skill 2.2.3 — Develop optimized FM deployment approaches to balance performance and resource requirements (selecting appropriate models, smaller pre-trained models for specific tasks, API-based model cascading for routine queries)

Table of Contents

  1. Optimization Strategy Mindmap
  2. Model Cascading Architecture
  3. Small Model Selection for Specific Tasks
  4. Performance vs Cost Trade-off Frameworks
  5. ModelCascadeRouter Implementation
  6. ModelSelectorFramework Implementation
  7. PerformanceOptimizer Implementation
  8. Decision Trees for Model Selection
  9. Cost Analysis Deep Dive
  10. Key Takeaways

Optimization Strategy Mindmap

                        Optimized FM Deployment
                                 |
          +----------+-----------+-----------+----------+
          |          |           |           |          |
    Model       Model       Model       Right-      API-Based
   Selection  Distillation Cascading   Sizing      Orchestration
      |          |           |           |          |
   +--+--+    +--+--+    +--+--+    +--+--+    +--+--+
   |     |    |     |    |     |    |     |    |     |
 Task  Perf  Know- Fine- Haiku Conf Batch  On- Gate- Load
 Match Tier  ledge Tune  First idence Stream Demand way  Balance
   |         Trans       |     Score      |   Alloc Route
   |         fer         v     |          |
   |                  Sonnet   v          v
   |                  If Ndd  Escalate  Scale
   |                           |        Infra
   v                           v
 Claude 3              Confidence
 Haiku for             Threshold
 Simple Tasks          (0.80+)
   |
   v
 Claude 3
 Sonnet for
 Complex Tasks

Sub-strategies:

Model Selection               Model Distillation
  |-- Task complexity            |-- Teacher-student learning
  |-- Latency requirements       |-- Knowledge compression
  |-- Cost constraints           |-- Task-specific fine-tuning
  |-- Quality thresholds         |-- Bedrock custom models
  '-- Token budget               '-- Embedding model optimization

Model Cascading                Right-Sizing
  |-- Confidence scoring         |-- Instance type selection
  |-- Escalation rules           |-- Batch vs real-time
  |-- Fallback chains            |-- Concurrency limits
  |-- Response validation        |-- Memory allocation
  '-- Cost-aware routing         '-- Provisioned throughput

API-Based Orchestration
  |-- Gateway routing rules
  |-- Model endpoint management
  |-- Rate limiting per model
  |-- Circuit breaker patterns
  '-- A/B testing infrastructure

Model Cascading Architecture

The Core Principle: Haiku First, Sonnet If Needed

Model cascading is a cost-optimization pattern that routes queries through progressively more capable (and expensive) models. For MangaAssist, this means:

User Query
    |
    v
+-------------------+
| Intent Classifier |  (Haiku - $0.25/1M input)
| + Complexity      |
| Estimator         |
+--------+----------+
         |
    +----+----+
    |         |
  Simple    Complex
  (70-80%)  (20-30%)
    |         |
    v         v
+--------+ +--------+
| Haiku  | | Sonnet |
| $0.25  | | $3.00  |
| /1M in | | /1M in |
+---+----+ +---+----+
    |          |
    v          v
+--------+ +--------+
| Conf   | | Direct |
| Check  | | Return |
| >=0.80 | +--------+
+---+----+
    |
  +-+---+
  |     |
 Pass  Fail
  |     |
  v     v
Return Escalate
       to Sonnet

Traffic Distribution Analysis

Query Category Percentage Model Avg Tokens (In/Out) Cost per Query
Simple FAQ / Greetings 35% Haiku 200 / 100 $0.000175
Product Lookup 25% Haiku 400 / 200 $0.000350
Manga Recommendations 20% Sonnet 600 / 400 $0.007800
Complex Comparison 10% Sonnet 800 / 600 $0.011400
Nuanced Discussion 7% Sonnet 1000 / 800 $0.015000
Haiku Escalation to Sonnet 3% Both 400+600 / 200+400 $0.008150

Daily Cost at 1M Messages

Without Cascading (All Sonnet):
  Average tokens: 500 in / 350 out per message
  Daily cost = 1M * (500 * $3/1M + 350 * $15/1M)
             = 1M * ($0.0015 + $0.00525)
             = $6,750/day
             = $202,500/month

With Cascading (Haiku-first):
  Haiku messages (60%): 600K * (300 * $0.25/1M + 150 * $1.25/1M)
                       = 600K * $0.0002625
                       = $157.50/day

  Sonnet messages (37%): 370K * (700 * $3/1M + 500 * $15/1M)
                        = 370K * $0.0096
                        = $3,552/day

  Escalated (3%): 30K * (400*$0.25/1M + 200*$1.25/1M + 600*$3/1M + 400*$15/1M)
                = 30K * $0.00815
                = $244.50/day

  Total: $3,954/day = $118,620/month
  Savings: 41.4% reduction ($83,880/month saved)

Small Model Selection for Specific Tasks

Task-to-Model Mapping for MangaAssist

Task Recommended Model Why Latency Quality
Intent Classification Haiku Structured output, limited reasoning needed ~200ms 95%+
Entity Extraction (manga) Haiku Pattern matching, named entities ~150ms 93%+
Simple FAQ Answers Haiku Retrieval-augmented, template-like ~300ms 90%+
Product Search Query Build Haiku SQL/query generation, structured ~250ms 92%+
Manga Recommendations Sonnet Nuanced preference understanding ~800ms 96%+
Complex Comparisons Sonnet Multi-factor reasoning ~1000ms 95%+
Content Summarization Haiku (short) Extractive summarization ~400ms 88%+
Content Summarization Sonnet (detailed) Abstractive, nuanced summaries ~900ms 95%+
Sentiment Analysis Haiku Binary/ternary classification ~100ms 94%+
Language Translation (JP) Sonnet Cultural nuance in JP manga context ~700ms 97%+

Model Selection Decision Matrix

                    Low Complexity          High Complexity
                 +--------------------+--------------------+
                 |                    |                    |
  Low Latency   |  Haiku             |  Haiku + Cache     |
  Requirement   |  Direct response   |  Pre-computed +    |
  (< 500ms)     |  Simple prompts    |  fallback Sonnet   |
                 |                    |                    |
                 +--------------------+--------------------+
                 |                    |                    |
  High Latency  |  Haiku             |  Sonnet            |
  Tolerance     |  Cost-optimized    |  Quality-optimized |
  (< 3000ms)    |  Batch-friendly    |  Full reasoning    |
                 |                    |                    |
                 +--------------------+--------------------+

Performance vs Cost Trade-off Frameworks

Framework 1: Quality-Cost Frontier

Quality
Score
  |
1.0|                                    * Sonnet (all queries)
   |                              *
   |                        * Cascade (optimized)
0.9|                  *
   |            * Cascade (aggressive)
   |      *
0.8| * Haiku (all queries)
   |
0.7|
   +----+----+----+----+----+----+----> Cost/query
   $0  $2   $4   $6   $8  $10  $12
                                    (x10^-3)

Sweet Spot: Cascade (optimized) at ~$4x10^-3/query with 0.94 quality

Framework 2: Latency-Throughput Trade-off

Throughput
(queries/sec)
   |
500|  * Haiku-only
   |
400|     * Cascade (aggressive)
   |
300|        * Cascade (optimized)
   |
200|              * Cascade (conservative)
   |
100|                        * Sonnet-only
   |
   +----+-----+-----+-----+-----+-----> P99 Latency (ms)
   0   500  1000  1500  2000  2500

Target: 300+ queries/sec with P99 < 2500ms

Framework 3: Cost Efficiency Score

# Cost Efficiency Score (CES) = Quality * (1 / NormalizedCost) * LatencyFactor

def calculate_ces(quality_score, cost_per_query, latency_ms, target_latency_ms=3000):
    """
    Calculate Cost Efficiency Score for a deployment strategy.

    Args:
        quality_score: 0.0-1.0 quality rating
        cost_per_query: Dollar cost per query
        latency_ms: Average response latency in milliseconds
        target_latency_ms: Target latency SLA

    Returns:
        CES score (higher is better)
    """
    # Normalize cost (lower is better, avoid division by zero)
    max_cost = 0.015  # Sonnet full-reasoning cost as baseline
    normalized_cost = cost_per_query / max_cost
    cost_factor = 1.0 / max(normalized_cost, 0.01)

    # Latency factor (1.0 if within SLA, degrades if over)
    latency_factor = min(1.0, target_latency_ms / max(latency_ms, 1))

    # Combined CES
    ces = quality_score * cost_factor * latency_factor
    return round(ces, 4)


# MangaAssist strategy comparison
strategies = {
    "All Sonnet":        {"quality": 0.97, "cost": 0.0096, "latency": 1200},
    "All Haiku":         {"quality": 0.82, "cost": 0.0003, "latency": 300},
    "Cascade Aggressive":{"quality": 0.88, "cost": 0.0025, "latency": 500},
    "Cascade Optimized": {"quality": 0.94, "cost": 0.0040, "latency": 700},
    "Cascade Conservative":{"quality": 0.96, "cost": 0.0065, "latency": 900},
}

# Results:
# All Sonnet:           CES = 1.5156
# All Haiku:            CES = 41.0000
# Cascade Aggressive:   CES = 5.2800
# Cascade Optimized:    CES = 3.5250  <-- Best balance
# Cascade Conservative: CES = 2.2154

ModelCascadeRouter Implementation

"""
ModelCascadeRouter: Intelligent routing of queries to appropriate FM models
based on complexity, task type, and confidence thresholds.

MangaAssist Production Implementation
"""

import json
import time
import hashlib
import logging
from enum import Enum
from typing import Optional
from dataclasses import dataclass, field

import boto3
from botocore.config import Config

logger = logging.getLogger(__name__)


class ModelTier(Enum):
    """Available model tiers in the cascade."""
    HAIKU = "anthropic.claude-3-haiku-20240307-v1:0"
    SONNET = "anthropic.claude-3-sonnet-20240229-v1:0"


class TaskCategory(Enum):
    """Predefined task categories for MangaAssist."""
    GREETING = "greeting"
    FAQ = "faq"
    PRODUCT_SEARCH = "product_search"
    RECOMMENDATION = "recommendation"
    COMPARISON = "comparison"
    DISCUSSION = "discussion"
    ORDER_STATUS = "order_status"
    SUMMARIZATION = "summarization"
    TRANSLATION = "translation"
    UNKNOWN = "unknown"


# Task-to-tier mapping: which model tier handles which task by default
TASK_MODEL_MAP = {
    TaskCategory.GREETING:       ModelTier.HAIKU,
    TaskCategory.FAQ:            ModelTier.HAIKU,
    TaskCategory.PRODUCT_SEARCH: ModelTier.HAIKU,
    TaskCategory.ORDER_STATUS:   ModelTier.HAIKU,
    TaskCategory.SUMMARIZATION:  ModelTier.HAIKU,
    TaskCategory.RECOMMENDATION: ModelTier.SONNET,
    TaskCategory.COMPARISON:     ModelTier.SONNET,
    TaskCategory.DISCUSSION:     ModelTier.SONNET,
    TaskCategory.TRANSLATION:    ModelTier.SONNET,
    TaskCategory.UNKNOWN:        ModelTier.HAIKU,  # Default to cheap, escalate if needed
}


@dataclass
class CascadeConfig:
    """Configuration for the cascade router."""
    confidence_threshold: float = 0.80
    max_escalations: int = 1
    haiku_max_tokens: int = 1024
    sonnet_max_tokens: int = 2048
    temperature_haiku: float = 0.3
    temperature_sonnet: float = 0.5
    enable_caching: bool = True
    cache_ttl_seconds: int = 300
    escalation_timeout_ms: int = 2500
    region: str = "us-east-1"


@dataclass
class CascadeResult:
    """Result of a cascade routing decision."""
    response: str
    model_used: ModelTier
    was_escalated: bool
    confidence_score: float
    latency_ms: float
    input_tokens: int
    output_tokens: int
    estimated_cost: float
    task_category: TaskCategory
    cache_hit: bool = False


@dataclass
class CascadeMetrics:
    """Tracks cascade performance metrics."""
    total_requests: int = 0
    haiku_requests: int = 0
    sonnet_requests: int = 0
    escalations: int = 0
    cache_hits: int = 0
    total_cost: float = 0.0
    avg_latency_ms: float = 0.0
    quality_scores: list = field(default_factory=list)


class ModelCascadeRouter:
    """
    Routes queries through a model cascade: Haiku first, Sonnet if needed.

    The router implements:
    1. Task classification to determine initial model selection
    2. Confidence-based escalation from Haiku to Sonnet
    3. Response caching to avoid redundant inference
    4. Cost tracking and optimization metrics
    5. Circuit breaker for model endpoint failures

    Architecture:
        User Query --> TaskClassifier --> ModelSelector --> Inference
                                              |
                                         Haiku (cheap)
                                              |
                                        Confidence Check
                                           /       \\
                                       >= 0.80    < 0.80
                                         |           |
                                       Return    Escalate to Sonnet
    """

    # Pricing per 1M tokens (USD)
    PRICING = {
        ModelTier.HAIKU:  {"input": 0.25, "output": 1.25},
        ModelTier.SONNET: {"input": 3.00, "output": 15.00},
    }

    def __init__(self, config: Optional[CascadeConfig] = None):
        self.config = config or CascadeConfig()
        self.metrics = CascadeMetrics()
        self._response_cache = {}

        # Initialize Bedrock client with retry configuration
        boto_config = Config(
            region_name=self.config.region,
            retries={"max_attempts": 3, "mode": "adaptive"},
            read_timeout=10,
            connect_timeout=5,
        )
        self.bedrock = boto3.client("bedrock-runtime", config=boto_config)

        logger.info(
            "ModelCascadeRouter initialized | confidence_threshold=%.2f | region=%s",
            self.config.confidence_threshold,
            self.config.region,
        )

    # ------------------------------------------------------------------
    # Public API
    # ------------------------------------------------------------------

    def route_query(
        self,
        query: str,
        conversation_history: Optional[list] = None,
        task_hint: Optional[TaskCategory] = None,
        force_model: Optional[ModelTier] = None,
    ) -> CascadeResult:
        """
        Route a query through the model cascade.

        Args:
            query: The user's input query
            conversation_history: Previous messages for context
            task_hint: Optional pre-classified task category
            force_model: Force a specific model (bypasses cascade)

        Returns:
            CascadeResult with response and metadata
        """
        start_time = time.time()
        self.metrics.total_requests += 1

        # Step 1: Check cache
        if self.config.enable_caching:
            cache_key = self._build_cache_key(query, conversation_history)
            cached = self._check_cache(cache_key)
            if cached:
                self.metrics.cache_hits += 1
                cached.cache_hit = True
                return cached

        # Step 2: Classify task
        task_category = task_hint or self._classify_task(query)

        # Step 3: Determine initial model
        if force_model:
            initial_model = force_model
        else:
            initial_model = TASK_MODEL_MAP.get(task_category, ModelTier.HAIKU)

        # Step 4: If task maps directly to Sonnet, skip cascade
        if initial_model == ModelTier.SONNET:
            result = self._invoke_model(
                ModelTier.SONNET, query, conversation_history, task_category
            )
            self.metrics.sonnet_requests += 1
            result.latency_ms = (time.time() - start_time) * 1000
            self._update_metrics(result)
            if self.config.enable_caching:
                self._store_cache(cache_key, result)
            return result

        # Step 5: Try Haiku first
        haiku_result = self._invoke_model(
            ModelTier.HAIKU, query, conversation_history, task_category
        )
        self.metrics.haiku_requests += 1

        # Step 6: Check confidence and decide escalation
        if haiku_result.confidence_score >= self.config.confidence_threshold:
            haiku_result.latency_ms = (time.time() - start_time) * 1000
            self._update_metrics(haiku_result)
            if self.config.enable_caching:
                self._store_cache(cache_key, haiku_result)
            return haiku_result

        # Step 7: Escalate to Sonnet
        logger.info(
            "Escalating to Sonnet | confidence=%.2f < threshold=%.2f | task=%s",
            haiku_result.confidence_score,
            self.config.confidence_threshold,
            task_category.value,
        )
        sonnet_result = self._invoke_model(
            ModelTier.SONNET, query, conversation_history, task_category
        )
        sonnet_result.was_escalated = True
        sonnet_result.estimated_cost += haiku_result.estimated_cost  # Include Haiku cost
        sonnet_result.input_tokens += haiku_result.input_tokens
        sonnet_result.output_tokens += haiku_result.output_tokens
        self.metrics.sonnet_requests += 1
        self.metrics.escalations += 1

        sonnet_result.latency_ms = (time.time() - start_time) * 1000
        self._update_metrics(sonnet_result)
        if self.config.enable_caching:
            self._store_cache(cache_key, sonnet_result)
        return sonnet_result

    def get_metrics_summary(self) -> dict:
        """Return current cascade metrics."""
        total = max(self.metrics.total_requests, 1)
        return {
            "total_requests": self.metrics.total_requests,
            "haiku_percentage": round(self.metrics.haiku_requests / total * 100, 1),
            "sonnet_percentage": round(self.metrics.sonnet_requests / total * 100, 1),
            "escalation_rate": round(self.metrics.escalations / total * 100, 1),
            "cache_hit_rate": round(self.metrics.cache_hits / total * 100, 1),
            "total_cost": round(self.metrics.total_cost, 4),
            "avg_cost_per_query": round(self.metrics.total_cost / total, 6),
            "avg_latency_ms": round(self.metrics.avg_latency_ms, 1),
        }

    # ------------------------------------------------------------------
    # Internal Methods
    # ------------------------------------------------------------------

    def _classify_task(self, query: str) -> TaskCategory:
        """
        Classify the query into a task category using lightweight heuristics
        before invoking any model. Falls back to UNKNOWN for model-based routing.
        """
        query_lower = query.lower().strip()

        # Greeting patterns
        greeting_patterns = [
            "hello", "hi", "hey", "good morning", "good afternoon",
            "konnichiwa", "ohayo", "こんにちは", "おはよう",
        ]
        if any(query_lower.startswith(p) for p in greeting_patterns):
            return TaskCategory.GREETING

        # Order status patterns
        order_patterns = ["order", "shipping", "delivery", "tracking", "注文", "配送"]
        if any(p in query_lower for p in order_patterns):
            return TaskCategory.ORDER_STATUS

        # Product search patterns
        search_patterns = [
            "find", "search", "looking for", "where can i",
            "do you have", "探す", "検索",
        ]
        if any(p in query_lower for p in search_patterns):
            return TaskCategory.PRODUCT_SEARCH

        # Recommendation patterns
        recommend_patterns = [
            "recommend", "suggest", "similar to", "like",
            "what should i read", "おすすめ",
        ]
        if any(p in query_lower for p in recommend_patterns):
            return TaskCategory.RECOMMENDATION

        # Comparison patterns
        compare_patterns = ["compare", "versus", "vs", "difference between", "比較"]
        if any(p in query_lower for p in compare_patterns):
            return TaskCategory.COMPARISON

        # FAQ patterns
        faq_patterns = [
            "how do i", "what is", "can i", "do you", "is there",
            "how much", "when", "返品", "支払い",
        ]
        if any(p in query_lower for p in faq_patterns):
            return TaskCategory.FAQ

        return TaskCategory.UNKNOWN

    def _invoke_model(
        self,
        model_tier: ModelTier,
        query: str,
        conversation_history: Optional[list],
        task_category: TaskCategory,
    ) -> CascadeResult:
        """
        Invoke a Bedrock model and return structured result.
        """
        max_tokens = (
            self.config.haiku_max_tokens
            if model_tier == ModelTier.HAIKU
            else self.config.sonnet_max_tokens
        )
        temperature = (
            self.config.temperature_haiku
            if model_tier == ModelTier.HAIKU
            else self.config.temperature_sonnet
        )

        # Build messages
        messages = []
        if conversation_history:
            messages.extend(conversation_history[-6:])  # Last 3 turns
        messages.append({"role": "user", "content": query})

        system_prompt = self._build_system_prompt(task_category)

        body = json.dumps({
            "anthropic_version": "bedrock-2023-05-31",
            "max_tokens": max_tokens,
            "temperature": temperature,
            "system": system_prompt,
            "messages": messages,
        })

        try:
            response = self.bedrock.invoke_model(
                modelId=model_tier.value,
                contentType="application/json",
                accept="application/json",
                body=body,
            )
            response_body = json.loads(response["body"].read())

            content = response_body["content"][0]["text"]
            input_tokens = response_body["usage"]["input_tokens"]
            output_tokens = response_body["usage"]["output_tokens"]
            stop_reason = response_body.get("stop_reason", "end_turn")

            # Calculate confidence from stop reason and response characteristics
            confidence = self._estimate_confidence(content, stop_reason, task_category)

            # Calculate cost
            pricing = self.PRICING[model_tier]
            cost = (
                input_tokens * pricing["input"] / 1_000_000
                + output_tokens * pricing["output"] / 1_000_000
            )

            return CascadeResult(
                response=content,
                model_used=model_tier,
                was_escalated=False,
                confidence_score=confidence,
                latency_ms=0.0,  # Set by caller
                input_tokens=input_tokens,
                output_tokens=output_tokens,
                estimated_cost=cost,
                task_category=task_category,
            )

        except Exception as e:
            logger.error("Model invocation failed | model=%s | error=%s", model_tier.value, str(e))
            raise

    def _estimate_confidence(
        self, response: str, stop_reason: str, task_category: TaskCategory
    ) -> float:
        """
        Estimate confidence in the response quality.

        Factors considered:
        - Stop reason (end_turn vs max_tokens)
        - Response length relative to task expectations
        - Presence of hedging language
        - Task category typical accuracy
        """
        confidence = 0.85  # Base confidence

        # Penalize truncated responses
        if stop_reason == "max_tokens":
            confidence -= 0.15

        # Check for hedging / uncertainty markers
        hedging_phrases = [
            "i'm not sure", "i don't know", "it's unclear",
            "i cannot determine", "maybe", "perhaps",
            "わかりません", "不明",
        ]
        response_lower = response.lower()
        hedging_count = sum(1 for phrase in hedging_phrases if phrase in response_lower)
        confidence -= hedging_count * 0.05

        # Task-specific confidence adjustments
        task_adjustments = {
            TaskCategory.GREETING: 0.10,        # Greetings are almost always correct
            TaskCategory.FAQ: 0.05,             # FAQ with RAG is usually reliable
            TaskCategory.PRODUCT_SEARCH: 0.03,  # Structured queries are reliable
            TaskCategory.ORDER_STATUS: 0.05,    # Template-based responses
            TaskCategory.RECOMMENDATION: -0.05, # Subjective, harder to be confident
            TaskCategory.COMPARISON: -0.08,     # Requires deep reasoning
            TaskCategory.DISCUSSION: -0.10,     # Most subjective
            TaskCategory.UNKNOWN: -0.05,        # Unknown tasks get penalized
        }
        confidence += task_adjustments.get(task_category, 0.0)

        # Clamp to [0.0, 1.0]
        return max(0.0, min(1.0, confidence))

    def _build_system_prompt(self, task_category: TaskCategory) -> str:
        """Build task-optimized system prompt."""
        base = (
            "You are MangaAssist, a helpful JP manga store chatbot. "
            "Provide accurate, concise answers about manga products, "
            "recommendations, and store services. Respond in the same "
            "language as the user's query."
        )

        task_additions = {
            TaskCategory.GREETING: " Keep greetings brief and friendly.",
            TaskCategory.FAQ: " Answer the question directly. Cite store policies when relevant.",
            TaskCategory.PRODUCT_SEARCH: " Help the user find specific manga. Ask clarifying questions if needed.",
            TaskCategory.RECOMMENDATION: (
                " Provide thoughtful manga recommendations based on user preferences. "
                "Consider genre, art style, themes, and similar series."
            ),
            TaskCategory.COMPARISON: (
                " Compare manga series objectively across multiple dimensions: "
                "story, art, pacing, character development, and audience appeal."
            ),
            TaskCategory.ORDER_STATUS: " Provide clear order status information. Be precise about dates and statuses.",
        }

        return base + task_additions.get(task_category, "")

    def _build_cache_key(self, query: str, history: Optional[list]) -> str:
        """Create a deterministic cache key."""
        raw = query
        if history:
            raw += json.dumps(history[-4:], sort_keys=True)
        return hashlib.sha256(raw.encode()).hexdigest()

    def _check_cache(self, key: str) -> Optional[CascadeResult]:
        """Check if a response is cached and still valid."""
        if key in self._response_cache:
            entry = self._response_cache[key]
            if time.time() - entry["timestamp"] < self.config.cache_ttl_seconds:
                return entry["result"]
            else:
                del self._response_cache[key]
        return None

    def _store_cache(self, key: str, result: CascadeResult):
        """Store a response in the cache."""
        self._response_cache[key] = {
            "result": result,
            "timestamp": time.time(),
        }

    def _update_metrics(self, result: CascadeResult):
        """Update running metrics."""
        self.metrics.total_cost += result.estimated_cost
        n = self.metrics.total_requests
        self.metrics.avg_latency_ms = (
            (self.metrics.avg_latency_ms * (n - 1) + result.latency_ms) / n
        )

ModelSelectorFramework Implementation

"""
ModelSelectorFramework: Framework for selecting the optimal model
based on task requirements, constraints, and historical performance.

Implements multi-criteria decision making for model selection.
"""

import logging
from enum import Enum
from typing import Optional
from dataclasses import dataclass

logger = logging.getLogger(__name__)


class OptimizationGoal(Enum):
    """Primary optimization objectives."""
    COST = "cost"
    QUALITY = "quality"
    LATENCY = "latency"
    BALANCED = "balanced"


class QueryComplexity(Enum):
    """Complexity classification for queries."""
    TRIVIAL = 1     # Greetings, acknowledgments
    SIMPLE = 2      # Direct FAQ, lookups
    MODERATE = 3    # Search with filters, basic recommendations
    COMPLEX = 4     # Multi-criteria comparisons, detailed analysis
    EXPERT = 5      # Nuanced discussion, cultural context, creative tasks


@dataclass
class ModelProfile:
    """Profile describing a model's characteristics."""
    model_id: str
    tier_name: str
    cost_input_per_1m: float
    cost_output_per_1m: float
    avg_latency_ms: float
    quality_score: float         # 0.0-1.0 aggregate quality
    max_context_tokens: int
    strengths: list              # Task categories where it excels
    weaknesses: list             # Task categories where it struggles
    throughput_qps: float        # Estimated queries per second


@dataclass
class SelectionCriteria:
    """Criteria for model selection decision."""
    query_complexity: QueryComplexity
    max_latency_ms: float = 3000.0
    max_cost_per_query: float = 0.02
    min_quality_score: float = 0.85
    optimization_goal: OptimizationGoal = OptimizationGoal.BALANCED
    required_tokens: int = 500
    task_category: str = "general"


@dataclass
class SelectionResult:
    """Result of a model selection decision."""
    selected_model: ModelProfile
    selection_reason: str
    score: float
    alternatives: list
    trade_offs: dict


class ModelSelectorFramework:
    """
    Multi-criteria model selection framework.

    Uses weighted scoring to evaluate models against task requirements:

    Score = w_cost * CostScore + w_quality * QualityScore
            + w_latency * LatencyScore + w_fit * TaskFitScore

    Weights are adjusted based on the OptimizationGoal.
    """

    # Weight profiles for different optimization goals
    WEIGHT_PROFILES = {
        OptimizationGoal.COST:     {"cost": 0.50, "quality": 0.20, "latency": 0.15, "fit": 0.15},
        OptimizationGoal.QUALITY:  {"cost": 0.10, "quality": 0.50, "latency": 0.15, "fit": 0.25},
        OptimizationGoal.LATENCY:  {"cost": 0.15, "quality": 0.15, "latency": 0.50, "fit": 0.20},
        OptimizationGoal.BALANCED: {"cost": 0.25, "quality": 0.30, "latency": 0.20, "fit": 0.25},
    }

    def __init__(self):
        self.model_registry = {}
        self._register_default_models()

    def _register_default_models(self):
        """Register MangaAssist model profiles."""
        self.register_model(ModelProfile(
            model_id="anthropic.claude-3-haiku-20240307-v1:0",
            tier_name="Haiku",
            cost_input_per_1m=0.25,
            cost_output_per_1m=1.25,
            avg_latency_ms=250,
            quality_score=0.82,
            max_context_tokens=200000,
            strengths=["classification", "extraction", "faq", "search", "greeting"],
            weaknesses=["nuanced_reasoning", "creative_writing", "complex_comparison"],
            throughput_qps=500,
        ))
        self.register_model(ModelProfile(
            model_id="anthropic.claude-3-sonnet-20240229-v1:0",
            tier_name="Sonnet",
            cost_input_per_1m=3.00,
            cost_output_per_1m=15.00,
            avg_latency_ms=800,
            quality_score=0.95,
            max_context_tokens=200000,
            strengths=[
                "nuanced_reasoning", "creative_writing", "complex_comparison",
                "recommendation", "translation", "cultural_context",
            ],
            weaknesses=[],
            throughput_qps=150,
        ))

    def register_model(self, profile: ModelProfile):
        """Register a model profile in the framework."""
        self.model_registry[profile.model_id] = profile
        logger.info("Registered model: %s (%s)", profile.tier_name, profile.model_id)

    def select_model(self, criteria: SelectionCriteria) -> SelectionResult:
        """
        Select the optimal model based on provided criteria.

        Algorithm:
        1. Filter models that meet hard constraints (latency, cost, tokens)
        2. Score remaining models using weighted multi-criteria evaluation
        3. Return best model with alternatives and trade-off analysis
        """
        candidates = list(self.model_registry.values())

        # Step 1: Hard constraint filtering
        eligible = []
        for model in candidates:
            # Estimate cost for this query
            est_cost = self._estimate_query_cost(model, criteria.required_tokens)

            if model.avg_latency_ms <= criteria.max_latency_ms and est_cost <= criteria.max_cost_per_query:
                eligible.append(model)

        if not eligible:
            # If no model meets all constraints, relax and pick best available
            logger.warning("No model meets all hard constraints, relaxing requirements")
            eligible = candidates

        # Step 2: Score eligible models
        weights = self.WEIGHT_PROFILES[criteria.optimization_goal]
        scored = []
        for model in eligible:
            score = self._score_model(model, criteria, weights)
            scored.append((model, score))

        scored.sort(key=lambda x: x[1], reverse=True)

        # Step 3: Build result
        best_model, best_score = scored[0]
        alternatives = [
            {"model": m.tier_name, "score": round(s, 4)}
            for m, s in scored[1:]
        ]

        trade_offs = self._analyze_trade_offs(best_model, scored, criteria)
        reason = self._generate_selection_reason(best_model, criteria, best_score)

        return SelectionResult(
            selected_model=best_model,
            selection_reason=reason,
            score=round(best_score, 4),
            alternatives=alternatives,
            trade_offs=trade_offs,
        )

    def _score_model(self, model: ModelProfile, criteria: SelectionCriteria, weights: dict) -> float:
        """Calculate weighted score for a model."""
        # Cost score (lower cost = higher score)
        est_cost = self._estimate_query_cost(model, criteria.required_tokens)
        max_cost = criteria.max_cost_per_query
        cost_score = max(0, 1.0 - (est_cost / max_cost)) if max_cost > 0 else 0.5

        # Quality score (direct from profile)
        quality_score = model.quality_score

        # Latency score (lower latency = higher score)
        latency_score = max(0, 1.0 - (model.avg_latency_ms / criteria.max_latency_ms))

        # Task fit score (does the model excel at this task?)
        task_fit_score = self._calculate_task_fit(model, criteria)

        # Weighted combination
        total = (
            weights["cost"] * cost_score
            + weights["quality"] * quality_score
            + weights["latency"] * latency_score
            + weights["fit"] * task_fit_score
        )
        return total

    def _calculate_task_fit(self, model: ModelProfile, criteria: SelectionCriteria) -> float:
        """Calculate how well a model fits the specific task."""
        task = criteria.task_category.lower()

        if task in model.strengths:
            return 1.0
        elif task in model.weaknesses:
            return 0.3
        else:
            return 0.6  # Neutral

    def _estimate_query_cost(self, model: ModelProfile, token_count: int) -> float:
        """Estimate cost for a single query."""
        input_tokens = token_count
        output_tokens = int(token_count * 0.7)  # Assume 70% output ratio
        cost = (
            input_tokens * model.cost_input_per_1m / 1_000_000
            + output_tokens * model.cost_output_per_1m / 1_000_000
        )
        return cost

    def _analyze_trade_offs(self, selected, scored, criteria) -> dict:
        """Analyze what we gain/lose with the selection."""
        trade_offs = {}
        for model, score in scored:
            if model.model_id != selected.model_id:
                est_cost_selected = self._estimate_query_cost(selected, criteria.required_tokens)
                est_cost_alt = self._estimate_query_cost(model, criteria.required_tokens)
                trade_offs[model.tier_name] = {
                    "cost_difference": round(est_cost_selected - est_cost_alt, 6),
                    "quality_difference": round(selected.quality_score - model.quality_score, 3),
                    "latency_difference_ms": round(selected.avg_latency_ms - model.avg_latency_ms, 1),
                }
        return trade_offs

    def _generate_selection_reason(self, model, criteria, score) -> str:
        """Generate human-readable selection reason."""
        goal_text = {
            OptimizationGoal.COST: "cost optimization",
            OptimizationGoal.QUALITY: "quality maximization",
            OptimizationGoal.LATENCY: "latency minimization",
            OptimizationGoal.BALANCED: "balanced performance",
        }
        return (
            f"Selected {model.tier_name} (score={score:.4f}) for "
            f"{goal_text[criteria.optimization_goal]} | "
            f"task={criteria.task_category} | "
            f"complexity={criteria.query_complexity.name}"
        )

PerformanceOptimizer Implementation

"""
PerformanceOptimizer: Monitors and optimizes FM deployment performance
across latency, throughput, cost, and quality dimensions.

Provides real-time optimization recommendations and auto-tuning.
"""

import time
import logging
import statistics
from collections import deque
from dataclasses import dataclass, field
from typing import Optional

logger = logging.getLogger(__name__)


@dataclass
class PerformanceWindow:
    """Sliding window of performance observations."""
    window_size: int = 1000
    latencies: deque = field(default_factory=lambda: deque(maxlen=1000))
    costs: deque = field(default_factory=lambda: deque(maxlen=1000))
    quality_scores: deque = field(default_factory=lambda: deque(maxlen=1000))
    escalation_flags: deque = field(default_factory=lambda: deque(maxlen=1000))
    timestamps: deque = field(default_factory=lambda: deque(maxlen=1000))


@dataclass
class OptimizationRecommendation:
    """A recommended optimization action."""
    action: str
    reason: str
    expected_impact: dict
    priority: str           # "critical", "high", "medium", "low"
    auto_applicable: bool   # Can be applied automatically


@dataclass
class PerformanceReport:
    """Comprehensive performance report."""
    timestamp: float
    period_seconds: int
    metrics: dict
    sla_compliance: dict
    recommendations: list
    cost_projection: dict


class PerformanceOptimizer:
    """
    Continuously monitors and optimizes FM deployment performance.

    Optimization Strategies:
    1. Confidence threshold tuning
    2. Cache hit rate optimization
    3. Prompt length optimization
    4. Batch processing for non-real-time tasks
    5. Model routing weight adjustment
    6. Circuit breaker tuning

    SLA Targets (MangaAssist):
    - P50 latency: < 1000ms
    - P95 latency: < 2500ms
    - P99 latency: < 3000ms
    - Quality score: > 0.90
    - Cost per query: < $0.006
    - Availability: > 99.9%
    """

    SLA_TARGETS = {
        "p50_latency_ms": 1000,
        "p95_latency_ms": 2500,
        "p99_latency_ms": 3000,
        "min_quality_score": 0.90,
        "max_cost_per_query": 0.006,
        "min_availability": 0.999,
        "max_escalation_rate": 0.15,
    }

    def __init__(self, cascade_router=None):
        self.cascade_router = cascade_router
        self.performance_window = PerformanceWindow()
        self._optimization_history = []
        self._alert_callbacks = []

        logger.info("PerformanceOptimizer initialized with SLA targets: %s", self.SLA_TARGETS)

    def record_observation(
        self,
        latency_ms: float,
        cost: float,
        quality_score: float,
        was_escalated: bool,
    ):
        """Record a single query observation for performance tracking."""
        now = time.time()
        self.performance_window.latencies.append(latency_ms)
        self.performance_window.costs.append(cost)
        self.performance_window.quality_scores.append(quality_score)
        self.performance_window.escalation_flags.append(was_escalated)
        self.performance_window.timestamps.append(now)

    def analyze_performance(self) -> PerformanceReport:
        """
        Analyze current performance window and generate report.
        """
        window = self.performance_window
        if len(window.latencies) < 10:
            logger.warning("Insufficient data for analysis (need >= 10 observations)")
            return self._empty_report()

        latencies = list(window.latencies)
        costs = list(window.costs)
        quality_scores = list(window.quality_scores)
        escalation_flags = list(window.escalation_flags)

        # Compute metrics
        metrics = {
            "latency": {
                "p50_ms": round(statistics.median(latencies), 1),
                "p95_ms": round(self._percentile(latencies, 95), 1),
                "p99_ms": round(self._percentile(latencies, 99), 1),
                "mean_ms": round(statistics.mean(latencies), 1),
                "std_ms": round(statistics.stdev(latencies) if len(latencies) > 1 else 0, 1),
            },
            "cost": {
                "mean_per_query": round(statistics.mean(costs), 6),
                "total_window": round(sum(costs), 4),
                "std_per_query": round(statistics.stdev(costs) if len(costs) > 1 else 0, 6),
            },
            "quality": {
                "mean_score": round(statistics.mean(quality_scores), 4),
                "min_score": round(min(quality_scores), 4),
                "below_threshold_pct": round(
                    sum(1 for q in quality_scores if q < self.SLA_TARGETS["min_quality_score"])
                    / len(quality_scores) * 100, 1
                ),
            },
            "escalation": {
                "rate": round(sum(escalation_flags) / len(escalation_flags), 4),
                "count": sum(escalation_flags),
                "total": len(escalation_flags),
            },
        }

        # Check SLA compliance
        sla_compliance = self._check_sla_compliance(metrics)

        # Generate recommendations
        recommendations = self._generate_recommendations(metrics, sla_compliance)

        # Cost projection
        cost_projection = self._project_costs(metrics)

        report = PerformanceReport(
            timestamp=time.time(),
            period_seconds=self._window_duration(),
            metrics=metrics,
            sla_compliance=sla_compliance,
            recommendations=recommendations,
            cost_projection=cost_projection,
        )

        # Trigger alerts if needed
        self._check_alerts(sla_compliance)

        return report

    def auto_tune(self, report: PerformanceReport) -> list:
        """
        Apply automatic optimizations based on performance report.
        Only applies recommendations marked as auto_applicable.

        Returns list of applied optimizations.
        """
        applied = []
        for rec in report.recommendations:
            if not rec.auto_applicable:
                continue

            if rec.action == "raise_confidence_threshold":
                if self.cascade_router:
                    old = self.cascade_router.config.confidence_threshold
                    new = min(0.95, old + 0.02)
                    self.cascade_router.config.confidence_threshold = new
                    applied.append(f"Raised confidence threshold: {old:.2f} -> {new:.2f}")

            elif rec.action == "lower_confidence_threshold":
                if self.cascade_router:
                    old = self.cascade_router.config.confidence_threshold
                    new = max(0.60, old - 0.02)
                    self.cascade_router.config.confidence_threshold = new
                    applied.append(f"Lowered confidence threshold: {old:.2f} -> {new:.2f}")

            elif rec.action == "increase_cache_ttl":
                if self.cascade_router:
                    old = self.cascade_router.config.cache_ttl_seconds
                    new = min(600, old + 60)
                    self.cascade_router.config.cache_ttl_seconds = new
                    applied.append(f"Increased cache TTL: {old}s -> {new}s")

            elif rec.action == "reduce_haiku_max_tokens":
                if self.cascade_router:
                    old = self.cascade_router.config.haiku_max_tokens
                    new = max(256, old - 128)
                    self.cascade_router.config.haiku_max_tokens = new
                    applied.append(f"Reduced Haiku max tokens: {old} -> {new}")

        if applied:
            self._optimization_history.append({
                "timestamp": time.time(),
                "actions": applied,
            })
            logger.info("Auto-tune applied %d optimizations: %s", len(applied), applied)

        return applied

    # ------------------------------------------------------------------
    # Internal Methods
    # ------------------------------------------------------------------

    def _check_sla_compliance(self, metrics: dict) -> dict:
        """Check each SLA target against current metrics."""
        return {
            "p50_latency": {
                "target": self.SLA_TARGETS["p50_latency_ms"],
                "actual": metrics["latency"]["p50_ms"],
                "compliant": metrics["latency"]["p50_ms"] <= self.SLA_TARGETS["p50_latency_ms"],
            },
            "p95_latency": {
                "target": self.SLA_TARGETS["p95_latency_ms"],
                "actual": metrics["latency"]["p95_ms"],
                "compliant": metrics["latency"]["p95_ms"] <= self.SLA_TARGETS["p95_latency_ms"],
            },
            "p99_latency": {
                "target": self.SLA_TARGETS["p99_latency_ms"],
                "actual": metrics["latency"]["p99_ms"],
                "compliant": metrics["latency"]["p99_ms"] <= self.SLA_TARGETS["p99_latency_ms"],
            },
            "quality": {
                "target": self.SLA_TARGETS["min_quality_score"],
                "actual": metrics["quality"]["mean_score"],
                "compliant": metrics["quality"]["mean_score"] >= self.SLA_TARGETS["min_quality_score"],
            },
            "cost": {
                "target": self.SLA_TARGETS["max_cost_per_query"],
                "actual": metrics["cost"]["mean_per_query"],
                "compliant": metrics["cost"]["mean_per_query"] <= self.SLA_TARGETS["max_cost_per_query"],
            },
            "escalation_rate": {
                "target": self.SLA_TARGETS["max_escalation_rate"],
                "actual": metrics["escalation"]["rate"],
                "compliant": metrics["escalation"]["rate"] <= self.SLA_TARGETS["max_escalation_rate"],
            },
        }

    def _generate_recommendations(self, metrics: dict, sla: dict) -> list:
        """Generate optimization recommendations based on current state."""
        recommendations = []

        # High escalation rate -> raise confidence threshold or improve Haiku prompts
        if not sla["escalation_rate"]["compliant"]:
            recommendations.append(OptimizationRecommendation(
                action="lower_confidence_threshold",
                reason=(
                    f"Escalation rate ({sla['escalation_rate']['actual']:.1%}) exceeds "
                    f"target ({sla['escalation_rate']['target']:.1%}). Lowering threshold "
                    "will reduce escalations but may affect quality."
                ),
                expected_impact={"escalation_rate_delta": -0.03, "quality_delta": -0.01},
                priority="high",
                auto_applicable=True,
            ))

        # High cost -> optimize token usage or shift more to Haiku
        if not sla["cost"]["compliant"]:
            recommendations.append(OptimizationRecommendation(
                action="reduce_haiku_max_tokens",
                reason=(
                    f"Cost per query (${sla['cost']['actual']:.6f}) exceeds "
                    f"target (${sla['cost']['target']:.6f}). Reducing max tokens "
                    "will lower cost but may truncate responses."
                ),
                expected_impact={"cost_delta_pct": -10, "quality_delta": -0.005},
                priority="high",
                auto_applicable=True,
            ))

        # High latency -> increase caching or reduce token limits
        if not sla["p95_latency"]["compliant"]:
            recommendations.append(OptimizationRecommendation(
                action="increase_cache_ttl",
                reason=(
                    f"P95 latency ({sla['p95_latency']['actual']:.0f}ms) exceeds "
                    f"target ({sla['p95_latency']['target']}ms). Increasing cache TTL "
                    "will improve repeat query latency."
                ),
                expected_impact={"latency_p95_delta_ms": -200, "cache_hit_rate_delta": 0.05},
                priority="high",
                auto_applicable=True,
            ))

        # Low quality -> raise threshold to escalate more to Sonnet
        if not sla["quality"]["compliant"]:
            recommendations.append(OptimizationRecommendation(
                action="raise_confidence_threshold",
                reason=(
                    f"Quality score ({sla['quality']['actual']:.4f}) below "
                    f"target ({sla['quality']['target']:.2f}). Raising confidence "
                    "threshold will escalate more queries to Sonnet for better quality."
                ),
                expected_impact={"quality_delta": 0.02, "cost_delta_pct": 15},
                priority="critical",
                auto_applicable=True,
            ))

        # If everything is compliant, suggest further cost optimization
        all_compliant = all(v["compliant"] for v in sla.values())
        if all_compliant:
            recommendations.append(OptimizationRecommendation(
                action="explore_further_cost_reduction",
                reason="All SLAs met. Consider testing lower confidence threshold or smaller token budgets.",
                expected_impact={"cost_delta_pct": -5},
                priority="low",
                auto_applicable=False,
            ))

        return recommendations

    def _project_costs(self, metrics: dict) -> dict:
        """Project costs for different time horizons."""
        avg_cost = metrics["cost"]["mean_per_query"]
        daily_queries = 1_000_000  # MangaAssist target

        return {
            "daily": round(avg_cost * daily_queries, 2),
            "weekly": round(avg_cost * daily_queries * 7, 2),
            "monthly": round(avg_cost * daily_queries * 30, 2),
            "annual": round(avg_cost * daily_queries * 365, 2),
            "per_query_avg": round(avg_cost, 6),
            "queries_per_day": daily_queries,
        }

    def _percentile(self, data: list, pct: int) -> float:
        """Calculate percentile from sorted data."""
        sorted_data = sorted(data)
        index = int(len(sorted_data) * pct / 100)
        index = min(index, len(sorted_data) - 1)
        return sorted_data[index]

    def _window_duration(self) -> int:
        """Get duration of the current observation window."""
        ts = self.performance_window.timestamps
        if len(ts) < 2:
            return 0
        return int(ts[-1] - ts[0])

    def _empty_report(self) -> PerformanceReport:
        """Return an empty report when insufficient data."""
        return PerformanceReport(
            timestamp=time.time(),
            period_seconds=0,
            metrics={},
            sla_compliance={},
            recommendations=[],
            cost_projection={},
        )

    def _check_alerts(self, sla_compliance: dict):
        """Trigger alerts for SLA violations."""
        violations = [k for k, v in sla_compliance.items() if not v["compliant"]]
        if violations:
            for callback in self._alert_callbacks:
                try:
                    callback(violations, sla_compliance)
                except Exception as e:
                    logger.error("Alert callback failed: %s", e)

    def register_alert_callback(self, callback):
        """Register a function to be called on SLA violations."""
        self._alert_callbacks.append(callback)

Decision Trees for Model Selection

Decision Tree 1: Initial Model Routing

                         [User Query Arrives]
                                |
                    [Is query in response cache?]
                          /            \
                        Yes             No
                         |               |
                   [Return cached]  [Classify task category]
                                         |
                          +--------------+--------------+
                          |              |              |
                      [Greeting/     [Recommend/    [Unknown]
                       FAQ/Search/   Compare/        |
                       Order/        Discussion/    [Default to
                       Summary]      Translation]    Haiku with
                          |              |           escalation]
                          |              |
                    [Route to       [Route to
                     Haiku]         Sonnet]
                          |              |
                    [Check          [Invoke Sonnet
                     confidence]     directly]
                       /    \            |
                   >=0.80  <0.80    [Return response]
                     |        |
                [Return]  [Escalate
                          to Sonnet]

Decision Tree 2: Cost Optimization Routing

                    [Query with cost constraint]
                              |
                    [Estimate token count]
                         /         \
                     < 300        >= 300
                     tokens       tokens
                       |              |
                  [Always use     [Check complexity]
                   Haiku]            /        \
                   $0.0001       Simple     Complex
                                   |           |
                             [Use Haiku   [Check budget]
                              with RAG]      /      \
                              $0.0003    Budget    Budget
                                         OK        Tight
                                          |          |
                                     [Sonnet]   [Haiku +
                                     $0.0096    escalate
                                                 if fail]
                                                 $0.0003-
                                                 $0.0099

Decision Tree 3: Latency-Optimized Routing

                    [Query with latency SLA]
                              |
                    [Target latency?]
                    /         |         \
                < 500ms   500-2000ms   > 2000ms
                   |          |            |
              [Haiku +    [Haiku-first  [Sonnet
               cache       cascade]      acceptable]
               only]         |            |
                |        [Haiku P50:   [Full reasoning
              [P50:       250ms +       with context]
               150ms     escalation
               cached,   overhead:
               250ms     ~500ms]
               uncached]

Cost Analysis Deep Dive

Scenario: MangaAssist 1M Messages/Day

+--------------------------------------------------+
|        COST COMPARISON: DEPLOYMENT STRATEGIES     |
+--------------------------------------------------+

Strategy 1: All Sonnet
  Input:  avg 500 tokens * 1M = 500M tokens * $3/1M  = $1,500
  Output: avg 350 tokens * 1M = 350M tokens * $15/1M = $5,250
  Daily total: $6,750
  Monthly: $202,500
  Annual: $2,463,750

Strategy 2: All Haiku
  Input:  avg 500 tokens * 1M = 500M tokens * $0.25/1M = $125
  Output: avg 350 tokens * 1M = 350M tokens * $1.25/1M = $437.50
  Daily total: $562.50
  Monthly: $16,875
  Annual: $205,313
  Quality impact: -15% on complex queries

Strategy 3: Cascade (60/40 Haiku/Sonnet)
  Haiku:   600K msgs * (300*$0.25/1M + 150*$1.25/1M) = $157.50
  Sonnet:  400K msgs * (700*$3/1M + 500*$15/1M)      = $3,840.00
  Daily total: $3,997.50
  Monthly: $119,925
  Annual: $1,459,088
  Savings vs All-Sonnet: 40.8%

Strategy 4: Cascade Optimized (70/27/3 Haiku/Sonnet/Escalated)
  Haiku:     700K * $0.000263  = $183.75
  Sonnet:    270K * $0.009600  = $2,592.00
  Escalated: 30K  * $0.008150  = $244.50
  Daily total: $3,020.25
  Monthly: $90,607.50
  Annual: $1,102,391
  Savings vs All-Sonnet: 55.3%

Strategy 5: Cascade + Caching (20% cache hit rate)
  Effective queries: 800K (200K served from cache)
  Haiku:     560K * $0.000263  = $147.00
  Sonnet:    216K * $0.009600  = $2,073.60
  Escalated: 24K  * $0.008150  = $195.60
  Daily total: $2,416.20
  Monthly: $72,486
  Annual: $881,913
  Savings vs All-Sonnet: 64.2%

+--------------------------------------------------+
|                 ANNUAL SAVINGS SUMMARY            |
+--------------------------------------------------+
| Strategy         | Annual Cost  | vs All-Sonnet   |
|------------------|-------------|-----------------|
| All Sonnet       | $2,463,750  | baseline        |
| All Haiku        | $205,313    | -91.7%          |
| Cascade 60/40    | $1,459,088  | -40.8%          |
| Cascade Optimized| $1,102,391  | -55.3%          |
| Cascade + Cache  | $881,913    | -64.2%          |
+--------------------------------------------------+

Winner: Cascade + Cache delivers 64.2% cost savings while
maintaining 94%+ quality on all query types.

Break-even Analysis: When Does Cascading Pay Off?

At what daily volume does cascade ROI exceed implementation cost?

Implementation cost (one-time):
  - Development: ~$15,000 (80 hours * $187.50/hr)
  - Testing: ~$5,000 (validation across all task types)
  - Monitoring: ~$3,000 (dashboards, alerts)
  Total: ~$23,000

Daily savings (Cascade Optimized vs All-Sonnet):
  $6,750 - $3,020.25 = $3,729.75/day

Break-even: $23,000 / $3,729.75 = 6.2 days

At 1M messages/day, cascading pays for itself in under 1 week.
Even at 100K messages/day, break-even is ~62 days.

Key Takeaways

+------------------------------------------------------------------+
| # | Takeaway                                                      |
+---+---------------------------------------------------------------+
| 1 | Model cascading (Haiku-first, Sonnet-if-needed) saves 55%+   |
|   | compared to all-Sonnet while maintaining 94%+ quality.        |
+---+---------------------------------------------------------------+
| 2 | Task classification is the linchpin -- accurate routing       |
|   | determines whether cascading delivers cost savings or adds    |
|   | unnecessary latency through escalations.                      |
+---+---------------------------------------------------------------+
| 3 | Confidence-based escalation (threshold ~0.80) provides the    |
|   | optimal balance between cost savings and quality assurance.   |
+---+---------------------------------------------------------------+
| 4 | Response caching at a 20% hit rate adds another 15-20% cost  |
|   | reduction on top of cascading savings.                        |
+---+---------------------------------------------------------------+
| 5 | Small models (Haiku) handle 60-70% of production traffic      |
|   | adequately for structured tasks: FAQ, search, classification. |
+---+---------------------------------------------------------------+
| 6 | Multi-criteria model selection (cost, quality, latency, fit)  |
|   | should be weighted based on the business optimization goal.   |
+---+---------------------------------------------------------------+
| 7 | Auto-tuning of confidence thresholds and token budgets via    |
|   | PerformanceOptimizer keeps the system at optimal operating    |
|   | point as traffic patterns evolve.                             |
+---+---------------------------------------------------------------+
| 8 | At MangaAssist scale (1M msgs/day), even small per-query     |
|   | savings compound to $1M+ annual savings.                     |
+---+---------------------------------------------------------------+
| 9 | The Cascade + Cache strategy achieves 64.2% cost reduction    |
|   | vs All-Sonnet -- the recommended production deployment.      |
+---+---------------------------------------------------------------+
|10 | Break-even for cascading infrastructure is < 1 week at scale, |
|   | making it one of the highest-ROI optimizations available.    |
+---+---------------------------------------------------------------+

Next: 02-model-cascading-right-sizing.md -- Deep-dive into cascading implementation, confidence-based escalation, and task-specific model matching.