LOCAL PREVIEW View on GitHub

Cost-Effective Model Selection Framework

MangaAssist Context: JP Manga store chatbot running on AWS. Bedrock Claude 3 Sonnet ($3/$15 per 1M input/output tokens) handles complex queries; Haiku ($0.25/$1.25 per 1M input/output tokens) handles simple ones. 1M messages/day across product search, order status, manga recommendations, and Q&A. Infrastructure: OpenSearch Serverless, DynamoDB, ECS Fargate, API Gateway WebSocket, ElastiCache Redis.


Skill Mapping

AWS AIP-C01 Domain Task Skill This File Covers
Domain 4 Operational Efficiency Task 4.1 Cost Optimization 4.1.2 Cost-Effective Model Selection Cost-capability matrix, tiered FM routing, complexity classification, intent-to-model mapping, price-to-performance measurement

1. Model Selection Dimensions Mind Map

mindmap
  root((Model Selection<br/>Framework))
    Cost-Capability Tradeoff
      Quality Dimensions
        Accuracy
        Reasoning Depth
        Multilingual / Japanese
        Creativity
        Speed / Latency
      Cost Dimensions
        Input Token Price
        Output Token Price
        Tokens per Request
        Requests per Day
    Tiered FM Usage
      Tier 0 — Template
        Zero LLM Cost
        Deterministic Responses
        Order Status / Shipping
      Tier 1 — Haiku
        Low Cost
        Simple Queries
        Product Search / FAQ
      Tier 2 — Sonnet
        High Quality
        Complex Reasoning
        Recommendations / Manga QA
    Complexity Classification
      Rule-Based Fast Path
        Keyword Matching
        Intent Detection
        Pattern Recognition
      ML Classifier
        Complexity Score 0-1
        Feature Extraction
        Threshold Tuning
    Intent-to-Model Routing
      product_search to Haiku
      order_status to Template
      recommendation to Sonnet
      manga_qa to Sonnet
      chitchat to Template
      shipping_info to Haiku
      escalation to Template
    Price-to-Performance
      Tokens per Dollar
      Quality per Dollar
      Satisfaction per Dollar
    Efficient Inference
      Prompt Reuse
      Response Recycling
      Batch Grouping
    Budget Controls
      Daily Budget Limits
      Dynamic Downgrade
      Cost-per-Quality-Point

2. Cost-Capability Tradeoff Matrix

Claude 3 Sonnet vs Haiku — Quality Dimension Comparison

Quality Dimension Claude 3 Sonnet Claude 3 Haiku Winner for MangaAssist Notes
Accuracy 9.2/10 7.8/10 Sonnet Critical for manga recommendations, plot summaries
Reasoning Depth 9.5/10 6.5/10 Sonnet Multi-step reasoning needed for "manga like X but with Y"
Multilingual / Japanese 9.0/10 7.5/10 Sonnet Japanese title parsing, honorifics, genre terminology
Creativity 9.3/10 6.8/10 Sonnet Personalized recommendation narratives
Speed (Latency) 6.5/10 9.5/10 Haiku 3x faster TTFT; ideal for simple lookups
Cost Efficiency 3.0/10 9.5/10 Haiku Haiku is 12x cheaper on input, 12x cheaper on output

Cost per Request Comparison (MangaAssist Averages)

Metric Sonnet Haiku Ratio
Avg input tokens 800 500 1.6x
Avg output tokens 350 150 2.3x
Input cost per request $0.0024 $0.000125 19.2x
Output cost per request $0.00525 $0.0001875 28.0x
Total cost per request $0.00765 $0.0003125 24.5x
Avg latency (TTFT) 1.2s 0.4s 3.0x
Quality score (weighted) 9.2 7.4 1.24x

Key insight: Sonnet is 24.5x more expensive but only 1.24x better in quality. For simple queries, Haiku delivers 95%+ acceptable quality at 4% of the cost.


3. Tiered FM Usage — Three-Tier Routing System

flowchart TD
    QUERY[Incoming Customer Query] --> CLASSIFY[Complexity Classifier]

    CLASSIFY -->|Score 0.0 - 0.2| TIER0[Tier 0: Template Engine<br/>Cost: $0.00/request]
    CLASSIFY -->|Score 0.2 - 0.6| TIER1[Tier 1: Claude 3 Haiku<br/>Cost: ~$0.0003/request]
    CLASSIFY -->|Score 0.6 - 1.0| TIER2[Tier 2: Claude 3 Sonnet<br/>Cost: ~$0.0077/request]

    TIER0 --> T0_EXAMPLES["Examples:<br/>- Where is my order #12345?<br/>- What are your shipping rates?<br/>- I want to speak to a human<br/>- What are your store hours?"]

    TIER1 --> T1_EXAMPLES["Examples:<br/>- Find manga by Eiichiro Oda<br/>- Is One Piece volume 104 in stock?<br/>- What genres do you carry?<br/>- How do I return a book?"]

    TIER2 --> T2_EXAMPLES["Examples:<br/>- Recommend manga like Berserk but less dark<br/>- Explain the themes in Chainsaw Man<br/>- Compare Seinen vs Shounen for a new reader<br/>- I liked Vinland Saga, what next?"]

    TIER0 --> RESPONSE[Response to Customer]
    TIER1 --> RESPONSE
    TIER2 --> RESPONSE

    style TIER0 fill:#2ecc71,color:#fff
    style TIER1 fill:#3498db,color:#fff
    style TIER2 fill:#e74c3c,color:#fff

Tier Distribution (MangaAssist Production Target)

Tier Model % of Traffic Daily Requests Cost/Request Daily Cost Monthly Cost
Tier 0 Template 25% 250,000 $0.0000 $0.00 $0.00
Tier 1 Haiku 50% 500,000 $0.0003125 $156.25 $4,687.50
Tier 2 Sonnet 25% 250,000 $0.00765 $1,912.50 $57,375.00
Total Mixed 100% 1,000,000 $0.002069 $2,068.75 $62,062.50

Comparison: All-Sonnet vs Tiered Routing

Approach Monthly Cost Quality (avg) Cost Savings
All Sonnet $229,500 9.2/10 Baseline
All Haiku $9,375 7.4/10 95.9%
Tiered (25/50/25) $62,063 8.5/10 73.0%

Tiered routing saves $167,437/month while maintaining an average quality score of 8.5/10 (only 0.7 points below all-Sonnet).


4. Complexity Classifier

The classifier combines a rule-based fast path for obvious cases with a lightweight ML scorer for ambiguous queries.

flowchart TD
    INPUT[Customer Message] --> RULE[Rule-Based Fast Path]

    RULE -->|Matched Template Pattern| SCORE_0["Score: 0.0<br/>Route: Template"]
    RULE -->|Matched Simple Pattern| SCORE_LOW["Score: 0.15<br/>Route: Haiku"]
    RULE -->|No Pattern Match| ML[ML Complexity Scorer]

    ML --> FEATURES[Feature Extraction]
    FEATURES --> F1["Token count"]
    FEATURES --> F2["Question depth<br/>(nested clauses)"]
    FEATURES --> F3["Named entities"]
    FEATURES --> F4["Japanese char ratio"]
    FEATURES --> F5["Comparison keywords"]
    FEATURES --> F6["Subjective language"]

    F1 & F2 & F3 & F4 & F5 & F6 --> MODEL[Lightweight Classifier<br/>Logistic Regression / Small NN]

    MODEL --> SCORE["Complexity Score<br/>0.0 - 1.0"]

    SCORE -->|0.0 - 0.2| TEMPLATE[Template Tier]
    SCORE -->|0.2 - 0.6| HAIKU[Haiku Tier]
    SCORE -->|0.6 - 1.0| SONNET[Sonnet Tier]

    style SCORE_0 fill:#2ecc71,color:#fff
    style SCORE_LOW fill:#2ecc71,color:#fff
    style TEMPLATE fill:#2ecc71,color:#fff
    style HAIKU fill:#3498db,color:#fff
    style SONNET fill:#e74c3c,color:#fff

5. Intent-to-Model Routing Map

Intent Routed Model Avg Input Tokens Avg Output Tokens Cost/Request Quality Score Rationale
product_search Haiku 400 120 $0.000250 8.1/10 Structured lookup; Haiku handles keyword extraction well
order_status Template 0 (DynamoDB) 0 $0.000000 9.5/10 Pure database lookup, deterministic response
recommendation Sonnet 1,200 500 $0.011100 9.4/10 Requires deep reasoning about taste, themes, similar titles
manga_qa Sonnet 900 400 $0.008700 9.1/10 Needs plot understanding, cultural context, Japanese nuance
chitchat Template 0 0 $0.000000 7.5/10 Canned friendly responses; no LLM needed
shipping_info Haiku 300 80 $0.000175 8.8/10 Simple policy lookup with slight personalization
escalation Template 0 0 $0.000000 9.0/10 Hand-off to human agent; deterministic workflow

Daily Cost Breakdown by Intent (at 1M messages/day)

pie title Daily FM Cost by Intent
    "recommendation (Sonnet)" : 832
    "manga_qa (Sonnet)" : 652
    "product_search (Haiku)" : 75
    "shipping_info (Haiku)" : 17
    "order_status (Template)" : 0
    "chitchat (Template)" : 0
    "escalation (Template)" : 0
Intent % of Traffic Daily Volume Daily Cost Monthly Cost
product_search 30% 300,000 $75.00 $2,250
order_status 20% 200,000 $0.00 $0
recommendation 7.5% 75,000 $832.50 $24,975
manga_qa 7.5% 75,000 $652.50 $19,575
chitchat 15% 150,000 $0.00 $0
shipping_info 10% 100,000 $17.50 $525
escalation 10% 100,000 $0.00 $0
Total 100% 1,000,000 $1,577.50 $47,325

6. Inference Cost Balancing

Cost-per-Quality-Point Metric

This metric normalizes cost against quality output, enabling apples-to-apples comparison across models and intents.

Cost per Quality Point = (cost_per_request / quality_score) * 1000
Model Cost/Request Quality Score Cost per Quality Point
Sonnet (recommendation) $0.01110 9.4 $1.181
Sonnet (manga_qa) $0.00870 9.1 $0.956
Haiku (product_search) $0.00025 8.1 $0.031
Haiku (shipping_info) $0.000175 8.8 $0.020
Template (order_status) $0.00000 9.5 $0.000

Decision rule: If a cheaper model achieves > 80% of the quality score, route to the cheaper model. Only escalate to Sonnet when the quality gap exceeds the cost-per-quality-point threshold.

Dynamic Routing Under Budget Pressure

flowchart TD
    BUDGET[Check Daily Budget] --> PERCENT{Budget<br/>Consumed %}

    PERCENT -->|"< 60%"| NORMAL["Normal Routing<br/>Follow intent-to-model map"]
    PERCENT -->|"60-80%"| CAUTIOUS["Cautious Mode<br/>Downgrade low-priority Sonnet to Haiku"]
    PERCENT -->|"80-95%"| AGGRESSIVE["Aggressive Savings<br/>All queries to Haiku except recommendation"]
    PERCENT -->|"> 95%"| EMERGENCY["Emergency Mode<br/>Template + Haiku only<br/>Queue Sonnet requests"]

    NORMAL --> SERVE[Serve Response]
    CAUTIOUS --> SERVE
    AGGRESSIVE --> SERVE
    EMERGENCY --> SERVE

    style NORMAL fill:#2ecc71,color:#fff
    style CAUTIOUS fill:#f39c12,color:#fff
    style AGGRESSIVE fill:#e67e22,color:#fff
    style EMERGENCY fill:#e74c3c,color:#fff

7. Price-to-Performance Measurement

Three Core Metrics

7.1 Tokens per Dollar

Tokens per Dollar = 1,000,000 / price_per_1M_tokens
Model Direction Price / 1M Tokens Tokens per Dollar
Sonnet Input $3.00 333,333
Sonnet Output $15.00 66,667
Haiku Input $0.25 4,000,000
Haiku Output $1.25 800,000

Haiku delivers 12x more input tokens and 12x more output tokens per dollar than Sonnet.

7.2 Quality per Dollar

Quality per Dollar = (quality_score / cost_per_request) * 0.001
Use Case Model Quality Cost Quality per Dollar
Recommendation Sonnet 9.4 $0.01110 0.847
Recommendation Haiku 6.2 $0.00031 20.000
Product Search Haiku 8.1 $0.00025 32.400
Product Search Sonnet 8.9 $0.00765 1.163

For product search, Haiku delivers 27.9x more quality per dollar than Sonnet, with only a 0.8-point quality gap.

7.3 Satisfaction per Dollar

Based on user satisfaction surveys (thumbs up/down) mapped to model routing:

Model + Intent Satisfaction Rate Cost/Request Satisfaction per Dollar
Sonnet + recommendation 94% $0.01110 84.7
Haiku + product_search 88% $0.00025 3,520.0
Haiku + shipping_info 92% $0.000175 5,257.1
Template + order_status 96% $0.00000 Infinite

8. Efficient Inference Patterns

8.1 Prompt Reuse

Reuse system prompts and common context blocks across requests to leverage Bedrock's prompt caching.

sequenceDiagram
    participant User as Customer
    participant Router as Model Router
    participant Cache as Prompt Cache<br/>(ElastiCache)
    participant Bedrock as Bedrock API

    User->>Router: "Recommend manga like Berserk"
    Router->>Cache: Check cached system prompt hash
    Cache-->>Router: Cache HIT (system prompt ID: sp_manga_rec_v3)
    Router->>Bedrock: Invoke with cached prompt reference + user query
    Bedrock-->>Router: Response (only user-specific tokens billed)
    Router-->>User: Personalized recommendation

    Note over Cache,Bedrock: System prompt (~400 tokens) reused across<br/>all recommendation queries = 75,000 reuses/day

8.2 Response Recycling

For near-identical queries, recycle previous responses instead of invoking the FM again.

flowchart LR
    Q1["Query: Best shounen manga?"] --> HASH[Semantic Hash]
    HASH --> CHECK{Similar Response<br/>in Cache?}
    CHECK -->|Yes, similarity > 0.95| RECYCLE["Recycle cached response<br/>Cost: $0.00"]
    CHECK -->|No| INVOKE["Invoke FM<br/>Cache the response"]

    style RECYCLE fill:#2ecc71,color:#fff
    style INVOKE fill:#3498db,color:#fff

9. MangaAssist Model Router — Architecture

graph TB
    subgraph "Client Layer"
        USER[Customer<br/>Browser/App]
        WS[API Gateway<br/>WebSocket]
    end

    subgraph "Orchestration Layer (ECS Fargate)"
        INTENT[Intent Classifier]
        COMPLEX[Complexity Scorer<br/>Rule + ML]
        ROUTER[Model Router<br/>Intent + Complexity + Budget]
        BUDGET[Budget Guardian<br/>Daily Spend Tracker]
        COST_LOG[Cost Logger<br/>CloudWatch Metrics]
    end

    subgraph "Response Tier 0 — Templates"
        TMPL[Template Engine<br/>Jinja2 Templates]
        DDB_T[DynamoDB<br/>Order/Shipping Data]
    end

    subgraph "Response Tier 1 — Haiku"
        HAIKU[Bedrock Claude 3 Haiku<br/>$0.25 / $1.25 per 1M tokens]
        OS_H[OpenSearch<br/>Product Index]
    end

    subgraph "Response Tier 2 — Sonnet"
        SONNET[Bedrock Claude 3 Sonnet<br/>$3.00 / $15.00 per 1M tokens]
        OS_S[OpenSearch<br/>Full Catalog + Reviews]
    end

    subgraph "Shared Infrastructure"
        REDIS[ElastiCache Redis<br/>Prompt Cache + Response Cache]
        GUARD[Bedrock Guardrails]
        CW[CloudWatch<br/>Cost + Quality Metrics]
    end

    USER --> WS --> INTENT
    INTENT --> COMPLEX --> ROUTER
    ROUTER --> BUDGET
    BUDGET --> ROUTER

    ROUTER -->|Tier 0| TMPL
    TMPL --> DDB_T

    ROUTER -->|Tier 1| HAIKU
    HAIKU --> OS_H

    ROUTER -->|Tier 2| SONNET
    SONNET --> OS_S

    TMPL --> GUARD --> WS
    HAIKU --> GUARD
    SONNET --> GUARD

    ROUTER --> COST_LOG --> CW
    REDIS -.->|Cache Check| ROUTER

    style SONNET fill:#e74c3c,color:#fff
    style HAIKU fill:#3498db,color:#fff
    style TMPL fill:#2ecc71,color:#fff
    style ROUTER fill:#ff9900,color:#000
    style BUDGET fill:#f39c12,color:#fff

10. Production Python Code

10.1 ModelRouter Class — Complexity Scoring and Cost-Aware Routing

"""
MangaAssist Model Router — Cost-Effective FM Selection
Routes queries to Template / Haiku / Sonnet based on intent, complexity, and budget.
"""

import re
import time
import json
import hashlib
import logging
from enum import Enum
from dataclasses import dataclass, field
from typing import Optional

import boto3
import redis

logger = logging.getLogger("mangaassist.model_router")


# ---------------------------------------------------------------------------
# Enums & Data Classes
# ---------------------------------------------------------------------------

class ModelTier(Enum):
    TEMPLATE = "template"
    HAIKU = "haiku"
    SONNET = "sonnet"


class Intent(Enum):
    PRODUCT_SEARCH = "product_search"
    ORDER_STATUS = "order_status"
    RECOMMENDATION = "recommendation"
    MANGA_QA = "manga_qa"
    CHITCHAT = "chitchat"
    SHIPPING_INFO = "shipping_info"
    ESCALATION = "escalation"


@dataclass
class RoutingDecision:
    """Encapsulates the routing result with full traceability."""
    model_tier: ModelTier
    model_id: str
    intent: Intent
    complexity_score: float
    estimated_cost: float
    budget_mode: str  # normal, cautious, aggressive, emergency
    reasoning: str


@dataclass
class ModelConfig:
    """Configuration for a Bedrock model."""
    model_id: str
    input_price_per_1m: float   # $ per 1M input tokens
    output_price_per_1m: float  # $ per 1M output tokens
    avg_input_tokens: int
    avg_output_tokens: int
    quality_score: float        # 0-10 scale
    max_tokens: int = 4096

    @property
    def cost_per_request(self) -> float:
        input_cost = (self.avg_input_tokens / 1_000_000) * self.input_price_per_1m
        output_cost = (self.avg_output_tokens / 1_000_000) * self.output_price_per_1m
        return input_cost + output_cost


# ---------------------------------------------------------------------------
# Model Configurations
# ---------------------------------------------------------------------------

MODELS = {
    ModelTier.SONNET: ModelConfig(
        model_id="anthropic.claude-3-sonnet-20240229-v1:0",
        input_price_per_1m=3.00,
        output_price_per_1m=15.00,
        avg_input_tokens=800,
        avg_output_tokens=350,
        quality_score=9.2,
    ),
    ModelTier.HAIKU: ModelConfig(
        model_id="anthropic.claude-3-haiku-20240307-v1:0",
        input_price_per_1m=0.25,
        output_price_per_1m=1.25,
        avg_input_tokens=500,
        avg_output_tokens=150,
        quality_score=7.4,
    ),
}

# ---------------------------------------------------------------------------
# Intent-to-Model Default Routing Map
# ---------------------------------------------------------------------------

INTENT_MODEL_MAP: dict[Intent, ModelTier] = {
    Intent.PRODUCT_SEARCH: ModelTier.HAIKU,
    Intent.ORDER_STATUS: ModelTier.TEMPLATE,
    Intent.RECOMMENDATION: ModelTier.SONNET,
    Intent.MANGA_QA: ModelTier.SONNET,
    Intent.CHITCHAT: ModelTier.TEMPLATE,
    Intent.SHIPPING_INFO: ModelTier.HAIKU,
    Intent.ESCALATION: ModelTier.TEMPLATE,
}


# ---------------------------------------------------------------------------
# Complexity Classifier
# ---------------------------------------------------------------------------

class ComplexityClassifier:
    """
    Two-stage classifier:
      Stage 1 — Rule-based fast path for obvious patterns.
      Stage 2 — Feature-based ML scorer for ambiguous queries.
    Returns a score from 0.0 (trivial) to 1.0 (highly complex).
    """

    # Rule-based patterns (compiled once)
    TEMPLATE_PATTERNS = [
        re.compile(r"(?i)(where|track)\s.*(order|package|shipment)\s*#?\d+"),
        re.compile(r"(?i)(speak|talk|transfer).*(human|agent|representative)"),
        re.compile(r"(?i)(store\s*hours|opening\s*hours|when.*(open|close))"),
        re.compile(r"(?i)(shipping\s*(rate|cost|fee|price))"),
    ]

    SIMPLE_PATTERNS = [
        re.compile(r"(?i)(do you have|in stock|is .* available)\b"),
        re.compile(r"(?i)(find|search|look for)\s+(manga|book|volume)\b"),
        re.compile(r"(?i)(how (do|can) I)\s+(return|exchange|cancel)\b"),
        re.compile(r"(?i)^(what genres|which categories)"),
    ]

    COMPLEX_INDICATORS = [
        re.compile(r"(?i)(recommend|suggest|similar to|like .* but)"),
        re.compile(r"(?i)(compare|difference between|versus|vs)"),
        re.compile(r"(?i)(explain|analyze|theme|meaning|symbolism)"),
        re.compile(r"(?i)(best .* for .* who)"),
    ]

    def classify(self, query: str) -> float:
        """Return complexity score 0.0 - 1.0."""

        # Stage 1: Rule-based fast path
        for pattern in self.TEMPLATE_PATTERNS:
            if pattern.search(query):
                return 0.0

        for pattern in self.SIMPLE_PATTERNS:
            if pattern.search(query):
                return 0.15

        # Stage 2: Feature-based scoring
        return self._ml_score(query)

    def _ml_score(self, query: str) -> float:
        """
        Lightweight feature-based complexity scoring.
        In production, this would be a trained logistic regression or small NN.
        Here we use interpretable heuristic features that mirror the trained model.
        """
        score = 0.3  # base score for unmatched queries

        features = self._extract_features(query)

        # Token length contributes to complexity
        if features["token_count"] > 20:
            score += 0.1
        if features["token_count"] > 40:
            score += 0.1

        # Complex indicator keywords
        for pattern in self.COMPLEX_INDICATORS:
            if pattern.search(query):
                score += 0.15

        # Question depth (subordinate clauses)
        score += min(features["clause_count"] * 0.05, 0.15)

        # Japanese character ratio (mixed-language = harder)
        if features["japanese_char_ratio"] > 0.1:
            score += 0.05

        # Subjective language
        if features["has_subjective_language"]:
            score += 0.1

        # Named entity count (more entities = more complex comparison)
        if features["named_entity_count"] >= 2:
            score += 0.1

        return min(score, 1.0)

    def _extract_features(self, query: str) -> dict:
        """Extract features for the ML scorer."""
        tokens = query.split()

        # Count Japanese characters (Hiragana, Katakana, CJK)
        jp_chars = sum(
            1 for c in query
            if ("\u3040" <= c <= "\u309f")    # Hiragana
            or ("\u30a0" <= c <= "\u30ff")     # Katakana
            or ("\u4e00" <= c <= "\u9fff")     # CJK
        )

        subjective_words = {
            "best", "worst", "favorite", "love", "hate", "amazing",
            "boring", "exciting", "beautiful", "dark", "deep",
        }

        return {
            "token_count": len(tokens),
            "clause_count": query.count(",") + query.count(" and ") + query.count(" but "),
            "japanese_char_ratio": jp_chars / max(len(query), 1),
            "has_subjective_language": bool(
                set(t.lower() for t in tokens) & subjective_words
            ),
            "named_entity_count": self._count_title_case_sequences(tokens),
        }

    @staticmethod
    def _count_title_case_sequences(tokens: list[str]) -> int:
        """Count sequences of title-case words as named entity proxies."""
        count = 0
        in_entity = False
        for token in tokens:
            if token[0:1].isupper() and token.isalpha():
                if not in_entity:
                    count += 1
                    in_entity = True
            else:
                in_entity = False
        return count


# ---------------------------------------------------------------------------
# Budget Guardian (simplified — full version in 02-inference-cost-optimization.md)
# ---------------------------------------------------------------------------

class BudgetGuardian:
    """Tracks daily FM spend and returns the current budget mode."""

    DAILY_BUDGET = 2_500.00  # $2,500/day target

    def __init__(self, redis_client: redis.Redis):
        self.redis = redis_client

    def get_budget_mode(self) -> str:
        """Return current budget mode based on daily spend."""
        spent = self._get_daily_spend()
        pct = spent / self.DAILY_BUDGET

        if pct < 0.60:
            return "normal"
        elif pct < 0.80:
            return "cautious"
        elif pct < 0.95:
            return "aggressive"
        else:
            return "emergency"

    def record_cost(self, cost: float) -> None:
        """Record an inference cost against today's budget."""
        key = f"budget:daily:{time.strftime('%Y-%m-%d')}"
        self.redis.incrbyfloat(key, cost)
        self.redis.expire(key, 86400 * 2)  # TTL 2 days

    def _get_daily_spend(self) -> float:
        key = f"budget:daily:{time.strftime('%Y-%m-%d')}"
        val = self.redis.get(key)
        return float(val) if val else 0.0


# ---------------------------------------------------------------------------
# Model Router
# ---------------------------------------------------------------------------

class ModelRouter:
    """
    Cost-aware model router for MangaAssist.
    Combines intent detection, complexity scoring, and budget constraints
    to select the optimal model tier for each query.
    """

    # Budget mode overrides: which intents get downgraded
    DOWNGRADE_RULES = {
        "cautious": {
            # In cautious mode, manga_qa drops from Sonnet to Haiku
            Intent.MANGA_QA: ModelTier.HAIKU,
        },
        "aggressive": {
            # In aggressive mode, everything except recommendation goes to Haiku
            Intent.MANGA_QA: ModelTier.HAIKU,
            Intent.PRODUCT_SEARCH: ModelTier.TEMPLATE,
        },
        "emergency": {
            # In emergency mode, only recommendation stays on Haiku, rest Template
            Intent.RECOMMENDATION: ModelTier.HAIKU,
            Intent.MANGA_QA: ModelTier.HAIKU,
            Intent.PRODUCT_SEARCH: ModelTier.TEMPLATE,
            Intent.SHIPPING_INFO: ModelTier.TEMPLATE,
        },
    }

    def __init__(
        self,
        classifier: ComplexityClassifier,
        budget_guardian: BudgetGuardian,
    ):
        self.classifier = classifier
        self.budget = budget_guardian

    def route(self, query: str, intent: Intent) -> RoutingDecision:
        """
        Determine the optimal model for a given query and intent.

        Steps:
          1. Look up default model from intent-to-model map.
          2. Score query complexity.
          3. Check budget mode and apply downgrades if needed.
          4. Return RoutingDecision with full traceability.
        """
        # Step 1: Default model from intent map
        default_tier = INTENT_MODEL_MAP[intent]

        # Step 2: Complexity scoring
        complexity = self.classifier.classify(query)

        # Step 3: Complexity-based override (upgrade simple→template or downgrade)
        tier = self._apply_complexity_override(default_tier, complexity)

        # Step 4: Budget-aware adjustment
        budget_mode = self.budget.get_budget_mode()
        tier, reasoning = self._apply_budget_override(tier, intent, budget_mode)

        # Step 5: Resolve model ID and cost
        if tier == ModelTier.TEMPLATE:
            model_id = "template_engine"
            estimated_cost = 0.0
        else:
            config = MODELS[tier]
            model_id = config.model_id
            estimated_cost = config.cost_per_request

        # Record cost
        self.budget.record_cost(estimated_cost)

        return RoutingDecision(
            model_tier=tier,
            model_id=model_id,
            intent=intent,
            complexity_score=complexity,
            estimated_cost=estimated_cost,
            budget_mode=budget_mode,
            reasoning=reasoning,
        )

    def _apply_complexity_override(
        self, default_tier: ModelTier, complexity: float
    ) -> ModelTier:
        """
        Override model tier based on complexity score.
        - If default is Haiku but complexity > 0.7, upgrade to Sonnet.
        - If default is Sonnet but complexity < 0.3, downgrade to Haiku.
        """
        if default_tier == ModelTier.HAIKU and complexity > 0.7:
            logger.info("Upgrading Haiku -> Sonnet (complexity=%.2f)", complexity)
            return ModelTier.SONNET
        if default_tier == ModelTier.SONNET and complexity < 0.3:
            logger.info("Downgrading Sonnet -> Haiku (complexity=%.2f)", complexity)
            return ModelTier.HAIKU
        return default_tier

    def _apply_budget_override(
        self, tier: ModelTier, intent: Intent, budget_mode: str
    ) -> tuple[ModelTier, str]:
        """Apply budget-mode downgrades."""
        if budget_mode == "normal":
            return tier, f"Normal routing: {intent.value} -> {tier.value}"

        overrides = self.DOWNGRADE_RULES.get(budget_mode, {})
        if intent in overrides:
            new_tier = overrides[intent]
            if self._tier_rank(new_tier) < self._tier_rank(tier):
                reason = (
                    f"Budget {budget_mode}: downgraded {intent.value} "
                    f"from {tier.value} to {new_tier.value}"
                )
                logger.warning(reason)
                return new_tier, reason

        return tier, f"{budget_mode} mode: {intent.value} -> {tier.value} (no change)"

    @staticmethod
    def _tier_rank(tier: ModelTier) -> int:
        return {ModelTier.TEMPLATE: 0, ModelTier.HAIKU: 1, ModelTier.SONNET: 2}[tier]


# ---------------------------------------------------------------------------
# Usage Example
# ---------------------------------------------------------------------------

def demo():
    """Demonstrate the ModelRouter with sample MangaAssist queries."""
    redis_client = redis.Redis(host="mangaassist-cache.xxxxx.ng.0001.apne1.cache.amazonaws.com")
    classifier = ComplexityClassifier()
    guardian = BudgetGuardian(redis_client)
    router = ModelRouter(classifier, guardian)

    test_cases = [
        ("Where is my order #98765?", Intent.ORDER_STATUS),
        ("Do you have One Piece volume 104?", Intent.PRODUCT_SEARCH),
        ("Recommend manga like Berserk but less dark and more hopeful", Intent.RECOMMENDATION),
        ("What themes does Chainsaw Man explore?", Intent.MANGA_QA),
        ("Hello!", Intent.CHITCHAT),
        ("What are your shipping rates to Osaka?", Intent.SHIPPING_INFO),
    ]

    for query, intent in test_cases:
        decision = router.route(query, intent)
        print(
            f"  [{decision.model_tier.value:>8}] "
            f"complexity={decision.complexity_score:.2f} "
            f"cost=${decision.estimated_cost:.6f} "
            f"mode={decision.budget_mode:>10} | "
            f"{query[:60]}"
        )


if __name__ == "__main__":
    demo()

10.2 CostCapabilityMatrix Evaluator

"""
MangaAssist Cost-Capability Matrix Evaluator
Evaluates and compares model cost-effectiveness across intents.
"""

from dataclasses import dataclass
from typing import Optional
import json


@dataclass
class EvaluationResult:
    """Single model evaluation for a specific intent."""
    intent: str
    model: str
    cost_per_request: float
    quality_score: float
    satisfaction_rate: float
    tokens_per_dollar: float
    quality_per_dollar: float
    satisfaction_per_dollar: float
    cost_per_quality_point: float


class CostCapabilityMatrix:
    """
    Evaluates models across cost and capability dimensions,
    producing actionable routing recommendations.
    """

    # Quality scores by (model, intent) — from production evaluation
    QUALITY_SCORES: dict[tuple[str, str], float] = {
        ("sonnet", "product_search"): 8.9,
        ("sonnet", "recommendation"): 9.4,
        ("sonnet", "manga_qa"): 9.1,
        ("sonnet", "shipping_info"): 9.0,
        ("haiku", "product_search"): 8.1,
        ("haiku", "recommendation"): 6.2,
        ("haiku", "manga_qa"): 6.5,
        ("haiku", "shipping_info"): 8.8,
        ("template", "order_status"): 9.5,
        ("template", "chitchat"): 7.5,
        ("template", "escalation"): 9.0,
    }

    # Satisfaction rates by (model, intent) — from user feedback
    SATISFACTION_RATES: dict[tuple[str, str], float] = {
        ("sonnet", "product_search"): 0.93,
        ("sonnet", "recommendation"): 0.94,
        ("sonnet", "manga_qa"): 0.91,
        ("sonnet", "shipping_info"): 0.92,
        ("haiku", "product_search"): 0.88,
        ("haiku", "recommendation"): 0.62,
        ("haiku", "manga_qa"): 0.64,
        ("haiku", "shipping_info"): 0.92,
        ("template", "order_status"): 0.96,
        ("template", "chitchat"): 0.78,
        ("template", "escalation"): 0.90,
    }

    # Average tokens per request by (model, intent)
    AVG_TOKENS: dict[tuple[str, str], tuple[int, int]] = {
        ("sonnet", "product_search"): (600, 250),
        ("sonnet", "recommendation"): (1200, 500),
        ("sonnet", "manga_qa"): (900, 400),
        ("sonnet", "shipping_info"): (400, 150),
        ("haiku", "product_search"): (400, 120),
        ("haiku", "recommendation"): (800, 200),
        ("haiku", "manga_qa"): (600, 180),
        ("haiku", "shipping_info"): (300, 80),
    }

    # Pricing
    PRICING = {
        "sonnet": {"input": 3.00, "output": 15.00},   # per 1M tokens
        "haiku": {"input": 0.25, "output": 1.25},
        "template": {"input": 0.00, "output": 0.00},
    }

    def evaluate(self, model: str, intent: str) -> Optional[EvaluationResult]:
        """Evaluate a specific model-intent combination."""
        quality = self.QUALITY_SCORES.get((model, intent))
        satisfaction = self.SATISFACTION_RATES.get((model, intent))

        if quality is None or satisfaction is None:
            return None

        cost = self._calculate_cost(model, intent)

        # Tokens per dollar (using input tokens as reference)
        pricing = self.PRICING[model]
        tokens_per_dollar = (
            1_000_000 / pricing["input"] if pricing["input"] > 0 else float("inf")
        )

        # Quality per dollar
        quality_per_dollar = (
            (quality / cost) * 0.001 if cost > 0 else float("inf")
        )

        # Satisfaction per dollar
        satisfaction_per_dollar = (
            (satisfaction * 100 / cost) if cost > 0 else float("inf")
        )

        # Cost per quality point
        cost_per_quality_point = (
            (cost / quality) * 1000 if quality > 0 else float("inf")
        )

        return EvaluationResult(
            intent=intent,
            model=model,
            cost_per_request=cost,
            quality_score=quality,
            satisfaction_rate=satisfaction,
            tokens_per_dollar=tokens_per_dollar,
            quality_per_dollar=quality_per_dollar,
            satisfaction_per_dollar=satisfaction_per_dollar,
            cost_per_quality_point=cost_per_quality_point,
        )

    def compare_models(self, intent: str) -> dict:
        """
        Compare all available models for a given intent.
        Returns recommendation with reasoning.
        """
        results = {}
        for model in ["template", "haiku", "sonnet"]:
            result = self.evaluate(model, intent)
            if result:
                results[model] = result

        if not results:
            return {"error": f"No evaluations available for intent: {intent}"}

        # Find the model with best quality-per-dollar
        best_value = max(results.values(), key=lambda r: r.quality_per_dollar)

        # Find the model with best raw quality
        best_quality = max(results.values(), key=lambda r: r.quality_score)

        # Recommendation logic:
        # Use the cheaper model if its quality is within 15% of the best quality
        quality_threshold = best_quality.quality_score * 0.85
        cost_effective_options = [
            r for r in results.values() if r.quality_score >= quality_threshold
        ]
        recommended = min(cost_effective_options, key=lambda r: r.cost_per_request)

        return {
            "intent": intent,
            "evaluations": {k: self._to_dict(v) for k, v in results.items()},
            "recommended_model": recommended.model,
            "reasoning": (
                f"For '{intent}': {recommended.model} achieves "
                f"{recommended.quality_score}/10 quality at "
                f"${recommended.cost_per_request:.6f}/request. "
                f"Best quality ({best_quality.model}) scores "
                f"{best_quality.quality_score}/10 at "
                f"${best_quality.cost_per_request:.6f}/request — "
                f"{best_quality.cost_per_request / max(recommended.cost_per_request, 0.000001):.1f}x more expensive."
            ),
        }

    def full_matrix_report(self) -> str:
        """Generate a full cost-capability matrix report."""
        intents = [
            "product_search", "order_status", "recommendation",
            "manga_qa", "chitchat", "shipping_info", "escalation",
        ]

        lines = [
            "=" * 90,
            "MANGAASSIST COST-CAPABILITY MATRIX REPORT",
            "=" * 90,
            "",
        ]

        for intent in intents:
            comparison = self.compare_models(intent)
            if "error" in comparison:
                continue

            lines.append(f"--- {intent.upper()} ---")
            lines.append(f"  Recommended: {comparison['recommended_model']}")
            lines.append(f"  {comparison['reasoning']}")

            for model, data in comparison["evaluations"].items():
                lines.append(
                    f"    {model:>8}: quality={data['quality_score']:.1f}  "
                    f"cost=${data['cost_per_request']:.6f}  "
                    f"satisfaction={data['satisfaction_rate']:.0%}  "
                    f"cost/qp=${data['cost_per_quality_point']:.4f}"
                )
            lines.append("")

        return "\n".join(lines)

    def _calculate_cost(self, model: str, intent: str) -> float:
        """Calculate cost per request for a model-intent pair."""
        if model == "template":
            return 0.0

        tokens = self.AVG_TOKENS.get((model, intent))
        if not tokens:
            return 0.0

        input_tokens, output_tokens = tokens
        pricing = self.PRICING[model]

        input_cost = (input_tokens / 1_000_000) * pricing["input"]
        output_cost = (output_tokens / 1_000_000) * pricing["output"]
        return input_cost + output_cost

    @staticmethod
    def _to_dict(result: EvaluationResult) -> dict:
        return {
            "quality_score": result.quality_score,
            "cost_per_request": result.cost_per_request,
            "satisfaction_rate": result.satisfaction_rate,
            "tokens_per_dollar": result.tokens_per_dollar,
            "quality_per_dollar": result.quality_per_dollar,
            "satisfaction_per_dollar": result.satisfaction_per_dollar,
            "cost_per_quality_point": result.cost_per_quality_point,
        }


# ---------------------------------------------------------------------------
# Usage
# ---------------------------------------------------------------------------

if __name__ == "__main__":
    matrix = CostCapabilityMatrix()
    print(matrix.full_matrix_report())

    print("\n\nDetailed comparison for 'recommendation':")
    result = matrix.compare_models("recommendation")
    print(json.dumps(result, indent=2, default=str))

11. Complete Intent x Model Cost Matrix

Intent Model Avg Input Tokens Avg Output Tokens Cost / Request Quality Score Satisfaction Cost/Quality Point
product_search Haiku 400 120 $0.000250 8.1 88% $0.031
product_search Sonnet 600 250 $0.005550 8.9 93% $0.624
order_status Template 0 0 $0.000000 9.5 96% $0.000
recommendation Sonnet 1,200 500 $0.011100 9.4 94% $1.181
recommendation Haiku 800 200 $0.000450 6.2 62% $0.073
manga_qa Sonnet 900 400 $0.008700 9.1 91% $0.956
manga_qa Haiku 600 180 $0.000375 6.5 64% $0.058
chitchat Template 0 0 $0.000000 7.5 78% $0.000
shipping_info Haiku 300 80 $0.000175 8.8 92% $0.020
shipping_info Sonnet 400 150 $0.003450 9.0 92% $0.383
escalation Template 0 0 $0.000000 9.0 90% $0.000

Reading the table: For product_search, Haiku costs $0.000250 vs Sonnet's $0.005550 (22x cheaper) while scoring 8.1 vs 8.9 (only 9% lower quality). Haiku is the clear winner. For recommendation, Sonnet's 9.4 quality vs Haiku's 6.2 (34% gap) justifies the 24.7x cost premium.


12. Key Takeaways

  1. 73% cost reduction is achievable through tiered routing without meaningful quality loss.
  2. Template tier eliminates 35% of LLM costs entirely (order_status, chitchat, escalation).
  3. Complexity scoring prevents the two worst outcomes: overspending on simple queries and underserving complex ones.
  4. Budget guardian provides automatic cost protection during traffic spikes and sale events.
  5. Cost-per-quality-point is the single most useful metric for routing decisions — not raw cost or raw quality alone.

Next: 02-inference-cost-optimization.md — Deep-dive into inference cost patterns, budget guardian implementation, and A/B testing model assignments.