Inference Cost Optimization

MangaAssist Context: JP Manga store chatbot running on AWS. Bedrock Claude 3 Sonnet ($3/$15 per 1M input/output tokens) handles complex queries; Haiku ($0.25/$1.25 per 1M input/output tokens) handles simple ones. 1M messages/day across product search, order status, manga recommendations, and Q&A. Infrastructure: OpenSearch Serverless, DynamoDB, ECS Fargate, API Gateway WebSocket, ElastiCache Redis.

Skill Mapping

AWS AIP-C01 Domain	Task	Skill	This File Covers
Domain 4 Operational Efficiency	Task 4.1 Cost Optimization	4.1.2 Cost-Effective Model Selection	Inference cost patterns, budget guardian, traffic analysis, A/B testing model assignments, fallback chains, cost projections

1. Inference Cost Patterns Mind Map

mindmap
  root((Inference Cost<br/>Optimization))
    Traffic Analysis
      Volume by Intent
      Cost Distribution
      Peak vs Off-Peak
      Seasonal Patterns
    Budget Guardian
      Daily Budget Tracking
      Spend Rate Monitoring
      Dynamic Downgrade
      Emergency Mode
    A/B Testing
      Quality Impact Measurement
      Statistical Significance
      Routing Change Rollout
      Rollback Triggers
    Fallback Chains
      Sonnet to Haiku
      Haiku to Template
      Cascading Failures
      Graceful Degradation
    Cross-Region Pricing
      ap-northeast-1 vs us-east-1
      Latency-Cost Tradeoff
      Regional Availability
    Cost Projections
      Monthly Forecasting
      Growth Modeling
      Scenario Planning

2. MangaAssist Traffic Analysis

2.1 Daily Traffic Distribution by Intent and Tier

pie title Daily Message Distribution by Tier (1M messages/day)
    "Tier 0 — Template (35%)" : 350000
    "Tier 1 — Haiku (50%)" : 500000
    "Tier 2 — Sonnet (15%)" : 150000

Detailed Breakdown

Intent	Tier	% Traffic	Daily Volume	Input Tokens/Req	Output Tokens/Req	Cost/Req	Daily Cost
`product_search`	Haiku	30%	300,000	400	120	$0.000250	$75.00
`order_status`	Template	20%	200,000	0	0	$0.000000	$0.00
`chitchat`	Template	15%	150,000	0	0	$0.000000	$0.00
`shipping_info`	Haiku	10%	100,000	300	80	$0.000175	$17.50
`escalation`	Template	5%	50,000	0	0	$0.000000	$0.00
`recommendation`	Sonnet	10%	100,000	1,200	500	$0.011100	$1,110.00
`manga_qa`	Sonnet	5%	50,000	900	400	$0.008700	$435.00
`product_search` (complex)	Sonnet	3%	30,000	600	250	$0.005550	$166.50
`shipping_info` (complex)	Haiku	2%	20,000	300	80	$0.000175	$3.50
Totals		100%	1,000,000				$1,807.50

2.2 Cost Distribution — Where the Money Goes

pie title Daily FM Cost Distribution ($1,807.50/day)
    "recommendation (Sonnet) — $1,110" : 1110
    "manga_qa (Sonnet) — $435" : 435
    "product_search complex (Sonnet) — $167" : 167
    "product_search (Haiku) — $75" : 75
    "shipping_info (Haiku) — $21" : 21
    "Template intents — $0" : 1

Key insight: Sonnet calls represent only 15% of traffic but account for 94.7% of daily FM cost ($1,711.50 of $1,807.50). Every 1% of traffic shifted from Sonnet to Haiku saves approximately $110/day ($3,300/month).

2.3 Hourly Traffic Pattern (JST)

xychart-beta
    title "MangaAssist Hourly Traffic (JST) — Messages per Hour"
    x-axis ["0","1","2","3","4","5","6","7","8","9","10","11","12","13","14","15","16","17","18","19","20","21","22","23"]
    y-axis "Messages (thousands)" 0 --> 90
    bar [8,5,3,2,2,5,15,30,45,55,60,65,70,55,50,48,52,58,72,85,80,65,40,20]

Time Block (JST)	% of Daily Traffic	Hourly Avg	Peak Consideration
00:00-06:00	4%	6,667	Lowest — aggressive cost savings
06:00-12:00	27%	45,000	Morning ramp — normal routing
12:00-18:00	31%	52,500	Afternoon steady — normal routing
18:00-00:00	38%	63,333	Evening peak — budget monitoring critical

3. Dynamic Model Downgrade — Budget Guardian

3.1 Architecture

flowchart TD
    subgraph "Every Request"
        REQ[Incoming Request] --> ROUTER[Model Router]
        ROUTER --> BG[Budget Guardian]
        BG --> CHECK{Check Daily<br/>Spend vs Budget}
    end

    CHECK -->|"< 60% spent"| NORMAL["NORMAL MODE<br/>Standard intent-to-model routing"]
    CHECK -->|"60-80% spent"| CAUTIOUS["CAUTIOUS MODE<br/>Downgrade manga_qa: Sonnet→Haiku<br/>Est. savings: ~20% on Sonnet spend"]
    CHECK -->|"80-95% spent"| AGGRESSIVE["AGGRESSIVE MODE<br/>Only recommendation stays Sonnet<br/>All else: Haiku or Template<br/>Est. savings: ~60% on Sonnet spend"]
    CHECK -->|"> 95% spent"| EMERGENCY["EMERGENCY MODE<br/>No Sonnet calls<br/>recommendation→Haiku, rest→Template<br/>Queue critical requests for next day"]

    NORMAL --> SERVE[Serve Response]
    CAUTIOUS --> SERVE
    AGGRESSIVE --> SERVE
    EMERGENCY --> SERVE

    BG --> LOG[Log to CloudWatch<br/>budget_mode metric]
    BG --> ALERT{Threshold<br/>Breach?}
    ALERT -->|80%| SNS1["SNS Alert:<br/>Budget Warning"]
    ALERT -->|95%| SNS2["PagerDuty Alert:<br/>Budget Critical"]

    style NORMAL fill:#2ecc71,color:#fff
    style CAUTIOUS fill:#f39c12,color:#fff
    style AGGRESSIVE fill:#e67e22,color:#fff
    style EMERGENCY fill:#e74c3c,color:#fff

3.2 Budget Mode Impact on Quality and Cost

Budget Mode	Trigger	Sonnet %	Haiku %	Template %	Quality Impact	Cost Reduction
Normal	< 60% budget	15%	50%	35%	Baseline (8.5)	0%
Cautious	60-80%	10%	55%	35%	-0.3 (8.2)	~15%
Aggressive	80-95%	5%	50%	45%	-0.8 (7.7)	~45%
Emergency	> 95%	0%	35%	65%	-1.5 (7.0)	~85%

3.3 Budget Guardian Sequence Diagram

sequenceDiagram
    participant User as Customer
    participant Router as Model Router
    participant BG as Budget Guardian
    participant Redis as ElastiCache Redis
    participant CW as CloudWatch
    participant SNS as SNS / PagerDuty

    User->>Router: "Recommend manga like Attack on Titan"
    Router->>BG: check_budget(intent=recommendation)
    BG->>Redis: GET budget:daily:2026-03-31
    Redis-->>BG: $1,850.00 spent (74% of $2,500)

    BG-->>Router: mode=CAUTIOUS (recommendation stays Sonnet)
    Router->>Router: Route to Sonnet

    Note over BG,CW: Later that evening — manga sale event spikes traffic

    User->>Router: "What themes in Chainsaw Man?"
    Router->>BG: check_budget(intent=manga_qa)
    BG->>Redis: GET budget:daily:2026-03-31
    Redis-->>BG: $2,150.00 spent (86% of $2,500)

    BG-->>Router: mode=AGGRESSIVE (manga_qa downgraded to Haiku)
    BG->>CW: Emit budget_mode=aggressive metric
    BG->>SNS: Budget Warning Alert (86%)
    Router->>Router: Route to Haiku (downgraded)

    Note over BG,SNS: Budget hits 96%

    User->>Router: "Suggest something similar to Vinland Saga"
    Router->>BG: check_budget(intent=recommendation)
    BG->>Redis: GET budget:daily:2026-03-31
    Redis-->>BG: $2,400.00 spent (96% of $2,500)

    BG-->>Router: mode=EMERGENCY (recommendation→Haiku)
    BG->>SNS: PagerDuty CRITICAL Alert (96%)
    Router->>Router: Route to Haiku (emergency downgrade)

3.4 BudgetGuardian — Full Production Implementation

"""
MangaAssist Budget Guardian — Dynamic Model Downgrade System
Monitors daily FM spend and dynamically adjusts routing tiers.
"""

import time
import logging
from dataclasses import dataclass
from enum import Enum
from typing import Optional

import boto3
import redis

logger = logging.getLogger("mangaassist.budget_guardian")


class BudgetMode(Enum):
    NORMAL = "normal"
    CAUTIOUS = "cautious"
    AGGRESSIVE = "aggressive"
    EMERGENCY = "emergency"


@dataclass
class BudgetStatus:
    """Current budget state with full context."""
    mode: BudgetMode
    daily_budget: float
    spent_today: float
    remaining: float
    utilization_pct: float
    projected_eod_spend: float    # projected end-of-day spend
    projected_eod_pct: float
    minutes_until_budget_exhausted: Optional[float]


class BudgetGuardian:
    """
    Tracks daily FM spend in Redis and returns the current budget mode.
    Provides cost projection, alerting, and automatic downgrade triggers.
    """

    DAILY_BUDGET = 2_500.00  # $2,500/day baseline

    THRESHOLDS = {
        BudgetMode.NORMAL: 0.00,
        BudgetMode.CAUTIOUS: 0.60,
        BudgetMode.AGGRESSIVE: 0.80,
        BudgetMode.EMERGENCY: 0.95,
    }

    # Which intents get downgraded at each mode level
    DOWNGRADE_MAP = {
        BudgetMode.CAUTIOUS: {
            "manga_qa": "haiku",        # Sonnet -> Haiku
        },
        BudgetMode.AGGRESSIVE: {
            "manga_qa": "haiku",
            "product_search": "template",  # Haiku -> Template (simple searches)
        },
        BudgetMode.EMERGENCY: {
            "recommendation": "haiku",  # Even recommendation gets downgraded
            "manga_qa": "haiku",
            "product_search": "template",
            "shipping_info": "template",
        },
    }

    def __init__(
        self,
        redis_client: redis.Redis,
        cloudwatch_client=None,
        sns_client=None,
        daily_budget: Optional[float] = None,
    ):
        self.redis = redis_client
        self.cw = cloudwatch_client or boto3.client("cloudwatch", region_name="ap-northeast-1")
        self.sns = sns_client or boto3.client("sns", region_name="ap-northeast-1")
        self.daily_budget = daily_budget or self.DAILY_BUDGET
        self._alert_topic_arn = "arn:aws:sns:ap-northeast-1:123456789012:mangaassist-budget-alerts"
        self._critical_topic_arn = "arn:aws:sns:ap-northeast-1:123456789012:mangaassist-budget-critical"

    # ----- Core Budget Check -----

    def get_budget_status(self) -> BudgetStatus:
        """Full budget status with projections."""
        today = time.strftime("%Y-%m-%d")
        spent = self._get_daily_spend(today)
        remaining = max(self.daily_budget - spent, 0)
        utilization = spent / self.daily_budget

        # Project end-of-day spend based on current burn rate
        projected_eod, minutes_left = self._project_spend(today, spent)

        mode = self._determine_mode(utilization)

        return BudgetStatus(
            mode=mode,
            daily_budget=self.daily_budget,
            spent_today=spent,
            remaining=remaining,
            utilization_pct=utilization * 100,
            projected_eod_spend=projected_eod,
            projected_eod_pct=(projected_eod / self.daily_budget) * 100,
            minutes_until_budget_exhausted=minutes_left,
        )

    def get_budget_mode(self) -> BudgetMode:
        """Quick mode check for hot-path routing."""
        today = time.strftime("%Y-%m-%d")
        spent = self._get_daily_spend(today)
        utilization = spent / self.daily_budget
        return self._determine_mode(utilization)

    def get_model_override(self, intent: str, current_model: str) -> Optional[str]:
        """
        Check if the current budget mode requires a model downgrade.
        Returns None if no override needed, or the downgraded model name.
        """
        mode = self.get_budget_mode()

        if mode == BudgetMode.NORMAL:
            return None

        downgrades = self.DOWNGRADE_MAP.get(mode, {})
        override = downgrades.get(intent)

        if override and self._tier_rank(override) < self._tier_rank(current_model):
            logger.warning(
                "Budget %s: downgrading %s from %s to %s",
                mode.value, intent, current_model, override,
            )
            return override

        return None

    # ----- Cost Recording -----

    def record_cost(self, cost: float, intent: str, model: str) -> None:
        """Record an inference cost and emit metrics."""
        today = time.strftime("%Y-%m-%d")
        pipe = self.redis.pipeline()

        # Total daily spend
        total_key = f"budget:daily:{today}"
        pipe.incrbyfloat(total_key, cost)
        pipe.expire(total_key, 86400 * 2)

        # Per-intent spend
        intent_key = f"budget:daily:{today}:intent:{intent}"
        pipe.incrbyfloat(intent_key, cost)
        pipe.expire(intent_key, 86400 * 2)

        # Per-model spend
        model_key = f"budget:daily:{today}:model:{model}"
        pipe.incrbyfloat(model_key, cost)
        pipe.expire(model_key, 86400 * 2)

        # Request count for burn-rate projection
        count_key = f"budget:daily:{today}:count"
        pipe.incr(count_key)
        pipe.expire(count_key, 86400 * 2)

        # Timestamp of first request today (for projection)
        first_key = f"budget:daily:{today}:first_ts"
        pipe.setnx(first_key, str(time.time()))
        pipe.expire(first_key, 86400 * 2)

        pipe.execute()

        # Emit CloudWatch metrics
        self._emit_metrics(cost, intent, model)

        # Check for alert thresholds
        self._check_alerts(today)

    # ----- Spend Projection -----

    def _project_spend(self, today: str, current_spend: float) -> tuple[float, Optional[float]]:
        """
        Project end-of-day spend based on current burn rate.
        Returns (projected_eod_spend, minutes_until_exhaustion).
        """
        first_ts = self.redis.get(f"budget:daily:{today}:first_ts")
        count = self.redis.get(f"budget:daily:{today}:count")

        if not first_ts or not count:
            return current_spend, None

        elapsed_seconds = time.time() - float(first_ts)
        if elapsed_seconds < 60:
            return current_spend, None

        # Calculate burn rate ($/second)
        burn_rate = current_spend / elapsed_seconds

        # Seconds remaining in the day
        now = time.localtime()
        seconds_remaining = (
            (23 - now.tm_hour) * 3600
            + (59 - now.tm_min) * 60
            + (60 - now.tm_sec)
        )

        projected_eod = current_spend + (burn_rate * seconds_remaining)

        # Minutes until budget exhaustion
        remaining_budget = self.daily_budget - current_spend
        if burn_rate > 0 and remaining_budget > 0:
            minutes_left = (remaining_budget / burn_rate) / 60
        else:
            minutes_left = None

        return projected_eod, minutes_left

    # ----- Alerting -----

    def _check_alerts(self, today: str) -> None:
        """Send alerts when budget thresholds are crossed."""
        spent = self._get_daily_spend(today)
        pct = spent / self.daily_budget

        # Use Redis to track which alerts have been sent today
        if pct >= 0.80:
            alert_key = f"budget:alert:{today}:80"
            if not self.redis.exists(alert_key):
                self.redis.setex(alert_key, 86400, "sent")
                self._send_alert(
                    topic_arn=self._alert_topic_arn,
                    subject="MangaAssist Budget Warning (80%)",
                    message=(
                        f"Daily FM budget is {pct:.1%} consumed.\n"
                        f"Spent: ${spent:.2f} / ${self.daily_budget:.2f}\n"
                        f"Mode: AGGRESSIVE — non-critical Sonnet calls downgraded."
                    ),
                )

        if pct >= 0.95:
            alert_key = f"budget:alert:{today}:95"
            if not self.redis.exists(alert_key):
                self.redis.setex(alert_key, 86400, "sent")
                self._send_alert(
                    topic_arn=self._critical_topic_arn,
                    subject="MangaAssist Budget CRITICAL (95%)",
                    message=(
                        f"Daily FM budget is {pct:.1%} consumed.\n"
                        f"Spent: ${spent:.2f} / ${self.daily_budget:.2f}\n"
                        f"Mode: EMERGENCY — all queries on Haiku/Template.\n"
                        f"ACTION REQUIRED: Review traffic spike and adjust budget."
                    ),
                )

    def _send_alert(self, topic_arn: str, subject: str, message: str) -> None:
        try:
            self.sns.publish(
                TopicArn=topic_arn,
                Subject=subject,
                Message=message,
            )
            logger.info("Sent alert: %s", subject)
        except Exception as e:
            logger.error("Failed to send alert: %s", e)

    # ----- CloudWatch Metrics -----

    def _emit_metrics(self, cost: float, intent: str, model: str) -> None:
        """Emit cost metrics to CloudWatch for dashboarding."""
        try:
            self.cw.put_metric_data(
                Namespace="MangaAssist/FMCost",
                MetricData=[
                    {
                        "MetricName": "InferenceCost",
                        "Value": cost,
                        "Unit": "None",
                        "Dimensions": [
                            {"Name": "Intent", "Value": intent},
                            {"Name": "Model", "Value": model},
                        ],
                    },
                    {
                        "MetricName": "DailySpend",
                        "Value": self._get_daily_spend(time.strftime("%Y-%m-%d")),
                        "Unit": "None",
                    },
                ],
            )
        except Exception as e:
            logger.error("Failed to emit CloudWatch metrics: %s", e)

    # ----- Helpers -----

    def _get_daily_spend(self, today: str) -> float:
        val = self.redis.get(f"budget:daily:{today}")
        return float(val) if val else 0.0

    def _determine_mode(self, utilization: float) -> BudgetMode:
        if utilization >= 0.95:
            return BudgetMode.EMERGENCY
        elif utilization >= 0.80:
            return BudgetMode.AGGRESSIVE
        elif utilization >= 0.60:
            return BudgetMode.CAUTIOUS
        else:
            return BudgetMode.NORMAL

    @staticmethod
    def _tier_rank(model: str) -> int:
        return {"template": 0, "haiku": 1, "sonnet": 2}.get(model, 1)


# ---------------------------------------------------------------------------
# Cost Projection Calculator
# ---------------------------------------------------------------------------

class CostProjectionCalculator:
    """
    Projects monthly FM costs under different traffic and routing scenarios.
    Used for capacity planning and budget requests.
    """

    PRICING = {
        "sonnet": {"input": 3.00, "output": 15.00},
        "haiku": {"input": 0.25, "output": 1.25},
        "template": {"input": 0.00, "output": 0.00},
    }

    DEFAULT_TOKENS = {
        "sonnet": {"input": 900, "output": 400},
        "haiku": {"input": 450, "output": 130},
        "template": {"input": 0, "output": 0},
    }

    def project_monthly_cost(
        self,
        daily_messages: int,
        sonnet_pct: float,
        haiku_pct: float,
        template_pct: float,
        days: int = 30,
    ) -> dict:
        """
        Project monthly cost for a given traffic mix.
        Percentages should sum to 1.0.
        """
        results = {}

        for model, pct in [("sonnet", sonnet_pct), ("haiku", haiku_pct), ("template", template_pct)]:
            daily_vol = daily_messages * pct
            tokens = self.DEFAULT_TOKENS[model]
            pricing = self.PRICING[model]

            cost_per_req = (
                (tokens["input"] / 1_000_000) * pricing["input"]
                + (tokens["output"] / 1_000_000) * pricing["output"]
            )

            daily_cost = daily_vol * cost_per_req
            monthly_cost = daily_cost * days

            results[model] = {
                "daily_volume": int(daily_vol),
                "cost_per_request": cost_per_req,
                "daily_cost": daily_cost,
                "monthly_cost": monthly_cost,
            }

        total_daily = sum(r["daily_cost"] for r in results.values())
        total_monthly = sum(r["monthly_cost"] for r in results.values())

        results["total"] = {
            "daily_cost": total_daily,
            "monthly_cost": total_monthly,
            "avg_cost_per_message": total_daily / daily_messages if daily_messages > 0 else 0,
        }

        return results

    def compare_scenarios(self, daily_messages: int = 1_000_000) -> str:
        """Compare multiple routing scenarios."""
        scenarios = {
            "All Sonnet": (1.00, 0.00, 0.00),
            "All Haiku": (0.00, 1.00, 0.00),
            "Current Tiered (15/50/35)": (0.15, 0.50, 0.35),
            "Optimized Tiered (10/55/35)": (0.10, 0.55, 0.35),
            "Aggressive Savings (5/45/50)": (0.05, 0.45, 0.50),
            "Emergency (0/35/65)": (0.00, 0.35, 0.65),
        }

        lines = [
            "=" * 80,
            "MANGAASSIST COST PROJECTION — SCENARIO COMPARISON",
            f"Base: {daily_messages:,} messages/day, 30-day month",
            "=" * 80,
            "",
            f"{'Scenario':<35} {'Daily':>12} {'Monthly':>14} {'Savings':>10}",
            "-" * 80,
        ]

        baseline = None
        for name, (s, h, t) in scenarios.items():
            result = self.project_monthly_cost(daily_messages, s, h, t)
            monthly = result["total"]["monthly_cost"]
            daily = result["total"]["daily_cost"]

            if baseline is None:
                baseline = monthly
                savings = "Baseline"
            else:
                savings_pct = (1 - monthly / baseline) * 100
                savings = f"{savings_pct:.1f}%"

            lines.append(
                f"{name:<35} ${daily:>10,.2f} ${monthly:>12,.2f} {savings:>10}"
            )

        lines.append("-" * 80)
        return "\n".join(lines)


if __name__ == "__main__":
    calc = CostProjectionCalculator()
    print(calc.compare_scenarios())

4. A/B Testing Model Assignments

4.1 A/B Test Architecture

flowchart TD
    REQ[Incoming Request] --> SPLIT{A/B Split<br/>by session_id hash}

    SPLIT -->|Control: 90%| CONTROL["Control Group<br/>Standard routing<br/>(manga_qa → Sonnet)"]
    SPLIT -->|Treatment: 10%| TREATMENT["Treatment Group<br/>Experimental routing<br/>(manga_qa → Haiku)"]

    CONTROL --> BEDROCK_S[Bedrock Sonnet]
    TREATMENT --> BEDROCK_H[Bedrock Haiku]

    BEDROCK_S --> RESPONSE_C[Response + Metadata]
    BEDROCK_H --> RESPONSE_T[Response + Metadata]

    RESPONSE_C --> METRICS[Metrics Collector]
    RESPONSE_T --> METRICS

    METRICS --> QUALITY["Quality Metrics<br/>- LLM-as-judge score<br/>- User satisfaction<br/>- Response relevance"]
    METRICS --> COST_M["Cost Metrics<br/>- Tokens used<br/>- Cost per request<br/>- Budget impact"]
    METRICS --> LATENCY["Latency Metrics<br/>- TTFT<br/>- Total latency<br/>- Streaming time"]

    QUALITY & COST_M & LATENCY --> ANALYSIS["Statistical Analysis<br/>- t-test for means<br/>- Chi-square for satisfaction<br/>- Required: p < 0.05, n > 1,000"]

    ANALYSIS -->|Quality drop < 10%| PROMOTE["Promote Treatment<br/>Roll out to 100%"]
    ANALYSIS -->|Quality drop >= 10%| ROLLBACK["Rollback Treatment<br/>Keep Control routing"]

    style CONTROL fill:#3498db,color:#fff
    style TREATMENT fill:#f39c12,color:#fff
    style PROMOTE fill:#2ecc71,color:#fff
    style ROLLBACK fill:#e74c3c,color:#fff

4.2 A/B Test Configuration

"""
MangaAssist A/B Testing — Model Assignment Experiments
"""

import hashlib
import time
import logging
from dataclasses import dataclass
from typing import Optional

logger = logging.getLogger("mangaassist.ab_test")


@dataclass
class ABTestConfig:
    """Configuration for a model routing A/B test."""
    test_id: str
    intent: str
    control_model: str        # e.g., "sonnet"
    treatment_model: str      # e.g., "haiku"
    treatment_pct: float      # 0.0 - 1.0 (e.g., 0.10 = 10%)
    start_time: float
    end_time: Optional[float] = None
    min_samples: int = 1_000
    max_quality_drop_pct: float = 10.0  # auto-rollback if quality drops > 10%


@dataclass
class ABTestResult:
    """Metrics for one group in an A/B test."""
    group: str               # "control" or "treatment"
    sample_count: int
    avg_quality_score: float
    avg_cost: float
    avg_latency_ms: float
    satisfaction_rate: float


class ABTestRouter:
    """
    Deterministic A/B test routing based on session_id hash.
    Ensures the same user always sees the same variant within a test.
    """

    def __init__(self, active_tests: list[ABTestConfig]):
        self.tests = {t.intent: t for t in active_tests}

    def get_variant(self, session_id: str, intent: str) -> Optional[str]:
        """
        Return the model to use, or None if no active test for this intent.
        Uses session_id hash for deterministic, repeatable assignment.
        """
        test = self.tests.get(intent)
        if not test:
            return None

        # Check if test is still active
        now = time.time()
        if now < test.start_time:
            return None
        if test.end_time and now > test.end_time:
            return None

        # Deterministic hash-based split
        hash_input = f"{test.test_id}:{session_id}"
        hash_value = int(hashlib.sha256(hash_input.encode()).hexdigest()[:8], 16)
        bucket = (hash_value % 10000) / 10000.0  # 0.0000 to 0.9999

        if bucket < test.treatment_pct:
            return test.treatment_model
        else:
            return test.control_model

    def should_auto_rollback(
        self, intent: str, control: ABTestResult, treatment: ABTestResult
    ) -> bool:
        """
        Check if the treatment should be auto-rolled back.
        Triggers if quality drops more than the configured threshold
        AND we have enough samples for statistical significance.
        """
        test = self.tests.get(intent)
        if not test:
            return False

        if treatment.sample_count < test.min_samples:
            return False  # Not enough data yet

        quality_drop_pct = (
            (control.avg_quality_score - treatment.avg_quality_score)
            / control.avg_quality_score
            * 100
        )

        if quality_drop_pct > test.max_quality_drop_pct:
            logger.warning(
                "A/B test %s: quality drop %.1f%% exceeds threshold %.1f%%. "
                "Auto-rolling back.",
                test.test_id, quality_drop_pct, test.max_quality_drop_pct,
            )
            return True

        return False


# Example: Testing manga_qa on Haiku instead of Sonnet
ACTIVE_TESTS = [
    ABTestConfig(
        test_id="exp-2026-03-manga-qa-haiku",
        intent="manga_qa",
        control_model="sonnet",
        treatment_model="haiku",
        treatment_pct=0.10,  # 10% of manga_qa traffic
        start_time=time.time(),
        min_samples=2_000,
        max_quality_drop_pct=15.0,  # Allow up to 15% quality drop
    ),
]

4.3 A/B Test Decision Criteria

Metric	Acceptable Range	Auto-Rollback Trigger	Data Required
Quality score (LLM-as-judge)	< 10% drop	> 15% drop	n >= 1,000
User satisfaction	< 5% drop	> 10% drop	n >= 2,000
Cost reduction	> 50% savings	N/A (always cheaper)	N/A
Latency improvement	Any improvement	> 200ms increase	n >= 500

5. Cross-Region Pricing Optimization

Region	Sonnet Input/1M	Sonnet Output/1M	Haiku Input/1M	Haiku Output/1M	Latency from Tokyo
ap-northeast-1 (Tokyo)	$3.00	$15.00	$0.25	$1.25	0 ms
us-east-1 (Virginia)	$3.00	$15.00	$0.25	$1.25	~160 ms
us-west-2 (Oregon)	$3.00	$15.00	$0.25	$1.25	~120 ms
eu-west-1 (Ireland)	$3.00	$15.00	$0.25	$1.25	~230 ms

MangaAssist decision: Bedrock pricing is uniform across regions for Claude 3. Since our users are in Japan, we use ap-northeast-1 exclusively to minimize latency. Cross-region arbitrage is not beneficial for this workload — latency cost far exceeds any potential savings.

6. Fallback Chains

6.1 Cascading Fallback Architecture

flowchart TD
    QUERY[Customer Query<br/>Intent: recommendation] --> PRIMARY["PRIMARY: Sonnet<br/>Attempt invocation"]

    PRIMARY -->|Success| RESPONSE[Return Response]
    PRIMARY -->|"Failure (throttle/timeout/5xx)"| FALLBACK1["FALLBACK 1: Haiku<br/>Attempt with simplified prompt"]

    FALLBACK1 -->|Success| RESPONSE
    FALLBACK1 -->|Failure| FALLBACK2["FALLBACK 2: Template<br/>Generic response + apology"]

    FALLBACK2 --> RESPONSE

    PRIMARY -->|Latency > 5s| TIMEOUT["Parallel: Start Haiku<br/>Race condition — first wins"]
    TIMEOUT --> RESPONSE

    style PRIMARY fill:#e74c3c,color:#fff
    style FALLBACK1 fill:#3498db,color:#fff
    style FALLBACK2 fill:#2ecc71,color:#fff
    style TIMEOUT fill:#f39c12,color:#fff

6.2 Fallback Chain Implementation

"""
MangaAssist Fallback Chain — Graceful Degradation on FM Failures
"""

import asyncio
import logging
import time
from dataclasses import dataclass
from typing import Optional

import boto3

logger = logging.getLogger("mangaassist.fallback")


@dataclass
class FallbackResult:
    """Result from fallback chain invocation."""
    response_text: str
    model_used: str
    tier_used: str          # "primary", "fallback_1", "fallback_2"
    latency_ms: float
    was_degraded: bool
    degradation_reason: Optional[str] = None


class FallbackChain:
    """
    Executes a cascade of model invocations with graceful degradation.
    Primary → Fallback 1 → Fallback 2 (template).
    """

    FALLBACK_MAP = {
        "sonnet": ["haiku", "template"],
        "haiku": ["template"],
        "template": [],
    }

    TEMPLATE_RESPONSES = {
        "recommendation": (
            "I apologize, but I'm currently unable to provide personalized manga "
            "recommendations. Please try again in a moment, or browse our curated "
            "collections at manga.example.com/collections."
        ),
        "manga_qa": (
            "I'm sorry, I'm temporarily unable to answer detailed manga questions. "
            "You can find manga information on our wiki at manga.example.com/wiki."
        ),
        "product_search": (
            "I'm having trouble searching right now. Please try using the search bar "
            "on our website, or contact support for assistance."
        ),
        "default": (
            "I apologize for the inconvenience. I'm experiencing temporary difficulties. "
            "Please try again shortly or contact our support team."
        ),
    }

    def __init__(self, bedrock_client=None, timeout_ms: int = 5_000):
        self.bedrock = bedrock_client or boto3.client(
            "bedrock-runtime", region_name="ap-northeast-1"
        )
        self.timeout_s = timeout_ms / 1000.0

    async def invoke_with_fallback(
        self,
        prompt: str,
        primary_model: str,
        intent: str,
    ) -> FallbackResult:
        """
        Invoke the primary model, falling back through the chain on failure.
        """
        chain = [primary_model] + self.FALLBACK_MAP.get(primary_model, [])
        tier_names = ["primary", "fallback_1", "fallback_2"]

        for i, model in enumerate(chain):
            tier = tier_names[min(i, len(tier_names) - 1)]

            if model == "template":
                return FallbackResult(
                    response_text=self.TEMPLATE_RESPONSES.get(
                        intent, self.TEMPLATE_RESPONSES["default"]
                    ),
                    model_used="template",
                    tier_used=tier,
                    latency_ms=1.0,
                    was_degraded=(i > 0),
                    degradation_reason=(
                        f"All FM models failed; serving template for {intent}"
                        if i > 0 else None
                    ),
                )

            try:
                start = time.monotonic()
                response = await asyncio.wait_for(
                    self._invoke_model(model, prompt),
                    timeout=self.timeout_s,
                )
                latency = (time.monotonic() - start) * 1000

                return FallbackResult(
                    response_text=response,
                    model_used=model,
                    tier_used=tier,
                    latency_ms=latency,
                    was_degraded=(i > 0),
                    degradation_reason=(
                        f"Degraded from {primary_model} to {model}" if i > 0 else None
                    ),
                )

            except asyncio.TimeoutError:
                logger.warning("Model %s timed out after %.1fs", model, self.timeout_s)
                continue

            except Exception as e:
                logger.error("Model %s failed: %s", model, str(e))
                continue

        # All models failed — should not reach here because template is always last
        return FallbackResult(
            response_text=self.TEMPLATE_RESPONSES["default"],
            model_used="template",
            tier_used="fallback_2",
            latency_ms=1.0,
            was_degraded=True,
            degradation_reason="All models and templates exhausted",
        )

    async def _invoke_model(self, model: str, prompt: str) -> str:
        """Invoke a Bedrock model. In production, use async Bedrock client."""
        model_ids = {
            "sonnet": "anthropic.claude-3-sonnet-20240229-v1:0",
            "haiku": "anthropic.claude-3-haiku-20240307-v1:0",
        }

        model_id = model_ids.get(model)
        if not model_id:
            raise ValueError(f"Unknown model: {model}")

        # In production: use aioboto3 or run_in_executor
        loop = asyncio.get_event_loop()
        response = await loop.run_in_executor(
            None,
            lambda: self.bedrock.invoke_model(
                modelId=model_id,
                contentType="application/json",
                accept="application/json",
                body=f'{{"anthropic_version":"bedrock-2023-05-31","max_tokens":1024,"messages":[{{"role":"user","content":"{prompt}"}}]}}',
            ),
        )

        import json
        body = json.loads(response["body"].read())
        return body["content"][0]["text"]

6.3 Fallback Chain — Expected Behavior

Scenario	Primary	Fallback 1	Fallback 2	User Impact
Normal operation	Sonnet succeeds	N/A	N/A	Full quality
Sonnet throttled	Sonnet 429	Haiku succeeds	N/A	Slight quality drop
Sonnet + Haiku throttled	Sonnet 429	Haiku 429	Template	Canned response
Sonnet timeout (> 5s)	Timeout	Haiku succeeds	N/A	Faster, lower quality
All Bedrock down	Sonnet fail	Haiku fail	Template	Canned response

7. Cost Projection — Monthly Forecast

7.1 Current Baseline

pie title Monthly FM Cost Projection — Current Routing ($54,225/mo)
    "Sonnet (recommendation)" : 33300
    "Sonnet (manga_qa)" : 13050
    "Sonnet (complex product_search)" : 4995
    "Haiku (product_search)" : 2250
    "Haiku (shipping_info)" : 630
    "Template (all)" : 0

7.2 Scenario Projections

Scenario	Daily Messages	Sonnet %	Haiku %	Template %	Monthly Cost	vs Baseline
All Sonnet	1M	100%	0%	0%	$229,500	+323%
All Haiku	1M	0%	100%	0%	$5,063	-91%
Current Tiered	1M	15%	50%	35%	$54,225	Baseline
Optimized (after A/B tests)	1M	10%	55%	35%	$38,475	-29%
2x Growth	2M	15%	50%	35%	$108,450	+100%
Sale Event (3x spike)	3M	15%	50%	35%	$162,675	+200%
Sale + Budget Guardian	3M	5%	50%	45%	$56,250	+4%

Key takeaway: During a 3x traffic spike (sale event), the Budget Guardian keeps monthly costs nearly flat by dynamically shifting traffic from Sonnet to Haiku/Template. Without the guardian, costs would triple.

7.3 Annual Budget Planning

Quarter	Expected Daily Volume	Routing Strategy	Monthly Budget	Quarterly Budget
Q1 2026	1.0M	Current Tiered	$54,225	$162,675
Q2 2026	1.2M (spring sale)	Optimized + Guardian	$46,170	$138,510
Q3 2026	1.5M (growth)	Optimized + Guardian	$57,713	$173,138
Q4 2026	2.0M (holiday)	Aggressive + Guardian	$62,500	$187,500
Annual Total				$661,823

8. Key Takeaways

94.7% of FM cost comes from 15% of traffic (Sonnet calls). Optimizing Sonnet routing has the highest ROI.
Budget Guardian prevents runaway costs during traffic spikes — keeping a 3x spike cost-neutral.
A/B testing before routing changes prevents quality regressions — always test with statistical rigor (n >= 1,000, p < 0.05).
Fallback chains provide graceful degradation — users always get a response, even during Bedrock outages.
Cross-region arbitrage does not apply for MangaAssist (uniform Bedrock pricing) — minimize latency by staying in ap-northeast-1.
Projected annual cost of $661,823 for 2026 is 71% less than an all-Sonnet approach ($2.75M).

Previous: 01-model-selection-framework.md -- Model selection architecture, complexity classifier, routing maps.

Next: 03-scenarios-and-runbooks.md -- Five MangaAssist production scenarios with decision trees.