LOCAL PREVIEW View on GitHub

Inference Cost Optimization

MangaAssist Context: JP Manga store chatbot running on AWS. Bedrock Claude 3 Sonnet ($3/$15 per 1M input/output tokens) handles complex queries; Haiku ($0.25/$1.25 per 1M input/output tokens) handles simple ones. 1M messages/day across product search, order status, manga recommendations, and Q&A. Infrastructure: OpenSearch Serverless, DynamoDB, ECS Fargate, API Gateway WebSocket, ElastiCache Redis.


Skill Mapping

AWS AIP-C01 Domain Task Skill This File Covers
Domain 4 Operational Efficiency Task 4.1 Cost Optimization 4.1.2 Cost-Effective Model Selection Inference cost patterns, budget guardian, traffic analysis, A/B testing model assignments, fallback chains, cost projections

1. Inference Cost Patterns Mind Map

mindmap
  root((Inference Cost<br/>Optimization))
    Traffic Analysis
      Volume by Intent
      Cost Distribution
      Peak vs Off-Peak
      Seasonal Patterns
    Budget Guardian
      Daily Budget Tracking
      Spend Rate Monitoring
      Dynamic Downgrade
      Emergency Mode
    A/B Testing
      Quality Impact Measurement
      Statistical Significance
      Routing Change Rollout
      Rollback Triggers
    Fallback Chains
      Sonnet to Haiku
      Haiku to Template
      Cascading Failures
      Graceful Degradation
    Cross-Region Pricing
      ap-northeast-1 vs us-east-1
      Latency-Cost Tradeoff
      Regional Availability
    Cost Projections
      Monthly Forecasting
      Growth Modeling
      Scenario Planning

2. MangaAssist Traffic Analysis

2.1 Daily Traffic Distribution by Intent and Tier

pie title Daily Message Distribution by Tier (1M messages/day)
    "Tier 0 — Template (35%)" : 350000
    "Tier 1 — Haiku (50%)" : 500000
    "Tier 2 — Sonnet (15%)" : 150000

Detailed Breakdown

Intent Tier % Traffic Daily Volume Input Tokens/Req Output Tokens/Req Cost/Req Daily Cost
product_search Haiku 30% 300,000 400 120 $0.000250 $75.00
order_status Template 20% 200,000 0 0 $0.000000 $0.00
chitchat Template 15% 150,000 0 0 $0.000000 $0.00
shipping_info Haiku 10% 100,000 300 80 $0.000175 $17.50
escalation Template 5% 50,000 0 0 $0.000000 $0.00
recommendation Sonnet 10% 100,000 1,200 500 $0.011100 $1,110.00
manga_qa Sonnet 5% 50,000 900 400 $0.008700 $435.00
product_search (complex) Sonnet 3% 30,000 600 250 $0.005550 $166.50
shipping_info (complex) Haiku 2% 20,000 300 80 $0.000175 $3.50
Totals 100% 1,000,000 $1,807.50

2.2 Cost Distribution — Where the Money Goes

pie title Daily FM Cost Distribution ($1,807.50/day)
    "recommendation (Sonnet) — $1,110" : 1110
    "manga_qa (Sonnet) — $435" : 435
    "product_search complex (Sonnet) — $167" : 167
    "product_search (Haiku) — $75" : 75
    "shipping_info (Haiku) — $21" : 21
    "Template intents — $0" : 1

Key insight: Sonnet calls represent only 15% of traffic but account for 94.7% of daily FM cost ($1,711.50 of $1,807.50). Every 1% of traffic shifted from Sonnet to Haiku saves approximately $110/day ($3,300/month).

2.3 Hourly Traffic Pattern (JST)

xychart-beta
    title "MangaAssist Hourly Traffic (JST) — Messages per Hour"
    x-axis ["0","1","2","3","4","5","6","7","8","9","10","11","12","13","14","15","16","17","18","19","20","21","22","23"]
    y-axis "Messages (thousands)" 0 --> 90
    bar [8,5,3,2,2,5,15,30,45,55,60,65,70,55,50,48,52,58,72,85,80,65,40,20]
Time Block (JST) % of Daily Traffic Hourly Avg Peak Consideration
00:00-06:00 4% 6,667 Lowest — aggressive cost savings
06:00-12:00 27% 45,000 Morning ramp — normal routing
12:00-18:00 31% 52,500 Afternoon steady — normal routing
18:00-00:00 38% 63,333 Evening peak — budget monitoring critical

3. Dynamic Model Downgrade — Budget Guardian

3.1 Architecture

flowchart TD
    subgraph "Every Request"
        REQ[Incoming Request] --> ROUTER[Model Router]
        ROUTER --> BG[Budget Guardian]
        BG --> CHECK{Check Daily<br/>Spend vs Budget}
    end

    CHECK -->|"< 60% spent"| NORMAL["NORMAL MODE<br/>Standard intent-to-model routing"]
    CHECK -->|"60-80% spent"| CAUTIOUS["CAUTIOUS MODE<br/>Downgrade manga_qa: Sonnet→Haiku<br/>Est. savings: ~20% on Sonnet spend"]
    CHECK -->|"80-95% spent"| AGGRESSIVE["AGGRESSIVE MODE<br/>Only recommendation stays Sonnet<br/>All else: Haiku or Template<br/>Est. savings: ~60% on Sonnet spend"]
    CHECK -->|"> 95% spent"| EMERGENCY["EMERGENCY MODE<br/>No Sonnet calls<br/>recommendation→Haiku, rest→Template<br/>Queue critical requests for next day"]

    NORMAL --> SERVE[Serve Response]
    CAUTIOUS --> SERVE
    AGGRESSIVE --> SERVE
    EMERGENCY --> SERVE

    BG --> LOG[Log to CloudWatch<br/>budget_mode metric]
    BG --> ALERT{Threshold<br/>Breach?}
    ALERT -->|80%| SNS1["SNS Alert:<br/>Budget Warning"]
    ALERT -->|95%| SNS2["PagerDuty Alert:<br/>Budget Critical"]

    style NORMAL fill:#2ecc71,color:#fff
    style CAUTIOUS fill:#f39c12,color:#fff
    style AGGRESSIVE fill:#e67e22,color:#fff
    style EMERGENCY fill:#e74c3c,color:#fff

3.2 Budget Mode Impact on Quality and Cost

Budget Mode Trigger Sonnet % Haiku % Template % Quality Impact Cost Reduction
Normal < 60% budget 15% 50% 35% Baseline (8.5) 0%
Cautious 60-80% 10% 55% 35% -0.3 (8.2) ~15%
Aggressive 80-95% 5% 50% 45% -0.8 (7.7) ~45%
Emergency > 95% 0% 35% 65% -1.5 (7.0) ~85%

3.3 Budget Guardian Sequence Diagram

sequenceDiagram
    participant User as Customer
    participant Router as Model Router
    participant BG as Budget Guardian
    participant Redis as ElastiCache Redis
    participant CW as CloudWatch
    participant SNS as SNS / PagerDuty

    User->>Router: "Recommend manga like Attack on Titan"
    Router->>BG: check_budget(intent=recommendation)
    BG->>Redis: GET budget:daily:2026-03-31
    Redis-->>BG: $1,850.00 spent (74% of $2,500)

    BG-->>Router: mode=CAUTIOUS (recommendation stays Sonnet)
    Router->>Router: Route to Sonnet

    Note over BG,CW: Later that evening — manga sale event spikes traffic

    User->>Router: "What themes in Chainsaw Man?"
    Router->>BG: check_budget(intent=manga_qa)
    BG->>Redis: GET budget:daily:2026-03-31
    Redis-->>BG: $2,150.00 spent (86% of $2,500)

    BG-->>Router: mode=AGGRESSIVE (manga_qa downgraded to Haiku)
    BG->>CW: Emit budget_mode=aggressive metric
    BG->>SNS: Budget Warning Alert (86%)
    Router->>Router: Route to Haiku (downgraded)

    Note over BG,SNS: Budget hits 96%

    User->>Router: "Suggest something similar to Vinland Saga"
    Router->>BG: check_budget(intent=recommendation)
    BG->>Redis: GET budget:daily:2026-03-31
    Redis-->>BG: $2,400.00 spent (96% of $2,500)

    BG-->>Router: mode=EMERGENCY (recommendation→Haiku)
    BG->>SNS: PagerDuty CRITICAL Alert (96%)
    Router->>Router: Route to Haiku (emergency downgrade)

3.4 BudgetGuardian — Full Production Implementation

"""
MangaAssist Budget Guardian — Dynamic Model Downgrade System
Monitors daily FM spend and dynamically adjusts routing tiers.
"""

import time
import logging
from dataclasses import dataclass
from enum import Enum
from typing import Optional

import boto3
import redis

logger = logging.getLogger("mangaassist.budget_guardian")


class BudgetMode(Enum):
    NORMAL = "normal"
    CAUTIOUS = "cautious"
    AGGRESSIVE = "aggressive"
    EMERGENCY = "emergency"


@dataclass
class BudgetStatus:
    """Current budget state with full context."""
    mode: BudgetMode
    daily_budget: float
    spent_today: float
    remaining: float
    utilization_pct: float
    projected_eod_spend: float    # projected end-of-day spend
    projected_eod_pct: float
    minutes_until_budget_exhausted: Optional[float]


class BudgetGuardian:
    """
    Tracks daily FM spend in Redis and returns the current budget mode.
    Provides cost projection, alerting, and automatic downgrade triggers.
    """

    DAILY_BUDGET = 2_500.00  # $2,500/day baseline

    THRESHOLDS = {
        BudgetMode.NORMAL: 0.00,
        BudgetMode.CAUTIOUS: 0.60,
        BudgetMode.AGGRESSIVE: 0.80,
        BudgetMode.EMERGENCY: 0.95,
    }

    # Which intents get downgraded at each mode level
    DOWNGRADE_MAP = {
        BudgetMode.CAUTIOUS: {
            "manga_qa": "haiku",        # Sonnet -> Haiku
        },
        BudgetMode.AGGRESSIVE: {
            "manga_qa": "haiku",
            "product_search": "template",  # Haiku -> Template (simple searches)
        },
        BudgetMode.EMERGENCY: {
            "recommendation": "haiku",  # Even recommendation gets downgraded
            "manga_qa": "haiku",
            "product_search": "template",
            "shipping_info": "template",
        },
    }

    def __init__(
        self,
        redis_client: redis.Redis,
        cloudwatch_client=None,
        sns_client=None,
        daily_budget: Optional[float] = None,
    ):
        self.redis = redis_client
        self.cw = cloudwatch_client or boto3.client("cloudwatch", region_name="ap-northeast-1")
        self.sns = sns_client or boto3.client("sns", region_name="ap-northeast-1")
        self.daily_budget = daily_budget or self.DAILY_BUDGET
        self._alert_topic_arn = "arn:aws:sns:ap-northeast-1:123456789012:mangaassist-budget-alerts"
        self._critical_topic_arn = "arn:aws:sns:ap-northeast-1:123456789012:mangaassist-budget-critical"

    # ----- Core Budget Check -----

    def get_budget_status(self) -> BudgetStatus:
        """Full budget status with projections."""
        today = time.strftime("%Y-%m-%d")
        spent = self._get_daily_spend(today)
        remaining = max(self.daily_budget - spent, 0)
        utilization = spent / self.daily_budget

        # Project end-of-day spend based on current burn rate
        projected_eod, minutes_left = self._project_spend(today, spent)

        mode = self._determine_mode(utilization)

        return BudgetStatus(
            mode=mode,
            daily_budget=self.daily_budget,
            spent_today=spent,
            remaining=remaining,
            utilization_pct=utilization * 100,
            projected_eod_spend=projected_eod,
            projected_eod_pct=(projected_eod / self.daily_budget) * 100,
            minutes_until_budget_exhausted=minutes_left,
        )

    def get_budget_mode(self) -> BudgetMode:
        """Quick mode check for hot-path routing."""
        today = time.strftime("%Y-%m-%d")
        spent = self._get_daily_spend(today)
        utilization = spent / self.daily_budget
        return self._determine_mode(utilization)

    def get_model_override(self, intent: str, current_model: str) -> Optional[str]:
        """
        Check if the current budget mode requires a model downgrade.
        Returns None if no override needed, or the downgraded model name.
        """
        mode = self.get_budget_mode()

        if mode == BudgetMode.NORMAL:
            return None

        downgrades = self.DOWNGRADE_MAP.get(mode, {})
        override = downgrades.get(intent)

        if override and self._tier_rank(override) < self._tier_rank(current_model):
            logger.warning(
                "Budget %s: downgrading %s from %s to %s",
                mode.value, intent, current_model, override,
            )
            return override

        return None

    # ----- Cost Recording -----

    def record_cost(self, cost: float, intent: str, model: str) -> None:
        """Record an inference cost and emit metrics."""
        today = time.strftime("%Y-%m-%d")
        pipe = self.redis.pipeline()

        # Total daily spend
        total_key = f"budget:daily:{today}"
        pipe.incrbyfloat(total_key, cost)
        pipe.expire(total_key, 86400 * 2)

        # Per-intent spend
        intent_key = f"budget:daily:{today}:intent:{intent}"
        pipe.incrbyfloat(intent_key, cost)
        pipe.expire(intent_key, 86400 * 2)

        # Per-model spend
        model_key = f"budget:daily:{today}:model:{model}"
        pipe.incrbyfloat(model_key, cost)
        pipe.expire(model_key, 86400 * 2)

        # Request count for burn-rate projection
        count_key = f"budget:daily:{today}:count"
        pipe.incr(count_key)
        pipe.expire(count_key, 86400 * 2)

        # Timestamp of first request today (for projection)
        first_key = f"budget:daily:{today}:first_ts"
        pipe.setnx(first_key, str(time.time()))
        pipe.expire(first_key, 86400 * 2)

        pipe.execute()

        # Emit CloudWatch metrics
        self._emit_metrics(cost, intent, model)

        # Check for alert thresholds
        self._check_alerts(today)

    # ----- Spend Projection -----

    def _project_spend(self, today: str, current_spend: float) -> tuple[float, Optional[float]]:
        """
        Project end-of-day spend based on current burn rate.
        Returns (projected_eod_spend, minutes_until_exhaustion).
        """
        first_ts = self.redis.get(f"budget:daily:{today}:first_ts")
        count = self.redis.get(f"budget:daily:{today}:count")

        if not first_ts or not count:
            return current_spend, None

        elapsed_seconds = time.time() - float(first_ts)
        if elapsed_seconds < 60:
            return current_spend, None

        # Calculate burn rate ($/second)
        burn_rate = current_spend / elapsed_seconds

        # Seconds remaining in the day
        now = time.localtime()
        seconds_remaining = (
            (23 - now.tm_hour) * 3600
            + (59 - now.tm_min) * 60
            + (60 - now.tm_sec)
        )

        projected_eod = current_spend + (burn_rate * seconds_remaining)

        # Minutes until budget exhaustion
        remaining_budget = self.daily_budget - current_spend
        if burn_rate > 0 and remaining_budget > 0:
            minutes_left = (remaining_budget / burn_rate) / 60
        else:
            minutes_left = None

        return projected_eod, minutes_left

    # ----- Alerting -----

    def _check_alerts(self, today: str) -> None:
        """Send alerts when budget thresholds are crossed."""
        spent = self._get_daily_spend(today)
        pct = spent / self.daily_budget

        # Use Redis to track which alerts have been sent today
        if pct >= 0.80:
            alert_key = f"budget:alert:{today}:80"
            if not self.redis.exists(alert_key):
                self.redis.setex(alert_key, 86400, "sent")
                self._send_alert(
                    topic_arn=self._alert_topic_arn,
                    subject="MangaAssist Budget Warning (80%)",
                    message=(
                        f"Daily FM budget is {pct:.1%} consumed.\n"
                        f"Spent: ${spent:.2f} / ${self.daily_budget:.2f}\n"
                        f"Mode: AGGRESSIVE — non-critical Sonnet calls downgraded."
                    ),
                )

        if pct >= 0.95:
            alert_key = f"budget:alert:{today}:95"
            if not self.redis.exists(alert_key):
                self.redis.setex(alert_key, 86400, "sent")
                self._send_alert(
                    topic_arn=self._critical_topic_arn,
                    subject="MangaAssist Budget CRITICAL (95%)",
                    message=(
                        f"Daily FM budget is {pct:.1%} consumed.\n"
                        f"Spent: ${spent:.2f} / ${self.daily_budget:.2f}\n"
                        f"Mode: EMERGENCY — all queries on Haiku/Template.\n"
                        f"ACTION REQUIRED: Review traffic spike and adjust budget."
                    ),
                )

    def _send_alert(self, topic_arn: str, subject: str, message: str) -> None:
        try:
            self.sns.publish(
                TopicArn=topic_arn,
                Subject=subject,
                Message=message,
            )
            logger.info("Sent alert: %s", subject)
        except Exception as e:
            logger.error("Failed to send alert: %s", e)

    # ----- CloudWatch Metrics -----

    def _emit_metrics(self, cost: float, intent: str, model: str) -> None:
        """Emit cost metrics to CloudWatch for dashboarding."""
        try:
            self.cw.put_metric_data(
                Namespace="MangaAssist/FMCost",
                MetricData=[
                    {
                        "MetricName": "InferenceCost",
                        "Value": cost,
                        "Unit": "None",
                        "Dimensions": [
                            {"Name": "Intent", "Value": intent},
                            {"Name": "Model", "Value": model},
                        ],
                    },
                    {
                        "MetricName": "DailySpend",
                        "Value": self._get_daily_spend(time.strftime("%Y-%m-%d")),
                        "Unit": "None",
                    },
                ],
            )
        except Exception as e:
            logger.error("Failed to emit CloudWatch metrics: %s", e)

    # ----- Helpers -----

    def _get_daily_spend(self, today: str) -> float:
        val = self.redis.get(f"budget:daily:{today}")
        return float(val) if val else 0.0

    def _determine_mode(self, utilization: float) -> BudgetMode:
        if utilization >= 0.95:
            return BudgetMode.EMERGENCY
        elif utilization >= 0.80:
            return BudgetMode.AGGRESSIVE
        elif utilization >= 0.60:
            return BudgetMode.CAUTIOUS
        else:
            return BudgetMode.NORMAL

    @staticmethod
    def _tier_rank(model: str) -> int:
        return {"template": 0, "haiku": 1, "sonnet": 2}.get(model, 1)


# ---------------------------------------------------------------------------
# Cost Projection Calculator
# ---------------------------------------------------------------------------

class CostProjectionCalculator:
    """
    Projects monthly FM costs under different traffic and routing scenarios.
    Used for capacity planning and budget requests.
    """

    PRICING = {
        "sonnet": {"input": 3.00, "output": 15.00},
        "haiku": {"input": 0.25, "output": 1.25},
        "template": {"input": 0.00, "output": 0.00},
    }

    DEFAULT_TOKENS = {
        "sonnet": {"input": 900, "output": 400},
        "haiku": {"input": 450, "output": 130},
        "template": {"input": 0, "output": 0},
    }

    def project_monthly_cost(
        self,
        daily_messages: int,
        sonnet_pct: float,
        haiku_pct: float,
        template_pct: float,
        days: int = 30,
    ) -> dict:
        """
        Project monthly cost for a given traffic mix.
        Percentages should sum to 1.0.
        """
        results = {}

        for model, pct in [("sonnet", sonnet_pct), ("haiku", haiku_pct), ("template", template_pct)]:
            daily_vol = daily_messages * pct
            tokens = self.DEFAULT_TOKENS[model]
            pricing = self.PRICING[model]

            cost_per_req = (
                (tokens["input"] / 1_000_000) * pricing["input"]
                + (tokens["output"] / 1_000_000) * pricing["output"]
            )

            daily_cost = daily_vol * cost_per_req
            monthly_cost = daily_cost * days

            results[model] = {
                "daily_volume": int(daily_vol),
                "cost_per_request": cost_per_req,
                "daily_cost": daily_cost,
                "monthly_cost": monthly_cost,
            }

        total_daily = sum(r["daily_cost"] for r in results.values())
        total_monthly = sum(r["monthly_cost"] for r in results.values())

        results["total"] = {
            "daily_cost": total_daily,
            "monthly_cost": total_monthly,
            "avg_cost_per_message": total_daily / daily_messages if daily_messages > 0 else 0,
        }

        return results

    def compare_scenarios(self, daily_messages: int = 1_000_000) -> str:
        """Compare multiple routing scenarios."""
        scenarios = {
            "All Sonnet": (1.00, 0.00, 0.00),
            "All Haiku": (0.00, 1.00, 0.00),
            "Current Tiered (15/50/35)": (0.15, 0.50, 0.35),
            "Optimized Tiered (10/55/35)": (0.10, 0.55, 0.35),
            "Aggressive Savings (5/45/50)": (0.05, 0.45, 0.50),
            "Emergency (0/35/65)": (0.00, 0.35, 0.65),
        }

        lines = [
            "=" * 80,
            "MANGAASSIST COST PROJECTION — SCENARIO COMPARISON",
            f"Base: {daily_messages:,} messages/day, 30-day month",
            "=" * 80,
            "",
            f"{'Scenario':<35} {'Daily':>12} {'Monthly':>14} {'Savings':>10}",
            "-" * 80,
        ]

        baseline = None
        for name, (s, h, t) in scenarios.items():
            result = self.project_monthly_cost(daily_messages, s, h, t)
            monthly = result["total"]["monthly_cost"]
            daily = result["total"]["daily_cost"]

            if baseline is None:
                baseline = monthly
                savings = "Baseline"
            else:
                savings_pct = (1 - monthly / baseline) * 100
                savings = f"{savings_pct:.1f}%"

            lines.append(
                f"{name:<35} ${daily:>10,.2f} ${monthly:>12,.2f} {savings:>10}"
            )

        lines.append("-" * 80)
        return "\n".join(lines)


if __name__ == "__main__":
    calc = CostProjectionCalculator()
    print(calc.compare_scenarios())

4. A/B Testing Model Assignments

4.1 A/B Test Architecture

flowchart TD
    REQ[Incoming Request] --> SPLIT{A/B Split<br/>by session_id hash}

    SPLIT -->|Control: 90%| CONTROL["Control Group<br/>Standard routing<br/>(manga_qa → Sonnet)"]
    SPLIT -->|Treatment: 10%| TREATMENT["Treatment Group<br/>Experimental routing<br/>(manga_qa → Haiku)"]

    CONTROL --> BEDROCK_S[Bedrock Sonnet]
    TREATMENT --> BEDROCK_H[Bedrock Haiku]

    BEDROCK_S --> RESPONSE_C[Response + Metadata]
    BEDROCK_H --> RESPONSE_T[Response + Metadata]

    RESPONSE_C --> METRICS[Metrics Collector]
    RESPONSE_T --> METRICS

    METRICS --> QUALITY["Quality Metrics<br/>- LLM-as-judge score<br/>- User satisfaction<br/>- Response relevance"]
    METRICS --> COST_M["Cost Metrics<br/>- Tokens used<br/>- Cost per request<br/>- Budget impact"]
    METRICS --> LATENCY["Latency Metrics<br/>- TTFT<br/>- Total latency<br/>- Streaming time"]

    QUALITY & COST_M & LATENCY --> ANALYSIS["Statistical Analysis<br/>- t-test for means<br/>- Chi-square for satisfaction<br/>- Required: p < 0.05, n > 1,000"]

    ANALYSIS -->|Quality drop < 10%| PROMOTE["Promote Treatment<br/>Roll out to 100%"]
    ANALYSIS -->|Quality drop >= 10%| ROLLBACK["Rollback Treatment<br/>Keep Control routing"]

    style CONTROL fill:#3498db,color:#fff
    style TREATMENT fill:#f39c12,color:#fff
    style PROMOTE fill:#2ecc71,color:#fff
    style ROLLBACK fill:#e74c3c,color:#fff

4.2 A/B Test Configuration

"""
MangaAssist A/B Testing — Model Assignment Experiments
"""

import hashlib
import time
import logging
from dataclasses import dataclass
from typing import Optional

logger = logging.getLogger("mangaassist.ab_test")


@dataclass
class ABTestConfig:
    """Configuration for a model routing A/B test."""
    test_id: str
    intent: str
    control_model: str        # e.g., "sonnet"
    treatment_model: str      # e.g., "haiku"
    treatment_pct: float      # 0.0 - 1.0 (e.g., 0.10 = 10%)
    start_time: float
    end_time: Optional[float] = None
    min_samples: int = 1_000
    max_quality_drop_pct: float = 10.0  # auto-rollback if quality drops > 10%


@dataclass
class ABTestResult:
    """Metrics for one group in an A/B test."""
    group: str               # "control" or "treatment"
    sample_count: int
    avg_quality_score: float
    avg_cost: float
    avg_latency_ms: float
    satisfaction_rate: float


class ABTestRouter:
    """
    Deterministic A/B test routing based on session_id hash.
    Ensures the same user always sees the same variant within a test.
    """

    def __init__(self, active_tests: list[ABTestConfig]):
        self.tests = {t.intent: t for t in active_tests}

    def get_variant(self, session_id: str, intent: str) -> Optional[str]:
        """
        Return the model to use, or None if no active test for this intent.
        Uses session_id hash for deterministic, repeatable assignment.
        """
        test = self.tests.get(intent)
        if not test:
            return None

        # Check if test is still active
        now = time.time()
        if now < test.start_time:
            return None
        if test.end_time and now > test.end_time:
            return None

        # Deterministic hash-based split
        hash_input = f"{test.test_id}:{session_id}"
        hash_value = int(hashlib.sha256(hash_input.encode()).hexdigest()[:8], 16)
        bucket = (hash_value % 10000) / 10000.0  # 0.0000 to 0.9999

        if bucket < test.treatment_pct:
            return test.treatment_model
        else:
            return test.control_model

    def should_auto_rollback(
        self, intent: str, control: ABTestResult, treatment: ABTestResult
    ) -> bool:
        """
        Check if the treatment should be auto-rolled back.
        Triggers if quality drops more than the configured threshold
        AND we have enough samples for statistical significance.
        """
        test = self.tests.get(intent)
        if not test:
            return False

        if treatment.sample_count < test.min_samples:
            return False  # Not enough data yet

        quality_drop_pct = (
            (control.avg_quality_score - treatment.avg_quality_score)
            / control.avg_quality_score
            * 100
        )

        if quality_drop_pct > test.max_quality_drop_pct:
            logger.warning(
                "A/B test %s: quality drop %.1f%% exceeds threshold %.1f%%. "
                "Auto-rolling back.",
                test.test_id, quality_drop_pct, test.max_quality_drop_pct,
            )
            return True

        return False


# Example: Testing manga_qa on Haiku instead of Sonnet
ACTIVE_TESTS = [
    ABTestConfig(
        test_id="exp-2026-03-manga-qa-haiku",
        intent="manga_qa",
        control_model="sonnet",
        treatment_model="haiku",
        treatment_pct=0.10,  # 10% of manga_qa traffic
        start_time=time.time(),
        min_samples=2_000,
        max_quality_drop_pct=15.0,  # Allow up to 15% quality drop
    ),
]

4.3 A/B Test Decision Criteria

Metric Acceptable Range Auto-Rollback Trigger Data Required
Quality score (LLM-as-judge) < 10% drop > 15% drop n >= 1,000
User satisfaction < 5% drop > 10% drop n >= 2,000
Cost reduction > 50% savings N/A (always cheaper) N/A
Latency improvement Any improvement > 200ms increase n >= 500

5. Cross-Region Pricing Optimization

Region Sonnet Input/1M Sonnet Output/1M Haiku Input/1M Haiku Output/1M Latency from Tokyo
ap-northeast-1 (Tokyo) $3.00 $15.00 $0.25 $1.25 0 ms
us-east-1 (Virginia) $3.00 $15.00 $0.25 $1.25 ~160 ms
us-west-2 (Oregon) $3.00 $15.00 $0.25 $1.25 ~120 ms
eu-west-1 (Ireland) $3.00 $15.00 $0.25 $1.25 ~230 ms

MangaAssist decision: Bedrock pricing is uniform across regions for Claude 3. Since our users are in Japan, we use ap-northeast-1 exclusively to minimize latency. Cross-region arbitrage is not beneficial for this workload — latency cost far exceeds any potential savings.


6. Fallback Chains

6.1 Cascading Fallback Architecture

flowchart TD
    QUERY[Customer Query<br/>Intent: recommendation] --> PRIMARY["PRIMARY: Sonnet<br/>Attempt invocation"]

    PRIMARY -->|Success| RESPONSE[Return Response]
    PRIMARY -->|"Failure (throttle/timeout/5xx)"| FALLBACK1["FALLBACK 1: Haiku<br/>Attempt with simplified prompt"]

    FALLBACK1 -->|Success| RESPONSE
    FALLBACK1 -->|Failure| FALLBACK2["FALLBACK 2: Template<br/>Generic response + apology"]

    FALLBACK2 --> RESPONSE

    PRIMARY -->|Latency > 5s| TIMEOUT["Parallel: Start Haiku<br/>Race condition — first wins"]
    TIMEOUT --> RESPONSE

    style PRIMARY fill:#e74c3c,color:#fff
    style FALLBACK1 fill:#3498db,color:#fff
    style FALLBACK2 fill:#2ecc71,color:#fff
    style TIMEOUT fill:#f39c12,color:#fff

6.2 Fallback Chain Implementation

"""
MangaAssist Fallback Chain — Graceful Degradation on FM Failures
"""

import asyncio
import logging
import time
from dataclasses import dataclass
from typing import Optional

import boto3

logger = logging.getLogger("mangaassist.fallback")


@dataclass
class FallbackResult:
    """Result from fallback chain invocation."""
    response_text: str
    model_used: str
    tier_used: str          # "primary", "fallback_1", "fallback_2"
    latency_ms: float
    was_degraded: bool
    degradation_reason: Optional[str] = None


class FallbackChain:
    """
    Executes a cascade of model invocations with graceful degradation.
    Primary → Fallback 1 → Fallback 2 (template).
    """

    FALLBACK_MAP = {
        "sonnet": ["haiku", "template"],
        "haiku": ["template"],
        "template": [],
    }

    TEMPLATE_RESPONSES = {
        "recommendation": (
            "I apologize, but I'm currently unable to provide personalized manga "
            "recommendations. Please try again in a moment, or browse our curated "
            "collections at manga.example.com/collections."
        ),
        "manga_qa": (
            "I'm sorry, I'm temporarily unable to answer detailed manga questions. "
            "You can find manga information on our wiki at manga.example.com/wiki."
        ),
        "product_search": (
            "I'm having trouble searching right now. Please try using the search bar "
            "on our website, or contact support for assistance."
        ),
        "default": (
            "I apologize for the inconvenience. I'm experiencing temporary difficulties. "
            "Please try again shortly or contact our support team."
        ),
    }

    def __init__(self, bedrock_client=None, timeout_ms: int = 5_000):
        self.bedrock = bedrock_client or boto3.client(
            "bedrock-runtime", region_name="ap-northeast-1"
        )
        self.timeout_s = timeout_ms / 1000.0

    async def invoke_with_fallback(
        self,
        prompt: str,
        primary_model: str,
        intent: str,
    ) -> FallbackResult:
        """
        Invoke the primary model, falling back through the chain on failure.
        """
        chain = [primary_model] + self.FALLBACK_MAP.get(primary_model, [])
        tier_names = ["primary", "fallback_1", "fallback_2"]

        for i, model in enumerate(chain):
            tier = tier_names[min(i, len(tier_names) - 1)]

            if model == "template":
                return FallbackResult(
                    response_text=self.TEMPLATE_RESPONSES.get(
                        intent, self.TEMPLATE_RESPONSES["default"]
                    ),
                    model_used="template",
                    tier_used=tier,
                    latency_ms=1.0,
                    was_degraded=(i > 0),
                    degradation_reason=(
                        f"All FM models failed; serving template for {intent}"
                        if i > 0 else None
                    ),
                )

            try:
                start = time.monotonic()
                response = await asyncio.wait_for(
                    self._invoke_model(model, prompt),
                    timeout=self.timeout_s,
                )
                latency = (time.monotonic() - start) * 1000

                return FallbackResult(
                    response_text=response,
                    model_used=model,
                    tier_used=tier,
                    latency_ms=latency,
                    was_degraded=(i > 0),
                    degradation_reason=(
                        f"Degraded from {primary_model} to {model}" if i > 0 else None
                    ),
                )

            except asyncio.TimeoutError:
                logger.warning("Model %s timed out after %.1fs", model, self.timeout_s)
                continue

            except Exception as e:
                logger.error("Model %s failed: %s", model, str(e))
                continue

        # All models failed — should not reach here because template is always last
        return FallbackResult(
            response_text=self.TEMPLATE_RESPONSES["default"],
            model_used="template",
            tier_used="fallback_2",
            latency_ms=1.0,
            was_degraded=True,
            degradation_reason="All models and templates exhausted",
        )

    async def _invoke_model(self, model: str, prompt: str) -> str:
        """Invoke a Bedrock model. In production, use async Bedrock client."""
        model_ids = {
            "sonnet": "anthropic.claude-3-sonnet-20240229-v1:0",
            "haiku": "anthropic.claude-3-haiku-20240307-v1:0",
        }

        model_id = model_ids.get(model)
        if not model_id:
            raise ValueError(f"Unknown model: {model}")

        # In production: use aioboto3 or run_in_executor
        loop = asyncio.get_event_loop()
        response = await loop.run_in_executor(
            None,
            lambda: self.bedrock.invoke_model(
                modelId=model_id,
                contentType="application/json",
                accept="application/json",
                body=f'{{"anthropic_version":"bedrock-2023-05-31","max_tokens":1024,"messages":[{{"role":"user","content":"{prompt}"}}]}}',
            ),
        )

        import json
        body = json.loads(response["body"].read())
        return body["content"][0]["text"]

6.3 Fallback Chain — Expected Behavior

Scenario Primary Fallback 1 Fallback 2 User Impact
Normal operation Sonnet succeeds N/A N/A Full quality
Sonnet throttled Sonnet 429 Haiku succeeds N/A Slight quality drop
Sonnet + Haiku throttled Sonnet 429 Haiku 429 Template Canned response
Sonnet timeout (> 5s) Timeout Haiku succeeds N/A Faster, lower quality
All Bedrock down Sonnet fail Haiku fail Template Canned response

7. Cost Projection — Monthly Forecast

7.1 Current Baseline

pie title Monthly FM Cost Projection — Current Routing ($54,225/mo)
    "Sonnet (recommendation)" : 33300
    "Sonnet (manga_qa)" : 13050
    "Sonnet (complex product_search)" : 4995
    "Haiku (product_search)" : 2250
    "Haiku (shipping_info)" : 630
    "Template (all)" : 0

7.2 Scenario Projections

Scenario Daily Messages Sonnet % Haiku % Template % Monthly Cost vs Baseline
All Sonnet 1M 100% 0% 0% $229,500 +323%
All Haiku 1M 0% 100% 0% $5,063 -91%
Current Tiered 1M 15% 50% 35% $54,225 Baseline
Optimized (after A/B tests) 1M 10% 55% 35% $38,475 -29%
2x Growth 2M 15% 50% 35% $108,450 +100%
Sale Event (3x spike) 3M 15% 50% 35% $162,675 +200%
Sale + Budget Guardian 3M 5% 50% 45% $56,250 +4%

Key takeaway: During a 3x traffic spike (sale event), the Budget Guardian keeps monthly costs nearly flat by dynamically shifting traffic from Sonnet to Haiku/Template. Without the guardian, costs would triple.

7.3 Annual Budget Planning

Quarter Expected Daily Volume Routing Strategy Monthly Budget Quarterly Budget
Q1 2026 1.0M Current Tiered $54,225 $162,675
Q2 2026 1.2M (spring sale) Optimized + Guardian $46,170 $138,510
Q3 2026 1.5M (growth) Optimized + Guardian $57,713 $173,138
Q4 2026 2.0M (holiday) Aggressive + Guardian $62,500 $187,500
Annual Total $661,823

8. Key Takeaways

  1. 94.7% of FM cost comes from 15% of traffic (Sonnet calls). Optimizing Sonnet routing has the highest ROI.
  2. Budget Guardian prevents runaway costs during traffic spikes — keeping a 3x spike cost-neutral.
  3. A/B testing before routing changes prevents quality regressions — always test with statistical rigor (n >= 1,000, p < 0.05).
  4. Fallback chains provide graceful degradation — users always get a response, even during Bedrock outages.
  5. Cross-region arbitrage does not apply for MangaAssist (uniform Bedrock pricing) — minimize latency by staying in ap-northeast-1.
  6. Projected annual cost of $661,823 for 2026 is 71% less than an all-Sonnet approach ($2.75M).

Previous: 01-model-selection-framework.md -- Model selection architecture, complexity classifier, routing maps.

Next: 03-scenarios-and-runbooks.md -- Five MangaAssist production scenarios with decision trees.