02: Model Evaluation and Optimal Configuration

AIP-C01 Mapping

Content Domain 5: Testing, Validation, and Troubleshooting Task 5.1: Implement evaluation systems for GenAI Skill 5.1.2: Create systematic model evaluation systems to identify optimal configurations (for example, by using Amazon Bedrock Model Evaluations, A/B testing and canary testing of FMs, multi-model evaluation, cost-performance analysis to measure token efficiency, latency-to-quality ratios, and business outcomes).

User Story

As a MangaAssist ML engineer, I want to systematically evaluate different FM configurations — model versions, parameter settings, and routing strategies — against cost, latency, and quality metrics, So that I can identify the optimal configuration for each intent type and confidently deploy model changes without degrading the customer experience.

Acceptance Criteria

Model evaluation compares at least two FM configurations side-by-side on the same golden dataset
A/B testing infrastructure routes a configurable percentage of live traffic to candidate models
Canary deployment validates new model versions on 5% traffic before full rollout
Cost-performance analysis measures token efficiency (cost per satisfactory response) per intent
Latency-to-quality ratio is tracked: each model configuration shows median latency vs. quality score
Business outcome metrics (cart adds, escalation reduction, session duration) are correlated with model choice
Evaluation results are stored in Redshift and surfaced in dashboards

The Model Configuration Problem

MangaAssist uses Amazon Bedrock with Claude 3.5 Sonnet as the primary LLM. But "use Sonnet for everything" is not the optimal strategy. Different intents have different complexity-cost profiles:

Intent	Complexity	Token Budget	Quality Sensitivity	Cost Sensitivity
`recommendation`	High — open-ended generation	~2,000 tokens	High — bad recs erode trust	Medium
`product_question`	Medium — factual lookup + formatting	~800 tokens	High — wrong facts are visible	Medium
`faq`	Medium — RAG-grounded	~1,200 tokens	Very high — policy accuracy critical	Medium
`order_tracking`	Low — structured API data	~400 tokens	Very high — wrong order info is unacceptable	Low (can use template)
`chitchat`	Low — greeting/thanks	~200 tokens	Low — wording flexibility	Very high (can skip LLM)

The question is not "which model is best?" but "which model is best for each intent, given our latency SLA and cost budget?"

High-Level Design

Multi-Model Evaluation Architecture

graph TD
    subgraph "Evaluation Orchestrator"
        A[Golden Dataset<br>200+ test cases per intent] --> B[Evaluation Runner]
        B --> C1[Model A: Claude 3.5 Sonnet]
        B --> C2[Model B: Claude 3.5 Haiku]
        B --> C3[Model C: Claude 3 Sonnet]
        B --> C4[Model D: Custom Fine-tuned<br>SageMaker Endpoint]
    end

    subgraph "Scoring Engine"
        C1 --> D[Quality Scorer<br>Skill 5.1.1 Framework]
        C2 --> D
        C3 --> D
        C4 --> D
        D --> E[Cost Calculator<br>Token counts × pricing]
        D --> F[Latency Recorder<br>P50, P95, P99]
    end

    subgraph "Analysis"
        E --> G[Cost-Quality Matrix]
        F --> G
        G --> H[Pareto Frontier<br>Optimal configs per intent]
        H --> I[Configuration Recommendation]
    end

    subgraph "Output"
        I --> J[Model Routing Table<br>Intent → Model mapping]
        I --> K[Comparison Report<br>Redshift + Dashboard]
    end

A/B Testing and Canary Architecture

graph LR
    subgraph "Traffic Router"
        A[Incoming Request] --> B{Feature Flag<br>Service}
        B -->|90% Control| C[Current Model<br>Claude 3.5 Sonnet v1]
        B -->|10% Treatment| D[Candidate Model<br>Claude 3.5 Sonnet v2]
    end

    subgraph "Response Path"
        C --> E[Quality Scorer]
        D --> E
        E --> F[Response to User]
    end

    subgraph "Metrics Collection"
        C --> G[Control Metrics<br>Kinesis Stream]
        D --> H[Treatment Metrics<br>Kinesis Stream]
        G --> I[A/B Analysis Engine<br>Redshift]
        H --> I
    end

    subgraph "Decision"
        I --> J{Statistical<br>Significance?}
        J -->|Yes, Treatment Better| K[Promote to 100%]
        J -->|Yes, Treatment Worse| L[Rollback to Control]
        J -->|Not Yet| M[Continue Experiment]
    end

Canary Deployment Flow

sequenceDiagram
    participant Deploy as Deployment Pipeline
    participant Router as Traffic Router
    participant Canary as Canary Model
    participant Prod as Production Model
    participant Monitor as Canary Monitor
    participant Alarm as CloudWatch Alarm

    Deploy->>Router: Deploy candidate at 5% traffic
    Router->>Canary: Route 5% requests
    Router->>Prod: Route 95% requests

    loop Every 5 minutes for 30 minutes
        Monitor->>Canary: Collect quality scores
        Monitor->>Prod: Collect baseline scores
        Monitor->>Monitor: Compare distributions

        alt Quality drop > 5%
            Monitor->>Alarm: Trigger rollback alarm
            Alarm->>Router: Revert to 100% production
            Alarm->>Deploy: Notify team
        else Quality stable or improved
            Monitor->>Monitor: Continue monitoring
        end
    end

    Monitor->>Router: Promote canary to 25%
    Note over Monitor,Router: Repeat monitoring at 25%, 50%, 100%

Cost-Performance Analysis Framework

graph TD
    subgraph "Per-Request Metrics"
        A[Input Tokens] --> D[Cost Calculator]
        B[Output Tokens] --> D
        C[Model Pricing Tier] --> D
        D --> E[Cost per Request]

        F[Quality Score] --> G[Efficiency Ratio]
        E --> G
    end

    subgraph "Aggregated Metrics"
        G --> H[Cost per Satisfactory Response<br>cost / quality_gate_pass_rate]
        G --> I[Token Efficiency<br>quality_score / total_tokens]
        G --> J[Latency-Quality Ratio<br>quality_score / P95_latency]
    end

    subgraph "Business Outcomes"
        K[Cart Add Rate after Recommendation] --> L[Business Impact Score]
        M[Escalation Rate Reduction] --> L
        N[Session Continuation Rate] --> L
        L --> O[ROI per Model Configuration]
    end

Low-Level Design

Model Configuration Data Model

from dataclasses import dataclass, field
from enum import Enum
from typing import Optional
import time


class ModelProvider(Enum):
    BEDROCK_SONNET_35_V1 = "anthropic.claude-3-5-sonnet-20240620-v1:0"
    BEDROCK_SONNET_35_V2 = "anthropic.claude-3-5-sonnet-20241022-v2:0"
    BEDROCK_HAIKU_35 = "anthropic.claude-3-5-haiku-20241022-v1:0"
    BEDROCK_SONNET_3 = "anthropic.claude-3-sonnet-20240229-v1:0"
    SAGEMAKER_CUSTOM = "sagemaker-custom-endpoint"


@dataclass
class ModelConfiguration:
    """A specific model + parameter combination to evaluate."""
    config_id: str
    model_id: ModelProvider
    temperature: float = 0.1
    max_tokens: int = 1024
    top_p: float = 0.9
    system_prompt_version: str = "v1.0"
    description: str = ""


@dataclass
class ModelPricing:
    """Pricing per 1K tokens for cost calculation."""
    model_id: ModelProvider
    input_cost_per_1k: float   # USD per 1K input tokens
    output_cost_per_1k: float  # USD per 1K output tokens


# Current Bedrock pricing (as of design time)
PRICING_TABLE: dict[ModelProvider, ModelPricing] = {
    ModelProvider.BEDROCK_SONNET_35_V2: ModelPricing(
        model_id=ModelProvider.BEDROCK_SONNET_35_V2,
        input_cost_per_1k=0.003,
        output_cost_per_1k=0.015,
    ),
    ModelProvider.BEDROCK_HAIKU_35: ModelPricing(
        model_id=ModelProvider.BEDROCK_HAIKU_35,
        input_cost_per_1k=0.001,
        output_cost_per_1k=0.005,
    ),
    ModelProvider.BEDROCK_SONNET_3: ModelPricing(
        model_id=ModelProvider.BEDROCK_SONNET_3,
        input_cost_per_1k=0.003,
        output_cost_per_1k=0.015,
    ),
}


@dataclass
class ModelEvaluationResult:
    """Result of evaluating one model configuration on one test case."""
    config_id: str
    test_id: str
    intent: str
    quality_score: float               # Composite from Skill 5.1.1 framework
    dimension_scores: dict[str, float]
    input_tokens: int
    output_tokens: int
    cost_usd: float
    latency_ms: float
    response_text: str
    passed_quality_gate: bool
    timestamp: float = field(default_factory=time.time)


@dataclass
class ModelComparisonReport:
    """Comparison of multiple model configurations across all test cases."""
    comparison_id: str
    configurations: list[str]          # config_ids compared
    intent_results: dict[str, dict]    # intent -> {config_id -> aggregated metrics}
    pareto_optimal: dict[str, str]     # intent -> recommended config_id
    total_cost_estimate: dict[str, float]  # config_id -> monthly cost projection
    recommendation: str                # Human-readable recommendation
    timestamp: float = field(default_factory=time.time)

Multi-Model Evaluation Runner

import json
import logging
import time
import uuid
from concurrent.futures import ThreadPoolExecutor, as_completed
from typing import Optional

import boto3

logger = logging.getLogger(__name__)


class MultiModelEvaluator:
    """Evaluates multiple FM configurations against the same golden dataset.

    For MangaAssist, this answers: should we use Sonnet for recommendations
    and Haiku for chitchat? Or is Sonnet v2 worth the cost over v1?
    """

    def __init__(self, quality_evaluator=None):
        self.bedrock = boto3.client("bedrock-runtime", region_name="us-east-1")
        self.quality_evaluator = quality_evaluator  # FMOutputQualityEvaluator from Skill 5.1.1

    def evaluate_configurations(
        self,
        configurations: list[ModelConfiguration],
        test_cases: list,  # EvaluationTestCase from Skill 5.1.1
        max_parallel: int = 3,
    ) -> ModelComparisonReport:
        """Run all test cases against all model configurations and compare."""
        all_results: dict[str, list[ModelEvaluationResult]] = {}

        for config in configurations:
            logger.info("Evaluating configuration: %s (%s)", config.config_id, config.model_id.value)
            config_results = []

            for test_case in test_cases:
                result = self._evaluate_single(config, test_case)
                config_results.append(result)

            all_results[config.config_id] = config_results

        return self._build_comparison_report(configurations, all_results)

    def _evaluate_single(
        self, config: ModelConfiguration, test_case
    ) -> ModelEvaluationResult:
        """Evaluate one model configuration on one test case."""
        start_time = time.time()

        # Generate response from the specific model
        response_text, input_tokens, output_tokens = self._invoke_model(config, test_case)
        latency_ms = (time.time() - start_time) * 1000

        # Calculate cost
        pricing = PRICING_TABLE.get(config.model_id)
        cost = 0.0
        if pricing:
            cost = (
                (input_tokens / 1000) * pricing.input_cost_per_1k
                + (output_tokens / 1000) * pricing.output_cost_per_1k
            )

        # Score quality using Skill 5.1.1 framework
        quality_score = 0.0
        dimension_scores = {}
        passed = False
        if self.quality_evaluator:
            eval_result = self.quality_evaluator.evaluate(test_case, response_text)
            quality_score = eval_result.composite_score
            dimension_scores = {
                ds.dimension.value: ds.score for ds in eval_result.dimension_scores
            }
            passed = eval_result.passed

        return ModelEvaluationResult(
            config_id=config.config_id,
            test_id=test_case.test_id,
            intent=test_case.intent.value,
            quality_score=quality_score,
            dimension_scores=dimension_scores,
            input_tokens=input_tokens,
            output_tokens=output_tokens,
            cost_usd=cost,
            latency_ms=latency_ms,
            response_text=response_text,
            passed_quality_gate=passed,
        )

    def _invoke_model(self, config: ModelConfiguration, test_case) -> tuple[str, int, int]:
        """Invoke the specified model and return (response_text, input_tokens, output_tokens)."""
        prompt = self._build_prompt(config, test_case)

        try:
            body = json.dumps({
                "anthropic_version": "bedrock-2023-05-31",
                "max_tokens": config.max_tokens,
                "temperature": config.temperature,
                "top_p": config.top_p,
                "messages": [{"role": "user", "content": prompt}],
            })

            response = self.bedrock.invoke_model(
                modelId=config.model_id.value, body=body
            )
            resp_body = json.loads(response["body"].read())
            text = resp_body["content"][0]["text"]
            input_tokens = resp_body["usage"]["input_tokens"]
            output_tokens = resp_body["usage"]["output_tokens"]
            return text, input_tokens, output_tokens

        except Exception as e:
            logger.error("Model invocation failed for %s: %s", config.config_id, e)
            return f"[ERROR: {e}]", 0, 0

    def _build_prompt(self, config: ModelConfiguration, test_case) -> str:
        """Build the full prompt for the test case using the config's system prompt version."""
        return f"""You are MangaAssist, a helpful shopping assistant for the JP Manga store on Amazon.com.

User query: {test_case.query}
Intent: {test_case.intent.value}
Page context: {json.dumps(test_case.page_context)}

Conversation history:
{json.dumps(test_case.conversation_history)}

Respond helpfully and concisely."""

    def _build_comparison_report(
        self,
        configurations: list[ModelConfiguration],
        all_results: dict[str, list[ModelEvaluationResult]],
    ) -> ModelComparisonReport:
        """Aggregate results and identify Pareto-optimal configurations per intent."""
        intent_results: dict[str, dict] = {}
        total_costs: dict[str, float] = {}

        for config_id, results in all_results.items():
            total_costs[config_id] = sum(r.cost_usd for r in results)

            for result in results:
                intent = result.intent
                if intent not in intent_results:
                    intent_results[intent] = {}
                if config_id not in intent_results[intent]:
                    intent_results[intent][config_id] = {
                        "avg_quality": 0.0,
                        "avg_cost": 0.0,
                        "avg_latency": 0.0,
                        "pass_rate": 0.0,
                        "token_efficiency": 0.0,
                        "count": 0,
                        "quality_scores": [],
                        "costs": [],
                        "latencies": [],
                        "passes": [],
                    }
                bucket = intent_results[intent][config_id]
                bucket["count"] += 1
                bucket["quality_scores"].append(result.quality_score)
                bucket["costs"].append(result.cost_usd)
                bucket["latencies"].append(result.latency_ms)
                bucket["passes"].append(1 if result.passed_quality_gate else 0)

        # Compute averages
        for intent, configs in intent_results.items():
            for config_id, bucket in configs.items():
                n = bucket["count"]
                bucket["avg_quality"] = sum(bucket["quality_scores"]) / n
                bucket["avg_cost"] = sum(bucket["costs"]) / n
                bucket["avg_latency"] = sum(bucket["latencies"]) / n
                bucket["pass_rate"] = sum(bucket["passes"]) / n
                total_tokens_approx = sum(bucket["costs"]) / 0.003 * 1000  # rough estimate
                bucket["token_efficiency"] = (
                    bucket["avg_quality"] / max(total_tokens_approx / n, 1)
                )
                # Clean up raw lists
                del bucket["quality_scores"], bucket["costs"]
                del bucket["latencies"], bucket["passes"]

        # Identify Pareto-optimal config per intent
        pareto_optimal = {}
        for intent, configs in intent_results.items():
            best_config = None
            best_score = -1
            for config_id, metrics in configs.items():
                # Pareto criterion: highest quality among configs that meet pass rate > 80%
                if metrics["pass_rate"] >= 0.80 and metrics["avg_quality"] > best_score:
                    best_score = metrics["avg_quality"]
                    best_config = config_id
            if best_config is None:
                # Fallback: pick highest quality regardless of pass rate
                best_config = max(configs.keys(), key=lambda c: configs[c]["avg_quality"])
            pareto_optimal[intent] = best_config

        # Monthly cost projection (assuming 1M messages/day, intent distribution from architecture)
        intent_distribution = {
            "recommendation": 0.25, "product_question": 0.20, "faq": 0.15,
            "order_tracking": 0.15, "promotion": 0.10, "checkout_help": 0.05,
            "chitchat": 0.08, "return_request": 0.02,
        }
        messages_per_day = 1_000_000
        monthly_estimates = {}
        for config_id in all_results:
            monthly_cost = 0.0
            for intent, share in intent_distribution.items():
                if intent in intent_results and config_id in intent_results[intent]:
                    avg_cost = intent_results[intent][config_id]["avg_cost"]
                    monthly_cost += avg_cost * messages_per_day * share * 30
            monthly_estimates[config_id] = monthly_cost

        recommendation = self._generate_recommendation(
            intent_results, pareto_optimal, monthly_estimates
        )

        return ModelComparisonReport(
            comparison_id=str(uuid.uuid4()),
            configurations=[c.config_id for c in configurations],
            intent_results=intent_results,
            pareto_optimal=pareto_optimal,
            total_cost_estimate=monthly_estimates,
            recommendation=recommendation,
        )

    def _generate_recommendation(
        self,
        intent_results: dict,
        pareto_optimal: dict,
        monthly_estimates: dict,
    ) -> str:
        """Generate a human-readable recommendation based on evaluation results."""
        lines = ["## Model Configuration Recommendation\n"]
        lines.append("Based on evaluation across all intents:\n")

        for intent, config_id in pareto_optimal.items():
            metrics = intent_results[intent][config_id]
            lines.append(
                f"- **{intent}**: Use `{config_id}` "
                f"(quality={metrics['avg_quality']:.2f}, "
                f"pass_rate={metrics['pass_rate']:.0%}, "
                f"avg_latency={metrics['avg_latency']:.0f}ms)"
            )

        lines.append("\n### Monthly Cost Estimates:")
        for config_id, cost in sorted(monthly_estimates.items(), key=lambda x: x[1]):
            lines.append(f"- `{config_id}`: ${cost:,.0f}/month")

        return "\n".join(lines)

A/B Test Manager

class ABTestManager:
    """Manages A/B tests for model configurations in MangaAssist.

    Routes traffic between control and treatment models, collects metrics,
    and determines statistical significance.
    """

    @dataclass
    class ABTestConfig:
        experiment_id: str
        control_config: ModelConfiguration
        treatment_config: ModelConfiguration
        traffic_split: float = 0.10       # 10% to treatment
        min_samples: int = 1000           # Minimum samples per arm before analysis
        significance_level: float = 0.05  # p-value threshold
        min_effect_size: float = 0.02     # Minimum quality delta to declare winner
        max_duration_hours: int = 168     # 7 days max

    def __init__(self, dynamodb_table: str = "manga_ab_tests"):
        self.dynamodb = boto3.resource("dynamodb", region_name="us-east-1")
        self.table = self.dynamodb.Table(dynamodb_table)

    def route_request(self, experiment_id: str, session_id: str) -> ModelConfiguration:
        """Deterministic routing: same session always goes to the same arm."""
        # Use session_id hash for deterministic assignment
        hash_val = hash(f"{experiment_id}:{session_id}") % 100
        experiment = self._get_experiment(experiment_id)

        if hash_val < experiment["traffic_split"] * 100:
            return experiment["treatment_config"]
        return experiment["control_config"]

    def record_outcome(
        self,
        experiment_id: str,
        session_id: str,
        arm: str,
        quality_score: float,
        latency_ms: float,
        cost_usd: float,
        business_outcome: Optional[dict] = None,
    ) -> None:
        """Record a single observation for the experiment."""
        self.table.put_item(Item={
            "pk": f"EXP#{experiment_id}",
            "sk": f"OBS#{session_id}#{int(time.time())}",
            "arm": arm,
            "quality_score": str(quality_score),
            "latency_ms": str(latency_ms),
            "cost_usd": str(cost_usd),
            "business_outcome": business_outcome or {},
            "timestamp": int(time.time()),
        })

    def analyze_experiment(self, experiment_id: str) -> dict:
        """Analyze the experiment and return significance results."""
        # Query all observations for this experiment
        response = self.table.query(
            KeyConditionExpression="pk = :pk AND begins_with(sk, :prefix)",
            ExpressionAttributeValues={
                ":pk": f"EXP#{experiment_id}",
                ":prefix": "OBS#",
            },
        )

        control_scores = []
        treatment_scores = []
        control_costs = []
        treatment_costs = []

        for item in response.get("Items", []):
            score = float(item["quality_score"])
            cost = float(item["cost_usd"])
            if item["arm"] == "control":
                control_scores.append(score)
                control_costs.append(cost)
            else:
                treatment_scores.append(score)
                treatment_costs.append(cost)

        if len(control_scores) < 30 or len(treatment_scores) < 30:
            return {"status": "insufficient_data", "control_n": len(control_scores),
                    "treatment_n": len(treatment_scores)}

        # Two-sample t-test for quality scores
        from statistics import mean, stdev
        import math

        control_mean = mean(control_scores)
        treatment_mean = mean(treatment_scores)
        control_std = stdev(control_scores) if len(control_scores) > 1 else 0
        treatment_std = stdev(treatment_scores) if len(treatment_scores) > 1 else 0

        # Welch's t-test
        n1, n2 = len(control_scores), len(treatment_scores)
        se = math.sqrt((control_std ** 2 / n1) + (treatment_std ** 2 / n2)) if (control_std + treatment_std) > 0 else 1
        t_stat = (treatment_mean - control_mean) / se if se > 0 else 0

        effect_size = treatment_mean - control_mean
        cost_delta = mean(treatment_costs) - mean(control_costs)

        return {
            "status": "analyzed",
            "control": {"n": n1, "mean_quality": control_mean, "mean_cost": mean(control_costs)},
            "treatment": {"n": n2, "mean_quality": treatment_mean, "mean_cost": mean(treatment_costs)},
            "effect_size": effect_size,
            "t_statistic": t_stat,
            "cost_delta_per_request": cost_delta,
            "recommendation": self._recommend(effect_size, t_stat, cost_delta),
        }

    def _recommend(self, effect_size: float, t_stat: float, cost_delta: float) -> str:
        """Generate recommendation based on statistical analysis."""
        if abs(t_stat) < 1.96:  # Not significant at 0.05
            return "NO_DECISION — Continue experiment (not yet statistically significant)"
        if effect_size > 0.02:
            if cost_delta <= 0:
                return "PROMOTE_TREATMENT — Better quality at same or lower cost"
            elif cost_delta < 0.001:
                return "PROMOTE_TREATMENT — Better quality, marginal cost increase"
            else:
                return f"REVIEW — Better quality (+{effect_size:.3f}) but higher cost (+${cost_delta:.4f}/req)"
        elif effect_size < -0.02:
            return "ROLLBACK — Treatment is worse than control"
        else:
            if cost_delta < -0.0005:
                return "PROMOTE_TREATMENT — Similar quality at lower cost"
            return "NO_DECISION — No meaningful difference"

    def _get_experiment(self, experiment_id: str) -> dict:
        """Retrieve experiment configuration from DynamoDB."""
        response = self.table.get_item(Key={"pk": f"EXP#{experiment_id}", "sk": "CONFIG"})
        return response.get("Item", {})

Canary Deployment Controller

class CanaryDeploymentController:
    """Controls staged rollout of new model configurations.

    MangaAssist canary flow:
    5% → monitor 30 min → 25% → monitor 30 min → 50% → monitor 30 min → 100%

    At each stage, if quality drops more than 5% vs. baseline, auto-rollback.
    """

    STAGES = [
        {"name": "canary_5pct", "traffic_pct": 5, "monitor_minutes": 30},
        {"name": "canary_25pct", "traffic_pct": 25, "monitor_minutes": 30},
        {"name": "canary_50pct", "traffic_pct": 50, "monitor_minutes": 30},
        {"name": "full_rollout", "traffic_pct": 100, "monitor_minutes": 60},
    ]

    def __init__(self, cloudwatch_namespace: str = "MangaAssist/Canary"):
        self.cloudwatch = boto3.client("cloudwatch", region_name="us-east-1")
        self.namespace = cloudwatch_namespace

    def should_promote(
        self,
        canary_scores: list[float],
        baseline_scores: list[float],
        max_degradation_pct: float = 5.0,
    ) -> tuple[bool, str]:
        """Decide whether to promote the canary to the next stage."""
        if len(canary_scores) < 10:
            return False, "Insufficient canary samples (need >= 10)"

        from statistics import mean
        canary_mean = mean(canary_scores)
        baseline_mean = mean(baseline_scores) if baseline_scores else 0.85

        degradation_pct = ((baseline_mean - canary_mean) / baseline_mean) * 100

        if degradation_pct > max_degradation_pct:
            return False, (
                f"Quality degradation {degradation_pct:.1f}% exceeds threshold "
                f"{max_degradation_pct}% (canary={canary_mean:.3f}, baseline={baseline_mean:.3f})"
            )

        if canary_mean < 0.70:  # Absolute floor — no model below 0.70 composite
            return False, f"Canary quality {canary_mean:.3f} below absolute floor (0.70)"

        return True, (
            f"Canary quality {canary_mean:.3f} within tolerance "
            f"(baseline={baseline_mean:.3f}, degradation={degradation_pct:.1f}%)"
        )

    def publish_canary_metrics(
        self,
        stage: str,
        canary_quality: float,
        baseline_quality: float,
        canary_latency_ms: float,
    ) -> None:
        """Publish canary vs. baseline metrics to CloudWatch."""
        self.cloudwatch.put_metric_data(
            Namespace=self.namespace,
            MetricData=[
                {
                    "MetricName": "CanaryQuality",
                    "Dimensions": [{"Name": "Stage", "Value": stage}],
                    "Value": canary_quality,
                    "Unit": "None",
                },
                {
                    "MetricName": "BaselineQuality",
                    "Dimensions": [{"Name": "Stage", "Value": stage}],
                    "Value": baseline_quality,
                    "Unit": "None",
                },
                {
                    "MetricName": "QualityDelta",
                    "Dimensions": [{"Name": "Stage", "Value": stage}],
                    "Value": canary_quality - baseline_quality,
                    "Unit": "None",
                },
                {
                    "MetricName": "CanaryLatency",
                    "Dimensions": [{"Name": "Stage", "Value": stage}],
                    "Value": canary_latency_ms,
                    "Unit": "Milliseconds",
                },
            ],
        )

MangaAssist Scenarios

Scenario A: Sonnet vs. Haiku for Recommendation Intent

Context: The team hypothesized that Claude 3.5 Haiku could handle recommendation queries at 60% lower cost while maintaining acceptable quality. Both models were evaluated on 200 recommendation test cases.

Results:

Metric	Sonnet 3.5 v2	Haiku 3.5	Delta
Avg Quality Score	0.84	0.72	-0.12
Pass Rate (≥0.75)	91%	58%	-33%
Avg Cost/Request	$0.0081	$0.0027	-$0.0054
P95 Latency	2,100ms	890ms	-1,210ms
Relevance Sub-Score	0.86	0.68	-0.18
Factual Accuracy	0.82	0.79	-0.03

Decision: Haiku failed the quality gate for recommendations (58% pass rate vs. required 80%). Haiku's relevance sub-score was the culprit — it generated generic suggestions rather than personalized ones using the user's browsing history. Haiku was approved for chitchat and order_tracking (both passed quality gates), saving $2,400/month on those two intents.

Cost Savings from Model Tiering: - chitchat (8% of traffic): Sonnet → Haiku saves ~$1,200/month - order_tracking (15% of traffic): Sonnet → Haiku saves ~$1,800/month (most responses use templates anyway) - Total: $3,000/month savings without quality degradation

Scenario B: A/B Test for Claude 3.5 Sonnet v1 → v2 Upgrade

Context: Anthropic released Claude 3.5 Sonnet v2. The team ran a 7-day A/B test with 10% traffic on v2.

What Happened: - After 3 days (45,000 samples per arm), the A/B analysis showed: - v2 quality: 0.86 mean (vs. v1: 0.84) — effect size +0.02 - v2 cost: $0.0083/req (vs. v1: $0.0081) — marginal increase (+2.5%) - v2 latency: 1,950ms P95 (vs. v1: 2,100ms) — 7% faster - t-statistic: 3.2 (statistically significant at p < 0.01)

Decision: PROMOTE_TREATMENT — better quality and lower latency with marginal cost increase. Promoted to canary at 5% → 25% → 50% → 100% over 2 hours.

Business Outcome Correlation: - Cart add rate after recommendation: +1.8% with v2 (statistically significant) - Escalation rate: -3.2% with v2 (approaching significance, p=0.08)

Scenario C: Token Efficiency Analysis Across Intents

Context: Monthly LLM cost was $315,000. The team ran a cost-performance analysis to identify where token budgets were over-allocated.

Findings:

Intent	Avg Input Tokens	Avg Output Tokens	Cost/Req	Quality	Token Efficiency
`recommendation`	1,800	450	$0.012	0.84	0.37
`faq`	2,200	300	$0.011	0.88	0.27
`product_question`	1,400	200	$0.007	0.82	0.51
`order_tracking`	800	150	$0.004	0.91	0.95
`promotion`	1,200	250	$0.007	0.81	0.47

Key Insight: FAQ intent had the worst token efficiency (0.27) because it retrieved 5 RAG chunks when 3 were sufficient. Reducing to top-3 chunks saved 800 input tokens per request with only a 0.01 quality drop (0.88 → 0.87). At 150K FAQ requests/day, that saved $36,000/month.

Scenario D: Canary Catches Latency Regression During Model Upgrade

Context: A new model version was deployed via canary at 5% traffic. Quality scores looked fine, but the canary monitor caught a latency regression.

What Happened: - Canary quality: 0.85 (baseline: 0.84) — within tolerance - Canary P95 latency: 4,200ms (baseline: 2,100ms) — 100% increase - The canary monitor's latency check triggered a rollback alert even though quality was acceptable

Root Cause: The new model version had a cold-start issue on Bedrock — the first few hundred invocations were 3-4x slower due to model loading. After warmup, latency normalized to 2,000ms.

Fix: Added a warmup step to the canary deployment: send 500 synthetic requests before routing live traffic. Re-deployed with warmup — P95 latency was 2,050ms, and the canary progressed through all stages.

Intuition Gained

Model Selection Is Intent-Dependent

The most expensive model is not the best choice for every intent. MangaAssist's model tiering strategy saves 15-20% on LLM costs by routing simple intents (chitchat, order_tracking) to cheaper models while reserving the premium model for complex intents (recommendation, faq). The evaluation system makes this tiering decision data-driven rather than gut-based.

A/B Tests Need Business Outcomes, Not Just Quality Scores

Quality scores tell you whether the model is technically better. Business outcomes (cart adds, escalation rate, session duration) tell you whether customers care. A model that scores 0.02 higher on quality but does not move business metrics is not worth the additional cost. Always instrument A/B tests with both technical and business metrics.

Canary Deployments Must Check Latency, Not Just Quality

Model upgrades can pass quality gates but fail latency requirements. The MangaAssist 3-second SLA means a model that produces better responses in 5 seconds is worse than a model that produces good-enough responses in 2 seconds. Canary monitors should check quality, latency, error rate, and cost — not just one dimension.

References

MangaAssist Architecture HLD — Deployment architecture and compute stack
MangaAssist Architecture LLD — API contracts and service definitions
FM Output Quality Assessment — Quality scoring framework used by this evaluation system
Model Evaluation Framework — Foundational evaluation architecture
Model Evaluation Deep Dive — Production evaluation platform design
LLM Token Cost Optimization — Token cost optimization strategies
LLM Model Tiering Tradeoffs — Model tiering decision framework