02: Model Evaluation and Optimal Configuration
AIP-C01 Mapping
Content Domain 5: Testing, Validation, and Troubleshooting Task 5.1: Implement evaluation systems for GenAI Skill 5.1.2: Create systematic model evaluation systems to identify optimal configurations (for example, by using Amazon Bedrock Model Evaluations, A/B testing and canary testing of FMs, multi-model evaluation, cost-performance analysis to measure token efficiency, latency-to-quality ratios, and business outcomes).
User Story
As a MangaAssist ML engineer, I want to systematically evaluate different FM configurations — model versions, parameter settings, and routing strategies — against cost, latency, and quality metrics, So that I can identify the optimal configuration for each intent type and confidently deploy model changes without degrading the customer experience.
Acceptance Criteria
- Model evaluation compares at least two FM configurations side-by-side on the same golden dataset
- A/B testing infrastructure routes a configurable percentage of live traffic to candidate models
- Canary deployment validates new model versions on 5% traffic before full rollout
- Cost-performance analysis measures token efficiency (cost per satisfactory response) per intent
- Latency-to-quality ratio is tracked: each model configuration shows median latency vs. quality score
- Business outcome metrics (cart adds, escalation reduction, session duration) are correlated with model choice
- Evaluation results are stored in Redshift and surfaced in dashboards
The Model Configuration Problem
MangaAssist uses Amazon Bedrock with Claude 3.5 Sonnet as the primary LLM. But "use Sonnet for everything" is not the optimal strategy. Different intents have different complexity-cost profiles:
| Intent | Complexity | Token Budget | Quality Sensitivity | Cost Sensitivity |
|---|---|---|---|---|
recommendation |
High — open-ended generation | ~2,000 tokens | High — bad recs erode trust | Medium |
product_question |
Medium — factual lookup + formatting | ~800 tokens | High — wrong facts are visible | Medium |
faq |
Medium — RAG-grounded | ~1,200 tokens | Very high — policy accuracy critical | Medium |
order_tracking |
Low — structured API data | ~400 tokens | Very high — wrong order info is unacceptable | Low (can use template) |
chitchat |
Low — greeting/thanks | ~200 tokens | Low — wording flexibility | Very high (can skip LLM) |
The question is not "which model is best?" but "which model is best for each intent, given our latency SLA and cost budget?"
High-Level Design
Multi-Model Evaluation Architecture
graph TD
subgraph "Evaluation Orchestrator"
A[Golden Dataset<br>200+ test cases per intent] --> B[Evaluation Runner]
B --> C1[Model A: Claude 3.5 Sonnet]
B --> C2[Model B: Claude 3.5 Haiku]
B --> C3[Model C: Claude 3 Sonnet]
B --> C4[Model D: Custom Fine-tuned<br>SageMaker Endpoint]
end
subgraph "Scoring Engine"
C1 --> D[Quality Scorer<br>Skill 5.1.1 Framework]
C2 --> D
C3 --> D
C4 --> D
D --> E[Cost Calculator<br>Token counts × pricing]
D --> F[Latency Recorder<br>P50, P95, P99]
end
subgraph "Analysis"
E --> G[Cost-Quality Matrix]
F --> G
G --> H[Pareto Frontier<br>Optimal configs per intent]
H --> I[Configuration Recommendation]
end
subgraph "Output"
I --> J[Model Routing Table<br>Intent → Model mapping]
I --> K[Comparison Report<br>Redshift + Dashboard]
end
A/B Testing and Canary Architecture
graph LR
subgraph "Traffic Router"
A[Incoming Request] --> B{Feature Flag<br>Service}
B -->|90% Control| C[Current Model<br>Claude 3.5 Sonnet v1]
B -->|10% Treatment| D[Candidate Model<br>Claude 3.5 Sonnet v2]
end
subgraph "Response Path"
C --> E[Quality Scorer]
D --> E
E --> F[Response to User]
end
subgraph "Metrics Collection"
C --> G[Control Metrics<br>Kinesis Stream]
D --> H[Treatment Metrics<br>Kinesis Stream]
G --> I[A/B Analysis Engine<br>Redshift]
H --> I
end
subgraph "Decision"
I --> J{Statistical<br>Significance?}
J -->|Yes, Treatment Better| K[Promote to 100%]
J -->|Yes, Treatment Worse| L[Rollback to Control]
J -->|Not Yet| M[Continue Experiment]
end
Canary Deployment Flow
sequenceDiagram
participant Deploy as Deployment Pipeline
participant Router as Traffic Router
participant Canary as Canary Model
participant Prod as Production Model
participant Monitor as Canary Monitor
participant Alarm as CloudWatch Alarm
Deploy->>Router: Deploy candidate at 5% traffic
Router->>Canary: Route 5% requests
Router->>Prod: Route 95% requests
loop Every 5 minutes for 30 minutes
Monitor->>Canary: Collect quality scores
Monitor->>Prod: Collect baseline scores
Monitor->>Monitor: Compare distributions
alt Quality drop > 5%
Monitor->>Alarm: Trigger rollback alarm
Alarm->>Router: Revert to 100% production
Alarm->>Deploy: Notify team
else Quality stable or improved
Monitor->>Monitor: Continue monitoring
end
end
Monitor->>Router: Promote canary to 25%
Note over Monitor,Router: Repeat monitoring at 25%, 50%, 100%
Cost-Performance Analysis Framework
graph TD
subgraph "Per-Request Metrics"
A[Input Tokens] --> D[Cost Calculator]
B[Output Tokens] --> D
C[Model Pricing Tier] --> D
D --> E[Cost per Request]
F[Quality Score] --> G[Efficiency Ratio]
E --> G
end
subgraph "Aggregated Metrics"
G --> H[Cost per Satisfactory Response<br>cost / quality_gate_pass_rate]
G --> I[Token Efficiency<br>quality_score / total_tokens]
G --> J[Latency-Quality Ratio<br>quality_score / P95_latency]
end
subgraph "Business Outcomes"
K[Cart Add Rate after Recommendation] --> L[Business Impact Score]
M[Escalation Rate Reduction] --> L
N[Session Continuation Rate] --> L
L --> O[ROI per Model Configuration]
end
Low-Level Design
Model Configuration Data Model
from dataclasses import dataclass, field
from enum import Enum
from typing import Optional
import time
class ModelProvider(Enum):
BEDROCK_SONNET_35_V1 = "anthropic.claude-3-5-sonnet-20240620-v1:0"
BEDROCK_SONNET_35_V2 = "anthropic.claude-3-5-sonnet-20241022-v2:0"
BEDROCK_HAIKU_35 = "anthropic.claude-3-5-haiku-20241022-v1:0"
BEDROCK_SONNET_3 = "anthropic.claude-3-sonnet-20240229-v1:0"
SAGEMAKER_CUSTOM = "sagemaker-custom-endpoint"
@dataclass
class ModelConfiguration:
"""A specific model + parameter combination to evaluate."""
config_id: str
model_id: ModelProvider
temperature: float = 0.1
max_tokens: int = 1024
top_p: float = 0.9
system_prompt_version: str = "v1.0"
description: str = ""
@dataclass
class ModelPricing:
"""Pricing per 1K tokens for cost calculation."""
model_id: ModelProvider
input_cost_per_1k: float # USD per 1K input tokens
output_cost_per_1k: float # USD per 1K output tokens
# Current Bedrock pricing (as of design time)
PRICING_TABLE: dict[ModelProvider, ModelPricing] = {
ModelProvider.BEDROCK_SONNET_35_V2: ModelPricing(
model_id=ModelProvider.BEDROCK_SONNET_35_V2,
input_cost_per_1k=0.003,
output_cost_per_1k=0.015,
),
ModelProvider.BEDROCK_HAIKU_35: ModelPricing(
model_id=ModelProvider.BEDROCK_HAIKU_35,
input_cost_per_1k=0.001,
output_cost_per_1k=0.005,
),
ModelProvider.BEDROCK_SONNET_3: ModelPricing(
model_id=ModelProvider.BEDROCK_SONNET_3,
input_cost_per_1k=0.003,
output_cost_per_1k=0.015,
),
}
@dataclass
class ModelEvaluationResult:
"""Result of evaluating one model configuration on one test case."""
config_id: str
test_id: str
intent: str
quality_score: float # Composite from Skill 5.1.1 framework
dimension_scores: dict[str, float]
input_tokens: int
output_tokens: int
cost_usd: float
latency_ms: float
response_text: str
passed_quality_gate: bool
timestamp: float = field(default_factory=time.time)
@dataclass
class ModelComparisonReport:
"""Comparison of multiple model configurations across all test cases."""
comparison_id: str
configurations: list[str] # config_ids compared
intent_results: dict[str, dict] # intent -> {config_id -> aggregated metrics}
pareto_optimal: dict[str, str] # intent -> recommended config_id
total_cost_estimate: dict[str, float] # config_id -> monthly cost projection
recommendation: str # Human-readable recommendation
timestamp: float = field(default_factory=time.time)
Multi-Model Evaluation Runner
import json
import logging
import time
import uuid
from concurrent.futures import ThreadPoolExecutor, as_completed
from typing import Optional
import boto3
logger = logging.getLogger(__name__)
class MultiModelEvaluator:
"""Evaluates multiple FM configurations against the same golden dataset.
For MangaAssist, this answers: should we use Sonnet for recommendations
and Haiku for chitchat? Or is Sonnet v2 worth the cost over v1?
"""
def __init__(self, quality_evaluator=None):
self.bedrock = boto3.client("bedrock-runtime", region_name="us-east-1")
self.quality_evaluator = quality_evaluator # FMOutputQualityEvaluator from Skill 5.1.1
def evaluate_configurations(
self,
configurations: list[ModelConfiguration],
test_cases: list, # EvaluationTestCase from Skill 5.1.1
max_parallel: int = 3,
) -> ModelComparisonReport:
"""Run all test cases against all model configurations and compare."""
all_results: dict[str, list[ModelEvaluationResult]] = {}
for config in configurations:
logger.info("Evaluating configuration: %s (%s)", config.config_id, config.model_id.value)
config_results = []
for test_case in test_cases:
result = self._evaluate_single(config, test_case)
config_results.append(result)
all_results[config.config_id] = config_results
return self._build_comparison_report(configurations, all_results)
def _evaluate_single(
self, config: ModelConfiguration, test_case
) -> ModelEvaluationResult:
"""Evaluate one model configuration on one test case."""
start_time = time.time()
# Generate response from the specific model
response_text, input_tokens, output_tokens = self._invoke_model(config, test_case)
latency_ms = (time.time() - start_time) * 1000
# Calculate cost
pricing = PRICING_TABLE.get(config.model_id)
cost = 0.0
if pricing:
cost = (
(input_tokens / 1000) * pricing.input_cost_per_1k
+ (output_tokens / 1000) * pricing.output_cost_per_1k
)
# Score quality using Skill 5.1.1 framework
quality_score = 0.0
dimension_scores = {}
passed = False
if self.quality_evaluator:
eval_result = self.quality_evaluator.evaluate(test_case, response_text)
quality_score = eval_result.composite_score
dimension_scores = {
ds.dimension.value: ds.score for ds in eval_result.dimension_scores
}
passed = eval_result.passed
return ModelEvaluationResult(
config_id=config.config_id,
test_id=test_case.test_id,
intent=test_case.intent.value,
quality_score=quality_score,
dimension_scores=dimension_scores,
input_tokens=input_tokens,
output_tokens=output_tokens,
cost_usd=cost,
latency_ms=latency_ms,
response_text=response_text,
passed_quality_gate=passed,
)
def _invoke_model(self, config: ModelConfiguration, test_case) -> tuple[str, int, int]:
"""Invoke the specified model and return (response_text, input_tokens, output_tokens)."""
prompt = self._build_prompt(config, test_case)
try:
body = json.dumps({
"anthropic_version": "bedrock-2023-05-31",
"max_tokens": config.max_tokens,
"temperature": config.temperature,
"top_p": config.top_p,
"messages": [{"role": "user", "content": prompt}],
})
response = self.bedrock.invoke_model(
modelId=config.model_id.value, body=body
)
resp_body = json.loads(response["body"].read())
text = resp_body["content"][0]["text"]
input_tokens = resp_body["usage"]["input_tokens"]
output_tokens = resp_body["usage"]["output_tokens"]
return text, input_tokens, output_tokens
except Exception as e:
logger.error("Model invocation failed for %s: %s", config.config_id, e)
return f"[ERROR: {e}]", 0, 0
def _build_prompt(self, config: ModelConfiguration, test_case) -> str:
"""Build the full prompt for the test case using the config's system prompt version."""
return f"""You are MangaAssist, a helpful shopping assistant for the JP Manga store on Amazon.com.
User query: {test_case.query}
Intent: {test_case.intent.value}
Page context: {json.dumps(test_case.page_context)}
Conversation history:
{json.dumps(test_case.conversation_history)}
Respond helpfully and concisely."""
def _build_comparison_report(
self,
configurations: list[ModelConfiguration],
all_results: dict[str, list[ModelEvaluationResult]],
) -> ModelComparisonReport:
"""Aggregate results and identify Pareto-optimal configurations per intent."""
intent_results: dict[str, dict] = {}
total_costs: dict[str, float] = {}
for config_id, results in all_results.items():
total_costs[config_id] = sum(r.cost_usd for r in results)
for result in results:
intent = result.intent
if intent not in intent_results:
intent_results[intent] = {}
if config_id not in intent_results[intent]:
intent_results[intent][config_id] = {
"avg_quality": 0.0,
"avg_cost": 0.0,
"avg_latency": 0.0,
"pass_rate": 0.0,
"token_efficiency": 0.0,
"count": 0,
"quality_scores": [],
"costs": [],
"latencies": [],
"passes": [],
}
bucket = intent_results[intent][config_id]
bucket["count"] += 1
bucket["quality_scores"].append(result.quality_score)
bucket["costs"].append(result.cost_usd)
bucket["latencies"].append(result.latency_ms)
bucket["passes"].append(1 if result.passed_quality_gate else 0)
# Compute averages
for intent, configs in intent_results.items():
for config_id, bucket in configs.items():
n = bucket["count"]
bucket["avg_quality"] = sum(bucket["quality_scores"]) / n
bucket["avg_cost"] = sum(bucket["costs"]) / n
bucket["avg_latency"] = sum(bucket["latencies"]) / n
bucket["pass_rate"] = sum(bucket["passes"]) / n
total_tokens_approx = sum(bucket["costs"]) / 0.003 * 1000 # rough estimate
bucket["token_efficiency"] = (
bucket["avg_quality"] / max(total_tokens_approx / n, 1)
)
# Clean up raw lists
del bucket["quality_scores"], bucket["costs"]
del bucket["latencies"], bucket["passes"]
# Identify Pareto-optimal config per intent
pareto_optimal = {}
for intent, configs in intent_results.items():
best_config = None
best_score = -1
for config_id, metrics in configs.items():
# Pareto criterion: highest quality among configs that meet pass rate > 80%
if metrics["pass_rate"] >= 0.80 and metrics["avg_quality"] > best_score:
best_score = metrics["avg_quality"]
best_config = config_id
if best_config is None:
# Fallback: pick highest quality regardless of pass rate
best_config = max(configs.keys(), key=lambda c: configs[c]["avg_quality"])
pareto_optimal[intent] = best_config
# Monthly cost projection (assuming 1M messages/day, intent distribution from architecture)
intent_distribution = {
"recommendation": 0.25, "product_question": 0.20, "faq": 0.15,
"order_tracking": 0.15, "promotion": 0.10, "checkout_help": 0.05,
"chitchat": 0.08, "return_request": 0.02,
}
messages_per_day = 1_000_000
monthly_estimates = {}
for config_id in all_results:
monthly_cost = 0.0
for intent, share in intent_distribution.items():
if intent in intent_results and config_id in intent_results[intent]:
avg_cost = intent_results[intent][config_id]["avg_cost"]
monthly_cost += avg_cost * messages_per_day * share * 30
monthly_estimates[config_id] = monthly_cost
recommendation = self._generate_recommendation(
intent_results, pareto_optimal, monthly_estimates
)
return ModelComparisonReport(
comparison_id=str(uuid.uuid4()),
configurations=[c.config_id for c in configurations],
intent_results=intent_results,
pareto_optimal=pareto_optimal,
total_cost_estimate=monthly_estimates,
recommendation=recommendation,
)
def _generate_recommendation(
self,
intent_results: dict,
pareto_optimal: dict,
monthly_estimates: dict,
) -> str:
"""Generate a human-readable recommendation based on evaluation results."""
lines = ["## Model Configuration Recommendation\n"]
lines.append("Based on evaluation across all intents:\n")
for intent, config_id in pareto_optimal.items():
metrics = intent_results[intent][config_id]
lines.append(
f"- **{intent}**: Use `{config_id}` "
f"(quality={metrics['avg_quality']:.2f}, "
f"pass_rate={metrics['pass_rate']:.0%}, "
f"avg_latency={metrics['avg_latency']:.0f}ms)"
)
lines.append("\n### Monthly Cost Estimates:")
for config_id, cost in sorted(monthly_estimates.items(), key=lambda x: x[1]):
lines.append(f"- `{config_id}`: ${cost:,.0f}/month")
return "\n".join(lines)
A/B Test Manager
class ABTestManager:
"""Manages A/B tests for model configurations in MangaAssist.
Routes traffic between control and treatment models, collects metrics,
and determines statistical significance.
"""
@dataclass
class ABTestConfig:
experiment_id: str
control_config: ModelConfiguration
treatment_config: ModelConfiguration
traffic_split: float = 0.10 # 10% to treatment
min_samples: int = 1000 # Minimum samples per arm before analysis
significance_level: float = 0.05 # p-value threshold
min_effect_size: float = 0.02 # Minimum quality delta to declare winner
max_duration_hours: int = 168 # 7 days max
def __init__(self, dynamodb_table: str = "manga_ab_tests"):
self.dynamodb = boto3.resource("dynamodb", region_name="us-east-1")
self.table = self.dynamodb.Table(dynamodb_table)
def route_request(self, experiment_id: str, session_id: str) -> ModelConfiguration:
"""Deterministic routing: same session always goes to the same arm."""
# Use session_id hash for deterministic assignment
hash_val = hash(f"{experiment_id}:{session_id}") % 100
experiment = self._get_experiment(experiment_id)
if hash_val < experiment["traffic_split"] * 100:
return experiment["treatment_config"]
return experiment["control_config"]
def record_outcome(
self,
experiment_id: str,
session_id: str,
arm: str,
quality_score: float,
latency_ms: float,
cost_usd: float,
business_outcome: Optional[dict] = None,
) -> None:
"""Record a single observation for the experiment."""
self.table.put_item(Item={
"pk": f"EXP#{experiment_id}",
"sk": f"OBS#{session_id}#{int(time.time())}",
"arm": arm,
"quality_score": str(quality_score),
"latency_ms": str(latency_ms),
"cost_usd": str(cost_usd),
"business_outcome": business_outcome or {},
"timestamp": int(time.time()),
})
def analyze_experiment(self, experiment_id: str) -> dict:
"""Analyze the experiment and return significance results."""
# Query all observations for this experiment
response = self.table.query(
KeyConditionExpression="pk = :pk AND begins_with(sk, :prefix)",
ExpressionAttributeValues={
":pk": f"EXP#{experiment_id}",
":prefix": "OBS#",
},
)
control_scores = []
treatment_scores = []
control_costs = []
treatment_costs = []
for item in response.get("Items", []):
score = float(item["quality_score"])
cost = float(item["cost_usd"])
if item["arm"] == "control":
control_scores.append(score)
control_costs.append(cost)
else:
treatment_scores.append(score)
treatment_costs.append(cost)
if len(control_scores) < 30 or len(treatment_scores) < 30:
return {"status": "insufficient_data", "control_n": len(control_scores),
"treatment_n": len(treatment_scores)}
# Two-sample t-test for quality scores
from statistics import mean, stdev
import math
control_mean = mean(control_scores)
treatment_mean = mean(treatment_scores)
control_std = stdev(control_scores) if len(control_scores) > 1 else 0
treatment_std = stdev(treatment_scores) if len(treatment_scores) > 1 else 0
# Welch's t-test
n1, n2 = len(control_scores), len(treatment_scores)
se = math.sqrt((control_std ** 2 / n1) + (treatment_std ** 2 / n2)) if (control_std + treatment_std) > 0 else 1
t_stat = (treatment_mean - control_mean) / se if se > 0 else 0
effect_size = treatment_mean - control_mean
cost_delta = mean(treatment_costs) - mean(control_costs)
return {
"status": "analyzed",
"control": {"n": n1, "mean_quality": control_mean, "mean_cost": mean(control_costs)},
"treatment": {"n": n2, "mean_quality": treatment_mean, "mean_cost": mean(treatment_costs)},
"effect_size": effect_size,
"t_statistic": t_stat,
"cost_delta_per_request": cost_delta,
"recommendation": self._recommend(effect_size, t_stat, cost_delta),
}
def _recommend(self, effect_size: float, t_stat: float, cost_delta: float) -> str:
"""Generate recommendation based on statistical analysis."""
if abs(t_stat) < 1.96: # Not significant at 0.05
return "NO_DECISION — Continue experiment (not yet statistically significant)"
if effect_size > 0.02:
if cost_delta <= 0:
return "PROMOTE_TREATMENT — Better quality at same or lower cost"
elif cost_delta < 0.001:
return "PROMOTE_TREATMENT — Better quality, marginal cost increase"
else:
return f"REVIEW — Better quality (+{effect_size:.3f}) but higher cost (+${cost_delta:.4f}/req)"
elif effect_size < -0.02:
return "ROLLBACK — Treatment is worse than control"
else:
if cost_delta < -0.0005:
return "PROMOTE_TREATMENT — Similar quality at lower cost"
return "NO_DECISION — No meaningful difference"
def _get_experiment(self, experiment_id: str) -> dict:
"""Retrieve experiment configuration from DynamoDB."""
response = self.table.get_item(Key={"pk": f"EXP#{experiment_id}", "sk": "CONFIG"})
return response.get("Item", {})
Canary Deployment Controller
class CanaryDeploymentController:
"""Controls staged rollout of new model configurations.
MangaAssist canary flow:
5% → monitor 30 min → 25% → monitor 30 min → 50% → monitor 30 min → 100%
At each stage, if quality drops more than 5% vs. baseline, auto-rollback.
"""
STAGES = [
{"name": "canary_5pct", "traffic_pct": 5, "monitor_minutes": 30},
{"name": "canary_25pct", "traffic_pct": 25, "monitor_minutes": 30},
{"name": "canary_50pct", "traffic_pct": 50, "monitor_minutes": 30},
{"name": "full_rollout", "traffic_pct": 100, "monitor_minutes": 60},
]
def __init__(self, cloudwatch_namespace: str = "MangaAssist/Canary"):
self.cloudwatch = boto3.client("cloudwatch", region_name="us-east-1")
self.namespace = cloudwatch_namespace
def should_promote(
self,
canary_scores: list[float],
baseline_scores: list[float],
max_degradation_pct: float = 5.0,
) -> tuple[bool, str]:
"""Decide whether to promote the canary to the next stage."""
if len(canary_scores) < 10:
return False, "Insufficient canary samples (need >= 10)"
from statistics import mean
canary_mean = mean(canary_scores)
baseline_mean = mean(baseline_scores) if baseline_scores else 0.85
degradation_pct = ((baseline_mean - canary_mean) / baseline_mean) * 100
if degradation_pct > max_degradation_pct:
return False, (
f"Quality degradation {degradation_pct:.1f}% exceeds threshold "
f"{max_degradation_pct}% (canary={canary_mean:.3f}, baseline={baseline_mean:.3f})"
)
if canary_mean < 0.70: # Absolute floor — no model below 0.70 composite
return False, f"Canary quality {canary_mean:.3f} below absolute floor (0.70)"
return True, (
f"Canary quality {canary_mean:.3f} within tolerance "
f"(baseline={baseline_mean:.3f}, degradation={degradation_pct:.1f}%)"
)
def publish_canary_metrics(
self,
stage: str,
canary_quality: float,
baseline_quality: float,
canary_latency_ms: float,
) -> None:
"""Publish canary vs. baseline metrics to CloudWatch."""
self.cloudwatch.put_metric_data(
Namespace=self.namespace,
MetricData=[
{
"MetricName": "CanaryQuality",
"Dimensions": [{"Name": "Stage", "Value": stage}],
"Value": canary_quality,
"Unit": "None",
},
{
"MetricName": "BaselineQuality",
"Dimensions": [{"Name": "Stage", "Value": stage}],
"Value": baseline_quality,
"Unit": "None",
},
{
"MetricName": "QualityDelta",
"Dimensions": [{"Name": "Stage", "Value": stage}],
"Value": canary_quality - baseline_quality,
"Unit": "None",
},
{
"MetricName": "CanaryLatency",
"Dimensions": [{"Name": "Stage", "Value": stage}],
"Value": canary_latency_ms,
"Unit": "Milliseconds",
},
],
)
MangaAssist Scenarios
Scenario A: Sonnet vs. Haiku for Recommendation Intent
Context: The team hypothesized that Claude 3.5 Haiku could handle recommendation queries at 60% lower cost while maintaining acceptable quality. Both models were evaluated on 200 recommendation test cases.
Results:
| Metric | Sonnet 3.5 v2 | Haiku 3.5 | Delta |
|---|---|---|---|
| Avg Quality Score | 0.84 | 0.72 | -0.12 |
| Pass Rate (≥0.75) | 91% | 58% | -33% |
| Avg Cost/Request | $0.0081 | $0.0027 | -$0.0054 |
| P95 Latency | 2,100ms | 890ms | -1,210ms |
| Relevance Sub-Score | 0.86 | 0.68 | -0.18 |
| Factual Accuracy | 0.82 | 0.79 | -0.03 |
Decision: Haiku failed the quality gate for recommendations (58% pass rate vs. required 80%). Haiku's relevance sub-score was the culprit — it generated generic suggestions rather than personalized ones using the user's browsing history. Haiku was approved for chitchat and order_tracking (both passed quality gates), saving $2,400/month on those two intents.
Cost Savings from Model Tiering:
- chitchat (8% of traffic): Sonnet → Haiku saves ~$1,200/month
- order_tracking (15% of traffic): Sonnet → Haiku saves ~$1,800/month (most responses use templates anyway)
- Total: $3,000/month savings without quality degradation
Scenario B: A/B Test for Claude 3.5 Sonnet v1 → v2 Upgrade
Context: Anthropic released Claude 3.5 Sonnet v2. The team ran a 7-day A/B test with 10% traffic on v2.
What Happened: - After 3 days (45,000 samples per arm), the A/B analysis showed: - v2 quality: 0.86 mean (vs. v1: 0.84) — effect size +0.02 - v2 cost: $0.0083/req (vs. v1: $0.0081) — marginal increase (+2.5%) - v2 latency: 1,950ms P95 (vs. v1: 2,100ms) — 7% faster - t-statistic: 3.2 (statistically significant at p < 0.01)
Decision: PROMOTE_TREATMENT — better quality and lower latency with marginal cost increase. Promoted to canary at 5% → 25% → 50% → 100% over 2 hours.
Business Outcome Correlation: - Cart add rate after recommendation: +1.8% with v2 (statistically significant) - Escalation rate: -3.2% with v2 (approaching significance, p=0.08)
Scenario C: Token Efficiency Analysis Across Intents
Context: Monthly LLM cost was $315,000. The team ran a cost-performance analysis to identify where token budgets were over-allocated.
Findings:
| Intent | Avg Input Tokens | Avg Output Tokens | Cost/Req | Quality | Token Efficiency |
|---|---|---|---|---|---|
recommendation |
1,800 | 450 | $0.012 | 0.84 | 0.37 |
faq |
2,200 | 300 | $0.011 | 0.88 | 0.27 |
product_question |
1,400 | 200 | $0.007 | 0.82 | 0.51 |
order_tracking |
800 | 150 | $0.004 | 0.91 | 0.95 |
promotion |
1,200 | 250 | $0.007 | 0.81 | 0.47 |
Key Insight: FAQ intent had the worst token efficiency (0.27) because it retrieved 5 RAG chunks when 3 were sufficient. Reducing to top-3 chunks saved 800 input tokens per request with only a 0.01 quality drop (0.88 → 0.87). At 150K FAQ requests/day, that saved $36,000/month.
Scenario D: Canary Catches Latency Regression During Model Upgrade
Context: A new model version was deployed via canary at 5% traffic. Quality scores looked fine, but the canary monitor caught a latency regression.
What Happened: - Canary quality: 0.85 (baseline: 0.84) — within tolerance - Canary P95 latency: 4,200ms (baseline: 2,100ms) — 100% increase - The canary monitor's latency check triggered a rollback alert even though quality was acceptable
Root Cause: The new model version had a cold-start issue on Bedrock — the first few hundred invocations were 3-4x slower due to model loading. After warmup, latency normalized to 2,000ms.
Fix: Added a warmup step to the canary deployment: send 500 synthetic requests before routing live traffic. Re-deployed with warmup — P95 latency was 2,050ms, and the canary progressed through all stages.
Intuition Gained
Model Selection Is Intent-Dependent
The most expensive model is not the best choice for every intent. MangaAssist's model tiering strategy saves 15-20% on LLM costs by routing simple intents (chitchat, order_tracking) to cheaper models while reserving the premium model for complex intents (recommendation, faq). The evaluation system makes this tiering decision data-driven rather than gut-based.
A/B Tests Need Business Outcomes, Not Just Quality Scores
Quality scores tell you whether the model is technically better. Business outcomes (cart adds, escalation rate, session duration) tell you whether customers care. A model that scores 0.02 higher on quality but does not move business metrics is not worth the additional cost. Always instrument A/B tests with both technical and business metrics.
Canary Deployments Must Check Latency, Not Just Quality
Model upgrades can pass quality gates but fail latency requirements. The MangaAssist 3-second SLA means a model that produces better responses in 5 seconds is worse than a model that produces good-enough responses in 2 seconds. Canary monitors should check quality, latency, error rate, and cost — not just one dimension.
References
- MangaAssist Architecture HLD — Deployment architecture and compute stack
- MangaAssist Architecture LLD — API contracts and service definitions
- FM Output Quality Assessment — Quality scoring framework used by this evaluation system
- Model Evaluation Framework — Foundational evaluation architecture
- Model Evaluation Deep Dive — Production evaluation platform design
- LLM Token Cost Optimization — Token cost optimization strategies
- LLM Model Tiering Tradeoffs — Model tiering decision framework