A/B Testing and Parameter Tuning for FM Optimization
AWS AIP-C01 Task 4.2 — Skill 4.2.4: Optimize FM performance through systematic A/B testing and parameter tuning Context: MangaAssist JP manga store chatbot — Bedrock Claude 3 (Sonnet/Haiku), OpenSearch Serverless, DynamoDB, ECS Fargate Intents: product_search, order_status, recommendation, manga_qa, chitchat, shipping_info
Skill Mapping
| Certification | Domain | Task | Skill |
|---|---|---|---|
| AWS AIP-C01 | Domain 4 — Operational Efficiency | Task 4.2 — Optimize FM performance | Skill 4.2.4 — Design and execute A/B experiments for FM parameter optimization with statistical rigor |
Skill scope: Design controlled experiments to compare FM parameter configurations, compute statistical significance, implement gradual rollout strategies, and apply Bayesian optimization for automated parameter search.
A/B Test Lifecycle
flowchart TD
A[Identify Optimization Target] --> B[Design Experiment]
B --> C[Calculate Minimum Sample Size]
C --> D{Sufficient Traffic<br/>for Power?}
D -- No --> E[Increase Test Duration<br/>or Reduce Variants]
E --> C
D -- Yes --> F[Configure Experiment in DynamoDB]
F --> G[Launch at 5% Traffic — Canary]
G --> H{Errors or<br/>Regressions?}
H -- Yes --> I[Kill Experiment<br/>Rollback to Control]
H -- No --> J[Expand to 20% Traffic]
J --> K[Collect Observations<br/>Quality + Latency + Cost]
K --> L{Min Samples<br/>Reached?}
L -- No --> K
L -- Yes --> M[Run Statistical Analysis]
M --> N{p-value < 0.05<br/>AND Practical<br/>Significance?}
N -- Not Significant --> O[Continue or Stop]
O --> P{Max Duration<br/>Reached?}
P -- No --> K
P -- Yes --> Q[Conclude: No Winner<br/>Keep Control]
N -- Significant --> R{Quality Improvement<br/> > Cost Increase?}
R -- No --> S[Reject Variant<br/>Not Cost-Effective]
R -- Yes --> T[Expand to 50% Traffic]
T --> U[Monitor for 48 Hours]
U --> V{Stable?}
V -- No --> I
V -- Yes --> W[Promote to 100%<br/>Update Default Profile]
W --> X[Archive Experiment<br/>Document Learnings]
style G fill:#fff3cd
style J fill:#fff3cd
style T fill:#fff3cd
style W fill:#d4edda
style I fill:#f8d7da
style Q fill:#e2e3e5
style S fill:#f8d7da
Experiment Design
Control vs Variant
Every A/B test has exactly one control (the current production parameters) and one or more variants (proposed parameter changes). Changing only one parameter per variant isolates its effect.
| Component | Control | Variant A | Variant B |
|---|---|---|---|
| Description | Current production config | Higher temperature | Larger top_k |
| temperature | 0.45 | 0.55 | 0.45 |
| top_k | 100 | 100 | 150 |
| top_p | 0.90 | 0.90 | 0.90 |
| max_tokens | 512 | 512 | 512 |
Rule: Never change multiple parameters in one variant. If temperature AND top_k both change, you cannot attribute any quality difference to either one.
Minimum Sample Size Calculation
Before launching any experiment, calculate the minimum samples needed to detect a meaningful improvement:
Required Samples Per Variant = f(baseline_rate, minimum_detectable_effect, alpha, power)
| Parameter | Meaning | Typical Value |
|---|---|---|
| baseline_rate | Current quality score | 0.85 (85% quality) |
| minimum_detectable_effect | Smallest improvement worth pursuing | 0.03 (3 percentage points) |
| alpha | False positive rate | 0.05 (5%) |
| power | Probability of detecting a real effect | 0.80 (80%) |
MangaAssist sample size examples:
| Intent | Baseline | MDE | Required Samples/Variant | Daily Traffic | Days Needed |
|---|---|---|---|---|---|
| product_search | 0.88 | 0.03 | ~1,950 | 2,400 | 2 days |
| order_status | 0.95 | 0.02 | ~2,800 | 1,800 | 4 days |
| recommendation | 0.80 | 0.05 | ~750 | 1,200 | 2 days |
| manga_qa | 0.85 | 0.03 | ~1,950 | 800 | 5 days |
| chitchat | 0.78 | 0.05 | ~650 | 3,000 | 1 day |
| shipping_info | 0.93 | 0.02 | ~3,200 | 600 | 11 days |
Shipping_info has the lowest traffic and the highest baseline, so experiments take the longest. Consider combining shipping_info experiments with product_search windows when possible.
Metrics to Measure
Primary Metrics (Decision Criteria)
| Metric | Type | Collection Method | Threshold for Winner |
|---|---|---|---|
| Quality Score | Continuous [0,1] | Automated relevance scoring + sampled human evaluation | Variant mean > Control mean, p < 0.05 |
| Response Relevance | Continuous [0,1] | Cosine similarity between response and ground-truth answer set | Variant mean > Control mean |
| User Satisfaction | Ordinal [1-5] | Post-interaction survey (sampled 10% of sessions) | Variant mean > Control mean |
Guardrail Metrics (Safety Checks)
| Metric | Type | Max Acceptable Regression |
|---|---|---|
| Latency P95 | Continuous (ms) | No more than 10% increase |
| Token Usage | Continuous (count) | No more than 15% increase |
| Hallucination Rate | Proportion | No increase allowed |
| Cost per Request | Continuous (USD) | No more than 20% increase |
| Error Rate | Proportion | No increase allowed |
If any guardrail metric regresses beyond its threshold, the experiment is automatically paused regardless of primary metric improvement.
Metric Collection Architecture
graph LR
subgraph "FM Invocation"
A[Bedrock InvokeModel] --> B[Raw Response]
end
subgraph "Automated Scoring"
B --> C[Relevance Scorer<br/>Cosine Similarity]
B --> D[Hallucination Detector<br/>Claim Verification]
B --> E[Latency Recorder]
B --> F[Token Counter]
end
subgraph "Human Evaluation — Sampled"
B --> G{10% Sample?}
G -- Yes --> H[Human Review Queue<br/>SQS + Lambda]
H --> I[Quality Score 0-1]
end
subgraph "Storage"
C --> J[(DynamoDB<br/>ABResults)]
D --> J
E --> J
F --> J
I --> J
end
subgraph "Analysis"
J --> K[StatisticalSignificanceCalculator]
K --> L[CloudWatch Dashboard]
K --> M[SNS Alerts]
end
Statistical Significance
Welch's t-test (Primary)
Used for comparing continuous metrics (quality score, latency) between control and variant. Welch's t-test does not assume equal variances, making it robust for real-world data.
Hypotheses: - H0: mean_control = mean_variant (no difference) - H1: mean_control ≠ mean_variant (two-tailed)
Decision rule: Reject H0 if p-value < 0.05.
Chi-squared Test (Binary Outcomes)
Used for comparing proportions (e.g., "did the user add to cart?" or "was the response flagged by guardrails?").
Hypotheses: - H0: proportion_control = proportion_variant - H1: proportion_control ≠ proportion_variant
Confidence Intervals
Report the 95% confidence interval for the difference in means. If the interval does not contain zero, the result is significant.
CI = (mean_variant - mean_control) ± z_alpha/2 * SE_diff
Early Stopping Rules
To prevent wasting traffic on clearly losing variants:
| Rule | Condition | Action |
|---|---|---|
| Futility stop | p-value > 0.95 at 50% of planned samples | Stop: variant is almost certainly not better |
| Harm stop | Variant quality < control quality at p < 0.01 | Stop immediately: variant is actively worse |
| Winner stop | p-value < 0.001 at 50% of planned samples | Optional: declare early win (with caution) |
| Guardrail stop | Any guardrail metric breached | Stop immediately: variant causes regression |
Multi-Armed Bandit as Alternative
Traditional A/B testing allocates fixed traffic percentages. A multi-armed bandit (MAB) dynamically shifts traffic toward better-performing variants during the experiment.
When to Use MAB vs Fixed A/B
| Criterion | Fixed A/B | Multi-Armed Bandit |
|---|---|---|
| Goal | Statistical proof of superiority | Minimize regret (maximize reward) during testing |
| Traffic cost | 50% of traffic goes to inferior variant | Traffic shifts away from losers automatically |
| Duration | Fixed duration (based on sample size) | Runs continuously; no fixed end |
| Statistical rigor | Full hypothesis testing with p-values | Bayesian posterior probabilities |
| Best for MangaAssist | High-stakes changes (order_status params) | Low-risk optimizations (chitchat temperature) |
Thompson Sampling for MangaAssist
graph TD
A[New Request Arrives] --> B[For Each Variant:<br/>Sample from Beta Distribution]
B --> C[Select Variant with<br/>Highest Sampled Value]
C --> D[Serve Request with<br/>Selected Variant]
D --> E[Observe Quality Score]
E --> F{Score > Threshold?}
F -- Yes --> G[Update Beta: alpha += 1<br/>Success]
F -- No --> H[Update Beta: beta += 1<br/>Failure]
G --> I[Posterior Updated<br/>Better variants sampled more]
H --> I
I --> A
MangaAssist Experiment Examples
Experiment 1 — Temperature for Product Descriptions
Hypothesis: Increasing temperature from 0.45 to 0.55 for product_search will produce more engaging product descriptions without sacrificing accuracy.
| Attribute | Value |
|---|---|
| Intent | product_search |
| Control | temperature=0.45, top_k=100, top_p=0.90 |
| Variant A | temperature=0.55, top_k=100, top_p=0.90 |
| Primary metric | Quality score (human eval) |
| Guardrail metrics | Hallucination rate, latency P95 |
| Traffic split | 80% control / 20% variant |
| Min samples | 1,950 per variant |
| Expected duration | 3 days at current traffic |
Example observations after 2,500 samples per variant:
| Metric | Control (T=0.45) | Variant (T=0.55) | Difference |
|---|---|---|---|
| Mean quality score | 0.883 | 0.901 | +0.018 |
| Std dev | 0.071 | 0.078 | — |
| Hallucination rate | 2.1% | 2.3% | +0.2% (within tolerance) |
| P95 latency | 3,210 ms | 3,280 ms | +70 ms (within tolerance) |
| Mean tokens | 385 | 412 | +7% (within tolerance) |
| p-value (t-test) | — | — | 0.0023 |
| Significant? | — | — | Yes |
Decision: Promote Variant A. The +1.8% quality improvement is statistically significant (p=0.0023) and all guardrail metrics remain within acceptable bounds.
Experiment 2 — Top-k for Recommendations
Hypothesis: Increasing top_k from 250 to 400 for recommendation will produce more diverse manga suggestions.
| Attribute | Value |
|---|---|
| Intent | recommendation |
| Control | temperature=0.80, top_k=250, top_p=0.95 |
| Variant A | temperature=0.80, top_k=400, top_p=0.95 |
| Primary metric | Recommendation diversity (unique titles per session) |
| Secondary metric | Click-through rate on recommended titles |
| Min samples | 750 per variant |
Gradual Rollout Strategy
Never go from experiment winner to 100% traffic in one step. The gradual rollout protects against edge cases that the A/B test did not cover.
graph LR
A["5% Canary<br/>24 hours"] --> B{Errors?}
B -- Yes --> C[Rollback]
B -- No --> D["20% Validation<br/>48 hours"]
D --> E{Metrics<br/>Stable?}
E -- No --> C
E -- Yes --> F["50% Expansion<br/>72 hours"]
F --> G{Business Metrics<br/>Healthy?}
G -- No --> C
G -- Yes --> H["100% Full Rollout"]
H --> I[Update Default Profile<br/>Archive Experiment]
style A fill:#fff3cd
style D fill:#fff3cd
style F fill:#fff3cd
style H fill:#d4edda
style C fill:#f8d7da
| Stage | Traffic % | Duration | Success Criteria | Rollback Trigger |
|---|---|---|---|---|
| Canary | 5% | 24 hours | No errors, no latency spikes | Error rate > 1% OR P95 latency > 5s |
| Validation | 20% | 48 hours | Quality score matches A/B test results | Quality drops >2% from test OR guardrail breach |
| Expansion | 50% | 72 hours | Business metrics stable (conversion, CSAT) | Conversion rate drops >5% OR CSAT drops >0.2 |
| Full rollout | 100% | Permanent | All dashboards green for 24 hours | Any alert fires within first 24 hours |
Parameter Tuning Automation — Bayesian Optimization
For intents with many tunable parameters, manual A/B testing every combination is impractical. Bayesian optimization systematically explores the parameter space.
How It Works
- Define the search space: temperature [0.1, 0.9], top_k [20, 300], top_p [0.5, 0.99]
- Build a surrogate model (Gaussian Process) of quality score as a function of parameters
- Acquisition function (Expected Improvement) selects the next parameter combination to try
- Evaluate by running the selected parameters on a sample of requests
- Update the surrogate model with the new observation
- Repeat until the improvement plateaus or budget is exhausted
MangaAssist Bayesian Optimization Boundaries
| Parameter | Min | Max | Step | Intent Group |
|---|---|---|---|---|
| temperature | 0.05 | 0.35 | 0.05 | Factual (order_status, shipping_info) |
| temperature | 0.30 | 0.70 | 0.05 | Balanced (product_search, manga_qa) |
| temperature | 0.60 | 0.95 | 0.05 | Creative (recommendation, chitchat) |
| top_k | 20 | 80 | 10 | Factual |
| top_k | 60 | 200 | 20 | Balanced |
| top_k | 150 | 400 | 25 | Creative |
| top_p | 0.50 | 0.85 | 0.05 | Factual |
| top_p | 0.80 | 0.95 | 0.05 | Balanced |
| top_p | 0.90 | 0.99 | 0.01 | Creative |
Python Implementation — ExperimentManager
"""
ExperimentManager — Orchestrates the full A/B experiment lifecycle for MangaAssist.
Handles experiment creation, traffic allocation, metric collection,
analysis triggers, and automated rollout decisions.
"""
import boto3
import json
import logging
from dataclasses import dataclass
from datetime import datetime, timezone, timedelta
from typing import Optional
from enum import Enum
logger = logging.getLogger(__name__)
class ExperimentStatus(Enum):
DRAFT = "draft"
CANARY = "canary" # 5% traffic
VALIDATING = "validating" # 20% traffic
EXPANDING = "expanding" # 50% traffic
PROMOTED = "promoted" # 100% traffic — winner
ROLLED_BACK = "rolled_back"
CONCLUDED = "concluded" # No winner, keeping control
class RolloutStage:
"""Traffic allocation per rollout stage."""
STAGES = {
ExperimentStatus.CANARY: {
"traffic_pct": 0.05,
"min_duration_hours": 24,
"auto_advance": True,
},
ExperimentStatus.VALIDATING: {
"traffic_pct": 0.20,
"min_duration_hours": 48,
"auto_advance": True,
},
ExperimentStatus.EXPANDING: {
"traffic_pct": 0.50,
"min_duration_hours": 72,
"auto_advance": False, # Requires manual approval for 100%
},
ExperimentStatus.PROMOTED: {
"traffic_pct": 1.00,
"min_duration_hours": 0,
"auto_advance": False,
},
}
@dataclass
class RolloutDecision:
"""Result of evaluating whether to advance, hold, or rollback."""
action: str # "advance", "hold", "rollback"
reason: str
current_stage: ExperimentStatus
next_stage: Optional[ExperimentStatus]
metrics_summary: dict
class ExperimentManager:
"""
Manages the lifecycle of FM parameter experiments in MangaAssist.
Responsibilities:
- Create and configure experiments
- Control traffic allocation per rollout stage
- Evaluate guardrail and quality metrics at each stage gate
- Trigger automatic advancement or rollback
- Persist decisions for audit trail
"""
def __init__(self, region: str = "ap-northeast-1"):
self.dynamodb = boto3.resource("dynamodb", region_name=region)
self.experiments_table = self.dynamodb.Table("MangaAssist-ABExperiments")
self.results_table = self.dynamodb.Table("MangaAssist-ABResults")
self.audit_table = self.dynamodb.Table("MangaAssist-ExperimentAudit")
self.sns = boto3.client("sns", region_name=region)
self.alert_topic_arn = "arn:aws:sns:ap-northeast-1:123456789012:manga-assist-experiments"
def create_experiment(
self,
experiment_id: str,
intent: str,
description: str,
control_params: dict,
variant_params: dict,
min_samples: int = 1000,
confidence_level: float = 0.95,
) -> dict:
"""
Create a new experiment in DRAFT status.
Args:
experiment_id: Unique experiment identifier.
intent: The MangaAssist intent to test (e.g., 'product_search').
description: Human-readable description of what is being tested.
control_params: Current production parameters.
variant_params: Proposed parameter changes.
min_samples: Minimum observations per variant before analysis.
confidence_level: Required confidence for significance (default 0.95).
Returns:
The created experiment record.
"""
now = datetime.now(timezone.utc).isoformat()
experiment = {
"experiment_id": experiment_id,
"intent": intent,
"description": description,
"status": ExperimentStatus.DRAFT.value,
"control": {
"variant_id": "control",
"description": "Current production parameters",
"parameter_overrides": control_params,
"traffic_percentage": 0.95,
},
"variants": [
{
"variant_id": "variant_a",
"description": description,
"parameter_overrides": variant_params,
"traffic_percentage": 0.05,
}
],
"min_samples_per_variant": min_samples,
"confidence_level": confidence_level,
"stage_entered_at": now,
"created_at": now,
"guardrail_config": {
"max_latency_increase_pct": 10,
"max_token_increase_pct": 15,
"max_hallucination_increase": 0,
"max_cost_increase_pct": 20,
"max_error_rate_increase": 0,
},
}
self.experiments_table.put_item(Item=experiment)
self._audit_log(experiment_id, "created", f"Experiment created: {description}")
logger.info(f"Created experiment {experiment_id} for intent={intent}")
return experiment
def launch_experiment(self, experiment_id: str) -> RolloutDecision:
"""Move experiment from DRAFT to CANARY (5% traffic)."""
experiment = self._load_experiment(experiment_id)
if not experiment:
return RolloutDecision(
action="rollback", reason="Experiment not found",
current_stage=ExperimentStatus.DRAFT, next_stage=None, metrics_summary={}
)
if experiment["status"] != ExperimentStatus.DRAFT.value:
return RolloutDecision(
action="hold", reason=f"Cannot launch from status={experiment['status']}",
current_stage=ExperimentStatus(experiment["status"]),
next_stage=None, metrics_summary={}
)
# Update traffic allocation for canary
experiment["status"] = ExperimentStatus.CANARY.value
experiment["control"]["traffic_percentage"] = 0.95
experiment["variants"][0]["traffic_percentage"] = 0.05
experiment["stage_entered_at"] = datetime.now(timezone.utc).isoformat()
self.experiments_table.put_item(Item=experiment)
self._audit_log(experiment_id, "launched", "Experiment launched at 5% canary traffic")
self._send_alert(experiment_id, "Experiment launched", "5% canary traffic activated")
return RolloutDecision(
action="advance", reason="Experiment launched successfully",
current_stage=ExperimentStatus.CANARY,
next_stage=ExperimentStatus.VALIDATING, metrics_summary={}
)
def evaluate_stage_gate(self, experiment_id: str) -> RolloutDecision:
"""
Evaluate whether the experiment should advance, hold, or rollback.
Checks:
1. Minimum duration at current stage has elapsed
2. Minimum samples collected
3. No guardrail breaches
4. Quality metrics meet advancement criteria
"""
experiment = self._load_experiment(experiment_id)
if not experiment:
return RolloutDecision(
action="hold", reason="Experiment not found",
current_stage=ExperimentStatus.DRAFT, next_stage=None, metrics_summary={}
)
current_status = ExperimentStatus(experiment["status"])
stage_config = RolloutStage.STAGES.get(current_status)
if not stage_config:
return RolloutDecision(
action="hold", reason=f"No stage config for status={current_status.value}",
current_stage=current_status, next_stage=None, metrics_summary={}
)
# Check minimum duration
stage_entered = datetime.fromisoformat(experiment["stage_entered_at"])
hours_elapsed = (datetime.now(timezone.utc) - stage_entered).total_seconds() / 3600
if hours_elapsed < stage_config["min_duration_hours"]:
return RolloutDecision(
action="hold",
reason=(
f"Minimum duration not met: {hours_elapsed:.1f}h elapsed, "
f"{stage_config['min_duration_hours']}h required"
),
current_stage=current_status, next_stage=None,
metrics_summary={"hours_elapsed": round(hours_elapsed, 1)}
)
# Load metrics
metrics = self._compute_stage_metrics(experiment_id, experiment)
# Check guardrails
guardrail_result = self._check_guardrails(experiment, metrics)
if guardrail_result["breached"]:
self._rollback_experiment(experiment_id, experiment, guardrail_result["reason"])
return RolloutDecision(
action="rollback", reason=guardrail_result["reason"],
current_stage=current_status, next_stage=None,
metrics_summary=metrics
)
# Check sample size
if metrics.get("variant_samples", 0) < experiment["min_samples_per_variant"]:
return RolloutDecision(
action="hold",
reason=(
f"Insufficient samples: {metrics.get('variant_samples', 0)} "
f"of {experiment['min_samples_per_variant']} required"
),
current_stage=current_status, next_stage=None,
metrics_summary=metrics
)
# Determine next stage
stage_order = [
ExperimentStatus.CANARY,
ExperimentStatus.VALIDATING,
ExperimentStatus.EXPANDING,
ExperimentStatus.PROMOTED,
]
current_idx = stage_order.index(current_status)
if current_idx + 1 >= len(stage_order):
return RolloutDecision(
action="hold", reason="Already at final stage",
current_stage=current_status, next_stage=None,
metrics_summary=metrics
)
next_status = stage_order[current_idx + 1]
# Advance
self._advance_experiment(experiment_id, experiment, next_status)
return RolloutDecision(
action="advance",
reason=f"All criteria met. Advancing to {next_status.value}.",
current_stage=current_status, next_stage=next_status,
metrics_summary=metrics
)
def _check_guardrails(self, experiment: dict, metrics: dict) -> dict:
"""Check if any guardrail metric has been breached."""
config = experiment.get("guardrail_config", {})
# Latency check
ctrl_latency = metrics.get("control_p95_latency", 0)
var_latency = metrics.get("variant_p95_latency", 0)
if ctrl_latency > 0:
latency_increase = (var_latency - ctrl_latency) / ctrl_latency * 100
if latency_increase > config.get("max_latency_increase_pct", 10):
return {
"breached": True,
"reason": f"Latency guardrail breached: +{latency_increase:.1f}% "
f"(max {config['max_latency_increase_pct']}%)"
}
# Token check
ctrl_tokens = metrics.get("control_mean_tokens", 0)
var_tokens = metrics.get("variant_mean_tokens", 0)
if ctrl_tokens > 0:
token_increase = (var_tokens - ctrl_tokens) / ctrl_tokens * 100
if token_increase > config.get("max_token_increase_pct", 15):
return {
"breached": True,
"reason": f"Token usage guardrail breached: +{token_increase:.1f}% "
f"(max {config['max_token_increase_pct']}%)"
}
# Hallucination check
ctrl_halluc = metrics.get("control_hallucination_rate", 0)
var_halluc = metrics.get("variant_hallucination_rate", 0)
if var_halluc > ctrl_halluc + config.get("max_hallucination_increase", 0):
return {
"breached": True,
"reason": f"Hallucination guardrail breached: variant={var_halluc:.3f} "
f"> control={ctrl_halluc:.3f}"
}
return {"breached": False, "reason": "All guardrails passed"}
def _advance_experiment(
self, experiment_id: str, experiment: dict, next_status: ExperimentStatus
) -> None:
"""Advance the experiment to the next rollout stage."""
stage_config = RolloutStage.STAGES[next_status]
variant_traffic = stage_config["traffic_pct"]
experiment["status"] = next_status.value
experiment["control"]["traffic_percentage"] = round(1.0 - variant_traffic, 2)
experiment["variants"][0]["traffic_percentage"] = variant_traffic
experiment["stage_entered_at"] = datetime.now(timezone.utc).isoformat()
self.experiments_table.put_item(Item=experiment)
self._audit_log(
experiment_id, "advanced",
f"Advanced to {next_status.value} at {variant_traffic*100:.0f}% variant traffic"
)
self._send_alert(
experiment_id, f"Experiment advanced to {next_status.value}",
f"Variant traffic now at {variant_traffic*100:.0f}%"
)
def _rollback_experiment(
self, experiment_id: str, experiment: dict, reason: str
) -> None:
"""Rollback: set variant traffic to 0% and mark as rolled back."""
experiment["status"] = ExperimentStatus.ROLLED_BACK.value
experiment["control"]["traffic_percentage"] = 1.0
experiment["variants"][0]["traffic_percentage"] = 0.0
experiment["stage_entered_at"] = datetime.now(timezone.utc).isoformat()
experiment["rollback_reason"] = reason
self.experiments_table.put_item(Item=experiment)
self._audit_log(experiment_id, "rolled_back", f"Rolled back: {reason}")
self._send_alert(experiment_id, "EXPERIMENT ROLLED BACK", reason)
def _compute_stage_metrics(self, experiment_id: str, experiment: dict) -> dict:
"""Compute summary metrics for the current stage evaluation."""
# In production, query DynamoDB/CloudWatch for real metrics.
# Simplified structure shown here.
return {
"control_samples": 0,
"variant_samples": 0,
"control_mean_quality": 0.0,
"variant_mean_quality": 0.0,
"control_p95_latency": 0.0,
"variant_p95_latency": 0.0,
"control_mean_tokens": 0.0,
"variant_mean_tokens": 0.0,
"control_hallucination_rate": 0.0,
"variant_hallucination_rate": 0.0,
}
def _load_experiment(self, experiment_id: str) -> Optional[dict]:
"""Load experiment from DynamoDB."""
try:
response = self.experiments_table.get_item(
Key={"experiment_id": experiment_id}
)
return response.get("Item")
except Exception as e:
logger.error(f"Failed to load experiment {experiment_id}: {e}")
return None
def _audit_log(self, experiment_id: str, action: str, detail: str) -> None:
"""Write an immutable audit record."""
try:
self.audit_table.put_item(
Item={
"experiment_id": experiment_id,
"timestamp": datetime.now(timezone.utc).isoformat(),
"action": action,
"detail": detail,
}
)
except Exception as e:
logger.error(f"Failed to write audit log: {e}")
def _send_alert(self, experiment_id: str, subject: str, message: str) -> None:
"""Send SNS notification about experiment state change."""
try:
self.sns.publish(
TopicArn=self.alert_topic_arn,
Subject=f"[MangaAssist A/B] {subject}",
Message=f"Experiment: {experiment_id}\n\n{message}",
)
except Exception as e:
logger.error(f"Failed to send alert: {e}")
Python Implementation — StatisticalSignificanceCalculator
"""
StatisticalSignificanceCalculator — Statistical tests for A/B experiment analysis.
Implements Welch's t-test, chi-squared test, confidence intervals,
and early stopping rules for MangaAssist parameter experiments.
"""
import math
import logging
from dataclasses import dataclass
from typing import Optional
logger = logging.getLogger(__name__)
@dataclass
class TTestResult:
"""Result of a Welch's t-test comparison."""
control_mean: float
variant_mean: float
control_std: float
variant_std: float
control_n: int
variant_n: int
t_statistic: float
degrees_of_freedom: float
p_value: float
significant: bool
confidence_interval: tuple[float, float]
effect_size: float # Cohen's d
@dataclass
class ChiSquaredResult:
"""Result of a chi-squared test for proportions."""
control_successes: int
control_total: int
variant_successes: int
variant_total: int
control_rate: float
variant_rate: float
chi_squared: float
p_value: float
significant: bool
@dataclass
class EarlyStopDecision:
"""Result of early stopping evaluation."""
should_stop: bool
reason: str
rule_triggered: Optional[str] # "futility", "harm", "winner", "guardrail"
samples_collected: int
samples_planned: int
class StatisticalSignificanceCalculator:
"""
Performs statistical significance tests for MangaAssist A/B experiments.
Supports:
- Welch's t-test for continuous metrics (quality score, latency)
- Chi-squared test for binary metrics (conversion, hallucination)
- Confidence interval estimation
- Early stopping evaluation
"""
def __init__(self, alpha: float = 0.05):
self.alpha = alpha
def welch_t_test(
self,
control_values: list[float],
variant_values: list[float],
) -> TTestResult:
"""
Perform Welch's t-test comparing control and variant quality scores.
Welch's t-test is preferred over Student's t-test because it does not
assume equal variances between groups — a realistic assumption for
FM outputs where different parameter configs produce different variance.
"""
n_c = len(control_values)
n_v = len(variant_values)
if n_c < 2 or n_v < 2:
raise ValueError(f"Need at least 2 samples per group (got {n_c}, {n_v})")
mean_c = sum(control_values) / n_c
mean_v = sum(variant_values) / n_v
var_c = sum((x - mean_c) ** 2 for x in control_values) / (n_c - 1)
var_v = sum((x - mean_v) ** 2 for x in variant_values) / (n_v - 1)
std_c = math.sqrt(var_c)
std_v = math.sqrt(var_v)
# Standard error of the difference
se = math.sqrt(var_c / n_c + var_v / n_v)
if se == 0:
return TTestResult(
control_mean=mean_c, variant_mean=mean_v,
control_std=std_c, variant_std=std_v,
control_n=n_c, variant_n=n_v,
t_statistic=0.0, degrees_of_freedom=n_c + n_v - 2,
p_value=1.0, significant=False,
confidence_interval=(0.0, 0.0), effect_size=0.0
)
t_stat = (mean_v - mean_c) / se
# Welch-Satterthwaite degrees of freedom
num = (var_c / n_c + var_v / n_v) ** 2
denom = (var_c / n_c) ** 2 / (n_c - 1) + (var_v / n_v) ** 2 / (n_v - 1)
df = num / denom if denom > 0 else 1.0
# Two-tailed p-value
p_value = self._t_distribution_two_tailed_p(abs(t_stat), df)
# 95% confidence interval for difference in means
t_critical = self._t_critical_value(df, self.alpha)
ci_lower = (mean_v - mean_c) - t_critical * se
ci_upper = (mean_v - mean_c) + t_critical * se
# Cohen's d effect size
pooled_std = math.sqrt(
((n_c - 1) * var_c + (n_v - 1) * var_v) / (n_c + n_v - 2)
)
effect_size = (mean_v - mean_c) / pooled_std if pooled_std > 0 else 0.0
return TTestResult(
control_mean=round(mean_c, 4),
variant_mean=round(mean_v, 4),
control_std=round(std_c, 4),
variant_std=round(std_v, 4),
control_n=n_c,
variant_n=n_v,
t_statistic=round(t_stat, 4),
degrees_of_freedom=round(df, 1),
p_value=round(p_value, 6),
significant=p_value < self.alpha,
confidence_interval=(round(ci_lower, 4), round(ci_upper, 4)),
effect_size=round(effect_size, 4),
)
def chi_squared_test(
self,
control_successes: int,
control_total: int,
variant_successes: int,
variant_total: int,
) -> ChiSquaredResult:
"""
Chi-squared test for comparing two proportions.
Used for binary outcomes: e.g., "did the user click a recommended title?"
or "was the response flagged by guardrails?"
"""
if control_total == 0 or variant_total == 0:
raise ValueError("Total counts must be > 0")
rate_c = control_successes / control_total
rate_v = variant_successes / variant_total
# Expected counts under H0 (pooled rate)
total_success = control_successes + variant_successes
total_n = control_total + variant_total
pooled_rate = total_success / total_n
# 2x2 contingency table expected values
e_c_success = control_total * pooled_rate
e_c_failure = control_total * (1 - pooled_rate)
e_v_success = variant_total * pooled_rate
e_v_failure = variant_total * (1 - pooled_rate)
# Chi-squared statistic
chi2 = 0.0
observed = [
control_successes, control_total - control_successes,
variant_successes, variant_total - variant_successes,
]
expected = [e_c_success, e_c_failure, e_v_success, e_v_failure]
for obs, exp in zip(observed, expected):
if exp > 0:
chi2 += (obs - exp) ** 2 / exp
# p-value from chi-squared distribution with df=1
p_value = self._chi2_survival(chi2, df=1)
return ChiSquaredResult(
control_successes=control_successes,
control_total=control_total,
variant_successes=variant_successes,
variant_total=variant_total,
control_rate=round(rate_c, 4),
variant_rate=round(rate_v, 4),
chi_squared=round(chi2, 4),
p_value=round(p_value, 6),
significant=p_value < self.alpha,
)
def evaluate_early_stopping(
self,
control_values: list[float],
variant_values: list[float],
planned_samples: int,
guardrail_breached: bool = False,
) -> EarlyStopDecision:
"""
Evaluate whether to stop the experiment early.
Rules applied in order of priority:
1. Guardrail breach → immediate stop
2. Harm detection → variant significantly worse at p < 0.01
3. Winner detection → variant significantly better at p < 0.001 (at 50%+ samples)
4. Futility → p > 0.95 at 50%+ samples
"""
current_samples = min(len(control_values), len(variant_values))
progress = current_samples / planned_samples if planned_samples > 0 else 0
# Rule 1: Guardrail breach
if guardrail_breached:
return EarlyStopDecision(
should_stop=True,
reason="Guardrail metric breached — stopping immediately",
rule_triggered="guardrail",
samples_collected=current_samples,
samples_planned=planned_samples,
)
# Need minimum samples for statistical tests
if current_samples < 30:
return EarlyStopDecision(
should_stop=False,
reason=f"Insufficient samples for analysis ({current_samples} < 30)",
rule_triggered=None,
samples_collected=current_samples,
samples_planned=planned_samples,
)
# Run t-test
result = self.welch_t_test(control_values, variant_values)
# Rule 2: Harm detection — variant is worse
if result.variant_mean < result.control_mean and result.p_value < 0.01:
return EarlyStopDecision(
should_stop=True,
reason=(
f"Variant is significantly WORSE than control "
f"(p={result.p_value:.4f}, effect={result.effect_size:.3f})"
),
rule_triggered="harm",
samples_collected=current_samples,
samples_planned=planned_samples,
)
# Rules 3 & 4 only apply after 50% of planned samples
if progress < 0.50:
return EarlyStopDecision(
should_stop=False,
reason=f"Only {progress:.0%} of planned samples collected — too early for stopping rules",
rule_triggered=None,
samples_collected=current_samples,
samples_planned=planned_samples,
)
# Rule 3: Early winner
if result.variant_mean > result.control_mean and result.p_value < 0.001:
return EarlyStopDecision(
should_stop=True,
reason=(
f"Early winner detected (p={result.p_value:.6f}, "
f"improvement={result.variant_mean - result.control_mean:.4f})"
),
rule_triggered="winner",
samples_collected=current_samples,
samples_planned=planned_samples,
)
# Rule 4: Futility
if result.p_value > 0.95:
return EarlyStopDecision(
should_stop=True,
reason=(
f"Futility stop: p-value={result.p_value:.4f} indicates "
f"no meaningful difference is likely"
),
rule_triggered="futility",
samples_collected=current_samples,
samples_planned=planned_samples,
)
return EarlyStopDecision(
should_stop=False,
reason="Experiment in progress — no stopping rule triggered",
rule_triggered=None,
samples_collected=current_samples,
samples_planned=planned_samples,
)
@staticmethod
def _t_distribution_two_tailed_p(t: float, df: float) -> float:
"""
Approximate two-tailed p-value from the t-distribution.
Uses the normal approximation for df > 30 and a lookup-based
approximation for smaller df. In production, use scipy.stats.t.sf().
"""
if df > 30:
p_one_tail = 0.5 * math.erfc(t / math.sqrt(2))
return 2 * p_one_tail
# Conservative approximation for small df
if t > 4.0:
return 0.001
elif t > 3.0:
return 0.005
elif t > 2.5:
return 0.02
elif t > 2.0:
return 0.06
elif t > 1.5:
return 0.15
elif t > 1.0:
return 0.33
else:
return 0.50
@staticmethod
def _t_critical_value(df: float, alpha: float) -> float:
"""
Approximate critical value for two-tailed t-test.
For df > 30, uses z-approximation. In production, use scipy.stats.t.ppf().
"""
if df > 30:
# z critical values for common alpha levels
z_map = {0.01: 2.576, 0.05: 1.960, 0.10: 1.645}
return z_map.get(alpha, 1.960)
# Conservative for small df
return 2.1
@staticmethod
def _chi2_survival(chi2: float, df: int) -> float:
"""
Approximate survival function (1 - CDF) for chi-squared distribution.
Uses the normal approximation for the chi-squared statistic.
In production, use scipy.stats.chi2.sf().
"""
if df != 1:
raise ValueError("Only df=1 implemented in this approximation")
z = math.sqrt(chi2)
return math.erfc(z / math.sqrt(2))
Key Takeaways
- A/B testing is not optional — Intuition about "better" parameters fails regularly. Only controlled experiments with statistical significance testing can confirm improvements.
- Calculate sample size before launching — Running an underpowered experiment wastes traffic and produces inconclusive results. Use the sample size calculator for every experiment.
- Guardrail metrics protect against blind optimization — A variant can improve quality score while doubling latency. Always enforce guardrail thresholds.
- Gradual rollout prevents disasters — Even statistically significant results can fail at scale. The 5% to 20% to 50% to 100% ramp catches edge cases.
- Session-level assignment prevents contamination — Always assign users to variants based on session ID, not request ID. A user seeing both variants within one conversation produces unreliable data.
- Practical significance matters as much as statistical significance — A p-value < 0.05 with a 0.001 quality improvement is not worth a 20% cost increase. Always evaluate the cost-benefit ratio.