A/B Testing and Parameter Tuning for FM Optimization

AWS AIP-C01 Task 4.2 — Skill 4.2.4: Optimize FM performance through systematic A/B testing and parameter tuning Context: MangaAssist JP manga store chatbot — Bedrock Claude 3 (Sonnet/Haiku), OpenSearch Serverless, DynamoDB, ECS Fargate Intents: product_search, order_status, recommendation, manga_qa, chitchat, shipping_info

Skill Mapping

Certification	Domain	Task	Skill
AWS AIP-C01	Domain 4 — Operational Efficiency	Task 4.2 — Optimize FM performance	Skill 4.2.4 — Design and execute A/B experiments for FM parameter optimization with statistical rigor

Skill scope: Design controlled experiments to compare FM parameter configurations, compute statistical significance, implement gradual rollout strategies, and apply Bayesian optimization for automated parameter search.

A/B Test Lifecycle

flowchart TD
    A[Identify Optimization Target] --> B[Design Experiment]
    B --> C[Calculate Minimum Sample Size]
    C --> D{Sufficient Traffic<br/>for Power?}
    D -- No --> E[Increase Test Duration<br/>or Reduce Variants]
    E --> C
    D -- Yes --> F[Configure Experiment in DynamoDB]
    F --> G[Launch at 5% Traffic — Canary]

    G --> H{Errors or<br/>Regressions?}
    H -- Yes --> I[Kill Experiment<br/>Rollback to Control]
    H -- No --> J[Expand to 20% Traffic]

    J --> K[Collect Observations<br/>Quality + Latency + Cost]
    K --> L{Min Samples<br/>Reached?}
    L -- No --> K
    L -- Yes --> M[Run Statistical Analysis]

    M --> N{p-value < 0.05<br/>AND Practical<br/>Significance?}
    N -- Not Significant --> O[Continue or Stop]
    O --> P{Max Duration<br/>Reached?}
    P -- No --> K
    P -- Yes --> Q[Conclude: No Winner<br/>Keep Control]

    N -- Significant --> R{Quality Improvement<br/> > Cost Increase?}
    R -- No --> S[Reject Variant<br/>Not Cost-Effective]
    R -- Yes --> T[Expand to 50% Traffic]

    T --> U[Monitor for 48 Hours]
    U --> V{Stable?}
    V -- No --> I
    V -- Yes --> W[Promote to 100%<br/>Update Default Profile]
    W --> X[Archive Experiment<br/>Document Learnings]

    style G fill:#fff3cd
    style J fill:#fff3cd
    style T fill:#fff3cd
    style W fill:#d4edda
    style I fill:#f8d7da
    style Q fill:#e2e3e5
    style S fill:#f8d7da

Experiment Design

Control vs Variant

Every A/B test has exactly one control (the current production parameters) and one or more variants (proposed parameter changes). Changing only one parameter per variant isolates its effect.

Component	Control	Variant A	Variant B
Description	Current production config	Higher temperature	Larger top_k
temperature	0.45	0.55	0.45
top_k	100	100	150
top_p	0.90	0.90	0.90
max_tokens	512	512	512

Rule: Never change multiple parameters in one variant. If temperature AND top_k both change, you cannot attribute any quality difference to either one.

Minimum Sample Size Calculation

Before launching any experiment, calculate the minimum samples needed to detect a meaningful improvement:

Required Samples Per Variant = f(baseline_rate, minimum_detectable_effect, alpha, power)

Parameter	Meaning	Typical Value
baseline_rate	Current quality score	0.85 (85% quality)
minimum_detectable_effect	Smallest improvement worth pursuing	0.03 (3 percentage points)
alpha	False positive rate	0.05 (5%)
power	Probability of detecting a real effect	0.80 (80%)

MangaAssist sample size examples:

Intent	Baseline	MDE	Required Samples/Variant	Daily Traffic	Days Needed
product_search	0.88	0.03	~1,950	2,400	2 days
order_status	0.95	0.02	~2,800	1,800	4 days
recommendation	0.80	0.05	~750	1,200	2 days
manga_qa	0.85	0.03	~1,950	800	5 days
chitchat	0.78	0.05	~650	3,000	1 day
shipping_info	0.93	0.02	~3,200	600	11 days

Shipping_info has the lowest traffic and the highest baseline, so experiments take the longest. Consider combining shipping_info experiments with product_search windows when possible.

Metrics to Measure

Primary Metrics (Decision Criteria)

Metric	Type	Collection Method	Threshold for Winner
Quality Score	Continuous [0,1]	Automated relevance scoring + sampled human evaluation	Variant mean > Control mean, p < 0.05
Response Relevance	Continuous [0,1]	Cosine similarity between response and ground-truth answer set	Variant mean > Control mean
User Satisfaction	Ordinal [1-5]	Post-interaction survey (sampled 10% of sessions)	Variant mean > Control mean

Guardrail Metrics (Safety Checks)

Metric	Type	Max Acceptable Regression
Latency P95	Continuous (ms)	No more than 10% increase
Token Usage	Continuous (count)	No more than 15% increase
Hallucination Rate	Proportion	No increase allowed
Cost per Request	Continuous (USD)	No more than 20% increase
Error Rate	Proportion	No increase allowed

If any guardrail metric regresses beyond its threshold, the experiment is automatically paused regardless of primary metric improvement.

Metric Collection Architecture

graph LR
    subgraph "FM Invocation"
        A[Bedrock InvokeModel] --> B[Raw Response]
    end
    subgraph "Automated Scoring"
        B --> C[Relevance Scorer<br/>Cosine Similarity]
        B --> D[Hallucination Detector<br/>Claim Verification]
        B --> E[Latency Recorder]
        B --> F[Token Counter]
    end
    subgraph "Human Evaluation — Sampled"
        B --> G{10% Sample?}
        G -- Yes --> H[Human Review Queue<br/>SQS + Lambda]
        H --> I[Quality Score 0-1]
    end
    subgraph "Storage"
        C --> J[(DynamoDB<br/>ABResults)]
        D --> J
        E --> J
        F --> J
        I --> J
    end
    subgraph "Analysis"
        J --> K[StatisticalSignificanceCalculator]
        K --> L[CloudWatch Dashboard]
        K --> M[SNS Alerts]
    end

Statistical Significance

Welch's t-test (Primary)

Used for comparing continuous metrics (quality score, latency) between control and variant. Welch's t-test does not assume equal variances, making it robust for real-world data.

Hypotheses: - H0: mean_control = mean_variant (no difference) - H1: mean_control ≠ mean_variant (two-tailed)

Decision rule: Reject H0 if p-value < 0.05.

Chi-squared Test (Binary Outcomes)

Used for comparing proportions (e.g., "did the user add to cart?" or "was the response flagged by guardrails?").

Hypotheses: - H0: proportion_control = proportion_variant - H1: proportion_control ≠ proportion_variant

Confidence Intervals

Report the 95% confidence interval for the difference in means. If the interval does not contain zero, the result is significant.

CI = (mean_variant - mean_control) ± z_alpha/2 * SE_diff

Early Stopping Rules

To prevent wasting traffic on clearly losing variants:

Rule	Condition	Action
Futility stop	p-value > 0.95 at 50% of planned samples	Stop: variant is almost certainly not better
Harm stop	Variant quality < control quality at p < 0.01	Stop immediately: variant is actively worse
Winner stop	p-value < 0.001 at 50% of planned samples	Optional: declare early win (with caution)
Guardrail stop	Any guardrail metric breached	Stop immediately: variant causes regression

Multi-Armed Bandit as Alternative

Traditional A/B testing allocates fixed traffic percentages. A multi-armed bandit (MAB) dynamically shifts traffic toward better-performing variants during the experiment.

When to Use MAB vs Fixed A/B

Criterion	Fixed A/B	Multi-Armed Bandit
Goal	Statistical proof of superiority	Minimize regret (maximize reward) during testing
Traffic cost	50% of traffic goes to inferior variant	Traffic shifts away from losers automatically
Duration	Fixed duration (based on sample size)	Runs continuously; no fixed end
Statistical rigor	Full hypothesis testing with p-values	Bayesian posterior probabilities
Best for MangaAssist	High-stakes changes (order_status params)	Low-risk optimizations (chitchat temperature)

Thompson Sampling for MangaAssist

graph TD
    A[New Request Arrives] --> B[For Each Variant:<br/>Sample from Beta Distribution]
    B --> C[Select Variant with<br/>Highest Sampled Value]
    C --> D[Serve Request with<br/>Selected Variant]
    D --> E[Observe Quality Score]
    E --> F{Score > Threshold?}
    F -- Yes --> G[Update Beta: alpha += 1<br/>Success]
    F -- No --> H[Update Beta: beta += 1<br/>Failure]
    G --> I[Posterior Updated<br/>Better variants sampled more]
    H --> I
    I --> A

MangaAssist Experiment Examples

Experiment 1 — Temperature for Product Descriptions

Hypothesis: Increasing temperature from 0.45 to 0.55 for product_search will produce more engaging product descriptions without sacrificing accuracy.

Attribute	Value
Intent	product_search
Control	temperature=0.45, top_k=100, top_p=0.90
Variant A	temperature=0.55, top_k=100, top_p=0.90
Primary metric	Quality score (human eval)
Guardrail metrics	Hallucination rate, latency P95
Traffic split	80% control / 20% variant
Min samples	1,950 per variant
Expected duration	3 days at current traffic

Example observations after 2,500 samples per variant:

Metric	Control (T=0.45)	Variant (T=0.55)	Difference
Mean quality score	0.883	0.901	+0.018
Std dev	0.071	0.078	—
Hallucination rate	2.1%	2.3%	+0.2% (within tolerance)
P95 latency	3,210 ms	3,280 ms	+70 ms (within tolerance)
Mean tokens	385	412	+7% (within tolerance)
p-value (t-test)	—	—	0.0023
Significant?	—	—	Yes

Decision: Promote Variant A. The +1.8% quality improvement is statistically significant (p=0.0023) and all guardrail metrics remain within acceptable bounds.

Experiment 2 — Top-k for Recommendations

Hypothesis: Increasing top_k from 250 to 400 for recommendation will produce more diverse manga suggestions.

Attribute	Value
Intent	recommendation
Control	temperature=0.80, top_k=250, top_p=0.95
Variant A	temperature=0.80, top_k=400, top_p=0.95
Primary metric	Recommendation diversity (unique titles per session)
Secondary metric	Click-through rate on recommended titles
Min samples	750 per variant

Gradual Rollout Strategy

Never go from experiment winner to 100% traffic in one step. The gradual rollout protects against edge cases that the A/B test did not cover.

graph LR
    A["5% Canary<br/>24 hours"] --> B{Errors?}
    B -- Yes --> C[Rollback]
    B -- No --> D["20% Validation<br/>48 hours"]
    D --> E{Metrics<br/>Stable?}
    E -- No --> C
    E -- Yes --> F["50% Expansion<br/>72 hours"]
    F --> G{Business Metrics<br/>Healthy?}
    G -- No --> C
    G -- Yes --> H["100% Full Rollout"]
    H --> I[Update Default Profile<br/>Archive Experiment]

    style A fill:#fff3cd
    style D fill:#fff3cd
    style F fill:#fff3cd
    style H fill:#d4edda
    style C fill:#f8d7da

Stage	Traffic %	Duration	Success Criteria	Rollback Trigger
Canary	5%	24 hours	No errors, no latency spikes	Error rate > 1% OR P95 latency > 5s
Validation	20%	48 hours	Quality score matches A/B test results	Quality drops >2% from test OR guardrail breach
Expansion	50%	72 hours	Business metrics stable (conversion, CSAT)	Conversion rate drops >5% OR CSAT drops >0.2
Full rollout	100%	Permanent	All dashboards green for 24 hours	Any alert fires within first 24 hours

Parameter Tuning Automation — Bayesian Optimization

For intents with many tunable parameters, manual A/B testing every combination is impractical. Bayesian optimization systematically explores the parameter space.

How It Works

Define the search space: temperature [0.1, 0.9], top_k [20, 300], top_p [0.5, 0.99]
Build a surrogate model (Gaussian Process) of quality score as a function of parameters
Acquisition function (Expected Improvement) selects the next parameter combination to try
Evaluate by running the selected parameters on a sample of requests
Update the surrogate model with the new observation
Repeat until the improvement plateaus or budget is exhausted

MangaAssist Bayesian Optimization Boundaries

Parameter	Min	Max	Step	Intent Group
temperature	0.05	0.35	0.05	Factual (order_status, shipping_info)
temperature	0.30	0.70	0.05	Balanced (product_search, manga_qa)
temperature	0.60	0.95	0.05	Creative (recommendation, chitchat)
top_k	20	80	10	Factual
top_k	60	200	20	Balanced
top_k	150	400	25	Creative
top_p	0.50	0.85	0.05	Factual
top_p	0.80	0.95	0.05	Balanced
top_p	0.90	0.99	0.01	Creative

Python Implementation — ExperimentManager

"""
ExperimentManager — Orchestrates the full A/B experiment lifecycle for MangaAssist.

Handles experiment creation, traffic allocation, metric collection,
analysis triggers, and automated rollout decisions.
"""

import boto3
import json
import logging
from dataclasses import dataclass
from datetime import datetime, timezone, timedelta
from typing import Optional
from enum import Enum

logger = logging.getLogger(__name__)


class ExperimentStatus(Enum):
    DRAFT = "draft"
    CANARY = "canary"           # 5% traffic
    VALIDATING = "validating"   # 20% traffic
    EXPANDING = "expanding"     # 50% traffic
    PROMOTED = "promoted"       # 100% traffic — winner
    ROLLED_BACK = "rolled_back"
    CONCLUDED = "concluded"     # No winner, keeping control


class RolloutStage:
    """Traffic allocation per rollout stage."""
    STAGES = {
        ExperimentStatus.CANARY: {
            "traffic_pct": 0.05,
            "min_duration_hours": 24,
            "auto_advance": True,
        },
        ExperimentStatus.VALIDATING: {
            "traffic_pct": 0.20,
            "min_duration_hours": 48,
            "auto_advance": True,
        },
        ExperimentStatus.EXPANDING: {
            "traffic_pct": 0.50,
            "min_duration_hours": 72,
            "auto_advance": False,  # Requires manual approval for 100%
        },
        ExperimentStatus.PROMOTED: {
            "traffic_pct": 1.00,
            "min_duration_hours": 0,
            "auto_advance": False,
        },
    }


@dataclass
class RolloutDecision:
    """Result of evaluating whether to advance, hold, or rollback."""
    action: str  # "advance", "hold", "rollback"
    reason: str
    current_stage: ExperimentStatus
    next_stage: Optional[ExperimentStatus]
    metrics_summary: dict


class ExperimentManager:
    """
    Manages the lifecycle of FM parameter experiments in MangaAssist.

    Responsibilities:
    - Create and configure experiments
    - Control traffic allocation per rollout stage
    - Evaluate guardrail and quality metrics at each stage gate
    - Trigger automatic advancement or rollback
    - Persist decisions for audit trail
    """

    def __init__(self, region: str = "ap-northeast-1"):
        self.dynamodb = boto3.resource("dynamodb", region_name=region)
        self.experiments_table = self.dynamodb.Table("MangaAssist-ABExperiments")
        self.results_table = self.dynamodb.Table("MangaAssist-ABResults")
        self.audit_table = self.dynamodb.Table("MangaAssist-ExperimentAudit")
        self.sns = boto3.client("sns", region_name=region)
        self.alert_topic_arn = "arn:aws:sns:ap-northeast-1:123456789012:manga-assist-experiments"

    def create_experiment(
        self,
        experiment_id: str,
        intent: str,
        description: str,
        control_params: dict,
        variant_params: dict,
        min_samples: int = 1000,
        confidence_level: float = 0.95,
    ) -> dict:
        """
        Create a new experiment in DRAFT status.

        Args:
            experiment_id: Unique experiment identifier.
            intent: The MangaAssist intent to test (e.g., 'product_search').
            description: Human-readable description of what is being tested.
            control_params: Current production parameters.
            variant_params: Proposed parameter changes.
            min_samples: Minimum observations per variant before analysis.
            confidence_level: Required confidence for significance (default 0.95).

        Returns:
            The created experiment record.
        """
        now = datetime.now(timezone.utc).isoformat()

        experiment = {
            "experiment_id": experiment_id,
            "intent": intent,
            "description": description,
            "status": ExperimentStatus.DRAFT.value,
            "control": {
                "variant_id": "control",
                "description": "Current production parameters",
                "parameter_overrides": control_params,
                "traffic_percentage": 0.95,
            },
            "variants": [
                {
                    "variant_id": "variant_a",
                    "description": description,
                    "parameter_overrides": variant_params,
                    "traffic_percentage": 0.05,
                }
            ],
            "min_samples_per_variant": min_samples,
            "confidence_level": confidence_level,
            "stage_entered_at": now,
            "created_at": now,
            "guardrail_config": {
                "max_latency_increase_pct": 10,
                "max_token_increase_pct": 15,
                "max_hallucination_increase": 0,
                "max_cost_increase_pct": 20,
                "max_error_rate_increase": 0,
            },
        }

        self.experiments_table.put_item(Item=experiment)
        self._audit_log(experiment_id, "created", f"Experiment created: {description}")

        logger.info(f"Created experiment {experiment_id} for intent={intent}")
        return experiment

    def launch_experiment(self, experiment_id: str) -> RolloutDecision:
        """Move experiment from DRAFT to CANARY (5% traffic)."""
        experiment = self._load_experiment(experiment_id)
        if not experiment:
            return RolloutDecision(
                action="rollback", reason="Experiment not found",
                current_stage=ExperimentStatus.DRAFT, next_stage=None, metrics_summary={}
            )

        if experiment["status"] != ExperimentStatus.DRAFT.value:
            return RolloutDecision(
                action="hold", reason=f"Cannot launch from status={experiment['status']}",
                current_stage=ExperimentStatus(experiment["status"]),
                next_stage=None, metrics_summary={}
            )

        # Update traffic allocation for canary
        experiment["status"] = ExperimentStatus.CANARY.value
        experiment["control"]["traffic_percentage"] = 0.95
        experiment["variants"][0]["traffic_percentage"] = 0.05
        experiment["stage_entered_at"] = datetime.now(timezone.utc).isoformat()

        self.experiments_table.put_item(Item=experiment)
        self._audit_log(experiment_id, "launched", "Experiment launched at 5% canary traffic")
        self._send_alert(experiment_id, "Experiment launched", "5% canary traffic activated")

        return RolloutDecision(
            action="advance", reason="Experiment launched successfully",
            current_stage=ExperimentStatus.CANARY,
            next_stage=ExperimentStatus.VALIDATING, metrics_summary={}
        )

    def evaluate_stage_gate(self, experiment_id: str) -> RolloutDecision:
        """
        Evaluate whether the experiment should advance, hold, or rollback.

        Checks:
        1. Minimum duration at current stage has elapsed
        2. Minimum samples collected
        3. No guardrail breaches
        4. Quality metrics meet advancement criteria
        """
        experiment = self._load_experiment(experiment_id)
        if not experiment:
            return RolloutDecision(
                action="hold", reason="Experiment not found",
                current_stage=ExperimentStatus.DRAFT, next_stage=None, metrics_summary={}
            )

        current_status = ExperimentStatus(experiment["status"])
        stage_config = RolloutStage.STAGES.get(current_status)
        if not stage_config:
            return RolloutDecision(
                action="hold", reason=f"No stage config for status={current_status.value}",
                current_stage=current_status, next_stage=None, metrics_summary={}
            )

        # Check minimum duration
        stage_entered = datetime.fromisoformat(experiment["stage_entered_at"])
        hours_elapsed = (datetime.now(timezone.utc) - stage_entered).total_seconds() / 3600
        if hours_elapsed < stage_config["min_duration_hours"]:
            return RolloutDecision(
                action="hold",
                reason=(
                    f"Minimum duration not met: {hours_elapsed:.1f}h elapsed, "
                    f"{stage_config['min_duration_hours']}h required"
                ),
                current_stage=current_status, next_stage=None,
                metrics_summary={"hours_elapsed": round(hours_elapsed, 1)}
            )

        # Load metrics
        metrics = self._compute_stage_metrics(experiment_id, experiment)

        # Check guardrails
        guardrail_result = self._check_guardrails(experiment, metrics)
        if guardrail_result["breached"]:
            self._rollback_experiment(experiment_id, experiment, guardrail_result["reason"])
            return RolloutDecision(
                action="rollback", reason=guardrail_result["reason"],
                current_stage=current_status, next_stage=None,
                metrics_summary=metrics
            )

        # Check sample size
        if metrics.get("variant_samples", 0) < experiment["min_samples_per_variant"]:
            return RolloutDecision(
                action="hold",
                reason=(
                    f"Insufficient samples: {metrics.get('variant_samples', 0)} "
                    f"of {experiment['min_samples_per_variant']} required"
                ),
                current_stage=current_status, next_stage=None,
                metrics_summary=metrics
            )

        # Determine next stage
        stage_order = [
            ExperimentStatus.CANARY,
            ExperimentStatus.VALIDATING,
            ExperimentStatus.EXPANDING,
            ExperimentStatus.PROMOTED,
        ]
        current_idx = stage_order.index(current_status)
        if current_idx + 1 >= len(stage_order):
            return RolloutDecision(
                action="hold", reason="Already at final stage",
                current_stage=current_status, next_stage=None,
                metrics_summary=metrics
            )

        next_status = stage_order[current_idx + 1]

        # Advance
        self._advance_experiment(experiment_id, experiment, next_status)
        return RolloutDecision(
            action="advance",
            reason=f"All criteria met. Advancing to {next_status.value}.",
            current_stage=current_status, next_stage=next_status,
            metrics_summary=metrics
        )

    def _check_guardrails(self, experiment: dict, metrics: dict) -> dict:
        """Check if any guardrail metric has been breached."""
        config = experiment.get("guardrail_config", {})

        # Latency check
        ctrl_latency = metrics.get("control_p95_latency", 0)
        var_latency = metrics.get("variant_p95_latency", 0)
        if ctrl_latency > 0:
            latency_increase = (var_latency - ctrl_latency) / ctrl_latency * 100
            if latency_increase > config.get("max_latency_increase_pct", 10):
                return {
                    "breached": True,
                    "reason": f"Latency guardrail breached: +{latency_increase:.1f}% "
                              f"(max {config['max_latency_increase_pct']}%)"
                }

        # Token check
        ctrl_tokens = metrics.get("control_mean_tokens", 0)
        var_tokens = metrics.get("variant_mean_tokens", 0)
        if ctrl_tokens > 0:
            token_increase = (var_tokens - ctrl_tokens) / ctrl_tokens * 100
            if token_increase > config.get("max_token_increase_pct", 15):
                return {
                    "breached": True,
                    "reason": f"Token usage guardrail breached: +{token_increase:.1f}% "
                              f"(max {config['max_token_increase_pct']}%)"
                }

        # Hallucination check
        ctrl_halluc = metrics.get("control_hallucination_rate", 0)
        var_halluc = metrics.get("variant_hallucination_rate", 0)
        if var_halluc > ctrl_halluc + config.get("max_hallucination_increase", 0):
            return {
                "breached": True,
                "reason": f"Hallucination guardrail breached: variant={var_halluc:.3f} "
                          f"> control={ctrl_halluc:.3f}"
            }

        return {"breached": False, "reason": "All guardrails passed"}

    def _advance_experiment(
        self, experiment_id: str, experiment: dict, next_status: ExperimentStatus
    ) -> None:
        """Advance the experiment to the next rollout stage."""
        stage_config = RolloutStage.STAGES[next_status]
        variant_traffic = stage_config["traffic_pct"]

        experiment["status"] = next_status.value
        experiment["control"]["traffic_percentage"] = round(1.0 - variant_traffic, 2)
        experiment["variants"][0]["traffic_percentage"] = variant_traffic
        experiment["stage_entered_at"] = datetime.now(timezone.utc).isoformat()

        self.experiments_table.put_item(Item=experiment)
        self._audit_log(
            experiment_id, "advanced",
            f"Advanced to {next_status.value} at {variant_traffic*100:.0f}% variant traffic"
        )
        self._send_alert(
            experiment_id, f"Experiment advanced to {next_status.value}",
            f"Variant traffic now at {variant_traffic*100:.0f}%"
        )

    def _rollback_experiment(
        self, experiment_id: str, experiment: dict, reason: str
    ) -> None:
        """Rollback: set variant traffic to 0% and mark as rolled back."""
        experiment["status"] = ExperimentStatus.ROLLED_BACK.value
        experiment["control"]["traffic_percentage"] = 1.0
        experiment["variants"][0]["traffic_percentage"] = 0.0
        experiment["stage_entered_at"] = datetime.now(timezone.utc).isoformat()
        experiment["rollback_reason"] = reason

        self.experiments_table.put_item(Item=experiment)
        self._audit_log(experiment_id, "rolled_back", f"Rolled back: {reason}")
        self._send_alert(experiment_id, "EXPERIMENT ROLLED BACK", reason)

    def _compute_stage_metrics(self, experiment_id: str, experiment: dict) -> dict:
        """Compute summary metrics for the current stage evaluation."""
        # In production, query DynamoDB/CloudWatch for real metrics.
        # Simplified structure shown here.
        return {
            "control_samples": 0,
            "variant_samples": 0,
            "control_mean_quality": 0.0,
            "variant_mean_quality": 0.0,
            "control_p95_latency": 0.0,
            "variant_p95_latency": 0.0,
            "control_mean_tokens": 0.0,
            "variant_mean_tokens": 0.0,
            "control_hallucination_rate": 0.0,
            "variant_hallucination_rate": 0.0,
        }

    def _load_experiment(self, experiment_id: str) -> Optional[dict]:
        """Load experiment from DynamoDB."""
        try:
            response = self.experiments_table.get_item(
                Key={"experiment_id": experiment_id}
            )
            return response.get("Item")
        except Exception as e:
            logger.error(f"Failed to load experiment {experiment_id}: {e}")
            return None

    def _audit_log(self, experiment_id: str, action: str, detail: str) -> None:
        """Write an immutable audit record."""
        try:
            self.audit_table.put_item(
                Item={
                    "experiment_id": experiment_id,
                    "timestamp": datetime.now(timezone.utc).isoformat(),
                    "action": action,
                    "detail": detail,
                }
            )
        except Exception as e:
            logger.error(f"Failed to write audit log: {e}")

    def _send_alert(self, experiment_id: str, subject: str, message: str) -> None:
        """Send SNS notification about experiment state change."""
        try:
            self.sns.publish(
                TopicArn=self.alert_topic_arn,
                Subject=f"[MangaAssist A/B] {subject}",
                Message=f"Experiment: {experiment_id}\n\n{message}",
            )
        except Exception as e:
            logger.error(f"Failed to send alert: {e}")

Python Implementation — StatisticalSignificanceCalculator

"""
StatisticalSignificanceCalculator — Statistical tests for A/B experiment analysis.

Implements Welch's t-test, chi-squared test, confidence intervals,
and early stopping rules for MangaAssist parameter experiments.
"""

import math
import logging
from dataclasses import dataclass
from typing import Optional

logger = logging.getLogger(__name__)


@dataclass
class TTestResult:
    """Result of a Welch's t-test comparison."""
    control_mean: float
    variant_mean: float
    control_std: float
    variant_std: float
    control_n: int
    variant_n: int
    t_statistic: float
    degrees_of_freedom: float
    p_value: float
    significant: bool
    confidence_interval: tuple[float, float]
    effect_size: float  # Cohen's d


@dataclass
class ChiSquaredResult:
    """Result of a chi-squared test for proportions."""
    control_successes: int
    control_total: int
    variant_successes: int
    variant_total: int
    control_rate: float
    variant_rate: float
    chi_squared: float
    p_value: float
    significant: bool


@dataclass
class EarlyStopDecision:
    """Result of early stopping evaluation."""
    should_stop: bool
    reason: str
    rule_triggered: Optional[str]  # "futility", "harm", "winner", "guardrail"
    samples_collected: int
    samples_planned: int


class StatisticalSignificanceCalculator:
    """
    Performs statistical significance tests for MangaAssist A/B experiments.

    Supports:
    - Welch's t-test for continuous metrics (quality score, latency)
    - Chi-squared test for binary metrics (conversion, hallucination)
    - Confidence interval estimation
    - Early stopping evaluation
    """

    def __init__(self, alpha: float = 0.05):
        self.alpha = alpha

    def welch_t_test(
        self,
        control_values: list[float],
        variant_values: list[float],
    ) -> TTestResult:
        """
        Perform Welch's t-test comparing control and variant quality scores.

        Welch's t-test is preferred over Student's t-test because it does not
        assume equal variances between groups — a realistic assumption for
        FM outputs where different parameter configs produce different variance.
        """
        n_c = len(control_values)
        n_v = len(variant_values)

        if n_c < 2 or n_v < 2:
            raise ValueError(f"Need at least 2 samples per group (got {n_c}, {n_v})")

        mean_c = sum(control_values) / n_c
        mean_v = sum(variant_values) / n_v

        var_c = sum((x - mean_c) ** 2 for x in control_values) / (n_c - 1)
        var_v = sum((x - mean_v) ** 2 for x in variant_values) / (n_v - 1)

        std_c = math.sqrt(var_c)
        std_v = math.sqrt(var_v)

        # Standard error of the difference
        se = math.sqrt(var_c / n_c + var_v / n_v)
        if se == 0:
            return TTestResult(
                control_mean=mean_c, variant_mean=mean_v,
                control_std=std_c, variant_std=std_v,
                control_n=n_c, variant_n=n_v,
                t_statistic=0.0, degrees_of_freedom=n_c + n_v - 2,
                p_value=1.0, significant=False,
                confidence_interval=(0.0, 0.0), effect_size=0.0
            )

        t_stat = (mean_v - mean_c) / se

        # Welch-Satterthwaite degrees of freedom
        num = (var_c / n_c + var_v / n_v) ** 2
        denom = (var_c / n_c) ** 2 / (n_c - 1) + (var_v / n_v) ** 2 / (n_v - 1)
        df = num / denom if denom > 0 else 1.0

        # Two-tailed p-value
        p_value = self._t_distribution_two_tailed_p(abs(t_stat), df)

        # 95% confidence interval for difference in means
        t_critical = self._t_critical_value(df, self.alpha)
        ci_lower = (mean_v - mean_c) - t_critical * se
        ci_upper = (mean_v - mean_c) + t_critical * se

        # Cohen's d effect size
        pooled_std = math.sqrt(
            ((n_c - 1) * var_c + (n_v - 1) * var_v) / (n_c + n_v - 2)
        )
        effect_size = (mean_v - mean_c) / pooled_std if pooled_std > 0 else 0.0

        return TTestResult(
            control_mean=round(mean_c, 4),
            variant_mean=round(mean_v, 4),
            control_std=round(std_c, 4),
            variant_std=round(std_v, 4),
            control_n=n_c,
            variant_n=n_v,
            t_statistic=round(t_stat, 4),
            degrees_of_freedom=round(df, 1),
            p_value=round(p_value, 6),
            significant=p_value < self.alpha,
            confidence_interval=(round(ci_lower, 4), round(ci_upper, 4)),
            effect_size=round(effect_size, 4),
        )

    def chi_squared_test(
        self,
        control_successes: int,
        control_total: int,
        variant_successes: int,
        variant_total: int,
    ) -> ChiSquaredResult:
        """
        Chi-squared test for comparing two proportions.

        Used for binary outcomes: e.g., "did the user click a recommended title?"
        or "was the response flagged by guardrails?"
        """
        if control_total == 0 or variant_total == 0:
            raise ValueError("Total counts must be > 0")

        rate_c = control_successes / control_total
        rate_v = variant_successes / variant_total

        # Expected counts under H0 (pooled rate)
        total_success = control_successes + variant_successes
        total_n = control_total + variant_total
        pooled_rate = total_success / total_n

        # 2x2 contingency table expected values
        e_c_success = control_total * pooled_rate
        e_c_failure = control_total * (1 - pooled_rate)
        e_v_success = variant_total * pooled_rate
        e_v_failure = variant_total * (1 - pooled_rate)

        # Chi-squared statistic
        chi2 = 0.0
        observed = [
            control_successes, control_total - control_successes,
            variant_successes, variant_total - variant_successes,
        ]
        expected = [e_c_success, e_c_failure, e_v_success, e_v_failure]

        for obs, exp in zip(observed, expected):
            if exp > 0:
                chi2 += (obs - exp) ** 2 / exp

        # p-value from chi-squared distribution with df=1
        p_value = self._chi2_survival(chi2, df=1)

        return ChiSquaredResult(
            control_successes=control_successes,
            control_total=control_total,
            variant_successes=variant_successes,
            variant_total=variant_total,
            control_rate=round(rate_c, 4),
            variant_rate=round(rate_v, 4),
            chi_squared=round(chi2, 4),
            p_value=round(p_value, 6),
            significant=p_value < self.alpha,
        )

    def evaluate_early_stopping(
        self,
        control_values: list[float],
        variant_values: list[float],
        planned_samples: int,
        guardrail_breached: bool = False,
    ) -> EarlyStopDecision:
        """
        Evaluate whether to stop the experiment early.

        Rules applied in order of priority:
        1. Guardrail breach → immediate stop
        2. Harm detection → variant significantly worse at p < 0.01
        3. Winner detection → variant significantly better at p < 0.001 (at 50%+ samples)
        4. Futility → p > 0.95 at 50%+ samples
        """
        current_samples = min(len(control_values), len(variant_values))
        progress = current_samples / planned_samples if planned_samples > 0 else 0

        # Rule 1: Guardrail breach
        if guardrail_breached:
            return EarlyStopDecision(
                should_stop=True,
                reason="Guardrail metric breached — stopping immediately",
                rule_triggered="guardrail",
                samples_collected=current_samples,
                samples_planned=planned_samples,
            )

        # Need minimum samples for statistical tests
        if current_samples < 30:
            return EarlyStopDecision(
                should_stop=False,
                reason=f"Insufficient samples for analysis ({current_samples} < 30)",
                rule_triggered=None,
                samples_collected=current_samples,
                samples_planned=planned_samples,
            )

        # Run t-test
        result = self.welch_t_test(control_values, variant_values)

        # Rule 2: Harm detection — variant is worse
        if result.variant_mean < result.control_mean and result.p_value < 0.01:
            return EarlyStopDecision(
                should_stop=True,
                reason=(
                    f"Variant is significantly WORSE than control "
                    f"(p={result.p_value:.4f}, effect={result.effect_size:.3f})"
                ),
                rule_triggered="harm",
                samples_collected=current_samples,
                samples_planned=planned_samples,
            )

        # Rules 3 & 4 only apply after 50% of planned samples
        if progress < 0.50:
            return EarlyStopDecision(
                should_stop=False,
                reason=f"Only {progress:.0%} of planned samples collected — too early for stopping rules",
                rule_triggered=None,
                samples_collected=current_samples,
                samples_planned=planned_samples,
            )

        # Rule 3: Early winner
        if result.variant_mean > result.control_mean and result.p_value < 0.001:
            return EarlyStopDecision(
                should_stop=True,
                reason=(
                    f"Early winner detected (p={result.p_value:.6f}, "
                    f"improvement={result.variant_mean - result.control_mean:.4f})"
                ),
                rule_triggered="winner",
                samples_collected=current_samples,
                samples_planned=planned_samples,
            )

        # Rule 4: Futility
        if result.p_value > 0.95:
            return EarlyStopDecision(
                should_stop=True,
                reason=(
                    f"Futility stop: p-value={result.p_value:.4f} indicates "
                    f"no meaningful difference is likely"
                ),
                rule_triggered="futility",
                samples_collected=current_samples,
                samples_planned=planned_samples,
            )

        return EarlyStopDecision(
            should_stop=False,
            reason="Experiment in progress — no stopping rule triggered",
            rule_triggered=None,
            samples_collected=current_samples,
            samples_planned=planned_samples,
        )

    @staticmethod
    def _t_distribution_two_tailed_p(t: float, df: float) -> float:
        """
        Approximate two-tailed p-value from the t-distribution.

        Uses the normal approximation for df > 30 and a lookup-based
        approximation for smaller df. In production, use scipy.stats.t.sf().
        """
        if df > 30:
            p_one_tail = 0.5 * math.erfc(t / math.sqrt(2))
            return 2 * p_one_tail
        # Conservative approximation for small df
        if t > 4.0:
            return 0.001
        elif t > 3.0:
            return 0.005
        elif t > 2.5:
            return 0.02
        elif t > 2.0:
            return 0.06
        elif t > 1.5:
            return 0.15
        elif t > 1.0:
            return 0.33
        else:
            return 0.50

    @staticmethod
    def _t_critical_value(df: float, alpha: float) -> float:
        """
        Approximate critical value for two-tailed t-test.

        For df > 30, uses z-approximation. In production, use scipy.stats.t.ppf().
        """
        if df > 30:
            # z critical values for common alpha levels
            z_map = {0.01: 2.576, 0.05: 1.960, 0.10: 1.645}
            return z_map.get(alpha, 1.960)
        # Conservative for small df
        return 2.1

    @staticmethod
    def _chi2_survival(chi2: float, df: int) -> float:
        """
        Approximate survival function (1 - CDF) for chi-squared distribution.

        Uses the normal approximation for the chi-squared statistic.
        In production, use scipy.stats.chi2.sf().
        """
        if df != 1:
            raise ValueError("Only df=1 implemented in this approximation")
        z = math.sqrt(chi2)
        return math.erfc(z / math.sqrt(2))

Key Takeaways

A/B testing is not optional — Intuition about "better" parameters fails regularly. Only controlled experiments with statistical significance testing can confirm improvements.
Calculate sample size before launching — Running an underpowered experiment wastes traffic and produces inconclusive results. Use the sample size calculator for every experiment.
Guardrail metrics protect against blind optimization — A variant can improve quality score while doubling latency. Always enforce guardrail thresholds.
Gradual rollout prevents disasters — Even statistically significant results can fail at scale. The 5% to 20% to 50% to 100% ramp catches edge cases.
Session-level assignment prevents contamination — Always assign users to variants based on session ID, not request ID. A user seeing both variants within one conversation produces unreliable data.
Practical significance matters as much as statistical significance — A p-value < 0.05 with a 0.001 quality improvement is not worth a 20% cost increase. Always evaluate the cost-benefit ratio.