LOCAL PREVIEW View on GitHub

FM Performance Enhancement Architecture

AWS AIP-C01 Task 4.2 — Skill 4.2.4: Optimize FM performance through parameter tuning, A/B testing, and prompt engineering Context: MangaAssist JP manga store chatbot — Bedrock Claude 3 (Sonnet/Haiku), OpenSearch Serverless, DynamoDB, ECS Fargate Intents: product_search, order_status, recommendation, manga_qa, chitchat, shipping_info


Skill Mapping

Certification Domain Task Skill
AWS AIP-C01 Domain 4 — Operational Efficiency Task 4.2 — Optimize FM performance Skill 4.2.4 — Enhance FM performance through parameter configuration, A/B testing, and systematic prompt engineering

Skill scope: Design per-intent parameter profiles (temperature, top_k, top_p, max_tokens), implement A/B testing frameworks for parameter optimization, apply prompt engineering techniques to maximize FM output quality, and build automated pipelines for continuous performance enhancement.


Mind Map — FM Performance Enhancement Dimensions

mindmap
  root((FM Performance<br/>Enhancement))
    Parameter Configuration
      Temperature
        Low 0.1-0.3 — Factual
        Medium 0.4-0.6 — Balanced
        High 0.7-0.9 — Creative
      Top-k Sampling
        k=40 — Constrained Factual
        k=100 — Exploratory
        k=250 — Maximum Diversity
      Top-p Nucleus
        p=0.7 — Conservative
        p=0.9 — Balanced Diversity
        p=0.95 — Near-Full Vocabulary
      Max Tokens
        Short 128 — Chitchat
        Medium 512 — Q&A
        Long 1024 — Recommendations
      Stop Sequences
        Intent-Specific Stops
        Safety Boundary Tokens
    A/B Testing
      Experiment Design
        Control vs Variant
        Sample Size Calculation
        Traffic Splitting
      Statistical Analysis
        Chi-squared Test
        Welch t-test
        Confidence Intervals
      Quality Metrics
        Human Evaluation Score
        Automated Relevance
        Latency Impact
        Token Efficiency
      Rollout Strategy
        5% Canary
        20% Validation
        50% Expansion
        100% Full Rollout
    Prompt Engineering
      System Prompt Optimization
        Role Definition
        Constraint Specification
        Output Format Directives
      Few-Shot Selection
        Intent-Aligned Examples
        Diversity in Examples
        Edge Case Coverage
      Instruction Clarity
        Step-by-Step Decomposition
        Explicit Constraints
        Negative Examples
      Chain-of-Thought
        Reasoning Traces
        Self-Verification
    Generation Controls
      Stop Sequences
        Per-Intent Terminators
        Safety Boundaries
      Frequency Penalty
        Reduce Repetition
        Balance Creativity
      Response Validation
        Schema Conformance
        Hallucination Check
    Per-Intent Profiles
      product_search
      order_status
      recommendation
      manga_qa
      chitchat
      shipping_info

Temperature Selection Guide

Temperature controls the randomness of token selection. It directly affects the trade-off between deterministic accuracy and creative diversity in MangaAssist responses.

How Temperature Works

At each generation step the model produces a probability distribution over all tokens. Temperature scales the logits before softmax:

P(token_i) = exp(logit_i / T) / sum(exp(logit_j / T))
  • T → 0: Distribution collapses to the highest-probability token (greedy decoding). Factual, repetitive.
  • T = 1.0: Original distribution. The model's native confidence is preserved.
  • T > 1.0: Distribution flattens. Low-probability tokens become more likely. Creative but risky.

MangaAssist Temperature Zones

graph LR
    subgraph "Low Temperature 0.1 - 0.3"
        A[order_status] --> A1["T=0.15<br/>Exact dates, tracking IDs<br/>Zero hallucination tolerance"]
        B[shipping_info] --> B1["T=0.20<br/>Policy statements<br/>Carrier information"]
    end
    subgraph "Medium Temperature 0.4 - 0.6"
        C[product_search] --> C1["T=0.45<br/>Balanced product descriptions<br/>Accurate metadata + engaging text"]
        D[manga_qa] --> D1["T=0.50<br/>Knowledgeable answers<br/>Allow nuanced phrasing"]
    end
    subgraph "High Temperature 0.7 - 0.9"
        E[recommendation] --> E1["T=0.80<br/>Creative suggestions<br/>Diverse title discovery"]
        F[chitchat] --> F1["T=0.75<br/>Natural conversation<br/>Personality and warmth"]
    end
    style A1 fill:#d4edda
    style B1 fill:#d4edda
    style C1 fill:#fff3cd
    style D1 fill:#fff3cd
    style E1 fill:#f8d7da
    style F1 fill:#f8d7da
Zone Temperature Use Case Rationale Risk if Wrong
Low 0.1 - 0.3 Factual queries: order status, shipping policy Must not hallucinate dates, tracking numbers, or policy details Higher temps produce fabricated delivery dates
Medium 0.4 - 0.6 Product descriptions, manga knowledge Q&A Needs accuracy but benefits from natural, engaging phrasing Too low = robotic; too high = inaccurate metadata
High 0.7 - 0.9 Recommendations, casual conversation Creative diversity is the goal; users want surprising discoveries Too low = repetitive "safe" picks; too high = incoherent

Top-k and Top-p Selection

Top-k Sampling

Top-k restricts the model to choosing from only the k most probable tokens at each step. Tokens outside the top-k are zeroed out before sampling.

top_k Value Behavior MangaAssist Use Case
40 Constrained — only high-confidence tokens survive order_status, shipping_info — factual precision
100 Balanced — moderate vocabulary exploration product_search, manga_qa — informative with variety
250 Exploratory — wide vocabulary access recommendation, chitchat — creative and diverse

Top-p (Nucleus) Sampling

Top-p selects from the smallest set of tokens whose cumulative probability exceeds p. Unlike top-k, it adapts dynamically: when the model is confident, fewer tokens are considered; when uncertain, more are included.

top_p Value Behavior MangaAssist Use Case
0.70 Conservative — only the probability mass "core" order_status — no room for low-probability tangents
0.90 Balanced diversity — the standard choice product_search, manga_qa, shipping_info
0.95 Near-full vocabulary — maximum creative range recommendation, chitchat

Interaction Between top_k and top_p

When both are set, both filters apply — the candidate set is the intersection. For MangaAssist, the recommended strategy:

  1. Set top_k as the hard ceiling on vocabulary size
  2. Set top_p as the adaptive filter within that ceiling
  3. Lower both for factual intents, raise both for creative intents

Per-Intent Parameter Profiles

Complete Configuration Table

Intent Temperature top_k top_p max_tokens Stop Sequences Quality Score Target Rationale
order_status 0.15 40 0.70 256 ["\n\nUser:", "\n\nHuman:"] >0.95 Factual: tracking IDs, dates, statuses must be exact. Zero tolerance for hallucinated delivery dates.
shipping_info 0.20 40 0.80 384 ["\n\nUser:", "\n\nHuman:"] >0.93 Policy-based: carrier names, timeframes, and costs must match official data. Slight flexibility in phrasing.
product_search 0.45 100 0.90 512 ["\n\nUser:", "\n\nHuman:"] >0.88 Balanced: accurate ISBNs, prices, and metadata with engaging natural-language descriptions.
manga_qa 0.50 100 0.90 768 ["\n\nUser:", "\n\nHuman:"] >0.85 Knowledge: correct facts about manga series, authors, and genres with nuanced, conversational explanations.
recommendation 0.80 250 0.95 1024 ["\n\nUser:", "\n\nHuman:"] >0.80 Creative: diverse title suggestions that go beyond the obvious. Users want to discover new series.
chitchat 0.75 200 0.95 128 ["\n\nUser:", "\n\nHuman:"] >0.78 Conversational: warm, natural personality. Short responses. Variety in phrasing prevents robotic feel.

Prompt Engineering for Performance

System Prompt Optimization

Each intent uses a tailored system prompt that constrains the model's behavior:

SYSTEM_PROMPTS = {
    "order_status": (
        "You are MangaAssist, a precise order-tracking assistant for a Japanese manga store. "
        "RULES:\n"
        "1. Only report data retrieved from the order database. Never estimate or guess dates.\n"
        "2. If a field is unavailable, say 'I don't have that information yet' — do not fabricate.\n"
        "3. Format dates as YYYY-MM-DD.\n"
        "4. Include the order ID and tracking number in every response.\n"
        "5. If the order status is 'processing', do NOT predict a delivery date."
    ),
    "recommendation": (
        "You are MangaAssist, an enthusiastic manga recommendation specialist. "
        "RULES:\n"
        "1. Suggest 3-5 titles per request, mixing popular and hidden gems.\n"
        "2. Explain WHY each title fits the user's taste (genre, art style, themes).\n"
        "3. Never recommend the same set of titles twice in a session — check conversation history.\n"
        "4. Include genre tags and volume count for each recommendation.\n"
        "5. If the user's preferences are vague, ask one clarifying question before recommending."
    ),
    "product_search": (
        "You are MangaAssist, a knowledgeable product search assistant for a Japanese manga store. "
        "RULES:\n"
        "1. Present search results with title, author, price, and availability.\n"
        "2. Only cite products that appear in the retrieved context — never invent titles.\n"
        "3. If fewer than 3 results match, suggest broadening the search.\n"
        "4. Highlight any ongoing discounts or bundle deals.\n"
        "5. Format results as a numbered list for easy scanning."
    ),
}

Few-Shot Example Selection Strategy

Few-shot examples dramatically improve output consistency. The selection strategy for MangaAssist:

Strategy Description Best For
Intent-Aligned Examples match the current intent exactly All intents — baseline strategy
Edge-Case Coverage Include examples of tricky/ambiguous queries product_search (misspelled titles), manga_qa (spoiler avoidance)
Diversity Sampling Examples span different sub-categories recommendation (mix genres in examples)
Negative Examples Show what NOT to do order_status (example of hallucinated date with correction)

Stop Sequences and Generation Controls

Stop sequences prevent the model from generating beyond the intended response boundary:

Control Value Purpose
stop_sequences ["\n\nUser:", "\n\nHuman:", "\n\n---"] Prevent model from simulating next user turn
max_tokens Per-intent (see table above) Hard ceiling on generation length
frequency_penalty 0.3 for recommendation, 0.0 for order_status Reduce repetitive phrasing in creative intents

Architecture — Parameter Configuration Pipeline

graph TB
    subgraph "Request Ingestion"
        A[User Message via WebSocket] --> B[API Gateway]
        B --> C[ECS Orchestrator]
    end

    subgraph "Intent Classification"
        C --> D[Bedrock Haiku — Intent Classifier]
        D --> E{Classified Intent}
    end

    subgraph "Parameter Resolution"
        E --> F[ParameterProfileManager]
        F --> G[(DynamoDB<br/>Parameter Profiles)]
        F --> H{A/B Test Active?}
        H -- Yes --> I[ABTestingFramework<br/>Assign Variant]
        I --> J[(DynamoDB<br/>Experiment Config)]
        H -- No --> K[Use Default Profile]
        I --> L[Resolved Parameters]
        K --> L
    end

    subgraph "Prompt Assembly"
        L --> M[System Prompt Selection]
        M --> N[Few-Shot Example Injection]
        N --> O[Context from OpenSearch RAG]
        O --> P[Final Prompt + Parameters]
    end

    subgraph "FM Invocation"
        P --> Q[Bedrock InvokeModel<br/>Claude 3 Sonnet]
        Q --> R[Response + Metrics]
    end

    subgraph "Metrics Collection"
        R --> S[Quality Scorer]
        S --> T[CloudWatch Metrics<br/>per intent, per variant]
        T --> U[A/B Analysis Dashboard]
    end

    style F fill:#e1f5fe
    style I fill:#fff9c4
    style S fill:#f3e5f5

Python Implementation — ParameterProfileManager

"""
ParameterProfileManager — Per-intent FM parameter configuration for MangaAssist.

Loads parameter profiles from DynamoDB, supports A/B test overrides,
and provides the correct temperature/top_k/top_p/max_tokens for each intent.
"""

import boto3
import json
import logging
from dataclasses import dataclass, asdict
from typing import Optional
from datetime import datetime, timezone

logger = logging.getLogger(__name__)

DYNAMODB_TABLE = "MangaAssist-ParameterProfiles"


@dataclass
class ParameterProfile:
    """FM invocation parameters for a single intent."""
    intent: str
    temperature: float
    top_k: int
    top_p: float
    max_tokens: int
    stop_sequences: list[str]
    frequency_penalty: float
    system_prompt_key: str
    few_shot_count: int
    quality_score_target: float
    version: str
    updated_at: str

    def to_bedrock_params(self) -> dict:
        """Convert to Bedrock InvokeModel parameter format."""
        return {
            "temperature": self.temperature,
            "top_k": self.top_k,
            "top_p": self.top_p,
            "max_tokens": self.max_tokens,
            "stop_sequences": self.stop_sequences,
        }


# Default profiles — used when DynamoDB is unavailable or for bootstrapping
DEFAULT_PROFILES: dict[str, ParameterProfile] = {
    "order_status": ParameterProfile(
        intent="order_status",
        temperature=0.15,
        top_k=40,
        top_p=0.70,
        max_tokens=256,
        stop_sequences=["\n\nUser:", "\n\nHuman:"],
        frequency_penalty=0.0,
        system_prompt_key="order_status_v3",
        few_shot_count=2,
        quality_score_target=0.95,
        version="default-v1",
        updated_at="2025-01-01T00:00:00Z",
    ),
    "shipping_info": ParameterProfile(
        intent="shipping_info",
        temperature=0.20,
        top_k=40,
        top_p=0.80,
        max_tokens=384,
        stop_sequences=["\n\nUser:", "\n\nHuman:"],
        frequency_penalty=0.0,
        system_prompt_key="shipping_info_v2",
        few_shot_count=2,
        quality_score_target=0.93,
        version="default-v1",
        updated_at="2025-01-01T00:00:00Z",
    ),
    "product_search": ParameterProfile(
        intent="product_search",
        temperature=0.45,
        top_k=100,
        top_p=0.90,
        max_tokens=512,
        stop_sequences=["\n\nUser:", "\n\nHuman:"],
        frequency_penalty=0.1,
        system_prompt_key="product_search_v4",
        few_shot_count=3,
        quality_score_target=0.88,
        version="default-v1",
        updated_at="2025-01-01T00:00:00Z",
    ),
    "manga_qa": ParameterProfile(
        intent="manga_qa",
        temperature=0.50,
        top_k=100,
        top_p=0.90,
        max_tokens=768,
        stop_sequences=["\n\nUser:", "\n\nHuman:"],
        frequency_penalty=0.1,
        system_prompt_key="manga_qa_v3",
        few_shot_count=2,
        quality_score_target=0.85,
        version="default-v1",
        updated_at="2025-01-01T00:00:00Z",
    ),
    "recommendation": ParameterProfile(
        intent="recommendation",
        temperature=0.80,
        top_k=250,
        top_p=0.95,
        max_tokens=1024,
        stop_sequences=["\n\nUser:", "\n\nHuman:"],
        frequency_penalty=0.3,
        system_prompt_key="recommendation_v5",
        few_shot_count=3,
        quality_score_target=0.80,
        version="default-v1",
        updated_at="2025-01-01T00:00:00Z",
    ),
    "chitchat": ParameterProfile(
        intent="chitchat",
        temperature=0.75,
        top_k=200,
        top_p=0.95,
        max_tokens=128,
        stop_sequences=["\n\nUser:", "\n\nHuman:"],
        frequency_penalty=0.2,
        system_prompt_key="chitchat_v2",
        few_shot_count=1,
        quality_score_target=0.78,
        version="default-v1",
        updated_at="2025-01-01T00:00:00Z",
    ),
}


class ParameterProfileManager:
    """
    Manages per-intent FM parameter profiles for MangaAssist.

    Loads profiles from DynamoDB with fallback to defaults.
    Integrates with ABTestingFramework for experiment overrides.
    """

    def __init__(self, table_name: str = DYNAMODB_TABLE, region: str = "ap-northeast-1"):
        self.dynamodb = boto3.resource("dynamodb", region_name=region)
        self.table = self.dynamodb.Table(table_name)
        self._cache: dict[str, ParameterProfile] = {}
        self._cache_ttl_seconds = 300  # Refresh profiles every 5 minutes
        self._last_refresh: Optional[datetime] = None

    def get_profile(self, intent: str, ab_variant: Optional[str] = None) -> ParameterProfile:
        """
        Retrieve the parameter profile for a given intent.

        Args:
            intent: The classified user intent (e.g., 'order_status').
            ab_variant: Optional A/B test variant ID to apply overrides.

        Returns:
            ParameterProfile with the resolved parameters.
        """
        self._refresh_cache_if_stale()

        # Base profile: cached from DynamoDB or default
        profile_key = f"{intent}:{ab_variant}" if ab_variant else intent
        if profile_key in self._cache:
            return self._cache[profile_key]

        # Try DynamoDB
        profile = self._load_from_dynamodb(intent, ab_variant)
        if profile:
            self._cache[profile_key] = profile
            return profile

        # Fallback to defaults
        if intent in DEFAULT_PROFILES:
            logger.warning(f"Using default profile for intent={intent}")
            return DEFAULT_PROFILES[intent]

        # Unknown intent — use conservative defaults
        logger.error(f"Unknown intent={intent}, using order_status defaults as safest option")
        return DEFAULT_PROFILES["order_status"]

    def _load_from_dynamodb(
        self, intent: str, ab_variant: Optional[str] = None
    ) -> Optional[ParameterProfile]:
        """Load a profile from DynamoDB."""
        try:
            key = {"intent": intent}
            if ab_variant:
                key["variant_id"] = ab_variant

            response = self.table.get_item(Key=key)
            item = response.get("Item")
            if not item:
                return None

            return ParameterProfile(
                intent=item["intent"],
                temperature=float(item["temperature"]),
                top_k=int(item["top_k"]),
                top_p=float(item["top_p"]),
                max_tokens=int(item["max_tokens"]),
                stop_sequences=item.get("stop_sequences", ["\n\nUser:", "\n\nHuman:"]),
                frequency_penalty=float(item.get("frequency_penalty", 0.0)),
                system_prompt_key=item.get("system_prompt_key", f"{intent}_v1"),
                few_shot_count=int(item.get("few_shot_count", 2)),
                quality_score_target=float(item.get("quality_score_target", 0.85)),
                version=item.get("version", "unknown"),
                updated_at=item.get("updated_at", ""),
            )
        except Exception as e:
            logger.error(f"Failed to load profile from DynamoDB: {e}")
            return None

    def _refresh_cache_if_stale(self) -> None:
        """Clear cache if TTL has expired."""
        now = datetime.now(timezone.utc)
        if self._last_refresh is None or (
            now - self._last_refresh
        ).total_seconds() > self._cache_ttl_seconds:
            self._cache.clear()
            self._last_refresh = now

    def update_profile(self, profile: ParameterProfile) -> bool:
        """
        Write an updated profile to DynamoDB.

        Used by the A/B testing system to persist winning configurations.
        """
        try:
            item = asdict(profile)
            item["updated_at"] = datetime.now(timezone.utc).isoformat()
            self.table.put_item(Item=item)
            # Invalidate cache for this intent
            keys_to_remove = [k for k in self._cache if k.startswith(profile.intent)]
            for k in keys_to_remove:
                del self._cache[k]
            logger.info(f"Updated profile for intent={profile.intent} version={profile.version}")
            return True
        except Exception as e:
            logger.error(f"Failed to update profile: {e}")
            return False

    def list_all_profiles(self) -> dict[str, ParameterProfile]:
        """Return all current profiles (from cache/DynamoDB or defaults)."""
        profiles = {}
        for intent in DEFAULT_PROFILES:
            profiles[intent] = self.get_profile(intent)
        return profiles

Python Implementation — ABTestingFramework

"""
ABTestingFramework — Statistical A/B testing for FM parameter optimization.

Assigns users to experiment variants, collects quality metrics,
and determines statistical significance using chi-squared and Welch's t-test.
"""

import boto3
import hashlib
import json
import logging
import math
from dataclasses import dataclass, field
from datetime import datetime, timezone
from typing import Optional

logger = logging.getLogger(__name__)

EXPERIMENTS_TABLE = "MangaAssist-ABExperiments"
RESULTS_TABLE = "MangaAssist-ABResults"


@dataclass
class ExperimentVariant:
    """A single variant in an A/B experiment."""
    variant_id: str
    description: str
    parameter_overrides: dict  # e.g., {"temperature": 0.5, "top_k": 80}
    traffic_percentage: float  # 0.0 to 1.0


@dataclass
class Experiment:
    """An A/B test experiment definition."""
    experiment_id: str
    intent: str
    description: str
    status: str  # "draft", "running", "paused", "completed"
    control: ExperimentVariant
    variants: list[ExperimentVariant]
    start_date: str
    end_date: Optional[str]
    min_samples_per_variant: int
    confidence_level: float  # e.g., 0.95
    created_at: str


@dataclass
class ObservationRecord:
    """A single observation from an A/B test."""
    experiment_id: str
    variant_id: str
    session_id: str
    quality_score: float
    relevance_score: float
    latency_ms: float
    token_count: int
    user_satisfaction: Optional[float]
    timestamp: str


class ABTestingFramework:
    """
    Manages A/B experiments for FM parameter tuning in MangaAssist.

    Provides deterministic user-to-variant assignment, metric collection,
    and statistical significance analysis.
    """

    def __init__(self, region: str = "ap-northeast-1"):
        self.dynamodb = boto3.resource("dynamodb", region_name=region)
        self.experiments_table = self.dynamodb.Table(EXPERIMENTS_TABLE)
        self.results_table = self.dynamodb.Table(RESULTS_TABLE)
        self._active_experiments: dict[str, Experiment] = {}

    def assign_variant(
        self, experiment_id: str, session_id: str
    ) -> ExperimentVariant:
        """
        Deterministically assign a session to a variant.

        Uses consistent hashing so the same session always gets the same variant.
        This prevents cross-variant contamination within a session.

        Args:
            experiment_id: The experiment to assign for.
            session_id: The user's session ID.

        Returns:
            The assigned ExperimentVariant.
        """
        experiment = self._get_experiment(experiment_id)
        if not experiment or experiment.status != "running":
            return experiment.control if experiment else None

        # Consistent hash: SHA-256 of (experiment_id + session_id) → [0.0, 1.0)
        hash_input = f"{experiment_id}:{session_id}".encode("utf-8")
        hash_value = int(hashlib.sha256(hash_input).hexdigest(), 16)
        bucket = (hash_value % 10000) / 10000.0  # 0.0000 to 0.9999

        # Walk through variants by traffic allocation
        cumulative = 0.0
        for variant in [experiment.control] + experiment.variants:
            cumulative += variant.traffic_percentage
            if bucket < cumulative:
                logger.info(
                    f"Assigned session={session_id} to variant={variant.variant_id} "
                    f"in experiment={experiment_id}"
                )
                return variant

        # Fallback to control
        return experiment.control

    def record_observation(self, observation: ObservationRecord) -> None:
        """Record a single observation (one FM invocation result) for analysis."""
        try:
            self.results_table.put_item(
                Item={
                    "experiment_id": observation.experiment_id,
                    "observation_key": f"{observation.variant_id}#{observation.timestamp}",
                    "variant_id": observation.variant_id,
                    "session_id": observation.session_id,
                    "quality_score": str(observation.quality_score),
                    "relevance_score": str(observation.relevance_score),
                    "latency_ms": str(observation.latency_ms),
                    "token_count": observation.token_count,
                    "user_satisfaction": (
                        str(observation.user_satisfaction)
                        if observation.user_satisfaction is not None
                        else None
                    ),
                    "timestamp": observation.timestamp,
                }
            )
        except Exception as e:
            logger.error(f"Failed to record observation: {e}")

    def analyze_experiment(self, experiment_id: str) -> dict:
        """
        Compute statistical significance for an experiment.

        Returns:
            Dict with per-variant stats, t-test results, and recommendations.
        """
        experiment = self._get_experiment(experiment_id)
        if not experiment:
            return {"error": f"Experiment {experiment_id} not found"}

        # Collect observations per variant
        observations = self._load_observations(experiment_id)
        control_obs = observations.get(experiment.control.variant_id, [])

        results = {
            "experiment_id": experiment_id,
            "intent": experiment.intent,
            "status": experiment.status,
            "control": self._compute_variant_stats(
                experiment.control.variant_id, control_obs
            ),
            "variants": [],
            "recommendation": None,
        }

        best_variant = None
        best_improvement = 0.0

        for variant in experiment.variants:
            variant_obs = observations.get(variant.variant_id, [])
            stats = self._compute_variant_stats(variant.variant_id, variant_obs)

            # Welch's t-test: compare quality scores
            if len(control_obs) >= 30 and len(variant_obs) >= 30:
                control_scores = [o.quality_score for o in control_obs]
                variant_scores = [o.quality_score for o in variant_obs]
                t_stat, p_value = self._welch_t_test(control_scores, variant_scores)
                stats["t_statistic"] = round(t_stat, 4)
                stats["p_value"] = round(p_value, 6)
                stats["significant"] = p_value < (1 - experiment.confidence_level)

                improvement = stats["mean_quality"] - results["control"]["mean_quality"]
                stats["quality_improvement"] = round(improvement, 4)

                if stats["significant"] and improvement > best_improvement:
                    best_improvement = improvement
                    best_variant = variant.variant_id
            else:
                stats["t_statistic"] = None
                stats["p_value"] = None
                stats["significant"] = None
                stats["note"] = (
                    f"Insufficient samples: control={len(control_obs)}, "
                    f"variant={len(variant_obs)} (need 30+ each)"
                )

            results["variants"].append(stats)

        # Recommendation
        if best_variant:
            results["recommendation"] = {
                "action": "promote",
                "variant_id": best_variant,
                "quality_improvement": round(best_improvement, 4),
                "message": (
                    f"Variant '{best_variant}' shows statistically significant "
                    f"improvement of {best_improvement:.2%} in quality score. "
                    f"Recommend promoting to default profile for intent='{experiment.intent}'."
                ),
            }
        else:
            results["recommendation"] = {
                "action": "continue",
                "message": "No variant has reached statistical significance. Continue collecting data.",
            }

        return results

    def _compute_variant_stats(self, variant_id: str, observations: list) -> dict:
        """Compute summary statistics for a variant."""
        if not observations:
            return {
                "variant_id": variant_id,
                "sample_count": 0,
                "mean_quality": 0.0,
                "std_quality": 0.0,
                "mean_latency_ms": 0.0,
                "mean_tokens": 0.0,
            }

        quality_scores = [o.quality_score for o in observations]
        latencies = [o.latency_ms for o in observations]
        tokens = [o.token_count for o in observations]

        mean_q = sum(quality_scores) / len(quality_scores)
        std_q = (
            sum((x - mean_q) ** 2 for x in quality_scores) / len(quality_scores)
        ) ** 0.5

        return {
            "variant_id": variant_id,
            "sample_count": len(observations),
            "mean_quality": round(mean_q, 4),
            "std_quality": round(std_q, 4),
            "mean_latency_ms": round(sum(latencies) / len(latencies), 1),
            "mean_tokens": round(sum(tokens) / len(tokens), 1),
        }

    @staticmethod
    def _welch_t_test(sample_a: list[float], sample_b: list[float]) -> tuple[float, float]:
        """
        Welch's t-test for two independent samples with unequal variances.

        Returns (t_statistic, p_value).
        """
        n_a, n_b = len(sample_a), len(sample_b)
        mean_a = sum(sample_a) / n_a
        mean_b = sum(sample_b) / n_b
        var_a = sum((x - mean_a) ** 2 for x in sample_a) / (n_a - 1)
        var_b = sum((x - mean_b) ** 2 for x in sample_b) / (n_b - 1)

        se = math.sqrt(var_a / n_a + var_b / n_b)
        if se == 0:
            return 0.0, 1.0

        t_stat = (mean_a - mean_b) / se

        # Welch-Satterthwaite degrees of freedom
        numerator = (var_a / n_a + var_b / n_b) ** 2
        denominator = (var_a / n_a) ** 2 / (n_a - 1) + (var_b / n_b) ** 2 / (n_b - 1)
        df = numerator / denominator if denominator > 0 else 1

        # Approximate p-value using the t-distribution survival function
        # For production, use scipy.stats.t.sf — here we use a conservative approximation
        p_value = ABTestingFramework._approximate_t_pvalue(abs(t_stat), df)
        return t_stat, p_value * 2  # Two-tailed test

    @staticmethod
    def _approximate_t_pvalue(t: float, df: float) -> float:
        """
        Conservative approximation of the one-tailed p-value.

        For production use scipy.stats.t.sf(t, df) instead.
        This uses the normal approximation valid for df > 30.
        """
        # Normal approximation for large df
        if df > 30:
            z = t
            p = 0.5 * math.erfc(z / math.sqrt(2))
            return p
        # Very rough approximation for small df — use lookup tables in production
        if t > 3.5:
            return 0.001
        elif t > 2.5:
            return 0.01
        elif t > 2.0:
            return 0.03
        elif t > 1.5:
            return 0.07
        else:
            return 0.15

    def _get_experiment(self, experiment_id: str) -> Optional[Experiment]:
        """Retrieve an experiment definition from DynamoDB."""
        if experiment_id in self._active_experiments:
            return self._active_experiments[experiment_id]

        try:
            response = self.experiments_table.get_item(
                Key={"experiment_id": experiment_id}
            )
            item = response.get("Item")
            if not item:
                return None

            control = ExperimentVariant(
                variant_id=item["control"]["variant_id"],
                description=item["control"]["description"],
                parameter_overrides=item["control"].get("parameter_overrides", {}),
                traffic_percentage=float(item["control"]["traffic_percentage"]),
            )

            variants = [
                ExperimentVariant(
                    variant_id=v["variant_id"],
                    description=v["description"],
                    parameter_overrides=v.get("parameter_overrides", {}),
                    traffic_percentage=float(v["traffic_percentage"]),
                )
                for v in item.get("variants", [])
            ]

            experiment = Experiment(
                experiment_id=item["experiment_id"],
                intent=item["intent"],
                description=item["description"],
                status=item["status"],
                control=control,
                variants=variants,
                start_date=item["start_date"],
                end_date=item.get("end_date"),
                min_samples_per_variant=int(item.get("min_samples_per_variant", 500)),
                confidence_level=float(item.get("confidence_level", 0.95)),
                created_at=item["created_at"],
            )

            if experiment.status == "running":
                self._active_experiments[experiment_id] = experiment

            return experiment
        except Exception as e:
            logger.error(f"Failed to load experiment {experiment_id}: {e}")
            return None

    def _load_observations(self, experiment_id: str) -> dict[str, list]:
        """Load all observations for an experiment, grouped by variant."""
        grouped: dict[str, list] = {}
        try:
            response = self.results_table.query(
                KeyConditionExpression=boto3.dynamodb.conditions.Key(
                    "experiment_id"
                ).eq(experiment_id)
            )
            for item in response.get("Items", []):
                vid = item["variant_id"]
                obs = ObservationRecord(
                    experiment_id=item["experiment_id"],
                    variant_id=vid,
                    session_id=item["session_id"],
                    quality_score=float(item["quality_score"]),
                    relevance_score=float(item["relevance_score"]),
                    latency_ms=float(item["latency_ms"]),
                    token_count=int(item["token_count"]),
                    user_satisfaction=(
                        float(item["user_satisfaction"])
                        if item.get("user_satisfaction")
                        else None
                    ),
                    timestamp=item["timestamp"],
                )
                grouped.setdefault(vid, []).append(obs)
        except Exception as e:
            logger.error(f"Failed to load observations for experiment {experiment_id}: {e}")
        return grouped


# ---------------------------------------------------------------------------
# Minimum sample size calculator (used before launching experiments)
# ---------------------------------------------------------------------------

def calculate_minimum_sample_size(
    baseline_rate: float,
    minimum_detectable_effect: float,
    alpha: float = 0.05,
    power: float = 0.80,
) -> int:
    """
    Calculate minimum sample size per variant for an A/B test.

    Uses the standard formula for comparing two proportions.

    Args:
        baseline_rate: Current quality score (e.g., 0.85).
        minimum_detectable_effect: Smallest improvement worth detecting (e.g., 0.03).
        alpha: Significance level (default 0.05).
        power: Statistical power (default 0.80).

    Returns:
        Minimum samples needed per variant.

    Example:
        >>> calculate_minimum_sample_size(0.85, 0.03)
        1962  # Need ~2000 samples per variant to detect a 3% improvement
    """
    import scipy.stats as stats

    z_alpha = stats.norm.ppf(1 - alpha / 2)
    z_beta = stats.norm.ppf(power)

    p1 = baseline_rate
    p2 = baseline_rate + minimum_detectable_effect
    p_bar = (p1 + p2) / 2

    numerator = (z_alpha * math.sqrt(2 * p_bar * (1 - p_bar)) +
                 z_beta * math.sqrt(p1 * (1 - p1) + p2 * (1 - p2))) ** 2
    denominator = (p2 - p1) ** 2

    return math.ceil(numerator / denominator)

Summary — Intent x Parameter x Quality Matrix

Intent Temp top_k top_p max_tokens Quality Target Rationale
order_status 0.15 40 0.70 256 >0.95 Factual precision is paramount. Hallucinated dates erode trust instantly.
shipping_info 0.20 40 0.80 384 >0.93 Policy accuracy required. Slight flexibility in natural phrasing.
product_search 0.45 100 0.90 512 >0.88 Balance between accurate metadata and engaging descriptions.
manga_qa 0.50 100 0.90 768 >0.85 Knowledge accuracy with conversational nuance. Longer explanations.
recommendation 0.80 250 0.95 1024 >0.80 Creative diversity drives discovery. Users want surprising suggestions.
chitchat 0.75 200 0.95 128 >0.78 Warm personality with varied phrasing. Short and snappy.

Key Takeaways

  1. One temperature does not fit all — MangaAssist uses 6 different temperature profiles because order status precision (T=0.15) and recommendation creativity (T=0.80) are fundamentally different tasks.
  2. top_k and top_p work together — top_k sets a hard vocabulary ceiling; top_p adapts within it. Both should be tuned per intent.
  3. A/B testing is mandatory — Intuition about "good" parameters is unreliable. Only statistical testing reveals whether T=0.45 or T=0.55 actually produces better product descriptions.
  4. Prompt engineering amplifies parameter tuning — The best temperature setting cannot compensate for a vague system prompt. Both must be optimized together.
  5. Fallback to conservative defaults — Unknown intents should use order_status-level conservative parameters. It is safer to be boring than to hallucinate.