FM Performance Enhancement Architecture
AWS AIP-C01 Task 4.2 — Skill 4.2.4: Optimize FM performance through parameter tuning, A/B testing, and prompt engineering Context: MangaAssist JP manga store chatbot — Bedrock Claude 3 (Sonnet/Haiku), OpenSearch Serverless, DynamoDB, ECS Fargate Intents: product_search, order_status, recommendation, manga_qa, chitchat, shipping_info
Skill Mapping
| Certification | Domain | Task | Skill |
|---|---|---|---|
| AWS AIP-C01 | Domain 4 — Operational Efficiency | Task 4.2 — Optimize FM performance | Skill 4.2.4 — Enhance FM performance through parameter configuration, A/B testing, and systematic prompt engineering |
Skill scope: Design per-intent parameter profiles (temperature, top_k, top_p, max_tokens), implement A/B testing frameworks for parameter optimization, apply prompt engineering techniques to maximize FM output quality, and build automated pipelines for continuous performance enhancement.
Mind Map — FM Performance Enhancement Dimensions
mindmap
root((FM Performance<br/>Enhancement))
Parameter Configuration
Temperature
Low 0.1-0.3 — Factual
Medium 0.4-0.6 — Balanced
High 0.7-0.9 — Creative
Top-k Sampling
k=40 — Constrained Factual
k=100 — Exploratory
k=250 — Maximum Diversity
Top-p Nucleus
p=0.7 — Conservative
p=0.9 — Balanced Diversity
p=0.95 — Near-Full Vocabulary
Max Tokens
Short 128 — Chitchat
Medium 512 — Q&A
Long 1024 — Recommendations
Stop Sequences
Intent-Specific Stops
Safety Boundary Tokens
A/B Testing
Experiment Design
Control vs Variant
Sample Size Calculation
Traffic Splitting
Statistical Analysis
Chi-squared Test
Welch t-test
Confidence Intervals
Quality Metrics
Human Evaluation Score
Automated Relevance
Latency Impact
Token Efficiency
Rollout Strategy
5% Canary
20% Validation
50% Expansion
100% Full Rollout
Prompt Engineering
System Prompt Optimization
Role Definition
Constraint Specification
Output Format Directives
Few-Shot Selection
Intent-Aligned Examples
Diversity in Examples
Edge Case Coverage
Instruction Clarity
Step-by-Step Decomposition
Explicit Constraints
Negative Examples
Chain-of-Thought
Reasoning Traces
Self-Verification
Generation Controls
Stop Sequences
Per-Intent Terminators
Safety Boundaries
Frequency Penalty
Reduce Repetition
Balance Creativity
Response Validation
Schema Conformance
Hallucination Check
Per-Intent Profiles
product_search
order_status
recommendation
manga_qa
chitchat
shipping_info
Temperature Selection Guide
Temperature controls the randomness of token selection. It directly affects the trade-off between deterministic accuracy and creative diversity in MangaAssist responses.
How Temperature Works
At each generation step the model produces a probability distribution over all tokens. Temperature scales the logits before softmax:
P(token_i) = exp(logit_i / T) / sum(exp(logit_j / T))
- T → 0: Distribution collapses to the highest-probability token (greedy decoding). Factual, repetitive.
- T = 1.0: Original distribution. The model's native confidence is preserved.
- T > 1.0: Distribution flattens. Low-probability tokens become more likely. Creative but risky.
MangaAssist Temperature Zones
graph LR
subgraph "Low Temperature 0.1 - 0.3"
A[order_status] --> A1["T=0.15<br/>Exact dates, tracking IDs<br/>Zero hallucination tolerance"]
B[shipping_info] --> B1["T=0.20<br/>Policy statements<br/>Carrier information"]
end
subgraph "Medium Temperature 0.4 - 0.6"
C[product_search] --> C1["T=0.45<br/>Balanced product descriptions<br/>Accurate metadata + engaging text"]
D[manga_qa] --> D1["T=0.50<br/>Knowledgeable answers<br/>Allow nuanced phrasing"]
end
subgraph "High Temperature 0.7 - 0.9"
E[recommendation] --> E1["T=0.80<br/>Creative suggestions<br/>Diverse title discovery"]
F[chitchat] --> F1["T=0.75<br/>Natural conversation<br/>Personality and warmth"]
end
style A1 fill:#d4edda
style B1 fill:#d4edda
style C1 fill:#fff3cd
style D1 fill:#fff3cd
style E1 fill:#f8d7da
style F1 fill:#f8d7da
| Zone | Temperature | Use Case | Rationale | Risk if Wrong |
|---|---|---|---|---|
| Low | 0.1 - 0.3 | Factual queries: order status, shipping policy | Must not hallucinate dates, tracking numbers, or policy details | Higher temps produce fabricated delivery dates |
| Medium | 0.4 - 0.6 | Product descriptions, manga knowledge Q&A | Needs accuracy but benefits from natural, engaging phrasing | Too low = robotic; too high = inaccurate metadata |
| High | 0.7 - 0.9 | Recommendations, casual conversation | Creative diversity is the goal; users want surprising discoveries | Too low = repetitive "safe" picks; too high = incoherent |
Top-k and Top-p Selection
Top-k Sampling
Top-k restricts the model to choosing from only the k most probable tokens at each step. Tokens outside the top-k are zeroed out before sampling.
| top_k Value | Behavior | MangaAssist Use Case |
|---|---|---|
| 40 | Constrained — only high-confidence tokens survive | order_status, shipping_info — factual precision |
| 100 | Balanced — moderate vocabulary exploration | product_search, manga_qa — informative with variety |
| 250 | Exploratory — wide vocabulary access | recommendation, chitchat — creative and diverse |
Top-p (Nucleus) Sampling
Top-p selects from the smallest set of tokens whose cumulative probability exceeds p. Unlike top-k, it adapts dynamically: when the model is confident, fewer tokens are considered; when uncertain, more are included.
| top_p Value | Behavior | MangaAssist Use Case |
|---|---|---|
| 0.70 | Conservative — only the probability mass "core" | order_status — no room for low-probability tangents |
| 0.90 | Balanced diversity — the standard choice | product_search, manga_qa, shipping_info |
| 0.95 | Near-full vocabulary — maximum creative range | recommendation, chitchat |
Interaction Between top_k and top_p
When both are set, both filters apply — the candidate set is the intersection. For MangaAssist, the recommended strategy:
- Set top_k as the hard ceiling on vocabulary size
- Set top_p as the adaptive filter within that ceiling
- Lower both for factual intents, raise both for creative intents
Per-Intent Parameter Profiles
Complete Configuration Table
| Intent | Temperature | top_k | top_p | max_tokens | Stop Sequences | Quality Score Target | Rationale |
|---|---|---|---|---|---|---|---|
| order_status | 0.15 | 40 | 0.70 | 256 | ["\n\nUser:", "\n\nHuman:"] |
>0.95 | Factual: tracking IDs, dates, statuses must be exact. Zero tolerance for hallucinated delivery dates. |
| shipping_info | 0.20 | 40 | 0.80 | 384 | ["\n\nUser:", "\n\nHuman:"] |
>0.93 | Policy-based: carrier names, timeframes, and costs must match official data. Slight flexibility in phrasing. |
| product_search | 0.45 | 100 | 0.90 | 512 | ["\n\nUser:", "\n\nHuman:"] |
>0.88 | Balanced: accurate ISBNs, prices, and metadata with engaging natural-language descriptions. |
| manga_qa | 0.50 | 100 | 0.90 | 768 | ["\n\nUser:", "\n\nHuman:"] |
>0.85 | Knowledge: correct facts about manga series, authors, and genres with nuanced, conversational explanations. |
| recommendation | 0.80 | 250 | 0.95 | 1024 | ["\n\nUser:", "\n\nHuman:"] |
>0.80 | Creative: diverse title suggestions that go beyond the obvious. Users want to discover new series. |
| chitchat | 0.75 | 200 | 0.95 | 128 | ["\n\nUser:", "\n\nHuman:"] |
>0.78 | Conversational: warm, natural personality. Short responses. Variety in phrasing prevents robotic feel. |
Prompt Engineering for Performance
System Prompt Optimization
Each intent uses a tailored system prompt that constrains the model's behavior:
SYSTEM_PROMPTS = {
"order_status": (
"You are MangaAssist, a precise order-tracking assistant for a Japanese manga store. "
"RULES:\n"
"1. Only report data retrieved from the order database. Never estimate or guess dates.\n"
"2. If a field is unavailable, say 'I don't have that information yet' — do not fabricate.\n"
"3. Format dates as YYYY-MM-DD.\n"
"4. Include the order ID and tracking number in every response.\n"
"5. If the order status is 'processing', do NOT predict a delivery date."
),
"recommendation": (
"You are MangaAssist, an enthusiastic manga recommendation specialist. "
"RULES:\n"
"1. Suggest 3-5 titles per request, mixing popular and hidden gems.\n"
"2. Explain WHY each title fits the user's taste (genre, art style, themes).\n"
"3. Never recommend the same set of titles twice in a session — check conversation history.\n"
"4. Include genre tags and volume count for each recommendation.\n"
"5. If the user's preferences are vague, ask one clarifying question before recommending."
),
"product_search": (
"You are MangaAssist, a knowledgeable product search assistant for a Japanese manga store. "
"RULES:\n"
"1. Present search results with title, author, price, and availability.\n"
"2. Only cite products that appear in the retrieved context — never invent titles.\n"
"3. If fewer than 3 results match, suggest broadening the search.\n"
"4. Highlight any ongoing discounts or bundle deals.\n"
"5. Format results as a numbered list for easy scanning."
),
}
Few-Shot Example Selection Strategy
Few-shot examples dramatically improve output consistency. The selection strategy for MangaAssist:
| Strategy | Description | Best For |
|---|---|---|
| Intent-Aligned | Examples match the current intent exactly | All intents — baseline strategy |
| Edge-Case Coverage | Include examples of tricky/ambiguous queries | product_search (misspelled titles), manga_qa (spoiler avoidance) |
| Diversity Sampling | Examples span different sub-categories | recommendation (mix genres in examples) |
| Negative Examples | Show what NOT to do | order_status (example of hallucinated date with correction) |
Stop Sequences and Generation Controls
Stop sequences prevent the model from generating beyond the intended response boundary:
| Control | Value | Purpose |
|---|---|---|
stop_sequences |
["\n\nUser:", "\n\nHuman:", "\n\n---"] |
Prevent model from simulating next user turn |
max_tokens |
Per-intent (see table above) | Hard ceiling on generation length |
frequency_penalty |
0.3 for recommendation, 0.0 for order_status | Reduce repetitive phrasing in creative intents |
Architecture — Parameter Configuration Pipeline
graph TB
subgraph "Request Ingestion"
A[User Message via WebSocket] --> B[API Gateway]
B --> C[ECS Orchestrator]
end
subgraph "Intent Classification"
C --> D[Bedrock Haiku — Intent Classifier]
D --> E{Classified Intent}
end
subgraph "Parameter Resolution"
E --> F[ParameterProfileManager]
F --> G[(DynamoDB<br/>Parameter Profiles)]
F --> H{A/B Test Active?}
H -- Yes --> I[ABTestingFramework<br/>Assign Variant]
I --> J[(DynamoDB<br/>Experiment Config)]
H -- No --> K[Use Default Profile]
I --> L[Resolved Parameters]
K --> L
end
subgraph "Prompt Assembly"
L --> M[System Prompt Selection]
M --> N[Few-Shot Example Injection]
N --> O[Context from OpenSearch RAG]
O --> P[Final Prompt + Parameters]
end
subgraph "FM Invocation"
P --> Q[Bedrock InvokeModel<br/>Claude 3 Sonnet]
Q --> R[Response + Metrics]
end
subgraph "Metrics Collection"
R --> S[Quality Scorer]
S --> T[CloudWatch Metrics<br/>per intent, per variant]
T --> U[A/B Analysis Dashboard]
end
style F fill:#e1f5fe
style I fill:#fff9c4
style S fill:#f3e5f5
Python Implementation — ParameterProfileManager
"""
ParameterProfileManager — Per-intent FM parameter configuration for MangaAssist.
Loads parameter profiles from DynamoDB, supports A/B test overrides,
and provides the correct temperature/top_k/top_p/max_tokens for each intent.
"""
import boto3
import json
import logging
from dataclasses import dataclass, asdict
from typing import Optional
from datetime import datetime, timezone
logger = logging.getLogger(__name__)
DYNAMODB_TABLE = "MangaAssist-ParameterProfiles"
@dataclass
class ParameterProfile:
"""FM invocation parameters for a single intent."""
intent: str
temperature: float
top_k: int
top_p: float
max_tokens: int
stop_sequences: list[str]
frequency_penalty: float
system_prompt_key: str
few_shot_count: int
quality_score_target: float
version: str
updated_at: str
def to_bedrock_params(self) -> dict:
"""Convert to Bedrock InvokeModel parameter format."""
return {
"temperature": self.temperature,
"top_k": self.top_k,
"top_p": self.top_p,
"max_tokens": self.max_tokens,
"stop_sequences": self.stop_sequences,
}
# Default profiles — used when DynamoDB is unavailable or for bootstrapping
DEFAULT_PROFILES: dict[str, ParameterProfile] = {
"order_status": ParameterProfile(
intent="order_status",
temperature=0.15,
top_k=40,
top_p=0.70,
max_tokens=256,
stop_sequences=["\n\nUser:", "\n\nHuman:"],
frequency_penalty=0.0,
system_prompt_key="order_status_v3",
few_shot_count=2,
quality_score_target=0.95,
version="default-v1",
updated_at="2025-01-01T00:00:00Z",
),
"shipping_info": ParameterProfile(
intent="shipping_info",
temperature=0.20,
top_k=40,
top_p=0.80,
max_tokens=384,
stop_sequences=["\n\nUser:", "\n\nHuman:"],
frequency_penalty=0.0,
system_prompt_key="shipping_info_v2",
few_shot_count=2,
quality_score_target=0.93,
version="default-v1",
updated_at="2025-01-01T00:00:00Z",
),
"product_search": ParameterProfile(
intent="product_search",
temperature=0.45,
top_k=100,
top_p=0.90,
max_tokens=512,
stop_sequences=["\n\nUser:", "\n\nHuman:"],
frequency_penalty=0.1,
system_prompt_key="product_search_v4",
few_shot_count=3,
quality_score_target=0.88,
version="default-v1",
updated_at="2025-01-01T00:00:00Z",
),
"manga_qa": ParameterProfile(
intent="manga_qa",
temperature=0.50,
top_k=100,
top_p=0.90,
max_tokens=768,
stop_sequences=["\n\nUser:", "\n\nHuman:"],
frequency_penalty=0.1,
system_prompt_key="manga_qa_v3",
few_shot_count=2,
quality_score_target=0.85,
version="default-v1",
updated_at="2025-01-01T00:00:00Z",
),
"recommendation": ParameterProfile(
intent="recommendation",
temperature=0.80,
top_k=250,
top_p=0.95,
max_tokens=1024,
stop_sequences=["\n\nUser:", "\n\nHuman:"],
frequency_penalty=0.3,
system_prompt_key="recommendation_v5",
few_shot_count=3,
quality_score_target=0.80,
version="default-v1",
updated_at="2025-01-01T00:00:00Z",
),
"chitchat": ParameterProfile(
intent="chitchat",
temperature=0.75,
top_k=200,
top_p=0.95,
max_tokens=128,
stop_sequences=["\n\nUser:", "\n\nHuman:"],
frequency_penalty=0.2,
system_prompt_key="chitchat_v2",
few_shot_count=1,
quality_score_target=0.78,
version="default-v1",
updated_at="2025-01-01T00:00:00Z",
),
}
class ParameterProfileManager:
"""
Manages per-intent FM parameter profiles for MangaAssist.
Loads profiles from DynamoDB with fallback to defaults.
Integrates with ABTestingFramework for experiment overrides.
"""
def __init__(self, table_name: str = DYNAMODB_TABLE, region: str = "ap-northeast-1"):
self.dynamodb = boto3.resource("dynamodb", region_name=region)
self.table = self.dynamodb.Table(table_name)
self._cache: dict[str, ParameterProfile] = {}
self._cache_ttl_seconds = 300 # Refresh profiles every 5 minutes
self._last_refresh: Optional[datetime] = None
def get_profile(self, intent: str, ab_variant: Optional[str] = None) -> ParameterProfile:
"""
Retrieve the parameter profile for a given intent.
Args:
intent: The classified user intent (e.g., 'order_status').
ab_variant: Optional A/B test variant ID to apply overrides.
Returns:
ParameterProfile with the resolved parameters.
"""
self._refresh_cache_if_stale()
# Base profile: cached from DynamoDB or default
profile_key = f"{intent}:{ab_variant}" if ab_variant else intent
if profile_key in self._cache:
return self._cache[profile_key]
# Try DynamoDB
profile = self._load_from_dynamodb(intent, ab_variant)
if profile:
self._cache[profile_key] = profile
return profile
# Fallback to defaults
if intent in DEFAULT_PROFILES:
logger.warning(f"Using default profile for intent={intent}")
return DEFAULT_PROFILES[intent]
# Unknown intent — use conservative defaults
logger.error(f"Unknown intent={intent}, using order_status defaults as safest option")
return DEFAULT_PROFILES["order_status"]
def _load_from_dynamodb(
self, intent: str, ab_variant: Optional[str] = None
) -> Optional[ParameterProfile]:
"""Load a profile from DynamoDB."""
try:
key = {"intent": intent}
if ab_variant:
key["variant_id"] = ab_variant
response = self.table.get_item(Key=key)
item = response.get("Item")
if not item:
return None
return ParameterProfile(
intent=item["intent"],
temperature=float(item["temperature"]),
top_k=int(item["top_k"]),
top_p=float(item["top_p"]),
max_tokens=int(item["max_tokens"]),
stop_sequences=item.get("stop_sequences", ["\n\nUser:", "\n\nHuman:"]),
frequency_penalty=float(item.get("frequency_penalty", 0.0)),
system_prompt_key=item.get("system_prompt_key", f"{intent}_v1"),
few_shot_count=int(item.get("few_shot_count", 2)),
quality_score_target=float(item.get("quality_score_target", 0.85)),
version=item.get("version", "unknown"),
updated_at=item.get("updated_at", ""),
)
except Exception as e:
logger.error(f"Failed to load profile from DynamoDB: {e}")
return None
def _refresh_cache_if_stale(self) -> None:
"""Clear cache if TTL has expired."""
now = datetime.now(timezone.utc)
if self._last_refresh is None or (
now - self._last_refresh
).total_seconds() > self._cache_ttl_seconds:
self._cache.clear()
self._last_refresh = now
def update_profile(self, profile: ParameterProfile) -> bool:
"""
Write an updated profile to DynamoDB.
Used by the A/B testing system to persist winning configurations.
"""
try:
item = asdict(profile)
item["updated_at"] = datetime.now(timezone.utc).isoformat()
self.table.put_item(Item=item)
# Invalidate cache for this intent
keys_to_remove = [k for k in self._cache if k.startswith(profile.intent)]
for k in keys_to_remove:
del self._cache[k]
logger.info(f"Updated profile for intent={profile.intent} version={profile.version}")
return True
except Exception as e:
logger.error(f"Failed to update profile: {e}")
return False
def list_all_profiles(self) -> dict[str, ParameterProfile]:
"""Return all current profiles (from cache/DynamoDB or defaults)."""
profiles = {}
for intent in DEFAULT_PROFILES:
profiles[intent] = self.get_profile(intent)
return profiles
Python Implementation — ABTestingFramework
"""
ABTestingFramework — Statistical A/B testing for FM parameter optimization.
Assigns users to experiment variants, collects quality metrics,
and determines statistical significance using chi-squared and Welch's t-test.
"""
import boto3
import hashlib
import json
import logging
import math
from dataclasses import dataclass, field
from datetime import datetime, timezone
from typing import Optional
logger = logging.getLogger(__name__)
EXPERIMENTS_TABLE = "MangaAssist-ABExperiments"
RESULTS_TABLE = "MangaAssist-ABResults"
@dataclass
class ExperimentVariant:
"""A single variant in an A/B experiment."""
variant_id: str
description: str
parameter_overrides: dict # e.g., {"temperature": 0.5, "top_k": 80}
traffic_percentage: float # 0.0 to 1.0
@dataclass
class Experiment:
"""An A/B test experiment definition."""
experiment_id: str
intent: str
description: str
status: str # "draft", "running", "paused", "completed"
control: ExperimentVariant
variants: list[ExperimentVariant]
start_date: str
end_date: Optional[str]
min_samples_per_variant: int
confidence_level: float # e.g., 0.95
created_at: str
@dataclass
class ObservationRecord:
"""A single observation from an A/B test."""
experiment_id: str
variant_id: str
session_id: str
quality_score: float
relevance_score: float
latency_ms: float
token_count: int
user_satisfaction: Optional[float]
timestamp: str
class ABTestingFramework:
"""
Manages A/B experiments for FM parameter tuning in MangaAssist.
Provides deterministic user-to-variant assignment, metric collection,
and statistical significance analysis.
"""
def __init__(self, region: str = "ap-northeast-1"):
self.dynamodb = boto3.resource("dynamodb", region_name=region)
self.experiments_table = self.dynamodb.Table(EXPERIMENTS_TABLE)
self.results_table = self.dynamodb.Table(RESULTS_TABLE)
self._active_experiments: dict[str, Experiment] = {}
def assign_variant(
self, experiment_id: str, session_id: str
) -> ExperimentVariant:
"""
Deterministically assign a session to a variant.
Uses consistent hashing so the same session always gets the same variant.
This prevents cross-variant contamination within a session.
Args:
experiment_id: The experiment to assign for.
session_id: The user's session ID.
Returns:
The assigned ExperimentVariant.
"""
experiment = self._get_experiment(experiment_id)
if not experiment or experiment.status != "running":
return experiment.control if experiment else None
# Consistent hash: SHA-256 of (experiment_id + session_id) → [0.0, 1.0)
hash_input = f"{experiment_id}:{session_id}".encode("utf-8")
hash_value = int(hashlib.sha256(hash_input).hexdigest(), 16)
bucket = (hash_value % 10000) / 10000.0 # 0.0000 to 0.9999
# Walk through variants by traffic allocation
cumulative = 0.0
for variant in [experiment.control] + experiment.variants:
cumulative += variant.traffic_percentage
if bucket < cumulative:
logger.info(
f"Assigned session={session_id} to variant={variant.variant_id} "
f"in experiment={experiment_id}"
)
return variant
# Fallback to control
return experiment.control
def record_observation(self, observation: ObservationRecord) -> None:
"""Record a single observation (one FM invocation result) for analysis."""
try:
self.results_table.put_item(
Item={
"experiment_id": observation.experiment_id,
"observation_key": f"{observation.variant_id}#{observation.timestamp}",
"variant_id": observation.variant_id,
"session_id": observation.session_id,
"quality_score": str(observation.quality_score),
"relevance_score": str(observation.relevance_score),
"latency_ms": str(observation.latency_ms),
"token_count": observation.token_count,
"user_satisfaction": (
str(observation.user_satisfaction)
if observation.user_satisfaction is not None
else None
),
"timestamp": observation.timestamp,
}
)
except Exception as e:
logger.error(f"Failed to record observation: {e}")
def analyze_experiment(self, experiment_id: str) -> dict:
"""
Compute statistical significance for an experiment.
Returns:
Dict with per-variant stats, t-test results, and recommendations.
"""
experiment = self._get_experiment(experiment_id)
if not experiment:
return {"error": f"Experiment {experiment_id} not found"}
# Collect observations per variant
observations = self._load_observations(experiment_id)
control_obs = observations.get(experiment.control.variant_id, [])
results = {
"experiment_id": experiment_id,
"intent": experiment.intent,
"status": experiment.status,
"control": self._compute_variant_stats(
experiment.control.variant_id, control_obs
),
"variants": [],
"recommendation": None,
}
best_variant = None
best_improvement = 0.0
for variant in experiment.variants:
variant_obs = observations.get(variant.variant_id, [])
stats = self._compute_variant_stats(variant.variant_id, variant_obs)
# Welch's t-test: compare quality scores
if len(control_obs) >= 30 and len(variant_obs) >= 30:
control_scores = [o.quality_score for o in control_obs]
variant_scores = [o.quality_score for o in variant_obs]
t_stat, p_value = self._welch_t_test(control_scores, variant_scores)
stats["t_statistic"] = round(t_stat, 4)
stats["p_value"] = round(p_value, 6)
stats["significant"] = p_value < (1 - experiment.confidence_level)
improvement = stats["mean_quality"] - results["control"]["mean_quality"]
stats["quality_improvement"] = round(improvement, 4)
if stats["significant"] and improvement > best_improvement:
best_improvement = improvement
best_variant = variant.variant_id
else:
stats["t_statistic"] = None
stats["p_value"] = None
stats["significant"] = None
stats["note"] = (
f"Insufficient samples: control={len(control_obs)}, "
f"variant={len(variant_obs)} (need 30+ each)"
)
results["variants"].append(stats)
# Recommendation
if best_variant:
results["recommendation"] = {
"action": "promote",
"variant_id": best_variant,
"quality_improvement": round(best_improvement, 4),
"message": (
f"Variant '{best_variant}' shows statistically significant "
f"improvement of {best_improvement:.2%} in quality score. "
f"Recommend promoting to default profile for intent='{experiment.intent}'."
),
}
else:
results["recommendation"] = {
"action": "continue",
"message": "No variant has reached statistical significance. Continue collecting data.",
}
return results
def _compute_variant_stats(self, variant_id: str, observations: list) -> dict:
"""Compute summary statistics for a variant."""
if not observations:
return {
"variant_id": variant_id,
"sample_count": 0,
"mean_quality": 0.0,
"std_quality": 0.0,
"mean_latency_ms": 0.0,
"mean_tokens": 0.0,
}
quality_scores = [o.quality_score for o in observations]
latencies = [o.latency_ms for o in observations]
tokens = [o.token_count for o in observations]
mean_q = sum(quality_scores) / len(quality_scores)
std_q = (
sum((x - mean_q) ** 2 for x in quality_scores) / len(quality_scores)
) ** 0.5
return {
"variant_id": variant_id,
"sample_count": len(observations),
"mean_quality": round(mean_q, 4),
"std_quality": round(std_q, 4),
"mean_latency_ms": round(sum(latencies) / len(latencies), 1),
"mean_tokens": round(sum(tokens) / len(tokens), 1),
}
@staticmethod
def _welch_t_test(sample_a: list[float], sample_b: list[float]) -> tuple[float, float]:
"""
Welch's t-test for two independent samples with unequal variances.
Returns (t_statistic, p_value).
"""
n_a, n_b = len(sample_a), len(sample_b)
mean_a = sum(sample_a) / n_a
mean_b = sum(sample_b) / n_b
var_a = sum((x - mean_a) ** 2 for x in sample_a) / (n_a - 1)
var_b = sum((x - mean_b) ** 2 for x in sample_b) / (n_b - 1)
se = math.sqrt(var_a / n_a + var_b / n_b)
if se == 0:
return 0.0, 1.0
t_stat = (mean_a - mean_b) / se
# Welch-Satterthwaite degrees of freedom
numerator = (var_a / n_a + var_b / n_b) ** 2
denominator = (var_a / n_a) ** 2 / (n_a - 1) + (var_b / n_b) ** 2 / (n_b - 1)
df = numerator / denominator if denominator > 0 else 1
# Approximate p-value using the t-distribution survival function
# For production, use scipy.stats.t.sf — here we use a conservative approximation
p_value = ABTestingFramework._approximate_t_pvalue(abs(t_stat), df)
return t_stat, p_value * 2 # Two-tailed test
@staticmethod
def _approximate_t_pvalue(t: float, df: float) -> float:
"""
Conservative approximation of the one-tailed p-value.
For production use scipy.stats.t.sf(t, df) instead.
This uses the normal approximation valid for df > 30.
"""
# Normal approximation for large df
if df > 30:
z = t
p = 0.5 * math.erfc(z / math.sqrt(2))
return p
# Very rough approximation for small df — use lookup tables in production
if t > 3.5:
return 0.001
elif t > 2.5:
return 0.01
elif t > 2.0:
return 0.03
elif t > 1.5:
return 0.07
else:
return 0.15
def _get_experiment(self, experiment_id: str) -> Optional[Experiment]:
"""Retrieve an experiment definition from DynamoDB."""
if experiment_id in self._active_experiments:
return self._active_experiments[experiment_id]
try:
response = self.experiments_table.get_item(
Key={"experiment_id": experiment_id}
)
item = response.get("Item")
if not item:
return None
control = ExperimentVariant(
variant_id=item["control"]["variant_id"],
description=item["control"]["description"],
parameter_overrides=item["control"].get("parameter_overrides", {}),
traffic_percentage=float(item["control"]["traffic_percentage"]),
)
variants = [
ExperimentVariant(
variant_id=v["variant_id"],
description=v["description"],
parameter_overrides=v.get("parameter_overrides", {}),
traffic_percentage=float(v["traffic_percentage"]),
)
for v in item.get("variants", [])
]
experiment = Experiment(
experiment_id=item["experiment_id"],
intent=item["intent"],
description=item["description"],
status=item["status"],
control=control,
variants=variants,
start_date=item["start_date"],
end_date=item.get("end_date"),
min_samples_per_variant=int(item.get("min_samples_per_variant", 500)),
confidence_level=float(item.get("confidence_level", 0.95)),
created_at=item["created_at"],
)
if experiment.status == "running":
self._active_experiments[experiment_id] = experiment
return experiment
except Exception as e:
logger.error(f"Failed to load experiment {experiment_id}: {e}")
return None
def _load_observations(self, experiment_id: str) -> dict[str, list]:
"""Load all observations for an experiment, grouped by variant."""
grouped: dict[str, list] = {}
try:
response = self.results_table.query(
KeyConditionExpression=boto3.dynamodb.conditions.Key(
"experiment_id"
).eq(experiment_id)
)
for item in response.get("Items", []):
vid = item["variant_id"]
obs = ObservationRecord(
experiment_id=item["experiment_id"],
variant_id=vid,
session_id=item["session_id"],
quality_score=float(item["quality_score"]),
relevance_score=float(item["relevance_score"]),
latency_ms=float(item["latency_ms"]),
token_count=int(item["token_count"]),
user_satisfaction=(
float(item["user_satisfaction"])
if item.get("user_satisfaction")
else None
),
timestamp=item["timestamp"],
)
grouped.setdefault(vid, []).append(obs)
except Exception as e:
logger.error(f"Failed to load observations for experiment {experiment_id}: {e}")
return grouped
# ---------------------------------------------------------------------------
# Minimum sample size calculator (used before launching experiments)
# ---------------------------------------------------------------------------
def calculate_minimum_sample_size(
baseline_rate: float,
minimum_detectable_effect: float,
alpha: float = 0.05,
power: float = 0.80,
) -> int:
"""
Calculate minimum sample size per variant for an A/B test.
Uses the standard formula for comparing two proportions.
Args:
baseline_rate: Current quality score (e.g., 0.85).
minimum_detectable_effect: Smallest improvement worth detecting (e.g., 0.03).
alpha: Significance level (default 0.05).
power: Statistical power (default 0.80).
Returns:
Minimum samples needed per variant.
Example:
>>> calculate_minimum_sample_size(0.85, 0.03)
1962 # Need ~2000 samples per variant to detect a 3% improvement
"""
import scipy.stats as stats
z_alpha = stats.norm.ppf(1 - alpha / 2)
z_beta = stats.norm.ppf(power)
p1 = baseline_rate
p2 = baseline_rate + minimum_detectable_effect
p_bar = (p1 + p2) / 2
numerator = (z_alpha * math.sqrt(2 * p_bar * (1 - p_bar)) +
z_beta * math.sqrt(p1 * (1 - p1) + p2 * (1 - p2))) ** 2
denominator = (p2 - p1) ** 2
return math.ceil(numerator / denominator)
Summary — Intent x Parameter x Quality Matrix
| Intent | Temp | top_k | top_p | max_tokens | Quality Target | Rationale |
|---|---|---|---|---|---|---|
| order_status | 0.15 | 40 | 0.70 | 256 | >0.95 | Factual precision is paramount. Hallucinated dates erode trust instantly. |
| shipping_info | 0.20 | 40 | 0.80 | 384 | >0.93 | Policy accuracy required. Slight flexibility in natural phrasing. |
| product_search | 0.45 | 100 | 0.90 | 512 | >0.88 | Balance between accurate metadata and engaging descriptions. |
| manga_qa | 0.50 | 100 | 0.90 | 768 | >0.85 | Knowledge accuracy with conversational nuance. Longer explanations. |
| recommendation | 0.80 | 250 | 0.95 | 1024 | >0.80 | Creative diversity drives discovery. Users want surprising suggestions. |
| chitchat | 0.75 | 200 | 0.95 | 128 | >0.78 | Warm personality with varied phrasing. Short and snappy. |
Key Takeaways
- One temperature does not fit all — MangaAssist uses 6 different temperature profiles because order status precision (T=0.15) and recommendation creativity (T=0.80) are fundamentally different tasks.
- top_k and top_p work together — top_k sets a hard vocabulary ceiling; top_p adapts within it. Both should be tuned per intent.
- A/B testing is mandatory — Intuition about "good" parameters is unreliable. Only statistical testing reveals whether T=0.45 or T=0.55 actually produces better product descriptions.
- Prompt engineering amplifies parameter tuning — The best temperature setting cannot compensate for a vague system prompt. Both must be optimized together.
- Fallback to conservative defaults — Unknown intents should use order_status-level conservative parameters. It is safer to be boring than to hallucinate.