02 — A/B Testing & Canary Deployment — Answers
Easy
A1. Canary Deployment Pattern on ECS Fargate
MangaAssist canary deployment — stage-by-stage:
Infrastructure setup:
- The MangaAssist API runs on ECS Fargate behind an Application Load Balancer (ALB)
- Two ECS task sets exist in the same ECS service: primary (current model) and canary (new model)
- Both task sets run identical container images; the model version is controlled via environment variable MODEL_ID sourced from SSM Parameter Store
Stage progression:
| Stage | Canary Traffic | Duration | Progression Trigger |
|---|---|---|---|
| Stage 0 | 0% | Pre-deploy | Bedrock offline evaluation passes all gates |
| Stage 1 | 5% | 4 hours minimum | Error rate < 0.5%, P99 latency within 10% of baseline |
| Stage 2 | 25% | 24 hours minimum | All guardrail metrics pass, CSAT ≥ baseline - 1% |
| Stage 3 | 50% | 48 hours minimum | Statistical significance on primary success metric |
| Stage 4 | 100% | Permanent | Human approval + all metrics confirmed |
ALB weighted target group configuration:
ALB Listener Rule:
TargetGroup-Primary: weight = 95 (Stage 1)
TargetGroup-Canary: weight = 5 (Stage 1)
At each stage, a CodeDeploy Blue/Green deployment updates the weights. The ECS service maintains both task sets simultaneously — no cold starts during traffic shifts.
Progression triggers: - Automated (Stages 1→2, 2→3): CloudWatch composite alarm evaluates guardrail metrics. If all pass after the minimum duration, a Step Functions workflow updates ALB weights. - Manual (Stage 3→4): Requires explicit approval via a CodePipeline manual approval action. An ML engineer reviews the full A/B test report before promoting to 100%.
A2. A/B Test Metrics for MangaAssist
Guardrail metrics (must not degrade — hard block if violated):
| Metric | Threshold | Source |
|---|---|---|
| Error rate (5xx) | < 0.5% | ALB access logs / CloudWatch |
| P99 latency (end-to-end) | < 5 seconds | X-Ray traces |
| LLM hallucination rate | < current + 1% | Custom hallucination detector (Lambda) |
| Safety filter trigger rate | < current + 0.5% | Bedrock guardrails logs |
| Cart abandonment rate | No increase > 2% | Amazon analytics pipeline |
Success metrics (must improve — primary decision drivers):
| Metric | Target | Source |
|---|---|---|
| Task completion rate | +2% improvement | DynamoDB conversation logs (did the user complete their goal?) |
| CSAT score (post-chat survey) | +0.1 points (4-point scale) | Survey responses |
| Recommendation click-through rate | +3% improvement | Clickstream analytics |
| Conversations per resolution | Fewer turns = better | DynamoDB conversation metadata |
| Escalation rate | Decrease | Intent classifier logs |
Observational metrics (monitor but don't gate on):
| Metric | Purpose |
|---|---|
| Token usage per conversation | Cost impact tracking |
| Cache hit rate (ElastiCache) | Ensure caching still effective with new model's output patterns |
| Intent distribution shift | Detect if the new model changes how users interact |
A3. Minimum Sample Size Calculation
Given: - Baseline task completion rate: p₁ = 0.78 - Minimum detectable effect: δ = 0.02 (2% improvement → p₂ = 0.80) - Significance level: α = 0.05 (two-tailed) - Power: 1 - β = 0.80
Formula (two-proportion z-test):
$$n = \frac{(Z_{\alpha/2} + Z_\beta)^2 \cdot [p_1(1-p_1) + p_2(1-p_2)]}{\delta^2}$$
$$n = \frac{(1.96 + 0.84)^2 \cdot [0.78(0.22) + 0.80(0.20)]}{0.02^2}$$
$$n = \frac{7.84 \cdot [0.1716 + 0.16]}{0.0004}$$
$$n = \frac{7.84 \cdot 0.3316}{0.0004} = \frac{2.60}{0.0004} \approx 6{,}497$$
Per group: ~6,500 conversations needed in each arm (control + treatment = ~13,000 total).
Time to collect at MangaAssist's traffic: - Total traffic: ~300K conversations/day - At 5% canary (Stage 1): canary receives 15K conversations/day - Reaches sample size in < 12 hours — but minimum stage duration is 4 hours for guardrail observation - At 25% canary (Stage 2): 75K conversations/day per arm - Reaches significance in < 3 hours
Caveat: Reaching raw sample size quickly ≠ reaching reliable significance. We impose minimum durations (24–48 hours) to account for time-of-day effects, day-of-week variation, and novelty effects (see Q8).
Medium
A4. Stratified A/B Testing for Imbalanced Intent Traffic
MangaAssist intent traffic distribution:
| Intent | Daily Volume | % of Total |
|---|---|---|
recommendation |
72K | 24% |
product_discovery |
48K | 16% |
faq |
45K | 15% |
product_question |
36K | 12% |
order_tracking |
30K | 10% |
checkout_help |
24K | 8% |
return_request |
18K | 6% |
promotion |
12K | 4% |
chitchat |
9K | 3% |
escalation |
6K | 2% |
Problem: At 5% canary, escalation receives only 300 conversations/day. Reaching 6,500 sample size takes ~22 days — far too slow.
Stratified approach:
- Intent-level traffic splitting: Instead of uniform 5% canary across all traffic, apply different canary percentages per intent:
| Intent Tier | Intents | Canary % | Reason |
|---|---|---|---|
| High-traffic | recommendation, product_discovery, faq | 5% | Standard canary — ample volume |
| Medium-traffic | product_question, order_tracking, checkout_help | 10% | Slight increase to accelerate |
| Low-traffic | return_request, promotion, chitchat, escalation | 25% | Aggressive split — these intents have less revenue impact |
- Implementation: The intent classifier (SageMaker endpoint) returns the intent label. The application routing layer (ECS Fargate) checks the canary assignment table in Redis:
canary_rates = {
"recommendation": 0.05, "product_discovery": 0.05, "faq": 0.05,
"product_question": 0.10, "order_tracking": 0.10, "checkout_help": 0.10,
"return_request": 0.25, "promotion": 0.25, "chitchat": 0.25, "escalation": 0.25
}
-
Sequential testing for low-traffic intents: Use group sequential design — analyze at pre-defined intervals rather than waiting for full sample size. Use alpha spending functions (O'Brien-Fleming) to control Type I error across interim analyses.
-
Pooled analysis as backup: For intents that can never reach individual significance (e.g.,
escalationwith only 6K/day), group them into an "operational intents" pool and test at the pool level. Accept per-intent significance only when achievable.
A5. Conflicting Per-Intent Signals at the 25% Canary Stage
Situation:
- recommendation CSAT: +3% (positive)
- return_request task completion: -7% (negative)
- Composite metric: net positive
Decision framework — MangaAssist Canary Progression Rules:
Rule 1 — No critical intent regression:
IF any revenue_critical_intent has regression > 5%:
BLOCK progression regardless of composite metric
return_request is not revenue-critical (it's operational), so this rule doesn't block — but the 7% regression is concerning.
Rule 2 — Severity-weighted composite:
Assign weights by business impact:
| Intent | Weight | Reasoning |
|---|---|---|
recommendation |
0.25 | Directly drives revenue |
checkout_help |
0.20 | Conversion-critical |
product_discovery |
0.15 | Revenue driver |
return_request |
0.10 | Operational — impacts CSAT but not revenue |
| Others | 0.30 | Distributed |
Weighted score: (0.25 × +3%) + (0.10 × -7%) + (others neutral) = +0.75% - 0.70% = +0.05% → barely positive.
Rule 3 — Root cause before proceeding:
The +3%/-7% pattern is suspicious. Before progressing to 50%:
-
Investigate the
return_requestregression: - Pull sample conversations from the canary arm - Check if the new model is failing to generate return labels, missing policy details, or misunderstanding return eligibility - Look for systematic failures vs. random quality variance -
Check if the improvement and regression are correlated: - Are the same users seeing both? (Session-level analysis) - Is the new model redirecting
return_requestusers to browse recommendations instead of processing their return? -
Run a targeted evaluation: - Execute the
return_requestBedrock evaluation job against the new model - Compare against the golden dataset
Decision:
- Do NOT proceed to 50%. Hold at 25% for an additional 48 hours while investigating.
- If the return_request regression is a genuine model behavior → reject the candidate
- If it's a statistical artifact (small sample, edge cases) → extend the test at 25% for more data
A6. Session Stickiness During A/B Testing
MangaAssist session stickiness design:
Assignment persistence:
When a user starts a conversation, the A/B test assignment is generated and stored:
Redis Key: ab:session:{session_id}
Value: {
"model_version": "sonnet-v2-canary",
"assigned_at": "2025-03-15T10:30:00Z",
"test_id": "ab-test-047",
"intent_assignments": {
"recommendation": "canary",
"faq": "control"
}
}
TTL: 7 days
Flow when user returns 2 hours later:
- User sends a message → ECS Fargate receives the request
- Application layer extracts
session_idfrom the cookie or Amazon customer ID - Redis lookup: Check
ab:session:{session_id} - If found → route to the same model version (canary or control)
- If not found (TTL expired or new session) → generate new assignment based on current canary percentages
Mid-conversation model consistency:
The stickiness is at the conversation level, not the request level. Even if the intent changes mid-conversation (e.g., chitchat → recommendation), the model version stays consistent within that conversation.
What happens if the canary has been rolled back:
Scenario: User was assigned to canary at 10:00 AM. Canary rolled back at 11:00 AM. User returns at 12:00 PM.
- User sends message → application checks Redis → finds
model_version: "sonnet-v2-canary" - Application checks a model availability flag in Parameter Store:
canary_active: false - Graceful fallback: Route the user to the control model silently
- Update Redis:
model_version: "sonnet-v1-control",reassigned_reason: "canary_rollback" - Log the reassignment for analysis (these users are excluded from A/B test results)
Data cleanliness: Users who experienced a mid-test rollback are flagged in analytics. Their conversations are excluded from the final A/B test statistical analysis to avoid contamination.
Hard
A7. Complete A/B Testing Infrastructure Design
Architecture overview:
Customer Request
│
▼
┌──────────────┐
│ ALB │ (No traffic splitting here —
│ │ uniform routing to ECS)
└──────┬───────┘
│
▼
┌──────────────┐
│ ECS Fargate │ Application-level A/B routing
│ (API Layer) │
└──────┬───────┘
│
┌──────────┼──────────┐
▼ ▼ ▼
┌────────────┐ ┌─────┐ ┌─────────┐
│ ElastiCache│ │ Req │ │ Intent │
│ Redis │ │Route│ │Classify │
│ (Assignment│ │ r │ │(SageMkr)│
│ Store) │ └──┬──┘ └────┬────┘
└────────────┘ │ │
▼ ▼
┌─────────────────────┐
│ Model Router │
│ (Control vs Canary) │
└────────┬────────────┘
┌─────┴─────┐
▼ ▼
┌──────────┐ ┌──────────┐
│ Bedrock │ │ Bedrock │
│ Control │ │ Canary │
│ (Sonnet │ │ (Sonnet │
│ v1) │ │ v2) │
└──────────┘ └──────────┘
Why application-level routing over ALB weighted routing:
ALB weighted routing routes at the infrastructure level — it can't make intent-aware decisions and can't persist assignment logic in Redis. Application-level routing enables: - Per-intent canary percentages - Session-sticky assignment - Feature flag integration - Detailed per-experiment logging
Traffic assignment logic (Python pseudocode):
async def get_model_assignment(session_id: str, intent: str, test_config: dict) -> str:
# Check Redis for existing assignment
cached = await redis.hget(f"ab:{session_id}", intent)
if cached:
return cached # Session stickiness
# New assignment — deterministic hash for reproducibility
hash_val = mmh3.hash(f"{session_id}:{test_config['test_id']}") % 10000
threshold = test_config["canary_rates"][intent] * 10000 # e.g., 5% = 500
assignment = "canary" if hash_val < threshold else "control"
# Persist in Redis with TTL
await redis.hset(f"ab:{session_id}", intent, assignment)
await redis.expire(f"ab:{session_id}", 604800) # 7 days
# Emit metric
cloudwatch.put_metric("ABAssignment", 1, dimensions={
"test_id": test_config["test_id"],
"intent": intent,
"assignment": assignment
})
return assignment
Metrics collection pipeline:
- Per-request: ECS emits structured logs to CloudWatch Logs (JSON with
test_id,assignment,intent,latency_ms,token_count,error_flag) - Aggregation: Kinesis Data Firehose streams logs to S3 (partitioned by date/test_id)
- Near-real-time analysis: A Lambda function runs every 5 minutes: - Queries CloudWatch Metrics Insights for guardrail checks - Runs the statistical significance test (z-test for proportions) - Publishes results to an S3 dashboard bucket and SNS
Statistical significance computation (near-real-time):
from scipy import stats
def compute_significance(control_successes, control_total, canary_successes, canary_total):
p_control = control_successes / control_total
p_canary = canary_successes / canary_total
# Pooled proportion
p_pool = (control_successes + canary_successes) / (control_total + canary_total)
se = math.sqrt(p_pool * (1 - p_pool) * (1/control_total + 1/canary_total))
z = (p_canary - p_control) / se
p_value = 2 * (1 - stats.norm.cdf(abs(z)))
return {
"z_score": z,
"p_value": p_value,
"significant": p_value < 0.05,
"effect_size": p_canary - p_control,
"confidence_interval": (
(p_canary - p_control) - 1.96 * se,
(p_canary - p_control) + 1.96 * se
)
}
A8. Testing Protocol That Accounts for Temporal Confounds
Problem: MangaAssist reaches statistical significance in 6 hours, but this captures only one time-of-day slice:
- Morning commuters in Japan may browse manga casually (easy FAQs)
- Evening US users may be making purchase decisions (complex recommendations)
- Weekend traffic patterns differ from weekday
Testing protocol:
1. Minimum test duration: 7 full days (one complete weekly cycle)
Regardless of when statistical significance is reached, the test continues for 7 days minimum. This captures: - All 7 days of the week - All time zones (US + JP traffic patterns) - Payday effects (beginning/end of month)
2. Sequential analysis with alpha spending:
Use an O'Brien-Fleming alpha spending function to enable interim looks without inflating false positive rates:
| Interim Look | Calendar Time | Alpha Spent | Cumulative Alpha |
|---|---|---|---|
| Look 1 | Day 2 | 0.001 | 0.001 |
| Look 2 | Day 4 | 0.008 | 0.009 |
| Look 3 | Day 7 | 0.041 | 0.050 |
Early stopping is only allowed if: - The effect is massive (e.g., 10%+ improvement) AND statistically significant at the adjusted alpha - This prevents acting on small fluctuations while allowing rapid response to dramatically good/bad outcomes
3. Day-of-week analysis:
After 7 days, check for interaction effects:
for day in ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']:
day_result = compute_significance(
control_by_day[day], canary_by_day[day]
)
if day_result['effect_size'] < 0:
flag_as_inconsistent(day)
If the treatment is positive on weekdays but negative on weekends → the result is not robust. Extend the test.
4. Novelty/primacy effect detection:
Plot the treatment effect over time: - If the effect is largest on Day 1 and decays → novelty effect (users react differently to a "new feel" initially) - Compare Day 1-2 effect size vs Day 5-7 effect size - If > 50% decay → the true long-term effect is approximately the Day 5-7 estimate
5. MangaAssist-specific temporal factors:
- New manga release Wednesdays (Shonen Jump schedule) —
product_discoveryandrecommendationtraffic spikes 3× - Amazon promotional events — test must not span a promotional boundary (start or end during a sale)
- Japanese holidays — Golden Week, Obon may shift traffic patterns dramatically
Minimum test duration matrix:
| Scenario | Min Duration |
|---|---|
| Standard test | 7 days |
| Low-traffic intent (< 10K/day) | 14 days |
| Test spans a promotional event | 14 days (7 pre + 7 post) |
| Revenue-critical intent | 14 days |
A9. Multi-Region A/B Test (US and Japan)
Phased multi-region rollout:
Phase 1 (Week 1-2): US-East canary at 5% → 25% → 50%
Phase 2 (Week 3): US-East at 100%, JP canary at 5%
Phase 3 (Week 3-4): JP canary → 25% → 50%
Phase 4 (Week 5): JP at 100% (if approved)
Why sequential, not parallel:
- Risk isolation: If the model fails in US, JP is unaffected
- Learning transfer: US test results inform what to watch for in JP
- Resource management: Running parallel canaries requires double the monitoring infrastructure
Handling interaction effects (US English vs. Japanese):
Region-specific evaluation criteria:
| Metric | US Threshold | JP Threshold |
|---|---|---|
| Task completion rate | Improve ≥ 1% | Improve ≥ 1% |
| CSAT | No degradation | No degradation |
| Product name accuracy | ≥ 95% | ≥ 99% (Japanese titles must be exact — kanji errors are unacceptable) |
| Honorific correctness | N/A | ≥ 98% (お客様, ~さん must be correct) |
| Cross-language handling | English responses only | Must handle mixed JP/EN queries correctly |
Known risk — Japanese manga title handling:
The new model might improve English conversational quality but regress on Japanese by: - Romanizing titles that should stay in kanji (進撃の巨人 → "Shingeki no Kyojin" when the customer used kanji) - Incorrect furigana suggestions - Mixing simplified Chinese characters with Japanese kanji
Mitigation: 1. Run MangaDomainAccuracy (MDA) evaluation specifically on JP test data before deploying to JP 2. Add a JP-specific guardrail metric: "Japanese product name preservation rate" measured by the hallucination detector 3. If US test passes but JP offline evaluation fails → do not deploy to JP. Use a separate model version for JP if needed.
Infrastructure:
Each region has its own ECS cluster, so canary deployments are region-independent. The A/B test configuration is stored in regional SSM Parameter Stores. A global dashboard aggregates results from both regions with clear per-region breakdowns.
Very Hard
A10. Automated Canary Rollback System
Rollback trigger hierarchy (fastest to slowest detection):
| Trigger | Detection Time | Threshold | Source |
|---|---|---|---|
| Error rate spike | < 1 minute | > 2% 5xx rate (5-min window) | ALB CloudWatch metrics |
| Latency degradation | < 2 minutes | P99 > 8s (2× baseline) | X-Ray + CloudWatch |
| Bedrock throttling | < 2 minutes | > 5% throttled requests | Bedrock CloudWatch metrics |
| Hallucination spike | < 10 minutes | > 3% hallucination rate | Custom detector Lambda |
| CSAT drop | < 4 hours | > 10% drop (rolling 4h) | Survey pipeline |
| Task completion drop | < 6 hours | > 5% drop (rolling 6h) | DynamoDB conversation analysis |
CloudWatch Alarm configuration:
# Critical alarm — triggers immediate rollback
CriticalCanaryAlarm:
Type: AWS::CloudWatch::CompositeAlarm
Properties:
AlarmRule: |
ALARM(CanaryErrorRate) OR
ALARM(CanaryLatencyP99) OR
ALARM(CanaryThrottling)
AlarmActions:
- !Ref RollbackSNSTopic # Notify
- !Ref RollbackLambdaARN # Execute rollback
# Warning alarm — pauses progression, alerts team
WarningCanaryAlarm:
Type: AWS::CloudWatch::CompositeAlarm
Properties:
AlarmRule: |
ALARM(CanaryHallucinationRate) OR
ALARM(CanaryCSATDrop)
AlarmActions:
- !Ref PauseSNSTopic
Rollback execution (Lambda function):
def execute_rollback(event, context):
# 1. Update ALB weights — shift 100% to control
elbv2.modify_rule(
RuleArn=CANARY_RULE_ARN,
Actions=[{
'Type': 'forward',
'ForwardConfig': {
'TargetGroups': [
{'TargetGroupArn': CONTROL_TG_ARN, 'Weight': 100},
{'TargetGroupArn': CANARY_TG_ARN, 'Weight': 0}
]
}
}]
)
# 2. Update Redis — mark canary as inactive
redis_client.set("canary_active", "false")
# 3. Update SSM Parameter Store
ssm.put_parameter(
Name='/mangaassist/canary/status',
Value='rolled_back',
Overwrite=True
)
# 4. Log rollback event
dynamodb.put_item(
TableName='canary_rollbacks',
Item={
'rollback_id': str(uuid4()),
'timestamp': datetime.utcnow().isoformat(),
'trigger': event['detail']['alarmName'],
'canary_stage': get_current_stage(),
'metrics_snapshot': collect_metrics_snapshot()
}
)
# 5. Notify team
sns.publish(
TopicArn=TEAM_NOTIFICATION_TOPIC,
Subject=f"[ROLLBACK] MangaAssist canary rolled back",
Message=json.dumps(rollback_details)
)
Session migration during rollback:
- Users currently mid-conversation on the canary model are transparently migrated to the control model
- Redis stores the full conversation history (not just model assignment), so the control model can continue the conversation using the history
- The
reassigned_reason: "rollback"flag ensures these sessions are excluded from A/B test analysis
Anti-flapping mechanism:
| Mechanism | Implementation |
|---|---|
| Cooldown period | After a rollback, no new canary deployment for 24 hours minimum |
| Rollback counter | If 3 rollbacks occur within 7 days for the same candidate, the candidate is permanently rejected |
| Progressive threshold tightening | After a rollback, the next deployment must pass at 5% for 8 hours (2× normal) before progressing |
| Rollback root cause requirement | The canary cannot be re-deployed until a root cause analysis document is filed and reviewed |
Post-rollback diagnostic workflow:
- Automated: Lambda captures a metrics snapshot, sample of failed conversations, and Bedrock invocation logs
- Within 1 hour: On-call engineer reviews the snapshot and confirms the rollback was justified
- Within 24 hours: Root cause analysis with comparison of canary vs control outputs on the failing conversations
- Before re-deployment: Fix validated in offline evaluation before any new canary attempt
A11. Controlling False Discovery Rate in Continuous Testing
Problem: MangaAssist runs a new A/B test every 2–3 weeks. Over a year, that's ~20 tests. At α = 0.05, the probability of at least one false positive is:
$$P(\text{at least 1 FP}) = 1 - (1 - 0.05)^{20} = 1 - 0.358 = 0.642$$
A 64% chance of declaring a winner that isn't actually better.
Framework — Continuous testing with FDR control:
1. Family-wise error rate (FWER) control — Bonferroni:
Adjust α per test: α_adjusted = 0.05 / K (where K = number of tests planned)
- For 20 tests/year: α_adjusted = 0.0025
- Problem: Extremely conservative. Requires 4× the sample size per test. Tests take weeks longer at MangaAssist's volume.
2. False Discovery Rate (FDR) control — Benjamini-Hochberg (recommended):
Instead of controlling the probability of ANY false positive, control the proportion of false positives among all rejections:
Application to MangaAssist's continuous testing:
def apply_bh_correction(test_results: list[dict]) -> list[dict]:
"""
Apply Benjamini-Hochberg correction across all completed tests.
Called when making final deployment decisions.
"""
# Sort by p-value
sorted_results = sorted(test_results, key=lambda x: x['p_value'])
m = len(sorted_results)
q = 0.10 # FDR threshold (10% false discovery rate)
for i, result in enumerate(sorted_results):
rank = i + 1
bh_threshold = (rank / m) * q
result['bh_threshold'] = bh_threshold
result['significant_after_correction'] = result['p_value'] <= bh_threshold
return sorted_results
3. Practical implementation for MangaAssist:
| Component | Design |
|---|---|
| Test registry | DynamoDB table tracking all tests: hypothesis, start/end dates, raw p-values, corrected p-values |
| Rolling window | Apply BH correction over the last 12 months of tests (older tests are "confirmed" and excluded) |
| Decision timing | Raw p-values are computed in real-time. BH correction is applied when making the deploy/no-deploy decision |
| FDR budget | q = 0.10 for non-critical intents, q = 0.05 for revenue-critical intents |
4. Alpha investing (alternative approach):
Treat the α budget as a "bank account": - Start with α-wealth = 0.05 - Each test spends some α (if you test, you spend) - Each successful test earns back α (validated positive results replenish the budget) - If α-wealth reaches 0, stop testing until you earn more from validated wins
This naturally throttles testing when too many tests fail and encourages high-confidence experiments.
5. MangaAssist recommendation:
Use BH correction (q=0.10) as the primary framework, supplemented by: - Minimum 7-day test duration (prevents p-hacking through early stopping) - Pre-registration of all test hypotheses before starting (prevents post-hoc metric selection) - A quarterly review where the data science team evaluates the FDR across all tests run that quarter
A12. Multi-Armed Bandit vs A/B Testing — Hybrid Approach
Critique of pure MAB for MangaAssist:
Arguments FOR MAB: - Reduces regret — shifts traffic to the better model faster - Continuously adapts — no fixed test duration needed - Efficient for many-arm scenarios (testing 5+ model configurations simultaneously)
Arguments AGAINST MAB for production model deployment:
| Risk | Explanation |
|---|---|
| Delayed convergence with noisy rewards | CSAT and task completion are noisy signals. MAB may oscillate between arms for weeks before converging. |
| Revenue-critical intents need certainty | For recommendation and checkout_help, the business needs to KNOW which model is better with statistical confidence — not probabilistically exploit the "probably better" arm. |
| Seasonal confounds | MAB algorithms assume stationary reward distributions. MangaAssist's traffic patterns shift weekly (new manga releases) and seasonally. A model that looks better during Shonen Jump release week may not be better generally. |
| Irreversible business decisions | Once you promote a model to production and update documentation/training/monitoring around it, switching back has organizational cost. MAB's continuous arm-switching doesn't match this operational reality. |
| Lack of causal inference | MAB optimizes allocation but doesn't produce a p-value or confidence interval. Stakeholders need to report "Model B is 3% ± 1% better with 95% confidence" — MAB can't provide this. |
When MAB IS appropriate for MangaAssist: - Non-critical intents with fast, clear reward signals - Exploring prompt template variants (not full model swaps) - Tuning inference parameters (temperature, top_p) where all arms are the same model
Hybrid design — intent-tiered experimentation:
┌─────────────────────────────────────────────────┐
│ MangaAssist Experimentation Tiers │
├─────────────────────────────────────────────────┤
│ │
│ Tier 1 — Traditional A/B Testing │
│ Intents: recommendation, checkout_help, │
│ product_question, product_discovery │
│ Method: Fixed-horizon A/B test, 7-day min │
│ Decision: p < 0.05 (BH-corrected) │
│ Reason: Revenue impact — need statistical │
│ certainty before committing │
│ │
│ Tier 2 — Thompson Sampling (MAB) │
│ Intents: faq, chitchat, promotion, escalation │
│ Method: Beta-Bernoulli Thompson Sampling │
│ Arms: Up to 4 model configs simultaneously │
│ Reason: Low risk, fast iteration, explore │
│ many configurations cheaply │
│ │
│ Tier 3 — Contextual Bandit │
│ Intents: order_tracking, return_request │
│ Method: LinUCB with context features │
│ Context: query complexity, language, │
│ customer tier (Prime/non-Prime) │
│ Reason: Optimal model may vary by context │
│ — no single "best" model │
│ │
└─────────────────────────────────────────────────┘
Thompson Sampling implementation for Tier 2:
import numpy as np
class MangaAssistBandit:
def __init__(self, arms: list[str]):
# Beta distribution priors (uniform)
self.alpha = {arm: 1.0 for arm in arms}
self.beta = {arm: 1.0 for arm in arms}
def select_arm(self) -> str:
"""Thompson Sampling — sample from each arm's posterior, pick highest."""
samples = {
arm: np.random.beta(self.alpha[arm], self.beta[arm])
for arm in self.alpha
}
return max(samples, key=samples.get)
def update(self, arm: str, reward: float):
"""Update posterior with observed reward (0 or 1)."""
self.alpha[arm] += reward
self.beta[arm] += (1 - reward)
def get_best_arm(self) -> tuple[str, float]:
"""Return the arm with highest expected reward."""
expected = {
arm: self.alpha[arm] / (self.alpha[arm] + self.beta[arm])
for arm in self.alpha
}
best = max(expected, key=expected.get)
return best, expected[best]
Guardrails on the MAB system:
| Guardrail | Purpose |
|---|---|
| Minimum exploration rate: 10% | Even the "worst" arm gets at least 10% traffic — prevents premature convergence on a local optimum |
| Weekly posterior review | Data scientist reviews the posterior distributions. If they haven't converged after 14 days, the reward signal may be too noisy. |
| Hard floor on all arms | No arm's response quality (BERTScore) is allowed below 0.80. If Thompson Sampling exploits an arm that drops below, that arm is removed. |
| Non-stationarity detection | Monitor reward mean per arm over time. If an arm's reward shifts > 2σ week-over-week, reset that arm's priors and restart exploration. |