02 — A/B Testing & Canary Deployment — Answers

Easy

A1. Canary Deployment Pattern on ECS Fargate

MangaAssist canary deployment — stage-by-stage:

Infrastructure setup: - The MangaAssist API runs on ECS Fargate behind an Application Load Balancer (ALB) - Two ECS task sets exist in the same ECS service: primary (current model) and canary (new model) - Both task sets run identical container images; the model version is controlled via environment variable MODEL_ID sourced from SSM Parameter Store

Stage progression:

Stage	Canary Traffic	Duration	Progression Trigger
Stage 0	0%	Pre-deploy	Bedrock offline evaluation passes all gates
Stage 1	5%	4 hours minimum	Error rate < 0.5%, P99 latency within 10% of baseline
Stage 2	25%	24 hours minimum	All guardrail metrics pass, CSAT ≥ baseline - 1%
Stage 3	50%	48 hours minimum	Statistical significance on primary success metric
Stage 4	100%	Permanent	Human approval + all metrics confirmed

ALB weighted target group configuration:

ALB Listener Rule:
  TargetGroup-Primary:  weight = 95  (Stage 1)
  TargetGroup-Canary:   weight = 5   (Stage 1)

At each stage, a CodeDeploy Blue/Green deployment updates the weights. The ECS service maintains both task sets simultaneously — no cold starts during traffic shifts.

Progression triggers: - Automated (Stages 1→2, 2→3): CloudWatch composite alarm evaluates guardrail metrics. If all pass after the minimum duration, a Step Functions workflow updates ALB weights. - Manual (Stage 3→4): Requires explicit approval via a CodePipeline manual approval action. An ML engineer reviews the full A/B test report before promoting to 100%.

A2. A/B Test Metrics for MangaAssist

Guardrail metrics (must not degrade — hard block if violated):

Metric	Threshold	Source
Error rate (5xx)	< 0.5%	ALB access logs / CloudWatch
P99 latency (end-to-end)	< 5 seconds	X-Ray traces
LLM hallucination rate	< current + 1%	Custom hallucination detector (Lambda)
Safety filter trigger rate	< current + 0.5%	Bedrock guardrails logs
Cart abandonment rate	No increase > 2%	Amazon analytics pipeline

Success metrics (must improve — primary decision drivers):

Metric	Target	Source
Task completion rate	+2% improvement	DynamoDB conversation logs (did the user complete their goal?)
CSAT score (post-chat survey)	+0.1 points (4-point scale)	Survey responses
Recommendation click-through rate	+3% improvement	Clickstream analytics
Conversations per resolution	Fewer turns = better	DynamoDB conversation metadata
Escalation rate	Decrease	Intent classifier logs

Observational metrics (monitor but don't gate on):

Metric	Purpose
Token usage per conversation	Cost impact tracking
Cache hit rate (ElastiCache)	Ensure caching still effective with new model's output patterns
Intent distribution shift	Detect if the new model changes how users interact

A3. Minimum Sample Size Calculation

Given: - Baseline task completion rate: p₁ = 0.78 - Minimum detectable effect: δ = 0.02 (2% improvement → p₂ = 0.80) - Significance level: α = 0.05 (two-tailed) - Power: 1 - β = 0.80

Formula (two-proportion z-test):

$$n = \frac{(Z_{\alpha/2} + Z_\beta)^2 \cdot [p_1(1-p_1) + p_2(1-p_2)]}{\delta^2}$$

$$n = \frac{(1.96 + 0.84)^2 \cdot [0.78(0.22) + 0.80(0.20)]}{0.02^2}$$

$$n = \frac{7.84 \cdot [0.1716 + 0.16]}{0.0004}$$

$$n = \frac{7.84 \cdot 0.3316}{0.0004} = \frac{2.60}{0.0004} \approx 6{,}497$$

Per group: ~6,500 conversations needed in each arm (control + treatment = ~13,000 total).

Time to collect at MangaAssist's traffic: - Total traffic: ~300K conversations/day - At 5% canary (Stage 1): canary receives 15K conversations/day - Reaches sample size in < 12 hours — but minimum stage duration is 4 hours for guardrail observation - At 25% canary (Stage 2): 75K conversations/day per arm - Reaches significance in < 3 hours

Caveat: Reaching raw sample size quickly ≠ reaching reliable significance. We impose minimum durations (24–48 hours) to account for time-of-day effects, day-of-week variation, and novelty effects (see Q8).

Medium

A4. Stratified A/B Testing for Imbalanced Intent Traffic

MangaAssist intent traffic distribution:

Intent	Daily Volume	% of Total
`recommendation`	72K	24%
`product_discovery`	48K	16%
`faq`	45K	15%
`product_question`	36K	12%
`order_tracking`	30K	10%
`checkout_help`	24K	8%
`return_request`	18K	6%
`promotion`	12K	4%
`chitchat`	9K	3%
`escalation`	6K	2%

Problem: At 5% canary, escalation receives only 300 conversations/day. Reaching 6,500 sample size takes ~22 days — far too slow.

Stratified approach:

Intent-level traffic splitting: Instead of uniform 5% canary across all traffic, apply different canary percentages per intent:

Intent Tier	Intents	Canary %	Reason
High-traffic	recommendation, product_discovery, faq	5%	Standard canary — ample volume
Medium-traffic	product_question, order_tracking, checkout_help	10%	Slight increase to accelerate
Low-traffic	return_request, promotion, chitchat, escalation	25%	Aggressive split — these intents have less revenue impact

Implementation: The intent classifier (SageMaker endpoint) returns the intent label. The application routing layer (ECS Fargate) checks the canary assignment table in Redis:

canary_rates = {
    "recommendation": 0.05, "product_discovery": 0.05, "faq": 0.05,
    "product_question": 0.10, "order_tracking": 0.10, "checkout_help": 0.10,
    "return_request": 0.25, "promotion": 0.25, "chitchat": 0.25, "escalation": 0.25
}

Sequential testing for low-traffic intents: Use group sequential design — analyze at pre-defined intervals rather than waiting for full sample size. Use alpha spending functions (O'Brien-Fleming) to control Type I error across interim analyses.
Pooled analysis as backup: For intents that can never reach individual significance (e.g., escalation with only 6K/day), group them into an "operational intents" pool and test at the pool level. Accept per-intent significance only when achievable.

A5. Conflicting Per-Intent Signals at the 25% Canary Stage

Situation: - recommendation CSAT: +3% (positive) - return_request task completion: -7% (negative) - Composite metric: net positive

Decision framework — MangaAssist Canary Progression Rules:

Rule 1 — No critical intent regression:

IF any revenue_critical_intent has regression > 5%:
    BLOCK progression regardless of composite metric

return_request is not revenue-critical (it's operational), so this rule doesn't block — but the 7% regression is concerning.

Rule 2 — Severity-weighted composite:

Assign weights by business impact:

Intent	Weight	Reasoning
`recommendation`	0.25	Directly drives revenue
`checkout_help`	0.20	Conversion-critical
`product_discovery`	0.15	Revenue driver
`return_request`	0.10	Operational — impacts CSAT but not revenue
Others	0.30	Distributed

Weighted score: (0.25 × +3%) + (0.10 × -7%) + (others neutral) = +0.75% - 0.70% = +0.05% → barely positive.

Rule 3 — Root cause before proceeding:

The +3%/-7% pattern is suspicious. Before progressing to 50%:

Investigate the return_request regression: - Pull sample conversations from the canary arm - Check if the new model is failing to generate return labels, missing policy details, or misunderstanding return eligibility - Look for systematic failures vs. random quality variance
Check if the improvement and regression are correlated: - Are the same users seeing both? (Session-level analysis) - Is the new model redirecting return_request users to browse recommendations instead of processing their return?
Run a targeted evaluation: - Execute the return_request Bedrock evaluation job against the new model - Compare against the golden dataset

Decision: - Do NOT proceed to 50%. Hold at 25% for an additional 48 hours while investigating. - If the return_request regression is a genuine model behavior → reject the candidate - If it's a statistical artifact (small sample, edge cases) → extend the test at 25% for more data

A6. Session Stickiness During A/B Testing

MangaAssist session stickiness design:

Assignment persistence:

When a user starts a conversation, the A/B test assignment is generated and stored:

Redis Key: ab:session:{session_id}
Value: {
    "model_version": "sonnet-v2-canary",
    "assigned_at": "2025-03-15T10:30:00Z",
    "test_id": "ab-test-047",
    "intent_assignments": {
        "recommendation": "canary",
        "faq": "control"
    }
}
TTL: 7 days

Flow when user returns 2 hours later:

User sends a message → ECS Fargate receives the request
Application layer extracts session_id from the cookie or Amazon customer ID
Redis lookup: Check ab:session:{session_id}
If found → route to the same model version (canary or control)
If not found (TTL expired or new session) → generate new assignment based on current canary percentages

Mid-conversation model consistency:

The stickiness is at the conversation level, not the request level. Even if the intent changes mid-conversation (e.g., chitchat → recommendation), the model version stays consistent within that conversation.

What happens if the canary has been rolled back:

Scenario: User was assigned to canary at 10:00 AM. Canary rolled back at 11:00 AM. User returns at 12:00 PM.

User sends message → application checks Redis → finds model_version: "sonnet-v2-canary"
Application checks a model availability flag in Parameter Store: canary_active: false
Graceful fallback: Route the user to the control model silently
Update Redis: model_version: "sonnet-v1-control", reassigned_reason: "canary_rollback"
Log the reassignment for analysis (these users are excluded from A/B test results)

Data cleanliness: Users who experienced a mid-test rollback are flagged in analytics. Their conversations are excluded from the final A/B test statistical analysis to avoid contamination.

Hard

A7. Complete A/B Testing Infrastructure Design

Architecture overview:

                Customer Request
                       │
                       ▼
                ┌──────────────┐
                │     ALB      │  (No traffic splitting here — 
                │              │   uniform routing to ECS)
                └──────┬───────┘
                       │
                       ▼
                ┌──────────────┐
                │  ECS Fargate │  Application-level A/B routing
                │  (API Layer) │  
                └──────┬───────┘
                       │
            ┌──────────┼──────────┐
            ▼          ▼          ▼
     ┌────────────┐  ┌─────┐  ┌─────────┐
     │ ElastiCache│  │ Req │  │ Intent  │
     │ Redis      │  │Route│  │Classify │
     │ (Assignment│  │ r   │  │(SageMkr)│
     │  Store)    │  └──┬──┘  └────┬────┘
     └────────────┘     │          │
                        ▼          ▼
                 ┌─────────────────────┐
                 │   Model Router      │
                 │ (Control vs Canary) │
                 └────────┬────────────┘
                    ┌─────┴─────┐
                    ▼           ▼
              ┌──────────┐ ┌──────────┐
              │ Bedrock  │ │ Bedrock  │
              │ Control  │ │ Canary   │
              │ (Sonnet  │ │ (Sonnet  │
              │  v1)     │ │  v2)     │
              └──────────┘ └──────────┘

Why application-level routing over ALB weighted routing:

ALB weighted routing routes at the infrastructure level — it can't make intent-aware decisions and can't persist assignment logic in Redis. Application-level routing enables: - Per-intent canary percentages - Session-sticky assignment - Feature flag integration - Detailed per-experiment logging

Traffic assignment logic (Python pseudocode):

async def get_model_assignment(session_id: str, intent: str, test_config: dict) -> str:
    # Check Redis for existing assignment
    cached = await redis.hget(f"ab:{session_id}", intent)
    if cached:
        return cached  # Session stickiness

    # New assignment — deterministic hash for reproducibility
    hash_val = mmh3.hash(f"{session_id}:{test_config['test_id']}") % 10000
    threshold = test_config["canary_rates"][intent] * 10000  # e.g., 5% = 500

    assignment = "canary" if hash_val < threshold else "control"

    # Persist in Redis with TTL
    await redis.hset(f"ab:{session_id}", intent, assignment)
    await redis.expire(f"ab:{session_id}", 604800)  # 7 days

    # Emit metric
    cloudwatch.put_metric("ABAssignment", 1, dimensions={
        "test_id": test_config["test_id"],
        "intent": intent,
        "assignment": assignment
    })

    return assignment

Metrics collection pipeline:

Per-request: ECS emits structured logs to CloudWatch Logs (JSON with test_id, assignment, intent, latency_ms, token_count, error_flag)
Aggregation: Kinesis Data Firehose streams logs to S3 (partitioned by date/test_id)
Near-real-time analysis: A Lambda function runs every 5 minutes: - Queries CloudWatch Metrics Insights for guardrail checks - Runs the statistical significance test (z-test for proportions) - Publishes results to an S3 dashboard bucket and SNS

Statistical significance computation (near-real-time):

from scipy import stats

def compute_significance(control_successes, control_total, canary_successes, canary_total):
    p_control = control_successes / control_total
    p_canary = canary_successes / canary_total

    # Pooled proportion
    p_pool = (control_successes + canary_successes) / (control_total + canary_total)
    se = math.sqrt(p_pool * (1 - p_pool) * (1/control_total + 1/canary_total))

    z = (p_canary - p_control) / se
    p_value = 2 * (1 - stats.norm.cdf(abs(z)))

    return {
        "z_score": z,
        "p_value": p_value,
        "significant": p_value < 0.05,
        "effect_size": p_canary - p_control,
        "confidence_interval": (
            (p_canary - p_control) - 1.96 * se,
            (p_canary - p_control) + 1.96 * se
        )
    }

A8. Testing Protocol That Accounts for Temporal Confounds

Problem: MangaAssist reaches statistical significance in 6 hours, but this captures only one time-of-day slice:

Morning commuters in Japan may browse manga casually (easy FAQs)
Evening US users may be making purchase decisions (complex recommendations)
Weekend traffic patterns differ from weekday

Testing protocol:

1. Minimum test duration: 7 full days (one complete weekly cycle)

Regardless of when statistical significance is reached, the test continues for 7 days minimum. This captures: - All 7 days of the week - All time zones (US + JP traffic patterns) - Payday effects (beginning/end of month)

2. Sequential analysis with alpha spending:

Use an O'Brien-Fleming alpha spending function to enable interim looks without inflating false positive rates:

Interim Look	Calendar Time	Alpha Spent	Cumulative Alpha
Look 1	Day 2	0.001	0.001
Look 2	Day 4	0.008	0.009
Look 3	Day 7	0.041	0.050

Early stopping is only allowed if: - The effect is massive (e.g., 10%+ improvement) AND statistically significant at the adjusted alpha - This prevents acting on small fluctuations while allowing rapid response to dramatically good/bad outcomes

3. Day-of-week analysis:

After 7 days, check for interaction effects:

for day in ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']:
    day_result = compute_significance(
        control_by_day[day], canary_by_day[day]
    )
    if day_result['effect_size'] < 0:
        flag_as_inconsistent(day)

If the treatment is positive on weekdays but negative on weekends → the result is not robust. Extend the test.

4. Novelty/primacy effect detection:

Plot the treatment effect over time: - If the effect is largest on Day 1 and decays → novelty effect (users react differently to a "new feel" initially) - Compare Day 1-2 effect size vs Day 5-7 effect size - If > 50% decay → the true long-term effect is approximately the Day 5-7 estimate

5. MangaAssist-specific temporal factors:

New manga release Wednesdays (Shonen Jump schedule) — product_discovery and recommendation traffic spikes 3×
Amazon promotional events — test must not span a promotional boundary (start or end during a sale)
Japanese holidays — Golden Week, Obon may shift traffic patterns dramatically

Minimum test duration matrix:

Scenario	Min Duration
Standard test	7 days
Low-traffic intent (< 10K/day)	14 days
Test spans a promotional event	14 days (7 pre + 7 post)
Revenue-critical intent	14 days

A9. Multi-Region A/B Test (US and Japan)

Phased multi-region rollout:

Phase 1 (Week 1-2):  US-East canary at 5% → 25% → 50%
Phase 2 (Week 3):    US-East at 100%, JP canary at 5%
Phase 3 (Week 3-4):  JP canary → 25% → 50%
Phase 4 (Week 5):    JP at 100% (if approved)

Why sequential, not parallel:

Risk isolation: If the model fails in US, JP is unaffected
Learning transfer: US test results inform what to watch for in JP
Resource management: Running parallel canaries requires double the monitoring infrastructure

Handling interaction effects (US English vs. Japanese):

Region-specific evaluation criteria:

Metric	US Threshold	JP Threshold
Task completion rate	Improve ≥ 1%	Improve ≥ 1%
CSAT	No degradation	No degradation
Product name accuracy	≥ 95%	≥ 99% (Japanese titles must be exact — kanji errors are unacceptable)
Honorific correctness	N/A	≥ 98% (お客様, ~さん must be correct)
Cross-language handling	English responses only	Must handle mixed JP/EN queries correctly

Known risk — Japanese manga title handling:

The new model might improve English conversational quality but regress on Japanese by: - Romanizing titles that should stay in kanji (進撃の巨人 → "Shingeki no Kyojin" when the customer used kanji) - Incorrect furigana suggestions - Mixing simplified Chinese characters with Japanese kanji

Mitigation: 1. Run MangaDomainAccuracy (MDA) evaluation specifically on JP test data before deploying to JP 2. Add a JP-specific guardrail metric: "Japanese product name preservation rate" measured by the hallucination detector 3. If US test passes but JP offline evaluation fails → do not deploy to JP. Use a separate model version for JP if needed.

Infrastructure:

Each region has its own ECS cluster, so canary deployments are region-independent. The A/B test configuration is stored in regional SSM Parameter Stores. A global dashboard aggregates results from both regions with clear per-region breakdowns.

Very Hard

A10. Automated Canary Rollback System

Rollback trigger hierarchy (fastest to slowest detection):

Trigger	Detection Time	Threshold	Source
Error rate spike	< 1 minute	> 2% 5xx rate (5-min window)	ALB CloudWatch metrics
Latency degradation	< 2 minutes	P99 > 8s (2× baseline)	X-Ray + CloudWatch
Bedrock throttling	< 2 minutes	> 5% throttled requests	Bedrock CloudWatch metrics
Hallucination spike	< 10 minutes	> 3% hallucination rate	Custom detector Lambda
CSAT drop	< 4 hours	> 10% drop (rolling 4h)	Survey pipeline
Task completion drop	< 6 hours	> 5% drop (rolling 6h)	DynamoDB conversation analysis

CloudWatch Alarm configuration:

# Critical alarm — triggers immediate rollback
CriticalCanaryAlarm:
  Type: AWS::CloudWatch::CompositeAlarm
  Properties:
    AlarmRule: |
      ALARM(CanaryErrorRate) OR 
      ALARM(CanaryLatencyP99) OR 
      ALARM(CanaryThrottling)
    AlarmActions:
      - !Ref RollbackSNSTopic     # Notify
      - !Ref RollbackLambdaARN    # Execute rollback

# Warning alarm — pauses progression, alerts team
WarningCanaryAlarm:
  Type: AWS::CloudWatch::CompositeAlarm
  Properties:
    AlarmRule: |
      ALARM(CanaryHallucinationRate) OR 
      ALARM(CanaryCSATDrop)
    AlarmActions:
      - !Ref PauseSNSTopic

Rollback execution (Lambda function):

def execute_rollback(event, context):
    # 1. Update ALB weights — shift 100% to control
    elbv2.modify_rule(
        RuleArn=CANARY_RULE_ARN,
        Actions=[{
            'Type': 'forward',
            'ForwardConfig': {
                'TargetGroups': [
                    {'TargetGroupArn': CONTROL_TG_ARN, 'Weight': 100},
                    {'TargetGroupArn': CANARY_TG_ARN, 'Weight': 0}
                ]
            }
        }]
    )

    # 2. Update Redis — mark canary as inactive
    redis_client.set("canary_active", "false")

    # 3. Update SSM Parameter Store
    ssm.put_parameter(
        Name='/mangaassist/canary/status',
        Value='rolled_back',
        Overwrite=True
    )

    # 4. Log rollback event
    dynamodb.put_item(
        TableName='canary_rollbacks',
        Item={
            'rollback_id': str(uuid4()),
            'timestamp': datetime.utcnow().isoformat(),
            'trigger': event['detail']['alarmName'],
            'canary_stage': get_current_stage(),
            'metrics_snapshot': collect_metrics_snapshot()
        }
    )

    # 5. Notify team
    sns.publish(
        TopicArn=TEAM_NOTIFICATION_TOPIC,
        Subject=f"[ROLLBACK] MangaAssist canary rolled back",
        Message=json.dumps(rollback_details)
    )

Session migration during rollback:

Users currently mid-conversation on the canary model are transparently migrated to the control model
Redis stores the full conversation history (not just model assignment), so the control model can continue the conversation using the history
The reassigned_reason: "rollback" flag ensures these sessions are excluded from A/B test analysis

Anti-flapping mechanism:

Mechanism	Implementation
Cooldown period	After a rollback, no new canary deployment for 24 hours minimum
Rollback counter	If 3 rollbacks occur within 7 days for the same candidate, the candidate is permanently rejected
Progressive threshold tightening	After a rollback, the next deployment must pass at 5% for 8 hours (2× normal) before progressing
Rollback root cause requirement	The canary cannot be re-deployed until a root cause analysis document is filed and reviewed

Post-rollback diagnostic workflow:

Automated: Lambda captures a metrics snapshot, sample of failed conversations, and Bedrock invocation logs
Within 1 hour: On-call engineer reviews the snapshot and confirms the rollback was justified
Within 24 hours: Root cause analysis with comparison of canary vs control outputs on the failing conversations
Before re-deployment: Fix validated in offline evaluation before any new canary attempt

A11. Controlling False Discovery Rate in Continuous Testing

Problem: MangaAssist runs a new A/B test every 2–3 weeks. Over a year, that's ~20 tests. At α = 0.05, the probability of at least one false positive is:

$$P(\text{at least 1 FP}) = 1 - (1 - 0.05)^{20} = 1 - 0.358 = 0.642$$

A 64% chance of declaring a winner that isn't actually better.

Framework — Continuous testing with FDR control:

1. Family-wise error rate (FWER) control — Bonferroni:

Adjust α per test: α_adjusted = 0.05 / K (where K = number of tests planned)

For 20 tests/year: α_adjusted = 0.0025
Problem: Extremely conservative. Requires 4× the sample size per test. Tests take weeks longer at MangaAssist's volume.

2. False Discovery Rate (FDR) control — Benjamini-Hochberg (recommended):

Instead of controlling the probability of ANY false positive, control the proportion of false positives among all rejections:

Application to MangaAssist's continuous testing:

def apply_bh_correction(test_results: list[dict]) -> list[dict]:
    """
    Apply Benjamini-Hochberg correction across all completed tests.
    Called when making final deployment decisions.
    """
    # Sort by p-value
    sorted_results = sorted(test_results, key=lambda x: x['p_value'])
    m = len(sorted_results)
    q = 0.10  # FDR threshold (10% false discovery rate)

    for i, result in enumerate(sorted_results):
        rank = i + 1
        bh_threshold = (rank / m) * q
        result['bh_threshold'] = bh_threshold
        result['significant_after_correction'] = result['p_value'] <= bh_threshold

    return sorted_results

3. Practical implementation for MangaAssist:

Component	Design
Test registry	DynamoDB table tracking all tests: hypothesis, start/end dates, raw p-values, corrected p-values
Rolling window	Apply BH correction over the last 12 months of tests (older tests are "confirmed" and excluded)
Decision timing	Raw p-values are computed in real-time. BH correction is applied when making the deploy/no-deploy decision
FDR budget	q = 0.10 for non-critical intents, q = 0.05 for revenue-critical intents

4. Alpha investing (alternative approach):

Treat the α budget as a "bank account": - Start with α-wealth = 0.05 - Each test spends some α (if you test, you spend) - Each successful test earns back α (validated positive results replenish the budget) - If α-wealth reaches 0, stop testing until you earn more from validated wins

This naturally throttles testing when too many tests fail and encourages high-confidence experiments.

5. MangaAssist recommendation:

Use BH correction (q=0.10) as the primary framework, supplemented by: - Minimum 7-day test duration (prevents p-hacking through early stopping) - Pre-registration of all test hypotheses before starting (prevents post-hoc metric selection) - A quarterly review where the data science team evaluates the FDR across all tests run that quarter

A12. Multi-Armed Bandit vs A/B Testing — Hybrid Approach

Critique of pure MAB for MangaAssist:

Arguments FOR MAB: - Reduces regret — shifts traffic to the better model faster - Continuously adapts — no fixed test duration needed - Efficient for many-arm scenarios (testing 5+ model configurations simultaneously)

Arguments AGAINST MAB for production model deployment:

Risk	Explanation
Delayed convergence with noisy rewards	CSAT and task completion are noisy signals. MAB may oscillate between arms for weeks before converging.
Revenue-critical intents need certainty	For `recommendation` and `checkout_help`, the business needs to KNOW which model is better with statistical confidence — not probabilistically exploit the "probably better" arm.
Seasonal confounds	MAB algorithms assume stationary reward distributions. MangaAssist's traffic patterns shift weekly (new manga releases) and seasonally. A model that looks better during Shonen Jump release week may not be better generally.
Irreversible business decisions	Once you promote a model to production and update documentation/training/monitoring around it, switching back has organizational cost. MAB's continuous arm-switching doesn't match this operational reality.
Lack of causal inference	MAB optimizes allocation but doesn't produce a p-value or confidence interval. Stakeholders need to report "Model B is 3% ± 1% better with 95% confidence" — MAB can't provide this.

When MAB IS appropriate for MangaAssist: - Non-critical intents with fast, clear reward signals - Exploring prompt template variants (not full model swaps) - Tuning inference parameters (temperature, top_p) where all arms are the same model

Hybrid design — intent-tiered experimentation:

┌─────────────────────────────────────────────────┐
│          MangaAssist Experimentation Tiers       │
├─────────────────────────────────────────────────┤
│                                                 │
│  Tier 1 — Traditional A/B Testing               │
│  Intents: recommendation, checkout_help,        │
│           product_question, product_discovery    │
│  Method: Fixed-horizon A/B test, 7-day min      │
│  Decision: p < 0.05 (BH-corrected)             │
│  Reason: Revenue impact — need statistical      │
│          certainty before committing             │
│                                                 │
│  Tier 2 — Thompson Sampling (MAB)               │
│  Intents: faq, chitchat, promotion, escalation  │
│  Method: Beta-Bernoulli Thompson Sampling       │
│  Arms: Up to 4 model configs simultaneously     │
│  Reason: Low risk, fast iteration, explore       │
│          many configurations cheaply             │
│                                                 │
│  Tier 3 — Contextual Bandit                     │
│  Intents: order_tracking, return_request        │
│  Method: LinUCB with context features            │
│  Context: query complexity, language,            │
│           customer tier (Prime/non-Prime)        │
│  Reason: Optimal model may vary by context       │
│          — no single "best" model                │
│                                                 │
└─────────────────────────────────────────────────┘

Thompson Sampling implementation for Tier 2:

import numpy as np

class MangaAssistBandit:
    def __init__(self, arms: list[str]):
        # Beta distribution priors (uniform)
        self.alpha = {arm: 1.0 for arm in arms}
        self.beta = {arm: 1.0 for arm in arms}

    def select_arm(self) -> str:
        """Thompson Sampling — sample from each arm's posterior, pick highest."""
        samples = {
            arm: np.random.beta(self.alpha[arm], self.beta[arm])
            for arm in self.alpha
        }
        return max(samples, key=samples.get)

    def update(self, arm: str, reward: float):
        """Update posterior with observed reward (0 or 1)."""
        self.alpha[arm] += reward
        self.beta[arm] += (1 - reward)

    def get_best_arm(self) -> tuple[str, float]:
        """Return the arm with highest expected reward."""
        expected = {
            arm: self.alpha[arm] / (self.alpha[arm] + self.beta[arm])
            for arm in self.alpha
        }
        best = max(expected, key=expected.get)
        return best, expected[best]

Guardrails on the MAB system:

Guardrail	Purpose
Minimum exploration rate: 10%	Even the "worst" arm gets at least 10% traffic — prevents premature convergence on a local optimum
Weekly posterior review	Data scientist reviews the posterior distributions. If they haven't converged after 14 days, the reward signal may be too noisy.
Hard floor on all arms	No arm's response quality (BERTScore) is allowed below 0.80. If Thompson Sampling exploits an arm that drops below, that arm is removed.
Non-stationarity detection	Monitor reward mean per arm over time. If an arm's reward shifts > 2σ week-over-week, reset that arm's priors and restart exploration.

← Back to Questions · ← Back to Skill 02 Hub