LOCAL PREVIEW View on GitHub

02 — A/B Testing & Canary Deployment — Answers

Easy

A1. Canary Deployment Pattern on ECS Fargate

MangaAssist canary deployment — stage-by-stage:

Infrastructure setup: - The MangaAssist API runs on ECS Fargate behind an Application Load Balancer (ALB) - Two ECS task sets exist in the same ECS service: primary (current model) and canary (new model) - Both task sets run identical container images; the model version is controlled via environment variable MODEL_ID sourced from SSM Parameter Store

Stage progression:

Stage Canary Traffic Duration Progression Trigger
Stage 0 0% Pre-deploy Bedrock offline evaluation passes all gates
Stage 1 5% 4 hours minimum Error rate < 0.5%, P99 latency within 10% of baseline
Stage 2 25% 24 hours minimum All guardrail metrics pass, CSAT ≥ baseline - 1%
Stage 3 50% 48 hours minimum Statistical significance on primary success metric
Stage 4 100% Permanent Human approval + all metrics confirmed

ALB weighted target group configuration:

ALB Listener Rule:
  TargetGroup-Primary:  weight = 95  (Stage 1)
  TargetGroup-Canary:   weight = 5   (Stage 1)

At each stage, a CodeDeploy Blue/Green deployment updates the weights. The ECS service maintains both task sets simultaneously — no cold starts during traffic shifts.

Progression triggers: - Automated (Stages 1→2, 2→3): CloudWatch composite alarm evaluates guardrail metrics. If all pass after the minimum duration, a Step Functions workflow updates ALB weights. - Manual (Stage 3→4): Requires explicit approval via a CodePipeline manual approval action. An ML engineer reviews the full A/B test report before promoting to 100%.


A2. A/B Test Metrics for MangaAssist

Guardrail metrics (must not degrade — hard block if violated):

Metric Threshold Source
Error rate (5xx) < 0.5% ALB access logs / CloudWatch
P99 latency (end-to-end) < 5 seconds X-Ray traces
LLM hallucination rate < current + 1% Custom hallucination detector (Lambda)
Safety filter trigger rate < current + 0.5% Bedrock guardrails logs
Cart abandonment rate No increase > 2% Amazon analytics pipeline

Success metrics (must improve — primary decision drivers):

Metric Target Source
Task completion rate +2% improvement DynamoDB conversation logs (did the user complete their goal?)
CSAT score (post-chat survey) +0.1 points (4-point scale) Survey responses
Recommendation click-through rate +3% improvement Clickstream analytics
Conversations per resolution Fewer turns = better DynamoDB conversation metadata
Escalation rate Decrease Intent classifier logs

Observational metrics (monitor but don't gate on):

Metric Purpose
Token usage per conversation Cost impact tracking
Cache hit rate (ElastiCache) Ensure caching still effective with new model's output patterns
Intent distribution shift Detect if the new model changes how users interact

A3. Minimum Sample Size Calculation

Given: - Baseline task completion rate: p₁ = 0.78 - Minimum detectable effect: δ = 0.02 (2% improvement → p₂ = 0.80) - Significance level: α = 0.05 (two-tailed) - Power: 1 - β = 0.80

Formula (two-proportion z-test):

$$n = \frac{(Z_{\alpha/2} + Z_\beta)^2 \cdot [p_1(1-p_1) + p_2(1-p_2)]}{\delta^2}$$

$$n = \frac{(1.96 + 0.84)^2 \cdot [0.78(0.22) + 0.80(0.20)]}{0.02^2}$$

$$n = \frac{7.84 \cdot [0.1716 + 0.16]}{0.0004}$$

$$n = \frac{7.84 \cdot 0.3316}{0.0004} = \frac{2.60}{0.0004} \approx 6{,}497$$

Per group: ~6,500 conversations needed in each arm (control + treatment = ~13,000 total).

Time to collect at MangaAssist's traffic: - Total traffic: ~300K conversations/day - At 5% canary (Stage 1): canary receives 15K conversations/day - Reaches sample size in < 12 hours — but minimum stage duration is 4 hours for guardrail observation - At 25% canary (Stage 2): 75K conversations/day per arm - Reaches significance in < 3 hours

Caveat: Reaching raw sample size quickly ≠ reaching reliable significance. We impose minimum durations (24–48 hours) to account for time-of-day effects, day-of-week variation, and novelty effects (see Q8).


Medium

A4. Stratified A/B Testing for Imbalanced Intent Traffic

MangaAssist intent traffic distribution:

Intent Daily Volume % of Total
recommendation 72K 24%
product_discovery 48K 16%
faq 45K 15%
product_question 36K 12%
order_tracking 30K 10%
checkout_help 24K 8%
return_request 18K 6%
promotion 12K 4%
chitchat 9K 3%
escalation 6K 2%

Problem: At 5% canary, escalation receives only 300 conversations/day. Reaching 6,500 sample size takes ~22 days — far too slow.

Stratified approach:

  1. Intent-level traffic splitting: Instead of uniform 5% canary across all traffic, apply different canary percentages per intent:
Intent Tier Intents Canary % Reason
High-traffic recommendation, product_discovery, faq 5% Standard canary — ample volume
Medium-traffic product_question, order_tracking, checkout_help 10% Slight increase to accelerate
Low-traffic return_request, promotion, chitchat, escalation 25% Aggressive split — these intents have less revenue impact
  1. Implementation: The intent classifier (SageMaker endpoint) returns the intent label. The application routing layer (ECS Fargate) checks the canary assignment table in Redis:
canary_rates = {
    "recommendation": 0.05, "product_discovery": 0.05, "faq": 0.05,
    "product_question": 0.10, "order_tracking": 0.10, "checkout_help": 0.10,
    "return_request": 0.25, "promotion": 0.25, "chitchat": 0.25, "escalation": 0.25
}
  1. Sequential testing for low-traffic intents: Use group sequential design — analyze at pre-defined intervals rather than waiting for full sample size. Use alpha spending functions (O'Brien-Fleming) to control Type I error across interim analyses.

  2. Pooled analysis as backup: For intents that can never reach individual significance (e.g., escalation with only 6K/day), group them into an "operational intents" pool and test at the pool level. Accept per-intent significance only when achievable.


A5. Conflicting Per-Intent Signals at the 25% Canary Stage

Situation: - recommendation CSAT: +3% (positive) - return_request task completion: -7% (negative) - Composite metric: net positive

Decision framework — MangaAssist Canary Progression Rules:

Rule 1 — No critical intent regression:

IF any revenue_critical_intent has regression > 5%:
    BLOCK progression regardless of composite metric
return_request is not revenue-critical (it's operational), so this rule doesn't block — but the 7% regression is concerning.

Rule 2 — Severity-weighted composite:

Assign weights by business impact:

Intent Weight Reasoning
recommendation 0.25 Directly drives revenue
checkout_help 0.20 Conversion-critical
product_discovery 0.15 Revenue driver
return_request 0.10 Operational — impacts CSAT but not revenue
Others 0.30 Distributed

Weighted score: (0.25 × +3%) + (0.10 × -7%) + (others neutral) = +0.75% - 0.70% = +0.05% → barely positive.

Rule 3 — Root cause before proceeding:

The +3%/-7% pattern is suspicious. Before progressing to 50%:

  1. Investigate the return_request regression: - Pull sample conversations from the canary arm - Check if the new model is failing to generate return labels, missing policy details, or misunderstanding return eligibility - Look for systematic failures vs. random quality variance

  2. Check if the improvement and regression are correlated: - Are the same users seeing both? (Session-level analysis) - Is the new model redirecting return_request users to browse recommendations instead of processing their return?

  3. Run a targeted evaluation: - Execute the return_request Bedrock evaluation job against the new model - Compare against the golden dataset

Decision: - Do NOT proceed to 50%. Hold at 25% for an additional 48 hours while investigating. - If the return_request regression is a genuine model behavior → reject the candidate - If it's a statistical artifact (small sample, edge cases) → extend the test at 25% for more data


A6. Session Stickiness During A/B Testing

MangaAssist session stickiness design:

Assignment persistence:

When a user starts a conversation, the A/B test assignment is generated and stored:

Redis Key: ab:session:{session_id}
Value: {
    "model_version": "sonnet-v2-canary",
    "assigned_at": "2025-03-15T10:30:00Z",
    "test_id": "ab-test-047",
    "intent_assignments": {
        "recommendation": "canary",
        "faq": "control"
    }
}
TTL: 7 days

Flow when user returns 2 hours later:

  1. User sends a message → ECS Fargate receives the request
  2. Application layer extracts session_id from the cookie or Amazon customer ID
  3. Redis lookup: Check ab:session:{session_id}
  4. If found → route to the same model version (canary or control)
  5. If not found (TTL expired or new session) → generate new assignment based on current canary percentages

Mid-conversation model consistency:

The stickiness is at the conversation level, not the request level. Even if the intent changes mid-conversation (e.g., chitchatrecommendation), the model version stays consistent within that conversation.

What happens if the canary has been rolled back:

Scenario: User was assigned to canary at 10:00 AM. Canary rolled back at 11:00 AM. User returns at 12:00 PM.

  1. User sends message → application checks Redis → finds model_version: "sonnet-v2-canary"
  2. Application checks a model availability flag in Parameter Store: canary_active: false
  3. Graceful fallback: Route the user to the control model silently
  4. Update Redis: model_version: "sonnet-v1-control", reassigned_reason: "canary_rollback"
  5. Log the reassignment for analysis (these users are excluded from A/B test results)

Data cleanliness: Users who experienced a mid-test rollback are flagged in analytics. Their conversations are excluded from the final A/B test statistical analysis to avoid contamination.


Hard

A7. Complete A/B Testing Infrastructure Design

Architecture overview:

                Customer Request
                       │
                       ▼
                ┌──────────────┐
                │     ALB      │  (No traffic splitting here — 
                │              │   uniform routing to ECS)
                └──────┬───────┘
                       │
                       ▼
                ┌──────────────┐
                │  ECS Fargate │  Application-level A/B routing
                │  (API Layer) │  
                └──────┬───────┘
                       │
            ┌──────────┼──────────┐
            ▼          ▼          ▼
     ┌────────────┐  ┌─────┐  ┌─────────┐
     │ ElastiCache│  │ Req │  │ Intent  │
     │ Redis      │  │Route│  │Classify │
     │ (Assignment│  │ r   │  │(SageMkr)│
     │  Store)    │  └──┬──┘  └────┬────┘
     └────────────┘     │          │
                        ▼          ▼
                 ┌─────────────────────┐
                 │   Model Router      │
                 │ (Control vs Canary) │
                 └────────┬────────────┘
                    ┌─────┴─────┐
                    ▼           ▼
              ┌──────────┐ ┌──────────┐
              │ Bedrock  │ │ Bedrock  │
              │ Control  │ │ Canary   │
              │ (Sonnet  │ │ (Sonnet  │
              │  v1)     │ │  v2)     │
              └──────────┘ └──────────┘

Why application-level routing over ALB weighted routing:

ALB weighted routing routes at the infrastructure level — it can't make intent-aware decisions and can't persist assignment logic in Redis. Application-level routing enables: - Per-intent canary percentages - Session-sticky assignment - Feature flag integration - Detailed per-experiment logging

Traffic assignment logic (Python pseudocode):

async def get_model_assignment(session_id: str, intent: str, test_config: dict) -> str:
    # Check Redis for existing assignment
    cached = await redis.hget(f"ab:{session_id}", intent)
    if cached:
        return cached  # Session stickiness

    # New assignment — deterministic hash for reproducibility
    hash_val = mmh3.hash(f"{session_id}:{test_config['test_id']}") % 10000
    threshold = test_config["canary_rates"][intent] * 10000  # e.g., 5% = 500

    assignment = "canary" if hash_val < threshold else "control"

    # Persist in Redis with TTL
    await redis.hset(f"ab:{session_id}", intent, assignment)
    await redis.expire(f"ab:{session_id}", 604800)  # 7 days

    # Emit metric
    cloudwatch.put_metric("ABAssignment", 1, dimensions={
        "test_id": test_config["test_id"],
        "intent": intent,
        "assignment": assignment
    })

    return assignment

Metrics collection pipeline:

  1. Per-request: ECS emits structured logs to CloudWatch Logs (JSON with test_id, assignment, intent, latency_ms, token_count, error_flag)
  2. Aggregation: Kinesis Data Firehose streams logs to S3 (partitioned by date/test_id)
  3. Near-real-time analysis: A Lambda function runs every 5 minutes: - Queries CloudWatch Metrics Insights for guardrail checks - Runs the statistical significance test (z-test for proportions) - Publishes results to an S3 dashboard bucket and SNS

Statistical significance computation (near-real-time):

from scipy import stats

def compute_significance(control_successes, control_total, canary_successes, canary_total):
    p_control = control_successes / control_total
    p_canary = canary_successes / canary_total

    # Pooled proportion
    p_pool = (control_successes + canary_successes) / (control_total + canary_total)
    se = math.sqrt(p_pool * (1 - p_pool) * (1/control_total + 1/canary_total))

    z = (p_canary - p_control) / se
    p_value = 2 * (1 - stats.norm.cdf(abs(z)))

    return {
        "z_score": z,
        "p_value": p_value,
        "significant": p_value < 0.05,
        "effect_size": p_canary - p_control,
        "confidence_interval": (
            (p_canary - p_control) - 1.96 * se,
            (p_canary - p_control) + 1.96 * se
        )
    }

A8. Testing Protocol That Accounts for Temporal Confounds

Problem: MangaAssist reaches statistical significance in 6 hours, but this captures only one time-of-day slice:

  • Morning commuters in Japan may browse manga casually (easy FAQs)
  • Evening US users may be making purchase decisions (complex recommendations)
  • Weekend traffic patterns differ from weekday

Testing protocol:

1. Minimum test duration: 7 full days (one complete weekly cycle)

Regardless of when statistical significance is reached, the test continues for 7 days minimum. This captures: - All 7 days of the week - All time zones (US + JP traffic patterns) - Payday effects (beginning/end of month)

2. Sequential analysis with alpha spending:

Use an O'Brien-Fleming alpha spending function to enable interim looks without inflating false positive rates:

Interim Look Calendar Time Alpha Spent Cumulative Alpha
Look 1 Day 2 0.001 0.001
Look 2 Day 4 0.008 0.009
Look 3 Day 7 0.041 0.050

Early stopping is only allowed if: - The effect is massive (e.g., 10%+ improvement) AND statistically significant at the adjusted alpha - This prevents acting on small fluctuations while allowing rapid response to dramatically good/bad outcomes

3. Day-of-week analysis:

After 7 days, check for interaction effects:

for day in ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']:
    day_result = compute_significance(
        control_by_day[day], canary_by_day[day]
    )
    if day_result['effect_size'] < 0:
        flag_as_inconsistent(day)

If the treatment is positive on weekdays but negative on weekends → the result is not robust. Extend the test.

4. Novelty/primacy effect detection:

Plot the treatment effect over time: - If the effect is largest on Day 1 and decays → novelty effect (users react differently to a "new feel" initially) - Compare Day 1-2 effect size vs Day 5-7 effect size - If > 50% decay → the true long-term effect is approximately the Day 5-7 estimate

5. MangaAssist-specific temporal factors:

  • New manga release Wednesdays (Shonen Jump schedule) — product_discovery and recommendation traffic spikes 3×
  • Amazon promotional events — test must not span a promotional boundary (start or end during a sale)
  • Japanese holidays — Golden Week, Obon may shift traffic patterns dramatically

Minimum test duration matrix:

Scenario Min Duration
Standard test 7 days
Low-traffic intent (< 10K/day) 14 days
Test spans a promotional event 14 days (7 pre + 7 post)
Revenue-critical intent 14 days

A9. Multi-Region A/B Test (US and Japan)

Phased multi-region rollout:

Phase 1 (Week 1-2):  US-East canary at 5% → 25% → 50%
Phase 2 (Week 3):    US-East at 100%, JP canary at 5%
Phase 3 (Week 3-4):  JP canary → 25% → 50%
Phase 4 (Week 5):    JP at 100% (if approved)

Why sequential, not parallel:

  1. Risk isolation: If the model fails in US, JP is unaffected
  2. Learning transfer: US test results inform what to watch for in JP
  3. Resource management: Running parallel canaries requires double the monitoring infrastructure

Handling interaction effects (US English vs. Japanese):

Region-specific evaluation criteria:

Metric US Threshold JP Threshold
Task completion rate Improve ≥ 1% Improve ≥ 1%
CSAT No degradation No degradation
Product name accuracy ≥ 95% ≥ 99% (Japanese titles must be exact — kanji errors are unacceptable)
Honorific correctness N/A ≥ 98% (お客様, ~さん must be correct)
Cross-language handling English responses only Must handle mixed JP/EN queries correctly

Known risk — Japanese manga title handling:

The new model might improve English conversational quality but regress on Japanese by: - Romanizing titles that should stay in kanji (進撃の巨人 → "Shingeki no Kyojin" when the customer used kanji) - Incorrect furigana suggestions - Mixing simplified Chinese characters with Japanese kanji

Mitigation: 1. Run MangaDomainAccuracy (MDA) evaluation specifically on JP test data before deploying to JP 2. Add a JP-specific guardrail metric: "Japanese product name preservation rate" measured by the hallucination detector 3. If US test passes but JP offline evaluation fails → do not deploy to JP. Use a separate model version for JP if needed.

Infrastructure:

Each region has its own ECS cluster, so canary deployments are region-independent. The A/B test configuration is stored in regional SSM Parameter Stores. A global dashboard aggregates results from both regions with clear per-region breakdowns.


Very Hard

A10. Automated Canary Rollback System

Rollback trigger hierarchy (fastest to slowest detection):

Trigger Detection Time Threshold Source
Error rate spike < 1 minute > 2% 5xx rate (5-min window) ALB CloudWatch metrics
Latency degradation < 2 minutes P99 > 8s (2× baseline) X-Ray + CloudWatch
Bedrock throttling < 2 minutes > 5% throttled requests Bedrock CloudWatch metrics
Hallucination spike < 10 minutes > 3% hallucination rate Custom detector Lambda
CSAT drop < 4 hours > 10% drop (rolling 4h) Survey pipeline
Task completion drop < 6 hours > 5% drop (rolling 6h) DynamoDB conversation analysis

CloudWatch Alarm configuration:

# Critical alarm — triggers immediate rollback
CriticalCanaryAlarm:
  Type: AWS::CloudWatch::CompositeAlarm
  Properties:
    AlarmRule: |
      ALARM(CanaryErrorRate) OR 
      ALARM(CanaryLatencyP99) OR 
      ALARM(CanaryThrottling)
    AlarmActions:
      - !Ref RollbackSNSTopic     # Notify
      - !Ref RollbackLambdaARN    # Execute rollback

# Warning alarm — pauses progression, alerts team
WarningCanaryAlarm:
  Type: AWS::CloudWatch::CompositeAlarm
  Properties:
    AlarmRule: |
      ALARM(CanaryHallucinationRate) OR 
      ALARM(CanaryCSATDrop)
    AlarmActions:
      - !Ref PauseSNSTopic

Rollback execution (Lambda function):

def execute_rollback(event, context):
    # 1. Update ALB weights — shift 100% to control
    elbv2.modify_rule(
        RuleArn=CANARY_RULE_ARN,
        Actions=[{
            'Type': 'forward',
            'ForwardConfig': {
                'TargetGroups': [
                    {'TargetGroupArn': CONTROL_TG_ARN, 'Weight': 100},
                    {'TargetGroupArn': CANARY_TG_ARN, 'Weight': 0}
                ]
            }
        }]
    )

    # 2. Update Redis — mark canary as inactive
    redis_client.set("canary_active", "false")

    # 3. Update SSM Parameter Store
    ssm.put_parameter(
        Name='/mangaassist/canary/status',
        Value='rolled_back',
        Overwrite=True
    )

    # 4. Log rollback event
    dynamodb.put_item(
        TableName='canary_rollbacks',
        Item={
            'rollback_id': str(uuid4()),
            'timestamp': datetime.utcnow().isoformat(),
            'trigger': event['detail']['alarmName'],
            'canary_stage': get_current_stage(),
            'metrics_snapshot': collect_metrics_snapshot()
        }
    )

    # 5. Notify team
    sns.publish(
        TopicArn=TEAM_NOTIFICATION_TOPIC,
        Subject=f"[ROLLBACK] MangaAssist canary rolled back",
        Message=json.dumps(rollback_details)
    )

Session migration during rollback:

  • Users currently mid-conversation on the canary model are transparently migrated to the control model
  • Redis stores the full conversation history (not just model assignment), so the control model can continue the conversation using the history
  • The reassigned_reason: "rollback" flag ensures these sessions are excluded from A/B test analysis

Anti-flapping mechanism:

Mechanism Implementation
Cooldown period After a rollback, no new canary deployment for 24 hours minimum
Rollback counter If 3 rollbacks occur within 7 days for the same candidate, the candidate is permanently rejected
Progressive threshold tightening After a rollback, the next deployment must pass at 5% for 8 hours (2× normal) before progressing
Rollback root cause requirement The canary cannot be re-deployed until a root cause analysis document is filed and reviewed

Post-rollback diagnostic workflow:

  1. Automated: Lambda captures a metrics snapshot, sample of failed conversations, and Bedrock invocation logs
  2. Within 1 hour: On-call engineer reviews the snapshot and confirms the rollback was justified
  3. Within 24 hours: Root cause analysis with comparison of canary vs control outputs on the failing conversations
  4. Before re-deployment: Fix validated in offline evaluation before any new canary attempt

A11. Controlling False Discovery Rate in Continuous Testing

Problem: MangaAssist runs a new A/B test every 2–3 weeks. Over a year, that's ~20 tests. At α = 0.05, the probability of at least one false positive is:

$$P(\text{at least 1 FP}) = 1 - (1 - 0.05)^{20} = 1 - 0.358 = 0.642$$

A 64% chance of declaring a winner that isn't actually better.

Framework — Continuous testing with FDR control:

1. Family-wise error rate (FWER) control — Bonferroni:

Adjust α per test: α_adjusted = 0.05 / K (where K = number of tests planned)

  • For 20 tests/year: α_adjusted = 0.0025
  • Problem: Extremely conservative. Requires 4× the sample size per test. Tests take weeks longer at MangaAssist's volume.

2. False Discovery Rate (FDR) control — Benjamini-Hochberg (recommended):

Instead of controlling the probability of ANY false positive, control the proportion of false positives among all rejections:

Application to MangaAssist's continuous testing:

def apply_bh_correction(test_results: list[dict]) -> list[dict]:
    """
    Apply Benjamini-Hochberg correction across all completed tests.
    Called when making final deployment decisions.
    """
    # Sort by p-value
    sorted_results = sorted(test_results, key=lambda x: x['p_value'])
    m = len(sorted_results)
    q = 0.10  # FDR threshold (10% false discovery rate)

    for i, result in enumerate(sorted_results):
        rank = i + 1
        bh_threshold = (rank / m) * q
        result['bh_threshold'] = bh_threshold
        result['significant_after_correction'] = result['p_value'] <= bh_threshold

    return sorted_results

3. Practical implementation for MangaAssist:

Component Design
Test registry DynamoDB table tracking all tests: hypothesis, start/end dates, raw p-values, corrected p-values
Rolling window Apply BH correction over the last 12 months of tests (older tests are "confirmed" and excluded)
Decision timing Raw p-values are computed in real-time. BH correction is applied when making the deploy/no-deploy decision
FDR budget q = 0.10 for non-critical intents, q = 0.05 for revenue-critical intents

4. Alpha investing (alternative approach):

Treat the α budget as a "bank account": - Start with α-wealth = 0.05 - Each test spends some α (if you test, you spend) - Each successful test earns back α (validated positive results replenish the budget) - If α-wealth reaches 0, stop testing until you earn more from validated wins

This naturally throttles testing when too many tests fail and encourages high-confidence experiments.

5. MangaAssist recommendation:

Use BH correction (q=0.10) as the primary framework, supplemented by: - Minimum 7-day test duration (prevents p-hacking through early stopping) - Pre-registration of all test hypotheses before starting (prevents post-hoc metric selection) - A quarterly review where the data science team evaluates the FDR across all tests run that quarter


A12. Multi-Armed Bandit vs A/B Testing — Hybrid Approach

Critique of pure MAB for MangaAssist:

Arguments FOR MAB: - Reduces regret — shifts traffic to the better model faster - Continuously adapts — no fixed test duration needed - Efficient for many-arm scenarios (testing 5+ model configurations simultaneously)

Arguments AGAINST MAB for production model deployment:

Risk Explanation
Delayed convergence with noisy rewards CSAT and task completion are noisy signals. MAB may oscillate between arms for weeks before converging.
Revenue-critical intents need certainty For recommendation and checkout_help, the business needs to KNOW which model is better with statistical confidence — not probabilistically exploit the "probably better" arm.
Seasonal confounds MAB algorithms assume stationary reward distributions. MangaAssist's traffic patterns shift weekly (new manga releases) and seasonally. A model that looks better during Shonen Jump release week may not be better generally.
Irreversible business decisions Once you promote a model to production and update documentation/training/monitoring around it, switching back has organizational cost. MAB's continuous arm-switching doesn't match this operational reality.
Lack of causal inference MAB optimizes allocation but doesn't produce a p-value or confidence interval. Stakeholders need to report "Model B is 3% ± 1% better with 95% confidence" — MAB can't provide this.

When MAB IS appropriate for MangaAssist: - Non-critical intents with fast, clear reward signals - Exploring prompt template variants (not full model swaps) - Tuning inference parameters (temperature, top_p) where all arms are the same model

Hybrid design — intent-tiered experimentation:

┌─────────────────────────────────────────────────┐
│          MangaAssist Experimentation Tiers       │
├─────────────────────────────────────────────────┤
│                                                 │
│  Tier 1 — Traditional A/B Testing               │
│  Intents: recommendation, checkout_help,        │
│           product_question, product_discovery    │
│  Method: Fixed-horizon A/B test, 7-day min      │
│  Decision: p < 0.05 (BH-corrected)             │
│  Reason: Revenue impact — need statistical      │
│          certainty before committing             │
│                                                 │
│  Tier 2 — Thompson Sampling (MAB)               │
│  Intents: faq, chitchat, promotion, escalation  │
│  Method: Beta-Bernoulli Thompson Sampling       │
│  Arms: Up to 4 model configs simultaneously     │
│  Reason: Low risk, fast iteration, explore       │
│          many configurations cheaply             │
│                                                 │
│  Tier 3 — Contextual Bandit                     │
│  Intents: order_tracking, return_request        │
│  Method: LinUCB with context features            │
│  Context: query complexity, language,            │
│           customer tier (Prime/non-Prime)        │
│  Reason: Optimal model may vary by context       │
│          — no single "best" model                │
│                                                 │
└─────────────────────────────────────────────────┘

Thompson Sampling implementation for Tier 2:

import numpy as np

class MangaAssistBandit:
    def __init__(self, arms: list[str]):
        # Beta distribution priors (uniform)
        self.alpha = {arm: 1.0 for arm in arms}
        self.beta = {arm: 1.0 for arm in arms}

    def select_arm(self) -> str:
        """Thompson Sampling — sample from each arm's posterior, pick highest."""
        samples = {
            arm: np.random.beta(self.alpha[arm], self.beta[arm])
            for arm in self.alpha
        }
        return max(samples, key=samples.get)

    def update(self, arm: str, reward: float):
        """Update posterior with observed reward (0 or 1)."""
        self.alpha[arm] += reward
        self.beta[arm] += (1 - reward)

    def get_best_arm(self) -> tuple[str, float]:
        """Return the arm with highest expected reward."""
        expected = {
            arm: self.alpha[arm] / (self.alpha[arm] + self.beta[arm])
            for arm in self.alpha
        }
        best = max(expected, key=expected.get)
        return best, expected[best]

Guardrails on the MAB system:

Guardrail Purpose
Minimum exploration rate: 10% Even the "worst" arm gets at least 10% traffic — prevents premature convergence on a local optimum
Weekly posterior review Data scientist reviews the posterior distributions. If they haven't converged after 14 days, the reward signal may be too noisy.
Hard floor on all arms No arm's response quality (BERTScore) is allowed below 0.80. If Thompson Sampling exploits an arm that drops below, that arm is removed.
Non-stationarity detection Monitor reward mean per arm over time. If an arm's reward shifts > 2σ week-over-week, reset that arm's priors and restart exploration.

← Back to Questions · ← Back to Skill 02 Hub