02 — A/B Testing & Canary Deployment

Safely rolling out new model versions for MangaAssist using staged canary deployments and statistically rigorous A/B testing.

Context

MangaAssist serves millions of manga customers on Amazon.com. Deploying a new model version (e.g., Claude 3.5 Sonnet v2, or promoting a fine-tuned Haiku) directly to 100% of traffic is unacceptable — a regression could impact conversion rates, CSAT, and customer trust. This scenario covers canary deployment stages (5% → 25% → 50% → 100%), A/B test design, statistical significance testing, and automated rollback criteria.

Questions (12)

Easy (1–3)

Q1. Explain the canary deployment pattern for MangaAssist's ECS Fargate service. When deploying a new model version, what does a 5% → 25% → 50% → 100% rollout look like in terms of ECS task sets and ALB target group weighting? What triggers the progression between stages?

Q2. For MangaAssist, define the key metrics you would track during an A/B test comparing the current Claude 3.5 Sonnet against a new version. Categorize these metrics into guardrail metrics (must not degrade) and success metrics (must improve).

Q3. What is the minimum sample size required to detect a 2% improvement in MangaAssist's task completion rate with 95% confidence and 80% power? Assume the current baseline task completion rate is 78%. How long would it take to collect this sample at MangaAssist's traffic volume of ~300K conversations/day?

Medium (4–6)

Q4. MangaAssist's traffic is not uniformly distributed across intents — recommendation and product_discovery represent 40% of traffic, while escalation represents only 3%. Design a stratified A/B testing approach that ensures statistical significance is reached for low-traffic intents without waiting months. How do you handle the intent imbalance?

Q5. During a canary deployment at the 25% stage, you observe that the new model version has a 3% higher CSAT for recommendation but a 7% lower task completion rate for return_request. The composite metric is net positive. Should you proceed to 50%? Design the decision framework that handles conflicting per-intent signals during canary progression.

Q6. Explain how MangaAssist implements session stickiness during A/B testing. A customer starts a conversation with model version A and returns 2 hours later to continue. How does the system ensure they remain on the same model version? What happens if the canary has been rolled back in the interim?

Hard (7–9)

Q7. MangaAssist runs on ECS Fargate with ElastiCache Redis for session management. Design the complete A/B testing infrastructure: how traffic is split (ALB weighted routing vs. application-level routing), how the assignment is persisted (Redis), how metrics are collected (CloudWatch + custom metrics), and how the analysis pipeline computes statistical significance in near-real-time.

Q8. You are A/B testing a change that replaces Sonnet with fine-tuned Haiku for the faq intent. However, the faq intent receives 50K conversations/day and the test reaches statistical significance within 6 hours. Your data science team warns about novelty effects and day-of-week confounds. Design a testing protocol that accounts for temporal confounds while still enabling rapid iteration. How long should the minimum test duration be regardless of statistical significance?

Q9. MangaAssist operates in both the US and Japan marketplaces. Describe how you run a multi-region A/B test where the treatment (new model version) is deployed as a canary in US-East first, then JP after US achieves significance. How do you handle the interaction effects — e.g., a model that improves US English performance but degrades Japanese manga title handling?

Very Hard (10–12)

Q10. Design an automated canary rollback system for MangaAssist. Define the rollback triggers (error rate spike, latency degradation, CSAT drop, LLM hallucination detection), implementation using CloudWatch Alarms and Lambda, rollback execution (ECS task set revert, Redis session migration), and the post-rollback diagnostic workflow. How do you prevent flapping (repeated rollback/redeploy cycles)?

Q11. MangaAssist wants to run continuous A/B tests — as soon as one test concludes, the next candidate model begins testing. However, running sequential tests introduces multiple comparison bias (the more tests you run, the more likely you are to get a false positive). Design a framework that controls the false discovery rate across MangaAssist's continuous testing program. How do you apply corrections (Bonferroni, Benjamini-Hochberg) in a continuous testing context?

Q12. The MangaAssist team proposes a Multi-Armed Bandit (MAB) approach instead of traditional A/B testing for model selection, arguing it minimizes regret by quickly shifting traffic to the better model. Critique this proposal: under what conditions would MAB be appropriate for MangaAssist's model deployment, and under what conditions would it be dangerous? Design a hybrid approach that uses MAB for non-critical intents (chitchat, faq) and traditional A/B testing for revenue-impacting intents (recommendation, checkout_help).

← Back to Skill 02 Hub · Answers →