Deep Dive Scenarios: How I Used Hypothesis Testing in MangaAssist
1. How I would summarize this in an interview
If an interviewer asks where I used hypothesis testing in this chatbot, I would not answer with theory first. I would answer with release decisions.
I used hypothesis testing in four major places:
- during canary rollout to decide whether a new model or prompt was safe enough to promote
- during A/B testing to prove whether the chatbot improved business metrics like conversion
- during shadow mode to compare a candidate retriever or prompt against production on the same live traffic
- during continuous monitoring to detect drift, regression, and hidden quality incidents after launch
The important point is that I did not use hypothesis testing as a notebook exercise. I used it as a production decision system.
2. Scenario 1: Canary rollout decision for a new generation model
2.1 Business and engineering context
MangaAssist used a hybrid architecture, but the recommendation, FAQ, and complex conversation paths still depended on an LLM. That meant a model or prompt change could improve helpfulness while also making the system slower, more expensive, or less safe.
After a candidate model passed offline evaluation and shadow mode, I exposed it to a small amount of real traffic through a staged canary rollout:
- 1 percent
- 10 percent
- 50 percent
- 100 percent
The key question was not "does the output look better?" The key question was "is it safe to send more users to this version?"
2.2 What problem I was solving
The biggest production risk was not a total outage. It was a silent degradation:
- more users asking for a human
- more failed resolutions
- more unsafe or blocked responses
- slower tail latency
Those problems do not always show up clearly in offline testing, so I needed a formal live decision rule.
2.3 Hypothesis
For the canary safety gate, I used escalation rate as the primary metric.
- Null hypothesis (
H0): the candidate escalation rate is the same as the baseline escalation rate - Alternative hypothesis (
H1): the candidate escalation rate is higher than the baseline escalation rate
I treated this as a safety-oriented test. I only cared about detecting whether the new version was worse.
2.4 Why escalation rate was the primary metric
I chose escalation rate because it captured multiple failure modes at once:
- low answer quality
- incorrect routing
- unresolved issues
- user frustration
- higher downstream support cost
If escalation went up, it usually meant the new version was harming both user experience and operations.
2.5 Experimental design
The traffic split compared the canary population against the live baseline population in the same time window.
Primary metric:
- escalation rate
Secondary guardrail metrics:
- thumbs-down rate
- error rate
- P99 latency
- guardrail block rate
Decision rules:
- auto-rollback if a hard threshold was breached
- otherwise wait until minimum sample size was reached
- then run a formal significance test before promotion
This mattered because a small canary can show noisy percentages early on. I did not want to promote or roll back based on noise.
2.6 Statistical test I used
I used a two-proportion z-test because escalation rate is a binary outcome:
- escalation happened
- escalation did not happen
This test was the right fit because I was comparing the rate of a binary event between two groups.
2.7 Worked example from the chatbot
Baseline:
- total sessions = 495,000
- escalations = 59,400
- escalation rate = 12 percent
Canary:
- total sessions = 5,000
- escalations = 700
- escalation rate = 14 percent
Observed delta:
- absolute increase = 2 percentage points
- relative increase = about 16.7 percent
The z-score from the two-proportion test was about 4.33, which is far beyond the usual 95 percent threshold.
Interpretation:
- this was very unlikely to be random noise
- the canary was meaningfully worse than the baseline
Decision:
- rollback the candidate
- do not move from 1 percent to 10 percent
2.8 Critical decisions I took
Decision 1: I used a safety-first primary metric
I did not make "answer quality" the primary live decision metric because quality is harder to measure in real time. Escalation rate was faster, more reliable, and directly tied to business impact.
Decision 2: I separated hard thresholds from statistical significance
If error rate or guardrail block rate spiked sharply, I would not wait for a perfect p-value. I used hard rollback thresholds for severe incidents and hypothesis testing for borderline cases.
Decision 3: I used staged rollout instead of a big-bang release
This reduced blast radius. If the model was bad, only a small percentage of users were exposed before rollback.
Decision 4: I looked at practical significance, not just statistical significance
At scale, tiny effects can become statistically significant. I cared whether the increase was large enough to change support load and user experience, not only whether p < 0.05.
Decision 5: I treated canary as a system test, not a model-only test
I was evaluating the full path:
- routing
- retrieval
- generation
- guardrails
- frontend behavior
That made the canary decision much more reliable than judging the model in isolation.
2.9 Panel-style follow-up questions and strong answers
| Who asks | Follow-up question | Strong answer |
|---|---|---|
| Hiring manager | Why not just compare raw percentages and move on? | Because early canary traffic is noisy. Raw deltas can overreact to randomness. I needed a repeatable decision rule for promotion and rollback. |
| Data scientist | Why did you use a z-test instead of a t-test? | Escalation is a binary event, so I was comparing two proportions, not two continuous means. That makes the two-proportion z-test the natural fit. |
| Product manager | Why did you prioritize escalation rate over CSAT? | CSAT is slower, sparser, and more biased because only a subset of users respond. Escalation is immediate, high-volume, and operationally meaningful. |
| SRE | What if the sample is too small but the regression looks severe? | I did not rely only on significance. I also defined hard thresholds for error rate, latency, and safety metrics so that severe regressions triggered rollback immediately. |
| ML engineer | How did you prevent peeking? | I predefined rollout gates, minimum sample sizes, and decision metrics before reading the results. That prevented me from stopping the test as soon as I saw a favorable number. |
| Senior engineer | What was the most important design choice here? | Treating canary as full-system evaluation instead of model evaluation. Many regressions come from routing, retrieval, prompt length, or guardrails, not only from the model itself. |
3. Scenario 2: A/B test to prove business impact of the chatbot
3.1 Business context
Once the chatbot was technically stable, the next question was not about safety. It was about value.
The business wanted to know whether showing the chatbot to users actually improved outcomes such as:
- conversion rate
- add-to-cart rate
- revenue per chat session
- average order value
This is where hypothesis testing moved from release safety to product impact.
3.2 What problem I was solving
Product teams often see a higher conversion rate and assume the chatbot caused it. That is dangerous because conversion is influenced by:
- promotions
- seasonality
- traffic source mix
- page placement
- user segment differences
I needed a design that could isolate the impact of the chatbot itself.
3.3 Experiment setup
I set up an A/B test with:
- control: users who did not see the chatbot
- treatment: users who did see the chatbot
- primary metric: purchase within 24 hours
I randomized at the user level so the same user would not bounce between control and treatment across repeated sessions. That reduced contamination.
3.4 Hypothesis
Primary business hypothesis:
- Null hypothesis (
H0): treatment conversion rate is equal to control conversion rate - Alternative hypothesis (
H1): treatment conversion rate is higher than control conversion rate
I also tracked secondary metrics:
- AOV
- CSAT
- escalation rate
- latency
This was important because a chatbot that increases conversion but also explodes support cost or hurts user trust is not a real win.
3.5 Sample size and power
Before launch, I defined:
- acceptable false positive rate
- target power
- minimum detectable effect
That prevented a common mistake in experimentation: running the test with too little traffic and declaring "no impact" when there simply was not enough data.
3.6 Statistical tests I used
I used different tests for different metric types:
- two-proportion z-test for conversion rate
- t-test for AOV because it is a continuous metric
- multiple testing correction when several secondary metrics were reviewed together
This mattered because not every metric should be tested with the same method.
3.7 Worked example from the chatbot
Control group:
- users = 50,000
- purchases = 4,200
- conversion rate = 8.4 percent
Treatment group:
- users = 50,000
- purchases = 4,700
- conversion rate = 9.4 percent
Observed delta:
- absolute lift = 1 percentage point
- relative lift = about 11.9 percent
The z-test shows that this lift is statistically significant, so the improvement is very unlikely to be explained by random variation alone.
3.8 The decision I took
I did not stop at "conversion went up."
I checked whether:
- escalation stayed within acceptable bounds
- latency remained within SLA
- AOV was flat or positive
- the lift held across key slices such as device type and major intent families
After that, I was comfortable recommending wider rollout because the chatbot showed measurable business lift without an obvious operational regression.
3.9 Critical decisions I took
Decision 1: I pre-registered one primary metric
The primary success metric was conversion within 24 hours. This avoided cherry-picking whichever metric looked good after the fact.
Decision 2: I separated business metrics from guardrail metrics
Conversion decided whether the experiment created value. Guardrail metrics decided whether the value was safe to keep.
Decision 3: I used practical significance, not only p-values
Even if a metric is statistically significant, it may not be worth shipping if the effect is too small to matter operationally.
Decision 4: I corrected for multiple comparisons on secondary metrics
If I looked at conversion, AOV, CSAT, escalation, and latency together, I had to control the false positive risk. Otherwise, one of them could appear significant by chance.
Decision 5: I randomized at the user level
That was important because repeated exposure from the same user can bias results if the same person sees both experiences.
3.10 Panel-style follow-up questions and strong answers
| Who asks | Follow-up question | Strong answer |
|---|---|---|
| Product manager | Why was conversion the primary metric instead of engagement? | Engagement is only an activity signal. Conversion ties directly to revenue and is closer to the business objective. |
| Data scientist | How did you avoid claiming victory from one good-looking metric? | I pre-registered conversion as primary, treated other metrics as secondary, and corrected for multiple comparisons when interpreting them. |
| Senior engineer | Why not just compare chat users versus non-chat users without an experiment? | That would be heavily confounded because the users who choose to open chat are already different from the users who do not. Randomization was necessary for causal inference. |
| Finance stakeholder | What if conversion improves but cost also rises? | Then I would compare net value, not only topline lift. A statistically significant lift is not enough if margin or support cost gets worse. |
| ML engineer | How did you handle AOV, which is not binary like conversion? | I treated AOV as a continuous metric and used a mean comparison test instead of a proportion test. |
| Hiring manager | What senior-level decision did you make here? | I made sure the experiment answered a business question and not just a modeling question. That meant aligning metric choice, sample design, and rollout criteria with the business objective. |
4. Scenario 3: Shadow mode comparison for a retriever and prompt change
4.1 Engineering context
MangaAssist used a RAG pipeline for FAQ, policy, and recommendation-style answers. We retrieved chunks from OpenSearch, reranked them, and injected the top results into the prompt.
That meant a retriever change could improve grounding but also introduce production regressions such as:
- longer prompts
- higher token cost
- slower response time
- more blocked responses because of noisier context
4.2 Why offline results were not enough
Suppose the new retriever looked great offline:
- Recall@3 improved on the golden dataset
- human reviewers preferred the grounded answers
That still was not enough for production approval because live traffic is messier:
- queries are shorter and noisier
- user intent is ambiguous
- traffic slices behave differently
- prompt size can explode in long-tail cases
So I used shadow mode before any user-facing rollout.
4.3 What shadow mode looked like
For the same live request:
- production handled the real user response
- the candidate retriever and prompt ran asynchronously in parallel
- both results were stored under the same request ID
- a comparison service measured differences in quality, safety, latency, and cost
This design was important because it let me compare the two systems on the exact same traffic without risking customer experience.
4.4 Hypothesis
I framed the change as a multi-metric production hypothesis:
- quality hypothesis: the candidate reduces unsupported or incorrect answers
- safety hypothesis: the candidate does not reduce guardrail pass rate
- operational hypothesis: the candidate does not create unacceptable latency or cost regression
This was one of the most important decisions I made. I refused to let an offline quality gain override a production regression in safety or latency.
4.5 Statistical tests I used
I used different tests for different outputs:
- paired or matched comparisons where the same live request was evaluated by both versions
- two-proportion z-test for guardrail pass rate and unsupported-answer rate
- distribution comparison for latency tails because averages can hide p95 and p99 regressions
The idea was simple: use the same request whenever possible, because paired comparisons are more sensitive and less noisy than comparing unrelated traffic.
4.6 Worked example from the chatbot
The candidate retriever improved grounding offline, but shadow mode showed a more complicated picture.
Guardrail pass rate:
- baseline: 49,000 passes out of 50,000 requests = 98.0 percent
- candidate: 48,600 passes out of 50,000 requests = 97.2 percent
Prompt size:
- baseline average tokens = 120
- candidate average tokens = 195
- increase = 62.5 percent
Operational effect:
- latency tail worsened because the candidate was injecting more context
- token cost per session also rose
Interpretation:
- the quality gain was not enough to justify the safety and latency regression
- the change was statistically meaningful and operationally relevant
Decision:
- do not promote this retriever configuration to canary
- reduce retrieved context, tighten metadata filters, and rerun shadow mode
4.7 Critical decisions I took
Decision 1: I required shadow mode even after offline success
This prevented me from shipping a change that looked good on curated examples but behaved badly under real traffic conditions.
Decision 2: I ran shadow inference asynchronously
I did not put the candidate on the critical path. That protected user latency and reduced operational risk during evaluation.
Decision 3: I evaluated the whole pipeline, not only retrieval quality
The retriever change affected:
- context size
- generation behavior
- guardrail outcomes
- latency
- cost
That is why I treated it as a system change, not just a search-quality change.
Decision 4: I allowed veto metrics
Even if groundedness improved, I did not promote when safety and latency regressed beyond acceptable levels.
Decision 5: I sliced the analysis
I did not trust only the aggregate. I checked whether the regression was concentrated in:
- FAQ
- policy
- recommendation
- mobile traffic
- specific locales
That helped isolate where the candidate was actually failing.
4.8 Panel-style follow-up questions and strong answers
| Who asks | Follow-up question | Strong answer |
|---|---|---|
| ML engineer | Why not go directly from offline evaluation to canary? | Because retrieval changes often create second-order effects in latency, prompt length, and guardrail behavior that are invisible offline. Shadow mode catches those safely. |
| SRE | Why did async shadowing matter so much? | If I put the candidate inline, I would double latency risk and expand blast radius. Async fan-out let me evaluate the candidate without hurting users. |
| Data scientist | Why use paired comparisons here? | Because both systems processed the same request. Pairing reduces variance and gives a cleaner estimate of the real effect. |
| Product manager | If answer quality improved, why not ship anyway? | Because a better answer is not enough if it costs too much, breaches latency targets, or increases blocked responses. Production decisions are multi-objective. |
| Staff engineer | What was the most important judgment call? | Treating safety and latency as veto metrics. That prevented us from optimizing one metric while degrading the overall system. |
5. Scenario 4: Continuous monitoring and drift detection after launch
5.1 Why hypothesis testing still mattered after deployment
Shipping a model is not the end of the problem. The environment keeps changing:
- catalog changes
- promotions change user behavior
- traffic mix changes by season
- new phrasing appears in user queries
That means a model can degrade even when no code changed.
5.2 Example production signals I watched
- weekly intent accuracy against a labeled baseline
- daily thumbs-down rate
- out-of-scope rate
- escalation rate
- embedding and query distribution shifts
5.3 Example drift scenario
Suppose the historical recommendation intent share was 32 percent, and then it fell to 21 percent during a major campaign while promotion-related traffic surged.
At the same time:
- thumbs-down rate rose from 6 percent to 9 percent
- out-of-scope rate increased
- escalation volume climbed
That combination suggests a real distribution shift, not just random metric wobble.
5.4 What I used hypothesis testing for
I used it to decide:
- is this a real drift event or normal noise?
- do we need to retrain the classifier?
- do we need to refresh prompts or rules?
- do we need to update the golden dataset with new traffic patterns?
5.5 Critical decisions I took
- I used rolling baselines instead of comparing everything to a stale launch-day reference
- I combined automated statistical alerts with human review for high-impact changes
- I updated evaluation datasets after confirmed drift so we did not keep testing only on yesterday's product behavior
5.6 Panel-style follow-up questions and strong answers
| Who asks | Follow-up question | Strong answer |
|---|---|---|
| SRE | How did you avoid noisy alerts? | I used sustained deviation rules, rolling baselines, and statistical thresholds instead of alerting on every isolated spike. |
| Data scientist | What did you do after drift was confirmed? | I refreshed the dataset, investigated the shifted traffic slices, and retrained or adjusted the relevant model component rather than only acknowledging the alert. |
| Hiring manager | Why is this a hypothesis testing problem and not just monitoring? | Because monitoring tells me what moved, but hypothesis testing helps me decide whether the movement is real enough to justify intervention. |
6. Cross-scenario decisions that matter in a senior interview
If a panel keeps drilling deeper, these are the decisions I would emphasize most clearly.
6.1 I separated safety questions from business questions
I did not use one generic decision rule for everything.
- safety releases focused on not making the system worse
- business experiments focused on proving real lift
That changed the metric choice, test design, and rollout criteria.
6.2 I always paired statistical significance with practical significance
I never treated p < 0.05 as the final answer. I also asked:
- is the effect large enough to matter?
- does it change user experience?
- does it change support cost?
- does it change revenue or margin?
6.3 I treated the chatbot as a system, not just an LLM
Most real regressions in a production chatbot come from the interaction of:
- routing
- retrieval
- prompts
- model behavior
- guardrails
- latency
- frontend behavior
That is why my evaluation logic always covered the full pipeline.
6.4 I used different tests for different metric types
- proportions for binary outcomes like escalation, conversion, and guardrail pass rate
- mean comparisons for AOV, latency, and cost
- distribution tests for drift and long-tail latency behavior
I chose the test based on the data, not the other way around.
6.5 I made rollout decisions automatable
Hypothesis testing was useful because it was connected to operational action:
- block the release
- promote to the next canary stage
- hold for more data
- rollback automatically
- trigger retraining or dataset refresh
That is what made it valuable in MLOps.
7. Best final answer if the panel asks "Where exactly did you use hypothesis testing?"
I used hypothesis testing as the decision layer for the chatbot lifecycle.
- In canary rollout, I used it to decide whether a new model or prompt was safe enough to promote, with escalation rate as the primary safety metric.
- In A/B testing, I used it to prove whether the chatbot actually improved conversion and revenue-related outcomes.
- In shadow mode, I used it to compare candidate retrievers and prompts against production on the same live traffic before exposing users.
- In continuous monitoring, I used it to tell the difference between real drift and normal metric noise so we could retrain, refresh prompts, or roll back with evidence.
The senior-level part was not just knowing the formulas. It was choosing the right metric, the right test, the right rollout stage, and the right operational action for each kind of decision.