Deep Dive Scenarios: How I Used Hypothesis Testing in MangaAssist

1. How I would summarize this in an interview

If an interviewer asks where I used hypothesis testing in this chatbot, I would not answer with theory first. I would answer with release decisions.

I used hypothesis testing in four major places:

during canary rollout to decide whether a new model or prompt was safe enough to promote
during A/B testing to prove whether the chatbot improved business metrics like conversion
during shadow mode to compare a candidate retriever or prompt against production on the same live traffic
during continuous monitoring to detect drift, regression, and hidden quality incidents after launch

The important point is that I did not use hypothesis testing as a notebook exercise. I used it as a production decision system.

2. Scenario 1: Canary rollout decision for a new generation model

2.1 Business and engineering context

MangaAssist used a hybrid architecture, but the recommendation, FAQ, and complex conversation paths still depended on an LLM. That meant a model or prompt change could improve helpfulness while also making the system slower, more expensive, or less safe.

After a candidate model passed offline evaluation and shadow mode, I exposed it to a small amount of real traffic through a staged canary rollout:

1 percent
10 percent
50 percent
100 percent

The key question was not "does the output look better?" The key question was "is it safe to send more users to this version?"

2.2 What problem I was solving

The biggest production risk was not a total outage. It was a silent degradation:

more users asking for a human
more failed resolutions
more unsafe or blocked responses
slower tail latency

Those problems do not always show up clearly in offline testing, so I needed a formal live decision rule.

2.3 Hypothesis

For the canary safety gate, I used escalation rate as the primary metric.

Null hypothesis (H0): the candidate escalation rate is the same as the baseline escalation rate
Alternative hypothesis (H1): the candidate escalation rate is higher than the baseline escalation rate

I treated this as a safety-oriented test. I only cared about detecting whether the new version was worse.

2.4 Why escalation rate was the primary metric

I chose escalation rate because it captured multiple failure modes at once:

low answer quality
incorrect routing
unresolved issues
user frustration
higher downstream support cost

If escalation went up, it usually meant the new version was harming both user experience and operations.

2.5 Experimental design

The traffic split compared the canary population against the live baseline population in the same time window.

Primary metric:

escalation rate

Secondary guardrail metrics:

thumbs-down rate
error rate
P99 latency
guardrail block rate

Decision rules:

auto-rollback if a hard threshold was breached
otherwise wait until minimum sample size was reached
then run a formal significance test before promotion

This mattered because a small canary can show noisy percentages early on. I did not want to promote or roll back based on noise.

2.6 Statistical test I used

I used a two-proportion z-test because escalation rate is a binary outcome:

escalation happened
escalation did not happen

This test was the right fit because I was comparing the rate of a binary event between two groups.

2.7 Worked example from the chatbot

Baseline:

total sessions = 495,000
escalations = 59,400
escalation rate = 12 percent

Canary:

total sessions = 5,000
escalations = 700
escalation rate = 14 percent

Observed delta:

absolute increase = 2 percentage points
relative increase = about 16.7 percent

The z-score from the two-proportion test was about 4.33, which is far beyond the usual 95 percent threshold.

Interpretation:

this was very unlikely to be random noise
the canary was meaningfully worse than the baseline

Decision:

rollback the candidate
do not move from 1 percent to 10 percent

2.8 Critical decisions I took

Decision 1: I used a safety-first primary metric

I did not make "answer quality" the primary live decision metric because quality is harder to measure in real time. Escalation rate was faster, more reliable, and directly tied to business impact.

Decision 2: I separated hard thresholds from statistical significance

If error rate or guardrail block rate spiked sharply, I would not wait for a perfect p-value. I used hard rollback thresholds for severe incidents and hypothesis testing for borderline cases.

Decision 3: I used staged rollout instead of a big-bang release

This reduced blast radius. If the model was bad, only a small percentage of users were exposed before rollback.

Decision 4: I looked at practical significance, not just statistical significance

At scale, tiny effects can become statistically significant. I cared whether the increase was large enough to change support load and user experience, not only whether p < 0.05.

Decision 5: I treated canary as a system test, not a model-only test

I was evaluating the full path:

routing
retrieval
generation
guardrails
frontend behavior

That made the canary decision much more reliable than judging the model in isolation.

2.9 Panel-style follow-up questions and strong answers

Who asks	Follow-up question	Strong answer
Hiring manager	Why not just compare raw percentages and move on?	Because early canary traffic is noisy. Raw deltas can overreact to randomness. I needed a repeatable decision rule for promotion and rollback.
Data scientist	Why did you use a z-test instead of a t-test?	Escalation is a binary event, so I was comparing two proportions, not two continuous means. That makes the two-proportion z-test the natural fit.
Product manager	Why did you prioritize escalation rate over CSAT?	CSAT is slower, sparser, and more biased because only a subset of users respond. Escalation is immediate, high-volume, and operationally meaningful.
SRE	What if the sample is too small but the regression looks severe?	I did not rely only on significance. I also defined hard thresholds for error rate, latency, and safety metrics so that severe regressions triggered rollback immediately.
ML engineer	How did you prevent peeking?	I predefined rollout gates, minimum sample sizes, and decision metrics before reading the results. That prevented me from stopping the test as soon as I saw a favorable number.
Senior engineer	What was the most important design choice here?	Treating canary as full-system evaluation instead of model evaluation. Many regressions come from routing, retrieval, prompt length, or guardrails, not only from the model itself.

3. Scenario 2: A/B test to prove business impact of the chatbot

3.1 Business context

Once the chatbot was technically stable, the next question was not about safety. It was about value.

The business wanted to know whether showing the chatbot to users actually improved outcomes such as:

conversion rate
add-to-cart rate
revenue per chat session
average order value

This is where hypothesis testing moved from release safety to product impact.

3.2 What problem I was solving

Product teams often see a higher conversion rate and assume the chatbot caused it. That is dangerous because conversion is influenced by:

promotions
seasonality
traffic source mix
page placement
user segment differences

I needed a design that could isolate the impact of the chatbot itself.

3.3 Experiment setup

I set up an A/B test with:

control: users who did not see the chatbot
treatment: users who did see the chatbot
primary metric: purchase within 24 hours

I randomized at the user level so the same user would not bounce between control and treatment across repeated sessions. That reduced contamination.

3.4 Hypothesis

Primary business hypothesis:

Null hypothesis (H0): treatment conversion rate is equal to control conversion rate
Alternative hypothesis (H1): treatment conversion rate is higher than control conversion rate

I also tracked secondary metrics:

AOV
CSAT
escalation rate
latency

This was important because a chatbot that increases conversion but also explodes support cost or hurts user trust is not a real win.

3.5 Sample size and power

Before launch, I defined:

acceptable false positive rate
target power
minimum detectable effect

That prevented a common mistake in experimentation: running the test with too little traffic and declaring "no impact" when there simply was not enough data.

3.6 Statistical tests I used

I used different tests for different metric types:

two-proportion z-test for conversion rate
t-test for AOV because it is a continuous metric
multiple testing correction when several secondary metrics were reviewed together

This mattered because not every metric should be tested with the same method.

3.7 Worked example from the chatbot

Control group:

users = 50,000
purchases = 4,200
conversion rate = 8.4 percent

Treatment group:

users = 50,000
purchases = 4,700
conversion rate = 9.4 percent

Observed delta:

absolute lift = 1 percentage point
relative lift = about 11.9 percent

The z-test shows that this lift is statistically significant, so the improvement is very unlikely to be explained by random variation alone.

3.8 The decision I took

I did not stop at "conversion went up."

I checked whether:

escalation stayed within acceptable bounds
latency remained within SLA
AOV was flat or positive
the lift held across key slices such as device type and major intent families

After that, I was comfortable recommending wider rollout because the chatbot showed measurable business lift without an obvious operational regression.

3.9 Critical decisions I took

Decision 1: I pre-registered one primary metric

The primary success metric was conversion within 24 hours. This avoided cherry-picking whichever metric looked good after the fact.

Decision 2: I separated business metrics from guardrail metrics

Conversion decided whether the experiment created value. Guardrail metrics decided whether the value was safe to keep.

Decision 3: I used practical significance, not only p-values

Even if a metric is statistically significant, it may not be worth shipping if the effect is too small to matter operationally.

Decision 4: I corrected for multiple comparisons on secondary metrics

If I looked at conversion, AOV, CSAT, escalation, and latency together, I had to control the false positive risk. Otherwise, one of them could appear significant by chance.

Decision 5: I randomized at the user level

That was important because repeated exposure from the same user can bias results if the same person sees both experiences.

3.10 Panel-style follow-up questions and strong answers

Who asks	Follow-up question	Strong answer
Product manager	Why was conversion the primary metric instead of engagement?	Engagement is only an activity signal. Conversion ties directly to revenue and is closer to the business objective.
Data scientist	How did you avoid claiming victory from one good-looking metric?	I pre-registered conversion as primary, treated other metrics as secondary, and corrected for multiple comparisons when interpreting them.
Senior engineer	Why not just compare chat users versus non-chat users without an experiment?	That would be heavily confounded because the users who choose to open chat are already different from the users who do not. Randomization was necessary for causal inference.
Finance stakeholder	What if conversion improves but cost also rises?	Then I would compare net value, not only topline lift. A statistically significant lift is not enough if margin or support cost gets worse.
ML engineer	How did you handle AOV, which is not binary like conversion?	I treated AOV as a continuous metric and used a mean comparison test instead of a proportion test.
Hiring manager	What senior-level decision did you make here?	I made sure the experiment answered a business question and not just a modeling question. That meant aligning metric choice, sample design, and rollout criteria with the business objective.

4. Scenario 3: Shadow mode comparison for a retriever and prompt change

4.1 Engineering context

MangaAssist used a RAG pipeline for FAQ, policy, and recommendation-style answers. We retrieved chunks from OpenSearch, reranked them, and injected the top results into the prompt.

That meant a retriever change could improve grounding but also introduce production regressions such as:

longer prompts
higher token cost
slower response time
more blocked responses because of noisier context

4.2 Why offline results were not enough

Suppose the new retriever looked great offline:

Recall@3 improved on the golden dataset
human reviewers preferred the grounded answers

That still was not enough for production approval because live traffic is messier:

queries are shorter and noisier
user intent is ambiguous
traffic slices behave differently
prompt size can explode in long-tail cases

So I used shadow mode before any user-facing rollout.

4.3 What shadow mode looked like

For the same live request:

production handled the real user response
the candidate retriever and prompt ran asynchronously in parallel
both results were stored under the same request ID
a comparison service measured differences in quality, safety, latency, and cost

This design was important because it let me compare the two systems on the exact same traffic without risking customer experience.

4.4 Hypothesis

I framed the change as a multi-metric production hypothesis:

quality hypothesis: the candidate reduces unsupported or incorrect answers
safety hypothesis: the candidate does not reduce guardrail pass rate
operational hypothesis: the candidate does not create unacceptable latency or cost regression

This was one of the most important decisions I made. I refused to let an offline quality gain override a production regression in safety or latency.

4.5 Statistical tests I used

I used different tests for different outputs:

paired or matched comparisons where the same live request was evaluated by both versions
two-proportion z-test for guardrail pass rate and unsupported-answer rate
distribution comparison for latency tails because averages can hide p95 and p99 regressions

The idea was simple: use the same request whenever possible, because paired comparisons are more sensitive and less noisy than comparing unrelated traffic.

4.6 Worked example from the chatbot

The candidate retriever improved grounding offline, but shadow mode showed a more complicated picture.

Guardrail pass rate:

baseline: 49,000 passes out of 50,000 requests = 98.0 percent
candidate: 48,600 passes out of 50,000 requests = 97.2 percent

Prompt size:

baseline average tokens = 120
candidate average tokens = 195
increase = 62.5 percent

Operational effect:

latency tail worsened because the candidate was injecting more context
token cost per session also rose

Interpretation:

the quality gain was not enough to justify the safety and latency regression
the change was statistically meaningful and operationally relevant

Decision:

do not promote this retriever configuration to canary
reduce retrieved context, tighten metadata filters, and rerun shadow mode

4.7 Critical decisions I took

Decision 1: I required shadow mode even after offline success

This prevented me from shipping a change that looked good on curated examples but behaved badly under real traffic conditions.

Decision 2: I ran shadow inference asynchronously

I did not put the candidate on the critical path. That protected user latency and reduced operational risk during evaluation.

Decision 3: I evaluated the whole pipeline, not only retrieval quality

The retriever change affected:

context size
generation behavior
guardrail outcomes
latency
cost

That is why I treated it as a system change, not just a search-quality change.

Decision 4: I allowed veto metrics

Even if groundedness improved, I did not promote when safety and latency regressed beyond acceptable levels.

Decision 5: I sliced the analysis

I did not trust only the aggregate. I checked whether the regression was concentrated in:

FAQ
policy
recommendation
mobile traffic
specific locales

That helped isolate where the candidate was actually failing.

4.8 Panel-style follow-up questions and strong answers

Who asks	Follow-up question	Strong answer
ML engineer	Why not go directly from offline evaluation to canary?	Because retrieval changes often create second-order effects in latency, prompt length, and guardrail behavior that are invisible offline. Shadow mode catches those safely.
SRE	Why did async shadowing matter so much?	If I put the candidate inline, I would double latency risk and expand blast radius. Async fan-out let me evaluate the candidate without hurting users.
Data scientist	Why use paired comparisons here?	Because both systems processed the same request. Pairing reduces variance and gives a cleaner estimate of the real effect.
Product manager	If answer quality improved, why not ship anyway?	Because a better answer is not enough if it costs too much, breaches latency targets, or increases blocked responses. Production decisions are multi-objective.
Staff engineer	What was the most important judgment call?	Treating safety and latency as veto metrics. That prevented us from optimizing one metric while degrading the overall system.

5. Scenario 4: Continuous monitoring and drift detection after launch

5.1 Why hypothesis testing still mattered after deployment

Shipping a model is not the end of the problem. The environment keeps changing:

catalog changes
promotions change user behavior
traffic mix changes by season
new phrasing appears in user queries

That means a model can degrade even when no code changed.

5.2 Example production signals I watched

weekly intent accuracy against a labeled baseline
daily thumbs-down rate
out-of-scope rate
escalation rate
embedding and query distribution shifts

5.3 Example drift scenario

Suppose the historical recommendation intent share was 32 percent, and then it fell to 21 percent during a major campaign while promotion-related traffic surged.

At the same time:

thumbs-down rate rose from 6 percent to 9 percent
out-of-scope rate increased
escalation volume climbed

That combination suggests a real distribution shift, not just random metric wobble.

5.4 What I used hypothesis testing for

I used it to decide:

is this a real drift event or normal noise?
do we need to retrain the classifier?
do we need to refresh prompts or rules?
do we need to update the golden dataset with new traffic patterns?

5.5 Critical decisions I took

I used rolling baselines instead of comparing everything to a stale launch-day reference
I combined automated statistical alerts with human review for high-impact changes
I updated evaluation datasets after confirmed drift so we did not keep testing only on yesterday's product behavior

5.6 Panel-style follow-up questions and strong answers

Who asks	Follow-up question	Strong answer
SRE	How did you avoid noisy alerts?	I used sustained deviation rules, rolling baselines, and statistical thresholds instead of alerting on every isolated spike.
Data scientist	What did you do after drift was confirmed?	I refreshed the dataset, investigated the shifted traffic slices, and retrained or adjusted the relevant model component rather than only acknowledging the alert.
Hiring manager	Why is this a hypothesis testing problem and not just monitoring?	Because monitoring tells me what moved, but hypothesis testing helps me decide whether the movement is real enough to justify intervention.

6. Cross-scenario decisions that matter in a senior interview

If a panel keeps drilling deeper, these are the decisions I would emphasize most clearly.

6.1 I separated safety questions from business questions

I did not use one generic decision rule for everything.

safety releases focused on not making the system worse
business experiments focused on proving real lift

That changed the metric choice, test design, and rollout criteria.

6.2 I always paired statistical significance with practical significance

I never treated p < 0.05 as the final answer. I also asked:

is the effect large enough to matter?
does it change user experience?
does it change support cost?
does it change revenue or margin?

6.3 I treated the chatbot as a system, not just an LLM

Most real regressions in a production chatbot come from the interaction of:

routing
retrieval
prompts
model behavior
guardrails
latency
frontend behavior

That is why my evaluation logic always covered the full pipeline.

6.4 I used different tests for different metric types

proportions for binary outcomes like escalation, conversion, and guardrail pass rate
mean comparisons for AOV, latency, and cost
distribution tests for drift and long-tail latency behavior

I chose the test based on the data, not the other way around.

6.5 I made rollout decisions automatable

Hypothesis testing was useful because it was connected to operational action:

block the release
promote to the next canary stage
hold for more data
rollback automatically
trigger retraining or dataset refresh

That is what made it valuable in MLOps.

7. Best final answer if the panel asks "Where exactly did you use hypothesis testing?"

I used hypothesis testing as the decision layer for the chatbot lifecycle.

In canary rollout, I used it to decide whether a new model or prompt was safe enough to promote, with escalation rate as the primary safety metric.
In A/B testing, I used it to prove whether the chatbot actually improved conversion and revenue-related outcomes.
In shadow mode, I used it to compare candidate retrievers and prompts against production on the same live traffic before exposing users.
In continuous monitoring, I used it to tell the difference between real drift and normal metric noise so we could retrain, refresh prompts, or roll back with evidence.

The senior-level part was not just knowing the formulas. It was choosing the right metric, the right test, the right rollout stage, and the right operational action for each kind of decision.