LOCAL PREVIEW View on GitHub

Hypothesis Testing in MangaAssist

1. Why Hypothesis Testing Mattered

Every production decision in MangaAssist — promoting a new model, changing a prompt, updating retriever configuration, or rolling out a guardrail change — required evidence that the change was safe and beneficial. Hypothesis testing provided the formal framework for making those decisions with controlled risk.

Without it, the team would be making rollout decisions based on gut feeling or raw metric deltas that could easily be noise.


2. Foundational Concepts

2.1 Null and Alternative Hypotheses

Every statistical test starts with two competing claims:

  • Null hypothesis ($H_0$): There is no meaningful difference between the candidate and the baseline. Any observed difference is due to random variation.
  • Alternative hypothesis ($H_1$): There is a real difference between the candidate and the baseline.

MangaAssist examples:

Decision $H_0$ $H_1$
Canary rollout: escalation rate Candidate escalation rate = baseline escalation rate Candidate escalation rate ≠ baseline escalation rate
A/B test: conversion lift Chatbot users convert at the same rate as non-chatbot users Chatbot users convert at a higher rate
Model upgrade: latency New model P99 latency = old model P99 latency New model P99 latency is different
Prompt change: thumbs-down rate New prompt thumbs-down rate = old prompt thumbs-down rate New prompt thumbs-down rate is different

2.2 One-Tailed vs. Two-Tailed Tests

Test Type When Used in MangaAssist Example
Two-tailed We care about change in either direction Latency comparison — could get better or worse
One-tailed (upper) We specifically fear an increase Escalation rate during canary — we only care if it went up
One-tailed (lower) We want to confirm improvement Conversion rate A/B test — we want to confirm it went up

In practice, canary safety checks used one-tailed tests (is the candidate worse?) while A/B tests for business metrics used two-tailed tests (is there a real difference?).

2.3 Significance Level ($\alpha$)

The significance level is the probability of rejecting $H_0$ when it is actually true (Type I error — false alarm).

Context $\alpha$ Used Rationale
Canary safety checks 0.05 Standard threshold — 5% false alarm risk acceptable
Business metric tests 0.05 Industry standard for A/B testing
Multi-metric tests (with correction) 0.05 / number of tests Bonferroni correction to control family-wise error rate

2.4 P-Value

The p-value is the probability of observing a result at least as extreme as the one observed, assuming $H_0$ is true.

  • p < $\alpha$: Reject $H_0$ — the difference is statistically significant.
  • p ≥ $\alpha$: Fail to reject $H_0$ — insufficient evidence to declare a real difference.

Critical interpretation rule: A p-value does not tell you the size of the effect or whether it matters to the business. It only tells you whether the observed difference is unlikely to be noise.

2.5 Type I and Type II Errors

Error Type What Happens MangaAssist Consequence
Type I (False Positive) We reject $H_0$ when it is true We roll back a good model change unnecessarily, wasting engineering time
Type II (False Negative) We fail to reject $H_0$ when $H_1$ is true We promote a degraded model to 100% traffic, harming users

Trade-off in MangaAssist:

For canary safety (is the candidate worse?): - Type II errors are more dangerous — missing a real regression means shipping a bad model. - So we set power ≥ 0.80 (80% chance of detecting a real regression if one exists).

For A/B test business lift: - Type I errors are more costly — false claims of improvement lead to wrong strategy. - So we stick with $\alpha = 0.05$ and require practical significance alongside statistical significance.

2.6 Statistical Power

$$\text{Power} = 1 - \beta = P(\text{Reject } H_0 \mid H_1 \text{ is true})$$

Power depends on: - Sample size ($n$): More data → more power. - Effect size: Larger real differences are easier to detect. - Significance level ($\alpha$): Higher $\alpha$ → more power but more false alarms. - Variance: Lower variance → more power.

MangaAssist application:

Before launching an A/B test, the team computed minimum sample size:

from statsmodels.stats.power import NormalIndPower

power_analysis = NormalIndPower()
sample_size = power_analysis.solve_power(
    effect_size=0.05,   # 5% relative lift in conversion
    alpha=0.05,
    power=0.80,
    alternative='two-sided'
)
# Result: ~3,142 per group

At 500K messages/day, reaching 3,142 per group was trivial. But for rare events (escalation at 5%), detecting a 1% absolute change required much larger samples.


3. Hypothesis Testing in the Model Evaluation Framework

3.1 Layer 1 — Golden Dataset Evaluation

At this layer, hypothesis testing is implicit. The evaluation compares candidate metrics against fixed thresholds rather than running formal statistical tests.

Metric Threshold (Gate) How It Works
Intent accuracy ≥ 90% If candidate accuracy < 90%, block the PR
Per-class F1 ≥ 0.85 for each class If any class F1 < 0.85, flag for review
BERTScore F1 ≥ 0.80 If BERTScore < 0.80, block promotion
Guardrail pass rate ≥ 95% If guardrail pass rate < 95%, block
Format compliance ≥ 95% If format compliance < 95%, block

These are fixed-threshold gates — simpler than a full hypothesis test but effective for regression detection on curated datasets.

3.2 Layer 2 — Shadow Mode

Shadow mode compares two distributions (baseline vs. candidate) on real production traffic. This is where formal hypothesis testing begins.

Example: Guardrail pass rate comparison

import scipy.stats as stats

# Baseline: 49,000 passes out of 50,000 requests → 98.0%
# Candidate: 48,600 passes out of 50,000 requests → 97.2%

baseline_pass = 49000
baseline_total = 50000
candidate_pass = 48600
candidate_total = 50000

# Two-proportion z-test
from statsmodels.stats.proportion import proportions_ztest

count = [candidate_pass, baseline_pass]
nobs = [candidate_total, baseline_total]
z_stat, p_value = proportions_ztest(count, nobs, alternative='smaller')

print(f"z-statistic: {z_stat:.4f}")
print(f"p-value: {p_value:.6f}")
# If p_value < 0.05, candidate guardrail pass rate is significantly lower

3.3 Layer 3 — Canary Deployment

Canary is where hypothesis testing is most critical. The decision to promote from 1% → 10% → 50% → 100% depends on statistical tests at each gate.

The two-proportion z-test for canary escalation rate:

Let: - $p_1 = x_1 / n_1$ = baseline escalation rate - $p_2 = x_2 / n_2$ = canary escalation rate - $\hat{p} = (x_1 + x_2) / (n_1 + n_2)$ = pooled rate

Standard error:

$$SE = \sqrt{\hat{p}(1 - \hat{p}) \left(\frac{1}{n_1} + \frac{1}{n_2}\right)}$$

z-score:

$$z = \frac{p_2 - p_1}{SE}$$

Worked example from the project:

Baseline: $n_1 = 495{,}000$, $x_1 = 59{,}400$, $p_1 = 0.12$

Canary: $n_2 = 5{,}000$, $x_2 = 700$, $p_2 = 0.14$

$$\hat{p} = \frac{59{,}400 + 700}{500{,}000} = 0.1202$$

$$SE = \sqrt{0.1202 \times 0.8798 \times \left(\frac{1}{495{,}000} + \frac{1}{5{,}000}\right)} \approx 0.00462$$

$$z = \frac{0.14 - 0.12}{0.00462} \approx 4.33$$

Since $|z| > 1.96$, the difference is statistically significant at 95% confidence. Decision: Do not promote. Rollback.

3.4 Layer 4 — Continuous Monitoring

Continuous monitoring uses hypothesis testing on a rolling basis to detect drift and regressions:

  • Weekly intent accuracy checks compared against historical baseline.
  • Daily thumbs-down rate tested for significant increases.
  • Embedding drift detected via distribution comparison tests (KS test, KL divergence).

4. Hypothesis Testing in A/B Testing

4.1 Conversion Rate Test

The primary business A/B test: does the chatbot increase conversion rate?

Setup: - Control: users who do not see the chatbot - Treatment: users who see the chatbot - Metric: purchase within 24 hours

Formal test:

$H_0$: $p_{\text{treatment}} = p_{\text{control}}$

$H_1$: $p_{\text{treatment}} > p_{\text{control}}$

from statsmodels.stats.proportion import proportions_ztest

# Control: 4,200 purchases out of 50,000 users → 8.4%
# Treatment: 4,700 purchases out of 50,000 users → 9.4%

count = [4700, 4200]
nobs = [50000, 50000]
z_stat, p_value = proportions_ztest(count, nobs, alternative='larger')

print(f"z = {z_stat:.4f}, p = {p_value:.6f}")
# If p < 0.05, the chatbot significantly increases conversion

4.2 Average Order Value Test

AOV is a continuous metric, so a t-test is appropriate.

from scipy.stats import ttest_ind

# treatment_aov and control_aov are arrays of order values
t_stat, p_value = ttest_ind(treatment_aov, control_aov, alternative='greater')

4.3 Multiple Testing Correction

When testing conversion rate, AOV, CSAT, escalation rate, and latency simultaneously:

from statsmodels.stats.multitest import multipletests

p_values = [0.012, 0.087, 0.041, 0.003, 0.220]
reject, corrected_p, _, _ = multipletests(p_values, method='bonferroni')

# Only metrics where reject[i] == True are truly significant

5. Common Pitfalls

Pitfall Why It's Dangerous MangaAssist Mitigation
Peeking — checking results early and stopping when p < 0.05 Inflates false positive rate to 20-30% Pre-registered sample sizes; sequential testing for canaries
Multiple comparisons — testing 5 metrics without correction ~23% chance of at least one false positive at $\alpha = 0.05$ Bonferroni correction applied
Confusing statistical and practical significance A 0.01% lift can be "significant" with enough data Defined minimum detectable effects before tests
Ignoring power — running tests with too little data High chance of missing real regressions Power analysis computed before every experiment
Using one-tailed tests when two-tailed is appropriate Misses unexpected regressions in the opposite direction Canary uses one-tailed (safety); A/B uses two-tailed

6. Summary

Hypothesis testing was embedded at every decision layer of MangaAssist:

  1. Golden dataset gates — threshold-based implicit hypothesis tests
  2. Shadow mode comparisons — z-tests and distribution tests on real traffic
  3. Canary rollout gates — two-proportion z-tests with auto-rollback
  4. A/B tests — conversion, AOV, and CSAT hypothesis tests with power analysis
  5. Continuous monitoring — rolling hypothesis tests for drift detection