Additional Statistical Tests in MangaAssist
Beyond hypothesis testing, confidence intervals, t-tests, and chi-square tests, several other statistical tests were applied across the MangaAssist project for specialized scenarios.
1. Two-Proportion Z-Test
Purpose
Compares two proportions from independent samples. This was the most frequently used test in canary and A/B testing for rate-based metrics.
Formula
$$z = \frac{\hat{p}_1 - \hat{p}_2}{\sqrt{\hat{p}(1-\hat{p})\left(\frac{1}{n_1} + \frac{1}{n_2}\right)}}$$
Where $\hat{p} = \frac{x_1 + x_2}{n_1 + n_2}$ is the pooled proportion.
MangaAssist Applications
| Comparison | Baseline Rate | Candidate Rate | Decision |
|---|---|---|---|
| Canary escalation rate | 12.0% (n=495K) | 14.0% (n=5K) | Do not promote (z=4.33, p<0.001) |
| A/B conversion rate | 8.4% (n=50K) | 9.4% (n=50K) | Ship chatbot (significant lift) |
| Guardrail pass rate (shadow) | 98.0% | 97.2% | Investigate regression |
| Thumbs-up rate (prompt A vs B) | 60% (n=1K) | 65% (n=1K) | Prompt B is better |
from statsmodels.stats.proportion import proportions_ztest
count = [700, 59400] # escalations in canary vs baseline
nobs = [5000, 495000] # total sessions
z_stat, p_value = proportions_ztest(count, nobs, alternative='larger')
2. Binomial Test (Exact Test for a Single Proportion)
Purpose
Tests whether an observed proportion differs from a hypothesized value. Used when sample sizes are small or when an exact test is preferred.
MangaAssist Application — Hallucination Rate Audit
Question: From this week's audit, is the hallucination rate significantly above the 2% target?
from scipy.stats import binomtest
# Observed: 7 hallucinations in 200 audited responses → 3.5%
result = binomtest(7, 200, p=0.02, alternative='greater')
print(f"p-value = {result.pvalue:.4f}")
# If p < 0.05, hallucination rate is significantly above 2%
Also used for: - PII leak rate check after guardrail update - Competitor mention rate against zero-tolerance target - ASIN validation failure rate against 99.5% threshold
3. Mann-Whitney U Test (Wilcoxon Rank-Sum)
Purpose
Non-parametric alternative to the two-sample t-test. Compares two independent groups without assuming normality. Tests whether one distribution is stochastically greater than the other.
When Used in MangaAssist
- Latency distributions that are heavily right-skewed, especially tail metrics (P95, P99)
- Revenue per session which is zero-inflated (many sessions produce $0 revenue)
- Small-sample comparisons where CLT does not apply
from scipy.stats import mannwhitneyu
# Compare P99 latency distributions
u_stat, p_value = mannwhitneyu(
candidate_p99_latencies,
baseline_p99_latencies,
alternative='greater' # Is candidate worse?
)
print(f"U = {u_stat:.0f}, p = {p_value:.6f}")
T-Test vs. Mann-Whitney Decision
| Condition | Use |
|---|---|
| n > 30 and data is roughly symmetric | T-test (CLT applies) |
| n > 1000 even with skew | T-test (CLT applies at scale) |
| Small n and heavy skew | Mann-Whitney U |
| Comparing medians specifically | Mann-Whitney U |
| Latency tail percentiles | Mann-Whitney U |
4. Kolmogorov-Smirnov (KS) Test
Purpose
Tests whether two samples come from the same distribution, or whether a sample matches a theoretical distribution. Unlike chi-square or z-tests, KS tests the entire distribution shape, not just a summary statistic.
MangaAssist Application — Embedding Drift Detection
Question: Has the cosine similarity distribution between query embeddings and retrieved document embeddings shifted after an embedding model update?
from scipy.stats import ks_2samp
# cosine similarities from last week vs. this week
stat, p_value = ks_2samp(last_week_cosine_sims, this_week_cosine_sims)
print(f"KS statistic = {stat:.4f}, p = {p_value:.6f}")
MangaAssist Application — Latency Distribution Shift
# Full latency distributions before and after infra change
stat, p_value = ks_2samp(pre_change_latencies, post_change_latencies)
# Detects shifts in shape, spread, or location — not just mean
MangaAssist Application — Response Length Distribution Drift
The model evaluation framework caught a response length inflation (120 → 195 tokens). The KS test would detect this as a distributional shift even before the mean crossed a threshold.
stat, p_value = ks_2samp(baseline_lengths, candidate_lengths)
Why KS Is Useful Beyond Mean Comparisons
A t-test only detects shifts in the mean. The KS test detects: - Mean shifts - Variance changes - Shape changes (bimodal appearing) - Tail behavior changes
This is why KS tests were used alongside metric-specific tests for drift monitoring.
5. Fisher's Exact Test
Purpose
Exact test for independence in a 2×2 contingency table. Unlike chi-square, it is valid for any sample size, including very small expected counts.
MangaAssist Application — Rare Guardrail Event Analysis
When analyzing rare guardrail block types (e.g., PII leak) where counts are in single digits:
from scipy.stats import fisher_exact
# PII leak blocked: Yes/No × Model: Baseline/Candidate
table = [[3, 12], [0, 15]]
odds_ratio, p_value = fisher_exact(table, alternative='two-sided')
print(f"Odds ratio = {odds_ratio:.2f}, p = {p_value:.4f}")
Also used when analyzing rare events by locale or time segment where cell counts were too small for chi-square.
6. ANOVA (Analysis of Variance)
Purpose
Extends the t-test to compare means across three or more groups simultaneously. Tests whether at least one group mean is significantly different from the others.
Formula
$$F = \frac{\text{Between-group variance}}{\text{Within-group variance}} = \frac{MS_{\text{between}}}{MS_{\text{within}}}$$
MangaAssist Application — Latency Across Intent Types
Question: Does response latency differ significantly across intent types?
| Intent | Mean Latency | Std Dev | n |
|---|---|---|---|
| chitchat | 45ms | 15ms | 2,000 |
| order_tracking | 250ms | 80ms | 3,000 |
| recommendation | 950ms | 300ms | 5,000 |
| faq (RAG) | 1,200ms | 400ms | 4,000 |
| complex (full LLM) | 2,100ms | 600ms | 1,000 |
from scipy.stats import f_oneway
f_stat, p_value = f_oneway(
chitchat_latencies,
order_tracking_latencies,
recommendation_latencies,
faq_latencies,
complex_latencies
)
print(f"F = {f_stat:.2f}, p = {p_value:.6f}")
# p ≈ 0 — latency differs significantly across intents (expected)
Follow-up with Tukey's HSD:
ANOVA only tells you that at least one group is different. To find which pairs differ:
from statsmodels.stats.multicomp import pairwise_tukeyhsd
tukey = pairwise_tukeyhsd(
endog=all_latencies,
groups=all_intent_labels,
alpha=0.05
)
print(tukey.summary())
MangaAssist Application — BERTScore Across Prompt Versions
Question: When testing 3 prompt variants simultaneously, does BERTScore differ?
f_stat, p_value = f_oneway(
prompt_v1_scores,
prompt_v2_scores,
prompt_v3_scores
)
7. Multiple Comparisons Correction
The Problem
When running multiple hypothesis tests simultaneously, the chance of at least one false positive increases rapidly.
For $m$ independent tests at $\alpha = 0.05$:
$$P(\text{at least one false positive}) = 1 - (1 - \alpha)^m$$
| Number of Tests | P(at least one false positive) |
|---|---|
| 1 | 5.0% |
| 5 | 22.6% |
| 10 | 40.1% |
| 20 | 64.2% |
Bonferroni Correction
The simplest correction: divide $\alpha$ by the number of tests.
$$\alpha_{\text{adjusted}} = \frac{\alpha}{m}$$
MangaAssist Application — Canary Multi-Metric Testing
Canary checks 5 metrics simultaneously: escalation rate, thumbs-down rate, error rate, P99 latency, and guardrail block rate.
from statsmodels.stats.multitest import multipletests
p_values = [0.032, 0.087, 0.003, 0.048, 0.210]
metrics = ['escalation', 'thumbs_down', 'error_rate', 'p99_latency', 'guardrail_block']
reject_bonferroni, corrected_p_bonf, _, _ = multipletests(
p_values, method='bonferroni'
)
reject_bh, corrected_p_bh, _, _ = multipletests(
p_values, method='fdr_bh' # Benjamini-Hochberg
)
for i, metric in enumerate(metrics):
print(f"{metric}: raw p={p_values[i]:.3f}, "
f"Bonferroni reject={reject_bonferroni[i]}, "
f"BH reject={reject_bh[i]}")
Bonferroni vs. Benjamini-Hochberg
| Method | Controls | Strictness | MangaAssist Usage |
|---|---|---|---|
| Bonferroni | Family-wise error rate | Very strict — fewer false alarms | Canary safety checks (high stakes) |
| Benjamini-Hochberg | False discovery rate | Less strict — more power | Exploratory analysis, weekly reports |
8. Sequential Testing
The Problem with Fixed-Sample Tests
Standard hypothesis tests require pre-specifying a fixed sample size. But canary rollouts need to make decisions as data accumulates — not wait for a fixed window.
How Sequential Testing Was Used
Instead of running a single z-test after 24 hours, the canary controller checked metrics at defined checkpoints (2h, 6h, 12h, 24h) using alpha-spending to control the overall false positive rate.
O'Brien-Fleming spending function allocates most of the alpha budget to later checks:
| Checkpoint | Fraction of Data | Alpha Spent | Cumulative Alpha |
|---|---|---|---|
| 2 hours | 8% | 0.0001 | 0.0001 |
| 6 hours | 25% | 0.005 | 0.0051 |
| 12 hours | 50% | 0.014 | 0.0191 |
| 24 hours | 100% | 0.031 | 0.0500 |
This means: - At 2 hours, only extreme differences trigger rollback (very conservative). - At 24 hours, the full remaining alpha budget is available. - The overall false positive rate across all checks remains at 5%.
9. Cohen's Kappa for Inter-Rater Agreement
Purpose
Measures agreement between two human raters beyond what would be expected by chance. Used in the weekly human audit process.
$$\kappa = \frac{P_o - P_e}{1 - P_e}$$
Where: - $P_o$ = observed proportion of agreement - $P_e$ = expected proportion of agreement by chance
MangaAssist Application
from sklearn.metrics import cohen_kappa_score
# Two raters labeling 200 responses as: correct, partially_correct, incorrect
rater_1 = [...]
rater_2 = [...]
kappa = cohen_kappa_score(rater_1, rater_2)
print(f"Cohen's κ = {kappa:.3f}")
| κ Value | Interpretation | Project Status |
|---|---|---|
| < 0.20 | Poor | Retrain raters |
| 0.21 – 0.40 | Fair | Improve rubric |
| 0.41 – 0.60 | Moderate | Acceptable for some tasks |
| 0.61 – 0.80 | Substantial | Target for MangaAssist (achieved 0.78) |
| 0.81 – 1.00 | Almost perfect | Achieved for intent labeling (0.85) |
10. Summary: Test Selection Guide
| Scenario | Test | Library |
|---|---|---|
| Compare two proportions (rates) | Two-proportion z-test | statsmodels.stats.proportion.proportions_ztest() |
| Single proportion against target | Binomial test | scipy.stats.binomtest() |
| Compare two means (large samples) | Welch's t-test | scipy.stats.ttest_ind() |
| Compare paired observations | Paired t-test | scipy.stats.ttest_rel() |
| Compare two non-normal distributions | Mann-Whitney U | scipy.stats.mannwhitneyu() |
| Compare full distributions | KS test | scipy.stats.ks_2samp() |
| Categorical independence | Chi-square | scipy.stats.chi2_contingency() |
| Categorical independence (small n) | Fisher's exact | scipy.stats.fisher_exact() |
| Distribution fit to expected | Chi-square goodness-of-fit | scipy.stats.chisquare() |
| Compare 3+ group means | ANOVA + Tukey HSD | scipy.stats.f_oneway() + statsmodels |
| Correct for multiple tests | Bonferroni / BH | statsmodels.stats.multitest.multipletests() |
| Sequential canary decisions | Alpha-spending | Custom implementation |
| Inter-rater agreement | Cohen's kappa | sklearn.metrics.cohen_kappa_score() |