Additional Statistical Tests in MangaAssist

Beyond hypothesis testing, confidence intervals, t-tests, and chi-square tests, several other statistical tests were applied across the MangaAssist project for specialized scenarios.

1. Two-Proportion Z-Test

Purpose

Compares two proportions from independent samples. This was the most frequently used test in canary and A/B testing for rate-based metrics.

Formula

$$z = \frac{\hat{p}_1 - \hat{p}_2}{\sqrt{\hat{p}(1-\hat{p})\left(\frac{1}{n_1} + \frac{1}{n_2}\right)}}$$

Where $\hat{p} = \frac{x_1 + x_2}{n_1 + n_2}$ is the pooled proportion.

MangaAssist Applications

Comparison	Baseline Rate	Candidate Rate	Decision
Canary escalation rate	12.0% (n=495K)	14.0% (n=5K)	Do not promote (z=4.33, p<0.001)
A/B conversion rate	8.4% (n=50K)	9.4% (n=50K)	Ship chatbot (significant lift)
Guardrail pass rate (shadow)	98.0%	97.2%	Investigate regression
Thumbs-up rate (prompt A vs B)	60% (n=1K)	65% (n=1K)	Prompt B is better

from statsmodels.stats.proportion import proportions_ztest

count = [700, 59400]     # escalations in canary vs baseline
nobs = [5000, 495000]    # total sessions
z_stat, p_value = proportions_ztest(count, nobs, alternative='larger')

2. Binomial Test (Exact Test for a Single Proportion)

Purpose

Tests whether an observed proportion differs from a hypothesized value. Used when sample sizes are small or when an exact test is preferred.

MangaAssist Application — Hallucination Rate Audit

Question: From this week's audit, is the hallucination rate significantly above the 2% target?

from scipy.stats import binomtest

# Observed: 7 hallucinations in 200 audited responses → 3.5%
result = binomtest(7, 200, p=0.02, alternative='greater')
print(f"p-value = {result.pvalue:.4f}")
# If p < 0.05, hallucination rate is significantly above 2%

Also used for: - PII leak rate check after guardrail update - Competitor mention rate against zero-tolerance target - ASIN validation failure rate against 99.5% threshold

3. Mann-Whitney U Test (Wilcoxon Rank-Sum)

Purpose

Non-parametric alternative to the two-sample t-test. Compares two independent groups without assuming normality. Tests whether one distribution is stochastically greater than the other.

When Used in MangaAssist

Latency distributions that are heavily right-skewed, especially tail metrics (P95, P99)
Revenue per session which is zero-inflated (many sessions produce $0 revenue)
Small-sample comparisons where CLT does not apply

from scipy.stats import mannwhitneyu

# Compare P99 latency distributions
u_stat, p_value = mannwhitneyu(
    candidate_p99_latencies,
    baseline_p99_latencies,
    alternative='greater'  # Is candidate worse?
)
print(f"U = {u_stat:.0f}, p = {p_value:.6f}")

T-Test vs. Mann-Whitney Decision

Condition	Use
n > 30 and data is roughly symmetric	T-test (CLT applies)
n > 1000 even with skew	T-test (CLT applies at scale)
Small n and heavy skew	Mann-Whitney U
Comparing medians specifically	Mann-Whitney U
Latency tail percentiles	Mann-Whitney U

4. Kolmogorov-Smirnov (KS) Test

Purpose

Tests whether two samples come from the same distribution, or whether a sample matches a theoretical distribution. Unlike chi-square or z-tests, KS tests the entire distribution shape, not just a summary statistic.

MangaAssist Application — Embedding Drift Detection

Question: Has the cosine similarity distribution between query embeddings and retrieved document embeddings shifted after an embedding model update?

from scipy.stats import ks_2samp

# cosine similarities from last week vs. this week
stat, p_value = ks_2samp(last_week_cosine_sims, this_week_cosine_sims)
print(f"KS statistic = {stat:.4f}, p = {p_value:.6f}")

MangaAssist Application — Latency Distribution Shift

# Full latency distributions before and after infra change
stat, p_value = ks_2samp(pre_change_latencies, post_change_latencies)
# Detects shifts in shape, spread, or location — not just mean

MangaAssist Application — Response Length Distribution Drift

The model evaluation framework caught a response length inflation (120 → 195 tokens). The KS test would detect this as a distributional shift even before the mean crossed a threshold.

stat, p_value = ks_2samp(baseline_lengths, candidate_lengths)

Why KS Is Useful Beyond Mean Comparisons

A t-test only detects shifts in the mean. The KS test detects: - Mean shifts - Variance changes - Shape changes (bimodal appearing) - Tail behavior changes

This is why KS tests were used alongside metric-specific tests for drift monitoring.

5. Fisher's Exact Test

Purpose

Exact test for independence in a 2×2 contingency table. Unlike chi-square, it is valid for any sample size, including very small expected counts.

MangaAssist Application — Rare Guardrail Event Analysis

When analyzing rare guardrail block types (e.g., PII leak) where counts are in single digits:

from scipy.stats import fisher_exact

# PII leak blocked: Yes/No × Model: Baseline/Candidate
table = [[3, 12], [0, 15]]
odds_ratio, p_value = fisher_exact(table, alternative='two-sided')
print(f"Odds ratio = {odds_ratio:.2f}, p = {p_value:.4f}")

Also used when analyzing rare events by locale or time segment where cell counts were too small for chi-square.

6. ANOVA (Analysis of Variance)

Purpose

Extends the t-test to compare means across three or more groups simultaneously. Tests whether at least one group mean is significantly different from the others.

Formula

$$F = \frac{\text{Between-group variance}}{\text{Within-group variance}} = \frac{MS_{\text{between}}}{MS_{\text{within}}}$$

MangaAssist Application — Latency Across Intent Types

Question: Does response latency differ significantly across intent types?

Intent	Mean Latency	Std Dev	n
chitchat	45ms	15ms	2,000
order_tracking	250ms	80ms	3,000
recommendation	950ms	300ms	5,000
faq (RAG)	1,200ms	400ms	4,000
complex (full LLM)	2,100ms	600ms	1,000

from scipy.stats import f_oneway

f_stat, p_value = f_oneway(
    chitchat_latencies,
    order_tracking_latencies,
    recommendation_latencies,
    faq_latencies,
    complex_latencies
)
print(f"F = {f_stat:.2f}, p = {p_value:.6f}")
# p ≈ 0 — latency differs significantly across intents (expected)

Follow-up with Tukey's HSD:

ANOVA only tells you that at least one group is different. To find which pairs differ:

from statsmodels.stats.multicomp import pairwise_tukeyhsd

tukey = pairwise_tukeyhsd(
    endog=all_latencies,
    groups=all_intent_labels,
    alpha=0.05
)
print(tukey.summary())

MangaAssist Application — BERTScore Across Prompt Versions

Question: When testing 3 prompt variants simultaneously, does BERTScore differ?

f_stat, p_value = f_oneway(
    prompt_v1_scores,
    prompt_v2_scores,
    prompt_v3_scores
)

7. Multiple Comparisons Correction

The Problem

When running multiple hypothesis tests simultaneously, the chance of at least one false positive increases rapidly.

For $m$ independent tests at $\alpha = 0.05$:

$$P(\text{at least one false positive}) = 1 - (1 - \alpha)^m$$

Number of Tests	P(at least one false positive)
1	5.0%
5	22.6%
10	40.1%
20	64.2%

Bonferroni Correction

The simplest correction: divide $\alpha$ by the number of tests.

$$\alpha_{\text{adjusted}} = \frac{\alpha}{m}$$

MangaAssist Application — Canary Multi-Metric Testing

Canary checks 5 metrics simultaneously: escalation rate, thumbs-down rate, error rate, P99 latency, and guardrail block rate.

from statsmodels.stats.multitest import multipletests

p_values = [0.032, 0.087, 0.003, 0.048, 0.210]
metrics = ['escalation', 'thumbs_down', 'error_rate', 'p99_latency', 'guardrail_block']

reject_bonferroni, corrected_p_bonf, _, _ = multipletests(
    p_values, method='bonferroni'
)

reject_bh, corrected_p_bh, _, _ = multipletests(
    p_values, method='fdr_bh'  # Benjamini-Hochberg
)

for i, metric in enumerate(metrics):
    print(f"{metric}: raw p={p_values[i]:.3f}, "
          f"Bonferroni reject={reject_bonferroni[i]}, "
          f"BH reject={reject_bh[i]}")

Bonferroni vs. Benjamini-Hochberg

Method	Controls	Strictness	MangaAssist Usage
Bonferroni	Family-wise error rate	Very strict — fewer false alarms	Canary safety checks (high stakes)
Benjamini-Hochberg	False discovery rate	Less strict — more power	Exploratory analysis, weekly reports

8. Sequential Testing

The Problem with Fixed-Sample Tests

Standard hypothesis tests require pre-specifying a fixed sample size. But canary rollouts need to make decisions as data accumulates — not wait for a fixed window.

How Sequential Testing Was Used

Instead of running a single z-test after 24 hours, the canary controller checked metrics at defined checkpoints (2h, 6h, 12h, 24h) using alpha-spending to control the overall false positive rate.

O'Brien-Fleming spending function allocates most of the alpha budget to later checks:

Checkpoint	Fraction of Data	Alpha Spent	Cumulative Alpha
2 hours	8%	0.0001	0.0001
6 hours	25%	0.005	0.0051
12 hours	50%	0.014	0.0191
24 hours	100%	0.031	0.0500

This means: - At 2 hours, only extreme differences trigger rollback (very conservative). - At 24 hours, the full remaining alpha budget is available. - The overall false positive rate across all checks remains at 5%.

9. Cohen's Kappa for Inter-Rater Agreement

Purpose

Measures agreement between two human raters beyond what would be expected by chance. Used in the weekly human audit process.

$$\kappa = \frac{P_o - P_e}{1 - P_e}$$

Where: - $P_o$ = observed proportion of agreement - $P_e$ = expected proportion of agreement by chance

MangaAssist Application

from sklearn.metrics import cohen_kappa_score

# Two raters labeling 200 responses as: correct, partially_correct, incorrect
rater_1 = [...]
rater_2 = [...]

kappa = cohen_kappa_score(rater_1, rater_2)
print(f"Cohen's κ = {kappa:.3f}")

κ Value	Interpretation	Project Status
< 0.20	Poor	Retrain raters
0.21 – 0.40	Fair	Improve rubric
0.41 – 0.60	Moderate	Acceptable for some tasks
0.61 – 0.80	Substantial	Target for MangaAssist (achieved 0.78)
0.81 – 1.00	Almost perfect	Achieved for intent labeling (0.85)

10. Summary: Test Selection Guide

Scenario	Test	Library
Compare two proportions (rates)	Two-proportion z-test	`statsmodels.stats.proportion.proportions_ztest()`
Single proportion against target	Binomial test	`scipy.stats.binomtest()`
Compare two means (large samples)	Welch's t-test	`scipy.stats.ttest_ind()`
Compare paired observations	Paired t-test	`scipy.stats.ttest_rel()`
Compare two non-normal distributions	Mann-Whitney U	`scipy.stats.mannwhitneyu()`
Compare full distributions	KS test	`scipy.stats.ks_2samp()`
Categorical independence	Chi-square	`scipy.stats.chi2_contingency()`
Categorical independence (small n)	Fisher's exact	`scipy.stats.fisher_exact()`
Distribution fit to expected	Chi-square goodness-of-fit	`scipy.stats.chisquare()`
Compare 3+ group means	ANOVA + Tukey HSD	`scipy.stats.f_oneway()` + `statsmodels`
Correct for multiple tests	Bonferroni / BH	`statsmodels.stats.multitest.multipletests()`
Sequential canary decisions	Alpha-spending	Custom implementation
Inter-rater agreement	Cohen's kappa	`sklearn.metrics.cohen_kappa_score()`