LOCAL PREVIEW View on GitHub

Additional Statistical Tests in MangaAssist

Beyond hypothesis testing, confidence intervals, t-tests, and chi-square tests, several other statistical tests were applied across the MangaAssist project for specialized scenarios.


1. Two-Proportion Z-Test

Purpose

Compares two proportions from independent samples. This was the most frequently used test in canary and A/B testing for rate-based metrics.

Formula

$$z = \frac{\hat{p}_1 - \hat{p}_2}{\sqrt{\hat{p}(1-\hat{p})\left(\frac{1}{n_1} + \frac{1}{n_2}\right)}}$$

Where $\hat{p} = \frac{x_1 + x_2}{n_1 + n_2}$ is the pooled proportion.

MangaAssist Applications

Comparison Baseline Rate Candidate Rate Decision
Canary escalation rate 12.0% (n=495K) 14.0% (n=5K) Do not promote (z=4.33, p<0.001)
A/B conversion rate 8.4% (n=50K) 9.4% (n=50K) Ship chatbot (significant lift)
Guardrail pass rate (shadow) 98.0% 97.2% Investigate regression
Thumbs-up rate (prompt A vs B) 60% (n=1K) 65% (n=1K) Prompt B is better
from statsmodels.stats.proportion import proportions_ztest

count = [700, 59400]     # escalations in canary vs baseline
nobs = [5000, 495000]    # total sessions
z_stat, p_value = proportions_ztest(count, nobs, alternative='larger')

2. Binomial Test (Exact Test for a Single Proportion)

Purpose

Tests whether an observed proportion differs from a hypothesized value. Used when sample sizes are small or when an exact test is preferred.

MangaAssist Application — Hallucination Rate Audit

Question: From this week's audit, is the hallucination rate significantly above the 2% target?

from scipy.stats import binomtest

# Observed: 7 hallucinations in 200 audited responses → 3.5%
result = binomtest(7, 200, p=0.02, alternative='greater')
print(f"p-value = {result.pvalue:.4f}")
# If p < 0.05, hallucination rate is significantly above 2%

Also used for: - PII leak rate check after guardrail update - Competitor mention rate against zero-tolerance target - ASIN validation failure rate against 99.5% threshold


3. Mann-Whitney U Test (Wilcoxon Rank-Sum)

Purpose

Non-parametric alternative to the two-sample t-test. Compares two independent groups without assuming normality. Tests whether one distribution is stochastically greater than the other.

When Used in MangaAssist

  • Latency distributions that are heavily right-skewed, especially tail metrics (P95, P99)
  • Revenue per session which is zero-inflated (many sessions produce $0 revenue)
  • Small-sample comparisons where CLT does not apply
from scipy.stats import mannwhitneyu

# Compare P99 latency distributions
u_stat, p_value = mannwhitneyu(
    candidate_p99_latencies,
    baseline_p99_latencies,
    alternative='greater'  # Is candidate worse?
)
print(f"U = {u_stat:.0f}, p = {p_value:.6f}")

T-Test vs. Mann-Whitney Decision

Condition Use
n > 30 and data is roughly symmetric T-test (CLT applies)
n > 1000 even with skew T-test (CLT applies at scale)
Small n and heavy skew Mann-Whitney U
Comparing medians specifically Mann-Whitney U
Latency tail percentiles Mann-Whitney U

4. Kolmogorov-Smirnov (KS) Test

Purpose

Tests whether two samples come from the same distribution, or whether a sample matches a theoretical distribution. Unlike chi-square or z-tests, KS tests the entire distribution shape, not just a summary statistic.

MangaAssist Application — Embedding Drift Detection

Question: Has the cosine similarity distribution between query embeddings and retrieved document embeddings shifted after an embedding model update?

from scipy.stats import ks_2samp

# cosine similarities from last week vs. this week
stat, p_value = ks_2samp(last_week_cosine_sims, this_week_cosine_sims)
print(f"KS statistic = {stat:.4f}, p = {p_value:.6f}")

MangaAssist Application — Latency Distribution Shift

# Full latency distributions before and after infra change
stat, p_value = ks_2samp(pre_change_latencies, post_change_latencies)
# Detects shifts in shape, spread, or location — not just mean

MangaAssist Application — Response Length Distribution Drift

The model evaluation framework caught a response length inflation (120 → 195 tokens). The KS test would detect this as a distributional shift even before the mean crossed a threshold.

stat, p_value = ks_2samp(baseline_lengths, candidate_lengths)

Why KS Is Useful Beyond Mean Comparisons

A t-test only detects shifts in the mean. The KS test detects: - Mean shifts - Variance changes - Shape changes (bimodal appearing) - Tail behavior changes

This is why KS tests were used alongside metric-specific tests for drift monitoring.


5. Fisher's Exact Test

Purpose

Exact test for independence in a 2×2 contingency table. Unlike chi-square, it is valid for any sample size, including very small expected counts.

MangaAssist Application — Rare Guardrail Event Analysis

When analyzing rare guardrail block types (e.g., PII leak) where counts are in single digits:

from scipy.stats import fisher_exact

# PII leak blocked: Yes/No × Model: Baseline/Candidate
table = [[3, 12], [0, 15]]
odds_ratio, p_value = fisher_exact(table, alternative='two-sided')
print(f"Odds ratio = {odds_ratio:.2f}, p = {p_value:.4f}")

Also used when analyzing rare events by locale or time segment where cell counts were too small for chi-square.


6. ANOVA (Analysis of Variance)

Purpose

Extends the t-test to compare means across three or more groups simultaneously. Tests whether at least one group mean is significantly different from the others.

Formula

$$F = \frac{\text{Between-group variance}}{\text{Within-group variance}} = \frac{MS_{\text{between}}}{MS_{\text{within}}}$$

MangaAssist Application — Latency Across Intent Types

Question: Does response latency differ significantly across intent types?

Intent Mean Latency Std Dev n
chitchat 45ms 15ms 2,000
order_tracking 250ms 80ms 3,000
recommendation 950ms 300ms 5,000
faq (RAG) 1,200ms 400ms 4,000
complex (full LLM) 2,100ms 600ms 1,000
from scipy.stats import f_oneway

f_stat, p_value = f_oneway(
    chitchat_latencies,
    order_tracking_latencies,
    recommendation_latencies,
    faq_latencies,
    complex_latencies
)
print(f"F = {f_stat:.2f}, p = {p_value:.6f}")
# p ≈ 0 — latency differs significantly across intents (expected)

Follow-up with Tukey's HSD:

ANOVA only tells you that at least one group is different. To find which pairs differ:

from statsmodels.stats.multicomp import pairwise_tukeyhsd

tukey = pairwise_tukeyhsd(
    endog=all_latencies,
    groups=all_intent_labels,
    alpha=0.05
)
print(tukey.summary())

MangaAssist Application — BERTScore Across Prompt Versions

Question: When testing 3 prompt variants simultaneously, does BERTScore differ?

f_stat, p_value = f_oneway(
    prompt_v1_scores,
    prompt_v2_scores,
    prompt_v3_scores
)

7. Multiple Comparisons Correction

The Problem

When running multiple hypothesis tests simultaneously, the chance of at least one false positive increases rapidly.

For $m$ independent tests at $\alpha = 0.05$:

$$P(\text{at least one false positive}) = 1 - (1 - \alpha)^m$$

Number of Tests P(at least one false positive)
1 5.0%
5 22.6%
10 40.1%
20 64.2%

Bonferroni Correction

The simplest correction: divide $\alpha$ by the number of tests.

$$\alpha_{\text{adjusted}} = \frac{\alpha}{m}$$

MangaAssist Application — Canary Multi-Metric Testing

Canary checks 5 metrics simultaneously: escalation rate, thumbs-down rate, error rate, P99 latency, and guardrail block rate.

from statsmodels.stats.multitest import multipletests

p_values = [0.032, 0.087, 0.003, 0.048, 0.210]
metrics = ['escalation', 'thumbs_down', 'error_rate', 'p99_latency', 'guardrail_block']

reject_bonferroni, corrected_p_bonf, _, _ = multipletests(
    p_values, method='bonferroni'
)

reject_bh, corrected_p_bh, _, _ = multipletests(
    p_values, method='fdr_bh'  # Benjamini-Hochberg
)

for i, metric in enumerate(metrics):
    print(f"{metric}: raw p={p_values[i]:.3f}, "
          f"Bonferroni reject={reject_bonferroni[i]}, "
          f"BH reject={reject_bh[i]}")

Bonferroni vs. Benjamini-Hochberg

Method Controls Strictness MangaAssist Usage
Bonferroni Family-wise error rate Very strict — fewer false alarms Canary safety checks (high stakes)
Benjamini-Hochberg False discovery rate Less strict — more power Exploratory analysis, weekly reports

8. Sequential Testing

The Problem with Fixed-Sample Tests

Standard hypothesis tests require pre-specifying a fixed sample size. But canary rollouts need to make decisions as data accumulates — not wait for a fixed window.

How Sequential Testing Was Used

Instead of running a single z-test after 24 hours, the canary controller checked metrics at defined checkpoints (2h, 6h, 12h, 24h) using alpha-spending to control the overall false positive rate.

O'Brien-Fleming spending function allocates most of the alpha budget to later checks:

Checkpoint Fraction of Data Alpha Spent Cumulative Alpha
2 hours 8% 0.0001 0.0001
6 hours 25% 0.005 0.0051
12 hours 50% 0.014 0.0191
24 hours 100% 0.031 0.0500

This means: - At 2 hours, only extreme differences trigger rollback (very conservative). - At 24 hours, the full remaining alpha budget is available. - The overall false positive rate across all checks remains at 5%.


9. Cohen's Kappa for Inter-Rater Agreement

Purpose

Measures agreement between two human raters beyond what would be expected by chance. Used in the weekly human audit process.

$$\kappa = \frac{P_o - P_e}{1 - P_e}$$

Where: - $P_o$ = observed proportion of agreement - $P_e$ = expected proportion of agreement by chance

MangaAssist Application

from sklearn.metrics import cohen_kappa_score

# Two raters labeling 200 responses as: correct, partially_correct, incorrect
rater_1 = [...]
rater_2 = [...]

kappa = cohen_kappa_score(rater_1, rater_2)
print(f"Cohen's κ = {kappa:.3f}")
κ Value Interpretation Project Status
< 0.20 Poor Retrain raters
0.21 – 0.40 Fair Improve rubric
0.41 – 0.60 Moderate Acceptable for some tasks
0.61 – 0.80 Substantial Target for MangaAssist (achieved 0.78)
0.81 – 1.00 Almost perfect Achieved for intent labeling (0.85)

10. Summary: Test Selection Guide

Scenario Test Library
Compare two proportions (rates) Two-proportion z-test statsmodels.stats.proportion.proportions_ztest()
Single proportion against target Binomial test scipy.stats.binomtest()
Compare two means (large samples) Welch's t-test scipy.stats.ttest_ind()
Compare paired observations Paired t-test scipy.stats.ttest_rel()
Compare two non-normal distributions Mann-Whitney U scipy.stats.mannwhitneyu()
Compare full distributions KS test scipy.stats.ks_2samp()
Categorical independence Chi-square scipy.stats.chi2_contingency()
Categorical independence (small n) Fisher's exact scipy.stats.fisher_exact()
Distribution fit to expected Chi-square goodness-of-fit scipy.stats.chisquare()
Compare 3+ group means ANOVA + Tukey HSD scipy.stats.f_oneway() + statsmodels
Correct for multiple tests Bonferroni / BH statsmodels.stats.multitest.multipletests()
Sequential canary decisions Alpha-spending Custom implementation
Inter-rater agreement Cohen's kappa sklearn.metrics.cohen_kappa_score()