T-Tests in MangaAssist
1. Why T-Tests Were Used
T-tests compare means between groups or against a baseline. In MangaAssist, they were the primary tool for answering questions about continuous metrics — latency, token counts, response quality scores, revenue, and CSAT — where proportions tests (z-tests) are not appropriate.
2. Types of T-Tests Used
2.1 One-Sample T-Test
Question: Does the mean of a metric differ from a known target or threshold?
$$t = \frac{\bar{x} - \mu_0}{s / \sqrt{n}}$$
Where: - $\bar{x}$ = sample mean - $\mu_0$ = hypothesized population mean (target) - $s$ = sample standard deviation - $n$ = sample size - Degrees of freedom: $df = n - 1$
MangaAssist Application — Is mean latency within the SLA target?
The SLA target for TTFT (Time to First Token) P50 is 800ms. After a model upgrade, we want to verify compliance.
from scipy.stats import ttest_1samp
# Sample of 2,000 TTFT measurements after upgrade
ttft_samples = [...] # milliseconds
t_stat, p_value = ttest_1samp(ttft_samples, popmean=800)
print(f"t = {t_stat:.4f}, p = {p_value:.6f}")
# If p < 0.05, the mean is significantly different from 800ms
# Check direction: if mean < 800, we are within SLA
MangaAssist Application — Does mean token count exceed budget?
The cost budget assumes an average of 130 tokens per response. A prompt change might inflate this.
# Sample of 5,000 responses after prompt change
t_stat, p_value = ttest_1samp(token_counts, popmean=130, alternative='greater')
# One-tailed: we specifically fear an increase
2.2 Two-Sample Independent T-Test (Welch's)
Question: Do two groups have different means?
$$t = \frac{\bar{x}_1 - \bar{x}_2}{\sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}}$$
Degrees of freedom (Welch-Satterthwaite approximation):
$$df = \frac{\left(\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}\right)^2}{\frac{(s_1^2/n_1)^2}{n_1-1} + \frac{(s_2^2/n_2)^2}{n_2-1}}$$
Welch's t-test is preferred over Student's t-test because it does not assume equal variances.
MangaAssist Application — Latency comparison between model versions
from scipy.stats import ttest_ind
# baseline_latencies: 10,000 TTFT samples from Claude 3.5 Sonnet
# candidate_latencies: 10,000 TTFT samples from Claude 3.5 Sonnet v2
t_stat, p_value = ttest_ind(
candidate_latencies,
baseline_latencies,
equal_var=False, # Welch's t-test
alternative='two-sided'
)
print(f"t = {t_stat:.4f}, p = {p_value:.6f}")
MangaAssist Application — Average Order Value (AOV) in A/B test
# treatment_aov: order values for chatbot users
# control_aov: order values for non-chatbot users
t_stat, p_value = ttest_ind(
treatment_aov,
control_aov,
equal_var=False,
alternative='greater' # We want to confirm chatbot increases AOV
)
# Mean difference
lift = np.mean(treatment_aov) - np.mean(control_aov)
print(f"AOV lift: ${lift:.2f}, p = {p_value:.6f}")
MangaAssist Application — BERTScore comparison across prompt versions
# Score distributions for 500 golden-dataset examples, scored under two prompts
t_stat, p_value = ttest_ind(
new_prompt_bertscores,
old_prompt_bertscores,
equal_var=False
)
2.3 Paired T-Test
Question: For paired observations (same inputs, different treatments), is the mean difference non-zero?
$$t = \frac{\bar{d}}{s_d / \sqrt{n}}$$
Where $d_i = x_{i,\text{new}} - x_{i,\text{old}}$ for each paired observation, and $\bar{d}$ and $s_d$ are the mean and std of the differences.
MangaAssist Application — Golden dataset regression test
Each golden-dataset query is evaluated under both the baseline and candidate pipelines, producing paired scores.
from scipy.stats import ttest_rel
# paired BERTScores for same 500 queries, evaluated under old vs. new prompt
t_stat, p_value = ttest_rel(new_scores, old_scores, alternative='greater')
print(f"Paired t = {t_stat:.4f}, p = {p_value:.6f}")
Why paired is better here: The same query might be inherently easy or hard. Pairing removes that between-query variance, giving a more sensitive test of the prompt change effect.
MangaAssist Application — Shadow mode latency comparison
The same production requests are processed by both baseline and candidate (asynchronously). Pairing by request ID removes variance due to query complexity.
# For each request_id, we have baseline_latency and candidate_latency
diffs = candidate_latencies - baseline_latencies # per-request
t_stat, p_value = ttest_rel(candidate_latencies, baseline_latencies)
3. Assumptions and When They Hold
| Assumption | Description | MangaAssist Situation | Mitigation |
|---|---|---|---|
| Independence | Observations are independent | Mostly true for distinct user sessions | Filter out multi-turn within-session correlations |
| Normality | Data is approximately normally distributed | Violated for latency (right-skewed) and revenue (zero-inflated) | Large sample sizes make t-test robust via CLT; use Mann-Whitney for small samples |
| Equal variances | Both groups have similar variance | Often violated across model versions | Always use Welch's t-test (equal_var=False) |
When CLT Saves the T-Test
The Central Limit Theorem guarantees that the sampling distribution of the mean approaches normality as $n$ increases, regardless of the underlying data distribution. For MangaAssist:
- Latency data is right-skewed, but with $n = 10{,}000$, $\bar{x}$ is approximately normal.
- Revenue data is zero-inflated, but with $n = 50{,}000$, the mean is well-behaved.
- BERTScore is bounded [0, 1], but with $n = 500$, the paired mean difference is approximately normal.
When CLT Is Not Enough
For very small samples (weekly audit of 200 responses), or extreme skew (P99 latency), non-parametric alternatives were used:
from scipy.stats import mannwhitneyu
# Mann-Whitney U test — does not require normality
u_stat, p_value = mannwhitneyu(
candidate_p99_latencies,
baseline_p99_latencies,
alternative='greater'
)
4. Effect Size: Cohen's d
Statistical significance tells you whether a difference exists. Effect size tells you whether it matters.
$$d = \frac{\bar{x}1 - \bar{x}_2}{s{\text{pooled}}}$$
Where:
$$s_{\text{pooled}} = \sqrt{\frac{(n_1-1)s_1^2 + (n_2-1)s_2^2}{n_1 + n_2 - 2}}$$
| Cohen's d | Interpretation | MangaAssist Example |
|---|---|---|
| 0.2 | Small | 10ms latency difference on a 350ms std |
| 0.5 | Medium | 5-token response length difference |
| 0.8 | Large | 50ms latency improvement after infra change |
Example calculation:
# Baseline latency: mean=820, std=350, n=10000
# Candidate latency: mean=780, std=320, n=10000
s_pooled = np.sqrt(((10000-1)*350**2 + (10000-1)*320**2) / (20000-2))
d = (820 - 780) / s_pooled
print(f"Cohen's d = {d:.3f}")
# d ≈ 0.119 → small effect size
# Statistically significant but practically modest
5. T-Test Decision Matrix
| Scenario | Test Type | Alternative | Rationale |
|---|---|---|---|
| Is mean latency within SLA? | One-sample | Two-sided | Check both directions |
| Does candidate have higher latency? | Two-sample Welch's | One-sided (greater) | Safety check |
| Did prompt change increase tokens? | Two-sample Welch's | One-sided (greater) | Cost concern |
| AOV improvement in A/B test | Two-sample Welch's | One-sided (greater) | Business lift |
| BERTScore regression on golden dataset | Paired | Two-sided | Same queries, paired evaluation |
| Shadow mode latency same-request comparison | Paired | Two-sided | Paired by request ID |
| CSAT difference between weeks | Two-sample Welch's | Two-sided | Independent survey samples |
6. Real Worked Example: Model Upgrade Latency Decision
Context: Upgrading from Claude 3.5 Sonnet to a newer version. Need to verify latency did not regress.
Data: - Baseline (shadow mode, 24 hours): $n_1 = 50{,}000$, $\bar{x}_1 = 823\text{ms}$, $s_1 = 341\text{ms}$ - Candidate (shadow mode, 24 hours): $n_2 = 50{,}000$, $\bar{x}_2 = 795\text{ms}$, $s_2 = 318\text{ms}$
Hypothesis:
$H_0$: $\mu_{\text{candidate}} = \mu_{\text{baseline}}$
$H_1$: $\mu_{\text{candidate}} \neq \mu_{\text{baseline}}$
Computation:
$$SE = \sqrt{\frac{341^2}{50000} + \frac{318^2}{50000}} = \sqrt{2.326 + 2.024} = \sqrt{4.350} \approx 2.086$$
$$t = \frac{795 - 823}{2.086} = \frac{-28}{2.086} \approx -13.42$$
$p \approx 0$ (extremely significant)
Cohen's d:
$$d = \frac{28}{\sqrt{(341^2 + 318^2)/2}} \approx \frac{28}{330} \approx 0.085$$
Decision: Statistically significant improvement ($p \approx 0$) but small effect size ($d = 0.085$). The 28ms improvement is real but modest. Combined with no quality regression, approved for promotion.
7. Summary
| T-Test Type | MangaAssist Application | When to Choose It |
|---|---|---|
| One-sample | SLA compliance, budget threshold checks | Comparing a measured mean against a fixed target |
| Two-sample (Welch's) | A/B test AOV, cross-version latency, CSAT comparison | Comparing independent groups with unequal variances |
| Paired | Golden dataset regression, shadow mode same-request comparison | Same inputs evaluated under two conditions |
| Mann-Whitney U (non-parametric alternative) | P99 latency comparison on small samples, skewed metrics | Normality assumption is violated with insufficient sample size |