LOCAL PREVIEW View on GitHub

T-Tests in MangaAssist

1. Why T-Tests Were Used

T-tests compare means between groups or against a baseline. In MangaAssist, they were the primary tool for answering questions about continuous metrics — latency, token counts, response quality scores, revenue, and CSAT — where proportions tests (z-tests) are not appropriate.


2. Types of T-Tests Used

2.1 One-Sample T-Test

Question: Does the mean of a metric differ from a known target or threshold?

$$t = \frac{\bar{x} - \mu_0}{s / \sqrt{n}}$$

Where: - $\bar{x}$ = sample mean - $\mu_0$ = hypothesized population mean (target) - $s$ = sample standard deviation - $n$ = sample size - Degrees of freedom: $df = n - 1$

MangaAssist Application — Is mean latency within the SLA target?

The SLA target for TTFT (Time to First Token) P50 is 800ms. After a model upgrade, we want to verify compliance.

from scipy.stats import ttest_1samp

# Sample of 2,000 TTFT measurements after upgrade
ttft_samples = [...]  # milliseconds

t_stat, p_value = ttest_1samp(ttft_samples, popmean=800)
print(f"t = {t_stat:.4f}, p = {p_value:.6f}")

# If p < 0.05, the mean is significantly different from 800ms
# Check direction: if mean < 800, we are within SLA

MangaAssist Application — Does mean token count exceed budget?

The cost budget assumes an average of 130 tokens per response. A prompt change might inflate this.

# Sample of 5,000 responses after prompt change
t_stat, p_value = ttest_1samp(token_counts, popmean=130, alternative='greater')
# One-tailed: we specifically fear an increase

2.2 Two-Sample Independent T-Test (Welch's)

Question: Do two groups have different means?

$$t = \frac{\bar{x}_1 - \bar{x}_2}{\sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}}$$

Degrees of freedom (Welch-Satterthwaite approximation):

$$df = \frac{\left(\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}\right)^2}{\frac{(s_1^2/n_1)^2}{n_1-1} + \frac{(s_2^2/n_2)^2}{n_2-1}}$$

Welch's t-test is preferred over Student's t-test because it does not assume equal variances.

MangaAssist Application — Latency comparison between model versions

from scipy.stats import ttest_ind

# baseline_latencies: 10,000 TTFT samples from Claude 3.5 Sonnet
# candidate_latencies: 10,000 TTFT samples from Claude 3.5 Sonnet v2

t_stat, p_value = ttest_ind(
    candidate_latencies,
    baseline_latencies,
    equal_var=False,       # Welch's t-test
    alternative='two-sided'
)
print(f"t = {t_stat:.4f}, p = {p_value:.6f}")

MangaAssist Application — Average Order Value (AOV) in A/B test

# treatment_aov: order values for chatbot users
# control_aov: order values for non-chatbot users

t_stat, p_value = ttest_ind(
    treatment_aov,
    control_aov,
    equal_var=False,
    alternative='greater'  # We want to confirm chatbot increases AOV
)

# Mean difference
lift = np.mean(treatment_aov) - np.mean(control_aov)
print(f"AOV lift: ${lift:.2f}, p = {p_value:.6f}")

MangaAssist Application — BERTScore comparison across prompt versions

# Score distributions for 500 golden-dataset examples, scored under two prompts
t_stat, p_value = ttest_ind(
    new_prompt_bertscores,
    old_prompt_bertscores,
    equal_var=False
)

2.3 Paired T-Test

Question: For paired observations (same inputs, different treatments), is the mean difference non-zero?

$$t = \frac{\bar{d}}{s_d / \sqrt{n}}$$

Where $d_i = x_{i,\text{new}} - x_{i,\text{old}}$ for each paired observation, and $\bar{d}$ and $s_d$ are the mean and std of the differences.

MangaAssist Application — Golden dataset regression test

Each golden-dataset query is evaluated under both the baseline and candidate pipelines, producing paired scores.

from scipy.stats import ttest_rel

# paired BERTScores for same 500 queries, evaluated under old vs. new prompt
t_stat, p_value = ttest_rel(new_scores, old_scores, alternative='greater')
print(f"Paired t = {t_stat:.4f}, p = {p_value:.6f}")

Why paired is better here: The same query might be inherently easy or hard. Pairing removes that between-query variance, giving a more sensitive test of the prompt change effect.

MangaAssist Application — Shadow mode latency comparison

The same production requests are processed by both baseline and candidate (asynchronously). Pairing by request ID removes variance due to query complexity.

# For each request_id, we have baseline_latency and candidate_latency
diffs = candidate_latencies - baseline_latencies  # per-request
t_stat, p_value = ttest_rel(candidate_latencies, baseline_latencies)

3. Assumptions and When They Hold

Assumption Description MangaAssist Situation Mitigation
Independence Observations are independent Mostly true for distinct user sessions Filter out multi-turn within-session correlations
Normality Data is approximately normally distributed Violated for latency (right-skewed) and revenue (zero-inflated) Large sample sizes make t-test robust via CLT; use Mann-Whitney for small samples
Equal variances Both groups have similar variance Often violated across model versions Always use Welch's t-test (equal_var=False)

When CLT Saves the T-Test

The Central Limit Theorem guarantees that the sampling distribution of the mean approaches normality as $n$ increases, regardless of the underlying data distribution. For MangaAssist:

  • Latency data is right-skewed, but with $n = 10{,}000$, $\bar{x}$ is approximately normal.
  • Revenue data is zero-inflated, but with $n = 50{,}000$, the mean is well-behaved.
  • BERTScore is bounded [0, 1], but with $n = 500$, the paired mean difference is approximately normal.

When CLT Is Not Enough

For very small samples (weekly audit of 200 responses), or extreme skew (P99 latency), non-parametric alternatives were used:

from scipy.stats import mannwhitneyu

# Mann-Whitney U test — does not require normality
u_stat, p_value = mannwhitneyu(
    candidate_p99_latencies,
    baseline_p99_latencies,
    alternative='greater'
)

4. Effect Size: Cohen's d

Statistical significance tells you whether a difference exists. Effect size tells you whether it matters.

$$d = \frac{\bar{x}1 - \bar{x}_2}{s{\text{pooled}}}$$

Where:

$$s_{\text{pooled}} = \sqrt{\frac{(n_1-1)s_1^2 + (n_2-1)s_2^2}{n_1 + n_2 - 2}}$$

Cohen's d Interpretation MangaAssist Example
0.2 Small 10ms latency difference on a 350ms std
0.5 Medium 5-token response length difference
0.8 Large 50ms latency improvement after infra change

Example calculation:

# Baseline latency: mean=820, std=350, n=10000
# Candidate latency: mean=780, std=320, n=10000

s_pooled = np.sqrt(((10000-1)*350**2 + (10000-1)*320**2) / (20000-2))
d = (820 - 780) / s_pooled
print(f"Cohen's d = {d:.3f}")
# d ≈ 0.119 → small effect size
# Statistically significant but practically modest

5. T-Test Decision Matrix

Scenario Test Type Alternative Rationale
Is mean latency within SLA? One-sample Two-sided Check both directions
Does candidate have higher latency? Two-sample Welch's One-sided (greater) Safety check
Did prompt change increase tokens? Two-sample Welch's One-sided (greater) Cost concern
AOV improvement in A/B test Two-sample Welch's One-sided (greater) Business lift
BERTScore regression on golden dataset Paired Two-sided Same queries, paired evaluation
Shadow mode latency same-request comparison Paired Two-sided Paired by request ID
CSAT difference between weeks Two-sample Welch's Two-sided Independent survey samples

6. Real Worked Example: Model Upgrade Latency Decision

Context: Upgrading from Claude 3.5 Sonnet to a newer version. Need to verify latency did not regress.

Data: - Baseline (shadow mode, 24 hours): $n_1 = 50{,}000$, $\bar{x}_1 = 823\text{ms}$, $s_1 = 341\text{ms}$ - Candidate (shadow mode, 24 hours): $n_2 = 50{,}000$, $\bar{x}_2 = 795\text{ms}$, $s_2 = 318\text{ms}$

Hypothesis:

$H_0$: $\mu_{\text{candidate}} = \mu_{\text{baseline}}$

$H_1$: $\mu_{\text{candidate}} \neq \mu_{\text{baseline}}$

Computation:

$$SE = \sqrt{\frac{341^2}{50000} + \frac{318^2}{50000}} = \sqrt{2.326 + 2.024} = \sqrt{4.350} \approx 2.086$$

$$t = \frac{795 - 823}{2.086} = \frac{-28}{2.086} \approx -13.42$$

$p \approx 0$ (extremely significant)

Cohen's d:

$$d = \frac{28}{\sqrt{(341^2 + 318^2)/2}} \approx \frac{28}{330} \approx 0.085$$

Decision: Statistically significant improvement ($p \approx 0$) but small effect size ($d = 0.085$). The 28ms improvement is real but modest. Combined with no quality regression, approved for promotion.


7. Summary

T-Test Type MangaAssist Application When to Choose It
One-sample SLA compliance, budget threshold checks Comparing a measured mean against a fixed target
Two-sample (Welch's) A/B test AOV, cross-version latency, CSAT comparison Comparing independent groups with unequal variances
Paired Golden dataset regression, shadow mode same-request comparison Same inputs evaluated under two conditions
Mann-Whitney U (non-parametric alternative) P99 latency comparison on small samples, skewed metrics Normality assumption is violated with insufficient sample size