Confidence Intervals in MangaAssist

1. What Confidence Intervals Are and Why They Matter

A confidence interval (CI) provides a range of plausible values for a population parameter, based on observed sample data. Unlike a point estimate that gives a single number, a CI communicates uncertainty.

For MangaAssist, confidence intervals were used to:

Report metric estimates with uncertainty bounds on dashboards
Determine whether an A/B test lift was practically meaningful
Set error bars on canary metric comparisons
Quantify uncertainty around hallucination rates estimated from audits
Provide stakeholders with ranges rather than false precision

2. Confidence Interval for a Proportion

2.1 Formula (Normal Approximation / Wald Interval)

For a sample proportion $\hat{p}$ from $n$ observations:

$$\hat{p} \pm z_{\alpha/2} \sqrt{\frac{\hat{p}(1-\hat{p})}{n}}$$

Where $z_{\alpha/2} = 1.96$ for a 95% CI.

2.2 MangaAssist Applications

Escalation rate:

from statsmodels.stats.proportion import proportion_confint

# Observed: 600 escalations out of 5,000 sessions → 12%
lower, upper = proportion_confint(600, 5000, alpha=0.05, method='normal')
print(f"95% CI for escalation rate: [{lower:.4f}, {upper:.4f}]")
# Output: [0.1110, 0.1290]

Hallucination rate (from weekly audit):

# Observed: 4 hallucinations out of 200 audited responses → 2%
lower, upper = proportion_confint(4, 200, alpha=0.05, method='wilson')
print(f"95% CI for hallucination rate: [{lower:.4f}, {upper:.4f}]")
# Output: [0.0078, 0.0503]

Note: For small samples or extreme proportions, the Wilson interval (method='wilson') is preferred over the Wald interval because it avoids impossible bounds (negative or above 1.0).

Conversion rate:

# Observed: 4,700 purchases out of 50,000 chatbot users → 9.4%
lower, upper = proportion_confint(4700, 50000, alpha=0.05, method='wilson')
print(f"95% CI for conversion rate: [{lower:.4f}, {upper:.4f}]")
# Output: [0.0915, 0.0965]

2.3 Key Project Proportion CIs

Metric	Observed Rate	Sample Size	95% CI	Interpretation
Escalation rate	12.0%	5,000	[11.1%, 12.9%]	Tight — large sample gives precise estimate
Hallucination rate	2.0%	200	[0.8%, 5.0%]	Wide — small audit sample means high uncertainty
Conversion rate	9.4%	50,000	[9.2%, 9.7%]	Very tight — massive traffic sample
Thumbs-up rate	62.0%	1,000	[58.9%, 65.0%]	Moderate — survey sample
Guardrail block rate	3.5%	10,000	[3.1%, 3.9%]	Tight
Intent accuracy	89.3%	200	[84.5%, 93.0%]	Wide — weekly sample

2.4 Insight: Why Small Audit Samples Produce Wide CIs

The hallucination rate CI is [0.8%, 5.0%] — a 4.2-percentage-point range. This means the true hallucination rate could be anywhere from acceptable (< 2%) to concerning (5%).

This drove the decision to increase audit sample size from 200 to 500 per week for high-stakes metrics.

3. Confidence Interval for a Mean

3.1 Formula

For a sample mean $\bar{x}$ with sample standard deviation $s$ and sample size $n$:

$$\bar{x} \pm t_{\alpha/2, n-1} \cdot \frac{s}{\sqrt{n}}$$

Where $t_{\alpha/2, n-1}$ is the critical value from the t-distribution. For large $n$, this approaches the z-value.

3.2 MangaAssist Applications

P99 latency comparison:

import numpy as np
from scipy.stats import t

# Sample of 1,000 response latencies for new model
latencies = np.array([...])  # milliseconds
n = len(latencies)
mean = np.mean(latencies)
se = np.std(latencies, ddof=1) / np.sqrt(n)
t_crit = t.ppf(0.975, df=n-1)

lower = mean - t_crit * se
upper = mean + t_crit * se
print(f"95% CI for mean latency: [{lower:.1f}ms, {upper:.1f}ms]")

Average tokens per response:

# Baseline: mean = 120 tokens, std = 45, n = 5000
# Candidate: mean = 135 tokens, std = 50, n = 5000
from scipy.stats import t

se_baseline = 45 / np.sqrt(5000)   # ≈ 0.636
se_candidate = 50 / np.sqrt(5000)  # ≈ 0.707

ci_baseline = (120 - 1.96*se_baseline, 120 + 1.96*se_baseline)
ci_candidate = (135 - 1.96*se_candidate, 135 + 1.96*se_candidate)
# Baseline CI: [118.75, 121.25]
# Candidate CI: [133.61, 136.39]
# Non-overlapping → significant cost increase

Revenue per chat session:

# Sample: 10,000 sessions, mean revenue = $5.20, std = $12.50
se = 12.50 / np.sqrt(10000)  # = 0.125
ci = (5.20 - 1.96*0.125, 5.20 + 1.96*0.125)
# 95% CI: [$4.96, $5.44]

3.3 Key Project Mean CIs

Metric	Sample Mean	Std Dev	n	95% CI
Mean latency (ms)	820	350	10,000	[813.1, 826.9]
Avg tokens per response	135	50	5,000	[133.6, 136.4]
Revenue per session	$5.20	$12.50	10,000	[$4.96, $5.44]
CSAT score	4.2	0.9	500	[4.12, 4.28]
Turns to resolution	3.8	2.1	2,000	[3.71, 3.89]

4. Confidence Interval for a Difference (A/B Testing)

4.1 Difference of Two Proportions

$$(\hat{p}1 - \hat{p}_2) \pm z{\alpha/2} \sqrt{\frac{\hat{p}_1(1-\hat{p}_1)}{n_1} + \frac{\hat{p}_2(1-\hat{p}_2)}{n_2}}$$

Conversion rate lift:

import numpy as np

p_treatment = 0.094   # 9.4%
p_control = 0.084     # 8.4%
n_treatment = 50000
n_control = 50000

diff = p_treatment - p_control  # 1.0%
se_diff = np.sqrt(
    p_treatment*(1-p_treatment)/n_treatment +
    p_control*(1-p_control)/n_control
)
# se_diff ≈ 0.00176

ci_lower = diff - 1.96 * se_diff  # ≈ 0.0065
ci_upper = diff + 1.96 * se_diff  # ≈ 0.0135

print(f"Conversion lift: {diff:.4f}")
print(f"95% CI: [{ci_lower:.4f}, {ci_upper:.4f}]")
# Lift: 1.0%, CI: [0.65%, 1.35%]
# CI does not include 0 → significant

4.2 Difference of Two Means

$$(\bar{x}1 - \bar{x}_2) \pm t{\alpha/2} \sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}$$

Latency comparison between model versions:

# Old model: mean = 820ms, std = 350ms, n = 10000
# New model: mean = 780ms, std = 320ms, n = 10000

diff = 780 - 820  # = -40ms (improvement)
se_diff = np.sqrt(350**2/10000 + 320**2/10000)  # ≈ 4.74

ci_lower = diff - 1.96 * se_diff  # ≈ -49.3ms
ci_upper = diff + 1.96 * se_diff  # ≈ -30.7ms

# 95% CI for latency change: [-49.3ms, -30.7ms]
# Entire CI is negative → new model is significantly faster

5. Bootstrap Confidence Intervals

5.1 When Bootstrapping Was Used

Some metrics in MangaAssist did not follow standard distributional assumptions:

NDCG@3 — bounded, non-normal distribution
BERTScore — bounded between 0 and 1, often skewed
Revenue per session — heavily right-skewed (many $0, occasional large orders)

For these, bootstrap CIs provided distribution-free interval estimates.

5.2 How It Works

Resample the observed data with replacement, $B$ times (e.g., $B = 10{,}000$).
Compute the statistic of interest for each resample.
Take the 2.5^th and 97.5^th percentiles of the bootstrap distribution.

from scipy.stats import bootstrap
import numpy as np

# BERTScore observations for 500 golden-dataset queries
bertscores = np.array([...])  # shape (500,)

result = bootstrap(
    (bertscores,),
    statistic=np.mean,
    n_resamples=10000,
    confidence_level=0.95,
    method='percentile'
)
print(f"95% Bootstrap CI: [{result.confidence_interval.low:.4f}, "
      f"{result.confidence_interval.high:.4f}]")

5.3 Bootstrap vs. Parametric CIs

Aspect	Parametric CI	Bootstrap CI
Requires distributional assumption	Yes (normality for means, binomial for proportions)	No
Works well for skewed data	No, unless sample is very large	Yes
Computationally expensive	No	Moderate (10K resamples is fast)
MangaAssist usage	Conversion rate, escalation rate, latency means	BERTScore, NDCG, revenue per session

6. CI Width and Sample Size Planning

6.1 The Relationship

CI width is inversely proportional to $\sqrt{n}$. To halve the width of a CI, you need 4× the sample size.

6.2 Practical Implication

Metric	Weekly Sample	CI Width	Desired Width	Required Sample
Hallucination rate (audit)	200	±2.1%	±1.0%	~880
Intent accuracy (sample check)	200	±4.3%	±2.0%	~920
CSAT score	500	±0.08	±0.04	~2,000

This analysis drove the decision to scale up weekly audit sample sizes for metrics where CI width was too large to support reliable decisions.

7. How CIs Appeared on Dashboards

Metric dashboards in MangaAssist always displayed confidence intervals as:

Error bars on bar charts comparing baseline vs. candidate
Shaded bands on time-series trend lines
Range columns in weekly report tables

The rule was: Never report a point estimate without its CI. A conversion rate of 9.4% is not useful to stakeholders without knowing whether the CI is [4%, 14%] (uninformative) or [9.2%, 9.6%] (precise).

8. Summary

CI Type	MangaAssist Use Case	Primary Library
Proportion CI	Escalation rate, hallucination rate, conversion rate	`statsmodels.stats.proportion.proportion_confint()`
Mean CI	Latency, token count, revenue per session, CSAT	`scipy.stats.t.interval()`
Difference of proportions CI	A/B test conversion lift, canary rate deltas	Manual formula with `numpy`
Difference of means CI	Latency comparison, AOV lift	`scipy.stats.ttest_ind()` (returns CI in newer versions)
Bootstrap CI	BERTScore, NDCG, skewed revenue	`scipy.stats.bootstrap()`