Confidence Intervals in MangaAssist
1. What Confidence Intervals Are and Why They Matter
A confidence interval (CI) provides a range of plausible values for a population parameter, based on observed sample data. Unlike a point estimate that gives a single number, a CI communicates uncertainty.
For MangaAssist, confidence intervals were used to:
- Report metric estimates with uncertainty bounds on dashboards
- Determine whether an A/B test lift was practically meaningful
- Set error bars on canary metric comparisons
- Quantify uncertainty around hallucination rates estimated from audits
- Provide stakeholders with ranges rather than false precision
2. Confidence Interval for a Proportion
2.1 Formula (Normal Approximation / Wald Interval)
For a sample proportion $\hat{p}$ from $n$ observations:
$$\hat{p} \pm z_{\alpha/2} \sqrt{\frac{\hat{p}(1-\hat{p})}{n}}$$
Where $z_{\alpha/2} = 1.96$ for a 95% CI.
2.2 MangaAssist Applications
Escalation rate:
from statsmodels.stats.proportion import proportion_confint
# Observed: 600 escalations out of 5,000 sessions → 12%
lower, upper = proportion_confint(600, 5000, alpha=0.05, method='normal')
print(f"95% CI for escalation rate: [{lower:.4f}, {upper:.4f}]")
# Output: [0.1110, 0.1290]
Hallucination rate (from weekly audit):
# Observed: 4 hallucinations out of 200 audited responses → 2%
lower, upper = proportion_confint(4, 200, alpha=0.05, method='wilson')
print(f"95% CI for hallucination rate: [{lower:.4f}, {upper:.4f}]")
# Output: [0.0078, 0.0503]
Note: For small samples or extreme proportions, the Wilson interval (method='wilson') is preferred over the Wald interval because it avoids impossible bounds (negative or above 1.0).
Conversion rate:
# Observed: 4,700 purchases out of 50,000 chatbot users → 9.4%
lower, upper = proportion_confint(4700, 50000, alpha=0.05, method='wilson')
print(f"95% CI for conversion rate: [{lower:.4f}, {upper:.4f}]")
# Output: [0.0915, 0.0965]
2.3 Key Project Proportion CIs
| Metric | Observed Rate | Sample Size | 95% CI | Interpretation |
|---|---|---|---|---|
| Escalation rate | 12.0% | 5,000 | [11.1%, 12.9%] | Tight — large sample gives precise estimate |
| Hallucination rate | 2.0% | 200 | [0.8%, 5.0%] | Wide — small audit sample means high uncertainty |
| Conversion rate | 9.4% | 50,000 | [9.2%, 9.7%] | Very tight — massive traffic sample |
| Thumbs-up rate | 62.0% | 1,000 | [58.9%, 65.0%] | Moderate — survey sample |
| Guardrail block rate | 3.5% | 10,000 | [3.1%, 3.9%] | Tight |
| Intent accuracy | 89.3% | 200 | [84.5%, 93.0%] | Wide — weekly sample |
2.4 Insight: Why Small Audit Samples Produce Wide CIs
The hallucination rate CI is [0.8%, 5.0%] — a 4.2-percentage-point range. This means the true hallucination rate could be anywhere from acceptable (< 2%) to concerning (5%).
This drove the decision to increase audit sample size from 200 to 500 per week for high-stakes metrics.
3. Confidence Interval for a Mean
3.1 Formula
For a sample mean $\bar{x}$ with sample standard deviation $s$ and sample size $n$:
$$\bar{x} \pm t_{\alpha/2, n-1} \cdot \frac{s}{\sqrt{n}}$$
Where $t_{\alpha/2, n-1}$ is the critical value from the t-distribution. For large $n$, this approaches the z-value.
3.2 MangaAssist Applications
P99 latency comparison:
import numpy as np
from scipy.stats import t
# Sample of 1,000 response latencies for new model
latencies = np.array([...]) # milliseconds
n = len(latencies)
mean = np.mean(latencies)
se = np.std(latencies, ddof=1) / np.sqrt(n)
t_crit = t.ppf(0.975, df=n-1)
lower = mean - t_crit * se
upper = mean + t_crit * se
print(f"95% CI for mean latency: [{lower:.1f}ms, {upper:.1f}ms]")
Average tokens per response:
# Baseline: mean = 120 tokens, std = 45, n = 5000
# Candidate: mean = 135 tokens, std = 50, n = 5000
from scipy.stats import t
se_baseline = 45 / np.sqrt(5000) # ≈ 0.636
se_candidate = 50 / np.sqrt(5000) # ≈ 0.707
ci_baseline = (120 - 1.96*se_baseline, 120 + 1.96*se_baseline)
ci_candidate = (135 - 1.96*se_candidate, 135 + 1.96*se_candidate)
# Baseline CI: [118.75, 121.25]
# Candidate CI: [133.61, 136.39]
# Non-overlapping → significant cost increase
Revenue per chat session:
# Sample: 10,000 sessions, mean revenue = $5.20, std = $12.50
se = 12.50 / np.sqrt(10000) # = 0.125
ci = (5.20 - 1.96*0.125, 5.20 + 1.96*0.125)
# 95% CI: [$4.96, $5.44]
3.3 Key Project Mean CIs
| Metric | Sample Mean | Std Dev | n | 95% CI |
|---|---|---|---|---|
| Mean latency (ms) | 820 | 350 | 10,000 | [813.1, 826.9] |
| Avg tokens per response | 135 | 50 | 5,000 | [133.6, 136.4] |
| Revenue per session | $5.20 | $12.50 | 10,000 | [$4.96, $5.44] |
| CSAT score | 4.2 | 0.9 | 500 | [4.12, 4.28] |
| Turns to resolution | 3.8 | 2.1 | 2,000 | [3.71, 3.89] |
4. Confidence Interval for a Difference (A/B Testing)
4.1 Difference of Two Proportions
$$(\hat{p}1 - \hat{p}_2) \pm z{\alpha/2} \sqrt{\frac{\hat{p}_1(1-\hat{p}_1)}{n_1} + \frac{\hat{p}_2(1-\hat{p}_2)}{n_2}}$$
Conversion rate lift:
import numpy as np
p_treatment = 0.094 # 9.4%
p_control = 0.084 # 8.4%
n_treatment = 50000
n_control = 50000
diff = p_treatment - p_control # 1.0%
se_diff = np.sqrt(
p_treatment*(1-p_treatment)/n_treatment +
p_control*(1-p_control)/n_control
)
# se_diff ≈ 0.00176
ci_lower = diff - 1.96 * se_diff # ≈ 0.0065
ci_upper = diff + 1.96 * se_diff # ≈ 0.0135
print(f"Conversion lift: {diff:.4f}")
print(f"95% CI: [{ci_lower:.4f}, {ci_upper:.4f}]")
# Lift: 1.0%, CI: [0.65%, 1.35%]
# CI does not include 0 → significant
4.2 Difference of Two Means
$$(\bar{x}1 - \bar{x}_2) \pm t{\alpha/2} \sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}$$
Latency comparison between model versions:
# Old model: mean = 820ms, std = 350ms, n = 10000
# New model: mean = 780ms, std = 320ms, n = 10000
diff = 780 - 820 # = -40ms (improvement)
se_diff = np.sqrt(350**2/10000 + 320**2/10000) # ≈ 4.74
ci_lower = diff - 1.96 * se_diff # ≈ -49.3ms
ci_upper = diff + 1.96 * se_diff # ≈ -30.7ms
# 95% CI for latency change: [-49.3ms, -30.7ms]
# Entire CI is negative → new model is significantly faster
5. Bootstrap Confidence Intervals
5.1 When Bootstrapping Was Used
Some metrics in MangaAssist did not follow standard distributional assumptions:
- NDCG@3 — bounded, non-normal distribution
- BERTScore — bounded between 0 and 1, often skewed
- Revenue per session — heavily right-skewed (many $0, occasional large orders)
For these, bootstrap CIs provided distribution-free interval estimates.
5.2 How It Works
- Resample the observed data with replacement, $B$ times (e.g., $B = 10{,}000$).
- Compute the statistic of interest for each resample.
- Take the 2.5th and 97.5th percentiles of the bootstrap distribution.
from scipy.stats import bootstrap
import numpy as np
# BERTScore observations for 500 golden-dataset queries
bertscores = np.array([...]) # shape (500,)
result = bootstrap(
(bertscores,),
statistic=np.mean,
n_resamples=10000,
confidence_level=0.95,
method='percentile'
)
print(f"95% Bootstrap CI: [{result.confidence_interval.low:.4f}, "
f"{result.confidence_interval.high:.4f}]")
5.3 Bootstrap vs. Parametric CIs
| Aspect | Parametric CI | Bootstrap CI |
|---|---|---|
| Requires distributional assumption | Yes (normality for means, binomial for proportions) | No |
| Works well for skewed data | No, unless sample is very large | Yes |
| Computationally expensive | No | Moderate (10K resamples is fast) |
| MangaAssist usage | Conversion rate, escalation rate, latency means | BERTScore, NDCG, revenue per session |
6. CI Width and Sample Size Planning
6.1 The Relationship
CI width is inversely proportional to $\sqrt{n}$. To halve the width of a CI, you need 4× the sample size.
6.2 Practical Implication
| Metric | Weekly Sample | CI Width | Desired Width | Required Sample |
|---|---|---|---|---|
| Hallucination rate (audit) | 200 | ±2.1% | ±1.0% | ~880 |
| Intent accuracy (sample check) | 200 | ±4.3% | ±2.0% | ~920 |
| CSAT score | 500 | ±0.08 | ±0.04 | ~2,000 |
This analysis drove the decision to scale up weekly audit sample sizes for metrics where CI width was too large to support reliable decisions.
7. How CIs Appeared on Dashboards
Metric dashboards in MangaAssist always displayed confidence intervals as:
- Error bars on bar charts comparing baseline vs. candidate
- Shaded bands on time-series trend lines
- Range columns in weekly report tables
The rule was: Never report a point estimate without its CI. A conversion rate of 9.4% is not useful to stakeholders without knowing whether the CI is [4%, 14%] (uninformative) or [9.2%, 9.6%] (precise).
8. Summary
| CI Type | MangaAssist Use Case | Primary Library |
|---|---|---|
| Proportion CI | Escalation rate, hallucination rate, conversion rate | statsmodels.stats.proportion.proportion_confint() |
| Mean CI | Latency, token count, revenue per session, CSAT | scipy.stats.t.interval() |
| Difference of proportions CI | A/B test conversion lift, canary rate deltas | Manual formula with numpy |
| Difference of means CI | Latency comparison, AOV lift | scipy.stats.ttest_ind() (returns CI in newer versions) |
| Bootstrap CI | BERTScore, NDCG, skewed revenue | scipy.stats.bootstrap() |