LOCAL PREVIEW View on GitHub

Confidence Intervals in MangaAssist

1. What Confidence Intervals Are and Why They Matter

A confidence interval (CI) provides a range of plausible values for a population parameter, based on observed sample data. Unlike a point estimate that gives a single number, a CI communicates uncertainty.

For MangaAssist, confidence intervals were used to:

  • Report metric estimates with uncertainty bounds on dashboards
  • Determine whether an A/B test lift was practically meaningful
  • Set error bars on canary metric comparisons
  • Quantify uncertainty around hallucination rates estimated from audits
  • Provide stakeholders with ranges rather than false precision

2. Confidence Interval for a Proportion

2.1 Formula (Normal Approximation / Wald Interval)

For a sample proportion $\hat{p}$ from $n$ observations:

$$\hat{p} \pm z_{\alpha/2} \sqrt{\frac{\hat{p}(1-\hat{p})}{n}}$$

Where $z_{\alpha/2} = 1.96$ for a 95% CI.

2.2 MangaAssist Applications

Escalation rate:

from statsmodels.stats.proportion import proportion_confint

# Observed: 600 escalations out of 5,000 sessions → 12%
lower, upper = proportion_confint(600, 5000, alpha=0.05, method='normal')
print(f"95% CI for escalation rate: [{lower:.4f}, {upper:.4f}]")
# Output: [0.1110, 0.1290]

Hallucination rate (from weekly audit):

# Observed: 4 hallucinations out of 200 audited responses → 2%
lower, upper = proportion_confint(4, 200, alpha=0.05, method='wilson')
print(f"95% CI for hallucination rate: [{lower:.4f}, {upper:.4f}]")
# Output: [0.0078, 0.0503]

Note: For small samples or extreme proportions, the Wilson interval (method='wilson') is preferred over the Wald interval because it avoids impossible bounds (negative or above 1.0).

Conversion rate:

# Observed: 4,700 purchases out of 50,000 chatbot users → 9.4%
lower, upper = proportion_confint(4700, 50000, alpha=0.05, method='wilson')
print(f"95% CI for conversion rate: [{lower:.4f}, {upper:.4f}]")
# Output: [0.0915, 0.0965]

2.3 Key Project Proportion CIs

Metric Observed Rate Sample Size 95% CI Interpretation
Escalation rate 12.0% 5,000 [11.1%, 12.9%] Tight — large sample gives precise estimate
Hallucination rate 2.0% 200 [0.8%, 5.0%] Wide — small audit sample means high uncertainty
Conversion rate 9.4% 50,000 [9.2%, 9.7%] Very tight — massive traffic sample
Thumbs-up rate 62.0% 1,000 [58.9%, 65.0%] Moderate — survey sample
Guardrail block rate 3.5% 10,000 [3.1%, 3.9%] Tight
Intent accuracy 89.3% 200 [84.5%, 93.0%] Wide — weekly sample

2.4 Insight: Why Small Audit Samples Produce Wide CIs

The hallucination rate CI is [0.8%, 5.0%] — a 4.2-percentage-point range. This means the true hallucination rate could be anywhere from acceptable (< 2%) to concerning (5%).

This drove the decision to increase audit sample size from 200 to 500 per week for high-stakes metrics.


3. Confidence Interval for a Mean

3.1 Formula

For a sample mean $\bar{x}$ with sample standard deviation $s$ and sample size $n$:

$$\bar{x} \pm t_{\alpha/2, n-1} \cdot \frac{s}{\sqrt{n}}$$

Where $t_{\alpha/2, n-1}$ is the critical value from the t-distribution. For large $n$, this approaches the z-value.

3.2 MangaAssist Applications

P99 latency comparison:

import numpy as np
from scipy.stats import t

# Sample of 1,000 response latencies for new model
latencies = np.array([...])  # milliseconds
n = len(latencies)
mean = np.mean(latencies)
se = np.std(latencies, ddof=1) / np.sqrt(n)
t_crit = t.ppf(0.975, df=n-1)

lower = mean - t_crit * se
upper = mean + t_crit * se
print(f"95% CI for mean latency: [{lower:.1f}ms, {upper:.1f}ms]")

Average tokens per response:

# Baseline: mean = 120 tokens, std = 45, n = 5000
# Candidate: mean = 135 tokens, std = 50, n = 5000
from scipy.stats import t

se_baseline = 45 / np.sqrt(5000)   # ≈ 0.636
se_candidate = 50 / np.sqrt(5000)  # ≈ 0.707

ci_baseline = (120 - 1.96*se_baseline, 120 + 1.96*se_baseline)
ci_candidate = (135 - 1.96*se_candidate, 135 + 1.96*se_candidate)
# Baseline CI: [118.75, 121.25]
# Candidate CI: [133.61, 136.39]
# Non-overlapping → significant cost increase

Revenue per chat session:

# Sample: 10,000 sessions, mean revenue = $5.20, std = $12.50
se = 12.50 / np.sqrt(10000)  # = 0.125
ci = (5.20 - 1.96*0.125, 5.20 + 1.96*0.125)
# 95% CI: [$4.96, $5.44]

3.3 Key Project Mean CIs

Metric Sample Mean Std Dev n 95% CI
Mean latency (ms) 820 350 10,000 [813.1, 826.9]
Avg tokens per response 135 50 5,000 [133.6, 136.4]
Revenue per session $5.20 $12.50 10,000 [$4.96, $5.44]
CSAT score 4.2 0.9 500 [4.12, 4.28]
Turns to resolution 3.8 2.1 2,000 [3.71, 3.89]

4. Confidence Interval for a Difference (A/B Testing)

4.1 Difference of Two Proportions

$$(\hat{p}1 - \hat{p}_2) \pm z{\alpha/2} \sqrt{\frac{\hat{p}_1(1-\hat{p}_1)}{n_1} + \frac{\hat{p}_2(1-\hat{p}_2)}{n_2}}$$

Conversion rate lift:

import numpy as np

p_treatment = 0.094   # 9.4%
p_control = 0.084     # 8.4%
n_treatment = 50000
n_control = 50000

diff = p_treatment - p_control  # 1.0%
se_diff = np.sqrt(
    p_treatment*(1-p_treatment)/n_treatment +
    p_control*(1-p_control)/n_control
)
# se_diff ≈ 0.00176

ci_lower = diff - 1.96 * se_diff  # ≈ 0.0065
ci_upper = diff + 1.96 * se_diff  # ≈ 0.0135

print(f"Conversion lift: {diff:.4f}")
print(f"95% CI: [{ci_lower:.4f}, {ci_upper:.4f}]")
# Lift: 1.0%, CI: [0.65%, 1.35%]
# CI does not include 0 → significant

4.2 Difference of Two Means

$$(\bar{x}1 - \bar{x}_2) \pm t{\alpha/2} \sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}$$

Latency comparison between model versions:

# Old model: mean = 820ms, std = 350ms, n = 10000
# New model: mean = 780ms, std = 320ms, n = 10000

diff = 780 - 820  # = -40ms (improvement)
se_diff = np.sqrt(350**2/10000 + 320**2/10000)  # ≈ 4.74

ci_lower = diff - 1.96 * se_diff  # ≈ -49.3ms
ci_upper = diff + 1.96 * se_diff  # ≈ -30.7ms

# 95% CI for latency change: [-49.3ms, -30.7ms]
# Entire CI is negative → new model is significantly faster

5. Bootstrap Confidence Intervals

5.1 When Bootstrapping Was Used

Some metrics in MangaAssist did not follow standard distributional assumptions:

  • NDCG@3 — bounded, non-normal distribution
  • BERTScore — bounded between 0 and 1, often skewed
  • Revenue per session — heavily right-skewed (many $0, occasional large orders)

For these, bootstrap CIs provided distribution-free interval estimates.

5.2 How It Works

  1. Resample the observed data with replacement, $B$ times (e.g., $B = 10{,}000$).
  2. Compute the statistic of interest for each resample.
  3. Take the 2.5th and 97.5th percentiles of the bootstrap distribution.
from scipy.stats import bootstrap
import numpy as np

# BERTScore observations for 500 golden-dataset queries
bertscores = np.array([...])  # shape (500,)

result = bootstrap(
    (bertscores,),
    statistic=np.mean,
    n_resamples=10000,
    confidence_level=0.95,
    method='percentile'
)
print(f"95% Bootstrap CI: [{result.confidence_interval.low:.4f}, "
      f"{result.confidence_interval.high:.4f}]")

5.3 Bootstrap vs. Parametric CIs

Aspect Parametric CI Bootstrap CI
Requires distributional assumption Yes (normality for means, binomial for proportions) No
Works well for skewed data No, unless sample is very large Yes
Computationally expensive No Moderate (10K resamples is fast)
MangaAssist usage Conversion rate, escalation rate, latency means BERTScore, NDCG, revenue per session

6. CI Width and Sample Size Planning

6.1 The Relationship

CI width is inversely proportional to $\sqrt{n}$. To halve the width of a CI, you need 4× the sample size.

6.2 Practical Implication

Metric Weekly Sample CI Width Desired Width Required Sample
Hallucination rate (audit) 200 ±2.1% ±1.0% ~880
Intent accuracy (sample check) 200 ±4.3% ±2.0% ~920
CSAT score 500 ±0.08 ±0.04 ~2,000

This analysis drove the decision to scale up weekly audit sample sizes for metrics where CI width was too large to support reliable decisions.


7. How CIs Appeared on Dashboards

Metric dashboards in MangaAssist always displayed confidence intervals as:

  • Error bars on bar charts comparing baseline vs. candidate
  • Shaded bands on time-series trend lines
  • Range columns in weekly report tables

The rule was: Never report a point estimate without its CI. A conversion rate of 9.4% is not useful to stakeholders without knowing whether the CI is [4%, 14%] (uninformative) or [9.2%, 9.6%] (precise).


8. Summary

CI Type MangaAssist Use Case Primary Library
Proportion CI Escalation rate, hallucination rate, conversion rate statsmodels.stats.proportion.proportion_confint()
Mean CI Latency, token count, revenue per session, CSAT scipy.stats.t.interval()
Difference of proportions CI A/B test conversion lift, canary rate deltas Manual formula with numpy
Difference of means CI Latency comparison, AOV lift scipy.stats.ttest_ind() (returns CI in newer versions)
Bootstrap CI BERTScore, NDCG, skewed revenue scipy.stats.bootstrap()