Statistics Foundations for MangaAssist

This folder explains the core statistics concepts that are already being used, either explicitly or implicitly, across the MangaAssist project documents.

The goal is not to turn the project into a pure statistics exercise. The goal is to make the measurement model behind evaluation, monitoring, and experimentation explicit.

Reading Order

Start here with this README to understand the statistics concepts and how they connect to the project.
Then read 01-ml-ai-libraries-and-statistics-foundations.md to see how real ML and AI libraries use these same concepts.

Where Statistics Shows Up in This Project

Statistics is used in three main places:

Experimentation and attribution - A/B testing for chatbot impact on conversion, AOV, and support reduction. - Holdout groups and attribution windows for causal measurement.
Model and product evaluation - Intent accuracy, hallucination rate, Recall@3, thumbs-down rate, and guardrail false positive rate. - Canary thresholds and statistical significance checks during rollout.
Operational monitoring - P50 and P99 latency, intent distribution drift, response length drift, and error-rate tracking.

Relevant project docs:

1. Probability Rules Used in This Project

These are the probability ideas the project relies on most.

1. Event Probability

Many project metrics are probabilities in business form:

Conversion rate = $P(\text{purchase} \mid \text{chat session})$
Escalation rate = $P(\text{human handoff} \mid \text{chat session})$
Hallucination rate = $P(\text{factually wrong response} \mid \text{response})$
Thumbs-up rate = $P(\text{positive feedback} \mid \text{feedback shown})$

This is the basic rule behind almost every success-rate metric in 13-metrics.md.

2. Conditional Probability

Conditional probability is central to the project because outcomes are measured within a context, not in isolation.

Examples:

$P(\text{purchase} \mid \text{user saw chatbot})$
$P(\text{purchase} \mid \text{purchase within 24h and ASIN mentioned})$
$P(\text{thumbs down} \mid \text{new canary model})$
$P(\text{recommendation intent} \mid \text{holiday traffic})$

This is directly reflected in:

A/B testing and holdout measurement in Challenges/real-world-challenges.md
Canary comparisons in Model-Inference/06-model-evaluation-framework.md
Intent-shift monitoring in Challenges/real-world-challenges.md

3. Joint Probability / Intersection of Events

Attribution in this project uses multiple conditions at once.

Example:

A purchase is attributed only if it happened within 24 hours and the purchased ASIN was mentioned or recommended in the conversation.

That is effectively the intersection:

$$ P(\text{Attributed Purchase}) = P(\text{Purchase within 24h} \cap \text{ASIN mentioned}) $$

This matters because it prevents inflated attribution.

4. Complement Rule

Several metrics are complements of each other.

Examples:

Support cost deflection is closely related to $1 - P(\text{escalation})$
Availability-related reliability can be framed as $1 - P(\text{failure})$
Resolution failure rate is $1 - P(\text{resolution})$

The complement rule is heavily used in operational reporting and dashboard thinking.

5. Distribution of Outcomes Across Classes

Intent monitoring depends on class probabilities across categories such as:

recommendation
product_question
faq
order_tracking
chitchat
escalation

Those proportions form an intent distribution. Drift alerts trigger when the share of an intent changes materially over time.

This is described explicitly in 13-metrics.md and Challenges/real-world-challenges.md.

6. Sampling and Statistical Significance

The project explicitly depends on sampling logic in:

Golden datasets
Human audits
Canary traffic
Weekly quality review samples
A/B testing

The repo also explicitly mentions automatic statistical significance detection for prompt and model changes. That means the project is not just reporting point estimates; it is comparing whether observed differences are large enough to likely reflect a real effect rather than sampling noise.

7. Quantiles / Tail Probability Thinking

The project uses percentile latency rather than average latency alone:

P50 latency
P99 latency

This is important because user experience is often dominated by tail behavior, not mean behavior. In practical terms, the project asks, "How bad are the worst slow requests?" rather than only "What is the average?"

8. Bayes' Theorem

Bayes' theorem describes how to update a probability estimate given new evidence.

$$ P(A \mid B) = \frac{P(B \mid A) \cdot P(A)}{P(B)} $$

Project relevance:

Intent classification models are implicitly Bayesian: they estimate $P(\text{intent} \mid \text{query})$ using learned likelihoods and prior class frequencies.
Guardrail tuning involves trading off prior assumptions about safety risk against observed evidence in the user query.
Canary analysis updates the team's belief about whether a new model is better, given observed metrics.

This is rarely computed by hand in production, but the reasoning pattern — "update your belief given evidence" — drives model selection, threshold tuning, and experiment interpretation.

9. Hypothesis Testing Framework

Hypothesis testing is the formal structure behind A/B testing, canary checks, and significance decisions in this project.

Core concepts:

Null hypothesis ($H_0$): There is no difference between treatment and control, or between canary and baseline.
Alternative hypothesis ($H_1$): There is a real difference.
Type I error (false positive): Concluding there is a difference when there is not. Controlled by the significance level $\alpha$, typically 0.05.
Type II error (false negative): Failing to detect a real difference. Related to statistical power.
p-value: The probability of observing results at least as extreme as the data, assuming $H_0$ is true. A small p-value is evidence against $H_0$.
Statistical power ($1 - \beta$): The probability of correctly detecting a real effect. Depends on sample size, effect size, and $\alpha$.

Project relevance:

A/B tests for conversion lift require choosing $\alpha$ and computing required sample size for adequate power.
Canary rollouts check whether the new model's thumbs-down rate is significantly worse than baseline.
Prompt version comparisons need significance checks to avoid shipping changes based on noise.
The project explicitly mentions automatic significance detection, which means this framework is built into the deployment pipeline.

10. Confidence Intervals

A confidence interval provides a range of plausible values for an unknown parameter, not just a single point estimate.

For a proportion $\hat{p}$ with sample size $n$, a common approximate 95% confidence interval is:

$$ \hat{p} \pm 1.96 \sqrt{\frac{\hat{p}(1 - \hat{p})}{n}} $$

Project relevance:

Conversion rate, escalation rate, and hallucination rate are all point estimates. Reporting them without confidence intervals hides how uncertain the estimate is.
Small sample sizes (such as weekly human audits of 50 responses) produce wide intervals, which limits what conclusions can be drawn.
Canary decisions need to account for interval overlap: if the confidence intervals for canary and baseline overlap heavily, the difference may not be meaningful.
Dashboards that show only a single number without an interval can mislead stakeholders into over-reacting to noise.

11. Central Limit Theorem

The Central Limit Theorem (CLT) states that the sampling distribution of the mean of a sufficiently large number of independent observations approaches a normal distribution, regardless of the shape of the underlying data distribution.

Project relevance:

This is the reason the project can use normal-approximation-based significance tests for metrics like conversion rate, even though individual sessions are binary (convert or not).
It justifies treating average order value lift and average latency comparisons as approximately normal when the sample size is large, even though the raw data is skewed.
It is the statistical basis behind the approximate normal distribution discussion in section 3.7 above.
In practice, the CLT is why large-traffic A/B tests converge to reliable conclusions faster than small ones.

2. Levels of Measurement Used in This Project

The project uses all four common levels of measurement.

1. Nominal

Nominal variables are categories with no numeric order.

Project examples:

Intent label: recommendation, faq, order_tracking
User persona: new visitor, enthusiast, comparison shopper, support seeker
Language/edition: English, Japanese, deluxe, standard
Guardrail outcome: pass, blocked
Escalation path or issue type

Why it matters:

Used for confusion matrices, class distribution, routing analysis, and segmentation.

2. Ordinal

Ordinal variables have an order, but the distance between values is not guaranteed to be equal.

Project examples:

CSAT score on a 1-5 scale
Severity levels such as low, medium, high
Relevance judgments from human raters
Priority classes in operational review

Why it matters:

You can rank results, but treating the gap between 1 and 2 as identical to the gap between 4 and 5 can be misleading.

3. Interval

Interval variables have meaningful differences, but no true zero.

Project-adjacent examples:

Calendar dates used for release schedules or weekly trend comparisons
Time-of-day or timestamp-based comparisons when looking at shifts over windows

Why it matters:

Differences are meaningful, but ratios are not. For example, one date is not "twice" another date.

This project uses interval-like time indexing in trend and drift analysis, though most operational metrics themselves are ratio-scale.

4. Ratio

Ratio variables have equal intervals and a true zero. This is the dominant scale in the project.

Project examples:

Latency in milliseconds
Session duration
Revenue per chat session
Token count
Cost per session
Message count
Error count
Recommendation clicks
Add-to-cart count
Response length

Why it matters:

Ratios are meaningful. A 2-second response really is twice as long as a 1-second response.
This is why percent lift, cost ratios, and latency reduction calculations are valid for these metrics.

3. Distributions Used in This Project

The repo does not always name the distributions formally, but several are clearly implied by the measurement design.

1. Bernoulli Distribution

Used when an outcome is binary for a single observation.

Project examples:

Did the session convert: yes/no
Was the response escalated: yes/no
Was the answer hallucinated: yes/no
Did the user click a recommendation: yes/no
Did the response receive thumbs down: yes/no

This is the atomic unit behind many KPI rates.

2. Binomial Distribution

Used when counting the number of successes out of many Bernoulli trials.

Project examples:

Number of converted sessions out of all chat sessions
Number of escalations out of all sessions
Number of blocked outputs out of all generations
Number of positive ratings out of all rated responses

This is the natural distribution behind conversion-rate comparisons and canary pass/fail counts.

3. Categorical / Multinomial Distribution

Used when each observation falls into one of several categories.

Project examples:

Intent distribution across intent classes
Persona mix across user types
Root-cause classification of escalations
Query type mix during events like holidays or Prime Day

This distribution is fundamental to drift detection in the classifier and to the dashboard view of intent shares.

4. Poisson-Like Count Distribution

Used for event counts over time windows, especially operational monitoring.

Project examples:

Messages per second
Error counts per minute
Circuit breaker trips per day
Guardrail blocks per hour
Escalations per hour during peak traffic

In practice, high-scale systems often monitor these as counts per unit time even if the real process is only approximately Poisson.

5. Long-Tailed Latency Distribution

Latency is almost never symmetric in distributed systems. It is usually right-skewed, with rare slow requests stretching the tail.

Project evidence:

The repo tracks P50 and P99 latency, not just average latency.
The architecture discusses throttling, cold starts, queueing, and downstream fan-out.

This strongly implies a heavy-tailed or log-normal-like latency distribution.

Why it matters:

Mean latency can look healthy while the worst user experience is still bad.
Tail metrics are the right choice for this system.

6. Response Length Distribution

The project explicitly monitors response token count and alerts on shifts from baseline.

Project examples:

Average response length drift
Response length inflation after model updates

This is a distribution-over-time problem, not just a single average. Monitoring the full spread helps catch subtle behavioral drift.

7. Approximate Normal Distribution for Aggregate Means and Lift Estimates

The repo talks about A/B testing, significance, and lift measurement. In practice, once sample sizes are large enough, aggregate averages and rate estimates are often analyzed using normal approximations.

Project examples:

Average order value lift
Difference in conversion rate between treatment and control
Thumbs-down rate comparison in canary vs. baseline

Important note:

The raw session-level data is not necessarily normal.
The sampling distribution of the estimated mean or proportion is often treated as approximately normal when traffic is large.

That is the statistical basis for many rollout and experimentation decisions.

4. Recommended Mapping of Metrics to Statistics Concepts

Project Metric	Measurement Level	Core Probability View	Likely Distribution	Common Libraries
Conversion rate	Ratio at aggregate level; binary at event level	Conditional probability	Bernoulli/Binomial	`scipy.stats`, `statsmodels`
Escalation rate	Ratio at aggregate level; binary at event level	Conditional probability	Bernoulli/Binomial	`scipy.stats`, `statsmodels`
Hallucination rate	Ratio at aggregate level; binary at event level	Conditional probability	Bernoulli/Binomial	`ragas`, `deepeval`, `scipy.stats`
Intent class	Nominal	Class probability	Categorical/Multinomial	`scikit-learn`, `transformers`
Intent distribution	Ratio proportions across classes	Probability mass across labels	Multinomial	`evidently`, `whylogs`, `pandas`
CSAT	Ordinal	Satisfaction probability by score band	Ordinal categorical	`pandas`, `numpy`
Latency	Ratio	Tail probability and quantiles	Right-skewed / long-tailed	Prometheus, Grafana, OpenTelemetry
Revenue per chat session	Ratio	Expected value over sessions	Often skewed continuous	`pandas`, `numpy`, `statsmodels`
Token count	Ratio	Expected value and drift monitoring	Count distribution, often right-skewed	tokenizer APIs, `pandas`
Error count per minute	Ratio count	Event probability over time	Poisson-like	Prometheus, `statsmodels`

5. Practical Summary

If someone asks, "What statistics are actually used in this project?" the short answer is:

Probability rules - Event probability, conditional probability, joint events for attribution, complement rule, class-probability monitoring, and significance testing.
Levels of measurement - Nominal for intents and categories, ordinal for ratings and severity, interval for calendar/time indexing, and ratio for most operational and business metrics.
Distributions - Bernoulli/binomial for success rates, multinomial for intent mix, Poisson-like for event counts, and right-skewed long-tail distributions for latency and token length.

6. Most Important Takeaway

This project is measured like a production AI product, not like a demo.

That means the statistical backbone is mostly about:

rates
class distributions
tail behavior
drift over time
causal comparison through experiments

Those are the concepts doing the real work across MangaAssist.