Statistics Foundations for MangaAssist
This folder explains the core statistics concepts that are already being used, either explicitly or implicitly, across the MangaAssist project documents.
The goal is not to turn the project into a pure statistics exercise. The goal is to make the measurement model behind evaluation, monitoring, and experimentation explicit.
Reading Order
- Start here with this README to understand the statistics concepts and how they connect to the project.
- Then read 01-ml-ai-libraries-and-statistics-foundations.md to see how real ML and AI libraries use these same concepts.
Where Statistics Shows Up in This Project
Statistics is used in three main places:
- Experimentation and attribution - A/B testing for chatbot impact on conversion, AOV, and support reduction. - Holdout groups and attribution windows for causal measurement.
- Model and product evaluation - Intent accuracy, hallucination rate, Recall@3, thumbs-down rate, and guardrail false positive rate. - Canary thresholds and statistical significance checks during rollout.
- Operational monitoring - P50 and P99 latency, intent distribution drift, response length drift, and error-rate tracking.
Relevant project docs:
- 13-metrics.md
- Challenges/real-world-challenges.md
- Model-Inference/06-model-evaluation-framework.md
- 01-ml-ai-libraries-and-statistics-foundations.md
1. Probability Rules Used in This Project
These are the probability ideas the project relies on most.
1. Event Probability
Many project metrics are probabilities in business form:
- Conversion rate = $P(\text{purchase} \mid \text{chat session})$
- Escalation rate = $P(\text{human handoff} \mid \text{chat session})$
- Hallucination rate = $P(\text{factually wrong response} \mid \text{response})$
- Thumbs-up rate = $P(\text{positive feedback} \mid \text{feedback shown})$
This is the basic rule behind almost every success-rate metric in 13-metrics.md.
2. Conditional Probability
Conditional probability is central to the project because outcomes are measured within a context, not in isolation.
Examples:
- $P(\text{purchase} \mid \text{user saw chatbot})$
- $P(\text{purchase} \mid \text{purchase within 24h and ASIN mentioned})$
- $P(\text{thumbs down} \mid \text{new canary model})$
- $P(\text{recommendation intent} \mid \text{holiday traffic})$
This is directly reflected in:
- A/B testing and holdout measurement in Challenges/real-world-challenges.md
- Canary comparisons in Model-Inference/06-model-evaluation-framework.md
- Intent-shift monitoring in Challenges/real-world-challenges.md
3. Joint Probability / Intersection of Events
Attribution in this project uses multiple conditions at once.
Example:
- A purchase is attributed only if it happened within 24 hours and the purchased ASIN was mentioned or recommended in the conversation.
That is effectively the intersection:
$$ P(\text{Attributed Purchase}) = P(\text{Purchase within 24h} \cap \text{ASIN mentioned}) $$
This matters because it prevents inflated attribution.
4. Complement Rule
Several metrics are complements of each other.
Examples:
- Support cost deflection is closely related to $1 - P(\text{escalation})$
- Availability-related reliability can be framed as $1 - P(\text{failure})$
- Resolution failure rate is $1 - P(\text{resolution})$
The complement rule is heavily used in operational reporting and dashboard thinking.
5. Distribution of Outcomes Across Classes
Intent monitoring depends on class probabilities across categories such as:
recommendationproduct_questionfaqorder_trackingchitchatescalation
Those proportions form an intent distribution. Drift alerts trigger when the share of an intent changes materially over time.
This is described explicitly in 13-metrics.md and Challenges/real-world-challenges.md.
6. Sampling and Statistical Significance
The project explicitly depends on sampling logic in:
- Golden datasets
- Human audits
- Canary traffic
- Weekly quality review samples
- A/B testing
The repo also explicitly mentions automatic statistical significance detection for prompt and model changes. That means the project is not just reporting point estimates; it is comparing whether observed differences are large enough to likely reflect a real effect rather than sampling noise.
7. Quantiles / Tail Probability Thinking
The project uses percentile latency rather than average latency alone:
- P50 latency
- P99 latency
This is important because user experience is often dominated by tail behavior, not mean behavior. In practical terms, the project asks, "How bad are the worst slow requests?" rather than only "What is the average?"
8. Bayes' Theorem
Bayes' theorem describes how to update a probability estimate given new evidence.
$$ P(A \mid B) = \frac{P(B \mid A) \cdot P(A)}{P(B)} $$
Project relevance:
- Intent classification models are implicitly Bayesian: they estimate $P(\text{intent} \mid \text{query})$ using learned likelihoods and prior class frequencies.
- Guardrail tuning involves trading off prior assumptions about safety risk against observed evidence in the user query.
- Canary analysis updates the team's belief about whether a new model is better, given observed metrics.
This is rarely computed by hand in production, but the reasoning pattern — "update your belief given evidence" — drives model selection, threshold tuning, and experiment interpretation.
9. Hypothesis Testing Framework
Hypothesis testing is the formal structure behind A/B testing, canary checks, and significance decisions in this project.
Core concepts:
- Null hypothesis ($H_0$): There is no difference between treatment and control, or between canary and baseline.
- Alternative hypothesis ($H_1$): There is a real difference.
- Type I error (false positive): Concluding there is a difference when there is not. Controlled by the significance level $\alpha$, typically 0.05.
- Type II error (false negative): Failing to detect a real difference. Related to statistical power.
- p-value: The probability of observing results at least as extreme as the data, assuming $H_0$ is true. A small p-value is evidence against $H_0$.
- Statistical power ($1 - \beta$): The probability of correctly detecting a real effect. Depends on sample size, effect size, and $\alpha$.
Project relevance:
- A/B tests for conversion lift require choosing $\alpha$ and computing required sample size for adequate power.
- Canary rollouts check whether the new model's thumbs-down rate is significantly worse than baseline.
- Prompt version comparisons need significance checks to avoid shipping changes based on noise.
- The project explicitly mentions automatic significance detection, which means this framework is built into the deployment pipeline.
10. Confidence Intervals
A confidence interval provides a range of plausible values for an unknown parameter, not just a single point estimate.
For a proportion $\hat{p}$ with sample size $n$, a common approximate 95% confidence interval is:
$$ \hat{p} \pm 1.96 \sqrt{\frac{\hat{p}(1 - \hat{p})}{n}} $$
Project relevance:
- Conversion rate, escalation rate, and hallucination rate are all point estimates. Reporting them without confidence intervals hides how uncertain the estimate is.
- Small sample sizes (such as weekly human audits of 50 responses) produce wide intervals, which limits what conclusions can be drawn.
- Canary decisions need to account for interval overlap: if the confidence intervals for canary and baseline overlap heavily, the difference may not be meaningful.
- Dashboards that show only a single number without an interval can mislead stakeholders into over-reacting to noise.
11. Central Limit Theorem
The Central Limit Theorem (CLT) states that the sampling distribution of the mean of a sufficiently large number of independent observations approaches a normal distribution, regardless of the shape of the underlying data distribution.
Project relevance:
- This is the reason the project can use normal-approximation-based significance tests for metrics like conversion rate, even though individual sessions are binary (convert or not).
- It justifies treating average order value lift and average latency comparisons as approximately normal when the sample size is large, even though the raw data is skewed.
- It is the statistical basis behind the approximate normal distribution discussion in section 3.7 above.
- In practice, the CLT is why large-traffic A/B tests converge to reliable conclusions faster than small ones.
2. Levels of Measurement Used in This Project
The project uses all four common levels of measurement.
1. Nominal
Nominal variables are categories with no numeric order.
Project examples:
- Intent label:
recommendation,faq,order_tracking - User persona: new visitor, enthusiast, comparison shopper, support seeker
- Language/edition: English, Japanese, deluxe, standard
- Guardrail outcome: pass, blocked
- Escalation path or issue type
Why it matters:
- Used for confusion matrices, class distribution, routing analysis, and segmentation.
2. Ordinal
Ordinal variables have an order, but the distance between values is not guaranteed to be equal.
Project examples:
- CSAT score on a 1-5 scale
- Severity levels such as low, medium, high
- Relevance judgments from human raters
- Priority classes in operational review
Why it matters:
- You can rank results, but treating the gap between 1 and 2 as identical to the gap between 4 and 5 can be misleading.
3. Interval
Interval variables have meaningful differences, but no true zero.
Project-adjacent examples:
- Calendar dates used for release schedules or weekly trend comparisons
- Time-of-day or timestamp-based comparisons when looking at shifts over windows
Why it matters:
- Differences are meaningful, but ratios are not. For example, one date is not "twice" another date.
This project uses interval-like time indexing in trend and drift analysis, though most operational metrics themselves are ratio-scale.
4. Ratio
Ratio variables have equal intervals and a true zero. This is the dominant scale in the project.
Project examples:
- Latency in milliseconds
- Session duration
- Revenue per chat session
- Token count
- Cost per session
- Message count
- Error count
- Recommendation clicks
- Add-to-cart count
- Response length
Why it matters:
- Ratios are meaningful. A 2-second response really is twice as long as a 1-second response.
- This is why percent lift, cost ratios, and latency reduction calculations are valid for these metrics.
3. Distributions Used in This Project
The repo does not always name the distributions formally, but several are clearly implied by the measurement design.
1. Bernoulli Distribution
Used when an outcome is binary for a single observation.
Project examples:
- Did the session convert: yes/no
- Was the response escalated: yes/no
- Was the answer hallucinated: yes/no
- Did the user click a recommendation: yes/no
- Did the response receive thumbs down: yes/no
This is the atomic unit behind many KPI rates.
2. Binomial Distribution
Used when counting the number of successes out of many Bernoulli trials.
Project examples:
- Number of converted sessions out of all chat sessions
- Number of escalations out of all sessions
- Number of blocked outputs out of all generations
- Number of positive ratings out of all rated responses
This is the natural distribution behind conversion-rate comparisons and canary pass/fail counts.
3. Categorical / Multinomial Distribution
Used when each observation falls into one of several categories.
Project examples:
- Intent distribution across intent classes
- Persona mix across user types
- Root-cause classification of escalations
- Query type mix during events like holidays or Prime Day
This distribution is fundamental to drift detection in the classifier and to the dashboard view of intent shares.
4. Poisson-Like Count Distribution
Used for event counts over time windows, especially operational monitoring.
Project examples:
- Messages per second
- Error counts per minute
- Circuit breaker trips per day
- Guardrail blocks per hour
- Escalations per hour during peak traffic
In practice, high-scale systems often monitor these as counts per unit time even if the real process is only approximately Poisson.
5. Long-Tailed Latency Distribution
Latency is almost never symmetric in distributed systems. It is usually right-skewed, with rare slow requests stretching the tail.
Project evidence:
- The repo tracks P50 and P99 latency, not just average latency.
- The architecture discusses throttling, cold starts, queueing, and downstream fan-out.
This strongly implies a heavy-tailed or log-normal-like latency distribution.
Why it matters:
- Mean latency can look healthy while the worst user experience is still bad.
- Tail metrics are the right choice for this system.
6. Response Length Distribution
The project explicitly monitors response token count and alerts on shifts from baseline.
Project examples:
- Average response length drift
- Response length inflation after model updates
This is a distribution-over-time problem, not just a single average. Monitoring the full spread helps catch subtle behavioral drift.
7. Approximate Normal Distribution for Aggregate Means and Lift Estimates
The repo talks about A/B testing, significance, and lift measurement. In practice, once sample sizes are large enough, aggregate averages and rate estimates are often analyzed using normal approximations.
Project examples:
- Average order value lift
- Difference in conversion rate between treatment and control
- Thumbs-down rate comparison in canary vs. baseline
Important note:
- The raw session-level data is not necessarily normal.
- The sampling distribution of the estimated mean or proportion is often treated as approximately normal when traffic is large.
That is the statistical basis for many rollout and experimentation decisions.
4. Recommended Mapping of Metrics to Statistics Concepts
| Project Metric | Measurement Level | Core Probability View | Likely Distribution | Common Libraries |
|---|---|---|---|---|
| Conversion rate | Ratio at aggregate level; binary at event level | Conditional probability | Bernoulli/Binomial | scipy.stats, statsmodels |
| Escalation rate | Ratio at aggregate level; binary at event level | Conditional probability | Bernoulli/Binomial | scipy.stats, statsmodels |
| Hallucination rate | Ratio at aggregate level; binary at event level | Conditional probability | Bernoulli/Binomial | ragas, deepeval, scipy.stats |
| Intent class | Nominal | Class probability | Categorical/Multinomial | scikit-learn, transformers |
| Intent distribution | Ratio proportions across classes | Probability mass across labels | Multinomial | evidently, whylogs, pandas |
| CSAT | Ordinal | Satisfaction probability by score band | Ordinal categorical | pandas, numpy |
| Latency | Ratio | Tail probability and quantiles | Right-skewed / long-tailed | Prometheus, Grafana, OpenTelemetry |
| Revenue per chat session | Ratio | Expected value over sessions | Often skewed continuous | pandas, numpy, statsmodels |
| Token count | Ratio | Expected value and drift monitoring | Count distribution, often right-skewed | tokenizer APIs, pandas |
| Error count per minute | Ratio count | Event probability over time | Poisson-like | Prometheus, statsmodels |
5. Practical Summary
If someone asks, "What statistics are actually used in this project?" the short answer is:
- Probability rules - Event probability, conditional probability, joint events for attribution, complement rule, class-probability monitoring, and significance testing.
- Levels of measurement - Nominal for intents and categories, ordinal for ratings and severity, interval for calendar/time indexing, and ratio for most operational and business metrics.
- Distributions - Bernoulli/binomial for success rates, multinomial for intent mix, Poisson-like for event counts, and right-skewed long-tail distributions for latency and token length.
6. Most Important Takeaway
This project is measured like a production AI product, not like a demo.
That means the statistical backbone is mostly about:
- rates
- class distributions
- tail behavior
- drift over time
- causal comparison through experiments
Those are the concepts doing the real work across MangaAssist.