How Real-World ML and AI Libraries Use These Statistics Foundations
This document connects the statistics foundations in this folder to the machine learning and AI libraries that use them in practice.
The goal is to show that these ideas are not abstract theory. They are the same concepts used by real training libraries, evaluation libraries, experiment-analysis tools, and production monitoring stacks.
Related document:
1. Two Places Statistics Appears in Real AI Systems
Statistics shows up in two layers of a real ML or AI system.
1. Inside the model
This is the part handled by training and inference libraries such as:
scikit-learnPyTorchTensorFlowXGBoostLightGBMCatBoosttransformers
At this layer, statistics appears as:
- class probabilities
- logits and probability calibration
- loss functions such as cross-entropy
- confidence scores
- ranking scores
- sampling from token distributions
2. Around the model
This is the part handled by evaluation, experimentation, and monitoring libraries such as:
pandasnumpyscipystatsmodelsmlflowevidentlywhylogsprometheus_client- OpenTelemetry-based observability stacks
- LLM evaluation tools such as
ragas,deepeval, andpromptfoo
At this layer, statistics appears as:
- A/B testing
- canary analysis
- confidence intervals
- lift measurement
- drift detection
- latency percentiles
- error-rate tracking
- response-quality scoring
The statistics foundations in this folder are mainly about the second layer, but many of the same ideas also drive the first layer.
2. Probability Rules and the Libraries That Use Them
1. Event probability
Examples from this project include metrics such as conversion rate, escalation rate, hallucination rate, and thumbs-up rate.
How real libraries use it:
scikit-learnreports class probabilities throughpredict_proba(). Calibration is checked withsklearn.calibration.calibration_curve()andsklearn.calibration.CalibratedClassifierCV.PyTorchandTensorFlowtypically produce logits that are converted to probabilities usingtorch.sigmoid()/torch.softmax()ortf.nn.sigmoid()/tf.nn.softmax().transformersuses token probability distributions during text generation. Token log-probabilities are accessible viamodel.generate(output_scores=True)or through thelogitsfield of model output.statsmodelsandscipyare used to estimate and compare these rates after deployment. Key functions includescipy.stats.binomtest()for single-proportion tests andstatsmodels.stats.proportion.proportions_ztest()for comparing two rates.
Real-world interpretation:
- A binary business metric such as "did the user convert" is a Bernoulli event.
- A model score such as "probability this query is order tracking" is also a probability estimate.
2. Conditional probability
Examples from this project include:
- $P(\text{purchase} \mid \text{user saw chatbot})$
- $P(\text{thumbs down} \mid \text{new canary model})$
- $P(\text{recommendation intent} \mid \text{holiday traffic})$
How real libraries use it:
scikit-learn,XGBoost,LightGBM, andCatBoostall learn predictions conditioned on input features. Evaluation is segmented usingsklearn.metrics.classification_report(y_true, y_pred, target_names=...)or by filteringpandasDataFrames before computing metrics.- Recommender systems condition scores on user, item, and context features.
- LLM evaluation frameworks condition outcome analysis on prompt version, retrieval strategy, user segment, or model version. Tools like
mlflow.evaluate()andragas.evaluate()can be sliced by metadata fields. pandasand SQL-based analytics are commonly used to segment these probabilities by cohort usinggroupby()operations.
Real-world interpretation:
- Most deployed AI metrics only make sense inside a condition or segment.
- This is why dashboards are often split by model version, traffic cohort, language, geography, or seasonality.
3. Joint probability and attribution rules
Examples from this project include attributed purchases that depend on more than one condition being true.
How real libraries and tooling use it:
pandas,PySpark, and SQL pipelines are used to compute labels from multiple event conditions.- feature engineering code often creates joint indicators such as "clicked and purchased within 24h".
- attribution and experimentation systems use these intersections to avoid inflated success metrics.
Real-world interpretation:
- The statistic is often implemented as a filtered label, attribution rule, or composite metric.
4. Complement rule
Examples from this project include support deflection and failure-rate framing.
How real libraries use it:
- This usually appears in metric definitions built in
pandas,numpy, BI tools, or monitoring dashboards. - Teams often compute one metric directly and derive its complement for reporting.
Real-world interpretation:
- Reliability dashboards often present success rate and failure rate as complements.
- Support deflection and escalation rate are often viewed the same way.
3. Measurement Levels and the Libraries That Work With Them
1. Nominal variables
Project examples include intent labels, personas, language, issue types, and guardrail outcomes.
Libraries that commonly use them:
scikit-learnfor classification models and confusion matrices viasklearn.metrics.confusion_matrix()andsklearn.metrics.classification_report()transformersfor sequence classification and routing tasks viapipeline("text-classification", ...)pandasfor grouped analysis and class distributions viavalue_counts()andcrosstab()evidentlyandwhylogsfor category-distribution monitoring via dataset drift reports
Typical operations:
- label encoding
- confusion matrix analysis
- class distribution drift
- routing quality analysis
2. Ordinal variables
Project examples include CSAT score, severity bands, and human judgment scales.
Libraries that commonly use them:
pandasandnumpyfor score aggregationscikit-learnandstatsmodelswhen teams model ranked or bucketed outcomes- evaluation pipelines that treat rubric scores as ordered labels
Typical operations:
- score-band analysis
- threshold-based evaluation
- comparison of rating distributions across model versions
3. Interval variables
Project examples include timestamps, week-over-week comparisons, and trend windows.
Libraries that commonly use them:
pandastime-series indexingnumpydate/time transformations- observability platforms and BI systems for trend slicing
Typical operations:
- time-window aggregation
- release-over-release comparison
- drift analysis over calendar periods
4. Ratio variables
Project examples include latency, token count, revenue per session, message count, and error count.
Libraries that commonly use them:
numpyandpandasfor numeric aggregationscipyandstatsmodelsfor mean comparison and interval estimation- Prometheus and Grafana for operational metrics
Typical operations:
- mean and percentile tracking
- lift analysis
- cost and efficiency analysis
- throughput and latency reporting
4. Distributions and Their Direct Library Correlations
1. Bernoulli distribution
Used for a single yes/no outcome.
Real AI and ML use cases:
- click or no click
- purchase or no purchase
- thumbs-down or no thumbs-down
- blocked or not blocked
- hallucinated or not hallucinated in a labeled audit set
Common libraries:
scikit-learn:sklearn.metrics.accuracy_score(),sklearn.metrics.precision_recall_fscore_support(),sklearn.metrics.roc_auc_score()statsmodels:statsmodels.stats.proportion.proportions_ztest(),statsmodels.stats.proportion.proportion_confint()scipy:scipy.stats.binomtest(),scipy.stats.bernoulli
Why it matters:
- Many product KPIs begin as Bernoulli events at the single-session or single-response level.
2. Binomial distribution
Used when counting the number of successes out of many Bernoulli observations.
Real AI and ML use cases:
- number of escalations out of all sessions
- number of positive ratings out of all ratings
- number of policy blocks out of all generated responses
Common libraries:
scipy.stats:scipy.stats.binom,scipy.stats.binom_test()(legacy) orscipy.stats.binomtest()statsmodels:statsmodels.stats.proportion.proportions_ztest()for two-sample proportion comparison,statsmodels.stats.proportion.proportion_confint()for confidence intervals- Experiment-analysis notebooks using
pandasandnumpyfor computing rates and lifts directly
Why it matters:
- Rate comparisons in A/B tests and canaries often rely on binomial reasoning or normal approximations of binomial outcomes.
3. Categorical or multinomial distribution
Used when each example belongs to one of several classes.
Real AI and ML use cases:
- intent classification
- sentiment categories
- moderation labels
- routing outcomes
- root-cause classification
Common libraries:
scikit-learn:sklearn.metrics.confusion_matrix(),sklearn.metrics.classification_report(),sklearn.preprocessing.LabelEncoder()transformers:pipeline("text-classification", ...),AutoModelForSequenceClassificationPyTorch:torch.nn.CrossEntropyLoss()TensorFlow:tf.keras.losses.CategoricalCrossentropy(),tf.keras.losses.SparseCategoricalCrossentropy()
Why it matters:
- This is the distribution behind class-probability vectors, confusion matrices, and intent drift monitoring.
4. Poisson-like count distributions
Used for counts over time windows.
Real AI and ML use cases:
- requests per minute
- errors per minute
- tool failures per hour
- escalation spikes during peak traffic
Common libraries and systems:
statsmodelsfor Poisson-style regression or count modeling- Prometheus and Grafana for count monitoring
- OpenTelemetry-based observability stacks
Why it matters:
- This is more common in production operations than in model training, but it is essential for AI system reliability.
5. Right-skewed or long-tailed latency distributions
Used for response time and other tail-dominated operational metrics.
Real AI and ML use cases:
- model inference latency
- retrieval latency
- vector database latency
- tool-calling latency
- end-to-end chatbot response latency
Common libraries and systems:
- Prometheus histograms
- Grafana percentile dashboards
- OpenTelemetry tracing
- cloud monitoring systems
Why it matters:
- Teams monitor P50, P95, or P99 because user experience is dominated by slow outliers, not just by the mean.
6. Response-length or token-count distributions
Used for generation behavior and cost monitoring.
Real AI and ML use cases:
- output length inflation after prompt changes
- cost growth due to longer completions
- quality drift where answers become too short or too verbose
Common libraries and systems:
- tokenizer APIs from LLM SDKs
pandasfor distribution analysis- monitoring systems for trend alerts
Why it matters:
- This helps catch changes in model behavior that may not appear in a single average alone.
7. Approximate normality of aggregate estimates
Used when comparing means, lifts, or large-sample rate estimates.
Real AI and ML use cases:
- conversion lift in A/B testing
- average order value lift
- thumbs-down rate comparison in canary versus baseline
- average evaluator score comparison across prompt versions
Common libraries:
scipy.statsstatsmodelsnumpypandas
Why it matters:
- Even when the raw data is not normally distributed, the sampling distribution of aggregated estimates is often treated as approximately normal at sufficient scale.
5. Direct Mapping to Specific Library Categories
| Library Category | Example Libraries | Statistics Foundations They Rely On | How They Use Them |
|---|---|---|---|
| Classical ML | scikit-learn, XGBoost, LightGBM, CatBoost |
probabilities, Bernoulli/binomial, categorical distributions, confusion matrices | classification, ranking, calibration, evaluation |
| Deep learning | PyTorch, TensorFlow, Keras |
probability distributions, cross-entropy, sampling, class imbalance | training neural models, classification, generation |
| NLP and LLMs | transformers, LLM SDKs |
token distributions, class probabilities, response-length distributions | text generation, intent classification, scoring |
| Experiment analysis | scipy, statsmodels, numpy, pandas |
sampling, significance, confidence intervals, lift estimation | A/B tests, canaries, rollout decisions |
| LLM evaluation | ragas, deepeval, promptfoo, mlflow |
sample-based scoring, binary and ordinal judgments, drift and regression tracking | evaluate factuality, relevance, answer quality, regressions |
| Monitoring and observability | prometheus_client, Grafana, OpenTelemetry, evidently, whylogs |
quantiles, count distributions, class-distribution drift, tail behavior | latency dashboards, drift monitoring, alerting |
6. Practical Interpretation for This Project
In this project, the statistics foundations map cleanly to the libraries and systems that a production team would use:
- Intent accuracy and intent distribution map to classifier libraries and drift-monitoring tools.
- Hallucination rate and thumbs-down rate map to evaluation pipelines using audit labels, sampled review sets, and experiment analysis.
- P50 and P99 latency map to observability tooling rather than to model-training libraries.
- A/B testing, holdouts, and canary checks map to
scipy,statsmodels, andpandas-based experiment workflows. - Response length drift and token-cost growth map to LLM telemetry, tokenizer usage, and time-series monitoring.
That means the statistics in this folder are directly connected to how real chatbot, search, recommendation, and LLM systems are built and operated.
7. Most Important Takeaway
These statistics foundations correlate with real-world ML and AI libraries in a very direct way:
- probability theory explains model scores and KPI rates
- measurement levels explain what kind of analysis is valid for a metric
- distributions explain how raw data behaves in production
- significance testing explains how teams make rollout decisions
- quantiles and drift analysis explain how teams keep AI systems healthy after launch
So the foundations in this folder are not separate from practical ML engineering. They are the measurement language behind real libraries, real dashboards, and real production decisions.
8. Bayes' Theorem, Hypothesis Testing, Confidence Intervals, and CLT in Libraries
These four concepts were added to the README as foundational additions. Here is how they map to real libraries.
1. Bayes' Theorem
Bayes' theorem is used implicitly by classifiers and explicitly in some modeling approaches.
Library correlations:
scikit-learn:sklearn.naive_bayes.MultinomialNB()andsklearn.naive_bayes.GaussianNB()are direct implementations of Bayesian classification.PyTorchandTensorFlow: Bayesian reasoning appears in posterior estimation, uncertainty quantification, and probabilistic layers.scipy.stats: Bayesian updating can be done manually usingscipy.stats.betaas a conjugate prior for binomial outcomes.- In practice, most teams use Bayesian reasoning informally when interpreting canary results: "Given this observed thumbs-down rate, how confident am I that the new model is worse?"
2. Hypothesis Testing
This is the formal engine behind A/B testing and canary decisions.
Library correlations:
scipy.stats.ttest_ind(): two-sample t-test for comparing means (e.g., average latency between model versions).scipy.stats.chi2_contingency(): chi-squared test for comparing categorical distributions (e.g., intent distribution shift).scipy.stats.binomtest(): exact test for a single proportion.statsmodels.stats.proportion.proportions_ztest(): z-test for comparing two proportions (e.g., conversion rate in treatment vs. control).statsmodels.stats.power.NormalIndPower().solve_power(): sample size and power calculations.statsmodels.stats.multitest.multipletests(): correction for multiple comparisons (Bonferroni, FDR).
3. Confidence Intervals
Confidence intervals quantify how uncertain a point estimate is.
Library correlations:
statsmodels.stats.proportion.proportion_confint(): confidence interval for a single proportion (e.g., hallucination rate).scipy.stats.t.interval(): confidence interval for a mean.scipy.stats.bootstrap()(SciPy 1.7+): bootstrap confidence intervals for any statistic.numpy: manual computation usingnp.percentile()for bootstrap distributions.- In dashboards, these are often shown as error bars or shaded bands around metric trend lines.
4. Central Limit Theorem
The CLT is not a function you call directly, but it is the reason many library functions work correctly at scale.
Library correlations:
scipy.stats.norm: the normal distribution used for large-sample approximations.statsmodelsz-tests and t-tests: these assume approximate normality of the sampling distribution, which is justified by the CLT at large sample sizes.numpy.random: simulation and bootstrapping approaches can empirically demonstrate CLT convergence.- In practice, the CLT is why
proportions_ztest()gives reliable results for conversion-rate comparisons when traffic is in the thousands or more.
9. Common Pitfalls When Applying Statistics in ML Systems
These are mistakes that production teams and interviewers commonly encounter.
1. Peeking at A/B test results before reaching required sample size
- Checking results repeatedly and stopping early when a p-value happens to be small inflates the false positive rate.
- Fix: pre-register the sample size, or use sequential testing methods such as
statsmodels.stats.proportion.proportions_ztest()with alpha-spending adjustments.
2. Ignoring multiple comparisons
- Testing many metrics at once (conversion, AOV, escalation, CSAT, latency) without correction means at least one will appear significant by chance.
- Fix: apply Bonferroni or Benjamini-Hochberg correction via
statsmodels.stats.multitest.multipletests().
3. Computing means on ordinal data as if it were ratio data
- Averaging CSAT scores (1-5) treats the gap between 1 and 2 as equal to the gap between 4 and 5, which may not be valid.
- Fix: report median or score-band distributions rather than (or in addition to) the mean.
4. Ignoring class imbalance in classification metrics
- Accuracy can be misleading when one class dominates (e.g., 95% of queries are not escalations, so a model that never predicts escalation gets 95% accuracy).
- Fix: use precision, recall, F1, or
sklearn.metrics.classification_report()with per-class breakdowns.
5. Reporting mean latency instead of percentiles
- A small number of slow requests can be hidden by a healthy mean.
- Fix: always report P50, P95, and P99. This is already done correctly in this project.
6. Using accuracy as the primary metric for imbalanced guardrail detection
- If guardrail blocks are rare (e.g., 0.1% of responses), a model that never blocks achieves 99.9% accuracy.
- Fix: focus on precision and recall of the block class, and track the false positive rate explicitly.
7. Treating model confidence scores as calibrated probabilities
- A model that outputs 0.9 confidence does not necessarily mean the prediction is correct 90% of the time.
- Fix: check calibration using
sklearn.calibration.calibration_curve()and recalibrate withsklearn.calibration.CalibratedClassifierCV()if needed.
8. Drawing causal conclusions from observational data
- Observing that users who interact with the chatbot have higher conversion does not mean the chatbot caused the conversion. These users may already have higher purchase intent.
- Fix: use randomized A/B tests with holdout groups, as described in the project's experimentation design.
9. Ignoring drift between training data and production data
- A model trained on last year's data may not perform well on this year's traffic, especially around seasonal events.
- Fix: monitor prediction distributions and feature distributions using
evidentlyorwhylogs, and retrain or re-evaluate on fresh data.
10. Confusing statistical significance with practical significance
- With enough traffic, even a 0.01% conversion lift can be statistically significant. That does not mean it matters to the business.
- Fix: define a minimum detectable effect (MDE) or minimum business-relevant lift before the experiment, and evaluate results against that threshold.