Tools and Libraries for Statistical Inference in MangaAssist
1. Overview
Statistical inference in MangaAssist was not a manual, notebook-only exercise. It was embedded into automated pipelines, CI/CD gates, monitoring dashboards, and experiment platforms. This document maps every tool and library that enabled the inference workflows described in this folder.
2. Core Python Libraries
2.1 SciPy (scipy.stats)
Role: The primary library for statistical tests across the project.
| Function | Used For in MangaAssist |
|---|---|
ttest_1samp() |
One-sample t-test — latency SLA compliance |
ttest_ind() |
Two-sample t-test — AOV lift, latency comparison, BERTScore regression |
ttest_rel() |
Paired t-test — golden dataset evaluation (same inputs, two models) |
mannwhitneyu() |
Non-parametric mean comparison — P99 latency tails |
chi2_contingency() |
Chi-square test of independence — intent × guardrail outcome |
chisquare() |
Goodness-of-fit — intent distribution drift detection |
binomtest() |
Exact proportion test — hallucination rate vs. 2% target |
fisher_exact() |
Small-sample independence test — rare guardrail events |
ks_2samp() |
Distribution comparison — embedding drift, latency distribution shift |
f_oneway() |
ANOVA — latency across intent types |
norm, t, beta, binom |
Distribution objects for CI computation and p-value lookups |
bootstrap() |
Bootstrap CIs for BERTScore, NDCG, skewed revenue distributions |
Installation: Part of the standard scientific Python stack.
pip install scipy
2.2 Statsmodels
Role: Proportions tests, confidence intervals, power analysis, and multiple comparisons correction.
| Function | Used For in MangaAssist |
|---|---|
proportions_ztest() |
Two-proportion z-test — canary escalation rate, A/B conversion rate |
proportion_confint() |
CI for proportions — hallucination rate, escalation rate, conversion rate |
NormalIndPower().solve_power() |
Sample size planning for A/B tests |
multipletests() |
Bonferroni and BH correction for multi-metric canary tests |
pairwise_tukeyhsd() |
Post-hoc pairwise comparisons after ANOVA |
pip install statsmodels
2.3 NumPy
Role: Numerical foundation — array operations, basic statistics, and manual CI computations.
| Function | Used For |
|---|---|
np.mean(), np.std() |
Point estimates for metrics |
np.percentile() |
Bootstrap CI bounds, latency percentiles |
np.sqrt() |
Standard error computations |
np.random.choice() |
Manual bootstrap resampling |
2.4 Pandas
Role: Data manipulation, metric computation, and aggregation before statistical testing.
| Operation | Used For |
|---|---|
groupby().agg() |
Segment metrics by intent, locale, model version |
crosstab() |
Build contingency tables for chi-square tests |
value_counts() |
Compute intent distributions for goodness-of-fit tests |
merge() |
Join canary and baseline data by request ID for paired tests |
resample() |
Time-based aggregation for trend CI bands |
2.5 Scikit-learn
Role: Classification metrics, confusion matrices, and inter-rater agreement.
| Function | Used For |
|---|---|
classification_report() |
Per-class precision, recall, F1 for intent classifier |
confusion_matrix() |
Routing confusion analysis |
cohen_kappa_score() |
Inter-rater agreement in human audits |
calibration_curve() |
Classifier confidence calibration |
CalibratedClassifierCV() |
Recalibration when confidence scores are unreliable |
3. Experiment and Evaluation Platforms
3.1 MLflow
Role: Experiment tracking, metric logging, and model versioning.
| Capability | How It Supported Inference |
|---|---|
mlflow.log_metric() |
Logged all golden dataset evaluation metrics per run |
mlflow.log_param() |
Tracked model version, prompt version, retriever config |
| Run comparison UI | Visual comparison of metric distributions across experiments |
mlflow.evaluate() |
Automated evaluation with custom metrics on golden datasets |
Inference connection: MLflow stored the raw metric values that were then fed into statistical tests for significance checking. A prompt change created a new MLflow run, and the evaluation pipeline compared it against the baseline run using t-tests and z-tests.
3.2 RAGAS
Role: LLM-specific evaluation — faithfulness, answer relevancy, context relevancy.
| Metric | Statistical Treatment |
|---|---|
| Faithfulness score | Bootstrap CIs for confidence bounds |
| Answer relevancy | Paired t-test vs. baseline prompt |
| Context precision | Welch's t-test across retriever versions |
3.3 DeepEval
Role: Alternative LLM evaluation framework for hallucination detection and rubric-based scoring.
Used alongside RAGAS for cross-validation of evaluation results. When RAGAS and DeepEval disagreed, the team investigated the scoring rubric rather than blindly trusting either.
3.4 PromptFoo
Role: Prompt regression testing and comparison.
| Capability | Inference Connection |
|---|---|
| Side-by-side prompt comparison | Generated paired data for paired t-tests |
| Assertion-based testing | Binary pass/fail data for proportion tests |
| Multi-model evaluation | ANOVA across model variants |
4. AWS Infrastructure Tools
4.1 Amazon CloudWatch
Role: Operational metrics collection, dashboarding, and alerting.
| Metric Type | Inference Usage |
|---|---|
| Latency percentiles (P50, P95, P99) | Source data for t-tests and KS tests |
| Error rates | Source data for proportion tests and canary checks |
| Request counts | Sample size tracking for power analysis |
| Custom metrics | Published evaluation scores for time-series CI bands |
CloudWatch Anomaly Detection used statistical baselines (2-week rolling window) to flag anomalies — effectively an automated hypothesis test against historical behavior.
4.2 Amazon SageMaker
Role: Model training, hosting, and evaluation pipeline execution.
| Capability | Inference Connection |
|---|---|
| Processing jobs | Ran golden dataset evaluations and computed metrics |
| Model registry | Versioned models for paired comparisons |
| Experiments | Tracked A/B test results and evaluation runs |
| Clarify | Bias detection and feature importance (uses statistical tests internally) |
4.3 Amazon Bedrock
Role: LLM inference for Claude 3.5 Sonnet.
Inference connection: Bedrock usage metrics (token counts, latency, throttling) were the raw data for operational t-tests and proportion tests. Cost-per-response analysis used mean CIs from Bedrock billing data.
4.4 Amazon OpenSearch
Role: Vector search for RAG retrieval.
Inference connection: Retrieval metrics (Recall@K, Precision@K, MRR, NDCG) computed from OpenSearch results were compared across retriever configurations using paired t-tests on the golden dataset.
5. Monitoring and Observability Tools
5.1 Prometheus + Grafana
Role: Custom metric collection and visualization for operational dashboards.
| Feature | Inference Support |
|---|---|
| Histogram metrics | Latency distribution data for KS tests |
| Counter metrics | Rate data for proportion tests |
| Grafana panels with error bands | Visual CI representation on dashboards |
| Alert rules with for-duration | Implicit sequential testing — alert only if metric breaches for sustained period |
5.2 Evidently AI
Role: Data and model monitoring — drift detection.
| Feature | Statistical Method Used |
|---|---|
| Data drift report | KS test, chi-square test, Population Stability Index (PSI) |
| Target drift report | Proportion tests on label distribution |
| Classification performance report | Precision/recall/F1 with statistical significance |
| Numerical feature drift | Wasserstein distance, KS statistic |
Evidently automated many of the drift checks that would otherwise require manual scipy calls.
5.3 WhyLogs
Role: Lightweight data logging and profiling for drift detection.
| Feature | Inference Connection |
|---|---|
| Statistical profiling | Captured distribution summaries for each feature per time window |
| Constraint validation | Threshold-based checks equivalent to one-sample tests |
| Drift detection | Chi-square for categorical, KS for numerical features |
6. Custom Tools Built for MangaAssist
6.1 Canary Controller
A custom service that orchestrated the canary rollout and automated statistical decisions.
Key components:
| Component | Statistical Method |
|---|---|
| Rate comparator | Two-proportion z-test with pooled variance |
| Multi-metric gate | Bonferroni-corrected p-values across 5 metrics |
| Sequential checker | O'Brien-Fleming alpha-spending function |
| Auto-rollback trigger | p-value threshold breach → immediate revert |
6.2 Shadow Mode Diff Service
Compared baseline and candidate outputs on the same requests.
Statistical methods embedded:
- Paired comparisons for text quality scores
- Chi-square on routing confusion matrices
- KS test on latency and response length distributions
- KL divergence for intent distribution shift
6.3 Evaluation Orchestrator
The CI/CD integration that ran golden dataset evaluations and gated pull requests.
Statistical methods used:
- Fixed-threshold gates (implicit hypothesis tests)
- Paired t-tests for metric regression detection
- Bootstrap CIs for bounded metrics (BERTScore, NDCG)
7. Tool Stack Summary
| Layer | Tools | Statistical Tests Enabled |
|---|---|---|
| Core computation | scipy, statsmodels, numpy, pandas | All hypothesis tests, CIs, power analysis |
| ML evaluation | scikit-learn, MLflow, RAGAS, DeepEval, PromptFoo | Classification metrics, paired evaluations, regression detection |
| AWS infra | CloudWatch, SageMaker, Bedrock, OpenSearch | Latency metrics, cost analysis, retrieval evaluation |
| Monitoring | Prometheus, Grafana, Evidently, WhyLogs | Drift detection, distribution tests, alert-based testing |
| Custom services | Canary controller, shadow diff service, eval orchestrator | Sequential testing, multi-metric correction, automated rollback |
8. How Tools Fit the 4-Layer Evaluation Framework
| Evaluation Layer | Primary Tools | Statistical Methods |
|---|---|---|
| Layer 1: Golden Dataset | MLflow, SageMaker, PromptFoo, scikit-learn | Paired t-tests, bootstrap CIs, threshold gates |
| Layer 2: Shadow Mode | Custom diff service, Evidently, scipy | KS test, chi-square, KL divergence, paired comparison |
| Layer 3: Canary | Custom canary controller, statsmodels, CloudWatch | Z-test, Bonferroni correction, sequential testing |
| Layer 4: Continuous Monitoring | Prometheus, Grafana, Evidently, WhyLogs, CloudWatch | Drift tests, anomaly detection, rolling CIs |