Tools and Libraries for Statistical Inference in MangaAssist

1. Overview

Statistical inference in MangaAssist was not a manual, notebook-only exercise. It was embedded into automated pipelines, CI/CD gates, monitoring dashboards, and experiment platforms. This document maps every tool and library that enabled the inference workflows described in this folder.

2. Core Python Libraries

2.1 SciPy (`scipy.stats`)

Role: The primary library for statistical tests across the project.

Function	Used For in MangaAssist
`ttest_1samp()`	One-sample t-test — latency SLA compliance
`ttest_ind()`	Two-sample t-test — AOV lift, latency comparison, BERTScore regression
`ttest_rel()`	Paired t-test — golden dataset evaluation (same inputs, two models)
`mannwhitneyu()`	Non-parametric mean comparison — P99 latency tails
`chi2_contingency()`	Chi-square test of independence — intent × guardrail outcome
`chisquare()`	Goodness-of-fit — intent distribution drift detection
`binomtest()`	Exact proportion test — hallucination rate vs. 2% target
`fisher_exact()`	Small-sample independence test — rare guardrail events
`ks_2samp()`	Distribution comparison — embedding drift, latency distribution shift
`f_oneway()`	ANOVA — latency across intent types
`norm`, `t`, `beta`, `binom`	Distribution objects for CI computation and p-value lookups
`bootstrap()`	Bootstrap CIs for BERTScore, NDCG, skewed revenue distributions

Installation: Part of the standard scientific Python stack.

pip install scipy

2.2 Statsmodels

Role: Proportions tests, confidence intervals, power analysis, and multiple comparisons correction.

Function	Used For in MangaAssist
`proportions_ztest()`	Two-proportion z-test — canary escalation rate, A/B conversion rate
`proportion_confint()`	CI for proportions — hallucination rate, escalation rate, conversion rate
`NormalIndPower().solve_power()`	Sample size planning for A/B tests
`multipletests()`	Bonferroni and BH correction for multi-metric canary tests
`pairwise_tukeyhsd()`	Post-hoc pairwise comparisons after ANOVA

pip install statsmodels

2.3 NumPy

Role: Numerical foundation — array operations, basic statistics, and manual CI computations.

Function	Used For
`np.mean()`, `np.std()`	Point estimates for metrics
`np.percentile()`	Bootstrap CI bounds, latency percentiles
`np.sqrt()`	Standard error computations
`np.random.choice()`	Manual bootstrap resampling

2.4 Pandas

Role: Data manipulation, metric computation, and aggregation before statistical testing.

Operation	Used For
`groupby().agg()`	Segment metrics by intent, locale, model version
`crosstab()`	Build contingency tables for chi-square tests
`value_counts()`	Compute intent distributions for goodness-of-fit tests
`merge()`	Join canary and baseline data by request ID for paired tests
`resample()`	Time-based aggregation for trend CI bands

2.5 Scikit-learn

Role: Classification metrics, confusion matrices, and inter-rater agreement.

Function	Used For
`classification_report()`	Per-class precision, recall, F1 for intent classifier
`confusion_matrix()`	Routing confusion analysis
`cohen_kappa_score()`	Inter-rater agreement in human audits
`calibration_curve()`	Classifier confidence calibration
`CalibratedClassifierCV()`	Recalibration when confidence scores are unreliable

3. Experiment and Evaluation Platforms

3.1 MLflow

Role: Experiment tracking, metric logging, and model versioning.

Capability	How It Supported Inference
`mlflow.log_metric()`	Logged all golden dataset evaluation metrics per run
`mlflow.log_param()`	Tracked model version, prompt version, retriever config
Run comparison UI	Visual comparison of metric distributions across experiments
`mlflow.evaluate()`	Automated evaluation with custom metrics on golden datasets

Inference connection: MLflow stored the raw metric values that were then fed into statistical tests for significance checking. A prompt change created a new MLflow run, and the evaluation pipeline compared it against the baseline run using t-tests and z-tests.

3.2 RAGAS

Role: LLM-specific evaluation — faithfulness, answer relevancy, context relevancy.

Metric	Statistical Treatment
Faithfulness score	Bootstrap CIs for confidence bounds
Answer relevancy	Paired t-test vs. baseline prompt
Context precision	Welch's t-test across retriever versions

3.3 DeepEval

Role: Alternative LLM evaluation framework for hallucination detection and rubric-based scoring.

Used alongside RAGAS for cross-validation of evaluation results. When RAGAS and DeepEval disagreed, the team investigated the scoring rubric rather than blindly trusting either.

3.4 PromptFoo

Role: Prompt regression testing and comparison.

Capability	Inference Connection
Side-by-side prompt comparison	Generated paired data for paired t-tests
Assertion-based testing	Binary pass/fail data for proportion tests
Multi-model evaluation	ANOVA across model variants

4. AWS Infrastructure Tools

4.1 Amazon CloudWatch

Role: Operational metrics collection, dashboarding, and alerting.

Metric Type	Inference Usage
Latency percentiles (P50, P95, P99)	Source data for t-tests and KS tests
Error rates	Source data for proportion tests and canary checks
Request counts	Sample size tracking for power analysis
Custom metrics	Published evaluation scores for time-series CI bands

CloudWatch Anomaly Detection used statistical baselines (2-week rolling window) to flag anomalies — effectively an automated hypothesis test against historical behavior.

4.2 Amazon SageMaker

Role: Model training, hosting, and evaluation pipeline execution.

Capability	Inference Connection
Processing jobs	Ran golden dataset evaluations and computed metrics
Model registry	Versioned models for paired comparisons
Experiments	Tracked A/B test results and evaluation runs
Clarify	Bias detection and feature importance (uses statistical tests internally)

4.3 Amazon Bedrock

Role: LLM inference for Claude 3.5 Sonnet.

Inference connection: Bedrock usage metrics (token counts, latency, throttling) were the raw data for operational t-tests and proportion tests. Cost-per-response analysis used mean CIs from Bedrock billing data.

4.4 Amazon OpenSearch

Role: Vector search for RAG retrieval.

Inference connection: Retrieval metrics (Recall@K, Precision@K, MRR, NDCG) computed from OpenSearch results were compared across retriever configurations using paired t-tests on the golden dataset.

5. Monitoring and Observability Tools

5.1 Prometheus + Grafana

Role: Custom metric collection and visualization for operational dashboards.

Feature	Inference Support
Histogram metrics	Latency distribution data for KS tests
Counter metrics	Rate data for proportion tests
Grafana panels with error bands	Visual CI representation on dashboards
Alert rules with for-duration	Implicit sequential testing — alert only if metric breaches for sustained period

5.2 Evidently AI

Role: Data and model monitoring — drift detection.

Feature	Statistical Method Used
Data drift report	KS test, chi-square test, Population Stability Index (PSI)
Target drift report	Proportion tests on label distribution
Classification performance report	Precision/recall/F1 with statistical significance
Numerical feature drift	Wasserstein distance, KS statistic

Evidently automated many of the drift checks that would otherwise require manual scipy calls.

5.3 WhyLogs

Role: Lightweight data logging and profiling for drift detection.

Feature	Inference Connection
Statistical profiling	Captured distribution summaries for each feature per time window
Constraint validation	Threshold-based checks equivalent to one-sample tests
Drift detection	Chi-square for categorical, KS for numerical features

6. Custom Tools Built for MangaAssist

6.1 Canary Controller

A custom service that orchestrated the canary rollout and automated statistical decisions.

Key components:

Component	Statistical Method
Rate comparator	Two-proportion z-test with pooled variance
Multi-metric gate	Bonferroni-corrected p-values across 5 metrics
Sequential checker	O'Brien-Fleming alpha-spending function
Auto-rollback trigger	p-value threshold breach → immediate revert

6.2 Shadow Mode Diff Service

Compared baseline and candidate outputs on the same requests.

Statistical methods embedded:

Paired comparisons for text quality scores
Chi-square on routing confusion matrices
KS test on latency and response length distributions
KL divergence for intent distribution shift

6.3 Evaluation Orchestrator

The CI/CD integration that ran golden dataset evaluations and gated pull requests.

Statistical methods used:

Fixed-threshold gates (implicit hypothesis tests)
Paired t-tests for metric regression detection
Bootstrap CIs for bounded metrics (BERTScore, NDCG)

7. Tool Stack Summary

Layer	Tools	Statistical Tests Enabled
Core computation	scipy, statsmodels, numpy, pandas	All hypothesis tests, CIs, power analysis
ML evaluation	scikit-learn, MLflow, RAGAS, DeepEval, PromptFoo	Classification metrics, paired evaluations, regression detection
AWS infra	CloudWatch, SageMaker, Bedrock, OpenSearch	Latency metrics, cost analysis, retrieval evaluation
Monitoring	Prometheus, Grafana, Evidently, WhyLogs	Drift detection, distribution tests, alert-based testing
Custom services	Canary controller, shadow diff service, eval orchestrator	Sequential testing, multi-metric correction, automated rollback

8. How Tools Fit the 4-Layer Evaluation Framework

Evaluation Layer	Primary Tools	Statistical Methods
Layer 1: Golden Dataset	MLflow, SageMaker, PromptFoo, scikit-learn	Paired t-tests, bootstrap CIs, threshold gates
Layer 2: Shadow Mode	Custom diff service, Evidently, scipy	KS test, chi-square, KL divergence, paired comparison
Layer 3: Canary	Custom canary controller, statsmodels, CloudWatch	Z-test, Bonferroni correction, sequential testing
Layer 4: Continuous Monitoring	Prometheus, Grafana, Evidently, WhyLogs, CloudWatch	Drift tests, anomaly detection, rolling CIs