LOCAL PREVIEW View on GitHub

Tools and Libraries for Statistical Inference in MangaAssist

1. Overview

Statistical inference in MangaAssist was not a manual, notebook-only exercise. It was embedded into automated pipelines, CI/CD gates, monitoring dashboards, and experiment platforms. This document maps every tool and library that enabled the inference workflows described in this folder.


2. Core Python Libraries

2.1 SciPy (scipy.stats)

Role: The primary library for statistical tests across the project.

Function Used For in MangaAssist
ttest_1samp() One-sample t-test — latency SLA compliance
ttest_ind() Two-sample t-test — AOV lift, latency comparison, BERTScore regression
ttest_rel() Paired t-test — golden dataset evaluation (same inputs, two models)
mannwhitneyu() Non-parametric mean comparison — P99 latency tails
chi2_contingency() Chi-square test of independence — intent × guardrail outcome
chisquare() Goodness-of-fit — intent distribution drift detection
binomtest() Exact proportion test — hallucination rate vs. 2% target
fisher_exact() Small-sample independence test — rare guardrail events
ks_2samp() Distribution comparison — embedding drift, latency distribution shift
f_oneway() ANOVA — latency across intent types
norm, t, beta, binom Distribution objects for CI computation and p-value lookups
bootstrap() Bootstrap CIs for BERTScore, NDCG, skewed revenue distributions

Installation: Part of the standard scientific Python stack.

pip install scipy

2.2 Statsmodels

Role: Proportions tests, confidence intervals, power analysis, and multiple comparisons correction.

Function Used For in MangaAssist
proportions_ztest() Two-proportion z-test — canary escalation rate, A/B conversion rate
proportion_confint() CI for proportions — hallucination rate, escalation rate, conversion rate
NormalIndPower().solve_power() Sample size planning for A/B tests
multipletests() Bonferroni and BH correction for multi-metric canary tests
pairwise_tukeyhsd() Post-hoc pairwise comparisons after ANOVA
pip install statsmodels

2.3 NumPy

Role: Numerical foundation — array operations, basic statistics, and manual CI computations.

Function Used For
np.mean(), np.std() Point estimates for metrics
np.percentile() Bootstrap CI bounds, latency percentiles
np.sqrt() Standard error computations
np.random.choice() Manual bootstrap resampling

2.4 Pandas

Role: Data manipulation, metric computation, and aggregation before statistical testing.

Operation Used For
groupby().agg() Segment metrics by intent, locale, model version
crosstab() Build contingency tables for chi-square tests
value_counts() Compute intent distributions for goodness-of-fit tests
merge() Join canary and baseline data by request ID for paired tests
resample() Time-based aggregation for trend CI bands

2.5 Scikit-learn

Role: Classification metrics, confusion matrices, and inter-rater agreement.

Function Used For
classification_report() Per-class precision, recall, F1 for intent classifier
confusion_matrix() Routing confusion analysis
cohen_kappa_score() Inter-rater agreement in human audits
calibration_curve() Classifier confidence calibration
CalibratedClassifierCV() Recalibration when confidence scores are unreliable

3. Experiment and Evaluation Platforms

3.1 MLflow

Role: Experiment tracking, metric logging, and model versioning.

Capability How It Supported Inference
mlflow.log_metric() Logged all golden dataset evaluation metrics per run
mlflow.log_param() Tracked model version, prompt version, retriever config
Run comparison UI Visual comparison of metric distributions across experiments
mlflow.evaluate() Automated evaluation with custom metrics on golden datasets

Inference connection: MLflow stored the raw metric values that were then fed into statistical tests for significance checking. A prompt change created a new MLflow run, and the evaluation pipeline compared it against the baseline run using t-tests and z-tests.

3.2 RAGAS

Role: LLM-specific evaluation — faithfulness, answer relevancy, context relevancy.

Metric Statistical Treatment
Faithfulness score Bootstrap CIs for confidence bounds
Answer relevancy Paired t-test vs. baseline prompt
Context precision Welch's t-test across retriever versions

3.3 DeepEval

Role: Alternative LLM evaluation framework for hallucination detection and rubric-based scoring.

Used alongside RAGAS for cross-validation of evaluation results. When RAGAS and DeepEval disagreed, the team investigated the scoring rubric rather than blindly trusting either.

3.4 PromptFoo

Role: Prompt regression testing and comparison.

Capability Inference Connection
Side-by-side prompt comparison Generated paired data for paired t-tests
Assertion-based testing Binary pass/fail data for proportion tests
Multi-model evaluation ANOVA across model variants

4. AWS Infrastructure Tools

4.1 Amazon CloudWatch

Role: Operational metrics collection, dashboarding, and alerting.

Metric Type Inference Usage
Latency percentiles (P50, P95, P99) Source data for t-tests and KS tests
Error rates Source data for proportion tests and canary checks
Request counts Sample size tracking for power analysis
Custom metrics Published evaluation scores for time-series CI bands

CloudWatch Anomaly Detection used statistical baselines (2-week rolling window) to flag anomalies — effectively an automated hypothesis test against historical behavior.

4.2 Amazon SageMaker

Role: Model training, hosting, and evaluation pipeline execution.

Capability Inference Connection
Processing jobs Ran golden dataset evaluations and computed metrics
Model registry Versioned models for paired comparisons
Experiments Tracked A/B test results and evaluation runs
Clarify Bias detection and feature importance (uses statistical tests internally)

4.3 Amazon Bedrock

Role: LLM inference for Claude 3.5 Sonnet.

Inference connection: Bedrock usage metrics (token counts, latency, throttling) were the raw data for operational t-tests and proportion tests. Cost-per-response analysis used mean CIs from Bedrock billing data.

4.4 Amazon OpenSearch

Role: Vector search for RAG retrieval.

Inference connection: Retrieval metrics (Recall@K, Precision@K, MRR, NDCG) computed from OpenSearch results were compared across retriever configurations using paired t-tests on the golden dataset.


5. Monitoring and Observability Tools

5.1 Prometheus + Grafana

Role: Custom metric collection and visualization for operational dashboards.

Feature Inference Support
Histogram metrics Latency distribution data for KS tests
Counter metrics Rate data for proportion tests
Grafana panels with error bands Visual CI representation on dashboards
Alert rules with for-duration Implicit sequential testing — alert only if metric breaches for sustained period

5.2 Evidently AI

Role: Data and model monitoring — drift detection.

Feature Statistical Method Used
Data drift report KS test, chi-square test, Population Stability Index (PSI)
Target drift report Proportion tests on label distribution
Classification performance report Precision/recall/F1 with statistical significance
Numerical feature drift Wasserstein distance, KS statistic

Evidently automated many of the drift checks that would otherwise require manual scipy calls.

5.3 WhyLogs

Role: Lightweight data logging and profiling for drift detection.

Feature Inference Connection
Statistical profiling Captured distribution summaries for each feature per time window
Constraint validation Threshold-based checks equivalent to one-sample tests
Drift detection Chi-square for categorical, KS for numerical features

6. Custom Tools Built for MangaAssist

6.1 Canary Controller

A custom service that orchestrated the canary rollout and automated statistical decisions.

Key components:

Component Statistical Method
Rate comparator Two-proportion z-test with pooled variance
Multi-metric gate Bonferroni-corrected p-values across 5 metrics
Sequential checker O'Brien-Fleming alpha-spending function
Auto-rollback trigger p-value threshold breach → immediate revert

6.2 Shadow Mode Diff Service

Compared baseline and candidate outputs on the same requests.

Statistical methods embedded:

  • Paired comparisons for text quality scores
  • Chi-square on routing confusion matrices
  • KS test on latency and response length distributions
  • KL divergence for intent distribution shift

6.3 Evaluation Orchestrator

The CI/CD integration that ran golden dataset evaluations and gated pull requests.

Statistical methods used:

  • Fixed-threshold gates (implicit hypothesis tests)
  • Paired t-tests for metric regression detection
  • Bootstrap CIs for bounded metrics (BERTScore, NDCG)

7. Tool Stack Summary

Layer Tools Statistical Tests Enabled
Core computation scipy, statsmodels, numpy, pandas All hypothesis tests, CIs, power analysis
ML evaluation scikit-learn, MLflow, RAGAS, DeepEval, PromptFoo Classification metrics, paired evaluations, regression detection
AWS infra CloudWatch, SageMaker, Bedrock, OpenSearch Latency metrics, cost analysis, retrieval evaluation
Monitoring Prometheus, Grafana, Evidently, WhyLogs Drift detection, distribution tests, alert-based testing
Custom services Canary controller, shadow diff service, eval orchestrator Sequential testing, multi-metric correction, automated rollback

8. How Tools Fit the 4-Layer Evaluation Framework

Evaluation Layer Primary Tools Statistical Methods
Layer 1: Golden Dataset MLflow, SageMaker, PromptFoo, scikit-learn Paired t-tests, bootstrap CIs, threshold gates
Layer 2: Shadow Mode Custom diff service, Evidently, scipy KS test, chi-square, KL divergence, paired comparison
Layer 3: Canary Custom canary controller, statsmodels, CloudWatch Z-test, Bonferroni correction, sequential testing
Layer 4: Continuous Monitoring Prometheus, Grafana, Evidently, WhyLogs, CloudWatch Drift tests, anomaly detection, rolling CIs