Chi-Square Tests in MangaAssist
1. What Chi-Square Tests Are
Chi-square ($\chi^2$) tests evaluate associations between categorical variables. They answer questions like:
- Has the intent distribution shifted after a model change?
- Is guardrail block outcome independent of intent type?
- Does the distribution of CSAT ratings differ between model versions?
- Is escalation rate associated with user locale?
In MangaAssist, where intent classification, guardrail outcomes, and categorical user segments were central, chi-square tests were a critical tool.
2. Chi-Square Test of Independence
2.1 Purpose
Tests whether two categorical variables are independent of each other.
$H_0$: The two variables are independent (no association).
$H_1$: The two variables are associated.
2.2 Formula
$$\chi^2 = \sum_{i,j} \frac{(O_{ij} - E_{ij})^2}{E_{ij}}$$
Where: - $O_{ij}$ = observed count in cell $(i, j)$ - $E_{ij}$ = expected count under independence = $\frac{\text{row}_i \text{ total} \times \text{col}_j \text{ total}}{\text{grand total}}$ - Degrees of freedom: $df = (r - 1)(c - 1)$ where $r$ = rows, $c$ = columns
2.3 MangaAssist Application — Intent × Guardrail Outcome
Question: Is the guardrail block rate independent of intent type, or do certain intents trigger more blocks?
| Intent | Passed | Blocked | Total |
|---|---|---|---|
| recommendation | 4,850 | 150 | 5,000 |
| product_question | 3,920 | 80 | 4,000 |
| faq | 2,900 | 100 | 3,000 |
| order_tracking | 1,980 | 20 | 2,000 |
| chitchat | 1,450 | 50 | 1,500 |
| Total | 15,100 | 400 | 15,500 |
import numpy as np
from scipy.stats import chi2_contingency
observed = np.array([
[4850, 150],
[3920, 80],
[2900, 100],
[1980, 20],
[1450, 50]
])
chi2, p_value, dof, expected = chi2_contingency(observed)
print(f"χ² = {chi2:.2f}")
print(f"p-value = {p_value:.6f}")
print(f"Degrees of freedom = {dof}")
print(f"\nExpected frequencies:\n{expected}")
Interpretation: If $p < 0.05$, guardrail blocks are not uniformly distributed across intents — some intents are blocked at a significantly different rate. This helps target guardrail tuning to the most affected intent categories.
2.4 MangaAssist Application — Escalation × User Locale
Question: Is escalation rate independent of user locale (US, JP, EU)?
| Locale | Escalated | Not Escalated | Total |
|---|---|---|---|
| US | 600 | 4,400 | 5,000 |
| JP | 450 | 2,550 | 3,000 |
| EU | 200 | 1,800 | 2,000 |
observed = np.array([
[600, 4400],
[450, 2550],
[200, 1800]
])
chi2, p_value, dof, expected = chi2_contingency(observed)
print(f"χ² = {chi2:.2f}, p = {p_value:.6f}")
Why this matters: If escalation depends on locale, the chatbot may need locale-specific prompt tuning or FAQ content. A statistically significant association triggers an investigation, not an immediate code change.
3. Chi-Square Goodness-of-Fit Test
3.1 Purpose
Tests whether an observed categorical distribution matches an expected distribution.
$H_0$: The observed distribution matches the expected distribution.
$H_1$: The observed distribution differs from the expected distribution.
3.2 Formula
$$\chi^2 = \sum_{i} \frac{(O_i - E_i)^2}{E_i}$$
Degrees of freedom: $df = k - 1$ where $k$ = number of categories.
3.3 MangaAssist Application — Intent Distribution Drift
Question: Has this week's intent distribution shifted compared to the historical baseline?
| Intent | Historical % | This Week Observed | Expected (based on 10,000 total) |
|---|---|---|---|
| recommendation | 35% | 3,200 | 3,500 |
| product_question | 22% | 2,400 | 2,200 |
| faq | 18% | 1,700 | 1,800 |
| order_tracking | 12% | 1,300 | 1,200 |
| chitchat | 8% | 900 | 800 |
| escalation | 5% | 500 | 500 |
from scipy.stats import chisquare
observed = [3200, 2400, 1700, 1300, 900, 500]
expected = [3500, 2200, 1800, 1200, 800, 500]
chi2, p_value = chisquare(f_obs=observed, f_exp=expected)
print(f"χ² = {chi2:.2f}, p = {p_value:.6f}")
Why this matters: An intent distribution shift can indicate: - Seasonal behavior change (holiday traffic patterns) - External event driving specific queries - Classifier drift requiring retraining - Product catalog or policy changes driving different question types
Relationship to KL Divergence: The model evaluation framework also used KL divergence for intent distribution drift. Chi-square is a formal hypothesis test with a p-value, while KL divergence is a continuous measure. They complement each other:
- $\chi^2$ test answers: "Is the shift statistically significant?"
- KL divergence answers: "How large is the shift?"
3.4 MangaAssist Application — CSAT Rating Distribution Change
Question: After a major prompt update, did the CSAT rating distribution change?
| Rating | Before (%) | After (Observed) | Expected |
|---|---|---|---|
| 1 | 5% | 20 | 25 |
| 2 | 8% | 30 | 40 |
| 3 | 15% | 80 | 75 |
| 4 | 32% | 170 | 160 |
| 5 | 40% | 200 | 200 |
observed = [20, 30, 80, 170, 200]
expected = [25, 40, 75, 160, 200]
chi2, p_value = chisquare(f_obs=observed, f_exp=expected)
4. Chi-Square for Routing Drift Detection
4.1 The Routing Confusion Matrix
Shadow mode produces a routing-difference matrix that compares how baseline and candidate classify the same requests:
Candidate
Rec ProdQ FAQ Other
Baseline Rec 820 70 10 0
Baseline ProdQ 25 910 15 0
Baseline FAQ 8 12 730 0
Baseline Other 2 3 5 370
Question: Is the routing confusion pattern significantly different from perfect agreement?
# Under perfect agreement, the off-diagonal should be 0
# Chi-square test of independence on this matrix:
observed = np.array([
[820, 70, 10, 0],
[25, 910, 15, 0],
[8, 12, 730, 0],
[2, 3, 5, 370]
])
chi2, p_value, dof, expected = chi2_contingency(observed)
print(f"χ² = {chi2:.2f}, p = {p_value:.6f}")
The specific off-diagonal cells with the largest $(O - E)^2 / E$ contributions reveal which intent pairs have the most routing confusion — guiding targeted classifier improvements.
5. Assumptions and Limitations
| Assumption | Description | MangaAssist Situation | Mitigation |
|---|---|---|---|
| Expected frequency ≥ 5 | All expected cell counts should be at least 5 | Violated for rare intents or rare guardrail outcomes in fine-grained tables | Combine rare categories or use Fisher's exact test |
| Independent observations | Each observation counted once | Mostly holds — each message is classified once | Ensure no double-counting of multi-turn messages |
| Categorical data | Variables must be categorical, not continuous | Naturally holds for intent, locale, guardrail outcome | Do not discretize continuous variables for chi-square without good reason |
When to Use Fisher's Exact Test Instead
For small sample sizes or when expected frequencies fall below 5:
from scipy.stats import fisher_exact
# 2×2 contingency table with small counts
table = np.array([[8, 2], [1, 9]])
odds_ratio, p_value = fisher_exact(table)
In MangaAssist, Fisher's exact test was used for: - Rare guardrail block types (e.g., PII leak vs. toxicity) where cell counts were small - Locale-specific analyses where some segments had few sessions
6. Cramér's V — Effect Size for Chi-Square
Just as Cohen's d quantifies effect size for t-tests, Cramér's V quantifies the strength of association for chi-square tests.
$$V = \sqrt{\frac{\chi^2}{n \cdot \min(r-1, c-1)}}$$
| Cramér's V | Interpretation |
|---|---|
| 0.0 – 0.1 | Negligible association |
| 0.1 – 0.3 | Small association |
| 0.3 – 0.5 | Medium association |
| 0.5+ | Large association |
import numpy as np
def cramers_v(observed):
from scipy.stats import chi2_contingency
chi2, _, _, _ = chi2_contingency(observed)
n = observed.sum()
r, c = observed.shape
return np.sqrt(chi2 / (n * min(r-1, c-1)))
v = cramers_v(observed)
print(f"Cramér's V = {v:.3f}")
7. Summary
| Test | Application in MangaAssist | What It Answers |
|---|---|---|
| Chi-square test of independence | Intent × guardrail outcome, escalation × locale | Are two categorical variables associated? |
| Chi-square goodness-of-fit | Intent distribution drift, CSAT distribution change | Does the observed distribution match the expected? |
| Chi-square on routing matrix | Shadow mode intent agreement analysis | Which intents are confused after a model change? |
| Fisher's exact test | Small-count guardrail analyses, rare locale combinations | Same as independence test but valid for small samples |
| Cramér's V | Effect size for all chi-square tests | How strong is the categorical association? |