Chi-Square Tests in MangaAssist

1. What Chi-Square Tests Are

Chi-square ($\chi^2$) tests evaluate associations between categorical variables. They answer questions like:

Has the intent distribution shifted after a model change?
Is guardrail block outcome independent of intent type?
Does the distribution of CSAT ratings differ between model versions?
Is escalation rate associated with user locale?

In MangaAssist, where intent classification, guardrail outcomes, and categorical user segments were central, chi-square tests were a critical tool.

2. Chi-Square Test of Independence

2.1 Purpose

Tests whether two categorical variables are independent of each other.

$H_0$: The two variables are independent (no association).

$H_1$: The two variables are associated.

2.2 Formula

$$\chi^2 = \sum_{i,j} \frac{(O_{ij} - E_{ij})^2}{E_{ij}}$$

Where: - $O_{ij}$ = observed count in cell $(i, j)$ - $E_{ij}$ = expected count under independence = $\frac{\text{row}_i \text{ total} \times \text{col}_j \text{ total}}{\text{grand total}}$ - Degrees of freedom: $df = (r - 1)(c - 1)$ where $r$ = rows, $c$ = columns

2.3 MangaAssist Application — Intent × Guardrail Outcome

Question: Is the guardrail block rate independent of intent type, or do certain intents trigger more blocks?

Intent	Passed	Blocked	Total
recommendation	4,850	150	5,000
product_question	3,920	80	4,000
faq	2,900	100	3,000
order_tracking	1,980	20	2,000
chitchat	1,450	50	1,500
Total	15,100	400	15,500

import numpy as np
from scipy.stats import chi2_contingency

observed = np.array([
    [4850, 150],
    [3920, 80],
    [2900, 100],
    [1980, 20],
    [1450, 50]
])

chi2, p_value, dof, expected = chi2_contingency(observed)

print(f"χ² = {chi2:.2f}")
print(f"p-value = {p_value:.6f}")
print(f"Degrees of freedom = {dof}")
print(f"\nExpected frequencies:\n{expected}")

Interpretation: If $p < 0.05$, guardrail blocks are not uniformly distributed across intents — some intents are blocked at a significantly different rate. This helps target guardrail tuning to the most affected intent categories.

2.4 MangaAssist Application — Escalation × User Locale

Question: Is escalation rate independent of user locale (US, JP, EU)?

Locale	Escalated	Not Escalated	Total
US	600	4,400	5,000
JP	450	2,550	3,000
EU	200	1,800	2,000

observed = np.array([
    [600, 4400],
    [450, 2550],
    [200, 1800]
])

chi2, p_value, dof, expected = chi2_contingency(observed)
print(f"χ² = {chi2:.2f}, p = {p_value:.6f}")

Why this matters: If escalation depends on locale, the chatbot may need locale-specific prompt tuning or FAQ content. A statistically significant association triggers an investigation, not an immediate code change.

3. Chi-Square Goodness-of-Fit Test

3.1 Purpose

Tests whether an observed categorical distribution matches an expected distribution.

$H_0$: The observed distribution matches the expected distribution.

$H_1$: The observed distribution differs from the expected distribution.

3.2 Formula

$$\chi^2 = \sum_{i} \frac{(O_i - E_i)^2}{E_i}$$

Degrees of freedom: $df = k - 1$ where $k$ = number of categories.

3.3 MangaAssist Application — Intent Distribution Drift

Question: Has this week's intent distribution shifted compared to the historical baseline?

Intent	Historical %	This Week Observed	Expected (based on 10,000 total)
recommendation	35%	3,200	3,500
product_question	22%	2,400	2,200
faq	18%	1,700	1,800
order_tracking	12%	1,300	1,200
chitchat	8%	900	800
escalation	5%	500	500

from scipy.stats import chisquare

observed = [3200, 2400, 1700, 1300, 900, 500]
expected = [3500, 2200, 1800, 1200, 800, 500]

chi2, p_value = chisquare(f_obs=observed, f_exp=expected)
print(f"χ² = {chi2:.2f}, p = {p_value:.6f}")

Why this matters: An intent distribution shift can indicate: - Seasonal behavior change (holiday traffic patterns) - External event driving specific queries - Classifier drift requiring retraining - Product catalog or policy changes driving different question types

Relationship to KL Divergence: The model evaluation framework also used KL divergence for intent distribution drift. Chi-square is a formal hypothesis test with a p-value, while KL divergence is a continuous measure. They complement each other:

$\chi^2$ test answers: "Is the shift statistically significant?"
KL divergence answers: "How large is the shift?"

3.4 MangaAssist Application — CSAT Rating Distribution Change

Question: After a major prompt update, did the CSAT rating distribution change?

Rating	Before (%)	After (Observed)	Expected
1	5%	20	25
2	8%	30	40
3	15%	80	75
4	32%	170	160
5	40%	200	200

observed = [20, 30, 80, 170, 200]
expected = [25, 40, 75, 160, 200]

chi2, p_value = chisquare(f_obs=observed, f_exp=expected)

4. Chi-Square for Routing Drift Detection

4.1 The Routing Confusion Matrix

Shadow mode produces a routing-difference matrix that compares how baseline and candidate classify the same requests:

                    Candidate
              Rec   ProdQ   FAQ   Other
Baseline Rec   820    70     10     0
Baseline ProdQ  25   910     15     0
Baseline FAQ     8    12    730     0
Baseline Other   2     3      5   370

Question: Is the routing confusion pattern significantly different from perfect agreement?

# Under perfect agreement, the off-diagonal should be 0
# Chi-square test of independence on this matrix:
observed = np.array([
    [820, 70, 10, 0],
    [25, 910, 15, 0],
    [8, 12, 730, 0],
    [2, 3, 5, 370]
])

chi2, p_value, dof, expected = chi2_contingency(observed)
print(f"χ² = {chi2:.2f}, p = {p_value:.6f}")

The specific off-diagonal cells with the largest $(O - E)^2 / E$ contributions reveal which intent pairs have the most routing confusion — guiding targeted classifier improvements.

5. Assumptions and Limitations

Assumption	Description	MangaAssist Situation	Mitigation
Expected frequency ≥ 5	All expected cell counts should be at least 5	Violated for rare intents or rare guardrail outcomes in fine-grained tables	Combine rare categories or use Fisher's exact test
Independent observations	Each observation counted once	Mostly holds — each message is classified once	Ensure no double-counting of multi-turn messages
Categorical data	Variables must be categorical, not continuous	Naturally holds for intent, locale, guardrail outcome	Do not discretize continuous variables for chi-square without good reason

When to Use Fisher's Exact Test Instead

For small sample sizes or when expected frequencies fall below 5:

from scipy.stats import fisher_exact

# 2×2 contingency table with small counts
table = np.array([[8, 2], [1, 9]])
odds_ratio, p_value = fisher_exact(table)

In MangaAssist, Fisher's exact test was used for: - Rare guardrail block types (e.g., PII leak vs. toxicity) where cell counts were small - Locale-specific analyses where some segments had few sessions

6. Cramér's V — Effect Size for Chi-Square

Just as Cohen's d quantifies effect size for t-tests, Cramér's V quantifies the strength of association for chi-square tests.

$$V = \sqrt{\frac{\chi^2}{n \cdot \min(r-1, c-1)}}$$

Cramér's V	Interpretation
0.0 – 0.1	Negligible association
0.1 – 0.3	Small association
0.3 – 0.5	Medium association
0.5+	Large association

import numpy as np

def cramers_v(observed):
    from scipy.stats import chi2_contingency
    chi2, _, _, _ = chi2_contingency(observed)
    n = observed.sum()
    r, c = observed.shape
    return np.sqrt(chi2 / (n * min(r-1, c-1)))

v = cramers_v(observed)
print(f"Cramér's V = {v:.3f}")

7. Summary

Test	Application in MangaAssist	What It Answers
Chi-square test of independence	Intent × guardrail outcome, escalation × locale	Are two categorical variables associated?
Chi-square goodness-of-fit	Intent distribution drift, CSAT distribution change	Does the observed distribution match the expected?
Chi-square on routing matrix	Shadow mode intent agreement analysis	Which intents are confused after a model change?
Fisher's exact test	Small-count guardrail analyses, rare locale combinations	Same as independence test but valid for small samples
Cramér's V	Effect size for all chi-square tests	How strong is the categorical association?