LOCAL PREVIEW View on GitHub

Chi-Square Tests in MangaAssist

1. What Chi-Square Tests Are

Chi-square ($\chi^2$) tests evaluate associations between categorical variables. They answer questions like:

  • Has the intent distribution shifted after a model change?
  • Is guardrail block outcome independent of intent type?
  • Does the distribution of CSAT ratings differ between model versions?
  • Is escalation rate associated with user locale?

In MangaAssist, where intent classification, guardrail outcomes, and categorical user segments were central, chi-square tests were a critical tool.


2. Chi-Square Test of Independence

2.1 Purpose

Tests whether two categorical variables are independent of each other.

$H_0$: The two variables are independent (no association).

$H_1$: The two variables are associated.

2.2 Formula

$$\chi^2 = \sum_{i,j} \frac{(O_{ij} - E_{ij})^2}{E_{ij}}$$

Where: - $O_{ij}$ = observed count in cell $(i, j)$ - $E_{ij}$ = expected count under independence = $\frac{\text{row}_i \text{ total} \times \text{col}_j \text{ total}}{\text{grand total}}$ - Degrees of freedom: $df = (r - 1)(c - 1)$ where $r$ = rows, $c$ = columns

2.3 MangaAssist Application — Intent × Guardrail Outcome

Question: Is the guardrail block rate independent of intent type, or do certain intents trigger more blocks?

Intent Passed Blocked Total
recommendation 4,850 150 5,000
product_question 3,920 80 4,000
faq 2,900 100 3,000
order_tracking 1,980 20 2,000
chitchat 1,450 50 1,500
Total 15,100 400 15,500
import numpy as np
from scipy.stats import chi2_contingency

observed = np.array([
    [4850, 150],
    [3920, 80],
    [2900, 100],
    [1980, 20],
    [1450, 50]
])

chi2, p_value, dof, expected = chi2_contingency(observed)

print(f"χ² = {chi2:.2f}")
print(f"p-value = {p_value:.6f}")
print(f"Degrees of freedom = {dof}")
print(f"\nExpected frequencies:\n{expected}")

Interpretation: If $p < 0.05$, guardrail blocks are not uniformly distributed across intents — some intents are blocked at a significantly different rate. This helps target guardrail tuning to the most affected intent categories.

2.4 MangaAssist Application — Escalation × User Locale

Question: Is escalation rate independent of user locale (US, JP, EU)?

Locale Escalated Not Escalated Total
US 600 4,400 5,000
JP 450 2,550 3,000
EU 200 1,800 2,000
observed = np.array([
    [600, 4400],
    [450, 2550],
    [200, 1800]
])

chi2, p_value, dof, expected = chi2_contingency(observed)
print(f"χ² = {chi2:.2f}, p = {p_value:.6f}")

Why this matters: If escalation depends on locale, the chatbot may need locale-specific prompt tuning or FAQ content. A statistically significant association triggers an investigation, not an immediate code change.


3. Chi-Square Goodness-of-Fit Test

3.1 Purpose

Tests whether an observed categorical distribution matches an expected distribution.

$H_0$: The observed distribution matches the expected distribution.

$H_1$: The observed distribution differs from the expected distribution.

3.2 Formula

$$\chi^2 = \sum_{i} \frac{(O_i - E_i)^2}{E_i}$$

Degrees of freedom: $df = k - 1$ where $k$ = number of categories.

3.3 MangaAssist Application — Intent Distribution Drift

Question: Has this week's intent distribution shifted compared to the historical baseline?

Intent Historical % This Week Observed Expected (based on 10,000 total)
recommendation 35% 3,200 3,500
product_question 22% 2,400 2,200
faq 18% 1,700 1,800
order_tracking 12% 1,300 1,200
chitchat 8% 900 800
escalation 5% 500 500
from scipy.stats import chisquare

observed = [3200, 2400, 1700, 1300, 900, 500]
expected = [3500, 2200, 1800, 1200, 800, 500]

chi2, p_value = chisquare(f_obs=observed, f_exp=expected)
print(f"χ² = {chi2:.2f}, p = {p_value:.6f}")

Why this matters: An intent distribution shift can indicate: - Seasonal behavior change (holiday traffic patterns) - External event driving specific queries - Classifier drift requiring retraining - Product catalog or policy changes driving different question types

Relationship to KL Divergence: The model evaluation framework also used KL divergence for intent distribution drift. Chi-square is a formal hypothesis test with a p-value, while KL divergence is a continuous measure. They complement each other:

  • $\chi^2$ test answers: "Is the shift statistically significant?"
  • KL divergence answers: "How large is the shift?"

3.4 MangaAssist Application — CSAT Rating Distribution Change

Question: After a major prompt update, did the CSAT rating distribution change?

Rating Before (%) After (Observed) Expected
1 5% 20 25
2 8% 30 40
3 15% 80 75
4 32% 170 160
5 40% 200 200
observed = [20, 30, 80, 170, 200]
expected = [25, 40, 75, 160, 200]

chi2, p_value = chisquare(f_obs=observed, f_exp=expected)

4. Chi-Square for Routing Drift Detection

4.1 The Routing Confusion Matrix

Shadow mode produces a routing-difference matrix that compares how baseline and candidate classify the same requests:

                    Candidate
              Rec   ProdQ   FAQ   Other
Baseline Rec   820    70     10     0
Baseline ProdQ  25   910     15     0
Baseline FAQ     8    12    730     0
Baseline Other   2     3      5   370

Question: Is the routing confusion pattern significantly different from perfect agreement?

# Under perfect agreement, the off-diagonal should be 0
# Chi-square test of independence on this matrix:
observed = np.array([
    [820, 70, 10, 0],
    [25, 910, 15, 0],
    [8, 12, 730, 0],
    [2, 3, 5, 370]
])

chi2, p_value, dof, expected = chi2_contingency(observed)
print(f"χ² = {chi2:.2f}, p = {p_value:.6f}")

The specific off-diagonal cells with the largest $(O - E)^2 / E$ contributions reveal which intent pairs have the most routing confusion — guiding targeted classifier improvements.


5. Assumptions and Limitations

Assumption Description MangaAssist Situation Mitigation
Expected frequency ≥ 5 All expected cell counts should be at least 5 Violated for rare intents or rare guardrail outcomes in fine-grained tables Combine rare categories or use Fisher's exact test
Independent observations Each observation counted once Mostly holds — each message is classified once Ensure no double-counting of multi-turn messages
Categorical data Variables must be categorical, not continuous Naturally holds for intent, locale, guardrail outcome Do not discretize continuous variables for chi-square without good reason

When to Use Fisher's Exact Test Instead

For small sample sizes or when expected frequencies fall below 5:

from scipy.stats import fisher_exact

# 2×2 contingency table with small counts
table = np.array([[8, 2], [1, 9]])
odds_ratio, p_value = fisher_exact(table)

In MangaAssist, Fisher's exact test was used for: - Rare guardrail block types (e.g., PII leak vs. toxicity) where cell counts were small - Locale-specific analyses where some segments had few sessions


6. Cramér's V — Effect Size for Chi-Square

Just as Cohen's d quantifies effect size for t-tests, Cramér's V quantifies the strength of association for chi-square tests.

$$V = \sqrt{\frac{\chi^2}{n \cdot \min(r-1, c-1)}}$$

Cramér's V Interpretation
0.0 – 0.1 Negligible association
0.1 – 0.3 Small association
0.3 – 0.5 Medium association
0.5+ Large association
import numpy as np

def cramers_v(observed):
    from scipy.stats import chi2_contingency
    chi2, _, _, _ = chi2_contingency(observed)
    n = observed.sum()
    r, c = observed.shape
    return np.sqrt(chi2 / (n * min(r-1, c-1)))

v = cramers_v(observed)
print(f"Cramér's V = {v:.3f}")

7. Summary

Test Application in MangaAssist What It Answers
Chi-square test of independence Intent × guardrail outcome, escalation × locale Are two categorical variables associated?
Chi-square goodness-of-fit Intent distribution drift, CSAT distribution change Does the observed distribution match the expected?
Chi-square on routing matrix Shadow mode intent agreement analysis Which intents are confused after a model change?
Fisher's exact test Small-count guardrail analyses, rare locale combinations Same as independence test but valid for small samples
Cramér's V Effect size for all chi-square tests How strong is the categorical association?