Interpretability Scenarios After Fine-Tuning - MangaAssist
Interpretability helps the MangaAssist team understand why a fine-tuned model changed behavior before it is promoted. This is especially important for high-risk intents, retrieval failures, and preference-aligned response models.
When This Topic Matters
Use interpretability when:
- a candidate improves accuracy but creates surprising errors,
- rare-class recall changes,
- a model becomes overconfident,
- a LoRA or DPO model changes tone unexpectedly,
- support or escalation routing regresses.
Scenario 1 - Intent Classifier Attention and Attribution
Example message:
"I have asked twice. Can a real person help with my missing order?"
Desired signals:
- "real person" supports
escalation, - "missing order" supports
order_tracking, - "asked twice" supports frustration.
Inspection:
- token attribution for predicted intent,
- top-2 probability and margin,
- layer-wise embedding drift,
- confusion pair review.
Promotion rule:
If attribution for escalation examples depends on irrelevant tokens like title names or punctuation, do not promote until error analysis is complete.
Scenario 2 - Retrieval Embedding Space Inspection
Inspect whether fine-tuned embeddings form useful neighborhoods.
Clusters should group by:
- theme,
- tone,
- demographic,
- user intent,
- support policy topic.
Clusters should not group only by:
- raw popularity,
- title length,
- publisher,
- noisy synthetic template.
Metrics plus visuals:
| Check | Use |
|---|---|
| UMAP/t-SNE | cluster sanity check |
| nearest-neighbor probes | title-level relevance |
| centroid drift | compare old and new adapters |
| hard-negative distance | see if confusing titles separated |
Scenario 3 - DPO and LoRA Behavior Inspection
For response models, inspect:
- answer length distribution,
- spoiler words,
- catalog fact claims,
- policy promise language,
- refusal and escalation style,
- retrieval citation use.
Example probe:
"Recommend manga like Monster but no spoilers."
Reject a candidate if it wins style preference but increases spoiler or hallucination risk.
Failure Modes
| Failure | Detection | Fix |
|---|---|---|
| plausible but wrong explanation | attribution not faithful | combine multiple methods |
| visual cluster overread | pretty UMAP, weak metrics | rely on eval plus inspection |
| rare failure hidden | global plots look fine | inspect critical class slices |
| explanation theater | no release decision changes | tie findings to gates |
Inspection Report Skeleton
Model:
Candidate version:
Changed behavior:
Top improved slices:
Top regressed slices:
Critical examples inspected:
Attribution findings:
Embedding neighborhood findings:
Release decision:
Final Decision
Interpretability is not a replacement for evaluation. For MangaAssist, it is the debugging layer that explains why a metric moved and whether the movement is trustworthy enough for production.