Interpretability Scenarios After Fine-Tuning - MangaAssist

Interpretability helps the MangaAssist team understand why a fine-tuned model changed behavior before it is promoted. This is especially important for high-risk intents, retrieval failures, and preference-aligned response models.

When This Topic Matters

Use interpretability when:

a candidate improves accuracy but creates surprising errors,
rare-class recall changes,
a model becomes overconfident,
a LoRA or DPO model changes tone unexpectedly,
support or escalation routing regresses.

Scenario 1 - Intent Classifier Attention and Attribution

Example message:

"I have asked twice. Can a real person help with my missing order?"

Desired signals:

"real person" supports escalation,
"missing order" supports order_tracking,
"asked twice" supports frustration.

Inspection:

token attribution for predicted intent,
top-2 probability and margin,
layer-wise embedding drift,
confusion pair review.

Promotion rule:

If attribution for escalation examples depends on irrelevant tokens like title names or punctuation, do not promote until error analysis is complete.

Scenario 2 - Retrieval Embedding Space Inspection

Inspect whether fine-tuned embeddings form useful neighborhoods.

Clusters should group by:

theme,
tone,
demographic,
user intent,
support policy topic.

Clusters should not group only by:

raw popularity,
title length,
publisher,
noisy synthetic template.

Metrics plus visuals:

Check	Use
UMAP/t-SNE	cluster sanity check
nearest-neighbor probes	title-level relevance
centroid drift	compare old and new adapters
hard-negative distance	see if confusing titles separated

Scenario 3 - DPO and LoRA Behavior Inspection

For response models, inspect:

answer length distribution,
spoiler words,
catalog fact claims,
policy promise language,
refusal and escalation style,
retrieval citation use.

Example probe:

"Recommend manga like Monster but no spoilers."

Reject a candidate if it wins style preference but increases spoiler or hallucination risk.

Failure Modes

Failure	Detection	Fix
plausible but wrong explanation	attribution not faithful	combine multiple methods
visual cluster overread	pretty UMAP, weak metrics	rely on eval plus inspection
rare failure hidden	global plots look fine	inspect critical class slices
explanation theater	no release decision changes	tie findings to gates

Inspection Report Skeleton

Model:
Candidate version:
Changed behavior:
Top improved slices:
Top regressed slices:
Critical examples inspected:
Attribution findings:
Embedding neighborhood findings:
Release decision:

Final Decision

Interpretability is not a replacement for evaluation. For MangaAssist, it is the debugging layer that explains why a metric moved and whether the movement is trustworthy enough for production.