LOCAL PREVIEW View on GitHub

Interpretability Scenarios After Fine-Tuning - MangaAssist

Interpretability helps the MangaAssist team understand why a fine-tuned model changed behavior before it is promoted. This is especially important for high-risk intents, retrieval failures, and preference-aligned response models.

When This Topic Matters

Use interpretability when:

  • a candidate improves accuracy but creates surprising errors,
  • rare-class recall changes,
  • a model becomes overconfident,
  • a LoRA or DPO model changes tone unexpectedly,
  • support or escalation routing regresses.

Scenario 1 - Intent Classifier Attention and Attribution

Example message:

"I have asked twice. Can a real person help with my missing order?"

Desired signals:

  • "real person" supports escalation,
  • "missing order" supports order_tracking,
  • "asked twice" supports frustration.

Inspection:

  • token attribution for predicted intent,
  • top-2 probability and margin,
  • layer-wise embedding drift,
  • confusion pair review.

Promotion rule:

If attribution for escalation examples depends on irrelevant tokens like title names or punctuation, do not promote until error analysis is complete.

Scenario 2 - Retrieval Embedding Space Inspection

Inspect whether fine-tuned embeddings form useful neighborhoods.

Clusters should group by:

  • theme,
  • tone,
  • demographic,
  • user intent,
  • support policy topic.

Clusters should not group only by:

  • raw popularity,
  • title length,
  • publisher,
  • noisy synthetic template.

Metrics plus visuals:

Check Use
UMAP/t-SNE cluster sanity check
nearest-neighbor probes title-level relevance
centroid drift compare old and new adapters
hard-negative distance see if confusing titles separated

Scenario 3 - DPO and LoRA Behavior Inspection

For response models, inspect:

  • answer length distribution,
  • spoiler words,
  • catalog fact claims,
  • policy promise language,
  • refusal and escalation style,
  • retrieval citation use.

Example probe:

"Recommend manga like Monster but no spoilers."

Reject a candidate if it wins style preference but increases spoiler or hallucination risk.

Failure Modes

Failure Detection Fix
plausible but wrong explanation attribution not faithful combine multiple methods
visual cluster overread pretty UMAP, weak metrics rely on eval plus inspection
rare failure hidden global plots look fine inspect critical class slices
explanation theater no release decision changes tie findings to gates

Inspection Report Skeleton

Model:
Candidate version:
Changed behavior:
Top improved slices:
Top regressed slices:
Critical examples inspected:
Attribution findings:
Embedding neighborhood findings:
Release decision:

Final Decision

Interpretability is not a replacement for evaluation. For MangaAssist, it is the debugging layer that explains why a metric moved and whether the movement is trustworthy enough for production.