17. Visualization and Interpretability After Fine-Tuning - Inspecting the MangaAssist Intent Classifier Beyond Accuracy

Problem Statement and MangaAssist Context

The fine-tuned DistilBERT intent classifier in Doc 01 already clears the headline production bar: 92.1% accuracy, strong rare-intent recovery, and sub-15ms P95 latency after Inferentia compilation. That was enough to promote the model, but it was not enough to answer the questions that matter once the model becomes part of a production routing path:

Which layers actually adapted to manga-domain intent boundaries?
Which tokens now drive rare-intent decisions such as escalation, return_request, and promotion?
Are the remaining product_discovery vs recommendation confusions caused by missing data, weak representations, or brittle lexical shortcuts?
Does the exported ONNX plus Inferentia artifact preserve the same structure and routing behavior that the training notebook produced?

Static Mermaid diagrams in Doc 01 explain the architecture, gradient flow, learning-rate schedule, and deployment path well. They do not answer these post-fine-tuning inspection questions. For that, the team applied the same interpretability discipline used in Amazon-scale production ML — where a model that meets top-line metrics can still fail in operationally expensive slices because nobody inspected what it learned internally — and ran a focused set of visualization tools after training, before locking the next retraining pipeline.

The goal was not to produce pretty dashboards. The goal was to determine whether the 10-intent classifier had learned stable routing features for manga jargon, JP-EN mixed phrasing, class-imbalanced rare intents, and multi-intent edge cases without breaking the 15ms P95 serving budget. Each tool below was selected because it answered one operational question that Mermaid alone could not.

Why Mermaid Still Stays in the Series

Mermaid remains the better medium for:

Training-loop structure and pipeline sequencing
Learning-rate schedule and optimizer flow
Loss-landscape intuition and architectural overviews
Deployment and MLOps control flow

The tools in this document are investigative, not explanatory. They help validate whether those static diagrams describe what the fine-tuned model actually learned.

Mathematical Foundations

Several of the inspection tools below rely on formal definitions that are worth stating once before they appear in context. This section covers the five mathematical ideas that drive the analysis: why attention weights are a poor proxy for feature importance, how to measure where fine-tuning actually changed internal representations, how to attribute predictions to individual tokens, and how to probe whether intent separability forms at a given layer.

Why Attention Weights Are Not Attributions

The attention-shift Mermaid diagram in Doc 01 shows Head 7, Layer 5 redistributing weight from structural tokens to domain tokens after fine-tuning. That is useful but misleading if read as an importance ranking. The softmax Jacobian explains why.

For logits $z \in \mathbb{R}^n$ and softmax output $a_i = \frac{e^{z_i}}{\sum_j e^{z_j}}$, the Jacobian entry is:

$$\frac{\partial a_i}{\partial z_j} = a_i(\delta_{ij} - a_j)$$

where $\delta_{ij}$ is the Kronecker delta. This means every attention weight $a_i$ depends on every logit $z_j$, not just its own. A token can have low attention weight but high gradient influence through the off-diagonal terms. The practical consequence: attention maps from BertViz show where the model looks, not what drives the prediction. The Captum section uses attribution methods that account for this.

Integrated Gradients — Path-Based Attribution

For an input $x$, a neutral baseline $x'$ (typically the zero-embedding or [PAD] token embedding), and model output function $F$, Integrated Gradients for feature $i$ is:

$$IG_i(x) = (x_i - x_i') \times \int_{\alpha=0}^{1} \frac{\partial F(x' + \alpha(x - x'))}{\partial x_i} \, d\alpha$$

The integral accumulates gradient contributions along a straight-line path from baseline to input. In practice, the integral is approximated with $m$ Riemann steps:

$$IG_i(x) \approx (x_i - x_i') \times \frac{1}{m} \sum_{k=1}^{m} \frac{\partial F\bigl(x' + \frac{k}{m}(x - x')\bigr)}{\partial x_i}$$

The method satisfies two axioms that raw gradient saliency does not: completeness (attributions sum to the difference $F(x) - F(x')$) and sensitivity (if changing one feature changes the output, that feature receives nonzero attribution). For the MangaAssist classifier, completeness means the attribution budget over all tokens must equal the logit gap between baseline and input, making it possible to compare token importance across different intents on the same scale.

Centered Kernel Alignment (CKA) — Measuring Representational Change

To quantify how much each layer changed during fine-tuning, the team compared pre-trained and fine-tuned representations using CKA. Given two representation matrices $X \in \mathbb{R}^{n \times p}$ and $Y \in \mathbb{R}^{n \times q}$ (same $n$ examples, different feature spaces), the linear CKA is:

$$\text{CKA}(X, Y) = \frac{|Y^T X|_F^2}{|X^T X|_F \cdot |Y^T Y|_F}$$

where $|\cdot|_F$ is the Frobenius norm. CKA equals 1.0 when the representations are related by an invertible linear transformation and approaches 0.0 when they are unrelated. Unlike raw cosine similarity on individual vectors, CKA compares the geometry of entire representation matrices — it answers "did the space of representations change?" rather than "did one vector move?"

For the MangaAssist classifier, the team computed CKA between pre-trained and fine-tuned [CLS] representations at each layer over the 5,500-example validation set. A low CKA at layer $l$ means that layer's representation was substantially reorganized during fine-tuning.

Linear Probing — Layer-Wise Intent Separability

Linear probing measures how much task-relevant information is accessible at each layer without further nonlinear transformation. For layer $l$, extract the [CLS] hidden state $h^{(l)} \in \mathbb{R}^{768}$ for all validation examples, then train a logistic regression classifier:

$$P(y = c \mid h^{(l)}) = \text{softmax}(W^{(l)} h^{(l)} + b^{(l)})_c$$

where $W^{(l)} \in \mathbb{R}^{10 \times 768}$ and $b^{(l)} \in \mathbb{R}^{10}$ are the probe parameters (not the model's own classification head). The probe's macro F1 on held-out data gives a lower bound on the intent information present at that layer. If probe accuracy jumps sharply between layers $l$ and $l+1$, that transition is where the model builds intent-discriminative features.

Activation Patching — Causal Localization

Activation patching goes beyond correlation to test causal sufficiency. Given a correctly classified example $x_{\text{correct}}$ and a misclassified example $x_{\text{error}}$, the procedure is:

Run both examples through the model, caching all layer hidden states.
Re-run $x_{\text{error}}$ but replace the hidden state at layer $l$ with the corresponding state from $x_{\text{correct}}$.
Measure whether the patched run now produces the correct prediction.

Formally, define the patched logit as:

$$\hat{y}{\text{patched}}^{(l)} = F\bigl(x{\text{error}} \mid h^{(l)} \leftarrow h_{\text{correct}}^{(l)}\bigr)$$

If $\hat{y}_{\text{patched}}^{(l)}$ is correct for layer $l = 5$ but not for $l = 1$, then layer 5 carries causal information that layer 1 does not. This is the strongest evidence for localizing where the model's intent discrimination lives, because it tests intervention, not just observation.

Tool Selection Matrix

Tool	Primary Purpose	Exact Artifact Inspected	Why It Was Chosen Over Mermaid	Best Mode
BertViz	Inspect attention redistribution after fine-tuning	Layer/head attention maps for pre-trained vs fine-tuned DistilBERT	Mermaid can illustrate attention conceptually, but it cannot show which heads now focus on `isekai`, JP-EN tokens, or ambiguous product phrases	Notebook analysis and one-off investigations
TransformerLens	Localize where intent separability emerges	Residual stream states, layer outputs, activation patching traces	Mermaid shows layer order, not where the representation becomes linearly separable for the 10 intents	Notebook analysis and recurring deep dives
LIT	Stress-test counterfactual phrasing and confidence shifts	Example-level predictions, saliency, and editable token substitutions	Mermaid cannot show how one token swap changes a routing decision in real time	Recurring validation and debugging
Ecco	Inspect token meaning evolution across layers	Layerwise token embedding trajectories and saliency views	Mermaid cannot expose how ambiguous words like `return` or `order` change meaning through the encoder stack	Notebook analysis
Captum	Quantify attribution and audit spurious lexical shortcuts	Integrated Gradients and Layer Conductance scores over tokens and layers	Mermaid cannot assign measurable contribution scores to individual tokens or layers	Recurring validation and auditability
Netron	Validate exported serving graphs and shapes	ONNX graph, classifier head layout, mask tensors, and compiled-serving input signatures	Mermaid cannot confirm that the production export matches the intended architecture	Pre-deployment validation

Post-Fine-Tuning Investigation Setup

The inspection workflow reused the same fine-tuned DistilBERT classifier described in Doc 01:

10 intents
DistilBERT encoder with 6 layers and 768-d hidden size
Focal loss with class weights for class imbalance
Strong gains on manga jargon, JP-EN mixed prompts, and multi-intent routing
AWS Inferentia deployment target with <15ms P95 latency budget

The team evaluated tools against three representative slices:

Slice	Example Prompt	Why It Matters
Manga jargon	`Is this isekai more like Re:Zero or mushoku tensei?`	Measures whether domain tokens acquired stable task meaning
JP-EN mix	`Kono manga wa English desu ka for vol 12?`	Tests mixed-script robustness that improved from 52.1% to 81.4% in Doc 01
Ambiguous commerce intent	`I want something darker than Naruto and maybe return my last order too`	Exposes multi-intent and `product_discovery` vs `recommendation` confusion

BertViz - Attention Redistribution After Fine-Tuning

Context

Doc 01 already argued that the top DistilBERT layers adapt most strongly during fine-tuning. The open question was whether attention actually moved toward manga-domain tokens and away from generic structural tokens on the examples the production system still found difficult.

Setup

from transformers import AutoTokenizer, AutoModelForSequenceClassification
from bertviz import head_view
import torch

tokenizer = AutoTokenizer.from_pretrained("./artifacts/intent-distilbert")
model = AutoModelForSequenceClassification.from_pretrained(
    "./artifacts/intent-distilbert",
    output_attentions=True,
)

text = "Kono manga wa English desu ka for vol 12?"
inputs = tokenizer(text, return_tensors="pt")

with torch.no_grad():
    outputs = model(**inputs)

tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
head_view(outputs.attentions, tokens=tokens)

The same prompts were run through the pre-trained checkpoint and the fine-tuned checkpoint to compare head specialization.

Observation

Three patterns appeared consistently:

In Layer 5, two heads that previously focused on punctuation and [SEP] shifted attention toward domain-bearing tokens such as isekai, English, vol, and title fragments.
JP-EN mixed prompts showed a stronger bridge between romanized Japanese fragments and nearby commerce tokens after fine-tuning, instead of treating them as disconnected noise.
The largest remaining ambiguity appeared in product_discovery vs recommendation, where attention maps still split between comparison terms like darker than and explicit request verbs like want something.

On the most informative head for the isekai sample, attention mass on the domain token cluster increased from roughly 0.18 in the pre-trained model to 0.56 after fine-tuning. That was the first concrete sign that the model had not merely memorized labels; it had reorganized attention around intent-bearing evidence.

Impact

BertViz changed two engineering decisions:

The team kept the Mermaid attention diagram in Doc 01 because the interactive view confirmed the static conceptual story was directionally correct.
The retraining set was expanded with more contrastive examples for product_discovery vs recommendation, because the head maps showed that the model still distributed attention across both comparison language and recommendation language on borderline prompts.

Limits

Attention is not a complete explanation. BertViz showed where heads focused, not whether those tokens causally drove the final logit. It was useful as a first diagnostic, not as proof.

TransformerLens - Where Intent Separability Emerges

Context

The team needed to verify a central claim from Doc 01: the top 2-3 layers change most, so discriminative learning rates are justified. Accuracy gains alone do not show where separability actually forms in the residual stream.

Setup

TransformerLens is more natural for decoder-only transformers, so the team used a thin DistilBERT compatibility wrapper to expose hidden states and run TransformerLens-style residual analysis and activation patching.

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score

layer_scores = {}

for layer_idx, hidden_states in layerwise_cls_cache.items():
    probe = LogisticRegression(max_iter=2000)
    probe.fit(hidden_states["train"], labels["train"])
    preds = probe.predict(hidden_states["valid"])
    layer_scores[layer_idx] = f1_score(labels["valid"], preds, average="macro")

print(layer_scores)

Activation patching then replaced one layer's hidden state from a correct run into an error case to test where the decision boundary could be repaired.

Observation

The analysis combined three complementary measures: CKA between pre-trained and fine-tuned representations at each layer, linear probing accuracy, and activation patching on misclassified examples.

CKA: Where Representations Changed

Layer	CKA (Pre-trained vs Fine-tuned)	Interpretation
Embeddings	0.99	Nearly unchanged — pre-trained token representations preserved
0	0.98	Minimal adaptation
1	0.96	Minimal adaptation
2	0.93	Slight structural reorganization
3	0.88	Moderate adaptation begins
4	0.79	Substantial reorganization
5	0.72	Heaviest adaptation — intent-discriminative space learned here

The CKA gradient from 0.99 at embeddings to 0.72 at Layer 5 directly mirrors the discriminative learning-rate decay from Doc 01: $\eta_l = 2 \times 10^{-5} \cdot 0.8^{(6-l)}$. Layers allowed to learn more ($\eta$ closer to base) changed more (CKA further from 1.0).

Linear Probing: Where Intent Separability Emerges

Layer	Macro F1 from CLS Probe	Interpretation
0	0.41	Mostly lexical and positional signal
1	0.57	Early phrase-level grouping
2	0.68	Generic semantic grouping
3	0.79	Emerging intent clusters
4	0.88	Strong intent separability appears
5	0.91	Final boundary sharpening

The sharpest jump (0.68 → 0.79) occurs between layers 2 and 3, matching where CKA first drops below 0.93. This is where the model transitions from generic language features to task-relevant groupings.

Activation Patching: Causal Layer Localization

Activation patching on misclassified JP-EN samples confirmed the causal story. For 42 misclassified examples from the validation set:

Patched Layer	Examples Corrected (of 42)	Correction Rate
0	2	4.8%
1	3	7.1%
2	5	11.9%
3	12	28.6%
4	27	64.3%
5	34	81.0%

Swapping Layer 4 or Layer 5 states from a correct example often restored the right prediction, while patching lower layers rarely did. That validated the original learning-rate strategy: lower layers preserved language structure, while the final two layers encoded most of the routing adaptation.

Impact

This analysis changed the training policy in three ways:

The team kept the discriminative learning-rate schedule instead of simplifying to a flat learning rate.
Layer-freezing experiments for cost reduction were limited to lower layers only; freezing Layers 4-5 would have removed most of the domain adaptation.
Regression monitoring was updated to log per-layer linear-probe scores on a fixed validation subset during monthly retraining.

Limits

The wrapper and patching pipeline were notebook-heavy and slower than the rest of the toolchain. This was valuable for deeper investigations, but not something every routine training run should execute end-to-end.

LIT - Counterfactual Debugging for Product-Safety Cases

Context

Production risk was concentrated in a small number of route-sensitive intents. A false negative on escalation can strand a frustrated customer in self-service. A false positive on promotion can send a support question to the wrong flow. The team needed an interactive way to perturb language and watch confidence change.

Setup

The Language Interpretability Tool (LIT) was wired to the fine-tuned classifier and seeded with borderline examples from the golden set.

from lit_nlp.api import model as lit_model
from lit_nlp.api import dataset as lit_dataset

examples = [
    {"text": "I need a real person, this order is wrong again"},
    {"text": "Can I send this volume back and get store credit?"},
    {"text": "Any deals if I buy the whole set?"},
]

lit_examples = lit_dataset.IndexedDataset(examples)
lit_app = build_lit_app(model_path="./artifacts/intent-distilbert", dataset=lit_examples)

Analysts then edited tokens directly inside the UI to compare confidence movement.

Observation

Two brittle patterns appeared quickly:

escalation confidence dropped too sharply when the user expressed frustration indirectly, for example I need a real person -> can someone help me directly. The model overweighted explicit human-handoff phrases such as human or agent.
promotion confidence jumped whenever deal, discount, or sale appeared, even when the message was actually a checkout or return question such as Can I return this if the sale price changes?

Counterfactual editing exposed both failures more clearly than aggregate confusion tables because analysts could watch the logit shift after one phrase change instead of waiting for a full retraining cycle.

Impact

The LIT findings produced two operational changes:

The training data was augmented with indirect escalation phrases and mixed-intent support prompts.
A class-specific confidence threshold was added for auto-routing promotion; low-margin cases now fall back to the safer commerce-support path instead of immediately triggering the promo flow.

Limits

LIT is excellent for example-level debugging but weak as a compact static artifact. Its main value is during review sessions, not as a document embedded inside the repository.

Ecco - Token Behavior Across Layers

Context

Some of the hardest classifier errors were driven by ambiguous tokens rather than entire sentences. Words like return, order, and peak change meaning depending on nearby terms. The team wanted to watch those token representations evolve across the encoder stack.

Setup

Ecco was used in exploratory notebook mode. Because the target model is an encoder classifier rather than a causal language model, the team exported layerwise hidden states and inspected them with an Ecco-style token analysis workflow rather than relying on generation-centric features.

text = "I want to return to this series after volume 8"
encoded = tokenizer(text, return_tensors="pt")

with torch.no_grad():
    outputs = model.distilbert(**encoded, output_hidden_states=True)

hidden_states = outputs.hidden_states
token_strings = tokenizer.convert_ids_to_tokens(encoded["input_ids"][0])
visualize_token_trajectory(token_strings, hidden_states)

Observation

The layerwise trajectories showed a useful pattern:

Early layers treated return similarly across commerce and content contexts.
Middle layers started separating return in return my order from return to this series once nearby noun phrases were integrated.
Upper layers pushed the commerce sense closer to return_request while pushing the reading-progress sense closer to product_discovery or recommendation.

The same effect appeared for domain words such as seinen and isekai: lower layers mostly treated them as unusual lexical items, while upper layers repositioned them into stable domain clusters tied to recommendation and product-query intents.

Impact

Ecco changed the dataset strategy rather than the model architecture:

The team created more minimal pairs around ambiguous verbs (return, order, ship, cancel) to force upper-layer disambiguation on short contexts.
Token-level cluster drift became part of exploratory analysis when a retrained model improved accuracy but behaved strangely on manga jargon.

Limits

Ecco was the least production-ready tool in this stack for an encoder classifier. It was useful for exploratory reasoning, but not robust enough to become a required recurring check.

Captum - Attribution and Auditability

Context

BertViz suggested which tokens attracted attention, but the team still needed a causal attribution view for auditability. This mattered most for rare intents and safety-sensitive routing rules.

Mathematical Anchor: Integrated Gradients

The formal definition and axioms for Integrated Gradients are given in the Mathematical Foundations section above. The key property for auditability is completeness: attributions over all tokens must sum to the logit difference $F(x) - F(x')$, so the attribution budget is fixed and comparable across intents and examples. This makes it possible to ask "which tokens consumed the most attribution budget for escalation?" and get a meaningful answer.

Setup

from captum.attr import IntegratedGradients, LayerConductance

ig = IntegratedGradients(forward_intent_logit)
attributions = ig.attribute(
    inputs=input_embeddings,
    baselines=baseline_embeddings,
    target=target_intent_idx,
    n_steps=64,
)

conductance = LayerConductance(forward_intent_logit, model.distilbert.transformer.layer[4])
layer_attr = conductance.attribute(input_embeddings, target=target_intent_idx)

Observation

Captum exposed one especially important shortcut: the model gave disproportionate positive attribution to tokens like deal, discount, and sale, even when surrounding context clearly pointed to checkout or returns. It also showed that explicit frustration markers such as again, still, and wrong carried more reliable escalation signal than generic support words such as help.

Layer Conductance reinforced the TransformerLens result: Layer 4 contributed the largest share of useful attribution mass on hard routing decisions, with Layer 5 sharpening the final class margin.

Impact

Captum changed both training and evaluation:

The team added a lexical-shortcut audit to the monthly model review. If one token family dominates attribution for an intent, the retrain is flagged for manual inspection.
The golden set was expanded with adversarial examples such as sale return, discount exchange, and help me with promo code refund to make sure the classifier uses surrounding semantics, not just trigger words.

Limits

Attribution values depend on the chosen baseline and on the specific target logit. Captum improved auditability, but it still required human judgment to interpret whether a high-attribution token reflected a valid shortcut or a harmful one.

Netron - Deployment Validation Before Inferentia Serving

Context

The training notebook can look perfect while the exported graph quietly changes the serving contract. Because this classifier routes every request, the deployment artifact needed to be inspected before the Neuron-compiled binary was promoted.

Setup

The team exported the ONNX graph used for Inferentia compilation and inspected it in Netron before and after Neuron conversion.

python export_intent_classifier_to_onnx.py \
  --checkpoint ./artifacts/intent-distilbert \
  --output ./exports/intent-distilbert.onnx

netron ./exports/intent-distilbert.onnx

Observation

Netron confirmed three expected properties and exposed one practical issue:

The export preserved the 6 encoder blocks, embedding path, and 768 -> 10 classification head expected from Doc 01.
The ONNX graph kept the attention-mask branch instead of folding it away incorrectly.
The classifier head dimensions matched the 10-intent label mapping used by the runtime.
One export variant had a fixed sequence-length assumption left over from an optimization script, which would have broken padding behavior for longer JP-EN mixed prompts after compilation.

That last issue would not have appeared in training metrics, but it would have surfaced as an intermittent serving bug.

Impact

Netron changed the release checklist:

Every model export now gets a structural inspection before Inferentia compilation.
Input-shape and label-order assertions were added to the export script.
The team stopped treating ONNX export as a transparent serialization step; it is now a validated transformation stage.

Limits

Netron answers structural questions, not behavioral ones. It can confirm that the graph looks right, but it cannot tell whether the logits still reflect the intended semantic behavior without downstream tests.

Synthesis - Which Tools Were Actually Worth It?

Tool	Interpretability Depth	Ease of Use	Best Audience	Recurring or One-Off	Works Well in Static Markdown?
BertViz	Medium	Medium	ML engineers, reviewers	One-off after major retrains	Partially
TransformerLens	High	Low	Senior ML engineers	Recurring for deep regressions only	No
LIT	Medium	High	ML, PM, QA, support leads	Recurring	No
Ecco	Medium	Low	ML researchers	One-off exploratory work	No
Captum	High	Medium	ML engineers, audit reviewers	Recurring	Yes
Netron	Low for semantics, high for structure	High	Platform and deployment engineers	Recurring before release	Yes

Practical Ranking for This Project

If the team had to keep only three recurring checks for the MangaAssist intent classifier, they would be:

Captum, because it catches spurious lexical shortcuts before they become production bugs.
LIT, because it turns ambiguous routing failures into fast counterfactual debugging sessions.
Netron, because the Inferentia path makes export correctness a deployment requirement, not a convenience.

BertViz remains the best bridge between conceptual diagrams and learned attention behavior. TransformerLens is the highest-value deep tool when the model regresses and the reason is unclear. Ecco is useful when language behavior feels odd but aggregate metrics do not explain why.

Group Discussion: Key Decision Points

Decision Point 1: Which Checks Enter the Recurring Retraining Pipeline?

Jordan (MLOps): I do not want the monthly retrain blocked on heavyweight notebook analysis. The current pipeline runs in 38 minutes on spot instances. If we add TransformerLens probing, that becomes 90+ minutes per run — and longer jobs are more likely to get interrupted on spot. My proposal is a tiered gate.

Priya (ML Engineer): TransformerLens and Captum together gave us the clearest view of where the classifier learned the domain. Layer 4 is doing most of the useful intent separation, Layer 5 sharpens the boundary, and Captum shows where the attribution mass goes on hard examples. If we drop all interpretability checks, we will eventually miss a shortcut-heavy retrain.

Check	Runtime	Monthly Cost (Spot)	Signal Value
Captum attribution audit	~8 min	~$0.02	Catches lexical shortcuts before production
LIT golden-set review	~5 min (automated) + 30 min (human)	~$0.01 + analyst time	Validates route-sensitive phrase robustness
Netron shape assertion	~1 min	~$0.00	Catches ONNX/Inferentia export mismatches
TransformerLens CKA + probing	~45 min	~$0.12	Deep layer-level regression detection
BertViz attention review	~10 min (manual)	~$0.00	Visual confirmation of attention redistribution

Aiko (Data Scientist): I want CKA comparison as a retraining quality gate. If CKA between the new and previous model at Layer 5 drops below 0.65, something fundamentally changed in the representation space and we should investigate before shipping. That is a 45-minute check, but it only blocks the pipeline if it fires — we can run it in parallel with the accuracy gate.

Marcus (Architect): Netron is non-negotiable. The model only matters if the exported graph and Neuron artifact preserve the same contract. We are serving on Inferentia because we need the cost-performance and the <15ms P95 budget. A structurally wrong export can erase every training improvement.

Sam (PM): The total added cost for Captum + LIT + Netron is under $1/month in compute. TransformerLens adds $1.44/year if run monthly. The real cost is analyst time for LIT review — about 30 minutes per retrain. At our team's loaded rate, that is $25/month. Still well under the $50 CPQ threshold.

Resolution: Make Captum attribution audit, LIT golden-set review, and Netron shape assertion part of every monthly retraining gate. Run TransformerLens CKA comparison quarterly or when unexplained accuracy regressions appear. Keep BertViz for milestone retrains and Ecco as optional exploratory analysis.

Decision Point 2: Should the Interpretability Findings Change the Training Data?

Aiko (Data Scientist): Captum found that escalation relies too heavily on human and person. LIT found that promotion fires on deal/discount/sale even inside return and checkout queries. Both are lexical shortcuts, not semantic reasoning. The fix is targeted augmentation: add training examples where those trigger words appear in non-target contexts.

Priya (ML Engineer): I estimated the augmentation scope:

Augmentation Target	New Examples Needed	Labeling Method	Estimated Cost
Escalation synonym diversity	~200 paraphrases of indirect frustration	Claude synthetic + human review	~$120
Promotion false-positive suppression	~150 `deal`/`discount` in non-promo contexts	Sampled from production logs + label	~$95
JP-EN mixed-intent expansion	~100 bilingual multi-intent queries	Human-authored (domain-specific)	~$180

Sam (PM): Total labeling cost is ~$395 for a one-time augmentation pass. The current monthly labeling budget is $500 for active-learning samples. We can fold this into one cycle. The CPQ math: if augmentation fixes even half the escalation false negatives (currently ~12% of escalation queries), that is roughly 180 fewer misrouted escalation requests per month. Each misrouted escalation costs ~$2.50 in wasted LLM calls and user friction. That is $450/month in saved cost, paying back the augmentation in under one month.

Jordan (MLOps): I need the augmentation data to go through the same confident-learning quality filter described in Doc 16. No unreviewed synthetic data enters the training set.

Marcus (Architect): The confidence threshold for promotion auto-routing should also be tightened. Currently at 0.70 — LIT showed that borderline cases below 0.65 are almost always wrong. Drop the auto-route threshold to 0.75 and send low-confidence cases to the commerce-support fallback.

Resolution: Proceed with targeted augmentation for escalation synonyms, promotion false-positive suppression, and JP-EN mixed-intent expansion. Total one-time cost: ~$395, folded into the next monthly labeling cycle. Tighten promotion auto-route threshold from 0.70 to 0.75. All augmented data goes through the confident-learning pipeline from Doc 16 before entering the training set.

Decision Point 3: Where Do Mermaid Diagrams Remain the Better Medium?

Marcus (Architect): I want to be clear about what this document does not replace. Doc 01 has six Mermaid diagrams covering the architecture pipeline, attention redistribution concept, training dynamics, learning-rate schedule, loss landscape, and MLOps lifecycle. Those are conceptual explanations — they show how the system is designed, not what one specific trained model does. Mermaid is the right tool for that.

Priya (ML Engineer): Agreed. BertViz and TransformerLens findings are data-dependent — they change every time the model retrains. Mermaid diagrams in Doc 01 are design-intent diagrams that stay stable across retraining cycles. If we tried to put CKA values or attention heatmaps into Mermaid, we would have to update them every month.

Aiko (Data Scientist): There is one exception. The layerwise probe F1 progression (0.41 → 0.91) and CKA values (0.99 → 0.72) are structural findings that would not change drastically between retrains unless we changed the architecture or training setup. Those could reasonably be added to Doc 01 as a supplementary diagram.

Sam (PM): Cost of maintaining Mermaid diagrams is near zero — they are text. Cost of maintaining live visualization outputs is nonzero. Let us keep the distinction: Mermaid for architecture and design intent, tool outputs for post-training validation. This document describes tool outputs textually with enough detail to reproduce, but does not embed interactive artifacts.

Resolution: Keep all six existing Mermaid diagrams in Doc 01 unchanged. This document references them but does not replace them. Tool outputs are described textually with reproduction instructions. The layerwise probe and CKA results are stable enough to be added to Doc 01 in a future update if the team decides they belong in the architecture narrative.

Decisions, Operationalization, and Scope Boundaries

What Becomes Standard

Check	Frequency	Reason
Captum attribution audit	Every monthly retrain	Detect lexical shortcuts and rare-intent shortcut regressions
LIT counterfactual review	Every monthly retrain	Validate route-sensitive phrase robustness
Netron export inspection	Every release candidate	Catch ONNX and Inferentia export mismatches
BertViz review	Major model revisions only	Confirm attention redistribution on representative prompts
TransformerLens deep dive	Regression investigations	Localize which layer behavior changed
Ecco exploratory analysis	Optional	Investigate token-level semantic drift

Scope Included

This document covers the interpretability and model-inspection workflow for the fine-tuned DistilBERT intent classifier used by MangaAssist.

Scope Excluded

This document does not include full notebook dumps, embedded screenshots, or a replacement for Doc 01. It also does not claim that any single visualization tool proves causality by itself. The value comes from combining multiple tools and tying each finding to an engineering decision.

Final Takeaway

The post-fine-tuning inspection phase answered a more important question than did accuracy go up? It answered what did the model learn, where did it learn it, what shortcuts remain, and did that behavior survive export to production? That is the difference between a classifier that looks good in a report and one that can safely sit on the critical path of a retail chatbot.

Research References

Foundational papers that grounded the analysis methodology in this document:

Paper	Year	Relevance to This Document
Sundararajan, Taly, & Yan. "Axiomatic Attribution for Deep Networks"	2017	Formal basis for Integrated Gradients used in the Captum section. Defines completeness and sensitivity axioms.
Jain & Wallace. "Attention is not Explanation"	2019	Motivates why BertViz attention maps are diagnostic, not causal. Justifies the need for attribution methods alongside attention analysis.
Wiegreffe & Pinter. "Attention is not not Explanation"	2019	Counterpoint to Jain & Wallace — attention is still informative under certain conditions. Contextualizes why the team used BertViz as a first diagnostic despite its theoretical limitations.
Kornblith et al. "Similarity of Neural Network Representations Revisited"	2019	Defines CKA and shows it is more robust than CCA or raw cosine for comparing layer representations. Used in the TransformerLens section to measure where fine-tuning changed the representation space.
Vig. "A Multiscale Visualization of Attention in the Transformer Model"	2019	BertViz paper. Provides the head-view and model-view visualizations used to compare pre-trained and fine-tuned attention patterns.
Tenney, Das, & Pavlick. "BERT Rediscovers the Classical NLP Pipeline"	2019	Establishes that BERT layers encode increasingly abstract linguistic features. Grounded the team's expectation that lower layers preserve syntax while upper layers encode task-specific semantics.
Nanda & Bloom. "TransformerLens"	2022	Mechanistic interpretability library. Inspired the residual analysis and activation patching methodology adapted for the DistilBERT encoder.
Kokhlikyan et al. "Captum: A Unified and Generic Model Interpretability Library for PyTorch"	2020	Captum library paper. Provides the Integrated Gradients and Layer Conductance implementations used in the attribution audit.
Alammar. "Ecco: An Open Source Library for the Exploration and Explanation of Transformer Language Models"	2021	Ecco library. Inspired the token-level activation trajectory analysis adapted for the encoder classifier.

Cross-References

This document connects to the following documents in the Fine-Tuning-Foundational-Models series:

Document	Connection
01 — Intent Classifier Fine-Tuning	Source of all anchoring data: the 10 intents, DistilBERT architecture, Head 7/Layer 5 attention weights, discriminative learning-rate schedule ($\xi=0.8$), gradient decay ratios, Inferentia `compile_for_inferentia` function, focal loss configuration, and the six Mermaid diagrams this document references without replacing.
05 — Knowledge Distillation Pipeline	The TransformerLens finding that layers 0–2 can be safely frozen feeds directly into distillation decisions: a student model may not need to replicate the frozen layers' exact behavior if those layers preserved their pre-trained representations.
09 — Training Infrastructure and MLOps	The recurring validation checks from Decision Point 1 (Captum, LIT, Netron in CI; TransformerLens quarterly) integrate into the MLOps pipeline described there. Spot-instance runtime constraints informed the tiered-gate design.
12 — Quantization-Aware Training	INT8 quantization can change attribution patterns by collapsing fine-grained weight distinctions. The Captum audit should be re-run after quantization to verify that attribution stability holds at reduced precision.
16 — Data Curation and Synthetic Generation	The targeted augmentation from Decision Point 2 (escalation synonyms, promotion false-positive suppression) must pass through the confident-learning quality filter described there before entering the training set.