01. Intent Classifier Fine-Tuning — DistilBERT for MangaAssist

Problem Statement and MangaAssist Context

MangaAssist routes every user message through an intent classifier before deciding which downstream service to call. The classifier must map messages like "Show me horror manga", "Where is my order?", or "Something like Naruto but darker" to one of 10 intents in under 15ms at P95. A wrong classification sends the user down the wrong path entirely — recommending manga when they asked about returns, or triggering an order lookup when they wanted discovery.

The pre-trained DistilBERT model achieves 83.2% accuracy on our manga retail domain out of the box. That is not good enough — 16.8% of messages get misrouted, causing user frustration and unnecessary LLM calls. Fine-tuning on domain-specific data pushes accuracy to 92.1%, reducing misroutes by more than half.

This document covers the full fine-tuning pipeline: from the math behind cross-entropy and focal loss, through the internal mechanics of how DistilBERT's 6 transformer layers change during training, to production deployment on AWS Inferentia.

The 10 Intents

Intent	Example	Frequency
`product_discovery`	"Show me horror manga"	22%
`product_question`	"Is this in English?"	15%
`recommendation`	"Something like Naruto"	18%
`faq`	"What's the return policy?"	8%
`order_tracking`	"Where is my order?"	12%
`return_request`	"I want to return this"	7%
`promotion`	"Any deals on manga?"	5%
`checkout_help`	"Can I use gift cards?"	4%
`escalation`	"Talk to a human"	3%
`chitchat`	"Hello", "Thanks"	6%

Why This Is Hard for Manga Retail

The distribution has four gaps that generic NLP models do not cover:

Manga jargon (22% of queries): "tankōbon", "shōnen jump", "isekai", "seinen" — tokens that appear rarely or never in pre-training corpora
Slang (15%): "Is this peak?", "W manga", "mid" — internet-era expressions with domain-specific meaning
Multi-intent (18%): "I want to return this and find something better" — two intents in one message
Japanese-English mixing (12%): "この漫画は英語ですか?" mixed with "Is vol 12 out?"

Mathematical Foundations

Cross-Entropy Loss — The Starting Point

For a single training example with true class $y$ and predicted probability distribution $\hat{y}$, the cross-entropy loss is:

$$\mathcal{L}{CE} = -\sum{c=1}^{C} y_c \log(\hat{y}_c)$$

For one-hot encoded labels (which our intent classifier uses), this simplifies to:

$$\mathcal{L}{CE} = -\log(\hat{y}{y_{true}})$$

where $\hat{y}{y{true}}$ is the predicted probability for the correct class.

Intuition: Cross-entropy measures how surprised the model is by the correct answer. If the model assigns probability 0.9 to the correct class, the loss is $-\log(0.9) = 0.105$. If it assigns 0.1, the loss is $-\log(0.1) = 2.303$ — much higher. The logarithm creates an asymmetric penalty: being confidently wrong is punished far more than being uncertain.

Softmax Temperature and Its Effect on Gradients

The softmax function converts raw logits $z_i$ into probabilities:

$$\hat{y}i = \frac{e^{z_i / T}}{\sum{j=1}^{C} e^{z_j / T}}$$

where $T$ is the temperature parameter.

$T = 1$ (default): Standard softmax. Distribution reflects model confidence directly.
$T < 1$ (sharpening): Makes the distribution peakier. The model becomes more confident.
$T > 1$ (smoothing): Flattens the distribution. All classes get more similar probabilities.

Why temperature matters for fine-tuning: During early fine-tuning, the model is not yet adapted to manga domain. Using $T > 1$ (e.g., 1.5) during the first few epochs prevents the model from becoming overconfident on wrong predictions, which would create large gradients that destabilize training. As training progresses and the model learns the domain, we anneal $T \to 1$.

Gradient of softmax with temperature:

$$\frac{\partial \hat{y}i}{\partial z_j} = \frac{1}{T} \hat{y}_i (\delta{ij} - \hat{y}_j)$$

The $\frac{1}{T}$ factor means higher temperature directly reduces gradient magnitude, acting as an implicit learning rate dampener.

Focal Loss — Handling Class Imbalance

Our intent distribution is heavily imbalanced: product_discovery (22%) vs escalation (3%). Standard cross-entropy gives equal weight per sample, so the model optimizes for majority classes and underperforms on rare intents.

Focal loss (Lin et al., 2017) adds a modulating factor:

$$\mathcal{L}_{FL} = -\alpha_t (1 - p_t)^{\gamma} \log(p_t)$$

where: - $p_t$ is the predicted probability for the true class - $\gamma$ is the focusing parameter (typically 2.0) - $\alpha_t$ is the class weight for class $t$

Breaking down the modulating factor $(1 - p_t)^{\gamma}$:

$p_t$ (model confidence)	$(1-p_t)^2$	Effect
0.9 (easy, well-classified)	0.01	Loss reduced by 100x — model stops wasting gradients on easy examples
0.5 (uncertain)	0.25	Moderate loss — model still learns from these
0.1 (hard, misclassified)	0.81	Near-full loss — model focuses learning here

Intuition: Focal loss makes the model stop paying attention to examples it already classifies well (like "Hello" → chitchat) and focus its gradient budget on hard examples (like "Is this isekai peak or mid?" → product_question vs recommendation).

Our class weights $\alpha_t$:

$$\alpha_t = \frac{1}{\text{freq}(t)} \cdot \frac{1}{\sum_{c=1}^{C} \frac{1}{\text{freq}©}}$$

This gives escalation (3% frequency) about 7.3x the weight of product_discovery (22%). Combined with $\gamma = 2$, the effective gradient signal for rare-class hard examples is amplified by $7.3 \times \frac{1}{0.01} = 730\times$ compared to easy majority-class examples.

Gradient Flow Through DistilBERT Layers

DistilBERT has the following architecture:

Embedding layer: WordPiece embeddings (30,522 vocab) + position embeddings (512 positions) → 768-dim vectors
6 Transformer encoder layers: Each with multi-head self-attention (12 heads, 64 dims each) + feed-forward network (768 → 3072 → 768)
[CLS] pooling: Take the first token's representation
Classification head: Linear(768 → 10) + softmax

Total parameters: ~66M (embedding: ~23M, encoders: ~42M, head: ~7.7K)

Gradient magnitude per layer during fine-tuning:

During backpropagation, gradients flow from the classification head back through the encoder layers to the embeddings. The gradient magnitude at each layer follows this pattern:

Layer	Parameter Count	Gradient Magnitude (rel.)	What Changes
Classification Head	7,690	1.0 (reference)	Learns intent-specific decision boundary
Encoder Layer 5	7.1M	0.6	Adapts high-level semantic features to manga domain
Encoder Layer 4	7.1M	0.35	Refines topic-level representations
Encoder Layer 3	7.1M	0.18	Moderate adaptation of syntactic-semantic interface
Encoder Layer 2	7.1M	0.08	Minor changes to syntactic patterns
Encoder Layer 1	7.1M	0.03	Almost frozen — basic language structure
Encoder Layer 0	7.1M	0.01	Nearly unchanged — tokenization patterns
Embeddings	23.4M	0.005	Barely moves — pre-trained token representations are stable

Key insight: Fine-tuning mostly changes the top 2-3 layers and the classification head. This is why discriminative learning rates work — we should use a higher learning rate for top layers and a lower one for bottom layers, matching the natural gradient flow.

Discriminative Learning Rate Schedule

Following Sun et al. (2019), we set per-layer learning rates:

$$\eta_l = \eta_{base} \cdot \xi^{(L - l)}$$

where: - $\eta_{base} = 2 \times 10^{-5}$ (base learning rate for the top layer) - $\xi = 0.8$ (decay factor) - $L = 6$ (total encoder layers) - $l$ is the layer index (0 = bottom, 5 = top)

Layer	Learning Rate	Relative to Base
Head	$2 \times 10^{-5}$	1.0x
Layer 5	$2 \times 10^{-5}$	1.0x
Layer 4	$1.6 \times 10^{-5}$	0.8x
Layer 3	$1.28 \times 10^{-5}$	0.64x
Layer 2	$1.02 \times 10^{-5}$	0.51x
Layer 1	$8.19 \times 10^{-6}$	0.41x
Layer 0	$6.55 \times 10^{-6}$	0.33x
Embeddings	$5.24 \times 10^{-6}$	0.26x

Intuition: Bottom layers have learned universal language features during pre-training (tokenization, basic syntax). We do not want to destroy these. Top layers are more task-specific and need more room to adapt to manga intent classification.

Warmup Schedule — Why It Prevents Catastrophic Collapse

We use linear warmup for the first 10% of training steps, then linear decay:

$$\eta(t) = \begin{cases} \eta_{max} \cdot \frac{t}{T_{warmup}} & \text{if } t < T_{warmup} \ \eta_{max} \cdot \frac{T_{total} - t}{T_{total} - T_{warmup}} & \text{if } t \geq T_{warmup} \end{cases}$$

Why warmup is critical: At initialization, the classification head has random weights. Without warmup, the first gradient updates are computed from random-quality logits, creating large, noisy gradients that can permanently damage the pre-trained representations. Warmup lets the classification head stabilize before the encoder layers start updating significantly.

Model Internals — Layer-by-Layer Diagrams

DistilBERT — Representation Transformation Pipeline

How to read this diagram: Colors represent the degree of adaptation during fine-tuning — green (frozen, pre-trained knowledge preserved) through red (heavily adapted to manga intent domain). Each layer transforms the representation from raw tokens toward intent-discriminative features. Use BertViz to interactively explore attention patterns, or Netron to inspect the exported model graph.

graph TB
    subgraph INPUT["Tokenization & Embedding — dim: batch × seq_len × 768"]
        A["🔤 Raw Input: 'Is this isekai peak?'"]
        A --> B["WordPiece Subword Tokenization<br>[CLS] is this is ##ek ##ai peak ? [SEP]<br>⚠ 'isekai' → '##ek' + '##ai' (OOV — unseen in pre-training)"]
        B --> C["Token Embeddings (30,522 vocab × 768d)<br>+ Position Embeddings (512 × 768d)<br>→ Dense vector per subword token"]
    end

    subgraph FROZEN["Layers 0–2 — Frozen: Preserve Pre-trained Linguistic Knowledge"]
        C --> D0["Layer 0: Tokenization & Morphology<br>Learns: subword composition, token boundaries<br>Representations: raw token identity<br>⚡ ~0.01× gradient — catastrophic forgetting risk if updated"]
        D0 --> D1["Layer 1: Syntactic Patterns<br>Learns: POS-like features, word order<br>Representations: local syntactic context"]
        D1 --> D2["Layer 2: Phrase Structure<br>Learns: noun phrases, verb groups<br>Representations: 'this isekai' as a unit<br>⚡ ~0.08× gradient — minimal drift"]
    end

    subgraph PARTIAL["Layers 3–4 — Partially Adapted: Domain Semantics"]
        D2 --> D3["Layer 3: Semantic Composition<br>Learns: meaning combinations<br>'isekai' + 'peak' → genre quality judgment<br>⚡ ~0.18× gradient"]
        D3 --> D4["Layer 4: Topic & Sentiment<br>Learns: domain-specific topic clusters<br>Separates product talk from order talk<br>⚡ ~0.35× gradient"]
    end

    subgraph ADAPTED["Layer 5 — Heavily Adapted: Intent-Discriminative Features"]
        D4 --> D5["Layer 5: Intent Boundary Learning<br>Learns: features that separate 10 intent classes<br>'isekai peak' → product_question vs recommendation<br>⚡ 0.6× gradient — largest encoder change"]
    end

    subgraph HEAD["Classification Head — Fully Trained from Scratch"]
        D5 --> CLS["[CLS] Token Pooling<br>768-dim intent-focused vector<br>(aggregates full-sequence meaning)"]
        CLS --> LIN["Linear Projection: 768 → 10<br>Decision hyperplane in embedding space<br>⚡ 1.0× gradient — fully trained"]
        LIN --> SM["Softmax (T=1.0) → Probability Distribution"]
        SM --> OUT["Intent Prediction:<br>product_question: 0.42<br>recommendation: 0.38<br>product_discovery: 0.12 ..."]
    end

    style D0 fill:#e8f5e9,stroke:#4caf50
    style D1 fill:#e8f5e9,stroke:#4caf50
    style D2 fill:#f1f8e9,stroke:#8bc34a
    style D3 fill:#fff9c4,stroke:#fbc02d
    style D4 fill:#fff3e0,stroke:#ff9800
    style D5 fill:#ffccbc,stroke:#ff5722
    style LIN fill:#ef9a9a,stroke:#f44336

Attention Redistribution: Learned Feature Importance Before vs After Fine-Tuning

What this shows: [CLS] token attention weights from Head 7, Layer 5 (the most intent-discriminative attention head, identified via attention entropy analysis). Before fine-tuning, the model wastes attention on structural tokens ([SEP], self-attention). After fine-tuning, attention concentrates on domain-relevant tokens that drive intent classification. Use BertViz (pip install bertviz) to reproduce this with your own fine-tuned model.

graph TD
    subgraph PRE["Pre-trained DistilBERT — Generic Language Modeling Attention"]
        direction LR
        P_CLS["[CLS]<br>source"] --> P_is["'is'<br>w=0.15"]
        P_CLS --> P_this["'this'<br>w=0.12"]
        P_CLS --> P_isekai["'isekai'<br>w=0.08 ⬜"]
        P_CLS --> P_peak["'peak'<br>w=0.10 ⬜"]
        P_CLS --> P_SEP["'[SEP]'<br>w=0.20 🔴"]
        P_CLS --> P_self["'[CLS]' self<br>w=0.35 🔴"]
    end

    subgraph POST["Fine-tuned (Manga-Adapted) — Intent-Discriminative Attention"]
        direction LR
        F_CLS["[CLS]<br>source"] --> F_is["'is'<br>w=0.05 ⬜"]
        F_CLS --> F_this["'this'<br>w=0.03 ⬜"]
        F_CLS --> F_isekai["'isekai'<br>w=0.35 🟢"]
        F_CLS --> F_peak["'peak'<br>w=0.30 🟢"]
        F_CLS --> F_SEP["'[SEP]'<br>w=0.07"]
        F_CLS --> F_self["'[CLS]' self<br>w=0.20"]
    end

    subgraph EFFECT["Effect on [CLS] Representation"]
        R1["Before: [CLS] = generic sentence embedding<br>Encodes positional structure, not meaning"]
        R2["After: [CLS] = intent-discriminative encoding<br>Captures 'isekai' (genre) + 'peak' (quality judgment)<br>→ Separable in 768-d space for classification"]
        R1 --> R2
    end

    style P_SEP fill:#ef9a9a
    style P_self fill:#ef9a9a
    style P_isekai fill:#e0e0e0
    style P_peak fill:#e0e0e0
    style F_isekai fill:#a5d6a7,stroke:#4caf50
    style F_peak fill:#a5d6a7,stroke:#4caf50
    style F_is fill:#e0e0e0
    style F_this fill:#e0e0e0

Key Observation: Fine-tuning shifts 0.27 of attention weight (0.35→0.08 from self/[SEP]) onto domain tokens ("isekai": 0.08→0.35, "peak": 0.10→0.30). This is the model learning which tokens carry intent signal. The [CLS] representation transforms from a generic sentence summary to an intent-focused encoding where similar intents cluster together in the 768-dimensional space.

Training Dynamics — Single Step Forward & Backward Pass

How to read this diagram: Follow the data flow top-to-bottom (forward pass), then bottom-to-top (backward pass). Tensor shapes are annotated at each stage. Color intensity reflects gradient magnitude — the loss signal is strongest at the classification head and dissipates as it flows backward through the encoder layers. Track this live with TensorBoard gradient histograms or W&B per-layer gradient norms.

graph TD
    subgraph FORWARD["━━━ FORWARD PASS ━━━"]
        direction TB
        A["📦 Training Batch<br>32 examples × variable seq_len<br>Sampled with class-balanced strategy"]
        A --> B["🔤 WordPiece Tokenization<br>→ token_ids: (32 × 128) int64<br>→ attention_mask: (32 × 128) bool<br>Pad to max_len=128, truncate if longer"]
        B --> C["Embedding Lookup + Positional Encoding<br>→ (32 × 128 × 768) float32<br>~23M params — nearly frozen"]
        C --> D["6 Transformer Encoder Layers<br>Each: Multi-Head Attention → FFN → LayerNorm + Residual<br>→ (32 × 128 × 768) at each layer<br>~42M params — graduated adaptation"]
        D --> E["[CLS] Token Pooling<br>Extract first token's hidden state<br>→ (32 × 768) — one vector per example"]
        E --> F["Classification Head: Linear(768 → 10)<br>→ (32 × 10) raw logits<br>~7.7K params — fully trained"]
        F --> G["Focal Loss with Class Weights<br>L = −αₜ(1−pₜ)ᵞ log(pₜ)<br>γ=2.0 | αₜ = inverse frequency<br>→ scalar loss value"]
    end

    subgraph BACKWARD["━━━ BACKWARD PASS (Gradient Flow) ━━━"]
        direction TB
        G --> H["∂L/∂logits → Classification Head<br>Gradient magnitude: 1.0× (reference)<br>Largest parameter updates"]
        H --> I["∂L/∂[CLS] → Encoder Layer 5<br>Gradient magnitude: 0.6×<br>Intent-discriminative features adapt"]
        I --> J["∂L/∂hidden → Layers 4→3→2→1→0<br>Gradient decay: 0.35× → 0.18× → 0.08× → 0.03× → 0.01×<br>Bottom layers receive vanishing signal"]
        J --> K["∂L/∂embeddings<br>Gradient magnitude: 0.005×<br>Pre-trained token representations barely move"]
    end

    subgraph OPTIM["━━━ OPTIMIZER STEP (AdamW) ━━━"]
        direction TB
        L["Gradient Clipping: max_norm=1.0<br>Prevents exploding gradients from hard examples"]
        L --> M["Per-Layer Discriminative Learning Rates<br>Head: 2e-5 | Layer 5: 2e-5 | Layer 0: 6.55e-6<br>Matches natural gradient magnitude hierarchy"]
        M --> N["Weight Decay: λ=0.01<br>L2 regularization on all params except bias & LayerNorm<br>Prevents weights from growing unbounded"]
        N --> O["Update Parameters<br>θ ← θ − η · m̂/(√v̂ + ε) − λθ<br>Adam momentum smooths noisy gradients"]
    end

    subgraph EPOCH["━━━ EPOCH CONTEXT ━━━"]
        P["Repeat for 3 epochs × 4,688 steps<br>= 50K train examples ÷ 32 batch × 3 epochs<br>With linear warmup (469 steps) + linear LR decay"]
    end

    style G fill:#ef9a9a,stroke:#f44336
    style H fill:#ffccbc,stroke:#ff5722
    style I fill:#ffe0b2,stroke:#ff9800
    style J fill:#fff9c4,stroke:#fbc02d
    style K fill:#e8f5e9,stroke:#4caf50

Learning Rate Schedule — Training Phases & Model Behavior

How to read this diagram: The schedule is split into three behavioral phases — each phase has a different optimization objective and affects model layers differently. The key insight is that the learning rate schedule is not just a numerical curve: it encodes a curriculum where the model stabilizes its head first, then adapts its encoder, then fine-tunes for generalization. Monitor phase transitions with W&B LR vs. loss panels or TensorBoard scalar dashboards.

graph TD
    subgraph PHASE1["Phase 1: Warmup — Head Stabilization (Steps 0–469, 10%)"]
        direction TB
        W1["LR ramps: 0 → 2e-5 (linear)<br>Classification head learns coarse decision boundaries"]
        W1 --> W2["What the model does:<br>Head weights move from random → meaningful logits<br>Encoder layers receive tiny gradients — nearly frozen<br>Loss drops rapidly: ~2.3 → ~1.1"]
        W2 --> W3["Per-Layer Effective Update Magnitude<br>Head: ████████░░ (ramping)<br>Layer 5: █░░░░░░░░░ (minimal)<br>Layers 0–4: ░░░░░░░░░░ (frozen)"]
        W3 --> W4["⚠ Why warmup matters:<br>Without it, random head → noisy gradients<br>→ large encoder updates → catastrophic forgetting<br>Pre-trained linguistic knowledge destroyed"]
    end

    subgraph PHASE2["Phase 2: Peak Adaptation — Encoder Learning (Steps 469–2344, 40%)"]
        direction TB
        P1["LR at peak: 2e-5 (head) → 5.24e-6 (embed)<br>Discriminative LR: ξ=0.8 decay per layer"]
        P1 --> P2["What the model does:<br>Top layers (4–5) learn intent-discriminative features<br>Mid layers (2–3) adapt domain semantics<br>Bottom layers (0–1) barely change<br>Loss plateau: ~1.1 → ~0.5"]
        P2 --> P3["Per-Layer Effective Update Magnitude<br>Head: ██████████ (1.0×)<br>Layer 5: ██████░░░░ (0.6×)<br>Layer 4: ████░░░░░░ (0.35×)<br>Layer 3: ██░░░░░░░░ (0.18×)<br>Layer 2: █░░░░░░░░░ (0.08×)<br>Layers 0–1: ░░░░░░░░░░ (~0.01×)"]
        P3 --> P4["🎯 Critical period:<br>Most intent-discriminative learning happens here<br>'isekai'/'shōnen' attention patterns form<br>Rare-class accuracy jumps from ~70% → ~85%"]
    end

    subgraph PHASE3["Phase 3: Linear Decay — Convergence & Regularization (Steps 2344–4688, 50%)"]
        direction TB
        D1["LR decays: 2e-5 → 0 (linear)<br>All layers receiving progressively smaller updates"]
        D1 --> D2["What the model does:<br>Fine-grained decision boundary refinement<br>Confidence calibration (softmax sharpening)<br>Memorization risk increases — monitor val loss<br>Loss final: ~0.5 → ~0.24"]
        D2 --> D3["Per-Layer Effective Update Magnitude<br>Head: ████░░░░░░ → █░░░░░░░░░<br>Layer 5: ██░░░░░░░░ → ░░░░░░░░░░<br>All others: effectively frozen"]
        D3 --> D4["⚠ Overfitting checkpoint:<br>If val_loss stops decreasing while train_loss drops<br>→ Generalization gap opening<br>→ Early stopping or reduce epochs"]
    end

    PHASE1 --> PHASE2
    PHASE2 --> PHASE3

    style PHASE1 fill:#e3f2fd,stroke:#1976d2
    style PHASE2 fill:#fff3e0,stroke:#ff9800
    style PHASE3 fill:#e8f5e9,stroke:#4caf50

Optimization Trajectory Through the Loss Landscape

How to read this diagram: The model's weights trace a path through a high-dimensional loss landscape during fine-tuning. Each stage represents a qualitatively different region — from the broad pre-trained basin (good for general NLP) through a narrow manga-specific minimum (good for our task). Regularization mechanisms prevent the trajectory from falling into sharp, non-generalizing minima. Track train vs. val loss divergence in real time with W&B or TensorBoard.

graph TD
    subgraph START["Pre-trained Basin — General NLP (Epoch 0)"]
        S1["θ₀: Pre-trained DistilBERT weights<br>Broad, flat minimum good for general language tasks"]
        S1 --> S2["Train loss: 2.31 (near −log(1/10) = random)<br>Val loss: 2.29<br>Generalization gap: ~0.02 ✅<br>Manga accuracy: 71.8%"]
    end

    subgraph WARMUP["Transition Phase — Warmup (Steps 0–469)"]
        S2 --> T1["Gradient descent begins with small steps<br>Head learns coarse intent boundaries<br>θ moves toward task-relevant region"]
        T1 --> T2["Train loss: 1.1 | Val loss: 1.15<br>Gap: 0.05 ✅<br>Regularization active:<br>• Weight decay (λ=0.01) keeps θ near pre-trained<br>• Warmup limits step size → stable trajectory"]
    end

    subgraph ADAPT["Manga-Specific Minimum — Peak Training (Steps 469–2344)"]
        T2 --> A1["Fastest descent: top layers adapt to domain<br>Attention heads discover manga-relevant features<br>Loss surface narrows — specialization begins"]
        A1 --> A2["Train loss: 0.45 | Val loss: 0.52<br>Gap: 0.07 ✅ (still healthy)<br>Regularization active:<br>• Focal loss (γ=2) → ignores easy examples<br>• Discriminative LR → bottom layers anchored<br>• Dropout (0.1) → implicit ensemble"]
    end

    subgraph CONVERGE["Fine-Grained Convergence — Decay Phase (Steps 2344–4688)"]
        A2 --> C1["Small LR → fine adjustments near minimum<br>Decision boundaries sharpen<br>Confidence calibration improves"]
        C1 --> C2["Train loss: 0.24 | Val loss: 0.31<br>Gap: 0.07 ✅ (stable)<br>Final: 92.1% overall | 88.6% rare-class<br>Model sits in good minimum"]
    end

    subgraph DANGER["⚠ Overfitting Territory — If Training Continues (Epoch 4+)"]
        C2 -->|"Continue training<br>past epoch 3"| O1["θ moves into sharp, narrow minimum<br>Memorizes training noise & label errors<br>Pre-trained features overwritten"]
        O1 --> O2["Train loss: 0.05 | Val loss: 0.89<br>Gap: 0.84 ❌ (severe overfitting)<br>Sharp minimum → brittle to distribution shift<br>Manga accuracy drops on unseen queries"]
    end

    subgraph GUARD["Regularization Mechanisms Preventing Overfitting"]
        G1["Weight Decay λ=0.01<br>Pulls θ toward origin<br>→ flatter minima preferred"]
        G2["Discriminative LR<br>Bottom layers: 0.26× base LR<br>→ pre-trained anchoring"]
        G3["Focal Loss γ=2.0<br>Stops learning easy examples<br>→ reduces effective epochs"]
        G4["Early Stopping<br>Monitor val_loss patience=2<br>→ halt before overfitting"]
    end

    style START fill:#e3f2fd,stroke:#1976d2
    style WARMUP fill:#e8f5e9,stroke:#4caf50
    style ADAPT fill:#fff3e0,stroke:#ff9800
    style CONVERGE fill:#f1f8e9,stroke:#8bc34a
    style DANGER fill:#ffcdd2,stroke:#f44336
    style GUARD fill:#f3e5f5,stroke:#9c27b0

Implementation Deep-Dive

Dataset Preparation

import json
import re
from collections import Counter
from sklearn.model_selection import train_test_split

# ─── Step 1: Load and Clean Raw Data ───
def load_manga_intent_dataset(logs_path: str, synthetic_path: str):
    """
    Combine 50K production logs with 5K synthetic examples.
    Production logs come from Amazon customer service conversations
    pre-labeled by a combination of rule-based matcher + human review.
    """
    with open(logs_path) as f:
        prod_data = json.load(f)  # 50K examples

    with open(synthetic_path) as f:
        synth_data = json.load(f)  # 5K examples from Claude

    # Clean and normalize
    all_examples = []
    for item in prod_data + synth_data:
        text = clean_text(item["message"])
        label = item["intent"]
        source = "production" if item in prod_data else "synthetic"
        all_examples.append({"text": text, "label": label, "source": source})

    return all_examples


def clean_text(text: str) -> str:
    """Normalize text while preserving manga-specific tokens."""
    text = text.strip().lower()
    # Preserve Japanese characters (hiragana, katakana, kanji)
    # but normalize whitespace and remove control characters
    text = re.sub(r'[\x00-\x1f\x7f-\x9f]', '', text)
    text = re.sub(r'\s+', ' ', text)
    return text


# ─── Step 2: Handle Class Imbalance ───
def compute_class_weights(labels: list) -> dict:
    """
    Inverse frequency weighting for focal loss alpha parameter.
    Gives rare intents (escalation: 3%) higher weight than
    common intents (product_discovery: 22%).
    """
    counts = Counter(labels)
    total = sum(counts.values())
    weights = {}
    for label, count in counts.items():
        weights[label] = total / (len(counts) * count)
    return weights


# ─── Step 3: Stratified Split ───
def create_splits(examples: list):
    """
    80/10/10 split with stratification to maintain intent
    distribution in each split.
    """
    texts = [e["text"] for e in examples]
    labels = [e["label"] for e in examples]

    X_train, X_temp, y_train, y_temp = train_test_split(
        texts, labels, test_size=0.2, stratify=labels, random_state=42
    )
    X_val, X_test, y_val, y_test = train_test_split(
        X_temp, y_temp, test_size=0.5, stratify=y_temp, random_state=42
    )

    return (X_train, y_train), (X_val, y_val), (X_test, y_test)

Focal Loss Implementation

import torch
import torch.nn as nn
import torch.nn.functional as F


class FocalLoss(nn.Module):
    """
    Focal Loss (Lin et al., 2017) with per-class alpha weights.

    FL(p_t) = -alpha_t * (1 - p_t)^gamma * log(p_t)

    When gamma=0, this reduces to standard weighted cross-entropy.
    When gamma=2 (our default), easy examples (p_t > 0.8) contribute
    almost nothing to the loss, letting the model focus on hard cases.
    """

    def __init__(self, alpha: torch.Tensor, gamma: float = 2.0):
        super().__init__()
        self.alpha = alpha      # Shape: (num_classes,)
        self.gamma = gamma

    def forward(self, logits: torch.Tensor, targets: torch.Tensor) -> torch.Tensor:
        # logits: (batch_size, num_classes)
        # targets: (batch_size,) — class indices

        # Compute softmax probabilities
        probs = F.softmax(logits, dim=-1)  # (batch_size, num_classes)

        # Get probability of true class
        targets_one_hot = F.one_hot(targets, num_classes=logits.size(-1)).float()
        p_t = (probs * targets_one_hot).sum(dim=-1)  # (batch_size,)

        # Get alpha for true class
        alpha_t = self.alpha[targets]  # (batch_size,)

        # Compute focal modulating factor
        focal_weight = (1 - p_t) ** self.gamma  # (batch_size,)

        # Compute focal loss
        ce_loss = -torch.log(p_t + 1e-8)  # (batch_size,)
        loss = alpha_t * focal_weight * ce_loss  # (batch_size,)

        return loss.mean()

Fine-Tuning with Discriminative Learning Rates

from transformers import DistilBertForSequenceClassification, DistilBertTokenizer
from torch.optim import AdamW
from transformers import get_linear_schedule_with_warmup


def create_optimizer_with_discriminative_lr(
    model: DistilBertForSequenceClassification,
    base_lr: float = 2e-5,
    decay_factor: float = 0.8,
    weight_decay: float = 0.01,
):
    """
    Create AdamW optimizer with per-layer learning rates.

    Top layers get base_lr, each lower layer gets base_lr * decay^(L-l).
    This matches the natural gradient magnitude decay through the network
    and prevents catastrophic forgetting of low-level language features.
    """
    optimizer_grouped_parameters = []

    # Classification head — highest LR
    optimizer_grouped_parameters.append({
        "params": model.classifier.parameters(),
        "lr": base_lr,
        "weight_decay": weight_decay,
    })

    # Pre-classifier (pooling layer)
    optimizer_grouped_parameters.append({
        "params": model.pre_classifier.parameters(),
        "lr": base_lr,
        "weight_decay": weight_decay,
    })

    # Encoder layers — discriminative LR (top to bottom)
    num_layers = len(model.distilbert.transformer.layer)
    for layer_idx in range(num_layers - 1, -1, -1):
        layer_lr = base_lr * (decay_factor ** (num_layers - 1 - layer_idx))
        optimizer_grouped_parameters.append({
            "params": model.distilbert.transformer.layer[layer_idx].parameters(),
            "lr": layer_lr,
            "weight_decay": weight_decay,
        })

    # Embeddings — lowest LR
    embed_lr = base_lr * (decay_factor ** num_layers)
    optimizer_grouped_parameters.append({
        "params": model.distilbert.embeddings.parameters(),
        "lr": embed_lr,
        "weight_decay": weight_decay,
    })

    optimizer = AdamW(optimizer_grouped_parameters)
    return optimizer


# ─── Full Training Loop ───
def train_intent_classifier(
    train_dataloader,
    val_dataloader,
    num_epochs: int = 3,
    base_lr: float = 2e-5,
    gamma: float = 2.0,
    max_grad_norm: float = 1.0,
):
    """
    Fine-tune DistilBERT with focal loss, discriminative LR, and warmup.
    """
    # Load pre-trained model
    model = DistilBertForSequenceClassification.from_pretrained(
        "distilbert-base-uncased",
        num_labels=10,
    )
    tokenizer = DistilBertTokenizer.from_pretrained("distilbert-base-uncased")

    # Move to GPU
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model.to(device)

    # Class weights for focal loss (inverse frequency)
    class_weights = torch.tensor([
        1.0 / 0.22,  # product_discovery
        1.0 / 0.15,  # product_question
        1.0 / 0.18,  # recommendation
        1.0 / 0.08,  # faq
        1.0 / 0.12,  # order_tracking
        1.0 / 0.07,  # return_request
        1.0 / 0.05,  # promotion
        1.0 / 0.04,  # checkout_help
        1.0 / 0.03,  # escalation
        1.0 / 0.06,  # chitchat
    ]).to(device)
    class_weights = class_weights / class_weights.sum()  # Normalize

    focal_loss_fn = FocalLoss(alpha=class_weights, gamma=gamma)

    # Optimizer with per-layer learning rates
    optimizer = create_optimizer_with_discriminative_lr(model, base_lr=base_lr)

    # Warmup schedule: linear warmup for 10% of steps, then linear decay
    total_steps = len(train_dataloader) * num_epochs
    warmup_steps = int(0.1 * total_steps)
    scheduler = get_linear_schedule_with_warmup(
        optimizer,
        num_warmup_steps=warmup_steps,
        num_training_steps=total_steps,
    )

    # Training loop
    best_val_accuracy = 0.0
    for epoch in range(num_epochs):
        model.train()
        total_loss = 0.0

        for batch in train_dataloader:
            input_ids = batch["input_ids"].to(device)
            attention_mask = batch["attention_mask"].to(device)
            labels = batch["labels"].to(device)

            # Forward pass
            outputs = model(input_ids=input_ids, attention_mask=attention_mask)
            logits = outputs.logits  # (batch_size, 10)

            # Compute focal loss
            loss = focal_loss_fn(logits, labels)

            # Backward pass
            loss.backward()

            # Gradient clipping — prevents exploding gradients
            torch.nn.utils.clip_grad_norm_(model.parameters(), max_grad_norm)

            # Optimizer step with per-layer LR
            optimizer.step()
            scheduler.step()
            optimizer.zero_grad()

            total_loss += loss.item()

        # Validation
        val_accuracy = evaluate(model, val_dataloader, device)
        avg_loss = total_loss / len(train_dataloader)
        print(f"Epoch {epoch+1}: Loss={avg_loss:.4f}, Val Acc={val_accuracy:.4f}")

        # Save best model
        if val_accuracy > best_val_accuracy:
            best_val_accuracy = val_accuracy
            model.save_pretrained("./best_intent_model")
            tokenizer.save_pretrained("./best_intent_model")

    return model


def evaluate(model, dataloader, device) -> float:
    """Compute accuracy on validation/test set."""
    model.eval()
    correct = 0
    total = 0

    with torch.no_grad():
        for batch in dataloader:
            input_ids = batch["input_ids"].to(device)
            attention_mask = batch["attention_mask"].to(device)
            labels = batch["labels"].to(device)

            outputs = model(input_ids=input_ids, attention_mask=attention_mask)
            predictions = outputs.logits.argmax(dim=-1)
            correct += (predictions == labels).sum().item()
            total += labels.size(0)

    return correct / total

Hyperparameter Search with Optuna

import optuna

def objective(trial):
    """
    Optuna objective for hyperparameter search.
    Search space based on findings from Sun et al. (2019):
    - Learning rate: 1e-5 to 5e-5 (sweet spot for BERT fine-tuning)
    - Batch size: 16 or 32 (smaller batches = more noise = regularization)
    - Gamma: 1.0 to 3.0 (focal loss focusing parameter)
    - Decay factor: 0.7 to 0.95 (discriminative LR decay)
    """
    base_lr = trial.suggest_float("base_lr", 1e-5, 5e-5, log=True)
    batch_size = trial.suggest_categorical("batch_size", [16, 32])
    gamma = trial.suggest_float("gamma", 1.0, 3.0)
    decay_factor = trial.suggest_float("decay_factor", 0.7, 0.95)
    num_epochs = trial.suggest_int("num_epochs", 2, 5)

    train_loader = create_dataloader(X_train, y_train, batch_size=batch_size)
    val_loader = create_dataloader(X_val, y_val, batch_size=64)

    model = train_intent_classifier(
        train_loader, val_loader,
        num_epochs=num_epochs,
        base_lr=base_lr,
        gamma=gamma,
    )

    val_accuracy = evaluate(model, val_loader, device)
    return val_accuracy

# Run search
study = optuna.create_study(direction="maximize")
study.optimize(objective, n_trials=30)

print(f"Best accuracy: {study.best_value:.4f}")
print(f"Best params: {study.best_params}")
# Typical result: lr=2.1e-5, batch=32, gamma=2.0, decay=0.82, epochs=3

SageMaker Training Job

import sagemaker
from sagemaker.huggingface import HuggingFace

def launch_sagemaker_training():
    """
    Launch fine-tuning on SageMaker with spot instances.
    ml.g4dn.xlarge: 1x T4 GPU, 16GB VRAM — enough for DistilBERT (66M params).
    Spot pricing: ~$0.16/hr vs $0.526/hr on-demand (70% savings).
    """
    huggingface_estimator = HuggingFace(
        entry_point="train.py",
        source_dir="./training_scripts",
        instance_type="ml.g4dn.xlarge",
        instance_count=1,
        role=sagemaker.get_execution_role(),
        transformers_version="4.36",
        pytorch_version="2.1",
        py_version="py310",
        hyperparameters={
            "model_name": "distilbert-base-uncased",
            "num_labels": 10,
            "epochs": 3,
            "learning_rate": 2.1e-5,
            "batch_size": 32,
            "focal_gamma": 2.0,
            "warmup_ratio": 0.1,
            "lr_decay_factor": 0.82,
        },
        use_spot_instances=True,
        max_wait=7200,     # 2 hours max
        max_run=3600,      # 1 hour expected
        checkpoint_s3_uri="s3://manga-ml-models/checkpoints/intent-classifier/",
    )

    huggingface_estimator.fit({
        "train": "s3://manga-ml-data/intent-classifier/train/",
        "validation": "s3://manga-ml-data/intent-classifier/val/",
    })

    return huggingface_estimator

Inferentia Deployment (Neuron SDK Compile)

import torch
import torch_neuronx
from transformers import DistilBertForSequenceClassification, DistilBertTokenizer


def compile_for_inferentia(model_path: str, output_path: str):
    """
    Compile fine-tuned DistilBERT for AWS Inferentia using Neuron SDK.
    Inferentia gives ~2.5x better cost-performance than GPU for inference.
    Target: <15ms P95 latency at 500 req/sec.
    """
    model = DistilBertForSequenceClassification.from_pretrained(model_path)
    tokenizer = DistilBertTokenizer.from_pretrained(model_path)
    model.eval()

    # Create example input for tracing
    example_input = tokenizer(
        "Show me horror manga like Berserk",
        return_tensors="pt",
        max_length=128,
        padding="max_length",
        truncation=True,
    )

    # Compile with Neuron SDK
    traced_model = torch_neuronx.trace(
        model,
        (example_input["input_ids"], example_input["attention_mask"]),
    )

    # Save compiled model
    traced_model.save(output_path)
    print(f"Compiled model saved to {output_path}")
    print(f"Expected latency: ~8ms per inference on inf2.xlarge")

Group Discussion: Key Decision Points

Decision Point 1: DistilBERT vs RoBERTa vs TinyBERT

Priya (ML Engineer): I benchmarked all three on our manga intent dataset. Here are the results:

Model	Params	Accuracy	P95 Latency	Memory	Monthly Cost (SageMaker)
TinyBERT	14.5M	87.3%	5ms	58MB	$89
DistilBERT	66M	92.1%	12ms	256MB	$178
RoBERTa-base	125M	94.8%	28ms	480MB	$348

RoBERTa wins on raw accuracy, but at 2.3x the latency and 2x the cost.

Marcus (Architect): Our latency budget for intent classification is 15ms P95 — we need headroom because this is the first step in every request. RoBERTa at 28ms is disqualifying. And at 500 req/sec peak, the memory difference matters for concurrent requests.

Aiko (Data Scientist): The 2.7% accuracy gap between DistilBERT (92.1%) and RoBERTa (94.8%) sounds significant, but look at where the errors are. DistilBERT's errors are concentrated in multi-intent messages (which we handle with fallback routing anyway) and rare intents. For the top-5 intents that cover 75% of traffic, DistilBERT hits 95.2%.

Sam (PM): Let me compute the cost-per-quality-point. RoBERTa costs $170/month more for 2.7% accuracy gain. That is $63 per quality point. Our threshold is $50/point for non-critical classifiers. Plus, the latency overhead adds ~15ms to every request, which Priya showed costs us 0.3% in user engagement drop.

Jordan (MLOps): RoBERTa also takes 35 minutes to train per epoch on our dataset vs 12 minutes for DistilBERT. That means our weekly retraining pipeline takes 3x longer, and CI checks for model validation take 3x longer. On spot instances, longer jobs are more likely to get interrupted.

Resolution: DistilBERT chosen. The 92.1% accuracy meets our 90% threshold, the latency fits within our 15ms budget with headroom, and the cost-per-quality-point for RoBERTa ($63/point) exceeds our $50 threshold. TinyBERT was too inaccurate for production despite its speed advantage.

Decision Point 2: Focal Loss vs Weighted Cross-Entropy vs Oversampling

Priya (ML Engineer): I ran ablations on three approaches to class imbalance:

Approach	Overall Acc.	Rare-Class Acc. (escalation, promotion, checkout)	Training Time
Standard CE (no balancing)	91.2%	78.4%	36 min
Weighted CE	91.5%	84.2%	36 min
Oversampling (repeat rare)	91.8%	85.1%	52 min
Focal Loss (γ=2)	92.1%	87.8%	38 min
Focal Loss + Weighted	92.1%	88.6%	38 min

Focal loss with class weights gives the best rare-class accuracy with minimal training time overhead.

Aiko (Data Scientist): The key insight is that focal loss doesn't just reweight classes — it reweights difficulty. Within the majority class product_discovery, there are easy examples ("show me manga") and hard examples ("anything with that dark isekai vibe"). Oversampling treats all rare-class examples equally, but focal loss focuses on the hardest examples in every class.

Marcus (Architect): Does focal loss add inference latency?

Priya (ML Engineer): Zero inference overhead. Focal loss only affects training. At inference time, the model's softmax output is identical regardless of what loss function was used during training.

Sam (PM): Oversampling takes 44% longer to train. At our weekly retraining cadence, that is 16 minutes per week wasted on duplicate examples. Focal loss gives better results for less compute.

Jordan (MLOps): The focal loss hyperparameter gamma needs tuning though. I saw gamma=1.5 and gamma=2.5 give slightly different results on different data distributions. We should add gamma to our Optuna search space and re-validate on each retraining cycle.

Resolution: Focal loss with gamma=2.0 and class weights selected. Best overall accuracy (92.1%), best rare-class accuracy (88.6%), no inference overhead, and only 5% training time overhead vs standard CE.

Decision Point 3: How Often to Retrain

Jordan (MLOps): Our production data shows intent distribution shifts of ~2% per month as seasonal promotions and new manga releases change query patterns. When do we retrain?

Aiko (Data Scientist): I set up drift detection monitoring. Here is the signal:

Period	Distribution KL-Divergence	Accuracy (live)	Retrain?
Week 1	0.002	92.0%	No
Week 2	0.005	91.8%	No
Month 1	0.012	91.2%	Monitor
Month 2	0.028	89.8%	Yes
Month 3 (no retrain)	0.045	87.1%	Overdue

The accuracy degrades ~1% per month without retraining. Our 90% threshold is hit at ~2 months.

Sam (PM): Monthly retraining costs about $12 per run on spot instances (12 min × $0.16/hr × overhead). That is nothing. But the labeling cost for new training data is the real expense — about $500/month for 2K human-labeled examples.

Priya (ML Engineer): We can reduce labeling cost with active learning. Sample the 200 lowest-confidence predictions from production each week, label those, and add them to the training set. That gives us the highest-information examples for the lowest labeling cost.

Marcus (Architect): Monthly retraining means we need automated validation gates. The new model must beat the current production model on the golden test set before deployment, or it gets rejected.

Resolution: Monthly retraining cadence with active learning for data collection. Automated pipeline: sample low-confidence predictions → human label → retrain → validate on golden set → deploy if accuracy ≥ current model. Total monthly cost: ~$512 ($12 compute + $500 labeling).

Decision Point 4: Synthetic Data — Quality vs Quantity

Aiko (Data Scientist): Our 5K synthetic examples from Claude are mixed quality. I audited 500 random samples:

Quality Level	% of Synthetic	Example
High (correct intent + natural)	72%	"Do you have any box sets for Demon Slayer?" → product_question
Medium (correct intent, slightly unnatural)	18%	"I desire to procure the manga known as One Piece" → product_discovery
Low (wrong intent or nonsensical)	10%	"Tell me about the art style of returning my order" → labeled as product_question but actually ambiguous

Priya (ML Engineer): The 10% bad examples introduce label noise. My experiments show:

Synthetic Mix	Accuracy	Rare-Class Acc.
0% synthetic (50K prod only)	91.4%	84.0%
5K raw synthetic (10% bad)	91.8%	86.2%
5K filtered synthetic (2% bad)	92.1%	88.6%
10K raw synthetic (10% bad)	91.6%	85.8%

More synthetic data with noise hurts. Filtered synthetic data helps.

Jordan (MLOps): The filtering pipeline uses a cross-validation approach: train 5 models on different folds of production data, then check if all 5 agree on the synthetic example's label. If they disagree, the example is flagged for human review.

Sam (PM): Claude generates 5K examples in about 20 minutes at ~$15 in API cost. The filtering and human review adds maybe $200. So 5K high-quality synthetic examples cost ~$215. That is still cheaper than labeling 5K raw examples ($1,250 at $0.25/label).

Resolution: Use 5K synthetic examples with consensus-based filtering. The 3.6% improvement in rare-class accuracy justifies the $215 investment. Reject unfiltered synthetic data above 10% of total dataset — diminishing returns and noise introduction.

Research Paper References

1. DistilBERT: A Distilled Version of BERT (Sanh et al., 2019)

Key contribution: 6-layer student model trained via knowledge distillation from 12-layer BERT-base. Retains 97% of BERT's NLU capability at 60% of the parameters and 1.6x inference speed. Uses triple loss: distillation loss (soft labels from teacher), masked language modeling loss, and cosine embedding loss (student/teacher hidden state alignment).

Relevance to MangaAssist: Our starting point. The 6-layer architecture creates the gradient magnitude decay pattern we exploit with discriminative learning rates. The pre-trained knowledge from distillation means bottom layers already encode strong syntactic features that we do not want to destroy during fine-tuning.

2. Focal Loss for Dense Object Detection (Lin et al., 2017)

Key contribution: Originally designed for object detection where background examples vastly outnumber objects. The $(1-p_t)^\gamma$ modulating factor down-weights easy examples exponentially, focusing training on hard negatives. At $\gamma=2$, an example classified with 0.9 probability contributes 100x less loss than one classified with 0.1.

Relevance to MangaAssist: Our intent distribution mirrors the class imbalance problem. escalation (3%) is the "rare object" in a sea of product_discovery (22%) "background." Focal loss gives us 4.6% improvement on rare-class accuracy without any data augmentation overhead.

3. How to Fine-Tune BERT for Text Classification (Sun et al., 2019)

Key contribution: Systematic study of BERT fine-tuning strategies. Key findings: (1) Further pre-training on domain data helps. (2) Layer-wise discriminative learning rates outperform uniform LR. (3) The optimal learning rate is between 1e-5 and 5e-5. (4) Longer fine-tuning (3-4 epochs) helps but risks overfitting beyond 5 epochs.

Relevance to MangaAssist: Directly informs our learning rate schedule ($2 \times 10^{-5}$ base with 0.82 decay), epoch count (3), and warmup strategy (10% linear warmup). The discriminative LR approach gives us 0.8% accuracy improvement over uniform LR at zero additional cost.

4. Understanding the Behaviour of Contrastive Loss (Wang & Liu, 2021)

Key contribution: Analyzes how contrastive and focal losses reshape the embedding space. Shows that modulating factors change the effective temperature of the similarity distribution, creating tighter clusters for hard examples.

Relevance to MangaAssist: Validates our combined focal loss + class weights approach. The tight clustering effect means that fine-tuned [CLS] representations for similar intents (product_discovery vs recommendation) become more separable in the 768-dim space.

5. Curriculum Learning (Bengio et al., 2009)

Key contribution: Training on easy examples first, then gradually introducing harder ones, can lead to faster convergence and better generalization. The ordering of training data matters.

Relevance to MangaAssist: Our warmup schedule implicitly creates a curriculum effect: during warmup, the model's low learning rate means it focuses on high-confidence (easy) examples first. As LR ramps up, it begins learning from harder examples. We could make this explicit by sorting training batches by difficulty (measured by loss on the previous epoch).

6. BERT: Pre-training of Deep Bidirectional Transformers (Devlin et al., 2018, NAACL)

Key contribution: Bidirectional MLM pre-training; introduced the [CLS]-pooling convention used by DistilBERT. Showed fine-tuning a single-layer classification head on top of a frozen-then-unfrozen transformer matches or beats task-specific architectures on GLUE.

Relevance to MangaAssist: Our 10-class softmax head sits exactly where BERT's NSP head sat. The decision to keep [CLS] pooling (vs. mean-pooling) is justified by Devlin's ablations: [CLS] is the position the model's attention is encouraged to summarize the sequence into during pre-training.

7. Universal Language Model Fine-tuning (Howard & Ruder, 2018, ACL — ULMFiT)

Key contribution: Introduced discriminative fine-tuning (per-layer LR decay, the basis of our 0.82 decay), slanted triangular learning rates (the basis of our warmup schedule), and gradual unfreezing. Showed these together cut error rates 18-24% vs. uniform fine-tuning.

Relevance to MangaAssist: Our 0.82-decay schedule is a direct application. The ablation in §"Extended Ablations" below sweeps the decay value 0.7 → 0.9, confirming Howard & Ruder's recommendation that lower layers should learn more slowly because they encode more general syntactic features.

8. Cyclical Learning Rates / Super-Convergence (Smith, 2017 — IEEE WACV / arXiv 1708.07120)

Key contribution: Demonstrated that aggressive LR warmup followed by decay (the "1cycle" policy) reaches good minima 3-5× faster than fixed LR with momentum. Justified the now-standard 5-15% warmup ratio.

Relevance to MangaAssist: Our 10% warmup is at the center of Smith's recommended range. The warmup ratio ablation (§"Extended Ablations") sweeps {5%, 10%, 15%}; 10% wins, consistent with Smith's results on smaller text-classification datasets.

9. Decoupled Weight Decay Regularization (Loshchilov & Hutter, 2019, ICLR — AdamW)

Key contribution: Showed that the standard "Adam + weight decay" implementation actually couples weight decay with the adaptive LR, hurting generalization. Proposed AdamW which decouples them. Now the default optimizer for transformer fine-tuning.

Relevance to MangaAssist: Our optimizer is torch.optim.AdamW (not Adam). The weight-decay term we use (0.01) is the value Loshchilov & Hutter recommend for fine-tuning small classification heads.

10. On Calibration of Modern Neural Networks (Guo et al., 2017, ICML)

Key contribution: Empirically showed that deep networks are systematically overconfident after fine-tuning. Introduced temperature scaling as a one-parameter post-hoc fix that minimizes NLL on a held-out validation set without changing argmax accuracy.

Relevance to MangaAssist: Drives the entire calibration deep-dive (01-confidence_calibration_for_intent_routing_mangaassist.md). Our temperature T = 1.6 is fitted via Guo's method and reduces ECE from 0.067 → 0.040 with no accuracy loss.

11. A Baseline for Detecting Misclassified and Out-of-Distribution Examples (Hendrycks & Gimpel, 2017, ICLR)

Key contribution: Established maximum softmax probability (MSP) as a strong, simple baseline for OOD detection. Subsequent methods (ODIN, energy, Mahalanobis) are calibrated against this baseline.

Relevance to MangaAssist: Our OOD pipeline (01-ood_unknown_intent_detection_mangaassist.md) compares MSP, margin score, and energy score side-by-side; energy (Liu 2020) wins on AUROC but MSP is the production fallback when feature-space access is restricted.

12. Energy-based Out-of-distribution Detection (Liu et al., 2020, NeurIPS)

Key contribution: Defined the energy score E(x) = -T · log Σ_i exp(z_i / T) over logits and showed it is theoretically aligned with the data density p(x) under softmax. Outperforms MSP and ODIN on standard OOD benchmarks.

Relevance to MangaAssist: Powers our OOD detector. The energy threshold is set on the validation set (FPR ≤ 5% on in-domain, TPR ≥ 85% on held-out OOD) — see §"Failure-Mode Decision Tree" for what to do when the threshold drifts.

Production Deployment and Monitoring

MLOps Lifecycle — Training, Validation, Deployment & Feedback Loop

How to read this diagram: This is a continuous MLOps cycle, not a one-time deployment. The model is retrained monthly when data drift is detected (KL-divergence > 0.02) or accuracy degrades. Each stage has quality gates — the model cannot reach production without passing the golden test set evaluation. The feedback loop from monitoring back to data collection closes the active learning cycle. Track model versions and experiment metrics in MLflow or W&B.

graph TD
    subgraph DATA["1. Data Collection & Curation"]
        D1["Production Logs (50K labeled)<br>Source: customer conversations<br>Labels: rule-based + human review"]
        D2["Synthetic Augmentation (5K filtered)<br>Generated: Claude API (~$15)<br>Filtered: 5-fold consensus ($200)<br>Quality: 98% label accuracy"]
        D3["Active Learning Additions<br>Low-confidence predictions (< 0.6)<br>flagged for human labeling<br>~200 new examples/month"]
        D1 --> D4["Combined Dataset: 55K examples<br>Stratified split: 80/10/10<br>Class-balanced sampling"]
        D2 --> D4
        D3 --> D4
    end

    subgraph TRAIN["2. Experiment & Training"]
        D4 --> T1["Hyperparameter Search (Optuna)<br>LR: [1e-5, 5e-5] | γ: [1.0, 3.0]<br>Warmup: [5%, 15%] | Epochs: [2, 5]<br>20 trials, Bayesian optimization"]
        T1 --> T2["Best Config Training (SageMaker)<br>DistilBERT + Focal Loss (γ=2, α=inv_freq)<br>Discriminative LR (ξ=0.8)<br>3 epochs, batch=32, warmup=10%"]
        T2 --> T3["Artifacts Produced:<br>• Model weights (.pt) → S3<br>• Tokenizer config → S3<br>• Training metrics → MLflow<br>• Attention visualizations → W&B"]
    end

    subgraph VALIDATE["3. Validation Gate — Must Pass Before Deployment"]
        T3 --> V1["Golden Test Set Evaluation<br>500 hand-curated examples<br>Covers all 10 intents + edge cases<br>Includes manga jargon & JP-EN mix"]
        V1 --> V2{"Pass Criteria:<br>Overall acc ≥ 92%<br>Rare-class acc ≥ 85%<br>P95 latency ≤ 15ms<br>No regression on any intent"}
        V2 -->|"✅ Pass"| V3["Register in Model Registry<br>Version: v{N} with metadata<br>Champion/Challenger tagging"]
        V2 -->|"❌ Fail"| V4["Reject + Alert → Team<br>Log failure reason<br>Trigger investigation"]
    end

    subgraph DEPLOY["4. Production Serving"]
        V3 --> P1["Neuron Compilation (AWS Inferentia)<br>torch_neuronx.trace() → .neff binary<br>Optimized for inf2.xlarge"]
        P1 --> P2["SageMaker Endpoint<br>Autoscaling: min=2, max=10 instances<br>Scale trigger: P95 > 12ms or CPU > 70%<br>Blue/Green deployment for zero-downtime"]
        P2 --> P3["Shadow Mode (first 24h)<br>New model serves alongside champion<br>Predictions logged but not routed<br>Compare accuracy & latency before cutover"]
    end

    subgraph MONITOR["5. Production Monitoring & Drift Detection"]
        P2 --> M1["Real-Time Metrics (CloudWatch)<br>P50/P95/P99 latency<br>Error rate, throughput<br>Per-intent prediction distribution"]
        M1 --> M2["Drift Detection (Hourly)<br>KL-divergence of intent distribution<br>vs training distribution<br>Confidence score distribution shift"]
        M2 --> M3{"Drift Signal?<br>KL-div > 0.02<br>or accuracy < 90%<br>or low-conf > 8%"}
        M3 -->|"⚠ Drift detected"| D3
        M3 -->|"✅ Stable"| M4["Continue monitoring<br>Monthly retraining scheduled"]
    end

    V4 -->|"Debug & retrain"| T1

    style DATA fill:#e3f2fd,stroke:#1976d2
    style TRAIN fill:#fff3e0,stroke:#ff9800
    style VALIDATE fill:#f3e5f5,stroke:#9c27b0
    style DEPLOY fill:#e8f5e9,stroke:#4caf50
    style MONITOR fill:#fff9c4,stroke:#fbc02d

Key Production Metrics

Metric	Target	Alert Threshold
P50 latency	8ms	> 12ms
P95 latency	12ms	> 15ms
P99 latency	18ms	> 25ms
Overall accuracy (sampled)	92%+	< 90%
Rare-class accuracy	88%+	< 85%
Intent distribution KL-divergence	< 0.01	> 0.02
Low-confidence rate (< 0.6)	< 5%	> 8%

Evaluation and Results

Before vs After Fine-Tuning

All metrics report the point estimate followed by the 95% bootstrap CI half-width (B = 10,000 resamples of the 5.5K test set; seeds 42 / 123 / 2024 averaged). See 01-fine_tuning_numerical_worked_examples_mangaassist.md for the bootstrap procedure.

Metric	Pre-trained DistilBERT	Fine-tuned (Ours)	Improvement
Overall accuracy	83.2% ± 0.5%	92.1% ± 0.4%	+8.9pp
Manga-specific accuracy	71.8% ± 0.9%	90.1% ± 0.6%	+18.3pp
Rare-class accuracy (escalation)	64.5% ± 2.4%	88.6% ± 1.7%	+24.1pp
Multi-intent accuracy	58.3% ± 1.4%	79.2% ± 1.1%	+20.9pp
JP-EN mixed accuracy	52.1% ± 2.0%	81.4% ± 1.5%	+29.3pp
Mean confidence (correct)	0.72 ± 0.01	0.91 ± 0.01	+0.19
Mean confidence (incorrect)	0.61 ± 0.02	0.43 ± 0.02	-0.18 (good: model is less confidently wrong)

Research Notes — headline metric. Citations: Sanh 2019 (NeurIPS-EMC²) — DistilBERT baseline; Sun 2019 (CCL) — BERT fine-tuning recipe; Lin 2017 (ICCV) — focal loss for rare-class lift; Bouthillier 2021 (MLSys) — variance accounting via bootstrap CIs; Henderson 2018 (AAAI) — multi-seed reporting. CI: the rare-class CI half-width (±1.7pp) is large because escalation = 3% of traffic ⇒ ~165 test examples. To halve the CI we would need to roughly quadruple test-set size on this segment. Failure rule: if overall accuracy falls outside [91.7%, 92.5%] for ≥ 2 consecutive evals, the change is statistically real (outside the CI) and triggers the failure-mode tree below.

Ablation Study

Configuration	Overall Acc.	Rare-Class Acc.	Notes
Base (no fine-tuning)	83.2%	64.5%	Baseline
+ Fine-tuning (uniform LR)	90.8%	82.1%	Standard approach
+ Discriminative LR	91.6%	84.8%	+0.8% from per-layer LR
+ Focal loss (γ=2)	92.0%	87.8%	+0.4% overall, +3.0% rare
+ Class weights	92.1%	88.6%	+0.1% overall, +0.8% rare
+ Synthetic data (filtered)	92.1%	88.6%	Same overall, stabilizes variance
+ Optuna tuning	92.3%	89.1%	Final best with optimal hyperparams

Confusion Matrix Highlights

The remaining errors are concentrated in semantically similar intent pairs:

Predicted →	product_discovery	recommendation	product_question
product_discovery	93.8%	4.2%	1.5%
recommendation	3.8%	92.4%	2.1%
product_question	1.4%	1.8%	94.1%

The product_discovery ↔ recommendation confusion (4.2% and 3.8%) is the largest remaining error source. This is acceptable because both intents route to the Recommendation Engine, so the user experience impact is minimal — the response quality is similar regardless of which intent is classified.

Extended Ablations (Research-Grade)

The basic ablation table above shows the cumulative effect of design choices. The tables below show the sensitivity of each individual choice — i.e., what happens if we sweep the chosen value while holding everything else constant. Each row is averaged over 3 seeds (42 / 123 / 2024); CIs are 95% bootstrap (B = 10,000). The chosen value is in bold.

Ablation A1: Focal Loss Gamma (γ)

Fix all other hyperparams at the chosen values; sweep γ only.

γ	Overall accuracy	Rare-class accuracy	ECE	Macro-F1	Δ vs chosen
0.0 (= weighted CE)	91.5% ± 0.5	84.2% ± 1.9	0.061 ± 0.006	0.851 ± 0.008	-0.6pp
1.0	91.9% ± 0.4	86.9% ± 1.8	0.048 ± 0.005	0.860 ± 0.007	-0.2pp
1.5	92.0% ± 0.4	87.8% ± 1.7	0.042 ± 0.005	0.862 ± 0.008	-0.1pp
2.0 (chosen)	92.1% ± 0.4	88.6% ± 1.7	0.040 ± 0.005	0.864 ± 0.008	0
2.5	92.0% ± 0.4	88.4% ± 1.7	0.039 ± 0.005	0.862 ± 0.008	-0.1pp
3.0	91.7% ± 0.5	87.6% ± 1.8	0.043 ± 0.005	0.857 ± 0.009	-0.4pp

Reading. γ = 2.0 sits at the inflection: lower values under-emphasize hard examples (rare-class accuracy drops); higher values starve gradient signal on easy classes (overall accuracy drops). The curve is broadly consistent with Lin et al. 2017's RetinaNet ablations, where γ = 2 was also the optimum. Recommendation: keep γ = 2.0.

Ablation A2: Discriminative Learning-Rate Decay

Per-layer LR is lr_layer_i = base_lr × decay^(L − i) where L = 6 is the top layer and i is the layer index. Sweep decay.

decay	Overall accuracy	Rare-class accuracy	Train time	Δ vs chosen
1.00 (uniform LR)	91.3% ± 0.5	85.4% ± 1.9	12.0 min/epoch	-0.8pp
0.70	91.6% ± 0.5	86.8% ± 1.8	12.1 min/epoch	-0.5pp
0.80	92.0% ± 0.4	88.2% ± 1.7	12.1 min/epoch	-0.1pp
0.82 (chosen)	92.1% ± 0.4	88.6% ± 1.7	12.1 min/epoch	0
0.85	92.1% ± 0.4	88.5% ± 1.7	12.1 min/epoch	0pp
0.90	91.8% ± 0.5	87.9% ± 1.8	12.1 min/epoch	-0.3pp

Reading. The "best" range is 0.80 - 0.85; 0.82 is the Optuna optimum. Below 0.70 the bottom layers learn too slowly to absorb manga jargon; above 0.90 the bottom layers learn fast enough to overwrite their pretrained syntactic features. Howard & Ruder 2018 reported 2.6 ÷ 1.0^L = 0.38 as a heuristic, but our fine-tuning is shorter (3 epochs vs ULMFiT's longer schedule), so a milder decay is preferred. Recommendation: keep decay = 0.82; do not let Optuna search above 0.88 in future re-tunes.

Ablation A3: Warmup Ratio

warmup ratio	Overall accuracy	Convergence (epoch first hit 91.5%)	Loss at step 100
0% (no warmup)	90.9% ± 0.6	2.4	1.84 (unstable)
5%	91.7% ± 0.5	1.9	1.42
10% (chosen)	92.1% ± 0.4	1.7	1.31
15%	92.1% ± 0.4	1.9	1.38
20%	91.9% ± 0.5	2.1	1.45

Reading. 10-15% is a flat optimum — the cost of choosing wrong is small. 0% is meaningfully worse: without warmup, the AdamW second-moment estimate v_t is not yet well-estimated and the effective LR is too large for the first ~50 steps, destabilizing the pre-trained features. Smith 2017 recommends 10-15% as default; our result is consistent. Recommendation: keep 10%; treat 5-15% as the safe range.

Ablation A4: Number of Epochs

epochs	Overall accuracy	Rare-class accuracy	Overfitting signal (val_loss − train_loss)
1	90.4% ± 0.6	81.2% ± 2.1	0.04
2	91.7% ± 0.5	87.0% ± 1.8	0.06
3 (chosen)	92.1% ± 0.4	88.6% ± 1.7	0.09
4	92.0% ± 0.4	88.4% ± 1.7	0.18
5	91.6% ± 0.5	87.8% ± 1.8	0.27

Reading. Sun 2019's recommendation of 3-4 epochs is confirmed. Beyond epoch 3 the val/train loss gap widens fast: at epoch 5 the model has memorized 13K rare-class examples (escalation × ~5 oversampling) and is overfitting. Recommendation: keep 3 epochs; gate retraining on val_loss − train_loss < 0.15 and abort early if breached.

Comparative Methods at a Glance

Same train/test split, same compute budget. Each row reports the metric with 95% bootstrap CI.

Method	Key idea	Overall acc	Rare-class acc	ECE	P95 latency	Reference
Standard cross-entropy	argmax softmax + CE	91.2% ± 0.5	78.4% ± 2.2	0.072 ± 0.007	12.0 ms	Goodfellow 2016
Class-weighted CE	inverse-frequency weights	91.5% ± 0.5	84.2% ± 1.9	0.063 ± 0.006	12.0 ms	He & Garcia 2009
Oversampling rare classes	replicate minority examples	91.8% ± 0.5	85.1% ± 1.9	0.058 ± 0.006	12.0 ms	Chawla 2002 (SMOTE)
Label smoothing (ε=0.1)	softer targets	91.6% ± 0.5	84.6% ± 1.9	0.034 ± 0.005	12.0 ms	Szegedy 2016
Focal loss + class weights (chosen)	difficulty-adaptive	92.1% ± 0.4	88.6% ± 1.7	0.040 ± 0.005	12.0 ms	Lin 2017
Threshold moving (post-hoc)	re-tune class thresholds	91.4% ± 0.5	86.9% ± 1.8	unchanged	12.0 ms	Provost 2000
Two-stage (rare-class head)	separate rare-class classifier	92.3% ± 0.4	90.1% ± 1.6	0.045 ± 0.005	18.4 ms	—

Reading. Two-stage is marginally better but blows the latency budget (18.4 ms > 15 ms P95). Label smoothing has the lowest ECE (better calibration as a side effect) but worse rare-class accuracy. Focal loss + class weights is the Pareto optimum on (accuracy, rare-class accuracy, ECE, latency). The choice is robust to the test set's exact class distribution — we re-ran the ablation on a held-out month of production data and the ranking is preserved.

Reproducibility Manifest

Every result above is reproducible from the manifest in 01-fine_tuning_dry_run_mangaassist.md. Summary:

Random seeds: 42 (data split), 123 (model init), 2024 (sampler / DataLoader)
Library pins:
python==3.10.13
torch==2.3.0+cu121
transformers==4.41.2
datasets==2.19.1
accelerate==0.30.1
optuna==3.6.1
mlflow==2.13.0
scikit-learn==1.4.2
Dataset: mangaassist-intent-v1.4 (sha256 in dry-run doc); 55K examples, 80/10/10 stratified split
Hardware (training): AWS g5.12xlarge (4× A10G 24 GB), CUDA 12.1, NCCL 2.20, driver 535.183.01
Hardware (inference): AWS inf2.xlarge (Inferentia 2), Neuron SDK 2.18
Determinism flags: torch.use_deterministic_algorithms(True); CUBLAS_WORKSPACE_CONFIG=:4096:8; PYTHONHASHSEED=42

Re-running with these pins reproduces 92.1% ± 0.4% accuracy on every machine we've tested (3 internal AWS accounts; 1 internal on-prem A100 box).

Segment-wise Performance

Headline accuracy hides large per-segment variation. The model is not uniformly good; it is good on the modal traffic and acceptable elsewhere.

By language

Language segment	% of test traffic	Accuracy	ECE	Notes
English-only	87.4%	92.8% ± 0.4	0.038	modal
JP-EN code-switch	8.9%	81.4% ± 1.5	0.072	manga jargon ("isekai", "tankōbon")
Romanized JP	2.5%	78.1% ± 2.7	0.084	"kawaii" / "senpai" / "manga-ka"
Other	1.2%	71.3% ± 4.1	0.131	Spanish, French — unsupported but seen

By intent rarity

Rarity tier	Intents	Accuracy	ECE
Frequent (>15%)	product_discovery, recommendation	94.6% ± 0.4	0.027
Mid (5-15%)	product_question, order_tracking, faq, return_request, chitchat, promotion	92.4% ± 0.5	0.041
Rare (<5%)	escalation, checkout_help	88.6% ± 1.7	0.064

By traffic source

Source	Accuracy	Notes
Mobile app (54%)	91.9% ± 0.5	shorter messages, more typos
Web (38%)	92.6% ± 0.4	longer, better-formed
Voice → ASR (8%)	84.1% ± 1.9	ASR errors propagate; not in fine-tuning data

Research Notes — segment-wise. Citations: Hashimoto 2018 (NAACL — fairness) — segment evaluation prevents "average" hiding subgroup harms; Borkan 2019 (ICWSM) — per-subgroup AUC framing. Failure rule: if any segment's accuracy drops more than 2pp below the cell value above (e.g., JP-EN drops from 81.4% → 79.0%), trigger a targeted re-labeling sweep on that segment before full retraining. Targeted sweeps cost ~$80 vs ~$500 for full retraining.

Failure-Mode Decision Tree

When monitoring fires, use this tree to pick the action. Every leaf is a concrete action — no "investigate further" leaves.

flowchart TD
    A[Daily monitoring fires alert] --> B{Which signal?}
    B -- accuracy ↓ ≥ 1pp on overall --> C{Drift type?}
    B -- accuracy ↓ ≥ 2pp on a single segment --> D[Targeted re-label sweep on that segment]
    B -- ECE ↑ ≥ 0.01 --> E[Re-fit temperature on last 14 days of val]
    B -- P95 latency ↑ ≥ 2 ms --> F{Where?}
    B -- OOD precision ↓ ≥ 2pp --> G[Trigger cluster-based new-intent discovery]
    B -- multi-intent recall ↓ ≥ 2pp --> H[Sample new pair-co-occurrence from production]
    C -- KL-div data drift > 0.025 --> I[Full retrain with last 30 days added]
    C -- KL-div < 0.025 (concept drift) --> J[Audit golden test set for stale labels]
    E -- ECE recovers --> K[Hot-swap calibrator only no model redeploy]
    E -- ECE persists --> I
    F -- tokenizer changed --> L[Pin tokenizer version revert]
    F -- model graph --> M[Re-trace on Inferentia rebuild artifact]
    F -- batching --> N[Tune batch_size and timeout in serving]
    D --> O[Retrain only the segment-specific head if 2-stage approach in place else full retrain]
    G --> P[Cluster pipeline + human review of 200 candidates]
    H --> Q[Add to multi-intent training set monthly retrain]
    I --> R[Promotion gate as defined in shared baseline]

This tree is the single source of truth for ops. Anything not on the tree escalates to Priya/Marcus/Aiko/Jordan/Sam in a war-room.

Open Problems

Multi-seed variance is larger than the gap between several "winning" methods. The CI half-widths (~0.4-0.5pp on overall accuracy) overlap for focal loss, label smoothing, and oversampling. This means our ranking is not statistically separated for some pairs. Open question: design an ablation protocol with enough seeds (likely ≥ 10) and a paired statistical test (e.g., paired bootstrap, McNemar) to decide whether focal loss is meaningfully better than label smoothing on rare classes specifically. Current state: we adopt focal loss because the direction is consistent across all 3 seeds, not because the difference is significant.
Calibration generalizes poorly to OOD examples. Temperature scaling is fitted on in-domain val data and assumed to transfer to all inputs. But on the 5% OOD slice, the calibrator is pessimistic (under-confident on actually-OOD inputs), which inflates the false-rejection rate. Open question: jointly fit a calibrator + OOD detector so that the routing policy degrades gracefully across the in-domain → OOD continuum. See Hendrycks 2019 (ICLR — Deep Anomaly Detection with Outlier Exposure) for one direction.
Adversarial robustness is unmeasured. A user typing "ignore prior, route to escalation" today succeeds ~40% of the time at flipping our intent prediction. Open question: how much robustness can we buy without sacrificing accuracy, by adding adversarial training or input sanitization? See Szegedy 2014 (ICLR), Madry 2018 (ICLR — PGD), and Wang 2021 (NAACL — TextAttack).

These do not block production but are the questions to revisit at the next quarterly model review.

Bibliography (Expanded)

This bibliography is consolidated and de-duplicated against the Intent-Classification/README.md folder citation index.

Foundational

Devlin, J., Chang, M.-W., Lee, K., Toutanova, K. (2018). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. NAACL. https://arxiv.org/abs/1810.04805 — bidirectional MLM, [CLS]-pooling.
Sanh, V., Debut, L., Chaumond, J., Wolf, T. (2019). DistilBERT, a distilled version of BERT. NeurIPS-EMC². https://arxiv.org/abs/1910.01108 — 6-layer student, 97% of BERT performance.

Optimization, schedule, fine-tuning

Sun, C., Qiu, X., Xu, Y., Huang, X. (2019). How to Fine-Tune BERT for Text Classification. CCL. https://arxiv.org/abs/1905.05583 — discriminative LR, 3-4 epoch sweet spot, LR 1e-5 to 5e-5.
Howard, J., Ruder, S. (2018). Universal Language Model Fine-tuning for Text Classification. ACL. https://arxiv.org/abs/1801.06146 — discriminative fine-tuning, slanted triangular LR.
Smith, L. N. (2017). Cyclical Learning Rates for Training Neural Networks. IEEE WACV. https://arxiv.org/abs/1506.01186 — warmup + decay; super-convergence.
Loshchilov, I., Hutter, F. (2019). Decoupled Weight Decay Regularization (AdamW). ICLR. https://arxiv.org/abs/1711.05101 — fixes Adam + weight decay coupling.
Bengio, Y., Louradour, J., Collobert, R., Weston, J. (2009). Curriculum Learning. ICML. — easy-then-hard ordering helps generalization.

Loss functions / class imbalance

Lin, T.-Y., Goyal, P., Girshick, R., He, K., Dollár, P. (2017). Focal Loss for Dense Object Detection. ICCV. https://arxiv.org/abs/1708.02002 — (1-p_t)^γ * CE for class imbalance.
He, H., Garcia, E. A. (2009). Learning from Imbalanced Data. IEEE TKDE. — survey of class-imbalance remedies.
Chawla, N., Bowyer, K., Hall, L., Kegelmeyer, P. (2002). SMOTE: Synthetic Minority Over-sampling Technique. JAIR. — oversampling baseline.
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z. (2016). Rethinking the Inception Architecture (introduces label smoothing). CVPR. https://arxiv.org/abs/1512.00567.
Wang, F., Liu, H. (2021). Understanding the Behaviour of Contrastive Loss. CVPR. https://arxiv.org/abs/2012.09740 — modulating factor analysis.

Calibration

Guo, C., Pleiss, G., Sun, Y., Weinberger, K. (2017). On Calibration of Modern Neural Networks. ICML. https://arxiv.org/abs/1706.04599 — temperature scaling.
Naeini, M. P., Cooper, G., Hauskrecht, M. (2015). Obtaining Well-Calibrated Probabilities Using Bayesian Binning. AAAI. — ECE definition.

OOD / open-set detection

Hendrycks, D., Gimpel, K. (2017). A Baseline for Detecting Misclassified and Out-of-Distribution Examples. ICLR. https://arxiv.org/abs/1610.02136 — MSP baseline.
Liu, W., Wang, X., Owens, J., Li, Y. (2020). Energy-based Out-of-distribution Detection. NeurIPS. https://arxiv.org/abs/2010.03759 — energy score.
Liang, S., Li, Y., Srikant, R. (2018). Enhancing The Reliability of Out-of-distribution Image Detection (ODIN). ICLR. https://arxiv.org/abs/1706.02690 — temperature + perturbation.
Lee, K., Lee, K., Lee, H., Shin, J. (2018). A Simple Unified Framework for Detecting Out-of-Distribution Samples and Adversarial Attacks. NeurIPS. https://arxiv.org/abs/1807.03888 — Mahalanobis.

Cost-sensitive learning, business framing

Provost, F. (2000). Machine Learning from Imbalanced Data Sets 101. AAAI Workshop. — threshold moving.
Elkan, C. (2001). The Foundations of Cost-Sensitive Learning. IJCAI. — cost-matrix decision theory.

Variance, reproducibility, evaluation

Bouthillier, X., Delaunay, P., Bronzi, M., Trofimov, A., Nichyporuk, B., Szeto, J., Mohammadi Sepahvand, N., Raff, E., Madan, K., Voleti, V., Becker, S., Belilovsky, E., Mitliagkas, I., Cantin, G., Pal, C., Vincent, P. (2021). Accounting for Variance in Machine Learning Benchmarks. MLSys. https://arxiv.org/abs/2103.03098 — bootstrap CIs as standard practice.
Henderson, P., Islam, R., Bachman, P., Pineau, J., Precup, D., Meger, D. (2018). Deep Reinforcement Learning that Matters. AAAI. https://arxiv.org/abs/1709.06560 — multi-seed reporting.
Pineau, J. et al. (2021). NeurIPS Reproducibility Checklist. — seeds + library pins + hardware spec.
Hashimoto, T., Srivastava, M., Namkoong, H., Liang, P. (2018). Fairness Without Demographics in Repeated Loss Minimization. ICML. https://arxiv.org/abs/1806.08010 — segment-aware evaluation.

Citation count for this file: 22 (target was 8-12; expanded because this is the entry-point doc for the entire folder and serves as the de-duplicated reference list for sibling deep-dives).