01. Intent Classifier Fine-Tuning — DistilBERT for MangaAssist
Problem Statement and MangaAssist Context
MangaAssist routes every user message through an intent classifier before deciding which downstream service to call. The classifier must map messages like "Show me horror manga", "Where is my order?", or "Something like Naruto but darker" to one of 10 intents in under 15ms at P95. A wrong classification sends the user down the wrong path entirely — recommending manga when they asked about returns, or triggering an order lookup when they wanted discovery.
The pre-trained DistilBERT model achieves 83.2% accuracy on our manga retail domain out of the box. That is not good enough — 16.8% of messages get misrouted, causing user frustration and unnecessary LLM calls. Fine-tuning on domain-specific data pushes accuracy to 92.1%, reducing misroutes by more than half.
This document covers the full fine-tuning pipeline: from the math behind cross-entropy and focal loss, through the internal mechanics of how DistilBERT's 6 transformer layers change during training, to production deployment on AWS Inferentia.
The 10 Intents
| Intent | Example | Frequency |
|---|---|---|
product_discovery |
"Show me horror manga" | 22% |
product_question |
"Is this in English?" | 15% |
recommendation |
"Something like Naruto" | 18% |
faq |
"What's the return policy?" | 8% |
order_tracking |
"Where is my order?" | 12% |
return_request |
"I want to return this" | 7% |
promotion |
"Any deals on manga?" | 5% |
checkout_help |
"Can I use gift cards?" | 4% |
escalation |
"Talk to a human" | 3% |
chitchat |
"Hello", "Thanks" | 6% |
Why This Is Hard for Manga Retail
The distribution has four gaps that generic NLP models do not cover:
- Manga jargon (22% of queries): "tankōbon", "shōnen jump", "isekai", "seinen" — tokens that appear rarely or never in pre-training corpora
- Slang (15%): "Is this peak?", "W manga", "mid" — internet-era expressions with domain-specific meaning
- Multi-intent (18%): "I want to return this and find something better" — two intents in one message
- Japanese-English mixing (12%): "この漫画は英語ですか?" mixed with "Is vol 12 out?"
Mathematical Foundations
Cross-Entropy Loss — The Starting Point
For a single training example with true class $y$ and predicted probability distribution $\hat{y}$, the cross-entropy loss is:
$$\mathcal{L}{CE} = -\sum{c=1}^{C} y_c \log(\hat{y}_c)$$
For one-hot encoded labels (which our intent classifier uses), this simplifies to:
$$\mathcal{L}{CE} = -\log(\hat{y}{y_{true}})$$
where $\hat{y}{y{true}}$ is the predicted probability for the correct class.
Intuition: Cross-entropy measures how surprised the model is by the correct answer. If the model assigns probability 0.9 to the correct class, the loss is $-\log(0.9) = 0.105$. If it assigns 0.1, the loss is $-\log(0.1) = 2.303$ — much higher. The logarithm creates an asymmetric penalty: being confidently wrong is punished far more than being uncertain.
Softmax Temperature and Its Effect on Gradients
The softmax function converts raw logits $z_i$ into probabilities:
$$\hat{y}i = \frac{e^{z_i / T}}{\sum{j=1}^{C} e^{z_j / T}}$$
where $T$ is the temperature parameter.
- $T = 1$ (default): Standard softmax. Distribution reflects model confidence directly.
- $T < 1$ (sharpening): Makes the distribution peakier. The model becomes more confident.
- $T > 1$ (smoothing): Flattens the distribution. All classes get more similar probabilities.
Why temperature matters for fine-tuning: During early fine-tuning, the model is not yet adapted to manga domain. Using $T > 1$ (e.g., 1.5) during the first few epochs prevents the model from becoming overconfident on wrong predictions, which would create large gradients that destabilize training. As training progresses and the model learns the domain, we anneal $T \to 1$.
Gradient of softmax with temperature:
$$\frac{\partial \hat{y}i}{\partial z_j} = \frac{1}{T} \hat{y}_i (\delta{ij} - \hat{y}_j)$$
The $\frac{1}{T}$ factor means higher temperature directly reduces gradient magnitude, acting as an implicit learning rate dampener.
Focal Loss — Handling Class Imbalance
Our intent distribution is heavily imbalanced: product_discovery (22%) vs escalation (3%). Standard cross-entropy gives equal weight per sample, so the model optimizes for majority classes and underperforms on rare intents.
Focal loss (Lin et al., 2017) adds a modulating factor:
$$\mathcal{L}_{FL} = -\alpha_t (1 - p_t)^{\gamma} \log(p_t)$$
where: - $p_t$ is the predicted probability for the true class - $\gamma$ is the focusing parameter (typically 2.0) - $\alpha_t$ is the class weight for class $t$
Breaking down the modulating factor $(1 - p_t)^{\gamma}$:
| $p_t$ (model confidence) | $(1-p_t)^2$ | Effect |
|---|---|---|
| 0.9 (easy, well-classified) | 0.01 | Loss reduced by 100x — model stops wasting gradients on easy examples |
| 0.5 (uncertain) | 0.25 | Moderate loss — model still learns from these |
| 0.1 (hard, misclassified) | 0.81 | Near-full loss — model focuses learning here |
Intuition: Focal loss makes the model stop paying attention to examples it already classifies well (like "Hello" → chitchat) and focus its gradient budget on hard examples (like "Is this isekai peak or mid?" → product_question vs recommendation).
Our class weights $\alpha_t$:
$$\alpha_t = \frac{1}{\text{freq}(t)} \cdot \frac{1}{\sum_{c=1}^{C} \frac{1}{\text{freq}©}}$$
This gives escalation (3% frequency) about 7.3x the weight of product_discovery (22%). Combined with $\gamma = 2$, the effective gradient signal for rare-class hard examples is amplified by $7.3 \times \frac{1}{0.01} = 730\times$ compared to easy majority-class examples.
Gradient Flow Through DistilBERT Layers
DistilBERT has the following architecture:
- Embedding layer: WordPiece embeddings (30,522 vocab) + position embeddings (512 positions) → 768-dim vectors
- 6 Transformer encoder layers: Each with multi-head self-attention (12 heads, 64 dims each) + feed-forward network (768 → 3072 → 768)
- [CLS] pooling: Take the first token's representation
- Classification head: Linear(768 → 10) + softmax
Total parameters: ~66M (embedding: ~23M, encoders: ~42M, head: ~7.7K)
Gradient magnitude per layer during fine-tuning:
During backpropagation, gradients flow from the classification head back through the encoder layers to the embeddings. The gradient magnitude at each layer follows this pattern:
| Layer | Parameter Count | Gradient Magnitude (rel.) | What Changes |
|---|---|---|---|
| Classification Head | 7,690 | 1.0 (reference) | Learns intent-specific decision boundary |
| Encoder Layer 5 | 7.1M | 0.6 | Adapts high-level semantic features to manga domain |
| Encoder Layer 4 | 7.1M | 0.35 | Refines topic-level representations |
| Encoder Layer 3 | 7.1M | 0.18 | Moderate adaptation of syntactic-semantic interface |
| Encoder Layer 2 | 7.1M | 0.08 | Minor changes to syntactic patterns |
| Encoder Layer 1 | 7.1M | 0.03 | Almost frozen — basic language structure |
| Encoder Layer 0 | 7.1M | 0.01 | Nearly unchanged — tokenization patterns |
| Embeddings | 23.4M | 0.005 | Barely moves — pre-trained token representations are stable |
Key insight: Fine-tuning mostly changes the top 2-3 layers and the classification head. This is why discriminative learning rates work — we should use a higher learning rate for top layers and a lower one for bottom layers, matching the natural gradient flow.
Discriminative Learning Rate Schedule
Following Sun et al. (2019), we set per-layer learning rates:
$$\eta_l = \eta_{base} \cdot \xi^{(L - l)}$$
where: - $\eta_{base} = 2 \times 10^{-5}$ (base learning rate for the top layer) - $\xi = 0.8$ (decay factor) - $L = 6$ (total encoder layers) - $l$ is the layer index (0 = bottom, 5 = top)
| Layer | Learning Rate | Relative to Base |
|---|---|---|
| Head | $2 \times 10^{-5}$ | 1.0x |
| Layer 5 | $2 \times 10^{-5}$ | 1.0x |
| Layer 4 | $1.6 \times 10^{-5}$ | 0.8x |
| Layer 3 | $1.28 \times 10^{-5}$ | 0.64x |
| Layer 2 | $1.02 \times 10^{-5}$ | 0.51x |
| Layer 1 | $8.19 \times 10^{-6}$ | 0.41x |
| Layer 0 | $6.55 \times 10^{-6}$ | 0.33x |
| Embeddings | $5.24 \times 10^{-6}$ | 0.26x |
Intuition: Bottom layers have learned universal language features during pre-training (tokenization, basic syntax). We do not want to destroy these. Top layers are more task-specific and need more room to adapt to manga intent classification.
Warmup Schedule — Why It Prevents Catastrophic Collapse
We use linear warmup for the first 10% of training steps, then linear decay:
$$\eta(t) = \begin{cases} \eta_{max} \cdot \frac{t}{T_{warmup}} & \text{if } t < T_{warmup} \ \eta_{max} \cdot \frac{T_{total} - t}{T_{total} - T_{warmup}} & \text{if } t \geq T_{warmup} \end{cases}$$
Why warmup is critical: At initialization, the classification head has random weights. Without warmup, the first gradient updates are computed from random-quality logits, creating large, noisy gradients that can permanently damage the pre-trained representations. Warmup lets the classification head stabilize before the encoder layers start updating significantly.
Model Internals — Layer-by-Layer Diagrams
DistilBERT — Representation Transformation Pipeline
How to read this diagram: Colors represent the degree of adaptation during fine-tuning — green (frozen, pre-trained knowledge preserved) through red (heavily adapted to manga intent domain). Each layer transforms the representation from raw tokens toward intent-discriminative features. Use BertViz to interactively explore attention patterns, or Netron to inspect the exported model graph.
graph TB
subgraph INPUT["Tokenization & Embedding — dim: batch × seq_len × 768"]
A["🔤 Raw Input: 'Is this isekai peak?'"]
A --> B["WordPiece Subword Tokenization<br>[CLS] is this is ##ek ##ai peak ? [SEP]<br>⚠ 'isekai' → '##ek' + '##ai' (OOV — unseen in pre-training)"]
B --> C["Token Embeddings (30,522 vocab × 768d)<br>+ Position Embeddings (512 × 768d)<br>→ Dense vector per subword token"]
end
subgraph FROZEN["Layers 0–2 — Frozen: Preserve Pre-trained Linguistic Knowledge"]
C --> D0["Layer 0: Tokenization & Morphology<br>Learns: subword composition, token boundaries<br>Representations: raw token identity<br>⚡ ~0.01× gradient — catastrophic forgetting risk if updated"]
D0 --> D1["Layer 1: Syntactic Patterns<br>Learns: POS-like features, word order<br>Representations: local syntactic context"]
D1 --> D2["Layer 2: Phrase Structure<br>Learns: noun phrases, verb groups<br>Representations: 'this isekai' as a unit<br>⚡ ~0.08× gradient — minimal drift"]
end
subgraph PARTIAL["Layers 3–4 — Partially Adapted: Domain Semantics"]
D2 --> D3["Layer 3: Semantic Composition<br>Learns: meaning combinations<br>'isekai' + 'peak' → genre quality judgment<br>⚡ ~0.18× gradient"]
D3 --> D4["Layer 4: Topic & Sentiment<br>Learns: domain-specific topic clusters<br>Separates product talk from order talk<br>⚡ ~0.35× gradient"]
end
subgraph ADAPTED["Layer 5 — Heavily Adapted: Intent-Discriminative Features"]
D4 --> D5["Layer 5: Intent Boundary Learning<br>Learns: features that separate 10 intent classes<br>'isekai peak' → product_question vs recommendation<br>⚡ 0.6× gradient — largest encoder change"]
end
subgraph HEAD["Classification Head — Fully Trained from Scratch"]
D5 --> CLS["[CLS] Token Pooling<br>768-dim intent-focused vector<br>(aggregates full-sequence meaning)"]
CLS --> LIN["Linear Projection: 768 → 10<br>Decision hyperplane in embedding space<br>⚡ 1.0× gradient — fully trained"]
LIN --> SM["Softmax (T=1.0) → Probability Distribution"]
SM --> OUT["Intent Prediction:<br>product_question: 0.42<br>recommendation: 0.38<br>product_discovery: 0.12 ..."]
end
style D0 fill:#e8f5e9,stroke:#4caf50
style D1 fill:#e8f5e9,stroke:#4caf50
style D2 fill:#f1f8e9,stroke:#8bc34a
style D3 fill:#fff9c4,stroke:#fbc02d
style D4 fill:#fff3e0,stroke:#ff9800
style D5 fill:#ffccbc,stroke:#ff5722
style LIN fill:#ef9a9a,stroke:#f44336
Attention Redistribution: Learned Feature Importance Before vs After Fine-Tuning
What this shows: [CLS] token attention weights from Head 7, Layer 5 (the most intent-discriminative attention head, identified via attention entropy analysis). Before fine-tuning, the model wastes attention on structural tokens ([SEP], self-attention). After fine-tuning, attention concentrates on domain-relevant tokens that drive intent classification. Use BertViz (
pip install bertviz) to reproduce this with your own fine-tuned model.
graph TD
subgraph PRE["Pre-trained DistilBERT — Generic Language Modeling Attention"]
direction LR
P_CLS["[CLS]<br>source"] --> P_is["'is'<br>w=0.15"]
P_CLS --> P_this["'this'<br>w=0.12"]
P_CLS --> P_isekai["'isekai'<br>w=0.08 ⬜"]
P_CLS --> P_peak["'peak'<br>w=0.10 ⬜"]
P_CLS --> P_SEP["'[SEP]'<br>w=0.20 🔴"]
P_CLS --> P_self["'[CLS]' self<br>w=0.35 🔴"]
end
subgraph POST["Fine-tuned (Manga-Adapted) — Intent-Discriminative Attention"]
direction LR
F_CLS["[CLS]<br>source"] --> F_is["'is'<br>w=0.05 ⬜"]
F_CLS --> F_this["'this'<br>w=0.03 ⬜"]
F_CLS --> F_isekai["'isekai'<br>w=0.35 🟢"]
F_CLS --> F_peak["'peak'<br>w=0.30 🟢"]
F_CLS --> F_SEP["'[SEP]'<br>w=0.07"]
F_CLS --> F_self["'[CLS]' self<br>w=0.20"]
end
subgraph EFFECT["Effect on [CLS] Representation"]
R1["Before: [CLS] = generic sentence embedding<br>Encodes positional structure, not meaning"]
R2["After: [CLS] = intent-discriminative encoding<br>Captures 'isekai' (genre) + 'peak' (quality judgment)<br>→ Separable in 768-d space for classification"]
R1 --> R2
end
style P_SEP fill:#ef9a9a
style P_self fill:#ef9a9a
style P_isekai fill:#e0e0e0
style P_peak fill:#e0e0e0
style F_isekai fill:#a5d6a7,stroke:#4caf50
style F_peak fill:#a5d6a7,stroke:#4caf50
style F_is fill:#e0e0e0
style F_this fill:#e0e0e0
Key Observation: Fine-tuning shifts 0.27 of attention weight (0.35→0.08 from self/[SEP]) onto domain tokens ("isekai": 0.08→0.35, "peak": 0.10→0.30). This is the model learning which tokens carry intent signal. The [CLS] representation transforms from a generic sentence summary to an intent-focused encoding where similar intents cluster together in the 768-dimensional space.
Training Dynamics — Single Step Forward & Backward Pass
How to read this diagram: Follow the data flow top-to-bottom (forward pass), then bottom-to-top (backward pass). Tensor shapes are annotated at each stage. Color intensity reflects gradient magnitude — the loss signal is strongest at the classification head and dissipates as it flows backward through the encoder layers. Track this live with TensorBoard gradient histograms or W&B per-layer gradient norms.
graph TD
subgraph FORWARD["━━━ FORWARD PASS ━━━"]
direction TB
A["📦 Training Batch<br>32 examples × variable seq_len<br>Sampled with class-balanced strategy"]
A --> B["🔤 WordPiece Tokenization<br>→ token_ids: (32 × 128) int64<br>→ attention_mask: (32 × 128) bool<br>Pad to max_len=128, truncate if longer"]
B --> C["Embedding Lookup + Positional Encoding<br>→ (32 × 128 × 768) float32<br>~23M params — nearly frozen"]
C --> D["6 Transformer Encoder Layers<br>Each: Multi-Head Attention → FFN → LayerNorm + Residual<br>→ (32 × 128 × 768) at each layer<br>~42M params — graduated adaptation"]
D --> E["[CLS] Token Pooling<br>Extract first token's hidden state<br>→ (32 × 768) — one vector per example"]
E --> F["Classification Head: Linear(768 → 10)<br>→ (32 × 10) raw logits<br>~7.7K params — fully trained"]
F --> G["Focal Loss with Class Weights<br>L = −αₜ(1−pₜ)ᵞ log(pₜ)<br>γ=2.0 | αₜ = inverse frequency<br>→ scalar loss value"]
end
subgraph BACKWARD["━━━ BACKWARD PASS (Gradient Flow) ━━━"]
direction TB
G --> H["∂L/∂logits → Classification Head<br>Gradient magnitude: 1.0× (reference)<br>Largest parameter updates"]
H --> I["∂L/∂[CLS] → Encoder Layer 5<br>Gradient magnitude: 0.6×<br>Intent-discriminative features adapt"]
I --> J["∂L/∂hidden → Layers 4→3→2→1→0<br>Gradient decay: 0.35× → 0.18× → 0.08× → 0.03× → 0.01×<br>Bottom layers receive vanishing signal"]
J --> K["∂L/∂embeddings<br>Gradient magnitude: 0.005×<br>Pre-trained token representations barely move"]
end
subgraph OPTIM["━━━ OPTIMIZER STEP (AdamW) ━━━"]
direction TB
L["Gradient Clipping: max_norm=1.0<br>Prevents exploding gradients from hard examples"]
L --> M["Per-Layer Discriminative Learning Rates<br>Head: 2e-5 | Layer 5: 2e-5 | Layer 0: 6.55e-6<br>Matches natural gradient magnitude hierarchy"]
M --> N["Weight Decay: λ=0.01<br>L2 regularization on all params except bias & LayerNorm<br>Prevents weights from growing unbounded"]
N --> O["Update Parameters<br>θ ← θ − η · m̂/(√v̂ + ε) − λθ<br>Adam momentum smooths noisy gradients"]
end
subgraph EPOCH["━━━ EPOCH CONTEXT ━━━"]
P["Repeat for 3 epochs × 4,688 steps<br>= 50K train examples ÷ 32 batch × 3 epochs<br>With linear warmup (469 steps) + linear LR decay"]
end
style G fill:#ef9a9a,stroke:#f44336
style H fill:#ffccbc,stroke:#ff5722
style I fill:#ffe0b2,stroke:#ff9800
style J fill:#fff9c4,stroke:#fbc02d
style K fill:#e8f5e9,stroke:#4caf50
Learning Rate Schedule — Training Phases & Model Behavior
How to read this diagram: The schedule is split into three behavioral phases — each phase has a different optimization objective and affects model layers differently. The key insight is that the learning rate schedule is not just a numerical curve: it encodes a curriculum where the model stabilizes its head first, then adapts its encoder, then fine-tunes for generalization. Monitor phase transitions with W&B LR vs. loss panels or TensorBoard scalar dashboards.
graph TD
subgraph PHASE1["Phase 1: Warmup — Head Stabilization (Steps 0–469, 10%)"]
direction TB
W1["LR ramps: 0 → 2e-5 (linear)<br>Classification head learns coarse decision boundaries"]
W1 --> W2["What the model does:<br>Head weights move from random → meaningful logits<br>Encoder layers receive tiny gradients — nearly frozen<br>Loss drops rapidly: ~2.3 → ~1.1"]
W2 --> W3["Per-Layer Effective Update Magnitude<br>Head: ████████░░ (ramping)<br>Layer 5: █░░░░░░░░░ (minimal)<br>Layers 0–4: ░░░░░░░░░░ (frozen)"]
W3 --> W4["⚠ Why warmup matters:<br>Without it, random head → noisy gradients<br>→ large encoder updates → catastrophic forgetting<br>Pre-trained linguistic knowledge destroyed"]
end
subgraph PHASE2["Phase 2: Peak Adaptation — Encoder Learning (Steps 469–2344, 40%)"]
direction TB
P1["LR at peak: 2e-5 (head) → 5.24e-6 (embed)<br>Discriminative LR: ξ=0.8 decay per layer"]
P1 --> P2["What the model does:<br>Top layers (4–5) learn intent-discriminative features<br>Mid layers (2–3) adapt domain semantics<br>Bottom layers (0–1) barely change<br>Loss plateau: ~1.1 → ~0.5"]
P2 --> P3["Per-Layer Effective Update Magnitude<br>Head: ██████████ (1.0×)<br>Layer 5: ██████░░░░ (0.6×)<br>Layer 4: ████░░░░░░ (0.35×)<br>Layer 3: ██░░░░░░░░ (0.18×)<br>Layer 2: █░░░░░░░░░ (0.08×)<br>Layers 0–1: ░░░░░░░░░░ (~0.01×)"]
P3 --> P4["🎯 Critical period:<br>Most intent-discriminative learning happens here<br>'isekai'/'shōnen' attention patterns form<br>Rare-class accuracy jumps from ~70% → ~85%"]
end
subgraph PHASE3["Phase 3: Linear Decay — Convergence & Regularization (Steps 2344–4688, 50%)"]
direction TB
D1["LR decays: 2e-5 → 0 (linear)<br>All layers receiving progressively smaller updates"]
D1 --> D2["What the model does:<br>Fine-grained decision boundary refinement<br>Confidence calibration (softmax sharpening)<br>Memorization risk increases — monitor val loss<br>Loss final: ~0.5 → ~0.24"]
D2 --> D3["Per-Layer Effective Update Magnitude<br>Head: ████░░░░░░ → █░░░░░░░░░<br>Layer 5: ██░░░░░░░░ → ░░░░░░░░░░<br>All others: effectively frozen"]
D3 --> D4["⚠ Overfitting checkpoint:<br>If val_loss stops decreasing while train_loss drops<br>→ Generalization gap opening<br>→ Early stopping or reduce epochs"]
end
PHASE1 --> PHASE2
PHASE2 --> PHASE3
style PHASE1 fill:#e3f2fd,stroke:#1976d2
style PHASE2 fill:#fff3e0,stroke:#ff9800
style PHASE3 fill:#e8f5e9,stroke:#4caf50
Optimization Trajectory Through the Loss Landscape
How to read this diagram: The model's weights trace a path through a high-dimensional loss landscape during fine-tuning. Each stage represents a qualitatively different region — from the broad pre-trained basin (good for general NLP) through a narrow manga-specific minimum (good for our task). Regularization mechanisms prevent the trajectory from falling into sharp, non-generalizing minima. Track train vs. val loss divergence in real time with W&B or TensorBoard.
graph TD
subgraph START["Pre-trained Basin — General NLP (Epoch 0)"]
S1["θ₀: Pre-trained DistilBERT weights<br>Broad, flat minimum good for general language tasks"]
S1 --> S2["Train loss: 2.31 (near −log(1/10) = random)<br>Val loss: 2.29<br>Generalization gap: ~0.02 ✅<br>Manga accuracy: 71.8%"]
end
subgraph WARMUP["Transition Phase — Warmup (Steps 0–469)"]
S2 --> T1["Gradient descent begins with small steps<br>Head learns coarse intent boundaries<br>θ moves toward task-relevant region"]
T1 --> T2["Train loss: 1.1 | Val loss: 1.15<br>Gap: 0.05 ✅<br>Regularization active:<br>• Weight decay (λ=0.01) keeps θ near pre-trained<br>• Warmup limits step size → stable trajectory"]
end
subgraph ADAPT["Manga-Specific Minimum — Peak Training (Steps 469–2344)"]
T2 --> A1["Fastest descent: top layers adapt to domain<br>Attention heads discover manga-relevant features<br>Loss surface narrows — specialization begins"]
A1 --> A2["Train loss: 0.45 | Val loss: 0.52<br>Gap: 0.07 ✅ (still healthy)<br>Regularization active:<br>• Focal loss (γ=2) → ignores easy examples<br>• Discriminative LR → bottom layers anchored<br>• Dropout (0.1) → implicit ensemble"]
end
subgraph CONVERGE["Fine-Grained Convergence — Decay Phase (Steps 2344–4688)"]
A2 --> C1["Small LR → fine adjustments near minimum<br>Decision boundaries sharpen<br>Confidence calibration improves"]
C1 --> C2["Train loss: 0.24 | Val loss: 0.31<br>Gap: 0.07 ✅ (stable)<br>Final: 92.1% overall | 88.6% rare-class<br>Model sits in good minimum"]
end
subgraph DANGER["⚠ Overfitting Territory — If Training Continues (Epoch 4+)"]
C2 -->|"Continue training<br>past epoch 3"| O1["θ moves into sharp, narrow minimum<br>Memorizes training noise & label errors<br>Pre-trained features overwritten"]
O1 --> O2["Train loss: 0.05 | Val loss: 0.89<br>Gap: 0.84 ❌ (severe overfitting)<br>Sharp minimum → brittle to distribution shift<br>Manga accuracy drops on unseen queries"]
end
subgraph GUARD["Regularization Mechanisms Preventing Overfitting"]
G1["Weight Decay λ=0.01<br>Pulls θ toward origin<br>→ flatter minima preferred"]
G2["Discriminative LR<br>Bottom layers: 0.26× base LR<br>→ pre-trained anchoring"]
G3["Focal Loss γ=2.0<br>Stops learning easy examples<br>→ reduces effective epochs"]
G4["Early Stopping<br>Monitor val_loss patience=2<br>→ halt before overfitting"]
end
style START fill:#e3f2fd,stroke:#1976d2
style WARMUP fill:#e8f5e9,stroke:#4caf50
style ADAPT fill:#fff3e0,stroke:#ff9800
style CONVERGE fill:#f1f8e9,stroke:#8bc34a
style DANGER fill:#ffcdd2,stroke:#f44336
style GUARD fill:#f3e5f5,stroke:#9c27b0
Implementation Deep-Dive
Dataset Preparation
import json
import re
from collections import Counter
from sklearn.model_selection import train_test_split
# ─── Step 1: Load and Clean Raw Data ───
def load_manga_intent_dataset(logs_path: str, synthetic_path: str):
"""
Combine 50K production logs with 5K synthetic examples.
Production logs come from Amazon customer service conversations
pre-labeled by a combination of rule-based matcher + human review.
"""
with open(logs_path) as f:
prod_data = json.load(f) # 50K examples
with open(synthetic_path) as f:
synth_data = json.load(f) # 5K examples from Claude
# Clean and normalize
all_examples = []
for item in prod_data + synth_data:
text = clean_text(item["message"])
label = item["intent"]
source = "production" if item in prod_data else "synthetic"
all_examples.append({"text": text, "label": label, "source": source})
return all_examples
def clean_text(text: str) -> str:
"""Normalize text while preserving manga-specific tokens."""
text = text.strip().lower()
# Preserve Japanese characters (hiragana, katakana, kanji)
# but normalize whitespace and remove control characters
text = re.sub(r'[\x00-\x1f\x7f-\x9f]', '', text)
text = re.sub(r'\s+', ' ', text)
return text
# ─── Step 2: Handle Class Imbalance ───
def compute_class_weights(labels: list) -> dict:
"""
Inverse frequency weighting for focal loss alpha parameter.
Gives rare intents (escalation: 3%) higher weight than
common intents (product_discovery: 22%).
"""
counts = Counter(labels)
total = sum(counts.values())
weights = {}
for label, count in counts.items():
weights[label] = total / (len(counts) * count)
return weights
# ─── Step 3: Stratified Split ───
def create_splits(examples: list):
"""
80/10/10 split with stratification to maintain intent
distribution in each split.
"""
texts = [e["text"] for e in examples]
labels = [e["label"] for e in examples]
X_train, X_temp, y_train, y_temp = train_test_split(
texts, labels, test_size=0.2, stratify=labels, random_state=42
)
X_val, X_test, y_val, y_test = train_test_split(
X_temp, y_temp, test_size=0.5, stratify=y_temp, random_state=42
)
return (X_train, y_train), (X_val, y_val), (X_test, y_test)
Focal Loss Implementation
import torch
import torch.nn as nn
import torch.nn.functional as F
class FocalLoss(nn.Module):
"""
Focal Loss (Lin et al., 2017) with per-class alpha weights.
FL(p_t) = -alpha_t * (1 - p_t)^gamma * log(p_t)
When gamma=0, this reduces to standard weighted cross-entropy.
When gamma=2 (our default), easy examples (p_t > 0.8) contribute
almost nothing to the loss, letting the model focus on hard cases.
"""
def __init__(self, alpha: torch.Tensor, gamma: float = 2.0):
super().__init__()
self.alpha = alpha # Shape: (num_classes,)
self.gamma = gamma
def forward(self, logits: torch.Tensor, targets: torch.Tensor) -> torch.Tensor:
# logits: (batch_size, num_classes)
# targets: (batch_size,) — class indices
# Compute softmax probabilities
probs = F.softmax(logits, dim=-1) # (batch_size, num_classes)
# Get probability of true class
targets_one_hot = F.one_hot(targets, num_classes=logits.size(-1)).float()
p_t = (probs * targets_one_hot).sum(dim=-1) # (batch_size,)
# Get alpha for true class
alpha_t = self.alpha[targets] # (batch_size,)
# Compute focal modulating factor
focal_weight = (1 - p_t) ** self.gamma # (batch_size,)
# Compute focal loss
ce_loss = -torch.log(p_t + 1e-8) # (batch_size,)
loss = alpha_t * focal_weight * ce_loss # (batch_size,)
return loss.mean()
Fine-Tuning with Discriminative Learning Rates
from transformers import DistilBertForSequenceClassification, DistilBertTokenizer
from torch.optim import AdamW
from transformers import get_linear_schedule_with_warmup
def create_optimizer_with_discriminative_lr(
model: DistilBertForSequenceClassification,
base_lr: float = 2e-5,
decay_factor: float = 0.8,
weight_decay: float = 0.01,
):
"""
Create AdamW optimizer with per-layer learning rates.
Top layers get base_lr, each lower layer gets base_lr * decay^(L-l).
This matches the natural gradient magnitude decay through the network
and prevents catastrophic forgetting of low-level language features.
"""
optimizer_grouped_parameters = []
# Classification head — highest LR
optimizer_grouped_parameters.append({
"params": model.classifier.parameters(),
"lr": base_lr,
"weight_decay": weight_decay,
})
# Pre-classifier (pooling layer)
optimizer_grouped_parameters.append({
"params": model.pre_classifier.parameters(),
"lr": base_lr,
"weight_decay": weight_decay,
})
# Encoder layers — discriminative LR (top to bottom)
num_layers = len(model.distilbert.transformer.layer)
for layer_idx in range(num_layers - 1, -1, -1):
layer_lr = base_lr * (decay_factor ** (num_layers - 1 - layer_idx))
optimizer_grouped_parameters.append({
"params": model.distilbert.transformer.layer[layer_idx].parameters(),
"lr": layer_lr,
"weight_decay": weight_decay,
})
# Embeddings — lowest LR
embed_lr = base_lr * (decay_factor ** num_layers)
optimizer_grouped_parameters.append({
"params": model.distilbert.embeddings.parameters(),
"lr": embed_lr,
"weight_decay": weight_decay,
})
optimizer = AdamW(optimizer_grouped_parameters)
return optimizer
# ─── Full Training Loop ───
def train_intent_classifier(
train_dataloader,
val_dataloader,
num_epochs: int = 3,
base_lr: float = 2e-5,
gamma: float = 2.0,
max_grad_norm: float = 1.0,
):
"""
Fine-tune DistilBERT with focal loss, discriminative LR, and warmup.
"""
# Load pre-trained model
model = DistilBertForSequenceClassification.from_pretrained(
"distilbert-base-uncased",
num_labels=10,
)
tokenizer = DistilBertTokenizer.from_pretrained("distilbert-base-uncased")
# Move to GPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
# Class weights for focal loss (inverse frequency)
class_weights = torch.tensor([
1.0 / 0.22, # product_discovery
1.0 / 0.15, # product_question
1.0 / 0.18, # recommendation
1.0 / 0.08, # faq
1.0 / 0.12, # order_tracking
1.0 / 0.07, # return_request
1.0 / 0.05, # promotion
1.0 / 0.04, # checkout_help
1.0 / 0.03, # escalation
1.0 / 0.06, # chitchat
]).to(device)
class_weights = class_weights / class_weights.sum() # Normalize
focal_loss_fn = FocalLoss(alpha=class_weights, gamma=gamma)
# Optimizer with per-layer learning rates
optimizer = create_optimizer_with_discriminative_lr(model, base_lr=base_lr)
# Warmup schedule: linear warmup for 10% of steps, then linear decay
total_steps = len(train_dataloader) * num_epochs
warmup_steps = int(0.1 * total_steps)
scheduler = get_linear_schedule_with_warmup(
optimizer,
num_warmup_steps=warmup_steps,
num_training_steps=total_steps,
)
# Training loop
best_val_accuracy = 0.0
for epoch in range(num_epochs):
model.train()
total_loss = 0.0
for batch in train_dataloader:
input_ids = batch["input_ids"].to(device)
attention_mask = batch["attention_mask"].to(device)
labels = batch["labels"].to(device)
# Forward pass
outputs = model(input_ids=input_ids, attention_mask=attention_mask)
logits = outputs.logits # (batch_size, 10)
# Compute focal loss
loss = focal_loss_fn(logits, labels)
# Backward pass
loss.backward()
# Gradient clipping — prevents exploding gradients
torch.nn.utils.clip_grad_norm_(model.parameters(), max_grad_norm)
# Optimizer step with per-layer LR
optimizer.step()
scheduler.step()
optimizer.zero_grad()
total_loss += loss.item()
# Validation
val_accuracy = evaluate(model, val_dataloader, device)
avg_loss = total_loss / len(train_dataloader)
print(f"Epoch {epoch+1}: Loss={avg_loss:.4f}, Val Acc={val_accuracy:.4f}")
# Save best model
if val_accuracy > best_val_accuracy:
best_val_accuracy = val_accuracy
model.save_pretrained("./best_intent_model")
tokenizer.save_pretrained("./best_intent_model")
return model
def evaluate(model, dataloader, device) -> float:
"""Compute accuracy on validation/test set."""
model.eval()
correct = 0
total = 0
with torch.no_grad():
for batch in dataloader:
input_ids = batch["input_ids"].to(device)
attention_mask = batch["attention_mask"].to(device)
labels = batch["labels"].to(device)
outputs = model(input_ids=input_ids, attention_mask=attention_mask)
predictions = outputs.logits.argmax(dim=-1)
correct += (predictions == labels).sum().item()
total += labels.size(0)
return correct / total
Hyperparameter Search with Optuna
import optuna
def objective(trial):
"""
Optuna objective for hyperparameter search.
Search space based on findings from Sun et al. (2019):
- Learning rate: 1e-5 to 5e-5 (sweet spot for BERT fine-tuning)
- Batch size: 16 or 32 (smaller batches = more noise = regularization)
- Gamma: 1.0 to 3.0 (focal loss focusing parameter)
- Decay factor: 0.7 to 0.95 (discriminative LR decay)
"""
base_lr = trial.suggest_float("base_lr", 1e-5, 5e-5, log=True)
batch_size = trial.suggest_categorical("batch_size", [16, 32])
gamma = trial.suggest_float("gamma", 1.0, 3.0)
decay_factor = trial.suggest_float("decay_factor", 0.7, 0.95)
num_epochs = trial.suggest_int("num_epochs", 2, 5)
train_loader = create_dataloader(X_train, y_train, batch_size=batch_size)
val_loader = create_dataloader(X_val, y_val, batch_size=64)
model = train_intent_classifier(
train_loader, val_loader,
num_epochs=num_epochs,
base_lr=base_lr,
gamma=gamma,
)
val_accuracy = evaluate(model, val_loader, device)
return val_accuracy
# Run search
study = optuna.create_study(direction="maximize")
study.optimize(objective, n_trials=30)
print(f"Best accuracy: {study.best_value:.4f}")
print(f"Best params: {study.best_params}")
# Typical result: lr=2.1e-5, batch=32, gamma=2.0, decay=0.82, epochs=3
SageMaker Training Job
import sagemaker
from sagemaker.huggingface import HuggingFace
def launch_sagemaker_training():
"""
Launch fine-tuning on SageMaker with spot instances.
ml.g4dn.xlarge: 1x T4 GPU, 16GB VRAM — enough for DistilBERT (66M params).
Spot pricing: ~$0.16/hr vs $0.526/hr on-demand (70% savings).
"""
huggingface_estimator = HuggingFace(
entry_point="train.py",
source_dir="./training_scripts",
instance_type="ml.g4dn.xlarge",
instance_count=1,
role=sagemaker.get_execution_role(),
transformers_version="4.36",
pytorch_version="2.1",
py_version="py310",
hyperparameters={
"model_name": "distilbert-base-uncased",
"num_labels": 10,
"epochs": 3,
"learning_rate": 2.1e-5,
"batch_size": 32,
"focal_gamma": 2.0,
"warmup_ratio": 0.1,
"lr_decay_factor": 0.82,
},
use_spot_instances=True,
max_wait=7200, # 2 hours max
max_run=3600, # 1 hour expected
checkpoint_s3_uri="s3://manga-ml-models/checkpoints/intent-classifier/",
)
huggingface_estimator.fit({
"train": "s3://manga-ml-data/intent-classifier/train/",
"validation": "s3://manga-ml-data/intent-classifier/val/",
})
return huggingface_estimator
Inferentia Deployment (Neuron SDK Compile)
import torch
import torch_neuronx
from transformers import DistilBertForSequenceClassification, DistilBertTokenizer
def compile_for_inferentia(model_path: str, output_path: str):
"""
Compile fine-tuned DistilBERT for AWS Inferentia using Neuron SDK.
Inferentia gives ~2.5x better cost-performance than GPU for inference.
Target: <15ms P95 latency at 500 req/sec.
"""
model = DistilBertForSequenceClassification.from_pretrained(model_path)
tokenizer = DistilBertTokenizer.from_pretrained(model_path)
model.eval()
# Create example input for tracing
example_input = tokenizer(
"Show me horror manga like Berserk",
return_tensors="pt",
max_length=128,
padding="max_length",
truncation=True,
)
# Compile with Neuron SDK
traced_model = torch_neuronx.trace(
model,
(example_input["input_ids"], example_input["attention_mask"]),
)
# Save compiled model
traced_model.save(output_path)
print(f"Compiled model saved to {output_path}")
print(f"Expected latency: ~8ms per inference on inf2.xlarge")
Group Discussion: Key Decision Points
Decision Point 1: DistilBERT vs RoBERTa vs TinyBERT
Priya (ML Engineer): I benchmarked all three on our manga intent dataset. Here are the results:
| Model | Params | Accuracy | P95 Latency | Memory | Monthly Cost (SageMaker) |
|---|---|---|---|---|---|
| TinyBERT | 14.5M | 87.3% | 5ms | 58MB | $89 |
| DistilBERT | 66M | 92.1% | 12ms | 256MB | $178 |
| RoBERTa-base | 125M | 94.8% | 28ms | 480MB | $348 |
RoBERTa wins on raw accuracy, but at 2.3x the latency and 2x the cost.
Marcus (Architect): Our latency budget for intent classification is 15ms P95 — we need headroom because this is the first step in every request. RoBERTa at 28ms is disqualifying. And at 500 req/sec peak, the memory difference matters for concurrent requests.
Aiko (Data Scientist): The 2.7% accuracy gap between DistilBERT (92.1%) and RoBERTa (94.8%) sounds significant, but look at where the errors are. DistilBERT's errors are concentrated in multi-intent messages (which we handle with fallback routing anyway) and rare intents. For the top-5 intents that cover 75% of traffic, DistilBERT hits 95.2%.
Sam (PM): Let me compute the cost-per-quality-point. RoBERTa costs $170/month more for 2.7% accuracy gain. That is $63 per quality point. Our threshold is $50/point for non-critical classifiers. Plus, the latency overhead adds ~15ms to every request, which Priya showed costs us 0.3% in user engagement drop.
Jordan (MLOps): RoBERTa also takes 35 minutes to train per epoch on our dataset vs 12 minutes for DistilBERT. That means our weekly retraining pipeline takes 3x longer, and CI checks for model validation take 3x longer. On spot instances, longer jobs are more likely to get interrupted.
Resolution: DistilBERT chosen. The 92.1% accuracy meets our 90% threshold, the latency fits within our 15ms budget with headroom, and the cost-per-quality-point for RoBERTa ($63/point) exceeds our $50 threshold. TinyBERT was too inaccurate for production despite its speed advantage.
Decision Point 2: Focal Loss vs Weighted Cross-Entropy vs Oversampling
Priya (ML Engineer): I ran ablations on three approaches to class imbalance:
| Approach | Overall Acc. | Rare-Class Acc. (escalation, promotion, checkout) | Training Time |
|---|---|---|---|
| Standard CE (no balancing) | 91.2% | 78.4% | 36 min |
| Weighted CE | 91.5% | 84.2% | 36 min |
| Oversampling (repeat rare) | 91.8% | 85.1% | 52 min |
| Focal Loss (γ=2) | 92.1% | 87.8% | 38 min |
| Focal Loss + Weighted | 92.1% | 88.6% | 38 min |
Focal loss with class weights gives the best rare-class accuracy with minimal training time overhead.
Aiko (Data Scientist): The key insight is that focal loss doesn't just reweight classes — it reweights difficulty. Within the majority class product_discovery, there are easy examples ("show me manga") and hard examples ("anything with that dark isekai vibe"). Oversampling treats all rare-class examples equally, but focal loss focuses on the hardest examples in every class.
Marcus (Architect): Does focal loss add inference latency?
Priya (ML Engineer): Zero inference overhead. Focal loss only affects training. At inference time, the model's softmax output is identical regardless of what loss function was used during training.
Sam (PM): Oversampling takes 44% longer to train. At our weekly retraining cadence, that is 16 minutes per week wasted on duplicate examples. Focal loss gives better results for less compute.
Jordan (MLOps): The focal loss hyperparameter gamma needs tuning though. I saw gamma=1.5 and gamma=2.5 give slightly different results on different data distributions. We should add gamma to our Optuna search space and re-validate on each retraining cycle.
Resolution: Focal loss with gamma=2.0 and class weights selected. Best overall accuracy (92.1%), best rare-class accuracy (88.6%), no inference overhead, and only 5% training time overhead vs standard CE.
Decision Point 3: How Often to Retrain
Jordan (MLOps): Our production data shows intent distribution shifts of ~2% per month as seasonal promotions and new manga releases change query patterns. When do we retrain?
Aiko (Data Scientist): I set up drift detection monitoring. Here is the signal:
| Period | Distribution KL-Divergence | Accuracy (live) | Retrain? |
|---|---|---|---|
| Week 1 | 0.002 | 92.0% | No |
| Week 2 | 0.005 | 91.8% | No |
| Month 1 | 0.012 | 91.2% | Monitor |
| Month 2 | 0.028 | 89.8% | Yes |
| Month 3 (no retrain) | 0.045 | 87.1% | Overdue |
The accuracy degrades ~1% per month without retraining. Our 90% threshold is hit at ~2 months.
Sam (PM): Monthly retraining costs about $12 per run on spot instances (12 min × $0.16/hr × overhead). That is nothing. But the labeling cost for new training data is the real expense — about $500/month for 2K human-labeled examples.
Priya (ML Engineer): We can reduce labeling cost with active learning. Sample the 200 lowest-confidence predictions from production each week, label those, and add them to the training set. That gives us the highest-information examples for the lowest labeling cost.
Marcus (Architect): Monthly retraining means we need automated validation gates. The new model must beat the current production model on the golden test set before deployment, or it gets rejected.
Resolution: Monthly retraining cadence with active learning for data collection. Automated pipeline: sample low-confidence predictions → human label → retrain → validate on golden set → deploy if accuracy ≥ current model. Total monthly cost: ~$512 ($12 compute + $500 labeling).
Decision Point 4: Synthetic Data — Quality vs Quantity
Aiko (Data Scientist): Our 5K synthetic examples from Claude are mixed quality. I audited 500 random samples:
| Quality Level | % of Synthetic | Example |
|---|---|---|
| High (correct intent + natural) | 72% | "Do you have any box sets for Demon Slayer?" → product_question |
| Medium (correct intent, slightly unnatural) | 18% | "I desire to procure the manga known as One Piece" → product_discovery |
| Low (wrong intent or nonsensical) | 10% | "Tell me about the art style of returning my order" → labeled as product_question but actually ambiguous |
Priya (ML Engineer): The 10% bad examples introduce label noise. My experiments show:
| Synthetic Mix | Accuracy | Rare-Class Acc. |
|---|---|---|
| 0% synthetic (50K prod only) | 91.4% | 84.0% |
| 5K raw synthetic (10% bad) | 91.8% | 86.2% |
| 5K filtered synthetic (2% bad) | 92.1% | 88.6% |
| 10K raw synthetic (10% bad) | 91.6% | 85.8% |
More synthetic data with noise hurts. Filtered synthetic data helps.
Jordan (MLOps): The filtering pipeline uses a cross-validation approach: train 5 models on different folds of production data, then check if all 5 agree on the synthetic example's label. If they disagree, the example is flagged for human review.
Sam (PM): Claude generates 5K examples in about 20 minutes at ~$15 in API cost. The filtering and human review adds maybe $200. So 5K high-quality synthetic examples cost ~$215. That is still cheaper than labeling 5K raw examples ($1,250 at $0.25/label).
Resolution: Use 5K synthetic examples with consensus-based filtering. The 3.6% improvement in rare-class accuracy justifies the $215 investment. Reject unfiltered synthetic data above 10% of total dataset — diminishing returns and noise introduction.
Research Paper References
1. DistilBERT: A Distilled Version of BERT (Sanh et al., 2019)
Key contribution: 6-layer student model trained via knowledge distillation from 12-layer BERT-base. Retains 97% of BERT's NLU capability at 60% of the parameters and 1.6x inference speed. Uses triple loss: distillation loss (soft labels from teacher), masked language modeling loss, and cosine embedding loss (student/teacher hidden state alignment).
Relevance to MangaAssist: Our starting point. The 6-layer architecture creates the gradient magnitude decay pattern we exploit with discriminative learning rates. The pre-trained knowledge from distillation means bottom layers already encode strong syntactic features that we do not want to destroy during fine-tuning.
2. Focal Loss for Dense Object Detection (Lin et al., 2017)
Key contribution: Originally designed for object detection where background examples vastly outnumber objects. The $(1-p_t)^\gamma$ modulating factor down-weights easy examples exponentially, focusing training on hard negatives. At $\gamma=2$, an example classified with 0.9 probability contributes 100x less loss than one classified with 0.1.
Relevance to MangaAssist: Our intent distribution mirrors the class imbalance problem. escalation (3%) is the "rare object" in a sea of product_discovery (22%) "background." Focal loss gives us 4.6% improvement on rare-class accuracy without any data augmentation overhead.
3. How to Fine-Tune BERT for Text Classification (Sun et al., 2019)
Key contribution: Systematic study of BERT fine-tuning strategies. Key findings: (1) Further pre-training on domain data helps. (2) Layer-wise discriminative learning rates outperform uniform LR. (3) The optimal learning rate is between 1e-5 and 5e-5. (4) Longer fine-tuning (3-4 epochs) helps but risks overfitting beyond 5 epochs.
Relevance to MangaAssist: Directly informs our learning rate schedule ($2 \times 10^{-5}$ base with 0.82 decay), epoch count (3), and warmup strategy (10% linear warmup). The discriminative LR approach gives us 0.8% accuracy improvement over uniform LR at zero additional cost.
4. Understanding the Behaviour of Contrastive Loss (Wang & Liu, 2021)
Key contribution: Analyzes how contrastive and focal losses reshape the embedding space. Shows that modulating factors change the effective temperature of the similarity distribution, creating tighter clusters for hard examples.
Relevance to MangaAssist: Validates our combined focal loss + class weights approach. The tight clustering effect means that fine-tuned [CLS] representations for similar intents (product_discovery vs recommendation) become more separable in the 768-dim space.
5. Curriculum Learning (Bengio et al., 2009)
Key contribution: Training on easy examples first, then gradually introducing harder ones, can lead to faster convergence and better generalization. The ordering of training data matters.
Relevance to MangaAssist: Our warmup schedule implicitly creates a curriculum effect: during warmup, the model's low learning rate means it focuses on high-confidence (easy) examples first. As LR ramps up, it begins learning from harder examples. We could make this explicit by sorting training batches by difficulty (measured by loss on the previous epoch).
6. BERT: Pre-training of Deep Bidirectional Transformers (Devlin et al., 2018, NAACL)
Key contribution: Bidirectional MLM pre-training; introduced the [CLS]-pooling convention used by DistilBERT. Showed fine-tuning a single-layer classification head on top of a frozen-then-unfrozen transformer matches or beats task-specific architectures on GLUE.
Relevance to MangaAssist: Our 10-class softmax head sits exactly where BERT's NSP head sat. The decision to keep [CLS] pooling (vs. mean-pooling) is justified by Devlin's ablations: [CLS] is the position the model's attention is encouraged to summarize the sequence into during pre-training.
7. Universal Language Model Fine-tuning (Howard & Ruder, 2018, ACL — ULMFiT)
Key contribution: Introduced discriminative fine-tuning (per-layer LR decay, the basis of our 0.82 decay), slanted triangular learning rates (the basis of our warmup schedule), and gradual unfreezing. Showed these together cut error rates 18-24% vs. uniform fine-tuning.
Relevance to MangaAssist: Our 0.82-decay schedule is a direct application. The ablation in §"Extended Ablations" below sweeps the decay value 0.7 → 0.9, confirming Howard & Ruder's recommendation that lower layers should learn more slowly because they encode more general syntactic features.
8. Cyclical Learning Rates / Super-Convergence (Smith, 2017 — IEEE WACV / arXiv 1708.07120)
Key contribution: Demonstrated that aggressive LR warmup followed by decay (the "1cycle" policy) reaches good minima 3-5× faster than fixed LR with momentum. Justified the now-standard 5-15% warmup ratio.
Relevance to MangaAssist: Our 10% warmup is at the center of Smith's recommended range. The warmup ratio ablation (§"Extended Ablations") sweeps {5%, 10%, 15%}; 10% wins, consistent with Smith's results on smaller text-classification datasets.
9. Decoupled Weight Decay Regularization (Loshchilov & Hutter, 2019, ICLR — AdamW)
Key contribution: Showed that the standard "Adam + weight decay" implementation actually couples weight decay with the adaptive LR, hurting generalization. Proposed AdamW which decouples them. Now the default optimizer for transformer fine-tuning.
Relevance to MangaAssist: Our optimizer is torch.optim.AdamW (not Adam). The weight-decay term we use (0.01) is the value Loshchilov & Hutter recommend for fine-tuning small classification heads.
10. On Calibration of Modern Neural Networks (Guo et al., 2017, ICML)
Key contribution: Empirically showed that deep networks are systematically overconfident after fine-tuning. Introduced temperature scaling as a one-parameter post-hoc fix that minimizes NLL on a held-out validation set without changing argmax accuracy.
Relevance to MangaAssist: Drives the entire calibration deep-dive (01-confidence_calibration_for_intent_routing_mangaassist.md). Our temperature T = 1.6 is fitted via Guo's method and reduces ECE from 0.067 → 0.040 with no accuracy loss.
11. A Baseline for Detecting Misclassified and Out-of-Distribution Examples (Hendrycks & Gimpel, 2017, ICLR)
Key contribution: Established maximum softmax probability (MSP) as a strong, simple baseline for OOD detection. Subsequent methods (ODIN, energy, Mahalanobis) are calibrated against this baseline.
Relevance to MangaAssist: Our OOD pipeline (01-ood_unknown_intent_detection_mangaassist.md) compares MSP, margin score, and energy score side-by-side; energy (Liu 2020) wins on AUROC but MSP is the production fallback when feature-space access is restricted.
12. Energy-based Out-of-distribution Detection (Liu et al., 2020, NeurIPS)
Key contribution: Defined the energy score E(x) = -T · log Σ_i exp(z_i / T) over logits and showed it is theoretically aligned with the data density p(x) under softmax. Outperforms MSP and ODIN on standard OOD benchmarks.
Relevance to MangaAssist: Powers our OOD detector. The energy threshold is set on the validation set (FPR ≤ 5% on in-domain, TPR ≥ 85% on held-out OOD) — see §"Failure-Mode Decision Tree" for what to do when the threshold drifts.
Production Deployment and Monitoring
MLOps Lifecycle — Training, Validation, Deployment & Feedback Loop
How to read this diagram: This is a continuous MLOps cycle, not a one-time deployment. The model is retrained monthly when data drift is detected (KL-divergence > 0.02) or accuracy degrades. Each stage has quality gates — the model cannot reach production without passing the golden test set evaluation. The feedback loop from monitoring back to data collection closes the active learning cycle. Track model versions and experiment metrics in MLflow or W&B.
graph TD
subgraph DATA["1. Data Collection & Curation"]
D1["Production Logs (50K labeled)<br>Source: customer conversations<br>Labels: rule-based + human review"]
D2["Synthetic Augmentation (5K filtered)<br>Generated: Claude API (~$15)<br>Filtered: 5-fold consensus ($200)<br>Quality: 98% label accuracy"]
D3["Active Learning Additions<br>Low-confidence predictions (< 0.6)<br>flagged for human labeling<br>~200 new examples/month"]
D1 --> D4["Combined Dataset: 55K examples<br>Stratified split: 80/10/10<br>Class-balanced sampling"]
D2 --> D4
D3 --> D4
end
subgraph TRAIN["2. Experiment & Training"]
D4 --> T1["Hyperparameter Search (Optuna)<br>LR: [1e-5, 5e-5] | γ: [1.0, 3.0]<br>Warmup: [5%, 15%] | Epochs: [2, 5]<br>20 trials, Bayesian optimization"]
T1 --> T2["Best Config Training (SageMaker)<br>DistilBERT + Focal Loss (γ=2, α=inv_freq)<br>Discriminative LR (ξ=0.8)<br>3 epochs, batch=32, warmup=10%"]
T2 --> T3["Artifacts Produced:<br>• Model weights (.pt) → S3<br>• Tokenizer config → S3<br>• Training metrics → MLflow<br>• Attention visualizations → W&B"]
end
subgraph VALIDATE["3. Validation Gate — Must Pass Before Deployment"]
T3 --> V1["Golden Test Set Evaluation<br>500 hand-curated examples<br>Covers all 10 intents + edge cases<br>Includes manga jargon & JP-EN mix"]
V1 --> V2{"Pass Criteria:<br>Overall acc ≥ 92%<br>Rare-class acc ≥ 85%<br>P95 latency ≤ 15ms<br>No regression on any intent"}
V2 -->|"✅ Pass"| V3["Register in Model Registry<br>Version: v{N} with metadata<br>Champion/Challenger tagging"]
V2 -->|"❌ Fail"| V4["Reject + Alert → Team<br>Log failure reason<br>Trigger investigation"]
end
subgraph DEPLOY["4. Production Serving"]
V3 --> P1["Neuron Compilation (AWS Inferentia)<br>torch_neuronx.trace() → .neff binary<br>Optimized for inf2.xlarge"]
P1 --> P2["SageMaker Endpoint<br>Autoscaling: min=2, max=10 instances<br>Scale trigger: P95 > 12ms or CPU > 70%<br>Blue/Green deployment for zero-downtime"]
P2 --> P3["Shadow Mode (first 24h)<br>New model serves alongside champion<br>Predictions logged but not routed<br>Compare accuracy & latency before cutover"]
end
subgraph MONITOR["5. Production Monitoring & Drift Detection"]
P2 --> M1["Real-Time Metrics (CloudWatch)<br>P50/P95/P99 latency<br>Error rate, throughput<br>Per-intent prediction distribution"]
M1 --> M2["Drift Detection (Hourly)<br>KL-divergence of intent distribution<br>vs training distribution<br>Confidence score distribution shift"]
M2 --> M3{"Drift Signal?<br>KL-div > 0.02<br>or accuracy < 90%<br>or low-conf > 8%"}
M3 -->|"⚠ Drift detected"| D3
M3 -->|"✅ Stable"| M4["Continue monitoring<br>Monthly retraining scheduled"]
end
V4 -->|"Debug & retrain"| T1
style DATA fill:#e3f2fd,stroke:#1976d2
style TRAIN fill:#fff3e0,stroke:#ff9800
style VALIDATE fill:#f3e5f5,stroke:#9c27b0
style DEPLOY fill:#e8f5e9,stroke:#4caf50
style MONITOR fill:#fff9c4,stroke:#fbc02d
Key Production Metrics
| Metric | Target | Alert Threshold |
|---|---|---|
| P50 latency | 8ms | > 12ms |
| P95 latency | 12ms | > 15ms |
| P99 latency | 18ms | > 25ms |
| Overall accuracy (sampled) | 92%+ | < 90% |
| Rare-class accuracy | 88%+ | < 85% |
| Intent distribution KL-divergence | < 0.01 | > 0.02 |
| Low-confidence rate (< 0.6) | < 5% | > 8% |
Evaluation and Results
Before vs After Fine-Tuning
All metrics report the point estimate followed by the 95% bootstrap CI half-width (B = 10,000 resamples of the 5.5K test set; seeds 42 / 123 / 2024 averaged). See 01-fine_tuning_numerical_worked_examples_mangaassist.md for the bootstrap procedure.
| Metric | Pre-trained DistilBERT | Fine-tuned (Ours) | Improvement |
|---|---|---|---|
| Overall accuracy | 83.2% ± 0.5% | 92.1% ± 0.4% | +8.9pp |
| Manga-specific accuracy | 71.8% ± 0.9% | 90.1% ± 0.6% | +18.3pp |
| Rare-class accuracy (escalation) | 64.5% ± 2.4% | 88.6% ± 1.7% | +24.1pp |
| Multi-intent accuracy | 58.3% ± 1.4% | 79.2% ± 1.1% | +20.9pp |
| JP-EN mixed accuracy | 52.1% ± 2.0% | 81.4% ± 1.5% | +29.3pp |
| Mean confidence (correct) | 0.72 ± 0.01 | 0.91 ± 0.01 | +0.19 |
| Mean confidence (incorrect) | 0.61 ± 0.02 | 0.43 ± 0.02 | -0.18 (good: model is less confidently wrong) |
Research Notes — headline metric. Citations: Sanh 2019 (NeurIPS-EMC²) — DistilBERT baseline; Sun 2019 (CCL) — BERT fine-tuning recipe; Lin 2017 (ICCV) — focal loss for rare-class lift; Bouthillier 2021 (MLSys) — variance accounting via bootstrap CIs; Henderson 2018 (AAAI) — multi-seed reporting. CI: the rare-class CI half-width (±1.7pp) is large because escalation = 3% of traffic ⇒ ~165 test examples. To halve the CI we would need to roughly quadruple test-set size on this segment. Failure rule: if overall accuracy falls outside
[91.7%, 92.5%]for ≥ 2 consecutive evals, the change is statistically real (outside the CI) and triggers the failure-mode tree below.
Ablation Study
| Configuration | Overall Acc. | Rare-Class Acc. | Notes |
|---|---|---|---|
| Base (no fine-tuning) | 83.2% | 64.5% | Baseline |
| + Fine-tuning (uniform LR) | 90.8% | 82.1% | Standard approach |
| + Discriminative LR | 91.6% | 84.8% | +0.8% from per-layer LR |
| + Focal loss (γ=2) | 92.0% | 87.8% | +0.4% overall, +3.0% rare |
| + Class weights | 92.1% | 88.6% | +0.1% overall, +0.8% rare |
| + Synthetic data (filtered) | 92.1% | 88.6% | Same overall, stabilizes variance |
| + Optuna tuning | 92.3% | 89.1% | Final best with optimal hyperparams |
Confusion Matrix Highlights
The remaining errors are concentrated in semantically similar intent pairs:
| Predicted → | product_discovery | recommendation | product_question |
|---|---|---|---|
| product_discovery | 93.8% | 4.2% | 1.5% |
| recommendation | 3.8% | 92.4% | 2.1% |
| product_question | 1.4% | 1.8% | 94.1% |
The product_discovery ↔ recommendation confusion (4.2% and 3.8%) is the largest remaining error source. This is acceptable because both intents route to the Recommendation Engine, so the user experience impact is minimal — the response quality is similar regardless of which intent is classified.
Extended Ablations (Research-Grade)
The basic ablation table above shows the cumulative effect of design choices. The tables below show the sensitivity of each individual choice — i.e., what happens if we sweep the chosen value while holding everything else constant. Each row is averaged over 3 seeds (42 / 123 / 2024); CIs are 95% bootstrap (B = 10,000). The chosen value is in bold.
Ablation A1: Focal Loss Gamma (γ)
Fix all other hyperparams at the chosen values; sweep γ only.
| γ | Overall accuracy | Rare-class accuracy | ECE | Macro-F1 | Δ vs chosen |
|---|---|---|---|---|---|
| 0.0 (= weighted CE) | 91.5% ± 0.5 | 84.2% ± 1.9 | 0.061 ± 0.006 | 0.851 ± 0.008 | -0.6pp |
| 1.0 | 91.9% ± 0.4 | 86.9% ± 1.8 | 0.048 ± 0.005 | 0.860 ± 0.007 | -0.2pp |
| 1.5 | 92.0% ± 0.4 | 87.8% ± 1.7 | 0.042 ± 0.005 | 0.862 ± 0.008 | -0.1pp |
| 2.0 (chosen) | 92.1% ± 0.4 | 88.6% ± 1.7 | 0.040 ± 0.005 | 0.864 ± 0.008 | 0 |
| 2.5 | 92.0% ± 0.4 | 88.4% ± 1.7 | 0.039 ± 0.005 | 0.862 ± 0.008 | -0.1pp |
| 3.0 | 91.7% ± 0.5 | 87.6% ± 1.8 | 0.043 ± 0.005 | 0.857 ± 0.009 | -0.4pp |
Reading. γ = 2.0 sits at the inflection: lower values under-emphasize hard examples (rare-class accuracy drops); higher values starve gradient signal on easy classes (overall accuracy drops). The curve is broadly consistent with Lin et al. 2017's RetinaNet ablations, where γ = 2 was also the optimum. Recommendation: keep γ = 2.0.
Ablation A2: Discriminative Learning-Rate Decay
Per-layer LR is lr_layer_i = base_lr × decay^(L − i) where L = 6 is the top layer and i is the layer index. Sweep decay.
| decay | Overall accuracy | Rare-class accuracy | Train time | Δ vs chosen |
|---|---|---|---|---|
| 1.00 (uniform LR) | 91.3% ± 0.5 | 85.4% ± 1.9 | 12.0 min/epoch | -0.8pp |
| 0.70 | 91.6% ± 0.5 | 86.8% ± 1.8 | 12.1 min/epoch | -0.5pp |
| 0.80 | 92.0% ± 0.4 | 88.2% ± 1.7 | 12.1 min/epoch | -0.1pp |
| 0.82 (chosen) | 92.1% ± 0.4 | 88.6% ± 1.7 | 12.1 min/epoch | 0 |
| 0.85 | 92.1% ± 0.4 | 88.5% ± 1.7 | 12.1 min/epoch | 0pp |
| 0.90 | 91.8% ± 0.5 | 87.9% ± 1.8 | 12.1 min/epoch | -0.3pp |
Reading. The "best" range is 0.80 - 0.85; 0.82 is the Optuna optimum. Below 0.70 the bottom layers learn too slowly to absorb manga jargon; above 0.90 the bottom layers learn fast enough to overwrite their pretrained syntactic features. Howard & Ruder 2018 reported 2.6 ÷ 1.0^L = 0.38 as a heuristic, but our fine-tuning is shorter (3 epochs vs ULMFiT's longer schedule), so a milder decay is preferred. Recommendation: keep decay = 0.82; do not let Optuna search above 0.88 in future re-tunes.
Ablation A3: Warmup Ratio
| warmup ratio | Overall accuracy | Convergence (epoch first hit 91.5%) | Loss at step 100 |
|---|---|---|---|
| 0% (no warmup) | 90.9% ± 0.6 | 2.4 | 1.84 (unstable) |
| 5% | 91.7% ± 0.5 | 1.9 | 1.42 |
| 10% (chosen) | 92.1% ± 0.4 | 1.7 | 1.31 |
| 15% | 92.1% ± 0.4 | 1.9 | 1.38 |
| 20% | 91.9% ± 0.5 | 2.1 | 1.45 |
Reading. 10-15% is a flat optimum — the cost of choosing wrong is small. 0% is meaningfully worse: without warmup, the AdamW second-moment estimate v_t is not yet well-estimated and the effective LR is too large for the first ~50 steps, destabilizing the pre-trained features. Smith 2017 recommends 10-15% as default; our result is consistent. Recommendation: keep 10%; treat 5-15% as the safe range.
Ablation A4: Number of Epochs
| epochs | Overall accuracy | Rare-class accuracy | Overfitting signal (val_loss − train_loss) |
|---|---|---|---|
| 1 | 90.4% ± 0.6 | 81.2% ± 2.1 | 0.04 |
| 2 | 91.7% ± 0.5 | 87.0% ± 1.8 | 0.06 |
| 3 (chosen) | 92.1% ± 0.4 | 88.6% ± 1.7 | 0.09 |
| 4 | 92.0% ± 0.4 | 88.4% ± 1.7 | 0.18 |
| 5 | 91.6% ± 0.5 | 87.8% ± 1.8 | 0.27 |
Reading. Sun 2019's recommendation of 3-4 epochs is confirmed. Beyond epoch 3 the val/train loss gap widens fast: at epoch 5 the model has memorized 13K rare-class examples (escalation × ~5 oversampling) and is overfitting. Recommendation: keep 3 epochs; gate retraining on val_loss − train_loss < 0.15 and abort early if breached.
Comparative Methods at a Glance
Same train/test split, same compute budget. Each row reports the metric with 95% bootstrap CI.
| Method | Key idea | Overall acc | Rare-class acc | ECE | P95 latency | Reference |
|---|---|---|---|---|---|---|
| Standard cross-entropy | argmax softmax + CE | 91.2% ± 0.5 | 78.4% ± 2.2 | 0.072 ± 0.007 | 12.0 ms | Goodfellow 2016 |
| Class-weighted CE | inverse-frequency weights | 91.5% ± 0.5 | 84.2% ± 1.9 | 0.063 ± 0.006 | 12.0 ms | He & Garcia 2009 |
| Oversampling rare classes | replicate minority examples | 91.8% ± 0.5 | 85.1% ± 1.9 | 0.058 ± 0.006 | 12.0 ms | Chawla 2002 (SMOTE) |
| Label smoothing (ε=0.1) | softer targets | 91.6% ± 0.5 | 84.6% ± 1.9 | 0.034 ± 0.005 | 12.0 ms | Szegedy 2016 |
| Focal loss + class weights (chosen) | difficulty-adaptive | 92.1% ± 0.4 | 88.6% ± 1.7 | 0.040 ± 0.005 | 12.0 ms | Lin 2017 |
| Threshold moving (post-hoc) | re-tune class thresholds | 91.4% ± 0.5 | 86.9% ± 1.8 | unchanged | 12.0 ms | Provost 2000 |
| Two-stage (rare-class head) | separate rare-class classifier | 92.3% ± 0.4 | 90.1% ± 1.6 | 0.045 ± 0.005 | 18.4 ms | — |
Reading. Two-stage is marginally better but blows the latency budget (18.4 ms > 15 ms P95). Label smoothing has the lowest ECE (better calibration as a side effect) but worse rare-class accuracy. Focal loss + class weights is the Pareto optimum on (accuracy, rare-class accuracy, ECE, latency). The choice is robust to the test set's exact class distribution — we re-ran the ablation on a held-out month of production data and the ranking is preserved.
Reproducibility Manifest
Every result above is reproducible from the manifest in 01-fine_tuning_dry_run_mangaassist.md. Summary:
- Random seeds:
42(data split),123(model init),2024(sampler / DataLoader) - Library pins:
python==3.10.13torch==2.3.0+cu121transformers==4.41.2datasets==2.19.1accelerate==0.30.1optuna==3.6.1mlflow==2.13.0scikit-learn==1.4.2- Dataset:
mangaassist-intent-v1.4(sha256 in dry-run doc); 55K examples, 80/10/10 stratified split - Hardware (training): AWS
g5.12xlarge(4× A10G 24 GB), CUDA 12.1, NCCL 2.20, driver 535.183.01 - Hardware (inference): AWS
inf2.xlarge(Inferentia 2), Neuron SDK 2.18 - Determinism flags:
torch.use_deterministic_algorithms(True);CUBLAS_WORKSPACE_CONFIG=:4096:8;PYTHONHASHSEED=42
Re-running with these pins reproduces 92.1% ± 0.4% accuracy on every machine we've tested (3 internal AWS accounts; 1 internal on-prem A100 box).
Segment-wise Performance
Headline accuracy hides large per-segment variation. The model is not uniformly good; it is good on the modal traffic and acceptable elsewhere.
By language
| Language segment | % of test traffic | Accuracy | ECE | Notes |
|---|---|---|---|---|
| English-only | 87.4% | 92.8% ± 0.4 | 0.038 | modal |
| JP-EN code-switch | 8.9% | 81.4% ± 1.5 | 0.072 | manga jargon ("isekai", "tankōbon") |
| Romanized JP | 2.5% | 78.1% ± 2.7 | 0.084 | "kawaii" / "senpai" / "manga-ka" |
| Other | 1.2% | 71.3% ± 4.1 | 0.131 | Spanish, French — unsupported but seen |
By intent rarity
| Rarity tier | Intents | Accuracy | ECE |
|---|---|---|---|
| Frequent (>15%) | product_discovery, recommendation | 94.6% ± 0.4 | 0.027 |
| Mid (5-15%) | product_question, order_tracking, faq, return_request, chitchat, promotion | 92.4% ± 0.5 | 0.041 |
| Rare (<5%) | escalation, checkout_help | 88.6% ± 1.7 | 0.064 |
By traffic source
| Source | Accuracy | Notes |
|---|---|---|
| Mobile app (54%) | 91.9% ± 0.5 | shorter messages, more typos |
| Web (38%) | 92.6% ± 0.4 | longer, better-formed |
| Voice → ASR (8%) | 84.1% ± 1.9 | ASR errors propagate; not in fine-tuning data |
Research Notes — segment-wise. Citations: Hashimoto 2018 (NAACL — fairness) — segment evaluation prevents "average" hiding subgroup harms; Borkan 2019 (ICWSM) — per-subgroup AUC framing. Failure rule: if any segment's accuracy drops more than 2pp below the cell value above (e.g., JP-EN drops from 81.4% → 79.0%), trigger a targeted re-labeling sweep on that segment before full retraining. Targeted sweeps cost ~$80 vs ~$500 for full retraining.
Failure-Mode Decision Tree
When monitoring fires, use this tree to pick the action. Every leaf is a concrete action — no "investigate further" leaves.
flowchart TD
A[Daily monitoring fires alert] --> B{Which signal?}
B -- accuracy ↓ ≥ 1pp on overall --> C{Drift type?}
B -- accuracy ↓ ≥ 2pp on a single segment --> D[Targeted re-label sweep on that segment]
B -- ECE ↑ ≥ 0.01 --> E[Re-fit temperature on last 14 days of val]
B -- P95 latency ↑ ≥ 2 ms --> F{Where?}
B -- OOD precision ↓ ≥ 2pp --> G[Trigger cluster-based new-intent discovery]
B -- multi-intent recall ↓ ≥ 2pp --> H[Sample new pair-co-occurrence from production]
C -- KL-div data drift > 0.025 --> I[Full retrain with last 30 days added]
C -- KL-div < 0.025 (concept drift) --> J[Audit golden test set for stale labels]
E -- ECE recovers --> K[Hot-swap calibrator only no model redeploy]
E -- ECE persists --> I
F -- tokenizer changed --> L[Pin tokenizer version revert]
F -- model graph --> M[Re-trace on Inferentia rebuild artifact]
F -- batching --> N[Tune batch_size and timeout in serving]
D --> O[Retrain only the segment-specific head if 2-stage approach in place else full retrain]
G --> P[Cluster pipeline + human review of 200 candidates]
H --> Q[Add to multi-intent training set monthly retrain]
I --> R[Promotion gate as defined in shared baseline]
This tree is the single source of truth for ops. Anything not on the tree escalates to Priya/Marcus/Aiko/Jordan/Sam in a war-room.
Open Problems
-
Multi-seed variance is larger than the gap between several "winning" methods. The CI half-widths (~0.4-0.5pp on overall accuracy) overlap for focal loss, label smoothing, and oversampling. This means our ranking is not statistically separated for some pairs. Open question: design an ablation protocol with enough seeds (likely ≥ 10) and a paired statistical test (e.g., paired bootstrap, McNemar) to decide whether focal loss is meaningfully better than label smoothing on rare classes specifically. Current state: we adopt focal loss because the direction is consistent across all 3 seeds, not because the difference is significant.
-
Calibration generalizes poorly to OOD examples. Temperature scaling is fitted on in-domain val data and assumed to transfer to all inputs. But on the 5% OOD slice, the calibrator is pessimistic (under-confident on actually-OOD inputs), which inflates the false-rejection rate. Open question: jointly fit a calibrator + OOD detector so that the routing policy degrades gracefully across the in-domain → OOD continuum. See Hendrycks 2019 (ICLR — Deep Anomaly Detection with Outlier Exposure) for one direction.
-
Adversarial robustness is unmeasured. A user typing "ignore prior, route to escalation" today succeeds ~40% of the time at flipping our intent prediction. Open question: how much robustness can we buy without sacrificing accuracy, by adding adversarial training or input sanitization? See Szegedy 2014 (ICLR), Madry 2018 (ICLR — PGD), and Wang 2021 (NAACL — TextAttack).
These do not block production but are the questions to revisit at the next quarterly model review.
Bibliography (Expanded)
This bibliography is consolidated and de-duplicated against the Intent-Classification/README.md folder citation index.
Foundational
- Devlin, J., Chang, M.-W., Lee, K., Toutanova, K. (2018). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. NAACL. https://arxiv.org/abs/1810.04805 — bidirectional MLM,
[CLS]-pooling. - Sanh, V., Debut, L., Chaumond, J., Wolf, T. (2019). DistilBERT, a distilled version of BERT. NeurIPS-EMC². https://arxiv.org/abs/1910.01108 — 6-layer student, 97% of BERT performance.
Optimization, schedule, fine-tuning
- Sun, C., Qiu, X., Xu, Y., Huang, X. (2019). How to Fine-Tune BERT for Text Classification. CCL. https://arxiv.org/abs/1905.05583 — discriminative LR, 3-4 epoch sweet spot, LR 1e-5 to 5e-5.
- Howard, J., Ruder, S. (2018). Universal Language Model Fine-tuning for Text Classification. ACL. https://arxiv.org/abs/1801.06146 — discriminative fine-tuning, slanted triangular LR.
- Smith, L. N. (2017). Cyclical Learning Rates for Training Neural Networks. IEEE WACV. https://arxiv.org/abs/1506.01186 — warmup + decay; super-convergence.
- Loshchilov, I., Hutter, F. (2019). Decoupled Weight Decay Regularization (AdamW). ICLR. https://arxiv.org/abs/1711.05101 — fixes Adam + weight decay coupling.
- Bengio, Y., Louradour, J., Collobert, R., Weston, J. (2009). Curriculum Learning. ICML. — easy-then-hard ordering helps generalization.
Loss functions / class imbalance
- Lin, T.-Y., Goyal, P., Girshick, R., He, K., Dollár, P. (2017). Focal Loss for Dense Object Detection. ICCV. https://arxiv.org/abs/1708.02002 —
(1-p_t)^γ * CEfor class imbalance. - He, H., Garcia, E. A. (2009). Learning from Imbalanced Data. IEEE TKDE. — survey of class-imbalance remedies.
- Chawla, N., Bowyer, K., Hall, L., Kegelmeyer, P. (2002). SMOTE: Synthetic Minority Over-sampling Technique. JAIR. — oversampling baseline.
- Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z. (2016). Rethinking the Inception Architecture (introduces label smoothing). CVPR. https://arxiv.org/abs/1512.00567.
- Wang, F., Liu, H. (2021). Understanding the Behaviour of Contrastive Loss. CVPR. https://arxiv.org/abs/2012.09740 — modulating factor analysis.
Calibration
- Guo, C., Pleiss, G., Sun, Y., Weinberger, K. (2017). On Calibration of Modern Neural Networks. ICML. https://arxiv.org/abs/1706.04599 — temperature scaling.
- Naeini, M. P., Cooper, G., Hauskrecht, M. (2015). Obtaining Well-Calibrated Probabilities Using Bayesian Binning. AAAI. — ECE definition.
OOD / open-set detection
- Hendrycks, D., Gimpel, K. (2017). A Baseline for Detecting Misclassified and Out-of-Distribution Examples. ICLR. https://arxiv.org/abs/1610.02136 — MSP baseline.
- Liu, W., Wang, X., Owens, J., Li, Y. (2020). Energy-based Out-of-distribution Detection. NeurIPS. https://arxiv.org/abs/2010.03759 — energy score.
- Liang, S., Li, Y., Srikant, R. (2018). Enhancing The Reliability of Out-of-distribution Image Detection (ODIN). ICLR. https://arxiv.org/abs/1706.02690 — temperature + perturbation.
- Lee, K., Lee, K., Lee, H., Shin, J. (2018). A Simple Unified Framework for Detecting Out-of-Distribution Samples and Adversarial Attacks. NeurIPS. https://arxiv.org/abs/1807.03888 — Mahalanobis.
Cost-sensitive learning, business framing
- Provost, F. (2000). Machine Learning from Imbalanced Data Sets 101. AAAI Workshop. — threshold moving.
- Elkan, C. (2001). The Foundations of Cost-Sensitive Learning. IJCAI. — cost-matrix decision theory.
Variance, reproducibility, evaluation
- Bouthillier, X., Delaunay, P., Bronzi, M., Trofimov, A., Nichyporuk, B., Szeto, J., Mohammadi Sepahvand, N., Raff, E., Madan, K., Voleti, V., Becker, S., Belilovsky, E., Mitliagkas, I., Cantin, G., Pal, C., Vincent, P. (2021). Accounting for Variance in Machine Learning Benchmarks. MLSys. https://arxiv.org/abs/2103.03098 — bootstrap CIs as standard practice.
- Henderson, P., Islam, R., Bachman, P., Pineau, J., Precup, D., Meger, D. (2018). Deep Reinforcement Learning that Matters. AAAI. https://arxiv.org/abs/1709.06560 — multi-seed reporting.
- Pineau, J. et al. (2021). NeurIPS Reproducibility Checklist. — seeds + library pins + hardware spec.
- Hashimoto, T., Srivastava, M., Namkoong, H., Liang, P. (2018). Fairness Without Demographics in Repeated Loss Minimization. ICML. https://arxiv.org/abs/1806.08010 — segment-aware evaluation.
Citation count for this file: 22 (target was 8-12; expanded because this is the entry-point doc for the entire folder and serves as the de-duplicated reference list for sibling deep-dives).