Fine-Tuning Foundational Models for MangaAssist

Overview

This folder contains 18 deep-dive scenario documents covering fine-tuning, model customization, post-fine-tuning inspection techniques, and capstone intuition synthesis relevant to the MangaAssist chatbot. Each document is written as a group discussion among 5 engineers, includes mathematical derivations with geometric intuition, layer-by-layer mermaid diagrams or inspection workflows showing what happens inside models during training, production-grade code, and references to foundational research papers.

MangaAssist Scenario Companion Set

For the newer MangaAssist-grounded scenario documents, start with:

These companion docs sit beside the original deep dives and cover the non-intent fine-tuning topics: embeddings, reranking, RAFT, LoRA/QLoRA, prompt tuning, QAT, distillation, MoE, continual learning, few-shot learning, multi-task learning, sentiment, DPO/RLHF, MLOps, data curation, interpretability, and capstone decision-making.

Reading Order

Tier 1: Start Here (Core Fine-Tuning for MangaAssist Production)

These are the techniques already deployed or planned for MangaAssist:

#	Document	What You Learn	Difficulty
01	Intent Classifier Fine-Tuning	DistilBERT fine-tuning for 10-intent routing, focal loss, class imbalance	Intermediate
02	Embedding Model Fine-Tuning	Contrastive learning adapter on Titan V2, InfoNCE loss, hard negative mining	Intermediate
03	Cross-Encoder Reranker Fine-Tuning	ms-marco-MiniLM domain adaptation, pairwise ranking loss, LambdaRank	Intermediate
08	Sentiment Classifier Fine-Tuning	Frustration detection for escalation, gradual unfreezing, transfer learning	Intermediate

Tier 2: Advanced Model Customization

These extend MangaAssist with more sophisticated fine-tuning:

#	Document	What You Learn	Difficulty
04	LoRA and QLoRA LLM Customization	Low-rank adapters for Claude/open-source LLMs, 4-bit quantization + training	Advanced
05	Knowledge Distillation Pipeline	Teacher (Sonnet) → Student distillation, dark knowledge, temperature scaling	Advanced
10	RLHF and DPO Alignment	Reinforcement learning from human feedback, Direct Preference Optimization	Advanced
11	Prompt Tuning and Prefix Tuning	Soft prompts as lightweight alternative to LoRA, P-Tuning v2	Advanced
14	Retrieval-Augmented Fine-Tuning (RAFT)	Fine-tune LLM to properly use retrieved context, reduce hallucination	Advanced

Tier 3: Specialized Techniques

For specific production challenges:

#	Document	What You Learn	Difficulty
06	Continual Learning and Catastrophic Forgetting	EWC, rehearsal buffers, drift detection, monthly retraining	Advanced
07	Few-Shot Learning and Rapid Adaptation	Prototypical networks, MAML, SetFit, adding new intents with <100 examples	Advanced
12	Quantization-Aware Training	INT8/INT4 quantization, GPTQ, AWQ, SmoothQuant	Advanced
13	Multi-Task Learning	Single model for intent + sentiment + entities, gradient surgery	Expert
15	Mixture of Experts Routing	MoE with genre-specific experts, gating networks, load balancing	Expert
17	Visualization and Interpretability After Fine-Tuning	BertViz, TransformerLens, LIT, Ecco, Captum, and Netron for post-fine-tuning analysis	Advanced

Tier 4: Infrastructure and Data

#	Document	What You Learn	Difficulty
09	Training Infrastructure and MLOps	SageMaker, distributed training, MLflow, CI/CD for models	Intermediate
16	Data Curation and Synthetic Generation	Dataset preparation, synthetic data with Claude, confident learning	Intermediate

Tier 5: Capstone Synthesis

#	Document	What You Learn	Difficulty
18	Intuition Scenario and Strategic Direction	Meta-learning across all 17 techniques, intervention ladders, CPQ instincts, and career growth signals	Advanced

Group Discussion Personas

Every document features debates among these five engineers, representing a real Amazon-style cross-functional team:

Persona	Role	Focus Area	Typical Question
Priya	Senior ML Engineer	Model architecture, training code, loss functions, math derivations	"What does the gradient look like at layer 4 when we use focal loss?"
Marcus	Staff Platform Architect	Infrastructure, deployment, scaling, latency budgets	"Can we serve 10 concurrent LoRA adapters within our P95 latency SLA?"
Aiko	Data Scientist	Data quality, evaluation metrics, experiment design, statistical rigor	"The ablation study shows diminishing returns above rank 12."
Jordan	MLOps Engineer	CI/CD, monitoring, drift detection, reproducibility, model registry	"How do we detect if the retrained model regresses on old intents?"
Sam	Cost-Aware PM	ROI, cost-per-quality-point, build vs buy, business impact	"That's a $170K/year difference for 2.7% accuracy. What's the CPQ?"

Discussion Format

### Decision Point: Should we use LoRA rank 8 or 16?

**Priya (ML Engineer):** Rank 16 gives us more expressive capacity to capture
manga-specific patterns. The SVD analysis of weight updates shows the top 16
singular values capture 94% of the variance, vs 87% for rank 8...

**Marcus (Architect):** But rank 16 doubles adapter memory from 4MB to 8MB per
adapter. With 10 concurrent adapters for different intents, that's 80MB just
for adapters...

**Aiko (Data Scientist):** The ablation study on our validation set shows:
rank 4 = 89.1%, rank 8 = 91.8%, rank 12 = 92.3%, rank 16 = 92.5%.
Diminishing returns above rank 8...

**Sam (PM):** The cost difference is $240/month (rank 8) vs $480/month (rank 16)
for GPU memory. For 0.7% accuracy gain, that's $342 per quality point...

**Jordan (MLOps):** More parameters means longer CI validation cycles. Rank 8
validates in 12 minutes, rank 16 takes 22 minutes...

> **Resolution:** Rank 8 chosen. The 91.8% accuracy meets our 90% threshold,
> the cost-per-quality-point for rank 16 ($342/point) exceeds our $300
> threshold, and faster CI cycles matter for weekly adapter updates.

Mathematical Prerequisites

To get the most from these documents, you should be comfortable with:

Topic	What You Need	Used In
Linear Algebra	Matrix multiplication, SVD, eigenvalues, rank	Docs 04, 05, 12, 15
Calculus	Partial derivatives, chain rule, gradients	All docs
Probability	Bayes' theorem, KL-divergence, entropy, softmax	Docs 01, 05, 06, 10
Optimization	SGD, Adam, learning rate schedules, loss landscapes	All docs
Information Theory	Cross-entropy, mutual information, InfoNCE	Docs 01, 02, 05
Statistics	Hypothesis testing, confidence intervals, Cohen's kappa	Docs 08, 16, 17

Master Research Paper Reading List

Foundational (Read These First)

Paper	Year	Key Contribution	Referenced In
Attention Is All You Need (Vaswani et al.)	2017	Transformer architecture	All docs
BERT (Devlin et al.)	2018	Bidirectional pre-training	Docs 01, 02, 08
DistilBERT (Sanh et al.)	2019	Knowledge distillation for BERT	Docs 01, 05

Fine-Tuning Techniques

Paper	Year	Key Contribution	Referenced In
How to Fine-Tune BERT for Text Classification (Sun et al.)	2019	Layer-wise LR, gradual unfreezing	Docs 01, 08
ULMFiT (Howard & Ruder)	2018	Discriminative fine-tuning, slanted triangular LR	Doc 08
LoRA (Hu et al.)	2021	Low-rank adaptation of large language models	Doc 04
QLoRA (Dettmers et al.)	2023	4-bit quantization + LoRA	Docs 04, 12
Prefix-Tuning (Li & Liang)	2021	Learnable prefix key-value pairs	Doc 11
P-Tuning v2 (Liu et al.)	2021	Deep prompt tuning at every layer	Doc 11
Prompt Tuning (Lester et al.)	2021	Soft prompts scale with model size	Doc 11

Contrastive Learning and Retrieval

Paper	Year	Key Contribution	Referenced In
SimCLR (Chen et al.)	2020	Contrastive learning framework	Doc 02
Dense Passage Retrieval (Karpukhin et al.)	2020	Dual-encoder for retrieval	Doc 02
Sentence-BERT (Reimers & Gurevych)	2019	Sentence embeddings via siamese BERT	Doc 02
ColBERT (Khattab & Zaharia)	2020	Late interaction for efficient reranking	Doc 03
MS-MARCO (Bajaj et al.)	2016	Large-scale passage ranking dataset	Doc 03

Alignment and Preference Learning

Paper	Year	Key Contribution	Referenced In
InstructGPT (Ouyang et al.)	2022	RLHF pipeline for instruction following	Doc 10
DPO (Rafailov et al.)	2023	Direct preference optimization without reward model	Doc 10
Constitutional AI (Bai et al.)	2022	Self-supervision for alignment	Doc 10
RLAIF (Lee et al.)	2023	AI feedback instead of human feedback	Doc 10

Distillation and Compression

Paper	Year	Key Contribution	Referenced In
Distilling Knowledge (Hinton et al.)	2015	Temperature-scaled soft labels	Doc 05
TinyBERT (Jiao et al.)	2019	Two-stage task-agnostic + task-specific distillation	Doc 05
GPTQ (Frantar et al.)	2022	One-shot weight quantization via Hessian	Doc 12
AWQ (Lin et al.)	2023	Activation-aware weight quantization	Doc 12
SmoothQuant (Xiao et al.)	2022	Migrate quantization difficulty from activations to weights	Doc 12

Continual and Few-Shot Learning

Paper	Year	Key Contribution	Referenced In
EWC (Kirkpatrick et al.)	2017	Fisher Information for catastrophic forgetting prevention	Doc 06
Progressive Neural Networks (Rusu et al.)	2016	Lateral connections for continual learning	Doc 06
Prototypical Networks (Snell et al.)	2017	Distance-based few-shot classification	Doc 07
MAML (Finn et al.)	2017	Model-agnostic meta-learning	Doc 07
SetFit (Tunstall et al.)	2022	Few-shot fine-tuning via contrastive learning	Doc 07

Multi-Task and Expert Models

Paper	Year	Key Contribution	Referenced In
Multi-Task Uncertainty (Kendall et al.)	2018	Homoscedastic uncertainty for task weighting	Doc 13
GradNorm (Chen et al.)	2018	Gradient normalization for multi-task	Doc 13
Gradient Surgery (Yu et al.)	2020	Project conflicting gradients	Doc 13
Switch Transformers (Fedus et al.)	2021	Sparse MoE with simplified routing	Doc 15
Mixtral (Jiang et al.)	2024	Production MoE with expert routing	Doc 15

RAG and Retrieval-Augmented Training

Paper	Year	Key Contribution	Referenced In
RAFT (Zhang et al.)	2024	Fine-tune LLM to use retrieved context	Doc 14
Self-RAG (Asai et al.)	2023	Self-reflective retrieval-augmented generation	Doc 14
REALM (Guu et al.)	2020	Retrieval-augmented language model pre-training	Doc 14

Data Quality and Synthesis

Paper	Year	Key Contribution	Referenced In
Confident Learning (Northcutt et al.)	2021	Label noise estimation and correction	Doc 16
Self-Instruct (Wang et al.)	2022	Synthetic instruction generation	Doc 16
Alpaca (Taori et al.)	2023	Instruction tuning with synthetic data	Doc 16

Infrastructure

Paper	Year	Key Contribution	Referenced In
ZeRO (Rajbhandari et al.)	2019	Memory-efficient distributed training	Doc 09
Mixed Precision Training (Micikevicius et al.)	2017	FP16 training with loss scaling	Doc 09
Focal Loss (Lin et al.)	2017	Class imbalance via modulating factor	Docs 01, 08
LambdaRank (Burges et al.)	2005	Learning to rank with NDCG gradients	Doc 03

Cross-References to Other Project Folders

This Folder's Doc	References	For
Doc 01 (Intent Classifier)	`04b-architecture-lld.md` LLD-2	Intent classifier design spec
Doc 02 (Embeddings)	`04b-architecture-lld.md` LLD-3	RAG pipeline and vector store
Doc 04 (LoRA/QLoRA)	`Prompt-Engineering/`	When prompting is enough vs fine-tuning
Doc 05 (Distillation)	`Model-Inference/`	Inference pipeline constraints
Doc 09 (MLOps)	`MLflow/`	Experiment tracking integration
Doc 14 (RAFT)	`Prompt-Engineering/`	RAG grounding and prompt design
All docs	`model_evaluation_framework_deep_dive.md`	7-dimensional evaluation framework
All docs	`10-ai-llm-design.md`	Model selection and CPQ framework
All docs	`Optimization-Tradeoffs-User-Stories/`	Cost-quality-performance trilemma

How to Use These Documents

For interview prep: Start with the README, then read Tier 1 docs, then use the decision-point sections plus Docs 17 and 18 to rehearse architecture, training, interpretability, and strategic tradeoffs.
For implementation: Read the relevant scenario doc end-to-end, then reference Doc 09 (MLOps) for deployment.
For understanding the math: Each doc is self-contained. Read the "Mathematical Foundations" section, then trace through the mermaid layer diagrams.
For the group discussion format: Each decision point shows how a real team would debate the tradeoff. Use these as templates for your own technical discussions, then read Doc 18 for the cross-technique synthesis.