04. LoRA/QLoRA LLM Customization — Adapting Claude 3.5 Sonnet via Parameter-Efficient Methods
Problem Statement and MangaAssist Context
MangaAssist uses Claude 3.5 Sonnet (via Amazon Bedrock) as its primary language model for generating product descriptions, answering complex manga questions, and creating personalized recommendations. While Claude 3.5 Sonnet is exceptional at general language tasks, it lacks deep manga domain knowledge — it sometimes generates inaccurate publication dates, confuses character arcs across series, or produces generic rather than genre-specific recommendations.
Full fine-tuning of a model this size (~175B parameters estimated) is impractical: it requires hundreds of GPUs and costs $100K+ per run. LoRA (Low-Rank Adaptation) and QLoRA (Quantized LoRA) offer an alternative — training only 0.1-1% of the parameters while achieving 90-97% of full fine-tuning quality.
Since Claude 3.5 Sonnet is a managed API (Bedrock), we cannot directly apply LoRA to it. Instead, this document explores LoRA/QLoRA on an open-source alternative (Llama 3 70B) as the self-hosted fallback model, and discusses how the same principles apply conceptually to Bedrock's custom model training API.
Why LoRA Instead of Full Fine-Tuning
| Approach | Trainable Params | GPU Memory | Training Time | Cost per Run |
|---|---|---|---|---|
| Full Fine-Tuning (70B) | 70B | 560GB (8× A100 80GB) | 72 hours | $28,800 |
| LoRA (rank 16) | 84M (0.12%) | 80GB (1× A100) | 4 hours | $16 |
| QLoRA (rank 16, 4-bit) | 84M (0.12%) | 24GB (1× A10G) | 6 hours | $8 |
LoRA reduces cost by 1800×. QLoRA reduces it further by enabling consumer-grade GPUs.
Mathematical Foundations
Low-Rank Decomposition — The Core Idea
In a standard transformer, each weight matrix $\mathbf{W}_0 \in \mathbb{R}^{d \times k}$ is updated during fine-tuning:
$$\mathbf{W} = \mathbf{W}_0 + \Delta\mathbf{W}$$
Full fine-tuning learns all $d \times k$ parameters in $\Delta\mathbf{W}$. LoRA hypothesizes that the task-specific update $\Delta\mathbf{W}$ has low intrinsic rank — it can be decomposed as:
$$\Delta\mathbf{W} = \mathbf{B}\mathbf{A}$$
where $\mathbf{B} \in \mathbb{R}^{d \times r}$ and $\mathbf{A} \in \mathbb{R}^{r \times k}$, with rank $r \ll \min(d, k)$.
Parameter savings:
| Model Dimension | Full Update | LoRA (r=16) | Savings |
|---|---|---|---|
| $\mathbf{W}_Q$: 8192 × 8192 | 67M | 262K | 256× |
| $\mathbf{W}_K$: 8192 × 1024 | 8.4M | 147K | 57× |
| $\mathbf{W}_V$: 8192 × 1024 | 8.4M | 147K | 57× |
| $\mathbf{W}_O$: 1024 × 8192 | 8.4M | 147K | 57× |
| All layers (80 layers) | ~6.7B | ~84M | 80× |
Forward Pass with LoRA
The modified forward pass for a linear layer becomes:
$$\mathbf{h} = \mathbf{W}_0\mathbf{x} + \frac{\alpha}{r}\mathbf{B}\mathbf{A}\mathbf{x}$$
where: - $\mathbf{W}_0\mathbf{x}$ is the original (frozen) computation - $\mathbf{B}\mathbf{A}\mathbf{x}$ is the LoRA adapter's contribution - $\frac{\alpha}{r}$ is the scaling factor
Scaling factor $\frac{\alpha}{r}$:
The $\alpha$ hyperparameter controls how much influence the adapter has relative to the frozen weights. The division by $r$ normalizes the contribution so that changing rank does not change the effective magnitude.
| $\alpha$ / $r$ | Scaling | Effect |
|---|---|---|
| $\alpha = 8$, $r = 8$ | 1.0 | Full adapter influence |
| $\alpha = 16$, $r = 16$ | 1.0 | Same effective scaling despite double rank |
| $\alpha = 32$, $r = 16$ | 2.0 | Adapter dominates — risk of overriding base knowledge |
| $\alpha = 8$, $r = 16$ | 0.5 | Conservative — base knowledge preserved more |
Intuition: Think of $\alpha/r$ as a volume knob. Higher values make the adapter louder (more domain-specific but risk losing base capabilities). Lower values keep it quieter (more conservative, preserving general knowledge).
Initialization Strategy
LoRA initializes: - $\mathbf{A}$: Kaiming uniform (random) — provides diverse initial directions in the low-rank space - $\mathbf{B}$: Zero — ensures the adapter starts as an identity transformation ($\Delta\mathbf{W} = \mathbf{0}$)
This means at the start of training, $\Delta\mathbf{W} = \mathbf{B}\mathbf{A} = \mathbf{0} \cdot \mathbf{A} = \mathbf{0}$, so the model behaves exactly like the pre-trained model. The adapter gradually learns deviations from pre-trained behavior during training.
Why this matters: If both $\mathbf{A}$ and $\mathbf{B}$ were randomly initialized, the initial output would be perturbed by random noise. For a 70B model, this could produce incoherent text on the very first training step, creating massive loss and unstable gradients.
The Rank Selection Problem
Rank $r$ determines the expressiveness of the adapter. Too low → underfitting (cannot capture complexity). Too high → overfitting (memorizes training data).
Singular Value Decomposition (SVD) analysis:
If we compute the SVD of the "ideal" update $\Delta\mathbf{W}^*$ obtained from full fine-tuning:
$$\Delta\mathbf{W}^* = \mathbf{U}\mathbf{\Sigma}\mathbf{V}^T$$
then the optimal rank-$r$ approximation is:
$$\Delta\mathbf{W}r = \sum{i=1}^{r} \sigma_i \mathbf{u}_i \mathbf{v}_i^T$$
The Eckart-Young theorem guarantees this is the best rank-$r$ approximation in Frobenius norm.
Empirical findings for LLM fine-tuning:
The singular values of $\Delta\mathbf{W}^*$ decay rapidly. For Llama-class models fine-tuned on domain data:
| Rank | % of Frobenius Norm Captured | Task Quality (% of Full FT) |
|---|---|---|
| 4 | 72% | 85% |
| 8 | 85% | 91% |
| 16 | 93% | 95% |
| 32 | 97% | 97% |
| 64 | 99% | 98% |
| 256 | 99.9% | 99% |
For our manga domain task, rank 16 captures 93% of the update's information at 0.12% of the parameter cost. The remaining 7% are fine-grained distinctions that matter less for our use case.
QLoRA — 4-Bit Quantization Mathematics
QLoRA extends LoRA by quantizing the frozen base model to 4-bit precision, dramatically reducing memory:
4-bit NormalFloat (NF4) Quantization:
Standard 4-bit integers can represent values 0-15. But weight distributions in LLMs follow a normal distribution, not uniform. NF4 creates 16 quantization levels that optimally cover the normal distribution:
$$q_i = \Phi^{-1}\left(\frac{i + 0.5}{16}\right), \quad i = 0, 1, \ldots, 15$$
where $\Phi^{-1}$ is the inverse normal CDF. This maps the 16 levels to:
| Level | NF4 Value | Probability Coverage |
|---|---|---|
| 0 | -1.75 | Extreme negative |
| 4 | -0.44 | Below mean |
| 7 | -0.06 | Near zero |
| 8 | +0.06 | Near zero |
| 11 | +0.44 | Above mean |
| 15 | +1.75 | Extreme positive |
Why NF4 > INT4: Standard INT4 wastes quantization levels on the tails (values 0, 1, 14, 15 rarely appear). NF4 places more levels near zero where most weights cluster, reducing quantization error by ~30%.
Double Quantization:
NF4 requires per-block quantization constants (one FP32 scale factor per 64 weights). With 70B parameters, this adds 4.375B bytes of overhead. QLoRA applies a second round of quantization to these constants:
$$\text{Level 1: } \mathbf{W}{NF4} = \text{quant}(\mathbf{W}{FP16}, \text{scale}{FP32})$$ $$\text{Level 2: } \text{scale}{FP8} = \text{quant}(\text{scale}{FP32}, \text{superscale}{FP32})$$
This reduces the constant overhead from 0.5 bits/param to 0.127 bits/param — saving ~3GB for a 70B model.
Memory comparison for Llama 3 70B:
| Precision | Model Memory | Adapter Memory | Total |
|---|---|---|---|
| FP16 (baseline) | 140GB | — | 140GB |
| LoRA (FP16 frozen + FP16 adapter) | 140GB | 168MB | ~140GB |
| QLoRA (NF4 frozen + FP16 adapter) | 35GB | 168MB | ~35GB |
| QLoRA (NF4 + double quant) | ~33GB | 168MB | ~33GB |
QLoRA fits a 70B model on a single A100 80GB (with room for activations and optimizer states).
Dequantization During Forward Pass
During the forward pass, quantized weights are dequantized on-the-fly:
$$\mathbf{h} = \text{dequant}(\mathbf{W}_{NF4}) \cdot \mathbf{x} + \frac{\alpha}{r} \mathbf{B}\mathbf{A}\mathbf{x}$$
The dequantization computes: $\mathbf{W}{FP16} \approx \text{scale} \times \text{NF4_lookup}[\mathbf{W}{NF4}]$
This happens in GPU registers, not in global memory, so the additional computation overhead is minimal (~5-8% slowdown). The big win is the 4× memory reduction, which determines whether the model fits on a given GPU at all.
Model Internals — Layer-by-Layer Diagrams
LoRA Injection Points in a Transformer Layer
graph TB
subgraph "Single Transformer Layer (e.g., Layer 40 of 80)"
A["Input Hidden State h ∈ ℝ⁸¹⁹²"]
subgraph "Multi-Head Attention"
A --> Q["W_Q (8192×8192)<br>FROZEN 🔒"]
A --> K["W_K (8192×1024)<br>FROZEN 🔒"]
A --> V["W_V (8192×1024)<br>FROZEN 🔒"]
A --> QA["LoRA_Q: B_Q(8192×16) × A_Q(16×8192)<br>TRAINABLE 🔥 (262K params)"]
A --> VA["LoRA_V: B_V(8192×16) × A_V(16×1024)<br>TRAINABLE 🔥 (147K params)"]
Q --> QR["Q + (α/r)·LoRA_Q(x)"]
QA --> QR
K --> ATT["Multi-Head<br>Attention"]
V --> VR["V + (α/r)·LoRA_V(x)"]
VA --> VR
QR --> ATT
VR --> ATT
end
ATT --> O["W_O (1024×8192)<br>FROZEN 🔒"]
O --> RES1["Residual + LayerNorm"]
subgraph "Feed-Forward Network"
RES1 --> UP["W_up (8192×22016)<br>FROZEN 🔒"]
RES1 --> GATE["W_gate (8192×22016)<br>FROZEN 🔒"]
UP --> SwiGLU["SwiGLU Activation"]
GATE --> SwiGLU
SwiGLU --> DOWN["W_down (22016×8192)<br>FROZEN 🔒"]
end
DOWN --> RES2["Residual + LayerNorm"]
RES2 --> OUT["Output Hidden State"]
end
style Q fill:#bbdefb
style K fill:#bbdefb
style V fill:#bbdefb
style O fill:#bbdefb
style UP fill:#bbdefb
style GATE fill:#bbdefb
style DOWN fill:#bbdefb
style QA fill:#fff9c4
style VA fill:#fff9c4
Rank's Effect on the Weight Update Space
graph LR
subgraph "Rank 4 (Underfitting Risk)"
A4["4-dim subspace<br>Can represent:<br>- Genre preference shifts<br>- Basic tone adjustments<br>Cannot represent:<br>- Nuanced character knowledge<br>- Complex recommendation logic"]
end
subgraph "Rank 16 (Sweet Spot)"
A16["16-dim subspace<br>Can represent:<br>- Genre preference shifts<br>- Tone and mood adjustments<br>- Character relationship nuances<br>- Publication date corrections<br>Diminishing returns beyond this"]
end
subgraph "Rank 64 (Overfitting Risk)"
A64["64-dim subspace<br>Can represent everything rank 16 can<br>Plus:<br>- Memorizes training examples<br>- Learns spurious correlations<br>- 4× more params to update<br>Risk: overfits to training set"]
end
A4 -->|"+12 dims"| A16
A16 -->|"+48 dims"| A64
style A4 fill:#ffcdd2
style A16 fill:#c8e6c9
style A64 fill:#ffcdd2
QLoRA Memory Layout
graph TB
subgraph "GPU Memory Layout: QLoRA on 70B Model (A100 80GB)"
subgraph "Frozen Weights (NF4) — 33GB"
FW["70B params × 4 bits + double-quant overhead<br>Dequantized on-the-fly during forward pass<br>NO gradient stored"]
end
subgraph "LoRA Adapters (FP16) — 168MB"
LA["84M params × 16 bits<br>B matrices (80 layers × 2 targets × 8192×16)<br>A matrices (80 layers × 2 targets × 16×dim)"]
end
subgraph "Optimizer States (FP32) — 672MB"
OS_["Adam: 2 states per param × 84M × 32 bits<br>m (momentum) + v (variance) for each adapter param"]
end
subgraph "Gradients (FP16) — 168MB"
GR["84M gradient values × 16 bits<br>Only for adapter params (frozen params have no gradient)"]
end
subgraph "Activations (varies) — ~20GB"
AC["Intermediate activations for backprop<br>Depends on batch size and sequence length<br>Gradient checkpointing reduces this to ~5GB"]
end
subgraph "Free — ~26GB"
FR["Available for batch size increase<br>or sequence length increase"]
end
end
style FW fill:#bbdefb
style LA fill:#fff9c4
style OS_ fill:#ffe0b2
style GR fill:#ffccbc
style AC fill:#e1bee7
style FR fill:#e8f5e9
Training Loop: Forward-Backward with LoRA
sequenceDiagram
participant Data as Training Batch<br>"What genre is Berserk?"
participant Frozen as Frozen LLM<br>(70B params, NF4)
participant LoRA as LoRA Adapters<br>(84M params, FP16)
participant Loss as Cross-Entropy<br>Loss
participant Adam as Paged AdamW<br>Optimizer
Data->>Frozen: Input tokens
Data->>LoRA: Same input tokens
Note over Frozen: Dequantize NF4 → FP16<br>Compute W₀·x<br>(no gradient tracked)
Note over LoRA: Compute B·A·x in FP16<br>(gradient tracked)
Frozen->>Loss: W₀·x (part 1 of output)
LoRA->>Loss: (α/r)·B·A·x (part 2)
Note over Loss: Combined: h = W₀·x + (α/r)·BA·x<br>Continue through 80 layers<br>Compute CE loss against target
Loss->>LoRA: ∂L/∂B, ∂L/∂A for all 80 layers
Note over Frozen: NO gradient to frozen weights<br>Memory saved: 70B × 16 bytes = 0 bytes gradient
LoRA->>Adam: 84M gradient values
Note over Adam: Paged AdamW:<br>If GPU OOM → offload to CPU<br>Update: θ ← θ - lr·m̂/(√v̂+ε)
Adam->>LoRA: Updated adapter weights
SVD Spectrum: Why Low Rank Works
graph TD
subgraph "Singular Value Spectrum of ΔW (Full Fine-Tuning)"
A["σ₁ = 12.4 (genre knowledge shift)"]
B["σ₂ = 8.7 (tone understanding)"]
C["σ₃ = 5.2 (character relationships)"]
D["σ₄ = 3.1 (publication metadata)"]
E["σ₅-σ₁₆ = 2.8 → 0.4 (progressively finer)"]
F["σ₁₇-σ₆₄ = 0.3 → 0.01 (noise-level)"]
G["σ₆₅+ ≈ 0 (no information)"]
end
subgraph "LoRA Rank Selection"
H["Rank 4: captures σ₁-σ₄<br>72% of ΔW energy<br>Genre + tone + characters + dates"]
I["Rank 16: captures σ₁-σ₁₆<br>93% of ΔW energy<br>All meaningful adaptations"]
J["Rank 64: captures σ₁-σ₆₄<br>99% of ΔW energy<br>Including noise — overfits"]
end
A --> H
B --> H
C --> H
D --> H
E --> I
F --> J
style I fill:#c8e6c9
Implementation Deep-Dive
LoRA Fine-Tuning with Hugging Face PEFT
import torch
from transformers import (
AutoModelForCausalLM,
AutoTokenizer,
TrainingArguments,
Trainer,
BitsAndBytesConfig,
)
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from datasets import load_dataset
def setup_qlora_model(model_name: str = "meta-llama/Meta-Llama-3-70B"):
"""
Load a 70B model in 4-bit quantization with LoRA adapters.
This fits on a single A100 80GB or even an A10G 24GB.
"""
# ── 4-bit Quantization Config ──
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4", # NormalFloat4 — optimal for normal-distributed weights
bnb_4bit_compute_dtype=torch.bfloat16, # Compute in BF16 for stability
bnb_4bit_use_double_quant=True, # Double quantization saves ~3GB
)
# ── Load Model ──
model = AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config=bnb_config,
device_map="auto",
trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
# Prepare for k-bit training (enable gradient checkpointing, etc.)
model = prepare_model_for_kbit_training(model)
# ── LoRA Config ──
lora_config = LoraConfig(
r=16, # Rank — 16 is our sweet spot
lora_alpha=32, # Scaling: alpha/r = 2.0
target_modules=[
"q_proj", "v_proj", # Attention: query and value projections
],
lora_dropout=0.05, # Dropout on adapter for regularization
bias="none", # Don't train bias terms
task_type="CAUSAL_LM",
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# Output: trainable params: 83,886,080 || all params: 70,553,706,496 || trainable%: 0.1189
return model, tokenizer
Manga Domain Training Dataset
def prepare_manga_dataset(tokenizer, max_length: int = 512):
"""
Prepare instruction-following dataset for manga domain fine-tuning.
3K examples covering: product knowledge, recommendations,
publication details, genre classification, and character analysis.
"""
PROMPT_TEMPLATE = """<|begin_of_text|><|start_header_id|>system<|end_header_id|>
You are MangaAssist, an expert manga advisor for Amazon's Japanese manga store.
Provide accurate, helpful information about manga titles, genres, and recommendations.
<|eot_id|><|start_header_id|>user<|end_header_id|>
{question}<|eot_id|><|start_header_id|>assistant<|end_header_id|>
{answer}<|eot_id|>"""
# Example training instances
training_examples = [
{
"question": "What genre is Berserk and who would enjoy it?",
"answer": "Berserk is a dark fantasy seinen manga by Kentaro Miura, "
"serialized from 1989 to 2021. It follows Guts, a lone mercenary "
"in a medieval dark fantasy setting. Recommended for readers who "
"enjoy: mature themes, detailed artwork, complex character development, "
"and dark fantasy similar to Claymore or Vagabond. Not recommended "
"for readers sensitive to graphic violence.",
},
{
"question": "Recommend manga similar to One Piece for someone who likes long adventure series.",
"answer": "For fans of One Piece's epic adventure style: "
"1. Naruto (72 volumes) — ninja adventure with similar themes of friendship "
"2. Hunter × Hunter — strategic battles with a deeper power system "
"3. Fairy Tail (63 volumes) — guild-based adventure with lighter tone "
"4. My Hero Academia (ongoing) — modern superhero shōnen "
"5. Toriko — food-hunting adventure by a Jump contemporary. "
"All are shōnen manga with long-form storytelling and world-building.",
},
# ... 3K examples total
]
# Tokenize
def tokenize(example):
text = PROMPT_TEMPLATE.format(
question=example["question"],
answer=example["answer"],
)
return tokenizer(
text,
max_length=max_length,
padding="max_length",
truncation=True,
)
dataset = load_dataset("json", data_files="manga_training_data.json")
tokenized = dataset.map(tokenize, batched=False, remove_columns=dataset["train"].column_names)
return tokenized
Training Configuration
def train_manga_lora(model, tokenizer, dataset):
"""
Fine-tune with QLoRA on manga domain data.
Key hyperparameters:
- Epochs: 3 (sufficient for 3K examples with rank 16)
- Batch size: 4 with gradient accumulation 4 = effective 16
- Learning rate: 2e-4 (higher than full FT because only adapter params)
- Warmup: 100 steps (stabilize adapter before full LR)
"""
training_args = TrainingArguments(
output_dir="./manga_lora_output",
num_train_epochs=3,
per_device_train_batch_size=4,
gradient_accumulation_steps=4, # Effective batch size: 16
learning_rate=2e-4, # Higher LR for adapter-only training
weight_decay=0.01,
warmup_steps=100,
lr_scheduler_type="cosine",
logging_steps=10,
save_strategy="epoch",
evaluation_strategy="epoch",
bf16=True, # BF16 mixed precision
gradient_checkpointing=True, # Trade compute for memory
optim="paged_adamw_8bit", # Paged optimizer — offloads to CPU if OOM
max_grad_norm=0.3, # Conservative clipping for stability
group_by_length=True, # Batch similar-length sequences
report_to="mlflow",
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=dataset["train"],
eval_dataset=dataset["validation"],
)
trainer.train()
# Save only the adapter weights (168MB vs 140GB for full model)
model.save_pretrained("./manga_lora_adapter")
return model
Merging LoRA Weights for Inference
def merge_and_deploy(adapter_path: str, base_model: str):
"""
Merge LoRA adapter into the base model for inference.
After merging, the model is a standard transformer with no
LoRA overhead. Inference latency is identical to the base model.
For Bedrock deployment, we upload the merged model to S3
and create a custom model import.
"""
from peft import PeftModel
# Load base model in FP16 (for merging, need full precision)
base = AutoModelForCausalLM.from_pretrained(
base_model,
torch_dtype=torch.float16,
device_map="auto",
)
# Load adapter
model = PeftModel.from_pretrained(base, adapter_path)
# Merge: W_merged = W_0 + (alpha/r) * B * A
model = model.merge_and_unload()
# Save merged model
model.save_pretrained("./manga_llm_merged")
# The merged model is the same size as the base model
# but incorporates all manga domain knowledge from the adapter
return model
Hyperparameter Search: Rank and Alpha
import optuna
def lora_hparam_search(trial):
"""
Search for optimal LoRA hyperparameters.
Key search dimensions:
- rank: 4 to 64 (adapter capacity)
- alpha: rank to 4*rank (scaling factor)
- learning rate: 5e-5 to 5e-4 (adapter-appropriate range)
- target modules: which attention matrices to adapt
"""
rank = trial.suggest_categorical("rank", [4, 8, 16, 32, 64])
alpha = trial.suggest_categorical("alpha", [rank, 2*rank, 4*rank])
lr = trial.suggest_float("learning_rate", 5e-5, 5e-4, log=True)
target_modules = ["q_proj", "v_proj"]
if trial.suggest_categorical("add_k_o", [True, False]):
target_modules.extend(["k_proj", "o_proj"])
if trial.suggest_categorical("add_ffn", [True, False]):
target_modules.extend(["gate_proj", "up_proj", "down_proj"])
lora_config = LoraConfig(
r=rank,
lora_alpha=alpha,
target_modules=target_modules,
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM",
)
# Train and evaluate...
model, tokenizer = setup_qlora_model()
model = get_peft_model(model, lora_config)
# (training code using `lr` for learning rate)
# Evaluate on held-out manga QA set
eval_score = evaluate_manga_qa(model, tokenizer)
return eval_score
# Run search
study = optuna.create_study(direction="maximize")
study.optimize(lora_hparam_search, n_trials=20)
# Typical best: rank=16, alpha=32, lr=2e-4, q+v only, no FFN
Group Discussion: Key Decision Points
Decision Point 1: Rank 8 vs 16 vs 32
Priya (ML Engineer): I trained adapters at ranks 4, 8, 16, 32, 64 on our 3K manga QA dataset:
| Rank | Trainable Params | Manga QA Score | MMLU (General) | Training Time | Adapter Size |
|---|---|---|---|---|---|
| 4 | 21M | 71.2% | 68.4% | 1.5 hr | 42MB |
| 8 | 42M | 75.8% | 68.2% | 2 hr | 84MB |
| 16 | 84M | 79.4% | 67.9% | 4 hr | 168MB |
| 32 | 168M | 80.1% | 66.8% | 8 hr | 336MB |
| 64 | 336M | 79.8% | 64.2% | 16 hr | 672MB |
Rank 16 hits the sweet spot: 79.4% manga QA score with minimal degradation on MMLU (general knowledge).
Marcus (Architect): The MMLU degradation at rank 32+ is concerning. That means the adapter is overriding useful general knowledge to memorize manga-specific patterns. At rank 64, the model actually gets worse on manga QA (79.8% < 80.1%) while losing 4% on general tasks — classic overfitting.
Aiko (Data Scientist): The SVD analysis confirms this. When I decompose the rank-32 adapter's $\mathbf{B}\mathbf{A}$ product, the top 16 singular values capture 93% of the energy. The remaining 16 dimensions are mostly noise — the adapter is spending capacity on memorizing training examples rather than learning transferable manga knowledge.
Jordan (MLOps): Rank 16 adapter is 168MB. We can store 100 versions in S3 for under $1/month. The 4-hour training time means we can iterate daily during development without blocking the GPU for other experiments.
Sam (PM): The quality jump from rank 8 (75.8%) to rank 16 (79.4%) is 3.6 points. From rank 16 to 32 it is only 0.7 points but doubles the training time. Clear diminishing returns.
Resolution: Rank 16 selected. Best quality-efficiency tradeoff. 93% of full fine-tuning information captured. Training time (4 hr) fits our daily iteration cycle. Adapter size (168MB) is trivial for storage and deployment.
Decision Point 2: Which Layers to Apply LoRA
Priya (ML Engineer): I tested different target module configurations on rank 16:
| Target Modules | Trainable Params | Manga QA | Training Time |
|---|---|---|---|
| Q only | 42M | 74.1% | 2 hr |
| Q + V | 84M | 79.4% | 4 hr |
| Q + K + V + O | 168M | 80.2% | 8 hr |
| Q + V + FFN (gate, up, down) | 420M | 80.8% | 18 hr |
| All (Q + K + V + O + FFN) | 504M | 80.6% | 22 hr |
Aiko (Data Scientist): The Q+V configuration is well-supported by theory. In attention, $Q$ determines "what to look for" and $V$ determines "what to extract". By adapting Q, we teach the model what manga-relevant features to attend to. By adapting V, we teach it what to extract from manga-specific context. K (keys) and O (output projection) from add less for our task.
Marcus (Architect): Adding FFN adapters gives 1.4% improvement but 4.5× more training time. The cost-per-quality-point: 14 more hours × $4/hr = $56 / 1.4 points = $40/point. Under our $50 threshold but not by much, and the operational complexity of 5× more adapter parameters is not worth it for 1.4%.
Jordan (MLOps): Q+V is also the most tested configuration in the literature. LoRA's original paper used Q+V. Most reproduction studies use Q+V. We benefits from the ecosystem's validation.
Resolution: Apply LoRA to Q and V projections only. Best parameters-to-quality ratio. Matches literature defaults. Adding K+O+FFN provides diminishing returns (0.8-1.4%) at significant training cost increase.
Decision Point 3: Self-Hosted vs Bedrock Custom Model Training
Marcus (Architect): We have two paths for LLM customization: (A) self-host Llama 3 70B with QLoRA on SageMaker, or (B) use Bedrock's model customization API with Claude.
| Approach | Quality Control | Latency | Monthly Cost | Operational Load |
|---|---|---|---|---|
| SageMaker QLoRA (Llama 3 70B) | Full control over rank, alpha, modules | 300-800ms | $2,400 (p4d.24xlarge) | High — manage GPU fleet |
| Bedrock Custom (Claude) | Limited — Bedrock controls hyperparams | 500-1500ms | $1,100 (API pricing) | Low — managed service |
| Bedrock base Claude (no tuning) | None — prompt engineering only | 500-1500ms | $800 | None |
Priya (ML Engineer): Bedrock's customization API is a black box. We provide training data as JSONL, and Bedrock handles everything. We cannot control rank, target modules, or training schedule. For production, this lack of visibility makes debugging regressions nearly impossible.
Sam (PM): The cost difference is stark: $2,400/month for SageMaker vs $1,100/month for Bedrock API. But the SageMaker route requires Jordan to manage GPU instances, handle cold starts, implement model serving, and monitor GPU utilization. That is easily 20 hours/month of MLOps time.
Jordan (MLOps): I would recommend a hybrid: use Bedrock Claude base model with excellent prompt engineering for launch (cheapest, simplest). Run QLoRA experiments on SageMaker during development to understand what domain adaptation buys us. If the gap is significant (>5% quality), migrate to self-hosted. If small (<3%), stay on Bedrock.
Marcus (Architect): The hybrid approach also aligns with our CPQ framework. We only invest in self-hosting if the quality delta × user impact justifies the infrastructure cost.
Resolution: Phase 1 (MVP): Bedrock Claude base with prompt engineering. Phase 2: Run QLoRA experiments on Llama 3 70B to quantify the quality delta. Phase 3: Deploy self-hosted Llama 3 with QLoRA only if the quality delta exceeds 5% on our manga QA benchmark and the CPQ analysis justifies the cost.
Decision Point 4: TIES-Merging for Multi-Task Adapters
Priya (ML Engineer): If we train multiple LoRA adapters (manga knowledge, structured output, safety), can we merge them into one?
Aiko (Data Scientist): TIES-Merging (Yadav et al., 2023) addresses this. It merges multiple adapters by: 1. Trimming: Keep only the top-K% most changed parameters per adapter 2. Resolving sign conflicts: If two adapters push a parameter in opposite directions, take the majority vote 3. Merging: Average the surviving parameter deltas
I tested with two adapters (manga_knowledge + structured_output):
| Approach | Manga QA | Format Compliance | Combined |
|---|---|---|---|
| Manga adapter only | 79.4% | 68.2% | 73.8% |
| Format adapter only | 62.1% | 91.5% | 76.8% |
| Sequential (manga then format) | 76.2% | 85.3% | 80.8% |
| TIES-Merged | 77.8% | 88.1% | 83.0% |
TIES merging preserves more of each adapter's specialty than sequential application.
Jordan (MLOps): But this adds another dimension to our validation gate. Now we need to validate the merged model on both tasks simultaneously. If manga QA degrades below 75% post-merge, we reject and debug which adapter is interfering.
Resolution: Explore TIES merging in Phase 3 when we have multiple stable adapters. For MVP, single adapter (manga knowledge) is sufficient. The merging research validates the adapter approach as composable and extensible.
Research Paper References
1. LoRA: Low-Rank Adaptation of Large Language Models (Hu et al., 2021)
Key contribution: Demonstrated that task-specific updates to large pre-trained models have low intrinsic rank. By decomposing weight updates into $\mathbf{B}\mathbf{A}$ with small $r$, achieves comparable quality to full fine-tuning at a fraction of the compute. Key insight: LoRA adapters can be merged back into the base model, adding zero inference overhead.
Relevance to MangaAssist: Foundation of our LLM customization strategy. The rank-16 configuration gives us 79.4% manga QA accuracy (vs ~81% for full fine-tuning) at 1/1800th the cost. The merge capability means we deploy a standard model checkpoint with no LoRA-specific inference code.
2. QLoRA: Efficient Finetuning of Quantized Language Models (Dettmers et al., 2023)
Key contribution: Combined 4-bit NormalFloat quantization with LoRA to fine-tune 65B parameter models on a single 48GB GPU. Introduced NF4 (optimal for normally distributed weights), double quantization (quantize the quantization constants), and paged optimizers (offload optimizer state to CPU when OOM).
Relevance to MangaAssist: QLoRA makes our 70B model experiments feasible on SageMaker ml.g5.2xlarge (A10G, 24GB) instead of requiring ml.p4d.24xlarge (8× A100). This reduces experiment cost from ~$110/hour to ~$6/hour — enabling rapid iteration during the R&D phase.
3. Resolving Interference When Merging Models — TIES-Merging (Yadav et al., 2023)
Key contribution: Identified that naively averaging model weights causes interference — parameters pushed in opposite directions by different tasks partially cancel out. TIES (TrIm, Elect Sign, and merge) resolves this by keeping only the most important changes, resolving sign conflicts via majority vote, and then averaging.
Relevance to MangaAssist: Once we have multiple LoRA adapters (manga knowledge, structured output, safety), TIES merging gives us a principled way to combine them. Our ablation shows TIES preserves 98% of each adapter's specialty vs 93% for naive averaging.
4. Scaling Down to Scale Up: A Guide to Parameter-Efficient Fine-Tuning (Lialin et al., 2023)
Key contribution: Comprehensive survey of 40+ parameter-efficient methods. Key finding: LoRA consistently outperforms other PEFT methods (adapters, prefix tuning, prompt tuning) on large models (>7B parameters) across diverse tasks. For smaller models (<1B), the differences are minimal.
Relevance to MangaAssist: Validates our choice of LoRA over prefix tuning or adapter layers for the 70B model. For our smaller models (DistilBERT 66M, MiniLM 33M), the paper suggests full fine-tuning is still appropriate — consistent with our approach in docs 01 and 03.
Production Deployment and Monitoring
Deployment Architecture
graph LR
subgraph "Development (Phase 2)"
A[Manga QA<br>Dataset (3K)] --> B[QLoRA Training<br>SageMaker g5.2xlarge<br>(6 hr, $36)]
B --> C[Adapter Artifact<br>S3 (168MB)]
C --> D[Merge + Validate<br>on Manga QA<br>and MMLU]
end
subgraph "Production (Phase 3, if justified)"
D --> E{Quality Delta > 5%?}
E -->|Yes| F[Deploy Merged Model<br>SageMaker Endpoint<br>(g5.12xlarge)]
E -->|No| G[Stay on Bedrock<br>Claude Base +<br>Prompt Engineering]
end
subgraph "Inference"
F --> H[Self-hosted Llama 3<br>Manga-adapted<br>300-800ms]
G --> I[Bedrock Claude<br>Prompt-engineered<br>500-1500ms]
end
Key Metrics for LoRA Custom Model
| Metric | Target | Alert Threshold |
|---|---|---|
| Manga QA accuracy | ≥ 78% | < 75% |
| MMLU general knowledge | ≥ 66% | < 64% (base=68.5%) |
| Factual accuracy (manga dates, authors) | ≥ 90% | < 85% |
| Structured output compliance | ≥ 85% | < 80% |
| Training loss convergence | < 0.5 by epoch 3 | > 1.0 at epoch 3 |
| Adapter size | 168MB (rank 16) | > 500MB (rank too high) |