LOCAL PREVIEW View on GitHub

04. LoRA/QLoRA LLM Customization — Adapting Claude 3.5 Sonnet via Parameter-Efficient Methods

Problem Statement and MangaAssist Context

MangaAssist uses Claude 3.5 Sonnet (via Amazon Bedrock) as its primary language model for generating product descriptions, answering complex manga questions, and creating personalized recommendations. While Claude 3.5 Sonnet is exceptional at general language tasks, it lacks deep manga domain knowledge — it sometimes generates inaccurate publication dates, confuses character arcs across series, or produces generic rather than genre-specific recommendations.

Full fine-tuning of a model this size (~175B parameters estimated) is impractical: it requires hundreds of GPUs and costs $100K+ per run. LoRA (Low-Rank Adaptation) and QLoRA (Quantized LoRA) offer an alternative — training only 0.1-1% of the parameters while achieving 90-97% of full fine-tuning quality.

Since Claude 3.5 Sonnet is a managed API (Bedrock), we cannot directly apply LoRA to it. Instead, this document explores LoRA/QLoRA on an open-source alternative (Llama 3 70B) as the self-hosted fallback model, and discusses how the same principles apply conceptually to Bedrock's custom model training API.

Why LoRA Instead of Full Fine-Tuning

Approach Trainable Params GPU Memory Training Time Cost per Run
Full Fine-Tuning (70B) 70B 560GB (8× A100 80GB) 72 hours $28,800
LoRA (rank 16) 84M (0.12%) 80GB (1× A100) 4 hours $16
QLoRA (rank 16, 4-bit) 84M (0.12%) 24GB (1× A10G) 6 hours $8

LoRA reduces cost by 1800×. QLoRA reduces it further by enabling consumer-grade GPUs.


Mathematical Foundations

Low-Rank Decomposition — The Core Idea

In a standard transformer, each weight matrix $\mathbf{W}_0 \in \mathbb{R}^{d \times k}$ is updated during fine-tuning:

$$\mathbf{W} = \mathbf{W}_0 + \Delta\mathbf{W}$$

Full fine-tuning learns all $d \times k$ parameters in $\Delta\mathbf{W}$. LoRA hypothesizes that the task-specific update $\Delta\mathbf{W}$ has low intrinsic rank — it can be decomposed as:

$$\Delta\mathbf{W} = \mathbf{B}\mathbf{A}$$

where $\mathbf{B} \in \mathbb{R}^{d \times r}$ and $\mathbf{A} \in \mathbb{R}^{r \times k}$, with rank $r \ll \min(d, k)$.

Parameter savings:

Model Dimension Full Update LoRA (r=16) Savings
$\mathbf{W}_Q$: 8192 × 8192 67M 262K 256×
$\mathbf{W}_K$: 8192 × 1024 8.4M 147K 57×
$\mathbf{W}_V$: 8192 × 1024 8.4M 147K 57×
$\mathbf{W}_O$: 1024 × 8192 8.4M 147K 57×
All layers (80 layers) ~6.7B ~84M 80×

Forward Pass with LoRA

The modified forward pass for a linear layer becomes:

$$\mathbf{h} = \mathbf{W}_0\mathbf{x} + \frac{\alpha}{r}\mathbf{B}\mathbf{A}\mathbf{x}$$

where: - $\mathbf{W}_0\mathbf{x}$ is the original (frozen) computation - $\mathbf{B}\mathbf{A}\mathbf{x}$ is the LoRA adapter's contribution - $\frac{\alpha}{r}$ is the scaling factor

Scaling factor $\frac{\alpha}{r}$:

The $\alpha$ hyperparameter controls how much influence the adapter has relative to the frozen weights. The division by $r$ normalizes the contribution so that changing rank does not change the effective magnitude.

$\alpha$ / $r$ Scaling Effect
$\alpha = 8$, $r = 8$ 1.0 Full adapter influence
$\alpha = 16$, $r = 16$ 1.0 Same effective scaling despite double rank
$\alpha = 32$, $r = 16$ 2.0 Adapter dominates — risk of overriding base knowledge
$\alpha = 8$, $r = 16$ 0.5 Conservative — base knowledge preserved more

Intuition: Think of $\alpha/r$ as a volume knob. Higher values make the adapter louder (more domain-specific but risk losing base capabilities). Lower values keep it quieter (more conservative, preserving general knowledge).

Initialization Strategy

LoRA initializes: - $\mathbf{A}$: Kaiming uniform (random) — provides diverse initial directions in the low-rank space - $\mathbf{B}$: Zero — ensures the adapter starts as an identity transformation ($\Delta\mathbf{W} = \mathbf{0}$)

This means at the start of training, $\Delta\mathbf{W} = \mathbf{B}\mathbf{A} = \mathbf{0} \cdot \mathbf{A} = \mathbf{0}$, so the model behaves exactly like the pre-trained model. The adapter gradually learns deviations from pre-trained behavior during training.

Why this matters: If both $\mathbf{A}$ and $\mathbf{B}$ were randomly initialized, the initial output would be perturbed by random noise. For a 70B model, this could produce incoherent text on the very first training step, creating massive loss and unstable gradients.

The Rank Selection Problem

Rank $r$ determines the expressiveness of the adapter. Too low → underfitting (cannot capture complexity). Too high → overfitting (memorizes training data).

Singular Value Decomposition (SVD) analysis:

If we compute the SVD of the "ideal" update $\Delta\mathbf{W}^*$ obtained from full fine-tuning:

$$\Delta\mathbf{W}^* = \mathbf{U}\mathbf{\Sigma}\mathbf{V}^T$$

then the optimal rank-$r$ approximation is:

$$\Delta\mathbf{W}r = \sum{i=1}^{r} \sigma_i \mathbf{u}_i \mathbf{v}_i^T$$

The Eckart-Young theorem guarantees this is the best rank-$r$ approximation in Frobenius norm.

Empirical findings for LLM fine-tuning:

The singular values of $\Delta\mathbf{W}^*$ decay rapidly. For Llama-class models fine-tuned on domain data:

Rank % of Frobenius Norm Captured Task Quality (% of Full FT)
4 72% 85%
8 85% 91%
16 93% 95%
32 97% 97%
64 99% 98%
256 99.9% 99%

For our manga domain task, rank 16 captures 93% of the update's information at 0.12% of the parameter cost. The remaining 7% are fine-grained distinctions that matter less for our use case.

QLoRA — 4-Bit Quantization Mathematics

QLoRA extends LoRA by quantizing the frozen base model to 4-bit precision, dramatically reducing memory:

4-bit NormalFloat (NF4) Quantization:

Standard 4-bit integers can represent values 0-15. But weight distributions in LLMs follow a normal distribution, not uniform. NF4 creates 16 quantization levels that optimally cover the normal distribution:

$$q_i = \Phi^{-1}\left(\frac{i + 0.5}{16}\right), \quad i = 0, 1, \ldots, 15$$

where $\Phi^{-1}$ is the inverse normal CDF. This maps the 16 levels to:

Level NF4 Value Probability Coverage
0 -1.75 Extreme negative
4 -0.44 Below mean
7 -0.06 Near zero
8 +0.06 Near zero
11 +0.44 Above mean
15 +1.75 Extreme positive

Why NF4 > INT4: Standard INT4 wastes quantization levels on the tails (values 0, 1, 14, 15 rarely appear). NF4 places more levels near zero where most weights cluster, reducing quantization error by ~30%.

Double Quantization:

NF4 requires per-block quantization constants (one FP32 scale factor per 64 weights). With 70B parameters, this adds 4.375B bytes of overhead. QLoRA applies a second round of quantization to these constants:

$$\text{Level 1: } \mathbf{W}{NF4} = \text{quant}(\mathbf{W}{FP16}, \text{scale}{FP32})$$ $$\text{Level 2: } \text{scale}{FP8} = \text{quant}(\text{scale}{FP32}, \text{superscale}{FP32})$$

This reduces the constant overhead from 0.5 bits/param to 0.127 bits/param — saving ~3GB for a 70B model.

Memory comparison for Llama 3 70B:

Precision Model Memory Adapter Memory Total
FP16 (baseline) 140GB 140GB
LoRA (FP16 frozen + FP16 adapter) 140GB 168MB ~140GB
QLoRA (NF4 frozen + FP16 adapter) 35GB 168MB ~35GB
QLoRA (NF4 + double quant) ~33GB 168MB ~33GB

QLoRA fits a 70B model on a single A100 80GB (with room for activations and optimizer states).

Dequantization During Forward Pass

During the forward pass, quantized weights are dequantized on-the-fly:

$$\mathbf{h} = \text{dequant}(\mathbf{W}_{NF4}) \cdot \mathbf{x} + \frac{\alpha}{r} \mathbf{B}\mathbf{A}\mathbf{x}$$

The dequantization computes: $\mathbf{W}{FP16} \approx \text{scale} \times \text{NF4_lookup}[\mathbf{W}{NF4}]$

This happens in GPU registers, not in global memory, so the additional computation overhead is minimal (~5-8% slowdown). The big win is the 4× memory reduction, which determines whether the model fits on a given GPU at all.


Model Internals — Layer-by-Layer Diagrams

LoRA Injection Points in a Transformer Layer

graph TB
    subgraph "Single Transformer Layer (e.g., Layer 40 of 80)"
        A["Input Hidden State h ∈ ℝ⁸¹⁹²"]

        subgraph "Multi-Head Attention"
            A --> Q["W_Q (8192×8192)<br>FROZEN 🔒"]
            A --> K["W_K (8192×1024)<br>FROZEN 🔒"]
            A --> V["W_V (8192×1024)<br>FROZEN 🔒"]

            A --> QA["LoRA_Q: B_Q(8192×16) × A_Q(16×8192)<br>TRAINABLE 🔥 (262K params)"]
            A --> VA["LoRA_V: B_V(8192×16) × A_V(16×1024)<br>TRAINABLE 🔥 (147K params)"]

            Q --> QR["Q + (α/r)·LoRA_Q(x)"]
            QA --> QR
            K --> ATT["Multi-Head<br>Attention"]
            V --> VR["V + (α/r)·LoRA_V(x)"]
            VA --> VR
            QR --> ATT
            VR --> ATT
        end

        ATT --> O["W_O (1024×8192)<br>FROZEN 🔒"]
        O --> RES1["Residual + LayerNorm"]

        subgraph "Feed-Forward Network"
            RES1 --> UP["W_up (8192×22016)<br>FROZEN 🔒"]
            RES1 --> GATE["W_gate (8192×22016)<br>FROZEN 🔒"]
            UP --> SwiGLU["SwiGLU Activation"]
            GATE --> SwiGLU
            SwiGLU --> DOWN["W_down (22016×8192)<br>FROZEN 🔒"]
        end

        DOWN --> RES2["Residual + LayerNorm"]
        RES2 --> OUT["Output Hidden State"]
    end

    style Q fill:#bbdefb
    style K fill:#bbdefb
    style V fill:#bbdefb
    style O fill:#bbdefb
    style UP fill:#bbdefb
    style GATE fill:#bbdefb
    style DOWN fill:#bbdefb
    style QA fill:#fff9c4
    style VA fill:#fff9c4

Rank's Effect on the Weight Update Space

graph LR
    subgraph "Rank 4 (Underfitting Risk)"
        A4["4-dim subspace<br>Can represent:<br>- Genre preference shifts<br>- Basic tone adjustments<br>Cannot represent:<br>- Nuanced character knowledge<br>- Complex recommendation logic"]
    end

    subgraph "Rank 16 (Sweet Spot)"
        A16["16-dim subspace<br>Can represent:<br>- Genre preference shifts<br>- Tone and mood adjustments<br>- Character relationship nuances<br>- Publication date corrections<br>Diminishing returns beyond this"]
    end

    subgraph "Rank 64 (Overfitting Risk)"
        A64["64-dim subspace<br>Can represent everything rank 16 can<br>Plus:<br>- Memorizes training examples<br>- Learns spurious correlations<br>- 4× more params to update<br>Risk: overfits to training set"]
    end

    A4 -->|"+12 dims"| A16
    A16 -->|"+48 dims"| A64

    style A4 fill:#ffcdd2
    style A16 fill:#c8e6c9
    style A64 fill:#ffcdd2

QLoRA Memory Layout

graph TB
    subgraph "GPU Memory Layout: QLoRA on 70B Model (A100 80GB)"
        subgraph "Frozen Weights (NF4) — 33GB"
            FW["70B params × 4 bits + double-quant overhead<br>Dequantized on-the-fly during forward pass<br>NO gradient stored"]
        end

        subgraph "LoRA Adapters (FP16) — 168MB"
            LA["84M params × 16 bits<br>B matrices (80 layers × 2 targets × 8192×16)<br>A matrices (80 layers × 2 targets × 16×dim)"]
        end

        subgraph "Optimizer States (FP32) — 672MB"
            OS_["Adam: 2 states per param × 84M × 32 bits<br>m (momentum) + v (variance) for each adapter param"]
        end

        subgraph "Gradients (FP16) — 168MB"
            GR["84M gradient values × 16 bits<br>Only for adapter params (frozen params have no gradient)"]
        end

        subgraph "Activations (varies) — ~20GB"
            AC["Intermediate activations for backprop<br>Depends on batch size and sequence length<br>Gradient checkpointing reduces this to ~5GB"]
        end

        subgraph "Free — ~26GB"
            FR["Available for batch size increase<br>or sequence length increase"]
        end
    end

    style FW fill:#bbdefb
    style LA fill:#fff9c4
    style OS_ fill:#ffe0b2
    style GR fill:#ffccbc
    style AC fill:#e1bee7
    style FR fill:#e8f5e9

Training Loop: Forward-Backward with LoRA

sequenceDiagram
    participant Data as Training Batch<br>"What genre is Berserk?"
    participant Frozen as Frozen LLM<br>(70B params, NF4)
    participant LoRA as LoRA Adapters<br>(84M params, FP16)
    participant Loss as Cross-Entropy<br>Loss
    participant Adam as Paged AdamW<br>Optimizer

    Data->>Frozen: Input tokens
    Data->>LoRA: Same input tokens

    Note over Frozen: Dequantize NF4 → FP16<br>Compute W₀·x<br>(no gradient tracked)

    Note over LoRA: Compute B·A·x in FP16<br>(gradient tracked)

    Frozen->>Loss: W₀·x (part 1 of output)
    LoRA->>Loss: (α/r)·B·A·x (part 2)

    Note over Loss: Combined: h = W₀·x + (α/r)·BA·x<br>Continue through 80 layers<br>Compute CE loss against target

    Loss->>LoRA: ∂L/∂B, ∂L/∂A for all 80 layers
    Note over Frozen: NO gradient to frozen weights<br>Memory saved: 70B × 16 bytes = 0 bytes gradient

    LoRA->>Adam: 84M gradient values
    Note over Adam: Paged AdamW:<br>If GPU OOM → offload to CPU<br>Update: θ ← θ - lr·m̂/(√v̂+ε)

    Adam->>LoRA: Updated adapter weights

SVD Spectrum: Why Low Rank Works

graph TD
    subgraph "Singular Value Spectrum of ΔW (Full Fine-Tuning)"
        A["σ₁ = 12.4 (genre knowledge shift)"]
        B["σ₂ = 8.7 (tone understanding)"]
        C["σ₃ = 5.2 (character relationships)"]
        D["σ₄ = 3.1 (publication metadata)"]
        E["σ₅-σ₁₆ = 2.8 → 0.4 (progressively finer)"]
        F["σ₁₇-σ₆₄ = 0.3 → 0.01 (noise-level)"]
        G["σ₆₅+ ≈ 0 (no information)"]
    end

    subgraph "LoRA Rank Selection"
        H["Rank 4: captures σ₁-σ₄<br>72% of ΔW energy<br>Genre + tone + characters + dates"]
        I["Rank 16: captures σ₁-σ₁₆<br>93% of ΔW energy<br>All meaningful adaptations"]
        J["Rank 64: captures σ₁-σ₆₄<br>99% of ΔW energy<br>Including noise — overfits"]
    end

    A --> H
    B --> H
    C --> H
    D --> H
    E --> I
    F --> J

    style I fill:#c8e6c9

Implementation Deep-Dive

LoRA Fine-Tuning with Hugging Face PEFT

import torch
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    TrainingArguments,
    Trainer,
    BitsAndBytesConfig,
)
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from datasets import load_dataset


def setup_qlora_model(model_name: str = "meta-llama/Meta-Llama-3-70B"):
    """
    Load a 70B model in 4-bit quantization with LoRA adapters.
    This fits on a single A100 80GB or even an A10G 24GB.
    """
    # ── 4-bit Quantization Config ──
    bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type="nf4",           # NormalFloat4 — optimal for normal-distributed weights
        bnb_4bit_compute_dtype=torch.bfloat16, # Compute in BF16 for stability
        bnb_4bit_use_double_quant=True,        # Double quantization saves ~3GB
    )

    # ── Load Model ──
    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        quantization_config=bnb_config,
        device_map="auto",
        trust_remote_code=True,
    )
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    tokenizer.pad_token = tokenizer.eos_token

    # Prepare for k-bit training (enable gradient checkpointing, etc.)
    model = prepare_model_for_kbit_training(model)

    # ── LoRA Config ──
    lora_config = LoraConfig(
        r=16,                          # Rank — 16 is our sweet spot
        lora_alpha=32,                 # Scaling: alpha/r = 2.0
        target_modules=[
            "q_proj", "v_proj",        # Attention: query and value projections
        ],
        lora_dropout=0.05,             # Dropout on adapter for regularization
        bias="none",                   # Don't train bias terms
        task_type="CAUSAL_LM",
    )

    model = get_peft_model(model, lora_config)
    model.print_trainable_parameters()
    # Output: trainable params: 83,886,080 || all params: 70,553,706,496 || trainable%: 0.1189

    return model, tokenizer

Manga Domain Training Dataset

def prepare_manga_dataset(tokenizer, max_length: int = 512):
    """
    Prepare instruction-following dataset for manga domain fine-tuning.
    3K examples covering: product knowledge, recommendations,
    publication details, genre classification, and character analysis.
    """

    PROMPT_TEMPLATE = """<|begin_of_text|><|start_header_id|>system<|end_header_id|>
You are MangaAssist, an expert manga advisor for Amazon's Japanese manga store.
Provide accurate, helpful information about manga titles, genres, and recommendations.
<|eot_id|><|start_header_id|>user<|end_header_id|>
{question}<|eot_id|><|start_header_id|>assistant<|end_header_id|>
{answer}<|eot_id|>"""

    # Example training instances
    training_examples = [
        {
            "question": "What genre is Berserk and who would enjoy it?",
            "answer": "Berserk is a dark fantasy seinen manga by Kentaro Miura, "
                      "serialized from 1989 to 2021. It follows Guts, a lone mercenary "
                      "in a medieval dark fantasy setting. Recommended for readers who "
                      "enjoy: mature themes, detailed artwork, complex character development, "
                      "and dark fantasy similar to Claymore or Vagabond. Not recommended "
                      "for readers sensitive to graphic violence.",
        },
        {
            "question": "Recommend manga similar to One Piece for someone who likes long adventure series.",
            "answer": "For fans of One Piece's epic adventure style: "
                      "1. Naruto (72 volumes) — ninja adventure with similar themes of friendship "
                      "2. Hunter × Hunter — strategic battles with a deeper power system "
                      "3. Fairy Tail (63 volumes) — guild-based adventure with lighter tone "
                      "4. My Hero Academia (ongoing) — modern superhero shōnen "
                      "5. Toriko — food-hunting adventure by a Jump contemporary. "
                      "All are shōnen manga with long-form storytelling and world-building.",
        },
        # ... 3K examples total
    ]

    # Tokenize
    def tokenize(example):
        text = PROMPT_TEMPLATE.format(
            question=example["question"],
            answer=example["answer"],
        )
        return tokenizer(
            text,
            max_length=max_length,
            padding="max_length",
            truncation=True,
        )

    dataset = load_dataset("json", data_files="manga_training_data.json")
    tokenized = dataset.map(tokenize, batched=False, remove_columns=dataset["train"].column_names)
    return tokenized

Training Configuration

def train_manga_lora(model, tokenizer, dataset):
    """
    Fine-tune with QLoRA on manga domain data.

    Key hyperparameters:
    - Epochs: 3 (sufficient for 3K examples with rank 16)
    - Batch size: 4 with gradient accumulation 4 = effective 16
    - Learning rate: 2e-4 (higher than full FT because only adapter params)
    - Warmup: 100 steps (stabilize adapter before full LR)
    """
    training_args = TrainingArguments(
        output_dir="./manga_lora_output",
        num_train_epochs=3,
        per_device_train_batch_size=4,
        gradient_accumulation_steps=4,       # Effective batch size: 16
        learning_rate=2e-4,                  # Higher LR for adapter-only training
        weight_decay=0.01,
        warmup_steps=100,
        lr_scheduler_type="cosine",
        logging_steps=10,
        save_strategy="epoch",
        evaluation_strategy="epoch",
        bf16=True,                           # BF16 mixed precision
        gradient_checkpointing=True,         # Trade compute for memory
        optim="paged_adamw_8bit",            # Paged optimizer — offloads to CPU if OOM
        max_grad_norm=0.3,                   # Conservative clipping for stability
        group_by_length=True,                # Batch similar-length sequences
        report_to="mlflow",
    )

    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=dataset["train"],
        eval_dataset=dataset["validation"],
    )

    trainer.train()

    # Save only the adapter weights (168MB vs 140GB for full model)
    model.save_pretrained("./manga_lora_adapter")

    return model

Merging LoRA Weights for Inference

def merge_and_deploy(adapter_path: str, base_model: str):
    """
    Merge LoRA adapter into the base model for inference.

    After merging, the model is a standard transformer with no
    LoRA overhead. Inference latency is identical to the base model.

    For Bedrock deployment, we upload the merged model to S3
    and create a custom model import.
    """
    from peft import PeftModel

    # Load base model in FP16 (for merging, need full precision)
    base = AutoModelForCausalLM.from_pretrained(
        base_model,
        torch_dtype=torch.float16,
        device_map="auto",
    )

    # Load adapter
    model = PeftModel.from_pretrained(base, adapter_path)

    # Merge: W_merged = W_0 + (alpha/r) * B * A
    model = model.merge_and_unload()

    # Save merged model
    model.save_pretrained("./manga_llm_merged")

    # The merged model is the same size as the base model
    # but incorporates all manga domain knowledge from the adapter
    return model

Hyperparameter Search: Rank and Alpha

import optuna


def lora_hparam_search(trial):
    """
    Search for optimal LoRA hyperparameters.

    Key search dimensions:
    - rank: 4 to 64 (adapter capacity)
    - alpha: rank to 4*rank (scaling factor)
    - learning rate: 5e-5 to 5e-4 (adapter-appropriate range)
    - target modules: which attention matrices to adapt
    """
    rank = trial.suggest_categorical("rank", [4, 8, 16, 32, 64])
    alpha = trial.suggest_categorical("alpha", [rank, 2*rank, 4*rank])
    lr = trial.suggest_float("learning_rate", 5e-5, 5e-4, log=True)

    target_modules = ["q_proj", "v_proj"]
    if trial.suggest_categorical("add_k_o", [True, False]):
        target_modules.extend(["k_proj", "o_proj"])
    if trial.suggest_categorical("add_ffn", [True, False]):
        target_modules.extend(["gate_proj", "up_proj", "down_proj"])

    lora_config = LoraConfig(
        r=rank,
        lora_alpha=alpha,
        target_modules=target_modules,
        lora_dropout=0.05,
        bias="none",
        task_type="CAUSAL_LM",
    )

    # Train and evaluate...
    model, tokenizer = setup_qlora_model()
    model = get_peft_model(model, lora_config)

    # (training code using `lr` for learning rate)

    # Evaluate on held-out manga QA set
    eval_score = evaluate_manga_qa(model, tokenizer)
    return eval_score


# Run search
study = optuna.create_study(direction="maximize")
study.optimize(lora_hparam_search, n_trials=20)

# Typical best: rank=16, alpha=32, lr=2e-4, q+v only, no FFN

Group Discussion: Key Decision Points

Decision Point 1: Rank 8 vs 16 vs 32

Priya (ML Engineer): I trained adapters at ranks 4, 8, 16, 32, 64 on our 3K manga QA dataset:

Rank Trainable Params Manga QA Score MMLU (General) Training Time Adapter Size
4 21M 71.2% 68.4% 1.5 hr 42MB
8 42M 75.8% 68.2% 2 hr 84MB
16 84M 79.4% 67.9% 4 hr 168MB
32 168M 80.1% 66.8% 8 hr 336MB
64 336M 79.8% 64.2% 16 hr 672MB

Rank 16 hits the sweet spot: 79.4% manga QA score with minimal degradation on MMLU (general knowledge).

Marcus (Architect): The MMLU degradation at rank 32+ is concerning. That means the adapter is overriding useful general knowledge to memorize manga-specific patterns. At rank 64, the model actually gets worse on manga QA (79.8% < 80.1%) while losing 4% on general tasks — classic overfitting.

Aiko (Data Scientist): The SVD analysis confirms this. When I decompose the rank-32 adapter's $\mathbf{B}\mathbf{A}$ product, the top 16 singular values capture 93% of the energy. The remaining 16 dimensions are mostly noise — the adapter is spending capacity on memorizing training examples rather than learning transferable manga knowledge.

Jordan (MLOps): Rank 16 adapter is 168MB. We can store 100 versions in S3 for under $1/month. The 4-hour training time means we can iterate daily during development without blocking the GPU for other experiments.

Sam (PM): The quality jump from rank 8 (75.8%) to rank 16 (79.4%) is 3.6 points. From rank 16 to 32 it is only 0.7 points but doubles the training time. Clear diminishing returns.

Resolution: Rank 16 selected. Best quality-efficiency tradeoff. 93% of full fine-tuning information captured. Training time (4 hr) fits our daily iteration cycle. Adapter size (168MB) is trivial for storage and deployment.

Decision Point 2: Which Layers to Apply LoRA

Priya (ML Engineer): I tested different target module configurations on rank 16:

Target Modules Trainable Params Manga QA Training Time
Q only 42M 74.1% 2 hr
Q + V 84M 79.4% 4 hr
Q + K + V + O 168M 80.2% 8 hr
Q + V + FFN (gate, up, down) 420M 80.8% 18 hr
All (Q + K + V + O + FFN) 504M 80.6% 22 hr

Aiko (Data Scientist): The Q+V configuration is well-supported by theory. In attention, $Q$ determines "what to look for" and $V$ determines "what to extract". By adapting Q, we teach the model what manga-relevant features to attend to. By adapting V, we teach it what to extract from manga-specific context. K (keys) and O (output projection) from add less for our task.

Marcus (Architect): Adding FFN adapters gives 1.4% improvement but 4.5× more training time. The cost-per-quality-point: 14 more hours × $4/hr = $56 / 1.4 points = $40/point. Under our $50 threshold but not by much, and the operational complexity of 5× more adapter parameters is not worth it for 1.4%.

Jordan (MLOps): Q+V is also the most tested configuration in the literature. LoRA's original paper used Q+V. Most reproduction studies use Q+V. We benefits from the ecosystem's validation.

Resolution: Apply LoRA to Q and V projections only. Best parameters-to-quality ratio. Matches literature defaults. Adding K+O+FFN provides diminishing returns (0.8-1.4%) at significant training cost increase.

Decision Point 3: Self-Hosted vs Bedrock Custom Model Training

Marcus (Architect): We have two paths for LLM customization: (A) self-host Llama 3 70B with QLoRA on SageMaker, or (B) use Bedrock's model customization API with Claude.

Approach Quality Control Latency Monthly Cost Operational Load
SageMaker QLoRA (Llama 3 70B) Full control over rank, alpha, modules 300-800ms $2,400 (p4d.24xlarge) High — manage GPU fleet
Bedrock Custom (Claude) Limited — Bedrock controls hyperparams 500-1500ms $1,100 (API pricing) Low — managed service
Bedrock base Claude (no tuning) None — prompt engineering only 500-1500ms $800 None

Priya (ML Engineer): Bedrock's customization API is a black box. We provide training data as JSONL, and Bedrock handles everything. We cannot control rank, target modules, or training schedule. For production, this lack of visibility makes debugging regressions nearly impossible.

Sam (PM): The cost difference is stark: $2,400/month for SageMaker vs $1,100/month for Bedrock API. But the SageMaker route requires Jordan to manage GPU instances, handle cold starts, implement model serving, and monitor GPU utilization. That is easily 20 hours/month of MLOps time.

Jordan (MLOps): I would recommend a hybrid: use Bedrock Claude base model with excellent prompt engineering for launch (cheapest, simplest). Run QLoRA experiments on SageMaker during development to understand what domain adaptation buys us. If the gap is significant (>5% quality), migrate to self-hosted. If small (<3%), stay on Bedrock.

Marcus (Architect): The hybrid approach also aligns with our CPQ framework. We only invest in self-hosting if the quality delta × user impact justifies the infrastructure cost.

Resolution: Phase 1 (MVP): Bedrock Claude base with prompt engineering. Phase 2: Run QLoRA experiments on Llama 3 70B to quantify the quality delta. Phase 3: Deploy self-hosted Llama 3 with QLoRA only if the quality delta exceeds 5% on our manga QA benchmark and the CPQ analysis justifies the cost.

Decision Point 4: TIES-Merging for Multi-Task Adapters

Priya (ML Engineer): If we train multiple LoRA adapters (manga knowledge, structured output, safety), can we merge them into one?

Aiko (Data Scientist): TIES-Merging (Yadav et al., 2023) addresses this. It merges multiple adapters by: 1. Trimming: Keep only the top-K% most changed parameters per adapter 2. Resolving sign conflicts: If two adapters push a parameter in opposite directions, take the majority vote 3. Merging: Average the surviving parameter deltas

I tested with two adapters (manga_knowledge + structured_output):

Approach Manga QA Format Compliance Combined
Manga adapter only 79.4% 68.2% 73.8%
Format adapter only 62.1% 91.5% 76.8%
Sequential (manga then format) 76.2% 85.3% 80.8%
TIES-Merged 77.8% 88.1% 83.0%

TIES merging preserves more of each adapter's specialty than sequential application.

Jordan (MLOps): But this adds another dimension to our validation gate. Now we need to validate the merged model on both tasks simultaneously. If manga QA degrades below 75% post-merge, we reject and debug which adapter is interfering.

Resolution: Explore TIES merging in Phase 3 when we have multiple stable adapters. For MVP, single adapter (manga knowledge) is sufficient. The merging research validates the adapter approach as composable and extensible.


Research Paper References

1. LoRA: Low-Rank Adaptation of Large Language Models (Hu et al., 2021)

Key contribution: Demonstrated that task-specific updates to large pre-trained models have low intrinsic rank. By decomposing weight updates into $\mathbf{B}\mathbf{A}$ with small $r$, achieves comparable quality to full fine-tuning at a fraction of the compute. Key insight: LoRA adapters can be merged back into the base model, adding zero inference overhead.

Relevance to MangaAssist: Foundation of our LLM customization strategy. The rank-16 configuration gives us 79.4% manga QA accuracy (vs ~81% for full fine-tuning) at 1/1800th the cost. The merge capability means we deploy a standard model checkpoint with no LoRA-specific inference code.

2. QLoRA: Efficient Finetuning of Quantized Language Models (Dettmers et al., 2023)

Key contribution: Combined 4-bit NormalFloat quantization with LoRA to fine-tune 65B parameter models on a single 48GB GPU. Introduced NF4 (optimal for normally distributed weights), double quantization (quantize the quantization constants), and paged optimizers (offload optimizer state to CPU when OOM).

Relevance to MangaAssist: QLoRA makes our 70B model experiments feasible on SageMaker ml.g5.2xlarge (A10G, 24GB) instead of requiring ml.p4d.24xlarge (8× A100). This reduces experiment cost from ~$110/hour to ~$6/hour — enabling rapid iteration during the R&D phase.

3. Resolving Interference When Merging Models — TIES-Merging (Yadav et al., 2023)

Key contribution: Identified that naively averaging model weights causes interference — parameters pushed in opposite directions by different tasks partially cancel out. TIES (TrIm, Elect Sign, and merge) resolves this by keeping only the most important changes, resolving sign conflicts via majority vote, and then averaging.

Relevance to MangaAssist: Once we have multiple LoRA adapters (manga knowledge, structured output, safety), TIES merging gives us a principled way to combine them. Our ablation shows TIES preserves 98% of each adapter's specialty vs 93% for naive averaging.

4. Scaling Down to Scale Up: A Guide to Parameter-Efficient Fine-Tuning (Lialin et al., 2023)

Key contribution: Comprehensive survey of 40+ parameter-efficient methods. Key finding: LoRA consistently outperforms other PEFT methods (adapters, prefix tuning, prompt tuning) on large models (>7B parameters) across diverse tasks. For smaller models (<1B), the differences are minimal.

Relevance to MangaAssist: Validates our choice of LoRA over prefix tuning or adapter layers for the 70B model. For our smaller models (DistilBERT 66M, MiniLM 33M), the paper suggests full fine-tuning is still appropriate — consistent with our approach in docs 01 and 03.


Production Deployment and Monitoring

Deployment Architecture

graph LR
    subgraph "Development (Phase 2)"
        A[Manga QA<br>Dataset (3K)] --> B[QLoRA Training<br>SageMaker g5.2xlarge<br>(6 hr, $36)]
        B --> C[Adapter Artifact<br>S3 (168MB)]
        C --> D[Merge + Validate<br>on Manga QA<br>and MMLU]
    end

    subgraph "Production (Phase 3, if justified)"
        D --> E{Quality Delta > 5%?}
        E -->|Yes| F[Deploy Merged Model<br>SageMaker Endpoint<br>(g5.12xlarge)]
        E -->|No| G[Stay on Bedrock<br>Claude Base +<br>Prompt Engineering]
    end

    subgraph "Inference"
        F --> H[Self-hosted Llama 3<br>Manga-adapted<br>300-800ms]
        G --> I[Bedrock Claude<br>Prompt-engineered<br>500-1500ms]
    end

Key Metrics for LoRA Custom Model

Metric Target Alert Threshold
Manga QA accuracy ≥ 78% < 75%
MMLU general knowledge ≥ 66% < 64% (base=68.5%)
Factual accuracy (manga dates, authors) ≥ 90% < 85%
Structured output compliance ≥ 85% < 80%
Training loss convergence < 0.5 by epoch 3 > 1.0 at epoch 3
Adapter size 168MB (rank 16) > 500MB (rank too high)