LOCAL PREVIEW View on GitHub

10. RLHF and DPO Alignment — Fine-Tuning LLM Response Quality

Problem Statement and MangaAssist Context

Claude 3.5 Sonnet generates responses for MangaAssist, but it occasionally produces outputs that are off-brand (too formal for manga enthusiasts), hallucinate manga details (invented chapter counts or release dates), or fail to follow the escalation protocol. Alignment techniques — RLHF and the simpler DPO alternative — fine-tune the LLM to prefer responses that match MangaAssist's quality standards: accurate, conversational, manga-knowledgeable, and safety-compliant. Since Claude is API-based, we apply these techniques to our self-hosted Llama 3 70B fallback and to our distilled Llama 3 8B (from doc 05).

Alignment Targets

Issue Frequency Example
Off-brand tone 12% of responses "Your inquiry has been noted" → should be "Great question! Let me check that for you"
Manga hallucination 4% "Berserk has 45 volumes" (actual: 41)
Missing escalation cue 8% Not suggesting human agent when user expresses repeated frustration
Safety violation 0.3% Recommending content inappropriate for the user's age rating preferences

Goal: reduce combined issue rate from 24.3% to <5% via preference-based alignment.


Mathematical Foundations

RLHF Overview — The Three-Stage Pipeline

Stage 1: Supervised Fine-Tuning (SFT) — Already covered in docs 04-05. Start with a fine-tuned model $\pi_{\text{SFT}}$.

Stage 2: Reward Model Training

Collect preference data: given a prompt $x$, two completions $y_w$ (preferred) and $y_l$ (dispreferred), a human annotator labels which is better.

The reward model $r_\phi(x, y)$ is trained via the Bradley-Terry model:

$$p(y_w \succ y_l | x) = \sigma(r_\phi(x, y_w) - r_\phi(x, y_l))$$

The loss:

$$\mathcal{L}{\text{RM}} = -\mathbb{E}{(x, y_w, y_l) \sim \mathcal{D}} \left[\log \sigma(r_\phi(x, y_w) - r_\phi(x, y_l))\right]$$

Intuition: This is logistic regression on the reward difference. The reward model learns to assign higher scalar rewards to preferred completions. The sigmoid converts the reward gap into a probability that $y_w$ is better.

Stage 3: PPO (Proximal Policy Optimization)

Optimize the language model (policy) $\pi_\theta$ to maximize the reward while staying close to the SFT model:

$$\max_{\pi_\theta} \mathbb{E}{x \sim \mathcal{D}, y \sim \pi\theta(\cdot|x)} \left[ r_\phi(x, y) - \beta \, D_{\text{KL}}[\pi_\theta(\cdot|x) | \pi_{\text{SFT}}(\cdot|x)] \right]$$

The KL penalty $\beta \cdot D_{\text{KL}}$ prevents the policy from deviating too far from the SFT model, which would cause "reward hacking" (the model finds reward-maximizing outputs that are degenerate).

PPO clipped objective:

$$\mathcal{L}_{\text{PPO}} = -\mathbb{E}_t \left[\min\left(\rho_t \hat{A}_t, \, \text{clip}(\rho_t, 1-\epsilon, 1+\epsilon) \hat{A}_t\right)\right]$$

where: - $\rho_t = \frac{\pi_\theta(a_t|s_t)}{\pi_{\text{old}}(a_t|s_t)}$ is the probability ratio - $\hat{A}_t$ is the advantage estimate (how much better this action is vs average) - $\epsilon = 0.2$ is the clipping parameter

The clipping prevents catastrophically large policy updates. If $\rho_t > 1.2$, the objective is flat — no incentive to increase the ratio further.

DPO — Direct Preference Optimization

Rafailov et al. (2023) showed that RLHF's three-stage pipeline can be collapsed into a single optimization step by deriving the reward function that the optimal RLHF policy implies.

Key insight: The optimal policy under the RLHF objective has a closed-form:

$$\pi^(y|x) = \frac{1}{Z(x)} \pi_{\text{SFT}}(y|x) \exp\left(\frac{1}{\beta} r^(x, y)\right)$$

Rearranging for the implicit reward:

$$r^(x, y) = \beta \log \frac{\pi^(y|x)}{\pi_{\text{SFT}}(y|x)} + \beta \log Z(x)$$

Substituting into the Bradley-Terry preference model:

$$p(y_w \succ y_l | x) = \sigma\left(\beta \log \frac{\pi_\theta(y_w|x)}{\pi_{\text{SFT}}(y_w|x)} - \beta \log \frac{\pi_\theta(y_l|x)}{\pi_{\text{SFT}}(y_l|x)}\right)$$

The partition function $Z(x)$ cancels out. The DPO loss:

$$\mathcal{L}{\text{DPO}} = -\mathbb{E}{(x, y_w, y_l)} \left[\log \sigma\left(\beta \left[\log \frac{\pi_\theta(y_w|x)}{\pi_{\text{SFT}}(y_w|x)} - \log \frac{\pi_\theta(y_l|x)}{\pi_{\text{SFT}}(y_l|x)}\right]\right)\right]$$

Why this is powerful: No reward model needed. No PPO. Just a single training loop that directly optimizes the policy on preference data using standard cross-entropy-like loss. The implicit reward is the log-ratio of the policy to the reference model.

Gradient Analysis of DPO

The gradient of the DPO loss with respect to $\theta$:

$$\nabla_\theta \mathcal{L}{\text{DPO}} = -\beta \mathbb{E} \left[\underbrace{\sigma(\hat{r}_l - \hat{r}_w)}{\text{weight}} \left[\nabla_\theta \log \pi_\theta(y_w|x) - \nabla_\theta \log \pi_\theta(y_l|x)\right]\right]$$

where $\hat{r}w = \beta \log \frac{\pi\theta(y_w|x)}{\pi_{\text{SFT}}(y_w|x)}$ is the implicit reward for the preferred completion.

Interpretation: 1. The gradient increases the log-probability of $y_w$ (preferred) and decreases that of $y_l$ (dispreferred) 2. The weight $\sigma(\hat{r}_l - \hat{r}_w)$ is large when the model incorrectly ranks the pair (assigns higher implicit reward to $y_l$). Correctly ranked pairs get downweighted — the model focuses on mistakes. 3. $\beta$ controls the strength of the KL constraint. Higher $\beta$ → weaker updates, more conservative policy.

$\beta$ Selection and Its Effect

$\beta$ KL constraint Behavior
0.01 Very weak Policy deviates far from SFT → potential reward hacking
0.1 Standard Good balance: meaningful update with regularization
0.5 Strong Very conservative updates, slow learning
1.0 Very strong Barely moves from SFT policy

For MangaAssist: - Tone alignment: $\beta = 0.1$ (want significant style change) - Factual grounding: $\beta = 0.3$ (conservative — don't lose factual ability for style) - Safety: $\beta = 0.05$ (strong signal — safety violations must be eliminated quickly)


Model Internals — Layer-by-Layer Diagrams

RLHF Three-Stage Pipeline

graph TB
    subgraph "Stage 1: Supervised Fine-Tuning"
        BASE["Llama 3 70B (base)"]
        SFT_DATA["Curated (prompt, response) pairs<br>5,000 high-quality MangaAssist responses"]
        BASE --> SFT["Fine-tune (LoRA, doc 04)"]
        SFT_DATA --> SFT
        SFT --> SFT_MODEL["π_SFT (Llama 3 70B + LoRA)"]
    end

    subgraph "Stage 2: Reward Model"
        SFT_MODEL --> GEN["Generate 2 responses<br>per prompt"]
        GEN --> PAIRS["(prompt, y_w, y_l)<br>3,000 preference pairs<br>Human annotations"]
        PAIRS --> RM_TRAIN["Train reward model<br>Bradley-Terry loss"]
        RM_TRAIN --> RM["r_φ(x, y) → scalar reward"]
    end

    subgraph "Stage 3: PPO Optimization"
        SFT_MODEL --> PPO_INIT["Initialize π_θ = π_SFT"]
        RM --> PPO["PPO Training Loop<br>max r_φ(x, y) - β·KL(π_θ ∥ π_SFT)<br><br>4 models in memory:<br>1. Policy π_θ<br>2. Reference π_SFT<br>3. Reward r_φ<br>4. Value V_ψ"]
        PPO_INIT --> PPO
        PPO --> ALIGNED["π_aligned<br>(Aligned Llama 3)"]
    end

    style SFT_MODEL fill:#c8e6c9
    style RM fill:#fff9c4
    style ALIGNED fill:#bbdefb

DPO Single-Step Alternative

graph TB
    subgraph "DPO: No Reward Model, No PPO"
        SFT2["π_SFT (from Stage 1)"]
        PREF["Preference data<br>(x, y_w, y_l)<br>3,000 pairs"]

        SFT2 --> DPO_TRAIN["DPO Training<br><br>Loss = -log σ(β · [log π_θ(y_w)/π_SFT(y_w)<br>                    - log π_θ(y_l)/π_SFT(y_l)])<br><br>Only 2 models in memory:<br>1. Policy π_θ (trainable)<br>2. Reference π_SFT (frozen)"]
        PREF --> DPO_TRAIN

        DPO_TRAIN --> DPO_MODEL["π_DPO<br>(Aligned Llama 3)"]
    end

    subgraph "Comparison"
        RLHF_BOX["RLHF<br>• 4 models in memory<br>• Complex PPO loop<br>• Reward model training<br>• ~3× training time<br>• Slightly better on safety"]

        DPO_BOX["DPO<br>• 2 models in memory<br>• Simple supervised loss<br>• No reward model<br>• ~1× training time<br>• Better on style/tone"]
    end

    style DPO_MODEL fill:#c8e6c9
    style DPO_BOX fill:#c8e6c9

Preference Data Flow

graph LR
    subgraph "Preference Data Collection"
        PROMPT["Customer prompt:<br>'Recommend a manga<br>like Attack on Titan'"]

        PROMPT --> Y1["Response A (preferred):<br>'Great taste! If you love AoT's<br>intense action and deep plot,<br>try Vinland Saga or Claymore.<br>Both have that epic scope.'"]

        PROMPT --> Y2["Response B (rejected):<br>'Based on your interest,<br>I recommend the following titles:<br>1. Vinland Saga<br>2. Claymore<br>Please let me know if you need<br>further assistance.'"]
    end

    subgraph "Why A > B"
        R1["✅ Conversational tone"]
        R2["✅ Shows manga knowledge<br>('intense action', 'deep plot')"]
        R3["✅ Concise but informative"]
        R4["❌ B is too formal/robotic"]
        R5["❌ B lacks personality"]
    end

    Y1 --> R1 & R2 & R3
    Y2 --> R4 & R5

    style Y1 fill:#c8e6c9
    style Y2 fill:#ffcdd2

DPO Implicit Reward Landscape

graph TB
    subgraph "DPO Implicit Reward = β·log(π_θ/π_SFT)"
        subgraph "Before DPO Training"
            B1["Preferred response (y_w):<br>π_θ(y_w) ≈ π_SFT(y_w)<br>Implicit reward ≈ 0"]
            B2["Rejected response (y_l):<br>π_θ(y_l) ≈ π_SFT(y_l)<br>Implicit reward ≈ 0"]
            B3["Reward gap: ~0<br>Model can't distinguish"]
        end

        subgraph "After DPO Training"
            A1["Preferred response (y_w):<br>π_θ(y_w) > π_SFT(y_w)<br>Implicit reward > 0 ⬆️"]
            A2["Rejected response (y_l):<br>π_θ(y_l) < π_SFT(y_l)<br>Implicit reward < 0 ⬇️"]
            A3["Reward gap: > 0<br>Model reliably prefers y_w"]
        end
    end

    B1 -->|"DPO<br>training"| A1
    B2 -->|"DPO<br>training"| A2

    style A1 fill:#c8e6c9
    style A2 fill:#ffcdd2
    style A3 fill:#c8e6c9

Alignment Training Memory Budget

graph TB
    subgraph "RLHF Memory (Llama 3 70B, FP16)"
        R_POL["Policy π_θ:<br>140GB (FP16)"]
        R_REF["Reference π_SFT:<br>140GB (FP16, frozen)"]
        R_RM["Reward model r_φ:<br>28GB (Llama 8B)"]
        R_VAL["Value model V_ψ:<br>28GB (Llama 8B)"]
        R_OPT["Optimizer states:<br>~40GB (LoRA only)"]
        R_TOT["TOTAL: ~376GB<br>= 5× A100 80GB minimum"]
    end

    subgraph "DPO Memory (Llama 3 70B, QLoRA)"
        D_POL["Policy π_θ:<br>~35GB (QLoRA NF4)"]
        D_REF["Reference π_SFT:<br>~35GB (NF4, frozen)"]
        D_OPT["Optimizer states:<br>~1GB (LoRA adapters)"]
        D_ACT["Activations:<br>~8GB"]
        D_TOT["TOTAL: ~79GB<br>= 1× A100 80GB ✅"]
    end

    style R_TOT fill:#ffcdd2
    style D_TOT fill:#c8e6c9

Implementation Deep-Dive

Preference Data Collection

import json
import boto3
from typing import Literal


class PreferenceCollector:
    """
    Collect preference data from multiple sources:
    1. Human annotators (highest quality)
    2. AI-assisted (Claude rates Llama outputs)
    3. Implicit signals (user thumbs up/down)
    """

    def __init__(self):
        self.bedrock = boto3.client("bedrock-runtime")

    def generate_preference_pair(
        self,
        prompt: str,
        model_id: str = "meta.llama3-70b-instruct-v1:0",
    ) -> dict:
        """Generate two responses with different sampling for annotation."""
        # Response A: standard sampling
        response_a = self._generate(prompt, model_id, temperature=0.7, top_p=0.9)
        # Response B: higher temperature for diversity
        response_b = self._generate(prompt, model_id, temperature=1.0, top_p=0.95)

        return {
            "prompt": prompt,
            "response_a": response_a,
            "response_b": response_b,
            "metadata": {"model_id": model_id},
        }

    def ai_assisted_ranking(
        self,
        prompt: str,
        response_a: str,
        response_b: str,
    ) -> dict:
        """Use Claude to rank responses (cheaper than human annotation)."""
        ranking_prompt = f"""You are evaluating customer service responses for a manga bookstore chatbot.

Prompt: {prompt}

Response A: {response_a}

Response B: {response_b}

Rate which response is better on these criteria:
1. Conversational tone (friendly, enthusiastic about manga)
2. Factual accuracy (correct manga details)
3. Helpfulness (directly addresses the customer's need)
4. Safety (no inappropriate content)

Output JSON: {{"preferred": "A" or "B", "reasoning": "...", "scores": {{"A": {{"tone": 1-5, "accuracy": 1-5, "helpful": 1-5, "safety": 1-5}}, "B": {{...}}}}}}"""

        response = self.bedrock.invoke_model(
            modelId="anthropic.claude-3-5-sonnet-20241022-v2:0",
            body=json.dumps({
                "anthropic_version": "bedrock-2023-05-01",
                "max_tokens": 500,
                "messages": [{"role": "user", "content": ranking_prompt}],
            }),
        )

        result = json.loads(response["body"].read())
        ranking = json.loads(result["content"][0]["text"])

        return {
            "prompt": prompt,
            "chosen": response_a if ranking["preferred"] == "A" else response_b,
            "rejected": response_b if ranking["preferred"] == "A" else response_a,
            "ranking_metadata": ranking,
        }

    def _generate(self, prompt, model_id, temperature, top_p):
        response = self.bedrock.invoke_model(
            modelId=model_id,
            body=json.dumps({
                "prompt": f"<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n{prompt}<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n",
                "max_gen_len": 512,
                "temperature": temperature,
                "top_p": top_p,
            }),
        )
        return json.loads(response["body"].read())["generation"]

DPO Training Pipeline

import torch
import torch.nn.functional as F
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import get_peft_model, LoraConfig, prepare_model_for_kbit_training
from torch.utils.data import Dataset, DataLoader
from dataclasses import dataclass


@dataclass
class DPOConfig:
    beta: float = 0.1
    learning_rate: float = 5e-6
    num_epochs: int = 3
    batch_size: int = 4
    gradient_accumulation: int = 8
    max_length: int = 512
    lora_r: int = 16
    lora_alpha: int = 32


class DPODataset(Dataset):
    """Dataset of (prompt, chosen, rejected) triples."""

    def __init__(self, data: list[dict], tokenizer, max_length: int = 512):
        self.data = data
        self.tokenizer = tokenizer
        self.max_length = max_length

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        item = self.data[idx]
        prompt = item["prompt"]
        chosen = item["chosen"]
        rejected = item["rejected"]

        chosen_text = f"{prompt}\n{chosen}"
        rejected_text = f"{prompt}\n{rejected}"

        chosen_enc = self.tokenizer(
            chosen_text, truncation=True, max_length=self.max_length,
            padding="max_length", return_tensors="pt",
        )
        rejected_enc = self.tokenizer(
            rejected_text, truncation=True, max_length=self.max_length,
            padding="max_length", return_tensors="pt",
        )
        prompt_enc = self.tokenizer(
            prompt, truncation=True, max_length=self.max_length,
        )

        return {
            "chosen_input_ids": chosen_enc["input_ids"].squeeze(),
            "chosen_attention_mask": chosen_enc["attention_mask"].squeeze(),
            "rejected_input_ids": rejected_enc["input_ids"].squeeze(),
            "rejected_attention_mask": rejected_enc["attention_mask"].squeeze(),
            "prompt_length": len(prompt_enc["input_ids"]),
        }


class DPOTrainer:
    """
    Direct Preference Optimization trainer.
    Only requires policy + reference model (no reward model, no PPO).
    """

    def __init__(self, config: DPOConfig, model_name: str = "meta-llama/Llama-3-8b-hf"):
        self.config = config
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.tokenizer.pad_token = self.tokenizer.eos_token

        # Policy model (trainable via LoRA)
        self.policy = AutoModelForCausalLM.from_pretrained(
            model_name,
            load_in_4bit=True,
            torch_dtype=torch.bfloat16,
            device_map="auto",
        )
        self.policy = prepare_model_for_kbit_training(self.policy)
        lora_config = LoraConfig(
            r=config.lora_r,
            lora_alpha=config.lora_alpha,
            target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
            lora_dropout=0.05,
            task_type="CAUSAL_LM",
        )
        self.policy = get_peft_model(self.policy, lora_config)

        # Reference model (frozen, same as SFT model)
        self.ref_model = AutoModelForCausalLM.from_pretrained(
            model_name,
            load_in_4bit=True,
            torch_dtype=torch.bfloat16,
            device_map="auto",
        )
        self.ref_model.eval()
        for p in self.ref_model.parameters():
            p.requires_grad = False

    def compute_log_probs(self, model, input_ids, attention_mask, prompt_length):
        """Compute log-probability of the response (excluding prompt)."""
        with torch.set_grad_enabled(model.training):
            outputs = model(input_ids=input_ids, attention_mask=attention_mask)
            logits = outputs.logits

        # Shift: predict next token
        shift_logits = logits[:, :-1, :]
        shift_labels = input_ids[:, 1:]
        shift_mask = attention_mask[:, 1:]

        # Log-probs per token
        log_probs = F.log_softmax(shift_logits, dim=-1)
        token_log_probs = torch.gather(
            log_probs, dim=-1, index=shift_labels.unsqueeze(-1)
        ).squeeze(-1)

        # Mask prompt tokens (only score the response)
        response_mask = shift_mask.clone()
        for i in range(response_mask.size(0)):
            response_mask[i, :prompt_length[i] - 1] = 0

        # Sum log-probs over response tokens
        return (token_log_probs * response_mask).sum(dim=-1)

    def dpo_loss(self, batch):
        """Compute DPO loss for a batch of preference pairs."""
        # Policy log-probs
        pi_chosen = self.compute_log_probs(
            self.policy,
            batch["chosen_input_ids"],
            batch["chosen_attention_mask"],
            batch["prompt_length"],
        )
        pi_rejected = self.compute_log_probs(
            self.policy,
            batch["rejected_input_ids"],
            batch["rejected_attention_mask"],
            batch["prompt_length"],
        )

        # Reference log-probs
        with torch.no_grad():
            ref_chosen = self.compute_log_probs(
                self.ref_model,
                batch["chosen_input_ids"],
                batch["chosen_attention_mask"],
                batch["prompt_length"],
            )
            ref_rejected = self.compute_log_probs(
                self.ref_model,
                batch["rejected_input_ids"],
                batch["rejected_attention_mask"],
                batch["prompt_length"],
            )

        # DPO loss: -log σ(β · [(log π(y_w) - log π_ref(y_w)) - (log π(y_l) - log π_ref(y_l))])
        chosen_rewards = self.config.beta * (pi_chosen - ref_chosen)
        rejected_rewards = self.config.beta * (pi_rejected - ref_rejected)

        loss = -F.logsigmoid(chosen_rewards - rejected_rewards).mean()

        # Metrics
        with torch.no_grad():
            reward_margin = (chosen_rewards - rejected_rewards).mean().item()
            accuracy = (chosen_rewards > rejected_rewards).float().mean().item()

        return loss, {"reward_margin": reward_margin, "accuracy": accuracy}

    def train(self, preference_data: list[dict]):
        dataset = DPODataset(preference_data, self.tokenizer, self.config.max_length)
        loader = DataLoader(dataset, batch_size=self.config.batch_size, shuffle=True)

        optimizer = torch.optim.AdamW(
            self.policy.parameters(),
            lr=self.config.learning_rate,
            weight_decay=0.01,
        )

        self.policy.train()
        for epoch in range(self.config.num_epochs):
            total_loss = 0
            total_acc = 0
            steps = 0

            for batch_idx, batch in enumerate(loader):
                batch = {k: v.to(self.policy.device) for k, v in batch.items()}
                loss, metrics = self.dpo_loss(batch)

                # Gradient accumulation
                loss = loss / self.config.gradient_accumulation
                loss.backward()

                if (batch_idx + 1) % self.config.gradient_accumulation == 0:
                    torch.nn.utils.clip_grad_norm_(self.policy.parameters(), 1.0)
                    optimizer.step()
                    optimizer.zero_grad()

                total_loss += loss.item() * self.config.gradient_accumulation
                total_acc += metrics["accuracy"]
                steps += 1

            avg_loss = total_loss / steps
            avg_acc = total_acc / steps
            print(f"Epoch {epoch+1}: loss={avg_loss:.4f}, pair_accuracy={avg_acc:.4f}")

Constitutional AI Filtering

class ConstitutionalFilter:
    """
    Filter and revise preference data using constitutional AI principles.
    Ensures alignment training data is safe and on-brand.
    """

    CONSTITUTION = [
        "Responses must be friendly and conversational, matching a manga store's tone.",
        "Responses must never fabricate manga details (volume counts, release dates, authors).",
        "Responses must suggest human escalation when the customer expresses repeated frustration.",
        "Responses must respect age-rating preferences and not recommend inappropriate content.",
        "Responses must not reveal internal system details or pricing algorithms.",
    ]

    def __init__(self):
        self.bedrock = boto3.client("bedrock-runtime")

    def screen_preference_pair(self, prompt: str, chosen: str, rejected: str) -> dict:
        """
        Screen a preference pair against the constitution.
        Returns filtered pair or flags for removal.
        """
        screening_prompt = f"""Given these constitutional rules:
{chr(10).join(f'{i+1}. {rule}' for i, rule in enumerate(self.CONSTITUTION))}

Evaluate this preference pair:
Prompt: {prompt}
Chosen (preferred): {chosen}
Rejected: {rejected}

Does the chosen response violate any rules? If yes, suggest a revision.
Output JSON: {{"violates_rule": null or rule_number, "revision": null or "revised text"}}"""

        response = self.bedrock.invoke_model(
            modelId="anthropic.claude-3-5-sonnet-20241022-v2:0",
            body=json.dumps({
                "anthropic_version": "bedrock-2023-05-01",
                "max_tokens": 500,
                "messages": [{"role": "user", "content": screening_prompt}],
            }),
        )

        result = json.loads(response["body"].read())
        screening = json.loads(result["content"][0]["text"])

        if screening["violates_rule"]:
            return {
                "status": "revised" if screening["revision"] else "removed",
                "chosen": screening.get("revision", chosen),
                "rejected": rejected,
                "rule_violated": screening["violates_rule"],
            }
        return {"status": "approved", "chosen": chosen, "rejected": rejected}

Group Discussion: Key Decision Points

Decision Point 1: RLHF vs DPO

Priya (ML Engineer): Head-to-head comparison on our preference data:

Metric RLHF (PPO) DPO Notes
Tone improvement 73% → 91% on-brand 73% → 89% on-brand RLHF slightly better
Hallucination reduction 4.0% → 1.2% 4.0% → 1.5% RLHF slightly better
Safety violations 0.3% → 0.02% 0.3% → 0.05% RLHF better
Training time 18 hours (4×A100) 6 hours (1×A100) DPO 3× faster
GPU memory 5× A100 1× A100 (QLoRA) DPO 5× less memory
Implementation complexity Very high (4 models, PPO loop) Low (standard supervised loss)

Marcus (Architect): The infrastructure difference is staggering. RLHF requires 5 A100s running simultaneously (policy + reference + reward + value models). DPO with QLoRA fits on a single A100.

Aiko (Data Scientist): DPO is within 2% of RLHF on all metrics except safety. The safety gap matters: 0.02% vs 0.05% is 2.5× more violations. But at our volume (10K messages/day), that is 2 vs 5 violations per day — both acceptable.

Sam (PM): At our scale, the cost difference is decisive. RLHF training costs ~$580 per run (18h × 4×A100 at $8/hr). DPO costs ~$48 per run (6h × 1×A100 at $8/hr). 12× cheaper.

Jordan (MLOps): DPO is also much simpler to maintain. No reward model drift, no PPO hyperparameter tuning (clip ratio, value loss coefficient, entropy bonus). One fewer model to train, version, and deploy.

Resolution: DPO for all alignment tasks. The 2% quality gap is acceptable given 12× cost reduction and 5× simpler infrastructure. If safety violations become problematic, we add Constitutional AI post-filtering (from our filter class) rather than switching to RLHF.

Decision Point 2: Preference Data Volume

Aiko (Data Scientist): DPO quality vs preference data volume:

Preference Pairs Tone Score Hallucination Rate Pair Accuracy
500 82% on-brand 3.2% 68%
1,000 86% 2.4% 74%
3,000 89% 1.5% 82%
5,000 90% 1.3% 84%
10,000 90% 1.2% 85%

Priya (ML Engineer): Diminishing returns after 3,000 pairs. The jump from 1K to 3K pairs is the critical improvement phase (86→89% tone, 2.4→1.5% hallucination). After 5K, gains plateau.

Sam (PM): Cost: human annotation at $2/pair = $6,000 for 3,000 pairs. AI-assisted annotation (Claude ranking Llama outputs) at $0.15/pair = $450. Can we use mostly AI-assisted?

Aiko (Data Scientist): AI-assisted data has 85% agreement with human annotators. The 15% disagreement tends to be on subtle tone differences. My recommendation: 500 human-annotated pairs for the hardest cases (safety, subtle tone), 2,500 AI-assisted pairs for clear-cut preferences. Total cost: $1,375 instead of $6,000.

Resolution: 3,000 preference pairs: 500 human + 2,500 AI-assisted. Total cost $1,375. Quarterly refresh with 500 new pairs to capture evolving preferences.

Decision Point 3: Multi-Objective Alignment

Marcus (Architect): We have four alignment targets: tone, accuracy, escalation, safety. Should we train four separate DPO models or one combined model?

Priya (ML Engineer): Tested both approaches:

Approach Tone Accuracy Escalation Safety Total training
Single DPO (all objectives) 89% 1.5% halluc 92% correct 0.05% viol 6 hours
Sequential DPO (4 rounds) 91% 1.2% 94% 0.03% 24 hours
Weighted DPO (β per objective) 90% 1.3% 93% 0.04% 6 hours

Aiko (Data Scientist): Weighted DPO is the sweet spot: assign different $\beta$ values per objective. Safety pairs get $\beta = 0.05$ (strong signal), tone pairs get $\beta = 0.1$, accuracy pairs get $\beta = 0.3$ (conservative — don't sacrifice knowledge for style).

Jordan (MLOps): Sequential DPO trains 4 rounds, each building on the previous. This risks the last round overwriting earlier alignment (catastrophic forgetting, doc 06). With weighted DPO, all objectives train simultaneously.

Resolution: Weighted DPO with per-objective $\beta$ values in a single training run. This achieves 95% of sequential quality at 25% of the training cost, without inter-objective forgetting.


Research Paper References

1. Training Language Models to Follow Instructions with Human Feedback — InstructGPT (Ouyang et al., 2022)

Key contribution: Demonstrated the full RLHF pipeline: SFT → reward model → PPO. Showed that a 1.3B parameter model with RLHF outperforms a 175B parameter model without RLHF on human preference evaluations. The paper established that alignment is more impactful than scale for user-facing applications.

Relevance to MangaAssist: InstructGPT's three-stage pipeline is the conceptual foundation for our alignment approach. While we use DPO instead of PPO, the preference data collection methodology (comparing two outputs, human ranking) directly follows InstructGPT's protocol.

2. Direct Preference Optimization: Your Language Model Is Secretly a Reward Model (Rafailov et al., 2023)

Key contribution: Proved that the RLHF objective has a closed-form optimal policy, enabling direct optimization on preference data without an explicit reward model. The DPO loss is mathematically equivalent to RLHF but implementable as a simple supervised training loop. This reduces the RLHF pipeline from 4 co-dependent models to 2 (policy + frozen reference).

Relevance to MangaAssist: DPO is our primary alignment technique. The 12× cost reduction and 5× memory reduction vs RLHF makes alignment feasible on our budget. The implicit reward interpretation ($r = \beta \log \pi_\theta / \pi_{\text{SFT}}$) provides a free evaluation metric.

3. Constitutional AI: Harmlessness from AI Feedback (Bai et al., 2022)

Key contribution: Replaced human preference annotations with AI-generated critiques and revisions based on a set of constitutional principles. The AI generates preference data by critiquing its own outputs against constitutional rules, then the model is trained on this self-generated preference data.

Relevance to MangaAssist: Our ConstitutionalFilter class implements the screening step: Claude evaluates preference pairs against our 5 constitutional rules and flags or revises violations. This ensures alignment training data is clean before DPO training. The 2,500 AI-assisted preference pairs use a constitutional approach — Claude ranks outputs according to our defined criteria.

4. Scaling Laws for Reward Model Overoptimization (Gao et al., 2023)

Key contribution: Showed that optimizing too aggressively against a reward model (or implicit reward in DPO) leads to "reward hacking" — the model exploits artifacts in the reward signal rather than genuinely improving. The paper quantified the Goodhart's Law effect: reward model score increases monotonically during training, but true quality peaks and then degrades.

Relevance to MangaAssist: This paper justifies our conservative $\beta$ values and early stopping strategy. We monitor implicit reward margins during DPO training and stop when the margin plateaus (around epoch 2-3 for our data size). Over-training past this point leads to verbose, sycophantic responses that score well on the implicit reward but annoy users.


Production Results

Before vs After DPO Alignment

Metric Before (SFT only) After (DPO) Target
On-brand tone 73% 89% ≥ 85% ✅
Hallucination rate 4.0% 1.5% ≤ 2% ✅
Correct escalation 82% 93% ≥ 90% ✅
Safety violations 0.3% 0.05% ≤ 0.1% ✅
Combined issue rate 24.3% 4.8% ≤ 5% ✅
User satisfaction (aligned responses) 3.8/5 4.⅘ ≥ 4.0/5 ✅

Training Cost Summary

Item Cost
Preference data (500 human + 2500 AI) $1,375
DPO training (6h × 1 A100) $48
Evaluation pipeline $12
Total per alignment cycle $1,435
Quarterly cadence $5,740/year

ROI: The 1.5% hallucination reduction saves ~$8K/month in customer support costs (fewer incorrect recommendations → fewer returns). Annual savings: $96K. Training cost: $5.7K. ROI: 16.8:1.