LOCAL PREVIEW View on GitHub

RLHF and DPO Alignment Scenarios - MangaAssist

RLHF and DPO align MangaAssist responses with human preferences. For this project, DPO is usually the practical first choice because it trains directly on preferred and rejected answer pairs without a separate reward-model and PPO pipeline.

When This Topic Matters

Use preference alignment when the model can answer but chooses the wrong style, depth, safety posture, or tradeoff.

Examples:

  • too much plot spoiler detail,
  • invented manga facts,
  • missing escalation language,
  • overly formal customer support tone,
  • recommendations that ignore user constraints.

Scenario 1 - Recommendation Preference Pairs

Prompt:

"I loved Vinland Saga. Recommend something mature, historical, and not too supernatural."

Preferred answer:

  • gives 2-3 titles,
  • explains historical and mature tone match,
  • avoids unrelated battle shonen,
  • mentions availability only if grounded.

Rejected answer:

  • lists popular manga generically,
  • ignores "not too supernatural",
  • invents facts.

DPO record:

{
  "prompt": "I loved Vinland Saga...",
  "chosen": "Try Vagabond, Historie, and Kingdom...",
  "rejected": "You may like Naruto, One Piece, and Bleach..."
}

Promotion gate:

Metric Gate
human preference win rate vs SFT >= 65%
spoiler violation rate <= 1%
hallucinated catalog facts <= 3%
support escalation compliance >= 98%

Scenario 2 - Support Tone Alignment

Users should not get robotic or dismissive answers during support.

Preferred:

"I can help with that. Since this is about a damaged item, I will guide you to the replacement or return path."

Rejected:

"Your request has been logged. Please review the policy."

Use preference pairs from support reviewers, not only synthetic generation.

Scenario 3 - Safety And Age-Rating Preferences

MangaAssist should respect age and content preferences:

  • avoid mature recommendations when the user requests family-friendly titles,
  • warn about intense content when appropriate,
  • avoid explicit plot spoilers.

Preference data should include edge cases where a title is relevant but not appropriate for the user's constraints.

Failure Modes

Failure Detection Fix
preference overfits style answers become samey diversify chosen responses
DPO weakens factuality polished but wrong pair with RAFT eval
reviewer disagreement noisy chosen/rejected labels require guidelines and adjudication
reward hacking in RLHF verbose answers score high length-normalized rubric

Production Log

{
  "event": "preference_aligned_response",
  "model": "manga-dpo-v04",
  "intent": "recommendation",
  "spoiler_check": "pass",
  "constraint_following": 0.93,
  "human_eval_bucket": "weekly_sample"
}

Final Decision

For MangaAssist, preference alignment is about response judgment. Use DPO after supervised fine-tuning or LoRA when the model already knows the task but needs to prefer answers that are more grounded, helpful, spoiler-safe, and support-aware.