RLHF and DPO Alignment Scenarios - MangaAssist
RLHF and DPO align MangaAssist responses with human preferences. For this project, DPO is usually the practical first choice because it trains directly on preferred and rejected answer pairs without a separate reward-model and PPO pipeline.
When This Topic Matters
Use preference alignment when the model can answer but chooses the wrong style, depth, safety posture, or tradeoff.
Examples:
- too much plot spoiler detail,
- invented manga facts,
- missing escalation language,
- overly formal customer support tone,
- recommendations that ignore user constraints.
Scenario 1 - Recommendation Preference Pairs
Prompt:
"I loved Vinland Saga. Recommend something mature, historical, and not too supernatural."
Preferred answer:
- gives 2-3 titles,
- explains historical and mature tone match,
- avoids unrelated battle shonen,
- mentions availability only if grounded.
Rejected answer:
- lists popular manga generically,
- ignores "not too supernatural",
- invents facts.
DPO record:
{
"prompt": "I loved Vinland Saga...",
"chosen": "Try Vagabond, Historie, and Kingdom...",
"rejected": "You may like Naruto, One Piece, and Bleach..."
}
Promotion gate:
| Metric | Gate |
|---|---|
| human preference win rate vs SFT | >= 65% |
| spoiler violation rate | <= 1% |
| hallucinated catalog facts | <= 3% |
| support escalation compliance | >= 98% |
Scenario 2 - Support Tone Alignment
Users should not get robotic or dismissive answers during support.
Preferred:
"I can help with that. Since this is about a damaged item, I will guide you to the replacement or return path."
Rejected:
"Your request has been logged. Please review the policy."
Use preference pairs from support reviewers, not only synthetic generation.
Scenario 3 - Safety And Age-Rating Preferences
MangaAssist should respect age and content preferences:
- avoid mature recommendations when the user requests family-friendly titles,
- warn about intense content when appropriate,
- avoid explicit plot spoilers.
Preference data should include edge cases where a title is relevant but not appropriate for the user's constraints.
Failure Modes
| Failure | Detection | Fix |
|---|---|---|
| preference overfits style | answers become samey | diversify chosen responses |
| DPO weakens factuality | polished but wrong | pair with RAFT eval |
| reviewer disagreement | noisy chosen/rejected labels | require guidelines and adjudication |
| reward hacking in RLHF | verbose answers score high | length-normalized rubric |
Production Log
{
"event": "preference_aligned_response",
"model": "manga-dpo-v04",
"intent": "recommendation",
"spoiler_check": "pass",
"constraint_following": 0.93,
"human_eval_bucket": "weekly_sample"
}
Final Decision
For MangaAssist, preference alignment is about response judgment. Use DPO after supervised fine-tuning or LoRA when the model already knows the task but needs to prefer answers that are more grounded, helpful, spoiler-safe, and support-aware.