Data Curation and Synthetic Generation Scenarios - MangaAssist
Data quality controls whether fine-tuning helps or simply teaches the model noisy behavior. This document grounds data curation and synthetic generation in MangaAssist production workflows.
When This Topic Matters
Use this topic before any training run where:
- labels come from production logs,
- rare intents are underrepresented,
- synthetic examples are being added,
- OOD clusters may become new intents,
- policy or catalog facts may be stale.
Scenario 1 - Intent Dataset Refresh
Base dataset:
- 50K production examples,
- 5K synthetic examples,
- 10 known intents,
- 80/10/10 train-validation-test split.
Quality checks:
| Check | Threshold |
|---|---|
| duplicate rate across splits | 0 exact or near-duplicate conversations |
| label disagreement sample | <= 6% |
| synthetic share | <= 10% unless justified |
| rare intent minimum | >= 300 examples per critical class |
| language/script coverage | representative of production |
Scenario 2 - Synthetic Rare-Class Expansion
Rare classes:
promotion,checkout_help,escalation.
Synthetic generation prompt should include:
- intent definition,
- positive examples,
- negative examples,
- forbidden shortcuts,
- tone variations,
- Japanese-English mixed phrasing,
- commerce constraints.
Review process:
- generate 5x needed examples,
- filter duplicates and template-like messages,
- embed and cluster to remove near copies,
- human-review at least 20%,
- cap synthetic influence with sampling weights.
Scenario 3 - Preference Data For DPO
Preference data needs chosen and rejected answers.
Bad rejected answers should be realistic, not cartoonishly bad. Otherwise DPO learns only easy distinctions.
Pair categories:
| Category | Example rejected behavior |
|---|---|
| factuality | invented volume count |
| constraint following | recommends mature manga after family-friendly request |
| spoiler safety | reveals plot twist |
| support tone | dismissive or vague response |
| escalation | fails to offer handoff |
Scenario 4 - Retrieval Grounding Data
For RAFT and retrieval:
- keep document IDs,
- keep source timestamp,
- mark stale passages,
- include distractors,
- evaluate on current policy/catalog data.
Failure Modes
| Failure | Detection | Fix |
|---|---|---|
| synthetic artifacts | model reacts to generated phrasing | diversity filters and production eval |
| label leakage | same conversation in train and test | group-based split |
| weak negatives | retrieval eval too easy | hard negative mining |
| outdated facts | model learns stale catalog details | timestamped source checks |
Data Audit Log
{
"event": "dataset_audit",
"dataset": "intent-data-2026-04-20",
"total_examples": 55000,
"synthetic_share": 0.091,
"duplicate_rate": 0.004,
"label_disagreement_rate": 0.052,
"decision": "approved_for_training"
}
Final Decision
For MangaAssist, synthetic data is a multiplier, not a replacement for production labels. Use it to cover rare and dangerous edges, then keep its influence bounded and validated against real traffic.