Data Curation and Synthetic Generation Scenarios - MangaAssist

Data quality controls whether fine-tuning helps or simply teaches the model noisy behavior. This document grounds data curation and synthetic generation in MangaAssist production workflows.

When This Topic Matters

Use this topic before any training run where:

labels come from production logs,
rare intents are underrepresented,
synthetic examples are being added,
OOD clusters may become new intents,
policy or catalog facts may be stale.

Scenario 1 - Intent Dataset Refresh

Base dataset:

50K production examples,
5K synthetic examples,
10 known intents,
80/10/10 train-validation-test split.

Quality checks:

Check	Threshold
duplicate rate across splits	0 exact or near-duplicate conversations
label disagreement sample	<= 6%
synthetic share	<= 10% unless justified
rare intent minimum	>= 300 examples per critical class
language/script coverage	representative of production

Scenario 2 - Synthetic Rare-Class Expansion

Rare classes:

promotion,
checkout_help,
escalation.

Synthetic generation prompt should include:

intent definition,
positive examples,
negative examples,
forbidden shortcuts,
tone variations,
Japanese-English mixed phrasing,
commerce constraints.

Review process:

generate 5x needed examples,
filter duplicates and template-like messages,
embed and cluster to remove near copies,
human-review at least 20%,
cap synthetic influence with sampling weights.

Scenario 3 - Preference Data For DPO

Preference data needs chosen and rejected answers.

Bad rejected answers should be realistic, not cartoonishly bad. Otherwise DPO learns only easy distinctions.

Pair categories:

Category	Example rejected behavior
factuality	invented volume count
constraint following	recommends mature manga after family-friendly request
spoiler safety	reveals plot twist
support tone	dismissive or vague response
escalation	fails to offer handoff

Scenario 4 - Retrieval Grounding Data

For RAFT and retrieval:

keep document IDs,
keep source timestamp,
mark stale passages,
include distractors,
evaluate on current policy/catalog data.

Failure Modes

Failure	Detection	Fix
synthetic artifacts	model reacts to generated phrasing	diversity filters and production eval
label leakage	same conversation in train and test	group-based split
weak negatives	retrieval eval too easy	hard negative mining
outdated facts	model learns stale catalog details	timestamped source checks

Data Audit Log

{
  "event": "dataset_audit",
  "dataset": "intent-data-2026-04-20",
  "total_examples": 55000,
  "synthetic_share": 0.091,
  "duplicate_rate": 0.004,
  "label_disagreement_rate": 0.052,
  "decision": "approved_for_training"
}

Final Decision

For MangaAssist, synthetic data is a multiplier, not a replacement for production labels. Use it to cover rare and dangerous edges, then keep its influence bounded and validated against real traffic.