LOCAL PREVIEW View on GitHub

Data Curation and Synthetic Generation Scenarios - MangaAssist

Data quality controls whether fine-tuning helps or simply teaches the model noisy behavior. This document grounds data curation and synthetic generation in MangaAssist production workflows.

When This Topic Matters

Use this topic before any training run where:

  • labels come from production logs,
  • rare intents are underrepresented,
  • synthetic examples are being added,
  • OOD clusters may become new intents,
  • policy or catalog facts may be stale.

Scenario 1 - Intent Dataset Refresh

Base dataset:

  • 50K production examples,
  • 5K synthetic examples,
  • 10 known intents,
  • 80/10/10 train-validation-test split.

Quality checks:

Check Threshold
duplicate rate across splits 0 exact or near-duplicate conversations
label disagreement sample <= 6%
synthetic share <= 10% unless justified
rare intent minimum >= 300 examples per critical class
language/script coverage representative of production

Scenario 2 - Synthetic Rare-Class Expansion

Rare classes:

  • promotion,
  • checkout_help,
  • escalation.

Synthetic generation prompt should include:

  • intent definition,
  • positive examples,
  • negative examples,
  • forbidden shortcuts,
  • tone variations,
  • Japanese-English mixed phrasing,
  • commerce constraints.

Review process:

  1. generate 5x needed examples,
  2. filter duplicates and template-like messages,
  3. embed and cluster to remove near copies,
  4. human-review at least 20%,
  5. cap synthetic influence with sampling weights.

Scenario 3 - Preference Data For DPO

Preference data needs chosen and rejected answers.

Bad rejected answers should be realistic, not cartoonishly bad. Otherwise DPO learns only easy distinctions.

Pair categories:

Category Example rejected behavior
factuality invented volume count
constraint following recommends mature manga after family-friendly request
spoiler safety reveals plot twist
support tone dismissive or vague response
escalation fails to offer handoff

Scenario 4 - Retrieval Grounding Data

For RAFT and retrieval:

  • keep document IDs,
  • keep source timestamp,
  • mark stale passages,
  • include distractors,
  • evaluate on current policy/catalog data.

Failure Modes

Failure Detection Fix
synthetic artifacts model reacts to generated phrasing diversity filters and production eval
label leakage same conversation in train and test group-based split
weak negatives retrieval eval too easy hard negative mining
outdated facts model learns stale catalog details timestamped source checks

Data Audit Log

{
  "event": "dataset_audit",
  "dataset": "intent-data-2026-04-20",
  "total_examples": 55000,
  "synthetic_share": 0.091,
  "duplicate_rate": 0.004,
  "label_disagreement_rate": 0.052,
  "decision": "approved_for_training"
}

Final Decision

For MangaAssist, synthetic data is a multiplier, not a replacement for production labels. Use it to cover rare and dangerous edges, then keep its influence bounded and validated against real traffic.