LOCAL PREVIEW View on GitHub

16. Data Curation, Synthetic Generation & Active Learning — Building the Data Flywheel

Problem Statement and MangaAssist Context

Every fine-tuning scenario in this series (docs 01-15) depends on high-quality training data. The reality for MangaAssist:

  1. Limited labeled data: We have ~5,000 labeled customer interactions from the first 3 months of operation
  2. Label noise (estimated 8-12%): Human annotators disagree on intent labels, sentiment, and relevance judgments
  3. Distribution shift: The manga catalog updates weekly with new titles, volume releases, and seasonal trends
  4. Long-tail coverage: 60% of queries fall into 5 common intents, but the remaining 40% span 20+ rare intents with <50 examples each

The data flywheel solves all of these: production data → clean → label → augment → train → deploy → collect more production data → repeat. This document covers the math, pipelines, and practices for building that flywheel.

Data Quality Impact on Model Performance

Training Data Quality Intent Accuracy Embedding Recall@3 LLM Hallucination
Raw (8-12% noise) 87.3% 0.74 5.2%
Cleaned (2-3% noise) 92.1% 0.82 2.8%
Cleaned + augmented 93.8% 0.85 1.9%
Cleaned + augmented + active learning 95.2% 0.88 1.1%

Mathematical Foundations

Confident Learning — Label Noise Estimation (Northcutt et al., 2021)

Goal: Identify mislabeled examples without knowing the true labels.

Step 1: Estimate the Joint Distribution

Given $n$ examples with noisy labels $\tilde{y}$ and model-predicted probabilities $\hat{p}$, estimate the joint distribution $Q_{\tilde{y}, y^*}$ between noisy labels and true labels:

$$\hat{Q}_{\tilde{y}=i, y^=j} = \frac{|\hat{X}_{\tilde{y}=i, y^=j}|}{n}$$

where:

$$\hat{X}_{\tilde{y}=i, y^*=j} = {x : \tilde{y} = i \text{ and } \hat{p}(y=j|x) \geq t_j}$$

and $t_j$ is the per-class threshold: the average self-confidence of examples with noisy label $j$:

$$t_j = \frac{1}{|{x : \tilde{y}=j}|} \sum_{x : \tilde{y}=j} \hat{p}(y=j|x)$$

Intuition: If the model predicts class $j$ with high confidence but the label says class $i$, the label is likely wrong. The per-class threshold adapts to class difficulty — easy classes have high thresholds, hard classes have low ones.

Step 2: Identify Label Errors

An example $(x, \tilde{y}=i)$ is a candidate label error if:

$$\hat{p}(y=j|x) \geq t_j \text{ for some } j \neq i$$

AND it appears in the off-diagonal of $\hat{Q}$ (noisy label ≠ predicted true label).

Step 3: Rank by Error Likelihood

Rank candidate errors by the normalized margin:

$$\text{error_score}(x) = \hat{p}(y=\tilde{y}|x) - \max_{j \neq \tilde{y}} \hat{p}(y=j|x)$$

Negative margin = high confidence that the label is wrong. Most negative scores are reviewed first.

Annotation Agreement Metrics

Cohen's Kappa (2 annotators):

$$\kappa = \frac{p_o - p_e}{1 - p_e}$$

where: - $p_o$ = observed agreement (fraction of items where annotators agree) - $p_e$ = expected agreement by chance = $\sum_k p_{1k} \cdot p_{2k}$ (product of marginal class probabilities)

Interpretation: - $\kappa < 0.40$: Poor agreement — re-examine guidelines - $0.40 \leq \kappa < 0.60$: Moderate — acceptable for initial labeling - $0.60 \leq \kappa < 0.80$: Substantial — target for production labeling - $\kappa \geq 0.80$: Almost perfect — gold standard

Fleiss' Kappa (multiple annotators):

$$\kappa_F = \frac{\bar{P} - \bar{P}_e}{1 - \bar{P}_e}$$

where:

$$\bar{P} = \frac{1}{n} \sum_{i=1}^{n} \frac{1}{r(r-1)} \sum_{k=1}^{K} n_{ik}(n_{ik} - 1)$$

$$\bar{P}e = \sum{k=1}^{K} \left(\frac{1}{nr} \sum_{i=1}^{n} n_{ik}\right)^2$$

with $n$ items, $r$ raters, $K$ categories, and $n_{ik}$ = number of raters who assigned item $i$ to category $k$.

MangaAssist annotation targets: - Intent labels: $\kappa \geq 0.75$ (15 classes, clear definitions) - Sentiment: $\kappa \geq 0.65$ (subjective, harder agreement) - Relevance judgments: $\kappa \geq 0.70$ (binary relevant/irrelevant)

Active Learning — Uncertainty Sampling

Select the most informative examples for human labeling:

Least Confidence:

$$x^* = \arg\max_x \left(1 - \max_k \hat{p}(y=k|x)\right)$$

Margin Sampling:

$$x^* = \arg\min_x \left(\hat{p}(y=k_1|x) - \hat{p}(y=k_2|x)\right)$$

where $k_1, k_2$ are the top-2 predicted classes.

Entropy Sampling:

$$x^* = \arg\max_x H(\hat{p}(y|x)) = \arg\max_x \left(-\sum_k \hat{p}(y=k|x) \log \hat{p}(y=k|x)\right)$$

Batch Active Learning with Diversity:

To avoid selecting redundant examples, combine uncertainty with diversity. Use k-means++ on embeddings of the top-$N$ uncertain examples, then select cluster centroids:

$$\text{score}(x) = \lambda \cdot H(\hat{p}(y|x)) + (1-\lambda) \cdot \min_{x' \in S} |e(x) - e(x')|_2$$

where $S$ is the already-selected batch and $e(x)$ is the embedding. $\lambda = 0.7$ balances uncertainty and diversity.

Synthetic Data Quality Assessment

For generated synthetic examples, measure quality with:

Faithfulness Score (does the synthetic example match the distribution?):

$$\text{Faith}(x_{\text{syn}}) = \exp\left(-D_{KL}(P_{\text{real}} | P_{\text{model}}(x_{\text{syn}}))\right)$$

Diversity Score (are synthetic examples varied enough?):

$$\text{Div}(S_{\text{syn}}) = \frac{1}{|S|^2} \sum_{x_i, x_j \in S} \left(1 - \cos(e(x_i), e(x_j))\right)$$

Utility Score (does adding synthetic data improve the model?):

$$\text{Util} = \text{Accuracy}(M_{\text{real+syn}}) - \text{Accuracy}(M_{\text{real only}})$$


Model Internals — Layer-by-Layer Diagrams

End-to-End Data Flywheel

graph TB
    subgraph "Data Flywheel (Continuous Loop)"
        PROD["Production Traffic<br>10K messages/day<br>Unlabeled customer queries"]

        SAMPLE["Smart Sampling<br>Active learning selects<br>most informative 2%<br>(200 messages/day)"]

        LABEL["Human Labeling<br>3 annotators per example<br>Fleiss κ ≥ 0.70 required<br>Cost: $0.15/label"]

        CLEAN["Confident Learning<br>Identify noisy labels<br>Flag 8-12% for review<br>Reduce noise to 2-3%"]

        AUGMENT["Synthetic Augmentation<br>Self-Instruct (Claude 3.5)<br>10× amplification factor<br>2,000 → 20,000 examples"]

        TRAIN["Model Training<br>Fine-tune all models:<br>Intent, Embedding, Reranker,<br>Sentiment, LLM (docs 01-15)"]

        DEPLOY["Deploy & Monitor<br>A/B test new model<br>Track quality metrics<br>Detect distribution shift"]

        PROD --> SAMPLE --> LABEL --> CLEAN --> AUGMENT --> TRAIN --> DEPLOY
        DEPLOY -->|"New production data"| PROD
    end

    style PROD fill:#e3f2fd
    style SAMPLE fill:#fff9c4
    style LABEL fill:#fff9c4
    style CLEAN fill:#ffcdd2
    style AUGMENT fill:#c8e6c9
    style TRAIN fill:#c8e6c9
    style DEPLOY fill:#e3f2fd

Confident Learning Pipeline

graph TB
    subgraph "Confident Learning — Label Noise Detection"
        DATA["5,000 labeled examples<br>Noisy labels ỹ from annotators"]

        MODEL["Train DistilBERT k-fold<br>(k=5 cross-validation)<br>Get out-of-fold predictions p̂(y|x)"]

        THRESH["Compute per-class thresholds<br>t_j = avg self-confidence<br>t_product = 0.92, t_order = 0.88<br>t_complaint = 0.75, ..."]

        JOINT["Estimate joint Q(ỹ, y*)<br>Off-diagonal = label errors<br>Q(product→order) = 0.03<br>Q(complaint→feedback) = 0.05"]

        RANK["Rank by normalized margin<br>margin = p̂(ỹ) - max p̂(j≠ỹ)<br>Most negative = most likely error"]

        REVIEW["Human review top-N<br>N = 500 candidates<br>380 confirmed errors (76%)<br>120 correct labels (24%)"]

        FIXED["Cleaned dataset<br>380 labels corrected<br>Noise: 8.2% → 2.4%"]

        DATA --> MODEL --> THRESH --> JOINT --> RANK --> REVIEW --> FIXED
    end

    style DATA fill:#ffcdd2
    style FIXED fill:#c8e6c9
    style REVIEW fill:#fff9c4

Active Learning Selection Strategy

graph TB
    subgraph "Active Learning — Batch Selection"
        POOL["Unlabeled pool<br>10,000 new queries<br>from last week"]

        EMBED["Embed all queries<br>DistilBERT → 768-dim<br>Compute model predictions"]

        UNCERTAIN["Uncertainty ranking<br>Entropy H(p̂(y|x))<br>Top 500 most uncertain"]

        DIVERSE["Diversity filtering<br>k-means++ on embeddings<br>Select K=50 cluster centroids"]

        BATCH["Final batch: 50 examples<br>λ=0.7 uncertainty + 0.3 diversity<br>Covers all query types"]

        HUMAN["Send to annotators<br>3 annotators × 50 queries<br>Cost: $22.50<br>Time: 2 hours"]

        RETRAIN["Add to training set<br>Retrain intent classifier<br>+0.4% accuracy per cycle"]

        POOL --> EMBED --> UNCERTAIN --> DIVERSE --> BATCH --> HUMAN --> RETRAIN
        RETRAIN -->|"Next cycle (weekly)"| POOL
    end

    style BATCH fill:#c8e6c9
    style HUMAN fill:#fff9c4

Synthetic Data Generation Pipeline

graph TB
    subgraph "Self-Instruct Synthetic Pipeline"
        SEED["Seed examples<br>200 high-quality<br>human-labeled examples<br>per intent class"]

        PROMPT["Prompt template:<br>'Generate 10 customer queries<br>similar to these examples<br>but with different products,<br>phrasing, and complexity'"]

        CLAUDE["Claude 3.5 Sonnet<br>Generates 10× per seed<br>200 seeds → 2,000 synthetic<br>per intent class"]

        FILTER["Quality filtering:<br>1. Dedup (MinHash, 0.85 threshold)<br>2. Classifier confidence > 0.8<br>3. Fluency score > 0.9<br>4. Human spot-check (5%)"]

        VALID["Validated synthetic set<br>1,600 per class (80% pass rate)<br>15 classes × 1,600 = 24,000<br>synthetic examples total"]

        MERGE["Merge with real data:<br>5,000 real + 24,000 synthetic<br>Weight: 1.0 real, 0.5 synthetic<br>Effective: 5K + 12K = 17K examples"]

        SEED --> PROMPT --> CLAUDE --> FILTER --> VALID --> MERGE
    end

    style SEED fill:#e3f2fd
    style CLAUDE fill:#fff9c4
    style FILTER fill:#ffcdd2
    style VALID fill:#c8e6c9

Distribution Shift Detection

graph TB
    subgraph "Distribution Shift Monitoring"
        direction TB
        BASELINE["Baseline distribution<br>(training data, week 0)"]
        CURRENT["Current production<br>(this week's queries)"]

        COMPARE["Compare distributions:<br>1. Intent frequency shift<br>2. Embedding drift (MMD)<br>3. OOD detection"]

        INTENT["Intent frequency:<br>Baseline: product=32%<br>Current: product=28%, new_intent=5%<br>χ² test: p < 0.01 → SHIFT"]

        EMBED["Embedding drift (MMD):<br>MMD² = E[k(x,x')] - 2E[k(x,y)] + E[k(y,y')]<br>where k = RBF kernel<br>MMD > threshold → DRIFT"]

        OOD["OOD detection:<br>Mahalanobis distance from<br>training centroid<br>5% of queries are OOD → FLAG"]

        ACTION["Triggered actions:<br>1. Alert MLOps team<br>2. Increase active learning rate 2%→5%<br>3. Queue retraining if drift > 2 weeks"]

        BASELINE --> COMPARE
        CURRENT --> COMPARE
        COMPARE --> INTENT & EMBED & OOD
        INTENT --> ACTION
        EMBED --> ACTION
        OOD --> ACTION
    end

    style INTENT fill:#ffcdd2
    style EMBED fill:#fff9c4
    style OOD fill:#ffcdd2
    style ACTION fill:#c8e6c9

Implementation Deep-Dive

Confident Learning Implementation

import numpy as np
from sklearn.model_selection import StratifiedKFold
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
from collections import Counter


class ConfidentLearningCleaner:
    """
    Identify noisy labels using the Confident Learning framework.
    Reference: Northcutt et al., "Confident Learning: Estimating
    Uncertainty in Dataset Labels" (2021).
    """

    def __init__(self, model_name: str = "distilbert-base-uncased", n_folds: int = 5):
        self.model_name = model_name
        self.n_folds = n_folds

    def find_label_errors(
        self,
        texts: list[str],
        noisy_labels: list[int],
        num_classes: int,
    ) -> dict:
        """
        Identify mislabeled examples using cross-validated predictions.

        Returns:
            {
                "error_indices": list of suspected error indices,
                "error_scores": confidence that each is an error,
                "joint_matrix": estimated Q(noisy, true) matrix,
                "per_class_noise": estimated noise rate per class,
            }
        """
        # Step 1: Get out-of-fold predicted probabilities
        probs = self._cross_val_predict_proba(texts, noisy_labels, num_classes)

        # Step 2: Compute per-class thresholds
        thresholds = self._compute_thresholds(probs, noisy_labels, num_classes)

        # Step 3: Estimate the joint distribution Q
        joint = self._estimate_joint(probs, noisy_labels, thresholds, num_classes)

        # Step 4: Find label errors (off-diagonal of Q)
        error_indices, error_scores = self._identify_errors(
            probs, noisy_labels, thresholds, num_classes,
        )

        # Step 5: Per-class noise rate
        per_class_noise = {}
        for cls in range(num_classes):
            cls_indices = [i for i, y in enumerate(noisy_labels) if y == cls]
            cls_errors = [i for i in error_indices if noisy_labels[i] == cls]
            per_class_noise[cls] = (
                len(cls_errors) / len(cls_indices) if cls_indices else 0
            )

        return {
            "error_indices": error_indices,
            "error_scores": error_scores,
            "joint_matrix": joint,
            "per_class_noise": per_class_noise,
        }

    def _compute_thresholds(
        self,
        probs: np.ndarray,
        labels: list[int],
        num_classes: int,
    ) -> np.ndarray:
        """Per-class threshold = average self-confidence for that class."""
        thresholds = np.zeros(num_classes)
        for cls in range(num_classes):
            mask = np.array(labels) == cls
            if mask.sum() > 0:
                thresholds[cls] = probs[mask, cls].mean()
        return thresholds

    def _estimate_joint(
        self,
        probs: np.ndarray,
        labels: list[int],
        thresholds: np.ndarray,
        num_classes: int,
    ) -> np.ndarray:
        """Estimate Q(noisy_label, true_label) joint distribution."""
        n = len(labels)
        joint = np.zeros((num_classes, num_classes))

        for i in range(n):
            noisy_y = labels[i]
            for true_y in range(num_classes):
                if probs[i, true_y] >= thresholds[true_y]:
                    joint[noisy_y, true_y] += 1

        # Normalize
        joint /= joint.sum()
        return joint

    def _identify_errors(
        self,
        probs: np.ndarray,
        labels: list[int],
        thresholds: np.ndarray,
        num_classes: int,
    ) -> tuple[list[int], list[float]]:
        """Find examples where predicted class != noisy label."""
        errors = []
        scores = []

        for i in range(len(labels)):
            noisy_y = labels[i]
            # Check if any other class exceeds its threshold
            for j in range(num_classes):
                if j != noisy_y and probs[i, j] >= thresholds[j]:
                    # Normalized margin: how wrong is the label?
                    margin = probs[i, noisy_y] - probs[i, j]
                    errors.append(i)
                    scores.append(-margin)  # More negative = more likely error
                    break

        # Sort by error likelihood (most negative margin first)
        sorted_pairs = sorted(zip(errors, scores), key=lambda x: -x[1])
        return [p[0] for p in sorted_pairs], [p[1] for p in sorted_pairs]

    def _cross_val_predict_proba(
        self,
        texts: list[str],
        labels: list[int],
        num_classes: int,
    ) -> np.ndarray:
        """Get out-of-fold predicted probabilities using k-fold CV."""
        probs = np.zeros((len(texts), num_classes))
        kfold = StratifiedKFold(n_splits=self.n_folds, shuffle=True, random_state=42)

        for fold, (train_idx, val_idx) in enumerate(kfold.split(texts, labels)):
            print(f"Fold {fold+1}/{self.n_folds}")

            # Train on fold
            tokenizer = AutoTokenizer.from_pretrained(self.model_name)
            model = AutoModelForSequenceClassification.from_pretrained(
                self.model_name, num_labels=num_classes,
            )

            train_texts = [texts[i] for i in train_idx]
            train_labels = [labels[i] for i in train_idx]
            val_texts = [texts[i] for i in val_idx]

            # Simple training loop (abbreviated)
            self._train_fold(model, tokenizer, train_texts, train_labels)

            # Predict on validation fold
            model.eval()
            for i, vi in enumerate(val_idx):
                inputs = tokenizer(
                    val_texts[i], return_tensors="pt",
                    truncation=True, max_length=128,
                )
                with torch.no_grad():
                    logits = model(**inputs).logits
                probs[vi] = torch.softmax(logits, dim=-1).numpy().squeeze()

        return probs

    def _train_fold(self, model, tokenizer, texts, labels, epochs=3):
        """Quick fine-tune for one fold (simplified)."""
        optimizer = torch.optim.AdamW(model.parameters(), lr=2e-5)
        model.train()
        for _ in range(epochs):
            for text, label in zip(texts, labels):
                inputs = tokenizer(
                    text, return_tensors="pt",
                    truncation=True, max_length=128, padding="max_length",
                )
                inputs["labels"] = torch.tensor([label])
                loss = model(**inputs).loss
                loss.backward()
                optimizer.step()
                optimizer.zero_grad()

Active Learning Pipeline

from scipy.stats import entropy
from sklearn.cluster import MiniBatchKMeans


class ActiveLearningSelector:
    """
    Select the most informative examples for human annotation.
    Combines uncertainty sampling with diversity via k-means++.
    """

    def __init__(
        self,
        model,
        tokenizer,
        uncertainty_weight: float = 0.7,
    ):
        self.model = model
        self.tokenizer = tokenizer
        self.uncertainty_weight = uncertainty_weight

    def select_batch(
        self,
        unlabeled_texts: list[str],
        batch_size: int = 50,
        candidate_pool_size: int = 500,
    ) -> list[int]:
        """
        Select a batch of examples for annotation.

        Strategy:
        1. Score all by uncertainty (entropy)
        2. Take top-N candidates (N >> batch_size)
        3. Apply diversity filtering via k-means++

        Returns: indices into unlabeled_texts
        """
        # Step 1: Compute uncertainty for all unlabeled examples
        uncertainties = self._compute_uncertainty(unlabeled_texts)

        # Step 2: Select top candidates by uncertainty
        candidate_indices = np.argsort(uncertainties)[-candidate_pool_size:]

        # Step 3: Compute embeddings for diversity
        embeddings = self._get_embeddings(
            [unlabeled_texts[i] for i in candidate_indices]
        )

        # Step 4: k-means++ for diversity
        selected = self._diverse_select(
            candidate_indices, embeddings, uncertainties, batch_size,
        )

        return selected

    def _compute_uncertainty(self, texts: list[str]) -> np.ndarray:
        """Compute entropy-based uncertainty for each text."""
        self.model.eval()
        uncertainties = []

        for text in texts:
            inputs = self.tokenizer(
                text, return_tensors="pt", truncation=True, max_length=128,
            )
            with torch.no_grad():
                logits = self.model(**inputs).logits
            probs = torch.softmax(logits, dim=-1).numpy().squeeze()
            uncertainties.append(entropy(probs))

        return np.array(uncertainties)

    def _get_embeddings(self, texts: list[str]) -> np.ndarray:
        """Get CLS token embeddings for diversity computation."""
        self.model.eval()
        embeddings = []

        for text in texts:
            inputs = self.tokenizer(
                text, return_tensors="pt", truncation=True, max_length=128,
            )
            with torch.no_grad():
                outputs = self.model(**inputs, output_hidden_states=True)
            cls_embedding = outputs.hidden_states[-1][:, 0, :].numpy().squeeze()
            embeddings.append(cls_embedding)

        return np.stack(embeddings)

    def _diverse_select(
        self,
        indices: np.ndarray,
        embeddings: np.ndarray,
        uncertainties: np.ndarray,
        batch_size: int,
    ) -> list[int]:
        """Select diverse subset using k-means++ initialization."""
        # Cluster candidates into batch_size clusters
        kmeans = MiniBatchKMeans(n_clusters=batch_size, random_state=42)
        kmeans.fit(embeddings)

        # From each cluster, select the most uncertain example
        selected = []
        for cluster_id in range(batch_size):
            cluster_mask = kmeans.labels_ == cluster_id
            cluster_indices = np.where(cluster_mask)[0]

            if len(cluster_indices) == 0:
                continue

            # Most uncertain in this cluster
            cluster_uncertainties = uncertainties[indices[cluster_indices]]
            best_in_cluster = cluster_indices[np.argmax(cluster_uncertainties)]
            selected.append(int(indices[best_in_cluster]))

        return selected

Synthetic Data Generator (Self-Instruct)

import json
import hashlib
import boto3
from datasketch import MinHash, MinHashLSH


class SyntheticDataGenerator:
    """
    Generate synthetic training data using Self-Instruct pattern.
    Claude 3.5 Sonnet generates diverse variations of seed examples.
    """

    def __init__(self):
        self.bedrock = boto3.client("bedrock-runtime")
        # MinHash LSH for near-duplicate detection
        self.lsh = MinHashLSH(threshold=0.85, num_perm=128)
        self._registered_hashes = {}

    def generate(
        self,
        seed_examples: list[dict],
        target_count: int = 2000,
        batch_size: int = 10,
    ) -> list[dict]:
        """
        Generate synthetic examples from seed data.

        Args:
            seed_examples: [{"text": "...", "label": "..."}]
            target_count: total synthetic examples to generate
            batch_size: examples per LLM call
        """
        synthetic = []
        attempts = 0
        max_attempts = target_count * 3  # Allow 3× retries

        # Group seeds by label
        seeds_by_label = {}
        for ex in seed_examples:
            seeds_by_label.setdefault(ex["label"], []).append(ex)

        while len(synthetic) < target_count and attempts < max_attempts:
            # Rotate through labels evenly
            for label, seeds in seeds_by_label.items():
                if len(synthetic) >= target_count:
                    break

                # Sample 3-5 seed examples for context
                context_seeds = random.sample(seeds, min(5, len(seeds)))
                context_text = "\n".join(
                    f"- {s['text']}" for s in context_seeds
                )

                prompt = f"""Generate {batch_size} diverse customer queries for a manga bookstore chatbot.
These should be variations of the intent: {label}

Example queries for this intent:
{context_text}

Requirements:
1. Each query should be natural and conversational
2. Vary the products mentioned (different manga titles, authors)
3. Vary the complexity (simple vs multi-part questions)
4. Include typos and informal language occasionally
5. Cover different customer demographics

Output as JSON array: [{{"text": "...", "label": "{label}"}}]"""

                response = self.bedrock.invoke_model(
                    modelId="anthropic.claude-3-5-sonnet-20241022-v2:0",
                    body=json.dumps({
                        "anthropic_version": "bedrock-2023-05-01",
                        "max_tokens": 1000,
                        "messages": [{"role": "user", "content": prompt}],
                    }),
                )

                text = json.loads(response["body"].read())["content"][0]["text"]

                # Parse JSON from response
                try:
                    # Handle cases where LLM wraps JSON in markdown
                    if "```json" in text:
                        text = text.split("```json")[1].split("```")[0]
                    generated = json.loads(text)
                except json.JSONDecodeError:
                    attempts += batch_size
                    continue

                # Quality filtering
                for ex in generated:
                    if self._passes_quality_checks(ex):
                        synthetic.append(ex)

                attempts += batch_size

        print(f"Generated {len(synthetic)} synthetic examples "
              f"from {len(seed_examples)} seeds ({len(synthetic)/len(seed_examples):.1f}× amplification)")

        return synthetic

    def _passes_quality_checks(self, example: dict) -> bool:
        """Apply quality filters to synthetic examples."""
        text = example.get("text", "")

        # Check 1: Minimum length
        if len(text.split()) < 4:
            return False

        # Check 2: Maximum length (avoid essays)
        if len(text.split()) > 100:
            return False

        # Check 3: Near-duplicate detection via MinHash
        mh = MinHash(num_perm=128)
        for word in text.lower().split():
            mh.update(word.encode("utf8"))

        # Check against existing hashes
        text_hash = hashlib.md5(text.lower().encode()).hexdigest()
        if text_hash in self._registered_hashes:
            return False

        # Check LSH for near-duplicates
        result = self.lsh.query(mh)
        if result:
            return False

        # Register this example
        self._registered_hashes[text_hash] = True
        self.lsh.insert(text_hash, mh)

        return True

Distribution Shift Detector

from scipy import stats


class DistributionShiftDetector:
    """
    Monitor for distribution drift between training and production data.
    Triggers retraining when significant shifts are detected.
    """

    def __init__(self, baseline_embeddings: np.ndarray, baseline_labels: list[int]):
        self.baseline_embeddings = baseline_embeddings
        self.baseline_labels = baseline_labels
        self.baseline_label_dist = Counter(baseline_labels)
        self.baseline_centroid = baseline_embeddings.mean(axis=0)
        self.baseline_cov_inv = np.linalg.pinv(
            np.cov(baseline_embeddings.T) + 1e-6 * np.eye(baseline_embeddings.shape[1])
        )

    def check_shift(
        self,
        current_embeddings: np.ndarray,
        current_labels: list[int],
    ) -> dict:
        """
        Run all shift detection checks.

        Returns:
            {
                "intent_shift": {"detected": bool, "p_value": float},
                "embedding_drift": {"detected": bool, "mmd_score": float},
                "ood_rate": {"detected": bool, "ood_fraction": float},
                "action": str,
            }
        """
        results = {}

        # 1. Intent frequency shift (chi-squared test)
        results["intent_shift"] = self._check_intent_shift(current_labels)

        # 2. Embedding drift (Maximum Mean Discrepancy)
        results["embedding_drift"] = self._check_embedding_drift(current_embeddings)

        # 3. OOD detection (Mahalanobis distance)
        results["ood_rate"] = self._check_ood_rate(current_embeddings)

        # Determine action
        num_alerts = sum(
            1 for v in results.values()
            if isinstance(v, dict) and v.get("detected")
        )

        if num_alerts >= 2:
            results["action"] = "RETRAIN: Multiple drift signals detected"
        elif num_alerts == 1:
            results["action"] = "MONITOR: Increase active learning rate to 5%"
        else:
            results["action"] = "OK: No significant drift"

        return results

    def _check_intent_shift(self, current_labels: list[int]) -> dict:
        """Chi-squared test for intent distribution shift."""
        current_dist = Counter(current_labels)
        all_classes = set(self.baseline_label_dist.keys()) | set(current_dist.keys())

        baseline_counts = [self.baseline_label_dist.get(c, 0) for c in all_classes]
        current_counts = [current_dist.get(c, 0) for c in all_classes]

        # Normalize to same total
        total_b = sum(baseline_counts) or 1
        total_c = sum(current_counts) or 1
        expected = [b * total_c / total_b for b in baseline_counts]

        chi2, p_value = stats.chisquare(current_counts, f_exp=expected)

        return {"detected": p_value < 0.01, "p_value": float(p_value), "chi2": float(chi2)}

    def _check_embedding_drift(self, current_embeddings: np.ndarray) -> dict:
        """Maximum Mean Discrepancy between baseline and current embeddings."""
        # Subsample for efficiency
        n = min(1000, len(self.baseline_embeddings), len(current_embeddings))
        X = self.baseline_embeddings[:n]
        Y = current_embeddings[:n]

        # RBF kernel MMD
        gamma = 1.0 / X.shape[1]

        xx = np.exp(-gamma * np.sum((X[:, None] - X[None, :]) ** 2, axis=-1)).mean()
        yy = np.exp(-gamma * np.sum((Y[:, None] - Y[None, :]) ** 2, axis=-1)).mean()
        xy = np.exp(-gamma * np.sum((X[:, None] - Y[None, :]) ** 2, axis=-1)).mean()

        mmd_squared = xx - 2 * xy + yy

        return {"detected": mmd_squared > 0.05, "mmd_score": float(mmd_squared)}

    def _check_ood_rate(self, current_embeddings: np.ndarray) -> dict:
        """Detect OOD examples via Mahalanobis distance."""
        distances = []
        for emb in current_embeddings:
            diff = emb - self.baseline_centroid
            maha = np.sqrt(diff @ self.baseline_cov_inv @ diff)
            distances.append(maha)

        # Threshold: 95th percentile of baseline distances
        baseline_dists = []
        for emb in self.baseline_embeddings[:1000]:
            diff = emb - self.baseline_centroid
            maha = np.sqrt(diff @ self.baseline_cov_inv @ diff)
            baseline_dists.append(maha)

        threshold = np.percentile(baseline_dists, 95)
        ood_fraction = sum(1 for d in distances if d > threshold) / len(distances)

        return {"detected": ood_fraction > 0.10, "ood_fraction": float(ood_fraction)}

Group Discussion: Key Decision Points

Decision Point 1: Human Labeling Budget Allocation

Sam (PM): We have $500/month for labeling. How do we allocate?

Strategy Examples/month Quality Model Improvement
Random sampling 3,333 (@ $0.15) Low (8% noise) +0.8%/month
Active learning 3,333 (@ $0.15) Medium (5% noise) +1.5%/month
Active + 3 annotators 1,111 (@ $0.45) High (2% noise) +1.2%/month
Active + 2 annotators + CL 1,667 (@ $0.30) High (2.5% noise) +1.4%/month

Priya (ML Engineer): Active learning + 2 annotators + Confident Learning gives the best balance: 1,667 examples/month, 2.5% noise, +1.4% accuracy improvement per month.

Aiko (Data Scientist): With 2 annotators, we use Cohen's $\kappa$ for agreement. If $\kappa < 0.60$, we add a third annotator for that batch (raises cost to $0.45 but only for hard cases, ~20% of batches).

Resolution: $500/month = 2 annotators × 1,667 examples + CL cleanup + 3rd annotator for $\kappa < 0.60$ batches. Expected: +1.4%/month improvement.

Decision Point 2: Synthetic Data Ratio

Aiko (Data Scientist): How much synthetic data relative to real?

Ratio (synth:real) Intent Accuracy Embedding Recall Notes
0:1 (real only) 92.1% 0.82 Baseline
1:1 93.2% 0.84 Safe default
3:1 93.8% 0.85 Optimal
5:1 93.9% 0.85 Diminishing returns
10:1 93.4% 0.83 Quality degradation

Marcus (Architect): 3:1 synthetic-to-real is optimal. Beyond that, the synthetic data starts to reinforce its own biases (the model learns Claude's phrasing patterns rather than real customer language).

Priya (ML Engineer): We should weight synthetic examples at 0.5× during training (doc 02, contrastive learning showed this helps). So 3:1 ratio with 0.5× weight gives effective 1.5:1 weighting.

Resolution: 3:1 synthetic-to-real ratio, 0.5× sample weight for synthetic, regenerate synthetic data every 4 weeks to capture new products. Monthly synthetic cost: ~$12 (Claude API calls).

Decision Point 3: Retraining Trigger Criteria

Jordan (MLOps): When should we trigger automatic retraining?

Trigger Threshold Check Frequency Action
Intent distribution shift χ² p < 0.01 Daily Alert + increase AL rate
Embedding drift (MMD) MMD² > 0.05 Weekly Queue retraining
OOD rate > 10% of queries Daily Alert + flag novel queries
Accuracy drop > 2% degradation Continuous Emergency retrain
Time-based 4 weeks since last train N/A Scheduled retrain

Marcus (Architect): Implement a 2-strike rule: any single trigger → alert + monitor. Two triggers in the same week → automatic retraining. This prevents both over-retraining (expensive) and under-retraining (quality degradation).

Resolution: 2-strike rule with weekly aggregation. Scheduled monthly retraining as a safety net. Estimated retraining cost: $85/cycle (from doc 09). Budget: $255/quarter for 3 scheduled + emergency cycles.


Research Paper References

1. Confident Learning: Estimating Uncertainty in Dataset Labels (Northcutt et al., 2021)

Key contribution: A principled framework for finding label errors without knowing the true labels. Uses out-of-fold predictions to estimate the joint distribution between noisy and true labels, then identifies likely errors via per-class thresholds. The method found label errors in 10 of the most-cited benchmarks (ImageNet, CIFAR-100, etc.), with 54% precision at identifying real errors. Key insight: dataset quality is often the bottleneck, not model architecture.

Relevance to MangaAssist: Our annotation pipeline starts with noisy labels from crowdworkers. Confident Learning identifies the 8-12% noise rate and corrects it to 2-3%, which directly improves all downstream models (intent: +4.8%, sentiment: +3.2%).

2. Self-Instruct: Aligning Language Models with Self-Generated Instructions (Wang et al., 2022)

Key contribution: A framework for bootstrapping instruction-following data by having the LLM generate new instructions and responses from a small seed set. Starting with 175 seed tasks, the method generates 52K instructions with 82% quality (judged by humans). This enables fine-tuning at a fraction of the cost of fully human-labeled data.

Relevance to MangaAssist: Self-Instruct powers our synthetic data pipeline. We use Claude 3.5 Sonnet to generate 10× amplification from 200 seed examples per intent class. Combined with quality filtering (MinHash dedup, classifier confidence, spot-checks), this gives us 24,000 high-quality synthetic examples for $12/month.

3. Alpaca: A Strong, Replicable Instruction-Following Model (Taori et al., 2023)

Key contribution: Demonstrated that instruction-tuning a 7B LLaMA model on 52K Self-Instruct-generated examples (at a cost of $500) achieves GPT-3.5-comparable performance. The key insight: quality and diversity of instruction data matters more than quantity. 52K well-crafted examples outperform millions of noisy ones.

Relevance to MangaAssist: Alpaca validates our synthetic data strategy. Our 29,000 examples (5K real + 24K synthetic) at $12/month is an order of magnitude more efficient than the Alpaca approach, because we: (1) start with real production data as seeds (higher quality than synthetic seeds), (2) use Claude 3.5 Sonnet (better than text-davinci-003), and (3) apply Confident Learning filtering.


Production Results

Data Flywheel Impact Over 6 Months

Month Labeled Data Noise Rate Intent Acc Embedding R@3 Hallucination
0 (launch) 2,000 12% 87.3% 0.74 5.2%
1 5,000 + 12K syn 8% → 3% (CL) 92.1% 0.82 2.8%
2 6,667 + 16K syn 2.5% 93.2% 0.84 2.1%
3 8,334 + 20K syn 2.2% 93.8% 0.85 1.9%
4 10,001 + 24K syn 2.0% 94.5% 0.86 1.5%
5 11,668 + 28K syn 1.8% 95.0% 0.87 1.2%
6 13,335 + 32K syn 1.7% 95.2% 0.88 1.1%

Cost

Monthly Item Cost
Human labeling (1,667 examples × 2 annotators) $500
Synthetic generation (Claude API) $12
Confident Learning compute $5
Active learning compute $3
Distribution shift monitoring $2
Total monthly data cost $522

ROI: Quality improvement from month 0 to month 6: 87.3% → 95.2% intent accuracy = +7.9 points. Revenue impact of 1% accuracy improvement ≈ $800/month (fewer escalations, higher conversion). Total improvement ROI: 7.9 × $800 = $6,320/month revenue impact vs $522/month cost = 12.1:1 ROI.