16. Data Curation, Synthetic Generation & Active Learning — Building the Data Flywheel
Problem Statement and MangaAssist Context
Every fine-tuning scenario in this series (docs 01-15) depends on high-quality training data. The reality for MangaAssist:
- Limited labeled data: We have ~5,000 labeled customer interactions from the first 3 months of operation
- Label noise (estimated 8-12%): Human annotators disagree on intent labels, sentiment, and relevance judgments
- Distribution shift: The manga catalog updates weekly with new titles, volume releases, and seasonal trends
- Long-tail coverage: 60% of queries fall into 5 common intents, but the remaining 40% span 20+ rare intents with <50 examples each
The data flywheel solves all of these: production data → clean → label → augment → train → deploy → collect more production data → repeat. This document covers the math, pipelines, and practices for building that flywheel.
Data Quality Impact on Model Performance
| Training Data Quality | Intent Accuracy | Embedding Recall@3 | LLM Hallucination |
|---|---|---|---|
| Raw (8-12% noise) | 87.3% | 0.74 | 5.2% |
| Cleaned (2-3% noise) | 92.1% | 0.82 | 2.8% |
| Cleaned + augmented | 93.8% | 0.85 | 1.9% |
| Cleaned + augmented + active learning | 95.2% | 0.88 | 1.1% |
Mathematical Foundations
Confident Learning — Label Noise Estimation (Northcutt et al., 2021)
Goal: Identify mislabeled examples without knowing the true labels.
Step 1: Estimate the Joint Distribution
Given $n$ examples with noisy labels $\tilde{y}$ and model-predicted probabilities $\hat{p}$, estimate the joint distribution $Q_{\tilde{y}, y^*}$ between noisy labels and true labels:
$$\hat{Q}_{\tilde{y}=i, y^=j} = \frac{|\hat{X}_{\tilde{y}=i, y^=j}|}{n}$$
where:
$$\hat{X}_{\tilde{y}=i, y^*=j} = {x : \tilde{y} = i \text{ and } \hat{p}(y=j|x) \geq t_j}$$
and $t_j$ is the per-class threshold: the average self-confidence of examples with noisy label $j$:
$$t_j = \frac{1}{|{x : \tilde{y}=j}|} \sum_{x : \tilde{y}=j} \hat{p}(y=j|x)$$
Intuition: If the model predicts class $j$ with high confidence but the label says class $i$, the label is likely wrong. The per-class threshold adapts to class difficulty — easy classes have high thresholds, hard classes have low ones.
Step 2: Identify Label Errors
An example $(x, \tilde{y}=i)$ is a candidate label error if:
$$\hat{p}(y=j|x) \geq t_j \text{ for some } j \neq i$$
AND it appears in the off-diagonal of $\hat{Q}$ (noisy label ≠ predicted true label).
Step 3: Rank by Error Likelihood
Rank candidate errors by the normalized margin:
$$\text{error_score}(x) = \hat{p}(y=\tilde{y}|x) - \max_{j \neq \tilde{y}} \hat{p}(y=j|x)$$
Negative margin = high confidence that the label is wrong. Most negative scores are reviewed first.
Annotation Agreement Metrics
Cohen's Kappa (2 annotators):
$$\kappa = \frac{p_o - p_e}{1 - p_e}$$
where: - $p_o$ = observed agreement (fraction of items where annotators agree) - $p_e$ = expected agreement by chance = $\sum_k p_{1k} \cdot p_{2k}$ (product of marginal class probabilities)
Interpretation: - $\kappa < 0.40$: Poor agreement — re-examine guidelines - $0.40 \leq \kappa < 0.60$: Moderate — acceptable for initial labeling - $0.60 \leq \kappa < 0.80$: Substantial — target for production labeling - $\kappa \geq 0.80$: Almost perfect — gold standard
Fleiss' Kappa (multiple annotators):
$$\kappa_F = \frac{\bar{P} - \bar{P}_e}{1 - \bar{P}_e}$$
where:
$$\bar{P} = \frac{1}{n} \sum_{i=1}^{n} \frac{1}{r(r-1)} \sum_{k=1}^{K} n_{ik}(n_{ik} - 1)$$
$$\bar{P}e = \sum{k=1}^{K} \left(\frac{1}{nr} \sum_{i=1}^{n} n_{ik}\right)^2$$
with $n$ items, $r$ raters, $K$ categories, and $n_{ik}$ = number of raters who assigned item $i$ to category $k$.
MangaAssist annotation targets: - Intent labels: $\kappa \geq 0.75$ (15 classes, clear definitions) - Sentiment: $\kappa \geq 0.65$ (subjective, harder agreement) - Relevance judgments: $\kappa \geq 0.70$ (binary relevant/irrelevant)
Active Learning — Uncertainty Sampling
Select the most informative examples for human labeling:
Least Confidence:
$$x^* = \arg\max_x \left(1 - \max_k \hat{p}(y=k|x)\right)$$
Margin Sampling:
$$x^* = \arg\min_x \left(\hat{p}(y=k_1|x) - \hat{p}(y=k_2|x)\right)$$
where $k_1, k_2$ are the top-2 predicted classes.
Entropy Sampling:
$$x^* = \arg\max_x H(\hat{p}(y|x)) = \arg\max_x \left(-\sum_k \hat{p}(y=k|x) \log \hat{p}(y=k|x)\right)$$
Batch Active Learning with Diversity:
To avoid selecting redundant examples, combine uncertainty with diversity. Use k-means++ on embeddings of the top-$N$ uncertain examples, then select cluster centroids:
$$\text{score}(x) = \lambda \cdot H(\hat{p}(y|x)) + (1-\lambda) \cdot \min_{x' \in S} |e(x) - e(x')|_2$$
where $S$ is the already-selected batch and $e(x)$ is the embedding. $\lambda = 0.7$ balances uncertainty and diversity.
Synthetic Data Quality Assessment
For generated synthetic examples, measure quality with:
Faithfulness Score (does the synthetic example match the distribution?):
$$\text{Faith}(x_{\text{syn}}) = \exp\left(-D_{KL}(P_{\text{real}} | P_{\text{model}}(x_{\text{syn}}))\right)$$
Diversity Score (are synthetic examples varied enough?):
$$\text{Div}(S_{\text{syn}}) = \frac{1}{|S|^2} \sum_{x_i, x_j \in S} \left(1 - \cos(e(x_i), e(x_j))\right)$$
Utility Score (does adding synthetic data improve the model?):
$$\text{Util} = \text{Accuracy}(M_{\text{real+syn}}) - \text{Accuracy}(M_{\text{real only}})$$
Model Internals — Layer-by-Layer Diagrams
End-to-End Data Flywheel
graph TB
subgraph "Data Flywheel (Continuous Loop)"
PROD["Production Traffic<br>10K messages/day<br>Unlabeled customer queries"]
SAMPLE["Smart Sampling<br>Active learning selects<br>most informative 2%<br>(200 messages/day)"]
LABEL["Human Labeling<br>3 annotators per example<br>Fleiss κ ≥ 0.70 required<br>Cost: $0.15/label"]
CLEAN["Confident Learning<br>Identify noisy labels<br>Flag 8-12% for review<br>Reduce noise to 2-3%"]
AUGMENT["Synthetic Augmentation<br>Self-Instruct (Claude 3.5)<br>10× amplification factor<br>2,000 → 20,000 examples"]
TRAIN["Model Training<br>Fine-tune all models:<br>Intent, Embedding, Reranker,<br>Sentiment, LLM (docs 01-15)"]
DEPLOY["Deploy & Monitor<br>A/B test new model<br>Track quality metrics<br>Detect distribution shift"]
PROD --> SAMPLE --> LABEL --> CLEAN --> AUGMENT --> TRAIN --> DEPLOY
DEPLOY -->|"New production data"| PROD
end
style PROD fill:#e3f2fd
style SAMPLE fill:#fff9c4
style LABEL fill:#fff9c4
style CLEAN fill:#ffcdd2
style AUGMENT fill:#c8e6c9
style TRAIN fill:#c8e6c9
style DEPLOY fill:#e3f2fd
Confident Learning Pipeline
graph TB
subgraph "Confident Learning — Label Noise Detection"
DATA["5,000 labeled examples<br>Noisy labels ỹ from annotators"]
MODEL["Train DistilBERT k-fold<br>(k=5 cross-validation)<br>Get out-of-fold predictions p̂(y|x)"]
THRESH["Compute per-class thresholds<br>t_j = avg self-confidence<br>t_product = 0.92, t_order = 0.88<br>t_complaint = 0.75, ..."]
JOINT["Estimate joint Q(ỹ, y*)<br>Off-diagonal = label errors<br>Q(product→order) = 0.03<br>Q(complaint→feedback) = 0.05"]
RANK["Rank by normalized margin<br>margin = p̂(ỹ) - max p̂(j≠ỹ)<br>Most negative = most likely error"]
REVIEW["Human review top-N<br>N = 500 candidates<br>380 confirmed errors (76%)<br>120 correct labels (24%)"]
FIXED["Cleaned dataset<br>380 labels corrected<br>Noise: 8.2% → 2.4%"]
DATA --> MODEL --> THRESH --> JOINT --> RANK --> REVIEW --> FIXED
end
style DATA fill:#ffcdd2
style FIXED fill:#c8e6c9
style REVIEW fill:#fff9c4
Active Learning Selection Strategy
graph TB
subgraph "Active Learning — Batch Selection"
POOL["Unlabeled pool<br>10,000 new queries<br>from last week"]
EMBED["Embed all queries<br>DistilBERT → 768-dim<br>Compute model predictions"]
UNCERTAIN["Uncertainty ranking<br>Entropy H(p̂(y|x))<br>Top 500 most uncertain"]
DIVERSE["Diversity filtering<br>k-means++ on embeddings<br>Select K=50 cluster centroids"]
BATCH["Final batch: 50 examples<br>λ=0.7 uncertainty + 0.3 diversity<br>Covers all query types"]
HUMAN["Send to annotators<br>3 annotators × 50 queries<br>Cost: $22.50<br>Time: 2 hours"]
RETRAIN["Add to training set<br>Retrain intent classifier<br>+0.4% accuracy per cycle"]
POOL --> EMBED --> UNCERTAIN --> DIVERSE --> BATCH --> HUMAN --> RETRAIN
RETRAIN -->|"Next cycle (weekly)"| POOL
end
style BATCH fill:#c8e6c9
style HUMAN fill:#fff9c4
Synthetic Data Generation Pipeline
graph TB
subgraph "Self-Instruct Synthetic Pipeline"
SEED["Seed examples<br>200 high-quality<br>human-labeled examples<br>per intent class"]
PROMPT["Prompt template:<br>'Generate 10 customer queries<br>similar to these examples<br>but with different products,<br>phrasing, and complexity'"]
CLAUDE["Claude 3.5 Sonnet<br>Generates 10× per seed<br>200 seeds → 2,000 synthetic<br>per intent class"]
FILTER["Quality filtering:<br>1. Dedup (MinHash, 0.85 threshold)<br>2. Classifier confidence > 0.8<br>3. Fluency score > 0.9<br>4. Human spot-check (5%)"]
VALID["Validated synthetic set<br>1,600 per class (80% pass rate)<br>15 classes × 1,600 = 24,000<br>synthetic examples total"]
MERGE["Merge with real data:<br>5,000 real + 24,000 synthetic<br>Weight: 1.0 real, 0.5 synthetic<br>Effective: 5K + 12K = 17K examples"]
SEED --> PROMPT --> CLAUDE --> FILTER --> VALID --> MERGE
end
style SEED fill:#e3f2fd
style CLAUDE fill:#fff9c4
style FILTER fill:#ffcdd2
style VALID fill:#c8e6c9
Distribution Shift Detection
graph TB
subgraph "Distribution Shift Monitoring"
direction TB
BASELINE["Baseline distribution<br>(training data, week 0)"]
CURRENT["Current production<br>(this week's queries)"]
COMPARE["Compare distributions:<br>1. Intent frequency shift<br>2. Embedding drift (MMD)<br>3. OOD detection"]
INTENT["Intent frequency:<br>Baseline: product=32%<br>Current: product=28%, new_intent=5%<br>χ² test: p < 0.01 → SHIFT"]
EMBED["Embedding drift (MMD):<br>MMD² = E[k(x,x')] - 2E[k(x,y)] + E[k(y,y')]<br>where k = RBF kernel<br>MMD > threshold → DRIFT"]
OOD["OOD detection:<br>Mahalanobis distance from<br>training centroid<br>5% of queries are OOD → FLAG"]
ACTION["Triggered actions:<br>1. Alert MLOps team<br>2. Increase active learning rate 2%→5%<br>3. Queue retraining if drift > 2 weeks"]
BASELINE --> COMPARE
CURRENT --> COMPARE
COMPARE --> INTENT & EMBED & OOD
INTENT --> ACTION
EMBED --> ACTION
OOD --> ACTION
end
style INTENT fill:#ffcdd2
style EMBED fill:#fff9c4
style OOD fill:#ffcdd2
style ACTION fill:#c8e6c9
Implementation Deep-Dive
Confident Learning Implementation
import numpy as np
from sklearn.model_selection import StratifiedKFold
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
from collections import Counter
class ConfidentLearningCleaner:
"""
Identify noisy labels using the Confident Learning framework.
Reference: Northcutt et al., "Confident Learning: Estimating
Uncertainty in Dataset Labels" (2021).
"""
def __init__(self, model_name: str = "distilbert-base-uncased", n_folds: int = 5):
self.model_name = model_name
self.n_folds = n_folds
def find_label_errors(
self,
texts: list[str],
noisy_labels: list[int],
num_classes: int,
) -> dict:
"""
Identify mislabeled examples using cross-validated predictions.
Returns:
{
"error_indices": list of suspected error indices,
"error_scores": confidence that each is an error,
"joint_matrix": estimated Q(noisy, true) matrix,
"per_class_noise": estimated noise rate per class,
}
"""
# Step 1: Get out-of-fold predicted probabilities
probs = self._cross_val_predict_proba(texts, noisy_labels, num_classes)
# Step 2: Compute per-class thresholds
thresholds = self._compute_thresholds(probs, noisy_labels, num_classes)
# Step 3: Estimate the joint distribution Q
joint = self._estimate_joint(probs, noisy_labels, thresholds, num_classes)
# Step 4: Find label errors (off-diagonal of Q)
error_indices, error_scores = self._identify_errors(
probs, noisy_labels, thresholds, num_classes,
)
# Step 5: Per-class noise rate
per_class_noise = {}
for cls in range(num_classes):
cls_indices = [i for i, y in enumerate(noisy_labels) if y == cls]
cls_errors = [i for i in error_indices if noisy_labels[i] == cls]
per_class_noise[cls] = (
len(cls_errors) / len(cls_indices) if cls_indices else 0
)
return {
"error_indices": error_indices,
"error_scores": error_scores,
"joint_matrix": joint,
"per_class_noise": per_class_noise,
}
def _compute_thresholds(
self,
probs: np.ndarray,
labels: list[int],
num_classes: int,
) -> np.ndarray:
"""Per-class threshold = average self-confidence for that class."""
thresholds = np.zeros(num_classes)
for cls in range(num_classes):
mask = np.array(labels) == cls
if mask.sum() > 0:
thresholds[cls] = probs[mask, cls].mean()
return thresholds
def _estimate_joint(
self,
probs: np.ndarray,
labels: list[int],
thresholds: np.ndarray,
num_classes: int,
) -> np.ndarray:
"""Estimate Q(noisy_label, true_label) joint distribution."""
n = len(labels)
joint = np.zeros((num_classes, num_classes))
for i in range(n):
noisy_y = labels[i]
for true_y in range(num_classes):
if probs[i, true_y] >= thresholds[true_y]:
joint[noisy_y, true_y] += 1
# Normalize
joint /= joint.sum()
return joint
def _identify_errors(
self,
probs: np.ndarray,
labels: list[int],
thresholds: np.ndarray,
num_classes: int,
) -> tuple[list[int], list[float]]:
"""Find examples where predicted class != noisy label."""
errors = []
scores = []
for i in range(len(labels)):
noisy_y = labels[i]
# Check if any other class exceeds its threshold
for j in range(num_classes):
if j != noisy_y and probs[i, j] >= thresholds[j]:
# Normalized margin: how wrong is the label?
margin = probs[i, noisy_y] - probs[i, j]
errors.append(i)
scores.append(-margin) # More negative = more likely error
break
# Sort by error likelihood (most negative margin first)
sorted_pairs = sorted(zip(errors, scores), key=lambda x: -x[1])
return [p[0] for p in sorted_pairs], [p[1] for p in sorted_pairs]
def _cross_val_predict_proba(
self,
texts: list[str],
labels: list[int],
num_classes: int,
) -> np.ndarray:
"""Get out-of-fold predicted probabilities using k-fold CV."""
probs = np.zeros((len(texts), num_classes))
kfold = StratifiedKFold(n_splits=self.n_folds, shuffle=True, random_state=42)
for fold, (train_idx, val_idx) in enumerate(kfold.split(texts, labels)):
print(f"Fold {fold+1}/{self.n_folds}")
# Train on fold
tokenizer = AutoTokenizer.from_pretrained(self.model_name)
model = AutoModelForSequenceClassification.from_pretrained(
self.model_name, num_labels=num_classes,
)
train_texts = [texts[i] for i in train_idx]
train_labels = [labels[i] for i in train_idx]
val_texts = [texts[i] for i in val_idx]
# Simple training loop (abbreviated)
self._train_fold(model, tokenizer, train_texts, train_labels)
# Predict on validation fold
model.eval()
for i, vi in enumerate(val_idx):
inputs = tokenizer(
val_texts[i], return_tensors="pt",
truncation=True, max_length=128,
)
with torch.no_grad():
logits = model(**inputs).logits
probs[vi] = torch.softmax(logits, dim=-1).numpy().squeeze()
return probs
def _train_fold(self, model, tokenizer, texts, labels, epochs=3):
"""Quick fine-tune for one fold (simplified)."""
optimizer = torch.optim.AdamW(model.parameters(), lr=2e-5)
model.train()
for _ in range(epochs):
for text, label in zip(texts, labels):
inputs = tokenizer(
text, return_tensors="pt",
truncation=True, max_length=128, padding="max_length",
)
inputs["labels"] = torch.tensor([label])
loss = model(**inputs).loss
loss.backward()
optimizer.step()
optimizer.zero_grad()
Active Learning Pipeline
from scipy.stats import entropy
from sklearn.cluster import MiniBatchKMeans
class ActiveLearningSelector:
"""
Select the most informative examples for human annotation.
Combines uncertainty sampling with diversity via k-means++.
"""
def __init__(
self,
model,
tokenizer,
uncertainty_weight: float = 0.7,
):
self.model = model
self.tokenizer = tokenizer
self.uncertainty_weight = uncertainty_weight
def select_batch(
self,
unlabeled_texts: list[str],
batch_size: int = 50,
candidate_pool_size: int = 500,
) -> list[int]:
"""
Select a batch of examples for annotation.
Strategy:
1. Score all by uncertainty (entropy)
2. Take top-N candidates (N >> batch_size)
3. Apply diversity filtering via k-means++
Returns: indices into unlabeled_texts
"""
# Step 1: Compute uncertainty for all unlabeled examples
uncertainties = self._compute_uncertainty(unlabeled_texts)
# Step 2: Select top candidates by uncertainty
candidate_indices = np.argsort(uncertainties)[-candidate_pool_size:]
# Step 3: Compute embeddings for diversity
embeddings = self._get_embeddings(
[unlabeled_texts[i] for i in candidate_indices]
)
# Step 4: k-means++ for diversity
selected = self._diverse_select(
candidate_indices, embeddings, uncertainties, batch_size,
)
return selected
def _compute_uncertainty(self, texts: list[str]) -> np.ndarray:
"""Compute entropy-based uncertainty for each text."""
self.model.eval()
uncertainties = []
for text in texts:
inputs = self.tokenizer(
text, return_tensors="pt", truncation=True, max_length=128,
)
with torch.no_grad():
logits = self.model(**inputs).logits
probs = torch.softmax(logits, dim=-1).numpy().squeeze()
uncertainties.append(entropy(probs))
return np.array(uncertainties)
def _get_embeddings(self, texts: list[str]) -> np.ndarray:
"""Get CLS token embeddings for diversity computation."""
self.model.eval()
embeddings = []
for text in texts:
inputs = self.tokenizer(
text, return_tensors="pt", truncation=True, max_length=128,
)
with torch.no_grad():
outputs = self.model(**inputs, output_hidden_states=True)
cls_embedding = outputs.hidden_states[-1][:, 0, :].numpy().squeeze()
embeddings.append(cls_embedding)
return np.stack(embeddings)
def _diverse_select(
self,
indices: np.ndarray,
embeddings: np.ndarray,
uncertainties: np.ndarray,
batch_size: int,
) -> list[int]:
"""Select diverse subset using k-means++ initialization."""
# Cluster candidates into batch_size clusters
kmeans = MiniBatchKMeans(n_clusters=batch_size, random_state=42)
kmeans.fit(embeddings)
# From each cluster, select the most uncertain example
selected = []
for cluster_id in range(batch_size):
cluster_mask = kmeans.labels_ == cluster_id
cluster_indices = np.where(cluster_mask)[0]
if len(cluster_indices) == 0:
continue
# Most uncertain in this cluster
cluster_uncertainties = uncertainties[indices[cluster_indices]]
best_in_cluster = cluster_indices[np.argmax(cluster_uncertainties)]
selected.append(int(indices[best_in_cluster]))
return selected
Synthetic Data Generator (Self-Instruct)
import json
import hashlib
import boto3
from datasketch import MinHash, MinHashLSH
class SyntheticDataGenerator:
"""
Generate synthetic training data using Self-Instruct pattern.
Claude 3.5 Sonnet generates diverse variations of seed examples.
"""
def __init__(self):
self.bedrock = boto3.client("bedrock-runtime")
# MinHash LSH for near-duplicate detection
self.lsh = MinHashLSH(threshold=0.85, num_perm=128)
self._registered_hashes = {}
def generate(
self,
seed_examples: list[dict],
target_count: int = 2000,
batch_size: int = 10,
) -> list[dict]:
"""
Generate synthetic examples from seed data.
Args:
seed_examples: [{"text": "...", "label": "..."}]
target_count: total synthetic examples to generate
batch_size: examples per LLM call
"""
synthetic = []
attempts = 0
max_attempts = target_count * 3 # Allow 3× retries
# Group seeds by label
seeds_by_label = {}
for ex in seed_examples:
seeds_by_label.setdefault(ex["label"], []).append(ex)
while len(synthetic) < target_count and attempts < max_attempts:
# Rotate through labels evenly
for label, seeds in seeds_by_label.items():
if len(synthetic) >= target_count:
break
# Sample 3-5 seed examples for context
context_seeds = random.sample(seeds, min(5, len(seeds)))
context_text = "\n".join(
f"- {s['text']}" for s in context_seeds
)
prompt = f"""Generate {batch_size} diverse customer queries for a manga bookstore chatbot.
These should be variations of the intent: {label}
Example queries for this intent:
{context_text}
Requirements:
1. Each query should be natural and conversational
2. Vary the products mentioned (different manga titles, authors)
3. Vary the complexity (simple vs multi-part questions)
4. Include typos and informal language occasionally
5. Cover different customer demographics
Output as JSON array: [{{"text": "...", "label": "{label}"}}]"""
response = self.bedrock.invoke_model(
modelId="anthropic.claude-3-5-sonnet-20241022-v2:0",
body=json.dumps({
"anthropic_version": "bedrock-2023-05-01",
"max_tokens": 1000,
"messages": [{"role": "user", "content": prompt}],
}),
)
text = json.loads(response["body"].read())["content"][0]["text"]
# Parse JSON from response
try:
# Handle cases where LLM wraps JSON in markdown
if "```json" in text:
text = text.split("```json")[1].split("```")[0]
generated = json.loads(text)
except json.JSONDecodeError:
attempts += batch_size
continue
# Quality filtering
for ex in generated:
if self._passes_quality_checks(ex):
synthetic.append(ex)
attempts += batch_size
print(f"Generated {len(synthetic)} synthetic examples "
f"from {len(seed_examples)} seeds ({len(synthetic)/len(seed_examples):.1f}× amplification)")
return synthetic
def _passes_quality_checks(self, example: dict) -> bool:
"""Apply quality filters to synthetic examples."""
text = example.get("text", "")
# Check 1: Minimum length
if len(text.split()) < 4:
return False
# Check 2: Maximum length (avoid essays)
if len(text.split()) > 100:
return False
# Check 3: Near-duplicate detection via MinHash
mh = MinHash(num_perm=128)
for word in text.lower().split():
mh.update(word.encode("utf8"))
# Check against existing hashes
text_hash = hashlib.md5(text.lower().encode()).hexdigest()
if text_hash in self._registered_hashes:
return False
# Check LSH for near-duplicates
result = self.lsh.query(mh)
if result:
return False
# Register this example
self._registered_hashes[text_hash] = True
self.lsh.insert(text_hash, mh)
return True
Distribution Shift Detector
from scipy import stats
class DistributionShiftDetector:
"""
Monitor for distribution drift between training and production data.
Triggers retraining when significant shifts are detected.
"""
def __init__(self, baseline_embeddings: np.ndarray, baseline_labels: list[int]):
self.baseline_embeddings = baseline_embeddings
self.baseline_labels = baseline_labels
self.baseline_label_dist = Counter(baseline_labels)
self.baseline_centroid = baseline_embeddings.mean(axis=0)
self.baseline_cov_inv = np.linalg.pinv(
np.cov(baseline_embeddings.T) + 1e-6 * np.eye(baseline_embeddings.shape[1])
)
def check_shift(
self,
current_embeddings: np.ndarray,
current_labels: list[int],
) -> dict:
"""
Run all shift detection checks.
Returns:
{
"intent_shift": {"detected": bool, "p_value": float},
"embedding_drift": {"detected": bool, "mmd_score": float},
"ood_rate": {"detected": bool, "ood_fraction": float},
"action": str,
}
"""
results = {}
# 1. Intent frequency shift (chi-squared test)
results["intent_shift"] = self._check_intent_shift(current_labels)
# 2. Embedding drift (Maximum Mean Discrepancy)
results["embedding_drift"] = self._check_embedding_drift(current_embeddings)
# 3. OOD detection (Mahalanobis distance)
results["ood_rate"] = self._check_ood_rate(current_embeddings)
# Determine action
num_alerts = sum(
1 for v in results.values()
if isinstance(v, dict) and v.get("detected")
)
if num_alerts >= 2:
results["action"] = "RETRAIN: Multiple drift signals detected"
elif num_alerts == 1:
results["action"] = "MONITOR: Increase active learning rate to 5%"
else:
results["action"] = "OK: No significant drift"
return results
def _check_intent_shift(self, current_labels: list[int]) -> dict:
"""Chi-squared test for intent distribution shift."""
current_dist = Counter(current_labels)
all_classes = set(self.baseline_label_dist.keys()) | set(current_dist.keys())
baseline_counts = [self.baseline_label_dist.get(c, 0) for c in all_classes]
current_counts = [current_dist.get(c, 0) for c in all_classes]
# Normalize to same total
total_b = sum(baseline_counts) or 1
total_c = sum(current_counts) or 1
expected = [b * total_c / total_b for b in baseline_counts]
chi2, p_value = stats.chisquare(current_counts, f_exp=expected)
return {"detected": p_value < 0.01, "p_value": float(p_value), "chi2": float(chi2)}
def _check_embedding_drift(self, current_embeddings: np.ndarray) -> dict:
"""Maximum Mean Discrepancy between baseline and current embeddings."""
# Subsample for efficiency
n = min(1000, len(self.baseline_embeddings), len(current_embeddings))
X = self.baseline_embeddings[:n]
Y = current_embeddings[:n]
# RBF kernel MMD
gamma = 1.0 / X.shape[1]
xx = np.exp(-gamma * np.sum((X[:, None] - X[None, :]) ** 2, axis=-1)).mean()
yy = np.exp(-gamma * np.sum((Y[:, None] - Y[None, :]) ** 2, axis=-1)).mean()
xy = np.exp(-gamma * np.sum((X[:, None] - Y[None, :]) ** 2, axis=-1)).mean()
mmd_squared = xx - 2 * xy + yy
return {"detected": mmd_squared > 0.05, "mmd_score": float(mmd_squared)}
def _check_ood_rate(self, current_embeddings: np.ndarray) -> dict:
"""Detect OOD examples via Mahalanobis distance."""
distances = []
for emb in current_embeddings:
diff = emb - self.baseline_centroid
maha = np.sqrt(diff @ self.baseline_cov_inv @ diff)
distances.append(maha)
# Threshold: 95th percentile of baseline distances
baseline_dists = []
for emb in self.baseline_embeddings[:1000]:
diff = emb - self.baseline_centroid
maha = np.sqrt(diff @ self.baseline_cov_inv @ diff)
baseline_dists.append(maha)
threshold = np.percentile(baseline_dists, 95)
ood_fraction = sum(1 for d in distances if d > threshold) / len(distances)
return {"detected": ood_fraction > 0.10, "ood_fraction": float(ood_fraction)}
Group Discussion: Key Decision Points
Decision Point 1: Human Labeling Budget Allocation
Sam (PM): We have $500/month for labeling. How do we allocate?
| Strategy | Examples/month | Quality | Model Improvement |
|---|---|---|---|
| Random sampling | 3,333 (@ $0.15) | Low (8% noise) | +0.8%/month |
| Active learning | 3,333 (@ $0.15) | Medium (5% noise) | +1.5%/month |
| Active + 3 annotators | 1,111 (@ $0.45) | High (2% noise) | +1.2%/month |
| Active + 2 annotators + CL | 1,667 (@ $0.30) | High (2.5% noise) | +1.4%/month |
Priya (ML Engineer): Active learning + 2 annotators + Confident Learning gives the best balance: 1,667 examples/month, 2.5% noise, +1.4% accuracy improvement per month.
Aiko (Data Scientist): With 2 annotators, we use Cohen's $\kappa$ for agreement. If $\kappa < 0.60$, we add a third annotator for that batch (raises cost to $0.45 but only for hard cases, ~20% of batches).
Resolution: $500/month = 2 annotators × 1,667 examples + CL cleanup + 3rd annotator for $\kappa < 0.60$ batches. Expected: +1.4%/month improvement.
Decision Point 2: Synthetic Data Ratio
Aiko (Data Scientist): How much synthetic data relative to real?
| Ratio (synth:real) | Intent Accuracy | Embedding Recall | Notes |
|---|---|---|---|
| 0:1 (real only) | 92.1% | 0.82 | Baseline |
| 1:1 | 93.2% | 0.84 | Safe default |
| 3:1 | 93.8% | 0.85 | Optimal |
| 5:1 | 93.9% | 0.85 | Diminishing returns |
| 10:1 | 93.4% | 0.83 | Quality degradation |
Marcus (Architect): 3:1 synthetic-to-real is optimal. Beyond that, the synthetic data starts to reinforce its own biases (the model learns Claude's phrasing patterns rather than real customer language).
Priya (ML Engineer): We should weight synthetic examples at 0.5× during training (doc 02, contrastive learning showed this helps). So 3:1 ratio with 0.5× weight gives effective 1.5:1 weighting.
Resolution: 3:1 synthetic-to-real ratio, 0.5× sample weight for synthetic, regenerate synthetic data every 4 weeks to capture new products. Monthly synthetic cost: ~$12 (Claude API calls).
Decision Point 3: Retraining Trigger Criteria
Jordan (MLOps): When should we trigger automatic retraining?
| Trigger | Threshold | Check Frequency | Action |
|---|---|---|---|
| Intent distribution shift | χ² p < 0.01 | Daily | Alert + increase AL rate |
| Embedding drift (MMD) | MMD² > 0.05 | Weekly | Queue retraining |
| OOD rate | > 10% of queries | Daily | Alert + flag novel queries |
| Accuracy drop | > 2% degradation | Continuous | Emergency retrain |
| Time-based | 4 weeks since last train | N/A | Scheduled retrain |
Marcus (Architect): Implement a 2-strike rule: any single trigger → alert + monitor. Two triggers in the same week → automatic retraining. This prevents both over-retraining (expensive) and under-retraining (quality degradation).
Resolution: 2-strike rule with weekly aggregation. Scheduled monthly retraining as a safety net. Estimated retraining cost: $85/cycle (from doc 09). Budget: $255/quarter for 3 scheduled + emergency cycles.
Research Paper References
1. Confident Learning: Estimating Uncertainty in Dataset Labels (Northcutt et al., 2021)
Key contribution: A principled framework for finding label errors without knowing the true labels. Uses out-of-fold predictions to estimate the joint distribution between noisy and true labels, then identifies likely errors via per-class thresholds. The method found label errors in 10 of the most-cited benchmarks (ImageNet, CIFAR-100, etc.), with 54% precision at identifying real errors. Key insight: dataset quality is often the bottleneck, not model architecture.
Relevance to MangaAssist: Our annotation pipeline starts with noisy labels from crowdworkers. Confident Learning identifies the 8-12% noise rate and corrects it to 2-3%, which directly improves all downstream models (intent: +4.8%, sentiment: +3.2%).
2. Self-Instruct: Aligning Language Models with Self-Generated Instructions (Wang et al., 2022)
Key contribution: A framework for bootstrapping instruction-following data by having the LLM generate new instructions and responses from a small seed set. Starting with 175 seed tasks, the method generates 52K instructions with 82% quality (judged by humans). This enables fine-tuning at a fraction of the cost of fully human-labeled data.
Relevance to MangaAssist: Self-Instruct powers our synthetic data pipeline. We use Claude 3.5 Sonnet to generate 10× amplification from 200 seed examples per intent class. Combined with quality filtering (MinHash dedup, classifier confidence, spot-checks), this gives us 24,000 high-quality synthetic examples for $12/month.
3. Alpaca: A Strong, Replicable Instruction-Following Model (Taori et al., 2023)
Key contribution: Demonstrated that instruction-tuning a 7B LLaMA model on 52K Self-Instruct-generated examples (at a cost of $500) achieves GPT-3.5-comparable performance. The key insight: quality and diversity of instruction data matters more than quantity. 52K well-crafted examples outperform millions of noisy ones.
Relevance to MangaAssist: Alpaca validates our synthetic data strategy. Our 29,000 examples (5K real + 24K synthetic) at $12/month is an order of magnitude more efficient than the Alpaca approach, because we: (1) start with real production data as seeds (higher quality than synthetic seeds), (2) use Claude 3.5 Sonnet (better than text-davinci-003), and (3) apply Confident Learning filtering.
Production Results
Data Flywheel Impact Over 6 Months
| Month | Labeled Data | Noise Rate | Intent Acc | Embedding R@3 | Hallucination |
|---|---|---|---|---|---|
| 0 (launch) | 2,000 | 12% | 87.3% | 0.74 | 5.2% |
| 1 | 5,000 + 12K syn | 8% → 3% (CL) | 92.1% | 0.82 | 2.8% |
| 2 | 6,667 + 16K syn | 2.5% | 93.2% | 0.84 | 2.1% |
| 3 | 8,334 + 20K syn | 2.2% | 93.8% | 0.85 | 1.9% |
| 4 | 10,001 + 24K syn | 2.0% | 94.5% | 0.86 | 1.5% |
| 5 | 11,668 + 28K syn | 1.8% | 95.0% | 0.87 | 1.2% |
| 6 | 13,335 + 32K syn | 1.7% | 95.2% | 0.88 | 1.1% |
Cost
| Monthly Item | Cost |
|---|---|
| Human labeling (1,667 examples × 2 annotators) | $500 |
| Synthetic generation (Claude API) | $12 |
| Confident Learning compute | $5 |
| Active learning compute | $3 |
| Distribution shift monitoring | $2 |
| Total monthly data cost | $522 |
ROI: Quality improvement from month 0 to month 6: 87.3% → 95.2% intent accuracy = +7.9 points. Revenue impact of 1% accuracy improvement ≈ $800/month (fewer escalations, higher conversion). Total improvement ROI: 7.9 × $800 = $6,320/month revenue impact vs $522/month cost = 12.1:1 ROI.