07. Few-Shot Learning and Rapid Adaptation — New Intents with <100 Examples
Problem Statement and MangaAssist Context
MangaAssist needs to rapidly support new intent categories as business needs evolve. When anime adaptations surge, a "preorder_anime_adaptation" intent emerges with only 30-50 labeled examples. Traditional fine-tuning requires thousands of examples and hours of training. Few-shot learning enables the model to recognize new intents from as few as 5-100 examples, deploying in minutes rather than days.
The Cold-Start Problem
| Scenario | Available Examples | Required Accuracy | Time Budget |
|---|---|---|---|
| New intent: "preorder_anime" | 30 | ≥ 80% | < 1 hour |
| Seasonal: "holiday_gift_bundle" | 50 | ≥ 82% | < 2 hours |
| Emergency: "product_recall" | 10 | ≥ 75% | < 30 min |
| Market expansion: Japanese queries | 80 | ≥ 78% | < 4 hours |
Standard fine-tuning on 30 examples → severe overfitting (70% train / 45% test). Few-shot techniques bridge this gap.
Mathematical Foundations
Prototypical Networks — Learning by Comparison
Prototypical Networks (Snell et al., 2017) classify by comparing a query to class "prototypes" — the mean embedding of each class's support examples.
Setup: Given $N$ classes, each with $K$ support examples (N-way K-shot): - Support set $S = {(\mathbf{x}1^{©}, ..., \mathbf{x}_K^{©})}{c=1}^{N}$ - Query $\mathbf{x}_q$ to classify
Step 1: Compute prototypes
The prototype for class $c$ is the centroid of its support embeddings:
$$\mathbf{p}c = \frac{1}{K} \sum{k=1}^{K} f_\phi(\mathbf{x}_k^{©})$$
where $f_\phi$ is the embedding model (e.g., DistilBERT's [CLS] output).
Step 2: Distance-based classification
The probability that query $\mathbf{x}_q$ belongs to class $c$:
$$p(y = c | \mathbf{x}q) = \frac{\exp(-d(f\phi(\mathbf{x}q), \mathbf{p}_c))}{\sum{c'=1}^{N} \exp(-d(f_\phi(\mathbf{x}q), \mathbf{p}{c'}))}$$
where $d(\cdot, \cdot)$ is Euclidean distance:
$$d(\mathbf{a}, \mathbf{b}) = |\mathbf{a} - \mathbf{b}|_2 = \sqrt{\sum_i (a_i - b_i)^2}$$
Why Euclidean over cosine? For prototypical networks, Euclidean distance with fixed-norm embeddings is equivalent to negative cosine similarity but produces smoother gradients during episodic training. Snell et al. showed Euclidean outperforms cosine by 1-2% in their experiments due to this gradient property.
Intuition: Each class forms a "cluster" in embedding space. The prototype is the cluster center. Classification asks: "Which cluster center is closest to this query?" With a good embedding model, manga queries about preorders will cluster near the "preorder" prototype regardless of the specific title.
Episodic Training for Prototypical Networks
Standard training optimizes on (input, label) pairs. Episodic training mimics the few-shot evaluation protocol during training:
- Sample an episode: Pick $N$ random classes, sample $K$ examples per class (support) + $Q$ queries per class
- Compute prototypes from $K$ support examples
- Classify queries against prototypes
- Loss: Negative log-probability of correct classification:
$$\mathcal{L} = -\frac{1}{NQ} \sum_{c=1}^{N} \sum_{q=1}^{Q} \log p(y = c | \mathbf{x}_{q}^{©})$$
Why this works: By repeatedly training on random N-way K-shot episodes, the model learns embeddings that naturally cluster by class — even for classes it has never seen during training. The embedding space becomes "class-aware" in a general sense.
MAML — Model-Agnostic Meta-Learning
MAML (Finn et al., 2017) takes a different approach: instead of learning a fixed embedding, it learns an initialization that can be rapidly fine-tuned for any new task.
Inner loop (task-specific adaptation):
For each task $\mathcal{T}_i$ with support set $\mathcal{D}_i^{\text{train}}$:
$$\boldsymbol{\theta}i' = \boldsymbol{\theta} - \alpha \nabla{\boldsymbol{\theta}} \mathcal{L}_{\mathcal{T}_i}(\boldsymbol{\theta}, \mathcal{D}_i^{\text{train}})$$
Take 1-5 gradient steps on the task's support data, starting from the meta-initialization $\boldsymbol{\theta}$.
Outer loop (meta-update):
Evaluate the adapted parameters $\boldsymbol{\theta}_i'$ on the task's query set $\mathcal{D}_i^{\text{test}}$, and update the meta-initialization:
$$\boldsymbol{\theta} \leftarrow \boldsymbol{\theta} - \beta \sum_{\mathcal{T}i} \nabla{\boldsymbol{\theta}} \mathcal{L}_{\mathcal{T}_i}(\boldsymbol{\theta}_i', \mathcal{D}_i^{\text{test}})$$
The key insight: The outer loop gradient $\nabla_{\boldsymbol{\theta}} \mathcal{L}(\boldsymbol{\theta}_i')$ requires differentiating through the inner loop gradient step — a "gradient of a gradient" (second-order derivative). This is why MAML learns an initialization from which a few gradient steps reach any task's optimum.
$$\nabla_{\boldsymbol{\theta}} \mathcal{L}(\boldsymbol{\theta}i') = \nabla{\boldsymbol{\theta}} \mathcal{L}(\boldsymbol{\theta} - \alpha \nabla_{\boldsymbol{\theta}} \mathcal{L}(\boldsymbol{\theta}))$$
Using the chain rule:
$$= (\mathbf{I} - \alpha \nabla_{\boldsymbol{\theta}}^2 \mathcal{L}(\boldsymbol{\theta})) \nabla_{\boldsymbol{\theta}'} \mathcal{L}(\boldsymbol{\theta}')$$
The Hessian $\nabla_{\boldsymbol{\theta}}^2 \mathcal{L}$ makes MAML computationally expensive for large models. First-order MAML (FOMAML) drops the Hessian term, using:
$$\nabla_{\boldsymbol{\theta}} \mathcal{L}(\boldsymbol{\theta}i') \approx \nabla{\boldsymbol{\theta}'} \mathcal{L}(\boldsymbol{\theta}')$$
This approximation works surprisingly well in practice (within 1-2% of full MAML).
SetFit — Sentence Transformer Fine-Tuning for Few-Shot
SetFit (Tunstall et al., 2022) bridges the gap between prototypical nets and full fine-tuning. It uses contrastive learning on a Sentence Transformer:
Stage 1: Contrastive fine-tuning
From $K$ examples per class, generate $\binom{K}{2}$ positive pairs (same class) and negative pairs (different classes). Fine-tune using the contrastive loss:
$$\mathcal{L}{\text{contrastive}} = \frac{1}{2N{\text{pairs}}} \sum_{(i,j)} \left[y_{ij} \cdot d(\mathbf{h}i, \mathbf{h}_j)^2 + (1-y{ij}) \cdot \max(0, m - d(\mathbf{h}_i, \mathbf{h}_j))^2\right]$$
where $y_{ij}=1$ for same-class pairs, $m$ is the margin.
Stage 2: Classification head
After contrastive fine-tuning, train a logistic regression classifier on the $N \times K$ labeled embeddings.
Why SetFit outperforms prototypical networks for our use case: - Prototypical nets use a fixed pre-trained encoder (no adaptation to our domain) - SetFit fine-tunes the encoder on our few-shot data, adapting it to the manga domain - The contrastive stage is much more data-efficient than supervised fine-tuning because pair generation creates $O(K^2)$ training signals from $K$ examples
Data amplification via pairing:
| K examples | Supervised signals | Contrastive pairs (same class) | Contrastive pairs (total) |
|---|---|---|---|
| 10 | 10 | 45 | 405 |
| 30 | 30 | 435 | 3,915 |
| 50 | 50 | 1,225 | 11,025 |
| 100 | 100 | 4,950 | 44,550 |
From 30 examples, SetFit generates 3,915 contrastive training pairs — a 130× amplification.
Triplet Loss — Refining the Embedding Space
An alternative to contrastive loss that produces tighter clusters:
$$\mathcal{L}_{\text{triplet}} = \max(0, d(\mathbf{a}, \mathbf{p}) - d(\mathbf{a}, \mathbf{n}) + m)$$
where $\mathbf{a}$ is an anchor, $\mathbf{p}$ is a positive (same class), $\mathbf{n}$ is a negative (different class), and $m$ is the margin.
Hard negative mining: Random negatives are often "easy" (very different from the anchor). Hard negatives — the closest examples from other classes — provide the strongest learning signal:
$$\mathbf{n}^* = \arg\min_{\mathbf{n} \in \mathcal{N}} d(\mathbf{a}, \mathbf{n})$$
For our "preorder_anime" intent, a hard negative would be "product_inquiry" about the same manga title — different intent but nearly identical vocabulary.
Model Internals — Layer-by-Layer Diagrams
Prototypical Network Classification
graph TB
subgraph "Support Set (5-shot, 3-way for illustration)"
subgraph "order_status (5 examples)"
OS1["'Where is my Berserk order?'"]
OS2["'Track my delivery'"]
OS3["'When will volume 2 arrive?'"]
OS4["'Order #123 status?'"]
OS5["'Is my package shipped?'"]
end
subgraph "recommendation (5 examples)"
R1["'Manga like One Piece?'"]
R2["'Recommend dark fantasy'"]
R3["'What should I read next?'"]
R4["'Similar to Naruto?'"]
R5["'Best shōnen for adults?'"]
end
subgraph "preorder_anime (5 examples — NEW)"
PA1["'Preorder Jujutsu Kaisen 0?'"]
PA2["'When can I preorder?'"]
PA3["'Reserve anime manga'"]
PA4["'Upcoming anime tie-in'"]
PA5["'New anime manga release'"]
end
end
subgraph "Embedding (DistilBERT → 768d)"
OS1 --> E_OS["Mean of 5 embeddings"]
R1 --> E_R["Mean of 5 embeddings"]
PA1 --> E_PA["Mean of 5 embeddings"]
end
E_OS --> P_OS["Prototype: p_order ∈ ℝ⁷⁶⁸"]
E_R --> P_R["Prototype: p_reco ∈ ℝ⁷⁶⁸"]
E_PA --> P_PA["Prototype: p_preorder ∈ ℝ⁷⁶⁸"]
subgraph "Query Classification"
Q["Query: 'Can I preorder the Chainsaw Man anime edition?'"]
Q --> E_Q["Embedding ∈ ℝ⁷⁶⁸"]
end
E_Q --> D1["d(q, p_order) = 4.82"]
E_Q --> D2["d(q, p_reco) = 3.15"]
E_Q --> D3["d(q, p_preorder) = 1.24"]
P_OS --> D1
P_R --> D2
P_PA --> D3
D1 --> SM["Softmax(-distances)<br>order: 0.02<br>reco: 0.12<br>preorder: 0.86"]
style P_PA fill:#c8e6c9
style D3 fill:#c8e6c9
MAML Inner-Outer Loop
sequenceDiagram
participant Meta as Meta-Initialization θ
participant Task1 as Task 1: Intent (10-way 5-shot)
participant Task2 as Task 2: Sentiment (3-way 5-shot)
participant Task3 as Task 3: Genre (8-way 5-shot)
participant Update as Meta-Update
Note over Meta: θ = random init (or pretrained)
rect rgb(255, 249, 196)
Note over Task1: INNER LOOP — Task 1
Meta->>Task1: Copy θ → θ₁
Note over Task1: 3 gradient steps on support set:<br>θ₁' = θ₁ - α∇L₁(θ₁)<br>θ₁'' = θ₁' - α∇L₁(θ₁')<br>θ₁''' = θ₁'' - α∇L₁(θ₁'')
Task1->>Update: Evaluate θ₁''' on query set → L₁(θ₁''')
end
rect rgb(187, 222, 251)
Note over Task2: INNER LOOP — Task 2
Meta->>Task2: Copy θ → θ₂
Note over Task2: 3 gradient steps on support set:<br>θ₂' ... θ₂'''
Task2->>Update: Evaluate θ₂''' on query set → L₂(θ₂''')
end
rect rgb(200, 230, 201)
Note over Task3: INNER LOOP — Task 3
Meta->>Task3: Copy θ → θ₃
Note over Task3: 3 gradient steps on support set:<br>θ₃' ... θ₃'''
Task3->>Update: Evaluate θ₃''' on query set → L₃(θ₃''')
end
Note over Update: OUTER LOOP<br>θ ← θ - β∇θ[L₁(θ₁''') + L₂(θ₂''') + L₃(θ₃''')]<br><br>This gradient flows THROUGH<br>the inner loop steps<br>(second-order derivative)
Update->>Meta: Updated meta-initialization θ
SetFit Two-Stage Pipeline
graph TB
subgraph "Input: 30 labeled examples (3 intents × 10 each)"
D["order_status: 10<br>recommendation: 10<br>preorder_anime: 10"]
end
subgraph "Stage 1: Contrastive Fine-Tuning"
D --> PAIRS["Generate contrastive pairs<br>Positive (same class): 3 × C(10,2) = 135<br>Negative (cross-class): 3 × C(10,1)² × 3 = 2700+<br>Total: ~3,915 pairs"]
PAIRS --> ST["Sentence Transformer<br>(all-MiniLM-L6-v2, 22M params)"]
ST --> CONT["Contrastive Loss<br>Pull same-class closer<br>Push diff-class apart<br><br>Fine-tune for ~20 epochs<br>~2 minutes on GPU"]
end
subgraph "Stage 2: Classification Head"
CONT --> EMB["Embed all 30 examples<br>with fine-tuned encoder"]
EMB --> LR["Logistic Regression<br>on 30 × 384d vectors<br>Training: <1 second"]
end
subgraph "Inference"
Q["New query"] --> EMB2["Encode with<br>fine-tuned Sentence-T"]
EMB2 --> PRED["Logistic Regression<br>→ Intent prediction"]
end
style CONT fill:#fff9c4
style LR fill:#c8e6c9
Embedding Space: Before vs After SetFit
graph LR
subgraph "Before SetFit (Pre-trained Embeddings)"
B1["order_status queries<br>scattered — overlap with<br>product_inquiry"]
B2["recommendation queries<br>partially clustered"]
B3["preorder_anime queries<br>NO cluster — unseen intent"]
end
subgraph "After SetFit (30 examples, 2 min training)"
A1["order_status queries<br>tight cluster — separated<br>from product_inquiry"]
A2["recommendation queries<br>compact cluster"]
A3["preorder_anime queries<br>NEW cluster formed from<br>contrastive pairs"]
end
B1 -->|"Contrastive<br>fine-tuning"| A1
B2 -->|"2 min"| A2
B3 -->|"130× data<br>amplification"| A3
style A1 fill:#c8e6c9
style A2 fill:#c8e6c9
style A3 fill:#c8e6c9
Few-Shot Decision Flow
graph TD
START["New intent emerges<br>How many labeled examples?"]
START -->|"5-20 examples"| PROTO["Prototypical Networks<br>No training needed<br>Just compute prototypes<br>Expected: 70-78% accuracy"]
START -->|"20-50 examples"| SETFIT["SetFit<br>2-5 min contrastive training<br>+ logistic regression<br>Expected: 78-85% accuracy"]
START -->|"50-100 examples"| COMBINED["SetFit + FOMAML init<br>5-10 min training<br>Expected: 82-88% accuracy"]
START -->|"100+ examples"| FULL["Standard fine-tuning<br>(with continual learning, doc 06)<br>Expected: 88-92% accuracy"]
PROTO --> VAL{"Accuracy ≥ 80%?"}
SETFIT --> VAL
COMBINED --> VAL
FULL --> DEPLOY["Deploy to production"]
VAL -->|Yes| DEPLOY
VAL -->|No| MORE["Collect more examples<br>or try next method"]
MORE --> START
style SETFIT fill:#c8e6c9
style COMBINED fill:#c8e6c9
Implementation Deep-Dive
Prototypical Network for MangaAssist Intents
import torch
import torch.nn.functional as F
from transformers import AutoModel, AutoTokenizer
from collections import defaultdict
import numpy as np
class PrototypicalClassifier:
"""
Few-shot intent classification using prototypical networks.
No training required — just compute prototypes from support examples.
"""
def __init__(self, model_name: str = "distilbert-base-uncased"):
self.tokenizer = AutoTokenizer.from_pretrained(model_name)
self.model = AutoModel.from_pretrained(model_name)
self.model.eval()
self.prototypes: dict[str, torch.Tensor] = {}
@torch.no_grad()
def encode(self, texts: list[str]) -> torch.Tensor:
"""Encode texts to [CLS] embeddings."""
inputs = self.tokenizer(
texts,
padding=True,
truncation=True,
max_length=128,
return_tensors="pt",
)
outputs = self.model(**inputs)
# Use [CLS] token embedding
return outputs.last_hidden_state[:, 0, :] # (N, 768)
def register_intent(self, intent_name: str, examples: list[str]):
"""
Register a new intent from few-shot examples.
Prototype = mean embedding of all examples.
"""
embeddings = self.encode(examples) # (K, 768)
prototype = embeddings.mean(dim=0) # (768,)
# L2 normalize for stable distance computation
prototype = F.normalize(prototype, dim=0)
self.prototypes[intent_name] = prototype
def classify(self, query: str, top_k: int = 3) -> list[dict]:
"""Classify a query by distance to prototypes."""
query_emb = self.encode([query])[0] # (768,)
query_emb = F.normalize(query_emb, dim=0)
distances = {}
for intent, proto in self.prototypes.items():
# Euclidean distance (on normalized vectors)
distances[intent] = torch.norm(query_emb - proto).item()
# Convert distances to probabilities
dist_tensor = torch.tensor(list(distances.values()))
probs = F.softmax(-dist_tensor, dim=0)
results = []
for (intent, dist), prob in zip(distances.items(), probs):
results.append({
"intent": intent,
"distance": dist,
"probability": prob.item(),
})
results.sort(key=lambda x: x["probability"], reverse=True)
return results[:top_k]
# Usage
clf = PrototypicalClassifier()
clf.register_intent("order_status", [
"Where is my Berserk order?",
"Track my delivery",
"When will volume 2 arrive?",
"Order #123 status?",
"Is my package shipped?",
])
clf.register_intent("preorder_anime", [
"Preorder Jujutsu Kaisen 0?",
"When can I preorder the anime edition?",
"Reserve anime manga",
"Upcoming anime tie-in preorder",
"New anime manga release date",
])
result = clf.classify("Can I preorder the Chainsaw Man anime edition?")
# [{"intent": "preorder_anime", "probability": 0.86, "distance": 1.24}, ...]
SetFit Training Pipeline
from setfit import SetFitModel, SetFitTrainer, TrainingArguments
from datasets import Dataset
import pandas as pd
def train_setfit_intent(
examples: list[dict],
model_name: str = "sentence-transformers/all-MiniLM-L6-v2",
num_epochs: int = 20,
):
"""
Train SetFit model for few-shot intent classification.
From 30 examples, generates ~3,900 contrastive pairs.
Training takes 2-5 minutes on a single GPU.
"""
# Prepare dataset
df = pd.DataFrame(examples)
dataset = Dataset.from_pandas(df)
# Split if enough data
if len(examples) > 20:
splits = dataset.train_test_split(test_size=0.2, seed=42)
train_dataset = splits["train"]
eval_dataset = splits["test"]
else:
train_dataset = dataset
eval_dataset = None
# Load SetFit model
model = SetFitModel.from_pretrained(
model_name,
labels=list(set(df["label"])),
)
args = TrainingArguments(
batch_size=16,
num_epochs=num_epochs,
num_iterations=20, # Number of text pairs to generate per class
evaluation_strategy="epoch" if eval_dataset else "no",
)
trainer = SetFitTrainer(
model=model,
args=args,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
)
trainer.train()
metrics = trainer.evaluate() if eval_dataset else {}
return model, metrics
# Example: 30 examples for 3 intents
examples = [
{"text": "Where is my Berserk order?", "label": "order_status"},
{"text": "Track my delivery", "label": "order_status"},
# ... 8 more order_status examples
{"text": "Manga like One Piece?", "label": "recommendation"},
# ... 9 more recommendation examples
{"text": "Preorder Jujutsu Kaisen 0?", "label": "preorder_anime"},
# ... 9 more preorder_anime examples
]
model, metrics = train_setfit_intent(examples)
# Training: ~2 minutes on T4 GPU
# Generated pairs: ~3,900
# Accuracy: ~83% with 30 examples
FOMAML for MangaAssist
import torch
import torch.nn as nn
from transformers import AutoModel, AutoTokenizer
from copy import deepcopy
import random
class FOMAMLIntentClassifier(nn.Module):
"""
First-Order MAML for few-shot intent classification.
Learns a meta-initialization that can be adapted with a few gradient steps.
"""
def __init__(self, model_name: str = "distilbert-base-uncased", num_classes: int = 10):
super().__init__()
self.encoder = AutoModel.from_pretrained(model_name)
self.classifier = nn.Linear(768, num_classes)
def forward(self, input_ids, attention_mask):
outputs = self.encoder(input_ids=input_ids, attention_mask=attention_mask)
cls_emb = outputs.last_hidden_state[:, 0, :]
return self.classifier(cls_emb)
def fomaml_train(
model: FOMAMLIntentClassifier,
tasks: list,
meta_lr: float = 1e-5,
inner_lr: float = 1e-3,
inner_steps: int = 3,
meta_epochs: int = 100,
):
"""
Train FOMAML with episodic sampling.
Each episode:
1. Sample a task (N-way K-shot)
2. Inner loop: adapt model on support set
3. Outer loop: update meta-params using query set loss
"""
meta_optimizer = torch.optim.Adam(model.parameters(), lr=meta_lr)
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
for epoch in range(meta_epochs):
meta_loss = 0
# Sample batch of tasks
task_batch = random.sample(tasks, min(4, len(tasks)))
for task in task_batch:
support, query = task["support"], task["query"]
# Clone model for inner loop
adapted_model = deepcopy(model)
inner_optimizer = torch.optim.SGD(
adapted_model.parameters(), lr=inner_lr
)
# Inner loop: adapt on support set
for _ in range(inner_steps):
support_inputs = tokenizer(
[s["text"] for s in support],
padding=True, truncation=True,
max_length=128, return_tensors="pt",
)
support_labels = torch.tensor([s["label"] for s in support])
logits = adapted_model(**support_inputs)
loss = nn.functional.cross_entropy(logits, support_labels)
inner_optimizer.zero_grad()
loss.backward()
inner_optimizer.step()
# Outer loop: evaluate adapted model on query set
query_inputs = tokenizer(
[q["text"] for q in query],
padding=True, truncation=True,
max_length=128, return_tensors="pt",
)
query_labels = torch.tensor([q["label"] for q in query])
query_logits = adapted_model(**query_inputs)
query_loss = nn.functional.cross_entropy(query_logits, query_labels)
meta_loss += query_loss
# FOMAML: use gradients from adapted model directly
# (approximates second-order gradient without Hessian)
meta_optimizer.zero_grad()
meta_loss /= len(task_batch)
meta_loss.backward()
meta_optimizer.step()
if epoch % 10 == 0:
print(f"Epoch {epoch}: meta_loss = {meta_loss.item():.4f}")
return model
Rapid Deployment Pipeline
import time
from typing import Literal
def rapid_adapt_and_deploy(
new_intent: str,
examples: list[dict],
method: Literal["prototype", "setfit", "fomaml"] = "auto",
):
"""
Rapid adaptation pipeline: from labeled examples to deployed model.
Automatically selects method based on example count.
"""
n_examples = len(examples)
start_time = time.time()
# Auto-select method
if method == "auto":
if n_examples < 20:
method = "prototype"
elif n_examples < 100:
method = "setfit"
else:
method = "fomaml"
if method == "prototype":
clf = PrototypicalClassifier()
# Register all existing intents (from production model's support set)
for intent, support in load_existing_support_sets().items():
clf.register_intent(intent, support)
# Register new intent
clf.register_intent(new_intent, [e["text"] for e in examples])
accuracy = evaluate_prototype(clf, examples)
elif method == "setfit":
# Combine existing intent examples with new intent
all_examples = load_existing_few_shot_examples()
all_examples.extend(examples)
model, metrics = train_setfit_intent(all_examples)
accuracy = metrics.get("accuracy", 0)
elif method == "fomaml":
model = FOMAMLIntentClassifier()
model.load_state_dict(torch.load("meta_init.pt"))
# Few inner loop steps on new data
adapted = adapt_fomaml(model, examples, inner_steps=5)
accuracy = evaluate_adapted(adapted, examples)
elapsed = time.time() - start_time
result = {
"intent": new_intent,
"method": method,
"n_examples": n_examples,
"accuracy": accuracy,
"training_time_seconds": elapsed,
"ready_for_production": accuracy >= 0.80,
}
if result["ready_for_production"]:
deploy_to_lambda(result)
return result
Group Discussion: Key Decision Points
Decision Point 1: Prototypical Net vs SetFit vs MAML
Priya (ML Engineer): I benchmarked all three on our "preorder_anime" intent scenario:
| Method | 10 examples | 30 examples | 50 examples | Training Time | Deployment |
|---|---|---|---|---|---|
| Prototypical Net | 72.4% | 78.1% | 81.3% | 0 sec | Instant |
| SetFit | 74.8% | 83.2% | 87.1% | 120 sec | 120 sec |
| FOMAML | 76.1% | 82.8% | 86.4% | 180 sec | 180 sec |
| Fine-tune (DistilBERT) | 45.2% | 68.4% | 79.2% | 300 sec | 300 sec |
Aiko (Data Scientist): Notice that vanilla fine-tuning with 10 examples is catastrophically bad (45.2%) — it overfits to the 10 examples completely. Prototypical networks avoid this by not training at all. SetFit avoids it through contrastive pair generation (10 examples → 405 pairs).
Marcus (Architect): The zero-training property of prototypical networks is compelling for emergency scenarios (product recall intent with 10 examples). We can deploy instantly with 72% accuracy and improve later.
Sam (PM): SetFit at 30 examples gives 83.2% in 2 minutes. That is close to production quality (our SLA is 90%, but new intents get a 6-month grace period at 80%). For the MVP launch of a new intent, SetFit is the sweet spot.
Jordan (MLOps): FOMAML requires maintaining a meta-initialization model, which adds complexity. It is only 0.4% better than SetFit at 30 examples. I would not add the meta-learning infrastructure unless we are adding new intents weekly.
Resolution: Stage-based approach: (1) Emergency: Prototypical networks (0 training time, 72%+ accuracy). (2) Standard: SetFit when 20+ examples are available (2 min training, 83%+ accuracy). (3) MAML only if new intents emerge faster than monthly (adds infrastructure complexity). (4) Full fine-tuning once 100+ examples accumulate (transitions to continual learning from doc 06).
Decision Point 2: How Many Examples Are "Enough"?
Priya (ML Engineer): I measured accuracy vs example count for SetFit:
| Examples per Class | Accuracy | Marginal Gain | Confidence Interval (95%) |
|---|---|---|---|
| 5 | 74.8% | — | ±8.2% |
| 10 | 78.5% | +3.7% | ±5.1% |
| 20 | 81.2% | +2.7% | ±3.4% |
| 30 | 83.2% | +2.0% | ±2.8% |
| 50 | 87.1% | +3.9% | ±1.9% |
| 100 | 90.4% | +3.3% | ±1.2% |
Aiko (Data Scientist): There is a clear diminishing-returns curve. The first 20 examples provide the steepest improvements. After 30, gains are still meaningful but slower. After 100, we enter standard fine-tuning territory.
The confidence intervals are critical: with 5 examples, our accuracy estimate has ±8.2% uncertainty — maybe 67%, maybe 83%. With 30 examples, uncertainty drops to ±2.8%.
Sam (PM): Our minimum viable threshold for a new intent is 80% accuracy with ≤3% confidence interval. That requires roughly 30 examples. The labeling cost is ~$0.50 per example (10 seconds of human labeling) = $15 for a new intent launch.
Marcus (Architect): We should build a "new intent intake" process: when analytics detects emerging queries, route 30 of them to human labelers. Once labeled, trigger the SetFit pipeline automatically.
Resolution: 30 examples is the minimum for production SetFit deployment. Below 30, use prototypical networks as a stopgap. The automated pipeline: detect → label 30 → SetFit → validate → deploy runs end-to-end in under 1 hour including human labeling time.
Decision Point 3: Embedding Model Choice for Few-Shot
Priya (ML Engineer): The embedding model quality significantly impacts few-shot performance:
| Model | Params | Latency | Proto@10 | SetFit@30 |
|---|---|---|---|---|
| all-MiniLM-L6-v2 | 22M | 3ms | 72.4% | 83.2% |
| all-mpnet-base-v2 | 109M | 12ms | 76.1% | 85.8% |
| DistilBERT (our intent model) | 66M | 8ms | 74.8% | 84.5% |
| Titan Embeddings V2 (adapted, doc 02) | — | 30ms | 78.3% | — |
Aiko (Data Scientist): Our manga-adapted Titan embeddings perform best for prototypical classification because they already understand manga vocabulary. But SetFit cannot fine-tune Titan (it is an API, not a local model).
Marcus (Architect): all-MiniLM-L6-v2 is the default for SetFit — small, fast, well-tested. The 2.6% accuracy gap with mpnet is meaningful for 30-example scenarios, but mpnet adds 9ms latency. At our P99 budget of 15ms for intent classification, mpnet is too slow.
Jordan (MLOps): MiniLM is also easier to deploy — 22M parameters fits in Lambda's 256MB limit without compression. mpnet requires 512MB.
Resolution: all-MiniLM-L6-v2 for SetFit (best speed/quality/deployment tradeoff). For prototypical classification, use our manga-adapted Titan embeddings via API (best few-shot accuracy, acceptable 30ms latency since it is a stopgap).
Decision Point 4: Transitioning from Few-Shot to Full Model
Jordan (MLOps): When do we stop using few-shot models and retrain the full intent classifier?
Sam (PM): Once a new intent has 100+ examples (typically after 2-4 weeks of production traffic), the accuracy gap between SetFit (87%) and full fine-tuning (92%) justifies the retrain.
Priya (ML Engineer): The transition should be seamless. Proposed lifecycle:
| Phase | Trigger | Method | Expected Accuracy | Duration |
|---|---|---|---|---|
| Emergency | 5-10 examples | Prototypical | 72-78% | Hours |
| Launch | 30 examples | SetFit | 83-87% | Days to weeks |
| Maturation | 100+ examples | Continual learning (doc 06) | 88-92% | Ongoing |
Marcus (Architect): This creates a natural A/B test: traffic hitting the SetFit model generates labeled data for the full retrain. We can shadow-test the full model against SetFit before cutting over.
Resolution: Three-phase lifecycle. Automatic transition triggered by example count thresholds: 30 (SetFit) → 100 (full retrain). Each transition is gated by a comparative evaluation on 20% held-out data — only cut over if the new model beats the old by ≥2%.
Research Paper References
1. Prototypical Networks for Few-shot Learning (Snell et al., 2017)
Key contribution: Introduced prototype-based classification using mean embeddings. Showed that Euclidean distance in embedding space is sufficient for few-shot classification, outperforming more complex learnable distance metrics. The simplicity of prototypical networks (no training, just compute centroids) makes them ideal for zero-training-time deployment.
Relevance to MangaAssist: Our emergency intent deployment uses prototypical networks. With 5-10 examples, we compute prototypes and deploy instantly. The 72-78% accuracy is acceptable for a stopgap that activates within minutes.
2. Model-Agnostic Meta-Learning for Fast Adaptation — MAML (Finn et al., 2017)
Key contribution: Proposed learning an initialization from which a few gradient steps reach a good solution for any task. The key mathematical insight is the gradient-through-gradient (second-order) meta-update. FOMAML (first-order approximation) drops the Hessian term with minimal quality loss.
Relevance to MangaAssist: FOMAML provides the theoretical foundation for our adaptation strategy, though we use SetFit in practice due to lower infrastructure complexity. MAML's inner/outer loop concept informs how we think about rapid adaptation generally — our 3-phase lifecycle mirrors MAML's "meta-train then adapt" paradigm.
3. Efficient Few-Shot Learning Without Prompts — SetFit (Tunstall et al., 2022)
Key contribution: Combined contrastive Sentence Transformer fine-tuning with a simple logistic regression head. Achieved competitive results with GPT-3 few-shot prompting at 1000× lower cost. The key insight is the data amplification: K examples generate O(K²) contrastive pairs, making few-shot training data-efficient.
Relevance to MangaAssist: Primary few-shot method for new intent deployment. The 130× data amplification (30 examples → 3,900 pairs) is the key enabler. Training in 2 minutes on a T4 GPU makes it feasible as an automated pipeline triggered by human labeling.
Production Deployment Results
Few-Shot Deployment: "preorder_anime" Intent
| Metric | Prototypical (10 ex) | SetFit (30 ex) | Full Train (200 ex) |
|---|---|---|---|
| Accuracy | 74.2% | 84.1% | 91.3% |
| Time to deploy | 5 min | 15 min | 4 hours |
| Model size | 0 (uses existing encoder) | 88MB | 264MB |
| Human labeling cost | $5 | $15 | $100 |
| Total cost | $5 | $16 | $108 |
Lifecycle Transition Results
| Week | Method | Examples | Accuracy | Notes |
|---|---|---|---|---|
| Week 0 (launch) | Prototypical | 10 | 74.2% | Same-day deploy |
| Week 1 | SetFit | 30 | 84.1% | Auto-triggered by 30 labels |
| Week 3 | SetFit | 65 | 87.4% | Retrained with more data |
| Week 6 | Full (continual) | 150 | 91.3% | Auto-triggered by 100 threshold |
| Week 12 | Full (production) | 500+ | 93.1% | Merged into monthly retrain |