LOCAL PREVIEW View on GitHub

07. Few-Shot Learning and Rapid Adaptation — New Intents with <100 Examples

Problem Statement and MangaAssist Context

MangaAssist needs to rapidly support new intent categories as business needs evolve. When anime adaptations surge, a "preorder_anime_adaptation" intent emerges with only 30-50 labeled examples. Traditional fine-tuning requires thousands of examples and hours of training. Few-shot learning enables the model to recognize new intents from as few as 5-100 examples, deploying in minutes rather than days.

The Cold-Start Problem

Scenario Available Examples Required Accuracy Time Budget
New intent: "preorder_anime" 30 ≥ 80% < 1 hour
Seasonal: "holiday_gift_bundle" 50 ≥ 82% < 2 hours
Emergency: "product_recall" 10 ≥ 75% < 30 min
Market expansion: Japanese queries 80 ≥ 78% < 4 hours

Standard fine-tuning on 30 examples → severe overfitting (70% train / 45% test). Few-shot techniques bridge this gap.


Mathematical Foundations

Prototypical Networks — Learning by Comparison

Prototypical Networks (Snell et al., 2017) classify by comparing a query to class "prototypes" — the mean embedding of each class's support examples.

Setup: Given $N$ classes, each with $K$ support examples (N-way K-shot): - Support set $S = {(\mathbf{x}1^{©}, ..., \mathbf{x}_K^{©})}{c=1}^{N}$ - Query $\mathbf{x}_q$ to classify

Step 1: Compute prototypes

The prototype for class $c$ is the centroid of its support embeddings:

$$\mathbf{p}c = \frac{1}{K} \sum{k=1}^{K} f_\phi(\mathbf{x}_k^{©})$$

where $f_\phi$ is the embedding model (e.g., DistilBERT's [CLS] output).

Step 2: Distance-based classification

The probability that query $\mathbf{x}_q$ belongs to class $c$:

$$p(y = c | \mathbf{x}q) = \frac{\exp(-d(f\phi(\mathbf{x}q), \mathbf{p}_c))}{\sum{c'=1}^{N} \exp(-d(f_\phi(\mathbf{x}q), \mathbf{p}{c'}))}$$

where $d(\cdot, \cdot)$ is Euclidean distance:

$$d(\mathbf{a}, \mathbf{b}) = |\mathbf{a} - \mathbf{b}|_2 = \sqrt{\sum_i (a_i - b_i)^2}$$

Why Euclidean over cosine? For prototypical networks, Euclidean distance with fixed-norm embeddings is equivalent to negative cosine similarity but produces smoother gradients during episodic training. Snell et al. showed Euclidean outperforms cosine by 1-2% in their experiments due to this gradient property.

Intuition: Each class forms a "cluster" in embedding space. The prototype is the cluster center. Classification asks: "Which cluster center is closest to this query?" With a good embedding model, manga queries about preorders will cluster near the "preorder" prototype regardless of the specific title.

Episodic Training for Prototypical Networks

Standard training optimizes on (input, label) pairs. Episodic training mimics the few-shot evaluation protocol during training:

  1. Sample an episode: Pick $N$ random classes, sample $K$ examples per class (support) + $Q$ queries per class
  2. Compute prototypes from $K$ support examples
  3. Classify queries against prototypes
  4. Loss: Negative log-probability of correct classification:

$$\mathcal{L} = -\frac{1}{NQ} \sum_{c=1}^{N} \sum_{q=1}^{Q} \log p(y = c | \mathbf{x}_{q}^{©})$$

Why this works: By repeatedly training on random N-way K-shot episodes, the model learns embeddings that naturally cluster by class — even for classes it has never seen during training. The embedding space becomes "class-aware" in a general sense.

MAML — Model-Agnostic Meta-Learning

MAML (Finn et al., 2017) takes a different approach: instead of learning a fixed embedding, it learns an initialization that can be rapidly fine-tuned for any new task.

Inner loop (task-specific adaptation):

For each task $\mathcal{T}_i$ with support set $\mathcal{D}_i^{\text{train}}$:

$$\boldsymbol{\theta}i' = \boldsymbol{\theta} - \alpha \nabla{\boldsymbol{\theta}} \mathcal{L}_{\mathcal{T}_i}(\boldsymbol{\theta}, \mathcal{D}_i^{\text{train}})$$

Take 1-5 gradient steps on the task's support data, starting from the meta-initialization $\boldsymbol{\theta}$.

Outer loop (meta-update):

Evaluate the adapted parameters $\boldsymbol{\theta}_i'$ on the task's query set $\mathcal{D}_i^{\text{test}}$, and update the meta-initialization:

$$\boldsymbol{\theta} \leftarrow \boldsymbol{\theta} - \beta \sum_{\mathcal{T}i} \nabla{\boldsymbol{\theta}} \mathcal{L}_{\mathcal{T}_i}(\boldsymbol{\theta}_i', \mathcal{D}_i^{\text{test}})$$

The key insight: The outer loop gradient $\nabla_{\boldsymbol{\theta}} \mathcal{L}(\boldsymbol{\theta}_i')$ requires differentiating through the inner loop gradient step — a "gradient of a gradient" (second-order derivative). This is why MAML learns an initialization from which a few gradient steps reach any task's optimum.

$$\nabla_{\boldsymbol{\theta}} \mathcal{L}(\boldsymbol{\theta}i') = \nabla{\boldsymbol{\theta}} \mathcal{L}(\boldsymbol{\theta} - \alpha \nabla_{\boldsymbol{\theta}} \mathcal{L}(\boldsymbol{\theta}))$$

Using the chain rule:

$$= (\mathbf{I} - \alpha \nabla_{\boldsymbol{\theta}}^2 \mathcal{L}(\boldsymbol{\theta})) \nabla_{\boldsymbol{\theta}'} \mathcal{L}(\boldsymbol{\theta}')$$

The Hessian $\nabla_{\boldsymbol{\theta}}^2 \mathcal{L}$ makes MAML computationally expensive for large models. First-order MAML (FOMAML) drops the Hessian term, using:

$$\nabla_{\boldsymbol{\theta}} \mathcal{L}(\boldsymbol{\theta}i') \approx \nabla{\boldsymbol{\theta}'} \mathcal{L}(\boldsymbol{\theta}')$$

This approximation works surprisingly well in practice (within 1-2% of full MAML).

SetFit — Sentence Transformer Fine-Tuning for Few-Shot

SetFit (Tunstall et al., 2022) bridges the gap between prototypical nets and full fine-tuning. It uses contrastive learning on a Sentence Transformer:

Stage 1: Contrastive fine-tuning

From $K$ examples per class, generate $\binom{K}{2}$ positive pairs (same class) and negative pairs (different classes). Fine-tune using the contrastive loss:

$$\mathcal{L}{\text{contrastive}} = \frac{1}{2N{\text{pairs}}} \sum_{(i,j)} \left[y_{ij} \cdot d(\mathbf{h}i, \mathbf{h}_j)^2 + (1-y{ij}) \cdot \max(0, m - d(\mathbf{h}_i, \mathbf{h}_j))^2\right]$$

where $y_{ij}=1$ for same-class pairs, $m$ is the margin.

Stage 2: Classification head

After contrastive fine-tuning, train a logistic regression classifier on the $N \times K$ labeled embeddings.

Why SetFit outperforms prototypical networks for our use case: - Prototypical nets use a fixed pre-trained encoder (no adaptation to our domain) - SetFit fine-tunes the encoder on our few-shot data, adapting it to the manga domain - The contrastive stage is much more data-efficient than supervised fine-tuning because pair generation creates $O(K^2)$ training signals from $K$ examples

Data amplification via pairing:

K examples Supervised signals Contrastive pairs (same class) Contrastive pairs (total)
10 10 45 405
30 30 435 3,915
50 50 1,225 11,025
100 100 4,950 44,550

From 30 examples, SetFit generates 3,915 contrastive training pairs — a 130× amplification.

Triplet Loss — Refining the Embedding Space

An alternative to contrastive loss that produces tighter clusters:

$$\mathcal{L}_{\text{triplet}} = \max(0, d(\mathbf{a}, \mathbf{p}) - d(\mathbf{a}, \mathbf{n}) + m)$$

where $\mathbf{a}$ is an anchor, $\mathbf{p}$ is a positive (same class), $\mathbf{n}$ is a negative (different class), and $m$ is the margin.

Hard negative mining: Random negatives are often "easy" (very different from the anchor). Hard negatives — the closest examples from other classes — provide the strongest learning signal:

$$\mathbf{n}^* = \arg\min_{\mathbf{n} \in \mathcal{N}} d(\mathbf{a}, \mathbf{n})$$

For our "preorder_anime" intent, a hard negative would be "product_inquiry" about the same manga title — different intent but nearly identical vocabulary.


Model Internals — Layer-by-Layer Diagrams

Prototypical Network Classification

graph TB
    subgraph "Support Set (5-shot, 3-way for illustration)"
        subgraph "order_status (5 examples)"
            OS1["'Where is my Berserk order?'"]
            OS2["'Track my delivery'"]
            OS3["'When will volume 2 arrive?'"]
            OS4["'Order #123 status?'"]
            OS5["'Is my package shipped?'"]
        end
        subgraph "recommendation (5 examples)"
            R1["'Manga like One Piece?'"]
            R2["'Recommend dark fantasy'"]
            R3["'What should I read next?'"]
            R4["'Similar to Naruto?'"]
            R5["'Best shōnen for adults?'"]
        end
        subgraph "preorder_anime (5 examples — NEW)"
            PA1["'Preorder Jujutsu Kaisen 0?'"]
            PA2["'When can I preorder?'"]
            PA3["'Reserve anime manga'"]
            PA4["'Upcoming anime tie-in'"]
            PA5["'New anime manga release'"]
        end
    end

    subgraph "Embedding (DistilBERT → 768d)"
        OS1 --> E_OS["Mean of 5 embeddings"]
        R1 --> E_R["Mean of 5 embeddings"]
        PA1 --> E_PA["Mean of 5 embeddings"]
    end

    E_OS --> P_OS["Prototype: p_order ∈ ℝ⁷⁶⁸"]
    E_R --> P_R["Prototype: p_reco ∈ ℝ⁷⁶⁸"]
    E_PA --> P_PA["Prototype: p_preorder ∈ ℝ⁷⁶⁸"]

    subgraph "Query Classification"
        Q["Query: 'Can I preorder the Chainsaw Man anime edition?'"]
        Q --> E_Q["Embedding ∈ ℝ⁷⁶⁸"]
    end

    E_Q --> D1["d(q, p_order) = 4.82"]
    E_Q --> D2["d(q, p_reco) = 3.15"]
    E_Q --> D3["d(q, p_preorder) = 1.24"]

    P_OS --> D1
    P_R --> D2
    P_PA --> D3

    D1 --> SM["Softmax(-distances)<br>order: 0.02<br>reco: 0.12<br>preorder: 0.86"]

    style P_PA fill:#c8e6c9
    style D3 fill:#c8e6c9

MAML Inner-Outer Loop

sequenceDiagram
    participant Meta as Meta-Initialization θ
    participant Task1 as Task 1: Intent (10-way 5-shot)
    participant Task2 as Task 2: Sentiment (3-way 5-shot)
    participant Task3 as Task 3: Genre (8-way 5-shot)
    participant Update as Meta-Update

    Note over Meta: θ = random init (or pretrained)

    rect rgb(255, 249, 196)
        Note over Task1: INNER LOOP — Task 1
        Meta->>Task1: Copy θ → θ₁
        Note over Task1: 3 gradient steps on support set:<br>θ₁' = θ₁ - α∇L₁(θ₁)<br>θ₁'' = θ₁' - α∇L₁(θ₁')<br>θ₁''' = θ₁'' - α∇L₁(θ₁'')
        Task1->>Update: Evaluate θ₁''' on query set → L₁(θ₁''')
    end

    rect rgb(187, 222, 251)
        Note over Task2: INNER LOOP — Task 2
        Meta->>Task2: Copy θ → θ₂
        Note over Task2: 3 gradient steps on support set:<br>θ₂' ... θ₂'''
        Task2->>Update: Evaluate θ₂''' on query set → L₂(θ₂''')
    end

    rect rgb(200, 230, 201)
        Note over Task3: INNER LOOP — Task 3
        Meta->>Task3: Copy θ → θ₃
        Note over Task3: 3 gradient steps on support set:<br>θ₃' ... θ₃'''
        Task3->>Update: Evaluate θ₃''' on query set → L₃(θ₃''')
    end

    Note over Update: OUTER LOOP<br>θ ← θ - β∇θ[L₁(θ₁''') + L₂(θ₂''') + L₃(θ₃''')]<br><br>This gradient flows THROUGH<br>the inner loop steps<br>(second-order derivative)

    Update->>Meta: Updated meta-initialization θ

SetFit Two-Stage Pipeline

graph TB
    subgraph "Input: 30 labeled examples (3 intents × 10 each)"
        D["order_status: 10<br>recommendation: 10<br>preorder_anime: 10"]
    end

    subgraph "Stage 1: Contrastive Fine-Tuning"
        D --> PAIRS["Generate contrastive pairs<br>Positive (same class): 3 × C(10,2) = 135<br>Negative (cross-class): 3 × C(10,1)² × 3 = 2700+<br>Total: ~3,915 pairs"]

        PAIRS --> ST["Sentence Transformer<br>(all-MiniLM-L6-v2, 22M params)"]
        ST --> CONT["Contrastive Loss<br>Pull same-class closer<br>Push diff-class apart<br><br>Fine-tune for ~20 epochs<br>~2 minutes on GPU"]
    end

    subgraph "Stage 2: Classification Head"
        CONT --> EMB["Embed all 30 examples<br>with fine-tuned encoder"]
        EMB --> LR["Logistic Regression<br>on 30 × 384d vectors<br>Training: <1 second"]
    end

    subgraph "Inference"
        Q["New query"] --> EMB2["Encode with<br>fine-tuned Sentence-T"]
        EMB2 --> PRED["Logistic Regression<br>→ Intent prediction"]
    end

    style CONT fill:#fff9c4
    style LR fill:#c8e6c9

Embedding Space: Before vs After SetFit

graph LR
    subgraph "Before SetFit (Pre-trained Embeddings)"
        B1["order_status queries<br>scattered — overlap with<br>product_inquiry"]
        B2["recommendation queries<br>partially clustered"]
        B3["preorder_anime queries<br>NO cluster — unseen intent"]
    end

    subgraph "After SetFit (30 examples, 2 min training)"
        A1["order_status queries<br>tight cluster — separated<br>from product_inquiry"]
        A2["recommendation queries<br>compact cluster"]
        A3["preorder_anime queries<br>NEW cluster formed from<br>contrastive pairs"]
    end

    B1 -->|"Contrastive<br>fine-tuning"| A1
    B2 -->|"2 min"| A2
    B3 -->|"130× data<br>amplification"| A3

    style A1 fill:#c8e6c9
    style A2 fill:#c8e6c9
    style A3 fill:#c8e6c9

Few-Shot Decision Flow

graph TD
    START["New intent emerges<br>How many labeled examples?"]

    START -->|"5-20 examples"| PROTO["Prototypical Networks<br>No training needed<br>Just compute prototypes<br>Expected: 70-78% accuracy"]

    START -->|"20-50 examples"| SETFIT["SetFit<br>2-5 min contrastive training<br>+ logistic regression<br>Expected: 78-85% accuracy"]

    START -->|"50-100 examples"| COMBINED["SetFit + FOMAML init<br>5-10 min training<br>Expected: 82-88% accuracy"]

    START -->|"100+ examples"| FULL["Standard fine-tuning<br>(with continual learning, doc 06)<br>Expected: 88-92% accuracy"]

    PROTO --> VAL{"Accuracy ≥ 80%?"}
    SETFIT --> VAL
    COMBINED --> VAL
    FULL --> DEPLOY["Deploy to production"]

    VAL -->|Yes| DEPLOY
    VAL -->|No| MORE["Collect more examples<br>or try next method"]
    MORE --> START

    style SETFIT fill:#c8e6c9
    style COMBINED fill:#c8e6c9

Implementation Deep-Dive

Prototypical Network for MangaAssist Intents

import torch
import torch.nn.functional as F
from transformers import AutoModel, AutoTokenizer
from collections import defaultdict
import numpy as np


class PrototypicalClassifier:
    """
    Few-shot intent classification using prototypical networks.
    No training required — just compute prototypes from support examples.
    """

    def __init__(self, model_name: str = "distilbert-base-uncased"):
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.model = AutoModel.from_pretrained(model_name)
        self.model.eval()
        self.prototypes: dict[str, torch.Tensor] = {}

    @torch.no_grad()
    def encode(self, texts: list[str]) -> torch.Tensor:
        """Encode texts to [CLS] embeddings."""
        inputs = self.tokenizer(
            texts,
            padding=True,
            truncation=True,
            max_length=128,
            return_tensors="pt",
        )
        outputs = self.model(**inputs)
        # Use [CLS] token embedding
        return outputs.last_hidden_state[:, 0, :]  # (N, 768)

    def register_intent(self, intent_name: str, examples: list[str]):
        """
        Register a new intent from few-shot examples.
        Prototype = mean embedding of all examples.
        """
        embeddings = self.encode(examples)  # (K, 768)
        prototype = embeddings.mean(dim=0)  # (768,)
        # L2 normalize for stable distance computation
        prototype = F.normalize(prototype, dim=0)
        self.prototypes[intent_name] = prototype

    def classify(self, query: str, top_k: int = 3) -> list[dict]:
        """Classify a query by distance to prototypes."""
        query_emb = self.encode([query])[0]  # (768,)
        query_emb = F.normalize(query_emb, dim=0)

        distances = {}
        for intent, proto in self.prototypes.items():
            # Euclidean distance (on normalized vectors)
            distances[intent] = torch.norm(query_emb - proto).item()

        # Convert distances to probabilities
        dist_tensor = torch.tensor(list(distances.values()))
        probs = F.softmax(-dist_tensor, dim=0)

        results = []
        for (intent, dist), prob in zip(distances.items(), probs):
            results.append({
                "intent": intent,
                "distance": dist,
                "probability": prob.item(),
            })

        results.sort(key=lambda x: x["probability"], reverse=True)
        return results[:top_k]


# Usage
clf = PrototypicalClassifier()

clf.register_intent("order_status", [
    "Where is my Berserk order?",
    "Track my delivery",
    "When will volume 2 arrive?",
    "Order #123 status?",
    "Is my package shipped?",
])

clf.register_intent("preorder_anime", [
    "Preorder Jujutsu Kaisen 0?",
    "When can I preorder the anime edition?",
    "Reserve anime manga",
    "Upcoming anime tie-in preorder",
    "New anime manga release date",
])

result = clf.classify("Can I preorder the Chainsaw Man anime edition?")
# [{"intent": "preorder_anime", "probability": 0.86, "distance": 1.24}, ...]

SetFit Training Pipeline

from setfit import SetFitModel, SetFitTrainer, TrainingArguments
from datasets import Dataset
import pandas as pd


def train_setfit_intent(
    examples: list[dict],
    model_name: str = "sentence-transformers/all-MiniLM-L6-v2",
    num_epochs: int = 20,
):
    """
    Train SetFit model for few-shot intent classification.

    From 30 examples, generates ~3,900 contrastive pairs.
    Training takes 2-5 minutes on a single GPU.
    """
    # Prepare dataset
    df = pd.DataFrame(examples)
    dataset = Dataset.from_pandas(df)

    # Split if enough data
    if len(examples) > 20:
        splits = dataset.train_test_split(test_size=0.2, seed=42)
        train_dataset = splits["train"]
        eval_dataset = splits["test"]
    else:
        train_dataset = dataset
        eval_dataset = None

    # Load SetFit model
    model = SetFitModel.from_pretrained(
        model_name,
        labels=list(set(df["label"])),
    )

    args = TrainingArguments(
        batch_size=16,
        num_epochs=num_epochs,
        num_iterations=20,  # Number of text pairs to generate per class
        evaluation_strategy="epoch" if eval_dataset else "no",
    )

    trainer = SetFitTrainer(
        model=model,
        args=args,
        train_dataset=train_dataset,
        eval_dataset=eval_dataset,
    )

    trainer.train()
    metrics = trainer.evaluate() if eval_dataset else {}

    return model, metrics


# Example: 30 examples for 3 intents
examples = [
    {"text": "Where is my Berserk order?", "label": "order_status"},
    {"text": "Track my delivery", "label": "order_status"},
    # ... 8 more order_status examples
    {"text": "Manga like One Piece?", "label": "recommendation"},
    # ... 9 more recommendation examples
    {"text": "Preorder Jujutsu Kaisen 0?", "label": "preorder_anime"},
    # ... 9 more preorder_anime examples
]

model, metrics = train_setfit_intent(examples)
# Training: ~2 minutes on T4 GPU
# Generated pairs: ~3,900
# Accuracy: ~83% with 30 examples

FOMAML for MangaAssist

import torch
import torch.nn as nn
from transformers import AutoModel, AutoTokenizer
from copy import deepcopy
import random


class FOMAMLIntentClassifier(nn.Module):
    """
    First-Order MAML for few-shot intent classification.
    Learns a meta-initialization that can be adapted with a few gradient steps.
    """

    def __init__(self, model_name: str = "distilbert-base-uncased", num_classes: int = 10):
        super().__init__()
        self.encoder = AutoModel.from_pretrained(model_name)
        self.classifier = nn.Linear(768, num_classes)

    def forward(self, input_ids, attention_mask):
        outputs = self.encoder(input_ids=input_ids, attention_mask=attention_mask)
        cls_emb = outputs.last_hidden_state[:, 0, :]
        return self.classifier(cls_emb)


def fomaml_train(
    model: FOMAMLIntentClassifier,
    tasks: list,
    meta_lr: float = 1e-5,
    inner_lr: float = 1e-3,
    inner_steps: int = 3,
    meta_epochs: int = 100,
):
    """
    Train FOMAML with episodic sampling.

    Each episode:
    1. Sample a task (N-way K-shot)
    2. Inner loop: adapt model on support set
    3. Outer loop: update meta-params using query set loss
    """
    meta_optimizer = torch.optim.Adam(model.parameters(), lr=meta_lr)
    tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

    for epoch in range(meta_epochs):
        meta_loss = 0

        # Sample batch of tasks
        task_batch = random.sample(tasks, min(4, len(tasks)))

        for task in task_batch:
            support, query = task["support"], task["query"]

            # Clone model for inner loop
            adapted_model = deepcopy(model)
            inner_optimizer = torch.optim.SGD(
                adapted_model.parameters(), lr=inner_lr
            )

            # Inner loop: adapt on support set
            for _ in range(inner_steps):
                support_inputs = tokenizer(
                    [s["text"] for s in support],
                    padding=True, truncation=True,
                    max_length=128, return_tensors="pt",
                )
                support_labels = torch.tensor([s["label"] for s in support])

                logits = adapted_model(**support_inputs)
                loss = nn.functional.cross_entropy(logits, support_labels)

                inner_optimizer.zero_grad()
                loss.backward()
                inner_optimizer.step()

            # Outer loop: evaluate adapted model on query set
            query_inputs = tokenizer(
                [q["text"] for q in query],
                padding=True, truncation=True,
                max_length=128, return_tensors="pt",
            )
            query_labels = torch.tensor([q["label"] for q in query])

            query_logits = adapted_model(**query_inputs)
            query_loss = nn.functional.cross_entropy(query_logits, query_labels)
            meta_loss += query_loss

        # FOMAML: use gradients from adapted model directly
        # (approximates second-order gradient without Hessian)
        meta_optimizer.zero_grad()
        meta_loss /= len(task_batch)
        meta_loss.backward()
        meta_optimizer.step()

        if epoch % 10 == 0:
            print(f"Epoch {epoch}: meta_loss = {meta_loss.item():.4f}")

    return model

Rapid Deployment Pipeline

import time
from typing import Literal


def rapid_adapt_and_deploy(
    new_intent: str,
    examples: list[dict],
    method: Literal["prototype", "setfit", "fomaml"] = "auto",
):
    """
    Rapid adaptation pipeline: from labeled examples to deployed model.
    Automatically selects method based on example count.
    """
    n_examples = len(examples)
    start_time = time.time()

    # Auto-select method
    if method == "auto":
        if n_examples < 20:
            method = "prototype"
        elif n_examples < 100:
            method = "setfit"
        else:
            method = "fomaml"

    if method == "prototype":
        clf = PrototypicalClassifier()
        # Register all existing intents (from production model's support set)
        for intent, support in load_existing_support_sets().items():
            clf.register_intent(intent, support)
        # Register new intent
        clf.register_intent(new_intent, [e["text"] for e in examples])
        accuracy = evaluate_prototype(clf, examples)

    elif method == "setfit":
        # Combine existing intent examples with new intent
        all_examples = load_existing_few_shot_examples()
        all_examples.extend(examples)
        model, metrics = train_setfit_intent(all_examples)
        accuracy = metrics.get("accuracy", 0)

    elif method == "fomaml":
        model = FOMAMLIntentClassifier()
        model.load_state_dict(torch.load("meta_init.pt"))
        # Few inner loop steps on new data
        adapted = adapt_fomaml(model, examples, inner_steps=5)
        accuracy = evaluate_adapted(adapted, examples)

    elapsed = time.time() - start_time

    result = {
        "intent": new_intent,
        "method": method,
        "n_examples": n_examples,
        "accuracy": accuracy,
        "training_time_seconds": elapsed,
        "ready_for_production": accuracy >= 0.80,
    }

    if result["ready_for_production"]:
        deploy_to_lambda(result)

    return result

Group Discussion: Key Decision Points

Decision Point 1: Prototypical Net vs SetFit vs MAML

Priya (ML Engineer): I benchmarked all three on our "preorder_anime" intent scenario:

Method 10 examples 30 examples 50 examples Training Time Deployment
Prototypical Net 72.4% 78.1% 81.3% 0 sec Instant
SetFit 74.8% 83.2% 87.1% 120 sec 120 sec
FOMAML 76.1% 82.8% 86.4% 180 sec 180 sec
Fine-tune (DistilBERT) 45.2% 68.4% 79.2% 300 sec 300 sec

Aiko (Data Scientist): Notice that vanilla fine-tuning with 10 examples is catastrophically bad (45.2%) — it overfits to the 10 examples completely. Prototypical networks avoid this by not training at all. SetFit avoids it through contrastive pair generation (10 examples → 405 pairs).

Marcus (Architect): The zero-training property of prototypical networks is compelling for emergency scenarios (product recall intent with 10 examples). We can deploy instantly with 72% accuracy and improve later.

Sam (PM): SetFit at 30 examples gives 83.2% in 2 minutes. That is close to production quality (our SLA is 90%, but new intents get a 6-month grace period at 80%). For the MVP launch of a new intent, SetFit is the sweet spot.

Jordan (MLOps): FOMAML requires maintaining a meta-initialization model, which adds complexity. It is only 0.4% better than SetFit at 30 examples. I would not add the meta-learning infrastructure unless we are adding new intents weekly.

Resolution: Stage-based approach: (1) Emergency: Prototypical networks (0 training time, 72%+ accuracy). (2) Standard: SetFit when 20+ examples are available (2 min training, 83%+ accuracy). (3) MAML only if new intents emerge faster than monthly (adds infrastructure complexity). (4) Full fine-tuning once 100+ examples accumulate (transitions to continual learning from doc 06).

Decision Point 2: How Many Examples Are "Enough"?

Priya (ML Engineer): I measured accuracy vs example count for SetFit:

Examples per Class Accuracy Marginal Gain Confidence Interval (95%)
5 74.8% ±8.2%
10 78.5% +3.7% ±5.1%
20 81.2% +2.7% ±3.4%
30 83.2% +2.0% ±2.8%
50 87.1% +3.9% ±1.9%
100 90.4% +3.3% ±1.2%

Aiko (Data Scientist): There is a clear diminishing-returns curve. The first 20 examples provide the steepest improvements. After 30, gains are still meaningful but slower. After 100, we enter standard fine-tuning territory.

The confidence intervals are critical: with 5 examples, our accuracy estimate has ±8.2% uncertainty — maybe 67%, maybe 83%. With 30 examples, uncertainty drops to ±2.8%.

Sam (PM): Our minimum viable threshold for a new intent is 80% accuracy with ≤3% confidence interval. That requires roughly 30 examples. The labeling cost is ~$0.50 per example (10 seconds of human labeling) = $15 for a new intent launch.

Marcus (Architect): We should build a "new intent intake" process: when analytics detects emerging queries, route 30 of them to human labelers. Once labeled, trigger the SetFit pipeline automatically.

Resolution: 30 examples is the minimum for production SetFit deployment. Below 30, use prototypical networks as a stopgap. The automated pipeline: detect → label 30 → SetFit → validate → deploy runs end-to-end in under 1 hour including human labeling time.

Decision Point 3: Embedding Model Choice for Few-Shot

Priya (ML Engineer): The embedding model quality significantly impacts few-shot performance:

Model Params Latency Proto@10 SetFit@30
all-MiniLM-L6-v2 22M 3ms 72.4% 83.2%
all-mpnet-base-v2 109M 12ms 76.1% 85.8%
DistilBERT (our intent model) 66M 8ms 74.8% 84.5%
Titan Embeddings V2 (adapted, doc 02) 30ms 78.3%

Aiko (Data Scientist): Our manga-adapted Titan embeddings perform best for prototypical classification because they already understand manga vocabulary. But SetFit cannot fine-tune Titan (it is an API, not a local model).

Marcus (Architect): all-MiniLM-L6-v2 is the default for SetFit — small, fast, well-tested. The 2.6% accuracy gap with mpnet is meaningful for 30-example scenarios, but mpnet adds 9ms latency. At our P99 budget of 15ms for intent classification, mpnet is too slow.

Jordan (MLOps): MiniLM is also easier to deploy — 22M parameters fits in Lambda's 256MB limit without compression. mpnet requires 512MB.

Resolution: all-MiniLM-L6-v2 for SetFit (best speed/quality/deployment tradeoff). For prototypical classification, use our manga-adapted Titan embeddings via API (best few-shot accuracy, acceptable 30ms latency since it is a stopgap).

Decision Point 4: Transitioning from Few-Shot to Full Model

Jordan (MLOps): When do we stop using few-shot models and retrain the full intent classifier?

Sam (PM): Once a new intent has 100+ examples (typically after 2-4 weeks of production traffic), the accuracy gap between SetFit (87%) and full fine-tuning (92%) justifies the retrain.

Priya (ML Engineer): The transition should be seamless. Proposed lifecycle:

Phase Trigger Method Expected Accuracy Duration
Emergency 5-10 examples Prototypical 72-78% Hours
Launch 30 examples SetFit 83-87% Days to weeks
Maturation 100+ examples Continual learning (doc 06) 88-92% Ongoing

Marcus (Architect): This creates a natural A/B test: traffic hitting the SetFit model generates labeled data for the full retrain. We can shadow-test the full model against SetFit before cutting over.

Resolution: Three-phase lifecycle. Automatic transition triggered by example count thresholds: 30 (SetFit) → 100 (full retrain). Each transition is gated by a comparative evaluation on 20% held-out data — only cut over if the new model beats the old by ≥2%.


Research Paper References

1. Prototypical Networks for Few-shot Learning (Snell et al., 2017)

Key contribution: Introduced prototype-based classification using mean embeddings. Showed that Euclidean distance in embedding space is sufficient for few-shot classification, outperforming more complex learnable distance metrics. The simplicity of prototypical networks (no training, just compute centroids) makes them ideal for zero-training-time deployment.

Relevance to MangaAssist: Our emergency intent deployment uses prototypical networks. With 5-10 examples, we compute prototypes and deploy instantly. The 72-78% accuracy is acceptable for a stopgap that activates within minutes.

2. Model-Agnostic Meta-Learning for Fast Adaptation — MAML (Finn et al., 2017)

Key contribution: Proposed learning an initialization from which a few gradient steps reach a good solution for any task. The key mathematical insight is the gradient-through-gradient (second-order) meta-update. FOMAML (first-order approximation) drops the Hessian term with minimal quality loss.

Relevance to MangaAssist: FOMAML provides the theoretical foundation for our adaptation strategy, though we use SetFit in practice due to lower infrastructure complexity. MAML's inner/outer loop concept informs how we think about rapid adaptation generally — our 3-phase lifecycle mirrors MAML's "meta-train then adapt" paradigm.

3. Efficient Few-Shot Learning Without Prompts — SetFit (Tunstall et al., 2022)

Key contribution: Combined contrastive Sentence Transformer fine-tuning with a simple logistic regression head. Achieved competitive results with GPT-3 few-shot prompting at 1000× lower cost. The key insight is the data amplification: K examples generate O(K²) contrastive pairs, making few-shot training data-efficient.

Relevance to MangaAssist: Primary few-shot method for new intent deployment. The 130× data amplification (30 examples → 3,900 pairs) is the key enabler. Training in 2 minutes on a T4 GPU makes it feasible as an automated pipeline triggered by human labeling.


Production Deployment Results

Few-Shot Deployment: "preorder_anime" Intent

Metric Prototypical (10 ex) SetFit (30 ex) Full Train (200 ex)
Accuracy 74.2% 84.1% 91.3%
Time to deploy 5 min 15 min 4 hours
Model size 0 (uses existing encoder) 88MB 264MB
Human labeling cost $5 $15 $100
Total cost $5 $16 $108

Lifecycle Transition Results

Week Method Examples Accuracy Notes
Week 0 (launch) Prototypical 10 74.2% Same-day deploy
Week 1 SetFit 30 84.1% Auto-triggered by 30 labels
Week 3 SetFit 65 87.4% Retrained with more data
Week 6 Full (continual) 150 91.3% Auto-triggered by 100 threshold
Week 12 Full (production) 500+ 93.1% Merged into monthly retrain