02. Embedding Model Fine-Tuning — Contrastive Adapter for Titan Embeddings V2

Problem Statement and MangaAssist Context

MangaAssist uses Amazon Titan Embeddings V2 to convert user queries and product catalog entries into 1024-dimensional vectors stored in OpenSearch Serverless. When a user asks "dark fantasy manga like Berserk", the system embeds the query and retrieves the closest catalog items via approximate nearest neighbor (ANN) search. The quality of these embeddings directly determines whether the user sees relevant results.

Out of the box, Titan Embeddings V2 achieves Recall@3 of 0.68 on our manga catalog — meaning only 68% of the time, the correct product appears in the top 3 results. For an e-commerce search experience, this is unacceptable. Users who do not find what they want within 3 results tend to leave.

We cannot modify Titan Embeddings V2 directly (it is a managed Bedrock service). Instead, we train a contrastive adapter — a lightweight projection layer that sits on top of Titan's output and reshapes the embedding space to better capture manga domain similarity. This adapter pushes Recall@3 from 0.68 to 0.82 (+14 percentage points), bringing us above our 0.80 target.

Why Generic Embeddings Fail on Manga Retail

Titan Embeddings V2 is trained on broad internet text. It understands that "Naruto" and "One Piece" are both anime/manga, but it does not encode the nuanced similarity that manga readers expect:

Query	Titan V2 Top 3 (Pre-Adapter)	Expected Top 3
"dark isekai manga"	Overlord, Re:Zero, SAO (ok but shallow)	Overlord, Berserk, Goblin Slayer
"romance manga for adults"	Fruits Basket, Kimi ni Todoke, Your Name	Nana, Paradise Kiss, Honey and Clover
"best shōnen jump manga"	Naruto, One Piece, Dragon Ball (popular ≠ similar)	My Hero Academia, Jujutsu Kaisen, Chainsaw Man (current jump titles)

The core problem: Titan conflates popularity with relevance. Frequently discussed titles get embedded closer together regardless of thematic similarity. Our adapter needs to learn that "dark isekai" similarity is about tone and premise, not just co-occurrence in internet text.

The Training Data: 2,000 Contrastive Pairs

We curate 2,000 (query, positive, negative) triplets from three sources:

Click-through logs (1,200 pairs): Query → clicked product is positive; query → shown-but-not-clicked is hard negative
Editorial curation (500 pairs): Manga experts group titles by genre, subgenre, tone, and target demographic
Synthetic generation (300 pairs): Claude generates "users who searched for X would also want Y" given catalog metadata

Mathematical Foundations

Cosine Similarity — The Distance Metric

Given two vectors $\mathbf{q}$ (query embedding) and $\mathbf{d}$ (document/product embedding) in $\mathbb{R}^n$, cosine similarity is:

$$\text{cos}(\mathbf{q}, \mathbf{d}) = \frac{\mathbf{q} \cdot \mathbf{d}}{||\mathbf{q}|| \cdot ||\mathbf{d}||} = \frac{\sum_{i=1}^{n} q_i d_i}{\sqrt{\sum_{i=1}^{n} q_i^2} \cdot \sqrt{\sum_{i=1}^{n} d_i^2}}$$

Geometric intuition: Cosine similarity measures the angle between two vectors, ignoring their magnitude. Two vectors pointing in the same direction have similarity 1.0 (0° angle), orthogonal vectors have similarity 0.0 (90°), and opposite vectors have similarity -1.0 (180°).

Why cosine over dot product? In a retrieval system where embeddings come from different models (query encoder vs document pre-computed embeddings), magnitude can vary. Cosine normalizes this away. Our OpenSearch index uses cosine similarity by default.

InfoNCE Loss — Training the Adapter

InfoNCE (van den Oord et al., 2018, used in SimCLR / DPR) is the standard contrastive learning objective. For a query $q$, one positive $d^+$, and $N-1$ negatives ${d_1^-, d_2^-, \ldots, d_{N-1}^-}$:

$$\mathcal{L}{\text{InfoNCE}} = -\log \frac{\exp(\text{sim}(q, d^+) / \tau)}{\exp(\text{sim}(q, d^+) / \tau) + \sum{j=1}^{N-1} \exp(\text{sim}(q, d_j^-) / \tau)}$$

where $\tau$ is the temperature parameter and $\text{sim}$ is cosine similarity.

Breaking this down step by step:

Compute similarity between query and all candidates: $s^+ = \text{sim}(q, d^+)$, $s_j^- = \text{sim}(q, d_j^-)$
Temperature-scale all similarities: divide by $\tau$ to control the sharpness of the distribution
Apply softmax: the query's similarity to the positive should be the highest
Take negative log: converting the probability into a loss

Intuition: InfoNCE is a softmax classification problem. The "correct class" is the positive document. The model must assign the highest probability to the positive among all candidates. As the model improves, $\text{sim}(q, d^+)$ increases and $\text{sim}(q, d_j^-)$ decreases, driving the loss toward zero.

Temperature $\tau$ — The Sharpness Controller

The temperature $\tau$ controls how "peaked" the similarity distribution is:

$\tau$	Effect	Distribution Shape	Training Behavior
0.01	Very sharp	One-hot-like	Only the closest negative matters; vanishing gradients for others
0.05	Sharp	Concentrated	Good for well-separated clusters; standard for SimCLR
0.07	Our choice	Balanced	Gradients from top-5 negatives; good for our 2K-pair dataset
0.20	Warm	Broad	All negatives contribute; can be noisy with easy negatives
1.00	Flat	Uniform-like	Nearly all negatives equally weighted; slow convergence

Mathematical effect on gradients:

The gradient of InfoNCE with respect to the query embedding $q$ is:

$$\frac{\partial \mathcal{L}}{\partial q} = -\frac{1}{\tau}\left(d^+ - \sum_{j=1}^{N-1} p_j \cdot d_j^-\right)$$

where $p_j = \frac{\exp(\text{sim}(q, d_j^-) / \tau)}{\sum_k \exp(\text{sim}(q, d_k^-) / \tau)}$ is the softmax weight of each negative.

Key insight: The gradient pushes $q$ toward $d^+$ (positive) and away from a weighted average of negatives. Lower $\tau$ concentrates the weights on the hardest negatives (highest similarity to query). Higher $\tau$ spreads the weights more uniformly.

For our manga dataset, $\tau = 0.07$ means the model focuses correction effort on the 3-5 most confusing negatives per batch. This is ideal for our domain where the challenge is distinguishing "dark fantasy" from "action adventure" — genres that are close but not identical.

The Adapter Architecture — Projection Layer Mathematics

Since we cannot modify Titan Embeddings V2 (managed by Bedrock), we add a trainable projection layer:

$$\mathbf{z} = \text{LayerNorm}(\mathbf{W}_2 \cdot \text{GELU}(\mathbf{W}_1 \cdot \mathbf{e} + \mathbf{b}_1) + \mathbf{b}_2)$$

where: - $\mathbf{e} \in \mathbb{R}^{1024}$ is the frozen Titan Embeddings V2 output - $\mathbf{W}_1 \in \mathbb{R}^{512 \times 1024}$ projects down to 512 dimensions (bottleneck) - $\mathbf{W}_2 \in \mathbb{R}^{1024 \times 512}$ projects back up to 1024 - $\mathbf{z} \in \mathbb{R}^{1024}$ is the adapted embedding

Parameter count: $1024 \times 512 + 512 + 512 \times 1024 + 1024 + 2 \times 1024 = 1,051,648$ (~1M parameters). This is 0.4% of Titan Embeddings V2's total parameters — a lightweight adapter.

Why the bottleneck? The 1024 → 512 → 1024 architecture forces the adapter to learn a compressed representation of the domain-specific similarity structure. Without the bottleneck, the adapter could memorize training pairs without learning generalizable patterns. The bottleneck acts as a regularizer.

GELU activation:

$$\text{GELU}(x) = x \cdot \Phi(x) = x \cdot \frac{1}{2}\left[1 + \text{erf}\left(\frac{x}{\sqrt{2}}\right)\right]$$

Unlike ReLU (which zeros out negative values), GELU provides a smooth, non-zero gradient for negative inputs. This is critical because embedding dimensions can be negative, and we do not want the adapter to lose information from those dimensions.

Hard Negative Mining — Why Easy Negatives Are Useless

A random negative (e.g., pairing "dark isekai manga" with an order tracking message) has cosine similarity ~0.1 with the query. The InfoNCE loss for this example provides almost zero gradient — the model already knows these are dissimilar.

Hard negatives are examples that are similar to the query but should not be the retrieved result. For "dark isekai manga", a hard negative is "comedy isekai manga" — same genre framework, different tone.

Formally, hard negatives are samples $d^-$ where:

$$\text{sim}(q, d^-) > \text{sim}(q, d^+) - \epsilon$$

meaning they are almost as similar to the query as the positive, creating a margin violation.

Mining strategy:

Pre-compute all catalog embeddings using current adapter
For each query, find top-K nearest neighbors using ANN index
Filter out the actual positive from the ANN results
The remaining top-K neighbors are hard negatives

We refresh hard negatives every epoch because as the adapter trains, the similarity rankings change. What was a hard negative in epoch 1 might become an easy negative in epoch 3.

Margin and Triplet Loss — An Alternative Perspective

While we use InfoNCE (batch contrastive), it helps to understand the triplet loss that many contrastive learning papers discuss:

$$\mathcal{L}_{\text{triplet}} = \max(0, \text{sim}(q, d^-) - \text{sim}(q, d^+) + m)$$

where $m$ is a margin. This loss is zero when $\text{sim}(q, d^+) > \text{sim}(q, d^-) + m$ — the positive is more similar by at least margin $m$.

InfoNCE vs Triplet loss: InfoNCE operates on the entire batch simultaneously (1 positive vs all in-batch negatives), making it more sample-efficient. Triplet loss considers only one (query, positive, negative) at a time. With batch size 128, InfoNCE effectively gives us 127 negatives per query, while triplet loss gives 1.

Model Internals — Layer-by-Layer Diagrams

Adapter Architecture and Data Flow

graph TB
    subgraph "Frozen: Amazon Titan Embeddings V2 (Bedrock)"
        A["User Query: 'dark isekai manga like Berserk'"] --> B["Titan Embedding V2<br>(Managed API Call)<br>~300M params, frozen"]
        B --> C["Raw Embedding e ∈ ℝ¹⁰²⁴<br>General-purpose, not manga-tuned"]
    end

    subgraph "Trainable: Contrastive Adapter (~1M params)"
        C --> D["Linear W₁ (1024 → 512)<br>524,800 params<br>Projects to bottleneck"]
        D --> E["GELU Activation<br>Smooth non-linearity<br>Preserves negative dims"]
        E --> F["Linear W₂ (512 → 1024)<br>525,312 params<br>Projects back to original dim"]
        F --> G["LayerNorm (1024)<br>2,048 params<br>Stabilizes output magnitude"]
        G --> H["Residual Connection<br>z = LayerNorm(adapter(e)) + e"]
    end

    subgraph "Output"
        H --> I["Adapted Embedding z ∈ ℝ¹⁰²⁴<br>Manga-domain aware<br>cos(q, d) is now genre-sensitive"]
    end

    style B fill:#bbdefb
    style D fill:#fff9c4
    style E fill:#fff9c4
    style F fill:#fff9c4
    style G fill:#fff9c4

Contrastive Training Flow — Full Pipeline

sequenceDiagram
    participant Batch as Training Batch<br>(128 triplets)
    participant Titan as Titan Embeddings V2<br>(Frozen, Bedrock API)
    participant Adapter as Contrastive Adapter<br>(1M trainable params)
    participant Loss as InfoNCE Loss<br>Calculator
    participant ANN as Hard Negative<br>Miner (ANN Index)
    participant Optim as Adam Optimizer

    Note over Batch: Epoch start: refresh hard negatives
    Batch->>ANN: All queries from training set
    ANN->>Batch: Top-K hard negatives per query

    Batch->>Titan: 128 queries + 128 positives + 128 negatives (384 texts)
    Titan->>Adapter: 384 frozen embeddings ∈ ℝ¹⁰²⁴

    Note over Adapter: Forward pass through adapter<br>e → W₁ → GELU → W₂ → LayerNorm → z<br>3 matrix multiplications per embedding

    Adapter->>Loss: 384 adapted embeddings
    Loss->>Loss: Compute pairwise cosine similarities<br>128 × 128 similarity matrix
    Loss->>Loss: Apply temperature τ=0.07<br>InfoNCE = -log(exp(s⁺/τ) / Σexp(s/τ))

    Note over Loss: Backward pass
    Loss->>Adapter: ∂L/∂z for all 384 embeddings
    Note over Adapter: Gradients flow through:<br>LayerNorm → W₂ → GELU' → W₁<br>Titan params receive NO gradient (frozen)

    Adapter->>Optim: Update 1M adapter params
    Optim->>Adapter: Adam step: θ ← θ - lr·m̂/(√v̂ + ε)

Embedding Space Transformation — Before vs After Adapter

graph LR
    subgraph "Before Adapter (Titan V2 Raw)"
        A1["'dark fantasy manga'<br>(0.72, 0.31, ...)"]
        B1["'action adventure manga'<br>(0.70, 0.35, ...)"]
        C1["'comedy isekai'<br>(0.68, 0.29, ...)"]
        D1["'Berserk vol 1'<br>(0.71, 0.33, ...)"]
        E1["'One Piece vol 1'<br>(0.69, 0.34, ...)"]

        A1 -.->|"cos=0.94<br>too close!"| B1
        A1 -.->|"cos=0.91"| C1
        A1 -.->|"cos=0.93"| D1
        B1 -.->|"cos=0.95"| E1
    end

    subgraph "After Adapter (Manga-Tuned)"
        A2["'dark fantasy manga'<br>moved to dark cluster"]
        B2["'action adventure manga'<br>moved to action cluster"]
        C2["'comedy isekai'<br>moved to comedy cluster"]
        D2["'Berserk vol 1'<br>near dark fantasy"]
        E2["'One Piece vol 1'<br>near action adventure"]

        A2 -.->|"cos=0.92<br>correct!"| D2
        A2 -.->|"cos=0.71<br>separated"| B2
        A2 -.->|"cos=0.58<br>far apart"| C2
        B2 -.->|"cos=0.89"| E2
    end

Gradient Flow Through the Adapter Layers

graph BT
    subgraph "Loss Layer"
        L["InfoNCE Loss<br>L = -log(softmax(sim/τ))"]
    end

    subgraph "LayerNorm + Residual"
        LN["LayerNorm<br>∂L/∂z: scales ≈ 1.0<br>Residual adds identity gradient"]
    end

    subgraph "W₂ Linear (512 → 1024)"
        W2["W₂ Gradient:<br>∂L/∂W₂ = ∂L/∂z · h^T<br>Updates: 525K params<br>Magnitude: 0.8 × reference"]
    end

    subgraph "GELU Activation"
        G["GELU'(x) = Φ(x) + x·φ(x)<br>Gradient: 0.5-1.0 for most inputs<br>Near-linear for positive inputs<br>Smooth attenuation for negative"]
    end

    subgraph "W₁ Linear (1024 → 512)"
        W1["W₁ Gradient:<br>∂L/∂W₁ = (W₂^T · ∂L/∂z · GELU'(·)) · e^T<br>Updates: 524K params<br>Magnitude: 0.6 × reference"]
    end

    subgraph "Frozen Titan Embeddings"
        TT["Titan V2 Output e<br>NO gradient flows here<br>Bedrock API, no backprop"]
    end

    L --> LN
    LN --> W2
    W2 --> G
    G --> W1
    W1 --> TT

    style TT fill:#bbdefb
    style W1 fill:#fff9c4
    style G fill:#fff9c4
    style W2 fill:#fff9c4

Temperature Sensitivity — Effect on Learning Signal

graph TD
    subgraph "τ = 0.01 (Too Sharp)"
        A["sim(q, d⁺) = 0.85<br>sim(q, d₁⁻) = 0.82<br>sim(q, d₂⁻) = 0.60"]
        A --> B["softmax([85, 82, 60])<br>= [0.953, 0.045, 0.002]"]
        B --> C["Only d₁⁻ gets gradient<br>d₂⁻ effectively ignored<br>Misses useful signal"]
    end

    subgraph "τ = 0.07 (Our Setting)"
        D["sim(q, d⁺) = 0.85<br>sim(q, d₁⁻) = 0.82<br>sim(q, d₂⁻) = 0.60"]
        D --> E["softmax([12.1, 11.7, 8.6])<br>= [0.598, 0.366, 0.036]"]
        E --> F["Both hard negatives<br>get meaningful gradient<br>Balanced learning"]
    end

    subgraph "τ = 0.5 (Too Soft)"
        G["sim(q, d⁺) = 0.85<br>sim(q, d₁⁻) = 0.82<br>sim(q, d₂⁻) = 0.60"]
        G --> H["softmax([1.7, 1.64, 1.2])<br>= [0.381, 0.359, 0.260]"]
        H --> I["Easy negative gets<br>almost same weight as hard<br>Noisy, slow convergence"]
    end

ANN Index Update Cycle

graph LR
    subgraph "Epoch N"
        A["Train adapter<br>on current pairs"] --> B["Adapter weights<br>updated"]
        B --> C["Re-embed all<br>catalog items (50K)"]
        C --> D["Rebuild FAISS<br>ANN index"]
        D --> E["Mine new hard<br>negatives for<br>each query"]
    end

    subgraph "Epoch N+1"
        E --> F["Train with<br>refreshed negatives"]
        F --> G["Repeat..."]
    end

    subgraph "Why Refresh?"
        H["At epoch 1:<br>'comedy isekai' is hard<br>negative for 'dark isekai'"]
        I["At epoch 3:<br>adapter separates them<br>'comedy isekai' is now easy"]
        J["New hard negative:<br>'horror isekai' moves<br>closer, becomes the<br>new challenge"]
        H --> I --> J
    end

Implementation Deep-Dive

Titan Embeddings V2 Client

import boto3
import json
import numpy as np
from typing import List


class TitanEmbeddingClient:
    """
    Client for Amazon Titan Embeddings V2 via Bedrock.
    Returns 1024-dim embeddings. We call this for both training
    (to get frozen embeddings as adapter input) and inference
    (before passing through the adapter).
    """

    def __init__(self, region: str = "us-east-1"):
        self.client = boto3.client("bedrock-runtime", region_name=region)
        self.model_id = "amazon.titan-embed-text-v2:0"

    def embed_texts(self, texts: List[str], batch_size: int = 25) -> np.ndarray:
        """
        Embed a list of texts. Titan V2 supports single-text calls,
        so we batch manually.
        """
        all_embeddings = []
        for i in range(0, len(texts), batch_size):
            batch = texts[i:i + batch_size]
            batch_embeddings = []
            for text in batch:
                response = self.client.invoke_model(
                    modelId=self.model_id,
                    body=json.dumps({
                        "inputText": text,
                        "dimensions": 1024,
                        "normalize": True,
                    }),
                )
                result = json.loads(response["body"].read())
                batch_embeddings.append(result["embedding"])
            all_embeddings.extend(batch_embeddings)

        return np.array(all_embeddings, dtype=np.float32)


    def embed_single(self, text: str) -> np.ndarray:
        """Single text embedding for inference path."""
        response = self.client.invoke_model(
            modelId=self.model_id,
            body=json.dumps({
                "inputText": text,
                "dimensions": 1024,
                "normalize": True,
            }),
        )
        result = json.loads(response["body"].read())
        return np.array(result["embedding"], dtype=np.float32)

Contrastive Adapter Model

import torch
import torch.nn as nn
import torch.nn.functional as F


class ContrastiveAdapter(nn.Module):
    """
    Lightweight adapter that transforms Titan V2 embeddings
    to be manga-domain aware.

    Architecture: e (1024) → W₁ (512) → GELU → W₂ (1024) → LayerNorm + Residual
    Parameters: ~1M (0.4% of Titan V2)
    """

    def __init__(self, embed_dim: int = 1024, bottleneck_dim: int = 512):
        super().__init__()
        self.linear1 = nn.Linear(embed_dim, bottleneck_dim)
        self.linear2 = nn.Linear(bottleneck_dim, embed_dim)
        self.layer_norm = nn.LayerNorm(embed_dim)

        # Initialize with small weights so adapter starts near identity
        nn.init.normal_(self.linear1.weight, std=0.02)
        nn.init.normal_(self.linear2.weight, std=0.02)
        nn.init.zeros_(self.linear1.bias)
        nn.init.zeros_(self.linear2.bias)

    def forward(self, embeddings: torch.Tensor) -> torch.Tensor:
        """
        Args:
            embeddings: (batch_size, 1024) — frozen Titan V2 outputs
        Returns:
            (batch_size, 1024) — adapted embeddings
        """
        # Bottleneck projection
        h = F.gelu(self.linear1(embeddings))    # (batch, 512)
        adapted = self.linear2(h)                # (batch, 1024)

        # Residual connection + LayerNorm
        output = self.layer_norm(adapted + embeddings)  # (batch, 1024)

        # L2 normalize for cosine similarity
        output = F.normalize(output, p=2, dim=-1)

        return output

InfoNCE Loss with In-Batch Negatives

class InfoNCELoss(nn.Module):
    """
    InfoNCE contrastive loss with in-batch negatives.

    For a batch of N (query, positive) pairs:
    - Each query has 1 positive (its paired document)
    - Each query has N-1 negatives (all other documents in the batch)
    - Total comparisons per query: N
    - This is equivalent to N-way classification

    With batch_size=128, each query gets 127 free negatives.
    """

    def __init__(self, temperature: float = 0.07):
        super().__init__()
        self.temperature = temperature

    def forward(
        self,
        query_embeddings: torch.Tensor,
        doc_embeddings: torch.Tensor,
    ) -> torch.Tensor:
        """
        Args:
            query_embeddings: (N, dim) — adapted query embeddings
            doc_embeddings: (N, dim) — adapted document embeddings
            Row i of queries matches row i of docs (positive pair)

        Returns:
            Scalar loss
        """
        # Compute all pairwise cosine similarities
        # (N, dim) × (dim, N) → (N, N)
        similarity_matrix = torch.mm(
            query_embeddings, doc_embeddings.t()
        ) / self.temperature

        # Labels: diagonal entries are positives
        # query[i] should match doc[i], so label for row i = i
        labels = torch.arange(
            similarity_matrix.size(0), device=similarity_matrix.device
        )

        # Cross-entropy on similarity matrix = InfoNCE
        loss = F.cross_entropy(similarity_matrix, labels)

        return loss

Hard Negative Mining

import faiss
import numpy as np


class HardNegativeMiner:
    """
    Mines hard negatives using FAISS approximate nearest neighbor search.
    Refreshed every epoch as the adapter changes the embedding space.
    """

    def __init__(self, embed_dim: int = 1024, top_k: int = 20):
        self.embed_dim = embed_dim
        self.top_k = top_k
        self.index = None

    def build_index(self, doc_embeddings: np.ndarray):
        """Build FAISS index from adapted document embeddings."""
        # Use IVF index for 50K+ documents
        n_docs = doc_embeddings.shape[0]
        n_lists = min(int(np.sqrt(n_docs)), 256)

        quantizer = faiss.IndexFlatIP(self.embed_dim)  # Inner product (cosine after L2 norm)
        self.index = faiss.IndexIVFFlat(
            quantizer, self.embed_dim, n_lists, faiss.METRIC_INNER_PRODUCT
        )
        self.index.train(doc_embeddings)
        self.index.add(doc_embeddings)
        self.index.nprobe = 10  # Search 10 nearest clusters

    def mine(
        self,
        query_embeddings: np.ndarray,
        positive_ids: np.ndarray,
        n_hard_negatives: int = 5,
    ) -> np.ndarray:
        """
        For each query, find top-K nearest docs and filter out the positive.
        Return indices of hard negatives.
        """
        # Search for top-K nearest documents
        similarities, indices = self.index.search(query_embeddings, self.top_k)

        hard_negatives = []
        for i in range(len(query_embeddings)):
            # Filter out the known positive
            neg_indices = [
                idx for idx in indices[i]
                if idx != positive_ids[i] and idx >= 0
            ]
            # Take top-N hard negatives
            hard_negatives.append(neg_indices[:n_hard_negatives])

        return hard_negatives

Full Training Pipeline

from torch.utils.data import DataLoader, Dataset
from torch.optim import Adam
from torch.optim.lr_scheduler import CosineAnnealingLR


class ContrastiveDataset(Dataset):
    """Dataset of (query_embedding, positive_doc_embedding) pairs."""

    def __init__(self, query_embeds, doc_embeds):
        self.query_embeds = torch.tensor(query_embeds, dtype=torch.float32)
        self.doc_embeds = torch.tensor(doc_embeds, dtype=torch.float32)

    def __len__(self):
        return len(self.query_embeds)

    def __getitem__(self, idx):
        return self.query_embeds[idx], self.doc_embeds[idx]


def train_contrastive_adapter(
    query_embeddings: np.ndarray,     # (2000, 1024)
    positive_embeddings: np.ndarray,  # (2000, 1024)
    catalog_embeddings: np.ndarray,   # (50000, 1024) — full catalog
    num_epochs: int = 20,
    batch_size: int = 128,
    learning_rate: float = 1e-4,
    temperature: float = 0.07,
):
    """
    Train contrastive adapter with hard negative mining.

    20 epochs is appropriate for our small (2K pairs) dataset.
    More epochs risk overfitting; fewer under-train the adapter.
    """
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

    adapter = ContrastiveAdapter(embed_dim=1024, bottleneck_dim=512).to(device)
    loss_fn = InfoNCELoss(temperature=temperature)
    optimizer = Adam(adapter.parameters(), lr=learning_rate, weight_decay=1e-5)
    scheduler = CosineAnnealingLR(optimizer, T_max=num_epochs)

    miner = HardNegativeMiner(embed_dim=1024)

    best_recall = 0.0

    for epoch in range(num_epochs):
        # ── Hard Negative Refresh ──
        adapter.eval()
        with torch.no_grad():
            # Re-embed catalog through current adapter
            catalog_tensor = torch.tensor(catalog_embeddings, dtype=torch.float32).to(device)
            adapted_catalog = adapter(catalog_tensor).cpu().numpy()

        miner.build_index(adapted_catalog)
        # For now we use in-batch negatives; hard negative mining
        # enriches the batch composition by pre-selecting hard examples

        # ── Training ──
        dataset = ContrastiveDataset(query_embeddings, positive_embeddings)
        dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=True)

        adapter.train()
        epoch_loss = 0.0

        for query_batch, doc_batch in dataloader:
            query_batch = query_batch.to(device)
            doc_batch = doc_batch.to(device)

            # Forward: adapt both queries and documents
            adapted_queries = adapter(query_batch)
            adapted_docs = adapter(doc_batch)

            # Compute InfoNCE loss (in-batch negatives)
            loss = loss_fn(adapted_queries, adapted_docs)

            # Backward
            optimizer.zero_grad()
            loss.backward()
            torch.nn.utils.clip_grad_norm_(adapter.parameters(), max_norm=1.0)
            optimizer.step()

            epoch_loss += loss.item()

        scheduler.step()

        # ── Validation ──
        recall_at_3 = evaluate_recall(adapter, device, query_embeddings, catalog_embeddings)
        avg_loss = epoch_loss / len(dataloader)
        print(f"Epoch {epoch+1}: Loss={avg_loss:.4f}, Recall@3={recall_at_3:.4f}")

        if recall_at_3 > best_recall:
            best_recall = recall_at_3
            torch.save(adapter.state_dict(), "best_adapter.pt")

    return adapter


def evaluate_recall(adapter, device, query_embeds, catalog_embeds, k=3):
    """Compute Recall@K on validation queries against full catalog."""
    adapter.eval()
    with torch.no_grad():
        q = torch.tensor(query_embeds, dtype=torch.float32).to(device)
        c = torch.tensor(catalog_embeds, dtype=torch.float32).to(device)
        q_adapted = adapter(q).cpu().numpy()
        c_adapted = adapter(c).cpu().numpy()

    # Build FAISS index
    index = faiss.IndexFlatIP(q_adapted.shape[1])
    index.add(c_adapted)

    # Search
    _, indices = index.search(q_adapted, k)

    # Compute recall (how often the true positive is in top-K)
    hits = 0
    for i, retrieved in enumerate(indices):
        if i in retrieved:  # Simplified: positive_id == query_id mapping
            hits += 1
    return hits / len(query_embeds)

Lambda Deployment for Real-Time Inference

import json
import torch
import numpy as np
import boto3

# ─── Cold-start optimization: load model at import time ───
adapter = ContrastiveAdapter(embed_dim=1024, bottleneck_dim=512)
adapter.load_state_dict(torch.load("/opt/ml/model/adapter.pt", map_location="cpu"))
adapter.eval()

bedrock = boto3.client("bedrock-runtime", region_name="us-east-1")


def lambda_handler(event, context):
    """
    Lambda function for real-time embedding adaptation.
    Called by the orchestrator between Titan V2 embedding and OpenSearch retrieval.

    P95 latency budget: 30ms total (Titan ~20ms + adapter ~2ms + overhead ~8ms)
    """
    query_text = event["query"]

    # Step 1: Get Titan Embedding (Bedrock API call)
    titan_response = bedrock.invoke_model(
        modelId="amazon.titan-embed-text-v2:0",
        body=json.dumps({
            "inputText": query_text,
            "dimensions": 1024,
            "normalize": True,
        }),
    )
    raw_embedding = json.loads(titan_response["body"].read())["embedding"]

    # Step 2: Adapt embedding through our trained adapter
    with torch.no_grad():
        input_tensor = torch.tensor([raw_embedding], dtype=torch.float32)
        adapted = adapter(input_tensor).squeeze(0).numpy()

    return {
        "statusCode": 200,
        "body": json.dumps({
            "embedding": adapted.tolist(),
            "dimension": 1024,
        }),
    }

Group Discussion: Key Decision Points

Decision Point 1: Adapter vs Full Model Replacement

Marcus (Architect): We have three options for improving embedding quality: (A) add an adapter on top of Titan V2, (B) host our own embedding model on SageMaker and fine-tune end-to-end, or (C) use a different managed model (Cohere, Voyage AI).

Priya (ML Engineer): Here are the benchmarks:

Approach	Recall@3	Latency (P95)	Monthly Cost	Training Effort
Titan V2 (baseline)	0.68	22ms	$450	None
Titan V2 + Adapter	0.82	24ms (+2ms)	$485 (+$35)	2 days
SageMaker E5-large (fine-tuned)	0.87	35ms	$1,200	2 weeks
Cohere Embed v3	0.79	40ms	$900	None
Voyage AI voyage-2	0.81	38ms	$750	None

The adapter gives us 0.82 Recall@3 at only +$35/month and +2ms latency.

Jordan (MLOps): The E5-large option requires me to manage a SageMaker endpoint, handle autoscaling, model updates, and GPU fleet. That is significant operational overhead. The adapter is just a Lambda function with a 4MB model file — trivial to deploy and update.

Aiko (Data Scientist): The 5-point Recall@3 gap between adapter (0.82) and E5-large (0.87) is meaningful, but look at the cost curve. $35/month vs $750/month incremental. The cost-per-quality-point is: - Adapter: $35 / 14 points = $2.50/point - E5-large: $750 / 19 points = $39.47/point

Both are under our $50/point threshold, but the adapter is 16x more cost-efficient.

Sam (PM): For MVP, the adapter approach makes sense. We can always upgrade to E5-large in V2 if we need that extra 5 points. But for launch, adding 2ms latency vs 13ms matters more than 5 points of recall — our latency budget is tight.

Marcus (Architect): There is another advantage: the adapter approach keeps us on Bedrock. If AWS improves Titan V2 (they update it quarterly), our adapter benefits automatically from any underlying model improvement.

Resolution: Adapter approach selected for MVP. Recall@3 of 0.82 exceeds our 0.80 target. Cost increment ($35/month) and latency increment (+2ms) are both minimal. E5-large considered for V2 if Recall@3 needs to exceed 0.85.

Decision Point 2: Temperature Selection

Aiko (Data Scientist): I swept temperature from 0.01 to 0.5 on our validation set:

Temperature τ	Recall@3	Recall@10	Training Stability (loss variance)
0.01	0.74	0.85	High variance (±0.15) — unstable
0.03	0.79	0.89	Moderate (±0.08)
0.05	0.81	0.91	Low (±0.04)
0.07	0.82	0.92	Low (±0.03) — sweet spot
0.10	0.81	0.91	Low (±0.03)
0.20	0.77	0.88	Very low (±0.02) but underperforms
0.50	0.72	0.83	Minimal variance but near-baseline

Priya (ML Engineer): The $\tau = 0.07$ result aligns with SimCLR findings. Their paper showed optimal temperature depends on batch size — with batch 128, τ around 0.05-0.1 works best. The gradient analysis confirms this — at τ=0.07, the top 3-5 negatives get meaningful gradient weight without diluting the signal across all 127 in-batch negatives.

Jordan (MLOps): Is temperature something we should tune per-retraining cycle, or fix it?

Aiko (Data Scientist): The optimal τ is stable across data distributions as long as the batch size and embedding dimensionality stay the same. I would fix it at 0.07 and only revisit if we change the adapter architecture or training data size significantly.

Resolution: Temperature fixed at τ=0.07. Optimal for our batch size (128), embedding dimension (1024), and training data size (2K pairs). Re-evaluate only if architecture changes.

Decision Point 3: Bottleneck Dimension

Priya (ML Engineer): The bottleneck dimension controls the adapter's capacity. I tested:

Bottleneck Dim	Params	Recall@3	Training Time	Overfitting Risk
128	263K	0.76	8 min	Low — too constrained
256	527K	0.79	12 min	Low
512	1.05M	0.82	18 min	Moderate — our pick
768	1.57M	0.82	24 min	Higher — no recall gain
1024	2.1M	0.81	32 min	High — starts memorizing

512 is the sweet spot: best Recall@3 with manageable parameter count.

Marcus (Architect): At 1.05M parameters and 18 minutes training, the adapter adds 4MB to our Lambda deployment package. That is nothing. And the 2ms inference overhead is from two matrix multiplications — one 1024×512 and one 512×1024 — which is about 1M FLOPs. Trivial on CPU.

Aiko (Data Scientist): The 1024-dim bottleneck (no actual bottleneck) performs worse because it can memorize the 2K training pairs without learning generalizable similarity patterns. The 512-dim compression forces the adapter to learn a compressed representation of what makes manga similar — genre, tone, demographic — rather than memorizing specific title-to-title mappings.

Resolution: Bottleneck dimension 512 selected. Best Recall@3 (0.82), moderate parameter count (1.05M), and natural regularization through compression. 128/256 too constrained, 768/1024 no benefit with overfitting risk.

Decision Point 4: How Many Training Pairs Do We Need?

Sam (PM): Our 2K training pairs cost about $3,500 to create (click-through analysis: $1,500, editorial curation: $1,500, synthetic generation: $500). Should we invest in more?

Aiko (Data Scientist): I ran scaling experiments with subsets of our data:

Training Pairs	Recall@3	Marginal Gain per 500 Pairs
250	0.73	—
500	0.76	+0.03/500
1,000	0.79	+0.03/500
1,500	0.81	+0.02/500
2,000	0.82	+0.01/500
2,500 (projected)	~0.83	+0.01/500

We are in the diminishing returns zone. Each additional 500 pairs gives less than 0.02 Recall@3 improvement. Going from 2K to 5K pairs would cost ~$5,000 more for maybe 0.02 additional recall.

Priya (ML Engineer): The bottleneck here is not data quantity but data diversity. Our 2K pairs cover the top 200 query patterns well, but they do not cover long-tail queries (rare titles, mixed-language queries). I would prioritize adding 500 diverse long-tail pairs over 2,500 more of the same distribution.

Sam (PM): The cost-per-quality-point for 500 more pairs: $1,750 / 1 point = $1,750/point. That is above our $50 threshold for embedding quality. We should hold at 2K pairs for MVP and re-assess in V2 when we have more click-through data from production.

Resolution: Stay at 2K training pairs for MVP. Diminishing returns beyond this point. In V2, add 500 long-tail pairs sourced from production click-through logs (free, ongoing data collection).

Research Paper References

1. A Simple Framework for Contrastive Learning of Visual Representations — SimCLR (Chen et al., 2020)

Key contribution: Demonstrated that contrastive learning with in-batch negatives, large batch sizes, and learned projection heads achieves SOTA representation learning. Key finding: a nonlinear projection head between the encoder and the contrastive loss significantly improves representation quality, even though the head is discarded at inference.

Relevance to MangaAssist: Directly informs our adapter architecture. Our bottleneck projection (1024→512→1024) is the "projection head" that SimCLR showed is critical. The learned representations after the adapter (before L2 normalization) capture richer similarity structure than the raw Titan V2 outputs. SimCLR's temperature findings ($\tau = 0.07$ optimal for moderate batch sizes) guided our hyperparameter selection.

2. Dense Passage Retrieval for Open-Domain Question Answering — DPR (Karpukhin et al., 2020)

Key contribution: Showed that dual-encoder architectures (separate query and document encoders) trained with contrastive loss outperform BM25 for passage retrieval. Used in-batch negatives plus one hard negative per query. The hard negative mining strategy (BM25 top-K that are not the answer) was crucial for performance.

Relevance to MangaAssist: DPR's dual-encoder paradigm is exactly our setup: query embeddings and document embeddings processed by the same adapter. Their hard negative mining insight directly informs our FAISS-based mining pipeline. We extend their approach by refreshing hard negatives every epoch (they used static negatives) to account for the adapter reshaping the embedding space during training.

3. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks (Reimers & Gurevych, 2019)

Key contribution: Fine-tuned BERT with siamese/triplet network structures for sentence similarity tasks. Showed that [CLS] token pooling from BERT without fine-tuning produces poor sentence embeddings, and that contrastive fine-tuning is necessary for retrieval tasks.

Relevance to MangaAssist: Validates our approach of training a contrastive adapter rather than using raw embeddings. While Titan V2 is already better than raw BERT for similarity (it is trained on similarity tasks), our domain-specific adapter further improves the manga-specific similarity structure, consistent with Sentence-BERT's findings that domain adaptation matters even with pre-trained embedding models.

4. Representation Learning with Contrastive Predictive Coding — InfoNCE (van den Oord et al., 2018)

Key contribution: Formalized the InfoNCE loss function and showed it maximizes a lower bound on the mutual information between the query and positive. The temperature parameter controls the tightness of this bound — lower temperature = tighter bound but higher variance gradients.

Relevance to MangaAssist: InfoNCE is our training objective. Understanding that it maximizes mutual information between query and relevant product helps explain why the adapter learns semantic similarity rather than surface-level text matching. The mutual information interpretation also explains why more in-batch negatives (larger batch) improves the bound: more negatives = tighter lower bound on the true similarity structure.

5. Matryoshka Representation Learning (Kusupati et al., 2022)

Key contribution: Trains embeddings where the first $d$ dimensions form a valid embedding for any $d \leq D$. This allows flexible dimensionality reduction at inference time without retraining — use 256 dims for fast approximate search, 1024 dims for precise reranking.

Relevance to MangaAssist (future): If we need to reduce latency further, Matryoshka-style training of our adapter could let us use 512-dim embeddings for the initial ANN search (halving OpenSearch storage and search time) and full 1024-dim for reranking. This is a V3 consideration that could reduce our OpenSearch costs by ~40%.

Production Deployment and Monitoring

Deployment Architecture

graph LR
    subgraph "Training Pipeline (Bi-weekly)"
        A[Click-Through<br>Logs] --> B[Pair Extraction<br>+ Hard Neg Mining]
        B --> C[Adapter Training<br>Lambda GPU<br>(20 epochs, 18 min)]
        C --> D[Model Artifact<br>S3 (4MB)]
    end

    subgraph "Validation Gate"
        D --> E[Recall@3 > 0.80<br>on golden set]
        E -->|pass| F[Lambda Layer<br>Deployment]
        E -->|fail| G[Reject + Alert]
    end

    subgraph "Inference Pipeline"
        H[User Query] --> I[Titan V2<br>Bedrock API<br>~20ms]
        I --> J[Adapter Lambda<br>~2ms]
        J --> K[OpenSearch<br>ANN Search]
    end

    subgraph "Monitoring"
        K --> L[Recall@3 Tracking<br>(sampled)]
        L -->|drift detected| A
    end

Key Production Metrics

Metric	Target	Alert Threshold
Recall@3	≥ 0.80	< 0.75
Recall@10	≥ 0.90	< 0.85
Adapter latency (P50)	1ms	> 3ms
Adapter latency (P95)	2ms	> 5ms
Lambda cold start	< 500ms	> 1000ms
Embedding cosine similarity drift	< 0.05	> 0.10

Evaluation and Results

Before vs After Adapter

Metric	Titan V2 (Raw)	Titan V2 + Adapter	Improvement
Recall@1	0.48	0.62	+0.14
Recall@3	0.68	0.82	+0.14
Recall@10	0.81	0.92	+0.11
MRR@10	0.55	0.71	+0.16
Manga-jargon queries Recall@3	0.52	0.78	+0.26
Cross-genre queries Recall@3	0.61	0.80	+0.19
JP-EN mixed queries Recall@3	0.41	0.69	+0.28

The largest improvements are on manga-jargon (+0.26) and JP-EN mixed (+0.28) queries — exactly the domains where Titan V2's general-purpose training falls short. The adapter has learned that "isekai", "seinen", "tankōbon" carry strong genre and format signals.

Ablation Study

Configuration	Recall@3	Notes
Baseline (Titan V2 raw)	0.68	No adaptation
+ Linear adapter (no bottleneck)	0.74	Simple transformation, limited capacity
+ Bottleneck adapter (no residual)	0.78	Compression helps, but loses base quality
+ Residual connection	0.80	Base embedding preserved
+ Hard negative mining	0.81	+0.01 from better training signal
+ Optimal temperature (0.07)	0.82	+0.01 from gradient distribution
+ More data (2K → 5K, projected)	~0.84	Diminishing returns, deferred to V2