03. Cross-Encoder Reranker Fine-Tuning — ms-marco-MiniLM for MangaAssist

Problem Statement and MangaAssist Context

After the embedding adapter retrieves the top-10 candidate products from OpenSearch, a cross-encoder reranker re-scores them to produce the final top-3 results. This two-stage retrieval is necessary because embedding-based (bi-encoder) search is fast but imprecise — it encodes queries and documents independently. The cross-encoder sees the query and document together, computing deep cross-attention between them, which captures nuanced relevance that bi-encoders miss.

MangaAssist uses ms-marco-MiniLM-L-6-v2 as the reranker. Out of the box (trained on MS-MARCO web passages), it achieves NDCG@3 of 0.71 on our manga catalog. After fine-tuning on 3,000 manga query-document pairs with pairwise ranking loss, it reaches NDCG@3 of 0.84 — a 13-point improvement that translates to noticeably better result ordering.

Why Cross-Attention Matters for Manga

Bi-encoders embed "dark isekai manga" and a product description independently. They cannot model fine-grained interactions like:

Query	Document Excerpt	Bi-Encoder Score	Cross-Encoder Score
"dark isekai manga"	"Overlord — A dark fantasy isekai where..."	0.82	0.94
"dark isekai manga"	"Sword Art Online — An isekai adventure with..."	0.79	0.61
"manga like Berserk"	"Vagabond — A seinen epic by Takehiko Inoue..."	0.71	0.88
"manga like Berserk"	"Dragon Ball — A shōnen classic by Toriyama..."	0.68	0.32

The cross-encoder "sees" that "dark" in the query aligns with "dark fantasy" in Overlord's description but not with SAO's "adventure". This word-level alignment is impossible for bi-encoders that produce a single vector per text.

The Two-Stage Retrieval Pipeline

Stage 1 — Bi-Encoder (Titan V2 + Adapter): Retrieve top-10 from 50K catalog items. Latency: ~24ms. Cost: $0.0001/query.
Stage 2 — Cross-Encoder (ms-marco-MiniLM): Re-score 10 candidates with the query. Latency: ~50ms (5ms × 10 pairs). Cost: $0.0003/query (SageMaker real-time endpoint).

Total retrieval latency: ~74ms. Without the reranker, users would see the bi-encoder's top-3, which has NDCG@3 of only 0.72. The reranker's NDCG@3 of 0.84 means better result ordering in 12% more queries.

Mathematical Foundations

NDCG — Normalized Discounted Cumulative Gain

NDCG measures how well a ranked list places the most relevant items at the top. For a ranking of $K$ items:

$$\text{DCG@K} = \sum_{i=1}^{K} \frac{2^{r_i} - 1}{\log_2(i + 1)}$$

where $r_i$ is the relevance score of the item at position $i$.

$$\text{NDCG@K} = \frac{\text{DCG@K}}{\text{IDCG@K}}$$

where IDCG@K is the DCG of the ideal (perfect) ranking.

Intuition: The $\frac{1}{\log_2(i+1)}$ discount means position 1 is worth 1.0, position 2 is worth 0.63, position 3 is worth 0.50, and so on. A highly relevant item at position 1 contributes much more than the same item at position 5. This mathces real user behavior — most users only look at the first 2-3 results.

Example for MangaAssist:

Query: "dark fantasy manga" - Berserk (relevance = 3, highly relevant) - Claymore (relevance = 2, relevant) - One Piece (relevance = 0, not relevant)

$$\text{DCG@3} = \frac{2^3 - 1}{\log_2(2)} + \frac{2^2 - 1}{\log_2(3)} + \frac{2^0 - 1}{\log_2(4)} = \frac{7}{1} + \frac{3}{1.585} + \frac{0}{2} = 7 + 1.893 + 0 = 8.893$$

Ideal ranking would put Berserk first, Claymore second: IDCG@3 = same = 8.893, so NDCG@3 = 1.0.

If the ranker swaps Berserk and One Piece: $$\text{DCG@3} = \frac{0}{1} + \frac{3}{1.585} + \frac{7}{2} = 0 + 1.893 + 3.5 = 5.393$$

NDCG@3 = 5.393 / 8.893 = 0.606. Much worse — placing the best result last costs 39.4% of the possible score.

Pairwise Ranking Loss — RankNet

For training a ranker, we use pairwise loss. Given a query $q$, a relevant document $d^+$, and a less-relevant document $d^-$:

$$\mathcal{L}_{\text{pairwise}} = -\log\sigma(s(q, d^+) - s(q, d^-))$$

where $s(q, d)$ is the cross-encoder's relevance score and $\sigma$ is the sigmoid function.

Gradient analysis:

$$\frac{\partial \mathcal{L}}{\partial s(q, d^+)} = -(1 - \sigma(\Delta s)) = -\sigma(-\Delta s)$$

where $\Delta s = s(q, d^+) - s(q, d^-)$.

$\Delta s$	$\sigma(-\Delta s)$	Gradient Magnitude	Interpretation
-2.0 (wrong order)	0.88	0.88 (high)	Model ranks incorrectly — strong correction
0.0 (tie)	0.50	0.50 (moderate)	Model is uncertain — moderate push
+2.0 (correct order)	0.12	0.12 (low)	Model ranks correctly — minimal update
+5.0 (very confident)	0.007	0.007 (tiny)	Already well-separated — almost no gradient

Key insight: Pairwise loss naturally focuses learning on the boundary cases where the model is unsure, similar to focal loss for classification. Confidently correct pairs get near-zero gradient.

LambdaRank — NDCG-Aware Training

Standard pairwise loss treats all swaps equally — swapping items at positions (1,2) gets the same gradient as swapping (8,9). But from a user experience perspective, the first swap is far more important.

LambdaRank multiplies the pairwise gradient by the change in NDCG that the swap would cause:

$$\lambda_{ij} = -\sigma(-\Delta s_{ij}) \cdot |\Delta\text{NDCG}_{ij}|$$

where $|\Delta\text{NDCG}_{ij}|$ is the absolute change in NDCG if items $i$ and $j$ were swapped.

Intuition: Swapping positions (1,2) changes NDCG by ~0.37 (for binary relevance). Swapping positions (8,9) changes NDCG by ~0.03. So the gradient for the (1,2) swap is amplified by 12x compared to (8,9). This focuses the model's learning on getting the top positions right.

NDCG swap deltas for our top-3 reranking:

| Positions Swapped | $|\Delta\text{NDCG}|$ | Gradient Multiplier | |-------------------|-----------------------|-------------------| | (1, 2) | 0.369 | 12.3× | | (1, 3) | 0.500 | 16.7× | | (2, 3) | 0.131 | 4.4× | | (1, 5) | 0.570 | 19.0× | | (5, 10) | 0.030 | 1.0× (reference) |

This is why LambdaRank is critical for MangaAssist: we only show 3 results to the user, so getting the top-3 ordering right matters enormously, while positions beyond 5 are irrelevant.

Cross-Attention — How the Reranker "Reads" Query-Document Together

The cross-encoder concatenates query and document: [CLS] query [SEP] document [SEP]. The transformer's self-attention then computes attention across all tokens, allowing every query token to attend to every document token.

For a token at position $i$ and all tokens at positions $j$:

$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right) V$$

where $Q = W_Q \cdot h_i$, $K = W_K \cdot h_j$, $V = W_V \cdot h_j$ for the 12 attention heads ($d_k = 64$).

What cross-attention learns for manga queries:

When processing [CLS] dark isekai manga [SEP] Overlord - A dark fantasy isekai where...: - "dark" (query) attends strongly to "dark" and "fantasy" in the document (direct match → high relevance) - "isekai" (query) attends to "isekai" in the document (genre match) - "manga" (query) attends broadly — nearly all catalog items are manga (uninformative token) - [CLS] aggregates all cross-attention signals into the relevance score

After fine-tuning, the attention pattern shifts: - "dark" gains negative attention to "comedy" (learns that dark ≠ comedy) - "isekai" gains positive attention to "transported to another world" (learns paraphrase) - Non-matching genre tokens get near-zero attention (efficient filtering)

Gradient Flow Through the 6-Layer MiniLM

ms-marco-MiniLM-L-6-v2 has nearly the same architecture as DistilBERT:

Layer	Parameters	Role in Reranking	Fine-Tuning Gradient
Relevance Head	Linear(384→1) = 385 params	Produces final score	1.0× (reference)
Layer 5	3.5M	High-level relevance matching	0.65×
Layer 4	3.5M	Semantic alignment (genre, tone)	0.38×
Layer 3	3.5M	Cross-attention refinement	0.18×
Layer 2	3.5M	Syntactic interaction patterns	0.07×
Layer 1	3.5M	Basic token relationships	0.02×
Layer 0	3.5M	Token-level features	0.008×
Embeddings	11.7M	Word representations	0.003×

Total: ~33M parameters. Hidden dimension: 384. Attention heads: 12 (32 dims each).

Model Internals — Layer-by-Layer Diagrams

Cross-Encoder Architecture for Reranking

graph TB
    subgraph "Input Construction"
        A["Query: 'dark isekai manga'"] --> C["[CLS] dark isekai manga [SEP]<br>Overlord - A dark fantasy isekai<br>where the protagonist... [SEP]"]
        B["Doc: 'Overlord - A dark fantasy...'"] --> C
    end

    subgraph "Tokenization"
        C --> D["WordPiece Tokens (max 128)<br>[CLS] dark is ##ek ##ai manga [SEP]<br>over ##lord - a dark fantasy<br>is ##ek ##ai where the... [SEP]"]
    end

    subgraph "Embedding (384-dim, ~12M params)"
        D --> E["Token Emb + Position Emb + Segment Emb<br>Segment A: query tokens<br>Segment B: document tokens"]
    end

    subgraph "6 Transformer Layers"
        E --> F0["Layer 0: Basic token features<br>LR: 3.2e-6 | Grad: 0.008×"]
        F0 --> F1["Layer 1-2: Syntactic interactions<br>Query-doc token alignment begins"]
        F1 --> F3["Layer 3-4: Semantic matching<br>'dark' ↔ 'dark fantasy' link formed<br>LR: 6.4e-6 to 1e-5 | Grad: 0.18-0.38×"]
        F3 --> F5["Layer 5: High-level relevance<br>'isekai' + 'dark fantasy' → relevant<br>LR: 1.6e-5 | Grad: 0.65×"]
    end

    subgraph "Relevance Scoring"
        F5 --> G["[CLS] Pooling<br>384-dim relevance vector"]
        G --> H["Linear(384 → 1)<br>LR: 2e-5 | Grad: 1.0×"]
        H --> I["Relevance Score: 0.94"]
    end

    style F0 fill:#e8f5e9
    style F1 fill:#fff9c4
    style F3 fill:#fff3e0
    style F5 fill:#ffccbc
    style H fill:#ef9a9a

Cross-Attention Pattern: Before vs After Fine-Tuning

graph TD
    subgraph "Before Fine-Tuning (Layer 5 Head 3)"
        direction LR
        Q1["dark<br>(query)"] -->|"0.15"| D1a["dark<br>(doc)"]
        Q1 -->|"0.12"| D1b["fantasy<br>(doc)"]
        Q1 -->|"0.10"| D1c["comedy<br>(doc)"]
        Q1 -->|"0.08"| D1d["adventure<br>(doc)"]
        Q1 -->|"0.55"| D1e["other tokens<br>(distributed)"]
    end

    subgraph "After Fine-Tuning (Layer 5 Head 3)"
        direction LR
        Q2["dark<br>(query)"] -->|"0.38"| D2a["dark<br>(doc)"]
        Q2 -->|"0.28"| D2b["fantasy<br>(doc)"]
        Q2 -->|"0.02"| D2c["comedy<br>(doc)"]
        Q2 -->|"0.04"| D2d["adventure<br>(doc)"]
        Q2 -->|"0.28"| D2e["other tokens"]
    end

Key change: After fine-tuning, "dark" (query) concentrates attention on "dark" and "fantasy" and suppresses attention to "comedy". The model has learned that "dark" is an intent signal, not just a word to match literally.

LambdaRank Gradient Amplification

graph LR
    subgraph "Standard Pairwise Loss"
        A["Pair (pos=1, neg=2)<br>Gradient: 0.50"] --> B["All pairs get<br>equal treatment"]
        C["Pair (pos=5, neg=8)<br>Gradient: 0.50"] --> B
    end

    subgraph "LambdaRank"
        D["Pair (pos=1, neg=2)<br>|ΔNDCG| = 0.37<br>Gradient: 0.50 × 0.37 = 0.185"] --> E["Top positions get<br>12× more gradient<br>than bottom"]
        F["Pair (pos=5, neg=8)<br>|ΔNDCG| = 0.03<br>Gradient: 0.50 × 0.03 = 0.015"] --> E
    end

    subgraph "Effect on MangaAssist"
        E --> G["Position 1 error:<br>Wrong manga at top<br>= user leaves"]
        E --> H["Position 8 error:<br>Wrong order beyond top-3<br>= user never sees it"]
    end

Full Reranking Inference Pipeline

sequenceDiagram
    participant Query as User Query
    participant BiEnc as Bi-Encoder<br>(Titan V2 + Adapter)
    participant OS as OpenSearch<br>(ANN Index)
    participant CrossEnc as Cross-Encoder<br>(MiniLM Reranker)
    participant User as User Sees<br>Top 3 Results

    Query->>BiEnc: "dark isekai manga"
    BiEnc->>OS: Query embedding (1024-dim)
    OS->>CrossEnc: Top 10 candidates with scores

    Note over CrossEnc: Score each of 10 candidates:<br>[CLS] query [SEP] doc_i [SEP]<br>10 forward passes × 5ms each = 50ms

    loop For each candidate (10x)
        CrossEnc->>CrossEnc: Concatenate query + doc<br>→ 6 attention layers<br>→ [CLS] → Linear → score
    end

    CrossEnc->>CrossEnc: Sort by cross-encoder score<br>Reorder: [7,2,5,...] → [2,5,7,...]

    CrossEnc->>User: Top 3 reranked results<br>1. Overlord (0.94)<br>2. Berserk (0.91)<br>3. Goblin Slayer (0.87)

Training Data Construction Pipeline

graph TD
    subgraph "Source 1: Click-Through Logs (2000 pairs)"
        A["User searched X<br>Clicked product Y"] --> D["Positive pair:<br>(X, Y, relevance=3)"]
        B["User searched X<br>Saw product Z, didn't click"] --> E["Hard negative:<br>(X, Z, relevance=0)"]
    end

    subgraph "Source 2: Editorial (600 pairs)"
        F["Manga experts rate<br>query-product relevance<br>0=irrelevant, 1=marginal,<br>2=relevant, 3=perfect"] --> G["Graded relevance pairs"]
    end

    subgraph "Source 3: Synthetic (400 pairs)"
        H["Claude generates<br>'user searching for X<br>should/shouldn't find Y'"] --> I["Synthetic pairs<br>with quality filtering"]
    end

    D --> J["Combined Dataset<br>3000 query-doc pairs<br>with graded relevance"]
    E --> J
    G --> J
    I --> J

    J --> K["80/10/10 Split<br>Stratified by<br>relevance level"]

Implementation Deep-Dive

Dataset Preparation

import json
from dataclasses import dataclass
from typing import List, Tuple
from sklearn.model_selection import train_test_split


@dataclass
class RankingExample:
    query: str
    document: str
    relevance: int  # 0=irrelevant, 1=marginal, 2=relevant, 3=perfect


def load_ranking_dataset(
    clickthrough_path: str,
    editorial_path: str,
    synthetic_path: str,
) -> List[RankingExample]:
    """
    Combine three sources into a unified ranking dataset.
    """
    examples = []

    # Click-through: binary relevance (clicked=3, not-clicked=0)
    with open(clickthrough_path) as f:
        for item in json.load(f):
            examples.append(RankingExample(
                query=item["query"],
                document=item["product_description"],
                relevance=3 if item["clicked"] else 0,
            ))

    # Editorial: graded relevance (0-3)
    with open(editorial_path) as f:
        for item in json.load(f):
            examples.append(RankingExample(
                query=item["query"],
                document=item["product_description"],
                relevance=item["relevance"],
            ))

    # Synthetic: binary (relevant=2, not-relevant=0)
    with open(synthetic_path) as f:
        for item in json.load(f):
            examples.append(RankingExample(
                query=item["query"],
                document=item["product_description"],
                relevance=2 if item["relevant"] else 0,
            ))

    return examples


def create_pairwise_samples(
    examples: List[RankingExample],
) -> List[Tuple[str, str, str]]:
    """
    Create (query, doc_better, doc_worse) triplets for pairwise training.
    Group by query, then form all pairs where relevance differs.
    """
    from collections import defaultdict

    by_query = defaultdict(list)
    for ex in examples:
        by_query[ex.query].append(ex)

    triplets = []
    for query, docs in by_query.items():
        for i in range(len(docs)):
            for j in range(len(docs)):
                if docs[i].relevance > docs[j].relevance:
                    triplets.append((
                        query,
                        docs[i].document,  # better
                        docs[j].document,  # worse
                    ))

    return triplets

Pairwise Ranking Loss with LambdaRank Weighting

import torch
import torch.nn as nn
import numpy as np


class LambdaRankLoss(nn.Module):
    """
    Pairwise ranking loss with NDCG-aware gradient weighting.

    L = -log(σ(s_better - s_worse)) × |ΔNDCG|

    The |ΔNDCG| factor amplifies gradients for pairs whose swap
    would most affect the user-visible ranking (top positions).
    """

    def __init__(self):
        super().__init__()

    def forward(
        self,
        scores_better: torch.Tensor,  # (batch,) — scores for better docs
        scores_worse: torch.Tensor,    # (batch,) — scores for worse docs
        ndcg_deltas: torch.Tensor,     # (batch,) — |ΔNDCG| for each pair
    ) -> torch.Tensor:
        # Pairwise logistic loss
        diff = scores_better - scores_worse
        pairwise_loss = -torch.log(torch.sigmoid(diff) + 1e-8)

        # Weight by NDCG delta
        weighted_loss = pairwise_loss * ndcg_deltas

        return weighted_loss.mean()


def compute_ndcg_delta(
    relevances: List[int],
    pos_i: int,
    pos_j: int,
    k: int = 3,
) -> float:
    """
    Compute |ΔNDCG@K| if items at positions i and j were swapped.
    """
    # Current DCG@K
    dcg = sum(
        (2 ** relevances[p] - 1) / np.log2(p + 2)
        for p in range(min(k, len(relevances)))
    )

    # Swapped DCG@K
    swapped = list(relevances)
    swapped[pos_i], swapped[pos_j] = swapped[pos_j], swapped[pos_i]
    dcg_swapped = sum(
        (2 ** swapped[p] - 1) / np.log2(p + 2)
        for p in range(min(k, len(swapped)))
    )

    # Ideal DCG@K
    ideal = sorted(relevances, reverse=True)
    idcg = sum(
        (2 ** ideal[p] - 1) / np.log2(p + 2)
        for p in range(min(k, len(ideal)))
    )

    if idcg == 0:
        return 0.0

    return abs((dcg - dcg_swapped) / idcg)

Cross-Encoder Fine-Tuning

from transformers import AutoModelForSequenceClassification, AutoTokenizer
from torch.optim import AdamW
from transformers import get_linear_schedule_with_warmup
from torch.utils.data import DataLoader, Dataset


class PairwiseRankingDataset(Dataset):
    """Dataset of (query, better_doc, worse_doc, ndcg_delta) tuples."""

    def __init__(self, triplets, tokenizer, max_length=128):
        self.triplets = triplets
        self.tokenizer = tokenizer
        self.max_length = max_length

    def __len__(self):
        return len(self.triplets)

    def __getitem__(self, idx):
        query, better_doc, worse_doc = self.triplets[idx]

        # Tokenize query-better_doc pair
        better_encoding = self.tokenizer(
            query, better_doc,
            max_length=self.max_length,
            padding="max_length",
            truncation=True,
            return_tensors="pt",
        )

        # Tokenize query-worse_doc pair
        worse_encoding = self.tokenizer(
            query, worse_doc,
            max_length=self.max_length,
            padding="max_length",
            truncation=True,
            return_tensors="pt",
        )

        return {
            "better_input_ids": better_encoding["input_ids"].squeeze(0),
            "better_attention_mask": better_encoding["attention_mask"].squeeze(0),
            "worse_input_ids": worse_encoding["input_ids"].squeeze(0),
            "worse_attention_mask": worse_encoding["attention_mask"].squeeze(0),
        }


def train_reranker(
    train_triplets: list,
    val_triplets: list,
    num_epochs: int = 5,
    batch_size: int = 16,
    base_lr: float = 2e-5,
):
    """
    Fine-tune ms-marco-MiniLM cross-encoder for manga reranking.

    Uses pairwise ranking loss: for each (query, better_doc, worse_doc),
    the model should score better_doc higher than worse_doc.
    """
    model_name = "cross-encoder/ms-marco-MiniLM-L-6-v2"
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForSequenceClassification.from_pretrained(
        model_name, num_labels=1
    )

    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model.to(device)

    # Optimizer with discriminative learning rates
    optimizer_params = []

    # Relevance head — highest LR
    optimizer_params.append({
        "params": model.classifier.parameters(),
        "lr": base_lr,
    })

    # Encoder layers — discriminative LR
    num_layers = len(model.bert.encoder.layer)
    for layer_idx in range(num_layers - 1, -1, -1):
        layer_lr = base_lr * (0.8 ** (num_layers - 1 - layer_idx))
        optimizer_params.append({
            "params": model.bert.encoder.layer[layer_idx].parameters(),
            "lr": layer_lr,
        })

    # Embeddings — lowest LR
    optimizer_params.append({
        "params": model.bert.embeddings.parameters(),
        "lr": base_lr * (0.8 ** num_layers),
    })

    optimizer = AdamW(optimizer_params, weight_decay=0.01)

    # Training
    train_dataset = PairwiseRankingDataset(train_triplets, tokenizer)
    train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)

    total_steps = len(train_loader) * num_epochs
    scheduler = get_linear_schedule_with_warmup(
        optimizer,
        num_warmup_steps=int(0.1 * total_steps),
        num_training_steps=total_steps,
    )

    loss_fn = nn.MarginRankingLoss(margin=1.0)

    best_ndcg = 0.0

    for epoch in range(num_epochs):
        model.train()
        epoch_loss = 0.0

        for batch in train_loader:
            # Score better documents
            better_scores = model(
                input_ids=batch["better_input_ids"].to(device),
                attention_mask=batch["better_attention_mask"].to(device),
            ).logits.squeeze(-1)

            # Score worse documents
            worse_scores = model(
                input_ids=batch["worse_input_ids"].to(device),
                attention_mask=batch["worse_attention_mask"].to(device),
            ).logits.squeeze(-1)

            # Target: better > worse (all ones)
            target = torch.ones_like(better_scores)

            loss = loss_fn(better_scores, worse_scores, target)

            optimizer.zero_grad()
            loss.backward()
            torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
            optimizer.step()
            scheduler.step()

            epoch_loss += loss.item()

        # Validation
        ndcg = evaluate_ndcg(model, tokenizer, val_triplets, device)
        avg_loss = epoch_loss / len(train_loader)
        print(f"Epoch {epoch+1}: Loss={avg_loss:.4f}, NDCG@3={ndcg:.4f}")

        if ndcg > best_ndcg:
            best_ndcg = ndcg
            model.save_pretrained("./best_reranker")
            tokenizer.save_pretrained("./best_reranker")

    return model


def evaluate_ndcg(model, tokenizer, val_triplets, device, k=3):
    """Compute NDCG@K on validation set."""
    model.eval()
    from collections import defaultdict

    # Group by query
    query_scores = defaultdict(list)
    for query, better_doc, worse_doc in val_triplets:
        for doc, rel in [(better_doc, 1), (worse_doc, 0)]:
            encoding = tokenizer(
                query, doc,
                max_length=128,
                padding="max_length",
                truncation=True,
                return_tensors="pt",
            )
            with torch.no_grad():
                score = model(
                    input_ids=encoding["input_ids"].to(device),
                    attention_mask=encoding["attention_mask"].to(device),
                ).logits.squeeze().item()
            query_scores[query].append((score, rel))

    # Compute NDCG per query
    ndcg_scores = []
    for query, scored_docs in query_scores.items():
        scored_docs.sort(key=lambda x: x[0], reverse=True)
        rels = [rel for _, rel in scored_docs]

        dcg = sum(
            (2 ** rels[i] - 1) / np.log2(i + 2)
            for i in range(min(k, len(rels)))
        )
        ideal_rels = sorted(rels, reverse=True)
        idcg = sum(
            (2 ** ideal_rels[i] - 1) / np.log2(i + 2)
            for i in range(min(k, len(ideal_rels)))
        )
        if idcg > 0:
            ndcg_scores.append(dcg / idcg)

    return np.mean(ndcg_scores) if ndcg_scores else 0.0

SageMaker Deployment

import sagemaker
from sagemaker.huggingface import HuggingFaceModel


def deploy_reranker():
    """
    Deploy fine-tuned reranker to SageMaker real-time endpoint.
    ml.c5.xlarge: 4 vCPU, 8GB RAM — CPU inference is sufficient
    for the small model (33M params) at 10 candidates per query.
    """
    huggingface_model = HuggingFaceModel(
        model_data="s3://manga-ml-models/reranker/model.tar.gz",
        role=sagemaker.get_execution_role(),
        transformers_version="4.36",
        pytorch_version="2.1",
        py_version="py310",
        env={
            "SAGEMAKER_MODEL_SERVER_WORKERS": "2",
        },
    )

    predictor = huggingface_model.deploy(
        initial_instance_count=2,
        instance_type="ml.c5.xlarge",
        endpoint_name="manga-reranker-v1",
    )

    return predictor


def invoke_reranker(predictor, query: str, candidates: list) -> list:
    """
    Score all candidates and return sorted by relevance.

    Input: query + list of candidate product descriptions
    Output: candidates sorted by cross-encoder relevance score
    """
    payload = {
        "inputs": [
            {"text": query, "text_pair": doc}
            for doc in candidates
        ],
    }

    scores = predictor.predict(payload)

    # Sort candidates by score (descending)
    scored = list(zip(candidates, scores))
    scored.sort(key=lambda x: x[1], reverse=True)

    return scored

Group Discussion: Key Decision Points

Decision Point 1: Cross-Encoder vs ColBERT vs Sparse-Dense Hybrid

Priya (ML Engineer): I benchmarked three reranking architectures:

Approach	NDCG@3	Latency (10 docs)	Params	Monthly Cost
ms-marco-MiniLM (cross-encoder)	0.84	50ms	33M	$380
ColBERT v2 (late interaction)	0.81	25ms	110M	$620
SPLADE + cross-encoder (hybrid)	0.86	70ms	33M + index	$520

Cross-encoder wins on quality-per-dollar. ColBERT is faster but lower quality. SPLADE hybrid is best quality but latency is too high.

Marcus (Architect): Our total retrieval latency target is 100ms. Stage 1 (bi-encoder + OpenSearch) takes 24ms + 10ms = 34ms. The reranker gets the remaining 66ms budget. Cross-encoder at 50ms fits; SPLADE at 70ms is tight with no headroom for spikes.

Aiko (Data Scientist): ColBERT's late interaction approach is interesting for scale — it pre-computes per-token embeddings and does MaxSim at query time. But for reranking 10 candidates (not 1000), the overhead of maintaining per-token indexes is not justified. Cross-encoder's token-level cross-attention gives strictly better quality for small candidate sets.

Jordan (MLOps): Cross-encoder is also the simplest to operate. One model, one endpoint, no additional indexes. ColBERT requires a special per-token index and SPLADE needs a sparse index. More moving parts = more failure modes.

Sam (PM): At NDCG@3 = 0.84, the cross-encoder gives us 13 points over the bi-encoder alone (0.71). That translates to roughly 8% more users finding what they want in the top 3. For our 50K daily active users, that is 4,000 users per day with a better experience.

Resolution: Cross-encoder (ms-marco-MiniLM) selected. Best NDCG@3 within latency budget, lowest cost, simplest deployment. Consider ColBERT for V2 if candidate set grows beyond 50 (where cross-encoder becomes too slow) or if latency budget tightens.

Decision Point 2: Pairwise Loss vs Listwise Loss

Priya (ML Engineer): I compared pairwise (MarginRankingLoss) and listwise (ListMLE) approaches:

Loss Function	NDCG@3	NDCG@10	Training Time
Pointwise (MSE)	0.78	0.83	20 min
Pairwise (MarginRanking)	0.83	0.88	35 min
Pairwise + LambdaRank	0.84	0.89	38 min
Listwise (ListMLE)	0.84	0.90	55 min

Pairwise + LambdaRank matches listwise NDCG@3 at 30% less training time.

Aiko (Data Scientist): Our dataset has mostly binary relevance (clicked vs not-clicked from logs), with only the editorial subset having graded relevance (0-3). Listwise losses work best with fully graded lists, which we don't have for most queries. Pairwise is more robust to incomplete relevance labels.

Jordan (MLOps): Listwise training requires all candidates for a query to be in the same batch, which constrains batch construction and makes distributed training harder. Pairwise triplets can be freely shuffled across batches.

Resolution: Pairwise ranking loss with LambdaRank weighting. Matches listwise quality for our data, trains 30% faster, more robust to incomplete graded labels.

Decision Point 3: Reranking Depth — Top-K Candidates

Marcus (Architect): How many candidates should we rerank?

Candidates Reranked	NDCG@3	Latency	Monthly Cost
5	0.80	25ms	$190
10	0.84	50ms	$380
20	0.86	100ms	$760
50	0.87	250ms	$1,900

Aiko (Data Scientist): The jump from 5 to 10 candidates gives +4 NDCG points (0.80 → 0.84). Going from 10 to 20 gives only +2 (0.84 → 0.86). The diminishing returns are clear.

Priya (ML Engineer): With 10 candidates, the reranker has seen enough diversity that the correct top-3 is almost always somewhere in the candidate pool. Going to 20 mostly helps edge cases where the bi-encoder ranks the truly relevant item at positions 11-20, which happens ~5% of the time.

Sam (PM): The cost doubles from $380 to $760 for 2 NDCG points. That is $190/point — well above our $50 threshold. 10 candidates at $380/month is the sweet spot.

Resolution: Rerank top-10 candidates. Best cost-quality tradeoff. NDCG@3 of 0.84 within 50ms latency budget. Monitor recall@10 of the bi-encoder — if it drops below 0.85, consider increasing to 15 candidates.

Decision Point 4: Co-Training Reranker with Adapter

Priya (ML Engineer): Currently we train the embedding adapter and reranker independently. Should we co-train them?

Aiko (Data Scientist): Co-training means using the reranker's feedback to improve the adapter's training signal. Intuition: if the reranker consistently boosts certain items that the adapter ranked low, the adapter should learn to rank them higher. This is "knowledge distillation from reranker to retriever."

Marcus (Architect): The challenge is complexity. Co-training creates a dependency loop: adapter quality affects reranker training data, reranker quality affects adapter optimization signal. Getting this right is a research project, not an engineering task.

Jordan (MLOps): Our retraining cadence is monthly. Co-training would couple the two pipelines, meaning a failure in one blocks the other. Independent training gives us isolation — we can update the adapter without retraining the reranker, and vice versa.

Sam (PM): The independent approach already gives us NDCG@3 of 0.84, which exceeds our 0.80 target by 4 points. Co-training is premature optimization.

Resolution: Train independently for now. Revisit co-training in V2 when both pipelines are stable and we have enough data to validate the feedback loop.

Research Paper References

1. MS MARCO: A Human Generated MAchine Reading COmprehension Dataset (Bajaj et al., 2016)

Key contribution: Created the largest public passage ranking dataset (8.8M passages, 1M queries). Enabled training of neural ranking models at scale. The cross-encoder approach (concatenate query + passage → relevance score) became the standard architecture for reranking.

Relevance to MangaAssist: ms-marco-MiniLM-L-6-v2 is pre-trained on this dataset. The pre-training gives us a strong foundation for passage-level relevance scoring. However, MS-MARCO queries are web search (factoid questions), not e-commerce product search. Our fine-tuning adapts the model from "is this passage the answer?" to "is this manga what the user wants?".

2. ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction (Khattab & Zaharia, 2020)

Key contribution: Introduced late interaction — pre-compute per-token embeddings for documents, then compute MaxSim at query time. Achieves near cross-encoder quality at bi-encoder speed for large candidate sets. The token-level matching allows fine-grained relevance without full cross-attention.

Relevance to MangaAssist: ColBERT is our V2 candidate if we need to rerank more than 50 candidates (e.g., if we expand to cross-category search). For our current 10-candidate reranking, cross-encoder is better. But ColBERT's architecture insight informs our understanding: the quality gap between bi-encoder and cross-encoder comes specifically from token-level interactions that bi-encoders cannot capture.

3. From RankNet to LambdaRank to LambdaMART (Burges, 2010)

Key contribution: Derived LambdaRank as a practical speed-up to the theoretically grounded LambdaRank+ approach. Showed that multiplying pairwise gradients by |ΔNDCG| empirically optimizes NDCG better than any other known approach, despite lacking a formal loss function. LambdaMART (LambdaRank + gradient boosted trees) became the industry standard for learning-to-rank.

Relevance to MangaAssist: We use the LambdaRank gradient weighting to focus our cross-encoder's training on getting the top-3 positions right. The |ΔNDCG| multiplier ensures that swapping positions (1,2) gets 12× the gradient of swapping (8,9), directly optimizing for user-visible result quality.

4. Intra-Document Cascading for Efficient Passage Retrieval (Hofstätter et al., 2019)

Key contribution: Proposed cascaded retrieval where progressively more expensive models score progressively fewer candidates. Showed that 3-stage cascading (BM25 → bi-encoder → cross-encoder) achieves better latency-quality tradeoffs than any single model.

Relevance to MangaAssist: Our 2-stage pipeline (bi-encoder → cross-encoder) is a simplified cascade. The paper validates our architectural choice and suggests that adding a lightweight first-stage filter (e.g., BM25 prefiltering before the bi-encoder) could further improve latency if needed in V2.

Production Deployment and Monitoring

Deployment Architecture

graph LR
    subgraph "Training Pipeline (Monthly)"
        A[Click-through<br>Logs] --> B[Pairwise Triplet<br>Construction]
        B --> C[Cross-Encoder<br>Fine-Tuning<br>(SageMaker)]
        C --> D[Model Artifact<br>S3]
    end

    subgraph "Validation Gate"
        D --> E[NDCG@3 > 0.80<br>on golden set]
        E -->|pass| F[SageMaker Endpoint<br>Blue-Green Deploy]
        E -->|fail| G[Reject + Alert]
    end

    subgraph "Inference"
        H[Bi-Encoder<br>Top 10] --> I[Cross-Encoder<br>Reranker<br>(2x ml.c5.xlarge)]
        I --> J[Top 3 Sorted<br>Results]
    end

Key Production Metrics

Metric	Target	Alert Threshold
NDCG@3 (sampled)	≥ 0.80	< 0.75
P50 latency (10 candidates)	35ms	> 50ms
P95 latency (10 candidates)	50ms	> 70ms
Pairwise accuracy (sampled)	≥ 85%	< 80%
Score distribution shift	KL < 0.02	KL > 0.05

Evaluation and Results

Before vs After Fine-Tuning

Metric	ms-marco (Pre-trained)	Fine-tuned (Ours)	Improvement
NDCG@3	0.71	0.84	+0.13
NDCG@10	0.77	0.89	+0.12
Pairwise accuracy	72.3%	86.8%	+14.5%
Manga-jargon NDCG@3	0.58	0.81	+0.23
Genre-crossing NDCG@3	0.63	0.82	+0.19
Mean reciprocal rank	0.68	0.82	+0.14

End-to-End Retrieval Quality (Bi-Encoder + Reranker)

Pipeline Configuration	NDCG@3	P95 Latency	Monthly Cost
Titan V2 raw (no adapter, no reranker)	0.58	32ms	$450
Titan V2 + adapter (no reranker)	0.72	34ms	$485
Titan V2 + adapter + ms-marco (pre-trained)	0.78	84ms	$865
Titan V2 + adapter + ms-marco (fine-tuned)	0.84	84ms	$865
Ours (full pipeline)	0.84	84ms	$865

The fine-tuned reranker adds 6 NDCG points over the pre-trained version at zero additional infrastructure cost — same endpoint, same latency, just better weights.