03. Cross-Encoder Reranker Fine-Tuning — ms-marco-MiniLM for MangaAssist
Problem Statement and MangaAssist Context
After the embedding adapter retrieves the top-10 candidate products from OpenSearch, a cross-encoder reranker re-scores them to produce the final top-3 results. This two-stage retrieval is necessary because embedding-based (bi-encoder) search is fast but imprecise — it encodes queries and documents independently. The cross-encoder sees the query and document together, computing deep cross-attention between them, which captures nuanced relevance that bi-encoders miss.
MangaAssist uses ms-marco-MiniLM-L-6-v2 as the reranker. Out of the box (trained on MS-MARCO web passages), it achieves NDCG@3 of 0.71 on our manga catalog. After fine-tuning on 3,000 manga query-document pairs with pairwise ranking loss, it reaches NDCG@3 of 0.84 — a 13-point improvement that translates to noticeably better result ordering.
Why Cross-Attention Matters for Manga
Bi-encoders embed "dark isekai manga" and a product description independently. They cannot model fine-grained interactions like:
| Query | Document Excerpt | Bi-Encoder Score | Cross-Encoder Score |
|---|---|---|---|
| "dark isekai manga" | "Overlord — A dark fantasy isekai where..." | 0.82 | 0.94 |
| "dark isekai manga" | "Sword Art Online — An isekai adventure with..." | 0.79 | 0.61 |
| "manga like Berserk" | "Vagabond — A seinen epic by Takehiko Inoue..." | 0.71 | 0.88 |
| "manga like Berserk" | "Dragon Ball — A shōnen classic by Toriyama..." | 0.68 | 0.32 |
The cross-encoder "sees" that "dark" in the query aligns with "dark fantasy" in Overlord's description but not with SAO's "adventure". This word-level alignment is impossible for bi-encoders that produce a single vector per text.
The Two-Stage Retrieval Pipeline
- Stage 1 — Bi-Encoder (Titan V2 + Adapter): Retrieve top-10 from 50K catalog items. Latency: ~24ms. Cost: $0.0001/query.
- Stage 2 — Cross-Encoder (ms-marco-MiniLM): Re-score 10 candidates with the query. Latency: ~50ms (5ms × 10 pairs). Cost: $0.0003/query (SageMaker real-time endpoint).
Total retrieval latency: ~74ms. Without the reranker, users would see the bi-encoder's top-3, which has NDCG@3 of only 0.72. The reranker's NDCG@3 of 0.84 means better result ordering in 12% more queries.
Mathematical Foundations
NDCG — Normalized Discounted Cumulative Gain
NDCG measures how well a ranked list places the most relevant items at the top. For a ranking of $K$ items:
$$\text{DCG@K} = \sum_{i=1}^{K} \frac{2^{r_i} - 1}{\log_2(i + 1)}$$
where $r_i$ is the relevance score of the item at position $i$.
$$\text{NDCG@K} = \frac{\text{DCG@K}}{\text{IDCG@K}}$$
where IDCG@K is the DCG of the ideal (perfect) ranking.
Intuition: The $\frac{1}{\log_2(i+1)}$ discount means position 1 is worth 1.0, position 2 is worth 0.63, position 3 is worth 0.50, and so on. A highly relevant item at position 1 contributes much more than the same item at position 5. This mathces real user behavior — most users only look at the first 2-3 results.
Example for MangaAssist:
Query: "dark fantasy manga" - Berserk (relevance = 3, highly relevant) - Claymore (relevance = 2, relevant) - One Piece (relevance = 0, not relevant)
$$\text{DCG@3} = \frac{2^3 - 1}{\log_2(2)} + \frac{2^2 - 1}{\log_2(3)} + \frac{2^0 - 1}{\log_2(4)} = \frac{7}{1} + \frac{3}{1.585} + \frac{0}{2} = 7 + 1.893 + 0 = 8.893$$
Ideal ranking would put Berserk first, Claymore second: IDCG@3 = same = 8.893, so NDCG@3 = 1.0.
If the ranker swaps Berserk and One Piece: $$\text{DCG@3} = \frac{0}{1} + \frac{3}{1.585} + \frac{7}{2} = 0 + 1.893 + 3.5 = 5.393$$
NDCG@3 = 5.393 / 8.893 = 0.606. Much worse — placing the best result last costs 39.4% of the possible score.
Pairwise Ranking Loss — RankNet
For training a ranker, we use pairwise loss. Given a query $q$, a relevant document $d^+$, and a less-relevant document $d^-$:
$$\mathcal{L}_{\text{pairwise}} = -\log\sigma(s(q, d^+) - s(q, d^-))$$
where $s(q, d)$ is the cross-encoder's relevance score and $\sigma$ is the sigmoid function.
Gradient analysis:
$$\frac{\partial \mathcal{L}}{\partial s(q, d^+)} = -(1 - \sigma(\Delta s)) = -\sigma(-\Delta s)$$
where $\Delta s = s(q, d^+) - s(q, d^-)$.
| $\Delta s$ | $\sigma(-\Delta s)$ | Gradient Magnitude | Interpretation |
|---|---|---|---|
| -2.0 (wrong order) | 0.88 | 0.88 (high) | Model ranks incorrectly — strong correction |
| 0.0 (tie) | 0.50 | 0.50 (moderate) | Model is uncertain — moderate push |
| +2.0 (correct order) | 0.12 | 0.12 (low) | Model ranks correctly — minimal update |
| +5.0 (very confident) | 0.007 | 0.007 (tiny) | Already well-separated — almost no gradient |
Key insight: Pairwise loss naturally focuses learning on the boundary cases where the model is unsure, similar to focal loss for classification. Confidently correct pairs get near-zero gradient.
LambdaRank — NDCG-Aware Training
Standard pairwise loss treats all swaps equally — swapping items at positions (1,2) gets the same gradient as swapping (8,9). But from a user experience perspective, the first swap is far more important.
LambdaRank multiplies the pairwise gradient by the change in NDCG that the swap would cause:
$$\lambda_{ij} = -\sigma(-\Delta s_{ij}) \cdot |\Delta\text{NDCG}_{ij}|$$
where $|\Delta\text{NDCG}_{ij}|$ is the absolute change in NDCG if items $i$ and $j$ were swapped.
Intuition: Swapping positions (1,2) changes NDCG by ~0.37 (for binary relevance). Swapping positions (8,9) changes NDCG by ~0.03. So the gradient for the (1,2) swap is amplified by 12x compared to (8,9). This focuses the model's learning on getting the top positions right.
NDCG swap deltas for our top-3 reranking:
| Positions Swapped | $|\Delta\text{NDCG}|$ | Gradient Multiplier | |-------------------|-----------------------|-------------------| | (1, 2) | 0.369 | 12.3× | | (1, 3) | 0.500 | 16.7× | | (2, 3) | 0.131 | 4.4× | | (1, 5) | 0.570 | 19.0× | | (5, 10) | 0.030 | 1.0× (reference) |
This is why LambdaRank is critical for MangaAssist: we only show 3 results to the user, so getting the top-3 ordering right matters enormously, while positions beyond 5 are irrelevant.
Cross-Attention — How the Reranker "Reads" Query-Document Together
The cross-encoder concatenates query and document: [CLS] query [SEP] document [SEP]. The transformer's self-attention then computes attention across all tokens, allowing every query token to attend to every document token.
For a token at position $i$ and all tokens at positions $j$:
$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right) V$$
where $Q = W_Q \cdot h_i$, $K = W_K \cdot h_j$, $V = W_V \cdot h_j$ for the 12 attention heads ($d_k = 64$).
What cross-attention learns for manga queries:
When processing [CLS] dark isekai manga [SEP] Overlord - A dark fantasy isekai where...:
- "dark" (query) attends strongly to "dark" and "fantasy" in the document (direct match → high relevance)
- "isekai" (query) attends to "isekai" in the document (genre match)
- "manga" (query) attends broadly — nearly all catalog items are manga (uninformative token)
- [CLS] aggregates all cross-attention signals into the relevance score
After fine-tuning, the attention pattern shifts: - "dark" gains negative attention to "comedy" (learns that dark ≠ comedy) - "isekai" gains positive attention to "transported to another world" (learns paraphrase) - Non-matching genre tokens get near-zero attention (efficient filtering)
Gradient Flow Through the 6-Layer MiniLM
ms-marco-MiniLM-L-6-v2 has nearly the same architecture as DistilBERT:
| Layer | Parameters | Role in Reranking | Fine-Tuning Gradient |
|---|---|---|---|
| Relevance Head | Linear(384→1) = 385 params | Produces final score | 1.0× (reference) |
| Layer 5 | 3.5M | High-level relevance matching | 0.65× |
| Layer 4 | 3.5M | Semantic alignment (genre, tone) | 0.38× |
| Layer 3 | 3.5M | Cross-attention refinement | 0.18× |
| Layer 2 | 3.5M | Syntactic interaction patterns | 0.07× |
| Layer 1 | 3.5M | Basic token relationships | 0.02× |
| Layer 0 | 3.5M | Token-level features | 0.008× |
| Embeddings | 11.7M | Word representations | 0.003× |
Total: ~33M parameters. Hidden dimension: 384. Attention heads: 12 (32 dims each).
Model Internals — Layer-by-Layer Diagrams
Cross-Encoder Architecture for Reranking
graph TB
subgraph "Input Construction"
A["Query: 'dark isekai manga'"] --> C["[CLS] dark isekai manga [SEP]<br>Overlord - A dark fantasy isekai<br>where the protagonist... [SEP]"]
B["Doc: 'Overlord - A dark fantasy...'"] --> C
end
subgraph "Tokenization"
C --> D["WordPiece Tokens (max 128)<br>[CLS] dark is ##ek ##ai manga [SEP]<br>over ##lord - a dark fantasy<br>is ##ek ##ai where the... [SEP]"]
end
subgraph "Embedding (384-dim, ~12M params)"
D --> E["Token Emb + Position Emb + Segment Emb<br>Segment A: query tokens<br>Segment B: document tokens"]
end
subgraph "6 Transformer Layers"
E --> F0["Layer 0: Basic token features<br>LR: 3.2e-6 | Grad: 0.008×"]
F0 --> F1["Layer 1-2: Syntactic interactions<br>Query-doc token alignment begins"]
F1 --> F3["Layer 3-4: Semantic matching<br>'dark' ↔ 'dark fantasy' link formed<br>LR: 6.4e-6 to 1e-5 | Grad: 0.18-0.38×"]
F3 --> F5["Layer 5: High-level relevance<br>'isekai' + 'dark fantasy' → relevant<br>LR: 1.6e-5 | Grad: 0.65×"]
end
subgraph "Relevance Scoring"
F5 --> G["[CLS] Pooling<br>384-dim relevance vector"]
G --> H["Linear(384 → 1)<br>LR: 2e-5 | Grad: 1.0×"]
H --> I["Relevance Score: 0.94"]
end
style F0 fill:#e8f5e9
style F1 fill:#fff9c4
style F3 fill:#fff3e0
style F5 fill:#ffccbc
style H fill:#ef9a9a
Cross-Attention Pattern: Before vs After Fine-Tuning
graph TD
subgraph "Before Fine-Tuning (Layer 5 Head 3)"
direction LR
Q1["dark<br>(query)"] -->|"0.15"| D1a["dark<br>(doc)"]
Q1 -->|"0.12"| D1b["fantasy<br>(doc)"]
Q1 -->|"0.10"| D1c["comedy<br>(doc)"]
Q1 -->|"0.08"| D1d["adventure<br>(doc)"]
Q1 -->|"0.55"| D1e["other tokens<br>(distributed)"]
end
subgraph "After Fine-Tuning (Layer 5 Head 3)"
direction LR
Q2["dark<br>(query)"] -->|"0.38"| D2a["dark<br>(doc)"]
Q2 -->|"0.28"| D2b["fantasy<br>(doc)"]
Q2 -->|"0.02"| D2c["comedy<br>(doc)"]
Q2 -->|"0.04"| D2d["adventure<br>(doc)"]
Q2 -->|"0.28"| D2e["other tokens"]
end
Key change: After fine-tuning, "dark" (query) concentrates attention on "dark" and "fantasy" and suppresses attention to "comedy". The model has learned that "dark" is an intent signal, not just a word to match literally.
LambdaRank Gradient Amplification
graph LR
subgraph "Standard Pairwise Loss"
A["Pair (pos=1, neg=2)<br>Gradient: 0.50"] --> B["All pairs get<br>equal treatment"]
C["Pair (pos=5, neg=8)<br>Gradient: 0.50"] --> B
end
subgraph "LambdaRank"
D["Pair (pos=1, neg=2)<br>|ΔNDCG| = 0.37<br>Gradient: 0.50 × 0.37 = 0.185"] --> E["Top positions get<br>12× more gradient<br>than bottom"]
F["Pair (pos=5, neg=8)<br>|ΔNDCG| = 0.03<br>Gradient: 0.50 × 0.03 = 0.015"] --> E
end
subgraph "Effect on MangaAssist"
E --> G["Position 1 error:<br>Wrong manga at top<br>= user leaves"]
E --> H["Position 8 error:<br>Wrong order beyond top-3<br>= user never sees it"]
end
Full Reranking Inference Pipeline
sequenceDiagram
participant Query as User Query
participant BiEnc as Bi-Encoder<br>(Titan V2 + Adapter)
participant OS as OpenSearch<br>(ANN Index)
participant CrossEnc as Cross-Encoder<br>(MiniLM Reranker)
participant User as User Sees<br>Top 3 Results
Query->>BiEnc: "dark isekai manga"
BiEnc->>OS: Query embedding (1024-dim)
OS->>CrossEnc: Top 10 candidates with scores
Note over CrossEnc: Score each of 10 candidates:<br>[CLS] query [SEP] doc_i [SEP]<br>10 forward passes × 5ms each = 50ms
loop For each candidate (10x)
CrossEnc->>CrossEnc: Concatenate query + doc<br>→ 6 attention layers<br>→ [CLS] → Linear → score
end
CrossEnc->>CrossEnc: Sort by cross-encoder score<br>Reorder: [7,2,5,...] → [2,5,7,...]
CrossEnc->>User: Top 3 reranked results<br>1. Overlord (0.94)<br>2. Berserk (0.91)<br>3. Goblin Slayer (0.87)
Training Data Construction Pipeline
graph TD
subgraph "Source 1: Click-Through Logs (2000 pairs)"
A["User searched X<br>Clicked product Y"] --> D["Positive pair:<br>(X, Y, relevance=3)"]
B["User searched X<br>Saw product Z, didn't click"] --> E["Hard negative:<br>(X, Z, relevance=0)"]
end
subgraph "Source 2: Editorial (600 pairs)"
F["Manga experts rate<br>query-product relevance<br>0=irrelevant, 1=marginal,<br>2=relevant, 3=perfect"] --> G["Graded relevance pairs"]
end
subgraph "Source 3: Synthetic (400 pairs)"
H["Claude generates<br>'user searching for X<br>should/shouldn't find Y'"] --> I["Synthetic pairs<br>with quality filtering"]
end
D --> J["Combined Dataset<br>3000 query-doc pairs<br>with graded relevance"]
E --> J
G --> J
I --> J
J --> K["80/10/10 Split<br>Stratified by<br>relevance level"]
Implementation Deep-Dive
Dataset Preparation
import json
from dataclasses import dataclass
from typing import List, Tuple
from sklearn.model_selection import train_test_split
@dataclass
class RankingExample:
query: str
document: str
relevance: int # 0=irrelevant, 1=marginal, 2=relevant, 3=perfect
def load_ranking_dataset(
clickthrough_path: str,
editorial_path: str,
synthetic_path: str,
) -> List[RankingExample]:
"""
Combine three sources into a unified ranking dataset.
"""
examples = []
# Click-through: binary relevance (clicked=3, not-clicked=0)
with open(clickthrough_path) as f:
for item in json.load(f):
examples.append(RankingExample(
query=item["query"],
document=item["product_description"],
relevance=3 if item["clicked"] else 0,
))
# Editorial: graded relevance (0-3)
with open(editorial_path) as f:
for item in json.load(f):
examples.append(RankingExample(
query=item["query"],
document=item["product_description"],
relevance=item["relevance"],
))
# Synthetic: binary (relevant=2, not-relevant=0)
with open(synthetic_path) as f:
for item in json.load(f):
examples.append(RankingExample(
query=item["query"],
document=item["product_description"],
relevance=2 if item["relevant"] else 0,
))
return examples
def create_pairwise_samples(
examples: List[RankingExample],
) -> List[Tuple[str, str, str]]:
"""
Create (query, doc_better, doc_worse) triplets for pairwise training.
Group by query, then form all pairs where relevance differs.
"""
from collections import defaultdict
by_query = defaultdict(list)
for ex in examples:
by_query[ex.query].append(ex)
triplets = []
for query, docs in by_query.items():
for i in range(len(docs)):
for j in range(len(docs)):
if docs[i].relevance > docs[j].relevance:
triplets.append((
query,
docs[i].document, # better
docs[j].document, # worse
))
return triplets
Pairwise Ranking Loss with LambdaRank Weighting
import torch
import torch.nn as nn
import numpy as np
class LambdaRankLoss(nn.Module):
"""
Pairwise ranking loss with NDCG-aware gradient weighting.
L = -log(σ(s_better - s_worse)) × |ΔNDCG|
The |ΔNDCG| factor amplifies gradients for pairs whose swap
would most affect the user-visible ranking (top positions).
"""
def __init__(self):
super().__init__()
def forward(
self,
scores_better: torch.Tensor, # (batch,) — scores for better docs
scores_worse: torch.Tensor, # (batch,) — scores for worse docs
ndcg_deltas: torch.Tensor, # (batch,) — |ΔNDCG| for each pair
) -> torch.Tensor:
# Pairwise logistic loss
diff = scores_better - scores_worse
pairwise_loss = -torch.log(torch.sigmoid(diff) + 1e-8)
# Weight by NDCG delta
weighted_loss = pairwise_loss * ndcg_deltas
return weighted_loss.mean()
def compute_ndcg_delta(
relevances: List[int],
pos_i: int,
pos_j: int,
k: int = 3,
) -> float:
"""
Compute |ΔNDCG@K| if items at positions i and j were swapped.
"""
# Current DCG@K
dcg = sum(
(2 ** relevances[p] - 1) / np.log2(p + 2)
for p in range(min(k, len(relevances)))
)
# Swapped DCG@K
swapped = list(relevances)
swapped[pos_i], swapped[pos_j] = swapped[pos_j], swapped[pos_i]
dcg_swapped = sum(
(2 ** swapped[p] - 1) / np.log2(p + 2)
for p in range(min(k, len(swapped)))
)
# Ideal DCG@K
ideal = sorted(relevances, reverse=True)
idcg = sum(
(2 ** ideal[p] - 1) / np.log2(p + 2)
for p in range(min(k, len(ideal)))
)
if idcg == 0:
return 0.0
return abs((dcg - dcg_swapped) / idcg)
Cross-Encoder Fine-Tuning
from transformers import AutoModelForSequenceClassification, AutoTokenizer
from torch.optim import AdamW
from transformers import get_linear_schedule_with_warmup
from torch.utils.data import DataLoader, Dataset
class PairwiseRankingDataset(Dataset):
"""Dataset of (query, better_doc, worse_doc, ndcg_delta) tuples."""
def __init__(self, triplets, tokenizer, max_length=128):
self.triplets = triplets
self.tokenizer = tokenizer
self.max_length = max_length
def __len__(self):
return len(self.triplets)
def __getitem__(self, idx):
query, better_doc, worse_doc = self.triplets[idx]
# Tokenize query-better_doc pair
better_encoding = self.tokenizer(
query, better_doc,
max_length=self.max_length,
padding="max_length",
truncation=True,
return_tensors="pt",
)
# Tokenize query-worse_doc pair
worse_encoding = self.tokenizer(
query, worse_doc,
max_length=self.max_length,
padding="max_length",
truncation=True,
return_tensors="pt",
)
return {
"better_input_ids": better_encoding["input_ids"].squeeze(0),
"better_attention_mask": better_encoding["attention_mask"].squeeze(0),
"worse_input_ids": worse_encoding["input_ids"].squeeze(0),
"worse_attention_mask": worse_encoding["attention_mask"].squeeze(0),
}
def train_reranker(
train_triplets: list,
val_triplets: list,
num_epochs: int = 5,
batch_size: int = 16,
base_lr: float = 2e-5,
):
"""
Fine-tune ms-marco-MiniLM cross-encoder for manga reranking.
Uses pairwise ranking loss: for each (query, better_doc, worse_doc),
the model should score better_doc higher than worse_doc.
"""
model_name = "cross-encoder/ms-marco-MiniLM-L-6-v2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(
model_name, num_labels=1
)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
# Optimizer with discriminative learning rates
optimizer_params = []
# Relevance head — highest LR
optimizer_params.append({
"params": model.classifier.parameters(),
"lr": base_lr,
})
# Encoder layers — discriminative LR
num_layers = len(model.bert.encoder.layer)
for layer_idx in range(num_layers - 1, -1, -1):
layer_lr = base_lr * (0.8 ** (num_layers - 1 - layer_idx))
optimizer_params.append({
"params": model.bert.encoder.layer[layer_idx].parameters(),
"lr": layer_lr,
})
# Embeddings — lowest LR
optimizer_params.append({
"params": model.bert.embeddings.parameters(),
"lr": base_lr * (0.8 ** num_layers),
})
optimizer = AdamW(optimizer_params, weight_decay=0.01)
# Training
train_dataset = PairwiseRankingDataset(train_triplets, tokenizer)
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
total_steps = len(train_loader) * num_epochs
scheduler = get_linear_schedule_with_warmup(
optimizer,
num_warmup_steps=int(0.1 * total_steps),
num_training_steps=total_steps,
)
loss_fn = nn.MarginRankingLoss(margin=1.0)
best_ndcg = 0.0
for epoch in range(num_epochs):
model.train()
epoch_loss = 0.0
for batch in train_loader:
# Score better documents
better_scores = model(
input_ids=batch["better_input_ids"].to(device),
attention_mask=batch["better_attention_mask"].to(device),
).logits.squeeze(-1)
# Score worse documents
worse_scores = model(
input_ids=batch["worse_input_ids"].to(device),
attention_mask=batch["worse_attention_mask"].to(device),
).logits.squeeze(-1)
# Target: better > worse (all ones)
target = torch.ones_like(better_scores)
loss = loss_fn(better_scores, worse_scores, target)
optimizer.zero_grad()
loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
optimizer.step()
scheduler.step()
epoch_loss += loss.item()
# Validation
ndcg = evaluate_ndcg(model, tokenizer, val_triplets, device)
avg_loss = epoch_loss / len(train_loader)
print(f"Epoch {epoch+1}: Loss={avg_loss:.4f}, NDCG@3={ndcg:.4f}")
if ndcg > best_ndcg:
best_ndcg = ndcg
model.save_pretrained("./best_reranker")
tokenizer.save_pretrained("./best_reranker")
return model
def evaluate_ndcg(model, tokenizer, val_triplets, device, k=3):
"""Compute NDCG@K on validation set."""
model.eval()
from collections import defaultdict
# Group by query
query_scores = defaultdict(list)
for query, better_doc, worse_doc in val_triplets:
for doc, rel in [(better_doc, 1), (worse_doc, 0)]:
encoding = tokenizer(
query, doc,
max_length=128,
padding="max_length",
truncation=True,
return_tensors="pt",
)
with torch.no_grad():
score = model(
input_ids=encoding["input_ids"].to(device),
attention_mask=encoding["attention_mask"].to(device),
).logits.squeeze().item()
query_scores[query].append((score, rel))
# Compute NDCG per query
ndcg_scores = []
for query, scored_docs in query_scores.items():
scored_docs.sort(key=lambda x: x[0], reverse=True)
rels = [rel for _, rel in scored_docs]
dcg = sum(
(2 ** rels[i] - 1) / np.log2(i + 2)
for i in range(min(k, len(rels)))
)
ideal_rels = sorted(rels, reverse=True)
idcg = sum(
(2 ** ideal_rels[i] - 1) / np.log2(i + 2)
for i in range(min(k, len(ideal_rels)))
)
if idcg > 0:
ndcg_scores.append(dcg / idcg)
return np.mean(ndcg_scores) if ndcg_scores else 0.0
SageMaker Deployment
import sagemaker
from sagemaker.huggingface import HuggingFaceModel
def deploy_reranker():
"""
Deploy fine-tuned reranker to SageMaker real-time endpoint.
ml.c5.xlarge: 4 vCPU, 8GB RAM — CPU inference is sufficient
for the small model (33M params) at 10 candidates per query.
"""
huggingface_model = HuggingFaceModel(
model_data="s3://manga-ml-models/reranker/model.tar.gz",
role=sagemaker.get_execution_role(),
transformers_version="4.36",
pytorch_version="2.1",
py_version="py310",
env={
"SAGEMAKER_MODEL_SERVER_WORKERS": "2",
},
)
predictor = huggingface_model.deploy(
initial_instance_count=2,
instance_type="ml.c5.xlarge",
endpoint_name="manga-reranker-v1",
)
return predictor
def invoke_reranker(predictor, query: str, candidates: list) -> list:
"""
Score all candidates and return sorted by relevance.
Input: query + list of candidate product descriptions
Output: candidates sorted by cross-encoder relevance score
"""
payload = {
"inputs": [
{"text": query, "text_pair": doc}
for doc in candidates
],
}
scores = predictor.predict(payload)
# Sort candidates by score (descending)
scored = list(zip(candidates, scores))
scored.sort(key=lambda x: x[1], reverse=True)
return scored
Group Discussion: Key Decision Points
Decision Point 1: Cross-Encoder vs ColBERT vs Sparse-Dense Hybrid
Priya (ML Engineer): I benchmarked three reranking architectures:
| Approach | NDCG@3 | Latency (10 docs) | Params | Monthly Cost |
|---|---|---|---|---|
| ms-marco-MiniLM (cross-encoder) | 0.84 | 50ms | 33M | $380 |
| ColBERT v2 (late interaction) | 0.81 | 25ms | 110M | $620 |
| SPLADE + cross-encoder (hybrid) | 0.86 | 70ms | 33M + index | $520 |
Cross-encoder wins on quality-per-dollar. ColBERT is faster but lower quality. SPLADE hybrid is best quality but latency is too high.
Marcus (Architect): Our total retrieval latency target is 100ms. Stage 1 (bi-encoder + OpenSearch) takes 24ms + 10ms = 34ms. The reranker gets the remaining 66ms budget. Cross-encoder at 50ms fits; SPLADE at 70ms is tight with no headroom for spikes.
Aiko (Data Scientist): ColBERT's late interaction approach is interesting for scale — it pre-computes per-token embeddings and does MaxSim at query time. But for reranking 10 candidates (not 1000), the overhead of maintaining per-token indexes is not justified. Cross-encoder's token-level cross-attention gives strictly better quality for small candidate sets.
Jordan (MLOps): Cross-encoder is also the simplest to operate. One model, one endpoint, no additional indexes. ColBERT requires a special per-token index and SPLADE needs a sparse index. More moving parts = more failure modes.
Sam (PM): At NDCG@3 = 0.84, the cross-encoder gives us 13 points over the bi-encoder alone (0.71). That translates to roughly 8% more users finding what they want in the top 3. For our 50K daily active users, that is 4,000 users per day with a better experience.
Resolution: Cross-encoder (ms-marco-MiniLM) selected. Best NDCG@3 within latency budget, lowest cost, simplest deployment. Consider ColBERT for V2 if candidate set grows beyond 50 (where cross-encoder becomes too slow) or if latency budget tightens.
Decision Point 2: Pairwise Loss vs Listwise Loss
Priya (ML Engineer): I compared pairwise (MarginRankingLoss) and listwise (ListMLE) approaches:
| Loss Function | NDCG@3 | NDCG@10 | Training Time |
|---|---|---|---|
| Pointwise (MSE) | 0.78 | 0.83 | 20 min |
| Pairwise (MarginRanking) | 0.83 | 0.88 | 35 min |
| Pairwise + LambdaRank | 0.84 | 0.89 | 38 min |
| Listwise (ListMLE) | 0.84 | 0.90 | 55 min |
Pairwise + LambdaRank matches listwise NDCG@3 at 30% less training time.
Aiko (Data Scientist): Our dataset has mostly binary relevance (clicked vs not-clicked from logs), with only the editorial subset having graded relevance (0-3). Listwise losses work best with fully graded lists, which we don't have for most queries. Pairwise is more robust to incomplete relevance labels.
Jordan (MLOps): Listwise training requires all candidates for a query to be in the same batch, which constrains batch construction and makes distributed training harder. Pairwise triplets can be freely shuffled across batches.
Resolution: Pairwise ranking loss with LambdaRank weighting. Matches listwise quality for our data, trains 30% faster, more robust to incomplete graded labels.
Decision Point 3: Reranking Depth — Top-K Candidates
Marcus (Architect): How many candidates should we rerank?
| Candidates Reranked | NDCG@3 | Latency | Monthly Cost |
|---|---|---|---|
| 5 | 0.80 | 25ms | $190 |
| 10 | 0.84 | 50ms | $380 |
| 20 | 0.86 | 100ms | $760 |
| 50 | 0.87 | 250ms | $1,900 |
Aiko (Data Scientist): The jump from 5 to 10 candidates gives +4 NDCG points (0.80 → 0.84). Going from 10 to 20 gives only +2 (0.84 → 0.86). The diminishing returns are clear.
Priya (ML Engineer): With 10 candidates, the reranker has seen enough diversity that the correct top-3 is almost always somewhere in the candidate pool. Going to 20 mostly helps edge cases where the bi-encoder ranks the truly relevant item at positions 11-20, which happens ~5% of the time.
Sam (PM): The cost doubles from $380 to $760 for 2 NDCG points. That is $190/point — well above our $50 threshold. 10 candidates at $380/month is the sweet spot.
Resolution: Rerank top-10 candidates. Best cost-quality tradeoff. NDCG@3 of 0.84 within 50ms latency budget. Monitor recall@10 of the bi-encoder — if it drops below 0.85, consider increasing to 15 candidates.
Decision Point 4: Co-Training Reranker with Adapter
Priya (ML Engineer): Currently we train the embedding adapter and reranker independently. Should we co-train them?
Aiko (Data Scientist): Co-training means using the reranker's feedback to improve the adapter's training signal. Intuition: if the reranker consistently boosts certain items that the adapter ranked low, the adapter should learn to rank them higher. This is "knowledge distillation from reranker to retriever."
Marcus (Architect): The challenge is complexity. Co-training creates a dependency loop: adapter quality affects reranker training data, reranker quality affects adapter optimization signal. Getting this right is a research project, not an engineering task.
Jordan (MLOps): Our retraining cadence is monthly. Co-training would couple the two pipelines, meaning a failure in one blocks the other. Independent training gives us isolation — we can update the adapter without retraining the reranker, and vice versa.
Sam (PM): The independent approach already gives us NDCG@3 of 0.84, which exceeds our 0.80 target by 4 points. Co-training is premature optimization.
Resolution: Train independently for now. Revisit co-training in V2 when both pipelines are stable and we have enough data to validate the feedback loop.
Research Paper References
1. MS MARCO: A Human Generated MAchine Reading COmprehension Dataset (Bajaj et al., 2016)
Key contribution: Created the largest public passage ranking dataset (8.8M passages, 1M queries). Enabled training of neural ranking models at scale. The cross-encoder approach (concatenate query + passage → relevance score) became the standard architecture for reranking.
Relevance to MangaAssist: ms-marco-MiniLM-L-6-v2 is pre-trained on this dataset. The pre-training gives us a strong foundation for passage-level relevance scoring. However, MS-MARCO queries are web search (factoid questions), not e-commerce product search. Our fine-tuning adapts the model from "is this passage the answer?" to "is this manga what the user wants?".
2. ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction (Khattab & Zaharia, 2020)
Key contribution: Introduced late interaction — pre-compute per-token embeddings for documents, then compute MaxSim at query time. Achieves near cross-encoder quality at bi-encoder speed for large candidate sets. The token-level matching allows fine-grained relevance without full cross-attention.
Relevance to MangaAssist: ColBERT is our V2 candidate if we need to rerank more than 50 candidates (e.g., if we expand to cross-category search). For our current 10-candidate reranking, cross-encoder is better. But ColBERT's architecture insight informs our understanding: the quality gap between bi-encoder and cross-encoder comes specifically from token-level interactions that bi-encoders cannot capture.
3. From RankNet to LambdaRank to LambdaMART (Burges, 2010)
Key contribution: Derived LambdaRank as a practical speed-up to the theoretically grounded LambdaRank+ approach. Showed that multiplying pairwise gradients by |ΔNDCG| empirically optimizes NDCG better than any other known approach, despite lacking a formal loss function. LambdaMART (LambdaRank + gradient boosted trees) became the industry standard for learning-to-rank.
Relevance to MangaAssist: We use the LambdaRank gradient weighting to focus our cross-encoder's training on getting the top-3 positions right. The |ΔNDCG| multiplier ensures that swapping positions (1,2) gets 12× the gradient of swapping (8,9), directly optimizing for user-visible result quality.
4. Intra-Document Cascading for Efficient Passage Retrieval (Hofstätter et al., 2019)
Key contribution: Proposed cascaded retrieval where progressively more expensive models score progressively fewer candidates. Showed that 3-stage cascading (BM25 → bi-encoder → cross-encoder) achieves better latency-quality tradeoffs than any single model.
Relevance to MangaAssist: Our 2-stage pipeline (bi-encoder → cross-encoder) is a simplified cascade. The paper validates our architectural choice and suggests that adding a lightweight first-stage filter (e.g., BM25 prefiltering before the bi-encoder) could further improve latency if needed in V2.
Production Deployment and Monitoring
Deployment Architecture
graph LR
subgraph "Training Pipeline (Monthly)"
A[Click-through<br>Logs] --> B[Pairwise Triplet<br>Construction]
B --> C[Cross-Encoder<br>Fine-Tuning<br>(SageMaker)]
C --> D[Model Artifact<br>S3]
end
subgraph "Validation Gate"
D --> E[NDCG@3 > 0.80<br>on golden set]
E -->|pass| F[SageMaker Endpoint<br>Blue-Green Deploy]
E -->|fail| G[Reject + Alert]
end
subgraph "Inference"
H[Bi-Encoder<br>Top 10] --> I[Cross-Encoder<br>Reranker<br>(2x ml.c5.xlarge)]
I --> J[Top 3 Sorted<br>Results]
end
Key Production Metrics
| Metric | Target | Alert Threshold |
|---|---|---|
| NDCG@3 (sampled) | ≥ 0.80 | < 0.75 |
| P50 latency (10 candidates) | 35ms | > 50ms |
| P95 latency (10 candidates) | 50ms | > 70ms |
| Pairwise accuracy (sampled) | ≥ 85% | < 80% |
| Score distribution shift | KL < 0.02 | KL > 0.05 |
Evaluation and Results
Before vs After Fine-Tuning
| Metric | ms-marco (Pre-trained) | Fine-tuned (Ours) | Improvement |
|---|---|---|---|
| NDCG@3 | 0.71 | 0.84 | +0.13 |
| NDCG@10 | 0.77 | 0.89 | +0.12 |
| Pairwise accuracy | 72.3% | 86.8% | +14.5% |
| Manga-jargon NDCG@3 | 0.58 | 0.81 | +0.23 |
| Genre-crossing NDCG@3 | 0.63 | 0.82 | +0.19 |
| Mean reciprocal rank | 0.68 | 0.82 | +0.14 |
End-to-End Retrieval Quality (Bi-Encoder + Reranker)
| Pipeline Configuration | NDCG@3 | P95 Latency | Monthly Cost |
|---|---|---|---|
| Titan V2 raw (no adapter, no reranker) | 0.58 | 32ms | $450 |
| Titan V2 + adapter (no reranker) | 0.72 | 34ms | $485 |
| Titan V2 + adapter + ms-marco (pre-trained) | 0.78 | 84ms | $865 |
| Titan V2 + adapter + ms-marco (fine-tuned) | 0.84 | 84ms | $865 |
| Ours (full pipeline) | 0.84 | 84ms | $865 |
The fine-tuned reranker adds 6 NDCG points over the pre-trained version at zero additional infrastructure cost — same endpoint, same latency, just better weights.