Hybrid Search and Custom Scoring for Manga Product Retrieval
AWS AIP-C01 Task 4.2 → Skill 4.2.2: Optimize retrieval mechanisms for FM-augmented applications System: MangaAssist e-commerce chatbot — Bedrock Claude 3 Sonnet/Haiku, OpenSearch Serverless (HNSW k-NN), DynamoDB, ECS Fargate, ElastiCache Redis Focus: Deep-dive into hybrid search implementation, score fusion mathematics, custom scoring functions, re-ranking with Bedrock, and retrieval evaluation metrics
Skill Mapping
| AWS AIP-C01 Element | Coverage |
|---|---|
| Task 4.2 | Optimize application performance for FM workloads |
| Skill 4.2.2 | Optimize retrieval mechanisms to improve FM-augmented application performance |
| This File | Hybrid search scoring, fusion methods, re-ranking with Bedrock, evaluation methodology |
| MangaAssist Context | Combining BM25 (exact manga titles/ISBNs) + kNN (semantic "manga like One Piece") to maximize both precision and recall across 100K+ products |
Why Hybrid Search? — The BM25 vs kNN Tradeoff
The Core Problem
MangaAssist receives two fundamentally different query types that require two fundamentally different retrieval approaches.
graph TB
subgraph ExactQueries["Exact Queries — BM25 Wins"]
style ExactQueries fill:#2ecc71,stroke:#16213e,color:#fff
Q1["'鬼滅の刃 23巻'<br/>Demon Slayer Vol 23"]
Q2["'ISBN 978-4088820...']
Q3["'Eiichiro Oda'"]
Q4["'Weekly Shonen Jump<br/>2024 Issue 15'"]
end
subgraph SemanticQueries["Semantic Queries — kNN Wins"]
style SemanticQueries fill:#3498db,stroke:#16213e,color:#fff
Q5["'adventure manga<br/>with pirates'"]
Q6["'something dark and<br/>philosophical'"]
Q7["'manga for someone<br/>who loved Naruto'"]
Q8["'feel-good slice of life<br/>romance'"]
end
subgraph HybridQueries["Mixed Queries — Hybrid Wins"]
style HybridQueries fill:#e94560,stroke:#16213e,color:#fff
Q9["'manga like ワンピース<br/>with good art'"]
Q10["'shonen manga about<br/>cooking competitions'"]
Q11["'new releases similar<br/>to Attack on Titan'"]
end
Q1 & Q2 & Q3 & Q4 --> BM25["BM25<br/>Keyword Match"]
Q5 & Q6 & Q7 & Q8 --> KNN["kNN Vector<br/>Semantic Similarity"]
Q9 & Q10 & Q11 --> HYBRID["Hybrid<br/>BM25 + kNN"]
Head-to-Head Comparison
| Dimension | BM25 (Keyword) | kNN (Vector) | Hybrid |
|---|---|---|---|
| Exact title match | Excellent — direct term matching | Poor — semantic drift from exact surface forms | Excellent — BM25 component handles this |
| Semantic similarity | Poor — "pirate adventure" does not match "One Piece" | Excellent — embedding captures meaning | Excellent — kNN component handles this |
| Author lookup | Excellent — keyword match on name field | Moderate — depends on embedding training data | Excellent |
| Cross-language | Poor — "冒険" does not match "adventure" in BM25 | Good — multilingual embeddings bridge languages | Good |
| Handling typos | Poor — "Narruto" misses "Naruto" | Good — embeddings are robust to misspellings | Good |
| Novel/unseen terms | Zero recall for OOV terms | Moderate — nearest semantic neighbor | Moderate |
| Latency | 15-25ms | 30-50ms | 40-60ms (parallel) |
| Explainability | High — TF-IDF scores are interpretable | Low — cosine similarity is opaque | Medium |
Score Fusion Methods
The central challenge of hybrid search: BM25 scores (unbounded, typically 5-50) and kNN scores (cosine similarity, 0.0-1.0) are on completely different scales. Score fusion resolves this.
Method 1: Reciprocal Rank Fusion (RRF)
RRF is rank-based, not score-based. It sidesteps the score normalization problem entirely.
Formula:
RRF(d) = Σ 1 / (k + rank_i(d))
i∈{bm25, knn}
Where k is a smoothing constant (default 60) and rank_i(d) is the rank of document d in ranker i (1-indexed).
Properties:
- Score-scale agnostic: works regardless of BM25 vs kNN score ranges
- Documents appearing in both result lists get double contribution
- The smoothing constant k=60 prevents top-ranked documents from dominating
graph TB
subgraph BM25Ranks["BM25 Rankings"]
B1["Rank 1: Naruto Vol 72<br/>BM25=38.2"]
B2["Rank 2: One Piece Vol 100<br/>BM25=31.7"]
B3["Rank 3: Dragon Ball Vol 42<br/>BM25=28.1"]
B4["Rank 4: Bleach Vol 74<br/>BM25=22.5"]
end
subgraph KNNRanks["kNN Rankings"]
K1["Rank 1: One Piece Vol 100<br/>cos=0.94"]
K2["Rank 2: Fairy Tail Vol 63<br/>cos=0.91"]
K3["Rank 3: Naruto Vol 72<br/>cos=0.88"]
K4["Rank 4: Black Clover Vol 33<br/>cos=0.85"]
end
subgraph RRFScores["RRF Fused Scores (k=60)"]
style RRFScores fill:#e94560,stroke:#16213e,color:#fff
R1["One Piece Vol 100<br/>1/(60+2) + 1/(60+1) = 0.0325<br/>← Appeared in BOTH lists"]
R2["Naruto Vol 72<br/>1/(60+1) + 1/(60+3) = 0.0322"]
R3["Fairy Tail Vol 63<br/>0 + 1/(60+2) = 0.0161"]
R4["Dragon Ball Vol 42<br/>1/(60+3) + 0 = 0.0159"]
end
B1 & B2 & K1 & K3 --> R1
B1 & K3 --> R2
K2 --> R3
B3 --> R4
Method 2: Weighted Linear Combination
Requires min-max normalization to bring both score distributions onto [0, 1].
Formula:
score(d) = α × norm_knn(d) + β × norm_bm25(d)
where:
norm(s) = (s - min) / (max - min)
α + β = 1.0
MangaAssist defaults: α = 0.6 (kNN weight), β = 0.4 (BM25 weight) — tuned on a labeled relevance set of 500 manga queries.
Method 3: Learned Score Combination
Train a lightweight model (logistic regression or small neural net) to combine BM25 score, kNN score, and metadata features into a single relevance score.
Features used: - Normalized BM25 score - Cosine similarity (kNN score) - Query-document genre match (binary) - Title overlap (Jaccard similarity) - Document recency (days since release) - Document popularity (log sales rank)
Advantage: Can learn non-linear interactions (e.g., "for buy-intent queries, weight BM25 higher"). Disadvantage: Requires labeled training data and periodic retraining.
Fusion Method Comparison
| Method | NDCG@5 on MangaAssist | Latency Overhead | Requires Training Data | Robustness |
|---|---|---|---|---|
| RRF (k=60) | 0.82 | < 1ms | No | High — rank-based, no normalization needed |
| Weighted Linear (α=0.6) | 0.80 | < 1ms | No (but weights need tuning) | Medium — sensitive to score distribution shifts |
| Learned Combination | 0.87 | 2-5ms | Yes (500+ labeled queries) | Medium — needs retraining as catalog changes |
Recommendation for MangaAssist: Start with RRF as the default. Move to learned combination only after accumulating sufficient click-through / relevance judgment data.
Custom Scoring for MangaAssist
Beyond fusion scores, MangaAssist applies business-aware scoring boosts. These operate as multiplicative factors on the fused score.
Scoring Dimensions
graph LR
FUSED["Fused Score<br/>(RRF or Linear)"] --> RECENCY["Recency Boost<br/>New releases +20%"]
RECENCY --> POPULARITY["Popularity Boost<br/>High-rated +15%"]
POPULARITY --> GENRE["Genre Match<br/>Intent-aligned +10%"]
GENRE --> AVAILABILITY["Availability<br/>In-stock +10%"]
AVAILABILITY --> FINAL["Final Score"]
Boost Factor Configuration
| Boost | Condition | Multiplier | Rationale |
|---|---|---|---|
| Recency — New Release | Released within last 30 days | 1.20 | New manga drives engagement; customers expect fresh recommendations |
| Recency — Recent | Released within last 90 days | 1.10 | Still relevant, moderate boost |
| Recency — Trending | Sales rank improved > 50% in last 7 days | 1.15 | Capture viral/trending titles |
| Popularity — Top Rated | avg_rating >= 4.5 and rating_count >= 50 | 1.15 | High confidence in quality |
| Popularity — Well Rated | avg_rating >= 4.0 and rating_count >= 20 | 1.05 | Moderate confidence |
| Genre Match | Query intent matches document genre/demographic | 1.10 | "shonen recommendation" should surface shonen manga |
| Availability — In Stock | in_stock = true (for buy intent) | 1.10 | Avoid frustrating buy-intent users with OOS results |
| Availability — Preorder | release_date in future (for browse intent) | 1.05 | Surface upcoming titles for browsers |
| Price Range | Within user's stated price range | 1.05 | Only if explicit price constraint detected |
Intent-Aware Scoring Matrix
Different user intents require different boost profiles.
| Factor | Browse Intent | Buy Intent | Recommend Intent | Research Intent |
|---|---|---|---|---|
| Recency boost | High (1.20) | Medium (1.10) | Medium (1.10) | Low (1.00) |
| Popularity boost | Medium (1.10) | Low (1.00) | High (1.15) | Low (1.00) |
| In-stock boost | None (1.00) | High (1.15) | None (1.00) | None (1.00) |
| Genre match boost | Medium (1.10) | Low (1.05) | High (1.15) | Medium (1.10) |
| Review count boost | Low (1.00) | Low (1.00) | Medium (1.10) | High (1.15) |
OpenSearch Serverless Query DSL — Hybrid Query Examples
Example 1: Basic Hybrid Query with RRF
This uses OpenSearch's native hybrid search with the search_pipeline feature.
{
"comment": "MangaAssist hybrid search: adventure shonen manga",
"size": 50,
"_source": ["manga_id", "title_ja", "title_en", "genre", "avg_rating", "release_date"],
"query": {
"hybrid": {
"queries": [
{
"multi_match": {
"query": "adventure shonen manga pirates 冒険 少年 漫画",
"fields": ["title_ja^3", "title_en^3", "description", "tags^1.5", "genre"],
"type": "cross_fields",
"minimum_should_match": "30%"
}
},
{
"knn": {
"description_embedding": {
"vector": [0.023, -0.118, 0.045, "... 1536 dims ..."],
"k": 50
}
}
}
]
}
},
"search_pipeline": "manga-hybrid-rrf-pipeline"
}
Search pipeline definition (created once via OpenSearch API):
{
"description": "MangaAssist RRF hybrid search pipeline",
"phase_results_processors": [
{
"normalization-processor": {
"normalization": { "technique": "min_max" },
"combination": {
"technique": "arithmetic_mean",
"parameters": { "weights": [0.4, 0.6] }
}
}
}
]
}
Example 2: Filtered Hybrid Query (Buy Intent)
{
"comment": "Buy intent: specific manga, must be in stock, under 1000 JPY",
"size": 20,
"_source": ["manga_id", "title_ja", "title_en", "price_jpy", "in_stock", "genre"],
"query": {
"hybrid": {
"queries": [
{
"bool": {
"must": [
{
"multi_match": {
"query": "鬼滅の刃 最新刊",
"fields": ["title_ja^5", "title_en^3", "description"],
"type": "best_fields"
}
}
],
"filter": [
{ "term": { "in_stock": true } },
{ "range": { "price_jpy": { "lte": 1000 } } }
]
}
},
{
"knn": {
"description_embedding": {
"vector": [0.031, -0.092, "... 1536 dims ..."],
"k": 20,
"filter": {
"bool": {
"filter": [
{ "term": { "in_stock": true } },
{ "range": { "price_jpy": { "lte": 1000 } } }
]
}
}
}
}
}
]
}
},
"search_pipeline": "manga-hybrid-rrf-pipeline"
}
Example 3: Recommendation Query (Semantic-Heavy)
{
"comment": "Recommend intent: semantic similarity to 'manga like One Piece with good art'",
"size": 50,
"_source": ["manga_id", "title_ja", "title_en", "genre", "avg_rating", "author"],
"query": {
"hybrid": {
"queries": [
{
"multi_match": {
"query": "ワンピース One Piece 冒険 adventure shonen art 画力",
"fields": ["title_ja^2", "title_en^2", "tags^2", "genre", "description"],
"type": "cross_fields",
"minimum_should_match": "20%"
}
},
{
"knn": {
"description_embedding": {
"vector": [0.012, -0.067, "... 1536 dims ..."],
"k": 50
}
}
}
]
}
},
"search_pipeline": "manga-hybrid-recommend-pipeline"
}
Recommendation pipeline — higher kNN weight:
{
"description": "MangaAssist recommendation-tuned hybrid pipeline",
"phase_results_processors": [
{
"normalization-processor": {
"normalization": { "technique": "min_max" },
"combination": {
"technique": "arithmetic_mean",
"parameters": { "weights": [0.25, 0.75] }
}
}
}
]
}
Re-ranking with Bedrock
Why Re-rank?
Hybrid search retrieves a broad candidate set (top-50), but ranking quality degrades past the top-10. Re-ranking applies a more expensive but accurate relevance model to distill top-50 down to top-5.
graph TB
subgraph Retrieval["Hybrid Search (80ms)"]
style Retrieval fill:#0f3460,stroke:#16213e,color:#fff
SEARCH["Top-50 Candidates<br/>from RRF hybrid"]
end
subgraph Reranking["Re-ranking (60ms budget)"]
style Reranking fill:#e94560,stroke:#16213e,color:#fff
HEURISTIC["Stage 1: Heuristic Scorer<br/>Recency + Popularity + Intent<br/>(5ms) → Top-20"]
CROSS["Stage 2: Cross-Encoder<br/>or Claude Haiku Reranker<br/>(50ms) → Top-5"]
end
subgraph Output["RAG Context"]
style Output fill:#2ecc71,stroke:#16213e,color:#fff
TOP5["Top-5 Results<br/>→ Claude 3 Sonnet prompt"]
end
SEARCH --> HEURISTIC --> CROSS --> TOP5
Claude Haiku as Re-ranker
For high-value queries (buy intent detected, returning customer), MangaAssist can use Claude Haiku to score query-document relevance.
Prompt template:
You are a manga relevance scorer for an e-commerce search engine.
Query: {query}
User intent: {intent}
Rate each manga result from 0.0 (irrelevant) to 1.0 (perfect match).
Consider: title relevance, genre match, thematic similarity, and user intent.
Results to score:
{formatted_results}
Return ONLY a JSON array of scores in the same order:
[0.92, 0.85, 0.71, ...]
Cost analysis: Haiku re-ranking of 20 candidates costs ~$0.001 per query (input: ~2K tokens, output: ~100 tokens). At 100K queries/day = ~$100/day.
Python — ScoreFusion
"""
MangaAssist Score Fusion
Implements RRF, weighted linear, and learned score combination.
"""
import math
from dataclasses import dataclass
from typing import Protocol
@dataclass
class ScoredDocument:
"""Document with scores from multiple rankers."""
doc_id: str
title: str
bm25_score: float = 0.0
bm25_rank: int = 0
knn_score: float = 0.0
knn_rank: int = 0
fused_score: float = 0.0
metadata: dict = None
def __post_init__(self):
if self.metadata is None:
self.metadata = {}
class FusionStrategy(Protocol):
"""Protocol for score fusion strategies."""
def fuse(
self,
bm25_results: list[ScoredDocument],
knn_results: list[ScoredDocument],
) -> list[ScoredDocument]: ...
class ScoreFusion:
"""
Orchestrates score fusion for MangaAssist hybrid search.
Supports three strategies:
- RRF: Reciprocal Rank Fusion (default, most robust)
- Linear: Weighted linear combination of normalized scores
- Learned: Feature-based combination with a trained model
Usage:
fusion = ScoreFusion(strategy="rrf", rrf_k=60)
results = fusion.fuse(bm25_results, knn_results)
"""
def __init__(
self,
strategy: str = "rrf",
rrf_k: int = 60,
bm25_weight: float = 0.4,
knn_weight: float = 0.6,
):
self.strategy = strategy
self.rrf_k = rrf_k
self.bm25_weight = bm25_weight
self.knn_weight = knn_weight
def fuse(
self,
bm25_results: list[ScoredDocument],
knn_results: list[ScoredDocument],
) -> list[ScoredDocument]:
"""Fuse BM25 and kNN results using the configured strategy."""
if self.strategy == "rrf":
return self._rrf_fuse(bm25_results, knn_results)
elif self.strategy == "linear":
return self._linear_fuse(bm25_results, knn_results)
else:
raise ValueError(f"Unknown strategy: {self.strategy}")
def _rrf_fuse(
self,
bm25_results: list[ScoredDocument],
knn_results: list[ScoredDocument],
) -> list[ScoredDocument]:
"""
Reciprocal Rank Fusion.
RRF(d) = Σ 1/(k + rank_i(d))
Rank-based: immune to score scale differences.
k=60 is the original paper's recommendation.
"""
k = self.rrf_k
score_map: dict[str, float] = {}
doc_map: dict[str, ScoredDocument] = {}
# BM25 contribution
for rank, doc in enumerate(bm25_results, start=1):
rrf_contribution = 1.0 / (k + rank)
score_map[doc.doc_id] = score_map.get(
doc.doc_id, 0.0
) + rrf_contribution
doc.bm25_rank = rank
doc_map[doc.doc_id] = doc
# kNN contribution
for rank, doc in enumerate(knn_results, start=1):
rrf_contribution = 1.0 / (k + rank)
score_map[doc.doc_id] = score_map.get(
doc.doc_id, 0.0
) + rrf_contribution
doc.knn_rank = rank
if doc.doc_id not in doc_map:
doc_map[doc.doc_id] = doc
else:
doc_map[doc.doc_id].knn_rank = rank
doc_map[doc.doc_id].knn_score = doc.knn_score
# Build fused result list
fused: list[ScoredDocument] = []
for doc_id, score in score_map.items():
doc = doc_map[doc_id]
doc.fused_score = score
fused.append(doc)
fused.sort(key=lambda d: d.fused_score, reverse=True)
return fused
def _linear_fuse(
self,
bm25_results: list[ScoredDocument],
knn_results: list[ScoredDocument],
) -> list[ScoredDocument]:
"""
Weighted linear combination with min-max normalization.
score(d) = α * norm_knn(d) + β * norm_bm25(d)
"""
# Min-max normalize BM25 scores
bm25_scores = [d.bm25_score for d in bm25_results]
bm25_min = min(bm25_scores) if bm25_scores else 0
bm25_max = max(bm25_scores) if bm25_scores else 1
bm25_range = bm25_max - bm25_min or 1.0
# Min-max normalize kNN scores
knn_scores = [d.knn_score for d in knn_results]
knn_min = min(knn_scores) if knn_scores else 0
knn_max = max(knn_scores) if knn_scores else 1
knn_range = knn_max - knn_min or 1.0
score_map: dict[str, float] = {}
doc_map: dict[str, ScoredDocument] = {}
for doc in bm25_results:
norm_score = (doc.bm25_score - bm25_min) / bm25_range
score_map[doc.doc_id] = self.bm25_weight * norm_score
doc_map[doc.doc_id] = doc
for doc in knn_results:
norm_score = (doc.knn_score - knn_min) / knn_range
score_map[doc.doc_id] = score_map.get(
doc.doc_id, 0.0
) + self.knn_weight * norm_score
if doc.doc_id not in doc_map:
doc_map[doc.doc_id] = doc
fused: list[ScoredDocument] = []
for doc_id, score in score_map.items():
doc = doc_map[doc_id]
doc.fused_score = score
fused.append(doc)
fused.sort(key=lambda d: d.fused_score, reverse=True)
return fused
# ---- Convenience factory ----
def create_fusion(
intent: str = "browse",
) -> ScoreFusion:
"""
Create a ScoreFusion instance with intent-aware weights.
- browse/research: balanced (RRF default)
- recommend: kNN-heavy (0.75 kNN)
- buy: BM25-heavy (0.6 BM25) for exact match priority
"""
if intent == "recommend":
return ScoreFusion(
strategy="linear", bm25_weight=0.25, knn_weight=0.75
)
elif intent == "buy":
return ScoreFusion(
strategy="linear", bm25_weight=0.60, knn_weight=0.40
)
else:
return ScoreFusion(strategy="rrf", rrf_k=60)
Python — RelevanceReranker
"""
MangaAssist Relevance Re-ranker
Two-stage re-ranking: heuristic scorer → cross-encoder or LLM reranker.
Budget: 60ms total (5ms heuristic + 50ms cross-encoder).
"""
import json
import time
from dataclasses import dataclass, field
from typing import Optional
import boto3
@dataclass
class RankedResult:
"""A search result with re-ranking metadata."""
doc_id: str
title: str
original_rank: int
original_score: float
heuristic_score: float = 0.0
rerank_score: float = 0.0
final_score: float = 0.0
metadata: dict = field(default_factory=dict)
class RelevanceReranker:
"""
Two-stage re-ranking pipeline for MangaAssist.
Stage 1: Heuristic scoring (< 5ms)
- Recency boost, popularity boost, genre match, availability
- Reduces top-50 → top-20
Stage 2: Model-based re-ranking (< 50ms)
- Option A: Cross-encoder model (self-hosted, ~30ms for 20 docs)
- Option B: Claude Haiku via Bedrock (~80ms, higher quality)
- Reduces top-20 → top-5
Usage:
reranker = RelevanceReranker(
region="ap-northeast-1",
rerank_method="haiku",
)
top5 = reranker.rerank(query, intent, candidates, top_k=5)
"""
def __init__(
self,
region: str = "ap-northeast-1",
rerank_method: str = "heuristic", # "heuristic", "haiku"
heuristic_cutoff: int = 20,
):
self.region = region
self.rerank_method = rerank_method
self.heuristic_cutoff = heuristic_cutoff
if rerank_method == "haiku":
self.bedrock = boto3.client(
"bedrock-runtime", region_name=region
)
def rerank(
self,
query: str,
intent: str,
candidates: list[RankedResult],
top_k: int = 5,
) -> list[RankedResult]:
"""
Full re-ranking pipeline.
Returns top_k results sorted by final_score.
"""
# Stage 1: Heuristic scoring → top-N
heuristic_ranked = self._heuristic_score(candidates, intent)
heuristic_ranked.sort(
key=lambda r: r.heuristic_score, reverse=True
)
stage1_results = heuristic_ranked[:self.heuristic_cutoff]
# Stage 2: Model-based re-ranking → top_k
if self.rerank_method == "haiku" and len(stage1_results) > top_k:
final = self._haiku_rerank(query, intent, stage1_results)
else:
final = stage1_results
final.sort(key=lambda r: r.final_score, reverse=True)
return final[:top_k]
# ---- Stage 1: Heuristic Scoring ----
def _heuristic_score(
self,
candidates: list[RankedResult],
intent: str,
) -> list[RankedResult]:
"""
Fast heuristic scoring using metadata signals.
Target: < 5ms for 50 candidates.
"""
import datetime
now = datetime.date.today()
thirty_days = datetime.timedelta(days=30)
ninety_days = datetime.timedelta(days=90)
for r in candidates:
score = r.original_score
meta = r.metadata
# Recency
release_str = meta.get("release_date", "")
if release_str:
try:
release = datetime.date.fromisoformat(release_str[:10])
days_old = (now - release).days
if days_old <= 30:
score *= 1.20
elif days_old <= 90:
score *= 1.10
except (ValueError, TypeError):
pass
# Popularity
rating = meta.get("avg_rating", 0.0)
rating_count = meta.get("rating_count", 0)
if rating >= 4.5 and rating_count >= 50:
score *= 1.15
elif rating >= 4.0 and rating_count >= 20:
score *= 1.05
# Intent-aware
if intent == "buy":
if meta.get("in_stock", False):
score *= 1.15
elif intent == "recommend":
if rating >= 4.0:
score *= 1.10
elif intent == "research":
if meta.get("review_count", 0) > 100:
score *= 1.10
r.heuristic_score = score
r.final_score = score # default; may be overridden by stage 2
return candidates
# ---- Stage 2: Claude Haiku Re-ranking ----
def _haiku_rerank(
self,
query: str,
intent: str,
candidates: list[RankedResult],
) -> list[RankedResult]:
"""
Use Claude 3 Haiku to score query-document relevance.
Cost: ~$0.001 per call (2K input + 100 output tokens).
Latency: 80-150ms.
"""
# Format candidates for the prompt
formatted = ""
for i, r in enumerate(candidates):
formatted += (
f"{i+1}. [{r.doc_id}] {r.title} "
f"(Genre: {r.metadata.get('genre', 'N/A')}, "
f"Rating: {r.metadata.get('avg_rating', 'N/A')})\n"
)
prompt = f"""You are a manga relevance scorer for a Japanese e-commerce search engine.
Query: {query}
User intent: {intent}
Rate each manga result from 0.0 (completely irrelevant) to 1.0 (perfect match).
Consider: title relevance, genre match, thematic similarity, and user intent.
Results to score:
{formatted}
Return ONLY a JSON array of scores in the same order, e.g., [0.92, 0.85, 0.71, ...]:"""
try:
response = self.bedrock.invoke_model(
modelId="anthropic.claude-3-haiku-20240307-v1:0",
contentType="application/json",
accept="application/json",
body=json.dumps({
"anthropic_version": "bedrock-2023-05-31",
"max_tokens": 200,
"temperature": 0.0,
"messages": [
{"role": "user", "content": prompt}
],
}),
)
result = json.loads(response["body"].read())
text = result["content"][0]["text"].strip()
scores = json.loads(text)
# Apply Haiku scores
for i, r in enumerate(candidates):
if i < len(scores):
r.rerank_score = float(scores[i])
# Blend: 70% Haiku + 30% heuristic (normalized)
r.final_score = (
0.7 * r.rerank_score
+ 0.3 * (r.heuristic_score / max(
c.heuristic_score for c in candidates
))
)
else:
r.final_score = r.heuristic_score
except (json.JSONDecodeError, KeyError, IndexError) as e:
# Fallback to heuristic scores if Haiku fails
for r in candidates:
r.final_score = r.heuristic_score
return candidates
Evaluation Methodology
Core Retrieval Quality Metrics
| Metric | Formula | What It Measures | MangaAssist Target |
|---|---|---|---|
| NDCG@K (Normalized Discounted Cumulative Gain) | DCG@K / IDCG@K where DCG = Σ (2^rel - 1) / log₂(rank + 1) |
Quality of ranking order; penalizes relevant docs ranked low | NDCG@5 >= 0.85 |
| MAP (Mean Average Precision) | (1/Q) Σ AP(q) where AP = (1/R) Σ P@k * rel(k) |
Average precision across all queries; rewards finding ALL relevant docs | MAP >= 0.80 |
| MRR (Mean Reciprocal Rank) | (1/Q) Σ 1/rank(first_relevant) |
How quickly the first relevant result appears | MRR >= 0.85 |
| Recall@K | |relevant ∩ retrieved@K| / |relevant| |
Fraction of all relevant documents found in top K | Recall@50 >= 0.95 |
| Precision@K | |relevant ∩ retrieved@K| / K |
Fraction of retrieved documents that are relevant | Precision@5 >= 0.80 |
Before/After: Hybrid Search Improving Manga Discovery
Example 1: Semantic Recommendation Query
Query: "冒険が好きで、ワンピースみたいな漫画が読みたい" (I like adventure; I want to read manga like One Piece)
| Rank | BM25 Only | kNN Only | Hybrid (RRF) + Re-rank |
|---|---|---|---|
| 1 | One Piece Vol 100 (exact title match) | Fairy Tail (adventure, similar themes) | One Piece Vol 100 (exact + semantic) |
| 2 | One Piece Vol 99 | Magi: Labyrinth of Magic | Fairy Tail (semantic, high rating) |
| 3 | One Piece Vol 98 | Hunter x Hunter | Hunter x Hunter (semantic, trending) |
| 4 | One Piece Episode A (spinoff) | Seven Deadly Sins | Magi (semantic, genre match) |
| 5 | One Piece Film Red (movie tie-in) | Black Clover | Seven Deadly Sins (new volume) |
| NDCG@5 | 0.52 (too many One Piece volumes) | 0.74 (good diversity, missing exact) | 0.91 (exact match #1 + diverse recs) |
BM25 returns multiple volumes of One Piece because they all match the title keyword. kNN returns semantically similar titles but misses the specific One Piece mention. Hybrid + re-rank gets the best of both worlds.
Example 2: Exact Product Lookup
Query: "鬼滅の刃 23巻 購入" (Demon Slayer Volume 23 purchase)
| Rank | BM25 Only | kNN Only | Hybrid + Re-rank |
|---|---|---|---|
| 1 | Demon Slayer Vol 23 | Demon Slayer Vol 23 | Demon Slayer Vol 23 (in stock) |
| 2 | Demon Slayer Vol 22 | Demon Slayer Vol 22 | Demon Slayer Box Set (buy intent boost) |
| 3 | Demon Slayer Vol 21 | Jujutsu Kaisen Vol 20 | Demon Slayer Vol 22 (bundle suggestion) |
| 4 | Demon Slayer Artbook | Chainsaw Man Vol 15 | Demon Slayer Artbook (cross-sell) |
| 5 | Demon Slayer Guide | Tokyo Ghoul Vol 14 | Koyoharu Gotouge Art Collection |
| NDCG@5 | 0.88 (strong on exact, weak cross-sell) | 0.61 (semantic drift) | 0.94 (exact + smart cross-sell) |
For buy intent, BM25 dominates because the query contains the exact title. But hybrid search still wins by adding intent-aware cross-sell suggestions (box set, artbook) that pure BM25 ranks lower.
Example 3: Japanese Tokenization Challenge
Query: "進撃の巨人の作者の他の作品" (Other works by the author of Attack on Titan)
| Rank | BM25 Only | kNN Only | Hybrid + Preprocessing |
|---|---|---|---|
| 1 | Attack on Titan Vol 34 | Attack on Titan Vol 34 | Attack on Titan (anchor result) |
| 2 | Attack on Titan Vol 33 | Attack on Titan: Before the Fall | Hajime Isayama Interview Book |
| 3 | Attack on Titan: Junior High | Vinland Saga (thematic sim) | Attack on Titan: Lost Girls |
| 4 | No further relevant | Kabaneri of the Iron Fortress | Isayama Short Stories Collection |
| 5 | -- | Claymore (dark fantasy sim) | Before the Fall (same universe) |
| NDCG@5 | 0.41 (cannot resolve "author's other works") | 0.55 (thematic but wrong author) | 0.83 (preprocessor expands to author name) |
The query preprocessor detects the "作者の他の作品" (author's other works) pattern, resolves "進撃の巨人" to author "Hajime Isayama" via metadata lookup, and expands the query to include author-specific terms. Neither pure BM25 nor pure kNN can do this without preprocessing.
Monitoring Retrieval Quality in Production
Key Metrics to Track
| Metric | Collection Method | Alert Threshold | Dashboard |
|---|---|---|---|
| Search latency p95 | CloudWatch custom metric from ECS | > 100ms (search stage) | Real-time |
| Fusion latency | Application timer around fusion code | > 5ms | Real-time |
| Re-rank latency | Application timer around reranker | > 80ms | Real-time |
| Zero-result rate | Count queries returning 0 results | > 2% of queries | Daily |
| Click-through rate (CTR) | User clicks on search results | < 15% (below baseline) | Daily |
| NDCG@5 (offline) | Weekly evaluation on labeled query set | < 0.80 | Weekly |
| BM25/kNN overlap | % of top-10 docs appearing in both lists | < 20% (too little overlap may indicate misalignment) | Weekly |
| Re-rank position change | Average rank change after re-ranking | < 2 positions (if too high, fusion may be poor) | Weekly |
A/B Testing Search Configurations
When tuning hybrid search parameters (weights, fusion method, re-ranking strategy), run A/B tests.
| Experiment | Metric | Control | Variant | Result |
|---|---|---|---|---|
| RRF vs Linear fusion | NDCG@5 | Linear (α=0.6) → 0.80 | RRF (k=60) → 0.82 | RRF wins (+2.5%) |
| kNN weight 0.6 vs 0.7 | CTR | α=0.6 → CTR 18.2% | α=0.7 → CTR 17.8% | 0.6 wins (0.7 over-indexes semantic) |
| Haiku re-rank vs cross-encoder | NDCG@5 | Cross-encoder → 0.88 | Haiku → 0.91 | Haiku wins (+3.4%) but costs $100/day |
| With recency boost vs without | CTR on new releases | Without → 12.1% | With (1.2x) → 16.8% | With wins (+39% CTR lift) |
Key Takeaways
- Hybrid search is mandatory — no single retrieval method handles both exact lookups ("Demon Slayer Vol 23") and semantic discovery ("adventure manga like One Piece").
- RRF is the safest default fusion — rank-based, no normalization needed, robust to score scale changes. Switch to learned combination only with sufficient labeled data.
- Intent detection drives scoring configuration — buy intent needs BM25-heavy weights and in-stock boosts; recommend intent needs kNN-heavy weights and diversity.
- Re-ranking is the highest-leverage optimization — a cross-encoder on top-20 adds +25% NDCG for 30-50ms. Haiku re-ranking adds +35% NDCG but requires careful latency budgeting.
- Measure retrieval quality continuously — NDCG@5, MRR, and CTR should be tracked weekly. Degradation in these metrics signals embedding drift, synonym gaps, or scoring misconfiguration.
- Query preprocessing is essential for Japanese — without kuromoji tokenization and synonym expansion, queries like "shonen" will miss "少年" indexed content entirely.