Hybrid Search and Custom Scoring for Manga Product Retrieval

AWS AIP-C01 Task 4.2 → Skill 4.2.2: Optimize retrieval mechanisms for FM-augmented applications System: MangaAssist e-commerce chatbot — Bedrock Claude 3 Sonnet/Haiku, OpenSearch Serverless (HNSW k-NN), DynamoDB, ECS Fargate, ElastiCache Redis Focus: Deep-dive into hybrid search implementation, score fusion mathematics, custom scoring functions, re-ranking with Bedrock, and retrieval evaluation metrics

Skill Mapping

AWS AIP-C01 Element	Coverage
Task 4.2	Optimize application performance for FM workloads
Skill 4.2.2	Optimize retrieval mechanisms to improve FM-augmented application performance
This File	Hybrid search scoring, fusion methods, re-ranking with Bedrock, evaluation methodology
MangaAssist Context	Combining BM25 (exact manga titles/ISBNs) + kNN (semantic "manga like One Piece") to maximize both precision and recall across 100K+ products

Why Hybrid Search? — The BM25 vs kNN Tradeoff

The Core Problem

MangaAssist receives two fundamentally different query types that require two fundamentally different retrieval approaches.

graph TB
    subgraph ExactQueries["Exact Queries — BM25 Wins"]
        style ExactQueries fill:#2ecc71,stroke:#16213e,color:#fff
        Q1["'鬼滅の刃 23巻'<br/>Demon Slayer Vol 23"]
        Q2["'ISBN 978-4088820...']
        Q3["'Eiichiro Oda'"]
        Q4["'Weekly Shonen Jump<br/>2024 Issue 15'"]
    end

    subgraph SemanticQueries["Semantic Queries — kNN Wins"]
        style SemanticQueries fill:#3498db,stroke:#16213e,color:#fff
        Q5["'adventure manga<br/>with pirates'"]
        Q6["'something dark and<br/>philosophical'"]
        Q7["'manga for someone<br/>who loved Naruto'"]
        Q8["'feel-good slice of life<br/>romance'"]
    end

    subgraph HybridQueries["Mixed Queries — Hybrid Wins"]
        style HybridQueries fill:#e94560,stroke:#16213e,color:#fff
        Q9["'manga like ワンピース<br/>with good art'"]
        Q10["'shonen manga about<br/>cooking competitions'"]
        Q11["'new releases similar<br/>to Attack on Titan'"]
    end

    Q1 & Q2 & Q3 & Q4 --> BM25["BM25<br/>Keyword Match"]
    Q5 & Q6 & Q7 & Q8 --> KNN["kNN Vector<br/>Semantic Similarity"]
    Q9 & Q10 & Q11 --> HYBRID["Hybrid<br/>BM25 + kNN"]

Head-to-Head Comparison

Dimension	BM25 (Keyword)	kNN (Vector)	Hybrid
Exact title match	Excellent — direct term matching	Poor — semantic drift from exact surface forms	Excellent — BM25 component handles this
Semantic similarity	Poor — "pirate adventure" does not match "One Piece"	Excellent — embedding captures meaning	Excellent — kNN component handles this
Author lookup	Excellent — keyword match on name field	Moderate — depends on embedding training data	Excellent
Cross-language	Poor — "冒険" does not match "adventure" in BM25	Good — multilingual embeddings bridge languages	Good
Handling typos	Poor — "Narruto" misses "Naruto"	Good — embeddings are robust to misspellings	Good
Novel/unseen terms	Zero recall for OOV terms	Moderate — nearest semantic neighbor	Moderate
Latency	15-25ms	30-50ms	40-60ms (parallel)
Explainability	High — TF-IDF scores are interpretable	Low — cosine similarity is opaque	Medium

Score Fusion Methods

The central challenge of hybrid search: BM25 scores (unbounded, typically 5-50) and kNN scores (cosine similarity, 0.0-1.0) are on completely different scales. Score fusion resolves this.

Method 1: Reciprocal Rank Fusion (RRF)

RRF is rank-based, not score-based. It sidesteps the score normalization problem entirely.

Formula:

RRF(d) = Σ  1 / (k + rank_i(d))
         i∈{bm25, knn}

Where k is a smoothing constant (default 60) and rank_i(d) is the rank of document d in ranker i (1-indexed).

Properties: - Score-scale agnostic: works regardless of BM25 vs kNN score ranges - Documents appearing in both result lists get double contribution - The smoothing constant k=60 prevents top-ranked documents from dominating

graph TB
    subgraph BM25Ranks["BM25 Rankings"]
        B1["Rank 1: Naruto Vol 72<br/>BM25=38.2"]
        B2["Rank 2: One Piece Vol 100<br/>BM25=31.7"]
        B3["Rank 3: Dragon Ball Vol 42<br/>BM25=28.1"]
        B4["Rank 4: Bleach Vol 74<br/>BM25=22.5"]
    end

    subgraph KNNRanks["kNN Rankings"]
        K1["Rank 1: One Piece Vol 100<br/>cos=0.94"]
        K2["Rank 2: Fairy Tail Vol 63<br/>cos=0.91"]
        K3["Rank 3: Naruto Vol 72<br/>cos=0.88"]
        K4["Rank 4: Black Clover Vol 33<br/>cos=0.85"]
    end

    subgraph RRFScores["RRF Fused Scores (k=60)"]
        style RRFScores fill:#e94560,stroke:#16213e,color:#fff
        R1["One Piece Vol 100<br/>1/(60+2) + 1/(60+1) = 0.0325<br/>← Appeared in BOTH lists"]
        R2["Naruto Vol 72<br/>1/(60+1) + 1/(60+3) = 0.0322"]
        R3["Fairy Tail Vol 63<br/>0 + 1/(60+2) = 0.0161"]
        R4["Dragon Ball Vol 42<br/>1/(60+3) + 0 = 0.0159"]
    end

    B1 & B2 & K1 & K3 --> R1
    B1 & K3 --> R2
    K2 --> R3
    B3 --> R4

Method 2: Weighted Linear Combination

Requires min-max normalization to bring both score distributions onto [0, 1].

Formula:

score(d) = α × norm_knn(d) + β × norm_bm25(d)

where:
  norm(s) = (s - min) / (max - min)
  α + β = 1.0

MangaAssist defaults: α = 0.6 (kNN weight), β = 0.4 (BM25 weight) — tuned on a labeled relevance set of 500 manga queries.

Method 3: Learned Score Combination

Train a lightweight model (logistic regression or small neural net) to combine BM25 score, kNN score, and metadata features into a single relevance score.

Features used: - Normalized BM25 score - Cosine similarity (kNN score) - Query-document genre match (binary) - Title overlap (Jaccard similarity) - Document recency (days since release) - Document popularity (log sales rank)

Advantage: Can learn non-linear interactions (e.g., "for buy-intent queries, weight BM25 higher"). Disadvantage: Requires labeled training data and periodic retraining.

Fusion Method Comparison

Method	NDCG@5 on MangaAssist	Latency Overhead	Requires Training Data	Robustness
RRF (k=60)	0.82	< 1ms	No	High — rank-based, no normalization needed
Weighted Linear (α=0.6)	0.80	< 1ms	No (but weights need tuning)	Medium — sensitive to score distribution shifts
Learned Combination	0.87	2-5ms	Yes (500+ labeled queries)	Medium — needs retraining as catalog changes

Recommendation for MangaAssist: Start with RRF as the default. Move to learned combination only after accumulating sufficient click-through / relevance judgment data.

Custom Scoring for MangaAssist

Beyond fusion scores, MangaAssist applies business-aware scoring boosts. These operate as multiplicative factors on the fused score.

Scoring Dimensions

graph LR
    FUSED["Fused Score<br/>(RRF or Linear)"] --> RECENCY["Recency Boost<br/>New releases +20%"]
    RECENCY --> POPULARITY["Popularity Boost<br/>High-rated +15%"]
    POPULARITY --> GENRE["Genre Match<br/>Intent-aligned +10%"]
    GENRE --> AVAILABILITY["Availability<br/>In-stock +10%"]
    AVAILABILITY --> FINAL["Final Score"]

Boost Factor Configuration

Boost	Condition	Multiplier	Rationale
Recency — New Release	Released within last 30 days	1.20	New manga drives engagement; customers expect fresh recommendations
Recency — Recent	Released within last 90 days	1.10	Still relevant, moderate boost
Recency — Trending	Sales rank improved > 50% in last 7 days	1.15	Capture viral/trending titles
Popularity — Top Rated	avg_rating >= 4.5 and rating_count >= 50	1.15	High confidence in quality
Popularity — Well Rated	avg_rating >= 4.0 and rating_count >= 20	1.05	Moderate confidence
Genre Match	Query intent matches document genre/demographic	1.10	"shonen recommendation" should surface shonen manga
Availability — In Stock	in_stock = true (for buy intent)	1.10	Avoid frustrating buy-intent users with OOS results
Availability — Preorder	release_date in future (for browse intent)	1.05	Surface upcoming titles for browsers
Price Range	Within user's stated price range	1.05	Only if explicit price constraint detected

Intent-Aware Scoring Matrix

Different user intents require different boost profiles.

Factor	Browse Intent	Buy Intent	Recommend Intent	Research Intent
Recency boost	High (1.20)	Medium (1.10)	Medium (1.10)	Low (1.00)
Popularity boost	Medium (1.10)	Low (1.00)	High (1.15)	Low (1.00)
In-stock boost	None (1.00)	High (1.15)	None (1.00)	None (1.00)
Genre match boost	Medium (1.10)	Low (1.05)	High (1.15)	Medium (1.10)
Review count boost	Low (1.00)	Low (1.00)	Medium (1.10)	High (1.15)

OpenSearch Serverless Query DSL — Hybrid Query Examples

Example 1: Basic Hybrid Query with RRF

This uses OpenSearch's native hybrid search with the search_pipeline feature.

{
  "comment": "MangaAssist hybrid search: adventure shonen manga",
  "size": 50,
  "_source": ["manga_id", "title_ja", "title_en", "genre", "avg_rating", "release_date"],
  "query": {
    "hybrid": {
      "queries": [
        {
          "multi_match": {
            "query": "adventure shonen manga pirates 冒険 少年 漫画",
            "fields": ["title_ja^3", "title_en^3", "description", "tags^1.5", "genre"],
            "type": "cross_fields",
            "minimum_should_match": "30%"
          }
        },
        {
          "knn": {
            "description_embedding": {
              "vector": [0.023, -0.118, 0.045, "... 1536 dims ..."],
              "k": 50
            }
          }
        }
      ]
    }
  },
  "search_pipeline": "manga-hybrid-rrf-pipeline"
}

Search pipeline definition (created once via OpenSearch API):

{
  "description": "MangaAssist RRF hybrid search pipeline",
  "phase_results_processors": [
    {
      "normalization-processor": {
        "normalization": { "technique": "min_max" },
        "combination": {
          "technique": "arithmetic_mean",
          "parameters": { "weights": [0.4, 0.6] }
        }
      }
    }
  ]
}

Example 2: Filtered Hybrid Query (Buy Intent)

{
  "comment": "Buy intent: specific manga, must be in stock, under 1000 JPY",
  "size": 20,
  "_source": ["manga_id", "title_ja", "title_en", "price_jpy", "in_stock", "genre"],
  "query": {
    "hybrid": {
      "queries": [
        {
          "bool": {
            "must": [
              {
                "multi_match": {
                  "query": "鬼滅の刃 最新刊",
                  "fields": ["title_ja^5", "title_en^3", "description"],
                  "type": "best_fields"
                }
              }
            ],
            "filter": [
              { "term": { "in_stock": true } },
              { "range": { "price_jpy": { "lte": 1000 } } }
            ]
          }
        },
        {
          "knn": {
            "description_embedding": {
              "vector": [0.031, -0.092, "... 1536 dims ..."],
              "k": 20,
              "filter": {
                "bool": {
                  "filter": [
                    { "term": { "in_stock": true } },
                    { "range": { "price_jpy": { "lte": 1000 } } }
                  ]
                }
              }
            }
          }
        }
      ]
    }
  },
  "search_pipeline": "manga-hybrid-rrf-pipeline"
}

Example 3: Recommendation Query (Semantic-Heavy)

{
  "comment": "Recommend intent: semantic similarity to 'manga like One Piece with good art'",
  "size": 50,
  "_source": ["manga_id", "title_ja", "title_en", "genre", "avg_rating", "author"],
  "query": {
    "hybrid": {
      "queries": [
        {
          "multi_match": {
            "query": "ワンピース One Piece 冒険 adventure shonen art 画力",
            "fields": ["title_ja^2", "title_en^2", "tags^2", "genre", "description"],
            "type": "cross_fields",
            "minimum_should_match": "20%"
          }
        },
        {
          "knn": {
            "description_embedding": {
              "vector": [0.012, -0.067, "... 1536 dims ..."],
              "k": 50
            }
          }
        }
      ]
    }
  },
  "search_pipeline": "manga-hybrid-recommend-pipeline"
}

Recommendation pipeline — higher kNN weight:

{
  "description": "MangaAssist recommendation-tuned hybrid pipeline",
  "phase_results_processors": [
    {
      "normalization-processor": {
        "normalization": { "technique": "min_max" },
        "combination": {
          "technique": "arithmetic_mean",
          "parameters": { "weights": [0.25, 0.75] }
        }
      }
    }
  ]
}

Re-ranking with Bedrock

Why Re-rank?

Hybrid search retrieves a broad candidate set (top-50), but ranking quality degrades past the top-10. Re-ranking applies a more expensive but accurate relevance model to distill top-50 down to top-5.

graph TB
    subgraph Retrieval["Hybrid Search (80ms)"]
        style Retrieval fill:#0f3460,stroke:#16213e,color:#fff
        SEARCH["Top-50 Candidates<br/>from RRF hybrid"]
    end

    subgraph Reranking["Re-ranking (60ms budget)"]
        style Reranking fill:#e94560,stroke:#16213e,color:#fff
        HEURISTIC["Stage 1: Heuristic Scorer<br/>Recency + Popularity + Intent<br/>(5ms) → Top-20"]
        CROSS["Stage 2: Cross-Encoder<br/>or Claude Haiku Reranker<br/>(50ms) → Top-5"]
    end

    subgraph Output["RAG Context"]
        style Output fill:#2ecc71,stroke:#16213e,color:#fff
        TOP5["Top-5 Results<br/>→ Claude 3 Sonnet prompt"]
    end

    SEARCH --> HEURISTIC --> CROSS --> TOP5

Claude Haiku as Re-ranker

For high-value queries (buy intent detected, returning customer), MangaAssist can use Claude Haiku to score query-document relevance.

Prompt template:

You are a manga relevance scorer for an e-commerce search engine.

Query: {query}
User intent: {intent}

Rate each manga result from 0.0 (irrelevant) to 1.0 (perfect match).
Consider: title relevance, genre match, thematic similarity, and user intent.

Results to score:
{formatted_results}

Return ONLY a JSON array of scores in the same order:
[0.92, 0.85, 0.71, ...]

Cost analysis: Haiku re-ranking of 20 candidates costs ~$0.001 per query (input: ~2K tokens, output: ~100 tokens). At 100K queries/day = ~$100/day.

Python — ScoreFusion

"""
MangaAssist Score Fusion
Implements RRF, weighted linear, and learned score combination.
"""

import math
from dataclasses import dataclass
from typing import Protocol


@dataclass
class ScoredDocument:
    """Document with scores from multiple rankers."""
    doc_id: str
    title: str
    bm25_score: float = 0.0
    bm25_rank: int = 0
    knn_score: float = 0.0
    knn_rank: int = 0
    fused_score: float = 0.0
    metadata: dict = None

    def __post_init__(self):
        if self.metadata is None:
            self.metadata = {}


class FusionStrategy(Protocol):
    """Protocol for score fusion strategies."""
    def fuse(
        self,
        bm25_results: list[ScoredDocument],
        knn_results: list[ScoredDocument],
    ) -> list[ScoredDocument]: ...


class ScoreFusion:
    """
    Orchestrates score fusion for MangaAssist hybrid search.

    Supports three strategies:
    - RRF: Reciprocal Rank Fusion (default, most robust)
    - Linear: Weighted linear combination of normalized scores
    - Learned: Feature-based combination with a trained model

    Usage:
        fusion = ScoreFusion(strategy="rrf", rrf_k=60)
        results = fusion.fuse(bm25_results, knn_results)
    """

    def __init__(
        self,
        strategy: str = "rrf",
        rrf_k: int = 60,
        bm25_weight: float = 0.4,
        knn_weight: float = 0.6,
    ):
        self.strategy = strategy
        self.rrf_k = rrf_k
        self.bm25_weight = bm25_weight
        self.knn_weight = knn_weight

    def fuse(
        self,
        bm25_results: list[ScoredDocument],
        knn_results: list[ScoredDocument],
    ) -> list[ScoredDocument]:
        """Fuse BM25 and kNN results using the configured strategy."""
        if self.strategy == "rrf":
            return self._rrf_fuse(bm25_results, knn_results)
        elif self.strategy == "linear":
            return self._linear_fuse(bm25_results, knn_results)
        else:
            raise ValueError(f"Unknown strategy: {self.strategy}")

    def _rrf_fuse(
        self,
        bm25_results: list[ScoredDocument],
        knn_results: list[ScoredDocument],
    ) -> list[ScoredDocument]:
        """
        Reciprocal Rank Fusion.

        RRF(d) = Σ 1/(k + rank_i(d))

        Rank-based: immune to score scale differences.
        k=60 is the original paper's recommendation.
        """
        k = self.rrf_k
        score_map: dict[str, float] = {}
        doc_map: dict[str, ScoredDocument] = {}

        # BM25 contribution
        for rank, doc in enumerate(bm25_results, start=1):
            rrf_contribution = 1.0 / (k + rank)
            score_map[doc.doc_id] = score_map.get(
                doc.doc_id, 0.0
            ) + rrf_contribution
            doc.bm25_rank = rank
            doc_map[doc.doc_id] = doc

        # kNN contribution
        for rank, doc in enumerate(knn_results, start=1):
            rrf_contribution = 1.0 / (k + rank)
            score_map[doc.doc_id] = score_map.get(
                doc.doc_id, 0.0
            ) + rrf_contribution
            doc.knn_rank = rank
            if doc.doc_id not in doc_map:
                doc_map[doc.doc_id] = doc
            else:
                doc_map[doc.doc_id].knn_rank = rank
                doc_map[doc.doc_id].knn_score = doc.knn_score

        # Build fused result list
        fused: list[ScoredDocument] = []
        for doc_id, score in score_map.items():
            doc = doc_map[doc_id]
            doc.fused_score = score
            fused.append(doc)

        fused.sort(key=lambda d: d.fused_score, reverse=True)
        return fused

    def _linear_fuse(
        self,
        bm25_results: list[ScoredDocument],
        knn_results: list[ScoredDocument],
    ) -> list[ScoredDocument]:
        """
        Weighted linear combination with min-max normalization.

        score(d) = α * norm_knn(d) + β * norm_bm25(d)
        """
        # Min-max normalize BM25 scores
        bm25_scores = [d.bm25_score for d in bm25_results]
        bm25_min = min(bm25_scores) if bm25_scores else 0
        bm25_max = max(bm25_scores) if bm25_scores else 1
        bm25_range = bm25_max - bm25_min or 1.0

        # Min-max normalize kNN scores
        knn_scores = [d.knn_score for d in knn_results]
        knn_min = min(knn_scores) if knn_scores else 0
        knn_max = max(knn_scores) if knn_scores else 1
        knn_range = knn_max - knn_min or 1.0

        score_map: dict[str, float] = {}
        doc_map: dict[str, ScoredDocument] = {}

        for doc in bm25_results:
            norm_score = (doc.bm25_score - bm25_min) / bm25_range
            score_map[doc.doc_id] = self.bm25_weight * norm_score
            doc_map[doc.doc_id] = doc

        for doc in knn_results:
            norm_score = (doc.knn_score - knn_min) / knn_range
            score_map[doc.doc_id] = score_map.get(
                doc.doc_id, 0.0
            ) + self.knn_weight * norm_score
            if doc.doc_id not in doc_map:
                doc_map[doc.doc_id] = doc

        fused: list[ScoredDocument] = []
        for doc_id, score in score_map.items():
            doc = doc_map[doc_id]
            doc.fused_score = score
            fused.append(doc)

        fused.sort(key=lambda d: d.fused_score, reverse=True)
        return fused


# ---- Convenience factory ----

def create_fusion(
    intent: str = "browse",
) -> ScoreFusion:
    """
    Create a ScoreFusion instance with intent-aware weights.
    - browse/research: balanced (RRF default)
    - recommend: kNN-heavy (0.75 kNN)
    - buy: BM25-heavy (0.6 BM25) for exact match priority
    """
    if intent == "recommend":
        return ScoreFusion(
            strategy="linear", bm25_weight=0.25, knn_weight=0.75
        )
    elif intent == "buy":
        return ScoreFusion(
            strategy="linear", bm25_weight=0.60, knn_weight=0.40
        )
    else:
        return ScoreFusion(strategy="rrf", rrf_k=60)

Python — RelevanceReranker

"""
MangaAssist Relevance Re-ranker
Two-stage re-ranking: heuristic scorer → cross-encoder or LLM reranker.
Budget: 60ms total (5ms heuristic + 50ms cross-encoder).
"""

import json
import time
from dataclasses import dataclass, field
from typing import Optional

import boto3


@dataclass
class RankedResult:
    """A search result with re-ranking metadata."""
    doc_id: str
    title: str
    original_rank: int
    original_score: float
    heuristic_score: float = 0.0
    rerank_score: float = 0.0
    final_score: float = 0.0
    metadata: dict = field(default_factory=dict)


class RelevanceReranker:
    """
    Two-stage re-ranking pipeline for MangaAssist.

    Stage 1: Heuristic scoring (< 5ms)
      - Recency boost, popularity boost, genre match, availability
      - Reduces top-50 → top-20

    Stage 2: Model-based re-ranking (< 50ms)
      - Option A: Cross-encoder model (self-hosted, ~30ms for 20 docs)
      - Option B: Claude Haiku via Bedrock (~80ms, higher quality)
      - Reduces top-20 → top-5

    Usage:
        reranker = RelevanceReranker(
            region="ap-northeast-1",
            rerank_method="haiku",
        )
        top5 = reranker.rerank(query, intent, candidates, top_k=5)
    """

    def __init__(
        self,
        region: str = "ap-northeast-1",
        rerank_method: str = "heuristic",  # "heuristic", "haiku"
        heuristic_cutoff: int = 20,
    ):
        self.region = region
        self.rerank_method = rerank_method
        self.heuristic_cutoff = heuristic_cutoff

        if rerank_method == "haiku":
            self.bedrock = boto3.client(
                "bedrock-runtime", region_name=region
            )

    def rerank(
        self,
        query: str,
        intent: str,
        candidates: list[RankedResult],
        top_k: int = 5,
    ) -> list[RankedResult]:
        """
        Full re-ranking pipeline.

        Returns top_k results sorted by final_score.
        """
        # Stage 1: Heuristic scoring → top-N
        heuristic_ranked = self._heuristic_score(candidates, intent)
        heuristic_ranked.sort(
            key=lambda r: r.heuristic_score, reverse=True
        )
        stage1_results = heuristic_ranked[:self.heuristic_cutoff]

        # Stage 2: Model-based re-ranking → top_k
        if self.rerank_method == "haiku" and len(stage1_results) > top_k:
            final = self._haiku_rerank(query, intent, stage1_results)
        else:
            final = stage1_results

        final.sort(key=lambda r: r.final_score, reverse=True)
        return final[:top_k]

    # ---- Stage 1: Heuristic Scoring ----

    def _heuristic_score(
        self,
        candidates: list[RankedResult],
        intent: str,
    ) -> list[RankedResult]:
        """
        Fast heuristic scoring using metadata signals.
        Target: < 5ms for 50 candidates.
        """
        import datetime

        now = datetime.date.today()
        thirty_days = datetime.timedelta(days=30)
        ninety_days = datetime.timedelta(days=90)

        for r in candidates:
            score = r.original_score
            meta = r.metadata

            # Recency
            release_str = meta.get("release_date", "")
            if release_str:
                try:
                    release = datetime.date.fromisoformat(release_str[:10])
                    days_old = (now - release).days
                    if days_old <= 30:
                        score *= 1.20
                    elif days_old <= 90:
                        score *= 1.10
                except (ValueError, TypeError):
                    pass

            # Popularity
            rating = meta.get("avg_rating", 0.0)
            rating_count = meta.get("rating_count", 0)
            if rating >= 4.5 and rating_count >= 50:
                score *= 1.15
            elif rating >= 4.0 and rating_count >= 20:
                score *= 1.05

            # Intent-aware
            if intent == "buy":
                if meta.get("in_stock", False):
                    score *= 1.15
            elif intent == "recommend":
                if rating >= 4.0:
                    score *= 1.10
            elif intent == "research":
                if meta.get("review_count", 0) > 100:
                    score *= 1.10

            r.heuristic_score = score
            r.final_score = score  # default; may be overridden by stage 2

        return candidates

    # ---- Stage 2: Claude Haiku Re-ranking ----

    def _haiku_rerank(
        self,
        query: str,
        intent: str,
        candidates: list[RankedResult],
    ) -> list[RankedResult]:
        """
        Use Claude 3 Haiku to score query-document relevance.
        Cost: ~$0.001 per call (2K input + 100 output tokens).
        Latency: 80-150ms.
        """
        # Format candidates for the prompt
        formatted = ""
        for i, r in enumerate(candidates):
            formatted += (
                f"{i+1}. [{r.doc_id}] {r.title} "
                f"(Genre: {r.metadata.get('genre', 'N/A')}, "
                f"Rating: {r.metadata.get('avg_rating', 'N/A')})\n"
            )

        prompt = f"""You are a manga relevance scorer for a Japanese e-commerce search engine.

Query: {query}
User intent: {intent}

Rate each manga result from 0.0 (completely irrelevant) to 1.0 (perfect match).
Consider: title relevance, genre match, thematic similarity, and user intent.

Results to score:
{formatted}

Return ONLY a JSON array of scores in the same order, e.g., [0.92, 0.85, 0.71, ...]:"""

        try:
            response = self.bedrock.invoke_model(
                modelId="anthropic.claude-3-haiku-20240307-v1:0",
                contentType="application/json",
                accept="application/json",
                body=json.dumps({
                    "anthropic_version": "bedrock-2023-05-31",
                    "max_tokens": 200,
                    "temperature": 0.0,
                    "messages": [
                        {"role": "user", "content": prompt}
                    ],
                }),
            )

            result = json.loads(response["body"].read())
            text = result["content"][0]["text"].strip()
            scores = json.loads(text)

            # Apply Haiku scores
            for i, r in enumerate(candidates):
                if i < len(scores):
                    r.rerank_score = float(scores[i])
                    # Blend: 70% Haiku + 30% heuristic (normalized)
                    r.final_score = (
                        0.7 * r.rerank_score
                        + 0.3 * (r.heuristic_score / max(
                            c.heuristic_score for c in candidates
                        ))
                    )
                else:
                    r.final_score = r.heuristic_score

        except (json.JSONDecodeError, KeyError, IndexError) as e:
            # Fallback to heuristic scores if Haiku fails
            for r in candidates:
                r.final_score = r.heuristic_score

        return candidates

Evaluation Methodology

Core Retrieval Quality Metrics

Metric	Formula	What It Measures	MangaAssist Target
NDCG@K (Normalized Discounted Cumulative Gain)	`DCG@K / IDCG@K` where `DCG = Σ (2^rel - 1) / log₂(rank + 1)`	Quality of ranking order; penalizes relevant docs ranked low	NDCG@5 >= 0.85
MAP (Mean Average Precision)	`(1/Q) Σ AP(q)` where `AP = (1/R) Σ P@k * rel(k)`	Average precision across all queries; rewards finding ALL relevant docs	MAP >= 0.80
MRR (Mean Reciprocal Rank)	`(1/Q) Σ 1/rank(first_relevant)`	How quickly the first relevant result appears	MRR >= 0.85
Recall@K	`\|relevant ∩ retrieved@K\| / \|relevant\|`	Fraction of all relevant documents found in top K	Recall@50 >= 0.95
Precision@K	`\|relevant ∩ retrieved@K\| / K`	Fraction of retrieved documents that are relevant	Precision@5 >= 0.80

Before/After: Hybrid Search Improving Manga Discovery

Example 1: Semantic Recommendation Query

Query: "冒険が好きで、ワンピースみたいな漫画が読みたい" (I like adventure; I want to read manga like One Piece)

Rank	BM25 Only	kNN Only	Hybrid (RRF) + Re-rank
1	One Piece Vol 100 (exact title match)	Fairy Tail (adventure, similar themes)	One Piece Vol 100 (exact + semantic)
2	One Piece Vol 99	Magi: Labyrinth of Magic	Fairy Tail (semantic, high rating)
3	One Piece Vol 98	Hunter x Hunter	Hunter x Hunter (semantic, trending)
4	One Piece Episode A (spinoff)	Seven Deadly Sins	Magi (semantic, genre match)
5	One Piece Film Red (movie tie-in)	Black Clover	Seven Deadly Sins (new volume)
NDCG@5	0.52 (too many One Piece volumes)	0.74 (good diversity, missing exact)	0.91 (exact match #1 + diverse recs)

BM25 returns multiple volumes of One Piece because they all match the title keyword. kNN returns semantically similar titles but misses the specific One Piece mention. Hybrid + re-rank gets the best of both worlds.

Example 2: Exact Product Lookup

Query: "鬼滅の刃 23巻購入" (Demon Slayer Volume 23 purchase)

Rank	BM25 Only	kNN Only	Hybrid + Re-rank
1	Demon Slayer Vol 23	Demon Slayer Vol 23	Demon Slayer Vol 23 (in stock)
2	Demon Slayer Vol 22	Demon Slayer Vol 22	Demon Slayer Box Set (buy intent boost)
3	Demon Slayer Vol 21	Jujutsu Kaisen Vol 20	Demon Slayer Vol 22 (bundle suggestion)
4	Demon Slayer Artbook	Chainsaw Man Vol 15	Demon Slayer Artbook (cross-sell)
5	Demon Slayer Guide	Tokyo Ghoul Vol 14	Koyoharu Gotouge Art Collection
NDCG@5	0.88 (strong on exact, weak cross-sell)	0.61 (semantic drift)	0.94 (exact + smart cross-sell)

For buy intent, BM25 dominates because the query contains the exact title. But hybrid search still wins by adding intent-aware cross-sell suggestions (box set, artbook) that pure BM25 ranks lower.

Example 3: Japanese Tokenization Challenge

Query: "進撃の巨人の作者の他の作品" (Other works by the author of Attack on Titan)

Rank	BM25 Only	kNN Only	Hybrid + Preprocessing
1	Attack on Titan Vol 34	Attack on Titan Vol 34	Attack on Titan (anchor result)
2	Attack on Titan Vol 33	Attack on Titan: Before the Fall	Hajime Isayama Interview Book
3	Attack on Titan: Junior High	Vinland Saga (thematic sim)	Attack on Titan: Lost Girls
4	No further relevant	Kabaneri of the Iron Fortress	Isayama Short Stories Collection
5	--	Claymore (dark fantasy sim)	Before the Fall (same universe)
NDCG@5	0.41 (cannot resolve "author's other works")	0.55 (thematic but wrong author)	0.83 (preprocessor expands to author name)

The query preprocessor detects the "作者の他の作品" (author's other works) pattern, resolves "進撃の巨人" to author "Hajime Isayama" via metadata lookup, and expands the query to include author-specific terms. Neither pure BM25 nor pure kNN can do this without preprocessing.

Monitoring Retrieval Quality in Production

Key Metrics to Track

Metric	Collection Method	Alert Threshold	Dashboard
Search latency p95	CloudWatch custom metric from ECS	> 100ms (search stage)	Real-time
Fusion latency	Application timer around fusion code	> 5ms	Real-time
Re-rank latency	Application timer around reranker	> 80ms	Real-time
Zero-result rate	Count queries returning 0 results	> 2% of queries	Daily
Click-through rate (CTR)	User clicks on search results	< 15% (below baseline)	Daily
NDCG@5 (offline)	Weekly evaluation on labeled query set	< 0.80	Weekly
BM25/kNN overlap	% of top-10 docs appearing in both lists	< 20% (too little overlap may indicate misalignment)	Weekly
Re-rank position change	Average rank change after re-ranking	< 2 positions (if too high, fusion may be poor)	Weekly

A/B Testing Search Configurations

When tuning hybrid search parameters (weights, fusion method, re-ranking strategy), run A/B tests.

Experiment	Metric	Control	Variant	Result
RRF vs Linear fusion	NDCG@5	Linear (α=0.6) → 0.80	RRF (k=60) → 0.82	RRF wins (+2.5%)
kNN weight 0.6 vs 0.7	CTR	α=0.6 → CTR 18.2%	α=0.7 → CTR 17.8%	0.6 wins (0.7 over-indexes semantic)
Haiku re-rank vs cross-encoder	NDCG@5	Cross-encoder → 0.88	Haiku → 0.91	Haiku wins (+3.4%) but costs $100/day
With recency boost vs without	CTR on new releases	Without → 12.1%	With (1.2x) → 16.8%	With wins (+39% CTR lift)

Key Takeaways

Hybrid search is mandatory — no single retrieval method handles both exact lookups ("Demon Slayer Vol 23") and semantic discovery ("adventure manga like One Piece").
RRF is the safest default fusion — rank-based, no normalization needed, robust to score scale changes. Switch to learned combination only with sufficient labeled data.
Intent detection drives scoring configuration — buy intent needs BM25-heavy weights and in-stock boosts; recommend intent needs kNN-heavy weights and diversity.
Re-ranking is the highest-leverage optimization — a cross-encoder on top-20 adds +25% NDCG for 30-50ms. Haiku re-ranking adds +35% NDCG but requires careful latency budgeting.
Measure retrieval quality continuously — NDCG@5, MRR, and CTR should be tracked weekly. Degradation in these metrics signals embedding drift, synonym gaps, or scoring misconfiguration.
Query preprocessing is essential for Japanese — without kuromoji tokenization and synonym expansion, queries like "shonen" will miss "少年" indexed content entirely.