LOCAL PREVIEW View on GitHub

Hybrid Search and Custom Scoring for Manga Product Retrieval

AWS AIP-C01 Task 4.2 → Skill 4.2.2: Optimize retrieval mechanisms for FM-augmented applications System: MangaAssist e-commerce chatbot — Bedrock Claude 3 Sonnet/Haiku, OpenSearch Serverless (HNSW k-NN), DynamoDB, ECS Fargate, ElastiCache Redis Focus: Deep-dive into hybrid search implementation, score fusion mathematics, custom scoring functions, re-ranking with Bedrock, and retrieval evaluation metrics


Skill Mapping

AWS AIP-C01 Element Coverage
Task 4.2 Optimize application performance for FM workloads
Skill 4.2.2 Optimize retrieval mechanisms to improve FM-augmented application performance
This File Hybrid search scoring, fusion methods, re-ranking with Bedrock, evaluation methodology
MangaAssist Context Combining BM25 (exact manga titles/ISBNs) + kNN (semantic "manga like One Piece") to maximize both precision and recall across 100K+ products

Why Hybrid Search? — The BM25 vs kNN Tradeoff

The Core Problem

MangaAssist receives two fundamentally different query types that require two fundamentally different retrieval approaches.

graph TB
    subgraph ExactQueries["Exact Queries — BM25 Wins"]
        style ExactQueries fill:#2ecc71,stroke:#16213e,color:#fff
        Q1["'鬼滅の刃 23巻'<br/>Demon Slayer Vol 23"]
        Q2["'ISBN 978-4088820...']
        Q3["'Eiichiro Oda'"]
        Q4["'Weekly Shonen Jump<br/>2024 Issue 15'"]
    end

    subgraph SemanticQueries["Semantic Queries — kNN Wins"]
        style SemanticQueries fill:#3498db,stroke:#16213e,color:#fff
        Q5["'adventure manga<br/>with pirates'"]
        Q6["'something dark and<br/>philosophical'"]
        Q7["'manga for someone<br/>who loved Naruto'"]
        Q8["'feel-good slice of life<br/>romance'"]
    end

    subgraph HybridQueries["Mixed Queries — Hybrid Wins"]
        style HybridQueries fill:#e94560,stroke:#16213e,color:#fff
        Q9["'manga like ワンピース<br/>with good art'"]
        Q10["'shonen manga about<br/>cooking competitions'"]
        Q11["'new releases similar<br/>to Attack on Titan'"]
    end

    Q1 & Q2 & Q3 & Q4 --> BM25["BM25<br/>Keyword Match"]
    Q5 & Q6 & Q7 & Q8 --> KNN["kNN Vector<br/>Semantic Similarity"]
    Q9 & Q10 & Q11 --> HYBRID["Hybrid<br/>BM25 + kNN"]

Head-to-Head Comparison

Dimension BM25 (Keyword) kNN (Vector) Hybrid
Exact title match Excellent — direct term matching Poor — semantic drift from exact surface forms Excellent — BM25 component handles this
Semantic similarity Poor — "pirate adventure" does not match "One Piece" Excellent — embedding captures meaning Excellent — kNN component handles this
Author lookup Excellent — keyword match on name field Moderate — depends on embedding training data Excellent
Cross-language Poor — "冒険" does not match "adventure" in BM25 Good — multilingual embeddings bridge languages Good
Handling typos Poor — "Narruto" misses "Naruto" Good — embeddings are robust to misspellings Good
Novel/unseen terms Zero recall for OOV terms Moderate — nearest semantic neighbor Moderate
Latency 15-25ms 30-50ms 40-60ms (parallel)
Explainability High — TF-IDF scores are interpretable Low — cosine similarity is opaque Medium

Score Fusion Methods

The central challenge of hybrid search: BM25 scores (unbounded, typically 5-50) and kNN scores (cosine similarity, 0.0-1.0) are on completely different scales. Score fusion resolves this.

Method 1: Reciprocal Rank Fusion (RRF)

RRF is rank-based, not score-based. It sidesteps the score normalization problem entirely.

Formula:

RRF(d) = Σ  1 / (k + rank_i(d))
         i∈{bm25, knn}

Where k is a smoothing constant (default 60) and rank_i(d) is the rank of document d in ranker i (1-indexed).

Properties: - Score-scale agnostic: works regardless of BM25 vs kNN score ranges - Documents appearing in both result lists get double contribution - The smoothing constant k=60 prevents top-ranked documents from dominating

graph TB
    subgraph BM25Ranks["BM25 Rankings"]
        B1["Rank 1: Naruto Vol 72<br/>BM25=38.2"]
        B2["Rank 2: One Piece Vol 100<br/>BM25=31.7"]
        B3["Rank 3: Dragon Ball Vol 42<br/>BM25=28.1"]
        B4["Rank 4: Bleach Vol 74<br/>BM25=22.5"]
    end

    subgraph KNNRanks["kNN Rankings"]
        K1["Rank 1: One Piece Vol 100<br/>cos=0.94"]
        K2["Rank 2: Fairy Tail Vol 63<br/>cos=0.91"]
        K3["Rank 3: Naruto Vol 72<br/>cos=0.88"]
        K4["Rank 4: Black Clover Vol 33<br/>cos=0.85"]
    end

    subgraph RRFScores["RRF Fused Scores (k=60)"]
        style RRFScores fill:#e94560,stroke:#16213e,color:#fff
        R1["One Piece Vol 100<br/>1/(60+2) + 1/(60+1) = 0.0325<br/>← Appeared in BOTH lists"]
        R2["Naruto Vol 72<br/>1/(60+1) + 1/(60+3) = 0.0322"]
        R3["Fairy Tail Vol 63<br/>0 + 1/(60+2) = 0.0161"]
        R4["Dragon Ball Vol 42<br/>1/(60+3) + 0 = 0.0159"]
    end

    B1 & B2 & K1 & K3 --> R1
    B1 & K3 --> R2
    K2 --> R3
    B3 --> R4

Method 2: Weighted Linear Combination

Requires min-max normalization to bring both score distributions onto [0, 1].

Formula:

score(d) = α × norm_knn(d) + β × norm_bm25(d)

where:
  norm(s) = (s - min) / (max - min)
  α + β = 1.0

MangaAssist defaults: α = 0.6 (kNN weight), β = 0.4 (BM25 weight) — tuned on a labeled relevance set of 500 manga queries.

Method 3: Learned Score Combination

Train a lightweight model (logistic regression or small neural net) to combine BM25 score, kNN score, and metadata features into a single relevance score.

Features used: - Normalized BM25 score - Cosine similarity (kNN score) - Query-document genre match (binary) - Title overlap (Jaccard similarity) - Document recency (days since release) - Document popularity (log sales rank)

Advantage: Can learn non-linear interactions (e.g., "for buy-intent queries, weight BM25 higher"). Disadvantage: Requires labeled training data and periodic retraining.

Fusion Method Comparison

Method NDCG@5 on MangaAssist Latency Overhead Requires Training Data Robustness
RRF (k=60) 0.82 < 1ms No High — rank-based, no normalization needed
Weighted Linear (α=0.6) 0.80 < 1ms No (but weights need tuning) Medium — sensitive to score distribution shifts
Learned Combination 0.87 2-5ms Yes (500+ labeled queries) Medium — needs retraining as catalog changes

Recommendation for MangaAssist: Start with RRF as the default. Move to learned combination only after accumulating sufficient click-through / relevance judgment data.


Custom Scoring for MangaAssist

Beyond fusion scores, MangaAssist applies business-aware scoring boosts. These operate as multiplicative factors on the fused score.

Scoring Dimensions

graph LR
    FUSED["Fused Score<br/>(RRF or Linear)"] --> RECENCY["Recency Boost<br/>New releases +20%"]
    RECENCY --> POPULARITY["Popularity Boost<br/>High-rated +15%"]
    POPULARITY --> GENRE["Genre Match<br/>Intent-aligned +10%"]
    GENRE --> AVAILABILITY["Availability<br/>In-stock +10%"]
    AVAILABILITY --> FINAL["Final Score"]

Boost Factor Configuration

Boost Condition Multiplier Rationale
Recency — New Release Released within last 30 days 1.20 New manga drives engagement; customers expect fresh recommendations
Recency — Recent Released within last 90 days 1.10 Still relevant, moderate boost
Recency — Trending Sales rank improved > 50% in last 7 days 1.15 Capture viral/trending titles
Popularity — Top Rated avg_rating >= 4.5 and rating_count >= 50 1.15 High confidence in quality
Popularity — Well Rated avg_rating >= 4.0 and rating_count >= 20 1.05 Moderate confidence
Genre Match Query intent matches document genre/demographic 1.10 "shonen recommendation" should surface shonen manga
Availability — In Stock in_stock = true (for buy intent) 1.10 Avoid frustrating buy-intent users with OOS results
Availability — Preorder release_date in future (for browse intent) 1.05 Surface upcoming titles for browsers
Price Range Within user's stated price range 1.05 Only if explicit price constraint detected

Intent-Aware Scoring Matrix

Different user intents require different boost profiles.

Factor Browse Intent Buy Intent Recommend Intent Research Intent
Recency boost High (1.20) Medium (1.10) Medium (1.10) Low (1.00)
Popularity boost Medium (1.10) Low (1.00) High (1.15) Low (1.00)
In-stock boost None (1.00) High (1.15) None (1.00) None (1.00)
Genre match boost Medium (1.10) Low (1.05) High (1.15) Medium (1.10)
Review count boost Low (1.00) Low (1.00) Medium (1.10) High (1.15)

OpenSearch Serverless Query DSL — Hybrid Query Examples

Example 1: Basic Hybrid Query with RRF

This uses OpenSearch's native hybrid search with the search_pipeline feature.

{
  "comment": "MangaAssist hybrid search: adventure shonen manga",
  "size": 50,
  "_source": ["manga_id", "title_ja", "title_en", "genre", "avg_rating", "release_date"],
  "query": {
    "hybrid": {
      "queries": [
        {
          "multi_match": {
            "query": "adventure shonen manga pirates 冒険 少年 漫画",
            "fields": ["title_ja^3", "title_en^3", "description", "tags^1.5", "genre"],
            "type": "cross_fields",
            "minimum_should_match": "30%"
          }
        },
        {
          "knn": {
            "description_embedding": {
              "vector": [0.023, -0.118, 0.045, "... 1536 dims ..."],
              "k": 50
            }
          }
        }
      ]
    }
  },
  "search_pipeline": "manga-hybrid-rrf-pipeline"
}

Search pipeline definition (created once via OpenSearch API):

{
  "description": "MangaAssist RRF hybrid search pipeline",
  "phase_results_processors": [
    {
      "normalization-processor": {
        "normalization": { "technique": "min_max" },
        "combination": {
          "technique": "arithmetic_mean",
          "parameters": { "weights": [0.4, 0.6] }
        }
      }
    }
  ]
}

Example 2: Filtered Hybrid Query (Buy Intent)

{
  "comment": "Buy intent: specific manga, must be in stock, under 1000 JPY",
  "size": 20,
  "_source": ["manga_id", "title_ja", "title_en", "price_jpy", "in_stock", "genre"],
  "query": {
    "hybrid": {
      "queries": [
        {
          "bool": {
            "must": [
              {
                "multi_match": {
                  "query": "鬼滅の刃 最新刊",
                  "fields": ["title_ja^5", "title_en^3", "description"],
                  "type": "best_fields"
                }
              }
            ],
            "filter": [
              { "term": { "in_stock": true } },
              { "range": { "price_jpy": { "lte": 1000 } } }
            ]
          }
        },
        {
          "knn": {
            "description_embedding": {
              "vector": [0.031, -0.092, "... 1536 dims ..."],
              "k": 20,
              "filter": {
                "bool": {
                  "filter": [
                    { "term": { "in_stock": true } },
                    { "range": { "price_jpy": { "lte": 1000 } } }
                  ]
                }
              }
            }
          }
        }
      ]
    }
  },
  "search_pipeline": "manga-hybrid-rrf-pipeline"
}

Example 3: Recommendation Query (Semantic-Heavy)

{
  "comment": "Recommend intent: semantic similarity to 'manga like One Piece with good art'",
  "size": 50,
  "_source": ["manga_id", "title_ja", "title_en", "genre", "avg_rating", "author"],
  "query": {
    "hybrid": {
      "queries": [
        {
          "multi_match": {
            "query": "ワンピース One Piece 冒険 adventure shonen art 画力",
            "fields": ["title_ja^2", "title_en^2", "tags^2", "genre", "description"],
            "type": "cross_fields",
            "minimum_should_match": "20%"
          }
        },
        {
          "knn": {
            "description_embedding": {
              "vector": [0.012, -0.067, "... 1536 dims ..."],
              "k": 50
            }
          }
        }
      ]
    }
  },
  "search_pipeline": "manga-hybrid-recommend-pipeline"
}

Recommendation pipeline — higher kNN weight:

{
  "description": "MangaAssist recommendation-tuned hybrid pipeline",
  "phase_results_processors": [
    {
      "normalization-processor": {
        "normalization": { "technique": "min_max" },
        "combination": {
          "technique": "arithmetic_mean",
          "parameters": { "weights": [0.25, 0.75] }
        }
      }
    }
  ]
}

Re-ranking with Bedrock

Why Re-rank?

Hybrid search retrieves a broad candidate set (top-50), but ranking quality degrades past the top-10. Re-ranking applies a more expensive but accurate relevance model to distill top-50 down to top-5.

graph TB
    subgraph Retrieval["Hybrid Search (80ms)"]
        style Retrieval fill:#0f3460,stroke:#16213e,color:#fff
        SEARCH["Top-50 Candidates<br/>from RRF hybrid"]
    end

    subgraph Reranking["Re-ranking (60ms budget)"]
        style Reranking fill:#e94560,stroke:#16213e,color:#fff
        HEURISTIC["Stage 1: Heuristic Scorer<br/>Recency + Popularity + Intent<br/>(5ms) → Top-20"]
        CROSS["Stage 2: Cross-Encoder<br/>or Claude Haiku Reranker<br/>(50ms) → Top-5"]
    end

    subgraph Output["RAG Context"]
        style Output fill:#2ecc71,stroke:#16213e,color:#fff
        TOP5["Top-5 Results<br/>→ Claude 3 Sonnet prompt"]
    end

    SEARCH --> HEURISTIC --> CROSS --> TOP5

Claude Haiku as Re-ranker

For high-value queries (buy intent detected, returning customer), MangaAssist can use Claude Haiku to score query-document relevance.

Prompt template:

You are a manga relevance scorer for an e-commerce search engine.

Query: {query}
User intent: {intent}

Rate each manga result from 0.0 (irrelevant) to 1.0 (perfect match).
Consider: title relevance, genre match, thematic similarity, and user intent.

Results to score:
{formatted_results}

Return ONLY a JSON array of scores in the same order:
[0.92, 0.85, 0.71, ...]

Cost analysis: Haiku re-ranking of 20 candidates costs ~$0.001 per query (input: ~2K tokens, output: ~100 tokens). At 100K queries/day = ~$100/day.

Python — ScoreFusion

"""
MangaAssist Score Fusion
Implements RRF, weighted linear, and learned score combination.
"""

import math
from dataclasses import dataclass
from typing import Protocol


@dataclass
class ScoredDocument:
    """Document with scores from multiple rankers."""
    doc_id: str
    title: str
    bm25_score: float = 0.0
    bm25_rank: int = 0
    knn_score: float = 0.0
    knn_rank: int = 0
    fused_score: float = 0.0
    metadata: dict = None

    def __post_init__(self):
        if self.metadata is None:
            self.metadata = {}


class FusionStrategy(Protocol):
    """Protocol for score fusion strategies."""
    def fuse(
        self,
        bm25_results: list[ScoredDocument],
        knn_results: list[ScoredDocument],
    ) -> list[ScoredDocument]: ...


class ScoreFusion:
    """
    Orchestrates score fusion for MangaAssist hybrid search.

    Supports three strategies:
    - RRF: Reciprocal Rank Fusion (default, most robust)
    - Linear: Weighted linear combination of normalized scores
    - Learned: Feature-based combination with a trained model

    Usage:
        fusion = ScoreFusion(strategy="rrf", rrf_k=60)
        results = fusion.fuse(bm25_results, knn_results)
    """

    def __init__(
        self,
        strategy: str = "rrf",
        rrf_k: int = 60,
        bm25_weight: float = 0.4,
        knn_weight: float = 0.6,
    ):
        self.strategy = strategy
        self.rrf_k = rrf_k
        self.bm25_weight = bm25_weight
        self.knn_weight = knn_weight

    def fuse(
        self,
        bm25_results: list[ScoredDocument],
        knn_results: list[ScoredDocument],
    ) -> list[ScoredDocument]:
        """Fuse BM25 and kNN results using the configured strategy."""
        if self.strategy == "rrf":
            return self._rrf_fuse(bm25_results, knn_results)
        elif self.strategy == "linear":
            return self._linear_fuse(bm25_results, knn_results)
        else:
            raise ValueError(f"Unknown strategy: {self.strategy}")

    def _rrf_fuse(
        self,
        bm25_results: list[ScoredDocument],
        knn_results: list[ScoredDocument],
    ) -> list[ScoredDocument]:
        """
        Reciprocal Rank Fusion.

        RRF(d) = Σ 1/(k + rank_i(d))

        Rank-based: immune to score scale differences.
        k=60 is the original paper's recommendation.
        """
        k = self.rrf_k
        score_map: dict[str, float] = {}
        doc_map: dict[str, ScoredDocument] = {}

        # BM25 contribution
        for rank, doc in enumerate(bm25_results, start=1):
            rrf_contribution = 1.0 / (k + rank)
            score_map[doc.doc_id] = score_map.get(
                doc.doc_id, 0.0
            ) + rrf_contribution
            doc.bm25_rank = rank
            doc_map[doc.doc_id] = doc

        # kNN contribution
        for rank, doc in enumerate(knn_results, start=1):
            rrf_contribution = 1.0 / (k + rank)
            score_map[doc.doc_id] = score_map.get(
                doc.doc_id, 0.0
            ) + rrf_contribution
            doc.knn_rank = rank
            if doc.doc_id not in doc_map:
                doc_map[doc.doc_id] = doc
            else:
                doc_map[doc.doc_id].knn_rank = rank
                doc_map[doc.doc_id].knn_score = doc.knn_score

        # Build fused result list
        fused: list[ScoredDocument] = []
        for doc_id, score in score_map.items():
            doc = doc_map[doc_id]
            doc.fused_score = score
            fused.append(doc)

        fused.sort(key=lambda d: d.fused_score, reverse=True)
        return fused

    def _linear_fuse(
        self,
        bm25_results: list[ScoredDocument],
        knn_results: list[ScoredDocument],
    ) -> list[ScoredDocument]:
        """
        Weighted linear combination with min-max normalization.

        score(d) = α * norm_knn(d) + β * norm_bm25(d)
        """
        # Min-max normalize BM25 scores
        bm25_scores = [d.bm25_score for d in bm25_results]
        bm25_min = min(bm25_scores) if bm25_scores else 0
        bm25_max = max(bm25_scores) if bm25_scores else 1
        bm25_range = bm25_max - bm25_min or 1.0

        # Min-max normalize kNN scores
        knn_scores = [d.knn_score for d in knn_results]
        knn_min = min(knn_scores) if knn_scores else 0
        knn_max = max(knn_scores) if knn_scores else 1
        knn_range = knn_max - knn_min or 1.0

        score_map: dict[str, float] = {}
        doc_map: dict[str, ScoredDocument] = {}

        for doc in bm25_results:
            norm_score = (doc.bm25_score - bm25_min) / bm25_range
            score_map[doc.doc_id] = self.bm25_weight * norm_score
            doc_map[doc.doc_id] = doc

        for doc in knn_results:
            norm_score = (doc.knn_score - knn_min) / knn_range
            score_map[doc.doc_id] = score_map.get(
                doc.doc_id, 0.0
            ) + self.knn_weight * norm_score
            if doc.doc_id not in doc_map:
                doc_map[doc.doc_id] = doc

        fused: list[ScoredDocument] = []
        for doc_id, score in score_map.items():
            doc = doc_map[doc_id]
            doc.fused_score = score
            fused.append(doc)

        fused.sort(key=lambda d: d.fused_score, reverse=True)
        return fused


# ---- Convenience factory ----

def create_fusion(
    intent: str = "browse",
) -> ScoreFusion:
    """
    Create a ScoreFusion instance with intent-aware weights.
    - browse/research: balanced (RRF default)
    - recommend: kNN-heavy (0.75 kNN)
    - buy: BM25-heavy (0.6 BM25) for exact match priority
    """
    if intent == "recommend":
        return ScoreFusion(
            strategy="linear", bm25_weight=0.25, knn_weight=0.75
        )
    elif intent == "buy":
        return ScoreFusion(
            strategy="linear", bm25_weight=0.60, knn_weight=0.40
        )
    else:
        return ScoreFusion(strategy="rrf", rrf_k=60)

Python — RelevanceReranker

"""
MangaAssist Relevance Re-ranker
Two-stage re-ranking: heuristic scorer → cross-encoder or LLM reranker.
Budget: 60ms total (5ms heuristic + 50ms cross-encoder).
"""

import json
import time
from dataclasses import dataclass, field
from typing import Optional

import boto3


@dataclass
class RankedResult:
    """A search result with re-ranking metadata."""
    doc_id: str
    title: str
    original_rank: int
    original_score: float
    heuristic_score: float = 0.0
    rerank_score: float = 0.0
    final_score: float = 0.0
    metadata: dict = field(default_factory=dict)


class RelevanceReranker:
    """
    Two-stage re-ranking pipeline for MangaAssist.

    Stage 1: Heuristic scoring (< 5ms)
      - Recency boost, popularity boost, genre match, availability
      - Reduces top-50 → top-20

    Stage 2: Model-based re-ranking (< 50ms)
      - Option A: Cross-encoder model (self-hosted, ~30ms for 20 docs)
      - Option B: Claude Haiku via Bedrock (~80ms, higher quality)
      - Reduces top-20 → top-5

    Usage:
        reranker = RelevanceReranker(
            region="ap-northeast-1",
            rerank_method="haiku",
        )
        top5 = reranker.rerank(query, intent, candidates, top_k=5)
    """

    def __init__(
        self,
        region: str = "ap-northeast-1",
        rerank_method: str = "heuristic",  # "heuristic", "haiku"
        heuristic_cutoff: int = 20,
    ):
        self.region = region
        self.rerank_method = rerank_method
        self.heuristic_cutoff = heuristic_cutoff

        if rerank_method == "haiku":
            self.bedrock = boto3.client(
                "bedrock-runtime", region_name=region
            )

    def rerank(
        self,
        query: str,
        intent: str,
        candidates: list[RankedResult],
        top_k: int = 5,
    ) -> list[RankedResult]:
        """
        Full re-ranking pipeline.

        Returns top_k results sorted by final_score.
        """
        # Stage 1: Heuristic scoring → top-N
        heuristic_ranked = self._heuristic_score(candidates, intent)
        heuristic_ranked.sort(
            key=lambda r: r.heuristic_score, reverse=True
        )
        stage1_results = heuristic_ranked[:self.heuristic_cutoff]

        # Stage 2: Model-based re-ranking → top_k
        if self.rerank_method == "haiku" and len(stage1_results) > top_k:
            final = self._haiku_rerank(query, intent, stage1_results)
        else:
            final = stage1_results

        final.sort(key=lambda r: r.final_score, reverse=True)
        return final[:top_k]

    # ---- Stage 1: Heuristic Scoring ----

    def _heuristic_score(
        self,
        candidates: list[RankedResult],
        intent: str,
    ) -> list[RankedResult]:
        """
        Fast heuristic scoring using metadata signals.
        Target: < 5ms for 50 candidates.
        """
        import datetime

        now = datetime.date.today()
        thirty_days = datetime.timedelta(days=30)
        ninety_days = datetime.timedelta(days=90)

        for r in candidates:
            score = r.original_score
            meta = r.metadata

            # Recency
            release_str = meta.get("release_date", "")
            if release_str:
                try:
                    release = datetime.date.fromisoformat(release_str[:10])
                    days_old = (now - release).days
                    if days_old <= 30:
                        score *= 1.20
                    elif days_old <= 90:
                        score *= 1.10
                except (ValueError, TypeError):
                    pass

            # Popularity
            rating = meta.get("avg_rating", 0.0)
            rating_count = meta.get("rating_count", 0)
            if rating >= 4.5 and rating_count >= 50:
                score *= 1.15
            elif rating >= 4.0 and rating_count >= 20:
                score *= 1.05

            # Intent-aware
            if intent == "buy":
                if meta.get("in_stock", False):
                    score *= 1.15
            elif intent == "recommend":
                if rating >= 4.0:
                    score *= 1.10
            elif intent == "research":
                if meta.get("review_count", 0) > 100:
                    score *= 1.10

            r.heuristic_score = score
            r.final_score = score  # default; may be overridden by stage 2

        return candidates

    # ---- Stage 2: Claude Haiku Re-ranking ----

    def _haiku_rerank(
        self,
        query: str,
        intent: str,
        candidates: list[RankedResult],
    ) -> list[RankedResult]:
        """
        Use Claude 3 Haiku to score query-document relevance.
        Cost: ~$0.001 per call (2K input + 100 output tokens).
        Latency: 80-150ms.
        """
        # Format candidates for the prompt
        formatted = ""
        for i, r in enumerate(candidates):
            formatted += (
                f"{i+1}. [{r.doc_id}] {r.title} "
                f"(Genre: {r.metadata.get('genre', 'N/A')}, "
                f"Rating: {r.metadata.get('avg_rating', 'N/A')})\n"
            )

        prompt = f"""You are a manga relevance scorer for a Japanese e-commerce search engine.

Query: {query}
User intent: {intent}

Rate each manga result from 0.0 (completely irrelevant) to 1.0 (perfect match).
Consider: title relevance, genre match, thematic similarity, and user intent.

Results to score:
{formatted}

Return ONLY a JSON array of scores in the same order, e.g., [0.92, 0.85, 0.71, ...]:"""

        try:
            response = self.bedrock.invoke_model(
                modelId="anthropic.claude-3-haiku-20240307-v1:0",
                contentType="application/json",
                accept="application/json",
                body=json.dumps({
                    "anthropic_version": "bedrock-2023-05-31",
                    "max_tokens": 200,
                    "temperature": 0.0,
                    "messages": [
                        {"role": "user", "content": prompt}
                    ],
                }),
            )

            result = json.loads(response["body"].read())
            text = result["content"][0]["text"].strip()
            scores = json.loads(text)

            # Apply Haiku scores
            for i, r in enumerate(candidates):
                if i < len(scores):
                    r.rerank_score = float(scores[i])
                    # Blend: 70% Haiku + 30% heuristic (normalized)
                    r.final_score = (
                        0.7 * r.rerank_score
                        + 0.3 * (r.heuristic_score / max(
                            c.heuristic_score for c in candidates
                        ))
                    )
                else:
                    r.final_score = r.heuristic_score

        except (json.JSONDecodeError, KeyError, IndexError) as e:
            # Fallback to heuristic scores if Haiku fails
            for r in candidates:
                r.final_score = r.heuristic_score

        return candidates

Evaluation Methodology

Core Retrieval Quality Metrics

Metric Formula What It Measures MangaAssist Target
NDCG@K (Normalized Discounted Cumulative Gain) DCG@K / IDCG@K where DCG = Σ (2^rel - 1) / log₂(rank + 1) Quality of ranking order; penalizes relevant docs ranked low NDCG@5 >= 0.85
MAP (Mean Average Precision) (1/Q) Σ AP(q) where AP = (1/R) Σ P@k * rel(k) Average precision across all queries; rewards finding ALL relevant docs MAP >= 0.80
MRR (Mean Reciprocal Rank) (1/Q) Σ 1/rank(first_relevant) How quickly the first relevant result appears MRR >= 0.85
Recall@K |relevant ∩ retrieved@K| / |relevant| Fraction of all relevant documents found in top K Recall@50 >= 0.95
Precision@K |relevant ∩ retrieved@K| / K Fraction of retrieved documents that are relevant Precision@5 >= 0.80

Before/After: Hybrid Search Improving Manga Discovery

Example 1: Semantic Recommendation Query

Query: "冒険が好きで、ワンピースみたいな漫画が読みたい" (I like adventure; I want to read manga like One Piece)

Rank BM25 Only kNN Only Hybrid (RRF) + Re-rank
1 One Piece Vol 100 (exact title match) Fairy Tail (adventure, similar themes) One Piece Vol 100 (exact + semantic)
2 One Piece Vol 99 Magi: Labyrinth of Magic Fairy Tail (semantic, high rating)
3 One Piece Vol 98 Hunter x Hunter Hunter x Hunter (semantic, trending)
4 One Piece Episode A (spinoff) Seven Deadly Sins Magi (semantic, genre match)
5 One Piece Film Red (movie tie-in) Black Clover Seven Deadly Sins (new volume)
NDCG@5 0.52 (too many One Piece volumes) 0.74 (good diversity, missing exact) 0.91 (exact match #1 + diverse recs)

BM25 returns multiple volumes of One Piece because they all match the title keyword. kNN returns semantically similar titles but misses the specific One Piece mention. Hybrid + re-rank gets the best of both worlds.

Example 2: Exact Product Lookup

Query: "鬼滅の刃 23巻 購入" (Demon Slayer Volume 23 purchase)

Rank BM25 Only kNN Only Hybrid + Re-rank
1 Demon Slayer Vol 23 Demon Slayer Vol 23 Demon Slayer Vol 23 (in stock)
2 Demon Slayer Vol 22 Demon Slayer Vol 22 Demon Slayer Box Set (buy intent boost)
3 Demon Slayer Vol 21 Jujutsu Kaisen Vol 20 Demon Slayer Vol 22 (bundle suggestion)
4 Demon Slayer Artbook Chainsaw Man Vol 15 Demon Slayer Artbook (cross-sell)
5 Demon Slayer Guide Tokyo Ghoul Vol 14 Koyoharu Gotouge Art Collection
NDCG@5 0.88 (strong on exact, weak cross-sell) 0.61 (semantic drift) 0.94 (exact + smart cross-sell)

For buy intent, BM25 dominates because the query contains the exact title. But hybrid search still wins by adding intent-aware cross-sell suggestions (box set, artbook) that pure BM25 ranks lower.

Example 3: Japanese Tokenization Challenge

Query: "進撃の巨人の作者の他の作品" (Other works by the author of Attack on Titan)

Rank BM25 Only kNN Only Hybrid + Preprocessing
1 Attack on Titan Vol 34 Attack on Titan Vol 34 Attack on Titan (anchor result)
2 Attack on Titan Vol 33 Attack on Titan: Before the Fall Hajime Isayama Interview Book
3 Attack on Titan: Junior High Vinland Saga (thematic sim) Attack on Titan: Lost Girls
4 No further relevant Kabaneri of the Iron Fortress Isayama Short Stories Collection
5 -- Claymore (dark fantasy sim) Before the Fall (same universe)
NDCG@5 0.41 (cannot resolve "author's other works") 0.55 (thematic but wrong author) 0.83 (preprocessor expands to author name)

The query preprocessor detects the "作者の他の作品" (author's other works) pattern, resolves "進撃の巨人" to author "Hajime Isayama" via metadata lookup, and expands the query to include author-specific terms. Neither pure BM25 nor pure kNN can do this without preprocessing.


Monitoring Retrieval Quality in Production

Key Metrics to Track

Metric Collection Method Alert Threshold Dashboard
Search latency p95 CloudWatch custom metric from ECS > 100ms (search stage) Real-time
Fusion latency Application timer around fusion code > 5ms Real-time
Re-rank latency Application timer around reranker > 80ms Real-time
Zero-result rate Count queries returning 0 results > 2% of queries Daily
Click-through rate (CTR) User clicks on search results < 15% (below baseline) Daily
NDCG@5 (offline) Weekly evaluation on labeled query set < 0.80 Weekly
BM25/kNN overlap % of top-10 docs appearing in both lists < 20% (too little overlap may indicate misalignment) Weekly
Re-rank position change Average rank change after re-ranking < 2 positions (if too high, fusion may be poor) Weekly

A/B Testing Search Configurations

When tuning hybrid search parameters (weights, fusion method, re-ranking strategy), run A/B tests.

Experiment Metric Control Variant Result
RRF vs Linear fusion NDCG@5 Linear (α=0.6) → 0.80 RRF (k=60) → 0.82 RRF wins (+2.5%)
kNN weight 0.6 vs 0.7 CTR α=0.6 → CTR 18.2% α=0.7 → CTR 17.8% 0.6 wins (0.7 over-indexes semantic)
Haiku re-rank vs cross-encoder NDCG@5 Cross-encoder → 0.88 Haiku → 0.91 Haiku wins (+3.4%) but costs $100/day
With recency boost vs without CTR on new releases Without → 12.1% With (1.2x) → 16.8% With wins (+39% CTR lift)

Key Takeaways

  1. Hybrid search is mandatory — no single retrieval method handles both exact lookups ("Demon Slayer Vol 23") and semantic discovery ("adventure manga like One Piece").
  2. RRF is the safest default fusion — rank-based, no normalization needed, robust to score scale changes. Switch to learned combination only with sufficient labeled data.
  3. Intent detection drives scoring configuration — buy intent needs BM25-heavy weights and in-stock boosts; recommend intent needs kNN-heavy weights and diversity.
  4. Re-ranking is the highest-leverage optimization — a cross-encoder on top-20 adds +25% NDCG for 30-50ms. Haiku re-ranking adds +35% NDCG but requires careful latency budgeting.
  5. Measure retrieval quality continuously — NDCG@5, MRR, and CTR should be tracked weekly. Degradation in these metrics signals embedding drift, synonym gaps, or scoring misconfiguration.
  6. Query preprocessing is essential for Japanese — without kuromoji tokenization and synonym expansion, queries like "shonen" will miss "少年" indexed content entirely.