LOCAL PREVIEW View on GitHub

06: Retrieval Quality Testing

AIP-C01 Mapping

Content Domain 5: Testing, Validation, and Troubleshooting Task 5.1: Implement evaluation systems for GenAI Skill 5.1.6: Test the quality of retrieval mechanisms (for example, with relevance scoring, context matching, and retrieval latency).


User Story

As a MangaAssist search/retrieval engineer, I want to measure and optimize the quality of our OpenSearch Serverless retrieval pipeline — including relevance of retrieved chunks, precision/recall against expected contexts, and retrieval latency under production load — So that the LLM receives the best possible context for every query and we can diagnose whether quality issues originate from retrieval or generation.


Acceptance Criteria

  • Retrieval relevance scoring evaluates top-K chunks against query intent
  • Context matching measures whether retrieved chunks contain the information needed to answer the query
  • Retrieval precision@K and recall@K computed against labeled ground truth per intent
  • Retrieval latency P50/P95/P99 tracked per query type and index
  • Mean Reciprocal Rank (MRR) and Normalized Discounted Cumulative Gain (nDCG) computed for ranking quality
  • End-to-end retrieval testing includes embedding generation, vector search, and re-ranking stages
  • Degradation alerts fire when retrieval quality drops below threshold
  • Retrieval A/B testing framework supports comparing embedding models and search configurations

Why Retrieval Quality Is the Foundation

In MangaAssist's RAG architecture, the retrieval pipeline determines the ceiling of response quality. Even the best LLM cannot generate a correct answer from irrelevant context:

Response Quality ≤ min(Retrieval Quality, Generation Quality)
Retrieval Quality Generation Quality User Experience
High (correct chunks) High (good synthesis) Correct, helpful response
High (correct chunks) Low (poor synthesis) Fixable — improve prompt/model
Low (wrong chunks) High (good synthesis) Confident hallucination — worst case
Low (wrong chunks) Low (poor synthesis) Obvious failure — user asks again

The dangerous quadrant is high generation quality + low retrieval quality — the model confidently presents wrong information because it was given wrong context. This is why retrieval must be tested independently.

MangaAssist Retrieval Pipeline

User Query → Intent Classifier → Query Rewriter → Titan Embeddings v2 →
OpenSearch Serverless (k-NN + BM25 hybrid) → Re-ranker → Top-5 Chunks → LLM

Each stage can fail independently: the query rewriter can lose intent, embeddings can miss semantic nuance, the hybrid search can rank poorly, the re-ranker can demote the correct chunk.


High-Level Design

Retrieval Quality Testing Architecture

graph TD
    subgraph "Test Data"
        A[Golden Retrieval Dataset<br>300 queries with labeled chunks] --> B[Test Runner]
    end

    subgraph "Retrieval Pipeline Under Test"
        B --> C[Query Rewriter]
        C --> D[Embedding Generator<br>Titan Embeddings v2]
        D --> E[OpenSearch Serverless<br>Hybrid k-NN + BM25]
        E --> F[Re-ranker]
        F --> G[Top-K Retrieved Chunks]
    end

    subgraph "Quality Metrics Engine"
        G --> H[Relevance Scorer<br>Per-chunk relevance]
        G --> I[Precision@K / Recall@K<br>Against ground truth]
        G --> J[MRR / nDCG@K<br>Ranking quality]
        G --> K[Latency Tracker<br>P50, P95, P99]
    end

    subgraph "Stage-Level Diagnostics"
        C --> L[Query Rewrite Quality<br>Intent preservation check]
        D --> M[Embedding Quality<br>Cosine similarity distribution]
        E --> N[Search Quality<br>k-NN vs BM25 contribution]
        F --> O[Re-ranker Quality<br>Position changes analysis]
    end

    subgraph "Outputs"
        H --> P[Retrieval Quality Report]
        I --> P
        J --> P
        K --> P
        L --> P
        M --> P
        N --> P
        O --> P
        P --> Q[CloudWatch Dashboard]
        P --> R[Degradation Alerts]
    end

Retrieval A/B Testing Architecture

graph LR
    subgraph "Experiment Setup"
        A[Same Query Set] --> B{A/B Router}
        B --> C[Config A: Titan v2<br>k=5, hybrid 0.7/0.3]
        B --> D[Config B: Titan v2<br>k=7, hybrid 0.5/0.5]
    end

    subgraph "Evaluation"
        C --> E[Retrieve & Score]
        D --> E
        E --> F[Paired Comparison<br>Same query, both configs]
    end

    subgraph "Decision"
        F --> G{Config B better<br>on MRR + Latency?}
        G -->|Yes, significant| H[Adopt Config B]
        G -->|No| I[Keep Config A]
    end

End-to-End Retrieval Latency Breakdown

sequenceDiagram
    participant QR as Query Rewriter
    participant EM as Embedding Generator
    participant OS as OpenSearch
    participant RR as Re-ranker
    participant Total as Total Latency

    Note over QR,Total: Target: < 500ms total

    QR->>QR: Rewrite query<br>~50ms
    QR->>EM: Rewritten query
    EM->>EM: Generate embedding<br>~30ms (Titan v2)
    EM->>OS: 1536-dim vector
    OS->>OS: k-NN search + BM25<br>~80ms (warm) / ~300ms (cold)
    OS->>RR: Top-20 candidates
    RR->>RR: Re-rank to top-5<br>~40ms
    RR->>Total: Final chunks

    Note over Total: Warm: ~200ms | Cold: ~420ms | P99: ~600ms

Low-Level Design

Retrieval Test Data Model

from dataclasses import dataclass, field
from enum import Enum
from typing import Optional
import time
import uuid


@dataclass
class RetrievalTestCase:
    """A test case for retrieval quality evaluation."""
    test_id: str = field(default_factory=lambda: str(uuid.uuid4()))
    query: str = ""
    intent: str = ""
    expected_chunk_ids: list[str] = field(default_factory=list)    # Ground truth chunks
    expected_key_facts: list[str] = field(default_factory=list)    # Facts that must appear
    min_relevance_threshold: float = 0.6                           # Minimum acceptable relevance
    difficulty: str = "medium"                                      # easy, medium, hard, adversarial


@dataclass
class RetrievalResult:
    """Result of a single retrieval operation."""
    test_id: str = ""
    query: str = ""
    retrieved_chunks: list[dict] = field(default_factory=list)     # {chunk_id, text, score, source}
    retrieval_latency_ms: float = 0.0
    embedding_latency_ms: float = 0.0
    search_latency_ms: float = 0.0
    rerank_latency_ms: float = 0.0
    total_latency_ms: float = 0.0


@dataclass
class RetrievalQualityMetrics:
    """Aggregated retrieval quality metrics."""
    precision_at_k: dict[int, float] = field(default_factory=dict)   # k -> precision
    recall_at_k: dict[int, float] = field(default_factory=dict)      # k -> recall
    mrr: float = 0.0                                                  # Mean Reciprocal Rank
    ndcg_at_k: dict[int, float] = field(default_factory=dict)        # k -> nDCG
    avg_relevance_score: float = 0.0
    fact_coverage_rate: float = 0.0                                   # % of expected facts found
    latency_p50_ms: float = 0.0
    latency_p95_ms: float = 0.0
    latency_p99_ms: float = 0.0
    total_queries: int = 0
    timestamp: float = field(default_factory=time.time)

Retrieval Quality Evaluator

import json
import logging
import math
import time
from statistics import median, quantiles
from typing import Optional

import boto3

logger = logging.getLogger(__name__)


class RetrievalQualityEvaluator:
    """Evaluates retrieval quality across multiple dimensions.

    Metrics computed:
    - Precision@K: fraction of retrieved chunks that are relevant
    - Recall@K: fraction of relevant chunks that are retrieved
    - MRR: average reciprocal rank of the first relevant chunk
    - nDCG@K: ranking quality considering position
    - Fact coverage: does the retrieved context contain the key facts?
    - Latency breakdown: per-stage timing
    """

    def __init__(self, opensearch_endpoint: str = "", embedding_model_id: str = "amazon.titan-embed-text-v2:0"):
        self.bedrock = boto3.client("bedrock-runtime", region_name="us-east-1")
        self.embedding_model_id = embedding_model_id
        self.opensearch_endpoint = opensearch_endpoint
        self.cloudwatch = boto3.client("cloudwatch", region_name="us-east-1")

    def evaluate_retrieval_batch(
        self,
        test_cases: list[RetrievalTestCase],
        retrieval_fn=None,  # Callable that takes query -> RetrievalResult
        k_values: list[int] = None,
    ) -> RetrievalQualityMetrics:
        """Evaluate retrieval quality on a batch of test cases."""
        if k_values is None:
            k_values = [1, 3, 5, 10]

        all_results: list[tuple[RetrievalTestCase, RetrievalResult]] = []

        for tc in test_cases:
            if retrieval_fn:
                result = retrieval_fn(tc.query)
            else:
                result = self._default_retrieve(tc.query)
            all_results.append((tc, result))

        # Compute metrics
        precision_at_k = {k: self._compute_precision_at_k(all_results, k) for k in k_values}
        recall_at_k = {k: self._compute_recall_at_k(all_results, k) for k in k_values}
        mrr = self._compute_mrr(all_results)
        ndcg_at_k = {k: self._compute_ndcg_at_k(all_results, k) for k in k_values}
        avg_relevance = self._compute_avg_relevance(all_results)
        fact_coverage = self._compute_fact_coverage(all_results)

        # Latency statistics
        latencies = [r.total_latency_ms for _, r in all_results if r.total_latency_ms > 0]
        sorted_latencies = sorted(latencies) if latencies else [0]

        return RetrievalQualityMetrics(
            precision_at_k=precision_at_k,
            recall_at_k=recall_at_k,
            mrr=mrr,
            ndcg_at_k=ndcg_at_k,
            avg_relevance_score=avg_relevance,
            fact_coverage_rate=fact_coverage,
            latency_p50_ms=self._percentile(sorted_latencies, 50),
            latency_p95_ms=self._percentile(sorted_latencies, 95),
            latency_p99_ms=self._percentile(sorted_latencies, 99),
            total_queries=len(all_results),
        )

    def _compute_precision_at_k(
        self, results: list[tuple[RetrievalTestCase, RetrievalResult]], k: int
    ) -> float:
        """Precision@K: of the top-K retrieved, how many are relevant?"""
        precisions = []
        for tc, result in results:
            top_k = result.retrieved_chunks[:k]
            if not top_k:
                precisions.append(0.0)
                continue
            relevant = sum(
                1 for chunk in top_k
                if chunk.get("chunk_id") in tc.expected_chunk_ids
            )
            precisions.append(relevant / len(top_k))
        return sum(precisions) / len(precisions) if precisions else 0.0

    def _compute_recall_at_k(
        self, results: list[tuple[RetrievalTestCase, RetrievalResult]], k: int
    ) -> float:
        """Recall@K: of all relevant chunks, how many are in top-K?"""
        recalls = []
        for tc, result in results:
            if not tc.expected_chunk_ids:
                continue
            top_k_ids = {
                chunk.get("chunk_id") for chunk in result.retrieved_chunks[:k]
            }
            relevant_retrieved = len(top_k_ids & set(tc.expected_chunk_ids))
            recalls.append(relevant_retrieved / len(tc.expected_chunk_ids))
        return sum(recalls) / len(recalls) if recalls else 0.0

    def _compute_mrr(
        self, results: list[tuple[RetrievalTestCase, RetrievalResult]]
    ) -> float:
        """Mean Reciprocal Rank: average of 1/rank of first relevant chunk."""
        reciprocal_ranks = []
        for tc, result in results:
            rr = 0.0
            for i, chunk in enumerate(result.retrieved_chunks):
                if chunk.get("chunk_id") in tc.expected_chunk_ids:
                    rr = 1.0 / (i + 1)
                    break
            reciprocal_ranks.append(rr)
        return sum(reciprocal_ranks) / len(reciprocal_ranks) if reciprocal_ranks else 0.0

    def _compute_ndcg_at_k(
        self, results: list[tuple[RetrievalTestCase, RetrievalResult]], k: int
    ) -> float:
        """Normalized Discounted Cumulative Gain at K."""
        ndcg_scores = []
        for tc, result in results:
            # Relevance: 1 if chunk is expected, 0 otherwise
            relevance = [
                1 if chunk.get("chunk_id") in tc.expected_chunk_ids else 0
                for chunk in result.retrieved_chunks[:k]
            ]

            # DCG
            dcg = sum(
                rel / math.log2(i + 2)  # i+2 because log2(1) = 0
                for i, rel in enumerate(relevance)
            )

            # Ideal DCG (all relevant chunks ranked first)
            ideal_relevance = sorted(relevance, reverse=True)
            idcg = sum(
                rel / math.log2(i + 2)
                for i, rel in enumerate(ideal_relevance)
            )

            ndcg = dcg / idcg if idcg > 0 else 0.0
            ndcg_scores.append(ndcg)

        return sum(ndcg_scores) / len(ndcg_scores) if ndcg_scores else 0.0

    def _compute_avg_relevance(
        self, results: list[tuple[RetrievalTestCase, RetrievalResult]]
    ) -> float:
        """Average relevance score of top-K retrieved chunks."""
        scores = []
        for _, result in results:
            for chunk in result.retrieved_chunks:
                score = chunk.get("score", chunk.get("relevance_score", 0))
                scores.append(float(score))
        return sum(scores) / len(scores) if scores else 0.0

    def _compute_fact_coverage(
        self, results: list[tuple[RetrievalTestCase, RetrievalResult]]
    ) -> float:
        """What percentage of expected key facts appear in retrieved chunks?"""
        total_facts = 0
        found_facts = 0
        for tc, result in results:
            if not tc.expected_key_facts:
                continue
            combined_text = " ".join(
                chunk.get("text", "") for chunk in result.retrieved_chunks
            ).lower()
            for fact in tc.expected_key_facts:
                total_facts += 1
                if fact.lower() in combined_text:
                    found_facts += 1
        return found_facts / total_facts if total_facts > 0 else 0.0

    def _percentile(self, sorted_values: list[float], pct: int) -> float:
        """Compute percentile from sorted values."""
        if not sorted_values:
            return 0.0
        idx = int(len(sorted_values) * pct / 100)
        return sorted_values[min(idx, len(sorted_values) - 1)]

    def _default_retrieve(self, query: str) -> RetrievalResult:
        """Default retrieval using MangaAssist's OpenSearch Serverless pipeline."""
        total_start = time.time()

        # Stage 1: Generate embedding
        embed_start = time.time()
        embedding = self._generate_embedding(query)
        embed_latency = (time.time() - embed_start) * 1000

        # Stage 2: Hybrid search (k-NN + BM25)
        search_start = time.time()
        raw_results = self._hybrid_search(query, embedding, k=20)
        search_latency = (time.time() - search_start) * 1000

        # Stage 3: Re-rank
        rerank_start = time.time()
        reranked = self._rerank(query, raw_results, top_k=5)
        rerank_latency = (time.time() - rerank_start) * 1000

        total_latency = (time.time() - total_start) * 1000

        return RetrievalResult(
            query=query,
            retrieved_chunks=reranked,
            embedding_latency_ms=embed_latency,
            search_latency_ms=search_latency,
            rerank_latency_ms=rerank_latency,
            total_latency_ms=total_latency,
        )

    def _generate_embedding(self, text: str) -> list[float]:
        """Generate embedding using Amazon Titan Embeddings v2."""
        body = json.dumps({"inputText": text})
        response = self.bedrock.invoke_model(modelId=self.embedding_model_id, body=body)
        return json.loads(response["body"].read())["embedding"]

    def _hybrid_search(self, query: str, embedding: list[float], k: int = 20) -> list[dict]:
        """Hybrid k-NN + BM25 search on OpenSearch Serverless."""
        # In production, this calls OpenSearch Serverless
        # Simplified for illustration
        return []

    def _rerank(self, query: str, candidates: list[dict], top_k: int = 5) -> list[dict]:
        """Re-rank candidates using cross-encoder or LLM-based re-ranker."""
        # In production, this uses a cross-encoder model
        return candidates[:top_k]

Stage-Level Diagnostic Evaluator

class RetrievalStageDiagnostics:
    """Diagnoses retrieval quality issues at each pipeline stage.

    When overall retrieval quality drops, this class helps isolate
    which stage is the bottleneck:
    - Query rewriting losing intent?
    - Embeddings not capturing semantics?
    - Search ranking poorly?
    - Re-ranker demoting correct results?
    """

    def diagnose_query_rewriter(
        self, original_query: str, rewritten_query: str, intent: str
    ) -> dict:
        """Evaluate whether the query rewriter preserved intent and improved searchability."""
        # Check intent preservation: key terms from original should appear in rewritten
        original_terms = set(original_query.lower().split())
        rewritten_terms = set(rewritten_query.lower().split())
        overlap = original_terms & rewritten_terms
        preservation_rate = len(overlap) / len(original_terms) if original_terms else 0

        # Check expansion: did the rewriter add useful terms?
        added_terms = rewritten_terms - original_terms
        expansion_rate = len(added_terms) / max(len(rewritten_terms), 1)

        return {
            "stage": "query_rewriter",
            "intent_preservation": preservation_rate,
            "query_expansion_rate": expansion_rate,
            "original_length": len(original_query),
            "rewritten_length": len(rewritten_query),
            "added_terms": list(added_terms)[:10],
            "healthy": preservation_rate >= 0.5,
        }

    def diagnose_embedding_quality(
        self, query_embedding: list[float], expected_chunk_embeddings: list[list[float]]
    ) -> dict:
        """Evaluate embedding quality by checking cosine similarity distribution."""
        similarities = []
        for chunk_emb in expected_chunk_embeddings:
            sim = self._cosine_similarity(query_embedding, chunk_emb)
            similarities.append(sim)

        if not similarities:
            return {"stage": "embedding", "healthy": False, "reason": "No embeddings to compare"}

        avg_sim = sum(similarities) / len(similarities)
        max_sim = max(similarities)
        min_sim = min(similarities)

        return {
            "stage": "embedding",
            "avg_similarity": round(avg_sim, 4),
            "max_similarity": round(max_sim, 4),
            "min_similarity": round(min_sim, 4),
            "similarity_spread": round(max_sim - min_sim, 4),
            "healthy": avg_sim >= 0.65,
            "recommendation": (
                "Embedding model captures query-chunk similarity well"
                if avg_sim >= 0.65
                else "Consider fine-tuning embeddings or adjusting chunk size"
            ),
        }

    def diagnose_search_components(
        self,
        knn_results: list[dict],
        bm25_results: list[dict],
        hybrid_results: list[dict],
        expected_chunk_ids: list[str],
    ) -> dict:
        """Compare k-NN and BM25 contributions to identify search issues."""
        knn_set = {r.get("chunk_id") for r in knn_results[:10]}
        bm25_set = {r.get("chunk_id") for r in bm25_results[:10]}
        hybrid_set = {r.get("chunk_id") for r in hybrid_results[:10]}
        expected_set = set(expected_chunk_ids)

        return {
            "stage": "search",
            "knn_recall@10": len(knn_set & expected_set) / max(len(expected_set), 1),
            "bm25_recall@10": len(bm25_set & expected_set) / max(len(expected_set), 1),
            "hybrid_recall@10": len(hybrid_set & expected_set) / max(len(expected_set), 1),
            "knn_unique": len(knn_set - bm25_set),
            "bm25_unique": len(bm25_set - knn_set),
            "overlap": len(knn_set & bm25_set),
            "recommendation": self._search_recommendation(
                knn_set, bm25_set, hybrid_set, expected_set
            ),
        }

    def diagnose_reranker(
        self,
        pre_rerank: list[dict],
        post_rerank: list[dict],
        expected_chunk_ids: list[str],
    ) -> dict:
        """Evaluate whether the re-ranker improved or degraded ranking quality."""
        expected_set = set(expected_chunk_ids)

        # Compare positions of relevant chunks pre and post rerank
        pre_positions = {}
        for i, chunk in enumerate(pre_rerank):
            if chunk.get("chunk_id") in expected_set:
                pre_positions[chunk["chunk_id"]] = i

        post_positions = {}
        for i, chunk in enumerate(post_rerank):
            if chunk.get("chunk_id") in expected_set:
                post_positions[chunk["chunk_id"]] = i

        improvements = 0
        degradations = 0
        for cid in expected_set:
            pre = pre_positions.get(cid, len(pre_rerank))
            post = post_positions.get(cid, len(post_rerank))
            if post < pre:
                improvements += 1
            elif post > pre:
                degradations += 1

        return {
            "stage": "reranker",
            "improvements": improvements,
            "degradations": degradations,
            "no_change": len(expected_set) - improvements - degradations,
            "healthy": improvements >= degradations,
            "recommendation": (
                "Re-ranker is improving ranking"
                if improvements > degradations
                else "Re-ranker may be degrading results — review cross-encoder model"
            ),
        }

    def _cosine_similarity(self, a: list[float], b: list[float]) -> float:
        """Compute cosine similarity between two vectors."""
        dot_product = sum(x * y for x, y in zip(a, b))
        norm_a = math.sqrt(sum(x * x for x in a))
        norm_b = math.sqrt(sum(x * x for x in b))
        return dot_product / (norm_a * norm_b) if norm_a > 0 and norm_b > 0 else 0.0

    def _search_recommendation(self, knn, bm25, hybrid, expected) -> str:
        knn_recall = len(knn & expected) / max(len(expected), 1)
        bm25_recall = len(bm25 & expected) / max(len(expected), 1)
        if knn_recall > bm25_recall + 0.2:
            return "k-NN dominates — consider increasing k-NN weight in hybrid"
        elif bm25_recall > knn_recall + 0.2:
            return "BM25 dominates — queries may be keyword-heavy, check embeddings"
        return "Balanced contribution — hybrid search is working well"

MangaAssist Scenarios

Scenario A: Embedding Model Upgrade Breaks Semantic Retrieval

Context: The team upgraded from Titan Embeddings v1 (1024-dim) to v2 (1536-dim) for better semantic representation. Retrieval quality was expected to improve.

What Happened: - Retrieval quality metrics before upgrade: - Precision@5: 0.72, Recall@5: 0.68, MRR: 0.78, nDCG@5: 0.74 - After v2 deployment: - Precision@5: 0.58 (-0.14), Recall@5: 0.52 (-0.16), MRR: 0.63 (-0.15), nDCG@5: 0.59 (-0.15)

How Caught: The retrieval quality evaluator ran its nightly regression suite. Every metric dropped. The stage-level diagnostics showed: - Embedding quality: avg cosine similarity between query and expected chunks dropped from 0.78 to 0.52 - Search quality: k-NN recall@10 dropped from 0.80 to 0.55

Root Cause: The existing OpenSearch index was built with v1 embeddings (1024-dim). The new v2 embeddings (1536-dim) were being compared against v1 embeddings in the index. Cosine similarity between different-dimensional embeddings is meaningless — OpenSearch was truncating/padding silently.

Fix: Re-indexed the entire knowledge base with v2 embeddings. This required: 1. Generate v2 embeddings for all 150K chunks 2. Create a new OpenSearch index with 1536-dim configuration 3. Perform a blue-green index switch 4. After re-indexing: Precision@5: 0.79, Recall@5: 0.75, MRR: 0.84

Scenario B: Hybrid Search Weight Tuning via Retrieval A/B Test

Context: MangaAssist's hybrid search uses 70% k-NN (semantic) + 30% BM25 (keyword). The team hypothesized that product_question intent might benefit from higher BM25 weight because users often search by exact product names.

What Happened: - A/B test configurations: - Config A (control): k-NN=0.7, BM25=0.3, k=5 - Config B (treatment): k-NN=0.5, BM25=0.5, k=5 - Results on 200 product_question test cases:

Metric Config A (70/30) Config B (50/50) Delta
Precision@5 0.68 0.76 +0.08
Recall@5 0.64 0.72 +0.08
MRR 0.72 0.81 +0.09
Latency P95 85ms 82ms -3ms
  • On recommendation intent, Config B was worse: Precision@5 dropped from 0.74 to 0.65 (semantic similarity matters more for recommendations than keyword matching)

Decision: Use intent-dependent search weights: - product_question, order_tracking: 50/50 hybrid - recommendation, chitchat: 70/30 hybrid - faq: 60/40 hybrid (balanced)

Scenario C: Re-ranker Demotes Correct Chunks for Multi-Turn Queries

Context: The cross-encoder re-ranker was trained on single-turn query-chunk pairs. For multi-turn conversations, it was receiving the full conversation context as the "query."

What Happened: - The stage diagnostics showed: - Pre-rerank: correct chunk at position 3 - Post-rerank: correct chunk dropped to position 8 (outside top-5) - This happened for 23% of multi-turn test cases

Root Cause: The re-ranker was confused by long conversation histories. For a query like "What about the hardcover edition?" with conversation history mentioning "One Piece Volume 108," the re-ranker scored chunks about hardcover editions in general higher than the specific One Piece chunk because it weighted the current turn query ("hardcover edition") more than the conversation context.

Fix: Changed the re-ranker input strategy: - Single-turn: pass the query as-is - Multi-turn: pass a synthesized query that merges the current turn with the resolved reference ("One Piece Volume 108 hardcover edition") — using the query rewriter to resolve coreferences before re-ranking

Post-fix: multi-turn re-ranker accuracy improved from 77% to 91%.

Scenario D: Latency Spike During Product Launch Event

Context: A major manga publisher announced a new series, and 50,000 users simultaneously searched for information about it. Retrieval latency spiked.

What Happened: - Normal P95 latency: 200ms - During spike: P95 latency: 1,800ms - P99 latency: 4,200ms (above the 3-second SLA) - OpenSearch Serverless auto-scaled, but the scaling lag was 3-4 minutes

How Caught: The retrieval latency monitor fired an alarm when P95 exceeded 500ms for 5 consecutive minutes. The stage-level diagnostics showed: - Embedding generation: 30ms (normal) - OpenSearch search: 1,500ms (10x normal) - Re-ranker: 40ms (normal) - Bottleneck: OpenSearch Serverless compute scaling lag

Fix: 1. Pre-warmed OpenSearch capacity before known product launch events 2. Added request queuing with a 2-second timeout — if retrieval exceeded 2 seconds, return cached results from ElastiCache Redis for the most common queries 3. Implemented circuit breaker: if retrieval latency > 1 second for 3 consecutive requests, switch to ElastiCache-only retrieval for 60 seconds


Intuition Gained

Test Retrieval Independently from Generation

When response quality drops, the first question should be: "Is the problem in retrieval or generation?" Without separate retrieval metrics, you cannot distinguish between a model that is generating poorly from good context vs. a model that is generating confidently from bad context. The latter is far more dangerous.

MRR Is the Most Actionable Retrieval Metric

Precision@K and Recall@K tell you about the top-K as a set. MRR tells you about rank order — specifically, where the first correct chunk appears. For MangaAssist, if the correct chunk is at position 1 vs. position 5, the LLM's utilization of that chunk changes dramatically. A 0.05 improvement in MRR has more impact on downstream quality than a 0.05 improvement in Precision@5.

Retrieval Configuration Should Be Intent-Dependent

Different intents have different retrieval characteristics. Product questions are keyword-heavy (exact product names). Recommendations are semantic-heavy (genre similarity, thematic matching). FAQ queries are a mix. Using a single hybrid search weight for all intents is a compromise that serves none of them optimally. Intent-dependent retrieval configuration is a low-effort, high-impact optimization.


References