LOCAL PREVIEW View on GitHub

Embedding and Retrieval Architectures in MangaAssist

The retrieval path is the backbone of grounded responses. If the embedding space is weak or the ranking pipeline is noisy, the generator receives poor context and the final answer quality drops with it.

This document covers the math and system design behind the RAG path used in MangaAssist.

1. Retrieval Flow

User query
  -> embedding model
  -> ANN search
  -> metadata filtering
  -> reranking
  -> top context chunks
  -> generation model

Each arrow corresponds to a real mathematical step: vector encoding, similarity scoring, graph traversal, or learned ranking.


2. Dense Vector Embeddings

2.1 What an Embedding Is

An embedding maps discrete text into a continuous vector space:

$$f: \text{Text} \rightarrow \mathbb{R}^d$$

In this project configuration, Titan Text Embeddings V2 returns 1,024-dimensional float vectors.

2.2 What Good Embeddings Should Do

Property Mathematical meaning Why it matters
Similar text -> nearby vectors $\cos(\mathbf{e}_a, \mathbf{e}_b)$ is high Relevant chunks rise in the ranking
Dissimilar text -> separated vectors $\cos(\mathbf{e}_a, \mathbf{e}_c)$ is low Irrelevant context stays out of the prompt
Stable geometry Neighborhoods remain meaningful over time Retrieval quality is easier to monitor
Good coverage of the space Vectors do not collapse into one region The index can discriminate across many intents and topics

2.3 Why Contrastive Objectives Matter

Many modern embedding systems are trained with a contrastive objective:

$$\mathcal{L} = -\log \frac{e^{\text{sim}(\mathbf{q}, \mathbf{d}^+)/\tau}}{e^{\text{sim}(\mathbf{q}, \mathbf{d}^+)/\tau} + \sum_j e^{\text{sim}(\mathbf{q}, \mathbf{d}_j^-)/\tau}}$$

This encourages relevant pairs to move closer together and irrelevant pairs to move farther apart in the vector space.

Even when the managed model internals are abstracted away, this is still the right intuition for how semantic retrieval behavior is learned.


3. Similarity Metrics

3.1 Cosine Similarity

$$\cos(\mathbf{u}, \mathbf{v}) = \frac{\mathbf{u}\cdot\mathbf{v}}{|\mathbf{u}|_2|\mathbf{v}|_2}$$

Cosine similarity is the primary metric because it focuses on direction rather than raw magnitude.

Used in:

  • vector retrieval
  • embedding diagnostics
  • semantic similarity evaluation

3.2 Dot Product

$$\mathbf{u}\cdot\mathbf{v} = \sum_{i=1}^{d} u_i v_i$$

If vectors are normalized, dot product and cosine similarity induce the same ranking. This matters because some systems normalize embeddings at write time to make retrieval cheaper.

3.3 Euclidean Distance

$$|\mathbf{u} - \mathbf{v}|2 = \sqrt{\sum{i=1}^{d}(u_i - v_i)^2}$$

Euclidean distance is still useful for clustering or diagnostics, but cosine similarity is usually the better default for text retrieval.

3.4 Why Cosine Usually Wins for Text

Property Cosine Euclidean
Scale-invariant Yes No
Works well with normalized embeddings Yes Less direct
Less sensitive to length effects Yes No
Retrieval default in this project Yes No

4.1 Why Exact Search Does Not Scale

Exact KNN compares the query with every indexed vector:

$$O(Nd)$$

For large indexes, that becomes too expensive for real-time chat traffic.

4.2 HNSW Intuition

OpenSearch commonly uses HNSW, a graph-based ANN method.

Search intuition:

  1. start from a coarse graph layer
  2. move greedily toward better neighbors
  3. descend into denser layers
  4. refine the candidate set near the query

This avoids scanning the full corpus while preserving high recall.

4.3 Important Tuning Knobs

Parameter Meaning Trade-off
M Maximum graph connectivity per node Higher recall, more memory
ef_construction Candidate breadth during index build Better graph quality, slower indexing
ef_search Candidate breadth during search when the engine exposes it Better recall, slower queries

Note: exact semantics can vary by OpenSearch engine and version, so these parameters should be treated as implementation-specific tuning knobs rather than universal constants.


5. Chunking Strategy

5.1 Why Chunking Exists

Documents must be split before embedding because:

  • embedding models have context-window limits
  • retrieval needs fine-grained units
  • overly large chunks dilute relevance
  • overly small chunks lose context

5.2 Project Chunking Heuristics

Content type Chunk size Overlap Rationale
Product descriptions 256 tokens 25 tokens Keep product details focused
FAQ content 512 tokens 50 tokens Preserve self-contained answers
Policies 512 tokens 50 tokens Avoid splitting rules across chunks
Reviews 128 tokens 0 tokens Reviews are naturally short

5.3 Overlap Math

If chunk size is 512 and overlap is 50, then the stride is:

$$512 - 50 = 462$$

So chunk start positions are:

$$0, 462, 924, 1386, \ldots$$

Overlap improves recall at document boundaries because important facts are less likely to be split away from each other.


6. Two-Stage Retrieval

6.1 Stage 1: Embedding Retrieval

Query
  -> Titan Text Embeddings V2
  -> query vector
  -> OpenSearch ANN index
  -> top candidate chunks

Why this stage is fast:

  • chunk embeddings are precomputed
  • the query is embedded once
  • ANN search prunes the space aggressively

6.2 Stage 2: Cross-Encoder Reranking

For each candidate:
  [CLS] query [SEP] chunk [SEP]
    -> cross-encoder
    -> relevance score

Sort by score -> keep top few chunks

Why this stage is more accurate:

  • query and document tokens interact directly
  • relevance is learned as a ranking problem
  • lexical match and semantic match can both influence the score

6.3 Why Two Stages Instead of One

Option Speed Accuracy Operational fit
Bi-encoder only Fast Good Great for large candidate pools
Cross-encoder only Slow Best Too expensive over a full corpus
Two-stage pipeline Fast enough Near-best Best balance for production

7. Retrieval Metrics

7.1 Recall@K

$$\text{Recall@}K = \frac{|\text{relevant docs in top-}K|}{|\text{total relevant docs}|}$$

Use this when multiple documents may be relevant.

7.2 Precision@K

$$\text{Precision@}K = \frac{|\text{relevant docs in top-}K|}{K}$$

This matters because low-precision context wastes tokens and increases hallucination risk.

7.3 Hit@K

$$\text{Hit@}K = \mathbb{1}[\text{at least one relevant item appears in top-}K]$$

This is often the most intuitive operational metric for RAG. If the top three chunks contain at least one useful grounding chunk, the generator still has a good chance to answer well.

7.4 Mean Reciprocal Rank

$$\text{MRR} = \frac{1}{|Q|}\sum_{i=1}^{|Q|}\frac{1}{\text{rank}_i}$$

MRR rewards putting the first relevant item as high as possible.

7.5 NDCG@K

$$\text{DCG@}K = \sum_{i=1}^{K}\frac{2^{rel_i} - 1}{\log_2(i+1)}$$

$$\text{NDCG@}K = \frac{\text{DCG@}K}{\text{IDCG@}K}$$

NDCG is especially useful when relevance is graded rather than binary.

7.6 Reranking Lift

$$\text{Lift} = \text{NDCG@3}{\text{after rerank}} - \text{NDCG@3}{\text{before rerank}}$$

This quantifies whether the reranker is earning its latency cost.


8. Embedding Drift Detection

8.1 Why Drift Happens

Embedding quality can degrade even if the model weights do not change:

  • user vocabulary shifts
  • new products or content are indexed
  • catalog mix changes
  • seasonal query patterns alter the query distribution

8.2 Practical Monitoring Methods

Top-1 similarity distribution

  • track the cosine similarity between each query and its top result
  • watch for distribution shifts over time

Probe set monitoring

  • keep a fixed set of representative queries
  • compare their embeddings or retrieval outputs across time windows

Cluster separation

  • group embeddings by content type
  • measure whether clusters are still well separated

For significance testing around distribution shift, see 05-additional-statistical-tests.md in the Statistical-Inference folder.


9. Summary

Component Main math idea Operational purpose
Embedding model Text to dense vector map Represent semantics numerically
Cosine similarity Normalized dot product Measure semantic closeness
HNSW ANN Graph-based nearest-neighbor search Make retrieval fast enough for production
Chunking Sliding windows over token sequences Balance context and precision
Cross-encoder reranking Full token-level interaction Improve ranking quality on a small set
Retrieval metrics Recall, precision, hit rate, MRR, NDCG Evaluate search quality
Drift monitoring Distribution and stability checks Catch silent retrieval degradation

Retrieval quality is the compound result of embedding geometry, chunk design, ANN tuning, reranking quality, and evaluation discipline. Weakness in any one of those layers will show up in answer quality downstream.