Embedding and Retrieval Architectures in MangaAssist

The retrieval path is the backbone of grounded responses. If the embedding space is weak or the ranking pipeline is noisy, the generator receives poor context and the final answer quality drops with it.

This document covers the math and system design behind the RAG path used in MangaAssist.

1. Retrieval Flow

User query
  -> embedding model
  -> ANN search
  -> metadata filtering
  -> reranking
  -> top context chunks
  -> generation model

Each arrow corresponds to a real mathematical step: vector encoding, similarity scoring, graph traversal, or learned ranking.

2. Dense Vector Embeddings

2.1 What an Embedding Is

An embedding maps discrete text into a continuous vector space:

$$f: \text{Text} \rightarrow \mathbb{R}^d$$

In this project configuration, Titan Text Embeddings V2 returns 1,024-dimensional float vectors.

2.2 What Good Embeddings Should Do

Property	Mathematical meaning	Why it matters
Similar text -> nearby vectors	$\cos(\mathbf{e}_a, \mathbf{e}_b)$ is high	Relevant chunks rise in the ranking
Dissimilar text -> separated vectors	$\cos(\mathbf{e}_a, \mathbf{e}_c)$ is low	Irrelevant context stays out of the prompt
Stable geometry	Neighborhoods remain meaningful over time	Retrieval quality is easier to monitor
Good coverage of the space	Vectors do not collapse into one region	The index can discriminate across many intents and topics

2.3 Why Contrastive Objectives Matter

Many modern embedding systems are trained with a contrastive objective:

$$\mathcal{L} = -\log \frac{e^{\text{sim}(\mathbf{q}, \mathbf{d}^+)/\tau}}{e^{\text{sim}(\mathbf{q}, \mathbf{d}^+)/\tau} + \sum_j e^{\text{sim}(\mathbf{q}, \mathbf{d}_j^-)/\tau}}$$

This encourages relevant pairs to move closer together and irrelevant pairs to move farther apart in the vector space.

Even when the managed model internals are abstracted away, this is still the right intuition for how semantic retrieval behavior is learned.

3. Similarity Metrics

3.1 Cosine Similarity

$$\cos(\mathbf{u}, \mathbf{v}) = \frac{\mathbf{u}\cdot\mathbf{v}}{|\mathbf{u}|_2|\mathbf{v}|_2}$$

Cosine similarity is the primary metric because it focuses on direction rather than raw magnitude.

Used in:

vector retrieval
embedding diagnostics
semantic similarity evaluation

3.2 Dot Product

$$\mathbf{u}\cdot\mathbf{v} = \sum_{i=1}^{d} u_i v_i$$

If vectors are normalized, dot product and cosine similarity induce the same ranking. This matters because some systems normalize embeddings at write time to make retrieval cheaper.

3.3 Euclidean Distance

$$|\mathbf{u} - \mathbf{v}|2 = \sqrt{\sum{i=1}^{d}(u_i - v_i)^2}$$

Euclidean distance is still useful for clustering or diagnostics, but cosine similarity is usually the better default for text retrieval.

3.4 Why Cosine Usually Wins for Text

Property	Cosine	Euclidean
Scale-invariant	Yes	No
Works well with normalized embeddings	Yes	Less direct
Less sensitive to length effects	Yes	No
Retrieval default in this project	Yes	No

4. Approximate Nearest-Neighbor Search

4.1 Why Exact Search Does Not Scale

Exact KNN compares the query with every indexed vector:

$$O(Nd)$$

For large indexes, that becomes too expensive for real-time chat traffic.

4.2 HNSW Intuition

OpenSearch commonly uses HNSW, a graph-based ANN method.

Search intuition:

start from a coarse graph layer
move greedily toward better neighbors
descend into denser layers
refine the candidate set near the query

This avoids scanning the full corpus while preserving high recall.

4.3 Important Tuning Knobs

Parameter	Meaning	Trade-off
`M`	Maximum graph connectivity per node	Higher recall, more memory
`ef_construction`	Candidate breadth during index build	Better graph quality, slower indexing
`ef_search`	Candidate breadth during search when the engine exposes it	Better recall, slower queries

Note: exact semantics can vary by OpenSearch engine and version, so these parameters should be treated as implementation-specific tuning knobs rather than universal constants.

5. Chunking Strategy

5.1 Why Chunking Exists

Documents must be split before embedding because:

embedding models have context-window limits
retrieval needs fine-grained units
overly large chunks dilute relevance
overly small chunks lose context

5.2 Project Chunking Heuristics

Content type	Chunk size	Overlap	Rationale
Product descriptions	256 tokens	25 tokens	Keep product details focused
FAQ content	512 tokens	50 tokens	Preserve self-contained answers
Policies	512 tokens	50 tokens	Avoid splitting rules across chunks
Reviews	128 tokens	0 tokens	Reviews are naturally short

5.3 Overlap Math

If chunk size is 512 and overlap is 50, then the stride is:

$$512 - 50 = 462$$

So chunk start positions are:

$$0, 462, 924, 1386, \ldots$$

Overlap improves recall at document boundaries because important facts are less likely to be split away from each other.

6. Two-Stage Retrieval

6.1 Stage 1: Embedding Retrieval

Query
  -> Titan Text Embeddings V2
  -> query vector
  -> OpenSearch ANN index
  -> top candidate chunks

Why this stage is fast:

chunk embeddings are precomputed
the query is embedded once
ANN search prunes the space aggressively

6.2 Stage 2: Cross-Encoder Reranking

For each candidate:
  [CLS] query [SEP] chunk [SEP]
    -> cross-encoder
    -> relevance score

Sort by score -> keep top few chunks

Why this stage is more accurate:

query and document tokens interact directly
relevance is learned as a ranking problem
lexical match and semantic match can both influence the score

6.3 Why Two Stages Instead of One

Option	Speed	Accuracy	Operational fit
Bi-encoder only	Fast	Good	Great for large candidate pools
Cross-encoder only	Slow	Best	Too expensive over a full corpus
Two-stage pipeline	Fast enough	Near-best	Best balance for production

7. Retrieval Metrics

7.1 Recall@K

$$\text{Recall@}K = \frac{|\text{relevant docs in top-}K|}{|\text{total relevant docs}|}$$

Use this when multiple documents may be relevant.

7.2 Precision@K

$$\text{Precision@}K = \frac{|\text{relevant docs in top-}K|}{K}$$

This matters because low-precision context wastes tokens and increases hallucination risk.

7.3 Hit@K

$$\text{Hit@}K = \mathbb{1}[\text{at least one relevant item appears in top-}K]$$

This is often the most intuitive operational metric for RAG. If the top three chunks contain at least one useful grounding chunk, the generator still has a good chance to answer well.

7.4 Mean Reciprocal Rank

$$\text{MRR} = \frac{1}{|Q|}\sum_{i=1}^{|Q|}\frac{1}{\text{rank}_i}$$

MRR rewards putting the first relevant item as high as possible.

7.5 NDCG@K

$$\text{DCG@}K = \sum_{i=1}^{K}\frac{2^{rel_i} - 1}{\log_2(i+1)}$$

$$\text{NDCG@}K = \frac{\text{DCG@}K}{\text{IDCG@}K}$$

NDCG is especially useful when relevance is graded rather than binary.

7.6 Reranking Lift

$$\text{Lift} = \text{NDCG@3}{\text{after rerank}} - \text{NDCG@3}{\text{before rerank}}$$

This quantifies whether the reranker is earning its latency cost.

8. Embedding Drift Detection

8.1 Why Drift Happens

Embedding quality can degrade even if the model weights do not change:

user vocabulary shifts
new products or content are indexed
catalog mix changes
seasonal query patterns alter the query distribution

8.2 Practical Monitoring Methods

Top-1 similarity distribution

track the cosine similarity between each query and its top result
watch for distribution shifts over time

Probe set monitoring

keep a fixed set of representative queries
compare their embeddings or retrieval outputs across time windows

Cluster separation

group embeddings by content type
measure whether clusters are still well separated

For significance testing around distribution shift, see 05-additional-statistical-tests.md in the Statistical-Inference folder.

9. Summary

Component	Main math idea	Operational purpose
Embedding model	Text to dense vector map	Represent semantics numerically
Cosine similarity	Normalized dot product	Measure semantic closeness
HNSW ANN	Graph-based nearest-neighbor search	Make retrieval fast enough for production
Chunking	Sliding windows over token sequences	Balance context and precision
Cross-encoder reranking	Full token-level interaction	Improve ranking quality on a small set
Retrieval metrics	Recall, precision, hit rate, MRR, NDCG	Evaluate search quality
Drift monitoring	Distribution and stability checks	Catch silent retrieval degradation

Retrieval quality is the compound result of embedding geometry, chunk design, ANN tuning, reranking quality, and evaluation discipline. Weakness in any one of those layers will show up in answer quality downstream.