Embedding and Retrieval Architectures in MangaAssist
The retrieval path is the backbone of grounded responses. If the embedding space is weak or the ranking pipeline is noisy, the generator receives poor context and the final answer quality drops with it.
This document covers the math and system design behind the RAG path used in MangaAssist.
1. Retrieval Flow
User query
-> embedding model
-> ANN search
-> metadata filtering
-> reranking
-> top context chunks
-> generation model
Each arrow corresponds to a real mathematical step: vector encoding, similarity scoring, graph traversal, or learned ranking.
2. Dense Vector Embeddings
2.1 What an Embedding Is
An embedding maps discrete text into a continuous vector space:
$$f: \text{Text} \rightarrow \mathbb{R}^d$$
In this project configuration, Titan Text Embeddings V2 returns 1,024-dimensional float vectors.
2.2 What Good Embeddings Should Do
| Property | Mathematical meaning | Why it matters |
|---|---|---|
| Similar text -> nearby vectors | $\cos(\mathbf{e}_a, \mathbf{e}_b)$ is high | Relevant chunks rise in the ranking |
| Dissimilar text -> separated vectors | $\cos(\mathbf{e}_a, \mathbf{e}_c)$ is low | Irrelevant context stays out of the prompt |
| Stable geometry | Neighborhoods remain meaningful over time | Retrieval quality is easier to monitor |
| Good coverage of the space | Vectors do not collapse into one region | The index can discriminate across many intents and topics |
2.3 Why Contrastive Objectives Matter
Many modern embedding systems are trained with a contrastive objective:
$$\mathcal{L} = -\log \frac{e^{\text{sim}(\mathbf{q}, \mathbf{d}^+)/\tau}}{e^{\text{sim}(\mathbf{q}, \mathbf{d}^+)/\tau} + \sum_j e^{\text{sim}(\mathbf{q}, \mathbf{d}_j^-)/\tau}}$$
This encourages relevant pairs to move closer together and irrelevant pairs to move farther apart in the vector space.
Even when the managed model internals are abstracted away, this is still the right intuition for how semantic retrieval behavior is learned.
3. Similarity Metrics
3.1 Cosine Similarity
$$\cos(\mathbf{u}, \mathbf{v}) = \frac{\mathbf{u}\cdot\mathbf{v}}{|\mathbf{u}|_2|\mathbf{v}|_2}$$
Cosine similarity is the primary metric because it focuses on direction rather than raw magnitude.
Used in:
- vector retrieval
- embedding diagnostics
- semantic similarity evaluation
3.2 Dot Product
$$\mathbf{u}\cdot\mathbf{v} = \sum_{i=1}^{d} u_i v_i$$
If vectors are normalized, dot product and cosine similarity induce the same ranking. This matters because some systems normalize embeddings at write time to make retrieval cheaper.
3.3 Euclidean Distance
$$|\mathbf{u} - \mathbf{v}|2 = \sqrt{\sum{i=1}^{d}(u_i - v_i)^2}$$
Euclidean distance is still useful for clustering or diagnostics, but cosine similarity is usually the better default for text retrieval.
3.4 Why Cosine Usually Wins for Text
| Property | Cosine | Euclidean |
|---|---|---|
| Scale-invariant | Yes | No |
| Works well with normalized embeddings | Yes | Less direct |
| Less sensitive to length effects | Yes | No |
| Retrieval default in this project | Yes | No |
4. Approximate Nearest-Neighbor Search
4.1 Why Exact Search Does Not Scale
Exact KNN compares the query with every indexed vector:
$$O(Nd)$$
For large indexes, that becomes too expensive for real-time chat traffic.
4.2 HNSW Intuition
OpenSearch commonly uses HNSW, a graph-based ANN method.
Search intuition:
- start from a coarse graph layer
- move greedily toward better neighbors
- descend into denser layers
- refine the candidate set near the query
This avoids scanning the full corpus while preserving high recall.
4.3 Important Tuning Knobs
| Parameter | Meaning | Trade-off |
|---|---|---|
M |
Maximum graph connectivity per node | Higher recall, more memory |
ef_construction |
Candidate breadth during index build | Better graph quality, slower indexing |
ef_search |
Candidate breadth during search when the engine exposes it | Better recall, slower queries |
Note: exact semantics can vary by OpenSearch engine and version, so these parameters should be treated as implementation-specific tuning knobs rather than universal constants.
5. Chunking Strategy
5.1 Why Chunking Exists
Documents must be split before embedding because:
- embedding models have context-window limits
- retrieval needs fine-grained units
- overly large chunks dilute relevance
- overly small chunks lose context
5.2 Project Chunking Heuristics
| Content type | Chunk size | Overlap | Rationale |
|---|---|---|---|
| Product descriptions | 256 tokens | 25 tokens | Keep product details focused |
| FAQ content | 512 tokens | 50 tokens | Preserve self-contained answers |
| Policies | 512 tokens | 50 tokens | Avoid splitting rules across chunks |
| Reviews | 128 tokens | 0 tokens | Reviews are naturally short |
5.3 Overlap Math
If chunk size is 512 and overlap is 50, then the stride is:
$$512 - 50 = 462$$
So chunk start positions are:
$$0, 462, 924, 1386, \ldots$$
Overlap improves recall at document boundaries because important facts are less likely to be split away from each other.
6. Two-Stage Retrieval
6.1 Stage 1: Embedding Retrieval
Query
-> Titan Text Embeddings V2
-> query vector
-> OpenSearch ANN index
-> top candidate chunks
Why this stage is fast:
- chunk embeddings are precomputed
- the query is embedded once
- ANN search prunes the space aggressively
6.2 Stage 2: Cross-Encoder Reranking
For each candidate:
[CLS] query [SEP] chunk [SEP]
-> cross-encoder
-> relevance score
Sort by score -> keep top few chunks
Why this stage is more accurate:
- query and document tokens interact directly
- relevance is learned as a ranking problem
- lexical match and semantic match can both influence the score
6.3 Why Two Stages Instead of One
| Option | Speed | Accuracy | Operational fit |
|---|---|---|---|
| Bi-encoder only | Fast | Good | Great for large candidate pools |
| Cross-encoder only | Slow | Best | Too expensive over a full corpus |
| Two-stage pipeline | Fast enough | Near-best | Best balance for production |
7. Retrieval Metrics
7.1 Recall@K
$$\text{Recall@}K = \frac{|\text{relevant docs in top-}K|}{|\text{total relevant docs}|}$$
Use this when multiple documents may be relevant.
7.2 Precision@K
$$\text{Precision@}K = \frac{|\text{relevant docs in top-}K|}{K}$$
This matters because low-precision context wastes tokens and increases hallucination risk.
7.3 Hit@K
$$\text{Hit@}K = \mathbb{1}[\text{at least one relevant item appears in top-}K]$$
This is often the most intuitive operational metric for RAG. If the top three chunks contain at least one useful grounding chunk, the generator still has a good chance to answer well.
7.4 Mean Reciprocal Rank
$$\text{MRR} = \frac{1}{|Q|}\sum_{i=1}^{|Q|}\frac{1}{\text{rank}_i}$$
MRR rewards putting the first relevant item as high as possible.
7.5 NDCG@K
$$\text{DCG@}K = \sum_{i=1}^{K}\frac{2^{rel_i} - 1}{\log_2(i+1)}$$
$$\text{NDCG@}K = \frac{\text{DCG@}K}{\text{IDCG@}K}$$
NDCG is especially useful when relevance is graded rather than binary.
7.6 Reranking Lift
$$\text{Lift} = \text{NDCG@3}{\text{after rerank}} - \text{NDCG@3}{\text{before rerank}}$$
This quantifies whether the reranker is earning its latency cost.
8. Embedding Drift Detection
8.1 Why Drift Happens
Embedding quality can degrade even if the model weights do not change:
- user vocabulary shifts
- new products or content are indexed
- catalog mix changes
- seasonal query patterns alter the query distribution
8.2 Practical Monitoring Methods
Top-1 similarity distribution
- track the cosine similarity between each query and its top result
- watch for distribution shifts over time
Probe set monitoring
- keep a fixed set of representative queries
- compare their embeddings or retrieval outputs across time windows
Cluster separation
- group embeddings by content type
- measure whether clusters are still well separated
For significance testing around distribution shift, see 05-additional-statistical-tests.md in the Statistical-Inference folder.
9. Summary
| Component | Main math idea | Operational purpose |
|---|---|---|
| Embedding model | Text to dense vector map | Represent semantics numerically |
| Cosine similarity | Normalized dot product | Measure semantic closeness |
| HNSW ANN | Graph-based nearest-neighbor search | Make retrieval fast enough for production |
| Chunking | Sliding windows over token sequences | Balance context and precision |
| Cross-encoder reranking | Full token-level interaction | Improve ranking quality on a small set |
| Retrieval metrics | Recall, precision, hit rate, MRR, NDCG | Evaluate search quality |
| Drift monitoring | Distribution and stability checks | Catch silent retrieval degradation |
Retrieval quality is the compound result of embedding geometry, chunk design, ANN tuning, reranking quality, and evaluation discipline. Weakness in any one of those layers will show up in answer quality downstream.