Matrix Theory and Linear Algebra in MangaAssist
Matrix theory is the runtime language of modern ML systems. Every embedding lookup, attention score, classifier head, and recommender score in MangaAssist can be explained as vector operations, matrix multiplication, normalization, and low-rank structure.
This document uses a row-vector convention for forward passes so the shapes stay consistent with how tensors are usually written in production code:
- single example: $\mathbf{y} = \mathbf{x}\mathbf{W} + \mathbf{b}$
- batch of examples: $\mathbf{Y} = \mathbf{X}\mathbf{W} + \mathbf{b}$
1. Why Matrix Theory Matters
The project uses several different model components, but they all reduce to a common set of linear algebra operations:
- intent classification uses embedding lookup tables, projection matrices, and softmax outputs
- retrieval uses dense vectors, similarity scoring, and approximate nearest-neighbor search
- reranking uses attention over a concatenated query-document sequence
- recommendation uses latent vectors and dot-product scoring
Understanding that shared math makes the stack easier to debug, explain, and optimize.
2. Vectors and Vector Spaces
2.1 What a Vector Represents in MangaAssist
| Component | What the vector represents | Typical dimensionality |
|---|---|---|
| Titan Text Embeddings V2 | Dense representation of a query, chunk, or document | 1,024 in this project configuration |
| DistilBERT hidden state | Contextual representation of one token | 768 |
| MiniLM reranker output | Joint query-document representation before scoring | 384 |
| One-hot intent label | Sparse encoding of one intent class | 8 |
| Softmax output | Probability distribution over intent classes | 8 |
2.2 Core Vector Operations
Dot product
$$\mathbf{u} \cdot \mathbf{v} = \sum_{i=1}^{d} u_i v_i$$
Used in attention, latent-factor scoring, and linear layers.
Cosine similarity
$$\cos(\mathbf{u}, \mathbf{v}) = \frac{\mathbf{u} \cdot \mathbf{v}}{|\mathbf{u}|_2 \cdot |\mathbf{v}|_2}$$
Used in retrieval, embedding comparisons, and drift monitoring.
Example: retrieval scoring
When a user asks for a recommendation such as "What dark fantasy manga is similar to Berserk?", the query is converted to a dense vector and compared with document vectors in the index.
query_vector = embed("What dark fantasy manga is similar to Berserk?")
for chunk_vector in index:
score = cosine_similarity(query_vector, chunk_vector)
Euclidean norm and distance
$$|\mathbf{u} - \mathbf{v}|2 = \sqrt{\sum{i=1}^{d}(u_i - v_i)^2}$$
This is less common for text retrieval than cosine similarity, but it is still useful for diagnostics and clustering analysis.
3. Matrices in Neural Networks
3.1 Linear Layers
For one example written as a row vector:
$$\mathbf{y} = \mathbf{x}\mathbf{W} + \mathbf{b}$$
Where:
- $\mathbf{x} \in \mathbb{R}^{1 \times d_{\text{in}}}$ is the input
- $\mathbf{W} \in \mathbb{R}^{d_{\text{in}} \times d_{\text{out}}}$ is the weight matrix
- $\mathbf{b} \in \mathbb{R}^{1 \times d_{\text{out}}}$ is the bias
- $\mathbf{y} \in \mathbb{R}^{1 \times d_{\text{out}}}$ is the output
For a batch of $B$ examples:
$$\mathbf{Y} = \mathbf{X}\mathbf{W} + \mathbf{b}$$
Where $\mathbf{X} \in \mathbb{R}^{B \times d_{\text{in}}}$ and $\mathbf{Y} \in \mathbb{R}^{B \times d_{\text{out}}}$.
This is why GPUs matter so much: transformer inference is dominated by large batched matrix multiplications.
3.2 Example Matrices in the Stack
| Component | Layer | Matrix shape | What it does |
|---|---|---|---|
| DistilBERT | Token embedding table | $(30{,}522 \times 768)$ | Maps token IDs to dense vectors |
| DistilBERT | Query projection | $(768 \times 768)$ | Projects hidden states into query space |
| DistilBERT | Key projection | $(768 \times 768)$ | Projects hidden states into key space |
| DistilBERT | Value projection | $(768 \times 768)$ | Projects hidden states into value space |
| DistilBERT | FFN layer 1 | $(768 \times 3072)$ | Expands features inside the transformer block |
| DistilBERT | FFN layer 2 | $(3072 \times 768)$ | Projects back to model width |
| DistilBERT | Classification head | $(768 \times 8)$ | Maps sequence representation to 8 intent logits |
| MiniLM reranker | Final scoring layer | $(384 \times 1)$ | Produces one relevance score |
| Titan embedding output projection | $(h \times 1024)$ | Maps internal hidden states to the configured embedding size |
3.3 Matrix View of Production Inference
In production, a request batch is processed as a tensor rather than one item at a time:
X : batch_size x sequence_length x hidden_size
WQ : hidden_size x hidden_size
Q = XWQ
The same pattern repeats across attention projections, feedforward blocks, classifier heads, and recommender scoring.
4. Attention as Matrix Operations
4.1 Scaled Dot-Product Attention
The core attention formula is:
$$\text{Attention}(\mathbf{Q}, \mathbf{K}, \mathbf{V}) = \text{softmax}\left(\frac{\mathbf{Q}\mathbf{K}^T}{\sqrt{d_k}}\right)\mathbf{V}$$
This breaks into four steps:
- $\mathbf{Q}\mathbf{K}^T$ computes pairwise similarity between query tokens and key tokens.
- Division by $\sqrt{d_k}$ prevents large dot products from saturating softmax.
softmaxconverts each row into a probability distribution.- Multiplying by $\mathbf{V}$ forms a weighted mixture of value vectors.
4.2 Multi-Head Attention
Transformers do not use a single attention map. They run several in parallel:
$$\text{MultiHead}(\mathbf{Q}, \mathbf{K}, \mathbf{V}) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h)\mathbf{W}^O$$
with
$$\text{head}_i = \text{Attention}(\mathbf{Q}\mathbf{W}_i^Q, \mathbf{K}\mathbf{W}_i^K, \mathbf{V}\mathbf{W}_i^V)$$
For DistilBERT, the common base configuration uses 12 attention heads and hidden size 768, so each head works over 64-dimensional slices.
4.3 Self-Attention vs. Cross-Attention in This Stack
| Type | Query source | Key/value source | Project usage |
|---|---|---|---|
| Self-attention | Same sequence | Same sequence | DistilBERT encoding a user message |
| Causal self-attention | Prefix of the same sequence | Prefix of the same sequence | Claude-style next-token generation over prompt and response |
| Bi-encoder pattern | Query and document encoded separately | No token-level interaction at retrieval time | Embedding-based first-stage retrieval |
| Cross-encoder interaction | Query and document tokens in one sequence | Same concatenated sequence | MiniLM reranking |
5. Eigendecomposition and Singular Value Decomposition
5.1 Eigendecomposition
For a square matrix $\mathbf{A}$:
$$\mathbf{A}\mathbf{v} = \lambda\mathbf{v}$$
Where $\mathbf{v}$ is an eigenvector and $\lambda$ is the corresponding eigenvalue.
Useful applications in this project include:
- interpreting covariance structure in embeddings
- PCA-style visualization of high-dimensional vectors
- studying concentration or dispersion in attention maps
5.2 Singular Value Decomposition
For any matrix $\mathbf{A} \in \mathbb{R}^{m \times n}$:
$$\mathbf{A} = \mathbf{U}\mathbf{\Sigma}\mathbf{V}^T$$
SVD is especially useful when the matrix is rectangular or sparse.
Project-relevant uses:
- low-rank recommendation models
- dimensionality reduction for sparse features or embeddings
- compressing weight matrices while preserving most of their energy
In practice, many pipelines use SVD rather than direct eigendecomposition because it is numerically stable and works on arbitrary matrices.
6. Matrix Factorization in Recommendation
6.1 User-Item Matrix
The recommender can be represented as a sparse interaction matrix:
$$\mathbf{R} \in \mathbb{R}^{|\text{users}| \times |\text{items}|}$$
Each entry may encode clicks, purchases, ratings, or other engagement signals.
6.2 Low-Rank Approximation
The standard factorization view is:
$$\mathbf{R} \approx \mathbf{P}\mathbf{Q}^T$$
Where:
- $\mathbf{P} \in \mathbb{R}^{|\text{users}| \times k}$ is the user-factor matrix
- $\mathbf{Q} \in \mathbb{R}^{|\text{items}| \times k}$ is the item-factor matrix
- $k$ is the latent dimensionality
The predicted affinity for user $u$ and item $i$ is:
$$\hat{R}_{u,i} = \mathbf{p}_u \cdot \mathbf{q}_i$$
6.3 Optimization Objective
$$\min_{\mathbf{P}, \mathbf{Q}} \sum_{(u,i) \in \Omega} \left(R_{u,i} - \mathbf{p}_u \cdot \mathbf{q}_i\right)^2 + \lambda\left(|\mathbf{P}|_F^2 + |\mathbf{Q}|_F^2\right)$$
Where $\Omega$ is the set of observed interactions and $\lambda$ controls regularization.
This is the same design pattern seen elsewhere in the stack: learn latent vectors, then compare them with simple linear algebra.
7. Softmax as a Probability Transform
7.1 Definition
$$\text{softmax}(\mathbf{z})i = \frac{e^{z_i}}{\sum{j=1}^{K} e^{z_j}}$$
7.2 Where It Appears
| Component | Input | Output |
|---|---|---|
| Intent classifier | Intent logits | Probability over 8 classes |
| Attention layer | Scaled query-key scores | Attention weights over tokens |
| Generator output layer | Vocabulary logits | Probability over next tokens |
7.3 Temperature Scaling
$$\text{softmax}(\mathbf{z}/T)i = \frac{e^{z_i/T}}{\sum{j=1}^{K} e^{z_j/T}}$$
- $T = 1$: standard softmax
- $T < 1$: sharper distribution
- $T > 1$: flatter distribution
For a customer support or shopping assistant, lower temperatures are often preferred because they reduce randomness and make answers easier to control.
8. Layer Normalization
8.1 Formula
$$\text{LayerNorm}(\mathbf{x}) = \gamma \odot \frac{\mathbf{x} - \mu}{\sqrt{\sigma^2 + \epsilon}} + \beta$$
Where:
- $\mu$ is the mean across the feature dimension
- $\sigma^2$ is the variance across the feature dimension
- $\gamma$ and $\beta$ are learned scale and shift parameters
- $\epsilon$ is a small constant for numerical stability
8.2 Why It Matters
Layer normalization keeps activations in a stable range so deep transformer stacks remain trainable and predictable at inference time.
Without it, the stack is more vulnerable to:
- unstable gradients during training
- highly variable activation magnitudes across layers
- brittle attention behavior
9. End-to-End View of the Pipeline
User message: "Recommend something like Berserk"
|
+-- Tokenization
| -> token IDs
|
+-- DistilBERT intent classifier
| -> embedding lookup
| -> repeated attention + FFN blocks
| -> linear head + softmax
| -> intent = "recommendation"
|
+-- Query embedding
| -> 1024-dimensional dense vector
|
+-- Vector retrieval
| -> cosine similarity / ANN search
| -> top candidate chunks
|
+-- MiniLM reranker
| -> query-document token interaction
| -> one relevance score per candidate
|
+-- Claude-style response generation
-> autoregressive next-token prediction
-> softmax over vocabulary at each step
The important takeaway is that the pipeline looks diverse at the product level, but it is surprisingly unified at the math level. Most of the system is built from the same building blocks: vectors, projections, similarities, normalizations, and low-rank structure.