LOCAL PREVIEW View on GitHub

Matrix Theory and Linear Algebra in MangaAssist

Matrix theory is the runtime language of modern ML systems. Every embedding lookup, attention score, classifier head, and recommender score in MangaAssist can be explained as vector operations, matrix multiplication, normalization, and low-rank structure.

This document uses a row-vector convention for forward passes so the shapes stay consistent with how tensors are usually written in production code:

  • single example: $\mathbf{y} = \mathbf{x}\mathbf{W} + \mathbf{b}$
  • batch of examples: $\mathbf{Y} = \mathbf{X}\mathbf{W} + \mathbf{b}$

1. Why Matrix Theory Matters

The project uses several different model components, but they all reduce to a common set of linear algebra operations:

  • intent classification uses embedding lookup tables, projection matrices, and softmax outputs
  • retrieval uses dense vectors, similarity scoring, and approximate nearest-neighbor search
  • reranking uses attention over a concatenated query-document sequence
  • recommendation uses latent vectors and dot-product scoring

Understanding that shared math makes the stack easier to debug, explain, and optimize.


2. Vectors and Vector Spaces

2.1 What a Vector Represents in MangaAssist

Component What the vector represents Typical dimensionality
Titan Text Embeddings V2 Dense representation of a query, chunk, or document 1,024 in this project configuration
DistilBERT hidden state Contextual representation of one token 768
MiniLM reranker output Joint query-document representation before scoring 384
One-hot intent label Sparse encoding of one intent class 8
Softmax output Probability distribution over intent classes 8

2.2 Core Vector Operations

Dot product

$$\mathbf{u} \cdot \mathbf{v} = \sum_{i=1}^{d} u_i v_i$$

Used in attention, latent-factor scoring, and linear layers.

Cosine similarity

$$\cos(\mathbf{u}, \mathbf{v}) = \frac{\mathbf{u} \cdot \mathbf{v}}{|\mathbf{u}|_2 \cdot |\mathbf{v}|_2}$$

Used in retrieval, embedding comparisons, and drift monitoring.

Example: retrieval scoring

When a user asks for a recommendation such as "What dark fantasy manga is similar to Berserk?", the query is converted to a dense vector and compared with document vectors in the index.

query_vector = embed("What dark fantasy manga is similar to Berserk?")
for chunk_vector in index:
    score = cosine_similarity(query_vector, chunk_vector)

Euclidean norm and distance

$$|\mathbf{u} - \mathbf{v}|2 = \sqrt{\sum{i=1}^{d}(u_i - v_i)^2}$$

This is less common for text retrieval than cosine similarity, but it is still useful for diagnostics and clustering analysis.


3. Matrices in Neural Networks

3.1 Linear Layers

For one example written as a row vector:

$$\mathbf{y} = \mathbf{x}\mathbf{W} + \mathbf{b}$$

Where:

  • $\mathbf{x} \in \mathbb{R}^{1 \times d_{\text{in}}}$ is the input
  • $\mathbf{W} \in \mathbb{R}^{d_{\text{in}} \times d_{\text{out}}}$ is the weight matrix
  • $\mathbf{b} \in \mathbb{R}^{1 \times d_{\text{out}}}$ is the bias
  • $\mathbf{y} \in \mathbb{R}^{1 \times d_{\text{out}}}$ is the output

For a batch of $B$ examples:

$$\mathbf{Y} = \mathbf{X}\mathbf{W} + \mathbf{b}$$

Where $\mathbf{X} \in \mathbb{R}^{B \times d_{\text{in}}}$ and $\mathbf{Y} \in \mathbb{R}^{B \times d_{\text{out}}}$.

This is why GPUs matter so much: transformer inference is dominated by large batched matrix multiplications.

3.2 Example Matrices in the Stack

Component Layer Matrix shape What it does
DistilBERT Token embedding table $(30{,}522 \times 768)$ Maps token IDs to dense vectors
DistilBERT Query projection $(768 \times 768)$ Projects hidden states into query space
DistilBERT Key projection $(768 \times 768)$ Projects hidden states into key space
DistilBERT Value projection $(768 \times 768)$ Projects hidden states into value space
DistilBERT FFN layer 1 $(768 \times 3072)$ Expands features inside the transformer block
DistilBERT FFN layer 2 $(3072 \times 768)$ Projects back to model width
DistilBERT Classification head $(768 \times 8)$ Maps sequence representation to 8 intent logits
MiniLM reranker Final scoring layer $(384 \times 1)$ Produces one relevance score
Titan embedding output projection $(h \times 1024)$ Maps internal hidden states to the configured embedding size

3.3 Matrix View of Production Inference

In production, a request batch is processed as a tensor rather than one item at a time:

X  : batch_size x sequence_length x hidden_size
WQ : hidden_size x hidden_size
Q  = XWQ

The same pattern repeats across attention projections, feedforward blocks, classifier heads, and recommender scoring.


4. Attention as Matrix Operations

4.1 Scaled Dot-Product Attention

The core attention formula is:

$$\text{Attention}(\mathbf{Q}, \mathbf{K}, \mathbf{V}) = \text{softmax}\left(\frac{\mathbf{Q}\mathbf{K}^T}{\sqrt{d_k}}\right)\mathbf{V}$$

This breaks into four steps:

  1. $\mathbf{Q}\mathbf{K}^T$ computes pairwise similarity between query tokens and key tokens.
  2. Division by $\sqrt{d_k}$ prevents large dot products from saturating softmax.
  3. softmax converts each row into a probability distribution.
  4. Multiplying by $\mathbf{V}$ forms a weighted mixture of value vectors.

4.2 Multi-Head Attention

Transformers do not use a single attention map. They run several in parallel:

$$\text{MultiHead}(\mathbf{Q}, \mathbf{K}, \mathbf{V}) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h)\mathbf{W}^O$$

with

$$\text{head}_i = \text{Attention}(\mathbf{Q}\mathbf{W}_i^Q, \mathbf{K}\mathbf{W}_i^K, \mathbf{V}\mathbf{W}_i^V)$$

For DistilBERT, the common base configuration uses 12 attention heads and hidden size 768, so each head works over 64-dimensional slices.

4.3 Self-Attention vs. Cross-Attention in This Stack

Type Query source Key/value source Project usage
Self-attention Same sequence Same sequence DistilBERT encoding a user message
Causal self-attention Prefix of the same sequence Prefix of the same sequence Claude-style next-token generation over prompt and response
Bi-encoder pattern Query and document encoded separately No token-level interaction at retrieval time Embedding-based first-stage retrieval
Cross-encoder interaction Query and document tokens in one sequence Same concatenated sequence MiniLM reranking

5. Eigendecomposition and Singular Value Decomposition

5.1 Eigendecomposition

For a square matrix $\mathbf{A}$:

$$\mathbf{A}\mathbf{v} = \lambda\mathbf{v}$$

Where $\mathbf{v}$ is an eigenvector and $\lambda$ is the corresponding eigenvalue.

Useful applications in this project include:

  • interpreting covariance structure in embeddings
  • PCA-style visualization of high-dimensional vectors
  • studying concentration or dispersion in attention maps

5.2 Singular Value Decomposition

For any matrix $\mathbf{A} \in \mathbb{R}^{m \times n}$:

$$\mathbf{A} = \mathbf{U}\mathbf{\Sigma}\mathbf{V}^T$$

SVD is especially useful when the matrix is rectangular or sparse.

Project-relevant uses:

  • low-rank recommendation models
  • dimensionality reduction for sparse features or embeddings
  • compressing weight matrices while preserving most of their energy

In practice, many pipelines use SVD rather than direct eigendecomposition because it is numerically stable and works on arbitrary matrices.


6. Matrix Factorization in Recommendation

6.1 User-Item Matrix

The recommender can be represented as a sparse interaction matrix:

$$\mathbf{R} \in \mathbb{R}^{|\text{users}| \times |\text{items}|}$$

Each entry may encode clicks, purchases, ratings, or other engagement signals.

6.2 Low-Rank Approximation

The standard factorization view is:

$$\mathbf{R} \approx \mathbf{P}\mathbf{Q}^T$$

Where:

  • $\mathbf{P} \in \mathbb{R}^{|\text{users}| \times k}$ is the user-factor matrix
  • $\mathbf{Q} \in \mathbb{R}^{|\text{items}| \times k}$ is the item-factor matrix
  • $k$ is the latent dimensionality

The predicted affinity for user $u$ and item $i$ is:

$$\hat{R}_{u,i} = \mathbf{p}_u \cdot \mathbf{q}_i$$

6.3 Optimization Objective

$$\min_{\mathbf{P}, \mathbf{Q}} \sum_{(u,i) \in \Omega} \left(R_{u,i} - \mathbf{p}_u \cdot \mathbf{q}_i\right)^2 + \lambda\left(|\mathbf{P}|_F^2 + |\mathbf{Q}|_F^2\right)$$

Where $\Omega$ is the set of observed interactions and $\lambda$ controls regularization.

This is the same design pattern seen elsewhere in the stack: learn latent vectors, then compare them with simple linear algebra.


7. Softmax as a Probability Transform

7.1 Definition

$$\text{softmax}(\mathbf{z})i = \frac{e^{z_i}}{\sum{j=1}^{K} e^{z_j}}$$

7.2 Where It Appears

Component Input Output
Intent classifier Intent logits Probability over 8 classes
Attention layer Scaled query-key scores Attention weights over tokens
Generator output layer Vocabulary logits Probability over next tokens

7.3 Temperature Scaling

$$\text{softmax}(\mathbf{z}/T)i = \frac{e^{z_i/T}}{\sum{j=1}^{K} e^{z_j/T}}$$

  • $T = 1$: standard softmax
  • $T < 1$: sharper distribution
  • $T > 1$: flatter distribution

For a customer support or shopping assistant, lower temperatures are often preferred because they reduce randomness and make answers easier to control.


8. Layer Normalization

8.1 Formula

$$\text{LayerNorm}(\mathbf{x}) = \gamma \odot \frac{\mathbf{x} - \mu}{\sqrt{\sigma^2 + \epsilon}} + \beta$$

Where:

  • $\mu$ is the mean across the feature dimension
  • $\sigma^2$ is the variance across the feature dimension
  • $\gamma$ and $\beta$ are learned scale and shift parameters
  • $\epsilon$ is a small constant for numerical stability

8.2 Why It Matters

Layer normalization keeps activations in a stable range so deep transformer stacks remain trainable and predictable at inference time.

Without it, the stack is more vulnerable to:

  • unstable gradients during training
  • highly variable activation magnitudes across layers
  • brittle attention behavior

9. End-to-End View of the Pipeline

User message: "Recommend something like Berserk"
  |
  +-- Tokenization
  |     -> token IDs
  |
  +-- DistilBERT intent classifier
  |     -> embedding lookup
  |     -> repeated attention + FFN blocks
  |     -> linear head + softmax
  |     -> intent = "recommendation"
  |
  +-- Query embedding
  |     -> 1024-dimensional dense vector
  |
  +-- Vector retrieval
  |     -> cosine similarity / ANN search
  |     -> top candidate chunks
  |
  +-- MiniLM reranker
  |     -> query-document token interaction
  |     -> one relevance score per candidate
  |
  +-- Claude-style response generation
        -> autoregressive next-token prediction
        -> softmax over vocabulary at each step

The important takeaway is that the pipeline looks diverse at the product level, but it is surprisingly unified at the math level. Most of the system is built from the same building blocks: vectors, projections, similarities, normalizations, and low-rank structure.