Matrix Theory and Linear Algebra in MangaAssist

Matrix theory is the runtime language of modern ML systems. Every embedding lookup, attention score, classifier head, and recommender score in MangaAssist can be explained as vector operations, matrix multiplication, normalization, and low-rank structure.

This document uses a row-vector convention for forward passes so the shapes stay consistent with how tensors are usually written in production code:

single example: $\mathbf{y} = \mathbf{x}\mathbf{W} + \mathbf{b}$
batch of examples: $\mathbf{Y} = \mathbf{X}\mathbf{W} + \mathbf{b}$

1. Why Matrix Theory Matters

The project uses several different model components, but they all reduce to a common set of linear algebra operations:

intent classification uses embedding lookup tables, projection matrices, and softmax outputs
retrieval uses dense vectors, similarity scoring, and approximate nearest-neighbor search
reranking uses attention over a concatenated query-document sequence
recommendation uses latent vectors and dot-product scoring

Understanding that shared math makes the stack easier to debug, explain, and optimize.

2. Vectors and Vector Spaces

2.1 What a Vector Represents in MangaAssist

Component	What the vector represents	Typical dimensionality
Titan Text Embeddings V2	Dense representation of a query, chunk, or document	1,024 in this project configuration
DistilBERT hidden state	Contextual representation of one token	768
MiniLM reranker output	Joint query-document representation before scoring	384
One-hot intent label	Sparse encoding of one intent class	8
Softmax output	Probability distribution over intent classes	8

2.2 Core Vector Operations

Dot product

$$\mathbf{u} \cdot \mathbf{v} = \sum_{i=1}^{d} u_i v_i$$

Used in attention, latent-factor scoring, and linear layers.

Cosine similarity

$$\cos(\mathbf{u}, \mathbf{v}) = \frac{\mathbf{u} \cdot \mathbf{v}}{|\mathbf{u}|_2 \cdot |\mathbf{v}|_2}$$

Used in retrieval, embedding comparisons, and drift monitoring.

Example: retrieval scoring

When a user asks for a recommendation such as "What dark fantasy manga is similar to Berserk?", the query is converted to a dense vector and compared with document vectors in the index.

query_vector = embed("What dark fantasy manga is similar to Berserk?")
for chunk_vector in index:
    score = cosine_similarity(query_vector, chunk_vector)

Euclidean norm and distance

$$|\mathbf{u} - \mathbf{v}|2 = \sqrt{\sum{i=1}^{d}(u_i - v_i)^2}$$

This is less common for text retrieval than cosine similarity, but it is still useful for diagnostics and clustering analysis.

3. Matrices in Neural Networks

3.1 Linear Layers

For one example written as a row vector:

$$\mathbf{y} = \mathbf{x}\mathbf{W} + \mathbf{b}$$

Where:

$\mathbf{x} \in \mathbb{R}^{1 \times d_{\text{in}}}$ is the input
$\mathbf{W} \in \mathbb{R}^{d_{\text{in}} \times d_{\text{out}}}$ is the weight matrix
$\mathbf{b} \in \mathbb{R}^{1 \times d_{\text{out}}}$ is the bias
$\mathbf{y} \in \mathbb{R}^{1 \times d_{\text{out}}}$ is the output

For a batch of $B$ examples:

$$\mathbf{Y} = \mathbf{X}\mathbf{W} + \mathbf{b}$$

Where $\mathbf{X} \in \mathbb{R}^{B \times d_{\text{in}}}$ and $\mathbf{Y} \in \mathbb{R}^{B \times d_{\text{out}}}$.

This is why GPUs matter so much: transformer inference is dominated by large batched matrix multiplications.

3.2 Example Matrices in the Stack

Component	Layer	Matrix shape	What it does
DistilBERT	Token embedding table	$(30{,}522 \times 768)$	Maps token IDs to dense vectors
DistilBERT	Query projection	$(768 \times 768)$	Projects hidden states into query space
DistilBERT	Key projection	$(768 \times 768)$	Projects hidden states into key space
DistilBERT	Value projection	$(768 \times 768)$	Projects hidden states into value space
DistilBERT	FFN layer 1	$(768 \times 3072)$	Expands features inside the transformer block
DistilBERT	FFN layer 2	$(3072 \times 768)$	Projects back to model width
DistilBERT	Classification head	$(768 \times 8)$	Maps sequence representation to 8 intent logits
MiniLM reranker	Final scoring layer	$(384 \times 1)$	Produces one relevance score
Titan embedding output projection	$(h \times 1024)$	Maps internal hidden states to the configured embedding size

3.3 Matrix View of Production Inference

In production, a request batch is processed as a tensor rather than one item at a time:

X  : batch_size x sequence_length x hidden_size
WQ : hidden_size x hidden_size
Q  = XWQ

The same pattern repeats across attention projections, feedforward blocks, classifier heads, and recommender scoring.

4. Attention as Matrix Operations

4.1 Scaled Dot-Product Attention

The core attention formula is:

$$\text{Attention}(\mathbf{Q}, \mathbf{K}, \mathbf{V}) = \text{softmax}\left(\frac{\mathbf{Q}\mathbf{K}^T}{\sqrt{d_k}}\right)\mathbf{V}$$

This breaks into four steps:

$\mathbf{Q}\mathbf{K}^T$ computes pairwise similarity between query tokens and key tokens.
Division by $\sqrt{d_k}$ prevents large dot products from saturating softmax.
softmax converts each row into a probability distribution.
Multiplying by $\mathbf{V}$ forms a weighted mixture of value vectors.

4.2 Multi-Head Attention

Transformers do not use a single attention map. They run several in parallel:

$$\text{MultiHead}(\mathbf{Q}, \mathbf{K}, \mathbf{V}) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h)\mathbf{W}^O$$

with

$$\text{head}_i = \text{Attention}(\mathbf{Q}\mathbf{W}_i^Q, \mathbf{K}\mathbf{W}_i^K, \mathbf{V}\mathbf{W}_i^V)$$

For DistilBERT, the common base configuration uses 12 attention heads and hidden size 768, so each head works over 64-dimensional slices.

4.3 Self-Attention vs. Cross-Attention in This Stack

Type	Query source	Key/value source	Project usage
Self-attention	Same sequence	Same sequence	DistilBERT encoding a user message
Causal self-attention	Prefix of the same sequence	Prefix of the same sequence	Claude-style next-token generation over prompt and response
Bi-encoder pattern	Query and document encoded separately	No token-level interaction at retrieval time	Embedding-based first-stage retrieval
Cross-encoder interaction	Query and document tokens in one sequence	Same concatenated sequence	MiniLM reranking

5. Eigendecomposition and Singular Value Decomposition

5.1 Eigendecomposition

For a square matrix $\mathbf{A}$:

$$\mathbf{A}\mathbf{v} = \lambda\mathbf{v}$$

Where $\mathbf{v}$ is an eigenvector and $\lambda$ is the corresponding eigenvalue.

Useful applications in this project include:

interpreting covariance structure in embeddings
PCA-style visualization of high-dimensional vectors
studying concentration or dispersion in attention maps

5.2 Singular Value Decomposition

For any matrix $\mathbf{A} \in \mathbb{R}^{m \times n}$:

$$\mathbf{A} = \mathbf{U}\mathbf{\Sigma}\mathbf{V}^T$$

SVD is especially useful when the matrix is rectangular or sparse.

Project-relevant uses:

low-rank recommendation models
dimensionality reduction for sparse features or embeddings
compressing weight matrices while preserving most of their energy

In practice, many pipelines use SVD rather than direct eigendecomposition because it is numerically stable and works on arbitrary matrices.

6. Matrix Factorization in Recommendation

6.1 User-Item Matrix

The recommender can be represented as a sparse interaction matrix:

$$\mathbf{R} \in \mathbb{R}^{|\text{users}| \times |\text{items}|}$$

Each entry may encode clicks, purchases, ratings, or other engagement signals.

6.2 Low-Rank Approximation

The standard factorization view is:

$$\mathbf{R} \approx \mathbf{P}\mathbf{Q}^T$$

Where:

$\mathbf{P} \in \mathbb{R}^{|\text{users}| \times k}$ is the user-factor matrix
$\mathbf{Q} \in \mathbb{R}^{|\text{items}| \times k}$ is the item-factor matrix
$k$ is the latent dimensionality

The predicted affinity for user $u$ and item $i$ is:

$$\hat{R}_{u,i} = \mathbf{p}_u \cdot \mathbf{q}_i$$

6.3 Optimization Objective

$$\min_{\mathbf{P}, \mathbf{Q}} \sum_{(u,i) \in \Omega} \left(R_{u,i} - \mathbf{p}_u \cdot \mathbf{q}_i\right)^2 + \lambda\left(|\mathbf{P}|_F^2 + |\mathbf{Q}|_F^2\right)$$

Where $\Omega$ is the set of observed interactions and $\lambda$ controls regularization.

This is the same design pattern seen elsewhere in the stack: learn latent vectors, then compare them with simple linear algebra.

7. Softmax as a Probability Transform

7.1 Definition

$$\text{softmax}(\mathbf{z})i = \frac{e^{z_i}}{\sum{j=1}^{K} e^{z_j}}$$

7.2 Where It Appears

Component	Input	Output
Intent classifier	Intent logits	Probability over 8 classes
Attention layer	Scaled query-key scores	Attention weights over tokens
Generator output layer	Vocabulary logits	Probability over next tokens

7.3 Temperature Scaling

$$\text{softmax}(\mathbf{z}/T)i = \frac{e^{z_i/T}}{\sum{j=1}^{K} e^{z_j/T}}$$

$T = 1$: standard softmax
$T < 1$: sharper distribution
$T > 1$: flatter distribution

For a customer support or shopping assistant, lower temperatures are often preferred because they reduce randomness and make answers easier to control.

8. Layer Normalization

8.1 Formula

$$\text{LayerNorm}(\mathbf{x}) = \gamma \odot \frac{\mathbf{x} - \mu}{\sqrt{\sigma^2 + \epsilon}} + \beta$$

Where:

$\mu$ is the mean across the feature dimension
$\sigma^2$ is the variance across the feature dimension
$\gamma$ and $\beta$ are learned scale and shift parameters
$\epsilon$ is a small constant for numerical stability

8.2 Why It Matters

Layer normalization keeps activations in a stable range so deep transformer stacks remain trainable and predictable at inference time.

Without it, the stack is more vulnerable to:

unstable gradients during training
highly variable activation magnitudes across layers
brittle attention behavior

9. End-to-End View of the Pipeline

User message: "Recommend something like Berserk"
  |
  +-- Tokenization
  |     -> token IDs
  |
  +-- DistilBERT intent classifier
  |     -> embedding lookup
  |     -> repeated attention + FFN blocks
  |     -> linear head + softmax
  |     -> intent = "recommendation"
  |
  +-- Query embedding
  |     -> 1024-dimensional dense vector
  |
  +-- Vector retrieval
  |     -> cosine similarity / ANN search
  |     -> top candidate chunks
  |
  +-- MiniLM reranker
  |     -> query-document token interaction
  |     -> one relevance score per candidate
  |
  +-- Claude-style response generation
        -> autoregressive next-token prediction
        -> softmax over vocabulary at each step

The important takeaway is that the pipeline looks diverse at the product level, but it is surprisingly unified at the math level. Most of the system is built from the same building blocks: vectors, projections, similarities, normalizations, and low-rank structure.