Neural Network Architectures in MangaAssist

MangaAssist uses different neural architectures for different jobs. Fast classification, vector retrieval, reranking, and text generation all benefit from different inductive biases and latency trade-offs.

This document focuses on the architectural patterns that matter most for the system design. Where a managed model does not publicly expose all internals, the explanation stays at the level that can be defended.

1. Architecture Overview

Component	Architecture pattern	Example model	Main task
Intent classifier	Encoder-only transformer	DistilBERT (fine-tuned)	8-class routing
Sentiment detector	Encoder-only transformer	DistilBERT variant	Binary or multiclass sentiment
Retrieval encoder	Embedding model with bi-encoder retrieval behavior	Titan Text Embeddings V2	Dense vector generation
Reranker	Cross-encoder transformer	`cross-encoder/ms-marco-MiniLM-L-6-v2`	Query-document relevance scoring
Response generator	Decoder-style text generation model	Claude 3.5 Sonnet via Bedrock	Autoregressive response generation

2. Feedforward Networks

2.1 Core Pattern

A feedforward network stacks linear layers with nonlinear activations:

$$\mathbf{h}_1 = \text{ReLU}(\mathbf{x}\mathbf{W}_1 + \mathbf{b}_1)$$ $$\mathbf{h}_2 = \text{ReLU}(\mathbf{h}_1\mathbf{W}_2 + \mathbf{b}_2)$$ $$\hat{\mathbf{y}} = \text{softmax}(\mathbf{h}_2\mathbf{W}_3 + \mathbf{b}_3)$$

2.2 Why FFNs Matter Here

MangaAssist does not rely on standalone MLPs for the main customer-facing tasks, but feedforward blocks appear inside every transformer layer:

Transformer block
  -> attention sublayer
  -> residual connection + layer norm
  -> feedforward sublayer
  -> residual connection + layer norm

For DistilBERT, the feedforward block expands from 768 to 3072 dimensions and then projects back down. That widened middle layer lets the model express richer feature interactions than a single linear projection could.

2.3 Common Activations

Activation	Formula	Common use
ReLU	$\max(0, x)$	Basic FFNs
GELU	$x\Phi(x)$	BERT-family transformers
SiLU / Swish	$x\sigma(x)$	Many modern decoder models
Sigmoid	$\frac{1}{1 + e^{-x}}$	Binary heads and gates
Softmax	$\frac{e^{z_i}}{\sum_j e^{z_j}}$	Multiclass outputs and attention weights

3. Transformer Families

3.1 Encoder-Decoder Transformer

The original transformer pairs an encoder with a decoder:

Source text
  -> encoder stack
  -> contextual hidden states
  -> decoder stack with cross-attention
  -> next-token probabilities

This pattern is useful to know because it explains where "cross-attention" comes from, even though MangaAssist's main runtime path uses encoder-only and decoder-only models instead.

3.2 Encoder-Only Transformer: DistilBERT

Encoder-only models process the whole input bidirectionally. Each token can attend to every other token in the same sequence.

Input sequence
  -> token embeddings + position embeddings
  -> transformer block 1
  -> transformer block 2
  -> ...
  -> transformer block 6
  -> pooled sequence representation
  -> classification head

Important properties:

bidirectional attention captures both left and right context
every token receives a contextualized representation
the first-token representation is commonly used for classification
the architecture is well suited for classification, tagging, and reranking

DistilBERT-specific notes:

6 transformer layers instead of 12 in BERT-base
12 attention heads
768 hidden size
roughly 40% fewer parameters and about 60% faster inference than BERT-base while retaining about 97% of its language-understanding performance in the original DistilBERT paper
token-type embeddings and the pooler are removed relative to BERT-base

Fine-tuning example:

from transformers import DistilBertForSequenceClassification, Trainer

model = DistilBertForSequenceClassification.from_pretrained(
    "distilbert-base-uncased",
    num_labels=8
)

trainer = Trainer(
    model=model,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    args=training_args
)
trainer.train()

3.3 Decoder-Only Generation Model

Decoder-style models generate text left to right. Each token can attend only to the tokens before it.

Prompt
  -> token embeddings + positional information
  -> repeated masked self-attention blocks
  -> language-model head
  -> next-token distribution
  -> sample or select next token
  -> append token and repeat

The causal mask enforces autoregressive generation:

$$\text{Mask}_{ij} = \begin{cases} 0 & \text{if } j \le i \ -\infty & \text{if } j > i \end{cases}$$

For MangaAssist, the practical implication is more important than the hidden implementation details: the generator conditions on the prompt, retrieved context, and conversation history, then emits one token at a time. Exact layer counts and parameter counts for proprietary hosted models are not always publicly documented, so this document avoids relying on unpublished internals.

3.4 Cross-Encoder Reranker

The reranker is still an encoder-style transformer, but the input format is different.

Bi-encoder retrieval

encode(query)    -> vector_q
encode(document) -> vector_d
score = cosine(vector_q, vector_d)

Cross-encoder reranking

[CLS] query [SEP] document [SEP]
  -> transformer
  -> sequence representation
  -> scalar relevance score

Why it is stronger:

query tokens and document tokens interact inside attention layers
phrase-level and negation-sensitive relevance signals are easier to capture
the score is learned directly for ranking, not derived from a generic vector similarity

Why it is slower:

every candidate document must be re-encoded with the query
latency grows with the number of candidates

That is why the system uses it only after a fast first-stage retriever narrows the search space.

4. Positional Information

4.1 Why Position Must Be Added Explicitly

Attention alone is permutation-invariant. Without positional information, "I love manga" and "manga love I" would look equivalent to the model.

4.2 Learned Absolute Positions

DistilBERT uses learned position embeddings:

$$\mathbf{x}i = \mathbf{e}{\text{token}i} + \mathbf{e}{\text{pos}_i}$$

This is enough for many classification and understanding tasks.

4.3 Relative and Rotary Schemes

Many modern generation models use relative or rotary positional schemes instead of fixed learned absolute embeddings. The exact choice depends on the model family, but the shared goal is the same: inject sequence order without abandoning efficient attention computation.

5. Residual Connections and Layer Normalization

Transformer sublayers are wrapped in a residual path:

$$\text{output} = \text{LayerNorm}(\mathbf{x} + \text{SubLayer}(\mathbf{x}))$$

This matters because:

gradients can flow more easily through deep stacks
optimization becomes more stable
intermediate computations can refine features without losing the original signal

Without residual connections, deep transformers would be much harder to train reliably.

6. Knowledge Distillation and DistilBERT

DistilBERT is a compressed student model trained to imitate BERT-base.

The distillation objective combines ground-truth supervision with teacher guidance:

$$\mathcal{L}{\text{distill}} = \alpha \mathcal{L}{\text{CE}} + (1-\alpha)T^2 \cdot \text{KL}(p_{\text{teacher}}^{(T)} | p_{\text{student}}^{(T)})$$

Where:

$\mathcal{L}_{\text{CE}}$ is cross-entropy against labels
$\text{KL}$ is KL divergence between teacher and student distributions
$T$ is the distillation temperature
$\alpha$ balances the two objectives

Why this matters operationally:

lower latency than BERT-base
smaller memory footprint
better production fit for real-time routing

The trade-off is straightforward: a small loss in model capacity buys a large improvement in serving efficiency.

7. Architecture Comparison

Property	Encoder-only classifier	Cross-encoder reranker	Decoder-style generator
Attention pattern	Bidirectional	Bidirectional over query + doc together	Causal
Input	Single sequence	Paired query-document sequence	Prompt prefix
Output	Class probabilities or representations	One relevance score	Next-token distribution
Strength	Fast understanding and classification	Fine-grained ranking	Fluent grounded generation
Weakness	Not ideal for generation	Too slow for full-corpus search	Higher latency and cost
Project role	Intent routing and sentiment	Retrieval reranking	Final answer generation

8. How the Architectures Work Together

User asks for a recommendation
  |
  +-- DistilBERT
  |     -> classify intent
  |
  +-- Titan Text Embeddings V2
  |     -> convert query to a retrieval vector
  |
  +-- OpenSearch ANN
  |     -> fetch top candidate chunks
  |
  +-- MiniLM cross-encoder
  |     -> rerank the candidate set
  |
  +-- Claude 3.5 Sonnet
        -> generate a grounded response using prompt + retrieved context

The system works well because each model is assigned to the job its architecture handles best:

encoder-only models for fast understanding
cross-encoders for precision ranking on a small candidate set
decoder-style models for final language generation