Neural Network Architectures in MangaAssist
MangaAssist uses different neural architectures for different jobs. Fast classification, vector retrieval, reranking, and text generation all benefit from different inductive biases and latency trade-offs.
This document focuses on the architectural patterns that matter most for the system design. Where a managed model does not publicly expose all internals, the explanation stays at the level that can be defended.
1. Architecture Overview
| Component | Architecture pattern | Example model | Main task |
|---|---|---|---|
| Intent classifier | Encoder-only transformer | DistilBERT (fine-tuned) | 8-class routing |
| Sentiment detector | Encoder-only transformer | DistilBERT variant | Binary or multiclass sentiment |
| Retrieval encoder | Embedding model with bi-encoder retrieval behavior | Titan Text Embeddings V2 | Dense vector generation |
| Reranker | Cross-encoder transformer | cross-encoder/ms-marco-MiniLM-L-6-v2 |
Query-document relevance scoring |
| Response generator | Decoder-style text generation model | Claude 3.5 Sonnet via Bedrock | Autoregressive response generation |
2. Feedforward Networks
2.1 Core Pattern
A feedforward network stacks linear layers with nonlinear activations:
$$\mathbf{h}_1 = \text{ReLU}(\mathbf{x}\mathbf{W}_1 + \mathbf{b}_1)$$ $$\mathbf{h}_2 = \text{ReLU}(\mathbf{h}_1\mathbf{W}_2 + \mathbf{b}_2)$$ $$\hat{\mathbf{y}} = \text{softmax}(\mathbf{h}_2\mathbf{W}_3 + \mathbf{b}_3)$$
2.2 Why FFNs Matter Here
MangaAssist does not rely on standalone MLPs for the main customer-facing tasks, but feedforward blocks appear inside every transformer layer:
Transformer block
-> attention sublayer
-> residual connection + layer norm
-> feedforward sublayer
-> residual connection + layer norm
For DistilBERT, the feedforward block expands from 768 to 3072 dimensions and then projects back down. That widened middle layer lets the model express richer feature interactions than a single linear projection could.
2.3 Common Activations
| Activation | Formula | Common use |
|---|---|---|
| ReLU | $\max(0, x)$ | Basic FFNs |
| GELU | $x\Phi(x)$ | BERT-family transformers |
| SiLU / Swish | $x\sigma(x)$ | Many modern decoder models |
| Sigmoid | $\frac{1}{1 + e^{-x}}$ | Binary heads and gates |
| Softmax | $\frac{e^{z_i}}{\sum_j e^{z_j}}$ | Multiclass outputs and attention weights |
3. Transformer Families
3.1 Encoder-Decoder Transformer
The original transformer pairs an encoder with a decoder:
Source text
-> encoder stack
-> contextual hidden states
-> decoder stack with cross-attention
-> next-token probabilities
This pattern is useful to know because it explains where "cross-attention" comes from, even though MangaAssist's main runtime path uses encoder-only and decoder-only models instead.
3.2 Encoder-Only Transformer: DistilBERT
Encoder-only models process the whole input bidirectionally. Each token can attend to every other token in the same sequence.
Input sequence
-> token embeddings + position embeddings
-> transformer block 1
-> transformer block 2
-> ...
-> transformer block 6
-> pooled sequence representation
-> classification head
Important properties:
- bidirectional attention captures both left and right context
- every token receives a contextualized representation
- the first-token representation is commonly used for classification
- the architecture is well suited for classification, tagging, and reranking
DistilBERT-specific notes:
- 6 transformer layers instead of 12 in BERT-base
- 12 attention heads
- 768 hidden size
- roughly 40% fewer parameters and about 60% faster inference than BERT-base while retaining about 97% of its language-understanding performance in the original DistilBERT paper
- token-type embeddings and the pooler are removed relative to BERT-base
Fine-tuning example:
from transformers import DistilBertForSequenceClassification, Trainer
model = DistilBertForSequenceClassification.from_pretrained(
"distilbert-base-uncased",
num_labels=8
)
trainer = Trainer(
model=model,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
args=training_args
)
trainer.train()
3.3 Decoder-Only Generation Model
Decoder-style models generate text left to right. Each token can attend only to the tokens before it.
Prompt
-> token embeddings + positional information
-> repeated masked self-attention blocks
-> language-model head
-> next-token distribution
-> sample or select next token
-> append token and repeat
The causal mask enforces autoregressive generation:
$$\text{Mask}_{ij} = \begin{cases} 0 & \text{if } j \le i \ -\infty & \text{if } j > i \end{cases}$$
For MangaAssist, the practical implication is more important than the hidden implementation details: the generator conditions on the prompt, retrieved context, and conversation history, then emits one token at a time. Exact layer counts and parameter counts for proprietary hosted models are not always publicly documented, so this document avoids relying on unpublished internals.
3.4 Cross-Encoder Reranker
The reranker is still an encoder-style transformer, but the input format is different.
Bi-encoder retrieval
encode(query) -> vector_q
encode(document) -> vector_d
score = cosine(vector_q, vector_d)
Cross-encoder reranking
[CLS] query [SEP] document [SEP]
-> transformer
-> sequence representation
-> scalar relevance score
Why it is stronger:
- query tokens and document tokens interact inside attention layers
- phrase-level and negation-sensitive relevance signals are easier to capture
- the score is learned directly for ranking, not derived from a generic vector similarity
Why it is slower:
- every candidate document must be re-encoded with the query
- latency grows with the number of candidates
That is why the system uses it only after a fast first-stage retriever narrows the search space.
4. Positional Information
4.1 Why Position Must Be Added Explicitly
Attention alone is permutation-invariant. Without positional information, "I love manga" and "manga love I" would look equivalent to the model.
4.2 Learned Absolute Positions
DistilBERT uses learned position embeddings:
$$\mathbf{x}i = \mathbf{e}{\text{token}i} + \mathbf{e}{\text{pos}_i}$$
This is enough for many classification and understanding tasks.
4.3 Relative and Rotary Schemes
Many modern generation models use relative or rotary positional schemes instead of fixed learned absolute embeddings. The exact choice depends on the model family, but the shared goal is the same: inject sequence order without abandoning efficient attention computation.
5. Residual Connections and Layer Normalization
Transformer sublayers are wrapped in a residual path:
$$\text{output} = \text{LayerNorm}(\mathbf{x} + \text{SubLayer}(\mathbf{x}))$$
This matters because:
- gradients can flow more easily through deep stacks
- optimization becomes more stable
- intermediate computations can refine features without losing the original signal
Without residual connections, deep transformers would be much harder to train reliably.
6. Knowledge Distillation and DistilBERT
DistilBERT is a compressed student model trained to imitate BERT-base.
The distillation objective combines ground-truth supervision with teacher guidance:
$$\mathcal{L}{\text{distill}} = \alpha \mathcal{L}{\text{CE}} + (1-\alpha)T^2 \cdot \text{KL}(p_{\text{teacher}}^{(T)} | p_{\text{student}}^{(T)})$$
Where:
- $\mathcal{L}_{\text{CE}}$ is cross-entropy against labels
- $\text{KL}$ is KL divergence between teacher and student distributions
- $T$ is the distillation temperature
- $\alpha$ balances the two objectives
Why this matters operationally:
- lower latency than BERT-base
- smaller memory footprint
- better production fit for real-time routing
The trade-off is straightforward: a small loss in model capacity buys a large improvement in serving efficiency.
7. Architecture Comparison
| Property | Encoder-only classifier | Cross-encoder reranker | Decoder-style generator |
|---|---|---|---|
| Attention pattern | Bidirectional | Bidirectional over query + doc together | Causal |
| Input | Single sequence | Paired query-document sequence | Prompt prefix |
| Output | Class probabilities or representations | One relevance score | Next-token distribution |
| Strength | Fast understanding and classification | Fine-grained ranking | Fluent grounded generation |
| Weakness | Not ideal for generation | Too slow for full-corpus search | Higher latency and cost |
| Project role | Intent routing and sentiment | Retrieval reranking | Final answer generation |
8. How the Architectures Work Together
User asks for a recommendation
|
+-- DistilBERT
| -> classify intent
|
+-- Titan Text Embeddings V2
| -> convert query to a retrieval vector
|
+-- OpenSearch ANN
| -> fetch top candidate chunks
|
+-- MiniLM cross-encoder
| -> rerank the candidate set
|
+-- Claude 3.5 Sonnet
-> generate a grounded response using prompt + retrieved context
The system works well because each model is assigned to the job its architecture handles best:
- encoder-only models for fast understanding
- cross-encoders for precision ranking on a small candidate set
- decoder-style models for final language generation