LOCAL PREVIEW View on GitHub

Neural Network Architectures in MangaAssist

MangaAssist uses different neural architectures for different jobs. Fast classification, vector retrieval, reranking, and text generation all benefit from different inductive biases and latency trade-offs.

This document focuses on the architectural patterns that matter most for the system design. Where a managed model does not publicly expose all internals, the explanation stays at the level that can be defended.

1. Architecture Overview

Component Architecture pattern Example model Main task
Intent classifier Encoder-only transformer DistilBERT (fine-tuned) 8-class routing
Sentiment detector Encoder-only transformer DistilBERT variant Binary or multiclass sentiment
Retrieval encoder Embedding model with bi-encoder retrieval behavior Titan Text Embeddings V2 Dense vector generation
Reranker Cross-encoder transformer cross-encoder/ms-marco-MiniLM-L-6-v2 Query-document relevance scoring
Response generator Decoder-style text generation model Claude 3.5 Sonnet via Bedrock Autoregressive response generation

2. Feedforward Networks

2.1 Core Pattern

A feedforward network stacks linear layers with nonlinear activations:

$$\mathbf{h}_1 = \text{ReLU}(\mathbf{x}\mathbf{W}_1 + \mathbf{b}_1)$$ $$\mathbf{h}_2 = \text{ReLU}(\mathbf{h}_1\mathbf{W}_2 + \mathbf{b}_2)$$ $$\hat{\mathbf{y}} = \text{softmax}(\mathbf{h}_2\mathbf{W}_3 + \mathbf{b}_3)$$

2.2 Why FFNs Matter Here

MangaAssist does not rely on standalone MLPs for the main customer-facing tasks, but feedforward blocks appear inside every transformer layer:

Transformer block
  -> attention sublayer
  -> residual connection + layer norm
  -> feedforward sublayer
  -> residual connection + layer norm

For DistilBERT, the feedforward block expands from 768 to 3072 dimensions and then projects back down. That widened middle layer lets the model express richer feature interactions than a single linear projection could.

2.3 Common Activations

Activation Formula Common use
ReLU $\max(0, x)$ Basic FFNs
GELU $x\Phi(x)$ BERT-family transformers
SiLU / Swish $x\sigma(x)$ Many modern decoder models
Sigmoid $\frac{1}{1 + e^{-x}}$ Binary heads and gates
Softmax $\frac{e^{z_i}}{\sum_j e^{z_j}}$ Multiclass outputs and attention weights

3. Transformer Families

3.1 Encoder-Decoder Transformer

The original transformer pairs an encoder with a decoder:

Source text
  -> encoder stack
  -> contextual hidden states
  -> decoder stack with cross-attention
  -> next-token probabilities

This pattern is useful to know because it explains where "cross-attention" comes from, even though MangaAssist's main runtime path uses encoder-only and decoder-only models instead.

3.2 Encoder-Only Transformer: DistilBERT

Encoder-only models process the whole input bidirectionally. Each token can attend to every other token in the same sequence.

Input sequence
  -> token embeddings + position embeddings
  -> transformer block 1
  -> transformer block 2
  -> ...
  -> transformer block 6
  -> pooled sequence representation
  -> classification head

Important properties:

  • bidirectional attention captures both left and right context
  • every token receives a contextualized representation
  • the first-token representation is commonly used for classification
  • the architecture is well suited for classification, tagging, and reranking

DistilBERT-specific notes:

  • 6 transformer layers instead of 12 in BERT-base
  • 12 attention heads
  • 768 hidden size
  • roughly 40% fewer parameters and about 60% faster inference than BERT-base while retaining about 97% of its language-understanding performance in the original DistilBERT paper
  • token-type embeddings and the pooler are removed relative to BERT-base

Fine-tuning example:

from transformers import DistilBertForSequenceClassification, Trainer

model = DistilBertForSequenceClassification.from_pretrained(
    "distilbert-base-uncased",
    num_labels=8
)

trainer = Trainer(
    model=model,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    args=training_args
)
trainer.train()

3.3 Decoder-Only Generation Model

Decoder-style models generate text left to right. Each token can attend only to the tokens before it.

Prompt
  -> token embeddings + positional information
  -> repeated masked self-attention blocks
  -> language-model head
  -> next-token distribution
  -> sample or select next token
  -> append token and repeat

The causal mask enforces autoregressive generation:

$$\text{Mask}_{ij} = \begin{cases} 0 & \text{if } j \le i \ -\infty & \text{if } j > i \end{cases}$$

For MangaAssist, the practical implication is more important than the hidden implementation details: the generator conditions on the prompt, retrieved context, and conversation history, then emits one token at a time. Exact layer counts and parameter counts for proprietary hosted models are not always publicly documented, so this document avoids relying on unpublished internals.

3.4 Cross-Encoder Reranker

The reranker is still an encoder-style transformer, but the input format is different.

Bi-encoder retrieval

encode(query)    -> vector_q
encode(document) -> vector_d
score = cosine(vector_q, vector_d)

Cross-encoder reranking

[CLS] query [SEP] document [SEP]
  -> transformer
  -> sequence representation
  -> scalar relevance score

Why it is stronger:

  • query tokens and document tokens interact inside attention layers
  • phrase-level and negation-sensitive relevance signals are easier to capture
  • the score is learned directly for ranking, not derived from a generic vector similarity

Why it is slower:

  • every candidate document must be re-encoded with the query
  • latency grows with the number of candidates

That is why the system uses it only after a fast first-stage retriever narrows the search space.


4. Positional Information

4.1 Why Position Must Be Added Explicitly

Attention alone is permutation-invariant. Without positional information, "I love manga" and "manga love I" would look equivalent to the model.

4.2 Learned Absolute Positions

DistilBERT uses learned position embeddings:

$$\mathbf{x}i = \mathbf{e}{\text{token}i} + \mathbf{e}{\text{pos}_i}$$

This is enough for many classification and understanding tasks.

4.3 Relative and Rotary Schemes

Many modern generation models use relative or rotary positional schemes instead of fixed learned absolute embeddings. The exact choice depends on the model family, but the shared goal is the same: inject sequence order without abandoning efficient attention computation.


5. Residual Connections and Layer Normalization

Transformer sublayers are wrapped in a residual path:

$$\text{output} = \text{LayerNorm}(\mathbf{x} + \text{SubLayer}(\mathbf{x}))$$

This matters because:

  • gradients can flow more easily through deep stacks
  • optimization becomes more stable
  • intermediate computations can refine features without losing the original signal

Without residual connections, deep transformers would be much harder to train reliably.


6. Knowledge Distillation and DistilBERT

DistilBERT is a compressed student model trained to imitate BERT-base.

The distillation objective combines ground-truth supervision with teacher guidance:

$$\mathcal{L}{\text{distill}} = \alpha \mathcal{L}{\text{CE}} + (1-\alpha)T^2 \cdot \text{KL}(p_{\text{teacher}}^{(T)} | p_{\text{student}}^{(T)})$$

Where:

  • $\mathcal{L}_{\text{CE}}$ is cross-entropy against labels
  • $\text{KL}$ is KL divergence between teacher and student distributions
  • $T$ is the distillation temperature
  • $\alpha$ balances the two objectives

Why this matters operationally:

  • lower latency than BERT-base
  • smaller memory footprint
  • better production fit for real-time routing

The trade-off is straightforward: a small loss in model capacity buys a large improvement in serving efficiency.


7. Architecture Comparison

Property Encoder-only classifier Cross-encoder reranker Decoder-style generator
Attention pattern Bidirectional Bidirectional over query + doc together Causal
Input Single sequence Paired query-document sequence Prompt prefix
Output Class probabilities or representations One relevance score Next-token distribution
Strength Fast understanding and classification Fine-grained ranking Fluent grounded generation
Weakness Not ideal for generation Too slow for full-corpus search Higher latency and cost
Project role Intent routing and sentiment Retrieval reranking Final answer generation

8. How the Architectures Work Together

User asks for a recommendation
  |
  +-- DistilBERT
  |     -> classify intent
  |
  +-- Titan Text Embeddings V2
  |     -> convert query to a retrieval vector
  |
  +-- OpenSearch ANN
  |     -> fetch top candidate chunks
  |
  +-- MiniLM cross-encoder
  |     -> rerank the candidate set
  |
  +-- Claude 3.5 Sonnet
        -> generate a grounded response using prompt + retrieved context

The system works well because each model is assigned to the job its architecture handles best:

  • encoder-only models for fast understanding
  • cross-encoders for precision ranking on a small candidate set
  • decoder-style models for final language generation