Linear Regression and Generalized Models in MangaAssist

Linear models still matter in an LLM-centric system. They show up as practical baselines, interpretable analysis tools, and the conceptual template for many neural-network components.

This document covers the parts of linear and generalized modeling that are most useful for understanding the MangaAssist stack.

1. Why Linear Models Still Matter

Even in a transformer-heavy architecture, linear models remain important in three ways:

They provide strong, cheap baselines that every more complex model should beat.
They are easy to interpret, which makes them useful for analysis, calibration, and business-facing reporting.
Many neural components are linear models applied to learned features rather than hand-engineered ones.

2. Simple Linear Regression

2.1 Model

$$y = \beta_0 + \beta_1 x + \epsilon$$

Where:

$y$ is the response
$x$ is the predictor
$\beta_0$ is the intercept
$\beta_1$ is the slope
$\epsilon$ is the residual error term

2.2 Ordinary Least Squares

OLS estimates coefficients by minimizing squared residual error:

$$\hat{\beta} = \arg\min_{\beta} \sum_{i=1}^{n}(y_i - \beta_0 - \beta_1 x_i)^2$$

Closed-form solution:

$$\hat{\boldsymbol{\beta}} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}$$

In practice, this closed form is most useful for intuition. Large or ill-conditioned problems are often fit with iterative methods or regularization.

2.3 Project Examples

Cost prediction

$$\text{cost per session} = \beta_0 + \beta_1(\text{avg response tokens}) + \epsilon$$

This estimates how prompt changes or retrieval changes affect serving cost before rollout.

Latency trend analysis

$$\text{P99 latency} = \beta_0 + \beta_1(\text{week}) + \epsilon$$

A positive slope suggests gradual performance drift over time.

import statsmodels.api as sm

X = sm.add_constant(avg_tokens_per_response)
model = sm.OLS(cost_per_session, X).fit()
print(model.summary())

3. Multiple Linear Regression

3.1 Model

$$y = \beta_0 + \beta_1x_1 + \beta_2x_2 + \cdots + \beta_px_p + \epsilon$$

Or in matrix form:

$$\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon}$$

3.2 Example: Session Value Analysis

$$\text{revenue} = \beta_0 + \beta_1(\text{turns}) + \beta_2(\text{recommendations shown}) + \beta_3(\text{is prime}) + \beta_4(\text{add-to-cart clicks}) + \epsilon$$

This style of model helps answer business questions such as:

which conversation features correlate with value
whether recommendation exposure is associated with higher downstream conversion
which parts of the experience deserve optimization effort

3.3 Assumptions and Diagnostics

Assumption	Meaning	What to check
Linearity	Mean response changes linearly with predictors	Residual plots, partial dependence
Independence	Residuals are not correlated	Session-level grouping, time effects
Homoscedasticity	Error variance is roughly constant	Residual spread vs. fitted values
Low multicollinearity	Predictors are not redundant	Correlation matrix, VIF
Approximate normality	Residuals are not too heavy-tailed	QQ plot, robust alternatives

For business metrics such as revenue, heteroscedasticity is common. Robust standard errors are often a better default than pretending the classical assumptions hold perfectly.

model = sm.OLS(revenue, X).fit(cov_type="HC3")

4. Generalized Linear Model View

The most useful mental model is not "linear regression vs. logistic regression", but the broader GLM template:

$$g(\mathbb{E}[y \mid \mathbf{x}]) = \mathbf{x}^T\boldsymbol{\beta}$$

Where $g(\cdot)$ is a link function.

Response type	Distributional family	Link function	Example use
Continuous	Gaussian	Identity	Cost or latency modeling
Binary	Bernoulli	Logit	Escalation prediction
Count	Poisson	Log	Ticket or retry count modeling
Multiclass	Categorical	Softmax / multinomial logit	Intent classification

This framing makes it easier to see classifier heads in neural networks as generalized linear models on top of learned representations.

5. Logistic Regression

5.1 Model

For a binary outcome:

$$P(y = 1 \mid \mathbf{x}) = \sigma(\mathbf{w}^T\mathbf{x} + b) = \frac{1}{1 + e^{-(\mathbf{w}^T\mathbf{x} + b)}}$$

The log-odds is linear in the features:

$$\log \frac{P(y=1)}{P(y=0)} = \mathbf{w}^T\mathbf{x} + b$$

5.2 Loss Function

$$\mathcal{L} = -\frac{1}{n}\sum_{i=1}^{n}\left[y_i\log(\hat{p}_i) + (1-y_i)\log(1-\hat{p}_i)\right]$$

5.3 Project Uses

Baseline intent routing

Before fine-tuning DistilBERT, a TF-IDF plus logistic-regression baseline gives a fast sanity-check model.

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression

vectorizer = TfidfVectorizer(max_features=10000)
X_train = vectorizer.fit_transform(train_messages)

clf = LogisticRegression(
    multi_class="multinomial",
    solver="lbfgs",
    max_iter=1000
)
clf.fit(X_train, train_intents)

Escalation prediction

$$P(\text{escalate}) = \sigma(\beta_0 + \beta_1(\text{sentiment}) + \beta_2(\text{turn count}) + \beta_3(\text{failed attempts}))$$

Recommendation click prediction

$$P(\text{click}) = \sigma(\beta_0 + \beta_1(\text{similarity}) + \beta_2(\text{price match}) + \beta_3(\text{is prime}))$$

5.4 Why the Baseline Matters

Even when DistilBERT wins on accuracy, the linear baseline remains useful because it:

provides a performance floor
highlights easy label or feature issues quickly
offers a low-latency fallback path
is easier to retrain and inspect

6. Multinomial Logistic Regression

For $K$ classes:

$$P(y = k \mid \mathbf{x}) = \frac{e^{\mathbf{w}k^T\mathbf{x} + b_k}}{\sum{j=1}^{K} e^{\mathbf{w}_j^T\mathbf{x} + b_j}}$$

Training uses categorical cross-entropy:

$$\mathcal{L} = -\frac{1}{n}\sum_{i=1}^{n}\sum_{k=1}^{K} y_{ik}\log(\hat{p}_{ik})$$

This is the right lens for:

the 8-way intent classifier
the final output layer of many classification models
vocabulary prediction in next-token generation

6.1 Connection to DistilBERT

The final classifier head in DistilBERT is not magic. It is a learned linear projection followed by softmax:

$$\text{logits} = \mathbf{h}{\text{[CLS]}}\mathbf{W}{\text{cls}} + \mathbf{b}_{\text{cls}}$$ $$\text{probs} = \text{softmax}(\text{logits})$$

That is equivalent to multinomial logistic regression on top of a learned feature vector.

7. Regularization

7.1 L2 Regularization

$$\mathcal{L}{\text{ridge}} = \mathcal{L}{\text{original}} + \lambda \sum_j w_j^2$$

L2 regularization discourages large weights and typically improves generalization.

7.2 L1 Regularization

$$\mathcal{L}{\text{lasso}} = \mathcal{L}{\text{original}} + \lambda \sum_j |w_j|$$

L1 regularization encourages sparse solutions and can help with feature selection.

7.3 Other Regularizers in the Stack

Regularizer	Where it appears	Purpose
L2 / weight decay	Linear baselines and transformer fine-tuning	Reduces overfitting
L1	Sparse text-feature models	Selects informative features
Dropout	Transformer blocks and classifier heads	Adds stochastic regularization
Early stopping	Fine-tuning and retraining	Prevents over-training on small datasets

from torch.optim import AdamW

optimizer = AdamW(model.parameters(), lr=2e-5, weight_decay=0.01)

8. Gradient Descent and Optimization

All learned models in the stack are trained by optimizing a loss:

$$\boldsymbol{\theta}{t+1} = \boldsymbol{\theta}_t - \eta \nabla{\boldsymbol{\theta}}\mathcal{L}(\boldsymbol{\theta}_t)$$

8.1 Adam

Adam adapts the effective learning rate per parameter:

$$m_t = \beta_1 m_{t-1} + (1-\beta_1)\nabla\mathcal{L}$$ $$v_t = \beta_2 v_{t-1} + (1-\beta_2)(\nabla\mathcal{L})^2$$ $$\hat{m}t = \frac{m_t}{1-\beta_1^t}, \quad \hat{v}_t = \frac{v_t}{1-\beta_2^t}$$ $$\theta{t+1} = \theta_t - \eta \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon}$$

8.2 Learning Rate Scheduling

Warmup plus decay is common for transformer fine-tuning:

from transformers import get_linear_schedule_with_warmup

scheduler = get_linear_schedule_with_warmup(
    optimizer,
    num_warmup_steps=500,
    num_training_steps=10000
)

Warmup prevents unstable early updates. Decay helps the optimizer converge more gently later in training.

9. From Linear Models to Deep Learning

The conceptual jump from linear models to neural networks is smaller than it first appears.

Logistic regression

$$\hat{y} = \sigma(\mathbf{x}\mathbf{W} + \mathbf{b})$$

Two-layer network

$$\hat{y} = \sigma(\text{ReLU}(\mathbf{x}\mathbf{W}_1 + \mathbf{b}_1)\mathbf{W}_2 + \mathbf{b}_2)$$

Transformer-based classifier

$$\hat{y} = \text{softmax}(\mathbf{h}{\text{sequence}}\mathbf{W}{\text{cls}} + \mathbf{b}_{\text{cls}})$$

The feature extractor changes dramatically, but the final decision layer often remains a generalized linear model.

10. Summary

Concept	Practical role in MangaAssist
Linear regression	Cost analysis, latency trends, experiment readouts
Multiple regression	Multi-factor business and product analysis
Logistic regression	Baseline classification, escalation modeling, CTR prediction
Multinomial logistic regression	Intent classification and token prediction
GLMs	Common framework connecting regression and classification
Regularization	Prevents overfitting in both simple and deep models
Gradient-based optimization	Trains every learned component in the stack

Linear models are not "pre-deep-learning leftovers." They are still useful operational tools and the cleanest bridge between classical statistics and modern neural systems.