Linear Regression and Generalized Models in MangaAssist
Linear models still matter in an LLM-centric system. They show up as practical baselines, interpretable analysis tools, and the conceptual template for many neural-network components.
This document covers the parts of linear and generalized modeling that are most useful for understanding the MangaAssist stack.
1. Why Linear Models Still Matter
Even in a transformer-heavy architecture, linear models remain important in three ways:
- They provide strong, cheap baselines that every more complex model should beat.
- They are easy to interpret, which makes them useful for analysis, calibration, and business-facing reporting.
- Many neural components are linear models applied to learned features rather than hand-engineered ones.
2. Simple Linear Regression
2.1 Model
$$y = \beta_0 + \beta_1 x + \epsilon$$
Where:
- $y$ is the response
- $x$ is the predictor
- $\beta_0$ is the intercept
- $\beta_1$ is the slope
- $\epsilon$ is the residual error term
2.2 Ordinary Least Squares
OLS estimates coefficients by minimizing squared residual error:
$$\hat{\beta} = \arg\min_{\beta} \sum_{i=1}^{n}(y_i - \beta_0 - \beta_1 x_i)^2$$
Closed-form solution:
$$\hat{\boldsymbol{\beta}} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}$$
In practice, this closed form is most useful for intuition. Large or ill-conditioned problems are often fit with iterative methods or regularization.
2.3 Project Examples
Cost prediction
$$\text{cost per session} = \beta_0 + \beta_1(\text{avg response tokens}) + \epsilon$$
This estimates how prompt changes or retrieval changes affect serving cost before rollout.
Latency trend analysis
$$\text{P99 latency} = \beta_0 + \beta_1(\text{week}) + \epsilon$$
A positive slope suggests gradual performance drift over time.
import statsmodels.api as sm
X = sm.add_constant(avg_tokens_per_response)
model = sm.OLS(cost_per_session, X).fit()
print(model.summary())
3. Multiple Linear Regression
3.1 Model
$$y = \beta_0 + \beta_1x_1 + \beta_2x_2 + \cdots + \beta_px_p + \epsilon$$
Or in matrix form:
$$\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon}$$
3.2 Example: Session Value Analysis
$$\text{revenue} = \beta_0 + \beta_1(\text{turns}) + \beta_2(\text{recommendations shown}) + \beta_3(\text{is prime}) + \beta_4(\text{add-to-cart clicks}) + \epsilon$$
This style of model helps answer business questions such as:
- which conversation features correlate with value
- whether recommendation exposure is associated with higher downstream conversion
- which parts of the experience deserve optimization effort
3.3 Assumptions and Diagnostics
| Assumption | Meaning | What to check |
|---|---|---|
| Linearity | Mean response changes linearly with predictors | Residual plots, partial dependence |
| Independence | Residuals are not correlated | Session-level grouping, time effects |
| Homoscedasticity | Error variance is roughly constant | Residual spread vs. fitted values |
| Low multicollinearity | Predictors are not redundant | Correlation matrix, VIF |
| Approximate normality | Residuals are not too heavy-tailed | QQ plot, robust alternatives |
For business metrics such as revenue, heteroscedasticity is common. Robust standard errors are often a better default than pretending the classical assumptions hold perfectly.
model = sm.OLS(revenue, X).fit(cov_type="HC3")
4. Generalized Linear Model View
The most useful mental model is not "linear regression vs. logistic regression", but the broader GLM template:
$$g(\mathbb{E}[y \mid \mathbf{x}]) = \mathbf{x}^T\boldsymbol{\beta}$$
Where $g(\cdot)$ is a link function.
| Response type | Distributional family | Link function | Example use |
|---|---|---|---|
| Continuous | Gaussian | Identity | Cost or latency modeling |
| Binary | Bernoulli | Logit | Escalation prediction |
| Count | Poisson | Log | Ticket or retry count modeling |
| Multiclass | Categorical | Softmax / multinomial logit | Intent classification |
This framing makes it easier to see classifier heads in neural networks as generalized linear models on top of learned representations.
5. Logistic Regression
5.1 Model
For a binary outcome:
$$P(y = 1 \mid \mathbf{x}) = \sigma(\mathbf{w}^T\mathbf{x} + b) = \frac{1}{1 + e^{-(\mathbf{w}^T\mathbf{x} + b)}}$$
The log-odds is linear in the features:
$$\log \frac{P(y=1)}{P(y=0)} = \mathbf{w}^T\mathbf{x} + b$$
5.2 Loss Function
$$\mathcal{L} = -\frac{1}{n}\sum_{i=1}^{n}\left[y_i\log(\hat{p}_i) + (1-y_i)\log(1-\hat{p}_i)\right]$$
5.3 Project Uses
Baseline intent routing
Before fine-tuning DistilBERT, a TF-IDF plus logistic-regression baseline gives a fast sanity-check model.
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
vectorizer = TfidfVectorizer(max_features=10000)
X_train = vectorizer.fit_transform(train_messages)
clf = LogisticRegression(
multi_class="multinomial",
solver="lbfgs",
max_iter=1000
)
clf.fit(X_train, train_intents)
Escalation prediction
$$P(\text{escalate}) = \sigma(\beta_0 + \beta_1(\text{sentiment}) + \beta_2(\text{turn count}) + \beta_3(\text{failed attempts}))$$
Recommendation click prediction
$$P(\text{click}) = \sigma(\beta_0 + \beta_1(\text{similarity}) + \beta_2(\text{price match}) + \beta_3(\text{is prime}))$$
5.4 Why the Baseline Matters
Even when DistilBERT wins on accuracy, the linear baseline remains useful because it:
- provides a performance floor
- highlights easy label or feature issues quickly
- offers a low-latency fallback path
- is easier to retrain and inspect
6. Multinomial Logistic Regression
For $K$ classes:
$$P(y = k \mid \mathbf{x}) = \frac{e^{\mathbf{w}k^T\mathbf{x} + b_k}}{\sum{j=1}^{K} e^{\mathbf{w}_j^T\mathbf{x} + b_j}}$$
Training uses categorical cross-entropy:
$$\mathcal{L} = -\frac{1}{n}\sum_{i=1}^{n}\sum_{k=1}^{K} y_{ik}\log(\hat{p}_{ik})$$
This is the right lens for:
- the 8-way intent classifier
- the final output layer of many classification models
- vocabulary prediction in next-token generation
6.1 Connection to DistilBERT
The final classifier head in DistilBERT is not magic. It is a learned linear projection followed by softmax:
$$\text{logits} = \mathbf{h}{\text{[CLS]}}\mathbf{W}{\text{cls}} + \mathbf{b}_{\text{cls}}$$ $$\text{probs} = \text{softmax}(\text{logits})$$
That is equivalent to multinomial logistic regression on top of a learned feature vector.
7. Regularization
7.1 L2 Regularization
$$\mathcal{L}{\text{ridge}} = \mathcal{L}{\text{original}} + \lambda \sum_j w_j^2$$
L2 regularization discourages large weights and typically improves generalization.
7.2 L1 Regularization
$$\mathcal{L}{\text{lasso}} = \mathcal{L}{\text{original}} + \lambda \sum_j |w_j|$$
L1 regularization encourages sparse solutions and can help with feature selection.
7.3 Other Regularizers in the Stack
| Regularizer | Where it appears | Purpose |
|---|---|---|
| L2 / weight decay | Linear baselines and transformer fine-tuning | Reduces overfitting |
| L1 | Sparse text-feature models | Selects informative features |
| Dropout | Transformer blocks and classifier heads | Adds stochastic regularization |
| Early stopping | Fine-tuning and retraining | Prevents over-training on small datasets |
from torch.optim import AdamW
optimizer = AdamW(model.parameters(), lr=2e-5, weight_decay=0.01)
8. Gradient Descent and Optimization
All learned models in the stack are trained by optimizing a loss:
$$\boldsymbol{\theta}{t+1} = \boldsymbol{\theta}_t - \eta \nabla{\boldsymbol{\theta}}\mathcal{L}(\boldsymbol{\theta}_t)$$
8.1 Adam
Adam adapts the effective learning rate per parameter:
$$m_t = \beta_1 m_{t-1} + (1-\beta_1)\nabla\mathcal{L}$$ $$v_t = \beta_2 v_{t-1} + (1-\beta_2)(\nabla\mathcal{L})^2$$ $$\hat{m}t = \frac{m_t}{1-\beta_1^t}, \quad \hat{v}_t = \frac{v_t}{1-\beta_2^t}$$ $$\theta{t+1} = \theta_t - \eta \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon}$$
8.2 Learning Rate Scheduling
Warmup plus decay is common for transformer fine-tuning:
from transformers import get_linear_schedule_with_warmup
scheduler = get_linear_schedule_with_warmup(
optimizer,
num_warmup_steps=500,
num_training_steps=10000
)
Warmup prevents unstable early updates. Decay helps the optimizer converge more gently later in training.
9. From Linear Models to Deep Learning
The conceptual jump from linear models to neural networks is smaller than it first appears.
Logistic regression
$$\hat{y} = \sigma(\mathbf{x}\mathbf{W} + \mathbf{b})$$
Two-layer network
$$\hat{y} = \sigma(\text{ReLU}(\mathbf{x}\mathbf{W}_1 + \mathbf{b}_1)\mathbf{W}_2 + \mathbf{b}_2)$$
Transformer-based classifier
$$\hat{y} = \text{softmax}(\mathbf{h}{\text{sequence}}\mathbf{W}{\text{cls}} + \mathbf{b}_{\text{cls}})$$
The feature extractor changes dramatically, but the final decision layer often remains a generalized linear model.
10. Summary
| Concept | Practical role in MangaAssist |
|---|---|
| Linear regression | Cost analysis, latency trends, experiment readouts |
| Multiple regression | Multi-factor business and product analysis |
| Logistic regression | Baseline classification, escalation modeling, CTR prediction |
| Multinomial logistic regression | Intent classification and token prediction |
| GLMs | Common framework connecting regression and classification |
| Regularization | Prevents overfitting in both simple and deep models |
| Gradient-based optimization | Trains every learned component in the stack |
Linear models are not "pre-deep-learning leftovers." They are still useful operational tools and the cleanest bridge between classical statistics and modern neural systems.