LOCAL PREVIEW View on GitHub

Linear Regression and Generalized Models in MangaAssist

Linear models still matter in an LLM-centric system. They show up as practical baselines, interpretable analysis tools, and the conceptual template for many neural-network components.

This document covers the parts of linear and generalized modeling that are most useful for understanding the MangaAssist stack.

1. Why Linear Models Still Matter

Even in a transformer-heavy architecture, linear models remain important in three ways:

  1. They provide strong, cheap baselines that every more complex model should beat.
  2. They are easy to interpret, which makes them useful for analysis, calibration, and business-facing reporting.
  3. Many neural components are linear models applied to learned features rather than hand-engineered ones.

2. Simple Linear Regression

2.1 Model

$$y = \beta_0 + \beta_1 x + \epsilon$$

Where:

  • $y$ is the response
  • $x$ is the predictor
  • $\beta_0$ is the intercept
  • $\beta_1$ is the slope
  • $\epsilon$ is the residual error term

2.2 Ordinary Least Squares

OLS estimates coefficients by minimizing squared residual error:

$$\hat{\beta} = \arg\min_{\beta} \sum_{i=1}^{n}(y_i - \beta_0 - \beta_1 x_i)^2$$

Closed-form solution:

$$\hat{\boldsymbol{\beta}} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}$$

In practice, this closed form is most useful for intuition. Large or ill-conditioned problems are often fit with iterative methods or regularization.

2.3 Project Examples

Cost prediction

$$\text{cost per session} = \beta_0 + \beta_1(\text{avg response tokens}) + \epsilon$$

This estimates how prompt changes or retrieval changes affect serving cost before rollout.

Latency trend analysis

$$\text{P99 latency} = \beta_0 + \beta_1(\text{week}) + \epsilon$$

A positive slope suggests gradual performance drift over time.

import statsmodels.api as sm

X = sm.add_constant(avg_tokens_per_response)
model = sm.OLS(cost_per_session, X).fit()
print(model.summary())

3. Multiple Linear Regression

3.1 Model

$$y = \beta_0 + \beta_1x_1 + \beta_2x_2 + \cdots + \beta_px_p + \epsilon$$

Or in matrix form:

$$\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon}$$

3.2 Example: Session Value Analysis

$$\text{revenue} = \beta_0 + \beta_1(\text{turns}) + \beta_2(\text{recommendations shown}) + \beta_3(\text{is prime}) + \beta_4(\text{add-to-cart clicks}) + \epsilon$$

This style of model helps answer business questions such as:

  • which conversation features correlate with value
  • whether recommendation exposure is associated with higher downstream conversion
  • which parts of the experience deserve optimization effort

3.3 Assumptions and Diagnostics

Assumption Meaning What to check
Linearity Mean response changes linearly with predictors Residual plots, partial dependence
Independence Residuals are not correlated Session-level grouping, time effects
Homoscedasticity Error variance is roughly constant Residual spread vs. fitted values
Low multicollinearity Predictors are not redundant Correlation matrix, VIF
Approximate normality Residuals are not too heavy-tailed QQ plot, robust alternatives

For business metrics such as revenue, heteroscedasticity is common. Robust standard errors are often a better default than pretending the classical assumptions hold perfectly.

model = sm.OLS(revenue, X).fit(cov_type="HC3")

4. Generalized Linear Model View

The most useful mental model is not "linear regression vs. logistic regression", but the broader GLM template:

$$g(\mathbb{E}[y \mid \mathbf{x}]) = \mathbf{x}^T\boldsymbol{\beta}$$

Where $g(\cdot)$ is a link function.

Response type Distributional family Link function Example use
Continuous Gaussian Identity Cost or latency modeling
Binary Bernoulli Logit Escalation prediction
Count Poisson Log Ticket or retry count modeling
Multiclass Categorical Softmax / multinomial logit Intent classification

This framing makes it easier to see classifier heads in neural networks as generalized linear models on top of learned representations.


5. Logistic Regression

5.1 Model

For a binary outcome:

$$P(y = 1 \mid \mathbf{x}) = \sigma(\mathbf{w}^T\mathbf{x} + b) = \frac{1}{1 + e^{-(\mathbf{w}^T\mathbf{x} + b)}}$$

The log-odds is linear in the features:

$$\log \frac{P(y=1)}{P(y=0)} = \mathbf{w}^T\mathbf{x} + b$$

5.2 Loss Function

$$\mathcal{L} = -\frac{1}{n}\sum_{i=1}^{n}\left[y_i\log(\hat{p}_i) + (1-y_i)\log(1-\hat{p}_i)\right]$$

5.3 Project Uses

Baseline intent routing

Before fine-tuning DistilBERT, a TF-IDF plus logistic-regression baseline gives a fast sanity-check model.

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression

vectorizer = TfidfVectorizer(max_features=10000)
X_train = vectorizer.fit_transform(train_messages)

clf = LogisticRegression(
    multi_class="multinomial",
    solver="lbfgs",
    max_iter=1000
)
clf.fit(X_train, train_intents)

Escalation prediction

$$P(\text{escalate}) = \sigma(\beta_0 + \beta_1(\text{sentiment}) + \beta_2(\text{turn count}) + \beta_3(\text{failed attempts}))$$

Recommendation click prediction

$$P(\text{click}) = \sigma(\beta_0 + \beta_1(\text{similarity}) + \beta_2(\text{price match}) + \beta_3(\text{is prime}))$$

5.4 Why the Baseline Matters

Even when DistilBERT wins on accuracy, the linear baseline remains useful because it:

  • provides a performance floor
  • highlights easy label or feature issues quickly
  • offers a low-latency fallback path
  • is easier to retrain and inspect

6. Multinomial Logistic Regression

For $K$ classes:

$$P(y = k \mid \mathbf{x}) = \frac{e^{\mathbf{w}k^T\mathbf{x} + b_k}}{\sum{j=1}^{K} e^{\mathbf{w}_j^T\mathbf{x} + b_j}}$$

Training uses categorical cross-entropy:

$$\mathcal{L} = -\frac{1}{n}\sum_{i=1}^{n}\sum_{k=1}^{K} y_{ik}\log(\hat{p}_{ik})$$

This is the right lens for:

  • the 8-way intent classifier
  • the final output layer of many classification models
  • vocabulary prediction in next-token generation

6.1 Connection to DistilBERT

The final classifier head in DistilBERT is not magic. It is a learned linear projection followed by softmax:

$$\text{logits} = \mathbf{h}{\text{[CLS]}}\mathbf{W}{\text{cls}} + \mathbf{b}_{\text{cls}}$$ $$\text{probs} = \text{softmax}(\text{logits})$$

That is equivalent to multinomial logistic regression on top of a learned feature vector.


7. Regularization

7.1 L2 Regularization

$$\mathcal{L}{\text{ridge}} = \mathcal{L}{\text{original}} + \lambda \sum_j w_j^2$$

L2 regularization discourages large weights and typically improves generalization.

7.2 L1 Regularization

$$\mathcal{L}{\text{lasso}} = \mathcal{L}{\text{original}} + \lambda \sum_j |w_j|$$

L1 regularization encourages sparse solutions and can help with feature selection.

7.3 Other Regularizers in the Stack

Regularizer Where it appears Purpose
L2 / weight decay Linear baselines and transformer fine-tuning Reduces overfitting
L1 Sparse text-feature models Selects informative features
Dropout Transformer blocks and classifier heads Adds stochastic regularization
Early stopping Fine-tuning and retraining Prevents over-training on small datasets
from torch.optim import AdamW

optimizer = AdamW(model.parameters(), lr=2e-5, weight_decay=0.01)

8. Gradient Descent and Optimization

All learned models in the stack are trained by optimizing a loss:

$$\boldsymbol{\theta}{t+1} = \boldsymbol{\theta}_t - \eta \nabla{\boldsymbol{\theta}}\mathcal{L}(\boldsymbol{\theta}_t)$$

8.1 Adam

Adam adapts the effective learning rate per parameter:

$$m_t = \beta_1 m_{t-1} + (1-\beta_1)\nabla\mathcal{L}$$ $$v_t = \beta_2 v_{t-1} + (1-\beta_2)(\nabla\mathcal{L})^2$$ $$\hat{m}t = \frac{m_t}{1-\beta_1^t}, \quad \hat{v}_t = \frac{v_t}{1-\beta_2^t}$$ $$\theta{t+1} = \theta_t - \eta \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon}$$

8.2 Learning Rate Scheduling

Warmup plus decay is common for transformer fine-tuning:

from transformers import get_linear_schedule_with_warmup

scheduler = get_linear_schedule_with_warmup(
    optimizer,
    num_warmup_steps=500,
    num_training_steps=10000
)

Warmup prevents unstable early updates. Decay helps the optimizer converge more gently later in training.


9. From Linear Models to Deep Learning

The conceptual jump from linear models to neural networks is smaller than it first appears.

Logistic regression

$$\hat{y} = \sigma(\mathbf{x}\mathbf{W} + \mathbf{b})$$

Two-layer network

$$\hat{y} = \sigma(\text{ReLU}(\mathbf{x}\mathbf{W}_1 + \mathbf{b}_1)\mathbf{W}_2 + \mathbf{b}_2)$$

Transformer-based classifier

$$\hat{y} = \text{softmax}(\mathbf{h}{\text{sequence}}\mathbf{W}{\text{cls}} + \mathbf{b}_{\text{cls}})$$

The feature extractor changes dramatically, but the final decision layer often remains a generalized linear model.


10. Summary

Concept Practical role in MangaAssist
Linear regression Cost analysis, latency trends, experiment readouts
Multiple regression Multi-factor business and product analysis
Logistic regression Baseline classification, escalation modeling, CTR prediction
Multinomial logistic regression Intent classification and token prediction
GLMs Common framework connecting regression and classification
Regularization Prevents overfitting in both simple and deep models
Gradient-based optimization Trains every learned component in the stack

Linear models are not "pre-deep-learning leftovers." They are still useful operational tools and the cleanest bridge between classical statistics and modern neural systems.