LOCAL PREVIEW View on GitHub

Story 05 — Stopping a 4%-per-cycle accuracy bleed with a Fisher-information penalty

One-line: Monthly retraining was costing us 4 points of last-quarter accuracy every cycle. Team wanted to roll back to a frozen model. I implemented EWC (Elastic Weight Consolidation), tuned λ against the loss-landscape curvature, and unlocked sustainable continual learning.

Situation

Three months post-launch. The team had a working monthly retraining cadence: every fourth week, kick off a fine-tune on the previous month's labeled feedback (thumbs-down corrections, escalated cases, new catalog entries). The retraining job ran, the model shipped, the eval gate passed.

Then the previous months' eval suites started failing. Each retraining cycle was costing ~4% on the prior month's golden set — the textbook catastrophic forgetting signature of a model that's overfitting to recent data and degrading on what it used to know. Three cycles in, we were measurably worse on common queries than at launch, even though we were better on the recent corrections.

The team's proposal was to freeze the model and stop monthly retraining. From their POV: retraining is making things worse, so don't retrain. Defensible. Also wrong — it would cap quality at launch and let the model rot as the catalog and customer language drifted.

Task

Find a way to keep retraining without the 4% regression — or accept that we couldn't and freeze the model.

Action

1. Diagnosed by reading the gradients, not the metrics. I pulled the per-parameter delta between pre-retrain and post-retrain weights. The retraining was making large updates to parameters that controlled common-query behavior — exactly the parameters that should have been protected because they encoded the settled knowledge from earlier training. The new data was, statistically, a small fraction of total training signal, but its gradient was disproportionately large because there was no penalty for moving "well-known" parameters.

This is the canonical setup for Elastic Weight Consolidation.

2. Implemented EWC. EWC adds a regularization term to the loss: λ × Σᵢ Fᵢ × (θᵢ - θᵢ)², where F is the diagonal of the Fisher information matrix at the previous task's optimum and θ is the previous task's parameters. In English: penalize moving parameters that mattered for the previous task, in proportion to how much they mattered.

This is the math people skip past in papers. It actually matters here:

  • The Fisher information Fᵢ = E[(∂log p / ∂θᵢ)²] is, intuitively, the curvature of the loss landscape in the direction of θᵢ at the previous optimum. High curvature = small movement causes large loss increase = important parameter.
  • λ trades off plasticity (learn new things) against stability (remember old things). Too low: keep forgetting. Too high: can't learn the new corrections at all.
  • F is diagonal in the cheap version — full-matrix EWC is intractable. Diagonal works because most of the regularization signal lives in the diagonal anyway.

3. Tuned λ against the validation curves. Not by grid search alone — by reading the loss-landscape curvature at the previous optimum. A diagnostic plot: for each candidate λ, look at the ratio of (prior-task validation loss change) to (new-task validation loss change). The sweet spot is where the prior-task loss change goes near zero before the new-task loss change starts to balloon. Landed on λ=400 for our distribution.

4. Built it into the retraining pipeline. Not as a one-time fix — as a permanent component of the monthly job. Documented in Fine-Tuning-Foundational-Models/, specifically the continual-learning deep dive.

The math/algorithmic depth that mattered

This is the second story (after the InfoNCE one) where the DS degree directly paid back its tuition:

  • Fisher information as loss-landscape curvature. F isn't an abstract symbol from a paper. It's the second-order Taylor expansion of the loss at the optimum — it tells you which directions in parameter space the loss is steep in. If you understand that, EWC's regularization term reads as "don't move along the steep directions of the previous task's loss landscape." Geometric intuition unlocks tuning intuition.
  • Why diagonal-EWC is enough. The full Fisher matrix is the Hessian of the log-likelihood, which captures parameter interactions. The diagonal approximation throws those out. For our case (a small fine-tune on a large pretrained model), the diagonal captures most of the signal because the dominant curvature directions are roughly axis-aligned per layer. Knowing this saved us from implementing full-matrix EWC, which would have been infeasible.
  • λ as the plasticity-stability dial. Setting λ is the actual continual-learning trade-off in one number. There's no "right" value; there's only the right value for your distribution and how much old vs. new data quality matters. This is a product decision dressed in math — and you have to do the math to make the product call.

The leadership move

The team's instinct (freeze the model) was a reasonable engineering move under bad information. My job was to give them better information, not to override them. I ran the EWC implementation as a parallel two-week experiment against the proposed freeze-the-model timeline. If EWC didn't show a clear stability gain, we'd freeze. If it did, we'd ship it.

It did. λ=400 cut prior-task regression from 4% to <0.5% per cycle while preserving the gain on new corrections.

Then I wrote up the math with geometric intuition (why F is curvature, why diagonal is enough, why λ is plasticity-stability). New ML engineers joining the team a year later read that doc, understand why the regularization term exists, and can debug it themselves. That's the leadership multiplier from algorithmic depth: every difficult math thing you teach the team is one less escalation back to you.

Result

  • Prior-task accuracy regression per retraining cycle: 4% → <0.5%.
  • Sustained 6+ retraining cycles without accumulated drift.
  • Quality on each cycle's new signal continued to improve at +1.5–2% per cycle.
  • Continual-learning pipeline became a standard component, not an open question — see Fine-Tuning-Foundational-Models/.
  • "Freeze the model" never came up again.

What I'd want a future ML lead to take away

Catastrophic forgetting isn't a software bug. It's a property of the optimization landscape — you fix it by changing the loss function, not by adjusting the training schedule. To know that, you have to understand what's actually being optimized. Most ML teams that "ship and freeze" do so because nobody on the team understood the loss-landscape geometry well enough to keep iterating safely. That's a gap a DS-trained lead closes.