Story 03 — Three weeks of stalled embedding fine-tuning, fixed in four days

One-line: Recommendation embeddings plateaued at +5% Recall@3 for three weeks. The team wanted to throw more data at it. I read the loss geometry and saw a negative-sampling problem, switched to hard-negative mining and tuned the τ temperature, and got +14% in four days.

Situation

We needed manga-aware embeddings for the recommendation pipeline. Off-the-shelf Titan V2 embeddings worked for general semantic search but missed manga-specific signals — couldn't reliably distinguish seinen from shōnen, missed mangaka-style similarity, and confused franchise volumes with standalone works.

Three weeks of contrastive fine-tuning on a curated pairs dataset had gotten us +5% Recall@3 and stopped. The team's proposal was to scale up: more pairs, more epochs, larger batch size, possibly a bigger base model. Eng was about to spin up a multi-week data-collection effort to build a 10× larger pairs set.

Task

Diagnose why the fine-tune was plateauing, and unblock without burning a month on more data collection.

Action

1. Read the loss curves, not the dashboard. The training-set InfoNCE loss was still trending down. The validation Recall@K had flattened. That gap is diagnostic — the model was getting better at the training objective but not at the evaluation objective. That's not a data-volume problem. It's a what-the-loss-is-actually-optimizing problem.

2. Looked at what the negatives were doing. The contrastive setup used random in-batch negatives — the standard default. With a 256 batch size and ~10K item types in the catalog, random negatives were almost always trivially far from the anchor. Two random manga were almost certainly in different genres, different art styles, different demographics. The model was learning "these two random things aren't the same" — a task that gives no useful gradient signal once it's solved, which it had been by week one.

The fine-tune had plateaued because the negatives weren't informative anymore.

3. Switched to hard-negative mining. Built a k-NN nearest-off-genre miner: for each anchor, find the 32 nearest items in the current embedding space that aren't true positives. These are items the model currently thinks are similar but shouldn't be — exactly the gradient signal we needed. Mixed in 25% random negatives to keep the loss landscape from collapsing onto trivial cases.

4. Tuned τ. The InfoNCE temperature τ controls how sharp the softmax over negatives is. Default 0.07 was too soft for our case — it was averaging gradient signal across the (now informative) negatives instead of focusing on the hardest ones. Dropped τ to 0.05. Sharper gradients, faster convergence on the genre-discrimination task.

5. Re-ran the fine-tune. Four days. Recall@3 went from +5% to +14% over the off-the-shelf baseline, documented in Model-Inference/02-data-scientist-collaboration.md.

The math/algorithmic depth that mattered

This is the story I'd point to if someone asked "what does the DS degree actually buy you?" because the answer was invisible without it:

Contrastive loss geometry. InfoNCE is essentially a softmax cross-entropy over (positive vs. negatives). Its gradient signal is dominated by the hardest negative — the one closest to the anchor in current embedding space. If your negatives are easy, the loss converges to a trivial solution and the gradient becomes near-zero on the part of the manifold you actually care about.
Hard-negative mining as a curriculum. Mining negatives from the current embedding space turns training into a moving-target curriculum: every epoch, the negatives that were hard last epoch get easier, and new ones surface at the boundary. This is a known phenomenon in metric learning literature (semi-hard negative mining for triplet loss is the lineage), but it's not the default in most contrastive recipes — you have to know to ask for it.
Temperature τ as gradient-sharpness control. τ is not a "tuning parameter" — it's the sharpness of the implicit softmax over negatives. Smaller τ → sharper attention on hardest negatives → faster gradient on the boundary. Too small and the loss becomes unstable; too large and the gradient diffuses across all negatives. 0.05 was right for our distribution; 0.07 was too soft.

The team had been asking "how much more data do we need?" The right question was "what is the gradient actually telling us?" — and that question only makes sense if you understand what the loss is computing under the hood.

The leadership move

I didn't override the team's "more data" plan. I asked them to run my hard-negative-mining experiment for one week as a parallel track while data collection planning continued. If it worked, we'd cancel the data effort. If it didn't, we hadn't lost time.

It worked. We saved ~3 weeks and a chunk of data-engineering bandwidth.

Then I made the lesson permanent. Wrote up the InfoNCE-and-temperature deep dive as one of the Fine-Tuning-Foundational-Models/ entries, with the geometric intuition (why hard negatives sharpen the manifold boundary) drawn out alongside the math. Added it to the onboarding scaffolding for new ML engineers.

This is the part of leadership that is invisible from outside the team: the next time an embedding fine-tune plateaus, someone else on the team can debug it instead of escalating to me. That's the multiplier.

Result

+14% RAG Recall@3 over off-the-shelf embeddings (vs. +5% the team had been stuck at) — see Model-Inference/02-data-scientist-collaboration.md.
Four days of work instead of a multi-week data-collection effort.
Recommendation CTR uplift downstream (which is the actual business surface — embedding Recall@3 is a proxy for discovery quality).
Hard-negative mining and τ tuning became standard practice for every embedding job.
The math write-up shipped to the team's onboarding library.

What I'd want a future ML lead to take away

When a fine-tune plateaus, look at the loss curves and ask what the gradient is actually doing. "More data" is the hardest, slowest, most expensive answer. Sometimes the bottleneck is in the loss formulation, not the data — and you can only see that if you understand the math. That's the entire argument for having someone with a real DS background in the lead seat.