ML Scenario 02 — Sentiment Classifier Domain Shift

TL;DR

The sentiment classifier on top of the Review-Sentiment MCP scores incoming user reviews on a polarity scale used by the bot to summarize "what readers think." It was fine-tuned 14 months ago on a 50K-review labeled corpus assembled in Q1. Production reviews now contain a steady stream of new vocabulary the classifier was never trained on: anime-tie-in slang, manhwa/webtoon-specific terminology, ironic positive/negative phrasings popular on social media ("this kills me" / "no thoughts head empty" — both positive), and a growing share of code-switched English-Spanish-Portuguese reviews after locale launches. F1 on the held-out test set is unchanged at 0.91; F1 on a freshly-labeled 1K-review sample drawn from current production is 0.71. The model is confidently mis-classifying ~20% of reviews and the downstream summaries say "mixed" or "negative" when the audience is enthusiastic. The fix shape is continuous fresh-label sampling, language- and domain-stratified eval, an active-learning loop targeting the cases the model is most-uncertain and most-impactful, and an explicit monthly drift-detection process before every retrain decision.

Context & Trigger

Axis of change: Time (the dominant axis — language and slang shift continuously) + Requirements (new locales and new content formats add new domains).
Subsystem affected: Sentiment / ABSA service powering RAG-MCP-Integration/04-review-sentiment-mcp.md. Knock-on effect: bot summaries cite "what readers say" using these labels.
Trigger event: A spot check by a content ops analyst — "the bot says reviewers are mixed on this title but I just read 30 reviews and they're glowing" — leads to a fresh manual labeling of 1K production reviews. F1 = 0.71 against the freshly-labeled set; the model thought it was 0.91. The held-out test set is two years old.

The Old Ground Truth

Original sentiment pipeline:

50K labeled reviews (Q1 baseline). Vendor-labeled into {strongly_negative, negative, neutral, positive, strongly_positive}.
Train/val/test split at the time of labeling. Held-out test ≈ 5K reviews.
Fine-tuned XLM-RoBERTa (multilingual, but the training corpus was 80% English, ~15% Spanish, ~5% Japanese-romaji).
Eval metric: macro-F1 on the held-out test, gated at 0.88 for promotion.
Retrain cadence: annual. (The implicit assumption: language doesn't change that fast.)
Reasonable assumptions:
Held-out F1 is a reasonable proxy for production F1.
Multilingual base + multilingual labels = adequate cross-language coverage.
Annual retrain is enough.

What this gets wrong: the held-out is a frozen sample of two-year-old language; new vocabulary, new platforms, new genres, and new locales all introduce distribution shift; ironic and code-switched reviews are growing as a share of traffic; and macro-F1 on an aging held-out is mathematically incapable of detecting that.

The New Reality

Vocabulary moves fast. "Slay," "lit," "no thoughts head empty," anime-meme phrases, manhwa-specific terms ("isekai-coded," "boy-WLW-coded"). The classifier has never seen them.
Irony is a structural blind spot. "This title made me cry. 10/10." → strongly positive. The classifier sees "made me cry" and tilts negative. Irony is concentrated in younger demographics and is increasing as a share of reviews.
Code-switching is rising. Spanish and Portuguese reviews now mix English titles, Japanese romaji entity names, and English-internet slang. The classifier handles each language well in isolation but stumbles on code-switched mixes.
New genre vocabularies enter. Webtoon vertical-scroll readers use terms (panel-coded, scroll-trapped) that aren't in the manga vocabulary the model trained on.
The held-out is a stale frame. Two years ago, "lit" wasn't a positive sentiment marker in this audience the way it is now. The held-out doesn't even contain the words the model is failing on.
F1 is a lying aggregate. Per-class F1 on irony is < 0.4 even when overall macro-F1 is 0.91 — the irony class is rare in the (old) test set and contributes little to the average.

The schema didn't change (still a polarity classifier). The meaning under the schema shifted as the population's language shifted. The held-out is a photograph; the language is a movie.

Why Naive Approaches Fail

"Annual retrain on more data." Doesn't address the test-set staleness. The model retrains on labels collected mostly during the same era as the old test set. Same blind spots, refreshed.
"Use a bigger pretrained model." A bigger XLM-RoBERTa or a switch to LLM-based classification might help on irony, but without a fresh labeled set you can't tell, and you can't tune.
"Add explicit neg/pos signals from users." Sparse, biased, slow.
"Just trust the LLM." Asking an LLM to classify is tempting but circular when the LLM is the application model — the same FM that makes downstream summaries can't be the labeler. And LLM-as-judge has its own drift (see GenAI 04).
"Lower the F1 threshold." Doesn't help — you're still aiming at the old target.
"Use distant supervision (star ratings as labels)." Ratings carry sentiment imperfectly (some 5-star reviews complain bitterly; some 1-star reviews are jokes). Useful as weak signal, dangerous as primary.

Detection — How You Notice the Shift

Online signals.

Bot-summary accuracy spot checks. Content ops samples 30 summaries/week and labels "agree" / "disagree." Disagreement rate over 15% triggers investigation.
Disagreement between sentiment and rating. High-rating reviews labeled negative (or vice versa) at growing rates is a leading indicator.
User pushback. "The bot says reviews are mixed, but I just read them — they're great." Slow signal but high-trust.

Offline signals.

Drift-detection metrics on input distribution. PSI/KL on token frequency, embedding centroid, n-gram distribution — between train-time and current production.
Per-language F1 on a freshly-labeled sample. The freshly-labeled sample is the new ground truth — old held-out is irrelevant.
Per-failure-class breakdown. Manually triage 50 disagreements; tag them as "irony," "slang," "code-switch," "new vocabulary," "rare-class genuinely hard." Class-weighted F1 makes hidden classes visible.

Distribution signals.

Token-coverage rate. What fraction of tokens in current production are in the model's training vocabulary? Falling coverage = vocabulary drift.
Embedding-similarity histogram. For each production review, the cosine similarity to the nearest 5 training-set reviews. A flattening histogram (further from training) = the input distribution is moving.
Confidence-calibration drift. Are high-confidence predictions still correct at the rate the model reports? Calibration plots compared to baseline.

Architecture / Implementation Deep Dive

flowchart TB
    subgraph Sources["Data sources"]
        PROD["Production reviews stream"]
        VENDOR["Annotation vendor<br/>(weekly micro-batches)"]
        EXPERT["In-house experts<br/>(ambiguous + rare-class)"]
    end

    subgraph Active["Active-learning loop"]
        UNCERT["Uncertainty sampler<br/>(model entropy)"]
        IMPACT["Impact sampler<br/>(downstream weight:<br/>review on hot title)"]
        STRAT["Strata: language ·<br/>genre · review-length"]
    end

    subgraph Labels["Living labeled corpus"]
        FRESH["Fresh micro-batches<br/>(weekly, 500-1000 each)"]
        DRIFT["Drift-eval set<br/>(monthly, balanced)"]
        REGRESS["Regression set<br/>(growing, never deleted)"]
    end

    subgraph Train["Training + eval"]
        TR["XLM-RoBERTa fine-tune<br/>or distilled LLM"]
        F1L["Per-language F1"]
        F1F["Per-failure-class F1<br/>(irony · slang · code-switch)"]
        DRIFTM["Drift metrics<br/>vs train distribution"]
    end

    subgraph Serve["Serving"]
        MAIN["Main classifier"]
        FALLBACK["LLM fallback<br/>for low-confidence"]
        ABS["Abstain path<br/>('not confident enough')"]
    end

    PROD --> UNCERT
    PROD --> IMPACT
    UNCERT --> STRAT
    IMPACT --> STRAT
    STRAT --> VENDOR
    STRAT --> EXPERT
    VENDOR --> FRESH
    EXPERT --> FRESH
    FRESH --> DRIFT
    FRESH --> REGRESS
    DRIFT --> TR
    REGRESS --> TR
    TR --> F1L
    TR --> F1F
    TR --> DRIFTM
    F1L -->|gate| MAIN
    F1F -->|gate| MAIN
    DRIFTM -->|gate| MAIN
    MAIN --> FALLBACK
    MAIN --> ABS

    style FRESH fill:#fde68a,stroke:#92400e,color:#111
    style F1F fill:#fee2e2,stroke:#991b1b,color:#111
    style ABS fill:#dcfce7,stroke:#166534,color:#111
    style FALLBACK fill:#dbeafe,stroke:#1e40af,color:#111

1. Data layer — living corpus, not frozen training set

Three categories of labeled data:

Fresh micro-batches. Every week, an active-learning sampler selects 500–1000 reviews from production for labeling. Stratified by language, genre, review length, and (critically) model uncertainty + downstream impact. Vendor-labeled within 5 working days. These are added to the training corpus.
Drift-eval set. Every month, a balanced 1K-review set is freshly labeled. This is the new ground truth for evaluating the classifier. Old held-outs are kept for trend analysis but no longer gate promotion.
Regression set. Growing, never deleted. Includes hard cases discovered over time (irony, code-switch, rare classes). New models must not regress on this set; it's the floor of what we know how to handle.

Active-learning sampling (per week):

def select_fresh_batch(production_stream, target_size=750):
    sample = []
    # 50% from highest-uncertainty predictions (entropy on softmax)
    sample += sample_top_by(production_stream, key=lambda r: model_entropy(r), n=target_size // 2)
    # 30% from highest-impact (review on titles with high traffic)
    sample += sample_top_by(production_stream, key=lambda r: title_traffic_weight(r.title_id),
                             n=target_size * 3 // 10)
    # 20% from random sample for distribution coverage
    sample += random_sample(production_stream, n=target_size // 5)
    return stratify_by(sample, ["language", "genre", "review_length"])

The mix is the key: pure uncertainty sampling misses common-but-wrong cases; pure impact sampling misses tail patterns; pure random misses everything specific. The 50/30/20 mix is empirical and tuned over time.

2. Pipeline layer — drift detection before retrain

Every retrain decision starts with a drift-detection step:

def should_retrain():
    drift = compute_drift_metrics(prod_window=last_30_days, train_window=current_train_set)
    if drift.psi_input_distribution > 0.25:
        return True, "input distribution drift"
    if drift.token_coverage < 0.92:
        return True, "vocabulary coverage drop"
    if drift.fresh_set_f1 < production_threshold - 0.04:
        return True, "fresh-label F1 regression"
    return False, "stable"

Retraining happens monthly or whenever drift metrics trip thresholds — whichever is sooner. Adversarial-cadence diction from GenAI scenario 07 applies here too: retrain has to be fast enough to keep up with language drift.

3. Serving layer — confidence + LLM fallback + abstain

Three-tier serving:

Main classifier (XLM-RoBERTa fine-tuned). Handles the bulk of traffic at low latency / cost.
Confidence threshold. If softmax max-prob < 0.7 (calibrated), fall back to the LLM tier.
LLM fallback (e.g., Haiku with a tight prompt). Slower, more expensive, but better on rare patterns. Cached per-review so repeated lookups are cheap.
Abstain path. If both the main and the fallback report low confidence (very rare cases, e.g., super-short or highly-coded reviews), the system labels "uncertain" and excludes the review from sentiment summaries — better silent than wrong.

The abstain path is the architectural humility that says "ground truth on this review is genuinely unknown; don't pretend."

4. Governance — model versioning + reviewer agreement audit

Annotator-agreement on fresh batches. Each fresh batch is independently labeled by 2 annotators on a 10% subset. Inter-annotator agreement (Cohen's kappa) is tracked; falling below 0.75 indicates either ambiguous labeling guidelines (fix the guidelines) or an inherently hard sub-domain (irony, sarcasm) that needs expert review.
Model-version log. Each promoted model carries (training_corpus_sha, fresh_batch_versions_used, drift_metrics_at_promotion, per-class-F1, per-language-F1). Rollback is supported.
Annotation-vendor versioning. If the vendor changes their guidelines or annotators, label drift is real. Tag labels with (vendor_id, vendor_guideline_version, annotator_id_hash) so cross-vendor / cross-guideline shifts can be diagnosed.

Trade-offs & Alternatives Considered

Approach	Tracks drift	Cost	Latency	Verdict
Annual retrain on growing corpus	No	Low	Fast	Original — falls behind
Quarterly retrain	Partial	Medium	Fast	Helps but doesn't address eval staleness
Active-learning + fresh labeled batches + drift gate + LLM fallback	Yes	Medium-High	Medium (fallback)	Chosen
Pure LLM-as-classifier	Yes (inherits LLM)	High (per-call)	Slow	Cost-prohibitive at production scale
Distill LLM into student	Yes (per re-distill)	High training cost	Fast serve	Reasonable evolution; does not replace fresh labels
Star-rating distant supervision only	No	Free labels	Fast	Noisy signal; useful as augmentation
Few-shot LLM with retrieval	Yes	Medium	Slow	Good for rare classes; primary path is too slow

The chosen architecture trades operational complexity (active learning, weekly vendor coordination) for sustained accuracy. The LLM fallback is the safety valve that catches what the main classifier doesn't yet know.

Production Pitfalls

Vendor-labeled drift. Vendors update their internal guidelines or annotator pools without telling you. Sample-label disagreement is the leading indicator. Maintain a small in-house gold set for cross-checking vendor outputs monthly.
Active-learning bias. If you only sample uncertain examples, the labeled set drifts toward boundary cases and the model overfits to ambiguity. The 50/30/20 mix (uncertainty / impact / random) is the protection.
Per-class F1 hides under aggregate. Macro-F1 of 0.91 with one class at 0.40 looks fine on the headline. Always report per-class and gate per-class separately.
Calibration drift is silent. A model can have stable F1 but wildly mis-calibrated confidence (over-confident on wrong predictions). Calibration on the fresh set, periodically — Platt scaling or temperature scaling adjustments — keeps the LLM fallback gate meaningful.
Code-switched reviews have language-detection issues. A review is "mostly Spanish with English title and Japanese romaji entity names" — language ID may say es, en, or mixed. Stratify by detected language but include mixed as its own bucket.
Hot-title bias in active-learning. Impact-weighted sampling concentrates labeling effort on popular titles; tail-title sentiment quality degrades. Reserve a fraction of labeling budget for tail titles even if their per-review impact is small.
Annotator consistency matters more than annotator volume. Two consistent annotators are better than five inconsistent ones. Audit kappa monthly; rotate annotators that drift.

Interview Q&A Drill

Opening question

Your sentiment classifier reports 0.91 F1 on the held-out test set, unchanged since launch 14 months ago. Content ops manually labeled 1000 fresh production reviews — F1 against those is 0.71. What happened and what's your fix?

Model answer.

The held-out is a stale frame. The classifier is still good at the language it saw 14 months ago and bad at the language production users speak now. Sources of drift: new slang and irony common in younger cohorts, code-switching as locales launched, vocabulary specific to webtoon/manhwa formats, and platform-driven phrasings from social media. Held-out F1 doesn't move because the held-out doesn't contain those patterns.

The fix is a living eval surface and an active labeling loop, not a single retrain.

(1) Build a drift-eval set — every month, a freshly-labeled 1K balanced review set. This becomes the gate, not the old held-out. (2) Build a regression set that grows monotonically, never deleted, and includes hard cases (irony, code-switch, rare-class). New models must not regress on it. (3) Drive labeling with active learning: weekly batches of 500–1000 reviews, stratified by language/genre/length and selected for high uncertainty + high downstream impact + a 20% random tail for coverage. (4) Run drift detection before every retrain — input distribution PSI, token coverage, fresh-set F1 against the production threshold. Retrain on schedule or when drift trips. (5) Add a confidence-aware serving layer with an LLM fallback for low-confidence reviews and an abstain path for the genuinely-ambiguous.

The conceptual move: the labeled set is a living artifact tied to the production distribution, not a frozen snapshot. F1 on yesterday's snapshot tells you about yesterday.

Follow-up grill 1

Your "drift-eval set" is freshly-labeled monthly. How do you make sure the labelers don't drift quarter-over-quarter?

Annotator drift is real and a top cause of label-quality degradation. Three protections. (1) Inter-annotator agreement. Each fresh batch has 10% double-labeled; Cohen's kappa is computed and tracked. Below 0.75 = guideline ambiguity or annotator pool shift. (2) Anchor-set re-labeling. A small "anchor set" (~ 200 reviews with known correct labels) is re-labeled every quarter by current annotators; their accuracy on the anchor set is the trust signal. Drift on anchors → re-train annotators or change vendor. (3) Cross-vendor sample. When using multiple vendors or in-house experts, periodically have both label the same 100 reviews; cross-vendor disagreement reveals systematic guideline differences.

The labels are themselves a measurement, and like any measurement they need calibration. The architecture treats annotation as a process to be monitored, not just a service to be invoked.

Follow-up grill 2

You said active learning is 50% uncertainty + 30% impact + 20% random. Why those proportions?

Empirical, but the rationale: (1) Uncertainty sampling alone over-fits to boundary cases — the model learns ambiguity, not coverage. (2) Impact sampling alone concentrates on popular titles and starves tail-title sentiment. (3) Random sampling alone wastes budget on easy examples the model already gets right. The 50/30/20 mix balances learning the boundaries (uncertainty), keeping the high-traffic content correct (impact), and capturing distribution shift (random).

In practice the proportions are tunable per quarter — if you notice the model is good on common cases but missing rare classes, raise random / lower impact. Treat the proportions as hyperparameters, not constants.

Follow-up grill 3

Your LLM fallback is per-review-cached. What if the same review is cached as positive but the language has shifted by the time it's queried again?

Cache TTL is the answer, and it should match expected drift cadence. For sentiment on review text — the text itself doesn't change after publication — the cached classification is valid as long as the classifier or its judgement model hasn't been updated. Keying the cache on (review_text_hash, classifier_version, fallback_model_version) invalidates correctly when either the main or the fallback model updates.

A subtler case: the bot's downstream usage of sentiment may change. Today, "lit" classifies as positive; six months later we agree it's overwhelmingly positive; the cached label is still positive, just less wrong. Cache invalidation on every classifier release plus a TTL ceiling (e.g., 90 days) keeps things sane.

Follow-up grill 4

Drift detection trips, you retrain, F1 on the fresh set is now 0.84 — better than 0.71 but worse than the old 0.91. Do you ship?

Two questions to answer. (1) Is 0.84 better than the production model on the fresh set? Yes — production is 0.71. So shipping the new model is a strict improvement on the metric that actually represents production. (2) Is the regression set held? The new model must not regress on the rolling regression set. If yes, ship.

The 0.91 number from launch is irrelevant. It was a measurement against a frozen, stale, biased set. Comparing the new model to that number is comparing apples to a year-old photo of an apple. The right comparator is "the production model on the new fresh set" (0.71) and "the candidate model on the new fresh set" (0.84). Improvement = ship.

The cultural barrier is that everyone's been quoting 0.91 for 14 months. Re-anchor the conversation: "the model has been running at 0.71 for some time; this fix moves it to 0.84." Honest framing.

Architect-level escalation 1

Six months out, the company adds a feature where users can write reviews in voice (transcribed by Whisper). The transcribed text is much messier than typed reviews. How does your sentiment pipeline handle this?

This is a new domain inside an existing schema. Three considerations.

(1) Transcribed-vs-typed is a new stratum. The fresh-label batches and drift-eval sets must include both. Sentiment patterns differ — voice reviews are more colloquial, longer, often more emotional. Stratify and gate per stratum.

(2) Transcription quality propagates. Whisper isn't perfect; transcription errors look like new "vocabulary." A 10% transcription error rate means the input distribution drifted by ~10% beyond what the language change introduced. Run drift detection on the transcribed-text distribution separately, with a higher tolerance for vocabulary churn but a tighter gate for systematic shift (e.g., a Whisper version upgrade).

(3) Pre-processing as a pipeline component. Add a "transcription confidence" signal — Whisper exposes per-word confidence — and use it as a feature for the sentiment model and as a routing signal: very-low-confidence transcriptions go to LLM fallback or abstain.

The architectural pattern: when a new modality enters, treat it as a new stratum in the data pipeline (not just "more data of the same kind"), with its own labels, its own drift gate, and its own per-stratum metric. The danger is folding voice reviews into the same training set and losing visibility into which stratum is failing.

Architect-level escalation 2

Cost pressure: the LLM fallback is now ~$15K/month. Cut it in half without losing quality.

Three knobs in order of safety.

(1) Tier the fallback. Today every low-confidence review hits the same LLM (Haiku, say). Many low-confidence reviews are mildly uncertain (confidence 0.6–0.7) and a smaller / cheaper LLM (or even a calibrated rule-based fallback) handles them. Reserve the more capable LLM for very-low-confidence (< 0.5) cases. Estimate: 40% reduction.

(2) Tighten the LLM-fallback prompt and use a smaller context. Many fallback prompts include too much context (full review history, surrounding reviews). For sentiment, the per-review text plus a small set of style examples is enough. Cut prompt size by 50% → cost roughly halves.

(3) Cache + dedupe more aggressively. Reviews on the same title with very similar text are common (template-like reviews, copy-paste). Dedup on near-duplicate text (minhash) → cache hit rate rises.

Where I would not cut: the main classifier's confidence threshold. Lowering it (so fewer reviews fall to fallback) increases the wrong-confident-classification rate, which has CSAT cost downstream. Cost on inference is recoverable; CSAT isn't.

The architectural commitment: cost reduction comes from tiering and caching at the fallback layer, not from raising thresholds on the gate. Keep the gate honest; make the path past it efficient.

Red-flag answers

"Annual retrain is fine; just on more data." (Doesn't address held-out staleness.)
"Use a bigger model." (Doesn't help without fresh labels.)
"Trust held-out F1." (Mirror of training distribution.)
"Use star ratings as labels." (Noisy, biased.)
"Just use an LLM for everything." (Cost-prohibitive at scale, judge-drift).

Strong-answer indicators

Recognizes held-out as a stale frame.
Names active-learning + fresh-label batches + regression set as the canonical pattern.
Has per-class and per-language gates, not just aggregate F1.
Treats annotator drift as a real measurement problem (kappa, anchor sets).
Knows confidence + abstain are first-class architectural pieces.
Has cost-cut levers that preserve quality (tier, tighten, cache) and knows what not to cut.