ML Scenario 03 — Search Ranking Invalidated by UI Redesign

TL;DR

The search ranker (a learning-to-rank model trained on click and dwell signals) was retrained quarterly against an A/B-validated baseline for two years. In Q3, the product team shipped a UI redesign: the search results moved from a dense list (10 visible per screen) to a sparse grid with cover thumbnails (4 visible per screen, scrolling required for more). Within two weeks, NDCG@10 on the held-out set still looked great, but production CTR on rank-3-onwards dropped 28%, and the bot's "search this catalog" experience CSAT slipped. The UI change shifted what users see and what counts as a "good rank" — but the labels (clicks, dwells) were still being collected as if the old UI applied. The training set quietly mixed pre-redesign and post-redesign labels, training a Frankenstein ranker. The fix shape is treating UI surface as a primary key on training labels, segregating data pre/post redesign, retraining on UI-aware labels, and adding a UI-version field as a CI-gate column on every promotion.

Context & Trigger

Axis of change: Requirements (the product surface change is the trigger).
Subsystem affected: Search ranker, downstream of the catalog-search MCP (for the chatbot search-tool path) and the main search page (which shares the ranker). Closely tied to RAG-MCP-Integration/01-catalog-search-mcp.md.
Trigger event: Q3 UI redesign launches, dense list → sparse grid. Three weeks later, the team's quarterly ranker retrain ships and production CTR on rank-3+ tanks. Investigation reveals the new ranker was trained on a mix of pre-redesign and post-redesign clicks. Pre-redesign clicks at rank 3 were "user saw it on first screen, clicked or didn't"; post-redesign clicks at rank 3 are "user scrolled past two cards, then engaged" — fundamentally different signal.

The Old Ground Truth

Original ranking pipeline:

Implicit-feedback labels: click (positive), dwell-time-weighted positive, no-click (negative).
Position bias correction: standard inverse-position weighting (assumed dense list of 10).
Training cadence: quarterly retrain.
Eval: NDCG@10 on a held-out time-window slice.
Promotion: A/B test for 2 weeks, gated on CTR uplift.
Reasonable assumptions:
The UI is stable; position bias correction is a fixed function.
Clicks at position N have consistent semantics over time.
The held-out reflects production behavior.

What this misses: the UI is itself a model parameter. When the UI changes, every label collected after the change has a different meaning. Position-bias correction calibrated on the old UI doesn't apply.

The New Reality

Position bias is UI-dependent. In a dense list, position 3 is "saw it without scrolling." In a sparse grid, position 3 is "saw it after looking at 2 prominent cards." The implicit "saw it" weighting is wrong for either dataset if you mix them.
Click semantics shifted. A click in the dense UI was often exploratory ("let me see what this is"). A click in the sparse-grid UI is more committed (the user scrolled to it, then clicked the larger thumbnail). Same signal, different meaning.
Dwell times shifted. The grid's larger thumbnails cause longer pre-click hover; dwell-time as a positive signal needs recalibration.
Held-out is corrupted. If the held-out spans the redesign, it has both pre-and-post labels with no UI-version column. The model trains on a Frankenstein corpus.
A/B comparisons are confounded. The new ranker ran in A/B against the old ranker, but both served users on the new UI. The A/B told you which ranker was better for the new UI, but the training data had pre-redesign biases baked in. Consistent A/B winning doesn't mean the ranker is well-fit for the new world.
The "right" rank is now context-dependent. What ranks well on a list isn't the same as what ranks well on a grid (where cover-image attractiveness becomes a much stronger ranking signal).

Why Naive Approaches Fail

"Wait six months and retrain." Trains on enough post-UI data to maybe wash out pre-UI bias, but burns six months of poor user experience.
"Just A/B and trust the numbers." A/B confounds with mixed training data; uplift on the new UI doesn't mean the ranker is correctly fit.
"Add UI version as a feature." Helps the model condition on UI but doesn't fix the bias in collected labels.
"Drop pre-redesign data." Tempting; loses long-tail signals. Some queries are rare and only have pre-redesign data.
"Use the LLM as a relevance judge." Possibly — but the question is "what would the user click in this UI?" not "is this relevant?" — the LLM doesn't see the UI.
"Hold the old ranker; don't retrain." Old ranker was calibrated for old UI position bias; on new UI it scores items wrong for the new layout. Active mismatch, not safety.

Detection — How You Notice the Shift

Online signals.

CTR by rank position, pre and post redesign. A CTR-vs-position curve that flattens or has new shapes is the signature.
Scroll depth per query. In the grid UI, users scroll more or less depending on query type; if scroll depth is sharply different from prior assumptions, position bias must be re-derived.
Time-to-first-click. In the new UI, this often increases because cover images are larger and users hover more.
Bounce rate from result page. Up = users aren't finding what they want at the top; down = ranking is fitting the new UI better.

Offline signals.

Pre/post UI segregation in training data. Plot label distributions side-by-side. Differences in click-position distribution, average dwell time, and click-through rate confirm the schema change.
Position-bias re-derivation. Run an off-policy estimator (e.g., randomized intervention or PIE — position-based intervention experiment) on the new UI to derive the new position-bias function. Compare to the old function — significant differences confirm the recalibration is needed.
Per-UI NDCG. Stratify the held-out by UI version; ranker should not be expected to score well on pre-UI data with post-UI corrections (or vice versa).

Distribution signals.

Click-distribution histogram. Pre-UI: clicks concentrated in positions 1-5. Post-UI: clicks spread further but smaller absolute counts in 1-5. Different shapes are different ground-truth distributions.
Cover-image-feature importance. The feature importance of "cover-image quality" or "color salience" should rise after a grid redesign. If feature importance is unchanged, the ranker isn't using the visual signal that the new UI rewards.

Architecture / Implementation Deep Dive

flowchart TB
    subgraph UI["UI surfaces (versioned)"]
        UI_V1["Dense list v1<br/>(retired)"]
        UI_V2["Sparse grid v2<br/>(current)"]
    end

    subgraph Logs["Logged events"]
        EV["Per-impression log<br/>(query, item, rank, ui_version,<br/>click_t, dwell_t, scroll_depth)"]
    end

    subgraph Bias["UI-aware position-bias model"]
        PIE["Randomized intervention<br/>(small permuted slice)"]
        BIAS_V1["Bias fn for v1"]
        BIAS_V2["Bias fn for v2"]
    end

    subgraph GT["UI-keyed training labels"]
        L_V1["Labels under v1<br/>(historical, unchanged)"]
        L_V2["Labels under v2<br/>(growing)"]
    end

    subgraph Train["Training"]
        FUSED["Fused training set<br/>(weighted by UI · age · bias correction)"]
        EVAL_UI["Per-UI NDCG eval<br/>(strictly current-UI gates promotion)"]
    end

    subgraph Serve["Serving"]
        FEAT["Features include ui_version"]
        RANK["Ranker"]
        AB["A/B against UI-matched baseline"]
    end

    UI_V1 --> EV
    UI_V2 --> EV
    EV --> PIE
    EV --> L_V1
    EV --> L_V2
    PIE --> BIAS_V1
    PIE --> BIAS_V2
    BIAS_V1 --> L_V1
    BIAS_V2 --> L_V2
    L_V1 --> FUSED
    L_V2 --> FUSED
    FUSED --> RANK --> EVAL_UI
    FEAT --> RANK
    RANK --> AB

    style EV fill:#fde68a,stroke:#92400e,color:#111
    style L_V2 fill:#dbeafe,stroke:#1e40af,color:#111
    style EVAL_UI fill:#fee2e2,stroke:#991b1b,color:#111
    style AB fill:#dcfce7,stroke:#166534,color:#111

1. Data layer — `ui_version` as a primary key on labels

Every logged impression carries ui_version:

{
  "query_id": "...",
  "item_id": "...",
  "rank_position": 3,
  "ui_version": "grid_v2",
  "viewport_height": 800,
  "scroll_depth": 0,
  "clicked": true,
  "dwell_seconds": 4.2,
  "shown_at": "..."
}

The training pipeline filters or weights by ui_version. The default for the production ranker is "current UI version only" with a small (e.g., 10%) tail of recent-historical-UI examples, downweighted by an age-and-UI factor.

2. Pipeline layer — UI-aware position bias

Position-bias correction is per-UI. Don't reuse the old function.

To derive a new position-bias function: run a small randomization slice — for ~ 1% of queries, randomly permute the top-K positions of the result list and observe the click-through rate per position. The CTR-by-position-on-randomized = the position-bias function for that UI. (This is the standard PIE technique.)

def estimate_position_bias(ui_version, n_observations_min=10000):
    rand_slice = events.filter(ui_version=ui_version, randomized=True)
    bias = rand_slice.groupby("rank_position")["clicked"].mean()
    bias = bias / bias.iloc[0]   # normalize so position 1 = 1.0
    return bias

Position bias estimation needs ongoing renewal — even within a UI, behavior shifts over time (mobile vs desktop split, new device sizes). Re-estimate quarterly per UI.

3. Serving layer — UI-version feature

The ranker receives ui_version as a categorical feature. Models conditioned on UI version can correctly weight rank-vs-content tradeoffs per surface. At training time, the loss is computed with per-UI position-bias corrections; at serve time, the served UI version is passed through.

For a ranker that doesn't natively support per-UI conditioning (e.g., a shared LightGBM model across all UI variants), the alternative is per-UI sub-models: train a separate ranker per UI version, route at serve time. More expensive but cleaner.

4. Governance — UI changes as ML events

The product/UI team and the ML team must coordinate before any UI change ships:

Pre-redesign: lock the training corpus's UI version. New UI = new corpus.
During-redesign canary: the redesign is canaryed (10% of users see the new UI). The ML team logs ui_version for all impressions and observes label-distribution shifts in real-time.
Post-redesign: the ML team holds back full retrain until enough post-UI data has accumulated (say, 50K impressions per intent class) and a fresh position-bias function is estimated.
Promotion gate: the new ranker is promoted only with a per-UI NDCG check on the new UI's data. Old-UI metrics are advisory only (they reflect a historical state).

A UI release shouldn't be a surprise to the ranker. Treating UI changes as ML events is the discipline the original system lacked.

Trade-offs & Alternatives Considered

Approach	Handles UI change	Speed of recovery	Complexity	Verdict
Mix all labels, retrain with old position bias	No	Doesn't recover	Low	Original — broken
Drop pre-redesign labels, train on post-only	Partial	Fast (data starvation risk)	Low	Loses signal
Add `ui_version` feature, mix labels	Partial	Slow (data still mixed)	Medium	Helps but doesn't fix bias
UI-keyed labels + per-UI position bias + per-UI eval gate	Yes	Moderate (waits for new data)	Medium-High	Chosen
Per-UI sub-rankers (separate models)	Yes	Moderate	High (2× pipeline)	Cleaner; consider for major UIs
Wait six months + standard retrain	Yes (slowly)	Very slow	Lowest	Burns user experience

The chosen pattern keeps a single ranker (operational simplicity) but is honest about the data partitioning. Per-UI sub-models are the "next step" if the system has many UI surfaces (mobile vs desktop vs chat-tool with different rendering).

Production Pitfalls

A/B confounded by UI. If you A/B-test the new ranker before UI position-bias is recalibrated, you compare two rankers both fitting the new UI poorly. The "winner" is the less-bad of two miscalibrated models. Recalibrate position bias first, then A/B.
Mobile vs desktop is its own UI dimension. "Same UI v2" but different viewport/density on mobile leads to different position bias. Add device_class to ui_version if positions render differently.
Old-UI history dies slowly. Some sessions still serve old UI for cache/compatibility reasons after rollout. Don't accidentally treat their events as new-UI data — log faithfully.
Cover-image features are now first-class. A grid UI rewards visually-attractive thumbnails. If the ranker doesn't have cover-image features (color, salience, OCR-extracted title, blur level), retrain to include them. Otherwise the ranker is missing the signal the new UI uses to drive engagement.
Randomization-slice resistance. Product teams hate "we'll randomize 1% of results" because it perceived as "showing bad results." Frame it as the cost of having an unbiased position-bias function — without it, the ranker is mis-calibrated on every UI change. Show the cost of not having it (the 28% drop in this scenario).
Cold-start UI. A brand-new UI starts with no labels. Bootstrap by serving a fallback ranker (or the old ranker with conservative re-ranking, or a popularity baseline) until enough new-UI labels accumulate.

Interview Q&A Drill

Opening question

Your search ranker has a stable held-out NDCG@10 of 0.87 across quarterly retrains. The product team ships a UI redesign — list to grid. Three weeks after the redesign, the next quarterly retrain ships and CTR on rank-3+ drops 28%. What's wrong and what's the fix?

Model answer.

The UI is a primary key on training labels and the system didn't know it. Pre-redesign and post-redesign clicks have different semantics — a click at position 3 in a dense list is "saw without scrolling," in a sparse grid it's "scrolled past 2, then clicked on a larger thumbnail." Mixing them in training is mixing two distributions; the position-bias correction calibrated for the old UI doesn't apply to the new; held-out NDCG looks fine because the held-out also mixes them.

The fix has four pieces. (1) Add ui_version as a primary key on every logged impression. Every label is now (query, item, ui_version) → engagement, not just (query, item). (2) Re-derive position-bias per UI using a small randomization slice (~1% of queries with permuted top-K positions). The CTR-by-rank on the randomized slice is the new UI's position-bias function. (3) Filter the training set to current-UI labels with a small downweighted historical tail. (4) Per-UI NDCG eval as the gate; old-UI metrics are advisory.

The conceptual move: UI changes are ML events. The training pipeline must know the UI version of every label, and the position-bias correction must be derived per UI from real randomized exposure.

Follow-up grill 1

Walk me through the randomization slice. How big, on which queries, and what's the operational cost?

A 1% randomization slice on a representative cohort of queries is typically enough. Specifics. (1) Cohort selection: tail and head queries together — randomize within head queries to get statistical power on common intents, randomize on tail to validate that bias is consistent across the long tail. (2) Permutation strategy: don't fully shuffle. Take the ranker's top-K and permute within the top-K. This bounds the user-experience cost — items the ranker thought were top-K still appear, just in a different order. (3) Volume: 1% × ~ millions of impressions/day = adequate samples per (rank-position × cohort) within a week.

The operational cost: a small CTR drop on the randomized slice (perhaps 1–2% absolute on those queries — items you'd put at #1 sometimes appear at #5). Aggregated over 1% of traffic, the macro CTR cost is negligible. Frame it like the random-policy slice in scenario ML-01: it's the price of having an unbiased measurement system. Without it, every UI / ranker change is a leap of faith.

Follow-up grill 2

Adding ui_version as a feature lets the ranker condition on UI. Why is that not enough on its own?

Because the labels the ranker trains on are biased by UI. A feature lets the model condition on UI; a bias correction tells the loss function how much to weight each rank position. A ranker with ui_version as a feature but training-time labels that mix UIs and use the old position-bias correction will still learn distorted preferences — it just learns which UI it's distorted for. The fix has to address both: feature for conditioning + per-UI bias correction at training-time loss.

The deeper point: features adjust what the model sees at scoring time; bias corrections adjust what counts as ground truth at training time. Both pieces are needed for a clean fix.

Follow-up grill 3

You said you'd hold the post-redesign retrain until enough data accumulates. How much is enough, and what serves traffic in the meantime?

"Enough" is per-cohort statistical adequacy — for the active-learning slice and the position-bias estimation, you want ~10K impressions per (intent class × rank position × cohort) bucket for stable estimates. At common-intent-volume, that's days or a week or two; at tail-intent-volume, that may be a quarter.

In the meantime, three options. (1) Run the old ranker but apply a per-UI re-rank correction at serve time — even before retraining, the new position-bias estimates can shift the served order. (2) Use a popularity / heuristic ranker as a UI-agnostic baseline; it doesn't depend on labels at all. (3) Per-UI sub-rankers, with the old UI sub-ranker for sessions still on old UI (during the rollout transition).

The pragmatic answer: option (1) is fastest because it leverages the existing ranker but corrects for the new UI's bias. Option (2) is the safety net for queries the ranker doesn't have signal for. Option (3) is structurally cleaner if the system has long-lived multi-UI surfaces (mobile vs desktop), but adds operational complexity.

Follow-up grill 4

In the new grid UI, cover-image quality matters more. Your ranker doesn't have cover-image features. Adding them means a model architecture change. How do you sequence?

Two-step rollout. (1) First, ship the UI-aware bias correction and per-UI eval with the existing ranker architecture. This recovers most of the regression because position-bias is the dominant fix. The new ranker is better-fit to the new UI but still missing some visual signal. (2) Second, in parallel, build the cover-image feature pipeline. This is non-trivial — extract image features (CNN embedding, color salience, blurriness, OCR text on cover), join to items, retrain the ranker with the new feature set. Test the new feature set's lift via a separate A/B against the bias-corrected ranker.

Sequencing matters: trying to ship both at once means you can't tell whether the bias correction or the new features drove improvement. One change at a time keeps the eval interpretable.

The architectural commitment: the model architecture is a separable concern from the data correctness. Fix data first (labels + bias correction); add features after.

Architect-level escalation 1

Three months later, the product team plans an A/B test of two grid variants — one with denser cards (5/screen) and one with even sparser (3/screen). How does your eval architecture handle a UI A/B?

This is "UI redesign as ML event" applied at sub-version granularity. The architecture should handle it.

(1) Each grid variant gets its own ui_version_sub tag. Training labels are stratified per sub-variant.

(2) Per-sub-variant position-bias estimation. The randomization slice runs in each variant's traffic; per-variant bias function is derived. Variants with different card densities will have different bias functions even if they're "the same UI family."

(3) The ranker conditions on the sub-variant. If volume per sub-variant is high enough, train a sub-ranker per variant; if not, condition on ui_version_sub as a feature with shared backbone.

(4) The A/B test compares full-stack experiences: ranker tuned for variant X vs ranker tuned for variant Y. The product comparison is "which UI + ranker combination wins," which is the right question. Comparing rankers across UIs without acknowledging the UI dependency is misleading.

(5) Roll-out plan honors data dependencies. If the test runs for 4 weeks, the rankers can't fully retrain to each variant's distribution within the test (data is too thin for a fresh model). The right comparison is the existing ranker on each variant with the variant's bias correction applied, not fresh-retrained per-variant rankers. Document this explicitly so the decision-makers understand the comparison.

The architectural payoff: the system that originally collapsed under one UI change now handles multi-variant UI A/Bs as routine work. The cost was building UI-as-key into the data layer; the benefit is repeatable.

Architect-level escalation 2

Six months later, the company adds voice-based search ("hey bot, find me a long-form mystery"). The voice UI has no rank positions — the bot speaks one or two recommendations. The training data has tens of thousands of voice queries with nothing meaningful in rank_position. How does your ranking architecture extend?

Voice is a different paradigm, not just a different UI. The implicit assumption "ranking = ordered list users see and click in" doesn't apply.

Three architectural moves.

(1) Treat voice as a separate decision problem. Don't try to fit voice queries into the rank-position-based ranker. The right model for voice is "given query, pick one (maybe two) recommendation" — closer to a top-1 recommender than a ranker. Training labels are different: "user accepted the suggestion" / "user asked for another" / "user gave up."

(2) Shared candidate generation, separate decision layer. The candidate-generation stage (semantic + lexical retrieval, RAG) is shared between voice and grid. The post-candidate decision layer is specialized: a ranker for grid, a top-1 picker for voice. They share features and pipelines but have different objectives.

(3) Cross-pollination via embedding alignment. Voice queries and text queries map to the same embedding space; relevance signals from one transfer to the other. Use them as multi-task training (same backbone, two heads — ranker + voice-picker).

The deeper commitment: ground truth in the multi-modal world is per-modality. A voice "I'll take that one" is a different label from a text-search click. Both are valid; neither substitutes for the other. The architecture treats them as separate label streams feeding separate decision layers, sharing infrastructure where useful but never collapsing them into a single uniform "click" event.

Red-flag answers

"Just retrain on more data; it'll wash out." (Misses the bias correction.)
"Add ui_version as a feature." (Helps, doesn't fix the loss.)
"A/B test the new ranker." (Confounded.)
"Drop pre-redesign data." (Loses long-tail signal.)
"Use the LLM as relevance judge." (LLM doesn't see UI.)

Strong-answer indicators

Names UI as a primary key on labels and ground truth.
Knows position-bias must be re-derived per UI via randomization slice.
Has a sequenced fix (correct bias first, add features second).
Treats UI changes as ML events with cross-team coordination.
Recognizes voice search is a paradigm shift, not a UI variant.