LOCAL PREVIEW View on GitHub

ML Scenario 07 — Embedding Model and Category Expansion

TL;DR

The shared embedding service produces vector representations of every catalog item, used by retrieval (kNN), the recommender (item-similarity for cold-start), the search ranker, and the cross-title-link MCP. The current embedding model was fine-tuned 14 months ago against ~ 280K manga items using a triplet-loss objective with ground truth derived from co-purchase + editorial-curated similar pairs. Two new content types entered the catalog this year: manhwa (Korean vertical-scroll, ~ 60K titles) and manhua (Chinese, ~ 20K titles). They share the schema but inhabit a different semantic space — different art traditions, different reading direction, different genre distribution. The triplet-loss "ground truth" of similar/dissimilar pairs is dominated by manga-on-manga pairs; manhwa-on-manga similarity is mostly mislabeled by inheritance ("looks like a manga, is therefore similar to manga of the same genre"); cold-start similarity for new manhwa items is consistently wrong; and downstream features that depend on similarity (recommendations on the homepage's "if you liked this" rail) underperform on the new content types. The fix shape is multi-domain ground-truth construction (manga ↔ manga, manhwa ↔ manhwa, cross-domain pairs labeled separately), domain-aware sampling for triplet loss, per-domain evaluation, and a co-trained model with explicit domain-conditioning rather than retraining a single uniform space.

Context & Trigger

  • Axis of change: Scale (corpus grew 25% with the addition of new content types) + Requirements (the new types have semantically-different similarity rules — "isekai" in manga and "isekai" in manhwa are similar but not identical concepts).
  • Subsystem affected: Shared embedding service per RAG-MCP-Integration/09-rag-retrieval-pipeline-deep-dive.md. Knock-on effects in RAG-MCP-Integration/01-catalog-search-mcp.md, 02-user-preferences-recommendation-mcp.md, 07-cross-title-link-mcp.md.
  • Trigger event: Manhwa launch reveals that "if you liked this manhwa, you'll like..." recommendations include random manga that share genre tags but are stylistically incongruous. CTR on cross-title-link recommendations for manhwa drops to 0.6× of equivalent-position manga recommendations. Investigation reveals the embedding model places manhwa items near genre-matching manga and not near other manhwa.

The Old Ground Truth

Original setup:

  • Triplet-loss objective: anchor / positive (similar) / negative (dissimilar) item triples.
  • Positive pairs derived from:
  • Co-purchase (users who bought A also bought B at high rates).
  • Editorial-curated "similar series" pairs (~ 50K hand-curated pairs).
  • Sequel relationships and same-author bundles.
  • Negative pairs: random sampling within and across genres.
  • Eval: recall@K on the editorial-curated test set + downstream task lift (recommender CTR, cross-title-link CTR).
  • Reasonable assumptions:
  • The catalog is one homogeneous space; an item from any subset is comparable to any other.
  • Co-purchase patterns reflect "true" similarity.
  • The editorial-curated set is a reasonable test population.

What this misses: the catalog isn't one space; co-purchase patterns are mostly within-domain (manga buyers buy manga); the editorial set is overwhelmingly manga-on-manga. When manhwa enters, no pillar of ground truth covers it.

The New Reality

  1. Semantic spaces are domain-shaped. Manga, manhwa, and manhua share concepts (genres, story tropes) but differ in tradition, format, art style, reading direction, and audience expectations. Distance in the embedding space should respect domain boundaries.
  2. Co-purchase ground truth is sparse for new domains. Manhwa users haven't yet established strong co-purchase patterns within manhwa (the catalog is new); the dominant co-purchase signal is manga-buyers occasionally trying a manhwa, which is a cross-domain signal labeled as primary similarity.
  3. Editorial pairs are overwhelmingly intra-manga. New cross-domain pairs (manga ↔ manhwa "if you liked X you'll like Y") aren't curated.
  4. The triplet-loss objective is silently mis-defining negatives. Random "negatives" from a catalog dominated by manga means manhwa rarely appears as a negative for manga anchors. The model never learns the boundary.
  5. The evaluation is structurally blind. Recall@K on the editorial set says the model is fine — the editorial set has no manhwa-on-manhwa pairs. The metric doesn't measure what's broken.
  6. Downstream features amplify the failure. Cross-title-link recommendations rely on embedding similarity; bad similarity → bad recommendations. The CTR drop is observable but the cause is two layers up (in the embedding ground truth).

Why Naive Approaches Fail

  • "Add manhwa to the existing training; retrain." Without ground-truth pairs covering manhwa-on-manhwa similarity, the model has nothing to learn from. The new model still places manhwa near manga.
  • "Use co-purchase data as it accumulates." Slow (months), and biased (cross-domain co-purchases will dominate until intra-manhwa patterns stabilize, which themselves depend on having good recommendations to drive them — chicken-and-egg).
  • "Use a single multilingual embedding model and call it done." Multilingual / multimodal pretrained models have decent priors but were not trained on manhwa-specific similarity; the gap is specifically about similarity rules in the new domain.
  • "Train per-domain separate models." Loses cross-domain comparability ("readers of X manga might also like Y manhwa"), which is a real product use case.
  • "Trust embedding-space cluster analysis." Clusters reveal that manhwa is in its own region in the existing model — but the recommendations team needs the model to actively use that info, not just have it there.

Detection — How You Notice the Shift

Online signals.

  • Per-domain CTR on similarity-driven features. Cross-title-link CTR for manhwa anchors → manhwa recommendations is the most direct signal.
  • Recommendation diversity per domain. If "if you liked this manhwa" recommendations consistently include manga, that's the model failing to respect domain.
  • User feedback on cross-title-link. "These don't look like the same thing" → low ratings on the rail; track per-domain.

Offline signals.

  • Per-domain k-nearest-neighbors quality. For each domain, sample 100 anchors and ask annotators to rate the relevance of the top-5 nearest neighbors. Per-domain quality scores reveal where the embedding fails.
  • Cross-domain pair recall. If you have a labeled "users who liked manga X also liked manhwa Y" set, what's the model's recall on those pairs? Low recall = no cross-domain bridge.
  • Embedding-space domain analysis. Compute centroid distances between domains and within-domain variance. A model that respects domains has domain centroids that are clearly separated; manhwa items closer to manhwa centroid than to manga centroid.

Distribution signals.

  • Triplet-loss training-data composition. What fraction of training triplets have a manhwa anchor? What fraction have cross-domain (manga, manhwa) pairs? If both are tiny, the model has no signal to learn the new domain.
  • Editorial-curated set composition. Manga / manhwa / manhua / cross-domain proportions. Compare to catalog volume. Imbalance = eval set doesn't represent the catalog.

Architecture / Implementation Deep Dive

flowchart TB
    subgraph Sources["Per-domain similarity ground truth"]
        ED_M["Editorial pairs - manga"]
        ED_K["Editorial pairs - manhwa<br/>(NEW, growing)"]
        ED_C["Editorial pairs - manhua<br/>(NEW)"]
        CP_M["Co-purchase intra-manga"]
        CP_K["Co-purchase intra-manhwa<br/>(sparse, growing)"]
        CROSS["Cross-domain pairs<br/>(curated, sparse)"]
    end

    subgraph Triplet["Domain-aware triplet sampling"]
        BAL["Stratified by anchor domain"]
        NEG_HARD["Hard negatives within<br/>same domain"]
        NEG_SOFT["Soft negatives across<br/>domains for cross-domain training"]
    end

    subgraph Model["Co-trained model"]
        BACK["Shared backbone"]
        DC["Domain-conditioning head"]
        EMB["Final embedding<br/>(includes domain)"]
    end

    subgraph Eval["Per-domain + cross-domain eval"]
        D_M["Manga recall@k"]
        D_K["Manhwa recall@k"]
        X["Cross-domain recall@k<br/>(blocking when feature uses it)"]
    end

    subgraph Serve["Downstream"]
        REC["Recommender"]
        SIM["Cross-title-link"]
        SEARCH["Search reranker"]
    end

    ED_M --> BAL
    ED_K --> BAL
    ED_C --> BAL
    CP_M --> BAL
    CP_K --> BAL
    CROSS --> BAL
    BAL --> NEG_HARD
    BAL --> NEG_SOFT
    NEG_HARD --> BACK
    NEG_SOFT --> BACK
    BACK --> DC --> EMB
    EMB --> D_M
    EMB --> D_K
    EMB --> X
    D_M -->|gate| REC
    D_K -->|gate| REC
    X -->|gate| SIM
    EMB --> REC
    EMB --> SIM
    EMB --> SEARCH

    style ED_K fill:#fde68a,stroke:#92400e,color:#111
    style CROSS fill:#dbeafe,stroke:#1e40af,color:#111
    style X fill:#fee2e2,stroke:#991b1b,color:#111
    style DC fill:#dcfce7,stroke:#166534,color:#111

1. Data layer — multi-domain similarity ground truth

The training ground truth becomes a stratified set:

Pair source Coverage Strength
Editorial pairs - manga Mature, ~50K High
Editorial pairs - manhwa Growing, target ~10K Y1 Curated
Editorial pairs - manhua Growing, target ~5K Y1 Curated
Cross-domain editorial pairs New, ~2K Y1 Hand-curated
Co-purchase intra-manga Mature Signal-strong but biased
Co-purchase intra-manhwa Sparse, growing Use sparingly while sparse
Cross-domain co-purchase Skewed (manga-buyers crossing over) Don't use as primary similarity

Editorial coverage of new domains is the lift. Without it, the model has no signal source it can trust for the new categories. A small editorial team curating ~10K manhwa pairs over Y1 is the bootstrap.

2. Pipeline layer — domain-aware triplet sampling

Triplet sampling is stratified to balance domain representation:

def sample_triplet(corpus):
    anchor_domain = sample_domain_balanced()
    anchor = sample_item(corpus, domain=anchor_domain)
    # Positive: same-domain editorial / co-purchase
    positive = sample_positive(anchor, sources=[editorial, co_purchase],
                                same_domain=True)
    # Hard negative: same domain, different cluster (most informative)
    if random.random() < 0.7:
        negative = sample_hard_negative(anchor, same_domain=True)
    # Soft negative: cross-domain, *unless* this triplet trains cross-domain bridge
    else:
        negative = sample_negative(anchor, cross_domain=True)
    return anchor, positive, negative

Most triplets train within-domain similarity (the main task). A small fraction (~ 15–20%) explicitly trains cross-domain bridges using the curated cross-domain editorial pairs as positives.

3. Serving layer — domain-conditioned embedding

The final embedding includes a domain feature. Two design options:

  • Single backbone, domain feature concat. A single network produces embeddings; domain is a one-hot feature concatenated to the input or to an intermediate layer.
  • Multi-head: shared backbone + per-domain projection head. The backbone learns shared representations; per-domain heads specialize the projection. Final embedding is the projected output for the item's domain.

In practice, the multi-head design has worked better in this kind of multi-domain catalog because the per-domain heads can capture domain-specific structure while the backbone shares cross-domain knowledge.

4. Governance — per-domain promotion gates

A new embedding model is promoted only if:

  • Per-domain recall@K passes for each domain.
  • Cross-domain recall@K passes (using the curated cross-domain editorial set).
  • No regression on previously-stable domains.
  • Downstream task lift (recommender CTR, cross-title-link CTR) is stable or improved per-domain in shadow A/B.

The promotion gate explicitly tests every domain's metric, not aggregate. A model that improves manga but regresses manhwa is rejected.

Trade-offs & Alternatives Considered

Approach Per-domain quality Cross-domain quality Cost Verdict
Single uniform model, no domain awareness Mediocre on minority Poor Lowest Original — broken
Single model + domain feature Better OK Medium Reasonable but bottlenecked by backbone
Multi-head: shared backbone + per-domain heads + balanced ground truth Good Good Medium-High Chosen
Per-domain separate models Best per domain Loses cross-domain Highest Bad for cross-recommendation use case
Single model + clever loss (only) Variable Variable Medium Fix is data, not loss alone
Wait for co-purchase signal to mature Eventually OK Eventually OK Low engineering Months of bad UX in the meantime

The multi-head + balanced data is the standard pattern for multi-domain catalogs. The lift is the editorial labeling investment for new domains.

Production Pitfalls

  1. Editorial labeling for new domains is a bottleneck. Curating 10K manhwa pairs takes months with a small editorial team. Frontload this work before the new domain launches in features that depend on similarity (e.g., cross-title-link). Otherwise the feature ships broken on day one.
  2. Cross-domain pairs are doubly hard. Manga-to-manhwa similarity requires editors fluent in both traditions. Pair-level disagreement among editors is high; protect the pair quality with cross-editor agreement checks.
  3. Co-purchase data has chicken-and-egg dynamics. Good intra-manhwa co-purchase requires good manhwa recommendations, which require good embeddings, which require co-purchase. Bootstrap with editorial; let co-purchase grow gradually.
  4. Negative sampling is the silent killer. If negatives are all "very different," the model never learns the boundary; it learns "manga and manhwa are equally far from each other as a manga and a cookbook." Hard negatives — same-domain, similar-but-not-identical — are essential.
  5. Per-domain model promotion is more complex. A re-train improves manga and hurts manhwa? You either ship per-domain (manga gets the new model, manhwa stays) or block until the manhwa hit is fixed. Per-domain ship is operationally complex but sometimes the right call.
  6. Cross-encoder reranker is downstream and inherits drift. The reranker over the embedding's top-K has its own training signal; if the embedding's top-K shifts post-retrain, the reranker's score distribution shifts too — re-evaluate (see GenAI 04 + 08 patterns).
  7. The "same genre" trap. Two items both tagged "isekai" can be genuinely different (manga isekai vs manhwa isekai, both protagonist-overpowered but stylistically distinct). The model can over-anchor on tag matching unless the negatives include same-tag-different-style cases.

Interview Q&A Drill

Opening question

Your shared embedding model places manhwa near manga that share genre tags, not near other manhwa. The "if you liked this manhwa, you'll like..." rail returns mostly manga, CTR is 0.6× of the manga rail. The embedding model is two years old. Walk me through the fix.

Model answer.

The training ground truth is overwhelmingly manga-on-manga, with no signal for manhwa-on-manhwa similarity and almost no curated cross-domain pairs. The model can't learn what it's never trained on. Aggregate eval looks fine because the editorial test set is also overwhelmingly manga.

The fix is a multi-domain ground-truth strategy plus a domain-aware model.

(1) Per-domain editorial pair labeling. Bootstrap manhwa-on-manhwa pairs (target ~10K Y1) and manhua-on-manhua pairs (~5K Y1) with a focused editorial effort. Curate cross-domain pairs (~2K) for the cross-recommendation use case.

(2) Domain-stratified triplet sampling. Triplets balanced across anchor domains. Most triplets train within-domain similarity (the main job); ~15–20% train cross-domain bridges using the curated cross-domain pairs.

(3) Hard negatives within domain. Negatives sampled from same domain, different cluster — the boundary cases are what the model needs to learn.

(4) Multi-head architecture: shared backbone + per-domain projection heads. Captures shared cross-domain knowledge in the backbone and per-domain structure in the heads.

(5) Per-domain eval gates. Aggregate is advisory. Promotion blocks if any domain regresses or if cross-domain recall on the curated set drops.

The conceptual move: the embedding's "ground truth" was an unbalanced sum of manga signals. Treating new domains as additive (just train more) was the failure; the architecture needs the new domains as equal partners in ground-truth construction, not as a tail.

Follow-up grill 1

Editorial labeling for new domains is slow. The product team wants the rail to work on day one when manhwa launches. What ships on day one?

A staged rollout, not "off until perfect."

(1) Day one: feature shipped, restricted to within-domain. Manhwa anchors yield manhwa recommendations from a candidate pool within manhwa only, ranked by content features (genre tag overlap, story tropes, format) and the existing embedding's similarity (which is poor but better than random within manhwa).

(2) Day one: cross-domain rail labeled "Discover." Cross-domain recommendations get a different name and clear copy: "Try something different — manga readers also enjoy these manhwa." Sets expectations that the recommendation is an exploration, not a similarity match.

(3) Months 1–3: editorial pair labeling ramps. As pairs land, the rail's quality improves. Track per-cohort CTR; promote the rail from "Discover" to "Similar to your tastes" when quality clears a threshold.

(4) Months 3–6: model re-train with the new pairs. The properly trained embedding becomes the primary similarity backbone.

The architectural commitment: don't ship a confidently-wrong feature. Ship a honestly-curated one and let it improve as the data does.

Follow-up grill 2

Negative sampling. You said hard negatives within domain are crucial. How do you find hard negatives without expensive labeling?

Three sources of hard negatives that don't require explicit labeling.

(1) In-batch hard negatives. During triplet training, look at items in the same batch that have high cosine similarity to the anchor but aren't the labeled positive. They're "close but wrong" — good negatives. This is in-batch hard mining and it's standard.

(2) Embedding-space exploration. After a few training epochs, query the current model: for each anchor, find the top-K nearest neighbors that aren't in the editorial positive set. These are model-suggested similar items; if the editorial set says they're not similar, they're hard negatives. (Caution: editorial sets are non-exhaustive — an item not in the labeled set isn't necessarily a true negative.)

(3) Genre-but-not-style. Items sharing top-level genre tags but differing in finer attributes (art style, target demographic, pacing). These are precisely the cases the model needs to distinguish. Heuristic-mined.

The combination produces useful hard negatives without per-pair human labeling.

Follow-up grill 3

Multi-head architecture: per-domain projection heads. What stops the heads from drifting completely apart so cross-domain similarity becomes meaningless?

Three constraints during training.

(1) Shared backbone. Most parameters live in the shared backbone; per-domain heads are relatively small. Most learning happens in the shared part, anchoring cross-domain comparability.

(2) Cross-domain training signal. The 15–20% of triplets that explicitly train cross-domain pairs forces the projection heads to align in the cross-domain direction. Without this, the heads would silo.

(3) Inter-domain centroid regularization. During training, a small loss term penalizes the difference between domain centroids in the projected space — keeps the manga centroid near the manhwa centroid in a controlled way. Tunable: too strong → collapses to a uniform model; too weak → heads drift.

(4) Cross-domain eval as a gate. The curated cross-domain editorial pairs are the test of whether the heads stay aligned. Recall@K on cross-domain pairs is the explicit check.

The pattern: shared backbone + small heads + explicit cross-domain training data + regularization + eval gate. Each piece reinforces the others.

Follow-up grill 4

Two months in, manhwa co-purchase data starts to mature. Do you ingest it as another ground-truth source?

Yes, with care.

(1) Down-weight initially. Early co-purchase data is noisy (small sample, biased by which recommendations were shown — selection bias). Weight at maybe 0.3 of editorial-pair signal initially. Increase weight as confidence grows.

(2) Filter cross-domain co-purchase. Manga-buyers occasionally trying a manhwa is a cross-domain signal. Treat it as cross-domain pair candidate, not as primary intra-manhwa similarity.

(3) Selection-bias correction. Manhwa items that were heavily recommended will have inflated co-purchase. IPS or random-policy slice (scenario ML-01 pattern) corrects for this. Without correction, popular items become artificially "similar" to everything.

(4) Iterate the model. As co-purchase matures (months 3–6), retrain with the now-richer signal. Editorial pairs remain the high-confidence anchor; co-purchase is the volume signal.

The architectural commitment: each ground-truth source has a confidence and a bias profile, and the training mixes them with appropriate weights — not as a free-for-all sum.

Architect-level escalation 1

A new content type — light novels, text-only, no visual element — enters the catalog. The visual-similarity backbone won't help. How does the embedding architecture extend?

This is "modality expansion," not just domain expansion. Three architectural moves.

(1) Per-modality backbone, shared higher-level representation. The visual backbone (handles cover art, panel imagery for manga / manhwa) doesn't apply to text-only light novels. Train a text backbone for light novels using their textual content (synopsis, sample chapter, themes). Both backbones project into a shared higher-level embedding space where similarity is comparable across modalities.

(2) Anchored cross-modal training. Editorial-curated cross-modal pairs ("readers of X manga also enjoy Y light novel") are the bridge. Train both backbones jointly with these pairs as positive signal.

(3) Modality-aware downstream features. The recommender knows the user's modality preferences (some readers exclusively read manga; others mix). Filter / weight recommendations by modality preference per user.

The deeper commitment: when a new modality enters, treat it as a new component (new backbone) integrated through shared higher-level representation, not as a special case of existing modalities. The "embedding model" stops being one model and becomes a system of co-trained models.

Architect-level escalation 2

Six months later, a regulator in the EU asks you to demonstrate that your similarity-based recommendations don't reinforce stereotypes (e.g., recommending content with certain demographics only to certain user groups). How does your embedding architecture support that audit?

The audit chain has three parts.

(1) Fairness on the embedding itself. For protected attributes that the catalog can map (e.g., creator demographics, character demographics in the content), measure whether items with attribute X cluster together more than they would by random chance, and whether that clustering causes systematic over/under-recommendation to certain user groups. This is the "embedding fairness" check; it requires per-item attribute labels (which the catalog may have through editorial work).

(2) Fairness in downstream recommendations. For each user demographic cohort, what's the diversity of recommendations served? If cohort A is consistently recommended only items with attribute X (where the catalog has 50% X / 50% non-X), the recommender (or the embedding it uses) is over-anchoring on X.

(3) Counterfactual analysis. For a sample of users, swap their cohort label and see if the recommendation set changes meaningfully. If it does, the system is using cohort information; if it shouldn't, that's a fairness signal.

The architectural commitments to make this auditable.

(a) Item attribute metadata. The catalog stores creator demographic attributes (where available, with creator consent), character demographic attributes (where annotated), and content classification. Without this, the audit can't even start.

(b) Cohort-level aggregation. Recommendation diversity tracked per user cohort, with k-anonymity (only emit cohort metrics with k ≥ 50). Avoid re-identification via aggregate diff.

© Embedding interpretability. For high-impact recommendations, the system can surface which attributes drove the similarity (e.g., "this was recommended because you liked items in similar genre clusters"). Per-recommendation explanations.

The hard regulator question: "is the model itself biased, or is the data?" The honest answer is usually "both" — biased data trains biased models. The architectural work is making each layer measurable and surfacing the bias. Not pretending it's absent.

Red-flag answers

  • "Add manhwa data and retrain." (Doesn't address ground-truth gap.)
  • "Use a multilingual pretrained embedding." (Has decent priors but not domain-specific similarity rules.)
  • "Per-domain separate models." (Loses cross-domain comparability.)
  • "Wait for co-purchase data." (Months of bad UX.)
  • "Trust aggregate recall@K." (Domains hidden in the average.)

Strong-answer indicators

  • Recognizes embedding ground truth as multi-domain and the gap as label-side.
  • Knows multi-head shared-backbone is the standard pattern for multi-domain.
  • Has a staged-rollout plan that ships honest functionality on day one.
  • Treats co-purchase as a maturing signal with selection-bias correction.
  • Anticipates modality expansion (light novels) requires per-modality backbones.
  • Has a serious answer for fairness auditing tied to embedding-level metadata.