GenAI Scenario 05 — Localization / Multilingual Ground Truth

TL;DR

The Catalog-Search MCP launched in English. Nine months later, three locales are live (JP-native, ES, PT-BR) with two more in pipeline (DE, FR), and the index now stores titles, synopses, and aliases in a mix of languages — Japanese titles in both romaji and native script, Spanish synopses written by translation vendors, and English fallback for un-localized fields. The retrieval golden set was an English-only set; "translate the queries" was treated as the localization plan. In production, queries arrive in mixed scripts (a Spanish user querying an isekai title in romanized Japanese), translation introduces semantic drift (the Spanish synopsis lost a key plot tag), and the retriever silently favors the dominant-language fields. The fix shape is multilingual ground truth keyed on (query_language, document_language) pairs, not on a single language; translation as a measurable lossy transform with its own quality bar; and per-locale cohorts blocking promotion.

Context & Trigger

Axis of change: Requirements (new locales mean new product surfaces and new policies for what "correct" means) + Scale (the cross-product of language pairs explodes the eval space).
Subsystem affected: RAG-MCP-Integration/01-catalog-search-mcp.md and RAG-MCP-Integration/09-rag-retrieval-pipeline-deep-dive.md (multilingual embedding, BM25 + kNN fusion).
Trigger event: Brazil launches with PT-BR catalog. Within 4 weeks, support sees a class of complaints "el bot no encuentra X" — the bot doesn't find X — for queries that do have a matching title in the catalog, just under a different alias or in a different field's language. Aggregate recall@10 stays above 0.90; recall@10 for (query_lang=es, document_lang=ja) queries is 0.34.

The Old Ground Truth

The original design:

English-only golden set (≈ 2,400 pairs).
One embedding model (multilingual sentence transformer, fine-tuned on English manga corpus).
One BM25 analyzer (English with stopwords + stemming).
Aliases stored as a flat array per title, mostly English + a few romaji.
Reasonable assumptions:
"Multilingual" embeddings mean cross-lingual queries Just Work.
Translation of queries to English at retrieval time would solve the user side.
Title aliases naturally cover romaji + English; everything else is rare.

What this misses: multilingual embeddings handle some cross-lingual semantics but not all, BM25 is language-specific in ways that fight cross-lingual recall, and the alias table is incomplete in the languages that matter most for the new launches.

The New Reality

Queries arrive in mixed scripts. A Brazilian Portuguese user types "isekai onde o protagonista é um espadachim" — a description in PT, an entity tag in Japanese-via-English (isekai), and the document they want is a JP-native title with an ES synopsis.
Documents are partially-translated. A title's synopsis field exists in EN, PT, ES, but its aliases are JP + EN only. The relevant chunk for one user is not in the language they speak.
Translation drifts the signal. "School-life" in EN becomes "vida escolar" in ES — fine. "Slice-of-life" becomes "rebanada de la vida" — semantically broken. The translation pipeline introduces ground-truth drift independent of the model.
BM25 stopwords cut differently per language. "El bot" loses "el"; "is the bot" loses "is" and "the." If a query is mixed (bot encontrar X), the analyzer has to pick a language; whichever it picks, the other tokens get over-weighted.
Romanization is ambiguous. "Sakamoto" is a romaji of サカモト, but also a loose Western surname; "Shounen Jump" appears as "shonen jump", "shōnen jump", and "shounen-jump." Without normalization, recall fragments.

The schema isn't single-keyed (query → relevant_chunks). It's at least double-keyed ((query_language, document_language) → relevant_chunks), and three of those four cells (en→ja, es→ja, pt→ja) have very few labeled examples.

Why Naive Approaches Fail

"Translate queries to English at retrieval time." Loses meaning on local-flavored queries; introduces translation latency on the hot path; doesn't help with mixed-script queries; and biases the index toward English fields.
"Use a multilingual embedding model and call it done." Multilingual embeddings handle cross-lingual proximity unevenly — fine on common pairs (en↔es), worse on en↔ja and worst on es↔ja. The eval will show this if you measure it, which is the actual blocker.
"Translate the golden set to all locales." Translation introduces noise. The "correct chunk" for a translated query may not be the same chunk as for the source query. Worse, vendor translations are often slightly wrong, so your golden set drifts in unmeasured ways.
"Add more aliases." Useful but partial — covers entity recognition, doesn't solve description-style queries.
"Just train one big multilingual model." Without localized golden sets to evaluate against, you can't verify it's working; the model could over-fit the dominant-language fields and you wouldn't know.

Detection — How You Notice the Shift

Online signals.

Per-locale "did not find" rate. Especially the cross-lingual cohort (e.g., query=es, intended_doc=ja titles). When this cohort lifts above the locale's overall rate, cross-lingual retrieval is the gap.
Locale-specific escalations. Support tickets from PT/ES/JP referencing queries that "should have worked." Track them by query-language tag.
Click-through differential by document-language. Are users in es clicking on results that show ES synopses, or are they getting EN-only results and bouncing?

Offline signals.

Per-(query_lang, doc_lang) recall@10. The diagonal cells (en→en, es→es) tell you the per-language retriever quality; the off-diagonal cells tell you cross-lingual quality. A healthy system has all cells above threshold; in this scenario, off-diagonal cells fail.
Translation quality on labels. Periodically sample translated golden-set labels and have a fluent reviewer score them on adequacy. A translation quality below threshold is itself a ground-truth corruption.
Alias coverage. For each title, count aliases per language. Coverage gaps (a title with 0 ES aliases) predict per-locale recall failures.

Distribution signals.

Per-locale query-script histograms. What fraction of es queries contain non-Latin-1 characters (likely JP transliteration)? If high and stable, the index needs first-class JP-token support even in ES locales.
Field-of-match distribution. Of winning retrievals, which document field matched (alias vs synopsis vs character-list) per (query_lang, doc_lang)? If aliases dominate for one pair and synopses for another, the eval set must reflect this stratification.

Architecture / Implementation Deep Dive

flowchart TB
    subgraph Query["Query side"]
        Q["User query<br/>+ detected language<br/>+ script"]
        NORM["Normalization<br/>(romaji, accent, case,<br/>JP/CJK script)"]
        QEMB["Multilingual embedding<br/>(query)"]
    end

    subgraph Index["Index side (multilingual)"]
        DOC["Per-doc fields:<br/>title{lang} · synopsis{lang}<br/>· aliases{lang} · tags"]
        DEMB["Per-field multilingual<br/>embeddings"]
        BM["Per-language BM25 analyzers"]
    end

    subgraph Retrieval["Retrieval"]
        FUS["Hybrid: BM25 (any lang) +<br/>kNN (cross-lingual)<br/>+ alias exact-match boost"]
        RR["Cross-encoder reranker<br/>(multilingual, calibrated<br/>per language pair)"]
    end

    subgraph GT["Ground-truth layer"]
        SET["Pair-keyed golden set<br/>(query_lang, doc_lang) cells"]
        TRX["Translation quality<br/>monitor (BLEU + adequacy)"]
        ALI["Alias coverage<br/>monitor per locale"]
    end

    subgraph Gate["Per-locale promotion"]
        COH["Block on per-cell regression<br/>(not just aggregate)"]
    end

    Q --> NORM --> QEMB
    NORM --> BM
    DOC --> DEMB
    DOC --> BM
    QEMB --> FUS
    BM --> FUS
    FUS --> RR
    SET -.->|eval| RR
    TRX -.->|alerts| SET
    ALI -.->|alerts| DOC
    RR --> COH

    style SET fill:#fde68a,stroke:#92400e,color:#111
    style FUS fill:#dbeafe,stroke:#1e40af,color:#111
    style TRX fill:#fee2e2,stroke:#991b1b,color:#111
    style COH fill:#dcfce7,stroke:#166534,color:#111

1. Data layer — pair-keyed golden set

The golden set becomes a sparse 2-D grid (query_lang, doc_lang):

                doc_lang
              en    es    pt    ja
query_lang
   en     [  900   80    60   100  ]
   es     [   60  300    50    40  ]
   pt     [   40   30   200    25  ]
   ja     [   80   10    10   150  ]

Every cell has a target population. Cells launched recently (PT-launch q3) start small and grow weekly via a derivative builder seeded from production logs (same pattern as scenario 01, with language-pair stratification added).

Each pair has its own recall@10 target. Diagonal cells (same-language) get the strictest target (e.g., 0.92). Off-diagonal cells (cross-lingual) start at a lower bar (e.g., 0.80) and improve over time.

2. Pipeline layer — translation as a measurable transform

Translation produces eval-set labels in new locales. Each translation pass produces:

The translated query.
A vendor-confidence score.
A back-translation (round-trip EN → ES → EN) for adequacy spot-check.

Adequacy check: BLEU(original_en, back_translated_en) ≥ 0.65 for the pair to enter the labeled set. Items that fail are flagged for human review. The pipeline keeps (translation_vendor, vendor_version_sha) so a translation regression in one vendor is detectable as a quality drop in cells fed by that vendor.

def add_to_locale_eval(en_pair, target_lang):
    translated = translator.translate(en_pair.query, target=target_lang)
    back = translator.translate(translated, target="en")
    if bleu(en_pair.query, back) < 0.65:
        send_to_human_review(en_pair, translated, back)
        return None
    return GoldenPair(
        query=translated,
        intended_chunks=en_pair.intended_chunks,
        query_lang=target_lang,
        doc_lang=en_pair.doc_lang,
        translation_meta={"vendor": vendor.id, "sha": vendor.version, "back_bleu": bleu_score},
    )

The intended chunks aren't translated — they're still pointers into the document. What changed is the query side. The right answer is the same chunk in whatever language the doc is in.

3. Serving layer — hybrid + per-pair calibration

The retriever stack:

BM25 multi-analyzer. The query is run through BM25 with the detected query-language analyzer and also with a fallback "permissive" analyzer that doesn't strip stopwords. Top-K from each is unioned.
kNN with multilingual embedding. Single embedding model, but evaluated per pair to catch weak pairs.
Alias exact-match boost. Aliases are normalized (romaji collapse, accent strip, lowercase) and exact matches are boosted. This is the highest-precision signal for entity queries.
Cross-encoder reranker. Fine-tuned with pair-aware training data. Reranker scoring is calibrated per pair — a 0.7 score on en→en is not the same as 0.7 on es→ja.

The fusion weights (BM25 vs kNN vs alias) are per-pair learnable, because BM25 dominates intra-language recall and kNN dominates cross-lingual recall.

4. Governance — per-locale rollout

A new locale (e.g., DE) launches in three stages:

Index-only. Documents have DE fields populated; retriever doesn't expose to DE traffic yet. Eval set has DE→ja and DE→DE cells starting to fill.
Shadow. DE traffic is served by the existing system and shadow-evaluated against the new path; results compared.
Promote. Once the DE row of the matrix passes thresholds, DE is promoted; cohort metric stays as the gate.

A fourth stage — expand — adds new pairs to the matrix as locales mix (DE users querying for JP titles, etc.). The matrix grows quadratically in worst case; it grows linearly in practice because most users query in one or two languages.

Trade-offs & Alternatives Considered

Approach	Cross-lingual recall	Per-locale eval clarity	Cost	Verdict
Translate queries to English at runtime	Medium	Low (eval still EN-only)	Latency+API	Loses local flavor
Per-language indexes	High intra-lang	High	Storage × N	Cross-lingual gap
Single multilingual index + pair-keyed eval + per-pair fusion	High	High	Medium	Chosen
Per-language fine-tuned embeddings	Highest	High	Training cost	Diminishing returns above multilingual base
Pure exact-match alias retrieval	Highest precision, low recall	n/a	Low	Misses description queries
One mega-translated golden set	Medium	Low (translation noise muddies eval)	Translation $$	Unreliable

The pair-keyed golden set is the central bet — without it, every other change is unmeasurable per language pair and you can't tell whether you're improving es→ja or hurting en→en.

Production Pitfalls

Translation vendors silently version their models. Same input, different output a month later. Track vendor_version_sha per labeled pair; regression in a cell after a vendor update is a useful signal that translation drifted.
Romaji normalization rules differ. "Shōnen" vs "shounen" vs "shonen" — pick a canonical form for the index and the analyzer, document it loudly, and treat any deviation as a bug. Inconsistency here destroys recall on the most-searched JP entities.
CJK tokenization. Japanese has no spaces; tokenization is morphological. If your analyzer treats JP text as one token, BM25 collapses. Use a JP analyzer (kuromoji or equivalent) on JP fields specifically.
The matrix is sparse and stays sparse. Most cells will have low query volume. Don't over-invest in cells that don't matter; do measure them so you can tell when they grow.
Per-locale teams want to optimize their cells in isolation. If the JP team tunes en→ja and the BR team tunes es→pt, they can both make changes that hurt the third pair (es→ja). Treat fusion-weight changes as a global change that needs cross-locale review.
Mixed-script queries break detection. Language detectors mis-identify queries that are 60% ES + 30% JP romaji + 10% EN tags. Have a "mixed" bucket and route it to a special hybrid path.

Interview Q&A Drill

Opening question

Your manga catalog is launching in Brazil with PT-BR localization. The catalog-search MCP currently runs an English-only golden set against a multilingual embedding model and gets aggregate recall@10 of 0.91. After launch, BR support escalates 'the bot doesn't find titles users know exist.' Walk me through your approach.

Model answer.

The aggregate metric is hiding the problem. The first move is to slice recall@10 by (query_language, document_language) and look at the off-diagonal cells — specifically pt→ja, pt→en, and pt→pt separately. I'd expect the diagonal to be near aggregate but pt→ja to be much lower, because the multilingual embedding handles common pairs (en↔es, en↔pt) better than rarer ones (es↔ja).

Once that's confirmed, three changes. (1) The golden set becomes pair-keyed: a sparse matrix where each (query_lang, doc_lang) cell has its own population, target, and stratification. (2) Translation enters the eval pipeline as a measurable lossy transform — every translated query gets a back-translation and a BLEU adequacy score, and items below 0.65 BLEU go to human review before entering the labeled set. (3) Per-pair fusion: BM25 (with the right analyzer per language), multilingual kNN, and alias exact-match are blended with per-pair weights, because intra-language and cross-lingual queries want different mixes.

The promotion gate moves from "aggregate recall@10 ≥ X" to "every cell passes its own threshold." Diagonal cells get the strictest bar; off-diagonal cells start lower and improve over time.

The conceptual move: ground truth in a multilingual system isn't single-keyed. Treating it as single-keyed is the bug. Translation is not the same as localization.

Follow-up grill 1

"Just translate queries to English at retrieval time" is a tempting shortcut. Defend not doing that.

Three reasons. (1) Latency. Translation on the hot path adds 100–300 ms; that's a meaningful chunk of a 1–2 s SLA. (2) Loss of meaning. Local-flavored phrasing ("isekai com protagonista pacífico" — isekai with a peaceful protagonist) doesn't translate cleanly; subtle meaning gets dropped. The user's intent is in their language; translating to English to search English fields biases the result toward English-only documents. (3) Index-side bias. If the strategy is "translate query to EN, search EN fields," the system never learns to use the rich PT/ES/JA fields the catalog already has. The fields exist precisely so users can match them in their native language. Throwing that away is wasteful.

The deeper objection: translation-on-query is a workaround for not having pair-keyed eval. With pair-keyed eval, you can measure whether cross-lingual retrieval works directly, and invest in it if it doesn't. Without pair-keyed eval, you have no choice but the workaround.

Follow-up grill 2

Your eval pipeline uses back-translation BLEU as an adequacy check. BLEU is a famously imperfect metric. Why use it?

For its purpose, it's adequate. BLEU is bad as a primary translation quality metric for fluency or naturalness, but it's reasonable as a guardrail for adequacy on short queries — it correlates well enough with "did the meaning survive" to catch the worst translations. The 0.65 threshold isn't a quality target; it's a "this is dropped enough we should look at it" tripwire. Items above 0.65 go in; items below get human review.

For higher-stakes assessments (whether the labels themselves are right), I'd add: (1) a human review of a sample, calibrated weekly; (2) cross-vendor diff (translate the same query through two vendors, flag large disagreements); (3) downstream effect monitoring — if a vendor change drops pt→ja recall, the translation pipeline is a suspect even before BLEU flags it. BLEU is a cheap top-of-funnel filter, not the verdict.

Follow-up grill 3

You said per-pair fusion weights are learned. How? You don't have enough data per pair to train a learning-to-rank model from scratch.

Right — for thin pairs you can't fit a per-pair model. Two practical patterns. (1) Bayesian update of fusion weights. Start with a sensible prior (e.g., {BM25: 0.5, kNN: 0.4, alias: 0.1} for diagonal pairs; {BM25: 0.2, kNN: 0.6, alias: 0.2} for off-diagonal). Update weights from cell-level eval data using a small grid search around the prior. With 200 labeled pairs in a cell, you can move weights confidently in 0.05 increments and check cell-level recall improves. (2) Borrow strength across similar pairs. Pairs that share a language family (es, pt, fr, it) share fusion behavior approximately. A hierarchical model with per-pair weights regularized toward a per-family weight gives you stable estimates on small cells.

In production you don't typically need a learned reranker per pair — you need a calibrated reranker (the cross-encoder is multilingual, so it gives meaningful scores on all pairs) and pair-specific score thresholds for filtering. The learned piece is small and stable; the multilingual encoder does the heavy lifting.

Follow-up grill 4

A new locale launches every quarter. Your matrix grows quadratically with cells, golden sets, vendors, and team time. How do you keep the engineering cost from exploding?

Three knobs. (1) Prioritize cells by query volume. Of the 16 cells in a 4-locale matrix, the diagonal 4 carry maybe 70% of traffic, and a handful of off-diagonal cells carry another 25%. Most cells are in the long tail and need basic monitoring, not deep investment. (2) Reuse locales for "intent" eval. Some labeled pairs (e.g., synonym handling, typo robustness) are language-agnostic enough that English-only labels carry over with minor translation. Don't re-label things that don't actually depend on language. (3) Centralize tooling, decentralize content. The golden-set builder, translation pipeline, and eval harness are one team's responsibility; per-locale content (aliases, edge cases, escalation triage) is owned by per-locale ops. The work that scales linearly stays with locales; the work that scales quadratically stays with platform.

The real defense is not trying to make every cell perfect. Aggregate recall@10 ≥ 0.92 with diagonal cells ≥ 0.92 and any off-diagonal cell ≥ 0.80 is a far healthier bar than "all cells ≥ 0.92," which is unachievable on long-tail cells.

Architect-level escalation 1

A user complaint reveals a category of failure: religious / culturally-sensitive content. A title is correctly indexed in one locale and mis-classified in another (e.g., a same-sex romance correctly tagged in EN/ES, suppressed by JP regional editors). The retriever returns the right title in EN but censors it in JP. The user is in a third locale (BR) but reading the JP edition. What's the right architecture?

This is a content-policy question riding on a localization architecture, and the right framing is policy is per-locale, ground truth carries per-locale visibility, and the user's served-locale is a routing key. Concretely:

(1) Per-locale visibility flags on documents. Every chunk carries visible_in: ["en", "es", ...]. The index respects them at retrieval time. This is not censorship hidden in the model; it's a versioned policy edge.

(2) User-served-locale, not query-locale, gates the visibility filter. A BR user reading the JP edition is served by BR rules — the rules of the user's home locale govern, regardless of which catalog edition they're browsing. This is a deliberate stance and it's auditable.

(3) Eval has visibility-aware cohorts. A title that's visible_in: [es] should not surface in EN eval. The golden set must reflect this. Otherwise eval will flag "missing title" for users where the title was correctly hidden.

(4) Conflicts are escalated, not silently suppressed. When a query strongly matches a title that the user's locale has hidden, the bot says "I can't show you this in your region" rather than pretending the title doesn't exist. That's both more honest and operationally cheaper — users who hit the boundary stop searching for the same thing repeatedly.

The deeper architecture commitment: localization isn't just translation, it's policy. Ground truth in a multi-locale catalog is a (content, visibility-set, locale-applicability) triple — closely analogous to scenario 02 (policy versioning), and the same pattern (manifest-driven, versioned, judged-at-serve) applies.

Architect-level escalation 2

Six months from now, the company is testing a real-time machine-translation feature: any review or synopsis is auto-translated to the user's locale on the fly. This means every document chunk now has effectively N language variants, generated dynamically. How does this break — or improve — your eval architecture?

Two sides.

What it breaks. Pair-keyed eval assumed (query_lang, doc_lang) was static. With on-the-fly translation, doc_lang varies per session. The "intended chunk" identifier is the same (it's a pointer to a source chunk), but the served text is now an MT artifact. Eval that grades "did we serve a relevant chunk" still works on the pointer; eval that grades "was the served text faithful to the source" is a new dimension. You need a translation-quality eval running in parallel — every served MT artifact gets a sample-based adequacy check (back-translation BLEU, or human spot-check), independent of retrieval quality. A relevant chunk badly translated is a worse outcome than a less-relevant chunk well-translated, in many cases.

What it improves. Cross-lingual cells get healthier because the user is served their own language. The off-diagonal cells in the matrix start to behave more like diagonal cells over time. The architecture simplifies on the retrieval side — you don't need per-locale translated synopses in the index; you have source-language synopses and translate at serve.

The new ground-truth question. "Correct chunk in the right language at the right level of fidelity" replaces "correct chunk." It adds a translation-quality axis to the eval. The judge gets a new responsibility: assessing whether the MT output is faithful. Like the rubric in scenario 04, the judge needs calibration on MT quality (a 200-pair human-labeled MT calibration set) before its scores are trustable. Without that calibration, you can't distinguish "the retriever returned a worse chunk" from "the chunk was right, the MT was bad."

The pattern that survives: ground truth is multi-faceted and per-facet calibrated. Each facet (relevance, language pair, MT fidelity) has its own data, its own metric, its own gate. The architectural cost is more pieces; the benefit is that you can actually answer "what changed and why" when something regresses.

Red-flag answers

"Just use a multilingual model." (Doesn't measure cross-lingual gaps.)
"Translate everything to English." (Loses local meaning, biases retrieval.)
"Aggregate recall@10 is fine, ship it." (Cohort gaps hide.)
"Auto-translate the golden set." (Vendor noise + drift, no calibration.)
"Add more aliases." (Helps entity queries, doesn't fix descriptive queries.)

Strong-answer indicators

Names ground truth as pair-keyed (query_lang, doc_lang).
Treats translation as a measurable lossy transform with its own quality bar.
Has per-pair fusion and per-cell promotion gates.
Recognizes the matrix grows but most cells are long-tail and don't need deep investment.
Sees the policy/visibility extension when content rules differ per locale.
Anticipates that real-time MT introduces a translation-quality axis to eval.