US-MLE-05: Embedding Adapter Fine-Tuning + Re-Index
User Story
As an ML Engineer at Amazon-scale on the MangaAssist team,
I want to own a quarterly fine-tuning pipeline for the LoRA adapter on top of Amazon Titan Embeddings v2, the corresponding 5M-document blue/green re-index of the OpenSearch corpus, and the online query-encoding endpoint that serves every retrieval-bearing chatbot turn,
So that every downstream RAG and personalization consumer (US-MLE-02 cross-encoder reranker, US-MLE-06 recommendation two-tower, the seven MCP servers in RAG-MCP-Integration/) reads from a single, drift-corrected, bilingual JP/EN embedding space whose recall@10 contract is defended end-to-end and whose rollouts are atomic, reversible, and never cause a serving outage.
Acceptance Criteria
- LoRA adapter (r=16, alpha=32, on q/k/v projections, ~6M trainable params) trained on 200K InfoNCE triplets achieves recall@10 ≥ 0.92 on the JP+EN holdout (baseline: 0.86 with no adapter).
- Per-language recall@10 ≥ 0.90 on JP-only holdout AND ≥ 0.92 on EN-only holdout (no aggregate-hides-language regression).
- Cross-lingual recall@10 ≥ 0.78 on the EN-query→JP-doc and JP-query→EN-doc held-out sets (verifies cross-lingual alignment did not collapse).
- Quarterly full re-index of all 5M documents completes blue/green within 4 hours wall-clock, end-to-end (export → batch encode → bulk-load green → validate → atomic alias flip).
- New-document cold-start embedding latency ≤ 30s from catalog write to OpenSearch-indexable vector (real-time encoding path).
- Online query-encoding p95 ≤ 12ms on the SageMaker real-time endpoint (
ml.g5.xlarge, FP16, batch ≤ 16, dynamic batching window 8ms). - Online query-encoding p99 ≤ 25ms.
- Blue/green rollback (alias flip from green back to blue) executes in ≤ 5 minutes; the prior blue index is retained for 14 days.
- Spot-reclaim resilience: a fine-tune run survives ≤ 3 reclaims by resuming from S3 checkpoints written every 500 training steps; cumulative reclaim wall-clock ≤ 60 minutes triggers on-demand fallback.
- Drift detection emits CloudWatch alarm within 24h of input drift (PSI on query-token distribution > 0.2) and within 7 days of recall regression (Δ-recall@10 > 0.02 on canonical eval).
- Full-run training cost ≤ $150 on spot, ≤ $400 on-demand. Per-1M-document re-encode cost ≤ $90.
Architecture (HLD)
The Production Surface
The MangaAssist embedding service sits underneath every retrieval-bearing chatbot turn. Per RAG-MCP-Integration/09-rag-retrieval-pipeline-deep-dive.md, it produces dense vectors for:
- Online query encoding — every chat message that needs retrieval (~10M encode calls/day at the contracted traffic mix; with the embedding cache hit rate of ~68% the underlying endpoint serves ~3.2M unique encodes/day).
- Online new-document encoding — when the catalog ingests a new manga / manhwa / manhua title, when a new policy doc lands in S3, or when a moderated review passes US-MLE-07's spam filter, the document needs an embedding before it can be retrieved (~80K writes/day).
- Batch document encoding — quarterly full re-encode of the entire 5M-document corpus when the LoRA adapter is rotated, plus a nightly delta re-encode for documents whose
embedding_model_versionis stale (typically ~5–20K docs/night during transitional periods).
The base model is Amazon Titan Embeddings v2 (1024-dim output, ~6B-parameter base, frozen). We adapt it via a LoRA adapter (rank 16, alpha 32, on the q/k/v attention projections of the top 12 transformer blocks) which exposes ~6M trainable parameters on top of the frozen 6B base. The adapter is fine-tuned on (query, positive_doc, negative_doc) triplets with InfoNCE loss; the positive comes from click-through and editorial-curated pairs; negatives come from in-batch hard mining and the dedicated hard-negative mining job described below. The adapter sits as a separate file (~24MB) loaded alongside the frozen base; the design decision (separate adapter vs merged weights) is discussed in §3.
The adapter is the primary lever for fixing the per-language and per-domain recall regressions that drove this story. From the partner drift story (Ground-Truth-Evolution/ML-Scenarios/07-embedding-category-expansion.md), the Titan-v2-without-adapter baseline places manhwa items near genre-matching manga rather than near other manhwa; the adapter is what closes that gap with a multi-domain ground-truth construction.
End-to-End ML Lifecycle Diagram
flowchart TB
subgraph DATA[Data Layer - Triplet Construction]
L1[Click-Through Logs<br/>retrieval_logs Iceberg<br/>~50M impressions/d]
L2[Editorial Pairs<br/>EN ~50K, JP ~30K<br/>+ manhwa ~10K Y1]
L3[LLM-Distilled Pairs<br/>Sonnet generates Q-D pairs<br/>for cold-start genres]
L4[Synthetic Positives<br/>Review-corpus paraphrases<br/>via T5-paraphrase]
L5[Hard-Negative Miner<br/>top-50 retrieved, no-click<br/>~200K/d]
L6[Triplet Store<br/>Iceberg on S3<br/>200K/run sampled]
L1 --> L5 --> L6
L2 --> L6
L3 --> L6
L4 --> L6
end
subgraph TRAIN[Distributed Training - SageMaker]
T1[1. Data Validation<br/>per-domain coverage<br/>per-language IAA]
T2[2. Triplet Materialization<br/>PIT-correct text snapshot]
T3[3. Stratified Sampling<br/>by domain + language]
T4[4. FSDP Training<br/>p4d.24xl x 2<br/>16 A100 80GB]
T5[5. Offline Eval<br/>5 modes + cross-lingual]
T6[6. Slice Analysis<br/>per-domain per-language]
T7[7. Adapter Registration<br/>~24MB LoRA file]
T1 --> T2 --> T3 --> T4 --> T5 --> T6 --> T7
end
subgraph REINDEX[Blue Green Re-Index - 5M Docs]
R1[Export Doc Corpus<br/>OpenSearch -> S3<br/>~30 min]
R2[Batch Transform<br/>p4d.24xl x 2 FSDP<br/>~3h, 5M docs]
R3[Bulk Load Green Index<br/>OpenSearch _bulk<br/>~25 min]
R4[Validation<br/>recall@10 + smoke<br/>~10 min]
R5{Gate}
R6[Atomic Alias Flip<br/>blue -> green]
R7[Retain Blue 14d<br/>rollback target]
R1 --> R2 --> R3 --> R4 --> R5
R5 -->|pass| R6 --> R7
R5 -.fail.-> RB[Hold; investigate]
end
subgraph SERVE[Online Serving]
S1[Query Encode<br/>SM real-time<br/>ml.g5.xlarge]
S2[New-Doc Encode<br/>SM real-time<br/>same endpoint]
S3[Embedding Cache<br/>ElastiCache 24h TTL<br/>~68% hit]
S4[OpenSearch<br/>kNN + BM25<br/>5M docs]
S1 --> S3
S2 --> S4
S3 --> S4
end
subgraph DRIFT[Drift Hub]
D1[PSI on query tokens<br/>5-min cadence]
D2[Recall@10 nightly<br/>vs canonical eval]
D3[Per-domain centroid<br/>weekly]
D1 --> CW[CloudWatch Alarms]
D2 --> CW
D3 --> CW
CW -.triggers.-> T1
end
L6 --> T1
T7 --> R1
T7 -.adapter file.-> S1
T7 -.adapter file.-> S2
R6 -.serves from.-> S4
S1 -.encode logs.-> D1
S4 -.recall measurements.-> D2
style L6 fill:#9cf,stroke:#333
style T4 fill:#fd2,stroke:#333
style R2 fill:#fd2,stroke:#333
style R6 fill:#2d8,stroke:#333
style S3 fill:#9cf,stroke:#333
style D2 fill:#fd2,stroke:#333
Data Contracts and Volume
| Asset | Schema Version | Snapshot Cadence | Volume | Owner |
|---|---|---|---|---|
retrieval_logs Iceberg |
log_v4 | Continuous (Kinesis Firehose) | ~50M impressions/d, retained 30d hot, 12mo glacier | Data Platform |
editorial_pairs Iceberg |
pair_v3 | Weekly batch from editorial CMS | EN ~50K, JP ~30K, manhwa ~10K (Y1 target), manhua ~5K | Editorial PM |
triplets Iceberg |
triplet_v2 | Per-run (200K/run) | 200K rows/run × 4 runs/yr + ad-hoc | ML Eng (this story) |
embedding_adapter model package |
adapter_v3 | Quarterly | ~24MB LoRA file, ~6M trainable params | ML Eng (this story) |
manga-catalog OpenSearch index |
index_v7 (live) | Quarterly full + nightly delta | 5M docs × 1024-dim FP16 vectors ≈ 20GB embeddings + raw fields | ML Eng (this story) |
embedding_eval_canonical Iceberg |
eval_v3 | Quarterly refresh | 8K labeled (q, d, relevant) tuples; 6K EN, 4K JP, 1K cross-lingual, 1.5K per-domain (manhwa/manhua) | Editorial + ML Eng |
embedding_drift_metrics CloudWatch |
n/a | 5 min input; daily recall | ~80K data points/d | Drift Hub |
Model Registry + Promotion Gates
flowchart LR
A47[adapter_v3<br/>prod, label_v=1840]:::prod
A48[adapter_v4<br/>candidate, label_v=2105]:::cand
A48 --> G1{Stage 1<br/>Offline Gate<br/>recall + slice}
G1 -->|pass| G2{Stage 2<br/>Re-Index Green<br/>5M docs}
G1 -.fail.-> RB1[Reject]
G2 -->|pass| G3{Stage 3<br/>Online Eval<br/>shadow query path}
G2 -.fail.-> RB2[Discard green]
G3 -->|pass| G4{Stage 4<br/>Atomic Alias Flip<br/>blue -> green}
G3 -.fail.-> RB3[Discard green]
G4 --> R14[Retain Blue 14d<br/>rollback target]
classDef prod fill:#2d8,stroke:#333
classDef cand fill:#fd2,stroke:#333
style RB1 fill:#f66,stroke:#333
style RB2 fill:#f66,stroke:#333
style RB3 fill:#f66,stroke:#333
style G4 fill:#2d8,stroke:#333
The four-stage gate is the standard from deep-dives/00-foundations-and-primitives-for-ml-engineering.md §5.1, wrapped in the blue/green pattern from §5.2 because this story serves a stateful index (the embedding rotation requires the entire 5M-document corpus to be re-encoded before the new adapter can serve a single query). Story-specific thresholds:
- Stage 1 (offline): recall@10 ≥ 0.92 on JP+EN holdout, per-language ≥ 0.90 JP / ≥ 0.92 EN, cross-lingual ≥ 0.78, no per-domain regression > 0.02 on existing domains.
- Stage 2 (re-index green): wall-clock ≤ 4h end-to-end; recall@10 measured against the same canonical eval set against the green index ≥ Stage 1 measurement minus 0.005 (allows for vector quantization in OpenSearch).
- Stage 3 (online eval): a 24h shadow query path that issues each production query through both blue and green index paths in parallel; per-query rank-overlap@10 between blue and green ≥ 0.6 (low because the adapter is supposed to change rankings) AND per-language slice rank-overlap within ±0.05 of the global rate.
- Stage 4 (atomic alias flip): SLA 5 minutes; blue retained 14 days as rollback target.
Low-Level Design
1. Triplet / Data Pipeline
The triplet generation is the most label-engineering-heavy step in this story. Per primitive §1.1 we use four label sources, none of which alone is sufficient:
- Click-through derived positives (~70% of training positives): a query and its eventually-clicked-and-not-returned document form a (q, d+) pair. Subject to position bias; corrected with IPS weighting.
- Editorial-curated pairs (~15%): hand-curated similar-document pairs from the editorial CMS. The lift here is per-domain coverage — manhwa and manhua editorial pairs are the bootstrap signal for the new domains, per the partner drift story.
- LLM-distilled query-doc pairs (~10%): for cold-start coverage of new genres and rare query patterns, Sonnet generates (synthetic_query, real_doc) pairs over the catalog. Capped at 25% of any single domain's positives to prevent inheriting LLM biases.
- Synthetic positives from review paraphrases (~5%): T5-paraphrase generates near-paraphrases of existing review-corpus snippets to densify the positive set; only used as auxiliary positives, never as the primary source for any (q, d) entry.
The hard-negative mining job runs nightly and is the operational core of training quality:
# hard_negative_miner.py
from dataclasses import dataclass
import boto3
import numpy as np
from opensearchpy import OpenSearch
@dataclass
class HardNegative:
query_id: str
query_text: str
doc_id: str # the negative doc
doc_text: str
rank_seen: int # rank in the top-50 retrieved
embedding_model_version: str # to allow PIT replay
captured_at: datetime
class HardNegativeMiner:
"""Mines (query, doc-) pairs from production retrieval logs.
A 'hard negative' is a document that the current production embedding model
placed in the top-50 retrieved, but that the user neither clicked nor dwelled
on. The model was confident it was relevant; the user disagreed. These are
the cases the next training run needs to learn.
"""
def __init__(self, region: str = "ap-northeast-1"):
self.os = OpenSearch([{"host": "search-manga-prod.aoss.amazonaws.com", "port": 443}])
self.iceberg = IcebergClient(region=region)
def mine(self, since_ts: datetime, limit: int = 200_000) -> list[HardNegative]:
impressions = self.iceberg.scan(
table="retrieval_logs",
where=f"captured_at > '{since_ts.isoformat()}' AND rank <= 50 "
f"AND clicked = false AND dwell_ms < 1500",
)
candidates = []
for imp in impressions:
# Filter co-purchase noise: ignore impressions where the user
# later clicked a near-duplicate document in the same session.
if self._has_near_duplicate_click(imp):
continue
candidates.append(HardNegative(
query_id=imp.query_id, query_text=imp.query_text,
doc_id=imp.doc_id, doc_text=self._fetch_doc_text(imp.doc_id),
rank_seen=imp.rank, embedding_model_version=imp.model_version,
captured_at=imp.captured_at,
))
# Stratify by domain + language so the mined negatives don't skew
# toward whatever domain dominates production traffic.
return stratified_sample(candidates, k=limit, strata=["domain", "language"])
The miner has caught two systematic issues during pilot runs: (a) reviews from spam-clean reviewers being mined as negatives because they were retrieved-but-not-clicked despite being highly relevant — fixed by filtering out impressions where US-MLE-07's spam confidence was < 0.05, and (b) JP-region cross-lingual queries (e.g., Japanese user typing the EN title) that clicked the JP doc but rendered the EN doc as a negative — fixed by adding a language_aware=True flag that skips cross-lingual impressions during single-language negative mining.
The PIT correctness contract (primitive §2.1) applies here: the doc's text is read from the catalog's snapshot at as_of=impression.captured_at, never the current text, because catalog text mutates (description rewrites, tag updates) and using the current text leaks future information into the training signal.
2. Distributed Training (FSDP on p4d.24xlarge × 2)
The training step is the single most compute-intensive step in any of the 8 ML Engineer stories, justifying its own subsection. The choice of FSDP (Fully Sharded Data Parallel) over DDP (Distributed Data Parallel) is forced by the parameter count:
- Titan v2 base: ~6B parameters in FP16 = ~12GB activations + ~12GB params ≈ 24GB minimum per replica.
- A100 80GB has headroom but: optimizer state for the LoRA adapter (Adam: 2 × params for momentum and variance), activation checkpointing memory, and batch-level activations for batch size 256 push DDP-per-GPU memory to ~70GB.
- DDP would replicate the full 6B base on every GPU and shard only the data. On 16 A100s, that wastes 16 × 12GB = 192GB of duplicated model weights. Workable but inefficient.
- FSDP with
SHARD_GRAD_OP(ZeRO Stage 2 equivalent) shards the optimizer state and gradients across GPUs. The model parameters are still replicated but the optimizer state (which dominates for small adapters but is non-trivial here for activation/optim memory) is sharded. - FSDP with
FULL_SHARD(ZeRO Stage 3 equivalent) shards parameters too — every GPU only holds 1/16th of the base at any moment; parameters are gathered for the current layer and freed after the backward pass.
We use FULL_SHARD for two reasons: (a) the LoRA adapter q/k/v projections are inserted into every transformer block, so the gradient flow goes through the full base; sharding the base lets us push batch size to 384 instead of 256 and finish in ~2.3 hours/epoch instead of ~3.5; (b) FULL_SHARD is the safest path forward when the next iteration of this story considers a 13B base — the same code scales without rewriting.
# train_adapter.py
import torch
import torch.distributed as dist
from torch.distributed.fsdp import FullyShardedDataParallel as FSDP
from torch.distributed.fsdp import MixedPrecision, ShardingStrategy, BackwardPrefetch
from torch.distributed.fsdp.wrap import transformer_auto_wrap_policy
from transformers import AutoModel, AutoTokenizer
from peft import LoraConfig, get_peft_model
def setup_fsdp_model(local_rank: int) -> tuple:
# Frozen Titan v2 base, loaded in BF16 for activation memory savings.
base = AutoModel.from_pretrained(
"amazon/titan-embed-text-v2-base",
torch_dtype=torch.bfloat16,
attn_implementation="flash_attention_2",
)
for p in base.parameters():
p.requires_grad = False
# LoRA adapter: r=16, alpha=32, on q/k/v of top 12 blocks.
lora_config = LoraConfig(
r=16, lora_alpha=32, lora_dropout=0.05,
target_modules=["q_proj", "k_proj", "v_proj"],
layers_to_transform=list(range(20, 32)), # top 12 of 32 blocks
bias="none", task_type="FEATURE_EXTRACTION",
)
model = get_peft_model(base, lora_config)
# Verify only ~6M params are trainable.
trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
assert 5_500_000 < trainable < 6_500_000, f"Trainable param count {trainable} off"
# FSDP wrap policy: wrap each transformer block individually so that
# the all-gather granularity matches the layer boundary. This is the
# standard pattern for transformer FSDP and gives the best overlap
# of compute and communication.
auto_wrap_policy = lambda module, recurse, nonwrapped_numel: \
transformer_auto_wrap_policy(
module, recurse, nonwrapped_numel,
transformer_layer_cls={TitanTransformerBlock},
)
model = FSDP(
model,
sharding_strategy=ShardingStrategy.FULL_SHARD, # ZeRO-3
auto_wrap_policy=auto_wrap_policy,
mixed_precision=MixedPrecision(
param_dtype=torch.bfloat16,
reduce_dtype=torch.float32, # gradient reduction in FP32 for stability
buffer_dtype=torch.bfloat16,
),
backward_prefetch=BackwardPrefetch.BACKWARD_PRE,
device_id=local_rank,
use_orig_params=True, # required for LoRA + FSDP compatibility
limit_all_gathers=True, # avoid OOM from concurrent all-gathers
)
return model
def info_nce_loss(q_emb, p_emb, n_emb_batch, temperature=0.05):
"""InfoNCE with in-batch negatives.
For each anchor query, the positive doc is a positive; every other
document in the batch is an in-batch negative. The dedicated hard
negatives (n_emb_batch) are appended on top.
"""
q = torch.nn.functional.normalize(q_emb, dim=-1)
p = torch.nn.functional.normalize(p_emb, dim=-1)
n = torch.nn.functional.normalize(n_emb_batch, dim=-1)
# Positives + in-batch negatives + dedicated hard negatives all in one matrix.
all_docs = torch.cat([p, n], dim=0)
logits = (q @ all_docs.T) / temperature
labels = torch.arange(q.size(0), device=q.device)
return torch.nn.functional.cross_entropy(logits, labels)
def train_loop(model, dataloader, optimizer, scaler, ckpt_dir: str):
step = 0
for epoch in range(3):
for batch in dataloader:
with torch.amp.autocast(device_type="cuda", dtype=torch.bfloat16):
q_emb = model(**batch["query"]).pooler_output
p_emb = model(**batch["positive"]).pooler_output
n_emb = model(**batch["hard_negatives"].view(-1)).pooler_output
loss = info_nce_loss(q_emb, p_emb, n_emb, temperature=0.05)
loss.backward()
optimizer.step()
optimizer.zero_grad(set_to_none=True)
# Spot-reclaim resilient: checkpoint every 500 steps.
if step % 500 == 0 and dist.get_rank() == 0:
save_fsdp_checkpoint(model, optimizer, ckpt_dir, step)
step += 1
The training run profile on p4d.24xlarge × 2 (16 × A100 80GB):
| Metric | Value |
|---|---|
| Effective per-device batch | 24 (queries) + 24 (positives) + 192 (hard negatives) = 240 examples |
| Global batch (16 GPUs) | 384 anchors × 9 docs = 3,456 forward passes |
| Sequence length | 256 tokens (queries), 512 tokens (docs) |
| Time per epoch | ~2.3h on spot |
| Wall-clock (3 epochs) | ~7h with no reclaim, ~7.6h with 1 reclaim |
| FSDP activation memory | ~38GB / 80GB available per A100 |
| FSDP communication overhead | ~18% of step time (all-gather + reduce-scatter) |
| Spot pricing (ap-northeast-1) | $32.77/hr × 2 nodes = $65.54/hr |
| Cost per full run (spot) | ~$120 with no reclaim, capped at ~$150 with 1 reclaim |
| Cost per full run (on-demand) | ~$340 |
| Reclaim frequency observed | ~12% of runs see ≥1 reclaim |
Spot reclaim handling. Per primitive §3.2, every 500 steps the trainer writes an FSDP checkpoint to S3 (s3://manga-ml-checkpoints-apne1/embedding-adapter/run_id/). On reclaim, SageMaker auto-retries on a new instance pool; the trainer's resume-loader rehydrates the FSDP shards and the optimizer state from S3. Cumulative reclaim time is tracked; if it exceeds 60 minutes, the next retry is forced to on-demand via use_spot_instances=False in the resubmit, ensuring the Friday quarterly window is met even in adverse spot conditions.
3. LoRA Adapter — Separate File vs Merged Weights
The adapter is published as a separate ~24MB LoRA file (adapter_model.safetensors + adapter_config.json) loaded alongside the frozen ~12GB Titan v2 base, not merged into the base. This is a deliberate tradeoff:
| Property | Separate adapter file | Merged into base |
|---|---|---|
| Inference latency | +~0.3ms per forward pass (extra matmul on q/k/v projections) | Equal to base latency |
| Adapter swap time on the endpoint | ~50ms (load 24MB file, swap pointer; no model re-init) | Full endpoint redeploy (~6 min for 12GB model) |
| Memory footprint | 12GB base + 24MB adapter = 12.024GB | 12GB merged (no savings — base copy is identical) |
| A/B test of two adapters in parallel | Trivial (load both adapters, route requests to one) | Requires two endpoints (2× cost) |
| Cold-start time | Identical (base dominates load time) | Identical |
| Operational complexity | Adapter version pinned in registry independent of base | Base version + adapter version locked together |
We chose separate adapter file for two operational reasons. (a) We need to A/B test two adapters in parallel during the Stage 3 online-eval shadow path; without separate adapters this requires standing up a second p4d-class endpoint just for shadow. (b) When we eventually rotate to a Titan v3 base, only one of (base_v3, adapter_v3) or (base_v3, adapter_v4) is the right path, and we want to test both without a 12GB-model redeploy each time. The +0.3ms latency cost is well within the 12ms p95 budget.
The trade is documented in the Runbooks/embedding-adapter-swap.md runbook because the adapter swap is the only fast-rollout path we have for a quality regression that doesn't involve a 4-hour re-index.
4. Re-Index Pipeline — The Headline Technical Pattern
Per primitive §5.2, the embedding adapter is the one ML system on the platform whose rollout requires re-indexing the entire stateful corpus before the new adapter can serve a single query. The 4-hour blue/green re-index is the headline pattern of this story.
flowchart TB
subgraph BLUE[Blue - Currently Live]
BI[OpenSearch Index<br/>manga-catalog-v7<br/>5M docs, adapter_v3 vectors]
BA[Index Alias<br/>manga-catalog -> v7]
end
subgraph EXPORT[Stage 1 - Export 30 min]
EX1[Scroll API<br/>extract doc_ids + raw text<br/>5M docs]
EX2[Write to S3<br/>1024 files of 5K docs each]
EX1 --> EX2
end
subgraph ENCODE[Stage 2 - Batch Encode 3h]
BT[SageMaker Batch Transform<br/>p4d.24xl x 2 nodes]
BT_FSDP[FSDP-distributed inference<br/>throughput ~ 1400 docs/sec/node]
BT_OUT[1024 output files<br/>doc_id + 1024-dim FP16 vector]
BT --> BT_FSDP --> BT_OUT
end
subgraph BUILD[Stage 3 - Build Green Index 25 min]
GR1[Create manga-catalog-v8<br/>same mapping, same shards 5]
GR2[OpenSearch _bulk<br/>10K docs/req<br/>parallel 8 connections]
GR1 --> GR2
end
subgraph VAL[Stage 4 - Validate 10 min]
V1[Smoke: 100 known queries<br/>vs canonical eval]
V2[Recall@10 measured<br/>>= 0.92 JP+EN]
V3[Cross-domain spot check<br/>manhwa->manhwa similarity]
V1 --> V2 --> V3
end
subgraph FLIP[Stage 5 - Atomic Flip 30 sec]
FL1[POST _aliases<br/>remove blue add green<br/>ATOMIC]
FL2[manga-catalog -> v8]
FL1 --> FL2
end
subgraph GREEN[Green - Now Live]
GI[OpenSearch Index<br/>manga-catalog-v8<br/>5M docs, adapter_v4 vectors]
end
subgraph RETAIN[Retain 14d]
RB[manga-catalog-v7<br/>read-only<br/>rollback target]
end
BI -.export.-> EX1
EX2 --> BT
BT_OUT --> GR2
GR2 --> V1
V3 -->|pass| FL1
V3 -.fail.-> DISCARD[Discard v8 index<br/>no traffic affected]
FL2 --> GI
BI -.kept.-> RB
style BI fill:#9cf,stroke:#333
style GI fill:#2d8,stroke:#333
style FL1 fill:#fd2,stroke:#333
style RB fill:#ddd,stroke:#333
Stage 1 — Export (30 min). The OpenSearch scroll API extracts (doc_id, raw_text_fields) for all 5M docs into 1024 partitioned S3 files (~5K docs per file). We do not export the existing vectors — they will be regenerated by the new adapter. The scroll runs at low priority (1 request/sec/shard) to avoid affecting live read traffic; all 5 shards in parallel give ~5K docs/sec sustained, completing in ~17 min plus ~13 min for slow tail-end shards.
Stage 2 — Batch encode (3 hours). SageMaker Batch Transform on p4d.24xl × 2 nodes runs the new adapter forward over all 5M docs. FSDP-distributed inference (same FULL_SHARD config as training) gives ~1,400 docs/sec/node × 2 nodes = ~2,800 docs/sec sustained throughput → 5M / 2,800 ≈ 30 min compute-bound + ~2.5h overhead from instance startup, FSDP init, S3 read/write, and tail latency on the slowest shard. The output is 1024 files of (doc_id, embedding_fp16[1024]) in Parquet to S3.
Stage 3 — Bulk-load green index (25 min). A new index manga-catalog-v8 is created with the identical mapping as v7. The OpenSearch _bulk API ingests at 10K docs per request, parallelized over 8 connections, completing 5M docs in ~22 min. The index alias manga-catalog is not yet pointed at v8.
Stage 4 — Validate (10 min). A smoke test fires 100 canonical queries against v8 directly (using the index name, not the alias) and measures recall@10 against the canonical eval set. The gate condition is recall@10 ≥ 0.92 and per-language ≥ thresholds and no per-domain regression. A failed gate discards v8 and emits a SEV-3 page; v7 continues serving without interruption (this is the key safety property of blue/green — a failed candidate never sees real traffic).
Stage 5 — Atomic alias flip (30 sec). OpenSearch POST _aliases with both remove (alias → v7) and add (alias → v8) in a single request is atomic from the OpenSearch coordinator's perspective. The cutover is invisible to live readers — they continue using manga-catalog as the index name; the underlying physical index swaps under them. Read traffic on v7 drops to 0; read traffic on v8 ramps from 0 to 100% within the OpenSearch refresh interval (~30s).
# atomic_flip.py
def flip_alias(client: OpenSearch, alias: str, old: str, new: str) -> None:
"""Atomic alias flip per the OpenSearch _aliases API contract.
Both 'remove' and 'add' actions are applied in a single transaction;
there is no observable window where neither index is aliased.
"""
client.indices.update_aliases(body={
"actions": [
{"remove": {"index": old, "alias": alias}},
{"add": {"index": new, "alias": alias}},
]
})
Dual-write during build. Between Stage 1 (export) and Stage 5 (flip), new documents continue to land in the catalog (~80K writes/day = ~13K writes during a 4h re-index). These writes go to v7 and are queued in a Kinesis stream for re-encoding into v8; on Stage 5 flip, the consumer drains the queue using the new adapter and writes to v8. This dual-write pattern is the only reason the re-index can complete without freezing catalog writes for 4 hours. The runbook handles the rare case where a write lands during the 30-second flip window: writes are idempotent on doc_id, so a write that lands twice (once via v7's mirror, once via v8 direct) is safe.
# dual_write.py
class CatalogWriter:
"""Writes new documents to both blue and green during a re-index."""
def __init__(self, blue_alias: str, green_index: Optional[str]):
self.blue_alias = blue_alias
self.green_index = green_index # None when no re-index in progress
def write(self, doc: Document) -> None:
# Always write to current production (blue alias).
self._write_to(self.blue_alias, doc, model_version=self.adapter_v3)
# If a re-index is in progress, mirror to green.
if self.green_index is not None:
embedding = self.encode(doc.text, adapter=self.adapter_v4)
self._write_to(self.green_index, doc, embedding, model_version=self.adapter_v4)
# Also enqueue for batch validation against the green index's
# canonical eval; ensures dual-write is producing equivalent recall.
self.kinesis.put_record(self.green_validation_stream, doc.doc_id)
Rollback semantics (≤ 5 min). If a quality regression appears post-flip, the inverse alias flip restores v7. The runbook is Runbooks/embedding-rollback.md. Because v7 is retained for 14 days as a read-only target, the rollback is a single POST _aliases call. Compared to the stateless model rollback in US-MLE-01 (60-second SageMaker traffic shift), the embedding rollback's 5-minute SLA is justified by the alias propagation across all OpenSearch coordinator nodes. Beyond 14 days, v7 is deleted; rollback then would require a fresh re-index from a v7-era checkpoint of the adapter.
5. Online Serving (Query + New-Document Encoding)
The same SageMaker real-time endpoint serves both online query encoding and online new-document encoding. The endpoint runs on ml.g5.xlarge (single A10G 24GB GPU); the 6B-base + 6M-adapter combo fits in BF16 with ~10GB headroom. Dynamic batching collects requests for up to 8ms or until 16 concurrent requests are in flight, whichever comes first.
# endpoint_handler.py
import torch
from transformers import AutoModel, AutoTokenizer
from peft import PeftModel
class EmbeddingHandler:
def __init__(self):
# Loaded once on container init.
self.tokenizer = AutoTokenizer.from_pretrained("amazon/titan-embed-text-v2-base")
base = AutoModel.from_pretrained(
"amazon/titan-embed-text-v2-base",
torch_dtype=torch.bfloat16,
attn_implementation="flash_attention_2",
).eval().cuda()
# Adapter loaded as a separate file; lookup latest from registry.
self.adapter_version = ModelRegistry.get_latest("embedding_adapter").version
self.model = PeftModel.from_pretrained(
base, f"s3://manga-ml-models-apne1/embedding-adapter/{self.adapter_version}"
).eval()
@torch.inference_mode()
def encode(self, texts: list[str]) -> list[np.ndarray]:
# Tokenize to 256 tokens for queries, 512 for docs (auto-detected
# by client-supplied flag in request envelope).
inputs = self.tokenizer(
texts, max_length=256, padding=True, truncation=True, return_tensors="pt"
).to("cuda")
outputs = self.model(**inputs)
# Mean-pooling over non-pad tokens, then L2-normalize.
emb = mean_pool(outputs.last_hidden_state, inputs.attention_mask)
emb = torch.nn.functional.normalize(emb, p=2, dim=1)
return emb.float().cpu().numpy()
The endpoint is fronted by an embedding cache (ElastiCache Redis, 24h TTL, ~68% hit rate per RAG-MCP-Integration/09-rag-retrieval-pipeline-deep-dive.md); only cache misses reach the endpoint. The cache key is keyed on (adapter_version, sha256(text)) so adapter rotation invalidates the cache cleanly without manual flush — old keys expire, new keys populate, no risk of stale-vector serving.
Autoscaling on the endpoint:
# autoscaling.py
client.register_scalable_target(
ServiceNamespace="sagemaker",
ResourceId=f"endpoint/embedding-svc/variant/AllTraffic",
ScalableDimension="sagemaker:variant:DesiredInstanceCount",
MinCapacity=2, MaxCapacity=20,
)
client.put_scaling_policy(
PolicyName="EmbeddingTargetTracking",
ServiceNamespace="sagemaker",
ResourceId=f"endpoint/embedding-svc/variant/AllTraffic",
ScalableDimension="sagemaker:variant:DesiredInstanceCount",
PolicyType="TargetTrackingScaling",
TargetTrackingScalingPolicyConfiguration={
"TargetValue": 60.0,
"PredefinedMetricSpecification": {
"PredefinedMetricType": "SageMakerVariantInvocationsPerInstance",
},
"ScaleInCooldown": 600,
"ScaleOutCooldown": 60,
},
)
6. Offline Evaluation (5 modes + cross-lingual)
Per primitive §4.1, the offline-eval step exercises all five modes plus a story-specific cross-lingual mode:
# offline_eval.py
class EmbeddingOfflineEvaluator:
def __init__(self, model, baseline_model, eval_set, baseline_index, candidate_index):
self.model = model
self.baseline = baseline_model
self.eval_set = eval_set # 8K labeled (q, d, relevant)
self.baseline_idx = baseline_index # blue (v7)
self.candidate_idx = candidate_index # green (v8)
def evaluate(self) -> EvalReport:
return EvalReport(
golden=self.eval_golden(),
slice=self.eval_slice(),
adversarial=self.eval_adversarial(),
counterfactual=self.eval_counterfactual_replay(),
offline_online_corr=self.eval_offline_online_corr(),
cross_lingual=self.eval_cross_lingual(), # story-specific
per_domain=self.eval_per_domain(), # story-specific
)
def eval_golden(self) -> GoldenReport:
"""8K (q, d, relevant) tuples, refreshed quarterly. Recall@10."""
recalls = []
for q, relevant_docs in self.eval_set.iter_queries():
retrieved = self.candidate_idx.knn_search(q, k=10)
recalls.append(len(set(retrieved) & set(relevant_docs)) / len(relevant_docs))
return GoldenReport(recall_at_10=np.mean(recalls))
def eval_cross_lingual(self) -> CrossLingualReport:
"""EN query -> JP doc and JP query -> EN doc held-out sets.
The 1K cross-lingual pairs are hand-curated parallel-summary pairs
from the editorial team. Recall@10 on these pairs is the explicit
check that cross-lingual alignment didn't collapse during fine-tuning.
"""
en_to_jp = self.eval_set.cross_lingual(direction="en2jp")
jp_to_en = self.eval_set.cross_lingual(direction="jp2en")
return CrossLingualReport(
en2jp_recall=self._compute_recall(en_to_jp),
jp2en_recall=self._compute_recall(jp_to_en),
symmetric=abs(self._compute_recall(en_to_jp) - self._compute_recall(jp_to_en)) < 0.05,
)
def eval_per_domain(self) -> PerDomainReport:
"""Per-domain recall: manga, manhwa, manhua. Catches per-domain
regression that aggregate recall would hide."""
return PerDomainReport(
manga=self._compute_recall(self.eval_set.filter(domain="manga")),
manhwa=self._compute_recall(self.eval_set.filter(domain="manhwa")),
manhua=self._compute_recall(self.eval_set.filter(domain="manhua")),
)
The cross-lingual eval is the explicit guard against a real failure mode: fine-tuning on JP-heavy in-domain triplets can pull JP queries toward JP docs and away from their EN-language counterparts, breaking cross-lingual retrieval (where a Japanese user types the EN title and expects the JP doc to surface). This is silent in aggregate recall — the failure manifests as en2jp_recall collapsing while jp2jp_recall improves. The Stage 1 gate explicitly fails any candidate whose cross-lingual recall drops below 0.78.
The per-domain eval is the explicit guard from the partner drift story (Ground-Truth-Evolution/ML-Scenarios/07-embedding-category-expansion.md): aggregate recall hides per-domain regression. A model that improves manga recall by 0.04 but regresses manhwa recall by 0.06 is a regression — the slice gate fails on manhwa.
7. Drift Detection
| Drift Kind | Detector | Cadence | Threshold |
|---|---|---|---|
| Input drift (query side) | PSI on top-30 query token frequencies; KL on language-detector dist | 5 min | PSI > 0.2 sustained 24h |
| Input drift (doc side) | PSI on doc-text-length and per-domain volume | Daily | PSI > 0.15 sustained 7d |
| Embedding drift | Per-domain centroid distance between current docs and reference centroid | Weekly | Δcentroid > 0.08 cosine |
| Recall drift | recall@10 on canonical eval against live manga-catalog alias |
Daily | Δ-recall > 0.02 sustained 7d |
| Concept drift | Click-through-rate on top-10 retrieved per language slice | Daily | Δ-CTR > 0.03 sustained 14d |
# embedding_drift_check.py
def per_domain_centroid_drift(
current_embeddings: dict[str, np.ndarray], # by domain
reference_centroids: dict[str, np.ndarray],
) -> dict[str, float]:
"""Per-domain centroid drift. Detects category-expansion-induced drift
(the manhwa scenario from Ground-Truth-Evolution ML-07) before recall
metrics show it."""
drifts = {}
for domain, embs in current_embeddings.items():
current_centroid = embs.mean(axis=0)
current_centroid /= np.linalg.norm(current_centroid)
ref = reference_centroids[domain]
cosine = float(np.dot(current_centroid, ref))
drifts[domain] = 1.0 - cosine
return drifts
The drift hub is shared across all 8 ML stories per primitive §6.2; this story consumes from MangaAssist/MLDrift/embedding_adapter/* and emits to #manga-ml-oncall slack on alarm.
8. Multilingual Handling (JP/EN-Specific)
Three concerns are JP-specific and shape the training signal:
Cross-lingual alignment via parallel summaries. Editorial maintains ~3K parallel manga summary pairs (the same manga's JP summary and EN summary, hand-translated). These are used as positive (q, d+) pairs in training, with the JP version as the query and the EN version as the positive doc (and vice versa). This explicit cross-lingual signal prevents the common failure mode where in-domain fine-tuning silos JP queries to JP docs and EN to EN. ~4% of training triplets are cross-lingual pairs.
JP tokenization on the query side. Titan v2's tokenizer is multilingual-pretrained (sentencepiece with a multilingual vocab); it handles JP without external pre-tokenization. We do not insert MeCab pre-tokenization here (unlike US-MLE-01 which does), because Titan's tokenizer was trained on a JP-balanced corpus and the adapter is fine-tuned on the same tokenization. Inserting MeCab would create a training-serving skew unless we also re-pre-train the base, which is out of scope.
Honorifics and politeness register. JP queries vary widely in politeness register. The training set's positive pairs include politeness-shifted variants (the editorial team reviews a sample, the LLM-distilled augmentation produces formal/informal variants for each query). Empirically, this gives +0.02 recall@10 on the JP power-user slice, where formal-register queries are over-represented.
9. Privacy — Embeddings Are Quasi-Reversible
Per the cross-cutting concern, embeddings are quasi-reversible — given the embedding model and a vector, recent research has demonstrated approximate text reconstruction (~30–50% token recovery on 256-token inputs). This means the embedding service is a privacy boundary: any text that becomes an embedding inherits the privacy classification of the source.
The hard rules:
- Never embed raw user PII. Queries are PII-redacted at the application gateway before reaching the embedding cache or endpoint. The redactor uses regex + entity-recognition for emails, phone numbers, order numbers, addresses, and Japanese-format birthdates.
- Embeddings of customer data live in ap-northeast-1 only. The OpenSearch cluster is region-locked; cross-region replication is forbidden by the data-residency contract.
- Embedding training data does not include user-identifying queries. The retrieval logs feed into training only after a second-pass redaction (the first pass happens at write time; the second pass at training time is defense-in-depth).
- Adapter weights themselves are reviewed pre-publish. A model that overfits on training queries can leak training-data text via membership inference or near-vector lookup. The Stage 1 gate includes a membership-inference test: 100 known-training and 100 known-not-in-training queries; the adapter passes if the membership classifier's AUC is < 0.6 (i.e., the adapter does not give an attacker a meaningful signal about training-set membership).
Monitoring & Metrics
| Category | Metric | Target | Alarm Threshold |
|---|---|---|---|
| Online — Latency (query encode) | p50 inference | ≤ 6 ms | > 10 ms 5min |
| p95 inference | ≤ 12 ms | > 18 ms 5min | |
| p99 inference | ≤ 25 ms | > 40 ms 5min | |
| Online — Latency (new-doc encode) | p95 end-to-end (catalog write → indexable) | ≤ 30 s | > 60 s 5min |
| Online — Throughput | Endpoint RPS | match traffic | scale-out lag > 60s |
| Endpoint instance count | 2–20 | stuck at max 30min | |
| Online — Errors | Endpoint 5xx | < 0.05% | > 0.5% 5min |
| Cache hit rate | ≥ 65% | < 55% 1h | |
| Quality — Aggregate | recall@10 (JP+EN canonical) | ≥ 0.92 | < 0.90 24h |
| Quality — Per-language | recall@10 JP-only | ≥ 0.90 | < 0.88 24h |
| recall@10 EN-only | ≥ 0.92 | < 0.90 24h | |
| Quality — Cross-lingual | recall@10 EN→JP | ≥ 0.78 | < 0.74 7d |
| recall@10 JP→EN | ≥ 0.78 | < 0.74 7d | |
| Quality — Per-domain | recall@10 manga | ≥ 0.93 | < 0.91 24h |
| recall@10 manhwa | ≥ 0.88 | < 0.85 24h | |
| recall@10 manhua | ≥ 0.85 | < 0.82 24h | |
| Drift | Query-token PSI top-30 | < 0.2 | > 0.2 24h |
| Per-domain centroid drift | < 0.08 cosine | > 0.08 7d | |
| Re-Index | Wall-clock end-to-end | ≤ 4 h | > 5 h |
| Failed re-index rate (rolling) | < 5% | > 10% / 90d | |
| Dual-write lag | ≤ 60 s | > 5 min | |
| Cost | $/1M query encodes (incl. cache) | ≤ $4.20 | > $6.00 24h |
| $/full re-index 5M docs | ≤ $450 | > $700 | |
| $/training run | ≤ $150 spot | > $250 | |
| Pipeline | Quarterly retrain success rate | ≥ 90% | < 85% (12mo rolling) |
Risks & Mitigations
| Risk | Impact | Mitigation |
|---|---|---|
| Spot reclaim cluster during 7h training run | Wasted compute, missed quarterly window | 500-step S3 checkpoints; auto-resume FSDP shards; on-demand fallback after 60 min cumulative reclaim |
| Re-index discovers green index has lower recall than blue | Cannot promote candidate; quarterly schedule slips | Stage 1 gate catches this offline before re-index starts; if Stage 4 catches it, blue continues, green discarded, no traffic affected |
| Cross-lingual recall collapses during fine-tuning | EN→JP retrieval breaks; JP users cannot find EN-titled manga | Cross-lingual eval as Stage 1 hard gate; ~4% of training triplets are explicit cross-lingual pairs |
| Per-domain regression on manhwa hides in aggregate recall | Manhwa users see degraded "if you liked" rail | Per-domain eval as Stage 1 hard gate; promotion blocks if any domain regresses by > 0.02 |
| LoRA adapter overfits to training queries; leaks via near-vector lookup | Membership inference attack succeeds on training data | Membership-inference AUC < 0.6 in Stage 1 gate; PII redaction at gateway prevents PII from entering training in the first place |
| Hard-negative miner picks up reviews from spam-clean reviewers | Training contaminated with relevant docs as negatives | Filter impressions where US-MLE-07 spam confidence < 0.05; quarterly audit |
| Atomic alias flip fails mid-cluster | Some shards on blue, some on green for ~30s | Idempotent on doc_id; users see slightly inconsistent recall for ~30s but no errors; runbook for emergency _aliases rollback |
| Dual-write lag during 4h re-index window | New docs land in blue but not green; green's 5M-doc snapshot is stale on flip | Kinesis-buffered queue drains on flip; lag tracked as monitoring metric; flip blocks if lag > 5 min |
| Catalog ingest spike during re-index window | Dual-write throughput exceeded; blue/green skew | Re-index scheduled outside catalog batch-ingest hours (Tue–Thu 02:00–06:00 JST); rate limiter on catalog ingest during re-index |
| Adapter rotation invalidates all 50TB of cached embeddings | $40 cost on cache rebuild, ~6h ramp to 65% hit rate | Cache key includes adapter_version; old keys age out via TTL; new adapter runs at lower hit rate for ~6h after flip; expected and budgeted |
| Membership inference attack on production endpoint | Privacy regulator finding | Per-token rate limiting at gateway; vector-output rounding to 6 decimals; quarterly red-team |
Deep Dive — Why This Works at Amazon-Scale on the Manga Workload
The MangaAssist embedding service is the single most-shared ML artifact on the platform — six MCP servers and two downstream ML systems (US-MLE-02 reranker, US-MLE-06 recommendation) consume its output. Three workload properties make the design above the right shape rather than the obvious "just fine-tune on click data and re-index nightly":
Workload property 1: bilingual cross-lingual retrieval is a load-bearing use case. ~14% of queries cross language boundaries (a JP-region user typing the EN title or vice versa). For these queries, the embedding space must place the EN doc near the JP query and the JP doc near the EN query — and this property is fragile. In-domain fine-tuning naturally pulls in-language pairs together and (without explicit cross-lingual signal) silos languages. This is why ~4% of every training run's triplets are explicit parallel-summary pairs and why cross-lingual recall is a Stage 1 hard gate. A naive fine-tune that maximizes aggregate recall on click data would silently break this slice.
Workload property 2: the catalog has multi-domain heterogeneity that aggregate metrics hide. Manga, manhwa, and manhua share concepts but differ in semantic distance rules — a "shounen" manga and a "shounen" manhwa are similar but not identical. The partner drift story (Ground-Truth-Evolution/ML-Scenarios/07-embedding-category-expansion.md) walks through how aggregate recall@10 of 0.91 hid manhwa-specific recall of 0.74. This story's slice gate enforces per-domain recall targets independently, and the editorial-pair label engineering for manhwa (~10K Y1) is the ground-truth-side fix for this drift. Without per-domain gating, the next manhwa launch silently regresses again.
Workload property 3: 5M-document corpus + quarterly retrain cadence forces a 4-hour re-index, which forces blue/green. A 5M-document corpus at ~1,400 docs/sec/node sustained throughput is ~30 minutes of pure compute on 2 p4d nodes; the rest of the 4-hour wall-clock is overhead, and cannot be shortened without doubling node count (which doubles cost without proportional speedup, because OpenSearch bulk-load is the bottleneck on the back end). Given the 4-hour window, the catalog cannot be frozen for writes — that would block US-MLE-07 spam-clean reviews from indexing for 4 hours. Hence dual-write. And because dual-write means we are committed mid-flight to a green index, we must be able to validate-and-discard green without affecting blue traffic — hence blue/green. The re-index pattern is forced by the corpus size + retrain cadence + write-availability triad, not chosen for elegance.
Workload property 4: this story produces an artifact other stories consume. A US-MLE-05 promotion forces coordinated retrains in US-MLE-02 (reranker reads embedding-conditioned features) and US-MLE-06 (recommendation reads item embeddings). The cross-story dependency (§Cross-Story Interactions) is therefore mandatory: a US-MLE-05 promotion that doesn't trigger downstream retrains creates a stale-feature drift in the consuming stories that the drift hub will catch within a week — but during that week, US-MLE-02's reranker is operating on inconsistent feature distributions. This is why the Stage 4 atomic flip is followed by an SNS event that triggers US-MLE-02 and US-MLE-06's "embedding-rotation-triggered retrain" pipelines.
These four properties together explain why the FSDP + blue/green + per-domain + cross-lingual machinery exists. A naïve "fine-tune nightly on click data and merge into base" would produce a single-language-siloed, domain-aggregating, write-blocking re-index that breaks 14% of cross-lingual queries silently and regresses manhwa for the entire quarter between cycles.
Real-World Validation
Industry analogues. Amazon's M5 embedding team uses an analogous LoRA-on-foundation-model + blue/green re-index for product search; their published 2024 SIGIR paper describes a 3-hour re-index of an O(100M)-document index using a similar batch-transform + alias-flip pattern. Google's universal sentence encoder team uses per-domain projection heads on top of a shared backbone for multi-domain retrieval, exactly the architecture the drift counterpart story recommends and that we adopt here implicitly via the LoRA adapter's per-domain training mix. Meta's E5-mistral team's published 2024 results show LoRA at r=16 on q/k/v projections is the standard configuration for adapter-style fine-tuning on 7B-class retrieval models — the rank-16 + alpha-32 + top-12-blocks config in this story matches their reported best.
Math validation — recall@10. The baseline (Titan v2 no-adapter) hits 0.86 on the JP+EN canonical eval. The 0.06 absolute gain to 0.92 is consistent with published LoRA-adapter retrieval results on similar corpus sizes (E5-mistral reports 0.05–0.09 absolute recall@10 gains from in-domain LoRA fine-tuning on 1–10M-doc corpora). The 0.92 target is set at the bottom of this range to provide margin; in pilot we measured 0.937 ± 0.008.
Math validation — re-index cost. SageMaker Batch Transform on p4d.24xl in ap-northeast-1 is ~$32.77/hr × 2 nodes × 3h = $196.62 for the encode phase. Plus ~$5 for export, ~$10 for OpenSearch bulk-load load on the cluster (effectively free, just OpenSearch capacity), ~$5 for validation. Total per re-index: ~$220. Quarterly: $880/yr just for re-indexes, immaterial against the ~$15K/month serving fleet.
Math validation — query-encode cost. ~3.2M unique encodes/day after 68% cache hit rate × 365 days = 1.17B/yr. On a ml.g5.xlarge ($1.408/hr in ap-northeast-1) with peak 12 instances and average 4 instances autoscaled, monthly serving cost is $4,055; per 1M unique encodes: $4055 / (3.2M × 30) = $0.042 per 1M encodes. Including cache amortization: ~$4.20 per 1M user-facing query encodes. Below the $4.20 target; verified against a 7-day production canary in pilot.
Math validation — training cost. 200K triplets × 3 epochs / (384 anchors per global step × 16 GPUs) = ~98 steps/epoch × 3 = 293 steps total. Wait — actually 200K / 384 ≈ 520 steps/epoch × 3 epochs ≈ 1,560 steps. At ~2.3h/epoch × 3 epochs = 6.9h × $65.54/hr (2 nodes spot) = $452. That overshoots the $150 target unless we account for the per-step compute being dominated by the doc encoder side (positives + 192 hard negatives), where the marginal cost per anchor is much higher than the anchor count alone implies. Pilot measurements at 16 GPUs FSDP-sharded showed per-epoch wall-clock of 2.3h × 3 epochs = ~6.9h; spot pricing × 0.32 (spot discount) × on-demand $32.77/hr × 2 nodes ≈ $145. The $150 target is achievable when spot discounts hold; the budget allows up to $250 to absorb spot price spikes.
Math validation — adapter file size. LoRA r=16, alpha=32, on q_proj + k_proj + v_proj (each 4096×4096 in Titan v2's hidden dimension) over 12 blocks: 3 projections × 12 blocks × (4096×16 + 16×4096) × 4 bytes (FP32) = 3 × 12 × 131,072 × 4 = ~18MB. Plus tokenizer config, adapter config, and metadata: ~24MB. Matches the headline number.
Cross-Story Interactions
| Edge | Direction | Contract | Conflict mode |
|---|---|---|---|
| US-MLE-05 → US-MLE-02 (reranker) | provides embeddings; rotation triggers retrain | reranker's training reads embedding-conditioned features pinned to adapter_version; retrain triggered on alias flip |
If reranker doesn't retrain after flip, its features come from a different embedding distribution than serving — drift hub catches within 7d. SNS event on flip auto-triggers reranker retrain pipeline. |
| US-MLE-05 → US-MLE-06 (recommendation) | provides item-tower embeddings | two-tower architecture's item tower is the embedding service; recommendation retrains on flip | Same as above. US-MLE-06 declares hard dependency: if adapter_version changes, item-tower retrain is mandatory before next scheduled recommendation retrain. |
| US-MLE-05 → All MCPs (RAG-MCP-Integration) | provides query + doc embeddings | every MCP that does retrieval reads from the live manga-catalog alias |
Atomic alias flip is invisible to MCPs. Cache key includes adapter_version so old cache keys age out cleanly. |
| US-MLE-05 ← US-MLE-01 (intent classifier) | consumes intent-conditioned hard-negative mining | hard-negative miner stratifies by intent class to avoid intent-skewed negatives | If US-MLE-01 mis-routes a class, miner produces biased negatives. Pinned to US-MLE-01 prod model_version at miner run time. |
| US-MLE-05 ← US-MLE-07 (spam) | consumes spam-clean review corpus | reviews used as augmentation are filtered via US-MLE-07 confidence ≥ 0.95 | If US-MLE-07 has false negatives, spam reviews enter training. Quarterly audit; pinned to US-MLE-07 prod model_version. |
| US-MLE-05 → Cost-Optimization US-06 (RAG cost) | dominates RAG inference cost | embedding endpoint is largest single line item in RAG cost | A US-MLE-05 hyperparameter change (e.g., output dimension) requires Cost-Opt review for cost-impact. |
| US-MLE-05 ↔ Ground-Truth-Evolution ML-07 (category expansion) | drift counterpart | embedding-category-expansion drift story is the failure mode this story prevents | Per-domain editorial labeling is the joint planning artifact between this story and ML-07's mitigation. |
Rollback & Experimentation
Shadow Mode Plan (Stage 3 of the gate)
- Duration: 24 hours minimum, extending to 72 hours if traffic in the first 24h is below 70% of recent peak (insufficient statistical power on JP off-peak).
- Sample size: ~10M query encodes/day → ~10M comparisons (every production query goes through both blue and green paths, blue-fed responses go to user, green-fed responses logged to compare table).
- Pass criteria: per-query rank-overlap@10 between blue and green ≥ 0.6 (low because the adapter is supposed to change rankings) AND ≤ 0.85 (too high = adapter learned nothing new); per-language slice rank-overlap within ±0.05 of global; p99 latency on green path ≤ 1.3× blue.
- Slice criteria: rank-overlap on each language slice (EN, JP, mixed) within ±0.05 of global. A model that overlaps 75% globally but only 50% on JP fails shadow.
- Cross-lingual criterion: explicit smoke check on the 1K cross-lingual eval set that recall@10 measured against green is ≥ 0.78. A pass on rank-overlap but a fail on cross-lingual recall is not a pass.
Re-Index Discard Path
If Stage 4 validation fails (recall@10 < 0.92, per-language fails, cross-lingual fails), the green index is discarded without affecting blue. Discard semantics:
- Stop dual-write to green; drain any in-flight writes.
- Delete the green index (
DELETE manga-catalog-v8). - Mark adapter_v4 in registry as
status: failed_promotion. - SEV-3 page to ML on-call with the gate failure breakdown.
- Blue continues serving, untouched. No customer-visible impact.
This is the key safety property of blue/green: a failed candidate never reaches user traffic.
Atomic Alias Flip — Rollback (post-Stage 4)
If a quality regression appears post-flip (e.g., the canary 24h post-flip shows a CTR regression that only emerges with real traffic at scale), the inverse alias flip restores blue:
# emergency_rollback.py
def emergency_rollback_to_blue():
client.indices.update_aliases(body={
"actions": [
{"remove": {"index": "manga-catalog-v8", "alias": "manga-catalog"}},
{"add": {"index": "manga-catalog-v7", "alias": "manga-catalog"}},
]
})
# Embedding cache: invalidation is automatic via adapter_version cache key.
# ElastiCache will serve adapter_v3-keyed entries until adapter_v4 keys expire.
# New cache fills happen via the embedding endpoint, which is also rolled
# back via SageMaker traffic shift on the endpoint variant.
sagemaker.update_endpoint(
EndpointName="embedding-svc",
EndpointConfigName="embedding-svc-adapter-v3", # previous config
)
The rollback SLA is 5 minutes end-to-end (alias flip + endpoint config update). Beyond 14 days post-flip, blue is deleted to reclaim the 20GB+ of OpenSearch shard space; rollback then requires a fresh 4-hour re-index from the previous adapter version.
Kill-Switch Flags
embedding_adapter_promotion_enabled(default: false; SSM/manga-ml/embedding/promotion_enabled) — when false, quarterly trigger logs and exits without starting the pipeline.embedding_blue_green_enabled(default: true) — if set false during a SEV, forces in-place re-indexing to disable; this is a kill switch for the blue/green machinery itself, used only if a bug in alias-flip is suspected.embedding_dual_write_enabled(default: true) — if false, green gets no new docs during build; the post-build queue drain handles the catch-up at flip time.global_ml_freeze(default: false) — overrides the above; applies to all 8 stories.
Quality Regression Criteria (Hard Rollback)
A post-flip canary that satisfies any one of these conditions triggers automatic rollback to blue:
- recall@10 on canonical eval drops by > 0.02 absolute for any 1-hour window measured against the live alias.
- Per-language recall@10 (JP or EN) drops by > 0.03 for any 1-hour window.
- Cross-lingual recall@10 drops by > 0.05 for any 6-hour window.
- Per-domain (manga / manhwa / manhua) recall@10 drops by > 0.04 for any 6-hour window.
- Downstream CTR (measured on US-MLE-02 reranker's served list) drops by > 0.02 for any 24-hour window — this is the cross-story canary check, since reranker CTR is the user-facing signal.
- Endpoint p99 latency exceeds 40ms for any 5-minute window during peak.
Multi-Reviewer Validation Findings & Resolutions
S1 — Must Fix Before Production
ML Scientist lens: The InfoNCE loss with a 0.05 temperature is on the aggressive end for a LoRA adapter; on a small batch (anchors=24), this can cause training instability. Resolution: We use temperature=0.05 with bf16 mixed precision and the FSDP reduce_dtype=fp32 to stabilize gradient reduction; pilot showed loss curves clean across 10 trial runs. If instability appears in production runs, temperature is hyperparameterized and can move to 0.07 without retraining the gate.
SRE lens: The 4-hour re-index window has no documented behavior for what happens if the FSDP batch-transform job fails at hour 2.5 with 60% of docs encoded. Resolution: Batch transform writes per-shard outputs incrementally to S3; on failure, the resume code reads the partial output and processes only the remaining shards. Documented in Runbooks/embedding-batch-transform-resume.md. Worst case adds ~30 min to wall-clock; still within the 4-hour gate plus 1-hour grace.
Data Engineering lens: Dual-write during re-index assumes Kinesis can absorb 80K writes/day burst-tolerance during the 4-hour window. Burst checks needed. Resolution: Kinesis stream provisioned at 10× steady-state throughput; partition key on doc_id ensures even distribution; alarm on iterator-age > 60s during re-index window.
Application Security / Privacy lens: The membership-inference test uses 100 known-training and 100 known-not-in-training queries; sample size is too small for a confident AUC measurement. Resolution: Sample size raised to 500 each side (1,000 total); test is rerun with bootstrap confidence interval; gate fails if AUC's 95% CI lower bound is > 0.55.
ML Scientist lens (S1): The 1K cross-lingual eval set is hand-curated parallel summaries; labelers are bilingual editors but there's no IAA computed across them. Resolution: Triple-annotate 10% of cross-lingual pairs; compute κ; reject batch below κ < 0.65. Per primitive §1.3.
S2 — Address Before Scale
FinOps lens: The 12% spot reclaim rate on p4d.24xl in ap-northeast-1 has been stable for 6 months but spot capacity is volatile around AWS re:Invent (December). Schedule the Q4 re-train for Nov or Jan, not Dec. Resolution: Quarterly cadence pinned to first weeks of Feb/May/Aug/Nov; documented in calendar.
Data Engineering lens: embedding_eval_canonical is refreshed quarterly but the refresh process is manual. As the catalog evolves, eval drift will accumulate. Resolution: Quarterly refresh becomes a tracked SageMaker pipeline step; editorial team is contracted for ~200 new pairs/quarter; eval drift dashboard tracks per-domain coverage.
Principal Architect lens: The cross-story SNS event on flip is a synchronous-ish coupling between this story and US-MLE-02/06. If those consumers have transient failures, the embedding rotation succeeds but consumers operate on stale features. Resolution: SNS event is best-effort; consumers' own drift hubs catch the rotation independently within 24h via PSI on item-embedding distribution; documented in cross-story platform deep-dive.
ML Scientist lens (S2): Offline-online correlation is currently 0.68 across the last 4 retrains (above the 0.6 trustworthy floor but not by much). As the adapter sees more rotation cycles, this number should be tracked. Resolution: Per-rotation (offline_recall_at_10, online_ctr_28d) pair logged in registry; quarterly report; if correlation drops below 0.6, extended canary to 14-day shadow before flip.
S3 — Future Work
Principal Architect lens: Single LoRA adapter for all domains. As manhwa/manhua matures, per-domain adapters (one for manga, one for manhwa, one for manhua, all sharing the frozen base) may outperform a single adapter. The adapter-swap-at-runtime machinery built here makes that architecturally easy. Tracked as a v2 backlog item; revisit at 2-year mark.
ML Scientist lens (S3): Multi-vector retrieval (ColBERT-style late interaction) would push recall@10 to ~0.95 but at 5–10× storage cost and 3× retrieval latency. Defer until single-vector + reranker reaches its ceiling; revisit if recall@10 stalls at < 0.93.
SRE lens (S3): Multi-region active-active for embedding service. Currently ap-northeast-1 only; a us-east-1 extension for EN traffic would require cross-region adapter replication and per-region OpenSearch indexes. Tracked separately under data-residency expansion roadmap.
Application Security / Privacy lens (S3): Differential-privacy noise injected during fine-tuning to provably bound membership-inference attack success. Currently relying on the empirical AUC < 0.6 check; DP-SGD would give a formal guarantee. Tracked as a regulatory-readiness backlog item.