US-MLE-06: Recommendation Model Training + Cold-Start

User Story

As an ML Engineer at Amazon-scale on the MangaAssist team, I want to own the hybrid recommendation training stack — a custom two-tower neural retriever for warm users plus Amazon Personalize HRNN-meta for cold-start and fallback — with nightly incremental retrains, monthly full retrains, position-debiased implicit-feedback labels, an explore-exploit bandit on the warming-cohort, and a delta-merge ANN index that is rebuilt nightly and patched online during the day, So that every one of MangaAssist's 12M monthly active users gets retrieval that respects their actual taste (not the average user's taste), new items reach a relevant audience within hours of catalog ingest, and JP-skewed traffic patterns drive the loss rather than being averaged away by the larger global user base.

Acceptance Criteria

Architecture (HLD)

The Production Surface

The recommendation system sits behind the User Preference & Recommendation MCP described in RAG-MCP-Integration/02-user-preferences-recommendation-mcp.md. Every call to get_recommendations enters the hybrid router, which inspects the user's interaction depth and dispatches to one of two retrieval backends. Both produce a candidate set of ~100 items; downstream the cross-encoder reranker (US-MLE-02) trims to top-20 and the explanation generator surfaces "why."

At the contracted traffic of 200M sessions/month with ~12M monthly active users and an average of ~6 recommendation surfaces per session, this system serves ~1.2B retrievals/month, or ~460 RPS sustained with JP-region peaks of ~1,400 RPS during 21:00–24:00 JST. Cold-start traffic (users with <5 interactions) is ~14% of total but ~28% of weekend new-signup peaks; the system must not collapse under cold-start spikes.

The two retrieval paths:

Two-tower neural retriever (≥5 interactions, ~86% of requests) — user tower (256-dim) computed online from session + 30-day reading history; item tower (256-dim) precomputed nightly into ANN index. Scoring is dot-product. ANN backend is FAISS HNSW on EFS-mounted index volumes, with OpenSearch kNN as the fallback for items added in the last 24 hours (delta zone).
Amazon Personalize HRNN-meta (<5 interactions, ~14% of requests) — managed real-time endpoint. Reads the user's first 1–4 interaction events and produces ranked candidate IDs. We use HRNN-meta (the metadata-aware variant) because pure HRNN cold-starts poorly on items, and we feed it US-MLE-05's content embeddings as item metadata.

The 5M-item catalog has the same shape as US-MLE-05: ~3M JP-titled, ~1.7M EN-titled, ~0.3M cross-lingual or KO/ZH. Item tower reads US-MLE-05's cross-lingual embeddings as a feature, so item-side cold-start (new manga added in the last 7 days) is content-grounded by construction.

End-to-End ML Lifecycle Diagram

flowchart TB
    subgraph DATA[Data Layer]
        L1[Implicit Feedback<br/>Kinesis stream<br/>~50M events/day]
        L2[Position Propensities<br/>from US-MLE-02<br/>per-slot p_show]
        L3[Explicit Ratings<br/>~8% of users<br/>~400K/day]
        L4[LLM-Distilled<br/>would-like-this<br/>~20K/day for cold items]
        L5[Returns / Refunds<br/>strong negatives]
        L6[Label Platform<br/>Iceberg on S3<br/>PIT-correct]
        L1 --> L6
        L2 --> L6
        L3 --> L6
        L4 --> L6
        L5 --> L6
    end

    subgraph FEAT[Feature Layer]
        F1[User Features<br/>30d reading history<br/>session-aware]
        F2[Item Features<br/>US-MLE-05 embeddings<br/>+ catalog metadata]
        F3[Context Features<br/>US-MLE-01 intent<br/>US-MLE-03 aspect<br/>US-MLE-04 stock]
        F4[Feature Catalog<br/>schema_v3.4]
        F1 -.pinned to.-> F4
        F2 -.pinned to.-> F4
        F3 -.pinned to.-> F4
    end

    subgraph TRAIN[Training Pipeline]
        T1[Step 1<br/>Data Validation<br/>+ IPS audit]
        T2[Step 2<br/>Feature Materialization<br/>PIT join]
        T3[Step 3<br/>Negative Sampling<br/>in-batch + popular-stratified]
        T4[Step 4<br/>Two-Tower Train<br/>g5.12xlarge sampled-softmax]
        T5[Step 5<br/>Personalize Retrain<br/>HRNN-meta SDK]
        T6[Step 6<br/>Offline Eval 5 modes]
        T7[Step 7<br/>Slice + Cohort Analysis]
        T8[Step 8<br/>Item Embedding Export]
        T9[Step 9<br/>ANN Index Build<br/>FAISS HNSW]
        T10[Step 10<br/>Model Registration]
        T1 --> T2 --> T3 --> T4 --> T6
        T1 --> T5 --> T6
        T6 --> T7 --> T8 --> T9 --> T10
    end

    subgraph SERVE[Serving Layer]
        S1[Hybrid Router<br/>interaction_count gate]
        S2[Two-Tower User-Side<br/>ml.g5.xlarge AS]
        S3[ANN Index<br/>FAISS HNSW on EFS]
        S4[Delta Index<br/>OpenSearch kNN<br/>last 24h items]
        S5[Personalize Endpoint<br/>managed real-time]
        S6[Bandit on Warming Cohort<br/>Thompson sampling]
        S1 --> S2
        S2 --> S3
        S2 --> S4
        S1 --> S5
        S1 --> S6
    end

    subgraph DRIFT[Drift Detection]
        D1[HR@20 monitor<br/>per-slice]
        D2[Cold-start CTR / Warm CTR]
        D3[Label-decay detector<br/>per-recency NDCG]
        D4[CloudWatch Alarms]
        D1 --> D4
        D2 --> D4
        D3 --> D4
    end

    L6 --> T1
    F1 --> T2
    F2 --> T2
    F3 --> T2
    T10 --> S1
    S2 -.predictions.-> D1
    S5 -.predictions.-> D2
    S6 -.exploration data.-> D3

    style L6 fill:#9cf,stroke:#333
    style F1 fill:#9cf,stroke:#333
    style T9 fill:#fd2,stroke:#333
    style S1 fill:#fd2,stroke:#333
    style D1 fill:#fd2,stroke:#333

Data Contracts and Volume

Asset	Schema Version	Snapshot Cadence	Volume	Owner
`recommendation_events` Iceberg	event_v4	Continuous (Kinesis)	~50M events/day; ~500M positives/month	Data Platform
`position_propensities` table	propensity_v2	Daily refresh from US-MLE-02	~50K (slot, intent, language, cohort) tuples	US-MLE-02 owner
`user_features` feature group	schema_v3.4	5min batch + online realtime	~12M user rows; ~200M session rows/month	ML Eng (this story)
`item_features` feature group	schema_v3.4	1h batch	~5M item rows	ML Eng (this story)
`two_tower_v` model package	model_v23 in prod	Nightly incremental + monthly full	~100M params (50M user + 50M item)	ML Eng (this story)
`personalize_v` solution version	sol_v17 in prod	Monthly + on-demand	Managed by Personalize	Personalize PM
`item_embeddings` ANN index	idx_v23 in prod	Nightly rebuild + delta-merge	5M × 256-dim float32 = 5GB	ML Eng (this story)
`recommendation_predictions_log`	log_v3	Continuous	~1.2B rows/month, 60d hot, 24mo glacier	Data Platform

Model Registry + Promotion Gates

flowchart LR
    R23[Registry v23<br/>prod, label_v=4810]:::prod
    R24[Registry v24<br/>candidate, label_v=4920]:::cand

    R24 --> G1{Stage 1<br/>Offline Gate<br/>HR@20 + slice}
    G1 -->|pass| G2{Stage 2<br/>Shadow 48h<br/>vs v23 ANN}
    G1 -.fail.-> RB1[Rollback v23]
    G2 -->|pass| G3{Stage 3<br/>EXTENDED Canary<br/>10% / 14d}
    G2 -.fail.-> RB2[Rollback v23]
    G3 -->|pass| G4[Stage 4<br/>Full Promote v24<br/>+ blue/green ANN swap]
    G3 -.fail.-> RB3[Rollback v23]

    classDef prod fill:#2d8,stroke:#333
    classDef cand fill:#fd2,stroke:#333
    style RB1 fill:#f66,stroke:#333
    style RB2 fill:#f66,stroke:#333
    style RB3 fill:#f66,stroke:#333
    style G3 fill:#f93,stroke:#333,stroke-width:3px
    style G4 fill:#2d8,stroke:#333

The promotion gate uses the standard four-stage pattern from deep-dives/00-foundations-and-primitives-for-ml-engineering.md §5.1, but with one critical project-wide deviation: per the foundations §4.3, recommendation has notoriously weak offline-online correlation, so US-MLE-06 is permanently flagged as "always extended canary". The canary stage runs at 10% traffic for 14 days regardless of offline metric performance, and HR@20 alone never gates promotion — online metrics (CTR, add-to-list rate, cold-start CTR ratio) are the binding signals.

Story-specific thresholds:

Stage 1 (offline): HR@20 ≥ 0.18 on the random-policy holdout; per-slice HR@20 ≥ 0.16; coverage of at least 0.85 of the catalog (avoids head-only recommenders); cold-start regression ≤ 0.03.
Stage 2 (shadow): 48h side-by-side, ANN candidate-set agreement with v23 between 50% and 80% (lower than US-MLE-01 because retrieval should differ more than classification), p99 retrieval latency ≤ 1.3× v23.
Stage 3 (extended canary): 10% traffic for 14 days. Online CTR within ±5% of v23 baseline, add-to-list rate ≥ baseline - 0.02, cold-start CTR / warm CTR ≥ 0.78. Hourly auto-abort daemon monitors per-cohort CTR.
Stage 4 (full): ANN index blue/green swap; v23 retained for 21 days as rollback target.

Low-Level Design

1. Feature / Data Pipeline

The training pipeline consumes implicit-feedback events from a Kinesis Data Streams pipeline that fans out to S3 (Iceberg), DynamoDB (online state), and the Personalize event tracker. Three feature groups feed the two-tower:

User-tower features (computed online from session + 30-day cache):
30-day reading history as a list of (item_id, dwell_seconds, completed_chapter_flag) tuples; capped at 200 items.
Last 5 session intents (US-MLE-01).
Last-purchased item IDs (capped at 20).
Returning/new bucket; subscription tier; preferred language flag.
Session context: time-of-day bucket, device, locale.
Item-tower features (computed nightly into ANN index):
US-MLE-05 cross-lingual embedding (1024-dim → projected to 256-dim by item tower).
Catalog metadata: genre tags, demographic, content rating, publication year, total chapter count, completion status.
Aggregated US-MLE-03 ABSA aspect-sentiment vector (artwork, story, pacing, characters — 8 aspects).
Aggregated 30d engagement signals: views, click-through rate, completion rate.
Context features (per-request):
Current intent from US-MLE-01 (only used when intent_confidence ≥ 0.6; otherwise the context tower is skipped).
US-MLE-04 stock signal — out-of-stock items are filtered post-retrieval, never enter the ANN search.
Trending score from the Trending MCP (last 1h tumbling-window).

The PIT-correct read rule is mandatory. A click that happened at t_click reads user features as of t_click - serving_lag for whatever feature group the user-tower depends on. The 30-day reading history is itself PIT-correct: when training on a click at t_click, the history slice is [t_click - 30d, t_click - serving_lag], never including the click being predicted.

# user_tower_features.py
from dataclasses import dataclass
from datetime import datetime, timedelta
from typing import Optional
import numpy as np
import boto3
import sagemaker
from sagemaker.feature_store.feature_group import FeatureGroup


@dataclass
class RecoTrainingExample:
    example_id: str               # sha256(user_id + item_id + captured_at)
    user_id: str                  # hashed (anonymized)
    item_id: str
    captured_at: datetime
    label_value: float            # 0..1: 0=neg, 0.3=click, 0.6=add-to-list, 1.0=purchase, -1=return
    label_source: str             # "implicit", "explicit", "llm_distill"
    propensity: float             # p(shown_at_position | policy_at_serve), from US-MLE-02
    position_at_serve: int        # 1..K
    language_at_serve: str        # "en" / "ja" / "mixed"
    intent_at_serve: Optional[str]


class RecoFeaturePipeline:
    """PIT-correct feature read for two-tower training.

    Implements primitive §2.1. Reads US-MLE-05 item embeddings as item features.
    """

    def __init__(self, region: str = "ap-northeast-1"):
        self.session = sagemaker.Session(boto3.Session(region_name=region))
        self.fs = sagemaker.feature_store.FeatureStoreRuntime(self.session)
        self.catalog = FeatureCatalog.load("schema_v3.4")
        self.serving_lag = self.catalog.serving_lag_per_group()

    def materialize_user_features(
        self,
        examples: list[RecoTrainingExample],
    ) -> "pd.DataFrame":
        rows = []
        for ex in examples:
            as_of = ex.captured_at - self.serving_lag["user_features"]
            history_lower = as_of - timedelta(days=30)
            history = self.fs.get_history_window(
                feature_group_name="user_features",
                record_identifier=ex.user_id,
                lower=history_lower,
                upper=as_of,
            )
            session_features = self.fs.get_record(
                feature_group_name="user_session_features",
                record_identifier=ex.user_id,
                as_of_timestamp=as_of,
            )
            # Critical: filter out the very item being predicted from history
            history = [h for h in history if h["item_id"] != ex.item_id]
            rows.append({
                "example_id": ex.example_id,
                "history": history[-200:],  # cap at 200 items
                **session_features,
                "label": ex.label_value,
                "propensity": ex.propensity,
            })
        return pd.DataFrame(rows)

    def materialize_item_features(
        self,
        examples: list[RecoTrainingExample],
    ) -> "pd.DataFrame":
        rows = []
        for ex in examples:
            as_of = ex.captured_at - self.serving_lag["item_features"]
            features = self.fs.get_record(
                feature_group_name="item_features",
                record_identifier=ex.item_id,
                as_of_timestamp=as_of,
            )
            # Critical: read US-MLE-05 embedding at the model_version that was live at as_of,
            # not the current embedding model version.
            embedding_version = self.us_mle_05_registry.version_at(as_of)
            embedding = self.us_mle_05_registry.get_embedding(
                item_id=ex.item_id, version=embedding_version
            )
            features["embedding"] = embedding
            rows.append({"example_id": ex.example_id, **features})
        return pd.DataFrame(rows)

The leak detector here has caught two production bugs unique to this story: (1) a user-tower feature last_clicked_genre that was being populated by the current training-row's item genre (a label leak that boosted offline HR@20 by +0.04 while crashing online CTR), and (2) an item-tower feature 30d_engagement that was being computed against the training-snapshot-time engagement signal, leaking future popularity into the embedding. Both were caught by the 1ns / serving-lag re-read comparison.

2. Two-Tower Architecture and Training

The two-tower model is implemented in PyTorch and trained on g5.12xlarge (4× A10G GPUs, ~24GB each) using DDP. The architecture:

# two_tower_model.py
import torch
import torch.nn as nn
import torch.nn.functional as F


class UserTower(nn.Module):
    """User tower — embeds (history + session + context) into 256-dim user vector."""

    def __init__(
        self,
        embedding_dim: int = 256,
        history_max_len: int = 200,
        item_emb_dim: int = 1024,    # from US-MLE-05
        intent_vocab_size: int = 14,  # from US-MLE-01
        n_attention_heads: int = 4,
    ):
        super().__init__()
        # Project US-MLE-05 item embeddings into 256-dim history token space
        self.item_proj = nn.Linear(item_emb_dim, embedding_dim)
        # Self-attention over history with positional + recency encoding
        self.recency_enc = nn.Embedding(history_max_len, embedding_dim)
        self.history_attn = nn.MultiheadAttention(
            embed_dim=embedding_dim, num_heads=n_attention_heads, batch_first=True
        )
        # Context tower (US-MLE-01 intent + session context)
        self.intent_emb = nn.Embedding(intent_vocab_size + 1, 32)  # +1 for unknown
        self.context_mlp = nn.Sequential(
            nn.Linear(32 + 16, 128), nn.ReLU(), nn.Linear(128, embedding_dim)
        )
        # Final fusion: history pooled + context, then MLP
        self.fusion = nn.Sequential(
            nn.Linear(embedding_dim * 2, embedding_dim * 2),
            nn.GELU(),
            nn.Dropout(0.1),
            nn.Linear(embedding_dim * 2, embedding_dim),
        )

    def forward(
        self,
        history_emb: torch.Tensor,        # (B, L, item_emb_dim)
        history_mask: torch.Tensor,       # (B, L)  — 1 = valid, 0 = padding
        history_recency: torch.Tensor,    # (B, L)  — 0 = most recent
        intent_id: torch.Tensor,          # (B,)
        intent_confidence: torch.Tensor,  # (B,) — gate context tower
        session_features: torch.Tensor,   # (B, 16) — tod, device, lang flag, etc.
    ) -> torch.Tensor:
        # Project history items
        h = self.item_proj(history_emb)                    # (B, L, 256)
        h = h + self.recency_enc(history_recency)          # add recency encoding
        # Self-attention with mask
        attn_mask = ~history_mask.bool()  # True = ignore
        h_pooled, _ = self.history_attn(h, h, h, key_padding_mask=attn_mask)
        h_pooled = (h_pooled * history_mask.unsqueeze(-1)).sum(1) / history_mask.sum(1, keepdim=True).clamp(min=1)

        # Context tower — gated by intent confidence (US-MLE-01 fallback rule)
        intent_e = self.intent_emb(intent_id)
        ctx_input = torch.cat([intent_e, session_features], dim=-1)
        ctx_out = self.context_mlp(ctx_input)
        # Gate: if intent_confidence < 0.6, zero out context contribution
        ctx_gate = (intent_confidence >= 0.6).float().unsqueeze(-1)
        ctx_out = ctx_out * ctx_gate

        # Fuse and return L2-normalized user vector
        user_vec = self.fusion(torch.cat([h_pooled, ctx_out], dim=-1))
        return F.normalize(user_vec, p=2, dim=-1)


class ItemTower(nn.Module):
    """Item tower — embeds (US-MLE-05 emb + metadata + aspect signal) into 256-dim."""

    def __init__(
        self,
        embedding_dim: int = 256,
        item_emb_dim: int = 1024,        # from US-MLE-05
        n_genres: int = 64,
        n_aspects: int = 8,              # from US-MLE-03
    ):
        super().__init__()
        self.content_proj = nn.Linear(item_emb_dim, embedding_dim)
        self.genre_emb = nn.EmbeddingBag(n_genres, 64, mode="mean")
        self.aspect_proj = nn.Linear(n_aspects, 32)
        self.engagement_proj = nn.Linear(8, 32)  # ctr, completion, etc
        self.fusion = nn.Sequential(
            nn.Linear(embedding_dim + 64 + 32 + 32, embedding_dim * 2),
            nn.GELU(),
            nn.Dropout(0.1),
            nn.Linear(embedding_dim * 2, embedding_dim),
        )

    def forward(
        self,
        content_emb: torch.Tensor,    # (B, item_emb_dim) from US-MLE-05
        genre_ids: torch.Tensor,      # (B, G) sparse multi-hot
        genre_offsets: torch.Tensor,
        aspect_signal: torch.Tensor,  # (B, n_aspects)
        engagement: torch.Tensor,     # (B, 8)
    ) -> torch.Tensor:
        c = self.content_proj(content_emb)
        g = self.genre_emb(genre_ids, offsets=genre_offsets)
        a = self.aspect_proj(aspect_signal)
        e = self.engagement_proj(engagement)
        item_vec = self.fusion(torch.cat([c, g, a, e], dim=-1))
        return F.normalize(item_vec, p=2, dim=-1)


class TwoTowerModel(nn.Module):
    def __init__(self, user_tower: UserTower, item_tower: ItemTower):
        super().__init__()
        self.user = user_tower
        self.item = item_tower

    def score(self, u: torch.Tensor, i: torch.Tensor) -> torch.Tensor:
        # Dot product of normalized vectors
        return (u * i).sum(-1)


def sampled_softmax_loss(
    user_vec: torch.Tensor,        # (B, D)
    pos_item_vec: torch.Tensor,    # (B, D)
    neg_item_vec: torch.Tensor,    # (B, K, D) — in-batch + popular-stratified negatives
    propensity: torch.Tensor,      # (B,) — IPS weight from US-MLE-02 propensity model
    propensity_floor: float = 0.01,
    propensity_clip: float = 100.0,
) -> torch.Tensor:
    """Position-debiased sampled-softmax loss with in-batch + popular negatives."""
    pos_score = (user_vec * pos_item_vec).sum(-1, keepdim=True)        # (B, 1)
    neg_score = torch.einsum("bd,bkd->bk", user_vec, neg_item_vec)     # (B, K)
    logits = torch.cat([pos_score, neg_score], dim=-1)                 # (B, 1+K)
    target = torch.zeros(logits.size(0), dtype=torch.long, device=logits.device)
    per_example_loss = F.cross_entropy(logits, target, reduction="none")
    # IPS reweighting; floor and clip per primitive §6 of the drift counterpart
    ips_weight = 1.0 / propensity.clamp(min=propensity_floor)
    ips_weight = ips_weight.clamp(max=propensity_clip)
    # Per-user normalization to bound thin-data users' influence
    return (per_example_loss * ips_weight).mean()

Training configuration:

Nightly incremental (~5 hours): g5.12xlarge × 1 node, DDP across 4 GPUs, batch size 1024 per GPU (4096 effective). Trains for 1 epoch on the previous day's events, warm-starting from the last good checkpoint. In-batch negatives + 32 popular-item-stratified negatives per positive.
Monthly full retrain (~12 hours): same instance, 4 epochs over the last 90-day cumulative event window, with stratified sampling balanced across (tenure, recency, action_type) per the drift-counterpart playbook. Resets the model from US-MLE-05 embedding initialization.
Spot pricing: nightly incremental uses on-demand (the Friday window is non-negotiable), monthly full uses spot with checkpoint-every-2000-steps and on-demand fallback after 3 reclaims.
Optimizer: AdamW, lr 5e-4 (incremental) / 1e-3 (full), weight decay 0.01, cosine schedule with 5% warmup.

The negative sampling strategy combines two sources because in-batch negatives alone bias toward popular items (popular items appear in many users' positive lists, so they appear as negatives for other users in the same batch). The 32 popular-item-stratified negatives per positive are sampled from the top-1% most-shown items, with a propensity correction so the model doesn't learn "popular is bad."

3. Cold-Start: Hybrid Routing Decision

The hybrid router is a deterministic dispatch based on interaction_count, but the boundary cases are subtle. The full decision tree:

flowchart TD
    A[Request arrives<br/>user_id, context_query] --> B{Fetch interaction_count<br/>from DynamoDB cache}
    B -->|0 interactions| C[Onboarding flow<br/>Pick 3 favourite genres]
    C --> D[Genre-seeded vector<br/>+ Catalog MCP filter]
    D --> Z

    B -->|1-4 interactions| E[Personalize HRNN-meta<br/>real-time endpoint]
    E --> F[Returns 100 candidate IDs]
    F --> Z

    B -->|>= 5 interactions| G{User in 'warming' cohort?<br/>i.e., interaction_count in 5..15}
    G -->|Yes| H[Two-Tower retrieval<br/>+ Bandit explore-exploit<br/>Thompson sampling]
    G -->|No| I[Two-Tower retrieval<br/>pure exploitation]
    H --> J[Mix: 70% two-tower, 30% bandit-sampled]
    I --> J
    J --> Z[Filter: out-of-stock<br/>from US-MLE-04<br/>+ already-read filter]
    Z --> K[Return top 100 candidates<br/>to cross-encoder reranker<br/>US-MLE-02]

    style C fill:#fde68a,stroke:#92400e
    style E fill:#fee2e2,stroke:#991b1b
    style H fill:#dbeafe,stroke:#1e40af
    style I fill:#dcfce7,stroke:#166534
    style J fill:#8E44AD,color:#fff

The 5-interaction threshold is empirical. Before 5 interactions, the user-tower has too little signal to outperform Personalize on cold-start CTR; from 5 onward, the two-tower's neural retriever beats Personalize on all measured slices. The threshold is itself a tunable parameter in the registry and is re-evaluated quarterly via offline replay on a holdout of users who crossed the boundary.

Item-side cold-start is handled separately: items added to the catalog in the last 7 days do not yet have learned item-tower embeddings (they were absent from the previous nightly training run). For these items, the ANN index falls back to a content-only embedding derived directly from the US-MLE-05 cross-lingual embedding (projected through a frozen content-projection layer extracted from the item tower). This gives new items a "warming" representation that improves as interactions accumulate. After 7 days or 200 interactions (whichever first), the item is included in the next nightly training run and gets a learned embedding.

4. Position Debiasing via IPS (Coordination with US-MLE-02)

Implicit-feedback labels are position-biased: a click on slot 1 is far more likely than a click on slot 20, even when slot 20's item is just as relevant. Treating click-or-no-click as the binary label trains a model that learns what was shown at the top, not what users actually prefer. The fix is Inverse Propensity Scoring (IPS): each event's loss is multiplied by 1 / propensity(item, position | policy_at_serve).

The propensity model is owned by US-MLE-02 (cross-encoder reranker). US-MLE-02 trains a logistic-regression model on (slot, intent, language, cohort) → p_show, refreshed daily. This story consumes US-MLE-02's position_propensities table at training time and applies the IPS weight per the loss function in section 2.

This creates a hard coordination point — when US-MLE-02 promotes a new propensity model (which happens every 2-4 weeks, faster than US-MLE-06's monthly full retrain), the IPS weights this story uses can drift incompatibly. Mitigation:

US-MLE-02 versions its propensity model in the registry and announces promotions via the platform-level coordination Slack #manga-ml-coordination.
US-MLE-06's training pipeline pins to a specific propensity version at training time and stores the version in the model registry metadata.
A daily reconciliation job compares US-MLE-06's pinned propensity version to US-MLE-02's current; if the gap is >2 versions and the gap has persisted for ≥7 days, an alarm fires for coordinated retrain.
IPS variance protection per the drift-counterpart §grill 1: propensity floored at 0.01, IPS weight clipped at 100, doubly-robust estimation reserved as fallback when canary OPE variance exceeds threshold.

5. Bandit / Thompson Sampling on Warming Cohort

Users in the warming cohort (5–15 interactions) are still in a high-uncertainty regime — the two-tower's user vector is informative but not yet stable. To balance exploitation (rank by current model) against exploration (gather data on item-user pairs the model is uncertain about), Thompson sampling runs on top of the ANN candidate set:

# bandit_thompson.py
import torch
import numpy as np


class ThompsonBandit:
    """Per-item Beta posterior on (success_count, fail_count) for warming-cohort users.

    'Success' here = click + 30s dwell or purchase (the same positive label as training).
    Posterior parameters per item are updated daily from the previous day's events.
    """

    def __init__(self, prior_alpha: float = 1.0, prior_beta: float = 19.0):
        # Prior: ~5% click rate (alpha=1, beta=19 → mean=0.05)
        self.prior_alpha = prior_alpha
        self.prior_beta = prior_beta
        self.success_counts: dict[str, int] = {}
        self.fail_counts: dict[str, int] = {}

    def update_from_logs(self, events: "pd.DataFrame") -> None:
        """Daily refresh from yesterday's warming-cohort events."""
        for item_id, group in events.groupby("item_id"):
            self.success_counts[item_id] = group["is_positive"].sum()
            self.fail_counts[item_id] = (~group["is_positive"]).sum()

    def sample_score(self, item_id: str) -> float:
        a = self.prior_alpha + self.success_counts.get(item_id, 0)
        b = self.prior_beta + self.fail_counts.get(item_id, 0)
        return np.random.beta(a, b)

    def mix_with_two_tower(
        self,
        candidates: list[tuple[str, float]],   # (item_id, two_tower_score) top-100
        warming_user: bool,
        explore_fraction: float = 0.3,
    ) -> list[tuple[str, float]]:
        if not warming_user:
            return candidates
        # 70% slots from two-tower exploitation
        n_exploit = int(len(candidates) * (1 - explore_fraction))
        exploit_set = candidates[:n_exploit]
        # 30% slots from Thompson-sampled exploration over the remaining pool
        exploration_pool = candidates[n_exploit:]
        sampled = sorted(
            exploration_pool,
            key=lambda x: self.sample_score(x[0]) + 0.1 * x[1],  # small bias toward high two-tower
            reverse=True,
        )
        return exploit_set + sampled[: len(candidates) - n_exploit]

The bandit posterior is per-item, not per-user-item — full per-user-item bandits would have 60T parameters at our scale, infeasible. Per-item is a sound approximation because warming users' tastes have not yet diverged enough from the global mean for per-user posteriors to add much information. After the user crosses the 15-interaction threshold and exits the warming cohort, the bandit layer is removed and pure two-tower exploitation takes over.

6. ANN Index Update Pattern (Nightly Rebuild + Delta-Merge)

The 5M-item ANN index is the bottleneck for retrieval freshness. Two indexes coexist:

flowchart LR
    subgraph BLUE[Blue ANN Index - active]
        B1[FAISS HNSW<br/>5M items<br/>built last night<br/>EFS volume]
    end

    subgraph DELTA[Delta Index - last 24h]
        D1[OpenSearch kNN<br/>~5K-50K items<br/>added in last 24h<br/>real-time]
    end

    subgraph GREEN[Green ANN Index - building]
        G1[FAISS HNSW<br/>tonight's rebuild<br/>EFS volume<br/>built by 03:00 JST]
    end

    R[Retrieval] --> B1
    R --> D1
    R --> M[Merge top-K from both]
    M --> Out[Final candidates]

    G1 -.atomic swap at 03:30 JST.-> B1

    style B1 fill:#dbeafe,stroke:#1e40af
    style D1 fill:#fde68a,stroke:#92400e
    style G1 fill:#dcfce7,stroke:#166534

Nightly rebuild (Step 9 in the training DAG):

After the two-tower training step exports the new item embeddings (step 8), a SageMaker Processing job builds a fresh FAISS HNSW index. HNSW parameters: M=32, ef_construction=200, ef_search=64. Build time on the 5M-item corpus: ~25 minutes on r6i.4xlarge.
The new index is written to a new EFS path with a versioned suffix (e.g., idx_v24/items.faiss). The router's read flag flips from blue (v23) to green (v24) atomically once shadow validation passes.
Old index (v23) is retained on EFS for 21 days as rollback target.

Delta-merge during the day:

Newly added catalog items (e.g., a new manga title released at 14:00 JST) cannot wait until the nightly rebuild. The Catalog MCP publishes a catalog_item_added event to a Kinesis stream; a Lambda consumer computes the content-only embedding (US-MLE-05) and writes it directly to an OpenSearch kNN index delta_index.
Retrieval performs a two-shard fan-out: query the FAISS HNSW index for top-K_main (default K_main=80) and the OpenSearch delta index for top-K_delta (default K_delta=20), then merges by score. The score scales are calibrated nightly so the merge is honest.
Delta items typically have inflated content-only scores compared to learned embeddings; an empirical 0.85 multiplier corrects for this.
Each night's rebuild absorbs the delta — items present in the delta get learned embeddings in the next FAISS build, and the delta is reset to empty.

7. Online Serving

The user-tower forward pass is computed online on a ml.g5.xlarge SageMaker real-time endpoint. The item-tower is never called online — it is precomputed into the ANN index. Serving is therefore: (user-tower forward) → (FAISS query + delta query) → (filter + rerank handoff).

# endpoint_handler.py
import torch
import numpy as np
import faiss
from opensearchpy import OpenSearch


class RecoServingHandler:
    def __init__(self, model_path: str, index_path: str, region: str = "ap-northeast-1"):
        self.model = torch.jit.load(model_path).eval().cuda()
        self.faiss_index = faiss.read_index(index_path, faiss.IO_FLAG_MMAP)
        self.opensearch = OpenSearch(
            hosts=[f"recommendations-delta.{region}.es.amazonaws.com"],
            http_auth=aws_auth,
            use_ssl=True,
        )
        self.stock_signal = StockSignalClient(region)  # US-MLE-04
        self.read_history = ReadHistoryClient(region)  # DynamoDB

    @torch.no_grad()
    def retrieve(
        self,
        user_id: str,
        context: dict,
        k: int = 100,
    ) -> list[dict]:
        # 1. Build user-tower input from session + 30d history
        history, session = self.read_history.fetch(user_id, lookback_days=30)
        user_vec = self.model.user(
            history_emb=history.embeddings.cuda(),
            history_mask=history.mask.cuda(),
            history_recency=history.recency.cuda(),
            intent_id=torch.tensor([context["intent_id"]]).cuda(),
            intent_confidence=torch.tensor([context["intent_confidence"]]).cuda(),
            session_features=session.cuda(),
        ).cpu().numpy().astype(np.float32)

        # 2. FAISS HNSW query — top 80 from main index
        main_scores, main_ids = self.faiss_index.search(user_vec, 80)

        # 3. OpenSearch delta query — top 20 from delta index
        delta_resp = self.opensearch.search(
            index="delta_index",
            body={
                "size": 20,
                "query": {"knn": {"embedding": {"vector": user_vec[0].tolist(), "k": 20}}},
            },
        )
        delta_results = [
            (hit["_source"]["item_id"], hit["_score"] * 0.85)  # delta score correction
            for hit in delta_resp["hits"]["hits"]
        ]

        # 4. Merge and dedupe
        merged = self._merge_scores(main_ids, main_scores, delta_results)

        # 5. Filter: out-of-stock (US-MLE-04) and already-read
        in_stock_ids = self.stock_signal.filter_in_stock([m["item_id"] for m in merged])
        already_read = self.read_history.read_item_set(user_id)
        filtered = [m for m in merged if m["item_id"] in in_stock_ids and m["item_id"] not in already_read]

        return filtered[:k]

Latency budget (p95 ≤ 30 ms target):

Step	Budget	Notes
Read history fetch (DynamoDB)	4 ms	per-user-id GetItem; cache hot in DAX
User-tower forward (g5.xlarge)	8 ms	batch=1, FP16, JIT-compiled
FAISS HNSW query	6 ms	mmap'd index; ef_search=64
OpenSearch delta query	5 ms	warm shard; 20-result kNN
Merge + filter (stock, read)	3 ms	in-process
Network + serialization	4 ms	gateway round-trip
Total p95	30 ms

The autoscaler is configured for g5.xlarge between 4 and 32 instances; peak JP-region traffic at 1,400 RPS / ~50 RPS-per-instance ≈ 28 instances.

8. Drift Detection

Drift on this story is more nuanced than US-MLE-01 because recommendation has the four kinds plus a fifth recommendation-specific kind: label decay (the topic of Ground-Truth-Evolution/ML-Scenarios/01-recommendation-label-decay.md).

Drift Kind	Detector	Cadence	Threshold
Input drift	PSI on top-30 user features; KL on item-popularity distribution	5 min	PSI > 0.2 sustained 24h
Label drift	χ² on per-action-type counts; per-tenure-cohort positive rate	Daily	p < 0.01 sustained 7d
Prediction drift	KS on retrieval-score distribution; per-genre call rate	5 min	KS > 0.15 sustained 24h
Concept drift	HR@20 on rolling 14-day random-policy holdout	Daily	Δ-HR@20 > 0.02 sustained 7d
Label decay	Per-recency NDCG buckets (0-7d, 7-30d, 30-90d)	Weekly	Δ across buckets > 0.04

The label-decay detector is the unique one. It bucket-evaluates HR@20 on (event_age) and looks for monotonic degradation as recency grows. Sustained drops in the 7-30d and 30-90d buckets relative to 0-7d indicate the decay-weighting of training events is mis-calibrated — the half-lives need re-tuning.

9. Retraining Trigger Logic

Three retrain triggers:

Nightly incremental: EventBridge rule fires every day at 02:00 JST. Trains on the previous 24h of events, warm-starts from last good checkpoint. ~5h wall-clock; new ANN index ready by 07:30 JST (well before peak).
Monthly full retrain: scheduled the first Saturday of each month at 22:00 JST (lowest-traffic window). ~12h wall-clock. Runs all the way through Stage 1 offline gate and registers as candidate; promotion is then driven by the standard four-stage gate.
Drift-triggered: SNS topic on (HR@20 regression ≥ 0.02 sustained 7d) OR (cold-start CTR / warm CTR ratio < 0.75 sustained 7d) OR (US-MLE-05 promotes a new embedding model — coordinated retrain). Goes through the same eligibility gate as US-MLE-01 (global_ml_freeze, recommendation_promotion_enabled).

10. Multilingual Handling (JP/EN-Specific)

Three JP-specific concerns are project-wide significant:

Separate user-tower variants for JP-primary vs EN-primary users. The two-tower has two user-tower heads sharing the same item tower: user_tower_jp and user_tower_en. The router selects based on the user's preferred_language flag set during onboarding; mixed-language users default to user_tower_en for now (to be re-evaluated when JP/EN mixed-mode user volume crosses 5%). Each head trains on its language's session data only, sharing item-tower gradients.

The reason: JP user taste profiles are systematically different. JP users over-engage with seinen and slice-of-life genres compared to EN; recommendation patterns that work for EN users (action-skewed) misfire on JP users. A single user tower trained on the global mix optimizes for the EN distribution because EN users have higher rating density; the bilingual split catches this. JP-region traffic is 85% of total, so even a small regression on the JP head is a large absolute loss.

Item tower is language-agnostic. Since US-MLE-05's cross-lingual embedding handles JP/EN/KO/ZH titles in a shared space, the item tower needs no language split. Cross-lingual recommendations (a JP user discovering an EN-primary title) work natively.

Politeness markers do not affect this story. Unlike intent classification, recommendation does not read message text directly — the model reads intent labels (US-MLE-01) and item embeddings (US-MLE-05). All language-specific text handling is upstream. This is a deliberate architecture decision: the recommender does not duplicate the language-handling logic that US-MLE-01 and US-MLE-05 already own.

Monitoring & Metrics

Category	Metric	Target	Alarm Threshold
Online — Latency	p50 retrieval	≤ 18 ms	> 25 ms 5min
	p95 retrieval	≤ 30 ms	> 40 ms 5min
	p99 retrieval	≤ 60 ms	> 80 ms 5min
Online — Quality	HR@20 (random-policy slice)	≥ 0.18	< 0.16 24h
	Cold-start CTR / warm CTR	≥ 0.80	< 0.75 24h
	Add-to-list rate	≥ baseline - 0.02	regression 7d
	Genre coverage @ K=100	≥ 0.6 of catalog genres	< 0.5 24h
Online — Throughput	RPS	match traffic	scale-out lag > 90s
	Endpoint instance count	4–32	stuck at max 30min
Online — Errors	Personalize 5xx rate	< 0.1%	> 0.5% 5min
	ANN-search timeout rate	< 0.05%	> 0.5% 5min
	Stock-filter cache miss rate	< 1%	> 5% 1h
Per-slice	HR@20 EN	≥ 0.16	< 0.14 24h
	HR@20 JP	≥ 0.16	< 0.14 24h
	HR@20 mixed	≥ 0.14	< 0.12 24h
Drift	PSI per top-30 user feature	< 0.2	> 0.2 24h
	Per-recency NDCG bucket spread	< 0.04	> 0.04 7d
	Δ-HR@20 vs reference	< 0.02	> 0.02 7d
Cost	$/1k retrievals	≤ $0.045	> $0.06 24h
	Nightly training $/run	≤ $35	> $55
	Monthly full retrain $/run	≤ $180	> $260
	ANN index storage $/month	≤ $40	> $60
Pipeline	Nightly retrain success rate	≥ 98%	< 95% 30d rolling
	ANN rebuild wall-clock	≤ 35 min	> 60 min

Risks & Mitigations

Risk	Impact	Mitigation
Position-bias from US-MLE-02 propensity model drifts faster than US-MLE-06 retrains	IPS weights miscalibrated; offline HR@20 stable while online CTR drops	Daily reconciliation alarm at 2-version gap; coordinated retrain SNS trigger
In-batch negatives bias toward popular items	Long-tail items under-recommended; genre coverage drops	Popular-item-stratified negatives + per-popularity-bucket monitoring
ANN index rebuild fails before peak	Stale recommendations during peak; freshness SLA breach	Green index validates with shadow query before swap; blue-rollback path documented
Item-side cold-start skew on new items	New manga don't surface to relevant users	Content-only fallback embedding from US-MLE-05; bandit exploration on warming items
Personalize endpoint quota exhausted at peak	Cold-start users see error or fallback; CTR drops	Reserved capacity contract with Personalize; popularity-baseline fallback
US-MLE-05 promotes new embedding model without coordinated US-MLE-06 retrain	Item embeddings in two-tower diverge from item-tower expectations; HR@20 regresses	Coordinated change request between US-MLE-05 and US-MLE-06 owners; embedding-version pinning in registry
US-MLE-04 stock-signal staleness recommends out-of-stock	Customer sees unavailable items	TTL on stock cache ≤ 60s; post-retrieval refresh check on top-3
Bandit exploration surfaces low-quality items to warming users	Cold-start CTR ratio regresses	Per-warming-item exposure cap; auto-disable bandit on cohort CTR drop > 5%
User-tower features quasi-reversible (privacy)	Embedding leak could partially reconstruct reading history	Aggressive PII redaction; user_id always hashed; embedding storage encrypted at rest with KMS
JP user-tower regresses while EN user-tower improves	Aggregate HR@20 looks fine while 85% of traffic degrades	Per-language gate is hard-fail; aggregate is advisory only
Label decay causes silent drift between full retrains	NDCG@10 stable on biased holdout while production CTR decays	Per-recency NDCG bucket monitor; random-policy 5% slice for honest OPE
Nightly incremental warm-start drifts from monthly full	Incremental optima diverge over the month	Monthly full retrain resets; incremental checkpoint diff monitored against monthly baseline
200M sessions/month feature-store load saturates online reads	User-tower fetch latency spikes during peak	DAX in front of DynamoDB; 1000-item per-user history cap; pre-aggregated session windows

Deep Dive — Why This Works at Amazon-Scale on the Manga Workload

The MangaAssist recommendation system is a 100M-parameter two-tower (50M user + 50M item) plus a managed Personalize fallback. At 200M sessions/month and 1.2B retrievals/month, the system serves at the upper edge of what a single-region two-tower can comfortably handle. Five workload properties make the design above the right shape rather than a single Personalize-only or single-two-tower-only solution:

Workload property 1: cold-start is structural, not edge-case. New users join continuously; ~14% of all retrieval requests come from users with <5 interactions. A pure two-tower system that requires history simply can't serve them. Naive solutions (popularity baseline, content-only) underperform Personalize HRNN-meta by ~25% on cold-start CTR because Personalize's HRNN captures session-internal patterns the two-tower with thin history misses. The hybrid router is the answer: route to the model that actually has signal for the user's regime. This is the same pattern as US-MLE-01's tiered-model policy — pick the model that matches the input regime.

Workload property 2: 85% JP traffic with systematically different taste profiles. Aggregate HR@20 on a globally trained model is a lying metric — it averages JP and EN taste, optimizing toward the larger-rating-density EN cohort. The JP user-tower split is non-optional. The decision to share the item tower (language-agnostic via US-MLE-05's cross-lingual embedding) but split the user tower is the architectural commitment: items are universal, tastes are local. The same pattern shows up in US-MLE-02's reranker training (separate per-language gradient streams for the same cross-encoder) and in US-MLE-08's cover-art classifier (separate JP/EN style heads).

Workload property 3: weak offline-online correlation, by nature. Per the foundations doc §4.3, recommendation is on the short list of stories pre-flagged as "always extended canary." Offline HR@20 on a holdout that inherits the same exposure bias as training is not a leading indicator of online CTR. The 14-day extended canary at 10% traffic is the only honest signal. This is why this story has more comprehensive online monitoring (per-cohort, per-recency, per-genre, per-language) than US-MLE-01 — offline metrics are advisory; online metrics are binding. The drift-counterpart file documents the same lesson: NDCG@10 stable while CTR decays for two consecutive quarters.

Workload property 4: label decay is continuous and asymmetric. A click 90 days ago has weak predictive value for next week; a click yesterday has strong value. The decay-weighting (per-event-type half-life from clicks at 14d to explicit thumbs-up at 365d) handles this. Without it, the training set is a flat sum across staleness windows and the model learns "what users clicked on average over 90 days" rather than "what they'd click now." The IPS reweighting handles selection bias on top of decay: even after weighting events by recency, an event was exposed via the previous policy, and that exposure is biased. Decay + IPS together are necessary; either alone is insufficient.

Workload property 5: catalog freshness matters within hours, not days. A new manga release dropped at 14:00 JST should reach relevant users within 4 hours, not at next morning's nightly rebuild. The delta-merge ANN pattern (FAISS for the stable bulk + OpenSearch kNN for the last-24h delta) is what lets retrieval be both fast (HNSW on a static index) and fresh (real-time delta). The delta-score correction (0.85 multiplier on content-only embeddings) prevents new items from over-ranking against learned embeddings. The bandit on the warming cohort is the third layer: even with the delta merge, new items have noisy reward estimates, so Thompson sampling allocates exposure to gather data efficiently rather than locking into stale estimates.

These five properties together explain why the per-cohort gate + extended canary + decay-weighted training + IPS + delta-merge ANN + bandit + hybrid router machinery exists. Drop any one and either cold-start collapses, JP users get under-served, or label-decay drift goes undetected until customer complaints reach the CX dashboard.

Real-World Validation

Industry analogues. Spotify's discovery system uses an analogous two-tower with a managed cold-start fallback (their internal "Cold Start Service") and reports the same 5-interaction warming threshold. Netflix's recommendation team published a 2023 paper describing per-locale user-tower variants for non-US markets, with shared item embeddings — the same JP/EN architecture decision applied to (US, MX, BR, JP, KR). YouTube's two-tower ranking model uses sampled-softmax with in-batch + popular-item-stratified negatives in their published 2019 paper, and their team has published more recent work confirming per-cohort canary length is the right signal for recommendation promotion (not offline metrics). Amazon's own recommendations team uses Amazon Personalize for cold-start across multiple Amazon services, with custom downstream rerankers for warm users — the same pattern this story implements.

Math validation — cost. Two-tower nightly incremental: g5.12xlarge at $5.672/hr in ap-northeast-1 × 5 hours × 30 nights/month = $851/month. Monthly full retrain: $5.672 × 12 = $68/month. Personalize: ~$0.40 per 1K retrievals × 0.14 × 1.2B/month = $67,200/month. Two-tower serving: g5.xlarge at $1.408/hr × average 12 instances × 24 × 30 = $12,165/month. ANN storage on EFS: 5GB × $0.30/GB/month + 21-day retention × 2 versions ≈ $40/month. Total: ~$80,300/month for retrieval; against ~1.2B retrievals/month = $0.0000669 per retrieval. Cold-start path dominates the cost (Personalize is 84% of cost for 14% of traffic) — this is a known economic asymmetry of managed cold-start services. The contracted target of $0.000045 per retrieval is achieved on the warm path; the cold-start path is over-budget but on a deliberate "spend more for cold-start coverage" basis until in-house cold-start parity is achieved.

Math validation — latency. User-tower forward on g5.xlarge with FP16 and JIT, batch=1: ~6ms (50M params, single matmul-heavy forward). FAISS HNSW search on 5M-item index, ef_search=64: ~6ms (measured). OpenSearch kNN delta query, ~50K items: ~5ms. DynamoDB DAX history fetch: ~4ms. Stock-filter check on top-100: ~3ms. Network + serialization: ~4ms. Total: ~28ms p50, ~35ms p95 with tail-latency mitigation. The ≤30ms p95 target is achievable but only with strict tail-latency engineering — DAX cache warming, FAISS mmap (no first-touch cost), and OpenSearch reserved capacity.

Math validation — training data volume. 200M sessions/month × ~2.5 positive interactions per session = 500M positive events/month, matching the cited contract. Stratified across (tenure × recency × action_type × language) with 12 cohort buckets, the per-bucket minimum-volume gate at 100K events/bucket requires 1.2M events/cohort-month, which 500M / 12 = 41M events/cohort easily satisfies. The training pipeline can therefore safely apply the stratified sampling without starving any cohort.

Math validation — ANN index size. 5M items × 256-dim float32 = 5GB. HNSW with M=32 adds graph structure ≈ 2× the raw vectors = 10GB total. EFS storage for 2 versions (blue + green) = 20GB. mmap'd into memory on a 32GB-RAM r6i.4xlarge retrieval host with 12GB headroom for OS/cache. Tail latency for ef_search=64 on this scale is well-characterized at ~6ms p95.

Cross-Story Interactions

Edge	Direction	Contract	Conflict mode
US-MLE-06 ← US-MLE-05 (embedding adapter)	reads item embeddings	item-tower content_emb input pinned to US-MLE-05 model_version	If US-MLE-05 promotes a new embedding model without coordinated retrain, item-tower embeddings diverge silently; HR@20 regresses 0.02-0.05 within days. Mitigation: embedding-version registry pinning + coordinated change request.
US-MLE-06 ← US-MLE-01 (intent classifier)	reads intent label as context	context-tower input gated by `intent_confidence ≥ 0.6`	When US-MLE-01 mis-routes, context tower reads wrong intent. Mitigation: confidence gate falls back to user-only tower (no context contribution).
US-MLE-06 ← US-MLE-02 (reranker propensity model)	reads position propensities for IPS	per (slot, intent, language, cohort) tuple, refreshed daily	US-MLE-02 promoting new propensity model requires US-MLE-06 IPS recompute. Mitigation: 2-version-gap alarm; coordinated retrain SNS trigger.
US-MLE-06 ← US-MLE-04 (forecasting)	reads stock signal	item-id → in_stock boolean, TTL ≤ 60s	Out-of-stock items must never appear in final candidates. Mitigation: post-retrieval refresh check on top-3; defense-in-depth via cross-encoder reranker also reads stock.
US-MLE-06 ← US-MLE-03 (ABSA)	reads aspect signals	item-id → 8-aspect sentiment vector	ABSA failure leaves aspect input as zero vector; item tower is robust to this (aspect_proj has trained zero-handling).
US-MLE-06 → US-MLE-02 (reranker)	provides candidate set	top-100 (item_id, score) tuples	If US-MLE-06 returns < 100 candidates (e.g., aggressive stock filter), reranker has less to rerank; UI fills with popular fallback.
US-MLE-06 → Personalize	feeds event tracker	real-time event publish on every interaction	Personalize rate-limit at 100 events/sec/tracker; mitigation: Kinesis batching at 5-event windows.
US-MLE-06 ↔ Cost-Optimization US-08 (traffic-based)	tier-aware degradation	under traffic-tier 3 (severe), drop the bandit layer and skip context tower	Coordination point — degradation thresholds are co-owned.
US-MLE-06 ↔ RAG-MCP-Integration recommendation MCP	this story owns the model the MCP wraps	model artifact, ANN index, Personalize endpoint name	MCP-level changes (tool surface) coordinate with this story's SLA.

Rollback & Experimentation

Shadow Mode Plan

Duration: 48 hours minimum (longer than US-MLE-01's 24h because retrieval signals need more samples for power). 96 hours during low-traffic windows.
Sample size: 1.2B retrievals × 2 days × 0.05 (5% sample) ≈ 120M comparisons; massive statistical power.
Pass criteria: candidate-set agreement with v23 between 50% and 80% (lower bound is loose because retrieval should differ more than classification; upper bound prevents "no new behavior"). Per-language agreement within ±5pp of global.
Slice criteria: HR@20 on shadow per (language × cohort × time-of-day) within ±0.02 of v23. Cold-start cohort within ±0.04 (more variance allowed because Personalize signal has higher inherent variance).

Canary Thresholds (Extended)

The 14-day extended canary is unique to this story. Three sub-phases:

Phase A (10% traffic, days 1–4): error rates within 1.2× baseline; no SEV-3 pages from auto-abort daemon. Per-cohort CTR computed daily.
Phase B (10% traffic, days 5–10): per-cohort CTR within ±5% of baseline for all cohorts; cold-start CTR ratio ≥ 0.78. Per-recency NDCG buckets stable.
Phase C (10% traffic, days 11–14): full statistical-significance test on add-to-list rate (the lagging signal). H0: candidate ≥ baseline - 0.02 absolute. Two-sample test at α=0.01.

Kill-Switch Flags

recommendation_promotion_enabled (default: false; SSM /manga-ml/recommendation/promotion_enabled) — when false, monthly retrain logs and exits.
recommendation_canary_pause (default: false) — halts traffic shift at current canary stage; manual unblock required.
bandit_enabled (default: true) — when false, warming-cohort users get pure two-tower exploitation (no Thompson sampling); used during high-stakes experiments to remove exploration variance.
delta_merge_enabled (default: true) — when false, retrieval queries only the FAISS index; new catalog items don't surface until next rebuild. Used to debug delta-index issues.
personalize_circuit_breaker (default: false) — when true, cold-start users get popularity-baseline fallback (no Personalize call). Triggered automatically on Personalize 5xx rate > 5%.
global_ml_freeze (default: false) — overrides the above per kill-switch precedence.

Quality Regression Criteria (Hard Rollback)

A canary that satisfies any one of these conditions is automatically rolled back:

Per-cohort CTR regression > 0.05 absolute for any 2-hour window in any cohort.
Cold-start CTR ratio < 0.70 for any 4-hour window.
Per-language HR@20 regression > 0.04 for any 24-hour window.
p99 retrieval latency > 80ms for any 5-minute window during peak.
ANN search timeout rate > 0.5% for any 5-minute window.
Out-of-stock items appearing in final candidates > 0.1% of requests (compliance breach).

The rollback procedure:

Personalize: re-import last-known-good solution version (~2-3 minute SDK call).
Two-tower: SageMaker traffic shift to v23 (60 seconds).
ANN index: router flag flips green→blue (15 seconds).

End-to-end rollback SLA: ≤ 4 hours (per acceptance criteria), but in practice ≤ 5 minutes for the in-flight critical path.

Multi-Reviewer Validation Findings & Resolutions

S1 — Must Fix Before Production

ML Scientist lens: The two-tower trains with in-batch negatives, but in-batch is biased toward popular items (popular items appear in many users' positive lists, hence as negatives for other users' rows in the same batch). Without correction, the model under-recommends head items. Resolution: Popular-item-stratified negatives (32 per positive, sampled from top-1% most-shown items with propensity correction) added to the loss alongside in-batch. Per-popularity-bucket monitoring confirms head and tail items both have stable HR@20.

ML Scientist lens (S1): IPS variance at low propensities (e.g., shown at slot 20 of K=20) creates training instability. A single low-propensity event with raw IPS weight of 1000+ swamps the loss. Resolution: Floor propensity at 0.01, clip IPS weight at 100, document the bias the floor introduces. Per the drift-counterpart §grill 1, doubly-robust estimation reserved as fallback when canary OPE variance exceeds threshold.

SRE lens: ANN index swap (blue→green) is currently a single-shot atomic operation; if green has corrupted entries (e.g., NaN embeddings from a bad training run), the swap surfaces them immediately. Resolution: Pre-swap validation job runs 1000-query smoke test against green, comparing recall against blue at 95% threshold. If smoke test fails, swap is blocked and on-call is paged.

Application Security / Privacy lens: User-tower features (30-day reading history, taste embedding) are quasi-reversible PII — an attacker with embedding access could partially reconstruct what someone has been reading, which has both regulatory and CX implications in JP (APPI). Resolution: All user_id values hashed (SHA-256 with rotating salt); embedding storage encrypted at rest with KMS-managed keys; embedding access logged to CloudTrail with quarterly audit; data residency enforced in ap-northeast-1 with cross-region replication explicitly forbidden for user-tower artifacts.

S2 — Address Before Scale

Data Engineering lens: 1.2B retrievals/month × 60-day hot retention = 72B prediction-log rows × ~500 bytes/row ≈ 36TB hot. This is bounded but expensive. Resolution: Drop low-cardinality fields from log (tier-3 features that aren't used in drift detection), shorten hot retention to 45 days, accelerate Glacier transition. Net 28% cost reduction with no observable monitoring loss.

FinOps lens: Personalize cost dominates (~$67K/month for 14% of traffic). At current cold-start volume growth, this is projected to hit $90K/month within a year. Resolution: Quarterly evaluation of in-house cold-start replacement (a smaller specialized model trained on first-5-interactions). Until the in-house model achieves cold-start CTR parity within 95%, Personalize stays. Tracked as v2 backlog item.

ML Scientist lens (S2): Offline-online correlation has been 0.42 over the last 6 retrains (well below the 0.6 trustworthy floor — exactly why this story is "always extended canary"). The 14-day canary is the right fix, but quarterly recalibration of the offline gate against current online distribution would tighten the offline signal and give earlier warning of issues. Resolution: Quarterly recalibration job calendared; results land in registry metadata.

Principal Architect lens: The hybrid router's 5-interaction threshold is currently a hardcoded constant. As Personalize's cold-start performance evolves and the two-tower's thin-data behavior improves, the right threshold shifts. Resolution: Threshold is a registry parameter; quarterly re-evaluation via offline replay on holdout of users who crossed the boundary; current 5 stays default but is data-driven going forward.

S3 — Future Work

Principal Architect lens (S3): A unified single model that handles both cold and warm cases (e.g., a Transformer-based recommender that conditions on user-history-length) could replace the hybrid router. Tracked as a v2 backlog item; revisit when the hybrid router's coordination overhead exceeds 10 engineer-days/quarter.

ML Scientist lens (S3): Item-side cold-start currently uses content-only embedding fallback. A dedicated "item-warming" sub-model that learns from the bandit's 7-day exposure data could improve cold-item discovery. Defer until current architecture is stable for 90 days post-launch.

SRE lens (S3): Multi-region active-active for recommendation. Currently this story serves only ap-northeast-1; an EU region extension would require cross-region item-embedding replication (allowed; items are not customer data) and per-region user-tower training (forbidden cross-region for customer data). The architecture supports this but the operational complexity of two independent retrain pipelines is non-trivial. Tracked under data-residency expansion roadmap alongside US-MLE-01 and US-MLE-05.

Personalize PM lens (S3): Personalize's HRNN-meta is itself evolving (AWS roadmap suggests a transformer-based variant in 2026). When that ships, this story's cold-start path will need re-evaluation — possibly re-routing the 5-interaction threshold downward if the new variant outperforms the two-tower on slightly-warmer users. Tracked as quarterly check-in with Personalize PM.