00. Foundations & Primitives for Amazon-Scale ML Engineering on MangaAssist

This document defines the seven primitives every per-story implementation deep-dive in this folder is built from. The same primitives apply whether the model is a 67M-parameter DistilBERT (US-MLE-01), a 6B-parameter LoRA-adapted Titan embedder (US-MLE-05), or a LightGBM ensemble (US-MLE-07). Read this file once; the per-story files in 01-deep-dive-per-ml-story.md reference these primitives by name without re-deriving them.

The seven primitives:

Label Engineering — how raw signal becomes supervised training data
Feature Engineering — how serving inputs become training inputs without temporal leakage
Training Infrastructure — distributed training, checkpoint discipline, spot reclaim
Evaluation — golden + slice + adversarial + counterfactual + offline-online correlation
Promotion — model registry, shadow, canary, blue/green
Drift Detection — input + label + prediction + concept
Retraining — scheduled vs triggered, freshness windows, rollback semantics

These are the working vocabulary of the ML Engineer role on this project. If a story file says "the standard PIT-correct feature read" or "the standard four-stage promotion gate," the definition lives here.

1. Label Engineering Primitive

1.1 The four label sources

Every supervised model on the MangaAssist platform draws labels from one or more of:

Source	Volume / day	Latency to label	Quality	Used by
Vendor annotation (Appen, Sama, internal Mechanical-Turk-style)	~5K	24–72 h	High (κ ≥ 0.75)	US-MLE-01, -03, -08
Implicit feedback (clicks, dwells, purchases, returns)	~50M	seconds (online), hours (batch)	Noisy (positional bias, position-popularity bias)	US-MLE-02, -06
Programmatic labeling (heuristics, regex, Snorkel-style labeling functions)	~500K	seconds	Medium-high; biased by rule-set	US-MLE-01 (rule-based pre-filter), -07
LLM-distilled labels (Claude / Sonnet writes weak labels)	~50K	minutes (batch)	Medium; subject to LLM biases	US-MLE-03, -07 (data augmentation)

The non-negotiable rule: a model never reads from a single label source alone for production training. Cross-validation across sources catches systematic bias in any one source. US-MLE-07's spam classifier uses all four; US-MLE-01's intent classifier uses three (vendor + programmatic + LLM-distilled, no implicit because intent is not measured by clicks).

1.2 The label platform contract

The label platform (a managed Iceberg table on S3 keyed by (label_source, model_target, example_id, label_version)) exposes a stable schema:

@dataclass
class LabelRecord:
    example_id: str                # SHA-256 of (canonical input, model_target)
    model_target: str              # "intent_classifier", "spam_v3", "absa_aspect", ...
    label_value: dict              # canonical: {"class": "...", "confidence": 0.92, "metadata": {...}}
    label_source: str              # "vendor_appen", "vendor_sama", "programmatic", "implicit", "llm_distill"
    annotator_id: Optional[str]    # vendor annotator or LLM model id
    label_version: int             # monotonically increasing per (model_target, source)
    captured_at: datetime          # when raw signal was captured
    labeled_at: datetime           # when label was finalized
    review_status: str             # "draft", "qa_passed", "qa_failed", "withdrawn"
    iaa_lot_id: Optional[str]      # inter-annotator-agreement lot for vendor labels

The contract: a model registry version pins to a label snapshot via (model_target, max(label_version)). A model trained on label_version 47 is reproducible from the Iceberg snapshot taken at the moment label_version 47 was finalized. Snapshot expiration is 18 months; older snapshots are tiered to S3 Glacier with s3:RestoreObject runbooks documented.

1.3 Label QA and IAA (inter-annotator agreement)

For vendor labels:

Triple-annotation on 10% of examples (3 annotators per example), with the remaining 90% double-annotated (2 annotators) and disagreements escalated to a senior annotator.
Per-batch IAA computed as Cohen's κ on the triple-annotated subset. Below κ < 0.65, the batch is rejected and re-annotated at vendor expense (contractual). Below κ < 0.75, the batch is accepted but flagged — labels enter the pool but are excluded from holdout evaluation.
Per-annotator drift dashboards — running κ vs the consensus over time. An annotator whose κ drops 2σ below their 30-day baseline is paused for re-calibration. This catches both annotator fatigue and silent annotation-guideline misinterpretation.

For programmatic and LLM-distilled labels:

Calibration sample of 500 examples re-labeled by a vendor each week, audited against the programmatic / LLM output. If precision drops below the contracted floor (typically 0.90), the rule-set or LLM prompt is rolled back.

1.4 Bilingual label QA (project-specific)

Per the project-wide cross-cutting concern, EN and JP labels are tracked as separate IAA pools. A vendor's English-team κ tells you nothing about their Japanese-team κ. Each story's training run reads label_iaa_en and label_iaa_jp separately and refuses to train if either falls below the floor for that label_version.

2. Feature Engineering Primitive

2.1 Point-in-time (PIT) correctness

The single most expensive bug class in ML systems on this platform is training-serving skew from non-PIT-correct feature reads. The rule is unbreakable:

When you read feature f for example e whose ground truth was captured at t_label, you must read f's value as it would have been visible to the serving system at t_label - Δ_serving_lag, never the current value of f.

The platform feature store (managed Iceberg with as_of snapshots, fronted by SageMaker Feature Store for online reads) enforces this in two ways:

Offline read API requires as_of parameter — passing as_of=None is a deploy-time lint error, not a runtime warning.
Δ_serving_lag is per-feature and lives in the feature catalog. For features computed online (e.g., session-level intent count) it is 0; for batch-computed features (e.g., 7-day reading-history embedding) it is 1 hour minimum.

A worked example, US-MLE-06's recommendation training:

# WRONG — reads current value of `user_taste_embedding`
features = feature_store.get(user_id="abc", features=["user_taste_embedding"])

# RIGHT — reads what it would have been when the click happened
features = feature_store.get(
    user_id="abc",
    features=["user_taste_embedding"],
    as_of=click_event.captured_at - feature_store.serving_lag("user_taste_embedding"),
)

The PIT contract is verified at training time by a leak-detector job that re-reads features at as_of=captured_at - 1ns (just before the label) for a 1% sample and compares against the feature value at as_of=captured_at + serving_lag (the value that would have been logged). Any features whose values correlate suspiciously with the label across this 1ns window are flagged. This catches the classic "label baked into feature via a downstream pipeline" leak, which has cost more on this platform than any other class of bug.

2.2 Training-serving skew

Even with PIT-correct reads, training-serving skew arises from:

Schema drift — a column type changes upstream; training-time decoder uses the new type, serving still uses the old.
Default-value drift — a missing-feature default changes between training and serving.
Encoder drift — a categorical encoder is re-fit at training time but the serving encoder is not redeployed.

The platform mitigates each:

Schema drift: feature catalog is versioned; training jobs pin to a feature schema version and serving endpoints declare the schema version they expect. A mismatch triggers a deploy-time refusal.
Default-value drift: defaults are stored in the feature catalog, not in code. Both training and serving read defaults from the catalog at the pinned schema version.
Encoder drift: encoders are model artifacts, not code. They are versioned in the model registry alongside the model, and the serving endpoint loads (model, encoder) as a tuple.

2.3 Feature freshness vs feature recency

These are different. Freshness is "how stale is the value the serving system reads?" Measured in seconds-to-minutes for online features, hours for batch. Recency is "how recent is the signal the value summarizes?" A 7-day reading-history embedding has high freshness (computed every hour) but bounded recency (covers the last 7 days). Both must be tracked per feature; both can drift independently.

3. Training Infrastructure Primitive

3.1 SageMaker Pipelines as the training DAG

Every model on the platform trains via a SageMaker Pipelines DAG with these steps in order:

flowchart LR
    A[1. Data<br/>Validation] --> B[2. Feature<br/>Materialization]
    B --> C[3. Train/Val/Test<br/>Split]
    C --> D[4. Training]
    D --> E[5. Offline<br/>Evaluation]
    E --> F[6. Slice<br/>Analysis]
    F --> G[7. Bias<br/>Audit]
    G --> H[8. Model<br/>Registration]
    H --> I[9. Shadow<br/>Deploy]
    I --> J[10. Canary<br/>Promotion]

    style A fill:#fd2,stroke:#333
    style E fill:#fd2,stroke:#333
    style H fill:#9cf,stroke:#333
    style J fill:#2d8,stroke:#333

Each step emits a structured artifact (Iceberg table for data steps, model package for training+register, evaluation report for eval+slice+bias). The artifacts are immutable; a pipeline run is reproducible by replaying from any step.

Why SageMaker Pipelines vs Step Functions: Pipelines has native model-registry integration, automatic experiment tracking (lineage from raw data → model package), and built-in step caching keyed by input artifact hash. Step Functions would require custom registry and lineage glue. The trade-off is Pipelines' weaker observability for non-ML steps (e.g., calling external systems); the four cases on this platform that need that (label-vendor handoff, embedding re-index, online label join, Personalize batch) wrap the non-ML logic in Lambda or Step Functions and call the wrapped function from a Pipelines step.

3.2 Distributed training

Three of the eight models on this platform are large enough that distributed training is non-optional:

US-MLE-05 embedding adapter (LoRA on Titan) — p4d.24xlarge × 2 nodes, FSDP with auto_wrap_policy=transformer_auto_wrap_policy. ~2.3 hours per epoch, 3 epochs. Spot pricing brings this to ~$120/run vs ~$340 on-demand, with a 12% job-failure rate from spot reclaim that the checkpoint discipline below handles.
US-MLE-03 multi-task DeBERTa-v3 — g5.12xlarge × 1 node, DDP across the 4 GPUs. 6 hours per run, ~$45 spot.
US-MLE-08 EfficientNet-V2-S fine-tune — g5.2xlarge × 1 node, DDP across the 1 GPU (effectively single-GPU; DDP keeps the launch script identical to the multi-GPU case for code reuse). ~2 hours per run, ~$5 spot.

The other five models train on ml.m5.4xlarge or ml.g5.xlarge single-instance, no distribution needed.

Spot-reclaim handling: every distributed training job writes a checkpoint to S3 every 500 steps (≈ 12 minutes of wall-clock for the embedding adapter). On spot reclaim, SageMaker auto-retries on a new instance and the trainer resumes from the latest S3 checkpoint. Wall-clock cost of a reclaim averages 18 minutes (allocation + warm-up + resume); job success rate stays at 100% with at most 3 reclaims; jobs with >3 reclaims escalate to on-demand fallback.

3.3 Experiment tracking

All runs log to SageMaker Experiments with mandatory tags:

model_target (e.g., intent_classifier, embedding_adapter_v3)
code_revision (git SHA, populated by the CI system, never hand-set)
data_snapshot_id (Iceberg snapshot id of the training data)
label_version (max label_version included)
feature_schema_version
seed

A run that does not have all six tags is rejected at registration time — the model package will not be registered and the pipeline fails loud. This is the "no anonymous runs" rule and it has caught more reproducibility incidents than any other guardrail.

4. Evaluation Primitive

4.1 The five evaluation modes

Every promotion decision reads from five evaluation modes; a model that passes only one or two does not promote.

Mode	What it answers	Dataset	Failure behavior
Golden	Does the model meet the contract on a frozen, hand-curated set?	1–5K hand-curated, refreshed quarterly	Hard-fail; cannot promote
Slice	Does it work on every cohort (intent × language × device × cohort)?	Holdout, sliced post-hoc	Hard-fail on safety-critical slice; soft-flag elsewhere
Adversarial	Does it survive intentionally hard inputs (typos, JP/EN mix, Unicode tricks)?	500–2K curated adversarial set	Hard-fail above adversarial regression ceiling
Counterfactual replay	What would `(cost, quality)` have been on real production traffic if this model had served it?	5–10K replay sessions	Soft-flag; informs canary go/no-go
Offline-online correlation	Is offline metric a leading indicator of online metric?	Historical pairs of (offline_metric, online_metric) over the last 6 retrains	Used to calibrate the offline gate, not to gate directly

The five modes are not redundant. Golden catches contract violations; slice catches cohort regressions; adversarial catches robustness regressions; counterfactual catches workload-shift regressions (golden may say "fine" while real traffic shows -8% NDCG because production has shifted toward harder queries); offline-online correlation tells you whether to trust the first four.

4.2 The slice matrix

Every model on this platform reports metrics on at least these slices:

Language (EN / JP / mixed) — non-negotiable per cross-cutting concern
Intent (the 14 intent classes from US-MLE-01)
Cohort (new user / returning / power user)
Device (web / iOS / Android)
Catalog tier (long-tail / mid-tail / head; defined by 90-day query frequency)
Time-of-day bucket (peak / off-peak; matters for fairness because off-peak traffic skews JP)

A model that improves macro-F1 by +0.02 while regressing on long-tail JP power users by -0.05 is a regression, not an improvement. The slice gate catches this.

4.3 Offline-online correlation

This is the calibration loop. After every promoted model has run for 28 days, the platform computes:

The offline metric on the model's training-time eval set.
The online metric (CTR, NDCG@10-from-clicks, escalation rate, csat, etc.) measured during the 28 days.
The correlation across the last 6 retrains.

If corr(offline, online) < 0.6 for a given model, the offline gate is untrusted and the next promotion runs an extended canary (10% traffic for 14 days) regardless of offline metric. Stories that have weak offline-online correlation by nature (US-MLE-06 recommendation, US-MLE-04 forecasting) are pre-flagged as "always extended canary."

5. Promotion Primitive

5.1 The four-stage promotion gate

Every model promotion goes through four stages. A model that fails any stage rolls back to the previous registry version with no further questions.

flowchart LR
    A[Stage 1<br/>Offline Gate] --> B[Stage 2<br/>Shadow]
    B --> C[Stage 3<br/>Canary 1%/5%/25%]
    C --> D[Stage 4<br/>Full Promotion]

    A -.fail.-> R1[Rollback]
    B -.fail.-> R2[Rollback]
    C -.fail.-> R3[Rollback]
    D -.fail.-> R4[Rollback]

    style D fill:#2d8,stroke:#333
    style R1 fill:#f66,stroke:#333
    style R2 fill:#f66,stroke:#333
    style R3 fill:#f66,stroke:#333
    style R4 fill:#f66,stroke:#333

Stage 1 — Offline Gate (automated): all five evaluation modes from §4.1 pass. Decision in seconds, runs in CI.

Stage 2 — Shadow (24–72 hours, depending on traffic volume): the new model serves in parallel with the old; the old's predictions go to users, the new's predictions go to a comparison log. Metrics computed: per-request prediction agreement, per-slice agreement, latency comparison, error rate. A model whose shadow-mode latency exceeds the production endpoint's p99 by >30% fails shadow.

Stage 3 — Canary (typically 7 days at 1% → 3 days at 5% → 7 days at 25%): the new model serves a small fraction of real traffic. Online metrics (CTR, csat, escalation, etc.) compared to a hold-out group still served by the old model. A canary fails if any safety-critical metric regresses below the contracted ceiling for any contiguous 1-hour window.

Stage 4 — Full Promotion: 100% traffic. Old model retained in registry; can roll back via traffic shift in <60 seconds.

5.2 Blue/green for stateful endpoints

For US-MLE-05's embedding model (where promoting a new model requires re-indexing the entire 5M-document corpus before it can serve), the four-stage gate is wrapped in a blue/green pattern:

Blue = current production embedding model + index
Green = new candidate embedding model + freshly built index

Both indexes coexist. The router writes to both, reads from blue. After re-indexing finishes (≤ 4 hours per the US-MLE-05 SLA) and offline eval passes, the router's read flag flips from blue to green. The old blue index is retained for 14 days as a rollback target. This is the only way to safely roll out an embedding model change at this scale; in-place re-indexing has caused two SEV-2 incidents elsewhere on the platform.

5.3 Canary auto-abort

The canary gate runs an auto-abort daemon that monitors the contracted online metrics every 5 minutes and triggers an automatic rollback if any of these conditions hold:

Canary group's safety-critical metric is more than 2σ below hold-out for >1 hour
Canary group's error rate is >2× hold-out for >5 minutes
Canary group's p99 latency is >1.5× hold-out for >15 minutes
Any single user-facing exception type has a canary-group-only rate >0.1%

Auto-abort emits a SEV-3 page and a slack notification with the trigger condition. The on-call ML engineer reviews and can manually re-enable the canary if the trigger was a false alarm (validated against a 1-hour pre-canary baseline). This pattern has prevented an estimated 4 quality incidents per year that would otherwise have escaped the offline gate.

6. Drift Detection Primitive

6.1 The four drift kinds

Drift on this platform is measured in four kinds, each with a separate detection mechanism:

Kind	What it is	Detection	Threshold
Input drift	Feature distribution shifts	PSI per feature; KL divergence on top-k	PSI > 0.2 sustained 24h
Label drift	Class prior or aspect distribution shifts	χ² on class proportions; weekly	p < 0.01 sustained 7d
Prediction drift	Output distribution shifts	KS test on prediction distribution	KS > 0.15 sustained 24h
Concept drift	Input→label relationship changes	Rolling holdout F1 vs reference F1	Δ-F1 > 0.03 sustained 7d

The four kinds are detected independently because they fail differently. Input drift without prediction drift is benign (the model is robust to the shift); prediction drift without concept drift is "vibes" (the model's outputs look different but are still right); concept drift is the dangerous one — the model has become miscalibrated against reality.

6.2 The drift hub

A central drift detection service consumes prediction logs and feature snapshots from all 8 models and computes the four kinds on a fixed cadence (input/prediction every 5 min, label/concept daily). Detection results land in a CloudWatch metric namespace MangaAssist/MLDrift/<model_target>/<drift_kind>. Per-model thresholds are pinned in the model registry alongside the model artifact.

Drift hub vs per-model drift detection: centralizing means the same statistical machinery (PSI, KS, χ²) is shared across stories. A bug in PSI computation is fixed once. Per-model drift detection in the original prototype meant 8 different PSI implementations with subtle bugs; consolidation cut drift-related on-call pages by 60%.

6.3 Drift triage

A drift signal triggers a triage flow, not an immediate retrain:

Drift detector emits CloudWatch alarm.
ML engineer on-call reviews the drift dashboard for the affected model. Is the drift in a known seasonal pattern? (e.g., new manga volume releases on the 4^th Friday of every month spike intent distribution). If so, no action.
Is the drift caused by an upstream change? (e.g., embedding model US-MLE-05 just promoted v3, breaking US-MLE-02 reranker's feature distribution). If so, retrain the affected downstream model.
Is the drift sustained > 7 days with no upstream cause? Triggers retrain. Documented as a "concept drift event."
Is the drift accompanied by quality regression in slice metrics? Triggers retrain immediately, with SEV-3 page if a safety-critical slice (return_request, escalation) is affected.

The full triage runbook lives in Runbooks/ml-drift-triage.md (referenced from each story file).

7. Retraining Primitive

7.1 Scheduled vs triggered retrain

Story	Scheduled cadence	Trigger conditions
US-MLE-01 Intent	Weekly	Drift hub `concept_drift > 0.03` for 7d
US-MLE-02 Reranker	Weekly	Drift on click-through; canary CTR drop
US-MLE-03 Sentiment	Quarterly + on-demand	New aspect emergence detected (US-MLE-03 §6)
US-MLE-04 Forecasting	Daily	(no triggered retrain — daily cadence covers all signal)
US-MLE-05 Embedding Adapter	Quarterly	Catalog category expansion (e.g., manhwa); recall drop on slice
US-MLE-06 Recommendation	Daily incremental + monthly full	HR@20 drop; cold-start failure rate
US-MLE-07 Spam	Weekly + adversarial trigger	New attack pattern detected by red team or auto-detector
US-MLE-08 Cover-Art	Monthly	Style drift; AI-generated art surge

The pattern: frequently-changing signal → frequent scheduled retrain; slowly-changing signal → infrequent scheduled retrain plus drift-triggered. The cadence is set by the speed at which the signal evolves, not by the speed at which the model could be retrained.

7.2 Data freshness windows

Each retrain job specifies a freshness window — the time range of data included in training. Two patterns:

Sliding window (US-MLE-01, -02, -06, -07): always train on the last N days. Older data is discarded. Catches concept drift fast; risks losing rare-event signal.
Cumulative window (US-MLE-03, -04, -05, -08): train on all data ever seen, possibly with importance weighting toward recent. Preserves rare events; risks slow concept-drift adaptation.

The choice is per-story and is not casually changed; flipping a model from sliding to cumulative is a planning-level decision documented in the model's registry metadata.

7.3 Rollback semantics

A retrain that fails any of the four promotion stages rolls back to the previous registry version. Rollback is not a special case — it is the default behavior of "promotion gate failed." The previous version stays pinned as the production model; the failed candidate stays in registry but flagged status: failed_promotion and is excluded from future "latest model" queries.

Rollback time SLAs: - Stateless model rollback (US-MLE-01, -02, -03, -07, -08): ≤ 60 seconds (SageMaker traffic shift). - US-MLE-05 embedding rollback: ≤ 5 minutes (router flag flip from green back to blue index). - US-MLE-04, -06 (batch / Personalize): ≤ 4 hours (re-run last-known-good batch; Personalize re-import of last-known-good solution version).

The rollback time is part of every story's acceptance criteria. A story that cannot meet its rollback SLA cannot promote.

How These Primitives Compose

A worked example of all seven primitives in a single training run, US-MLE-01's weekly intent classifier retrain:

Label (§1): pull LabelRecords for model_target='intent_classifier', label_version > previous_run.max_label_version, review_status='qa_passed'. EN and JP IAA both ≥ 0.75 verified.
Feature (§2): for each LabelRecord, read features at as_of=record.captured_at - feature_store.serving_lag. Leak-detector job runs on 1% sample; passes.
Training (§3): SageMaker Pipelines DAG runs on ml.g5.xlarge. Spot, with checkpoints every 500 steps. All six experiment tags populated.
Evaluation (§4): golden set passes (≥ 0.95 macro-F1); slice gate passes on EN, JP, mixed; adversarial set passes; counterfactual replay shows -2% latency regression but +1.5% accuracy; offline-online correlation = 0.71 (trustworthy).
Promotion (§5): registry version 48 registered. Shadow runs for 24 h. Canary at 1% for 7 d. Full promotion.
Drift detection (§6): drift hub initialized for v48 with PSI/KS/χ² baselines from the training holdout.
Retraining (§7): next scheduled retrain in 7 days; trigger-based retrain armed for any drift signal sustained beyond threshold.

Every per-story file in 01-deep-dive-per-ml-story.md walks through these seven steps with story-specific code, datasets, and thresholds. The platform pieces (label store, feature store, registry, drift hub) are documented in 02-cross-story-platform-deep-dive.md.