02. Cross-Story ML Platform Deep Dive
This document describes the shared ML platform components that all 8 user stories in this folder depend on. Where 00-foundations-and-primitives-for-ml-engineering.md defines the conceptual primitives and 01-deep-dive-per-ml-story.md walks each story through them, this file is about the services — the actual systems that implement label storage, feature serving, model registration, drift detection, and promotion. These are owned by the ML Platform Lead (cross-story coordinator) rather than by any individual story owner.
The five platform components:
- Label Platform — Iceberg-on-S3, IAA tracking, cross-vendor reconciliation
- Feature Store — SageMaker Feature Store (online) + Iceberg (offline) with point-in-time correctness
- Model Registry — SageMaker Model Registry with story-specific contracts and promotion gates
- Drift Hub — central PSI/KS/χ² compute service consuming all 8 models' prediction logs
- Promotion Gate Service — shadow + canary + auto-abort daemon shared across stories
A sixth concern, the experiment-tracking + lineage system, is implicit in SageMaker Experiments and is referenced from all five components above.
1. Label Platform
1.1 Architecture
flowchart TB
subgraph SOURCES[Label Sources]
V1[Vendor: Appen<br/>EN+JP teams]
V2[Vendor: Sama<br/>JP-only fallback]
IMP[Implicit Feedback<br/>Kinesis stream]
PROG[Programmatic<br/>Lambda generators]
LLM[LLM-Distilled<br/>Bedrock batch]
end
subgraph INGEST[Ingestion Layer]
I1[Vendor API Gateway<br/>callback webhooks]
I2[Kinesis Firehose<br/>implicit feedback]
I3[Step Functions<br/>programmatic + LLM batch]
QA1[Label QA Service<br/>schema validation + IAA]
end
subgraph STORE[Label Store - Iceberg on S3]
ICEBERG[label_records table<br/>Iceberg format<br/>partitioned by model_target, label_source, captured_date]
SNAP[Snapshots<br/>retained 18mo hot + Glacier]
IAA_TBL[iaa_metrics table<br/>per-batch kappa, per-annotator running kappa]
end
subgraph CONSUME[Consumers]
TRAIN[Training Pipelines<br/>US-MLE-01..08]
AUDIT[Audit Queries<br/>regulator + post-mortem]
DASH[Vendor Dashboards<br/>per-annotator kappa]
end
V1 --> I1
V2 --> I1
IMP --> I2
PROG --> I3
LLM --> I3
I1 --> QA1
I2 --> QA1
I3 --> QA1
QA1 --> ICEBERG
QA1 --> IAA_TBL
ICEBERG --> SNAP
ICEBERG --> TRAIN
ICEBERG --> AUDIT
IAA_TBL --> DASH
style ICEBERG fill:#9cf,stroke:#333
style QA1 fill:#fd2,stroke:#333
1.2 The label_records Iceberg schema
CREATE TABLE manga_ml.label_records (
example_id STRING NOT NULL, -- sha256(canonical_input + model_target)
model_target STRING NOT NULL, -- e.g., 'intent_classifier', 'absa_aspect'
label_value STRUCT<
class: STRING,
confidence: DOUBLE,
metadata: MAP<STRING, STRING>
> NOT NULL,
label_source STRING NOT NULL, -- 'vendor_appen', 'vendor_sama', 'implicit', 'programmatic', 'llm_distill'
annotator_id STRING, -- vendor annotator hash or LLM model_id
label_version BIGINT NOT NULL, -- monotonic per (model_target, source)
captured_at TIMESTAMP NOT NULL, -- when raw signal was captured
labeled_at TIMESTAMP NOT NULL, -- when label was finalized
review_status STRING NOT NULL, -- 'draft', 'qa_passed', 'qa_failed', 'withdrawn'
iaa_lot_id STRING, -- for vendor labels in IAA pools
language STRING NOT NULL, -- 'en', 'ja', 'mixed' (project-wide bilingual concern)
region STRING NOT NULL, -- 'ap-northeast-1', 'us-east-1', 'eu-west-1' (data residency)
pii_redacted BOOLEAN NOT NULL -- enforced before persistence
)
USING iceberg
PARTITIONED BY (model_target, label_source, days(captured_at))
TBLPROPERTIES (
'write.metadata.delete-after-commit.enabled' = 'true',
'write.metadata.previous-versions-max' = '20',
'history.expire.max-snapshot-age-ms' = '47304000000' -- 18 months
);
The partition strategy (model_target, label_source, days(captured_at)) is non-trivial:
- model_target as the leading partition lets a training job for intent_classifier skip-scan past 50M ABSA labels without I/O cost.
- label_source as the second partition allows per-source IAA recomputation without scanning unrelated sources.
- days(captured_at) allows time-bounded reads for both training (last-N-days windows) and post-incident replay (specific historical day).
The 18-month snapshot expiration matches the longest training window (US-MLE-04 demand forecasting uses 3 years, but reads from a separate cumulative-history table); shorter windows would force older data to Glacier where restore latency breaks debuggability.
1.3 The Label QA service
A Lambda-triggered service that runs on every label batch arrival. The validation passes:
# label_qa_service.py
from dataclasses import dataclass
from typing import Optional
import json
@dataclass
class QAResult:
accepted: bool
rejected_count: int
flagged_count: int
reason: Optional[str]
iaa_kappa: Optional[float]
iaa_lang_kappa: dict[str, float]
class LabelQAService:
"""Runs on every label batch arrival. Computes IAA, validates schema,
enforces PII redaction, and writes accepted labels to the Iceberg table."""
def validate_batch(self, batch_id: str, model_target: str) -> QAResult:
records = self.load_batch(batch_id)
# Step 1: schema validation
schema_errors = self.validate_schema(records, model_target)
if schema_errors:
return QAResult(
accepted=False, rejected_count=len(records), flagged_count=0,
reason=f"schema errors: {schema_errors[:5]}",
iaa_kappa=None, iaa_lang_kappa={},
)
# Step 2: PII redaction enforcement
pii_violations = [r for r in records if not r.pii_redacted]
if pii_violations:
self.alert_security(batch_id, len(pii_violations))
return QAResult(
accepted=False, rejected_count=len(records), flagged_count=0,
reason=f"{len(pii_violations)} records lack pii_redacted=True",
iaa_kappa=None, iaa_lang_kappa={},
)
# Step 3: IAA per language (project-wide bilingual concern)
# Vendor batches contain triple-annotated examples for IAA; programmatic
# and implicit batches skip this step.
iaa_lang_kappa = {}
if records[0].label_source.startswith("vendor_"):
for lang in ["en", "ja"]:
lang_records = [r for r in records if r.language == lang]
if len(lang_records) >= 50: # need power for kappa
iaa_lang_kappa[lang] = self.compute_iaa_kappa(lang_records)
# Per-language gate
for lang, kappa in iaa_lang_kappa.items():
if kappa < 0.65:
self.reject_batch_to_vendor(batch_id, reason=f"{lang} kappa={kappa:.3f}<0.65")
return QAResult(
accepted=False, rejected_count=len(records), flagged_count=0,
reason=f"{lang} IAA below floor",
iaa_kappa=min(iaa_lang_kappa.values()),
iaa_lang_kappa=iaa_lang_kappa,
)
# Step 4: per-annotator drift check
flagged = self.check_per_annotator_drift(records)
# Step 5: persist
self.write_to_iceberg(records, model_target)
self.write_iaa_metrics(batch_id, iaa_lang_kappa, flagged)
return QAResult(
accepted=True,
rejected_count=0,
flagged_count=len(flagged),
reason=None,
iaa_kappa=min(iaa_lang_kappa.values()) if iaa_lang_kappa else None,
iaa_lang_kappa=iaa_lang_kappa,
)
def compute_iaa_kappa(self, records: list) -> float:
"""Cohen's kappa on triple-annotated examples within this language slice."""
triples = self.group_by_example(records, min_annotators=3)
if not triples:
return 0.0
# Pairwise kappa across all annotator pairs in each triple, then mean
kappas = []
for example_id, annotations in triples.items():
for i in range(len(annotations)):
for j in range(i + 1, len(annotations)):
kappa = cohen_kappa_score(
[annotations[i].label_value["class"]],
[annotations[j].label_value["class"]],
)
kappas.append(kappa)
return sum(kappas) / len(kappas) if kappas else 0.0
def check_per_annotator_drift(self, records: list) -> list:
"""Flag annotators whose 30-day running kappa has dropped >2sigma below their baseline."""
flagged = []
per_annotator = self.group_by(records, key="annotator_id")
for annotator_id, annotator_records in per_annotator.items():
history = self.load_annotator_history(annotator_id, days=30)
if not history:
continue
baseline_kappa = history.mean_kappa
baseline_std = history.std_kappa
current_kappa = self.compute_iaa_kappa(annotator_records)
if current_kappa < baseline_kappa - 2 * baseline_std:
flagged.append({
"annotator_id": annotator_id,
"current_kappa": current_kappa,
"baseline_kappa": baseline_kappa,
"z_score": (current_kappa - baseline_kappa) / baseline_std,
})
self.notify_vendor_pm(annotator_id, current_kappa, baseline_kappa)
return flagged
1.4 Cross-story sharing rules
The label platform is shared, which creates a coordination obligation:
- A model_target's namespace is owned by exactly one story.
intent_classifieris owned by US-MLE-01; no other story may write to it. Story owners' AWS IAM roles are scoped to their own model_target's partition. - A story may read any model_target's labels for cross-story features (e.g., US-MLE-06 recommendation reads
absa_aspectlabels as a feature for items). Read access is reviewed quarterly. - The
label_versionschema is monotonically increasing per (model_target, source). A retraining run pins tomax(label_version) at run start time, so concurrent label writes do not retroactively change a training run's data. - Snapshot tags are written at every promoted-model registration:
label_v=2840, model_v=47, registered_at=2026-04-29T03:00Z. This makes a model fully reproducible from its registry entry.
2. Feature Store
2.1 Architecture
flowchart LR
subgraph SOURCES[Feature Sources]
ONLINE[Online Computation<br/>Lambda + Redis]
BATCH[Batch Computation<br/>Glue + EMR]
EXT[External APIs<br/>Personalize, OpenSearch]
end
subgraph FS[Feature Store]
FS_ONLINE[SageMaker Feature Store<br/>online: DynamoDB-backed]
FS_OFFLINE[Iceberg Mirror<br/>S3, partitioned by feature_group]
CATALOG[Feature Catalog<br/>schema_v3.4]
end
subgraph CONSUMERS[Consumers]
ONLINE_INF[Online Inference<br/>SageMaker endpoints]
TRAIN[Training Pipelines<br/>PIT-correct reads]
DRIFT[Drift Hub<br/>distribution snapshots]
end
ONLINE --> FS_ONLINE
BATCH --> FS_OFFLINE
EXT --> FS_ONLINE
FS_ONLINE -.replicate.-> FS_OFFLINE
FS_ONLINE --> ONLINE_INF
FS_OFFLINE --> TRAIN
FS_OFFLINE --> DRIFT
CATALOG -.pinned by.-> TRAIN
CATALOG -.pinned by.-> ONLINE_INF
style FS_ONLINE fill:#9cf,stroke:#333
style FS_OFFLINE fill:#9cf,stroke:#333
style CATALOG fill:#fd2,stroke:#333
2.2 Feature catalog and serving lag
The feature catalog (a versioned YAML in s3://manga-ml-config-apne1/feature-catalog/) is the single source of truth for:
- Each feature group's schema (column types, defaults, valid ranges)
- Each feature group's
serving_lag— the time delay between when raw signal is captured and when the feature value becomes visible to online inference - Each feature group's
freshness_target— the refresh cadence of the underlying batch job - Each feature group's owner story (US-MLE-XX)
Excerpt:
# feature-catalog/schema_v3.4.yaml
version: schema_v3.4
deployed_at: 2026-04-15T00:00:00Z
feature_groups:
user_taste_embedding:
owner: US-MLE-06
columns:
- name: user_id
type: STRING
nullable: false
- name: taste_embedding
type: ARRAY<FLOAT>
size: 256
default: zero_vector
- name: embedding_model_version
type: STRING
nullable: false
- name: last_updated
type: TIMESTAMP
nullable: false
serving_lag: PT1H # 1 hour batch lag
freshness_target: PT1H
region: ap-northeast-1
pii_redacted: true
session_intent_history:
owner: US-MLE-01
columns:
- name: session_id
type: STRING
- name: last_3_intents
type: ARRAY<STRING>
default: []
- name: turn_count
type: INT
default: 0
serving_lag: PT0S # online, 0 lag
freshness_target: PT0S
region: ap-northeast-1
pii_redacted: true
catalog_demand_p50:
owner: US-MLE-04
columns:
- name: sku_id
type: STRING
- name: forecast_p50_next_7d
type: ARRAY<FLOAT>
size: 7
- name: forecast_p10_next_7d
type: ARRAY<FLOAT>
size: 7
- name: forecast_p90_next_7d
type: ARRAY<FLOAT>
size: 7
- name: forecast_run_id
type: STRING
serving_lag: PT24H # daily batch
freshness_target: PT24H
region: ap-northeast-1
pii_redacted: false # SKU-level, no PII
2.3 PIT correctness at the platform level
The platform enforces PIT correctness with a deploy-time lint and a runtime assertion:
- Deploy-time lint: any training pipeline calling
feature_store.get_record(...)without anas_of_timestampparameter fails CI. The lint is an AST-walker over the SageMaker Pipelines DAG source code. - Runtime assertion: every read from the offline mirror logs the requested
as_ofand the feature group'sserving_lag. Reads whereas_of > now - serving_lagare flagged in CloudWatch as potential leaks (since they correspond to feature values that wouldn't have been visible to serving at the implied time). Per-feature-group leak-check dashboards.
The leak detector described in primitive §2.1 of the foundations doc runs on top of this PIT-correct read API; it catches leaks the API alone cannot (e.g., a feature whose pipeline updates were retroactive — backfilling Jan with Feb data).
2.4 Schema upgrade procedure
A feature schema upgrade (e.g., schema_v3.4 → schema_v3.5) is a coordinated change across all consuming stories:
- Propose: ML Platform Lead opens a CCR (Coordinated Change Request) with affected stories tagged.
- Review: each affected story's owner audits training and serving code for the column changes.
- Dual-publish: schema_v3.5 is published alongside v3.4. Both serve in parallel.
- Migrate consumers: each story migrates its training and serving code to v3.5 on its own cadence; CCR tracks completion.
- Sunset: 30 days after the last consumer migrates, v3.4 is sunset.
Schema version is pinned in every model registry entry, so a model trained on v3.4 continues to read v3.4 features even after v3.5 is in production. This prevents a serving model from silently consuming a different schema than it trained on.
3. Model Registry
3.1 Architecture
The platform uses SageMaker Model Registry with a wrapper service (registry-svc) that enforces story-specific contracts.
flowchart LR
subgraph TRAIN[Training Pipelines]
T1[US-MLE-01<br/>Training]
T2[US-MLE-02..08<br/>Training]
end
subgraph REG[Registry Service]
VAL[Validation Layer<br/>required tags + lineage]
SMR[SageMaker Model<br/>Registry]
META[Registry Metadata<br/>DynamoDB]
end
subgraph CONS[Consumers]
ENDP[Endpoint Deploy<br/>SageMaker]
SHADOW[Shadow Deploy]
AUDIT[Audit Queries]
ROLLBK[Rollback Service]
end
T1 --> VAL
T2 --> VAL
VAL --> SMR
VAL --> META
SMR --> ENDP
SMR --> SHADOW
META --> AUDIT
META --> ROLLBK
style VAL fill:#fd2,stroke:#333
style SMR fill:#9cf,stroke:#333
3.2 Required tags on every registered model
A model package will not register without all six tags:
| Tag | Meaning | Source |
|---|---|---|
model_target |
e.g., intent_classifier, embedding_adapter_v3 |
Pipeline parameter |
code_revision |
git SHA, populated by CI | CI environment |
data_snapshot_id |
Iceberg snapshot id of training data at run start | Pipeline runtime |
label_version |
max(label_version) included | Pipeline runtime |
feature_schema_version |
e.g., schema_v3.4 |
Pipeline parameter |
seed |
training random seed | Pipeline parameter |
A run missing any tag is rejected at registration time, the pipeline fails, and a SEV-4 page goes to the story owner. This is the "no anonymous runs" rule from primitive §3.3.
3.3 Per-story registry contracts
Each story has a contract registered with the registry service that encodes its specific gates:
# registry_contracts.py
INTENT_CLASSIFIER_CONTRACT = ModelContract(
model_target="intent_classifier",
owner_story="US-MLE-01",
required_tags=REQUIRED_TAGS,
minimum_metrics={
"macro_f1": 0.90,
"safety_critical_macro_f1": 0.92,
"min_class_f1": 0.83,
},
maximum_metrics={
"p95_inference_latency_ms": 15.0,
"adversarial_regression": 0.01,
},
required_slice_metrics=[
("language=en", "macro_f1", ">=", 0.85),
("language=ja", "macro_f1", ">=", 0.85),
("language=mixed", "macro_f1", ">=", 0.85),
],
rollback_sla_seconds=60,
promotion_stages=[OFFLINE_GATE, SHADOW_24H, CANARY_1_5_25, FULL_PROMOTE],
incompatibility_set=["embedding_adapter"], # if embedding_adapter is mid-promotion, hold this story
)
EMBEDDING_ADAPTER_CONTRACT = ModelContract(
model_target="embedding_adapter_v3",
owner_story="US-MLE-05",
required_tags=REQUIRED_TAGS,
minimum_metrics={
"recall_at_10": 0.92,
"cross_lingual_recall_at_10": 0.85,
},
maximum_metrics={
"p95_query_encode_latency_ms": 12.0,
"blue_green_reindex_wall_clock_h": 4.0,
},
required_slice_metrics=[
("language=en", "recall_at_10", ">=", 0.90),
("language=ja", "recall_at_10", ">=", 0.90),
("genre=manhwa", "recall_at_10", ">=", 0.85),
],
rollback_sla_seconds=300, # blue/green flip, not endpoint shift
promotion_stages=[OFFLINE_GATE, BLUE_GREEN_REINDEX, SHADOW_24H, CANARY_1_5_25, FULL_PROMOTE],
coordination_required_with=["cross_encoder_reranker", "recommendation_two_tower"],
)
A registration that doesn't satisfy its contract sets the model package status to failed_promotion and emits a notification.
3.4 Cross-region replication
Per the project's data-residency rule, registry entries in ap-northeast-1 are the canonical copies for JP-region serving; us-east-1 and eu-west-1 are read-only replicas of the artifact metadata only (the binary artifact itself is region-pinned to where the training data lives).
Consequence: a model trained on JP customer data cannot serve traffic outside ap-northeast-1, even if the artifact is technically deployable. The registry contract enforces this via the region tag: an artifact with region=ap-northeast-1 cannot be deployed to a us-east-1 endpoint.
4. Drift Hub
4.1 Architecture
flowchart TB
subgraph LOGS[Prediction Logs]
L1[Endpoint Data Capture<br/>10pct sample]
L2[Iceberg prediction_log<br/>per model_target]
end
subgraph HUB[Drift Hub]
H1[Streaming Compute<br/>Kinesis Data Analytics<br/>5-min windows]
H2[Batch Compute<br/>EMR Serverless<br/>daily windows]
H3[Drift Metrics Store<br/>CloudWatch + Iceberg]
end
subgraph ALARMS[Alarms]
A1[CloudWatch Alarms<br/>per model_target x drift_kind]
A2[Triage Runbook<br/>Lambda]
A3[SNS to oncall slack]
end
L1 --> L2
L2 --> H1
L2 --> H2
H1 --> H3
H2 --> H3
H3 --> A1
A1 --> A2
A2 --> A3
style H1 fill:#9cf,stroke:#333
style H2 fill:#9cf,stroke:#333
style A1 fill:#fd2,stroke:#333
4.2 The four drift kinds, centralized
The drift hub computes the four drift kinds from primitive §6.1 for every model_target on a uniform schedule:
| Kind | Cadence | Compute | Output metric |
|---|---|---|---|
| Input drift | 5 min | Streaming, Kinesis Data Analytics | MangaAssist/MLDrift/<model_target>/InputPSI/<feature_name> |
| Prediction drift | 5 min | Streaming, Kinesis Data Analytics | MangaAssist/MLDrift/<model_target>/PredictionKS |
| Label drift | Daily | Batch, EMR Serverless | MangaAssist/MLDrift/<model_target>/LabelChiSquare |
| Concept drift | Daily | Batch, EMR Serverless | MangaAssist/MLDrift/<model_target>/ConceptDeltaF1 |
Per-model-target reference distributions are stored in s3://manga-ml-drift-refs-apne1/<model_target>/<registry_version>/ and are pinned at the time the model promotes. So when v48 promotes, its training-holdout becomes the new reference; v47's reference is retained for 30 days for cross-version comparison.
4.3 Centralization rationale
The original platform had per-story drift detection. Three problems emerged:
- Implementation drift: 8 different PSI implementations had subtle bugs (off-by-one binning, mishandling of empty bins, different epsilon conventions). Centralizing to a single library cut drift-related on-call pages by ~60%.
- Reference-distribution sprawl: each story stored references in its own format; cross-story comparisons (e.g., "did the input distribution shift on US-MLE-02 the same week as US-MLE-06?") were impossible.
- Threshold drift: each story's owner tuned drift thresholds independently, often lowering them after a false-positive alarm. Centralization moved threshold management into the registry contract, where threshold changes go through code review.
4.4 Triage Lambda
When a CloudWatch alarm fires, the triage Lambda runs the runbook from primitive §6.3:
# drift_triage.py
def lambda_handler(event, context):
alarm = parse_alarm(event)
model_target = alarm["model_target"]
drift_kind = alarm["drift_kind"]
# Step 1: known seasonal pattern?
if is_known_seasonal(model_target, drift_kind, alarm["timestamp"]):
return acknowledge_and_close(alarm, reason="known seasonal pattern")
# Step 2: upstream cause?
upstream = check_upstream_recent_promotion(model_target)
if upstream:
return notify_oncall(
alarm,
severity="P3",
reason=f"likely caused by {upstream.model_target} v{upstream.version} promotion",
recommended_action="coordinated retrain",
)
# Step 3: sustained > 7 days?
history = get_drift_history(model_target, drift_kind, days=8)
if not history.sustained_above_threshold(days=7):
return acknowledge_and_close(alarm, reason="not sustained")
# Step 4: safety-critical slice affected?
if is_safety_critical_slice_affected(model_target, alarm):
return notify_oncall(
alarm,
severity="P2",
reason="safety-critical slice quality regression",
recommended_action="immediate retrain + canary review",
)
# Step 5: default — sustained drift, no upstream, no safety-critical
return notify_oncall(
alarm,
severity="P3",
reason="sustained concept drift > 7d, no upstream cause",
recommended_action="scheduled retrain",
)
5. Promotion Gate Service
5.1 Architecture
sequenceDiagram
participant T as Training Pipeline
participant R as Model Registry
participant P as Promotion Gate Service
participant E as SageMaker Endpoint
participant D as Drift Hub
participant A as Auto-Abort Daemon
T->>R: Register v48 with required tags
R->>R: Validate against story contract
R-->>T: package_arn
T->>P: Request promotion of v48
P->>P: Stage 1 - Offline Gate (read eval reports)
P-->>T: Pass / Fail
P->>E: Deploy v48 to shadow endpoint
P->>D: Initialize drift refs for v48
Note over E: Shadow 24h
P->>P: Compare shadow vs prod logs
P->>E: Promote to canary 1pct
Note over E,A: Canary 7d
A->>E: Monitor canary metrics
A-->>P: Auto-abort if any condition trips
P->>E: Canary 5pct then 25pct then 100pct
P->>R: Mark v47 as previous, v48 as production
5.2 The auto-abort daemon
The auto-abort daemon described in primitive §5.3 is a shared service across all 8 stories. It reads each story's contract from the registry, polls online metrics every 5 minutes, and triggers rollback when any contract condition is violated for a sustained window.
# auto_abort_daemon.py
class AutoAbortDaemon:
def __init__(self, model_target: str, contract: ModelContract):
self.model_target = model_target
self.contract = contract
self.poll_interval_s = 300
def run(self):
while self.is_canary_active():
metrics = self.poll_canary_metrics()
holdout = self.poll_holdout_metrics()
for condition in self.contract.canary_abort_conditions:
if condition.triggered(metrics, holdout):
self.rollback(reason=condition.description)
return
time.sleep(self.poll_interval_s)
def rollback(self, reason: str):
self.log_p2_page(
model_target=self.model_target,
reason=reason,
recent_metrics=self.recent_metrics_window(hours=2),
)
self.shift_traffic_to_previous(self.model_target)
self.mark_failed_promotion(self.model_target)
The daemon's per-story contract makes this generic: each story passes its own thresholds. The four universal abort conditions from §5.3 of the foundations doc are encoded as defaults; story-specific conditions (e.g., US-MLE-02's CTR drop, US-MLE-05's recall regression) are appended.
5.3 Coordination across in-flight promotions
Two stories cannot promote concurrently if the coordination_required_with field of either contract names the other. The promotion gate service serializes them: the second story's promotion enters a pending queue and starts only after the first completes (success or rollback).
Specific coordinations: - US-MLE-05 → US-MLE-02 + US-MLE-06: an embedding adapter promotion blocks reranker and recommendation promotions until the new embedding has been canary-stable for 7 days. - US-MLE-07 → US-MLE-03: a spam classifier promotion must complete before an ABSA retrain begins (so ABSA doesn't train on a labeled corpus that has been re-cleaned mid-stream). - US-MLE-01 ↔ US-MLE-02 ↔ US-MLE-06: the three intent-consuming stories form a coordination cluster; a promotion in any of them pauses the others' promotions for a 24-hour stability window.
The serialization is enforced via a DynamoDB conditional write; concurrent promotions race for a "promotion lock" record per coordination cluster. The promotion gate service acquires the lock before starting Stage 1; releases it after Stage 4 completes or rollback.
6. Cross-Cutting Operational Concerns
6.1 The 8-story on-call rotation
A single on-call rotation covers all 8 stories, with subject-matter-expert escalation paths per story. Pages route to the rotation; if the on-call cannot diagnose within 30 minutes, they escalate to the story owner. This pattern is borrowed from the Cost-Optimization team's FinOps lead model.
6.2 Cost reconciliation against Cost-Optimization stories
Several ML-Engineer stories interact with cost-optimization stories' KPIs:
| ML Story | Cost Story Interaction |
|---|---|
| US-MLE-01 | Cost-Opt US-02 (Intent Classifier Cost) — autoscaling config tuned by US-02; model quality owned here |
| US-MLE-02, -05, -06 | Cost-Opt US-06 (RAG Pipeline Cost) — embedding/reranking inference cost capped here |
| US-MLE-04, -06 | Cost-Opt US-04 (Compute Cost) — batch transform spot pricing |
| US-MLE-03, -07 | Cost-Opt US-07 (Analytics Pipeline Cost) — review-corpus processing cost |
Quarterly cost reviews involve both this folder's ML Platform Lead and the Cost-Optimization FinOps Lead. Conflicts (e.g., quality vs cost trade-offs) escalate to the joint review board.
6.3 Multi-region readiness
Currently all 8 stories serve only ap-northeast-1. A multi-region extension (US/EU) requires:
- Label platform replication (regional Iceberg copies, with data-residency enforcement)
- Feature store replication (region-pinned online stores, separate offline mirrors)
- Model registry replication (artifact metadata replicates; binary artifacts region-pinned to data origin)
- Drift hub per-region (cannot share statistics across regions because reference distributions are different)
- Promotion gate per-region
Tracked as a roadmap item; out of scope for the initial 8 stories. The platform's design choices today (region tags everywhere, region-aware artifact deployment) make this expansion mechanical rather than a re-architecture.
7. Reading Order for Onboarding a New ML Engineer
A new ML Engineer joining the team should read this folder in this order:
../README.md— the index and dependency graph./00-foundations-and-primitives-for-ml-engineering.md— primitives, vocabulary- This file (
02-cross-story-platform-deep-dive.md) — what services exist - The user-story file for whichever model they will own first (e.g.,
../US-MLE-01-intent-classifier-retraining-pipeline.md) ./01-deep-dive-per-ml-story.md§<their story>— concrete walkthrough../grill-chains/ml-engineer-grill-chains.md§<their story>— interview-style stress-test of their understanding
Total reading time: ~3-4 hours. Sufficient for a senior ML Engineer to start contributing within the first week.