LOCAL PREVIEW View on GitHub

02. Cross-Story ML Platform Deep Dive

This document describes the shared ML platform components that all 8 user stories in this folder depend on. Where 00-foundations-and-primitives-for-ml-engineering.md defines the conceptual primitives and 01-deep-dive-per-ml-story.md walks each story through them, this file is about the services — the actual systems that implement label storage, feature serving, model registration, drift detection, and promotion. These are owned by the ML Platform Lead (cross-story coordinator) rather than by any individual story owner.

The five platform components:

  1. Label Platform — Iceberg-on-S3, IAA tracking, cross-vendor reconciliation
  2. Feature Store — SageMaker Feature Store (online) + Iceberg (offline) with point-in-time correctness
  3. Model Registry — SageMaker Model Registry with story-specific contracts and promotion gates
  4. Drift Hub — central PSI/KS/χ² compute service consuming all 8 models' prediction logs
  5. Promotion Gate Service — shadow + canary + auto-abort daemon shared across stories

A sixth concern, the experiment-tracking + lineage system, is implicit in SageMaker Experiments and is referenced from all five components above.


1. Label Platform

1.1 Architecture

flowchart TB
    subgraph SOURCES[Label Sources]
        V1[Vendor: Appen<br/>EN+JP teams]
        V2[Vendor: Sama<br/>JP-only fallback]
        IMP[Implicit Feedback<br/>Kinesis stream]
        PROG[Programmatic<br/>Lambda generators]
        LLM[LLM-Distilled<br/>Bedrock batch]
    end

    subgraph INGEST[Ingestion Layer]
        I1[Vendor API Gateway<br/>callback webhooks]
        I2[Kinesis Firehose<br/>implicit feedback]
        I3[Step Functions<br/>programmatic + LLM batch]
        QA1[Label QA Service<br/>schema validation + IAA]
    end

    subgraph STORE[Label Store - Iceberg on S3]
        ICEBERG[label_records table<br/>Iceberg format<br/>partitioned by model_target, label_source, captured_date]
        SNAP[Snapshots<br/>retained 18mo hot + Glacier]
        IAA_TBL[iaa_metrics table<br/>per-batch kappa, per-annotator running kappa]
    end

    subgraph CONSUME[Consumers]
        TRAIN[Training Pipelines<br/>US-MLE-01..08]
        AUDIT[Audit Queries<br/>regulator + post-mortem]
        DASH[Vendor Dashboards<br/>per-annotator kappa]
    end

    V1 --> I1
    V2 --> I1
    IMP --> I2
    PROG --> I3
    LLM --> I3
    I1 --> QA1
    I2 --> QA1
    I3 --> QA1
    QA1 --> ICEBERG
    QA1 --> IAA_TBL
    ICEBERG --> SNAP
    ICEBERG --> TRAIN
    ICEBERG --> AUDIT
    IAA_TBL --> DASH

    style ICEBERG fill:#9cf,stroke:#333
    style QA1 fill:#fd2,stroke:#333

1.2 The label_records Iceberg schema

CREATE TABLE manga_ml.label_records (
    example_id        STRING NOT NULL,        -- sha256(canonical_input + model_target)
    model_target      STRING NOT NULL,        -- e.g., 'intent_classifier', 'absa_aspect'
    label_value       STRUCT<
        class: STRING,
        confidence: DOUBLE,
        metadata: MAP<STRING, STRING>
    > NOT NULL,
    label_source      STRING NOT NULL,        -- 'vendor_appen', 'vendor_sama', 'implicit', 'programmatic', 'llm_distill'
    annotator_id      STRING,                 -- vendor annotator hash or LLM model_id
    label_version     BIGINT NOT NULL,        -- monotonic per (model_target, source)
    captured_at       TIMESTAMP NOT NULL,     -- when raw signal was captured
    labeled_at        TIMESTAMP NOT NULL,     -- when label was finalized
    review_status     STRING NOT NULL,        -- 'draft', 'qa_passed', 'qa_failed', 'withdrawn'
    iaa_lot_id        STRING,                 -- for vendor labels in IAA pools
    language          STRING NOT NULL,        -- 'en', 'ja', 'mixed' (project-wide bilingual concern)
    region            STRING NOT NULL,        -- 'ap-northeast-1', 'us-east-1', 'eu-west-1' (data residency)
    pii_redacted      BOOLEAN NOT NULL        -- enforced before persistence
)
USING iceberg
PARTITIONED BY (model_target, label_source, days(captured_at))
TBLPROPERTIES (
    'write.metadata.delete-after-commit.enabled' = 'true',
    'write.metadata.previous-versions-max' = '20',
    'history.expire.max-snapshot-age-ms' = '47304000000'  -- 18 months
);

The partition strategy (model_target, label_source, days(captured_at)) is non-trivial: - model_target as the leading partition lets a training job for intent_classifier skip-scan past 50M ABSA labels without I/O cost. - label_source as the second partition allows per-source IAA recomputation without scanning unrelated sources. - days(captured_at) allows time-bounded reads for both training (last-N-days windows) and post-incident replay (specific historical day).

The 18-month snapshot expiration matches the longest training window (US-MLE-04 demand forecasting uses 3 years, but reads from a separate cumulative-history table); shorter windows would force older data to Glacier where restore latency breaks debuggability.

1.3 The Label QA service

A Lambda-triggered service that runs on every label batch arrival. The validation passes:

# label_qa_service.py
from dataclasses import dataclass
from typing import Optional
import json

@dataclass
class QAResult:
    accepted: bool
    rejected_count: int
    flagged_count: int
    reason: Optional[str]
    iaa_kappa: Optional[float]
    iaa_lang_kappa: dict[str, float]


class LabelQAService:
    """Runs on every label batch arrival. Computes IAA, validates schema,
    enforces PII redaction, and writes accepted labels to the Iceberg table."""

    def validate_batch(self, batch_id: str, model_target: str) -> QAResult:
        records = self.load_batch(batch_id)

        # Step 1: schema validation
        schema_errors = self.validate_schema(records, model_target)
        if schema_errors:
            return QAResult(
                accepted=False, rejected_count=len(records), flagged_count=0,
                reason=f"schema errors: {schema_errors[:5]}",
                iaa_kappa=None, iaa_lang_kappa={},
            )

        # Step 2: PII redaction enforcement
        pii_violations = [r for r in records if not r.pii_redacted]
        if pii_violations:
            self.alert_security(batch_id, len(pii_violations))
            return QAResult(
                accepted=False, rejected_count=len(records), flagged_count=0,
                reason=f"{len(pii_violations)} records lack pii_redacted=True",
                iaa_kappa=None, iaa_lang_kappa={},
            )

        # Step 3: IAA per language (project-wide bilingual concern)
        # Vendor batches contain triple-annotated examples for IAA; programmatic
        # and implicit batches skip this step.
        iaa_lang_kappa = {}
        if records[0].label_source.startswith("vendor_"):
            for lang in ["en", "ja"]:
                lang_records = [r for r in records if r.language == lang]
                if len(lang_records) >= 50:  # need power for kappa
                    iaa_lang_kappa[lang] = self.compute_iaa_kappa(lang_records)

            # Per-language gate
            for lang, kappa in iaa_lang_kappa.items():
                if kappa < 0.65:
                    self.reject_batch_to_vendor(batch_id, reason=f"{lang} kappa={kappa:.3f}<0.65")
                    return QAResult(
                        accepted=False, rejected_count=len(records), flagged_count=0,
                        reason=f"{lang} IAA below floor",
                        iaa_kappa=min(iaa_lang_kappa.values()),
                        iaa_lang_kappa=iaa_lang_kappa,
                    )

        # Step 4: per-annotator drift check
        flagged = self.check_per_annotator_drift(records)

        # Step 5: persist
        self.write_to_iceberg(records, model_target)
        self.write_iaa_metrics(batch_id, iaa_lang_kappa, flagged)

        return QAResult(
            accepted=True,
            rejected_count=0,
            flagged_count=len(flagged),
            reason=None,
            iaa_kappa=min(iaa_lang_kappa.values()) if iaa_lang_kappa else None,
            iaa_lang_kappa=iaa_lang_kappa,
        )

    def compute_iaa_kappa(self, records: list) -> float:
        """Cohen's kappa on triple-annotated examples within this language slice."""
        triples = self.group_by_example(records, min_annotators=3)
        if not triples:
            return 0.0

        # Pairwise kappa across all annotator pairs in each triple, then mean
        kappas = []
        for example_id, annotations in triples.items():
            for i in range(len(annotations)):
                for j in range(i + 1, len(annotations)):
                    kappa = cohen_kappa_score(
                        [annotations[i].label_value["class"]],
                        [annotations[j].label_value["class"]],
                    )
                    kappas.append(kappa)
        return sum(kappas) / len(kappas) if kappas else 0.0

    def check_per_annotator_drift(self, records: list) -> list:
        """Flag annotators whose 30-day running kappa has dropped >2sigma below their baseline."""
        flagged = []
        per_annotator = self.group_by(records, key="annotator_id")
        for annotator_id, annotator_records in per_annotator.items():
            history = self.load_annotator_history(annotator_id, days=30)
            if not history:
                continue
            baseline_kappa = history.mean_kappa
            baseline_std = history.std_kappa
            current_kappa = self.compute_iaa_kappa(annotator_records)
            if current_kappa < baseline_kappa - 2 * baseline_std:
                flagged.append({
                    "annotator_id": annotator_id,
                    "current_kappa": current_kappa,
                    "baseline_kappa": baseline_kappa,
                    "z_score": (current_kappa - baseline_kappa) / baseline_std,
                })
                self.notify_vendor_pm(annotator_id, current_kappa, baseline_kappa)
        return flagged

1.4 Cross-story sharing rules

The label platform is shared, which creates a coordination obligation:

  • A model_target's namespace is owned by exactly one story. intent_classifier is owned by US-MLE-01; no other story may write to it. Story owners' AWS IAM roles are scoped to their own model_target's partition.
  • A story may read any model_target's labels for cross-story features (e.g., US-MLE-06 recommendation reads absa_aspect labels as a feature for items). Read access is reviewed quarterly.
  • The label_version schema is monotonically increasing per (model_target, source). A retraining run pins to max(label_version) at run start time, so concurrent label writes do not retroactively change a training run's data.
  • Snapshot tags are written at every promoted-model registration: label_v=2840, model_v=47, registered_at=2026-04-29T03:00Z. This makes a model fully reproducible from its registry entry.

2. Feature Store

2.1 Architecture

flowchart LR
    subgraph SOURCES[Feature Sources]
        ONLINE[Online Computation<br/>Lambda + Redis]
        BATCH[Batch Computation<br/>Glue + EMR]
        EXT[External APIs<br/>Personalize, OpenSearch]
    end

    subgraph FS[Feature Store]
        FS_ONLINE[SageMaker Feature Store<br/>online: DynamoDB-backed]
        FS_OFFLINE[Iceberg Mirror<br/>S3, partitioned by feature_group]
        CATALOG[Feature Catalog<br/>schema_v3.4]
    end

    subgraph CONSUMERS[Consumers]
        ONLINE_INF[Online Inference<br/>SageMaker endpoints]
        TRAIN[Training Pipelines<br/>PIT-correct reads]
        DRIFT[Drift Hub<br/>distribution snapshots]
    end

    ONLINE --> FS_ONLINE
    BATCH --> FS_OFFLINE
    EXT --> FS_ONLINE
    FS_ONLINE -.replicate.-> FS_OFFLINE
    FS_ONLINE --> ONLINE_INF
    FS_OFFLINE --> TRAIN
    FS_OFFLINE --> DRIFT
    CATALOG -.pinned by.-> TRAIN
    CATALOG -.pinned by.-> ONLINE_INF

    style FS_ONLINE fill:#9cf,stroke:#333
    style FS_OFFLINE fill:#9cf,stroke:#333
    style CATALOG fill:#fd2,stroke:#333

2.2 Feature catalog and serving lag

The feature catalog (a versioned YAML in s3://manga-ml-config-apne1/feature-catalog/) is the single source of truth for:

  • Each feature group's schema (column types, defaults, valid ranges)
  • Each feature group's serving_lag — the time delay between when raw signal is captured and when the feature value becomes visible to online inference
  • Each feature group's freshness_target — the refresh cadence of the underlying batch job
  • Each feature group's owner story (US-MLE-XX)

Excerpt:

# feature-catalog/schema_v3.4.yaml
version: schema_v3.4
deployed_at: 2026-04-15T00:00:00Z

feature_groups:
  user_taste_embedding:
    owner: US-MLE-06
    columns:
      - name: user_id
        type: STRING
        nullable: false
      - name: taste_embedding
        type: ARRAY<FLOAT>
        size: 256
        default: zero_vector
      - name: embedding_model_version
        type: STRING
        nullable: false
      - name: last_updated
        type: TIMESTAMP
        nullable: false
    serving_lag: PT1H  # 1 hour batch lag
    freshness_target: PT1H
    region: ap-northeast-1
    pii_redacted: true

  session_intent_history:
    owner: US-MLE-01
    columns:
      - name: session_id
        type: STRING
      - name: last_3_intents
        type: ARRAY<STRING>
        default: []
      - name: turn_count
        type: INT
        default: 0
    serving_lag: PT0S  # online, 0 lag
    freshness_target: PT0S
    region: ap-northeast-1
    pii_redacted: true

  catalog_demand_p50:
    owner: US-MLE-04
    columns:
      - name: sku_id
        type: STRING
      - name: forecast_p50_next_7d
        type: ARRAY<FLOAT>
        size: 7
      - name: forecast_p10_next_7d
        type: ARRAY<FLOAT>
        size: 7
      - name: forecast_p90_next_7d
        type: ARRAY<FLOAT>
        size: 7
      - name: forecast_run_id
        type: STRING
    serving_lag: PT24H  # daily batch
    freshness_target: PT24H
    region: ap-northeast-1
    pii_redacted: false  # SKU-level, no PII

2.3 PIT correctness at the platform level

The platform enforces PIT correctness with a deploy-time lint and a runtime assertion:

  • Deploy-time lint: any training pipeline calling feature_store.get_record(...) without an as_of_timestamp parameter fails CI. The lint is an AST-walker over the SageMaker Pipelines DAG source code.
  • Runtime assertion: every read from the offline mirror logs the requested as_of and the feature group's serving_lag. Reads where as_of > now - serving_lag are flagged in CloudWatch as potential leaks (since they correspond to feature values that wouldn't have been visible to serving at the implied time). Per-feature-group leak-check dashboards.

The leak detector described in primitive §2.1 of the foundations doc runs on top of this PIT-correct read API; it catches leaks the API alone cannot (e.g., a feature whose pipeline updates were retroactive — backfilling Jan with Feb data).

2.4 Schema upgrade procedure

A feature schema upgrade (e.g., schema_v3.4 → schema_v3.5) is a coordinated change across all consuming stories:

  1. Propose: ML Platform Lead opens a CCR (Coordinated Change Request) with affected stories tagged.
  2. Review: each affected story's owner audits training and serving code for the column changes.
  3. Dual-publish: schema_v3.5 is published alongside v3.4. Both serve in parallel.
  4. Migrate consumers: each story migrates its training and serving code to v3.5 on its own cadence; CCR tracks completion.
  5. Sunset: 30 days after the last consumer migrates, v3.4 is sunset.

Schema version is pinned in every model registry entry, so a model trained on v3.4 continues to read v3.4 features even after v3.5 is in production. This prevents a serving model from silently consuming a different schema than it trained on.


3. Model Registry

3.1 Architecture

The platform uses SageMaker Model Registry with a wrapper service (registry-svc) that enforces story-specific contracts.

flowchart LR
    subgraph TRAIN[Training Pipelines]
        T1[US-MLE-01<br/>Training]
        T2[US-MLE-02..08<br/>Training]
    end

    subgraph REG[Registry Service]
        VAL[Validation Layer<br/>required tags + lineage]
        SMR[SageMaker Model<br/>Registry]
        META[Registry Metadata<br/>DynamoDB]
    end

    subgraph CONS[Consumers]
        ENDP[Endpoint Deploy<br/>SageMaker]
        SHADOW[Shadow Deploy]
        AUDIT[Audit Queries]
        ROLLBK[Rollback Service]
    end

    T1 --> VAL
    T2 --> VAL
    VAL --> SMR
    VAL --> META
    SMR --> ENDP
    SMR --> SHADOW
    META --> AUDIT
    META --> ROLLBK

    style VAL fill:#fd2,stroke:#333
    style SMR fill:#9cf,stroke:#333

3.2 Required tags on every registered model

A model package will not register without all six tags:

Tag Meaning Source
model_target e.g., intent_classifier, embedding_adapter_v3 Pipeline parameter
code_revision git SHA, populated by CI CI environment
data_snapshot_id Iceberg snapshot id of training data at run start Pipeline runtime
label_version max(label_version) included Pipeline runtime
feature_schema_version e.g., schema_v3.4 Pipeline parameter
seed training random seed Pipeline parameter

A run missing any tag is rejected at registration time, the pipeline fails, and a SEV-4 page goes to the story owner. This is the "no anonymous runs" rule from primitive §3.3.

3.3 Per-story registry contracts

Each story has a contract registered with the registry service that encodes its specific gates:

# registry_contracts.py
INTENT_CLASSIFIER_CONTRACT = ModelContract(
    model_target="intent_classifier",
    owner_story="US-MLE-01",
    required_tags=REQUIRED_TAGS,
    minimum_metrics={
        "macro_f1": 0.90,
        "safety_critical_macro_f1": 0.92,
        "min_class_f1": 0.83,
    },
    maximum_metrics={
        "p95_inference_latency_ms": 15.0,
        "adversarial_regression": 0.01,
    },
    required_slice_metrics=[
        ("language=en", "macro_f1", ">=", 0.85),
        ("language=ja", "macro_f1", ">=", 0.85),
        ("language=mixed", "macro_f1", ">=", 0.85),
    ],
    rollback_sla_seconds=60,
    promotion_stages=[OFFLINE_GATE, SHADOW_24H, CANARY_1_5_25, FULL_PROMOTE],
    incompatibility_set=["embedding_adapter"],  # if embedding_adapter is mid-promotion, hold this story
)

EMBEDDING_ADAPTER_CONTRACT = ModelContract(
    model_target="embedding_adapter_v3",
    owner_story="US-MLE-05",
    required_tags=REQUIRED_TAGS,
    minimum_metrics={
        "recall_at_10": 0.92,
        "cross_lingual_recall_at_10": 0.85,
    },
    maximum_metrics={
        "p95_query_encode_latency_ms": 12.0,
        "blue_green_reindex_wall_clock_h": 4.0,
    },
    required_slice_metrics=[
        ("language=en", "recall_at_10", ">=", 0.90),
        ("language=ja", "recall_at_10", ">=", 0.90),
        ("genre=manhwa", "recall_at_10", ">=", 0.85),
    ],
    rollback_sla_seconds=300,  # blue/green flip, not endpoint shift
    promotion_stages=[OFFLINE_GATE, BLUE_GREEN_REINDEX, SHADOW_24H, CANARY_1_5_25, FULL_PROMOTE],
    coordination_required_with=["cross_encoder_reranker", "recommendation_two_tower"],
)

A registration that doesn't satisfy its contract sets the model package status to failed_promotion and emits a notification.

3.4 Cross-region replication

Per the project's data-residency rule, registry entries in ap-northeast-1 are the canonical copies for JP-region serving; us-east-1 and eu-west-1 are read-only replicas of the artifact metadata only (the binary artifact itself is region-pinned to where the training data lives).

Consequence: a model trained on JP customer data cannot serve traffic outside ap-northeast-1, even if the artifact is technically deployable. The registry contract enforces this via the region tag: an artifact with region=ap-northeast-1 cannot be deployed to a us-east-1 endpoint.


4. Drift Hub

4.1 Architecture

flowchart TB
    subgraph LOGS[Prediction Logs]
        L1[Endpoint Data Capture<br/>10pct sample]
        L2[Iceberg prediction_log<br/>per model_target]
    end

    subgraph HUB[Drift Hub]
        H1[Streaming Compute<br/>Kinesis Data Analytics<br/>5-min windows]
        H2[Batch Compute<br/>EMR Serverless<br/>daily windows]
        H3[Drift Metrics Store<br/>CloudWatch + Iceberg]
    end

    subgraph ALARMS[Alarms]
        A1[CloudWatch Alarms<br/>per model_target x drift_kind]
        A2[Triage Runbook<br/>Lambda]
        A3[SNS to oncall slack]
    end

    L1 --> L2
    L2 --> H1
    L2 --> H2
    H1 --> H3
    H2 --> H3
    H3 --> A1
    A1 --> A2
    A2 --> A3

    style H1 fill:#9cf,stroke:#333
    style H2 fill:#9cf,stroke:#333
    style A1 fill:#fd2,stroke:#333

4.2 The four drift kinds, centralized

The drift hub computes the four drift kinds from primitive §6.1 for every model_target on a uniform schedule:

Kind Cadence Compute Output metric
Input drift 5 min Streaming, Kinesis Data Analytics MangaAssist/MLDrift/<model_target>/InputPSI/<feature_name>
Prediction drift 5 min Streaming, Kinesis Data Analytics MangaAssist/MLDrift/<model_target>/PredictionKS
Label drift Daily Batch, EMR Serverless MangaAssist/MLDrift/<model_target>/LabelChiSquare
Concept drift Daily Batch, EMR Serverless MangaAssist/MLDrift/<model_target>/ConceptDeltaF1

Per-model-target reference distributions are stored in s3://manga-ml-drift-refs-apne1/<model_target>/<registry_version>/ and are pinned at the time the model promotes. So when v48 promotes, its training-holdout becomes the new reference; v47's reference is retained for 30 days for cross-version comparison.

4.3 Centralization rationale

The original platform had per-story drift detection. Three problems emerged:

  • Implementation drift: 8 different PSI implementations had subtle bugs (off-by-one binning, mishandling of empty bins, different epsilon conventions). Centralizing to a single library cut drift-related on-call pages by ~60%.
  • Reference-distribution sprawl: each story stored references in its own format; cross-story comparisons (e.g., "did the input distribution shift on US-MLE-02 the same week as US-MLE-06?") were impossible.
  • Threshold drift: each story's owner tuned drift thresholds independently, often lowering them after a false-positive alarm. Centralization moved threshold management into the registry contract, where threshold changes go through code review.

4.4 Triage Lambda

When a CloudWatch alarm fires, the triage Lambda runs the runbook from primitive §6.3:

# drift_triage.py
def lambda_handler(event, context):
    alarm = parse_alarm(event)
    model_target = alarm["model_target"]
    drift_kind = alarm["drift_kind"]

    # Step 1: known seasonal pattern?
    if is_known_seasonal(model_target, drift_kind, alarm["timestamp"]):
        return acknowledge_and_close(alarm, reason="known seasonal pattern")

    # Step 2: upstream cause?
    upstream = check_upstream_recent_promotion(model_target)
    if upstream:
        return notify_oncall(
            alarm,
            severity="P3",
            reason=f"likely caused by {upstream.model_target} v{upstream.version} promotion",
            recommended_action="coordinated retrain",
        )

    # Step 3: sustained > 7 days?
    history = get_drift_history(model_target, drift_kind, days=8)
    if not history.sustained_above_threshold(days=7):
        return acknowledge_and_close(alarm, reason="not sustained")

    # Step 4: safety-critical slice affected?
    if is_safety_critical_slice_affected(model_target, alarm):
        return notify_oncall(
            alarm,
            severity="P2",
            reason="safety-critical slice quality regression",
            recommended_action="immediate retrain + canary review",
        )

    # Step 5: default — sustained drift, no upstream, no safety-critical
    return notify_oncall(
        alarm,
        severity="P3",
        reason="sustained concept drift > 7d, no upstream cause",
        recommended_action="scheduled retrain",
    )

5. Promotion Gate Service

5.1 Architecture

sequenceDiagram
    participant T as Training Pipeline
    participant R as Model Registry
    participant P as Promotion Gate Service
    participant E as SageMaker Endpoint
    participant D as Drift Hub
    participant A as Auto-Abort Daemon

    T->>R: Register v48 with required tags
    R->>R: Validate against story contract
    R-->>T: package_arn
    T->>P: Request promotion of v48
    P->>P: Stage 1 - Offline Gate (read eval reports)
    P-->>T: Pass / Fail
    P->>E: Deploy v48 to shadow endpoint
    P->>D: Initialize drift refs for v48
    Note over E: Shadow 24h
    P->>P: Compare shadow vs prod logs
    P->>E: Promote to canary 1pct
    Note over E,A: Canary 7d
    A->>E: Monitor canary metrics
    A-->>P: Auto-abort if any condition trips
    P->>E: Canary 5pct then 25pct then 100pct
    P->>R: Mark v47 as previous, v48 as production

5.2 The auto-abort daemon

The auto-abort daemon described in primitive §5.3 is a shared service across all 8 stories. It reads each story's contract from the registry, polls online metrics every 5 minutes, and triggers rollback when any contract condition is violated for a sustained window.

# auto_abort_daemon.py
class AutoAbortDaemon:
    def __init__(self, model_target: str, contract: ModelContract):
        self.model_target = model_target
        self.contract = contract
        self.poll_interval_s = 300

    def run(self):
        while self.is_canary_active():
            metrics = self.poll_canary_metrics()
            holdout = self.poll_holdout_metrics()

            for condition in self.contract.canary_abort_conditions:
                if condition.triggered(metrics, holdout):
                    self.rollback(reason=condition.description)
                    return

            time.sleep(self.poll_interval_s)

    def rollback(self, reason: str):
        self.log_p2_page(
            model_target=self.model_target,
            reason=reason,
            recent_metrics=self.recent_metrics_window(hours=2),
        )
        self.shift_traffic_to_previous(self.model_target)
        self.mark_failed_promotion(self.model_target)

The daemon's per-story contract makes this generic: each story passes its own thresholds. The four universal abort conditions from §5.3 of the foundations doc are encoded as defaults; story-specific conditions (e.g., US-MLE-02's CTR drop, US-MLE-05's recall regression) are appended.

5.3 Coordination across in-flight promotions

Two stories cannot promote concurrently if the coordination_required_with field of either contract names the other. The promotion gate service serializes them: the second story's promotion enters a pending queue and starts only after the first completes (success or rollback).

Specific coordinations: - US-MLE-05 → US-MLE-02 + US-MLE-06: an embedding adapter promotion blocks reranker and recommendation promotions until the new embedding has been canary-stable for 7 days. - US-MLE-07 → US-MLE-03: a spam classifier promotion must complete before an ABSA retrain begins (so ABSA doesn't train on a labeled corpus that has been re-cleaned mid-stream). - US-MLE-01 ↔ US-MLE-02 ↔ US-MLE-06: the three intent-consuming stories form a coordination cluster; a promotion in any of them pauses the others' promotions for a 24-hour stability window.

The serialization is enforced via a DynamoDB conditional write; concurrent promotions race for a "promotion lock" record per coordination cluster. The promotion gate service acquires the lock before starting Stage 1; releases it after Stage 4 completes or rollback.


6. Cross-Cutting Operational Concerns

6.1 The 8-story on-call rotation

A single on-call rotation covers all 8 stories, with subject-matter-expert escalation paths per story. Pages route to the rotation; if the on-call cannot diagnose within 30 minutes, they escalate to the story owner. This pattern is borrowed from the Cost-Optimization team's FinOps lead model.

6.2 Cost reconciliation against Cost-Optimization stories

Several ML-Engineer stories interact with cost-optimization stories' KPIs:

ML Story Cost Story Interaction
US-MLE-01 Cost-Opt US-02 (Intent Classifier Cost) — autoscaling config tuned by US-02; model quality owned here
US-MLE-02, -05, -06 Cost-Opt US-06 (RAG Pipeline Cost) — embedding/reranking inference cost capped here
US-MLE-04, -06 Cost-Opt US-04 (Compute Cost) — batch transform spot pricing
US-MLE-03, -07 Cost-Opt US-07 (Analytics Pipeline Cost) — review-corpus processing cost

Quarterly cost reviews involve both this folder's ML Platform Lead and the Cost-Optimization FinOps Lead. Conflicts (e.g., quality vs cost trade-offs) escalate to the joint review board.

6.3 Multi-region readiness

Currently all 8 stories serve only ap-northeast-1. A multi-region extension (US/EU) requires:

  • Label platform replication (regional Iceberg copies, with data-residency enforcement)
  • Feature store replication (region-pinned online stores, separate offline mirrors)
  • Model registry replication (artifact metadata replicates; binary artifacts region-pinned to data origin)
  • Drift hub per-region (cannot share statistics across regions because reference distributions are different)
  • Promotion gate per-region

Tracked as a roadmap item; out of scope for the initial 8 stories. The platform's design choices today (region tags everywhere, region-aware artifact deployment) make this expansion mechanical rather than a re-architecture.


7. Reading Order for Onboarding a New ML Engineer

A new ML Engineer joining the team should read this folder in this order:

  1. ../README.md — the index and dependency graph
  2. ./00-foundations-and-primitives-for-ml-engineering.md — primitives, vocabulary
  3. This file (02-cross-story-platform-deep-dive.md) — what services exist
  4. The user-story file for whichever model they will own first (e.g., ../US-MLE-01-intent-classifier-retraining-pipeline.md)
  5. ./01-deep-dive-per-ml-story.md §<their story> — concrete walkthrough
  6. ../grill-chains/ml-engineer-grill-chains.md §<their story> — interview-style stress-test of their understanding

Total reading time: ~3-4 hours. Sufficient for a senior ML Engineer to start contributing within the first week.