LOCAL PREVIEW View on GitHub

US-MLE-04: Demand Forecasting Pipeline

User Story

As an ML Engineer at Amazon-scale on the MangaAssist team, I want to own a daily incremental refit + weekly full retrain pipeline for the multi-horizon demand forecasting model (Temporal Fusion Transformer) that ingests fulfilled-demand actuals with stockout-censoring correction, encodes promotional and anime-release calendars as known-future inputs, produces quantile (P10/P50/P90) next-7-day SKU forecasts via batch transform, evaluates with rolling-origin walk-forward backtests, and detects regime shifts from promo distortion and supply-chain disruptions, So that the chatbot's stockout-prevention dialog ("we expect this to be available in 3 days") and the recommendation system's stock-aware reranking (US-MLE-06) read calibrated, time-causal demand signals that correctly account for promo events, sparse long-tail SKUs, and the bilingual JP/EN release-calendar inputs that drive the most predictive demand spikes.

Acceptance Criteria

  • Daily batch forecast runs at 02:00 JST off-peak window; finishes by 03:30 JST (p99 wall-clock ≤ 90 minutes).
  • Forecast covers all ~1.5M SKUs with daily demand signal (5M SKUs in active catalog; ~3.5M long-tail filtered to category-level forecasts).
  • Next-7-day SKU forecast sMAPE ≤ 14% on rolling-origin walk-forward backtest (last 90 days).
  • Promo-uplift error ≤ 25% on the held-out promo-window slice (per Ground-Truth-Evolution/ML-Scenarios/05-demand-forecasting-promo-distortion.md cross-reference).
  • Quantile calibration: empirical P10 coverage ∈ [8%, 12%]; empirical P90 coverage ∈ [88%, 92%]; pinball loss tracked per quantile.
  • MASE ≤ 1.0 vs seasonal-naive baseline (forecast must beat "same-day-last-week" on every SKU tier).
  • Stockout-censoring correction applied: Kaplan-Meier-style adjustment on observation windows where units_sold == on_hand_at_open.
  • Cold-start SKUs (< 30 days history) served via hierarchical category × brand × format prior with explicit cold_start=true flag in output.
  • Daily run success rate ≥ 99% over 30-day rolling window; no consecutive-day failures.
  • Forecast output written to S3 (Iceberg sku_demand_forecast table) and DynamoDB cache (MangaSkuForecast table, forecast_date#sku PK) within 5 minutes of batch completion.
  • Online lookup p99 latency from DynamoDB ≤ 8 ms for chatbot order-inventory MCP queries.
  • Rollback to previous-day forecast in ≤ 4 hours (re-import last-known-good batch result; documented in primitive §7.3).
  • Promo-event drift alarm fires within 48 hours when realized lift on any event class deviates > 2σ from the overlay model's prediction.

Architecture (HLD)

The Production Surface

The demand forecasting model sits on the chatbot's warm path, not the hot path. The chatbot's order-inventory MCP (per RAG-MCP-Integration/03-order-inventory-mcp.md) reads pre-computed daily forecasts from a DynamoDB cache keyed by (forecast_date, sku_id). There is no real-time inference on the conversation path — the model serves via SageMaker batch transform once per day at 02:00 JST, and the chatbot reads the cached output.

The model is a Temporal Fusion Transformer (TFT, ~30M params) trained on 1095 days (3 years) of fulfilled-demand history. TFT is chosen over Prophet/LightGBM (the prior-generation forecasting stack documented in Ground-Truth-Evolution/ML-Scenarios/05-demand-forecasting-promo-distortion.md) because:

  • Multi-horizon native: TFT outputs all 7 horizons (day+1 through day+7) in a single forward pass, with horizon-specific attention.
  • Quantile output: TFT directly produces P10/P50/P90 via quantile loss; downstream ops needs P90 for safety-stock and P10 for stockout-risk dialog responses.
  • Known-future inputs: TFT separates static covariates (SKU brand, format), past-observed inputs (sales, prices, stockouts), and known-future inputs (anime air dates, promo calendar, manga volume release dates). The known-future channel is what makes TFT correctly handle the calendar-driven spikes that broke the prior Prophet/LightGBM stack.
  • Interpretable attention weights: TFT's variable-selection network exposes per-feature importance, which is critical for the post-event review process that ops + ML-eng run after every major demand event.

The output is three quantiles (P10, P50, P90) per SKU per horizon (1–7 days). The chatbot's stockout-prevention dialog reads P10 (pessimistic estimate of remaining stock days); the recommendation reranker reads P50 (point estimate for stock-aware reranking); inventory planning reads P90 (for safety-stock decisions).

Volume and Cadence

  • Active catalog: 5M SKUs.
  • Daily-demand-signal SKUs: ~1.5M (long tail of zero/sparse demand handled at category level).
  • History depth: 1095 days (3 years).
  • Training cadence: Weekly full retrain (Sunday 02:00 JST, ~4 hours on g5.12xlarge); daily incremental refit (every day 02:00 JST, ~30 minutes on g5.4xlarge with encoder weights frozen, decoder fine-tuned on last 7 days).
  • Inference cadence: Daily batch transform on ml.g5.12xlarge cluster (4 instances, sharded by SKU hash); 1.5M SKUs × 7 horizons × 3 quantiles = 31.5M forecast values produced per day.
  • Forecast horizon: Next 7 days (the chatbot's "we expect availability in N days" copy never extends beyond 7).

End-to-End ML Lifecycle Diagram

flowchart TB
    subgraph DATA[Data Layer]
        D1[Order Pipeline<br/>fulfilled_demand_actuals<br/>~12M orders/day]
        D2[Stockout Log<br/>on_hand_at_open per SKU]
        D3[Promo Calendar<br/>marketing platform<br/>14d look-ahead]
        D4[Anime Release Calendar<br/>bilingual JP/EN<br/>90d look-ahead]
        D5[Manga Volume Calendar<br/>publisher feeds<br/>180d look-ahead]
        D6[Censoring-Corrected<br/>Iceberg Table<br/>sku_demand_facts]
        D1 --> D6
        D2 --> D6
        D3 --> D6
        D4 --> D6
        D5 --> D6
    end

    subgraph FEAT[Feature Layer]
        F1[Static Covariates<br/>SKU brand, format,<br/>category, country]
        F2[Past-Observed<br/>sales, price, stockout,<br/>review velocity]
        F3[Known-Future<br/>promo, anime air,<br/>manga release, holiday]
        F4[Hierarchical Priors<br/>category x brand x format<br/>for cold-start SKUs]
    end

    subgraph TRAIN[Training Pipeline - SageMaker Pipelines]
        T1[Step 1<br/>Data Validation +<br/>Stockout Censoring]
        T2[Step 2<br/>Feature Materialization<br/>PIT join]
        T3[Step 3<br/>Rolling-Origin Split<br/>walk-forward NOT random]
        T4[Step 4<br/>Training<br/>g5.12xlarge spot]
        T5[Step 5<br/>Backtest Eval<br/>sMAPE/MASE/quantile loss]
        T6[Step 6<br/>Slice Analysis<br/>tier x event x cold-start]
        T7[Step 7<br/>Calibration Audit<br/>P10/P50/P90 coverage]
        T8[Step 8<br/>Model Registration]
        T1 --> T2 --> T3 --> T4 --> T5 --> T6 --> T7 --> T8
    end

    subgraph SERVE[Serving Layer]
        S1[Model Registry<br/>v23 prod, v24 candidate]
        S2[Batch Transform<br/>ml.g5.12xlarge x 4]
        S3[S3 Output<br/>sku_demand_forecast<br/>Iceberg]
        S4[DynamoDB Cache<br/>MangaSkuForecast<br/>p99 8ms]
        S5[Chatbot Order-<br/>Inventory MCP]
        S6[US-MLE-06<br/>Stock-Aware<br/>Reranker]
        S1 --> S2 --> S3
        S3 --> S4
        S4 --> S5
        S4 --> S6
    end

    subgraph DRIFT[Drift Detection]
        DR1[Drift Hub<br/>quantile coverage,<br/>per-event-class lift error]
        DR2[CloudWatch Alarms]
        DR3[Promo-Event Triage]
        DR1 --> DR2
        DR2 --> DR3
        DR3 -.coordinated retrain.-> T1
    end

    D6 --> T1
    F1 --> T2
    F2 --> T2
    F3 --> T2
    F4 --> T2
    T8 --> S1
    S3 -.actuals join.-> DR1

    style D6 fill:#9cf,stroke:#333
    style F3 fill:#fd2,stroke:#333
    style T3 fill:#f66,stroke:#333,color:#fff
    style T1 fill:#fd2,stroke:#333
    style S4 fill:#9cf,stroke:#333
    style DR1 fill:#fd2,stroke:#333

The colored nodes flag the architecture's three structural commitments: (1) D6 censoring-corrected facts is the single source of truth (red-orange = critical to get right); (2) F3 known-future inputs is the channel that distinguishes TFT from naive baselines; (3) T3 rolling-origin split is the only correct way to evaluate a forecasting model — random split is a leak, not a methodology.

Data Contracts and Volume

Asset Schema Version Snapshot Cadence Volume Owner
fulfilled_demand_actuals Iceberg table demand_v2 5-min micro-batch from order pipeline ~12M orders/day; 3-year history Order Platform
stockout_log Iceberg table stockout_v1 1-min from inventory service ~50K stockout events/day Inventory Platform
sku_demand_facts Iceberg table (censored, joined) facts_v3.2 Hourly rebuild ~1.5M SKU × 1095 days = 1.6B rows ML Eng (this story)
promo_calendar Iceberg table promo_v2 Daily refresh from marketing 14-day forward + 3-year backward Marketing Platform
anime_release_calendar Iceberg table anime_v3 Daily refresh, bilingual JP/EN 90-day forward + 5-year backward Content Platform
manga_volume_calendar Iceberg table volume_v2 Weekly publisher feeds 180-day forward + 10-year backward Content Platform
sku_demand_forecast Iceberg table forecast_v2 Daily 02:00 JST 31.5M rows/day, 90d hot, 3y glacier ML Eng (this story)
MangaSkuForecast DynamoDB n/a Daily replace 1.5M items, ~120MB ML Eng (this story)
tft_forecaster model package model_v23 in prod Weekly full + daily incremental ~30M params, ~120MB artifact ML Eng (this story)

Model Registry + Promotion Gates

flowchart LR
    R23[Registry v23<br/>prod, weekly_full=2026-04-21]:::prod
    R24[Registry v24<br/>candidate, weekly_full=2026-04-28]:::cand

    R24 --> G1{Stage 1<br/>Offline Gate<br/>rolling-origin backtest}
    G1 -->|pass| G2{Stage 2<br/>Shadow 7d batch<br/>parallel forecast}
    G1 -.fail.-> RB1[Rollback v23]
    G2 -->|pass| G3{Stage 3<br/>Extended Canary 14d<br/>10% SKU traffic}
    G2 -.fail.-> RB2[Rollback v23]
    G3 -->|pass| G4[Stage 4<br/>Full Promote v24]
    G3 -.fail.-> RB3[Rollback v23]

    classDef prod fill:#2d8,stroke:#333
    classDef cand fill:#fd2,stroke:#333
    style RB1 fill:#f66,stroke:#333
    style RB2 fill:#f66,stroke:#333
    style RB3 fill:#f66,stroke:#333
    style G4 fill:#2d8,stroke:#333

The four-stage promotion gate is the standard from deep-dives/00-foundations-and-primitives-for-ml-engineering.md §5.1. Story-specific thresholds:

  • Stage 1 (offline): rolling-origin walk-forward sMAPE ≤ 14%, MASE ≤ 1.0, P10/P90 coverage in [8%,12%]/[88%,92%], promo-window sMAPE ≤ 25%.
  • Stage 2 (shadow): 7-day parallel forecast comparison; v24 daily-MAE on actuals ≤ v23 daily-MAE × 1.05; per-SKU-tier sMAPE no worse than 1.5σ regression on any tier.
  • Stage 3 (canary): extended canary — TFT is on the §4.3 "weak offline-online correlation" pre-flagged list, so the canary runs at 10% SKU sample for 14 days regardless of offline metrics. The canary forecasts are written to a shadow column in DynamoDB; the chatbot does not read them. Per primitive §4.3.
  • Stage 4 (full): traffic shift to 100% with old v23 retained for 14 days as rollback target.

The 4-hour rollback SLA (per primitive §7.3) is met by re-running the previous-day batch transform with v23 model package and re-importing the result to DynamoDB.


Low-Level Design

1. Feature / Data Pipeline

The TFT consumes three feature kinds, mirroring TFT's native input partitioning:

Static covariates (per SKU, time-invariant): sku_brand, sku_format (single-volume / box-set / digital / pre-order), category (shōnen / shōjo / seinen / josei / manhwa / manhua / light-novel), publisher, country_of_origin_jp, is_part_of_series, series_episode_count_at_creation. 11 features.

Past-observed inputs (per SKU per day, available as actuals): units_sold, units_sold_censoring_indicator (1 if stockout, 0 otherwise), kaplan_meier_corrected_demand, list_price, effective_price_after_promo, review_count_delta, review_avg_rating_delta, inventory_on_hand_at_open, competitor_price_index (where available), prior_day_demand, lag_7_demand, lag_28_demand, rolling_28d_demand_mean, rolling_28d_demand_std. 14 features.

Known-future inputs (per SKU per day, knowable at forecast time): is_promo_day (binary), promo_discount_pct, is_anime_air_window (binary, ±7d around airing), anime_episode_number, is_manga_volume_release_day (binary), manga_volume_number, is_holiday_jp (binary), is_holiday_us (binary), is_payday_jp (binary, 25th of month), day_of_week, day_of_month, week_of_year, month. 13 features.

The known-future channel is the architectural commitment that distinguishes this design from the prior Prophet/LightGBM stack — promo events, anime airings, and manga volume releases are encoded as features the model can attend to, not as anomalies to be filtered out.

# feature_pipeline.py
from dataclasses import dataclass
from datetime import datetime, timedelta, date
from typing import Optional
import boto3
import sagemaker
import pandas as pd
import numpy as np


@dataclass
class TFTTrainingExample:
    sku_id: str
    forecast_origin: date         # the "today" of the example
    target_horizons: list[int]    # [1, 2, 3, 4, 5, 6, 7] days ahead
    target_values: list[float]    # actual demand at each horizon (censoring-corrected)
    static_covariates: dict
    past_observed: pd.DataFrame   # last 90 days of past-observed features
    known_future: pd.DataFrame    # next 7 days of known-future features
    is_cold_start: bool           # < 30 days of SKU history


class TFTFeaturePipeline:
    """PIT-correct feature read for TFT training.

    Implements primitive §2.1 with forecasting-specific PIT semantics:
    a feature observed at day t is included only if it would have been
    visible at the forecast_origin (i.e., t <= forecast_origin - serving_lag).

    Known-future inputs have a different rule: they are included for
    days t in [forecast_origin + 1, forecast_origin + 7] but only the
    *value as known at forecast_origin* — promo plans get revised, and
    a forecast made on Monday must use Monday's view of the promo calendar,
    not Wednesday's revision.
    """

    def __init__(self, region: str = "ap-northeast-1"):
        self.session = sagemaker.Session(boto3.Session(region_name=region))
        self.fs = sagemaker.feature_store.FeatureStoreRuntime(self.session)
        self.catalog = FeatureCatalog.load("facts_v3.2")
        self.serving_lag = self.catalog.serving_lag_per_group()

    def materialize_for_training(
        self,
        sku_ids: list[str],
        forecast_origin: date,
        history_window_days: int = 90,
    ) -> list[TFTTrainingExample]:
        """For each SKU, build a training example with PIT-correct features."""
        examples = []
        for sku_id in sku_ids:
            past = self._read_past_observed(
                sku_id=sku_id,
                start=forecast_origin - timedelta(days=history_window_days),
                end=forecast_origin,
                as_of=forecast_origin,  # PIT cutoff: nothing visible after origin
            )
            future = self._read_known_future(
                sku_id=sku_id,
                start=forecast_origin + timedelta(days=1),
                end=forecast_origin + timedelta(days=7),
                as_of=forecast_origin,  # CRITICAL: as-of-forecast-origin view
            )
            target = self._read_target(
                sku_id=sku_id,
                horizons=range(1, 8),
                forecast_origin=forecast_origin,
            )
            static = self._read_static(sku_id=sku_id, as_of=forecast_origin)
            sku_history_days = self._sku_history_days(sku_id, forecast_origin)
            examples.append(
                TFTTrainingExample(
                    sku_id=sku_id,
                    forecast_origin=forecast_origin,
                    target_horizons=list(range(1, 8)),
                    target_values=target,
                    static_covariates=static,
                    past_observed=past,
                    known_future=future,
                    is_cold_start=sku_history_days < 30,
                )
            )
        return examples

    def _read_known_future(
        self,
        sku_id: str,
        start: date,
        end: date,
        as_of: date,
    ) -> pd.DataFrame:
        """Read the known-future channel as it was visible at as_of.

        Promo calendars are revised: a campaign added on April 23 for
        an April 26 promo must NOT appear in a forecast made April 22.
        The catalog tables are versioned; we read the snapshot taken
        at end-of-day as_of.
        """
        return self.catalog.read_versioned(
            table="known_future_features",
            sku_id=sku_id,
            day_range=(start, end),
            as_of_snapshot=as_of,  # snapshot taken at 23:59:59 of as_of date
        )

    def stockout_censoring_correction(
        self,
        raw_actuals: pd.DataFrame,
        stockout_log: pd.DataFrame,
    ) -> pd.DataFrame:
        """Kaplan-Meier-style right-censoring correction.

        On any day where units_sold == on_hand_at_open AND on_hand_at_open > 0
        AND the SKU went out-of-stock during the day, the observed sales are
        censored (true demand was AT LEAST observed sales, possibly more).

        We model this as right-censored survival data and apply a KM-style
        adjustment: estimate the survival function S(d) = P(demand > d) on
        non-censored days, then for censored days replace observed sales
        with the conditional expectation E[demand | demand >= observed].
        """
        merged = raw_actuals.merge(stockout_log, on=["sku_id", "day"], how="left")
        merged["is_censored"] = (
            (merged["units_sold"] == merged["on_hand_at_open"])
            & (merged["on_hand_at_open"] > 0)
            & (merged["went_oos_during_day"] == True)
        )
        # Per-SKU-tier KM survival function (tier = quantile bucket of mean demand)
        merged["sku_tier"] = self._sku_tier(merged["sku_id"])
        survival = self._estimate_km_survival_per_tier(
            merged[~merged["is_censored"]]
        )
        # For censored rows, replace with conditional expectation
        merged["corrected_demand"] = merged.apply(
            lambda r: r["units_sold"] if not r["is_censored"]
            else self._conditional_expectation(
                survival[r["sku_tier"]],
                lower_bound=r["units_sold"],
            ),
            axis=1,
        )
        return merged

    def _conditional_expectation(self, survival_fn, lower_bound: float) -> float:
        """E[D | D >= lower_bound] using the per-tier KM survival function."""
        # Numerically: integrate d * f(d) from lower_bound to inf, divided by S(lower_bound)
        d_grid = np.arange(lower_bound, lower_bound * 5 + 1)
        s_vals = np.array([survival_fn(d) for d in d_grid])
        f_vals = -np.diff(np.concatenate([[1.0], s_vals]))
        if s_vals[0] < 1e-6:
            return float(lower_bound)  # extreme tail, return observed
        return float((d_grid * f_vals).sum() / s_vals[0])

    def hierarchical_prior_for_cold_start(
        self,
        sku_id: str,
        as_of: date,
    ) -> dict:
        """For SKUs with < 30 days history, build a hierarchical prior from
        category x brand x format aggregates. Returns features that the
        model uses to substitute missing past-observed lags.

        Three nested levels of priors, with shrinkage:
        - Level 1: SKU's own (sparse) history, weight = n_days / 30
        - Level 2: brand x format prior, weight = (1 - L1) * 0.6
        - Level 3: category x format prior, weight = (1 - L1) * 0.4
        """
        own = self._sku_history_aggregates(sku_id, as_of)
        n_days = own["n_days"]
        l1_weight = min(n_days / 30.0, 1.0)
        if l1_weight >= 1.0:
            return own  # not cold-start

        static = self._read_static(sku_id, as_of)
        brand_format = self._aggregate(
            filter_={"brand": static["brand"], "format": static["format"]},
            as_of=as_of,
        )
        cat_format = self._aggregate(
            filter_={"category": static["category"], "format": static["format"]},
            as_of=as_of,
        )
        return {
            "rolling_28d_demand_mean": (
                l1_weight * own["rolling_28d_demand_mean"]
                + (1 - l1_weight) * 0.6 * brand_format["rolling_28d_demand_mean"]
                + (1 - l1_weight) * 0.4 * cat_format["rolling_28d_demand_mean"]
            ),
            "rolling_28d_demand_std": (
                l1_weight * own["rolling_28d_demand_std"]
                + (1 - l1_weight) * 0.6 * brand_format["rolling_28d_demand_std"]
                + (1 - l1_weight) * 0.4 * cat_format["rolling_28d_demand_std"]
            ),
            # ... lag features substituted similarly
            "is_cold_start": True,
            "cold_start_n_days": n_days,
        }

The two non-trivial pieces above — stockout censoring correction and hierarchical cold-start priors — are workload-specific extensions of the platform's standard feature pipeline. The censoring correction has caught a structural under-forecast on chronically-stocked-out hot SKUs (the prior Prophet model had been learning "demand on this SKU is roughly = inventory" because the censored days dragged the mean down). The hierarchical prior is what makes the system serve forecasts on Day 1 of a new SKU launch instead of falling back to a global mean.

2. Training Pipeline (SageMaker Pipelines)

# training_pipeline.py
from sagemaker.workflow.pipeline import Pipeline
from sagemaker.workflow.steps import ProcessingStep, TrainingStep
from sagemaker.workflow.condition_step import ConditionStep
from sagemaker.workflow.conditions import ConditionLessThanOrEqualTo
from sagemaker.workflow.parameters import ParameterString, ParameterInteger
from sagemaker.pytorch import PyTorch
from sagemaker.processing import ScriptProcessor


def build_pipeline(role: str, region: str = "ap-northeast-1") -> Pipeline:
    train_mode = ParameterString(name="TrainMode", default_value="incremental")
    forecast_origin = ParameterString(name="ForecastOrigin", default_value="")
    feature_schema = ParameterString(name="FeatureSchemaVersion", default_value="facts_v3.2")

    # Step 1: Data validation + stockout censoring
    data_val = ProcessingStep(
        name="TFTDataValidationAndCensoring",
        processor=ScriptProcessor(
            image_uri="763104351884.dkr.ecr.ap-northeast-1.amazonaws.com/pytorch-training:2.1.0-cpu-py310",
            role=role,
            instance_type="ml.m5.4xlarge",
            instance_count=4,  # parallelize censoring across SKU shards
            command=["python3"],
        ),
        code="src/data_validation.py",
        job_arguments=[
            "--forecast-origin", forecast_origin,
            "--feature-schema", feature_schema,
            "--km-survival-min-tier-samples", "5000",
            "--censoring-rate-abort-threshold", "0.15",  # >15% censored = data integrity issue
        ],
    )

    # Step 2: Feature materialization (PIT-correct, including known-future as-of-snapshot)
    feature_mat = ProcessingStep(
        name="TFTFeatureMaterialization",
        processor=ScriptProcessor(
            image_uri="763104351884.dkr.ecr.ap-northeast-1.amazonaws.com/pytorch-training:2.1.0-cpu-py310",
            role=role,
            instance_type="ml.m5.4xlarge",
            instance_count=8,
            command=["python3"],
        ),
        code="src/feature_materialize.py",
        depends_on=[data_val],
        job_arguments=[
            "--feature-schema", feature_schema,
            "--history-window-days", "90",
            "--horizon-days", "7",
            "--known-future-snapshot-policy", "as-of-forecast-origin",
            "--leak-detector-pct", "0.005",
        ],
    )

    # Step 3: Rolling-origin walk-forward split (NOT random; this is critical)
    split = ProcessingStep(
        name="TFTRollingOriginSplit",
        processor=ScriptProcessor(
            image_uri="763104351884.dkr.ecr.ap-northeast-1.amazonaws.com/pytorch-training:2.1.0-cpu-py310",
            role=role,
            instance_type="ml.m5.4xlarge",
            instance_count=1,
            command=["python3"],
        ),
        code="src/walk_forward_split.py",
        depends_on=[feature_mat],
        job_arguments=[
            "--strategy", "rolling-origin-walk-forward",
            "--n-folds", "12",                    # 12 folds × 7-day horizon = 84 days backtest
            "--gap-days", "0",                    # no gap (we test on the day right after train)
            "--initial-train-days", "730",        # 2 years initial training window
            "--step-days", "7",                   # advance origin by 7 days per fold
            "--seed", "42",
        ],
    )

    # Step 4: Training (full or incremental)
    estimator_full = PyTorch(
        entry_point="train_tft.py",
        source_dir="src/",
        instance_type="ml.g5.12xlarge",          # 4x A10G GPUs
        instance_count=1,
        role=role,
        framework_version="2.1.0",
        py_version="py310",
        use_spot_instances=True,
        max_wait=21600,                           # 6h spot wait (4h target run + 2h slack)
        max_run=18000,                            # 5h hard cap
        checkpoint_s3_uri=f"s3://manga-ml-checkpoints-apne1/tft/",
        checkpoint_local_path="/opt/ml/checkpoints",
        hyperparameters={
            "model_arch": "tft",
            "hidden_size": 160,
            "lstm_layers": 2,
            "attention_heads": 4,
            "dropout": 0.1,
            "max_encoder_length": 90,
            "max_prediction_length": 7,
            "learning_rate": 1e-3,
            "batch_size": 128,
            "max_epochs": 30,
            "early_stopping_patience": 5,
            "loss": "quantile",
            "quantiles": "0.1,0.5,0.9",
            "fp16": True,
            "gradient_clip_val": 0.1,
            "seed": 42,
        },
    )
    estimator_incremental = PyTorch(
        entry_point="train_tft_incremental.py",
        source_dir="src/",
        instance_type="ml.g5.4xlarge",            # 1x A10G GPU
        instance_count=1,
        role=role,
        framework_version="2.1.0",
        py_version="py310",
        use_spot_instances=True,
        max_wait=3600,
        max_run=2400,                             # 40min hard cap
        checkpoint_s3_uri=f"s3://manga-ml-checkpoints-apne1/tft-incr/",
        hyperparameters={
            "model_arch": "tft",
            "warm_start_from_registry": "v23",   # daily incremental loads prior weights
            "freeze_encoder": True,
            "decoder_finetune_days": 7,
            "learning_rate": 1e-4,                # 10x lower for incremental
            "max_epochs": 3,
            "batch_size": 256,
            "loss": "quantile",
            "quantiles": "0.1,0.5,0.9",
            "fp16": True,
        },
    )
    train = TrainingStep(
        name="TFTTrain",
        estimator=estimator_full,                 # selector logic in pipeline params
        depends_on=[split],
    )

    # Step 5: Backtest evaluation (rolling-origin)
    backtest = ProcessingStep(
        name="TFTBacktestEval",
        processor=ScriptProcessor(
            image_uri="763104351884.dkr.ecr.ap-northeast-1.amazonaws.com/pytorch-training:2.1.0-gpu-py310",
            role=role,
            instance_type="ml.g5.4xlarge",
            instance_count=1,
            command=["python3"],
        ),
        code="src/backtest_eval.py",
        depends_on=[train],
        job_arguments=[
            "--baseline-model", "seasonal-naive-7d",
            "--baseline-model-prod", "v23",
            "--metrics", "smape,mase,quantile_loss,p10_coverage,p90_coverage,promo_window_smape",
            "--n-folds", "12",
        ],
    )

    # Step 6: Slice analysis (tier x event x cold-start x category)
    slice_analysis = ProcessingStep(
        name="TFTSliceAnalysis",
        processor=ScriptProcessor(
            image_uri="763104351884.dkr.ecr.ap-northeast-1.amazonaws.com/pytorch-training:2.1.0-cpu-py310",
            role=role,
            instance_type="ml.m5.4xlarge",
            instance_count=1,
            command=["python3"],
        ),
        code="src/slice_analysis.py",
        depends_on=[backtest],
        job_arguments=[
            "--slices", "sku_tier,category,format,is_cold_start,is_promo_day,is_anime_air_window,country_jp_us",
            "--regression-sigma-floor", "1.5",
        ],
    )

    # Step 7: Calibration audit (P10/P50/P90 coverage)
    calibration = ProcessingStep(
        name="TFTCalibrationAudit",
        processor=ScriptProcessor(
            image_uri="763104351884.dkr.ecr.ap-northeast-1.amazonaws.com/pytorch-training:2.1.0-cpu-py310",
            role=role,
            instance_type="ml.m5.4xlarge",
            instance_count=1,
            command=["python3"],
        ),
        code="src/calibration_audit.py",
        depends_on=[slice_analysis],
        job_arguments=[
            "--p10-coverage-bounds", "0.08,0.12",
            "--p90-coverage-bounds", "0.88,0.92",
            "--per-tier-calibration", "true",
        ],
    )

    # Stage 1 offline gate
    promote_gate = ConditionStep(
        name="TFTOfflineGate",
        conditions=[
            ConditionLessThanOrEqualTo(
                left=backtest.properties.ProcessingOutputConfig.Outputs[
                    "smape"
                ].S3Output.S3Uri,
                right="0.14",
            ),
        ],
        if_steps=[register_model_step()],
        else_steps=[abort_step()],
        depends_on=[calibration],
    )

    return Pipeline(
        name="TFTDemandForecastingPipeline",
        parameters=[train_mode, forecast_origin, feature_schema],
        steps=[data_val, feature_mat, split, train, backtest, slice_analysis, calibration, promote_gate],
    )

3. Rolling-Origin Walk-Forward Backtest (CRITICAL)

This is the single most-important section in this story. A forecasting model evaluated with random train/test split is invalid. The split must be time-causal, period.

flowchart TB
    subgraph WRONG[WRONG - Random Split]
        W1[Day 1]:::train
        W2[Day 2]:::test
        W3[Day 3]:::train
        W4[Day 4]:::train
        W5[Day 5]:::test
        W6[Day 6]:::train
        W7[Day 7]:::test
        W8[Day 8]:::train
        W9[...]:::train
    end

    subgraph RIGHT[RIGHT - Rolling-Origin Walk-Forward]
        F1[Fold 1: train days 1-730, test days 731-737]
        F2[Fold 2: train days 1-737, test days 738-744]
        F3[Fold 3: train days 1-744, test days 745-751]
        F4[Fold 4-12: continue advancing 7-day origin]
    end

    classDef train fill:#9cf,stroke:#333
    classDef test fill:#fd2,stroke:#333
    style F1 fill:#dcfce7,stroke:#166534
    style F2 fill:#dcfce7,stroke:#166534
    style F3 fill:#dcfce7,stroke:#166534
    style F4 fill:#dcfce7,stroke:#166534

The random split on the left is wrong because Day 3 is in train and Day 5 is in test — the model has seen "the future" relative to Day 5 during training, contaminating the evaluation. This is a temporal leak that inflates offline metrics by 30–60% on this workload (measured during pipeline development; the original prototype reported 6% sMAPE on random split, 19% on walk-forward — the gap is the leak).

The rolling-origin walk-forward on the right is the only correct evaluation. Each fold: 1. Trains on all data strictly before the fold's forecast_origin. 2. Predicts the next 7 days. 3. Evaluates predictions against held-out actuals at horizons 1–7. 4. Advances the origin by 7 days and repeats.

12 folds × 7 horizons = 84 days of backtest coverage. Final reported metrics are weighted averages across all folds, with weights proportional to the recency of the fold (recent folds weighted more — they reflect current regime).

# walk_forward_split.py
def rolling_origin_walk_forward(
    facts: pd.DataFrame,
    n_folds: int = 12,
    initial_train_days: int = 730,
    step_days: int = 7,
    horizon_days: int = 7,
    gap_days: int = 0,
) -> list[Fold]:
    """Generate n_folds time-causal training/test pairs.

    Fold k:
      train: [day 0, day initial_train_days + k*step_days)
      gap:   [day train_end, day train_end + gap_days)
      test:  [day train_end + gap_days, day train_end + gap_days + horizon_days)

    The gap is 0 by default because forecasting at horizon 1 means the
    test starts the day AFTER training ends. Some setups (e.g., when
    daily actuals lag by 1 day in the warehouse) require gap_days = 1
    to mirror serving lag.
    """
    folds = []
    max_day = facts["day"].max()
    for k in range(n_folds):
        train_end = initial_train_days + k * step_days
        test_start = train_end + gap_days
        test_end = test_start + horizon_days
        if test_end > max_day:
            break
        folds.append(Fold(
            fold_id=k,
            train_range=(0, train_end),
            test_range=(test_start, test_end),
            forecast_origin=test_start - 1,
        ))
    return folds

The gap_days parameter exists because in production, the daily actuals from the order pipeline have a 1-day reporting lag — when the forecast runs at 02:00 JST on day D, the latest fully-realized actuals are for day D-2, not D-1. If backtests are run with gap_days=0 but production has a 1-day lag, the offline evaluation is easier than production and offline-online correlation breaks. The serving lag is read from the feature catalog and the backtest mirrors it.

4. Forecasting Metrics

Forecasting has its own metric suite that does not generalize from classification:

sMAPE (Symmetric Mean Absolute Percentage Error) — primary metric:

sMAPE = (1/N) × Σ_i  2 × |y_i - ŷ_i|  /  (|y_i| + |ŷ_i| + ε)

The symmetry handles the asymmetry of MAPE (which blows up when actual is small). ε = 1.0 prevents division by zero on long-tail SKUs with zero-demand days. Target ≤ 14%.

MASE (Mean Absolute Scaled Error) — naive-baseline comparison:

MASE = MAE(model) / MAE(seasonal_naive_baseline)

where seasonal_naive(t+h) = y(t+h-7)   (same day last week)

A MASE of 1.0 means the model is exactly as good as predicting "this Tuesday will be like last Tuesday." Target ≤ 1.0 — the model must beat seasonal-naive on every SKU tier.

Quantile loss (pinball loss) — for P10/P50/P90 calibration:

ρ_τ(y, ŷ_τ) = max(τ × (y - ŷ_τ), (τ - 1) × (y - ŷ_τ))

QuantileLoss = Σ_τ Σ_i ρ_τ(y_i, ŷ_τ,i)   for τ ∈ {0.1, 0.5, 0.9}

Pinball loss is the proper scoring rule for quantile regression. Per-quantile loss is reported and tracked over time; if P10 loss grows while P50 loss stays flat, the lower quantile has decalibrated.

P10/P90 empirical coverage — calibration audit:

P10_coverage = (1/N) × Σ_i  𝟙[y_i ≤ P10_i]    target ∈ [8%, 12%]
P90_coverage = (1/N) × Σ_i  𝟙[y_i ≤ P90_i]    target ∈ [88%, 92%]

A well-calibrated model has empirical coverage matching the nominal quantile within ±2%. If P10 coverage is 4% (model is overconfident on the low side) the chatbot's "running low" signal is wrong — it predicts stockouts that don't happen.

Promo-window sMAPE — workload-specific slice metric:

promo_window_sMAPE = sMAPE restricted to days where is_promo_day=1
                    OR is_anime_air_window=1
                    OR is_manga_volume_release_day=1

Per the Ground-Truth-Evolution drift counterpart, the original prior-generation Prophet stack was 4–10× off on event days because events were filtered out of training. This metric explicitly forces the evaluation to include event days; target ≤ 25%.

5. Stockout Censoring — Why It Matters

Without censoring correction, stockout days are training-data poison. Consider a hot SKU that consistently sells out by noon:

Day Inventory at open Units sold True demand Without correction With KM correction
Mon 50 50 (oos by noon) ~85 50 78 (KM-corrected)
Tue 50 50 (oos by noon) ~90 50 81
Wed 100 92 92 (uncensored) 92 92
Thu 50 50 (oos by noon) ~88 50 80
Fri 50 50 (oos by noon) ~95 50 79

The "Without correction" column trains the model to predict demand ≈ 50 on this SKU forever, regardless of inventory level. The "With KM correction" column lets the model learn the true demand distribution. On the canary, the censoring correction reduced the under-forecast on chronically-stocked-out hot SKUs by 38%.

The KM-style adjustment uses per-SKU-tier survival functions because hot SKUs (top 5% by 90-day demand) have very different demand distributions from long-tail SKUs. Estimating one global survival function would over-shrink hot SKUs and under-correct long-tail ones.

6. Cold-Start Hierarchical Priors

For SKUs with < 30 days of history (typically pre-orders for upcoming releases, or newly-listed back-catalog), the past-observed channel is mostly missing. The hierarchical prior fills it:

flowchart TB
    SKU[New SKU<br/>n_days=12]
    L1[Level 1: SKU's own history<br/>weight = 12/30 = 0.40]
    L2[Level 2: brand x format prior<br/>weight = 0.6 x 1 - 0.40 = 0.36]
    L3[Level 3: category x format prior<br/>weight = 0.4 x 1 - 0.40 = 0.24]
    OUT[Final past-observed features<br/>= 0.40 x L1 + 0.36 x L2 + 0.24 x L3]

    SKU --> L1
    SKU --> L2
    SKU --> L3
    L1 --> OUT
    L2 --> OUT
    L3 --> OUT

    style L1 fill:#dcfce7,stroke:#166534
    style L2 fill:#fef3c7,stroke:#92400e
    style L3 fill:#dbeafe,stroke:#1e40af

The forecast carries an is_cold_start=true flag in its output, and the chatbot's order-inventory MCP wraps the response in hedged language: "we expect this to be available, though demand for new releases like this can be uncertain." The recommendation reranker (US-MLE-06) treats cold-start forecasts as lower-weight signal in stock-aware reranking. Cold-start is a known-uncertainty mode, not a hidden one.

The shrinkage weights (0.6 / 0.4 between brand-level and category-level) are themselves learned via the calibration audit step — brand information is more predictive than category for established publishers, and the weights can flip for less-known brands. This is per-tier and re-fit weekly.

7. Promo-Event Handling and Concept Drift

Marketing's known-promo-calendar feeds the known-future channel directly: every promo with discount, start, end, and SKU-filter scope is encoded as is_promo_day and promo_discount_pct features. The TFT learns the multiplicative relationship between discount depth and demand uplift per SKU tier per category.

Unknown promos are the dominant concept-drift source. Cases:

  1. Unannounced flash promos: marketing fires a promo without 24-hour-ahead calendar entry. The forecast made the prior night does not see it. Mitigation: real-time anomaly detector watches actuals vs P90; spikes flagged within 4 hours and incremental retrain re-runs same-day on the affected category.
  2. External viral events (e.g., a YouTube influencer reviews a manga, drives unforeseen demand). Not in any calendar. Mitigation: same as above; the post-event review captures it as a labeled event for next week's full retrain.
  3. Anime announcement gaps: a streaming service announces an adaptation 48h before premiere; the calendar only has 90-day-ahead forecasted entries. Mitigation: bilingual JP/EN content team has a fast-path manual entry; the calendar pipeline supports same-day inserts that trigger a partial-incremental refit.

The promo-event drift detection is described in §9 below and is the primary cross-reference to Ground-Truth-Evolution/ML-Scenarios/05-demand-forecasting-promo-distortion.md.

8. Batch Transform Serving

The daily batch transform on ml.g5.12xlarge × 4 shards SKUs by hash modulo 4. Each instance processes ~375K SKUs in ~75 minutes. Output is a single Parquet partition per shard, written to s3://manga-ml-forecast-apne1/sku_demand_forecast/forecast_date=2026-04-27/shard=0..3/.

# batch_transform.py
from sagemaker.transformer import Transformer

transformer = Transformer(
    model_name="tft-forecaster-v23",
    instance_count=4,
    instance_type="ml.g5.12xlarge",
    strategy="MultiRecord",
    max_payload=100,
    max_concurrent_transforms=8,
    accept="application/x-parquet",
    output_path="s3://manga-ml-forecast-apne1/sku_demand_forecast/forecast_date=2026-04-27/",
    assemble_with="Line",
    env={
        "FORECAST_ORIGIN": "2026-04-27",
        "HORIZONS": "1,2,3,4,5,6,7",
        "QUANTILES": "0.1,0.5,0.9",
        "EMIT_COLD_START_FLAG": "true",
    },
)

transformer.transform(
    data="s3://manga-ml-feature-snapshots-apne1/forecast_origin=2026-04-27/",
    content_type="application/x-parquet",
    split_type="None",
    join_source="Input",
)

After the batch transform completes, a Lambda step reads the four shard outputs and writes them to the MangaSkuForecast DynamoDB table with (forecast_date, sku_id) as the composite primary key. The DynamoDB writer uses BatchWriteItem with 25-item batches and parallelism = 16; full 1.5M item replace takes ~3 minutes. The chatbot's order-inventory MCP reads from DynamoDB with point-in-time queries:

# online_lookup.py (called by order-inventory MCP)
def get_demand_forecast(sku_id: str) -> ForecastResponse:
    today = date.today()  # JST
    item = ddb.get_item(
        TableName="MangaSkuForecast",
        Key={
            "forecast_date": {"S": today.isoformat()},
            "sku_id": {"S": sku_id},
        },
        ProjectionExpression="p10, p50, p90, is_cold_start, model_version",
    ).get("Item")
    if not item:
        return ForecastResponse.unavailable(sku_id, reason="no_forecast")
    return ForecastResponse(
        sku_id=sku_id,
        p10=[float(x["N"]) for x in item["p10"]["L"]],  # 7 horizons
        p50=[float(x["N"]) for x in item["p50"]["L"]],
        p90=[float(x["N"]) for x in item["p90"]["L"]],
        is_cold_start=item["is_cold_start"]["BOOL"],
        model_version=item["model_version"]["S"],
    )

p99 latency on the DynamoDB GetItem is ~6 ms in steady state, well below the 8 ms contract.

9. Drift Detection

Drift detection consumes the sku_demand_forecast table (predictions) joined to the fulfilled_demand_actuals table (realized demand) once the actuals catch up. Per primitive §6.1 with forecasting-specific extensions:

Drift Kind Detector Cadence Threshold
Input drift PSI on top-30 features (price, on-hand, lag-7, promo flags) Daily PSI > 0.2 sustained 7d
Label drift KS on per-SKU-tier daily demand distribution Daily KS > 0.15 sustained 7d
Prediction drift KS on P50 distribution per category Daily KS > 0.15 sustained 7d
Concept drift Rolling 28-day sMAPE vs reference sMAPE Daily Δ sMAPE > 0.03 sustained 7d
Promo-event drift Per-event-class realized lift vs predicted lift Per-event > 2σ deviation triggers alarm within 48h
Quantile decalibration Empirical P10/P90 coverage vs nominal Daily Outside [8%, 12%] / [88%, 92%] for 3d

The promo-event drift detector is the workload-specific addition. Each event class (Comic-Con week, anime-airing window, manga volume release week, marketing flash promo) has its own historical lift distribution. When a new event of that class realizes, the detector compares predicted vs realized lift. A > 2σ deviation flags the event for the post-event review process described in Ground-Truth-Evolution/ML-Scenarios/05-demand-forecasting-promo-distortion.md.

# promo_event_drift.py
def detect_promo_event_drift(
    event: PromoEvent,
    predicted_lift_per_sku: dict[str, float],
    realized_actuals: pd.DataFrame,
    historical_lift_distribution: dict[str, ScipyDist],
) -> PromoEventDriftReport:
    """For an event that has just concluded, compare predicted vs realized
    lift per SKU and flag deviations > 2σ vs historical distribution."""
    flagged_skus = []
    for sku_id, predicted_lift in predicted_lift_per_sku.items():
        realized_lift = compute_realized_lift(
            sku_id=sku_id,
            event_window=(event.start, event.end),
            actuals=realized_actuals,
        )
        dist = historical_lift_distribution[event.event_class]
        sigma_deviation = abs(realized_lift - dist.mean()) / dist.std()
        if sigma_deviation > 2.0:
            flagged_skus.append(SkuFlag(
                sku_id=sku_id,
                event_id=event.event_id,
                predicted_lift=predicted_lift,
                realized_lift=realized_lift,
                historical_mean=dist.mean(),
                historical_std=dist.std(),
                sigma_deviation=sigma_deviation,
            ))
    return PromoEventDriftReport(
        event=event,
        flagged_skus=flagged_skus,
        triggers_retrain=len(flagged_skus) >= 100,  # 100+ SKU flags triggers retrain
    )

10. Multilingual Handling (JP/EN-Specific)

Forecast values themselves are language-agnostic — they are numbers, not text. The bilingual concern is in two places:

Release calendar inputs. The anime release calendar and manga volume calendar arrive in both JP (from publishers like Shueisha, Kodansha, Shogakukan) and EN (from translators like Viz, Kodansha USA). The same release event often appears in both feeds with different metadata (JP feed: original air date in JST; EN feed: translation publication date in PST, sometimes weeks later). The feature pipeline reconciles these with a release_event_id join key maintained by the content team. A misalignment (e.g., a JP-only release that has no EN counterpart) does not break the forecast — the JP-only event is still encoded in the known-future channel, just with en_translation_available=false.

Chatbot dialog wrapper. The chatbot's response that uses the forecast is bilingual. The order-inventory MCP receives the forecast (numbers) and the user's locale, then renders the response:

  • EN: "We expect this to ship in 3 days based on current demand."
  • JP: "現在の需要状況から、3日以内の発送を見込んでいます。"

The MCP also handles uncertainty wording differently per locale — JP responses lean more cautious (Japanese e-commerce convention favors hedged delivery promises), so the is_cold_start=true flag triggers a more conservative wording in JP than in EN. This is a downstream rendering concern, not a forecast-model concern, but it is documented here because the forecast metadata explicitly carries is_cold_start and confidence_band_p10_p90 to support locale-specific rendering.


Monitoring & Metrics

Category Metric Target Alarm Threshold
Online — Latency DynamoDB GetItem p99 ≤ 8 ms > 15 ms 5min
Online — Availability Forecast availability for active SKUs ≥ 99.5% < 98% 1h
Quality — Aggregate sMAPE (rolling 28d) ≤ 14% > 18% 7d
MASE vs seasonal-naive ≤ 1.0 > 1.1 7d
Pinball loss per quantile trend stable +20% week-over-week
Quality — Calibration P10 empirical coverage 8%–12% < 6% or > 14% 3d
P90 empirical coverage 88%–92% < 86% or > 94% 3d
Quality — Slice sMAPE on hot-tier SKUs ≤ 12% > 16% 7d
sMAPE on long-tail SKUs ≤ 30% > 40% 7d
sMAPE on cold-start SKUs ≤ 35% > 45% 7d
sMAPE on promo-window days ≤ 25% > 35% 7d
sMAPE on anime-air windows ≤ 28% > 40% 7d
Drift Concept drift Δ sMAPE < 0.03 > 0.03 7d
Promo-event 2σ deviation count < 20/day > 100 in 24h
Pipeline Daily batch wall-clock ≤ 90 min > 120 min
Daily run success rate (30d) ≥ 99% < 98% 7d
Censoring rate < 8% > 15% data integrity issue
Cost Daily batch $/run ≤ $35 > $50
Weekly full retrain $/run ≤ $80 > $120
DynamoDB read $/day ≤ $4 > $8

Risks & Mitigations

Risk Impact Mitigation
Random train/test split sneaks back in via PR Offline metrics inflated 30-60%; promotions go through that should fail Walk-forward split is enforced as a ProcessingStep precondition; CI lint checks any train_test_split import in src/; quarterly red-team review of split logic
Stockout-censoring under-corrects on hot SKUs Forecast continues under-shooting hot SKU demand; chronic stockouts amplify Per-tier KM survival; hot-tier survival re-estimated daily; quarterly audit of censored-day correction magnitude
Promo calendar arrives late or wrong Forecast misses the spike; chatbot says "available in 3 days" during stockout Real-time anomaly detector watches actuals vs P90; same-day incremental refit on affected category; ops fast-path calendar entry
Anime release calendar feed JP/EN mismatch One feed has the event, the other doesn't; SKU-level forecast inconsistent release_event_id join key; daily reconciliation report; missing-translation flagged but not blocking
TFT spot reclaim during weekly full retrain Wasted compute; weekly cycle slips Checkpoints every 5 epochs to S3; max_wait=21600s; on-demand fallback on >3 reclaims; weekly retrain has 6h slack vs Sunday 02:00 start
Incremental refit drifts quality vs full retrain Daily incremental quality silently degrades Each Sunday's full retrain is the reset point; incremental quality compared to "full retrain on same data" weekly; if delta > 1% sMAPE, incremental is paused and daily falls back to full
Cold-start prior is wrong for novel category (e.g., Korean manhwa launch) Cold-start forecast biased; chatbot mis-estimates new category availability Cold-start prior re-estimated weekly per category; coordinated with US-MLE-05 embedding adapter retrain on category expansion
Concept drift on supply-chain disruption Models pre-disruption demand pattern; mis-forecasts post-disruption Concept-drift detector daily; coordinated supply-chain Lambda flips a supply_constrained=true feature when triggered; censored-regression mode activated (per drift counterpart §architect-level escalation 2)
Forecast goes stale (e.g., DDB write fails) Chatbot reads yesterday's forecast as today's DynamoDB items carry forecast_date; reader rejects items older than 36 hours; alarm if any SKU's forecast is > 36h stale
LLM in chatbot hallucinates "available in N days" with N inconsistent with forecast Customer trust regression Order-inventory MCP enforces grounded numerical output; the forecast value is interpolated into the response template, not generated by the LLM
Promo-event drift detected but no retrain budget that day Drift accumulates Drift hub enqueues retrain request; if pipeline busy, queues for next slot; SEV-3 page if queue depth > 3

Deep Dive — Why This Works at Amazon-Scale on the Manga Workload

The MangaAssist demand forecasting model is medium-sized (TFT, ~30M params) and serves the warm path of the chatbot rather than the hot path. Four workload properties make the design above the right shape rather than the obvious "just fit Prophet daily":

Workload property 1: 1.5M SKUs with extreme demand heterogeneity. The top 5% of SKUs (~75K) account for ~70% of revenue; the long tail of ~1.4M SKUs has zero-demand days as the median daily observation. A single global model would either over-fit the head (under-serving the tail) or smooth the head's spikes (mis-forecasting hot SKUs). The TFT's static-covariate channel, combined with per-tier evaluation slicing, lets one model serve the full distribution while the slice gate enforces no-regression on each tier independently. The hot-tier sMAPE target (≤12%) is tighter than the long-tail target (≤30%) because hot-tier mistakes have higher revenue impact.

Workload property 2: calendar-driven demand with bilingual release schedules. Anime airings, manga volume releases, and marketing promos drive the spikes that matter most for revenue and CSAT. The prior-generation Prophet/LightGBM stack treated these as anomalies and filtered them out — the architectural mistake documented in Ground-Truth-Evolution/ML-Scenarios/05-demand-forecasting-promo-distortion.md. TFT's known-future-input channel encodes these calendars directly, and the model attends to them via TFT's variable-selection network. The bilingual reconciliation (release_event_id join across JP and EN feeds) is the workload-specific data-integration cost; without it, half the predictive signal would be lost.

Workload property 3: stockout-censored observations. On chronically-stocked-out hot SKUs, observed demand is bounded by inventory. A naive model that ignores this learns "demand on this SKU is roughly = inventory level" and never recommends increasing supply, perpetuating the stockout cycle. The KM-style censoring correction is non-negotiable for this workload. The per-tier survival functions ensure the correction is calibrated to each tier's natural demand distribution.

Workload property 4: warm-path serving with strong daily cadence. Unlike US-MLE-01 (real-time intent classification at ~80 RPS) or US-MLE-06 (recommendation reranking on the hot path), demand forecasting feeds dialog wording and stock-aware reranking that tolerate a 1-day staleness. This unlocks SageMaker batch transform on large GPU instances (cost-efficient at scale) instead of real-time endpoints (latency-efficient but expensive). The DynamoDB cache fills the role of a "fast static lookup" without paying real-time inference cost. Daily cadence also means the freshness window is naturally aligned with how fast the underlying signal evolves — promo calendars rarely change intra-day, so a daily refresh captures essentially all signal.

These four properties together explain why TFT-with-known-future + KM-censoring + hierarchical-cold-start + walk-forward-backtest is the right shape. A naïve "Prophet per SKU, daily refresh" design fails on all four: no promo features (property 2), no censoring (property 3), no cold-start handling (property 1 long tail), and the original Prophet stack used random splits internally (property 4 — wrong evaluation methodology).


Real-World Validation

Industry analogues. Amazon's retail demand forecasting team published on multi-horizon TFT-style forecasters at SKU scale; their setup uses similar walk-forward backtests with 12+ folds and explicit known-future channels for promotions and holidays. Uber's Michelangelo platform documents a parallel pattern for marketplace-demand forecasting: per-tier evaluation, censored-regression for rider-cancelation effects, and hierarchical priors for cold-start zones. Google Brain's TFT paper itself uses retail and traffic datasets with rolling-origin evaluation; the paper explicitly warns against random splits as evaluation methodology. The cross-encoder reranker community (US-MLE-02) draws on the same evaluation discipline — time-causal splits, per-slice metrics, calibration audits.

Math validation — quantile loss intuition. For τ = 0.5 (median), pinball loss reduces to 0.5 × |y - ŷ|, i.e., MAE / 2. For τ = 0.9, pinball loss penalizes under-prediction (y > ŷ) at 0.9× the residual and over-prediction at 0.1× — making the loss minimizer the 90th percentile of the conditional distribution. This is why optimizing pinball loss at τ = 0.9 produces a P90 estimate, and why empirical P90 coverage should match 90% if the model is calibrated. A common bug is to optimize MSE and then post-hoc scale the standard deviation to "produce P90"; this is wrong because the conditional distribution is heteroscedastic and right-skewed (count data) — direct pinball-loss optimization is the correct approach.

Math validation — cost. Daily batch on ml.g5.12xlarge × 4 at $7.09/hr × 4 instances × 1.5h = $42.54/day. Adding the daily incremental on ml.g5.4xlarge at $1.83/hr × 0.5h = $0.92, daily compute is ~$43.50. Weekly full retrain on ml.g5.12xlarge × 1 at $7.09/hr × 4h = $28.36/week. Monthly compute: ~$1,305 (daily) + $113 (weekly) = $1,418/month. DynamoDB cost: 1.5M items × 120 bytes × $0.25/GB-month = ~$0.05/month storage + ~$3/day reads at chatbot read volume = ~$95/month. Total: ~$1,513/month. Below the $35/run × 30 + $80/week × 4 = $1,370 line-item budget; ~10% over due to DDB and S3 transfer, well within FinOps tolerance.

Math validation — calibration sanity check. With 12 folds × 7 horizons × 1.5M SKUs = 126M predictions per backtest, the standard error on P10 empirical coverage is approximately sqrt(0.1 × 0.9 / 126M) ≈ 0.000027, or ±0.0027% at 1σ. The acceptance criterion of [8%, 12%] is therefore operationally tight (the noise floor is far below the threshold), so a violation is a real signal of decalibration, not noise.

Math validation — censoring rate. 1.5M SKUs × ~50K stockout events/day / 1.5M = ~3.3% of SKU-days are censored. Hot-tier SKUs have higher censoring rate (~12%) because they are more likely to sell out. The 8% target censoring rate aggregates over all tiers; the 15% abort threshold catches days where a supply-chain disruption causes structural under-supply.


Cross-Story Interactions

Edge Direction Contract Conflict mode
US-MLE-04 → US-MLE-06 (recommendation) provides stock-aware signal P50 forecast and stockout probability per SKU If US-MLE-04 over-forecasts, US-MLE-06 reranks unavailable SKUs to the top; mitigation: US-MLE-06 reads P10 (pessimistic) for stock-aware filtering, P50 only for ranking among available
US-MLE-04 → Chatbot Order-Inventory MCP provides "available in N days" signal DynamoDB lookup with is_cold_start flag If forecast missing, MCP falls back to "we don't have an estimate" instead of fabricating one; never let LLM generate a number not grounded in the forecast
US-MLE-04 ← US-MLE-07 (spam classifier) clean review-velocity signal review_count_delta feature filtered by spam classifier If spam classifier over-flags reviews, review-velocity signal under-counts; mitigation: spam-recall metric tracked monthly
US-MLE-04 ← US-MLE-05 (embedding adapter) category expansion (manhwa, manhua) new category triggers cold-start prior re-fit If US-MLE-05 promotes a new category before US-MLE-04 cold-start prior is re-fit, new SKUs get global mean instead of category-aware prior; coordinated change request
US-MLE-04 ↔ Cost-Optimization US-04 (compute) compute cost optimization daily batch instance type and count US-04 may push for smaller instance types to save cost; US-MLE-04's 90-min wall-clock contract is the floor
US-MLE-04 ↔ Ground-Truth-Evolution ML-05 drift counterpart promo-event drift documented in ML-05; this story implements detection and mitigation Coordinated planning between ML-eng and ops on event-class taxonomy
US-MLE-04 → RAG-MCP-Integration order-inventory MCP API contract for forecast lookup DynamoDB schema, fields, fallback behavior A forecast schema change requires coordinated MCP-side update; pinned model_version field in response
US-MLE-04 ← Marketing Platform promo calendar feed 14-day forward calendar, daily refresh Late-arriving promos trigger same-day incremental refit; > 6h late = SEV-3
US-MLE-04 ← Content Platform anime + volume calendar 90-day / 180-day forward, bilingual JP/EN reconciled release_event_id join key contract; missing-translation tracked but not blocking

Rollback & Experimentation

Shadow Mode Plan

  • Duration: 7 days minimum. Forecasting models need at least 7 days of shadow because the horizon is 7 days — anything shorter does not let the full forecast cycle complete and validate against actuals.
  • Sample size: 7 days × 1.5M SKUs × 7 horizons = 73.5M shadow predictions. Statistical power is overwhelming for any meaningful effect.
  • Pass criteria: shadow v24 daily-MAE on actuals ≤ v23 daily-MAE × 1.05 averaged across all 7 days; per-SKU-tier sMAPE no worse than 1.5σ regression on any tier; quantile coverage on shadow within ±3 percentage points of nominal.
  • Slice criteria: shadow forecasts on promo-window days, anime-air windows, and cold-start SKUs each tracked separately; any slice regressing > 2σ vs v23 fails shadow.

Canary Thresholds (Extended Canary, 14d at 10% SKU Sample)

Per primitive §4.3, TFT is on the "weak offline-online correlation" pre-flagged list. The canary:

  • Phase A (10% SKU sample, 14 days): 10% of SKUs (deterministic by hash) get v24 forecasts written to a shadow column in DynamoDB; chatbot continues to read v23 from the production column. Auto-abort daemon monitors v24 sMAPE on actuals daily; if v24 sMAPE > v23 sMAPE × 1.10 for 3 consecutive days on any SKU tier, abort.
  • Phase B (full promotion, no further phases): after 14-day canary passes, traffic shifts to 100% v24 with a single atomic DynamoDB column rename. Old v23 forecasts retained for 14 days as rollback target.

The canary differs from US-MLE-01's because there is no real-time per-request canary on this story — the unit of canary is SKU, and we measure forecast quality after actuals arrive. This is the standard pattern for batch-cadence models.

Kill-Switch Flags

  • tft_forecaster_promotion_enabled (default: false; SSM Parameter Store /manga-ml/tft/promotion_enabled) — when false, weekly full-retrain runs but does not register a new candidate.
  • tft_forecaster_canary_pause (default: false) — when true, canary halts at current SKU-sample rate; chatbot continues reading v23.
  • tft_forecaster_use_censoring_correction (default: true) — kill switch for KM-style correction in case the survival estimation has a bug; when false, falls back to raw observed demand. Used only as emergency mitigation.
  • global_ml_freeze (default: false) — overrides the above; applies to all 8 stories. See README's kill-switch precedence section.

Quality Regression Criteria (Hard Rollback)

A canary that satisfies any one of these conditions is automatically rolled back within 4 hours:

  • v24 sMAPE > v23 sMAPE × 1.10 for any 3 consecutive days on any SKU tier.
  • P10 empirical coverage falls outside [6%, 14%] for any 3 consecutive days.
  • P90 empirical coverage falls outside [86%, 94%] for any 3 consecutive days.
  • Stockout rate on chatbot-handled SKUs increases by > 15% absolute for any 3 consecutive days (online proxy metric).
  • Customer-reported incident on the production forecast that appears traceable to the model (manual override by SRE on-call).

The rollback is via re-running yesterday's batch transform with v23 model package and re-importing the result to DynamoDB (per primitive §7.3, batch model rollback SLA = 4 hours). The pipeline registers v24 as status: failed_promotion in the registry and does not retry until manual review.


Multi-Reviewer Validation Findings & Resolutions

S1 — Must Fix Before Production

ML Scientist lens: The walk-forward backtest must guard against feature-leak from the known-future channel. If the feature pipeline accidentally reads is_promo_day for the test window using the current (post-event) calendar snapshot, the leak inflates promo-window sMAPE. Resolution: _read_known_future reads the calendar snapshot pinned to as_of=forecast_origin; a unit test asserts that revising a promo entry after the forecast origin does not change the historical training data. Quarterly leak-detector audit explicitly covers the known-future channel.

SRE lens: The 90-min p99 batch wall-clock has 30 min of slack vs the 02:00–03:30 window; a single spot reclaim consumes most of the slack. Resolution: batch transform uses on-demand instances (not spot); only the weekly full-retrain training step uses spot. Cost increase is ~$8/day, justified by the SLA tightness. Documented in Runbooks/tft-batch-instance-policy.md.

Application Security / Privacy lens: Forecast inputs include per-SKU sales actuals which, at high resolution, can leak commercial information about specific publishers. Resolution: aggregated SKU sales facts are partitioned by publisher in S3 with publisher-specific KMS keys; the training job has access only to its own training-data IAM role; quarterly audit of S3 bucket access logs.

Data Engineering lens: The release_event_id join across JP/EN content feeds is fragile — a publisher-side schema change has caused two outages in the prior year. Resolution: schema-pinning enforced at the Iceberg-table level with version checks at pipeline-step entry; missing-id reconciliation report runs daily; ops team owns the fast-path fix runbook Runbooks/release-event-id-mismatch.md.

S2 — Address Before Scale

FinOps lens: Daily batch is ~$43.50; if SKU count grows to 7M (Korean manhwa expansion + manhua expansion), batch cost projects to ~$80/day. Justified by revenue growth, but the budget needs an early-warning alarm. Resolution: daily_batch_cost CloudWatch metric; alarm at $60/day for 5 consecutive days. FinOps quarterly review.

ML Scientist lens (S2): The hierarchical cold-start prior shrinkage weights (0.6 brand, 0.4 category) are static; they should be data-driven per category. Resolution: weekly weight re-fit step added to the full-retrain pipeline; weights stored in registry alongside model package. Tracked as v2 of the prior.

SRE lens (S2): DynamoDB write of 1.5M items in 3 minutes saturates the table's provisioned write capacity briefly. Resolution: switch to on-demand capacity mode for write windows; capacity reverts to provisioned after batch completes. Cost-neutral within FinOps tolerance.

S3 — Future Work

Principal Architect lens: Current architecture is single-region (ap-northeast-1). When KO and EU markets scale, a multi-region forecaster with per-region calendars and per-region demand patterns may outperform a single global model. Tracked as v3 backlog; revisit when KO/EU SKU count exceeds 500K each.

ML Scientist lens (S3): Probabilistic programming (e.g., Pyro / NumPyro) for explicit hierarchical Bayesian forecasting could replace the heuristic shrinkage with principled posterior inference. Trade-off: training cost ~3× higher; calibration ~5% better on cold-start. Defer unless cold-start sMAPE persistently exceeds 35%.

SRE lens (S3): Real-time forecast updates (within-day re-forecast on viral signals) is a v3 capability. Currently the daily cadence covers all signal; viral spikes are caught by the real-time anomaly detector + same-day incremental refit. A true within-day re-forecast would require streaming-feature ingestion and real-time TFT inference — substantial infra investment. Tracked but not prioritized.

FinOps lens (S3): TFT inference can be quantized to int8 for batch transform; preliminary tests show 1.8× throughput at < 0.5% sMAPE regression. Could reduce daily batch cost to ~$25. Tracked as v3 cost optimization, coordinated with Cost-Optimization US-04.