ML Scenario 05 — Demand Forecasting and Promotional Distortion

TL;DR

The demand-forecasting model predicts daily units sold per SKU for inventory planning and powers the bot's "we're running low / pre-order" signaling. It was trained on three years of daily-sales data with an "anomaly removal" preprocessing step that filters Black Friday, Comic-Con week, anime-airing windows, and major-release events from the training set. Forecasts on quiet weeks are accurate (sMAPE ≈ 0.18); forecasts on event weeks miss by 4–10×. Inventory ops compensate manually. The "anomaly" assumption is the bug — those events aren't anomalies, they're the very part of the demand distribution that matters most for revenue and customer experience. Treating them as outliers means the model learns "what would demand be if these events never happened," which is a counterfactual nobody asked for. The fix shape is events as first-class features (calendar + tie-in + promo signals), event-conditional forecasts with their own ground truth (per-event-class historical realizations), event-aware eval slicing, and a layered architecture where the steady-state base model and the event-overlay model serve different jobs.

Context & Trigger

Axis of change: Requirements (the business runs on event-driven demand; the original model was trained as if it didn't) + Time (event windows recur on calendar cycles but with year-over-year drift in magnitude).
Subsystem affected: Order-Inventory MCP forecasting layer in RAG-MCP-Integration/03-order-inventory-mcp.md. Knock-on effects in stockout signaling, pre-order promotion, and the bot's "we're running low" copy.
Trigger event: A flagship anime adaptation aired and pushed the source manga's demand to 7× baseline for two weeks. The forecast predicted 1.2× baseline (slight uplift). Inventory ops, who had been quietly compensating with manual overrides for ~ 18 months, escalated the issue. Investigation found a structural assumption baked into preprocessing.

The Old Ground Truth

Original forecasting setup:

Three years of (sku, day, units_sold) actuals as training data.
Anomaly filter: dates within Black Friday, Cyber Monday, Comic-Con week, and a hand-curated list of anime-airing windows are removed before training.
Model: Prophet (seasonality + trend) per SKU for the high-volume tail; LightGBM with shared features for the long tail.
Eval: sMAPE on a held-out time-window with the same anomaly filter applied.
Reasonable assumptions:
Promo events distort training; removing them produces a cleaner steady-state model.
Inventory ops handle event-week planning manually anyway.
The held-out, after the same filter, is an honest measurement of steady-state.

What this gets wrong: removing the events from training also removes them from the eval, so the model is "good" at a counterfactual world. It also means the system has no learned response to events — every event is a manual intervention. And the assumption that ops "handle it manually" is an admission that the model isn't doing its job, not a justification.

The New Reality

Events drive the part of the business that matters most. Comic-Con, anime tie-ins, awards, and seasonal spikes are when the company makes outsized revenue and when stockouts are most painful. The forecast is least useful exactly when it matters most.
Event magnitude varies. A Comic-Con tie-in for a popular series might be 5×; for a mid-tier series, 1.5×. Year-over-year drift is large — a series that was 3× last year might be 8× this year because anime aired since.
Event signals are knowable in advance. Anime-airing schedules, Comic-Con dates, marketing-promo calendars, partnered-publisher releases — all knowable weeks ahead. The model could use them; the architecture chose not to.
"Anomaly removal" hides label drift. If the preprocessing keeps changing what counts as "anomalous," the steady-state baseline itself shifts — but the change is invisible because it happens in the filter, not the model.
The inventory-ops manual override has its own ground truth. What ops actually did during event windows (pre-orders ordered, safety stock added) is a labeled dataset of "event-class → realized response." The model could learn from it; the architecture didn't.
Forecast errors during events compound downstream. Stockout events impact CSAT, drive customers to competitors, and bias future demand signals (a customer who couldn't buy on the spike day may not return).

Why Naive Approaches Fail

"Stop filtering anomalies; train on everything." Without event features, the model sees noise — a single Comic-Con week looks like a random spike, with no signal about why or when it'll repeat. The model learns "demand is noisy," which is worse than "demand is steady."
"Add more historical data." More years means more events, but without event labels they're still noise.
"Use a fancy time-series model (LSTM, Transformer)." Deep models pick up patterns better, but without explicit event features they still struggle with sparse, non-stationary, calendar-driven spikes.
"Forecast at a higher level." Forecasting category-level demand smooths over the SKU-level spikes that matter.
"Trust the manual ops process." Manual overrides don't scale, aren't reproducible, and don't learn from year to year.
"Just multiply by an event coefficient." Naive global multipliers miss the heterogeneity (which SKUs spike, by how much, for how long).

Detection — How You Notice the Shift

Online signals.

Forecast error during event windows vs steady state. Stratified sMAPE per event class. The ratio is the diagnostic.
Stockout rate during event windows. If stockouts cluster on event weeks, the forecast is failing structurally on those days.
Manual override rate. How often inventory ops overrides the forecast. If overrides are concentrated on event-related SKUs, the model isn't doing its job for those SKUs.

Offline signals.

Stratified eval. Compute sMAPE separately on (steady-state) days, (event) days, and (event-tier) days. Treating them as one number hides the failure.
Counterfactual replay. If the proposed model had been live last year, what would its event-week forecasts have been? Compare to actual realizations.
Lift attribution. When the forecast is right on an event day, which feature drove the lift? If event features explain most of the lift, they're working.

Distribution signals.

Per-SKU event-elasticity. For each SKU, estimate event_elasticity = (event_week_sales / steady_state_sales). Plot the distribution. Heavy-tailed = some SKUs spike enormously; the model needs SKU-conditional event response.
Year-over-year event drift. For recurring events, the magnitude shift over years. If the spike grows linearly while the model assumes a fixed multiplier, the gap widens annually.

Architecture / Implementation Deep Dive

flowchart TB
    subgraph Cal["Event signals (calendar + tie-in)"]
        ANI["Anime airing calendar"]
        CON["Comic-Con / convention dates"]
        AWA["Awards / publisher events"]
        PRO["Marketing promo calendar"]
        REL["Major release dates"]
    end

    subgraph History["Historical labeled events"]
        REAL["Realized event impact:<br/>(event_id, sku, lift_factor)<br/>across past years"]
        OPS["Ops override log<br/>(what was done, what worked)"]
    end

    subgraph Models["Two-tier forecasting"]
        BASE["Base steady-state model<br/>(Prophet / LightGBM)"]
        EVENT["Event-overlay model<br/>(predicts lift_factor<br/>per (sku, event))"]
        FUSE["Fused forecast =<br/>base × event_lift"]
    end

    subgraph GT["Stratified ground truth"]
        STEADY["Steady-state actuals<br/>(no events)"]
        EVENT_GT["Event-day actuals<br/>(per event class)"]
        TIER["Per-SKU-tier evaluation"]
    end

    subgraph Serve["Operational"]
        FCST["Final forecast"]
        SIG["Stockout / pre-order signal"]
        OVR["Manual override path<br/>(captured for relabel)"]
    end

    ANI --> EVENT
    CON --> EVENT
    AWA --> EVENT
    PRO --> EVENT
    REL --> EVENT
    REAL --> EVENT
    OPS --> EVENT
    BASE --> FUSE
    EVENT --> FUSE
    FUSE --> FCST
    FCST --> SIG
    OVR --> OPS
    STEADY -.->|eval| BASE
    EVENT_GT -.->|eval| EVENT
    TIER -.->|eval| FUSE

    style EVENT fill:#fde68a,stroke:#92400e,color:#111
    style EVENT_GT fill:#dbeafe,stroke:#1e40af,color:#111
    style OVR fill:#fee2e2,stroke:#991b1b,color:#111
    style OPS fill:#dcfce7,stroke:#166534,color:#111

1. Data layer — events as first-class

Three datasets:

Steady-state actuals. Days not covered by any event. The base model trains on these.
Event labels. Each historical event tagged: (event_id, type, start, end, affected_sku_filter). Joined to actuals → realized lift per (sku, event).
Future event calendar. Forward-looking events known at forecast time: anime airings, conventions, promo plans.

Event types and their feature representations:

Event type	Forward-knowledge	Magnitude predictor
Anime airing window	Confirmed weeks/months ahead	Series tier × prior-airing impact × current popularity
Comic-Con / convention	Annual calendar	Convention size × historical lift
Awards	Quarterly schedule	Award category × historical winners' lift
Publisher promo	Days ahead	Promo discount × prior promo lift
Major release	Confirmed weeks ahead	Series sales-rank × prior-volume drop

2. Pipeline layer — two-tier forecasting

The forecast is decomposed:

forecast(sku, day) = base(sku, day) × event_lift(sku, day)

event_lift(sku, day) = max(1.0, model_lift(sku, events_active(day)))

The base model is the existing steady-state Prophet/LightGBM. The event-overlay model is a separate model (typically gradient-boosted) trained on (sku_features, event_features) → realized_lift_factor. It outputs a multiplicative correction.

Training the event-overlay model:

def build_event_overlay_training_data():
    rows = []
    for event in historical_events:
        for sku in event.affected_skus:
            base_pred = base_model.predict(sku, event.start_to_end)
            actual = actuals.sales(sku, event.start_to_end)
            lift = sum(actual) / sum(base_pred)
            rows.append({
                "sku_features": sku.features(),
                "event_features": event.features(),
                "lift_factor": lift,
            })
    return rows

The event-overlay model learns (sku, event) → lift. New events at forecast time use the same features.

3. Serving layer — fused forecast + ops override capture

The fused forecast goes to inventory planning. When ops override, the override is captured:

{
  "sku": "...",
  "forecast_day": "2026-04-25",
  "fused_forecast": 1240,
  "ops_override": 1800,
  "override_reason": "anime tie-in stronger than predicted",
  "override_user": "...",
  "realized_actual": 1720
}

These are gold labels for retraining the event-overlay model. The system learns from manual overrides — the institutional knowledge of inventory ops becomes part of the model.

4. Governance — event-class freshness and post-event review

Event-class freshness. Each event class has a "last realized" timestamp. Classes that haven't recurred recently (e.g., a once-every-3-years awards format) get human review on the forecast.
Post-event review. After every major event, ops + forecast team reviews the forecast vs actual. Lessons feed into event-overlay model retraining and feature engineering.
Calendar-feature audit. Forward-looking calendars are maintained by ops; the forecast pipeline pulls from them. A miss in the calendar (e.g., a surprise anime announcement late in the cycle) is the most common forecast failure source — track calendar-completeness as a meta-metric.

Trade-offs & Alternatives Considered

Approach	Steady-state accuracy	Event accuracy	Cost	Verdict
Filter events, single model	Good	Bad	Low	Original — broken
Train on everything, no event features	Mediocre	Mediocre	Low	Smooths spikes, adds steady-state noise
Single model with event features	Good	OK	Medium	Better than original; over-fits event features at SKU level
Two-tier (base + event-overlay)	Good	Good	Medium-High	Chosen
Per-event-class custom models	Best per class	Best per class	Very high	Operationally expensive; reserve for top events
Pure ops manual planning	n/a	Variable	Highest manpower	Doesn't scale

The two-tier pattern is standard for retail forecasting; the contribution here is the labeled-events ground truth discipline, the override-capture loop, and the explicit per-event-class freshness gates.

Production Pitfalls

Events the calendar misses. A surprise tie-in or viral moment isn't in any forecast input. The forecast gets caught flat-footed. Mitigation: a "real-time anomaly detector" that watches actuals and flags spikes for ops; manual override path is fast.
Event-overlay training is data-poor for rare events. Some events recur once a year, twice a year. Sparse data → unreliable lift estimates. Use Bayesian shrinkage toward category-level priors; reserve manual review for first-time-class events.
Cannibalization between events. Two overlapping events on the same SKU don't simply multiply lifts; they cannibalize. Need interaction features.
Ops overrides are biased. Ops err toward over-stocking (stockout fear); their overrides may systematically over-shoot. Use realized actuals (not overrides) as the gold label, with overrides as an auxiliary feature.
Year-over-year drift in event magnitude. A model trained on multi-year data without time-decay weight can under-react to growing event impact. Add recency weights.
Calendar feeds break. Anime calendar API goes down silently; events disappear from the feed. Monitor calendar-feed liveness as a critical signal.
Stockout-as-feature feedback. A stockout suppresses sales (zero-sales day) which then enters the training data as low demand. The base model has to mask stockout days from the actuals or it learns to predict zero on chronically-stocked-out SKUs.

Interview Q&A Drill

Opening question

Your demand-forecasting model has a clean preprocessing step that removes "anomalies" (Black Friday, Comic-Con, anime tie-ins) from training. Steady-state forecasts are accurate; event-week forecasts are off by 4–10× and ops manually overrides them. Is this a reasonable architecture?

Model answer.

It's a reasonable starting point and a structural failure for a mature business. The "anomalies" aren't anomalies — they're the part of the demand distribution that matters most for revenue and CSAT. Filtering them out trains the model for a counterfactual world. And the model's gap is patched by manual ops overrides, which means the model isn't actually doing its job during the periods that matter.

The fix is two-tier forecasting with events as first-class.

(1) Base steady-state model on event-free days, like today's model.

(2) Event-overlay model trained on (sku_features, event_features) → realized_lift_factor. Features include event type (anime airing, convention, promo), event magnitude predictors (series tier, convention size, prior-airing impact, current popularity), and SKU features. The overlay outputs a multiplicative lift; fused forecast = base × lift.

(3) Stratified eval. sMAPE separately on steady-state, on per-event-class days, and per-SKU-tier. Aggregate metrics are advisory.

(4) Override capture as ground truth. When ops overrides the forecast, capture (forecast, override, actual). The realized actual is the gold label; the overlay model learns from it.

(5) Forward-looking event calendar maintained by ops: anime airings, conventions, promo plans. The forecast pipeline reads from it; calendar completeness is a tracked meta-metric.

The conceptual move: events are the part of the business that matters most. Removing them from training to make the steady-state metric look good is optimizing the wrong thing. The architecture treats them as a learnable, predictable phenomenon — and learns from the institutional knowledge of inventory ops captured in their overrides.

Follow-up grill 1

Your event-overlay model needs realized_lift_factor labels. For one-off events (a never-before-aired anime), there's no historical data. What does the model do?

The architectural answer is graceful uncertainty + Bayesian shrinkage to category priors.

(1) Category priors. For a never-seen anime in genre X, the prior is the average lift of past anime tie-ins in genre X, weighted by similarity (production studio, episode count, prior series success). The overlay outputs a posterior that combines the prior with whatever sparse data exists.

(2) Confidence interval, not point estimate. The overlay returns (predicted_lift, CI_low, CI_high). For one-off events, the CI is wide. Inventory ops sees the CI and chooses how aggressively to stock based on risk tolerance.

(3) Manual review for first-class events. When the event class is novel (a genre that has no priors), the forecast is flagged for ops review. Don't pretend confidence the model doesn't have.

(4) Real-time monitoring during the event. Once the event is live, watch actuals. If they exceed the overlay's CI within hours, trigger an alert and an ops re-plan. This is the "we got it wrong, course-correct fast" path.

The architectural humility: for events the model has never seen, the right answer is a wide forecast and a fast feedback loop, not a confident point prediction.

Follow-up grill 2

Ops overrides are biased toward over-stocking — they fear stockouts more than over-stock. How do you keep that bias out of your training labels?

Use realized actuals, not overrides, as the primary label. Specifically:

(1) Label = realized lift. lift_factor = actual_units_sold / base_forecast. The actual is the truth, regardless of whether ops over-stocked or under-stocked.

(2) Override as auxiliary feature. In the event-overlay model, include ops_override_factor as an input feature. The model can learn to associate ops's fear with subsequent under-realized demand (over-stocking) or with correct calls. The model treats ops as a noisy signal that contributes information about the situation but doesn't directly drive the label.

(3) Stockout-day exclusion. If an event week ended with a stockout (sold out before week's end), the realized sales are capped by inventory, not by demand. Either exclude stockout days from training or model demand as a censored variable (we know demand was at least what we sold). Without this, the label systematically under-counts.

The deeper trade-off: ops bias is a feature of human decision-making. Modeling it explicitly ("ops tends to over-stock by 20% when uncertain") lets the system both learn from ops and account for their bias. Hiding it would be cleaner but lose information.

Follow-up grill 3

Year-over-year event magnitudes drift — a Comic-Con bump that was 2× three years ago is now 5×. How do you keep the overlay model adaptive?

Three protections.

(1) Recency-weighted training. Recent event observations weighted more heavily than older ones. Half-life ~ 1–2 years for event magnitudes (consistent with the underlying business growth and audience-size drift).

(2) Trend-conditioning features. Include "growth index" features in the SKU side: how much has the series' steady-state sales grown YoY? Series with growing baselines tend to have growing event lifts. The model learns the relationship.

(3) Year-of-event feature. A categorical "year-since-event-class-launched" feature. If anime tie-ins as a class are growing in impact, the model learns the temporal trend.

(4) Periodic re-fit on most-recent windows. The overlay model re-fits monthly on a window weighted toward the past 12 months. Event-overlay drift is handled by the same continuous-retraining discipline as scenario ML-01.

Follow-up grill 4

Your forecast = base × lift decomposition assumes events affect SKUs multiplicatively. Some events are additive (a marketing promo adds N units regardless of base). How do you handle that?

The decomposition is a modeling choice; mixed additive/multiplicative effects need a richer functional form. Two approaches.

(1) Mixed model. forecast = base × multiplicative_lift + additive_lift. The overlay model outputs both components for each (sku, event) pair. Some events drive multiplicative lift (tie-ins on already-popular series), some drive additive lift (promos drive a fixed extra audience), some drive both. The labeled training data lets the model learn which component dominates per event class.

(2) Per-event-class function form. Each event class has its own functional form. Marketing promos might be forecast = base + additive; anime tie-ins might be forecast = base × multiplicative. Implementing per-class forms is more code but more interpretable.

I'd start with the mixed model (one architecture, learns the decomposition from data) and migrate to per-class forms if certain classes need explicit physics-based modeling for ops trust or regulatory reasons.

Architect-level escalation 1

A new product launches: pre-orders for an upcoming chapter release. The "demand" you're forecasting is now anticipatory — pre-order count, not units shipped. How does the forecasting architecture extend?

Pre-orders are a different time-aligned signal that predicts future demand and is demand simultaneously.

(1) Pre-orders as both target and feature. Forecast pre_order_count(sku, day) as its own target; use it as a feature for the eventual ship-week demand forecast. A high pre-order count is a leading indicator for a high ship-week demand.

(2) Pre-order conversion rate. The historical relationship between pre-orders and realized buys is the conversion factor. Some series have 95% conversion; cancellations / no-shows on others can be 30%. Per-series conversion is a feature.

(3) Time-to-release lift curves. Pre-orders ramp in characteristic shapes — slow at first, steep in the final week. The forecast model learns the shape per series tier.

(4) Treat the pre-order period as its own event window. The pre-order period acts like a small event in the demand distribution; the same overlay architecture applies.

The architectural commitment: the schema is now two demand signals per SKU per day — pre-orders and ship demand — that interact. Models learn the joint distribution. Don't try to forecast pre-orders by re-using the ship-demand model.

Architect-level escalation 2

Six months later, supply chain is constrained — even when demand spikes, you can't always meet it. The forecasting question shifts from "how much will sell" to "how much can we sell given a constrained supply." How does the architecture change?

This is demand sensing under capacity constraint, and it's a meaningfully different problem.

Three changes.

(1) Decompose into demand and supply. "Sales" is now min(demand, supply). Forecasting raw sales hides the fact that capacity-bound days don't reveal demand. Estimate demand as a latent variable using censored regression — observed sales = demand on uncapped days, observed sales ≤ demand on capped days.

(2) Capacity is a forecast input. The supply chain provides a per-SKU per-day capacity forecast (or actuals as they come in). The fused forecast outputs both: predicted demand (uncapped) and predicted sales (= min(demand, capacity)). Inventory and procurement use the demand number; revenue planning uses the sales number.

(3) Stockout cost vs over-stock cost. The decision changes — when capacity is constrained, increasing safety stock has a different cost than when it isn't. The forecast feeds an optimization that weighs the asymmetry. The bot's "we're running low" signal becomes more nuanced: "demand is high, supply is constrained, inventory will likely run out by Friday."

The deeper architectural lesson: as the business changes (constrained supply is a structural shift), the meaning of "demand forecasting" shifts. The architecture that served unconstrained-supply forecasting needs new components (latent demand, capacity inputs, asymmetric loss) for the new world. Ground truth itself shifts: "what would have been sold without the cap" is a counterfactual that the model now must estimate, not a directly-observable target.

Red-flag answers

"Filter the anomalies; they're noise." (The original failure.)
"Train on everything as-is." (Without event features, just adds noise.)
"Use a deeper model." (Architecture, not depth, is the issue.)
"Trust ops manual planning." (Doesn't scale, doesn't learn.)
"Multiply by a global event coefficient." (Misses heterogeneity.)

Strong-answer indicators

Names events as the part of the distribution that matters most.
Two-tier (base + event-overlay) decomposition.
Realized actuals as labels; overrides as features (or auxiliary).
Stratified eval (steady-state, per-event-class, per-tier).
Calendar completeness as a tracked meta-metric.
Bayesian shrinkage to priors for sparse-history events.
Year-over-year drift handled with recency weights and trend features.
Recognizes that supply constraints transform the problem from forecasting sales to estimating latent demand.