ML Engineer User Stories — MangaAssist Chatbot (Amazon-Scale)

Overview

This directory contains the ML Engineer-owned user stories for the MangaAssist chatbot on Amazon. Where Cost-Optimization-User-Stories/ is the FinOps lens, Ground-Truth-Evolution/ is the drift lens, and RAG-MCP-Integration/ is the tool-interface lens, this folder is the day-to-day ML Engineer lens — the person who owns training pipelines, label engineering, model registries, online ranking, drift detection, retrain cadence, and SageMaker serving for the classical (non-FM) ML systems on the platform.

Eight production ML systems sit on the chatbot's hot path. Each one has a user story here describing how it is built, trained, evaluated, deployed, monitored, and retrained at Amazon-scale (millions of requests/day, bilingual JP/EN traffic, ap-northeast-1 data residency, multi-region serving).

User Stories

#	User Story	Primary System	SageMaker Surface	Headline Metric (Target)
US-MLE-01	Intent Classifier Retraining Pipeline	DistilBERT (multilingual)	Real-time endpoint, ml.g5.xlarge	Val accuracy ≥ 0.90, p95 ≤ 15ms
US-MLE-02	Cross-Encoder Reranker Training & Online Ranking	MiniLM-L6 cross-encoder	Real-time, ml.g5.2xlarge MME	NDCG@10 ≥ 0.78, +3% CTR canary
US-MLE-03	Sentiment + ABSA Lifecycle	Multi-task DeBERTa-v3	Async + batch transform	Macro-F1 ≥ 0.85 / 0.74, quarterly relabel
US-MLE-04	Demand Forecasting Pipeline	Temporal Fusion Transformer	Batch transform, daily	sMAPE ≤ 14%, promo-uplift err ≤ 25%
US-MLE-05	Embedding Adapter Fine-Tuning + Re-Index	Titan + LoRA adapter	SageMaker Training + batch	Recall@10 ≥ 0.92, blue/green re-index ≤ 4h
US-MLE-06	Recommendation Training + Cold-Start	Two-tower + Personalize hybrid	Personalize + custom serving	HR@20 ≥ 0.18, cold-start CTR ≥ 80% of warm
US-MLE-07	Spam/Abuse Classifier + Adversarial Labels	LightGBM + LLM-distill ensemble	Real-time + nightly batch	P@95R ≥ 0.93, weekly adversarial retrain
US-MLE-08	Cover-Art Style Classifier + Image Monitoring	EfficientNet-V2-S	Real-time + async, ml.g5.xlarge	Top-1 ≥ 0.87, AI-gen detection AUC ≥ 0.95

pie title ML Engineer Effort Share Across the 8 Systems (post-launch steady state)
    "US-MLE-01 Intent Classifier" : 12
    "US-MLE-02 Cross-Encoder Reranker" : 14
    "US-MLE-03 Sentiment / ABSA" : 11
    "US-MLE-04 Demand Forecasting" : 10
    "US-MLE-05 Embedding Adapter + Re-Index" : 18
    "US-MLE-06 Recommendation" : 16
    "US-MLE-07 Spam / Abuse" : 11
    "US-MLE-08 Cover-Art Style" : 8

The two largest slices (US-MLE-05 embedding adapter, US-MLE-06 recommendation) reflect the operational cost of corpus-scale re-indexing and the experimentation overhead of personalization respectively. US-MLE-08 (cover-art) is the smallest because the catalog turns over slower than text and traffic is bursty around new releases.

Dependency & Sequencing Graph

The 8 stories are not independent. Five of them share infrastructure (feature store, model registry, drift hub, label platform, online inference fabric) and several have data dependencies on each other. Implementing in the wrong order creates either silent quality regressions (no drift detection to catch them) or rollback-impossible failures (no registry to pin versions to).

graph TB
    PLAT[ML Platform<br/>Feature Store + Registry + Drift Hub<br/>Label Platform]:::plat

    US01[US-MLE-01 Intent Classifier<br/>Routes traffic to MCPs]
    US02[US-MLE-02 Cross-Encoder Reranker<br/>Ranks RAG retrieval]
    US03[US-MLE-03 Sentiment + ABSA<br/>Reads 50M reviews]
    US04[US-MLE-04 Demand Forecasting<br/>Inventory signal]
    US05[US-MLE-05 Embedding Adapter<br/>Underpins retrieval]
    US06[US-MLE-06 Recommendation<br/>Personalize + 2-tower]
    US07[US-MLE-07 Spam Classifier<br/>Filters review corpus]
    US08[US-MLE-08 Cover-Art Style<br/>Visual classification]

    PLAT --> US01
    PLAT --> US02
    PLAT --> US03
    PLAT --> US04
    PLAT --> US05
    PLAT --> US06
    PLAT --> US07
    PLAT --> US08

    US05 -->|embeddings| US02
    US05 -->|embeddings| US06
    US01 -->|intent label| US02
    US01 -->|intent label| US06
    US07 -->|spam-clean reviews| US03
    US03 -->|aspect signal| US06
    US04 -->|stock signal| US06

    classDef plat fill:#fd2,stroke:#333,stroke-width:2px
    style US05 fill:#9cf,stroke:#333
    style US01 fill:#9cf,stroke:#333
    style PLAT fill:#fd2,stroke:#333

Recommended implementation order:

ML Platform (cross-cutting) first — feature store contract, model registry schema, drift hub, label platform. Documented in deep-dives/02-cross-story-platform-deep-dive.md. Without this, every story re-implements its own infra and they drift apart.
US-MLE-01 (Intent Classifier) — produces the intent label that US-MLE-02 and US-MLE-06 read. Must ship before any consumer optimizes against intent-conditional metrics.
US-MLE-05 (Embedding Adapter) — produces the embedding model that US-MLE-02 reranker and US-MLE-06 two-tower both consume. Re-index throughput sets the ceiling on how often the reranker's training set can be refreshed.
US-MLE-07 (Spam Classifier) — must run before US-MLE-03 ABSA training because ABSA reads the cleaned review corpus. ABSA quality is bounded by spam recall.
US-MLE-02, US-MLE-03, US-MLE-04, US-MLE-06, US-MLE-08 in parallel — these are the leaf models. Each can ship independently once the platform + upstream signal is in place.

Owner Mapping

#	User Story	Suggested Owner Role
US-MLE-01	Intent Classifier Retraining	Senior ML Engineer (NLP, multilingual)
US-MLE-02	Cross-Encoder Reranker	ML Engineer (Search/Ranking)
US-MLE-03	Sentiment + ABSA	ML Engineer (NLU) + Annotation Vendor PM
US-MLE-04	Demand Forecasting	ML Engineer (Time-Series) + Inventory PM
US-MLE-05	Embedding Adapter + Re-Index	Senior ML Engineer (Retrieval) + Data Platform
US-MLE-06	Recommendation	ML Engineer (RecSys) + Personalize PM
US-MLE-07	Spam / Abuse Classifier	ML Engineer (Trust & Safety)
US-MLE-08	Cover-Art Style	ML Engineer (CV)

The platform pieces (feature store, registry, drift hub, label platform) are owned by an ML Platform Lead who acts as the coordination point across all 8 stories. This is the same pattern that the Cost-Optimization stories establish for the FinOps Lead.

Unified KPI Rollup

#	Story	Headline Metric	Target	Baseline	Status
US-MLE-01	Intent Classifier	Multilingual val accuracy	≥ 0.90	0.86	Track
US-MLE-01	Intent Classifier	p95 inference latency	≤ 15 ms	22 ms	Track
US-MLE-02	Reranker	NDCG@10 on labeled holdout	≥ 0.78	0.71	Track
US-MLE-02	Reranker	Canary CTR uplift	+3%	+0%	Track
US-MLE-03	Sentiment	Macro-F1 (3-class)	≥ 0.85	0.81	Track
US-MLE-03	ABSA	Aspect-F1	≥ 0.74	0.66	Track
US-MLE-04	Demand Forecast	sMAPE 7-day SKU	≤ 14%	19%	Track
US-MLE-04	Demand Forecast	Promo uplift error	≤ 25%	41%	Track
US-MLE-05	Embedding Adapter	Recall@10 (JP+EN)	≥ 0.92	0.86	Track
US-MLE-05	Embedding Adapter	Re-index wall-clock (5M docs)	≤ 4h	11h	Track
US-MLE-06	Recommendation	HR@20	≥ 0.18	0.13	Track
US-MLE-06	Recommendation	Cold-start CTR / warm CTR	≥ 0.80	0.42	Track
US-MLE-07	Spam	Precision @ 95% recall	≥ 0.93	0.84	Track
US-MLE-07	Spam	Adversarial drift detection lag	≤ 7 days	30+ days	Track
US-MLE-08	Cover-Art Style	Top-1 accuracy	≥ 0.87	0.79	Track
US-MLE-08	Cover-Art Style	AI-gen detection AUC	≥ 0.95	n/a (new)	Track

How to Read This Folder

Each story file contains:

User Story + Acceptance Criteria — the persona and the testable definition of done
High-Level Design — production surface, end-to-end ML lifecycle diagram, data contracts
Low-Level Design — feature pipeline, training pipeline, offline eval, online serving, shadow/canary, drift detection, retrain trigger, multilingual handling, all with concrete code
Monitoring & Metrics — online + quality + drift + cost
Risks & Mitigations
Deep Dive — Why This Works at Amazon-Scale on Manga Workload
Real-World Validation — industry analogues, math validation
Cross-Story Interactions — explicit edges to other US-MLE stories and to Cost-Optimization stories
Rollback & Experimentation — shadow plan, canary thresholds, kill switch
Multi-Reviewer Validation Findings (S1/S2/S3) — ML Scientist, SRE, Data Eng, AppSec/Privacy, FinOps lenses

The companion docs:

deep-dives/00-foundations-and-primitives-for-ml-engineering.md — the seven primitives every story uses (label, feature, training, evaluation, promotion, drift, retrain). Read first.
deep-dives/01-deep-dive-per-ml-story.md — uniform per-story implementation walkthrough that applies the primitives to each story
deep-dives/02-cross-story-platform-deep-dive.md — the ML platform components that all 8 stories share
grill-chains/ml-engineer-grill-chains.md — 7-round Q&A drill per story for hiring / interview prep

Cross-References to Existing Folders

Existing Folder	What It Covers	How It Relates
`Cost-Optimization-User-Stories/`	FinOps lens — cost reduction per service	US-02 there is the inference-cost view of US-MLE-01 here
`Ground-Truth-Evolution/ML-Scenarios/`	Drift lens — what breaks over time	Each US-MLE story links to its drift counterpart
`RAG-MCP-Integration/`	Tool-interface lens	US-MLE-02, -05 sit inside the retrieval pipeline described there
`Fine-Tuning-Foundational-Models/`	Training infra scenarios	US-MLE-01 reuses the training infra patterns documented there
`Model-Inference/`	Inference + metrics taxonomy	All 8 stories use the metrics taxonomy defined there
`POC-to-Production-War-Story/`	Production failure modes	US-MLE stories include "incidents this would have caught"

Cross-Cutting Concerns Inherited by All 8 Stories

These are non-negotiable shared infrastructure obligations. The Cost-Optimization README enumerates them for the cost lens; this folder inherits the same set, with ML-specific additions:

Concern	Why required	Applies to
`request_id` (UUID) on every inference call	Distributed tracing, model attribution, incident forensics	All 8
Model + classifier version pinning in cache keys	Embedding rotation otherwise serves stale vectors silently	US-MLE-01, -02, -05, -06
Language stratification in metrics (EN/JP/mixed)	Bilingual store; aggregated metrics hide JP regressions	All 8
Drift detection (input/label/prediction/concept)	Calibration on month-1 traffic breaks at month-6	All 8
Point-in-time correctness in feature store	Training-serving skew causes silent quality loss	US-MLE-02, -04, -06, -07
Schema versioning on training datasets	A column type change retroactively poisons last month's runs	All 8
Audit trail on label edits + model promotions	Required for regulated-data review and post-incident replay	All 8
PII redaction at boundary (before training, embedding, archiving)	GDPR / APPI / breach risk; embeddings are quasi-reversible	US-MLE-03, -05, -06, -07
ap-northeast-1 residency for JP customer data	Data residency contract; cross-region forbidden on customer paths	All 8
Region-specific model artifacts (JP, US, EU buckets)	Same artifact, separate signed copies; never one bucket serving multi-region	All 8

These obligations are owned at the platform level (see deep-dives/02-cross-story-platform-deep-dive.md); per-story implementations conform to the platform contract rather than re-deriving these guarantees.

Kill-Switch Precedence

When multiple safety mechanisms fire simultaneously, this is the precedence order across all 8 stories:

global_ml_freeze=true (set by Incident Commander during SEV-2/1) — every model is pinned to its last known good registry version. New deployments blocked. Drift triggers ignored.
Per-model *_promotion_enabled=false — promotion gates frozen for that one model; existing prod model continues; safe-by-default.
Per-technique flags within a story (e.g., embedding_adapter_blue_green_enabled within US-MLE-05) — finest granularity, honored last.

Default value when SSM is unreachable: safe-by-default per flag — global_ml_freeze defaults to false (do not freeze on missing signal); *_promotion_enabled defaults to false (revert to last known good, never promote unverified path).

Shared Infrastructure (the Distributed-Monolith Risk)

Six stories (US-MLE-01, -02, -03, -05, -06, -07) read from the same feature store. Five stories (US-MLE-01, -02, -03, -06, -08) write to the same model registry. Three stories (US-MLE-02, -05, -06) read the same embedding namespace. The Cost-Optimization README flagged the analogous Redis SPOF; this folder's analogue is:

Feature Store SPOF mitigated by per-story namespace isolation, point-in-time read snapshots, and read-replica fallback to last-good Iceberg manifest.
Model Registry SPOF mitigated by registry replication across regions and by per-story registry contracts that fail closed (consumer pins to last-good model artifact if registry is unreachable).
Embedding namespace SPOF mitigated by versioned blue/green indexes — readers can fall back to the previous green index if the new blue is corrupted.

Reviewer Sign-Off Status

Lens	Sign-off	Outstanding
ML Scientist	Conditional	Per-story slice analysis (intent x language x cohort) before promotion
Principal Architect	Conditional	Cross-story platform contracts in place (deep-dives/02)
SRE / On-call	Conditional	Runbooks in US-MLE-02, -05, -06; SEV thresholds calibrated
Data Engineering	Conditional	Iceberg snapshot expiration policy; PIT correctness audit
Application Security / Privacy	Conditional	Embedding redaction, JP residency on US-MLE-03, -05, -06, -07
FinOps	Conditional	Per-story training $/run + serving $/1k-inferences within budget

Per-story details are in each file's "Multi-Reviewer Validation Findings & Resolutions" section.