ML Engineer User Stories — MangaAssist Chatbot (Amazon-Scale)
Overview
This directory contains the ML Engineer-owned user stories for the MangaAssist chatbot on Amazon. Where Cost-Optimization-User-Stories/ is the FinOps lens, Ground-Truth-Evolution/ is the drift lens, and RAG-MCP-Integration/ is the tool-interface lens, this folder is the day-to-day ML Engineer lens — the person who owns training pipelines, label engineering, model registries, online ranking, drift detection, retrain cadence, and SageMaker serving for the classical (non-FM) ML systems on the platform.
Eight production ML systems sit on the chatbot's hot path. Each one has a user story here describing how it is built, trained, evaluated, deployed, monitored, and retrained at Amazon-scale (millions of requests/day, bilingual JP/EN traffic, ap-northeast-1 data residency, multi-region serving).
User Stories
| # | User Story | Primary System | SageMaker Surface | Headline Metric (Target) |
|---|---|---|---|---|
| US-MLE-01 | Intent Classifier Retraining Pipeline | DistilBERT (multilingual) | Real-time endpoint, ml.g5.xlarge | Val accuracy ≥ 0.90, p95 ≤ 15ms |
| US-MLE-02 | Cross-Encoder Reranker Training & Online Ranking | MiniLM-L6 cross-encoder | Real-time, ml.g5.2xlarge MME | NDCG@10 ≥ 0.78, +3% CTR canary |
| US-MLE-03 | Sentiment + ABSA Lifecycle | Multi-task DeBERTa-v3 | Async + batch transform | Macro-F1 ≥ 0.85 / 0.74, quarterly relabel |
| US-MLE-04 | Demand Forecasting Pipeline | Temporal Fusion Transformer | Batch transform, daily | sMAPE ≤ 14%, promo-uplift err ≤ 25% |
| US-MLE-05 | Embedding Adapter Fine-Tuning + Re-Index | Titan + LoRA adapter | SageMaker Training + batch | Recall@10 ≥ 0.92, blue/green re-index ≤ 4h |
| US-MLE-06 | Recommendation Training + Cold-Start | Two-tower + Personalize hybrid | Personalize + custom serving | HR@20 ≥ 0.18, cold-start CTR ≥ 80% of warm |
| US-MLE-07 | Spam/Abuse Classifier + Adversarial Labels | LightGBM + LLM-distill ensemble | Real-time + nightly batch | P@95R ≥ 0.93, weekly adversarial retrain |
| US-MLE-08 | Cover-Art Style Classifier + Image Monitoring | EfficientNet-V2-S | Real-time + async, ml.g5.xlarge | Top-1 ≥ 0.87, AI-gen detection AUC ≥ 0.95 |
Effort Distribution (Engineering Cost Share)
pie title ML Engineer Effort Share Across the 8 Systems (post-launch steady state)
"US-MLE-01 Intent Classifier" : 12
"US-MLE-02 Cross-Encoder Reranker" : 14
"US-MLE-03 Sentiment / ABSA" : 11
"US-MLE-04 Demand Forecasting" : 10
"US-MLE-05 Embedding Adapter + Re-Index" : 18
"US-MLE-06 Recommendation" : 16
"US-MLE-07 Spam / Abuse" : 11
"US-MLE-08 Cover-Art Style" : 8
The two largest slices (US-MLE-05 embedding adapter, US-MLE-06 recommendation) reflect the operational cost of corpus-scale re-indexing and the experimentation overhead of personalization respectively. US-MLE-08 (cover-art) is the smallest because the catalog turns over slower than text and traffic is bursty around new releases.
Dependency & Sequencing Graph
The 8 stories are not independent. Five of them share infrastructure (feature store, model registry, drift hub, label platform, online inference fabric) and several have data dependencies on each other. Implementing in the wrong order creates either silent quality regressions (no drift detection to catch them) or rollback-impossible failures (no registry to pin versions to).
graph TB
PLAT[ML Platform<br/>Feature Store + Registry + Drift Hub<br/>Label Platform]:::plat
US01[US-MLE-01 Intent Classifier<br/>Routes traffic to MCPs]
US02[US-MLE-02 Cross-Encoder Reranker<br/>Ranks RAG retrieval]
US03[US-MLE-03 Sentiment + ABSA<br/>Reads 50M reviews]
US04[US-MLE-04 Demand Forecasting<br/>Inventory signal]
US05[US-MLE-05 Embedding Adapter<br/>Underpins retrieval]
US06[US-MLE-06 Recommendation<br/>Personalize + 2-tower]
US07[US-MLE-07 Spam Classifier<br/>Filters review corpus]
US08[US-MLE-08 Cover-Art Style<br/>Visual classification]
PLAT --> US01
PLAT --> US02
PLAT --> US03
PLAT --> US04
PLAT --> US05
PLAT --> US06
PLAT --> US07
PLAT --> US08
US05 -->|embeddings| US02
US05 -->|embeddings| US06
US01 -->|intent label| US02
US01 -->|intent label| US06
US07 -->|spam-clean reviews| US03
US03 -->|aspect signal| US06
US04 -->|stock signal| US06
classDef plat fill:#fd2,stroke:#333,stroke-width:2px
style US05 fill:#9cf,stroke:#333
style US01 fill:#9cf,stroke:#333
style PLAT fill:#fd2,stroke:#333
Recommended implementation order:
- ML Platform (cross-cutting) first — feature store contract, model registry schema, drift hub, label platform. Documented in
deep-dives/02-cross-story-platform-deep-dive.md. Without this, every story re-implements its own infra and they drift apart. - US-MLE-01 (Intent Classifier) — produces the intent label that US-MLE-02 and US-MLE-06 read. Must ship before any consumer optimizes against intent-conditional metrics.
- US-MLE-05 (Embedding Adapter) — produces the embedding model that US-MLE-02 reranker and US-MLE-06 two-tower both consume. Re-index throughput sets the ceiling on how often the reranker's training set can be refreshed.
- US-MLE-07 (Spam Classifier) — must run before US-MLE-03 ABSA training because ABSA reads the cleaned review corpus. ABSA quality is bounded by spam recall.
- US-MLE-02, US-MLE-03, US-MLE-04, US-MLE-06, US-MLE-08 in parallel — these are the leaf models. Each can ship independently once the platform + upstream signal is in place.
Owner Mapping
| # | User Story | Suggested Owner Role |
|---|---|---|
| US-MLE-01 | Intent Classifier Retraining | Senior ML Engineer (NLP, multilingual) |
| US-MLE-02 | Cross-Encoder Reranker | ML Engineer (Search/Ranking) |
| US-MLE-03 | Sentiment + ABSA | ML Engineer (NLU) + Annotation Vendor PM |
| US-MLE-04 | Demand Forecasting | ML Engineer (Time-Series) + Inventory PM |
| US-MLE-05 | Embedding Adapter + Re-Index | Senior ML Engineer (Retrieval) + Data Platform |
| US-MLE-06 | Recommendation | ML Engineer (RecSys) + Personalize PM |
| US-MLE-07 | Spam / Abuse Classifier | ML Engineer (Trust & Safety) |
| US-MLE-08 | Cover-Art Style | ML Engineer (CV) |
The platform pieces (feature store, registry, drift hub, label platform) are owned by an ML Platform Lead who acts as the coordination point across all 8 stories. This is the same pattern that the Cost-Optimization stories establish for the FinOps Lead.
Unified KPI Rollup
| # | Story | Headline Metric | Target | Baseline | Status |
|---|---|---|---|---|---|
| US-MLE-01 | Intent Classifier | Multilingual val accuracy | ≥ 0.90 | 0.86 | Track |
| US-MLE-01 | Intent Classifier | p95 inference latency | ≤ 15 ms | 22 ms | Track |
| US-MLE-02 | Reranker | NDCG@10 on labeled holdout | ≥ 0.78 | 0.71 | Track |
| US-MLE-02 | Reranker | Canary CTR uplift | +3% | +0% | Track |
| US-MLE-03 | Sentiment | Macro-F1 (3-class) | ≥ 0.85 | 0.81 | Track |
| US-MLE-03 | ABSA | Aspect-F1 | ≥ 0.74 | 0.66 | Track |
| US-MLE-04 | Demand Forecast | sMAPE 7-day SKU | ≤ 14% | 19% | Track |
| US-MLE-04 | Demand Forecast | Promo uplift error | ≤ 25% | 41% | Track |
| US-MLE-05 | Embedding Adapter | Recall@10 (JP+EN) | ≥ 0.92 | 0.86 | Track |
| US-MLE-05 | Embedding Adapter | Re-index wall-clock (5M docs) | ≤ 4h | 11h | Track |
| US-MLE-06 | Recommendation | HR@20 | ≥ 0.18 | 0.13 | Track |
| US-MLE-06 | Recommendation | Cold-start CTR / warm CTR | ≥ 0.80 | 0.42 | Track |
| US-MLE-07 | Spam | Precision @ 95% recall | ≥ 0.93 | 0.84 | Track |
| US-MLE-07 | Spam | Adversarial drift detection lag | ≤ 7 days | 30+ days | Track |
| US-MLE-08 | Cover-Art Style | Top-1 accuracy | ≥ 0.87 | 0.79 | Track |
| US-MLE-08 | Cover-Art Style | AI-gen detection AUC | ≥ 0.95 | n/a (new) | Track |
How to Read This Folder
Each story file contains:
- User Story + Acceptance Criteria — the persona and the testable definition of done
- High-Level Design — production surface, end-to-end ML lifecycle diagram, data contracts
- Low-Level Design — feature pipeline, training pipeline, offline eval, online serving, shadow/canary, drift detection, retrain trigger, multilingual handling, all with concrete code
- Monitoring & Metrics — online + quality + drift + cost
- Risks & Mitigations
- Deep Dive — Why This Works at Amazon-Scale on Manga Workload
- Real-World Validation — industry analogues, math validation
- Cross-Story Interactions — explicit edges to other US-MLE stories and to Cost-Optimization stories
- Rollback & Experimentation — shadow plan, canary thresholds, kill switch
- Multi-Reviewer Validation Findings (S1/S2/S3) — ML Scientist, SRE, Data Eng, AppSec/Privacy, FinOps lenses
The companion docs:
deep-dives/00-foundations-and-primitives-for-ml-engineering.md— the seven primitives every story uses (label, feature, training, evaluation, promotion, drift, retrain). Read first.deep-dives/01-deep-dive-per-ml-story.md— uniform per-story implementation walkthrough that applies the primitives to each storydeep-dives/02-cross-story-platform-deep-dive.md— the ML platform components that all 8 stories sharegrill-chains/ml-engineer-grill-chains.md— 7-round Q&A drill per story for hiring / interview prep
Cross-References to Existing Folders
| Existing Folder | What It Covers | How It Relates |
|---|---|---|
Cost-Optimization-User-Stories/ |
FinOps lens — cost reduction per service | US-02 there is the inference-cost view of US-MLE-01 here |
Ground-Truth-Evolution/ML-Scenarios/ |
Drift lens — what breaks over time | Each US-MLE story links to its drift counterpart |
RAG-MCP-Integration/ |
Tool-interface lens | US-MLE-02, -05 sit inside the retrieval pipeline described there |
Fine-Tuning-Foundational-Models/ |
Training infra scenarios | US-MLE-01 reuses the training infra patterns documented there |
Model-Inference/ |
Inference + metrics taxonomy | All 8 stories use the metrics taxonomy defined there |
POC-to-Production-War-Story/ |
Production failure modes | US-MLE stories include "incidents this would have caught" |
Cross-Cutting Concerns Inherited by All 8 Stories
These are non-negotiable shared infrastructure obligations. The Cost-Optimization README enumerates them for the cost lens; this folder inherits the same set, with ML-specific additions:
| Concern | Why required | Applies to |
|---|---|---|
request_id (UUID) on every inference call |
Distributed tracing, model attribution, incident forensics | All 8 |
| Model + classifier version pinning in cache keys | Embedding rotation otherwise serves stale vectors silently | US-MLE-01, -02, -05, -06 |
| Language stratification in metrics (EN/JP/mixed) | Bilingual store; aggregated metrics hide JP regressions | All 8 |
| Drift detection (input/label/prediction/concept) | Calibration on month-1 traffic breaks at month-6 | All 8 |
| Point-in-time correctness in feature store | Training-serving skew causes silent quality loss | US-MLE-02, -04, -06, -07 |
| Schema versioning on training datasets | A column type change retroactively poisons last month's runs | All 8 |
| Audit trail on label edits + model promotions | Required for regulated-data review and post-incident replay | All 8 |
| PII redaction at boundary (before training, embedding, archiving) | GDPR / APPI / breach risk; embeddings are quasi-reversible | US-MLE-03, -05, -06, -07 |
| ap-northeast-1 residency for JP customer data | Data residency contract; cross-region forbidden on customer paths | All 8 |
| Region-specific model artifacts (JP, US, EU buckets) | Same artifact, separate signed copies; never one bucket serving multi-region | All 8 |
These obligations are owned at the platform level (see deep-dives/02-cross-story-platform-deep-dive.md); per-story implementations conform to the platform contract rather than re-deriving these guarantees.
Kill-Switch Precedence
When multiple safety mechanisms fire simultaneously, this is the precedence order across all 8 stories:
global_ml_freeze=true(set by Incident Commander during SEV-2/1) — every model is pinned to its last known good registry version. New deployments blocked. Drift triggers ignored.- Per-model
*_promotion_enabled=false— promotion gates frozen for that one model; existing prod model continues; safe-by-default. - Per-technique flags within a story (e.g.,
embedding_adapter_blue_green_enabledwithin US-MLE-05) — finest granularity, honored last.
Default value when SSM is unreachable: safe-by-default per flag — global_ml_freeze defaults to false (do not freeze on missing signal); *_promotion_enabled defaults to false (revert to last known good, never promote unverified path).
Shared Infrastructure (the Distributed-Monolith Risk)
Six stories (US-MLE-01, -02, -03, -05, -06, -07) read from the same feature store. Five stories (US-MLE-01, -02, -03, -06, -08) write to the same model registry. Three stories (US-MLE-02, -05, -06) read the same embedding namespace. The Cost-Optimization README flagged the analogous Redis SPOF; this folder's analogue is:
- Feature Store SPOF mitigated by per-story namespace isolation, point-in-time read snapshots, and read-replica fallback to last-good Iceberg manifest.
- Model Registry SPOF mitigated by registry replication across regions and by per-story registry contracts that fail closed (consumer pins to last-good model artifact if registry is unreachable).
- Embedding namespace SPOF mitigated by versioned blue/green indexes — readers can fall back to the previous green index if the new blue is corrupted.
Reviewer Sign-Off Status
| Lens | Sign-off | Outstanding |
|---|---|---|
| ML Scientist | Conditional | Per-story slice analysis (intent x language x cohort) before promotion |
| Principal Architect | Conditional | Cross-story platform contracts in place (deep-dives/02) |
| SRE / On-call | Conditional | Runbooks in US-MLE-02, -05, -06; SEV thresholds calibrated |
| Data Engineering | Conditional | Iceberg snapshot expiration policy; PIT correctness audit |
| Application Security / Privacy | Conditional | Embedding redaction, JP residency on US-MLE-03, -05, -06, -07 |
| FinOps | Conditional | Per-story training $/run + serving $/1k-inferences within budget |
Per-story details are in each file's "Multi-Reviewer Validation Findings & Resolutions" section.