LOCAL PREVIEW View on GitHub

ML Engineer User Stories — MangaAssist Chatbot (Amazon-Scale)

Overview

This directory contains the ML Engineer-owned user stories for the MangaAssist chatbot on Amazon. Where Cost-Optimization-User-Stories/ is the FinOps lens, Ground-Truth-Evolution/ is the drift lens, and RAG-MCP-Integration/ is the tool-interface lens, this folder is the day-to-day ML Engineer lens — the person who owns training pipelines, label engineering, model registries, online ranking, drift detection, retrain cadence, and SageMaker serving for the classical (non-FM) ML systems on the platform.

Eight production ML systems sit on the chatbot's hot path. Each one has a user story here describing how it is built, trained, evaluated, deployed, monitored, and retrained at Amazon-scale (millions of requests/day, bilingual JP/EN traffic, ap-northeast-1 data residency, multi-region serving).

User Stories

# User Story Primary System SageMaker Surface Headline Metric (Target)
US-MLE-01 Intent Classifier Retraining Pipeline DistilBERT (multilingual) Real-time endpoint, ml.g5.xlarge Val accuracy ≥ 0.90, p95 ≤ 15ms
US-MLE-02 Cross-Encoder Reranker Training & Online Ranking MiniLM-L6 cross-encoder Real-time, ml.g5.2xlarge MME NDCG@10 ≥ 0.78, +3% CTR canary
US-MLE-03 Sentiment + ABSA Lifecycle Multi-task DeBERTa-v3 Async + batch transform Macro-F1 ≥ 0.85 / 0.74, quarterly relabel
US-MLE-04 Demand Forecasting Pipeline Temporal Fusion Transformer Batch transform, daily sMAPE ≤ 14%, promo-uplift err ≤ 25%
US-MLE-05 Embedding Adapter Fine-Tuning + Re-Index Titan + LoRA adapter SageMaker Training + batch Recall@10 ≥ 0.92, blue/green re-index ≤ 4h
US-MLE-06 Recommendation Training + Cold-Start Two-tower + Personalize hybrid Personalize + custom serving HR@20 ≥ 0.18, cold-start CTR ≥ 80% of warm
US-MLE-07 Spam/Abuse Classifier + Adversarial Labels LightGBM + LLM-distill ensemble Real-time + nightly batch P@95R ≥ 0.93, weekly adversarial retrain
US-MLE-08 Cover-Art Style Classifier + Image Monitoring EfficientNet-V2-S Real-time + async, ml.g5.xlarge Top-1 ≥ 0.87, AI-gen detection AUC ≥ 0.95

Effort Distribution (Engineering Cost Share)

pie title ML Engineer Effort Share Across the 8 Systems (post-launch steady state)
    "US-MLE-01 Intent Classifier" : 12
    "US-MLE-02 Cross-Encoder Reranker" : 14
    "US-MLE-03 Sentiment / ABSA" : 11
    "US-MLE-04 Demand Forecasting" : 10
    "US-MLE-05 Embedding Adapter + Re-Index" : 18
    "US-MLE-06 Recommendation" : 16
    "US-MLE-07 Spam / Abuse" : 11
    "US-MLE-08 Cover-Art Style" : 8

The two largest slices (US-MLE-05 embedding adapter, US-MLE-06 recommendation) reflect the operational cost of corpus-scale re-indexing and the experimentation overhead of personalization respectively. US-MLE-08 (cover-art) is the smallest because the catalog turns over slower than text and traffic is bursty around new releases.


Dependency & Sequencing Graph

The 8 stories are not independent. Five of them share infrastructure (feature store, model registry, drift hub, label platform, online inference fabric) and several have data dependencies on each other. Implementing in the wrong order creates either silent quality regressions (no drift detection to catch them) or rollback-impossible failures (no registry to pin versions to).

graph TB
    PLAT[ML Platform<br/>Feature Store + Registry + Drift Hub<br/>Label Platform]:::plat

    US01[US-MLE-01 Intent Classifier<br/>Routes traffic to MCPs]
    US02[US-MLE-02 Cross-Encoder Reranker<br/>Ranks RAG retrieval]
    US03[US-MLE-03 Sentiment + ABSA<br/>Reads 50M reviews]
    US04[US-MLE-04 Demand Forecasting<br/>Inventory signal]
    US05[US-MLE-05 Embedding Adapter<br/>Underpins retrieval]
    US06[US-MLE-06 Recommendation<br/>Personalize + 2-tower]
    US07[US-MLE-07 Spam Classifier<br/>Filters review corpus]
    US08[US-MLE-08 Cover-Art Style<br/>Visual classification]

    PLAT --> US01
    PLAT --> US02
    PLAT --> US03
    PLAT --> US04
    PLAT --> US05
    PLAT --> US06
    PLAT --> US07
    PLAT --> US08

    US05 -->|embeddings| US02
    US05 -->|embeddings| US06
    US01 -->|intent label| US02
    US01 -->|intent label| US06
    US07 -->|spam-clean reviews| US03
    US03 -->|aspect signal| US06
    US04 -->|stock signal| US06

    classDef plat fill:#fd2,stroke:#333,stroke-width:2px
    style US05 fill:#9cf,stroke:#333
    style US01 fill:#9cf,stroke:#333
    style PLAT fill:#fd2,stroke:#333

Recommended implementation order:

  1. ML Platform (cross-cutting) first — feature store contract, model registry schema, drift hub, label platform. Documented in deep-dives/02-cross-story-platform-deep-dive.md. Without this, every story re-implements its own infra and they drift apart.
  2. US-MLE-01 (Intent Classifier) — produces the intent label that US-MLE-02 and US-MLE-06 read. Must ship before any consumer optimizes against intent-conditional metrics.
  3. US-MLE-05 (Embedding Adapter) — produces the embedding model that US-MLE-02 reranker and US-MLE-06 two-tower both consume. Re-index throughput sets the ceiling on how often the reranker's training set can be refreshed.
  4. US-MLE-07 (Spam Classifier) — must run before US-MLE-03 ABSA training because ABSA reads the cleaned review corpus. ABSA quality is bounded by spam recall.
  5. US-MLE-02, US-MLE-03, US-MLE-04, US-MLE-06, US-MLE-08 in parallel — these are the leaf models. Each can ship independently once the platform + upstream signal is in place.

Owner Mapping

# User Story Suggested Owner Role
US-MLE-01 Intent Classifier Retraining Senior ML Engineer (NLP, multilingual)
US-MLE-02 Cross-Encoder Reranker ML Engineer (Search/Ranking)
US-MLE-03 Sentiment + ABSA ML Engineer (NLU) + Annotation Vendor PM
US-MLE-04 Demand Forecasting ML Engineer (Time-Series) + Inventory PM
US-MLE-05 Embedding Adapter + Re-Index Senior ML Engineer (Retrieval) + Data Platform
US-MLE-06 Recommendation ML Engineer (RecSys) + Personalize PM
US-MLE-07 Spam / Abuse Classifier ML Engineer (Trust & Safety)
US-MLE-08 Cover-Art Style ML Engineer (CV)

The platform pieces (feature store, registry, drift hub, label platform) are owned by an ML Platform Lead who acts as the coordination point across all 8 stories. This is the same pattern that the Cost-Optimization stories establish for the FinOps Lead.


Unified KPI Rollup

# Story Headline Metric Target Baseline Status
US-MLE-01 Intent Classifier Multilingual val accuracy ≥ 0.90 0.86 Track
US-MLE-01 Intent Classifier p95 inference latency ≤ 15 ms 22 ms Track
US-MLE-02 Reranker NDCG@10 on labeled holdout ≥ 0.78 0.71 Track
US-MLE-02 Reranker Canary CTR uplift +3% +0% Track
US-MLE-03 Sentiment Macro-F1 (3-class) ≥ 0.85 0.81 Track
US-MLE-03 ABSA Aspect-F1 ≥ 0.74 0.66 Track
US-MLE-04 Demand Forecast sMAPE 7-day SKU ≤ 14% 19% Track
US-MLE-04 Demand Forecast Promo uplift error ≤ 25% 41% Track
US-MLE-05 Embedding Adapter Recall@10 (JP+EN) ≥ 0.92 0.86 Track
US-MLE-05 Embedding Adapter Re-index wall-clock (5M docs) ≤ 4h 11h Track
US-MLE-06 Recommendation HR@20 ≥ 0.18 0.13 Track
US-MLE-06 Recommendation Cold-start CTR / warm CTR ≥ 0.80 0.42 Track
US-MLE-07 Spam Precision @ 95% recall ≥ 0.93 0.84 Track
US-MLE-07 Spam Adversarial drift detection lag ≤ 7 days 30+ days Track
US-MLE-08 Cover-Art Style Top-1 accuracy ≥ 0.87 0.79 Track
US-MLE-08 Cover-Art Style AI-gen detection AUC ≥ 0.95 n/a (new) Track

How to Read This Folder

Each story file contains:

  1. User Story + Acceptance Criteria — the persona and the testable definition of done
  2. High-Level Design — production surface, end-to-end ML lifecycle diagram, data contracts
  3. Low-Level Design — feature pipeline, training pipeline, offline eval, online serving, shadow/canary, drift detection, retrain trigger, multilingual handling, all with concrete code
  4. Monitoring & Metrics — online + quality + drift + cost
  5. Risks & Mitigations
  6. Deep Dive — Why This Works at Amazon-Scale on Manga Workload
  7. Real-World Validation — industry analogues, math validation
  8. Cross-Story Interactions — explicit edges to other US-MLE stories and to Cost-Optimization stories
  9. Rollback & Experimentation — shadow plan, canary thresholds, kill switch
  10. Multi-Reviewer Validation Findings (S1/S2/S3) — ML Scientist, SRE, Data Eng, AppSec/Privacy, FinOps lenses

The companion docs:


Cross-References to Existing Folders

Existing Folder What It Covers How It Relates
Cost-Optimization-User-Stories/ FinOps lens — cost reduction per service US-02 there is the inference-cost view of US-MLE-01 here
Ground-Truth-Evolution/ML-Scenarios/ Drift lens — what breaks over time Each US-MLE story links to its drift counterpart
RAG-MCP-Integration/ Tool-interface lens US-MLE-02, -05 sit inside the retrieval pipeline described there
Fine-Tuning-Foundational-Models/ Training infra scenarios US-MLE-01 reuses the training infra patterns documented there
Model-Inference/ Inference + metrics taxonomy All 8 stories use the metrics taxonomy defined there
POC-to-Production-War-Story/ Production failure modes US-MLE stories include "incidents this would have caught"

Cross-Cutting Concerns Inherited by All 8 Stories

These are non-negotiable shared infrastructure obligations. The Cost-Optimization README enumerates them for the cost lens; this folder inherits the same set, with ML-specific additions:

Concern Why required Applies to
request_id (UUID) on every inference call Distributed tracing, model attribution, incident forensics All 8
Model + classifier version pinning in cache keys Embedding rotation otherwise serves stale vectors silently US-MLE-01, -02, -05, -06
Language stratification in metrics (EN/JP/mixed) Bilingual store; aggregated metrics hide JP regressions All 8
Drift detection (input/label/prediction/concept) Calibration on month-1 traffic breaks at month-6 All 8
Point-in-time correctness in feature store Training-serving skew causes silent quality loss US-MLE-02, -04, -06, -07
Schema versioning on training datasets A column type change retroactively poisons last month's runs All 8
Audit trail on label edits + model promotions Required for regulated-data review and post-incident replay All 8
PII redaction at boundary (before training, embedding, archiving) GDPR / APPI / breach risk; embeddings are quasi-reversible US-MLE-03, -05, -06, -07
ap-northeast-1 residency for JP customer data Data residency contract; cross-region forbidden on customer paths All 8
Region-specific model artifacts (JP, US, EU buckets) Same artifact, separate signed copies; never one bucket serving multi-region All 8

These obligations are owned at the platform level (see deep-dives/02-cross-story-platform-deep-dive.md); per-story implementations conform to the platform contract rather than re-deriving these guarantees.


Kill-Switch Precedence

When multiple safety mechanisms fire simultaneously, this is the precedence order across all 8 stories:

  1. global_ml_freeze=true (set by Incident Commander during SEV-2/1) — every model is pinned to its last known good registry version. New deployments blocked. Drift triggers ignored.
  2. Per-model *_promotion_enabled=false — promotion gates frozen for that one model; existing prod model continues; safe-by-default.
  3. Per-technique flags within a story (e.g., embedding_adapter_blue_green_enabled within US-MLE-05) — finest granularity, honored last.

Default value when SSM is unreachable: safe-by-default per flagglobal_ml_freeze defaults to false (do not freeze on missing signal); *_promotion_enabled defaults to false (revert to last known good, never promote unverified path).


Shared Infrastructure (the Distributed-Monolith Risk)

Six stories (US-MLE-01, -02, -03, -05, -06, -07) read from the same feature store. Five stories (US-MLE-01, -02, -03, -06, -08) write to the same model registry. Three stories (US-MLE-02, -05, -06) read the same embedding namespace. The Cost-Optimization README flagged the analogous Redis SPOF; this folder's analogue is:

  • Feature Store SPOF mitigated by per-story namespace isolation, point-in-time read snapshots, and read-replica fallback to last-good Iceberg manifest.
  • Model Registry SPOF mitigated by registry replication across regions and by per-story registry contracts that fail closed (consumer pins to last-good model artifact if registry is unreachable).
  • Embedding namespace SPOF mitigated by versioned blue/green indexes — readers can fall back to the previous green index if the new blue is corrupted.

Reviewer Sign-Off Status

Lens Sign-off Outstanding
ML Scientist Conditional Per-story slice analysis (intent x language x cohort) before promotion
Principal Architect Conditional Cross-story platform contracts in place (deep-dives/02)
SRE / On-call Conditional Runbooks in US-MLE-02, -05, -06; SEV thresholds calibrated
Data Engineering Conditional Iceberg snapshot expiration policy; PIT correctness audit
Application Security / Privacy Conditional Embedding redaction, JP residency on US-MLE-03, -05, -06, -07
FinOps Conditional Per-story training $/run + serving $/1k-inferences within budget

Per-story details are in each file's "Multi-Reviewer Validation Findings & Resolutions" section.