How I led MangaAssist past the 90% ML-failure rate

Role: Applied ML Solutions Architect / ML Product Engineer — MangaAssist (Amazon JP Manga storefront chatbot).

The seam I operated at: solution architecture (reference architectures, AWS-native deployment patterns, multi-region scale, customer/stakeholder translation) and product engineering (ship end-to-end, own the product surface, instrument and iterate on customer outcomes — not just model metrics).

The stakes: 50K → 500K concurrent sessions, P99 < 1.5s, cost < $0.03/session, intent accuracy > 90%, hallucination < 2%, RAG Recall@3 > 80% — see 13-metrics.md and 11-scalability-reliability.md.

Deep-dive stories: every claim below has a STAR-format companion under stories/ — read those for the texture (situation, math, action, leadership move, business outcome).

0 — The 90% problem

Industry data is consistent: 80–90% of ML/AI projects never reach production, and a large share of the rest fail to meet expectations once live. The failure modes are not mysterious — they're predictable: the team builds the wrong thing, a 92%-on-dev-set POC creates false confidence, no offline eval gate exists before prod traffic, guardrails are an afterthought, costs blow up unmonitored, the model rots within six months because no feedback loop was ever wired in, and cross-functional friction (PM ↔ DS ↔ Eng) stalls every decision.

Each of these is a phase in the project lifecycle. At each phase I made a different choice. This document explains the choices, the math/algorithmic understanding behind them, and the artifact in this repo that proves it.

0.5 — What "Applied ML Solutions Architect / ML Product Engineer" meant on this project

I deliberately combined the two hats — the project required both:

As Solutions Architect, I owned the end-to-end reference architecture, not just a model. I picked AWS-native components by trade-off, not by default: Bedrock for foundation-model access, OpenSearch for vector retrieval, ECS Fargate + Lambda for serving, DynamoDB for conversation memory, MLflow + CloudWatch + X-Ray for observability (04-architecture-hld.md, 04b-architecture-lld.md). I translated 8 ambiguous customer use cases (03-use-cases.md) into a deployable, cost-bounded, multi-region-ready architecture with explicit service contracts. I drove the hybrid execution model (10-ai-llm-design.md) — templates / APIs / RAG / LLM by intent — which is the architect-level decision most teams skip, and the reason they pay either too much or ship something that hallucinates.

As ML Product Engineer, I shipped the product, not the notebook. I owned the surface customers actually touch: streaming WebSocket responses, P99 latency budgets, the feedback loop wired from thumbs-down → trace → retraining queue (Implementation-Integration-Domain2/Skill-2.1.5-Collaborative-AI-Systems/02-feedback-augmentation-patterns.md). I optimized for product metrics (CSAT, conversion, support deflection, AOV lift) — not just model metrics. Prompts, guardrails, and rerankers were treated as shippable product features (Prompt-Engineering/07-prompt-evaluation-versioning-and-regression.md) — versioned, A/B tested, rolled back like any other release.

A pure architect ships a beautiful diagram nobody implements. A pure product engineer ships a feature that won't survive Black Friday traffic. MangaAssist required both.

1 — Phase 1: Problem framing (where ~50% of projects die before code is written)

Failure mode: teams jump to "let's build an LLM chatbot" before defining what success means. Six months later, no one can answer "did it work?"

What I did: I drove a 16-document business + technical narrative arc before any architecture — 01-problem-statement.md, 02-user-description.md, 03-use-cases.md, 13-metrics.md, 14-mvp-vs-future.md. I established a 4-tier metrics framework — business / UX / AI quality / operational — with concrete targets the team could be held to.

DS-degree angle: "Discovery overload" is not a metric. Distributions, baselines, MDE (minimum detectable effect), and sample-size-for-power-analysis are. My DS training meant every business pain became a measurable hypothesis with a defensible MDE — so we'd know if the chatbot moved the needle versus random noise.

Amazon angle: working-backwards / PRFAQ discipline turned 8 ambiguous use cases into testable customer outcomes. That's the architect lens — not "what will we build" but "what does the customer experience look like, and what's the path back from there to system components?"

2 — Phase 2: Solution design (where over-engineering kills another ~20%)

Failure mode: "use an LLM for everything." Result: cost blowup, latency blowup, hallucinations on problems a regex would solve.

What I did: designed the hybrid execution model in 10-ai-llm-design.md — templates vs. APIs vs. RAG vs. LLM, routed by intent. Documented 6 trade-offs and 7 known challenges in 15-tradeoffs-challenges.md. Stood up the agent framework in agents.md — orchestrator + 4 specialized sub-agents — so each component had a measurable contract.

Math angle: knowing when not to use ML. Information-theoretic argument: when the prior on intent is strong enough (e.g., "track my order #"), an LLM adds entropy, latency, and cost — not signal. The 2-stage intent classifier (rules + DistilBERT) routes ~30% of traffic away from the LLM. That decision came from understanding Bayesian posteriors, not from intuition.

Mentoring signal: group-discussion personas (Priya/ML, Marcus/Architect, Aiko/DS, Jordan/MLOps, Sam/PM) embedded throughout Fine-Tuning-Foundational-Models/ taught the team to argue the trade-off from five perspectives — not just pick the trendy answer.

Deep-dive: stories/02-hybrid-execution-model-decision.md.

3 — Phase 3: POC (the false-confidence trap — the famous "92% problem")

Failure mode: POC scores 92% on a curated dev set. Team celebrates. Production gets adversarial input, concurrency, edge cases — and melts down.

What I did: documented the trap explicitly in POC-to-Production-War-Story/01-the-poc-that-fooled-us.md and the seven catastrophes that followed in POC-to-Production-War-Story/02-seven-production-catastrophes.md, with mitigations in POC-to-Production-War-Story/03-how-we-overcame-each-failure.md. I refused to ship until a 4-layer eval gate was in place: golden dataset → shadow mode → canary → continuous monitoring (Model-Inference/06-model-evaluation-framework.md).

DS-degree angle: distribution shift and sampling bias are first-week material in any DS program. The 92% number was a point estimate on a non-representative sample. My training made it instinctive to ask "what's the support of the dev set vs. the support of prod traffic?" — and to design golden datasets that adversarially cover the long tail (API-Design-and-Testing/04-offline-testing-quality-strategies.md).

Math angle: the canary gate uses formal hypothesis testing — α=0.05, explicit power analysis (Statistical-Inference/01-hypothesis-testing.md). Without that, "the canary looks fine" is just vibes.

Deep-dive: stories/01-the-92-percent-poc-trap.md.

4 — Phase 4: Production hardening (where another ~10% die at launch)

Failure mode: no guardrails, no observability, no rollback path. Hallucinations reach customers, costs spike unnoticed, the on-call engineer learns about a regression from Twitter.

What I did: designed the 7-dimensional evaluation framework — relevance, factual accuracy, consistency, fluency, hallucination, latency, cost (Evaluation-Systems-GenAI/). Layered guardrails — input safety → output safety → PII detection → policy enforcement (AI-Safety-Security-Governance/). Built the monitoring stack with calibrated thresholds (Monitoring-GenAI-Systems/Skill-4.3.2-GenAI-Monitoring/01-monitoring-architecture.md).

Math angle (the one most engineers miss): the hallucination alert/block thresholds (0.5 / 0.7) are not guesses — they're calibrated points on an ROC curve of LLM-as-Judge scores against the golden set (AI-Safety-Security-Governance/01-input-output-safety-controls/03-accuracy-verification-hallucination-control.md). Drift detection uses CUSUM for token-burst anomalies and Isolation Forest for response-distribution shift. To debug why the canary fails, you have to read the math, not just stare at the dashboard. That's what the DS degree bought us.

Cost discipline: the CPQ (cost-per-quality-point) framework — semantic caching, prompt compression, model tiering (Haiku vs. Sonnet by intent) — projected $209K/month savings (Cost-Optimization-User-Stories/US-01-llm-token-cost-optimization.md). That's the product-engineer hat: cost is a first-class product metric, not a finance ticket.

Deep-dive: stories/04-hallucination-threshold-calibration.md.

5 — Phase 5: Operate (where ~30% of "successful" launches rot within 6 months)

Failure mode: the model drifts, feedback isn't wired in, no one retrains, quality silently degrades. Six months later the chatbot is worse than the day it launched.

What I did: wired feedback correlation (thumbs-down, escalations, guardrail blocks tied to exact MLflow traces) — see MLflow/. Designed monthly retraining with EWC (Elastic Weight Consolidation) to prevent catastrophic forgetting in continual learning (Fine-Tuning-Foundational-Models/).

Math angle: EWC is a Fisher-information-matrix penalty term. Setting its λ, debugging it when retraining stalls, or explaining to a PM why "just retrain on the union of all data" breaks last quarter's accuracy — that requires actually understanding the geometry of the loss landscape. That's the kind of bottleneck where the DS degree pays back its tuition.

Cross-functional ownership: the 9 joint ML/Eng collaboration areas in Model-Inference/02-data-scientist-collaboration.md — each one shows the lead translating a DS finding (+3.8% intent accuracy from synthetic data, +14% RAG Recall@3 from embedding fine-tuning) into a shippable engineering decision. That translation step is where most projects die from PM-DS-Eng misalignment, and it's the architect's job to make it happen.

Deep-dive: stories/05-ewc-retraining-save.md.

6 — How I guided the team (the leadership multiplier)

A single person's judgment doesn't ship a project at this scale — a team's does. I scaled mine through:

Scenario templates with named personas (Priya/Marcus/Aiko/Jordan/Sam) — taught the team to argue trade-offs from 5 perspectives instead of debating whose intuition wins.
4 behavioral conflict scenarios in CI-CD-Pipeline-User-Stories/ — captured how to handle ML-gate disagreements, rollback policy disputes, and tooling fights without escalating to leadership every time.
"Mathematical foundations with geometric intuition" sections — I didn't just hand engineers without a DS background a paper. I built the bridge from intuition → math → code → metric. That's how a single person's algorithmic depth scales across a team.
The agent framework reference in agents.md, plus the 18 fine-tuning deep dives, function as onboarding scaffolding for new ML engineers — they walk in and have a 90-day path.

Deep-dive: stories/06-five-persona-mentoring-framework.md.

7 — How algorithmic depth drove decisions and delivered company value

The DS degree and the math are upstream. The decisions are where they paid off. Each item below names a complex ML/AI algorithm, the decision my understanding of it unlocked, and the business value the company captured.

Retrieval — Contrastive embedding fine-tuning (InfoNCE + hard-negative mining). Off-the-shelf Titan V2 embeddings underperformed on manga-specific queries. Because I understood that contrastive loss collapses without informative negatives — and that the τ temperature controls gradient sharpness — I directed the team to invest in hard-negative mining over more training data. Result: +14% RAG Recall@3 (Model-Inference/02-data-scientist-collaboration.md). Business value: directly improved recommendation relevance — the conversion-driving surface. (Story: stories/03-embedding-plateau-debugging.md.)
Fine-tuning — LoRA / QLoRA rank selection. Engineering instinct said "use rank 16, more capacity is better." My understanding of low-rank adaptation — the intrinsic-dimensionality argument from the LoRA paper, and the regularization effect of lower rank — said the opposite for our task (Fine-Tuning-Foundational-Models/). Result: smaller adapters, faster cold-start, lower training cost — and better generalization on small fine-tuning sets. Business value: faster iteration cadence, lower training spend.
Intent classification — 2-stage rules + DistilBERT. Most teams reach for a single LLM call. My understanding of model capacity vs. task complexity (and the cost curve at scale) drove the 2-stage design in 10-ai-llm-design.md. Result: ~30% of traffic bypasses the LLM. Business value: order-of-magnitude cost reduction on routine intents and tighter P99 latency. (Story: stories/02-hybrid-execution-model-decision.md.)
Reranking — cross-encoder vs. bi-encoder trade-off. Bi-encoders are fast but lossy; cross-encoders are accurate but slow. Understanding the latency-accuracy frontier let me design two-stage retrieval (bi-encoder shortlist → cross-encoder rerank on top-K). Business value: kept P99 < 1.5s (11-scalability-reliability.md) without sacrificing relevance.
Mixture of Experts — when MoE is worth its operational tax. MoE is fashionable; it is also a deployment nightmare (gating instability, expert collapse, memory blowup). Understanding the gating math meant I could argue for genre-specific experts only where signal density justified it, not as a default. Business value: avoided a 6-month rabbit hole the team almost walked into.
Quantization — QAT vs. GPTQ vs. AWQ vs. SmoothQuant. These are not interchangeable. Understanding what each one preserves (activation outliers, weight distribution, calibration set sensitivity) let me match the technique to the deployment target. Business value: shipped quantized models without the silent quality regressions that bite teams who picked by Twitter consensus.
Continual learning — EWC. Monthly retraining caused catastrophic forgetting on the previous month's improvements. Because I understood EWC as a Fisher-information penalty and could tune λ against the loss-landscape curvature, the team had a path forward instead of "just retrain on union of all data" (which doesn't scale). Business value: sustained quality over 6+ retraining cycles without regression. (Story: stories/05-ewc-retraining-save.md.)
Alignment — DPO vs. RLHF. Understanding why DPO's closed-form objective avoids RLHF's reward-hacking and instability led us to pick DPO for safety alignment. Business value: faster, cheaper, more reproducible alignment runs — and a defensible answer when leadership asked "why aren't we doing RLHF like everyone else?"

The pattern: algorithmic understanding turns high-stakes binary choices into informed trade-offs. A team led by someone who knows the algorithms only by name picks the trendy answer. A team led by someone who knows the algorithms by their math picks the right answer for the constraint set — and can defend it under scrutiny.

8 — The DS-degree + 2-years-at-Amazon multiplier (closing argument)

What the DS degree gave me: probability, hypothesis testing, sampling theory, causal inference, statistical learning theory, optimization. The vocabulary to name what's wrong with a "92% POC" or a stalled fine-tune — and the math to fix it.

What 2 years at Amazon gave me: scale intuition (50K → 500K isn't a quantitative gap, it's a qualitative one), CPQ thinking, the operational-excellence muscle, working-backwards discipline, and — critically — daily exposure to senior engineers who'd already failed at all these things and learned the lessons. That's where the architect intuition came from.

Combined: math to debug algorithm bottlenecks (why is contrastive embedding fine-tuning plateauing? — look at hard-negative-mining strategy and the τ temperature in InfoNCE loss), and the business-context fluency to translate that math into a use case a PM can defend in a roadmap review.

Most ML projects fail because no single person on the team can hold both the math and the business at the same time. My job as Applied ML Solutions Architect / Product Engineer was to be that person — and to grow more of them on the team.

9 — Quick-reference scorecard

Failure mode	Industry rate	What I did	Evidence
Wrong problem	~50% pre-build	4-tier metrics framework with MDE before any architecture	13-metrics.md
Over-engineering	~20%	Hybrid execution model — templates/APIs/RAG/LLM by intent	10-ai-llm-design.md
POC false-confidence	most "shipped" projects	4-layer eval gate: golden → shadow → canary → continuous	Model-Inference/06-model-evaluation-framework.md
No statistical rigor on canary	majority of A/B rollouts	Formal hypothesis testing, α=0.05, power analysis	Statistical-Inference/01-hypothesis-testing.md
No guardrails	high	Layered input/output safety, PII, calibrated hallucination thresholds	AI-Safety-Security-Governance/
Cost blowup	high	CPQ framework, semantic caching, model tiering — $209K/mo savings	Cost-Optimization-User-Stories/US-01-llm-token-cost-optimization.md
Model rot post-launch	~30%	Feedback-trace correlation + EWC continual learning	MLflow/, Fine-Tuning-Foundational-Models/
Cross-functional friction	endemic	9 named ML/Eng collaboration areas + 4 conflict scenarios	Model-Inference/02-data-scientist-collaboration.md
Team can't scale lead's judgment	high	Persona-driven scenarios + math-with-geometric-intuition onboarding	agents.md, `Fine-Tuning-Foundational-Models/`