Model Inference - Production Challenges, Metrics, and Data Science Collaboration
This folder documents the real-world model inference challenges I faced while building and operating MangaAssist at Amazon scale, how I collaborated with data scientists to solve them, and the metrics frameworks that drove every decision.
Interview Walkthrough Arc (Suggested 8-10 Minute Flow)
Use this order when walking through model inference in a system design interview:
- Start with the inference pipeline (01-inference-pipeline-challenges.md) - Set the stage: "We had 4 models running in sequence for every request. Here's why that was hard at production scale..."
- Zoom into collaboration (02-data-scientist-collaboration.md) - "I worked closely with data scientists to solve these problems. Here's how we divided ownership..."
- Discuss tradeoffs (03-tradeoffs-decisions.md) - "Every decision was a tradeoff. Here's how I used metrics to drive decisions, not opinions..."
- Show ML metrics depth (04-ml-metrics-taxonomy.md) - "For our intent classifier and RAG pipeline, these are the metrics I tracked and why..."
- Show LLM metrics depth (05-llm-metrics-taxonomy.md) - "For the LLM generation layer, traditional ML metrics do not apply. Here's the framework I built..."
- Close with evaluation (06-model-evaluation-framework.md) - "To tie it all together, here's the end-to-end evaluation framework that gated every deployment..."
Document Index
| # | File | Focus Area |
|---|---|---|
| 01 | Inference Pipeline Challenges | Multi-model serving, latency at scale, cold starts, SageMaker and Bedrock production issues |
| 02 | Data Scientist Collaboration | Model selection, fine-tuning, evaluation, drift monitoring, and cost-quality tradeoffs |
| 03 | Tradeoffs and Decisions | Latency vs quality, cost vs accuracy, precision vs recall, fine-tuning vs prompting |
| 04 | ML Metrics Taxonomy | Classification, ranking, retrieval, embedding, and reranking metrics |
| 05 | LLM Metrics Taxonomy | Generation quality, hallucination, faithfulness, safety, and operational metrics |
| 06 | Model Evaluation Framework | Golden datasets, offline regression tests, online A/B testing, shadow mode, human evaluation |
| 07 | GPU Architecture Challenges | KV cache fragmentation, continuous batching, cold starts, OOM, Multi-LoRA, quantization, predictive scaling |
| 08 | SageMaker Endpoints, FastAPI, asyncio, and REST | What a SageMaker endpoint is, endpoint types, and when to use FastAPI, asyncio, or REST |
| 09 | SageMaker and Azure Inference APIs | Common AWS and Azure APIs for LLM and traditional ML inference |
Cross-References to Existing Documentation
These documents expand on content already covered elsewhere in the repo:
| Existing File | Relevant Sections | What This Folder Adds |
|---|---|---|
| Challenges/real-world-challenges.md | Latency, model drift, hallucination, RAG quality, cost, guardrails, evaluation, token budget | Inference-specific deep dives, collaboration stories, decision frameworks with metrics |
| 10-ai-llm-design.md | Model selection, intent classification, RAG pipeline, prompt engineering | ML and LLM metrics taxonomies, evaluation rigor |
| 13-metrics.md | AI quality and operational metrics | Metric taxonomy with production context |
| 04b-architecture-lld.md | Inference sequencing, guardrails pipeline | Production challenges in running that pipeline at scale |
| 15-tradeoffs-challenges.md | Summary tradeoff tables | Decision frameworks with specific thresholds and collaboration context |
Key Numbers Quick Reference (Interview Recall Sheet)
Use these numbers to anchor your answers with specifics. Memorize the bolded values.
| Category | Metric | Value |
|---|---|---|
| Scale | Concurrent sessions | 50K |
| Daily messages | 500K | |
| Daily LLM calls | ~300K (60% of messages) | |
| Latency | End-to-end target | < 3 seconds |
| Intent classifier (DistilBERT) | 15ms avg, 50ms P99 | |
| Embeddings (Titan V2) | 30ms avg | |
| Reranker (cross-encoder) | 50ms avg | |
| LLM generation (Claude Sonnet) | 500ms TTFT P50, 1.3s P99 | |
| Cost | Cost per session (optimized) | $0.025 (down from $0.082) |
| Monthly inference cost | ~$143K | |
| Monthly savings from optimizations | ~$119K | |
| Quality | Intent accuracy | 92.1% (production) |
| RAG Recall@3 | 86% (after fine-tuning) | |
| Resolution rate | 73% | |
| Escalation rate | 12% | |
| Thumbs up rate | 65% | |
| CSAT score | 4.3 / 5.0 | |
| Hallucination grounding score | 0.91 avg | |
| ASIN validation rate | 99.7% | |
| Models | Models in pipeline | 4 (classifier → embeddings → reranker → LLM) |
| LLM bypass rate (templates) | 40% of messages | |
| Golden dataset size | 500 queries |
Common Interview Questions → Document Mapping
| Interview Question | Where to Answer From |
|---|---|
| "How did you handle latency at scale?" | 01 §Challenge 1 — speculative execution, graceful degradation |
| "How did you optimize costs?" | 01 §Challenge 5 — LLM bypass, model tiering, prompt caching |
| "How did you work with data scientists?" | 02 — All 9 Areas — model selection, evaluation, drift monitoring |
| "How did you choose between models?" | 03 §Tradeoff 2 — cost per marginal accuracy point |
| "What metrics did you track?" | 04 + 05 — full taxonomy, Tier ½/3 production monitoring |
| "How did you handle hallucinations?" | 05 §Part 2 + 02 §Area 6 — grounding score, ASIN/price validation |
| "How did you deploy model changes safely?" | 06 — All 4 Layers — golden dataset → shadow → canary → monitoring |
| "What tradeoffs did you make?" | 03 — All 7 Tradeoffs — with reversal triggers and decision metrics |
| "How did you scale the inference pipeline?" | 01 §Challenge 2-3 — SageMaker scaling, Bedrock provisioned throughput |
| "How did you evaluate the LLM?" | 06 §Layer 1 — golden dataset, BERTScore, human evaluation |
Key Narrative Themes
When presenting this content in interviews, weave in these recurring themes:
- Metrics-driven decisions, not opinions — Every major decision (model selection, temperature, chunk size) was backed by a quantified metric and a documented reversal trigger.
- Engineer and data scientist partnership — 9 distinct collaboration areas, shared metrics, joint ownership of the overlap zone.
- Production is not research — The DS team's top model (RoBERTa at 94.8%) wasn't the right production choice. Latency, cost, and scaling constraints dominated.
- Layered defense — 4-layer evaluation framework, multi-layer guardrails, graceful degradation per model.
- Continuous improvement — Monthly retraining, quarterly golden dataset refresh, weekly metric reviews, automated drift detection.