LOCAL PREVIEW View on GitHub

Model Inference - Production Challenges, Metrics, and Data Science Collaboration

This folder documents the real-world model inference challenges I faced while building and operating MangaAssist at Amazon scale, how I collaborated with data scientists to solve them, and the metrics frameworks that drove every decision.


Interview Walkthrough Arc (Suggested 8-10 Minute Flow)

Use this order when walking through model inference in a system design interview:

  1. Start with the inference pipeline (01-inference-pipeline-challenges.md) - Set the stage: "We had 4 models running in sequence for every request. Here's why that was hard at production scale..."
  2. Zoom into collaboration (02-data-scientist-collaboration.md) - "I worked closely with data scientists to solve these problems. Here's how we divided ownership..."
  3. Discuss tradeoffs (03-tradeoffs-decisions.md) - "Every decision was a tradeoff. Here's how I used metrics to drive decisions, not opinions..."
  4. Show ML metrics depth (04-ml-metrics-taxonomy.md) - "For our intent classifier and RAG pipeline, these are the metrics I tracked and why..."
  5. Show LLM metrics depth (05-llm-metrics-taxonomy.md) - "For the LLM generation layer, traditional ML metrics do not apply. Here's the framework I built..."
  6. Close with evaluation (06-model-evaluation-framework.md) - "To tie it all together, here's the end-to-end evaluation framework that gated every deployment..."

Document Index

# File Focus Area
01 Inference Pipeline Challenges Multi-model serving, latency at scale, cold starts, SageMaker and Bedrock production issues
02 Data Scientist Collaboration Model selection, fine-tuning, evaluation, drift monitoring, and cost-quality tradeoffs
03 Tradeoffs and Decisions Latency vs quality, cost vs accuracy, precision vs recall, fine-tuning vs prompting
04 ML Metrics Taxonomy Classification, ranking, retrieval, embedding, and reranking metrics
05 LLM Metrics Taxonomy Generation quality, hallucination, faithfulness, safety, and operational metrics
06 Model Evaluation Framework Golden datasets, offline regression tests, online A/B testing, shadow mode, human evaluation
07 GPU Architecture Challenges KV cache fragmentation, continuous batching, cold starts, OOM, Multi-LoRA, quantization, predictive scaling
08 SageMaker Endpoints, FastAPI, asyncio, and REST What a SageMaker endpoint is, endpoint types, and when to use FastAPI, asyncio, or REST
09 SageMaker and Azure Inference APIs Common AWS and Azure APIs for LLM and traditional ML inference

Cross-References to Existing Documentation

These documents expand on content already covered elsewhere in the repo:

Existing File Relevant Sections What This Folder Adds
Challenges/real-world-challenges.md Latency, model drift, hallucination, RAG quality, cost, guardrails, evaluation, token budget Inference-specific deep dives, collaboration stories, decision frameworks with metrics
10-ai-llm-design.md Model selection, intent classification, RAG pipeline, prompt engineering ML and LLM metrics taxonomies, evaluation rigor
13-metrics.md AI quality and operational metrics Metric taxonomy with production context
04b-architecture-lld.md Inference sequencing, guardrails pipeline Production challenges in running that pipeline at scale
15-tradeoffs-challenges.md Summary tradeoff tables Decision frameworks with specific thresholds and collaboration context

Key Numbers Quick Reference (Interview Recall Sheet)

Use these numbers to anchor your answers with specifics. Memorize the bolded values.

Category Metric Value
Scale Concurrent sessions 50K
Daily messages 500K
Daily LLM calls ~300K (60% of messages)
Latency End-to-end target < 3 seconds
Intent classifier (DistilBERT) 15ms avg, 50ms P99
Embeddings (Titan V2) 30ms avg
Reranker (cross-encoder) 50ms avg
LLM generation (Claude Sonnet) 500ms TTFT P50, 1.3s P99
Cost Cost per session (optimized) $0.025 (down from $0.082)
Monthly inference cost ~$143K
Monthly savings from optimizations ~$119K
Quality Intent accuracy 92.1% (production)
RAG Recall@3 86% (after fine-tuning)
Resolution rate 73%
Escalation rate 12%
Thumbs up rate 65%
CSAT score 4.3 / 5.0
Hallucination grounding score 0.91 avg
ASIN validation rate 99.7%
Models Models in pipeline 4 (classifier → embeddings → reranker → LLM)
LLM bypass rate (templates) 40% of messages
Golden dataset size 500 queries

Common Interview Questions → Document Mapping

Interview Question Where to Answer From
"How did you handle latency at scale?" 01 §Challenge 1 — speculative execution, graceful degradation
"How did you optimize costs?" 01 §Challenge 5 — LLM bypass, model tiering, prompt caching
"How did you work with data scientists?" 02 — All 9 Areas — model selection, evaluation, drift monitoring
"How did you choose between models?" 03 §Tradeoff 2 — cost per marginal accuracy point
"What metrics did you track?" 04 + 05 — full taxonomy, Tier ½/3 production monitoring
"How did you handle hallucinations?" 05 §Part 2 + 02 §Area 6 — grounding score, ASIN/price validation
"How did you deploy model changes safely?" 06 — All 4 Layers — golden dataset → shadow → canary → monitoring
"What tradeoffs did you make?" 03 — All 7 Tradeoffs — with reversal triggers and decision metrics
"How did you scale the inference pipeline?" 01 §Challenge 2-3 — SageMaker scaling, Bedrock provisioned throughput
"How did you evaluate the LLM?" 06 §Layer 1 — golden dataset, BERTScore, human evaluation

Key Narrative Themes

When presenting this content in interviews, weave in these recurring themes:

  1. Metrics-driven decisions, not opinions — Every major decision (model selection, temperature, chunk size) was backed by a quantified metric and a documented reversal trigger.
  2. Engineer and data scientist partnership — 9 distinct collaboration areas, shared metrics, joint ownership of the overlap zone.
  3. Production is not research — The DS team's top model (RoBERTa at 94.8%) wasn't the right production choice. Latency, cost, and scaling constraints dominated.
  4. Layered defense — 4-layer evaluation framework, multi-layer guardrails, graceful degradation per model.
  5. Continuous improvement — Monthly retraining, quarterly golden dataset refresh, weekly metric reviews, automated drift detection.