Model Inference - Production Challenges, Metrics, and Data Science Collaboration

This folder documents the real-world model inference challenges I faced while building and operating MangaAssist at Amazon scale, how I collaborated with data scientists to solve them, and the metrics frameworks that drove every decision.

Interview Walkthrough Arc (Suggested 8-10 Minute Flow)

Use this order when walking through model inference in a system design interview:

Start with the inference pipeline (01-inference-pipeline-challenges.md) - Set the stage: "We had 4 models running in sequence for every request. Here's why that was hard at production scale..."
Zoom into collaboration (02-data-scientist-collaboration.md) - "I worked closely with data scientists to solve these problems. Here's how we divided ownership..."
Discuss tradeoffs (03-tradeoffs-decisions.md) - "Every decision was a tradeoff. Here's how I used metrics to drive decisions, not opinions..."
Show ML metrics depth (04-ml-metrics-taxonomy.md) - "For our intent classifier and RAG pipeline, these are the metrics I tracked and why..."
Show LLM metrics depth (05-llm-metrics-taxonomy.md) - "For the LLM generation layer, traditional ML metrics do not apply. Here's the framework I built..."
Close with evaluation (06-model-evaluation-framework.md) - "To tie it all together, here's the end-to-end evaluation framework that gated every deployment..."

Document Index

#	File	Focus Area
01	Inference Pipeline Challenges	Multi-model serving, latency at scale, cold starts, SageMaker and Bedrock production issues
02	Data Scientist Collaboration	Model selection, fine-tuning, evaluation, drift monitoring, and cost-quality tradeoffs
03	Tradeoffs and Decisions	Latency vs quality, cost vs accuracy, precision vs recall, fine-tuning vs prompting
04	ML Metrics Taxonomy	Classification, ranking, retrieval, embedding, and reranking metrics
05	LLM Metrics Taxonomy	Generation quality, hallucination, faithfulness, safety, and operational metrics
06	Model Evaluation Framework	Golden datasets, offline regression tests, online A/B testing, shadow mode, human evaluation
07	GPU Architecture Challenges	KV cache fragmentation, continuous batching, cold starts, OOM, Multi-LoRA, quantization, predictive scaling
08	SageMaker Endpoints, FastAPI, asyncio, and REST	What a SageMaker endpoint is, endpoint types, and when to use FastAPI, `asyncio`, or REST
09	SageMaker and Azure Inference APIs	Common AWS and Azure APIs for LLM and traditional ML inference

Cross-References to Existing Documentation

These documents expand on content already covered elsewhere in the repo:

Existing File	Relevant Sections	What This Folder Adds
Challenges/real-world-challenges.md	Latency, model drift, hallucination, RAG quality, cost, guardrails, evaluation, token budget	Inference-specific deep dives, collaboration stories, decision frameworks with metrics
10-ai-llm-design.md	Model selection, intent classification, RAG pipeline, prompt engineering	ML and LLM metrics taxonomies, evaluation rigor
13-metrics.md	AI quality and operational metrics	Metric taxonomy with production context
04b-architecture-lld.md	Inference sequencing, guardrails pipeline	Production challenges in running that pipeline at scale
15-tradeoffs-challenges.md	Summary tradeoff tables	Decision frameworks with specific thresholds and collaboration context

Key Numbers Quick Reference (Interview Recall Sheet)

Use these numbers to anchor your answers with specifics. Memorize the bolded values.

Category	Metric	Value
Scale	Concurrent sessions	50K
	Daily messages	500K
	Daily LLM calls	~300K (60% of messages)
Latency	End-to-end target	< 3 seconds
	Intent classifier (DistilBERT)	15ms avg, 50ms P99
	Embeddings (Titan V2)	30ms avg
	Reranker (cross-encoder)	50ms avg
	LLM generation (Claude Sonnet)	500ms TTFT P50, 1.3s P99
Cost	Cost per session (optimized)	$0.025 (down from $0.082)
	Monthly inference cost	~$143K
	Monthly savings from optimizations	~$119K
Quality	Intent accuracy	92.1% (production)
	RAG Recall@3	86% (after fine-tuning)
	Resolution rate	73%
	Escalation rate	12%
	Thumbs up rate	65%
	CSAT score	4.3 / 5.0
	Hallucination grounding score	0.91 avg
	ASIN validation rate	99.7%
Models	Models in pipeline	4 (classifier → embeddings → reranker → LLM)
	LLM bypass rate (templates)	40% of messages
	Golden dataset size	500 queries

Common Interview Questions → Document Mapping

Interview Question	Where to Answer From
"How did you handle latency at scale?"	01 §Challenge 1 — speculative execution, graceful degradation
"How did you optimize costs?"	01 §Challenge 5 — LLM bypass, model tiering, prompt caching
"How did you work with data scientists?"	02 — All 9 Areas — model selection, evaluation, drift monitoring
"How did you choose between models?"	03 §Tradeoff 2 — cost per marginal accuracy point
"What metrics did you track?"	04 + 05 — full taxonomy, Tier ½/3 production monitoring
"How did you handle hallucinations?"	05 §Part 2 + 02 §Area 6 — grounding score, ASIN/price validation
"How did you deploy model changes safely?"	06 — All 4 Layers — golden dataset → shadow → canary → monitoring
"What tradeoffs did you make?"	03 — All 7 Tradeoffs — with reversal triggers and decision metrics
"How did you scale the inference pipeline?"	01 §Challenge 2-3 — SageMaker scaling, Bedrock provisioned throughput
"How did you evaluate the LLM?"	06 §Layer 1 — golden dataset, BERTScore, human evaluation

Key Narrative Themes

When presenting this content in interviews, weave in these recurring themes:

Metrics-driven decisions, not opinions — Every major decision (model selection, temperature, chunk size) was backed by a quantified metric and a documented reversal trigger.
Engineer and data scientist partnership — 9 distinct collaboration areas, shared metrics, joint ownership of the overlap zone.
Production is not research — The DS team's top model (RoBERTa at 94.8%) wasn't the right production choice. Latency, cost, and scaling constraints dominated.
Layered defense — 4-layer evaluation framework, multi-layer guardrails, graceful degradation per model.
Continuous improvement — Monthly retraining, quarterly golden dataset refresh, weekly metric reviews, automated drift detection.