Model Inference

Serving patterns: hosted APIs vs self-hosted (vLLM / TGI / Triton), batching, KV-cache reuse, quantisation, and the latency/cost frontier.

Interview talking points

API vs self-host? Default to hosted unless cost or latency forces self-host. Self-host break-even is usually >1M tokens/day at consistent load.
vLLM vs Triton. vLLM for LLMs (paged attention, continuous batching). Triton for multi-framework, multi-GPU mixed traffic.
KV cache reuse — Anthropic's prompt cache. Saves both cost and TTFT for repeated system prompts; design the prompt with cache markers.
Quantisation tradeoff. INT8/INT4 for self-host; the quality cliff is sharp on small models.

File	Title
01-inference-pipeline-challenges.md	01. Model Inference Pipeline — Production Challenges
02-data-scientist-collaboration.md	02. Data Scientist Collaboration — Simplifying Production Inference Together
03-tradeoffs-decisions.md	03. Tradeoffs & Decision Frameworks — How Metrics Drove Every Choice
04-ml-metrics-taxonomy.md	04. ML Metrics Taxonomy — Full Reference + Production Application
05-llm-metrics-taxonomy.md	05. LLM Metrics Taxonomy — Full Reference + Production Application
06-gpu-architecture-challenges.md	06. GPU Architecture Challenges — MangaAssist
06-model-evaluation-framework.md	06. Model Evaluation Framework — End-to-End Quality Gates
07-model-evaluation-scenarios.md	07. Model Evaluation Scenarios — Deep Dive Q&A
08-sagemaker-endpoints-fastapi-asyncio-rest.md	08. SageMaker Endpoints, FastAPI, asyncio, and REST for Model Inference
09-sagemaker-and-azure-inference-apis.md	09. SageMaker and Azure Inference APIs for LLMs and Traditional ML Models
README.md	Model Inference - Production Challenges, Metrics, and Data Science Collaboration