Model Inference
Serving patterns: hosted APIs vs self-hosted (vLLM / TGI / Triton), batching, KV-cache reuse, quantisation, and the latency/cost frontier.
Interview talking points
- API vs self-host? Default to hosted unless cost or latency forces self-host. Self-host break-even is usually >1M tokens/day at consistent load.
- vLLM vs Triton. vLLM for LLMs (paged attention, continuous batching). Triton for multi-framework, multi-GPU mixed traffic.
- KV cache reuse — Anthropic's prompt cache. Saves both cost and TTFT for repeated system prompts; design the prompt with cache markers.
- Quantisation tradeoff. INT8/INT4 for self-host; the quality cliff is sharp on small models.
Files in this folder
| File | Title |
|---|---|
| 01-inference-pipeline-challenges.md | 01. Model Inference Pipeline — Production Challenges |
| 02-data-scientist-collaboration.md | 02. Data Scientist Collaboration — Simplifying Production Inference Together |
| 03-tradeoffs-decisions.md | 03. Tradeoffs & Decision Frameworks — How Metrics Drove Every Choice |
| 04-ml-metrics-taxonomy.md | 04. ML Metrics Taxonomy — Full Reference + Production Application |
| 05-llm-metrics-taxonomy.md | 05. LLM Metrics Taxonomy — Full Reference + Production Application |
| 06-gpu-architecture-challenges.md | 06. GPU Architecture Challenges — MangaAssist |
| 06-model-evaluation-framework.md | 06. Model Evaluation Framework — End-to-End Quality Gates |
| 07-model-evaluation-scenarios.md | 07. Model Evaluation Scenarios — Deep Dive Q&A |
| 08-sagemaker-endpoints-fastapi-asyncio-rest.md | 08. SageMaker Endpoints, FastAPI, asyncio, and REST for Model Inference |
| 09-sagemaker-and-azure-inference-apis.md | 09. SageMaker and Azure Inference APIs for LLMs and Traditional ML Models |
| README.md | Model Inference - Production Challenges, Metrics, and Data Science Collaboration |
Back to the home page.