LOCAL PREVIEW View on GitHub

Model Inference

Serving patterns: hosted APIs vs self-hosted (vLLM / TGI / Triton), batching, KV-cache reuse, quantisation, and the latency/cost frontier.

Interview talking points

  • API vs self-host? Default to hosted unless cost or latency forces self-host. Self-host break-even is usually >1M tokens/day at consistent load.
  • vLLM vs Triton. vLLM for LLMs (paged attention, continuous batching). Triton for multi-framework, multi-GPU mixed traffic.
  • KV cache reuse — Anthropic's prompt cache. Saves both cost and TTFT for repeated system prompts; design the prompt with cache markers.
  • Quantisation tradeoff. INT8/INT4 for self-host; the quality cliff is sharp on small models.

Files in this folder

File Title
01-inference-pipeline-challenges.md 01. Model Inference Pipeline — Production Challenges
02-data-scientist-collaboration.md 02. Data Scientist Collaboration — Simplifying Production Inference Together
03-tradeoffs-decisions.md 03. Tradeoffs & Decision Frameworks — How Metrics Drove Every Choice
04-ml-metrics-taxonomy.md 04. ML Metrics Taxonomy — Full Reference + Production Application
05-llm-metrics-taxonomy.md 05. LLM Metrics Taxonomy — Full Reference + Production Application
06-gpu-architecture-challenges.md 06. GPU Architecture Challenges — MangaAssist
06-model-evaluation-framework.md 06. Model Evaluation Framework — End-to-End Quality Gates
07-model-evaluation-scenarios.md 07. Model Evaluation Scenarios — Deep Dive Q&A
08-sagemaker-endpoints-fastapi-asyncio-rest.md 08. SageMaker Endpoints, FastAPI, asyncio, and REST for Model Inference
09-sagemaker-and-azure-inference-apis.md 09. SageMaker and Azure Inference APIs for LLMs and Traditional ML Models
README.md Model Inference - Production Challenges, Metrics, and Data Science Collaboration

Back to the home page.