LOCAL PREVIEW View on GitHub

vLLM Deep Dive for MangaAssist

Complete scenario pack on how vLLM replaced the previous raw Transformers serving path for self-hosted generation, and why that decision changed latency, concurrency, cost, and operability in the chatbot. Covers architecture, implementation, deployment, monitoring, model preparation, and interview prep.

This folder expands the vLLM discussion in ../02-open-source-libraries.md into scenario walkthroughs, low-level implementation notes, deployment guides, monitoring runbooks, model preparation pipelines, and interview-prep material — all grounded in MangaAssist's production architecture.

Measured Impact Summary

Metric Before (HF Transformers) After (vLLM 0.4.3)
P50 latency 1,820 ms 620 ms
P99 latency 4,200 ms 1,400 ms
Concurrent sessions per GPU 4–6 85–90
GPU instances 8 4
Monthly cost $18.4K $9.2K
OOM restarts Multiple/day Zero
TTFT (returning user) ~900 ms ~180 ms
Prefix cache hit rate N/A ~72%

Document Index

# Document Focus Pages
1 01-vllm-game-changer-scenarios.md Business and architecture story: 8 scenarios with Mermaid diagrams and concrete metrics Architecture
2 02-vllm-low-level-implementation-and-critical-decisions.md Production code patterns: engine construction, admission control, streaming, OOM guard, health probes Implementation
3 03-vllm-interview-prep-deep-dive.md 20 interview questions with hints, answers, and deep dives — tagged by interviewer role Interview prep
4 04-vllm-deployment-and-infrastructure.md Docker multi-stage build, SageMaker endpoints, auto-scaling, warm pools, environment variables Deployment
5 05-vllm-monitoring-and-troubleshooting.md Prometheus metrics, CloudWatch alarms, SLOs, dashboards, troubleshooting runbook, load testing Operations
6 06-vllm-model-preparation-and-quantization.md AWQ calibration, LoRA adapter management, quality gates, CI/CD pipeline, model versioning ML Engineering

For architecture understanding (Staff/Principal interviews): 1. 01-scenarios — Why vLLM, what changed, measured impact 2. 02-low-level — How decisions translate to code 3. 03-interview-prep — Polished talking points

For deployment and operations (SRE/DevOps interviews): 1. 04-deployment — Docker, SageMaker, scaling 2. 05-monitoring — Metrics, alerts, runbooks 3. 01-scenarios Scenarios 7–8 for deployment and observability context

For ML engineering (ML Eng interviews): 1. 06-model-preparation — AWQ, LoRA, quality gates 2. 01-scenarios Scenarios 4–5 for Multi-LoRA and quantization context 3. 02-low-level — Engine config rationale

What This Pack Covers

  • Why raw HuggingFace Transformers plus custom serving stopped scaling for the self-hosted Llama path
  • How PagedAttention, continuous batching, prefix caching, Multi-LoRA, and AWQ changed the operating model
  • Why the vLLM decision mattered specifically for a multi-turn shopping chatbot instead of a generic model benchmark
  • Which low-level implementation choices made the migration safe, observable, and reversible
  • Complete Docker image build, SageMaker deployment, and auto-scaling configuration
  • Production monitoring with Prometheus metrics, CloudWatch alarms, SLOs, and troubleshooting runbooks
  • AWQ quantization pipeline with domain-specific calibration and quality gates
  • LoRA adapter training, versioning, and promotion with rollback procedures
  • 20 interview questions covering all aspects, tagged by interviewer role

Grounding Docs