vLLM Deep Dive for MangaAssist
Complete scenario pack on how vLLM replaced the previous raw Transformers serving path for self-hosted generation, and why that decision changed latency, concurrency, cost, and operability in the chatbot. Covers architecture, implementation, deployment, monitoring, model preparation, and interview prep.
This folder expands the vLLM discussion in ../02-open-source-libraries.md into scenario walkthroughs, low-level implementation notes, deployment guides, monitoring runbooks, model preparation pipelines, and interview-prep material — all grounded in MangaAssist's production architecture.
Measured Impact Summary
| Metric | Before (HF Transformers) | After (vLLM 0.4.3) |
|---|---|---|
| P50 latency | 1,820 ms | 620 ms |
| P99 latency | 4,200 ms | 1,400 ms |
| Concurrent sessions per GPU | 4–6 | 85–90 |
| GPU instances | 8 | 4 |
| Monthly cost | $18.4K | $9.2K |
| OOM restarts | Multiple/day | Zero |
| TTFT (returning user) | ~900 ms | ~180 ms |
| Prefix cache hit rate | N/A | ~72% |
Document Index
| # | Document | Focus | Pages |
|---|---|---|---|
| 1 | 01-vllm-game-changer-scenarios.md | Business and architecture story: 8 scenarios with Mermaid diagrams and concrete metrics | Architecture |
| 2 | 02-vllm-low-level-implementation-and-critical-decisions.md | Production code patterns: engine construction, admission control, streaming, OOM guard, health probes | Implementation |
| 3 | 03-vllm-interview-prep-deep-dive.md | 20 interview questions with hints, answers, and deep dives — tagged by interviewer role | Interview prep |
| 4 | 04-vllm-deployment-and-infrastructure.md | Docker multi-stage build, SageMaker endpoints, auto-scaling, warm pools, environment variables | Deployment |
| 5 | 05-vllm-monitoring-and-troubleshooting.md | Prometheus metrics, CloudWatch alarms, SLOs, dashboards, troubleshooting runbook, load testing | Operations |
| 6 | 06-vllm-model-preparation-and-quantization.md | AWQ calibration, LoRA adapter management, quality gates, CI/CD pipeline, model versioning | ML Engineering |
Recommended Reading Order
For architecture understanding (Staff/Principal interviews): 1. 01-scenarios — Why vLLM, what changed, measured impact 2. 02-low-level — How decisions translate to code 3. 03-interview-prep — Polished talking points
For deployment and operations (SRE/DevOps interviews): 1. 04-deployment — Docker, SageMaker, scaling 2. 05-monitoring — Metrics, alerts, runbooks 3. 01-scenarios Scenarios 7–8 for deployment and observability context
For ML engineering (ML Eng interviews): 1. 06-model-preparation — AWQ, LoRA, quality gates 2. 01-scenarios Scenarios 4–5 for Multi-LoRA and quantization context 3. 02-low-level — Engine config rationale
What This Pack Covers
- Why raw HuggingFace Transformers plus custom serving stopped scaling for the self-hosted Llama path
- How PagedAttention, continuous batching, prefix caching, Multi-LoRA, and AWQ changed the operating model
- Why the vLLM decision mattered specifically for a multi-turn shopping chatbot instead of a generic model benchmark
- Which low-level implementation choices made the migration safe, observable, and reversible
- Complete Docker image build, SageMaker deployment, and auto-scaling configuration
- Production monitoring with Prometheus metrics, CloudWatch alarms, SLOs, and troubleshooting runbooks
- AWQ quantization pipeline with domain-specific calibration and quality gates
- LoRA adapter training, versioning, and promotion with rollback procedures
- 20 interview questions covering all aspects, tagged by interviewer role