vLLM Deep Dive for MangaAssist

Complete scenario pack on how vLLM replaced the previous raw Transformers serving path for self-hosted generation, and why that decision changed latency, concurrency, cost, and operability in the chatbot. Covers architecture, implementation, deployment, monitoring, model preparation, and interview prep.

This folder expands the vLLM discussion in ../02-open-source-libraries.md into scenario walkthroughs, low-level implementation notes, deployment guides, monitoring runbooks, model preparation pipelines, and interview-prep material — all grounded in MangaAssist's production architecture.

Measured Impact Summary

Metric	Before (HF Transformers)	After (vLLM 0.4.3)
P50 latency	1,820 ms	620 ms
P99 latency	4,200 ms	1,400 ms
Concurrent sessions per GPU	4–6	85–90
GPU instances	8	4
Monthly cost	$18.4K	$9.2K
OOM restarts	Multiple/day	Zero
TTFT (returning user)	~900 ms	~180 ms
Prefix cache hit rate	N/A	~72%

Document Index

#	Document	Focus	Pages
1	01-vllm-game-changer-scenarios.md	Business and architecture story: 8 scenarios with Mermaid diagrams and concrete metrics	Architecture
2	02-vllm-low-level-implementation-and-critical-decisions.md	Production code patterns: engine construction, admission control, streaming, OOM guard, health probes	Implementation
3	03-vllm-interview-prep-deep-dive.md	20 interview questions with hints, answers, and deep dives — tagged by interviewer role	Interview prep
4	04-vllm-deployment-and-infrastructure.md	Docker multi-stage build, SageMaker endpoints, auto-scaling, warm pools, environment variables	Deployment
5	05-vllm-monitoring-and-troubleshooting.md	Prometheus metrics, CloudWatch alarms, SLOs, dashboards, troubleshooting runbook, load testing	Operations
6	06-vllm-model-preparation-and-quantization.md	AWQ calibration, LoRA adapter management, quality gates, CI/CD pipeline, model versioning	ML Engineering

What This Pack Covers

Why raw HuggingFace Transformers plus custom serving stopped scaling for the self-hosted Llama path
How PagedAttention, continuous batching, prefix caching, Multi-LoRA, and AWQ changed the operating model
Why the vLLM decision mattered specifically for a multi-turn shopping chatbot instead of a generic model benchmark
Which low-level implementation choices made the migration safe, observable, and reversible
Complete Docker image build, SageMaker deployment, and auto-scaling configuration
Production monitoring with Prometheus metrics, CloudWatch alarms, SLOs, and troubleshooting runbooks
AWQ quantization pipeline with domain-specific calibration and quality gates
LoRA adapter training, versioning, and promotion with rollback procedures
20 interview questions covering all aspects, tagged by interviewer role

vLLM Deep Dive for MangaAssist

Measured Impact Summary

Document Index

Recommended Reading Order

What This Pack Covers

Grounding Docs