LOCAL PREVIEW View on GitHub

vLLM Interview Prep Deep Dive

Use this document to turn the vLLM migration into strong interview answers. Each question includes multiple hints, a final answer, and a deeper explanation so you can answer at both summary and implementation depth. Questions are organized by difficulty and tagged with the roles that typically ask them.

How To Use This

  1. Answer the question out loud before reading the hints.
  2. Use the hints to pressure-test whether your answer covers business impact, system behavior, and low-level implementation.
  3. Use the final answer as the concise interview version.
  4. Use the deep dive explanation when the interviewer keeps drilling.

Role Tags

Tag Who asks this
[Staff/Principal] Staff Engineer, Principal Engineer, or Architect interviews
[SDE] Software Development Engineer or Senior SDE interviews
[ML Eng] Machine Learning Engineer interviews
[SRE/Ops] Site Reliability Engineer or DevOps interviews
Document When to reference
01-vllm-game-changer-scenarios.md Concrete metrics and before/after numbers
02-vllm-low-level-implementation-and-critical-decisions.md Code patterns and implementation decisions
04-vllm-deployment-and-infrastructure.md Docker, SageMaker, scaling details
05-vllm-monitoring-and-troubleshooting.md Metrics, alerting, SLOs, troubleshooting
06-vllm-model-preparation-and-quantization.md AWQ calibration, LoRA management

Part 1: Core Architecture Questions

Question 1 [Staff/Principal] [SDE]

Why was vLLM a game changer for this chatbot instead of just another inference library?

Hint 1: Compare the old path to the new operating model, not just benchmark numbers.

Hint 2: Mention memory efficiency, dynamic scheduling, and multi-turn chat behavior.

Hint 3: Tie the answer to user-visible latency and GPU cost, not only model-server elegance.

Final answer:

vLLM was a game changer because it fixed the specific production failures of our previous self-hosted inference path. Raw Transformers plus custom serving wasted KV-cache memory, scheduled work poorly during uneven traffic, and recomputed repeated chat prefixes. vLLM gave us PagedAttention, continuous batching, and prefix caching, which let us serve more concurrent chat sessions per GPU, reduce queueing, and improve responsiveness. That translated into both lower cost and a faster-feeling assistant.

Deep dive explanation:

The important part is that MangaAssist was an interactive, bursty, multi-turn system. PagedAttention turned wasted VRAM into usable concurrency. Continuous batching kept short requests from waiting behind long generations. Prefix caching reduced repeated work across turns. So the real win was not just more tokens per second. It was a better chatbot operating model: fewer GPUs, lower tail latency, and a path to handle spikes without degrading UX.


Question 2 [Staff/Principal]

Why did you choose vLLM over TensorRT-LLM and TGI?

Hint 1: Do not say "it was the fastest," because that was not fully true.

Hint 2: Explain the tradeoff between raw speed and operational burden.

Hint 3: Mention hardware flexibility, build complexity, and migration risk.

Final answer:

I chose vLLM because it had the best performance-to-operability ratio. TensorRT-LLM was slightly faster in some benchmarks, but it carried more NVIDIA-specific build and deployment complexity. TGI was easier operationally than raw Transformers, but it still trailed vLLM on our workload. vLLM was fast enough to win economically while staying simpler to run and easier to evolve.

Deep dive explanation:

This is a senior-level tradeoff answer. The right comparison was not "which library wins the benchmark chart by a few percent." The right comparison was "which engine improves throughput and latency enough to change cost while keeping setup, maintenance, and rollback reasonable." vLLM won because it gave us most of the performance upside without accepting hard vendor lock-in or a fragile build pipeline.


Question 3 [ML Eng] [SDE]

Explain PagedAttention in a way that connects directly to chatbot performance.

Hint 1: Start with the KV cache problem.

Hint 2: Use the operating-system page analogy only if it helps the explanation.

Hint 3: End with concurrency and queueing, not a pure memory lecture.

Final answer:

Autoregressive generation needs the model to keep prior-token KV tensors in memory. The old serving path behaved like it had to reserve too much memory up front, so a lot of GPU VRAM sat unused. PagedAttention fixes that by allocating KV memory in smaller blocks as tokens are actually generated. In chatbot terms, that means more live conversations fit on one GPU, which means less queueing and better latency during traffic spikes.

Deep dive explanation:

The strongest version of this answer is to connect waste to user impact. If memory waste limits how many requests can run concurrently, new requests wait even when the GPU should have had room. PagedAttention increased effective density, so the same hardware absorbed more conversations. That is why it mattered operationally and financially.


Question 4 [SDE] [ML Eng]

How did continuous batching improve latency during spikes instead of making users wait longer?

Hint 1: Contrast dynamic refilling with fixed batch windows.

Hint 2: Mention mixed request lengths.

Hint 3: Separate queue wait from generation time in your answer.

Final answer:

The old pattern acted like fixed batching, which meant short requests could sit behind long ones or wait for the next batch window. vLLM continuously refilled open decode slots as sequences finished, so new requests entered the active set immediately instead of waiting for the whole batch to drain. That reduced queue wait time, which is why latency improved even while throughput increased.

Deep dive explanation:

This is really a scheduler story. In chat traffic, some requests need a few tokens and others need hundreds. A rigid batch boundary penalizes short requests unnecessarily. Continuous batching keeps the GPU busy while also reducing the amount of idle or blocked time hidden in the queue. The key implementation detail is to measure queue wait separately from generation latency so you can prove the scheduler is helping.


Question 5 [SDE] [ML Eng]

Why did prefix caching matter so much in a multi-turn shopping assistant?

Hint 1: Think about what stays stable across turns.

Hint 2: Mention both efficiency and perceived speed.

Hint 3: Include a low-level requirement for making caching effective.

Final answer:

Multi-turn shopping chat repeats a lot of leading prompt structure: system instructions, safety rules, formatting rules, and parts of the conversation scaffold. Without prefix caching, we were paying to recompute that shared prefix on every turn. vLLM let us reuse that work, which reduced redundant compute and improved time to first token. That matters because users judge chat quality by how quickly the assistant starts responding.

Deep dive explanation:

The low-level catch is that prefix caching only helps when the leading tokens are actually stable. That means prompt construction must be deterministic before the variable region begins. If you inject timestamps, random IDs, or personalized fields too early, you destroy cache reuse. The architectural lesson is that prompt design and inference efficiency are linked.


Question 6 [Staff/Principal] [ML Eng]

What was the importance of Multi-LoRA in this chatbot?

Hint 1: Frame it as an infrastructure-consolidation decision.

Hint 2: Mention why separate endpoints would have been expensive.

Hint 3: Tie it back to specialization without fleet sprawl.

Final answer:

Multi-LoRA mattered because it let multiple specialized behaviors share one base model runtime. Instead of running separate full-model endpoints for each adapter-driven variant, we routed requests to the right adapter inside a shared vLLM serving layer. That reduced GPU footprint and operational sprawl while preserving domain-specific behavior.

Deep dive explanation:

This is important in a chatbot because specialization needs grow faster than traffic per specialization. If every variant gets its own endpoint, the infrastructure fragments quickly. Multi-LoRA let us keep the specialization surface modular while keeping deployment, warm pools, and observability centered around one base runtime. The low-level discipline was to keep adapter routing explicit and versioned.


Question 7 [SRE/Ops] [SDE]

How did you keep long conversations from causing GPU OOM and worker restarts?

Hint 1: Quantization alone is not enough.

Hint 2: Mention token budgeting and failure containment.

Hint 3: Explain why long sessions are a normal path, not an edge case.

Final answer:

We handled long conversations with a combination of AWQ quantization, explicit context budgeting, and OOM containment. Quantization reduced the model memory footprint, token budgeting kept prompt growth bounded, and the OOM guard prevented one bad request from crashing the worker. That turned long multi-turn sessions into a supported production path instead of a reliability risk.

Deep dive explanation:

The important design choice was prioritization inside the context window. We preserved system rules, retrieved evidence, and the most recent turns first, then summarized or trimmed lower-value historical chat. We also calibrated AWQ on manga-domain data so the memory savings did not erase Japanese-domain quality. The resilience lesson is that preventing most OOMs is not enough; you also need graceful containment for the rare ones.


Question 8 [Staff/Principal] [SDE]

How did you make the vLLM migration low risk?

Hint 1: Think about API contracts and shadowing.

Hint 2: Avoid describing it as a big-bang cutover.

Hint 3: Mention rollback explicitly.

Final answer:

We made the migration low risk by keeping the application on a stable generation contract and hiding vLLM behind a backend-neutral gateway. That let us shadow traffic, compare outputs, and cut over by routing configuration instead of rewriting application logic. If something regressed, rollback was a routing change, not a code emergency.

Deep dive explanation:

This is where the OpenAI-compatible API mattered. It reduced coupling between the chatbot and the inference engine. Once that contract was stable, the remaining migration work became benchmarking, observability, canarying, and evaluation. That is a much safer place to be than having business logic call a backend-specific server directly.


Question 9 [ML Eng] [SRE/Ops]

What are the most important low-level vLLM tuning decisions you made?

Hint 1: Mention headroom, concurrency limits, and token caps.

Hint 2: Explain why defaults are not enough.

Hint 3: Tie each knob to a concrete failure mode.

Final answer:

The most important tuning decisions were GPU memory headroom, maximum concurrent sequences, total batched-token limits, deterministic prefix construction, and explicit token budgeting before inference. Those settings controlled whether the engine stayed stable under mixed traffic, whether caching actually worked, and whether long prompts crowded out healthy traffic.

Deep dive explanation:

This answer shows engineering maturity because it treats configuration as architecture. gpu_memory_utilization was a stability boundary, not a speed knob. max_num_seqs and max_num_batched_tokens balanced throughput against fairness and TTFT. Prompt determinism controlled cache effectiveness. Token budgeting prevented the engine from being the first component to discover that the request was oversized.


Question 10 [Staff/Principal]

Under what conditions would you revisit the vLLM decision?

Hint 1: Show that the choice was reversible.

Hint 2: Mention measurable triggers, not vague feelings.

Hint 3: Compare future alternatives on both performance and operability.

Final answer:

I would revisit the vLLM decision if another engine opened a materially better performance gap while also reducing operational burden, or if our hardware strategy changed enough that a different runtime became a better fit. The key is that we documented reversal triggers in advance, so reevaluation would be data-driven instead of reactive.

Deep dive explanation:

This is a strong closing answer because it proves the decision was intentional and reversible. A good trigger could be something like a much larger throughput advantage from TensorRT-LLM with simpler builds, or a mature alternative on new hardware. Because we kept the serving contract stable, reevaluation stays feasible. That is the difference between a smart platform choice and accidental lock-in.


Part 2: Deployment And Operations Questions

Question 11 [SRE/Ops] [SDE]

Walk me through how your vLLM Docker image is built and why you bake the model into it.

Hint 1: Mention multi-stage build and what each stage does.

Hint 2: Explain the cold start tradeoff between baked and runtime download.

Hint 3: Include the image size and startup timing numbers.

Final answer:

We use a three-stage Docker build: a builder stage that installs vLLM and dependencies, a model-prep stage that downloads and validates model artifacts from S3, and a minimal runtime stage that copies only the serving code and model weights. The final image is ~8.2 GB. We bake the model into the image because it reduces cold start from 5–8 minutes (runtime S3 download) to ~67 seconds (model already on disk). That matters because our auto-scaling policy needs new instances serving traffic within 90 seconds.

Deep dive explanation:

The key decisions in the Dockerfile are: pinning vLLM 0.4.3 and PyTorch 2.3.0 for reproducibility, using flash-attn 2.5.8 for PagedAttention kernel compatibility, removing build tools from the runtime stage to save ~1.5 GB, and running as a non-root user. The model-prep stage checksums the model artifacts against a manifest to catch corruption. The entrypoint runs a warmup script that sends 3 dummy requests to capture CUDA graphs before the readiness probe passes. See 04-deployment for the full Dockerfile and startup scripts.


Question 12 [SRE/Ops]

How does your auto-scaling work for GPU inference endpoints? Why not just scale on CPU or request count?

Hint 1: GPU inference has different bottleneck signals than web servers.

Hint 2: Mention the specific metrics you scale on.

Hint 3: Explain asymmetric cooldown.

Final answer:

We scale on GPU-specific metrics: InvocationsPerInstance combined with queue depth and GPU cache utilization. CPU utilization and request count are poor signals for GPU inference because the GPU can be saturated while CPU is idle. We use a scale-out threshold of 70% GPU cache utilization sustained for 2 minutes, and an asymmetric cooldown: 120 seconds for scale-out (act fast) and 600 seconds for scale-in (be cautious about removing capacity). We also maintain warm pools with 1 pre-provisioned instance so scale-out is 67 seconds instead of 5–8 minutes.

Deep dive explanation:

The asymmetric cooldown was learned from an incident where traffic oscillated between 80% and 40% of capacity. Symmetric cooldowns caused thrashing: scale out, traffic drops, scale in, traffic spikes, scale out again. The 600-second scale-in cooldown eliminated this. We also have scheduled scaling actions for predictable peaks: JP market opens (UTC 0:00) and US market opens (UTC 14:00). See 04-deployment Section 9 for the full scaling configuration.


Question 13 [SRE/Ops] [ML Eng]

How do you monitor vLLM in production? What are your SLOs and what alerts page on-call?

Hint 1: Separate engine metrics from application metrics.

Hint 2: Mention specific SLO targets with numbers.

Hint 3: Explain the difference between critical and warning alerts.

Final answer:

We monitor at two layers: vLLM engine metrics (Prometheus) and custom application metrics (CloudWatch). The engine exposes num_requests_waiting, gpu_cache_usage_perc, num_preemptions, and TTFT histograms. Our application layer emits QueueWaitMs, TTFTMs, PrefixCacheHit, OOMCaught, and AdmissionRejected per adapter. Our SLOs are: 99.9% availability, P50 TTFT < 200 ms, P99 total latency < 2,000 ms, and error rate < 0.1%. Critical alerts page immediately for queue overflow (> 80 waiting, 2 min), OOM events (any), and endpoint down (1 min). Warning alerts for TTFT regression and prefix cache degradation go to the next-business-day queue.

Deep dive explanation:

The most important design decision was separating queue wait from generation latency in our metrics. Early on, we only measured total latency and could not tell whether a P99 spike was caused by a slow model or a saturated scheduler. Once we split the metrics, we found that 80% of latency spikes were scheduler issues (queue wait), not model issues. That changed how we tuned: we added admission control instead of trying to make the model faster. See 05-monitoring for the full alerting rules, dashboard design, and troubleshooting runbook.


Question 14 [SRE/Ops]

An on-call engineer sees the TTFT P95 alarm firing. Walk me through how you would diagnose and fix it.

Hint 1: Start with the most common cause, not the most exotic.

Hint 2: Use a decision tree approach.

Hint 3: Show that you have runbooks, not guesswork.

Final answer:

The diagnostic tree has four branches, checked in order. First, check prefix cache hit rate — if it dropped below 50%, a prompt template change likely broke cache determinism. Fix: revert the prompt change or move volatile fields below the prefix boundary. Second, if cache is normal, check queue wait time — if P95 > 200 ms, the engine is saturated. Fix: scale out or strengthen admission control. Third, if queue is normal, check input token distribution — if average tokens increased, prompts got bigger (maybe retrieval returning more chunks). Fix: review context budgeting. Fourth, if all above are normal, check GPU health with nvidia-smi. If the GPU reports degraded performance, replace the instance.

Deep dive explanation:

The reason we check prefix cache first is that it has caused 60% of our TTFT regressions. Every time a prompt template is updated, there is a risk that a developer adds a timestamp or personalization field in the cacheable prefix. We now have a CI check that verifies the first N tokens of the canonical prompt match the previous version. The full runbook is in 05-monitoring Section 8.


Question 15 [ML Eng]

Explain your AWQ quantization process. How did you validate that INT4 did not degrade quality?

Hint 1: Calibration dataset matters more than the quantization algorithm.

Hint 2: Mention specific quality gates with thresholds.

Hint 3: Explain why generic calibration data would have been wrong.

Final answer:

We quantized Llama-3-8B-Instruct to AWQ INT4, reducing the model from ~16 GB to ~4.5 GB. The calibration dataset was 512 samples from real MangaAssist production traffic, reflecting the actual distribution: 30% short factual, 40% recommendation, 20% detailed comparison, 10% support, and including both English and Japanese content. We validated with quality gates: BLEU ≥ 0.85, factual accuracy ≥ 90%, language quality ≥ 88%, safety ≥ 99%, and hallucination rate ≤ 5%. We ran these gates against 200 test cases from our offline eval suite.

Deep dive explanation:

The critical lesson was that calibration data distribution matters enormously. Our first attempt used generic English calibration text from the AWQ paper examples. The quantized model lost quality on Japanese manga titles — it would hallucinate publication dates and confuse edition names. When we switched to our own production traffic distribution for calibration, the quality gates passed. This is because AWQ uses activation statistics from the calibration data to decide which weight channels to protect, and generic text activates different channels than manga-domain Japanese text. See 06-model-preparation for the full calibration and validation pipeline.


Question 16 [ML Eng] [Staff/Principal]

How do you manage LoRA adapters in production? What happens when you need to update one?

Hint 1: Explain versioning and independent promotion.

Hint 2: Mention quality gates specific to adapters.

Hint 3: Describe the rollback path.

Final answer:

Each adapter is versioned independently (e.g., manga_domain_v3). Updates go through a pipeline: offline evaluation against 200 test cases, quality gates (same as base model plus adapter-specific regression checks), canary deployment to 10% of traffic for 1 hour, then full promotion. Rollback is updating the CURRENT pointer file in S3 to the previous version and redeploying the endpoint — takes under 5 minutes with warm pools. All adapters must have the same LoRA rank (16) and target the same modules, because vLLM allocates a fixed-size slot per adapter.

Deep dive explanation:

The most painful lesson was about rank consistency. Early in development, one team member trained an adapter at rank 32 for better quality while the production adapters were rank 16. When loaded together, vLLM silently allocated slots for rank 16 (the first adapter loaded) and truncated the rank-32 adapter. There was no error — just degraded quality that was hard to diagnose. We now enforce rank consistency in our training pipeline and validate it in CI. See 06-model-preparation Section 4 for the full adapter management workflow.


Part 3: Deep System Design Questions

Question 17 [Staff/Principal]

How would you design the system differently if you had to serve 10× the current traffic?

Hint 1: Think about what changes architecturally, not just "add more GPUs."

Hint 2: Consider the model tier, caching strategy, and routing changes needed.

Hint 3: Address both the happy path and the failure path at scale.

Final answer:

At 10× traffic, three things change. First, I would add a response cache layer before the inference engine — many chatbot queries are near-duplicates, and caching complete responses for common questions (product availability, shipping, returns) would offload 30-40% of requests from the GPU. Second, I would introduce tiered routing: simple factual questions go to a smaller, faster model (or cached responses), while complex recommendations go to the full Llama-3-8B path. Third, I would move to a multi-region deployment to reduce latency for the JP market and add geographic redundancy.

Deep dive explanation:

The key insight is that 10× traffic does not mean 10× GPU cost if you design the tiers correctly. The response cache handles the high-frequency tail. The smaller model handles factual lookups where quality from a large model is wasted. The full model handles the long-tail recommendations where quality matters. The failure path changes too: at 10× scale, a single GPU failure affects a smaller fraction of users, but correlated failures (CUDA driver bug, model corruption) become more dangerous. I would add a fully managed Bedrock tier as a warm fallback that can absorb 100% of traffic if the self-hosted fleet goes down.


Question 18 [Staff/Principal] [SDE]

What was the hardest production incident you dealt with on this system?

Hint 1: Pick an incident that shows debugging skill, not just awareness.

Hint 2: Show the diagnostic process, not just the fix.

Hint 3: Explain what you changed to prevent recurrence.

Final answer:

The hardest incident was a silent quality degradation after a vLLM version upgrade from 0.4.1 to 0.4.2. Latency and error rates looked fine, but our weekly quality audit caught that the prefix cache hit rate had dropped from ~72% to ~45%, and TTFT P50 had crept from 180 ms to 340 ms. The root cause was a change in vLLM's cache eviction policy in 0.4.2 that evicted entries more aggressively under memory pressure. The fix was upgrading to 0.4.3 which restored the previous eviction behavior. The prevention was adding prefix cache hit rate as a monitored SLO metric and creating an alarm that fires if it drops below 50% for 15 minutes.

Deep dive explanation:

What made this hard was that it was a performance regression, not a functional failure. All requests succeeded. No errors. No OOMs. But users were experiencing noticeably slower first-token times, and our cost efficiency dropped because we lost the compute savings from prefix reuse. The lesson was that inference engine upgrades need the same rigor as model updates: run the full eval suite, compare latency profiles, and monitor behavioral metrics (cache hit rate) not just reliability metrics (error rate, availability).


Question 19 [SDE] [ML Eng]

Explain the request lifecycle from the moment a user sends a chat message until tokens start streaming back.

Hint 1: Walk through every component in order.

Hint 2: Include timing estimates for each phase.

Hint 3: Highlight where optimizations happen.

Final answer:

The lifecycle: (1) Chat UI sends message via WebSocket (~5 ms). (2) API Gateway routes to Chat Orchestrator (~10 ms). (3) Orchestrator calls Model Router, which selects vLLM based on routing config (~2 ms). (4) vLLM Gateway receives the request, runs admission control (queue depth, memory pressure check, ~1 ms). (5) Request Budgeter trims context to 3,600 tokens if needed (~3 ms). (6) Prompt Canonicalizer builds the deterministic prefix + variable suffix (~2 ms). (7) Adapter Registry resolves the LoRA adapter ID (~1 ms). (8) Request enters the vLLM engine queue (0–200 ms depending on load). (9) vLLM scheduler admits the request into the active batch. (10) Prefill runs — prefix cache check occurs here. If cache hit, skip redundant prefix compute (~30 ms saved). (11) First decode token generated — TTFT measured here. Typical ~180 ms from step 4. (12) Token streamed via SSE to the UI. Decode continues at ~15 ms per token.

Deep dive explanation:

The key optimization points are: step 4 (admission control prevents queue saturation), step 5 (token budgeting prevents OOM and unfair scheduling), step 6 (deterministic construction enables prefix cache hits at step 10), and step 8 (continuous batching minimizes queue wait). Each of these was a deliberate implementation choice, not a default. Without admission control, the queue becomes the only backpressure and users wait silently. Without token budgeting, large prompts crowd out small ones. Without prompt canonicalization, cache hit rate drops to near zero.


Question 20 [Staff/Principal]

How did you decide between self-hosting with vLLM and using a managed API like Bedrock for different request types?

Hint 1: Frame it as a cost-quality-control tradeoff, not a binary choice.

Hint 2: Mention which request types go where and why.

Hint 3: Explain the fallback relationship.

Final answer:

We run a dual-path architecture. The self-hosted vLLM path handles manga-domain recommendations and Japanese-language interactions where domain-specific LoRA adapters and prefix caching give us a quality and cost advantage. The managed Bedrock path handles overflow traffic, requests that exceed the self-hosted budget, and serves as a fallback when the GPU fleet is under pressure. The routing decision is made by the Model Router based on request type, adapter availability, and fleet health. About 85% of traffic goes through vLLM in steady state. During scaling events or incidents, Bedrock absorbs the overflow transparently.

Deep dive explanation:

The cost math is clear: self-hosted vLLM at $9.2K/month serves 85% of traffic. The same traffic through Bedrock would cost ~$28K/month (based on token pricing for the equivalent Claude model). But self-hosting requires engineering investment in Docker images, monitoring, scaling, and on-call. The managed path requires zero operational investment. So the hybrid approach optimizes total cost: self-host the high-volume, domain-specialized traffic; use managed APIs for the tail and for resilience. This is not a "we chose vLLM" story — it is a "we chose the right tool for each traffic segment" story.