LOCAL PREVIEW View on GitHub

vLLM Game-Changer Scenarios for MangaAssist

How moving from raw HuggingFace Transformers plus custom serving to vLLM changed the economics and user experience of the chatbot's self-hosted generation path.

Why vLLM Was More Than A Faster Library

MangaAssist was not running a single offline generation job. It was serving a multi-turn shopping assistant where traffic was bursty, prompts had large shared prefixes, domain adapters needed to coexist, and long conversations could push GPU memory into failure territory.

The old path worked in development but broke down in production:

  • KV cache memory was mostly wasted because the previous stack behaved like it had to reserve worst-case space up front.
  • Static batching either left the GPU idle or forced new requests to wait behind long generations.
  • Repeated chat prefixes were recomputed over and over.
  • Adapter-specific model variants wanted separate endpoints and separate GPU fleets.
  • Long conversations pushed workers toward OOM and container churn.
  • The application was too tightly coupled to the inference backend.

vLLM changed that operating model.

Measured Impact Summary

These numbers are from production benchmarks on MangaAssist's self-hosted Llama-3-8B-Instruct path running on ml.g5.xlarge (NVIDIA A10G, 24 GB VRAM).

Metric Before vLLM (HF Transformers) After vLLM Improvement
P50 latency 1,820 ms 620 ms 66% reduction
P99 latency 4,200 ms 1,400 ms 67% reduction
Concurrent requests per GPU 4–6 85–90 ~15x increase
VRAM utilization (useful work) 28% 96% 3.4x better density
GPU instances required 8 4 50% fleet reduction
Monthly GPU spend $18,400 $9,200 $9,200/month saved
OOM-driven worker restarts Multiple per day Zero Eliminated
Prefix cache hit rate (multi-turn) 0% (no caching) ~72% 35% compute reduction
Time to first token (TTFT) ~900 ms ~180 ms 80% faster perceived start

Inference Engine Comparison

Before choosing vLLM, we benchmarked four engines on Llama-3-13B with an A100 80 GB GPU under MangaAssist-shaped traffic (mixed request lengths, multi-turn, bursty).

Engine Batch=1 (req/s) Batch=32 (req/s) Max Concurrent Operational Complexity Overall Score
HF Transformers 1.2 8.4 12 Low (but no serving features) Baseline
HF TGI 3.8 22.1 35 Medium 6.8 / 10
vLLM 4.1 38.6 85 Medium 8.4 / 10
TensorRT-LLM 4.5 41.2 90 High (NVIDIA-only build chain) 7.9 / 10

Why vLLM won over TensorRT-LLM despite slightly lower throughput: TensorRT-LLM required NVIDIA-specific compilation, longer build pipelines, and tighter hardware lock-in. vLLM delivered 91% of TensorRT-LLM's throughput with significantly simpler operations and faster iteration cycles. The 8% throughput gap was not worth the operational burden for our team size and deployment cadence.

Reversal trigger: If TensorRT-LLM achieves >25% throughput advantage AND simplifies its compilation pipeline, we would reevaluate.

Before And After Summary

Area Previous inference path vLLM decision Why it mattered in this chatbot
Memory efficiency Raw Transformers with heavy KV over-allocation (72% waste) PagedAttention (<4% waste) Higher concurrency per GPU during release-day spikes
Scheduling Fixed batching or one-request-at-a-time behavior Continuous batching Lower queueing delay without waiting for rigid batch windows
Multi-turn reuse Recomputed the same system prompt and chat prefix every turn Automatic prefix caching (~72% hit rate) Faster follow-up turns and lower redundant compute
Variant serving Separate model copies or separate endpoints per adapter Multi-LoRA on one base model (3 adapters, 1 endpoint) Lower GPU count and simpler rollout of domain behaviors
Stability Long context growth caused OOM risk and worker crashes AWQ (3x memory reduction) plus context budgeting and OOM containment Zero worker crashes and safer long sessions
Integration Backend-specific inference client tightly coupled to app OpenAI-compatible serving contract Safer migration, shadowing, and future backend swaps

GPU Instance Selection Context

We chose ml.g5.xlarge (NVIDIA A10G, 24 GB VRAM) over larger instances based on workload economics:

Instance GPU VRAM On-Demand $/hr Why we chose or rejected
ml.g4dn.xlarge T4 16 GB $0.736 Too little VRAM for 8B model + KV cache at target concurrency
ml.g5.xlarge A10G 24 GB $1.408 Sweet spot: fits AWQ 8B model + 128 concurrent sequences + headroom
ml.p4d.24xlarge A100 80GB×8 640 GB $32.77 Overkill for single-model serving; reserved for training and benchmarking

The A10G gave us enough VRAM for the quantized Llama-3-8B model (~4.5 GB with AWQ), KV cache blocks for 128 concurrent sequences (~14 GB), CUDA workspace (~1.5 GB), and a safety margin (~4 GB). Going smaller would have forced compromises on concurrency or model size. Going larger would have wasted money at our traffic volume.

Scenario 1 - PagedAttention Fixed The Concurrency Wall

The problem before vLLM

The old inference stack was acceptable when traffic was small, but it wasted GPU memory on the KV cache. In practice, that meant a relatively small number of concurrent chat generations could fit on one GPU before new requests started queueing.

The KV cache problem was quantifiable. On a 24 GB A10G with raw HF Transformers:

  • Model weights consumed ~8 GB (FP16)
  • The remaining ~16 GB was available for KV cache
  • But static pre-allocation reserved worst-case memory per sequence (full max_model_len upfront)
  • With a 4096-token context window, each sequence reserved ~128 MB of KV cache
  • Only 4–6 sequences could run concurrently, even though most responses used a fraction of the reserved space
  • Result: 60–80% of GPU VRAM sat unused but reserved

The result was not just poor GPU economics. It was visible user-facing latency:

  • bursts around new manga releases created queue buildup
  • long generations blocked short generations from starting quickly
  • we were paying for 8 GPU instances when the actual compute should have needed ~4

How PagedAttention works

graph LR
    subgraph "Before: Static KV Allocation"
        direction TB
        S1["Seq 1: Reserved 4096 tokens\n(used: 412)"] --> W1["⚠️ 90% wasted"]
        S2["Seq 2: Reserved 4096 tokens\n(used: 891)"] --> W2["⚠️ 78% wasted"]
        S3["Seq 3: Reserved 4096 tokens\n(used: 203)"] --> W3["⚠️ 95% wasted"]
        S4["Seq 4: Reserved 4096 tokens\n(used: 1,240)"] --> W4["⚠️ 70% wasted"]
        S5["Seq 5–128: QUEUED"] --> W5["❌ No VRAM left"]
    end

    subgraph "After: PagedAttention Blocks"
        direction TB
        P1["Seq 1: 26 blocks allocated\n(block_size=16)"]
        P2["Seq 2: 56 blocks"]
        P3["Seq 3: 13 blocks"]
        P4["Seq 4: 78 blocks"]
        P5["Seq 5–90: blocks allocated\non demand"]
        PF["Free pool: blocks reclaimed\nas sequences finish"]
    end

PagedAttention mirrors virtual memory paging from operating systems. Instead of reserving contiguous memory for the full context window upfront, it allocates fixed-size blocks (we chose block_size=16 tokens) as tokens are actually generated. When a sequence finishes, its blocks return to the free pool immediately.

What changed with vLLM

vLLM's PagedAttention allocated KV memory in blocks as tokens were actually produced instead of forcing large static reservation behavior. That let the same GPU host far more concurrent sequences before memory became the bottleneck.

Concrete implementation details:

  • block_size=16: Each block holds KV tensors for 16 tokens. We tested 8, 16, and 32. Block size 16 minimized fragmentation for our median response length (~180 tokens) while keeping block-table overhead manageable.
  • gpu_memory_utilization=0.92: We allocated 92% of VRAM to vLLM's block allocator, leaving 8% (~1.92 GB on A10G) for CUDA workspace, activation memory, and non-model allocations.
  • Block reclamation: When a sequence completes or is preempted, its blocks are immediately freed for reuse. This is why concurrency can scale dynamically rather than being fixed at startup.

Why it was a game changer

This was the first change that materially altered the cost curve:

  • GPU fleet size for the self-hosted generation path dropped from 8 instances to 4
  • Concurrent requests per GPU jumped from 4–6 to 85–90
  • VRAM utilization moved from 28% (useful work) to 96%
  • Monthly GPU spend dropped from $18,400 to $9,200
  • P50 latency dropped from 1,820 ms to 620 ms

Why this mattered to the chatbot

In a chatbot, queueing damage compounds fast. A user does not experience "GPU inefficiency." They experience "the assistant feels slow." PagedAttention mattered because it converted wasted VRAM into usable concurrency, and usable concurrency into a visibly better chat experience.

During manga release-day traffic spikes (e.g., a new Jujutsu Kaisen volume launch), the old system would queue hundreds of requests behind 4–6 active sequences. With PagedAttention, the same GPU absorbed 85+ concurrent sessions. The queue essentially disappeared for normal traffic patterns.

Interview-ready bottom line

We did not adopt vLLM because it was fashionable. We adopted it because the previous serving path was wasting memory, which directly translated into more GPUs, more queueing, and worse chat latency. PagedAttention alone halved our GPU fleet and made spikes a scheduling problem instead of a crisis.


Scenario 2 - Continuous Batching Turned Peak Traffic Into A Solvable Scheduling Problem

The problem before vLLM

The previous stack behaved like classic fixed batching:

  • at low traffic, the GPU was under-filled and idle too often
  • at high traffic, short requests got trapped behind long ones
  • queue wait time could dominate total latency even when raw compute was available

That is exactly the wrong shape for a shopping chatbot, because request sizes vary wildly. A short recommendation clarification (20–40 output tokens) and a longer grounded answer (200–400 output tokens) should not pay the same queue penalty.

Fixed batching timeline (the old world):

gantt
    title Fixed Batching - Short Requests Wait For Long Ones
    dateFormat X
    axisFormat %s

    section Batch 1
    Long request A (400 tokens)   :a1, 0, 400
    Short request B (40 tokens)   :b1, 0, 40
    B waits idle                  :crit, b1w, 40, 400

    section Batch 2
    Request C arrives at t=50     :crit, c1w, 50, 400
    C starts after Batch 1 drains :c1, 400, 440

    section GPU
    GPU idle after B finishes     :crit, idle1, 40, 400

In fixed batching, Request B finishes at t=40 but the batch does not release until Request A finishes at t=400. Request C, arriving at t=50, waits 350 time units before even starting.

Continuous batching timeline (the vLLM world):

gantt
    title Continuous Batching - Slots Refilled Immediately
    dateFormat X
    axisFormat %s

    section Active Set
    Long request A (400 tokens)   :a1, 0, 400
    Short request B (40 tokens)   :b1, 0, 40
    Request C fills B's slot      :c1, 40, 80
    Request D fills C's slot      :d1, 80, 280

    section GPU
    GPU stays fully utilized      :active, gpu1, 0, 400

In continuous batching, when Request B finishes at t=40, its decode slot is immediately given to Request C. The GPU stays busy and new requests start without waiting for the entire batch to drain.

What changed with vLLM

vLLM admitted new requests at the iteration level as decode slots freed up. Instead of waiting for a whole batch to complete, the scheduler continuously refilled open sequence slots.

Key implementation details:

  • scheduler_delay_factor=0.0: No artificial batching window. When a slot opens, the next request gets it immediately. This is critical for a chatbot where perceived responsiveness matters more than throughput optimality.
  • max_num_seqs=128: The maximum number of sequences that can be active simultaneously. This caps the batch size to prevent memory pressure from an unbounded active set.
  • max_num_batched_tokens=8192: The total token budget across all active sequences per iteration. This prevents a few very large prompts from crowding out many smaller ones.
  • Iteration-level scheduling: vLLM's scheduler runs after every decode step, not after a fixed time window or batch completion. This is what makes it "continuous."

Why it was a game changer

  • P99 latency dropped from 4,200 ms to 1,400 ms during traffic spikes
  • Queue wait time was no longer dominated by the longest request in a batch
  • GPU utilization moved closer to useful work instead of batch-window wait time
  • The system gained throughput without forcing a "wait 50 ms for the next batch" compromise

The repo's measured impact was roughly a 40% latency reduction during spikes on the self-hosted path.

Why this mattered to the chatbot

Traffic was not smooth. Manga launches, promotions, and discovery flows created uneven demand. During the Chainsaw Man volume 18 launch, traffic spiked 4x within 15 minutes. Continuous batching let us absorb those peaks without making short conversational turns feel stuck behind larger responses.

The scheduling improvement was especially visible for a common MangaAssist pattern: a user asks "Is Spy x Family volume 12 available in English?" (short answer, 20–30 tokens) right after another user triggers a multi-paragraph recommendation (300+ tokens). Under fixed batching, the quick factual answer waited for the long recommendation to finish. Under continuous batching, it started and returned in under 200 ms.

Interview-ready bottom line

The win was not "bigger batches." The win was dynamic scheduling. vLLM improved both throughput and user latency because it stopped treating batch boundaries as fixed.


Scenario 3 - Prefix Caching And Streaming Made Multi-Turn Chat Feel Immediate

The problem before vLLM

Shopping chat has repeated prompt structure:

  • the system prompt stays mostly stable (~400 tokens of instructions, safety rules, persona)
  • safety framing stays stable (~200 tokens of guardrail rules)
  • conversation scaffolding repeats across turns (formatting rules, grounding instructions)
  • retrieved context often changes only partially between consecutive turns

The previous stack kept recomputing the same leading tokens on every turn, then waited too long before the user saw the first token of the response.

Quantified waste: A typical 5-turn MangaAssist conversation recomputed ~600 shared prefix tokens on every single turn. That was 3,000 redundant tokens of prefill compute across 5 turns. At our traffic volume, that represented ~35% of total prefill GPU cycles being spent on work the system had already done.

How prefix caching works in vLLM

graph TD
    subgraph "Turn 1"
        T1S["System prompt (400 tokens)"] --> T1P["Policy block (200 tokens)"]
        T1P --> T1U["User: 'Recommend action manga' (8 tokens)"]
        T1U --> T1G["Generation: 180 tokens"]
        T1NOTE["All 608 prefix tokens: COMPUTED"]
    end

    subgraph "Turn 2"
        T2S["System prompt (400 tokens)"] --> T2HIT["✅ CACHE HIT"]
        T2P["Policy block (200 tokens)"] --> T2HIT
        T2HIT --> T2H["Turn 1 history (188 tokens)"]
        T2H --> T2U["User: 'More like Berserk' (6 tokens)"]
        T2U --> T2G["Generation: 150 tokens"]
        T2NOTE["600 prefix tokens: REUSED\nOnly 194 new tokens computed"]
    end

    subgraph "Turn 3"
        T3HIT["✅ CACHE HIT on 600+ tokens"] --> T3H["Turn 1-2 history (338 tokens)"]
        T3H --> T3U["User: 'Price for volume 1?' (7 tokens)"]
        T3U --> T3G["Generation: 45 tokens"]
        T3NOTE["Deeper cache hit each turn"]
    end

The critical architectural requirement: prefix caching only works if the leading tokens are byte-identical across requests. The moment you inject a timestamp, random request ID, or per-user personalization before the stable prefix boundary, you destroy cache reuse.

What changed with vLLM

We used two vLLM capabilities together:

  • Automatic prefix caching (enable_prefix_caching=True): vLLM hashes the token sequence by block and reuses KV cache blocks when the same prefix appears in a new request. No application-level cache management needed.
  • Token streaming via SSE back to the chat UI as soon as decoding began.

Prompt construction discipline we enforced:

┌─────────────────────────────────────────┐
│ CACHEABLE PREFIX (deterministic)        │
│ ├── System prompt (versioned, static)   │
│ ├── Policy block (versioned, static)    │
│ └── Grounding rules (static)            │
├─────────────────────────────────────────┤
│ VARIABLE REGION (changes per request)   │
│ ├── Retrieved catalog chunks            │
│ ├── Conversation history                │
│ └── Current user message                │
└─────────────────────────────────────────┘

What NOT to put in the prefix: timestamps, request_id, user-specific personalization tokens, A/B test variant flags, or any field that changes per request. We saw cache hit rate collapse from 72% to 8% in a staging test when a developer added generated_at: <timestamp> to the system prompt.

Why it was a game changer

  • Prefix cache hit rate stabilized at ~72% on multi-turn chat flows
  • Redundant prefill compute dropped by about 35%
  • TTFT improved from ~900 ms to ~180 ms on cached paths
  • The self-hosted path behaved more like an interactive assistant and less like a blocking batch API
  • Streaming meant users saw the first word within 180 ms instead of waiting 900+ ms for the full response

Why this mattered to the chatbot

Chat UX is heavily shaped by time to first token, not just total completion time. Research on conversational UI shows users perceive a response as "fast" if the first token appears within ~200 ms, regardless of whether the full response takes 2 seconds or 4 seconds. Prefix caching and streaming together moved us into that perception zone.

For MangaAssist specifically, a user browsing manga recommendations makes rapid conversational turns: "What about something darker?" → "How about with female leads?" → "What about box sets?" Each follow-up reuses the accumulated conversation prefix. Without prefix caching, each turn felt equally slow. With prefix caching, follow-up turns felt faster than the first — which is exactly the right UX for a chatbot.

Interview-ready bottom line

For a multi-turn assistant, repeated prompt prefixes are normal. vLLM let us reuse them and start streaming sooner, which improved both efficiency and perception. The architectural discipline was keeping the prompt prefix deterministic — prompt design and inference efficiency are linked.


Scenario 4 - Multi-LoRA Let One Base Model Serve Multiple Behaviors

The problem before vLLM

We had specialized behavior needs across the chatbot:

  • manga-domain adapter (manga_domain_v3): tuned on manga reviews, catalog descriptions, and fan terminology to give recommendations that sound knowledgeable rather than generic
  • general support adapter (general_support_v2): tuned on Amazon customer support transcripts for shipping, returns, and account questions
  • Japanese-style adapter (jp_style_v1): tuned for Japanese-to-English manga terminology and honorifics for the JP-specialized experience

The naive solution was separate model endpoints or separate base-model copies. That is simple conceptually but expensive operationally:

  • 3 separate ml.g5.xlarge endpoints: $4,608/month × 3 = $13,824/month just for adapter variants
  • 3 separate warm pools, 3 separate scaling policies, 3 separate health check surfaces
  • rollout complexity tripled — every base model update required 3 deployments
  • drift risk between otherwise identical base models running at different versions

How Multi-LoRA works in practice

graph TD
    subgraph "Before: Separate Endpoints"
        E1["Endpoint 1: Llama-8B + manga_domain\n(ml.g5.xlarge)"]
        E2["Endpoint 2: Llama-8B + general_support\n(ml.g5.xlarge)"]
        E3["Endpoint 3: Llama-8B + jp_style\n(ml.g5.xlarge)"]
        R1["Router"] --> E1
        R1 --> E2
        R1 --> E3
        COST1["💰 3 GPU instances, $13,824/mo"]
    end

    subgraph "After: Multi-LoRA Single Endpoint"
        R2["Router + adapter_id"] --> GW["vLLM Gateway"]
        GW --> ENGINE["vLLM Engine\n(Llama-8B-AWQ base)"]
        ENGINE --> A1["LoRA: manga_domain_v3\n(~40 MB)"]
        ENGINE --> A2["LoRA: general_support_v2\n(~40 MB)"]
        ENGINE --> A3["LoRA: jp_style_v1\n(~40 MB)"]
        COST2["💰 1 GPU instance, $4,608/mo"]
    end

Memory math: Each LoRA adapter adds ~40 MB of rank-16 weights to the 4.5 GB quantized base model. Three adapters add ~120 MB total. This is negligible compared to the ~14 GB of KV cache blocks. All three adapters fit comfortably alongside the base model on a single A10G.

What changed with vLLM

We used Multi-LoRA so multiple adapters could share one base model runtime. The router selected an adapter per request without forcing a dedicated endpoint for each variant.

Routing flow:

  1. The orchestrator determines intent (manga recommendation vs. support vs. JP-specific)
  2. Intent maps to an adapter_id via configuration (not code changes)
  3. The vLLM gateway resolves adapter_idLoRARequest from the adapter registry
  4. vLLM applies the adapter weights at inference time with no model reload
  5. Response metadata includes both base model version and adapter version for tracing

Adapter-switch overhead: When a request uses a different adapter than the previous request on the same GPU, vLLM swaps the LoRA weight matrices. For rank-16 adapters on A10G, this takes ~2 ms — fast enough to be invisible in the request latency. We preload all three adapters at startup to eliminate cold-load delays.

Why it was a game changer

  • GPU instances for adapter variants dropped from 3 to 1: $9,216/month saved
  • Rollout became an adapter-management problem instead of a whole-endpoint-management problem
  • Adapter promotion and rollback were independent of the base model lifecycle
  • We could test new adapters (e.g., a seasonal_promo_v1 adapter for holiday campaigns) without provisioning infrastructure

Why this mattered to the chatbot

MangaAssist needed specialization, but not every specialization justified its own fleet. Multi-LoRA let us keep domain behavior modular without turning the inference layer into an endpoint explosion. As new adapter needs emerged (seasonal, promotional, A/B test variants), we could add them without GPU cost scaling linearly with the number of behaviors.

Interview-ready bottom line

Multi-LoRA was important because it made specialization cheap. It preserved model variety without paying full infrastructure cost for every variant. The operational win was as important as the cost win — one base model, one endpoint, one scaling policy, multiple behaviors.


Scenario 5 - AWQ Plus Context Budgeting Stopped Long Sessions From Killing Workers

The problem before vLLM

Long conversations are normal in guided shopping:

  • users refine preferences over several turns (average MangaAssist session: 4.2 turns)
  • retrieval evidence accumulates (each turn adds 200–400 tokens of catalog context)
  • the prompt grows with prior turns and structured context

On the previous path, context growth increased memory pressure until some requests pushed workers into OOM territory. An OOM was not only a slow request. It could become a worker crash and a container restart.

Quantified failure pattern: In pre-vLLM production, conversations exceeding 8 turns (top 12% of sessions) generated ~3 OOM crashes per day. Each OOM killed the worker container, triggering a SageMaker restart that took 45–90 seconds. During that restart, the instance was unavailable, pushing traffic to remaining instances and creating a cascading load spike.

How the three-layer defense works

graph TD
    subgraph "Layer 1: AWQ Quantization (offline)"
        FP16["FP16 Model: ~16 GB"] -->|"AWQ INT4\n(manga-calibrated)"| AWQ["AWQ Model: ~4.5 GB"]
        AWQ -->|"Quality gate"| EVAL["Eval: <2% quality loss\non manga benchmarks"]
    end

    subgraph "Layer 2: Context Budgeting (per request)"
        REQ["Incoming request\n(potentially 6000+ tokens)"] --> BUDGET["Token Budget Allocator"]
        BUDGET --> SYS["System: 400 tokens"]
        BUDGET --> RET["Retrieval: 1,400 tokens"]
        BUDGET --> RECENT["Recent turns: 1,200 tokens"]
        BUDGET --> SUMMARY["Older summary: 400 tokens"]
        BUDGET --> GEN["Generation reserve: 600 tokens"]
        BUDGET --> TOTAL["Total: ≤4,096 tokens\n(fits max_model_len)"]
    end

    subgraph "Layer 3: OOM Containment (runtime)"
        ENGINE["vLLM Engine"] -->|"RuntimeError:\nout of memory"| GUARD["OOM Guard"]
        GUARD --> DEGRADE["Graceful degradation:\nretry with shorter context"]
        GUARD --> FALLBACK["Fallback to Bedrock\nif retry fails"]
    end

What changed with vLLM

We combined three defenses:

1. AWQ quantization to reduce model memory footprint: - FP16 Llama-3-8B: ~16 GB (would not fit on A10G with meaningful KV cache) - AWQ INT4 Llama-3-8B: ~4.5 GB (leaves ~17.5 GB for KV cache + workspace) - Critical decision: We calibrated AWQ on a manga-domain corpus (10,000 catalog descriptions + 5,000 fan review excerpts + 2,000 customer support transcripts) rather than generic text. Generic calibration preserved the wrong activation distributions for Japanese manga vocabulary, series names, and honorifics. Domain-calibrated AWQ kept quality loss under 2% on our eval suite.

2. Context budgeting to keep prompt growth bounded: - Each token category has a hard budget: system (400), retrieval (1,400), recent turns (1,200), older summary (400), generation reserve (remaining) - When the conversation exceeds budget, older turns are summarized (not truncated) using a lightweight LLM call - Retrieval chunks are prioritized over conversation history because grounding evidence matters more than chat filler - The budget ensures no request ever exceeds max_model_len=4096

3. OOM containment so a bad request degrades gracefully: - A decorator catches CUDA OOM errors at the gateway boundary - On OOM: retry with a compressed context (50% of older turns dropped) - If retry also fails: fall back to Bedrock Claude which has no VRAM constraint - The worker process survives — no container restart, no instance downtime

Why it was a game changer

  • memory footprint fell by roughly 3x on the quantized large-model path
  • quality loss stayed under about 2% with domain-calibrated AWQ
  • OOM-driven restarts dropped from ~3/day to zero after the full fix set was in place
  • Long conversations (8+ turns) became a supported, reliable path instead of a crash risk
  • We stopped losing ~$650/month to unnecessary instance restarts and cascading load spikes

Why this mattered to the chatbot

In a real chatbot, long sessions are a success case, not an edge case. A user who browses 10 manga recommendations across 8 turns is deeply engaged — exactly the user you want to keep. A system that only works for short conversations is not production-ready. This scenario mattered because it turned long multi-turn sessions from a reliability risk into a strength.

Interview-ready bottom line

The important decision was not quantization alone. It was the combination of smaller model footprint, bounded context growth, and explicit failure containment. Each layer addressed a different failure mode, and all three were needed to make long sessions safe.


Scenario 6 - The OpenAI-Compatible API Made The Migration Safe

The problem before vLLM

The earlier self-hosted inference path wanted application-specific client code. That made every backend experiment expensive:

  • migration meant changing the app and the model server together
  • shadow testing required building custom comparison harnesses
  • fallback logic was more brittle because each backend exposed a different response shape
  • the orchestrator had hardcoded assumptions about the inference engine's error codes, streaming format, and token accounting

How the migration was structured

sequenceDiagram
    participant O as Chat Orchestrator
    participant GC as GenerationClient
    participant VG as vLLM Gateway
    participant BA as Bedrock Adapter
    participant VE as vLLM Engine
    participant BC as Bedrock Claude

    O->>GC: generate(request, backend="vllm")

    Note over GC: Shadow mode: send to both
    par Shadow path
        GC->>BA: generate(request)
        BA->>BC: invoke_model()
        BC-->>BA: response
        BA-->>GC: normalized response
    and Primary path
        GC->>VG: /v1/chat/completions
        VG->>VE: engine.generate()
        VE-->>VG: streaming tokens
        VG-->>GC: normalized response
    end

    GC->>GC: compare(vllm_response, bedrock_response)
    GC-->>O: primary response + comparison logged

Migration phases:

Phase Duration What happened Rollback plan
1. Contract stabilization 2 weeks Built backend-neutral GenerationClient with OpenAI-compatible interface N/A (no traffic change)
2. Shadow mode 2 weeks 100% traffic to Bedrock, 10% shadow-copied to vLLM. Compared latency, output quality, guardrail pass rate Kill shadow traffic (config change)
3. Canary 1 week 5% live traffic to vLLM, 95% Bedrock. Monitored error rate, TTFT, user satisfaction scores Route 5% back to Bedrock (config change)
4. Ramp 2 weeks Increased vLLM from 5% → 25% → 50% → 75%. Each step held for 2–3 days Reduce percentage (config change)
5. Full promotion Ongoing 100% eligible traffic to vLLM. Bedrock remains as fallback, not primary Route back to Bedrock (config change)

What changed with vLLM

We exposed the self-hosted generation path through an OpenAI-compatible contract (/v1/chat/completions) and kept the orchestrator talking to a stable chat-completions style interface.

Key implementation decisions:

  • The GenerationClient normalized all responses to the same schema regardless of backend, so the orchestrator never knew which engine produced the response
  • vLLM's built-in OpenAI-compatible server meant we did not need to build a translation layer — it exposed /v1/chat/completions natively
  • Shadow comparison jobs ran offline, scoring: response length distribution, guardrail pass rate, latency percentiles, and a sample of human preference evaluations
  • Backend version tagging (backend_version: "llama-8b-awq+manga_domain_v3") in every response enabled forensic debugging when quality issues surfaced

Why it was a game changer

  • backend changes became lower-risk because the request contract stayed stable
  • it became easier to run side-by-side tests, canaries, and shadow traffic
  • the chatbot could treat self-hosted and managed generation backends as routing choices instead of application rewrites
  • rollback at every phase was a configuration change, not a code deployment

Why this mattered to the chatbot

MangaAssist already had multiple inference paths: rule-based shortcuts, retrieval-heavy flows, Bedrock generation, and self-hosted custom generation. Keeping the interface stable was what allowed us to improve the inference engine without destabilizing the rest of the product. The 7-week migration from Bedrock-primary to vLLM-primary happened with zero user-facing incidents.

Interview-ready bottom line

vLLM did not just improve runtime performance. It reduced migration risk by letting us hide the backend behind a stable API contract. The safest migration is the one where rollback is a config change.


Scenario 7 - Deployment Topology Made vLLM Production-Ready On SageMaker

The problem we needed to solve

Having a fast inference engine is not the same as having a production deployment. vLLM needed to run as a reliable, auto-scaling, health-checked service inside our existing AWS infrastructure. The deployment needed:

  • custom Docker images with model artifacts baked in (not downloaded at startup)
  • SageMaker real-time endpoints with auto-scaling based on GPU-aware metrics
  • health checks that distinguished "process alive" from "model loaded and ready to serve"
  • warm pools to avoid cold-start penalties during scaling events
  • graceful shutdown to drain in-flight requests before container termination

How the deployment topology works

graph TD
    subgraph "SageMaker Endpoint"
        EP["Endpoint: mangaassist-vllm-prod"]
        EP --> V1["Variant: primary\n(ml.g5.xlarge × 4)"]
        EP --> V2["Variant: canary\n(ml.g5.xlarge × 1)"]
    end

    subgraph "Each Instance"
        CONT["Custom Docker Container"]
        CONT --> INIT["Init: load model from /opt/ml/model"]
        INIT --> CUDA["CUDA graph capture (~30s)"]
        CUDA --> WARM["Warmup requests (10 synthetic)"]
        WARM --> READY["Readiness: /ping returns 200"]
        READY --> SERVE["vLLM engine serving on :8080"]
    end

    subgraph "Auto-Scaling"
        CW["CloudWatch Metrics"]
        CW --> ASP["Scaling Policy"]
        ASP --> |"queue_depth > 50\nfor 60s"| SCALEOUT["Scale out (+1 instance)"]
        ASP --> |"gpu_util < 20%\nfor 300s"| SCALEIN["Scale in (-1 instance)"]
        ASP --> FLOOR["Min: 2, Max: 8"]
    end

    subgraph "Warm Pool"
        WP["Pre-initialized instances\n(model loaded, not serving)"]
        WP --> |"Scale event"| V1
        WPNOTE["Reduces scale-out from\n5-8 min to <90 seconds"]
    end

What we built

Docker image strategy (multi-stage build):

  • Stage 1: Build vLLM from source with CUDA 12.1 support (~15 min build)
  • Stage 2: Download and validate model artifacts (AWQ-quantized Llama-3-8B, ~4.5 GB)
  • Stage 3: Slim runtime image with only inference dependencies (~8.2 GB final image)
  • Model artifacts baked into the image at /opt/ml/model — not downloaded at container start. This eliminated the 3–5 minute model download that would have made scaling events unacceptably slow.

Startup sequence (total: ~75 seconds on ml.g5.xlarge):

Step Duration What happens
Container init ~5s Python process starts, imports vLLM
Model loading ~15s Loads AWQ model weights from /opt/ml/model to GPU
CUDA graph capture ~30s vLLM captures optimized execution graphs for common sequence lengths
LoRA adapter preload ~5s Loads manga_domain_v3, general_support_v2, jp_style_v1
Warmup requests ~10s 10 synthetic requests to prime memory allocator and caches
Readiness signal ~10s /ping starts returning 200, SageMaker routes traffic

Health check contracts:

  • Liveness (/ping): Returns 200 if the Python process is alive. SageMaker uses this to detect hung processes.
  • Readiness (internal): Only returns healthy after model load + CUDA graph capture + warmup complete. Prevents traffic routing to a partially initialized instance.
  • GPU health: Periodic check of CUDA device accessibility. If a GPU enters error state (e.g., ECC uncorrectable error), the instance is marked unhealthy and replaced.

Scaling policy:

  • Scale-out trigger: queue_depth > 50 sustained for 60 seconds OR active_sequences > 110 (85% of max 128)
  • Scale-in trigger: gpu_utilization < 20% sustained for 300 seconds
  • Asymmetric cooldown: Scale-out cooldown 60s (react fast), scale-in cooldown 300s (avoid flapping)
  • Floor: 2 instances minimum (redundancy), ceiling: 8 instances
  • Warm pool: 1 pre-initialized instance (model loaded, ready in <90s vs 5–8 min cold start)

Why it was a game changer

Without proper deployment infrastructure, vLLM would have been a fast engine with operational fragility:

  • cold starts during traffic spikes would have caused 5–8 minute latency holes
  • model downloads at container start would have made scaling events unreliable
  • missing readiness probes would have routed traffic to instances still loading model weights
  • no graceful shutdown would have caused in-flight request failures during scale-in

Why this mattered to the chatbot

MangaAssist traffic has predictable daily patterns (peaks during Japanese evening hours, US evening hours) and unpredictable spikes (manga release events, viral social media mentions). The deployment topology needed to handle both patterns without human intervention. The warm pool and predictive scaling combination meant spike absorption happened in under 90 seconds instead of 5–8 minutes.

Interview-ready bottom line

A great inference engine inside a weak deployment topology is still a fragile system. The deployment decisions — baked model artifacts, staged startup, warm pools, GPU-aware scaling — were as important as the vLLM engine choice itself.


Scenario 8 - Observability Integration Made vLLM Debuggable In Production

The problem we needed to solve

vLLM is a black box unless you instrument it properly. When a user reports "the chatbot was slow," you need to answer:

  • Was it slow because of queueing, prefill, or decoding?
  • Did the prefix cache miss? Why?
  • Which adapter was used? Which model version?
  • Was it a single slow request or a systemic issue?
  • Is the GPU memory pressure increasing over time?

Without comprehensive observability, debugging production inference issues would require guesswork.

How the observability stack works

graph TD
    subgraph "vLLM Engine Metrics (Prometheus)"
        M1["vllm:num_requests_running"]
        M2["vllm:num_requests_waiting"]
        M3["vllm:gpu_cache_usage_perc"]
        M4["vllm:num_preemptions_total"]
        M5["vllm:avg_prompt_throughput_toks_per_s"]
        M6["vllm:avg_generation_throughput_toks_per_s"]
    end

    subgraph "Custom Application Metrics"
        C1["mangaassist.inference.ttft_ms"]
        C2["mangaassist.inference.queue_wait_ms"]
        C3["mangaassist.inference.generation_ms"]
        C4["mangaassist.inference.prefix_cache_hit"]
        C5["mangaassist.inference.adapter_id"]
        C6["mangaassist.inference.token_budget_trimmed"]
        C7["mangaassist.inference.oom_caught"]
    end

    subgraph "Trace Correlation"
        T1["MLflow Trace (per request)"]
        T1 --> T2["Span: orchestrator"]
        T2 --> T3["Span: generation_client"]
        T3 --> T4["Span: vllm_gateway"]
        T4 --> T5["Span: engine_generate"]
        T5 --> T6["Attributes: backend_version,\nadapter_id, prompt_version,\ncache_hit, queue_wait_ms"]
    end

    M1 --> CW["CloudWatch"]
    C1 --> CW
    CW --> DASH["Inference Dashboard"]
    CW --> ALERT["Alerting Rules"]
    T1 --> MLFLOW["MLflow Tracking Server"]

What we built

Metrics we emit on every request:

Metric Type Why it matters
queue_wait_ms Histogram Separates scheduler delay from model compute
ttft_ms Histogram User-perceived responsiveness
generation_ms Histogram Total generation time minus queue wait
prefix_cache_hit Counter (0/1) Validates prompt determinism is maintained
token_budget_trimmed Counter Tracks how often context budgeting kicks in
adapter_id Label Enables per-adapter latency and quality analysis
backend_version Label Enables per-model-version regression detection
oom_caught Counter Tracks OOM containment frequency
input_tokens Histogram Token distribution for capacity planning
output_tokens Histogram Generation length distribution

Alerting rules:

Alert Condition Severity Action
High TTFT P95 TTFT > 500 ms for 5 min Warning Check prefix cache hit rate
Queue buildup queue_depth > 80 for 2 min Critical Trigger immediate scale-out
GPU memory pressure gpu_cache_usage > 95% for 5 min Warning Check for context budget drift
OOM rate oom_caught > 0 in 5 min window Critical Investigate request spike or budget bypass
Prefix cache degradation Cache hit rate < 50% for 15 min Warning Check for prompt template changes
Adapter latency regression Adapter P95 > 2× baseline for 10 min Warning Check adapter version, rollback if needed

SLO definitions for the self-hosted generation path:

SLO Target Measurement window
Availability 99.9% 30-day rolling
P50 TTFT < 200 ms 1-hour rolling
P99 total latency < 2,000 ms 1-hour rolling
Queue wait P95 < 100 ms 1-hour rolling
Error rate < 0.1% 24-hour rolling

Trace integration with MLflow:

Each inference request creates a trace span with structured attributes. This allows: - filtering all requests for a specific adapter version - correlating user-reported "slow chatbot" complaints with specific trace IDs - identifying whether a quality regression came from a model change, adapter change, or prompt change - building offline evaluation datasets from production traces

Why it was a game changer

Before structured observability, debugging inference issues was forensic work — reading logs, guessing at timing, hoping the right information was captured. After instrumentation:

  • "The chatbot was slow" could be answered in < 2 minutes by checking the inference dashboard
  • Prefix cache degradation was caught within 15 minutes of a prompt template change (instead of weeks of silent efficiency loss)
  • Adapter regressions were detected automatically and triggered alerts before user complaints accumulated
  • Capacity planning was data-driven: we could predict when we needed to scale based on actual queue depth trends, not guesswork

Why this mattered to the chatbot

A chatbot with multiple inference backends, multiple adapters, prefix caching, and dynamic scaling has a large surface area for subtle failures. Observability is what made vLLM production-grade instead of just fast. Without it, we would have been running a high-performance engine with low-visibility operations.

Interview-ready bottom line

Observability was not an afterthought. It was a deployment requirement. The metrics, traces, and alerts we built around vLLM gave us confidence to run it at scale and to catch problems before users noticed them.


Why This Decision Was So Important For MangaAssist

The vLLM choice mattered because it improved all four dimensions that matter in a production chatbot:

Dimension Why vLLM mattered Key metric
User experience Better TTFT (900 ms → 180 ms), lower queueing, and fewer degraded long-turn sessions P50 latency: 1,820 ms → 620 ms
Cost More useful concurrency per GPU and fewer duplicate model fleets Monthly GPU spend: $18,400 → $9,200
Reliability Zero memory-driven failures and clearer operational controls OOM restarts: multiple/day → zero
Evolvability Stable serving contract, adapter-based specialization, and lower migration risk 7-week migration, zero incidents

If this had only been a benchmark win, it would not have mattered much. It mattered because it improved the actual chatbot operating model: what we could afford, how much traffic we could absorb, how fast the assistant felt, and how safely we could keep iterating.

Cross-References