Scenario 3 — vLLM Serving Containers For Throughput And Cost
User Story
As an infrastructure engineer on MangaAssist, I wanted our custom fine-tuned manga recommendation model to handle higher concurrent reader sessions at lower GPU cost, without creating a brittle serving platform that only one ML engineer could operate.
Context
MangaAssist fine-tuned a base LLM on manga metadata, user preference patterns, and genre taxonomy. That fine-tuned model served the core recommendation and chat path. As concurrent user sessions grew — especially around chapter drop events — we needed a serving runtime that could handle batched inference efficiently without overprovisioning GPU capacity.
What We Actually Did
- Chose vLLM as the serving container for the fine-tuned models on SageMaker.
- Used PagedAttention to reduce KV-cache VRAM waste.
- Used continuous batching to improve throughput during traffic spikes.
- Used automatic prefix caching to avoid repeated work on common prompt prefixes shared across manga chat turns.
- Used streaming for better first-token experience in the chat interface.
- Used Multi-LoRA to serve domain-specific adapters (shonen, romance, isekai, horror manga tones) from one base model container.
- Used AWQ quantization to shrink model memory footprint.
High-Level Design (HLD)
flowchart TD
Browser["Browser / Mobile App<br/>(WebSocket streaming tokens)"]
APIGW["API Gateway<br/>+ WebSocket API"]
Orch["ECS Fargate<br/>MangaAssist Orchestrator<br/>───────────────────────<br/>1. Load user session<br/>2. RAG prefetch (OpenSearch)<br/>3. Token budget assembly<br/>4. Genre → LoRA routing<br/>5. Call vLLM endpoint<br/>6. Stream tokens to browser"]
vLLM["SageMaker Endpoint<br/>vLLM Container (ECR Image)<br/>───────────────────────<br/>Base: manga-llm-awq-int4<br/>LoRA: shonen / romance / isekai / horror<br/>───────────────────────<br/>PagedAttention ✓<br/>Continuous Batching ✓<br/>Prefix Caching ✓<br/>Streaming SSE ✓<br/>───────────────────────<br/>min_instances=2<br/>GPU: ml.g5.xlarge 24GB VRAM"]
Redis["Redis L1<br/>User Preferences"]
Dynamo["DynamoDB<br/>Session + History"]
OS["OpenSearch<br/>RAG Manga Context"]
S3["S3<br/>LoRA Adapter Files"]
Browser -->|WebSocket| APIGW
APIGW -->|HTTP upgrade| Orch
Orch -->|"HTTP POST /v1/chat/completions<br/>(OpenAI-compatible)"| vLLM
vLLM -->|SSE token chunks| Orch
Orch -->|WebSocket tokens| Browser
Orch --> Redis
Orch --> Dynamo
Orch --> OS
vLLM --> S3
style vLLM fill:#1a1a2e,color:#eee,stroke:#7c3aed
style Orch fill:#0f3460,color:#eee,stroke:#2563eb
style Browser fill:#16213e,color:#eee,stroke:#64748b
Low-Level Design (LLD)
LLD 1 — Container Internals
graph TD
subgraph Container["vLLM Container — Runtime Stage"]
subgraph Model["/model-artifacts/manga-llm-awq-int4/"]
cfg["config.json"]
tok["tokenizer.json"]
w1["model-00001.safetensors ← AWQ INT4"]
w2["model-00002.safetensors"]
w3["model-00003.safetensors"]
w4["model-00004.safetensors"]
end
subgraph LoRA["/lora-adapters/ (~50MB each)"]
ls["lora-shonen-v3/adapter_model.safetensors"]
lr["lora-romance-v2/adapter_model.safetensors"]
li["lora-isekai-v4/adapter_model.safetensors"]
lh["lora-horror-v1/adapter_model.safetensors"]
end
subgraph App["/app/"]
wu["warmup.py ← runs at startup, gates /ping"]
hc["health_check.py"]
end
subgraph Procs["Process Tree (PID 1: vllm api_server)"]
P1["Tokenizer Worker"]
P2["Scheduler<br/>(Continuous Batching Engine)"]
P3["KV-Cache Manager<br/>(PagedAttention)"]
P4["Engine Worker<br/>(GPU Inference Thread)"]
end
Port["Port 8080<br/>POST /v1/chat/completions<br/>GET /ping (health)"]
end
P2 --> P3
P3 --> P4
P1 --> P2
LLD 2 — Request Lifecycle Inside vLLM
flowchart TD
IN["Incoming Request<br/>'recommend a horror manga'<br/>+ lora_request: lora-horror-v1<br/>+ assembled context (token-budgeted)"]
TW["Tokenizer Worker<br/>Convert text → token IDs<br/>Count tokens, validate max_model_len"]
PC{"Prefix Cache<br/>Check<br/>Hash first N token blocks"}
HIT["CACHE HIT ✓<br/>KV-cache pages already computed<br/>Skip ~300 tokens of GPU work<br/>Jump to new tokens only"]
MISS["CACHE MISS ✗<br/>Compute KV for full prefix<br/>Store result in cache<br/>Next request will HIT"]
CB["Continuous Batching Scheduler<br/>─────────────────────────────<br/>Active slots:<br/> slot 0 → User A tok 14/200<br/> slot 1 → User B tok 3/200<br/> slot 2 → User C tok 89/200<br/>─────────────────────────────<br/>New request → insert NOW<br/> slot 3 → NEW REQUEST (no wait)"]
PA["PagedAttention KV Manager<br/>─────────────────────────────<br/>VRAM split into 16-token pages<br/>Page table for this request:<br/> logical 0 → physical page 47<br/> logical 1 → physical page 12<br/> logical 2 → physical page 103<br/>New token → allocate 1 page<br/>Request done → pages freed"]
GPU["GPU Inference Engine<br/>─────────────────────────────<br/>Base weights: manga-llm-awq-int4<br/>+ lora-horror-v1 delta applied<br/>Autoregressive decode loop<br/>tok_n generated → SSE chunk sent"]
OUT["SSE Chunks → Orchestrator → WebSocket → Browser<br/>data: Berserk<br/>data: is a dark...<br/>data: [DONE]"]
IN --> TW
TW --> PC
PC -->|HIT| HIT
PC -->|MISS| MISS
HIT --> CB
MISS --> CB
CB --> PA
PA --> GPU
GPU --> OUT
style HIT fill:#166534,color:#fff
style MISS fill:#7f1d1d,color:#fff
style GPU fill:#1e1b4b,color:#fff
LLD 3 — Multi-Stage Docker Build Flow
flowchart LR
subgraph Stage1["STAGE 1: builder<br/>nvidia/cuda:12.1-devel<br/>(has compiler, pip, git)"]
B1["pip install vllm autoawq safetensors"]
B2["Download manga-llm-base weights<br/>from HuggingFace Hub"]
B3["python quantize_model.py<br/>--bits 4<br/>→ manga-llm-awq-int4<br/>(AWQ INT4 at BUILD TIME)"]
B4["bash download_loras.sh<br/>→ pull 4 LoRA adapters from S3"]
B1 --> B2 --> B3 --> B4
end
subgraph Stage2["STAGE 2: runtime<br/>nvidia/cuda:12.1-runtime<br/>(NO compiler, NO build tools)"]
R1["pip install vllm safetensors httpx<br/>(runtime only — no autoawq)"]
R2["COPY --from=builder<br/>/model-artifacts<br/>/lora-adapters"]
R3["COPY app/ /app<br/>(warmup.py, health_check.py)"]
R4["USER vllm-runner<br/>(non-root)"]
R5["HEALTHCHECK<br/>--start-period=120s<br/>python health_check.py"]
R6["ENTRYPOINT: vllm api_server<br/>--quantization awq<br/>--enable-prefix-caching<br/>--enable-lora<br/>--max-loras 4<br/>--gpu-memory-utilization 0.90<br/>--max-model-len 4096"]
R1 --> R2 --> R3 --> R4 --> R5 --> R6
end
ECR["ECR<br/>Push runtime image only<br/>~3-4x smaller than builder"]
Stage1 -->|COPY artifacts only| Stage2
Stage2 -->|docker push| ECR
style Stage1 fill:#292524,color:#d6d3d1
style Stage2 fill:#0c4a6e,color:#e0f2fe
style ECR fill:#14532d,color:#dcfce7
LLD 4 — Container Startup & Readiness Gating
sequenceDiagram
participant Docker as Docker Runtime
participant vLLM as vLLM Process
participant Warmup as warmup.py
participant HC as health_check.py
participant LB as Load Balancer
Docker->>vLLM: Start PID 1 (api_server)
Note over vLLM: Loading model weights...<br/>~45s container start
loop every 2s (up to 120s)
Warmup->>vLLM: GET /health
vLLM-->>Warmup: 503 (not ready yet)
end
vLLM-->>Warmup: 200 OK (process accepting)
Note over Warmup: Run 3 warmup prompts<br/>(real MangaAssist traffic shapes)
Warmup->>vLLM: POST /v1/chat/completions (shonen prompt)
vLLM-->>Warmup: tokens (CUDA kernels compiled for this shape)
Warmup->>vLLM: POST /v1/chat/completions (romance prompt)
vLLM-->>Warmup: tokens
Warmup->>vLLM: POST /v1/chat/completions (horror prompt)
vLLM-->>Warmup: tokens
Warmup->>Docker: write /tmp/vllm_ready
Docker->>HC: HEALTHCHECK every 10s
HC-->>Docker: 0 (healthy) — file exists
Docker->>LB: Container is HEALTHY
LB->>vLLM: Route live MangaAssist traffic
LLD 5 — Orchestrator: Genre Routing + FM Call + Streaming
# manga_fm_client.py
GENRE_TO_LORA = {
"shonen": "lora-shonen-v3",
"romance": "lora-romance-v2",
"isekai": "lora-isekai-v4",
"horror": "lora-horror-v1",
}
LORA_INT_IDS = {
"lora-shonen-v3": 1,
"lora-romance-v2": 2,
"lora-isekai-v4": 3,
"lora-horror-v1": 4,
}
# ── Genre Router ───────────────────────────────────────────────────
def select_lora(session: UserSession) -> str | None:
genre = session.preferences.top_genre
if genre in GENRE_TO_LORA:
return GENRE_TO_LORA[genre]
inferred = infer_genre_from_turn(session.last_message)
return GENRE_TO_LORA.get(inferred) # None = use base model, no adapter
# ── FM Call With Streaming ─────────────────────────────────────────
async def call_vllm_streaming(
session: UserSession,
context: ContextWindow, # already token-budgeted (scenario-04 lesson)
) -> AsyncGenerator[str, None]:
lora = select_lora(session)
payload = {
"model": "manga-assistant",
"messages": context.to_messages(),
"max_tokens": 512,
"stream": True,
"temperature": 0.7,
"extra_body": {
"lora_request": {
"lora_name": lora,
"lora_int_id": LORA_INT_IDS[lora],
"lora_local_path": f"/lora-adapters/{lora}"
} if lora else None
}
}
async with httpx.AsyncClient(timeout=30.0) as client:
async with client.stream("POST", VLLM_ENDPOINT, json=payload) as resp:
async for line in resp.aiter_lines():
if line.startswith("data: "):
data = line[6:]
if data == "[DONE]":
return
chunk = json.loads(data)
token = chunk["choices"][0]["delta"].get("content", "")
if token:
yield token # forward to WebSocket handler
# ── WebSocket Handler ──────────────────────────────────────────────
async def handle_chat(ws: WebSocket, session_id: str):
session = await load_session(session_id) # Redis L1 → DynamoDB
# ALL prefetch BEFORE stream starts — never inside the loop
rag_context = await opensearch_retrieve(session.last_message)
context_win = build_context_window(session, rag_context) # token budget
async for token in call_vllm_streaming(session, context_win):
await ws.send_json({"type": "token", "content": token})
await ws.send_json({"type": "done"})
# Async post-stream work — never blocks token delivery
asyncio.create_task(save_session(session, context_win))
asyncio.create_task(emit_metrics(session_id))
End-to-End Request Flow
sequenceDiagram
participant U as User Browser
participant WS as WebSocket Handler (ECS Fargate)
participant Redis as Redis L1
participant DDB as DynamoDB
participant OS as OpenSearch
participant vLLM as vLLM Container
U->>WS: "recommend a horror manga"
Note over WS: ALL prefetch BEFORE stream starts
WS->>Redis: load_session(id)
Redis-->>WS: miss
WS->>DDB: load_session(id)
DDB-->>WS: session (genre=horror, preferences)
WS->>OS: retrieve(query="horror manga", top_k=3)
OS-->>WS: [Berserk context, Uzumaki context, MPD Psycho context]
Note over WS: build_context_window()<br/>system_prompt: 200 tok<br/>user_pref_header: 100 tok<br/>RAG context: 300 tok<br/>recent turns: 200 tok<br/>select_lora → lora-horror-v1
WS->>vLLM: POST /v1/chat/completions<br/>stream=true, lora=lora-horror-v1
Note over vLLM: prefix cache HIT (300 tokens skipped)<br/>join active batch (no wait)<br/>allocate KV pages (PagedAttention)<br/>apply lora-horror-v1 delta
loop autoregressive decode
vLLM-->>WS: data: {"content": "Berserk"}
WS-->>U: {type: token, content: "Berserk"}
vLLM-->>WS: data: {"content": " is"}
WS-->>U: {type: token, content: " is"}
vLLM-->>WS: data: {"content": " a dark..."}
WS-->>U: {type: token, content: " a dark..."}
end
vLLM-->>WS: data: [DONE]
WS-->>U: {type: done}
Note over WS: post-stream (non-blocking async tasks)
WS--)DDB: save_session() [fire-and-forget]
WS--)Redis: update_cache() [fire-and-forget]
LLM Concepts — Deep Dives
Concept 1 — PagedAttention
The problem: Standard inference allocates contiguous VRAM blocks reserved for max_len tokens upfront — 90%+ is wasted.
graph TD
subgraph Contiguous["Standard Contiguous Allocation (WASTEFUL)"]
direction LR
CA["Request A<br/>2048 slots reserved<br/>████░░░░░░░░░░░░░░░░░░░░<br/>50/2048 used — 97% wasted"]
CB["Request B<br/>2048 slots reserved<br/>████░░░░░░░░░░░░░░░░░░░░<br/>50/2048 used — 97% wasted"]
CC["Request C<br/>2048 slots reserved<br/>████░░░░░░░░░░░░░░░░░░░░<br/>50/2048 used — 97% wasted"]
end
subgraph Paged["PagedAttention (EFFICIENT)"]
direction LR
PG1["Physical Page 47<br/>tokens 0–15 of Req A"]
PG2["Physical Page 12<br/>tokens 16–31 of Req A"]
PG3["Physical Page 103<br/>tokens 32–47 of Req A"]
PG4["Physical Page 7<br/>tokens 48–50 of Req A (partial)"]
FREE["Remaining VRAM<br/>free for other requests"]
PG1 --> PG2 --> PG3 --> PG4
end
Contiguous -->|"PagedAttention replaces this"| Paged
style CA fill:#7f1d1d,color:#fca5a5
style CB fill:#7f1d1d,color:#fca5a5
style CC fill:#7f1d1d,color:#fca5a5
style FREE fill:#166534,color:#bbf7d0
How pages are managed at runtime:
flowchart LR
T["New token generated"]
A{"Last page<br/>full?"}
AP["Allocate next free page<br/>from VRAM pool"]
W["Write KV to page"]
D{"Request<br/>done?"}
F["Free all pages<br/>back to pool<br/>(immediately reusable)"]
C["Continue decode"]
T --> A
A -->|Yes| AP --> W --> D
A -->|No| W
W --> D
D -->|Yes| F
D -->|No| C --> T
MangaAssist result: Same ml.g5.xlarge (24GB VRAM) goes from ~4–6 concurrent sessions to ~15–20.
Concept 2 — Continuous Batching
gantt
title Static Batching vs Continuous Batching
dateFormat x
axisFormat %L ms
section Static Batching (wasteful)
User A arrives :milestone, 0, 0
User B arrives :milestone, 10, 0
User C arrives :milestone, 30, 0
Wait for batch window :crit, wait1, 0, 50
BATCH 1 executes :active, b1, 50, 100
User D arrives :milestone, 50, 0
Wait for batch window :crit, wait2, 50, 100
BATCH 2 executes :active, b2, 100, 150
section Continuous Batching (vLLM)
User A tok-by-tok :active, ca, 0, 80
User B joins immediately :active, cb, 10, 90
User C joins immediately :active, cc, 30, 110
User D joins next iter :active, cd, 50, 120
GPU never idles :active, gpu, 0, 120
Slot-level view:
graph LR
subgraph Iter1["Decode Iteration 1"]
I1A["A: tok 14"]
I1B["B: tok 3"]
I1C["C: tok 89"]
end
subgraph Iter2["Decode Iteration 2 — D joins NOW"]
I2A["A: tok 15"]
I2B["B: tok 4"]
I2C["C: tok 90"]
I2D["D: tok 1 ← NEW"]
end
subgraph Iter4["Decode Iteration 4 — C done, slot freed"]
I4A["A: tok 17"]
I4B["B: tok 6"]
I4C["C: DONE ✓"]
I4D["D: tok 3"]
end
subgraph Iter5["Decode Iteration 5 — E fills freed slot"]
I5A["A: tok 18"]
I5B["B: tok 7"]
I5E["E: tok 1 ← NEW"]
I5D["D: tok 4"]
end
Iter1 --> Iter2 --> Iter4 --> Iter5
style I2D fill:#166534,color:#fff
style I4C fill:#7f1d1d,color:#fff
style I5E fill:#166534,color:#fff
Concept 3 — Automatic Prefix Caching
The repeated prefix problem for MangaAssist:
graph LR
subgraph Turn1["Turn 1 (320 tokens)"]
T1A["system_prompt<br/>200 tok"]
T1B["user_pref_header<br/>100 tok"]
T1C["message<br/>20 tok"]
end
subgraph Turn2["Turn 2 (372 tokens)"]
T2A["system_prompt<br/>200 tok"]
T2B["user_pref_header<br/>100 tok"]
T2C["history_1<br/>50 tok"]
T2D["message<br/>22 tok"]
end
subgraph Turn3["Turn 3 (418 tokens)"]
T3A["system_prompt<br/>200 tok"]
T3B["user_pref_header<br/>100 tok"]
T3C["history_1+2<br/>100 tok"]
T3D["message<br/>18 tok"]
end
REPEAT["Same 300 tokens<br/>computed from scratch<br/>on EVERY turn<br/>without caching"]
T1A -.->|repeated| T2A -.->|repeated| T3A
T1B -.->|repeated| T2B -.->|repeated| T3B
style REPEAT fill:#7f1d1d,color:#fca5a5
How prefix caching resolves this:
flowchart TD
REQ["Request arrives<br/>tokens 0..319"]
H0{"Hash block 0<br/>tokens 0–15<br/>= 0xAB12..."}
H1{"Hash block 1<br/>tokens 16–31"}
H2{"Hash block 2<br/>tokens 32–47"}
HNEW{"Hash block N<br/>first new content"}
HIT0["CACHE HIT ✓<br/>Reuse stored KV<br/>No GPU compute"]
HIT1["CACHE HIT ✓"]
HIT2["CACHE HIT ✓"]
COMPUTE["CACHE MISS<br/>GPU computes KV<br/>for new tokens only<br/>Store result for next request"]
REQ --> H0
H0 -->|hit| HIT0 --> H1
H1 -->|hit| HIT1 --> H2
H2 -->|hit| HIT2 --> HNEW
HNEW -->|miss| COMPUTE
SAVED["300 / 320 tokens skipped<br/>= 93.75% prefix hit rate<br/>GPU works only on 20 new tokens"]
HIT2 --> SAVED
style HIT0 fill:#166534,color:#fff
style HIT1 fill:#166534,color:#fff
style HIT2 fill:#166534,color:#fff
style COMPUTE fill:#92400e,color:#fff
style SAVED fill:#1e3a5f,color:#fff
Concept 4 — Multi-LoRA (Low-Rank Adaptation)
What a LoRA adapter is (math made visual):
graph LR
subgraph FullFinetune["Full Fine-Tune (expensive)"]
FFW["W_base (14GB)<br/>ALL parameters updated<br/>7B weights × 2 bytes = 14GB<br/>per genre variant"]
end
subgraph LoRAFT["LoRA Fine-Tune (cheap)"]
LBW["W_base FROZEN (14GB)<br/>not modified"]
LA["A matrix (small)<br/>rank 64 × d_model"]
LB["B matrix (small)<br/>d_model × rank 64"]
EQ["W_effective = W_base + A×B<br/>ΔW ≈ 50MB total per adapter"]
LBW --> EQ
LA --> EQ
LB --> EQ
end
style FFW fill:#7f1d1d,color:#fca5a5
style EQ fill:#166534,color:#bbf7d0
Multi-LoRA: one container, four genre personalities:
flowchart TD
subgraph Without["WITHOUT Multi-LoRA<br/>4 endpoints × GPU cost"]
E1["Endpoint 1<br/>manga-llm-shonen<br/>1 GPU fleet"]
E2["Endpoint 2<br/>manga-llm-romance<br/>1 GPU fleet"]
E3["Endpoint 3<br/>manga-llm-isekai<br/>1 GPU fleet"]
E4["Endpoint 4<br/>manga-llm-horror<br/>1 GPU fleet"]
COST1["4× GPU cost<br/>4× container fleets<br/>4× deployment pipelines"]
end
subgraph With["WITH Multi-LoRA<br/>1 container, 1 GPU fleet"]
BASE["Base model in VRAM<br/>manga-llm-awq-int4 (4.5GB)"]
A1["lora-shonen-v3 (~50MB)"]
A2["lora-romance-v2 (~50MB)"]
A3["lora-isekai-v4 (~50MB)"]
A4["lora-horror-v1 (~50MB)"]
SWAP["Request: lora_request=horror<br/>vLLM: W_eff = W_base + A_horror×B_horror<br/>Pointer swap — sub-millisecond<br/>NO disk I/O, NO data copy"]
COST2["1× GPU cost<br/>1× container fleet<br/>1× deployment pipeline"]
end
Without -->|"Multi-LoRA replaces this"| With
style COST1 fill:#7f1d1d,color:#fca5a5
style COST2 fill:#166534,color:#bbf7d0
style SWAP fill:#1e1b4b,color:#c4b5fd
Concept 5 — AWQ INT4 Quantization
Precision vs VRAM tradeoff:
graph LR
subgraph FP16["FP16 Full Precision"]
F1["1 weight = 16 bits = 2 bytes"]
F2["7B params × 2 bytes = 14 GB"]
F3["KV-cache headroom<br/>24GB - 14GB - 2GB overhead<br/>= 8GB → ~5 concurrent sessions"]
end
subgraph AWQ["AWQ INT4 Quantized"]
A1["1 weight = 4 bits = 0.5 bytes"]
A2["7B params × 0.5 bytes = 3.5GB<br/>+ AWQ scales/zeros ≈ 4.5GB total"]
A3["KV-cache headroom<br/>24GB - 4.5GB - 2GB overhead<br/>= 17.5GB → ~18 concurrent sessions"]
end
FP16 -->|"AWQ quantization<br/>~2% quality delta<br/>3× VRAM reduction"| AWQ
style F3 fill:#7f1d1d,color:#fca5a5
style A3 fill:#166534,color:#bbf7d0
Why AWQ beats naive INT4:
flowchart TD
NAIVE["Naive INT4<br/>Round every weight equally<br/>to nearest integer"]
NAIVEFAIL["Important weights destroyed<br/>Quality crash on manga recommendations"]
AWQ["AWQ: Activation-Aware Weight Quantization"]
CALIB["1. Run on manga calibration data<br/>(real recommendation samples)"]
IDENTIFY["2. Identify salient weights<br/>(most important for output quality)"]
PROTECT["3. Protect salient weights<br/>with higher precision"]
QUANT["4. Aggressively quantize<br/>non-salient weights to INT4"]
RESULT["Result:<br/>Nearly same quality as FP16<br/>~2% perplexity increase<br/>68% VRAM reduction"]
NAIVE --> NAIVEFAIL
AWQ --> CALIB --> IDENTIFY --> PROTECT --> QUANT --> RESULT
style NAIVEFAIL fill:#7f1d1d,color:#fca5a5
style RESULT fill:#166534,color:#bbf7d0
Concept 6 — Streaming (SSE / Server-Sent Events)
Why autoregressive generation streams naturally:
sequenceDiagram
participant GPU as GPU Inference
participant vLLM as vLLM Server
participant Orch as Orchestrator
participant Browser as Browser
Note over GPU: Prompt: "recommend a horror manga"
GPU->>vLLM: token 1 = "Berserk"
vLLM->>Orch: data: {"content": "Berserk"}
Orch->>Browser: {type: "token", content: "Berserk"}
Note over Browser: Renders "Berserk" immediately ← ~200ms
GPU->>vLLM: token 2 = " is"
vLLM->>Orch: data: {"content": " is"}
Orch->>Browser: {type: "token", content: " is"}
GPU->>vLLM: token 3 = " a"
vLLM->>Orch: data: {"content": " a"}
Orch->>Browser: {type: "token", content: " a"}
Note over GPU: ... N tokens later ...
GPU->>vLLM: [EOS] token
vLLM->>Orch: data: [DONE]
Orch->>Browser: {type: "done"}
Note over Browser: Full response: "Berserk is a dark..." shown at ~5s<br/>WITHOUT streaming, user waits this long for first word
The critical rule — no I/O inside the forwarding loop:
flowchart TD
subgraph WRONG["WRONG — I/O inside stream loop"]
W1["async for token in vllm_stream()"]
W2["enrichment = await fetch_manga_details(token)<br/>← BLOCKS HERE during slow enrichment lookup"]
W3["await websocket.send(token + enrichment)<br/>← token delivery delayed"]
W1 --> W2 --> W3 --> W1
end
subgraph CORRECT["CORRECT — all I/O before stream starts"]
C1["manga_details = await fetch_manga_details(session)<br/>← runs once, BEFORE stream"]
C2["async for token in vllm_stream()"]
C3["await websocket.send(token)<br/>← pure forwarding, nothing blocks"]
C1 --> C2 --> C3 --> C2
end
WRONG -->|"fix"| CORRECT
style WRONG fill:#7f1d1d,color:#fca5a5
style CORRECT fill:#166534,color:#bbf7d0
Decision Table
| Dimension | Details |
|---|---|
| vLLM vs TGI | TGI simpler setup, but vLLM wins throughput benchmark on concurrent manga sessions |
| vLLM vs TensorRT-LLM | TensorRT marginally faster peak, but compilation complexity + NVIDIA lock-in outweighs gain for fast-iteration fine-tuning |
| PagedAttention benefit | Reduces KV-cache VRAM waste — 5 concurrent sessions → 18+ on same GPU |
| Continuous batching tradeoff | Slightly more complex scheduling vs fixed batching — payoff is better throughput + lower latency under load |
| Prefix caching benefit | 300/320 tokens shared per request — 93%+ cache hit rate, GPU skips repeated system prompt work |
| Multi-LoRA tradeoff | Adds adapter pointer management vs eliminating 3 entire GPU fleets |
| AWQ quantization tradeoff | ~2% perplexity delta vs 68% VRAM reduction and 3× concurrent session density |
| Scale mechanism | Continuous batching + Multi-LoRA: one GPU fleet serves all adapter variants under load |
| Key metric | ~50% GPU cost reduction, ~68% latency improvement vs raw Transformers baseline |
Tradeoffs Discussed
| Option considered | Why rejected or scoped |
|---|---|
| Raw Transformers serving | Baseline; highest operational simplicity, lowest throughput — unsuitable at production scale |
| Hugging Face TGI | Good option, but throughput benchmarks on our workload lagged vLLM |
| TensorRT-LLM | Fastest peak throughput, but compilation complexity + GPU lock-in rejected for fast-iteration environment |
| Separate endpoints per LoRA adapter | Simple isolation, but 3–5x GPU cost increase; Multi-LoRA consolidated this |
| No quantization | Maximum model quality, but VRAM footprint too large for cost target at scale |
Scale Planned
| Metric | Target |
|---|---|
| Concurrent chat sessions | Handled via continuous batching — burst absorbed without fixed batch wait |
| Domain adapters | 1 base model container + Multi-LoRA for n adapters — no GPU fleet proliferation |
| VRAM per instance | Reduced via AWQ INT4 — 8GB KV headroom (FP16) → 18GB KV headroom (INT4) |
| Token throughput | ~2x baseline via PagedAttention + continuous batching vs raw Transformers |
| Concurrent sessions per GPU | ~3x increase from FP16 → AWQ INT4 |
LLM Concepts Summary Table
| Concept | Problem solved | Mechanism | MangaAssist gain |
|---|---|---|---|
| PagedAttention | VRAM fragmentation from contiguous KV allocation | OS-style paged virtual memory for KV-cache | 3× more concurrent sessions per GPU |
| Continuous batching | GPU idle during fixed batch window | Insert new request into next decode iteration immediately | Lower latency + higher throughput during chapter-drop spikes |
| Prefix caching | Same system prompt recomputed on every request | Hash prefix blocks, store KV, reuse on cache hit | 93%+ of tokens skipped for shared 300-token prefix |
| Multi-LoRA | Separate GPU fleet per domain adapter | Load all adapter deltas in VRAM, swap pointer per request | 4 genre variants from 1 GPU fleet — 4× cost reduction |
| AWQ quantization | High VRAM cost of FP16 weights | Activation-aware per-channel quantization to INT4 | 14GB → 4.5GB base model; 3× more VRAM for KV-cache |
| Streaming (SSE) | User waits for full response before seeing anything | Send each token as generated, not wait for EOS | First token in ~200ms vs ~5s full response wait |
Intuition From This Scenario
Serving engine choice is a container operations decision, not just an ML performance decision. The six features — PagedAttention, continuous batching, prefix caching, streaming, Multi-LoRA, and AWQ — each solve a real cost or quality problem that showed up at MangaAssist scale. None of them are magic. PagedAttention is OS paging applied to GPU memory. Continuous batching is a scheduler change. Prefix caching is a hash table. Multi-LoRA is a pointer swap. AWQ is a smarter rounding scheme. The compound effect is that one GPU instance serves 3× more users while the team operates one container instead of four. That is the senior engineer answer: understand the mechanism, justify it against the actual workload, and measure it.