Scenario 3 — vLLM Serving Containers For Throughput And Cost

User Story

As an infrastructure engineer on MangaAssist, I wanted our custom fine-tuned manga recommendation model to handle higher concurrent reader sessions at lower GPU cost, without creating a brittle serving platform that only one ML engineer could operate.

Context

MangaAssist fine-tuned a base LLM on manga metadata, user preference patterns, and genre taxonomy. That fine-tuned model served the core recommendation and chat path. As concurrent user sessions grew — especially around chapter drop events — we needed a serving runtime that could handle batched inference efficiently without overprovisioning GPU capacity.

What We Actually Did

Chose vLLM as the serving container for the fine-tuned models on SageMaker.
Used PagedAttention to reduce KV-cache VRAM waste.
Used continuous batching to improve throughput during traffic spikes.
Used automatic prefix caching to avoid repeated work on common prompt prefixes shared across manga chat turns.
Used streaming for better first-token experience in the chat interface.
Used Multi-LoRA to serve domain-specific adapters (shonen, romance, isekai, horror manga tones) from one base model container.
Used AWQ quantization to shrink model memory footprint.

High-Level Design (HLD)

flowchart TD
    Browser["Browser / Mobile App<br/>(WebSocket streaming tokens)"]
    APIGW["API Gateway<br/>+ WebSocket API"]
    Orch["ECS Fargate<br/>MangaAssist Orchestrator<br/>───────────────────────<br/>1. Load user session<br/>2. RAG prefetch (OpenSearch)<br/>3. Token budget assembly<br/>4. Genre → LoRA routing<br/>5. Call vLLM endpoint<br/>6. Stream tokens to browser"]
    vLLM["SageMaker Endpoint<br/>vLLM Container (ECR Image)<br/>───────────────────────<br/>Base: manga-llm-awq-int4<br/>LoRA: shonen / romance / isekai / horror<br/>───────────────────────<br/>PagedAttention ✓<br/>Continuous Batching ✓<br/>Prefix Caching ✓<br/>Streaming SSE ✓<br/>───────────────────────<br/>min_instances=2<br/>GPU: ml.g5.xlarge 24GB VRAM"]

    Redis["Redis L1<br/>User Preferences"]
    Dynamo["DynamoDB<br/>Session + History"]
    OS["OpenSearch<br/>RAG Manga Context"]
    S3["S3<br/>LoRA Adapter Files"]

    Browser -->|WebSocket| APIGW
    APIGW -->|HTTP upgrade| Orch
    Orch -->|"HTTP POST /v1/chat/completions<br/>(OpenAI-compatible)"| vLLM
    vLLM -->|SSE token chunks| Orch
    Orch -->|WebSocket tokens| Browser

    Orch --> Redis
    Orch --> Dynamo
    Orch --> OS
    vLLM --> S3

    style vLLM fill:#1a1a2e,color:#eee,stroke:#7c3aed
    style Orch fill:#0f3460,color:#eee,stroke:#2563eb
    style Browser fill:#16213e,color:#eee,stroke:#64748b

Low-Level Design (LLD)

LLD 1 — Container Internals

graph TD
    subgraph Container["vLLM Container — Runtime Stage"]
        subgraph Model["/model-artifacts/manga-llm-awq-int4/"]
            cfg["config.json"]
            tok["tokenizer.json"]
            w1["model-00001.safetensors  ← AWQ INT4"]
            w2["model-00002.safetensors"]
            w3["model-00003.safetensors"]
            w4["model-00004.safetensors"]
        end

        subgraph LoRA["/lora-adapters/  (~50MB each)"]
            ls["lora-shonen-v3/adapter_model.safetensors"]
            lr["lora-romance-v2/adapter_model.safetensors"]
            li["lora-isekai-v4/adapter_model.safetensors"]
            lh["lora-horror-v1/adapter_model.safetensors"]
        end

        subgraph App["/app/"]
            wu["warmup.py  ← runs at startup, gates /ping"]
            hc["health_check.py"]
        end

        subgraph Procs["Process Tree (PID 1: vllm api_server)"]
            P1["Tokenizer Worker"]
            P2["Scheduler<br/>(Continuous Batching Engine)"]
            P3["KV-Cache Manager<br/>(PagedAttention)"]
            P4["Engine Worker<br/>(GPU Inference Thread)"]
        end

        Port["Port 8080<br/>POST /v1/chat/completions<br/>GET  /ping  (health)"]
    end

    P2 --> P3
    P3 --> P4
    P1 --> P2

LLD 2 — Request Lifecycle Inside vLLM

flowchart TD
    IN["Incoming Request<br/>'recommend a horror manga'<br/>+ lora_request: lora-horror-v1<br/>+ assembled context (token-budgeted)"]

    TW["Tokenizer Worker<br/>Convert text → token IDs<br/>Count tokens, validate max_model_len"]

    PC{"Prefix Cache<br/>Check<br/>Hash first N token blocks"}

    HIT["CACHE HIT ✓<br/>KV-cache pages already computed<br/>Skip ~300 tokens of GPU work<br/>Jump to new tokens only"]

    MISS["CACHE MISS ✗<br/>Compute KV for full prefix<br/>Store result in cache<br/>Next request will HIT"]

    CB["Continuous Batching Scheduler<br/>─────────────────────────────<br/>Active slots:<br/>  slot 0 → User A  tok 14/200<br/>  slot 1 → User B  tok  3/200<br/>  slot 2 → User C  tok 89/200<br/>─────────────────────────────<br/>New request → insert NOW<br/>  slot 3 → NEW REQUEST (no wait)"]

    PA["PagedAttention KV Manager<br/>─────────────────────────────<br/>VRAM split into 16-token pages<br/>Page table for this request:<br/>  logical 0 → physical page 47<br/>  logical 1 → physical page 12<br/>  logical 2 → physical page 103<br/>New token → allocate 1 page<br/>Request done → pages freed"]

    GPU["GPU Inference Engine<br/>─────────────────────────────<br/>Base weights: manga-llm-awq-int4<br/>+ lora-horror-v1 delta applied<br/>Autoregressive decode loop<br/>tok_n generated → SSE chunk sent"]

    OUT["SSE Chunks → Orchestrator → WebSocket → Browser<br/>data: Berserk<br/>data: is a dark...<br/>data: [DONE]"]

    IN --> TW
    TW --> PC
    PC -->|HIT| HIT
    PC -->|MISS| MISS
    HIT --> CB
    MISS --> CB
    CB --> PA
    PA --> GPU
    GPU --> OUT

    style HIT fill:#166534,color:#fff
    style MISS fill:#7f1d1d,color:#fff
    style GPU fill:#1e1b4b,color:#fff

LLD 3 — Multi-Stage Docker Build Flow

flowchart LR
    subgraph Stage1["STAGE 1: builder<br/>nvidia/cuda:12.1-devel<br/>(has compiler, pip, git)"]
        B1["pip install vllm autoawq safetensors"]
        B2["Download manga-llm-base weights<br/>from HuggingFace Hub"]
        B3["python quantize_model.py<br/>--bits 4<br/>→ manga-llm-awq-int4<br/>(AWQ INT4 at BUILD TIME)"]
        B4["bash download_loras.sh<br/>→ pull 4 LoRA adapters from S3"]
        B1 --> B2 --> B3 --> B4
    end

    subgraph Stage2["STAGE 2: runtime<br/>nvidia/cuda:12.1-runtime<br/>(NO compiler, NO build tools)"]
        R1["pip install vllm safetensors httpx<br/>(runtime only — no autoawq)"]
        R2["COPY --from=builder<br/>/model-artifacts<br/>/lora-adapters"]
        R3["COPY app/ /app<br/>(warmup.py, health_check.py)"]
        R4["USER vllm-runner<br/>(non-root)"]
        R5["HEALTHCHECK<br/>--start-period=120s<br/>python health_check.py"]
        R6["ENTRYPOINT: vllm api_server<br/>--quantization awq<br/>--enable-prefix-caching<br/>--enable-lora<br/>--max-loras 4<br/>--gpu-memory-utilization 0.90<br/>--max-model-len 4096"]
        R1 --> R2 --> R3 --> R4 --> R5 --> R6
    end

    ECR["ECR<br/>Push runtime image only<br/>~3-4x smaller than builder"]

    Stage1 -->|COPY artifacts only| Stage2
    Stage2 -->|docker push| ECR

    style Stage1 fill:#292524,color:#d6d3d1
    style Stage2 fill:#0c4a6e,color:#e0f2fe
    style ECR fill:#14532d,color:#dcfce7

LLD 4 — Container Startup & Readiness Gating

sequenceDiagram
    participant Docker as Docker Runtime
    participant vLLM as vLLM Process
    participant Warmup as warmup.py
    participant HC as health_check.py
    participant LB as Load Balancer

    Docker->>vLLM: Start PID 1 (api_server)
    Note over vLLM: Loading model weights...<br/>~45s container start

    loop every 2s (up to 120s)
        Warmup->>vLLM: GET /health
        vLLM-->>Warmup: 503 (not ready yet)
    end

    vLLM-->>Warmup: 200 OK (process accepting)

    Note over Warmup: Run 3 warmup prompts<br/>(real MangaAssist traffic shapes)
    Warmup->>vLLM: POST /v1/chat/completions (shonen prompt)
    vLLM-->>Warmup: tokens (CUDA kernels compiled for this shape)
    Warmup->>vLLM: POST /v1/chat/completions (romance prompt)
    vLLM-->>Warmup: tokens
    Warmup->>vLLM: POST /v1/chat/completions (horror prompt)
    vLLM-->>Warmup: tokens

    Warmup->>Docker: write /tmp/vllm_ready

    Docker->>HC: HEALTHCHECK every 10s
    HC-->>Docker: 0 (healthy) — file exists

    Docker->>LB: Container is HEALTHY
    LB->>vLLM: Route live MangaAssist traffic

LLD 5 — Orchestrator: Genre Routing + FM Call + Streaming

# manga_fm_client.py

GENRE_TO_LORA = {
    "shonen":  "lora-shonen-v3",
    "romance": "lora-romance-v2",
    "isekai":  "lora-isekai-v4",
    "horror":  "lora-horror-v1",
}

LORA_INT_IDS = {
    "lora-shonen-v3":  1,
    "lora-romance-v2": 2,
    "lora-isekai-v4":  3,
    "lora-horror-v1":  4,
}

# ── Genre Router ───────────────────────────────────────────────────
def select_lora(session: UserSession) -> str | None:
    genre = session.preferences.top_genre
    if genre in GENRE_TO_LORA:
        return GENRE_TO_LORA[genre]
    inferred = infer_genre_from_turn(session.last_message)
    return GENRE_TO_LORA.get(inferred)   # None = use base model, no adapter


# ── FM Call With Streaming ─────────────────────────────────────────
async def call_vllm_streaming(
    session:  UserSession,
    context:  ContextWindow,       # already token-budgeted (scenario-04 lesson)
) -> AsyncGenerator[str, None]:

    lora = select_lora(session)

    payload = {
        "model":      "manga-assistant",
        "messages":   context.to_messages(),
        "max_tokens": 512,
        "stream":     True,
        "temperature": 0.7,
        "extra_body": {
            "lora_request": {
                "lora_name":       lora,
                "lora_int_id":     LORA_INT_IDS[lora],
                "lora_local_path": f"/lora-adapters/{lora}"
            } if lora else None
        }
    }

    async with httpx.AsyncClient(timeout=30.0) as client:
        async with client.stream("POST", VLLM_ENDPOINT, json=payload) as resp:
            async for line in resp.aiter_lines():
                if line.startswith("data: "):
                    data = line[6:]
                    if data == "[DONE]":
                        return
                    chunk = json.loads(data)
                    token = chunk["choices"][0]["delta"].get("content", "")
                    if token:
                        yield token   # forward to WebSocket handler


# ── WebSocket Handler ──────────────────────────────────────────────
async def handle_chat(ws: WebSocket, session_id: str):

    session = await load_session(session_id)        # Redis L1 → DynamoDB

    # ALL prefetch BEFORE stream starts — never inside the loop
    rag_context  = await opensearch_retrieve(session.last_message)
    context_win  = build_context_window(session, rag_context)  # token budget

    async for token in call_vllm_streaming(session, context_win):
        await ws.send_json({"type": "token", "content": token})

    await ws.send_json({"type": "done"})

    # Async post-stream work — never blocks token delivery
    asyncio.create_task(save_session(session, context_win))
    asyncio.create_task(emit_metrics(session_id))

End-to-End Request Flow

sequenceDiagram
    participant U as User Browser
    participant WS as WebSocket Handler (ECS Fargate)
    participant Redis as Redis L1
    participant DDB as DynamoDB
    participant OS as OpenSearch
    participant vLLM as vLLM Container

    U->>WS: "recommend a horror manga"

    Note over WS: ALL prefetch BEFORE stream starts
    WS->>Redis: load_session(id)
    Redis-->>WS: miss
    WS->>DDB: load_session(id)
    DDB-->>WS: session (genre=horror, preferences)

    WS->>OS: retrieve(query="horror manga", top_k=3)
    OS-->>WS: [Berserk context, Uzumaki context, MPD Psycho context]

    Note over WS: build_context_window()<br/>system_prompt: 200 tok<br/>user_pref_header: 100 tok<br/>RAG context: 300 tok<br/>recent turns: 200 tok<br/>select_lora → lora-horror-v1

    WS->>vLLM: POST /v1/chat/completions<br/>stream=true, lora=lora-horror-v1

    Note over vLLM: prefix cache HIT (300 tokens skipped)<br/>join active batch (no wait)<br/>allocate KV pages (PagedAttention)<br/>apply lora-horror-v1 delta

    loop autoregressive decode
        vLLM-->>WS: data: {"content": "Berserk"}
        WS-->>U: {type: token, content: "Berserk"}
        vLLM-->>WS: data: {"content": " is"}
        WS-->>U: {type: token, content: " is"}
        vLLM-->>WS: data: {"content": " a dark..."}
        WS-->>U: {type: token, content: " a dark..."}
    end

    vLLM-->>WS: data: [DONE]
    WS-->>U: {type: done}

    Note over WS: post-stream (non-blocking async tasks)
    WS--)DDB: save_session() [fire-and-forget]
    WS--)Redis: update_cache() [fire-and-forget]

LLM Concepts — Deep Dives

Concept 1 — PagedAttention

The problem: Standard inference allocates contiguous VRAM blocks reserved for max_len tokens upfront — 90%+ is wasted.

graph TD
    subgraph Contiguous["Standard Contiguous Allocation (WASTEFUL)"]
        direction LR
        CA["Request A<br/>2048 slots reserved<br/>████░░░░░░░░░░░░░░░░░░░░<br/>50/2048 used — 97% wasted"]
        CB["Request B<br/>2048 slots reserved<br/>████░░░░░░░░░░░░░░░░░░░░<br/>50/2048 used — 97% wasted"]
        CC["Request C<br/>2048 slots reserved<br/>████░░░░░░░░░░░░░░░░░░░░<br/>50/2048 used — 97% wasted"]
    end

    subgraph Paged["PagedAttention (EFFICIENT)"]
        direction LR
        PG1["Physical Page 47<br/>tokens 0–15 of Req A"]
        PG2["Physical Page 12<br/>tokens 16–31 of Req A"]
        PG3["Physical Page 103<br/>tokens 32–47 of Req A"]
        PG4["Physical Page 7<br/>tokens 48–50 of Req A (partial)"]
        FREE["Remaining VRAM<br/>free for other requests"]
        PG1 --> PG2 --> PG3 --> PG4
    end

    Contiguous -->|"PagedAttention replaces this"| Paged

    style CA fill:#7f1d1d,color:#fca5a5
    style CB fill:#7f1d1d,color:#fca5a5
    style CC fill:#7f1d1d,color:#fca5a5
    style FREE fill:#166534,color:#bbf7d0

How pages are managed at runtime:

flowchart LR
    T["New token generated"]
    A{"Last page<br/>full?"}
    AP["Allocate next free page<br/>from VRAM pool"]
    W["Write KV to page"]
    D{"Request<br/>done?"}
    F["Free all pages<br/>back to pool<br/>(immediately reusable)"]
    C["Continue decode"]

    T --> A
    A -->|Yes| AP --> W --> D
    A -->|No| W
    W --> D
    D -->|Yes| F
    D -->|No| C --> T

MangaAssist result: Same ml.g5.xlarge (24GB VRAM) goes from ~4–6 concurrent sessions to ~15–20.

Concept 2 — Continuous Batching

gantt
    title Static Batching vs Continuous Batching
    dateFormat x
    axisFormat %L ms

    section Static Batching (wasteful)
    User A arrives        :milestone, 0, 0
    User B arrives        :milestone, 10, 0
    User C arrives        :milestone, 30, 0
    Wait for batch window :crit, wait1, 0, 50
    BATCH 1 executes      :active, b1, 50, 100
    User D arrives        :milestone, 50, 0
    Wait for batch window :crit, wait2, 50, 100
    BATCH 2 executes      :active, b2, 100, 150

    section Continuous Batching (vLLM)
    User A tok-by-tok     :active, ca, 0, 80
    User B joins immediately :active, cb, 10, 90
    User C joins immediately :active, cc, 30, 110
    User D joins next iter   :active, cd, 50, 120
    GPU never idles          :active, gpu, 0, 120

Slot-level view:

graph LR
    subgraph Iter1["Decode Iteration 1"]
        I1A["A: tok 14"]
        I1B["B: tok 3"]
        I1C["C: tok 89"]
    end
    subgraph Iter2["Decode Iteration 2 — D joins NOW"]
        I2A["A: tok 15"]
        I2B["B: tok 4"]
        I2C["C: tok 90"]
        I2D["D: tok 1 ← NEW"]
    end
    subgraph Iter4["Decode Iteration 4 — C done, slot freed"]
        I4A["A: tok 17"]
        I4B["B: tok 6"]
        I4C["C: DONE ✓"]
        I4D["D: tok 3"]
    end
    subgraph Iter5["Decode Iteration 5 — E fills freed slot"]
        I5A["A: tok 18"]
        I5B["B: tok 7"]
        I5E["E: tok 1 ← NEW"]
        I5D["D: tok 4"]
    end

    Iter1 --> Iter2 --> Iter4 --> Iter5

    style I2D fill:#166534,color:#fff
    style I4C fill:#7f1d1d,color:#fff
    style I5E fill:#166534,color:#fff

Concept 3 — Automatic Prefix Caching

The repeated prefix problem for MangaAssist:

graph LR
    subgraph Turn1["Turn 1  (320 tokens)"]
        T1A["system_prompt<br/>200 tok"]
        T1B["user_pref_header<br/>100 tok"]
        T1C["message<br/>20 tok"]
    end
    subgraph Turn2["Turn 2  (372 tokens)"]
        T2A["system_prompt<br/>200 tok"]
        T2B["user_pref_header<br/>100 tok"]
        T2C["history_1<br/>50 tok"]
        T2D["message<br/>22 tok"]
    end
    subgraph Turn3["Turn 3  (418 tokens)"]
        T3A["system_prompt<br/>200 tok"]
        T3B["user_pref_header<br/>100 tok"]
        T3C["history_1+2<br/>100 tok"]
        T3D["message<br/>18 tok"]
    end

    REPEAT["Same 300 tokens<br/>computed from scratch<br/>on EVERY turn<br/>without caching"]

    T1A -.->|repeated| T2A -.->|repeated| T3A
    T1B -.->|repeated| T2B -.->|repeated| T3B

    style REPEAT fill:#7f1d1d,color:#fca5a5

How prefix caching resolves this:

flowchart TD
    REQ["Request arrives<br/>tokens 0..319"]

    H0{"Hash block 0<br/>tokens 0–15<br/>= 0xAB12..."}
    H1{"Hash block 1<br/>tokens 16–31"}
    H2{"Hash block 2<br/>tokens 32–47"}
    HNEW{"Hash block N<br/>first new content"}

    HIT0["CACHE HIT ✓<br/>Reuse stored KV<br/>No GPU compute"]
    HIT1["CACHE HIT ✓"]
    HIT2["CACHE HIT ✓"]
    COMPUTE["CACHE MISS<br/>GPU computes KV<br/>for new tokens only<br/>Store result for next request"]

    REQ --> H0
    H0 -->|hit| HIT0 --> H1
    H1 -->|hit| HIT1 --> H2
    H2 -->|hit| HIT2 --> HNEW
    HNEW -->|miss| COMPUTE

    SAVED["300 / 320 tokens skipped<br/>= 93.75% prefix hit rate<br/>GPU works only on 20 new tokens"]

    HIT2 --> SAVED

    style HIT0 fill:#166534,color:#fff
    style HIT1 fill:#166534,color:#fff
    style HIT2 fill:#166534,color:#fff
    style COMPUTE fill:#92400e,color:#fff
    style SAVED fill:#1e3a5f,color:#fff

Concept 4 — Multi-LoRA (Low-Rank Adaptation)

What a LoRA adapter is (math made visual):

graph LR
    subgraph FullFinetune["Full Fine-Tune (expensive)"]
        FFW["W_base (14GB)<br/>ALL parameters updated<br/>7B weights × 2 bytes = 14GB<br/>per genre variant"]
    end

    subgraph LoRAFT["LoRA Fine-Tune (cheap)"]
        LBW["W_base FROZEN (14GB)<br/>not modified"]
        LA["A matrix (small)<br/>rank 64 × d_model"]
        LB["B matrix (small)<br/>d_model × rank 64"]
        EQ["W_effective = W_base + A×B<br/>ΔW ≈ 50MB total per adapter"]
        LBW --> EQ
        LA --> EQ
        LB --> EQ
    end

    style FFW fill:#7f1d1d,color:#fca5a5
    style EQ fill:#166534,color:#bbf7d0

Multi-LoRA: one container, four genre personalities:

flowchart TD
    subgraph Without["WITHOUT Multi-LoRA<br/>4 endpoints × GPU cost"]
        E1["Endpoint 1<br/>manga-llm-shonen<br/>1 GPU fleet"]
        E2["Endpoint 2<br/>manga-llm-romance<br/>1 GPU fleet"]
        E3["Endpoint 3<br/>manga-llm-isekai<br/>1 GPU fleet"]
        E4["Endpoint 4<br/>manga-llm-horror<br/>1 GPU fleet"]
        COST1["4× GPU cost<br/>4× container fleets<br/>4× deployment pipelines"]
    end

    subgraph With["WITH Multi-LoRA<br/>1 container, 1 GPU fleet"]
        BASE["Base model in VRAM<br/>manga-llm-awq-int4 (4.5GB)"]
        A1["lora-shonen-v3 (~50MB)"]
        A2["lora-romance-v2 (~50MB)"]
        A3["lora-isekai-v4 (~50MB)"]
        A4["lora-horror-v1 (~50MB)"]
        SWAP["Request: lora_request=horror<br/>vLLM: W_eff = W_base + A_horror×B_horror<br/>Pointer swap — sub-millisecond<br/>NO disk I/O, NO data copy"]
        COST2["1× GPU cost<br/>1× container fleet<br/>1× deployment pipeline"]
    end

    Without -->|"Multi-LoRA replaces this"| With

    style COST1 fill:#7f1d1d,color:#fca5a5
    style COST2 fill:#166534,color:#bbf7d0
    style SWAP fill:#1e1b4b,color:#c4b5fd

Concept 5 — AWQ INT4 Quantization

Precision vs VRAM tradeoff:

graph LR
    subgraph FP16["FP16 Full Precision"]
        F1["1 weight = 16 bits = 2 bytes"]
        F2["7B params × 2 bytes = 14 GB"]
        F3["KV-cache headroom<br/>24GB - 14GB - 2GB overhead<br/>= 8GB → ~5 concurrent sessions"]
    end

    subgraph AWQ["AWQ INT4 Quantized"]
        A1["1 weight = 4 bits = 0.5 bytes"]
        A2["7B params × 0.5 bytes = 3.5GB<br/>+ AWQ scales/zeros ≈ 4.5GB total"]
        A3["KV-cache headroom<br/>24GB - 4.5GB - 2GB overhead<br/>= 17.5GB → ~18 concurrent sessions"]
    end

    FP16 -->|"AWQ quantization<br/>~2% quality delta<br/>3× VRAM reduction"| AWQ

    style F3 fill:#7f1d1d,color:#fca5a5
    style A3 fill:#166534,color:#bbf7d0

Why AWQ beats naive INT4:

flowchart TD
    NAIVE["Naive INT4<br/>Round every weight equally<br/>to nearest integer"]
    NAIVEFAIL["Important weights destroyed<br/>Quality crash on manga recommendations"]

    AWQ["AWQ: Activation-Aware Weight Quantization"]
    CALIB["1. Run on manga calibration data<br/>(real recommendation samples)"]
    IDENTIFY["2. Identify salient weights<br/>(most important for output quality)"]
    PROTECT["3. Protect salient weights<br/>with higher precision"]
    QUANT["4. Aggressively quantize<br/>non-salient weights to INT4"]
    RESULT["Result:<br/>Nearly same quality as FP16<br/>~2% perplexity increase<br/>68% VRAM reduction"]

    NAIVE --> NAIVEFAIL
    AWQ --> CALIB --> IDENTIFY --> PROTECT --> QUANT --> RESULT

    style NAIVEFAIL fill:#7f1d1d,color:#fca5a5
    style RESULT fill:#166534,color:#bbf7d0

Concept 6 — Streaming (SSE / Server-Sent Events)

Why autoregressive generation streams naturally:

sequenceDiagram
    participant GPU as GPU Inference
    participant vLLM as vLLM Server
    participant Orch as Orchestrator
    participant Browser as Browser

    Note over GPU: Prompt: "recommend a horror manga"

    GPU->>vLLM: token 1 = "Berserk"
    vLLM->>Orch: data: {"content": "Berserk"}
    Orch->>Browser: {type: "token", content: "Berserk"}
    Note over Browser: Renders "Berserk" immediately ← ~200ms

    GPU->>vLLM: token 2 = " is"
    vLLM->>Orch: data: {"content": " is"}
    Orch->>Browser: {type: "token", content: " is"}

    GPU->>vLLM: token 3 = " a"
    vLLM->>Orch: data: {"content": " a"}
    Orch->>Browser: {type: "token", content: " a"}

    Note over GPU: ... N tokens later ...

    GPU->>vLLM: [EOS] token
    vLLM->>Orch: data: [DONE]
    Orch->>Browser: {type: "done"}
    Note over Browser: Full response: "Berserk is a dark..." shown at ~5s<br/>WITHOUT streaming, user waits this long for first word

The critical rule — no I/O inside the forwarding loop:

flowchart TD
    subgraph WRONG["WRONG — I/O inside stream loop"]
        W1["async for token in vllm_stream()"]
        W2["enrichment = await fetch_manga_details(token)<br/>← BLOCKS HERE during slow enrichment lookup"]
        W3["await websocket.send(token + enrichment)<br/>← token delivery delayed"]
        W1 --> W2 --> W3 --> W1
    end

    subgraph CORRECT["CORRECT — all I/O before stream starts"]
        C1["manga_details = await fetch_manga_details(session)<br/>← runs once, BEFORE stream"]
        C2["async for token in vllm_stream()"]
        C3["await websocket.send(token)<br/>← pure forwarding, nothing blocks"]
        C1 --> C2 --> C3 --> C2
    end

    WRONG -->|"fix"| CORRECT

    style WRONG fill:#7f1d1d,color:#fca5a5
    style CORRECT fill:#166534,color:#bbf7d0

Decision Table

Dimension	Details
vLLM vs TGI	TGI simpler setup, but vLLM wins throughput benchmark on concurrent manga sessions
vLLM vs TensorRT-LLM	TensorRT marginally faster peak, but compilation complexity + NVIDIA lock-in outweighs gain for fast-iteration fine-tuning
PagedAttention benefit	Reduces KV-cache VRAM waste — 5 concurrent sessions → 18+ on same GPU
Continuous batching tradeoff	Slightly more complex scheduling vs fixed batching — payoff is better throughput + lower latency under load
Prefix caching benefit	300/320 tokens shared per request — 93%+ cache hit rate, GPU skips repeated system prompt work
Multi-LoRA tradeoff	Adds adapter pointer management vs eliminating 3 entire GPU fleets
AWQ quantization tradeoff	~2% perplexity delta vs 68% VRAM reduction and 3× concurrent session density
Scale mechanism	Continuous batching + Multi-LoRA: one GPU fleet serves all adapter variants under load
Key metric	~50% GPU cost reduction, ~68% latency improvement vs raw Transformers baseline

Tradeoffs Discussed

Option considered	Why rejected or scoped
Raw Transformers serving	Baseline; highest operational simplicity, lowest throughput — unsuitable at production scale
Hugging Face TGI	Good option, but throughput benchmarks on our workload lagged vLLM
TensorRT-LLM	Fastest peak throughput, but compilation complexity + GPU lock-in rejected for fast-iteration environment
Separate endpoints per LoRA adapter	Simple isolation, but 3–5x GPU cost increase; Multi-LoRA consolidated this
No quantization	Maximum model quality, but VRAM footprint too large for cost target at scale

Scale Planned

Metric	Target
Concurrent chat sessions	Handled via continuous batching — burst absorbed without fixed batch wait
Domain adapters	1 base model container + Multi-LoRA for n adapters — no GPU fleet proliferation
VRAM per instance	Reduced via AWQ INT4 — 8GB KV headroom (FP16) → 18GB KV headroom (INT4)
Token throughput	~2x baseline via PagedAttention + continuous batching vs raw Transformers
Concurrent sessions per GPU	~3x increase from FP16 → AWQ INT4

LLM Concepts Summary Table

Concept	Problem solved	Mechanism	MangaAssist gain
PagedAttention	VRAM fragmentation from contiguous KV allocation	OS-style paged virtual memory for KV-cache	3× more concurrent sessions per GPU
Continuous batching	GPU idle during fixed batch window	Insert new request into next decode iteration immediately	Lower latency + higher throughput during chapter-drop spikes
Prefix caching	Same system prompt recomputed on every request	Hash prefix blocks, store KV, reuse on cache hit	93%+ of tokens skipped for shared 300-token prefix
Multi-LoRA	Separate GPU fleet per domain adapter	Load all adapter deltas in VRAM, swap pointer per request	4 genre variants from 1 GPU fleet — 4× cost reduction
AWQ quantization	High VRAM cost of FP16 weights	Activation-aware per-channel quantization to INT4	14GB → 4.5GB base model; 3× more VRAM for KV-cache
Streaming (SSE)	User waits for full response before seeing anything	Send each token as generated, not wait for EOS	First token in ~200ms vs ~5s full response wait

Intuition From This Scenario

Serving engine choice is a container operations decision, not just an ML performance decision. The six features — PagedAttention, continuous batching, prefix caching, streaming, Multi-LoRA, and AWQ — each solve a real cost or quality problem that showed up at MangaAssist scale. None of them are magic. PagedAttention is OS paging applied to GPU memory. Continuous batching is a scheduler change. Prefix caching is a hash table. Multi-LoRA is a pointer swap. AWQ is a smarter rounding scheme. The compound effect is that one GPU instance serves 3× more users while the team operates one container instead of four. That is the senior engineer answer: understand the mechanism, justify it against the actual workload, and measure it.