LOCAL PREVIEW View on GitHub

Scenario 3 — vLLM Serving Containers For Throughput And Cost

User Story

As an infrastructure engineer on MangaAssist, I wanted our custom fine-tuned manga recommendation model to handle higher concurrent reader sessions at lower GPU cost, without creating a brittle serving platform that only one ML engineer could operate.


Context

MangaAssist fine-tuned a base LLM on manga metadata, user preference patterns, and genre taxonomy. That fine-tuned model served the core recommendation and chat path. As concurrent user sessions grew — especially around chapter drop events — we needed a serving runtime that could handle batched inference efficiently without overprovisioning GPU capacity.


What We Actually Did

  • Chose vLLM as the serving container for the fine-tuned models on SageMaker.
  • Used PagedAttention to reduce KV-cache VRAM waste.
  • Used continuous batching to improve throughput during traffic spikes.
  • Used automatic prefix caching to avoid repeated work on common prompt prefixes shared across manga chat turns.
  • Used streaming for better first-token experience in the chat interface.
  • Used Multi-LoRA to serve domain-specific adapters (shonen, romance, isekai, horror manga tones) from one base model container.
  • Used AWQ quantization to shrink model memory footprint.

High-Level Design (HLD)

flowchart TD
    Browser["Browser / Mobile App<br/>(WebSocket streaming tokens)"]
    APIGW["API Gateway<br/>+ WebSocket API"]
    Orch["ECS Fargate<br/>MangaAssist Orchestrator<br/>───────────────────────<br/>1. Load user session<br/>2. RAG prefetch (OpenSearch)<br/>3. Token budget assembly<br/>4. Genre → LoRA routing<br/>5. Call vLLM endpoint<br/>6. Stream tokens to browser"]
    vLLM["SageMaker Endpoint<br/>vLLM Container (ECR Image)<br/>───────────────────────<br/>Base: manga-llm-awq-int4<br/>LoRA: shonen / romance / isekai / horror<br/>───────────────────────<br/>PagedAttention ✓<br/>Continuous Batching ✓<br/>Prefix Caching ✓<br/>Streaming SSE ✓<br/>───────────────────────<br/>min_instances=2<br/>GPU: ml.g5.xlarge 24GB VRAM"]

    Redis["Redis L1<br/>User Preferences"]
    Dynamo["DynamoDB<br/>Session + History"]
    OS["OpenSearch<br/>RAG Manga Context"]
    S3["S3<br/>LoRA Adapter Files"]

    Browser -->|WebSocket| APIGW
    APIGW -->|HTTP upgrade| Orch
    Orch -->|"HTTP POST /v1/chat/completions<br/>(OpenAI-compatible)"| vLLM
    vLLM -->|SSE token chunks| Orch
    Orch -->|WebSocket tokens| Browser

    Orch --> Redis
    Orch --> Dynamo
    Orch --> OS
    vLLM --> S3

    style vLLM fill:#1a1a2e,color:#eee,stroke:#7c3aed
    style Orch fill:#0f3460,color:#eee,stroke:#2563eb
    style Browser fill:#16213e,color:#eee,stroke:#64748b

Low-Level Design (LLD)

LLD 1 — Container Internals

graph TD
    subgraph Container["vLLM Container — Runtime Stage"]
        subgraph Model["/model-artifacts/manga-llm-awq-int4/"]
            cfg["config.json"]
            tok["tokenizer.json"]
            w1["model-00001.safetensors  ← AWQ INT4"]
            w2["model-00002.safetensors"]
            w3["model-00003.safetensors"]
            w4["model-00004.safetensors"]
        end

        subgraph LoRA["/lora-adapters/  (~50MB each)"]
            ls["lora-shonen-v3/adapter_model.safetensors"]
            lr["lora-romance-v2/adapter_model.safetensors"]
            li["lora-isekai-v4/adapter_model.safetensors"]
            lh["lora-horror-v1/adapter_model.safetensors"]
        end

        subgraph App["/app/"]
            wu["warmup.py  ← runs at startup, gates /ping"]
            hc["health_check.py"]
        end

        subgraph Procs["Process Tree (PID 1: vllm api_server)"]
            P1["Tokenizer Worker"]
            P2["Scheduler<br/>(Continuous Batching Engine)"]
            P3["KV-Cache Manager<br/>(PagedAttention)"]
            P4["Engine Worker<br/>(GPU Inference Thread)"]
        end

        Port["Port 8080<br/>POST /v1/chat/completions<br/>GET  /ping  (health)"]
    end

    P2 --> P3
    P3 --> P4
    P1 --> P2

LLD 2 — Request Lifecycle Inside vLLM

flowchart TD
    IN["Incoming Request<br/>'recommend a horror manga'<br/>+ lora_request: lora-horror-v1<br/>+ assembled context (token-budgeted)"]

    TW["Tokenizer Worker<br/>Convert text → token IDs<br/>Count tokens, validate max_model_len"]

    PC{"Prefix Cache<br/>Check<br/>Hash first N token blocks"}

    HIT["CACHE HIT ✓<br/>KV-cache pages already computed<br/>Skip ~300 tokens of GPU work<br/>Jump to new tokens only"]

    MISS["CACHE MISS ✗<br/>Compute KV for full prefix<br/>Store result in cache<br/>Next request will HIT"]

    CB["Continuous Batching Scheduler<br/>─────────────────────────────<br/>Active slots:<br/>  slot 0 → User A  tok 14/200<br/>  slot 1 → User B  tok  3/200<br/>  slot 2 → User C  tok 89/200<br/>─────────────────────────────<br/>New request → insert NOW<br/>  slot 3 → NEW REQUEST (no wait)"]

    PA["PagedAttention KV Manager<br/>─────────────────────────────<br/>VRAM split into 16-token pages<br/>Page table for this request:<br/>  logical 0 → physical page 47<br/>  logical 1 → physical page 12<br/>  logical 2 → physical page 103<br/>New token → allocate 1 page<br/>Request done → pages freed"]

    GPU["GPU Inference Engine<br/>─────────────────────────────<br/>Base weights: manga-llm-awq-int4<br/>+ lora-horror-v1 delta applied<br/>Autoregressive decode loop<br/>tok_n generated → SSE chunk sent"]

    OUT["SSE Chunks → Orchestrator → WebSocket → Browser<br/>data: Berserk<br/>data: is a dark...<br/>data: [DONE]"]

    IN --> TW
    TW --> PC
    PC -->|HIT| HIT
    PC -->|MISS| MISS
    HIT --> CB
    MISS --> CB
    CB --> PA
    PA --> GPU
    GPU --> OUT

    style HIT fill:#166534,color:#fff
    style MISS fill:#7f1d1d,color:#fff
    style GPU fill:#1e1b4b,color:#fff

LLD 3 — Multi-Stage Docker Build Flow

flowchart LR
    subgraph Stage1["STAGE 1: builder<br/>nvidia/cuda:12.1-devel<br/>(has compiler, pip, git)"]
        B1["pip install vllm autoawq safetensors"]
        B2["Download manga-llm-base weights<br/>from HuggingFace Hub"]
        B3["python quantize_model.py<br/>--bits 4<br/>→ manga-llm-awq-int4<br/>(AWQ INT4 at BUILD TIME)"]
        B4["bash download_loras.sh<br/>→ pull 4 LoRA adapters from S3"]
        B1 --> B2 --> B3 --> B4
    end

    subgraph Stage2["STAGE 2: runtime<br/>nvidia/cuda:12.1-runtime<br/>(NO compiler, NO build tools)"]
        R1["pip install vllm safetensors httpx<br/>(runtime only — no autoawq)"]
        R2["COPY --from=builder<br/>/model-artifacts<br/>/lora-adapters"]
        R3["COPY app/ /app<br/>(warmup.py, health_check.py)"]
        R4["USER vllm-runner<br/>(non-root)"]
        R5["HEALTHCHECK<br/>--start-period=120s<br/>python health_check.py"]
        R6["ENTRYPOINT: vllm api_server<br/>--quantization awq<br/>--enable-prefix-caching<br/>--enable-lora<br/>--max-loras 4<br/>--gpu-memory-utilization 0.90<br/>--max-model-len 4096"]
        R1 --> R2 --> R3 --> R4 --> R5 --> R6
    end

    ECR["ECR<br/>Push runtime image only<br/>~3-4x smaller than builder"]

    Stage1 -->|COPY artifacts only| Stage2
    Stage2 -->|docker push| ECR

    style Stage1 fill:#292524,color:#d6d3d1
    style Stage2 fill:#0c4a6e,color:#e0f2fe
    style ECR fill:#14532d,color:#dcfce7

LLD 4 — Container Startup & Readiness Gating

sequenceDiagram
    participant Docker as Docker Runtime
    participant vLLM as vLLM Process
    participant Warmup as warmup.py
    participant HC as health_check.py
    participant LB as Load Balancer

    Docker->>vLLM: Start PID 1 (api_server)
    Note over vLLM: Loading model weights...<br/>~45s container start

    loop every 2s (up to 120s)
        Warmup->>vLLM: GET /health
        vLLM-->>Warmup: 503 (not ready yet)
    end

    vLLM-->>Warmup: 200 OK (process accepting)

    Note over Warmup: Run 3 warmup prompts<br/>(real MangaAssist traffic shapes)
    Warmup->>vLLM: POST /v1/chat/completions (shonen prompt)
    vLLM-->>Warmup: tokens (CUDA kernels compiled for this shape)
    Warmup->>vLLM: POST /v1/chat/completions (romance prompt)
    vLLM-->>Warmup: tokens
    Warmup->>vLLM: POST /v1/chat/completions (horror prompt)
    vLLM-->>Warmup: tokens

    Warmup->>Docker: write /tmp/vllm_ready

    Docker->>HC: HEALTHCHECK every 10s
    HC-->>Docker: 0 (healthy) — file exists

    Docker->>LB: Container is HEALTHY
    LB->>vLLM: Route live MangaAssist traffic

LLD 5 — Orchestrator: Genre Routing + FM Call + Streaming

# manga_fm_client.py

GENRE_TO_LORA = {
    "shonen":  "lora-shonen-v3",
    "romance": "lora-romance-v2",
    "isekai":  "lora-isekai-v4",
    "horror":  "lora-horror-v1",
}

LORA_INT_IDS = {
    "lora-shonen-v3":  1,
    "lora-romance-v2": 2,
    "lora-isekai-v4":  3,
    "lora-horror-v1":  4,
}

# ── Genre Router ───────────────────────────────────────────────────
def select_lora(session: UserSession) -> str | None:
    genre = session.preferences.top_genre
    if genre in GENRE_TO_LORA:
        return GENRE_TO_LORA[genre]
    inferred = infer_genre_from_turn(session.last_message)
    return GENRE_TO_LORA.get(inferred)   # None = use base model, no adapter


# ── FM Call With Streaming ─────────────────────────────────────────
async def call_vllm_streaming(
    session:  UserSession,
    context:  ContextWindow,       # already token-budgeted (scenario-04 lesson)
) -> AsyncGenerator[str, None]:

    lora = select_lora(session)

    payload = {
        "model":      "manga-assistant",
        "messages":   context.to_messages(),
        "max_tokens": 512,
        "stream":     True,
        "temperature": 0.7,
        "extra_body": {
            "lora_request": {
                "lora_name":       lora,
                "lora_int_id":     LORA_INT_IDS[lora],
                "lora_local_path": f"/lora-adapters/{lora}"
            } if lora else None
        }
    }

    async with httpx.AsyncClient(timeout=30.0) as client:
        async with client.stream("POST", VLLM_ENDPOINT, json=payload) as resp:
            async for line in resp.aiter_lines():
                if line.startswith("data: "):
                    data = line[6:]
                    if data == "[DONE]":
                        return
                    chunk = json.loads(data)
                    token = chunk["choices"][0]["delta"].get("content", "")
                    if token:
                        yield token   # forward to WebSocket handler


# ── WebSocket Handler ──────────────────────────────────────────────
async def handle_chat(ws: WebSocket, session_id: str):

    session = await load_session(session_id)        # Redis L1 → DynamoDB

    # ALL prefetch BEFORE stream starts — never inside the loop
    rag_context  = await opensearch_retrieve(session.last_message)
    context_win  = build_context_window(session, rag_context)  # token budget

    async for token in call_vllm_streaming(session, context_win):
        await ws.send_json({"type": "token", "content": token})

    await ws.send_json({"type": "done"})

    # Async post-stream work — never blocks token delivery
    asyncio.create_task(save_session(session, context_win))
    asyncio.create_task(emit_metrics(session_id))

End-to-End Request Flow

sequenceDiagram
    participant U as User Browser
    participant WS as WebSocket Handler (ECS Fargate)
    participant Redis as Redis L1
    participant DDB as DynamoDB
    participant OS as OpenSearch
    participant vLLM as vLLM Container

    U->>WS: "recommend a horror manga"

    Note over WS: ALL prefetch BEFORE stream starts
    WS->>Redis: load_session(id)
    Redis-->>WS: miss
    WS->>DDB: load_session(id)
    DDB-->>WS: session (genre=horror, preferences)

    WS->>OS: retrieve(query="horror manga", top_k=3)
    OS-->>WS: [Berserk context, Uzumaki context, MPD Psycho context]

    Note over WS: build_context_window()<br/>system_prompt: 200 tok<br/>user_pref_header: 100 tok<br/>RAG context: 300 tok<br/>recent turns: 200 tok<br/>select_lora → lora-horror-v1

    WS->>vLLM: POST /v1/chat/completions<br/>stream=true, lora=lora-horror-v1

    Note over vLLM: prefix cache HIT (300 tokens skipped)<br/>join active batch (no wait)<br/>allocate KV pages (PagedAttention)<br/>apply lora-horror-v1 delta

    loop autoregressive decode
        vLLM-->>WS: data: {"content": "Berserk"}
        WS-->>U: {type: token, content: "Berserk"}
        vLLM-->>WS: data: {"content": " is"}
        WS-->>U: {type: token, content: " is"}
        vLLM-->>WS: data: {"content": " a dark..."}
        WS-->>U: {type: token, content: " a dark..."}
    end

    vLLM-->>WS: data: [DONE]
    WS-->>U: {type: done}

    Note over WS: post-stream (non-blocking async tasks)
    WS--)DDB: save_session() [fire-and-forget]
    WS--)Redis: update_cache() [fire-and-forget]

LLM Concepts — Deep Dives


Concept 1 — PagedAttention

The problem: Standard inference allocates contiguous VRAM blocks reserved for max_len tokens upfront — 90%+ is wasted.

graph TD
    subgraph Contiguous["Standard Contiguous Allocation (WASTEFUL)"]
        direction LR
        CA["Request A<br/>2048 slots reserved<br/>████░░░░░░░░░░░░░░░░░░░░<br/>50/2048 used — 97% wasted"]
        CB["Request B<br/>2048 slots reserved<br/>████░░░░░░░░░░░░░░░░░░░░<br/>50/2048 used — 97% wasted"]
        CC["Request C<br/>2048 slots reserved<br/>████░░░░░░░░░░░░░░░░░░░░<br/>50/2048 used — 97% wasted"]
    end

    subgraph Paged["PagedAttention (EFFICIENT)"]
        direction LR
        PG1["Physical Page 47<br/>tokens 0–15 of Req A"]
        PG2["Physical Page 12<br/>tokens 16–31 of Req A"]
        PG3["Physical Page 103<br/>tokens 32–47 of Req A"]
        PG4["Physical Page 7<br/>tokens 48–50 of Req A (partial)"]
        FREE["Remaining VRAM<br/>free for other requests"]
        PG1 --> PG2 --> PG3 --> PG4
    end

    Contiguous -->|"PagedAttention replaces this"| Paged

    style CA fill:#7f1d1d,color:#fca5a5
    style CB fill:#7f1d1d,color:#fca5a5
    style CC fill:#7f1d1d,color:#fca5a5
    style FREE fill:#166534,color:#bbf7d0

How pages are managed at runtime:

flowchart LR
    T["New token generated"]
    A{"Last page<br/>full?"}
    AP["Allocate next free page<br/>from VRAM pool"]
    W["Write KV to page"]
    D{"Request<br/>done?"}
    F["Free all pages<br/>back to pool<br/>(immediately reusable)"]
    C["Continue decode"]

    T --> A
    A -->|Yes| AP --> W --> D
    A -->|No| W
    W --> D
    D -->|Yes| F
    D -->|No| C --> T

MangaAssist result: Same ml.g5.xlarge (24GB VRAM) goes from ~4–6 concurrent sessions to ~15–20.


Concept 2 — Continuous Batching

gantt
    title Static Batching vs Continuous Batching
    dateFormat x
    axisFormat %L ms

    section Static Batching (wasteful)
    User A arrives        :milestone, 0, 0
    User B arrives        :milestone, 10, 0
    User C arrives        :milestone, 30, 0
    Wait for batch window :crit, wait1, 0, 50
    BATCH 1 executes      :active, b1, 50, 100
    User D arrives        :milestone, 50, 0
    Wait for batch window :crit, wait2, 50, 100
    BATCH 2 executes      :active, b2, 100, 150

    section Continuous Batching (vLLM)
    User A tok-by-tok     :active, ca, 0, 80
    User B joins immediately :active, cb, 10, 90
    User C joins immediately :active, cc, 30, 110
    User D joins next iter   :active, cd, 50, 120
    GPU never idles          :active, gpu, 0, 120

Slot-level view:

graph LR
    subgraph Iter1["Decode Iteration 1"]
        I1A["A: tok 14"]
        I1B["B: tok 3"]
        I1C["C: tok 89"]
    end
    subgraph Iter2["Decode Iteration 2 — D joins NOW"]
        I2A["A: tok 15"]
        I2B["B: tok 4"]
        I2C["C: tok 90"]
        I2D["D: tok 1 ← NEW"]
    end
    subgraph Iter4["Decode Iteration 4 — C done, slot freed"]
        I4A["A: tok 17"]
        I4B["B: tok 6"]
        I4C["C: DONE ✓"]
        I4D["D: tok 3"]
    end
    subgraph Iter5["Decode Iteration 5 — E fills freed slot"]
        I5A["A: tok 18"]
        I5B["B: tok 7"]
        I5E["E: tok 1 ← NEW"]
        I5D["D: tok 4"]
    end

    Iter1 --> Iter2 --> Iter4 --> Iter5

    style I2D fill:#166534,color:#fff
    style I4C fill:#7f1d1d,color:#fff
    style I5E fill:#166534,color:#fff

Concept 3 — Automatic Prefix Caching

The repeated prefix problem for MangaAssist:

graph LR
    subgraph Turn1["Turn 1  (320 tokens)"]
        T1A["system_prompt<br/>200 tok"]
        T1B["user_pref_header<br/>100 tok"]
        T1C["message<br/>20 tok"]
    end
    subgraph Turn2["Turn 2  (372 tokens)"]
        T2A["system_prompt<br/>200 tok"]
        T2B["user_pref_header<br/>100 tok"]
        T2C["history_1<br/>50 tok"]
        T2D["message<br/>22 tok"]
    end
    subgraph Turn3["Turn 3  (418 tokens)"]
        T3A["system_prompt<br/>200 tok"]
        T3B["user_pref_header<br/>100 tok"]
        T3C["history_1+2<br/>100 tok"]
        T3D["message<br/>18 tok"]
    end

    REPEAT["Same 300 tokens<br/>computed from scratch<br/>on EVERY turn<br/>without caching"]

    T1A -.->|repeated| T2A -.->|repeated| T3A
    T1B -.->|repeated| T2B -.->|repeated| T3B

    style REPEAT fill:#7f1d1d,color:#fca5a5

How prefix caching resolves this:

flowchart TD
    REQ["Request arrives<br/>tokens 0..319"]

    H0{"Hash block 0<br/>tokens 0–15<br/>= 0xAB12..."}
    H1{"Hash block 1<br/>tokens 16–31"}
    H2{"Hash block 2<br/>tokens 32–47"}
    HNEW{"Hash block N<br/>first new content"}

    HIT0["CACHE HIT ✓<br/>Reuse stored KV<br/>No GPU compute"]
    HIT1["CACHE HIT ✓"]
    HIT2["CACHE HIT ✓"]
    COMPUTE["CACHE MISS<br/>GPU computes KV<br/>for new tokens only<br/>Store result for next request"]

    REQ --> H0
    H0 -->|hit| HIT0 --> H1
    H1 -->|hit| HIT1 --> H2
    H2 -->|hit| HIT2 --> HNEW
    HNEW -->|miss| COMPUTE

    SAVED["300 / 320 tokens skipped<br/>= 93.75% prefix hit rate<br/>GPU works only on 20 new tokens"]

    HIT2 --> SAVED

    style HIT0 fill:#166534,color:#fff
    style HIT1 fill:#166534,color:#fff
    style HIT2 fill:#166534,color:#fff
    style COMPUTE fill:#92400e,color:#fff
    style SAVED fill:#1e3a5f,color:#fff

Concept 4 — Multi-LoRA (Low-Rank Adaptation)

What a LoRA adapter is (math made visual):

graph LR
    subgraph FullFinetune["Full Fine-Tune (expensive)"]
        FFW["W_base (14GB)<br/>ALL parameters updated<br/>7B weights × 2 bytes = 14GB<br/>per genre variant"]
    end

    subgraph LoRAFT["LoRA Fine-Tune (cheap)"]
        LBW["W_base FROZEN (14GB)<br/>not modified"]
        LA["A matrix (small)<br/>rank 64 × d_model"]
        LB["B matrix (small)<br/>d_model × rank 64"]
        EQ["W_effective = W_base + A×B<br/>ΔW ≈ 50MB total per adapter"]
        LBW --> EQ
        LA --> EQ
        LB --> EQ
    end

    style FFW fill:#7f1d1d,color:#fca5a5
    style EQ fill:#166534,color:#bbf7d0

Multi-LoRA: one container, four genre personalities:

flowchart TD
    subgraph Without["WITHOUT Multi-LoRA<br/>4 endpoints × GPU cost"]
        E1["Endpoint 1<br/>manga-llm-shonen<br/>1 GPU fleet"]
        E2["Endpoint 2<br/>manga-llm-romance<br/>1 GPU fleet"]
        E3["Endpoint 3<br/>manga-llm-isekai<br/>1 GPU fleet"]
        E4["Endpoint 4<br/>manga-llm-horror<br/>1 GPU fleet"]
        COST1["4× GPU cost<br/>4× container fleets<br/>4× deployment pipelines"]
    end

    subgraph With["WITH Multi-LoRA<br/>1 container, 1 GPU fleet"]
        BASE["Base model in VRAM<br/>manga-llm-awq-int4 (4.5GB)"]
        A1["lora-shonen-v3 (~50MB)"]
        A2["lora-romance-v2 (~50MB)"]
        A3["lora-isekai-v4 (~50MB)"]
        A4["lora-horror-v1 (~50MB)"]
        SWAP["Request: lora_request=horror<br/>vLLM: W_eff = W_base + A_horror×B_horror<br/>Pointer swap — sub-millisecond<br/>NO disk I/O, NO data copy"]
        COST2["1× GPU cost<br/>1× container fleet<br/>1× deployment pipeline"]
    end

    Without -->|"Multi-LoRA replaces this"| With

    style COST1 fill:#7f1d1d,color:#fca5a5
    style COST2 fill:#166534,color:#bbf7d0
    style SWAP fill:#1e1b4b,color:#c4b5fd

Concept 5 — AWQ INT4 Quantization

Precision vs VRAM tradeoff:

graph LR
    subgraph FP16["FP16 Full Precision"]
        F1["1 weight = 16 bits = 2 bytes"]
        F2["7B params × 2 bytes = 14 GB"]
        F3["KV-cache headroom<br/>24GB - 14GB - 2GB overhead<br/>= 8GB → ~5 concurrent sessions"]
    end

    subgraph AWQ["AWQ INT4 Quantized"]
        A1["1 weight = 4 bits = 0.5 bytes"]
        A2["7B params × 0.5 bytes = 3.5GB<br/>+ AWQ scales/zeros ≈ 4.5GB total"]
        A3["KV-cache headroom<br/>24GB - 4.5GB - 2GB overhead<br/>= 17.5GB → ~18 concurrent sessions"]
    end

    FP16 -->|"AWQ quantization<br/>~2% quality delta<br/>3× VRAM reduction"| AWQ

    style F3 fill:#7f1d1d,color:#fca5a5
    style A3 fill:#166534,color:#bbf7d0

Why AWQ beats naive INT4:

flowchart TD
    NAIVE["Naive INT4<br/>Round every weight equally<br/>to nearest integer"]
    NAIVEFAIL["Important weights destroyed<br/>Quality crash on manga recommendations"]

    AWQ["AWQ: Activation-Aware Weight Quantization"]
    CALIB["1. Run on manga calibration data<br/>(real recommendation samples)"]
    IDENTIFY["2. Identify salient weights<br/>(most important for output quality)"]
    PROTECT["3. Protect salient weights<br/>with higher precision"]
    QUANT["4. Aggressively quantize<br/>non-salient weights to INT4"]
    RESULT["Result:<br/>Nearly same quality as FP16<br/>~2% perplexity increase<br/>68% VRAM reduction"]

    NAIVE --> NAIVEFAIL
    AWQ --> CALIB --> IDENTIFY --> PROTECT --> QUANT --> RESULT

    style NAIVEFAIL fill:#7f1d1d,color:#fca5a5
    style RESULT fill:#166534,color:#bbf7d0

Concept 6 — Streaming (SSE / Server-Sent Events)

Why autoregressive generation streams naturally:

sequenceDiagram
    participant GPU as GPU Inference
    participant vLLM as vLLM Server
    participant Orch as Orchestrator
    participant Browser as Browser

    Note over GPU: Prompt: "recommend a horror manga"

    GPU->>vLLM: token 1 = "Berserk"
    vLLM->>Orch: data: {"content": "Berserk"}
    Orch->>Browser: {type: "token", content: "Berserk"}
    Note over Browser: Renders "Berserk" immediately ← ~200ms

    GPU->>vLLM: token 2 = " is"
    vLLM->>Orch: data: {"content": " is"}
    Orch->>Browser: {type: "token", content: " is"}

    GPU->>vLLM: token 3 = " a"
    vLLM->>Orch: data: {"content": " a"}
    Orch->>Browser: {type: "token", content: " a"}

    Note over GPU: ... N tokens later ...

    GPU->>vLLM: [EOS] token
    vLLM->>Orch: data: [DONE]
    Orch->>Browser: {type: "done"}
    Note over Browser: Full response: "Berserk is a dark..." shown at ~5s<br/>WITHOUT streaming, user waits this long for first word

The critical rule — no I/O inside the forwarding loop:

flowchart TD
    subgraph WRONG["WRONG — I/O inside stream loop"]
        W1["async for token in vllm_stream()"]
        W2["enrichment = await fetch_manga_details(token)<br/>← BLOCKS HERE during slow enrichment lookup"]
        W3["await websocket.send(token + enrichment)<br/>← token delivery delayed"]
        W1 --> W2 --> W3 --> W1
    end

    subgraph CORRECT["CORRECT — all I/O before stream starts"]
        C1["manga_details = await fetch_manga_details(session)<br/>← runs once, BEFORE stream"]
        C2["async for token in vllm_stream()"]
        C3["await websocket.send(token)<br/>← pure forwarding, nothing blocks"]
        C1 --> C2 --> C3 --> C2
    end

    WRONG -->|"fix"| CORRECT

    style WRONG fill:#7f1d1d,color:#fca5a5
    style CORRECT fill:#166534,color:#bbf7d0

Decision Table

Dimension Details
vLLM vs TGI TGI simpler setup, but vLLM wins throughput benchmark on concurrent manga sessions
vLLM vs TensorRT-LLM TensorRT marginally faster peak, but compilation complexity + NVIDIA lock-in outweighs gain for fast-iteration fine-tuning
PagedAttention benefit Reduces KV-cache VRAM waste — 5 concurrent sessions → 18+ on same GPU
Continuous batching tradeoff Slightly more complex scheduling vs fixed batching — payoff is better throughput + lower latency under load
Prefix caching benefit 300/320 tokens shared per request — 93%+ cache hit rate, GPU skips repeated system prompt work
Multi-LoRA tradeoff Adds adapter pointer management vs eliminating 3 entire GPU fleets
AWQ quantization tradeoff ~2% perplexity delta vs 68% VRAM reduction and 3× concurrent session density
Scale mechanism Continuous batching + Multi-LoRA: one GPU fleet serves all adapter variants under load
Key metric ~50% GPU cost reduction, ~68% latency improvement vs raw Transformers baseline

Tradeoffs Discussed

Option considered Why rejected or scoped
Raw Transformers serving Baseline; highest operational simplicity, lowest throughput — unsuitable at production scale
Hugging Face TGI Good option, but throughput benchmarks on our workload lagged vLLM
TensorRT-LLM Fastest peak throughput, but compilation complexity + GPU lock-in rejected for fast-iteration environment
Separate endpoints per LoRA adapter Simple isolation, but 3–5x GPU cost increase; Multi-LoRA consolidated this
No quantization Maximum model quality, but VRAM footprint too large for cost target at scale

Scale Planned

Metric Target
Concurrent chat sessions Handled via continuous batching — burst absorbed without fixed batch wait
Domain adapters 1 base model container + Multi-LoRA for n adapters — no GPU fleet proliferation
VRAM per instance Reduced via AWQ INT4 — 8GB KV headroom (FP16) → 18GB KV headroom (INT4)
Token throughput ~2x baseline via PagedAttention + continuous batching vs raw Transformers
Concurrent sessions per GPU ~3x increase from FP16 → AWQ INT4

LLM Concepts Summary Table

Concept Problem solved Mechanism MangaAssist gain
PagedAttention VRAM fragmentation from contiguous KV allocation OS-style paged virtual memory for KV-cache 3× more concurrent sessions per GPU
Continuous batching GPU idle during fixed batch window Insert new request into next decode iteration immediately Lower latency + higher throughput during chapter-drop spikes
Prefix caching Same system prompt recomputed on every request Hash prefix blocks, store KV, reuse on cache hit 93%+ of tokens skipped for shared 300-token prefix
Multi-LoRA Separate GPU fleet per domain adapter Load all adapter deltas in VRAM, swap pointer per request 4 genre variants from 1 GPU fleet — 4× cost reduction
AWQ quantization High VRAM cost of FP16 weights Activation-aware per-channel quantization to INT4 14GB → 4.5GB base model; 3× more VRAM for KV-cache
Streaming (SSE) User waits for full response before seeing anything Send each token as generated, not wait for EOS First token in ~200ms vs ~5s full response wait

Intuition From This Scenario

Serving engine choice is a container operations decision, not just an ML performance decision. The six features — PagedAttention, continuous batching, prefix caching, streaming, Multi-LoRA, and AWQ — each solve a real cost or quality problem that showed up at MangaAssist scale. None of them are magic. PagedAttention is OS paging applied to GPU memory. Continuous batching is a scheduler change. Prefix caching is a hash table. Multi-LoRA is a pointer swap. AWQ is a smarter rounding scheme. The compound effect is that one GPU instance serves 3× more users while the team operates one container instead of four. That is the senior engineer answer: understand the mechanism, justify it against the actual workload, and measure it.