vLLM Deployment And Infrastructure For MangaAssist

Complete deployment guide: from environment setup through Docker image construction, direct SageMaker endpoint deployment, the alternative Bedrock import path, startup sequencing, health checks, scaling policies, and graceful shutdown. Every decision is grounded in the MangaAssist chatbot workload.

1. Scope

This document covers the full deployment lifecycle for vLLM in MangaAssist's production environment. It answers:

How is vLLM installed and what are its system dependencies?
How is the Docker image built and optimized?
How do we choose between direct SageMaker hosting and Bedrock-managed inference on AWS?
How does the SageMaker endpoint deployment work?
What changes if the organization wants Bedrock instead of a self-managed vLLM container?
What happens during container startup and how long does each phase take?
How do health checks distinguish "alive" from "ready to serve"?
How does auto-scaling respond to traffic changes?
How are in-flight requests handled during shutdown?

2. Environment And System Requirements

Hardware Requirements

Component	Requirement	MangaAssist choice	Why
GPU	NVIDIA with compute capability ≥ 7.0	NVIDIA A10G (compute 8.6)	Best cost-performance for AWQ INT4 inference
VRAM	≥ 16 GB for 8B model with AWQ	24 GB (A10G)	Fits model (4.5 GB) + KV cache (14 GB) + workspace (5.5 GB)
System RAM	≥ 32 GB	16 GB on ml.g5.xlarge	Sufficient for model loading and tokenizer
Storage	≥ 50 GB for model artifacts	EBS gp3 mounted at `/opt/ml/model`	Fast enough for 4.5 GB model load

Software Dependencies

Dependency	Version	Why this version
CUDA	12.1	Required by vLLM ≥ 0.4.0; matches A10G driver compatibility
Python	3.10	vLLM's tested Python version; 3.11+ had intermittent issues with some CUDA bindings
vLLM	0.4.3	First stable release with Multi-LoRA + prefix caching + AWQ working together
PyTorch	2.3.0	Paired with CUDA 12.1; required by vLLM 0.4.3
Transformers	4.40.0	Tokenizer and model config loading
safetensors	0.4.3	Fast model weight loading (2–3× faster than pickle-based checkpoints)
uvicorn	0.29.0	ASGI server for vLLM's OpenAI-compatible HTTP endpoint

vLLM Version Compatibility Notes

Feature	Minimum vLLM version	Notes
PagedAttention	0.1.0	Core feature since inception
Continuous batching	0.1.0	Core scheduler since inception
AWQ quantization	0.2.0	AutoAWQ integration
Prefix caching	0.3.0	`enable_prefix_caching=True` flag
Multi-LoRA serving	0.3.3	Multiple adapters on single base model
OpenAI-compatible server	0.2.0	`/v1/chat/completions` endpoint
CUDA graph support	0.2.0	`enforce_eager=False` for graph capture
Prometheus metrics	0.3.0	Built-in `/metrics` endpoint

Installation

# Base CUDA environment (pre-installed in our Docker base image)
# NVIDIA Driver: 535.104.12
# CUDA Toolkit: 12.1

# vLLM and dependencies
pip install vllm==0.4.3 \
    torch==2.3.0 \
    transformers==4.40.0 \
    safetensors==0.4.3 \
    uvicorn==0.29.0

# For AWQ quantization (offline, not needed on serving nodes)
pip install autoawq==0.2.4

# For LoRA adapter training (offline, not needed on serving nodes)
pip install peft==0.10.0

Dependency pinning: We pin all versions in requirements.txt and rebuild images on a weekly cadence. vLLM moves fast — upgrading without testing has caused two regressions for us: 1. v0.4.1 → v0.4.2 changed prefix cache eviction behavior, dropping cache hit rate by 15% 2. v0.4.2 → v0.4.3 fixed the eviction bug but changed Multi-LoRA adapter loading semantics

3. GPU Instance Selection Deep Dive

VRAM Budget Breakdown (ml.g5.xlarge, A10G 24 GB)

┌────────────────────────────────────────────────────────┐
│                    VRAM Budget (24 GB)                  │
├────────────────────────────────────────────────────────┤
│ Model weights (AWQ INT4 Llama-3-8B)          4.5 GB   │
│ LoRA adapters (3 × ~40 MB)                   0.12 GB  │
│ KV cache blocks (128 seqs × block_size=16)  ~14.0 GB  │
│ CUDA workspace + activations                  1.5 GB   │
│ CUDA graph memory                             1.0 GB   │
│ Safety headroom (gpu_memory_utilization=0.92) 1.92 GB  │
│ Unallocated (OS/driver overhead)              0.96 GB  │
├────────────────────────────────────────────────────────┤
│ Total                                        24.0 GB   │
└────────────────────────────────────────────────────────┘

Instance Comparison

Instance	GPU	VRAM	vCPU	RAM	$/hr (On-Demand)	$/month	Max concurrent seqs	Verdict
ml.g4dn.xlarge	T4	16 GB	4	16 GB	$0.736	$537	~40	Too tight for 128 seq target
ml.g5.xlarge	A10G	24 GB	4	16 GB	$1.408	$1,028	128	Selected: optimal cost/capacity
ml.g5.2xlarge	A10G	24 GB	8	32 GB	$1.515	$1,106	128	More CPU/RAM, same GPU — not needed
ml.p4d.24xlarge	A100×8	640 GB	96	1.1 TB	$32.77	$23,914	1000+	Overkill for single-model serving

Why we rejected ml.g4dn.xlarge: With only 16 GB VRAM, the AWQ model (4.5 GB) leaves ~11.5 GB for KV cache. At 128 concurrent sequences with max_model_len=4096, the KV cache alone needs ~14 GB. The math does not work without reducing either concurrency (bad for chatbot UX) or context length (bad for multi-turn conversations).

Why we did not use A100: The A100 80 GB GPU can run unquantized FP16 models without compression, but it costs 23× more per hour. For our traffic volume (~50K requests/day across 4 instances), A10G with AWQ quantization is the cost-optimal choice. The A100 is reserved for training, benchmarking, and future scaling beyond 128 concurrent sequences per GPU.

4. Docker Image Construction

Multi-Stage Build Strategy

# ======================================
# Stage 1: Build vLLM with CUDA support
# ======================================
FROM nvidia/cuda:12.1.1-devel-ubuntu22.04 AS builder

RUN apt-get update && apt-get install -y \
    python3.10 python3.10-dev python3-pip git && \
    rm -rf /var/lib/apt/lists/*

COPY requirements.txt /build/requirements.txt
RUN pip install --no-cache-dir -r /build/requirements.txt

# ======================================
# Stage 2: Download and validate model
# ======================================
FROM builder AS model-prep

# Model artifacts are downloaded during image build, NOT at container runtime.
# This is a critical decision: runtime download adds 3-5 minutes to startup.
ARG MODEL_ID=meta-llama/Llama-3-8b-instruct-awq
ARG HF_TOKEN

RUN python3 -c "
from huggingface_hub import snapshot_download
snapshot_download(
    repo_id='${MODEL_ID}',
    local_dir='/models/base',
    token='${HF_TOKEN}',
    ignore_patterns=['*.md', '*.txt', 'original/*'],
)
"

# Download LoRA adapters
COPY scripts/download_adapters.py /build/download_adapters.py
RUN python3 /build/download_adapters.py \
    --output-dir /models/lora \
    --adapters manga_domain_v3 general_support_v2 jp_style_v1

# Validate model integrity
RUN python3 -c "
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('/models/base')
assert tokenizer.vocab_size > 0, 'Tokenizer validation failed'
print(f'Model validated: vocab_size={tokenizer.vocab_size}')
"

# ======================================
# Stage 3: Slim runtime image
# ======================================
FROM nvidia/cuda:12.1.1-runtime-ubuntu22.04 AS runtime

RUN apt-get update && apt-get install -y \
    python3.10 python3-pip && \
    rm -rf /var/lib/apt/lists/*

# Copy only runtime dependencies (no build tools)
COPY --from=builder /usr/local/lib/python3.10/dist-packages /usr/local/lib/python3.10/dist-packages
COPY --from=builder /usr/local/bin /usr/local/bin

# Copy validated model artifacts
COPY --from=model-prep /models /opt/ml/model

# Copy application code
COPY app/ /app/
COPY scripts/entrypoint.sh /entrypoint.sh

# Health check port
EXPOSE 8080

# Prometheus metrics port
EXPOSE 9090

ENV PYTHONUNBUFFERED=1
ENV MODEL_PATH=/opt/ml/model/base
ENV LORA_PATH=/opt/ml/model/lora
ENV VLLM_PORT=8080

HEALTHCHECK --interval=30s --timeout=10s --retries=3 \
    CMD curl -f http://localhost:8080/ping || exit 1

ENTRYPOINT ["/entrypoint.sh"]

Image Size Optimization

Optimization	Impact
Multi-stage build (drop build tools)	15 GB → 8.2 GB
`nvidia/cuda:runtime` instead of `devel`	Saves ~3 GB of CUDA dev headers
`--no-cache-dir` on pip install	Saves ~500 MB of pip cache
`.dockerignore` excluding tests, docs, notebooks	Saves ~200 MB
safetensors format (vs pickle)	Same size, but 2–3× faster load time

Why Model Artifacts Are Baked Into The Image

Decision: Bake model weights into the Docker image at build time, not download at container runtime.

Why not runtime download?

Approach	Startup time	Scaling speed	Reliability
Runtime download from S3	+3–5 min	Slow (download per instance)	Depends on S3 availability
Runtime download from HuggingFace	+5–10 min	Very slow	Depends on HF Hub availability
Baked into image	+0 min	Fast (image pull is cached)	Self-contained

The tradeoff is larger images (~8.2 GB), but ECR image layer caching means subsequent pulls only download changed layers. Since the model weights layer rarely changes (only on model updates), scaling events pull only the application code layer (~50 MB).

When to use runtime download instead: If you have many model variants (10+) and cannot maintain that many image tags, a hybrid approach works: bake the base model, download adapters at startup (adapters are ~40 MB each, fast enough).

5. Entrypoint And Startup Sequence

Entrypoint Script

#!/bin/bash
set -euo pipefail

echo "[$(date)] Starting vLLM server for MangaAssist"

# Phase 1: Environment validation
echo "[$(date)] Phase 1: Validating environment"
nvidia-smi --query-gpu=name,memory.total,driver_version --format=csv,noheader
python3 -c "import torch; assert torch.cuda.is_available(), 'CUDA not available'"

# Phase 2: Start vLLM OpenAI-compatible server
echo "[$(date)] Phase 2: Starting vLLM engine"
python3 -m vllm.entrypoints.openai.api_server \
    --model "${MODEL_PATH}" \
    --host 0.0.0.0 \
    --port "${VLLM_PORT:-8080}" \
    --tensor-parallel-size 1 \
    --gpu-memory-utilization 0.92 \
    --max-model-len 4096 \
    --max-num-seqs 128 \
    --max-num-batched-tokens 8192 \
    --block-size 16 \
    --enable-prefix-caching \
    --quantization awq \
    --no-custom-all-reduce \
    --enable-lora \
    --max-loras 4 \
    --max-lora-rank 16 \
    --lora-modules \
        manga_domain_v3="${LORA_PATH}/manga_domain_v3" \
        general_support_v2="${LORA_PATH}/general_support_v2" \
        jp_style_v1="${LORA_PATH}/jp_style_v1" \
    --disable-log-requests \
    2>&1 | tee /var/log/vllm/server.log &

VLLM_PID=$!

# Phase 3: Wait for readiness
echo "[$(date)] Phase 3: Waiting for engine readiness"
MAX_WAIT=180
WAITED=0
while [ $WAITED -lt $MAX_WAIT ]; do
    if curl -sf http://localhost:${VLLM_PORT}/health > /dev/null 2>&1; then
        echo "[$(date)] Engine is ready after ${WAITED}s"
        break
    fi
    sleep 2
    WAITED=$((WAITED + 2))
done

if [ $WAITED -ge $MAX_WAIT ]; then
    echo "[$(date)] ERROR: Engine failed to become ready in ${MAX_WAIT}s"
    kill $VLLM_PID 2>/dev/null
    exit 1
fi

# Phase 4: Warmup requests
echo "[$(date)] Phase 4: Running warmup requests"
python3 /app/scripts/warmup.py \
    --endpoint "http://localhost:${VLLM_PORT}/v1/chat/completions" \
    --num-requests 10 \
    --timeout 30

echo "[$(date)] Phase 5: Server is ready to accept traffic"

# Handle graceful shutdown
trap 'echo "[$(date)] SIGTERM received, draining..."; kill -SIGTERM $VLLM_PID; wait $VLLM_PID' SIGTERM

wait $VLLM_PID

Warmup Script

"""
Warmup script that sends synthetic requests to prime the vLLM engine.

Purpose:
- Trigger CUDA graph capture for common sequence lengths
- Prime the KV block allocator
- Warm the prefix cache with the system prompt
- Validate that all LoRA adapters load correctly
"""

import argparse
import time

import httpx


WARMUP_REQUESTS = [
    {
        "model": "manga_domain_v3",
        "messages": [
            {"role": "system", "content": "You are a manga shopping assistant."},
            {"role": "user", "content": "Recommend action manga."},
        ],
        "max_tokens": 50,
        "stream": False,
    },
    {
        "model": "general_support_v2",
        "messages": [
            {"role": "system", "content": "You are a manga shopping assistant."},
            {"role": "user", "content": "What is the return policy?"},
        ],
        "max_tokens": 50,
        "stream": False,
    },
    {
        "model": "jp_style_v1",
        "messages": [
            {"role": "system", "content": "You are a manga shopping assistant."},
            {"role": "user", "content": "Recommend shounen manga in Japanese."},
        ],
        "max_tokens": 50,
        "stream": False,
    },
]


def run_warmup(endpoint: str, num_requests: int, timeout: int) -> None:
    client = httpx.Client(timeout=timeout)
    start = time.time()

    for i in range(num_requests):
        request = WARMUP_REQUESTS[i % len(WARMUP_REQUESTS)]
        try:
            response = client.post(endpoint, json=request)
            response.raise_for_status()
            elapsed = time.time() - start
            print(f"  Warmup {i+1}/{num_requests}: {response.status_code} ({elapsed:.1f}s)")
        except Exception as e:
            print(f"  Warmup {i+1}/{num_requests}: FAILED - {e}")
            raise

    total = time.time() - start
    print(f"Warmup complete: {num_requests} requests in {total:.1f}s")


if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--endpoint", required=True)
    parser.add_argument("--num-requests", type=int, default=10)
    parser.add_argument("--timeout", type=int, default=30)
    args = parser.parse_args()
    run_warmup(args.endpoint, args.num_requests, args.timeout)

Startup Timing Breakdown

Phase	What happens	Duration	Failure behavior
Environment validation	Check CUDA availability, GPU health	~2s	Exit 1 (container restart)
Model weight loading	Load AWQ weights from disk to GPU	~15s	Exit 1 (container restart)
CUDA graph capture	Capture optimized execution graphs	~30s	Falls back to eager mode
LoRA adapter preload	Load 3 adapter weight sets	~5s	Logs warning, serves without adapters
Health endpoint ready	`/health` starts returning 200	~5s	Readiness probe fails, no traffic
Warmup requests	10 synthetic requests through full path	~10s	Logs warning, serves anyway
Total startup		~67s

Why warmup matters: The first request through vLLM after startup can be 3–5× slower than steady-state because CUDA graph compilation and memory allocator initialization happen lazily. Warmup requests absorb that penalty before real traffic arrives.

6. Health Check Implementation

SageMaker Health Check Contract

SageMaker calls /ping every 30 seconds. This must return 200 within 2 seconds for the instance to be considered healthy.

"""
Health check implementation for vLLM on SageMaker.

Three levels of health:
1. Liveness: Process is running and can respond to HTTP
2. Readiness: Model is loaded, CUDA graphs captured, warmup complete
3. GPU health: CUDA device is accessible and not in error state
"""

import asyncio
import subprocess
from enum import Enum

from fastapi import FastAPI, Response


class HealthState(Enum):
    INITIALIZING = "initializing"
    WARMING_UP = "warming_up"
    READY = "ready"
    DEGRADED = "degraded"
    UNHEALTHY = "unhealthy"


class HealthChecker:
    def __init__(self) -> None:
        self.state = HealthState.INITIALIZING
        self.model_loaded = False
        self.cuda_graphs_ready = False
        self.warmup_complete = False
        self.last_gpu_check_ok = True
        self._gpu_check_interval = 60  # seconds

    def mark_model_loaded(self) -> None:
        self.model_loaded = True
        self._update_state()

    def mark_cuda_graphs_ready(self) -> None:
        self.cuda_graphs_ready = True
        self._update_state()

    def mark_warmup_complete(self) -> None:
        self.warmup_complete = True
        self._update_state()

    def _update_state(self) -> None:
        if self.model_loaded and self.cuda_graphs_ready and self.warmup_complete:
            self.state = HealthState.READY
        elif self.model_loaded:
            self.state = HealthState.WARMING_UP

    async def check_gpu_health(self) -> bool:
        """Check GPU accessibility via nvidia-smi. Run periodically, not per-request."""
        try:
            result = subprocess.run(
                ["nvidia-smi", "--query-gpu=gpu_bus_id", "--format=csv,noheader"],
                capture_output=True,
                text=True,
                timeout=5,
            )
            self.last_gpu_check_ok = result.returncode == 0
        except (subprocess.TimeoutExpired, FileNotFoundError):
            self.last_gpu_check_ok = False
        return self.last_gpu_check_ok

    def is_ready(self) -> bool:
        return self.state == HealthState.READY and self.last_gpu_check_ok

    def is_alive(self) -> bool:
        return self.state != HealthState.UNHEALTHY


health_checker = HealthChecker()


def register_health_routes(app: FastAPI) -> None:
    @app.get("/ping")
    async def ping(response: Response) -> dict:
        """SageMaker liveness check. Returns 200 if process is alive."""
        if not health_checker.is_alive():
            response.status_code = 503
        return {
            "status": health_checker.state.value,
            "model_loaded": health_checker.model_loaded,
            "cuda_graphs_ready": health_checker.cuda_graphs_ready,
            "warmup_complete": health_checker.warmup_complete,
            "gpu_ok": health_checker.last_gpu_check_ok,
        }

    @app.get("/ready")
    async def ready(response: Response) -> dict:
        """Readiness check. Returns 200 only when fully ready to serve."""
        if not health_checker.is_ready():
            response.status_code = 503
        return {"ready": health_checker.is_ready()}

Why Liveness And Readiness Are Separate

Check	What it answers	When it matters
Liveness (`/ping`)	"Is the process running?"	SageMaker restart decision
Readiness (`/ready`)	"Can it serve traffic?"	Traffic routing decision

A process can be alive but not ready during the 67-second startup window. If /ping and /ready were the same, SageMaker would either route traffic to an unready instance (causing 503s) or restart a healthy instance that is still loading (causing restart loops).

7. Graceful Shutdown

Why Graceful Shutdown Matters

During auto-scaling scale-in or deployment updates, SageMaker sends SIGTERM to the container. Without graceful shutdown:

In-flight requests get killed mid-generation → users see truncated or error responses
Active WebSocket/SSE streams disconnect abruptly → frontend shows "connection lost"
KV cache state is lost without warning → requests already in prefill restart from scratch

Shutdown Sequence

sequenceDiagram
    participant SM as SageMaker
    participant EP as Entrypoint
    participant GW as vLLM Gateway
    participant ENGINE as vLLM Engine

    SM->>EP: SIGTERM
    EP->>GW: Stop accepting new requests
    Note over GW: /ping returns 503\n/ready returns 503
    GW->>ENGINE: Wait for in-flight requests (max 30s)
    Note over ENGINE: Active generations\ncomplete normally
    ENGINE-->>GW: All generations done
    GW->>EP: Clean exit
    EP-->>SM: Exit 0
    Note over SM: Instance removed\nfrom endpoint

Shutdown Implementation

import asyncio
import signal
import logging

logger = logging.getLogger(__name__)


class GracefulShutdown:
    def __init__(self, engine, max_drain_seconds: int = 30) -> None:
        self.engine = engine
        self.max_drain_seconds = max_drain_seconds
        self.shutting_down = False

    def install_signal_handlers(self) -> None:
        loop = asyncio.get_event_loop()
        for sig in (signal.SIGTERM, signal.SIGINT):
            loop.add_signal_handler(sig, lambda s=sig: asyncio.create_task(self.shutdown(s)))

    async def shutdown(self, sig: signal.Signals) -> None:
        logger.info(f"Received {sig.name}, starting graceful shutdown")
        self.shutting_down = True

        # Stop accepting new requests (health checks return 503)
        logger.info("Stopped accepting new requests")

        # Wait for in-flight requests to complete
        waited = 0
        while waited < self.max_drain_seconds:
            active = self.engine.get_num_unfinished_requests()
            if active == 0:
                logger.info(f"All requests drained after {waited}s")
                break
            logger.info(f"Draining: {active} requests still in-flight ({waited}s)")
            await asyncio.sleep(1)
            waited += 1

        if waited >= self.max_drain_seconds:
            active = self.engine.get_num_unfinished_requests()
            logger.warning(f"Drain timeout: {active} requests still active, forcing shutdown")

        logger.info("Shutdown complete")

8. SageMaker Endpoint Deployment Configuration

Endpoint Structure

import sagemaker
from sagemaker.model import Model
from sagemaker.predictor import Predictor


def deploy_vllm_endpoint(
    role: str,
    image_uri: str,
    model_data: str = None,  # None when model is baked into image
    instance_type: str = "ml.g5.xlarge",
    initial_instance_count: int = 4,
) -> Predictor:
    model = Model(
        image_uri=image_uri,
        role=role,
        env={
            "MODEL_PATH": "/opt/ml/model/base",
            "LORA_PATH": "/opt/ml/model/lora",
            "VLLM_PORT": "8080",
        },
    )

    predictor = model.deploy(
        initial_instance_count=initial_instance_count,
        instance_type=instance_type,
        endpoint_name="mangaassist-vllm-prod",
        container_startup_health_check_timeout=300,  # 5 min for model load + warmup
        volume_size=50,  # GB, for model artifacts if not baked in
    )

    return predictor

Canary Deployment With Production Variants

from sagemaker.session import production_variant


def deploy_canary(image_uri_prod: str, image_uri_canary: str, role: str) -> None:
    """Deploy canary variant alongside production for safe rollouts."""
    variants = [
        production_variant(
            model_name="mangaassist-vllm-prod",
            instance_type="ml.g5.xlarge",
            initial_instance_count=4,
            variant_name="primary",
            initial_weight=95,  # 95% of traffic
        ),
        production_variant(
            model_name="mangaassist-vllm-canary",
            instance_type="ml.g5.xlarge",
            initial_instance_count=1,
            variant_name="canary",
            initial_weight=5,  # 5% of traffic
        ),
    ]

    sagemaker.Session().endpoint_from_production_variants(
        name="mangaassist-vllm-prod",
        production_variants=variants,
    )

Warm Pool Configuration

import boto3

sm_client = boto3.client("sagemaker")

# Enable warm pool for faster scaling
sm_client.update_endpoint(
    EndpointName="mangaassist-vllm-prod",
    RetainAllVariantProperties=True,
    DeploymentConfig={
        "BlueGreenUpdatePolicy": {
            "TrafficRoutingConfiguration": {
                "Type": "ALL_AT_ONCE",
                "WaitIntervalInSeconds": 60,
            },
            "TerminationWaitInSeconds": 120,
            "MaximumExecutionTimeoutInSeconds": 600,
        },
    },
)

9. Auto-Scaling Policies

Scaling Configuration

import boto3

asg_client = boto3.client("application-autoscaling")

# Register scalable target
asg_client.register_scalable_target(
    ServiceNamespace="sagemaker",
    ResourceId="endpoint/mangaassist-vllm-prod/variant/primary",
    ScalableDimension="sagemaker:variant:DesiredInstanceCount",
    MinCapacity=2,   # Always keep 2 for redundancy
    MaxCapacity=8,   # Cost ceiling
)

# Scale-out policy: react fast to queue buildup
asg_client.put_scaling_policy(
    PolicyName="mangaassist-vllm-scale-out",
    ServiceNamespace="sagemaker",
    ResourceId="endpoint/mangaassist-vllm-prod/variant/primary",
    ScalableDimension="sagemaker:variant:DesiredInstanceCount",
    PolicyType="TargetTrackingScaling",
    TargetTrackingScalingPolicyConfiguration={
        "TargetValue": 50.0,  # Target invocations per instance
        "PredefinedMetricSpecification": {
            "PredefinedMetricType": "SageMakerVariantInvocationsPerInstance",
        },
        "ScaleOutCooldown": 60,   # Fast reaction to spikes
        "ScaleInCooldown": 300,   # Slow scale-in to avoid flapping
    },
)

Custom Metric-Based Scaling

SageMaker's built-in metric (invocations per instance) is a reasonable proxy, but for a GPU inference workload, custom metrics are more accurate:

Metric	Scale-out trigger	Scale-in trigger	Why
`queue_depth`	> 50 for 60s	< 5 for 300s	Direct measure of backpressure
`active_sequences`	> 110 (85% of 128) for 60s	< 30 for 300s	GPU saturation signal
`gpu_memory_used_pct`	> 90% for 60s	< 40% for 300s	Memory pressure signal
`p95_queue_wait_ms`	> 200 ms for 120s	< 20 ms for 300s	User-visible latency impact

Asymmetric Cooldown Strategy

Direction	Cooldown	Why
Scale-out	60 seconds	Users feel queueing immediately. React fast.
Scale-in	300 seconds	Premature scale-in causes oscillation. Wait for sustained low traffic.

Predictive scaling: For the daily traffic pattern (peaks during JP evening hours 18:00–22:00 JST and US evening hours 18:00–22:00 EST), we use scheduled scaling actions that pre-provision instances 15 minutes before the expected peak rather than waiting for reactive scaling.

# Pre-provision for JP evening peak
asg_client.put_scheduled_action(
    ServiceNamespace="sagemaker",
    ScheduledActionName="jp-evening-peak",
    ResourceId="endpoint/mangaassist-vllm-prod/variant/primary",
    ScalableDimension="sagemaker:variant:DesiredInstanceCount",
    Schedule="cron(45 8 * * ? *)",  # 17:45 JST = 08:45 UTC
    ScalableTargetAction={"MinCapacity": 6, "MaxCapacity": 8},
)

# Scale back after JP evening
asg_client.put_scheduled_action(
    ServiceNamespace="sagemaker",
    ScheduledActionName="jp-evening-end",
    ResourceId="endpoint/mangaassist-vllm-prod/variant/primary",
    ScalableDimension="sagemaker:variant:DesiredInstanceCount",
    Schedule="cron(0 14 * * ? *)",  # 23:00 JST = 14:00 UTC
    ScalableTargetAction={"MinCapacity": 2, "MaxCapacity": 8},
)

10. Environment Variable Reference

Variable	Default	Description	Impact of wrong value
`MODEL_PATH`	`/opt/ml/model/base`	Path to base model weights	Engine fails to start
`LORA_PATH`	`/opt/ml/model/lora`	Path to LoRA adapter directory	Adapters unavailable, falls back to base model
`VLLM_PORT`	`8080`	HTTP server port	Health checks fail, no traffic
`VLLM_GPU_MEMORY_UTILIZATION`	`0.92`	Fraction of VRAM allocated to vLLM	Too high: OOM; Too low: reduced concurrency
`VLLM_MAX_MODEL_LEN`	`4096`	Maximum context window in tokens	Too high: fewer concurrent seqs; Too low: truncated conversations
`VLLM_MAX_NUM_SEQS`	`128`	Maximum concurrent sequences	Too high: OOM; Too low: artificial queueing
`VLLM_MAX_NUM_BATCHED_TOKENS`	`8192`	Max tokens in active batch	Too high: TTFT spike; Too low: throughput loss
`VLLM_BLOCK_SIZE`	`16`	Tokens per KV cache block	Too small: block table overhead; Too large: fragmentation waste
`VLLM_LOG_LEVEL`	`info`	Engine log verbosity	`debug` in prod: log volume explosion
`WARMUP_NUM_REQUESTS`	`10`	Synthetic requests before ready	0: cold first-request penalty; Too many: slow startup

What Each Knob Controls

gpu_memory_utilization=0.92: This is a stability boundary, not a performance knob. Setting it to 0.98 would add ~1.4 GB of usable KV cache blocks but risks OOM when CUDA workspace allocations spike during certain operations (graph capture, large batch prefill). We tested 0.90, 0.92, and 0.95 under sustained load:

0.90: Stable, 118 max concurrent sequences
0.92: Stable, 128 max concurrent sequences ← selected
0.95: Intermittent OOM under bursty traffic (2–3 per day)

max_num_seqs=128: The concurrency ceiling. At 128 sequences with block_size=16 and max_model_len=4096, the worst-case KV cache is ~14 GB. With our observed median response length (~180 tokens), the average KV cache usage is ~3.5 GB, leaving substantial room. We chose 128 because: - It was the highest value that passed 24-hour soak tests without OOM at gpu_memory_utilization=0.92 - It provided 2× headroom above our observed peak concurrency (~60 simultaneous requests) - Going higher (256) caused occasional preemptions that hurt tail latency

block_size=16: Each block holds KV tensors for 16 tokens. Smaller blocks (8) reduce internal fragmentation but increase block-table memory overhead. Larger blocks (32) reduce overhead but waste more memory on sequences that do not fill the last block. For our median response length of 180 tokens, block_size=16 means ~11.25 blocks per sequence with <16 tokens of waste in the last block.

11. AWS Deployment Decision: SageMaker Or Bedrock

Current AWS Service Boundary (Validated 2026-03-25)

SageMaker AI and Amazon Bedrock are not interchangeable deployment targets for this design:

SageMaker AI is the right choice when we want to run our own vLLM container on GPU instances and keep control over CUDA, model loading, health endpoints, autoscaling behavior, LoRA loading, and Prometheus metrics.
Amazon Bedrock is the right choice when we want AWS to own the serving runtime. In that model we do not deploy our Docker image or our vLLM process. We export supported model artifacts to S3, import them into Bedrock, and invoke the resulting model through Bedrock APIs.

Decision Matrix

Requirement	SageMaker AI	Amazon Bedrock	MangaAssist implication
Run our own vLLM Docker image	Yes	No	SageMaker only
Keep custom `/ping` and `/invocations` handling	Yes	No	SageMaker only
Tune `gpu_memory_utilization`, `max_num_seqs`, prefix caching, CUDA graphs	Yes	No	SageMaker only
Serve multiple LoRA adapters on one base model	Yes	Not exposed as a Bedrock runtime primitive	Current design fits SageMaker
Use Bedrock-native APIs, Guardrails, Agents, and Converse	Indirect	Yes	Bedrock advantage
Reduce infrastructure operations burden	No	Yes	Bedrock advantage
Import supported Hugging Face model artifacts from S3	Possible but self-managed	Yes	Bedrock fit
Support arbitrary vLLM backends and custom kernels	Yes	No	SageMaker only

Recommended Default For MangaAssist

For the architecture described in this document, SageMaker is the primary deployment target. The design depends on vLLM-specific features such as AWQ quantization, Multi-LoRA loading, prefix caching, custom health/readiness logic, and queue-aware autoscaling. Bedrock becomes attractive only if the priority shifts from maximum runtime control to lower operational overhead and Bedrock-native governance.

Implementation Checklist If We Choose SageMaker

Build the runtime image described in Sections 4-5, including the base model, LoRA adapters, health routes, warmup script, and entrypoint.
Expose a SageMaker-compatible inference surface that accepts /ping and /invocations while forwarding inference to the local vLLM OpenAI-compatible endpoint.
Push the image to ECR and create a SageMaker Model, EndpointConfig, and Endpoint.
Set container_startup_health_check_timeout high enough to cover weight loading plus warmup.
Enable autoscaling and blue/green rollout policies from Sections 8-9.
Point the application gateway at the SageMaker endpoint and keep Bedrock out of the request path.

Minimal SageMaker Request Adapter

from fastapi import FastAPI, Request, Response
import httpx
import os

app = FastAPI()
vllm_base = os.getenv("VLLM_BASE_URL", "http://127.0.0.1:8080")


@app.get("/ping")
async def ping() -> Response:
    async with httpx.AsyncClient(timeout=2.0) as client:
        upstream = await client.get(f"{vllm_base}/health")
    return Response(status_code=200 if upstream.status_code == 200 else 503)


@app.post("/invocations")
async def invocations(request: Request) -> Response:
    payload = await request.json()

    vllm_payload = {
        "model": payload.get("model", "manga_domain_v3"),
        "messages": payload["messages"],
        "max_tokens": payload.get("max_tokens", 256),
        "temperature": payload.get("temperature", 0.2),
        "stream": payload.get("stream", False),
    }

    async with httpx.AsyncClient(timeout=70.0) as client:
        upstream = await client.post(
            f"{vllm_base}/v1/chat/completions",
            json=vllm_payload,
        )

    return Response(
        content=upstream.text,
        status_code=upstream.status_code,
        media_type="application/json",
    )

This adapter keeps the external SageMaker contract stable while preserving vLLM's OpenAI-compatible API internally. It is also the cleanest place to add auth, request normalization, admission control, or future multi-backend routing.

12. Bedrock Deployment Path For The Same Model

What Changes When We Move From SageMaker To Bedrock

The deployment unit changes completely:

SageMaker path: ship a container image that starts vLLM on GPU infrastructure we control.
Bedrock path: ship model artifacts in Hugging Face format to S3, submit a Bedrock Custom Model Import job, and invoke the imported model through Bedrock APIs.

That means Bedrock is not "host the same vLLM container somewhere else on AWS." It is a managed model-import and inference path.

Bedrock Import Constraints

Constraint	Bedrock requirement	Why it matters
Supported architectures	Llama ⅔/3.⅓.⅔.3/Mllama, Mistral, Mixtral, Flan-T5, GPTBigCode, Qwen2/2.5/2-VL/2.5-VL/3, GPT-OSS	Verify model family before choosing Bedrock
Model artifact format	Hugging Face weights in S3	Export from training pipeline in HF format
Required files	`.safetensors`, `config.json`, `tokenizer_config.json`, `tokenizer.json`, `tokenizer.model`	Package tokenizer with the model
Imported model size	< 200 GB for text, < 100 GB for multimodal	Fine for current 8B design
Max context length	< 128K	Fine for current 4K design
Transformers compatibility	4.51.3 when fine-tuning for import compatibility	Pin training/export image if Bedrock is a target
Unsupported import target	Embedding models	Keep embeddings on a separate stack
Feature exclusions	Custom Model Import does not support Batch inference or CloudFormation	Plan operations around API or console flows

Suggested Artifact Layout In S3

s3://mangaassist-bedrock-import/llama3-domain-v1/
  config.json
  generation_config.json
  tokenizer.json
  tokenizer_config.json
  tokenizer.model
  model-00001-of-00004.safetensors
  model-00002-of-00004.safetensors
  model-00003-of-00004.safetensors
  model-00004-of-00004.safetensors
  model.safetensors.index.json

Treat Bedrock packaging as its own promotion target. Do not assume every artifact layout optimized for vLLM runtime serving is automatically importable into Bedrock without a validation step in CI/CD.

Submit The Model Import Job

import boto3

bedrock = boto3.client("bedrock", region_name="us-east-1")

response = bedrock.create_model_import_job(
    jobName="mangaassist-llama3-import-v1",
    importedModelName="mangaassist-llama3-domain-v1",
    roleArn="arn:aws:iam::123456789012:role/bedrock-model-import-role",
    modelDataSource={
        "s3DataSource": {
            "s3Uri": "s3://mangaassist-bedrock-import/llama3-domain-v1/"
        }
    },
)

job_arn = response["jobArn"]
print(job_arn)

Poll Until Import Completes

import time

while True:
    status = bedrock.get_model_import_job(jobIdentifier=job_arn)
    print(status["status"])

    if status["status"] == "Completed":
        imported_model_arn = status["importedModelArn"]
        break
    if status["status"] == "Failed":
        raise RuntimeError(status["failureMessage"])

    time.sleep(30)

Invoke The Imported Model With Converse

import boto3
import json

runtime = boto3.client("bedrock-runtime", region_name="us-east-1")

response = runtime.converse(
    modelId=imported_model_arn,
    system=[
        {
            "text": "You are a manga shopping assistant."
        }
    ],
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "text": "Recommend action manga under $20."
                }
            ],
        }
    ],
    inferenceConfig={
        "maxTokens": 200,
        "temperature": 0.2,
    },
)

print(json.dumps(response, indent=2, default=str))

API Compatibility Note

For imported models created after 2025-11-11, Bedrock Custom Model Import supports Bedrock-native completion format plus OpenAI-compatible completion and chat-completion payloads. That reduces application rewrite cost if the client already speaks an OpenAI-like schema.

Operational Consequences

When we choose Bedrock, we give up direct control over the vLLM runtime:

No custom container startup sequence
No direct vllm serve flags
No direct tuning of gpu_memory_utilization, max_num_seqs, KV cache, or CUDA graph behavior
No runtime-loaded Multi-LoRA topology from this document

Operationally, the safest Bedrock strategy is to promote a fully materialized model version per domain or release, not depend on runtime adapter loading semantics.

When Bedrock Is Worth It

Choose Bedrock over SageMaker if most of these are true:

The target model architecture is supported by Custom Model Import.
The platform team wants Bedrock-native governance, Guardrails, Agents, or Converse APIs.
The application can live with less direct control over scheduling, caching, and LoRA loading.
Lower infrastructure operations burden matters more than squeezing the last bit of runtime efficiency out of vLLM.

13. Cross-References

Scenario narratives that motivate these decisions: 01-vllm-game-changer-scenarios.md
Low-level implementation patterns for each scenario: 02-vllm-low-level-implementation-and-critical-decisions.md
Monitoring and troubleshooting: 05-vllm-monitoring-and-troubleshooting.md
Model preparation and quantization: 06-vllm-model-preparation-and-quantization.md
Interview prep questions: 03-vllm-interview-prep-deep-dive.md
Docker patterns in the broader system: ../../Docker-Interview-Pack/02-docker-scenarios-lld.md
GPU architecture challenges: ../../Model-Inference/06-gpu-architecture-challenges.md