vLLM Deployment And Infrastructure For MangaAssist
Complete deployment guide: from environment setup through Docker image construction, direct SageMaker endpoint deployment, the alternative Bedrock import path, startup sequencing, health checks, scaling policies, and graceful shutdown. Every decision is grounded in the MangaAssist chatbot workload.
1. Scope
This document covers the full deployment lifecycle for vLLM in MangaAssist's production environment. It answers:
- How is vLLM installed and what are its system dependencies?
- How is the Docker image built and optimized?
- How do we choose between direct SageMaker hosting and Bedrock-managed inference on AWS?
- How does the SageMaker endpoint deployment work?
- What changes if the organization wants Bedrock instead of a self-managed vLLM container?
- What happens during container startup and how long does each phase take?
- How do health checks distinguish "alive" from "ready to serve"?
- How does auto-scaling respond to traffic changes?
- How are in-flight requests handled during shutdown?
2. Environment And System Requirements
Hardware Requirements
| Component | Requirement | MangaAssist choice | Why |
|---|---|---|---|
| GPU | NVIDIA with compute capability ≥ 7.0 | NVIDIA A10G (compute 8.6) | Best cost-performance for AWQ INT4 inference |
| VRAM | ≥ 16 GB for 8B model with AWQ | 24 GB (A10G) | Fits model (4.5 GB) + KV cache (14 GB) + workspace (5.5 GB) |
| System RAM | ≥ 32 GB | 16 GB on ml.g5.xlarge | Sufficient for model loading and tokenizer |
| Storage | ≥ 50 GB for model artifacts | EBS gp3 mounted at /opt/ml/model |
Fast enough for 4.5 GB model load |
Software Dependencies
| Dependency | Version | Why this version |
|---|---|---|
| CUDA | 12.1 | Required by vLLM ≥ 0.4.0; matches A10G driver compatibility |
| Python | 3.10 | vLLM's tested Python version; 3.11+ had intermittent issues with some CUDA bindings |
| vLLM | 0.4.3 | First stable release with Multi-LoRA + prefix caching + AWQ working together |
| PyTorch | 2.3.0 | Paired with CUDA 12.1; required by vLLM 0.4.3 |
| Transformers | 4.40.0 | Tokenizer and model config loading |
| safetensors | 0.4.3 | Fast model weight loading (2–3× faster than pickle-based checkpoints) |
| uvicorn | 0.29.0 | ASGI server for vLLM's OpenAI-compatible HTTP endpoint |
vLLM Version Compatibility Notes
| Feature | Minimum vLLM version | Notes |
|---|---|---|
| PagedAttention | 0.1.0 | Core feature since inception |
| Continuous batching | 0.1.0 | Core scheduler since inception |
| AWQ quantization | 0.2.0 | AutoAWQ integration |
| Prefix caching | 0.3.0 | enable_prefix_caching=True flag |
| Multi-LoRA serving | 0.3.3 | Multiple adapters on single base model |
| OpenAI-compatible server | 0.2.0 | /v1/chat/completions endpoint |
| CUDA graph support | 0.2.0 | enforce_eager=False for graph capture |
| Prometheus metrics | 0.3.0 | Built-in /metrics endpoint |
Installation
# Base CUDA environment (pre-installed in our Docker base image)
# NVIDIA Driver: 535.104.12
# CUDA Toolkit: 12.1
# vLLM and dependencies
pip install vllm==0.4.3 \
torch==2.3.0 \
transformers==4.40.0 \
safetensors==0.4.3 \
uvicorn==0.29.0
# For AWQ quantization (offline, not needed on serving nodes)
pip install autoawq==0.2.4
# For LoRA adapter training (offline, not needed on serving nodes)
pip install peft==0.10.0
Dependency pinning: We pin all versions in requirements.txt and rebuild images on a weekly cadence. vLLM moves fast — upgrading without testing has caused two regressions for us:
1. v0.4.1 → v0.4.2 changed prefix cache eviction behavior, dropping cache hit rate by 15%
2. v0.4.2 → v0.4.3 fixed the eviction bug but changed Multi-LoRA adapter loading semantics
3. GPU Instance Selection Deep Dive
VRAM Budget Breakdown (ml.g5.xlarge, A10G 24 GB)
┌────────────────────────────────────────────────────────┐
│ VRAM Budget (24 GB) │
├────────────────────────────────────────────────────────┤
│ Model weights (AWQ INT4 Llama-3-8B) 4.5 GB │
│ LoRA adapters (3 × ~40 MB) 0.12 GB │
│ KV cache blocks (128 seqs × block_size=16) ~14.0 GB │
│ CUDA workspace + activations 1.5 GB │
│ CUDA graph memory 1.0 GB │
│ Safety headroom (gpu_memory_utilization=0.92) 1.92 GB │
│ Unallocated (OS/driver overhead) 0.96 GB │
├────────────────────────────────────────────────────────┤
│ Total 24.0 GB │
└────────────────────────────────────────────────────────┘
Instance Comparison
| Instance | GPU | VRAM | vCPU | RAM | $/hr (On-Demand) | $/month | Max concurrent seqs | Verdict |
|---|---|---|---|---|---|---|---|---|
| ml.g4dn.xlarge | T4 | 16 GB | 4 | 16 GB | $0.736 | $537 | ~40 | Too tight for 128 seq target |
| ml.g5.xlarge | A10G | 24 GB | 4 | 16 GB | $1.408 | $1,028 | 128 | Selected: optimal cost/capacity |
| ml.g5.2xlarge | A10G | 24 GB | 8 | 32 GB | $1.515 | $1,106 | 128 | More CPU/RAM, same GPU — not needed |
| ml.p4d.24xlarge | A100×8 | 640 GB | 96 | 1.1 TB | $32.77 | $23,914 | 1000+ | Overkill for single-model serving |
Why we rejected ml.g4dn.xlarge: With only 16 GB VRAM, the AWQ model (4.5 GB) leaves ~11.5 GB for KV cache. At 128 concurrent sequences with max_model_len=4096, the KV cache alone needs ~14 GB. The math does not work without reducing either concurrency (bad for chatbot UX) or context length (bad for multi-turn conversations).
Why we did not use A100: The A100 80 GB GPU can run unquantized FP16 models without compression, but it costs 23× more per hour. For our traffic volume (~50K requests/day across 4 instances), A10G with AWQ quantization is the cost-optimal choice. The A100 is reserved for training, benchmarking, and future scaling beyond 128 concurrent sequences per GPU.
4. Docker Image Construction
Multi-Stage Build Strategy
# ======================================
# Stage 1: Build vLLM with CUDA support
# ======================================
FROM nvidia/cuda:12.1.1-devel-ubuntu22.04 AS builder
RUN apt-get update && apt-get install -y \
python3.10 python3.10-dev python3-pip git && \
rm -rf /var/lib/apt/lists/*
COPY requirements.txt /build/requirements.txt
RUN pip install --no-cache-dir -r /build/requirements.txt
# ======================================
# Stage 2: Download and validate model
# ======================================
FROM builder AS model-prep
# Model artifacts are downloaded during image build, NOT at container runtime.
# This is a critical decision: runtime download adds 3-5 minutes to startup.
ARG MODEL_ID=meta-llama/Llama-3-8b-instruct-awq
ARG HF_TOKEN
RUN python3 -c "
from huggingface_hub import snapshot_download
snapshot_download(
repo_id='${MODEL_ID}',
local_dir='/models/base',
token='${HF_TOKEN}',
ignore_patterns=['*.md', '*.txt', 'original/*'],
)
"
# Download LoRA adapters
COPY scripts/download_adapters.py /build/download_adapters.py
RUN python3 /build/download_adapters.py \
--output-dir /models/lora \
--adapters manga_domain_v3 general_support_v2 jp_style_v1
# Validate model integrity
RUN python3 -c "
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('/models/base')
assert tokenizer.vocab_size > 0, 'Tokenizer validation failed'
print(f'Model validated: vocab_size={tokenizer.vocab_size}')
"
# ======================================
# Stage 3: Slim runtime image
# ======================================
FROM nvidia/cuda:12.1.1-runtime-ubuntu22.04 AS runtime
RUN apt-get update && apt-get install -y \
python3.10 python3-pip && \
rm -rf /var/lib/apt/lists/*
# Copy only runtime dependencies (no build tools)
COPY --from=builder /usr/local/lib/python3.10/dist-packages /usr/local/lib/python3.10/dist-packages
COPY --from=builder /usr/local/bin /usr/local/bin
# Copy validated model artifacts
COPY --from=model-prep /models /opt/ml/model
# Copy application code
COPY app/ /app/
COPY scripts/entrypoint.sh /entrypoint.sh
# Health check port
EXPOSE 8080
# Prometheus metrics port
EXPOSE 9090
ENV PYTHONUNBUFFERED=1
ENV MODEL_PATH=/opt/ml/model/base
ENV LORA_PATH=/opt/ml/model/lora
ENV VLLM_PORT=8080
HEALTHCHECK --interval=30s --timeout=10s --retries=3 \
CMD curl -f http://localhost:8080/ping || exit 1
ENTRYPOINT ["/entrypoint.sh"]
Image Size Optimization
| Optimization | Impact |
|---|---|
| Multi-stage build (drop build tools) | 15 GB → 8.2 GB |
nvidia/cuda:runtime instead of devel |
Saves ~3 GB of CUDA dev headers |
--no-cache-dir on pip install |
Saves ~500 MB of pip cache |
.dockerignore excluding tests, docs, notebooks |
Saves ~200 MB |
| safetensors format (vs pickle) | Same size, but 2–3× faster load time |
Why Model Artifacts Are Baked Into The Image
Decision: Bake model weights into the Docker image at build time, not download at container runtime.
Why not runtime download?
| Approach | Startup time | Scaling speed | Reliability |
|---|---|---|---|
| Runtime download from S3 | +3–5 min | Slow (download per instance) | Depends on S3 availability |
| Runtime download from HuggingFace | +5–10 min | Very slow | Depends on HF Hub availability |
| Baked into image | +0 min | Fast (image pull is cached) | Self-contained |
The tradeoff is larger images (~8.2 GB), but ECR image layer caching means subsequent pulls only download changed layers. Since the model weights layer rarely changes (only on model updates), scaling events pull only the application code layer (~50 MB).
When to use runtime download instead: If you have many model variants (10+) and cannot maintain that many image tags, a hybrid approach works: bake the base model, download adapters at startup (adapters are ~40 MB each, fast enough).
5. Entrypoint And Startup Sequence
Entrypoint Script
#!/bin/bash
set -euo pipefail
echo "[$(date)] Starting vLLM server for MangaAssist"
# Phase 1: Environment validation
echo "[$(date)] Phase 1: Validating environment"
nvidia-smi --query-gpu=name,memory.total,driver_version --format=csv,noheader
python3 -c "import torch; assert torch.cuda.is_available(), 'CUDA not available'"
# Phase 2: Start vLLM OpenAI-compatible server
echo "[$(date)] Phase 2: Starting vLLM engine"
python3 -m vllm.entrypoints.openai.api_server \
--model "${MODEL_PATH}" \
--host 0.0.0.0 \
--port "${VLLM_PORT:-8080}" \
--tensor-parallel-size 1 \
--gpu-memory-utilization 0.92 \
--max-model-len 4096 \
--max-num-seqs 128 \
--max-num-batched-tokens 8192 \
--block-size 16 \
--enable-prefix-caching \
--quantization awq \
--no-custom-all-reduce \
--enable-lora \
--max-loras 4 \
--max-lora-rank 16 \
--lora-modules \
manga_domain_v3="${LORA_PATH}/manga_domain_v3" \
general_support_v2="${LORA_PATH}/general_support_v2" \
jp_style_v1="${LORA_PATH}/jp_style_v1" \
--disable-log-requests \
2>&1 | tee /var/log/vllm/server.log &
VLLM_PID=$!
# Phase 3: Wait for readiness
echo "[$(date)] Phase 3: Waiting for engine readiness"
MAX_WAIT=180
WAITED=0
while [ $WAITED -lt $MAX_WAIT ]; do
if curl -sf http://localhost:${VLLM_PORT}/health > /dev/null 2>&1; then
echo "[$(date)] Engine is ready after ${WAITED}s"
break
fi
sleep 2
WAITED=$((WAITED + 2))
done
if [ $WAITED -ge $MAX_WAIT ]; then
echo "[$(date)] ERROR: Engine failed to become ready in ${MAX_WAIT}s"
kill $VLLM_PID 2>/dev/null
exit 1
fi
# Phase 4: Warmup requests
echo "[$(date)] Phase 4: Running warmup requests"
python3 /app/scripts/warmup.py \
--endpoint "http://localhost:${VLLM_PORT}/v1/chat/completions" \
--num-requests 10 \
--timeout 30
echo "[$(date)] Phase 5: Server is ready to accept traffic"
# Handle graceful shutdown
trap 'echo "[$(date)] SIGTERM received, draining..."; kill -SIGTERM $VLLM_PID; wait $VLLM_PID' SIGTERM
wait $VLLM_PID
Warmup Script
"""
Warmup script that sends synthetic requests to prime the vLLM engine.
Purpose:
- Trigger CUDA graph capture for common sequence lengths
- Prime the KV block allocator
- Warm the prefix cache with the system prompt
- Validate that all LoRA adapters load correctly
"""
import argparse
import time
import httpx
WARMUP_REQUESTS = [
{
"model": "manga_domain_v3",
"messages": [
{"role": "system", "content": "You are a manga shopping assistant."},
{"role": "user", "content": "Recommend action manga."},
],
"max_tokens": 50,
"stream": False,
},
{
"model": "general_support_v2",
"messages": [
{"role": "system", "content": "You are a manga shopping assistant."},
{"role": "user", "content": "What is the return policy?"},
],
"max_tokens": 50,
"stream": False,
},
{
"model": "jp_style_v1",
"messages": [
{"role": "system", "content": "You are a manga shopping assistant."},
{"role": "user", "content": "Recommend shounen manga in Japanese."},
],
"max_tokens": 50,
"stream": False,
},
]
def run_warmup(endpoint: str, num_requests: int, timeout: int) -> None:
client = httpx.Client(timeout=timeout)
start = time.time()
for i in range(num_requests):
request = WARMUP_REQUESTS[i % len(WARMUP_REQUESTS)]
try:
response = client.post(endpoint, json=request)
response.raise_for_status()
elapsed = time.time() - start
print(f" Warmup {i+1}/{num_requests}: {response.status_code} ({elapsed:.1f}s)")
except Exception as e:
print(f" Warmup {i+1}/{num_requests}: FAILED - {e}")
raise
total = time.time() - start
print(f"Warmup complete: {num_requests} requests in {total:.1f}s")
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--endpoint", required=True)
parser.add_argument("--num-requests", type=int, default=10)
parser.add_argument("--timeout", type=int, default=30)
args = parser.parse_args()
run_warmup(args.endpoint, args.num_requests, args.timeout)
Startup Timing Breakdown
| Phase | What happens | Duration | Failure behavior |
|---|---|---|---|
| Environment validation | Check CUDA availability, GPU health | ~2s | Exit 1 (container restart) |
| Model weight loading | Load AWQ weights from disk to GPU | ~15s | Exit 1 (container restart) |
| CUDA graph capture | Capture optimized execution graphs | ~30s | Falls back to eager mode |
| LoRA adapter preload | Load 3 adapter weight sets | ~5s | Logs warning, serves without adapters |
| Health endpoint ready | /health starts returning 200 |
~5s | Readiness probe fails, no traffic |
| Warmup requests | 10 synthetic requests through full path | ~10s | Logs warning, serves anyway |
| Total startup | ~67s |
Why warmup matters: The first request through vLLM after startup can be 3–5× slower than steady-state because CUDA graph compilation and memory allocator initialization happen lazily. Warmup requests absorb that penalty before real traffic arrives.
6. Health Check Implementation
SageMaker Health Check Contract
SageMaker calls /ping every 30 seconds. This must return 200 within 2 seconds for the instance to be considered healthy.
"""
Health check implementation for vLLM on SageMaker.
Three levels of health:
1. Liveness: Process is running and can respond to HTTP
2. Readiness: Model is loaded, CUDA graphs captured, warmup complete
3. GPU health: CUDA device is accessible and not in error state
"""
import asyncio
import subprocess
from enum import Enum
from fastapi import FastAPI, Response
class HealthState(Enum):
INITIALIZING = "initializing"
WARMING_UP = "warming_up"
READY = "ready"
DEGRADED = "degraded"
UNHEALTHY = "unhealthy"
class HealthChecker:
def __init__(self) -> None:
self.state = HealthState.INITIALIZING
self.model_loaded = False
self.cuda_graphs_ready = False
self.warmup_complete = False
self.last_gpu_check_ok = True
self._gpu_check_interval = 60 # seconds
def mark_model_loaded(self) -> None:
self.model_loaded = True
self._update_state()
def mark_cuda_graphs_ready(self) -> None:
self.cuda_graphs_ready = True
self._update_state()
def mark_warmup_complete(self) -> None:
self.warmup_complete = True
self._update_state()
def _update_state(self) -> None:
if self.model_loaded and self.cuda_graphs_ready and self.warmup_complete:
self.state = HealthState.READY
elif self.model_loaded:
self.state = HealthState.WARMING_UP
async def check_gpu_health(self) -> bool:
"""Check GPU accessibility via nvidia-smi. Run periodically, not per-request."""
try:
result = subprocess.run(
["nvidia-smi", "--query-gpu=gpu_bus_id", "--format=csv,noheader"],
capture_output=True,
text=True,
timeout=5,
)
self.last_gpu_check_ok = result.returncode == 0
except (subprocess.TimeoutExpired, FileNotFoundError):
self.last_gpu_check_ok = False
return self.last_gpu_check_ok
def is_ready(self) -> bool:
return self.state == HealthState.READY and self.last_gpu_check_ok
def is_alive(self) -> bool:
return self.state != HealthState.UNHEALTHY
health_checker = HealthChecker()
def register_health_routes(app: FastAPI) -> None:
@app.get("/ping")
async def ping(response: Response) -> dict:
"""SageMaker liveness check. Returns 200 if process is alive."""
if not health_checker.is_alive():
response.status_code = 503
return {
"status": health_checker.state.value,
"model_loaded": health_checker.model_loaded,
"cuda_graphs_ready": health_checker.cuda_graphs_ready,
"warmup_complete": health_checker.warmup_complete,
"gpu_ok": health_checker.last_gpu_check_ok,
}
@app.get("/ready")
async def ready(response: Response) -> dict:
"""Readiness check. Returns 200 only when fully ready to serve."""
if not health_checker.is_ready():
response.status_code = 503
return {"ready": health_checker.is_ready()}
Why Liveness And Readiness Are Separate
| Check | What it answers | When it matters |
|---|---|---|
Liveness (/ping) |
"Is the process running?" | SageMaker restart decision |
Readiness (/ready) |
"Can it serve traffic?" | Traffic routing decision |
A process can be alive but not ready during the 67-second startup window. If /ping and /ready were the same, SageMaker would either route traffic to an unready instance (causing 503s) or restart a healthy instance that is still loading (causing restart loops).
7. Graceful Shutdown
Why Graceful Shutdown Matters
During auto-scaling scale-in or deployment updates, SageMaker sends SIGTERM to the container. Without graceful shutdown:
- In-flight requests get killed mid-generation → users see truncated or error responses
- Active WebSocket/SSE streams disconnect abruptly → frontend shows "connection lost"
- KV cache state is lost without warning → requests already in prefill restart from scratch
Shutdown Sequence
sequenceDiagram
participant SM as SageMaker
participant EP as Entrypoint
participant GW as vLLM Gateway
participant ENGINE as vLLM Engine
SM->>EP: SIGTERM
EP->>GW: Stop accepting new requests
Note over GW: /ping returns 503\n/ready returns 503
GW->>ENGINE: Wait for in-flight requests (max 30s)
Note over ENGINE: Active generations\ncomplete normally
ENGINE-->>GW: All generations done
GW->>EP: Clean exit
EP-->>SM: Exit 0
Note over SM: Instance removed\nfrom endpoint
Shutdown Implementation
import asyncio
import signal
import logging
logger = logging.getLogger(__name__)
class GracefulShutdown:
def __init__(self, engine, max_drain_seconds: int = 30) -> None:
self.engine = engine
self.max_drain_seconds = max_drain_seconds
self.shutting_down = False
def install_signal_handlers(self) -> None:
loop = asyncio.get_event_loop()
for sig in (signal.SIGTERM, signal.SIGINT):
loop.add_signal_handler(sig, lambda s=sig: asyncio.create_task(self.shutdown(s)))
async def shutdown(self, sig: signal.Signals) -> None:
logger.info(f"Received {sig.name}, starting graceful shutdown")
self.shutting_down = True
# Stop accepting new requests (health checks return 503)
logger.info("Stopped accepting new requests")
# Wait for in-flight requests to complete
waited = 0
while waited < self.max_drain_seconds:
active = self.engine.get_num_unfinished_requests()
if active == 0:
logger.info(f"All requests drained after {waited}s")
break
logger.info(f"Draining: {active} requests still in-flight ({waited}s)")
await asyncio.sleep(1)
waited += 1
if waited >= self.max_drain_seconds:
active = self.engine.get_num_unfinished_requests()
logger.warning(f"Drain timeout: {active} requests still active, forcing shutdown")
logger.info("Shutdown complete")
8. SageMaker Endpoint Deployment Configuration
Endpoint Structure
import sagemaker
from sagemaker.model import Model
from sagemaker.predictor import Predictor
def deploy_vllm_endpoint(
role: str,
image_uri: str,
model_data: str = None, # None when model is baked into image
instance_type: str = "ml.g5.xlarge",
initial_instance_count: int = 4,
) -> Predictor:
model = Model(
image_uri=image_uri,
role=role,
env={
"MODEL_PATH": "/opt/ml/model/base",
"LORA_PATH": "/opt/ml/model/lora",
"VLLM_PORT": "8080",
},
)
predictor = model.deploy(
initial_instance_count=initial_instance_count,
instance_type=instance_type,
endpoint_name="mangaassist-vllm-prod",
container_startup_health_check_timeout=300, # 5 min for model load + warmup
volume_size=50, # GB, for model artifacts if not baked in
)
return predictor
Canary Deployment With Production Variants
from sagemaker.session import production_variant
def deploy_canary(image_uri_prod: str, image_uri_canary: str, role: str) -> None:
"""Deploy canary variant alongside production for safe rollouts."""
variants = [
production_variant(
model_name="mangaassist-vllm-prod",
instance_type="ml.g5.xlarge",
initial_instance_count=4,
variant_name="primary",
initial_weight=95, # 95% of traffic
),
production_variant(
model_name="mangaassist-vllm-canary",
instance_type="ml.g5.xlarge",
initial_instance_count=1,
variant_name="canary",
initial_weight=5, # 5% of traffic
),
]
sagemaker.Session().endpoint_from_production_variants(
name="mangaassist-vllm-prod",
production_variants=variants,
)
Warm Pool Configuration
import boto3
sm_client = boto3.client("sagemaker")
# Enable warm pool for faster scaling
sm_client.update_endpoint(
EndpointName="mangaassist-vllm-prod",
RetainAllVariantProperties=True,
DeploymentConfig={
"BlueGreenUpdatePolicy": {
"TrafficRoutingConfiguration": {
"Type": "ALL_AT_ONCE",
"WaitIntervalInSeconds": 60,
},
"TerminationWaitInSeconds": 120,
"MaximumExecutionTimeoutInSeconds": 600,
},
},
)
9. Auto-Scaling Policies
Scaling Configuration
import boto3
asg_client = boto3.client("application-autoscaling")
# Register scalable target
asg_client.register_scalable_target(
ServiceNamespace="sagemaker",
ResourceId="endpoint/mangaassist-vllm-prod/variant/primary",
ScalableDimension="sagemaker:variant:DesiredInstanceCount",
MinCapacity=2, # Always keep 2 for redundancy
MaxCapacity=8, # Cost ceiling
)
# Scale-out policy: react fast to queue buildup
asg_client.put_scaling_policy(
PolicyName="mangaassist-vllm-scale-out",
ServiceNamespace="sagemaker",
ResourceId="endpoint/mangaassist-vllm-prod/variant/primary",
ScalableDimension="sagemaker:variant:DesiredInstanceCount",
PolicyType="TargetTrackingScaling",
TargetTrackingScalingPolicyConfiguration={
"TargetValue": 50.0, # Target invocations per instance
"PredefinedMetricSpecification": {
"PredefinedMetricType": "SageMakerVariantInvocationsPerInstance",
},
"ScaleOutCooldown": 60, # Fast reaction to spikes
"ScaleInCooldown": 300, # Slow scale-in to avoid flapping
},
)
Custom Metric-Based Scaling
SageMaker's built-in metric (invocations per instance) is a reasonable proxy, but for a GPU inference workload, custom metrics are more accurate:
| Metric | Scale-out trigger | Scale-in trigger | Why |
|---|---|---|---|
queue_depth |
> 50 for 60s | < 5 for 300s | Direct measure of backpressure |
active_sequences |
> 110 (85% of 128) for 60s | < 30 for 300s | GPU saturation signal |
gpu_memory_used_pct |
> 90% for 60s | < 40% for 300s | Memory pressure signal |
p95_queue_wait_ms |
> 200 ms for 120s | < 20 ms for 300s | User-visible latency impact |
Asymmetric Cooldown Strategy
| Direction | Cooldown | Why |
|---|---|---|
| Scale-out | 60 seconds | Users feel queueing immediately. React fast. |
| Scale-in | 300 seconds | Premature scale-in causes oscillation. Wait for sustained low traffic. |
Predictive scaling: For the daily traffic pattern (peaks during JP evening hours 18:00–22:00 JST and US evening hours 18:00–22:00 EST), we use scheduled scaling actions that pre-provision instances 15 minutes before the expected peak rather than waiting for reactive scaling.
# Pre-provision for JP evening peak
asg_client.put_scheduled_action(
ServiceNamespace="sagemaker",
ScheduledActionName="jp-evening-peak",
ResourceId="endpoint/mangaassist-vllm-prod/variant/primary",
ScalableDimension="sagemaker:variant:DesiredInstanceCount",
Schedule="cron(45 8 * * ? *)", # 17:45 JST = 08:45 UTC
ScalableTargetAction={"MinCapacity": 6, "MaxCapacity": 8},
)
# Scale back after JP evening
asg_client.put_scheduled_action(
ServiceNamespace="sagemaker",
ScheduledActionName="jp-evening-end",
ResourceId="endpoint/mangaassist-vllm-prod/variant/primary",
ScalableDimension="sagemaker:variant:DesiredInstanceCount",
Schedule="cron(0 14 * * ? *)", # 23:00 JST = 14:00 UTC
ScalableTargetAction={"MinCapacity": 2, "MaxCapacity": 8},
)
10. Environment Variable Reference
| Variable | Default | Description | Impact of wrong value |
|---|---|---|---|
MODEL_PATH |
/opt/ml/model/base |
Path to base model weights | Engine fails to start |
LORA_PATH |
/opt/ml/model/lora |
Path to LoRA adapter directory | Adapters unavailable, falls back to base model |
VLLM_PORT |
8080 |
HTTP server port | Health checks fail, no traffic |
VLLM_GPU_MEMORY_UTILIZATION |
0.92 |
Fraction of VRAM allocated to vLLM | Too high: OOM; Too low: reduced concurrency |
VLLM_MAX_MODEL_LEN |
4096 |
Maximum context window in tokens | Too high: fewer concurrent seqs; Too low: truncated conversations |
VLLM_MAX_NUM_SEQS |
128 |
Maximum concurrent sequences | Too high: OOM; Too low: artificial queueing |
VLLM_MAX_NUM_BATCHED_TOKENS |
8192 |
Max tokens in active batch | Too high: TTFT spike; Too low: throughput loss |
VLLM_BLOCK_SIZE |
16 |
Tokens per KV cache block | Too small: block table overhead; Too large: fragmentation waste |
VLLM_LOG_LEVEL |
info |
Engine log verbosity | debug in prod: log volume explosion |
WARMUP_NUM_REQUESTS |
10 |
Synthetic requests before ready | 0: cold first-request penalty; Too many: slow startup |
What Each Knob Controls
gpu_memory_utilization=0.92: This is a stability boundary, not a performance knob. Setting it to 0.98 would add ~1.4 GB of usable KV cache blocks but risks OOM when CUDA workspace allocations spike during certain operations (graph capture, large batch prefill). We tested 0.90, 0.92, and 0.95 under sustained load:
- 0.90: Stable, 118 max concurrent sequences
- 0.92: Stable, 128 max concurrent sequences ← selected
- 0.95: Intermittent OOM under bursty traffic (2–3 per day)
max_num_seqs=128: The concurrency ceiling. At 128 sequences with block_size=16 and max_model_len=4096, the worst-case KV cache is ~14 GB. With our observed median response length (~180 tokens), the average KV cache usage is ~3.5 GB, leaving substantial room. We chose 128 because:
- It was the highest value that passed 24-hour soak tests without OOM at gpu_memory_utilization=0.92
- It provided 2× headroom above our observed peak concurrency (~60 simultaneous requests)
- Going higher (256) caused occasional preemptions that hurt tail latency
block_size=16: Each block holds KV tensors for 16 tokens. Smaller blocks (8) reduce internal fragmentation but increase block-table memory overhead. Larger blocks (32) reduce overhead but waste more memory on sequences that do not fill the last block. For our median response length of 180 tokens, block_size=16 means ~11.25 blocks per sequence with <16 tokens of waste in the last block.
11. AWS Deployment Decision: SageMaker Or Bedrock
Current AWS Service Boundary (Validated 2026-03-25)
SageMaker AI and Amazon Bedrock are not interchangeable deployment targets for this design:
- SageMaker AI is the right choice when we want to run our own vLLM container on GPU instances and keep control over CUDA, model loading, health endpoints, autoscaling behavior, LoRA loading, and Prometheus metrics.
- Amazon Bedrock is the right choice when we want AWS to own the serving runtime. In that model we do not deploy our Docker image or our vLLM process. We export supported model artifacts to S3, import them into Bedrock, and invoke the resulting model through Bedrock APIs.
Decision Matrix
| Requirement | SageMaker AI | Amazon Bedrock | MangaAssist implication |
|---|---|---|---|
| Run our own vLLM Docker image | Yes | No | SageMaker only |
Keep custom /ping and /invocations handling |
Yes | No | SageMaker only |
Tune gpu_memory_utilization, max_num_seqs, prefix caching, CUDA graphs |
Yes | No | SageMaker only |
| Serve multiple LoRA adapters on one base model | Yes | Not exposed as a Bedrock runtime primitive | Current design fits SageMaker |
| Use Bedrock-native APIs, Guardrails, Agents, and Converse | Indirect | Yes | Bedrock advantage |
| Reduce infrastructure operations burden | No | Yes | Bedrock advantage |
| Import supported Hugging Face model artifacts from S3 | Possible but self-managed | Yes | Bedrock fit |
| Support arbitrary vLLM backends and custom kernels | Yes | No | SageMaker only |
Recommended Default For MangaAssist
For the architecture described in this document, SageMaker is the primary deployment target. The design depends on vLLM-specific features such as AWQ quantization, Multi-LoRA loading, prefix caching, custom health/readiness logic, and queue-aware autoscaling. Bedrock becomes attractive only if the priority shifts from maximum runtime control to lower operational overhead and Bedrock-native governance.
Implementation Checklist If We Choose SageMaker
- Build the runtime image described in Sections 4-5, including the base model, LoRA adapters, health routes, warmup script, and entrypoint.
- Expose a SageMaker-compatible inference surface that accepts
/pingand/invocationswhile forwarding inference to the local vLLM OpenAI-compatible endpoint. - Push the image to ECR and create a SageMaker
Model,EndpointConfig, andEndpoint. - Set
container_startup_health_check_timeouthigh enough to cover weight loading plus warmup. - Enable autoscaling and blue/green rollout policies from Sections 8-9.
- Point the application gateway at the SageMaker endpoint and keep Bedrock out of the request path.
Minimal SageMaker Request Adapter
from fastapi import FastAPI, Request, Response
import httpx
import os
app = FastAPI()
vllm_base = os.getenv("VLLM_BASE_URL", "http://127.0.0.1:8080")
@app.get("/ping")
async def ping() -> Response:
async with httpx.AsyncClient(timeout=2.0) as client:
upstream = await client.get(f"{vllm_base}/health")
return Response(status_code=200 if upstream.status_code == 200 else 503)
@app.post("/invocations")
async def invocations(request: Request) -> Response:
payload = await request.json()
vllm_payload = {
"model": payload.get("model", "manga_domain_v3"),
"messages": payload["messages"],
"max_tokens": payload.get("max_tokens", 256),
"temperature": payload.get("temperature", 0.2),
"stream": payload.get("stream", False),
}
async with httpx.AsyncClient(timeout=70.0) as client:
upstream = await client.post(
f"{vllm_base}/v1/chat/completions",
json=vllm_payload,
)
return Response(
content=upstream.text,
status_code=upstream.status_code,
media_type="application/json",
)
This adapter keeps the external SageMaker contract stable while preserving vLLM's OpenAI-compatible API internally. It is also the cleanest place to add auth, request normalization, admission control, or future multi-backend routing.
12. Bedrock Deployment Path For The Same Model
What Changes When We Move From SageMaker To Bedrock
The deployment unit changes completely:
- SageMaker path: ship a container image that starts vLLM on GPU infrastructure we control.
- Bedrock path: ship model artifacts in Hugging Face format to S3, submit a Bedrock Custom Model Import job, and invoke the imported model through Bedrock APIs.
That means Bedrock is not "host the same vLLM container somewhere else on AWS." It is a managed model-import and inference path.
Bedrock Import Constraints
| Constraint | Bedrock requirement | Why it matters |
|---|---|---|
| Supported architectures | Llama ⅔/3.⅓.⅔.3/Mllama, Mistral, Mixtral, Flan-T5, GPTBigCode, Qwen2/2.5/2-VL/2.5-VL/3, GPT-OSS | Verify model family before choosing Bedrock |
| Model artifact format | Hugging Face weights in S3 | Export from training pipeline in HF format |
| Required files | .safetensors, config.json, tokenizer_config.json, tokenizer.json, tokenizer.model |
Package tokenizer with the model |
| Imported model size | < 200 GB for text, < 100 GB for multimodal | Fine for current 8B design |
| Max context length | < 128K | Fine for current 4K design |
| Transformers compatibility | 4.51.3 when fine-tuning for import compatibility | Pin training/export image if Bedrock is a target |
| Unsupported import target | Embedding models | Keep embeddings on a separate stack |
| Feature exclusions | Custom Model Import does not support Batch inference or CloudFormation | Plan operations around API or console flows |
Suggested Artifact Layout In S3
s3://mangaassist-bedrock-import/llama3-domain-v1/
config.json
generation_config.json
tokenizer.json
tokenizer_config.json
tokenizer.model
model-00001-of-00004.safetensors
model-00002-of-00004.safetensors
model-00003-of-00004.safetensors
model-00004-of-00004.safetensors
model.safetensors.index.json
Treat Bedrock packaging as its own promotion target. Do not assume every artifact layout optimized for vLLM runtime serving is automatically importable into Bedrock without a validation step in CI/CD.
Submit The Model Import Job
import boto3
bedrock = boto3.client("bedrock", region_name="us-east-1")
response = bedrock.create_model_import_job(
jobName="mangaassist-llama3-import-v1",
importedModelName="mangaassist-llama3-domain-v1",
roleArn="arn:aws:iam::123456789012:role/bedrock-model-import-role",
modelDataSource={
"s3DataSource": {
"s3Uri": "s3://mangaassist-bedrock-import/llama3-domain-v1/"
}
},
)
job_arn = response["jobArn"]
print(job_arn)
Poll Until Import Completes
import time
while True:
status = bedrock.get_model_import_job(jobIdentifier=job_arn)
print(status["status"])
if status["status"] == "Completed":
imported_model_arn = status["importedModelArn"]
break
if status["status"] == "Failed":
raise RuntimeError(status["failureMessage"])
time.sleep(30)
Invoke The Imported Model With Converse
import boto3
import json
runtime = boto3.client("bedrock-runtime", region_name="us-east-1")
response = runtime.converse(
modelId=imported_model_arn,
system=[
{
"text": "You are a manga shopping assistant."
}
],
messages=[
{
"role": "user",
"content": [
{
"text": "Recommend action manga under $20."
}
],
}
],
inferenceConfig={
"maxTokens": 200,
"temperature": 0.2,
},
)
print(json.dumps(response, indent=2, default=str))
API Compatibility Note
For imported models created after 2025-11-11, Bedrock Custom Model Import supports Bedrock-native completion format plus OpenAI-compatible completion and chat-completion payloads. That reduces application rewrite cost if the client already speaks an OpenAI-like schema.
Operational Consequences
When we choose Bedrock, we give up direct control over the vLLM runtime:
- No custom container startup sequence
- No direct
vllm serveflags - No direct tuning of
gpu_memory_utilization,max_num_seqs, KV cache, or CUDA graph behavior - No runtime-loaded Multi-LoRA topology from this document
Operationally, the safest Bedrock strategy is to promote a fully materialized model version per domain or release, not depend on runtime adapter loading semantics.
When Bedrock Is Worth It
Choose Bedrock over SageMaker if most of these are true:
- The target model architecture is supported by Custom Model Import.
- The platform team wants Bedrock-native governance, Guardrails, Agents, or Converse APIs.
- The application can live with less direct control over scheduling, caching, and LoRA loading.
- Lower infrastructure operations burden matters more than squeezing the last bit of runtime efficiency out of vLLM.
13. Cross-References
- Scenario narratives that motivate these decisions: 01-vllm-game-changer-scenarios.md
- Low-level implementation patterns for each scenario: 02-vllm-low-level-implementation-and-critical-decisions.md
- Monitoring and troubleshooting: 05-vllm-monitoring-and-troubleshooting.md
- Model preparation and quantization: 06-vllm-model-preparation-and-quantization.md
- Interview prep questions: 03-vllm-interview-prep-deep-dive.md
- Docker patterns in the broader system: ../../Docker-Interview-Pack/02-docker-scenarios-lld.md
- GPU architecture challenges: ../../Model-Inference/06-gpu-architecture-challenges.md