vLLM Monitoring And Troubleshooting For MangaAssist
Complete monitoring, alerting, SLO, and troubleshooting guide for the vLLM-powered self-hosted generation path. Covers Prometheus metrics, CloudWatch integration, alerting rules, dashboards, and a diagnostics runbook for every failure mode we have seen in production.
1. Scope
This document answers:
- What metrics does vLLM expose and what do they mean?
- What custom application metrics do we emit?
- How do metrics flow from vLLM to CloudWatch and dashboards?
- What alerting rules protect the service?
- What SLOs define "healthy" for the self-hosted generation path?
- How do you diagnose and fix common production issues?
2. Metrics Architecture
graph TD
subgraph "vLLM Engine"
VM1["Built-in Prometheus metrics\n(:9090/metrics)"]
end
subgraph "vLLM Gateway (Application)"
AM1["Custom inference metrics"]
AM2["Trace attributes (MLflow)"]
end
subgraph "Collection Layer"
PROM["Prometheus\n(sidecar or remote)"]
CWA["CloudWatch Agent"]
end
subgraph "Storage and Query"
CW["CloudWatch Metrics"]
MLF["MLflow Tracking Server"]
end
subgraph "Consumption"
DASH["CloudWatch Dashboard:\nMangaAssist Inference"]
ALERT["CloudWatch Alarms"]
GRAFANA["Grafana\n(optional, for Prometheus native)"]
end
VM1 --> PROM
AM1 --> CWA
AM2 --> MLF
PROM --> CW
CWA --> CW
CW --> DASH
CW --> ALERT
PROM --> GRAFANA
3. vLLM Built-In Prometheus Metrics
vLLM exposes these metrics at the /metrics endpoint (port 9090 by default). These are engine-level metrics that require no application code.
| Metric | Type | What it measures | Why it matters |
|---|---|---|---|
vllm:num_requests_running |
Gauge | Active sequences in the engine | GPU utilization signal |
vllm:num_requests_waiting |
Gauge | Requests queued, waiting for decode slots | Primary backpressure signal |
vllm:num_requests_swapped |
Gauge | Requests swapped to CPU (preempted) | Memory pressure indicator |
vllm:gpu_cache_usage_perc |
Gauge | Percentage of KV cache blocks in use | Memory saturation signal |
vllm:cpu_cache_usage_perc |
Gauge | CPU cache usage (for swap) | Should stay near 0 in healthy ops |
vllm:num_preemptions_total |
Counter | Times a sequence was preempted | High count = memory pressure |
vllm:prompt_tokens_total |
Counter | Total input tokens processed | Throughput tracking |
vllm:generation_tokens_total |
Counter | Total output tokens generated | Throughput tracking |
vllm:avg_prompt_throughput_toks_per_s |
Gauge | Prefill throughput | Prefill performance signal |
vllm:avg_generation_throughput_toks_per_s |
Gauge | Decode throughput | Generation speed signal |
vllm:time_to_first_token_seconds |
Histogram | TTFT distribution | User experience metric |
vllm:time_per_output_token_seconds |
Histogram | Inter-token latency | Streaming smoothness |
vllm:e2e_request_latency_seconds |
Histogram | End-to-end request time | Overall latency signal |
How To Interpret Key Metrics
num_requests_waiting > 0 for sustained periods: The engine cannot keep up with incoming requests. If this persists for >60s, scale out. If it persists after scaling, check for:
- A few very large prompts crowding out others (check max_num_batched_tokens)
- Memory pressure causing preemptions (check gpu_cache_usage_perc)
- An upstream admission control failure (requests bypassing the queue limit)
gpu_cache_usage_perc > 95%: The KV block allocator is near saturation. New sequences will be queued or preempted. Check:
- Has max_model_len been increased without reducing max_num_seqs?
- Is there a context budget bypass (requests arriving with 4000+ tokens)?
- Has an adapter update increased per-request memory?
num_preemptions_total rising: Sequences are being swapped out of GPU to make room for new requests. This is vLLM's memory pressure relief valve. A few preemptions per hour is normal under peak traffic. More than that means the concurrency or memory configuration is too aggressive.
4. Custom Application Metrics
These metrics are emitted by the vLLM gateway application code, not by the vLLM engine itself. They provide business-level observability.
Metric Definitions
"""
Custom inference metrics for the MangaAssist vLLM gateway.
Emitted to CloudWatch via the CloudWatch Agent.
"""
import time
from dataclasses import dataclass, field
import boto3
@dataclass
class InferenceMetrics:
namespace: str = "MangaAssist/Inference"
_cw_client: object = field(default=None, repr=False)
def __post_init__(self) -> None:
self._cw_client = boto3.client("cloudwatch")
def emit_request_metrics(
self,
request_id: str,
queue_wait_ms: float,
ttft_ms: float,
generation_ms: float,
input_tokens: int,
output_tokens: int,
adapter_id: str,
backend_version: str,
prefix_cache_hit: bool,
token_budget_trimmed: bool,
) -> None:
metrics = [
self._metric("QueueWaitMs", queue_wait_ms, "Milliseconds"),
self._metric("TTFTMs", ttft_ms, "Milliseconds"),
self._metric("GenerationMs", generation_ms, "Milliseconds"),
self._metric("InputTokens", input_tokens, "Count"),
self._metric("OutputTokens", output_tokens, "Count"),
self._metric("PrefixCacheHit", 1.0 if prefix_cache_hit else 0.0, "Count"),
self._metric("TokenBudgetTrimmed", 1.0 if token_budget_trimmed else 0.0, "Count"),
]
dimensions = [
{"Name": "AdapterId", "Value": adapter_id},
{"Name": "BackendVersion", "Value": backend_version},
]
self._cw_client.put_metric_data(
Namespace=self.namespace,
MetricData=[{**m, "Dimensions": dimensions} for m in metrics],
)
def emit_oom_caught(self, adapter_id: str) -> None:
self._cw_client.put_metric_data(
Namespace=self.namespace,
MetricData=[{
"MetricName": "OOMCaught",
"Value": 1.0,
"Unit": "Count",
"Dimensions": [{"Name": "AdapterId", "Value": adapter_id}],
}],
)
def emit_admission_rejected(self, reason: str) -> None:
self._cw_client.put_metric_data(
Namespace=self.namespace,
MetricData=[{
"MetricName": "AdmissionRejected",
"Value": 1.0,
"Unit": "Count",
"Dimensions": [{"Name": "Reason", "Value": reason}],
}],
)
@staticmethod
def _metric(name: str, value: float, unit: str) -> dict:
return {"MetricName": name, "Value": value, "Unit": unit}
Metric Catalog
| Metric | Type | Dimensions | Why it matters |
|---|---|---|---|
QueueWaitMs |
Histogram | adapter_id, backend_version | Separates scheduler delay from model compute |
TTFTMs |
Histogram | adapter_id, backend_version | User-perceived responsiveness benchmark |
GenerationMs |
Histogram | adapter_id, backend_version | Model compute time |
InputTokens |
Histogram | adapter_id | Prompt size distribution for capacity planning |
OutputTokens |
Histogram | adapter_id | Generation length distribution |
PrefixCacheHit |
Counter | adapter_id | Validates prompt determinism |
TokenBudgetTrimmed |
Counter | adapter_id | How often context budgeting activates |
OOMCaught |
Counter | adapter_id | OOM containment frequency |
AdmissionRejected |
Counter | reason | Admission control activation frequency |
Trace Attributes (MLflow)
Every inference request creates a trace span with these structured attributes:
import mlflow
def trace_inference_request(
request_id: str,
session_id: str,
adapter_id: str,
backend_version: str,
prompt_version: str,
input_tokens: int,
output_tokens: int,
queue_wait_ms: float,
ttft_ms: float,
generation_ms: float,
prefix_cache_hit: bool,
finish_reason: str,
) -> None:
with mlflow.start_span(name="vllm_inference") as span:
span.set_attributes({
"request_id": request_id,
"session_id": session_id,
"adapter_id": adapter_id,
"backend_version": backend_version,
"prompt_version": prompt_version,
"input_tokens": input_tokens,
"output_tokens": output_tokens,
"queue_wait_ms": queue_wait_ms,
"ttft_ms": ttft_ms,
"generation_ms": generation_ms,
"prefix_cache_hit": prefix_cache_hit,
"finish_reason": finish_reason,
})
5. Alerting Rules
Critical Alerts (Page On-Call)
| Alert | Condition | Duration | Action |
|---|---|---|---|
| Queue Overflow | num_requests_waiting > 80 |
2 min | Immediate scale-out; check for upstream traffic spike |
| OOM Detected | OOMCaught > 0 |
Instant | Check request patterns, context budgets, memory config |
| Endpoint Down | /ping returns non-200 |
1 min | SageMaker auto-replaces; check GPU health, model load |
| Error Rate Spike | 5xx rate > 1% | 5 min | Check engine logs, OOM count, CUDA errors |
Warning Alerts (Investigate Next Business Day)
| Alert | Condition | Duration | Action |
|---|---|---|---|
| High TTFT | P95 TTFT > 500 ms | 5 min | Check prefix cache hit rate, prompt changes |
| GPU Memory Pressure | gpu_cache_usage_perc > 95% |
5 min | Check context budget drift, sequence count |
| Prefix Cache Degradation | Cache hit rate < 50% | 15 min | Check for prompt template changes that broke determinism |
| Adapter Latency Regression | Adapter P95 > 2× baseline | 10 min | Check adapter version, run offline eval, rollback if needed |
| Preemption Rate High | num_preemptions_total rate > 10/min |
10 min | Memory pressure; check max_num_seqs and traffic patterns |
| Queue Wait Creep | P95 queue wait > 200 ms | 10 min | Scale out or check for a few large prompts dominating |
CloudWatch Alarm Configuration
import boto3
cw_client = boto3.client("cloudwatch")
def create_critical_alarms() -> None:
# Queue overflow alarm
cw_client.put_metric_alarm(
AlarmName="mangaassist-vllm-queue-overflow",
Namespace="MangaAssist/Inference",
MetricName="vllm:num_requests_waiting",
Statistic="Maximum",
Period=60,
EvaluationPeriods=2,
Threshold=80,
ComparisonOperator="GreaterThanThreshold",
AlarmActions=["arn:aws:sns:us-east-1:ACCOUNT:mangaassist-critical"],
TreatMissingData="notBreaching",
)
# OOM alarm
cw_client.put_metric_alarm(
AlarmName="mangaassist-vllm-oom-detected",
Namespace="MangaAssist/Inference",
MetricName="OOMCaught",
Statistic="Sum",
Period=300,
EvaluationPeriods=1,
Threshold=0,
ComparisonOperator="GreaterThanThreshold",
AlarmActions=["arn:aws:sns:us-east-1:ACCOUNT:mangaassist-critical"],
TreatMissingData="notBreaching",
)
# High TTFT alarm
cw_client.put_metric_alarm(
AlarmName="mangaassist-vllm-high-ttft",
Namespace="MangaAssist/Inference",
MetricName="TTFTMs",
ExtendedStatistic="p95",
Period=300,
EvaluationPeriods=1,
Threshold=500,
ComparisonOperator="GreaterThanThreshold",
AlarmActions=["arn:aws:sns:us-east-1:ACCOUNT:mangaassist-warning"],
TreatMissingData="notBreaching",
)
# Prefix cache degradation
cw_client.put_metric_alarm(
AlarmName="mangaassist-vllm-prefix-cache-degradation",
Namespace="MangaAssist/Inference",
MetricName="PrefixCacheHit",
Statistic="Average",
Period=900, # 15 min
EvaluationPeriods=1,
Threshold=0.5,
ComparisonOperator="LessThanThreshold",
AlarmActions=["arn:aws:sns:us-east-1:ACCOUNT:mangaassist-warning"],
TreatMissingData="notBreaching",
)
6. SLO Definitions
Self-Hosted Generation Path SLOs
| SLO | Target | Measurement | Burn rate alert |
|---|---|---|---|
| Availability | 99.9% (43 min downtime/month) | Successful responses / total requests, 30-day rolling | 1% budget consumed in 1 hour |
| P50 TTFT | < 200 ms | 1-hour rolling window | P50 > 300 ms for 30 min |
| P99 Total Latency | < 2,000 ms | 1-hour rolling window | P99 > 3,000 ms for 15 min |
| P95 Queue Wait | < 100 ms | 1-hour rolling window | P95 > 200 ms for 10 min |
| Error Rate | < 0.1% | 24-hour rolling window | > 0.5% in 1 hour |
How SLOs Map To User Experience
| SLO | User experience if violated | Business impact |
|---|---|---|
| Availability < 99.9% | "The chatbot is down" | Lost conversions, support escalation |
| P50 TTFT > 200 ms | "The assistant feels sluggish" | Lower engagement, higher abandonment |
| P99 latency > 2,000 ms | "Sometimes it takes forever" | Frustrated power users, negative reviews |
| Queue wait > 100 ms | "Why is it thinking before it answers?" | Perception of unreliability |
| Error rate > 0.1% | "I got an error, this bot is broken" | Trust damage, hand-off to human agent |
Error Budget Calculation
Monthly error budget at 99.9% availability:
Total minutes: 43,200 (30 days × 24 hours × 60 min)
Budget: 43.2 minutes of downtime
Burn rate alert: If we consume 1% of budget (0.43 min = 26 seconds) in 1 hour,
we are on pace to exhaust the month's budget in 4.2 days.
7. Dashboard Design
Inference Dashboard Panels
The CloudWatch dashboard MangaAssist-Inference has these panels, organized top-to-bottom by urgency:
Row 1: Real-Time Health (auto-refresh 10s)
| Panel | Metric | Visualization | Alert tie-in |
|---|---|---|---|
| Active Sequences | num_requests_running | Number (current) | None |
| Queued Requests | num_requests_waiting | Number (current, red if > 50) | Queue overflow |
| GPU Cache Usage | gpu_cache_usage_perc | Gauge (0–100%) | GPU memory pressure |
| Endpoint Status | /ping success rate | Status (green/red) | Endpoint down |
Row 2: Latency (5-minute aggregation)
| Panel | Metric | Visualization |
|---|---|---|
| TTFT Distribution | TTFTMs P50/P95/P99 | Time series, 3 lines |
| Queue Wait Distribution | QueueWaitMs P50/P95/P99 | Time series, 3 lines |
| Generation Latency | GenerationMs P50/P95/P99 | Time series, 3 lines |
| End-to-End Latency | e2e_request_latency_seconds P50/P95/P99 | Time series, 3 lines |
Row 3: Efficiency Indicators (15-minute aggregation)
| Panel | Metric | Visualization |
|---|---|---|
| Prefix Cache Hit Rate | PrefixCacheHit average | Time series, target line at 70% |
| Token Budget Trimming Rate | TokenBudgetTrimmed rate | Time series |
| Preemptions | num_preemptions_total rate | Time series |
| Throughput | prompt_tokens_total + generation_tokens_total rate | Stacked area |
Row 4: Adapter Performance (1-hour aggregation)
| Panel | Metric | Visualization |
|---|---|---|
| Latency by Adapter | GenerationMs by adapter_id | Multi-line time series |
| Request Volume by Adapter | Request count by adapter_id | Stacked bar |
| Error Rate by Adapter | Error count by adapter_id | Time series |
8. Troubleshooting Runbook
Issue 1: High TTFT (P95 > 500 ms)
Symptoms:
- Users report "the bot takes a while to start answering"
- TTFT P95 alarm fires
Diagnostic Steps:
1. Check prefix cache hit rate
→ If < 50%: A prompt template change likely broke cache determinism
→ Check recent prompt version deployments
→ Verify no timestamps/random IDs in the cacheable prefix
→ Fix: Revert prompt change or move volatile fields below the prefix boundary
2. If cache hit rate is normal, check queue wait time
→ If queue_wait_ms P95 > 200ms: The engine is saturated
→ Check active_sequences vs max_num_seqs
→ Check if traffic spiked (admission controller should be rejecting)
→ Fix: Scale out (add instances) or strengthen admission control
3. If queue and cache are normal, check input token distribution
→ If average input_tokens has increased: Prompts got bigger
→ Check if retrieval is returning more chunks than expected
→ Check if conversation history is not being properly summarized
→ Fix: Review context budgeting settings, enforce token limits
4. If all above are normal, check GPU health
→ Run: nvidia-smi -q | grep -i "ecc\|retired\|performance"
→ If GPU reports degraded performance: Replace instance
Issue 2: GPU OOM Events
Symptoms:
- OOMCaught alarm fires
- Some responses return degraded/fallback content
Diagnostic Steps:
1. Check gpu_cache_usage_perc at time of OOM
→ If consistently > 95%: Memory configuration too aggressive
→ Consider: Reduce max_num_seqs from 128 to 112
→ Consider: Reduce gpu_memory_utilization from 0.92 to 0.90
2. Check input_token distribution at time of OOM
→ If spiky (occasional very large prompts): Context budgeting bypass
→ Check if request_budgeter is being called before engine
→ Check if a new code path bypasses the budgeter
→ Fix: Ensure all paths go through token budget allocation
3. Check adapter_id correlation
→ If OOMs correlate with a specific adapter: Adapter too large
→ Check adapter rank and weight size
→ Fix: Reduce LoRA rank or use a smaller adapter
4. If no single cause: Check for memory leak
→ Monitor gpu_cache_usage_perc over hours
→ If it trends upward without traffic increase: Possible block leak
→ Fix: Rolling restart of instances (schedule during low traffic)
Issue 3: Prefix Cache Hit Rate Collapse
Symptoms:
- PrefixCacheHit alarm fires (< 50% for 15 min)
- Efficiency drops, TTFT may increase
- No obvious errors or failures
Diagnostic Steps:
1. Check recent deployments
→ Check prompt_version in recent traces
→ If prompt version changed: Verify new prompt structure
→ Compare prefix bytes of old vs new prompt
→ Look for: timestamps, request IDs, A/B test flags in the prefix
2. If no deployment, check prompt construction code
→ Look for: new personalization fields injected before the prefix boundary
→ Look for: dynamic policy blocks that change frequently
→ Fix: Move volatile content after the deterministic prefix boundary
3. Check traffic pattern
→ If traffic is very low (< 10 req/min): Cache entries may be evicted
→ This is normal during off-peak hours
→ Cache hit rate should recover when traffic increases
4. Check vLLM version
→ vLLM 0.4.1 → 0.4.2 changed eviction behavior (known issue)
→ Fix: Upgrade to 0.4.3 which fixed the eviction bug
Issue 4: Adapter Latency Regression
Symptoms:
- One adapter's P95 latency is 2×+ its historical baseline
- Other adapters perform normally
Diagnostic Steps:
1. Check if adapter was recently updated
→ Compare adapter version in traces vs the last known-good version
→ If updated: Run offline evaluation against the previous version
→ If regression confirmed: Rollback adapter in the registry
→ If evaluation is clean: Issue is traffic-pattern, not adapter
2. Check adapter weight size
→ If new adapter is larger (higher rank): More memory per request
→ Check if preemption rate increased for this adapter
→ Fix: Reduce rank or increase GPU headroom
3. Check request pattern for this adapter
→ If this adapter handles longer prompts than others: Expected
→ Verify context budgeting is applied equally across adapters
Issue 5: Scaling Event Failures
Symptoms:
- Scaling alarm fires but instance count does not increase
- Or new instances start but /ping fails for > 5 minutes
Diagnostic Steps:
1. Check SageMaker endpoint events
→ Look for: InsufficientInstanceCapacity
→ If capacity issue: Try a different availability zone
→ If persistent: Add ml.g5.2xlarge as a fallback instance type
2. Check warm pool status
→ If warm pool empty and cold start required: Normal 5-8 min wait
→ Fix: Ensure warm pool has at least 1 instance pre-provisioned
3. If instances start but fail health checks
→ Check container logs for startup errors
→ Common: Model file corruption, CUDA version mismatch
→ Fix: Rebuild and re-push Docker image
Issue 6: Performance Degradation After vLLM Upgrade
Symptoms:
- After upgrading vLLM version, throughput or latency regressed
- No configuration changes made
Diagnostic Steps:
1. Run benchmark comparison: new version vs old version on identical traffic
→ Use the shadow comparison job in eval/shadow_compare.py
→ Compare: TTFT, generation_ms, queue_wait_ms, prefix_cache_hit_rate
2. Check for breaking changes in vLLM release notes
→ Prefix caching: Eviction behavior changed in 0.4.2
→ Multi-LoRA: Adapter loading semantics changed in 0.4.3
→ Block size: Internal allocation changed in 0.5.0
3. Check CUDA graph behavior
→ If enforce_eager=False, new version may capture different graphs
→ Try: Set enforce_eager=True temporarily to test
→ If performance recovers with eager mode: Graph capture issue
4. Rollback procedure
→ Revert Docker image tag to previous version
→ Redeploy endpoint (takes ~5 min with warm pool)
→ Monitor for 30 min to confirm regression is gone
Diagnostic Commands Quick Reference
# GPU status and utilization
nvidia-smi
nvidia-smi --query-gpu=utilization.gpu,utilization.memory,memory.used,memory.total --format=csv -l 1
# Check for GPU errors
nvidia-smi -q | grep -i "ecc\|retired\|performance\|error"
# vLLM engine status
curl http://localhost:8080/health
curl http://localhost:9090/metrics | grep vllm
# Active sequences and queue depth
curl -s http://localhost:9090/metrics | grep "num_requests"
# GPU cache usage
curl -s http://localhost:9090/metrics | grep "gpu_cache_usage"
# Prefix cache hit rate (custom metric)
curl -s http://localhost:9090/metrics | grep "prefix_cache"
# CUDA memory summary (from Python)
python3 -c "import torch; print(torch.cuda.memory_summary())"
# Check model loading status
curl -s http://localhost:8080/v1/models | python3 -m json.tool
# Test a single inference request
curl -X POST http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "manga_domain_v3",
"messages": [{"role": "user", "content": "test"}],
"max_tokens": 5
}'
9. Load Testing Methodology
Purpose
Before deploying a new vLLM version, configuration change, or model update, run load tests to validate: - TTFT stays within SLO under target concurrency - Queue wait does not exceed budget under peak traffic - OOM does not occur during sustained load - Prefix cache hit rate matches expectations
Load Test Script
"""
Load test for MangaAssist vLLM endpoint.
Simulates realistic chatbot traffic patterns.
"""
import asyncio
import time
import random
import statistics
from dataclasses import dataclass
import httpx
@dataclass
class LoadTestConfig:
endpoint: str
target_rps: float # Requests per second
duration_seconds: int # Total test duration
warmup_seconds: int = 30 # Gradual ramp-up period
max_concurrent: int = 100 # Max concurrent requests
@dataclass
class RequestResult:
ttft_ms: float
total_ms: float
input_tokens: int
output_tokens: int
status_code: int
adapter_id: str
# Realistic MangaAssist request distribution
REQUESTS = [
# Short factual (30% of traffic)
{"weight": 0.30, "adapter": "manga_domain_v3", "msg": "Is Spy x Family volume 12 available in English?", "max_tokens": 30},
# Medium recommendation (40% of traffic)
{"weight": 0.40, "adapter": "manga_domain_v3", "msg": "I liked Vinland Saga. What should I read next?", "max_tokens": 200},
# Long grounded answer (20% of traffic)
{"weight": 0.20, "adapter": "manga_domain_v3", "msg": "Compare all editions of Berserk volume 1.", "max_tokens": 400},
# Support question (10% of traffic)
{"weight": 0.10, "adapter": "general_support_v2", "msg": "Where is my order?", "max_tokens": 100},
]
SYSTEM_PROMPT = "You are a manga shopping assistant for the JP Manga Store."
def pick_request() -> dict:
r = random.random()
cumulative = 0.0
for req in REQUESTS:
cumulative += req["weight"]
if r <= cumulative:
return req
return REQUESTS[-1]
async def send_request(client: httpx.AsyncClient, endpoint: str) -> RequestResult:
req = pick_request()
start = time.monotonic()
payload = {
"model": req["adapter"],
"messages": [
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": req["msg"]},
],
"max_tokens": req["max_tokens"],
"stream": True,
}
ttft = None
total_tokens = 0
async with client.stream("POST", endpoint, json=payload) as response:
async for line in response.aiter_lines():
if ttft is None and line.startswith("data:") and '"delta"' in line:
ttft = (time.monotonic() - start) * 1000
if '"finish_reason"' in line:
break
if line.startswith("data:") and '"delta"' in line:
total_tokens += 1
total_ms = (time.monotonic() - start) * 1000
return RequestResult(
ttft_ms=ttft or total_ms,
total_ms=total_ms,
input_tokens=len(SYSTEM_PROMPT.split()) + len(req["msg"].split()),
output_tokens=total_tokens,
status_code=response.status_code,
adapter_id=req["adapter"],
)
async def run_load_test(config: LoadTestConfig) -> None:
results: list[RequestResult] = []
errors = 0
async with httpx.AsyncClient(timeout=30.0) as client:
sem = asyncio.Semaphore(config.max_concurrent)
async def bounded_request() -> None:
nonlocal errors
async with sem:
try:
result = await send_request(client, f"{config.endpoint}/v1/chat/completions")
results.append(result)
except Exception:
errors += 1
tasks = []
start_time = time.monotonic()
for i in range(int(config.target_rps * config.duration_seconds)):
elapsed = time.monotonic() - start_time
target_time = i / config.target_rps
# Gradual ramp-up during warmup period
if elapsed < config.warmup_seconds:
ramp_factor = elapsed / config.warmup_seconds
target_time = target_time / max(ramp_factor, 0.1)
if target_time > elapsed:
await asyncio.sleep(target_time - elapsed)
tasks.append(asyncio.create_task(bounded_request()))
await asyncio.gather(*tasks)
# Report
if results:
ttfts = [r.ttft_ms for r in results]
totals = [r.total_ms for r in results]
print(f"\n{'='*60}")
print(f"Load Test Results ({len(results)} successful, {errors} errors)")
print(f"{'='*60}")
print(f"TTFT P50: {statistics.median(ttfts):.0f} ms")
print(f"TTFT P95: {sorted(ttfts)[int(len(ttfts)*0.95)]:.0f} ms")
print(f"TTFT P99: {sorted(ttfts)[int(len(ttfts)*0.99)]:.0f} ms")
print(f"Total P50: {statistics.median(totals):.0f} ms")
print(f"Total P95: {sorted(totals)[int(len(totals)*0.95)]:.0f} ms")
print(f"Total P99: {sorted(totals)[int(len(totals)*0.99)]:.0f} ms")
print(f"Error rate: {errors / (len(results) + errors) * 100:.2f}%")
Load Test Scenarios
| Scenario | Target RPS | Duration | Concurrency | What it validates |
|---|---|---|---|---|
| Baseline | 10 | 5 min | 20 | Normal traffic performance |
| Peak | 50 | 10 min | 100 | Spike absorption |
| Sustained | 30 | 60 min | 50 | Memory stability, no leaks |
| Long conversations | 10 | 30 min | 20 (8+ turns each) | Context budgeting, OOM containment |
| Adapter switching | 30 | 10 min | 50 (random adapters) | Adapter swap overhead |
| Cache validation | 20 | 10 min | 30 (same prefix) | Prefix cache hit rate |
Pass/Fail Criteria
| Metric | Pass | Fail |
|---|---|---|
| P50 TTFT | < 200 ms | > 300 ms |
| P99 total latency | < 2,000 ms | > 3,000 ms |
| Error rate | < 0.1% | > 0.5% |
| OOM events | 0 | > 0 |
| Prefix cache hit (cache scenario) | > 65% | < 50% |
10. Cross-References
- Scenario narratives: 01-vllm-game-changer-scenarios.md
- Low-level implementation patterns: 02-vllm-low-level-implementation-and-critical-decisions.md
- Deployment and infrastructure: 04-vllm-deployment-and-infrastructure.md
- Model preparation and quantization: 06-vllm-model-preparation-and-quantization.md
- Interview prep questions: 03-vllm-interview-prep-deep-dive.md
- MLflow observability integration: ../../MLflow/03-mlflow-low-level-implementation-guide.md