LOCAL PREVIEW View on GitHub

vLLM Monitoring And Troubleshooting For MangaAssist

Complete monitoring, alerting, SLO, and troubleshooting guide for the vLLM-powered self-hosted generation path. Covers Prometheus metrics, CloudWatch integration, alerting rules, dashboards, and a diagnostics runbook for every failure mode we have seen in production.

1. Scope

This document answers:

  • What metrics does vLLM expose and what do they mean?
  • What custom application metrics do we emit?
  • How do metrics flow from vLLM to CloudWatch and dashboards?
  • What alerting rules protect the service?
  • What SLOs define "healthy" for the self-hosted generation path?
  • How do you diagnose and fix common production issues?

2. Metrics Architecture

graph TD
    subgraph "vLLM Engine"
        VM1["Built-in Prometheus metrics\n(:9090/metrics)"]
    end

    subgraph "vLLM Gateway (Application)"
        AM1["Custom inference metrics"]
        AM2["Trace attributes (MLflow)"]
    end

    subgraph "Collection Layer"
        PROM["Prometheus\n(sidecar or remote)"]
        CWA["CloudWatch Agent"]
    end

    subgraph "Storage and Query"
        CW["CloudWatch Metrics"]
        MLF["MLflow Tracking Server"]
    end

    subgraph "Consumption"
        DASH["CloudWatch Dashboard:\nMangaAssist Inference"]
        ALERT["CloudWatch Alarms"]
        GRAFANA["Grafana\n(optional, for Prometheus native)"]
    end

    VM1 --> PROM
    AM1 --> CWA
    AM2 --> MLF
    PROM --> CW
    CWA --> CW
    CW --> DASH
    CW --> ALERT
    PROM --> GRAFANA

3. vLLM Built-In Prometheus Metrics

vLLM exposes these metrics at the /metrics endpoint (port 9090 by default). These are engine-level metrics that require no application code.

Metric Type What it measures Why it matters
vllm:num_requests_running Gauge Active sequences in the engine GPU utilization signal
vllm:num_requests_waiting Gauge Requests queued, waiting for decode slots Primary backpressure signal
vllm:num_requests_swapped Gauge Requests swapped to CPU (preempted) Memory pressure indicator
vllm:gpu_cache_usage_perc Gauge Percentage of KV cache blocks in use Memory saturation signal
vllm:cpu_cache_usage_perc Gauge CPU cache usage (for swap) Should stay near 0 in healthy ops
vllm:num_preemptions_total Counter Times a sequence was preempted High count = memory pressure
vllm:prompt_tokens_total Counter Total input tokens processed Throughput tracking
vllm:generation_tokens_total Counter Total output tokens generated Throughput tracking
vllm:avg_prompt_throughput_toks_per_s Gauge Prefill throughput Prefill performance signal
vllm:avg_generation_throughput_toks_per_s Gauge Decode throughput Generation speed signal
vllm:time_to_first_token_seconds Histogram TTFT distribution User experience metric
vllm:time_per_output_token_seconds Histogram Inter-token latency Streaming smoothness
vllm:e2e_request_latency_seconds Histogram End-to-end request time Overall latency signal

How To Interpret Key Metrics

num_requests_waiting > 0 for sustained periods: The engine cannot keep up with incoming requests. If this persists for >60s, scale out. If it persists after scaling, check for: - A few very large prompts crowding out others (check max_num_batched_tokens) - Memory pressure causing preemptions (check gpu_cache_usage_perc) - An upstream admission control failure (requests bypassing the queue limit)

gpu_cache_usage_perc > 95%: The KV block allocator is near saturation. New sequences will be queued or preempted. Check: - Has max_model_len been increased without reducing max_num_seqs? - Is there a context budget bypass (requests arriving with 4000+ tokens)? - Has an adapter update increased per-request memory?

num_preemptions_total rising: Sequences are being swapped out of GPU to make room for new requests. This is vLLM's memory pressure relief valve. A few preemptions per hour is normal under peak traffic. More than that means the concurrency or memory configuration is too aggressive.

4. Custom Application Metrics

These metrics are emitted by the vLLM gateway application code, not by the vLLM engine itself. They provide business-level observability.

Metric Definitions

"""
Custom inference metrics for the MangaAssist vLLM gateway.
Emitted to CloudWatch via the CloudWatch Agent.
"""

import time
from dataclasses import dataclass, field

import boto3


@dataclass
class InferenceMetrics:
    namespace: str = "MangaAssist/Inference"
    _cw_client: object = field(default=None, repr=False)

    def __post_init__(self) -> None:
        self._cw_client = boto3.client("cloudwatch")

    def emit_request_metrics(
        self,
        request_id: str,
        queue_wait_ms: float,
        ttft_ms: float,
        generation_ms: float,
        input_tokens: int,
        output_tokens: int,
        adapter_id: str,
        backend_version: str,
        prefix_cache_hit: bool,
        token_budget_trimmed: bool,
    ) -> None:
        metrics = [
            self._metric("QueueWaitMs", queue_wait_ms, "Milliseconds"),
            self._metric("TTFTMs", ttft_ms, "Milliseconds"),
            self._metric("GenerationMs", generation_ms, "Milliseconds"),
            self._metric("InputTokens", input_tokens, "Count"),
            self._metric("OutputTokens", output_tokens, "Count"),
            self._metric("PrefixCacheHit", 1.0 if prefix_cache_hit else 0.0, "Count"),
            self._metric("TokenBudgetTrimmed", 1.0 if token_budget_trimmed else 0.0, "Count"),
        ]

        dimensions = [
            {"Name": "AdapterId", "Value": adapter_id},
            {"Name": "BackendVersion", "Value": backend_version},
        ]

        self._cw_client.put_metric_data(
            Namespace=self.namespace,
            MetricData=[{**m, "Dimensions": dimensions} for m in metrics],
        )

    def emit_oom_caught(self, adapter_id: str) -> None:
        self._cw_client.put_metric_data(
            Namespace=self.namespace,
            MetricData=[{
                "MetricName": "OOMCaught",
                "Value": 1.0,
                "Unit": "Count",
                "Dimensions": [{"Name": "AdapterId", "Value": adapter_id}],
            }],
        )

    def emit_admission_rejected(self, reason: str) -> None:
        self._cw_client.put_metric_data(
            Namespace=self.namespace,
            MetricData=[{
                "MetricName": "AdmissionRejected",
                "Value": 1.0,
                "Unit": "Count",
                "Dimensions": [{"Name": "Reason", "Value": reason}],
            }],
        )

    @staticmethod
    def _metric(name: str, value: float, unit: str) -> dict:
        return {"MetricName": name, "Value": value, "Unit": unit}

Metric Catalog

Metric Type Dimensions Why it matters
QueueWaitMs Histogram adapter_id, backend_version Separates scheduler delay from model compute
TTFTMs Histogram adapter_id, backend_version User-perceived responsiveness benchmark
GenerationMs Histogram adapter_id, backend_version Model compute time
InputTokens Histogram adapter_id Prompt size distribution for capacity planning
OutputTokens Histogram adapter_id Generation length distribution
PrefixCacheHit Counter adapter_id Validates prompt determinism
TokenBudgetTrimmed Counter adapter_id How often context budgeting activates
OOMCaught Counter adapter_id OOM containment frequency
AdmissionRejected Counter reason Admission control activation frequency

Trace Attributes (MLflow)

Every inference request creates a trace span with these structured attributes:

import mlflow


def trace_inference_request(
    request_id: str,
    session_id: str,
    adapter_id: str,
    backend_version: str,
    prompt_version: str,
    input_tokens: int,
    output_tokens: int,
    queue_wait_ms: float,
    ttft_ms: float,
    generation_ms: float,
    prefix_cache_hit: bool,
    finish_reason: str,
) -> None:
    with mlflow.start_span(name="vllm_inference") as span:
        span.set_attributes({
            "request_id": request_id,
            "session_id": session_id,
            "adapter_id": adapter_id,
            "backend_version": backend_version,
            "prompt_version": prompt_version,
            "input_tokens": input_tokens,
            "output_tokens": output_tokens,
            "queue_wait_ms": queue_wait_ms,
            "ttft_ms": ttft_ms,
            "generation_ms": generation_ms,
            "prefix_cache_hit": prefix_cache_hit,
            "finish_reason": finish_reason,
        })

5. Alerting Rules

Critical Alerts (Page On-Call)

Alert Condition Duration Action
Queue Overflow num_requests_waiting > 80 2 min Immediate scale-out; check for upstream traffic spike
OOM Detected OOMCaught > 0 Instant Check request patterns, context budgets, memory config
Endpoint Down /ping returns non-200 1 min SageMaker auto-replaces; check GPU health, model load
Error Rate Spike 5xx rate > 1% 5 min Check engine logs, OOM count, CUDA errors

Warning Alerts (Investigate Next Business Day)

Alert Condition Duration Action
High TTFT P95 TTFT > 500 ms 5 min Check prefix cache hit rate, prompt changes
GPU Memory Pressure gpu_cache_usage_perc > 95% 5 min Check context budget drift, sequence count
Prefix Cache Degradation Cache hit rate < 50% 15 min Check for prompt template changes that broke determinism
Adapter Latency Regression Adapter P95 > 2× baseline 10 min Check adapter version, run offline eval, rollback if needed
Preemption Rate High num_preemptions_total rate > 10/min 10 min Memory pressure; check max_num_seqs and traffic patterns
Queue Wait Creep P95 queue wait > 200 ms 10 min Scale out or check for a few large prompts dominating

CloudWatch Alarm Configuration

import boto3

cw_client = boto3.client("cloudwatch")


def create_critical_alarms() -> None:
    # Queue overflow alarm
    cw_client.put_metric_alarm(
        AlarmName="mangaassist-vllm-queue-overflow",
        Namespace="MangaAssist/Inference",
        MetricName="vllm:num_requests_waiting",
        Statistic="Maximum",
        Period=60,
        EvaluationPeriods=2,
        Threshold=80,
        ComparisonOperator="GreaterThanThreshold",
        AlarmActions=["arn:aws:sns:us-east-1:ACCOUNT:mangaassist-critical"],
        TreatMissingData="notBreaching",
    )

    # OOM alarm
    cw_client.put_metric_alarm(
        AlarmName="mangaassist-vllm-oom-detected",
        Namespace="MangaAssist/Inference",
        MetricName="OOMCaught",
        Statistic="Sum",
        Period=300,
        EvaluationPeriods=1,
        Threshold=0,
        ComparisonOperator="GreaterThanThreshold",
        AlarmActions=["arn:aws:sns:us-east-1:ACCOUNT:mangaassist-critical"],
        TreatMissingData="notBreaching",
    )

    # High TTFT alarm
    cw_client.put_metric_alarm(
        AlarmName="mangaassist-vllm-high-ttft",
        Namespace="MangaAssist/Inference",
        MetricName="TTFTMs",
        ExtendedStatistic="p95",
        Period=300,
        EvaluationPeriods=1,
        Threshold=500,
        ComparisonOperator="GreaterThanThreshold",
        AlarmActions=["arn:aws:sns:us-east-1:ACCOUNT:mangaassist-warning"],
        TreatMissingData="notBreaching",
    )

    # Prefix cache degradation
    cw_client.put_metric_alarm(
        AlarmName="mangaassist-vllm-prefix-cache-degradation",
        Namespace="MangaAssist/Inference",
        MetricName="PrefixCacheHit",
        Statistic="Average",
        Period=900,  # 15 min
        EvaluationPeriods=1,
        Threshold=0.5,
        ComparisonOperator="LessThanThreshold",
        AlarmActions=["arn:aws:sns:us-east-1:ACCOUNT:mangaassist-warning"],
        TreatMissingData="notBreaching",
    )

6. SLO Definitions

Self-Hosted Generation Path SLOs

SLO Target Measurement Burn rate alert
Availability 99.9% (43 min downtime/month) Successful responses / total requests, 30-day rolling 1% budget consumed in 1 hour
P50 TTFT < 200 ms 1-hour rolling window P50 > 300 ms for 30 min
P99 Total Latency < 2,000 ms 1-hour rolling window P99 > 3,000 ms for 15 min
P95 Queue Wait < 100 ms 1-hour rolling window P95 > 200 ms for 10 min
Error Rate < 0.1% 24-hour rolling window > 0.5% in 1 hour

How SLOs Map To User Experience

SLO User experience if violated Business impact
Availability < 99.9% "The chatbot is down" Lost conversions, support escalation
P50 TTFT > 200 ms "The assistant feels sluggish" Lower engagement, higher abandonment
P99 latency > 2,000 ms "Sometimes it takes forever" Frustrated power users, negative reviews
Queue wait > 100 ms "Why is it thinking before it answers?" Perception of unreliability
Error rate > 0.1% "I got an error, this bot is broken" Trust damage, hand-off to human agent

Error Budget Calculation

Monthly error budget at 99.9% availability:
  Total minutes:    43,200 (30 days × 24 hours × 60 min)
  Budget:           43.2 minutes of downtime
  Burn rate alert:  If we consume 1% of budget (0.43 min = 26 seconds) in 1 hour,
                    we are on pace to exhaust the month's budget in 4.2 days.

7. Dashboard Design

Inference Dashboard Panels

The CloudWatch dashboard MangaAssist-Inference has these panels, organized top-to-bottom by urgency:

Row 1: Real-Time Health (auto-refresh 10s) | Panel | Metric | Visualization | Alert tie-in | |---|---|---|---| | Active Sequences | num_requests_running | Number (current) | None | | Queued Requests | num_requests_waiting | Number (current, red if > 50) | Queue overflow | | GPU Cache Usage | gpu_cache_usage_perc | Gauge (0–100%) | GPU memory pressure | | Endpoint Status | /ping success rate | Status (green/red) | Endpoint down |

Row 2: Latency (5-minute aggregation) | Panel | Metric | Visualization | |---|---|---| | TTFT Distribution | TTFTMs P50/P95/P99 | Time series, 3 lines | | Queue Wait Distribution | QueueWaitMs P50/P95/P99 | Time series, 3 lines | | Generation Latency | GenerationMs P50/P95/P99 | Time series, 3 lines | | End-to-End Latency | e2e_request_latency_seconds P50/P95/P99 | Time series, 3 lines |

Row 3: Efficiency Indicators (15-minute aggregation) | Panel | Metric | Visualization | |---|---|---| | Prefix Cache Hit Rate | PrefixCacheHit average | Time series, target line at 70% | | Token Budget Trimming Rate | TokenBudgetTrimmed rate | Time series | | Preemptions | num_preemptions_total rate | Time series | | Throughput | prompt_tokens_total + generation_tokens_total rate | Stacked area |

Row 4: Adapter Performance (1-hour aggregation) | Panel | Metric | Visualization | |---|---|---| | Latency by Adapter | GenerationMs by adapter_id | Multi-line time series | | Request Volume by Adapter | Request count by adapter_id | Stacked bar | | Error Rate by Adapter | Error count by adapter_id | Time series |

8. Troubleshooting Runbook

Issue 1: High TTFT (P95 > 500 ms)

Symptoms:
  - Users report "the bot takes a while to start answering"
  - TTFT P95 alarm fires

Diagnostic Steps:
  1. Check prefix cache hit rate
     → If < 50%: A prompt template change likely broke cache determinism
        → Check recent prompt version deployments
        → Verify no timestamps/random IDs in the cacheable prefix
        → Fix: Revert prompt change or move volatile fields below the prefix boundary

  2. If cache hit rate is normal, check queue wait time
     → If queue_wait_ms P95 > 200ms: The engine is saturated
        → Check active_sequences vs max_num_seqs
        → Check if traffic spiked (admission controller should be rejecting)
        → Fix: Scale out (add instances) or strengthen admission control

  3. If queue and cache are normal, check input token distribution
     → If average input_tokens has increased: Prompts got bigger
        → Check if retrieval is returning more chunks than expected
        → Check if conversation history is not being properly summarized
        → Fix: Review context budgeting settings, enforce token limits

  4. If all above are normal, check GPU health
     → Run: nvidia-smi -q | grep -i "ecc\|retired\|performance"
     → If GPU reports degraded performance: Replace instance

Issue 2: GPU OOM Events

Symptoms:
  - OOMCaught alarm fires
  - Some responses return degraded/fallback content

Diagnostic Steps:
  1. Check gpu_cache_usage_perc at time of OOM
     → If consistently > 95%: Memory configuration too aggressive
        → Consider: Reduce max_num_seqs from 128 to 112
        → Consider: Reduce gpu_memory_utilization from 0.92 to 0.90

  2. Check input_token distribution at time of OOM
     → If spiky (occasional very large prompts): Context budgeting bypass
        → Check if request_budgeter is being called before engine
        → Check if a new code path bypasses the budgeter
        → Fix: Ensure all paths go through token budget allocation

  3. Check adapter_id correlation
     → If OOMs correlate with a specific adapter: Adapter too large
        → Check adapter rank and weight size
        → Fix: Reduce LoRA rank or use a smaller adapter

  4. If no single cause: Check for memory leak
     → Monitor gpu_cache_usage_perc over hours
     → If it trends upward without traffic increase: Possible block leak
        → Fix: Rolling restart of instances (schedule during low traffic)

Issue 3: Prefix Cache Hit Rate Collapse

Symptoms:
  - PrefixCacheHit alarm fires (< 50% for 15 min)
  - Efficiency drops, TTFT may increase
  - No obvious errors or failures

Diagnostic Steps:
  1. Check recent deployments
     → Check prompt_version in recent traces
     → If prompt version changed: Verify new prompt structure
        → Compare prefix bytes of old vs new prompt
        → Look for: timestamps, request IDs, A/B test flags in the prefix

  2. If no deployment, check prompt construction code
     → Look for: new personalization fields injected before the prefix boundary
     → Look for: dynamic policy blocks that change frequently
     → Fix: Move volatile content after the deterministic prefix boundary

  3. Check traffic pattern
     → If traffic is very low (< 10 req/min): Cache entries may be evicted
        → This is normal during off-peak hours
        → Cache hit rate should recover when traffic increases

  4. Check vLLM version
     → vLLM 0.4.1 → 0.4.2 changed eviction behavior (known issue)
     → Fix: Upgrade to 0.4.3 which fixed the eviction bug

Issue 4: Adapter Latency Regression

Symptoms:
  - One adapter's P95 latency is 2×+ its historical baseline
  - Other adapters perform normally

Diagnostic Steps:
  1. Check if adapter was recently updated
     → Compare adapter version in traces vs the last known-good version
     → If updated: Run offline evaluation against the previous version
        → If regression confirmed: Rollback adapter in the registry
        → If evaluation is clean: Issue is traffic-pattern, not adapter

  2. Check adapter weight size
     → If new adapter is larger (higher rank): More memory per request
        → Check if preemption rate increased for this adapter
        → Fix: Reduce rank or increase GPU headroom

  3. Check request pattern for this adapter
     → If this adapter handles longer prompts than others: Expected
        → Verify context budgeting is applied equally across adapters

Issue 5: Scaling Event Failures

Symptoms:
  - Scaling alarm fires but instance count does not increase
  - Or new instances start but /ping fails for > 5 minutes

Diagnostic Steps:
  1. Check SageMaker endpoint events
     → Look for: InsufficientInstanceCapacity
        → If capacity issue: Try a different availability zone
        → If persistent: Add ml.g5.2xlarge as a fallback instance type

  2. Check warm pool status
     → If warm pool empty and cold start required: Normal 5-8 min wait
        → Fix: Ensure warm pool has at least 1 instance pre-provisioned

  3. If instances start but fail health checks
     → Check container logs for startup errors
        → Common: Model file corruption, CUDA version mismatch
        → Fix: Rebuild and re-push Docker image

Issue 6: Performance Degradation After vLLM Upgrade

Symptoms:
  - After upgrading vLLM version, throughput or latency regressed
  - No configuration changes made

Diagnostic Steps:
  1. Run benchmark comparison: new version vs old version on identical traffic
     → Use the shadow comparison job in eval/shadow_compare.py
     → Compare: TTFT, generation_ms, queue_wait_ms, prefix_cache_hit_rate

  2. Check for breaking changes in vLLM release notes
     → Prefix caching: Eviction behavior changed in 0.4.2
     → Multi-LoRA: Adapter loading semantics changed in 0.4.3
     → Block size: Internal allocation changed in 0.5.0

  3. Check CUDA graph behavior
     → If enforce_eager=False, new version may capture different graphs
     → Try: Set enforce_eager=True temporarily to test
     → If performance recovers with eager mode: Graph capture issue

  4. Rollback procedure
     → Revert Docker image tag to previous version
     → Redeploy endpoint (takes ~5 min with warm pool)
     → Monitor for 30 min to confirm regression is gone

Diagnostic Commands Quick Reference

# GPU status and utilization
nvidia-smi
nvidia-smi --query-gpu=utilization.gpu,utilization.memory,memory.used,memory.total --format=csv -l 1

# Check for GPU errors
nvidia-smi -q | grep -i "ecc\|retired\|performance\|error"

# vLLM engine status
curl http://localhost:8080/health
curl http://localhost:9090/metrics | grep vllm

# Active sequences and queue depth
curl -s http://localhost:9090/metrics | grep "num_requests"

# GPU cache usage
curl -s http://localhost:9090/metrics | grep "gpu_cache_usage"

# Prefix cache hit rate (custom metric)
curl -s http://localhost:9090/metrics | grep "prefix_cache"

# CUDA memory summary (from Python)
python3 -c "import torch; print(torch.cuda.memory_summary())"

# Check model loading status
curl -s http://localhost:8080/v1/models | python3 -m json.tool

# Test a single inference request
curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "manga_domain_v3",
    "messages": [{"role": "user", "content": "test"}],
    "max_tokens": 5
  }'

9. Load Testing Methodology

Purpose

Before deploying a new vLLM version, configuration change, or model update, run load tests to validate: - TTFT stays within SLO under target concurrency - Queue wait does not exceed budget under peak traffic - OOM does not occur during sustained load - Prefix cache hit rate matches expectations

Load Test Script

"""
Load test for MangaAssist vLLM endpoint.
Simulates realistic chatbot traffic patterns.
"""

import asyncio
import time
import random
import statistics
from dataclasses import dataclass

import httpx


@dataclass
class LoadTestConfig:
    endpoint: str
    target_rps: float           # Requests per second
    duration_seconds: int       # Total test duration
    warmup_seconds: int = 30    # Gradual ramp-up period
    max_concurrent: int = 100   # Max concurrent requests


@dataclass
class RequestResult:
    ttft_ms: float
    total_ms: float
    input_tokens: int
    output_tokens: int
    status_code: int
    adapter_id: str


# Realistic MangaAssist request distribution
REQUESTS = [
    # Short factual (30% of traffic)
    {"weight": 0.30, "adapter": "manga_domain_v3", "msg": "Is Spy x Family volume 12 available in English?", "max_tokens": 30},
    # Medium recommendation (40% of traffic)
    {"weight": 0.40, "adapter": "manga_domain_v3", "msg": "I liked Vinland Saga. What should I read next?", "max_tokens": 200},
    # Long grounded answer (20% of traffic)
    {"weight": 0.20, "adapter": "manga_domain_v3", "msg": "Compare all editions of Berserk volume 1.", "max_tokens": 400},
    # Support question (10% of traffic)
    {"weight": 0.10, "adapter": "general_support_v2", "msg": "Where is my order?", "max_tokens": 100},
]

SYSTEM_PROMPT = "You are a manga shopping assistant for the JP Manga Store."


def pick_request() -> dict:
    r = random.random()
    cumulative = 0.0
    for req in REQUESTS:
        cumulative += req["weight"]
        if r <= cumulative:
            return req
    return REQUESTS[-1]


async def send_request(client: httpx.AsyncClient, endpoint: str) -> RequestResult:
    req = pick_request()
    start = time.monotonic()

    payload = {
        "model": req["adapter"],
        "messages": [
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": req["msg"]},
        ],
        "max_tokens": req["max_tokens"],
        "stream": True,
    }

    ttft = None
    total_tokens = 0

    async with client.stream("POST", endpoint, json=payload) as response:
        async for line in response.aiter_lines():
            if ttft is None and line.startswith("data:") and '"delta"' in line:
                ttft = (time.monotonic() - start) * 1000
            if '"finish_reason"' in line:
                break
            if line.startswith("data:") and '"delta"' in line:
                total_tokens += 1

    total_ms = (time.monotonic() - start) * 1000

    return RequestResult(
        ttft_ms=ttft or total_ms,
        total_ms=total_ms,
        input_tokens=len(SYSTEM_PROMPT.split()) + len(req["msg"].split()),
        output_tokens=total_tokens,
        status_code=response.status_code,
        adapter_id=req["adapter"],
    )


async def run_load_test(config: LoadTestConfig) -> None:
    results: list[RequestResult] = []
    errors = 0

    async with httpx.AsyncClient(timeout=30.0) as client:
        sem = asyncio.Semaphore(config.max_concurrent)

        async def bounded_request() -> None:
            nonlocal errors
            async with sem:
                try:
                    result = await send_request(client, f"{config.endpoint}/v1/chat/completions")
                    results.append(result)
                except Exception:
                    errors += 1

        tasks = []
        start_time = time.monotonic()

        for i in range(int(config.target_rps * config.duration_seconds)):
            elapsed = time.monotonic() - start_time
            target_time = i / config.target_rps

            # Gradual ramp-up during warmup period
            if elapsed < config.warmup_seconds:
                ramp_factor = elapsed / config.warmup_seconds
                target_time = target_time / max(ramp_factor, 0.1)

            if target_time > elapsed:
                await asyncio.sleep(target_time - elapsed)

            tasks.append(asyncio.create_task(bounded_request()))

        await asyncio.gather(*tasks)

    # Report
    if results:
        ttfts = [r.ttft_ms for r in results]
        totals = [r.total_ms for r in results]
        print(f"\n{'='*60}")
        print(f"Load Test Results ({len(results)} successful, {errors} errors)")
        print(f"{'='*60}")
        print(f"TTFT   P50: {statistics.median(ttfts):.0f} ms")
        print(f"TTFT   P95: {sorted(ttfts)[int(len(ttfts)*0.95)]:.0f} ms")
        print(f"TTFT   P99: {sorted(ttfts)[int(len(ttfts)*0.99)]:.0f} ms")
        print(f"Total  P50: {statistics.median(totals):.0f} ms")
        print(f"Total  P95: {sorted(totals)[int(len(totals)*0.95)]:.0f} ms")
        print(f"Total  P99: {sorted(totals)[int(len(totals)*0.99)]:.0f} ms")
        print(f"Error rate: {errors / (len(results) + errors) * 100:.2f}%")

Load Test Scenarios

Scenario Target RPS Duration Concurrency What it validates
Baseline 10 5 min 20 Normal traffic performance
Peak 50 10 min 100 Spike absorption
Sustained 30 60 min 50 Memory stability, no leaks
Long conversations 10 30 min 20 (8+ turns each) Context budgeting, OOM containment
Adapter switching 30 10 min 50 (random adapters) Adapter swap overhead
Cache validation 20 10 min 30 (same prefix) Prefix cache hit rate

Pass/Fail Criteria

Metric Pass Fail
P50 TTFT < 200 ms > 300 ms
P99 total latency < 2,000 ms > 3,000 ms
Error rate < 0.1% > 0.5%
OOM events 0 > 0
Prefix cache hit (cache scenario) > 65% < 50%

10. Cross-References