LOCAL PREVIEW View on GitHub

03. Scale Testing Scenarios — Real Issues I Faced and How I Solved Them

"Load testing a chatbot is fundamentally different from load testing a REST API. Every user message triggers a 4-model inference chain, 2-3 parallel service calls, a vector search, and a guardrails pipeline. At 50K concurrent sessions, bottlenecks showed up in places I never expected — DynamoDB hot partitions, Redis cache stampedes, and WebSocket connection ceilings. Here are the 7 worst scenarios I hit and how I fixed each one."


Scale Targets

Metric Normal Peak (Prime Day) What Breaks First
Concurrent sessions ~50,000 ~500,000 WebSocket connections, Bedrock quota
Messages per second ~5,000 ~50,000 Intent classifier throughput, DynamoDB writes
LLM calls per second ~3,000 ~30,000 Bedrock throttling
P99 latency (first token) < 1.5s < 3s Reranker queue depth
Availability 99.9% 99.9% Circuit breaker cascades

How I Ran Scale Tests

  • Tool: Artillery.io for WebSocket + REST load generation, custom Python scripts for sustained-load LLM benchmarks
  • Environment: Dedicated load-test environment mirroring production (same instance types, same Bedrock model configs, same DynamoDB table config)
  • Approach: Ramp from 0 to 10x normal load over 30 minutes, hold at peak for 60 minutes, then step down
  • Monitoring: CloudWatch dashboards, X-Ray traces, custom Grafana panels for per-model latency breakdown

Scenario 1: Prime Day Traffic Spike (10x Load)

The Problem

During the first simulated Prime Day load test, I ramped from 5K to 50K messages/second. At around 20K msg/s, three things broke simultaneously:

  1. Bedrock returned 429 (ThrottlingException) — we exceeded our on-demand quota for Claude 3.5 Sonnet
  2. DynamoDB session writes started throttling — popular manga titles (new releases) created hot partition keys
  3. Lambda cold starts spiked — burst overflow from ECS to Lambda triggered 3-5 second cold starts

How I Solved It

Problem Solution Result
Bedrock throttling Purchased Provisioned Throughput for Sonnet (baseline) + kept on-demand for Haiku (overflow) Zero throttling at 30K LLM calls/sec
DynamoDB hot partitions Added a random suffix to partition keys for high-traffic sessions; enabled DynamoDB Adaptive Capacity Throttle events dropped to zero
Lambda cold starts Pre-warmed Lambda with scheduled pings every 5 min; set ECS scaling target at 60% CPU (triggers earlier) Cold start P99 dropped from 5s to 0.8s

Metrics After Fix

  • P99 latency at 50K msg/s: 2.7s (under the 3s target)
  • Error rate at peak: 0.12% (well under 1% threshold)
  • Zero Bedrock throttling events during the subsequent load test

Scenario 2: Bedrock LLM Throttling Under Sustained Load

The Problem

Even after provisioning baseline throughput, sustained high-load periods (not spikes, but 2+ hours of elevated traffic) caused Bedrock timeout rates to climb to 8%. The issue was that on-demand overflow capacity had soft limits, and during peak periods many AWS customers competed for the same capacity pool.

How I Solved It

Priority queuing: I built a request priority system:

Priority Who Model Queue Behavior
P0 (critical) Authenticated + Prime users Claude 3.5 Sonnet Served first, never dropped
P1 (normal) Authenticated non-Prime Claude 3.5 Sonnet Served in order, retry once on timeout
P2 (best-effort) Guest users Claude Haiku (cheaper) Downgrade to Haiku under pressure

Semantic response caching: For common queries ("what's the return policy?", "do you have free shipping?"), I cached LLM responses in ElastiCache Redis keyed by a semantic hash of the query. Cache hit rate reached ~15%, directly reducing Bedrock call volume.

Haiku fallback: When Sonnet timeout rate exceeded 5% for 2 consecutive minutes, the orchestrator automatically routed simple intents (chitchat, order status formatting, FAQ) to Claude Haiku, which had far more available capacity and was 10x cheaper.

Metrics After Fix

  • LLM timeout rate: < 0.5% (down from 8%)
  • Monthly LLM cost reduction: ~$18K from intelligent routing
  • User-visible impact: None — Haiku quality was sufficient for simple intents, and users couldn't tell the difference

Scenario 3: SageMaker Cold Start / Scale-Up Lag

The Problem

The DistilBERT intent classifier ran on a SageMaker real-time endpoint. When traffic spiked, SageMaker's auto-scaling took 5-8 minutes to add new instances. During that window:

  • Request queue depth grew → P99 latency for intent classification spiked from 50ms to 2+ seconds
  • First request to a new instance had P99 of 55 seconds due to model loading from S3

How I Solved It

Fix What I Did Impact
Minimum instances = 2 Guaranteed at least 2 warm instances at all times, even during low traffic Eliminated cold start for normal load
Predictive + step scaling Combined CloudWatch target-tracking scaling with a scheduled scaling action before known peak hours (Prime Day, major manga release dates) Instances were ready before the spike hit
Inferentia migration Moved DistilBERT from ml.g4dn.xlarge (GPU) to ml.inf1.xlarge (AWS Inferentia) using Neuron SDK compilation 3x cheaper per instance + 4x faster model load
Warmup requests Sent synthetic inference requests to new instances during scale-up, before they entered the load balancer Eliminated the 55s first-request latency spike
INT8 quantization Quantized DistilBERT to INT8 via ONNX Runtime Model download from S3 dropped from 800ms to 200ms; inference latency unchanged

Metrics After Fix

  • Intent classifier P99: < 50ms (consistent at all load levels)
  • Scale-up time: 2 minutes (predictive scaling starts early, completing before the spike peaks)
  • Cost per inference: $0.0002 (down from $0.0006 on GPU)

Scenario 4: Thundering Herd on DynamoDB

The Problem

Popular sessions (e.g., a viral manga recommendation that got shared on social media) created hot partition reads. When 10,000 users clicked the same shared chat link within minutes, DynamoDB throttled reads on that session's partition key. This caused:

  • ProvisionedThroughputExceededException for session loads
  • Cascade: orchestrator retried immediately → amplified the thundering herd → more throttling

How I Solved It

Fix What I Did Impact
DynamoDB DAX Deployed DAX (in-memory cache) in front of DynamoDB for session reads Hot session reads served from cache in microseconds, never hitting DynamoDB
Exponential backoff + jitter Replaced immediate retries with exponential backoff (100ms, 200ms, 400ms) + random jitter (0-50ms) Spread retry load over time instead of amplifying the spike
Read request coalescing When multiple concurrent requests asked for the same session ID, only one actual DynamoDB read was issued; others waited for the result Reduced read volume by 60% for hot sessions
On-demand capacity mode Switched from provisioned to on-demand DynamoDB capacity for the sessions table Auto-scales without pre-planning; no more manual capacity adjustments

Metrics After Fix

  • DynamoDB throttle events in production: zero (monitored for 6 months)
  • Hot session read latency: < 1ms (DAX cache hit)
  • Monthly DynamoDB cost change: +$200/month for DAX, but eliminated the engineering cost of capacity planning

Scenario 5: WebSocket Connection Limit

The Problem

Amazon API Gateway has a default limit of 128,000 concurrent WebSocket connections per account per region. During a load test simulating 500K concurrent sessions, I hit 90% of this limit. A single Prime Day surge could have exceeded it.

How I Solved It

Fix What I Did Impact
Connection pooling Frontend reconnects reused existing connections when possible instead of creating new ones Reduced concurrent connection count by ~20%
Idle connection cleanup Automatically disconnected sessions inactive for > 10 minutes (server-side ping/pong timeout) Freed ~30% of connections held by abandoned tabs
ALB sticky sessions Used Application Load Balancer sticky sessions so WebSocket reconnects land on the same orchestrator instance Preserved in-memory context, reduced connection churn
AWS limit increase Requested and received a limit increase to 500K concurrent connections Headroom for 5x current peak
Multi-region architecture Route 53 latency-based routing distributes connections across us-east-1 and us-west-2 Each region handles half the connection load

Metrics After Fix

  • Peak concurrent WebSocket connections: ~200K during load test (well within 500K limit)
  • Connection reuse rate: 65% (up from 0% before pooling)
  • Headroom: 2.5x above observed peak

Scenario 6: Cache Stampede on ElastiCache (Catalog Invalidation)

The Problem

Product details for popular ASINs were cached in ElastiCache Redis with a 5-minute TTL. When the TTL expired for a cluster of popular ASINs simultaneously (e.g., all volumes of a trending series), hundreds of concurrent requests hit the Catalog API at once to repopulate the cache. This caused:

  • Catalog API P99 latency spiked from 90ms to 800ms
  • Some requests timed out → orchestrator returned fallback responses → degraded user experience

How I Solved It

Fix What I Did Impact
Probabilistic Early Reexpiration (PER) Each cache read has a small probability of triggering a background refresh before the TTL expires. The probability increases as TTL approaches 0 Cache is refreshed before expiration; no thundering herd on TTL boundary
Staggered TTLs Added random jitter (±60 seconds) to cache TTLs so related ASINs don't expire at the exact same moment Spread refresh load over time
Background async refresh When a cache miss occurs, the first request triggers an async background refresh. Concurrent requests get the stale value (stale-while-revalidate pattern) Only 1 request hits the Catalog API; others get slightly stale data (acceptable for product details, never for prices)

Metrics After Fix

  • Catalog API P99 during cache invalidation: 90ms (down from 800ms)
  • Cache hit rate: 94% (up from 85% — PER keeps the cache warmer)
  • Zero timeout events from cache stampede in production

Scenario 7: Multi-Model P99 Latency Spike Under Load

The Problem

At 10K+ concurrent requests, the end-to-end P99 latency spiked from 2.5s to 5.1s. Distributed tracing showed the bottleneck was the Cross-Encoder Reranker endpoint. Under sustained load, a queue built up at the SageMaker endpoint, and P99 for the reranker alone jumped from 120ms to 1.8s.

Root Cause Analysis

The reranker processed 50 candidate chunks per request (scoring each against the query). At 10K concurrent requests, the GPU was saturated. SageMaker auto-scaling had a 5-minute lag, so the queue grew faster than capacity could be added.

How I Solved It

Fix What I Did Impact
Speculative execution Started RAG retrieval in parallel with intent classification instead of waiting for intent results. ~70% of intents need RAG anyway, so wasted compute was minimal Saved 150-300ms on the critical path
Reduce reranker input Cut reranker input from top-50 to top-20 candidates (vector search quality was good enough that top-20 covered 95% of relevant results) Reranker latency dropped 60% at the same load
Per-model circuit breakers When the reranker P99 exceeded 500ms for 30 seconds, the circuit breaker opened and the orchestrator fell back to raw cosine similarity (top-3 from the vector search) Prevented cascading latency delays
Batch inference Grouped reranker requests into micro-batches (4 requests per batch) to improve GPU utilization Throughput increased 2.5x on the same instance count

Metrics After Fix

  • End-to-end P99 latency: 2.8s (down from 5.1s)
  • Reranker P99 latency: 95ms (down from 1.8s under the same load)
  • Reranker circuit breaker activation: ~2 times/month (during extreme bursts, with graceful fallback)

Summary of Scale Fixes and Business Impact

Scenario Before After Key Technique
Prime Day 10x spike Errors at 20K msg/s Handled 50K msg/s at P99 < 3s Provisioned Bedrock + ECS/Lambda hybrid
Bedrock throttling 8% timeout rate < 0.5% timeout rate Priority queuing + Haiku fallback + semantic caching
SageMaker cold start 55s P99 first request < 50ms consistent P99 Inferentia + warmup + predictive scaling
DynamoDB thundering herd Throttle events on hot sessions Zero throttle events DAX + jitter + request coalescing
WebSocket connection limit 90% of 128K limit 200K peak with 500K ceiling Pooling + cleanup + limit increase + multi-region
Cache stampede Catalog P99 = 800ms during invalidation Catalog P99 = 90ms PER + staggered TTLs + stale-while-revalidate
Multi-model P99 spike E2E P99 = 5.1s at 10K concurrent E2E P99 = 2.8s Speculative execution + smaller reranker input + circuit breakers

Interview Q&A — Scale Testing

Q: How did you approach load testing a chatbot system? It's not a simple REST API.

  • Easy: I used Artillery.io for WebSocket and REST load generation, ramping from normal (5K msg/s) to 10x peak (50K msg/s) over 30 minutes, holding at peak for 60 minutes. I monitored CloudWatch dashboards, X-Ray traces, and custom Grafana panels to watch per-model latency breakdown in real time
  • Medium: The key insight is that chatbot load testing is fundamentally about testing a dependency chain. A single user message triggers the intent classifier, embedding model, vector search, reranker, LLM, and guardrails. I had to instrument each step independently so that when P99 exceeded budget at a given load level, I could pinpoint exactly which model or service was the bottleneck. Standard RPS metrics tell you nothing useful — you need per-model, per-intent latency distributions
  • Hard: The hardest problem was realistic traffic shaping. Real chatbot traffic isn't uniform — it's bursty (a viral tweet can spike one intent by 50x), multi-turn (sessions last 5-15 minutes with 3-8 messages), and intent-skewed (60% of messages are product discovery, only 5% are order tracking). I built traffic profiles based on production analytics that replicated these patterns. A uniform-load test would have missed the DynamoDB hot partition issue entirely because it only happens when many sessions hit the same popular product simultaneously

Q: What was the scariest production issue you found through scale testing?

  • Easy: The cache stampede. When Redis TTL expired for popular ASINs, hundreds of requests simultaneously hit the Catalog API, causing P99 to spike from 90ms to 800ms. Users saw "I couldn't load product details right now" fallback messages during these windows
  • Medium: What made it scary was that it was invisible at normal load. With 5K msg/s the cache stampede involved maybe 10-20 concurrent requests to the Catalog API — no problem. But at 50K msg/s, the same TTL expiry triggered 200-500 concurrent requests, overwhelming the Catalog API. It was a load-dependent bug that only appeared at scale
  • Hard: The fix (Probabilistic Early Reexpiration + stale-while-revalidate) was elegant but required careful tuning. If the early-refresh probability was too high, I'd waste Catalog API calls refreshing cache that wasn't about to expire. If too low, stampedes still occurred. I ran A/B tests on the probability curve to find the sweet spot (beta distribution with alpha=2, centered at 80% of TTL remaining). The result: 94% cache hit rate with zero stampede events. But it took 3 iterations of load testing to get the parameters right

Q: How did you handle the trade-off between cost and latency when scaling ML models?

  • Easy: I migrated the DistilBERT intent classifier from GPU (ml.g4dn.xlarge at $0.736/hr) to AWS Inferentia (ml.inf1.xlarge at ~$0.24/hr). This cut classifier inference cost by 3x with no quality impact because the Neuron SDK compiled the model efficiently for Inferentia hardware
  • Medium: The larger cost decision was LLM model routing. Not every message needs Claude 3.5 Sonnet. I built a router that sends chitchat to template responses (zero cost), simple formatting tasks to Claude Haiku (10x cheaper), and only complex reasoning to Sonnet. This saved ~$18K/month. The trade-off was a small quality delta for borderline cases — but evaluation showed Haiku was indistinguishable from Sonnet for simple intent categories
  • Hard: The hardest cost-latency trade-off was the reranker. I could cut latency by reducing the candidate set from top-50 to top-20, but this risked missing relevant results in the long tail. I ran an offline evaluation: for our FAQ and product Q&A datasets, top-20 covered 95% of the relevant results that top-50 found. The 5% miss rate was acceptable because the LLM could compensate with its own knowledge. But I wouldn't have been comfortable making that trade-off without the evaluation data proving it