03. Scale Testing Scenarios — Real Issues I Faced and How I Solved Them
"Load testing a chatbot is fundamentally different from load testing a REST API. Every user message triggers a 4-model inference chain, 2-3 parallel service calls, a vector search, and a guardrails pipeline. At 50K concurrent sessions, bottlenecks showed up in places I never expected — DynamoDB hot partitions, Redis cache stampedes, and WebSocket connection ceilings. Here are the 7 worst scenarios I hit and how I fixed each one."
Scale Targets
| Metric | Normal | Peak (Prime Day) | What Breaks First |
|---|---|---|---|
| Concurrent sessions | ~50,000 | ~500,000 | WebSocket connections, Bedrock quota |
| Messages per second | ~5,000 | ~50,000 | Intent classifier throughput, DynamoDB writes |
| LLM calls per second | ~3,000 | ~30,000 | Bedrock throttling |
| P99 latency (first token) | < 1.5s | < 3s | Reranker queue depth |
| Availability | 99.9% | 99.9% | Circuit breaker cascades |
How I Ran Scale Tests
- Tool: Artillery.io for WebSocket + REST load generation, custom Python scripts for sustained-load LLM benchmarks
- Environment: Dedicated load-test environment mirroring production (same instance types, same Bedrock model configs, same DynamoDB table config)
- Approach: Ramp from 0 to 10x normal load over 30 minutes, hold at peak for 60 minutes, then step down
- Monitoring: CloudWatch dashboards, X-Ray traces, custom Grafana panels for per-model latency breakdown
Scenario 1: Prime Day Traffic Spike (10x Load)
The Problem
During the first simulated Prime Day load test, I ramped from 5K to 50K messages/second. At around 20K msg/s, three things broke simultaneously:
- Bedrock returned 429 (ThrottlingException) — we exceeded our on-demand quota for Claude 3.5 Sonnet
- DynamoDB session writes started throttling — popular manga titles (new releases) created hot partition keys
- Lambda cold starts spiked — burst overflow from ECS to Lambda triggered 3-5 second cold starts
How I Solved It
| Problem | Solution | Result |
|---|---|---|
| Bedrock throttling | Purchased Provisioned Throughput for Sonnet (baseline) + kept on-demand for Haiku (overflow) | Zero throttling at 30K LLM calls/sec |
| DynamoDB hot partitions | Added a random suffix to partition keys for high-traffic sessions; enabled DynamoDB Adaptive Capacity | Throttle events dropped to zero |
| Lambda cold starts | Pre-warmed Lambda with scheduled pings every 5 min; set ECS scaling target at 60% CPU (triggers earlier) | Cold start P99 dropped from 5s to 0.8s |
Metrics After Fix
- P99 latency at 50K msg/s: 2.7s (under the 3s target)
- Error rate at peak: 0.12% (well under 1% threshold)
- Zero Bedrock throttling events during the subsequent load test
Scenario 2: Bedrock LLM Throttling Under Sustained Load
The Problem
Even after provisioning baseline throughput, sustained high-load periods (not spikes, but 2+ hours of elevated traffic) caused Bedrock timeout rates to climb to 8%. The issue was that on-demand overflow capacity had soft limits, and during peak periods many AWS customers competed for the same capacity pool.
How I Solved It
Priority queuing: I built a request priority system:
| Priority | Who | Model | Queue Behavior |
|---|---|---|---|
| P0 (critical) | Authenticated + Prime users | Claude 3.5 Sonnet | Served first, never dropped |
| P1 (normal) | Authenticated non-Prime | Claude 3.5 Sonnet | Served in order, retry once on timeout |
| P2 (best-effort) | Guest users | Claude Haiku (cheaper) | Downgrade to Haiku under pressure |
Semantic response caching: For common queries ("what's the return policy?", "do you have free shipping?"), I cached LLM responses in ElastiCache Redis keyed by a semantic hash of the query. Cache hit rate reached ~15%, directly reducing Bedrock call volume.
Haiku fallback: When Sonnet timeout rate exceeded 5% for 2 consecutive minutes, the orchestrator automatically routed simple intents (chitchat, order status formatting, FAQ) to Claude Haiku, which had far more available capacity and was 10x cheaper.
Metrics After Fix
- LLM timeout rate: < 0.5% (down from 8%)
- Monthly LLM cost reduction: ~$18K from intelligent routing
- User-visible impact: None — Haiku quality was sufficient for simple intents, and users couldn't tell the difference
Scenario 3: SageMaker Cold Start / Scale-Up Lag
The Problem
The DistilBERT intent classifier ran on a SageMaker real-time endpoint. When traffic spiked, SageMaker's auto-scaling took 5-8 minutes to add new instances. During that window:
- Request queue depth grew → P99 latency for intent classification spiked from 50ms to 2+ seconds
- First request to a new instance had P99 of 55 seconds due to model loading from S3
How I Solved It
| Fix | What I Did | Impact |
|---|---|---|
| Minimum instances = 2 | Guaranteed at least 2 warm instances at all times, even during low traffic | Eliminated cold start for normal load |
| Predictive + step scaling | Combined CloudWatch target-tracking scaling with a scheduled scaling action before known peak hours (Prime Day, major manga release dates) | Instances were ready before the spike hit |
| Inferentia migration | Moved DistilBERT from ml.g4dn.xlarge (GPU) to ml.inf1.xlarge (AWS Inferentia) using Neuron SDK compilation |
3x cheaper per instance + 4x faster model load |
| Warmup requests | Sent synthetic inference requests to new instances during scale-up, before they entered the load balancer | Eliminated the 55s first-request latency spike |
| INT8 quantization | Quantized DistilBERT to INT8 via ONNX Runtime | Model download from S3 dropped from 800ms to 200ms; inference latency unchanged |
Metrics After Fix
- Intent classifier P99: < 50ms (consistent at all load levels)
- Scale-up time: 2 minutes (predictive scaling starts early, completing before the spike peaks)
- Cost per inference: $0.0002 (down from $0.0006 on GPU)
Scenario 4: Thundering Herd on DynamoDB
The Problem
Popular sessions (e.g., a viral manga recommendation that got shared on social media) created hot partition reads. When 10,000 users clicked the same shared chat link within minutes, DynamoDB throttled reads on that session's partition key. This caused:
ProvisionedThroughputExceededExceptionfor session loads- Cascade: orchestrator retried immediately → amplified the thundering herd → more throttling
How I Solved It
| Fix | What I Did | Impact |
|---|---|---|
| DynamoDB DAX | Deployed DAX (in-memory cache) in front of DynamoDB for session reads | Hot session reads served from cache in microseconds, never hitting DynamoDB |
| Exponential backoff + jitter | Replaced immediate retries with exponential backoff (100ms, 200ms, 400ms) + random jitter (0-50ms) | Spread retry load over time instead of amplifying the spike |
| Read request coalescing | When multiple concurrent requests asked for the same session ID, only one actual DynamoDB read was issued; others waited for the result | Reduced read volume by 60% for hot sessions |
| On-demand capacity mode | Switched from provisioned to on-demand DynamoDB capacity for the sessions table | Auto-scales without pre-planning; no more manual capacity adjustments |
Metrics After Fix
- DynamoDB throttle events in production: zero (monitored for 6 months)
- Hot session read latency: < 1ms (DAX cache hit)
- Monthly DynamoDB cost change: +$200/month for DAX, but eliminated the engineering cost of capacity planning
Scenario 5: WebSocket Connection Limit
The Problem
Amazon API Gateway has a default limit of 128,000 concurrent WebSocket connections per account per region. During a load test simulating 500K concurrent sessions, I hit 90% of this limit. A single Prime Day surge could have exceeded it.
How I Solved It
| Fix | What I Did | Impact |
|---|---|---|
| Connection pooling | Frontend reconnects reused existing connections when possible instead of creating new ones | Reduced concurrent connection count by ~20% |
| Idle connection cleanup | Automatically disconnected sessions inactive for > 10 minutes (server-side ping/pong timeout) | Freed ~30% of connections held by abandoned tabs |
| ALB sticky sessions | Used Application Load Balancer sticky sessions so WebSocket reconnects land on the same orchestrator instance | Preserved in-memory context, reduced connection churn |
| AWS limit increase | Requested and received a limit increase to 500K concurrent connections | Headroom for 5x current peak |
| Multi-region architecture | Route 53 latency-based routing distributes connections across us-east-1 and us-west-2 |
Each region handles half the connection load |
Metrics After Fix
- Peak concurrent WebSocket connections: ~200K during load test (well within 500K limit)
- Connection reuse rate: 65% (up from 0% before pooling)
- Headroom: 2.5x above observed peak
Scenario 6: Cache Stampede on ElastiCache (Catalog Invalidation)
The Problem
Product details for popular ASINs were cached in ElastiCache Redis with a 5-minute TTL. When the TTL expired for a cluster of popular ASINs simultaneously (e.g., all volumes of a trending series), hundreds of concurrent requests hit the Catalog API at once to repopulate the cache. This caused:
- Catalog API P99 latency spiked from 90ms to 800ms
- Some requests timed out → orchestrator returned fallback responses → degraded user experience
How I Solved It
| Fix | What I Did | Impact |
|---|---|---|
| Probabilistic Early Reexpiration (PER) | Each cache read has a small probability of triggering a background refresh before the TTL expires. The probability increases as TTL approaches 0 | Cache is refreshed before expiration; no thundering herd on TTL boundary |
| Staggered TTLs | Added random jitter (±60 seconds) to cache TTLs so related ASINs don't expire at the exact same moment | Spread refresh load over time |
| Background async refresh | When a cache miss occurs, the first request triggers an async background refresh. Concurrent requests get the stale value (stale-while-revalidate pattern) | Only 1 request hits the Catalog API; others get slightly stale data (acceptable for product details, never for prices) |
Metrics After Fix
- Catalog API P99 during cache invalidation: 90ms (down from 800ms)
- Cache hit rate: 94% (up from 85% — PER keeps the cache warmer)
- Zero timeout events from cache stampede in production
Scenario 7: Multi-Model P99 Latency Spike Under Load
The Problem
At 10K+ concurrent requests, the end-to-end P99 latency spiked from 2.5s to 5.1s. Distributed tracing showed the bottleneck was the Cross-Encoder Reranker endpoint. Under sustained load, a queue built up at the SageMaker endpoint, and P99 for the reranker alone jumped from 120ms to 1.8s.
Root Cause Analysis
The reranker processed 50 candidate chunks per request (scoring each against the query). At 10K concurrent requests, the GPU was saturated. SageMaker auto-scaling had a 5-minute lag, so the queue grew faster than capacity could be added.
How I Solved It
| Fix | What I Did | Impact |
|---|---|---|
| Speculative execution | Started RAG retrieval in parallel with intent classification instead of waiting for intent results. ~70% of intents need RAG anyway, so wasted compute was minimal | Saved 150-300ms on the critical path |
| Reduce reranker input | Cut reranker input from top-50 to top-20 candidates (vector search quality was good enough that top-20 covered 95% of relevant results) | Reranker latency dropped 60% at the same load |
| Per-model circuit breakers | When the reranker P99 exceeded 500ms for 30 seconds, the circuit breaker opened and the orchestrator fell back to raw cosine similarity (top-3 from the vector search) | Prevented cascading latency delays |
| Batch inference | Grouped reranker requests into micro-batches (4 requests per batch) to improve GPU utilization | Throughput increased 2.5x on the same instance count |
Metrics After Fix
- End-to-end P99 latency: 2.8s (down from 5.1s)
- Reranker P99 latency: 95ms (down from 1.8s under the same load)
- Reranker circuit breaker activation: ~2 times/month (during extreme bursts, with graceful fallback)
Summary of Scale Fixes and Business Impact
| Scenario | Before | After | Key Technique |
|---|---|---|---|
| Prime Day 10x spike | Errors at 20K msg/s | Handled 50K msg/s at P99 < 3s | Provisioned Bedrock + ECS/Lambda hybrid |
| Bedrock throttling | 8% timeout rate | < 0.5% timeout rate | Priority queuing + Haiku fallback + semantic caching |
| SageMaker cold start | 55s P99 first request | < 50ms consistent P99 | Inferentia + warmup + predictive scaling |
| DynamoDB thundering herd | Throttle events on hot sessions | Zero throttle events | DAX + jitter + request coalescing |
| WebSocket connection limit | 90% of 128K limit | 200K peak with 500K ceiling | Pooling + cleanup + limit increase + multi-region |
| Cache stampede | Catalog P99 = 800ms during invalidation | Catalog P99 = 90ms | PER + staggered TTLs + stale-while-revalidate |
| Multi-model P99 spike | E2E P99 = 5.1s at 10K concurrent | E2E P99 = 2.8s | Speculative execution + smaller reranker input + circuit breakers |
Interview Q&A — Scale Testing
Q: How did you approach load testing a chatbot system? It's not a simple REST API.
- Easy: I used Artillery.io for WebSocket and REST load generation, ramping from normal (5K msg/s) to 10x peak (50K msg/s) over 30 minutes, holding at peak for 60 minutes. I monitored CloudWatch dashboards, X-Ray traces, and custom Grafana panels to watch per-model latency breakdown in real time
- Medium: The key insight is that chatbot load testing is fundamentally about testing a dependency chain. A single user message triggers the intent classifier, embedding model, vector search, reranker, LLM, and guardrails. I had to instrument each step independently so that when P99 exceeded budget at a given load level, I could pinpoint exactly which model or service was the bottleneck. Standard RPS metrics tell you nothing useful — you need per-model, per-intent latency distributions
- Hard: The hardest problem was realistic traffic shaping. Real chatbot traffic isn't uniform — it's bursty (a viral tweet can spike one intent by 50x), multi-turn (sessions last 5-15 minutes with 3-8 messages), and intent-skewed (60% of messages are product discovery, only 5% are order tracking). I built traffic profiles based on production analytics that replicated these patterns. A uniform-load test would have missed the DynamoDB hot partition issue entirely because it only happens when many sessions hit the same popular product simultaneously
Q: What was the scariest production issue you found through scale testing?
- Easy: The cache stampede. When Redis TTL expired for popular ASINs, hundreds of requests simultaneously hit the Catalog API, causing P99 to spike from 90ms to 800ms. Users saw "I couldn't load product details right now" fallback messages during these windows
- Medium: What made it scary was that it was invisible at normal load. With 5K msg/s the cache stampede involved maybe 10-20 concurrent requests to the Catalog API — no problem. But at 50K msg/s, the same TTL expiry triggered 200-500 concurrent requests, overwhelming the Catalog API. It was a load-dependent bug that only appeared at scale
- Hard: The fix (Probabilistic Early Reexpiration + stale-while-revalidate) was elegant but required careful tuning. If the early-refresh probability was too high, I'd waste Catalog API calls refreshing cache that wasn't about to expire. If too low, stampedes still occurred. I ran A/B tests on the probability curve to find the sweet spot (beta distribution with alpha=2, centered at 80% of TTL remaining). The result: 94% cache hit rate with zero stampede events. But it took 3 iterations of load testing to get the parameters right
Q: How did you handle the trade-off between cost and latency when scaling ML models?
- Easy: I migrated the DistilBERT intent classifier from GPU (
ml.g4dn.xlargeat $0.736/hr) to AWS Inferentia (ml.inf1.xlargeat ~$0.24/hr). This cut classifier inference cost by 3x with no quality impact because the Neuron SDK compiled the model efficiently for Inferentia hardware - Medium: The larger cost decision was LLM model routing. Not every message needs Claude 3.5 Sonnet. I built a router that sends chitchat to template responses (zero cost), simple formatting tasks to Claude Haiku (10x cheaper), and only complex reasoning to Sonnet. This saved ~$18K/month. The trade-off was a small quality delta for borderline cases — but evaluation showed Haiku was indistinguishable from Sonnet for simple intent categories
- Hard: The hardest cost-latency trade-off was the reranker. I could cut latency by reducing the candidate set from top-50 to top-20, but this risked missing relevant results in the long tail. I ran an offline evaluation: for our FAQ and product Q&A datasets, top-20 covered 95% of the relevant results that top-50 found. The 5% miss rate was acceptable because the LLM could compensate with its own knowledge. But I wouldn't have been comfortable making that trade-off without the evaluation data proving it