03. Scale Testing Scenarios — Real Issues I Faced and How I Solved Them

"Load testing a chatbot is fundamentally different from load testing a REST API. Every user message triggers a 4-model inference chain, 2-3 parallel service calls, a vector search, and a guardrails pipeline. At 50K concurrent sessions, bottlenecks showed up in places I never expected — DynamoDB hot partitions, Redis cache stampedes, and WebSocket connection ceilings. Here are the 7 worst scenarios I hit and how I fixed each one."

Scale Targets

Metric	Normal	Peak (Prime Day)	What Breaks First
Concurrent sessions	~50,000	~500,000	WebSocket connections, Bedrock quota
Messages per second	~5,000	~50,000	Intent classifier throughput, DynamoDB writes
LLM calls per second	~3,000	~30,000	Bedrock throttling
P99 latency (first token)	< 1.5s	< 3s	Reranker queue depth
Availability	99.9%	99.9%	Circuit breaker cascades

How I Ran Scale Tests

Tool: Artillery.io for WebSocket + REST load generation, custom Python scripts for sustained-load LLM benchmarks
Environment: Dedicated load-test environment mirroring production (same instance types, same Bedrock model configs, same DynamoDB table config)
Approach: Ramp from 0 to 10x normal load over 30 minutes, hold at peak for 60 minutes, then step down
Monitoring: CloudWatch dashboards, X-Ray traces, custom Grafana panels for per-model latency breakdown

Scenario 1: Prime Day Traffic Spike (10x Load)

The Problem

During the first simulated Prime Day load test, I ramped from 5K to 50K messages/second. At around 20K msg/s, three things broke simultaneously:

Bedrock returned 429 (ThrottlingException) — we exceeded our on-demand quota for Claude 3.5 Sonnet
DynamoDB session writes started throttling — popular manga titles (new releases) created hot partition keys
Lambda cold starts spiked — burst overflow from ECS to Lambda triggered 3-5 second cold starts

How I Solved It

Problem	Solution	Result
Bedrock throttling	Purchased Provisioned Throughput for Sonnet (baseline) + kept on-demand for Haiku (overflow)	Zero throttling at 30K LLM calls/sec
DynamoDB hot partitions	Added a random suffix to partition keys for high-traffic sessions; enabled DynamoDB Adaptive Capacity	Throttle events dropped to zero
Lambda cold starts	Pre-warmed Lambda with scheduled pings every 5 min; set ECS scaling target at 60% CPU (triggers earlier)	Cold start P99 dropped from 5s to 0.8s

Metrics After Fix

P99 latency at 50K msg/s: 2.7s (under the 3s target)
Error rate at peak: 0.12% (well under 1% threshold)
Zero Bedrock throttling events during the subsequent load test

Scenario 2: Bedrock LLM Throttling Under Sustained Load

The Problem

Even after provisioning baseline throughput, sustained high-load periods (not spikes, but 2+ hours of elevated traffic) caused Bedrock timeout rates to climb to 8%. The issue was that on-demand overflow capacity had soft limits, and during peak periods many AWS customers competed for the same capacity pool.

How I Solved It

Priority queuing: I built a request priority system:

Priority	Who	Model	Queue Behavior
P0 (critical)	Authenticated + Prime users	Claude 3.5 Sonnet	Served first, never dropped
P1 (normal)	Authenticated non-Prime	Claude 3.5 Sonnet	Served in order, retry once on timeout
P2 (best-effort)	Guest users	Claude Haiku (cheaper)	Downgrade to Haiku under pressure

Semantic response caching: For common queries ("what's the return policy?", "do you have free shipping?"), I cached LLM responses in ElastiCache Redis keyed by a semantic hash of the query. Cache hit rate reached ~15%, directly reducing Bedrock call volume.

Haiku fallback: When Sonnet timeout rate exceeded 5% for 2 consecutive minutes, the orchestrator automatically routed simple intents (chitchat, order status formatting, FAQ) to Claude Haiku, which had far more available capacity and was 10x cheaper.

Metrics After Fix

LLM timeout rate: < 0.5% (down from 8%)
Monthly LLM cost reduction: ~$18K from intelligent routing
User-visible impact: None — Haiku quality was sufficient for simple intents, and users couldn't tell the difference

Scenario 3: SageMaker Cold Start / Scale-Up Lag

The Problem

The DistilBERT intent classifier ran on a SageMaker real-time endpoint. When traffic spiked, SageMaker's auto-scaling took 5-8 minutes to add new instances. During that window:

Request queue depth grew → P99 latency for intent classification spiked from 50ms to 2+ seconds
First request to a new instance had P99 of 55 seconds due to model loading from S3

How I Solved It

Fix	What I Did	Impact
Minimum instances = 2	Guaranteed at least 2 warm instances at all times, even during low traffic	Eliminated cold start for normal load
Predictive + step scaling	Combined CloudWatch target-tracking scaling with a scheduled scaling action before known peak hours (Prime Day, major manga release dates)	Instances were ready before the spike hit
Inferentia migration	Moved DistilBERT from `ml.g4dn.xlarge` (GPU) to `ml.inf1.xlarge` (AWS Inferentia) using Neuron SDK compilation	3x cheaper per instance + 4x faster model load
Warmup requests	Sent synthetic inference requests to new instances during scale-up, before they entered the load balancer	Eliminated the 55s first-request latency spike
INT8 quantization	Quantized DistilBERT to INT8 via ONNX Runtime	Model download from S3 dropped from 800ms to 200ms; inference latency unchanged

Metrics After Fix

Intent classifier P99: < 50ms (consistent at all load levels)
Scale-up time: 2 minutes (predictive scaling starts early, completing before the spike peaks)
Cost per inference: $0.0002 (down from $0.0006 on GPU)

Scenario 4: Thundering Herd on DynamoDB

The Problem

Popular sessions (e.g., a viral manga recommendation that got shared on social media) created hot partition reads. When 10,000 users clicked the same shared chat link within minutes, DynamoDB throttled reads on that session's partition key. This caused:

ProvisionedThroughputExceededException for session loads
Cascade: orchestrator retried immediately → amplified the thundering herd → more throttling

How I Solved It

Fix	What I Did	Impact
DynamoDB DAX	Deployed DAX (in-memory cache) in front of DynamoDB for session reads	Hot session reads served from cache in microseconds, never hitting DynamoDB
Exponential backoff + jitter	Replaced immediate retries with exponential backoff (100ms, 200ms, 400ms) + random jitter (0-50ms)	Spread retry load over time instead of amplifying the spike
Read request coalescing	When multiple concurrent requests asked for the same session ID, only one actual DynamoDB read was issued; others waited for the result	Reduced read volume by 60% for hot sessions
On-demand capacity mode	Switched from provisioned to on-demand DynamoDB capacity for the sessions table	Auto-scales without pre-planning; no more manual capacity adjustments

Metrics After Fix

DynamoDB throttle events in production: zero (monitored for 6 months)
Hot session read latency: < 1ms (DAX cache hit)
Monthly DynamoDB cost change: +$200/month for DAX, but eliminated the engineering cost of capacity planning

Scenario 5: WebSocket Connection Limit

The Problem

Amazon API Gateway has a default limit of 128,000 concurrent WebSocket connections per account per region. During a load test simulating 500K concurrent sessions, I hit 90% of this limit. A single Prime Day surge could have exceeded it.

How I Solved It

Fix	What I Did	Impact
Connection pooling	Frontend reconnects reused existing connections when possible instead of creating new ones	Reduced concurrent connection count by ~20%
Idle connection cleanup	Automatically disconnected sessions inactive for > 10 minutes (server-side ping/pong timeout)	Freed ~30% of connections held by abandoned tabs
ALB sticky sessions	Used Application Load Balancer sticky sessions so WebSocket reconnects land on the same orchestrator instance	Preserved in-memory context, reduced connection churn
AWS limit increase	Requested and received a limit increase to 500K concurrent connections	Headroom for 5x current peak
Multi-region architecture	Route 53 latency-based routing distributes connections across `us-east-1` and `us-west-2`	Each region handles half the connection load

Metrics After Fix

Peak concurrent WebSocket connections: ~200K during load test (well within 500K limit)
Connection reuse rate: 65% (up from 0% before pooling)
Headroom: 2.5x above observed peak

Scenario 6: Cache Stampede on ElastiCache (Catalog Invalidation)

The Problem

Product details for popular ASINs were cached in ElastiCache Redis with a 5-minute TTL. When the TTL expired for a cluster of popular ASINs simultaneously (e.g., all volumes of a trending series), hundreds of concurrent requests hit the Catalog API at once to repopulate the cache. This caused:

Catalog API P99 latency spiked from 90ms to 800ms
Some requests timed out → orchestrator returned fallback responses → degraded user experience

How I Solved It

Fix	What I Did	Impact
Probabilistic Early Reexpiration (PER)	Each cache read has a small probability of triggering a background refresh before the TTL expires. The probability increases as TTL approaches 0	Cache is refreshed before expiration; no thundering herd on TTL boundary
Staggered TTLs	Added random jitter (±60 seconds) to cache TTLs so related ASINs don't expire at the exact same moment	Spread refresh load over time
Background async refresh	When a cache miss occurs, the first request triggers an async background refresh. Concurrent requests get the stale value (stale-while-revalidate pattern)	Only 1 request hits the Catalog API; others get slightly stale data (acceptable for product details, never for prices)

Metrics After Fix

Catalog API P99 during cache invalidation: 90ms (down from 800ms)
Cache hit rate: 94% (up from 85% — PER keeps the cache warmer)
Zero timeout events from cache stampede in production

Scenario 7: Multi-Model P99 Latency Spike Under Load

The Problem

At 10K+ concurrent requests, the end-to-end P99 latency spiked from 2.5s to 5.1s. Distributed tracing showed the bottleneck was the Cross-Encoder Reranker endpoint. Under sustained load, a queue built up at the SageMaker endpoint, and P99 for the reranker alone jumped from 120ms to 1.8s.

Root Cause Analysis

The reranker processed 50 candidate chunks per request (scoring each against the query). At 10K concurrent requests, the GPU was saturated. SageMaker auto-scaling had a 5-minute lag, so the queue grew faster than capacity could be added.

How I Solved It

Fix	What I Did	Impact
Speculative execution	Started RAG retrieval in parallel with intent classification instead of waiting for intent results. ~70% of intents need RAG anyway, so wasted compute was minimal	Saved 150-300ms on the critical path
Reduce reranker input	Cut reranker input from top-50 to top-20 candidates (vector search quality was good enough that top-20 covered 95% of relevant results)	Reranker latency dropped 60% at the same load
Per-model circuit breakers	When the reranker P99 exceeded 500ms for 30 seconds, the circuit breaker opened and the orchestrator fell back to raw cosine similarity (top-3 from the vector search)	Prevented cascading latency delays
Batch inference	Grouped reranker requests into micro-batches (4 requests per batch) to improve GPU utilization	Throughput increased 2.5x on the same instance count

Metrics After Fix

End-to-end P99 latency: 2.8s (down from 5.1s)
Reranker P99 latency: 95ms (down from 1.8s under the same load)
Reranker circuit breaker activation: ~2 times/month (during extreme bursts, with graceful fallback)

Summary of Scale Fixes and Business Impact

Scenario	Before	After	Key Technique
Prime Day 10x spike	Errors at 20K msg/s	Handled 50K msg/s at P99 < 3s	Provisioned Bedrock + ECS/Lambda hybrid
Bedrock throttling	8% timeout rate	< 0.5% timeout rate	Priority queuing + Haiku fallback + semantic caching
SageMaker cold start	55s P99 first request	< 50ms consistent P99	Inferentia + warmup + predictive scaling
DynamoDB thundering herd	Throttle events on hot sessions	Zero throttle events	DAX + jitter + request coalescing
WebSocket connection limit	90% of 128K limit	200K peak with 500K ceiling	Pooling + cleanup + limit increase + multi-region
Cache stampede	Catalog P99 = 800ms during invalidation	Catalog P99 = 90ms	PER + staggered TTLs + stale-while-revalidate
Multi-model P99 spike	E2E P99 = 5.1s at 10K concurrent	E2E P99 = 2.8s	Speculative execution + smaller reranker input + circuit breakers

Interview Q&A — Scale Testing

Q: How did you approach load testing a chatbot system? It's not a simple REST API.

Easy: I used Artillery.io for WebSocket and REST load generation, ramping from normal (5K msg/s) to 10x peak (50K msg/s) over 30 minutes, holding at peak for 60 minutes. I monitored CloudWatch dashboards, X-Ray traces, and custom Grafana panels to watch per-model latency breakdown in real time
Medium: The key insight is that chatbot load testing is fundamentally about testing a dependency chain. A single user message triggers the intent classifier, embedding model, vector search, reranker, LLM, and guardrails. I had to instrument each step independently so that when P99 exceeded budget at a given load level, I could pinpoint exactly which model or service was the bottleneck. Standard RPS metrics tell you nothing useful — you need per-model, per-intent latency distributions
Hard: The hardest problem was realistic traffic shaping. Real chatbot traffic isn't uniform — it's bursty (a viral tweet can spike one intent by 50x), multi-turn (sessions last 5-15 minutes with 3-8 messages), and intent-skewed (60% of messages are product discovery, only 5% are order tracking). I built traffic profiles based on production analytics that replicated these patterns. A uniform-load test would have missed the DynamoDB hot partition issue entirely because it only happens when many sessions hit the same popular product simultaneously

Q: What was the scariest production issue you found through scale testing?

Easy: The cache stampede. When Redis TTL expired for popular ASINs, hundreds of requests simultaneously hit the Catalog API, causing P99 to spike from 90ms to 800ms. Users saw "I couldn't load product details right now" fallback messages during these windows
Medium: What made it scary was that it was invisible at normal load. With 5K msg/s the cache stampede involved maybe 10-20 concurrent requests to the Catalog API — no problem. But at 50K msg/s, the same TTL expiry triggered 200-500 concurrent requests, overwhelming the Catalog API. It was a load-dependent bug that only appeared at scale
Hard: The fix (Probabilistic Early Reexpiration + stale-while-revalidate) was elegant but required careful tuning. If the early-refresh probability was too high, I'd waste Catalog API calls refreshing cache that wasn't about to expire. If too low, stampedes still occurred. I ran A/B tests on the probability curve to find the sweet spot (beta distribution with alpha=2, centered at 80% of TTL remaining). The result: 94% cache hit rate with zero stampede events. But it took 3 iterations of load testing to get the parameters right

Q: How did you handle the trade-off between cost and latency when scaling ML models?

Easy: I migrated the DistilBERT intent classifier from GPU (ml.g4dn.xlarge at $0.736/hr) to AWS Inferentia (ml.inf1.xlarge at ~$0.24/hr). This cut classifier inference cost by 3x with no quality impact because the Neuron SDK compiled the model efficiently for Inferentia hardware
Medium: The larger cost decision was LLM model routing. Not every message needs Claude 3.5 Sonnet. I built a router that sends chitchat to template responses (zero cost), simple formatting tasks to Claude Haiku (10x cheaper), and only complex reasoning to Sonnet. This saved ~$18K/month. The trade-off was a small quality delta for borderline cases — but evaluation showed Haiku was indistinguishable from Sonnet for simple intent categories
Hard: The hardest cost-latency trade-off was the reranker. I could cut latency by reducing the candidate set from top-50 to top-20, but this risked missing relevant results in the long tail. I ran an offline evaluation: for our FAQ and product Q&A datasets, top-20 covered 95% of the relevant results that top-50 found. The 5% miss rate was acceptable because the LLM could compensate with its own knowledge. But I wouldn't have been comfortable making that trade-off without the evaluation data proving it