11. Scalability and Reliability - Designing for Amazon Scale
Scale Requirements
The JP Manga store can see large spikes during Prime Day, holiday season, and major release drops.
| Metric | Normal | Peak |
|---|---|---|
| Concurrent chat sessions | ~50,000 | ~500,000 |
| Messages per second | ~5,000 | ~50,000 |
| LLM calls per second | ~3,000 | ~30,000 |
| P99 latency target (first token) | < 1.5s | < 3s |
| Availability target | 99.9% | 99.9% |
High-Traffic Handling
graph TD
subgraph "Load Distribution"
A[CloudFront CDN<br>Static assets] --> B[Application Load Balancer<br>WebSocket sticky sessions]
B --> C[Auto-Scaling Group<br>Orchestrator instances]
C --> D[Lambda Functions<br>Overflow capacity]
end
subgraph "Compute Strategy"
C --> E[ECS Fargate<br>Baseline capacity]
D --> F[Lambda<br>Burst capacity]
end
Baseline traffic runs on ECS Fargate. Overflow traffic can spill to Lambda for burst handling.
Latency Optimization
The biggest latency win is calling downstream services in parallel instead of sequentially.
Sequential:
Reco Engine (200ms) -> Catalog (100ms) -> RAG (300ms) = 600ms total
Parallel:
Reco Engine (200ms) ->
Catalog (100ms) -> max = 300ms total
RAG (300ms) ->
Caching Strategy
| Data | Cache Location | TTL | Invalidation |
|---|---|---|---|
| Product details | ElastiCache | 5 min | Event-driven catalog updates |
| Recommendations | ElastiCache | 15 min | New session |
| Promotions | ElastiCache | 15 min | Event-driven |
| Reviews and ratings | ElastiCache | 1 hour | Scheduled refresh |
| Prices | Never cached | N/A | Always fetch live |
| Conversation memory | DynamoDB | Session lifetime | TTL expiry |
| RAG embeddings | OpenSearch | Until re-indexed | Event-driven or scheduled refresh |
Conversation memory stays in DynamoDB as the source of truth. A small read cache can be added later if needed, but the durable store remains DynamoDB.
Streaming
LLM responses are streamed token by token via WebSocket. The user sees the first word quickly even though the full response takes longer.
Rate Limiting
graph TD
A[Incoming Request] --> B{User authenticated?}
B -->|Yes| C[Token Bucket<br>30 msg/min per user]
B -->|No| D[Token Bucket<br>10 msg/min per session]
C --> E{Bucket has tokens?}
D --> E
E -->|Yes| F[Process request]
E -->|No| G[429 Too Many Requests<br>Please wait a moment]
Additional protections: - Global rate limit for the LLM path. - Per-session message limits. - Payload size limits to reduce abuse.
Retries and Circuit Breakers
graph TD
A[Call Downstream Service] --> B{Success?}
B -->|Yes| C[Return data]
B -->|No| D{Retry count < 2?}
D -->|Yes| E[Exponential backoff + jitter]
E --> A
D -->|No| F{Circuit breaker open?}
F -->|No| G[Open circuit breaker for 30 seconds]
F -->|Yes| H[Return fallback response]
G --> H
Retry policy: - Max 2 retries with exponential backoff. - Jitter to prevent thundering herd. - Idempotent requests only.
Fallback responses:
| Service Down | Fallback |
|---|---|
| Recommendation Engine | Show trending or popular manga |
| Product Catalog | Return a search link and a short apology |
| Order Service | Ask the user to try the order tracking page |
| RAG / Knowledge Base | Use LLM with system knowledge only |
| LLM / Bedrock | Return a brief unavailable message |
Multi-Region Considerations
For MVP, single-region deployment with cross-AZ redundancy is sufficient. For full production, use multi-region routing and replication for stateful data.
graph LR
subgraph "us-east-1 (Primary)"
A1[Orchestrator]
A2[DynamoDB Global Table]
A3[OpenSearch]
end
subgraph "us-west-2 (Secondary)"
B1[Orchestrator]
B2[DynamoDB Global Table]
B3[OpenSearch]
end
A2 <-->|Replication| B2
R53[Route 53<br>Latency-based routing] --> A1
R53 --> B1
Service Dependency Protection
graph TD
subgraph "Dependency Tiers"
T1[Tier 1 - Critical]
T2[Tier 2 - Important]
T3[Tier 3 - Nice to Have]
end
T1 --> T1A[LLM / Bedrock]
T1 --> T1B[Conversation Memory]
T1 --> T1C[API Gateway]
T2 --> T2A[Product Catalog]
T2 --> T2B[Recommendation Engine]
T2 --> T2C[Order Service]
T3 --> T3A[Promotions]
T3 --> T3B[Reviews]
T3 --> T3C[Analytics Pipeline]
If a Tier 3 service is down, the chatbot still works. If a Tier 2 service is down, the chatbot works with reduced capability. If a Tier 1 service is down, return a graceful error.
Observability and Alerting
| Metric | Alarm Threshold | Severity | Action |
|---|---|---|---|
| P99 latency (first token) | > 3 seconds for 5 min | SEV-2 | Page on-call |
| Error rate | > 1% for 5 min | SEV-2 | Page on-call |
| Guardrail block rate | > 10% for 15 min | SEV-3 | Investigate prompt or content issue |
| LLM timeout rate | > 5% for 5 min | SEV-2 | Check Bedrock quotas and fallback behavior |
| DynamoDB throttling | Any throttle event | SEV-3 | Review access patterns |
| Circuit breaker open | Any service | SEV-3 | Investigate dependency health |
| Concurrent sessions | > 80% of provisioned capacity | SEV-3 | Pre-scale capacity |
Dashboard contents: - Message volume - Latency percentiles - Intent distribution - Error rate - LLM token usage - Escalation rate