11. Scalability and Reliability - Designing for Amazon Scale

Scale Requirements

The JP Manga store can see large spikes during Prime Day, holiday season, and major release drops.

Metric	Normal	Peak
Concurrent chat sessions	~50,000	~500,000
Messages per second	~5,000	~50,000
LLM calls per second	~3,000	~30,000
P99 latency target (first token)	< 1.5s	< 3s
Availability target	99.9%	99.9%

High-Traffic Handling

graph TD
    subgraph "Load Distribution"
        A[CloudFront CDN<br>Static assets] --> B[Application Load Balancer<br>WebSocket sticky sessions]
        B --> C[Auto-Scaling Group<br>Orchestrator instances]
        C --> D[Lambda Functions<br>Overflow capacity]
    end

    subgraph "Compute Strategy"
        C --> E[ECS Fargate<br>Baseline capacity]
        D --> F[Lambda<br>Burst capacity]
    end

Baseline traffic runs on ECS Fargate. Overflow traffic can spill to Lambda for burst handling.

Latency Optimization

The biggest latency win is calling downstream services in parallel instead of sequentially.

Sequential:
  Reco Engine (200ms) -> Catalog (100ms) -> RAG (300ms) = 600ms total

Parallel:
  Reco Engine (200ms) ->
  Catalog (100ms)    -> max = 300ms total
  RAG (300ms)        ->

Caching Strategy

Data	Cache Location	TTL	Invalidation
Product details	ElastiCache	5 min	Event-driven catalog updates
Recommendations	ElastiCache	15 min	New session
Promotions	ElastiCache	15 min	Event-driven
Reviews and ratings	ElastiCache	1 hour	Scheduled refresh
Prices	Never cached	N/A	Always fetch live
Conversation memory	DynamoDB	Session lifetime	TTL expiry
RAG embeddings	OpenSearch	Until re-indexed	Event-driven or scheduled refresh

Conversation memory stays in DynamoDB as the source of truth. A small read cache can be added later if needed, but the durable store remains DynamoDB.

Streaming

LLM responses are streamed token by token via WebSocket. The user sees the first word quickly even though the full response takes longer.

Rate Limiting

graph TD
    A[Incoming Request] --> B{User authenticated?}
    B -->|Yes| C[Token Bucket<br>30 msg/min per user]
    B -->|No| D[Token Bucket<br>10 msg/min per session]
    C --> E{Bucket has tokens?}
    D --> E
    E -->|Yes| F[Process request]
    E -->|No| G[429 Too Many Requests<br>Please wait a moment]

Additional protections: - Global rate limit for the LLM path. - Per-session message limits. - Payload size limits to reduce abuse.

Retries and Circuit Breakers

graph TD
    A[Call Downstream Service] --> B{Success?}
    B -->|Yes| C[Return data]
    B -->|No| D{Retry count < 2?}
    D -->|Yes| E[Exponential backoff + jitter]
    E --> A
    D -->|No| F{Circuit breaker open?}
    F -->|No| G[Open circuit breaker for 30 seconds]
    F -->|Yes| H[Return fallback response]
    G --> H

Retry policy: - Max 2 retries with exponential backoff. - Jitter to prevent thundering herd. - Idempotent requests only.

Fallback responses:

Service Down	Fallback
Recommendation Engine	Show trending or popular manga
Product Catalog	Return a search link and a short apology
Order Service	Ask the user to try the order tracking page
RAG / Knowledge Base	Use LLM with system knowledge only
LLM / Bedrock	Return a brief unavailable message

Multi-Region Considerations

For MVP, single-region deployment with cross-AZ redundancy is sufficient. For full production, use multi-region routing and replication for stateful data.

graph LR
    subgraph "us-east-1 (Primary)"
        A1[Orchestrator]
        A2[DynamoDB Global Table]
        A3[OpenSearch]
    end

    subgraph "us-west-2 (Secondary)"
        B1[Orchestrator]
        B2[DynamoDB Global Table]
        B3[OpenSearch]
    end

    A2 <-->|Replication| B2
    R53[Route 53<br>Latency-based routing] --> A1
    R53 --> B1

Service Dependency Protection

graph TD
    subgraph "Dependency Tiers"
        T1[Tier 1 - Critical]
        T2[Tier 2 - Important]
        T3[Tier 3 - Nice to Have]
    end

    T1 --> T1A[LLM / Bedrock]
    T1 --> T1B[Conversation Memory]
    T1 --> T1C[API Gateway]

    T2 --> T2A[Product Catalog]
    T2 --> T2B[Recommendation Engine]
    T2 --> T2C[Order Service]

    T3 --> T3A[Promotions]
    T3 --> T3B[Reviews]
    T3 --> T3C[Analytics Pipeline]

If a Tier 3 service is down, the chatbot still works. If a Tier 2 service is down, the chatbot works with reduced capability. If a Tier 1 service is down, return a graceful error.

Observability and Alerting

Metric	Alarm Threshold	Severity	Action
P99 latency (first token)	> 3 seconds for 5 min	SEV-2	Page on-call
Error rate	> 1% for 5 min	SEV-2	Page on-call
Guardrail block rate	> 10% for 15 min	SEV-3	Investigate prompt or content issue
LLM timeout rate	> 5% for 5 min	SEV-2	Check Bedrock quotas and fallback behavior
DynamoDB throttling	Any throttle event	SEV-3	Review access patterns
Circuit breaker open	Any service	SEV-3	Investigate dependency health
Concurrent sessions	> 80% of provisioned capacity	SEV-3	Pre-scale capacity

Dashboard contents: - Message volume - Latency percentiles - Intent distribution - Error rate - LLM token usage - Escalation rate