LOCAL PREVIEW View on GitHub

11. Scalability and Reliability - Designing for Amazon Scale

Scale Requirements

The JP Manga store can see large spikes during Prime Day, holiday season, and major release drops.

Metric Normal Peak
Concurrent chat sessions ~50,000 ~500,000
Messages per second ~5,000 ~50,000
LLM calls per second ~3,000 ~30,000
P99 latency target (first token) < 1.5s < 3s
Availability target 99.9% 99.9%

High-Traffic Handling

graph TD
    subgraph "Load Distribution"
        A[CloudFront CDN<br>Static assets] --> B[Application Load Balancer<br>WebSocket sticky sessions]
        B --> C[Auto-Scaling Group<br>Orchestrator instances]
        C --> D[Lambda Functions<br>Overflow capacity]
    end

    subgraph "Compute Strategy"
        C --> E[ECS Fargate<br>Baseline capacity]
        D --> F[Lambda<br>Burst capacity]
    end

Baseline traffic runs on ECS Fargate. Overflow traffic can spill to Lambda for burst handling.

Latency Optimization

The biggest latency win is calling downstream services in parallel instead of sequentially.

Sequential:
  Reco Engine (200ms) -> Catalog (100ms) -> RAG (300ms) = 600ms total

Parallel:
  Reco Engine (200ms) ->
  Catalog (100ms)    -> max = 300ms total
  RAG (300ms)        ->

Caching Strategy

Data Cache Location TTL Invalidation
Product details ElastiCache 5 min Event-driven catalog updates
Recommendations ElastiCache 15 min New session
Promotions ElastiCache 15 min Event-driven
Reviews and ratings ElastiCache 1 hour Scheduled refresh
Prices Never cached N/A Always fetch live
Conversation memory DynamoDB Session lifetime TTL expiry
RAG embeddings OpenSearch Until re-indexed Event-driven or scheduled refresh

Conversation memory stays in DynamoDB as the source of truth. A small read cache can be added later if needed, but the durable store remains DynamoDB.

Streaming

LLM responses are streamed token by token via WebSocket. The user sees the first word quickly even though the full response takes longer.

Rate Limiting

graph TD
    A[Incoming Request] --> B{User authenticated?}
    B -->|Yes| C[Token Bucket<br>30 msg/min per user]
    B -->|No| D[Token Bucket<br>10 msg/min per session]
    C --> E{Bucket has tokens?}
    D --> E
    E -->|Yes| F[Process request]
    E -->|No| G[429 Too Many Requests<br>Please wait a moment]

Additional protections: - Global rate limit for the LLM path. - Per-session message limits. - Payload size limits to reduce abuse.

Retries and Circuit Breakers

graph TD
    A[Call Downstream Service] --> B{Success?}
    B -->|Yes| C[Return data]
    B -->|No| D{Retry count < 2?}
    D -->|Yes| E[Exponential backoff + jitter]
    E --> A
    D -->|No| F{Circuit breaker open?}
    F -->|No| G[Open circuit breaker for 30 seconds]
    F -->|Yes| H[Return fallback response]
    G --> H

Retry policy: - Max 2 retries with exponential backoff. - Jitter to prevent thundering herd. - Idempotent requests only.

Fallback responses:

Service Down Fallback
Recommendation Engine Show trending or popular manga
Product Catalog Return a search link and a short apology
Order Service Ask the user to try the order tracking page
RAG / Knowledge Base Use LLM with system knowledge only
LLM / Bedrock Return a brief unavailable message

Multi-Region Considerations

For MVP, single-region deployment with cross-AZ redundancy is sufficient. For full production, use multi-region routing and replication for stateful data.

graph LR
    subgraph "us-east-1 (Primary)"
        A1[Orchestrator]
        A2[DynamoDB Global Table]
        A3[OpenSearch]
    end

    subgraph "us-west-2 (Secondary)"
        B1[Orchestrator]
        B2[DynamoDB Global Table]
        B3[OpenSearch]
    end

    A2 <-->|Replication| B2
    R53[Route 53<br>Latency-based routing] --> A1
    R53 --> B1

Service Dependency Protection

graph TD
    subgraph "Dependency Tiers"
        T1[Tier 1 - Critical]
        T2[Tier 2 - Important]
        T3[Tier 3 - Nice to Have]
    end

    T1 --> T1A[LLM / Bedrock]
    T1 --> T1B[Conversation Memory]
    T1 --> T1C[API Gateway]

    T2 --> T2A[Product Catalog]
    T2 --> T2B[Recommendation Engine]
    T2 --> T2C[Order Service]

    T3 --> T3A[Promotions]
    T3 --> T3B[Reviews]
    T3 --> T3C[Analytics Pipeline]

If a Tier 3 service is down, the chatbot still works. If a Tier 2 service is down, the chatbot works with reduced capability. If a Tier 1 service is down, return a graceful error.

Observability and Alerting

Metric Alarm Threshold Severity Action
P99 latency (first token) > 3 seconds for 5 min SEV-2 Page on-call
Error rate > 1% for 5 min SEV-2 Page on-call
Guardrail block rate > 10% for 15 min SEV-3 Investigate prompt or content issue
LLM timeout rate > 5% for 5 min SEV-2 Check Bedrock quotas and fallback behavior
DynamoDB throttling Any throttle event SEV-3 Review access patterns
Circuit breaker open Any service SEV-3 Investigate dependency health
Concurrent sessions > 80% of provisioned capacity SEV-3 Pre-scale capacity

Dashboard contents: - Message volume - Latency percentiles - Intent distribution - Error rate - LLM token usage - Escalation rate