LOCAL PREVIEW View on GitHub

ECS Fargate + Lambda — How They Work Together in MangaAssist

The Core Problem: Extreme Burst Traffic

MangaAssist must handle: - Normal load: ~5,000 messages/second, ~50,000 concurrent sessions - Peak load (flash sales, new release day): ~50,000 messages/second, ~500,000 concurrent sessions

That is a 10x spike that can happen in minutes.

No single compute model handles this perfectly: - ECS Fargate alone: Takes 3-5 minutes to spin up 900 new tasks. Users experience timeouts during the ramp-up. - Lambda alone: Cannot hold long-lived WebSocket connections. Streaming token delivery is awkward. Cold starts add latency to baseline traffic.

The solution in the HLD is a hybrid model where each service does what it is best at.


Architecture: Hybrid Compute Model

graph TD
    subgraph "Traffic Entry"
        U[User Browser / App]
        CF[CloudFront]
        ALB[Application Load Balancer]
        APIGW[API Gateway WebSocket]
    end

    subgraph "Baseline Layer — ECS Fargate"
        WS[WebSocket Handler Service<br>Holds persistent connections<br>10-50 tasks, auto-scaling]
        ORC[Chatbot Orchestrator Service<br>Core processing logic<br>10-100 tasks, auto-scaling]
    end

    subgraph "Burst Layer — Lambda"
        SQS[SQS Burst Queue]
        LW1[Lambda Worker 1]
        LW2[Lambda Worker 2]
        LWN[Lambda Worker N<br>up to 10,000 concurrent]
    end

    subgraph "Shared Downstream Services"
        SM[SageMaker — Intent Classifier]
        BR[Bedrock — Claude 3.5 Sonnet]
        DDB[DynamoDB — Conversation Memory]
        OS[OpenSearch — Vector Store]
        EC[ElastiCache — Cache]
    end

    U -->|WebSocket| CF --> ALB --> WS
    WS -->|Authenticated request| ORC
    ORC -->|Normal load| SM
    ORC -->|Normal load| BR
    ORC -->|Normal load| DDB
    ORC -->|Overflow: enqueue| SQS
    SQS --> LW1 & LW2 & LWN
    LW1 & LW2 & LWN --> SM
    LW1 & LW2 & LWN --> BR
    LW1 & LW2 & LWN --> DDB
    LW1 & LW2 & LWN -->|Push response| APIGW
    APIGW -->|WebSocket push| U

Request Flow: Normal Load (ECS Handles Everything)

1. User sends "Recommend manga like Naruto"
2. CloudFront → ALB → WebSocket Handler Task (ECS)
3. WebSocket Handler → Orchestrator Task (ECS)
4. Orchestrator:
   a. Load conversation memory (DynamoDB)
   b. Classify intent (SageMaker)  → recommendation
   c. Fan out: Recommendation Engine + Product Catalog (parallel)
   d. Build prompt with context
   e. Call Bedrock (Claude 3.5 Sonnet) — stream tokens back
   f. Apply guardrails
   g. Stream response via WebSocket Handler to user
5. Save turn to DynamoDB
6. Emit analytics event to Kinesis

The WebSocket connection stays open on the ECS task for the duration of the streaming response. ECS is the right home for this because containers hold state (the connection) naturally.


Request Flow: Burst Load (Lambda Absorbs Overflow)

When ECS tasks are at capacity (CPU > 80% across the service):

1. User sends message
2. CloudFront → ALB → WebSocket Handler Task (ECS) ← connection registered here
3. WebSocket Handler → Orchestrator Task (ECS)
4. Orchestrator detects: task queue depth high, CPU saturated
5. Orchestrator enqueues message to SQS Burst Queue instead of processing inline
6. Lambda Burst Worker picks up message from SQS
7. Lambda Worker performs same orchestration steps as ECS Orchestrator
8. Lambda Worker pushes response via API Gateway WebSocket Management API
   POST https://{api-id}.execute-api.us-east-1.amazonaws.com/@connections/{connection_id}
   Body: { "type": "chat.response.delta", "delta": "Based on your love for action manga..." }
9. WebSocket Handler (ECS) delivers the pushed message to the open connection
10. User sees the streamed response — unaware it was processed by Lambda

Key detail: The WebSocket connection itself always lives on ECS. Lambda cannot hold a WebSocket connection. Lambda pushes responses into the existing connection via the API Gateway WebSocket Management API.


The SQS Queue — Decoupling Mechanism

SQS is the buffer between ECS and Lambda.

SQS Queue: manga-burst-queue
  Type: Standard (at-least-once delivery)
  Visibility timeout: 30s (task must complete within 30s or message reappears)
  Message retention: 1 minute (chat messages are time-sensitive; stale = drop)
  Dead-letter queue: manga-burst-dlq (after 2 failed processing attempts)
  Lambda trigger: batch size 1, max concurrency 10,000

Why SQS in between? 1. Backpressure: If Lambda is overwhelmed, messages queue instead of dropping 2. Retry: If a Lambda worker crashes mid-processing, the message reappears after visibility timeout 3. Decoupling: ECS and Lambda don't need to know about each other


Scaling Comparison: Side by Side

Scenario: Traffic doubles in 30 seconds (flash sale starts)

T+0s:   Normal load — 10 ECS tasks running, 0 Lambda executions
T+10s:  Traffic 2x — ECS CPU climbs from 40% to 80%
T+20s:  ECS auto-scaler triggers: add 10 more tasks
T+30s:  Traffic 5x — ECS cannot scale fast enough
        SQS queue depth climbs
        Lambda burst workers start: 0 → 500 concurrent executions in ~5s
T+60s:  Traffic 10x peak
        ECS: 50 tasks running (scaled up, still climbing)
        Lambda: 2,000-5,000 concurrent executions absorbing overflow
T+5min: Peak subsides
        ECS: ~80 tasks (slightly over-provisioned, slowly scaling in)
        Lambda: scales to 0 in seconds (no idle cost)
T+15min: ECS scales back to 10-20 tasks

Lambda absorbed the burst immediately while ECS was still spinning up. Users saw no degradation.


Cost Comparison: Peak Traffic Window

Without Lambda (ECS only, worst case)

To handle 10x peak without Lambda, you'd need to pre-provision for peak:

500 tasks × 1 vCPU × 24h × $0.04048/vCPU-hour = $485/day
Even at 3am when load is 5% of peak, you pay for 500 tasks.

With Hybrid Model

ECS baseline: 10-100 tasks (scales to demand)
Lambda: Pay only during burst windows

Typical day:
  ECS: avg 30 tasks × 1 vCPU × 24h × $0.04048 = $29/day
  Lambda burst (2 hours of 10x traffic):
    10,000 concurrent × 2s avg duration × $0.0000166667/GB-s × 0.5GB = $3.33
  Total: ~$32/day vs $485/day pre-provisioned

Savings: ~93% on compute cost for burst capacity.


Failure Modes and Graceful Degradation

If Lambda is Throttled (concurrency limit hit)

SQS message → Lambda throttle → message becomes visible again after visibility timeout
→ Lambda retries → if 2 retries fail → message moves to DLQ
→ DLQ triggers alert → on-call investigates → may drop stale messages

User impact: Response delayed by visibility timeout (30s). For chat, this means a slow response — degraded but not broken.

Mitigation: Request concurrency limit increase in advance of known events (product launches).

If SQS Queue Grows Too Deep

SQS message retention is set to 1 minute. Messages older than 1 minute are dropped because a 60-second-old chat response is useless.

def lambda_handler(event, context):
    for record in event["Records"]:
        message = json.loads(record["body"])

        # Drop stale messages immediately
        message_age_ms = time.time() * 1000 - message["enqueued_at_ms"]
        if message_age_ms > 45_000:  # 45 seconds
            logger.warning(f"Dropping stale message: {message_age_ms}ms old")
            # SQS deletes the message when Lambda returns successfully
            continue

        process_chat_turn(message)

If ECS Orchestrator is Unhealthy

ALB health checks detect unhealthy tasks and stop routing to them. New tasks start automatically. Lambda workers can process SQS messages without the Orchestrator — they call downstream services directly.


WebSocket Connection Lifecycle

Understanding this is critical to why ECS and Lambda play different roles:

Connection open  → ECS WebSocket Handler task claims the connection
                   Stores connection_id → session_id mapping in ElastiCache

User sends message → ECS Orchestrator or Lambda Worker processes it
                     Both have access to ElastiCache to look up connection_id
                     Both push response via API Gateway Management API

Connection idle 5min → ECS WebSocket Handler sends close frame, removes from cache

Connection close → ECS WebSocket Handler removes connection_id from cache
# How Lambda pushes to an existing WebSocket connection
import boto3

apigw = boto3.client(
    "apigatewaymanagementapi",
    endpoint_url="https://{api-id}.execute-api.us-east-1.amazonaws.com/prod"
)

def push_response_to_user(connection_id: str, response_chunk: str):
    try:
        apigw.post_to_connection(
            ConnectionId=connection_id,
            Data=json.dumps({
                "type": "chat.response.delta",
                "delta": response_chunk
            }).encode()
        )
    except apigw.exceptions.GoneException:
        # Connection closed — user navigated away
        # Clean up: remove from cache, stop processing
        logger.info(f"Connection {connection_id} gone, dropping response")

Summary: Who Does What

Responsibility ECS Fargate Lambda
Hold WebSocket connections Yes — containers are long-lived No — functions are ephemeral
Process baseline chat traffic Yes — primary processing layer No
Absorb 10x burst traffic Partially — scales but slowly Yes — scales in seconds
Call downstream services (Bedrock, SageMaker, DDB) Yes Yes (same code, same SDKs)
Push response to user Via WebSocket Handler Via API GW Management API
DLQ recovery writes No Yes — async, non-blocking
Scheduled jobs (RAG refresh) No Yes — EventBridge trigger
Cost model Pay per task-hour (idle or active) Pay per ms of actual execution
Cold start concern None (always running) Yes — mitigated with provisioned concurrency

The combination gives MangaAssist: steady low latency under normal load (ECS) + infinite burst capacity with zero idle cost (Lambda).