ECS Fargate + Lambda — How They Work Together in MangaAssist
The Core Problem: Extreme Burst Traffic
MangaAssist must handle: - Normal load: ~5,000 messages/second, ~50,000 concurrent sessions - Peak load (flash sales, new release day): ~50,000 messages/second, ~500,000 concurrent sessions
That is a 10x spike that can happen in minutes.
No single compute model handles this perfectly: - ECS Fargate alone: Takes 3-5 minutes to spin up 900 new tasks. Users experience timeouts during the ramp-up. - Lambda alone: Cannot hold long-lived WebSocket connections. Streaming token delivery is awkward. Cold starts add latency to baseline traffic.
The solution in the HLD is a hybrid model where each service does what it is best at.
Architecture: Hybrid Compute Model
graph TD
subgraph "Traffic Entry"
U[User Browser / App]
CF[CloudFront]
ALB[Application Load Balancer]
APIGW[API Gateway WebSocket]
end
subgraph "Baseline Layer — ECS Fargate"
WS[WebSocket Handler Service<br>Holds persistent connections<br>10-50 tasks, auto-scaling]
ORC[Chatbot Orchestrator Service<br>Core processing logic<br>10-100 tasks, auto-scaling]
end
subgraph "Burst Layer — Lambda"
SQS[SQS Burst Queue]
LW1[Lambda Worker 1]
LW2[Lambda Worker 2]
LWN[Lambda Worker N<br>up to 10,000 concurrent]
end
subgraph "Shared Downstream Services"
SM[SageMaker — Intent Classifier]
BR[Bedrock — Claude 3.5 Sonnet]
DDB[DynamoDB — Conversation Memory]
OS[OpenSearch — Vector Store]
EC[ElastiCache — Cache]
end
U -->|WebSocket| CF --> ALB --> WS
WS -->|Authenticated request| ORC
ORC -->|Normal load| SM
ORC -->|Normal load| BR
ORC -->|Normal load| DDB
ORC -->|Overflow: enqueue| SQS
SQS --> LW1 & LW2 & LWN
LW1 & LW2 & LWN --> SM
LW1 & LW2 & LWN --> BR
LW1 & LW2 & LWN --> DDB
LW1 & LW2 & LWN -->|Push response| APIGW
APIGW -->|WebSocket push| U
Request Flow: Normal Load (ECS Handles Everything)
1. User sends "Recommend manga like Naruto"
2. CloudFront → ALB → WebSocket Handler Task (ECS)
3. WebSocket Handler → Orchestrator Task (ECS)
4. Orchestrator:
a. Load conversation memory (DynamoDB)
b. Classify intent (SageMaker) → recommendation
c. Fan out: Recommendation Engine + Product Catalog (parallel)
d. Build prompt with context
e. Call Bedrock (Claude 3.5 Sonnet) — stream tokens back
f. Apply guardrails
g. Stream response via WebSocket Handler to user
5. Save turn to DynamoDB
6. Emit analytics event to Kinesis
The WebSocket connection stays open on the ECS task for the duration of the streaming response. ECS is the right home for this because containers hold state (the connection) naturally.
Request Flow: Burst Load (Lambda Absorbs Overflow)
When ECS tasks are at capacity (CPU > 80% across the service):
1. User sends message
2. CloudFront → ALB → WebSocket Handler Task (ECS) ← connection registered here
3. WebSocket Handler → Orchestrator Task (ECS)
4. Orchestrator detects: task queue depth high, CPU saturated
5. Orchestrator enqueues message to SQS Burst Queue instead of processing inline
6. Lambda Burst Worker picks up message from SQS
7. Lambda Worker performs same orchestration steps as ECS Orchestrator
8. Lambda Worker pushes response via API Gateway WebSocket Management API
POST https://{api-id}.execute-api.us-east-1.amazonaws.com/@connections/{connection_id}
Body: { "type": "chat.response.delta", "delta": "Based on your love for action manga..." }
9. WebSocket Handler (ECS) delivers the pushed message to the open connection
10. User sees the streamed response — unaware it was processed by Lambda
Key detail: The WebSocket connection itself always lives on ECS. Lambda cannot hold a WebSocket connection. Lambda pushes responses into the existing connection via the API Gateway WebSocket Management API.
The SQS Queue — Decoupling Mechanism
SQS is the buffer between ECS and Lambda.
SQS Queue: manga-burst-queue
Type: Standard (at-least-once delivery)
Visibility timeout: 30s (task must complete within 30s or message reappears)
Message retention: 1 minute (chat messages are time-sensitive; stale = drop)
Dead-letter queue: manga-burst-dlq (after 2 failed processing attempts)
Lambda trigger: batch size 1, max concurrency 10,000
Why SQS in between? 1. Backpressure: If Lambda is overwhelmed, messages queue instead of dropping 2. Retry: If a Lambda worker crashes mid-processing, the message reappears after visibility timeout 3. Decoupling: ECS and Lambda don't need to know about each other
Scaling Comparison: Side by Side
Scenario: Traffic doubles in 30 seconds (flash sale starts)
T+0s: Normal load — 10 ECS tasks running, 0 Lambda executions
T+10s: Traffic 2x — ECS CPU climbs from 40% to 80%
T+20s: ECS auto-scaler triggers: add 10 more tasks
T+30s: Traffic 5x — ECS cannot scale fast enough
SQS queue depth climbs
Lambda burst workers start: 0 → 500 concurrent executions in ~5s
T+60s: Traffic 10x peak
ECS: 50 tasks running (scaled up, still climbing)
Lambda: 2,000-5,000 concurrent executions absorbing overflow
T+5min: Peak subsides
ECS: ~80 tasks (slightly over-provisioned, slowly scaling in)
Lambda: scales to 0 in seconds (no idle cost)
T+15min: ECS scales back to 10-20 tasks
Lambda absorbed the burst immediately while ECS was still spinning up. Users saw no degradation.
Cost Comparison: Peak Traffic Window
Without Lambda (ECS only, worst case)
To handle 10x peak without Lambda, you'd need to pre-provision for peak:
500 tasks × 1 vCPU × 24h × $0.04048/vCPU-hour = $485/day
Even at 3am when load is 5% of peak, you pay for 500 tasks.
With Hybrid Model
ECS baseline: 10-100 tasks (scales to demand)
Lambda: Pay only during burst windows
Typical day:
ECS: avg 30 tasks × 1 vCPU × 24h × $0.04048 = $29/day
Lambda burst (2 hours of 10x traffic):
10,000 concurrent × 2s avg duration × $0.0000166667/GB-s × 0.5GB = $3.33
Total: ~$32/day vs $485/day pre-provisioned
Savings: ~93% on compute cost for burst capacity.
Failure Modes and Graceful Degradation
If Lambda is Throttled (concurrency limit hit)
SQS message → Lambda throttle → message becomes visible again after visibility timeout
→ Lambda retries → if 2 retries fail → message moves to DLQ
→ DLQ triggers alert → on-call investigates → may drop stale messages
User impact: Response delayed by visibility timeout (30s). For chat, this means a slow response — degraded but not broken.
Mitigation: Request concurrency limit increase in advance of known events (product launches).
If SQS Queue Grows Too Deep
SQS message retention is set to 1 minute. Messages older than 1 minute are dropped because a 60-second-old chat response is useless.
def lambda_handler(event, context):
for record in event["Records"]:
message = json.loads(record["body"])
# Drop stale messages immediately
message_age_ms = time.time() * 1000 - message["enqueued_at_ms"]
if message_age_ms > 45_000: # 45 seconds
logger.warning(f"Dropping stale message: {message_age_ms}ms old")
# SQS deletes the message when Lambda returns successfully
continue
process_chat_turn(message)
If ECS Orchestrator is Unhealthy
ALB health checks detect unhealthy tasks and stop routing to them. New tasks start automatically. Lambda workers can process SQS messages without the Orchestrator — they call downstream services directly.
WebSocket Connection Lifecycle
Understanding this is critical to why ECS and Lambda play different roles:
Connection open → ECS WebSocket Handler task claims the connection
Stores connection_id → session_id mapping in ElastiCache
User sends message → ECS Orchestrator or Lambda Worker processes it
Both have access to ElastiCache to look up connection_id
Both push response via API Gateway Management API
Connection idle 5min → ECS WebSocket Handler sends close frame, removes from cache
Connection close → ECS WebSocket Handler removes connection_id from cache
# How Lambda pushes to an existing WebSocket connection
import boto3
apigw = boto3.client(
"apigatewaymanagementapi",
endpoint_url="https://{api-id}.execute-api.us-east-1.amazonaws.com/prod"
)
def push_response_to_user(connection_id: str, response_chunk: str):
try:
apigw.post_to_connection(
ConnectionId=connection_id,
Data=json.dumps({
"type": "chat.response.delta",
"delta": response_chunk
}).encode()
)
except apigw.exceptions.GoneException:
# Connection closed — user navigated away
# Clean up: remove from cache, stop processing
logger.info(f"Connection {connection_id} gone, dropping response")
Summary: Who Does What
| Responsibility | ECS Fargate | Lambda |
|---|---|---|
| Hold WebSocket connections | Yes — containers are long-lived | No — functions are ephemeral |
| Process baseline chat traffic | Yes — primary processing layer | No |
| Absorb 10x burst traffic | Partially — scales but slowly | Yes — scales in seconds |
| Call downstream services (Bedrock, SageMaker, DDB) | Yes | Yes (same code, same SDKs) |
| Push response to user | Via WebSocket Handler | Via API GW Management API |
| DLQ recovery writes | No | Yes — async, non-blocking |
| Scheduled jobs (RAG refresh) | No | Yes — EventBridge trigger |
| Cost model | Pay per task-hour (idle or active) | Pay per ms of actual execution |
| Cold start concern | None (always running) | Yes — mitigated with provisioned concurrency |
The combination gives MangaAssist: steady low latency under normal load (ECS) + infinite burst capacity with zero idle cost (Lambda).