AWS Lambda - Basics to Production
What Is Lambda?
AWS Lambda is a serverless compute service. You upload a function, define what triggers it, and AWS runs it — provisioning servers, scaling, patching, and tearing down infrastructure automatically.
The core mental model:
Event happens → Lambda runs your code → Lambda stops
You pay only for the milliseconds your code actually runs. There is no idle cost.
How It Works — Step by Step
- An event source triggers your function (HTTP request, queue message, schedule, etc.)
- AWS finds (or creates) a compute environment (the "execution environment")
- Your code runs inside that environment
- The environment may be reused for the next invocation (warm) or torn down (cold start)
First invocation:
[Cold Start: ~100-500ms] → [Your code: Xms] = Total latency
Subsequent invocations (same container reused):
[Your code: Xms] = Total latency (no cold start)
Core Concepts
Function
The unit of deployment. One function = one handler + its dependencies.
# handler.py
import json
def handler(event, context):
session_id = event["session_id"]
message = event["message"]
# Process the chat message
response = process_message(session_id, message)
return {
"statusCode": 200,
"body": json.dumps({"response": response})
}
Event
The input to your function. The shape depends on the trigger source.
// From API Gateway
{
"httpMethod": "POST",
"path": "/chat/message",
"body": "{\"session_id\": \"sess_abc\", \"message\": \"Recommend manga\"}",
"headers": { "Authorization": "Bearer ..." }
}
// From SQS
{
"Records": [
{
"body": "{\"session_id\": \"sess_abc\", \"turn\": {...}}",
"messageId": "msg-123"
}
]
}
Context
Metadata about the invocation (function name, memory limit, request ID, remaining time).
def handler(event, context):
print(f"Function: {context.function_name}")
print(f"Remaining time: {context.get_remaining_time_in_millis()}ms")
print(f"Memory limit: {context.memory_limit_in_mb}MB")
Execution Environment
An isolated container that AWS manages. When a function is invoked: - Cold start: AWS creates a new environment, downloads your code, initializes the runtime, runs your init code (outside the handler), then calls your handler. - Warm invocation: AWS reuses an existing environment and calls your handler directly.
import boto3
# This runs once during cold start (environment initialization)
# Subsequent warm invocations reuse this client
dynamodb = boto3.resource("dynamodb")
table = dynamodb.Table("manga_chatbot_memory")
def handler(event, context):
# This runs on every invocation
# But DynamoDB client above is already initialized (warm path)
session_id = event["session_id"]
response = table.get_item(Key={"pk": f"SESSION#{session_id}"})
return response
Concurrency Model
Lambda scales horizontally by invocation, not by adding threads to an existing process.
10 simultaneous requests → 10 separate Lambda execution environments running in parallel
1,000 simultaneous requests → 1,000 separate Lambda environments
Concurrency limits: - Burst limit: How fast Lambda can scale up (varies by region; ~3,000 in us-east-1 initially) - Account limit: Default 10,000 concurrent executions per region - Reserved concurrency: Guarantee N executions for a specific function - Provisioned concurrency: Pre-warm N environments to eliminate cold starts
For MangaAssist, during a peak flash sale:
Normal: 5,000 msg/sec handled by ECS Fargate (10-100 tasks)
Burst: +45,000 msg/sec overflow → Lambda absorbs this instantly
= up to 10,000 concurrent Lambda executions
ECS cannot spin up 900 tasks in seconds; Lambda can
Trigger Sources (Event Sources)
Lambda integrates with almost every AWS service as a trigger:
| Trigger | Use Case |
|---|---|
| API Gateway | HTTP/WebSocket endpoint → runs Lambda per request |
| SQS | Message in queue → Lambda processes it |
| SNS | Notification published → Lambda reacts |
| Kinesis | Stream record → Lambda processes batches |
| DynamoDB Streams | Table change → Lambda reacts |
| EventBridge | Scheduled rule or event bus → Lambda |
| S3 | File uploaded → Lambda processes it |
| ALB | HTTP request → Lambda (alternative to ECS) |
How MangaAssist Uses Lambda
Burst Workers — Overflow Handler
The primary use case in the HLD/LLD is handling traffic spikes that exceed ECS Fargate capacity.
Architecture:
ALB → ECS Fargate (10-100 tasks, baseline)
↓
If queue depth > threshold OR ECS tasks at capacity
↓
ECS Orchestrator → SQS Queue → Lambda Burst Workers
Lambda burst workers perform the same orchestration work as ECS tasks: 1. Read message from SQS 2. Load conversation memory from DynamoDB 3. Call intent classifier (SageMaker) 4. Fan out to downstream services 5. Call Bedrock for response generation 6. Apply guardrails 7. Push response back via WebSocket (API Gateway WebSocket management API)
Why Lambda and not just more ECS tasks?
| Scenario | ECS Auto-Scaling | Lambda |
|---|---|---|
| Scale from 10 to 100 tasks | ~3-5 minutes (spin up, health check) | Milliseconds |
| Handle 10x spike for 5 minutes | Overshoots — starts too many tasks, then slowly drains | Scales exactly to demand, stops immediately |
| Cost for 5-minute peak | Pay for full tasks even after spike subsides | Pay for exact ms of compute used |
| State (WebSocket connection) | Task holds connection | Lambda is stateless — pushes via API GW WebSocket API |
Async Turn Persistence (Dead-Letter Queue Handler)
From LLD-4:
If a DynamoDB TURN write is throttled, the Orchestrator retries with exponential backoff. If the write still fails, the turn is written asynchronously via an SQS dead-letter queue to avoid blocking the response path.
def dlq_handler(event, context):
for record in event["Records"]:
turn_data = json.loads(record["body"])
# Retry the DynamoDB write
# User already got their response — this is fire-and-forget recovery
table.put_item(Item=turn_data)
This pattern ensures the user response is never delayed by a storage failure.
Scheduled Tasks (EventBridge + Lambda)
- RAG index refresh trigger (every 6 hours for product descriptions)
- Cache warming jobs
- Analytics aggregation
# Triggered by EventBridge schedule: rate(6 hours)
def rag_refresh_handler(event, context):
# Fetch updated product descriptions from catalog
# Re-chunk, re-embed, upsert to OpenSearch
...
Lambda Limits (Know These)
| Limit | Value | Impact on MangaAssist |
|---|---|---|
| Max execution time | 15 minutes | Fine — chat turns complete in <3s |
| Max memory | 10,240 MB | Fine — burst workers need ~512 MB |
| Max payload (sync) | 6 MB request / 6 MB response | Fine — chat messages are small |
| Max payload (async / SQS) | 256 KB per message | Chat messages + context fit easily |
| Concurrent executions (default) | 10,000 per account | Sufficient for 10,000 burst workers |
| Cold start (Python, 512 MB) | ~200-500 ms | Acceptable given P99 target of 3s |
| Deployment package size | 250 MB unzipped | Use Lambda Layers for large deps |
Cold Start Mitigation
For burst workers where latency matters:
1. Provisioned Concurrency
Pre-warm a number of environments so they are always ready:
Lambda function: manga-burst-worker
Provisioned concurrency: 100
→ First 100 invocations have zero cold start
→ Beyond 100, Lambda scales with normal cold starts
Cost: You pay for provisioned environments even when idle. Use only for latency-sensitive functions.
2. Lambda SnapStart (Java)
Not applicable here (Python/Node runtime).
3. Minimize Package Size
Smaller packages download and initialize faster.
# Bad: include everything
requirements.txt: boto3, numpy, pandas, scikit-learn, torch # 500 MB
# Good: only what the burst worker needs
requirements.txt: boto3, requests # 10 MB
# Move heavy deps to Lambda Layers if shared across functions
4. Initialize Outside the Handler
# Good: initialized once per execution environment (warm path reuses)
dynamodb = boto3.resource("dynamodb")
table = dynamodb.Table("manga_chatbot_memory")
bedrock = boto3.client("bedrock-runtime")
def handler(event, context):
# Uses already-initialized clients — no re-initialization cost
...
IAM Permissions for Lambda
Each Lambda function gets an Execution Role:
manga-burst-worker-role:
Allow: sqs:ReceiveMessage, DeleteMessage (burst-queue only)
Allow: dynamodb:GetItem, PutItem (manga_chatbot_memory only)
Allow: bedrock:InvokeModel (Claude 3.5 Sonnet only)
Allow: sagemaker:InvokeEndpoint (intent-classifier only)
Allow: execute-api:ManageConnections (chat WebSocket API only)
Allow: logs:CreateLogGroup, PutLogEvents (CloudWatch)
Never give Lambda admin permissions. Least privilege prevents blast radius if the function is compromised.
Monitoring Lambda
CloudWatch Metrics (automatic)
Invocations— how many times the function ranErrors— invocations that threw an exceptionDuration— execution time in ms (p50, p95, p99)ConcurrentExecutions— how many are running right nowThrottles— invocations rejected due to concurrency limit
Key Alarms for MangaAssist
Alarm: burst-worker-errors
Metric: Errors > 50 in 1 minute
Action: Page on-call (SEV-2)
Alarm: burst-worker-throttles
Metric: Throttles > 0 in 1 minute
Action: Increase reserved concurrency or account limit
Alarm: burst-worker-duration-p99
Metric: Duration p99 > 5000ms
Action: Investigate slow downstream calls
Lambda vs ECS Fargate — Decision Framework
Use this to decide which to use for a given workload:
| Use Lambda when... | Use ECS Fargate when... |
|---|---|
| Request is short (<15 min) | Need long-lived connections (WebSocket) |
| Workload is bursty / unpredictable | Workload is steady / predictable |
| You want zero idle cost | You need consistent latency (no cold start) |
| Stateless processing | Stateful or streaming workloads |
| Event-driven (queue, schedule, event) | HTTP service with many endpoints |
| Package is small (<250 MB) | Large runtimes or complex dependencies |
MangaAssist uses both because they complement each other: - ECS Fargate holds the WebSocket connection and is the primary serving layer - Lambda absorbs burst traffic that ECS cannot scale to fast enough
Summary
| Concept | One-Line Summary |
|---|---|
| Lambda | Run code without managing servers; pay per ms |
| Handler | Entry point — def handler(event, context) |
| Event | Input to the function; shape depends on trigger |
| Execution environment | Isolated container AWS manages; reused for warm invocations |
| Cold start | One-time init cost when a new environment is created |
| Concurrency | N simultaneous requests = N parallel environments |
| Burst scaling | Lambda scales to thousands of concurrent executions in seconds |
| Reserved concurrency | Guarantee N executions; prevents starvation by other functions |
| Provisioned concurrency | Pre-warm environments to eliminate cold starts |
| Execution role | IAM role the function assumes — use least privilege |