AWS Lambda - Basics to Production

What Is Lambda?

AWS Lambda is a serverless compute service. You upload a function, define what triggers it, and AWS runs it — provisioning servers, scaling, patching, and tearing down infrastructure automatically.

The core mental model:

Event happens → Lambda runs your code → Lambda stops

You pay only for the milliseconds your code actually runs. There is no idle cost.

How It Works — Step by Step

An event source triggers your function (HTTP request, queue message, schedule, etc.)
AWS finds (or creates) a compute environment (the "execution environment")
Your code runs inside that environment
The environment may be reused for the next invocation (warm) or torn down (cold start)

First invocation:
  [Cold Start: ~100-500ms]  →  [Your code: Xms]  =  Total latency

Subsequent invocations (same container reused):
  [Your code: Xms]  =  Total latency (no cold start)

Core Concepts

Function

The unit of deployment. One function = one handler + its dependencies.

# handler.py
import json

def handler(event, context):
    session_id = event["session_id"]
    message = event["message"]

    # Process the chat message
    response = process_message(session_id, message)

    return {
        "statusCode": 200,
        "body": json.dumps({"response": response})
    }

Event

The input to your function. The shape depends on the trigger source.

// From API Gateway
{
  "httpMethod": "POST",
  "path": "/chat/message",
  "body": "{\"session_id\": \"sess_abc\", \"message\": \"Recommend manga\"}",
  "headers": { "Authorization": "Bearer ..." }
}

// From SQS
{
  "Records": [
    {
      "body": "{\"session_id\": \"sess_abc\", \"turn\": {...}}",
      "messageId": "msg-123"
    }
  ]
}

Context

Metadata about the invocation (function name, memory limit, request ID, remaining time).

def handler(event, context):
    print(f"Function: {context.function_name}")
    print(f"Remaining time: {context.get_remaining_time_in_millis()}ms")
    print(f"Memory limit: {context.memory_limit_in_mb}MB")

Execution Environment

An isolated container that AWS manages. When a function is invoked: - Cold start: AWS creates a new environment, downloads your code, initializes the runtime, runs your init code (outside the handler), then calls your handler. - Warm invocation: AWS reuses an existing environment and calls your handler directly.

import boto3

# This runs once during cold start (environment initialization)
# Subsequent warm invocations reuse this client
dynamodb = boto3.resource("dynamodb")
table = dynamodb.Table("manga_chatbot_memory")

def handler(event, context):
    # This runs on every invocation
    # But DynamoDB client above is already initialized (warm path)
    session_id = event["session_id"]
    response = table.get_item(Key={"pk": f"SESSION#{session_id}"})
    return response

Concurrency Model

Lambda scales horizontally by invocation, not by adding threads to an existing process.

10 simultaneous requests → 10 separate Lambda execution environments running in parallel
1,000 simultaneous requests → 1,000 separate Lambda environments

Concurrency limits: - Burst limit: How fast Lambda can scale up (varies by region; ~3,000 in us-east-1 initially) - Account limit: Default 10,000 concurrent executions per region - Reserved concurrency: Guarantee N executions for a specific function - Provisioned concurrency: Pre-warm N environments to eliminate cold starts

For MangaAssist, during a peak flash sale:

Normal: 5,000 msg/sec handled by ECS Fargate (10-100 tasks)
Burst:  +45,000 msg/sec overflow → Lambda absorbs this instantly
        = up to 10,000 concurrent Lambda executions
        ECS cannot spin up 900 tasks in seconds; Lambda can

Trigger Sources (Event Sources)

Lambda integrates with almost every AWS service as a trigger:

Trigger	Use Case
API Gateway	HTTP/WebSocket endpoint → runs Lambda per request
SQS	Message in queue → Lambda processes it
SNS	Notification published → Lambda reacts
Kinesis	Stream record → Lambda processes batches
DynamoDB Streams	Table change → Lambda reacts
EventBridge	Scheduled rule or event bus → Lambda
S3	File uploaded → Lambda processes it
ALB	HTTP request → Lambda (alternative to ECS)

How MangaAssist Uses Lambda

Burst Workers — Overflow Handler

The primary use case in the HLD/LLD is handling traffic spikes that exceed ECS Fargate capacity.

Architecture:
  ALB → ECS Fargate (10-100 tasks, baseline)
         ↓
         If queue depth > threshold OR ECS tasks at capacity
         ↓
  ECS Orchestrator → SQS Queue → Lambda Burst Workers

Lambda burst workers perform the same orchestration work as ECS tasks: 1. Read message from SQS 2. Load conversation memory from DynamoDB 3. Call intent classifier (SageMaker) 4. Fan out to downstream services 5. Call Bedrock for response generation 6. Apply guardrails 7. Push response back via WebSocket (API Gateway WebSocket management API)

Why Lambda and not just more ECS tasks?

Scenario	ECS Auto-Scaling	Lambda
Scale from 10 to 100 tasks	~3-5 minutes (spin up, health check)	Milliseconds
Handle 10x spike for 5 minutes	Overshoots — starts too many tasks, then slowly drains	Scales exactly to demand, stops immediately
Cost for 5-minute peak	Pay for full tasks even after spike subsides	Pay for exact ms of compute used
State (WebSocket connection)	Task holds connection	Lambda is stateless — pushes via API GW WebSocket API

Async Turn Persistence (Dead-Letter Queue Handler)

From LLD-4:

If a DynamoDB TURN write is throttled, the Orchestrator retries with exponential backoff. If the write still fails, the turn is written asynchronously via an SQS dead-letter queue to avoid blocking the response path.

def dlq_handler(event, context):
    for record in event["Records"]:
        turn_data = json.loads(record["body"])

        # Retry the DynamoDB write
        # User already got their response — this is fire-and-forget recovery
        table.put_item(Item=turn_data)

This pattern ensures the user response is never delayed by a storage failure.

Scheduled Tasks (EventBridge + Lambda)

RAG index refresh trigger (every 6 hours for product descriptions)
Cache warming jobs
Analytics aggregation

# Triggered by EventBridge schedule: rate(6 hours)
def rag_refresh_handler(event, context):
    # Fetch updated product descriptions from catalog
    # Re-chunk, re-embed, upsert to OpenSearch
    ...

Lambda Limits (Know These)

Limit	Value	Impact on MangaAssist
Max execution time	15 minutes	Fine — chat turns complete in <3s
Max memory	10,240 MB	Fine — burst workers need ~512 MB
Max payload (sync)	6 MB request / 6 MB response	Fine — chat messages are small
Max payload (async / SQS)	256 KB per message	Chat messages + context fit easily
Concurrent executions (default)	10,000 per account	Sufficient for 10,000 burst workers
Cold start (Python, 512 MB)	~200-500 ms	Acceptable given P99 target of 3s
Deployment package size	250 MB unzipped	Use Lambda Layers for large deps

Cold Start Mitigation

For burst workers where latency matters:

1. Provisioned Concurrency

Pre-warm a number of environments so they are always ready:

Lambda function: manga-burst-worker
Provisioned concurrency: 100
→ First 100 invocations have zero cold start
→ Beyond 100, Lambda scales with normal cold starts

Cost: You pay for provisioned environments even when idle. Use only for latency-sensitive functions.

2. Lambda SnapStart (Java)

Not applicable here (Python/Node runtime).

3. Minimize Package Size

Smaller packages download and initialize faster.

# Bad: include everything
requirements.txt: boto3, numpy, pandas, scikit-learn, torch  # 500 MB

# Good: only what the burst worker needs
requirements.txt: boto3, requests  # 10 MB
# Move heavy deps to Lambda Layers if shared across functions

4. Initialize Outside the Handler

# Good: initialized once per execution environment (warm path reuses)
dynamodb = boto3.resource("dynamodb")
table = dynamodb.Table("manga_chatbot_memory")
bedrock = boto3.client("bedrock-runtime")

def handler(event, context):
    # Uses already-initialized clients — no re-initialization cost
    ...

IAM Permissions for Lambda

Each Lambda function gets an Execution Role:

manga-burst-worker-role:
  Allow: sqs:ReceiveMessage, DeleteMessage (burst-queue only)
  Allow: dynamodb:GetItem, PutItem (manga_chatbot_memory only)
  Allow: bedrock:InvokeModel (Claude 3.5 Sonnet only)
  Allow: sagemaker:InvokeEndpoint (intent-classifier only)
  Allow: execute-api:ManageConnections (chat WebSocket API only)
  Allow: logs:CreateLogGroup, PutLogEvents (CloudWatch)

Never give Lambda admin permissions. Least privilege prevents blast radius if the function is compromised.

Monitoring Lambda

CloudWatch Metrics (automatic)

Invocations — how many times the function ran
Errors — invocations that threw an exception
Duration — execution time in ms (p50, p95, p99)
ConcurrentExecutions — how many are running right now
Throttles — invocations rejected due to concurrency limit

Key Alarms for MangaAssist

Alarm: burst-worker-errors
  Metric: Errors > 50 in 1 minute
  Action: Page on-call (SEV-2)

Alarm: burst-worker-throttles
  Metric: Throttles > 0 in 1 minute
  Action: Increase reserved concurrency or account limit

Alarm: burst-worker-duration-p99
  Metric: Duration p99 > 5000ms
  Action: Investigate slow downstream calls

Lambda vs ECS Fargate — Decision Framework

Use this to decide which to use for a given workload:

Use Lambda when...	Use ECS Fargate when...
Request is short (<15 min)	Need long-lived connections (WebSocket)
Workload is bursty / unpredictable	Workload is steady / predictable
You want zero idle cost	You need consistent latency (no cold start)
Stateless processing	Stateful or streaming workloads
Event-driven (queue, schedule, event)	HTTP service with many endpoints
Package is small (<250 MB)	Large runtimes or complex dependencies

MangaAssist uses both because they complement each other: - ECS Fargate holds the WebSocket connection and is the primary serving layer - Lambda absorbs burst traffic that ECS cannot scale to fast enough

Summary

Concept	One-Line Summary
Lambda	Run code without managing servers; pay per ms
Handler	Entry point — `def handler(event, context)`
Event	Input to the function; shape depends on trigger
Execution environment	Isolated container AWS manages; reused for warm invocations
Cold start	One-time init cost when a new environment is created
Concurrency	N simultaneous requests = N parallel environments
Burst scaling	Lambda scales to thousands of concurrent executions in seconds
Reserved concurrency	Guarantee N executions; prevents starvation by other functions
Provisioned concurrency	Pre-warm environments to eliminate cold starts
Execution role	IAM role the function assumes — use least privilege