LOCAL PREVIEW View on GitHub

AWS Lambda - Basics to Production

What Is Lambda?

AWS Lambda is a serverless compute service. You upload a function, define what triggers it, and AWS runs it — provisioning servers, scaling, patching, and tearing down infrastructure automatically.

The core mental model:

Event happens → Lambda runs your code → Lambda stops

You pay only for the milliseconds your code actually runs. There is no idle cost.


How It Works — Step by Step

  1. An event source triggers your function (HTTP request, queue message, schedule, etc.)
  2. AWS finds (or creates) a compute environment (the "execution environment")
  3. Your code runs inside that environment
  4. The environment may be reused for the next invocation (warm) or torn down (cold start)
First invocation:
  [Cold Start: ~100-500ms]  →  [Your code: Xms]  =  Total latency

Subsequent invocations (same container reused):
  [Your code: Xms]  =  Total latency (no cold start)

Core Concepts

Function

The unit of deployment. One function = one handler + its dependencies.

# handler.py
import json

def handler(event, context):
    session_id = event["session_id"]
    message = event["message"]

    # Process the chat message
    response = process_message(session_id, message)

    return {
        "statusCode": 200,
        "body": json.dumps({"response": response})
    }

Event

The input to your function. The shape depends on the trigger source.

// From API Gateway
{
  "httpMethod": "POST",
  "path": "/chat/message",
  "body": "{\"session_id\": \"sess_abc\", \"message\": \"Recommend manga\"}",
  "headers": { "Authorization": "Bearer ..." }
}

// From SQS
{
  "Records": [
    {
      "body": "{\"session_id\": \"sess_abc\", \"turn\": {...}}",
      "messageId": "msg-123"
    }
  ]
}

Context

Metadata about the invocation (function name, memory limit, request ID, remaining time).

def handler(event, context):
    print(f"Function: {context.function_name}")
    print(f"Remaining time: {context.get_remaining_time_in_millis()}ms")
    print(f"Memory limit: {context.memory_limit_in_mb}MB")

Execution Environment

An isolated container that AWS manages. When a function is invoked: - Cold start: AWS creates a new environment, downloads your code, initializes the runtime, runs your init code (outside the handler), then calls your handler. - Warm invocation: AWS reuses an existing environment and calls your handler directly.

import boto3

# This runs once during cold start (environment initialization)
# Subsequent warm invocations reuse this client
dynamodb = boto3.resource("dynamodb")
table = dynamodb.Table("manga_chatbot_memory")

def handler(event, context):
    # This runs on every invocation
    # But DynamoDB client above is already initialized (warm path)
    session_id = event["session_id"]
    response = table.get_item(Key={"pk": f"SESSION#{session_id}"})
    return response

Concurrency Model

Lambda scales horizontally by invocation, not by adding threads to an existing process.

10 simultaneous requests → 10 separate Lambda execution environments running in parallel
1,000 simultaneous requests → 1,000 separate Lambda environments

Concurrency limits: - Burst limit: How fast Lambda can scale up (varies by region; ~3,000 in us-east-1 initially) - Account limit: Default 10,000 concurrent executions per region - Reserved concurrency: Guarantee N executions for a specific function - Provisioned concurrency: Pre-warm N environments to eliminate cold starts

For MangaAssist, during a peak flash sale:

Normal: 5,000 msg/sec handled by ECS Fargate (10-100 tasks)
Burst:  +45,000 msg/sec overflow → Lambda absorbs this instantly
        = up to 10,000 concurrent Lambda executions
        ECS cannot spin up 900 tasks in seconds; Lambda can


Trigger Sources (Event Sources)

Lambda integrates with almost every AWS service as a trigger:

Trigger Use Case
API Gateway HTTP/WebSocket endpoint → runs Lambda per request
SQS Message in queue → Lambda processes it
SNS Notification published → Lambda reacts
Kinesis Stream record → Lambda processes batches
DynamoDB Streams Table change → Lambda reacts
EventBridge Scheduled rule or event bus → Lambda
S3 File uploaded → Lambda processes it
ALB HTTP request → Lambda (alternative to ECS)

How MangaAssist Uses Lambda

Burst Workers — Overflow Handler

The primary use case in the HLD/LLD is handling traffic spikes that exceed ECS Fargate capacity.

Architecture:
  ALB → ECS Fargate (10-100 tasks, baseline)
         ↓
         If queue depth > threshold OR ECS tasks at capacity
         ↓
  ECS Orchestrator → SQS Queue → Lambda Burst Workers

Lambda burst workers perform the same orchestration work as ECS tasks: 1. Read message from SQS 2. Load conversation memory from DynamoDB 3. Call intent classifier (SageMaker) 4. Fan out to downstream services 5. Call Bedrock for response generation 6. Apply guardrails 7. Push response back via WebSocket (API Gateway WebSocket management API)

Why Lambda and not just more ECS tasks?

Scenario ECS Auto-Scaling Lambda
Scale from 10 to 100 tasks ~3-5 minutes (spin up, health check) Milliseconds
Handle 10x spike for 5 minutes Overshoots — starts too many tasks, then slowly drains Scales exactly to demand, stops immediately
Cost for 5-minute peak Pay for full tasks even after spike subsides Pay for exact ms of compute used
State (WebSocket connection) Task holds connection Lambda is stateless — pushes via API GW WebSocket API

Async Turn Persistence (Dead-Letter Queue Handler)

From LLD-4:

If a DynamoDB TURN write is throttled, the Orchestrator retries with exponential backoff. If the write still fails, the turn is written asynchronously via an SQS dead-letter queue to avoid blocking the response path.

def dlq_handler(event, context):
    for record in event["Records"]:
        turn_data = json.loads(record["body"])

        # Retry the DynamoDB write
        # User already got their response — this is fire-and-forget recovery
        table.put_item(Item=turn_data)

This pattern ensures the user response is never delayed by a storage failure.

Scheduled Tasks (EventBridge + Lambda)

  • RAG index refresh trigger (every 6 hours for product descriptions)
  • Cache warming jobs
  • Analytics aggregation
# Triggered by EventBridge schedule: rate(6 hours)
def rag_refresh_handler(event, context):
    # Fetch updated product descriptions from catalog
    # Re-chunk, re-embed, upsert to OpenSearch
    ...

Lambda Limits (Know These)

Limit Value Impact on MangaAssist
Max execution time 15 minutes Fine — chat turns complete in <3s
Max memory 10,240 MB Fine — burst workers need ~512 MB
Max payload (sync) 6 MB request / 6 MB response Fine — chat messages are small
Max payload (async / SQS) 256 KB per message Chat messages + context fit easily
Concurrent executions (default) 10,000 per account Sufficient for 10,000 burst workers
Cold start (Python, 512 MB) ~200-500 ms Acceptable given P99 target of 3s
Deployment package size 250 MB unzipped Use Lambda Layers for large deps

Cold Start Mitigation

For burst workers where latency matters:

1. Provisioned Concurrency

Pre-warm a number of environments so they are always ready:

Lambda function: manga-burst-worker
Provisioned concurrency: 100
→ First 100 invocations have zero cold start
→ Beyond 100, Lambda scales with normal cold starts

Cost: You pay for provisioned environments even when idle. Use only for latency-sensitive functions.

2. Lambda SnapStart (Java)

Not applicable here (Python/Node runtime).

3. Minimize Package Size

Smaller packages download and initialize faster.

# Bad: include everything
requirements.txt: boto3, numpy, pandas, scikit-learn, torch  # 500 MB

# Good: only what the burst worker needs
requirements.txt: boto3, requests  # 10 MB
# Move heavy deps to Lambda Layers if shared across functions

4. Initialize Outside the Handler

# Good: initialized once per execution environment (warm path reuses)
dynamodb = boto3.resource("dynamodb")
table = dynamodb.Table("manga_chatbot_memory")
bedrock = boto3.client("bedrock-runtime")

def handler(event, context):
    # Uses already-initialized clients — no re-initialization cost
    ...

IAM Permissions for Lambda

Each Lambda function gets an Execution Role:

manga-burst-worker-role:
  Allow: sqs:ReceiveMessage, DeleteMessage (burst-queue only)
  Allow: dynamodb:GetItem, PutItem (manga_chatbot_memory only)
  Allow: bedrock:InvokeModel (Claude 3.5 Sonnet only)
  Allow: sagemaker:InvokeEndpoint (intent-classifier only)
  Allow: execute-api:ManageConnections (chat WebSocket API only)
  Allow: logs:CreateLogGroup, PutLogEvents (CloudWatch)

Never give Lambda admin permissions. Least privilege prevents blast radius if the function is compromised.


Monitoring Lambda

CloudWatch Metrics (automatic)

  • Invocations — how many times the function ran
  • Errors — invocations that threw an exception
  • Duration — execution time in ms (p50, p95, p99)
  • ConcurrentExecutions — how many are running right now
  • Throttles — invocations rejected due to concurrency limit

Key Alarms for MangaAssist

Alarm: burst-worker-errors
  Metric: Errors > 50 in 1 minute
  Action: Page on-call (SEV-2)

Alarm: burst-worker-throttles
  Metric: Throttles > 0 in 1 minute
  Action: Increase reserved concurrency or account limit

Alarm: burst-worker-duration-p99
  Metric: Duration p99 > 5000ms
  Action: Investigate slow downstream calls

Lambda vs ECS Fargate — Decision Framework

Use this to decide which to use for a given workload:

Use Lambda when... Use ECS Fargate when...
Request is short (<15 min) Need long-lived connections (WebSocket)
Workload is bursty / unpredictable Workload is steady / predictable
You want zero idle cost You need consistent latency (no cold start)
Stateless processing Stateful or streaming workloads
Event-driven (queue, schedule, event) HTTP service with many endpoints
Package is small (<250 MB) Large runtimes or complex dependencies

MangaAssist uses both because they complement each other: - ECS Fargate holds the WebSocket connection and is the primary serving layer - Lambda absorbs burst traffic that ECS cannot scale to fast enough


Summary

Concept One-Line Summary
Lambda Run code without managing servers; pay per ms
Handler Entry point — def handler(event, context)
Event Input to the function; shape depends on trigger
Execution environment Isolated container AWS manages; reused for warm invocations
Cold start One-time init cost when a new environment is created
Concurrency N simultaneous requests = N parallel environments
Burst scaling Lambda scales to thousands of concurrent executions in seconds
Reserved concurrency Guarantee N executions; prevents starvation by other functions
Provisioned concurrency Pre-warm environments to eliminate cold starts
Execution role IAM role the function assumes — use least privilege