LOCAL PREVIEW View on GitHub

DynamoDB Scaling Scenarios — How DynamoDB Outperforms Relational and Other NoSQL Databases

Audience: Architects, senior developers, and interview candidates who need to articulate the specific scaling situations where DynamoDB is the right choice and why.


Why Scaling Scenarios Matter

Most interviews and design reviews do not ask "what is DynamoDB?" They ask "why would you pick DynamoDB over Postgres for this specific workload?" This document maps real scaling situations to DynamoDB's strengths, with direct comparisons to alternatives so you can justify the choice precisely.


Scenario 1: E-Commerce Flash Sale — Traffic Spike from 1× to 100× in Minutes

The Situation

An e-commerce platform runs at 5,000 requests per second during normal hours. A flash sale announcement causes traffic to spike to 500,000 requests per second within minutes. The primary workload is session state reads, cart lookups by user ID, and inventory status checks.

Why Relational Databases Struggle Here

Problem What Actually Happens
Connection pool exhaustion PostgreSQL supports ~200–500 active connections per instance. At 500,000 RPS, a connection pool saturates in seconds even with PgBouncer
Vertical scaling ceiling Read replicas take 5–15 minutes to provision and warm up. The spike is over before they are ready
Write amplification on updates Cart updates modify a row, trigger WAL writes, update indexes, and hold row-level locks — not ideal under contention
Schema lock during load Index maintenance or auto-vacuum locks tables at the worst time

How DynamoDB Handles This

flowchart LR
    subgraph Normal["Normal: 5K RPS"]
        N1[Lambda] --> N2[DynamoDB On-Demand]
        N2 -->|"~5K WRU/sec"| N3[Auto-managed capacity]
    end

    subgraph Flash["Flash Sale: 500K RPS"]
        F1[Lambda] --> F2[DynamoDB On-Demand]
        F2 -->|"~500K WRU/sec"| F3[Adaptive capacity scales automatically]
    end
DynamoDB Advantage Detail
No connection limit DynamoDB is HTTP/HTTPS-based. No connection pool. Every Lambda invocation gets an independent request
On-demand mode absorbs burst Capacity adjusts within seconds. No provisioning step needed for the spike
Adaptive capacity DynamoDB detects and reallocates capacity to hot partitions within seconds
Predictable per-item latency Latency stays at single-digit milliseconds at 5K RPS and at 500K RPS
Horizontal partition expansion DynamoDB automatically splits and redistributes partitions as data and traffic grow

Comparison Summary

Database Flash Sale Behavior
DynamoDB (on-demand) Handles the spike transparently; cost rises proportionally to requests
Aurora / PostgreSQL Likely fails under connection storm; requires pre-warming read replicas
MongoDB Atlas Scales with sharding, but shard rebalancing takes minutes; connection limits still apply
Redis (standalone) Single-threaded per shard; saturates at high write concurrency; no durability

Scenario 2: Gaming Leaderboard — Millions of Concurrent Score Updates per Second

The Situation

A mobile game has 10 million daily active users. Every time a player completes a level, their score is updated and regional leaderboards must reflect changes within one second. Peak concurrency is 200,000 score updates per second. Leaderboard reads happen 10× more often than writes.

Why Relational Databases Struggle Here

Problem Root Cause
Row-level lock contention Updating a leaderboard rank requires reading, computing, and writing a rank column under a lock
ORDER BY … LIMIT at scale A SELECT * FROM leaderboard ORDER BY score DESC LIMIT 100 performs a full index scan; acceptable at small scale, painful at 10M rows with concurrent writes
Index maintenance overhead Writes to a high-concurrency score column trigger B-tree index rewrites on every update

DynamoDB Solution Design

PK = LEADERBOARD#global        SK = SCORE#<zero_padded_score>#<player_id>
PK = LEADERBOARD#region_NA     SK = SCORE#<zero_padded_score>#<player_id>
PK = PLAYER#<player_id>        SK = PROFILE → stores raw score

Key insight: Score is embedded in the sort key, left-zero-padded so lexicographic sort matches numeric sort. The top-N leaderboard is a Query in reverse order — no ORDER BY needed.

Operation DynamoDB Design Cost
Update score PutItem new sort key entry + DeleteItem old entry 2 WCU
Read top 100 Query(PK=LEADERBOARD#global, ScanIndexForward=False, Limit=100) ~100 RCU (4KB each, eventual) = 25 RCU
Player rank lookup Query with ExclusiveStartKey from player's score Sub-millisecond with DAX

Comparison Summary

Database Leaderboard at 200K writes/sec
DynamoDB Sort-key-embedded scores make rank queries a bounded Query; scales horizontally
PostgreSQL Requires careful indexing; rank queries degrade as concurrent writes increase; connection bottleneck
Redis Sorted Sets Microsecond latency, excellent for leaderboards; but durability is weaker; limited to single-shard capacity per sorted set
MongoDB Aggregation pipeline for ranks is flexible but scan-heavy at 10M rows

DynamoDB + Redis combined: Use DynamoDB as the durable source of truth for scores. Use Redis Sorted Sets as the hot leaderboard cache. DynamoDB Streams trigger a Lambda to keep Redis in sync.


Scenario 3: IoT Sensor Ingestion — 1 Million Devices Writing Every 30 Seconds

The Situation

An industrial IoT platform collects temperature, pressure, and vibration readings from 1,000,000 devices, each sending a data point every 30 seconds. That is roughly 33,000 writes per second at baseline. Reads are time-windowed: "give me the last 24 hours of readings for device XYZ."

Why Relational Databases Struggle Here

Problem Root Cause
Write throughput ceiling A single PostgreSQL primary handles ~5,000–10,000 writes per second reliably; 33K WPS requires aggressive sharding
Time-range partition pruning is fragile Without proper table partitioning (PARTITION BY RANGE), every time-range query scans the full index
Schema migrations on live tables Adding a new sensor type at 33K WPS causes ALTER TABLE lock timeouts
Storage bloat Dead row versions accumulate (MVCC bloat); VACUUM competes with active writes

DynamoDB Solution Design

PK = DEVICE#<device_id>     SK = TS#<unix_epoch_ms>
TTL = epoch for 30-day retention
Access Pattern DynamoDB Operation Why It Works
Ingest one reading PutItem 1 WCU per 1 KB reading; 33K WPS = 33K WCU (on-demand absorbs)
Last 24 hours for device Query(PK=DEVICE#id, SK BETWEEN yesterday AND now) Bounded range within one partition; no table scan
Expire old readings TTL on 30-day epoch Zero WCU; no delete jobs needed
Fan-out to alerts pipeline DynamoDB Streams → Lambda Every new reading triggers downstream processing asynchronously

Comparison Summary

Database IoT 33K WPS
DynamoDB On-demand handles burst; time-range queries are partition-bounded; TTL manages retention
PostgreSQL Requires time-series extension (TimescaleDB) or manual PARTITION BY RANGE; write ceiling requires careful capacity planning
Cassandra Comparable write throughput; BUT requires manual ring management, compaction tuning, and infrastructure ownership
InfluxDB / TimescaleDB Purpose-built for time-series; better aggregation functions (AVG, ROLLUP); but higher operational overhead than DynamoDB

When to prefer Timestream or TimescaleDB over DynamoDB for IoT: If the primary query is time-series analytics (moving averages, downsampling, interpolation), a purpose-built time-series database wins on query expressiveness. DynamoDB wins on scale, simplicity, and time-windowed point lookups.


Scenario 4: Multi-Tenant SaaS — Thousands of Tenants, Wildly Variable Load

The Situation

A SaaS platform serves 10,000 tenant companies. A handful of enterprise tenants generate 80% of the traffic. Smaller tenants generate almost no traffic at all. Data isolation between tenants is a compliance requirement. Some tenants grow 10× in a week after a product launch.

Why Shared Relational Databases Struggle Here

Problem Root Cause
One noisy tenant hurts all A large tenant running a heavy SELECT * query or a long transaction blocks other tenants on the same instance
Provisioning for peak wastes money You size the database for the largest enterprise tenant's peak, which means all small tenants pay for unused capacity
Schema migrations across all tenants A schema change to add a column requires locking the shared table, affecting every tenant simultaneously
Connection multiplexing 10,000 tenants each wanting persistent connections overwhelms any reasonable connection limit

DynamoDB Solution: Tenant-Prefixed Key Design

PK = TENANT#<tenant_id>#RESOURCE#<resource_id>    SK = <version_or_timestamp>
GSI: PK = TENANT#<tenant_id>, SK = CREATED_AT
Scaling Benefit How DynamoDB Delivers It
Tenant isolation at partition level Each tenant's data lives in its own partition range; a noisy tenant causes throttling only on their own partition; other tenants are unaffected
Linear cost scaling Small tenants pay for tiny WCU/RCU; large tenants pay more; the cost scales precisely with usage
No schema migrations New feature attributes are added per-item without altering a shared schema
No connection management DynamoDB is connectionless; 10,000 tenants × 100 concurrent users = 1,000,000 simultaneous HTTP requests all accepted

Comparison Summary

Database Multi-Tenant at 10K Tenants
DynamoDB Per-tenant partition isolation; pay-per-use costs; connectionless; no shared lock contention
PostgreSQL (schema-per-tenant) Better isolation but 10K schemas creates index and connection management complexity
PostgreSQL (row-level security) Simpler than separate schemas; noisy-tenant risk remains; connection limits still apply
MongoDB (collection-per-tenant) 10K collections in one cluster is manageable but adds operational overhead

Scenario 5: Session and Authentication State — Millions of Active Sessions

The Situation

A consumer web application has 5 million logged-in users during peak hours. Each API request checks if a session token is valid and loads a small payload of session data. Session data must survive server restarts. Sessions expire after 30 minutes of inactivity.

Why Redis Alone Struggles as the Primary Session Store

Problem Root Cause
Durability on restart Redis AOF persistence reduces this risk but adds latency; RDB snapshots are periodic and may lose recent sessions on crash
Memory pressure at 5M sessions At ~1 KB per session, 5M sessions = ~5 GB hot working set in memory; expensive at Redis memory pricing
Single-shard hot key ceiling If you naively store all sessions under one key or a few keys, the shard bottlenecks
Cluster failover gap Redis cluster failover takes 15–30 seconds; sessions become unreadable during failover

DynamoDB Solution

PK = SESSION#<session_id>     SK = META
TTL = current_time + 1800     (30-minute inactivity expiry, reset on each request)
Requirement DynamoDB Behavior
Durability Multi-AZ replication; no data loss on AZ failure
5M concurrent sessions On-demand mode handles any read/write rate; no memory ceiling
30-minute sliding expiry UpdateItem resets TTL on every request; physical deletion is automatic
Sub-10ms reads GetItem by session_id is a direct key lookup; consistent single-digit ms latency
Compliance Encryption at rest with KMS; access audited via CloudTrail

Hybrid Architecture: DynamoDB + Redis

flowchart LR
    REQUEST["API Request\n(session token)"] --> REDIS["Redis Cache\n(hot sessions, 60-sec TTL)"]
    REDIS -->|Cache hit| RESPONSE["Response"]
    REDIS -->|Cache miss| DYNAMO["DynamoDB\n(durable session store)"]
    DYNAMO --> REDIS
    DYNAMO --> RESPONSE
  • Redis handles repeated reads for hot sessions (users clicking rapidly)
  • DynamoDB is the durable source of truth
  • Cache miss falls through to DynamoDB and warms Redis

Comparison Summary

Database 5M Concurrent Sessions
DynamoDB Durable, TTL-native, infinite horizontal scale, no connection limit
Redis (primary store) Faster but durability risk; memory cost at 5M sessions; failover gap
PostgreSQL Session table at 5M rows with high-frequency TTL updates causes bloat and lock pressure
Memcached No durability; no TTL at item level; data lost on restart

Scenario 6: Real-Time Activity Feed — Fan-Out Writes with High Read Amplification

The Situation

A social platform has 100M users. When a celebrity with 10M followers posts content, all 10M followers should see the post in their feed within a second. Reads are 10× more frequent than writes. Feed items should expire after 30 days.

The Fan-Out Write Challenge

The write fan-out model (write one post to 10M follower feeds) creates: - 10M individual writes per celebrity post - Burst write load lasting 5–30 minutes per viral post

Why Relational Databases Fail at Fan-Out Scale

Problem Root Cause
10M row inserts in minutes PostgreSQL insert throughput with indexes saturates a single primary
Index maintenance on insert Every insert into a feed table with user_id + created_at index triggers B-tree maintenance
Connection storm Every write worker needs a connection; 1,000 parallel writers approach the connection limit

DynamoDB Solution: Per-User Feed Partition

PK = FEED#<user_id>      SK = CREATED_AT#<post_id>
TTL = 30-day epoch
Operation DynamoDB Behavior
Fan-out write (celebrity post) BatchWriteItem calls writing 25 items each; rate-distributed via SQS queue with Lambda consumers
Read user feed Query(PK=FEED#user_id, ScanIndexForward=False, Limit=20) — single-partition, reverse-sorted, fast
Expire old posts TTL handles 30-day rolling expiry without a cleanup job
Handle celebrity fan-out peak SQS absorbs the burst; Lambda scales out fan-out writers to 1,000 concurrent workers
flowchart TB
    POST["Celebrity posts content"] --> SQS["SQS Queue\n(fan-out job)"]
    SQS --> LAMBDA["Lambda Fan-Out Workers\n(1,000 concurrent)"]
    LAMBDA -->|"BatchWriteItem × 400K batches"| DYNAMO["DynamoDB\nFeed Table (per-user partition)"]
    USER["Follower reads feed"] -->|"Query(PK=FEED#user_id)"| DYNAMO

Comparison Summary

Database Fan-Out Write to 10M Followers
DynamoDB Per-user partition distributes writes; on-demand absorbs burst; TTL manages retention
PostgreSQL Single-primary insert throughput insufficient; read replicas help reads but not writes
Cassandra Comparable write throughput; per-user partition maps naturally; but higher operational burden
MongoDB Flexible but single-shard write ceiling applies; sharding configuration adds complexity

Scenario 7: Global Active-Active — Multi-Region with Low-Latency Local Reads

The Situation

A global B2B SaaS application serves customers in North America, Europe, and Asia-Pacific. Regulatory requirements mandate that customer data in EU stays in EU. Local read latency must be under 10ms. Write conflicts must be resolved automatically.

Cross-Region Relational Database Challenges

Problem Root Cause
Write conflicts between regions Active-active writes to two PostgreSQL masters require application-level conflict resolution
Replication lag Async replication means EU reads may lag behind writes from US by hundreds of milliseconds
Latency over the wire Routing US writes to EU for strong consistency adds 80–120ms of cross-region latency
Failover complexity Aurora Global Database failover requires promotional steps and DNS propagation

DynamoDB Global Tables Solution

flowchart LR
    subgraph US-East["us-east-1"]
        DDB_US["DynamoDB\nGlobal Table Replica"]
        LAMBDA_US["Lambda\n(US traffic)"]
    end

    subgraph EU-West["eu-west-1"]
        DDB_EU["DynamoDB\nGlobal Table Replica"]
        LAMBDA_EU["Lambda\n(EU traffic)"]
    end

    subgraph AP-SE["ap-southeast-1"]
        DDB_AP["DynamoDB\nGlobal Table Replica"]
        LAMBDA_AP["Lambda\n(AP traffic)"]
    end

    LAMBDA_US -->|"Read/Write local"| DDB_US
    LAMBDA_EU -->|"Read/Write local"| DDB_EU
    LAMBDA_AP -->|"Read/Write local"| DDB_AP

    DDB_US <-->|"Async replication\n~1 second"| DDB_EU
    DDB_EU <-->|"Async replication\n~1 second"| DDB_AP
    DDB_US <-->|"Async replication\n~1 second"| DDB_AP
DynamoDB Global Tables Property Detail
Replication lag Typically under 1 second between regions
Conflict resolution Last-writer-wins based on timestamp (at item version level)
Regional isolation Reads and writes in each region hit the local replica (sub-10ms)
Failover Traffic shifting to another region requires only DNS/routing change; replica is already warm and up to date
Data residency EU replica stays in EU; write IAM policies prevent cross-region writes for regulated data

Comparison Summary

Database Global Active-Active
DynamoDB Global Tables Native multi-region active-active; ~1s replication lag; last-writer-wins conflict resolution
Aurora Global Database One writer region; cross-region reads only (read replicas); failover takes minutes
CockroachDB True multi-region active-active with consensus; more configuration; higher latency for cross-region writes
Cassandra (multi-DC) High write availability; requires manual rack/DC topology configuration; compaction and repair management

Scenario 8: Event Sourcing and Audit Logs — Immutable Append-Only Write Patterns

The Situation

A financial services platform stores every state change as an immutable event — account debits, credits, transfers, and status changes. The audit log must be query-able by account ID and time range. Older events must never be modified. Regulators require 7-year retention.

Why This Pattern Suits DynamoDB Naturally

The event log workload is purely append-heavy with ordered reads. It never modifies existing records. The access pattern is simple: "give me all events for account X between date A and date B."

Design

PK = ACCOUNT#<account_id>     SK = EVENT#<unix_epoch_ms>#<event_id>
event_type = "CREDIT" | "DEBIT" | "TRANSFER"
amount = 250.00
balance_after = 1750.00
actor = "system" | "user_id"
TTL = not set (7-year retention)  ← TTL is intentionally absent for compliance
Access Pattern Operation Cost
Append one event PutItem with attribute_not_exists(SK) condition 1 WCU
Load full account history Query(PK=ACCOUNT#id) paginated ~1 RCU per page
Load events in date range Query(PK=ACCOUNT#id, SK BETWEEN start AND end) Bounded read
Audit export to S3 DynamoDB Streams → Lambda → S3 Parquet Near real-time export without Scan

Conditional Write for Idempotency

table.put_item(
    Item={
        "PK": f"ACCOUNT#{account_id}",
        "SK": f"EVENT#{timestamp}#{event_id}",
        "event_type": "CREDIT",
        "amount": Decimal("250.00"),
        "actor": actor_id
    },
    ConditionExpression="attribute_not_exists(SK)"
)

If the same event is retried (e.g., Lambda retry after timeout), the condition fails and raises ConditionalCheckFailedException. The duplicate write is silently rejected — the event log stays clean.

Comparison Summary

Database Append-Only Event Log at Scale
DynamoDB Natural per-account partition; sort-key-ordered events; conditional writes prevent duplicates
PostgreSQL Works well at moderate scale; high-volume OLTP event inserts require table partitioning and archiving strategy
Kafka (event store) Excellent for event streaming and replay; less suitable for per-entity query-by-account queries without a secondary index
EventStoreDB Purpose-built for event sourcing; better stream projection support; more operational overhead than DynamoDB

Scenario 9: Burst Traffic with Cold Start — Serverless-First Architecture

The Situation

A startup deploys an API using AWS Lambda + API Gateway. At zero traffic, no compute runs. During a viral social media moment, the API receives 50,000 requests per minute within 30 seconds of the post.

The Database Cold Start Problem

Most databases have a warm-up cost: - Aurora Serverless v1: cold starts take 25–30 seconds to resume from paused state - RDS: always-on but connection pool must be pre-allocated - ElastiCache: cluster must be provisioned ahead of time - Aurora Serverless v2: faster, but still needs connection management

Why DynamoDB Is the Natural Serverless Pair

Property Why It Matters for Lambda
No connection state Lambda does not hold a database connection between invocations. DynamoDB requests are stateless HTTP calls — no connection warm-up
DAX client pool (if used) Be careful: DAX requires a persistent TCP connection from the Lambda VPC, which conflicts with cold starts
On-demand mode No pre-warming; capacity materializes as traffic arrives
Zero idle cost When Lambda is at zero invocations, DynamoDB costs nothing for reads/writes
RDS Proxy alternative If you must use Aurora with Lambda, RDS Proxy manages the connection pool — but adds latency and cost that DynamoDB avoids entirely

Comparison Summary

Database Serverless Lambda Cold Start
DynamoDB Stateless HTTP; zero connection overhead; on-demand capacity; natural pair
Aurora + RDS Proxy Works but adds proxy latency and cost; proxy itself has a warm-up period
Aurora Serverless v1 Cold start from paused state was 25–30s — catastrophic for burst traffic
Redis (ElastiCache) Requires VPC; persistent TCP connection from Lambda; idle cluster cost even at zero traffic

Scenario 10: Time-Based Session Memory with Sliding Window Summarization

The Situation

A conversational AI chatbot (like MangaAssist) maintains multi-turn context. Memory must be loaded on every message within a latency budget. Old turns must be summarized and compressed to avoid token overflow in LLM prompts. Sessions must expire automatically after inactivity.

The Memory Access Pattern

Every message in the chatbot triggers: 1. Load the last N turns (ordered by recency) 2. Load the latest summary item (if it exists) 3. Assemble context from turns + summary 4. Append the new user turn 5. Append the new assistant turn 6. Optionally write a new summary if the window is full

This is a purely key-based, ordered-range, append-heavy workload with no joins.

Schema Recap

PK = SESSION#<session_id>     SK = META           → session lifecycle state
PK = SESSION#<session_id>     SK = TURN#<epoch>   → one item per chat turn
PK = SESSION#<session_id>     SK = SUMMARY#<n>    → compressed window summary
GSI: PK = customer_id, SK = updated_at            → resume sessions by customer

Why This Is Specifically Where DynamoDB Beats Every Alternative

Access Pattern DynamoDB PostgreSQL Redis
Load last 10 turns Query reverse sort, Limit=10 — 1 network round trip SELECT … ORDER BY created_at DESC LIMIT 10 — works but slower at scale and under connection load LRANGE on a list — fast but no TTL-per-item; durability risk
Append one turn PutItem — small item, 1 WCU INSERT — adds row, updates index, writes WAL RPUSH — fast but atomicity on crash worse
Update session metadata UpdateItem — patches specific attributes UPDATE — row lock required HMSET — fast; no durability guarantee
Natural TTL per item TTL attribute — no cleanup job Requires scheduled DELETE WHERE job or partition pruning EXPIRE per key — available but per-list-item TTL is awkward
Burst at 500K sessions On-demand scales automatically Connection pool saturates; needs read replicas Single shard OOM risk at 500K hot keys

Summary: When to Choose DynamoDB Over Alternatives

Scenario Primary Reason to Choose DynamoDB
E-commerce flash sale On-demand capacity absorbs instant spikes; no connection pool
Gaming leaderboards Sort-key rank embedding; millisecond read latency at any scale
IoT sensor ingestion High-write throughput; per-device partition; TTL for retention
Multi-tenant SaaS Per-tenant partition isolation; pay-per-use; connectionless
Session state Durable; TTL-native; no connection overhead; multi-AZ
Social feed fan-out Per-user partition; batch write; TTL for rolling expiry
Global active-active Global Tables; local reads under 10ms; built-in replication
Event sourcing / audit log Append-only; conditional idempotency; ordered range queries
Serverless Lambda workloads Stateless HTTP; no connection warm-up; zero idle cost
Conversational AI memory Per-session partition; ordered turns; summary items; TTL

When NOT to Choose DynamoDB

Situation Better Choice Why
Complex ad-hoc queries, reporting, analytics PostgreSQL / Redshift SQL joins, aggregations, GROUP BY, window functions
Multi-entity transaction with full ACID PostgreSQL True serializable transactions across arbitrary tables
Full-text search on document content OpenSearch / Elasticsearch Text tokenization, relevance scoring, faceted search
High-cardinality time-series aggregation InfluxDB / TimescaleDB Built-in rollup, downsampling, interpolation
Graph traversal (friends-of-friends) Neptune / Neo4j Efficient multi-hop traversal; DynamoDB has no native graph semantics
Schema evolves heavily and unpredictably MongoDB Flexible schema querying without key redesign
Small data, few users, complex queries PostgreSQL Operationally simple; no need for NoSQL overhead

Architectural Principle: Fit the Access Pattern, Not the Data Shape

The single most important rule when evaluating DynamoDB for a scaling scenario:

DynamoDB rewards workloads where you know your access patterns upfront. For every access pattern you can enumerate, DynamoDB delivers predictable latency and linear scale. For every access pattern you cannot enumerate, DynamoDB punishes you with Scans or impossible queries.

If your workload has three well-defined access patterns, DynamoDB is likely the right tool. If your product manager changes the analytics requirement every sprint, keep that layer in a relational or document store where ad-hoc queries are natural.