DynamoDB in MangaAssist - Project Usage and Architecture Decision
What DynamoDB Does in This Project
In MangaAssist, DynamoDB is the conversation-memory store. It keeps short-lived chat state so the orchestrator can handle multi-turn conversations reliably without turning the memory layer into a heavy relational system.
It is used for:
- Session metadata created during
POST /chat/init - Per-turn chat history for user and assistant messages
- Compressed summaries of older conversation windows
- Authenticated session lookup by
customer_id - Context loading for reconnect and human handoff
- Automatic expiry of memory after the retention window
It is not used for:
- Product catalog storage
- RAG vector retrieval
- Analytics warehousing
- Long-term user profile storage
Concrete Schema Definition
Understanding the actual data shape makes every architectural discussion clearer.
Primary Table: manga-assist-sessions
| Attribute | Type | Role | Example Value |
|---|---|---|---|
PK |
String | Partition key | SESSION#a9f3c1... |
SK |
String | Sort key | META, TURN#1711300000000, SUMMARY#001 |
session_id |
String | Denormalized for ease | a9f3c1... |
customer_id |
String | GSI partition key | CUST#u8k21 |
updated_at |
Number (epoch ms) | GSI sort key | 1711300000000 |
turn_index |
Number | Chat turn ordering | 7 |
role |
String | user or assistant |
assistant |
content_compressed |
Binary | GZIP-compressed message text | — |
token_count |
Number | Tokens used in this turn | 312 |
intent |
String | Classified intent at this turn | product_recommendation |
page_context |
Map | Page URL, ASIN, referral source | {"asin": "B09XY..."} |
summary_text |
String | Compressed narrative of older turns | — |
window_start |
Number | First turn index in summary window | 1 |
window_end |
Number | Last turn index in summary window | 5 |
ttl |
Number | Unix epoch expiry for DynamoDB TTL | 1714000000 |
response_id |
String | Idempotency key for retry safety | resp_... |
Global Secondary Index: GSI1-customer-sessions
| Attribute | Role |
|---|---|
GSI1PK = customer_id |
Find all sessions for an authenticated user |
GSI1SK = updated_at |
Sort by recency for resume flows |
Projection: INCLUDE only session_id, updated_at, turn_count, SK=META to keep the index lean.
Item Type Summary
SESSION#<id> | META → Session lifecycle state (one per session)
SESSION#<id> | TURN#<ts> → One chat message (one per turn)
SESSION#<id> | SUMMARY#<n> → Compressed narrative of a window of turns
Why this matters in an interview: Being able to draw this on a whiteboard immediately shows you went from concept to implementation. Most candidates describe DynamoDB at the concept level but cannot explain what the actual item looks like.
Capacity Planning and Cost Estimation
Architect's perspective: Numbers ground a decision. When you say "DynamoDB scales for peak traffic," back it up with an actual capacity estimate.
Traffic Assumptions
| Metric | Normal Load | Peak Load (Prime Day equivalent) |
|---|---|---|
| Concurrent sessions | 50,000 | 500,000 |
| Messages per session per minute | ~2 | ~4 |
| Average item size after GZIP | ~1 KB | ~1 KB |
| GSI reads (resume/handoff) | ~5% of sessions | ~10% of sessions |
WCU and RCU Estimates
Writes per second at peak:
- Session init (META write): 500,000 / 1800 sec window ≈ 278 WPS
- Turn writes (2 per exchange: user + assistant): 500,000 × 4 msg/min / 60 ≈ 33,333 WPS
- Summary writes (roughly 1 per 5 turns): ≈ 6,666 WPS
- Total writes ≈ 40,000 WPS → ~40,000 WCU (on-demand handles this automatically)
Reads per second at peak: - Context loads (once per turn, ~latest 10 turns + 1 summary): 33,333 RPS × 11 items ≈ 366,000 RPS - Each TURN item ~1 KB = 1 RCU at strong, 0.5 RCU at eventual consistency - With eventual consistency: ~183,000 RCU (strongly preferred for most context reads) - Resume lookups via GSI: small fraction, not the dominant cost
Cost Ballpark (On-Demand, us-east-1, 2024 pricing)
| Resource | At Normal Load | At Peak Load |
|---|---|---|
| Write cost | ~$1.25/hr | ~$50/hr |
| Read cost | ~$0.09/hr | ~$9/hr |
| Storage (active sessions) | ~$0.25/GB/month | ~$2.50/GB/month |
| Estimated monthly | ~$1,000–$2,000 | Spikes during events |
Key insight: On-demand mode absorbs the burst but costs more per read/write than provisioned capacity. For predictable baseline traffic, consider provisioned mode with auto-scaling on top, reserving on-demand as a safety net during events.
TTL Storage Control
Sessions expire within 24–72 hours depending on tier. At 50,000 concurrent sessions × average 2 KB total session data, the live working set is roughly 100 MB. Storage cost is negligible. Item tombstones created by TTL deletion consume no meaningful additional space.
What We Store in DynamoDB
The design follows a session timeline model:
METAitem: session-level state such assession_id,customer_id,updated_at,turn_count, and latestpage_contextTURNitems: one item per chat turn so writes stay small and orderedSUMMARYitems: compressed memory blocks for older turns so prompt size stays under control
This works well because the main access pattern is simple:
- Find the session by
session_id - Read the latest turns in order
- Append the next turn
- Update summary or metadata when needed
Why DynamoDB Was a Better Fit for This Workload
The key point is that conversation memory is a high-scale, low-latency, append-heavy workload with simple access patterns. That is where DynamoDB is strong.
| Requirement | Why It Matters in MangaAssist | Why DynamoDB Fit Well |
|---|---|---|
| Low-latency reads and writes | Memory is on the critical chat path | Key-based reads and writes are fast and predictable |
| Burst handling | Traffic can jump from normal volume to Prime Day scale | DynamoDB scales better than connection-oriented databases for spiky traffic |
| Simple session access pattern | Most requests are "load latest turns" and "append next turn" | Partition key plus sort key maps naturally to session timelines |
| Built-in TTL | Chat memory is intentionally short-lived | No cleanup cron or delete job is required for normal expiry |
| Durability | Losing memory during failures hurts the experience | DynamoDB is safer than using cache-only memory |
| Managed operations | This system already has enough moving parts | No server management, sharding, vacuuming, or connection pool tuning |
| AWS-native integration | The rest of the system is on AWS | IAM, KMS, CloudWatch, Streams, DAX, and Global Tables fit cleanly |
| Multi-region path | Conversation continuity matters during failover | Global Tables are available if we need active-active later |
| Cost alignment | Session memory is short-lived and access-pattern driven | We pay for OLTP access, not relational features we do not need |
Why DynamoDB Was Better Than Other Database Options
1. Better than Redis / ElastiCache as the primary store
Redis was attractive for raw speed, but it was a weaker source of truth for this specific problem.
- Redis is excellent for caching and extremely hot reads.
- Redis is weaker as the only durable system of record for customer chat memory.
- TTL exists in Redis, but durability and recovery tradeoffs are worse than DynamoDB for a compliance-sensitive workflow.
- We wanted memory to survive node loss, retries, and failover scenarios without treating snapshots as the main recovery plan.
Decision:
- Redis stayed in the architecture as a cache option.
- DynamoDB stayed the durable memory source of truth.
2. Better than Aurora / PostgreSQL for session memory
Aurora would work, but it would solve problems this workload does not have.
- Conversation memory does not require joins.
- Most operations are point lookups and ordered range reads inside one session.
- Expiry would need scheduled cleanup or partition management.
- Burst traffic would introduce more connection-management and write-scaling considerations.
- The schema would become more operationally heavy than the problem justifies.
Decision:
- Aurora is better for transactional business systems such as orders or payments.
- DynamoDB is better for session memory where the access pattern is known upfront.
3. Better than DocumentDB / Mongo-style transcript documents
Document stores look natural at first because a conversation looks like a JSON document. That becomes less attractive at scale.
- If the whole transcript is stored in one document, the document grows every turn.
- Large-document rewrites create more write amplification.
- Concurrent updates become more awkward during retries or streaming.
- The 400 KB DynamoDB item limit forced us toward a better per-turn design anyway.
- Ordered per-turn reads and summary windows are easier when turns are separate items.
Decision:
- We modeled the chat as a timeline of small items instead of one growing document.
- DynamoDB with
pkandskwas cleaner than a session-document model.
4. Better than OpenSearch for operational memory
OpenSearch is already used in this system, but for RAG, not for chat state.
- OpenSearch is built for search and retrieval, not primary OLTP session memory.
- High write volume plus strict per-session read patterns would be more expensive and less predictable.
- It is the wrong abstraction for "load recent turns for one session and append one more item."
Decision:
- OpenSearch remains the knowledge retrieval store.
- DynamoDB remains the chat memory store.
Scenarios We Evaluated Before Taking the Decision
We compared the database options against the actual workflows the chatbot must support.
Scenario 1: Session Initialization
Need:
- Create session state quickly
- Attach expiry immediately
- Support guest and authenticated users
Why DynamoDB won:
- One metadata write is enough
- TTL can be set at creation time
- No separate cleanup process is needed for the standard expiry path
Scenario 2: Append One Turn per Message
Need:
- Avoid rewriting the full transcript every turn
- Keep writes small and idempotent
- Support retries without corrupting memory
Why DynamoDB won:
- One new
TURNitem can be appended per message - Metadata can be updated independently
- Conditional writes can make retries safe
Scenario 3: Load Latest N Turns Fast
Need:
- Prompt assembly must stay inside latency budget
- Most requests only need recent context, not a full transcript scan
Why DynamoDB won:
- Query by partition key plus descending sort key is a direct fit
- We can fetch only the latest turns and latest summary
- The design avoids table scans and complex joins
Scenario 4: Long Conversations
Need:
- Keep context useful without sending full transcripts to the model
- Avoid item-size growth and large rewrites
Why DynamoDB won:
- Separate
SUMMARYitems let us compress old windows - Separate
TURNitems avoid transcript-size blowups - The storage model stays stable even as conversations get longer
Scenario 5: Authenticated Resume and Human Handoff
Need:
- Find recent sessions by customer
- Load enough context to resume or escalate
Why DynamoDB won:
- A GSI on
customer_idplusupdated_atsupports recent-session lookup - Ordered turn history fits escalation packaging well
Scenario 6: Peak Traffic and Burstiness
Need:
- Handle about 50,000 concurrent sessions normally and up to about 500,000 at peak
- Avoid connection bottlenecks or manual shard rebalancing
Why DynamoDB won:
- It handles bursty request patterns better than many relational deployments
- Operational scaling is simpler than managing database connections during chat spikes
Scenario 7: Expiry and Privacy Retention
Need:
- Memory should not stay forever
- Expiry should align with privacy-by-design goals
Why DynamoDB won:
- TTL provides built-in expiry semantics
- The application can also enforce expiry at read time before physical deletion completes
Scenario 8: Multi-Region Growth Path
Need:
- Future regional failover should not require redesigning the memory layer
Why DynamoDB won:
- Global Tables provide a direct path if the chatbot later becomes active-active across regions
Final Decision Summary
We chose DynamoDB because conversation memory in MangaAssist is not a relational reporting problem and not just a cache problem. It is a short-lived, append-heavy, high-concurrency session-state problem with simple but strict access patterns.
That made DynamoDB the best overall choice because it gave:
- Better durability than Redis as the primary store
- Less operational overhead than Aurora for this workload
- Cleaner per-turn modeling than a growing document database design
- Better OLTP behavior than search-oriented systems
- Built-in TTL, strong AWS integration, and a clean path to scale
Failure Modes and Resilience
SRE's perspective: A design document is incomplete if it does not answer "what happens when this breaks?"
Scenario: DynamoDB is temporarily unavailable (Regional outage or throttle spike)
Impact: Context loads fail → the orchestrator cannot assemble a prompt with history.
Mitigation options ranked by cost and complexity:
- Graceful degradation to zero-turn context — The chat still works; users lose conversation continuity for the duration. Log a
memory_load_failuremetric and resume normally once DynamoDB recovers. - Short-lived in-process cache — Store the last assembled context in the Lambda/container's local memory for the duration of the request. Handles transient blips under 100 ms.
- ElastiCache read fallback — If a warm Redis cache of recent turns exists, fall back to it. The cache may be slightly stale but avoids a full context failure.
- Circuit breaker — After N consecutive failures, skip memory reads entirely and respond with a "starting fresh" UX notice. Prevents cascading timeouts from a degraded DynamoDB path from burning LLM quota.
Scenario: Write failures during message persistence
Impact: A turn is generated but not saved. On refresh or reconnect, the user sees the message disappear.
Mitigation: Use an async write-behind pattern with a retry queue (SQS). The response is streamed immediately; persistence is retried at least once. The response_id field prevents duplicate entries if the retry succeeds after the original write also went through.
Scenario: Hot partition causing throttling on one session
Impact: One unusually active session (like a support agent cycling through the same test account) causes that partition to be throttled.
Mitigation: DynamoDB's adaptive capacity handles moderate hot spots automatically. For extreme cases, use exponential backoff with jitter in the SDK. Monitor ConsumedWriteCapacityUnits and SystemErrors per-table in CloudWatch.
Scenario: TTL deletion lag causing stale session reads
Impact: A session is past its expiry but the item is still physically present. A reconnect attempt succeeds when it should fail.
Mitigation: Always enforce expiry at read time in application logic by checking ttl < now(). Never rely solely on DynamoDB to have deleted the item yet.
Security and Encryption
Security Engineer's perspective: Every service that stores user conversation data must have a clear security posture.
Encryption at Rest
All DynamoDB tables use AWS-managed KMS keys (SSE-KMS) for encryption at rest. If your compliance posture requires customer-managed keys (CMK), configure a dedicated KMS key with a rotation schedule and restrict the key policy to specific IAM roles.
Encryption in Transit
All communication with DynamoDB uses HTTPS/TLS via the AWS SDK. No plaintext traffic.
IAM Access Control
The Lambda functions that read and write DynamoDB should have least-privilege IAM roles:
{
"Effect": "Allow",
"Action": [
"dynamodb:GetItem",
"dynamodb:PutItem",
"dynamodb:UpdateItem",
"dynamodb:Query",
"dynamodb:DeleteItem"
],
"Resource": [
"arn:aws:dynamodb:us-east-1:ACCOUNT_ID:table/manga-assist-sessions",
"arn:aws:dynamodb:us-east-1:ACCOUNT_ID:table/manga-assist-sessions/index/*"
]
}
Do not grant dynamodb:Scan to the chat service. This makes accidental full-table scans a deployment-time error, not a runtime surprise.
Data Classification
Chat content is a PII-adjacent workload. Even if message text is not strictly PII, it can contain personally identifying information in user utterances.
Recommended controls:
- GZIP-compress message content before writing (reduces storage and makes accidental log exposure less readable)
- Do not log full message content in application logs — use truncated previews or [REDACTED]
- Enforce TTL strictly — do not extend retention beyond the agreed privacy window without a logged business reason
- Enable DynamoDB Streams only if a downstream consumer exists — inactive streams still incur charges
Compliance Deletion
For GDPR or CCPA right-to-delete flows, TTL alone is insufficient. Maintain an explicit delete job that:
1. Queries all sessions by customer_id via the GSI
2. Issues DeleteItem calls for each session and each turn
3. Records a deletion audit event (e.g., to CloudTrail or a dedicated audit table)
Monitoring and Alerting
DevOps perspective: You cannot operate what you cannot observe.
Key CloudWatch Metrics to Watch
| Metric | What It Signals | Alert Threshold |
|---|---|---|
SuccessfulRequestLatency (P99) |
Memory load tail latency affecting chat | > 50 ms sustained |
ConsumedWriteCapacityUnits |
Approaching provisioned limit | > 80% of provisioned |
SystemErrors |
DynamoDB-side errors (5xx) | Any nonzero sustained count |
UserErrors |
Application-side errors (4xx) | Spike above baseline |
ThrottledRequests |
Throughput limit hit | > 0 for > 60 seconds |
TimeToLiveDeletedItemCount |
TTL churn volume | Unusual drops may signal TTL misconfiguration |
Recommended Dashboards
- Memory latency dashboard — P50/P95/P99 latency for context read operations, broken out by
session_idage to detect degradation on older sessions - Write throughput dashboard — WCU consumed vs. provisioned, GSI write amplification
- Error rate dashboard —
SystemErrorsandUserErrorswith baseline comparison - Session lifecycle dashboard — Sessions created vs. sessions expired (via TTL stream event count) to verify TTL is functioning
Distributed Tracing Integration
Wrap every DynamoDB call in an X-Ray segment. This makes it easy to see exactly how much of the total chat latency budget is consumed by memory operations. Tag segments with session_id and item_type (META / TURN / SUMMARY) to diagnose specific item-type bottlenecks.
Practical Decision Rule
For this project:
- Use
DynamoDBfor short-lived conversation state - Use
ElastiCacheonly as an optional accelerator or fallback read cache - Use
OpenSearchfor retrieval, not session memory - Use transactional systems like
Aurorafor order and payment workflows, not chat memory - Add encryption, IAM scoping, and CloudWatch alerts from day one — not after an incident