LOCAL PREVIEW View on GitHub

DynamoDB in MangaAssist - Project Usage and Architecture Decision

What DynamoDB Does in This Project

In MangaAssist, DynamoDB is the conversation-memory store. It keeps short-lived chat state so the orchestrator can handle multi-turn conversations reliably without turning the memory layer into a heavy relational system.

It is used for:

  • Session metadata created during POST /chat/init
  • Per-turn chat history for user and assistant messages
  • Compressed summaries of older conversation windows
  • Authenticated session lookup by customer_id
  • Context loading for reconnect and human handoff
  • Automatic expiry of memory after the retention window

It is not used for:

  • Product catalog storage
  • RAG vector retrieval
  • Analytics warehousing
  • Long-term user profile storage

Concrete Schema Definition

Understanding the actual data shape makes every architectural discussion clearer.

Primary Table: manga-assist-sessions

Attribute Type Role Example Value
PK String Partition key SESSION#a9f3c1...
SK String Sort key META, TURN#1711300000000, SUMMARY#001
session_id String Denormalized for ease a9f3c1...
customer_id String GSI partition key CUST#u8k21
updated_at Number (epoch ms) GSI sort key 1711300000000
turn_index Number Chat turn ordering 7
role String user or assistant assistant
content_compressed Binary GZIP-compressed message text
token_count Number Tokens used in this turn 312
intent String Classified intent at this turn product_recommendation
page_context Map Page URL, ASIN, referral source {"asin": "B09XY..."}
summary_text String Compressed narrative of older turns
window_start Number First turn index in summary window 1
window_end Number Last turn index in summary window 5
ttl Number Unix epoch expiry for DynamoDB TTL 1714000000
response_id String Idempotency key for retry safety resp_...

Global Secondary Index: GSI1-customer-sessions

Attribute Role
GSI1PK = customer_id Find all sessions for an authenticated user
GSI1SK = updated_at Sort by recency for resume flows

Projection: INCLUDE only session_id, updated_at, turn_count, SK=META to keep the index lean.

Item Type Summary

SESSION#<id> | META         → Session lifecycle state (one per session)  
SESSION#<id> | TURN#<ts>    → One chat message (one per turn)  
SESSION#<id> | SUMMARY#<n>  → Compressed narrative of a window of turns  

Why this matters in an interview: Being able to draw this on a whiteboard immediately shows you went from concept to implementation. Most candidates describe DynamoDB at the concept level but cannot explain what the actual item looks like.

Capacity Planning and Cost Estimation

Architect's perspective: Numbers ground a decision. When you say "DynamoDB scales for peak traffic," back it up with an actual capacity estimate.

Traffic Assumptions

Metric Normal Load Peak Load (Prime Day equivalent)
Concurrent sessions 50,000 500,000
Messages per session per minute ~2 ~4
Average item size after GZIP ~1 KB ~1 KB
GSI reads (resume/handoff) ~5% of sessions ~10% of sessions

WCU and RCU Estimates

Writes per second at peak: - Session init (META write): 500,000 / 1800 sec window ≈ 278 WPS - Turn writes (2 per exchange: user + assistant): 500,000 × 4 msg/min / 60 ≈ 33,333 WPS - Summary writes (roughly 1 per 5 turns): ≈ 6,666 WPS - Total writes ≈ 40,000 WPS → ~40,000 WCU (on-demand handles this automatically)

Reads per second at peak: - Context loads (once per turn, ~latest 10 turns + 1 summary): 33,333 RPS × 11 items ≈ 366,000 RPS - Each TURN item ~1 KB = 1 RCU at strong, 0.5 RCU at eventual consistency - With eventual consistency: ~183,000 RCU (strongly preferred for most context reads) - Resume lookups via GSI: small fraction, not the dominant cost

Cost Ballpark (On-Demand, us-east-1, 2024 pricing)

Resource At Normal Load At Peak Load
Write cost ~$1.25/hr ~$50/hr
Read cost ~$0.09/hr ~$9/hr
Storage (active sessions) ~$0.25/GB/month ~$2.50/GB/month
Estimated monthly ~$1,000–$2,000 Spikes during events

Key insight: On-demand mode absorbs the burst but costs more per read/write than provisioned capacity. For predictable baseline traffic, consider provisioned mode with auto-scaling on top, reserving on-demand as a safety net during events.

TTL Storage Control

Sessions expire within 24–72 hours depending on tier. At 50,000 concurrent sessions × average 2 KB total session data, the live working set is roughly 100 MB. Storage cost is negligible. Item tombstones created by TTL deletion consume no meaningful additional space.


What We Store in DynamoDB

The design follows a session timeline model:

  • META item: session-level state such as session_id, customer_id, updated_at, turn_count, and latest page_context
  • TURN items: one item per chat turn so writes stay small and ordered
  • SUMMARY items: compressed memory blocks for older turns so prompt size stays under control

This works well because the main access pattern is simple:

  1. Find the session by session_id
  2. Read the latest turns in order
  3. Append the next turn
  4. Update summary or metadata when needed

Why DynamoDB Was a Better Fit for This Workload

The key point is that conversation memory is a high-scale, low-latency, append-heavy workload with simple access patterns. That is where DynamoDB is strong.

Requirement Why It Matters in MangaAssist Why DynamoDB Fit Well
Low-latency reads and writes Memory is on the critical chat path Key-based reads and writes are fast and predictable
Burst handling Traffic can jump from normal volume to Prime Day scale DynamoDB scales better than connection-oriented databases for spiky traffic
Simple session access pattern Most requests are "load latest turns" and "append next turn" Partition key plus sort key maps naturally to session timelines
Built-in TTL Chat memory is intentionally short-lived No cleanup cron or delete job is required for normal expiry
Durability Losing memory during failures hurts the experience DynamoDB is safer than using cache-only memory
Managed operations This system already has enough moving parts No server management, sharding, vacuuming, or connection pool tuning
AWS-native integration The rest of the system is on AWS IAM, KMS, CloudWatch, Streams, DAX, and Global Tables fit cleanly
Multi-region path Conversation continuity matters during failover Global Tables are available if we need active-active later
Cost alignment Session memory is short-lived and access-pattern driven We pay for OLTP access, not relational features we do not need

Why DynamoDB Was Better Than Other Database Options

1. Better than Redis / ElastiCache as the primary store

Redis was attractive for raw speed, but it was a weaker source of truth for this specific problem.

  • Redis is excellent for caching and extremely hot reads.
  • Redis is weaker as the only durable system of record for customer chat memory.
  • TTL exists in Redis, but durability and recovery tradeoffs are worse than DynamoDB for a compliance-sensitive workflow.
  • We wanted memory to survive node loss, retries, and failover scenarios without treating snapshots as the main recovery plan.

Decision:

  • Redis stayed in the architecture as a cache option.
  • DynamoDB stayed the durable memory source of truth.

2. Better than Aurora / PostgreSQL for session memory

Aurora would work, but it would solve problems this workload does not have.

  • Conversation memory does not require joins.
  • Most operations are point lookups and ordered range reads inside one session.
  • Expiry would need scheduled cleanup or partition management.
  • Burst traffic would introduce more connection-management and write-scaling considerations.
  • The schema would become more operationally heavy than the problem justifies.

Decision:

  • Aurora is better for transactional business systems such as orders or payments.
  • DynamoDB is better for session memory where the access pattern is known upfront.

3. Better than DocumentDB / Mongo-style transcript documents

Document stores look natural at first because a conversation looks like a JSON document. That becomes less attractive at scale.

  • If the whole transcript is stored in one document, the document grows every turn.
  • Large-document rewrites create more write amplification.
  • Concurrent updates become more awkward during retries or streaming.
  • The 400 KB DynamoDB item limit forced us toward a better per-turn design anyway.
  • Ordered per-turn reads and summary windows are easier when turns are separate items.

Decision:

  • We modeled the chat as a timeline of small items instead of one growing document.
  • DynamoDB with pk and sk was cleaner than a session-document model.

4. Better than OpenSearch for operational memory

OpenSearch is already used in this system, but for RAG, not for chat state.

  • OpenSearch is built for search and retrieval, not primary OLTP session memory.
  • High write volume plus strict per-session read patterns would be more expensive and less predictable.
  • It is the wrong abstraction for "load recent turns for one session and append one more item."

Decision:

  • OpenSearch remains the knowledge retrieval store.
  • DynamoDB remains the chat memory store.

Scenarios We Evaluated Before Taking the Decision

We compared the database options against the actual workflows the chatbot must support.

Scenario 1: Session Initialization

Need:

  • Create session state quickly
  • Attach expiry immediately
  • Support guest and authenticated users

Why DynamoDB won:

  • One metadata write is enough
  • TTL can be set at creation time
  • No separate cleanup process is needed for the standard expiry path

Scenario 2: Append One Turn per Message

Need:

  • Avoid rewriting the full transcript every turn
  • Keep writes small and idempotent
  • Support retries without corrupting memory

Why DynamoDB won:

  • One new TURN item can be appended per message
  • Metadata can be updated independently
  • Conditional writes can make retries safe

Scenario 3: Load Latest N Turns Fast

Need:

  • Prompt assembly must stay inside latency budget
  • Most requests only need recent context, not a full transcript scan

Why DynamoDB won:

  • Query by partition key plus descending sort key is a direct fit
  • We can fetch only the latest turns and latest summary
  • The design avoids table scans and complex joins

Scenario 4: Long Conversations

Need:

  • Keep context useful without sending full transcripts to the model
  • Avoid item-size growth and large rewrites

Why DynamoDB won:

  • Separate SUMMARY items let us compress old windows
  • Separate TURN items avoid transcript-size blowups
  • The storage model stays stable even as conversations get longer

Scenario 5: Authenticated Resume and Human Handoff

Need:

  • Find recent sessions by customer
  • Load enough context to resume or escalate

Why DynamoDB won:

  • A GSI on customer_id plus updated_at supports recent-session lookup
  • Ordered turn history fits escalation packaging well

Scenario 6: Peak Traffic and Burstiness

Need:

  • Handle about 50,000 concurrent sessions normally and up to about 500,000 at peak
  • Avoid connection bottlenecks or manual shard rebalancing

Why DynamoDB won:

  • It handles bursty request patterns better than many relational deployments
  • Operational scaling is simpler than managing database connections during chat spikes

Scenario 7: Expiry and Privacy Retention

Need:

  • Memory should not stay forever
  • Expiry should align with privacy-by-design goals

Why DynamoDB won:

  • TTL provides built-in expiry semantics
  • The application can also enforce expiry at read time before physical deletion completes

Scenario 8: Multi-Region Growth Path

Need:

  • Future regional failover should not require redesigning the memory layer

Why DynamoDB won:

  • Global Tables provide a direct path if the chatbot later becomes active-active across regions

Final Decision Summary

We chose DynamoDB because conversation memory in MangaAssist is not a relational reporting problem and not just a cache problem. It is a short-lived, append-heavy, high-concurrency session-state problem with simple but strict access patterns.

That made DynamoDB the best overall choice because it gave:

  • Better durability than Redis as the primary store
  • Less operational overhead than Aurora for this workload
  • Cleaner per-turn modeling than a growing document database design
  • Better OLTP behavior than search-oriented systems
  • Built-in TTL, strong AWS integration, and a clean path to scale

Failure Modes and Resilience

SRE's perspective: A design document is incomplete if it does not answer "what happens when this breaks?"

Scenario: DynamoDB is temporarily unavailable (Regional outage or throttle spike)

Impact: Context loads fail → the orchestrator cannot assemble a prompt with history.

Mitigation options ranked by cost and complexity:

  1. Graceful degradation to zero-turn context — The chat still works; users lose conversation continuity for the duration. Log a memory_load_failure metric and resume normally once DynamoDB recovers.
  2. Short-lived in-process cache — Store the last assembled context in the Lambda/container's local memory for the duration of the request. Handles transient blips under 100 ms.
  3. ElastiCache read fallback — If a warm Redis cache of recent turns exists, fall back to it. The cache may be slightly stale but avoids a full context failure.
  4. Circuit breaker — After N consecutive failures, skip memory reads entirely and respond with a "starting fresh" UX notice. Prevents cascading timeouts from a degraded DynamoDB path from burning LLM quota.

Scenario: Write failures during message persistence

Impact: A turn is generated but not saved. On refresh or reconnect, the user sees the message disappear.

Mitigation: Use an async write-behind pattern with a retry queue (SQS). The response is streamed immediately; persistence is retried at least once. The response_id field prevents duplicate entries if the retry succeeds after the original write also went through.

Scenario: Hot partition causing throttling on one session

Impact: One unusually active session (like a support agent cycling through the same test account) causes that partition to be throttled.

Mitigation: DynamoDB's adaptive capacity handles moderate hot spots automatically. For extreme cases, use exponential backoff with jitter in the SDK. Monitor ConsumedWriteCapacityUnits and SystemErrors per-table in CloudWatch.

Scenario: TTL deletion lag causing stale session reads

Impact: A session is past its expiry but the item is still physically present. A reconnect attempt succeeds when it should fail.

Mitigation: Always enforce expiry at read time in application logic by checking ttl < now(). Never rely solely on DynamoDB to have deleted the item yet.


Security and Encryption

Security Engineer's perspective: Every service that stores user conversation data must have a clear security posture.

Encryption at Rest

All DynamoDB tables use AWS-managed KMS keys (SSE-KMS) for encryption at rest. If your compliance posture requires customer-managed keys (CMK), configure a dedicated KMS key with a rotation schedule and restrict the key policy to specific IAM roles.

Encryption in Transit

All communication with DynamoDB uses HTTPS/TLS via the AWS SDK. No plaintext traffic.

IAM Access Control

The Lambda functions that read and write DynamoDB should have least-privilege IAM roles:

{
  "Effect": "Allow",
  "Action": [
    "dynamodb:GetItem",
    "dynamodb:PutItem",
    "dynamodb:UpdateItem",
    "dynamodb:Query",
    "dynamodb:DeleteItem"
  ],
  "Resource": [
    "arn:aws:dynamodb:us-east-1:ACCOUNT_ID:table/manga-assist-sessions",
    "arn:aws:dynamodb:us-east-1:ACCOUNT_ID:table/manga-assist-sessions/index/*"
  ]
}

Do not grant dynamodb:Scan to the chat service. This makes accidental full-table scans a deployment-time error, not a runtime surprise.

Data Classification

Chat content is a PII-adjacent workload. Even if message text is not strictly PII, it can contain personally identifying information in user utterances.

Recommended controls: - GZIP-compress message content before writing (reduces storage and makes accidental log exposure less readable) - Do not log full message content in application logs — use truncated previews or [REDACTED] - Enforce TTL strictly — do not extend retention beyond the agreed privacy window without a logged business reason - Enable DynamoDB Streams only if a downstream consumer exists — inactive streams still incur charges

Compliance Deletion

For GDPR or CCPA right-to-delete flows, TTL alone is insufficient. Maintain an explicit delete job that: 1. Queries all sessions by customer_id via the GSI 2. Issues DeleteItem calls for each session and each turn 3. Records a deletion audit event (e.g., to CloudTrail or a dedicated audit table)


Monitoring and Alerting

DevOps perspective: You cannot operate what you cannot observe.

Key CloudWatch Metrics to Watch

Metric What It Signals Alert Threshold
SuccessfulRequestLatency (P99) Memory load tail latency affecting chat > 50 ms sustained
ConsumedWriteCapacityUnits Approaching provisioned limit > 80% of provisioned
SystemErrors DynamoDB-side errors (5xx) Any nonzero sustained count
UserErrors Application-side errors (4xx) Spike above baseline
ThrottledRequests Throughput limit hit > 0 for > 60 seconds
TimeToLiveDeletedItemCount TTL churn volume Unusual drops may signal TTL misconfiguration
  1. Memory latency dashboard — P50/P95/P99 latency for context read operations, broken out by session_id age to detect degradation on older sessions
  2. Write throughput dashboard — WCU consumed vs. provisioned, GSI write amplification
  3. Error rate dashboardSystemErrors and UserErrors with baseline comparison
  4. Session lifecycle dashboard — Sessions created vs. sessions expired (via TTL stream event count) to verify TTL is functioning

Distributed Tracing Integration

Wrap every DynamoDB call in an X-Ray segment. This makes it easy to see exactly how much of the total chat latency budget is consumed by memory operations. Tag segments with session_id and item_type (META / TURN / SUMMARY) to diagnose specific item-type bottlenecks.


Practical Decision Rule

For this project:

  • Use DynamoDB for short-lived conversation state
  • Use ElastiCache only as an optional accelerator or fallback read cache
  • Use OpenSearch for retrieval, not session memory
  • Use transactional systems like Aurora for order and payment workflows, not chat memory
  • Add encryption, IAM scoping, and CloudWatch alerts from day one — not after an incident