DynamoDB in MangaAssist - Project Usage and Architecture Decision

What DynamoDB Does in This Project

In MangaAssist, DynamoDB is the conversation-memory store. It keeps short-lived chat state so the orchestrator can handle multi-turn conversations reliably without turning the memory layer into a heavy relational system.

It is used for:

Session metadata created during POST /chat/init
Per-turn chat history for user and assistant messages
Compressed summaries of older conversation windows
Authenticated session lookup by customer_id
Context loading for reconnect and human handoff
Automatic expiry of memory after the retention window

It is not used for:

Product catalog storage
RAG vector retrieval
Analytics warehousing
Long-term user profile storage

Concrete Schema Definition

Understanding the actual data shape makes every architectural discussion clearer.

Primary Table: `manga-assist-sessions`

Attribute	Type	Role	Example Value
`PK`	String	Partition key	`SESSION#a9f3c1...`
`SK`	String	Sort key	`META`, `TURN#1711300000000`, `SUMMARY#001`
`session_id`	String	Denormalized for ease	`a9f3c1...`
`customer_id`	String	GSI partition key	`CUST#u8k21`
`updated_at`	Number (epoch ms)	GSI sort key	`1711300000000`
`turn_index`	Number	Chat turn ordering	`7`
`role`	String	`user` or `assistant`	`assistant`
`content_compressed`	Binary	GZIP-compressed message text	—
`token_count`	Number	Tokens used in this turn	`312`
`intent`	String	Classified intent at this turn	`product_recommendation`
`page_context`	Map	Page URL, ASIN, referral source	`{"asin": "B09XY..."}`
`summary_text`	String	Compressed narrative of older turns	—
`window_start`	Number	First turn index in summary window	`1`
`window_end`	Number	Last turn index in summary window	`5`
`ttl`	Number	Unix epoch expiry for DynamoDB TTL	`1714000000`
`response_id`	String	Idempotency key for retry safety	`resp_...`

Global Secondary Index: `GSI1-customer-sessions`

Attribute	Role
`GSI1PK = customer_id`	Find all sessions for an authenticated user
`GSI1SK = updated_at`	Sort by recency for resume flows

Projection: INCLUDE only session_id, updated_at, turn_count, SK=META to keep the index lean.

Item Type Summary

SESSION#<id> | META         → Session lifecycle state (one per session)  
SESSION#<id> | TURN#<ts>    → One chat message (one per turn)  
SESSION#<id> | SUMMARY#<n>  → Compressed narrative of a window of turns

Why this matters in an interview: Being able to draw this on a whiteboard immediately shows you went from concept to implementation. Most candidates describe DynamoDB at the concept level but cannot explain what the actual item looks like.

Capacity Planning and Cost Estimation

Architect's perspective: Numbers ground a decision. When you say "DynamoDB scales for peak traffic," back it up with an actual capacity estimate.

Traffic Assumptions

Metric	Normal Load	Peak Load (Prime Day equivalent)
Concurrent sessions	50,000	500,000
Messages per session per minute	~2	~4
Average item size after GZIP	~1 KB	~1 KB
GSI reads (resume/handoff)	~5% of sessions	~10% of sessions

WCU and RCU Estimates

Writes per second at peak: - Session init (META write): 500,000 / 1800 sec window ≈ 278 WPS - Turn writes (2 per exchange: user + assistant): 500,000 × 4 msg/min / 60 ≈ 33,333 WPS - Summary writes (roughly 1 per 5 turns): ≈ 6,666 WPS - Total writes ≈ 40,000 WPS → ~40,000 WCU (on-demand handles this automatically)

Reads per second at peak: - Context loads (once per turn, ~latest 10 turns + 1 summary): 33,333 RPS × 11 items ≈ 366,000 RPS - Each TURN item ~1 KB = 1 RCU at strong, 0.5 RCU at eventual consistency - With eventual consistency: ~183,000 RCU (strongly preferred for most context reads) - Resume lookups via GSI: small fraction, not the dominant cost

Cost Ballpark (On-Demand, us-east-1, 2024 pricing)

Resource	At Normal Load	At Peak Load
Write cost	~$1.25/hr	~$50/hr
Read cost	~$0.09/hr	~$9/hr
Storage (active sessions)	~$0.25/GB/month	~$2.50/GB/month
Estimated monthly	~$1,000–$2,000	Spikes during events

Key insight: On-demand mode absorbs the burst but costs more per read/write than provisioned capacity. For predictable baseline traffic, consider provisioned mode with auto-scaling on top, reserving on-demand as a safety net during events.

TTL Storage Control

Sessions expire within 24–72 hours depending on tier. At 50,000 concurrent sessions × average 2 KB total session data, the live working set is roughly 100 MB. Storage cost is negligible. Item tombstones created by TTL deletion consume no meaningful additional space.

What We Store in DynamoDB

The design follows a session timeline model:

META item: session-level state such as session_id, customer_id, updated_at, turn_count, and latest page_context
TURN items: one item per chat turn so writes stay small and ordered
SUMMARY items: compressed memory blocks for older turns so prompt size stays under control

This works well because the main access pattern is simple:

Find the session by session_id
Read the latest turns in order
Append the next turn
Update summary or metadata when needed

Why DynamoDB Was a Better Fit for This Workload

The key point is that conversation memory is a high-scale, low-latency, append-heavy workload with simple access patterns. That is where DynamoDB is strong.

Requirement	Why It Matters in MangaAssist	Why DynamoDB Fit Well
Low-latency reads and writes	Memory is on the critical chat path	Key-based reads and writes are fast and predictable
Burst handling	Traffic can jump from normal volume to Prime Day scale	DynamoDB scales better than connection-oriented databases for spiky traffic
Simple session access pattern	Most requests are "load latest turns" and "append next turn"	Partition key plus sort key maps naturally to session timelines
Built-in TTL	Chat memory is intentionally short-lived	No cleanup cron or delete job is required for normal expiry
Durability	Losing memory during failures hurts the experience	DynamoDB is safer than using cache-only memory
Managed operations	This system already has enough moving parts	No server management, sharding, vacuuming, or connection pool tuning
AWS-native integration	The rest of the system is on AWS	IAM, KMS, CloudWatch, Streams, DAX, and Global Tables fit cleanly
Multi-region path	Conversation continuity matters during failover	Global Tables are available if we need active-active later
Cost alignment	Session memory is short-lived and access-pattern driven	We pay for OLTP access, not relational features we do not need

Why DynamoDB Was Better Than Other Database Options

1. Better than Redis / ElastiCache as the primary store

Redis was attractive for raw speed, but it was a weaker source of truth for this specific problem.

Redis is excellent for caching and extremely hot reads.
Redis is weaker as the only durable system of record for customer chat memory.
TTL exists in Redis, but durability and recovery tradeoffs are worse than DynamoDB for a compliance-sensitive workflow.
We wanted memory to survive node loss, retries, and failover scenarios without treating snapshots as the main recovery plan.

Decision:

Redis stayed in the architecture as a cache option.
DynamoDB stayed the durable memory source of truth.

2. Better than Aurora / PostgreSQL for session memory

Aurora would work, but it would solve problems this workload does not have.

Conversation memory does not require joins.
Most operations are point lookups and ordered range reads inside one session.
Expiry would need scheduled cleanup or partition management.
Burst traffic would introduce more connection-management and write-scaling considerations.
The schema would become more operationally heavy than the problem justifies.

Decision:

Aurora is better for transactional business systems such as orders or payments.
DynamoDB is better for session memory where the access pattern is known upfront.

3. Better than DocumentDB / Mongo-style transcript documents

Document stores look natural at first because a conversation looks like a JSON document. That becomes less attractive at scale.

If the whole transcript is stored in one document, the document grows every turn.
Large-document rewrites create more write amplification.
Concurrent updates become more awkward during retries or streaming.
The 400 KB DynamoDB item limit forced us toward a better per-turn design anyway.
Ordered per-turn reads and summary windows are easier when turns are separate items.

Decision:

We modeled the chat as a timeline of small items instead of one growing document.
DynamoDB with pk and sk was cleaner than a session-document model.

4. Better than OpenSearch for operational memory

OpenSearch is already used in this system, but for RAG, not for chat state.

OpenSearch is built for search and retrieval, not primary OLTP session memory.
High write volume plus strict per-session read patterns would be more expensive and less predictable.
It is the wrong abstraction for "load recent turns for one session and append one more item."

Decision:

OpenSearch remains the knowledge retrieval store.
DynamoDB remains the chat memory store.

Scenarios We Evaluated Before Taking the Decision

We compared the database options against the actual workflows the chatbot must support.

Scenario 1: Session Initialization

Need:

Create session state quickly
Attach expiry immediately
Support guest and authenticated users

Why DynamoDB won:

One metadata write is enough
TTL can be set at creation time
No separate cleanup process is needed for the standard expiry path

Scenario 2: Append One Turn per Message

Need:

Avoid rewriting the full transcript every turn
Keep writes small and idempotent
Support retries without corrupting memory

Why DynamoDB won:

One new TURN item can be appended per message
Metadata can be updated independently
Conditional writes can make retries safe

Scenario 3: Load Latest N Turns Fast

Need:

Prompt assembly must stay inside latency budget
Most requests only need recent context, not a full transcript scan

Why DynamoDB won:

Query by partition key plus descending sort key is a direct fit
We can fetch only the latest turns and latest summary
The design avoids table scans and complex joins

Scenario 4: Long Conversations

Need:

Keep context useful without sending full transcripts to the model
Avoid item-size growth and large rewrites

Why DynamoDB won:

Separate SUMMARY items let us compress old windows
Separate TURN items avoid transcript-size blowups
The storage model stays stable even as conversations get longer

Scenario 5: Authenticated Resume and Human Handoff

Need:

Find recent sessions by customer
Load enough context to resume or escalate

Why DynamoDB won:

A GSI on customer_id plus updated_at supports recent-session lookup
Ordered turn history fits escalation packaging well

Scenario 6: Peak Traffic and Burstiness

Need:

Handle about 50,000 concurrent sessions normally and up to about 500,000 at peak
Avoid connection bottlenecks or manual shard rebalancing

Why DynamoDB won:

It handles bursty request patterns better than many relational deployments
Operational scaling is simpler than managing database connections during chat spikes

Scenario 7: Expiry and Privacy Retention

Need:

Memory should not stay forever
Expiry should align with privacy-by-design goals

Why DynamoDB won:

TTL provides built-in expiry semantics
The application can also enforce expiry at read time before physical deletion completes

Scenario 8: Multi-Region Growth Path

Need:

Future regional failover should not require redesigning the memory layer

Why DynamoDB won:

Global Tables provide a direct path if the chatbot later becomes active-active across regions

Final Decision Summary

We chose DynamoDB because conversation memory in MangaAssist is not a relational reporting problem and not just a cache problem. It is a short-lived, append-heavy, high-concurrency session-state problem with simple but strict access patterns.

That made DynamoDB the best overall choice because it gave:

Better durability than Redis as the primary store
Less operational overhead than Aurora for this workload
Cleaner per-turn modeling than a growing document database design
Better OLTP behavior than search-oriented systems
Built-in TTL, strong AWS integration, and a clean path to scale

Failure Modes and Resilience

SRE's perspective: A design document is incomplete if it does not answer "what happens when this breaks?"

Scenario: DynamoDB is temporarily unavailable (Regional outage or throttle spike)

Impact: Context loads fail → the orchestrator cannot assemble a prompt with history.

Mitigation options ranked by cost and complexity:

Graceful degradation to zero-turn context — The chat still works; users lose conversation continuity for the duration. Log a memory_load_failure metric and resume normally once DynamoDB recovers.
Short-lived in-process cache — Store the last assembled context in the Lambda/container's local memory for the duration of the request. Handles transient blips under 100 ms.
ElastiCache read fallback — If a warm Redis cache of recent turns exists, fall back to it. The cache may be slightly stale but avoids a full context failure.
Circuit breaker — After N consecutive failures, skip memory reads entirely and respond with a "starting fresh" UX notice. Prevents cascading timeouts from a degraded DynamoDB path from burning LLM quota.

Scenario: Write failures during message persistence

Impact: A turn is generated but not saved. On refresh or reconnect, the user sees the message disappear.

Mitigation: Use an async write-behind pattern with a retry queue (SQS). The response is streamed immediately; persistence is retried at least once. The response_id field prevents duplicate entries if the retry succeeds after the original write also went through.

Scenario: Hot partition causing throttling on one session

Impact: One unusually active session (like a support agent cycling through the same test account) causes that partition to be throttled.

Mitigation: DynamoDB's adaptive capacity handles moderate hot spots automatically. For extreme cases, use exponential backoff with jitter in the SDK. Monitor ConsumedWriteCapacityUnits and SystemErrors per-table in CloudWatch.

Scenario: TTL deletion lag causing stale session reads

Impact: A session is past its expiry but the item is still physically present. A reconnect attempt succeeds when it should fail.

Mitigation: Always enforce expiry at read time in application logic by checking ttl < now(). Never rely solely on DynamoDB to have deleted the item yet.

Security and Encryption

Security Engineer's perspective: Every service that stores user conversation data must have a clear security posture.

Encryption at Rest

All DynamoDB tables use AWS-managed KMS keys (SSE-KMS) for encryption at rest. If your compliance posture requires customer-managed keys (CMK), configure a dedicated KMS key with a rotation schedule and restrict the key policy to specific IAM roles.

Encryption in Transit

All communication with DynamoDB uses HTTPS/TLS via the AWS SDK. No plaintext traffic.

IAM Access Control

The Lambda functions that read and write DynamoDB should have least-privilege IAM roles:

{
  "Effect": "Allow",
  "Action": [
    "dynamodb:GetItem",
    "dynamodb:PutItem",
    "dynamodb:UpdateItem",
    "dynamodb:Query",
    "dynamodb:DeleteItem"
  ],
  "Resource": [
    "arn:aws:dynamodb:us-east-1:ACCOUNT_ID:table/manga-assist-sessions",
    "arn:aws:dynamodb:us-east-1:ACCOUNT_ID:table/manga-assist-sessions/index/*"
  ]
}

Do not grant dynamodb:Scan to the chat service. This makes accidental full-table scans a deployment-time error, not a runtime surprise.

Data Classification

Chat content is a PII-adjacent workload. Even if message text is not strictly PII, it can contain personally identifying information in user utterances.

Recommended controls: - GZIP-compress message content before writing (reduces storage and makes accidental log exposure less readable) - Do not log full message content in application logs — use truncated previews or [REDACTED] - Enforce TTL strictly — do not extend retention beyond the agreed privacy window without a logged business reason - Enable DynamoDB Streams only if a downstream consumer exists — inactive streams still incur charges

Compliance Deletion

For GDPR or CCPA right-to-delete flows, TTL alone is insufficient. Maintain an explicit delete job that: 1. Queries all sessions by customer_id via the GSI 2. Issues DeleteItem calls for each session and each turn 3. Records a deletion audit event (e.g., to CloudTrail or a dedicated audit table)

Monitoring and Alerting

DevOps perspective: You cannot operate what you cannot observe.

Key CloudWatch Metrics to Watch

Metric	What It Signals	Alert Threshold
`SuccessfulRequestLatency` (P99)	Memory load tail latency affecting chat	> 50 ms sustained
`ConsumedWriteCapacityUnits`	Approaching provisioned limit	> 80% of provisioned
`SystemErrors`	DynamoDB-side errors (5xx)	Any nonzero sustained count
`UserErrors`	Application-side errors (4xx)	Spike above baseline
`ThrottledRequests`	Throughput limit hit	> 0 for > 60 seconds
`TimeToLiveDeletedItemCount`	TTL churn volume	Unusual drops may signal TTL misconfiguration

Recommended Dashboards

Memory latency dashboard — P50/P95/P99 latency for context read operations, broken out by session_id age to detect degradation on older sessions
Write throughput dashboard — WCU consumed vs. provisioned, GSI write amplification
Error rate dashboard — SystemErrors and UserErrors with baseline comparison
Session lifecycle dashboard — Sessions created vs. sessions expired (via TTL stream event count) to verify TTL is functioning

Distributed Tracing Integration

Wrap every DynamoDB call in an X-Ray segment. This makes it easy to see exactly how much of the total chat latency budget is consumed by memory operations. Tag segments with session_id and item_type (META / TURN / SUMMARY) to diagnose specific item-type bottlenecks.

Practical Decision Rule

For this project:

Use DynamoDB for short-lived conversation state
Use ElastiCache only as an optional accelerator or fallback read cache
Use OpenSearch for retrieval, not session memory
Use transactional systems like Aurora for order and payment workflows, not chat memory
Add encryption, IAM scoping, and CloudWatch alerts from day one — not after an incident

DynamoDB in MangaAssist - Project Usage and Architecture Decision

What DynamoDB Does in This Project

Concrete Schema Definition

Primary Table: manga-assist-sessions

Global Secondary Index: GSI1-customer-sessions

Item Type Summary

Capacity Planning and Cost Estimation

Traffic Assumptions

WCU and RCU Estimates

Cost Ballpark (On-Demand, us-east-1, 2024 pricing)

TTL Storage Control

What We Store in DynamoDB

Why DynamoDB Was a Better Fit for This Workload

Why DynamoDB Was Better Than Other Database Options

1. Better than Redis / ElastiCache as the primary store

2. Better than Aurora / PostgreSQL for session memory

3. Better than DocumentDB / Mongo-style transcript documents

4. Better than OpenSearch for operational memory

Scenarios We Evaluated Before Taking the Decision

Scenario 1: Session Initialization

Scenario 2: Append One Turn per Message

Scenario 3: Load Latest N Turns Fast

Scenario 4: Long Conversations

Scenario 5: Authenticated Resume and Human Handoff

Scenario 6: Peak Traffic and Burstiness

Scenario 7: Expiry and Privacy Retention

Scenario 8: Multi-Region Growth Path

Final Decision Summary

Failure Modes and Resilience

Scenario: DynamoDB is temporarily unavailable (Regional outage or throttle spike)

Scenario: Write failures during message persistence

Scenario: Hot partition causing throttling on one session

Scenario: TTL deletion lag causing stale session reads

Security and Encryption

Encryption at Rest

Encryption in Transit

IAM Access Control

Data Classification

Compliance Deletion

Monitoring and Alerting

Key CloudWatch Metrics to Watch

Recommended Dashboards

Distributed Tracing Integration

Practical Decision Rule

Primary Table: `manga-assist-sessions`

Global Secondary Index: `GSI1-customer-sessions`