HLD Deep Dive: Conversation Memory & Context Management
Questions covered: Q7, Q13, Q23
Interviewer level: Senior Engineer → Staff Engineer
Q7. What database is used for conversation memory, and why?
Short Answer
DynamoDB — low-latency key-value store, supports TTL for auto-expiry (24 hours), scales well for session data.
Deep Dive
Why DynamoDB over alternatives:
| Database | Latency | TTL Support | Scale | Cost | Verdict |
|---|---|---|---|---|---|
| DynamoDB | ~1–5ms | ✅ Native | Unlimited | Pay-per-use | ✅ Best choice |
| Redis (ElastiCache) | <1ms | ✅ Native | Limited by mem | $$$ for large data | Cache layer only |
| RDS (MySQL/Postgres) | ~5–20ms | Manual cleanup job | Vertical limit | Fixed cost | ❌ Too slow, wrong tool |
| S3 | ~50–200ms | Via lifecycle rules | Unlimited | Very cheap | ❌ Too slow for real-time |
| MongoDB | ~5–15ms | ✅ | Good | Moderate | ✅ Alternative, but not AWS-native |
DynamoDB schema for conversation memory:
Table: ChatSessions
Primary Key: session_id (PARTITION KEY)
Sort Key: turn_number (SORT KEY)
Attributes:
session_id: "sess_abc123xyz"
turn_number: 7 (increments per turn)
customer_id: "cust_12345" (null for guests)
timestamp: 1711345678
user_message: "What about the second one you mentioned?"
assistant_reply: "The second recommendation was Vinland Saga..."
intent: "recommendation"
ttl: 1711432078 (Unix timestamp 24h later)
How DynamoDB TTL works:
- Each item has a ttl attribute set to now + 86400 (24 hours in Unix seconds).
- DynamoDB's TTL daemon scans for expired items and deletes them automatically.
- Deletion is eventual (may take up to 48 hours after expiry) — items are excluded from reads even before physical deletion.
- Zero cost for TTL-based cleanup; no Lambda needed.
Reading conversation history:
def load_conversation_history(session_id: str, last_n_turns: int = 10) -> list:
response = dynamodb.query(
TableName="ChatSessions",
KeyConditionExpression="session_id = :sid",
ExpressionAttributeValues={":sid": {"S": session_id}},
ScanIndexForward=False, # Most recent first
Limit=last_n_turns
)
# Reverse to get chronological order
return list(reversed(response["Items"]))
Why last N turns, not everything? - LLM context windows are finite (Claude: 200K tokens, but each turn ~200–500 tokens). - Loading 100 turns = ~50K tokens of context → expensive and slow. - Users don't actually remember (or need) more than the last 5–10 turns in a chat session. - Last 10 turns covers >99% of practical multi-turn references.
Q13. How does conversation memory handle multi-turn context?
Short Answer
Stores last N turns per session in DynamoDB. Without memory, pronouns and references like "the second one you mentioned" are unresolvable.
Deep Dive
Why multi-turn context is critical:
Scenario without memory:
Turn 1:
User: "Recommend dark fantasy manga for beginners"
Bot: "Here are my top 3: (1) Berserk, (2) Vinland Saga, (3) Claymore"
Turn 2:
User: "What about the second one?"
Bot: [WITHOUT MEMORY] "I'm sorry, I don't understand what you mean by 'the second one'."
Bot: [WITH MEMORY] "Vinland Saga is set in medieval Scandinavia. It follows Thorfinn,
a young warrior seeking revenge. It's more grounded than Berserk and slightly
more beginner-friendly..."
Scenario with cross-turn reference resolution:
Turn 1:
User: "Do you have the Berserk deluxe edition?"
Bot: "Yes! Berserk Deluxe Edition Vol 1 is available for $49.99, hardcover."
Turn 2:
User: "Is it available for Kindle?"
Bot: [WITHOUT MEMORY] "What product are you referring to?"
Bot: [WITH MEMORY] "Berserk Deluxe Edition is only available in print.
The standard Berserk volumes (non-deluxe) are available on Kindle."
How the Orchestrator uses conversation history:
async def process_message(session_id: str, user_message: str) -> str:
# 1. Load last N turns from DynamoDB
history = await memory_service.load_history(session_id, last_n=10)
# 2. Build conversation context for LLM
conversation_context = format_conversation_history(history)
# Output: "User: Recommend dark fantasy manga\nAssistant: Here are 3 options...\n..."
# 3. Classify intent with context
intent = await classifier.classify(
current_message=user_message,
conversation_context=conversation_context # Context helps with ambiguous messages
)
# 4. Resolve references using history
resolved_query = await reference_resolver.resolve(
message=user_message,
history=history
)
# "What about the second one?" → "Tell me more about Vinland Saga"
# 5. Build final LLM prompt with history
prompt = build_prompt(
system_prompt=SYSTEM_PROMPT,
conversation_history=conversation_context,
current_query=resolved_query,
retrieved_context=rag_context
)
return await llm_service.generate(prompt)
Token budget management:
Total LLM context window: 200,000 tokens (Claude 3.5 Sonnet)
System prompt: ~500 tokens (reserved)
Conversation history: ~3,000 tokens (last 10 turns × 300 tokens)
RAG retrieved context: ~2,500 tokens (top 5 chunks × 500 tokens)
Current user message: ~100 tokens
Response buffer: ~1,000 tokens
─────────────────────────────────────────
Total used: ~7,100 tokens
Context headroom: ~193,000 tokens unused
For typical chatbot use cases, you use ~4% of Claude's context window. The 10-turn limit is about cost efficiency and relevance, not window size.
Session continuity across page refreshes: - Session ID stored in browser cookie (HttpOnly, Secure, SameSite=Strict). - If user refreshes the page, the frontend re-connects WebSocket with the same session ID. - The session cookie survives page refresh (unlike in-memory state). - History is loaded fresh from DynamoDB on reconnect.
Q23. DynamoDB elevated latency — how do you protect the user experience?
Short Answer
Circuit breaker → fall back to stateless mode → degrade gracefully. ElastiCache hot path for recent turns.
Deep Dive
Understanding the risk: - DynamoDB p99 latency goal: <10ms. - If DynamoDB hits elevated latency (say, 500ms–2s due to hot partitions or capacity issues), the chatbot's response time balloons. - Worse: if DynamoDB times out entirely, the Orchestrator has two options: fail the request, or proceed without history.
Solution 1: ElastiCache hot path
Primary: DynamoDB (cold, persistent)
Hot Path: ElastiCache Redis (warm, last 5 turns, TTL 1hr)
Read path:
1. Try Redis (sub-millisecond) ──► HIT: return immediately
MISS: fall through
2. Read from DynamoDB (~5ms) ──► Success: populate Redis, return
Timeout: fall back to stateless mode
Write path (write-through cache):
async def save_turn(session_id: str, turn: ConversationTurn):
# Write to both simultaneously
await asyncio.gather(
dynamodb.put_item(session_id, turn), # Persistent
redis.lpush(f"session:{session_id}", turn.serialize()) # Hot cache
)
# Trim Redis to last 5 turns
await redis.ltrim(f"session:{session_id}", 0, 4)
Solution 2: Circuit Breaker Pattern
class DynamoDBCircuitBreaker:
def __init__(self):
self.state = "CLOSED" # Normal operation
self.failure_count = 0
self.failure_threshold = 5 # Open after 5 failures in 10s
self.timeout = 30 # Stay open for 30 seconds
async def read_history(self, session_id: str) -> list:
if self.state == "OPEN":
# Don't even try DynamoDB — fail fast
logger.warning("Circuit OPEN: returning empty history for stateless mode")
return [] # Stateless fallback
try:
history = await dynamodb.query_with_timeout(session_id, timeout_ms=200)
self.on_success()
return history
except TimeoutError:
self.on_failure()
return [] # Stateless fallback
def on_failure(self):
self.failure_count += 1
if self.failure_count >= self.failure_threshold:
self.state = "OPEN"
cloudwatch.put_metric("DynamoDBCircuitOpen", 1)
# Auto-recover: schedule HALF-OPEN check after timeout
asyncio.create_task(self.schedule_recovery())
async def schedule_recovery(self):
await asyncio.sleep(self.timeout)
self.state = "HALF-OPEN" # Allow one probe request
Circuit breaker state machine:
CLOSED (normal) ──[5 failures in 10s]──► OPEN (fail fast)
▲ │
│ │ (30s timeout)
└──────[probe succeeds]──── HALF-OPEN ◄────┘
│
[probe fails]
│
OPEN (reset timer)
Solution 3: Graceful degradation levels
Level 0 (Normal): Full history loaded from cache or DynamoDB
Level 1 (Cache only): Only last 5 turns from Redis (if DynamoDB slow)
Level 2 (Stateless): Process current message only, no history
→ Add user-facing message: "I've lost our conversation
context. Could you briefly remind me what we were discussing?"
Level 3 (Full fallback): If Redis also fails, proceed with zero context but
never block the user from getting a response
User messaging when degraded:
// Level 2 stateless mode — be transparent but not alarming
{
"response": "I'm having a brief technical hiccup with my memory.
I can still help you! What were you looking for?",
"debug_metadata": { "memory_mode": "stateless", "reason": "dynamo_timeout" }
}
Monitoring:
CloudWatch Alarms:
- Metric: DynamoDB/SuccessfulRequestLatency (p99)
Threshold: > 50ms for 3 consecutive minutes
Action: PagerDuty alert + SNS notification
- Metric: Custom/DynamoDBCircuitOpen
Threshold: > 0
Action: Critical PagerDuty alert
- Metric: Custom/StatelessModeSessions
Threshold: > 1% of sessions
Action: Warning alert — degradation is user-visible
Prevention: Hot partition mitigation - Session IDs use UUID v4 (random) as partition key → evenly distributed across DynamoDB partitions. - Never use sequential IDs (1, 2, 3…) or customer IDs as partition keys — they create hot partitions. - DynamoDB on-demand capacity automatically handles burst scaling.