LOCAL PREVIEW View on GitHub

HLD Deep Dive: Conversation Memory & Context Management

Questions covered: Q7, Q13, Q23
Interviewer level: Senior Engineer → Staff Engineer


Q7. What database is used for conversation memory, and why?

Short Answer

DynamoDB — low-latency key-value store, supports TTL for auto-expiry (24 hours), scales well for session data.

Deep Dive

Why DynamoDB over alternatives:

Database Latency TTL Support Scale Cost Verdict
DynamoDB ~1–5ms ✅ Native Unlimited Pay-per-use ✅ Best choice
Redis (ElastiCache) <1ms ✅ Native Limited by mem $$$ for large data Cache layer only
RDS (MySQL/Postgres) ~5–20ms Manual cleanup job Vertical limit Fixed cost ❌ Too slow, wrong tool
S3 ~50–200ms Via lifecycle rules Unlimited Very cheap ❌ Too slow for real-time
MongoDB ~5–15ms Good Moderate ✅ Alternative, but not AWS-native

DynamoDB schema for conversation memory:

Table: ChatSessions
Primary Key: session_id (PARTITION KEY)
Sort Key: turn_number (SORT KEY)

Attributes:
  session_id:    "sess_abc123xyz"
  turn_number:   7  (increments per turn)
  customer_id:   "cust_12345"  (null for guests)
  timestamp:     1711345678
  user_message:  "What about the second one you mentioned?"
  assistant_reply: "The second recommendation was Vinland Saga..."
  intent:        "recommendation"
  ttl:           1711432078  (Unix timestamp 24h later)

How DynamoDB TTL works: - Each item has a ttl attribute set to now + 86400 (24 hours in Unix seconds). - DynamoDB's TTL daemon scans for expired items and deletes them automatically. - Deletion is eventual (may take up to 48 hours after expiry) — items are excluded from reads even before physical deletion. - Zero cost for TTL-based cleanup; no Lambda needed.

Reading conversation history:

def load_conversation_history(session_id: str, last_n_turns: int = 10) -> list:
    response = dynamodb.query(
        TableName="ChatSessions",
        KeyConditionExpression="session_id = :sid",
        ExpressionAttributeValues={":sid": {"S": session_id}},
        ScanIndexForward=False,  # Most recent first
        Limit=last_n_turns
    )
    # Reverse to get chronological order
    return list(reversed(response["Items"]))

Why last N turns, not everything? - LLM context windows are finite (Claude: 200K tokens, but each turn ~200–500 tokens). - Loading 100 turns = ~50K tokens of context → expensive and slow. - Users don't actually remember (or need) more than the last 5–10 turns in a chat session. - Last 10 turns covers >99% of practical multi-turn references.


Q13. How does conversation memory handle multi-turn context?

Short Answer

Stores last N turns per session in DynamoDB. Without memory, pronouns and references like "the second one you mentioned" are unresolvable.

Deep Dive

Why multi-turn context is critical:

Scenario without memory:

Turn 1:
  User: "Recommend dark fantasy manga for beginners"
  Bot:  "Here are my top 3: (1) Berserk, (2) Vinland Saga, (3) Claymore"

Turn 2:
  User: "What about the second one?"
  Bot:  [WITHOUT MEMORY] "I'm sorry, I don't understand what you mean by 'the second one'."
  Bot:  [WITH MEMORY] "Vinland Saga is set in medieval Scandinavia. It follows Thorfinn,
         a young warrior seeking revenge. It's more grounded than Berserk and slightly
         more beginner-friendly..."

Scenario with cross-turn reference resolution:

Turn 1:
  User: "Do you have the Berserk deluxe edition?"
  Bot:  "Yes! Berserk Deluxe Edition Vol 1 is available for $49.99, hardcover."

Turn 2:
  User: "Is it available for Kindle?"
  Bot:  [WITHOUT MEMORY] "What product are you referring to?"
  Bot:  [WITH MEMORY] "Berserk Deluxe Edition is only available in print. 
         The standard Berserk volumes (non-deluxe) are available on Kindle."

How the Orchestrator uses conversation history:

async def process_message(session_id: str, user_message: str) -> str:
    # 1. Load last N turns from DynamoDB
    history = await memory_service.load_history(session_id, last_n=10)

    # 2. Build conversation context for LLM
    conversation_context = format_conversation_history(history)
    # Output: "User: Recommend dark fantasy manga\nAssistant: Here are 3 options...\n..."

    # 3. Classify intent with context
    intent = await classifier.classify(
        current_message=user_message,
        conversation_context=conversation_context  # Context helps with ambiguous messages
    )

    # 4. Resolve references using history
    resolved_query = await reference_resolver.resolve(
        message=user_message,
        history=history
    )
    # "What about the second one?" → "Tell me more about Vinland Saga"

    # 5. Build final LLM prompt with history
    prompt = build_prompt(
        system_prompt=SYSTEM_PROMPT,
        conversation_history=conversation_context,
        current_query=resolved_query,
        retrieved_context=rag_context
    )

    return await llm_service.generate(prompt)

Token budget management:

Total LLM context window:    200,000 tokens (Claude 3.5 Sonnet)
  System prompt:               ~500 tokens (reserved)
  Conversation history:      ~3,000 tokens (last 10 turns × 300 tokens)
  RAG retrieved context:     ~2,500 tokens (top 5 chunks × 500 tokens)
  Current user message:        ~100 tokens
  Response buffer:           ~1,000 tokens
  ─────────────────────────────────────────
  Total used:                ~7,100 tokens
  Context headroom:         ~193,000 tokens unused

For typical chatbot use cases, you use ~4% of Claude's context window. The 10-turn limit is about cost efficiency and relevance, not window size.

Session continuity across page refreshes: - Session ID stored in browser cookie (HttpOnly, Secure, SameSite=Strict). - If user refreshes the page, the frontend re-connects WebSocket with the same session ID. - The session cookie survives page refresh (unlike in-memory state). - History is loaded fresh from DynamoDB on reconnect.


Q23. DynamoDB elevated latency — how do you protect the user experience?

Short Answer

Circuit breaker → fall back to stateless mode → degrade gracefully. ElastiCache hot path for recent turns.

Deep Dive

Understanding the risk: - DynamoDB p99 latency goal: <10ms. - If DynamoDB hits elevated latency (say, 500ms–2s due to hot partitions or capacity issues), the chatbot's response time balloons. - Worse: if DynamoDB times out entirely, the Orchestrator has two options: fail the request, or proceed without history.

Solution 1: ElastiCache hot path

Primary:  DynamoDB (cold, persistent)
Hot Path: ElastiCache Redis (warm, last 5 turns, TTL 1hr)

Read path:
  1. Try Redis (sub-millisecond) ──► HIT: return immediately
                                    MISS: fall through
  2. Read from DynamoDB (~5ms)  ──► Success: populate Redis, return
                                    Timeout: fall back to stateless mode

Write path (write-through cache):

async def save_turn(session_id: str, turn: ConversationTurn):
    # Write to both simultaneously
    await asyncio.gather(
        dynamodb.put_item(session_id, turn),    # Persistent
        redis.lpush(f"session:{session_id}", turn.serialize())  # Hot cache
    )
    # Trim Redis to last 5 turns
    await redis.ltrim(f"session:{session_id}", 0, 4)

Solution 2: Circuit Breaker Pattern

class DynamoDBCircuitBreaker:
    def __init__(self):
        self.state = "CLOSED"       # Normal operation
        self.failure_count = 0
        self.failure_threshold = 5  # Open after 5 failures in 10s
        self.timeout = 30           # Stay open for 30 seconds

    async def read_history(self, session_id: str) -> list:
        if self.state == "OPEN":
            # Don't even try DynamoDB — fail fast
            logger.warning("Circuit OPEN: returning empty history for stateless mode")
            return []  # Stateless fallback

        try:
            history = await dynamodb.query_with_timeout(session_id, timeout_ms=200)
            self.on_success()
            return history
        except TimeoutError:
            self.on_failure()
            return []  # Stateless fallback

    def on_failure(self):
        self.failure_count += 1
        if self.failure_count >= self.failure_threshold:
            self.state = "OPEN"
            cloudwatch.put_metric("DynamoDBCircuitOpen", 1)
            # Auto-recover: schedule HALF-OPEN check after timeout
            asyncio.create_task(self.schedule_recovery())

    async def schedule_recovery(self):
        await asyncio.sleep(self.timeout)
        self.state = "HALF-OPEN"  # Allow one probe request

Circuit breaker state machine:

CLOSED (normal) ──[5 failures in 10s]──► OPEN (fail fast)
    ▲                                          │
    │                                          │ (30s timeout)
    └──────[probe succeeds]──── HALF-OPEN ◄────┘
                                     │
                              [probe fails]
                                     │
                                  OPEN (reset timer)

Solution 3: Graceful degradation levels

Level 0 (Normal):        Full history loaded from cache or DynamoDB
Level 1 (Cache only):    Only last 5 turns from Redis (if DynamoDB slow)
Level 2 (Stateless):     Process current message only, no history
                         → Add user-facing message: "I've lost our conversation
                           context. Could you briefly remind me what we were discussing?"
Level 3 (Full fallback): If Redis also fails, proceed with zero context but
                         never block the user from getting a response

User messaging when degraded:

// Level 2 stateless mode — be transparent but not alarming
{
  "response": "I'm having a brief technical hiccup with my memory. 
               I can still help you! What were you looking for?",
  "debug_metadata": { "memory_mode": "stateless", "reason": "dynamo_timeout" }
}

Monitoring:

CloudWatch Alarms:
  - Metric: DynamoDB/SuccessfulRequestLatency (p99)
    Threshold: > 50ms for 3 consecutive minutes
    Action: PagerDuty alert + SNS notification

  - Metric: Custom/DynamoDBCircuitOpen
    Threshold: > 0
    Action: Critical PagerDuty alert

  - Metric: Custom/StatelessModeSessions
    Threshold: > 1% of sessions
    Action: Warning alert — degradation is user-visible

Prevention: Hot partition mitigation - Session IDs use UUID v4 (random) as partition key → evenly distributed across DynamoDB partitions. - Never use sequential IDs (1, 2, 3…) or customer IDs as partition keys — they create hot partitions. - DynamoDB on-demand capacity automatically handles burst scaling.