08 — Memory Architecture
Three tiers of state, two summarization passes, one prompt cache. The architecture that makes a stateless Lambda look like a stateful conversation.
A user expects the chatbot to remember them. A Lambda forgets everything between invocations. The memory architecture is what bridges that gap — and how well it works determines whether the chatbot feels coherent across turns or amnesiac.
The three tiers
| Tier | Store | TTL | Purpose | Read latency |
|---|---|---|---|---|
| Conversation (short-term) | ElastiCache Redis | 30 min | Recent turns, active intent, pending tool results | ~2ms |
| Session (medium-term) | DynamoDB | 24 hours | Full turn history, summary, extracted entities | ~15ms |
| Profile (long-term) | DynamoDB | Permanent | Genres, purchases, preferences, escalation history | ~15ms |
Reads on every turn pull from all three. Writes are scattered: - Conversation tier: written every turn - Session tier: written every 5 turns and at session close - Profile tier: written by the UserService asynchronously, never directly by the chatbot
Tier 1: Conversation (Redis)
The hottest data path. Redis holds whatever the next turn might need within 30 minutes:
session:{session_id}:context = {
"turns": [
{"role": "user", "content": "...", "ts": "..."},
{"role": "assistant", "content": "...", "ts": "..."},
...
],
"active_intent": "order_tracking",
"pending_tool_results": {...},
"extracted_entities": {
"order_id": "ORD-12345",
"asin": "B07X1234"
}
}
The 30-minute TTL is a UX choice: a user who steps away and returns within half an hour resumes mid-conversation. Past 30 minutes, they get a fresh greeting (we still load profile data; we just don't pretend to remember the recent dialogue).
Redis is the only tier with sub-10ms latency. That's why the per-turn Initialize step (~50ms in the Orchestrator's budget) can include this read.
Tier 2: Session (DynamoDB)
DynamoDB is durable and survives Redis evictions. Schema (in the MangaAssistSessions table):
PK: session_id
SK: ts#{turn_id}
attrs:
- role
- content
- intent
- tool_calls
- tool_results
- latency_ms
- cache_hit
A second item type holds the running summary:
PK: session_id
SK: SUMMARY
attrs:
- summary_text
- last_summarized_turn
- summary_version
After every 10 turns, the Orchestrator triggers a summarization pass: a separate Bedrock call that compresses the last 10 turns into a paragraph. The summary is stored, the turn buffer continues to grow, and the prompt uses summary + last 3 turns instead of all 13.
This is what keeps the context window bounded:
Without summarization (turn 50):
System prompt + 50 turns ≈ 30K tokens → expensive, slow, prone to losing focus
With summarization (turn 50):
System prompt + summary (~500 tokens) + last 3 turns (~1.5K) ≈ 5K tokens
Tier 3: Profile (DynamoDB)
Owned by the UserService, not the chatbot. The chatbot reads but never writes directly. Schema:
PK: user_id
attrs:
- favorite_genres: ["dark_fantasy", "seinen"]
- language_pref: "en"
- purchase_history_summary: {...}
- escalation_history: [...]
- last_updated: "..."
The chatbot's contribution to profile data goes through an async pipeline: significant signals (preference statements, genre likes) get published to Kinesis; the UserService consumes and updates the profile.
Why async? Because profile writes from the chatbot would race with profile writes from other surfaces (recommendations team, marketing team, etc.). Funneling through one service prevents inconsistent state.
Summarization: how it actually works
Trigger: turn count modulo 10 == 0
↓
Load last 10 turns from DynamoDB
↓
Bedrock call:
System: "Summarize this conversation, preserving:
- User intents
- Entities discussed (titles, ASINs, order IDs)
- User preferences expressed
- Outcomes / decisions"
↓
Returns ~200-500 word summary
↓
Append to existing summary (chained, not replacing)
↓
Store as SUMMARY item in session
↓
Continue conversation; subsequent turns include summary + last 3 turns only
Chained summarization is the failure mode to watch. If turn 10's summary is fed into turn 20's summarization, then turn 30's, the original details degrade across passes (it's a telephone game). Mitigation: always summarize from the original turns in DynamoDB, not from the prior summary. The prior summary is included as context but the new summary is generated against raw turns where possible.
Prompt caching
Bedrock supports prompt caching with a 5-minute TTL. We exploit this aggressively:
[STATIC, cached → ~5K tokens, hits cache 90%+ in steady state]
System prompt
Tool manifest (descriptions + schemas)
Hard constraints
[DYNAMIC, fresh per request → ~2K tokens, never cached]
Conversation summary
Last 3 turns
Current user message
Tool results (if mid-loop)
The static prefix is identical across requests as long as: - The tool list doesn't change - The system prompt doesn't change - The hard constraints don't change
Any deploy that touches these bursts the cache for the next 5 minutes. Cost and latency spike during deploys. We schedule deploys outside peak traffic for this reason.
Entity persistence
The chatbot extracts named entities from each turn: - ASINs mentioned - Series names - Order IDs - Volume numbers - Languages
These are stored in session:{session_id}:context.extracted_entities and automatically injected into subsequent turns. The Orchestrator's system prompt has a section:
Currently active entities (use these unless overridden):
series: Berserk
volume: 42
asin: B07X1234
This is what lets the user say "is it in stock?" three turns after they first mentioned Berserk vol 42 — Claude already has the ASIN as context, no need to re-extract from the user message.
Entities expire when: - The user explicitly switches subject ("forget that, show me Naruto") - A new entity of the same type replaces them - The session times out (Redis TTL)
Memory consistency
The three tiers can disagree. Cases:
| Inconsistency | Cause | Resolution |
|---|---|---|
| Redis has fresh entities, DynamoDB doesn't | Redis written every turn, DynamoDB every 5 turns | Redis wins for active session |
| DynamoDB has session, Redis evicted | Redis TTL passed | Reload Redis from DynamoDB lazily |
| Profile says English, Redis says Japanese (user just switched) | Active conversation overrides profile | Conversation tier wins for current session |
| Two devices in same session | User on phone + laptop simultaneously | Last write wins on Redis; DynamoDB conditional write prevents lost-update |
The rule of thumb: proximity wins. The closer-to-the-user tier reflects the most current truth.
Latency budget
Memory operations on every turn:
Stage | Latency
--------------------------------|--------
Redis GET (conversation) | 2ms
DynamoDB GetItem (profile) | 15ms
DynamoDB Query (last 3 turns) | 20ms (when Redis miss)
Compose context block | 5ms
─────
Total (Redis hit) | 22ms
Total (Redis miss) | 42ms
Within the Orchestrator's 50ms Initialize budget. Comfortable.
End-of-turn writes:
Redis SET (conversation) | 3ms (fire and forget mostly)
DynamoDB PutItem (turn record) | 20ms (async; doesn't block response)
These happen in parallel with response streaming.
Why this shape
| Alternative | Why we rejected it |
|---|---|
| Single tier (DynamoDB only) | 15ms reads × per-turn = adds latency; Redis at 2ms is materially better |
| Single tier (Redis only) | Eviction = data loss; fine for ephemeral but not for session history we need to recover |
| In-Lambda memory | Lambdas are stateless; works for tiny scope only |
| ElastiCache for everything (no DynamoDB) | No durability; expensive for permanent storage |
| No summarization (full history every request) | Token cost grows linearly; >50K tokens at turn 50; cost and latency explode |
| Vector store for memory | Overkill for this scope; structured entity extraction is cheaper and more precise |
Validation: Constraint Sanity Check
| Claimed metric / mechanism | Verdict | Why |
|---|---|---|
| Redis read 2ms | Realistic for ElastiCache | Sub-2ms P50 holds. P99 with network jitter and eviction can hit 5–8ms. Single-digit ms throughout. |
| DynamoDB read 15ms | Realistic with provisioned capacity | P99 of 15ms is achievable with on-demand or well-provisioned tables in same-region. Cold-region reads or under-provisioned tables can hit 50–100ms. |
| Summarize every 10 turns | Inflexible threshold | A user who has 10 short clarification turns shouldn't trigger summarization. A user with 5 long, fact-dense turns might benefit from earlier summarization. Token-based threshold (summarize when context > 8K tokens) is more principled. |
| Summarization preserves user intents and entities | Best-effort, no validation | The summary is generated by an LLM; nothing checks that the summary actually preserved the entities mentioned. Common drift: entity names get aggregated ("various titles") or lost across passes. Should be tested with eval data showing entity preservation rate over multiple summarizations. |
| Chained summarization mitigation by re-reading raw turns | Correct principle, costly | Re-summarizing from raw turns means at turn 50 we summarize turns 1–50, not just the last 10 over the prior summary. That's a much longer Bedrock call (3K input vs. 1.5K). The doc says "where possible" — what triggers the fallback isn't specified. |
| Prompt cache hit rate 90%+ in steady state | Optimistic; bursts on deploys | Steady state holds. Deploys, traffic gaps > 5 min, A/B-test variants, or any change to system prompt drops hit rate. The 90% number is the high-water mark, not the average. Realistic 24-hour average: 70–85%. |
| Three-tier consistency: "proximity wins" | Sound rule, race conditions exist | Two near-simultaneous writes from two devices: Redis SET races. DynamoDB conditional write protects DynamoDB but Redis can lose. Result: visible inconsistency for ~1 second. Real impact is rare but possible. |
| Entity persistence in active session | Right pattern, fragile to mis-extraction | Entities are extracted by the LLM. A bad extraction ("ASIN: B07-1234" instead of "B07X1234") persists and pollutes downstream calls until expired. No validation that extracted entities are well-formed. |
| 30-min Redis TTL for conversation | UX choice, not load-tuned | Why 30 minutes and not 15 or 60? The doc says "UX choice" without data. Real choice should consider: average session length, abandonment rate, cost of cache size, recovery cost from DynamoDB. None quoted. |
| Profile owned by UserService, written async | Right architecture, lag matters | If the user says "I love seinen" and asks for recommendations 30 seconds later, has the profile updated? Probably not (Kinesis → consumer → DynamoDB lag is ~5–30s). The agent should rely on conversation tier for in-session signals, not profile. The doc says "proximity wins" but doesn't explicitly call out that profile may lag. |
| Summarization triggers a separate Bedrock call | Latency cost not accounted | The summarization call is itself ~500ms–1s. If it runs synchronously every 10 turns, every 10th user response is delayed. Should run async after response is sent. Not documented. |
| Schedule deploys outside peak | Operational handwave | "We schedule outside peak" is fine until an emergency deploy is needed. The cache-burst cost during a forced deploy is real and the doc doesn't quantify it. |
The biggest issue: summarization quality has no closed loop
Summarization is the linchpin that keeps long sessions from blowing up the context window. But:
- The summary is generated by an LLM
- The summary becomes the input to the next inference
- If the summary loses information, the next inference can't recover it
- Nothing checks that the summary preserved what mattered
A real defense:
1. Maintain a structured slot list (active_entities, unresolved_intents, key_decisions)
2. After each summarization, verify each slot is still present
3. If a slot was lost, re-summarize with explicit instructions to preserve it
Without this, summarization is a slow leak. At turn 30, the chatbot remembers everything. At turn 80, it remembers the gist. At turn 200, it remembers the gist of the gist.
The 10-turn threshold is wrong
10 turns is a count, not a signal. What we really care about is context window pressure. A user with 10 chat-style turns ("hi", "okay", "thanks", etc.) doesn't need summarization. A user with 5 dense turns full of tool results does.
Replace with: summarize when total context tokens exceed 8K, OR every 20 turns, whichever comes first.
Profile-tier lag breaks "fresh preferences"
The chatbot pushes preference signals to Kinesis. The UserService writes them to DynamoDB asynchronously, with lag of seconds-to-minutes. Within a session, the profile is stale relative to the conversation. The architecture relies on the conversation tier reflecting fresh signals, but:
- A user logs out and comes back next day → profile may still not reflect yesterday's signals if Kinesis backlog
- A user switches devices mid-session → second device reads profile (which is stale)
Real fix: write preferences synchronously to a "session-scoped profile cache" in Redis, replicated async to DynamoDB. Same shape as the rest of the architecture, applied consistently.
Redis TTL 30 min is unjustified
There's no data behind 30 minutes. Realistic considerations: - Most chatbot sessions resolve in 3–5 minutes - Some users multi-task and return after 15–45 minutes - Cost of holding sessions in Redis is negligible at this scale
15 minutes would be cheaper and adequate; 60 minutes would be more user-friendly. 30 is in the middle without justification. A data-driven choice would inspect session-resumption rate by gap duration and pick the knee of the curve.
Prompt cache hit rate is a 24-hour average, not a number
The 90% target is true in steady state. Realistic 24-hour weighted average across deploys, traffic gaps, and tool-list edits: 70–80%. Cost models built on 90% will be 10–25% optimistic.
Remediation: track hit rate as a real-time metric with hourly granularity, not a single number. Set cost-model assumptions on the worst observed daily average, not the best.
Related documents
- 01-orchestrator-agent.md — Where memory is loaded and used per turn
- 06-tool-dispatch-and-routing.md — Why tool descriptions need to be cache-stable
- 09-escalation-workflow.md — Memory snapshot at handoff
- ../09-data-integrations.md — UserService and profile pipeline