08 — Memory Architecture

Three tiers of state, two summarization passes, one prompt cache. The architecture that makes a stateless Lambda look like a stateful conversation.

A user expects the chatbot to remember them. A Lambda forgets everything between invocations. The memory architecture is what bridges that gap — and how well it works determines whether the chatbot feels coherent across turns or amnesiac.

The three tiers

Tier	Store	TTL	Purpose	Read latency
Conversation (short-term)	ElastiCache Redis	30 min	Recent turns, active intent, pending tool results	~2ms
Session (medium-term)	DynamoDB	24 hours	Full turn history, summary, extracted entities	~15ms
Profile (long-term)	DynamoDB	Permanent	Genres, purchases, preferences, escalation history	~15ms

Reads on every turn pull from all three. Writes are scattered: - Conversation tier: written every turn - Session tier: written every 5 turns and at session close - Profile tier: written by the UserService asynchronously, never directly by the chatbot

Tier 1: Conversation (Redis)

The hottest data path. Redis holds whatever the next turn might need within 30 minutes:

session:{session_id}:context = {
  "turns": [
    {"role": "user",      "content": "...", "ts": "..."},
    {"role": "assistant", "content": "...", "ts": "..."},
    ...
  ],
  "active_intent": "order_tracking",
  "pending_tool_results": {...},
  "extracted_entities": {
    "order_id": "ORD-12345",
    "asin": "B07X1234"
  }
}

The 30-minute TTL is a UX choice: a user who steps away and returns within half an hour resumes mid-conversation. Past 30 minutes, they get a fresh greeting (we still load profile data; we just don't pretend to remember the recent dialogue).

Redis is the only tier with sub-10ms latency. That's why the per-turn Initialize step (~50ms in the Orchestrator's budget) can include this read.

Tier 2: Session (DynamoDB)

DynamoDB is durable and survives Redis evictions. Schema (in the MangaAssistSessions table):

PK: session_id
SK: ts#{turn_id}
attrs:
  - role
  - content
  - intent
  - tool_calls
  - tool_results
  - latency_ms
  - cache_hit

A second item type holds the running summary:

PK: session_id
SK: SUMMARY
attrs:
  - summary_text
  - last_summarized_turn
  - summary_version

After every 10 turns, the Orchestrator triggers a summarization pass: a separate Bedrock call that compresses the last 10 turns into a paragraph. The summary is stored, the turn buffer continues to grow, and the prompt uses summary + last 3 turns instead of all 13.

This is what keeps the context window bounded:

Without summarization (turn 50):
  System prompt + 50 turns ≈ 30K tokens → expensive, slow, prone to losing focus

With summarization (turn 50):
  System prompt + summary (~500 tokens) + last 3 turns (~1.5K) ≈ 5K tokens

Tier 3: Profile (DynamoDB)

Owned by the UserService, not the chatbot. The chatbot reads but never writes directly. Schema:

PK: user_id
attrs:
  - favorite_genres: ["dark_fantasy", "seinen"]
  - language_pref: "en"
  - purchase_history_summary: {...}
  - escalation_history: [...]
  - last_updated: "..."

The chatbot's contribution to profile data goes through an async pipeline: significant signals (preference statements, genre likes) get published to Kinesis; the UserService consumes and updates the profile.

Why async? Because profile writes from the chatbot would race with profile writes from other surfaces (recommendations team, marketing team, etc.). Funneling through one service prevents inconsistent state.

Summarization: how it actually works

Trigger: turn count modulo 10 == 0
   ↓
Load last 10 turns from DynamoDB
   ↓
Bedrock call:
  System: "Summarize this conversation, preserving:
           - User intents
           - Entities discussed (titles, ASINs, order IDs)
           - User preferences expressed
           - Outcomes / decisions"
   ↓
Returns ~200-500 word summary
   ↓
Append to existing summary (chained, not replacing)
   ↓
Store as SUMMARY item in session
   ↓
Continue conversation; subsequent turns include summary + last 3 turns only

Chained summarization is the failure mode to watch. If turn 10's summary is fed into turn 20's summarization, then turn 30's, the original details degrade across passes (it's a telephone game). Mitigation: always summarize from the original turns in DynamoDB, not from the prior summary. The prior summary is included as context but the new summary is generated against raw turns where possible.

Prompt caching

Bedrock supports prompt caching with a 5-minute TTL. We exploit this aggressively:

[STATIC, cached → ~5K tokens, hits cache 90%+ in steady state]
  System prompt
  Tool manifest (descriptions + schemas)
  Hard constraints

[DYNAMIC, fresh per request → ~2K tokens, never cached]
  Conversation summary
  Last 3 turns
  Current user message
  Tool results (if mid-loop)

The static prefix is identical across requests as long as: - The tool list doesn't change - The system prompt doesn't change - The hard constraints don't change

Any deploy that touches these bursts the cache for the next 5 minutes. Cost and latency spike during deploys. We schedule deploys outside peak traffic for this reason.

Entity persistence

The chatbot extracts named entities from each turn: - ASINs mentioned - Series names - Order IDs - Volume numbers - Languages

These are stored in session:{session_id}:context.extracted_entities and automatically injected into subsequent turns. The Orchestrator's system prompt has a section:

Currently active entities (use these unless overridden):
  series: Berserk
  volume: 42
  asin: B07X1234

This is what lets the user say "is it in stock?" three turns after they first mentioned Berserk vol 42 — Claude already has the ASIN as context, no need to re-extract from the user message.

Entities expire when: - The user explicitly switches subject ("forget that, show me Naruto") - A new entity of the same type replaces them - The session times out (Redis TTL)

Memory consistency

The three tiers can disagree. Cases:

Inconsistency	Cause	Resolution
Redis has fresh entities, DynamoDB doesn't	Redis written every turn, DynamoDB every 5 turns	Redis wins for active session
DynamoDB has session, Redis evicted	Redis TTL passed	Reload Redis from DynamoDB lazily
Profile says English, Redis says Japanese (user just switched)	Active conversation overrides profile	Conversation tier wins for current session
Two devices in same session	User on phone + laptop simultaneously	Last write wins on Redis; DynamoDB conditional write prevents lost-update

The rule of thumb: proximity wins. The closer-to-the-user tier reflects the most current truth.

Latency budget

Memory operations on every turn:

Stage                           | Latency
--------------------------------|--------
Redis GET (conversation)        |   2ms
DynamoDB GetItem (profile)      |  15ms
DynamoDB Query (last 3 turns)   |  20ms (when Redis miss)
Compose context block           |   5ms
─────
Total (Redis hit)               |  22ms
Total (Redis miss)              |  42ms

Within the Orchestrator's 50ms Initialize budget. Comfortable.

End-of-turn writes:

Redis SET (conversation)        |   3ms (fire and forget mostly)
DynamoDB PutItem (turn record)  |  20ms (async; doesn't block response)

These happen in parallel with response streaming.

Why this shape

Alternative	Why we rejected it
Single tier (DynamoDB only)	15ms reads × per-turn = adds latency; Redis at 2ms is materially better
Single tier (Redis only)	Eviction = data loss; fine for ephemeral but not for session history we need to recover
In-Lambda memory	Lambdas are stateless; works for tiny scope only
ElastiCache for everything (no DynamoDB)	No durability; expensive for permanent storage
No summarization (full history every request)	Token cost grows linearly; >50K tokens at turn 50; cost and latency explode
Vector store for memory	Overkill for this scope; structured entity extraction is cheaper and more precise

Validation: Constraint Sanity Check

Claimed metric / mechanism	Verdict	Why
Redis read 2ms	Realistic for ElastiCache	Sub-2ms P50 holds. P99 with network jitter and eviction can hit 5–8ms. Single-digit ms throughout.
DynamoDB read 15ms	Realistic with provisioned capacity	P99 of 15ms is achievable with on-demand or well-provisioned tables in same-region. Cold-region reads or under-provisioned tables can hit 50–100ms.
Summarize every 10 turns	Inflexible threshold	A user who has 10 short clarification turns shouldn't trigger summarization. A user with 5 long, fact-dense turns might benefit from earlier summarization. Token-based threshold (summarize when context > 8K tokens) is more principled.
Summarization preserves user intents and entities	Best-effort, no validation	The summary is generated by an LLM; nothing checks that the summary actually preserved the entities mentioned. Common drift: entity names get aggregated ("various titles") or lost across passes. Should be tested with eval data showing entity preservation rate over multiple summarizations.
Chained summarization mitigation by re-reading raw turns	Correct principle, costly	Re-summarizing from raw turns means at turn 50 we summarize turns 1–50, not just the last 10 over the prior summary. That's a much longer Bedrock call (3K input vs. 1.5K). The doc says "where possible" — what triggers the fallback isn't specified.
Prompt cache hit rate 90%+ in steady state	Optimistic; bursts on deploys	Steady state holds. Deploys, traffic gaps > 5 min, A/B-test variants, or any change to system prompt drops hit rate. The 90% number is the high-water mark, not the average. Realistic 24-hour average: 70–85%.
Three-tier consistency: "proximity wins"	Sound rule, race conditions exist	Two near-simultaneous writes from two devices: Redis SET races. DynamoDB conditional write protects DynamoDB but Redis can lose. Result: visible inconsistency for ~1 second. Real impact is rare but possible.
Entity persistence in active session	Right pattern, fragile to mis-extraction	Entities are extracted by the LLM. A bad extraction ("ASIN: B07-1234" instead of "B07X1234") persists and pollutes downstream calls until expired. No validation that extracted entities are well-formed.
30-min Redis TTL for conversation	UX choice, not load-tuned	Why 30 minutes and not 15 or 60? The doc says "UX choice" without data. Real choice should consider: average session length, abandonment rate, cost of cache size, recovery cost from DynamoDB. None quoted.
Profile owned by UserService, written async	Right architecture, lag matters	If the user says "I love seinen" and asks for recommendations 30 seconds later, has the profile updated? Probably not (Kinesis → consumer → DynamoDB lag is ~5–30s). The agent should rely on conversation tier for in-session signals, not profile. The doc says "proximity wins" but doesn't explicitly call out that profile may lag.
Summarization triggers a separate Bedrock call	Latency cost not accounted	The summarization call is itself ~500ms–1s. If it runs synchronously every 10 turns, every 10^th user response is delayed. Should run async after response is sent. Not documented.
Schedule deploys outside peak	Operational handwave	"We schedule outside peak" is fine until an emergency deploy is needed. The cache-burst cost during a forced deploy is real and the doc doesn't quantify it.

The biggest issue: summarization quality has no closed loop

Summarization is the linchpin that keeps long sessions from blowing up the context window. But:

The summary is generated by an LLM
The summary becomes the input to the next inference
If the summary loses information, the next inference can't recover it
Nothing checks that the summary preserved what mattered

A real defense: 1. Maintain a structured slot list (active_entities, unresolved_intents, key_decisions) 2. After each summarization, verify each slot is still present 3. If a slot was lost, re-summarize with explicit instructions to preserve it

Without this, summarization is a slow leak. At turn 30, the chatbot remembers everything. At turn 80, it remembers the gist. At turn 200, it remembers the gist of the gist.

The 10-turn threshold is wrong

10 turns is a count, not a signal. What we really care about is context window pressure. A user with 10 chat-style turns ("hi", "okay", "thanks", etc.) doesn't need summarization. A user with 5 dense turns full of tool results does.

Replace with: summarize when total context tokens exceed 8K, OR every 20 turns, whichever comes first.

Profile-tier lag breaks "fresh preferences"

The chatbot pushes preference signals to Kinesis. The UserService writes them to DynamoDB asynchronously, with lag of seconds-to-minutes. Within a session, the profile is stale relative to the conversation. The architecture relies on the conversation tier reflecting fresh signals, but:

A user logs out and comes back next day → profile may still not reflect yesterday's signals if Kinesis backlog
A user switches devices mid-session → second device reads profile (which is stale)

Real fix: write preferences synchronously to a "session-scoped profile cache" in Redis, replicated async to DynamoDB. Same shape as the rest of the architecture, applied consistently.

Redis TTL 30 min is unjustified

There's no data behind 30 minutes. Realistic considerations: - Most chatbot sessions resolve in 3–5 minutes - Some users multi-task and return after 15–45 minutes - Cost of holding sessions in Redis is negligible at this scale

15 minutes would be cheaper and adequate; 60 minutes would be more user-friendly. 30 is in the middle without justification. A data-driven choice would inspect session-resumption rate by gap duration and pick the knee of the curve.

Prompt cache hit rate is a 24-hour average, not a number

The 90% target is true in steady state. Realistic 24-hour weighted average across deploys, traffic gaps, and tool-list edits: 70–80%. Cost models built on 90% will be 10–25% optimistic.

Remediation: track hit rate as a real-time metric with hourly granularity, not a single number. Set cost-model assumptions on the worst observed daily average, not the best.

01-orchestrator-agent.md — Where memory is loaded and used per turn
06-tool-dispatch-and-routing.md — Why tool descriptions need to be cache-stable
09-escalation-workflow.md — Memory snapshot at handoff
../09-data-integrations.md — UserService and profile pipeline