04 — RecommendationAgent
Personalized title recommendations from a multi-stage pipeline: collaborative filtering → graph traversal → preference re-ranking. Backed by Amazon Personalize, Neptune, DynamoDB.
When a user asks "what should I read next?" or "give me something like Berserk," the Orchestrator routes to the RecommendationAgent. This is the most stateful sub-agent — it depends heavily on user history, current session intent, and global trending signals.
What it is
A logical sub-agent that owns personalized discovery. Three responsibilities:
- Personalized recommendations — based on user history (purchases, views, ratings)
- Similarity-based recommendations — "more like X" via graph traversal
- Cold-start recommendations — for new users, fall back to trending and popular
It is backed by two MCP servers: - User Preference MCP (../RAG-MCP-Integration/02-user-preferences-recommendation-mcp.md) — wraps Amazon Personalize + DynamoDB user vectors - Cross-Title Link MCP (../RAG-MCP-Integration/07-cross-title-link-mcp.md) — wraps Neptune graph + OpenSearch metadata
Tools exposed to the Orchestrator
| Tool | Purpose | Backing |
|---|---|---|
get_recommendations(user_id, count) |
Top-N personalized picks | Personalize + DynamoDB |
get_similar_titles(asin, count) |
"If you liked X" | Neptune graph |
get_trending(genre?, region) |
Cold-start / casual browse | DynamoDB Streams + Kinesis |
update_preference(user_id, signal) |
Record preference (read/skip/like) | DynamoDB |
update_preference is the second mutating tool in the system (after initiate_return in OrderStatusAgent), but it's much lower-stakes — wrong preferences degrade future recs but don't cost money.
The three-stage recommendation pipeline
User intent: "recommend something like Berserk"
│
▼
Stage 1: Candidate generation (broad, fast)
├── Personalize: top-100 candidates from collab filtering
└── Neptune: top-100 graph neighbors of "Berserk"
│
▼ (merge, dedupe → ~150 candidates)
│
Stage 2: Feature enrichment
├── User preference vector (DynamoDB)
├── Item embedding (OpenSearch)
└── Trending signal (last-24h boost)
│
▼
Stage 3: Re-rank
└── Score = α·personalize_score
+ β·graph_proximity
+ γ·preference_alignment
+ δ·trending_boost
│
▼
Top-3 results → return to Orchestrator
Weights α, β, γ, δ are tuned per scenario:
- "Recommend something" → emphasis on personalize_score (α high)
- "Similar to X" → emphasis on graph_proximity (β high)
- "What's hot?" → emphasis on trending_boost (δ high)
The Orchestrator's intent classification picks the scenario; the agent applies the corresponding weight profile.
Cold-start handling
Three cold-start regimes, each handled differently:
| Regime | Detection | Strategy |
|---|---|---|
| New user (no history) | DynamoDB profile empty | Pure trending + onboarding question ("what genres do you like?") |
| New item (just released) | Personalize hasn't seen it | Boost via metadata similarity (genre, author, themes) until interaction data accumulates |
| Sparse user (<5 interactions) | Profile exists but thin | Hybrid: trending dominates, personalize is a small lift |
The "tell us what you like" prompt for new users is friction, but recommendation quality without any signal is roughly random — better to ask than to embarrass ourselves with bad picks.
State management
| State | Store | Owner | TTL |
|---|---|---|---|
| User preference vector | DynamoDB | UserService (writes), this agent (reads) | Permanent |
| Interaction events (raw) | Kinesis → S3 | Event pipeline | 90 days |
| Item embeddings | OpenSearch | Catalog ingestion | Refreshed nightly |
| Graph (similarity edges) | Neptune | Recsys batch job | Refreshed nightly |
| Recommendation cache | ElastiCache | This agent | 1 hour per user |
| Trending list | DynamoDB | Trending pipeline | 5 minutes |
The 1-hour reco cache TTL is contentious — see Validation section.
Failure handling
| Failure | Detection | Recovery |
|---|---|---|
| Personalize timeout (>500ms) | Latency probe | Fallback to graph-only recs |
| Neptune timeout | Connection error | Fallback to Personalize-only recs |
| Both fail | Cascade | Trending list (last-known good) |
| Empty user profile | DynamoDB read returns null | Pure trending + onboarding prompt |
| New user but Personalize forced | Personalize returns generic top-100 | Detect "new user" flag, route to cold-start path |
| Graph traversal too deep | Depth limit | Cap traversal at depth 2; depth 3+ via batch precomputation only |
The double-fallback chain (Personalize → Graph → Trending) means the agent always returns something. The risk is silent quality degradation: if Personalize is down for 30 minutes, recommendations stay alive but get noticeably worse. Mitigation: emit a "degraded mode" metric on every fallback.
Latency budget
Target: P99 < 1s per tool call.
get_recommendations (cache hit):
Cache GET 5ms
Format 10ms
─────
Total 15ms
get_recommendations (cache miss):
Personalize call 300ms (parallel)
Neptune query 200ms (parallel)
Wait both (max = 300ms)
Feature fetch 100ms
Re-rank 50ms
Cache + format 20ms
─────
Total 470ms (P50)
~1.0s (P99)
get_similar_titles (cache miss):
Neptune (depth 2) 200ms
Enrichment 80ms
Re-rank 40ms
Format 10ms
─────
Total 330ms (P50)
~700ms (P99)
get_trending:
DynamoDB Streams cache lookup 30ms
Format 10ms
─────
Total 40ms (P50)
~150ms (P99)
Trending is fast because it's pre-computed; the heavy lift happens in the batch pipeline that maintains the trending list, not at request time.
Why this shape
| Alternative | Why we rejected it |
|---|---|
| Personalize alone | No "if you liked X" semantic similarity; collab filtering is item-blind |
| Neptune alone | No personalization; same recs for everyone with same query |
| LLM-generated recommendations | Catastrophic hallucination risk (made-up titles); not grounded in catalog |
| Pre-compute all (user, item) pairs | 10M users × 5M items = 50T cells; infeasible |
| Compute fully online (no cache) | Would breach latency budget; Personalize has its own SLA |
| Single fixed weight profile | Different intents need different weight emphasis |
Validation: Constraint Sanity Check
| Claimed metric | Verdict | Why |
|---|---|---|
| P99 < 1s per tool call | Aggressive | Multi-stage pipeline P99 stacks: Personalize (P99 ~700ms), Neptune (P99 ~400ms), enrichment (P99 ~250ms). Even with parallel candidate-gen, the slowest path dominates and re-rank is sequential. Realistic P99: 1.2–1.8s. |
| Personalize 300ms | Median, not P99 | Amazon Personalize GetRecommendations API: P50 ~150ms, P99 ~700ms. The 300ms here looks like an average, not a P99 budget. |
| Neptune depth-2 traversal 200ms | Realistic for cold cache | Gremlin queries with depth 2 over a graph with ~5M nodes and reasonable fan-out: 100–300ms. Holds. Depth 3+ is much worse — capped for good reason. |
| Reco cache TTL 1 hour | Too long for a shopping session | User behavior in a session changes faster than 1 hour: they click on a horror title, the next reco should reflect that. 1-hour cache means recs are frozen for the full session even after preference signals. Recommended: drop to 5 minutes, or invalidate on update_preference. |
| 10:1 cache hit ratio implied | Not stated, probably wrong | With 1-hour TTL, hit rate is high. With realistic 5-min TTL it drops to ~3:1 in active sessions. Either way, no measured number is quoted. |
| Personalize candidate top-100 + Neptune top-100 → ~150 deduped | Reasonable | Roughly 50/100 overlap is typical for related-item retrieval; 150 final candidates is in the right ballpark. |
Re-rank weights α, β, γ, δ per scenario |
No tuning data | Where do these weights come from? If they're set by hand based on intuition, the system has not been optimized — it's been guessed. A bandit or offline eval pipeline should drive these. |
| Cold-start onboarding prompt | Right approach, conversion risk | Asking new users "what do you like?" adds friction. Some will bounce. There's no quoted conversion-rate impact. The right answer for a real product is to A/B-test friction-vs-quality; the doc just asserts the choice. |
| Graph refresh nightly | Stale signal during the day | New manga reviews, purchases, and trending events all happen during the day. A graph that updates nightly misses 24h of signal. For a "currently hot manga" use case, this is borderline; for "what's similar," it's fine. |
| Trending list TTL 5 min | Realistic | Trending is precomputed from Kinesis aggregations. 5-min freshness is achievable and adequate for the use case. |
| Personalize for 10M users | Cost not discussed | Personalize charges per training hour and per inference. At 10M users with daily training, this is a real cost line; the architecture doc doesn't quote it. |
The biggest issue: 1-hour reco cache is wrong for shopping
The reco cache TTL was chosen, presumably, to absorb load. But shopping is fundamentally about signal acquisition — the user clicks one thing and reveals their intent for that session. A reco system that keeps showing the same recommendations after the user clicks on horror three times has failed at its job.
Two options:
- Drop TTL to 5 minutes. Simple. Loses some load-absorption.
- Event-driven invalidation. Every
update_preferencecall invalidates the user's reco cache. More work but correctly modeled.
Option 2 is the right one. The current 1-hour TTL only makes sense if recommendations are computed per-day and don't react to in-session behavior, which is closer to email recs than chatbot recs.
The pipeline is sequential where it claims to be parallel
The doc says Personalize and Neptune candidate generation run in parallel. True. But re-ranking depends on both completing, and feature enrichment depends on the deduped candidate set. The end-to-end pipeline is therefore:
max(Personalize, Neptune) → Enrichment → Rerank
↑ parallel here ↑ sequential ↑ sequential
That's correctly modeled in the latency budget (470ms P50). The "parallel pipeline" framing earlier in the doc oversells it: only one stage is parallel.
Cold-start friction has no measured cost
"Tell us what you like" is the right answer in theory. In practice, every onboarding step has a measurable bounce rate, and recommendation systems often perform okay even on bare cold start using regional/demographic priors. The doc presents the friction as obviously correct without an A/B-test reference. This is a design hypothesis, not a validated choice.
Weight tuning has no closed loop
α, β, γ, δ per intent profile are stated but their values aren't given, and there's no mention of how they're learned or refreshed. A real recommendation system either:
- Learns weights via offline regression on engagement data
- Tunes via online bandit on live traffic
The current architecture has neither described. The weights are hand-tuned constants. This should be an explicit roadmap item, not buried.
Related documents
- 01-orchestrator-agent.md — Routing to recommendation intents
- 02-product-search-agent.md — How catalog data feeds enrichment
- ../RAG-MCP-Integration/02-user-preferences-recommendation-mcp.md — User preference MCP internals
- ../RAG-MCP-Integration/07-cross-title-link-mcp.md — Graph MCP internals
- ../13-metrics.md — Recommendation quality metrics