04 — RecommendationAgent

Personalized title recommendations from a multi-stage pipeline: collaborative filtering → graph traversal → preference re-ranking. Backed by Amazon Personalize, Neptune, DynamoDB.

When a user asks "what should I read next?" or "give me something like Berserk," the Orchestrator routes to the RecommendationAgent. This is the most stateful sub-agent — it depends heavily on user history, current session intent, and global trending signals.

What it is

A logical sub-agent that owns personalized discovery. Three responsibilities:

Personalized recommendations — based on user history (purchases, views, ratings)
Similarity-based recommendations — "more like X" via graph traversal
Cold-start recommendations — for new users, fall back to trending and popular

It is backed by two MCP servers: - User Preference MCP (../RAG-MCP-Integration/02-user-preferences-recommendation-mcp.md) — wraps Amazon Personalize + DynamoDB user vectors - Cross-Title Link MCP (../RAG-MCP-Integration/07-cross-title-link-mcp.md) — wraps Neptune graph + OpenSearch metadata

Tools exposed to the Orchestrator

Tool	Purpose	Backing
`get_recommendations(user_id, count)`	Top-N personalized picks	Personalize + DynamoDB
`get_similar_titles(asin, count)`	"If you liked X"	Neptune graph
`get_trending(genre?, region)`	Cold-start / casual browse	DynamoDB Streams + Kinesis
`update_preference(user_id, signal)`	Record preference (read/skip/like)	DynamoDB

update_preference is the second mutating tool in the system (after initiate_return in OrderStatusAgent), but it's much lower-stakes — wrong preferences degrade future recs but don't cost money.

The three-stage recommendation pipeline

User intent: "recommend something like Berserk"
   │
   ▼
Stage 1: Candidate generation (broad, fast)
   ├── Personalize: top-100 candidates from collab filtering
   └── Neptune: top-100 graph neighbors of "Berserk"
   │
   ▼ (merge, dedupe → ~150 candidates)
   │
Stage 2: Feature enrichment
   ├── User preference vector (DynamoDB)
   ├── Item embedding (OpenSearch)
   └── Trending signal (last-24h boost)
   │
   ▼
Stage 3: Re-rank
   └── Score = α·personalize_score
              + β·graph_proximity
              + γ·preference_alignment
              + δ·trending_boost
   │
   ▼
Top-3 results → return to Orchestrator

Weights α, β, γ, δ are tuned per scenario: - "Recommend something" → emphasis on personalize_score (α high) - "Similar to X" → emphasis on graph_proximity (β high) - "What's hot?" → emphasis on trending_boost (δ high)

The Orchestrator's intent classification picks the scenario; the agent applies the corresponding weight profile.

Cold-start handling

Three cold-start regimes, each handled differently:

Regime	Detection	Strategy
New user (no history)	DynamoDB profile empty	Pure trending + onboarding question ("what genres do you like?")
New item (just released)	Personalize hasn't seen it	Boost via metadata similarity (genre, author, themes) until interaction data accumulates
Sparse user (<5 interactions)	Profile exists but thin	Hybrid: trending dominates, personalize is a small lift

The "tell us what you like" prompt for new users is friction, but recommendation quality without any signal is roughly random — better to ask than to embarrass ourselves with bad picks.

State management

State	Store	Owner	TTL
User preference vector	DynamoDB	UserService (writes), this agent (reads)	Permanent
Interaction events (raw)	Kinesis → S3	Event pipeline	90 days
Item embeddings	OpenSearch	Catalog ingestion	Refreshed nightly
Graph (similarity edges)	Neptune	Recsys batch job	Refreshed nightly
Recommendation cache	ElastiCache	This agent	1 hour per user
Trending list	DynamoDB	Trending pipeline	5 minutes

The 1-hour reco cache TTL is contentious — see Validation section.

Failure handling

Failure	Detection	Recovery
Personalize timeout (>500ms)	Latency probe	Fallback to graph-only recs
Neptune timeout	Connection error	Fallback to Personalize-only recs
Both fail	Cascade	Trending list (last-known good)
Empty user profile	DynamoDB read returns null	Pure trending + onboarding prompt
New user but Personalize forced	Personalize returns generic top-100	Detect "new user" flag, route to cold-start path
Graph traversal too deep	Depth limit	Cap traversal at depth 2; depth 3+ via batch precomputation only

The double-fallback chain (Personalize → Graph → Trending) means the agent always returns something. The risk is silent quality degradation: if Personalize is down for 30 minutes, recommendations stay alive but get noticeably worse. Mitigation: emit a "degraded mode" metric on every fallback.

Latency budget

Target: P99 < 1s per tool call.

get_recommendations (cache hit):
  Cache GET           5ms
  Format             10ms
  ─────
  Total             15ms

get_recommendations (cache miss):
  Personalize call    300ms (parallel)
  Neptune query       200ms (parallel)
  Wait both          (max = 300ms)
  Feature fetch       100ms
  Re-rank             50ms
  Cache + format      20ms
  ─────
  Total              470ms (P50)
                     ~1.0s (P99)

get_similar_titles (cache miss):
  Neptune (depth 2)   200ms
  Enrichment          80ms
  Re-rank             40ms
  Format              10ms
  ─────
  Total              330ms (P50)
                     ~700ms (P99)

get_trending:
  DynamoDB Streams cache lookup   30ms
  Format                          10ms
  ─────
  Total                           40ms (P50)
                                 ~150ms (P99)

Trending is fast because it's pre-computed; the heavy lift happens in the batch pipeline that maintains the trending list, not at request time.

Why this shape

Alternative	Why we rejected it
Personalize alone	No "if you liked X" semantic similarity; collab filtering is item-blind
Neptune alone	No personalization; same recs for everyone with same query
LLM-generated recommendations	Catastrophic hallucination risk (made-up titles); not grounded in catalog
Pre-compute all (user, item) pairs	10M users × 5M items = 50T cells; infeasible
Compute fully online (no cache)	Would breach latency budget; Personalize has its own SLA
Single fixed weight profile	Different intents need different weight emphasis

Validation: Constraint Sanity Check

Claimed metric	Verdict	Why
P99 < 1s per tool call	Aggressive	Multi-stage pipeline P99 stacks: Personalize (P99 ~700ms), Neptune (P99 ~400ms), enrichment (P99 ~250ms). Even with parallel candidate-gen, the slowest path dominates and re-rank is sequential. Realistic P99: 1.2–1.8s.
Personalize 300ms	Median, not P99	Amazon Personalize `GetRecommendations` API: P50 ~150ms, P99 ~700ms. The 300ms here looks like an average, not a P99 budget.
Neptune depth-2 traversal 200ms	Realistic for cold cache	Gremlin queries with depth 2 over a graph with ~5M nodes and reasonable fan-out: 100–300ms. Holds. Depth 3+ is much worse — capped for good reason.
Reco cache TTL 1 hour	Too long for a shopping session	User behavior in a session changes faster than 1 hour: they click on a horror title, the next reco should reflect that. 1-hour cache means recs are frozen for the full session even after preference signals. Recommended: drop to 5 minutes, or invalidate on `update_preference`.
10:1 cache hit ratio implied	Not stated, probably wrong	With 1-hour TTL, hit rate is high. With realistic 5-min TTL it drops to ~3:1 in active sessions. Either way, no measured number is quoted.
Personalize candidate top-100 + Neptune top-100 → ~150 deduped	Reasonable	Roughly 50/100 overlap is typical for related-item retrieval; 150 final candidates is in the right ballpark.
Re-rank weights `α, β, γ, δ` per scenario	No tuning data	Where do these weights come from? If they're set by hand based on intuition, the system has not been optimized — it's been guessed. A bandit or offline eval pipeline should drive these.
Cold-start onboarding prompt	Right approach, conversion risk	Asking new users "what do you like?" adds friction. Some will bounce. There's no quoted conversion-rate impact. The right answer for a real product is to A/B-test friction-vs-quality; the doc just asserts the choice.
Graph refresh nightly	Stale signal during the day	New manga reviews, purchases, and trending events all happen during the day. A graph that updates nightly misses 24h of signal. For a "currently hot manga" use case, this is borderline; for "what's similar," it's fine.
Trending list TTL 5 min	Realistic	Trending is precomputed from Kinesis aggregations. 5-min freshness is achievable and adequate for the use case.
Personalize for 10M users	Cost not discussed	Personalize charges per training hour and per inference. At 10M users with daily training, this is a real cost line; the architecture doc doesn't quote it.

The biggest issue: 1-hour reco cache is wrong for shopping

The reco cache TTL was chosen, presumably, to absorb load. But shopping is fundamentally about signal acquisition — the user clicks one thing and reveals their intent for that session. A reco system that keeps showing the same recommendations after the user clicks on horror three times has failed at its job.

Two options:

Drop TTL to 5 minutes. Simple. Loses some load-absorption.
Event-driven invalidation. Every update_preference call invalidates the user's reco cache. More work but correctly modeled.

Option 2 is the right one. The current 1-hour TTL only makes sense if recommendations are computed per-day and don't react to in-session behavior, which is closer to email recs than chatbot recs.

The pipeline is sequential where it claims to be parallel

The doc says Personalize and Neptune candidate generation run in parallel. True. But re-ranking depends on both completing, and feature enrichment depends on the deduped candidate set. The end-to-end pipeline is therefore:

max(Personalize, Neptune) → Enrichment → Rerank
       ↑ parallel here       ↑ sequential   ↑ sequential

That's correctly modeled in the latency budget (470ms P50). The "parallel pipeline" framing earlier in the doc oversells it: only one stage is parallel.

Cold-start friction has no measured cost

"Tell us what you like" is the right answer in theory. In practice, every onboarding step has a measurable bounce rate, and recommendation systems often perform okay even on bare cold start using regional/demographic priors. The doc presents the friction as obviously correct without an A/B-test reference. This is a design hypothesis, not a validated choice.

Weight tuning has no closed loop

α, β, γ, δ per intent profile are stated but their values aren't given, and there's no mention of how they're learned or refreshed. A real recommendation system either: - Learns weights via offline regression on engagement data - Tunes via online bandit on live traffic

The current architecture has neither described. The weights are hand-tuned constants. This should be an explicit roadmap item, not buried.

01-orchestrator-agent.md — Routing to recommendation intents
02-product-search-agent.md — How catalog data feeds enrichment
../RAG-MCP-Integration/02-user-preferences-recommendation-mcp.md — User preference MCP internals
../RAG-MCP-Integration/07-cross-title-link-mcp.md — Graph MCP internals
../13-metrics.md — Recommendation quality metrics