LOCAL PREVIEW View on GitHub

04 — RecommendationAgent

Personalized title recommendations from a multi-stage pipeline: collaborative filtering → graph traversal → preference re-ranking. Backed by Amazon Personalize, Neptune, DynamoDB.

When a user asks "what should I read next?" or "give me something like Berserk," the Orchestrator routes to the RecommendationAgent. This is the most stateful sub-agent — it depends heavily on user history, current session intent, and global trending signals.


What it is

A logical sub-agent that owns personalized discovery. Three responsibilities:

  1. Personalized recommendations — based on user history (purchases, views, ratings)
  2. Similarity-based recommendations — "more like X" via graph traversal
  3. Cold-start recommendations — for new users, fall back to trending and popular

It is backed by two MCP servers: - User Preference MCP (../RAG-MCP-Integration/02-user-preferences-recommendation-mcp.md) — wraps Amazon Personalize + DynamoDB user vectors - Cross-Title Link MCP (../RAG-MCP-Integration/07-cross-title-link-mcp.md) — wraps Neptune graph + OpenSearch metadata


Tools exposed to the Orchestrator

Tool Purpose Backing
get_recommendations(user_id, count) Top-N personalized picks Personalize + DynamoDB
get_similar_titles(asin, count) "If you liked X" Neptune graph
get_trending(genre?, region) Cold-start / casual browse DynamoDB Streams + Kinesis
update_preference(user_id, signal) Record preference (read/skip/like) DynamoDB

update_preference is the second mutating tool in the system (after initiate_return in OrderStatusAgent), but it's much lower-stakes — wrong preferences degrade future recs but don't cost money.


The three-stage recommendation pipeline

User intent: "recommend something like Berserk"
   │
   ▼
Stage 1: Candidate generation (broad, fast)
   ├── Personalize: top-100 candidates from collab filtering
   └── Neptune: top-100 graph neighbors of "Berserk"
   │
   ▼ (merge, dedupe → ~150 candidates)
   │
Stage 2: Feature enrichment
   ├── User preference vector (DynamoDB)
   ├── Item embedding (OpenSearch)
   └── Trending signal (last-24h boost)
   │
   ▼
Stage 3: Re-rank
   └── Score = α·personalize_score
              + β·graph_proximity
              + γ·preference_alignment
              + δ·trending_boost
   │
   ▼
Top-3 results → return to Orchestrator

Weights α, β, γ, δ are tuned per scenario: - "Recommend something" → emphasis on personalize_score (α high) - "Similar to X" → emphasis on graph_proximity (β high) - "What's hot?" → emphasis on trending_boost (δ high)

The Orchestrator's intent classification picks the scenario; the agent applies the corresponding weight profile.


Cold-start handling

Three cold-start regimes, each handled differently:

Regime Detection Strategy
New user (no history) DynamoDB profile empty Pure trending + onboarding question ("what genres do you like?")
New item (just released) Personalize hasn't seen it Boost via metadata similarity (genre, author, themes) until interaction data accumulates
Sparse user (<5 interactions) Profile exists but thin Hybrid: trending dominates, personalize is a small lift

The "tell us what you like" prompt for new users is friction, but recommendation quality without any signal is roughly random — better to ask than to embarrass ourselves with bad picks.


State management

State Store Owner TTL
User preference vector DynamoDB UserService (writes), this agent (reads) Permanent
Interaction events (raw) Kinesis → S3 Event pipeline 90 days
Item embeddings OpenSearch Catalog ingestion Refreshed nightly
Graph (similarity edges) Neptune Recsys batch job Refreshed nightly
Recommendation cache ElastiCache This agent 1 hour per user
Trending list DynamoDB Trending pipeline 5 minutes

The 1-hour reco cache TTL is contentious — see Validation section.


Failure handling

Failure Detection Recovery
Personalize timeout (>500ms) Latency probe Fallback to graph-only recs
Neptune timeout Connection error Fallback to Personalize-only recs
Both fail Cascade Trending list (last-known good)
Empty user profile DynamoDB read returns null Pure trending + onboarding prompt
New user but Personalize forced Personalize returns generic top-100 Detect "new user" flag, route to cold-start path
Graph traversal too deep Depth limit Cap traversal at depth 2; depth 3+ via batch precomputation only

The double-fallback chain (Personalize → Graph → Trending) means the agent always returns something. The risk is silent quality degradation: if Personalize is down for 30 minutes, recommendations stay alive but get noticeably worse. Mitigation: emit a "degraded mode" metric on every fallback.


Latency budget

Target: P99 < 1s per tool call.

get_recommendations (cache hit):
  Cache GET           5ms
  Format             10ms
  ─────
  Total             15ms

get_recommendations (cache miss):
  Personalize call    300ms (parallel)
  Neptune query       200ms (parallel)
  Wait both          (max = 300ms)
  Feature fetch       100ms
  Re-rank             50ms
  Cache + format      20ms
  ─────
  Total              470ms (P50)
                     ~1.0s (P99)

get_similar_titles (cache miss):
  Neptune (depth 2)   200ms
  Enrichment          80ms
  Re-rank             40ms
  Format              10ms
  ─────
  Total              330ms (P50)
                     ~700ms (P99)

get_trending:
  DynamoDB Streams cache lookup   30ms
  Format                          10ms
  ─────
  Total                           40ms (P50)
                                 ~150ms (P99)

Trending is fast because it's pre-computed; the heavy lift happens in the batch pipeline that maintains the trending list, not at request time.


Why this shape

Alternative Why we rejected it
Personalize alone No "if you liked X" semantic similarity; collab filtering is item-blind
Neptune alone No personalization; same recs for everyone with same query
LLM-generated recommendations Catastrophic hallucination risk (made-up titles); not grounded in catalog
Pre-compute all (user, item) pairs 10M users × 5M items = 50T cells; infeasible
Compute fully online (no cache) Would breach latency budget; Personalize has its own SLA
Single fixed weight profile Different intents need different weight emphasis

Validation: Constraint Sanity Check

Claimed metric Verdict Why
P99 < 1s per tool call Aggressive Multi-stage pipeline P99 stacks: Personalize (P99 ~700ms), Neptune (P99 ~400ms), enrichment (P99 ~250ms). Even with parallel candidate-gen, the slowest path dominates and re-rank is sequential. Realistic P99: 1.2–1.8s.
Personalize 300ms Median, not P99 Amazon Personalize GetRecommendations API: P50 ~150ms, P99 ~700ms. The 300ms here looks like an average, not a P99 budget.
Neptune depth-2 traversal 200ms Realistic for cold cache Gremlin queries with depth 2 over a graph with ~5M nodes and reasonable fan-out: 100–300ms. Holds. Depth 3+ is much worse — capped for good reason.
Reco cache TTL 1 hour Too long for a shopping session User behavior in a session changes faster than 1 hour: they click on a horror title, the next reco should reflect that. 1-hour cache means recs are frozen for the full session even after preference signals. Recommended: drop to 5 minutes, or invalidate on update_preference.
10:1 cache hit ratio implied Not stated, probably wrong With 1-hour TTL, hit rate is high. With realistic 5-min TTL it drops to ~3:1 in active sessions. Either way, no measured number is quoted.
Personalize candidate top-100 + Neptune top-100 → ~150 deduped Reasonable Roughly 50/100 overlap is typical for related-item retrieval; 150 final candidates is in the right ballpark.
Re-rank weights α, β, γ, δ per scenario No tuning data Where do these weights come from? If they're set by hand based on intuition, the system has not been optimized — it's been guessed. A bandit or offline eval pipeline should drive these.
Cold-start onboarding prompt Right approach, conversion risk Asking new users "what do you like?" adds friction. Some will bounce. There's no quoted conversion-rate impact. The right answer for a real product is to A/B-test friction-vs-quality; the doc just asserts the choice.
Graph refresh nightly Stale signal during the day New manga reviews, purchases, and trending events all happen during the day. A graph that updates nightly misses 24h of signal. For a "currently hot manga" use case, this is borderline; for "what's similar," it's fine.
Trending list TTL 5 min Realistic Trending is precomputed from Kinesis aggregations. 5-min freshness is achievable and adequate for the use case.
Personalize for 10M users Cost not discussed Personalize charges per training hour and per inference. At 10M users with daily training, this is a real cost line; the architecture doc doesn't quote it.

The biggest issue: 1-hour reco cache is wrong for shopping

The reco cache TTL was chosen, presumably, to absorb load. But shopping is fundamentally about signal acquisition — the user clicks one thing and reveals their intent for that session. A reco system that keeps showing the same recommendations after the user clicks on horror three times has failed at its job.

Two options:

  1. Drop TTL to 5 minutes. Simple. Loses some load-absorption.
  2. Event-driven invalidation. Every update_preference call invalidates the user's reco cache. More work but correctly modeled.

Option 2 is the right one. The current 1-hour TTL only makes sense if recommendations are computed per-day and don't react to in-session behavior, which is closer to email recs than chatbot recs.

The pipeline is sequential where it claims to be parallel

The doc says Personalize and Neptune candidate generation run in parallel. True. But re-ranking depends on both completing, and feature enrichment depends on the deduped candidate set. The end-to-end pipeline is therefore:

max(Personalize, Neptune) → Enrichment → Rerank
       ↑ parallel here       ↑ sequential   ↑ sequential

That's correctly modeled in the latency budget (470ms P50). The "parallel pipeline" framing earlier in the doc oversells it: only one stage is parallel.

Cold-start friction has no measured cost

"Tell us what you like" is the right answer in theory. In practice, every onboarding step has a measurable bounce rate, and recommendation systems often perform okay even on bare cold start using regional/demographic priors. The doc presents the friction as obviously correct without an A/B-test reference. This is a design hypothesis, not a validated choice.

Weight tuning has no closed loop

α, β, γ, δ per intent profile are stated but their values aren't given, and there's no mention of how they're learned or refreshed. A real recommendation system either: - Learns weights via offline regression on engagement data - Tunes via online bandit on live traffic

The current architecture has neither described. The weights are hand-tuned constants. This should be an explicit roadmap item, not buried.