HLD Deep Dive: Recommendations, Personalization & Caching
Questions covered: Q10, Q16, Q19 (partial), Q29 (async patterns)
Interviewer level: Senior Engineer → Staff Engineer
Q10. Why reuse Amazon Personalize instead of building a custom recommendation engine?
Short Answer
Amazon already has one of the best recommendation engines in the world. Reusing it saves months of development, leverages existing user signals, and is battle-tested at billions of interactions per day.
Deep Dive
What building custom would involve:
Custom Recommendation Engine (12+ months):
Data Pipeline:
- Collect user events (views, purchases, ratings, clicks)
- Store interaction matrix (users × items)
- De-duplicate, normalize, handle cold start
Model Training:
- Collaborative Filtering (Matrix Factorization, ALS)
- Content-Based Filtering (item feature vectors)
- Hybrid model combining both
- Retrain pipeline (weekly? daily? real-time?)
Serving Infrastructure:
- Low-latency serving endpoint (<100ms)
- Model versioning and A/B testing
- Real-time feature updates
Evaluation:
- Offline metrics (NDCG, Recall@K)
- Online A/B testing
- Cold start handling (new users, new items)
Timeline: 6-12 months. Cost: $500K–$2M in engineering time.
What Amazon Personalize gives you out of the box:
Amazon Personalize handles:
✅ Data ingestion API (events, interactions, item catalog)
✅ Model training (HRNN, SIMS, Popularity-Count, etc.)
✅ Automatic retraining (real-time or scheduled)
✅ Cold start handling (new users → popularity-based)
✅ A/B testing infrastructure (experiments)
✅ Campaign management (deploy, update model)
✅ Context-aware recommendations (pass current items/query)
✅ High-availability serving endpoint
What you still control:
✅ Item catalog (what products are available)
✅ User event schema (what signals to send)
✅ Business rules (filter out-of-stock items)
✅ Contextual parameters (current page, search query)
Integration pattern:
import boto3
personalize_runtime = boto3.client("personalize-runtime")
async def get_recommendations(customer_id: str, query_context: str,
num_results: int = 10) -> list:
response = personalize_runtime.get_recommendations(
campaignArn="arn:aws:personalize:ap-northeast-1:...:campaign/manga-recs",
userId=customer_id,
numResults=num_results,
context={
"CURRENT_QUERY": query_context, # "dark fantasy manga"
"DEVICE_TYPE": "desktop"
},
filterArn="arn:aws:personalize:...:filter/in-stock-filter" # Only in-stock
)
asins = [item["itemId"] for item in response["itemList"]]
# Enrich with product details from catalog
products = await catalog.batch_get_products(asins)
return products
Cold start problem — what Personalize does for new users:
New User (no history):
→ Personalize falls back to "POPULARITY_COUNT" recipe
→ Returns top-N most popular manga in the relevant category
→ As user interacts, real-time signals update recommendations immediately
Guest User:
→ Use session-level context (current page category, search query)
→ In-session recommendations only ("similar to what you're viewing now")
Why this is a "build vs. buy" win: - Recommendation quality from day 1 is excellent (Amazon's training data is their competitive advantage, but the model architecture in Personalize is sound). - Engineering resources can focus on the chatbot's unique value (NLP, orchestration) rather than commodity ML infrastructure. - At scale, Amazon's recommendation signals (all of amazon.com purchase data) feed downstream models — the manga chatbot benefits from cross-category signals.
Caching Strategy: What Gets Cached and Why
ElastiCache Redis as the Hot Path
┌──────────────────────────────────────────────────────────────┐
│ CACHING DECISION TABLE │
├─────────────────────┬──────────┬──────────┬────────────────┤
│ Data Type │ Cached? │ TTL │ Reason │
├─────────────────────┼──────────┼──────────┼────────────────┤
│ Product details │ ✅ Yes │ 1 hour │ Changes rarely │
│ Promotions/deals │ ✅ Yes │ 5 min │ Time-sensitive │
│ Recommendations │ ✅ Yes │ 15 min │ Expensive call │
│ FAQ answers │ ✅ Yes │ 24 hours │ Very stable │
│ Current PRICE │ ❌ NO │ N/A │ MUST be live │
│ Inventory status │ ❌ NO │ N/A │ MUST be live │
│ Order status │ ❌ NO │ N/A │ MUST be live │
│ Conversation memory │ ✅ Yes │ 1 hour │ DynamoDB backup│
└─────────────────────┴──────────┴──────────┴────────────────┘
Cache key design:
# Product details cache
cache_key = f"product:{asin}" # e.g., "product:B08KTZ8X3Q"
# Recommendation cache (user-specific, 15min TTL)
cache_key = f"recs:{customer_id}:{query_hash}" # e.g., "recs:cust123:a7f3b2"
# FAQ answer cache
cache_key = f"faq:{question_hash}" # e.g., "faq:3f8a1c"
# Promotions cache (global, 5min TTL)
cache_key = f"promotions:category:{category}" # e.g., "promotions:category:manga"
Cache invalidation — event-driven:
# When a product is updated in the catalog:
async def on_product_updated(event: ProductUpdateEvent):
asin = event.asin
# Invalidate product cache
await redis.delete(f"product:{asin}")
# Invalidate any recommendation caches that reference this product
# (Use a secondary index: asin → list of cache keys that contain it)
cache_keys = await redis.smembers(f"product_cache_refs:{asin}")
if cache_keys:
await redis.delete(*cache_keys)
# Trigger re-indexing if product description changed
if event.description_changed:
await kinesis.put_record(
StreamName="rag-reindex",
Data=json.dumps({"asin": asin, "action": "reindex"})
)
Why prices are NEVER cached:
If the chatbot shows a user a price from cache that's $5 cheaper than the current price, and the user clicks "Buy" and sees a different price → trust is broken. Worse, Amazon is at risk of legal/regulatory issues for advertising a price it doesn't honor. Prices are always fetched from the live catalog, with zero caching.
Q29. Where to introduce asynchronous patterns?
Short Answer
Analytics, feedback, RAG re-indexing, human handoff, and slow response generation are all candidates for async.
Deep Dive
Async patterns serve two purposes: 1. Fire-and-forget — the chatbot doesn't need to wait for this to complete the response. 2. Decoupling — separates the real-time response path from slower background operations.
Current synchronous calls (keep sync):
Intent Classification → needs result before routing (~20ms)
Product Catalog query → needs data for LLM context (~50ms)
Recommendations fetch → needs data for LLM context (~200ms)
LLM generation → needs output to send to user (~1,500ms)
Guardrails validation → must complete before sending to user (~100ms)
Where to go async:
1. Analytics Logging (currently async via Kinesis — correct)
async def handle_message(session_id: str, message: str):
response = await generate_response(message)
# Don't await — fire and forget
asyncio.create_task(
analytics.log_event({
"session_id": session_id,
"message": message,
"response": response,
"intent": intent,
"latency_ms": elapsed
})
)
return response # Return immediately without waiting for analytics
2. Feedback Processing (SQS queue)
async def submit_feedback(session_id: str, turn_id: str, rating: int):
# Queue message for async processing
await sqs.send_message(
QueueUrl=FEEDBACK_QUEUE_URL,
MessageBody=json.dumps({
"session_id": session_id,
"turn_id": turn_id,
"rating": rating,
"timestamp": datetime.utcnow().isoformat()
})
)
# Immediately return 200 OK — user doesn't wait for feedback to be processed
return {"status": "received"}
3. RAG Re-indexing (event-driven, not real-time path)
# Triggered by catalog change event (S3 upload, DynamoDB Streams)
async def on_knowledge_base_updated(event: S3Event):
for record in event.records:
await kinesis.put_record(
StreamName="rag-reindex",
Data=json.dumps({
"s3_key": record.s3.key,
"action": "upsert",
"category": infer_category(record.s3.key)
})
)
# Indexing happens asynchronously — current RAG continues serving old index
# New index becomes available in ~minutes without any downtime
4. Human Handoff (Amazon Connect — async escalation)
async def escalate_to_human(session_id: str, customer_id: str,
conversation_summary: str):
# Create a task in the human agent queue
await connect.start_task_contact(
ContactFlowId=ESCALATION_FLOW_ID,
Attributes={
"customer_id": customer_id,
"conversation_summary": conversation_summary,
"session_id": session_id,
"escalation_reason": "user_requested"
}
)
# Immediately acknowledge to user — agent availability is checked async
return {
"message": "I've connected you with a support agent. They'll be with you shortly.",
"estimated_wait": await connect.get_estimated_wait_time()
}
5. Slow response generation (typing indicator pattern)
For responses expected to take >3 seconds:
1. Immediately send: { "type": "typing_indicator", "message": "MangaAssist is thinking..." }
2. Generate response asynchronously
3. When ready, push via WebSocket: { "type": "response", "content": "..." }
This keeps the user informed and prevents them from thinking the chatbot is broken.
Decision framework for sync vs. async:
| Operation | Need result before responding? | Impact if delayed? | Pattern |
|---|---|---|---|
| Intent classification | ✅ Yes | Blocks routing | Sync |
| Product data | ✅ Yes | Can't build context | Sync |
| Analytics write | ❌ No | None to user | Async (Kinesis) |
| Feedback save | ❌ No | None to user | Async (SQS) |
| Cache update | ❌ No | None to user | Async background |
| Human handoff | Partial (acknowledge sync) | Lose the customer | Async queue + sync ack |