HLD Deep Dive: Analytics & Observability
Questions covered: Q15, Q20, Q33 (feedback loop)
Interviewer level: Senior Engineer → Staff Engineer
Q15. Why Kinesis instead of writing directly to Redshift?
Short Answer
Kinesis acts as a durable buffer that decouples the real-time response path from analytical writes. Direct Redshift writes would add latency and create backpressure.
Deep Dive
The problem with direct Redshift writes:
Without Kinesis:
User sends message
→ Orchestrator processes
→ Writes analytics to Redshift (INSERT statement)
Redshift write latency: ~100–500ms (columnar DB, not designed for OLTP)
During peak: Redshift JDBC connection pool exhaustion → INSERT blocks
Result: User's response is delayed because of an analytics write.
The analytics pipeline is on the critical path. This is wrong.
With Kinesis as a buffer:
User sends message
→ Orchestrator processes response (synchronous path)
→ Fire-and-forget: puts event to Kinesis (~<5ms, async)
→ Responds to user immediately
[Background, completely decoupled]:
Kinesis Data Stream (24hr retention buffer)
→ Kinesis Firehose (micro-batches every 60 seconds)
→ S3 (raw event archive)
→ Redshift Spectrum or COPY command (batch analytics load)
Kinesis pipeline architecture:
Chatbot Events
│
▼ (async, < 5ms)
[Kinesis Data Stream]
│ Shard: 1 shard per 1,000 events/sec
│ Retention: 24 hours
│ Parallel consumers possible
│
├──► [Consumer 1: Kinesis Firehose → S3 raw events]
│ Buffering: 60s or 5MB
│ → s3://manga-analytics/events/2026/03/25/
│
├──► [Consumer 2: Lambda real-time processor]
│ → DynamoDB for live dashboard metrics
│ → Publishes to CloudWatch custom metrics
│
└──► [Consumer 3: Kinesis Analytics (Flink)]
→ SQL queries on live stream
→ Anomaly detection (sudden spike in escalations?)
→ Real-time alerts
Why Kinesis over SQS for analytics?
| Kinesis | SQS | |
|---|---|---|
| Multiple consumers | ✅ Fan-out to N consumers | ❌ Message consumed by one |
| Message replay | ✅ Replay within retention window | ❌ Once consumed, gone |
| Ordering | ✅ Per-shard ordering | FIFO only with special queue |
| Analytics streaming | ✅ First-class with Kinesis Analytics | ❌ Not designed for this |
| Throughput | 1MB/s–1GB/s per shard | Very high |
Data retention and cost:
S3 (raw events):
- Compressed Parquet format: ~80% smaller than JSON
- Standard tier: 30 days
- IA tier: 30–365 days
- Glacier: >365 days (compliance)
- Cost: ~$0.023/GB/month (standard)
Redshift (analytics):
- Only aggregated/summarized data (not raw events)
- Keeps 90 days of data for trend analysis
- Historical data queried via Redshift Spectrum from S3
Q20. What metrics to track from Day 1?
Short Answer
Latency (p50/p95/p99), intent distribution, resolution rate, escalation rate, feedback ratio, conversion rate, session length.
Deep Dive
The metrics framework: Operational + Business
Tier 1: Operational Health (engineering owns)
Latency:
p50 latency: < 800ms target
p95 latency: < 2,000ms target
p99 latency: < 5,000ms SLA
Why all three?
p50 = typical user experience
p95 = most users (miss this → 1 in 20 users frustrated)
p99 = outliers (miss this → still impacts real users at scale)
Error rates:
Guardrail trigger rate: < 1% (higher = prompt or data issue)
LLM timeout rate: < 0.1%
Service dependency errors: < 0.5% per service
WebSocket disconnection: < 0.5%
Throughput:
Messages per second (current vs. capacity)
LLM calls per minute
Cache hit rate: > 70% (lower = caching strategy problem)
Tier 2: Intent Quality (ML team owns)
Intent distribution (track daily):
product_discovery: 10% ← Is this growing? Shows adoption
product_question: 15%
faq: 25% ← High = users asking same questions = FAQ gap?
order_tracking: 20%
return_request: 5%
recommendation: 15%
chitchat: 10% ← High = users not finding value in shopping help?
escalation: 5% ← Alert if >8% (system not handling user needs)
Intent classification accuracy (measured via sampling):
Target: > 95% correct
Method: Sample 100 conversations/week, label manually, compare vs. classifier
Tier 3: User Experience (product team owns)
Resolution Rate:
Definition: Did the user's question get answered without escalation?
Calculation: 1 - (escalation_count / total_sessions) for non-escalation intents
Target: > 85%
Why this matters: The most important chatbot metric. If users escalate to human
agents 50% of the time, the chatbot isn't working.
Escalation Rate:
Definition: % of sessions that reach human agent
Target: < 5% for the first 3 months, < 3% at maturity
Breakdown by intent:
Returns escalations: 15% ← expected (complex cases)
FAQ escalations: 2% ← should be low
Order escalations: 8% ← monitor closely
User Feedback:
Thumbs up rate: > 70%
Thumbs down rate: < 10%
⚠️ Watch the silent majority: Most users don't give feedback.
Require feedback widget to be prominent but not annoying (appears after every 5th message).
Tier 4: Business Impact (business team owns)
Conversion Rate (most important business metric):
Definition: % of recommendation sessions → user adds product to cart
Baseline: Amazon's current manga page conversion rate (e.g., 3.5%)
Target: Chatbot sessions show ≥ 5% conversion (buy more when recommended)
Measurement: A/B test — 50% of sessions with chatbot, 50% browse/search only
Compare: Average cart value, conversion rate, session length
Session Length:
Definition: Number of turns per session
Target: 3–7 turns
< 3 turns: Users leaving early (chatbot not engaging or useful)
> 10 turns: Users frustrated and looping (chatbot not resolving)
Return Visit Rate:
Definition: Users who used chatbot → came back within 7 days
Target: > 40% (indicates the chatbot provided value)
Metric alerting rules:
Critical (P1 - page on-call immediately):
- p99 latency > 10 seconds for 5 consecutive minutes
- Error rate > 5%
- Circuit breaker opens on Bedrock LLM
- Escalation rate > 20%
High (P2 - alert engineering team):
- p95 latency > 3 seconds for 10 minutes
- Guardrail trigger rate > 5%
- Cache hit rate < 50%
- Escalation rate > 10%
Medium (P3 - investigate within 24h):
- Thumbs down rate > 15%
- Resolution rate < 70%
- Intent classification accuracy < 90%
Low (P4 - weekly review):
- Conversion rate declining 3 weeks in a row
- New intent category growing rapidly (may need to add to classifier)
Q33. Feedback loop that actually improves the chatbot
Short Answer
Capture explicit + implicit signals → label dataset → fine-tune classifier → fix RAG gaps → A/B test prompts → monthly transcript review.
Deep Dive
Signal collection — explicit and implicit:
# Explicit: User thumbs up/down
async def record_feedback(session_id: str, turn_id: str, rating: int,
comment: str = None):
await sqs.send_message(
QueueUrl=FEEDBACK_QUEUE,
MessageBody=json.dumps({
"type": "explicit_feedback",
"session_id": session_id,
"turn_id": turn_id,
"rating": rating, # 1 (positive) or -1 (negative)
"comment": comment
})
)
# Implicit: User clicked recommended product → positive signal
async def record_click(session_id: str, turn_id: str, asin: str):
await kinesis.put_record(
StreamName="chatbot-events",
Data=json.dumps({
"type": "product_click",
"session_id": session_id,
"turn_id": turn_id,
"asin": asin,
"signal": "positive" # User clicked = recommended product was relevant
})
)
# Implicit: User escalated immediately after response → negative signal
# Implicit: User added to cart → strong positive signal
# Implicit: User left session quickly → possibly negative signal
Building the training dataset:
# Weekly job: Build labeled dataset from feedback
def build_classifier_training_data():
positive_examples = []
negative_examples = []
# Get thumbs-up conversations
for session in get_positive_feedback_sessions(last_n_days=7):
positive_examples.append({
"user_message": session.user_message,
"intent": session.classified_intent,
"label": session.classified_intent # Confirmed correct
})
# Get thumbs-down conversations → review for mis-classifications
for session in get_negative_feedback_sessions(last_n_days=7):
# Send to human labeling queue
labeling_queue.push({
"user_message": session.user_message,
"classified_intent": session.classified_intent,
"response": session.response,
"feedback": session.user_feedback
})
# Human labelers correct the intent for negative examples
# These become "hard negatives" for classifier retraining
corrected = labeling_queue.get_completed_labels()
negative_examples.extend(corrected)
return positive_examples + negative_examples
Using negative feedback to fix RAG gaps:
# If users consistently give thumbs-down to FAQ responses about a topic:
# → Knowledge base has a gap for that topic
def analyze_rag_gaps(negative_feedback_sessions):
intent_failure_counts = Counter()
for session in negative_feedback_sessions:
if session.intent == "faq":
intent_failure_counts[extract_topic(session.user_message)] += 1
# Topics with > 10 negative responses per week = knowledge base gap
gaps = [(topic, count) for topic, count in intent_failure_counts.items()
if count > 10]
# Create Jira tickets for content team to add FAQ docs for each gap
for topic, count in sorted(gaps, key=lambda x: -x[1]):
jira.create_ticket(
title=f"Add FAQ content for: {topic}",
body=f"Users asked about '{topic}' {count}x this week with negative feedback",
priority="medium"
)
A/B testing prompt improvements:
# Prompt A (control): Current production prompt
# Prompt B (variant): New prompt with improvement
PROMPT_EXPERIMENT = {
"id": "prompt_experiment_007",
"control_pct": 0.90,
"treatment_pct": 0.10,
"control_prompt": CURRENT_SYSTEM_PROMPT,
"treatment_prompt": NEW_SYSTEM_PROMPT_CANDIDATE,
}
# After 2 weeks, compare:
# - Thumbs up rate: Control 68% vs Treatment 73% → Treatment wins
# - Escalation rate: Control 8% vs Treatment 6% → Treatment wins
# - Latency: Control 1.4s vs Treatment 1.5s → Slight regression (OK)
# Decision: Roll out Treatment (Prompt B) to 100%
Monthly escalation transcript review:
Process:
1. Export all escalated conversations from the past month
2. Categorize by reason: (a) chatbot couldn't answer,
(b) wrong answer,
(c) user needed human empathy,
(d) technical failure
3. For category (a): Add to knowledge base or train new intent
4. For category (b): Fix guardrails or prompt for that error type
5. For category (c): Acceptable — some users will always prefer humans
6. For category (d): File engineering bug
Monthly improvement cycle:
Week 1: Collect + analyze feedback data
Week 2: Make improvements (new FAQ content, prompt changes, classifier data)
Week 3: Deploy improvements to staging, A/B test
Week 4: Ship to production, measure impact
Dashboard: Continuous improvement tracker
Month | Escalation Rate | Resolution Rate | Thumbs Up | Conversion
──────────────────────────────────────────────────────────────────────────
Jan 2026 | 12% | 71% | 62% | 4.1%
Feb 2026 | 9% | 76% | 67% | 4.8%
Mar 2026 | 7% | 81% | 72% | 5.3%
Target | <5% | >85% | >75% | >6.0%