LOCAL PREVIEW View on GitHub

HLD Deep Dive: Analytics & Observability

Questions covered: Q15, Q20, Q33 (feedback loop)
Interviewer level: Senior Engineer → Staff Engineer


Q15. Why Kinesis instead of writing directly to Redshift?

Short Answer

Kinesis acts as a durable buffer that decouples the real-time response path from analytical writes. Direct Redshift writes would add latency and create backpressure.

Deep Dive

The problem with direct Redshift writes:

Without Kinesis:
  User sends message
  → Orchestrator processes
  → Writes analytics to Redshift (INSERT statement)

  Redshift write latency: ~100–500ms (columnar DB, not designed for OLTP)
  During peak: Redshift JDBC connection pool exhaustion → INSERT blocks

  Result: User's response is delayed because of an analytics write.
          The analytics pipeline is on the critical path. This is wrong.

With Kinesis as a buffer:

User sends message
→ Orchestrator processes response (synchronous path)
→ Fire-and-forget: puts event to Kinesis (~<5ms, async)
→ Responds to user immediately

[Background, completely decoupled]:
  Kinesis Data Stream (24hr retention buffer)
  → Kinesis Firehose (micro-batches every 60 seconds)
  → S3 (raw event archive)
  → Redshift Spectrum or COPY command (batch analytics load)

Kinesis pipeline architecture:

Chatbot Events
     │
     ▼ (async, < 5ms)
[Kinesis Data Stream]
     │  Shard: 1 shard per 1,000 events/sec
     │  Retention: 24 hours
     │  Parallel consumers possible
     │
     ├──► [Consumer 1: Kinesis Firehose → S3 raw events]
     │         Buffering: 60s or 5MB
     │         → s3://manga-analytics/events/2026/03/25/
     │
     ├──► [Consumer 2: Lambda real-time processor]
     │         → DynamoDB for live dashboard metrics
     │         → Publishes to CloudWatch custom metrics
     │
     └──► [Consumer 3: Kinesis Analytics (Flink)]
               → SQL queries on live stream
               → Anomaly detection (sudden spike in escalations?)
               → Real-time alerts

Why Kinesis over SQS for analytics?

Kinesis SQS
Multiple consumers ✅ Fan-out to N consumers ❌ Message consumed by one
Message replay ✅ Replay within retention window ❌ Once consumed, gone
Ordering ✅ Per-shard ordering FIFO only with special queue
Analytics streaming ✅ First-class with Kinesis Analytics ❌ Not designed for this
Throughput 1MB/s–1GB/s per shard Very high

Data retention and cost:

S3 (raw events): 
  - Compressed Parquet format: ~80% smaller than JSON
  - Standard tier: 30 days
  - IA tier: 30–365 days
  - Glacier: >365 days (compliance)
  - Cost: ~$0.023/GB/month (standard)

Redshift (analytics):
  - Only aggregated/summarized data (not raw events)
  - Keeps 90 days of data for trend analysis
  - Historical data queried via Redshift Spectrum from S3


Q20. What metrics to track from Day 1?

Short Answer

Latency (p50/p95/p99), intent distribution, resolution rate, escalation rate, feedback ratio, conversion rate, session length.

Deep Dive

The metrics framework: Operational + Business

Tier 1: Operational Health (engineering owns)

Latency:
  p50 latency:    < 800ms    target
  p95 latency:    < 2,000ms  target
  p99 latency:    < 5,000ms  SLA

  Why all three? 
  p50 = typical user experience
  p95 = most users (miss this → 1 in 20 users frustrated)
  p99 = outliers (miss this → still impacts real users at scale)

Error rates:
  Guardrail trigger rate:   < 1%   (higher = prompt or data issue)
  LLM timeout rate:         < 0.1%
  Service dependency errors: < 0.5% per service
  WebSocket disconnection:  < 0.5%

Throughput:
  Messages per second (current vs. capacity)
  LLM calls per minute
  Cache hit rate: > 70% (lower = caching strategy problem)

Tier 2: Intent Quality (ML team owns)

Intent distribution (track daily):
  product_discovery: 10%   ← Is this growing? Shows adoption
  product_question:  15%
  faq:               25%   ← High = users asking same questions = FAQ gap?
  order_tracking:    20%
  return_request:     5%
  recommendation:    15%
  chitchat:          10%   ← High = users not finding value in shopping help?
  escalation:         5%   ← Alert if >8% (system not handling user needs)

Intent classification accuracy (measured via sampling):
  Target: > 95% correct
  Method: Sample 100 conversations/week, label manually, compare vs. classifier

Tier 3: User Experience (product team owns)

Resolution Rate:
  Definition: Did the user's question get answered without escalation?
  Calculation: 1 - (escalation_count / total_sessions) for non-escalation intents
  Target: > 85%

  Why this matters: The most important chatbot metric. If users escalate to human 
  agents 50% of the time, the chatbot isn't working.

Escalation Rate:
  Definition: % of sessions that reach human agent
  Target: < 5% for the first 3 months, < 3% at maturity

  Breakdown by intent:
    Returns escalations: 15%  ← expected (complex cases)
    FAQ escalations:      2%  ← should be low
    Order escalations:    8%  ← monitor closely

User Feedback:
  Thumbs up rate:   > 70%
  Thumbs down rate: < 10%

  ⚠️ Watch the silent majority: Most users don't give feedback.
  Require feedback widget to be prominent but not annoying (appears after every 5th message).

Tier 4: Business Impact (business team owns)

Conversion Rate (most important business metric):
  Definition: % of recommendation sessions → user adds product to cart
  Baseline: Amazon's current manga page conversion rate (e.g., 3.5%)
  Target: Chatbot sessions show ≥ 5% conversion (buy more when recommended)

  Measurement: A/B test — 50% of sessions with chatbot, 50% browse/search only
  Compare: Average cart value, conversion rate, session length

Session Length:
  Definition: Number of turns per session
  Target: 3–7 turns

  < 3 turns: Users leaving early (chatbot not engaging or useful)
  > 10 turns: Users frustrated and looping (chatbot not resolving)

Return Visit Rate:
  Definition: Users who used chatbot → came back within 7 days
  Target: > 40% (indicates the chatbot provided value)

Metric alerting rules:

Critical (P1 - page on-call immediately):
  - p99 latency > 10 seconds for 5 consecutive minutes
  - Error rate > 5%
  - Circuit breaker opens on Bedrock LLM
  - Escalation rate > 20%

High (P2 - alert engineering team):
  - p95 latency > 3 seconds for 10 minutes
  - Guardrail trigger rate > 5%
  - Cache hit rate < 50%
  - Escalation rate > 10%

Medium (P3 - investigate within 24h):
  - Thumbs down rate > 15%
  - Resolution rate < 70%
  - Intent classification accuracy < 90%

Low (P4 - weekly review):
  - Conversion rate declining 3 weeks in a row
  - New intent category growing rapidly (may need to add to classifier)

Q33. Feedback loop that actually improves the chatbot

Short Answer

Capture explicit + implicit signals → label dataset → fine-tune classifier → fix RAG gaps → A/B test prompts → monthly transcript review.

Deep Dive

Signal collection — explicit and implicit:

# Explicit: User thumbs up/down
async def record_feedback(session_id: str, turn_id: str, rating: int, 
                          comment: str = None):
    await sqs.send_message(
        QueueUrl=FEEDBACK_QUEUE,
        MessageBody=json.dumps({
            "type": "explicit_feedback",
            "session_id": session_id,
            "turn_id": turn_id,
            "rating": rating,         # 1 (positive) or -1 (negative)
            "comment": comment
        })
    )

# Implicit: User clicked recommended product → positive signal
async def record_click(session_id: str, turn_id: str, asin: str):
    await kinesis.put_record(
        StreamName="chatbot-events",
        Data=json.dumps({
            "type": "product_click",
            "session_id": session_id,
            "turn_id": turn_id,
            "asin": asin,
            "signal": "positive"  # User clicked = recommended product was relevant
        })
    )

# Implicit: User escalated immediately after response → negative signal
# Implicit: User added to cart → strong positive signal
# Implicit: User left session quickly → possibly negative signal

Building the training dataset:

# Weekly job: Build labeled dataset from feedback
def build_classifier_training_data():
    positive_examples = []
    negative_examples = []

    # Get thumbs-up conversations
    for session in get_positive_feedback_sessions(last_n_days=7):
        positive_examples.append({
            "user_message": session.user_message,
            "intent": session.classified_intent,
            "label": session.classified_intent  # Confirmed correct
        })

    # Get thumbs-down conversations → review for mis-classifications
    for session in get_negative_feedback_sessions(last_n_days=7):
        # Send to human labeling queue
        labeling_queue.push({
            "user_message": session.user_message,
            "classified_intent": session.classified_intent,
            "response": session.response,
            "feedback": session.user_feedback
        })

    # Human labelers correct the intent for negative examples
    # These become "hard negatives" for classifier retraining
    corrected = labeling_queue.get_completed_labels()
    negative_examples.extend(corrected)

    return positive_examples + negative_examples

Using negative feedback to fix RAG gaps:

# If users consistently give thumbs-down to FAQ responses about a topic:
# → Knowledge base has a gap for that topic

def analyze_rag_gaps(negative_feedback_sessions):
    intent_failure_counts = Counter()

    for session in negative_feedback_sessions:
        if session.intent == "faq":
            intent_failure_counts[extract_topic(session.user_message)] += 1

    # Topics with > 10 negative responses per week = knowledge base gap
    gaps = [(topic, count) for topic, count in intent_failure_counts.items() 
            if count > 10]

    # Create Jira tickets for content team to add FAQ docs for each gap
    for topic, count in sorted(gaps, key=lambda x: -x[1]):
        jira.create_ticket(
            title=f"Add FAQ content for: {topic}",
            body=f"Users asked about '{topic}' {count}x this week with negative feedback",
            priority="medium"
        )

A/B testing prompt improvements:

# Prompt A (control): Current production prompt
# Prompt B (variant): New prompt with improvement
PROMPT_EXPERIMENT = {
    "id": "prompt_experiment_007",
    "control_pct": 0.90,
    "treatment_pct": 0.10,
    "control_prompt": CURRENT_SYSTEM_PROMPT,
    "treatment_prompt": NEW_SYSTEM_PROMPT_CANDIDATE,
}

# After 2 weeks, compare:
# - Thumbs up rate: Control 68% vs Treatment 73% → Treatment wins
# - Escalation rate: Control 8% vs Treatment 6% → Treatment wins
# - Latency: Control 1.4s vs Treatment 1.5s → Slight regression (OK)
# Decision: Roll out Treatment (Prompt B) to 100%

Monthly escalation transcript review:

Process:
  1. Export all escalated conversations from the past month
  2. Categorize by reason: (a) chatbot couldn't answer, 
                            (b) wrong answer, 
                            (c) user needed human empathy, 
                            (d) technical failure
  3. For category (a): Add to knowledge base or train new intent
  4. For category (b): Fix guardrails or prompt for that error type
  5. For category (c): Acceptable — some users will always prefer humans
  6. For category (d): File engineering bug

Monthly improvement cycle:
  Week 1: Collect + analyze feedback data
  Week 2: Make improvements (new FAQ content, prompt changes, classifier data)
  Week 3: Deploy improvements to staging, A/B test
  Week 4: Ship to production, measure impact

Dashboard: Continuous improvement tracker

Month     | Escalation Rate | Resolution Rate | Thumbs Up | Conversion
──────────────────────────────────────────────────────────────────────────
Jan 2026  |     12%         |     71%         |   62%     |   4.1%
Feb 2026  |      9%         |     76%         |   67%     |   4.8%
Mar 2026  |      7%         |     81%         |   72%     |   5.3%
Target    |     <5%         |     >85%        |   >75%    |   >6.0%