LOCAL PREVIEW View on GitHub

Answers – Scenario 01: Thumbs Feedback Interface

Questions: README.md
System: MangaAssist – JP Manga Chatbot on Amazon.com


Easy

A1. DynamoDB Table Schema for Thumbs Feedback

Table Name: MangaAssist_ThumbsFeedback

Attribute Type Role Description
session_id String Partition Key Chat session identifier
timestamp#turn_id String Sort Key ISO-8601 timestamp + turn number (e.g., 2025-03-15T10:23:45Z#007)
feedback String thumbs_up or thumbs_down
intent String Classified intent: recommendation, faq, etc.
response_id String Unique ID for the chatbot response
user_id_hash String SHA-256 hashed customer ID (privacy)
device_type String mobile, desktop, app
latency_ms Number Response latency when feedback was given
model_version String Bedrock Claude model version used
ttl Number Epoch timestamp for DynamoDB TTL (90-day retention in hot storage)

Design rationale: - Partition key = session_id: Distributes writes evenly (sessions are naturally random). Avoids hot partitions from user_id (power users) or date-based keys. - Sort key = timestamp#turn_id: Enables range queries for all feedback within a session, ordered chronologically. - GSI-1: intent (PK) + timestamp (SK) — enables per-intent aggregation queries. - GSI-2: feedback (PK) + timestamp (SK) — enables quick retrieval of all negative feedback in a time window.


A2. Three Strategies to Boost Feedback Rate

Strategy 1 — Delayed, contextual prompt: Instead of showing the widget immediately, wait 2–3 seconds after the response renders. For recommendation responses, trigger the widget only after the user has had time to see the manga title and cover image. This reduces "banner blindness" from instant widget display.

Strategy 2 — Gamified micro-incentive: Display a subtle message: "Your feedback helps us recommend better manga! 🎯" after every 5th interaction. For order_tracking and return_request, add: "Was this helpful?" — a natural phrasing that feels less like a survey. Avoid monetary incentives (biases feedback).

Strategy 3 — Inline response feedback: For recommendation intent, embed the thumbs directly next to each recommended manga title (not just the overall response). Users are more likely to react to specific items. For faq, place the widget at the end of the answer paragraph, not in a floating overlay.

Intent prioritization: - recommendation benefits most — binary signal directly measures whether the suggested manga resonated. - faq benefits second — thumbs-down indicates the answer didn't resolve the question. - chitchat benefits least — subjective quality is hard to capture with binary feedback.


A3. Why Separate Feedback from Conversation History

Problems with co-located storage:

  1. Write contention: Conversation history is written sequentially during the chat. Feedback writes arrive asynchronously (minutes or hours later). Mixing them in one table creates unpredictable write patterns and complicates throughput planning.

  2. Different access patterns: Conversation history is read by session_id for context window construction (hot path, sub-10ms via ElastiCache). Feedback is read by intent, time range, and aggregation (analytics path). A single table forces conflicting GSI designs.

  3. TTL mismatch: Conversation history may be retained for 30 days; feedback data may need 12-month retention for trend analysis. Shared TTL policies cannot serve both.

  4. Schema evolution: Feedback schema evolves independently (adding free-text fields, multi-dimensional ratings). Mixing with conversation history makes migrations risky.

Preferred architecture:

User clicks 👎 → ECS Fargate API → Kinesis Data Streams (feedback-stream)
                                        ├── Lambda consumer → DynamoDB (MangaAssist_ThumbsFeedback) [hot, 90-day TTL]
                                        ├── Kinesis Firehose → S3 (s3://mangaassist-feedback/raw/) [cold, permanent]
                                        └── Kinesis Analytics → CloudWatch Metrics [real-time aggregation]

Kinesis decouples the write path from storage, enabling multiple consumers without backpressure on the chat API.


Medium

A4. Statistical Significance of Intent-Level Thumbs-Up Rates

Test: Two-proportion z-test (comparing recommendation at 72% vs. faq at 58%).

Hypotheses: - H₀: p_recommendation = p_faq - H₁: p_recommendation ≠ p_faq

Formula:

z = (p₁ - p₂) / √(p̂(1-p̂)(1/n₁ + 1/n₂))
where p̂ = (x₁ + x₂) / (n₁ + n₂)

Minimum sample size (for α=0.05, power=0.80, MDE=5%):

n = (Z_α/2 + Z_β)² × (p₁(1-p₁) + p₂(1-p₂)) / (p₁ - p₂)²
n ≈ (1.96 + 0.84)² × (0.72×0.28 + 0.58×0.42) / (0.14)²
n ≈ 7.84 × 0.4452 / 0.0196 ≈ 178 per group

With a 14-point difference, even ~180 feedback events per intent would suffice. But the real concern is sample bias.

Addressing power-user bias: 1. Stratified analysis: Segment users into frequency tiers (1–5 interactions/week, 5–20, 20+). Compare thumbs rates within each tier. 2. Inverse propensity weighting: Weight each feedback event by 1/P(leaving_feedback | user_segment). Estimate P using a logistic regression on user features (session count, device type, account age). 3. Bootstrap confidence intervals: Resample feedback events with replacement 10,000 times, stratified by user segment, to get robust CIs that account for non-random sampling.


A5. Real-Time Negative Feedback Spike Detection

Architecture:

Thumbs events → Kinesis Data Streams → Kinesis Data Analytics (SQL application)
                                                ↓
                                        CloudWatch Custom Metric: "ThumbsDownRate_by_Intent"
                                                ↓
                                        CloudWatch Alarm (threshold breach)
                                                ↓
                                        SNS → PagerDuty / Slack

Kinesis Analytics SQL:

CREATE OR REPLACE STREAM "FEEDBACK_METRICS" (
    intent VARCHAR(64),
    window_end TIMESTAMP,
    total_feedback INTEGER,
    thumbs_down_count INTEGER,
    thumbs_down_rate DOUBLE
);

CREATE OR REPLACE PUMP "FEEDBACK_PUMP" AS
INSERT INTO "FEEDBACK_METRICS"
SELECT STREAM
    intent,
    STEP("SOURCE_SQL_STREAM_001".ROWTIME BY INTERVAL '5' MINUTE) AS window_end,
    COUNT(*) AS total_feedback,
    SUM(CASE WHEN feedback = 'thumbs_down' THEN 1 ELSE 0 END) AS thumbs_down_count,
    CAST(SUM(CASE WHEN feedback = 'thumbs_down' THEN 1 ELSE 0 END) AS DOUBLE) / COUNT(*) AS thumbs_down_rate
FROM "SOURCE_SQL_STREAM_001"
GROUP BY
    intent,
    STEP("SOURCE_SQL_STREAM_001".ROWTIME BY INTERVAL '5' MINUTE);

Alert thresholds (for checkout_help during Prime Day): - Warning: thumbs_down_rate > 0.45 in a 5-minute window (baseline is ~0.30) - Critical: thumbs_down_rate > 0.55 OR thumbs_down_count > 500 in 5 minutes - Composite: Alert only fires if both rate AND absolute count exceed thresholds (avoids false alarms from low-volume windows)

Response playbook: 1. Alert fires → On-call engineer reviews sample of thumbs-down responses 2. If cause is stale cache (e.g., sold-out item still recommended) → Invalidate ElastiCache Redis entries for checkout_help 3. If cause is prompt failure → Hot-swap to fallback prompt template in Bedrock


A6. Free-Text Comment on Thumbs-Down with Classification

End-to-end flow:

User clicks 👎 → Modal: "What went wrong?" [text field, 500 char limit]
    → POST /api/feedback {session_id, turn_id, feedback: "thumbs_down", comment: "..."}
        → API Gateway → Lambda (feedback-processor)
            → Step 1: Write raw event to DynamoDB (MangaAssist_ThumbsFeedback)
            → Step 2: Invoke Bedrock Claude 3.5 Sonnet (Haiku for cost) to classify reason
            → Step 3: Write classified category back to DynamoDB (update item)
            → Step 4: Push to Kinesis for downstream analytics

Bedrock classification prompt:

Classify this chatbot feedback into exactly one category:
- wrong_genre: Manga recommendation didn't match user's preferred genre
- wrong_product: Recommended a product the user already owns or isn't interested in
- outdated_info: Price, availability, or shipping info was incorrect
- misunderstood_query: Chatbot didn't understand what the user was asking
- incomplete_answer: Answer was partially correct but missing key details
- tone_issue: Response felt robotic, rude, or inappropriate
- other: Doesn't fit above categories

Feedback: "{user_comment}"
Intent: {intent}

Return JSON: {"category": "...", "confidence": 0.XX}

Cost optimization: Use Claude 3.5 Haiku (~$0.00025/1K input tokens) for classification. At 3M feedback events/month × 6% comment rate = 180K classifications/month × ~100 tokens avg = 18M tokens → ~$4.50/month.

DynamoDB update:

{
    "comment_raw": "it recommended One Piece but I already read all 100+ volumes",
    "comment_category": "wrong_product",
    "comment_confidence": 0.94,
    "comment_classified_at": "2025-03-15T10:24:01Z"
}


Hard

A7. Feedback-Driven Cache Invalidation Policy

Metric definition:

thumbs_down_ratio(response_hash, window) = 
    count(thumbs_down for response_hash in window) / 
    count(all_feedback for response_hash in window)

Invalidation thresholds:

Condition Action
thumbs_down_ratio > 0.35 AND total_feedback > 50 (over 24h) Soft invalidation — mark cache entry as "stale", serve with reduced TTL (1h instead of 24h)
thumbs_down_ratio > 0.50 AND total_feedback > 30 (over 12h) Hard invalidation — delete cache entry, trigger re-generation
thumbs_down_ratio > 0.70 (any window, any volume) Emergency invalidation — immediate delete + alert

Re-generation flow:

Cache invalidation triggered
    → Lambda publishes to SQS (regeneration-queue)
        → ECS Fargate worker picks up message
            → Fetches original query context from DynamoDB conversation history
            → Calls Bedrock Claude 3.5 Sonnet with updated prompt + latest product catalog from OpenSearch
            → Validates response (guardrails check, length check)
            → Writes to ElastiCache Redis with new cache key: {intent}:{query_hash}:{model_version}:{timestamp}
            → Logs regeneration event for audit

Redis implementation:

import redis
import hashlib
import json

class FeedbackAwareCacheManager:
    def __init__(self, redis_client):
        self.redis = redis_client

    def check_and_invalidate(self, response_hash: str):
        feedback_key = f"feedback:aggregate:{response_hash}"
        data = self.redis.hgetall(feedback_key)

        if not data:
            return False

        total = int(data.get(b'total', 0))
        thumbs_down = int(data.get(b'thumbs_down', 0))

        if total < 30:
            return False  # Insufficient sample

        ratio = thumbs_down / total

        if ratio > 0.50:
            # Hard invalidation
            cache_key = f"response:{response_hash}"
            self.redis.delete(cache_key)
            self.redis.publish('cache-invalidation', json.dumps({
                'response_hash': response_hash,
                'reason': 'feedback_threshold',
                'ratio': ratio,
                'total': total
            }))
            return True

        elif ratio > 0.35:
            # Soft invalidation — reduce TTL
            cache_key = f"response:{response_hash}"
            self.redis.expire(cache_key, 3600)  # 1 hour instead of 24h
            return False

        return False

Safeguards: - Rate limiting: Maximum 10 re-generations per intent per hour (prevents feedback-bombing from triggering infinite regeneration). - Circuit breaker: If a re-generated response also accumulates >50% thumbs-down within 2 hours, stop auto-regeneration and alert the team. - A/B validation: 10% of traffic still sees the old response for 1 hour after regeneration, to confirm the new response performs better.


A8. Thumbs Feedback → SageMaker Training Pipeline

Pipeline stages:

Raw Feedback (DynamoDB/S3)
    → Stage 1: Filtering
    → Stage 2: Label Noise Reduction
    → Stage 3: Class Balancing
    → Stage 4: Feature Engineering
    → Stage 5: SageMaker Training Dataset (S3)

Stage 1 — Filtering rules: - Remove feedback from sessions < 10 seconds (accidental clicks) - Remove feedback from users who thumbs-down > 95% of all responses (adversarial/frustrated users) - Remove feedback on escalation intent (thumbs-down may reflect frustration with the situation, not the bot's escalation quality) - Require minimum 2-second gap between response display and feedback click

Stage 2 — Label noise reduction:

# Confidence-weighted labeling
def compute_label_confidence(feedback_event):
    signals = []

    # Signal 1: Comment alignment (if comment exists)
    if feedback_event.get('comment_category'):
        # If comment says "wrong_genre" and intent is "recommendation" → high confidence
        intent_comment_alignment = check_alignment(
            feedback_event['intent'], 
            feedback_event['comment_category']
        )
        signals.append(('comment_alignment', intent_comment_alignment, 0.4))

    # Signal 2: Behavioral corroboration
    # Thumbs-up on recommendation + user clicked product link = corroborated positive
    if feedback_event['intent'] == 'recommendation' and feedback_event.get('product_click'):
        signals.append(('behavioral', 1.0, 0.3))

    # Signal 3: Consistency — same user, same intent, consistent feedback pattern
    historical_consistency = get_user_intent_consistency(
        feedback_event['user_id_hash'], 
        feedback_event['intent']
    )
    signals.append(('consistency', historical_consistency, 0.3))

    confidence = sum(score * weight for _, score, weight in signals) / sum(w for _, _, w in signals)
    return confidence
  • Only include feedback events with confidence > 0.6 in the training set.
  • Events with confidence 0.4–0.6 go to a human review queue.

Stage 3 — Class imbalance handling:

Intent Feedback Volume/Month Thumbs-Up % Strategy
recommendation 800K 72% Undersample majority class
faq 600K 58% Use as-is (relatively balanced)
chitchat 400K 80% Undersample positives
order_tracking 300K 65% Use as-is
product_discovery 50K 60% SMOTE oversampling for minority class
escalation 30K 25% Exclude from training (see Stage 1)

Stage 4 — Feature engineering for intent classifier: - Input: user query text, conversation history (last 3 turns), detected entities - Label: intent (10 classes), with feedback-derived quality weight - Format: JSONL for SageMaker, uploaded to S3 with versioning


A9. Full Data Architecture at 50M Interactions/Month

Throughput calculations:

50M interactions/month × 6% feedback rate = 3M feedback events/month
3M / 30 days / 24 hours / 3600 seconds = ~1.16 events/second (average)
Peak (Prime Day): 10× average = ~12 events/second
Burst (flash sale start): 50× average = ~58 events/second

Ingestion — Kinesis Data Streams:

Stream: mangaassist-feedback-stream
Shards: 2 (each shard handles 1,000 writes/sec, 1 MB/sec)
  → Sufficient for 58 events/sec peak with 97% headroom
  → Auto-scaling policy: add shard if utilization > 70% for 5 minutes
Retention: 24 hours (default, sufficient for consumer catch-up)

Hot storage — DynamoDB:

Table: MangaAssist_ThumbsFeedback
Capacity mode: ON-DEMAND
  → Justification: Traffic is bursty (Prime Day, flash sales). Provisioned mode
    requires over-provisioning for peaks or complex auto-scaling. On-demand
    absorbs 10× burst within seconds.
Average item size: ~500 bytes
3M items/month × 500 bytes = 1.5 GB/month
TTL: 90 days → max table size ~4.5 GB (well within DynamoDB limits)

GSIs:
  - intent-timestamp-index: PK=intent, SK=timestamp (read-heavy, ~100 RCU)
  - feedback-timestamp-index: PK=feedback, SK=timestamp (read-heavy, ~50 RCU)

Warm storage — S3 via Kinesis Firehose:

Destination: s3://mangaassist-feedback/raw/year=YYYY/month=MM/day=DD/
Format: Parquet (columnar, compressed)
Buffer: 5 MB or 300 seconds (whichever first)
Compression: Snappy (~70% compression ratio)
Monthly data: 3M × 500 bytes × 0.3 (compressed) = ~450 MB/month
Annual: ~5.4 GB → negligible S3 cost ($0.12/year)

Cold analytics — Athena:

-- Athena table definition
CREATE EXTERNAL TABLE mangaassist_feedback (
    session_id STRING,
    timestamp STRING,
    turn_id INT,
    feedback STRING,
    intent STRING,
    response_id STRING,
    user_id_hash STRING,
    device_type STRING,
    latency_ms INT,
    model_version STRING,
    comment_raw STRING,
    comment_category STRING
)
PARTITIONED BY (year INT, month INT, day INT)
STORED AS PARQUET
LOCATION 's3://mangaassist-feedback/raw/'

Monthly cost estimate:

Component Calculation Monthly Cost
Kinesis Data Streams 2 shards × $0.015/hr × 730 hrs $21.90
Kinesis Firehose 3M records × 5KB per record batch $1.85
DynamoDB (on-demand) 3M writes × $1.25/M + reads ~100M × $0.25/M $28.75
S3 storage ~5 GB (cumulative first year) $0.12
Athena queries ~50 queries/month × 450 MB scanned $2.40
Total ~$55/month

Very Hard

A10. A/B Test for Feedback Widget's Causal Effect

Experimental design:

Randomization unit: User-level (not session-level). Rationale: If we randomize by session, the same user might see the widget in some sessions and not others, creating confusion and carryover effects. User-level randomization ensures a clean, consistent experience.

Assignment mechanism:

import hashlib

def assign_variant(user_id: str, experiment_id: str = "feedback_widget_v1") -> str:
    hash_input = f"{user_id}:{experiment_id}"
    hash_value = int(hashlib.sha256(hash_input.encode()).hexdigest(), 16)
    bucket = hash_value % 100
    if bucket < 50:
        return "control"  # No widget shown
    else:
        return "treatment"  # Widget shown

Metrics:

Metric Type Control Baseline MDE Required Sample
Conversation length (turns) Continuous 4.2 turns 0.3 turns (7%) ~3,500 users/group
Cart-add rate (post-recommendation) Proportion 12.5% 1% absolute ~15,000 users/group
Escalation rate Proportion 8.0% 0.8% absolute ~12,000 users/group

Duration: Given MangaAssist's traffic of ~1.7M daily active users, we need max 15K users/group = 30K total. With a 10% experiment allocation (170K users/day), we achieve required sample in < 1 day. However, run for 14 days minimum to capture weekly seasonality (weekend manga browsing patterns differ from weekdays).

Handling asymmetric data generation:

The fundamental challenge: treatment group generates feedback data, control group does not. This means:

  1. Primary metrics (conversation length, cart-add rate, escalation rate) are measured identically for both groups — these come from session logs and event streams, NOT from the feedback widget.

  2. Feedback rate can only be measured in the treatment group. To estimate what the control group's "latent satisfaction" would be, use proxy metrics: dwell time per response, message rephrasing rate, session abandonment rate.

  3. Hawthorne effect: Users who know they're being observed (widget visible) may behave differently. Address by: - Adding a "placebo" arm: show a non-interactive widget (just an icon, no click functionality) to isolate the visual effect from the feedback effect - 3-arm design: control (no widget), placebo (visual-only), treatment (functional widget)

Analysis: - Primary: Intent-to-treat (ITT) analysis comparing control vs. treatment on primary metrics - Secondary: IV (instrumental variable) analysis using widget assignment as instrument for "giving feedback" to estimate LATE (local average treatment effect) - Multiple comparison correction: Bonferroni for 3 primary metrics (α = 0.05/3 = 0.0167)


A11. Closed-Loop Feedback Flywheel

System architecture:

┌─────────────────────────────────────────────────────────────────────┐
│                     FEEDBACK FLYWHEEL                                │
│                                                                      │
│  ┌──────────┐    ┌──────────────┐    ┌─────────────┐               │
│  │ 1. COLLECT│───→│ 2. AGGREGATE │───→│ 3. OPTIMIZE │               │
│  │           │    │ & DETECT     │    │   PROMPTS   │               │
│  │ Thumbs    │    │              │    │             │               │
│  │ Widget    │    │ Kinesis +    │    │ Bedrock     │               │
│  │ → Kinesis │    │ Lambda       │    │ Prompt Mgr  │               │
│  └──────────┘    │ Anomaly Det. │    └──────┬──────┘               │
│                  └──────────────┘           │                       │
│                                             ▼                       │
│  ┌──────────┐    ┌──────────────┐    ┌─────────────┐               │
│  │ 5. CLOSE │←───│ 4. A/B       │←───│ Deploy      │               │
│  │   LOOP   │    │  DEPLOY      │    │ Candidate   │               │
│  │           │    │              │    │ Prompts     │               │
│  │ Measure   │    │ 10% traffic  │    │             │               │
│  │ via same  │    │ to new       │    │ ECS Fargate │               │
│  │ thumbs    │    │ prompt       │    │ Config Mgr  │               │
│  └──────────┘    └──────────────┘    └─────────────┘               │
└─────────────────────────────────────────────────────────────────────┘

Stage 1 — Collection (covered in Q1–Q3 above)

Stage 2 — Aggregation & anomaly detection:

# Lambda consumer on Kinesis — aggregates every 15 minutes
def aggregate_feedback(intent: str, window_minutes: int = 15):
    """Compute rolling thumbs-up rate and detect anomalies."""
    recent = query_dynamodb_gsi(
        index='intent-timestamp-index',
        pk=intent,
        sk_range=(now - timedelta(minutes=window_minutes), now)
    )

    total = len(recent)
    thumbs_up = sum(1 for r in recent if r['feedback'] == 'thumbs_up')
    rate = thumbs_up / total if total > 0 else None

    # Anomaly detection: compare to 7-day rolling average
    historical_rate = get_historical_rate(intent, days=7)
    z_score = (rate - historical_rate['mean']) / historical_rate['std']

    if z_score < -2.5:  # Significant drop
        trigger_optimization(intent, rate, recent_samples=recent[:50])

    return {'intent': intent, 'rate': rate, 'z_score': z_score}

Stage 3 — Automatic prompt tuning:

def optimize_prompt(intent: str, current_rate: float, negative_samples: list):
    """Generate candidate prompt improvements using Bedrock Claude."""

    # Retrieve current prompt template
    current_prompt = get_prompt_template(intent)

    # Analyze negative feedback patterns
    analysis_prompt = f"""
    Analyze these {len(negative_samples)} negative feedback responses for the '{intent}' intent 
    in a manga chatbot. Identify the top 3 patterns causing user dissatisfaction:

    {format_samples(negative_samples)}

    Current prompt template:
    {current_prompt}

    Suggest a revised prompt template that addresses these patterns.
    Return JSON: {{"analysis": "...", "revised_prompt": "...", "expected_improvement": "..."}}
    """

    response = bedrock_client.invoke_model(
        modelId='anthropic.claude-3-5-sonnet-20241022-v2:0',
        body=json.dumps({'prompt': analysis_prompt, 'max_tokens': 2000})
    )

    candidate = json.loads(response['body'].read())

    # Store candidate in DynamoDB prompt registry (not yet deployed)
    store_candidate_prompt(intent, candidate['revised_prompt'], 
                          metadata={'trigger_rate': current_rate, 'analysis': candidate['analysis']})

    return candidate

Stage 4 — A/B deployment:

# ECS Fargate prompt router
def get_prompt_for_request(intent: str, user_id: str) -> str:
    experiment = get_active_experiment(intent)

    if experiment and is_in_treatment_group(user_id, experiment['id']):
        return experiment['candidate_prompt']  # 10% of traffic
    else:
        return get_production_prompt(intent)   # 90% of traffic
  • Run for 7 days minimum per intent
  • Require statistical significance (p < 0.05) AND practical significance (thumbs-up rate improvement > 2 percentage points) before promoting

Stage 5 — Closing the loop:

  • If candidate prompt wins → promote to production, archive old prompt
  • If candidate prompt loses → discard, increase anomaly detection threshold to prevent repeated failed optimization
  • Track "optimization attempts per intent per month" — cap at 3 to prevent churn

Cold-start for new intents (e.g., product_discovery):

  1. Bootstrap with synthetic feedback: Generate 500 synthetic conversations for product_discovery using Bedrock. Have 3 internal annotators rate them (thumbs-up/down). Use as initial training signal.
  2. Transfer learning from similar intent: product_discovery is closest to recommendation. Initialize prompt template and feedback thresholds from recommendation parameters.
  3. Aggressive exploration: For the first 30 days, show the feedback widget to 100% of product_discovery users (vs. the standard sampling rate) to accelerate data collection.
  4. Lower confidence threshold: Accept optimization candidates with p < 0.10 (instead of 0.05) during the cold-start phase, reverting to stricter thresholds after 10K feedback events.

Sycophancy risk mitigation:

Optimizing purely for thumbs-up rate can lead to responses that tell users what they want to hear rather than what's accurate:

  1. Guardrail metrics: Track factual accuracy (via automated checks against product catalog in OpenSearch) alongside thumbs-up rate. Reject prompt candidates that improve thumbs-up but decrease factual accuracy.
  2. Diversity constraint: Ensure recommendation responses don't collapse to always recommending the top-5 most popular manga. Track recommendation diversity (entropy of recommended titles).
  3. Negative feedback value: Assign 2× weight to thumbs-down events in the optimization objective. This makes the system more sensitive to failures than successes.
  4. Human-in-the-loop gate: Any prompt change that would affect > 1M users/month requires human review before deployment, regardless of A/B results.

A12. Unified User Satisfaction Score (USS)

USS Formula:

$$USS = \frac{\sum_{i=1}^{10} w_i \cdot r_i \cdot c_i}{\sum_{i=1}^{10} w_i \cdot c_i}$$

Where: - $w_i$ = business impact weight for intent $i$ - $r_i$ = thumbs-up rate for intent $i$ (0 to 1) - $c_i$ = confidence factor for intent $i$ (0 to 1), based on sample size

Business impact weights:

Intent Weight ($w_i$) Justification
recommendation 1.00 Directly drives manga sales (core business metric)
product_question 0.85 Pre-purchase decision support, high conversion impact
product_discovery 0.80 Top-of-funnel engagement, drives browse-to-cart
checkout_help 0.90 Prevents cart abandonment (high revenue impact per interaction)
order_tracking 0.70 Customer retention, but transactional (low margin for error)
return_request 0.65 Satisfaction recovery; poor handling causes churn
promotion 0.60 Drives incremental sales, but lower per-interaction value
faq 0.50 Deflects support tickets; quality matters but less revenue-direct
escalation 0.40 Success = smooth handoff; thumbs signal is noisy here
chitchat 0.20 Brand engagement, low business impact

Confidence factor (handles sparse signal):

$$c_i = \min\left(1.0, \frac{n_i}{n_{min}}\right)$$

Where $n_i$ is the feedback count for intent $i$ in the measurement window, and $n_{min}$ = 100 (minimum sample for full confidence).

If product_discovery has only 40 feedback events this week, $c_i = 40/100 = 0.4$, effectively down-weighting its contribution to USS until more data accumulates.

Worked example:

Intent $w_i$ $r_i$ $n_i$ $c_i$ $w_i \cdot r_i \cdot c_i$ $w_i \cdot c_i$
recommendation 1.00 0.72 5000 1.00 0.720 1.000
product_question 0.85 0.65 3000 1.00 0.553 0.850
checkout_help 0.90 0.55 1500 1.00 0.495 0.900
product_discovery 0.80 0.60 40 0.40 0.192 0.320
chitchat 0.20 0.80 2000 1.00 0.160 0.200
... ... ... ... ... ... ...

USS = Σ(numerator) / Σ(denominator) ≈ 0.66 (66% weighted satisfaction)

QuickSight dashboard design:

┌─────────────────────────────────────────────────────────┐
│  MangaAssist User Satisfaction Dashboard                │
├─────────────────────────────────────────────────────────┤
│                                                         │
│  [USS Score: 66.2%]  ▲ +1.3% vs last week              │
│  ═══════════════════════════════════                    │
│                                                         │
│  ┌─── Intent Breakdown (Treemap) ────────────────┐     │
│  │ ┌──────────┐ ┌────────┐ ┌──────┐ ┌────┐      │     │
│  │ │ recom.   │ │checkout│ │prod_q│ │faq │      │     │
│  │ │ 72% ▲    │ │ 55% ▼  │ │ 65%  │ │58% │      │     │
│  │ └──────────┘ └────────┘ └──────┘ └────┘      │     │
│  └───────────────────────────────────────────────┘     │
│                                                         │
│  ┌─── USS Time Series (30 days) ─────────────────┐     │
│  │  68%│    ╱╲                                    │     │
│  │  66%│───╱──╲──────╱╲──────                    │     │
│  │  64%│  ╱    ╲    ╱  ╲                         │     │
│  │  62%│─╱──────╲──╱────╲───── ← Prime Day dip  │     │
│  │     └────────────────────── time →            │     │
│  └───────────────────────────────────────────────┘     │
│                                                         │
│  ┌─── Alerts ────────────────────────────────────┐     │
│  │ ⚠ checkout_help down 8% (flash sale impact)   │     │
│  │ ✅ recommendation up 2% (new prompt deployed)  │     │
│  └───────────────────────────────────────────────┘     │
└─────────────────────────────────────────────────────────┘

Dashboard layers: 1. Executive view: Single USS number, trend line, week-over-week delta 2. Intent drill-down: Click any intent in treemap → see its thumbs-up rate, volume, confidence factor, and top negative feedback categories 3. Time-series overlay: Toggle individual intent lines on/off, annotate with deployment events (prompt changes, model updates) 4. Anomaly panel: Automated highlights of statistically significant changes (z-score > 2.0), linked to root cause analysis

Data pipeline for QuickSight:

DynamoDB → S3 (Firehose, hourly Parquet) → Athena (daily aggregation view) → QuickSight SPICE dataset (refreshed every 4 hours)