Answers – Scenario 01: Thumbs Feedback Interface
Questions: README.md
System: MangaAssist – JP Manga Chatbot on Amazon.com
Easy
A1. DynamoDB Table Schema for Thumbs Feedback
Table Name: MangaAssist_ThumbsFeedback
| Attribute | Type | Role | Description |
|---|---|---|---|
session_id |
String | Partition Key | Chat session identifier |
timestamp#turn_id |
String | Sort Key | ISO-8601 timestamp + turn number (e.g., 2025-03-15T10:23:45Z#007) |
feedback |
String | — | thumbs_up or thumbs_down |
intent |
String | — | Classified intent: recommendation, faq, etc. |
response_id |
String | — | Unique ID for the chatbot response |
user_id_hash |
String | — | SHA-256 hashed customer ID (privacy) |
device_type |
String | — | mobile, desktop, app |
latency_ms |
Number | — | Response latency when feedback was given |
model_version |
String | — | Bedrock Claude model version used |
ttl |
Number | — | Epoch timestamp for DynamoDB TTL (90-day retention in hot storage) |
Design rationale:
- Partition key = session_id: Distributes writes evenly (sessions are naturally random). Avoids hot partitions from user_id (power users) or date-based keys.
- Sort key = timestamp#turn_id: Enables range queries for all feedback within a session, ordered chronologically.
- GSI-1: intent (PK) + timestamp (SK) — enables per-intent aggregation queries.
- GSI-2: feedback (PK) + timestamp (SK) — enables quick retrieval of all negative feedback in a time window.
A2. Three Strategies to Boost Feedback Rate
Strategy 1 — Delayed, contextual prompt:
Instead of showing the widget immediately, wait 2–3 seconds after the response renders. For recommendation responses, trigger the widget only after the user has had time to see the manga title and cover image. This reduces "banner blindness" from instant widget display.
Strategy 2 — Gamified micro-incentive:
Display a subtle message: "Your feedback helps us recommend better manga! 🎯" after every 5th interaction. For order_tracking and return_request, add: "Was this helpful?" — a natural phrasing that feels less like a survey. Avoid monetary incentives (biases feedback).
Strategy 3 — Inline response feedback:
For recommendation intent, embed the thumbs directly next to each recommended manga title (not just the overall response). Users are more likely to react to specific items. For faq, place the widget at the end of the answer paragraph, not in a floating overlay.
Intent prioritization:
- recommendation benefits most — binary signal directly measures whether the suggested manga resonated.
- faq benefits second — thumbs-down indicates the answer didn't resolve the question.
- chitchat benefits least — subjective quality is hard to capture with binary feedback.
A3. Why Separate Feedback from Conversation History
Problems with co-located storage:
-
Write contention: Conversation history is written sequentially during the chat. Feedback writes arrive asynchronously (minutes or hours later). Mixing them in one table creates unpredictable write patterns and complicates throughput planning.
-
Different access patterns: Conversation history is read by session_id for context window construction (hot path, sub-10ms via ElastiCache). Feedback is read by intent, time range, and aggregation (analytics path). A single table forces conflicting GSI designs.
-
TTL mismatch: Conversation history may be retained for 30 days; feedback data may need 12-month retention for trend analysis. Shared TTL policies cannot serve both.
-
Schema evolution: Feedback schema evolves independently (adding free-text fields, multi-dimensional ratings). Mixing with conversation history makes migrations risky.
Preferred architecture:
User clicks 👎 → ECS Fargate API → Kinesis Data Streams (feedback-stream)
├── Lambda consumer → DynamoDB (MangaAssist_ThumbsFeedback) [hot, 90-day TTL]
├── Kinesis Firehose → S3 (s3://mangaassist-feedback/raw/) [cold, permanent]
└── Kinesis Analytics → CloudWatch Metrics [real-time aggregation]
Kinesis decouples the write path from storage, enabling multiple consumers without backpressure on the chat API.
Medium
A4. Statistical Significance of Intent-Level Thumbs-Up Rates
Test: Two-proportion z-test (comparing recommendation at 72% vs. faq at 58%).
Hypotheses: - H₀: p_recommendation = p_faq - H₁: p_recommendation ≠ p_faq
Formula:
z = (p₁ - p₂) / √(p̂(1-p̂)(1/n₁ + 1/n₂))
where p̂ = (x₁ + x₂) / (n₁ + n₂)
Minimum sample size (for α=0.05, power=0.80, MDE=5%):
n = (Z_α/2 + Z_β)² × (p₁(1-p₁) + p₂(1-p₂)) / (p₁ - p₂)²
n ≈ (1.96 + 0.84)² × (0.72×0.28 + 0.58×0.42) / (0.14)²
n ≈ 7.84 × 0.4452 / 0.0196 ≈ 178 per group
With a 14-point difference, even ~180 feedback events per intent would suffice. But the real concern is sample bias.
Addressing power-user bias: 1. Stratified analysis: Segment users into frequency tiers (1–5 interactions/week, 5–20, 20+). Compare thumbs rates within each tier. 2. Inverse propensity weighting: Weight each feedback event by 1/P(leaving_feedback | user_segment). Estimate P using a logistic regression on user features (session count, device type, account age). 3. Bootstrap confidence intervals: Resample feedback events with replacement 10,000 times, stratified by user segment, to get robust CIs that account for non-random sampling.
A5. Real-Time Negative Feedback Spike Detection
Architecture:
Thumbs events → Kinesis Data Streams → Kinesis Data Analytics (SQL application)
↓
CloudWatch Custom Metric: "ThumbsDownRate_by_Intent"
↓
CloudWatch Alarm (threshold breach)
↓
SNS → PagerDuty / Slack
Kinesis Analytics SQL:
CREATE OR REPLACE STREAM "FEEDBACK_METRICS" (
intent VARCHAR(64),
window_end TIMESTAMP,
total_feedback INTEGER,
thumbs_down_count INTEGER,
thumbs_down_rate DOUBLE
);
CREATE OR REPLACE PUMP "FEEDBACK_PUMP" AS
INSERT INTO "FEEDBACK_METRICS"
SELECT STREAM
intent,
STEP("SOURCE_SQL_STREAM_001".ROWTIME BY INTERVAL '5' MINUTE) AS window_end,
COUNT(*) AS total_feedback,
SUM(CASE WHEN feedback = 'thumbs_down' THEN 1 ELSE 0 END) AS thumbs_down_count,
CAST(SUM(CASE WHEN feedback = 'thumbs_down' THEN 1 ELSE 0 END) AS DOUBLE) / COUNT(*) AS thumbs_down_rate
FROM "SOURCE_SQL_STREAM_001"
GROUP BY
intent,
STEP("SOURCE_SQL_STREAM_001".ROWTIME BY INTERVAL '5' MINUTE);
Alert thresholds (for checkout_help during Prime Day):
- Warning: thumbs_down_rate > 0.45 in a 5-minute window (baseline is ~0.30)
- Critical: thumbs_down_rate > 0.55 OR thumbs_down_count > 500 in 5 minutes
- Composite: Alert only fires if both rate AND absolute count exceed thresholds (avoids false alarms from low-volume windows)
Response playbook:
1. Alert fires → On-call engineer reviews sample of thumbs-down responses
2. If cause is stale cache (e.g., sold-out item still recommended) → Invalidate ElastiCache Redis entries for checkout_help
3. If cause is prompt failure → Hot-swap to fallback prompt template in Bedrock
A6. Free-Text Comment on Thumbs-Down with Classification
End-to-end flow:
User clicks 👎 → Modal: "What went wrong?" [text field, 500 char limit]
→ POST /api/feedback {session_id, turn_id, feedback: "thumbs_down", comment: "..."}
→ API Gateway → Lambda (feedback-processor)
→ Step 1: Write raw event to DynamoDB (MangaAssist_ThumbsFeedback)
→ Step 2: Invoke Bedrock Claude 3.5 Sonnet (Haiku for cost) to classify reason
→ Step 3: Write classified category back to DynamoDB (update item)
→ Step 4: Push to Kinesis for downstream analytics
Bedrock classification prompt:
Classify this chatbot feedback into exactly one category:
- wrong_genre: Manga recommendation didn't match user's preferred genre
- wrong_product: Recommended a product the user already owns or isn't interested in
- outdated_info: Price, availability, or shipping info was incorrect
- misunderstood_query: Chatbot didn't understand what the user was asking
- incomplete_answer: Answer was partially correct but missing key details
- tone_issue: Response felt robotic, rude, or inappropriate
- other: Doesn't fit above categories
Feedback: "{user_comment}"
Intent: {intent}
Return JSON: {"category": "...", "confidence": 0.XX}
Cost optimization: Use Claude 3.5 Haiku (~$0.00025/1K input tokens) for classification. At 3M feedback events/month × 6% comment rate = 180K classifications/month × ~100 tokens avg = 18M tokens → ~$4.50/month.
DynamoDB update:
{
"comment_raw": "it recommended One Piece but I already read all 100+ volumes",
"comment_category": "wrong_product",
"comment_confidence": 0.94,
"comment_classified_at": "2025-03-15T10:24:01Z"
}
Hard
A7. Feedback-Driven Cache Invalidation Policy
Metric definition:
thumbs_down_ratio(response_hash, window) =
count(thumbs_down for response_hash in window) /
count(all_feedback for response_hash in window)
Invalidation thresholds:
| Condition | Action |
|---|---|
| thumbs_down_ratio > 0.35 AND total_feedback > 50 (over 24h) | Soft invalidation — mark cache entry as "stale", serve with reduced TTL (1h instead of 24h) |
| thumbs_down_ratio > 0.50 AND total_feedback > 30 (over 12h) | Hard invalidation — delete cache entry, trigger re-generation |
| thumbs_down_ratio > 0.70 (any window, any volume) | Emergency invalidation — immediate delete + alert |
Re-generation flow:
Cache invalidation triggered
→ Lambda publishes to SQS (regeneration-queue)
→ ECS Fargate worker picks up message
→ Fetches original query context from DynamoDB conversation history
→ Calls Bedrock Claude 3.5 Sonnet with updated prompt + latest product catalog from OpenSearch
→ Validates response (guardrails check, length check)
→ Writes to ElastiCache Redis with new cache key: {intent}:{query_hash}:{model_version}:{timestamp}
→ Logs regeneration event for audit
Redis implementation:
import redis
import hashlib
import json
class FeedbackAwareCacheManager:
def __init__(self, redis_client):
self.redis = redis_client
def check_and_invalidate(self, response_hash: str):
feedback_key = f"feedback:aggregate:{response_hash}"
data = self.redis.hgetall(feedback_key)
if not data:
return False
total = int(data.get(b'total', 0))
thumbs_down = int(data.get(b'thumbs_down', 0))
if total < 30:
return False # Insufficient sample
ratio = thumbs_down / total
if ratio > 0.50:
# Hard invalidation
cache_key = f"response:{response_hash}"
self.redis.delete(cache_key)
self.redis.publish('cache-invalidation', json.dumps({
'response_hash': response_hash,
'reason': 'feedback_threshold',
'ratio': ratio,
'total': total
}))
return True
elif ratio > 0.35:
# Soft invalidation — reduce TTL
cache_key = f"response:{response_hash}"
self.redis.expire(cache_key, 3600) # 1 hour instead of 24h
return False
return False
Safeguards: - Rate limiting: Maximum 10 re-generations per intent per hour (prevents feedback-bombing from triggering infinite regeneration). - Circuit breaker: If a re-generated response also accumulates >50% thumbs-down within 2 hours, stop auto-regeneration and alert the team. - A/B validation: 10% of traffic still sees the old response for 1 hour after regeneration, to confirm the new response performs better.
A8. Thumbs Feedback → SageMaker Training Pipeline
Pipeline stages:
Raw Feedback (DynamoDB/S3)
→ Stage 1: Filtering
→ Stage 2: Label Noise Reduction
→ Stage 3: Class Balancing
→ Stage 4: Feature Engineering
→ Stage 5: SageMaker Training Dataset (S3)
Stage 1 — Filtering rules:
- Remove feedback from sessions < 10 seconds (accidental clicks)
- Remove feedback from users who thumbs-down > 95% of all responses (adversarial/frustrated users)
- Remove feedback on escalation intent (thumbs-down may reflect frustration with the situation, not the bot's escalation quality)
- Require minimum 2-second gap between response display and feedback click
Stage 2 — Label noise reduction:
# Confidence-weighted labeling
def compute_label_confidence(feedback_event):
signals = []
# Signal 1: Comment alignment (if comment exists)
if feedback_event.get('comment_category'):
# If comment says "wrong_genre" and intent is "recommendation" → high confidence
intent_comment_alignment = check_alignment(
feedback_event['intent'],
feedback_event['comment_category']
)
signals.append(('comment_alignment', intent_comment_alignment, 0.4))
# Signal 2: Behavioral corroboration
# Thumbs-up on recommendation + user clicked product link = corroborated positive
if feedback_event['intent'] == 'recommendation' and feedback_event.get('product_click'):
signals.append(('behavioral', 1.0, 0.3))
# Signal 3: Consistency — same user, same intent, consistent feedback pattern
historical_consistency = get_user_intent_consistency(
feedback_event['user_id_hash'],
feedback_event['intent']
)
signals.append(('consistency', historical_consistency, 0.3))
confidence = sum(score * weight for _, score, weight in signals) / sum(w for _, _, w in signals)
return confidence
- Only include feedback events with confidence > 0.6 in the training set.
- Events with confidence 0.4–0.6 go to a human review queue.
Stage 3 — Class imbalance handling:
| Intent | Feedback Volume/Month | Thumbs-Up % | Strategy |
|---|---|---|---|
recommendation |
800K | 72% | Undersample majority class |
faq |
600K | 58% | Use as-is (relatively balanced) |
chitchat |
400K | 80% | Undersample positives |
order_tracking |
300K | 65% | Use as-is |
product_discovery |
50K | 60% | SMOTE oversampling for minority class |
escalation |
30K | 25% | Exclude from training (see Stage 1) |
Stage 4 — Feature engineering for intent classifier: - Input: user query text, conversation history (last 3 turns), detected entities - Label: intent (10 classes), with feedback-derived quality weight - Format: JSONL for SageMaker, uploaded to S3 with versioning
A9. Full Data Architecture at 50M Interactions/Month
Throughput calculations:
50M interactions/month × 6% feedback rate = 3M feedback events/month
3M / 30 days / 24 hours / 3600 seconds = ~1.16 events/second (average)
Peak (Prime Day): 10× average = ~12 events/second
Burst (flash sale start): 50× average = ~58 events/second
Ingestion — Kinesis Data Streams:
Stream: mangaassist-feedback-stream
Shards: 2 (each shard handles 1,000 writes/sec, 1 MB/sec)
→ Sufficient for 58 events/sec peak with 97% headroom
→ Auto-scaling policy: add shard if utilization > 70% for 5 minutes
Retention: 24 hours (default, sufficient for consumer catch-up)
Hot storage — DynamoDB:
Table: MangaAssist_ThumbsFeedback
Capacity mode: ON-DEMAND
→ Justification: Traffic is bursty (Prime Day, flash sales). Provisioned mode
requires over-provisioning for peaks or complex auto-scaling. On-demand
absorbs 10× burst within seconds.
Average item size: ~500 bytes
3M items/month × 500 bytes = 1.5 GB/month
TTL: 90 days → max table size ~4.5 GB (well within DynamoDB limits)
GSIs:
- intent-timestamp-index: PK=intent, SK=timestamp (read-heavy, ~100 RCU)
- feedback-timestamp-index: PK=feedback, SK=timestamp (read-heavy, ~50 RCU)
Warm storage — S3 via Kinesis Firehose:
Destination: s3://mangaassist-feedback/raw/year=YYYY/month=MM/day=DD/
Format: Parquet (columnar, compressed)
Buffer: 5 MB or 300 seconds (whichever first)
Compression: Snappy (~70% compression ratio)
Monthly data: 3M × 500 bytes × 0.3 (compressed) = ~450 MB/month
Annual: ~5.4 GB → negligible S3 cost ($0.12/year)
Cold analytics — Athena:
-- Athena table definition
CREATE EXTERNAL TABLE mangaassist_feedback (
session_id STRING,
timestamp STRING,
turn_id INT,
feedback STRING,
intent STRING,
response_id STRING,
user_id_hash STRING,
device_type STRING,
latency_ms INT,
model_version STRING,
comment_raw STRING,
comment_category STRING
)
PARTITIONED BY (year INT, month INT, day INT)
STORED AS PARQUET
LOCATION 's3://mangaassist-feedback/raw/'
Monthly cost estimate:
| Component | Calculation | Monthly Cost |
|---|---|---|
| Kinesis Data Streams | 2 shards × $0.015/hr × 730 hrs | $21.90 |
| Kinesis Firehose | 3M records × 5KB per record batch | $1.85 |
| DynamoDB (on-demand) | 3M writes × $1.25/M + reads ~100M × $0.25/M | $28.75 |
| S3 storage | ~5 GB (cumulative first year) | $0.12 |
| Athena queries | ~50 queries/month × 450 MB scanned | $2.40 |
| Total | ~$55/month |
Very Hard
A10. A/B Test for Feedback Widget's Causal Effect
Experimental design:
Randomization unit: User-level (not session-level). Rationale: If we randomize by session, the same user might see the widget in some sessions and not others, creating confusion and carryover effects. User-level randomization ensures a clean, consistent experience.
Assignment mechanism:
import hashlib
def assign_variant(user_id: str, experiment_id: str = "feedback_widget_v1") -> str:
hash_input = f"{user_id}:{experiment_id}"
hash_value = int(hashlib.sha256(hash_input.encode()).hexdigest(), 16)
bucket = hash_value % 100
if bucket < 50:
return "control" # No widget shown
else:
return "treatment" # Widget shown
Metrics:
| Metric | Type | Control Baseline | MDE | Required Sample |
|---|---|---|---|---|
| Conversation length (turns) | Continuous | 4.2 turns | 0.3 turns (7%) | ~3,500 users/group |
| Cart-add rate (post-recommendation) | Proportion | 12.5% | 1% absolute | ~15,000 users/group |
| Escalation rate | Proportion | 8.0% | 0.8% absolute | ~12,000 users/group |
Duration: Given MangaAssist's traffic of ~1.7M daily active users, we need max 15K users/group = 30K total. With a 10% experiment allocation (170K users/day), we achieve required sample in < 1 day. However, run for 14 days minimum to capture weekly seasonality (weekend manga browsing patterns differ from weekdays).
Handling asymmetric data generation:
The fundamental challenge: treatment group generates feedback data, control group does not. This means:
-
Primary metrics (conversation length, cart-add rate, escalation rate) are measured identically for both groups — these come from session logs and event streams, NOT from the feedback widget.
-
Feedback rate can only be measured in the treatment group. To estimate what the control group's "latent satisfaction" would be, use proxy metrics: dwell time per response, message rephrasing rate, session abandonment rate.
-
Hawthorne effect: Users who know they're being observed (widget visible) may behave differently. Address by: - Adding a "placebo" arm: show a non-interactive widget (just an icon, no click functionality) to isolate the visual effect from the feedback effect - 3-arm design: control (no widget), placebo (visual-only), treatment (functional widget)
Analysis: - Primary: Intent-to-treat (ITT) analysis comparing control vs. treatment on primary metrics - Secondary: IV (instrumental variable) analysis using widget assignment as instrument for "giving feedback" to estimate LATE (local average treatment effect) - Multiple comparison correction: Bonferroni for 3 primary metrics (α = 0.05/3 = 0.0167)
A11. Closed-Loop Feedback Flywheel
System architecture:
┌─────────────────────────────────────────────────────────────────────┐
│ FEEDBACK FLYWHEEL │
│ │
│ ┌──────────┐ ┌──────────────┐ ┌─────────────┐ │
│ │ 1. COLLECT│───→│ 2. AGGREGATE │───→│ 3. OPTIMIZE │ │
│ │ │ │ & DETECT │ │ PROMPTS │ │
│ │ Thumbs │ │ │ │ │ │
│ │ Widget │ │ Kinesis + │ │ Bedrock │ │
│ │ → Kinesis │ │ Lambda │ │ Prompt Mgr │ │
│ └──────────┘ │ Anomaly Det. │ └──────┬──────┘ │
│ └──────────────┘ │ │
│ ▼ │
│ ┌──────────┐ ┌──────────────┐ ┌─────────────┐ │
│ │ 5. CLOSE │←───│ 4. A/B │←───│ Deploy │ │
│ │ LOOP │ │ DEPLOY │ │ Candidate │ │
│ │ │ │ │ │ Prompts │ │
│ │ Measure │ │ 10% traffic │ │ │ │
│ │ via same │ │ to new │ │ ECS Fargate │ │
│ │ thumbs │ │ prompt │ │ Config Mgr │ │
│ └──────────┘ └──────────────┘ └─────────────┘ │
└─────────────────────────────────────────────────────────────────────┘
Stage 1 — Collection (covered in Q1–Q3 above)
Stage 2 — Aggregation & anomaly detection:
# Lambda consumer on Kinesis — aggregates every 15 minutes
def aggregate_feedback(intent: str, window_minutes: int = 15):
"""Compute rolling thumbs-up rate and detect anomalies."""
recent = query_dynamodb_gsi(
index='intent-timestamp-index',
pk=intent,
sk_range=(now - timedelta(minutes=window_minutes), now)
)
total = len(recent)
thumbs_up = sum(1 for r in recent if r['feedback'] == 'thumbs_up')
rate = thumbs_up / total if total > 0 else None
# Anomaly detection: compare to 7-day rolling average
historical_rate = get_historical_rate(intent, days=7)
z_score = (rate - historical_rate['mean']) / historical_rate['std']
if z_score < -2.5: # Significant drop
trigger_optimization(intent, rate, recent_samples=recent[:50])
return {'intent': intent, 'rate': rate, 'z_score': z_score}
Stage 3 — Automatic prompt tuning:
def optimize_prompt(intent: str, current_rate: float, negative_samples: list):
"""Generate candidate prompt improvements using Bedrock Claude."""
# Retrieve current prompt template
current_prompt = get_prompt_template(intent)
# Analyze negative feedback patterns
analysis_prompt = f"""
Analyze these {len(negative_samples)} negative feedback responses for the '{intent}' intent
in a manga chatbot. Identify the top 3 patterns causing user dissatisfaction:
{format_samples(negative_samples)}
Current prompt template:
{current_prompt}
Suggest a revised prompt template that addresses these patterns.
Return JSON: {{"analysis": "...", "revised_prompt": "...", "expected_improvement": "..."}}
"""
response = bedrock_client.invoke_model(
modelId='anthropic.claude-3-5-sonnet-20241022-v2:0',
body=json.dumps({'prompt': analysis_prompt, 'max_tokens': 2000})
)
candidate = json.loads(response['body'].read())
# Store candidate in DynamoDB prompt registry (not yet deployed)
store_candidate_prompt(intent, candidate['revised_prompt'],
metadata={'trigger_rate': current_rate, 'analysis': candidate['analysis']})
return candidate
Stage 4 — A/B deployment:
# ECS Fargate prompt router
def get_prompt_for_request(intent: str, user_id: str) -> str:
experiment = get_active_experiment(intent)
if experiment and is_in_treatment_group(user_id, experiment['id']):
return experiment['candidate_prompt'] # 10% of traffic
else:
return get_production_prompt(intent) # 90% of traffic
- Run for 7 days minimum per intent
- Require statistical significance (p < 0.05) AND practical significance (thumbs-up rate improvement > 2 percentage points) before promoting
Stage 5 — Closing the loop:
- If candidate prompt wins → promote to production, archive old prompt
- If candidate prompt loses → discard, increase anomaly detection threshold to prevent repeated failed optimization
- Track "optimization attempts per intent per month" — cap at 3 to prevent churn
Cold-start for new intents (e.g., product_discovery):
- Bootstrap with synthetic feedback: Generate 500 synthetic conversations for
product_discoveryusing Bedrock. Have 3 internal annotators rate them (thumbs-up/down). Use as initial training signal. - Transfer learning from similar intent:
product_discoveryis closest torecommendation. Initialize prompt template and feedback thresholds fromrecommendationparameters. - Aggressive exploration: For the first 30 days, show the feedback widget to 100% of
product_discoveryusers (vs. the standard sampling rate) to accelerate data collection. - Lower confidence threshold: Accept optimization candidates with p < 0.10 (instead of 0.05) during the cold-start phase, reverting to stricter thresholds after 10K feedback events.
Sycophancy risk mitigation:
Optimizing purely for thumbs-up rate can lead to responses that tell users what they want to hear rather than what's accurate:
- Guardrail metrics: Track factual accuracy (via automated checks against product catalog in OpenSearch) alongside thumbs-up rate. Reject prompt candidates that improve thumbs-up but decrease factual accuracy.
- Diversity constraint: Ensure
recommendationresponses don't collapse to always recommending the top-5 most popular manga. Track recommendation diversity (entropy of recommended titles). - Negative feedback value: Assign 2× weight to thumbs-down events in the optimization objective. This makes the system more sensitive to failures than successes.
- Human-in-the-loop gate: Any prompt change that would affect > 1M users/month requires human review before deployment, regardless of A/B results.
A12. Unified User Satisfaction Score (USS)
USS Formula:
$$USS = \frac{\sum_{i=1}^{10} w_i \cdot r_i \cdot c_i}{\sum_{i=1}^{10} w_i \cdot c_i}$$
Where: - $w_i$ = business impact weight for intent $i$ - $r_i$ = thumbs-up rate for intent $i$ (0 to 1) - $c_i$ = confidence factor for intent $i$ (0 to 1), based on sample size
Business impact weights:
| Intent | Weight ($w_i$) | Justification |
|---|---|---|
recommendation |
1.00 | Directly drives manga sales (core business metric) |
product_question |
0.85 | Pre-purchase decision support, high conversion impact |
product_discovery |
0.80 | Top-of-funnel engagement, drives browse-to-cart |
checkout_help |
0.90 | Prevents cart abandonment (high revenue impact per interaction) |
order_tracking |
0.70 | Customer retention, but transactional (low margin for error) |
return_request |
0.65 | Satisfaction recovery; poor handling causes churn |
promotion |
0.60 | Drives incremental sales, but lower per-interaction value |
faq |
0.50 | Deflects support tickets; quality matters but less revenue-direct |
escalation |
0.40 | Success = smooth handoff; thumbs signal is noisy here |
chitchat |
0.20 | Brand engagement, low business impact |
Confidence factor (handles sparse signal):
$$c_i = \min\left(1.0, \frac{n_i}{n_{min}}\right)$$
Where $n_i$ is the feedback count for intent $i$ in the measurement window, and $n_{min}$ = 100 (minimum sample for full confidence).
If product_discovery has only 40 feedback events this week, $c_i = 40/100 = 0.4$, effectively down-weighting its contribution to USS until more data accumulates.
Worked example:
| Intent | $w_i$ | $r_i$ | $n_i$ | $c_i$ | $w_i \cdot r_i \cdot c_i$ | $w_i \cdot c_i$ |
|---|---|---|---|---|---|---|
| recommendation | 1.00 | 0.72 | 5000 | 1.00 | 0.720 | 1.000 |
| product_question | 0.85 | 0.65 | 3000 | 1.00 | 0.553 | 0.850 |
| checkout_help | 0.90 | 0.55 | 1500 | 1.00 | 0.495 | 0.900 |
| product_discovery | 0.80 | 0.60 | 40 | 0.40 | 0.192 | 0.320 |
| chitchat | 0.20 | 0.80 | 2000 | 1.00 | 0.160 | 0.200 |
| ... | ... | ... | ... | ... | ... | ... |
USS = Σ(numerator) / Σ(denominator) ≈ 0.66 (66% weighted satisfaction)
QuickSight dashboard design:
┌─────────────────────────────────────────────────────────┐
│ MangaAssist User Satisfaction Dashboard │
├─────────────────────────────────────────────────────────┤
│ │
│ [USS Score: 66.2%] ▲ +1.3% vs last week │
│ ═══════════════════════════════════ │
│ │
│ ┌─── Intent Breakdown (Treemap) ────────────────┐ │
│ │ ┌──────────┐ ┌────────┐ ┌──────┐ ┌────┐ │ │
│ │ │ recom. │ │checkout│ │prod_q│ │faq │ │ │
│ │ │ 72% ▲ │ │ 55% ▼ │ │ 65% │ │58% │ │ │
│ │ └──────────┘ └────────┘ └──────┘ └────┘ │ │
│ └───────────────────────────────────────────────┘ │
│ │
│ ┌─── USS Time Series (30 days) ─────────────────┐ │
│ │ 68%│ ╱╲ │ │
│ │ 66%│───╱──╲──────╱╲────── │ │
│ │ 64%│ ╱ ╲ ╱ ╲ │ │
│ │ 62%│─╱──────╲──╱────╲───── ← Prime Day dip │ │
│ │ └────────────────────── time → │ │
│ └───────────────────────────────────────────────┘ │
│ │
│ ┌─── Alerts ────────────────────────────────────┐ │
│ │ ⚠ checkout_help down 8% (flash sale impact) │ │
│ │ ✅ recommendation up 2% (new prompt deployed) │ │
│ └───────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────┘
Dashboard layers: 1. Executive view: Single USS number, trend line, week-over-week delta 2. Intent drill-down: Click any intent in treemap → see its thumbs-up rate, volume, confidence factor, and top negative feedback categories 3. Time-series overlay: Toggle individual intent lines on/off, annotate with deployment events (prompt changes, model updates) 4. Anomaly panel: Automated highlights of statistically significant changes (z-score > 2.0), linked to root cause analysis
Data pipeline for QuickSight:
DynamoDB → S3 (Firehose, hourly Parquet) → Athena (daily aggregation view) → QuickSight SPICE dataset (refreshed every 4 hours)