Scenario 02 – Response Rating System

Parent: ../README.md · Skill: 5.1.3 – User-Centered Evaluation
System: MangaAssist – JP Manga Chatbot on Amazon.com
Stack: AWS Bedrock Claude 3.5 Sonnet · DynamoDB · OpenSearch Serverless · ElastiCache Redis · ECS Fargate

Context

While thumbs feedback provides high-volume binary signals, ordinal rating systems capture nuance: a 3-star response is qualitatively different from a 1-star. MangaAssist uses multi-dimensional rating scales, NPS-style surveys, and per-attribute ratings to understand why users are satisfied or dissatisfied. The challenge lies in choosing the right scale, minimizing survey fatigue, and translating ratings into actionable model improvements.

Scenario Questions (12)

Easy (3)

Q1. MangaAssist wants to add a 5-star rating widget after recommendation and product_question responses. Design the rating UI flow: when does it appear, what does the user see, and how do you handle the case where a user rates multiple responses in a single session? Define the DynamoDB schema for storing star ratings alongside existing thumbs feedback.

Q2. The team debates between a 5-point scale (1–5 stars), a 7-point Likert scale ("Strongly Disagree" to "Strongly Agree"), and a 10-point NPS scale. Compare the three options for MangaAssist's use case, considering: (a) user cognitive load on a chatbot interface, (b) statistical discriminability, and © compatibility with Amazon's existing review ecosystem. Recommend one and justify.

Q3. After launching the 5-star rating system, you observe that 60% of ratings are either 1-star or 5-star (bimodal distribution). Explain why this "J-shaped" distribution is expected in chatbot feedback, how it differs from product review distributions, and what strategies you would use to encourage mid-range ratings for more granular signal.

Medium (3)

Q4. MangaAssist introduces per-dimension ratings for recommendation responses: (a) Relevance — "Was this manga relevant to your interests?", (b) Novelty — "Was this a new discovery for you?", © Explanation quality — "Was the recommendation reason helpful?". Design the data model, the aggregation pipeline that computes dimension-weighted composite scores, and explain how Bedrock Claude's prompt can be tuned differently based on which dimension scores lowest.

Q5. You want to implement a post-session NPS survey: "On a scale of 0–10, how likely are you to recommend MangaAssist to a friend?" Design the sampling strategy (which users see the survey, how often), the storage and calculation pipeline, and how you would segment NPS by intent mix (e.g., users who primarily used recommendation vs. order_tracking). Calculate the minimum sample size for a statistically reliable NPS score.

Q6. Star ratings on faq responses average 3.8/5, but the team doesn't know if this means "acceptable" or "needs improvement". Design a calibration study where you collect both star ratings and structured quality assessments from expert annotators on the same 500 responses. Define the correlation analysis you would perform to establish benchmarks (e.g., "3.5 stars from users ≈ 'Partially Correct' from experts").

Hard (3)

Q7. You discover a systematic rating bias: mobile users rate 0.4 stars lower than desktop users on identical responses (same query, same response, same intent). Investigate potential causes (screen real estate, interaction friction, user demographics). Design a normalization procedure that adjusts ratings by device type, and explain how you would validate that the normalization doesn't remove legitimate quality differences.

Q8. MangaAssist serves users in the US, UK, and Japan. Japanese users consistently rate 0.8 stars lower than US users (cultural response bias — tendency toward neutral/negative in East Asian survey patterns). Design a cross-cultural calibration framework: (a) detect the bias using statistical methods, (b) apply culture-adjusted normalization, © validate that normalization preserves within-culture quality signals. Include the statistical model and the risks of over-correction.

Q9. The team wants to predict star ratings from implicit signals (so they can estimate satisfaction for the 94% of interactions with no explicit feedback). Build a rating prediction model: define the features (conversation length, response latency, entity match rate from OpenSearch, follow-up questions asked, etc.), the model architecture (gradient boosted trees on SageMaker), the training/validation split strategy, and the expected accuracy. Discuss the ethical implications of imputing satisfaction scores.

Very Hard (3)

Q10. Design a Bayesian rating system for MangaAssist that handles sparse data gracefully. For intents with few ratings (e.g., product_discovery with 40 ratings/week), a simple average is unreliable. Implement a Bayesian model that uses a prior derived from all intents' ratings and updates with intent-specific observations. Define the prior distribution, the posterior update formula, how the system transitions from prior-dominated to data-dominated estimates, and how you would display confidence intervals in the dashboard.

Q11. You need to combine star ratings, thumbs feedback, NPS scores, and implicit signals into a single unified quality score per response. These signals have different scales, different noise levels, and different coverage rates. Design the fusion model: (a) normalization of each signal to a common scale, (b) weighting based on signal reliability and coverage, © handling of missing signals (e.g., response has thumbs but no stars), and (d) temporal decay (recent signals weighted more). Implement this as a DynamoDB + Lambda pipeline and define the math.

Q12. Amazon leadership asks: "Is our rating system actually measuring chatbot quality, or is it measuring user mood?" Design a validity study for the MangaAssist rating system. Include: (a) construct validity — does the scale measure what we think it measures, (b) convergent validity — do ratings correlate with other quality measures (expert annotation, task completion), © discriminant validity — do ratings differ across objectively different response qualities, and (d) test-retest reliability — do users rate the same response consistently? Define the experimental protocols for each.

Answer Key

→ ANSWERS.md