LOCAL PREVIEW View on GitHub

Answers – Scenario 02: Response Rating System

Questions: README.md
System: MangaAssist – JP Manga Chatbot on Amazon.com


Easy

A1. 5-Star Rating Widget Design and Schema

UI flow:

  1. Trigger condition: Star widget appears only for recommendation and product_question intents, 3 seconds after the response fully renders (allows reading time).
  2. Visual design: 5 hollow stars inline below the response. Stars fill on hover (desktop) or tap (mobile). A subtle label reads: "Rate this answer".
  3. Post-rating: Stars lock, display a checkmark animation, and fade to 50% opacity. No follow-up modal (minimize friction).
  4. Multi-rating in session: Each response gets its own independent rating widget. If a user rates 3 responses in one session, all 3 are stored as separate events. A session-level summary is computed asynchronously.

DynamoDB schema:

Table: MangaAssist_Ratings

Attribute Type Role Description
session_id String PK Chat session identifier
timestamp#turn_id String SK ISO-8601 + turn number
rating_type String stars_5 (distinguishes from thumbs, NPS)
rating_value Number 1–5
intent String recommendation, product_question, etc.
response_id String Links to the specific chatbot response
user_id_hash String SHA-256 hashed customer ID
device_type String mobile, desktop, app
model_version String Bedrock model version
ttl Number 180-day TTL (longer than thumbs — richer signal)

GSI: intent-rating-index — PK: intent, SK: rating_value#timestamp — enables queries like "all 1-star ratings on recommendation this week".

Coexistence with thumbs: If a user gives both thumbs and stars on the same response, both are stored in their respective tables. A Lambda function reconciles: if thumbs_up + 1-star exists, flag as "contradictory" for data quality review.


A2. Scale Comparison: 5-Point vs. 7-Point Likert vs. 10-Point NPS

Criterion 5-Point Stars 7-Point Likert 10-Point NPS
Cognitive load ⭐ Lowest — users instantly understand star ratings from Amazon product reviews Medium — requires reading labels ("Somewhat Agree" vs. "Agree") High — 0–10 scale is unusual in chatbot context
Discriminability Moderate — 5 levels capture good/neutral/bad High — 7 levels distinguish nuance High — but adjacent points (6 vs. 7) have no clear semantic difference
Amazon ecosystem fit ⭐ Perfect — matches existing product review paradigm Poor — not used anywhere on Amazon Moderate — used in Amazon customer satisfaction surveys, but not product-facing
Response completion rate ~8–12% of interactions ~4–6% (label reading reduces rate) ~5–7%
Analysis complexity Simple means, medians Requires careful ordinal statistics Requires NPS calculation (Promoter–Detractor) with specific cutoffs

Recommendation: 5-point star scale.

Justification: (1) Amazon customers already have a mental model for star ratings — they review products daily on the same platform. (2) The chatbot UI has limited vertical space; 5 clickable stars fit cleanly inline. (3) The lower cognitive load yields ~2× the completion rate of Likert scales, which matters more than the marginal discriminability gain from 7 levels. (4) 5-point data is backward-compatible with Amazon's internal quality metrics.

Caveat: Supplement with a separate NPS survey (sampled, not every interaction) for strategic-level tracking. See Q5.


A3. J-Shaped Rating Distribution in Chatbot Feedback

Why bimodal (J-shaped) is expected:

  1. Self-selection bias: Users who bother to rate have strong opinions. Satisfied users click 5 stars quickly; frustrated users click 1 star to "punish" the system. Mildly satisfied users (2–4) have no emotional impetus to rate.

  2. Task-completion framing: Chatbot interactions are task-oriented. Either the answer solved the problem (5 stars) or it didn't (1 star). Unlike product reviews where quality is continuous, chatbot quality is perceived as binary by most users.

  3. Speed of interaction: Users rate chatbot responses in 1–2 seconds (unlike product reviews where they spend 30+ seconds). Fast ratings gravitate to extremes.

Difference from product review distributions: - Product reviews on Amazon follow a J-shape too, but with a heavier 4–5 star cluster (because buyers chose the product). Chatbot users didn't "choose" MangaAssist — they were presented with it — so there's no pre-selection bias toward positive ratings.

Strategies for mid-range granularity:

  1. Semantic anchoring: Label each star explicitly — 1 (Wrong), 2 (Unhelpful), 3 (OK), 4 (Helpful), 5 (Perfect). Visual labels prime users to consider mid-range options.

  2. Two-step rating: First ask "Was this helpful?" (Yes/Somewhat/No). If "Somewhat", expand to "What could be better?" with checkboxes. This naturally segments the 2–4 star range without forcing users to pick a number.

  3. Comparison framing: For recommendation, ask "Compared to a typical manga recommendation, this was: Much Worse / Worse / About the Same / Better / Much Better". Comparative judgments produce more normal distributions than absolute ratings.


Medium

A4. Per-Dimension Ratings for Recommendations

Dimensions (for recommendation intent):

Dimension Question Scale
Relevance "Was this manga relevant to your interests?" 1–5
Novelty "Was this a new discovery for you?" 1–5
Explanation "Was the recommendation reason helpful?" 1–5

Data model (DynamoDB item):

{
    "session_id": "sess_abc123",
    "timestamp#turn_id": "2025-03-15T10:30:00Z#005",
    "rating_type": "multi_dimension",
    "dimensions": {
        "relevance": 4,
        "novelty": 2,
        "explanation": 5
    },
    "composite_score": 3.8,
    "intent": "recommendation",
    "response_id": "resp_xyz789"
}

Aggregation pipeline:

# Lambda function — triggered hourly by EventBridge
def compute_composite_scores(intent: str = 'recommendation'):
    """Weighted composite score across dimensions."""

    # Dimension weights (tuned by product team)
    WEIGHTS = {
        'relevance': 0.50,   # Most important — did we recommend the right manga?
        'novelty': 0.25,     # Important — avoid "obvious" recommendations
        'explanation': 0.25  # Important — but secondary to actual relevance
    }

    # Query last 7 days of multi-dimension ratings
    ratings = query_dynamodb(
        table='MangaAssist_Ratings',
        index='intent-timestamp-index',
        pk='recommendation',
        sk_range=(now - timedelta(days=7), now),
        filter={'rating_type': 'multi_dimension'}
    )

    # Per-dimension aggregation
    dim_scores = {}
    for dim in ['relevance', 'novelty', 'explanation']:
        values = [r['dimensions'][dim] for r in ratings if dim in r.get('dimensions', {})]
        dim_scores[dim] = {
            'mean': statistics.mean(values),
            'median': statistics.median(values),
            'std': statistics.stdev(values),
            'count': len(values)
        }

    # Weighted composite
    composite = sum(
        WEIGHTS[dim] * dim_scores[dim]['mean'] 
        for dim in WEIGHTS
    )

    return {'composite': composite, 'dimensions': dim_scores}

Prompt tuning based on lowest dimension:

Lowest Dimension Prompt Modification
Relevance low (< 3.0) Increase weight on user's reading history in the Bedrock prompt; add "Focus on the user's stated genre preferences" instruction; pull more signals from OpenSearch user profile
Novelty low (< 3.0) Add "Avoid recommending titles from the top-100 most popular manga" instruction; increase the diversity parameter in the recommendation retrieval from OpenSearch
Explanation low (< 3.0) Add "Explain WHY this manga matches the user's interests, referencing specific themes, art style, or author connections" to the Bedrock prompt

A5. Post-Session NPS Survey Design

Sampling strategy:

  • Who sees it: Users who completed ≥ 3 conversation turns in the session (filters out bounces).
  • Frequency cap: Maximum 1 NPS survey per user per 30 days (prevents survey fatigue).
  • Random sampling rate: 5% of eligible sessions.
  • Targeting: Stratified sampling ensuring proportional representation of all 10 intents. If a user's session touched recommendation + order_tracking, they count in both strata.

Survey UI: Full-screen modal after session ends (user closes chat or 5 minutes of inactivity):

"On a scale of 0–10, how likely are you to recommend MangaAssist to a friend?" [0] [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] Optional follow-up: "What's the main reason for your score?" [text box]

NPS calculation:

$$NPS = \%Promoters (9\text{-}10) - \%Detractors (0\text{-}6)$$

Segmentation by intent mix:

def compute_segmented_nps():
    """NPS broken down by primary intent in session."""

    nps_responses = query_nps_table(date_range=last_30_days)

    segments = {}
    for response in nps_responses:
        # Determine primary intent (most turns)
        session_intents = get_session_intents(response['session_id'])
        primary_intent = max(session_intents, key=session_intents.get)

        if primary_intent not in segments:
            segments[primary_intent] = {'promoters': 0, 'passives': 0, 'detractors': 0, 'total': 0}

        score = response['nps_score']
        segments[primary_intent]['total'] += 1

        if score >= 9:
            segments[primary_intent]['promoters'] += 1
        elif score >= 7:
            segments[primary_intent]['passives'] += 1
        else:
            segments[primary_intent]['detractors'] += 1

    # Compute NPS per segment
    for intent, data in segments.items():
        data['nps'] = ((data['promoters'] - data['detractors']) / data['total']) * 100

    return segments

Minimum sample size:

For NPS with ±5 point margin of error at 95% confidence:

$$n = \frac{z^2 \cdot \sigma^2}{E^2} = \frac{1.96^2 \cdot 30^2}{5^2} = \frac{3.84 \cdot 900}{25} = 138$$

(σ ≈ 30 is typical NPS standard deviation)

For per-intent segmentation with 10 intents, need ~138 per segment = ~1,380 total NPS responses. At 5% sampling rate and 50% completion rate, need ~55,200 eligible sessions → achievable in ~3 days at MangaAssist's traffic volume.


A6. Calibration Study: Star Ratings vs. Expert Assessment

Study design:

  1. Sample: 500 randomly selected MangaAssist responses stratified by intent (50 per intent for the 10 intents) that have at least one user star rating.

  2. Expert annotation: 3 trained annotators rate each response on a structured rubric:

Dimension Expert Scale Definition
Correctness 1–4 1=Wrong, 2=Partially correct, 3=Mostly correct, 4=Fully correct
Completeness 1–4 1=Missing key info, 2=Partial, 3=Adequate, 4=Comprehensive
Helpfulness 1–4 1=Unhelpful, 2=Marginally helpful, 3=Helpful, 4=Very helpful
  1. Annotation quality: Compute inter-annotator agreement (Cohen's Kappa) before the main study. Require κ > 0.65 on a 50-response pilot.

Correlation analysis:

import scipy.stats as stats
import pandas as pd

def calibration_analysis(user_ratings, expert_ratings):
    """Correlate user star ratings with expert quality assessments."""

    df = pd.merge(user_ratings, expert_ratings, on='response_id')

    # 1. Spearman rank correlation (ordinal data)
    for expert_dim in ['correctness', 'completeness', 'helpfulness']:
        rho, p_value = stats.spearmanr(df['star_rating'], df[expert_dim])
        print(f"Stars vs {expert_dim}: ρ={rho:.3f}, p={p_value:.4f}")

    # 2. Compute expert composite (mean of 3 dimensions)
    df['expert_composite'] = df[['correctness', 'completeness', 'helpfulness']].mean(axis=1)

    # 3. Build calibration table
    calibration = df.groupby('star_rating').agg({
        'expert_composite': ['mean', 'std', 'count'],
        'correctness': 'mean',
        'completeness': 'mean',
        'helpfulness': 'mean'
    }).round(2)

    return calibration

Expected calibration table:

User Stars Expert Composite (mean) Correctness Completeness Helpfulness n
1 ⭐ 1.4 ± 0.5 1.2 1.3 1.6 85
2 ⭐ 2.0 ± 0.6 1.8 2.0 2.2 40
3 ⭐ 2.6 ± 0.7 2.4 2.6 2.8 75
4 ⭐ 3.2 ± 0.5 3.1 3.2 3.3 110
5 ⭐ 3.6 ± 0.6 3.5 3.5 3.7 190

Benchmark interpretation: - 3.5 stars ≈ expert "Mostly Correct" and "Adequate" → acceptable baseline - < 3.0 stars ≈ expert "Partially Correct" → needs investigation - The 5-star ↔ expert 3.6 gap (not 4.0) confirms the positivity bias in user ratings


Hard

A7. Device-Type Rating Bias Investigation and Normalization

Investigation — potential causes of mobile derating:

  1. Fat-finger effect: On small screens, users tap 3 or 4 stars when aiming for 4 or 5. The touch target for individual stars may be < 44px (Apple's minimum recommended).
  2. Truncated responses: Mobile chat windows show fewer lines. If a recommendation explanation is truncated, users rate the visible (incomplete) response lower.
  3. Context switching cost: Mobile users are more likely to be multitasking. Interruptions between reading and rating create a negativity bias.
  4. User demographic differences: Mobile-heavy users may skew younger or more casual, with different quality expectations.

Normalization procedure:

Method: Z-score normalization within device type, then re-scaling:

$$r_{normalized} = \frac{r_{raw} - \mu_{device}}{\sigma_{device}} \cdot \sigma_{global} + \mu_{global}$$

def normalize_ratings_by_device(ratings_df):
    """Z-score normalization within device type."""

    # Compute per-device statistics
    device_stats = ratings_df.groupby('device_type')['rating_value'].agg(['mean', 'std'])
    global_mean = ratings_df['rating_value'].mean()
    global_std = ratings_df['rating_value'].std()

    # Normalize
    ratings_df['rating_normalized'] = ratings_df.apply(
        lambda row: (
            (row['rating_value'] - device_stats.loc[row['device_type'], 'mean']) /
            device_stats.loc[row['device_type'], 'std']
        ) * global_std + global_mean,
        axis=1
    )

    # Clip to valid range [1, 5]
    ratings_df['rating_normalized'] = ratings_df['rating_normalized'].clip(1.0, 5.0)

    return ratings_df

Validation — ensuring normalization doesn't remove legitimate differences:

  1. Controlled experiment: Serve identical responses to randomized users across device types. If the rating gap persists after controlling for response content, it's measurement bias (normalize it). If the gap disappears, it's legitimate (don't normalize).

  2. Intent-specific validation: If mobile users rate recommendation lower (reading manga previews requires screen space) but rate order_tracking equally (text-only), the recommendation gap may be legitimate while order_tracking gap is measurement noise.

  3. A/A test: Compare normalized ratings between two random mobile cohorts. If normalization introduces artificial differences between identical populations, it's overcorrecting.


A8. Cross-Cultural Rating Calibration

Detection:

def detect_cultural_bias(ratings_df):
    """Kruskal-Wallis test for regional differences on identical responses."""

    # Filter to responses served identically across regions (cached responses)
    shared_responses = ratings_df.groupby('response_id').filter(
        lambda g: g['region'].nunique() >= 2
    )

    # Kruskal-Wallis H-test (non-parametric, ordinal data)
    us_ratings = shared_responses[shared_responses['region'] == 'US']['rating_value']
    uk_ratings = shared_responses[shared_responses['region'] == 'UK']['rating_value']
    jp_ratings = shared_responses[shared_responses['region'] == 'JP']['rating_value']

    h_stat, p_value = stats.kruskal(us_ratings, uk_ratings, jp_ratings)

    # Effect size: epsilon-squared
    n = len(shared_responses)
    epsilon_sq = (h_stat - 2) / (n - 3)

    return {'h_stat': h_stat, 'p_value': p_value, 'epsilon_sq': epsilon_sq}

Culture-adjusted normalization:

Model: Mixed-effects ordinal regression

$$\text{logit}(P(Y \leq k)) = \alpha_k - (\beta_{quality} \cdot Q + \beta_{region} \cdot R + \beta_{intent} \cdot I)$$

Where: - $Y$ = observed star rating (1–5) - $Q$ = latent true quality (what we want to recover) - $R$ = region fixed effect (US=0, UK=−0.15, JP=−0.40, estimated from data) - $I$ = intent random effect

Calibration approach:

  1. Anchor responses: Select 100 responses rated by annotators as "objectively high quality" (expert score > 3.5/4.0). Measure mean user rating per region on these anchors.
  2. Compute region offsets: US mean on anchors = 4.3, JP mean = 3.5 → offset = −0.8 stars.
  3. Apply additive adjustment: JP ratings += 0.8 (simplest), or use the ordinal regression model (more principled).

Validation:

  1. Within-culture ranking preservation: After normalization, the ordering of responses within Japan should remain unchanged. If Response A was rated higher than Response B by Japanese users, it should still be higher after normalization.
  2. Cross-culture convergence: On anchor responses, normalized ratings across regions should converge (US ≈ UK ≈ JP). Measure by the reduction in between-region variance.
  3. Not overcorrecting: Japanese users may genuinely find some English-centric manga descriptions less helpful (localization issue). Segment analysis by manga language/origin to separate cultural bias from legitimate quality differences.

Risks: Over-normalization could mask real quality problems in the Japanese experience — e.g., if manga metadata in OpenSearch is less complete for JP-exclusive titles, lower JP ratings are legitimate signal, not cultural bias.


A9. Predicting Star Ratings from Implicit Signals

Feature engineering:

Feature Source Rationale
conversation_length DynamoDB session Longer conversations may indicate difficulty or deep engagement
response_latency_ms CloudWatch Slow responses frustrate users
entity_match_rate OpenSearch % of user-mentioned manga titles found in catalog
follow_up_questions DynamoDB turns More follow-ups may indicate incomplete answer
session_duration_sec DynamoDB timestamps Total time spent in chat
reformulation_count NLU pipeline How many times user rephrased the same question
intent_switches NLU pipeline Switching intents mid-session may indicate confusion
escalation_triggered escalation intent flag Binary — strong negative signal
product_clicks_after Clickstream (Kinesis) For recommendation: did user click any suggested product?
cart_add_after Clickstream For recommendation/promotion: ultimate conversion signal
cache_hit ElastiCache Redis logs Cached responses may have different quality profile
time_of_day Session timestamp Late-night users may rate differently
device_type Session metadata Known bias factor (see Q7)

Model architecture:

# SageMaker XGBoost training job
from sagemaker.xgboost import XGBoost

estimator = XGBoost(
    entry_point='train.py',
    role=sagemaker_role,
    instance_count=1,
    instance_type='ml.m5.xlarge',
    framework_version='1.7-1',
    hyperparameters={
        'objective': 'reg:squarederror',  # Treat 1-5 as continuous
        'num_round': 500,
        'max_depth': 6,
        'eta': 0.1,
        'subsample': 0.8,
        'colsample_bytree': 0.8,
        'eval_metric': 'rmse'
    }
)

Train/validation split: - Temporal split: Train on months 1–3, validate on month 4, test on month 5. Avoids data leakage from same-user interactions appearing in both sets. - No random split — user behavior patterns change over time, and we need the model to generalize forward.

Expected accuracy: - RMSE: ~0.8–1.0 stars (predicting within ±1 star of actual) - Rank correlation (Spearman): ~0.55–0.65 - Classification accuracy (binned: 1–2 vs. 3 vs. 4–5): ~60–65%

Ethical implications: 1. Consent: Users who chose not to rate may be opting out of quality judgment. Imputing scores violates their implicit choice not to give feedback. 2. Algorithmic bias amplification: If the model learns that mobile users get lower predicted ratings (reflecting existing bias), it could deprioritize mobile-user experiences in optimization decisions. 3. Transparency: If imputed scores influence which prompts/models are deployed, the team should disclose that "some quality metrics are estimated from behavioral signals" in internal documentation. 4. Guardrail: Never expose imputed scores to users or use them for individual-level decisions. Only use for aggregate trend analysis.


Very Hard

A10. Bayesian Rating System for Sparse Data

Model: Beta-Binomial for binarized ratings (good ≥ 4 stars, bad < 4):

Simplify to a "success rate" framework while preserving the ordinal nature through threshold selection.

Prior distribution:

$$\theta_i \sim \text{Beta}(\alpha_0, \beta_0)$$

Where $\theta_i$ is the true "good rating" probability for intent $i$.

Estimate prior parameters from the global rating distribution across all intents:

$$\alpha_0 = \mu_0 \cdot \kappa_0, \quad \beta_0 = (1 - \mu_0) \cdot \kappa_0$$

  • $\mu_0$ = global "good rating" rate ≈ 0.65 (65% of all ratings are ≥ 4 stars)
  • $\kappa_0$ = prior strength = 50 (equivalent to 50 pseudo-observations; tuned so that intents with 50+ real observations are data-dominated)

$$\alpha_0 = 0.65 \times 50 = 32.5, \quad \beta_0 = 0.35 \times 50 = 17.5$$

Posterior update:

Given $n_i$ total ratings and $s_i$ "good" ratings for intent $i$:

$$\theta_i | \text{data} \sim \text{Beta}(\alpha_0 + s_i, \beta_0 + n_i - s_i)$$

Posterior mean (Bayesian estimate):

$$E[\theta_i | \text{data}] = \frac{\alpha_0 + s_i}{\alpha_0 + \beta_0 + n_i} = \frac{32.5 + s_i}{50 + n_i}$$

Worked example for product_discovery (sparse: 40 ratings, 24 good):

$$E[\theta] = \frac{32.5 + 24}{50 + 40} = \frac{56.5}{90} = 0.628$$

Compare to naive estimate: $24/40 = 0.600$.

The Bayesian estimate (0.628) shrinks toward the global mean (0.65), reflecting uncertainty with only 40 observations.

95% credible interval:

from scipy.stats import beta

def bayesian_rating_estimate(good_count, total_count, alpha_prior=32.5, beta_prior=17.5):
    alpha_post = alpha_prior + good_count
    beta_post = beta_prior + (total_count - good_count)

    mean = alpha_post / (alpha_post + beta_post)
    ci_low, ci_high = beta.ppf([0.025, 0.975], alpha_post, beta_post)

    return {
        'mean': round(mean, 3),
        'ci_95': (round(ci_low, 3), round(ci_high, 3)),
        'effective_n': total_count,
        'prior_influence': round(50 / (50 + total_count), 3)
    }

# product_discovery: 40 ratings, 24 good
print(bayesian_rating_estimate(24, 40))
# {'mean': 0.628, 'ci_95': (0.524, 0.725), 'prior_influence': 0.556}

# recommendation: 5000 ratings, 3600 good  
print(bayesian_rating_estimate(3600, 5000))
# {'mean': 0.719, 'ci_95': (0.706, 0.732), 'prior_influence': 0.010}

Dashboard display:

Intent               Score    95% CI         Confidence
─────────────────────────────────────────────────────────
recommendation       71.9%    [70.6, 73.2]   ████████████ HIGH
product_question     65.2%    [63.1, 67.3]   ███████████  HIGH
product_discovery    62.8%    [52.4, 72.5]   █████        LOW
escalation           45.3%    [33.1, 58.0]   ███          VERY LOW

The wide CI for product_discovery and escalation visually communicates uncertainty to stakeholders.


A11. Multi-Signal Quality Fusion Model

Input signals and normalization:

Signal Raw Scale Coverage Reliability Normalized Scale [0, 1]
Stars 1–5 ~8% of interactions High (explicit) $(stars - 1) / 4$
Thumbs {0, 1} ~6% Medium (binary) Direct (0 or 1)
NPS 0–10 ~0.3% (sampled) High (but session-level, not response-level) $nps / 10$
Implicit: cart-add {0, 1} ~100% for recommendation Medium (noisy) Direct
Implicit: dwell-time seconds ~100% Low (confounded) Sigmoid normalization: $\sigma((dwell - \mu) / \sigma_{dwell})$

Fusion formula:

$$Q_{response} = \frac{\sum_{s \in S} w_s \cdot d_s \cdot n_s \cdot v_s}{\sum_{s \in S} w_s \cdot d_s \cdot n_s}$$

Where: - $w_s$ = reliability weight for signal type $s$ - $d_s$ = temporal decay: $e^{-\lambda(t_{now} - t_s)}$ with $\lambda = 0.01$ per day - $n_s$ = normalized signal value [0, 1] - $v_s$ = binary indicator (1 if signal exists, 0 if missing)

Reliability weights:

Signal Weight Justification
Star rating 1.0 Explicit, granular, response-level
Thumbs 0.7 Explicit but binary — less information
NPS 0.4 Session-level (not response-level), sampled
Cart-add 0.6 Behavioral, directly tied to business outcome, but only for purchase intents
Dwell time 0.3 Highly confounded (long dwell = engaged OR confused)

Handling missing signals:

When a response has only thumbs (no stars, no implicit), the formula naturally reduces to:

$$Q_{response} = \frac{0.7 \cdot d_{thumbs} \cdot n_{thumbs}}{0.7 \cdot d_{thumbs}} = n_{thumbs}$$

For responses with no explicit feedback at all (86% of interactions), use the imputed rating from the XGBoost model (Q9) with weight 0.2 (lowest reliability).

Lambda implementation:

import math
import time

WEIGHTS = {
    'stars': 1.0, 'thumbs': 0.7, 'nps': 0.4, 
    'cart_add': 0.6, 'dwell_time': 0.3, 'imputed': 0.2
}
LAMBDA_DECAY = 0.01  # per day

def compute_quality_score(response_id: str) -> dict:
    signals = fetch_all_signals(response_id)  # DynamoDB multi-table query

    now = time.time()
    numerator = 0.0
    denominator = 0.0

    for signal in signals:
        signal_type = signal['type']
        w = WEIGHTS.get(signal_type, 0.1)

        # Temporal decay
        age_days = (now - signal['timestamp']) / 86400
        decay = math.exp(-LAMBDA_DECAY * age_days)

        # Normalized value
        normalized = normalize_signal(signal)

        numerator += w * decay * normalized
        denominator += w * decay

    quality_score = numerator / denominator if denominator > 0 else None

    return {
        'response_id': response_id,
        'quality_score': round(quality_score, 3) if quality_score else None,
        'signal_count': len(signals),
        'signal_types': [s['type'] for s in signals]
    }

A12. Validity Study for the MangaAssist Rating System

A. Construct validity — Does the scale measure what we think it measures?

Protocol: Factor analysis on per-dimension ratings (Relevance, Novelty, Explanation) from 2,000+ responses.

  • Confirmatory Factor Analysis (CFA): Fit a model where stars = f(relevance, novelty, explanation, error). If model fit is good (CFI > 0.90, RMSEA < 0.08), the scale captures the intended construct.
  • Think-aloud study: Recruit 20 MangaAssist users. Ask them to rate responses while narrating their thought process. Record what factors they actually consider.
  • Expected finding: Users primarily key on "Did I get a good manga recommendation?" (relevance dominates), with explanation quality as secondary. Novelty is rarely consciously evaluated.

B. Convergent validity — Do ratings correlate with other quality measures?

Protocol: 1. Select 500 responses with both user ratings AND expert annotations (from the calibration study, Q6). 2. Compute Spearman correlation between user stars and expert composite. 3. Compute correlation between user stars and task completion (for order_tracking: was the order actually found? for faq: did the user ask a follow-up question on the same topic?).

Expected results: - Stars ↔ Expert composite: ρ = 0.55–0.65 (moderate-strong) - Stars ↔ Task completion: ρ = 0.40–0.50 (moderate — ratings capture more than just task completion) - Convergent validity is confirmed if correlations are statistically significant and > 0.30.

C. Discriminant validity — Do ratings differ across objectively different qualities?

Protocol: Known-group comparison: 1. Create 3 response groups: - Good: Responses reviewed by experts as correct, complete, and well-formatted (top quartile) - Mediocre: Partially correct, missing some information - Poor: Incorrect, irrelevant, or empty responses (generated by intentionally degraded prompts) 2. Serve these to real users in a blind study (they don't know which group they're in). 3. Compare mean star ratings across groups.

Expected results: - Good: 4.2 ± 0.6 stars - Mediocre: 3.1 ± 0.8 stars - Poor: 1.8 ± 0.7 stars - One-way ANOVA with p < 0.001 and η² > 0.30 confirms discriminant validity.

D. Test-retest reliability — Do users rate the same response consistently?

Protocol: 1. Select 100 neutral-quality responses (expert-rated 2.5–3.5/4.0). 2. Show each response to 50 users at time T1. 3. After 7 days, show the exact same response to the same users (embedded naturally in a new session) at T2. 4. Compute intra-class correlation (ICC) between T1 and T2 ratings.

Implementation challenge: Users must not recognize they're rating the same response. Solution: - Change the conversational context slightly (different preceding turns, same response) - Only test on faq and recommendation intents where identical responses are plausible - Exclude users who rate at T1 with extreme scores (ceiling/floor effects reduce reliability measurement)

Expected results: - ICC(2,1) = 0.50–0.65 (moderate). Chatbot ratings are inherently less stable than product ratings because user mood, context, and expectations change. - Acceptable threshold: ICC > 0.40 for chatbot feedback (literature benchmarks for subjective UX scales).

What happens if validity fails?: - Low construct validity → redesign the rating dimensions; the current scale is measuring the wrong things. - Low convergent validity → users interpret stars differently than intended; add semantic labels. - Low discriminant validity → the scale cannot tell good from bad responses; increase scale granularity or switch to comparative judgments. - Low test-retest → the signal is too noisy for individual-level decisions; aggregate aggressively (minimum 50 ratings per response before acting on the data).