LOCAL PREVIEW View on GitHub

03: User-Centered Evaluation Systems

AIP-C01 Mapping

Content Domain 5: Testing, Validation, and Troubleshooting Task 5.1: Implement evaluation systems for GenAI Skill 5.1.3: Determine adequate end user-centered evaluation systems (for example, user feedback interfaces, rating systems, and annotation workflows) and implement them.


User Story

As a MangaAssist product manager, I want to capture structured feedback from users about chatbot response quality — including thumbs up/down, dimension-specific ratings, and free-text corrections — So that we can identify quality gaps that automated metrics miss, build labeled datasets for future evaluation, and prioritize improvements based on real user pain.


Acceptance Criteria

  • Every chatbot response includes a non-intrusive thumbs up/down widget
  • Users who rate thumbs-down are prompted with optional dimension-specific feedback (relevance, accuracy, helpfulness)
  • Free-text correction field allows users to say what the correct answer should have been
  • Feedback is collected asynchronously — no impact to response latency
  • Feedback data flows to DynamoDB → Kinesis → Redshift analytics pipeline
  • Annotation workflow allows human reviewers to label batches of response pairs
  • Weekly feedback summary reports are auto-generated and surfaced in dashboards
  • Feedback loop: responses with >10% negative feedback trigger automatic quality review

Why User Feedback Completes the Evaluation Loop

Automated metrics (Skill 5.1.1) measure technical quality. Model evaluations (Skill 5.1.2) compare configurations. But neither tells you whether the user perceived the response as helpful.

Consider this MangaAssist response:

"Based on your interest in One Piece, you might enjoy Naruto (adventure/action), Bleach (supernatural battles), and Fairy Tail (friendship themes)."

  • Automated quality score: 0.88 (relevant, factual, fluent)
  • User feedback: 👎 "I've already read all of these. I wanted something new."

The automated system sees a high-quality response. The user sees a useless one. User-centered evaluation captures this gap — the "already consumed" problem that requires personalization beyond content similarity.

Signal Source What It Captures What It Misses
Automated Metrics Technical correctness, consistency, fluency User satisfaction, perceived helpfulness
A/B Test Business Metrics Aggregate behavior shifts Individual failure modes
User Feedback Subjective experience, edge cases, expectations Only from users who bother to rate
Annotation Workflows Expert quality calibration, training labels Expensive, slow, expert bias

All four signals are needed for a complete evaluation picture.


High-Level Design

Feedback Collection Architecture

graph TD
    subgraph "User Interface Layer"
        A[Chatbot Response] --> B[Feedback Widget<br>👍 / 👎]
        B -->|Thumbs Down| C[Dimension Feedback Modal<br>Relevance / Accuracy / Helpfulness]
        C --> D[Free-text Correction<br>Optional]
        B -->|Thumbs Up| E[Positive Signal Recorded]
    end

    subgraph "Collection Layer"
        D --> F[Feedback API<br>Lambda Function]
        E --> F
        F --> G[DynamoDB<br>manga_user_feedback]
        F --> H[Kinesis Data Stream<br>feedback-events]
    end

    subgraph "Analytics Layer"
        H --> I[Kinesis Firehose]
        I --> J[S3 Feedback Lake]
        J --> K[Redshift Spectrum]
        K --> L[Feedback Dashboard]
        K --> M[Weekly Summary Report]
    end

    subgraph "Action Layer"
        K --> N{Negative Rate<br>> 10%?}
        N -->|Yes| O[Quality Review Trigger<br>SNS → Team Alert]
        N -->|No| P[Continue Monitoring]
        K --> Q[Labeled Dataset Builder<br>for Fine-tuning / Evaluation]
    end

Annotation Workflow Architecture

graph LR
    subgraph "Data Selection"
        A[Response Pool<br>All responses in Redshift] --> B[Sampling Strategy]
        B --> C1[Random Sample<br>200/week]
        B --> C2[Negative Feedback Sample<br>All thumbs-down]
        B --> C3[Low-Confidence Sample<br>Model uncertainty > 0.5]
        B --> C4[Edge Case Sample<br>New intents / rare products]
    end

    subgraph "Annotation Platform"
        C1 --> D[Annotation Queue]
        C2 --> D
        C3 --> D
        C4 --> D
        D --> E[Annotator Interface]
        E --> F[Quality Rating<br>1-5 per dimension]
        E --> G[Correct Response<br>Gold-standard answer]
        E --> H[Issue Tags<br>hallucination / irrelevant / off-brand]
    end

    subgraph "Quality Control"
        F --> I[Inter-Annotator Agreement<br>Cohen's Kappa]
        G --> I
        H --> I
        I --> J{Kappa >= 0.7?}
        J -->|Yes| K[Accept Labels<br>Add to Golden Dataset]
        J -->|No| L[Adjudication<br>Senior Reviewer]
        L --> K
    end

    subgraph "Outputs"
        K --> M[Golden Test Dataset<br>for Skill 5.1.1 evaluator]
        K --> N[Fine-tuning Data<br>for model improvement]
        K --> O[Quality Trend Reports]
    end

Feedback-to-Action Pipeline

sequenceDiagram
    participant User as Customer
    participant Widget as Feedback Widget
    participant API as Feedback Lambda
    participant DB as DynamoDB
    participant Stream as Kinesis
    participant Analytics as Redshift
    participant Alert as Quality Alert

    User->>Widget: 👎 on recommendation response
    Widget->>Widget: Show dimension modal
    User->>Widget: "Not relevant" + "I wanted seinen manga"
    Widget->>API: POST /feedback (async, non-blocking)
    API->>DB: Store feedback record
    API->>Stream: Publish feedback event

    Note over Stream,Analytics: Batch processing every 5 minutes

    Stream->>Analytics: Aggregate feedback metrics
    Analytics->>Analytics: Compute negative rate per intent

    alt Negative rate > 10% for recommendation intent
        Analytics->>Alert: Trigger quality review
        Alert->>Alert: Create JIRA ticket + Slack notification
    end

    Note over Analytics: Weekly batch
    Analytics->>Analytics: Build labeled dataset from feedback
    Analytics->>Analytics: Generate weekly summary report

Low-Level Design

Feedback Data Model

from dataclasses import dataclass, field
from enum import Enum
from typing import Optional
import time
import uuid


class FeedbackRating(Enum):
    THUMBS_UP = "thumbs_up"
    THUMBS_DOWN = "thumbs_down"


class FeedbackDimension(Enum):
    RELEVANCE = "relevance"
    ACCURACY = "accuracy"
    HELPFULNESS = "helpfulness"
    TONE = "tone"
    COMPLETENESS = "completeness"


class IssueTag(Enum):
    HALLUCINATION = "hallucination"
    IRRELEVANT = "irrelevant"
    OFF_BRAND = "off_brand"
    OUTDATED = "outdated"
    WRONG_PRODUCT = "wrong_product"
    TOO_VERBOSE = "too_verbose"
    TOO_BRIEF = "too_brief"
    REPEATED_CONTENT = "repeated_content"


@dataclass
class UserFeedback:
    """A single feedback submission from a MangaAssist user."""
    feedback_id: str = field(default_factory=lambda: str(uuid.uuid4()))
    session_id: str = ""
    turn_id: str = ""
    response_id: str = ""
    rating: FeedbackRating = FeedbackRating.THUMBS_UP
    dimension_ratings: dict[FeedbackDimension, int] = field(default_factory=dict)  # 1-5 scale
    issue_tags: list[IssueTag] = field(default_factory=list)
    correction_text: str = ""           # What the user thinks the correct answer is
    user_comment: str = ""              # Optional free-text comment
    intent: str = ""
    query: str = ""
    response_text: str = ""
    page_context: dict = field(default_factory=dict)
    timestamp: float = field(default_factory=time.time)


@dataclass
class AnnotationTask:
    """A task for a human annotator to review a chatbot response."""
    task_id: str = field(default_factory=lambda: str(uuid.uuid4()))
    response_id: str = ""
    query: str = ""
    response_text: str = ""
    intent: str = ""
    conversation_history: list[dict] = field(default_factory=list)
    product_context: dict = field(default_factory=dict)
    source: str = ""                    # "random", "negative_feedback", "low_confidence", "edge_case"
    assigned_annotators: list[str] = field(default_factory=list)
    annotations: list[dict] = field(default_factory=list)
    status: str = "pending"             # pending, in_progress, completed, adjudication
    created_at: float = field(default_factory=time.time)


@dataclass
class AnnotationLabel:
    """A single annotation from a human reviewer."""
    annotator_id: str = ""
    task_id: str = ""
    quality_ratings: dict[str, int] = field(default_factory=dict)  # dimension -> 1-5
    correct_response: str = ""          # Gold-standard answer
    issue_tags: list[IssueTag] = field(default_factory=list)
    notes: str = ""
    time_spent_seconds: int = 0
    timestamp: float = field(default_factory=time.time)

Feedback Collection Service

import json
import logging
import time
import uuid
from typing import Optional

import boto3

logger = logging.getLogger(__name__)


class FeedbackCollectionService:
    """Collects, stores, and streams user feedback for MangaAssist responses.

    Design decisions:
    - Async collection: feedback API returns immediately, processing is async
    - DynamoDB for hot storage (recent 30 days), Redshift for analytics
    - Kinesis stream enables real-time alerting on negative feedback spikes
    """

    def __init__(
        self,
        feedback_table: str = "manga_user_feedback",
        kinesis_stream: str = "manga-feedback-events",
    ):
        self.dynamodb = boto3.resource("dynamodb", region_name="us-east-1")
        self.table = self.dynamodb.Table(feedback_table)
        self.kinesis = boto3.client("kinesis", region_name="us-east-1")
        self.stream_name = kinesis_stream

    def submit_feedback(self, feedback: UserFeedback) -> str:
        """Store feedback in DynamoDB and publish to Kinesis stream."""
        # Validate — prevent empty or malformed feedback
        if not feedback.session_id or not feedback.response_id:
            raise ValueError("session_id and response_id are required")

        # Sanitize user input — limit lengths to prevent abuse
        feedback.correction_text = feedback.correction_text[:2000]
        feedback.user_comment = feedback.user_comment[:1000]

        # Store in DynamoDB
        item = {
            "pk": f"SESSION#{feedback.session_id}",
            "sk": f"FEEDBACK#{feedback.feedback_id}",
            "gsi1pk": f"INTENT#{feedback.intent}",
            "gsi1sk": f"TS#{int(feedback.timestamp)}",
            "rating": feedback.rating.value,
            "dimension_ratings": {
                dim.value: score for dim, score in feedback.dimension_ratings.items()
            },
            "issue_tags": [tag.value for tag in feedback.issue_tags],
            "correction_text": feedback.correction_text,
            "user_comment": feedback.user_comment,
            "intent": feedback.intent,
            "query": feedback.query,
            "response_text": feedback.response_text[:5000],  # Truncate for storage
            "response_id": feedback.response_id,
            "timestamp": int(feedback.timestamp),
            "ttl": int(feedback.timestamp) + (30 * 86400),  # 30-day TTL
        }
        self.table.put_item(Item=item)

        # Publish to Kinesis for real-time processing
        event = {
            "event_type": "user_feedback",
            "feedback_id": feedback.feedback_id,
            "session_id": feedback.session_id,
            "rating": feedback.rating.value,
            "intent": feedback.intent,
            "has_correction": bool(feedback.correction_text),
            "issue_tags": [tag.value for tag in feedback.issue_tags],
            "timestamp": int(feedback.timestamp),
        }
        self.kinesis.put_record(
            StreamName=self.stream_name,
            Data=json.dumps(event).encode("utf-8"),
            PartitionKey=feedback.session_id,
        )

        logger.info(
            "Feedback %s recorded: %s for intent=%s",
            feedback.feedback_id, feedback.rating.value, feedback.intent,
        )
        return feedback.feedback_id

    def get_feedback_summary(
        self, intent: str, hours_back: int = 24
    ) -> dict:
        """Get aggregated feedback metrics for an intent over a time window."""
        cutoff = int(time.time()) - (hours_back * 3600)

        response = self.table.query(
            IndexName="gsi1-index",
            KeyConditionExpression="gsi1pk = :pk AND gsi1sk > :cutoff",
            ExpressionAttributeValues={
                ":pk": f"INTENT#{intent}",
                ":cutoff": f"TS#{cutoff}",
            },
        )

        items = response.get("Items", [])
        if not items:
            return {"intent": intent, "total": 0, "negative_rate": 0.0}

        total = len(items)
        negative = sum(1 for i in items if i.get("rating") == "thumbs_down")
        negative_rate = negative / total

        # Aggregate dimension ratings from negative feedback
        dimension_counts: dict[str, list[int]] = {}
        issue_tag_counts: dict[str, int] = {}
        for item in items:
            if item.get("rating") == "thumbs_down":
                for dim, score in item.get("dimension_ratings", {}).items():
                    dimension_counts.setdefault(dim, []).append(int(score))
                for tag in item.get("issue_tags", []):
                    issue_tag_counts[tag] = issue_tag_counts.get(tag, 0) + 1

        dimension_averages = {
            dim: sum(scores) / len(scores)
            for dim, scores in dimension_counts.items()
        }

        return {
            "intent": intent,
            "total": total,
            "positive": total - negative,
            "negative": negative,
            "negative_rate": negative_rate,
            "dimension_averages": dimension_averages,
            "top_issues": sorted(issue_tag_counts.items(), key=lambda x: -x[1])[:5],
            "hours_back": hours_back,
        }

Negative Feedback Alert Handler

class NegativeFeedbackAlertHandler:
    """Monitors feedback streams and triggers alerts when negative rate spikes.

    MangaAssist thresholds:
    - > 10% negative rate (any intent): quality review alert
    - > 20% negative rate: escalation to engineering lead
    - > 3 hallucination tags in 1 hour: immediate investigation
    """

    ALERT_THRESHOLDS = {
        "quality_review": {"negative_rate": 0.10, "window_hours": 4},
        "engineering_escalation": {"negative_rate": 0.20, "window_hours": 2},
        "hallucination_spike": {"tag_count": 3, "tag": "hallucination", "window_hours": 1},
    }

    def __init__(self, sns_topic_arn: str, feedback_service: FeedbackCollectionService):
        self.sns = boto3.client("sns", region_name="us-east-1")
        self.topic_arn = sns_topic_arn
        self.feedback_service = feedback_service

    def check_and_alert(self, intent: str) -> list[str]:
        """Check all alert thresholds for an intent and fire alerts if needed."""
        fired_alerts = []

        # Check negative rate thresholds
        for alert_name, config in self.ALERT_THRESHOLDS.items():
            if "negative_rate" in config:
                summary = self.feedback_service.get_feedback_summary(
                    intent=intent, hours_back=config["window_hours"]
                )
                if summary["total"] >= 50 and summary["negative_rate"] > config["negative_rate"]:
                    self._fire_alert(alert_name, intent, summary)
                    fired_alerts.append(alert_name)

            elif "tag_count" in config:
                summary = self.feedback_service.get_feedback_summary(
                    intent=intent, hours_back=config["window_hours"]
                )
                tag = config["tag"]
                tag_count = sum(
                    count for t, count in summary.get("top_issues", []) if t == tag
                )
                if tag_count >= config["tag_count"]:
                    self._fire_alert(alert_name, intent, {
                        "tag": tag, "count": tag_count, **summary
                    })
                    fired_alerts.append(alert_name)

        return fired_alerts

    def _fire_alert(self, alert_name: str, intent: str, details: dict) -> None:
        """Publish alert to SNS topic."""
        message = {
            "alert": alert_name,
            "intent": intent,
            "details": details,
            "timestamp": int(time.time()),
            "action_required": (
                "Immediate investigation" if "hallucination" in alert_name
                else "Quality review within 4 hours"
            ),
        }
        self.sns.publish(
            TopicArn=self.topic_arn,
            Subject=f"[MangaAssist] {alert_name}{intent}",
            Message=json.dumps(message, indent=2, default=str),
        )
        logger.warning("Alert fired: %s for intent=%s", alert_name, intent)

Annotation Workflow Manager

import random


class AnnotationWorkflowManager:
    """Manages the human annotation pipeline for MangaAssist response evaluation.

    Sampling strategy:
    - 200 random responses per week (stratified by intent)
    - All thumbs-down responses (typically 50-100/week)
    - Low-confidence model responses (confidence < 0.6)
    - Edge cases: rare intents, new product categories

    Quality control:
    - Each task assigned to 2 annotators
    - Cohen's Kappa >= 0.7 required for acceptance
    - Disagreements go to senior reviewer for adjudication
    """

    def __init__(self, tasks_table: str = "manga_annotation_tasks"):
        self.dynamodb = boto3.resource("dynamodb", region_name="us-east-1")
        self.table = self.dynamodb.Table(tasks_table)

    def create_annotation_batch(
        self,
        responses: list[dict],
        source: str,
        annotator_pool: list[str],
        annotations_per_task: int = 2,
    ) -> list[AnnotationTask]:
        """Create a batch of annotation tasks from response data."""
        tasks = []
        for response_data in responses:
            # Assign annotators (round-robin from pool, no self-review)
            assigned = random.sample(
                annotator_pool, min(annotations_per_task, len(annotator_pool))
            )

            task = AnnotationTask(
                response_id=response_data["response_id"],
                query=response_data["query"],
                response_text=response_data["response_text"],
                intent=response_data["intent"],
                conversation_history=response_data.get("conversation_history", []),
                product_context=response_data.get("product_context", {}),
                source=source,
                assigned_annotators=assigned,
            )

            self.table.put_item(Item={
                "pk": f"BATCH#{task.task_id[:8]}",
                "sk": f"TASK#{task.task_id}",
                "response_id": task.response_id,
                "query": task.query,
                "response_text": task.response_text[:5000],
                "intent": task.intent,
                "source": source,
                "assigned_annotators": assigned,
                "status": "pending",
                "annotations": [],
                "created_at": int(task.created_at),
            })
            tasks.append(task)

        logger.info("Created %d annotation tasks from source=%s", len(tasks), source)
        return tasks

    def submit_annotation(self, task_id: str, annotation: AnnotationLabel) -> dict:
        """Submit an annotation for a task and check if task is complete."""
        # Retrieve the task
        response = self.table.query(
            KeyConditionExpression="begins_with(sk, :task_prefix)",
            FilterExpression="contains(sk, :task_id)",
            ExpressionAttributeValues={
                ":task_prefix": "TASK#",
                ":task_id": task_id,
            },
        )
        # Use get_item with known key structure in production
        # This is simplified for readability

        annotation_data = {
            "annotator_id": annotation.annotator_id,
            "quality_ratings": annotation.quality_ratings,
            "correct_response": annotation.correct_response[:5000],
            "issue_tags": [tag.value for tag in annotation.issue_tags],
            "notes": annotation.notes[:1000],
            "time_spent_seconds": annotation.time_spent_seconds,
            "timestamp": int(annotation.timestamp),
        }

        # Append annotation to task
        self.table.update_item(
            Key={"pk": f"BATCH#{task_id[:8]}", "sk": f"TASK#{task_id}"},
            UpdateExpression="SET annotations = list_append(annotations, :ann)",
            ExpressionAttributeValues={":ann": [annotation_data]},
        )

        return {"task_id": task_id, "annotator": annotation.annotator_id, "submitted": True}

    def compute_inter_annotator_agreement(self, task_id: str) -> dict:
        """Compute Cohen's Kappa for a completed task with 2 annotations."""
        response = self.table.get_item(
            Key={"pk": f"BATCH#{task_id[:8]}", "sk": f"TASK#{task_id}"}
        )
        item = response.get("Item", {})
        annotations = item.get("annotations", [])

        if len(annotations) < 2:
            return {"kappa": None, "status": "insufficient_annotations"}

        ann1 = annotations[0]
        ann2 = annotations[1]

        # Compare quality ratings across dimensions
        shared_dims = set(ann1.get("quality_ratings", {}).keys()) & set(
            ann2.get("quality_ratings", {}).keys()
        )
        if not shared_dims:
            return {"kappa": 0.0, "status": "no_shared_dimensions"}

        # Simplified agreement: percentage of dimensions within 1 point
        agreements = 0
        for dim in shared_dims:
            if abs(int(ann1["quality_ratings"][dim]) - int(ann2["quality_ratings"][dim])) <= 1:
                agreements += 1

        agreement_rate = agreements / len(shared_dims)

        # Cohen's Kappa approximation (simplified for 5-point scale)
        p_observed = agreement_rate
        p_chance = 0.20  # Random chance for 5-point scale within 1 point
        kappa = (p_observed - p_chance) / (1 - p_chance) if p_chance < 1 else 0

        needs_adjudication = kappa < 0.70

        return {
            "task_id": task_id,
            "kappa": round(kappa, 3),
            "agreement_rate": round(agreement_rate, 3),
            "shared_dimensions": len(shared_dims),
            "needs_adjudication": needs_adjudication,
            "status": "adjudication_needed" if needs_adjudication else "accepted",
        }

    def build_golden_dataset(self, batch_prefix: str) -> list[dict]:
        """Export accepted annotations as golden dataset entries for evaluation."""
        response = self.table.query(
            KeyConditionExpression="pk = :pk",
            ExpressionAttributeValues={":pk": f"BATCH#{batch_prefix}"},
        )

        golden_entries = []
        for item in response.get("Items", []):
            if item.get("status") != "completed":
                continue
            annotations = item.get("annotations", [])
            if not annotations:
                continue

            # Use the gold-standard response from the most experienced annotator
            best_annotation = annotations[0]  # In production, rank by annotator seniority
            for ann in annotations:
                if ann.get("correct_response"):
                    best_annotation = ann
                    break

            golden_entries.append({
                "query": item["query"],
                "intent": item["intent"],
                "expected_response": best_annotation.get("correct_response", ""),
                "quality_ratings": best_annotation.get("quality_ratings", {}),
                "issue_tags": best_annotation.get("issue_tags", []),
                "source": item.get("source", "annotation"),
            })

        logger.info("Built %d golden dataset entries from batch %s", len(golden_entries), batch_prefix)
        return golden_entries

MangaAssist Scenarios

Scenario A: Feedback Reveals "Already Read" Blind Spot in Recommendations

Context: Automated quality scores for the recommendation intent averaged 0.84 — well above the 0.75 quality gate. But user feedback told a different story.

What Happened: - Weekly feedback summary for recommendation intent: - Total feedback: 2,400 (3.2% of recommendation responses) - Negative rate: 18% (vs. 8% for other intents) - Top issue tag: irrelevant (62% of thumbs-down) - Top correction pattern: "I already read this" appeared in 43% of correction texts

How Caught: The NegativeFeedbackAlertHandler fired a quality_review alert when recommendation negative rate exceeded 10% for 3 consecutive days. The team analyzed the free-text corrections and found the pattern.

Root Cause: The recommendation engine used content-based similarity (genres, authors, themes) but did not exclude manga the user had previously viewed or purchased. The automated evaluator scored these as "relevant" because the genres matched.

Fix: Added a post-filter step in the recommendation pipeline that excludes ASINs the user had viewed/purchased (from DynamoDB session data and order history API). Re-evaluated: negative rate dropped from 18% to 7%.

Metric Signal: MangaAssist/Feedback.NegativeRate for recommendation intent, MangaAssist/Feedback.IssueTagCount for irrelevant tag.

Scenario B: Annotation Workflow Discovers Systematic Hallucination Pattern

Context: The weekly annotation batch included 80 responses flagged as thumbs-down by users. Two annotators independently reviewed each response.

What Happened: - Inter-annotator agreement (Cohen's Kappa): 0.82 (strong agreement) - Both annotators tagged 12 responses as hallucination - All 12 hallucinations were in the product_question intent - Pattern: the model invented release dates for upcoming manga volumes - Example: "One Piece Volume 108 releases on March 15, 2025" — no such date existed in the product catalog - The RAG pipeline retrieved the product page, but the page had no release date field → the model generated a plausible-sounding date

How Caught: The annotation workflow's golden dataset export included 12 entries tagged hallucination. These were added to the evaluation test suite as regression tests. The Skill 5.1.1 FactualAccuracyChecker was updated to specifically flag date claims not grounded in the RAG context.

Fix: Added a guardrail: if the user asks about a release date and the RAG context has no release_date field, respond with "Release date not yet announced" instead of generating one. Also added "date_claim" as a tracked entity type in the hallucination detector.

Scenario C: Rating Fatigue — Optimizing the Feedback Widget

Context: Initially, the feedback widget asked for thumbs up/down on every response. Feedback collection rate was 4.1%, but the team noticed it was dropping month-over-month.

What Happened: - Month 1: 4.1% feedback rate - Month 2: 3.2% feedback rate - Month 3: 2.5% feedback rate - Of users who gave feedback, 92% only used thumbs up/down — never the dimension ratings or correction text

Root Cause: Feedback fatigue. Users were shown the widget on every turn, including turns where no feedback was useful (e.g., initial greeting, follow-up clarifications). The dimension rating modal felt like a survey.

Fix: Changed the feedback strategy: 1. Only show thumbs up/down on substantive responses (skip greetings, clarifications, system messages) 2. Rate-limit to max 3 feedback prompts per session 3. Show dimension feedback only to users already in a thumbs-down flow (not proactively) 4. Added a "Was this helpful?" single-question variant for mobile users

Result: Feedback rate stabilized at 3.5% with higher quality signals — correction text usage increased from 8% to 19% of negative feedback.

Scenario D: Annotator Disagreement Reveals Ambiguous Quality Standard

Context: An annotation batch for faq intent had unusually low inter-annotator agreement (Kappa = 0.52).

What Happened: - Annotator A consistently rated responses as ⅘ quality - Annotator B rated the same responses as ⅖ - The disagreement centered on response length: - Q: "What is your return policy?" - Response: 3-sentence summary of the return policy - Annotator A: ⅘ (concise, correct, helpful) - Annotator B: ⅖ (missing details about international returns, missing exceptions)

Root Cause: No clear annotation guideline for "completeness." Annotator A valued conciseness; Annotator B valued exhaustiveness. Both were valid perspectives, but without a standard, their ratings diverged.

Fix: The team created an annotation rubric specific to each intent: - faq: "Complete" means covers the most common interpretation. Edge cases (international returns) should be mentioned if directly asked, but not required in general FAQ responses. - Published rubric to all annotators. Re-annotated the batch with the rubric — Kappa improved to 0.78.


Intuition Gained

Automated Metrics and User Feedback Measure Different Things

Automated quality scores tell you if the response is technically correct. User feedback tells you if the response is useful. A response can be factually accurate (high automated score) but unhelpful (high negative feedback) because it does not address what the user actually needs. Both signals are necessary for a complete evaluation picture.

Feedback UI Design Directly Impacts Signal Quality

A feedback widget that appears on every response causes fatigue and low-quality ratings. A widget that only appears on substantive responses, with progressive disclosure (thumbs first, then dimensions, then free text), yields fewer but more informative data points. Optimize for signal quality, not volume.

Annotation Workflows Need Explicit Rubrics

Without a shared rubric, annotators bring different mental models of "quality." This creates noisy labels that are worse than no labels. Before starting annotation, define what "good" means for each intent type, with examples of ⅕ through 5/5 responses. Calibrate annotators on 20 shared examples before independent work.


References