03: User-Centered Evaluation Systems
AIP-C01 Mapping
Content Domain 5: Testing, Validation, and Troubleshooting Task 5.1: Implement evaluation systems for GenAI Skill 5.1.3: Determine adequate end user-centered evaluation systems (for example, user feedback interfaces, rating systems, and annotation workflows) and implement them.
User Story
As a MangaAssist product manager, I want to capture structured feedback from users about chatbot response quality — including thumbs up/down, dimension-specific ratings, and free-text corrections — So that we can identify quality gaps that automated metrics miss, build labeled datasets for future evaluation, and prioritize improvements based on real user pain.
Acceptance Criteria
- Every chatbot response includes a non-intrusive thumbs up/down widget
- Users who rate thumbs-down are prompted with optional dimension-specific feedback (relevance, accuracy, helpfulness)
- Free-text correction field allows users to say what the correct answer should have been
- Feedback is collected asynchronously — no impact to response latency
- Feedback data flows to DynamoDB → Kinesis → Redshift analytics pipeline
- Annotation workflow allows human reviewers to label batches of response pairs
- Weekly feedback summary reports are auto-generated and surfaced in dashboards
- Feedback loop: responses with >10% negative feedback trigger automatic quality review
Why User Feedback Completes the Evaluation Loop
Automated metrics (Skill 5.1.1) measure technical quality. Model evaluations (Skill 5.1.2) compare configurations. But neither tells you whether the user perceived the response as helpful.
Consider this MangaAssist response:
"Based on your interest in One Piece, you might enjoy Naruto (adventure/action), Bleach (supernatural battles), and Fairy Tail (friendship themes)."
- Automated quality score: 0.88 (relevant, factual, fluent)
- User feedback: 👎 "I've already read all of these. I wanted something new."
The automated system sees a high-quality response. The user sees a useless one. User-centered evaluation captures this gap — the "already consumed" problem that requires personalization beyond content similarity.
| Signal Source | What It Captures | What It Misses |
|---|---|---|
| Automated Metrics | Technical correctness, consistency, fluency | User satisfaction, perceived helpfulness |
| A/B Test Business Metrics | Aggregate behavior shifts | Individual failure modes |
| User Feedback | Subjective experience, edge cases, expectations | Only from users who bother to rate |
| Annotation Workflows | Expert quality calibration, training labels | Expensive, slow, expert bias |
All four signals are needed for a complete evaluation picture.
High-Level Design
Feedback Collection Architecture
graph TD
subgraph "User Interface Layer"
A[Chatbot Response] --> B[Feedback Widget<br>👍 / 👎]
B -->|Thumbs Down| C[Dimension Feedback Modal<br>Relevance / Accuracy / Helpfulness]
C --> D[Free-text Correction<br>Optional]
B -->|Thumbs Up| E[Positive Signal Recorded]
end
subgraph "Collection Layer"
D --> F[Feedback API<br>Lambda Function]
E --> F
F --> G[DynamoDB<br>manga_user_feedback]
F --> H[Kinesis Data Stream<br>feedback-events]
end
subgraph "Analytics Layer"
H --> I[Kinesis Firehose]
I --> J[S3 Feedback Lake]
J --> K[Redshift Spectrum]
K --> L[Feedback Dashboard]
K --> M[Weekly Summary Report]
end
subgraph "Action Layer"
K --> N{Negative Rate<br>> 10%?}
N -->|Yes| O[Quality Review Trigger<br>SNS → Team Alert]
N -->|No| P[Continue Monitoring]
K --> Q[Labeled Dataset Builder<br>for Fine-tuning / Evaluation]
end
Annotation Workflow Architecture
graph LR
subgraph "Data Selection"
A[Response Pool<br>All responses in Redshift] --> B[Sampling Strategy]
B --> C1[Random Sample<br>200/week]
B --> C2[Negative Feedback Sample<br>All thumbs-down]
B --> C3[Low-Confidence Sample<br>Model uncertainty > 0.5]
B --> C4[Edge Case Sample<br>New intents / rare products]
end
subgraph "Annotation Platform"
C1 --> D[Annotation Queue]
C2 --> D
C3 --> D
C4 --> D
D --> E[Annotator Interface]
E --> F[Quality Rating<br>1-5 per dimension]
E --> G[Correct Response<br>Gold-standard answer]
E --> H[Issue Tags<br>hallucination / irrelevant / off-brand]
end
subgraph "Quality Control"
F --> I[Inter-Annotator Agreement<br>Cohen's Kappa]
G --> I
H --> I
I --> J{Kappa >= 0.7?}
J -->|Yes| K[Accept Labels<br>Add to Golden Dataset]
J -->|No| L[Adjudication<br>Senior Reviewer]
L --> K
end
subgraph "Outputs"
K --> M[Golden Test Dataset<br>for Skill 5.1.1 evaluator]
K --> N[Fine-tuning Data<br>for model improvement]
K --> O[Quality Trend Reports]
end
Feedback-to-Action Pipeline
sequenceDiagram
participant User as Customer
participant Widget as Feedback Widget
participant API as Feedback Lambda
participant DB as DynamoDB
participant Stream as Kinesis
participant Analytics as Redshift
participant Alert as Quality Alert
User->>Widget: 👎 on recommendation response
Widget->>Widget: Show dimension modal
User->>Widget: "Not relevant" + "I wanted seinen manga"
Widget->>API: POST /feedback (async, non-blocking)
API->>DB: Store feedback record
API->>Stream: Publish feedback event
Note over Stream,Analytics: Batch processing every 5 minutes
Stream->>Analytics: Aggregate feedback metrics
Analytics->>Analytics: Compute negative rate per intent
alt Negative rate > 10% for recommendation intent
Analytics->>Alert: Trigger quality review
Alert->>Alert: Create JIRA ticket + Slack notification
end
Note over Analytics: Weekly batch
Analytics->>Analytics: Build labeled dataset from feedback
Analytics->>Analytics: Generate weekly summary report
Low-Level Design
Feedback Data Model
from dataclasses import dataclass, field
from enum import Enum
from typing import Optional
import time
import uuid
class FeedbackRating(Enum):
THUMBS_UP = "thumbs_up"
THUMBS_DOWN = "thumbs_down"
class FeedbackDimension(Enum):
RELEVANCE = "relevance"
ACCURACY = "accuracy"
HELPFULNESS = "helpfulness"
TONE = "tone"
COMPLETENESS = "completeness"
class IssueTag(Enum):
HALLUCINATION = "hallucination"
IRRELEVANT = "irrelevant"
OFF_BRAND = "off_brand"
OUTDATED = "outdated"
WRONG_PRODUCT = "wrong_product"
TOO_VERBOSE = "too_verbose"
TOO_BRIEF = "too_brief"
REPEATED_CONTENT = "repeated_content"
@dataclass
class UserFeedback:
"""A single feedback submission from a MangaAssist user."""
feedback_id: str = field(default_factory=lambda: str(uuid.uuid4()))
session_id: str = ""
turn_id: str = ""
response_id: str = ""
rating: FeedbackRating = FeedbackRating.THUMBS_UP
dimension_ratings: dict[FeedbackDimension, int] = field(default_factory=dict) # 1-5 scale
issue_tags: list[IssueTag] = field(default_factory=list)
correction_text: str = "" # What the user thinks the correct answer is
user_comment: str = "" # Optional free-text comment
intent: str = ""
query: str = ""
response_text: str = ""
page_context: dict = field(default_factory=dict)
timestamp: float = field(default_factory=time.time)
@dataclass
class AnnotationTask:
"""A task for a human annotator to review a chatbot response."""
task_id: str = field(default_factory=lambda: str(uuid.uuid4()))
response_id: str = ""
query: str = ""
response_text: str = ""
intent: str = ""
conversation_history: list[dict] = field(default_factory=list)
product_context: dict = field(default_factory=dict)
source: str = "" # "random", "negative_feedback", "low_confidence", "edge_case"
assigned_annotators: list[str] = field(default_factory=list)
annotations: list[dict] = field(default_factory=list)
status: str = "pending" # pending, in_progress, completed, adjudication
created_at: float = field(default_factory=time.time)
@dataclass
class AnnotationLabel:
"""A single annotation from a human reviewer."""
annotator_id: str = ""
task_id: str = ""
quality_ratings: dict[str, int] = field(default_factory=dict) # dimension -> 1-5
correct_response: str = "" # Gold-standard answer
issue_tags: list[IssueTag] = field(default_factory=list)
notes: str = ""
time_spent_seconds: int = 0
timestamp: float = field(default_factory=time.time)
Feedback Collection Service
import json
import logging
import time
import uuid
from typing import Optional
import boto3
logger = logging.getLogger(__name__)
class FeedbackCollectionService:
"""Collects, stores, and streams user feedback for MangaAssist responses.
Design decisions:
- Async collection: feedback API returns immediately, processing is async
- DynamoDB for hot storage (recent 30 days), Redshift for analytics
- Kinesis stream enables real-time alerting on negative feedback spikes
"""
def __init__(
self,
feedback_table: str = "manga_user_feedback",
kinesis_stream: str = "manga-feedback-events",
):
self.dynamodb = boto3.resource("dynamodb", region_name="us-east-1")
self.table = self.dynamodb.Table(feedback_table)
self.kinesis = boto3.client("kinesis", region_name="us-east-1")
self.stream_name = kinesis_stream
def submit_feedback(self, feedback: UserFeedback) -> str:
"""Store feedback in DynamoDB and publish to Kinesis stream."""
# Validate — prevent empty or malformed feedback
if not feedback.session_id or not feedback.response_id:
raise ValueError("session_id and response_id are required")
# Sanitize user input — limit lengths to prevent abuse
feedback.correction_text = feedback.correction_text[:2000]
feedback.user_comment = feedback.user_comment[:1000]
# Store in DynamoDB
item = {
"pk": f"SESSION#{feedback.session_id}",
"sk": f"FEEDBACK#{feedback.feedback_id}",
"gsi1pk": f"INTENT#{feedback.intent}",
"gsi1sk": f"TS#{int(feedback.timestamp)}",
"rating": feedback.rating.value,
"dimension_ratings": {
dim.value: score for dim, score in feedback.dimension_ratings.items()
},
"issue_tags": [tag.value for tag in feedback.issue_tags],
"correction_text": feedback.correction_text,
"user_comment": feedback.user_comment,
"intent": feedback.intent,
"query": feedback.query,
"response_text": feedback.response_text[:5000], # Truncate for storage
"response_id": feedback.response_id,
"timestamp": int(feedback.timestamp),
"ttl": int(feedback.timestamp) + (30 * 86400), # 30-day TTL
}
self.table.put_item(Item=item)
# Publish to Kinesis for real-time processing
event = {
"event_type": "user_feedback",
"feedback_id": feedback.feedback_id,
"session_id": feedback.session_id,
"rating": feedback.rating.value,
"intent": feedback.intent,
"has_correction": bool(feedback.correction_text),
"issue_tags": [tag.value for tag in feedback.issue_tags],
"timestamp": int(feedback.timestamp),
}
self.kinesis.put_record(
StreamName=self.stream_name,
Data=json.dumps(event).encode("utf-8"),
PartitionKey=feedback.session_id,
)
logger.info(
"Feedback %s recorded: %s for intent=%s",
feedback.feedback_id, feedback.rating.value, feedback.intent,
)
return feedback.feedback_id
def get_feedback_summary(
self, intent: str, hours_back: int = 24
) -> dict:
"""Get aggregated feedback metrics for an intent over a time window."""
cutoff = int(time.time()) - (hours_back * 3600)
response = self.table.query(
IndexName="gsi1-index",
KeyConditionExpression="gsi1pk = :pk AND gsi1sk > :cutoff",
ExpressionAttributeValues={
":pk": f"INTENT#{intent}",
":cutoff": f"TS#{cutoff}",
},
)
items = response.get("Items", [])
if not items:
return {"intent": intent, "total": 0, "negative_rate": 0.0}
total = len(items)
negative = sum(1 for i in items if i.get("rating") == "thumbs_down")
negative_rate = negative / total
# Aggregate dimension ratings from negative feedback
dimension_counts: dict[str, list[int]] = {}
issue_tag_counts: dict[str, int] = {}
for item in items:
if item.get("rating") == "thumbs_down":
for dim, score in item.get("dimension_ratings", {}).items():
dimension_counts.setdefault(dim, []).append(int(score))
for tag in item.get("issue_tags", []):
issue_tag_counts[tag] = issue_tag_counts.get(tag, 0) + 1
dimension_averages = {
dim: sum(scores) / len(scores)
for dim, scores in dimension_counts.items()
}
return {
"intent": intent,
"total": total,
"positive": total - negative,
"negative": negative,
"negative_rate": negative_rate,
"dimension_averages": dimension_averages,
"top_issues": sorted(issue_tag_counts.items(), key=lambda x: -x[1])[:5],
"hours_back": hours_back,
}
Negative Feedback Alert Handler
class NegativeFeedbackAlertHandler:
"""Monitors feedback streams and triggers alerts when negative rate spikes.
MangaAssist thresholds:
- > 10% negative rate (any intent): quality review alert
- > 20% negative rate: escalation to engineering lead
- > 3 hallucination tags in 1 hour: immediate investigation
"""
ALERT_THRESHOLDS = {
"quality_review": {"negative_rate": 0.10, "window_hours": 4},
"engineering_escalation": {"negative_rate": 0.20, "window_hours": 2},
"hallucination_spike": {"tag_count": 3, "tag": "hallucination", "window_hours": 1},
}
def __init__(self, sns_topic_arn: str, feedback_service: FeedbackCollectionService):
self.sns = boto3.client("sns", region_name="us-east-1")
self.topic_arn = sns_topic_arn
self.feedback_service = feedback_service
def check_and_alert(self, intent: str) -> list[str]:
"""Check all alert thresholds for an intent and fire alerts if needed."""
fired_alerts = []
# Check negative rate thresholds
for alert_name, config in self.ALERT_THRESHOLDS.items():
if "negative_rate" in config:
summary = self.feedback_service.get_feedback_summary(
intent=intent, hours_back=config["window_hours"]
)
if summary["total"] >= 50 and summary["negative_rate"] > config["negative_rate"]:
self._fire_alert(alert_name, intent, summary)
fired_alerts.append(alert_name)
elif "tag_count" in config:
summary = self.feedback_service.get_feedback_summary(
intent=intent, hours_back=config["window_hours"]
)
tag = config["tag"]
tag_count = sum(
count for t, count in summary.get("top_issues", []) if t == tag
)
if tag_count >= config["tag_count"]:
self._fire_alert(alert_name, intent, {
"tag": tag, "count": tag_count, **summary
})
fired_alerts.append(alert_name)
return fired_alerts
def _fire_alert(self, alert_name: str, intent: str, details: dict) -> None:
"""Publish alert to SNS topic."""
message = {
"alert": alert_name,
"intent": intent,
"details": details,
"timestamp": int(time.time()),
"action_required": (
"Immediate investigation" if "hallucination" in alert_name
else "Quality review within 4 hours"
),
}
self.sns.publish(
TopicArn=self.topic_arn,
Subject=f"[MangaAssist] {alert_name} — {intent}",
Message=json.dumps(message, indent=2, default=str),
)
logger.warning("Alert fired: %s for intent=%s", alert_name, intent)
Annotation Workflow Manager
import random
class AnnotationWorkflowManager:
"""Manages the human annotation pipeline for MangaAssist response evaluation.
Sampling strategy:
- 200 random responses per week (stratified by intent)
- All thumbs-down responses (typically 50-100/week)
- Low-confidence model responses (confidence < 0.6)
- Edge cases: rare intents, new product categories
Quality control:
- Each task assigned to 2 annotators
- Cohen's Kappa >= 0.7 required for acceptance
- Disagreements go to senior reviewer for adjudication
"""
def __init__(self, tasks_table: str = "manga_annotation_tasks"):
self.dynamodb = boto3.resource("dynamodb", region_name="us-east-1")
self.table = self.dynamodb.Table(tasks_table)
def create_annotation_batch(
self,
responses: list[dict],
source: str,
annotator_pool: list[str],
annotations_per_task: int = 2,
) -> list[AnnotationTask]:
"""Create a batch of annotation tasks from response data."""
tasks = []
for response_data in responses:
# Assign annotators (round-robin from pool, no self-review)
assigned = random.sample(
annotator_pool, min(annotations_per_task, len(annotator_pool))
)
task = AnnotationTask(
response_id=response_data["response_id"],
query=response_data["query"],
response_text=response_data["response_text"],
intent=response_data["intent"],
conversation_history=response_data.get("conversation_history", []),
product_context=response_data.get("product_context", {}),
source=source,
assigned_annotators=assigned,
)
self.table.put_item(Item={
"pk": f"BATCH#{task.task_id[:8]}",
"sk": f"TASK#{task.task_id}",
"response_id": task.response_id,
"query": task.query,
"response_text": task.response_text[:5000],
"intent": task.intent,
"source": source,
"assigned_annotators": assigned,
"status": "pending",
"annotations": [],
"created_at": int(task.created_at),
})
tasks.append(task)
logger.info("Created %d annotation tasks from source=%s", len(tasks), source)
return tasks
def submit_annotation(self, task_id: str, annotation: AnnotationLabel) -> dict:
"""Submit an annotation for a task and check if task is complete."""
# Retrieve the task
response = self.table.query(
KeyConditionExpression="begins_with(sk, :task_prefix)",
FilterExpression="contains(sk, :task_id)",
ExpressionAttributeValues={
":task_prefix": "TASK#",
":task_id": task_id,
},
)
# Use get_item with known key structure in production
# This is simplified for readability
annotation_data = {
"annotator_id": annotation.annotator_id,
"quality_ratings": annotation.quality_ratings,
"correct_response": annotation.correct_response[:5000],
"issue_tags": [tag.value for tag in annotation.issue_tags],
"notes": annotation.notes[:1000],
"time_spent_seconds": annotation.time_spent_seconds,
"timestamp": int(annotation.timestamp),
}
# Append annotation to task
self.table.update_item(
Key={"pk": f"BATCH#{task_id[:8]}", "sk": f"TASK#{task_id}"},
UpdateExpression="SET annotations = list_append(annotations, :ann)",
ExpressionAttributeValues={":ann": [annotation_data]},
)
return {"task_id": task_id, "annotator": annotation.annotator_id, "submitted": True}
def compute_inter_annotator_agreement(self, task_id: str) -> dict:
"""Compute Cohen's Kappa for a completed task with 2 annotations."""
response = self.table.get_item(
Key={"pk": f"BATCH#{task_id[:8]}", "sk": f"TASK#{task_id}"}
)
item = response.get("Item", {})
annotations = item.get("annotations", [])
if len(annotations) < 2:
return {"kappa": None, "status": "insufficient_annotations"}
ann1 = annotations[0]
ann2 = annotations[1]
# Compare quality ratings across dimensions
shared_dims = set(ann1.get("quality_ratings", {}).keys()) & set(
ann2.get("quality_ratings", {}).keys()
)
if not shared_dims:
return {"kappa": 0.0, "status": "no_shared_dimensions"}
# Simplified agreement: percentage of dimensions within 1 point
agreements = 0
for dim in shared_dims:
if abs(int(ann1["quality_ratings"][dim]) - int(ann2["quality_ratings"][dim])) <= 1:
agreements += 1
agreement_rate = agreements / len(shared_dims)
# Cohen's Kappa approximation (simplified for 5-point scale)
p_observed = agreement_rate
p_chance = 0.20 # Random chance for 5-point scale within 1 point
kappa = (p_observed - p_chance) / (1 - p_chance) if p_chance < 1 else 0
needs_adjudication = kappa < 0.70
return {
"task_id": task_id,
"kappa": round(kappa, 3),
"agreement_rate": round(agreement_rate, 3),
"shared_dimensions": len(shared_dims),
"needs_adjudication": needs_adjudication,
"status": "adjudication_needed" if needs_adjudication else "accepted",
}
def build_golden_dataset(self, batch_prefix: str) -> list[dict]:
"""Export accepted annotations as golden dataset entries for evaluation."""
response = self.table.query(
KeyConditionExpression="pk = :pk",
ExpressionAttributeValues={":pk": f"BATCH#{batch_prefix}"},
)
golden_entries = []
for item in response.get("Items", []):
if item.get("status") != "completed":
continue
annotations = item.get("annotations", [])
if not annotations:
continue
# Use the gold-standard response from the most experienced annotator
best_annotation = annotations[0] # In production, rank by annotator seniority
for ann in annotations:
if ann.get("correct_response"):
best_annotation = ann
break
golden_entries.append({
"query": item["query"],
"intent": item["intent"],
"expected_response": best_annotation.get("correct_response", ""),
"quality_ratings": best_annotation.get("quality_ratings", {}),
"issue_tags": best_annotation.get("issue_tags", []),
"source": item.get("source", "annotation"),
})
logger.info("Built %d golden dataset entries from batch %s", len(golden_entries), batch_prefix)
return golden_entries
MangaAssist Scenarios
Scenario A: Feedback Reveals "Already Read" Blind Spot in Recommendations
Context: Automated quality scores for the recommendation intent averaged 0.84 — well above the 0.75 quality gate. But user feedback told a different story.
What Happened:
- Weekly feedback summary for recommendation intent:
- Total feedback: 2,400 (3.2% of recommendation responses)
- Negative rate: 18% (vs. 8% for other intents)
- Top issue tag: irrelevant (62% of thumbs-down)
- Top correction pattern: "I already read this" appeared in 43% of correction texts
How Caught: The NegativeFeedbackAlertHandler fired a quality_review alert when recommendation negative rate exceeded 10% for 3 consecutive days. The team analyzed the free-text corrections and found the pattern.
Root Cause: The recommendation engine used content-based similarity (genres, authors, themes) but did not exclude manga the user had previously viewed or purchased. The automated evaluator scored these as "relevant" because the genres matched.
Fix: Added a post-filter step in the recommendation pipeline that excludes ASINs the user had viewed/purchased (from DynamoDB session data and order history API). Re-evaluated: negative rate dropped from 18% to 7%.
Metric Signal: MangaAssist/Feedback.NegativeRate for recommendation intent, MangaAssist/Feedback.IssueTagCount for irrelevant tag.
Scenario B: Annotation Workflow Discovers Systematic Hallucination Pattern
Context: The weekly annotation batch included 80 responses flagged as thumbs-down by users. Two annotators independently reviewed each response.
What Happened:
- Inter-annotator agreement (Cohen's Kappa): 0.82 (strong agreement)
- Both annotators tagged 12 responses as hallucination
- All 12 hallucinations were in the product_question intent
- Pattern: the model invented release dates for upcoming manga volumes
- Example: "One Piece Volume 108 releases on March 15, 2025" — no such date existed in the product catalog
- The RAG pipeline retrieved the product page, but the page had no release date field → the model generated a plausible-sounding date
How Caught: The annotation workflow's golden dataset export included 12 entries tagged hallucination. These were added to the evaluation test suite as regression tests. The Skill 5.1.1 FactualAccuracyChecker was updated to specifically flag date claims not grounded in the RAG context.
Fix: Added a guardrail: if the user asks about a release date and the RAG context has no release_date field, respond with "Release date not yet announced" instead of generating one. Also added "date_claim" as a tracked entity type in the hallucination detector.
Scenario C: Rating Fatigue — Optimizing the Feedback Widget
Context: Initially, the feedback widget asked for thumbs up/down on every response. Feedback collection rate was 4.1%, but the team noticed it was dropping month-over-month.
What Happened: - Month 1: 4.1% feedback rate - Month 2: 3.2% feedback rate - Month 3: 2.5% feedback rate - Of users who gave feedback, 92% only used thumbs up/down — never the dimension ratings or correction text
Root Cause: Feedback fatigue. Users were shown the widget on every turn, including turns where no feedback was useful (e.g., initial greeting, follow-up clarifications). The dimension rating modal felt like a survey.
Fix: Changed the feedback strategy: 1. Only show thumbs up/down on substantive responses (skip greetings, clarifications, system messages) 2. Rate-limit to max 3 feedback prompts per session 3. Show dimension feedback only to users already in a thumbs-down flow (not proactively) 4. Added a "Was this helpful?" single-question variant for mobile users
Result: Feedback rate stabilized at 3.5% with higher quality signals — correction text usage increased from 8% to 19% of negative feedback.
Scenario D: Annotator Disagreement Reveals Ambiguous Quality Standard
Context: An annotation batch for faq intent had unusually low inter-annotator agreement (Kappa = 0.52).
What Happened: - Annotator A consistently rated responses as ⅘ quality - Annotator B rated the same responses as ⅖ - The disagreement centered on response length: - Q: "What is your return policy?" - Response: 3-sentence summary of the return policy - Annotator A: ⅘ (concise, correct, helpful) - Annotator B: ⅖ (missing details about international returns, missing exceptions)
Root Cause: No clear annotation guideline for "completeness." Annotator A valued conciseness; Annotator B valued exhaustiveness. Both were valid perspectives, but without a standard, their ratings diverged.
Fix: The team created an annotation rubric specific to each intent:
- faq: "Complete" means covers the most common interpretation. Edge cases (international returns) should be mentioned if directly asked, but not required in general FAQ responses.
- Published rubric to all annotators. Re-annotated the batch with the rubric — Kappa improved to 0.78.
Intuition Gained
Automated Metrics and User Feedback Measure Different Things
Automated quality scores tell you if the response is technically correct. User feedback tells you if the response is useful. A response can be factually accurate (high automated score) but unhelpful (high negative feedback) because it does not address what the user actually needs. Both signals are necessary for a complete evaluation picture.
Feedback UI Design Directly Impacts Signal Quality
A feedback widget that appears on every response causes fatigue and low-quality ratings. A widget that only appears on substantive responses, with progressive disclosure (thumbs first, then dimensions, then free text), yields fewer but more informative data points. Optimize for signal quality, not volume.
Annotation Workflows Need Explicit Rubrics
Without a shared rubric, annotators bring different mental models of "quality." This creates noisy labels that are worse than no labels. Before starting annotation, define what "good" means for each intent type, with examples of ⅕ through 5/5 responses. Calibrate annotators on 20 shared examples before independent work.
References
- MangaAssist Architecture HLD — Frontend integration and API contracts
- Website Integration — Chat widget design and JS API
- FM Output Quality Assessment — Automated scoring used alongside feedback
- Model Evaluation Framework Deep Dive — Human evaluation sections
- Metrics and KPIs — Business metrics tied to user satisfaction
- Security and Privacy — PII handling in feedback data