Scenario 03 – Annotation Workflow & Quality

Parent: ../README.md · Skill: 5.1.3 – User-Centered Evaluation
System: MangaAssist – JP Manga Chatbot on Amazon.com
Stack: AWS Bedrock Claude 3.5 Sonnet · SageMaker · DynamoDB · OpenSearch Serverless · S3

Context

Human annotation is the ground truth for evaluating MangaAssist's response quality. Unlike user feedback (noisy, biased, sparse), expert annotations provide controlled, rubric-based quality assessments. This scenario covers designing annotation workflows, ensuring inter-annotator agreement, building golden datasets, and scaling annotation for 10 intents. Annotation data feeds into SageMaker model training, Bedrock prompt evaluation, and offline test suite construction.

Scenario Questions (12)

Easy (3)

Q1. MangaAssist needs a human annotation rubric to evaluate chatbot responses across all 10 intents. Design a rubric with 4 quality dimensions (Correctness, Completeness, Relevance, Tone), each scored 1–4. Provide anchor descriptions for each score level, and explain how the rubric adapts for different intents (e.g., recommendation needs a different "Correctness" anchor than order_tracking).

Q2. You are setting up an annotation team of 5 annotators to label 2,000 MangaAssist responses per week. Describe the onboarding process: training materials, calibration exercises, pilot annotation round, and the criteria for certifying an annotator as "production-ready". How do you handle annotators who are manga fans (domain expertise) vs. those who are not?

Q3. The annotation workflow stores labeled data in S3 as JSONL files. Design the JSONL schema for a single annotated response, including all fields needed to trace the annotation back to the original conversation, the annotator identity, the rubric scores, and any free-text comments. Explain the S3 bucket structure and naming convention.

Medium (3)

Q4. Two annotators disagree on whether a recommendation response is "Relevant" (Annotator A scores 4, Annotator B scores 2). Compute Cohen's Kappa for a batch of 200 responses annotated by both annotators across the Relevance dimension. Walk through the full calculation, interpret the result, and describe your remediation strategy if Kappa falls below 0.60.

Q5. MangaAssist has 10 intents, but annotation volume is limited — you can only annotate 500 responses/week. Design a sampling strategy that allocates annotations across intents based on: (a) traffic volume per intent, (b) current quality uncertainty (intents with high annotator disagreement need more samples), and © business priority. Provide the allocation formula and a worked example.

Q6. You want to use annotation data to evaluate Bedrock Claude 3.5 Sonnet prompt changes. Design an A/B annotation study: Prompt A (current) and Prompt B (candidate) each generate responses to the same 200 user queries. Annotators blind-score all 400 responses. Define the statistical test to determine whether Prompt B is significantly better, the minimum effect size you care about, and how you handle the paired nature of the data (same query, different responses).

Hard (3)

Q7. Build a golden dataset of 1,000 "reference-quality" annotated MangaAssist responses for offline evaluation. Define: (a) the selection criteria for which responses enter the golden set, (b) the annotation process (how many annotators per response, disagreement resolution), © the quality bar (minimum agreement level), (d) versioning strategy (how the golden set evolves as the chatbot improves), and (e) how you use this dataset to compute automated metrics that correlate with human judgment.

Q8. Annotation is expensive ($0.50/response at scale). Design a hybrid human-AI annotation pipeline where Bedrock Claude 3.5 Sonnet pre-labels responses and human annotators only review cases where the AI's confidence is below a threshold. Define: the AI pre-labeling prompt, the confidence threshold calibration, the expected cost savings, and how you ensure the hybrid pipeline doesn't introduce systematic bias compared to pure-human annotation.

Q9. You discover that Annotator #3 has been drifting — their scores have shifted 0.5 points lower over the past month compared to their initial calibration. Design a continuous quality monitoring system for annotators: define the metrics you track (agreement with gold-standard, within-annotator consistency, inter-annotator drift), the alert thresholds, and the retraining/recalibration protocol. Implement the drift detection as a Lambda function.

Very Hard (3)

Q10. Design a multi-stage annotation pipeline for MangaAssist that handles the full spectrum from simple binary labels to complex rubric-based assessment. Stage 1: Crowdsource binary quality labels (good/bad) via Amazon Mechanical Turk. Stage 2: Expert annotators provide rubric scores on the "borderline" cases. Stage 3: Senior annotators adjudicate disagreements. Define the inter-stage data flow, cost model, quality gates between stages, and how the final labels achieve higher quality than any single stage alone.

Q11. The annotation rubric needs different evaluation criteria for each of MangaAssist's 10 intents. For recommendation, "Correctness" means suggesting real manga titles that match the user's genre preferences. For order_tracking, "Correctness" means providing accurate order status. Design an intent-adaptive annotation framework: the intent-specific rubric extensions, how annotators switch between intent rubrics efficiently, how you train annotators on all 10 intent rubrics without cognitive overload, and how you ensure cross-intent comparability of scores.

Q12. You want to measure whether your annotation labels actually predict user satisfaction (external validity). Design a validation study that compares annotator quality scores with user star ratings on the same set of 500 responses. Define the analysis plan, the expected correlation magnitude, how you would handle cases where annotators rate a response as "high quality" but users give it 2 stars (or vice versa), and what these discrepancies reveal about the gap between objective quality and subjective satisfaction. Discuss how this informs your overall evaluation strategy.

Answer Key

→ ANSWERS.md