01 — Bedrock Model Evaluation: Sonnet vs Haiku
Evaluating Claude 3.5 Sonnet and Claude 3 Haiku across MangaAssist's 10 intents to determine per-intent model assignment.
Context
MangaAssist routes customer queries across 10 intents. Not every intent demands the reasoning depth of Claude 3.5 Sonnet — some are best served by the faster, cheaper Claude 3 Haiku. This scenario covers how to run Bedrock Model Evaluation jobs, define per-intent quality thresholds, and build a decision matrix that balances quality, latency, and cost.
Questions (12)
Easy (1–3)
Q1. What is an Amazon Bedrock Model Evaluation job and how would you configure one to compare Claude 3.5 Sonnet against Claude 3 Haiku for MangaAssist's faq intent? Describe the input dataset format and the metrics you would select.
Q2. For MangaAssist, explain what evaluation metrics (e.g., ROUGE, BERTScore, human preference) are most relevant when comparing Sonnet and Haiku on the recommendation intent. Why might lexical overlap metrics be insufficient for manga recommendation quality?
Q3. Describe how you would create a golden dataset of 200 question-answer pairs for MangaAssist's product_question intent. What fields should each record contain, and how do you handle multi-turn context in the evaluation dataset?
Medium (4–6)
Q4. MangaAssist uses an intent classifier (SageMaker endpoint) to route queries before LLM invocation. Design a per-intent model assignment strategy where some intents use Sonnet and others use Haiku. Justify your assignment for each of the 10 intents with quality and cost reasoning.
Q5. You run a Bedrock evaluation job and find that Haiku scores 0.82 BERTScore on order_tracking while Sonnet scores 0.91. The business threshold for order-related intents is 0.88. Walk through how you decide which model to assign, factoring in the 12× cost difference and fallback strategies.
Q6. Explain how to automate Bedrock Model Evaluation as part of MangaAssist's CI/CD pipeline. When a new model version (e.g., Claude 3.5 Sonnet v2) becomes available, how does the pipeline evaluate it against the current production model before promoting?
Hard (7–9)
Q7. MangaAssist's recommendation intent requires grounding in a product catalog stored in OpenSearch Serverless. How do you evaluate RAG-augmented outputs in a Bedrock evaluation job? Describe how to measure faithfulness (no hallucinated manga titles), relevance, and completeness when the context window includes retrieved product embeddings.
Q8. Design a custom evaluation metric for MangaAssist that captures "manga-domain accuracy" — ensuring the model doesn't confuse genres (shōnen vs. seinen), misattribute authors, or hallucinate volume numbers. How would you implement this as a judge model evaluation in Bedrock, and what rubric would you provide to the judge?
Q9. After running evaluations, you discover that Sonnet outperforms Haiku on return_request for English queries but underperforms on Japanese-language queries due to Sonnet's occasional over-translation of product names. How do you structure a bilingual evaluation pipeline, and what language-specific metrics would you add?
Very Hard (10–12)
Q10. MangaAssist handles multi-turn conversations where the model assignment may need to change mid-conversation (e.g., a chitchat turn followed by a recommendation turn). Design an evaluation framework that measures cross-model consistency when the conversation switches between Haiku and Sonnet mid-session. How do you evaluate coherence degradation at model transition points?
Q11. You need to evaluate whether a fine-tuned Haiku model (trained on MangaAssist conversation logs via SageMaker) can replace Sonnet for the product_discovery intent. Design the complete evaluation pipeline: training data preparation, fine-tuning on SageMaker, deploying the fine-tuned model, running Bedrock evaluation against both Sonnet and base Haiku, and defining the promotion criteria. What statistical tests ensure the fine-tuned model genuinely outperforms rather than overfitting to the evaluation set?
Q12. The MangaAssist team wants to implement a continuous evaluation system that automatically re-evaluates model assignments weekly as conversation patterns shift (e.g., seasonal manga releases change recommendation distributions). Design this system end-to-end: data sampling from DynamoDB conversation logs, automated golden dataset refresh, scheduled Bedrock evaluation jobs, drift detection, and automatic model reassignment with human-in-the-loop approval. What guardrails prevent a bad automatic reassignment from reaching production?