02 — Model Evaluation & Optimal Configuration
AIP-C01 Skill 5.1.2 — Evaluate foundation model outputs and select optimal configurations for production workloads.
Navigation
| # | Scenario | Focus Area |
|---|---|---|
| 01 | Bedrock Model Evaluation — Sonnet vs Haiku | Per-intent model selection, quality thresholds, Bedrock evaluation jobs |
| 02 | A/B Testing & Canary Deployment | Staged rollouts (5%→25%→50%→100%), statistical significance, rollback criteria |
| 03 | Cost–Performance & Token Efficiency | Token budgets, cost-per-conversation, Pareto-optimal model configs |
| 04 | Latency–Quality Ratio Analysis | P50/P99 latency targets, cold starts, streaming vs non-streaming tradeoffs |
Context — MangaAssist JP Manga Chatbot
MangaAssist is a production chatbot on Amazon.com that helps customers discover, purchase, and get support for Japanese manga products. The system serves 10 intents: recommendation, product_question, faq, order_tracking, return_request, promotion, checkout_help, chitchat, escalation, and product_discovery.
Core Tech Stack
| Component | Service |
|---|---|
| Primary LLM | AWS Bedrock — Claude 3.5 Sonnet |
| Cost-Optimized LLM | AWS Bedrock — Claude 3 Haiku |
| ML Training & Inference | Amazon SageMaker |
| Session & Conversation Store | Amazon DynamoDB |
| Vector / Semantic Search | Amazon OpenSearch Serverless |
| Response Cache | Amazon ElastiCache Redis |
| Compute | Amazon ECS Fargate |
Why Model Evaluation Matters
Not every intent needs the same model. Routing faq and chitchat through Sonnet wastes budget, while routing recommendation through Haiku loses quality. This skill area covers the evaluation framework that drives those decisions — from offline benchmarks through live A/B tests to continuous latency–quality monitoring.
Parent
← Model Evaluation & Optimal Configuration (Overview)
Prepared for AIP-C01 certification — Skill 5.1.2