LOCAL PREVIEW View on GitHub

02 — Model Evaluation & Optimal Configuration

AIP-C01 Skill 5.1.2 — Evaluate foundation model outputs and select optimal configurations for production workloads.

# Scenario Focus Area
01 Bedrock Model Evaluation — Sonnet vs Haiku Per-intent model selection, quality thresholds, Bedrock evaluation jobs
02 A/B Testing & Canary Deployment Staged rollouts (5%→25%→50%→100%), statistical significance, rollback criteria
03 Cost–Performance & Token Efficiency Token budgets, cost-per-conversation, Pareto-optimal model configs
04 Latency–Quality Ratio Analysis P50/P99 latency targets, cold starts, streaming vs non-streaming tradeoffs

Context — MangaAssist JP Manga Chatbot

MangaAssist is a production chatbot on Amazon.com that helps customers discover, purchase, and get support for Japanese manga products. The system serves 10 intents: recommendation, product_question, faq, order_tracking, return_request, promotion, checkout_help, chitchat, escalation, and product_discovery.

Core Tech Stack

Component Service
Primary LLM AWS Bedrock — Claude 3.5 Sonnet
Cost-Optimized LLM AWS Bedrock — Claude 3 Haiku
ML Training & Inference Amazon SageMaker
Session & Conversation Store Amazon DynamoDB
Vector / Semantic Search Amazon OpenSearch Serverless
Response Cache Amazon ElastiCache Redis
Compute Amazon ECS Fargate

Why Model Evaluation Matters

Not every intent needs the same model. Routing faq and chitchat through Sonnet wastes budget, while routing recommendation through Haiku loses quality. This skill area covers the evaluation framework that drives those decisions — from offline benchmarks through live A/B tests to continuous latency–quality monitoring.

Parent

Model Evaluation & Optimal Configuration (Overview)


Prepared for AIP-C01 certification — Skill 5.1.2