02 — Model Evaluation & Optimal Configuration

AIP-C01 Skill 5.1.2 — Evaluate foundation model outputs and select optimal configurations for production workloads.

#	Scenario	Focus Area
01	Bedrock Model Evaluation — Sonnet vs Haiku	Per-intent model selection, quality thresholds, Bedrock evaluation jobs
02	A/B Testing & Canary Deployment	Staged rollouts (5%→25%→50%→100%), statistical significance, rollback criteria
03	Cost–Performance & Token Efficiency	Token budgets, cost-per-conversation, Pareto-optimal model configs
04	Latency–Quality Ratio Analysis	P50/P99 latency targets, cold starts, streaming vs non-streaming tradeoffs

Context — MangaAssist JP Manga Chatbot

MangaAssist is a production chatbot on Amazon.com that helps customers discover, purchase, and get support for Japanese manga products. The system serves 10 intents: recommendation, product_question, faq, order_tracking, return_request, promotion, checkout_help, chitchat, escalation, and product_discovery.

Core Tech Stack

Component	Service
Primary LLM	AWS Bedrock — Claude 3.5 Sonnet
Cost-Optimized LLM	AWS Bedrock — Claude 3 Haiku
ML Training & Inference	Amazon SageMaker
Session & Conversation Store	Amazon DynamoDB
Vector / Semantic Search	Amazon OpenSearch Serverless
Response Cache	Amazon ElastiCache Redis
Compute	Amazon ECS Fargate

Why Model Evaluation Matters

Not every intent needs the same model. Routing faq and chitchat through Sonnet wastes budget, while routing recommendation through Haiku loses quality. This skill area covers the evaluation framework that drives those decisions — from offline benchmarks through live A/B tests to continuous latency–quality monitoring.

Parent

← Model Evaluation & Optimal Configuration (Overview)

Prepared for AIP-C01 certification — Skill 5.1.2

02 — Model Evaluation & Optimal Configuration

Navigation

Context — MangaAssist JP Manga Chatbot

Core Tech Stack

Why Model Evaluation Matters

Parent