01 — Bedrock Model Evaluation: Sonnet vs Haiku — Answers
Easy
A1. Bedrock Model Evaluation Job for MangaAssist FAQ Intent
An Amazon Bedrock Model Evaluation job is a managed service that lets you compare foundation model outputs against reference answers using automated metrics or a judge model.
Configuration for MangaAssist faq intent:
-
Input dataset — JSONL file stored in S3 with the following schema:
{ "prompt": "What is the return policy for manga pre-orders?", "referenceResponse": "Manga pre-orders can be cancelled any time before the release date at no charge. After shipping, standard Amazon return policy applies — you have 30 days to initiate a return.", "category": "faq", "conversationId": "eval-faq-001", "metadata": { "language": "en", "source": "golden_dataset_v3" } } -
Model configuration — Two model entries: -
anthropic.claude-3-5-sonnet-20241022-v2:0withtemperature=0.3,max_tokens=512-anthropic.claude-3-haiku-20240307-v1:0with identical inference parameters -
Metrics selected: - Automatic: ROUGE-L, BERTScore, exact match rate - Judge model: Use Claude 3.5 Sonnet as a judge to rate responses on accuracy (1–5), helpfulness (1–5), and safety (pass/fail)
-
Evaluation task type: Text generation (single-turn for FAQ)
-
Output: Results written to S3 with per-sample scores and aggregate metrics per model.
The job runs both models against the same 200-sample FAQ dataset and produces a side-by-side comparison report.
A2. Evaluation Metrics for the Recommendation Intent
Relevant metrics for recommendation:
| Metric | Why It Matters |
|---|---|
| BERTScore | Captures semantic similarity — essential because two valid manga recommendations can be worded completely differently |
| Human preference (judge model) | The gold standard for subjective quality — "did the recommendation actually help the customer?" |
| Faithfulness (RAG-specific) | Ensures recommended titles exist in the catalog and metadata (author, genre, volume count) is accurate |
| Diversity score | Measures whether the model recommends varied titles vs. repeating the same popular series |
| Relevance to stated preferences | Custom metric: does the recommendation match the genres/themes the user expressed interest in? |
Why lexical overlap metrics (ROUGE) are insufficient:
Manga recommendations are inherently open-ended. Consider:
- Reference: "You might enjoy Attack on Titan — it's a dark fantasy action series with intense world-building."
- Model output: "Based on your interest in dark shōnen series, I'd recommend Jujutsu Kaisen — it shares the intense battle sequences and deep lore you enjoyed."
Both are excellent recommendations, but ROUGE-L would score near zero because they share almost no n-grams. The recommendation intent demands semantic and preference-alignment metrics rather than surface-level text overlap.
A3. Creating a Golden Dataset for Product Question Intent
Dataset specification — 200 QA pairs for product_question:
Fields per record:
| Field | Type | Description |
|---|---|---|
eval_id |
string | Unique identifier (e.g., pq-001) |
prompt |
string | The customer question |
conversation_history |
array | Previous turns (for multi-turn context) |
reference_response |
string | Expert-written ideal answer |
product_asin |
string | The product the question relates to |
grounding_context |
string | Product catalog data from OpenSearch that should be used |
acceptable_alternatives |
array | Other valid answer phrasings |
difficulty |
enum | simple, requires_lookup, multi_product_comparison |
language |
string | en or ja |
Multi-turn handling:
For multi-turn conversations, include the conversation_history as an array of {role, content} objects:
{
"eval_id": "pq-042",
"conversation_history": [
{"role": "user", "content": "I'm looking for manga similar to Naruto"},
{"role": "assistant", "content": "Great choice! Some popular shōnen series similar to Naruto include..."},
{"role": "user", "content": "How many volumes does the second one have?"}
],
"prompt": "How many volumes does the second one have?",
"reference_response": "My Hero Academia currently has 40 volumes...",
"product_asin": "B09XYZ1234"
}
Creation process:
- Sample 500 real
product_questionconversations from DynamoDB (last 90 days) - Filter to those with positive CSAT ratings (4–5 stars)
- Deduplicate by semantic similarity (cosine > 0.92 = duplicate)
- Expert review — manga domain experts write/refine reference answers
- Stratify — ensure balanced coverage: 40% simple lookups, 35% comparison questions, 25% multi-turn dependent
- Final selection — 200 pairs with inter-annotator agreement ≥ 0.85 (Cohen's kappa)
Medium
A4. Per-Intent Model Assignment Strategy
Decision matrix for MangaAssist's 10 intents:
| Intent | Assigned Model | Reasoning |
|---|---|---|
recommendation |
Sonnet | Requires nuanced genre understanding, cultural context, preference reasoning. Quality-critical — directly impacts conversion. |
product_question |
Sonnet | Needs accurate product detail extraction from RAG context. Hallucination risk is high with cheaper models. |
product_discovery |
Sonnet | Open-ended exploration requires strong reasoning to narrow from 100K+ catalog items. |
faq |
Haiku | Answers are well-defined and mostly template-based. Haiku achieves 0.94 BERTScore — above the 0.85 threshold. |
order_tracking |
Haiku | Structured data retrieval (order status from DynamoDB). Low creative reasoning needed. |
return_request |
Haiku | Follow a decision tree: eligible/ineligible, initiate return. Haiku handles this reliably. |
promotion |
Haiku | Promotional information is retrieved from a fixed dataset. Formatting is straightforward. |
checkout_help |
Haiku | Step-by-step guidance with minimal creative generation. |
chitchat |
Haiku | Low-stakes, conversational. Speed matters more than depth. |
escalation |
Haiku | The model's job is to summarize and hand off to a human agent — minimal generation needed. |
Result: 3 intents on Sonnet, 7 on Haiku → estimated 65% cost reduction vs. routing everything through Sonnet.
Quality thresholds per tier:
- Sonnet intents (quality-critical): BERTScore ≥ 0.88, faithfulness ≥ 0.92, human preference win rate ≥ 70%
- Haiku intents (efficiency-focused): BERTScore ≥ 0.82, faithfulness ≥ 0.88, task completion rate ≥ 95%
A5. Model Decision for Order Tracking at the Quality Threshold
Given data:
- Haiku BERTScore on order_tracking: 0.82
- Sonnet BERTScore on order_tracking: 0.91
- Business threshold for order-related intents: 0.88
- Cost: Sonnet is ~12× more expensive than Haiku
Decision framework:
-
Threshold check: Haiku (0.82) is below the 0.88 threshold. Sonnet (0.91) is above. Direct assignment: Sonnet.
-
Cost impact analysis: -
order_trackingrepresents ~12% of total traffic (~36K conversations/day) - At Sonnet pricing: ~$432/day for this intent - At Haiku pricing: ~$36/day - Delta: $396/day ($11,880/month) -
Before committing to Sonnet, explore mitigation strategies:
Option A — Prompt optimization for Haiku: - Add structured output templates to Haiku's prompt - Include 3–5 few-shot examples of order_tracking responses - Re-evaluate: if BERTScore rises to ≥ 0.88, use Haiku
Option B — Hybrid routing with fallback: - Route to Haiku first - Run a lightweight quality check on the response (regex for order number format, status field presence) - If quality check fails → re-invoke with Sonnet - Expected fallback rate: ~15% → blended cost = Haiku cost + 15% × Sonnet cost = ~$101/day
Option C — Fine-tuned Haiku: - Fine-tune Haiku on 5K order_tracking examples via SageMaker - Typically yields +0.04–0.08 BERTScore improvement - If fine-tuned Haiku reaches 0.88 → use it at ~$45/day
- Recommended path: Start with Option B (hybrid routing) for immediate deployment, invest in Option C (fine-tuning) as a medium-term cost reduction. Only fall back to full Sonnet if both options fail to meet the 0.88 threshold.
A6. Automating Bedrock Evaluation in CI/CD
Pipeline design for MangaAssist:
New Model Version Available (e.g., Sonnet v2)
│
▼
┌─────────────────────┐
│ 1. Trigger │ EventBridge detects new model version in Bedrock
│ (Automated) │ or manual trigger via CI/CD webhook
└──────────┬──────────┘
▼
┌─────────────────────┐
│ 2. Fetch Golden │ Pull latest evaluation datasets from S3
│ Datasets │ (10 datasets, one per intent)
└──────────┬──────────┘
▼
┌─────────────────────┐
│ 3. Run Bedrock │ For each intent:
│ Evaluation Jobs │ - Current production model
│ (Parallel) │ - New candidate model
│ │ Same inference params, same dataset
└──────────┬──────────┘
▼
┌─────────────────────┐
│ 4. Aggregate Results │ Compute per-intent deltas:
│ │ ΔBERTScore, ΔFaithfulness,
│ │ ΔLatency, ΔCost
└──────────┬──────────┘
▼
┌─────────────────────┐
│ 5. Gate Check │ Promotion criteria:
│ │ - No intent regresses > 2%
│ │ - At least 3 intents improve > 1%
│ │ - Latency P99 doesn't increase > 10%
│ │ - Cost doesn't increase > 5%
└──────────┬──────────┘
▼
Pass? ──No──► Notify team, blocked
│
Yes
▼
┌─────────────────────┐
│ 6. Promote to │ Update model ID in Parameter Store
│ Canary Stage │ Deploy to 5% traffic via ECS
└─────────────────────┘
Implementation details:
- CodePipeline orchestrates the stages
- Step Functions manages parallel Bedrock evaluation jobs (one per intent × two models = 20 jobs)
- Lambda aggregates results and applies gate logic
- Golden datasets are versioned in S3 with lifecycle policies (refresh monthly)
- Results are stored in DynamoDB for trend analysis
- SNS notifications for pass/fail with detailed report links
Hard
A7. Evaluating RAG-Augmented Outputs for Recommendation Intent
Challenge: The recommendation intent retrieves product embeddings from OpenSearch Serverless and injects them into the prompt context. Standard evaluation ignores whether the model used the retrieved context correctly.
RAG-specific evaluation dimensions:
| Dimension | What It Measures | Metric |
|---|---|---|
| Faithfulness | Does the response ONLY contain manga titles/details that appear in the retrieved context? | Hallucination detection — entity-level precision against retrieved documents |
| Relevance | Does the response address the user's stated preferences using the retrieved products? | Custom judge rubric: preference-alignment score (1–5) |
| Completeness | Does the response use the most relevant retrieved products, or does it cherry-pick? | Recall of top-3 relevant products from the retrieval set |
| Context utilization | How much of the retrieved context was useful? | Ratio of cited products to total retrieved products |
Implementation in Bedrock evaluation:
- Augmented evaluation dataset:
{
"prompt": "I love dark fantasy manga with complex magic systems. What should I read?",
"retrievedContext": [
{"asin": "B001", "title": "Berserk Vol 1", "genre": "dark_fantasy", "author": "Kentaro Miura"},
{"asin": "B002", "title": "Fullmetal Alchemist Vol 1", "genre": "dark_fantasy", "author": "Hiromu Arakawa"},
{"asin": "B003", "title": "One Piece Vol 1", "genre": "adventure", "author": "Eiichiro Oda"}
],
"referenceResponse": "Based on your interest in dark fantasy, I'd recommend Berserk by Kentaro Miura and Fullmetal Alchemist by Hiromu Arakawa...",
"expectedCitations": ["B001", "B002"],
"hallucination_traps": ["Claymore", "Dorohedoro"]
}
- Judge model prompt (for faithfulness):
You are evaluating a manga recommendation chatbot.
Given the RETRIEVED PRODUCTS and the MODEL RESPONSE:
- List every manga title mentioned in the response
- For each title, mark it as GROUNDED (appears in retrieved products) or HALLUCINATED
- Score faithfulness: (grounded titles) / (total titles mentioned) × 100
A score below 90% is FAIL.
- Automated hallucination detection: Post-process model output through a verification Lambda that: - Extracts manga titles via NER - Looks up each title in the OpenSearch catalog - Flags any title not found as a potential hallucination - Checks ASINs mentioned against the retrieval set
A8. Custom "Manga-Domain Accuracy" Metric
Metric definition — MangaDomainAccuracy (MDA):
A composite score measuring domain-specific factual correctness across four sub-dimensions:
| Sub-dimension | Weight | What It Catches |
|---|---|---|
| Genre accuracy | 25% | Confusing shōnen (e.g., Naruto) with seinen (e.g., Berserk), or josei with shōjo |
| Author attribution | 25% | Attributing One Piece to anyone other than Eiichiro Oda |
| Volume/chapter accuracy | 25% | Stating Naruto has 80 volumes when it has 72 |
| Publication metadata | 25% | Wrong publisher, serialization magazine, or release dates |
Judge model implementation in Bedrock:
-
Create a judge model evaluation task using Claude 3.5 Sonnet as the judge.
-
Judge rubric (provided as system prompt):
You are a manga domain expert evaluating a chatbot response.
EVALUATION RUBRIC — Score each dimension 0-5:
GENRE ACCURACY (0-5):
5 = All genre classifications correct with proper subcategories
3 = Genres mostly correct but imprecise (e.g., "action" instead of "shōnen")
1 = Genre misattributions present (calling seinen manga "for kids")
0 = Major genre confusion that could mislead buyers
AUTHOR ATTRIBUTION (0-5):
5 = All authors correctly attributed
3 = Minor errors (misspelled names but correct person)
0 = Wrong author attributed to a work
VOLUME/CHAPTER ACCURACY (0-5):
5 = Exact counts or correctly stated "approximately" with reasonable range
3 = Within 10% of actual count
0 = Wildly wrong numbers or fabricated counts
PUBLICATION METADATA (0-5):
5 = Publisher, magazine, and dates all correct
3 = Mostly correct with minor errors
0 = Fabricated publication details
Output JSON: {"genre": X, "author": X, "volume": X, "publication": X, "mda_score": avg}
-
Supplementary ground-truth data: The judge receives a
factsfield with verified metadata from the product catalog so it can check claims against reality, not just plausibility. -
Threshold: MDA ≥ 4.0 required for Sonnet intents; MDA ≥ 3.5 for Haiku intents.
A9. Bilingual Evaluation Pipeline for Return Request Intent
Problem: Sonnet occasionally over-translates Japanese product names (e.g., translating "進撃の巨人" to "Attack on Titan" when the customer asked about the Japanese edition specifically).
Bilingual evaluation pipeline design:
Evaluation Dataset (200 samples)
├── 100 English queries
│ ├── EN ground truth
│ └── EN-specific metrics
└── 100 Japanese queries
├── JA ground truth
└── JA-specific metrics
├── Product name preservation score
├── Honorific handling accuracy
└── Script consistency (kanji/katakana/romaji)
Language-specific metrics:
| Metric | Language | Description |
|---|---|---|
| Product name preservation | JA | Does the model keep Japanese product names in their original script when the customer used Japanese? Score: % of product names preserved correctly |
| Script consistency | JA | If the customer writes in kanji, does the model respond in kanji (not romaji or English)? |
| Honorific handling | JA | Correct use of お客様 (okyakusama) and polite forms in return instructions |
| Return policy accuracy | Both | Are the return eligibility rules correct per Amazon JP policy? |
| Actionability | Both | Does the response include the correct return initiation steps? |
| Code-switch detection | Both | Penalize unnecessary language switching mid-response |
Implementation:
- Separate evaluation jobs per language — don't mix languages in one job (metrics are not comparable cross-language)
- Language-specific judge prompts — The JA judge is instructed in Japanese and evaluates in Japanese context
- Product name verification Lambda: Post-processes responses to check if Japanese product names were preserved when they should have been:
def check_name_preservation(response: str, expected_names_ja: list[str]) -> float: preserved = sum(1 for name in expected_names_ja if name in response) return preserved / len(expected_names_ja) - Aggregate reporting: Per-intent, per-language scorecard with separate pass/fail thresholds.
Decision rule: The model must pass thresholds in both languages independently. Passing English but failing Japanese blocks promotion for that intent.
Very Hard
A10. Cross-Model Consistency in Multi-Turn Conversations
Problem: A conversation starts with chitchat (Haiku) then shifts to recommendation (Sonnet). The model switch can cause coherence breaks — Sonnet may not maintain the conversational tone established by Haiku, or may contradict something Haiku said.
Evaluation framework design:
1. Transition point taxonomy:
| Transition Type | Example | Risk Level |
|---|---|---|
| Haiku → Sonnet (quality upgrade) | chitchat → recommendation |
Medium — tone shift |
| Sonnet → Haiku (cost downgrade) | product_question → faq |
High — quality/detail regression |
| Haiku → Haiku (same model) | faq → order_tracking |
Low — baseline |
| Sonnet → Sonnet (same model) | recommendation → product_discovery |
Low — baseline |
2. Evaluation dataset construction:
Create 100 multi-turn conversations (5–8 turns each) with at least one model transition per conversation:
{
"conversation_id": "mt-eval-027",
"turns": [
{"role": "user", "content": "Hey! I just started reading manga", "intent": "chitchat", "model": "haiku"},
{"role": "assistant", "content": "That's awesome! Welcome to the manga world...", "model": "haiku"},
{"role": "user", "content": "Can you recommend something dark and mature?", "intent": "recommendation", "model": "sonnet"},
{"role": "assistant", "content": "[EVALUATE THIS TURN]", "model": "sonnet"}
],
"transition_points": [2],
"evaluation_focus": "Does the Sonnet response maintain the friendly, enthusiastic tone Haiku established while delivering a high-quality recommendation?"
}
3. Metrics for transition quality:
| Metric | Measurement |
|---|---|
| Tone consistency | Judge model rates tone match between pre- and post-transition turns (1–5) |
| Entity continuity | Any entities mentioned before the transition must be correctly referenced after |
| No contradiction | Post-transition model must not contradict facts stated by pre-transition model |
| Conversation flow naturalness | Human evaluators rate whether the transition feels jarring (blind test — they don't know models switched) |
4. Implementation:
- Run the full multi-turn conversation through the orchestrator (ECS Fargate) with actual model routing
- Capture the conversation in DynamoDB with model IDs per turn
- Post-process: extract transition points and evaluate each transition
- Compare against baseline: the same conversations run entirely on Sonnet (no transitions)
- Coherence degradation = TransitionScore - BaselineScore
5. Acceptance criteria: - Average coherence degradation < 5% - No single transition with degradation > 15% - Entity continuity: 100% (hard requirement — contradicting order numbers or product names is unacceptable)
A11. Evaluating Fine-Tuned Haiku vs Sonnet for Product Discovery
End-to-end pipeline:
Phase 1 — Training data preparation:
- Extract 50K
product_discoveryconversations from DynamoDB (last 6 months) - Filter to high-quality interactions: - CSAT ≥ 4 stars - Conversation led to a product click or purchase - No escalation triggered
- Format as instruction-tuning pairs:
{ "system": "You are MangaAssist, a manga discovery assistant...", "messages": [ {"role": "user", "content": "I want to explore horror manga"}, {"role": "assistant", "content": "Horror manga has some incredible titles..."} ] } - Split: 80% train (40K), 10% validation (5K), 10% holdout test (5K)
- Decontaminate: Ensure no overlap between training data and the Bedrock evaluation golden dataset
Phase 2 — Fine-tuning on SageMaker:
- Base model: Claude 3 Haiku (via Bedrock custom model training or SageMaker JumpStart)
- Training config: learning rate 1e-5, 3 epochs, batch size 8
- Track training loss + validation loss for overfitting detection
- Early stopping if validation loss increases for 2 consecutive checkpoints
Phase 3 — Triple evaluation:
| Model | ID |
|---|---|
| Claude 3.5 Sonnet (production) | model_A |
| Claude 3 Haiku (base) | model_B |
| Fine-tuned Haiku | model_C |
Run Bedrock evaluation with the product_discovery golden dataset (200 samples — separate from training data):
Phase 4 — Statistical rigor:
-
Paired bootstrap test (n=10,000 resamples): Compare model_C vs model_A BERTScores - Null hypothesis: model_C is not better than model_A - Reject if p < 0.05
-
McNemar's test: For binary outcomes (did the model produce a helpful response? yes/no) - Compares the disagreements between models on the same samples
-
Cross-validation on the golden dataset: Run 5-fold CV to check if results are stable across data splits
-
Overfitting detection: - Compare performance on training data vs. holdout test vs. golden dataset - If training >> holdout by more than 10%, the fine-tuned model is overfitting - Specifically test on manga titles released after the training data cutoff (temporal generalization)
Phase 5 — Promotion criteria:
| Criterion | Threshold | Rationale |
|---|---|---|
| BERTScore vs Sonnet | ≥ 95% of Sonnet's score | Allow slight quality trade for cost |
| Faithfulness | ≥ Sonnet's score | No regression on hallucination |
| MangaDomainAccuracy | ≥ 3.8 | Domain correctness is non-negotiable |
| Latency P99 | ≤ 80% of Sonnet | Fine-tuned Haiku should be faster |
| Cost | ≤ 30% of Sonnet | Must justify the fine-tuning investment |
| All statistical tests | p < 0.05 | Results are not due to chance |
A12. Continuous Weekly Re-Evaluation System
End-to-end architecture:
┌──────────────────────────────────────────────────────────┐
│ Weekly Evaluation Pipeline │
├──────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────┐ ┌──────────────┐ ┌────────────┐ │
│ │ DynamoDB │───►│ Data Sampling │───►│ Golden DS │ │
│ │ Conv. Logs │ │ Lambda │ │ Refresh │ │
│ └─────────────┘ └──────────────┘ │ (S3) │ │
│ └─────┬──────┘ │
│ │ │
│ ┌─────────────┐ ┌──────────────┐ ┌──────▼──────┐ │
│ │ EventBridge │───►│ Step Function │───►│ Bedrock │ │
│ │ Weekly Cron │ │ Orchestrator │ │ Eval Jobs │ │
│ └─────────────┘ └──────────────┘ │ (10 intents)│ │
│ └─────┬──────┘ │
│ │ │
│ ┌─────────────┐ ┌──────────────┐ ┌──────▼──────┐ │
│ │ SNS/Slack │◄───│ Human Review │◄───│ Drift │ │
│ │ Alerts │ │ Gate │ │ Detection │ │
│ └─────────────┘ └──────────────┘ └─────┬──────┘ │
│ │ │ │
│ ┌────▼────┐ ┌─────▼──────┐ │
│ │ Approve/ │ │ Model │ │
│ │ Reject │ │ Reassign │ │
│ └────┬────┘ │ Proposal │ │
│ │ └────────────┘ │
│ ┌────▼────┐ │
│ │ SSM │ Update model routing │
│ │ Param │ in Parameter Store │
│ │ Store │ │
│ └─────────┘ │
└──────────────────────────────────────────────────────────┘
Component details:
1. Data sampling (Lambda — weekly trigger): - Query DynamoDB for conversations from the past 7 days - Stratified sample: 50 conversations per intent (500 total) - Filter for conversations with CSAT scores (to derive ground truth) - Generate candidate golden dataset entries by pairing high-CSAT responses with their prompts - Human reviewers validate 20% of new samples; the rest use automated quality filters
2. Golden dataset refresh strategy:
- Rolling window: Keep last 12 weeks of evaluated samples (2,400 per intent max)
- Seasonal awareness: Weight recent samples higher during seasonal events (manga convention season, holiday promotions)
- Distribution monitoring: Track intent distribution shift — if product_discovery spikes from 15% to 25% of traffic (new manga season), increase that intent's evaluation weight
3. Drift detection (Lambda post-evaluation):
def detect_drift(current_scores: dict, historical_scores: list[dict]) -> dict:
drift_report = {}
for intent in INTENTS:
current = current_scores[intent]['bertscore']
historical_mean = np.mean([s[intent]['bertscore'] for s in historical_scores[-8:]])
historical_std = np.std([s[intent]['bertscore'] for s in historical_scores[-8:]])
z_score = (current - historical_mean) / historical_std if historical_std > 0 else 0
drift_report[intent] = {
'current': current,
'historical_mean': historical_mean,
'z_score': z_score,
'drift_detected': abs(z_score) > 2.0, # 95% confidence
'direction': 'degradation' if z_score < -2 else 'improvement' if z_score > 2 else 'stable'
}
return drift_report
4. Automatic model reassignment proposal:
When drift is detected: - If an intent on Haiku degrades below threshold → propose upgrade to Sonnet - If an intent on Sonnet shows stable quality for 4+ weeks at levels Haiku could serve → propose downgrade to save cost - Proposal includes: projected cost impact, quality delta, confidence interval
5. Human-in-the-loop approval:
- Proposals are sent to Slack and create Jira tickets
- Reassignment requires explicit approval from an ML engineer
- Auto-approval is only allowed for Haiku → Sonnet upgrades (quality protection), never for downgrades
- Approval has a 48-hour timeout; if not acted upon, the change is deferred to the next weekly cycle
6. Guardrails against bad reassignment:
| Guardrail | Implementation |
|---|---|
| No more than 2 intent changes per week | Prevents cascading changes from a single anomalous week |
| Rollback window | Changes are deployed as canaries (5% → 25% → 100% over 72 hours) |
| Real-time CSAT monitoring | CloudWatch alarm: if CSAT drops > 5% for a reassigned intent within 24 hours → auto-rollback |
| Change freeze periods | No automatic reassignments during Prime Day, holiday shopping season |
| Minimum sample size | Drift detection requires at least 100 conversations per intent in the evaluation window |
| Evaluation dataset integrity | Automated checks ensure the golden dataset hasn't been corrupted or accidentally overwritten |
| Audit trail | Every reassignment (proposed, approved, deployed, rolled back) is logged in DynamoDB with full context |