01 — Bedrock Model Evaluation: Sonnet vs Haiku — Answers

Easy

A1. Bedrock Model Evaluation Job for MangaAssist FAQ Intent

An Amazon Bedrock Model Evaluation job is a managed service that lets you compare foundation model outputs against reference answers using automated metrics or a judge model.

Configuration for MangaAssist faq intent:

Input dataset — JSONL file stored in S3 with the following schema:

{
  "prompt": "What is the return policy for manga pre-orders?",
  "referenceResponse": "Manga pre-orders can be cancelled any time before the release date at no charge. After shipping, standard Amazon return policy applies — you have 30 days to initiate a return.",
  "category": "faq",
  "conversationId": "eval-faq-001",
  "metadata": {
    "language": "en",
    "source": "golden_dataset_v3"
  }
}

Model configuration — Two model entries: - anthropic.claude-3-5-sonnet-20241022-v2:0 with temperature=0.3, max_tokens=512 - anthropic.claude-3-haiku-20240307-v1:0 with identical inference parameters
Metrics selected: - Automatic: ROUGE-L, BERTScore, exact match rate - Judge model: Use Claude 3.5 Sonnet as a judge to rate responses on accuracy (1–5), helpfulness (1–5), and safety (pass/fail)
Evaluation task type: Text generation (single-turn for FAQ)
Output: Results written to S3 with per-sample scores and aggregate metrics per model.

The job runs both models against the same 200-sample FAQ dataset and produces a side-by-side comparison report.

A2. Evaluation Metrics for the Recommendation Intent

Relevant metrics for recommendation:

Metric	Why It Matters
BERTScore	Captures semantic similarity — essential because two valid manga recommendations can be worded completely differently
Human preference (judge model)	The gold standard for subjective quality — "did the recommendation actually help the customer?"
Faithfulness (RAG-specific)	Ensures recommended titles exist in the catalog and metadata (author, genre, volume count) is accurate
Diversity score	Measures whether the model recommends varied titles vs. repeating the same popular series
Relevance to stated preferences	Custom metric: does the recommendation match the genres/themes the user expressed interest in?

Why lexical overlap metrics (ROUGE) are insufficient:

Manga recommendations are inherently open-ended. Consider:

Reference: "You might enjoy Attack on Titan — it's a dark fantasy action series with intense world-building."
Model output: "Based on your interest in dark shōnen series, I'd recommend Jujutsu Kaisen — it shares the intense battle sequences and deep lore you enjoyed."

Both are excellent recommendations, but ROUGE-L would score near zero because they share almost no n-grams. The recommendation intent demands semantic and preference-alignment metrics rather than surface-level text overlap.

A3. Creating a Golden Dataset for Product Question Intent

Dataset specification — 200 QA pairs for product_question:

Fields per record:

Field	Type	Description
`eval_id`	string	Unique identifier (e.g., `pq-001`)
`prompt`	string	The customer question
`conversation_history`	array	Previous turns (for multi-turn context)
`reference_response`	string	Expert-written ideal answer
`product_asin`	string	The product the question relates to
`grounding_context`	string	Product catalog data from OpenSearch that should be used
`acceptable_alternatives`	array	Other valid answer phrasings
`difficulty`	enum	`simple`, `requires_lookup`, `multi_product_comparison`
`language`	string	`en` or `ja`

Multi-turn handling:

For multi-turn conversations, include the conversation_history as an array of {role, content} objects:

{
  "eval_id": "pq-042",
  "conversation_history": [
    {"role": "user", "content": "I'm looking for manga similar to Naruto"},
    {"role": "assistant", "content": "Great choice! Some popular shōnen series similar to Naruto include..."},
    {"role": "user", "content": "How many volumes does the second one have?"}
  ],
  "prompt": "How many volumes does the second one have?",
  "reference_response": "My Hero Academia currently has 40 volumes...",
  "product_asin": "B09XYZ1234"
}

Creation process:

Sample 500 real product_question conversations from DynamoDB (last 90 days)
Filter to those with positive CSAT ratings (4–5 stars)
Deduplicate by semantic similarity (cosine > 0.92 = duplicate)
Expert review — manga domain experts write/refine reference answers
Stratify — ensure balanced coverage: 40% simple lookups, 35% comparison questions, 25% multi-turn dependent
Final selection — 200 pairs with inter-annotator agreement ≥ 0.85 (Cohen's kappa)

Medium

A4. Per-Intent Model Assignment Strategy

Decision matrix for MangaAssist's 10 intents:

Intent	Assigned Model	Reasoning
`recommendation`	Sonnet	Requires nuanced genre understanding, cultural context, preference reasoning. Quality-critical — directly impacts conversion.
`product_question`	Sonnet	Needs accurate product detail extraction from RAG context. Hallucination risk is high with cheaper models.
`product_discovery`	Sonnet	Open-ended exploration requires strong reasoning to narrow from 100K+ catalog items.
`faq`	Haiku	Answers are well-defined and mostly template-based. Haiku achieves 0.94 BERTScore — above the 0.85 threshold.
`order_tracking`	Haiku	Structured data retrieval (order status from DynamoDB). Low creative reasoning needed.
`return_request`	Haiku	Follow a decision tree: eligible/ineligible, initiate return. Haiku handles this reliably.
`promotion`	Haiku	Promotional information is retrieved from a fixed dataset. Formatting is straightforward.
`checkout_help`	Haiku	Step-by-step guidance with minimal creative generation.
`chitchat`	Haiku	Low-stakes, conversational. Speed matters more than depth.
`escalation`	Haiku	The model's job is to summarize and hand off to a human agent — minimal generation needed.

Result: 3 intents on Sonnet, 7 on Haiku → estimated 65% cost reduction vs. routing everything through Sonnet.

Quality thresholds per tier:

Sonnet intents (quality-critical): BERTScore ≥ 0.88, faithfulness ≥ 0.92, human preference win rate ≥ 70%
Haiku intents (efficiency-focused): BERTScore ≥ 0.82, faithfulness ≥ 0.88, task completion rate ≥ 95%

A5. Model Decision for Order Tracking at the Quality Threshold

Given data: - Haiku BERTScore on order_tracking: 0.82 - Sonnet BERTScore on order_tracking: 0.91 - Business threshold for order-related intents: 0.88 - Cost: Sonnet is ~12× more expensive than Haiku

Decision framework:

Threshold check: Haiku (0.82) is below the 0.88 threshold. Sonnet (0.91) is above. Direct assignment: Sonnet.
Cost impact analysis: - order_tracking represents ~12% of total traffic (~36K conversations/day) - At Sonnet pricing: ~$432/day for this intent - At Haiku pricing: ~$36/day - Delta: $396/day ($11,880/month)
Before committing to Sonnet, explore mitigation strategies:

Option A — Prompt optimization for Haiku: - Add structured output templates to Haiku's prompt - Include 3–5 few-shot examples of order_tracking responses - Re-evaluate: if BERTScore rises to ≥ 0.88, use Haiku

Option B — Hybrid routing with fallback: - Route to Haiku first - Run a lightweight quality check on the response (regex for order number format, status field presence) - If quality check fails → re-invoke with Sonnet - Expected fallback rate: ~15% → blended cost = Haiku cost + 15% × Sonnet cost = ~$101/day

Option C — Fine-tuned Haiku: - Fine-tune Haiku on 5K order_tracking examples via SageMaker - Typically yields +0.04–0.08 BERTScore improvement - If fine-tuned Haiku reaches 0.88 → use it at ~$45/day

Recommended path: Start with Option B (hybrid routing) for immediate deployment, invest in Option C (fine-tuning) as a medium-term cost reduction. Only fall back to full Sonnet if both options fail to meet the 0.88 threshold.

A6. Automating Bedrock Evaluation in CI/CD

Pipeline design for MangaAssist:

New Model Version Available (e.g., Sonnet v2)
    │
    ▼
┌─────────────────────┐
│ 1. Trigger           │  EventBridge detects new model version in Bedrock
│    (Automated)       │  or manual trigger via CI/CD webhook
└──────────┬──────────┘
           ▼
┌─────────────────────┐
│ 2. Fetch Golden      │  Pull latest evaluation datasets from S3
│    Datasets          │  (10 datasets, one per intent)
└──────────┬──────────┘
           ▼
┌─────────────────────┐
│ 3. Run Bedrock       │  For each intent:
│    Evaluation Jobs   │    - Current production model
│    (Parallel)        │    - New candidate model
│                      │  Same inference params, same dataset
└──────────┬──────────┘
           ▼
┌─────────────────────┐
│ 4. Aggregate Results │  Compute per-intent deltas:
│                      │    ΔBERTScore, ΔFaithfulness,
│                      │    ΔLatency, ΔCost
└──────────┬──────────┘
           ▼
┌─────────────────────┐
│ 5. Gate Check        │  Promotion criteria:
│                      │    - No intent regresses > 2%
│                      │    - At least 3 intents improve > 1%
│                      │    - Latency P99 doesn't increase > 10%
│                      │    - Cost doesn't increase > 5%
└──────────┬──────────┘
           ▼
      Pass? ──No──► Notify team, blocked
        │
       Yes
        ▼
┌─────────────────────┐
│ 6. Promote to        │  Update model ID in Parameter Store
│    Canary Stage      │  Deploy to 5% traffic via ECS
└─────────────────────┘

Implementation details:

CodePipeline orchestrates the stages
Step Functions manages parallel Bedrock evaluation jobs (one per intent × two models = 20 jobs)
Lambda aggregates results and applies gate logic
Golden datasets are versioned in S3 with lifecycle policies (refresh monthly)
Results are stored in DynamoDB for trend analysis
SNS notifications for pass/fail with detailed report links

Hard

A7. Evaluating RAG-Augmented Outputs for Recommendation Intent

Challenge: The recommendation intent retrieves product embeddings from OpenSearch Serverless and injects them into the prompt context. Standard evaluation ignores whether the model used the retrieved context correctly.

RAG-specific evaluation dimensions:

Dimension	What It Measures	Metric
Faithfulness	Does the response ONLY contain manga titles/details that appear in the retrieved context?	Hallucination detection — entity-level precision against retrieved documents
Relevance	Does the response address the user's stated preferences using the retrieved products?	Custom judge rubric: preference-alignment score (1–5)
Completeness	Does the response use the most relevant retrieved products, or does it cherry-pick?	Recall of top-3 relevant products from the retrieval set
Context utilization	How much of the retrieved context was useful?	Ratio of cited products to total retrieved products

Implementation in Bedrock evaluation:

Augmented evaluation dataset:

{
  "prompt": "I love dark fantasy manga with complex magic systems. What should I read?",
  "retrievedContext": [
    {"asin": "B001", "title": "Berserk Vol 1", "genre": "dark_fantasy", "author": "Kentaro Miura"},
    {"asin": "B002", "title": "Fullmetal Alchemist Vol 1", "genre": "dark_fantasy", "author": "Hiromu Arakawa"},
    {"asin": "B003", "title": "One Piece Vol 1", "genre": "adventure", "author": "Eiichiro Oda"}
  ],
  "referenceResponse": "Based on your interest in dark fantasy, I'd recommend Berserk by Kentaro Miura and Fullmetal Alchemist by Hiromu Arakawa...",
  "expectedCitations": ["B001", "B002"],
  "hallucination_traps": ["Claymore", "Dorohedoro"]
}

Judge model prompt (for faithfulness):

You are evaluating a manga recommendation chatbot.

Given the RETRIEVED PRODUCTS and the MODEL RESPONSE:
- List every manga title mentioned in the response
- For each title, mark it as GROUNDED (appears in retrieved products) or HALLUCINATED
- Score faithfulness: (grounded titles) / (total titles mentioned) × 100

A score below 90% is FAIL.

Automated hallucination detection: Post-process model output through a verification Lambda that: - Extracts manga titles via NER - Looks up each title in the OpenSearch catalog - Flags any title not found as a potential hallucination - Checks ASINs mentioned against the retrieval set

A8. Custom "Manga-Domain Accuracy" Metric

Metric definition — MangaDomainAccuracy (MDA):

A composite score measuring domain-specific factual correctness across four sub-dimensions:

Sub-dimension	Weight	What It Catches
Genre accuracy	25%	Confusing shōnen (e.g., Naruto) with seinen (e.g., Berserk), or josei with shōjo
Author attribution	25%	Attributing One Piece to anyone other than Eiichiro Oda
Volume/chapter accuracy	25%	Stating Naruto has 80 volumes when it has 72
Publication metadata	25%	Wrong publisher, serialization magazine, or release dates

Judge model implementation in Bedrock:

Create a judge model evaluation task using Claude 3.5 Sonnet as the judge.
Judge rubric (provided as system prompt):

You are a manga domain expert evaluating a chatbot response.

EVALUATION RUBRIC — Score each dimension 0-5:

GENRE ACCURACY (0-5):
5 = All genre classifications correct with proper subcategories
3 = Genres mostly correct but imprecise (e.g., "action" instead of "shōnen")
1 = Genre misattributions present (calling seinen manga "for kids")
0 = Major genre confusion that could mislead buyers

AUTHOR ATTRIBUTION (0-5):
5 = All authors correctly attributed
3 = Minor errors (misspelled names but correct person)
0 = Wrong author attributed to a work

VOLUME/CHAPTER ACCURACY (0-5):
5 = Exact counts or correctly stated "approximately" with reasonable range
3 = Within 10% of actual count
0 = Wildly wrong numbers or fabricated counts

PUBLICATION METADATA (0-5):
5 = Publisher, magazine, and dates all correct
3 = Mostly correct with minor errors
0 = Fabricated publication details

Output JSON: {"genre": X, "author": X, "volume": X, "publication": X, "mda_score": avg}

Supplementary ground-truth data: The judge receives a facts field with verified metadata from the product catalog so it can check claims against reality, not just plausibility.
Threshold: MDA ≥ 4.0 required for Sonnet intents; MDA ≥ 3.5 for Haiku intents.

A9. Bilingual Evaluation Pipeline for Return Request Intent

Problem: Sonnet occasionally over-translates Japanese product names (e.g., translating "進撃の巨人" to "Attack on Titan" when the customer asked about the Japanese edition specifically).

Bilingual evaluation pipeline design:

Evaluation Dataset (200 samples)
    ├── 100 English queries
    │     ├── EN ground truth
    │     └── EN-specific metrics
    └── 100 Japanese queries
          ├── JA ground truth
          └── JA-specific metrics
              ├── Product name preservation score
              ├── Honorific handling accuracy
              └── Script consistency (kanji/katakana/romaji)

Language-specific metrics:

Metric	Language	Description
Product name preservation	JA	Does the model keep Japanese product names in their original script when the customer used Japanese? Score: % of product names preserved correctly
Script consistency	JA	If the customer writes in kanji, does the model respond in kanji (not romaji or English)?
Honorific handling	JA	Correct use of お客様 (okyakusama) and polite forms in return instructions
Return policy accuracy	Both	Are the return eligibility rules correct per Amazon JP policy?
Actionability	Both	Does the response include the correct return initiation steps?
Code-switch detection	Both	Penalize unnecessary language switching mid-response

Implementation:

Separate evaluation jobs per language — don't mix languages in one job (metrics are not comparable cross-language)
Language-specific judge prompts — The JA judge is instructed in Japanese and evaluates in Japanese context

Product name verification Lambda: Post-processes responses to check if Japanese product names were preserved when they should have been:

def check_name_preservation(response: str, expected_names_ja: list[str]) -> float:
    preserved = sum(1 for name in expected_names_ja if name in response)
    return preserved / len(expected_names_ja)

Aggregate reporting: Per-intent, per-language scorecard with separate pass/fail thresholds.

Decision rule: The model must pass thresholds in both languages independently. Passing English but failing Japanese blocks promotion for that intent.

Very Hard

A10. Cross-Model Consistency in Multi-Turn Conversations

Problem: A conversation starts with chitchat (Haiku) then shifts to recommendation (Sonnet). The model switch can cause coherence breaks — Sonnet may not maintain the conversational tone established by Haiku, or may contradict something Haiku said.

Evaluation framework design:

1. Transition point taxonomy:

Transition Type	Example	Risk Level
Haiku → Sonnet (quality upgrade)	`chitchat` → `recommendation`	Medium — tone shift
Sonnet → Haiku (cost downgrade)	`product_question` → `faq`	High — quality/detail regression
Haiku → Haiku (same model)	`faq` → `order_tracking`	Low — baseline
Sonnet → Sonnet (same model)	`recommendation` → `product_discovery`	Low — baseline

2. Evaluation dataset construction:

Create 100 multi-turn conversations (5–8 turns each) with at least one model transition per conversation:

{
  "conversation_id": "mt-eval-027",
  "turns": [
    {"role": "user", "content": "Hey! I just started reading manga", "intent": "chitchat", "model": "haiku"},
    {"role": "assistant", "content": "That's awesome! Welcome to the manga world...", "model": "haiku"},
    {"role": "user", "content": "Can you recommend something dark and mature?", "intent": "recommendation", "model": "sonnet"},
    {"role": "assistant", "content": "[EVALUATE THIS TURN]", "model": "sonnet"}
  ],
  "transition_points": [2],
  "evaluation_focus": "Does the Sonnet response maintain the friendly, enthusiastic tone Haiku established while delivering a high-quality recommendation?"
}

3. Metrics for transition quality:

Metric	Measurement
Tone consistency	Judge model rates tone match between pre- and post-transition turns (1–5)
Entity continuity	Any entities mentioned before the transition must be correctly referenced after
No contradiction	Post-transition model must not contradict facts stated by pre-transition model
Conversation flow naturalness	Human evaluators rate whether the transition feels jarring (blind test — they don't know models switched)

4. Implementation:

Run the full multi-turn conversation through the orchestrator (ECS Fargate) with actual model routing
Capture the conversation in DynamoDB with model IDs per turn
Post-process: extract transition points and evaluate each transition
Compare against baseline: the same conversations run entirely on Sonnet (no transitions)
Coherence degradation = TransitionScore - BaselineScore

5. Acceptance criteria: - Average coherence degradation < 5% - No single transition with degradation > 15% - Entity continuity: 100% (hard requirement — contradicting order numbers or product names is unacceptable)

A11. Evaluating Fine-Tuned Haiku vs Sonnet for Product Discovery

End-to-end pipeline:

Phase 1 — Training data preparation:

Extract 50K product_discovery conversations from DynamoDB (last 6 months)
Filter to high-quality interactions: - CSAT ≥ 4 stars - Conversation led to a product click or purchase - No escalation triggered

Format as instruction-tuning pairs:

{
  "system": "You are MangaAssist, a manga discovery assistant...",
  "messages": [
    {"role": "user", "content": "I want to explore horror manga"},
    {"role": "assistant", "content": "Horror manga has some incredible titles..."}
  ]
}

Split: 80% train (40K), 10% validation (5K), 10% holdout test (5K)
Decontaminate: Ensure no overlap between training data and the Bedrock evaluation golden dataset

Phase 2 — Fine-tuning on SageMaker:

Base model: Claude 3 Haiku (via Bedrock custom model training or SageMaker JumpStart)
Training config: learning rate 1e-5, 3 epochs, batch size 8
Track training loss + validation loss for overfitting detection
Early stopping if validation loss increases for 2 consecutive checkpoints

Phase 3 — Triple evaluation:

Model	ID
Claude 3.5 Sonnet (production)	`model_A`
Claude 3 Haiku (base)	`model_B`
Fine-tuned Haiku	`model_C`

Run Bedrock evaluation with the product_discovery golden dataset (200 samples — separate from training data):

Phase 4 — Statistical rigor:

Paired bootstrap test (n=10,000 resamples): Compare model_C vs model_A BERTScores - Null hypothesis: model_C is not better than model_A - Reject if p < 0.05
McNemar's test: For binary outcomes (did the model produce a helpful response? yes/no) - Compares the disagreements between models on the same samples
Cross-validation on the golden dataset: Run 5-fold CV to check if results are stable across data splits
Overfitting detection: - Compare performance on training data vs. holdout test vs. golden dataset - If training >> holdout by more than 10%, the fine-tuned model is overfitting - Specifically test on manga titles released after the training data cutoff (temporal generalization)

Phase 5 — Promotion criteria:

Criterion	Threshold	Rationale
BERTScore vs Sonnet	≥ 95% of Sonnet's score	Allow slight quality trade for cost
Faithfulness	≥ Sonnet's score	No regression on hallucination
MangaDomainAccuracy	≥ 3.8	Domain correctness is non-negotiable
Latency P99	≤ 80% of Sonnet	Fine-tuned Haiku should be faster
Cost	≤ 30% of Sonnet	Must justify the fine-tuning investment
All statistical tests	p < 0.05	Results are not due to chance

A12. Continuous Weekly Re-Evaluation System

End-to-end architecture:

┌──────────────────────────────────────────────────────────┐
│                   Weekly Evaluation Pipeline              │
├──────────────────────────────────────────────────────────┤
│                                                          │
│  ┌─────────────┐    ┌──────────────┐    ┌────────────┐  │
│  │ DynamoDB     │───►│ Data Sampling │───►│ Golden DS  │  │
│  │ Conv. Logs   │    │ Lambda       │    │ Refresh    │  │
│  └─────────────┘    └──────────────┘    │ (S3)       │  │
│                                          └─────┬──────┘  │
│                                                │         │
│  ┌─────────────┐    ┌──────────────┐    ┌──────▼──────┐  │
│  │ EventBridge │───►│ Step Function │───►│ Bedrock     │  │
│  │ Weekly Cron │    │ Orchestrator │    │ Eval Jobs   │  │
│  └─────────────┘    └──────────────┘    │ (10 intents)│  │
│                                          └─────┬──────┘  │
│                                                │         │
│  ┌─────────────┐    ┌──────────────┐    ┌──────▼──────┐  │
│  │ SNS/Slack   │◄───│ Human Review  │◄───│ Drift       │  │
│  │ Alerts      │    │ Gate         │    │ Detection   │  │
│  └─────────────┘    └──────────────┘    └─────┬──────┘  │
│                          │                     │         │
│                     ┌────▼────┐          ┌─────▼──────┐  │
│                     │ Approve/ │          │ Model      │  │
│                     │ Reject   │          │ Reassign   │  │
│                     └────┬────┘          │ Proposal   │  │
│                          │               └────────────┘  │
│                     ┌────▼────┐                          │
│                     │ SSM     │ Update model routing     │
│                     │ Param   │ in Parameter Store       │
│                     │ Store   │                          │
│                     └─────────┘                          │
└──────────────────────────────────────────────────────────┘

Component details:

1. Data sampling (Lambda — weekly trigger): - Query DynamoDB for conversations from the past 7 days - Stratified sample: 50 conversations per intent (500 total) - Filter for conversations with CSAT scores (to derive ground truth) - Generate candidate golden dataset entries by pairing high-CSAT responses with their prompts - Human reviewers validate 20% of new samples; the rest use automated quality filters

2. Golden dataset refresh strategy: - Rolling window: Keep last 12 weeks of evaluated samples (2,400 per intent max) - Seasonal awareness: Weight recent samples higher during seasonal events (manga convention season, holiday promotions) - Distribution monitoring: Track intent distribution shift — if product_discovery spikes from 15% to 25% of traffic (new manga season), increase that intent's evaluation weight

3. Drift detection (Lambda post-evaluation):

def detect_drift(current_scores: dict, historical_scores: list[dict]) -> dict:
    drift_report = {}
    for intent in INTENTS:
        current = current_scores[intent]['bertscore']
        historical_mean = np.mean([s[intent]['bertscore'] for s in historical_scores[-8:]])
        historical_std = np.std([s[intent]['bertscore'] for s in historical_scores[-8:]])

        z_score = (current - historical_mean) / historical_std if historical_std > 0 else 0

        drift_report[intent] = {
            'current': current,
            'historical_mean': historical_mean,
            'z_score': z_score,
            'drift_detected': abs(z_score) > 2.0,  # 95% confidence
            'direction': 'degradation' if z_score < -2 else 'improvement' if z_score > 2 else 'stable'
        }
    return drift_report

4. Automatic model reassignment proposal:

When drift is detected: - If an intent on Haiku degrades below threshold → propose upgrade to Sonnet - If an intent on Sonnet shows stable quality for 4+ weeks at levels Haiku could serve → propose downgrade to save cost - Proposal includes: projected cost impact, quality delta, confidence interval

5. Human-in-the-loop approval:

Proposals are sent to Slack and create Jira tickets
Reassignment requires explicit approval from an ML engineer
Auto-approval is only allowed for Haiku → Sonnet upgrades (quality protection), never for downgrades
Approval has a 48-hour timeout; if not acted upon, the change is deferred to the next weekly cycle

6. Guardrails against bad reassignment:

Guardrail	Implementation
No more than 2 intent changes per week	Prevents cascading changes from a single anomalous week
Rollback window	Changes are deployed as canaries (5% → 25% → 100% over 72 hours)
Real-time CSAT monitoring	CloudWatch alarm: if CSAT drops > 5% for a reassigned intent within 24 hours → auto-rollback
Change freeze periods	No automatic reassignments during Prime Day, holiday shopping season
Minimum sample size	Drift detection requires at least 100 conversations per intent in the evaluation window
Evaluation dataset integrity	Automated checks ensure the golden dataset hasn't been corrupted or accidentally overwritten
Audit trail	Every reassignment (proposed, approved, deployed, rolled back) is logged in DynamoDB with full context

← Back to Questions · ← Back to Skill 02 Hub