09: Deployment Validation Systems
AIP-C01 Mapping
Content Domain 5: Testing, Validation, and Troubleshooting Task 5.1: Implement evaluation systems for GenAI Skill 5.1.9: Design validation workflows for GenAI deployments (for example, synthetic evaluation workflows, hallucination monitoring, and canary validation pipelines).
User Story
As a MangaAssist ML engineer, I want to validate every deployment through synthetic evaluation, hallucination monitoring, semantic drift detection, and canary validation pipelines, So that regressions are caught before reaching production users and every release maintains or improves quality.
Acceptance Criteria
- Synthetic test generation creates realistic test cases covering all 10 intents
- Hallucination monitoring detects fabricated product details, false order statuses, and invented person names
- Semantic drift detection identifies when model outputs shift meaning over time
- Canary validation pipeline gates deployment progression (5% → 25% → 50% → 100%)
- Pre-deployment validation blocks releases that fail quality gates
- Post-deployment validation runs within 15 minutes of release and escalates on failure
- All validation results stored for auditability, with a 90-day retention policy
Why Deployment Validation Is Different for GenAI
Traditional software deployment validation: "Does the API return 200? Does the health check pass? Are error rates below 1%?"
GenAI deployment validation must answer: "Is the model still saying the right things?" — a fundamentally harder question because:
| Challenge | Traditional Software | GenAI |
|---|---|---|
| Correctness | Deterministic (same input → same output) | Stochastic (same input → different outputs) |
| Failure definition | HTTP errors, crashes | Subtle quality degradation, hallucinations |
| Test coverage | API contracts, unit tests | Infinite input space, adversarial prompts |
| Regression detection | Assert expected output | Compare output distributions |
| Rollback decision | Error rate > threshold | Multi-dimensional quality metrics below threshold |
You can't write assert response == "expected" for a GenAI system. Instead, you validate distributions of quality metrics across synthetic and sampled real traffic.
High-Level Design
Deployment Validation Pipeline
graph TD
subgraph "Pre-Deployment"
A[Code/Model/Prompt Change] --> B[CI Pipeline]
B --> C[Synthetic Test Suite<br>500 test cases × 10 intents]
C --> D{Quality Gate<br>Pass?}
D -->|Fail| E[Block Deploy<br>Notify Team]
D -->|Pass| F[Stage to Canary]
end
subgraph "Canary Deployment"
F --> G[5% Traffic<br>Canary Instance]
G --> H[Real-Time Monitoring<br>15 min window]
H --> I{Canary Quality<br>≥ Baseline?}
I -->|Fail| J[Auto-Rollback<br>Alert On-Call]
I -->|Pass| K[Promote to 25%]
K --> L[25% Monitor<br>30 min]
L --> M{Quality Hold?}
M -->|Fail| J
M -->|Pass| N[Promote to 50% → 100%]
end
subgraph "Post-Deployment"
N --> O[Full Production Traffic]
O --> P[Continuous Monitoring]
P --> Q[Hallucination Detection]
P --> R[Semantic Drift Detection]
P --> S[Hourly Quality Sampling]
Q --> T{Hallucination<br>Rate > 2%?}
R --> U{Drift Score<br>> Threshold?}
S --> V{Quality Drop<br>> 5%?}
T -->|Yes| W[Emergency Rollback]
U -->|Yes| X[Alert: Gradual Drift]
V -->|Yes| Y[Alert: Quality Degradation]
end
Synthetic Test Generation Architecture
graph TD
A[Seed Examples<br>Human-curated per intent] --> B[LLM-Based Generator<br>Claude 3 Haiku]
B --> C[Raw Synthetic Tests<br>5000 candidates]
C --> D[Quality Filter]
D --> D1[Is this a realistic query?]
D --> D2[Does it match the target intent?]
D --> D3[Is the expected behavior testable?]
D --> D4[Is it distinct from existing tests?]
D1 --> E[Validated Test Suite<br>500 tests]
D2 --> E
D3 --> E
D4 --> E
E --> F[Tiered Test Sets]
F --> F1[Critical Path Tests<br>50 must-pass]
F --> F2[Regression Tests<br>200 broad coverage]
F --> F3[Edge Case Tests<br>150 adversarial]
F --> F4[Multi-Turn Tests<br>100 conversation flows]
Hallucination Detection Pipeline
sequenceDiagram
participant Agent as MangaAssist Agent
participant Mon as Hallucination Monitor
participant KB as Knowledge Base<br>OpenSearch
participant DB as Product Catalog<br>DynamoDB
participant CW as CloudWatch
Agent->>Mon: Response + Context + Query
Mon->>Mon: Extract factual claims from response
alt Product claim (title, price, date, author)
Mon->>DB: Verify claim against product catalog
DB->>Mon: Ground truth
Mon->>Mon: Match? → factual / hallucinated
end
alt Knowledge claim (manga lore, story content)
Mon->>KB: Verify against RAG knowledge base
KB->>Mon: Retrieved context
Mon->>Mon: Supported by context? → grounded / unsupported
end
alt Order claim (status, tracking, dates)
Mon->>DB: Verify against order database
DB->>Mon: Ground truth
Mon->>Mon: Match? → correct / fabricated
end
Mon->>CW: Publish hallucination metrics
alt Hallucination detected
Mon->>CW: ALARM: Hallucination detected
Note over Mon: Log full trace for review
end
Low-Level Design
Synthetic Test Generator
import json
import logging
import random
from dataclasses import dataclass, field
from typing import Optional
import boto3
logger = logging.getLogger(__name__)
@dataclass
class SyntheticTest:
"""A synthetic test case for deployment validation."""
test_id: str = ""
intent: str = ""
tier: str = "regression" # "critical", "regression", "edge_case", "multi_turn"
query: str = ""
expected_intent: str = ""
expected_tool: str = "" # API that should be called
expected_contains: list[str] = field(default_factory=list) # Strings response must contain
expected_not_contains: list[str] = field(default_factory=list) # Strings response must NOT contain
context: dict = field(default_factory=dict) # Session context for multi-turn
conversation_history: list[dict] = field(default_factory=list)
# Seed examples per intent — human-curated baseline for LLM generation
SEED_EXAMPLES = {
"recommendation": [
{
"query": "I just finished reading Chainsaw Man. What should I read next?",
"expected_tool": "recommendation_engine",
"expected_contains": ["manga", "recommend"],
},
{
"query": "Looking for something similar to Attack on Titan but less violent",
"expected_tool": "recommendation_engine",
"expected_contains": ["similar"],
},
],
"product_question": [
{
"query": "How many volumes does One Piece have?",
"expected_tool": "product_catalog",
"expected_not_contains": ["I think", "probably", "maybe"],
},
],
"order_tracking": [
{
"query": "Where is my Naruto box set order?",
"expected_tool": "order_status",
"expected_contains": ["order", "status"],
},
],
"return_request": [
{
"query": "I want to return the damaged Demon Slayer Vol 5",
"expected_tool": "return_service",
"expected_contains": ["return"],
},
],
"faq": [
{
"query": "What's your return policy for manga?",
"expected_tool": "rag_pipeline",
},
],
"promotion": [
{
"query": "Are there any deals on Shonen Jump titles?",
"expected_tool": "promotion_api",
"expected_contains": ["promotion", "discount", "deal", "offer"],
},
],
"checkout_help": [
{
"query": "I can't add items to my cart",
"expected_tool": "checkout_service",
},
],
"escalation": [
{
"query": "I'm extremely frustrated, I want to speak to a human NOW",
"expected_tool": "escalation_handler",
"expected_contains": ["connect", "agent", "representative"],
},
],
"product_discovery": [
{
"query": "What new manga was released this month?",
"expected_tool": "product_catalog",
},
],
"chitchat": [
{
"query": "Hi there, how are you?",
"expected_tool": "llm_direct",
},
],
}
class SyntheticTestGenerator:
"""Generates synthetic test cases from seed examples using LLM.
Uses Claude 3 Haiku (cheap, fast) to generate variations of seed examples,
then filters for quality, uniqueness, and intent alignment.
"""
def __init__(self, region: str = "us-east-1"):
self.bedrock = boto3.client("bedrock-runtime", region_name=region)
self.model_id = "anthropic.claude-3-haiku-20240307-v1:0"
def generate_test_suite(
self, tests_per_intent: int = 50, include_edge_cases: bool = True
) -> list[SyntheticTest]:
"""Generate a full synthetic test suite covering all intents."""
all_tests: list[SyntheticTest] = []
for intent, seeds in SEED_EXAMPLES.items():
# Generate standard variations
standard_tests = self._generate_variations(intent, seeds, count=tests_per_intent)
all_tests.extend(standard_tests)
# Generate edge cases
if include_edge_cases:
edge_tests = self._generate_edge_cases(intent, seeds, count=tests_per_intent // 3)
all_tests.extend(edge_tests)
# Generate multi-turn test cases
multi_turn_tests = self._generate_multi_turn_tests(count=100)
all_tests.extend(multi_turn_tests)
# Deduplicate
all_tests = self._deduplicate(all_tests)
# Assign tiers
all_tests = self._assign_tiers(all_tests)
logger.info("Generated %d synthetic tests across %d intents", len(all_tests), len(SEED_EXAMPLES))
return all_tests
def _generate_variations(
self, intent: str, seeds: list[dict], count: int
) -> list[SyntheticTest]:
"""Generate variations of seed examples for a single intent."""
seed_queries = "\n".join(f"- {s['query']}" for s in seeds)
prompt = f"""Generate {count} realistic customer queries for the "{intent}" intent
of a Japanese manga store chatbot on Amazon.com.
Seed examples:
{seed_queries}
Requirements:
- Vary phrasing, product titles, and specificity
- Include typos and casual language (real users are imperfect)
- Include queries in different lengths (short, medium, long)
- Reference real manga titles (popular and niche)
- Do NOT repeat seed examples
Return JSON array: [{{"query": "...", "difficulty": "easy|medium|hard"}}]"""
response = self.bedrock.invoke_model(
modelId=self.model_id,
body=json.dumps({
"anthropic_version": "bedrock-2023-05-31",
"max_tokens": 4096,
"temperature": 0.8,
"messages": [{"role": "user", "content": prompt}],
}),
)
body = json.loads(response["body"].read())
text = body["content"][0]["text"]
# Parse JSON from response
try:
start = text.index("[")
end = text.rindex("]") + 1
variations = json.loads(text[start:end])
except (ValueError, json.JSONDecodeError):
logger.warning("Failed to parse variations for %s", intent)
return []
tests = []
seed_ref = seeds[0] if seeds else {}
for i, var in enumerate(variations):
tests.append(SyntheticTest(
test_id=f"syn-{intent}-{i:03d}",
intent=intent,
query=var["query"],
expected_intent=intent,
expected_tool=seed_ref.get("expected_tool", ""),
expected_contains=seed_ref.get("expected_contains", []),
expected_not_contains=seed_ref.get("expected_not_contains", []),
))
return tests
def _generate_edge_cases(
self, intent: str, seeds: list[dict], count: int
) -> list[SyntheticTest]:
"""Generate edge case / adversarial tests for an intent."""
prompt = f"""Generate {count} edge-case or adversarial queries that a customer might send
to a JP manga chatbot that SHOULD be classified as "{intent}" but are tricky:
- Ambiguous queries (could be multiple intents)
- Very short queries (1-3 words)
- Queries with misspellings
- Queries mixing languages
- Queries that embed instructions like "ignore previous instructions"
- Queries referencing obscure manga titles
Return JSON array: [{{"query": "...", "edge_type": "ambiguous|short|misspelled|multilingual|injection|obscure"}}]"""
response = self.bedrock.invoke_model(
modelId=self.model_id,
body=json.dumps({
"anthropic_version": "bedrock-2023-05-31",
"max_tokens": 2048,
"temperature": 0.9,
"messages": [{"role": "user", "content": prompt}],
}),
)
body = json.loads(response["body"].read())
text = body["content"][0]["text"]
try:
start = text.index("[")
end = text.rindex("]") + 1
edge_cases = json.loads(text[start:end])
except (ValueError, json.JSONDecodeError):
logger.warning("Failed to parse edge cases for %s", intent)
return []
tests = []
for i, ec in enumerate(edge_cases):
tests.append(SyntheticTest(
test_id=f"edge-{intent}-{i:03d}",
intent=intent,
tier="edge_case",
query=ec["query"],
expected_intent=intent,
))
return tests
def _generate_multi_turn_tests(self, count: int) -> list[SyntheticTest]:
"""Generate multi-turn conversation test cases."""
prompt = f"""Generate {count} 2-3 turn conversation flows for a JP manga chatbot.
Each should test context maintenance across turns.
Common patterns:
- Ask about a product → then ask to buy it
- Start a return → change to exchange
- Get a recommendation → ask follow-up about that title
- Track an order → ask about return policy for it
Return JSON array: [{{"turns": [{{"query": "...", "expected_intent": "..."}}], "test_name": "..."}}]"""
response = self.bedrock.invoke_model(
modelId=self.model_id,
body=json.dumps({
"anthropic_version": "bedrock-2023-05-31",
"max_tokens": 4096,
"temperature": 0.7,
"messages": [{"role": "user", "content": prompt}],
}),
)
body = json.loads(response["body"].read())
text = body["content"][0]["text"]
try:
start = text.index("[")
end = text.rindex("]") + 1
flows = json.loads(text[start:end])
except (ValueError, json.JSONDecodeError):
return []
tests = []
for i, flow in enumerate(flows):
turns = flow.get("turns", [])
if not turns:
continue
tests.append(SyntheticTest(
test_id=f"multi-{i:03d}",
intent=turns[0].get("expected_intent", ""),
tier="multi_turn",
query=turns[0]["query"],
expected_intent=turns[0].get("expected_intent", ""),
conversation_history=[
{"role": "user", "content": t["query"]} for t in turns[1:]
],
))
return tests
def _deduplicate(self, tests: list[SyntheticTest]) -> list[SyntheticTest]:
"""Remove duplicate and near-duplicate tests."""
seen_queries: set[str] = set()
unique_tests = []
for test in tests:
normalized = test.query.strip().lower()
if normalized not in seen_queries:
seen_queries.add(normalized)
unique_tests.append(test)
return unique_tests
def _assign_tiers(self, tests: list[SyntheticTest]) -> list[SyntheticTest]:
"""Assign priority tiers to tests that don't already have one."""
for test in tests:
if test.tier != "edge_case" and test.tier != "multi_turn":
# Critical: tests from seed examples or very common queries
if test.test_id.endswith("-000") or test.test_id.endswith("-001"):
test.tier = "critical"
else:
test.tier = "regression"
return tests
Hallucination Monitor
import json
import logging
import re
from dataclasses import dataclass, field
from typing import Optional
import boto3
logger = logging.getLogger(__name__)
@dataclass
class FactualClaim:
"""A factual claim extracted from the agent's response."""
claim_text: str = ""
claim_type: str = "" # "product", "order", "knowledge", "date", "price"
entity: str = "" # The entity the claim is about
attribute: str = "" # "price", "release_date", "author", "status"
claimed_value: str = "" # What the agent said
ground_truth: str = "" # What the data says (if verifiable)
verified: bool = False
is_hallucination: bool = False
@dataclass
class HallucinationReport:
"""Report for a single response's hallucination analysis."""
conversation_id: str = ""
response_text: str = ""
claims: list[FactualClaim] = field(default_factory=list)
hallucination_count: int = 0
total_claims: int = 0
hallucination_rate: float = 0.0
severity: str = "none" # "none", "low", "medium", "high"
class HallucinationMonitor:
"""Detects hallucinations in MangaAssist responses by verifying factual claims
against the product catalog, order database, and knowledge base.
Three verification stages:
1. Claim extraction: LLM extracts factual claims from the response
2. Ground truth retrieval: Look up claims in DynamoDB/OpenSearch
3. Verification: Compare claimed values against ground truth
"""
def __init__(
self,
product_table_name: str = "manga-assist-products",
order_table_name: str = "manga-assist-orders",
opensearch_endpoint: str = "",
region: str = "us-east-1",
):
self.bedrock = boto3.client("bedrock-runtime", region_name=region)
self.dynamodb = boto3.resource("dynamodb", region_name=region)
self.product_table = self.dynamodb.Table(product_table_name)
self.order_table = self.dynamodb.Table(order_table_name)
self.cloudwatch = boto3.client("cloudwatch", region_name=region)
self.claim_extractor_model = "anthropic.claude-3-haiku-20240307-v1:0"
def analyze_response(
self, conversation_id: str, query: str, response: str, context: dict
) -> HallucinationReport:
"""Analyze a response for hallucinations."""
# Step 1: Extract factual claims
claims = self._extract_claims(response)
# Step 2: Verify each claim
for claim in claims:
self._verify_claim(claim, context)
# Step 3: Compute metrics
hallucinated = [c for c in claims if c.is_hallucination]
total = len(claims)
rate = len(hallucinated) / total if total > 0 else 0.0
# Determine severity
severity = "none"
if rate > 0:
if any(c.claim_type in ("price", "order") for c in hallucinated):
severity = "high" # Financial or order-related hallucinations are critical
elif rate > 0.3:
severity = "high"
elif rate > 0.1:
severity = "medium"
else:
severity = "low"
report = HallucinationReport(
conversation_id=conversation_id,
response_text=response,
claims=claims,
hallucination_count=len(hallucinated),
total_claims=total,
hallucination_rate=rate,
severity=severity,
)
self._publish_metrics(report)
return report
def _extract_claims(self, response: str) -> list[FactualClaim]:
"""Extract verifiable factual claims from the response using LLM."""
prompt = f"""Extract all verifiable factual claims from this chatbot response.
A factual claim is a statement that can be checked against a database — product names,
prices, dates, order statuses, author names, volume counts, etc.
Response: "{response}"
Return JSON array:
[{{"claim_text": "...", "claim_type": "product|order|knowledge|date|price",
"entity": "...", "attribute": "...", "claimed_value": "..."}}]
If there are no verifiable claims, return an empty array: []"""
body = json.dumps({
"anthropic_version": "bedrock-2023-05-31",
"max_tokens": 2048,
"temperature": 0.0,
"messages": [{"role": "user", "content": prompt}],
})
resp = self.bedrock.invoke_model(modelId=self.claim_extractor_model, body=body)
text = json.loads(resp["body"].read())["content"][0]["text"]
try:
start = text.index("[")
end = text.rindex("]") + 1
raw_claims = json.loads(text[start:end])
except (ValueError, json.JSONDecodeError):
return []
return [
FactualClaim(
claim_text=c.get("claim_text", ""),
claim_type=c.get("claim_type", ""),
entity=c.get("entity", ""),
attribute=c.get("attribute", ""),
claimed_value=c.get("claimed_value", ""),
)
for c in raw_claims
]
def _verify_claim(self, claim: FactualClaim, context: dict) -> None:
"""Verify a single claim against the appropriate data source."""
if claim.claim_type == "product":
self._verify_product_claim(claim)
elif claim.claim_type == "price":
self._verify_price_claim(claim)
elif claim.claim_type == "order":
self._verify_order_claim(claim, context)
elif claim.claim_type == "date":
self._verify_date_claim(claim)
elif claim.claim_type == "knowledge":
self._verify_knowledge_claim(claim, context)
def _verify_product_claim(self, claim: FactualClaim) -> None:
"""Verify product-related claims against DynamoDB catalog."""
try:
response = self.product_table.get_item(
Key={"product_name": claim.entity}
)
item = response.get("Item")
if not item:
# Product not found — could be hallucinated product
claim.is_hallucination = True
claim.ground_truth = "Product not found in catalog"
claim.verified = True
return
# Check specific attribute
actual_value = str(item.get(claim.attribute, ""))
claimed_value = str(claim.claimed_value)
if actual_value and claimed_value:
claim.ground_truth = actual_value
claim.is_hallucination = actual_value.lower() != claimed_value.lower()
claim.verified = True
except Exception as e:
logger.warning("Failed to verify product claim: %s", e)
def _verify_price_claim(self, claim: FactualClaim) -> None:
"""Verify price claims — strict matching."""
try:
response = self.product_table.get_item(
Key={"product_name": claim.entity}
)
item = response.get("Item")
if not item:
claim.is_hallucination = True
claim.ground_truth = "Product not found"
claim.verified = True
return
actual_price = str(item.get("price", ""))
claimed_price = str(claim.claimed_value)
# Normalize price strings (remove $, whitespace)
actual_norm = re.sub(r"[$ ,]", "", actual_price)
claimed_norm = re.sub(r"[$ ,]", "", claimed_price)
claim.ground_truth = actual_price
claim.is_hallucination = actual_norm != claimed_norm
claim.verified = True
except Exception as e:
logger.warning("Failed to verify price claim: %s", e)
def _verify_order_claim(self, claim: FactualClaim, context: dict) -> None:
"""Verify order-related claims against order database."""
order_id = context.get("order_id", claim.entity)
try:
response = self.order_table.get_item(
Key={"order_id": order_id}
)
item = response.get("Item")
if not item:
claim.is_hallucination = True
claim.ground_truth = "Order not found"
claim.verified = True
return
actual_value = str(item.get(claim.attribute, ""))
claim.ground_truth = actual_value
claim.is_hallucination = actual_value.lower() != str(claim.claimed_value).lower()
claim.verified = True
except Exception as e:
logger.warning("Failed to verify order claim: %s", e)
def _verify_date_claim(self, claim: FactualClaim) -> None:
"""Verify date claims (release dates, shipping dates)."""
self._verify_product_claim(claim) # Dates are stored as product attributes
def _verify_knowledge_claim(self, claim: FactualClaim, context: dict) -> None:
"""Verify knowledge claims against RAG context."""
rag_context = context.get("rag_context", "")
if not rag_context:
return # Cannot verify without context
# Check if the claimed information is supported by the retrieved context
claimed_lower = str(claim.claimed_value).lower()
context_lower = rag_context.lower()
# Loose containment check — if the claimed value appears in context, it's grounded
claim.verified = True
claim.is_hallucination = claimed_lower not in context_lower
claim.ground_truth = "Supported by RAG context" if not claim.is_hallucination else "Not found in RAG context"
def _publish_metrics(self, report: HallucinationReport) -> None:
"""Publish hallucination metrics to CloudWatch."""
metric_data = [
{
"MetricName": "HallucinationRate",
"Value": report.hallucination_rate,
"Unit": "None",
},
{
"MetricName": "HallucinationCount",
"Value": report.hallucination_count,
"Unit": "Count",
},
{
"MetricName": "TotalClaims",
"Value": report.total_claims,
"Unit": "Count",
},
]
if report.severity in ("medium", "high"):
metric_data.append({
"MetricName": "HighSeverityHallucination",
"Value": 1,
"Unit": "Count",
})
self.cloudwatch.put_metric_data(
Namespace="MangaAssist/Hallucination",
MetricData=metric_data,
)
Semantic Drift Detector
import json
import logging
import numpy as np
from dataclasses import dataclass, field
from typing import Optional
import boto3
logger = logging.getLogger(__name__)
@dataclass
class DriftReport:
"""Report on semantic drift between two time periods."""
intent: str = ""
baseline_period: str = ""
current_period: str = ""
embedding_drift: float = 0.0 # Cosine distance between avg embeddings
vocabulary_drift: float = 0.0 # Jaccard distance of top-100 tokens
length_drift: float = 0.0 # Relative change in avg response length
overall_drift_score: float = 0.0 # Weighted composite
drifted: bool = False
drift_direction: str = "" # "shorter", "longer", "more_formal", "less_specific"
class SemanticDriftDetector:
"""Detects when model outputs shift semantically over time.
Compares response distributions between a baseline period (e.g., last 30 days)
and a recent period (e.g., last 24 hours) to detect drift.
Three drift signals:
1. Embedding drift: avg response embedding shifts in vector space
2. Vocabulary drift: the distribution of tokens changes
3. Length drift: responses get significantly shorter or longer
"""
DRIFT_THRESHOLD = 0.15 # Overall drift score threshold
WEIGHTS = {"embedding": 0.5, "vocabulary": 0.3, "length": 0.2}
def __init__(self, region: str = "us-east-1"):
self.bedrock = boto3.client("bedrock-runtime", region_name=region)
self.cloudwatch = boto3.client("cloudwatch", region_name=region)
self.embedding_model = "amazon.titan-embed-text-v2:0"
def detect_drift(
self,
intent: str,
baseline_responses: list[str],
current_responses: list[str],
) -> DriftReport:
"""Compare current responses against baseline to detect drift."""
if not baseline_responses or not current_responses:
return DriftReport(intent=intent, drifted=False)
# Compute embedding drift
baseline_embs = [self._get_embedding(r) for r in baseline_responses]
current_embs = [self._get_embedding(r) for r in current_responses]
baseline_avg = np.mean(baseline_embs, axis=0)
current_avg = np.mean(current_embs, axis=0)
embedding_drift = 1.0 - float(
np.dot(baseline_avg, current_avg)
/ (np.linalg.norm(baseline_avg) * np.linalg.norm(current_avg) + 1e-10)
)
# Compute vocabulary drift (Jaccard distance of top tokens)
baseline_tokens = self._top_tokens(baseline_responses, n=100)
current_tokens = self._top_tokens(current_responses, n=100)
intersection = baseline_tokens & current_tokens
union = baseline_tokens | current_tokens
vocab_drift = 1.0 - (len(intersection) / len(union)) if union else 0.0
# Compute length drift
baseline_avg_len = np.mean([len(r.split()) for r in baseline_responses])
current_avg_len = np.mean([len(r.split()) for r in current_responses])
length_drift = abs(current_avg_len - baseline_avg_len) / (baseline_avg_len + 1e-10)
# Weighted composite
overall = (
self.WEIGHTS["embedding"] * embedding_drift
+ self.WEIGHTS["vocabulary"] * vocab_drift
+ self.WEIGHTS["length"] * length_drift
)
# Determine drift direction
direction = ""
if length_drift > 0.1:
direction = "longer" if current_avg_len > baseline_avg_len else "shorter"
if vocab_drift > 0.2:
direction += (" + " if direction else "") + "vocabulary_shift"
report = DriftReport(
intent=intent,
embedding_drift=round(embedding_drift, 4),
vocabulary_drift=round(vocab_drift, 4),
length_drift=round(length_drift, 4),
overall_drift_score=round(overall, 4),
drifted=overall > self.DRIFT_THRESHOLD,
drift_direction=direction,
)
self._publish_drift_metrics(report)
return report
def _get_embedding(self, text: str) -> list[float]:
"""Get text embedding from Titan Embeddings v2."""
response = self.bedrock.invoke_model(
modelId=self.embedding_model,
body=json.dumps({"inputText": text[:8000]}), # Truncate to model limit
)
body = json.loads(response["body"].read())
return body["embedding"]
def _top_tokens(self, responses: list[str], n: int = 100) -> set[str]:
"""Get the top-n most frequent tokens across responses."""
token_counts: dict[str, int] = {}
for resp in responses:
for token in resp.lower().split():
token = token.strip(".,!?\"'()[]{}:;")
if len(token) > 2:
token_counts[token] = token_counts.get(token, 0) + 1
sorted_tokens = sorted(token_counts.items(), key=lambda x: x[1], reverse=True)
return {t for t, _ in sorted_tokens[:n]}
def _publish_drift_metrics(self, report: DriftReport) -> None:
"""Publish drift metrics to CloudWatch."""
dimensions = [{"Name": "Intent", "Value": report.intent}]
self.cloudwatch.put_metric_data(
Namespace="MangaAssist/SemanticDrift",
MetricData=[
{
"MetricName": "EmbeddingDrift",
"Value": report.embedding_drift,
"Unit": "None",
"Dimensions": dimensions,
},
{
"MetricName": "OverallDriftScore",
"Value": report.overall_drift_score,
"Unit": "None",
"Dimensions": dimensions,
},
],
)
MangaAssist Scenarios
Scenario A: Synthetic Tests Catch Prompt Regression
Context: A prompt engineer updated the system prompt to make responses more concise. The change passed code review because it looked reasonable.
What Happened:
- CI triggered the synthetic test suite (500 tests)
- Critical-tier results: 48/50 passed (96%)
- Regression-tier results: 172/200 passed (86% — below 90% threshold)
- 22 of 28 failures were in product_question intent
- Failed tests expected specific factual details (volume counts, release dates) that the concise prompt now omitted
Quality Gate Result:
QUALITY GATE: FAILED
- regression tier: 86% < 90% threshold
- intent: product_question regression: 78% < 85% threshold
- 22 tests expect factual details now missing
Fix: The prompt was revised to be concise for conversational intents but retain factual completeness for product_question. Separate prompt templates per intent category. Rerun: 195/200 regression tests passed (97.5%).
Scenario B: Hallucination Monitor Catches Fabricated Release Date
Context: A user asked "When does My Hero Academia Volume 41 come out?" The product catalog had no entry for Volume 41 (it hadn't been announced yet).
What Happened:
- The agent responded: "My Hero Academia Volume 41 is scheduled for release on March 15, 2025! You can pre-order it now."
- Hallucination monitor extracted claims:
1. {entity: "My Hero Academia Vol 41", attribute: "release_date", claimed_value: "March 15, 2025"} → HALLUCINATION (product not in catalog)
2. {entity: "My Hero Academia Vol 41", attribute: "pre_order", claimed_value: "available"} → HALLUCINATION (no such product)
- Severity: HIGH (date + purchase-action hallucination)
- Alert fired immediately to on-call
Root Cause: Claude 3.5 Sonnet fabricated a plausible release date based on training data patterns. The RAG pipeline returned no results (product doesn't exist), but the agent confabulated instead of saying "I don't have information about that."
Fix: 1. Added a guardrail: if RAG returns zero results for a product query, inject "No product found — do not invent details" into the prompt 2. Added negative test to synthetic suite: queries about non-existent products must NOT return dates or prices 3. Post-fix: hallucination rate for product_question dropped from 4.2% to 0.3%
Scenario C: Semantic Drift After Bedrock Runtime Update
Context: AWS updated the Bedrock Claude runtime. No model version change was announced, but the team noticed the weekly report showed a vocabulary drift score of 0.28 for the faq intent (threshold: 0.15).
What Happened:
- Drift report: faq intent responses shifted from the baseline
- Embedding drift: 0.08 (normal)
- Vocabulary drift: 0.28 (HIGH — new words appearing)
- Length drift: 0.15 (responses 15% longer)
- Overall: 0.17 (above 0.15 threshold) → drifted: True
- Analysis of top-100 tokens showed: new tokens included "certainly", "absolutely", "delighted to help", "comprehensive" — the model had become more verbose and formal
- User satisfaction for faq dropped from 87% to 82% — users preferred the previous direct tone
Root Cause: Bedrock runtime update subtly changed response style. Not a quality regression in content, but a tonal shift.
Fix: 1. Added explicit tone instructions to the FAQ prompt: "Be direct, factual, and concise. Do not use filler phrases like 'certainly' or 'delighted to help.'" 2. Re-baselined the drift detector after the fix 3. Added vocabulary drift to the canary validation pipeline — now vocabulary drift > 0.20 blocks promotion beyond 25% traffic
Scenario D: Canary Pipeline Auto-Rollback on Latency Spike
Context: A new version of the MangaAssist orchestrator was deployed with an updated RAG chunking strategy (8 chunks instead of 5) for better retrieval quality.
What Happened: - Canary deployed at 5% traffic - Synthetic quality tests: PASSED (quality improved by 3%) - Real-time monitoring (15-minute window): - Task completion: 89% (stable) - Latency P99: 8.2s (baseline: 3.5s) → FAILED - Latency P50: 2.1s (baseline: 1.2s) → WARNING - Auto-rollback triggered at 5% stage — canary terminated
Root Cause: 8 chunks × OpenSearch k-NN + re-ranking = 2.4x more compute time. The quality improvement was real, but the latency cost was unacceptable. The pre-deployment synthetic tests didn't test under production load conditions — they ran sequentially.
Fix: 1. Added latency thresholds to the canary validation: P99 must be ≤ 2x baseline 2. Compromised on 6 chunks (instead of 8) — achieved 70% of the quality improvement with only 30% latency increase 3. Pre-warmed the OpenSearch cache for common queries before canary deployment 4. Post-fix canary: P99 = 4.1s (within 2x baseline of 3.5s), quality up 2%
Intuition Gained
Synthetic Tests + Production Monitoring = Two Complementary Safety Nets
Synthetic tests catch known failure modes before deployment. Production monitoring catches unknown failure modes after deployment. Neither alone is sufficient. The MangaAssist pipeline uses both: synthetic tests gate the deployment, production monitoring gates the canary rollout, and continuous monitoring catches gradual drift.
Hallucination Monitoring Needs Data Source Verification, Not Just LLM Self-Assessment
Asking an LLM "did you hallucinate?" doesn't work — the model doesn't know. Effective hallucination detection compares specific claims against structured ground truth (product catalog, order database). MangaAssist's hallucination monitor runs zero LLM inference for verification — it uses DynamoDB lookups and string matching. The only LLM call is for claim extraction.
Semantic Drift Is the Silent Killer
Model updates, runtime changes, and prompt tweaks can shift output distributions without changing individual response quality scores. A response that says "certainly, I'd be delighted to help!" vs "Sure, here's the info:" — both are correct, but one represents a significant tonal drift from the user experience. Drift detection at the distribution level catches changes that per-response evaluation misses.
References
- MangaAssist Architecture HLD — Deployment pipeline architecture
- CI/CD Application Code Pipeline — Deployment stages
- ML Model Deployment Pipeline — ML model canary patterns
- Quality Assurance Processes — Quality gates referenced by canary pipeline
- Multi-Perspective Assessment — Evaluation methods used in synthetic tests
- Model Evaluation Optimal Configuration — Canary deployment controller
- Security & Guardrails — Guardrails for hallucination prevention