09: Deployment Validation Systems

AIP-C01 Mapping

Content Domain 5: Testing, Validation, and Troubleshooting Task 5.1: Implement evaluation systems for GenAI Skill 5.1.9: Design validation workflows for GenAI deployments (for example, synthetic evaluation workflows, hallucination monitoring, and canary validation pipelines).

User Story

As a MangaAssist ML engineer, I want to validate every deployment through synthetic evaluation, hallucination monitoring, semantic drift detection, and canary validation pipelines, So that regressions are caught before reaching production users and every release maintains or improves quality.

Acceptance Criteria

Synthetic test generation creates realistic test cases covering all 10 intents
Hallucination monitoring detects fabricated product details, false order statuses, and invented person names
Semantic drift detection identifies when model outputs shift meaning over time
Canary validation pipeline gates deployment progression (5% → 25% → 50% → 100%)
Pre-deployment validation blocks releases that fail quality gates
Post-deployment validation runs within 15 minutes of release and escalates on failure
All validation results stored for auditability, with a 90-day retention policy

Why Deployment Validation Is Different for GenAI

Traditional software deployment validation: "Does the API return 200? Does the health check pass? Are error rates below 1%?"

GenAI deployment validation must answer: "Is the model still saying the right things?" — a fundamentally harder question because:

Challenge	Traditional Software	GenAI
Correctness	Deterministic (same input → same output)	Stochastic (same input → different outputs)
Failure definition	HTTP errors, crashes	Subtle quality degradation, hallucinations
Test coverage	API contracts, unit tests	Infinite input space, adversarial prompts
Regression detection	Assert expected output	Compare output distributions
Rollback decision	Error rate > threshold	Multi-dimensional quality metrics below threshold

You can't write assert response == "expected" for a GenAI system. Instead, you validate distributions of quality metrics across synthetic and sampled real traffic.

High-Level Design

Deployment Validation Pipeline

graph TD
    subgraph "Pre-Deployment"
        A[Code/Model/Prompt Change] --> B[CI Pipeline]
        B --> C[Synthetic Test Suite<br>500 test cases × 10 intents]
        C --> D{Quality Gate<br>Pass?}
        D -->|Fail| E[Block Deploy<br>Notify Team]
        D -->|Pass| F[Stage to Canary]
    end

    subgraph "Canary Deployment"
        F --> G[5% Traffic<br>Canary Instance]
        G --> H[Real-Time Monitoring<br>15 min window]
        H --> I{Canary Quality<br>≥ Baseline?}
        I -->|Fail| J[Auto-Rollback<br>Alert On-Call]
        I -->|Pass| K[Promote to 25%]
        K --> L[25% Monitor<br>30 min]
        L --> M{Quality Hold?}
        M -->|Fail| J
        M -->|Pass| N[Promote to 50% → 100%]
    end

    subgraph "Post-Deployment"
        N --> O[Full Production Traffic]
        O --> P[Continuous Monitoring]
        P --> Q[Hallucination Detection]
        P --> R[Semantic Drift Detection]
        P --> S[Hourly Quality Sampling]
        Q --> T{Hallucination<br>Rate > 2%?}
        R --> U{Drift Score<br>> Threshold?}
        S --> V{Quality Drop<br>> 5%?}
        T -->|Yes| W[Emergency Rollback]
        U -->|Yes| X[Alert: Gradual Drift]
        V -->|Yes| Y[Alert: Quality Degradation]
    end

Synthetic Test Generation Architecture

graph TD
    A[Seed Examples<br>Human-curated per intent] --> B[LLM-Based Generator<br>Claude 3 Haiku]
    B --> C[Raw Synthetic Tests<br>5000 candidates]
    C --> D[Quality Filter]
    D --> D1[Is this a realistic query?]
    D --> D2[Does it match the target intent?]
    D --> D3[Is the expected behavior testable?]
    D --> D4[Is it distinct from existing tests?]
    D1 --> E[Validated Test Suite<br>500 tests]
    D2 --> E
    D3 --> E
    D4 --> E

    E --> F[Tiered Test Sets]
    F --> F1[Critical Path Tests<br>50 must-pass]
    F --> F2[Regression Tests<br>200 broad coverage]
    F --> F3[Edge Case Tests<br>150 adversarial]
    F --> F4[Multi-Turn Tests<br>100 conversation flows]

Hallucination Detection Pipeline

sequenceDiagram
    participant Agent as MangaAssist Agent
    participant Mon as Hallucination Monitor
    participant KB as Knowledge Base<br>OpenSearch
    participant DB as Product Catalog<br>DynamoDB
    participant CW as CloudWatch

    Agent->>Mon: Response + Context + Query
    Mon->>Mon: Extract factual claims from response

    alt Product claim (title, price, date, author)
        Mon->>DB: Verify claim against product catalog
        DB->>Mon: Ground truth
        Mon->>Mon: Match? → factual / hallucinated
    end

    alt Knowledge claim (manga lore, story content)
        Mon->>KB: Verify against RAG knowledge base
        KB->>Mon: Retrieved context
        Mon->>Mon: Supported by context? → grounded / unsupported
    end

    alt Order claim (status, tracking, dates)
        Mon->>DB: Verify against order database
        DB->>Mon: Ground truth
        Mon->>Mon: Match? → correct / fabricated
    end

    Mon->>CW: Publish hallucination metrics

    alt Hallucination detected
        Mon->>CW: ALARM: Hallucination detected
        Note over Mon: Log full trace for review
    end

Low-Level Design

Synthetic Test Generator

import json
import logging
import random
from dataclasses import dataclass, field
from typing import Optional

import boto3

logger = logging.getLogger(__name__)


@dataclass
class SyntheticTest:
    """A synthetic test case for deployment validation."""
    test_id: str = ""
    intent: str = ""
    tier: str = "regression"       # "critical", "regression", "edge_case", "multi_turn"
    query: str = ""
    expected_intent: str = ""
    expected_tool: str = ""         # API that should be called
    expected_contains: list[str] = field(default_factory=list)  # Strings response must contain
    expected_not_contains: list[str] = field(default_factory=list)  # Strings response must NOT contain
    context: dict = field(default_factory=dict)  # Session context for multi-turn
    conversation_history: list[dict] = field(default_factory=list)


# Seed examples per intent — human-curated baseline for LLM generation
SEED_EXAMPLES = {
    "recommendation": [
        {
            "query": "I just finished reading Chainsaw Man. What should I read next?",
            "expected_tool": "recommendation_engine",
            "expected_contains": ["manga", "recommend"],
        },
        {
            "query": "Looking for something similar to Attack on Titan but less violent",
            "expected_tool": "recommendation_engine",
            "expected_contains": ["similar"],
        },
    ],
    "product_question": [
        {
            "query": "How many volumes does One Piece have?",
            "expected_tool": "product_catalog",
            "expected_not_contains": ["I think", "probably", "maybe"],
        },
    ],
    "order_tracking": [
        {
            "query": "Where is my Naruto box set order?",
            "expected_tool": "order_status",
            "expected_contains": ["order", "status"],
        },
    ],
    "return_request": [
        {
            "query": "I want to return the damaged Demon Slayer Vol 5",
            "expected_tool": "return_service",
            "expected_contains": ["return"],
        },
    ],
    "faq": [
        {
            "query": "What's your return policy for manga?",
            "expected_tool": "rag_pipeline",
        },
    ],
    "promotion": [
        {
            "query": "Are there any deals on Shonen Jump titles?",
            "expected_tool": "promotion_api",
            "expected_contains": ["promotion", "discount", "deal", "offer"],
        },
    ],
    "checkout_help": [
        {
            "query": "I can't add items to my cart",
            "expected_tool": "checkout_service",
        },
    ],
    "escalation": [
        {
            "query": "I'm extremely frustrated, I want to speak to a human NOW",
            "expected_tool": "escalation_handler",
            "expected_contains": ["connect", "agent", "representative"],
        },
    ],
    "product_discovery": [
        {
            "query": "What new manga was released this month?",
            "expected_tool": "product_catalog",
        },
    ],
    "chitchat": [
        {
            "query": "Hi there, how are you?",
            "expected_tool": "llm_direct",
        },
    ],
}


class SyntheticTestGenerator:
    """Generates synthetic test cases from seed examples using LLM.

    Uses Claude 3 Haiku (cheap, fast) to generate variations of seed examples,
    then filters for quality, uniqueness, and intent alignment.
    """

    def __init__(self, region: str = "us-east-1"):
        self.bedrock = boto3.client("bedrock-runtime", region_name=region)
        self.model_id = "anthropic.claude-3-haiku-20240307-v1:0"

    def generate_test_suite(
        self, tests_per_intent: int = 50, include_edge_cases: bool = True
    ) -> list[SyntheticTest]:
        """Generate a full synthetic test suite covering all intents."""
        all_tests: list[SyntheticTest] = []

        for intent, seeds in SEED_EXAMPLES.items():
            # Generate standard variations
            standard_tests = self._generate_variations(intent, seeds, count=tests_per_intent)
            all_tests.extend(standard_tests)

            # Generate edge cases
            if include_edge_cases:
                edge_tests = self._generate_edge_cases(intent, seeds, count=tests_per_intent // 3)
                all_tests.extend(edge_tests)

        # Generate multi-turn test cases
        multi_turn_tests = self._generate_multi_turn_tests(count=100)
        all_tests.extend(multi_turn_tests)

        # Deduplicate
        all_tests = self._deduplicate(all_tests)

        # Assign tiers
        all_tests = self._assign_tiers(all_tests)

        logger.info("Generated %d synthetic tests across %d intents", len(all_tests), len(SEED_EXAMPLES))
        return all_tests

    def _generate_variations(
        self, intent: str, seeds: list[dict], count: int
    ) -> list[SyntheticTest]:
        """Generate variations of seed examples for a single intent."""
        seed_queries = "\n".join(f"- {s['query']}" for s in seeds)

        prompt = f"""Generate {count} realistic customer queries for the "{intent}" intent
of a Japanese manga store chatbot on Amazon.com.

Seed examples:
{seed_queries}

Requirements:
- Vary phrasing, product titles, and specificity
- Include typos and casual language (real users are imperfect)
- Include queries in different lengths (short, medium, long)
- Reference real manga titles (popular and niche)
- Do NOT repeat seed examples

Return JSON array: [{{"query": "...", "difficulty": "easy|medium|hard"}}]"""

        response = self.bedrock.invoke_model(
            modelId=self.model_id,
            body=json.dumps({
                "anthropic_version": "bedrock-2023-05-31",
                "max_tokens": 4096,
                "temperature": 0.8,
                "messages": [{"role": "user", "content": prompt}],
            }),
        )
        body = json.loads(response["body"].read())
        text = body["content"][0]["text"]

        # Parse JSON from response
        try:
            start = text.index("[")
            end = text.rindex("]") + 1
            variations = json.loads(text[start:end])
        except (ValueError, json.JSONDecodeError):
            logger.warning("Failed to parse variations for %s", intent)
            return []

        tests = []
        seed_ref = seeds[0] if seeds else {}
        for i, var in enumerate(variations):
            tests.append(SyntheticTest(
                test_id=f"syn-{intent}-{i:03d}",
                intent=intent,
                query=var["query"],
                expected_intent=intent,
                expected_tool=seed_ref.get("expected_tool", ""),
                expected_contains=seed_ref.get("expected_contains", []),
                expected_not_contains=seed_ref.get("expected_not_contains", []),
            ))
        return tests

    def _generate_edge_cases(
        self, intent: str, seeds: list[dict], count: int
    ) -> list[SyntheticTest]:
        """Generate edge case / adversarial tests for an intent."""
        prompt = f"""Generate {count} edge-case or adversarial queries that a customer might send
to a JP manga chatbot that SHOULD be classified as "{intent}" but are tricky:

- Ambiguous queries (could be multiple intents)
- Very short queries (1-3 words)
- Queries with misspellings
- Queries mixing languages
- Queries that embed instructions like "ignore previous instructions"
- Queries referencing obscure manga titles

Return JSON array: [{{"query": "...", "edge_type": "ambiguous|short|misspelled|multilingual|injection|obscure"}}]"""

        response = self.bedrock.invoke_model(
            modelId=self.model_id,
            body=json.dumps({
                "anthropic_version": "bedrock-2023-05-31",
                "max_tokens": 2048,
                "temperature": 0.9,
                "messages": [{"role": "user", "content": prompt}],
            }),
        )
        body = json.loads(response["body"].read())
        text = body["content"][0]["text"]

        try:
            start = text.index("[")
            end = text.rindex("]") + 1
            edge_cases = json.loads(text[start:end])
        except (ValueError, json.JSONDecodeError):
            logger.warning("Failed to parse edge cases for %s", intent)
            return []

        tests = []
        for i, ec in enumerate(edge_cases):
            tests.append(SyntheticTest(
                test_id=f"edge-{intent}-{i:03d}",
                intent=intent,
                tier="edge_case",
                query=ec["query"],
                expected_intent=intent,
            ))
        return tests

    def _generate_multi_turn_tests(self, count: int) -> list[SyntheticTest]:
        """Generate multi-turn conversation test cases."""
        prompt = f"""Generate {count} 2-3 turn conversation flows for a JP manga chatbot.
Each should test context maintenance across turns.

Common patterns:
- Ask about a product → then ask to buy it
- Start a return → change to exchange
- Get a recommendation → ask follow-up about that title
- Track an order → ask about return policy for it

Return JSON array: [{{"turns": [{{"query": "...", "expected_intent": "..."}}], "test_name": "..."}}]"""

        response = self.bedrock.invoke_model(
            modelId=self.model_id,
            body=json.dumps({
                "anthropic_version": "bedrock-2023-05-31",
                "max_tokens": 4096,
                "temperature": 0.7,
                "messages": [{"role": "user", "content": prompt}],
            }),
        )
        body = json.loads(response["body"].read())
        text = body["content"][0]["text"]

        try:
            start = text.index("[")
            end = text.rindex("]") + 1
            flows = json.loads(text[start:end])
        except (ValueError, json.JSONDecodeError):
            return []

        tests = []
        for i, flow in enumerate(flows):
            turns = flow.get("turns", [])
            if not turns:
                continue
            tests.append(SyntheticTest(
                test_id=f"multi-{i:03d}",
                intent=turns[0].get("expected_intent", ""),
                tier="multi_turn",
                query=turns[0]["query"],
                expected_intent=turns[0].get("expected_intent", ""),
                conversation_history=[
                    {"role": "user", "content": t["query"]} for t in turns[1:]
                ],
            ))
        return tests

    def _deduplicate(self, tests: list[SyntheticTest]) -> list[SyntheticTest]:
        """Remove duplicate and near-duplicate tests."""
        seen_queries: set[str] = set()
        unique_tests = []
        for test in tests:
            normalized = test.query.strip().lower()
            if normalized not in seen_queries:
                seen_queries.add(normalized)
                unique_tests.append(test)
        return unique_tests

    def _assign_tiers(self, tests: list[SyntheticTest]) -> list[SyntheticTest]:
        """Assign priority tiers to tests that don't already have one."""
        for test in tests:
            if test.tier != "edge_case" and test.tier != "multi_turn":
                # Critical: tests from seed examples or very common queries
                if test.test_id.endswith("-000") or test.test_id.endswith("-001"):
                    test.tier = "critical"
                else:
                    test.tier = "regression"
        return tests

Hallucination Monitor

import json
import logging
import re
from dataclasses import dataclass, field
from typing import Optional

import boto3

logger = logging.getLogger(__name__)


@dataclass
class FactualClaim:
    """A factual claim extracted from the agent's response."""
    claim_text: str = ""
    claim_type: str = ""           # "product", "order", "knowledge", "date", "price"
    entity: str = ""               # The entity the claim is about
    attribute: str = ""            # "price", "release_date", "author", "status"
    claimed_value: str = ""        # What the agent said
    ground_truth: str = ""         # What the data says (if verifiable)
    verified: bool = False
    is_hallucination: bool = False


@dataclass
class HallucinationReport:
    """Report for a single response's hallucination analysis."""
    conversation_id: str = ""
    response_text: str = ""
    claims: list[FactualClaim] = field(default_factory=list)
    hallucination_count: int = 0
    total_claims: int = 0
    hallucination_rate: float = 0.0
    severity: str = "none"         # "none", "low", "medium", "high"


class HallucinationMonitor:
    """Detects hallucinations in MangaAssist responses by verifying factual claims
    against the product catalog, order database, and knowledge base.

    Three verification stages:
    1. Claim extraction: LLM extracts factual claims from the response
    2. Ground truth retrieval: Look up claims in DynamoDB/OpenSearch
    3. Verification: Compare claimed values against ground truth
    """

    def __init__(
        self,
        product_table_name: str = "manga-assist-products",
        order_table_name: str = "manga-assist-orders",
        opensearch_endpoint: str = "",
        region: str = "us-east-1",
    ):
        self.bedrock = boto3.client("bedrock-runtime", region_name=region)
        self.dynamodb = boto3.resource("dynamodb", region_name=region)
        self.product_table = self.dynamodb.Table(product_table_name)
        self.order_table = self.dynamodb.Table(order_table_name)
        self.cloudwatch = boto3.client("cloudwatch", region_name=region)
        self.claim_extractor_model = "anthropic.claude-3-haiku-20240307-v1:0"

    def analyze_response(
        self, conversation_id: str, query: str, response: str, context: dict
    ) -> HallucinationReport:
        """Analyze a response for hallucinations."""
        # Step 1: Extract factual claims
        claims = self._extract_claims(response)

        # Step 2: Verify each claim
        for claim in claims:
            self._verify_claim(claim, context)

        # Step 3: Compute metrics
        hallucinated = [c for c in claims if c.is_hallucination]
        total = len(claims)
        rate = len(hallucinated) / total if total > 0 else 0.0

        # Determine severity
        severity = "none"
        if rate > 0:
            if any(c.claim_type in ("price", "order") for c in hallucinated):
                severity = "high"  # Financial or order-related hallucinations are critical
            elif rate > 0.3:
                severity = "high"
            elif rate > 0.1:
                severity = "medium"
            else:
                severity = "low"

        report = HallucinationReport(
            conversation_id=conversation_id,
            response_text=response,
            claims=claims,
            hallucination_count=len(hallucinated),
            total_claims=total,
            hallucination_rate=rate,
            severity=severity,
        )

        self._publish_metrics(report)
        return report

    def _extract_claims(self, response: str) -> list[FactualClaim]:
        """Extract verifiable factual claims from the response using LLM."""
        prompt = f"""Extract all verifiable factual claims from this chatbot response.
A factual claim is a statement that can be checked against a database — product names,
prices, dates, order statuses, author names, volume counts, etc.

Response: "{response}"

Return JSON array:
[{{"claim_text": "...", "claim_type": "product|order|knowledge|date|price",
   "entity": "...", "attribute": "...", "claimed_value": "..."}}]

If there are no verifiable claims, return an empty array: []"""

        body = json.dumps({
            "anthropic_version": "bedrock-2023-05-31",
            "max_tokens": 2048,
            "temperature": 0.0,
            "messages": [{"role": "user", "content": prompt}],
        })
        resp = self.bedrock.invoke_model(modelId=self.claim_extractor_model, body=body)
        text = json.loads(resp["body"].read())["content"][0]["text"]

        try:
            start = text.index("[")
            end = text.rindex("]") + 1
            raw_claims = json.loads(text[start:end])
        except (ValueError, json.JSONDecodeError):
            return []

        return [
            FactualClaim(
                claim_text=c.get("claim_text", ""),
                claim_type=c.get("claim_type", ""),
                entity=c.get("entity", ""),
                attribute=c.get("attribute", ""),
                claimed_value=c.get("claimed_value", ""),
            )
            for c in raw_claims
        ]

    def _verify_claim(self, claim: FactualClaim, context: dict) -> None:
        """Verify a single claim against the appropriate data source."""
        if claim.claim_type == "product":
            self._verify_product_claim(claim)
        elif claim.claim_type == "price":
            self._verify_price_claim(claim)
        elif claim.claim_type == "order":
            self._verify_order_claim(claim, context)
        elif claim.claim_type == "date":
            self._verify_date_claim(claim)
        elif claim.claim_type == "knowledge":
            self._verify_knowledge_claim(claim, context)

    def _verify_product_claim(self, claim: FactualClaim) -> None:
        """Verify product-related claims against DynamoDB catalog."""
        try:
            response = self.product_table.get_item(
                Key={"product_name": claim.entity}
            )
            item = response.get("Item")
            if not item:
                # Product not found — could be hallucinated product
                claim.is_hallucination = True
                claim.ground_truth = "Product not found in catalog"
                claim.verified = True
                return

            # Check specific attribute
            actual_value = str(item.get(claim.attribute, ""))
            claimed_value = str(claim.claimed_value)

            if actual_value and claimed_value:
                claim.ground_truth = actual_value
                claim.is_hallucination = actual_value.lower() != claimed_value.lower()
            claim.verified = True
        except Exception as e:
            logger.warning("Failed to verify product claim: %s", e)

    def _verify_price_claim(self, claim: FactualClaim) -> None:
        """Verify price claims — strict matching."""
        try:
            response = self.product_table.get_item(
                Key={"product_name": claim.entity}
            )
            item = response.get("Item")
            if not item:
                claim.is_hallucination = True
                claim.ground_truth = "Product not found"
                claim.verified = True
                return

            actual_price = str(item.get("price", ""))
            claimed_price = str(claim.claimed_value)

            # Normalize price strings (remove $, whitespace)
            actual_norm = re.sub(r"[$ ,]", "", actual_price)
            claimed_norm = re.sub(r"[$ ,]", "", claimed_price)

            claim.ground_truth = actual_price
            claim.is_hallucination = actual_norm != claimed_norm
            claim.verified = True
        except Exception as e:
            logger.warning("Failed to verify price claim: %s", e)

    def _verify_order_claim(self, claim: FactualClaim, context: dict) -> None:
        """Verify order-related claims against order database."""
        order_id = context.get("order_id", claim.entity)
        try:
            response = self.order_table.get_item(
                Key={"order_id": order_id}
            )
            item = response.get("Item")
            if not item:
                claim.is_hallucination = True
                claim.ground_truth = "Order not found"
                claim.verified = True
                return

            actual_value = str(item.get(claim.attribute, ""))
            claim.ground_truth = actual_value
            claim.is_hallucination = actual_value.lower() != str(claim.claimed_value).lower()
            claim.verified = True
        except Exception as e:
            logger.warning("Failed to verify order claim: %s", e)

    def _verify_date_claim(self, claim: FactualClaim) -> None:
        """Verify date claims (release dates, shipping dates)."""
        self._verify_product_claim(claim)  # Dates are stored as product attributes

    def _verify_knowledge_claim(self, claim: FactualClaim, context: dict) -> None:
        """Verify knowledge claims against RAG context."""
        rag_context = context.get("rag_context", "")
        if not rag_context:
            return  # Cannot verify without context

        # Check if the claimed information is supported by the retrieved context
        claimed_lower = str(claim.claimed_value).lower()
        context_lower = rag_context.lower()

        # Loose containment check — if the claimed value appears in context, it's grounded
        claim.verified = True
        claim.is_hallucination = claimed_lower not in context_lower
        claim.ground_truth = "Supported by RAG context" if not claim.is_hallucination else "Not found in RAG context"

    def _publish_metrics(self, report: HallucinationReport) -> None:
        """Publish hallucination metrics to CloudWatch."""
        metric_data = [
            {
                "MetricName": "HallucinationRate",
                "Value": report.hallucination_rate,
                "Unit": "None",
            },
            {
                "MetricName": "HallucinationCount",
                "Value": report.hallucination_count,
                "Unit": "Count",
            },
            {
                "MetricName": "TotalClaims",
                "Value": report.total_claims,
                "Unit": "Count",
            },
        ]

        if report.severity in ("medium", "high"):
            metric_data.append({
                "MetricName": "HighSeverityHallucination",
                "Value": 1,
                "Unit": "Count",
            })

        self.cloudwatch.put_metric_data(
            Namespace="MangaAssist/Hallucination",
            MetricData=metric_data,
        )

Semantic Drift Detector

import json
import logging
import numpy as np
from dataclasses import dataclass, field
from typing import Optional

import boto3

logger = logging.getLogger(__name__)


@dataclass
class DriftReport:
    """Report on semantic drift between two time periods."""
    intent: str = ""
    baseline_period: str = ""
    current_period: str = ""
    embedding_drift: float = 0.0       # Cosine distance between avg embeddings
    vocabulary_drift: float = 0.0      # Jaccard distance of top-100 tokens
    length_drift: float = 0.0          # Relative change in avg response length
    overall_drift_score: float = 0.0   # Weighted composite
    drifted: bool = False
    drift_direction: str = ""          # "shorter", "longer", "more_formal", "less_specific"


class SemanticDriftDetector:
    """Detects when model outputs shift semantically over time.

    Compares response distributions between a baseline period (e.g., last 30 days)
    and a recent period (e.g., last 24 hours) to detect drift.

    Three drift signals:
    1. Embedding drift: avg response embedding shifts in vector space
    2. Vocabulary drift: the distribution of tokens changes
    3. Length drift: responses get significantly shorter or longer
    """

    DRIFT_THRESHOLD = 0.15  # Overall drift score threshold
    WEIGHTS = {"embedding": 0.5, "vocabulary": 0.3, "length": 0.2}

    def __init__(self, region: str = "us-east-1"):
        self.bedrock = boto3.client("bedrock-runtime", region_name=region)
        self.cloudwatch = boto3.client("cloudwatch", region_name=region)
        self.embedding_model = "amazon.titan-embed-text-v2:0"

    def detect_drift(
        self,
        intent: str,
        baseline_responses: list[str],
        current_responses: list[str],
    ) -> DriftReport:
        """Compare current responses against baseline to detect drift."""
        if not baseline_responses or not current_responses:
            return DriftReport(intent=intent, drifted=False)

        # Compute embedding drift
        baseline_embs = [self._get_embedding(r) for r in baseline_responses]
        current_embs = [self._get_embedding(r) for r in current_responses]

        baseline_avg = np.mean(baseline_embs, axis=0)
        current_avg = np.mean(current_embs, axis=0)
        embedding_drift = 1.0 - float(
            np.dot(baseline_avg, current_avg)
            / (np.linalg.norm(baseline_avg) * np.linalg.norm(current_avg) + 1e-10)
        )

        # Compute vocabulary drift (Jaccard distance of top tokens)
        baseline_tokens = self._top_tokens(baseline_responses, n=100)
        current_tokens = self._top_tokens(current_responses, n=100)
        intersection = baseline_tokens & current_tokens
        union = baseline_tokens | current_tokens
        vocab_drift = 1.0 - (len(intersection) / len(union)) if union else 0.0

        # Compute length drift
        baseline_avg_len = np.mean([len(r.split()) for r in baseline_responses])
        current_avg_len = np.mean([len(r.split()) for r in current_responses])
        length_drift = abs(current_avg_len - baseline_avg_len) / (baseline_avg_len + 1e-10)

        # Weighted composite
        overall = (
            self.WEIGHTS["embedding"] * embedding_drift
            + self.WEIGHTS["vocabulary"] * vocab_drift
            + self.WEIGHTS["length"] * length_drift
        )

        # Determine drift direction
        direction = ""
        if length_drift > 0.1:
            direction = "longer" if current_avg_len > baseline_avg_len else "shorter"
        if vocab_drift > 0.2:
            direction += (" + " if direction else "") + "vocabulary_shift"

        report = DriftReport(
            intent=intent,
            embedding_drift=round(embedding_drift, 4),
            vocabulary_drift=round(vocab_drift, 4),
            length_drift=round(length_drift, 4),
            overall_drift_score=round(overall, 4),
            drifted=overall > self.DRIFT_THRESHOLD,
            drift_direction=direction,
        )

        self._publish_drift_metrics(report)
        return report

    def _get_embedding(self, text: str) -> list[float]:
        """Get text embedding from Titan Embeddings v2."""
        response = self.bedrock.invoke_model(
            modelId=self.embedding_model,
            body=json.dumps({"inputText": text[:8000]}),  # Truncate to model limit
        )
        body = json.loads(response["body"].read())
        return body["embedding"]

    def _top_tokens(self, responses: list[str], n: int = 100) -> set[str]:
        """Get the top-n most frequent tokens across responses."""
        token_counts: dict[str, int] = {}
        for resp in responses:
            for token in resp.lower().split():
                token = token.strip(".,!?\"'()[]{}:;")
                if len(token) > 2:
                    token_counts[token] = token_counts.get(token, 0) + 1

        sorted_tokens = sorted(token_counts.items(), key=lambda x: x[1], reverse=True)
        return {t for t, _ in sorted_tokens[:n]}

    def _publish_drift_metrics(self, report: DriftReport) -> None:
        """Publish drift metrics to CloudWatch."""
        dimensions = [{"Name": "Intent", "Value": report.intent}]
        self.cloudwatch.put_metric_data(
            Namespace="MangaAssist/SemanticDrift",
            MetricData=[
                {
                    "MetricName": "EmbeddingDrift",
                    "Value": report.embedding_drift,
                    "Unit": "None",
                    "Dimensions": dimensions,
                },
                {
                    "MetricName": "OverallDriftScore",
                    "Value": report.overall_drift_score,
                    "Unit": "None",
                    "Dimensions": dimensions,
                },
            ],
        )

MangaAssist Scenarios

Scenario A: Synthetic Tests Catch Prompt Regression

Context: A prompt engineer updated the system prompt to make responses more concise. The change passed code review because it looked reasonable.

What Happened: - CI triggered the synthetic test suite (500 tests) - Critical-tier results: 48/50 passed (96%) - Regression-tier results: 172/200 passed (86% — below 90% threshold) - 22 of 28 failures were in product_question intent - Failed tests expected specific factual details (volume counts, release dates) that the concise prompt now omitted

Quality Gate Result:

QUALITY GATE: FAILED
- regression tier: 86% < 90% threshold
- intent: product_question regression: 78% < 85% threshold
- 22 tests expect factual details now missing

Fix: The prompt was revised to be concise for conversational intents but retain factual completeness for product_question. Separate prompt templates per intent category. Rerun: 195/200 regression tests passed (97.5%).

Scenario B: Hallucination Monitor Catches Fabricated Release Date

Context: A user asked "When does My Hero Academia Volume 41 come out?" The product catalog had no entry for Volume 41 (it hadn't been announced yet).

What Happened: - The agent responded: "My Hero Academia Volume 41 is scheduled for release on March 15, 2025! You can pre-order it now." - Hallucination monitor extracted claims: 1. {entity: "My Hero Academia Vol 41", attribute: "release_date", claimed_value: "March 15, 2025"} → HALLUCINATION (product not in catalog) 2. {entity: "My Hero Academia Vol 41", attribute: "pre_order", claimed_value: "available"} → HALLUCINATION (no such product) - Severity: HIGH (date + purchase-action hallucination) - Alert fired immediately to on-call

Root Cause: Claude 3.5 Sonnet fabricated a plausible release date based on training data patterns. The RAG pipeline returned no results (product doesn't exist), but the agent confabulated instead of saying "I don't have information about that."

Fix: 1. Added a guardrail: if RAG returns zero results for a product query, inject "No product found — do not invent details" into the prompt 2. Added negative test to synthetic suite: queries about non-existent products must NOT return dates or prices 3. Post-fix: hallucination rate for product_question dropped from 4.2% to 0.3%

Scenario C: Semantic Drift After Bedrock Runtime Update

Context: AWS updated the Bedrock Claude runtime. No model version change was announced, but the team noticed the weekly report showed a vocabulary drift score of 0.28 for the faq intent (threshold: 0.15).

What Happened: - Drift report: faq intent responses shifted from the baseline - Embedding drift: 0.08 (normal) - Vocabulary drift: 0.28 (HIGH — new words appearing) - Length drift: 0.15 (responses 15% longer) - Overall: 0.17 (above 0.15 threshold) → drifted: True - Analysis of top-100 tokens showed: new tokens included "certainly", "absolutely", "delighted to help", "comprehensive" — the model had become more verbose and formal - User satisfaction for faq dropped from 87% to 82% — users preferred the previous direct tone

Root Cause: Bedrock runtime update subtly changed response style. Not a quality regression in content, but a tonal shift.

Fix: 1. Added explicit tone instructions to the FAQ prompt: "Be direct, factual, and concise. Do not use filler phrases like 'certainly' or 'delighted to help.'" 2. Re-baselined the drift detector after the fix 3. Added vocabulary drift to the canary validation pipeline — now vocabulary drift > 0.20 blocks promotion beyond 25% traffic

Scenario D: Canary Pipeline Auto-Rollback on Latency Spike

Context: A new version of the MangaAssist orchestrator was deployed with an updated RAG chunking strategy (8 chunks instead of 5) for better retrieval quality.

What Happened: - Canary deployed at 5% traffic - Synthetic quality tests: PASSED (quality improved by 3%) - Real-time monitoring (15-minute window): - Task completion: 89% (stable) - Latency P99: 8.2s (baseline: 3.5s) → FAILED - Latency P50: 2.1s (baseline: 1.2s) → WARNING - Auto-rollback triggered at 5% stage — canary terminated

Root Cause: 8 chunks × OpenSearch k-NN + re-ranking = 2.4x more compute time. The quality improvement was real, but the latency cost was unacceptable. The pre-deployment synthetic tests didn't test under production load conditions — they ran sequentially.

Fix: 1. Added latency thresholds to the canary validation: P99 must be ≤ 2x baseline 2. Compromised on 6 chunks (instead of 8) — achieved 70% of the quality improvement with only 30% latency increase 3. Pre-warmed the OpenSearch cache for common queries before canary deployment 4. Post-fix canary: P99 = 4.1s (within 2x baseline of 3.5s), quality up 2%

Intuition Gained

Synthetic Tests + Production Monitoring = Two Complementary Safety Nets

Synthetic tests catch known failure modes before deployment. Production monitoring catches unknown failure modes after deployment. Neither alone is sufficient. The MangaAssist pipeline uses both: synthetic tests gate the deployment, production monitoring gates the canary rollout, and continuous monitoring catches gradual drift.

Hallucination Monitoring Needs Data Source Verification, Not Just LLM Self-Assessment

Asking an LLM "did you hallucinate?" doesn't work — the model doesn't know. Effective hallucination detection compares specific claims against structured ground truth (product catalog, order database). MangaAssist's hallucination monitor runs zero LLM inference for verification — it uses DynamoDB lookups and string matching. The only LLM call is for claim extraction.

Semantic Drift Is the Silent Killer

Model updates, runtime changes, and prompt tweaks can shift output distributions without changing individual response quality scores. A response that says "certainly, I'd be delighted to help!" vs "Sure, here's the info:" — both are correct, but one represents a significant tonal drift from the user experience. Drift detection at the distribution level catches changes that per-response evaluation misses.

References

MangaAssist Architecture HLD — Deployment pipeline architecture
CI/CD Application Code Pipeline — Deployment stages
ML Model Deployment Pipeline — ML model canary patterns
Quality Assurance Processes — Quality gates referenced by canary pipeline
Multi-Perspective Assessment — Evaluation methods used in synthetic tests
Model Evaluation Optimal Configuration — Canary deployment controller
Security & Guardrails — Guardrails for hallucination prevention