LOCAL PREVIEW View on GitHub

01: FM Output Quality Assessment Framework

AIP-C01 Mapping

Content Domain 5: Testing, Validation, and Troubleshooting Task 5.1: Implement evaluation systems for GenAI Skill 5.1.1: Develop comprehensive assessment frameworks to evaluate the quality and effectiveness of FM outputs beyond traditional ML evaluation approaches (for example, by using metrics for relevance, factual accuracy, consistency, and fluency).


User Story

As a MangaAssist evaluation engineer, I want to assess the quality of every LLM-generated response across relevance, factual accuracy, consistency, and fluency dimensions, So that we catch quality regressions before they reach customers and continuously improve the chatbot's response quality beyond what traditional ML metrics (accuracy, F1) can measure.


Acceptance Criteria

  • Every LLM response is scored on at least four quality dimensions: relevance, factual accuracy, consistency, and fluency
  • Evaluation runs automatically on every prompt or model change before deployment
  • Scores are tracked over time with alerting when any dimension drops below threshold
  • The framework handles MangaAssist-specific evaluation needs: product recommendation quality, Japanese content handling, and multi-turn coherence
  • Evaluation results feed into a dashboard that the team reviews weekly
  • False positive rate for quality gates is below 5% (does not block good changes unnecessarily)

Why Traditional ML Metrics Fall Short for GenAI

Traditional ML evaluation (accuracy, precision, recall, F1, AUC) works when there is one correct answer per input. GenAI outputs are open-ended: a recommendation response can be relevant in many different ways, and two equally good responses can use completely different wording.

Traditional ML Metric Why It Fails for GenAI What We Need Instead
Accuracy No single correct answer for "Recommend manga like Naruto" Relevance scoring against user intent and profile
Precision / Recall Binary classification metrics don't capture response quality nuance Multi-dimensional scoring (relevance + accuracy + fluency + safety)
F1 Score Assumes discrete classes; LLM outputs are continuous text Semantic similarity + factual grounding checks
AUC Measures ranking of a binary classifier BERTScore, ROUGE-L, factual consistency scores
BLEU / ROUGE alone Captures lexical overlap, not semantic quality Human-correlated composite scores

High-Level Design

Quality Dimension Taxonomy

graph TD
    A[FM Output Quality] --> B[Relevance]
    A --> C[Factual Accuracy]
    A --> D[Consistency]
    A --> E[Fluency]

    B --> B1[Intent Relevance<br>Does the response address<br>what the user asked?]
    B --> B2[Product Relevance<br>Are recommended products<br>appropriate for the query?]
    B --> B3[Context Relevance<br>Does the response use<br>page context and history?]

    C --> C1[Price Accuracy<br>Do prices match the<br>Product Catalog?]
    C --> C2[Availability Accuracy<br>Is stock status correct?]
    C --> C3[Policy Accuracy<br>Are return/shipping policies<br>stated correctly?]
    C --> C4[Product Attribute Accuracy<br>Are author, volume, format<br>details correct?]

    D --> D1[Intra-Turn Consistency<br>No contradictions within<br>a single response]
    D --> D2[Cross-Turn Consistency<br>No contradictions across<br>conversation turns]
    D --> D3[Persona Consistency<br>Stays in MangaAssist<br>assistant character]

    E --> E1[Grammar and Syntax<br>Correct English or Japanese]
    E --> E2[Tone Appropriateness<br>Friendly, helpful, not robotic]
    E --> E3[Formatting Quality<br>Proper use of product cards,<br>links, action buttons]

Evaluation Pipeline Architecture

graph LR
    subgraph "Input"
        A[Test Case<br>query + context + expected_intent]
    end

    subgraph "MangaAssist Pipeline"
        A --> B[Intent Classifier]
        B --> C[Orchestrator]
        C --> D[Service Calls<br>Catalog, Reco, RAG]
        D --> E[LLM Generation<br>Bedrock Claude]
        E --> F[Guardrails]
        F --> G[Final Response]
    end

    subgraph "Evaluation Engine"
        G --> H[Relevance Scorer]
        G --> I[Factual Accuracy Checker]
        G --> J[Consistency Checker]
        G --> K[Fluency Scorer]
        H --> L[Composite Score<br>Weighted Average]
        I --> L
        J --> L
        K --> L
    end

    subgraph "Output"
        L --> M[Score Report]
        L --> N[Pass / Fail Gate]
        L --> O[Metric Store<br>CloudWatch + Redshift]
    end

Scoring Thresholds by Intent

Not every intent needs the same quality bar. High-stakes intents (order tracking, return requests) need near-perfect factual accuracy, while discovery intents (recommendations) prioritize relevance over exact wording.

Intent Relevance Weight Factual Accuracy Weight Consistency Weight Fluency Weight Pass Threshold
recommendation 0.40 0.20 0.20 0.20 0.75
product_question 0.25 0.40 0.20 0.15 0.80
faq 0.20 0.40 0.25 0.15 0.85
order_tracking 0.15 0.50 0.25 0.10 0.90
return_request 0.15 0.50 0.25 0.10 0.90
promotion 0.30 0.35 0.20 0.15 0.80
checkout_help 0.20 0.40 0.25 0.15 0.85
chitchat 0.30 0.10 0.30 0.30 0.70

Evaluation Trigger Points

graph TD
    subgraph "When Evaluation Runs"
        A[Prompt Change<br>CD-06 Pipeline] -->|golden dataset| E[Offline Evaluation]
        B[Model Version Upgrade<br>Claude 3.5 → 4] -->|full suite| E
        C[RAG Index Refresh<br>Knowledge Base Update] -->|retrieval + accuracy| E
        D[Weekly Scheduled<br>Regression Check] -->|sampled production traffic| E
    end

    E --> F{All dimensions<br>above threshold?}
    F -->|Yes| G[Deploy / Continue]
    F -->|No| H[Block + Alert + Report]
    H --> I[Human Review Queue]

Low-Level Design

Core Data Model

from dataclasses import dataclass, field
from enum import Enum
from typing import Optional
import time


class QualityDimension(Enum):
    RELEVANCE = "relevance"
    FACTUAL_ACCURACY = "factual_accuracy"
    CONSISTENCY = "consistency"
    FLUENCY = "fluency"


class IntentType(Enum):
    RECOMMENDATION = "recommendation"
    PRODUCT_QUESTION = "product_question"
    FAQ = "faq"
    ORDER_TRACKING = "order_tracking"
    RETURN_REQUEST = "return_request"
    PROMOTION = "promotion"
    CHECKOUT_HELP = "checkout_help"
    CHITCHAT = "chitchat"


@dataclass
class EvaluationTestCase:
    """A single test case for FM output quality evaluation."""
    test_id: str
    query: str
    intent: IntentType
    page_context: dict
    conversation_history: list[dict]
    expected_products: list[str]          # ASINs that should appear
    expected_facts: list[str]             # Facts that must be accurate
    forbidden_content: list[str]          # Content that must not appear
    reference_response: Optional[str]     # Gold-standard response for comparison
    metadata: dict = field(default_factory=dict)


@dataclass
class DimensionScore:
    """Score for a single quality dimension."""
    dimension: QualityDimension
    score: float                          # 0.0 to 1.0
    confidence: float                     # How confident the scorer is
    evidence: list[str]                   # Why this score was given
    sub_scores: dict[str, float] = field(default_factory=dict)


@dataclass
class EvaluationResult:
    """Complete evaluation result for a single test case."""
    test_id: str
    intent: IntentType
    response_text: str
    dimension_scores: list[DimensionScore]
    composite_score: float
    passed: bool
    latency_ms: float
    timestamp: float = field(default_factory=time.time)
    failure_reasons: list[str] = field(default_factory=list)


@dataclass
class EvaluationReport:
    """Aggregate report across all test cases in a run."""
    run_id: str
    trigger: str                          # "prompt_change", "model_upgrade", "scheduled"
    total_cases: int
    passed_cases: int
    failed_cases: int
    dimension_averages: dict[str, float]
    intent_breakdown: dict[str, dict]     # Scores by intent type
    regression_flags: list[str]           # Dimensions that dropped vs. baseline
    timestamp: float = field(default_factory=time.time)

Relevance Scorer

import json
import logging
from typing import Optional

import boto3

logger = logging.getLogger(__name__)


class RelevanceScorer:
    """Scores how relevant the FM response is to the user's query and context.

    Uses BERTScore for semantic similarity and an LLM-as-Judge call
    for intent-specific relevance assessment.
    """

    def __init__(self, bedrock_client=None, model_id: str = "anthropic.claude-3-5-sonnet-20241022-v2:0"):
        self.bedrock = bedrock_client or boto3.client("bedrock-runtime", region_name="us-east-1")
        self.model_id = model_id

    def score(
        self,
        query: str,
        response: str,
        intent: IntentType,
        page_context: dict,
        conversation_history: list[dict],
        expected_products: list[str],
    ) -> DimensionScore:
        sub_scores = {}

        # Sub-score 1: Intent relevance via LLM-as-Judge
        sub_scores["intent_relevance"] = self._score_intent_relevance(query, response, intent)

        # Sub-score 2: Product relevance (for product-related intents)
        if intent in (IntentType.RECOMMENDATION, IntentType.PRODUCT_QUESTION, IntentType.PROMOTION):
            sub_scores["product_relevance"] = self._score_product_relevance(
                response, expected_products
            )
        else:
            sub_scores["product_relevance"] = 1.0  # Not applicable, full marks

        # Sub-score 3: Context utilization
        sub_scores["context_relevance"] = self._score_context_utilization(
            response, page_context, conversation_history
        )

        # Weighted combination
        weights = {"intent_relevance": 0.50, "product_relevance": 0.30, "context_relevance": 0.20}
        composite = sum(sub_scores[k] * weights[k] for k in weights)

        evidence = []
        for k, v in sub_scores.items():
            if v < 0.7:
                evidence.append(f"{k} scored low ({v:.2f})")

        return DimensionScore(
            dimension=QualityDimension.RELEVANCE,
            score=composite,
            confidence=0.85,
            evidence=evidence,
            sub_scores=sub_scores,
        )

    def _score_intent_relevance(self, query: str, response: str, intent: IntentType) -> float:
        """Uses LLM-as-Judge to assess whether the response addresses the user's intent."""
        prompt = f"""You are evaluating a chatbot response for the MangaAssist JP Manga store assistant.

User query: {query}
Detected intent: {intent.value}
Assistant response: {response}

Rate how well the response addresses the user's intent on a scale of 0.0 to 1.0:
- 1.0: Perfectly addresses the intent with actionable information
- 0.8: Addresses the intent well but misses minor details
- 0.6: Partially addresses the intent
- 0.4: Tangentially related but does not answer the question
- 0.2: Mostly irrelevant
- 0.0: Completely off-topic

Return ONLY a JSON object: {{"score": <float>, "reason": "<brief explanation>"}}"""

        try:
            body = json.dumps({
                "anthropic_version": "bedrock-2023-05-31",
                "max_tokens": 150,
                "temperature": 0.0,
                "messages": [{"role": "user", "content": prompt}],
            })
            resp = self.bedrock.invoke_model(modelId=self.model_id, body=body)
            result = json.loads(json.loads(resp["body"].read())["content"][0]["text"])
            return max(0.0, min(1.0, result["score"]))
        except Exception as e:
            logger.warning("Intent relevance scoring failed, defaulting to 0.5: %s", e)
            return 0.5

    def _score_product_relevance(self, response: str, expected_products: list[str]) -> float:
        """Checks whether expected product ASINs appear in the response."""
        if not expected_products:
            return 1.0
        found = sum(1 for asin in expected_products if asin in response)
        return found / len(expected_products)

    def _score_context_utilization(
        self,
        response: str,
        page_context: dict,
        conversation_history: list[dict],
    ) -> float:
        """Checks whether the response incorporates relevant context signals."""
        signals_used = 0
        signals_available = 0

        # Check if current ASIN is referenced when relevant
        current_asin = page_context.get("current_asin")
        if current_asin:
            signals_available += 1
            if current_asin in response:
                signals_used += 1

        # Check if browsing context is reflected
        store_section = page_context.get("store_section", "")
        if store_section:
            signals_available += 1
            if store_section.replace("-", " ") in response.lower():
                signals_used += 1

        # Check if conversation history is acknowledged in multi-turn
        if len(conversation_history) > 1:
            signals_available += 1
            last_user_msg = ""
            for turn in reversed(conversation_history):
                if turn.get("role") == "user":
                    last_user_msg = turn.get("content", "")
                    break
            # Simple check: does response build on previous context?
            if last_user_msg and any(
                word in response.lower()
                for word in last_user_msg.lower().split()
                if len(word) > 4
            ):
                signals_used += 1

        if signals_available == 0:
            return 1.0
        return signals_used / signals_available

Factual Accuracy Checker

class FactualAccuracyChecker:
    """Verifies that FM responses contain accurate facts by cross-referencing
    against the Product Catalog, Knowledge Base, and Order Service.

    This is critical for MangaAssist because incorrect prices, availability,
    or return policies damage customer trust immediately.
    """

    def __init__(self, catalog_client=None, knowledge_base_client=None, bedrock_client=None):
        self.catalog = catalog_client
        self.knowledge_base = knowledge_base_client
        self.bedrock = bedrock_client or boto3.client("bedrock-runtime", region_name="us-east-1")

    def score(
        self,
        response: str,
        expected_facts: list[str],
        intent: IntentType,
        retrieved_chunks: Optional[list[str]] = None,
    ) -> DimensionScore:
        sub_scores = {}

        # Sub-score 1: Expected fact coverage
        sub_scores["fact_coverage"] = self._check_fact_coverage(response, expected_facts)

        # Sub-score 2: No hallucinated facts
        sub_scores["no_hallucination"] = self._check_no_hallucination(response, intent, retrieved_chunks)

        # Sub-score 3: Price accuracy (never cached, always live)
        if intent in (IntentType.PRODUCT_QUESTION, IntentType.RECOMMENDATION, IntentType.PROMOTION):
            sub_scores["price_accuracy"] = self._check_price_accuracy(response)
        else:
            sub_scores["price_accuracy"] = 1.0

        weights = {"fact_coverage": 0.40, "no_hallucination": 0.40, "price_accuracy": 0.20}
        composite = sum(sub_scores[k] * weights[k] for k in weights)

        evidence = []
        for k, v in sub_scores.items():
            if v < 0.8:
                evidence.append(f"{k} flagged ({v:.2f})")

        return DimensionScore(
            dimension=QualityDimension.FACTUAL_ACCURACY,
            score=composite,
            confidence=0.90,
            evidence=evidence,
            sub_scores=sub_scores,
        )

    def _check_fact_coverage(self, response: str, expected_facts: list[str]) -> float:
        """Verifies that all expected facts appear in the response."""
        if not expected_facts:
            return 1.0
        covered = 0
        for fact in expected_facts:
            # Semantic check: does the response convey this fact?
            if fact.lower() in response.lower():
                covered += 1
            else:
                # Fallback: use LLM to check semantic equivalence
                if self._semantic_fact_check(response, fact):
                    covered += 1
        return covered / len(expected_facts)

    def _check_no_hallucination(
        self,
        response: str,
        intent: IntentType,
        retrieved_chunks: Optional[list[str]],
    ) -> float:
        """Uses LLM-as-Judge to detect hallucinated content not grounded in source data."""
        if not retrieved_chunks:
            return 0.7  # Without source data, assume moderate risk

        context = "\n---\n".join(retrieved_chunks)
        prompt = f"""You are a fact-checker for a manga store chatbot. Given the retrieved source data
and the assistant's response, identify any claims in the response that are NOT supported by the source data.

SOURCE DATA:
{context}

ASSISTANT RESPONSE:
{response}

Return ONLY a JSON object:
{{
    "supported_claims": <int>,
    "unsupported_claims": <int>,
    "hallucinated_details": ["<list of specific hallucinated claims>"],
    "score": <float 0.0 to 1.0 where 1.0 means no hallucination>
}}"""

        try:
            body = json.dumps({
                "anthropic_version": "bedrock-2023-05-31",
                "max_tokens": 300,
                "temperature": 0.0,
                "messages": [{"role": "user", "content": prompt}],
            })
            resp = self.bedrock.invoke_model(
                modelId="anthropic.claude-3-5-sonnet-20241022-v2:0", body=body
            )
            result = json.loads(json.loads(resp["body"].read())["content"][0]["text"])
            return max(0.0, min(1.0, result["score"]))
        except Exception as e:
            logger.warning("Hallucination check failed, defaulting to 0.5: %s", e)
            return 0.5

    def _check_price_accuracy(self, response: str) -> float:
        """Extracts price mentions from the response and validates against live catalog.

        In MangaAssist, prices are NEVER cached (see architecture HLD). This checker
        confirms the response does not fabricate prices.
        """
        # Extract price patterns like $12.99 or ¥1,200
        import re
        prices = re.findall(r'[\$¥]\s*[\d,]+\.?\d*', response)
        if not prices:
            return 1.0  # No prices mentioned, no accuracy concern

        if not self.catalog:
            return 0.7  # Cannot verify without catalog client

        # In production: query catalog API for each mentioned ASIN and compare prices
        # For evaluation: this is done against a snapshot of the catalog at test time
        return 1.0  # Placeholder — production implementation validates each price

    def _semantic_fact_check(self, response: str, fact: str) -> bool:
        """Uses LLM to check if a specific fact is semantically present in the response."""
        prompt = f"""Does the following response convey this fact? Answer YES or NO only.

Fact: {fact}
Response: {response}"""

        try:
            body = json.dumps({
                "anthropic_version": "bedrock-2023-05-31",
                "max_tokens": 10,
                "temperature": 0.0,
                "messages": [{"role": "user", "content": prompt}],
            })
            resp = self.bedrock.invoke_model(
                modelId="anthropic.claude-3-5-sonnet-20241022-v2:0", body=body
            )
            answer = json.loads(resp["body"].read())["content"][0]["text"].strip().upper()
            return answer.startswith("YES")
        except Exception:
            return False

Consistency Checker

class ConsistencyChecker:
    """Checks that the FM response is internally consistent and consistent
    with previous conversation turns.

    MangaAssist scenario: if the assistant recommended "Demon Slayer" in turn 2,
    and the user says "tell me more about the first one", the assistant must not
    suddenly reference a different series.
    """

    def __init__(self, bedrock_client=None):
        self.bedrock = bedrock_client or boto3.client("bedrock-runtime", region_name="us-east-1")

    def score(
        self,
        response: str,
        conversation_history: list[dict],
        intent: IntentType,
    ) -> DimensionScore:
        sub_scores = {}

        # Sub-score 1: Intra-turn consistency (no self-contradictions)
        sub_scores["intra_turn"] = self._check_intra_turn(response)

        # Sub-score 2: Cross-turn consistency (matches conversation history)
        if len(conversation_history) > 1:
            sub_scores["cross_turn"] = self._check_cross_turn(response, conversation_history)
        else:
            sub_scores["cross_turn"] = 1.0

        # Sub-score 3: Persona consistency (stays in MangaAssist character)
        sub_scores["persona"] = self._check_persona(response)

        weights = {"intra_turn": 0.35, "cross_turn": 0.40, "persona": 0.25}
        composite = sum(sub_scores[k] * weights[k] for k in weights)

        evidence = []
        for k, v in sub_scores.items():
            if v < 0.7:
                evidence.append(f"{k} consistency issue ({v:.2f})")

        return DimensionScore(
            dimension=QualityDimension.CONSISTENCY,
            score=composite,
            confidence=0.80,
            evidence=evidence,
            sub_scores=sub_scores,
        )

    def _check_intra_turn(self, response: str) -> float:
        """Uses LLM to detect self-contradictions within a single response."""
        prompt = f"""Analyze this chatbot response for internal contradictions.
A contradiction is when the response says two things that cannot both be true.

Response: {response}

Return ONLY a JSON object:
{{
    "has_contradiction": <bool>,
    "contradictions": ["<list of contradictions found>"],
    "score": <float 0.0 to 1.0 where 1.0 means no contradictions>
}}"""

        try:
            body = json.dumps({
                "anthropic_version": "bedrock-2023-05-31",
                "max_tokens": 200,
                "temperature": 0.0,
                "messages": [{"role": "user", "content": prompt}],
            })
            resp = self.bedrock.invoke_model(
                modelId="anthropic.claude-3-5-sonnet-20241022-v2:0", body=body
            )
            result = json.loads(json.loads(resp["body"].read())["content"][0]["text"])
            return max(0.0, min(1.0, result["score"]))
        except Exception:
            return 0.7

    def _check_cross_turn(self, response: str, conversation_history: list[dict]) -> float:
        """Checks whether the current response contradicts claims made in earlier turns."""
        # Build a summary of assistant's previous claims
        previous_claims = []
        for turn in conversation_history:
            if turn.get("role") == "assistant":
                previous_claims.append(turn["content"])

        if not previous_claims:
            return 1.0

        recent_context = "\n---\n".join(previous_claims[-3:])  # Last 3 assistant turns

        prompt = f"""Compare the new response against the assistant's previous responses.
Flag any contradictions where the new response conflicts with what was said before.

PREVIOUS ASSISTANT RESPONSES:
{recent_context}

NEW RESPONSE:
{response}

Return ONLY a JSON object:
{{
    "contradicts_previous": <bool>,
    "contradictions": ["<list>"],
    "score": <float 0.0 to 1.0 where 1.0 means fully consistent>
}}"""

        try:
            body = json.dumps({
                "anthropic_version": "bedrock-2023-05-31",
                "max_tokens": 200,
                "temperature": 0.0,
                "messages": [{"role": "user", "content": prompt}],
            })
            resp = self.bedrock.invoke_model(
                modelId="anthropic.claude-3-5-sonnet-20241022-v2:0", body=body
            )
            result = json.loads(json.loads(resp["body"].read())["content"][0]["text"])
            return max(0.0, min(1.0, result["score"]))
        except Exception:
            return 0.7

    def _check_persona(self, response: str) -> float:
        """Checks whether the response stays in MangaAssist assistant persona."""
        violations = []
        response_lower = response.lower()

        # Must not claim to be a different assistant
        competitor_names = ["alexa", "siri", "google assistant", "chatgpt"]
        for name in competitor_names:
            if name in response_lower:
                violations.append(f"Mentioned competitor: {name}")

        # Must not break character
        meta_phrases = [
            "as an ai", "as a language model", "i cannot", "i'm just a",
            "my training data", "i was trained"
        ]
        for phrase in meta_phrases:
            if phrase in response_lower:
                violations.append(f"Meta-AI disclosure: {phrase}")

        # Must stay on manga/Amazon topic
        off_topic_signals = ["political", "religious", "sexual"]
        for signal in off_topic_signals:
            if signal in response_lower:
                violations.append(f"Off-topic content: {signal}")

        if not violations:
            return 1.0
        return max(0.0, 1.0 - (len(violations) * 0.3))

Fluency Scorer

class FluencyScorer:
    """Scores the linguistic quality of the FM response.

    For MangaAssist, fluency includes proper English grammar, appropriate
    conversational tone, and correct formatting of product cards and action buttons.
    """

    def __init__(self, bedrock_client=None):
        self.bedrock = bedrock_client or boto3.client("bedrock-runtime", region_name="us-east-1")

    def score(self, response: str, intent: IntentType) -> DimensionScore:
        sub_scores = {}

        # Sub-score 1: Grammar and syntax
        sub_scores["grammar"] = self._score_grammar(response)

        # Sub-score 2: Tone appropriateness
        sub_scores["tone"] = self._score_tone(response, intent)

        # Sub-score 3: Formatting quality
        sub_scores["formatting"] = self._score_formatting(response, intent)

        weights = {"grammar": 0.40, "tone": 0.35, "formatting": 0.25}
        composite = sum(sub_scores[k] * weights[k] for k in weights)

        evidence = []
        for k, v in sub_scores.items():
            if v < 0.7:
                evidence.append(f"{k} issue ({v:.2f})")

        return DimensionScore(
            dimension=QualityDimension.FLUENCY,
            score=composite,
            confidence=0.85,
            evidence=evidence,
            sub_scores=sub_scores,
        )

    def _score_grammar(self, response: str) -> float:
        """Uses LLM to assess grammatical correctness."""
        prompt = f"""Rate the grammatical correctness of this chatbot response.
Consider: sentence structure, subject-verb agreement, punctuation, spelling.
Ignore formatting elements like markdown or product cards.

Response: {response}

Return ONLY a JSON object: {{"score": <float 0.0 to 1.0>, "issues": ["<list of grammar issues>"]}}"""

        try:
            body = json.dumps({
                "anthropic_version": "bedrock-2023-05-31",
                "max_tokens": 200,
                "temperature": 0.0,
                "messages": [{"role": "user", "content": prompt}],
            })
            resp = self.bedrock.invoke_model(
                modelId="anthropic.claude-3-5-sonnet-20241022-v2:0", body=body
            )
            result = json.loads(json.loads(resp["body"].read())["content"][0]["text"])
            return max(0.0, min(1.0, result["score"]))
        except Exception:
            return 0.7

    def _score_tone(self, response: str, intent: IntentType) -> float:
        """Checks whether the tone matches the expected register for the intent."""
        # Order tracking and returns need professional, empathetic tone
        # Recommendations can be enthusiastic and conversational
        # Chitchat should be warm and brief
        tone_guidance = {
            IntentType.RECOMMENDATION: "enthusiastic, knowledgeable, conversational",
            IntentType.PRODUCT_QUESTION: "helpful, informative, concise",
            IntentType.FAQ: "clear, authoritative, reassuring",
            IntentType.ORDER_TRACKING: "professional, empathetic, action-oriented",
            IntentType.RETURN_REQUEST: "empathetic, solution-focused, patient",
            IntentType.PROMOTION: "excited but not pushy, informative",
            IntentType.CHECKOUT_HELP: "helpful, step-by-step, encouraging",
            IntentType.CHITCHAT: "warm, brief, natural",
        }

        expected_tone = tone_guidance.get(intent, "helpful and professional")

        prompt = f"""Rate whether this chatbot response has the appropriate tone.
Expected tone: {expected_tone}

Response: {response}

Return ONLY a JSON object: {{"score": <float 0.0 to 1.0>, "actual_tone": "<description>"}}"""

        try:
            body = json.dumps({
                "anthropic_version": "bedrock-2023-05-31",
                "max_tokens": 100,
                "temperature": 0.0,
                "messages": [{"role": "user", "content": prompt}],
            })
            resp = self.bedrock.invoke_model(
                modelId="anthropic.claude-3-5-sonnet-20241022-v2:0", body=body
            )
            result = json.loads(json.loads(resp["body"].read())["content"][0]["text"])
            return max(0.0, min(1.0, result["score"]))
        except Exception:
            return 0.7

    def _score_formatting(self, response: str, intent: IntentType) -> float:
        """Checks whether the response uses appropriate formatting elements."""
        score = 1.0
        penalties = []

        # Recommendations should include product information
        if intent == IntentType.RECOMMENDATION:
            if not any(marker in response for marker in ["ASIN", "http", "amazon.com", "$", "¥"]):
                score -= 0.3
                penalties.append("Recommendation missing product links or prices")

        # Response should not be excessively long for simple intents
        word_count = len(response.split())
        if intent == IntentType.CHITCHAT and word_count > 80:
            score -= 0.2
            penalties.append(f"Chitchat too verbose ({word_count} words)")

        if intent == IntentType.ORDER_TRACKING and word_count > 150:
            score -= 0.1
            penalties.append(f"Order tracking response too long ({word_count} words)")

        return max(0.0, score)

Composite Evaluator (Orchestrates All Dimensions)

class FMOutputQualityEvaluator:
    """Orchestrates all four quality dimension scorers and produces
    a composite evaluation result for a single test case.

    This is the main entry point for the evaluation pipeline.
    """

    # Intent-specific weights (from the scoring thresholds table above)
    INTENT_WEIGHTS: dict[IntentType, dict[str, float]] = {
        IntentType.RECOMMENDATION:   {"relevance": 0.40, "factual_accuracy": 0.20, "consistency": 0.20, "fluency": 0.20},
        IntentType.PRODUCT_QUESTION: {"relevance": 0.25, "factual_accuracy": 0.40, "consistency": 0.20, "fluency": 0.15},
        IntentType.FAQ:              {"relevance": 0.20, "factual_accuracy": 0.40, "consistency": 0.25, "fluency": 0.15},
        IntentType.ORDER_TRACKING:   {"relevance": 0.15, "factual_accuracy": 0.50, "consistency": 0.25, "fluency": 0.10},
        IntentType.RETURN_REQUEST:   {"relevance": 0.15, "factual_accuracy": 0.50, "consistency": 0.25, "fluency": 0.10},
        IntentType.PROMOTION:        {"relevance": 0.30, "factual_accuracy": 0.35, "consistency": 0.20, "fluency": 0.15},
        IntentType.CHECKOUT_HELP:    {"relevance": 0.20, "factual_accuracy": 0.40, "consistency": 0.25, "fluency": 0.15},
        IntentType.CHITCHAT:         {"relevance": 0.30, "factual_accuracy": 0.10, "consistency": 0.30, "fluency": 0.30},
    }

    PASS_THRESHOLDS: dict[IntentType, float] = {
        IntentType.RECOMMENDATION:   0.75,
        IntentType.PRODUCT_QUESTION: 0.80,
        IntentType.FAQ:              0.85,
        IntentType.ORDER_TRACKING:   0.90,
        IntentType.RETURN_REQUEST:   0.90,
        IntentType.PROMOTION:        0.80,
        IntentType.CHECKOUT_HELP:    0.85,
        IntentType.CHITCHAT:         0.70,
    }

    def __init__(self):
        self.relevance_scorer = RelevanceScorer()
        self.factual_checker = FactualAccuracyChecker()
        self.consistency_checker = ConsistencyChecker()
        self.fluency_scorer = FluencyScorer()

    def evaluate(self, test_case: EvaluationTestCase, response: str) -> EvaluationResult:
        """Evaluate a single FM response against all quality dimensions."""
        start_time = time.time()

        # Score each dimension
        relevance = self.relevance_scorer.score(
            query=test_case.query,
            response=response,
            intent=test_case.intent,
            page_context=test_case.page_context,
            conversation_history=test_case.conversation_history,
            expected_products=test_case.expected_products,
        )

        factual = self.factual_checker.score(
            response=response,
            expected_facts=test_case.expected_facts,
            intent=test_case.intent,
        )

        consistency = self.consistency_checker.score(
            response=response,
            conversation_history=test_case.conversation_history,
            intent=test_case.intent,
        )

        fluency = self.fluency_scorer.score(
            response=response,
            intent=test_case.intent,
        )

        dimension_scores = [relevance, factual, consistency, fluency]

        # Compute weighted composite score
        weights = self.INTENT_WEIGHTS.get(
            test_case.intent,
            {"relevance": 0.25, "factual_accuracy": 0.25, "consistency": 0.25, "fluency": 0.25},
        )
        composite = (
            relevance.score * weights["relevance"]
            + factual.score * weights["factual_accuracy"]
            + consistency.score * weights["consistency"]
            + fluency.score * weights["fluency"]
        )

        threshold = self.PASS_THRESHOLDS.get(test_case.intent, 0.80)
        passed = composite >= threshold

        failure_reasons = []
        if not passed:
            for ds in dimension_scores:
                if ds.evidence:
                    failure_reasons.extend(ds.evidence)

        latency_ms = (time.time() - start_time) * 1000

        return EvaluationResult(
            test_id=test_case.test_id,
            intent=test_case.intent,
            response_text=response,
            dimension_scores=dimension_scores,
            composite_score=composite,
            passed=passed,
            latency_ms=latency_ms,
            failure_reasons=failure_reasons,
        )

    def evaluate_batch(self, test_cases: list[tuple[EvaluationTestCase, str]]) -> EvaluationReport:
        """Evaluate a batch of test cases and produce an aggregate report."""
        results = []
        for test_case, response in test_cases:
            result = self.evaluate(test_case, response)
            results.append(result)

        passed = [r for r in results if r.passed]
        failed = [r for r in results if not r.passed]

        # Aggregate dimension averages
        dim_totals: dict[str, list[float]] = {}
        for r in results:
            for ds in r.dimension_scores:
                dim_totals.setdefault(ds.dimension.value, []).append(ds.score)
        dim_averages = {k: sum(v) / len(v) for k, v in dim_totals.items()}

        # Breakdown by intent
        intent_breakdown: dict[str, dict] = {}
        for r in results:
            key = r.intent.value
            if key not in intent_breakdown:
                intent_breakdown[key] = {"total": 0, "passed": 0, "avg_score": 0.0, "scores": []}
            intent_breakdown[key]["total"] += 1
            intent_breakdown[key]["scores"].append(r.composite_score)
            if r.passed:
                intent_breakdown[key]["passed"] += 1

        for key, data in intent_breakdown.items():
            data["avg_score"] = sum(data["scores"]) / len(data["scores"])
            del data["scores"]

        import uuid
        return EvaluationReport(
            run_id=str(uuid.uuid4()),
            trigger="batch_evaluation",
            total_cases=len(results),
            passed_cases=len(passed),
            failed_cases=len(failed),
            dimension_averages=dim_averages,
            intent_breakdown=intent_breakdown,
            regression_flags=[],  # Populated by comparing against baseline
        )

CloudWatch Metric Publishing

class EvaluationMetricsPublisher:
    """Publishes evaluation results to CloudWatch for dashboarding and alerting."""

    def __init__(self, namespace: str = "MangaAssist/Evaluation"):
        self.cloudwatch = boto3.client("cloudwatch", region_name="us-east-1")
        self.namespace = namespace

    def publish_result(self, result: EvaluationResult) -> None:
        """Publishes dimension scores and composite score for a single evaluation."""
        metric_data = [
            {
                "MetricName": "CompositeScore",
                "Dimensions": [
                    {"Name": "Intent", "Value": result.intent.value},
                ],
                "Value": result.composite_score,
                "Unit": "None",
            },
            {
                "MetricName": "EvaluationLatency",
                "Dimensions": [
                    {"Name": "Intent", "Value": result.intent.value},
                ],
                "Value": result.latency_ms,
                "Unit": "Milliseconds",
            },
        ]

        for ds in result.dimension_scores:
            metric_data.append({
                "MetricName": f"DimensionScore_{ds.dimension.value}",
                "Dimensions": [
                    {"Name": "Intent", "Value": result.intent.value},
                ],
                "Value": ds.score,
                "Unit": "None",
            })

        # Pass/fail as a binary metric for alarming
        metric_data.append({
            "MetricName": "QualityGateResult",
            "Dimensions": [
                {"Name": "Intent", "Value": result.intent.value},
            ],
            "Value": 1.0 if result.passed else 0.0,
            "Unit": "None",
        })

        self.cloudwatch.put_metric_data(Namespace=self.namespace, MetricData=metric_data)

    def publish_report(self, report: EvaluationReport) -> None:
        """Publishes aggregate report metrics."""
        metric_data = [
            {
                "MetricName": "BatchPassRate",
                "Value": report.passed_cases / max(report.total_cases, 1),
                "Unit": "None",
            },
            {
                "MetricName": "BatchTotalCases",
                "Value": float(report.total_cases),
                "Unit": "Count",
            },
        ]

        for dim, avg in report.dimension_averages.items():
            metric_data.append({
                "MetricName": f"BatchAvg_{dim}",
                "Value": avg,
                "Unit": "None",
            })

        self.cloudwatch.put_metric_data(Namespace=self.namespace, MetricData=metric_data)

MangaAssist Scenarios

Scenario A: Recommendation Relevance Drops After Prompt Change

Context: The prompt engineering team updated the system prompt to add a new instruction about highlighting promotions. After deployment, the recommendation intent's relevance score dropped from 0.82 to 0.68.

What Happened: - The new promotion instruction consumed attention budget, causing the LLM to mention promotions even when the user asked for genre-specific recommendations. - User asks: "Recommend horror manga similar to Junji Ito's work." - Before: Response focused on horror manga by similar artists (Kazuo Umezu, Hideshi Hino) — high relevance. - After: Response opened with a promotion for a shonen bundle before listing horror titles — relevance dropped because the first third of the response was off-intent.

How Evaluation Caught It: - The intent_relevance sub-score dropped from 0.88 to 0.62 because the LLM-as-Judge saw that promotion content was irrelevant to a horror recommendation query. - The product_relevance sub-score stayed at 0.85 (the horror titles were still there, just buried). - The composite relevance score dropped below the 0.75 threshold for recommendation intents. - The automated quality gate in the CD-06 pipeline blocked the deployment.

Fix: The team restructured the system prompt to apply promotion nudges only when the intent classifier returns promotion or when the user's cart is non-empty, not on every response.

Metric Signal:

CloudWatch Alarm: MangaAssist/Evaluation → DimensionScore_relevance
  Intent=recommendation < 0.75 for 3 consecutive evaluation runs

Scenario B: FAQ Factual Accuracy Regression After Knowledge Base Update

Context: The operations team updated the return policy in the knowledge base: the return window changed from 30 days to 15 days for opened manga volumes. The RAG pipeline re-indexed, but some old chunks were not properly invalidated.

What Happened: - User asks: "Can I return Demon Slayer Volume 23 if I already read it?" - The RAG pipeline retrieved two chunks: one from the new policy (15 days) and one stale chunk from the old policy (30 days). - The LLM generated: "You can return opened manga within 30 days of delivery." - This was factually incorrect — the new policy says 15 days.

How Evaluation Caught It: - The golden dataset included a test case with expected_facts: ["15 days return window for opened manga"]. - The fact_coverage sub-score returned 0.0 because "30 days" appeared instead of "15 days." - The no_hallucination sub-score returned 0.4 because the LLM cited a stale source chunk rather than fabricating entirely. - The composite factual accuracy score dropped to 0.32, far below the 0.85 threshold for FAQ intents.

Fix: 1. The stale chunk was invalidated by correcting the last_updated timestamp check in the RAG indexing pipeline. 2. The team added a policy-specific test case to the golden dataset that checks the exact return window number. 3. A chunk freshness filter was added to the retrieval step: chunks about return policies older than 24 hours are excluded.

Metric Signal:

CloudWatch Alarm: MangaAssist/Evaluation → DimensionScore_factual_accuracy
  Intent=faq < 0.85 for 1 evaluation run (zero tolerance for policy accuracy)

Scenario C: Multi-Turn Consistency Failure During Session

Context: A user is having a multi-turn conversation about manga recommendations. The assistant recommends three series in turn 2, and in turn 4 the user says "Tell me more about the second one."

What Happened: - Turn 2: Assistant recommends [Chainsaw Man, Spy×Family, Jujutsu Kaisen]. - Turn 3: User asks about the art style of Chainsaw Man (first one). - Turn 4: User says "Now tell me about the second one." - The conversation memory correctly stored all turns, but the LLM's context window was compressed (summarized older turns to save tokens), and the summary lost the ordered list. - The LLM responded with information about Jujutsu Kaisen (third one) instead of Spy×Family (second one).

How Evaluation Caught It: - The cross_turn consistency sub-score detected the mismatch: the response referenced Jujutsu Kaisen but the conversation history showed "the second one" should resolve to Spy×Family. - The consistency composite dropped to 0.45, well below the 0.75 threshold.

Fix: 1. The conversation memory summarizer was updated to preserve ordered lists explicitly when summarizing. 2. A new rule was added: when the summary step encounters product lists, it keeps them as structured data rather than compressing into prose. 3. A dedicated test case was added: "ordinal reference resolution" — verifying that "the first/second/third one" resolves correctly after summarization.

Scenario D: Japanese Content Fluency Degradation

Context: MangaAssist serves the JP Manga store, and some responses include Japanese titles, author names, and genre terms. After a model version upgrade (Claude 3.5 → 3.5 v2), the fluency of responses containing mixed English-Japanese content degraded.

What Happened: - User asks: "What volumes of 鬼滅の刃 are available in English?" - Before upgrade: "Demon Slayer (鬼滅の刃 / Kimetsu no Yaiba) is available in English..." - After upgrade: "鬼滅の刃 (Demon Slayer) volumes 1 through 23 available in English format..." — dropped the romanization, changed sentence structure to less natural phrasing.

How Evaluation Caught It: - The grammar sub-score was still 0.9 (grammatically correct). - The tone sub-score dropped to 0.6 because the response felt more mechanical and less conversational. - The formatting sub-score stayed at 0.8. - The fluency composite was 0.73, below the 0.75 threshold for recommendation responses involving Japanese content.

Fix: 1. The evaluation golden dataset was expanded with Japanese-mixed test cases. 2. A "Japanese content handling" rubric was added to the system prompt: always include original title, romanization, and English title. 3. The model upgrade was rolled back, and the team re-evaluated after adding the rubric — the upgraded model scored 0.82 with the updated prompt.


Intuition Gained

Quality Dimension Intuition

Traditional ML evaluation asks "Was the prediction correct?" — a binary question. FM output evaluation asks "Was the response good?" — a multidimensional question. The four dimensions (relevance, accuracy, consistency, fluency) decompose "good" into measurable sub-properties that teams can independently improve. When a composite score drops, you know which dimension to investigate first. When a new intent is added, you customize the dimension weights rather than building a new evaluation system.

Threshold Calibration Intuition

Setting thresholds too high blocks good changes (false negatives in the quality gate). Setting them too low lets regressions through. The right approach: start with thresholds based on the best current scores minus one standard deviation, then adjust based on the false positive rate of the quality gate. If more than 5% of deployments are blocked that humans judge as acceptable, the threshold is too aggressive.

Cost-Quality Tradeoff in Evaluation

Every evaluation run calls the LLM multiple times per test case (once for each dimension that uses LLM-as-Judge). For a golden dataset of 200 test cases with 4 LLM-based scoring calls each, that is 800 LLM invocations per evaluation run. At ~$0.003 per invocation, that is $2.40 per run. Running hourly costs ~$57/day. The tradeoff: evaluate frequently enough to catch regressions early but not so frequently that evaluation itself becomes a significant cost center. Recommendation: Run full evaluation on deployment triggers and weekly schedules; run lightweight deterministic checks (price accuracy, ASIN validation, length checks) continuously.


References