01: FM Output Quality Assessment Framework
AIP-C01 Mapping
Content Domain 5: Testing, Validation, and Troubleshooting Task 5.1: Implement evaluation systems for GenAI Skill 5.1.1: Develop comprehensive assessment frameworks to evaluate the quality and effectiveness of FM outputs beyond traditional ML evaluation approaches (for example, by using metrics for relevance, factual accuracy, consistency, and fluency).
User Story
As a MangaAssist evaluation engineer, I want to assess the quality of every LLM-generated response across relevance, factual accuracy, consistency, and fluency dimensions, So that we catch quality regressions before they reach customers and continuously improve the chatbot's response quality beyond what traditional ML metrics (accuracy, F1) can measure.
Acceptance Criteria
- Every LLM response is scored on at least four quality dimensions: relevance, factual accuracy, consistency, and fluency
- Evaluation runs automatically on every prompt or model change before deployment
- Scores are tracked over time with alerting when any dimension drops below threshold
- The framework handles MangaAssist-specific evaluation needs: product recommendation quality, Japanese content handling, and multi-turn coherence
- Evaluation results feed into a dashboard that the team reviews weekly
- False positive rate for quality gates is below 5% (does not block good changes unnecessarily)
Why Traditional ML Metrics Fall Short for GenAI
Traditional ML evaluation (accuracy, precision, recall, F1, AUC) works when there is one correct answer per input. GenAI outputs are open-ended: a recommendation response can be relevant in many different ways, and two equally good responses can use completely different wording.
| Traditional ML Metric | Why It Fails for GenAI | What We Need Instead |
|---|---|---|
| Accuracy | No single correct answer for "Recommend manga like Naruto" | Relevance scoring against user intent and profile |
| Precision / Recall | Binary classification metrics don't capture response quality nuance | Multi-dimensional scoring (relevance + accuracy + fluency + safety) |
| F1 Score | Assumes discrete classes; LLM outputs are continuous text | Semantic similarity + factual grounding checks |
| AUC | Measures ranking of a binary classifier | BERTScore, ROUGE-L, factual consistency scores |
| BLEU / ROUGE alone | Captures lexical overlap, not semantic quality | Human-correlated composite scores |
High-Level Design
Quality Dimension Taxonomy
graph TD
A[FM Output Quality] --> B[Relevance]
A --> C[Factual Accuracy]
A --> D[Consistency]
A --> E[Fluency]
B --> B1[Intent Relevance<br>Does the response address<br>what the user asked?]
B --> B2[Product Relevance<br>Are recommended products<br>appropriate for the query?]
B --> B3[Context Relevance<br>Does the response use<br>page context and history?]
C --> C1[Price Accuracy<br>Do prices match the<br>Product Catalog?]
C --> C2[Availability Accuracy<br>Is stock status correct?]
C --> C3[Policy Accuracy<br>Are return/shipping policies<br>stated correctly?]
C --> C4[Product Attribute Accuracy<br>Are author, volume, format<br>details correct?]
D --> D1[Intra-Turn Consistency<br>No contradictions within<br>a single response]
D --> D2[Cross-Turn Consistency<br>No contradictions across<br>conversation turns]
D --> D3[Persona Consistency<br>Stays in MangaAssist<br>assistant character]
E --> E1[Grammar and Syntax<br>Correct English or Japanese]
E --> E2[Tone Appropriateness<br>Friendly, helpful, not robotic]
E --> E3[Formatting Quality<br>Proper use of product cards,<br>links, action buttons]
Evaluation Pipeline Architecture
graph LR
subgraph "Input"
A[Test Case<br>query + context + expected_intent]
end
subgraph "MangaAssist Pipeline"
A --> B[Intent Classifier]
B --> C[Orchestrator]
C --> D[Service Calls<br>Catalog, Reco, RAG]
D --> E[LLM Generation<br>Bedrock Claude]
E --> F[Guardrails]
F --> G[Final Response]
end
subgraph "Evaluation Engine"
G --> H[Relevance Scorer]
G --> I[Factual Accuracy Checker]
G --> J[Consistency Checker]
G --> K[Fluency Scorer]
H --> L[Composite Score<br>Weighted Average]
I --> L
J --> L
K --> L
end
subgraph "Output"
L --> M[Score Report]
L --> N[Pass / Fail Gate]
L --> O[Metric Store<br>CloudWatch + Redshift]
end
Scoring Thresholds by Intent
Not every intent needs the same quality bar. High-stakes intents (order tracking, return requests) need near-perfect factual accuracy, while discovery intents (recommendations) prioritize relevance over exact wording.
| Intent | Relevance Weight | Factual Accuracy Weight | Consistency Weight | Fluency Weight | Pass Threshold |
|---|---|---|---|---|---|
recommendation |
0.40 | 0.20 | 0.20 | 0.20 | 0.75 |
product_question |
0.25 | 0.40 | 0.20 | 0.15 | 0.80 |
faq |
0.20 | 0.40 | 0.25 | 0.15 | 0.85 |
order_tracking |
0.15 | 0.50 | 0.25 | 0.10 | 0.90 |
return_request |
0.15 | 0.50 | 0.25 | 0.10 | 0.90 |
promotion |
0.30 | 0.35 | 0.20 | 0.15 | 0.80 |
checkout_help |
0.20 | 0.40 | 0.25 | 0.15 | 0.85 |
chitchat |
0.30 | 0.10 | 0.30 | 0.30 | 0.70 |
Evaluation Trigger Points
graph TD
subgraph "When Evaluation Runs"
A[Prompt Change<br>CD-06 Pipeline] -->|golden dataset| E[Offline Evaluation]
B[Model Version Upgrade<br>Claude 3.5 → 4] -->|full suite| E
C[RAG Index Refresh<br>Knowledge Base Update] -->|retrieval + accuracy| E
D[Weekly Scheduled<br>Regression Check] -->|sampled production traffic| E
end
E --> F{All dimensions<br>above threshold?}
F -->|Yes| G[Deploy / Continue]
F -->|No| H[Block + Alert + Report]
H --> I[Human Review Queue]
Low-Level Design
Core Data Model
from dataclasses import dataclass, field
from enum import Enum
from typing import Optional
import time
class QualityDimension(Enum):
RELEVANCE = "relevance"
FACTUAL_ACCURACY = "factual_accuracy"
CONSISTENCY = "consistency"
FLUENCY = "fluency"
class IntentType(Enum):
RECOMMENDATION = "recommendation"
PRODUCT_QUESTION = "product_question"
FAQ = "faq"
ORDER_TRACKING = "order_tracking"
RETURN_REQUEST = "return_request"
PROMOTION = "promotion"
CHECKOUT_HELP = "checkout_help"
CHITCHAT = "chitchat"
@dataclass
class EvaluationTestCase:
"""A single test case for FM output quality evaluation."""
test_id: str
query: str
intent: IntentType
page_context: dict
conversation_history: list[dict]
expected_products: list[str] # ASINs that should appear
expected_facts: list[str] # Facts that must be accurate
forbidden_content: list[str] # Content that must not appear
reference_response: Optional[str] # Gold-standard response for comparison
metadata: dict = field(default_factory=dict)
@dataclass
class DimensionScore:
"""Score for a single quality dimension."""
dimension: QualityDimension
score: float # 0.0 to 1.0
confidence: float # How confident the scorer is
evidence: list[str] # Why this score was given
sub_scores: dict[str, float] = field(default_factory=dict)
@dataclass
class EvaluationResult:
"""Complete evaluation result for a single test case."""
test_id: str
intent: IntentType
response_text: str
dimension_scores: list[DimensionScore]
composite_score: float
passed: bool
latency_ms: float
timestamp: float = field(default_factory=time.time)
failure_reasons: list[str] = field(default_factory=list)
@dataclass
class EvaluationReport:
"""Aggregate report across all test cases in a run."""
run_id: str
trigger: str # "prompt_change", "model_upgrade", "scheduled"
total_cases: int
passed_cases: int
failed_cases: int
dimension_averages: dict[str, float]
intent_breakdown: dict[str, dict] # Scores by intent type
regression_flags: list[str] # Dimensions that dropped vs. baseline
timestamp: float = field(default_factory=time.time)
Relevance Scorer
import json
import logging
from typing import Optional
import boto3
logger = logging.getLogger(__name__)
class RelevanceScorer:
"""Scores how relevant the FM response is to the user's query and context.
Uses BERTScore for semantic similarity and an LLM-as-Judge call
for intent-specific relevance assessment.
"""
def __init__(self, bedrock_client=None, model_id: str = "anthropic.claude-3-5-sonnet-20241022-v2:0"):
self.bedrock = bedrock_client or boto3.client("bedrock-runtime", region_name="us-east-1")
self.model_id = model_id
def score(
self,
query: str,
response: str,
intent: IntentType,
page_context: dict,
conversation_history: list[dict],
expected_products: list[str],
) -> DimensionScore:
sub_scores = {}
# Sub-score 1: Intent relevance via LLM-as-Judge
sub_scores["intent_relevance"] = self._score_intent_relevance(query, response, intent)
# Sub-score 2: Product relevance (for product-related intents)
if intent in (IntentType.RECOMMENDATION, IntentType.PRODUCT_QUESTION, IntentType.PROMOTION):
sub_scores["product_relevance"] = self._score_product_relevance(
response, expected_products
)
else:
sub_scores["product_relevance"] = 1.0 # Not applicable, full marks
# Sub-score 3: Context utilization
sub_scores["context_relevance"] = self._score_context_utilization(
response, page_context, conversation_history
)
# Weighted combination
weights = {"intent_relevance": 0.50, "product_relevance": 0.30, "context_relevance": 0.20}
composite = sum(sub_scores[k] * weights[k] for k in weights)
evidence = []
for k, v in sub_scores.items():
if v < 0.7:
evidence.append(f"{k} scored low ({v:.2f})")
return DimensionScore(
dimension=QualityDimension.RELEVANCE,
score=composite,
confidence=0.85,
evidence=evidence,
sub_scores=sub_scores,
)
def _score_intent_relevance(self, query: str, response: str, intent: IntentType) -> float:
"""Uses LLM-as-Judge to assess whether the response addresses the user's intent."""
prompt = f"""You are evaluating a chatbot response for the MangaAssist JP Manga store assistant.
User query: {query}
Detected intent: {intent.value}
Assistant response: {response}
Rate how well the response addresses the user's intent on a scale of 0.0 to 1.0:
- 1.0: Perfectly addresses the intent with actionable information
- 0.8: Addresses the intent well but misses minor details
- 0.6: Partially addresses the intent
- 0.4: Tangentially related but does not answer the question
- 0.2: Mostly irrelevant
- 0.0: Completely off-topic
Return ONLY a JSON object: {{"score": <float>, "reason": "<brief explanation>"}}"""
try:
body = json.dumps({
"anthropic_version": "bedrock-2023-05-31",
"max_tokens": 150,
"temperature": 0.0,
"messages": [{"role": "user", "content": prompt}],
})
resp = self.bedrock.invoke_model(modelId=self.model_id, body=body)
result = json.loads(json.loads(resp["body"].read())["content"][0]["text"])
return max(0.0, min(1.0, result["score"]))
except Exception as e:
logger.warning("Intent relevance scoring failed, defaulting to 0.5: %s", e)
return 0.5
def _score_product_relevance(self, response: str, expected_products: list[str]) -> float:
"""Checks whether expected product ASINs appear in the response."""
if not expected_products:
return 1.0
found = sum(1 for asin in expected_products if asin in response)
return found / len(expected_products)
def _score_context_utilization(
self,
response: str,
page_context: dict,
conversation_history: list[dict],
) -> float:
"""Checks whether the response incorporates relevant context signals."""
signals_used = 0
signals_available = 0
# Check if current ASIN is referenced when relevant
current_asin = page_context.get("current_asin")
if current_asin:
signals_available += 1
if current_asin in response:
signals_used += 1
# Check if browsing context is reflected
store_section = page_context.get("store_section", "")
if store_section:
signals_available += 1
if store_section.replace("-", " ") in response.lower():
signals_used += 1
# Check if conversation history is acknowledged in multi-turn
if len(conversation_history) > 1:
signals_available += 1
last_user_msg = ""
for turn in reversed(conversation_history):
if turn.get("role") == "user":
last_user_msg = turn.get("content", "")
break
# Simple check: does response build on previous context?
if last_user_msg and any(
word in response.lower()
for word in last_user_msg.lower().split()
if len(word) > 4
):
signals_used += 1
if signals_available == 0:
return 1.0
return signals_used / signals_available
Factual Accuracy Checker
class FactualAccuracyChecker:
"""Verifies that FM responses contain accurate facts by cross-referencing
against the Product Catalog, Knowledge Base, and Order Service.
This is critical for MangaAssist because incorrect prices, availability,
or return policies damage customer trust immediately.
"""
def __init__(self, catalog_client=None, knowledge_base_client=None, bedrock_client=None):
self.catalog = catalog_client
self.knowledge_base = knowledge_base_client
self.bedrock = bedrock_client or boto3.client("bedrock-runtime", region_name="us-east-1")
def score(
self,
response: str,
expected_facts: list[str],
intent: IntentType,
retrieved_chunks: Optional[list[str]] = None,
) -> DimensionScore:
sub_scores = {}
# Sub-score 1: Expected fact coverage
sub_scores["fact_coverage"] = self._check_fact_coverage(response, expected_facts)
# Sub-score 2: No hallucinated facts
sub_scores["no_hallucination"] = self._check_no_hallucination(response, intent, retrieved_chunks)
# Sub-score 3: Price accuracy (never cached, always live)
if intent in (IntentType.PRODUCT_QUESTION, IntentType.RECOMMENDATION, IntentType.PROMOTION):
sub_scores["price_accuracy"] = self._check_price_accuracy(response)
else:
sub_scores["price_accuracy"] = 1.0
weights = {"fact_coverage": 0.40, "no_hallucination": 0.40, "price_accuracy": 0.20}
composite = sum(sub_scores[k] * weights[k] for k in weights)
evidence = []
for k, v in sub_scores.items():
if v < 0.8:
evidence.append(f"{k} flagged ({v:.2f})")
return DimensionScore(
dimension=QualityDimension.FACTUAL_ACCURACY,
score=composite,
confidence=0.90,
evidence=evidence,
sub_scores=sub_scores,
)
def _check_fact_coverage(self, response: str, expected_facts: list[str]) -> float:
"""Verifies that all expected facts appear in the response."""
if not expected_facts:
return 1.0
covered = 0
for fact in expected_facts:
# Semantic check: does the response convey this fact?
if fact.lower() in response.lower():
covered += 1
else:
# Fallback: use LLM to check semantic equivalence
if self._semantic_fact_check(response, fact):
covered += 1
return covered / len(expected_facts)
def _check_no_hallucination(
self,
response: str,
intent: IntentType,
retrieved_chunks: Optional[list[str]],
) -> float:
"""Uses LLM-as-Judge to detect hallucinated content not grounded in source data."""
if not retrieved_chunks:
return 0.7 # Without source data, assume moderate risk
context = "\n---\n".join(retrieved_chunks)
prompt = f"""You are a fact-checker for a manga store chatbot. Given the retrieved source data
and the assistant's response, identify any claims in the response that are NOT supported by the source data.
SOURCE DATA:
{context}
ASSISTANT RESPONSE:
{response}
Return ONLY a JSON object:
{{
"supported_claims": <int>,
"unsupported_claims": <int>,
"hallucinated_details": ["<list of specific hallucinated claims>"],
"score": <float 0.0 to 1.0 where 1.0 means no hallucination>
}}"""
try:
body = json.dumps({
"anthropic_version": "bedrock-2023-05-31",
"max_tokens": 300,
"temperature": 0.0,
"messages": [{"role": "user", "content": prompt}],
})
resp = self.bedrock.invoke_model(
modelId="anthropic.claude-3-5-sonnet-20241022-v2:0", body=body
)
result = json.loads(json.loads(resp["body"].read())["content"][0]["text"])
return max(0.0, min(1.0, result["score"]))
except Exception as e:
logger.warning("Hallucination check failed, defaulting to 0.5: %s", e)
return 0.5
def _check_price_accuracy(self, response: str) -> float:
"""Extracts price mentions from the response and validates against live catalog.
In MangaAssist, prices are NEVER cached (see architecture HLD). This checker
confirms the response does not fabricate prices.
"""
# Extract price patterns like $12.99 or ¥1,200
import re
prices = re.findall(r'[\$¥]\s*[\d,]+\.?\d*', response)
if not prices:
return 1.0 # No prices mentioned, no accuracy concern
if not self.catalog:
return 0.7 # Cannot verify without catalog client
# In production: query catalog API for each mentioned ASIN and compare prices
# For evaluation: this is done against a snapshot of the catalog at test time
return 1.0 # Placeholder — production implementation validates each price
def _semantic_fact_check(self, response: str, fact: str) -> bool:
"""Uses LLM to check if a specific fact is semantically present in the response."""
prompt = f"""Does the following response convey this fact? Answer YES or NO only.
Fact: {fact}
Response: {response}"""
try:
body = json.dumps({
"anthropic_version": "bedrock-2023-05-31",
"max_tokens": 10,
"temperature": 0.0,
"messages": [{"role": "user", "content": prompt}],
})
resp = self.bedrock.invoke_model(
modelId="anthropic.claude-3-5-sonnet-20241022-v2:0", body=body
)
answer = json.loads(resp["body"].read())["content"][0]["text"].strip().upper()
return answer.startswith("YES")
except Exception:
return False
Consistency Checker
class ConsistencyChecker:
"""Checks that the FM response is internally consistent and consistent
with previous conversation turns.
MangaAssist scenario: if the assistant recommended "Demon Slayer" in turn 2,
and the user says "tell me more about the first one", the assistant must not
suddenly reference a different series.
"""
def __init__(self, bedrock_client=None):
self.bedrock = bedrock_client or boto3.client("bedrock-runtime", region_name="us-east-1")
def score(
self,
response: str,
conversation_history: list[dict],
intent: IntentType,
) -> DimensionScore:
sub_scores = {}
# Sub-score 1: Intra-turn consistency (no self-contradictions)
sub_scores["intra_turn"] = self._check_intra_turn(response)
# Sub-score 2: Cross-turn consistency (matches conversation history)
if len(conversation_history) > 1:
sub_scores["cross_turn"] = self._check_cross_turn(response, conversation_history)
else:
sub_scores["cross_turn"] = 1.0
# Sub-score 3: Persona consistency (stays in MangaAssist character)
sub_scores["persona"] = self._check_persona(response)
weights = {"intra_turn": 0.35, "cross_turn": 0.40, "persona": 0.25}
composite = sum(sub_scores[k] * weights[k] for k in weights)
evidence = []
for k, v in sub_scores.items():
if v < 0.7:
evidence.append(f"{k} consistency issue ({v:.2f})")
return DimensionScore(
dimension=QualityDimension.CONSISTENCY,
score=composite,
confidence=0.80,
evidence=evidence,
sub_scores=sub_scores,
)
def _check_intra_turn(self, response: str) -> float:
"""Uses LLM to detect self-contradictions within a single response."""
prompt = f"""Analyze this chatbot response for internal contradictions.
A contradiction is when the response says two things that cannot both be true.
Response: {response}
Return ONLY a JSON object:
{{
"has_contradiction": <bool>,
"contradictions": ["<list of contradictions found>"],
"score": <float 0.0 to 1.0 where 1.0 means no contradictions>
}}"""
try:
body = json.dumps({
"anthropic_version": "bedrock-2023-05-31",
"max_tokens": 200,
"temperature": 0.0,
"messages": [{"role": "user", "content": prompt}],
})
resp = self.bedrock.invoke_model(
modelId="anthropic.claude-3-5-sonnet-20241022-v2:0", body=body
)
result = json.loads(json.loads(resp["body"].read())["content"][0]["text"])
return max(0.0, min(1.0, result["score"]))
except Exception:
return 0.7
def _check_cross_turn(self, response: str, conversation_history: list[dict]) -> float:
"""Checks whether the current response contradicts claims made in earlier turns."""
# Build a summary of assistant's previous claims
previous_claims = []
for turn in conversation_history:
if turn.get("role") == "assistant":
previous_claims.append(turn["content"])
if not previous_claims:
return 1.0
recent_context = "\n---\n".join(previous_claims[-3:]) # Last 3 assistant turns
prompt = f"""Compare the new response against the assistant's previous responses.
Flag any contradictions where the new response conflicts with what was said before.
PREVIOUS ASSISTANT RESPONSES:
{recent_context}
NEW RESPONSE:
{response}
Return ONLY a JSON object:
{{
"contradicts_previous": <bool>,
"contradictions": ["<list>"],
"score": <float 0.0 to 1.0 where 1.0 means fully consistent>
}}"""
try:
body = json.dumps({
"anthropic_version": "bedrock-2023-05-31",
"max_tokens": 200,
"temperature": 0.0,
"messages": [{"role": "user", "content": prompt}],
})
resp = self.bedrock.invoke_model(
modelId="anthropic.claude-3-5-sonnet-20241022-v2:0", body=body
)
result = json.loads(json.loads(resp["body"].read())["content"][0]["text"])
return max(0.0, min(1.0, result["score"]))
except Exception:
return 0.7
def _check_persona(self, response: str) -> float:
"""Checks whether the response stays in MangaAssist assistant persona."""
violations = []
response_lower = response.lower()
# Must not claim to be a different assistant
competitor_names = ["alexa", "siri", "google assistant", "chatgpt"]
for name in competitor_names:
if name in response_lower:
violations.append(f"Mentioned competitor: {name}")
# Must not break character
meta_phrases = [
"as an ai", "as a language model", "i cannot", "i'm just a",
"my training data", "i was trained"
]
for phrase in meta_phrases:
if phrase in response_lower:
violations.append(f"Meta-AI disclosure: {phrase}")
# Must stay on manga/Amazon topic
off_topic_signals = ["political", "religious", "sexual"]
for signal in off_topic_signals:
if signal in response_lower:
violations.append(f"Off-topic content: {signal}")
if not violations:
return 1.0
return max(0.0, 1.0 - (len(violations) * 0.3))
Fluency Scorer
class FluencyScorer:
"""Scores the linguistic quality of the FM response.
For MangaAssist, fluency includes proper English grammar, appropriate
conversational tone, and correct formatting of product cards and action buttons.
"""
def __init__(self, bedrock_client=None):
self.bedrock = bedrock_client or boto3.client("bedrock-runtime", region_name="us-east-1")
def score(self, response: str, intent: IntentType) -> DimensionScore:
sub_scores = {}
# Sub-score 1: Grammar and syntax
sub_scores["grammar"] = self._score_grammar(response)
# Sub-score 2: Tone appropriateness
sub_scores["tone"] = self._score_tone(response, intent)
# Sub-score 3: Formatting quality
sub_scores["formatting"] = self._score_formatting(response, intent)
weights = {"grammar": 0.40, "tone": 0.35, "formatting": 0.25}
composite = sum(sub_scores[k] * weights[k] for k in weights)
evidence = []
for k, v in sub_scores.items():
if v < 0.7:
evidence.append(f"{k} issue ({v:.2f})")
return DimensionScore(
dimension=QualityDimension.FLUENCY,
score=composite,
confidence=0.85,
evidence=evidence,
sub_scores=sub_scores,
)
def _score_grammar(self, response: str) -> float:
"""Uses LLM to assess grammatical correctness."""
prompt = f"""Rate the grammatical correctness of this chatbot response.
Consider: sentence structure, subject-verb agreement, punctuation, spelling.
Ignore formatting elements like markdown or product cards.
Response: {response}
Return ONLY a JSON object: {{"score": <float 0.0 to 1.0>, "issues": ["<list of grammar issues>"]}}"""
try:
body = json.dumps({
"anthropic_version": "bedrock-2023-05-31",
"max_tokens": 200,
"temperature": 0.0,
"messages": [{"role": "user", "content": prompt}],
})
resp = self.bedrock.invoke_model(
modelId="anthropic.claude-3-5-sonnet-20241022-v2:0", body=body
)
result = json.loads(json.loads(resp["body"].read())["content"][0]["text"])
return max(0.0, min(1.0, result["score"]))
except Exception:
return 0.7
def _score_tone(self, response: str, intent: IntentType) -> float:
"""Checks whether the tone matches the expected register for the intent."""
# Order tracking and returns need professional, empathetic tone
# Recommendations can be enthusiastic and conversational
# Chitchat should be warm and brief
tone_guidance = {
IntentType.RECOMMENDATION: "enthusiastic, knowledgeable, conversational",
IntentType.PRODUCT_QUESTION: "helpful, informative, concise",
IntentType.FAQ: "clear, authoritative, reassuring",
IntentType.ORDER_TRACKING: "professional, empathetic, action-oriented",
IntentType.RETURN_REQUEST: "empathetic, solution-focused, patient",
IntentType.PROMOTION: "excited but not pushy, informative",
IntentType.CHECKOUT_HELP: "helpful, step-by-step, encouraging",
IntentType.CHITCHAT: "warm, brief, natural",
}
expected_tone = tone_guidance.get(intent, "helpful and professional")
prompt = f"""Rate whether this chatbot response has the appropriate tone.
Expected tone: {expected_tone}
Response: {response}
Return ONLY a JSON object: {{"score": <float 0.0 to 1.0>, "actual_tone": "<description>"}}"""
try:
body = json.dumps({
"anthropic_version": "bedrock-2023-05-31",
"max_tokens": 100,
"temperature": 0.0,
"messages": [{"role": "user", "content": prompt}],
})
resp = self.bedrock.invoke_model(
modelId="anthropic.claude-3-5-sonnet-20241022-v2:0", body=body
)
result = json.loads(json.loads(resp["body"].read())["content"][0]["text"])
return max(0.0, min(1.0, result["score"]))
except Exception:
return 0.7
def _score_formatting(self, response: str, intent: IntentType) -> float:
"""Checks whether the response uses appropriate formatting elements."""
score = 1.0
penalties = []
# Recommendations should include product information
if intent == IntentType.RECOMMENDATION:
if not any(marker in response for marker in ["ASIN", "http", "amazon.com", "$", "¥"]):
score -= 0.3
penalties.append("Recommendation missing product links or prices")
# Response should not be excessively long for simple intents
word_count = len(response.split())
if intent == IntentType.CHITCHAT and word_count > 80:
score -= 0.2
penalties.append(f"Chitchat too verbose ({word_count} words)")
if intent == IntentType.ORDER_TRACKING and word_count > 150:
score -= 0.1
penalties.append(f"Order tracking response too long ({word_count} words)")
return max(0.0, score)
Composite Evaluator (Orchestrates All Dimensions)
class FMOutputQualityEvaluator:
"""Orchestrates all four quality dimension scorers and produces
a composite evaluation result for a single test case.
This is the main entry point for the evaluation pipeline.
"""
# Intent-specific weights (from the scoring thresholds table above)
INTENT_WEIGHTS: dict[IntentType, dict[str, float]] = {
IntentType.RECOMMENDATION: {"relevance": 0.40, "factual_accuracy": 0.20, "consistency": 0.20, "fluency": 0.20},
IntentType.PRODUCT_QUESTION: {"relevance": 0.25, "factual_accuracy": 0.40, "consistency": 0.20, "fluency": 0.15},
IntentType.FAQ: {"relevance": 0.20, "factual_accuracy": 0.40, "consistency": 0.25, "fluency": 0.15},
IntentType.ORDER_TRACKING: {"relevance": 0.15, "factual_accuracy": 0.50, "consistency": 0.25, "fluency": 0.10},
IntentType.RETURN_REQUEST: {"relevance": 0.15, "factual_accuracy": 0.50, "consistency": 0.25, "fluency": 0.10},
IntentType.PROMOTION: {"relevance": 0.30, "factual_accuracy": 0.35, "consistency": 0.20, "fluency": 0.15},
IntentType.CHECKOUT_HELP: {"relevance": 0.20, "factual_accuracy": 0.40, "consistency": 0.25, "fluency": 0.15},
IntentType.CHITCHAT: {"relevance": 0.30, "factual_accuracy": 0.10, "consistency": 0.30, "fluency": 0.30},
}
PASS_THRESHOLDS: dict[IntentType, float] = {
IntentType.RECOMMENDATION: 0.75,
IntentType.PRODUCT_QUESTION: 0.80,
IntentType.FAQ: 0.85,
IntentType.ORDER_TRACKING: 0.90,
IntentType.RETURN_REQUEST: 0.90,
IntentType.PROMOTION: 0.80,
IntentType.CHECKOUT_HELP: 0.85,
IntentType.CHITCHAT: 0.70,
}
def __init__(self):
self.relevance_scorer = RelevanceScorer()
self.factual_checker = FactualAccuracyChecker()
self.consistency_checker = ConsistencyChecker()
self.fluency_scorer = FluencyScorer()
def evaluate(self, test_case: EvaluationTestCase, response: str) -> EvaluationResult:
"""Evaluate a single FM response against all quality dimensions."""
start_time = time.time()
# Score each dimension
relevance = self.relevance_scorer.score(
query=test_case.query,
response=response,
intent=test_case.intent,
page_context=test_case.page_context,
conversation_history=test_case.conversation_history,
expected_products=test_case.expected_products,
)
factual = self.factual_checker.score(
response=response,
expected_facts=test_case.expected_facts,
intent=test_case.intent,
)
consistency = self.consistency_checker.score(
response=response,
conversation_history=test_case.conversation_history,
intent=test_case.intent,
)
fluency = self.fluency_scorer.score(
response=response,
intent=test_case.intent,
)
dimension_scores = [relevance, factual, consistency, fluency]
# Compute weighted composite score
weights = self.INTENT_WEIGHTS.get(
test_case.intent,
{"relevance": 0.25, "factual_accuracy": 0.25, "consistency": 0.25, "fluency": 0.25},
)
composite = (
relevance.score * weights["relevance"]
+ factual.score * weights["factual_accuracy"]
+ consistency.score * weights["consistency"]
+ fluency.score * weights["fluency"]
)
threshold = self.PASS_THRESHOLDS.get(test_case.intent, 0.80)
passed = composite >= threshold
failure_reasons = []
if not passed:
for ds in dimension_scores:
if ds.evidence:
failure_reasons.extend(ds.evidence)
latency_ms = (time.time() - start_time) * 1000
return EvaluationResult(
test_id=test_case.test_id,
intent=test_case.intent,
response_text=response,
dimension_scores=dimension_scores,
composite_score=composite,
passed=passed,
latency_ms=latency_ms,
failure_reasons=failure_reasons,
)
def evaluate_batch(self, test_cases: list[tuple[EvaluationTestCase, str]]) -> EvaluationReport:
"""Evaluate a batch of test cases and produce an aggregate report."""
results = []
for test_case, response in test_cases:
result = self.evaluate(test_case, response)
results.append(result)
passed = [r for r in results if r.passed]
failed = [r for r in results if not r.passed]
# Aggregate dimension averages
dim_totals: dict[str, list[float]] = {}
for r in results:
for ds in r.dimension_scores:
dim_totals.setdefault(ds.dimension.value, []).append(ds.score)
dim_averages = {k: sum(v) / len(v) for k, v in dim_totals.items()}
# Breakdown by intent
intent_breakdown: dict[str, dict] = {}
for r in results:
key = r.intent.value
if key not in intent_breakdown:
intent_breakdown[key] = {"total": 0, "passed": 0, "avg_score": 0.0, "scores": []}
intent_breakdown[key]["total"] += 1
intent_breakdown[key]["scores"].append(r.composite_score)
if r.passed:
intent_breakdown[key]["passed"] += 1
for key, data in intent_breakdown.items():
data["avg_score"] = sum(data["scores"]) / len(data["scores"])
del data["scores"]
import uuid
return EvaluationReport(
run_id=str(uuid.uuid4()),
trigger="batch_evaluation",
total_cases=len(results),
passed_cases=len(passed),
failed_cases=len(failed),
dimension_averages=dim_averages,
intent_breakdown=intent_breakdown,
regression_flags=[], # Populated by comparing against baseline
)
CloudWatch Metric Publishing
class EvaluationMetricsPublisher:
"""Publishes evaluation results to CloudWatch for dashboarding and alerting."""
def __init__(self, namespace: str = "MangaAssist/Evaluation"):
self.cloudwatch = boto3.client("cloudwatch", region_name="us-east-1")
self.namespace = namespace
def publish_result(self, result: EvaluationResult) -> None:
"""Publishes dimension scores and composite score for a single evaluation."""
metric_data = [
{
"MetricName": "CompositeScore",
"Dimensions": [
{"Name": "Intent", "Value": result.intent.value},
],
"Value": result.composite_score,
"Unit": "None",
},
{
"MetricName": "EvaluationLatency",
"Dimensions": [
{"Name": "Intent", "Value": result.intent.value},
],
"Value": result.latency_ms,
"Unit": "Milliseconds",
},
]
for ds in result.dimension_scores:
metric_data.append({
"MetricName": f"DimensionScore_{ds.dimension.value}",
"Dimensions": [
{"Name": "Intent", "Value": result.intent.value},
],
"Value": ds.score,
"Unit": "None",
})
# Pass/fail as a binary metric for alarming
metric_data.append({
"MetricName": "QualityGateResult",
"Dimensions": [
{"Name": "Intent", "Value": result.intent.value},
],
"Value": 1.0 if result.passed else 0.0,
"Unit": "None",
})
self.cloudwatch.put_metric_data(Namespace=self.namespace, MetricData=metric_data)
def publish_report(self, report: EvaluationReport) -> None:
"""Publishes aggregate report metrics."""
metric_data = [
{
"MetricName": "BatchPassRate",
"Value": report.passed_cases / max(report.total_cases, 1),
"Unit": "None",
},
{
"MetricName": "BatchTotalCases",
"Value": float(report.total_cases),
"Unit": "Count",
},
]
for dim, avg in report.dimension_averages.items():
metric_data.append({
"MetricName": f"BatchAvg_{dim}",
"Value": avg,
"Unit": "None",
})
self.cloudwatch.put_metric_data(Namespace=self.namespace, MetricData=metric_data)
MangaAssist Scenarios
Scenario A: Recommendation Relevance Drops After Prompt Change
Context: The prompt engineering team updated the system prompt to add a new instruction about highlighting promotions. After deployment, the recommendation intent's relevance score dropped from 0.82 to 0.68.
What Happened: - The new promotion instruction consumed attention budget, causing the LLM to mention promotions even when the user asked for genre-specific recommendations. - User asks: "Recommend horror manga similar to Junji Ito's work." - Before: Response focused on horror manga by similar artists (Kazuo Umezu, Hideshi Hino) — high relevance. - After: Response opened with a promotion for a shonen bundle before listing horror titles — relevance dropped because the first third of the response was off-intent.
How Evaluation Caught It:
- The intent_relevance sub-score dropped from 0.88 to 0.62 because the LLM-as-Judge saw that promotion content was irrelevant to a horror recommendation query.
- The product_relevance sub-score stayed at 0.85 (the horror titles were still there, just buried).
- The composite relevance score dropped below the 0.75 threshold for recommendation intents.
- The automated quality gate in the CD-06 pipeline blocked the deployment.
Fix: The team restructured the system prompt to apply promotion nudges only when the intent classifier returns promotion or when the user's cart is non-empty, not on every response.
Metric Signal:
CloudWatch Alarm: MangaAssist/Evaluation → DimensionScore_relevance
Intent=recommendation < 0.75 for 3 consecutive evaluation runs
Scenario B: FAQ Factual Accuracy Regression After Knowledge Base Update
Context: The operations team updated the return policy in the knowledge base: the return window changed from 30 days to 15 days for opened manga volumes. The RAG pipeline re-indexed, but some old chunks were not properly invalidated.
What Happened: - User asks: "Can I return Demon Slayer Volume 23 if I already read it?" - The RAG pipeline retrieved two chunks: one from the new policy (15 days) and one stale chunk from the old policy (30 days). - The LLM generated: "You can return opened manga within 30 days of delivery." - This was factually incorrect — the new policy says 15 days.
How Evaluation Caught It:
- The golden dataset included a test case with expected_facts: ["15 days return window for opened manga"].
- The fact_coverage sub-score returned 0.0 because "30 days" appeared instead of "15 days."
- The no_hallucination sub-score returned 0.4 because the LLM cited a stale source chunk rather than fabricating entirely.
- The composite factual accuracy score dropped to 0.32, far below the 0.85 threshold for FAQ intents.
Fix:
1. The stale chunk was invalidated by correcting the last_updated timestamp check in the RAG indexing pipeline.
2. The team added a policy-specific test case to the golden dataset that checks the exact return window number.
3. A chunk freshness filter was added to the retrieval step: chunks about return policies older than 24 hours are excluded.
Metric Signal:
CloudWatch Alarm: MangaAssist/Evaluation → DimensionScore_factual_accuracy
Intent=faq < 0.85 for 1 evaluation run (zero tolerance for policy accuracy)
Scenario C: Multi-Turn Consistency Failure During Session
Context: A user is having a multi-turn conversation about manga recommendations. The assistant recommends three series in turn 2, and in turn 4 the user says "Tell me more about the second one."
What Happened: - Turn 2: Assistant recommends [Chainsaw Man, Spy×Family, Jujutsu Kaisen]. - Turn 3: User asks about the art style of Chainsaw Man (first one). - Turn 4: User says "Now tell me about the second one." - The conversation memory correctly stored all turns, but the LLM's context window was compressed (summarized older turns to save tokens), and the summary lost the ordered list. - The LLM responded with information about Jujutsu Kaisen (third one) instead of Spy×Family (second one).
How Evaluation Caught It:
- The cross_turn consistency sub-score detected the mismatch: the response referenced Jujutsu Kaisen but the conversation history showed "the second one" should resolve to Spy×Family.
- The consistency composite dropped to 0.45, well below the 0.75 threshold.
Fix: 1. The conversation memory summarizer was updated to preserve ordered lists explicitly when summarizing. 2. A new rule was added: when the summary step encounters product lists, it keeps them as structured data rather than compressing into prose. 3. A dedicated test case was added: "ordinal reference resolution" — verifying that "the first/second/third one" resolves correctly after summarization.
Scenario D: Japanese Content Fluency Degradation
Context: MangaAssist serves the JP Manga store, and some responses include Japanese titles, author names, and genre terms. After a model version upgrade (Claude 3.5 → 3.5 v2), the fluency of responses containing mixed English-Japanese content degraded.
What Happened: - User asks: "What volumes of 鬼滅の刃 are available in English?" - Before upgrade: "Demon Slayer (鬼滅の刃 / Kimetsu no Yaiba) is available in English..." - After upgrade: "鬼滅の刃 (Demon Slayer) volumes 1 through 23 available in English format..." — dropped the romanization, changed sentence structure to less natural phrasing.
How Evaluation Caught It:
- The grammar sub-score was still 0.9 (grammatically correct).
- The tone sub-score dropped to 0.6 because the response felt more mechanical and less conversational.
- The formatting sub-score stayed at 0.8.
- The fluency composite was 0.73, below the 0.75 threshold for recommendation responses involving Japanese content.
Fix: 1. The evaluation golden dataset was expanded with Japanese-mixed test cases. 2. A "Japanese content handling" rubric was added to the system prompt: always include original title, romanization, and English title. 3. The model upgrade was rolled back, and the team re-evaluated after adding the rubric — the upgraded model scored 0.82 with the updated prompt.
Intuition Gained
Quality Dimension Intuition
Traditional ML evaluation asks "Was the prediction correct?" — a binary question. FM output evaluation asks "Was the response good?" — a multidimensional question. The four dimensions (relevance, accuracy, consistency, fluency) decompose "good" into measurable sub-properties that teams can independently improve. When a composite score drops, you know which dimension to investigate first. When a new intent is added, you customize the dimension weights rather than building a new evaluation system.
Threshold Calibration Intuition
Setting thresholds too high blocks good changes (false negatives in the quality gate). Setting them too low lets regressions through. The right approach: start with thresholds based on the best current scores minus one standard deviation, then adjust based on the false positive rate of the quality gate. If more than 5% of deployments are blocked that humans judge as acceptable, the threshold is too aggressive.
Cost-Quality Tradeoff in Evaluation
Every evaluation run calls the LLM multiple times per test case (once for each dimension that uses LLM-as-Judge). For a golden dataset of 200 test cases with 4 LLM-based scoring calls each, that is 800 LLM invocations per evaluation run. At ~$0.003 per invocation, that is $2.40 per run. Running hourly costs ~$57/day. The tradeoff: evaluate frequently enough to catch regressions early but not so frequently that evaluation itself becomes a significant cost center. Recommendation: Run full evaluation on deployment triggers and weekly schedules; run lightweight deterministic checks (price accuracy, ASIN validation, length checks) continuously.
References
- MangaAssist Architecture HLD — System components and data flow
- MangaAssist Architecture LLD — Service contracts and schemas
- Model Evaluation Framework — Foundational evaluation architecture
- Model Evaluation Deep Dive — Production evaluation platform design
- Offline Testing Strategy — Cost-efficient evaluation approach
- Prompt Engineering Troubleshooting — Prompt testing and refinement systems