04: Quality Assurance Processes
AIP-C01 Mapping
Content Domain 5: Testing, Validation, and Troubleshooting Task 5.1: Implement evaluation systems for GenAI Skill 5.1.4: Plan quality assurance processes (for example, continuous evaluation, regression testing, and quality gates for production deployments).
User Story
As a MangaAssist engineering lead, I want to enforce automated quality gates at every stage of the deployment pipeline — from code merge through canary release — with continuous evaluation in production and regression tests that catch quality degradation before customers see it, So that no model, prompt, or RAG content change reaches 100% traffic without passing objective quality thresholds.
Acceptance Criteria
- Quality gate in CI/CD blocks deployment if evaluation score drops below threshold per intent
- Regression test suite of 500+ golden test cases runs on every model/prompt change
- Continuous evaluation samples 1% of production traffic every hour for quality monitoring
- Quality trend dashboard shows rolling 24h quality scores per intent with alerting
- Pre-production evaluation runs against staging environment before canary promotion
- Prompt changes go through the same quality gate as model changes
- RAG knowledge base updates trigger targeted regression tests for affected intents
- Quality gate failures generate actionable reports with failing test cases and score deltas
Why Quality Assurance Is Different for GenAI
Traditional software QA: deterministic inputs produce deterministic outputs. If the test passes, the code is correct.
GenAI QA: the same input can produce different outputs depending on model version, temperature, prompt wording, RAG context freshness, and random seed. A "passing" evaluation today can "fail" tomorrow with no code change — because the underlying model was updated, or the knowledge base drifted, or the conversation context triggered a different generation path.
| Quality Risk | Traditional Software | MangaAssist GenAI |
|---|---|---|
| Code change regression | Unit tests catch it | Model output is non-deterministic |
| Dependency update | Version pinning | Bedrock model versions change behavior |
| Data drift | Schema validation | RAG knowledge base content changes weekly |
| Configuration drift | Config-as-code | Prompt templates are code AND data |
| Environment parity | Docker containers | Model behavior differs at scale vs. single request |
This means MangaAssist needs continuous, probabilistic quality assurance — not just point-in-time tests.
High-Level Design
Quality Gate Architecture
graph TD
subgraph "Development"
A[Code / Prompt / RAG Change] --> B[Pull Request]
B --> C[CI Pipeline]
end
subgraph "CI Quality Gate"
C --> D[Run Regression Suite<br>500+ golden test cases]
D --> E[Score Against Thresholds]
E --> F{All Intents Pass<br>Quality Gate?}
F -->|Yes| G[Merge Allowed]
F -->|No| H[Block Merge<br>Quality Report Generated]
end
subgraph "CD Quality Gate"
G --> I[Deploy to Staging]
I --> J[Full Evaluation Suite<br>Against Staging]
J --> K{Staging Pass?}
K -->|Yes| L[Canary Deploy 5%]
K -->|No| M[Block Deploy<br>Staging Report]
end
subgraph "Production Quality Gate"
L --> N[Canary Monitor<br>30 min window]
N --> O{Canary Pass?}
O -->|Yes| P[Progressive Rollout<br>25% → 50% → 100%]
O -->|No| Q[Auto-Rollback]
end
subgraph "Continuous Evaluation"
P --> R[1% Traffic Sampling<br>Every hour]
R --> S[Quality Score Computation]
S --> T{Score Drift<br>Detected?}
T -->|Yes| U[Degradation Alert]
T -->|No| V[Continue Monitoring]
end
Regression Test Suite Structure
graph TD
subgraph "Golden Dataset"
A[Test Cases by Intent] --> A1[recommendation: 80 cases]
A --> A2[product_question: 70 cases]
A --> A3[faq: 60 cases]
A --> A4[order_tracking: 50 cases]
A --> A5[return_request: 40 cases]
A --> A6[promotion: 40 cases]
A --> A7[checkout_help: 35 cases]
A --> A8[chitchat: 30 cases]
A --> A9[escalation: 25 cases]
A --> A10[edge_cases: 70 cases<br>Multi-turn, ambiguous, adversarial]
end
subgraph "Test Categories"
B1[Core Regression<br>Must-pass cases<br>200 cases] --> C[Priority 1]
B2[Extended Regression<br>Comprehensive coverage<br>300 cases] --> D[Priority 2]
B3[Edge Case Suite<br>Known failure modes<br>70 cases] --> E[Priority 3]
end
subgraph "Triggers"
F1[Model Version Change] --> G[Full Suite: P1 + P2 + P3]
F2[Prompt Template Change] --> H[Affected Intent: P1 + P2]
F3[RAG KB Update] --> I[Affected Content: P1 + P3]
F4[Nightly Scheduled] --> J[Full Suite: P1 + P2 + P3]
end
Continuous Evaluation Pipeline
sequenceDiagram
participant Prod as Production Traffic
participant Sampler as 1% Sampler
participant Evaluator as Quality Evaluator
participant Store as Redshift
participant Monitor as CloudWatch
participant Alert as Alert System
loop Every hour
Prod->>Sampler: Sample 1% of responses
Sampler->>Evaluator: Batch evaluate samples
Evaluator->>Evaluator: Score each sample<br>(relevance, accuracy, consistency, fluency)
Evaluator->>Store: Store detailed scores
Evaluator->>Monitor: Publish hourly aggregates
Monitor->>Monitor: Compare to rolling baseline
alt Score drop > 3% vs 7-day average
Monitor->>Alert: Quality degradation alert
Alert->>Alert: Page on-call engineer
end
alt Score drop > 5% for 3 consecutive hours
Monitor->>Alert: Sustained degradation — auto-investigation
end
end
Low-Level Design
Quality Gate Configuration
from dataclasses import dataclass, field
from enum import Enum
from typing import Optional
import time
class GateStage(Enum):
CI = "ci" # Pre-merge
STAGING = "staging" # Pre-canary
CANARY = "canary" # During canary rollout
PRODUCTION = "production" # Continuous monitoring
class GateAction(Enum):
BLOCK = "block" # Fail the pipeline
WARN = "warn" # Allow but notify
LOG = "log" # Record only
@dataclass
class QualityThreshold:
"""Quality threshold for a specific intent at a specific stage."""
intent: str
stage: GateStage
min_composite_score: float
min_pass_rate: float # % of test cases that must pass
max_regression_delta: float # Maximum allowed drop from baseline
action_on_fail: GateAction = GateAction.BLOCK
min_test_cases: int = 20 # Minimum cases needed for valid evaluation
# MangaAssist quality thresholds — strict for production, relaxed for CI
QUALITY_THRESHOLDS: dict[str, dict[str, QualityThreshold]] = {
"recommendation": {
"ci": QualityThreshold("recommendation", GateStage.CI, 0.75, 0.80, 0.05),
"staging": QualityThreshold("recommendation", GateStage.STAGING, 0.78, 0.85, 0.03),
"canary": QualityThreshold("recommendation", GateStage.CANARY, 0.78, 0.85, 0.05),
"production": QualityThreshold("recommendation", GateStage.PRODUCTION, 0.75, 0.80, 0.03),
},
"product_question": {
"ci": QualityThreshold("product_question", GateStage.CI, 0.78, 0.85, 0.05),
"staging": QualityThreshold("product_question", GateStage.STAGING, 0.80, 0.88, 0.03),
"canary": QualityThreshold("product_question", GateStage.CANARY, 0.80, 0.88, 0.05),
"production": QualityThreshold("product_question", GateStage.PRODUCTION, 0.78, 0.85, 0.03),
},
"faq": {
"ci": QualityThreshold("faq", GateStage.CI, 0.80, 0.88, 0.03),
"staging": QualityThreshold("faq", GateStage.STAGING, 0.82, 0.90, 0.02),
"canary": QualityThreshold("faq", GateStage.CANARY, 0.82, 0.90, 0.03),
"production": QualityThreshold("faq", GateStage.PRODUCTION, 0.80, 0.88, 0.02),
},
"order_tracking": {
"ci": QualityThreshold("order_tracking", GateStage.CI, 0.85, 0.90, 0.03),
"staging": QualityThreshold("order_tracking", GateStage.STAGING, 0.88, 0.92, 0.02),
"canary": QualityThreshold("order_tracking", GateStage.CANARY, 0.88, 0.92, 0.03),
"production": QualityThreshold("order_tracking", GateStage.PRODUCTION, 0.85, 0.90, 0.02),
},
"chitchat": {
"ci": QualityThreshold("chitchat", GateStage.CI, 0.65, 0.75, 0.10),
"staging": QualityThreshold("chitchat", GateStage.STAGING, 0.68, 0.78, 0.08),
"canary": QualityThreshold("chitchat", GateStage.CANARY, 0.68, 0.78, 0.10),
"production": QualityThreshold("chitchat", GateStage.PRODUCTION, 0.65, 0.75, 0.08),
},
}
@dataclass
class QualityGateResult:
"""Result of running a quality gate evaluation."""
gate_id: str
stage: GateStage
passed: bool
overall_score: float
overall_pass_rate: float
intent_results: dict[str, dict] # intent -> {score, pass_rate, passed, delta}
failing_intents: list[str]
failing_test_cases: list[dict] # Detailed failures for debugging
baseline_comparison: dict # Comparison with previous baseline
recommendation: str
timestamp: float = field(default_factory=time.time)
Quality Gate Evaluator
import json
import logging
import time
import uuid
from typing import Optional
import boto3
logger = logging.getLogger(__name__)
class QualityGateEvaluator:
"""Evaluates model/prompt/RAG changes against quality thresholds.
Used at every stage of the MangaAssist deployment pipeline:
1. CI: blocks merge if quality drops below threshold
2. Staging: blocks canary deployment if staging eval fails
3. Canary: auto-rollback if canary quality degrades
4. Production: continuous quality monitoring with alerting
"""
def __init__(
self,
quality_evaluator=None, # FMOutputQualityEvaluator from Skill 5.1.1
baseline_store: str = "manga_quality_baselines",
):
self.quality_evaluator = quality_evaluator
self.dynamodb = boto3.resource("dynamodb", region_name="us-east-1")
self.baseline_table = self.dynamodb.Table(baseline_store)
def run_quality_gate(
self,
stage: GateStage,
test_cases: list,
change_description: str = "",
) -> QualityGateResult:
"""Run the quality gate for a specific deployment stage."""
gate_id = str(uuid.uuid4())
logger.info("Running quality gate %s at stage=%s (%d test cases)",
gate_id, stage.value, len(test_cases))
# Group test cases by intent
by_intent: dict[str, list] = {}
for tc in test_cases:
intent = tc.intent.value if hasattr(tc.intent, 'value') else tc.intent
by_intent.setdefault(intent, []).append(tc)
# Evaluate each intent against its threshold
intent_results = {}
failing_intents = []
failing_test_cases = []
all_scores = []
for intent, cases in by_intent.items():
threshold = self._get_threshold(intent, stage)
if threshold is None:
logger.warning("No threshold defined for intent=%s stage=%s, using defaults", intent, stage.value)
threshold = QualityThreshold(intent, stage, 0.70, 0.75, 0.10, GateAction.WARN)
if len(cases) < threshold.min_test_cases:
logger.warning("Only %d test cases for %s (minimum: %d)",
len(cases), intent, threshold.min_test_cases)
# Evaluate all cases for this intent
intent_scores = []
intent_passes = 0
for tc in cases:
if self.quality_evaluator:
result = self.quality_evaluator.evaluate(tc, tc.expected_response)
score = result.composite_score
passed = result.passed
else:
score = 0.80 # Placeholder
passed = True
intent_scores.append(score)
all_scores.append(score)
if passed:
intent_passes += 1
else:
failing_test_cases.append({
"test_id": tc.test_id,
"intent": intent,
"query": tc.query,
"score": score,
"threshold": threshold.min_composite_score,
})
avg_score = sum(intent_scores) / len(intent_scores) if intent_scores else 0
pass_rate = intent_passes / len(cases) if cases else 0
# Compare with baseline
baseline = self._get_baseline(intent)
delta = avg_score - baseline if baseline is not None else 0
intent_passed = (
avg_score >= threshold.min_composite_score
and pass_rate >= threshold.min_pass_rate
and (baseline is None or delta >= -threshold.max_regression_delta)
)
intent_results[intent] = {
"avg_score": round(avg_score, 4),
"pass_rate": round(pass_rate, 4),
"total_cases": len(cases),
"passed_cases": intent_passes,
"baseline": baseline,
"delta": round(delta, 4),
"threshold_score": threshold.min_composite_score,
"threshold_pass_rate": threshold.min_pass_rate,
"max_regression": threshold.max_regression_delta,
"passed": intent_passed,
"action": threshold.action_on_fail.value,
}
if not intent_passed:
failing_intents.append(intent)
# Overall gate decision
overall_score = sum(all_scores) / len(all_scores) if all_scores else 0
passed_count = sum(1 for s in all_scores if s >= 0.75)
overall_pass_rate = passed_count / len(all_scores) if all_scores else 0
# Gate passes only if all BLOCK-action intents pass
blocking_failures = [
i for i in failing_intents
if intent_results[i]["action"] == "block"
]
gate_passed = len(blocking_failures) == 0
recommendation = self._generate_recommendation(
gate_passed, failing_intents, intent_results, change_description
)
result = QualityGateResult(
gate_id=gate_id,
stage=stage,
passed=gate_passed,
overall_score=round(overall_score, 4),
overall_pass_rate=round(overall_pass_rate, 4),
intent_results=intent_results,
failing_intents=failing_intents,
failing_test_cases=failing_test_cases[:20], # Top 20 failures
baseline_comparison={
intent: {"baseline": r.get("baseline"), "delta": r.get("delta")}
for intent, r in intent_results.items()
},
recommendation=recommendation,
)
logger.info("Quality gate %s: %s (score=%.4f, pass_rate=%.2f%%)",
gate_id, "PASSED" if gate_passed else "FAILED",
overall_score, overall_pass_rate * 100)
return result
def _get_threshold(self, intent: str, stage: GateStage) -> Optional[QualityThreshold]:
"""Retrieve the quality threshold for an intent and stage."""
intent_thresholds = QUALITY_THRESHOLDS.get(intent, {})
return intent_thresholds.get(stage.value)
def _get_baseline(self, intent: str) -> Optional[float]:
"""Get the current production baseline score for an intent."""
try:
response = self.baseline_table.get_item(
Key={"pk": f"BASELINE#{intent}", "sk": "CURRENT"}
)
item = response.get("Item")
return float(item["score"]) if item else None
except Exception:
return None
def update_baseline(self, intent: str, score: float) -> None:
"""Update the baseline score after a successful production deployment."""
self.baseline_table.put_item(Item={
"pk": f"BASELINE#{intent}",
"sk": "CURRENT",
"score": str(score),
"updated_at": int(time.time()),
})
# Also store historical baseline
self.baseline_table.put_item(Item={
"pk": f"BASELINE#{intent}",
"sk": f"HISTORY#{int(time.time())}",
"score": str(score),
})
def _generate_recommendation(
self,
passed: bool,
failing_intents: list[str],
intent_results: dict,
change_description: str,
) -> str:
"""Generate actionable recommendation based on gate results."""
if passed:
return f"Quality gate PASSED. All intents within thresholds. Safe to proceed with: {change_description}"
lines = [f"Quality gate FAILED for change: {change_description}\n"]
lines.append("Failing intents:\n")
for intent in failing_intents:
r = intent_results[intent]
lines.append(
f" - {intent}: score={r['avg_score']:.3f} (need {r['threshold_score']}), "
f"pass_rate={r['pass_rate']:.1%} (need {r['threshold_pass_rate']:.1%}), "
f"delta={r['delta']:+.3f} (max regression: {r['max_regression']})"
)
lines.append("\nRecommended actions:")
lines.append(" 1. Review the failing test cases in the quality report")
lines.append(" 2. Check if the change affected the specific intents listed above")
lines.append(" 3. If the change is correct, update the golden dataset and re-run")
return "\n".join(lines)
Regression Test Runner
class RegressionTestRunner:
"""Runs regression tests against golden dataset and generates quality reports.
Test suite is organized by priority:
- P1 (Core): 200 must-pass cases covering common user journeys
- P2 (Extended): 300 additional cases for comprehensive coverage
- P3 (Edge): 70 edge cases from known failure modes and bug reports
Triggers:
- Model version change: P1 + P2 + P3 (full suite)
- Prompt change: P1 + P2 for affected intents
- RAG KB update: P1 + P3 for affected content areas
- Nightly scheduled: P1 + P2 + P3 (full suite)
"""
def __init__(self, test_data_bucket: str = "manga-evaluation-data"):
self.s3 = boto3.client("s3", region_name="us-east-1")
self.bucket = test_data_bucket
def load_test_suite(
self,
priorities: list[str] = None,
intents: list[str] = None,
) -> list:
"""Load golden test cases from S3, optionally filtered by priority and intent."""
if priorities is None:
priorities = ["P1", "P2", "P3"]
all_cases = []
for priority in priorities:
key = f"golden-dataset/{priority}/test_cases.jsonl"
try:
response = self.s3.get_object(Bucket=self.bucket, Key=key)
for line in response["Body"].read().decode("utf-8").strip().split("\n"):
case = json.loads(line)
if intents is None or case.get("intent") in intents:
all_cases.append(case)
except Exception as e:
logger.error("Failed to load test suite %s: %s", key, e)
logger.info("Loaded %d test cases (priorities=%s, intents=%s)",
len(all_cases), priorities, intents)
return all_cases
def run_regression(
self,
change_type: str,
change_description: str,
affected_intents: list[str] = None,
quality_gate_evaluator: QualityGateEvaluator = None,
stage: GateStage = GateStage.CI,
) -> QualityGateResult:
"""Run regression tests appropriate for the change type."""
# Determine which test priorities to run
if change_type == "model_version":
priorities = ["P1", "P2", "P3"]
intents = None # All intents
elif change_type == "prompt_change":
priorities = ["P1", "P2"]
intents = affected_intents
elif change_type == "rag_update":
priorities = ["P1", "P3"]
intents = affected_intents
elif change_type == "nightly":
priorities = ["P1", "P2", "P3"]
intents = None
else:
priorities = ["P1"]
intents = affected_intents
test_cases = self.load_test_suite(priorities=priorities, intents=intents)
if not test_cases:
logger.warning("No test cases loaded for change_type=%s", change_type)
return QualityGateResult(
gate_id=str(uuid.uuid4()), stage=stage, passed=True,
overall_score=0.0, overall_pass_rate=0.0,
intent_results={}, failing_intents=[], failing_test_cases=[],
baseline_comparison={}, recommendation="No test cases available",
)
return quality_gate_evaluator.run_quality_gate(
stage=stage, test_cases=test_cases, change_description=change_description
)
Continuous Production Evaluator
class ContinuousProductionEvaluator:
"""Samples production traffic and runs quality evaluation continuously.
Every hour:
1. Sample 1% of responses from the last hour (via Kinesis → S3)
2. Run quality evaluation on the sample
3. Publish metrics to CloudWatch
4. Alert if quality drops below rolling 7-day baseline
This catches:
- Model behavior drift (Bedrock updates the underlying model)
- RAG content degradation (stale or incorrect knowledge base entries)
- Traffic pattern shifts (new types of queries the model handles poorly)
"""
def __init__(
self,
sample_bucket: str = "manga-production-samples",
cloudwatch_namespace: str = "MangaAssist/ContinuousEval",
):
self.s3 = boto3.client("s3", region_name="us-east-1")
self.cloudwatch = boto3.client("cloudwatch", region_name="us-east-1")
self.bucket = sample_bucket
self.namespace = cloudwatch_namespace
def run_hourly_evaluation(self, quality_evaluator=None) -> dict:
"""Run quality evaluation on the last hour's production samples."""
samples = self._load_recent_samples(hours_back=1)
if not samples:
logger.info("No production samples available for evaluation")
return {"status": "no_samples"}
# Evaluate each sample
results_by_intent: dict[str, list[float]] = {}
for sample in samples:
intent = sample.get("intent", "unknown")
if quality_evaluator:
result = quality_evaluator.evaluate_from_dict(sample)
score = result.composite_score
else:
score = sample.get("pre_computed_score", 0.80)
results_by_intent.setdefault(intent, []).append(score)
# Publish per-intent metrics
metric_data = []
for intent, scores in results_by_intent.items():
avg_score = sum(scores) / len(scores)
metric_data.append({
"MetricName": "HourlyQualityScore",
"Dimensions": [{"Name": "Intent", "Value": intent}],
"Value": avg_score,
"Unit": "None",
})
metric_data.append({
"MetricName": "HourlySampleCount",
"Dimensions": [{"Name": "Intent", "Value": intent}],
"Value": len(scores),
"Unit": "Count",
})
self.cloudwatch.put_metric_data(
Namespace=self.namespace, MetricData=metric_data
)
# Check for degradation vs. rolling baseline
alerts = self._check_degradation(results_by_intent)
return {
"status": "evaluated",
"total_samples": len(samples),
"intents_evaluated": len(results_by_intent),
"alerts": alerts,
}
def _load_recent_samples(self, hours_back: int = 1) -> list[dict]:
"""Load production response samples from S3 (written by Kinesis Firehose)."""
now = time.time()
cutoff = now - (hours_back * 3600)
# S3 prefix follows Kinesis Firehose date partitioning
from datetime import datetime, timezone
ts = datetime.fromtimestamp(cutoff, tz=timezone.utc)
prefix = f"production-samples/{ts.strftime('%Y/%m/%d/%H')}/"
samples = []
try:
response = self.s3.list_objects_v2(Bucket=self.bucket, Prefix=prefix, MaxKeys=1000)
for obj in response.get("Contents", []):
data = self.s3.get_object(Bucket=self.bucket, Key=obj["Key"])
for line in data["Body"].read().decode("utf-8").strip().split("\n"):
if line:
samples.append(json.loads(line))
except Exception as e:
logger.error("Failed to load production samples: %s", e)
# Sample 1% if volume is very high
import random
if len(samples) > 5000:
samples = random.sample(samples, 5000)
return samples
def _check_degradation(self, results_by_intent: dict[str, list[float]]) -> list[str]:
"""Check if current scores are significantly below rolling 7-day baseline."""
alerts = []
for intent, scores in results_by_intent.items():
current_avg = sum(scores) / len(scores)
# Get 7-day rolling average from CloudWatch
try:
response = self.cloudwatch.get_metric_statistics(
Namespace=self.namespace,
MetricName="HourlyQualityScore",
Dimensions=[{"Name": "Intent", "Value": intent}],
StartTime=time.time() - (7 * 86400),
EndTime=time.time() - 3600, # Exclude current hour
Period=86400,
Statistics=["Average"],
)
datapoints = response.get("Datapoints", [])
if datapoints:
baseline_avg = sum(d["Average"] for d in datapoints) / len(datapoints)
delta = current_avg - baseline_avg
if delta < -0.03: # More than 3% drop
alerts.append(
f"DEGRADATION: {intent} current={current_avg:.3f} "
f"baseline={baseline_avg:.3f} delta={delta:+.3f}"
)
except Exception as e:
logger.error("Failed to retrieve baseline for %s: %s", intent, e)
return alerts
MangaAssist Scenarios
Scenario A: CI Quality Gate Blocks a Prompt Regression
Context: A developer modified the recommendation system prompt to be more concise, reducing token usage by 15%. The CI pipeline ran the regression suite before merge.
What Happened:
- CI quality gate evaluation results:
- recommendation: score=0.71 (threshold: 0.75) — FAILED
- product_question: score=0.83 (threshold: 0.78) — PASSED
- All other intents: PASSED (unaffected by prompt change)
- The concise prompt dropped the relevance sub-score from 0.86 to 0.68 — the model stopped including specific product attributes (author, volume count, genre tags) in recommendations
How Caught: The CI quality gate blocked the merge. The quality report showed 14 failing test cases, all in the recommendation intent, all with low relevance scores.
Fix: The developer kept the conciseness optimization but added explicit instructions to include product attributes. Second CI run: recommendation score=0.78 (passed). Token usage still reduced by 9%.
Metric Signal: MangaAssist/QualityGate.Result with dimension Stage=ci, Intent=recommendation, Passed=false
Scenario B: Nightly Regression Detects RAG Knowledge Base Drift
Context: The nightly full regression suite (P1 + P2 + P3) ran at 2 AM. No code or prompt changes had been made in 3 days.
What Happened:
- faq intent: score dropped from 0.88 (baseline) to 0.81 (current) — delta = -0.07
- The regression was in 8 test cases related to return policy
- Root cause: the product catalog team updated the return policy page, and the RAG pipeline ingested the new content. But the new page had a formatting change that caused the chunking algorithm to split a key sentence across two chunks, making it unretrievable as a single fact
How Caught: The nightly regression checked against stored baselines. The -0.07 delta exceeded the max_regression_delta of 0.02 for the faq intent at production stage.
Fix: The RAG pipeline's chunking strategy was updated to preserve paragraph boundaries. The return policy content was re-indexed. Re-evaluation: faq score returned to 0.87.
Metric Signal: MangaAssist/QualityGate.IntentDelta with Intent=faq, value=-0.07; MangaAssist/QualityGate.Result with Passed=false
Scenario C: Canary Quality Gate Catches Model Behavior Shift
Context: Amazon Bedrock updated the underlying Claude 3.5 Sonnet model (minor version). MangaAssist pins model versions, but the team wanted to test the new version via canary.
What Happened:
- Canary at 5% traffic for 30 minutes:
- Quality score: 0.86 (baseline: 0.84) — improved
- Pass rate: 89% (baseline: 87%) — improved
- Latency P95: 2,050ms (baseline: 2,100ms) — improved
- Canary promoted to 25%:
- Quality score: 0.85 — stable
- But: order_tracking intent pass rate dropped to 78% (baseline: 92%)
- At 5% traffic, there were only 12 order_tracking requests — too few to detect the problem
How Caught: The canary quality gate at 25% had enough samples (150 order_tracking requests) to detect the regression. The gate fired a rollback alarm.
Root Cause: The new model version formatted order status differently — it used a narrative format instead of the structured bullet-point format. The quality evaluator's formatting sub-score caught this because the golden dataset expected bullet points.
Fix: Added format instructions to the order_tracking system prompt explicitly requesting bullet-point format. Re-deployed the canary with the updated prompt. Passed all stages.
Scenario D: Continuous Evaluation Catches Gradual Quality Erosion
Context: No deployments for 2 weeks. But continuous evaluation detected a slow quality decline.
What Happened: - Week 1: recommendation quality avg = 0.84 - Week 2: recommendation quality avg = 0.82 - Week 3: recommendation quality avg = 0.79 - The 3% weekly drop was below the hourly alert threshold but showed a clear trend
How Caught: The weekly quality summary report (generated every Monday) flagged the 3-week downward trend. The report compared current week averages against the 4-week rolling average.
Root Cause: New manga titles were being added to the catalog, but the recommendation engine's product embedding index was only refreshed monthly. As the catalog grew, the embeddings became less representative of available products. Recommendations were still technically relevant to the user's interests but referenced older titles instead of new releases.
Fix: Changed the product embedding refresh from monthly to weekly. Added a "catalog freshness" metric: the percentage of recommendations that include products added in the last 30 days. Set a minimum freshness threshold of 20%.
Intuition Gained
Quality Gates Must Match Change Scope
Running the full 500-case regression suite for every single-line prompt change is wasteful (45-minute CI run). Running only P1 for a model version change is dangerous (misses edge cases). The trigger-based test selection strategy — full suite for model changes, targeted suite for prompt/RAG changes — balances thoroughness with CI speed.
Continuous Evaluation Catches What Point-in-Time Tests Miss
Regression tests validate a snapshot. Production traffic changes continuously: new products, seasonal trends, new user behaviors. The MangaAssist catalog adds 200+ titles per week. A regression suite from 3 months ago does not test whether the model handles those new titles well. Continuous evaluation on live traffic is the only way to catch this class of drift.
Baselines Must Be Rolling, Not Fixed
Fixed baselines become stale as the system improves. If the production baseline is set at 0.80 and the team improves quality to 0.88, a regression to 0.83 still passes the gate. Rolling baselines (7-day average) ensure that quality gates adapt to the current performance level and catch any regression from the new normal.
References
- MangaAssist CI/CD Pipelines — Pipeline architecture
- Application Code Deployment — Deployment pipeline stages
- ML Model Deployment — Model deployment with canary
- FM Output Quality Assessment — Quality scoring used by gates
- Model Evaluation and Configuration — A/B and canary testing
- Monitoring and Observability — CloudWatch metrics pipeline