LOCAL PREVIEW View on GitHub

04: Quality Assurance Processes

AIP-C01 Mapping

Content Domain 5: Testing, Validation, and Troubleshooting Task 5.1: Implement evaluation systems for GenAI Skill 5.1.4: Plan quality assurance processes (for example, continuous evaluation, regression testing, and quality gates for production deployments).


User Story

As a MangaAssist engineering lead, I want to enforce automated quality gates at every stage of the deployment pipeline — from code merge through canary release — with continuous evaluation in production and regression tests that catch quality degradation before customers see it, So that no model, prompt, or RAG content change reaches 100% traffic without passing objective quality thresholds.


Acceptance Criteria

  • Quality gate in CI/CD blocks deployment if evaluation score drops below threshold per intent
  • Regression test suite of 500+ golden test cases runs on every model/prompt change
  • Continuous evaluation samples 1% of production traffic every hour for quality monitoring
  • Quality trend dashboard shows rolling 24h quality scores per intent with alerting
  • Pre-production evaluation runs against staging environment before canary promotion
  • Prompt changes go through the same quality gate as model changes
  • RAG knowledge base updates trigger targeted regression tests for affected intents
  • Quality gate failures generate actionable reports with failing test cases and score deltas

Why Quality Assurance Is Different for GenAI

Traditional software QA: deterministic inputs produce deterministic outputs. If the test passes, the code is correct.

GenAI QA: the same input can produce different outputs depending on model version, temperature, prompt wording, RAG context freshness, and random seed. A "passing" evaluation today can "fail" tomorrow with no code change — because the underlying model was updated, or the knowledge base drifted, or the conversation context triggered a different generation path.

Quality Risk Traditional Software MangaAssist GenAI
Code change regression Unit tests catch it Model output is non-deterministic
Dependency update Version pinning Bedrock model versions change behavior
Data drift Schema validation RAG knowledge base content changes weekly
Configuration drift Config-as-code Prompt templates are code AND data
Environment parity Docker containers Model behavior differs at scale vs. single request

This means MangaAssist needs continuous, probabilistic quality assurance — not just point-in-time tests.


High-Level Design

Quality Gate Architecture

graph TD
    subgraph "Development"
        A[Code / Prompt / RAG Change] --> B[Pull Request]
        B --> C[CI Pipeline]
    end

    subgraph "CI Quality Gate"
        C --> D[Run Regression Suite<br>500+ golden test cases]
        D --> E[Score Against Thresholds]
        E --> F{All Intents Pass<br>Quality Gate?}
        F -->|Yes| G[Merge Allowed]
        F -->|No| H[Block Merge<br>Quality Report Generated]
    end

    subgraph "CD Quality Gate"
        G --> I[Deploy to Staging]
        I --> J[Full Evaluation Suite<br>Against Staging]
        J --> K{Staging Pass?}
        K -->|Yes| L[Canary Deploy 5%]
        K -->|No| M[Block Deploy<br>Staging Report]
    end

    subgraph "Production Quality Gate"
        L --> N[Canary Monitor<br>30 min window]
        N --> O{Canary Pass?}
        O -->|Yes| P[Progressive Rollout<br>25% → 50% → 100%]
        O -->|No| Q[Auto-Rollback]
    end

    subgraph "Continuous Evaluation"
        P --> R[1% Traffic Sampling<br>Every hour]
        R --> S[Quality Score Computation]
        S --> T{Score Drift<br>Detected?}
        T -->|Yes| U[Degradation Alert]
        T -->|No| V[Continue Monitoring]
    end

Regression Test Suite Structure

graph TD
    subgraph "Golden Dataset"
        A[Test Cases by Intent] --> A1[recommendation: 80 cases]
        A --> A2[product_question: 70 cases]
        A --> A3[faq: 60 cases]
        A --> A4[order_tracking: 50 cases]
        A --> A5[return_request: 40 cases]
        A --> A6[promotion: 40 cases]
        A --> A7[checkout_help: 35 cases]
        A --> A8[chitchat: 30 cases]
        A --> A9[escalation: 25 cases]
        A --> A10[edge_cases: 70 cases<br>Multi-turn, ambiguous, adversarial]
    end

    subgraph "Test Categories"
        B1[Core Regression<br>Must-pass cases<br>200 cases] --> C[Priority 1]
        B2[Extended Regression<br>Comprehensive coverage<br>300 cases] --> D[Priority 2]
        B3[Edge Case Suite<br>Known failure modes<br>70 cases] --> E[Priority 3]
    end

    subgraph "Triggers"
        F1[Model Version Change] --> G[Full Suite: P1 + P2 + P3]
        F2[Prompt Template Change] --> H[Affected Intent: P1 + P2]
        F3[RAG KB Update] --> I[Affected Content: P1 + P3]
        F4[Nightly Scheduled] --> J[Full Suite: P1 + P2 + P3]
    end

Continuous Evaluation Pipeline

sequenceDiagram
    participant Prod as Production Traffic
    participant Sampler as 1% Sampler
    participant Evaluator as Quality Evaluator
    participant Store as Redshift
    participant Monitor as CloudWatch
    participant Alert as Alert System

    loop Every hour
        Prod->>Sampler: Sample 1% of responses
        Sampler->>Evaluator: Batch evaluate samples
        Evaluator->>Evaluator: Score each sample<br>(relevance, accuracy, consistency, fluency)
        Evaluator->>Store: Store detailed scores
        Evaluator->>Monitor: Publish hourly aggregates

        Monitor->>Monitor: Compare to rolling baseline

        alt Score drop > 3% vs 7-day average
            Monitor->>Alert: Quality degradation alert
            Alert->>Alert: Page on-call engineer
        end

        alt Score drop > 5% for 3 consecutive hours
            Monitor->>Alert: Sustained degradation — auto-investigation
        end
    end

Low-Level Design

Quality Gate Configuration

from dataclasses import dataclass, field
from enum import Enum
from typing import Optional
import time


class GateStage(Enum):
    CI = "ci"                      # Pre-merge
    STAGING = "staging"            # Pre-canary
    CANARY = "canary"              # During canary rollout
    PRODUCTION = "production"      # Continuous monitoring


class GateAction(Enum):
    BLOCK = "block"                # Fail the pipeline
    WARN = "warn"                  # Allow but notify
    LOG = "log"                    # Record only


@dataclass
class QualityThreshold:
    """Quality threshold for a specific intent at a specific stage."""
    intent: str
    stage: GateStage
    min_composite_score: float
    min_pass_rate: float           # % of test cases that must pass
    max_regression_delta: float    # Maximum allowed drop from baseline
    action_on_fail: GateAction = GateAction.BLOCK
    min_test_cases: int = 20       # Minimum cases needed for valid evaluation


# MangaAssist quality thresholds — strict for production, relaxed for CI
QUALITY_THRESHOLDS: dict[str, dict[str, QualityThreshold]] = {
    "recommendation": {
        "ci": QualityThreshold("recommendation", GateStage.CI, 0.75, 0.80, 0.05),
        "staging": QualityThreshold("recommendation", GateStage.STAGING, 0.78, 0.85, 0.03),
        "canary": QualityThreshold("recommendation", GateStage.CANARY, 0.78, 0.85, 0.05),
        "production": QualityThreshold("recommendation", GateStage.PRODUCTION, 0.75, 0.80, 0.03),
    },
    "product_question": {
        "ci": QualityThreshold("product_question", GateStage.CI, 0.78, 0.85, 0.05),
        "staging": QualityThreshold("product_question", GateStage.STAGING, 0.80, 0.88, 0.03),
        "canary": QualityThreshold("product_question", GateStage.CANARY, 0.80, 0.88, 0.05),
        "production": QualityThreshold("product_question", GateStage.PRODUCTION, 0.78, 0.85, 0.03),
    },
    "faq": {
        "ci": QualityThreshold("faq", GateStage.CI, 0.80, 0.88, 0.03),
        "staging": QualityThreshold("faq", GateStage.STAGING, 0.82, 0.90, 0.02),
        "canary": QualityThreshold("faq", GateStage.CANARY, 0.82, 0.90, 0.03),
        "production": QualityThreshold("faq", GateStage.PRODUCTION, 0.80, 0.88, 0.02),
    },
    "order_tracking": {
        "ci": QualityThreshold("order_tracking", GateStage.CI, 0.85, 0.90, 0.03),
        "staging": QualityThreshold("order_tracking", GateStage.STAGING, 0.88, 0.92, 0.02),
        "canary": QualityThreshold("order_tracking", GateStage.CANARY, 0.88, 0.92, 0.03),
        "production": QualityThreshold("order_tracking", GateStage.PRODUCTION, 0.85, 0.90, 0.02),
    },
    "chitchat": {
        "ci": QualityThreshold("chitchat", GateStage.CI, 0.65, 0.75, 0.10),
        "staging": QualityThreshold("chitchat", GateStage.STAGING, 0.68, 0.78, 0.08),
        "canary": QualityThreshold("chitchat", GateStage.CANARY, 0.68, 0.78, 0.10),
        "production": QualityThreshold("chitchat", GateStage.PRODUCTION, 0.65, 0.75, 0.08),
    },
}


@dataclass
class QualityGateResult:
    """Result of running a quality gate evaluation."""
    gate_id: str
    stage: GateStage
    passed: bool
    overall_score: float
    overall_pass_rate: float
    intent_results: dict[str, dict]     # intent -> {score, pass_rate, passed, delta}
    failing_intents: list[str]
    failing_test_cases: list[dict]      # Detailed failures for debugging
    baseline_comparison: dict           # Comparison with previous baseline
    recommendation: str
    timestamp: float = field(default_factory=time.time)

Quality Gate Evaluator

import json
import logging
import time
import uuid
from typing import Optional

import boto3

logger = logging.getLogger(__name__)


class QualityGateEvaluator:
    """Evaluates model/prompt/RAG changes against quality thresholds.

    Used at every stage of the MangaAssist deployment pipeline:
    1. CI: blocks merge if quality drops below threshold
    2. Staging: blocks canary deployment if staging eval fails
    3. Canary: auto-rollback if canary quality degrades
    4. Production: continuous quality monitoring with alerting
    """

    def __init__(
        self,
        quality_evaluator=None,     # FMOutputQualityEvaluator from Skill 5.1.1
        baseline_store: str = "manga_quality_baselines",
    ):
        self.quality_evaluator = quality_evaluator
        self.dynamodb = boto3.resource("dynamodb", region_name="us-east-1")
        self.baseline_table = self.dynamodb.Table(baseline_store)

    def run_quality_gate(
        self,
        stage: GateStage,
        test_cases: list,
        change_description: str = "",
    ) -> QualityGateResult:
        """Run the quality gate for a specific deployment stage."""
        gate_id = str(uuid.uuid4())
        logger.info("Running quality gate %s at stage=%s (%d test cases)",
                     gate_id, stage.value, len(test_cases))

        # Group test cases by intent
        by_intent: dict[str, list] = {}
        for tc in test_cases:
            intent = tc.intent.value if hasattr(tc.intent, 'value') else tc.intent
            by_intent.setdefault(intent, []).append(tc)

        # Evaluate each intent against its threshold
        intent_results = {}
        failing_intents = []
        failing_test_cases = []
        all_scores = []

        for intent, cases in by_intent.items():
            threshold = self._get_threshold(intent, stage)
            if threshold is None:
                logger.warning("No threshold defined for intent=%s stage=%s, using defaults", intent, stage.value)
                threshold = QualityThreshold(intent, stage, 0.70, 0.75, 0.10, GateAction.WARN)

            if len(cases) < threshold.min_test_cases:
                logger.warning("Only %d test cases for %s (minimum: %d)",
                             len(cases), intent, threshold.min_test_cases)

            # Evaluate all cases for this intent
            intent_scores = []
            intent_passes = 0
            for tc in cases:
                if self.quality_evaluator:
                    result = self.quality_evaluator.evaluate(tc, tc.expected_response)
                    score = result.composite_score
                    passed = result.passed
                else:
                    score = 0.80  # Placeholder
                    passed = True

                intent_scores.append(score)
                all_scores.append(score)
                if passed:
                    intent_passes += 1
                else:
                    failing_test_cases.append({
                        "test_id": tc.test_id,
                        "intent": intent,
                        "query": tc.query,
                        "score": score,
                        "threshold": threshold.min_composite_score,
                    })

            avg_score = sum(intent_scores) / len(intent_scores) if intent_scores else 0
            pass_rate = intent_passes / len(cases) if cases else 0

            # Compare with baseline
            baseline = self._get_baseline(intent)
            delta = avg_score - baseline if baseline is not None else 0

            intent_passed = (
                avg_score >= threshold.min_composite_score
                and pass_rate >= threshold.min_pass_rate
                and (baseline is None or delta >= -threshold.max_regression_delta)
            )

            intent_results[intent] = {
                "avg_score": round(avg_score, 4),
                "pass_rate": round(pass_rate, 4),
                "total_cases": len(cases),
                "passed_cases": intent_passes,
                "baseline": baseline,
                "delta": round(delta, 4),
                "threshold_score": threshold.min_composite_score,
                "threshold_pass_rate": threshold.min_pass_rate,
                "max_regression": threshold.max_regression_delta,
                "passed": intent_passed,
                "action": threshold.action_on_fail.value,
            }

            if not intent_passed:
                failing_intents.append(intent)

        # Overall gate decision
        overall_score = sum(all_scores) / len(all_scores) if all_scores else 0
        passed_count = sum(1 for s in all_scores if s >= 0.75)
        overall_pass_rate = passed_count / len(all_scores) if all_scores else 0

        # Gate passes only if all BLOCK-action intents pass
        blocking_failures = [
            i for i in failing_intents
            if intent_results[i]["action"] == "block"
        ]
        gate_passed = len(blocking_failures) == 0

        recommendation = self._generate_recommendation(
            gate_passed, failing_intents, intent_results, change_description
        )

        result = QualityGateResult(
            gate_id=gate_id,
            stage=stage,
            passed=gate_passed,
            overall_score=round(overall_score, 4),
            overall_pass_rate=round(overall_pass_rate, 4),
            intent_results=intent_results,
            failing_intents=failing_intents,
            failing_test_cases=failing_test_cases[:20],  # Top 20 failures
            baseline_comparison={
                intent: {"baseline": r.get("baseline"), "delta": r.get("delta")}
                for intent, r in intent_results.items()
            },
            recommendation=recommendation,
        )

        logger.info("Quality gate %s: %s (score=%.4f, pass_rate=%.2f%%)",
                     gate_id, "PASSED" if gate_passed else "FAILED",
                     overall_score, overall_pass_rate * 100)

        return result

    def _get_threshold(self, intent: str, stage: GateStage) -> Optional[QualityThreshold]:
        """Retrieve the quality threshold for an intent and stage."""
        intent_thresholds = QUALITY_THRESHOLDS.get(intent, {})
        return intent_thresholds.get(stage.value)

    def _get_baseline(self, intent: str) -> Optional[float]:
        """Get the current production baseline score for an intent."""
        try:
            response = self.baseline_table.get_item(
                Key={"pk": f"BASELINE#{intent}", "sk": "CURRENT"}
            )
            item = response.get("Item")
            return float(item["score"]) if item else None
        except Exception:
            return None

    def update_baseline(self, intent: str, score: float) -> None:
        """Update the baseline score after a successful production deployment."""
        self.baseline_table.put_item(Item={
            "pk": f"BASELINE#{intent}",
            "sk": "CURRENT",
            "score": str(score),
            "updated_at": int(time.time()),
        })
        # Also store historical baseline
        self.baseline_table.put_item(Item={
            "pk": f"BASELINE#{intent}",
            "sk": f"HISTORY#{int(time.time())}",
            "score": str(score),
        })

    def _generate_recommendation(
        self,
        passed: bool,
        failing_intents: list[str],
        intent_results: dict,
        change_description: str,
    ) -> str:
        """Generate actionable recommendation based on gate results."""
        if passed:
            return f"Quality gate PASSED. All intents within thresholds. Safe to proceed with: {change_description}"

        lines = [f"Quality gate FAILED for change: {change_description}\n"]
        lines.append("Failing intents:\n")
        for intent in failing_intents:
            r = intent_results[intent]
            lines.append(
                f"  - {intent}: score={r['avg_score']:.3f} (need {r['threshold_score']}), "
                f"pass_rate={r['pass_rate']:.1%} (need {r['threshold_pass_rate']:.1%}), "
                f"delta={r['delta']:+.3f} (max regression: {r['max_regression']})"
            )
        lines.append("\nRecommended actions:")
        lines.append("  1. Review the failing test cases in the quality report")
        lines.append("  2. Check if the change affected the specific intents listed above")
        lines.append("  3. If the change is correct, update the golden dataset and re-run")
        return "\n".join(lines)

Regression Test Runner

class RegressionTestRunner:
    """Runs regression tests against golden dataset and generates quality reports.

    Test suite is organized by priority:
    - P1 (Core): 200 must-pass cases covering common user journeys
    - P2 (Extended): 300 additional cases for comprehensive coverage
    - P3 (Edge): 70 edge cases from known failure modes and bug reports

    Triggers:
    - Model version change: P1 + P2 + P3 (full suite)
    - Prompt change: P1 + P2 for affected intents
    - RAG KB update: P1 + P3 for affected content areas
    - Nightly scheduled: P1 + P2 + P3 (full suite)
    """

    def __init__(self, test_data_bucket: str = "manga-evaluation-data"):
        self.s3 = boto3.client("s3", region_name="us-east-1")
        self.bucket = test_data_bucket

    def load_test_suite(
        self,
        priorities: list[str] = None,
        intents: list[str] = None,
    ) -> list:
        """Load golden test cases from S3, optionally filtered by priority and intent."""
        if priorities is None:
            priorities = ["P1", "P2", "P3"]

        all_cases = []
        for priority in priorities:
            key = f"golden-dataset/{priority}/test_cases.jsonl"
            try:
                response = self.s3.get_object(Bucket=self.bucket, Key=key)
                for line in response["Body"].read().decode("utf-8").strip().split("\n"):
                    case = json.loads(line)
                    if intents is None or case.get("intent") in intents:
                        all_cases.append(case)
            except Exception as e:
                logger.error("Failed to load test suite %s: %s", key, e)

        logger.info("Loaded %d test cases (priorities=%s, intents=%s)",
                     len(all_cases), priorities, intents)
        return all_cases

    def run_regression(
        self,
        change_type: str,
        change_description: str,
        affected_intents: list[str] = None,
        quality_gate_evaluator: QualityGateEvaluator = None,
        stage: GateStage = GateStage.CI,
    ) -> QualityGateResult:
        """Run regression tests appropriate for the change type."""
        # Determine which test priorities to run
        if change_type == "model_version":
            priorities = ["P1", "P2", "P3"]
            intents = None  # All intents
        elif change_type == "prompt_change":
            priorities = ["P1", "P2"]
            intents = affected_intents
        elif change_type == "rag_update":
            priorities = ["P1", "P3"]
            intents = affected_intents
        elif change_type == "nightly":
            priorities = ["P1", "P2", "P3"]
            intents = None
        else:
            priorities = ["P1"]
            intents = affected_intents

        test_cases = self.load_test_suite(priorities=priorities, intents=intents)

        if not test_cases:
            logger.warning("No test cases loaded for change_type=%s", change_type)
            return QualityGateResult(
                gate_id=str(uuid.uuid4()), stage=stage, passed=True,
                overall_score=0.0, overall_pass_rate=0.0,
                intent_results={}, failing_intents=[], failing_test_cases=[],
                baseline_comparison={}, recommendation="No test cases available",
            )

        return quality_gate_evaluator.run_quality_gate(
            stage=stage, test_cases=test_cases, change_description=change_description
        )

Continuous Production Evaluator

class ContinuousProductionEvaluator:
    """Samples production traffic and runs quality evaluation continuously.

    Every hour:
    1. Sample 1% of responses from the last hour (via Kinesis → S3)
    2. Run quality evaluation on the sample
    3. Publish metrics to CloudWatch
    4. Alert if quality drops below rolling 7-day baseline

    This catches:
    - Model behavior drift (Bedrock updates the underlying model)
    - RAG content degradation (stale or incorrect knowledge base entries)
    - Traffic pattern shifts (new types of queries the model handles poorly)
    """

    def __init__(
        self,
        sample_bucket: str = "manga-production-samples",
        cloudwatch_namespace: str = "MangaAssist/ContinuousEval",
    ):
        self.s3 = boto3.client("s3", region_name="us-east-1")
        self.cloudwatch = boto3.client("cloudwatch", region_name="us-east-1")
        self.bucket = sample_bucket
        self.namespace = cloudwatch_namespace

    def run_hourly_evaluation(self, quality_evaluator=None) -> dict:
        """Run quality evaluation on the last hour's production samples."""
        samples = self._load_recent_samples(hours_back=1)
        if not samples:
            logger.info("No production samples available for evaluation")
            return {"status": "no_samples"}

        # Evaluate each sample
        results_by_intent: dict[str, list[float]] = {}
        for sample in samples:
            intent = sample.get("intent", "unknown")
            if quality_evaluator:
                result = quality_evaluator.evaluate_from_dict(sample)
                score = result.composite_score
            else:
                score = sample.get("pre_computed_score", 0.80)

            results_by_intent.setdefault(intent, []).append(score)

        # Publish per-intent metrics
        metric_data = []
        for intent, scores in results_by_intent.items():
            avg_score = sum(scores) / len(scores)
            metric_data.append({
                "MetricName": "HourlyQualityScore",
                "Dimensions": [{"Name": "Intent", "Value": intent}],
                "Value": avg_score,
                "Unit": "None",
            })
            metric_data.append({
                "MetricName": "HourlySampleCount",
                "Dimensions": [{"Name": "Intent", "Value": intent}],
                "Value": len(scores),
                "Unit": "Count",
            })

        self.cloudwatch.put_metric_data(
            Namespace=self.namespace, MetricData=metric_data
        )

        # Check for degradation vs. rolling baseline
        alerts = self._check_degradation(results_by_intent)

        return {
            "status": "evaluated",
            "total_samples": len(samples),
            "intents_evaluated": len(results_by_intent),
            "alerts": alerts,
        }

    def _load_recent_samples(self, hours_back: int = 1) -> list[dict]:
        """Load production response samples from S3 (written by Kinesis Firehose)."""
        now = time.time()
        cutoff = now - (hours_back * 3600)

        # S3 prefix follows Kinesis Firehose date partitioning
        from datetime import datetime, timezone
        ts = datetime.fromtimestamp(cutoff, tz=timezone.utc)
        prefix = f"production-samples/{ts.strftime('%Y/%m/%d/%H')}/"

        samples = []
        try:
            response = self.s3.list_objects_v2(Bucket=self.bucket, Prefix=prefix, MaxKeys=1000)
            for obj in response.get("Contents", []):
                data = self.s3.get_object(Bucket=self.bucket, Key=obj["Key"])
                for line in data["Body"].read().decode("utf-8").strip().split("\n"):
                    if line:
                        samples.append(json.loads(line))
        except Exception as e:
            logger.error("Failed to load production samples: %s", e)

        # Sample 1% if volume is very high
        import random
        if len(samples) > 5000:
            samples = random.sample(samples, 5000)

        return samples

    def _check_degradation(self, results_by_intent: dict[str, list[float]]) -> list[str]:
        """Check if current scores are significantly below rolling 7-day baseline."""
        alerts = []

        for intent, scores in results_by_intent.items():
            current_avg = sum(scores) / len(scores)

            # Get 7-day rolling average from CloudWatch
            try:
                response = self.cloudwatch.get_metric_statistics(
                    Namespace=self.namespace,
                    MetricName="HourlyQualityScore",
                    Dimensions=[{"Name": "Intent", "Value": intent}],
                    StartTime=time.time() - (7 * 86400),
                    EndTime=time.time() - 3600,  # Exclude current hour
                    Period=86400,
                    Statistics=["Average"],
                )
                datapoints = response.get("Datapoints", [])
                if datapoints:
                    baseline_avg = sum(d["Average"] for d in datapoints) / len(datapoints)
                    delta = current_avg - baseline_avg
                    if delta < -0.03:  # More than 3% drop
                        alerts.append(
                            f"DEGRADATION: {intent} current={current_avg:.3f} "
                            f"baseline={baseline_avg:.3f} delta={delta:+.3f}"
                        )
            except Exception as e:
                logger.error("Failed to retrieve baseline for %s: %s", intent, e)

        return alerts

MangaAssist Scenarios

Scenario A: CI Quality Gate Blocks a Prompt Regression

Context: A developer modified the recommendation system prompt to be more concise, reducing token usage by 15%. The CI pipeline ran the regression suite before merge.

What Happened: - CI quality gate evaluation results: - recommendation: score=0.71 (threshold: 0.75) — FAILED - product_question: score=0.83 (threshold: 0.78) — PASSED - All other intents: PASSED (unaffected by prompt change) - The concise prompt dropped the relevance sub-score from 0.86 to 0.68 — the model stopped including specific product attributes (author, volume count, genre tags) in recommendations

How Caught: The CI quality gate blocked the merge. The quality report showed 14 failing test cases, all in the recommendation intent, all with low relevance scores.

Fix: The developer kept the conciseness optimization but added explicit instructions to include product attributes. Second CI run: recommendation score=0.78 (passed). Token usage still reduced by 9%.

Metric Signal: MangaAssist/QualityGate.Result with dimension Stage=ci, Intent=recommendation, Passed=false

Scenario B: Nightly Regression Detects RAG Knowledge Base Drift

Context: The nightly full regression suite (P1 + P2 + P3) ran at 2 AM. No code or prompt changes had been made in 3 days.

What Happened: - faq intent: score dropped from 0.88 (baseline) to 0.81 (current) — delta = -0.07 - The regression was in 8 test cases related to return policy - Root cause: the product catalog team updated the return policy page, and the RAG pipeline ingested the new content. But the new page had a formatting change that caused the chunking algorithm to split a key sentence across two chunks, making it unretrievable as a single fact

How Caught: The nightly regression checked against stored baselines. The -0.07 delta exceeded the max_regression_delta of 0.02 for the faq intent at production stage.

Fix: The RAG pipeline's chunking strategy was updated to preserve paragraph boundaries. The return policy content was re-indexed. Re-evaluation: faq score returned to 0.87.

Metric Signal: MangaAssist/QualityGate.IntentDelta with Intent=faq, value=-0.07; MangaAssist/QualityGate.Result with Passed=false

Scenario C: Canary Quality Gate Catches Model Behavior Shift

Context: Amazon Bedrock updated the underlying Claude 3.5 Sonnet model (minor version). MangaAssist pins model versions, but the team wanted to test the new version via canary.

What Happened: - Canary at 5% traffic for 30 minutes: - Quality score: 0.86 (baseline: 0.84) — improved - Pass rate: 89% (baseline: 87%) — improved - Latency P95: 2,050ms (baseline: 2,100ms) — improved - Canary promoted to 25%: - Quality score: 0.85 — stable - But: order_tracking intent pass rate dropped to 78% (baseline: 92%) - At 5% traffic, there were only 12 order_tracking requests — too few to detect the problem

How Caught: The canary quality gate at 25% had enough samples (150 order_tracking requests) to detect the regression. The gate fired a rollback alarm.

Root Cause: The new model version formatted order status differently — it used a narrative format instead of the structured bullet-point format. The quality evaluator's formatting sub-score caught this because the golden dataset expected bullet points.

Fix: Added format instructions to the order_tracking system prompt explicitly requesting bullet-point format. Re-deployed the canary with the updated prompt. Passed all stages.

Scenario D: Continuous Evaluation Catches Gradual Quality Erosion

Context: No deployments for 2 weeks. But continuous evaluation detected a slow quality decline.

What Happened: - Week 1: recommendation quality avg = 0.84 - Week 2: recommendation quality avg = 0.82 - Week 3: recommendation quality avg = 0.79 - The 3% weekly drop was below the hourly alert threshold but showed a clear trend

How Caught: The weekly quality summary report (generated every Monday) flagged the 3-week downward trend. The report compared current week averages against the 4-week rolling average.

Root Cause: New manga titles were being added to the catalog, but the recommendation engine's product embedding index was only refreshed monthly. As the catalog grew, the embeddings became less representative of available products. Recommendations were still technically relevant to the user's interests but referenced older titles instead of new releases.

Fix: Changed the product embedding refresh from monthly to weekly. Added a "catalog freshness" metric: the percentage of recommendations that include products added in the last 30 days. Set a minimum freshness threshold of 20%.


Intuition Gained

Quality Gates Must Match Change Scope

Running the full 500-case regression suite for every single-line prompt change is wasteful (45-minute CI run). Running only P1 for a model version change is dangerous (misses edge cases). The trigger-based test selection strategy — full suite for model changes, targeted suite for prompt/RAG changes — balances thoroughness with CI speed.

Continuous Evaluation Catches What Point-in-Time Tests Miss

Regression tests validate a snapshot. Production traffic changes continuously: new products, seasonal trends, new user behaviors. The MangaAssist catalog adds 200+ titles per week. A regression suite from 3 months ago does not test whether the model handles those new titles well. Continuous evaluation on live traffic is the only way to catch this class of drift.

Baselines Must Be Rolling, Not Fixed

Fixed baselines become stale as the system improves. If the production baseline is set at 0.80 and the team improves quality to 0.88, a regression to 0.83 still passes the gate. Rolling baselines (7-day average) ensure that quality gates adapt to the current performance level and catch any regression from the new normal.


References