03: Prompt Engineering Troubleshooting

AIP-C01 Mapping

Task 5.2 → Skill 5.2.3: Systematically evaluate and fix prompt engineering problems using testing frameworks, version comparison, and iterative refinement workflows.

User Story

As a prompt engineer on the MangaAssist team, I want to systematically diagnose and fix prompt quality issues through automated testing, version comparison, and data-driven refinement, So that prompt changes are validated before production deployment, regressions are caught early, and prompt quality improves measurably over time.

Acceptance Criteria

Every prompt template has a golden test set of ≥ 20 input/expected-output pairs per intent
Prompt changes trigger an automated evaluation pipeline that scores against the golden set
Score regression > 5% on any metric blocks deployment with a detailed diff report
A/B testing framework compares two prompt versions with statistical significance (p < 0.05)
Prompt refinement follows a documented hypothesis → test → measure → iterate cycle
All prompt versions are stored with metadata (author, change reason, test scores) for audit
Mean time to diagnose prompt quality issues < 30 minutes using the diagnostic workflow

High-Level Design

Prompt Quality Failure Taxonomy

graph TD
    A[Prompt Quality<br>Failure] --> B[Hallucination]
    A --> C[Format Drift]
    A --> D[Instruction Following<br>Degradation]
    A --> E[Tone Inconsistency]
    A --> F[Context Sensitivity<br>Failures]

    B --> B1[Fabricated ASINs<br>or product details]
    B --> B2[Invented policy<br>rules or coupons]
    B --> B3[Confident wrong<br>answer to FAQ]

    C --> C1[JSON schema<br>changes unprompted]
    C --> C2[Markdown formatting<br>breaks UI parser]
    C --> C3[Extra fields appear<br>in structured output]

    D --> D1[Ignores system guardrails<br>e.g. competitor mentions]
    D --> D2[Stops using product<br>recommendation format]
    D --> D3[Fails to escalate<br>when instructed]

    E --> E1[Overly casual<br>for support context]
    E --> E2[Robotic tone<br>for conversational intent]
    E --> E3[Language mixing<br>JP/EN inconsistency]

    F --> F1[Works for short<br>context, fails for long]
    F --> F2[Sensitive to<br>conversation history order]
    F --> F3[Different quality<br>for different intents]

flowchart TD
    A[Prompt Change<br>Proposed] --> B[Unit Tests:<br>Golden Test Set]
    B -->|Pass| C[Comparative Eval:<br>New vs Current]
    B -->|Fail| D[Block + Report<br>regression details]

    C --> E{Score Δ?}
    E -->|Positive ≥ 5%| F[Auto-promote<br>to staging]
    E -->|Neutral ±5%| G[Manual review<br>+ edge case check]
    E -->|Negative > 5%| D

    F --> H[A/B Test in<br>Staging: 50/50]
    G --> H

    H --> I{Statistical<br>significance?}
    I -->|p < 0.05<br>new wins| J[Promote to<br>production]
    I -->|p < 0.05<br>current wins| K[Reject + analyze<br>failure cases]
    I -->|p ≥ 0.05| L[Extend test<br>or add samples]

    J --> M[Archive old version<br>+ update golden set]
    K --> N[Hypothesis refinement<br>cycle restart]

Version Comparison Workflow

sequenceDiagram
    participant Eng as Prompt Engineer
    participant Runner as TestRunner
    participant FM as Bedrock FM
    participant Store as PromptVersionStore
    participant Metrics as CloudWatch

    Eng->>Store: Submit new prompt version (v2)
    Store->>Store: Store with metadata (author, reason, timestamp)
    Eng->>Runner: Run comparison: v1 vs v2

    loop Each golden test case
        Runner->>FM: Invoke with v1 prompt + test input
        FM-->>Runner: v1 response
        Runner->>FM: Invoke with v2 prompt + test input
        FM-->>Runner: v2 response
        Runner->>Runner: Score both responses
    end

    Runner->>Store: Store evaluation results
    Runner->>Metrics: Emit comparison metrics
    Runner-->>Eng: Comparison report (per-case + aggregate)

Low-Level Design

1. Golden Test Case Management

Golden test sets are the foundation. Without them, every prompt change is a guess. Each test case pairs an input scenario with scoring criteria — not a single "correct answer" but a set of properties the response must satisfy.

import json
import hashlib
from dataclasses import dataclass, field, asdict
from typing import Optional
from enum import Enum
from datetime import datetime


class Intent(Enum):
    PRODUCT_RECOMMENDATION = "product_recommendation"
    ORDER_STATUS = "order_status"
    RETURN_REFUND = "return_refund"
    PRODUCT_FAQ = "product_faq"
    MANGA_SERIES_INFO = "manga_series_info"
    GENERAL_SUPPORT = "general_support"


class ScoreDimension(Enum):
    RELEVANCE = "relevance"         # Is the response about the right thing?
    ACCURACY = "accuracy"           # Are factual claims correct?
    FORMAT_COMPLIANCE = "format"    # Does output match expected schema?
    TONE = "tone"                   # Professional, friendly, on-brand?
    COMPLETENESS = "completeness"   # Does it address all parts of the query?
    SAFETY = "safety"               # No policy violations, no competitor mentions?


@dataclass
class GoldenTestCase:
    """A single test case in the golden evaluation set."""
    test_id: str
    intent: Intent
    description: str

    # Input
    user_message: str
    conversation_history: list = field(default_factory=list)
    context_documents: list = field(default_factory=list)  # RAG results to inject

    # Expected behavior (not exact match — criteria-based)
    must_contain: list = field(default_factory=list)       # Keywords/ASINs that must appear
    must_not_contain: list = field(default_factory=list)    # Forbidden content
    expected_format: Optional[str] = None                   # "json", "markdown", "plain"
    expected_json_schema: Optional[dict] = None             # JSON Schema for structured output
    min_score: dict = field(default_factory=dict)           # dimension -> minimum score

    # Metadata
    created_by: str = ""
    created_at: str = field(default_factory=lambda: datetime.utcnow().isoformat())
    tags: list = field(default_factory=list)
    difficulty: str = "medium"  # easy, medium, hard, adversarial

    def content_hash(self) -> str:
        """Deterministic hash for cache-busting when test case changes."""
        content = json.dumps({
            "user_message": self.user_message,
            "conversation_history": self.conversation_history,
            "context_documents": self.context_documents,
        }, sort_keys=True)
        return hashlib.sha256(content.encode()).hexdigest()[:12]


@dataclass
class GoldenTestSuite:
    """Complete test suite for one prompt template."""
    suite_id: str
    prompt_template_name: str
    test_cases: list = field(default_factory=list)
    minimum_pass_rate: float = 0.9  # 90% of test cases must pass

    def by_intent(self, intent: Intent) -> list:
        return [tc for tc in self.test_cases if tc.intent == intent]

    def by_difficulty(self, difficulty: str) -> list:
        return [tc for tc in self.test_cases if tc.difficulty == difficulty]

    def coverage_report(self) -> dict:
        """Check that every intent has sufficient test coverage."""
        coverage = {}
        for intent in Intent:
            cases = self.by_intent(intent)
            coverage[intent.value] = {
                "count": len(cases),
                "sufficient": len(cases) >= 3,
                "difficulties": list({tc.difficulty for tc in cases}),
            }
        return coverage

2. Prompt Scoring Engine

Scoring is multi-dimensional. A prompt can produce relevant content in the wrong format, or perfectly formatted hallucinations. Each dimension is scored independently to pinpoint exactly what went wrong.

import re
import json
import logging
from dataclasses import dataclass, field
from typing import Optional

logger = logging.getLogger("mangaassist.prompt_testing")


@dataclass
class DimensionScore:
    dimension: str
    score: float          # 0.0 to 1.0
    passed: bool
    details: str = ""


@dataclass
class TestCaseResult:
    test_id: str
    prompt_version: str
    response_text: str
    latency_ms: float
    dimension_scores: list = field(default_factory=list)
    overall_passed: bool = False
    overall_score: float = 0.0

    def compute_overall(self):
        if not self.dimension_scores:
            return
        self.overall_score = sum(d.score for d in self.dimension_scores) / len(self.dimension_scores)
        self.overall_passed = all(d.passed for d in self.dimension_scores)


class PromptScorer:
    """Scores FM responses against golden test case criteria.

    Scoring strategy:
    - Deterministic checks (must_contain, must_not_contain, format) are binary pass/fail
    - Quality checks (relevance, tone, completeness) use heuristic scoring
    - For production, quality checks can be upgraded to LLM-as-judge scoring
    """

    def score(self, response_text: str, test_case: GoldenTestCase) -> list:
        scores = []

        # 1. Must-contain check (accuracy/completeness)
        if test_case.must_contain:
            scores.append(self._score_must_contain(response_text, test_case))

        # 2. Must-not-contain check (safety)
        if test_case.must_not_contain:
            scores.append(self._score_must_not_contain(response_text, test_case))

        # 3. Format compliance
        if test_case.expected_format:
            scores.append(self._score_format(response_text, test_case))

        # 4. JSON schema validation
        if test_case.expected_json_schema:
            scores.append(self._score_json_schema(response_text, test_case))

        # 5. Response length reasonableness
        scores.append(self._score_length(response_text, test_case))

        # 6. Hallucination signals
        scores.append(self._score_hallucination_signals(response_text))

        return scores

    def _score_must_contain(self, response: str, test_case: GoldenTestCase) -> DimensionScore:
        response_lower = response.lower()
        found = [kw for kw in test_case.must_contain if kw.lower() in response_lower]
        missing = [kw for kw in test_case.must_contain if kw.lower() not in response_lower]
        ratio = len(found) / len(test_case.must_contain)
        min_score = test_case.min_score.get("accuracy", 0.8)

        return DimensionScore(
            dimension="accuracy",
            score=ratio,
            passed=ratio >= min_score,
            details=f"Found {len(found)}/{len(test_case.must_contain)}. Missing: {missing}" if missing else "All keywords found",
        )

    def _score_must_not_contain(self, response: str, test_case: GoldenTestCase) -> DimensionScore:
        response_lower = response.lower()
        violations = [kw for kw in test_case.must_not_contain if kw.lower() in response_lower]
        score = 1.0 if not violations else 0.0

        return DimensionScore(
            dimension="safety",
            score=score,
            passed=score >= 1.0,
            details=f"Violations found: {violations}" if violations else "No violations",
        )

    def _score_format(self, response: str, test_case: GoldenTestCase) -> DimensionScore:
        fmt = test_case.expected_format

        if fmt == "json":
            try:
                json.loads(response)
                return DimensionScore(dimension="format", score=1.0, passed=True, details="Valid JSON")
            except json.JSONDecodeError:
                # Check for JSON in code fences
                match = re.search(r'```(?:json)?\s*([\s\S]*?)\s*```', response)
                if match:
                    try:
                        json.loads(match.group(1))
                        return DimensionScore(
                            dimension="format", score=0.7, passed=True,
                            details="JSON found inside code fence (acceptable but not ideal)",
                        )
                    except json.JSONDecodeError:
                        pass
                return DimensionScore(dimension="format", score=0.0, passed=False, details="Expected JSON, got invalid format")

        elif fmt == "markdown":
            has_headers = bool(re.search(r'^#{1,3}\s', response, re.MULTILINE))
            has_lists = bool(re.search(r'^[\-\*]\s', response, re.MULTILINE))
            score = 0.5 * has_headers + 0.5 * has_lists
            return DimensionScore(
                dimension="format", score=score, passed=score >= 0.5,
                details=f"Headers: {has_headers}, Lists: {has_lists}",
            )

        return DimensionScore(dimension="format", score=1.0, passed=True, details="Plain text — no format check")

    def _score_json_schema(self, response: str, test_case: GoldenTestCase) -> DimensionScore:
        """Validate response against a JSON schema definition."""
        try:
            # Extract JSON
            parsed = None
            try:
                parsed = json.loads(response)
            except json.JSONDecodeError:
                match = re.search(r'```(?:json)?\s*([\s\S]*?)\s*```', response)
                if match:
                    parsed = json.loads(match.group(1))

            if parsed is None:
                return DimensionScore(dimension="schema", score=0.0, passed=False, details="No parseable JSON")

            schema = test_case.expected_json_schema
            required_keys = schema.get("required", [])

            missing_keys = [k for k in required_keys if k not in parsed]
            extra_keys = [k for k in parsed.keys() if k not in schema.get("properties", {}).keys()]

            key_score = 1.0 - (len(missing_keys) / max(len(required_keys), 1))
            extra_penalty = min(0.1 * len(extra_keys), 0.3)
            score = max(0.0, key_score - extra_penalty)

            details = []
            if missing_keys:
                details.append(f"Missing required: {missing_keys}")
            if extra_keys:
                details.append(f"Unexpected keys: {extra_keys}")

            return DimensionScore(
                dimension="schema",
                score=score,
                passed=score >= 0.8,
                details="; ".join(details) if details else "Schema compliant",
            )
        except (json.JSONDecodeError, TypeError) as e:
            return DimensionScore(dimension="schema", score=0.0, passed=False, details=f"Parse error: {e}")

    def _score_length(self, response: str, test_case: GoldenTestCase) -> DimensionScore:
        """Check response length is reasonable for the intent."""
        word_count = len(response.split())
        intent_ranges = {
            Intent.PRODUCT_RECOMMENDATION: (50, 500),
            Intent.ORDER_STATUS: (20, 200),
            Intent.RETURN_REFUND: (30, 300),
            Intent.PRODUCT_FAQ: (50, 500),
            Intent.MANGA_SERIES_INFO: (100, 800),
            Intent.GENERAL_SUPPORT: (20, 300),
        }
        min_words, max_words = intent_ranges.get(test_case.intent, (20, 500))

        if min_words <= word_count <= max_words:
            return DimensionScore(dimension="completeness", score=1.0, passed=True, details=f"{word_count} words (in range)")
        elif word_count < min_words:
            ratio = word_count / min_words
            return DimensionScore(
                dimension="completeness", score=ratio, passed=ratio >= 0.6,
                details=f"{word_count} words (below minimum {min_words})",
            )
        else:
            return DimensionScore(
                dimension="completeness", score=0.7, passed=True,
                details=f"{word_count} words (above maximum {max_words} — verbose but acceptable)",
            )

    def _score_hallucination_signals(self, response: str) -> DimensionScore:
        """Detect common hallucination patterns in MangaAssist context."""
        signals = []

        # Fabricated ASIN pattern (valid format but not in our catalog)
        asin_matches = re.findall(r'B0[A-Z0-9]{8}', response)
        if len(asin_matches) > 5:
            signals.append(f"Suspicious number of ASINs ({len(asin_matches)}) — possible fabrication")

        # Invented URLs
        url_matches = re.findall(r'https?://[^\s]+', response)
        for url in url_matches:
            if 'amazon.com' not in url and 'manga' in url.lower():
                signals.append(f"Suspicious URL: {url}")

        # Confidence language with specific numbers (often hallucinated)
        if re.search(r'(?:exactly|precisely)\s+\d+\.?\d*%', response):
            signals.append("Exact percentage claims — possible hallucination")

        score = max(0.0, 1.0 - 0.3 * len(signals))
        return DimensionScore(
            dimension="accuracy_signals",
            score=score,
            passed=len(signals) == 0,
            details="; ".join(signals) if signals else "No hallucination signals detected",
        )

3. Prompt Test Runner

The test runner orchestrates evaluation: it takes a prompt version, runs it against the golden set, scores every response, and produces a detailed report. This is what blocks bad prompts from reaching production.

import time
import json
import logging
from dataclasses import dataclass, field
from typing import Optional

logger = logging.getLogger("mangaassist.prompt_testing")


@dataclass
class PromptVersion:
    """A versioned prompt template with metadata."""
    version_id: str
    template_name: str
    system_prompt: str
    version_number: int
    author: str
    change_reason: str
    created_at: str = ""
    test_scores: Optional[dict] = None


@dataclass
class EvaluationReport:
    """Complete evaluation report for one prompt version against a test suite."""
    prompt_version: str
    suite_id: str
    total_cases: int = 0
    passed_cases: int = 0
    failed_cases: int = 0
    pass_rate: float = 0.0
    dimension_averages: dict = field(default_factory=dict)
    failed_case_details: list = field(default_factory=list)
    total_latency_ms: float = 0.0
    avg_latency_ms: float = 0.0

    def is_deployment_ready(self, min_pass_rate: float = 0.9) -> bool:
        return self.pass_rate >= min_pass_rate


class PromptTestRunner:
    """Runs golden test suites against prompt versions.

    Usage:
        runner = PromptTestRunner(bedrock_client)
        report = runner.run_suite(prompt_version, test_suite)
        if report.is_deployment_ready():
            deploy(prompt_version)
    """

    def __init__(self, bedrock_client, scorer: PromptScorer = None):
        self.bedrock_client = bedrock_client
        self.scorer = scorer or PromptScorer()

    def run_suite(
        self,
        prompt_version: PromptVersion,
        suite: GoldenTestSuite,
        max_concurrent: int = 5,
    ) -> EvaluationReport:
        """Run all test cases in the suite against the prompt version."""

        report = EvaluationReport(
            prompt_version=prompt_version.version_id,
            suite_id=suite.suite_id,
            total_cases=len(suite.test_cases),
        )

        results = []
        for test_case in suite.test_cases:
            result = self._run_single_test(prompt_version, test_case)
            results.append(result)

        # Aggregate results
        for result in results:
            report.total_latency_ms += result.latency_ms
            if result.overall_passed:
                report.passed_cases += 1
            else:
                report.failed_cases += 1
                report.failed_case_details.append({
                    "test_id": result.test_id,
                    "prompt_version": result.prompt_version,
                    "failed_dimensions": [
                        {"dimension": d.dimension, "score": d.score, "details": d.details}
                        for d in result.dimension_scores if not d.passed
                    ],
                    "response_preview": result.response_text[:300],
                })

        report.pass_rate = report.passed_cases / max(report.total_cases, 1)
        report.avg_latency_ms = report.total_latency_ms / max(report.total_cases, 1)

        # Compute dimension averages
        dimension_sums = {}
        dimension_counts = {}
        for result in results:
            for d in result.dimension_scores:
                dimension_sums[d.dimension] = dimension_sums.get(d.dimension, 0) + d.score
                dimension_counts[d.dimension] = dimension_counts.get(d.dimension, 0) + 1

        report.dimension_averages = {
            dim: round(dimension_sums[dim] / dimension_counts[dim], 3)
            for dim in dimension_sums
        }

        logger.info(
            "Evaluation complete: %s — pass_rate=%.1f%% (%d/%d)",
            prompt_version.version_id,
            report.pass_rate * 100,
            report.passed_cases,
            report.total_cases,
        )

        return report

    def _run_single_test(self, prompt_version: PromptVersion, test_case: GoldenTestCase) -> TestCaseResult:
        """Run a single test case and score the result."""

        # Build messages from test case
        messages = list(test_case.conversation_history)

        # Inject RAG context if present
        user_content = test_case.user_message
        if test_case.context_documents:
            context_block = "\n\n---\nRelevant documents:\n" + "\n---\n".join(test_case.context_documents)
            user_content = context_block + "\n\n" + user_content

        messages.append({"role": "user", "content": user_content})

        start = time.time()
        try:
            response = self.bedrock_client.invoke(
                messages=messages,
                system_prompt=prompt_version.system_prompt,
                session_id=f"test-{test_case.test_id}",
                intent=test_case.intent.value,
            )
            response_text = response.get("text", "")
            latency = (time.time() - start) * 1000
        except Exception as e:
            logger.error("Test case %s failed with error: %s", test_case.test_id, e)
            return TestCaseResult(
                test_id=test_case.test_id,
                prompt_version=prompt_version.version_id,
                response_text=f"ERROR: {e}",
                latency_ms=(time.time() - start) * 1000,
                dimension_scores=[DimensionScore(dimension="execution", score=0.0, passed=False, details=str(e))],
                overall_passed=False,
            )

        # Score the response
        scores = self.scorer.score(response_text, test_case)

        result = TestCaseResult(
            test_id=test_case.test_id,
            prompt_version=prompt_version.version_id,
            response_text=response_text,
            latency_ms=latency,
            dimension_scores=scores,
        )
        result.compute_overall()

        return result

    def compare_versions(
        self,
        version_a: PromptVersion,
        version_b: PromptVersion,
        suite: GoldenTestSuite,
    ) -> dict:
        """Run the same suite against two versions and produce a comparison report."""

        report_a = self.run_suite(version_a, suite)
        report_b = self.run_suite(version_b, suite)

        comparison = {
            "version_a": version_a.version_id,
            "version_b": version_b.version_id,
            "pass_rate_a": report_a.pass_rate,
            "pass_rate_b": report_b.pass_rate,
            "pass_rate_delta": report_b.pass_rate - report_a.pass_rate,
            "avg_latency_a": report_a.avg_latency_ms,
            "avg_latency_b": report_b.avg_latency_ms,
            "dimension_comparison": {},
            "recommendation": "",
        }

        # Per-dimension comparison
        all_dims = set(report_a.dimension_averages.keys()) | set(report_b.dimension_averages.keys())
        for dim in all_dims:
            score_a = report_a.dimension_averages.get(dim, 0)
            score_b = report_b.dimension_averages.get(dim, 0)
            comparison["dimension_comparison"][dim] = {
                "version_a": score_a,
                "version_b": score_b,
                "delta": round(score_b - score_a, 3),
            }

        # Recommendation logic
        delta = comparison["pass_rate_delta"]
        if delta > 0.05:
            comparison["recommendation"] = "PROMOTE_B"
        elif delta < -0.05:
            comparison["recommendation"] = "KEEP_A"
        else:
            comparison["recommendation"] = "MANUAL_REVIEW"
            # Check if any dimension improved significantly with none degrading
            improving_dims = [d for d, v in comparison["dimension_comparison"].items() if v["delta"] > 0.05]
            degrading_dims = [d for d, v in comparison["dimension_comparison"].items() if v["delta"] < -0.05]
            if improving_dims and not degrading_dims:
                comparison["recommendation"] = "PROMOTE_B_CAUTIOUSLY"

        return comparison

The refinement pipeline implements the hypothesis → test → measure → iterate cycle. Every prompt change starts with a written hypothesis about what will improve and why. This prevents aimless prompt tinkering.

import json
import logging
from dataclasses import dataclass, field
from typing import Optional
from datetime import datetime
from enum import Enum

logger = logging.getLogger("mangaassist.prompt_refinement")


class RefinementStatus(Enum):
    HYPOTHESIS = "hypothesis"   # Change proposed, not yet tested
    TESTING = "testing"         # Running against golden set
    EVALUATED = "evaluated"     # Scores available, pending decision
    PROMOTED = "promoted"       # Deployed to production
    REJECTED = "rejected"       # Did not meet quality bar
    ROLLED_BACK = "rolled_back" # Promoted but reverted


@dataclass
class RefinementHypothesis:
    """A documented hypothesis for why a prompt change will help."""
    hypothesis_id: str
    prompt_template: str

    # What changed
    change_description: str
    diff_summary: str  # e.g., "Added explicit instruction to include ASIN in recommendations"

    # Why
    root_cause: str            # What problem was observed
    expected_improvement: str  # What metric should improve and by how much
    target_dimensions: list = field(default_factory=list)  # Which score dimensions

    # Tracking
    status: RefinementStatus = RefinementStatus.HYPOTHESIS
    test_report: Optional[dict] = None
    decision: Optional[str] = None
    decision_reason: Optional[str] = None

    # Timestamps
    created_at: str = field(default_factory=lambda: datetime.utcnow().isoformat())
    tested_at: Optional[str] = None
    decided_at: Optional[str] = None


class PromptRefinementPipeline:
    """Manages the hypothesis → test → measure → iterate cycle.

    Workflow:
    1. Engineer observes a quality issue (via metrics, user feedback, or sampling)
    2. Writes a hypothesis: "Changing X will improve dimension Y by Z%"
    3. Creates a new prompt version implementing the hypothesis
    4. Pipeline runs evaluation, compares with current production
    5. Based on results: promote, reject, or refine further
    """

    def __init__(self, test_runner: PromptTestRunner, version_store: dict = None):
        self.test_runner = test_runner
        self.version_store = version_store or {}
        self.hypotheses: list = []

    def create_hypothesis(
        self,
        prompt_template: str,
        change_description: str,
        diff_summary: str,
        root_cause: str,
        expected_improvement: str,
        target_dimensions: list,
    ) -> RefinementHypothesis:
        """Step 1: Document the hypothesis before making any prompt changes."""
        h = RefinementHypothesis(
            hypothesis_id=f"H-{len(self.hypotheses)+1:04d}",
            prompt_template=prompt_template,
            change_description=change_description,
            diff_summary=diff_summary,
            root_cause=root_cause,
            expected_improvement=expected_improvement,
            target_dimensions=target_dimensions,
        )
        self.hypotheses.append(h)
        logger.info(
            "Hypothesis %s created: %s (expects improvement in %s)",
            h.hypothesis_id, h.change_description, h.target_dimensions,
        )
        return h

    def test_hypothesis(
        self,
        hypothesis: RefinementHypothesis,
        current_version: PromptVersion,
        new_version: PromptVersion,
        test_suite: GoldenTestSuite,
    ) -> dict:
        """Step 2-3: Test the hypothesis by running both versions."""
        hypothesis.status = RefinementStatus.TESTING
        hypothesis.tested_at = datetime.utcnow().isoformat()

        comparison = self.test_runner.compare_versions(
            current_version, new_version, test_suite,
        )

        hypothesis.test_report = comparison
        hypothesis.status = RefinementStatus.EVALUATED

        # Check if target dimensions improved
        target_results = {}
        for dim in hypothesis.target_dimensions:
            dim_data = comparison["dimension_comparison"].get(dim, {})
            target_results[dim] = {
                "delta": dim_data.get("delta", 0),
                "improved": dim_data.get("delta", 0) > 0.02,
            }

        comparison["target_dimension_results"] = target_results
        comparison["hypothesis_supported"] = all(
            r["improved"] for r in target_results.values()
        )

        logger.info(
            "Hypothesis %s evaluated: supported=%s, recommendation=%s",
            hypothesis.hypothesis_id,
            comparison["hypothesis_supported"],
            comparison["recommendation"],
        )

        return comparison

    def decide(self, hypothesis: RefinementHypothesis, decision: str, reason: str):
        """Step 4: Record the decision based on evaluation results."""
        valid_decisions = {"promote", "reject", "refine_further"}
        if decision not in valid_decisions:
            raise ValueError(f"Decision must be one of {valid_decisions}")

        hypothesis.decision = decision
        hypothesis.decision_reason = reason
        hypothesis.decided_at = datetime.utcnow().isoformat()

        if decision == "promote":
            hypothesis.status = RefinementStatus.PROMOTED
        elif decision == "reject":
            hypothesis.status = RefinementStatus.REJECTED
        else:
            hypothesis.status = RefinementStatus.HYPOTHESIS  # Back to start

        logger.info(
            "Hypothesis %s: decision=%s reason='%s'",
            hypothesis.hypothesis_id, decision, reason,
        )

    def get_refinement_history(self, prompt_template: str) -> list:
        """Get all hypotheses for a given prompt template, ordered by creation."""
        return [
            h for h in self.hypotheses
            if h.prompt_template == prompt_template
        ]

5. MangaAssist Scenarios

Scenario A: Recommendation Quality Drop After Prompt Update

Context: A prompt engineer updates the system prompt to add a new instruction: "Always include the customer's name in the greeting." After deployment, the product recommendation quality score drops from 0.85 to 0.72 in daily monitoring.

Root Cause: The new greeting instruction pushed the recommendation format instructions further from the end of the system prompt, where the model pays strongest attention. The recommendation structure (ASIN, title, price, reason) instructions are now diluted.

Diagnosis Workflow: 1. Run golden test suite against new version → format dimension scores drop by 15% 2. Compare failing test cases: responses are friendly but recommendations lack ASIN/price structure 3. Hypothesis: "System prompt instruction ordering matters — recommendation format should be near the end"

Fix: Restructure system prompt to keep format-critical instructions in the final section. Rerun golden suite — pass rate recovers to 0.87 (higher than before due to better organization).

Scenario B: JSON Schema Drift in Structured Outputs

Context: MangaAssist returns structured JSON for product recommendations that the frontend parses to render product cards. After a model version upgrade (Claude 3 Sonnet → Claude 3.5 Sonnet), the model starts adding an extra "confidence" field to the JSON output that the frontend doesn't expect.

Detection: _score_json_schema() catches the extra field in the nightly golden test run. Schema compliance score drops from 1.0 to 0.85.

Root Cause: The new model version is more "helpful" and infers that a confidence score would be useful. This is benign but breaks strict schema validation on the frontend.

Fix: Two changes: 1. Add explicit instruction: "Return ONLY the fields specified in the schema. Do not add extra fields." 2. Make the frontend schema parser lenient for unexpected fields (defense in depth)

Scenario C: Manga Terminology Confusion Across Languages

Context: Japanese manga titles and character names cause the model to occasionally mix Japanese and English mid-response, especially when the user's question references a Japanese title but the conversation is in English.

Detection: Golden test cases with Japanese titles show tone dimension scores of 0.6 (below the 0.8 threshold). Manual review reveals code-switching mid-sentence.

Hypothesis: "Adding an explicit language consistency instruction will reduce code-switching."

New instruction added: "Always respond in the same language as the user's most recent message. If referencing Japanese titles, keep them in their original form but provide English context in the surrounding text."

Result: tone score recovers to 0.88. Two edge cases still fail (titles with mixed JP/EN in the original) — flagged for the next refinement cycle.

6. CloudWatch Dashboard and Alerts

Metric	Source	Alarm Threshold	Action
`PromptTestPassRate`	Test pipeline	< 90% for new version	Block deployment
`PromptDimensionScore_{dim}`	Test pipeline	Drop > 10% vs baseline	Alert prompt engineer
`GoldenSetCoverage`	Test suite audit	< 3 cases per intent	Warn: need more tests
`PromptVersionAge`	Version store	> 30 days since eval	Schedule re-evaluation
`HallucinationSignalRate`	Production scoring	> 3% of responses	Page: investigate prompt
`FormatComplianceRate`	Production scoring	< 95%	Warn: schema drift likely

CloudWatch Logs Insights Queries

Prompt version score comparison over time:

fields @timestamp, prompt_version, pass_rate, dimension_averages
| filter log_type = "prompt_evaluation"
| sort @timestamp desc
| limit 20

Failed test cases by dimension:

fields @timestamp, test_id, prompt_version
| filter log_type = "test_case_result" and overall_passed = 0
| stats count(*) as failures by test_id
| sort failures desc
| limit 20

Intuition Gained

What Mental Model You Build

Prompt engineering troubleshooting develops a layered diagnostic sense:

1. Wording vs. Context vs. Model Limitation: When a prompt produces bad output, 70% of the time it is a wording issue (ambiguous instruction, wrong emphasis), 20% a context issue (missing information, too much noise), and 10% a model limitation (task too complex, language gap). You learn to test each layer independently: try rewording first (cheapest), then adjust context, and only blame the model after the other two are ruled out.

2. The Regression Instinct: You develop an allergic reaction to prompt changes without test results. Even a one-word change can drop accuracy by 15% because model sensitivity to instruction phrasing is non-linear. The golden test set becomes your immune system — any change that does not pass the suite does not ship.

3. The Instruction Attention Gradient: You learn that models pay more attention to the beginning and end of the system prompt, and less to the middle. Critical format instructions go at the end. Safety guardrails go at the beginning. Everything else fills the middle. This spatial intuition informs every prompt edit.

How This Intuition Guides Future Decisions

When onboarding a new model version: You run the golden suite first, read the comparison report, and look at per-dimension deltas before making any prompt adjustments. You never deploy a model upgrade without prompt regression testing.
When a stakeholder says "just add this one instruction to the prompt": You create a hypothesis, test it, and show the before/after scores. You know that each additional instruction slightly dilutes the rest, and you have the data to prove whether the tradeoff is worth it.
When building a new feature: You write the golden test cases before writing the prompt. This forces you to define "what does good look like?" before you start crafting the prompt — test-driven prompt development.
When debugging a reported quality issue: You isolate the dimension first. Is it accuracy (wrong facts), format (wrong structure), tone (wrong register), or safety (policy violation)? Each dimension has a different fix path, and you triage instantly.