LOCAL PREVIEW View on GitHub

05: Prompt Maintenance Troubleshooting

AIP-C01 Mapping

Task 5.2 → Skill 5.2.5: Maintain and improve prompt performance over time using template testing, X-Ray observability, schema validation, systematic refinement, and prompt confusion diagnosis.


User Story

As an SRE on the MangaAssist team, I want to continuously monitor, validate, and maintain prompt templates in production so that degradation is detected early, schema drift is caught automatically, and prompt refinement follows a measured, repeatable process, So that prompt quality stays stable over time even as the FM, data sources, and user behavior evolve.


Acceptance Criteria

  • Every prompt template execution is traced end-to-end with X-Ray, including prompt assembly, FM invocation, and response parsing spans
  • Schema validation runs on every FM response in production, catching format drift before it reaches the user
  • Prompt health metrics (format compliance, hallucination rate, latency, error rate) are dashboarded per template per day
  • Template regression is detected within 4 hours using statistical process control on response quality metrics
  • Prompt confusion (ambiguous classification between intents) is tracked and resolved through targeted prompt refinement
  • Seasonal prompt maintenance (holidays, sales events, new product launches) follows a documented pre-launch checklist

High-Level Design

Prompt Maintenance Lifecycle

graph LR
    A[Deploy Prompt<br>v1.0] --> B[Monitor<br>Production Metrics]
    B --> C{Drift<br>Detected?}
    C -->|No| B
    C -->|Yes| D[Diagnose:<br>What changed?]

    D --> E{Root Cause?}
    E -->|FM behavior shift| F[Prompt refinement<br>cycle]
    E -->|Data distribution shift| G[Context/RAG<br>adjustment]
    E -->|Schema drift| H[Output format<br>fix + validation]
    E -->|Seasonal pattern| I[Scheduled prompt<br>variant activation]

    F --> J[Test + Compare]
    G --> J
    H --> J
    I --> J

    J --> K{Pass?}
    K -->|Yes| L[Deploy prompt<br>v1.1]
    K -->|No| F

    L --> B

Prompt Observability Architecture

flowchart TD
    A[User Request] --> B[Orchestrator]

    subgraph XRay["X-Ray Trace (single request)"]
        B --> C[Span: Intent<br>Classification]
        C --> D[Span: Prompt<br>Assembly]
        D --> E[Span: Context<br>Retrieval]
        E --> F[Span: FM<br>Invocation]
        F --> G[Span: Response<br>Validation]
        G --> H[Span: Schema<br>Check]
    end

    H --> I[Response]

    subgraph Metrics["CloudWatch Metrics"]
        J[PromptAssemblyLatency]
        K[FMInvocationLatency]
        L[SchemaValidationPass/Fail]
        M[PromptHealthScore]
    end

    D -.-> J
    F -.-> K
    H -.-> L
    G -.-> M

Prompt Confusion Detection Flow

stateDiagram-v2
    [*] --> IntentClassification
    IntentClassification --> HighConfidence: confidence ≥ 0.85
    IntentClassification --> LowConfidence: 0.5 ≤ confidence < 0.85
    IntentClassification --> Ambiguous: confidence < 0.5

    HighConfidence --> CorrectPrompt: Use intent-specific template
    LowConfidence --> ConfusionLog: Log for analysis
    LowConfidence --> FallbackPrompt: Use general template
    Ambiguous --> ConfusionLog
    Ambiguous --> FallbackPrompt

    ConfusionLog --> WeeklyAnalysis: Aggregate confusion patterns
    WeeklyAnalysis --> PromptRefinement: Fix top-3 confused pairs
    PromptRefinement --> IntentClassification: Retrain / update

Low-Level Design

1. X-Ray Prompt Observability Pipeline

Every prompt execution is a distributed trace. X-Ray segments give timing breakdowns, and annotations make traces searchable by intent, template version, and quality signals.

import time
import json
import logging
from dataclasses import dataclass, field
from typing import Optional
from functools import wraps

logger = logging.getLogger("mangaassist.prompt_maintenance")


# Lightweight X-Ray integration — wraps boto3 X-Ray SDK patterns
# without requiring the full SDK import for this reference code

@dataclass
class SpanRecord:
    """Record of a single span within a prompt execution trace."""
    name: str
    start_time: float
    end_time: Optional[float] = None
    duration_ms: Optional[float] = None
    annotations: dict = field(default_factory=dict)
    metadata: dict = field(default_factory=dict)
    error: Optional[str] = None

    def close(self, error: str = None):
        self.end_time = time.time()
        self.duration_ms = (self.end_time - self.start_time) * 1000
        self.error = error


@dataclass
class PromptExecutionTrace:
    """Complete trace of a single prompt execution, from assembly to response validation."""
    trace_id: str
    session_id: str
    intent: str
    template_name: str
    template_version: str

    spans: list = field(default_factory=list)
    total_duration_ms: float = 0.0

    # Quality annotations (searchable in X-Ray)
    schema_valid: Optional[bool] = None
    hallucination_detected: bool = False
    prompt_confusion: bool = False
    response_truncated: bool = False

    def add_span(self, name: str) -> SpanRecord:
        span = SpanRecord(name=name, start_time=time.time())
        self.spans.append(span)
        return span

    def close(self):
        if self.spans:
            self.total_duration_ms = sum(
                s.duration_ms for s in self.spans if s.duration_ms is not None
            )

    def to_xray_annotations(self) -> dict:
        """Annotations that make this trace searchable in X-Ray console."""
        return {
            "intent": self.intent,
            "template_name": self.template_name,
            "template_version": self.template_version,
            "schema_valid": self.schema_valid,
            "hallucination_detected": self.hallucination_detected,
            "prompt_confusion": self.prompt_confusion,
        }

    def to_xray_metadata(self) -> dict:
        """Metadata for detailed investigation (not searchable but visible)."""
        return {
            "span_timings": {
                s.name: {"duration_ms": s.duration_ms, "error": s.error}
                for s in self.spans
            },
            "total_duration_ms": self.total_duration_ms,
        }


class PromptObservabilityPipeline:
    """Instruments prompt execution for X-Ray tracing.

    Usage:
        pipeline = PromptObservabilityPipeline()
        trace = pipeline.start_trace(session_id, intent, template_name, template_version)

        with pipeline.span(trace, "prompt_assembly"):
            prompt = assemble_prompt(...)

        with pipeline.span(trace, "fm_invocation"):
            response = invoke_fm(prompt)

        with pipeline.span(trace, "schema_validation"):
            valid = validate_schema(response)
            trace.schema_valid = valid

        pipeline.finish_trace(trace)
    """

    def __init__(self, cloudwatch_client=None):
        self.cloudwatch_client = cloudwatch_client
        self._active_traces: dict = {}

    def start_trace(
        self,
        trace_id: str,
        session_id: str,
        intent: str,
        template_name: str,
        template_version: str,
    ) -> PromptExecutionTrace:
        trace = PromptExecutionTrace(
            trace_id=trace_id,
            session_id=session_id,
            intent=intent,
            template_name=template_name,
            template_version=template_version,
        )
        self._active_traces[trace_id] = trace
        return trace

    class _SpanContext:
        """Context manager for tracing a span."""
        def __init__(self, trace: PromptExecutionTrace, span_name: str):
            self.trace = trace
            self.span_name = span_name
            self.span: Optional[SpanRecord] = None

        def __enter__(self):
            self.span = self.trace.add_span(self.span_name)
            return self.span

        def __exit__(self, exc_type, exc_val, exc_tb):
            error = str(exc_val) if exc_val else None
            self.span.close(error=error)
            return False  # Don't suppress exceptions

    def span(self, trace: PromptExecutionTrace, span_name: str):
        return self._SpanContext(trace, span_name)

    def finish_trace(self, trace: PromptExecutionTrace):
        trace.close()
        self._active_traces.pop(trace.trace_id, None)

        # Emit CloudWatch metrics
        self._emit_metrics(trace)

        # Log structured trace data
        logger.info(json.dumps({
            "log_type": "prompt_trace",
            "trace_id": trace.trace_id,
            "intent": trace.intent,
            "template_name": trace.template_name,
            "template_version": trace.template_version,
            "total_duration_ms": round(trace.total_duration_ms, 2),
            "schema_valid": trace.schema_valid,
            "hallucination_detected": trace.hallucination_detected,
            "prompt_confusion": trace.prompt_confusion,
            "annotations": trace.to_xray_annotations(),
        }))

    def _emit_metrics(self, trace: PromptExecutionTrace):
        if not self.cloudwatch_client:
            return

        dimensions = [
            {"Name": "Intent", "Value": trace.intent},
            {"Name": "TemplateName", "Value": trace.template_name},
        ]

        metric_data = [
            {
                "MetricName": "PromptExecutionDuration",
                "Value": trace.total_duration_ms,
                "Unit": "Milliseconds",
                "Dimensions": dimensions,
            },
            {
                "MetricName": "SchemaValidationResult",
                "Value": 1.0 if trace.schema_valid else 0.0,
                "Unit": "None",
                "Dimensions": dimensions,
            },
        ]

        if trace.hallucination_detected:
            metric_data.append({
                "MetricName": "HallucinationDetected",
                "Value": 1.0,
                "Unit": "Count",
                "Dimensions": dimensions,
            })

        if trace.prompt_confusion:
            metric_data.append({
                "MetricName": "PromptConfusion",
                "Value": 1.0,
                "Unit": "Count",
                "Dimensions": dimensions,
            })

        # Span-level latency
        for s in trace.spans:
            if s.duration_ms is not None:
                metric_data.append({
                    "MetricName": f"Span_{s.name}_Duration",
                    "Value": s.duration_ms,
                    "Unit": "Milliseconds",
                    "Dimensions": dimensions,
                })

        self.cloudwatch_client.put_metric_data(
            Namespace="MangaAssist/PromptMaintenance",
            MetricData=metric_data,
        )

2. Production Schema Validator

Validates every FM response against the expected output schema in production. This is not the test-time validator from file 03 — this runs on every live request and must be fast and resilient.

import json
import re
import time
import logging
from dataclasses import dataclass, field
from typing import Optional

logger = logging.getLogger("mangaassist.prompt_maintenance")


@dataclass
class SchemaValidationResult:
    valid: bool
    template_name: str
    errors: list = field(default_factory=list)
    warnings: list = field(default_factory=list)
    repair_applied: bool = False
    validation_time_ms: float = 0.0


class ProductionSchemaValidator:
    """Fast, resilient schema validation for production FM responses.

    Design decisions:
    - Validates structure, not content (content quality is a separate concern)
    - Repairs common issues (trailing commas, code fences) rather than rejecting
    - Logs violations but does not block user-visible response (defense in depth)
    - Validation must complete in < 5ms to not impact latency
    """

    # Schema definitions per template
    TEMPLATE_SCHEMAS = {
        "product_recommendation": {
            "required_fields": ["response_text"],
            "optional_fields": ["products", "follow_up_suggestions"],
            "product_schema": {
                "required": ["asin", "title"],
                "optional": ["price", "rating", "reason"],
            },
        },
        "order_status": {
            "required_fields": ["response_text"],
            "optional_fields": ["order_details", "actions"],
        },
        "return_refund": {
            "required_fields": ["response_text"],
            "optional_fields": ["return_steps", "policy_reference"],
        },
        "manga_series_info": {
            "required_fields": ["response_text"],
            "optional_fields": ["series_info", "volumes", "related_series"],
        },
        "general_support": {
            "required_fields": ["response_text"],
            "optional_fields": ["actions", "escalation_needed"],
        },
    }

    def validate(self, response_text: str, template_name: str) -> SchemaValidationResult:
        """Validate FM response against the expected schema for its template."""
        start = time.time()

        result = SchemaValidationResult(valid=True, template_name=template_name)

        schema = self.TEMPLATE_SCHEMAS.get(template_name)
        if not schema:
            result.warnings.append(f"No schema defined for template '{template_name}'")
            result.validation_time_ms = (time.time() - start) * 1000
            return result

        # Attempt to parse as JSON
        parsed = self._try_parse_json(response_text)

        if parsed is None:
            # Not JSON — check if template expects JSON
            if schema.get("required_fields"):
                result.warnings.append(
                    "Response is plain text, not structured JSON. "
                    "Schema validation skipped for non-JSON response."
                )
            result.validation_time_ms = (time.time() - start) * 1000
            return result

        # Check required fields
        for field_name in schema.get("required_fields", []):
            if field_name not in parsed:
                result.errors.append(f"Missing required field: '{field_name}'")
                result.valid = False
            elif not parsed[field_name]:
                result.warnings.append(f"Required field '{field_name}' is empty")

        # Check product schema if applicable
        products = parsed.get("products", [])
        if products and "product_schema" in schema:
            p_schema = schema["product_schema"]
            for i, product in enumerate(products):
                if not isinstance(product, dict):
                    result.errors.append(f"products[{i}] is not an object")
                    result.valid = False
                    continue
                for req_field in p_schema.get("required", []):
                    if req_field not in product:
                        result.errors.append(f"products[{i}] missing required field '{req_field}'")
                        result.valid = False

        # Check for unexpected fields (schema drift signal)
        all_known_fields = set(schema.get("required_fields", []) + schema.get("optional_fields", []))
        unexpected = set(parsed.keys()) - all_known_fields
        if unexpected:
            result.warnings.append(f"Unexpected fields detected (schema drift?): {unexpected}")

        result.validation_time_ms = (time.time() - start) * 1000
        return result

    def _try_parse_json(self, text: str) -> Optional[dict]:
        """Try to parse JSON from response text."""
        try:
            return json.loads(text)
        except (json.JSONDecodeError, TypeError):
            pass

        # Try code fences
        match = re.search(r'```(?:json)?\s*([\s\S]*?)\s*```', text)
        if match:
            try:
                return json.loads(match.group(1))
            except (json.JSONDecodeError, TypeError):
                pass

        # Try trailing comma removal
        cleaned = re.sub(r',\s*([}\]])', r'\1', text)
        try:
            return json.loads(cleaned)
        except (json.JSONDecodeError, TypeError):
            return None

3. Prompt Health Checker

Aggregates multiple quality signals into a single health score per prompt template. Used for dashboarding and automated regression detection.

import time
import json
import logging
from dataclasses import dataclass, field
from typing import Optional
from datetime import datetime

logger = logging.getLogger("mangaassist.prompt_maintenance")


@dataclass
class PromptHealthSnapshot:
    """Health metrics for one prompt template at a point in time."""
    template_name: str
    template_version: str
    timestamp: str

    # Volume
    request_count: int = 0

    # Quality metrics (0.0 to 1.0)
    schema_compliance_rate: float = 1.0
    hallucination_rate: float = 0.0
    confusion_rate: float = 0.0
    error_rate: float = 0.0

    # Performance metrics
    avg_latency_ms: float = 0.0
    p95_latency_ms: float = 0.0

    # Composite health score
    health_score: float = 1.0
    status: str = "healthy"  # healthy, degraded, critical
    issues: list = field(default_factory=list)

    def compute_health(self):
        """Compute composite health score from individual metrics."""
        weights = {
            "schema_compliance": (self.schema_compliance_rate, 0.25),
            "hallucination": (1.0 - self.hallucination_rate, 0.30),
            "confusion": (1.0 - self.confusion_rate, 0.20),
            "error": (1.0 - self.error_rate, 0.15),
            "latency": (1.0 if self.p95_latency_ms < 3000 else max(0, 1.0 - (self.p95_latency_ms - 3000) / 7000), 0.10),
        }

        self.health_score = sum(score * weight for score, weight in weights.values())

        if self.health_score >= 0.9:
            self.status = "healthy"
        elif self.health_score >= 0.7:
            self.status = "degraded"
        else:
            self.status = "critical"

        # Issue diagnosis
        if self.schema_compliance_rate < 0.95:
            self.issues.append(f"Schema compliance {self.schema_compliance_rate:.1%} — check for format drift")
        if self.hallucination_rate > 0.03:
            self.issues.append(f"Hallucination rate {self.hallucination_rate:.1%} — review grounding instructions")
        if self.confusion_rate > 0.05:
            self.issues.append(f"Confusion rate {self.confusion_rate:.1%} — check intent disambiguation")
        if self.error_rate > 0.02:
            self.issues.append(f"Error rate {self.error_rate:.1%} — check FM integration health")


class PromptHealthChecker:
    """Monitors prompt template health using statistical process control.

    Uses rolling window statistics to detect regression:
    - If current metric is > 2 standard deviations from the rolling mean, flag as anomaly
    - If anomaly persists for > 4 hours, escalate to "degraded"
    - If health_score drops below 0.7, escalate to "critical"
    """

    def __init__(self, window_size: int = 24):  # 24 hourly snapshots
        self.window_size = window_size
        self.history: dict = {}  # template_name -> list of PromptHealthSnapshot

    def record_snapshot(self, snapshot: PromptHealthSnapshot):
        """Record a health snapshot and check for regression."""
        snapshot.compute_health()

        if snapshot.template_name not in self.history:
            self.history[snapshot.template_name] = []

        history = self.history[snapshot.template_name]
        history.append(snapshot)

        # Keep only the rolling window
        if len(history) > self.window_size:
            history[:] = history[-self.window_size:]

        # Check for regression
        regression = self._check_regression(snapshot.template_name)
        if regression:
            snapshot.issues.extend(regression)
            logger.warning(
                "Prompt regression detected for '%s': %s",
                snapshot.template_name, regression,
            )

        logger.info(json.dumps({
            "log_type": "prompt_health",
            "template_name": snapshot.template_name,
            "template_version": snapshot.template_version,
            "health_score": round(snapshot.health_score, 3),
            "status": snapshot.status,
            "schema_compliance_rate": snapshot.schema_compliance_rate,
            "hallucination_rate": snapshot.hallucination_rate,
            "confusion_rate": snapshot.confusion_rate,
            "issues": snapshot.issues,
        }))

        return snapshot

    def _check_regression(self, template_name: str) -> list:
        """Check if current metrics are outside normal range using SPC."""
        history = self.history.get(template_name, [])
        if len(history) < 5:
            return []  # Not enough data for statistical comparison

        current = history[-1]
        baseline = history[:-1]  # Everything except current

        regressions = []

        # Check each metric against baseline
        metrics_to_check = [
            ("health_score", [h.health_score for h in baseline], current.health_score, "lower"),
            ("schema_compliance_rate", [h.schema_compliance_rate for h in baseline], current.schema_compliance_rate, "lower"),
            ("hallucination_rate", [h.hallucination_rate for h in baseline], current.hallucination_rate, "higher"),
            ("error_rate", [h.error_rate for h in baseline], current.error_rate, "higher"),
        ]

        for metric_name, baseline_values, current_value, direction in metrics_to_check:
            mean = sum(baseline_values) / len(baseline_values)
            variance = sum((v - mean) ** 2 for v in baseline_values) / len(baseline_values)
            std_dev = variance ** 0.5

            if std_dev == 0:
                continue  # No variance — cannot detect anomaly

            z_score = (current_value - mean) / std_dev

            if direction == "lower" and z_score < -2.0:
                regressions.append(
                    f"{metric_name} dropped to {current_value:.3f} "
                    f"(baseline mean={mean:.3f}, z={z_score:.1f})"
                )
            elif direction == "higher" and z_score > 2.0:
                regressions.append(
                    f"{metric_name} spiked to {current_value:.3f} "
                    f"(baseline mean={mean:.3f}, z={z_score:.1f})"
                )

        return regressions

    def get_health_report(self, template_name: str) -> dict:
        """Get current health status and recent trend for a template."""
        history = self.history.get(template_name, [])
        if not history:
            return {"template_name": template_name, "status": "unknown", "message": "No data"}

        current = history[-1]

        # Trend over last N snapshots
        recent = history[-min(5, len(history)):]
        trend = {
            "health_scores": [round(h.health_score, 3) for h in recent],
            "trending": "stable",
        }

        if len(recent) >= 3:
            first_half = sum(h.health_score for h in recent[:len(recent)//2]) / (len(recent)//2)
            second_half = sum(h.health_score for h in recent[len(recent)//2:]) / (len(recent) - len(recent)//2)

            if second_half < first_half - 0.05:
                trend["trending"] = "declining"
            elif second_half > first_half + 0.05:
                trend["trending"] = "improving"

        return {
            "template_name": template_name,
            "current_status": current.status,
            "current_health_score": round(current.health_score, 3),
            "issues": current.issues,
            "trend": trend,
        }

4. Prompt Confusion Detector

Detects when the intent classifier sends queries to the wrong prompt template, or when templates produce overlapping/contradictory responses for similar intents.

import json
import logging
from dataclasses import dataclass, field
from typing import Optional
from collections import defaultdict

logger = logging.getLogger("mangaassist.prompt_maintenance")


@dataclass
class ConfusionEntry:
    """A single instance of prompt confusion."""
    query: str
    classified_intent: str
    confidence: float
    second_intent: Optional[str] = None
    second_confidence: Optional[float] = None
    user_feedback: Optional[str] = None  # "helpful" or "not_helpful"


@dataclass
class ConfusionPair:
    """Two intents that are frequently confused."""
    intent_a: str
    intent_b: str
    confusion_count: int = 0
    sample_queries: list = field(default_factory=list)
    avg_confidence_gap: float = 0.0


class PromptConfusionDetector:
    """Detects and analyzes intent classification confusion.

    Confusion happens when:
    1. Classifier confidence is low (< 0.85) — model is uncertain
    2. Two intents have close confidence scores (gap < 0.2) — model can't decide
    3. User gives negative feedback after response — wrong template was used

    This data drives targeted prompt refinement: which templates need clearer differentiation.
    """

    CONFIDENCE_THRESHOLD = 0.85
    CONFIDENCE_GAP_THRESHOLD = 0.20

    def __init__(self):
        self.entries: list = []
        self._confusion_pairs: dict = defaultdict(lambda: ConfusionPair("", ""))

    def record(
        self,
        query: str,
        classified_intent: str,
        confidence: float,
        all_intents: dict = None,  # intent -> confidence
        user_feedback: str = None,
    ):
        """Record an intent classification event for confusion analysis."""

        entry = ConfusionEntry(
            query=query,
            classified_intent=classified_intent,
            confidence=confidence,
            user_feedback=user_feedback,
        )

        # Find second-best intent
        if all_intents:
            sorted_intents = sorted(all_intents.items(), key=lambda x: x[1], reverse=True)
            if len(sorted_intents) >= 2:
                entry.second_intent = sorted_intents[1][0]
                entry.second_confidence = sorted_intents[1][1]

        # Determine if this is a confusion case
        is_confused = False

        if confidence < self.CONFIDENCE_THRESHOLD:
            is_confused = True

        if entry.second_confidence is not None:
            gap = confidence - entry.second_confidence
            if gap < self.CONFIDENCE_GAP_THRESHOLD:
                is_confused = True

        if user_feedback == "not_helpful":
            is_confused = True

        if is_confused:
            self.entries.append(entry)

            # Track confusion pair
            if entry.second_intent:
                pair_key = tuple(sorted([classified_intent, entry.second_intent]))
                pair = self._confusion_pairs[pair_key]
                pair.intent_a = pair_key[0]
                pair.intent_b = pair_key[1]
                pair.confusion_count += 1
                if len(pair.sample_queries) < 10:
                    pair.sample_queries.append(query)

    def get_confusion_report(self, min_count: int = 5) -> dict:
        """Generate a confusion analysis report for prompt refinement."""

        # Top confused pairs
        all_pairs = sorted(
            self._confusion_pairs.values(),
            key=lambda p: p.confusion_count,
            reverse=True,
        )
        top_pairs = [p for p in all_pairs if p.confusion_count >= min_count]

        # Confusion rate by intent
        intent_confusion = defaultdict(lambda: {"total": 0, "confused": 0})
        for entry in self.entries:
            intent_confusion[entry.classified_intent]["confused"] += 1

        # Overall confusion rate
        total_confused = len(self.entries)

        report = {
            "total_confused_queries": total_confused,
            "top_confused_pairs": [
                {
                    "intent_a": p.intent_a,
                    "intent_b": p.intent_b,
                    "count": p.confusion_count,
                    "sample_queries": p.sample_queries[:3],
                }
                for p in top_pairs[:5]
            ],
            "recommendations": [],
        }

        # Generate actionable recommendations
        for pair in top_pairs[:3]:
            report["recommendations"].append(
                f"Intents '{pair.intent_a}' and '{pair.intent_b}' are confused "
                f"{pair.confusion_count} times. Action: Add discriminating examples to classifier "
                f"training data and add explicit disambiguation to both prompt templates."
            )

        return report

5. MangaAssist Scenarios

Scenario A: Seasonal Prompt Drift During Anime Season

Context: During the fall anime season, user queries shift from "recommend manga like X" to "what manga is the anime Y based on?" The existing recommendation prompt doesn't handle anime-to-manga mapping well.

Detection: Prompt health checker detects confusion_rate spike to 12% (baseline: 3%). Confusion detector shows high confusion between manga_series_info and product_recommendation intents.

Root Cause: The intent classifier was trained on manga-centric queries. Anime-related queries overlap both intents. The product recommendation prompt tries to recommend manga volumes when the user wants series relationship information.

Resolution: 1. Short-term: Add explicit disambiguation to both prompts: "If the user mentions an anime title, check if they want the source manga (series_info) or to buy volumes (recommendation)" 2. Medium-term: Add anime-to-manga mapping data to the RAG knowledge base 3. Long-term: Create a seasonal prompt variant scheduler that activates anime-season-specific prompt tweaks in October

Scenario B: New Product Category Handling

Context: Amazon starts selling manhwa (Korean comics) alongside manga. The existing prompts are manga-specific and produce awkward responses for manhwa queries ("This manga series..." when the user asked about a manhwa).

Detection: Schema validator catches an unexpected content_type: "manhwa" field appearing in 8% of responses. Health checker flags schema compliance drop.

Root Cause: The FM correctly identifies manhwa content but the prompt template forces manga terminology.

Resolution: Update system prompt to be medium-aware: - Before: "You are MangaAssist, an expert on Japanese manga..." - After: "You are MangaAssist, an expert on manga, manhwa, and related graphic novel formats. Identify the content type accurately and use appropriate terminology." - Run golden test suite with 10 new manhwa test cases → pass rate 95%

Scenario C: Multilingual Prompt Maintenance

Context: MangaAssist serves both English and Japanese-speaking customers. The Japanese system prompt was translated from English by a contractor 6 months ago. Japanese response quality has degraded — schema compliance for Japanese responses is 87% vs 97% for English.

Detection: Health checker shows diverging health scores between EN and JP prompt variants. Schema validator reports higher error rates for JP responses.

Root Cause: Two compounding issues: 1. The Japanese prompt uses formal keigo (敬語) phrasing that Claude interprets differently than the casual register expected 2. The Japanese JSON field names use inconsistent romanization, confusing the schema validator

Resolution: 1. Rewrite Japanese prompt with native speaker review, matching the instructional clarity of the English version without direct translation 2. Standardize JSON schema to use English field names for both languages (the response text is in the user's language, but the JSON structure is language-neutral) 3. Add JP-specific golden test cases (15 cases covering honorifics, series title formats, and mixed JP/EN content)


6. CloudWatch Dashboard and Alerts

Metric Alarm Threshold Action
PromptHealthScore < 0.7 for 2 consecutive hours Page: prompt is degraded
SchemaComplianceRate < 95% for 1 hour Alert: format drift
HallucinationRate > 3% for 30 min Alert: grounding failure
ConfusionRate > 5% for 4 hours Warn: intent disambiguation needed
PromptExecutionDuration_P95 > 4000ms Warn: latency regression
PromptHealthTrend "declining" for 3+ snapshots Warn: slow degradation

CloudWatch Logs Insights Queries

Prompt health trend by template:

fields @timestamp, template_name, health_score, status, issues
| filter log_type = "prompt_health"
| sort @timestamp desc
| limit 50

Schema violations by template:

fields @timestamp, template_name, error_count, warnings
| filter log_type = "schema_validation" and valid = 0
| stats count(*) as violations by template_name
| sort violations desc

Intent confusion pairs:

fields @timestamp, classified_intent, second_intent, confidence, second_confidence, query
| filter log_type = "intent_confusion" and confidence < 0.85
| stats count(*) as confused_count by classified_intent, second_intent
| sort confused_count desc
| limit 10

X-Ray trace latency breakdown by span:

fields @timestamp, trace_id, intent, template_name
| filter log_type = "prompt_trace"
| stats avg(total_duration_ms) as avg_total,
        avg(span_prompt_assembly_ms) as avg_assembly,
        avg(span_fm_invocation_ms) as avg_fm,
        avg(span_schema_validation_ms) as avg_validation
  by template_name


Intuition Gained

What Mental Model You Build

Prompt maintenance troubleshooting teaches you to treat prompts as living production artifacts — not static configuration files, but systems that degrade, drift, and require operational discipline.

1. The Drift Detection Instinct: You learn that prompt quality degrades slowly, then suddenly. A 1% schema compliance drop this week is a 15% drop next month if ignored. You build the habit of watching trend lines, not just current values. Statistical process control becomes second nature — if a metric moves > 2σ from baseline, you investigate before it becomes a user-visible problem.

2. The Multi-Lang Operational Instinct: You learn that each language variant of a prompt is an independent system. English quality says nothing about Japanese quality. Each variant needs its own golden test set, its own health metrics, and its own maintenance cycle. You stop treating localization as "just translation" and start treating it as "parallel deployment."

3. The Seasonal Awareness Instinct: You develop a calendar sense for when prompts will break. Anime season, holiday shopping, new product launches — these are not just marketing events, they are prompt failure risk windows. You proactively schedule prompt reviews before these events, not after users complain.

How This Intuition Guides Future Decisions

  • When launching a new prompt template: You set up health monitoring, schema validation, and confusion detection before writing the first version of the prompt. The observability pipeline is prerequisite infrastructure, not a post-launch enhancement.
  • When someone reports "the chatbot feels different": You pull the prompt health dashboard and check the trend. Vague quality complaints often correspond to slow metric drift that is invisible in individual conversations but obvious in aggregate.
  • When planning a model upgrade: You know that the prompt maintenance work begins after the model swap, not before. New model versions shift response distributions, and every prompt template needs re-validation. You budget time for prompt tuning in every model upgrade plan.
  • When building multi-language support: You design the prompt architecture for independent language maintenance from day one. Shared structure (schema, tool definitions), independent content (system prompt, few-shot examples), and per-language golden test suites.