05: Prompt Maintenance Troubleshooting
AIP-C01 Mapping
Task 5.2 → Skill 5.2.5: Maintain and improve prompt performance over time using template testing, X-Ray observability, schema validation, systematic refinement, and prompt confusion diagnosis.
User Story
As an SRE on the MangaAssist team, I want to continuously monitor, validate, and maintain prompt templates in production so that degradation is detected early, schema drift is caught automatically, and prompt refinement follows a measured, repeatable process, So that prompt quality stays stable over time even as the FM, data sources, and user behavior evolve.
Acceptance Criteria
- Every prompt template execution is traced end-to-end with X-Ray, including prompt assembly, FM invocation, and response parsing spans
- Schema validation runs on every FM response in production, catching format drift before it reaches the user
- Prompt health metrics (format compliance, hallucination rate, latency, error rate) are dashboarded per template per day
- Template regression is detected within 4 hours using statistical process control on response quality metrics
- Prompt confusion (ambiguous classification between intents) is tracked and resolved through targeted prompt refinement
- Seasonal prompt maintenance (holidays, sales events, new product launches) follows a documented pre-launch checklist
High-Level Design
Prompt Maintenance Lifecycle
graph LR
A[Deploy Prompt<br>v1.0] --> B[Monitor<br>Production Metrics]
B --> C{Drift<br>Detected?}
C -->|No| B
C -->|Yes| D[Diagnose:<br>What changed?]
D --> E{Root Cause?}
E -->|FM behavior shift| F[Prompt refinement<br>cycle]
E -->|Data distribution shift| G[Context/RAG<br>adjustment]
E -->|Schema drift| H[Output format<br>fix + validation]
E -->|Seasonal pattern| I[Scheduled prompt<br>variant activation]
F --> J[Test + Compare]
G --> J
H --> J
I --> J
J --> K{Pass?}
K -->|Yes| L[Deploy prompt<br>v1.1]
K -->|No| F
L --> B
Prompt Observability Architecture
flowchart TD
A[User Request] --> B[Orchestrator]
subgraph XRay["X-Ray Trace (single request)"]
B --> C[Span: Intent<br>Classification]
C --> D[Span: Prompt<br>Assembly]
D --> E[Span: Context<br>Retrieval]
E --> F[Span: FM<br>Invocation]
F --> G[Span: Response<br>Validation]
G --> H[Span: Schema<br>Check]
end
H --> I[Response]
subgraph Metrics["CloudWatch Metrics"]
J[PromptAssemblyLatency]
K[FMInvocationLatency]
L[SchemaValidationPass/Fail]
M[PromptHealthScore]
end
D -.-> J
F -.-> K
H -.-> L
G -.-> M
Prompt Confusion Detection Flow
stateDiagram-v2
[*] --> IntentClassification
IntentClassification --> HighConfidence: confidence ≥ 0.85
IntentClassification --> LowConfidence: 0.5 ≤ confidence < 0.85
IntentClassification --> Ambiguous: confidence < 0.5
HighConfidence --> CorrectPrompt: Use intent-specific template
LowConfidence --> ConfusionLog: Log for analysis
LowConfidence --> FallbackPrompt: Use general template
Ambiguous --> ConfusionLog
Ambiguous --> FallbackPrompt
ConfusionLog --> WeeklyAnalysis: Aggregate confusion patterns
WeeklyAnalysis --> PromptRefinement: Fix top-3 confused pairs
PromptRefinement --> IntentClassification: Retrain / update
Low-Level Design
1. X-Ray Prompt Observability Pipeline
Every prompt execution is a distributed trace. X-Ray segments give timing breakdowns, and annotations make traces searchable by intent, template version, and quality signals.
import time
import json
import logging
from dataclasses import dataclass, field
from typing import Optional
from functools import wraps
logger = logging.getLogger("mangaassist.prompt_maintenance")
# Lightweight X-Ray integration — wraps boto3 X-Ray SDK patterns
# without requiring the full SDK import for this reference code
@dataclass
class SpanRecord:
"""Record of a single span within a prompt execution trace."""
name: str
start_time: float
end_time: Optional[float] = None
duration_ms: Optional[float] = None
annotations: dict = field(default_factory=dict)
metadata: dict = field(default_factory=dict)
error: Optional[str] = None
def close(self, error: str = None):
self.end_time = time.time()
self.duration_ms = (self.end_time - self.start_time) * 1000
self.error = error
@dataclass
class PromptExecutionTrace:
"""Complete trace of a single prompt execution, from assembly to response validation."""
trace_id: str
session_id: str
intent: str
template_name: str
template_version: str
spans: list = field(default_factory=list)
total_duration_ms: float = 0.0
# Quality annotations (searchable in X-Ray)
schema_valid: Optional[bool] = None
hallucination_detected: bool = False
prompt_confusion: bool = False
response_truncated: bool = False
def add_span(self, name: str) -> SpanRecord:
span = SpanRecord(name=name, start_time=time.time())
self.spans.append(span)
return span
def close(self):
if self.spans:
self.total_duration_ms = sum(
s.duration_ms for s in self.spans if s.duration_ms is not None
)
def to_xray_annotations(self) -> dict:
"""Annotations that make this trace searchable in X-Ray console."""
return {
"intent": self.intent,
"template_name": self.template_name,
"template_version": self.template_version,
"schema_valid": self.schema_valid,
"hallucination_detected": self.hallucination_detected,
"prompt_confusion": self.prompt_confusion,
}
def to_xray_metadata(self) -> dict:
"""Metadata for detailed investigation (not searchable but visible)."""
return {
"span_timings": {
s.name: {"duration_ms": s.duration_ms, "error": s.error}
for s in self.spans
},
"total_duration_ms": self.total_duration_ms,
}
class PromptObservabilityPipeline:
"""Instruments prompt execution for X-Ray tracing.
Usage:
pipeline = PromptObservabilityPipeline()
trace = pipeline.start_trace(session_id, intent, template_name, template_version)
with pipeline.span(trace, "prompt_assembly"):
prompt = assemble_prompt(...)
with pipeline.span(trace, "fm_invocation"):
response = invoke_fm(prompt)
with pipeline.span(trace, "schema_validation"):
valid = validate_schema(response)
trace.schema_valid = valid
pipeline.finish_trace(trace)
"""
def __init__(self, cloudwatch_client=None):
self.cloudwatch_client = cloudwatch_client
self._active_traces: dict = {}
def start_trace(
self,
trace_id: str,
session_id: str,
intent: str,
template_name: str,
template_version: str,
) -> PromptExecutionTrace:
trace = PromptExecutionTrace(
trace_id=trace_id,
session_id=session_id,
intent=intent,
template_name=template_name,
template_version=template_version,
)
self._active_traces[trace_id] = trace
return trace
class _SpanContext:
"""Context manager for tracing a span."""
def __init__(self, trace: PromptExecutionTrace, span_name: str):
self.trace = trace
self.span_name = span_name
self.span: Optional[SpanRecord] = None
def __enter__(self):
self.span = self.trace.add_span(self.span_name)
return self.span
def __exit__(self, exc_type, exc_val, exc_tb):
error = str(exc_val) if exc_val else None
self.span.close(error=error)
return False # Don't suppress exceptions
def span(self, trace: PromptExecutionTrace, span_name: str):
return self._SpanContext(trace, span_name)
def finish_trace(self, trace: PromptExecutionTrace):
trace.close()
self._active_traces.pop(trace.trace_id, None)
# Emit CloudWatch metrics
self._emit_metrics(trace)
# Log structured trace data
logger.info(json.dumps({
"log_type": "prompt_trace",
"trace_id": trace.trace_id,
"intent": trace.intent,
"template_name": trace.template_name,
"template_version": trace.template_version,
"total_duration_ms": round(trace.total_duration_ms, 2),
"schema_valid": trace.schema_valid,
"hallucination_detected": trace.hallucination_detected,
"prompt_confusion": trace.prompt_confusion,
"annotations": trace.to_xray_annotations(),
}))
def _emit_metrics(self, trace: PromptExecutionTrace):
if not self.cloudwatch_client:
return
dimensions = [
{"Name": "Intent", "Value": trace.intent},
{"Name": "TemplateName", "Value": trace.template_name},
]
metric_data = [
{
"MetricName": "PromptExecutionDuration",
"Value": trace.total_duration_ms,
"Unit": "Milliseconds",
"Dimensions": dimensions,
},
{
"MetricName": "SchemaValidationResult",
"Value": 1.0 if trace.schema_valid else 0.0,
"Unit": "None",
"Dimensions": dimensions,
},
]
if trace.hallucination_detected:
metric_data.append({
"MetricName": "HallucinationDetected",
"Value": 1.0,
"Unit": "Count",
"Dimensions": dimensions,
})
if trace.prompt_confusion:
metric_data.append({
"MetricName": "PromptConfusion",
"Value": 1.0,
"Unit": "Count",
"Dimensions": dimensions,
})
# Span-level latency
for s in trace.spans:
if s.duration_ms is not None:
metric_data.append({
"MetricName": f"Span_{s.name}_Duration",
"Value": s.duration_ms,
"Unit": "Milliseconds",
"Dimensions": dimensions,
})
self.cloudwatch_client.put_metric_data(
Namespace="MangaAssist/PromptMaintenance",
MetricData=metric_data,
)
2. Production Schema Validator
Validates every FM response against the expected output schema in production. This is not the test-time validator from file 03 — this runs on every live request and must be fast and resilient.
import json
import re
import time
import logging
from dataclasses import dataclass, field
from typing import Optional
logger = logging.getLogger("mangaassist.prompt_maintenance")
@dataclass
class SchemaValidationResult:
valid: bool
template_name: str
errors: list = field(default_factory=list)
warnings: list = field(default_factory=list)
repair_applied: bool = False
validation_time_ms: float = 0.0
class ProductionSchemaValidator:
"""Fast, resilient schema validation for production FM responses.
Design decisions:
- Validates structure, not content (content quality is a separate concern)
- Repairs common issues (trailing commas, code fences) rather than rejecting
- Logs violations but does not block user-visible response (defense in depth)
- Validation must complete in < 5ms to not impact latency
"""
# Schema definitions per template
TEMPLATE_SCHEMAS = {
"product_recommendation": {
"required_fields": ["response_text"],
"optional_fields": ["products", "follow_up_suggestions"],
"product_schema": {
"required": ["asin", "title"],
"optional": ["price", "rating", "reason"],
},
},
"order_status": {
"required_fields": ["response_text"],
"optional_fields": ["order_details", "actions"],
},
"return_refund": {
"required_fields": ["response_text"],
"optional_fields": ["return_steps", "policy_reference"],
},
"manga_series_info": {
"required_fields": ["response_text"],
"optional_fields": ["series_info", "volumes", "related_series"],
},
"general_support": {
"required_fields": ["response_text"],
"optional_fields": ["actions", "escalation_needed"],
},
}
def validate(self, response_text: str, template_name: str) -> SchemaValidationResult:
"""Validate FM response against the expected schema for its template."""
start = time.time()
result = SchemaValidationResult(valid=True, template_name=template_name)
schema = self.TEMPLATE_SCHEMAS.get(template_name)
if not schema:
result.warnings.append(f"No schema defined for template '{template_name}'")
result.validation_time_ms = (time.time() - start) * 1000
return result
# Attempt to parse as JSON
parsed = self._try_parse_json(response_text)
if parsed is None:
# Not JSON — check if template expects JSON
if schema.get("required_fields"):
result.warnings.append(
"Response is plain text, not structured JSON. "
"Schema validation skipped for non-JSON response."
)
result.validation_time_ms = (time.time() - start) * 1000
return result
# Check required fields
for field_name in schema.get("required_fields", []):
if field_name not in parsed:
result.errors.append(f"Missing required field: '{field_name}'")
result.valid = False
elif not parsed[field_name]:
result.warnings.append(f"Required field '{field_name}' is empty")
# Check product schema if applicable
products = parsed.get("products", [])
if products and "product_schema" in schema:
p_schema = schema["product_schema"]
for i, product in enumerate(products):
if not isinstance(product, dict):
result.errors.append(f"products[{i}] is not an object")
result.valid = False
continue
for req_field in p_schema.get("required", []):
if req_field not in product:
result.errors.append(f"products[{i}] missing required field '{req_field}'")
result.valid = False
# Check for unexpected fields (schema drift signal)
all_known_fields = set(schema.get("required_fields", []) + schema.get("optional_fields", []))
unexpected = set(parsed.keys()) - all_known_fields
if unexpected:
result.warnings.append(f"Unexpected fields detected (schema drift?): {unexpected}")
result.validation_time_ms = (time.time() - start) * 1000
return result
def _try_parse_json(self, text: str) -> Optional[dict]:
"""Try to parse JSON from response text."""
try:
return json.loads(text)
except (json.JSONDecodeError, TypeError):
pass
# Try code fences
match = re.search(r'```(?:json)?\s*([\s\S]*?)\s*```', text)
if match:
try:
return json.loads(match.group(1))
except (json.JSONDecodeError, TypeError):
pass
# Try trailing comma removal
cleaned = re.sub(r',\s*([}\]])', r'\1', text)
try:
return json.loads(cleaned)
except (json.JSONDecodeError, TypeError):
return None
3. Prompt Health Checker
Aggregates multiple quality signals into a single health score per prompt template. Used for dashboarding and automated regression detection.
import time
import json
import logging
from dataclasses import dataclass, field
from typing import Optional
from datetime import datetime
logger = logging.getLogger("mangaassist.prompt_maintenance")
@dataclass
class PromptHealthSnapshot:
"""Health metrics for one prompt template at a point in time."""
template_name: str
template_version: str
timestamp: str
# Volume
request_count: int = 0
# Quality metrics (0.0 to 1.0)
schema_compliance_rate: float = 1.0
hallucination_rate: float = 0.0
confusion_rate: float = 0.0
error_rate: float = 0.0
# Performance metrics
avg_latency_ms: float = 0.0
p95_latency_ms: float = 0.0
# Composite health score
health_score: float = 1.0
status: str = "healthy" # healthy, degraded, critical
issues: list = field(default_factory=list)
def compute_health(self):
"""Compute composite health score from individual metrics."""
weights = {
"schema_compliance": (self.schema_compliance_rate, 0.25),
"hallucination": (1.0 - self.hallucination_rate, 0.30),
"confusion": (1.0 - self.confusion_rate, 0.20),
"error": (1.0 - self.error_rate, 0.15),
"latency": (1.0 if self.p95_latency_ms < 3000 else max(0, 1.0 - (self.p95_latency_ms - 3000) / 7000), 0.10),
}
self.health_score = sum(score * weight for score, weight in weights.values())
if self.health_score >= 0.9:
self.status = "healthy"
elif self.health_score >= 0.7:
self.status = "degraded"
else:
self.status = "critical"
# Issue diagnosis
if self.schema_compliance_rate < 0.95:
self.issues.append(f"Schema compliance {self.schema_compliance_rate:.1%} — check for format drift")
if self.hallucination_rate > 0.03:
self.issues.append(f"Hallucination rate {self.hallucination_rate:.1%} — review grounding instructions")
if self.confusion_rate > 0.05:
self.issues.append(f"Confusion rate {self.confusion_rate:.1%} — check intent disambiguation")
if self.error_rate > 0.02:
self.issues.append(f"Error rate {self.error_rate:.1%} — check FM integration health")
class PromptHealthChecker:
"""Monitors prompt template health using statistical process control.
Uses rolling window statistics to detect regression:
- If current metric is > 2 standard deviations from the rolling mean, flag as anomaly
- If anomaly persists for > 4 hours, escalate to "degraded"
- If health_score drops below 0.7, escalate to "critical"
"""
def __init__(self, window_size: int = 24): # 24 hourly snapshots
self.window_size = window_size
self.history: dict = {} # template_name -> list of PromptHealthSnapshot
def record_snapshot(self, snapshot: PromptHealthSnapshot):
"""Record a health snapshot and check for regression."""
snapshot.compute_health()
if snapshot.template_name not in self.history:
self.history[snapshot.template_name] = []
history = self.history[snapshot.template_name]
history.append(snapshot)
# Keep only the rolling window
if len(history) > self.window_size:
history[:] = history[-self.window_size:]
# Check for regression
regression = self._check_regression(snapshot.template_name)
if regression:
snapshot.issues.extend(regression)
logger.warning(
"Prompt regression detected for '%s': %s",
snapshot.template_name, regression,
)
logger.info(json.dumps({
"log_type": "prompt_health",
"template_name": snapshot.template_name,
"template_version": snapshot.template_version,
"health_score": round(snapshot.health_score, 3),
"status": snapshot.status,
"schema_compliance_rate": snapshot.schema_compliance_rate,
"hallucination_rate": snapshot.hallucination_rate,
"confusion_rate": snapshot.confusion_rate,
"issues": snapshot.issues,
}))
return snapshot
def _check_regression(self, template_name: str) -> list:
"""Check if current metrics are outside normal range using SPC."""
history = self.history.get(template_name, [])
if len(history) < 5:
return [] # Not enough data for statistical comparison
current = history[-1]
baseline = history[:-1] # Everything except current
regressions = []
# Check each metric against baseline
metrics_to_check = [
("health_score", [h.health_score for h in baseline], current.health_score, "lower"),
("schema_compliance_rate", [h.schema_compliance_rate for h in baseline], current.schema_compliance_rate, "lower"),
("hallucination_rate", [h.hallucination_rate for h in baseline], current.hallucination_rate, "higher"),
("error_rate", [h.error_rate for h in baseline], current.error_rate, "higher"),
]
for metric_name, baseline_values, current_value, direction in metrics_to_check:
mean = sum(baseline_values) / len(baseline_values)
variance = sum((v - mean) ** 2 for v in baseline_values) / len(baseline_values)
std_dev = variance ** 0.5
if std_dev == 0:
continue # No variance — cannot detect anomaly
z_score = (current_value - mean) / std_dev
if direction == "lower" and z_score < -2.0:
regressions.append(
f"{metric_name} dropped to {current_value:.3f} "
f"(baseline mean={mean:.3f}, z={z_score:.1f})"
)
elif direction == "higher" and z_score > 2.0:
regressions.append(
f"{metric_name} spiked to {current_value:.3f} "
f"(baseline mean={mean:.3f}, z={z_score:.1f})"
)
return regressions
def get_health_report(self, template_name: str) -> dict:
"""Get current health status and recent trend for a template."""
history = self.history.get(template_name, [])
if not history:
return {"template_name": template_name, "status": "unknown", "message": "No data"}
current = history[-1]
# Trend over last N snapshots
recent = history[-min(5, len(history)):]
trend = {
"health_scores": [round(h.health_score, 3) for h in recent],
"trending": "stable",
}
if len(recent) >= 3:
first_half = sum(h.health_score for h in recent[:len(recent)//2]) / (len(recent)//2)
second_half = sum(h.health_score for h in recent[len(recent)//2:]) / (len(recent) - len(recent)//2)
if second_half < first_half - 0.05:
trend["trending"] = "declining"
elif second_half > first_half + 0.05:
trend["trending"] = "improving"
return {
"template_name": template_name,
"current_status": current.status,
"current_health_score": round(current.health_score, 3),
"issues": current.issues,
"trend": trend,
}
4. Prompt Confusion Detector
Detects when the intent classifier sends queries to the wrong prompt template, or when templates produce overlapping/contradictory responses for similar intents.
import json
import logging
from dataclasses import dataclass, field
from typing import Optional
from collections import defaultdict
logger = logging.getLogger("mangaassist.prompt_maintenance")
@dataclass
class ConfusionEntry:
"""A single instance of prompt confusion."""
query: str
classified_intent: str
confidence: float
second_intent: Optional[str] = None
second_confidence: Optional[float] = None
user_feedback: Optional[str] = None # "helpful" or "not_helpful"
@dataclass
class ConfusionPair:
"""Two intents that are frequently confused."""
intent_a: str
intent_b: str
confusion_count: int = 0
sample_queries: list = field(default_factory=list)
avg_confidence_gap: float = 0.0
class PromptConfusionDetector:
"""Detects and analyzes intent classification confusion.
Confusion happens when:
1. Classifier confidence is low (< 0.85) — model is uncertain
2. Two intents have close confidence scores (gap < 0.2) — model can't decide
3. User gives negative feedback after response — wrong template was used
This data drives targeted prompt refinement: which templates need clearer differentiation.
"""
CONFIDENCE_THRESHOLD = 0.85
CONFIDENCE_GAP_THRESHOLD = 0.20
def __init__(self):
self.entries: list = []
self._confusion_pairs: dict = defaultdict(lambda: ConfusionPair("", ""))
def record(
self,
query: str,
classified_intent: str,
confidence: float,
all_intents: dict = None, # intent -> confidence
user_feedback: str = None,
):
"""Record an intent classification event for confusion analysis."""
entry = ConfusionEntry(
query=query,
classified_intent=classified_intent,
confidence=confidence,
user_feedback=user_feedback,
)
# Find second-best intent
if all_intents:
sorted_intents = sorted(all_intents.items(), key=lambda x: x[1], reverse=True)
if len(sorted_intents) >= 2:
entry.second_intent = sorted_intents[1][0]
entry.second_confidence = sorted_intents[1][1]
# Determine if this is a confusion case
is_confused = False
if confidence < self.CONFIDENCE_THRESHOLD:
is_confused = True
if entry.second_confidence is not None:
gap = confidence - entry.second_confidence
if gap < self.CONFIDENCE_GAP_THRESHOLD:
is_confused = True
if user_feedback == "not_helpful":
is_confused = True
if is_confused:
self.entries.append(entry)
# Track confusion pair
if entry.second_intent:
pair_key = tuple(sorted([classified_intent, entry.second_intent]))
pair = self._confusion_pairs[pair_key]
pair.intent_a = pair_key[0]
pair.intent_b = pair_key[1]
pair.confusion_count += 1
if len(pair.sample_queries) < 10:
pair.sample_queries.append(query)
def get_confusion_report(self, min_count: int = 5) -> dict:
"""Generate a confusion analysis report for prompt refinement."""
# Top confused pairs
all_pairs = sorted(
self._confusion_pairs.values(),
key=lambda p: p.confusion_count,
reverse=True,
)
top_pairs = [p for p in all_pairs if p.confusion_count >= min_count]
# Confusion rate by intent
intent_confusion = defaultdict(lambda: {"total": 0, "confused": 0})
for entry in self.entries:
intent_confusion[entry.classified_intent]["confused"] += 1
# Overall confusion rate
total_confused = len(self.entries)
report = {
"total_confused_queries": total_confused,
"top_confused_pairs": [
{
"intent_a": p.intent_a,
"intent_b": p.intent_b,
"count": p.confusion_count,
"sample_queries": p.sample_queries[:3],
}
for p in top_pairs[:5]
],
"recommendations": [],
}
# Generate actionable recommendations
for pair in top_pairs[:3]:
report["recommendations"].append(
f"Intents '{pair.intent_a}' and '{pair.intent_b}' are confused "
f"{pair.confusion_count} times. Action: Add discriminating examples to classifier "
f"training data and add explicit disambiguation to both prompt templates."
)
return report
5. MangaAssist Scenarios
Scenario A: Seasonal Prompt Drift During Anime Season
Context: During the fall anime season, user queries shift from "recommend manga like X" to "what manga is the anime Y based on?" The existing recommendation prompt doesn't handle anime-to-manga mapping well.
Detection: Prompt health checker detects confusion_rate spike to 12% (baseline: 3%). Confusion detector shows high confusion between manga_series_info and product_recommendation intents.
Root Cause: The intent classifier was trained on manga-centric queries. Anime-related queries overlap both intents. The product recommendation prompt tries to recommend manga volumes when the user wants series relationship information.
Resolution: 1. Short-term: Add explicit disambiguation to both prompts: "If the user mentions an anime title, check if they want the source manga (series_info) or to buy volumes (recommendation)" 2. Medium-term: Add anime-to-manga mapping data to the RAG knowledge base 3. Long-term: Create a seasonal prompt variant scheduler that activates anime-season-specific prompt tweaks in October
Scenario B: New Product Category Handling
Context: Amazon starts selling manhwa (Korean comics) alongside manga. The existing prompts are manga-specific and produce awkward responses for manhwa queries ("This manga series..." when the user asked about a manhwa).
Detection: Schema validator catches an unexpected content_type: "manhwa" field appearing in 8% of responses. Health checker flags schema compliance drop.
Root Cause: The FM correctly identifies manhwa content but the prompt template forces manga terminology.
Resolution: Update system prompt to be medium-aware: - Before: "You are MangaAssist, an expert on Japanese manga..." - After: "You are MangaAssist, an expert on manga, manhwa, and related graphic novel formats. Identify the content type accurately and use appropriate terminology." - Run golden test suite with 10 new manhwa test cases → pass rate 95%
Scenario C: Multilingual Prompt Maintenance
Context: MangaAssist serves both English and Japanese-speaking customers. The Japanese system prompt was translated from English by a contractor 6 months ago. Japanese response quality has degraded — schema compliance for Japanese responses is 87% vs 97% for English.
Detection: Health checker shows diverging health scores between EN and JP prompt variants. Schema validator reports higher error rates for JP responses.
Root Cause: Two compounding issues: 1. The Japanese prompt uses formal keigo (敬語) phrasing that Claude interprets differently than the casual register expected 2. The Japanese JSON field names use inconsistent romanization, confusing the schema validator
Resolution: 1. Rewrite Japanese prompt with native speaker review, matching the instructional clarity of the English version without direct translation 2. Standardize JSON schema to use English field names for both languages (the response text is in the user's language, but the JSON structure is language-neutral) 3. Add JP-specific golden test cases (15 cases covering honorifics, series title formats, and mixed JP/EN content)
6. CloudWatch Dashboard and Alerts
| Metric | Alarm Threshold | Action |
|---|---|---|
PromptHealthScore |
< 0.7 for 2 consecutive hours | Page: prompt is degraded |
SchemaComplianceRate |
< 95% for 1 hour | Alert: format drift |
HallucinationRate |
> 3% for 30 min | Alert: grounding failure |
ConfusionRate |
> 5% for 4 hours | Warn: intent disambiguation needed |
PromptExecutionDuration_P95 |
> 4000ms | Warn: latency regression |
PromptHealthTrend |
"declining" for 3+ snapshots | Warn: slow degradation |
CloudWatch Logs Insights Queries
Prompt health trend by template:
fields @timestamp, template_name, health_score, status, issues
| filter log_type = "prompt_health"
| sort @timestamp desc
| limit 50
Schema violations by template:
fields @timestamp, template_name, error_count, warnings
| filter log_type = "schema_validation" and valid = 0
| stats count(*) as violations by template_name
| sort violations desc
Intent confusion pairs:
fields @timestamp, classified_intent, second_intent, confidence, second_confidence, query
| filter log_type = "intent_confusion" and confidence < 0.85
| stats count(*) as confused_count by classified_intent, second_intent
| sort confused_count desc
| limit 10
X-Ray trace latency breakdown by span:
fields @timestamp, trace_id, intent, template_name
| filter log_type = "prompt_trace"
| stats avg(total_duration_ms) as avg_total,
avg(span_prompt_assembly_ms) as avg_assembly,
avg(span_fm_invocation_ms) as avg_fm,
avg(span_schema_validation_ms) as avg_validation
by template_name
Intuition Gained
What Mental Model You Build
Prompt maintenance troubleshooting teaches you to treat prompts as living production artifacts — not static configuration files, but systems that degrade, drift, and require operational discipline.
1. The Drift Detection Instinct: You learn that prompt quality degrades slowly, then suddenly. A 1% schema compliance drop this week is a 15% drop next month if ignored. You build the habit of watching trend lines, not just current values. Statistical process control becomes second nature — if a metric moves > 2σ from baseline, you investigate before it becomes a user-visible problem.
2. The Multi-Lang Operational Instinct: You learn that each language variant of a prompt is an independent system. English quality says nothing about Japanese quality. Each variant needs its own golden test set, its own health metrics, and its own maintenance cycle. You stop treating localization as "just translation" and start treating it as "parallel deployment."
3. The Seasonal Awareness Instinct: You develop a calendar sense for when prompts will break. Anime season, holiday shopping, new product launches — these are not just marketing events, they are prompt failure risk windows. You proactively schedule prompt reviews before these events, not after users complain.
How This Intuition Guides Future Decisions
- When launching a new prompt template: You set up health monitoring, schema validation, and confusion detection before writing the first version of the prompt. The observability pipeline is prerequisite infrastructure, not a post-launch enhancement.
- When someone reports "the chatbot feels different": You pull the prompt health dashboard and check the trend. Vague quality complaints often correspond to slow metric drift that is invisible in individual conversations but obvious in aggregate.
- When planning a model upgrade: You know that the prompt maintenance work begins after the model swap, not before. New model versions shift response distributions, and every prompt template needs re-validation. You budget time for prompt tuning in every model upgrade plan.
- When building multi-language support: You design the prompt architecture for independent language maintenance from day one. Shared structure (schema, tool definitions), independent content (system prompt, few-shot examples), and per-language golden test suites.