08: Best Practices and Patterns

AIP-C01 Mapping

Task 5.2 → All Skills (5.2.1–5.2.5): Cross-cutting best practices, anti-patterns, monitoring standards, and operational patterns for troubleshooting GenAI applications in production.

1. Anti-Patterns and Corrections

Anti-Pattern 1: Silent Token Truncation (Skill 5.2.1)

Aspect	Anti-Pattern	Correction
Symptom	Prompt is assembled without checking length; FM silently drops trailing context	Pre-flight token budget check before every invocation
MangaAssist Example	50-volume manga FAQ gets truncated; user gets incomplete answer with no error	`TokenBudgetManager` allocates fixed sections first, compresses history, logs utilization
Detection	Compare `input_tokens` in Bedrock response metadata against expected count	CloudWatch metric: `BudgetUtilization > 90%` alarm
Cost of Inaction	Users see wrong answers, blame the chatbot, lose trust — no alert fires	~2-5% of all requests silently degraded during peak

Anti-Pattern 2: Retry Without Circuit Breaker (Skill 5.2.2)

Aspect	Anti-Pattern	Correction
Symptom	Unlimited retries on Bedrock 429/503 errors; adds latency and cost, worsens throttling	Circuit breaker (CLOSED → OPEN after N failures → HALF_OPEN on cool-down)
MangaAssist Example	Prime Day spike: retries amplify 429s, orchestrator hits timeout, all requests fail	`CircuitBreaker` opens after 5 failures in 60s; returns cached/fallback response
Detection	Latency spikes correlated with retry count metrics	`RetryCount > 3` per request → alarm
Cost of Inaction	Cascading failure: throttling → retries → more throttling → total outage	Circuit breaker limits blast radius to ~30s recovery

Anti-Pattern 3: Testing Prompts Only Manually (Skill 5.2.3)

Aspect	Anti-Pattern	Correction
Symptom	Prompt changes deployed after "looks good" spot-check; regressions discovered by users	Automated golden test suite run in CI before every prompt deployment
MangaAssist Example	Instruction reorder improved recommendations but broke Japanese output format	`PromptTestRunner` with 50+ golden test cases catches regressions before deploy
Detection	User complaint rate spikes after prompt deployment (lagging indicator)	Pre-deploy: `GoldenTestSuite` pass/fail gate. Post-deploy: `PromptHealthChecker` SPC
Cost of Inaction	Mean time to detect regression: hours to days (user complaint lag)	Automated: caught before production in CI pipeline

Anti-Pattern 4: Treating RAG as "Set and Forget" (Skill 5.2.4)

Aspect	Anti-Pattern	Correction
Symptom	Documents ingested once; no freshness monitoring; stale data in production for weeks	Scheduled ingestion Lambda + embedding drift monitor + freshness alerts
MangaAssist Example	New manga catalog published weekly; old prices shown for 2 weeks after update	Daily delta ingestion via S3 event trigger + `EmbeddingDriftMonitor` (POC 4)
Detection	User reports "wrong price" → manual investigation → discover stale data	`DocumentStaleness > 24h` alarm on CloudWatch
Cost of Inaction	Incorrect product info → escalation → refund → trust loss	Freshness SLA: <4 hours post-catalog-update

Anti-Pattern 5: No Observability on Prompt Templates (Skill 5.2.5)

Aspect	Anti-Pattern	Correction
Symptom	Prompt templates stored in code; no version tracking, no quality metrics per template	Version-controlled templates with per-version CloudWatch metrics and SPC health checks
MangaAssist Example	Template edited directly in production; quality drops; no way to identify which change caused it	`PromptObservabilityPipeline` tags every trace with template name + version
Detection	"Something changed" — manual diff of git commits to find prompt change	`PromptHealthChecker` detects z-score anomaly within 1 hour of deployment
Cost of Inaction	MTTR for prompt-related issues: 4-8 hours (investigation time)	With observability: <30 minutes (pinpoint version that caused drop)

2. Monitoring and Alerting Best Practices

Metric Taxonomy

graph TD
    A[MangaAssist Metrics] --> B[Content Handling<br>5.2.1]
    A --> C[FM Integration<br>5.2.2]
    A --> D[Prompt Quality<br>5.2.3]
    A --> E[Retrieval Health<br>5.2.4]
    A --> F[Prompt Ops<br>5.2.5]

    B --> B1[BudgetUtilization %]
    B --> B2[TruncationCount]
    B --> B3[CompressionRatio]

    C --> C1[InvocationLatency P95]
    C --> C2[ErrorRate %]
    C --> C3[ThrottleCount]
    C --> C4[CircuitBreakerState]

    D --> D1[GoldenTestPassRate %]
    D --> D2[QualityScoreMean]
    D --> D3[HallucinationRate %]

    E --> E1[EmbeddingDriftP95]
    E --> E2[RetrievalMRR]
    E --> E3[DocumentStalenessHrs]

    F --> F1[TemplateHealthScore]
    F --> F2[SchemaViolationRate %]
    F --> F3[ConfusionPairCount]

Alert Severity Matrix

Metric	Warning Threshold	Critical Threshold	Evaluation Period	Action
BudgetUtilization	>80%	>95%	1 min	Slack / PagerDuty
InvocationLatency P95	>3s	>8s	5 min	Auto-scale / PagerDuty
ErrorRate	>2%	>10%	5 min	PagerDuty + auto-rollback
ThrottleCount	>10/min	>50/min	1 min	Auto-scale / check quota
CircuitBreakerState	HALF_OPEN	OPEN	Instant	PagerDuty + fallback
GoldenTestPassRate	<95%	<85%	Per deploy	Block deploy / rollback
EmbeddingDriftP95	>0.10	>0.15	Daily	Trigger re-embedding
DocumentStaleness	>4h	>24h	15 min	Slack ops / escalate
TemplateHealthScore	<0.8	<0.6	1h	Investigate / rollback
SchemaViolationRate	>5%	>15%	5 min	PagerDuty + fallback

CloudWatch Dashboard Layout

┌──────────────────────────────────────────────────────────┐
│  Row 1: FM Health (Skill 5.2.2)                         │
│  [Error Rate %]  [Latency P50/P95/P99]  [Circuit State] │
├──────────────────────────────────────────────────────────┤
│  Row 2: Content & Retrieval (Skills 5.2.1 + 5.2.4)     │
│  [Budget Util %]  [Embedding Drift]  [Doc Staleness]    │
├──────────────────────────────────────────────────────────┤
│  Row 3: Prompt Quality (Skills 5.2.3 + 5.2.5)          │
│  [Quality Score]  [Health Score]  [Schema Violations]    │
├──────────────────────────────────────────────────────────┤
│  Row 4: Cost Tracking                                    │
│  [Token Usage]  [Bedrock Cost]  [Cache Hit Rate]         │
└──────────────────────────────────────────────────────────┘

3. Incident Response Patterns for GenAI Systems

Triage Decision Tree

flowchart TD
    A[User reports<br>bad output] --> B{Output<br>empty?}
    B -->|Yes| C{Bedrock<br>error in logs?}
    C -->|Yes| D[Skill 5.2.2:<br>FM Integration]
    C -->|No| E{Budget > 95%?}
    E -->|Yes| F[Skill 5.2.1:<br>Content Overflow]
    E -->|No| G[Check template<br>rendering logs]

    B -->|No| H{Output<br>incorrect?}
    H -->|Yes| I{RAG context<br>relevant?}
    I -->|No| J[Skill 5.2.4:<br>Retrieval Issue]
    I -->|Yes| K{Prompt version<br>changed recently?}
    K -->|Yes| L[Skill 5.2.3/5.2.5:<br>Prompt Regression]
    K -->|No| M[Skill 5.2.3:<br>Prompt Quality]

    H -->|No| N{Output<br>slow?}
    N -->|Yes| O[Check X-Ray spans<br>for bottleneck]
    O --> P{Bedrock span<br>slow?}
    P -->|Yes| D
    P -->|No| Q[Check retrieval<br>or preprocessing]

Incident Response Playbook Template

incident_type: prompt_quality_regression
severity: P2
detection:
  signal: TemplateHealthScore dropped below 0.7
  alert: CloudWatch Alarm "MangaAssist-PromptHealth-Critical"
  time_to_detect: ~15 minutes (SPC anomaly detection)

immediate_actions:
  - Check last deployment: "Was a prompt template deployed in the last 2 hours?"
  - Check PromptObservabilityPipeline logs: filter by template_name and version
  - Compare quality_score between current and previous version
  - If confirmed regression: rollback template version via config update

investigation:
  - Pull golden test results for the new version
  - Run PromptTestRunner.compare_versions(old, new)
  - Identify which test cases regressed
  - Check if model version also changed (compound root cause)

resolution:
  - Rollback prompt template
  - Add failing cases to golden test suite
  - Update CI gate to prevent similar regression
  - Post-incident: update prompt change process checklist

prevention:
  - Mandatory golden test pass before prompt deploy
  - Canary deployment: 5% traffic for 1 hour before full rollout
  - PromptHealthChecker SPC window reduced from 24h to 6h for tighter detection

4. Defense-in-Depth for Prompt and Retrieval Systems

Layer Model

┌─────────────────────────────────────────────┐
│ Layer 5: Business Rules                     │
│ • Compliance checks, content policy         │
│ • Guardrails (Bedrock Guardrails)           │
├─────────────────────────────────────────────┤
│ Layer 4: Observability                      │
│ • X-Ray traces, CloudWatch metrics          │
│ • Structured logging, anomaly detection     │
├─────────────────────────────────────────────┤
│ Layer 3: Validation                         │
│ • Schema validation on responses            │
│ • Token budget pre-flight checks            │
│ • Hallucination detection                   │
├─────────────────────────────────────────────┤
│ Layer 2: Resilience                         │
│ • Circuit breakers, retries with backoff    │
│ • Model tiering (fallback Sonnet → Haiku)   │
│ • Response caching (ElastiCache)            │
├─────────────────────────────────────────────┤
│ Layer 1: Infrastructure                     │
│ • ECS auto-scaling, Bedrock provisioned     │
│ • OpenSearch replica shards                 │
│ • DynamoDB on-demand capacity               │
└─────────────────────────────────────────────┘

Defense-in-Depth Checklist

Layer	Check	Applied In MangaAssist
Infrastructure	Auto-scaling configured for compute	ECS Fargate target tracking on CPU/memory
Infrastructure	Database capacity handles peak	DynamoDB on-demand mode for conversations
Resilience	FM calls have retry + circuit breaker	`BedrockClientWrapper` (file 02)
Resilience	Fallback model configured	Sonnet → Haiku fallback on circuit open
Resilience	Response cache reduces FM load	ElastiCache for frequent product queries
Validation	Token budget checked pre-invocation	`TokenBudgetManager` (file 01)
Validation	Response schema validated	`ProductionSchemaValidator` (file 05)
Validation	Hallucination signals checked	Ungrounded claim detection (file 04)
Observability	Every FM call traced	X-Ray spans + structured logging (file 05)
Observability	Per-template metrics emitted	`PromptObservabilityPipeline` (file 05)
Observability	Alerts configured for all tiers	Alert matrix above
Business Rules	Content policy enforced	Bedrock Guardrails for toxicity, PII
Business Rules	Prompt injection defenses	Input sanitization (Security-Privacy folder)

5. Cost-Aware Troubleshooting

Principle: Debug Cheap, Fix Precisely

flowchart LR
    A[Issue Reported] --> B[Cheap: Check logs<br>$0 — CloudWatch]
    B --> C[Cheap: Check metrics<br>$0 — CloudWatch]
    C --> D[Medium: Replay<br>with Haiku<br>~$0.001/test]
    D --> E[Expensive: Replay<br>with Sonnet<br>~$0.01/test]
    E --> F[Most Expensive:<br>Full regression<br>suite ~$2-5]

Cost-Saving Debugging Practices

Practice	How It Saves Money	When to Use
Log-first investigation	CloudWatch Logs are nearly free; avoid re-running FM calls for diagnosis	Always — start every investigation here
Use Haiku for reproduction	20× cheaper than Sonnet for confirming a bug exists	When you need to verify a prompt issue is reproducible
Cache test results	Don't re-run identical golden tests if prompt hasn't changed	CI pipeline — skip unchanged templates
Sample-based drift checks	Check 200 of 100K docs instead of all	Daily drift monitor — statistically valid at 200
Batch metric queries	Logs Insights query across time range instead of point queries	Investigation phase — one query covers hours
Structured logs reduce parsing	JSON logs → direct Logs Insights queries; no regex parsing needed	All environments — reduces investigation time

Cost Impact by Investigation Type

Investigation	Estimated Cost	Duration	When Justified
CloudWatch Logs + Metrics review	~$0	15 min	Always first step
Single prompt reproduction (Haiku)	~$0.001	2 min	Confirming bug exists
Golden test suite (50 cases, Haiku)	~$0.05	5 min	Pre-deploy validation
Full regression suite (50 cases, Sonnet)	~$2.00	10 min	Major version changes
Embedding drift scan (200 docs)	~$0.10	3 min	Daily scheduled
Full re-embedding (100K docs)	~$50	2 hours	Only after confirmed drift

6. Operational Runbook Templates

Runbook: FM Latency Spike

TRIGGER: InvocationLatency P95 > 5s for 5 minutes
SEVERITY: P2

STEP 1: Check Bedrock service health
  → AWS Health Dashboard → Bedrock → ap-northeast-1
  → If service degradation: nothing to do, wait for AWS resolution

STEP 2: Check throttling
  → CloudWatch: MangaAssist/BedrockHealth → ThrottleCount
  → If high: check Bedrock quota usage (Service Quotas console)
  → If near quota: request increase or enable model tiering fallback

STEP 3: Check circuit breaker state
  → CloudWatch: MangaAssist/BedrockHealth → CircuitBreakerState
  → If OPEN: system is self-protecting, check underlying cause
  → If CLOSED but slow: issue is latency, not errors

STEP 4: Check input size
  → Logs Insights:
    fields @timestamp, input_tokens, latency_ms
    | filter log_type = "bedrock_call"
    | stats avg(input_tokens), avg(latency_ms) by bin(5m)
  → If input_tokens increased: check TokenBudgetManager logs

STEP 5: Mitigate
  → If traffic spike: no action if auto-scaling handles it
  → If input bloat: enable aggressive history compression
  → If Bedrock issue: switch to Haiku fallback via feature flag

Runbook: Prompt Quality Drop

TRIGGER: TemplateHealthScore < 0.7 for 1 hour
SEVERITY: P2

STEP 1: Identify affected template
  → CloudWatch: MangaAssist/PromptObservability → filter by Template dimension
  → Note template_name and template_version

STEP 2: Check for recent deployments
  → git log --since="2h ago" -- prompts/
  → If template was changed: this is the likely root cause

STEP 3: Compare versions
  → Run PromptTestRunner.compare_versions(prev_version, current_version)
  → Identify which quality dimensions regressed

STEP 4: Rollback if confirmed
  → Update prompt template config to previous version
  → Monitor TemplateHealthScore recovery (expect 15-30 min)

STEP 5: Root cause analysis
  → Review the prompt change: what was the intent?
  → Add failing cases to golden test suite
  → Update CI pipeline if test gap identified

7. Cross-Skill Integration Patterns

Pattern: Cascading Failure Detection

When one subsystem fails, multiple metrics shift simultaneously. Trained operators learn to read the pattern of metric changes rather than individual alarms.

Root Cause	Content (5.2.1)	FM (5.2.2)	Prompt (5.2.3)	Retrieval (5.2.4)	Maintenance (5.2.5)
Bedrock throttling	Budget OK	↑ Throttle, ↑ Latency	Quality drops (timeouts)	OK	Health score drops
Stale RAG data	OK	OK	Quality drops (wrong facts)	↑ Staleness	OK
Prompt regression	OK	OK	↑ Test failures	OK	↑ Schema violations
Token overflow	↑ Budget util	OK (or error if limit hit)	Quality drops (truncation)	OK	OK
Embedding drift	OK	OK	Quality drops (irrelevant context)	↑ Drift metric, ↓ MRR	OK

Key insight: A quality drop in Skill 5.2.3 is rarely a prompt-only problem. Always check 5.2.1 (was context truncated?), 5.2.4 (was context relevant?), and 5.2.2 (did the FM call succeed cleanly?) before blaming the prompt.

Pattern: Observability Stack per Request

Every MangaAssist request should produce:

Trace ID — correlates all spans and logs for one request
Budget check log — token utilization before FM call
Bedrock call log — input/output tokens, latency, status, model ID
Retrieval log — query, top-K results, relevance scores
Validation log — schema pass/fail, quality score
Intent log — classified intent, confidence

This enables any investigation to start from the trace ID and follow the full request lifecycle.

Intuition Gained

Instinct	What It Means
Anti-pattern recognition	Experienced engineers see silent failures (truncation, stale data, missing validation) before they manifest as user complaints — because they've built the mental model of what CAN go wrong
Alert fatigue avoidance	The right thresholds and evaluation periods prevent alert storms while catching real issues. The severity matrix is a living document, tuned per incident review
Cost-conscious debugging	Always start with the cheapest investigation (logs → metrics → Haiku replay → full suite). Never jump to the most expensive debugging path first
Pattern-based triage	A quality drop is not "one problem" — it's a symptom with multiple potential root causes. The cascading failure table teaches you to read metric patterns holistically
Defense-in-depth habit	No single layer prevents all failures. The layered model ensures that when one defense fails, the next catches it — and observability records everything for post-incident learning