LOCAL PREVIEW View on GitHub

08: Best Practices and Patterns

AIP-C01 Mapping

Task 5.2 → All Skills (5.2.1–5.2.5): Cross-cutting best practices, anti-patterns, monitoring standards, and operational patterns for troubleshooting GenAI applications in production.


1. Anti-Patterns and Corrections

Anti-Pattern 1: Silent Token Truncation (Skill 5.2.1)

Aspect Anti-Pattern Correction
Symptom Prompt is assembled without checking length; FM silently drops trailing context Pre-flight token budget check before every invocation
MangaAssist Example 50-volume manga FAQ gets truncated; user gets incomplete answer with no error TokenBudgetManager allocates fixed sections first, compresses history, logs utilization
Detection Compare input_tokens in Bedrock response metadata against expected count CloudWatch metric: BudgetUtilization > 90% alarm
Cost of Inaction Users see wrong answers, blame the chatbot, lose trust — no alert fires ~2-5% of all requests silently degraded during peak

Anti-Pattern 2: Retry Without Circuit Breaker (Skill 5.2.2)

Aspect Anti-Pattern Correction
Symptom Unlimited retries on Bedrock 429/503 errors; adds latency and cost, worsens throttling Circuit breaker (CLOSED → OPEN after N failures → HALF_OPEN on cool-down)
MangaAssist Example Prime Day spike: retries amplify 429s, orchestrator hits timeout, all requests fail CircuitBreaker opens after 5 failures in 60s; returns cached/fallback response
Detection Latency spikes correlated with retry count metrics RetryCount > 3 per request → alarm
Cost of Inaction Cascading failure: throttling → retries → more throttling → total outage Circuit breaker limits blast radius to ~30s recovery

Anti-Pattern 3: Testing Prompts Only Manually (Skill 5.2.3)

Aspect Anti-Pattern Correction
Symptom Prompt changes deployed after "looks good" spot-check; regressions discovered by users Automated golden test suite run in CI before every prompt deployment
MangaAssist Example Instruction reorder improved recommendations but broke Japanese output format PromptTestRunner with 50+ golden test cases catches regressions before deploy
Detection User complaint rate spikes after prompt deployment (lagging indicator) Pre-deploy: GoldenTestSuite pass/fail gate. Post-deploy: PromptHealthChecker SPC
Cost of Inaction Mean time to detect regression: hours to days (user complaint lag) Automated: caught before production in CI pipeline

Anti-Pattern 4: Treating RAG as "Set and Forget" (Skill 5.2.4)

Aspect Anti-Pattern Correction
Symptom Documents ingested once; no freshness monitoring; stale data in production for weeks Scheduled ingestion Lambda + embedding drift monitor + freshness alerts
MangaAssist Example New manga catalog published weekly; old prices shown for 2 weeks after update Daily delta ingestion via S3 event trigger + EmbeddingDriftMonitor (POC 4)
Detection User reports "wrong price" → manual investigation → discover stale data DocumentStaleness > 24h alarm on CloudWatch
Cost of Inaction Incorrect product info → escalation → refund → trust loss Freshness SLA: <4 hours post-catalog-update

Anti-Pattern 5: No Observability on Prompt Templates (Skill 5.2.5)

Aspect Anti-Pattern Correction
Symptom Prompt templates stored in code; no version tracking, no quality metrics per template Version-controlled templates with per-version CloudWatch metrics and SPC health checks
MangaAssist Example Template edited directly in production; quality drops; no way to identify which change caused it PromptObservabilityPipeline tags every trace with template name + version
Detection "Something changed" — manual diff of git commits to find prompt change PromptHealthChecker detects z-score anomaly within 1 hour of deployment
Cost of Inaction MTTR for prompt-related issues: 4-8 hours (investigation time) With observability: <30 minutes (pinpoint version that caused drop)

2. Monitoring and Alerting Best Practices

Metric Taxonomy

graph TD
    A[MangaAssist Metrics] --> B[Content Handling<br>5.2.1]
    A --> C[FM Integration<br>5.2.2]
    A --> D[Prompt Quality<br>5.2.3]
    A --> E[Retrieval Health<br>5.2.4]
    A --> F[Prompt Ops<br>5.2.5]

    B --> B1[BudgetUtilization %]
    B --> B2[TruncationCount]
    B --> B3[CompressionRatio]

    C --> C1[InvocationLatency P95]
    C --> C2[ErrorRate %]
    C --> C3[ThrottleCount]
    C --> C4[CircuitBreakerState]

    D --> D1[GoldenTestPassRate %]
    D --> D2[QualityScoreMean]
    D --> D3[HallucinationRate %]

    E --> E1[EmbeddingDriftP95]
    E --> E2[RetrievalMRR]
    E --> E3[DocumentStalenessHrs]

    F --> F1[TemplateHealthScore]
    F --> F2[SchemaViolationRate %]
    F --> F3[ConfusionPairCount]

Alert Severity Matrix

Metric Warning Threshold Critical Threshold Evaluation Period Action
BudgetUtilization >80% >95% 1 min Slack / PagerDuty
InvocationLatency P95 >3s >8s 5 min Auto-scale / PagerDuty
ErrorRate >2% >10% 5 min PagerDuty + auto-rollback
ThrottleCount >10/min >50/min 1 min Auto-scale / check quota
CircuitBreakerState HALF_OPEN OPEN Instant PagerDuty + fallback
GoldenTestPassRate <95% <85% Per deploy Block deploy / rollback
EmbeddingDriftP95 >0.10 >0.15 Daily Trigger re-embedding
DocumentStaleness >4h >24h 15 min Slack ops / escalate
TemplateHealthScore <0.8 <0.6 1h Investigate / rollback
SchemaViolationRate >5% >15% 5 min PagerDuty + fallback

CloudWatch Dashboard Layout

┌──────────────────────────────────────────────────────────┐
│  Row 1: FM Health (Skill 5.2.2)                         │
│  [Error Rate %]  [Latency P50/P95/P99]  [Circuit State] │
├──────────────────────────────────────────────────────────┤
│  Row 2: Content & Retrieval (Skills 5.2.1 + 5.2.4)     │
│  [Budget Util %]  [Embedding Drift]  [Doc Staleness]    │
├──────────────────────────────────────────────────────────┤
│  Row 3: Prompt Quality (Skills 5.2.3 + 5.2.5)          │
│  [Quality Score]  [Health Score]  [Schema Violations]    │
├──────────────────────────────────────────────────────────┤
│  Row 4: Cost Tracking                                    │
│  [Token Usage]  [Bedrock Cost]  [Cache Hit Rate]         │
└──────────────────────────────────────────────────────────┘

3. Incident Response Patterns for GenAI Systems

Triage Decision Tree

flowchart TD
    A[User reports<br>bad output] --> B{Output<br>empty?}
    B -->|Yes| C{Bedrock<br>error in logs?}
    C -->|Yes| D[Skill 5.2.2:<br>FM Integration]
    C -->|No| E{Budget > 95%?}
    E -->|Yes| F[Skill 5.2.1:<br>Content Overflow]
    E -->|No| G[Check template<br>rendering logs]

    B -->|No| H{Output<br>incorrect?}
    H -->|Yes| I{RAG context<br>relevant?}
    I -->|No| J[Skill 5.2.4:<br>Retrieval Issue]
    I -->|Yes| K{Prompt version<br>changed recently?}
    K -->|Yes| L[Skill 5.2.3/5.2.5:<br>Prompt Regression]
    K -->|No| M[Skill 5.2.3:<br>Prompt Quality]

    H -->|No| N{Output<br>slow?}
    N -->|Yes| O[Check X-Ray spans<br>for bottleneck]
    O --> P{Bedrock span<br>slow?}
    P -->|Yes| D
    P -->|No| Q[Check retrieval<br>or preprocessing]

Incident Response Playbook Template

incident_type: prompt_quality_regression
severity: P2
detection:
  signal: TemplateHealthScore dropped below 0.7
  alert: CloudWatch Alarm "MangaAssist-PromptHealth-Critical"
  time_to_detect: ~15 minutes (SPC anomaly detection)

immediate_actions:
  - Check last deployment: "Was a prompt template deployed in the last 2 hours?"
  - Check PromptObservabilityPipeline logs: filter by template_name and version
  - Compare quality_score between current and previous version
  - If confirmed regression: rollback template version via config update

investigation:
  - Pull golden test results for the new version
  - Run PromptTestRunner.compare_versions(old, new)
  - Identify which test cases regressed
  - Check if model version also changed (compound root cause)

resolution:
  - Rollback prompt template
  - Add failing cases to golden test suite
  - Update CI gate to prevent similar regression
  - Post-incident: update prompt change process checklist

prevention:
  - Mandatory golden test pass before prompt deploy
  - Canary deployment: 5% traffic for 1 hour before full rollout
  - PromptHealthChecker SPC window reduced from 24h to 6h for tighter detection

4. Defense-in-Depth for Prompt and Retrieval Systems

Layer Model

┌─────────────────────────────────────────────┐
│ Layer 5: Business Rules                     │
│ • Compliance checks, content policy         │
│ • Guardrails (Bedrock Guardrails)           │
├─────────────────────────────────────────────┤
│ Layer 4: Observability                      │
│ • X-Ray traces, CloudWatch metrics          │
│ • Structured logging, anomaly detection     │
├─────────────────────────────────────────────┤
│ Layer 3: Validation                         │
│ • Schema validation on responses            │
│ • Token budget pre-flight checks            │
│ • Hallucination detection                   │
├─────────────────────────────────────────────┤
│ Layer 2: Resilience                         │
│ • Circuit breakers, retries with backoff    │
│ • Model tiering (fallback Sonnet → Haiku)   │
│ • Response caching (ElastiCache)            │
├─────────────────────────────────────────────┤
│ Layer 1: Infrastructure                     │
│ • ECS auto-scaling, Bedrock provisioned     │
│ • OpenSearch replica shards                 │
│ • DynamoDB on-demand capacity               │
└─────────────────────────────────────────────┘

Defense-in-Depth Checklist

Layer Check Applied In MangaAssist
Infrastructure Auto-scaling configured for compute ECS Fargate target tracking on CPU/memory
Infrastructure Database capacity handles peak DynamoDB on-demand mode for conversations
Resilience FM calls have retry + circuit breaker BedrockClientWrapper (file 02)
Resilience Fallback model configured Sonnet → Haiku fallback on circuit open
Resilience Response cache reduces FM load ElastiCache for frequent product queries
Validation Token budget checked pre-invocation TokenBudgetManager (file 01)
Validation Response schema validated ProductionSchemaValidator (file 05)
Validation Hallucination signals checked Ungrounded claim detection (file 04)
Observability Every FM call traced X-Ray spans + structured logging (file 05)
Observability Per-template metrics emitted PromptObservabilityPipeline (file 05)
Observability Alerts configured for all tiers Alert matrix above
Business Rules Content policy enforced Bedrock Guardrails for toxicity, PII
Business Rules Prompt injection defenses Input sanitization (Security-Privacy folder)

5. Cost-Aware Troubleshooting

Principle: Debug Cheap, Fix Precisely

flowchart LR
    A[Issue Reported] --> B[Cheap: Check logs<br>$0 — CloudWatch]
    B --> C[Cheap: Check metrics<br>$0 — CloudWatch]
    C --> D[Medium: Replay<br>with Haiku<br>~$0.001/test]
    D --> E[Expensive: Replay<br>with Sonnet<br>~$0.01/test]
    E --> F[Most Expensive:<br>Full regression<br>suite ~$2-5]

Cost-Saving Debugging Practices

Practice How It Saves Money When to Use
Log-first investigation CloudWatch Logs are nearly free; avoid re-running FM calls for diagnosis Always — start every investigation here
Use Haiku for reproduction 20× cheaper than Sonnet for confirming a bug exists When you need to verify a prompt issue is reproducible
Cache test results Don't re-run identical golden tests if prompt hasn't changed CI pipeline — skip unchanged templates
Sample-based drift checks Check 200 of 100K docs instead of all Daily drift monitor — statistically valid at 200
Batch metric queries Logs Insights query across time range instead of point queries Investigation phase — one query covers hours
Structured logs reduce parsing JSON logs → direct Logs Insights queries; no regex parsing needed All environments — reduces investigation time

Cost Impact by Investigation Type

Investigation Estimated Cost Duration When Justified
CloudWatch Logs + Metrics review ~$0 15 min Always first step
Single prompt reproduction (Haiku) ~$0.001 2 min Confirming bug exists
Golden test suite (50 cases, Haiku) ~$0.05 5 min Pre-deploy validation
Full regression suite (50 cases, Sonnet) ~$2.00 10 min Major version changes
Embedding drift scan (200 docs) ~$0.10 3 min Daily scheduled
Full re-embedding (100K docs) ~$50 2 hours Only after confirmed drift

6. Operational Runbook Templates

Runbook: FM Latency Spike

TRIGGER: InvocationLatency P95 > 5s for 5 minutes
SEVERITY: P2

STEP 1: Check Bedrock service health
  → AWS Health Dashboard → Bedrock → ap-northeast-1
  → If service degradation: nothing to do, wait for AWS resolution

STEP 2: Check throttling
  → CloudWatch: MangaAssist/BedrockHealth → ThrottleCount
  → If high: check Bedrock quota usage (Service Quotas console)
  → If near quota: request increase or enable model tiering fallback

STEP 3: Check circuit breaker state
  → CloudWatch: MangaAssist/BedrockHealth → CircuitBreakerState
  → If OPEN: system is self-protecting, check underlying cause
  → If CLOSED but slow: issue is latency, not errors

STEP 4: Check input size
  → Logs Insights:
    fields @timestamp, input_tokens, latency_ms
    | filter log_type = "bedrock_call"
    | stats avg(input_tokens), avg(latency_ms) by bin(5m)
  → If input_tokens increased: check TokenBudgetManager logs

STEP 5: Mitigate
  → If traffic spike: no action if auto-scaling handles it
  → If input bloat: enable aggressive history compression
  → If Bedrock issue: switch to Haiku fallback via feature flag

Runbook: Prompt Quality Drop

TRIGGER: TemplateHealthScore < 0.7 for 1 hour
SEVERITY: P2

STEP 1: Identify affected template
  → CloudWatch: MangaAssist/PromptObservability → filter by Template dimension
  → Note template_name and template_version

STEP 2: Check for recent deployments
  → git log --since="2h ago" -- prompts/
  → If template was changed: this is the likely root cause

STEP 3: Compare versions
  → Run PromptTestRunner.compare_versions(prev_version, current_version)
  → Identify which quality dimensions regressed

STEP 4: Rollback if confirmed
  → Update prompt template config to previous version
  → Monitor TemplateHealthScore recovery (expect 15-30 min)

STEP 5: Root cause analysis
  → Review the prompt change: what was the intent?
  → Add failing cases to golden test suite
  → Update CI pipeline if test gap identified

7. Cross-Skill Integration Patterns

Pattern: Cascading Failure Detection

When one subsystem fails, multiple metrics shift simultaneously. Trained operators learn to read the pattern of metric changes rather than individual alarms.

Root Cause Content (5.2.1) FM (5.2.2) Prompt (5.2.3) Retrieval (5.2.4) Maintenance (5.2.5)
Bedrock throttling Budget OK ↑ Throttle, ↑ Latency Quality drops (timeouts) OK Health score drops
Stale RAG data OK OK Quality drops (wrong facts) ↑ Staleness OK
Prompt regression OK OK ↑ Test failures OK ↑ Schema violations
Token overflow ↑ Budget util OK (or error if limit hit) Quality drops (truncation) OK OK
Embedding drift OK OK Quality drops (irrelevant context) ↑ Drift metric, ↓ MRR OK

Key insight: A quality drop in Skill 5.2.3 is rarely a prompt-only problem. Always check 5.2.1 (was context truncated?), 5.2.4 (was context relevant?), and 5.2.2 (did the FM call succeed cleanly?) before blaming the prompt.

Pattern: Observability Stack per Request

Every MangaAssist request should produce:

  1. Trace ID — correlates all spans and logs for one request
  2. Budget check log — token utilization before FM call
  3. Bedrock call log — input/output tokens, latency, status, model ID
  4. Retrieval log — query, top-K results, relevance scores
  5. Validation log — schema pass/fail, quality score
  6. Intent log — classified intent, confidence

This enables any investigation to start from the trace ID and follow the full request lifecycle.


Intuition Gained

Instinct What It Means
Anti-pattern recognition Experienced engineers see silent failures (truncation, stale data, missing validation) before they manifest as user complaints — because they've built the mental model of what CAN go wrong
Alert fatigue avoidance The right thresholds and evaluation periods prevent alert storms while catching real issues. The severity matrix is a living document, tuned per incident review
Cost-conscious debugging Always start with the cheapest investigation (logs → metrics → Haiku replay → full suite). Never jump to the most expensive debugging path first
Pattern-based triage A quality drop is not "one problem" — it's a symptom with multiple potential root causes. The cascading failure table teaches you to read metric patterns holistically
Defense-in-depth habit No single layer prevents all failures. The layered model ensures that when one defense fails, the next catches it — and observability records everything for post-incident learning