08: Best Practices and Patterns
AIP-C01 Mapping
Task 5.2 → All Skills (5.2.1–5.2.5): Cross-cutting best practices, anti-patterns, monitoring standards, and operational patterns for troubleshooting GenAI applications in production.
1. Anti-Patterns and Corrections
Anti-Pattern 1: Silent Token Truncation (Skill 5.2.1)
| Aspect |
Anti-Pattern |
Correction |
| Symptom |
Prompt is assembled without checking length; FM silently drops trailing context |
Pre-flight token budget check before every invocation |
| MangaAssist Example |
50-volume manga FAQ gets truncated; user gets incomplete answer with no error |
TokenBudgetManager allocates fixed sections first, compresses history, logs utilization |
| Detection |
Compare input_tokens in Bedrock response metadata against expected count |
CloudWatch metric: BudgetUtilization > 90% alarm |
| Cost of Inaction |
Users see wrong answers, blame the chatbot, lose trust — no alert fires |
~2-5% of all requests silently degraded during peak |
Anti-Pattern 2: Retry Without Circuit Breaker (Skill 5.2.2)
| Aspect |
Anti-Pattern |
Correction |
| Symptom |
Unlimited retries on Bedrock 429/503 errors; adds latency and cost, worsens throttling |
Circuit breaker (CLOSED → OPEN after N failures → HALF_OPEN on cool-down) |
| MangaAssist Example |
Prime Day spike: retries amplify 429s, orchestrator hits timeout, all requests fail |
CircuitBreaker opens after 5 failures in 60s; returns cached/fallback response |
| Detection |
Latency spikes correlated with retry count metrics |
RetryCount > 3 per request → alarm |
| Cost of Inaction |
Cascading failure: throttling → retries → more throttling → total outage |
Circuit breaker limits blast radius to ~30s recovery |
Anti-Pattern 3: Testing Prompts Only Manually (Skill 5.2.3)
| Aspect |
Anti-Pattern |
Correction |
| Symptom |
Prompt changes deployed after "looks good" spot-check; regressions discovered by users |
Automated golden test suite run in CI before every prompt deployment |
| MangaAssist Example |
Instruction reorder improved recommendations but broke Japanese output format |
PromptTestRunner with 50+ golden test cases catches regressions before deploy |
| Detection |
User complaint rate spikes after prompt deployment (lagging indicator) |
Pre-deploy: GoldenTestSuite pass/fail gate. Post-deploy: PromptHealthChecker SPC |
| Cost of Inaction |
Mean time to detect regression: hours to days (user complaint lag) |
Automated: caught before production in CI pipeline |
Anti-Pattern 4: Treating RAG as "Set and Forget" (Skill 5.2.4)
| Aspect |
Anti-Pattern |
Correction |
| Symptom |
Documents ingested once; no freshness monitoring; stale data in production for weeks |
Scheduled ingestion Lambda + embedding drift monitor + freshness alerts |
| MangaAssist Example |
New manga catalog published weekly; old prices shown for 2 weeks after update |
Daily delta ingestion via S3 event trigger + EmbeddingDriftMonitor (POC 4) |
| Detection |
User reports "wrong price" → manual investigation → discover stale data |
DocumentStaleness > 24h alarm on CloudWatch |
| Cost of Inaction |
Incorrect product info → escalation → refund → trust loss |
Freshness SLA: <4 hours post-catalog-update |
Anti-Pattern 5: No Observability on Prompt Templates (Skill 5.2.5)
| Aspect |
Anti-Pattern |
Correction |
| Symptom |
Prompt templates stored in code; no version tracking, no quality metrics per template |
Version-controlled templates with per-version CloudWatch metrics and SPC health checks |
| MangaAssist Example |
Template edited directly in production; quality drops; no way to identify which change caused it |
PromptObservabilityPipeline tags every trace with template name + version |
| Detection |
"Something changed" — manual diff of git commits to find prompt change |
PromptHealthChecker detects z-score anomaly within 1 hour of deployment |
| Cost of Inaction |
MTTR for prompt-related issues: 4-8 hours (investigation time) |
With observability: <30 minutes (pinpoint version that caused drop) |
2. Monitoring and Alerting Best Practices
Metric Taxonomy
graph TD
A[MangaAssist Metrics] --> B[Content Handling<br>5.2.1]
A --> C[FM Integration<br>5.2.2]
A --> D[Prompt Quality<br>5.2.3]
A --> E[Retrieval Health<br>5.2.4]
A --> F[Prompt Ops<br>5.2.5]
B --> B1[BudgetUtilization %]
B --> B2[TruncationCount]
B --> B3[CompressionRatio]
C --> C1[InvocationLatency P95]
C --> C2[ErrorRate %]
C --> C3[ThrottleCount]
C --> C4[CircuitBreakerState]
D --> D1[GoldenTestPassRate %]
D --> D2[QualityScoreMean]
D --> D3[HallucinationRate %]
E --> E1[EmbeddingDriftP95]
E --> E2[RetrievalMRR]
E --> E3[DocumentStalenessHrs]
F --> F1[TemplateHealthScore]
F --> F2[SchemaViolationRate %]
F --> F3[ConfusionPairCount]
Alert Severity Matrix
| Metric |
Warning Threshold |
Critical Threshold |
Evaluation Period |
Action |
| BudgetUtilization |
>80% |
>95% |
1 min |
Slack / PagerDuty |
| InvocationLatency P95 |
>3s |
>8s |
5 min |
Auto-scale / PagerDuty |
| ErrorRate |
>2% |
>10% |
5 min |
PagerDuty + auto-rollback |
| ThrottleCount |
>10/min |
>50/min |
1 min |
Auto-scale / check quota |
| CircuitBreakerState |
HALF_OPEN |
OPEN |
Instant |
PagerDuty + fallback |
| GoldenTestPassRate |
<95% |
<85% |
Per deploy |
Block deploy / rollback |
| EmbeddingDriftP95 |
>0.10 |
>0.15 |
Daily |
Trigger re-embedding |
| DocumentStaleness |
>4h |
>24h |
15 min |
Slack ops / escalate |
| TemplateHealthScore |
<0.8 |
<0.6 |
1h |
Investigate / rollback |
| SchemaViolationRate |
>5% |
>15% |
5 min |
PagerDuty + fallback |
CloudWatch Dashboard Layout
┌──────────────────────────────────────────────────────────┐
│ Row 1: FM Health (Skill 5.2.2) │
│ [Error Rate %] [Latency P50/P95/P99] [Circuit State] │
├──────────────────────────────────────────────────────────┤
│ Row 2: Content & Retrieval (Skills 5.2.1 + 5.2.4) │
│ [Budget Util %] [Embedding Drift] [Doc Staleness] │
├──────────────────────────────────────────────────────────┤
│ Row 3: Prompt Quality (Skills 5.2.3 + 5.2.5) │
│ [Quality Score] [Health Score] [Schema Violations] │
├──────────────────────────────────────────────────────────┤
│ Row 4: Cost Tracking │
│ [Token Usage] [Bedrock Cost] [Cache Hit Rate] │
└──────────────────────────────────────────────────────────┘
3. Incident Response Patterns for GenAI Systems
Triage Decision Tree
flowchart TD
A[User reports<br>bad output] --> B{Output<br>empty?}
B -->|Yes| C{Bedrock<br>error in logs?}
C -->|Yes| D[Skill 5.2.2:<br>FM Integration]
C -->|No| E{Budget > 95%?}
E -->|Yes| F[Skill 5.2.1:<br>Content Overflow]
E -->|No| G[Check template<br>rendering logs]
B -->|No| H{Output<br>incorrect?}
H -->|Yes| I{RAG context<br>relevant?}
I -->|No| J[Skill 5.2.4:<br>Retrieval Issue]
I -->|Yes| K{Prompt version<br>changed recently?}
K -->|Yes| L[Skill 5.2.3/5.2.5:<br>Prompt Regression]
K -->|No| M[Skill 5.2.3:<br>Prompt Quality]
H -->|No| N{Output<br>slow?}
N -->|Yes| O[Check X-Ray spans<br>for bottleneck]
O --> P{Bedrock span<br>slow?}
P -->|Yes| D
P -->|No| Q[Check retrieval<br>or preprocessing]
Incident Response Playbook Template
incident_type: prompt_quality_regression
severity: P2
detection:
signal: TemplateHealthScore dropped below 0.7
alert: CloudWatch Alarm "MangaAssist-PromptHealth-Critical"
time_to_detect: ~15 minutes (SPC anomaly detection)
immediate_actions:
- Check last deployment: "Was a prompt template deployed in the last 2 hours?"
- Check PromptObservabilityPipeline logs: filter by template_name and version
- Compare quality_score between current and previous version
- If confirmed regression: rollback template version via config update
investigation:
- Pull golden test results for the new version
- Run PromptTestRunner.compare_versions(old, new)
- Identify which test cases regressed
- Check if model version also changed (compound root cause)
resolution:
- Rollback prompt template
- Add failing cases to golden test suite
- Update CI gate to prevent similar regression
- Post-incident: update prompt change process checklist
prevention:
- Mandatory golden test pass before prompt deploy
- Canary deployment: 5% traffic for 1 hour before full rollout
- PromptHealthChecker SPC window reduced from 24h to 6h for tighter detection
4. Defense-in-Depth for Prompt and Retrieval Systems
Layer Model
┌─────────────────────────────────────────────┐
│ Layer 5: Business Rules │
│ • Compliance checks, content policy │
│ • Guardrails (Bedrock Guardrails) │
├─────────────────────────────────────────────┤
│ Layer 4: Observability │
│ • X-Ray traces, CloudWatch metrics │
│ • Structured logging, anomaly detection │
├─────────────────────────────────────────────┤
│ Layer 3: Validation │
│ • Schema validation on responses │
│ • Token budget pre-flight checks │
│ • Hallucination detection │
├─────────────────────────────────────────────┤
│ Layer 2: Resilience │
│ • Circuit breakers, retries with backoff │
│ • Model tiering (fallback Sonnet → Haiku) │
│ • Response caching (ElastiCache) │
├─────────────────────────────────────────────┤
│ Layer 1: Infrastructure │
│ • ECS auto-scaling, Bedrock provisioned │
│ • OpenSearch replica shards │
│ • DynamoDB on-demand capacity │
└─────────────────────────────────────────────┘
Defense-in-Depth Checklist
| Layer |
Check |
Applied In MangaAssist |
| Infrastructure |
Auto-scaling configured for compute |
ECS Fargate target tracking on CPU/memory |
| Infrastructure |
Database capacity handles peak |
DynamoDB on-demand mode for conversations |
| Resilience |
FM calls have retry + circuit breaker |
BedrockClientWrapper (file 02) |
| Resilience |
Fallback model configured |
Sonnet → Haiku fallback on circuit open |
| Resilience |
Response cache reduces FM load |
ElastiCache for frequent product queries |
| Validation |
Token budget checked pre-invocation |
TokenBudgetManager (file 01) |
| Validation |
Response schema validated |
ProductionSchemaValidator (file 05) |
| Validation |
Hallucination signals checked |
Ungrounded claim detection (file 04) |
| Observability |
Every FM call traced |
X-Ray spans + structured logging (file 05) |
| Observability |
Per-template metrics emitted |
PromptObservabilityPipeline (file 05) |
| Observability |
Alerts configured for all tiers |
Alert matrix above |
| Business Rules |
Content policy enforced |
Bedrock Guardrails for toxicity, PII |
| Business Rules |
Prompt injection defenses |
Input sanitization (Security-Privacy folder) |
5. Cost-Aware Troubleshooting
Principle: Debug Cheap, Fix Precisely
flowchart LR
A[Issue Reported] --> B[Cheap: Check logs<br>$0 — CloudWatch]
B --> C[Cheap: Check metrics<br>$0 — CloudWatch]
C --> D[Medium: Replay<br>with Haiku<br>~$0.001/test]
D --> E[Expensive: Replay<br>with Sonnet<br>~$0.01/test]
E --> F[Most Expensive:<br>Full regression<br>suite ~$2-5]
Cost-Saving Debugging Practices
| Practice |
How It Saves Money |
When to Use |
| Log-first investigation |
CloudWatch Logs are nearly free; avoid re-running FM calls for diagnosis |
Always — start every investigation here |
| Use Haiku for reproduction |
20× cheaper than Sonnet for confirming a bug exists |
When you need to verify a prompt issue is reproducible |
| Cache test results |
Don't re-run identical golden tests if prompt hasn't changed |
CI pipeline — skip unchanged templates |
| Sample-based drift checks |
Check 200 of 100K docs instead of all |
Daily drift monitor — statistically valid at 200 |
| Batch metric queries |
Logs Insights query across time range instead of point queries |
Investigation phase — one query covers hours |
| Structured logs reduce parsing |
JSON logs → direct Logs Insights queries; no regex parsing needed |
All environments — reduces investigation time |
Cost Impact by Investigation Type
| Investigation |
Estimated Cost |
Duration |
When Justified |
| CloudWatch Logs + Metrics review |
~$0 |
15 min |
Always first step |
| Single prompt reproduction (Haiku) |
~$0.001 |
2 min |
Confirming bug exists |
| Golden test suite (50 cases, Haiku) |
~$0.05 |
5 min |
Pre-deploy validation |
| Full regression suite (50 cases, Sonnet) |
~$2.00 |
10 min |
Major version changes |
| Embedding drift scan (200 docs) |
~$0.10 |
3 min |
Daily scheduled |
| Full re-embedding (100K docs) |
~$50 |
2 hours |
Only after confirmed drift |
6. Operational Runbook Templates
Runbook: FM Latency Spike
TRIGGER: InvocationLatency P95 > 5s for 5 minutes
SEVERITY: P2
STEP 1: Check Bedrock service health
→ AWS Health Dashboard → Bedrock → ap-northeast-1
→ If service degradation: nothing to do, wait for AWS resolution
STEP 2: Check throttling
→ CloudWatch: MangaAssist/BedrockHealth → ThrottleCount
→ If high: check Bedrock quota usage (Service Quotas console)
→ If near quota: request increase or enable model tiering fallback
STEP 3: Check circuit breaker state
→ CloudWatch: MangaAssist/BedrockHealth → CircuitBreakerState
→ If OPEN: system is self-protecting, check underlying cause
→ If CLOSED but slow: issue is latency, not errors
STEP 4: Check input size
→ Logs Insights:
fields @timestamp, input_tokens, latency_ms
| filter log_type = "bedrock_call"
| stats avg(input_tokens), avg(latency_ms) by bin(5m)
→ If input_tokens increased: check TokenBudgetManager logs
STEP 5: Mitigate
→ If traffic spike: no action if auto-scaling handles it
→ If input bloat: enable aggressive history compression
→ If Bedrock issue: switch to Haiku fallback via feature flag
Runbook: Prompt Quality Drop
TRIGGER: TemplateHealthScore < 0.7 for 1 hour
SEVERITY: P2
STEP 1: Identify affected template
→ CloudWatch: MangaAssist/PromptObservability → filter by Template dimension
→ Note template_name and template_version
STEP 2: Check for recent deployments
→ git log --since="2h ago" -- prompts/
→ If template was changed: this is the likely root cause
STEP 3: Compare versions
→ Run PromptTestRunner.compare_versions(prev_version, current_version)
→ Identify which quality dimensions regressed
STEP 4: Rollback if confirmed
→ Update prompt template config to previous version
→ Monitor TemplateHealthScore recovery (expect 15-30 min)
STEP 5: Root cause analysis
→ Review the prompt change: what was the intent?
→ Add failing cases to golden test suite
→ Update CI pipeline if test gap identified
7. Cross-Skill Integration Patterns
Pattern: Cascading Failure Detection
When one subsystem fails, multiple metrics shift simultaneously. Trained operators learn to read the pattern of metric changes rather than individual alarms.
| Root Cause |
Content (5.2.1) |
FM (5.2.2) |
Prompt (5.2.3) |
Retrieval (5.2.4) |
Maintenance (5.2.5) |
| Bedrock throttling |
Budget OK |
↑ Throttle, ↑ Latency |
Quality drops (timeouts) |
OK |
Health score drops |
| Stale RAG data |
OK |
OK |
Quality drops (wrong facts) |
↑ Staleness |
OK |
| Prompt regression |
OK |
OK |
↑ Test failures |
OK |
↑ Schema violations |
| Token overflow |
↑ Budget util |
OK (or error if limit hit) |
Quality drops (truncation) |
OK |
OK |
| Embedding drift |
OK |
OK |
Quality drops (irrelevant context) |
↑ Drift metric, ↓ MRR |
OK |
Key insight: A quality drop in Skill 5.2.3 is rarely a prompt-only problem. Always check 5.2.1 (was context truncated?), 5.2.4 (was context relevant?), and 5.2.2 (did the FM call succeed cleanly?) before blaming the prompt.
Pattern: Observability Stack per Request
Every MangaAssist request should produce:
- Trace ID — correlates all spans and logs for one request
- Budget check log — token utilization before FM call
- Bedrock call log — input/output tokens, latency, status, model ID
- Retrieval log — query, top-K results, relevance scores
- Validation log — schema pass/fail, quality score
- Intent log — classified intent, confidence
This enables any investigation to start from the trace ID and follow the full request lifecycle.
Intuition Gained
| Instinct |
What It Means |
| Anti-pattern recognition |
Experienced engineers see silent failures (truncation, stale data, missing validation) before they manifest as user complaints — because they've built the mental model of what CAN go wrong |
| Alert fatigue avoidance |
The right thresholds and evaluation periods prevent alert storms while catching real issues. The severity matrix is a living document, tuned per incident review |
| Cost-conscious debugging |
Always start with the cheapest investigation (logs → metrics → Haiku replay → full suite). Never jump to the most expensive debugging path first |
| Pattern-based triage |
A quality drop is not "one problem" — it's a symptom with multiple potential root causes. The cascading failure table teaches you to read metric patterns holistically |
| Defense-in-depth habit |
No single layer prevents all failures. The layered model ensures that when one defense fails, the next catches it — and observability records everything for post-incident learning |