Holistic Observability Architecture for GenAI Applications
AWS AIP-C01 Task 4.3 — Skill 4.3.1: Create holistic observability systems for FM application performance
Context: MangaAssist e-commerce chatbot — Bedrock Claude 3, OpenSearch Serverless, DynamoDB, ECS Fargate, API Gateway WebSocket
Skill Mapping
| Certification | Domain | Task | Skill |
|---|---|---|---|
| AWS AIP-C01 | Domain 4 — Responsible AI & Monitoring | Task 4.3 — Monitor FM applications | Skill 4.3.1 — Create holistic observability systems for FM application performance |
Skill scope: Design and implement end-to-end observability covering metrics, traces, logs, and events across every layer of a GenAI application — from infrastructure through FM invocation to business outcomes.
Mind Map — Holistic Observability Topology
mindmap
root((Holistic<br/>Observability))
Metrics
Infrastructure
ECS CPU & Memory
Lambda Concurrency
API Gateway Request Count
OpenSearch IOPS
Service
Orchestrator Latency
Cache Hit Rate
Circuit Breaker State
Queue Depth
FM-Specific
Tokens per Second
Model Latency P50/P95/P99
Throttle Rate
Cost per Request
Guardrail Trigger Rate
Business
Conversion Rate
CSAT Score
Revenue per Session
Cart Additions via Chat
Traces
Distributed Traces
X-Ray Segments
Cross-Service Spans
Async Correlation
FM Interaction Traces
Prompt Assembly
RAG Retrieval
Model Invocation
Guardrail Check
Response Formatting
Business Correlation
Trace-to-Session Link
Trace-to-Outcome Link
Revenue Attribution
Logs
Structured Application Logs
JSON with Correlation IDs
Error Context
Request/Response Pairs
Bedrock Invocation Logs
Input/Output Tokens
Model Latency
Model ID & Version
Prompt Text (sampled)
Guardrail Trace Logs
Filter Actions
Blocked Content
Policy Violations
PII Detection Events
Events
State Changes
Deployment Events
Config Updates
Prompt Version Changes
Anomalies
Drift Detection
Threshold Breaches
Cost Spikes
Business Events
Flash Sale Start/End
Catalog Updates
Inventory Changes
The 4 Pillars of GenAI Observability
Pillar 1 — Metrics: Quantitative Measurements
Metrics are numeric values collected at regular intervals. They are the cheapest form of observability data and the first thing you check during an incident. In GenAI, metrics extend far beyond CPU and memory.
Infrastructure Metrics (the foundation)
| Metric | Source | CloudWatch Namespace | Why It Matters for GenAI |
|---|---|---|---|
| ECS CPU Utilization | ECS Agent | AWS/ECS |
Embedding generation and prompt assembly are CPU-intensive |
| ECS Memory Utilization | ECS Agent | AWS/ECS |
Large context windows require substantial memory for prompt building |
| Lambda Concurrent Executions | Lambda Service | AWS/Lambda |
Throttling here blocks guardrail checks or post-processing |
| API Gateway WebSocket Connections | API GW | AWS/ApiGateway |
Active user count; WebSocket disconnects degrade UX |
| OpenSearch Search Latency | OS Serverless | AWS/AOSS |
Slow vector search = slow RAG = slow response |
| DynamoDB Read/Write Capacity | DynamoDB | AWS/DynamoDB |
Session history reads scale with conversation length |
Service Metrics (the middleware)
| Metric | Type | Description |
|---|---|---|
orchestrator.latency_ms |
Histogram | End-to-end time from request receipt to response dispatch |
cache.hit_rate |
Gauge | Percentage of requests served from semantic cache |
cache.semantic_similarity |
Histogram | Cosine similarity scores for cache matches |
circuit_breaker.state |
Gauge | 0=closed, 0.5=half-open, 1=open per downstream service |
circuit_breaker.trip_count |
Counter | Number of times circuit opened in the window |
retry.count |
Counter | Retries to Bedrock, OpenSearch, DynamoDB |
queue.depth |
Gauge | Pending async requests (for streaming responses) |
FM-Specific Metrics (the differentiator)
These metrics do not exist in traditional applications. They are the core of GenAI observability:
| Metric | Type | Unit | Alert Threshold | Description |
|---|---|---|---|---|
fm.input_tokens |
Counter | Count | — | Tokens sent to model per invocation |
fm.output_tokens |
Counter | Count | — | Tokens received from model per invocation |
fm.total_tokens |
Counter | Count | >4000/request | Combined token count (cost driver) |
fm.latency_ms |
Histogram | ms | P95 > 5000 | Model invocation latency |
fm.tokens_per_second |
Gauge | Count/s | <20 tps | Generation speed (UX quality) |
fm.throttle_count |
Counter | Count | >5/min | Bedrock throttling events |
fm.cost_per_request |
Gauge | USD | >$0.05 | Estimated cost per invocation |
fm.cost_per_hour |
Gauge | USD | >$50/hr | Rolling hourly spend |
fm.guardrail_block_rate |
Gauge | Percent | >10% | Percentage of responses blocked by guardrails |
fm.quality_score |
Gauge | 0-1 | <0.7 | Automated relevance/quality score |
fm.hallucination_rate |
Gauge | Percent | >5% | Responses containing ungrounded claims |
fm.cache_hit_rate |
Gauge | Percent | — | Semantic cache effectiveness |
Business Metrics (the purpose)
| Metric | Calculation | Target | Observability Source |
|---|---|---|---|
| Conversion Rate | Orders via chat / total chat sessions | >3% | DynamoDB session + order events |
| CSAT | Post-chat survey score | >4.⅖ | Survey Lambda → CloudWatch |
| Revenue per Chat Session | Total revenue / chat sessions | >$2.50 | DynamoDB aggregation |
| Cart Addition Rate | Add-to-cart events via chat / recommendations shown | >15% | Frontend events → Kinesis |
| Deflection Rate | Issues resolved without human / total issues | >80% | Escalation service metrics |
| Time to Resolution | Median turns to fulfill user intent | <4 turns | Session log analysis |
Pillar 2 — Traces: Request Path Tracking
Traces reveal the full journey of a single request through all services. For GenAI, traces must capture both infrastructure hops and the internal FM interaction pipeline.
Distributed Traces via AWS X-Ray
[User WebSocket Message]
└─ API Gateway (segment: 12ms)
└─ ECS Orchestrator (segment: 4200ms)
├─ DynamoDB GetItem — session history (subsegment: 8ms)
├─ Intent Classifier — Bedrock Haiku (subsegment: 180ms)
├─ OpenSearch kNN — vector retrieval (subsegment: 95ms)
├─ Prompt Assembly (subsegment: 5ms)
├─ Bedrock InvokeModel — Claude 3 Sonnet (subsegment: 3400ms) ★ dominant
├─ Guardrail Check (subsegment: 45ms)
├─ DynamoDB PutItem — save turn (subsegment: 6ms)
└─ Response Formatting (subsegment: 3ms)
└─ API Gateway Response (segment: 8ms)
Key X-Ray annotations for FM traces:
| Annotation Key | Example Value | Purpose |
|---|---|---|
intent |
product_search |
Filter traces by user intent |
model_id |
anthropic.claude-3-sonnet-20240229-v1:0 |
Filter by model version |
input_tokens |
1847 |
Identify expensive invocations |
cache_hit |
true |
Measure cache effectiveness |
guardrail_action |
pass |
Find blocked requests |
quality_score |
0.85 |
Correlate quality with latency/cost |
FM Interaction Trace — Internal Pipeline
Unlike traditional traces that track service-to-service calls, FM traces track the reasoning pipeline:
[FM Interaction Trace]
├─ 1. Context Loading (8ms)
│ Load 5 prior turns from DynamoDB
├─ 2. Intent Classification (180ms)
│ Bedrock Haiku → "product_search" (confidence: 0.94)
├─ 3. Query Understanding (12ms)
│ Extract: genre=similar_to_one_piece, price<20, format=manga
├─ 4. RAG Retrieval (95ms)
│ OpenSearch kNN: 8 chunks, max_score=0.89
├─ 5. Prompt Assembly (5ms)
│ System prompt + context (5 turns) + RAG chunks (8) + user query
│ Total tokens: 1,847 input
├─ 6. Model Invocation (3,400ms) ← 80% of total time
│ Bedrock Claude 3 Sonnet
│ Output tokens: 342
├─ 7. Guardrail Check (45ms)
│ Content filter: pass, PII check: pass, Topic guard: pass
└─ 8. Response Delivery (3ms)
Format + stream to WebSocket
Business Correlation — Trace → Outcome
trace_id: abc-123-def
→ session_id: sess-456
→ user_id: user-789
→ outcome: purchase ($18.99)
→ satisfaction: 5/5
→ turns_to_convert: 3
This correlation lets you answer: "Requests that go through the slow path (P95 latency) — do they convert less?"
Pillar 3 — Logs: Structured Event Records
Application Logs — JSON Structured
{
"timestamp": "2025-03-15T14:22:33.456Z",
"level": "INFO",
"service": "mangaassist-orchestrator",
"correlation_id": "corr-abc-123",
"session_id": "sess-456",
"trace_id": "1-abc-def",
"span_id": "span-789",
"event": "fm_invocation_complete",
"model_id": "anthropic.claude-3-sonnet-20240229-v1:0",
"intent": "product_search",
"input_tokens": 1847,
"output_tokens": 342,
"latency_ms": 3400,
"cache_hit": false,
"guardrail_action": "pass",
"quality_score": 0.85,
"cost_usd": 0.01068,
"rag_chunks_retrieved": 8,
"rag_max_similarity": 0.89
}
Critical rule: Every log line includes correlation_id, session_id, and trace_id. This is non-negotiable for GenAI debugging.
Bedrock Model Invocation Logs
Enabled via Bedrock → Model Invocation Logging → S3 + CloudWatch:
{
"schemaType": "ModelInvocationLog",
"schemaVersion": "1.0",
"timestamp": "2025-03-15T14:22:33.000Z",
"accountId": "123456789012",
"identity": { "arn": "arn:aws:sts::123456789012:assumed-role/MangaAssistECSRole/..." },
"region": "us-east-1",
"requestId": "req-abc-123",
"operation": "InvokeModel",
"modelId": "anthropic.claude-3-sonnet-20240229-v1:0",
"input": {
"inputContentType": "application/json",
"inputTokenCount": 1847,
"inputBodyJson": { "...": "prompt text (if logging enabled)" }
},
"output": {
"outputContentType": "application/json",
"outputTokenCount": 342,
"outputBodyJson": { "...": "response text (if logging enabled)" }
}
}
Note: Input/output body logging is optional and should be sampled (e.g., 5%) in production to control costs and avoid storing PII.
Guardrail Trace Logs
{
"timestamp": "2025-03-15T14:22:33.500Z",
"guardrail_id": "guard-manga-001",
"action": "BLOCKED",
"trace": {
"contentFilter": { "hate": "NONE", "violence": "LOW", "sexual": "NONE" },
"topicPolicy": { "topics": ["competitor_promotion"], "action": "BLOCKED" },
"wordPolicy": { "matchedWords": [], "action": "NONE" },
"piiDetection": { "entities": ["EMAIL"], "action": "ANONYMIZED" }
},
"input_assessment": "PASS",
"output_assessment": "BLOCKED",
"correlation_id": "corr-abc-123"
}
Pillar 4 — Events: State Changes and Signals
Events are discrete occurrences that provide context for metric anomalies.
Deployment Events
{
"event_type": "deployment",
"timestamp": "2025-03-15T10:00:00Z",
"details": {
"service": "mangaassist-orchestrator",
"version": "v2.3.1",
"change_type": "prompt_update",
"description": "Updated product search system prompt — added price comparison instructions",
"deployer": "ci-pipeline",
"rollback_id": "v2.3.0"
}
}
Anomaly Events
{
"event_type": "anomaly",
"timestamp": "2025-03-15T14:30:00Z",
"details": {
"metric": "fm.cost_per_hour",
"expected_value": 32.50,
"actual_value": 78.20,
"deviation_band": 2.4,
"detection_model": "CloudWatch Anomaly Detection",
"probable_cause": "New prompt version increased average token count by 40%"
}
}
Business Events
{
"event_type": "business",
"timestamp": "2025-03-15T08:00:00Z",
"details": {
"event": "flash_sale_start",
"category": "manga_shonen",
"expected_traffic_multiplier": 3.5,
"duration_hours": 4,
"auto_scaling_triggered": true
}
}
Overlay deployment and business events on metric dashboards so operators can immediately correlate "cost spiked when we deployed the new prompt" or "latency increased when the flash sale started."
How GenAI Observability Differs from Traditional App Monitoring
| Dimension | Traditional Application | GenAI Application | Why It Matters |
|---|---|---|---|
| Determinism | Same input → same output (usually) | Same input → different output each time | Cannot test with simple assertion-based monitoring; need statistical quality checks |
| Cost Model | Fixed infra cost, predictable | Per-token variable cost, can spike 10x | Must monitor cost in real-time; a single bad prompt can cost $100s |
| Quality Measurement | HTTP 200 = success | HTTP 200 but answer is wrong/hallucinated | Need content-level quality metrics (relevance, groundedness, faithfulness) |
| Failure Modes | Crash, timeout, error code | Silent degradation, hallucination, drift | Traditional alerting misses GenAI-specific failures; need semantic monitoring |
| Debugging Approach | Read stack trace, reproduce bug | Analyze prompt, context, retrieval quality, model behavior | Logs must capture full prompt chain, RAG context, and model reasoning |
| Latency Profile | ms range, predictable | Seconds range, high variance (P50 vs P99 can differ 5x) | Streaming UX critical; must monitor time-to-first-token separately |
| Scaling Behavior | CPU/memory proportional | Token throughput limited by model quotas | Scaling is constrained by Bedrock TPM/RPM limits, not just compute |
| Security Concerns | SQL injection, XSS | Prompt injection, data exfiltration via prompts, PII leakage | Guardrail monitoring is a first-class observability concern |
| Compliance | Data retention, access logs | Model input/output auditing, bias monitoring, content filtering | Must log model interactions for audit; retention policies differ |
| User Experience | Page load time, click-through | Response relevance, conversation coherence, personality consistency | UX metrics require NLP-based evaluation, not just timing |
| Dependency Chain | DB → Cache → API | DB → Cache → Vector Store → Embedding Model → LLM → Guardrails | Longer chain = more failure points; each link needs distinct monitoring |
| Capacity Planning | Forecast from request volume | Forecast from token volume × model pricing | Capacity planning requires predicting token consumption patterns |
Architecture — Data Collection to Action
graph TB
subgraph Sources["Data Sources"]
style Sources fill:#f59e0b,stroke:#d97706,color:#000
ECS["ECS Fargate<br/>Orchestrator"]
BDK["Bedrock<br/>Claude 3"]
OSS["OpenSearch<br/>Serverless"]
DDB["DynamoDB<br/>Sessions"]
APIGW["API Gateway<br/>WebSocket"]
GRL["Bedrock<br/>Guardrails"]
FE["Frontend<br/>React App"]
end
subgraph Collection["Collection Layer"]
CWAgent["CloudWatch Agent<br/>(ECS sidecar)"]
XRaySDK["X-Ray SDK<br/>(instrumented code)"]
BDKLog["Bedrock Invocation<br/>Logging (native)"]
CustPub["Custom Metric<br/>Publisher (Python)"]
KDS["Kinesis Data<br/>Stream"]
FEBeacon["Frontend<br/>Beacon API"]
end
subgraph Aggregation["Aggregation Layer"]
CWMetrics["CloudWatch<br/>Metrics"]
CWLogs["CloudWatch<br/>Logs"]
XRayTraces["X-Ray<br/>Traces"]
S3Raw["S3 Raw Logs<br/>(Bedrock invocations)"]
S3Events["S3 Events<br/>(business signals)"]
end
subgraph Processing["Processing Layer"]
CWInsights["CloudWatch<br/>Logs Insights"]
Athena["Athena<br/>(S3 query)"]
LambdaProc["Lambda<br/>Processors"]
CWAnomaly["CloudWatch<br/>Anomaly Detection"]
MetricMath["CloudWatch<br/>Metric Math"]
end
subgraph Visualization["Visualization Layer"]
CWDash["CloudWatch<br/>Dashboards"]
Grafana["Grafana<br/>(optional)"]
QS["QuickSight<br/>(business reports)"]
end
subgraph Actions["Action Layer"]
style Actions fill:#ef4444,stroke:#dc2626,color:#fff
CWAlarm["CloudWatch<br/>Alarms"]
SNS["SNS<br/>Notifications"]
LambdaAuto["Lambda<br/>Auto-Remediation"]
PD["PagerDuty /<br/>OpsGenie"]
ASG["Auto Scaling<br/>Actions"]
Runbook["SSM<br/>Runbooks"]
end
%% Source → Collection
ECS --> CWAgent
ECS --> XRaySDK
ECS --> CustPub
BDK --> BDKLog
BDK --> XRaySDK
OSS --> CWAgent
DDB --> CWAgent
APIGW --> CWAgent
GRL --> BDKLog
FE --> FEBeacon
%% Collection → Aggregation
CWAgent --> CWMetrics
CWAgent --> CWLogs
XRaySDK --> XRayTraces
BDKLog --> CWLogs
BDKLog --> S3Raw
CustPub --> CWMetrics
KDS --> S3Events
FEBeacon --> KDS
%% Aggregation → Processing
CWLogs --> CWInsights
S3Raw --> Athena
S3Events --> Athena
CWMetrics --> CWAnomaly
CWMetrics --> MetricMath
CWLogs --> LambdaProc
%% Processing → Visualization
CWMetrics --> CWDash
CWInsights --> CWDash
XRayTraces --> CWDash
CWMetrics --> Grafana
Athena --> QS
MetricMath --> CWDash
%% Processing → Actions
CWAnomaly --> CWAlarm
MetricMath --> CWAlarm
CWAlarm --> SNS
SNS --> PD
SNS --> LambdaAuto
LambdaAuto --> ASG
LambdaAuto --> Runbook
MangaAssist Observability Stack Mapping
sequenceDiagram
actor User
participant APIGW as API Gateway<br/>WebSocket
participant ECS as ECS Orchestrator
participant DDB as DynamoDB
participant INTENT as Bedrock Haiku<br/>(Intent Classifier)
participant OSS as OpenSearch<br/>Serverless
participant BDK as Bedrock Claude 3<br/>Sonnet
participant GRL as Bedrock Guardrails
participant CW as CloudWatch
User->>APIGW: "Find me manga similar to One Piece under $20"
Note over APIGW: 📊 Metric: connection_count++<br/>📝 Log: access_log (requestId, IP)
APIGW->>ECS: Route message (12ms)
Note over ECS: 🔍 Trace: X-Ray segment starts<br/>📝 Log: request_received (correlation_id)
ECS->>DDB: GetItem — session history (8ms)
Note over DDB: 📊 Metric: read_capacity_units<br/>🔍 Trace: DDB subsegment (8ms)<br/>📝 Log: context_loaded (5 prior turns)
ECS->>INTENT: Classify intent (180ms)
Note over INTENT: 📊 Metric: intent_latency=180ms<br/>📊 Metric: input_tokens=120, output_tokens=15<br/>🔍 Trace: intent subsegment<br/>📝 Log: intent=product_search (conf=0.94)
ECS->>OSS: kNN vector search (95ms)
Note over OSS: 📊 Metric: search_latency=95ms<br/>📊 Metric: results_count=8<br/>🔍 Trace: OSS subsegment<br/>📝 Log: rag_retrieval (8 chunks, max_score=0.89)
ECS->>ECS: Prompt assembly (5ms)
Note over ECS: 📝 Log: prompt_assembled (1847 input tokens)<br/>🔍 Trace: assembly subsegment
ECS->>BDK: InvokeModel — Claude 3 Sonnet (3400ms)
Note over BDK: 📊 Metric: fm.latency=3400ms<br/>📊 Metric: fm.input_tokens=1847<br/>📊 Metric: fm.output_tokens=342<br/>📊 Metric: fm.cost=$0.01068<br/>🔍 Trace: bedrock subsegment (dominant)<br/>📝 Log: invocation_log (model_id, tokens)
BDK->>GRL: Check response (45ms)
Note over GRL: 📊 Metric: guardrail_latency=45ms<br/>📊 Metric: guardrail_action=pass<br/>🔍 Trace: guardrail subsegment<br/>📝 Log: guardrail_trace (all policy results)
GRL-->>ECS: Response approved
ECS->>DDB: PutItem — save turn (6ms)
Note over DDB: 📊 Metric: write_capacity_units<br/>📝 Log: turn_saved
ECS->>APIGW: Formatted response (3ms)
APIGW->>User: Manga recommendations displayed
Note over ECS: 📊 Metric: e2e_latency=3754ms<br/>📊 Metric: quality_score=0.85<br/>🔍 Trace: segment complete<br/>📝 Log: request_complete<br/>🎯 Event: publish all metrics to CW
ECS->>CW: Publish metrics batch
Note over CW: 📊 All metrics aggregated<br/>🚨 Anomaly detection evaluates<br/>📈 Dashboards update (60s refresh)
Total observability signals for a single request: ~15 metrics, 1 distributed trace (8 spans), ~10 structured log entries, 1 Bedrock invocation log.
Observability Maturity Model
graph LR
L0["Level 0<br/>🚫 No Observability<br/>Flying blind"]
L1["Level 1<br/>📝 Basic Logs<br/>Console output only"]
L2["Level 2<br/>📊 Metrics + Alerts<br/>CloudWatch basics"]
L3["Level 3<br/>🔍 Distributed Tracing<br/>X-Ray across services"]
L4["Level 4<br/>🤖 FM-Specific Monitoring<br/>Token, cost, quality metrics"]
L5["Level 5<br/>🧠 Predictive & Self-Healing<br/>Anomaly detection + auto-remediation"]
L0 --> L1 --> L2 --> L3 --> L4 --> L5
style L0 fill:#dc2626,color:#fff
style L1 fill:#ea580c,color:#fff
style L2 fill:#d97706,color:#fff
style L3 fill:#65a30d,color:#fff
style L4 fill:#0891b2,color:#fff
style L5 fill:#7c3aed,color:#fff
| Level | Capabilities | AWS Tools Used | Gaps at This Level |
|---|---|---|---|
| L0 — None | No monitoring; learn about issues from user complaints | None | Everything — complete blind spot |
| L1 — Basic Logs | Console.log / print statements; basic CloudWatch log groups | CloudWatch Logs | No metrics, no alerting, no correlation; reactive debugging only |
| L2 — Metrics + Alerts | CloudWatch metrics for infra (CPU, memory); basic alarms (5xx rate, high CPU) | CloudWatch Metrics, Alarms, SNS | No request-level visibility; cannot trace a single user journey |
| L3 — Distributed Tracing | X-Ray traces across ECS, API Gateway, DynamoDB, OpenSearch; can follow a single request | X-Ray, CloudWatch ServiceLens | No FM-specific insights; model latency is a black box; no cost tracking |
| L4 — FM Monitoring | Token counting, model latency percentiles, cost per request, quality scores, guardrail metrics; Bedrock invocation logging | Bedrock Logging, Custom CloudWatch Metrics, S3, Athena | Reactive — alerts fire after the problem; no prediction; manual remediation |
| L5 — Predictive | Anomaly detection on cost/latency/quality; auto-remediation (switch models, enable cache, scale); drift detection; business correlation | CloudWatch Anomaly Detection, Lambda auto-remediation, EventBridge, ML-based alerting | Continuous improvement needed; model reliability depends on historical data quality |
MangaAssist target: Level 4 at MVP, Level 5 within 6 months.
HLD: Observability Data Model
from dataclasses import dataclass, field
from typing import Optional, Dict, List
from datetime import datetime
from enum import Enum
class ObservabilityPillar(Enum):
METRICS = "metrics"
TRACES = "traces"
LOGS = "logs"
EVENTS = "events"
class AlertSeverity(Enum):
CRITICAL = "critical" # Page on-call immediately
HIGH = "high" # Notify team within 15 min
MEDIUM = "medium" # Investigate within 1 hour
LOW = "low" # Review in next business day
@dataclass
class ObservabilitySignal:
"""Base class for all observability data.
Every signal emitted by MangaAssist carries these fields,
enabling cross-pillar correlation (join a metric spike to
the trace that caused it, to the log that explains it).
"""
timestamp: datetime
source_service: str # e.g., "mangaassist-orchestrator"
correlation_id: str # unique per user request
session_id: str # unique per chat session
pillar: ObservabilityPillar
environment: str = "production"
metadata: Dict[str, str] = field(default_factory=dict)
@dataclass
class FMMetricSignal(ObservabilitySignal):
"""FM-specific metric signal — the core of GenAI observability.
Captures everything needed to understand cost, performance,
and quality of a single model invocation.
"""
model_id: str = ""
intent: str = ""
input_tokens: int = 0
output_tokens: int = 0
latency_ms: float = 0.0
cost_usd: float = 0.0
quality_score: float = 0.0 # 0.0 - 1.0, from automated evaluator
cache_hit: bool = False
guardrail_action: str = "pass" # pass | block | anonymize
rag_chunks_retrieved: int = 0
rag_max_similarity: float = 0.0
time_to_first_token_ms: float = 0.0 # streaming UX metric
@dataclass
class TraceSpan(ObservabilitySignal):
"""Individual span within a distributed trace.
Maps to an X-Ray subsegment. The span tree reconstructs
the full request path through MangaAssist services.
"""
trace_id: str = ""
span_id: str = ""
parent_span_id: Optional[str] = None
operation_name: str = "" # e.g., "bedrock.invoke_model"
duration_ms: float = 0.0
status: str = "OK" # OK | ERROR | THROTTLED
annotations: Dict[str, str] = field(default_factory=dict)
# annotations carry indexed, filterable key-value pairs
# e.g., {"intent": "product_search", "model_id": "claude-3-sonnet"}
@dataclass
class StructuredLogEntry(ObservabilitySignal):
"""Structured log entry — always JSON, always correlated."""
level: str = "INFO" # DEBUG | INFO | WARN | ERROR
event: str = "" # e.g., "fm_invocation_complete"
message: str = ""
trace_id: str = ""
span_id: str = ""
error_type: Optional[str] = None
error_message: Optional[str] = None
attributes: Dict[str, str] = field(default_factory=dict)
@dataclass
class ObservabilityEvent(ObservabilitySignal):
"""Discrete event — deployment, anomaly, or business signal."""
event_type: str = "" # deployment | anomaly | business
severity: AlertSeverity = AlertSeverity.LOW
details: Dict[str, str] = field(default_factory=dict)
related_metric: Optional[str] = None
auto_remediation_triggered: bool = False
@dataclass
class ObservabilityConfig:
"""Configuration for the MangaAssist observability stack.
These values control cost vs. granularity trade-offs.
"""
metrics_namespace: str = "MangaAssist"
trace_sampling_rate: float = 0.05 # 5% in production (cost control)
log_retention_days: int = 30 # CloudWatch Logs retention
s3_log_retention_days: int = 90 # Raw Bedrock invocation logs
alarm_evaluation_periods: int = 3 # 3 consecutive breaches before alarm
alarm_datapoints_to_alarm: int = 2 # 2 of 3 periods must breach
dashboard_refresh_seconds: int = 60 # CloudWatch dashboard auto-refresh
anomaly_detection_band: int = 2 # Standard deviations for anomaly band
enable_invocation_logging: bool = True # Bedrock invocation logging on/off
log_prompt_body: bool = False # Log full prompt text (cost + privacy)
prompt_body_sample_rate: float = 0.05 # If logging bodies, sample 5%
s3_log_bucket: str = "mangaassist-observability-logs"
cost_alert_hourly_threshold: float = 50.0 # USD
cost_alert_daily_threshold: float = 800.0 # USD
LLD: CloudWatch Metric Publisher
import boto3
import time
import logging
from datetime import datetime, timezone
from typing import Dict, List, Optional
from contextlib import contextmanager
logger = logging.getLogger(__name__)
class FMObservabilityPublisher:
"""Publishes FM observability signals to CloudWatch.
Design decisions:
- Buffers up to 20 metric data points (CloudWatch PutMetricData limit)
- Auto-flushes when buffer is full or on explicit flush()
- Uses 2 dimensions (ModelId, Intent) to balance cardinality vs. queryability
- Estimates cost based on known Bedrock pricing tiers
Usage:
publisher = FMObservabilityPublisher()
publisher.record_invocation(
model_id="anthropic.claude-3-sonnet-20240229-v1:0",
intent="product_search",
input_tokens=1847,
output_tokens=342,
latency_ms=3400.0,
quality_score=0.85
)
publisher.flush() # ensure remaining buffer is sent
"""
CLOUDWATCH_BATCH_LIMIT = 20
# Bedrock pricing per 1K tokens (us-east-1, on-demand, as of 2025)
PRICING: Dict[str, Dict[str, float]] = {
"anthropic.claude-3-sonnet": {"input": 0.003, "output": 0.015},
"anthropic.claude-3-haiku": {"input": 0.00025, "output": 0.00125},
"anthropic.claude-3-opus": {"input": 0.015, "output": 0.075},
"amazon.titan-embed-text": {"input": 0.0001, "output": 0.0},
}
def __init__(self, namespace: str = "MangaAssist/FM"):
self.cloudwatch = boto3.client("cloudwatch")
self.namespace = namespace
self._buffer: List[dict] = []
def record_invocation(
self,
model_id: str,
intent: str,
input_tokens: int,
output_tokens: int,
latency_ms: float,
quality_score: float,
cache_hit: bool = False,
guardrail_action: str = "pass",
time_to_first_token_ms: Optional[float] = None,
) -> None:
"""Record a single FM invocation with all observability dimensions."""
timestamp = datetime.now(timezone.utc)
dimensions = [
{"Name": "ModelId", "Value": self._short_model_id(model_id)},
{"Name": "Intent", "Value": intent},
]
metrics = [
{"MetricName": "InputTokens", "Value": input_tokens, "Unit": "Count"},
{"MetricName": "OutputTokens", "Value": output_tokens, "Unit": "Count"},
{"MetricName": "TotalTokens", "Value": input_tokens + output_tokens, "Unit": "Count"},
{"MetricName": "InvocationLatency", "Value": latency_ms, "Unit": "Milliseconds"},
{"MetricName": "QualityScore", "Value": quality_score, "Unit": "None"},
{"MetricName": "CacheHit", "Value": 1 if cache_hit else 0, "Unit": "Count"},
{"MetricName": "GuardrailBlock", "Value": 1 if guardrail_action == "block" else 0, "Unit": "Count"},
{"MetricName": "InvocationCount", "Value": 1, "Unit": "Count"},
]
if time_to_first_token_ms is not None:
metrics.append({
"MetricName": "TimeToFirstToken",
"Value": time_to_first_token_ms,
"Unit": "Milliseconds",
})
# Cost estimation
cost = self._estimate_cost(model_id, input_tokens, output_tokens)
metrics.append({"MetricName": "EstimatedCostMicro", "Value": cost * 1_000_000, "Unit": "Count"})
# Store as micro-dollars to avoid CloudWatch floating-point precision issues
for metric in metrics:
self._buffer.append({
"MetricName": metric["MetricName"],
"Timestamp": timestamp,
"Value": metric["Value"],
"Unit": metric["Unit"],
"Dimensions": dimensions,
})
if len(self._buffer) >= self.CLOUDWATCH_BATCH_LIMIT:
self._flush()
@contextmanager
def invocation_timer(self, model_id: str, intent: str, **kwargs):
"""Context manager to automatically time and record an invocation.
Usage:
with publisher.invocation_timer("anthropic.claude-3-sonnet...", "product_search") as ctx:
response = bedrock.invoke_model(...)
ctx["input_tokens"] = response["usage"]["input_tokens"]
ctx["output_tokens"] = response["usage"]["output_tokens"]
ctx["quality_score"] = evaluate(response)
"""
ctx = {"input_tokens": 0, "output_tokens": 0, "quality_score": 0.0, **kwargs}
start = time.monotonic()
try:
yield ctx
finally:
elapsed_ms = (time.monotonic() - start) * 1000
self.record_invocation(
model_id=model_id,
intent=intent,
input_tokens=ctx["input_tokens"],
output_tokens=ctx["output_tokens"],
latency_ms=elapsed_ms,
quality_score=ctx["quality_score"],
cache_hit=ctx.get("cache_hit", False),
guardrail_action=ctx.get("guardrail_action", "pass"),
time_to_first_token_ms=ctx.get("time_to_first_token_ms"),
)
def _estimate_cost(self, model_id: str, input_tokens: int, output_tokens: int) -> float:
"""Estimate invocation cost in USD based on known pricing."""
model_key = next(
(k for k in self.PRICING if k in model_id),
"anthropic.claude-3-haiku", # conservative default
)
rates = self.PRICING[model_key]
return (input_tokens * rates["input"] + output_tokens * rates["output"]) / 1000
def _short_model_id(self, model_id: str) -> str:
"""Shorten model ID for CloudWatch dimension (reduce cardinality)."""
# "anthropic.claude-3-sonnet-20240229-v1:0" → "claude-3-sonnet"
parts = model_id.split(".")
if len(parts) > 1:
name_parts = parts[1].split("-")
# Take up to the version date
return "-".join(p for p in name_parts if not p[0].isdigit() and p != "v1:0")
return model_id[:50]
def flush(self) -> None:
"""Flush remaining buffered metrics to CloudWatch."""
self._flush()
def _flush(self) -> None:
"""Internal flush — sends buffered metrics in batches of 20."""
if not self._buffer:
return
for i in range(0, len(self._buffer), self.CLOUDWATCH_BATCH_LIMIT):
batch = self._buffer[i : i + self.CLOUDWATCH_BATCH_LIMIT]
try:
self.cloudwatch.put_metric_data(
Namespace=self.namespace,
MetricData=batch,
)
except Exception:
logger.exception("Failed to publish metrics batch to CloudWatch")
# In production: push to a dead-letter queue or local file
# Do NOT raise — observability failures must not break the application
self._buffer.clear()
Key Design Decisions
| # | Decision | Choice | Rationale | Trade-off |
|---|---|---|---|---|
| 1 | Metric granularity | Per-intent + per-model dimensions | Enables drill-down by use case (product_search vs. order_status); keeps cardinality manageable (~20 intent × 3 model = 60 dimension combos) | Higher cardinality = higher CloudWatch cost; mitigated by limiting to 2 dimensions |
| 2 | Trace sampling rate | 5% in production, 100% in staging | Cost control — X-Ray charges per trace recorded; 5% gives statistical significance at MangaAssist scale (~200K requests/day → 10K traces/day) | May miss rare edge-case traces; compensate by force-sampling error paths at 100% |
| 3 | Log retention | 30 days CloudWatch, 90 days S3 | CloudWatch is expensive for long retention; S3 + Athena provides cheap long-term query; 90 days covers monthly business reviews | Querying S3 via Athena is slower than CloudWatch Logs Insights; acceptable for non-urgent analysis |
| 4 | Alert thresholds | 3 evaluation periods, 2-of-3 datapoints | Avoids alert fatigue from transient spikes; 3-period window (3 × 60s = 3 min) balances responsiveness with stability | Delays detection by up to 3 minutes; acceptable because GenAI issues rarely need sub-minute response |
| 5 | Dashboard refresh | 60 seconds | Balances real-time visibility with CloudWatch API cost; during incidents, operators can manually refresh | Not real-time; for live debugging, operators use CloudWatch Logs Insights with Live Tail |
| 6 | Anomaly detection | CloudWatch Anomaly Detection (2-band) | Native AWS service, no ML infrastructure to manage; 2-band catches significant deviations without false positives | Less flexible than custom models; supplement with static thresholds for known hard limits |
| 7 | Cost monitoring | Micro-dollar metric + hourly aggregate | CloudWatch doesn't support float precision well below $0.01; storing as micro-dollars (×1,000,000) preserves precision; hourly aggregate for alerting | Requires Metric Math to convert back to USD for display; small added complexity |
| 8 | Multi-account strategy | Single account with namespace isolation | MangaAssist is a single-team product; multi-account adds operational overhead not justified at current scale | If MangaAssist grows to multi-team, migrate to AWS Organizations with cross-account CloudWatch |
Cross-References
| Related Document | Topic | Connection |
|---|---|---|
| Debugging/01-bedrock-logging.md | Bedrock invocation logging setup | Detailed Bedrock logging configuration; this doc references those logs as data source |
| Debugging/02-application-logging.md | Structured logging patterns | Application log format and correlation ID patterns used in Pillar 3 |
| 13-metrics.md | Business metric definitions | Business metrics referenced in Pillar 1 are defined in detail here |
| LLMOps/llmops-user-stories.md | LLMOps lifecycle monitoring | Observability feeds into the LLMOps feedback loop for model updates |
| Skill 4.3.2 — CloudWatch Implementation | CloudWatch setup | Hands-on implementation of the CloudWatch metrics/alarms designed here |
| Skill 4.3.3 — Distributed Tracing | X-Ray deep dive | Detailed X-Ray instrumentation code for the traces described here |
| Skill 4.3.4 — Cost Monitoring | Cost observability | Deep dive into the cost metric pipeline sketched in this architecture |
| Skill 4.3.5 — Quality Monitoring | Quality metrics | Automated quality evaluation that feeds the quality_score metric |
| Skill 4.3.6 — Alerting & Remediation | Alert design | Alert rules and auto-remediation Lambda code for the action layer |
Key Takeaways
-
GenAI observability requires 4 pillars: Metrics, traces, logs, and events work together. No single pillar is sufficient — a cost spike (metric) needs a trace to find the expensive request, a log to see the prompt, and an event to identify the deployment that caused it.
-
FM-specific metrics don't exist in traditional monitoring: Token counts, cost-per-request, quality scores, guardrail trigger rates, and hallucination rates are entirely new metric categories that must be explicitly instrumented.
-
Correlation IDs are non-negotiable: Every observability signal — metric, trace, log, event — must carry
correlation_idandsession_id. Without them, debugging a GenAI issue is impossible because you need to reconstruct the full prompt chain. -
Cost is a first-class operational metric: Unlike traditional apps where infra costs are relatively fixed, GenAI costs scale with token volume and can spike unexpectedly. Real-time cost monitoring with anomaly detection prevents billing surprises.
-
Quality monitoring replaces simple health checks: A GenAI app can return HTTP 200 with a completely wrong answer. Observability must include content-level quality evaluation, not just availability and latency.
-
Observability must not break the application: The metric publisher swallows exceptions, uses buffering, and never blocks the request path. A failure in observability should be logged but must never degrade user experience.
-
Maturity is a journey — start at Level 2, target Level 4: Don't attempt to build Level 5 (predictive) before you have solid Level 3 (tracing). Each level provides compounding value and reduces the mean-time-to-resolution for the next class of issues.