08: Reporting and Visualization Systems
AIP-C01 Mapping
Content Domain 5: Testing, Validation, and Troubleshooting Task 5.1: Implement evaluation systems for GenAI Skill 5.1.8: Create evaluation reports and dashboards to communicate quality metrics and performance trends.
User Story
As a MangaAssist product stakeholder, I want to see real-time dashboards showing chatbot quality metrics, performance trends, and comparative model evaluations in a clear visual format, So that I can make data-driven decisions about model deployments, identify degradation early, and communicate chatbot health to leadership.
Acceptance Criteria
- Real-time quality dashboard showing intent-level and aggregate metrics with CloudWatch
- Automated weekly evaluation reports generated and sent to Slack/Email
- Model comparison views showing side-by-side performance across candidates
- Trend analysis showing metric trajectories over 7-day, 30-day, and 90-day windows
- Anomaly alerts surface directly on dashboards with root-cause suggestions
- Executive-level summary report with business KPIs (task completion, user satisfaction, cost)
- Drill-down capability from aggregate metrics to individual conversation traces
- Dashboard loads in < 3 seconds even with 90-day data range
Why Reporting Matters in GenAI Systems
Traditional software has binary metrics: requests succeed or fail, latency is fast or slow. GenAI systems have continuous quality dimensions — relevance can be 0.72 or 0.85, reasoning quality can degrade by 3% over two weeks, and a model upgrade might improve one intent while regressing another.
Without proper visualization:
| Failure Mode | What Gets Missed |
|---|---|
| Slow degradation | 0.5%/week quality drop is invisible until users complain |
| Intent-level variance | Aggregate 88% quality hides that recommendation is at 72% |
| Cost-quality tradeoff | Model A is 15% cheaper but only 2% worse — invisible without side-by-side |
| Seasonal patterns | Holiday traffic shifts intent distribution, but nobody connects the dots |
| Alert fatigue | 50 CloudWatch alarms → team ignores them without prioritized views |
Dashboards are not decoration — they are the feedback loop that makes all other evaluation systems actionable.
High-Level Design
Reporting Architecture
graph TD
subgraph "Data Sources"
A1[CloudWatch Metrics<br>Real-time agent metrics]
A2[DynamoDB<br>Evaluation results]
A3[Kinesis Data Stream<br>User feedback events]
A4[S3<br>Historical evaluation runs]
A5[Redshift<br>Aggregated analytics]
end
subgraph "Data Pipeline"
A1 --> B1[CloudWatch Metric Math<br>Derived metrics]
A2 --> B2[Lambda<br>Evaluation aggregator]
A3 --> B3[Kinesis Data Firehose<br>Batch to S3/Redshift]
A4 --> B2
A5 --> B4[Redshift Spectrum<br>Historical queries]
end
subgraph "Reporting Layer"
B1 --> C1[CloudWatch Dashboard<br>Operational real-time]
B2 --> C2[QuickSight<br>Business analytics]
B4 --> C2
B3 --> C2
B2 --> C3[Lambda<br>Automated report generator]
C3 --> C4[SNS → Slack/Email<br>Weekly reports]
end
subgraph "Consumers"
C1 --> D1[Engineering Team<br>Real-time ops]
C2 --> D2[Product/Leadership<br>Business decisions]
C4 --> D3[All Stakeholders<br>Weekly digest]
end
Dashboard Hierarchy
graph TD
A[Executive Summary<br>1-page business KPIs] --> B[Quality Overview<br>Aggregate + intent breakdown]
B --> C1[Model Comparison<br>Side-by-side candidates]
B --> C2[Trend Analysis<br>7/30/90-day windows]
B --> C3[User Feedback<br>Satisfaction + annotation]
C1 --> D1[Individual Model Detail<br>Per-metric deep dive]
C2 --> D2[Anomaly Investigation<br>Drill to traces]
C3 --> D3[Feedback Detail<br>Individual comments]
D2 --> E[Conversation Trace<br>Turn-by-turn replay]
Report Generation Pipeline
sequenceDiagram
participant CW as CloudWatch<br>EventBridge
participant LR as Lambda<br>Report Generator
participant DDB as DynamoDB<br>Eval Results
participant RS as Redshift<br>Historical
participant S3 as S3<br>Report Storage
participant SNS as SNS<br>Distribution
CW->>LR: Weekly trigger (Sunday 00:00 UTC)
LR->>DDB: Query last 7 days evaluation results
DDB->>LR: 15,000 evaluation records
LR->>RS: Query 30/90 day baselines
RS->>LR: Aggregated baselines
LR->>LR: Compute metrics, trends, anomalies
LR->>LR: Generate Markdown + HTML report
LR->>S3: Store report (s3://manga-assist-reports/weekly/2024-01-14.html)
LR->>SNS: Publish report notification
SNS->>SNS: Fan out to Slack webhook + Email
Low-Level Design
Dashboard Metric Definitions
from dataclasses import dataclass, field
from enum import Enum
from typing import Optional
import time
class MetricGranularity(Enum):
REAL_TIME = "real_time" # 1-minute resolution
HOURLY = "hourly"
DAILY = "daily"
WEEKLY = "weekly"
class TrendDirection(Enum):
IMPROVING = "improving"
STABLE = "stable"
DEGRADING = "degrading"
@dataclass
class DashboardMetric:
"""A single metric displayed on the dashboard."""
name: str
value: float
unit: str = "None" # None, Percent, Count, Milliseconds
intent: str = "aggregate" # "aggregate" or specific intent
timestamp: float = field(default_factory=time.time)
trend_7d: TrendDirection = TrendDirection.STABLE
trend_30d: TrendDirection = TrendDirection.STABLE
baseline_7d: float = 0.0
baseline_30d: float = 0.0
threshold_warning: float = 0.0
threshold_critical: float = 0.0
anomaly_detected: bool = False
@dataclass
class ModelComparisonRow:
"""One row in a model comparison table."""
model_id: str
model_name: str
metrics: dict[str, float] = field(default_factory=dict) # metric_name → value
cost_per_1k_requests: float = 0.0
avg_latency_ms: float = 0.0
quality_score: float = 0.0 # Composite quality
recommended: bool = False
@dataclass
class WeeklyReportData:
"""All data needed to generate the weekly report."""
report_date: str = ""
period_start: str = ""
period_end: str = ""
total_conversations: int = 0
total_evaluations: int = 0
aggregate_metrics: list[DashboardMetric] = field(default_factory=list)
intent_metrics: dict[str, list[DashboardMetric]] = field(default_factory=dict)
model_comparisons: list[ModelComparisonRow] = field(default_factory=list)
anomalies: list[dict] = field(default_factory=list)
top_failure_categories: list[tuple[str, int]] = field(default_factory=list)
recommendations: list[str] = field(default_factory=list)
CloudWatch Dashboard Builder
import json
import logging
from typing import Optional
import boto3
logger = logging.getLogger(__name__)
# MangaAssist intents for dashboard breakdown
INTENTS = [
"recommendation", "product_question", "faq", "order_tracking",
"return_request", "promotion", "checkout_help", "chitchat",
"escalation", "product_discovery",
]
# Key evaluation metrics to display
QUALITY_METRICS = [
"TaskCompletionRate", "IntentAccuracy", "RelevanceScore",
"FactualAccuracy", "ToolSelectionAccuracy", "UserSatisfaction",
]
class CloudWatchDashboardBuilder:
"""Builds and updates CloudWatch dashboards for MangaAssist evaluation metrics."""
def __init__(
self,
dashboard_name: str = "MangaAssist-Quality",
namespace: str = "MangaAssist/Evaluation",
region: str = "us-east-1",
):
self.cloudwatch = boto3.client("cloudwatch", region_name=region)
self.dashboard_name = dashboard_name
self.namespace = namespace
self.region = region
def build_operational_dashboard(self) -> None:
"""Build the real-time operational dashboard for engineering."""
widgets = []
y_pos = 0
# Row 1: Aggregate quality headline metrics
widgets.append(self._build_headline_row(y_pos))
y_pos += 3
# Row 2: Quality trend over 7 days
widgets.append(self._build_quality_trend_widget(y_pos))
y_pos += 6
# Row 3: Intent breakdown heatmap
widgets.append(self._build_intent_breakdown_widget(y_pos))
y_pos += 6
# Row 4: Failure category distribution
widgets.append(self._build_failure_distribution_widget(y_pos))
y_pos += 6
# Row 5: Latency by pipeline stage
widgets.append(self._build_latency_widget(y_pos))
y_pos += 6
# Row 6: Anomaly log
widgets.append(self._build_anomaly_log_widget(y_pos))
# Flatten widget list (some builders return lists)
flat_widgets = []
for w in widgets:
if isinstance(w, list):
flat_widgets.extend(w)
else:
flat_widgets.append(w)
dashboard_body = json.dumps({"widgets": flat_widgets})
self.cloudwatch.put_dashboard(
DashboardName=self.dashboard_name,
DashboardBody=dashboard_body,
)
logger.info("Dashboard %s updated with %d widgets", self.dashboard_name, len(flat_widgets))
def _build_headline_row(self, y_pos: int) -> list[dict]:
"""Build single-number headline widgets for key metrics."""
widgets = []
x_pos = 0
width = 4 # 24 / 6 metrics = 4 columns each
for metric in QUALITY_METRICS:
widgets.append({
"type": "metric",
"x": x_pos,
"y": y_pos,
"width": width,
"height": 3,
"properties": {
"metrics": [[self.namespace, metric]],
"view": "singleValue",
"region": self.region,
"stat": "Average",
"period": 3600,
"title": metric.replace("Rate", " Rate").replace("Score", " Score"),
"sparkline": True,
},
})
x_pos += width
return widgets
def _build_quality_trend_widget(self, y_pos: int) -> dict:
"""Build a line chart showing quality metrics over time."""
metrics = [
[self.namespace, m, {"stat": "Average", "period": 3600}]
for m in QUALITY_METRICS
]
return {
"type": "metric",
"x": 0,
"y": y_pos,
"width": 24,
"height": 6,
"properties": {
"metrics": metrics,
"view": "timeSeries",
"stacked": False,
"region": self.region,
"title": "Quality Metrics Trend (Hourly)",
"period": 3600,
"yAxis": {"left": {"min": 0, "max": 1}},
"annotations": {
"horizontal": [
{"label": "Warning", "value": 0.80, "color": "#ff9900"},
{"label": "Critical", "value": 0.70, "color": "#d13212"},
]
},
},
}
def _build_intent_breakdown_widget(self, y_pos: int) -> dict:
"""Build per-intent quality breakdown."""
metrics = []
for intent in INTENTS:
metrics.append([
self.namespace, "TaskCompletionRate",
"Intent", intent,
{"stat": "Average", "period": 86400},
])
return {
"type": "metric",
"x": 0,
"y": y_pos,
"width": 24,
"height": 6,
"properties": {
"metrics": metrics,
"view": "bar",
"region": self.region,
"title": "Task Completion Rate by Intent (Daily)",
"yAxis": {"left": {"min": 0, "max": 1}},
},
}
def _build_failure_distribution_widget(self, y_pos: int) -> dict:
"""Build failure category pie chart."""
from datetime import datetime, timedelta
return {
"type": "metric",
"x": 0,
"y": y_pos,
"width": 12,
"height": 6,
"properties": {
"metrics": [
[self.namespace, "FailureCount", "Category", "intent_misclassification"],
[self.namespace, "FailureCount", "Category", "wrong_tool"],
[self.namespace, "FailureCount", "Category", "wrong_parameters"],
[self.namespace, "FailureCount", "Category", "reasoning_error"],
[self.namespace, "FailureCount", "Category", "missed_escalation"],
[self.namespace, "FailureCount", "Category", "context_lost"],
],
"view": "pie",
"region": self.region,
"title": "Failure Category Distribution (7 Days)",
"stat": "Sum",
"period": 604800,
},
}
def _build_latency_widget(self, y_pos: int) -> dict:
"""Build latency breakdown by pipeline stage."""
return {
"type": "metric",
"x": 0,
"y": y_pos,
"width": 24,
"height": 6,
"properties": {
"metrics": [
[self.namespace, "Latency", "Stage", "intent_classification", {"stat": "p99"}],
[self.namespace, "Latency", "Stage", "rag_retrieval", {"stat": "p99"}],
[self.namespace, "Latency", "Stage", "llm_generation", {"stat": "p99"}],
[self.namespace, "Latency", "Stage", "total", {"stat": "p99"}],
],
"view": "timeSeries",
"region": self.region,
"title": "P99 Latency by Pipeline Stage (ms)",
"period": 300,
"yAxis": {"left": {"min": 0}},
},
}
def _build_anomaly_log_widget(self, y_pos: int) -> dict:
"""Build a log widget showing recent anomaly detections."""
return {
"type": "log",
"x": 0,
"y": y_pos,
"width": 24,
"height": 6,
"properties": {
"query": (
"SOURCE '/aws/lambda/manga-assist-evaluator' "
"| fields @timestamp, @message "
"| filter @message like /ANOMALY|DEGRADATION|ALERT/ "
"| sort @timestamp desc "
"| limit 50"
),
"region": self.region,
"title": "Recent Anomaly Detections",
},
}
Automated Report Generator
import json
import logging
from datetime import datetime, timedelta
from decimal import Decimal
from typing import Optional
import boto3
from boto3.dynamodb.conditions import Key, Attr
logger = logging.getLogger(__name__)
class WeeklyReportGenerator:
"""Generates automated weekly evaluation reports for MangaAssist.
Triggered by EventBridge rule every Sunday at 00:00 UTC.
Queries DynamoDB for evaluation results, Redshift for baselines,
generates Markdown + HTML, stores in S3, distributes via SNS.
"""
def __init__(
self,
eval_table_name: str = "manga-assist-evaluations",
report_bucket: str = "manga-assist-reports",
sns_topic_arn: str = "",
region: str = "us-east-1",
):
self.dynamodb = boto3.resource("dynamodb", region_name=region)
self.eval_table = self.dynamodb.Table(eval_table_name)
self.s3 = boto3.client("s3", region_name=region)
self.sns = boto3.client("sns", region_name=region)
self.report_bucket = report_bucket
self.sns_topic_arn = sns_topic_arn
def generate_weekly_report(self) -> str:
"""Generate and distribute the weekly report. Returns S3 key."""
now = datetime.utcnow()
period_end = now
period_start = now - timedelta(days=7)
report_data = self._collect_report_data(period_start, period_end)
markdown = self._render_markdown(report_data)
# Store in S3
s3_key = f"weekly/{now.strftime('%Y-%m-%d')}.md"
self.s3.put_object(
Bucket=self.report_bucket,
Key=s3_key,
Body=markdown.encode("utf-8"),
ContentType="text/markdown",
)
# Distribute via SNS
if self.sns_topic_arn:
self.sns.publish(
TopicArn=self.sns_topic_arn,
Subject=f"MangaAssist Weekly Quality Report — {now.strftime('%Y-%m-%d')}",
Message=self._render_summary(report_data),
)
logger.info("Weekly report generated: s3://%s/%s", self.report_bucket, s3_key)
return s3_key
def _collect_report_data(
self, period_start: datetime, period_end: datetime
) -> WeeklyReportData:
"""Collect all data needed for the report."""
start_ts = period_start.isoformat()
end_ts = period_end.isoformat()
# Query evaluation results from DynamoDB
response = self.eval_table.query(
KeyConditionExpression=Key("pk").eq("EVAL#WEEKLY")
& Key("sk").between(start_ts, end_ts),
)
items = response.get("Items", [])
# Aggregate metrics
aggregate = self._aggregate_metrics(items)
# Per-intent breakdown
intent_metrics = {}
for intent in INTENTS:
intent_items = [i for i in items if i.get("intent") == intent]
if intent_items:
intent_metrics[intent] = self._aggregate_metrics(intent_items)
# Detect anomalies (metrics that deviate > 2 std deviations from 30-day baseline)
anomalies = self._detect_anomalies(aggregate)
# Top failure categories
failure_counts: dict[str, int] = {}
for item in items:
for cat in item.get("failure_categories", []):
failure_counts[cat] = failure_counts.get(cat, 0) + 1
top_failures = sorted(failure_counts.items(), key=lambda x: x[1], reverse=True)[:5]
# Generate recommendations
recommendations = self._generate_recommendations(aggregate, intent_metrics, top_failures)
return WeeklyReportData(
report_date=period_end.strftime("%Y-%m-%d"),
period_start=period_start.strftime("%Y-%m-%d"),
period_end=period_end.strftime("%Y-%m-%d"),
total_conversations=len(items),
total_evaluations=len(items),
aggregate_metrics=aggregate,
intent_metrics=intent_metrics,
anomalies=anomalies,
top_failure_categories=top_failures,
recommendations=recommendations,
)
def _aggregate_metrics(self, items: list[dict]) -> list[DashboardMetric]:
"""Aggregate metric values from evaluation items."""
metric_values: dict[str, list[float]] = {}
for item in items:
for metric_name in QUALITY_METRICS:
val = item.get(metric_name)
if val is not None:
if metric_name not in metric_values:
metric_values[metric_name] = []
metric_values[metric_name].append(float(val))
result = []
for name, values in metric_values.items():
avg = sum(values) / len(values)
result.append(DashboardMetric(
name=name,
value=round(avg, 4),
unit="None",
))
return result
def _detect_anomalies(self, aggregate: list[DashboardMetric]) -> list[dict]:
"""Detect metrics that significantly deviate from baseline."""
anomalies = []
for metric in aggregate:
if metric.baseline_30d > 0:
deviation = abs(metric.value - metric.baseline_30d) / metric.baseline_30d
if deviation > 0.10: # >10% deviation
anomalies.append({
"metric": metric.name,
"current": metric.value,
"baseline": metric.baseline_30d,
"deviation_pct": round(deviation * 100, 1),
"direction": "degraded" if metric.value < metric.baseline_30d else "improved",
})
return anomalies
def _generate_recommendations(
self,
aggregate: list[DashboardMetric],
intent_metrics: dict,
top_failures: list[tuple[str, int]],
) -> list[str]:
"""Generate human-readable recommendations based on the data."""
recommendations = []
# Check for low-performing intents
for intent, metrics in intent_metrics.items():
for m in metrics:
if m.name == "TaskCompletionRate" and m.value < 0.80:
recommendations.append(
f"⚠️ {intent} task completion rate is {m.value:.1%} — "
f"investigate top failure modes for this intent."
)
# Check for dominant failure categories
if top_failures:
top_cat, top_count = top_failures[0]
total = sum(c for _, c in top_failures)
if top_count > total * 0.4:
recommendations.append(
f"🔴 '{top_cat}' accounts for {top_count/total:.0%} of all failures — "
f"prioritize fixing this category."
)
return recommendations
def _render_markdown(self, data: WeeklyReportData) -> str:
"""Render the report as Markdown."""
lines = [
f"# MangaAssist Weekly Quality Report",
f"**Period:** {data.period_start} — {data.period_end}",
f"**Generated:** {data.report_date}",
f"**Conversations Evaluated:** {data.total_evaluations:,}",
"",
"## Summary Metrics",
"",
"| Metric | Value | 7-Day Trend | 30-Day Baseline |",
"|--------|-------|-------------|-----------------|",
]
for m in data.aggregate_metrics:
lines.append(
f"| {m.name} | {m.value:.2%} | {m.trend_7d.value} | {m.baseline_30d:.2%} |"
)
lines.extend([
"",
"## Top Failure Categories",
"",
"| Category | Count | % of Total |",
"|----------|-------|------------|",
])
total_failures = sum(c for _, c in data.top_failure_categories)
for cat, count in data.top_failure_categories:
pct = count / total_failures if total_failures > 0 else 0
lines.append(f"| {cat} | {count} | {pct:.1%} |")
if data.anomalies:
lines.extend(["", "## Anomalies Detected", ""])
for anomaly in data.anomalies:
lines.append(
f"- **{anomaly['metric']}**: {anomaly['current']:.2%} "
f"({anomaly['direction']} {anomaly['deviation_pct']}% from baseline {anomaly['baseline']:.2%})"
)
if data.recommendations:
lines.extend(["", "## Recommendations", ""])
for rec in data.recommendations:
lines.append(f"- {rec}")
return "\n".join(lines)
def _render_summary(self, data: WeeklyReportData) -> str:
"""Render a short summary for Slack/Email notification."""
lines = [
f"📊 MangaAssist Weekly Quality Report ({data.period_start} → {data.period_end})",
f"Conversations: {data.total_evaluations:,}",
"",
]
for m in data.aggregate_metrics:
indicator = "✅" if m.value >= 0.85 else "⚠️" if m.value >= 0.75 else "🔴"
lines.append(f"{indicator} {m.name}: {m.value:.1%}")
if data.anomalies:
lines.append(f"\n🚨 {len(data.anomalies)} anomalies detected — see full report.")
return "\n".join(lines)
MangaAssist Scenarios
Scenario A: Dashboard Reveals Hidden Intent-Level Regression
Context: Aggregate TaskCompletionRate was 86% — stable for 2 weeks. No alerts fired. But the product team noticed more escalation requests from users asking about promotions.
What Happened:
- The CloudWatch aggregate dashboard showed 86% task completion with a stable trend
- Drilling into the intent breakdown bar chart revealed:
- promotion: 62% task completion (down from 88% two weeks prior)
- recommendation: 93% (up from 89%)
- The aggregate was held up by recommendation improvements masking promotion degradation
- The failure category pie chart showed 40% of promotion failures were missing_tool_call
Root Cause: A deployment two weeks ago changed the orchestrator routing rules. Promotion queries now routed to the RAG pipeline (general knowledge) instead of the Promotion API (live coupon codes). The RAG pipeline couldn't surface active promotions — it had stale catalog data.
Fix: Restored Promotion API routing. Added intent-level threshold alerts: any single intent dropping >10% from 7-day baseline triggers a Slack alert.
Dashboard Change: Added per-intent sparklines to the headline row — each intent now has its own single-value card showing current value with 7-day trend.
Scenario B: Weekly Report Catches Gradual User Satisfaction Decline
Context: Weekly automated report showed UserSatisfaction at 78%, down from 84% four weeks prior. The drop was gradual: 84% → 82% → 80% → 78%. No single-week alert triggered because each weekly drop was <5%.
What Happened:
- The recommendations section auto-generated: "⚠️ UserSatisfaction is {78%} — investigate top failure modes"
- Drilling into the report's intent breakdown: product_question satisfaction dropped from 90% → 71%
- Failure category: reasoning_error increased 3x over the period
- Cross-referencing with the model comparison view: a Bedrock runtime update had been applied 4 weeks ago
Root Cause: Bedrock runtime update changed Claude 3.5 Sonnet's default temperature behavior. Product questions requiring precise factual answers about manga (release dates, volume counts, price) were generating slightly more "creative" responses.
Fix: Pinned the Claude invocation to temperature: 0.1 for product_question intent (was previously unset, relying on model default). UserSatisfaction recovered to 85% within one week.
Report Improvement: Added 4-week trend column alongside 7-day and 30-day baselines. Gradual shifts now surface as a "degrading" trend arrow even if each individual week is within tolerance.
Scenario C: Model Comparison Dashboard Drives Cost Decision
Context: The team was evaluating whether to route chitchat and faq intents to Claude 3 Haiku instead of Claude 3.5 Sonnet to reduce costs.
What Happened: - Model comparison view showed side-by-side:
| Metric | Sonnet (chitchat) | Haiku (chitchat) | Sonnet (faq) | Haiku (faq) |
|---|---|---|---|---|
| TaskCompletionRate | 0.90 | 0.87 | 0.92 | 0.88 |
| RelevanceScore | 0.88 | 0.85 | 0.91 | 0.86 |
| FactualAccuracy | 0.82 | 0.80 | 0.94 | 0.89 |
| UserSatisfaction | 0.85 | 0.82 | 0.88 | 0.81 |
| Cost/1K requests | $4.20 | $0.42 | $4.20 | $0.42 |
| P99 Latency | 2.8s | 0.9s | 2.8s | 0.9s |
- For
chitchat: Haiku was 90% cheaper with only 3% quality drop — clearly acceptable - For
faq: Haiku had 5% FactualAccuracy drop — FAQ accuracy is critical, fail
Decision: Route chitchat to Haiku (saves $3.78/1K requests). Keep faq on Sonnet (accuracy matters). Monthly savings: ~$1,800 with negligible quality impact.
Dashboard Enhancement: Added "Cost Efficiency Score" = QualityScore / CostPer1KRequests. Higher is better. Haiku for chitchat scored 2.02 vs Sonnet's 0.21 — 10x more cost-efficient.
Scenario D: Executive Report Misinterpretation Prevention
Context: Leadership received the weekly report showing TaskCompletionRate at 84%. The VP asked: "Why is 16% of our chatbot failing?"
What Happened:
- The original executive report showed raw metrics without context
- 84% TaskCompletionRate included all intents, including escalation (which by definition routes to humans — not a "failure")
- The chitchat intent had inherently lower "completion" because casual conversation has no clear success criteria
- Adjusted calculation: excluding escalation and chitchat, task completion was 91%
Fix: Executive summary now shows: 1. "Actionable Task Completion" = TaskCompletionRate for transactional intents only (order_tracking, return_request, checkout_help, promotion) 2. "Knowledge Task Completion" = TaskCompletionRate for knowledge intents (product_question, recommendation, faq, product_discovery) 3. "Escalation Rate" = separate metric treated as a routing metric, not a failure metric 4. "Engagement Quality" = for chitchat, measured by conversation length and sentiment, not completion
Key Lesson: Reports for different audiences need different metric definitions. Engineering sees raw pipeline metrics. Leadership sees business-outcome metrics. Same data, different framing.
Intuition Gained
Aggregation Hides Problems
An 86% aggregate quality score can hide a 62% catastrophe in one intent. Always show intent-level breakdowns by default, not just on drill-down. The MangaAssist dashboard now opens to the intent-level view, with aggregate as a secondary summary.
Automated Reports Catch Gradual Drift
Human dashboard watchers miss gradual degradation — a 1%/week decline is invisible day-to-day but devastating over a quarter. Automated reports with 4-week trend analysis surface these patterns. The weekly report at MangaAssist is the most reliable degradation detection mechanism, more so than real-time alerts.
Different Audiences Need Different Metrics
Engineering needs pipeline-stage metrics (intent accuracy, tool selection, latency by stage). Product needs user-facing metrics (satisfaction, task completion, escalation rate). Leadership needs business metrics (cost per conversation, resolution rate, customer impact). One dashboard cannot serve all three — build a hierarchy.
References
- MangaAssist Architecture HLD — Overall system architecture
- Metrics Overview — Business and technical metrics
- Analytics Pipeline Cost Optimization — Analytics infra
- FM Output Quality Assessment — Quality metrics definitions
- Model Evaluation Optimal Configuration — Model comparison framework
- Agent Performance Framework — Agent metrics powering dashboards