08: Reporting and Visualization Systems

AIP-C01 Mapping

Content Domain 5: Testing, Validation, and Troubleshooting Task 5.1: Implement evaluation systems for GenAI Skill 5.1.8: Create evaluation reports and dashboards to communicate quality metrics and performance trends.

User Story

As a MangaAssist product stakeholder, I want to see real-time dashboards showing chatbot quality metrics, performance trends, and comparative model evaluations in a clear visual format, So that I can make data-driven decisions about model deployments, identify degradation early, and communicate chatbot health to leadership.

Acceptance Criteria

Real-time quality dashboard showing intent-level and aggregate metrics with CloudWatch
Automated weekly evaluation reports generated and sent to Slack/Email
Model comparison views showing side-by-side performance across candidates
Trend analysis showing metric trajectories over 7-day, 30-day, and 90-day windows
Anomaly alerts surface directly on dashboards with root-cause suggestions
Executive-level summary report with business KPIs (task completion, user satisfaction, cost)
Drill-down capability from aggregate metrics to individual conversation traces
Dashboard loads in < 3 seconds even with 90-day data range

Why Reporting Matters in GenAI Systems

Traditional software has binary metrics: requests succeed or fail, latency is fast or slow. GenAI systems have continuous quality dimensions — relevance can be 0.72 or 0.85, reasoning quality can degrade by 3% over two weeks, and a model upgrade might improve one intent while regressing another.

Without proper visualization:

Failure Mode	What Gets Missed
Slow degradation	0.5%/week quality drop is invisible until users complain
Intent-level variance	Aggregate 88% quality hides that `recommendation` is at 72%
Cost-quality tradeoff	Model A is 15% cheaper but only 2% worse — invisible without side-by-side
Seasonal patterns	Holiday traffic shifts intent distribution, but nobody connects the dots
Alert fatigue	50 CloudWatch alarms → team ignores them without prioritized views

Dashboards are not decoration — they are the feedback loop that makes all other evaluation systems actionable.

High-Level Design

Reporting Architecture

graph TD
    subgraph "Data Sources"
        A1[CloudWatch Metrics<br>Real-time agent metrics]
        A2[DynamoDB<br>Evaluation results]
        A3[Kinesis Data Stream<br>User feedback events]
        A4[S3<br>Historical evaluation runs]
        A5[Redshift<br>Aggregated analytics]
    end

    subgraph "Data Pipeline"
        A1 --> B1[CloudWatch Metric Math<br>Derived metrics]
        A2 --> B2[Lambda<br>Evaluation aggregator]
        A3 --> B3[Kinesis Data Firehose<br>Batch to S3/Redshift]
        A4 --> B2
        A5 --> B4[Redshift Spectrum<br>Historical queries]
    end

    subgraph "Reporting Layer"
        B1 --> C1[CloudWatch Dashboard<br>Operational real-time]
        B2 --> C2[QuickSight<br>Business analytics]
        B4 --> C2
        B3 --> C2
        B2 --> C3[Lambda<br>Automated report generator]
        C3 --> C4[SNS → Slack/Email<br>Weekly reports]
    end

    subgraph "Consumers"
        C1 --> D1[Engineering Team<br>Real-time ops]
        C2 --> D2[Product/Leadership<br>Business decisions]
        C4 --> D3[All Stakeholders<br>Weekly digest]
    end

Dashboard Hierarchy

graph TD
    A[Executive Summary<br>1-page business KPIs] --> B[Quality Overview<br>Aggregate + intent breakdown]
    B --> C1[Model Comparison<br>Side-by-side candidates]
    B --> C2[Trend Analysis<br>7/30/90-day windows]
    B --> C3[User Feedback<br>Satisfaction + annotation]
    C1 --> D1[Individual Model Detail<br>Per-metric deep dive]
    C2 --> D2[Anomaly Investigation<br>Drill to traces]
    C3 --> D3[Feedback Detail<br>Individual comments]
    D2 --> E[Conversation Trace<br>Turn-by-turn replay]

Report Generation Pipeline

sequenceDiagram
    participant CW as CloudWatch<br>EventBridge
    participant LR as Lambda<br>Report Generator
    participant DDB as DynamoDB<br>Eval Results
    participant RS as Redshift<br>Historical
    participant S3 as S3<br>Report Storage
    participant SNS as SNS<br>Distribution

    CW->>LR: Weekly trigger (Sunday 00:00 UTC)
    LR->>DDB: Query last 7 days evaluation results
    DDB->>LR: 15,000 evaluation records
    LR->>RS: Query 30/90 day baselines
    RS->>LR: Aggregated baselines
    LR->>LR: Compute metrics, trends, anomalies
    LR->>LR: Generate Markdown + HTML report
    LR->>S3: Store report (s3://manga-assist-reports/weekly/2024-01-14.html)
    LR->>SNS: Publish report notification
    SNS->>SNS: Fan out to Slack webhook + Email

Low-Level Design

Dashboard Metric Definitions

from dataclasses import dataclass, field
from enum import Enum
from typing import Optional
import time


class MetricGranularity(Enum):
    REAL_TIME = "real_time"        # 1-minute resolution
    HOURLY = "hourly"
    DAILY = "daily"
    WEEKLY = "weekly"


class TrendDirection(Enum):
    IMPROVING = "improving"
    STABLE = "stable"
    DEGRADING = "degrading"


@dataclass
class DashboardMetric:
    """A single metric displayed on the dashboard."""
    name: str
    value: float
    unit: str = "None"             # None, Percent, Count, Milliseconds
    intent: str = "aggregate"      # "aggregate" or specific intent
    timestamp: float = field(default_factory=time.time)
    trend_7d: TrendDirection = TrendDirection.STABLE
    trend_30d: TrendDirection = TrendDirection.STABLE
    baseline_7d: float = 0.0
    baseline_30d: float = 0.0
    threshold_warning: float = 0.0
    threshold_critical: float = 0.0
    anomaly_detected: bool = False


@dataclass
class ModelComparisonRow:
    """One row in a model comparison table."""
    model_id: str
    model_name: str
    metrics: dict[str, float] = field(default_factory=dict)  # metric_name → value
    cost_per_1k_requests: float = 0.0
    avg_latency_ms: float = 0.0
    quality_score: float = 0.0     # Composite quality
    recommended: bool = False


@dataclass
class WeeklyReportData:
    """All data needed to generate the weekly report."""
    report_date: str = ""
    period_start: str = ""
    period_end: str = ""
    total_conversations: int = 0
    total_evaluations: int = 0
    aggregate_metrics: list[DashboardMetric] = field(default_factory=list)
    intent_metrics: dict[str, list[DashboardMetric]] = field(default_factory=dict)
    model_comparisons: list[ModelComparisonRow] = field(default_factory=list)
    anomalies: list[dict] = field(default_factory=list)
    top_failure_categories: list[tuple[str, int]] = field(default_factory=list)
    recommendations: list[str] = field(default_factory=list)

CloudWatch Dashboard Builder

import json
import logging
from typing import Optional

import boto3

logger = logging.getLogger(__name__)

# MangaAssist intents for dashboard breakdown
INTENTS = [
    "recommendation", "product_question", "faq", "order_tracking",
    "return_request", "promotion", "checkout_help", "chitchat",
    "escalation", "product_discovery",
]

# Key evaluation metrics to display
QUALITY_METRICS = [
    "TaskCompletionRate", "IntentAccuracy", "RelevanceScore",
    "FactualAccuracy", "ToolSelectionAccuracy", "UserSatisfaction",
]


class CloudWatchDashboardBuilder:
    """Builds and updates CloudWatch dashboards for MangaAssist evaluation metrics."""

    def __init__(
        self,
        dashboard_name: str = "MangaAssist-Quality",
        namespace: str = "MangaAssist/Evaluation",
        region: str = "us-east-1",
    ):
        self.cloudwatch = boto3.client("cloudwatch", region_name=region)
        self.dashboard_name = dashboard_name
        self.namespace = namespace
        self.region = region

    def build_operational_dashboard(self) -> None:
        """Build the real-time operational dashboard for engineering."""
        widgets = []
        y_pos = 0

        # Row 1: Aggregate quality headline metrics
        widgets.append(self._build_headline_row(y_pos))
        y_pos += 3

        # Row 2: Quality trend over 7 days
        widgets.append(self._build_quality_trend_widget(y_pos))
        y_pos += 6

        # Row 3: Intent breakdown heatmap
        widgets.append(self._build_intent_breakdown_widget(y_pos))
        y_pos += 6

        # Row 4: Failure category distribution
        widgets.append(self._build_failure_distribution_widget(y_pos))
        y_pos += 6

        # Row 5: Latency by pipeline stage
        widgets.append(self._build_latency_widget(y_pos))
        y_pos += 6

        # Row 6: Anomaly log
        widgets.append(self._build_anomaly_log_widget(y_pos))

        # Flatten widget list (some builders return lists)
        flat_widgets = []
        for w in widgets:
            if isinstance(w, list):
                flat_widgets.extend(w)
            else:
                flat_widgets.append(w)

        dashboard_body = json.dumps({"widgets": flat_widgets})
        self.cloudwatch.put_dashboard(
            DashboardName=self.dashboard_name,
            DashboardBody=dashboard_body,
        )
        logger.info("Dashboard %s updated with %d widgets", self.dashboard_name, len(flat_widgets))

    def _build_headline_row(self, y_pos: int) -> list[dict]:
        """Build single-number headline widgets for key metrics."""
        widgets = []
        x_pos = 0
        width = 4  # 24 / 6 metrics = 4 columns each

        for metric in QUALITY_METRICS:
            widgets.append({
                "type": "metric",
                "x": x_pos,
                "y": y_pos,
                "width": width,
                "height": 3,
                "properties": {
                    "metrics": [[self.namespace, metric]],
                    "view": "singleValue",
                    "region": self.region,
                    "stat": "Average",
                    "period": 3600,
                    "title": metric.replace("Rate", " Rate").replace("Score", " Score"),
                    "sparkline": True,
                },
            })
            x_pos += width

        return widgets

    def _build_quality_trend_widget(self, y_pos: int) -> dict:
        """Build a line chart showing quality metrics over time."""
        metrics = [
            [self.namespace, m, {"stat": "Average", "period": 3600}]
            for m in QUALITY_METRICS
        ]
        return {
            "type": "metric",
            "x": 0,
            "y": y_pos,
            "width": 24,
            "height": 6,
            "properties": {
                "metrics": metrics,
                "view": "timeSeries",
                "stacked": False,
                "region": self.region,
                "title": "Quality Metrics Trend (Hourly)",
                "period": 3600,
                "yAxis": {"left": {"min": 0, "max": 1}},
                "annotations": {
                    "horizontal": [
                        {"label": "Warning", "value": 0.80, "color": "#ff9900"},
                        {"label": "Critical", "value": 0.70, "color": "#d13212"},
                    ]
                },
            },
        }

    def _build_intent_breakdown_widget(self, y_pos: int) -> dict:
        """Build per-intent quality breakdown."""
        metrics = []
        for intent in INTENTS:
            metrics.append([
                self.namespace, "TaskCompletionRate",
                "Intent", intent,
                {"stat": "Average", "period": 86400},
            ])
        return {
            "type": "metric",
            "x": 0,
            "y": y_pos,
            "width": 24,
            "height": 6,
            "properties": {
                "metrics": metrics,
                "view": "bar",
                "region": self.region,
                "title": "Task Completion Rate by Intent (Daily)",
                "yAxis": {"left": {"min": 0, "max": 1}},
            },
        }

    def _build_failure_distribution_widget(self, y_pos: int) -> dict:
        """Build failure category pie chart."""
        from datetime import datetime, timedelta
        return {
            "type": "metric",
            "x": 0,
            "y": y_pos,
            "width": 12,
            "height": 6,
            "properties": {
                "metrics": [
                    [self.namespace, "FailureCount", "Category", "intent_misclassification"],
                    [self.namespace, "FailureCount", "Category", "wrong_tool"],
                    [self.namespace, "FailureCount", "Category", "wrong_parameters"],
                    [self.namespace, "FailureCount", "Category", "reasoning_error"],
                    [self.namespace, "FailureCount", "Category", "missed_escalation"],
                    [self.namespace, "FailureCount", "Category", "context_lost"],
                ],
                "view": "pie",
                "region": self.region,
                "title": "Failure Category Distribution (7 Days)",
                "stat": "Sum",
                "period": 604800,
            },
        }

    def _build_latency_widget(self, y_pos: int) -> dict:
        """Build latency breakdown by pipeline stage."""
        return {
            "type": "metric",
            "x": 0,
            "y": y_pos,
            "width": 24,
            "height": 6,
            "properties": {
                "metrics": [
                    [self.namespace, "Latency", "Stage", "intent_classification", {"stat": "p99"}],
                    [self.namespace, "Latency", "Stage", "rag_retrieval", {"stat": "p99"}],
                    [self.namespace, "Latency", "Stage", "llm_generation", {"stat": "p99"}],
                    [self.namespace, "Latency", "Stage", "total", {"stat": "p99"}],
                ],
                "view": "timeSeries",
                "region": self.region,
                "title": "P99 Latency by Pipeline Stage (ms)",
                "period": 300,
                "yAxis": {"left": {"min": 0}},
            },
        }

    def _build_anomaly_log_widget(self, y_pos: int) -> dict:
        """Build a log widget showing recent anomaly detections."""
        return {
            "type": "log",
            "x": 0,
            "y": y_pos,
            "width": 24,
            "height": 6,
            "properties": {
                "query": (
                    "SOURCE '/aws/lambda/manga-assist-evaluator' "
                    "| fields @timestamp, @message "
                    "| filter @message like /ANOMALY|DEGRADATION|ALERT/ "
                    "| sort @timestamp desc "
                    "| limit 50"
                ),
                "region": self.region,
                "title": "Recent Anomaly Detections",
            },
        }

Automated Report Generator

import json
import logging
from datetime import datetime, timedelta
from decimal import Decimal
from typing import Optional

import boto3
from boto3.dynamodb.conditions import Key, Attr

logger = logging.getLogger(__name__)


class WeeklyReportGenerator:
    """Generates automated weekly evaluation reports for MangaAssist.

    Triggered by EventBridge rule every Sunday at 00:00 UTC.
    Queries DynamoDB for evaluation results, Redshift for baselines,
    generates Markdown + HTML, stores in S3, distributes via SNS.
    """

    def __init__(
        self,
        eval_table_name: str = "manga-assist-evaluations",
        report_bucket: str = "manga-assist-reports",
        sns_topic_arn: str = "",
        region: str = "us-east-1",
    ):
        self.dynamodb = boto3.resource("dynamodb", region_name=region)
        self.eval_table = self.dynamodb.Table(eval_table_name)
        self.s3 = boto3.client("s3", region_name=region)
        self.sns = boto3.client("sns", region_name=region)
        self.report_bucket = report_bucket
        self.sns_topic_arn = sns_topic_arn

    def generate_weekly_report(self) -> str:
        """Generate and distribute the weekly report. Returns S3 key."""
        now = datetime.utcnow()
        period_end = now
        period_start = now - timedelta(days=7)

        report_data = self._collect_report_data(period_start, period_end)
        markdown = self._render_markdown(report_data)

        # Store in S3
        s3_key = f"weekly/{now.strftime('%Y-%m-%d')}.md"
        self.s3.put_object(
            Bucket=self.report_bucket,
            Key=s3_key,
            Body=markdown.encode("utf-8"),
            ContentType="text/markdown",
        )

        # Distribute via SNS
        if self.sns_topic_arn:
            self.sns.publish(
                TopicArn=self.sns_topic_arn,
                Subject=f"MangaAssist Weekly Quality Report — {now.strftime('%Y-%m-%d')}",
                Message=self._render_summary(report_data),
            )

        logger.info("Weekly report generated: s3://%s/%s", self.report_bucket, s3_key)
        return s3_key

    def _collect_report_data(
        self, period_start: datetime, period_end: datetime
    ) -> WeeklyReportData:
        """Collect all data needed for the report."""
        start_ts = period_start.isoformat()
        end_ts = period_end.isoformat()

        # Query evaluation results from DynamoDB
        response = self.eval_table.query(
            KeyConditionExpression=Key("pk").eq("EVAL#WEEKLY")
            & Key("sk").between(start_ts, end_ts),
        )
        items = response.get("Items", [])

        # Aggregate metrics
        aggregate = self._aggregate_metrics(items)

        # Per-intent breakdown
        intent_metrics = {}
        for intent in INTENTS:
            intent_items = [i for i in items if i.get("intent") == intent]
            if intent_items:
                intent_metrics[intent] = self._aggregate_metrics(intent_items)

        # Detect anomalies (metrics that deviate > 2 std deviations from 30-day baseline)
        anomalies = self._detect_anomalies(aggregate)

        # Top failure categories
        failure_counts: dict[str, int] = {}
        for item in items:
            for cat in item.get("failure_categories", []):
                failure_counts[cat] = failure_counts.get(cat, 0) + 1
        top_failures = sorted(failure_counts.items(), key=lambda x: x[1], reverse=True)[:5]

        # Generate recommendations
        recommendations = self._generate_recommendations(aggregate, intent_metrics, top_failures)

        return WeeklyReportData(
            report_date=period_end.strftime("%Y-%m-%d"),
            period_start=period_start.strftime("%Y-%m-%d"),
            period_end=period_end.strftime("%Y-%m-%d"),
            total_conversations=len(items),
            total_evaluations=len(items),
            aggregate_metrics=aggregate,
            intent_metrics=intent_metrics,
            anomalies=anomalies,
            top_failure_categories=top_failures,
            recommendations=recommendations,
        )

    def _aggregate_metrics(self, items: list[dict]) -> list[DashboardMetric]:
        """Aggregate metric values from evaluation items."""
        metric_values: dict[str, list[float]] = {}
        for item in items:
            for metric_name in QUALITY_METRICS:
                val = item.get(metric_name)
                if val is not None:
                    if metric_name not in metric_values:
                        metric_values[metric_name] = []
                    metric_values[metric_name].append(float(val))

        result = []
        for name, values in metric_values.items():
            avg = sum(values) / len(values)
            result.append(DashboardMetric(
                name=name,
                value=round(avg, 4),
                unit="None",
            ))
        return result

    def _detect_anomalies(self, aggregate: list[DashboardMetric]) -> list[dict]:
        """Detect metrics that significantly deviate from baseline."""
        anomalies = []
        for metric in aggregate:
            if metric.baseline_30d > 0:
                deviation = abs(metric.value - metric.baseline_30d) / metric.baseline_30d
                if deviation > 0.10:  # >10% deviation
                    anomalies.append({
                        "metric": metric.name,
                        "current": metric.value,
                        "baseline": metric.baseline_30d,
                        "deviation_pct": round(deviation * 100, 1),
                        "direction": "degraded" if metric.value < metric.baseline_30d else "improved",
                    })
        return anomalies

    def _generate_recommendations(
        self,
        aggregate: list[DashboardMetric],
        intent_metrics: dict,
        top_failures: list[tuple[str, int]],
    ) -> list[str]:
        """Generate human-readable recommendations based on the data."""
        recommendations = []

        # Check for low-performing intents
        for intent, metrics in intent_metrics.items():
            for m in metrics:
                if m.name == "TaskCompletionRate" and m.value < 0.80:
                    recommendations.append(
                        f"⚠️ {intent} task completion rate is {m.value:.1%} — "
                        f"investigate top failure modes for this intent."
                    )

        # Check for dominant failure categories
        if top_failures:
            top_cat, top_count = top_failures[0]
            total = sum(c for _, c in top_failures)
            if top_count > total * 0.4:
                recommendations.append(
                    f"🔴 '{top_cat}' accounts for {top_count/total:.0%} of all failures — "
                    f"prioritize fixing this category."
                )

        return recommendations

    def _render_markdown(self, data: WeeklyReportData) -> str:
        """Render the report as Markdown."""
        lines = [
            f"# MangaAssist Weekly Quality Report",
            f"**Period:** {data.period_start} — {data.period_end}",
            f"**Generated:** {data.report_date}",
            f"**Conversations Evaluated:** {data.total_evaluations:,}",
            "",
            "## Summary Metrics",
            "",
            "| Metric | Value | 7-Day Trend | 30-Day Baseline |",
            "|--------|-------|-------------|-----------------|",
        ]

        for m in data.aggregate_metrics:
            lines.append(
                f"| {m.name} | {m.value:.2%} | {m.trend_7d.value} | {m.baseline_30d:.2%} |"
            )

        lines.extend([
            "",
            "## Top Failure Categories",
            "",
            "| Category | Count | % of Total |",
            "|----------|-------|------------|",
        ])
        total_failures = sum(c for _, c in data.top_failure_categories)
        for cat, count in data.top_failure_categories:
            pct = count / total_failures if total_failures > 0 else 0
            lines.append(f"| {cat} | {count} | {pct:.1%} |")

        if data.anomalies:
            lines.extend(["", "## Anomalies Detected", ""])
            for anomaly in data.anomalies:
                lines.append(
                    f"- **{anomaly['metric']}**: {anomaly['current']:.2%} "
                    f"({anomaly['direction']} {anomaly['deviation_pct']}% from baseline {anomaly['baseline']:.2%})"
                )

        if data.recommendations:
            lines.extend(["", "## Recommendations", ""])
            for rec in data.recommendations:
                lines.append(f"- {rec}")

        return "\n".join(lines)

    def _render_summary(self, data: WeeklyReportData) -> str:
        """Render a short summary for Slack/Email notification."""
        lines = [
            f"📊 MangaAssist Weekly Quality Report ({data.period_start} → {data.period_end})",
            f"Conversations: {data.total_evaluations:,}",
            "",
        ]
        for m in data.aggregate_metrics:
            indicator = "✅" if m.value >= 0.85 else "⚠️" if m.value >= 0.75 else "🔴"
            lines.append(f"{indicator} {m.name}: {m.value:.1%}")

        if data.anomalies:
            lines.append(f"\n🚨 {len(data.anomalies)} anomalies detected — see full report.")

        return "\n".join(lines)

MangaAssist Scenarios

Scenario A: Dashboard Reveals Hidden Intent-Level Regression

Context: Aggregate TaskCompletionRate was 86% — stable for 2 weeks. No alerts fired. But the product team noticed more escalation requests from users asking about promotions.

What Happened: - The CloudWatch aggregate dashboard showed 86% task completion with a stable trend - Drilling into the intent breakdown bar chart revealed: - promotion: 62% task completion (down from 88% two weeks prior) - recommendation: 93% (up from 89%) - The aggregate was held up by recommendation improvements masking promotion degradation - The failure category pie chart showed 40% of promotion failures were missing_tool_call

Root Cause: A deployment two weeks ago changed the orchestrator routing rules. Promotion queries now routed to the RAG pipeline (general knowledge) instead of the Promotion API (live coupon codes). The RAG pipeline couldn't surface active promotions — it had stale catalog data.

Fix: Restored Promotion API routing. Added intent-level threshold alerts: any single intent dropping >10% from 7-day baseline triggers a Slack alert.

Dashboard Change: Added per-intent sparklines to the headline row — each intent now has its own single-value card showing current value with 7-day trend.

Scenario B: Weekly Report Catches Gradual User Satisfaction Decline

Context: Weekly automated report showed UserSatisfaction at 78%, down from 84% four weeks prior. The drop was gradual: 84% → 82% → 80% → 78%. No single-week alert triggered because each weekly drop was <5%.

What Happened: - The recommendations section auto-generated: "⚠️ UserSatisfaction is {78%} — investigate top failure modes" - Drilling into the report's intent breakdown: product_question satisfaction dropped from 90% → 71% - Failure category: reasoning_error increased 3x over the period - Cross-referencing with the model comparison view: a Bedrock runtime update had been applied 4 weeks ago

Root Cause: Bedrock runtime update changed Claude 3.5 Sonnet's default temperature behavior. Product questions requiring precise factual answers about manga (release dates, volume counts, price) were generating slightly more "creative" responses.

Fix: Pinned the Claude invocation to temperature: 0.1 for product_question intent (was previously unset, relying on model default). UserSatisfaction recovered to 85% within one week.

Report Improvement: Added 4-week trend column alongside 7-day and 30-day baselines. Gradual shifts now surface as a "degrading" trend arrow even if each individual week is within tolerance.

Scenario C: Model Comparison Dashboard Drives Cost Decision

Context: The team was evaluating whether to route chitchat and faq intents to Claude 3 Haiku instead of Claude 3.5 Sonnet to reduce costs.

What Happened: - Model comparison view showed side-by-side:

Metric	Sonnet (chitchat)	Haiku (chitchat)	Sonnet (faq)	Haiku (faq)
TaskCompletionRate	0.90	0.87	0.92	0.88
RelevanceScore	0.88	0.85	0.91	0.86
FactualAccuracy	0.82	0.80	0.94	0.89
UserSatisfaction	0.85	0.82	0.88	0.81
Cost/1K requests	$4.20	$0.42	$4.20	$0.42
P99 Latency	2.8s	0.9s	2.8s	0.9s

For chitchat: Haiku was 90% cheaper with only 3% quality drop — clearly acceptable
For faq: Haiku had 5% FactualAccuracy drop — FAQ accuracy is critical, fail

Decision: Route chitchat to Haiku (saves $3.78/1K requests). Keep faq on Sonnet (accuracy matters). Monthly savings: ~$1,800 with negligible quality impact.

Dashboard Enhancement: Added "Cost Efficiency Score" = QualityScore / CostPer1KRequests. Higher is better. Haiku for chitchat scored 2.02 vs Sonnet's 0.21 — 10x more cost-efficient.

Scenario D: Executive Report Misinterpretation Prevention

Context: Leadership received the weekly report showing TaskCompletionRate at 84%. The VP asked: "Why is 16% of our chatbot failing?"

What Happened: - The original executive report showed raw metrics without context - 84% TaskCompletionRate included all intents, including escalation (which by definition routes to humans — not a "failure") - The chitchat intent had inherently lower "completion" because casual conversation has no clear success criteria - Adjusted calculation: excluding escalation and chitchat, task completion was 91%

Fix: Executive summary now shows: 1. "Actionable Task Completion" = TaskCompletionRate for transactional intents only (order_tracking, return_request, checkout_help, promotion) 2. "Knowledge Task Completion" = TaskCompletionRate for knowledge intents (product_question, recommendation, faq, product_discovery) 3. "Escalation Rate" = separate metric treated as a routing metric, not a failure metric 4. "Engagement Quality" = for chitchat, measured by conversation length and sentiment, not completion

Key Lesson: Reports for different audiences need different metric definitions. Engineering sees raw pipeline metrics. Leadership sees business-outcome metrics. Same data, different framing.

Intuition Gained

Aggregation Hides Problems

An 86% aggregate quality score can hide a 62% catastrophe in one intent. Always show intent-level breakdowns by default, not just on drill-down. The MangaAssist dashboard now opens to the intent-level view, with aggregate as a secondary summary.

Automated Reports Catch Gradual Drift

Human dashboard watchers miss gradual degradation — a 1%/week decline is invisible day-to-day but devastating over a quarter. Automated reports with 4-week trend analysis surface these patterns. The weekly report at MangaAssist is the most reliable degradation detection mechanism, more so than real-time alerts.

Different Audiences Need Different Metrics

Engineering needs pipeline-stage metrics (intent accuracy, tool selection, latency by stage). Product needs user-facing metrics (satisfaction, task completion, escalation rate). Leadership needs business metrics (cost per conversation, resolution rate, customer impact). One dashboard cannot serve all three — build a hierarchy.

References

MangaAssist Architecture HLD — Overall system architecture
Metrics Overview — Business and technical metrics
Analytics Pipeline Cost Optimization — Analytics infra
FM Output Quality Assessment — Quality metrics definitions
Model Evaluation Optimal Configuration — Model comparison framework
Agent Performance Framework — Agent metrics powering dashboards