US-10: Unified Optimization Decision Dashboard

User Story

As an engineering leadership team (Cost, Performance, Inference leads), I want a single dashboard that shows how every optimization decision affects all three dimensions simultaneously, So that we can make informed tradeoff decisions in monthly reviews and respond quickly when a reversal trigger fires.

Why This Dashboard Exists

graph TD
    A["Without Unified Dashboard"] --> B["Cost Team watches<br/>AWS billing dashboard"]
    A --> C["Performance Team watches<br/>CloudWatch latency metrics"]
    A --> D["Inference Team watches<br/>quality evaluation pipeline"]

    B --> E["Each team optimizes<br/>their metrics independently"]
    E --> F["Local optimizations<br/>create global pessimum"]

    G["With Unified Dashboard"] --> H["All three dimensions<br/>visible in one place"]
    H --> I["QACPI trend shows<br/>global system health"]
    I --> J["Tradeoff decisions are<br/>data-driven, not political"]

    style F fill:#eb3b5a,stroke:#333,color:#fff
    style J fill:#2d8659,stroke:#333,color:#fff

Acceptance Criteria

QACPI is computed and displayed in real-time with 5-minute granularity.
All reversal triggers from US-01 through US-09 are monitored with automated alerts.
Each dashboard panel links to the relevant user story for context.
Monthly tradeoff review has a standardized report generated from dashboard data.
Any team member can see the impact of a proposed change before deploying it.

Dashboard Architecture

graph TD
    subgraph "Data Sources"
        CW["CloudWatch<br/>Latency, throughput,<br/>error rates"]
        BR["Bedrock Metrics<br/>Token counts, model usage,<br/>provisioned utilization"]
        AB["AWS Billing<br/>Daily spend by service"]
        QE["Quality Evaluator<br/>Hallucination rate, CSAT,<br/>groundedness scores"]
        AN["Analytics Pipeline<br/>Intent distribution,<br/>cache hit rates,<br/>escalation rates"]
    end

    subgraph "Processing"
        KIN["Kinesis Data Stream<br/>Real-time aggregation"]
        RS["Redshift<br/>Historical analysis"]
    end

    subgraph "Dashboard Layer"
        DASH["Unified Dashboard<br/>(CloudWatch Dashboard + QuickSight)"]
    end

    CW --> KIN
    BR --> KIN
    AB --> RS
    QE --> KIN
    AN --> KIN

    KIN --> DASH
    RS --> DASH

    style DASH fill:#54a0ff,stroke:#333,color:#000

Dashboard Panels

Panel 1: QACPI Composite Score (Hero Metric)

graph LR
    subgraph "QACPI Trend (30 days)"
        direction TB
        QACPI["Current QACPI: 188,889<br/>7d trend: +2.3%<br/>30d trend: +8.7%"]
        TARGET["Target: > 150,000<br/>Alert: < 120,000"]
    end

    subgraph "QACPI Components"
        Q["Quality: 0.87<br/>(target > 0.85)"]
        T["Throughput: 1,200 RPS<br/>(target > 1,000)"]
        CPR["Cost/Req: $0.006<br/>(target < $0.008)"]
        LAT["p95 Latency: 0.9s<br/>(target < 2.0s)"]
    end

    QACPI --> Q
    QACPI --> T
    QACPI --> CPR
    QACPI --> LAT

    style QACPI fill:#2d8659,stroke:#333,color:#fff

Panel 2: The Three Dimensions (At a Glance)

graph TD
    subgraph "Cost Health"
        C1["Daily Spend: $5,200<br/>Budget: $6,000/day<br/>Utilization: 87%"]
        C2["LLM: $3,100<br/>Compute: $1,400<br/>Storage: $450<br/>Other: $250"]
    end

    subgraph "Performance Health"
        P1["p50 Latency: 420ms<br/>p95 Latency: 1,200ms<br/>p99 Latency: 1,850ms"]
        P2["Throughput: 1,200 RPS<br/>Error Rate: 0.02%<br/>Cache Hit Rate: 58%"]
    end

    subgraph "Inference Health"
        I1["Quality Score: 0.87<br/>Hallucination Rate: 3.2%<br/>CSAT: 4.1/5.0"]
        I2["Guardrail Block Rate: 2.8%<br/>False Positive Rate: 0.8%<br/>Escalation Rate: 4.5%"]
    end

    style C1 fill:#f9d71c,stroke:#333,color:#000
    style P1 fill:#4ecdc4,stroke:#333,color:#000
    style I1 fill:#ff6b6b,stroke:#333,color:#000

Panel 3: Model Tiering Effectiveness (US-02)

graph TD
    subgraph "Traffic Distribution by Tier"
        T1["Template: 32%<br/>(target: 30%)"]
        T2["Haiku: 38%<br/>(target: 40%)"]
        T3["Sonnet: 30%<br/>(target: 30%)"]
    end

    subgraph "Quality by Tier"
        Q1["Template: N/A<br/>(deterministic)"]
        Q2["Haiku avg quality: 0.79<br/>(floor: 0.72)"]
        Q3["Sonnet avg quality: 0.92<br/>(floor: 0.88)"]
    end

    subgraph "Cost by Tier"
        CO1["Template: $0/day"]
        CO2["Haiku: $400/day"]
        CO3["Sonnet: $4,500/day"]
    end

    style T1 fill:#2d8659,stroke:#333,color:#fff
    style T2 fill:#fd9644,stroke:#333,color:#000
    style T3 fill:#eb3b5a,stroke:#333,color:#fff

Panel 4: Latency Budget Compliance (US-03)

graph LR
    subgraph "Pipeline Stage Latency (p95)"
        E["Edge: 35ms<br/>(budget: 40ms) ✅"]
        S["Session: 42ms<br/>(budget: 50ms) ✅"]
        IC["Intent: 38ms<br/>(budget: 50ms) ✅"]
        RAG["RAG: 230ms<br/>(budget: 250ms) ⚠️"]
        SVC["Services: 180ms<br/>(budget: 200ms) ✅"]
        LLM["LLM TTFT: 380ms<br/>(budget: 400ms) ⚠️"]
        GR["Guardrails: 85ms<br/>(budget: 100ms) ✅"]
    end

    style RAG fill:#fd9644,stroke:#333,color:#000
    style LLM fill:#fd9644,stroke:#333,color:#000

Panel 5: RAG and Caching Health (US-05, US-06)

Metric	Current	Target	Status
RAG recall@3	0.86	≥ 0.85	✅
RAG hallucination rate	3.8%	< 4%	⚠️
RAG avg latency	115ms	< 250ms	✅
Cache hit rate (product)	72%	≥ 70%	✅
Cache hit rate (FAQ semantic)	38%	≥ 40%	⚠️
Cache hit rate (recommendations)	11%	≥ 10%	✅
Stale data complaints	0.06%	< 0.1%	✅

Panel 6: Guardrail Health (US-07)

graph TD
    subgraph "Guardrail Metrics"
        G1["Total Block Rate: 2.8%<br/>(target: 2-4%)"]
        G2["False Positive Rate: 0.8%<br/>(target: < 2%)"]
        G3["PII Detections: 1,240/day<br/>(all blocked ✅)"]
        G4["Price Corrections: 89/day<br/>(all corrected ✅)"]
        G5["Async Retractions: 12/day<br/>(hallucinations caught post-delivery)"]
    end

    style G1 fill:#2d8659,stroke:#333,color:#fff
    style G2 fill:#2d8659,stroke:#333,color:#fff

Panel 7: Autoscaling Health (US-08)

graph TD
    subgraph "Scaling Events (Last 24h)"
        SE1["Predictive scale-ups: 4<br/>(all on schedule)"]
        SE2["Reactive scale-ups: 2<br/>(unexpected traffic)"]
        SE3["Lambda overflow events: 1<br/>(handled 8K requests)"]
        SE4["Scale-downs: 6<br/>(all smooth)"]
    end

    subgraph "Capacity Utilization"
        CU1["Current tasks: 45<br/>(capacity: 60)"]
        CU2["Headroom: 25%"]
        CU3["Idle cost: $380/day<br/>(23% of compute)"]
    end

    style CU3 fill:#fd9644,stroke:#333,color:#000

Panel 8: Token Budget Health (US-09)

Metric	Current	Target	Status
Avg input tokens	1,780	< 2,000	✅
p95 input tokens	2,850	< 3,000	⚠️
Avg output tokens	280	< 300	✅
Prompt cache hit rate	68%	≥ 60%	✅
History summarization triggers/day	185K	Monitor only	-
Token budget overflows/day	2,300	< 5,000	✅

2026 Update: Turn the Dashboard into a Decision Cockpit

Treat everything above this section as the baseline dashboard architecture. This update preserves that original monitoring design and shows how the current architecture evolves into a more trace-level decision cockpit.

A useful optimization dashboard now needs to operate at trace level, not just as a static KPI board.

Add serving-path metrics such as queue time, TTFT, ITL/TPOT, prefill time, decode time, waiting requests, prompt cache hit rate, prefix cache hit rate, and invalidation lag.
Segment every major panel by intent, route class, model class, traffic cohort, and experiment ID. Blended averages hide which tradeoff is actually failing.
Join offline evals, online judge scores, human review outcomes, and business metrics in one experiment record so every change has an attributable impact trail.
Annotate releases and use trace replay / what-if simulation before rollout. The dashboard should compare expected vs actual impact by decision ID, not only after-the-fact trends.
Keep QACPI, but pair it with a constraint board and Pareto frontier view. Leadership needs to see when a "better" composite score violates a hard safety or policy threshold.

Recent references: AWS Bedrock model invocation logging, CloudWatch generative AI observability, Bedrock Guardrails CloudWatch metrics, vLLM metrics, Anthropic evaluation guidance.

Reversal Trigger Monitoring

graph TD
    subgraph "Active Reversal Triggers (0 firing, 2 warning)"
        RT1["US-02: Haiku quality < 0.72<br/>Status: 0.79 ✅"]
        RT2["US-03: End-to-end p95 > 2s<br/>Status: 1.2s ✅"]
        RT3["US-05: Hallucination > 5%<br/>Status: 3.8% ⚠️ (watching)"]
        RT4["US-06: Stale complaints > 0.1%<br/>Status: 0.06% ✅"]
        RT5["US-07: False positive > 3%<br/>Status: 0.8% ✅"]
        RT6["US-08: Idle cost > 30%<br/>Status: 23% ✅"]
        RT7["US-09: Quality < 0.82 on multi-turn<br/>Status: 0.84 ⚠️ (watching)"]
    end

    style RT3 fill:#fd9644,stroke:#333,color:#000
    style RT7 fill:#fd9644,stroke:#333,color:#000

Automated Alert Rules

Alert Level	Condition	Notification	Response
Info	Metric within 10% of trigger threshold	Slack message to team channel	Awareness; no action required
Warning	Metric within 5% of trigger threshold	Slack + email to team leads	Investigate; prepare rollback
Critical	Reversal trigger fires	PagerDuty + Slack + auto-generated Jira ticket	Execute reversal playbook within 2 hours
Emergency	PII leak or price error detected	PagerDuty (P1) + automatic traffic shift to safe mode	Immediate response; safe mode until root-caused

Monthly Tradeoff Review Report

The dashboard auto-generates a monthly report for the engineering leadership meeting:

graph TD
    subgraph "Monthly Review Template"
        M1["1. QACPI Trend<br/>Is the system getting better<br/>or worse overall?"]
        M2["2. Dimension Health<br/>Which dimension improved?<br/>Which degraded?"]
        M3["3. Tradeoff Decisions Made<br/>What changed this month?<br/>What was the measured impact?"]
        M4["4. Reversal Triggers Status<br/>Any close to firing?<br/>Any fired and resolved?"]
        M5["5. Next Month Proposals<br/>What optimization is proposed?<br/>Expected impact on all 3 dims?"]
    end

    M1 --> M2 --> M3 --> M4 --> M5

    style M1 fill:#54a0ff,stroke:#333,color:#000
    style M5 fill:#ff9f43,stroke:#333,color:#000

Example Monthly Summary

Category	January	February	Trend
QACPI	175,000	188,889	↑ +7.9%
Monthly Spend	$168,000	$156,000	↓ -7.1%
p95 Latency	1.4s	1.2s	↓ -14.3%
Quality Score	0.86	0.87	↑ +1.2%
Hallucination Rate	4.1%	3.8%	↓ -7.3%
Cache Hit Rate	52%	58%	↑ +11.5%
Escalation Rate	5.2%	4.5%	↓ -13.5%

Simulation Mode: "What If" Analysis

Before making any tradeoff decision, the dashboard supports simulation:

sequenceDiagram
    participant Lead as Engineering Lead
    participant Sim as Simulation Engine
    participant Dash as Dashboard

    Lead->>Sim: "What if we route FAQ to Haiku instead of Sonnet?"
    Sim->>Sim: Calculate impact on cost, latency, quality
    Sim-->>Dash: Projected QACPI: 195,000 (+3.2%)
    Sim-->>Dash: Projected cost: -$12K/month
    Sim-->>Dash: Projected FAQ quality: -0.08 (0.84→0.76)
    Sim-->>Dash: Projected FAQ latency: -400ms

    Lead->>Lead: Quality drop acceptable for FAQ?<br/>Consult Inference Team.

    Note over Lead: Decision: Approve if FAQ quality stays > 0.72

Simulation Inputs

Parameter	Options
Model tier change	Move intent X from Sonnet→Haiku, or Haiku→Template
RAG configuration	Change chunk count, add/remove reranker
Cache TTL change	Increase/decrease TTL for data type X
Scaling change	Adjust provisioned capacity, change predictive schedule
Token budget change	Reallocate tokens between sections
Guardrail change	Loosen/tighten threshold for guardrail X

Summary: How All 10 User Stories Connect

graph TD
    US01["US-01: Trilemma Framework<br/>(Decision methodology)"] --> US02["US-02: Model Tiering<br/>(Which model for which query?)"]
    US01 --> US03["US-03: Latency Budget<br/>(How much time per stage?)"]
    US01 --> US04["US-04: Real-Time vs Batch<br/>(What to pre-compute?)"]
    US01 --> US05["US-05: RAG Depth<br/>(How many chunks?)"]
    US01 --> US06["US-06: Caching<br/>(What to cache, how long?)"]
    US01 --> US07["US-07: Guardrails<br/>(How strict?)"]
    US01 --> US08["US-08: Autoscaling<br/>(How much headroom?)"]
    US01 --> US09["US-09: Token Budget<br/>(How to partition context?)"]

    US02 --> US10["US-10: Unified Dashboard<br/>(Track everything together)"]
    US03 --> US10
    US04 --> US10
    US05 --> US10
    US06 --> US10
    US07 --> US10
    US08 --> US10
    US09 --> US10

    style US01 fill:#ff9f43,stroke:#333,color:#000
    style US10 fill:#5f27cd,stroke:#333,color:#fff

Every decision in US-02 through US-09 feeds metrics into this dashboard. The dashboard enables the monthly review cycle that keeps all three optimization dimensions balanced. No team optimizes in isolation. Every tradeoff is visible, measured, and reversible.