LOCAL PREVIEW View on GitHub

US-01: The Optimization Trilemma — Decision Framework

User Story

As a senior engineering lead responsible for MangaAssist, I want to establish a systematic framework for making tradeoff decisions between cost, performance, and inference quality, So that every optimization choice is explicit, measurable, and reversible — and no team silently degrades another team's metrics.

The Problem: Three Teams, One System, Conflicting Goals

Three teams work on MangaAssist, each with a legitimate mandate that directly conflicts with the other two:

graph TD
    subgraph "Cost Team Says"
        C1["Use Haiku everywhere.<br/>Cache aggressively.<br/>Minimize LLM calls.<br/>Scale down during off-peak."]
    end

    subgraph "Performance Team Says"
        P1["Provision everything.<br/>Pre-compute results.<br/>Over-provision capacity.<br/>Minimize hops in the pipeline."]
    end

    subgraph "Inference Team Says"
        I1["Use Sonnet for everything.<br/>Retrieve 10 RAG chunks.<br/>Run full guardrail pipeline.<br/>Keep 20 turns of history."]
    end

    C1 -->|"conflicts with"| P1
    P1 -->|"conflicts with"| I1
    I1 -->|"conflicts with"| C1

    style C1 fill:#f9d71c,stroke:#333,color:#000
    style P1 fill:#4ecdc4,stroke:#333,color:#000
    style I1 fill:#ff6b6b,stroke:#333,color:#000

The Fundamental Conflicts

graph LR
    subgraph "Cost vs Performance"
        CP1["Provisioned throughput<br>reduces latency"]
        CP2["But provisioned throughput<br>costs money even when idle"]
    end

    subgraph "Performance vs Inference"
        PI1["More RAG chunks<br>improve grounding"]
        PI2["But more chunks<br>increase latency"]
    end

    subgraph "Inference vs Cost"
        IC1["Sonnet produces<br>better responses"]
        IC2["But Sonnet costs<br>15x more than Haiku"]
    end

Acceptance Criteria

  • Every optimization decision is documented with a tradeoff matrix showing impact on all three dimensions.
  • A composite metric (Quality-Adjusted Cost-Performance Index) is defined and tracked.
  • Reversal triggers are set for every decision — conditions under which the team revisits the choice.
  • No team can ship a change that degrades another dimension by more than 10% without explicit sign-off.
  • A monthly tradeoff review meeting uses the unified dashboard (US-10) to rebalance.

The Decision Framework

Step-by-Step Process

graph TD
    A["1. Identify the Tradeoff<br/>Which dimensions are in tension?"] --> B["2. Quantify Each Option<br/>Measure impact on cost, latency, quality"]
    B --> C["3. Define Decision Metric<br/>A single composite number"]
    C --> D["4. Run Experiment<br/>A/B test, shadow mode, offline benchmark"]
    D --> E["5. Make the Call<br/>Choose the option; document rationale"]
    E --> F["6. Set Reversal Trigger<br/>Under what conditions do we revisit?"]
    F --> G["7. Monitor Continuously<br/>Track composite metric over time"]
    G -->|"Trigger hit"| A

    style A fill:#ff9f43,stroke:#333,color:#000
    style D fill:#54a0ff,stroke:#333,color:#000
    style F fill:#ee5a24,stroke:#333,color:#fff

The Composite Metric: QACPI

Quality-Adjusted Cost-Performance Index — a single number that captures the tradeoff:

$$QACPI = \frac{Quality_Score \times Throughput}{Cost_Per_Request \times P95_Latency}$$

Where: - Quality Score (0-1): weighted combination of accuracy, hallucination rate, and user satisfaction - Throughput (requests/sec): sustained request handling capacity - Cost Per Request ($): total infrastructure cost divided by request volume - P95 Latency (seconds): 95th percentile end-to-end latency

A higher QACPI is better. The metric penalizes you for spending more, being slower, or producing worse responses.

Example QACPI Calculations

Configuration Quality Throughput Cost/Req P95 Latency QACPI
All-Sonnet, no cache, full RAG 0.92 500 $0.018 2.1s 12,169
Tiered (Haiku + Sonnet), cache, 3 chunks 0.85 1,200 $0.006 0.9s 188,889
All-template (no LLM) 0.40 10,000 $0.0002 0.05s 400,000,000
Tiered + aggressive cache + reranking 0.88 1,000 $0.007 1.1s 114,286

The all-template approach has the highest QACPI but only handles 30% of queries. The tiered + cache approach is the realistic optimum for the full traffic mix.


Tradeoff Zones: Where Each Team "Wins"

graph TD
    subgraph "Zone 1: Cost Wins"
        Z1["Off-peak hours<br/>Low-complexity queries<br/>Repeated/cacheable queries<br/>Guest users (lower SLA)"]
    end

    subgraph "Zone 2: Performance Wins"
        Z2["Peak traffic hours<br/>First message in session<br/>Streaming-sensitive flows<br/>Cart/checkout context"]
    end

    subgraph "Zone 3: Inference Wins"
        Z3["Recommendation queries<br/>Complex multi-turn reasoning<br/>High-value authenticated users<br/>Post-escalation follow-ups"]
    end

    subgraph "Negotiation Zone"
        NZ["Most traffic lives here.<br/>Every request needs a<br/>balanced compromise."]
    end

    Z1 --> NZ
    Z2 --> NZ
    Z3 --> NZ

    style Z1 fill:#f9d71c,stroke:#333,color:#000
    style Z2 fill:#4ecdc4,stroke:#333,color:#000
    style Z3 fill:#ff6b6b,stroke:#333,color:#000
    style NZ fill:#dfe6e9,stroke:#333,color:#000

Traffic Distribution Across Zones

pie title "Request Volume by Optimization Zone"
    "Cost Zone (templates, cache hits)" : 35
    "Performance Zone (fast path, streaming)" : 20
    "Inference Zone (Sonnet, full RAG)" : 15
    "Negotiation Zone (tiered, balanced)" : 30

The Master Tradeoff Matrix

Every decision in US-02 through US-09 maps to cells in this matrix:

Decision Area Cost Impact Performance Impact Inference Impact Primary Tension
Model selection (US-02) Haiku is 15x cheaper Haiku is 3x faster Sonnet is measurably better Cost ↔ Inference
Latency budget (US-03) Tighter budget → cheaper Directly controls UX Less time → less quality Performance ↔ Inference
Pre-computation (US-04) Batch is cheaper per unit Pre-computed is faster May be stale Cost ↔ Performance
RAG depth (US-05) More chunks → more tokens → more cost More chunks → slower More chunks → better grounding All three
Caching (US-06) Cache hits are nearly free Cache hits are fast Cached answers may be stale Cost+Perf ↔ Inference
Guardrails (US-07) More checks → more compute More checks → more latency More checks → safer Performance ↔ Inference
Autoscaling (US-08) Over-provision → waste Under-provision → latency spikes Latency spikes → timeouts → failures Cost ↔ Performance
Token budget (US-09) More tokens → more LLM cost More tokens → slower prefill More context → better answers All three

Decision Governance Model

sequenceDiagram
    participant CostTeam as Cost Team
    participant PerfTeam as Performance Team
    participant InfTeam as Inference Team
    participant Lead as Engineering Lead
    participant Dashboard as QACPI Dashboard

    CostTeam->>Lead: Propose: Switch FAQ to Haiku (saves $40K/mo)
    Lead->>InfTeam: Impact assessment on quality?
    InfTeam->>Lead: FAQ quality drops 0.92 → 0.84 for complex FAQs
    Lead->>PerfTeam: Impact on latency?
    PerfTeam->>Lead: FAQ latency improves 1.2s → 0.4s
    Lead->>Dashboard: Simulate QACPI change
    Dashboard->>Lead: QACPI improves 15% overall
    Lead->>CostTeam: Approved with reversal trigger
    Lead->>InfTeam: Set alert: if FAQ satisfaction < 3.5/5, revert
    Note over Lead: Decision logged with rationale + reversal trigger

Guardrails for the Decision Process

Rule Rationale
No silent degradation A change in one dimension that worsens another by >10% requires sign-off from the affected team
Every decision has a reversal trigger Prevents "set and forget"; forces re-evaluation when assumptions change
Experiment before commit A/B test or shadow mode before production rollout for any decision affecting >20% of traffic
Monthly tradeoff review Review QACPI trends, revisit decisions whose triggers are close to firing
Intent-level granularity Tradeoff decisions can vary by intent — one size does not fit all

Practical Example: Recommendation Query Lifecycle

A single recommendation query touches every tradeoff dimension:

graph TD
    A["User: 'Something like One Piece?'"] --> B["Intent Classifier<br/>(Cost: cheap, Perf: fast, Inf: good enough)"]
    B --> C["Cache Check<br/>(Cost: free if hit, Perf: fast, Inf: may be stale)"]
    C -->|miss| D["Recommendation Engine<br/>(Cost: moderate, Perf: variable, Inf: personalized)"]
    D --> E["RAG Retrieval<br/>(Cost: per-chunk, Perf: adds 100-200ms, Inf: grounds response)"]
    E --> F["Model Selection<br/>(Cost: Haiku=$0.001, Perf: Haiku=fast, Inf: Sonnet=better)"]
    F --> G["Prompt Assembly<br/>(Cost: every token costs, Perf: more tokens=slower, Inf: more context=better)"]
    G --> H["Guardrails<br/>(Cost: validation compute, Perf: adds 50-100ms, Inf: ensures safety)"]
    H --> I["Response Delivered"]

    style B fill:#f9d71c,stroke:#333,color:#000
    style C fill:#4ecdc4,stroke:#333,color:#000
    style F fill:#ff6b6b,stroke:#333,color:#000
    style G fill:#dfe6e9,stroke:#333,color:#000

At every step, a tradeoff decision has been made. The user stories that follow (US-02 through US-10) deep-dive into each one.


Anti-Patterns to Avoid

Anti-Pattern Why It's Dangerous What to Do Instead
"Optimize cost, worry about quality later" Quality debt compounds — users leave before you fix it Set minimum quality floors before optimizing cost
"Use the best model everywhere" Budget blowout at scale; 70% of queries don't need it Tier models by query complexity
"Cache everything aggressively" Stale recommendations destroy trust Cache only data with defined staleness tolerance
"Provision for peak at all times" 70% of capacity sits idle during off-peak Use autoscaling with predictive ramp-up
"Let each team optimize independently" Local optima create global pessimum Use QACPI and cross-team governance
"One configuration for all intents" Simple FAQ and complex recommendation have different needs Per-intent optimization profiles

2026 Update: Govern the Trilemma as Constrained Optimization

Treat everything above this section as the baseline architecture and original decision model. This section preserves that baseline and explains how the framework evolves for the current target architecture.

The current framework still works, but production teams are increasingly treating this as constrained multi-objective optimization, not just a single scalar-score contest.

  • Keep QACPI as the hero metric, but pair it with hard constraints such as zero PII leaks, zero incorrect prices, latency SLOs, and per-intent quality floors. This prevents "good composite score, bad real-world policy outcome" decisions.
  • Add a Pareto frontier view to monthly reviews. Leadership should compare only configurations that satisfy hard constraints, then choose among the remaining cost-latency-quality tradeoffs.
  • Run shadow traffic and trace replay before live rollout for material changes. A/B tests remain important, but replaying recent production traces is now a standard way to estimate counterfactual cost and latency before exposure.
  • Calibrate quality with a hybrid eval stack: task-specific offline evals, human-reviewed golden sets, LLM-as-judge, and production outcomes such as escalations and CSAT.
  • Express policies in terms of capability tiers and route classes, not just specific model names. Bedrock model catalogs, inference profiles, and pricing envelopes change often enough that quarterly re-benchmarking should be part of governance.

Recent references: Anthropic success criteria, Anthropic test design, AWS Bedrock model invocation logging, CloudWatch generative AI observability, Cascade routing for LLMs.

Next Steps

This framework applies to every subsequent user story: - US-02: How to decide which LLM model handles which query - US-03: How to split the latency budget across pipeline stages - US-04: What to pre-compute vs compute on-the-fly - US-05: How deep the RAG retrieval should go - US-06: How aggressively to cache - US-07: How strict the guardrails should be - US-08: How much headroom to provision - US-09: How to partition the token budget - US-10: How to track it all in one place