US-01: The Optimization Trilemma — Decision Framework
User Story
As a senior engineering lead responsible for MangaAssist, I want to establish a systematic framework for making tradeoff decisions between cost, performance, and inference quality, So that every optimization choice is explicit, measurable, and reversible — and no team silently degrades another team's metrics.
The Problem: Three Teams, One System, Conflicting Goals
Three teams work on MangaAssist, each with a legitimate mandate that directly conflicts with the other two:
graph TD
subgraph "Cost Team Says"
C1["Use Haiku everywhere.<br/>Cache aggressively.<br/>Minimize LLM calls.<br/>Scale down during off-peak."]
end
subgraph "Performance Team Says"
P1["Provision everything.<br/>Pre-compute results.<br/>Over-provision capacity.<br/>Minimize hops in the pipeline."]
end
subgraph "Inference Team Says"
I1["Use Sonnet for everything.<br/>Retrieve 10 RAG chunks.<br/>Run full guardrail pipeline.<br/>Keep 20 turns of history."]
end
C1 -->|"conflicts with"| P1
P1 -->|"conflicts with"| I1
I1 -->|"conflicts with"| C1
style C1 fill:#f9d71c,stroke:#333,color:#000
style P1 fill:#4ecdc4,stroke:#333,color:#000
style I1 fill:#ff6b6b,stroke:#333,color:#000
The Fundamental Conflicts
graph LR
subgraph "Cost vs Performance"
CP1["Provisioned throughput<br>reduces latency"]
CP2["But provisioned throughput<br>costs money even when idle"]
end
subgraph "Performance vs Inference"
PI1["More RAG chunks<br>improve grounding"]
PI2["But more chunks<br>increase latency"]
end
subgraph "Inference vs Cost"
IC1["Sonnet produces<br>better responses"]
IC2["But Sonnet costs<br>15x more than Haiku"]
end
Acceptance Criteria
- Every optimization decision is documented with a tradeoff matrix showing impact on all three dimensions.
- A composite metric (Quality-Adjusted Cost-Performance Index) is defined and tracked.
- Reversal triggers are set for every decision — conditions under which the team revisits the choice.
- No team can ship a change that degrades another dimension by more than 10% without explicit sign-off.
- A monthly tradeoff review meeting uses the unified dashboard (US-10) to rebalance.
The Decision Framework
Step-by-Step Process
graph TD
A["1. Identify the Tradeoff<br/>Which dimensions are in tension?"] --> B["2. Quantify Each Option<br/>Measure impact on cost, latency, quality"]
B --> C["3. Define Decision Metric<br/>A single composite number"]
C --> D["4. Run Experiment<br/>A/B test, shadow mode, offline benchmark"]
D --> E["5. Make the Call<br/>Choose the option; document rationale"]
E --> F["6. Set Reversal Trigger<br/>Under what conditions do we revisit?"]
F --> G["7. Monitor Continuously<br/>Track composite metric over time"]
G -->|"Trigger hit"| A
style A fill:#ff9f43,stroke:#333,color:#000
style D fill:#54a0ff,stroke:#333,color:#000
style F fill:#ee5a24,stroke:#333,color:#fff
The Composite Metric: QACPI
Quality-Adjusted Cost-Performance Index — a single number that captures the tradeoff:
$$QACPI = \frac{Quality_Score \times Throughput}{Cost_Per_Request \times P95_Latency}$$
Where: - Quality Score (0-1): weighted combination of accuracy, hallucination rate, and user satisfaction - Throughput (requests/sec): sustained request handling capacity - Cost Per Request ($): total infrastructure cost divided by request volume - P95 Latency (seconds): 95th percentile end-to-end latency
A higher QACPI is better. The metric penalizes you for spending more, being slower, or producing worse responses.
Example QACPI Calculations
| Configuration | Quality | Throughput | Cost/Req | P95 Latency | QACPI |
|---|---|---|---|---|---|
| All-Sonnet, no cache, full RAG | 0.92 | 500 | $0.018 | 2.1s | 12,169 |
| Tiered (Haiku + Sonnet), cache, 3 chunks | 0.85 | 1,200 | $0.006 | 0.9s | 188,889 |
| All-template (no LLM) | 0.40 | 10,000 | $0.0002 | 0.05s | 400,000,000 |
| Tiered + aggressive cache + reranking | 0.88 | 1,000 | $0.007 | 1.1s | 114,286 |
The all-template approach has the highest QACPI but only handles 30% of queries. The tiered + cache approach is the realistic optimum for the full traffic mix.
Tradeoff Zones: Where Each Team "Wins"
graph TD
subgraph "Zone 1: Cost Wins"
Z1["Off-peak hours<br/>Low-complexity queries<br/>Repeated/cacheable queries<br/>Guest users (lower SLA)"]
end
subgraph "Zone 2: Performance Wins"
Z2["Peak traffic hours<br/>First message in session<br/>Streaming-sensitive flows<br/>Cart/checkout context"]
end
subgraph "Zone 3: Inference Wins"
Z3["Recommendation queries<br/>Complex multi-turn reasoning<br/>High-value authenticated users<br/>Post-escalation follow-ups"]
end
subgraph "Negotiation Zone"
NZ["Most traffic lives here.<br/>Every request needs a<br/>balanced compromise."]
end
Z1 --> NZ
Z2 --> NZ
Z3 --> NZ
style Z1 fill:#f9d71c,stroke:#333,color:#000
style Z2 fill:#4ecdc4,stroke:#333,color:#000
style Z3 fill:#ff6b6b,stroke:#333,color:#000
style NZ fill:#dfe6e9,stroke:#333,color:#000
Traffic Distribution Across Zones
pie title "Request Volume by Optimization Zone"
"Cost Zone (templates, cache hits)" : 35
"Performance Zone (fast path, streaming)" : 20
"Inference Zone (Sonnet, full RAG)" : 15
"Negotiation Zone (tiered, balanced)" : 30
The Master Tradeoff Matrix
Every decision in US-02 through US-09 maps to cells in this matrix:
| Decision Area | Cost Impact | Performance Impact | Inference Impact | Primary Tension |
|---|---|---|---|---|
| Model selection (US-02) | Haiku is 15x cheaper | Haiku is 3x faster | Sonnet is measurably better | Cost ↔ Inference |
| Latency budget (US-03) | Tighter budget → cheaper | Directly controls UX | Less time → less quality | Performance ↔ Inference |
| Pre-computation (US-04) | Batch is cheaper per unit | Pre-computed is faster | May be stale | Cost ↔ Performance |
| RAG depth (US-05) | More chunks → more tokens → more cost | More chunks → slower | More chunks → better grounding | All three |
| Caching (US-06) | Cache hits are nearly free | Cache hits are fast | Cached answers may be stale | Cost+Perf ↔ Inference |
| Guardrails (US-07) | More checks → more compute | More checks → more latency | More checks → safer | Performance ↔ Inference |
| Autoscaling (US-08) | Over-provision → waste | Under-provision → latency spikes | Latency spikes → timeouts → failures | Cost ↔ Performance |
| Token budget (US-09) | More tokens → more LLM cost | More tokens → slower prefill | More context → better answers | All three |
Decision Governance Model
sequenceDiagram
participant CostTeam as Cost Team
participant PerfTeam as Performance Team
participant InfTeam as Inference Team
participant Lead as Engineering Lead
participant Dashboard as QACPI Dashboard
CostTeam->>Lead: Propose: Switch FAQ to Haiku (saves $40K/mo)
Lead->>InfTeam: Impact assessment on quality?
InfTeam->>Lead: FAQ quality drops 0.92 → 0.84 for complex FAQs
Lead->>PerfTeam: Impact on latency?
PerfTeam->>Lead: FAQ latency improves 1.2s → 0.4s
Lead->>Dashboard: Simulate QACPI change
Dashboard->>Lead: QACPI improves 15% overall
Lead->>CostTeam: Approved with reversal trigger
Lead->>InfTeam: Set alert: if FAQ satisfaction < 3.5/5, revert
Note over Lead: Decision logged with rationale + reversal trigger
Guardrails for the Decision Process
| Rule | Rationale |
|---|---|
| No silent degradation | A change in one dimension that worsens another by >10% requires sign-off from the affected team |
| Every decision has a reversal trigger | Prevents "set and forget"; forces re-evaluation when assumptions change |
| Experiment before commit | A/B test or shadow mode before production rollout for any decision affecting >20% of traffic |
| Monthly tradeoff review | Review QACPI trends, revisit decisions whose triggers are close to firing |
| Intent-level granularity | Tradeoff decisions can vary by intent — one size does not fit all |
Practical Example: Recommendation Query Lifecycle
A single recommendation query touches every tradeoff dimension:
graph TD
A["User: 'Something like One Piece?'"] --> B["Intent Classifier<br/>(Cost: cheap, Perf: fast, Inf: good enough)"]
B --> C["Cache Check<br/>(Cost: free if hit, Perf: fast, Inf: may be stale)"]
C -->|miss| D["Recommendation Engine<br/>(Cost: moderate, Perf: variable, Inf: personalized)"]
D --> E["RAG Retrieval<br/>(Cost: per-chunk, Perf: adds 100-200ms, Inf: grounds response)"]
E --> F["Model Selection<br/>(Cost: Haiku=$0.001, Perf: Haiku=fast, Inf: Sonnet=better)"]
F --> G["Prompt Assembly<br/>(Cost: every token costs, Perf: more tokens=slower, Inf: more context=better)"]
G --> H["Guardrails<br/>(Cost: validation compute, Perf: adds 50-100ms, Inf: ensures safety)"]
H --> I["Response Delivered"]
style B fill:#f9d71c,stroke:#333,color:#000
style C fill:#4ecdc4,stroke:#333,color:#000
style F fill:#ff6b6b,stroke:#333,color:#000
style G fill:#dfe6e9,stroke:#333,color:#000
At every step, a tradeoff decision has been made. The user stories that follow (US-02 through US-10) deep-dive into each one.
Anti-Patterns to Avoid
| Anti-Pattern | Why It's Dangerous | What to Do Instead |
|---|---|---|
| "Optimize cost, worry about quality later" | Quality debt compounds — users leave before you fix it | Set minimum quality floors before optimizing cost |
| "Use the best model everywhere" | Budget blowout at scale; 70% of queries don't need it | Tier models by query complexity |
| "Cache everything aggressively" | Stale recommendations destroy trust | Cache only data with defined staleness tolerance |
| "Provision for peak at all times" | 70% of capacity sits idle during off-peak | Use autoscaling with predictive ramp-up |
| "Let each team optimize independently" | Local optima create global pessimum | Use QACPI and cross-team governance |
| "One configuration for all intents" | Simple FAQ and complex recommendation have different needs | Per-intent optimization profiles |
2026 Update: Govern the Trilemma as Constrained Optimization
Treat everything above this section as the baseline architecture and original decision model. This section preserves that baseline and explains how the framework evolves for the current target architecture.
The current framework still works, but production teams are increasingly treating this as constrained multi-objective optimization, not just a single scalar-score contest.
- Keep QACPI as the hero metric, but pair it with hard constraints such as zero PII leaks, zero incorrect prices, latency SLOs, and per-intent quality floors. This prevents "good composite score, bad real-world policy outcome" decisions.
- Add a Pareto frontier view to monthly reviews. Leadership should compare only configurations that satisfy hard constraints, then choose among the remaining cost-latency-quality tradeoffs.
- Run shadow traffic and trace replay before live rollout for material changes. A/B tests remain important, but replaying recent production traces is now a standard way to estimate counterfactual cost and latency before exposure.
- Calibrate quality with a hybrid eval stack: task-specific offline evals, human-reviewed golden sets, LLM-as-judge, and production outcomes such as escalations and CSAT.
- Express policies in terms of capability tiers and route classes, not just specific model names. Bedrock model catalogs, inference profiles, and pricing envelopes change often enough that quarterly re-benchmarking should be part of governance.
Recent references: Anthropic success criteria, Anthropic test design, AWS Bedrock model invocation logging, CloudWatch generative AI observability, Cascade routing for LLMs.
Next Steps
This framework applies to every subsequent user story: - US-02: How to decide which LLM model handles which query - US-03: How to split the latency budget across pipeline stages - US-04: What to pre-compute vs compute on-the-fly - US-05: How deep the RAG retrieval should go - US-06: How aggressively to cache - US-07: How strict the guardrails should be - US-08: How much headroom to provision - US-09: How to partition the token budget - US-10: How to track it all in one place