Optimization Tradeoffs User Stories - MangaAssist Chatbot
The Core Problem: You Cannot Optimize Everything Simultaneously
MangaAssist operates under three competing optimization pressures. Improving any two necessarily degrades the third. Every engineering decision in this system is a negotiation between these forces.
graph TD
subgraph "The Optimization Trilemma"
COST["💰 Cost Optimization<br/>Reduce spend on compute,<br/>LLM tokens, storage, bandwidth"]
PERF["⚡ Performance Optimization<br/>Minimize latency, maximize<br/>throughput, improve availability"]
INF["🧠 Inference Optimization<br/>Maximize response quality,<br/>accuracy, grounding, safety"]
end
COST <-->|"Tension"| PERF
PERF <-->|"Tension"| INF
INF <-->|"Tension"| COST
style COST fill:#f9d71c,stroke:#333,color:#000
style PERF fill:#4ecdc4,stroke:#333,color:#000
style INF fill:#ff6b6b,stroke:#333,color:#000
Why These Three Cannot Be Optimized Together
| If you optimize... | ...you sacrifice... | Because... |
|---|---|---|
| Cost + Performance | Inference Quality | Cheaper/faster models produce worse answers; less RAG context means more hallucination |
| Cost + Inference Quality | Performance | Better models cost more per token but are slower; deeper RAG retrieval adds latency |
| Performance + Inference Quality | Cost | Fast, high-quality responses require provisioned throughput, larger models, more compute |
The Three Teams (Perspectives)
Each user story is written from the perspective of three competing stakeholders:
| Team | Role | Primary Metric | Fear |
|---|---|---|---|
| Cost Team | Finance / Platform Engineering | Monthly cloud spend ($) | Runaway costs that blow the budget |
| Performance Team | SRE / Frontend Engineering | p95 latency (ms), throughput (RPS) | Users abandoning a slow chatbot |
| Inference Team | ML Engineering / Data Science | Response quality score, hallucination rate | Users getting wrong answers, trust erosion |
User Stories
| # | User Story | Primary Tension | Critical Decision |
|---|---|---|---|
| US-01 | The Optimization Trilemma — Decision Framework | All three | How to structure tradeoff decisions systematically |
| US-02 | LLM Model Tiering — Quality vs Cost vs Latency | Cost ↔ Inference ↔ Performance | When to use Sonnet vs Haiku vs templates |
| US-03 | Latency Budget Allocation Across the Pipeline | Performance ↔ Inference | How to split 2 seconds across pipeline stages |
| US-04 | Real-Time vs Pre-Computed Inference | Cost ↔ Performance | What to compute on-the-fly vs pre-compute ahead of time |
| US-05 | RAG Retrieval Depth vs Speed vs Cost | All three | How many chunks, whether to rerank, retrieval budget |
| US-06 | Cache Aggressiveness — Freshness vs Speed vs Cost | Cost ↔ Performance ↔ Inference | TTL strategy, what to cache, staleness tolerance |
| US-07 | Guardrail Strictness — Safety vs Latency vs UX | Performance ↔ Inference | How strict the safety pipeline should be |
| US-08 | Autoscaling Strategy — Cost vs Performance Headroom | Cost ↔ Performance | Scale-up aggressiveness, provisioned vs on-demand |
| US-09 | Token Budget Allocation — Context Window Partitioning | Cost ↔ Inference | How to partition tokens between history, RAG, and system prompt |
| US-10 | Unified Optimization Decision Dashboard | All three | Composite metrics and ongoing negotiation |
How to Read These Stories
graph LR
A["US-01<br/>Framework"] --> B["US-02 to US-09<br/>Dimension-Specific<br/>Tradeoffs"]
B --> C["US-10<br/>Unified Dashboard"]
style A fill:#ff9f43,stroke:#333,color:#000
style B fill:#54a0ff,stroke:#333,color:#000
style C fill:#5f27cd,stroke:#333,color:#fff
- Start with US-01 to understand the decision framework and the trilemma model.
- Read US-02 through US-09 for specific tradeoff dimensions — each one is self-contained.
- End with US-10 to see how all tradeoffs roll up into a unified monitoring and decision system.
How to Read the Evolution
These documents now preserve both the earlier architecture guidance and the current-state refresh:
- The original sections in each file describe the baseline or previous architecture and the reasoning that led to those earlier tradeoffs.
- The
2026 Updatesection in each file shows the current architecture direction and explains what should change now based on newer platform capabilities and techniques. - Read each document top to bottom if you want the transition path from previous architecture to current architecture.
- Treat the newer sections as an evolution layer, not a replacement of the original material. This keeps the migration path and historical decision context visible.
2026 Refresh Highlights
These stories were refreshed on March 25, 2026 against current AWS Bedrock, Anthropic, vLLM, KServe, and recent research on routing and adaptive RAG. The core tradeoff model still holds, and the original content has been intentionally preserved as the baseline architecture, but the frontier has moved in a few important ways:
- Routing has matured from static tiering to utility-aware routing and cascades. Teams should benchmark current model classes regularly and route by expected quality lift, latency impact, and uncertainty rather than hard-coding a single "cheap" and "smart" model forever.
- RAG depth is no longer just "pick a fixed top-K." Hybrid retrieval, contextual retrieval, adaptive retrieval, and corrective retrieval now outperform static chunk-count rules in many production systems.
- Caching is now multi-layer. The important distinction is not only Redis TTLs, but also prompt caching, prefix/KV caching, deterministic tool-result caches, and tightly scoped semantic caches.
- Latency work should separate TTFT from the rest of completion time. Managed stacks can use Bedrock prompt caching, latency-optimized inference, and cross-Region inference; self-hosted stacks can tune chunked prefill, disaggregated prefill, and request-queue-aware autoscaling.
- Guardrails are becoming layered control planes, not just blocklists. Deterministic validation, structured outputs, detect-and-repair flows, and policy validation such as automated reasoning checks are now more practical than "block everything synchronously."
- Dashboards need trace-level observability. Per-intent cost, TTFT, queue time, prefill/decode split, cache hit rates, intervention types, and experiment annotations matter more than a single blended average.
Representative references: AWS Bedrock prompt caching, AWS Bedrock cross-Region inference, AWS Bedrock latency-optimized inference, Anthropic contextual retrieval, vLLM optimization and tuning, KServe autoscaling with LLM metrics.
Relationship to Other Documents
| Document | Relationship |
|---|---|
| Cost-Optimization-User-Stories/ | Pure cost optimization — these tradeoff stories explain what you sacrifice for those savings |
| Performance-Optimization-User-Stories/ | Pure performance optimization — these stories explain the cost and quality price |
| Model-Inference/03-tradeoffs-decisions.md | Inference-specific tradeoffs — these stories broaden to system-wide impact |
| 15-tradeoffs-challenges.md | High-level overview — these stories go deep with implementation detail |
| 04-architecture-hld.md | Architecture context for all components discussed |