Optimization Tradeoffs User Stories - MangaAssist Chatbot

The Core Problem: You Cannot Optimize Everything Simultaneously

MangaAssist operates under three competing optimization pressures. Improving any two necessarily degrades the third. Every engineering decision in this system is a negotiation between these forces.

graph TD
    subgraph "The Optimization Trilemma"
        COST["💰 Cost Optimization<br/>Reduce spend on compute,<br/>LLM tokens, storage, bandwidth"]
        PERF["⚡ Performance Optimization<br/>Minimize latency, maximize<br/>throughput, improve availability"]
        INF["🧠 Inference Optimization<br/>Maximize response quality,<br/>accuracy, grounding, safety"]
    end

    COST <-->|"Tension"| PERF
    PERF <-->|"Tension"| INF
    INF <-->|"Tension"| COST

    style COST fill:#f9d71c,stroke:#333,color:#000
    style PERF fill:#4ecdc4,stroke:#333,color:#000
    style INF fill:#ff6b6b,stroke:#333,color:#000

Why These Three Cannot Be Optimized Together

If you optimize...	...you sacrifice...	Because...
Cost + Performance	Inference Quality	Cheaper/faster models produce worse answers; less RAG context means more hallucination
Cost + Inference Quality	Performance	Better models cost more per token but are slower; deeper RAG retrieval adds latency
Performance + Inference Quality	Cost	Fast, high-quality responses require provisioned throughput, larger models, more compute

The Three Teams (Perspectives)

Each user story is written from the perspective of three competing stakeholders:

Team	Role	Primary Metric	Fear
Cost Team	Finance / Platform Engineering	Monthly cloud spend ($)	Runaway costs that blow the budget
Performance Team	SRE / Frontend Engineering	p95 latency (ms), throughput (RPS)	Users abandoning a slow chatbot
Inference Team	ML Engineering / Data Science	Response quality score, hallucination rate	Users getting wrong answers, trust erosion

User Stories

#	User Story	Primary Tension	Critical Decision
US-01	The Optimization Trilemma — Decision Framework	All three	How to structure tradeoff decisions systematically
US-02	LLM Model Tiering — Quality vs Cost vs Latency	Cost ↔ Inference ↔ Performance	When to use Sonnet vs Haiku vs templates
US-03	Latency Budget Allocation Across the Pipeline	Performance ↔ Inference	How to split 2 seconds across pipeline stages
US-04	Real-Time vs Pre-Computed Inference	Cost ↔ Performance	What to compute on-the-fly vs pre-compute ahead of time
US-05	RAG Retrieval Depth vs Speed vs Cost	All three	How many chunks, whether to rerank, retrieval budget
US-06	Cache Aggressiveness — Freshness vs Speed vs Cost	Cost ↔ Performance ↔ Inference	TTL strategy, what to cache, staleness tolerance
US-07	Guardrail Strictness — Safety vs Latency vs UX	Performance ↔ Inference	How strict the safety pipeline should be
US-08	Autoscaling Strategy — Cost vs Performance Headroom	Cost ↔ Performance	Scale-up aggressiveness, provisioned vs on-demand
US-09	Token Budget Allocation — Context Window Partitioning	Cost ↔ Inference	How to partition tokens between history, RAG, and system prompt
US-10	Unified Optimization Decision Dashboard	All three	Composite metrics and ongoing negotiation

How to Read These Stories

graph LR
    A["US-01<br/>Framework"] --> B["US-02 to US-09<br/>Dimension-Specific<br/>Tradeoffs"]
    B --> C["US-10<br/>Unified Dashboard"]

    style A fill:#ff9f43,stroke:#333,color:#000
    style B fill:#54a0ff,stroke:#333,color:#000
    style C fill:#5f27cd,stroke:#333,color:#fff

Start with US-01 to understand the decision framework and the trilemma model.
Read US-02 through US-09 for specific tradeoff dimensions — each one is self-contained.
End with US-10 to see how all tradeoffs roll up into a unified monitoring and decision system.

How to Read the Evolution

These documents now preserve both the earlier architecture guidance and the current-state refresh:

The original sections in each file describe the baseline or previous architecture and the reasoning that led to those earlier tradeoffs.
The 2026 Update section in each file shows the current architecture direction and explains what should change now based on newer platform capabilities and techniques.
Read each document top to bottom if you want the transition path from previous architecture to current architecture.
Treat the newer sections as an evolution layer, not a replacement of the original material. This keeps the migration path and historical decision context visible.

2026 Refresh Highlights

These stories were refreshed on March 25, 2026 against current AWS Bedrock, Anthropic, vLLM, KServe, and recent research on routing and adaptive RAG. The core tradeoff model still holds, and the original content has been intentionally preserved as the baseline architecture, but the frontier has moved in a few important ways:

Routing has matured from static tiering to utility-aware routing and cascades. Teams should benchmark current model classes regularly and route by expected quality lift, latency impact, and uncertainty rather than hard-coding a single "cheap" and "smart" model forever.
RAG depth is no longer just "pick a fixed top-K." Hybrid retrieval, contextual retrieval, adaptive retrieval, and corrective retrieval now outperform static chunk-count rules in many production systems.
Caching is now multi-layer. The important distinction is not only Redis TTLs, but also prompt caching, prefix/KV caching, deterministic tool-result caches, and tightly scoped semantic caches.
Latency work should separate TTFT from the rest of completion time. Managed stacks can use Bedrock prompt caching, latency-optimized inference, and cross-Region inference; self-hosted stacks can tune chunked prefill, disaggregated prefill, and request-queue-aware autoscaling.
Guardrails are becoming layered control planes, not just blocklists. Deterministic validation, structured outputs, detect-and-repair flows, and policy validation such as automated reasoning checks are now more practical than "block everything synchronously."
Dashboards need trace-level observability. Per-intent cost, TTFT, queue time, prefill/decode split, cache hit rates, intervention types, and experiment annotations matter more than a single blended average.

Representative references: AWS Bedrock prompt caching, AWS Bedrock cross-Region inference, AWS Bedrock latency-optimized inference, Anthropic contextual retrieval, vLLM optimization and tuning, KServe autoscaling with LLM metrics.

Relationship to Other Documents

Document	Relationship
Cost-Optimization-User-Stories/	Pure cost optimization — these tradeoff stories explain what you sacrifice for those savings
Performance-Optimization-User-Stories/	Pure performance optimization — these stories explain the cost and quality price
Model-Inference/03-tradeoffs-decisions.md	Inference-specific tradeoffs — these stories broaden to system-wide impact
15-tradeoffs-challenges.md	High-level overview — these stories go deep with implementation detail
04-architecture-hld.md	Architecture context for all components discussed