LOCAL PREVIEW View on GitHub

Optimization Tradeoffs User Stories - MangaAssist Chatbot

The Core Problem: You Cannot Optimize Everything Simultaneously

MangaAssist operates under three competing optimization pressures. Improving any two necessarily degrades the third. Every engineering decision in this system is a negotiation between these forces.

graph TD
    subgraph "The Optimization Trilemma"
        COST["💰 Cost Optimization<br/>Reduce spend on compute,<br/>LLM tokens, storage, bandwidth"]
        PERF["⚡ Performance Optimization<br/>Minimize latency, maximize<br/>throughput, improve availability"]
        INF["🧠 Inference Optimization<br/>Maximize response quality,<br/>accuracy, grounding, safety"]
    end

    COST <-->|"Tension"| PERF
    PERF <-->|"Tension"| INF
    INF <-->|"Tension"| COST

    style COST fill:#f9d71c,stroke:#333,color:#000
    style PERF fill:#4ecdc4,stroke:#333,color:#000
    style INF fill:#ff6b6b,stroke:#333,color:#000

Why These Three Cannot Be Optimized Together

If you optimize... ...you sacrifice... Because...
Cost + Performance Inference Quality Cheaper/faster models produce worse answers; less RAG context means more hallucination
Cost + Inference Quality Performance Better models cost more per token but are slower; deeper RAG retrieval adds latency
Performance + Inference Quality Cost Fast, high-quality responses require provisioned throughput, larger models, more compute

The Three Teams (Perspectives)

Each user story is written from the perspective of three competing stakeholders:

Team Role Primary Metric Fear
Cost Team Finance / Platform Engineering Monthly cloud spend ($) Runaway costs that blow the budget
Performance Team SRE / Frontend Engineering p95 latency (ms), throughput (RPS) Users abandoning a slow chatbot
Inference Team ML Engineering / Data Science Response quality score, hallucination rate Users getting wrong answers, trust erosion

User Stories

# User Story Primary Tension Critical Decision
US-01 The Optimization Trilemma — Decision Framework All three How to structure tradeoff decisions systematically
US-02 LLM Model Tiering — Quality vs Cost vs Latency Cost ↔ Inference ↔ Performance When to use Sonnet vs Haiku vs templates
US-03 Latency Budget Allocation Across the Pipeline Performance ↔ Inference How to split 2 seconds across pipeline stages
US-04 Real-Time vs Pre-Computed Inference Cost ↔ Performance What to compute on-the-fly vs pre-compute ahead of time
US-05 RAG Retrieval Depth vs Speed vs Cost All three How many chunks, whether to rerank, retrieval budget
US-06 Cache Aggressiveness — Freshness vs Speed vs Cost Cost ↔ Performance ↔ Inference TTL strategy, what to cache, staleness tolerance
US-07 Guardrail Strictness — Safety vs Latency vs UX Performance ↔ Inference How strict the safety pipeline should be
US-08 Autoscaling Strategy — Cost vs Performance Headroom Cost ↔ Performance Scale-up aggressiveness, provisioned vs on-demand
US-09 Token Budget Allocation — Context Window Partitioning Cost ↔ Inference How to partition tokens between history, RAG, and system prompt
US-10 Unified Optimization Decision Dashboard All three Composite metrics and ongoing negotiation

How to Read These Stories

graph LR
    A["US-01<br/>Framework"] --> B["US-02 to US-09<br/>Dimension-Specific<br/>Tradeoffs"]
    B --> C["US-10<br/>Unified Dashboard"]

    style A fill:#ff9f43,stroke:#333,color:#000
    style B fill:#54a0ff,stroke:#333,color:#000
    style C fill:#5f27cd,stroke:#333,color:#fff
  1. Start with US-01 to understand the decision framework and the trilemma model.
  2. Read US-02 through US-09 for specific tradeoff dimensions — each one is self-contained.
  3. End with US-10 to see how all tradeoffs roll up into a unified monitoring and decision system.

How to Read the Evolution

These documents now preserve both the earlier architecture guidance and the current-state refresh:

  • The original sections in each file describe the baseline or previous architecture and the reasoning that led to those earlier tradeoffs.
  • The 2026 Update section in each file shows the current architecture direction and explains what should change now based on newer platform capabilities and techniques.
  • Read each document top to bottom if you want the transition path from previous architecture to current architecture.
  • Treat the newer sections as an evolution layer, not a replacement of the original material. This keeps the migration path and historical decision context visible.

2026 Refresh Highlights

These stories were refreshed on March 25, 2026 against current AWS Bedrock, Anthropic, vLLM, KServe, and recent research on routing and adaptive RAG. The core tradeoff model still holds, and the original content has been intentionally preserved as the baseline architecture, but the frontier has moved in a few important ways:

  • Routing has matured from static tiering to utility-aware routing and cascades. Teams should benchmark current model classes regularly and route by expected quality lift, latency impact, and uncertainty rather than hard-coding a single "cheap" and "smart" model forever.
  • RAG depth is no longer just "pick a fixed top-K." Hybrid retrieval, contextual retrieval, adaptive retrieval, and corrective retrieval now outperform static chunk-count rules in many production systems.
  • Caching is now multi-layer. The important distinction is not only Redis TTLs, but also prompt caching, prefix/KV caching, deterministic tool-result caches, and tightly scoped semantic caches.
  • Latency work should separate TTFT from the rest of completion time. Managed stacks can use Bedrock prompt caching, latency-optimized inference, and cross-Region inference; self-hosted stacks can tune chunked prefill, disaggregated prefill, and request-queue-aware autoscaling.
  • Guardrails are becoming layered control planes, not just blocklists. Deterministic validation, structured outputs, detect-and-repair flows, and policy validation such as automated reasoning checks are now more practical than "block everything synchronously."
  • Dashboards need trace-level observability. Per-intent cost, TTFT, queue time, prefill/decode split, cache hit rates, intervention types, and experiment annotations matter more than a single blended average.

Representative references: AWS Bedrock prompt caching, AWS Bedrock cross-Region inference, AWS Bedrock latency-optimized inference, Anthropic contextual retrieval, vLLM optimization and tuning, KServe autoscaling with LLM metrics.

Relationship to Other Documents

Document Relationship
Cost-Optimization-User-Stories/ Pure cost optimization — these tradeoff stories explain what you sacrifice for those savings
Performance-Optimization-User-Stories/ Pure performance optimization — these stories explain the cost and quality price
Model-Inference/03-tradeoffs-decisions.md Inference-specific tradeoffs — these stories broaden to system-wide impact
15-tradeoffs-challenges.md High-level overview — these stories go deep with implementation detail
04-architecture-hld.md Architecture context for all components discussed