US-02: LLM Model Tiering — Quality vs Cost vs Latency

User Story

As a platform engineering lead, I want to route queries to different LLM models (or bypass the LLM entirely) based on query complexity, So that simple queries are fast and cheap while complex queries still get high-quality generation.

The Debate

graph TD
    subgraph "Cost Team Position"
        C["Use Haiku for everything.<br/>Sonnet is 15x more expensive.<br/>At 1M messages/day, that's<br/>$250K/month difference."]
    end

    subgraph "Performance Team Position"
        P["Haiku is 3x faster TTFT.<br/>Sonnet prefill kills our<br/>sub-second target.<br/>Template path is instant."]
    end

    subgraph "Inference Team Position"
        I["Sonnet's quality is measurably<br/>superior for recommendations<br/>and multi-turn reasoning.<br/>Haiku hallucinates 3x more<br/>on complex queries."]
    end

    C ---|"Agree on<br/>light queries"| P
    P ---|"Disagree on<br/>complex queries"| I
    I ---|"Disagree on<br/>budget"| C

    style C fill:#f9d71c,stroke:#333,color:#000
    style P fill:#4ecdc4,stroke:#333,color:#000
    style I fill:#ff6b6b,stroke:#333,color:#000

Acceptance Criteria

A complexity classifier routes each query to the appropriate tier with >90% accuracy.
Template tier handles ≥30% of all messages with zero LLM cost.
Haiku tier handles ≥35% of remaining messages at 85% cost reduction vs Sonnet.
Sonnet tier handles only genuinely complex queries (≤35% of remaining).
Quality score does not drop below 0.80 on any intent category.

The Three Tiers

graph TD
    A["User Message"] --> B["Intent Classifier"]
    B --> C{"Complexity<br/>Classifier"}

    C -->|"Tier 1: Trivial<br/>Greeting, thanks, order status,<br/>simple promo lookup"| T1["Template Response<br/>⚡ 10-50ms | 💰 $0 | 🧠 N/A"]

    C -->|"Tier 2: Simple<br/>FAQ, product lookup,<br/>single-turn factual"| T2["Claude 3.5 Haiku<br/>⚡ 200-400ms | 💰 $0.001 | 🧠 0.78"]

    C -->|"Tier 3: Complex<br/>Recommendation, multi-turn,<br/>reasoning, comparison"| T3["Claude 3.5 Sonnet<br/>⚡ 800-1500ms | 💰 $0.015 | 🧠 0.92"]

    T1 --> D["Response"]
    T2 --> D
    T3 --> D

    style T1 fill:#2d8659,stroke:#333,color:#fff
    style T2 fill:#fd9644,stroke:#333,color:#000
    style T3 fill:#eb3b5a,stroke:#333,color:#fff

Cost Comparison at Scale (1M messages/day)

Tier	Traffic Share	Daily Volume	Cost/Request	Daily Cost	Monthly Cost
Template	30%	300,000	$0.000	$0	$0
Haiku	40%	400,000	$0.001	$400	$12,000
Sonnet	30%	300,000	$0.015	$4,500	$135,000
Tiered Total				$4,900	$147,000
All-Sonnet Baseline	100%	1,000,000	$0.015	$15,000	$450,000
Savings					$303,000/mo (67%)

Complexity Classifier Design

The Hard Decision: How to Classify Complexity

This is the highest-stakes routing decision in the system. Misrouting a complex query to Haiku produces a bad answer. Misrouting a simple query to Sonnet wastes money. The classifier itself must be cheap and fast.

graph LR
    A["User Message<br/>+ Intent<br/>+ Entities"] --> B{"Rule-Based<br/>Pre-filter"}
    B -->|"High confidence<br/>trivial"| T1["→ Template Tier"]
    B -->|"Uncertain"| C{"Lightweight<br/>Complexity Model"}
    C -->|"score < 0.4"| T2["→ Haiku Tier"]
    C -->|"0.4 ≤ score < 0.7"| T2B["→ Haiku Tier<br/>(with quality monitoring)"]
    C -->|"score ≥ 0.7"| T3["→ Sonnet Tier"]

    style T1 fill:#2d8659,stroke:#333,color:#fff
    style T2 fill:#fd9644,stroke:#333,color:#000
    style T2B fill:#fd9644,stroke:#333,color:#000
    style T3 fill:#eb3b5a,stroke:#333,color:#fff

Rule-Based Pre-filter (Stage 1)

Signal	Routes To	Confidence
Intent = `chitchat`, `escalation`	Template	0.99
Intent = `order_tracking` + order ID resolved	Template	0.95
Intent = `promotion` + promos found in cache	Template	0.90
Turn count = 1 + intent = `faq` + single entity	Haiku	0.85
Turn count > 3 + intent = `recommendation`	Sonnet	0.90
User references previous assistant message	Sonnet	0.85

Complexity Signals (Stage 2)

Signal	Weight	Why
Turn count in session	0.15	More turns → more context needed → more complex
Number of entities extracted	0.10	Multiple entities suggest comparison or nuance
Intent type	0.25	Recommendation/comparison inherently complex
Query length (token count)	0.10	Longer queries tend to be more nuanced
Presence of comparative language	0.15	"which is better", "compare", "difference"
Unresolved anaphora	0.10	"the second one", "that series" needs context
User is authenticated + has history	0.15	Personalized responses need more reasoning

The Critical Tradeoff: The Gray Zone

graph TD
    subgraph "Clear Template (30%)"
        CT["Greetings, order lookups,<br/>simple promos<br/>✅ No debate"]
    end

    subgraph "Clear Sonnet (20%)"
        CS["Multi-turn recommendations,<br/>complex comparisons,<br/>reasoning chains<br/>✅ No debate"]
    end

    subgraph "Gray Zone (50%)"
        GZ["Single-turn FAQ<br/>Product questions<br/>Simple recommendations<br/>⚠️ THIS IS WHERE THE<br/>TRADEOFF LIVES"]
    end

    GZ --> D{"Decision<br/>Framework"}
    D -->|"Default: Haiku"| H["Save money, accept<br/>slight quality drop"]
    D -->|"Upgrade: Sonnet"| S["Spend more, guarantee<br/>quality"]

    style CT fill:#2d8659,stroke:#333,color:#fff
    style CS fill:#eb3b5a,stroke:#333,color:#fff
    style GZ fill:#f9d71c,stroke:#333,color:#000

Gray Zone Decision Rules

Scenario	Decision	Rationale
FAQ with a single, clear entity	Haiku	Factual retrieval doesn't need deep reasoning
Product question about format/availability	Haiku	Structured data lookup, not creative
"Recommend something like X" (single reference)	Haiku + monitoring	Simple similarity; escalate to Sonnet if quality score < 0.75
"Compare X and Y"	Sonnet	Comparison requires weighing multiple attributes
Any query from a user who previously escalated	Sonnet	High-risk user; don't degrade their experience
Any query touching pricing or availability	Haiku	Factual, but validate against catalog (guardrail handles accuracy)

Quality Monitoring and Automatic Escalation

sequenceDiagram
    participant User
    participant Router as Complexity Router
    participant Haiku as Claude Haiku
    participant QM as Quality Monitor
    participant Sonnet as Claude Sonnet

    User->>Router: "What's a good manga for beginners?"
    Router->>Haiku: Tier 2 (simple recommendation)
    Haiku-->>Router: Response (quality_estimate=0.72)
    Router->>QM: Log response + quality estimate

    Note over QM: quality_estimate < 0.75 threshold

    QM->>Router: Trigger re-generation with Sonnet
    Router->>Sonnet: Same query + context
    Sonnet-->>Router: Higher quality response
    Router-->>User: Sonnet response delivered

    Note over QM: Latency penalty: +800ms<br/>Cost penalty: +$0.014<br/>Quality gain: +0.18

Quality Estimation Without Ground Truth

The system estimates response quality in real-time without human labels:

Signal	What It Measures	How
Self-consistency	Does the response contradict itself?	Check if key claims are consistent
Groundedness	Are claims supported by retrieved data?	NLI check against RAG chunks
Completeness	Does it answer the question?	Check if intent entities are addressed
Length appropriateness	Is the length right for the query type?	Compare to expected range per intent
User follow-up	Did the user ask about the same topic again?	Indicates the first answer was insufficient

2026 Update: Move from Static Tiering to Utility-Aware Routing

Treat everything above this section as the baseline routing architecture. This update keeps that earlier design in place for context and shows how to evolve it into the current routing architecture.

The latest production pattern is to treat routing as an online decision problem rather than a fixed lookup table.

Replace hard-coded "Haiku vs Sonnet" assumptions with benchmarked route classes such as template, fast_model, reasoning_model, and safe_fallback. Recompute the mapping quarterly against the current Bedrock catalog.
Route on expected utility, not just estimated complexity. A useful router predicts whether the higher-tier model is likely to add enough quality to justify its added cost and latency.
Introduce uncertainty-aware escalation. Low-confidence cases should either ask a clarifying question, trigger a second-pass verification flow, or escalate directly instead of forcing a brittle cheap-model answer.
Combine routing and cascading where it pays off: let a cheaper model answer first on selected intents, then escalate only when a grader, validator, or uncertainty signal says the answer is weak.
Log the full route decision for every request: selected tier, uncertainty, latency, spend, downstream feedback, and any manual override. Without this telemetry, the router cannot improve.

Recent references: A Unified Approach to Routing and Cascading for LLMs, SATER: token-efficient routing and cascading, AWS Bedrock latency-optimized inference, Anthropic empirical evaluations.

Reversal Triggers

Trigger Condition	Action
Haiku quality score drops below 0.72 on any intent category for 7 consecutive days	Move that intent to Sonnet tier
A new model offers Sonnet quality at Haiku price	Re-evaluate all tiers
Template responses generate >5% "not helpful" feedback	Reduce template coverage, move to Haiku
Monthly LLM spend exceeds budget by >15%	Increase Haiku share, tighten Sonnet routing
User escalation rate increases >20% after a tiering change	Revert the change

Implementation Sequence

gantt
    title Model Tiering Rollout
    dateFormat  YYYY-MM-DD
    section Phase 1
    Implement template tier for chitchat + order     :a1, 2026-01-01, 14d
    Measure baseline quality per intent              :a2, 2026-01-01, 14d
    section Phase 2
    Deploy complexity classifier                     :b1, after a1, 14d
    Route clear-simple queries to Haiku              :b2, after b1, 7d
    section Phase 3
    Expand Haiku to gray zone (with monitoring)      :c1, after b2, 14d
    Implement auto-escalation pipeline               :c2, after b2, 14d
    section Phase 4
    Tune thresholds based on QACPI                   :d1, after c1, 21d
    Ongoing monthly review                           :d2, after d1, 30d

Impact on Trilemma

Dimension	Before Tiering	After Tiering	Change
Cost (monthly LLM)	$450,000	$147,000	-67%
p95 Latency (avg)	1.8s	0.8s	-56%
Quality Score (avg)	0.92	0.87	-5.4%
QACPI	12,169	188,889	+1,452%

The 5.4% quality drop is concentrated in simple queries where users don't notice the difference. Complex queries (recommendations, comparisons) still get Sonnet and maintain 0.92 quality.