US-02: LLM Model Tiering — Quality vs Cost vs Latency
User Story
As a platform engineering lead, I want to route queries to different LLM models (or bypass the LLM entirely) based on query complexity, So that simple queries are fast and cheap while complex queries still get high-quality generation.
The Debate
graph TD
subgraph "Cost Team Position"
C["Use Haiku for everything.<br/>Sonnet is 15x more expensive.<br/>At 1M messages/day, that's<br/>$250K/month difference."]
end
subgraph "Performance Team Position"
P["Haiku is 3x faster TTFT.<br/>Sonnet prefill kills our<br/>sub-second target.<br/>Template path is instant."]
end
subgraph "Inference Team Position"
I["Sonnet's quality is measurably<br/>superior for recommendations<br/>and multi-turn reasoning.<br/>Haiku hallucinates 3x more<br/>on complex queries."]
end
C ---|"Agree on<br/>light queries"| P
P ---|"Disagree on<br/>complex queries"| I
I ---|"Disagree on<br/>budget"| C
style C fill:#f9d71c,stroke:#333,color:#000
style P fill:#4ecdc4,stroke:#333,color:#000
style I fill:#ff6b6b,stroke:#333,color:#000
Acceptance Criteria
- A complexity classifier routes each query to the appropriate tier with >90% accuracy.
- Template tier handles ≥30% of all messages with zero LLM cost.
- Haiku tier handles ≥35% of remaining messages at 85% cost reduction vs Sonnet.
- Sonnet tier handles only genuinely complex queries (≤35% of remaining).
- Quality score does not drop below 0.80 on any intent category.
The Three Tiers
graph TD
A["User Message"] --> B["Intent Classifier"]
B --> C{"Complexity<br/>Classifier"}
C -->|"Tier 1: Trivial<br/>Greeting, thanks, order status,<br/>simple promo lookup"| T1["Template Response<br/>⚡ 10-50ms | 💰 $0 | 🧠 N/A"]
C -->|"Tier 2: Simple<br/>FAQ, product lookup,<br/>single-turn factual"| T2["Claude 3.5 Haiku<br/>⚡ 200-400ms | 💰 $0.001 | 🧠 0.78"]
C -->|"Tier 3: Complex<br/>Recommendation, multi-turn,<br/>reasoning, comparison"| T3["Claude 3.5 Sonnet<br/>⚡ 800-1500ms | 💰 $0.015 | 🧠 0.92"]
T1 --> D["Response"]
T2 --> D
T3 --> D
style T1 fill:#2d8659,stroke:#333,color:#fff
style T2 fill:#fd9644,stroke:#333,color:#000
style T3 fill:#eb3b5a,stroke:#333,color:#fff
Cost Comparison at Scale (1M messages/day)
| Tier | Traffic Share | Daily Volume | Cost/Request | Daily Cost | Monthly Cost |
|---|---|---|---|---|---|
| Template | 30% | 300,000 | $0.000 | $0 | $0 |
| Haiku | 40% | 400,000 | $0.001 | $400 | $12,000 |
| Sonnet | 30% | 300,000 | $0.015 | $4,500 | $135,000 |
| Tiered Total | $4,900 | $147,000 | |||
| All-Sonnet Baseline | 100% | 1,000,000 | $0.015 | $15,000 | $450,000 |
| Savings | $303,000/mo (67%) |
Complexity Classifier Design
The Hard Decision: How to Classify Complexity
This is the highest-stakes routing decision in the system. Misrouting a complex query to Haiku produces a bad answer. Misrouting a simple query to Sonnet wastes money. The classifier itself must be cheap and fast.
graph LR
A["User Message<br/>+ Intent<br/>+ Entities"] --> B{"Rule-Based<br/>Pre-filter"}
B -->|"High confidence<br/>trivial"| T1["→ Template Tier"]
B -->|"Uncertain"| C{"Lightweight<br/>Complexity Model"}
C -->|"score < 0.4"| T2["→ Haiku Tier"]
C -->|"0.4 ≤ score < 0.7"| T2B["→ Haiku Tier<br/>(with quality monitoring)"]
C -->|"score ≥ 0.7"| T3["→ Sonnet Tier"]
style T1 fill:#2d8659,stroke:#333,color:#fff
style T2 fill:#fd9644,stroke:#333,color:#000
style T2B fill:#fd9644,stroke:#333,color:#000
style T3 fill:#eb3b5a,stroke:#333,color:#fff
Rule-Based Pre-filter (Stage 1)
| Signal | Routes To | Confidence |
|---|---|---|
Intent = chitchat, escalation |
Template | 0.99 |
Intent = order_tracking + order ID resolved |
Template | 0.95 |
Intent = promotion + promos found in cache |
Template | 0.90 |
Turn count = 1 + intent = faq + single entity |
Haiku | 0.85 |
Turn count > 3 + intent = recommendation |
Sonnet | 0.90 |
| User references previous assistant message | Sonnet | 0.85 |
Complexity Signals (Stage 2)
| Signal | Weight | Why |
|---|---|---|
| Turn count in session | 0.15 | More turns → more context needed → more complex |
| Number of entities extracted | 0.10 | Multiple entities suggest comparison or nuance |
| Intent type | 0.25 | Recommendation/comparison inherently complex |
| Query length (token count) | 0.10 | Longer queries tend to be more nuanced |
| Presence of comparative language | 0.15 | "which is better", "compare", "difference" |
| Unresolved anaphora | 0.10 | "the second one", "that series" needs context |
| User is authenticated + has history | 0.15 | Personalized responses need more reasoning |
The Critical Tradeoff: The Gray Zone
graph TD
subgraph "Clear Template (30%)"
CT["Greetings, order lookups,<br/>simple promos<br/>✅ No debate"]
end
subgraph "Clear Sonnet (20%)"
CS["Multi-turn recommendations,<br/>complex comparisons,<br/>reasoning chains<br/>✅ No debate"]
end
subgraph "Gray Zone (50%)"
GZ["Single-turn FAQ<br/>Product questions<br/>Simple recommendations<br/>⚠️ THIS IS WHERE THE<br/>TRADEOFF LIVES"]
end
GZ --> D{"Decision<br/>Framework"}
D -->|"Default: Haiku"| H["Save money, accept<br/>slight quality drop"]
D -->|"Upgrade: Sonnet"| S["Spend more, guarantee<br/>quality"]
style CT fill:#2d8659,stroke:#333,color:#fff
style CS fill:#eb3b5a,stroke:#333,color:#fff
style GZ fill:#f9d71c,stroke:#333,color:#000
Gray Zone Decision Rules
| Scenario | Decision | Rationale |
|---|---|---|
| FAQ with a single, clear entity | Haiku | Factual retrieval doesn't need deep reasoning |
| Product question about format/availability | Haiku | Structured data lookup, not creative |
| "Recommend something like X" (single reference) | Haiku + monitoring | Simple similarity; escalate to Sonnet if quality score < 0.75 |
| "Compare X and Y" | Sonnet | Comparison requires weighing multiple attributes |
| Any query from a user who previously escalated | Sonnet | High-risk user; don't degrade their experience |
| Any query touching pricing or availability | Haiku | Factual, but validate against catalog (guardrail handles accuracy) |
Quality Monitoring and Automatic Escalation
sequenceDiagram
participant User
participant Router as Complexity Router
participant Haiku as Claude Haiku
participant QM as Quality Monitor
participant Sonnet as Claude Sonnet
User->>Router: "What's a good manga for beginners?"
Router->>Haiku: Tier 2 (simple recommendation)
Haiku-->>Router: Response (quality_estimate=0.72)
Router->>QM: Log response + quality estimate
Note over QM: quality_estimate < 0.75 threshold
QM->>Router: Trigger re-generation with Sonnet
Router->>Sonnet: Same query + context
Sonnet-->>Router: Higher quality response
Router-->>User: Sonnet response delivered
Note over QM: Latency penalty: +800ms<br/>Cost penalty: +$0.014<br/>Quality gain: +0.18
Quality Estimation Without Ground Truth
The system estimates response quality in real-time without human labels:
| Signal | What It Measures | How |
|---|---|---|
| Self-consistency | Does the response contradict itself? | Check if key claims are consistent |
| Groundedness | Are claims supported by retrieved data? | NLI check against RAG chunks |
| Completeness | Does it answer the question? | Check if intent entities are addressed |
| Length appropriateness | Is the length right for the query type? | Compare to expected range per intent |
| User follow-up | Did the user ask about the same topic again? | Indicates the first answer was insufficient |
2026 Update: Move from Static Tiering to Utility-Aware Routing
Treat everything above this section as the baseline routing architecture. This update keeps that earlier design in place for context and shows how to evolve it into the current routing architecture.
The latest production pattern is to treat routing as an online decision problem rather than a fixed lookup table.
- Replace hard-coded "Haiku vs Sonnet" assumptions with benchmarked route classes such as
template,fast_model,reasoning_model, andsafe_fallback. Recompute the mapping quarterly against the current Bedrock catalog. - Route on expected utility, not just estimated complexity. A useful router predicts whether the higher-tier model is likely to add enough quality to justify its added cost and latency.
- Introduce uncertainty-aware escalation. Low-confidence cases should either ask a clarifying question, trigger a second-pass verification flow, or escalate directly instead of forcing a brittle cheap-model answer.
- Combine routing and cascading where it pays off: let a cheaper model answer first on selected intents, then escalate only when a grader, validator, or uncertainty signal says the answer is weak.
- Log the full route decision for every request: selected tier, uncertainty, latency, spend, downstream feedback, and any manual override. Without this telemetry, the router cannot improve.
Recent references: A Unified Approach to Routing and Cascading for LLMs, SATER: token-efficient routing and cascading, AWS Bedrock latency-optimized inference, Anthropic empirical evaluations.
Reversal Triggers
| Trigger Condition | Action |
|---|---|
| Haiku quality score drops below 0.72 on any intent category for 7 consecutive days | Move that intent to Sonnet tier |
| A new model offers Sonnet quality at Haiku price | Re-evaluate all tiers |
| Template responses generate >5% "not helpful" feedback | Reduce template coverage, move to Haiku |
| Monthly LLM spend exceeds budget by >15% | Increase Haiku share, tighten Sonnet routing |
| User escalation rate increases >20% after a tiering change | Revert the change |
Implementation Sequence
gantt
title Model Tiering Rollout
dateFormat YYYY-MM-DD
section Phase 1
Implement template tier for chitchat + order :a1, 2026-01-01, 14d
Measure baseline quality per intent :a2, 2026-01-01, 14d
section Phase 2
Deploy complexity classifier :b1, after a1, 14d
Route clear-simple queries to Haiku :b2, after b1, 7d
section Phase 3
Expand Haiku to gray zone (with monitoring) :c1, after b2, 14d
Implement auto-escalation pipeline :c2, after b2, 14d
section Phase 4
Tune thresholds based on QACPI :d1, after c1, 21d
Ongoing monthly review :d2, after d1, 30d
Impact on Trilemma
| Dimension | Before Tiering | After Tiering | Change |
|---|---|---|---|
| Cost (monthly LLM) | $450,000 | $147,000 | -67% |
| p95 Latency (avg) | 1.8s | 0.8s | -56% |
| Quality Score (avg) | 0.92 | 0.87 | -5.4% |
| QACPI | 12,169 | 188,889 | +1,452% |
The 5.4% quality drop is concentrated in simple queries where users don't notice the difference. Complex queries (recommendations, comparisons) still get Sonnet and maintain 0.92 quality.