Constraint Scenario 03 — Cost Budget Halved
Trigger: Finance review cuts MangaAssist's monthly compute + LLM budget from $14M/day to $7M/day. Same traffic. Pillars stressed: All three. This is the hardest scenario in the folder.
TL;DR
50% cost cut at constant traffic is not a knob you turn — it's a multi-pronged campaign. The naive answer ("use Haiku for everything") fails the eval gate. The right answer composes ~12 levers, each cutting 4-15%, prioritized by quality_loss / cost_saved ratio. The eval framework picks the order; the harness implements; the LLD makes each switch a config flip.
The change
| Metric | Before | Target |
|---|---|---|
| Daily compute + LLM budget | $14M/day | $7M/day |
| Cost/turn (avg) | $0.0182 | $0.0091 |
| Acceptable eval drop | — | <3% aggregate, <5% per cell |
| Acceptable latency drop | — | none |
| Time to comply | 60 days |
You can lose ~3% quality. You cannot lose latency. You have 60 days. Go.
The lever menu — ranked by ROI
The eval framework has measured the marginal cost-vs-quality of each lever over the past quarter. The migration playbook is to apply them in ascending order of quality_loss_per_dollar_saved.
| # | Lever | $ saved | Quality cost | Risk | Effort |
|---|---|---|---|---|---|
| 1 | Inference cache eligibility expanded | -8% | ~0% | Low | 1 wk |
| 2 | Sub-agent input distillation tightened | -6% | -0.5% | Low | 2 wk |
| 3 | Speculative cancellation on intent-classified turns | -5% | -0.3% | Low | 1 wk |
| 4 | Race → delegate for low-latency-sensitive surfaces | -4% | -0.5% | Low | 0.5 wk |
| 5 | Tool fan-out cap reduced (5→3 max) | -3% | -1.0% | Med | 1 wk |
| 6 | Per-tier model routing (free tier → Haiku planner) | -7% | -2.0% on free | Med | 2 wk |
| 7 | Eval sample 1% → 0.5% | -3% | (eval coverage) | Med | 0.5 wk |
| 8 | Smaller embedding model for retrieval | -2% | -0.5% | Med | 2 wk |
| 9 | Async batch eval at off-peak hours only | -2% | (eval timeliness) | Low | 1 wk |
| 10 | Compress system prompts (token count -15%) | -3% | -0.7% | High | 3 wk |
| 11 | Selective Haiku planner for low-stakes turns | -4% | -1.5% | Med | 2 wk |
| 12 | DDB row TTL 7d → 3d | -1% | (resume rare-cases) | Low | 0.5 wk |
Cumulative: ~48% saved, ~6.5% quality drop in raw aggregate. The rubric-weighted drop is ~3% because the per-cell impact is bounded.
The architecture lens — where the savings live
flowchart TB
subgraph Levers[12 cost levers]
L1[Cache expansion]
L2[Distillation]
L3[Speculation cancel]
L4[Race to delegate]
L5[Fanout cap]
L6[Tier routing]
L7[Eval sampling]
L8[Embedding swap]
L9[Eval scheduling]
L10[Prompt compression]
L11[Haiku planner]
L12[TTL]
end
subgraph Pillars[Pillar each lever touches]
P1A[P1 Workflow]
P2A[P2 Harness]
P3A[P3 LLD]
end
L1 --> P2A
L2 --> P1A
L3 --> P1A
L4 --> P1A
L5 --> P1A
L6 --> P3A
L7 --> P2A
L8 --> P3A
L9 --> P2A
L10 --> P1A
L11 --> P1A
L11 --> P3A
L12 --> P2A
Note: cost optimization touches all three pillars. A team that built only one pillar deeply has fewer levers and ends up over-pulling on the ones they have, which exceeds the quality budget.
The two highest-leverage levers, in detail
Lever 1 — Inference cache eligibility expanded
Today: cache eligibility = (eval_score >= 0.85) AND (deterministic_intent == true). About 38% hit rate on planner.
Change: add a second tier — (eval_score >= 0.80) AND (intent in [recommend, summarize, faq]) with shorter TTL (1h instead of 4h) and per-locale partitioning. Pushes hit rate to 51%.
Why it works: - The added tier covers the most common intents. - Shorter TTL bounds blast radius on quality drift. - Per-locale partitioning prevents en-US dominating cache; ensures other locales benefit.
Quality measurement: A/B 5% with new cache rule for 1 week; eval delta -0.3% on aggregate, -0.6% on long-tail (acceptable).
Lever 6 — Per-tier model routing (free tier → Haiku planner)
Today: all users see Sonnet planner.
Change: free-tier users get Haiku planner; Prime users keep Sonnet. Free tier is ~28% of traffic.
Why it works: - Free tier has different distribution of queries (less complex on average). - Haiku is calibrated; eval shows -2% on free-tier rubric (not the global rubric — free tier has its own rubric calibrated to its expected complexity). - Strong differentiator narrative — Prime users get the better experience.
Quality measurement: the per-tier eval rubric is set explicitly during this lever's rollout. We do NOT pretend free-tier and Prime-tier have the same quality bar. Both have eval; both have drift detection; both have user satisfaction tracking.
Risk: if a free-tier user perceives a quality gap, that's a marketing problem. We measure perception via session-end surveys; the gap so far is small enough that user complaints are within noise.
The phased rollout
gantt
title 60-day cost-cut rollout
dateFormat YYYY-MM-DD
section Week 1-2
L1 Cache expansion :a1, 2026-04-29, 7d
L4 Race to delegate :a2, 2026-04-29, 4d
L7 Eval sampling 1pct to 0.5pct :a3, 2026-04-29, 4d
L12 TTL :a4, 2026-04-29, 4d
section Week 3-4
L2 Distillation tightening :b1, 2026-05-13, 14d
L3 Speculation cancellation :b2, 2026-05-13, 7d
L9 Eval scheduling :b3, 2026-05-13, 7d
section Week 5-6
L5 Fanout cap :c1, 2026-05-27, 7d
L8 Embedding swap :c2, 2026-05-27, 14d
L11 Selective Haiku planner :c3, 2026-05-27, 14d
section Week 7-8
L6 Per-tier routing :d1, 2026-06-10, 14d
L10 Prompt compression :d2, 2026-06-10, 14d
section Buffer
Stabilize and finalize :e1, 2026-06-24, 7d
Order is deliberate: lowest-risk, fastest first. Each lever is canary'd; eval drift watched. If three levers in a row show worse-than-modeled regression, the rollout pauses for re-planning.
What the harness contributes
| Harness piece | Role in cost cut |
|---|---|
| Cost dashboard partitioning (story 08) | Measures impact of each lever, weekly |
| Capability flags (story 02) | Per-tier, per-cohort, per-feature switches |
| Eval framework (story 04) | Quality regression gate per lever |
| Inference cache (story 06) | Lever 1 lives here |
| Skill registry (story 02) | Lever 5 (fanout cap) is a registry policy |
| Quota manager (story 08) | Lever 6 (per-tier routing) routes through quota classes |
| Replay harness (story 04) | Pre-deploy quality test for each lever |
The dangerous shortcuts (and why they fail)
| Shortcut | Why it fails |
|---|---|
| "Just use Haiku for everything" | Eval drops 8-15%, exceeds budget; user complaints spike |
| "Cache more aggressively" | Cache poisoning incidents; quality drift hard to attribute |
| "Drop all fan-out — every turn uses one tool" | Recommendations collapse; user satisfaction drops 20% |
| "Stop running eval (it's just monitoring)" | No leading indicator for the next regression — flying blind |
| "Compress prompts hard" | LLM behavior changes nonlinearly with prompt length; subtle regressions; longest-pole engineering work for least visible savings |
The pattern: any shortcut that ignores eval is a near-term saving and a long-term liability. The harness's eval-gated approach is exactly what makes the methodical 12-lever path tractable.
Q&A drill — opening question
*Q: Couldn't you negotiate a discount with Anthropic / AWS instead of cutting 12 different things?
You should. Negotiation is always lever 0, run in parallel with the eval-gated levers. Provider negotiation typically delivers 10-25% savings on enterprise volume; assume 18% for planning. Even with that, the gap is 32% (50% - 18%) — still substantial, still requiring the lever menu.
The lesson: negotiation and engineering are complements, not substitutes. The eval-driven lever rollout makes the engineering side defensible; the negotiation makes the engineering side tractable.
Grilling — Round 1
Q1. Eval sampling 1% → 0.5% saves money but reduces detection — isn't that risky?
Detection latency goes from 30 min to 60 min for the same statistical power. We re-run the math: at 0.5% of 2.4B = 12M judgments/day = 8K/min. Even at 0.5%, the per-cell stratified sampling gives sufficient power for the cells we care about. The marginal detection-latency increase is acceptable for the savings. We DO NOT go below 0.3% — that's where stratified power breaks.
Q2. Lever 11 (selective Haiku planner) — how do you decide which turns are "low-stakes"?
A small classifier (Haiku itself, with a tight rubric) labels the turn intent and a "stakes" score. Low-stakes = factual lookups, simple recommendations on common titles, FAQ-style. High-stakes = order changes, complex multi-title comparisons, ambiguous intent.
The classifier is itself eval'd for stakes-classification accuracy. False-positives (calling something low-stakes that isn't) are bounded at 4% by design.
Q3. How do you keep the team from regressing the cost gains over the next quarter?
Three durable mechanisms:
- Cost-per-feature SLO. Each feature has a $/use ceiling baked into ownership. Owners get paged if their feature drifts up.
- Cost-aware eval. New prompts/skills must report cost_delta in the PR description; deviation from baseline > 5% needs justification.
- Quarterly cost review. Top-N cost contributors are re-examined with fresh data. Levers are revisited; some are unwound when traffic patterns shift.
Grilling — Round 2 (architect-level)
Q4. Lever 6 (tier routing) creates a quality differentiation between free and Prime users. Is that compliant with our terms of service?
This needs a legal review before rollout. ToS for free-tier explicitly states "may use lower-cost models in service of free access." If our ToS doesn't say that, we update it before launching. Marketing communications also need vetting — we cannot frame Prime as "premium quality AI" without supporting evidence.
The architectural piece is straightforward; the legal/marketing pieces are the longer-pole.
Q5. Lever 10 (prompt compression) is the longest-effort and lowest-yielding. Why include it?
Two reasons: - Compounding savings. A 15% prompt token reduction compounds with the cache hit rate (saved tokens × cache fraction). Effective savings are larger than face value. - Maintainability. Prompt review reveals dead/legacy instructions that clutter the prompt; the compression project is also a cleanup project. The non-cost benefit is reduced prompt drift.
We acknowledge the engineering ROI is marginal; we run it in the background through the rollout, not on critical path.
Q6. What's the rollback story if 6 weeks in, we discover lever 5 (fanout cap) caused a slow recommendation-quality regression that the eval missed?
The eval should not have missed it; if it did, the postmortem revises the rubric (the recommendations rubric was incomplete on diversity dimension, say).
Rollback is a config flip — capability flag for fanout_cap_active flips to false. Cost rises ~3%; quality recovers. We re-budget by pulling another lever from the menu (we maintained a "spare" lever in the menu for exactly this case — lever 13 / 14 not yet activated).
The architectural lesson: carry margin in the lever menu. Don't activate exactly 12 levers for 50% savings; activate 10 and keep 2 in reserve. We pursue 55-60% target savings and accept that ~5-10% will need to roll back.
Intuition gained
- There's no single 50% lever. A dozen 4-15% levers, eval-gated, ordered by ROI.
- Negotiation runs in parallel — provider discount is lever 0.
- Tier-based service quality is a real architectural pattern; legal/marketing alignment matters as much as the LLD.
- The eval framework chooses the order of levers; without it, you're guessing.
- Carry margin in the lever menu — plan for 60% savings to comfortably hit 50%.
See also
01-10x-user-surge.md— temporary version of the cost squeeze04-latency-sla-tightened.md— the opposite constraint (latency, not cost)08-quota-revoked.md— quota is a forced cost cut you didn't choose- User stories 02, 04, 06, 08