LOCAL PREVIEW View on GitHub

Constraint Scenario 03 — Cost Budget Halved

Trigger: Finance review cuts MangaAssist's monthly compute + LLM budget from $14M/day to $7M/day. Same traffic. Pillars stressed: All three. This is the hardest scenario in the folder.


TL;DR

50% cost cut at constant traffic is not a knob you turn — it's a multi-pronged campaign. The naive answer ("use Haiku for everything") fails the eval gate. The right answer composes ~12 levers, each cutting 4-15%, prioritized by quality_loss / cost_saved ratio. The eval framework picks the order; the harness implements; the LLD makes each switch a config flip.


The change

Metric Before Target
Daily compute + LLM budget $14M/day $7M/day
Cost/turn (avg) $0.0182 $0.0091
Acceptable eval drop <3% aggregate, <5% per cell
Acceptable latency drop none
Time to comply 60 days

You can lose ~3% quality. You cannot lose latency. You have 60 days. Go.


The lever menu — ranked by ROI

The eval framework has measured the marginal cost-vs-quality of each lever over the past quarter. The migration playbook is to apply them in ascending order of quality_loss_per_dollar_saved.

# Lever $ saved Quality cost Risk Effort
1 Inference cache eligibility expanded -8% ~0% Low 1 wk
2 Sub-agent input distillation tightened -6% -0.5% Low 2 wk
3 Speculative cancellation on intent-classified turns -5% -0.3% Low 1 wk
4 Race → delegate for low-latency-sensitive surfaces -4% -0.5% Low 0.5 wk
5 Tool fan-out cap reduced (5→3 max) -3% -1.0% Med 1 wk
6 Per-tier model routing (free tier → Haiku planner) -7% -2.0% on free Med 2 wk
7 Eval sample 1% → 0.5% -3% (eval coverage) Med 0.5 wk
8 Smaller embedding model for retrieval -2% -0.5% Med 2 wk
9 Async batch eval at off-peak hours only -2% (eval timeliness) Low 1 wk
10 Compress system prompts (token count -15%) -3% -0.7% High 3 wk
11 Selective Haiku planner for low-stakes turns -4% -1.5% Med 2 wk
12 DDB row TTL 7d → 3d -1% (resume rare-cases) Low 0.5 wk

Cumulative: ~48% saved, ~6.5% quality drop in raw aggregate. The rubric-weighted drop is ~3% because the per-cell impact is bounded.


The architecture lens — where the savings live

flowchart TB
  subgraph Levers[12 cost levers]
    L1[Cache expansion]
    L2[Distillation]
    L3[Speculation cancel]
    L4[Race to delegate]
    L5[Fanout cap]
    L6[Tier routing]
    L7[Eval sampling]
    L8[Embedding swap]
    L9[Eval scheduling]
    L10[Prompt compression]
    L11[Haiku planner]
    L12[TTL]
  end

  subgraph Pillars[Pillar each lever touches]
    P1A[P1 Workflow]
    P2A[P2 Harness]
    P3A[P3 LLD]
  end

  L1 --> P2A
  L2 --> P1A
  L3 --> P1A
  L4 --> P1A
  L5 --> P1A
  L6 --> P3A
  L7 --> P2A
  L8 --> P3A
  L9 --> P2A
  L10 --> P1A
  L11 --> P1A
  L11 --> P3A
  L12 --> P2A

Note: cost optimization touches all three pillars. A team that built only one pillar deeply has fewer levers and ends up over-pulling on the ones they have, which exceeds the quality budget.


The two highest-leverage levers, in detail

Lever 1 — Inference cache eligibility expanded

Today: cache eligibility = (eval_score >= 0.85) AND (deterministic_intent == true). About 38% hit rate on planner.

Change: add a second tier — (eval_score >= 0.80) AND (intent in [recommend, summarize, faq]) with shorter TTL (1h instead of 4h) and per-locale partitioning. Pushes hit rate to 51%.

Why it works: - The added tier covers the most common intents. - Shorter TTL bounds blast radius on quality drift. - Per-locale partitioning prevents en-US dominating cache; ensures other locales benefit.

Quality measurement: A/B 5% with new cache rule for 1 week; eval delta -0.3% on aggregate, -0.6% on long-tail (acceptable).

Lever 6 — Per-tier model routing (free tier → Haiku planner)

Today: all users see Sonnet planner.

Change: free-tier users get Haiku planner; Prime users keep Sonnet. Free tier is ~28% of traffic.

Why it works: - Free tier has different distribution of queries (less complex on average). - Haiku is calibrated; eval shows -2% on free-tier rubric (not the global rubric — free tier has its own rubric calibrated to its expected complexity). - Strong differentiator narrative — Prime users get the better experience.

Quality measurement: the per-tier eval rubric is set explicitly during this lever's rollout. We do NOT pretend free-tier and Prime-tier have the same quality bar. Both have eval; both have drift detection; both have user satisfaction tracking.

Risk: if a free-tier user perceives a quality gap, that's a marketing problem. We measure perception via session-end surveys; the gap so far is small enough that user complaints are within noise.


The phased rollout

gantt
    title 60-day cost-cut rollout
    dateFormat YYYY-MM-DD
    section Week 1-2
    L1 Cache expansion           :a1, 2026-04-29, 7d
    L4 Race to delegate          :a2, 2026-04-29, 4d
    L7 Eval sampling 1pct to 0.5pct :a3, 2026-04-29, 4d
    L12 TTL                      :a4, 2026-04-29, 4d
    section Week 3-4
    L2 Distillation tightening   :b1, 2026-05-13, 14d
    L3 Speculation cancellation  :b2, 2026-05-13, 7d
    L9 Eval scheduling           :b3, 2026-05-13, 7d
    section Week 5-6
    L5 Fanout cap                :c1, 2026-05-27, 7d
    L8 Embedding swap            :c2, 2026-05-27, 14d
    L11 Selective Haiku planner  :c3, 2026-05-27, 14d
    section Week 7-8
    L6 Per-tier routing          :d1, 2026-06-10, 14d
    L10 Prompt compression       :d2, 2026-06-10, 14d
    section Buffer
    Stabilize and finalize       :e1, 2026-06-24, 7d

Order is deliberate: lowest-risk, fastest first. Each lever is canary'd; eval drift watched. If three levers in a row show worse-than-modeled regression, the rollout pauses for re-planning.


What the harness contributes

Harness piece Role in cost cut
Cost dashboard partitioning (story 08) Measures impact of each lever, weekly
Capability flags (story 02) Per-tier, per-cohort, per-feature switches
Eval framework (story 04) Quality regression gate per lever
Inference cache (story 06) Lever 1 lives here
Skill registry (story 02) Lever 5 (fanout cap) is a registry policy
Quota manager (story 08) Lever 6 (per-tier routing) routes through quota classes
Replay harness (story 04) Pre-deploy quality test for each lever

The dangerous shortcuts (and why they fail)

Shortcut Why it fails
"Just use Haiku for everything" Eval drops 8-15%, exceeds budget; user complaints spike
"Cache more aggressively" Cache poisoning incidents; quality drift hard to attribute
"Drop all fan-out — every turn uses one tool" Recommendations collapse; user satisfaction drops 20%
"Stop running eval (it's just monitoring)" No leading indicator for the next regression — flying blind
"Compress prompts hard" LLM behavior changes nonlinearly with prompt length; subtle regressions; longest-pole engineering work for least visible savings

The pattern: any shortcut that ignores eval is a near-term saving and a long-term liability. The harness's eval-gated approach is exactly what makes the methodical 12-lever path tractable.


Q&A drill — opening question

*Q: Couldn't you negotiate a discount with Anthropic / AWS instead of cutting 12 different things?

You should. Negotiation is always lever 0, run in parallel with the eval-gated levers. Provider negotiation typically delivers 10-25% savings on enterprise volume; assume 18% for planning. Even with that, the gap is 32% (50% - 18%) — still substantial, still requiring the lever menu.

The lesson: negotiation and engineering are complements, not substitutes. The eval-driven lever rollout makes the engineering side defensible; the negotiation makes the engineering side tractable.


Grilling — Round 1

Q1. Eval sampling 1% → 0.5% saves money but reduces detection — isn't that risky?

Detection latency goes from 30 min to 60 min for the same statistical power. We re-run the math: at 0.5% of 2.4B = 12M judgments/day = 8K/min. Even at 0.5%, the per-cell stratified sampling gives sufficient power for the cells we care about. The marginal detection-latency increase is acceptable for the savings. We DO NOT go below 0.3% — that's where stratified power breaks.

Q2. Lever 11 (selective Haiku planner) — how do you decide which turns are "low-stakes"?

A small classifier (Haiku itself, with a tight rubric) labels the turn intent and a "stakes" score. Low-stakes = factual lookups, simple recommendations on common titles, FAQ-style. High-stakes = order changes, complex multi-title comparisons, ambiguous intent.

The classifier is itself eval'd for stakes-classification accuracy. False-positives (calling something low-stakes that isn't) are bounded at 4% by design.

Q3. How do you keep the team from regressing the cost gains over the next quarter?

Three durable mechanisms: - Cost-per-feature SLO. Each feature has a $/use ceiling baked into ownership. Owners get paged if their feature drifts up. - Cost-aware eval. New prompts/skills must report cost_delta in the PR description; deviation from baseline > 5% needs justification. - Quarterly cost review. Top-N cost contributors are re-examined with fresh data. Levers are revisited; some are unwound when traffic patterns shift.


Grilling — Round 2 (architect-level)

Q4. Lever 6 (tier routing) creates a quality differentiation between free and Prime users. Is that compliant with our terms of service?

This needs a legal review before rollout. ToS for free-tier explicitly states "may use lower-cost models in service of free access." If our ToS doesn't say that, we update it before launching. Marketing communications also need vetting — we cannot frame Prime as "premium quality AI" without supporting evidence.

The architectural piece is straightforward; the legal/marketing pieces are the longer-pole.

Q5. Lever 10 (prompt compression) is the longest-effort and lowest-yielding. Why include it?

Two reasons: - Compounding savings. A 15% prompt token reduction compounds with the cache hit rate (saved tokens × cache fraction). Effective savings are larger than face value. - Maintainability. Prompt review reveals dead/legacy instructions that clutter the prompt; the compression project is also a cleanup project. The non-cost benefit is reduced prompt drift.

We acknowledge the engineering ROI is marginal; we run it in the background through the rollout, not on critical path.

Q6. What's the rollback story if 6 weeks in, we discover lever 5 (fanout cap) caused a slow recommendation-quality regression that the eval missed?

The eval should not have missed it; if it did, the postmortem revises the rubric (the recommendations rubric was incomplete on diversity dimension, say).

Rollback is a config flip — capability flag for fanout_cap_active flips to false. Cost rises ~3%; quality recovers. We re-budget by pulling another lever from the menu (we maintained a "spare" lever in the menu for exactly this case — lever 13 / 14 not yet activated).

The architectural lesson: carry margin in the lever menu. Don't activate exactly 12 levers for 50% savings; activate 10 and keep 2 in reserve. We pursue 55-60% target savings and accept that ~5-10% will need to roll back.


Intuition gained

  • There's no single 50% lever. A dozen 4-15% levers, eval-gated, ordered by ROI.
  • Negotiation runs in parallel — provider discount is lever 0.
  • Tier-based service quality is a real architectural pattern; legal/marketing alignment matters as much as the LLD.
  • The eval framework chooses the order of levers; without it, you're guessing.
  • Carry margin in the lever menu — plan for 60% savings to comfortably hit 50%.

See also

  • 01-10x-user-surge.md — temporary version of the cost squeeze
  • 04-latency-sla-tightened.md — the opposite constraint (latency, not cost)
  • 08-quota-revoked.md — quota is a forced cost cut you didn't choose
  • User stories 02, 04, 06, 08