Story 02 — Why I refused "use Claude for everything"
One-line: The first architecture proposal routed every customer turn through Claude Sonnet. The math said it would cost $0.07/session at $0.03 budget and miss P99 by 300ms. I redesigned to a hybrid execution model that bypassed the LLM for ~30% of traffic — and shipped within both budgets.
Situation
Early architecture review. The team's instinct (and the demo's instinct, and frankly the industry's instinct in 2024) was: foundation models are good enough now, just route everything through one. The proposed architecture had a single Claude Sonnet endpoint behind every customer message — order tracking, FAQ, recommendations, escalations, billing questions. Clean diagram. Easy to explain. Disastrous economics.
I worked the numbers on a whiteboard: average tokens per turn (input + output) × Sonnet pricing × turns/session × sessions/day. Came out to ~$0.07/session against a $0.03/session budget, with P99 in the 1.7–1.8s range against the 1.5s target. Neither number was close.
Task
Find an architecture that hits the cost and latency budgets — without giving up on the AI quality the project was promising.
Action
1. Decomposed the 8 use cases by what each one actually needs. I went back to 03-use-cases.md and tagged each use case with the answer to one question: what is the prior on the right response, given the customer's input?
- "Where is my order #ABC123" — prior is sharp. The right response is determined almost entirely by an internal API call. An LLM here adds latency, cost, and hallucination risk for zero quality gain.
- "Tell me about Vinland Saga volume 12" — prior is sharp on the catalog data, broad on the framing. RAG with a templated answer beats free-form LLM here.
- "I'm looking for something like Berserk but less violent" — prior is broad. This actually needs the LLM, because the customer is doing semantic exploration the catalog metadata can't answer alone.
- "Help me return this, my husband bought the wrong volume" — prior is sharp on the policy + state machine, but the tone matters. Templated response with LLM-rewritten warmth on top.
2. Designed the hybrid execution model. Documented in 10-ai-llm-design.md. Four routing tiers: templates → internal APIs → RAG with templated answers → full LLM. Routing decided by a 2-stage intent classifier (rules for unambiguous intents, DistilBERT for the rest).
3. Re-ran the unit economics. ~30% of traffic routed to template/API tiers (~$0.001/turn). ~40% to RAG-with-template (~$0.012/turn — Haiku, not Sonnet). ~30% to full LLM (~$0.05/turn — Sonnet). Weighted average: ~$0.022/session. Under budget with headroom.
4. Re-ran latency. Template tier P99 ~120ms. RAG tier P99 ~900ms. LLM tier P99 ~1.4s. Weighted P99: ~1.2s. Under target.
5. Defended the model tiering. Cost-Optimization-User-Stories/US-01-llm-token-cost-optimization.md projected $209K/month savings at scale. Documented in Cost-Optimization-User-Stories/US-01-llm-token-cost-optimization.md.
The math/algorithmic depth that mattered
This decision is unglamorous but it's the most important architect-level call in the whole project. The math is information-theoretic, not deep-learning theory:
- Bayesian posterior strength as a routing signal. When P(intent | customer input) is sharp (e.g., regex matches "where is my order"), a model adds variance to a process that should be deterministic. Calling an LLM there is negative expected value: small chance of better tone, real chance of hallucinated tracking number.
- Model capacity vs. task complexity. Information-theoretic argument: the entropy of the answer for "track order #X" is tiny. Sonnet's 175B+ parameters are wasted capacity on this task. DistilBERT (66M params) is overkill for the intent part; a regex is fine for the routing.
- Cost scaling under traffic shape. At 50K concurrent → 500K concurrent (11-scalability-reliability.md), a 2× cost-per-session difference compounds linearly with the user growth. The architect's job is to set the slope right, not just the intercept.
The leadership move
Senior engineers in the room had built their careers on "use the most powerful model" intuition. Pushing back on Claude Sonnet for everything risked sounding like "I don't trust the new tech." I framed it the opposite way: we're using Claude Sonnet where it adds asymmetric value, and we're not using it where it adds cost and risk for no quality gain. Same trust in the model — different deployment surface.
Then I made the framework reusable. The hybrid execution model isn't just "templates vs LLM" — it's a decision rubric the team can re-apply for every new use case. New feature comes in? Tag it by prior strength, route it to the right tier. The framework outlives the original 8 use cases.
This is also where the 5-persona group-discussion templates came in (Priya/Marcus/Aiko/Jordan/Sam — see story 06). Every routing decision was argued from all five seats: ML, Architecture, DS, MLOps, PM. Made it impossible for any single voice to dominate.
Result
- Cost-per-session: $0.022 vs. $0.07 in the original proposal — 3× under budget.
- P99 latency: 1.2s vs. 1.7s in the original proposal — under the 1.5s target.
- ~30% of traffic served without hitting an LLM at all.
- Projected $209K/month savings at scale (Cost-Optimization-User-Stories/US-01-llm-token-cost-optimization.md).
- The hybrid execution model became the architectural North Star — every subsequent feature routed through the same rubric instead of "let's use an LLM."
What I'd want a future ML lead to take away
The exciting part of an LLM project is the LLM. The architect's job is to figure out where the LLM doesn't belong. The cost and latency numbers aren't a constraint to fight — they're a forcing function that pushes you toward the right architecture. If the unit economics don't pencil out at MVP scale, they will be catastrophic at GA scale.