03. Tradeoffs & Decision Frameworks — How Metrics Drove Every Choice
"In production ML, there are no 'right' answers — only tradeoffs with different cost functions. My job was to make those tradeoffs explicit, measurable, and reversible. Every decision I made was backed by a metric, not an opinion."
Decision Framework: How I Structured Tradeoff Analysis
For every significant model inference decision, I used a consistent framework:
graph TD
A[Identify the Tradeoff] --> B[Define Options<br>2-4 concrete alternatives]
B --> C[Select Decision Metric<br>What number drives the call?]
C --> D[Run Experiment<br>A/B test, shadow mode, or offline benchmark]
D --> E[Measure Outcome<br>Quantify each option]
E --> F[Make Decision<br>Document rationale]
F --> G[Set Reversal Trigger<br>Under what conditions do we revisit?]
The reversal trigger was key — every decision had a documented condition under which we'd revisit it. This prevented both premature optimization and "set and forget" technical debt.
All Reversal Triggers Dashboard
One-stop reference: when do we revisit each decision?
| # | Decision | Current Choice | Revisit When |
|---|---|---|---|
| 1 | Latency vs Quality | Model tiering (template + Haiku + Sonnet) | Haiku quality < 3.5/5.0 on routed queries, or a new model offers Sonnet quality at Haiku latency |
| 2 | Classifier model | DistilBERT + rule-based fast path | Misclassification-driven escalation > 3%, or Inferentia supports RoBERTa (cost parity) |
| 3 | Guardrail thresholds | Strict ASIN, 0.5/0.7 hallucination | Block rate > 5% of responses, or "can't help" complaints increase |
| 4 | Fine-tune vs prompt | Prompt engineering + RAG | Prompt quality plateaus < 90% satisfaction, or Bedrock enables easy fine-tuning |
| 5 | RAG chunk size | Variable by source type | New embedding model handles long chunks better, or context windows grow enough for 5+ chunks |
| 6 | Real-time vs batch | Async hallucination scoring only | Real-time hallucination scoring latency drops below 20ms |
| 7 | Temperature | Intent-specific (0.1 – 0.7) | New model has better built-in calibration, or hallucination rate on any intent > 3% |
Reusable Decision Template
Use this framework for any new inference tradeoff:
1. IDENTIFY THE TRADEOFF
What are we trading? (latency vs quality, cost vs accuracy, etc.)
2. DEFINE 2-4 CONCRETE OPTIONS
Each with measurable pros/cons.
3. SELECT A DECISION METRIC
A single compound number that captures the tradeoff.
E.g., Quality-adjusted latency, Cost per marginal accuracy point.
4. RUN AN EXPERIMENT
A/B test, shadow mode, or offline benchmark. Never decide on intuition.
5. MEASURE AND DECIDE
Show the math. Document the rationale.
6. SET A REVERSAL TRIGGER
Under what specific, measurable condition do we revisit?
Tradeoff 1: Latency vs. Quality (Model Size)
The Options
| Option | Model | Inference Latency | Response Quality | Monthly Cost |
|---|---|---|---|---|
| A | Claude 3.5 Sonnet for all | ~500ms TTFT | High (4.⅖.0) | ~$143K |
| B | Haiku for simple + Sonnet for complex | ~100ms / ~500ms | Medium/High (3.8 / 4.2) | ~$95K |
| C | Haiku for all | ~100ms TTFT | Medium (3.6/5.0) | ~$48K |
Decision Metric
Quality-adjusted latency: (Response Quality Score × 20) - (P99 Latency in seconds × 10)
This penalized both low quality AND high latency, forcing a balance.
| Option | Quality Score | P99 Latency | Quality-Adjusted Latency |
|---|---|---|---|
| A (Sonnet all) | 4.2 × 20 = 84 | 1.5s × 10 = 15 | 69 |
| B (Tiered) | 4.0 × 20 = 80 | 0.8s × 10 = 8 | 72 ← Winner |
| C (Haiku all) | 3.6 × 20 = 72 | 0.5s × 10 = 5 | 67 |
Decision
Option B — Model tiering. Route 40% of messages to templates (no LLM), 15% to Haiku, 45% to Sonnet based on intent complexity.
Reversal Trigger
Revisit if: (a) Haiku quality drops below 3.5/5.0 on routed queries, or (b) a new model class offers Sonnet quality at Haiku latency.
Tradeoff 2: Cost vs. Accuracy (Classifier Model Selection)
The Options
| Option | Model | Accuracy | P99 Latency | Monthly Infra Cost |
|---|---|---|---|---|
| A | RoBERTa | 94.8% | 110ms | $28K |
| B | DistilBERT | 92.1% | 50ms | $18K |
| C | TinyBERT | 89.5% | 25ms | $8K |
| D | Rule-based only | 78.0% | 2ms | ~$0 |
Decision Metric
Cost per marginal accuracy point (CMA): For each model upgrade, how much does one additional percentage point of accuracy cost?
| Upgrade Path | Accuracy Gain | Cost Increase | CMA ($/point/month) |
|---|---|---|---|
| Rules → TinyBERT | +11.5% | +$8K | $696/point |
| TinyBERT → DistilBERT | +2.6% | +$10K | $3,846/point |
| DistilBERT → RoBERTa | +2.7% | +$10K | $3,704/point |
| Rules → DistilBERT | +14.1% | +$18K | $1,277/point |
Decision
DistilBERT (Option B) paired with rule-based fast path. The jump from rules to DistilBERT was highly cost-effective ($1,277/point). The marginal gain from DistilBERT to RoBERTa ($3,704/point) wasn't justified given our budget.
What the Accuracy Gap Actually Meant in Production
The 2.7% accuracy gap between DistilBERT (92.1%) and RoBERTa (94.8%) translated to:
500K messages/day × (100% - 40% rule-handled) = 300K BERT-classified messages
300K × 2.7% accuracy gap = ~8,100 additional misclassifications/day
Of those 8,100 misclassifications:
- ~60% still produced acceptable responses (wrong intent but LLM compensated)
- ~25% produced degraded but harmless responses
- ~15% produced noticeably wrong responses = ~1,215 bad experiences/day
Cost to fix: $10K/month additional infra
Cost of bad experiences: ~1,215 × $0.50 estimated support cost = ~$607/day = ~$18K/month
The math showed that the bad experience cost (~$18K/month) was close to the infra cost (~$10K/month). But the DistilBERT misclassifications were largely caught by the LLM layer (which could reason about intent independently), making the effective impact smaller. We decided the gap was acceptable.
Reversal Trigger
Revisit if: (a) Misclassification-driven escalation rate exceeds 3%, or (b) Inferentia support for RoBERTa reduces cost gap.
Tradeoff 3: Precision vs. Recall for Guardrails
The Problem
The post-generation guardrails pipeline validated LLM responses before sending them to users. But every guardrail check had a precision-recall tradeoff:
- High recall (catch all bad responses) → many false positives → good responses blocked → user sees "I can't help with that" more often → poor UX.
- High precision (only block truly bad responses) → some bad responses slip through → user sees hallucinated prices, fabricated products → trust damage.
The Options by Guardrail Type
ASIN Validation Guardrail:
| Threshold | Precision | Recall | Block Rate | User Impact |
|---|---|---|---|---|
| Strict (block if any ASIN invalid) | 99% | 98% | 8% | Some multi-product responses blocked due to one stale ASIN |
| Moderate (block if >50% ASINs invalid) | 95% | 85% | 3% | Occasional single invalid ASIN slips through |
| Lenient (block only if all ASINs invalid) | 88% | 70% | 1% | Multiple invalid ASINs in edge cases |
Hallucination Score Guardrail:
| Threshold | Precision | Recall | Block Rate | User Impact |
|---|---|---|---|---|
| 0.3 (aggressive) | 62% | 94% | 12% | Many correct responses blocked → bad UX |
| 0.5 (balanced) | 78% | 82% | 5% | Moderate false positives |
| 0.7 (conservative) | 91% | 65% | 2% | Some hallucinations slip through |
Decision Metric
Net quality impact: (Recall × Severity of missed bad responses) - (False Positive Rate × Cost of blocking good responses)
For the hallucination guardrail: - Severity of a hallucination reaching users: High (trust damage, potential financial liability for wrong prices) - Cost of blocking a good response: Medium (user sees fallback, slightly worse UX but no harm)
This asymmetry favored higher recall: it's worse to let a hallucinated price through than to occasionally block a correct response.
Decision
- ASIN Validation: Strict (block if any ASIN invalid). Then surgically remove only the invalid ASIN from the response and regenerate that portion — rather than blocking the entire response.
- Hallucination Score: 0.5 threshold for alerting, 0.7 threshold for automated blocking. The 0.5 threshold feeds into a weekly review queue where a human validates and either confirms or dismisses.
Reversal Trigger
Revisit if: (a) guardrail block rate exceeds 5% of all responses, or (b) user complaints about "I can't help with that" responses increase.
Tradeoff 4: Fine-Tuned Model vs. Prompt Engineering
The Problem
For improving the LLM's manga-specific knowledge, we had two approaches:
| Approach | How | Expected Quality Gain | Effort | Risk |
|---|---|---|---|---|
| Fine-tuning | Fine-tune Claude on manga Q&A dataset | High (+8-12% on domain tasks) | 4-6 weeks DS + engineering | Overfitting; loss of general capability; can't fine-tune Bedrock models easily |
| Prompt engineering + RAG | Better prompts + better retrieval | Moderate (+5-8% on domain tasks) | 1-2 weeks engineering | Prompt fragility; token cost increase |
Decision Metric
Time-to-value adjusted quality: Quality improvement per week of investment.
| Approach | Quality Gain | Time Investment | Quality/Week | Reversibility |
|---|---|---|---|---|
| Fine-tuning | +10% | 5 weeks | 2%/week | Hard — requires retraining to revert |
| Prompt + RAG | +6% | 1.5 weeks | 4%/week | Easy — revert prompt in seconds |
Decision
Prompt engineering + RAG for the MVP and V2. Reasons: 1. 2x better quality-per-week of investment. 2. Fully reversible — a bad prompt can be rolled back in seconds; a bad fine-tune requires retraining. 3. Bedrock limitations — fine-tuning Claude on Bedrock was not yet available at our scale. 4. RAG improvements compound — better retrieval improves all intents, not just the fine-tuned ones.
We reserved fine-tuning as a V3 option if prompt engineering plateaued.
Reversal Trigger
Revisit if: (a) prompt engineering quality plateaus below 90% user satisfaction, or (b) Bedrock supports easy fine-tuning with fast iteration cycles.
Tradeoff 5: RAG Chunk Size vs. Retrieval Quality
The Options
| Chunk Size | Recall@3 | Precision@3 | Context Noise | Token Cost Impact |
|---|---|---|---|---|
| 128 tokens | 65% | 82% | Low | Lower |
| 256 tokens | 78% | 75% | Medium | Medium |
| 512 tokens | 85% | 64% | High | Higher |
| Variable by source type | 82% | 79% | Medium | Medium |
Decision Metric
Effective retrieval quality: Recall@3 × Precision@3 — a single number that penalizes both missed documents and noisy documents.
| Chunk Size | Recall@3 × Precision@3 | Effective Quality |
|---|---|---|
| 128 tokens | 0.65 × 0.82 | 0.533 |
| 256 tokens | 0.78 × 0.75 | 0.585 |
| 512 tokens | 0.85 × 0.64 | 0.544 |
| Variable | 0.82 × 0.79 | 0.648 ← Winner |
Decision
Variable chunking by source type. Product descriptions: 256 tokens, policies: 512 tokens, review summaries: 128 tokens. Different content types have different information density — one size doesn't fit all.
Reversal Trigger
Revisit if: (a) a new embedding model handles long chunks better (reduces precision penalty for large chunks), or (b) context window sizes grow enough that chunk count can increase from 3 to 5+.
Tradeoff 6: Real-Time vs. Batch Inference
The Problem
Not every model needed real-time inference. Some components could run asynchronously without impacting user experience.
| Component | Real-Time Required? | Justification |
|---|---|---|
| Intent classification | Yes | On the critical path — determines everything downstream |
| Query embedding | Yes | On the critical path — RAG retrieval depends on it |
| Cross-encoder reranking | Mostly yes | On the critical path, but could degrade to raw vector results |
| LLM generation | Yes | The user is waiting for the response |
| Hallucination scoring | No | Async — score after the response is sent |
| Embedding re-indexing | No | Batch — run when product catalog updates |
| Classifier retraining | No | Batch — monthly retraining cycle |
| A/B test analysis | No | Batch — daily/weekly aggregation |
Decision
Hallucination scoring moved to an async pipeline (SQS → Lambda), saving ~50-100ms from the critical path. This was the only component where real-time vs. batch made a meaningful latency difference without sacrificing quality.
The risk: if a hallucinated response was sent and scored after the fact, the user already saw it. Mitigation: for high-severity hallucinations (wrong prices), the validation was still synchronous (price cross-check was fast, ~10ms). Only the sophisticated entailment-based scoring was async.
Tradeoff 7: Temperature Tuning — Creativity vs. Accuracy
The Options
| Temperature | Factual Accuracy | Response Creativity | Repetitiveness | Risk |
|---|---|---|---|---|
| 0.0 | Highest | Lowest | Very high | Robotic, repetitive responses |
| 0.1 | Very high | Low | Low | Slightly stilted |
| 0.3 | High | Moderate | Low | Balanced |
| 0.5 | Moderate | High | Very low | Occasional hallucinations |
| 0.7 | Lower | Very high | Very low | More hallucinations |
Decision
Intent-specific temperature — because different intents had different accuracy requirements:
| Intent | Temperature | Rationale |
|---|---|---|
product_question |
0.1 | Factual answers — minimize creativity, maximize accuracy |
faq / policy |
0.2 | Policy answers need precision, slight warmth |
recommendation |
0.5 | Descriptions benefit from creative, engaging language |
chitchat |
0.7 | Friendly, varied, human-like greetings |
escalation summary |
0.1 | Accurate summary for human agent, no embellishment |
Metrics That Validated This Decision
| Intent | Temp | Hallucination Rate | Thumbs Up Rate | Outcome |
|---|---|---|---|---|
product_question (before: 0.3) |
→ 0.1 | 3.2% → 1.1% | 58% → 64% | Significant improvement |
recommendation (before: 0.3) |
→ 0.5 | 1.5% → 2.1% | 62% → 71% | Quality improved, hallucination slightly up but acceptable |
chitchat (before: 0.3) |
→ 0.7 | N/A (no factual claims) | 55% → 68% | Users liked the variety |
Summary: Decision Log
| # | Tradeoff | Decision | Key Metric | Monthly Cost Impact |
|---|---|---|---|---|
| 1 | Latency vs Quality (model size) | Model tiering (templates + Haiku + Sonnet) | Quality-adjusted latency | -$48K |
| 2 | Cost vs Accuracy (classifier) | DistilBERT + rule-based fast path | Cost per marginal accuracy | -$10K vs RoBERTa |
| 3 | Precision vs Recall (guardrails) | Asymmetric: strict for validation, balanced for hallucination | Net quality impact | Neutral |
| 4 | Fine-tuning vs Prompting | Prompt engineering + RAG | Time-to-value adjusted quality | -$0 (avoided fine-tuning cost) |
| 5 | RAG chunk size | Variable by source type | Effective retrieval quality (R@3 × P@3) | Neutral |
| 6 | Real-time vs Batch | Async hallucination scoring only | Latency savings on critical path | -50ms off P99 |
| 7 | Temperature | Intent-specific temperatures | Hallucination rate per intent | Neutral |
Key Takeaways for Interviews
-
"Every decision was backed by a metric" — I didn't choose DistilBERT because it "felt right." I chose it because the cost per marginal accuracy point showed diminishing returns for more expensive models.
-
"I defined reversal triggers" — Every decision had a documented condition under which we'd revisit. This shows intellectual honesty: I acknowledge decisions might be wrong and plan for that.
-
"Asymmetric costs drive asymmetric thresholds" — Guardrail precision vs. recall isn't 50/50. The cost of a hallucinated price reaching a user is 10x the cost of blocking a correct response. This asymmetry should be reflected in the threshold.
-
"Compound metrics over single metrics" — Quality-adjusted latency, effective retrieval quality (R@3 × P@3), cost per quality point — these compound metrics capture tradeoffs that single metrics miss.
-
"Not every component needs real-time inference" — Moving hallucination scoring to async saved 50-100ms. Knowing which components can tolerate latency is an architectural insight.
Related Documents
- 02-data-scientist-collaboration.md — Area 9 (Cost-Quality Tradeoff Analysis) — the CPQ framework
- Challenges/real-world-challenges.md §9 — Cost Management — Detailed cost breakdown and optimization strategies
- 15-tradeoffs-challenges.md — High-level tradeoff summary