03. Tradeoffs & Decision Frameworks — How Metrics Drove Every Choice

"In production ML, there are no 'right' answers — only tradeoffs with different cost functions. My job was to make those tradeoffs explicit, measurable, and reversible. Every decision I made was backed by a metric, not an opinion."

Decision Framework: How I Structured Tradeoff Analysis

For every significant model inference decision, I used a consistent framework:

graph TD
    A[Identify the Tradeoff] --> B[Define Options<br>2-4 concrete alternatives]
    B --> C[Select Decision Metric<br>What number drives the call?]
    C --> D[Run Experiment<br>A/B test, shadow mode, or offline benchmark]
    D --> E[Measure Outcome<br>Quantify each option]
    E --> F[Make Decision<br>Document rationale]
    F --> G[Set Reversal Trigger<br>Under what conditions do we revisit?]

The reversal trigger was key — every decision had a documented condition under which we'd revisit it. This prevented both premature optimization and "set and forget" technical debt.

All Reversal Triggers Dashboard

One-stop reference: when do we revisit each decision?

#	Decision	Current Choice	Revisit When
1	Latency vs Quality	Model tiering (template + Haiku + Sonnet)	Haiku quality < 3.5/5.0 on routed queries, or a new model offers Sonnet quality at Haiku latency
2	Classifier model	DistilBERT + rule-based fast path	Misclassification-driven escalation > 3%, or Inferentia supports RoBERTa (cost parity)
3	Guardrail thresholds	Strict ASIN, 0.5/0.7 hallucination	Block rate > 5% of responses, or "can't help" complaints increase
4	Fine-tune vs prompt	Prompt engineering + RAG	Prompt quality plateaus < 90% satisfaction, or Bedrock enables easy fine-tuning
5	RAG chunk size	Variable by source type	New embedding model handles long chunks better, or context windows grow enough for 5+ chunks
6	Real-time vs batch	Async hallucination scoring only	Real-time hallucination scoring latency drops below 20ms
7	Temperature	Intent-specific (0.1 – 0.7)	New model has better built-in calibration, or hallucination rate on any intent > 3%

Reusable Decision Template

Use this framework for any new inference tradeoff:

1. IDENTIFY THE TRADEOFF
   What are we trading? (latency vs quality, cost vs accuracy, etc.)

2. DEFINE 2-4 CONCRETE OPTIONS
   Each with measurable pros/cons.

3. SELECT A DECISION METRIC
   A single compound number that captures the tradeoff.
   E.g., Quality-adjusted latency, Cost per marginal accuracy point.

4. RUN AN EXPERIMENT
   A/B test, shadow mode, or offline benchmark. Never decide on intuition.

5. MEASURE AND DECIDE
   Show the math. Document the rationale.

6. SET A REVERSAL TRIGGER
   Under what specific, measurable condition do we revisit?

Tradeoff 1: Latency vs. Quality (Model Size)

The Options

Option	Model	Inference Latency	Response Quality	Monthly Cost
A	Claude 3.5 Sonnet for all	~500ms TTFT	High (4.⅖.0)	~$143K
B	Haiku for simple + Sonnet for complex	~100ms / ~500ms	Medium/High (3.8 / 4.2)	~$95K
C	Haiku for all	~100ms TTFT	Medium (3.6/5.0)	~$48K

Decision Metric

Quality-adjusted latency: (Response Quality Score × 20) - (P99 Latency in seconds × 10)

This penalized both low quality AND high latency, forcing a balance.

Option	Quality Score	P99 Latency	Quality-Adjusted Latency
A (Sonnet all)	4.2 × 20 = 84	1.5s × 10 = 15	69
B (Tiered)	4.0 × 20 = 80	0.8s × 10 = 8	72 ← Winner
C (Haiku all)	3.6 × 20 = 72	0.5s × 10 = 5	67

Decision

Option B — Model tiering. Route 40% of messages to templates (no LLM), 15% to Haiku, 45% to Sonnet based on intent complexity.

Reversal Trigger

Revisit if: (a) Haiku quality drops below 3.5/5.0 on routed queries, or (b) a new model class offers Sonnet quality at Haiku latency.

Tradeoff 2: Cost vs. Accuracy (Classifier Model Selection)

The Options

Option	Model	Accuracy	P99 Latency	Monthly Infra Cost
A	RoBERTa	94.8%	110ms	$28K
B	DistilBERT	92.1%	50ms	$18K
C	TinyBERT	89.5%	25ms	$8K
D	Rule-based only	78.0%	2ms	~$0

Decision Metric

Cost per marginal accuracy point (CMA): For each model upgrade, how much does one additional percentage point of accuracy cost?

Upgrade Path	Accuracy Gain	Cost Increase	CMA ($/point/month)
Rules → TinyBERT	+11.5%	+$8K	$696/point
TinyBERT → DistilBERT	+2.6%	+$10K	$3,846/point
DistilBERT → RoBERTa	+2.7%	+$10K	$3,704/point
Rules → DistilBERT	+14.1%	+$18K	$1,277/point

Decision

DistilBERT (Option B) paired with rule-based fast path. The jump from rules to DistilBERT was highly cost-effective ($1,277/point). The marginal gain from DistilBERT to RoBERTa ($3,704/point) wasn't justified given our budget.

What the Accuracy Gap Actually Meant in Production

The 2.7% accuracy gap between DistilBERT (92.1%) and RoBERTa (94.8%) translated to:

500K messages/day × (100% - 40% rule-handled) = 300K BERT-classified messages
300K × 2.7% accuracy gap = ~8,100 additional misclassifications/day

Of those 8,100 misclassifications:
- ~60% still produced acceptable responses (wrong intent but LLM compensated)
- ~25% produced degraded but harmless responses
- ~15% produced noticeably wrong responses = ~1,215 bad experiences/day

Cost to fix: $10K/month additional infra
Cost of bad experiences: ~1,215 × $0.50 estimated support cost = ~$607/day = ~$18K/month

The math showed that the bad experience cost (~$18K/month) was close to the infra cost (~$10K/month). But the DistilBERT misclassifications were largely caught by the LLM layer (which could reason about intent independently), making the effective impact smaller. We decided the gap was acceptable.

Reversal Trigger

Revisit if: (a) Misclassification-driven escalation rate exceeds 3%, or (b) Inferentia support for RoBERTa reduces cost gap.

Tradeoff 3: Precision vs. Recall for Guardrails

The Problem

The post-generation guardrails pipeline validated LLM responses before sending them to users. But every guardrail check had a precision-recall tradeoff:

High recall (catch all bad responses) → many false positives → good responses blocked → user sees "I can't help with that" more often → poor UX.
High precision (only block truly bad responses) → some bad responses slip through → user sees hallucinated prices, fabricated products → trust damage.

The Options by Guardrail Type

ASIN Validation Guardrail:

Threshold	Precision	Recall	Block Rate	User Impact
Strict (block if any ASIN invalid)	99%	98%	8%	Some multi-product responses blocked due to one stale ASIN
Moderate (block if >50% ASINs invalid)	95%	85%	3%	Occasional single invalid ASIN slips through
Lenient (block only if all ASINs invalid)	88%	70%	1%	Multiple invalid ASINs in edge cases

Hallucination Score Guardrail:

Threshold	Precision	Recall	Block Rate	User Impact
0.3 (aggressive)	62%	94%	12%	Many correct responses blocked → bad UX
0.5 (balanced)	78%	82%	5%	Moderate false positives
0.7 (conservative)	91%	65%	2%	Some hallucinations slip through

Decision Metric

Net quality impact: (Recall × Severity of missed bad responses) - (False Positive Rate × Cost of blocking good responses)

For the hallucination guardrail: - Severity of a hallucination reaching users: High (trust damage, potential financial liability for wrong prices) - Cost of blocking a good response: Medium (user sees fallback, slightly worse UX but no harm)

This asymmetry favored higher recall: it's worse to let a hallucinated price through than to occasionally block a correct response.

Decision

ASIN Validation: Strict (block if any ASIN invalid). Then surgically remove only the invalid ASIN from the response and regenerate that portion — rather than blocking the entire response.
Hallucination Score: 0.5 threshold for alerting, 0.7 threshold for automated blocking. The 0.5 threshold feeds into a weekly review queue where a human validates and either confirms or dismisses.

Reversal Trigger

Revisit if: (a) guardrail block rate exceeds 5% of all responses, or (b) user complaints about "I can't help with that" responses increase.

Tradeoff 4: Fine-Tuned Model vs. Prompt Engineering

The Problem

For improving the LLM's manga-specific knowledge, we had two approaches:

Approach	How	Expected Quality Gain	Effort	Risk
Fine-tuning	Fine-tune Claude on manga Q&A dataset	High (+8-12% on domain tasks)	4-6 weeks DS + engineering	Overfitting; loss of general capability; can't fine-tune Bedrock models easily
Prompt engineering + RAG	Better prompts + better retrieval	Moderate (+5-8% on domain tasks)	1-2 weeks engineering	Prompt fragility; token cost increase

Decision Metric

Time-to-value adjusted quality: Quality improvement per week of investment.

Approach	Quality Gain	Time Investment	Quality/Week	Reversibility
Fine-tuning	+10%	5 weeks	2%/week	Hard — requires retraining to revert
Prompt + RAG	+6%	1.5 weeks	4%/week	Easy — revert prompt in seconds

Decision

Prompt engineering + RAG for the MVP and V2. Reasons: 1. 2x better quality-per-week of investment. 2. Fully reversible — a bad prompt can be rolled back in seconds; a bad fine-tune requires retraining. 3. Bedrock limitations — fine-tuning Claude on Bedrock was not yet available at our scale. 4. RAG improvements compound — better retrieval improves all intents, not just the fine-tuned ones.

We reserved fine-tuning as a V3 option if prompt engineering plateaued.

Reversal Trigger

Revisit if: (a) prompt engineering quality plateaus below 90% user satisfaction, or (b) Bedrock supports easy fine-tuning with fast iteration cycles.

Tradeoff 5: RAG Chunk Size vs. Retrieval Quality

The Options

Chunk Size	Recall@3	Precision@3	Context Noise	Token Cost Impact
128 tokens	65%	82%	Low	Lower
256 tokens	78%	75%	Medium	Medium
512 tokens	85%	64%	High	Higher
Variable by source type	82%	79%	Medium	Medium

Decision Metric

Effective retrieval quality: Recall@3 × Precision@3 — a single number that penalizes both missed documents and noisy documents.

Chunk Size	Recall@3 × Precision@3	Effective Quality
128 tokens	0.65 × 0.82	0.533
256 tokens	0.78 × 0.75	0.585
512 tokens	0.85 × 0.64	0.544
Variable	0.82 × 0.79	0.648 ← Winner

Decision

Variable chunking by source type. Product descriptions: 256 tokens, policies: 512 tokens, review summaries: 128 tokens. Different content types have different information density — one size doesn't fit all.

Reversal Trigger

Revisit if: (a) a new embedding model handles long chunks better (reduces precision penalty for large chunks), or (b) context window sizes grow enough that chunk count can increase from 3 to 5+.

Tradeoff 6: Real-Time vs. Batch Inference

The Problem

Not every model needed real-time inference. Some components could run asynchronously without impacting user experience.

Component	Real-Time Required?	Justification
Intent classification	Yes	On the critical path — determines everything downstream
Query embedding	Yes	On the critical path — RAG retrieval depends on it
Cross-encoder reranking	Mostly yes	On the critical path, but could degrade to raw vector results
LLM generation	Yes	The user is waiting for the response
Hallucination scoring	No	Async — score after the response is sent
Embedding re-indexing	No	Batch — run when product catalog updates
Classifier retraining	No	Batch — monthly retraining cycle
A/B test analysis	No	Batch — daily/weekly aggregation

Decision

Hallucination scoring moved to an async pipeline (SQS → Lambda), saving ~50-100ms from the critical path. This was the only component where real-time vs. batch made a meaningful latency difference without sacrificing quality.

The risk: if a hallucinated response was sent and scored after the fact, the user already saw it. Mitigation: for high-severity hallucinations (wrong prices), the validation was still synchronous (price cross-check was fast, ~10ms). Only the sophisticated entailment-based scoring was async.

Tradeoff 7: Temperature Tuning — Creativity vs. Accuracy

The Options

Temperature	Factual Accuracy	Response Creativity	Repetitiveness	Risk
0.0	Highest	Lowest	Very high	Robotic, repetitive responses
0.1	Very high	Low	Low	Slightly stilted
0.3	High	Moderate	Low	Balanced
0.5	Moderate	High	Very low	Occasional hallucinations
0.7	Lower	Very high	Very low	More hallucinations

Decision

Intent-specific temperature — because different intents had different accuracy requirements:

Intent	Temperature	Rationale
`product_question`	0.1	Factual answers — minimize creativity, maximize accuracy
`faq` / `policy`	0.2	Policy answers need precision, slight warmth
`recommendation`	0.5	Descriptions benefit from creative, engaging language
`chitchat`	0.7	Friendly, varied, human-like greetings
`escalation` summary	0.1	Accurate summary for human agent, no embellishment

Metrics That Validated This Decision

Intent	Temp	Hallucination Rate	Thumbs Up Rate	Outcome
`product_question` (before: 0.3)	→ 0.1	3.2% → 1.1%	58% → 64%	Significant improvement
`recommendation` (before: 0.3)	→ 0.5	1.5% → 2.1%	62% → 71%	Quality improved, hallucination slightly up but acceptable
`chitchat` (before: 0.3)	→ 0.7	N/A (no factual claims)	55% → 68%	Users liked the variety

Summary: Decision Log

#	Tradeoff	Decision	Key Metric	Monthly Cost Impact
1	Latency vs Quality (model size)	Model tiering (templates + Haiku + Sonnet)	Quality-adjusted latency	-$48K
2	Cost vs Accuracy (classifier)	DistilBERT + rule-based fast path	Cost per marginal accuracy	-$10K vs RoBERTa
3	Precision vs Recall (guardrails)	Asymmetric: strict for validation, balanced for hallucination	Net quality impact	Neutral
4	Fine-tuning vs Prompting	Prompt engineering + RAG	Time-to-value adjusted quality	-$0 (avoided fine-tuning cost)
5	RAG chunk size	Variable by source type	Effective retrieval quality (R@3 × P@3)	Neutral
6	Real-time vs Batch	Async hallucination scoring only	Latency savings on critical path	-50ms off P99
7	Temperature	Intent-specific temperatures	Hallucination rate per intent	Neutral

Key Takeaways for Interviews

"Every decision was backed by a metric" — I didn't choose DistilBERT because it "felt right." I chose it because the cost per marginal accuracy point showed diminishing returns for more expensive models.
"I defined reversal triggers" — Every decision had a documented condition under which we'd revisit. This shows intellectual honesty: I acknowledge decisions might be wrong and plan for that.
"Asymmetric costs drive asymmetric thresholds" — Guardrail precision vs. recall isn't 50/50. The cost of a hallucinated price reaching a user is 10x the cost of blocking a correct response. This asymmetry should be reflected in the threshold.
"Compound metrics over single metrics" — Quality-adjusted latency, effective retrieval quality (R@3 × P@3), cost per quality point — these compound metrics capture tradeoffs that single metrics miss.
"Not every component needs real-time inference" — Moving hallucination scoring to async saved 50-100ms. Knowing which components can tolerate latency is an architectural insight.

02-data-scientist-collaboration.md — Area 9 (Cost-Quality Tradeoff Analysis) — the CPQ framework
Challenges/real-world-challenges.md §9 — Cost Management — Detailed cost breakdown and optimization strategies
15-tradeoffs-challenges.md — High-level tradeoff summary

03. Tradeoffs & Decision Frameworks — How Metrics Drove Every Choice

Decision Framework: How I Structured Tradeoff Analysis

All Reversal Triggers Dashboard

Reusable Decision Template

Tradeoff 1: Latency vs. Quality (Model Size)

The Options

Decision Metric

Decision

Reversal Trigger

Tradeoff 2: Cost vs. Accuracy (Classifier Model Selection)

The Options

Decision Metric

Decision

What the Accuracy Gap Actually Meant in Production

Reversal Trigger

Tradeoff 3: Precision vs. Recall for Guardrails

The Problem

The Options by Guardrail Type

Decision Metric

Decision

Reversal Trigger

Tradeoff 4: Fine-Tuned Model vs. Prompt Engineering

The Problem

Decision Metric

Decision

Reversal Trigger

Tradeoff 5: RAG Chunk Size vs. Retrieval Quality

The Options

Decision Metric

Decision

Reversal Trigger

Tradeoff 6: Real-Time vs. Batch Inference

The Problem

Decision

Tradeoff 7: Temperature Tuning — Creativity vs. Accuracy

The Options

Decision

Metrics That Validated This Decision

Summary: Decision Log

Key Takeaways for Interviews

Related Documents