LOCAL PREVIEW View on GitHub

03. Tradeoffs & Decision Frameworks — How Metrics Drove Every Choice

"In production ML, there are no 'right' answers — only tradeoffs with different cost functions. My job was to make those tradeoffs explicit, measurable, and reversible. Every decision I made was backed by a metric, not an opinion."


Decision Framework: How I Structured Tradeoff Analysis

For every significant model inference decision, I used a consistent framework:

graph TD
    A[Identify the Tradeoff] --> B[Define Options<br>2-4 concrete alternatives]
    B --> C[Select Decision Metric<br>What number drives the call?]
    C --> D[Run Experiment<br>A/B test, shadow mode, or offline benchmark]
    D --> E[Measure Outcome<br>Quantify each option]
    E --> F[Make Decision<br>Document rationale]
    F --> G[Set Reversal Trigger<br>Under what conditions do we revisit?]

The reversal trigger was key — every decision had a documented condition under which we'd revisit it. This prevented both premature optimization and "set and forget" technical debt.


All Reversal Triggers Dashboard

One-stop reference: when do we revisit each decision?

# Decision Current Choice Revisit When
1 Latency vs Quality Model tiering (template + Haiku + Sonnet) Haiku quality < 3.5/5.0 on routed queries, or a new model offers Sonnet quality at Haiku latency
2 Classifier model DistilBERT + rule-based fast path Misclassification-driven escalation > 3%, or Inferentia supports RoBERTa (cost parity)
3 Guardrail thresholds Strict ASIN, 0.5/0.7 hallucination Block rate > 5% of responses, or "can't help" complaints increase
4 Fine-tune vs prompt Prompt engineering + RAG Prompt quality plateaus < 90% satisfaction, or Bedrock enables easy fine-tuning
5 RAG chunk size Variable by source type New embedding model handles long chunks better, or context windows grow enough for 5+ chunks
6 Real-time vs batch Async hallucination scoring only Real-time hallucination scoring latency drops below 20ms
7 Temperature Intent-specific (0.1 – 0.7) New model has better built-in calibration, or hallucination rate on any intent > 3%

Reusable Decision Template

Use this framework for any new inference tradeoff:

1. IDENTIFY THE TRADEOFF
   What are we trading? (latency vs quality, cost vs accuracy, etc.)

2. DEFINE 2-4 CONCRETE OPTIONS
   Each with measurable pros/cons.

3. SELECT A DECISION METRIC
   A single compound number that captures the tradeoff.
   E.g., Quality-adjusted latency, Cost per marginal accuracy point.

4. RUN AN EXPERIMENT
   A/B test, shadow mode, or offline benchmark. Never decide on intuition.

5. MEASURE AND DECIDE
   Show the math. Document the rationale.

6. SET A REVERSAL TRIGGER
   Under what specific, measurable condition do we revisit?

Tradeoff 1: Latency vs. Quality (Model Size)

The Options

Option Model Inference Latency Response Quality Monthly Cost
A Claude 3.5 Sonnet for all ~500ms TTFT High (4.⅖.0) ~$143K
B Haiku for simple + Sonnet for complex ~100ms / ~500ms Medium/High (3.8 / 4.2) ~$95K
C Haiku for all ~100ms TTFT Medium (3.6/5.0) ~$48K

Decision Metric

Quality-adjusted latency: (Response Quality Score × 20) - (P99 Latency in seconds × 10)

This penalized both low quality AND high latency, forcing a balance.

Option Quality Score P99 Latency Quality-Adjusted Latency
A (Sonnet all) 4.2 × 20 = 84 1.5s × 10 = 15 69
B (Tiered) 4.0 × 20 = 80 0.8s × 10 = 8 72 ← Winner
C (Haiku all) 3.6 × 20 = 72 0.5s × 10 = 5 67

Decision

Option B — Model tiering. Route 40% of messages to templates (no LLM), 15% to Haiku, 45% to Sonnet based on intent complexity.

Reversal Trigger

Revisit if: (a) Haiku quality drops below 3.5/5.0 on routed queries, or (b) a new model class offers Sonnet quality at Haiku latency.


Tradeoff 2: Cost vs. Accuracy (Classifier Model Selection)

The Options

Option Model Accuracy P99 Latency Monthly Infra Cost
A RoBERTa 94.8% 110ms $28K
B DistilBERT 92.1% 50ms $18K
C TinyBERT 89.5% 25ms $8K
D Rule-based only 78.0% 2ms ~$0

Decision Metric

Cost per marginal accuracy point (CMA): For each model upgrade, how much does one additional percentage point of accuracy cost?

Upgrade Path Accuracy Gain Cost Increase CMA ($/point/month)
Rules → TinyBERT +11.5% +$8K $696/point
TinyBERT → DistilBERT +2.6% +$10K $3,846/point
DistilBERT → RoBERTa +2.7% +$10K $3,704/point
Rules → DistilBERT +14.1% +$18K $1,277/point

Decision

DistilBERT (Option B) paired with rule-based fast path. The jump from rules to DistilBERT was highly cost-effective ($1,277/point). The marginal gain from DistilBERT to RoBERTa ($3,704/point) wasn't justified given our budget.

What the Accuracy Gap Actually Meant in Production

The 2.7% accuracy gap between DistilBERT (92.1%) and RoBERTa (94.8%) translated to:

500K messages/day × (100% - 40% rule-handled) = 300K BERT-classified messages
300K × 2.7% accuracy gap = ~8,100 additional misclassifications/day

Of those 8,100 misclassifications:
- ~60% still produced acceptable responses (wrong intent but LLM compensated)
- ~25% produced degraded but harmless responses
- ~15% produced noticeably wrong responses = ~1,215 bad experiences/day

Cost to fix: $10K/month additional infra
Cost of bad experiences: ~1,215 × $0.50 estimated support cost = ~$607/day = ~$18K/month

The math showed that the bad experience cost (~$18K/month) was close to the infra cost (~$10K/month). But the DistilBERT misclassifications were largely caught by the LLM layer (which could reason about intent independently), making the effective impact smaller. We decided the gap was acceptable.

Reversal Trigger

Revisit if: (a) Misclassification-driven escalation rate exceeds 3%, or (b) Inferentia support for RoBERTa reduces cost gap.


Tradeoff 3: Precision vs. Recall for Guardrails

The Problem

The post-generation guardrails pipeline validated LLM responses before sending them to users. But every guardrail check had a precision-recall tradeoff:

  • High recall (catch all bad responses) → many false positives → good responses blocked → user sees "I can't help with that" more often → poor UX.
  • High precision (only block truly bad responses) → some bad responses slip through → user sees hallucinated prices, fabricated products → trust damage.

The Options by Guardrail Type

ASIN Validation Guardrail:

Threshold Precision Recall Block Rate User Impact
Strict (block if any ASIN invalid) 99% 98% 8% Some multi-product responses blocked due to one stale ASIN
Moderate (block if >50% ASINs invalid) 95% 85% 3% Occasional single invalid ASIN slips through
Lenient (block only if all ASINs invalid) 88% 70% 1% Multiple invalid ASINs in edge cases

Hallucination Score Guardrail:

Threshold Precision Recall Block Rate User Impact
0.3 (aggressive) 62% 94% 12% Many correct responses blocked → bad UX
0.5 (balanced) 78% 82% 5% Moderate false positives
0.7 (conservative) 91% 65% 2% Some hallucinations slip through

Decision Metric

Net quality impact: (Recall × Severity of missed bad responses) - (False Positive Rate × Cost of blocking good responses)

For the hallucination guardrail: - Severity of a hallucination reaching users: High (trust damage, potential financial liability for wrong prices) - Cost of blocking a good response: Medium (user sees fallback, slightly worse UX but no harm)

This asymmetry favored higher recall: it's worse to let a hallucinated price through than to occasionally block a correct response.

Decision

  • ASIN Validation: Strict (block if any ASIN invalid). Then surgically remove only the invalid ASIN from the response and regenerate that portion — rather than blocking the entire response.
  • Hallucination Score: 0.5 threshold for alerting, 0.7 threshold for automated blocking. The 0.5 threshold feeds into a weekly review queue where a human validates and either confirms or dismisses.

Reversal Trigger

Revisit if: (a) guardrail block rate exceeds 5% of all responses, or (b) user complaints about "I can't help with that" responses increase.


Tradeoff 4: Fine-Tuned Model vs. Prompt Engineering

The Problem

For improving the LLM's manga-specific knowledge, we had two approaches:

Approach How Expected Quality Gain Effort Risk
Fine-tuning Fine-tune Claude on manga Q&A dataset High (+8-12% on domain tasks) 4-6 weeks DS + engineering Overfitting; loss of general capability; can't fine-tune Bedrock models easily
Prompt engineering + RAG Better prompts + better retrieval Moderate (+5-8% on domain tasks) 1-2 weeks engineering Prompt fragility; token cost increase

Decision Metric

Time-to-value adjusted quality: Quality improvement per week of investment.

Approach Quality Gain Time Investment Quality/Week Reversibility
Fine-tuning +10% 5 weeks 2%/week Hard — requires retraining to revert
Prompt + RAG +6% 1.5 weeks 4%/week Easy — revert prompt in seconds

Decision

Prompt engineering + RAG for the MVP and V2. Reasons: 1. 2x better quality-per-week of investment. 2. Fully reversible — a bad prompt can be rolled back in seconds; a bad fine-tune requires retraining. 3. Bedrock limitations — fine-tuning Claude on Bedrock was not yet available at our scale. 4. RAG improvements compound — better retrieval improves all intents, not just the fine-tuned ones.

We reserved fine-tuning as a V3 option if prompt engineering plateaued.

Reversal Trigger

Revisit if: (a) prompt engineering quality plateaus below 90% user satisfaction, or (b) Bedrock supports easy fine-tuning with fast iteration cycles.


Tradeoff 5: RAG Chunk Size vs. Retrieval Quality

The Options

Chunk Size Recall@3 Precision@3 Context Noise Token Cost Impact
128 tokens 65% 82% Low Lower
256 tokens 78% 75% Medium Medium
512 tokens 85% 64% High Higher
Variable by source type 82% 79% Medium Medium

Decision Metric

Effective retrieval quality: Recall@3 × Precision@3 — a single number that penalizes both missed documents and noisy documents.

Chunk Size Recall@3 × Precision@3 Effective Quality
128 tokens 0.65 × 0.82 0.533
256 tokens 0.78 × 0.75 0.585
512 tokens 0.85 × 0.64 0.544
Variable 0.82 × 0.79 0.648 ← Winner

Decision

Variable chunking by source type. Product descriptions: 256 tokens, policies: 512 tokens, review summaries: 128 tokens. Different content types have different information density — one size doesn't fit all.

Reversal Trigger

Revisit if: (a) a new embedding model handles long chunks better (reduces precision penalty for large chunks), or (b) context window sizes grow enough that chunk count can increase from 3 to 5+.


Tradeoff 6: Real-Time vs. Batch Inference

The Problem

Not every model needed real-time inference. Some components could run asynchronously without impacting user experience.

Component Real-Time Required? Justification
Intent classification Yes On the critical path — determines everything downstream
Query embedding Yes On the critical path — RAG retrieval depends on it
Cross-encoder reranking Mostly yes On the critical path, but could degrade to raw vector results
LLM generation Yes The user is waiting for the response
Hallucination scoring No Async — score after the response is sent
Embedding re-indexing No Batch — run when product catalog updates
Classifier retraining No Batch — monthly retraining cycle
A/B test analysis No Batch — daily/weekly aggregation

Decision

Hallucination scoring moved to an async pipeline (SQS → Lambda), saving ~50-100ms from the critical path. This was the only component where real-time vs. batch made a meaningful latency difference without sacrificing quality.

The risk: if a hallucinated response was sent and scored after the fact, the user already saw it. Mitigation: for high-severity hallucinations (wrong prices), the validation was still synchronous (price cross-check was fast, ~10ms). Only the sophisticated entailment-based scoring was async.


Tradeoff 7: Temperature Tuning — Creativity vs. Accuracy

The Options

Temperature Factual Accuracy Response Creativity Repetitiveness Risk
0.0 Highest Lowest Very high Robotic, repetitive responses
0.1 Very high Low Low Slightly stilted
0.3 High Moderate Low Balanced
0.5 Moderate High Very low Occasional hallucinations
0.7 Lower Very high Very low More hallucinations

Decision

Intent-specific temperature — because different intents had different accuracy requirements:

Intent Temperature Rationale
product_question 0.1 Factual answers — minimize creativity, maximize accuracy
faq / policy 0.2 Policy answers need precision, slight warmth
recommendation 0.5 Descriptions benefit from creative, engaging language
chitchat 0.7 Friendly, varied, human-like greetings
escalation summary 0.1 Accurate summary for human agent, no embellishment

Metrics That Validated This Decision

Intent Temp Hallucination Rate Thumbs Up Rate Outcome
product_question (before: 0.3) → 0.1 3.2% → 1.1% 58% → 64% Significant improvement
recommendation (before: 0.3) → 0.5 1.5% → 2.1% 62% → 71% Quality improved, hallucination slightly up but acceptable
chitchat (before: 0.3) → 0.7 N/A (no factual claims) 55% → 68% Users liked the variety

Summary: Decision Log

# Tradeoff Decision Key Metric Monthly Cost Impact
1 Latency vs Quality (model size) Model tiering (templates + Haiku + Sonnet) Quality-adjusted latency -$48K
2 Cost vs Accuracy (classifier) DistilBERT + rule-based fast path Cost per marginal accuracy -$10K vs RoBERTa
3 Precision vs Recall (guardrails) Asymmetric: strict for validation, balanced for hallucination Net quality impact Neutral
4 Fine-tuning vs Prompting Prompt engineering + RAG Time-to-value adjusted quality -$0 (avoided fine-tuning cost)
5 RAG chunk size Variable by source type Effective retrieval quality (R@3 × P@3) Neutral
6 Real-time vs Batch Async hallucination scoring only Latency savings on critical path -50ms off P99
7 Temperature Intent-specific temperatures Hallucination rate per intent Neutral

Key Takeaways for Interviews

  1. "Every decision was backed by a metric" — I didn't choose DistilBERT because it "felt right." I chose it because the cost per marginal accuracy point showed diminishing returns for more expensive models.

  2. "I defined reversal triggers" — Every decision had a documented condition under which we'd revisit. This shows intellectual honesty: I acknowledge decisions might be wrong and plan for that.

  3. "Asymmetric costs drive asymmetric thresholds" — Guardrail precision vs. recall isn't 50/50. The cost of a hallucinated price reaching a user is 10x the cost of blocking a correct response. This asymmetry should be reflected in the threshold.

  4. "Compound metrics over single metrics" — Quality-adjusted latency, effective retrieval quality (R@3 × P@3), cost per quality point — these compound metrics capture tradeoffs that single metrics miss.

  5. "Not every component needs real-time inference" — Moving hallucination scoring to async saved 50-100ms. Knowing which components can tolerate latency is an architectural insight.