02. Data Scientist Collaboration — Simplifying Production Inference Together

"The hardest part of model inference wasn't the models — it was bridging the gap between what worked in a Jupyter notebook and what survived production traffic at 50K concurrent sessions. I partnered with data scientists across 9 distinct areas, and we had to build shared ownership of the full lifecycle."

All 9 Collaboration Areas at a Glance

#	Area	Who Led	Key Outcome	Interview Sound Bite
1	Model Selection & Benchmarking	Joint	Chose DistilBERT over RoBERTa	"2.7% accuracy gap wasn't worth $170K/year — we used cost per marginal accuracy point."
2	Fine-Tuning Intent Classifier	DS	+3.8% accuracy with augmented data	"Manga-specific accuracy jumped from 71% to 89% with synthetic training data."
3	Evaluation Metrics Definition	Joint	Shared metric thresholds	"DS metrics alone were necessary but not sufficient — we needed end-to-end shared metrics."
4	Embedding Fine-Tuning for RAG	DS	+14% Recall@3	"Fine-tuned embeddings doubled the cosine similarity separation gap."
5	Prompt Optimization & A/B Testing	Joint	Chose structured output ($630/day vs $1,740/day)	"Best quality-per-dollar, not best absolute quality."
6	Hallucination Detection & Scoring	DS built, I productionized	0.5 alert / 0.7 block thresholds	"I couldn't validate subtle hallucinations programmatically — needed NLP entailment models."
7	LLM Evaluation Framework	Joint	7-dimension evaluation with golden dataset	"Traditional ML eval doesn't apply to LLMs — we built multi-dimensional scoring."
8	Model Drift Monitoring	DS detected, I retrained	Monthly retraining flywheel	"First question was always: model problem or data problem?"
9	Cost-Quality Tradeoff Analysis	Joint	CPQ framework ($3K/point threshold)	"CPQ turned subjective debates into math problems."

How We Structured the Partnership

At Amazon, engineering and data science are separate roles with different skill sets. For MangaAssist, I was the senior engineer owning the production pipeline; the DS team (2 data scientists, 1 applied scientist) owned model quality. The overlap — where most of the hard problems lived — was jointly owned.

graph TD
    subgraph "Engineer Owned (Me)"
        E1[Inference infrastructure<br>SageMaker, Bedrock, scaling]
        E2[API design, latency optimization]
        E3[Monitoring, alerting, dashboards]
        E4[Deployment pipelines<br>CI/CD, canary, rollback]
        E5[Cost tracking & optimization]
        E6[Guardrails & validation pipeline]
    end

    subgraph "Jointly Owned"
        J1[Model evaluation framework]
        J2[Metric definitions & thresholds]
        J3[A/B test design & analysis]
        J4[Production incident triage<br>Is it a model issue or infra issue?]
        J5[Feature engineering for classifier]
        J6[Prompt engineering & testing]
    end

    subgraph "Data Science Owned"
        D1[Model architecture selection]
        D2[Training data curation & labeling]
        D3[Hyperparameter tuning]
        D4[Model training & experimentation]
        D5[Offline evaluation & benchmarks]
        D6[Embedding fine-tuning]
    end

    style J1 fill:#FFD700
    style J2 fill:#FFD700
    style J3 fill:#FFD700
    style J4 fill:#FFD700
    style J5 fill:#FFD700
    style J6 fill:#FFD700

Area 1: Model Selection & Benchmarking

The Challenge

For the intent classifier, the DS team evaluated 5 architectures. But their benchmark results — run on a single GPU with batch inference — didn't reflect production reality. I needed to ground their selection in production constraints.

What the DS Team Proposed

Model	Offline Accuracy	Offline F1	Training Time
BERT-base	94.2%	0.93	8 hours
DistilBERT	92.1%	0.91	3 hours
RoBERTa	94.8%	0.94	12 hours
TinyBERT	89.5%	0.88	2 hours
Rule-based (regex)	78.0%	0.75	N/A

The DS team recommended RoBERTa — highest accuracy. I pushed back. Here's why.

What I Brought to the Table

I benchmarked each model on production infrastructure (SageMaker real-time endpoint, ml.g4dn.xlarge instance):

Model	Avg Latency	P99 Latency	Model Size	Instance Cost	Accuracy
BERT-base	35ms	90ms	440MB	$0.736/hr	94.2%
DistilBERT	15ms	50ms	260MB	$0.736/hr	92.1%
RoBERTa	40ms	110ms	500MB	$1.204/hr (needs larger GPU)	94.8%
TinyBERT	8ms	25ms	56MB	$0.340/hr (CPU OK)	89.5%
Rule-based	<1ms	2ms	N/A	$0	78.0%

The Decision

We chose DistilBERT — not the most accurate, but the best cost-accuracy-latency tradeoff:

2.7% less accurate than RoBERTa but 2.6x faster and 40% cheaper on infrastructure.
The 2.1% accuracy gap above TinyBERT was worth the extra latency because misclassifications cascaded through the entire pipeline (wrong context assembly → wrong RAG chunks → poor LLM response).
We paired it with the rule-based fast path (handles 40% of messages at <1ms), so DistilBERT only processed the remaining 60% — further reducing its cost impact.

How We Resolved the Disagreement

The DS team initially resisted — "we're leaving accuracy on the table." I showed them the production math:

RoBERTa path: 40ms classifier + 500ms LLM = 540ms minimum
DistilBERT path: 15ms classifier + 500ms LLM = 515ms minimum
Difference per request: 25ms

At 500K messages/day: 25ms × 500K = 3,472 additional compute-hours/year
Plus: larger GPU instance = $468/day more in SageMaker costs

Accuracy difference: 2.7% (94.8% vs 92.1%)
= ~13,500 additional correctly classified messages/day

Cost per additional correct classification: $0.035/classification

The DS agreed that 2.7% accuracy wasn't worth $170K/year. We made the decision based on cost per marginal accuracy point — a metric we defined together.

Area 2: Fine-Tuning the Intent Classifier

The Challenge

The DS team trained DistilBERT on ~50,000 labeled examples. But production performance lagged offline benchmarks by 4-6%. The gap came from distribution mismatch: training data was curated from historical Amazon customer service logs, but MangaAssist users wrote differently (more casual, more manga-specific jargon, more emoji).

How We Collaborated

Step 1 — I provided production data, DS analyzed distribution gaps:

I built a pipeline that sampled 500 low-confidence classifications per week from production and sent them to a labeling queue. The DS team analyzed these and found:

Distribution gaps:
- 22% of misclassifications involved manga-specific terms
  ("tankōbon", "shōnen jump", "seinen", "mangaka")
- 15% involved colloquial/slang queries
  ("Is this peak?", "W manga", "goated", "mid")
- 18% involved multi-intent messages
  ("I want to return this and also recommend something else")
- 12% involved Japanese text mixed with English

Step 2 — DS generated synthetic training data, I validated in production:

The DS team used Claude to generate 5,000 synthetic manga-specific training examples:

Prompt: "Generate a customer message to a manga store chatbot 
with intent={intent}. Use casual language, manga terminology, 
and occasionally include Japanese words."

These were human-validated (intern reviewers) and added to the training set. After retraining:

Metric	Before Augmentation	After Augmentation	Change
Overall Accuracy	88.3% (production)	92.1% (production)	+3.8%
Manga-specific Accuracy	71.2%	89.5%	+18.3%
Multi-intent Detection	65.0%	82.3%	+17.3%
Slang/Colloquial	73.4%	86.7%	+13.3%

Step 3 — Joint retraining cadence:

We established a monthly retraining cycle: 1. I exported the month's low-confidence samples (automated pipeline). 2. DS team labeled and augmented training data. 3. DS trained the new model, ran offline evaluation suite. 4. I deployed to shadow mode, compared production metrics. 5. Joint go/no-go decision based on: accuracy ≥ 90%, no intent regression >2%, escalation rate stable.

Area 3: Defining Evaluation Metrics

The Challenge

The DS team and I initially disagreed on what "good" meant. DS optimized for accuracy on a test set. I cared about end-to-end user impact. We needed shared metrics that both sides trusted.

The Metrics Framework We Designed Together

graph TD
    subgraph "DS-Centric Metrics (Offline)"
        M1[Accuracy]
        M2[Precision / Recall / F1<br>per intent class]
        M3[Confusion Matrix]
        M4[AUC-ROC]
    end

    subgraph "Engineering-Centric Metrics (Online)"
        M5[P50/P99 Latency]
        M6[Throughput<br>inferences/second]
        M7[Error Rate]
        M8[Cost per Inference]
    end

    subgraph "Shared Metrics (End-to-End)"
        M9[Escalation Rate]
        M10[Thumbs Up Rate]
        M11[Hallucination Rate]
        M12[Conversion Rate]
    end

    M1 --> M9
    M2 --> M9
    M5 --> M12
    M7 --> M10

    style M9 fill:#FFD700
    style M10 fill:#FFD700
    style M11 fill:#FFD700
    style M12 fill:#FFD700

Key insight: DS metrics (accuracy, F1) were necessary but not sufficient. A model with 95% accuracy but 200ms latency could yield worse business outcomes than a 90% accuracy model with 15ms latency — because the faster model allowed more context assembly time, which improved LLM response quality.

Metric Thresholds We Agreed On

Metric	Threshold	Owner	Measurement
Intent classification accuracy	≥ 90%	DS	Weekly offline evaluation (500 labeled samples)
Per-class F1	≥ 0.85 for all classes	DS	Same evaluation set
No class regression	≤ 2% drop on any class	DS	Comparison with previous model
P99 latency	≤ 50ms	Engineering (me)	CloudWatch real-time
Throughput	≥ 1,000 inferences/sec/instance	Engineering (me)	Load test before deploy
Escalation rate impact	≤ +1% change	Joint	A/B or canary deployment
Thumbs up rate impact	≤ -2% change	Joint	A/B or canary deployment

Area 4: Embedding Model Fine-Tuning for RAG

The Challenge

The Titan Embeddings V2 model produced decent general-purpose embeddings but struggled with manga-specific vocabulary. "Shōnen" and "action manga" should be semantically close — but the base embeddings placed them far apart because "shōnen" was a rare token.

How We Collaborated

DS owned: Training a contrastive learning adapter on manga-specific query-document pairs. They used ~2,000 hand-curated pairs:

Query: "dark fantasy manga like Berserk"
Positive Document: "Vinland Saga - a dark historical manga with brutal action..."
Negative Document: "My Neighbor Totoro art book - wholesome Ghibli illustrations..."

I owned: Production-izing the fine-tuned embeddings: - Deploying the adapter as a lightweight Lambda layer that transforms Titan embeddings before the OpenSearch kNN search. - Rebuilding the OpenSearch index with the new embeddings (35M chunks re-embedded, ~4 hours). - A/B testing the fine-tuned retrieval against the baseline.

Results:

Metric	Base Titan	Fine-tuned Adapter	Improvement
Recall@3	72%	86%	+14%
Recall@5	81%	92%	+11%
MRR (Mean Reciprocal Rank)	0.68	0.81	+0.13
Manga-specific query Recall@3	58%	83%	+25%

The +14% Recall@3 improvement translated to measurably better LLM responses — the LLM had better source material to work with.

Area 5: Prompt Optimization & A/B Testing

The Challenge

Prompt engineering lived in an awkward middle ground: DS brought NLP expertise (understanding token probabilities, temperature effects, instruction-following patterns), while I brought production constraints (latency budgets, token costs, format requirements for downstream parsing).

How We Collaborated

DS-led experiments: - Testing different instruction styles (imperative vs. conversational) - Optimizing temperature per intent type - Few-shot example selection (which examples improved quality most?) - Chain-of-thought vs. direct answer for complex queries

I-led experiments: - Prompt compression (same quality with fewer tokens) - Token budget optimization (how much context can we cut before quality degrades?) - Format instructions that survive streaming (valid JSON at any truncation point) - Prompt caching compatibility (structuring prompts so the cacheable prefix is maximized)

Joint experiment: We ran a 2-week A/B test on recommendation prompt variants:

Variant	Prompt Style	Avg Response Quality (human-rated)	Avg Output Tokens	Cost/Response
A (baseline)	Direct instruction	3.8 / 5.0	180 tokens	$0.0027
B (few-shot)	3 examples + instruction	4.2 / 5.0	210 tokens	$0.0035
C (chain-of-thought)	Reasoning + answer	4.4 / 5.0	350 tokens	$0.0058
D (structured output)	JSON template + instruction	4.0 / 5.0	140 tokens	$0.0021

Decision: We chose Variant D — not the highest quality, but the best quality-per-dollar. At 300K LLM calls/day, the cost difference between C ($1,740/day) and D ($630/day) was $1,110/day ($33K/month). The 0.4 quality point difference didn't justify it.

Area 6: Hallucination Detection & Scoring Pipeline

The Challenge

DS expertise was critical for building automated hallucination detection. I could validate prices and ASINs programmatically, but detecting subtler hallucinations ("the deluxe edition includes exclusive author commentary" — fabricated but plausible) required NLP techniques.

How We Collaborated

DS built the hallucination scoring model: 1. A claim extraction pipeline (using a smaller LLM): extract all factual claims from the response. 2. An entailment classifier: for each claim, check if it's entailed by the source context (RAG chunks + product data). 3. Score: 0 (fully grounded) to 1 (completely fabricated).

I built the production infrastructure: 1. Async scoring pipeline (didn't add latency to the response path). 2. CloudWatch metric for daily hallucination score average. 3. Alert if daily average exceeded 0.03. 4. Quarterly review of 500 flagged responses to calibrate the scoring model.

Joint calibration:

The entailment classifier had its own precision-recall tradeoff:

Threshold	Precision (% of flagged responses that are actual hallucinations)	Recall (% of actual hallucinations caught)	False Positive Rate
0.3 (aggressive)	62%	94%	12%
0.5 (balanced)	78%	82%	5%
0.7 (conservative)	91%	65%	2%

We chose 0.5 for alerting (catch most hallucinations, tolerate some false positives in the alert stream) and 0.7 for automated blocking (only block responses we're very confident are hallucinated — can't afford false positives that block good responses).

Area 7: LLM Evaluation Framework

The Challenge

Traditional ML evaluation (accuracy on a test set) doesn't apply to LLM outputs. There's no single "correct answer" — a response can be factually correct but poorly formatted, or well-written but missing key information. We needed a multi-dimensional evaluation framework.

How We Built It Together

DS designed the evaluation dimensions:

Dimension	What It Measures	Scoring Method
Factual Correctness	Are all claims accurate?	Automated (ASIN/price validation) + human audit
Completeness	Did the response address the user's question fully?	Human rating (1-5)
Relevance	Is the response about the right topic?	Automated (intent match) + human rating
Fluency	Is the response well-written and natural?	BLEU/ROUGE against golden responses + human rating
Helpfulness	Would a user find this useful?	Human rating (1-5) + thumbs up/down proxy
Safety	No toxicity, PII, competitor mentions?	Automated guardrails
Format Compliance	Correct JSON structure, markdown rendering?	Automated schema validation

I built the evaluation pipeline:

graph TD
    A[Model/Prompt Change] --> B[Trigger Evaluation Pipeline]
    B --> C[Run 500 Golden Queries]
    C --> D[Automated Scoring<br>BLEU, ROUGE, format, factual]
    D --> E{All automated<br>checks pass?}
    E -->|No| F[Block Deployment<br>Alert DS + Engineering]
    E -->|Yes| G[Sample 50 for<br>Human Evaluation]
    G --> H[Human Raters Score<br>Completeness, Helpfulness]
    H --> I{Human scores<br>above threshold?}
    I -->|No| F
    I -->|Yes| J[Approve for<br>Canary Deployment]

Golden dataset curation was a joint effort: - I selected 500 representative queries from production (stratified by intent, complexity, and edge cases). - DS team wrote reference responses and scoring rubrics. - We reviewed and revised quarterly — removing stale questions, adding new edge cases discovered in production.

Area 8: Model Drift Monitoring & Retraining Pipelines

The Challenge

Both the intent classifier and the LLM's behavior drifted over time. DS needed to detect drift; I needed to retrain and redeploy safely.

How We Collaborated

DS owned drift detection: - Intent distribution monitoring (alert if any intent's share shifts by >5% week-over-week) - Classification confidence distribution monitoring (alert if mean confidence drops below 0.85) - Embedding drift detection (compare query embedding distributions month-over-month using KL divergence)

I owned the retraining pipeline:

graph LR
    A[Drift Alert<br>from DS monitoring] --> B[Sample Low-Confidence<br>Production Data]
    B --> C[Human Labeling Queue<br>200 samples/week]
    C --> D[DS Retrains Model<br>Updated training set]
    D --> E[Offline Evaluation<br>Golden dataset]
    E --> F[Shadow Deployment<br>1 week parallel run]
    F --> G[Canary Deployment<br>1% traffic, 24 hours]
    G --> H[Full Rollout<br>100% traffic]

Joint incident response:

When we detected drift, the first question was always: "Is this a model problem or a data problem?" We built a decision tree:

Signal	Model Problem	Data Problem
Accuracy drops on golden dataset	Yes — model degraded	No — golden dataset still works
Accuracy drops on production data only	No — production data shifted	Yes — distribution changed
New intent pattern emerges	Partially — model doesn't know it	Yes — need new training data
Confidence distribution shifts	Yes — model uncertain	Maybe — ambiguous queries increased

This saved us from knee-jerk retraining when the real issue was a seasonal data shift (holiday traffic pattern) that required new training data, not a new model.

Area 9: Cost-Quality Tradeoff Analysis

The Challenge

Every model decision involved a cost-quality tradeoff. DS naturally optimized for quality; I naturally optimized for cost and latency. We needed a shared framework to make these decisions objectively.

The Framework We Developed

Cost per Quality Point (CPQ): We defined a composite quality score (0-100) combining accuracy, helpfulness, and safety. Then we measured the cost to improve it by one point:

Quality Score = 0.4 × Accuracy + 0.3 × Helpfulness + 0.2 × Safety + 0.1 × Fluency

Decision	Quality Δ	Cost Δ (monthly)	CPQ ($/point)	Decision
RoBERTa vs DistilBERT	+2.7	+$14K	$5,185/point	Reject — too expensive per point
Fine-tuned embeddings vs base	+8.5	+$2K	$235/point	Accept — great ROI
Chain-of-thought prompting	+4.0	+$33K	$8,250/point	Reject — not cost-justified
Structured output prompting	+2.0	-$10K	Saves money AND improves quality	Accept — no brainer
Cross-encoder reranker	+6.2	+$5K	$806/point	Accept — good ROI
Human-in-the-loop for low confidence	+3.1	+$8K	$2,580/point	Accept — for now; revisit at scale

Decision rule: Accept improvements with CPQ < $3,000/point. Reject those above unless there's a safety or compliance reason.

How This Changed Our Conversations

Before CPQ, DS and engineering had subjective debates: "this model is better" vs. "this model is too expensive." After CPQ, every proposal came with a quantified cost-quality tradeoff. It turned disagreements into math problems.

Collaboration Anti-Patterns We Avoided

Anti-Pattern	What Happens	How We Avoided It
"Throw it over the wall"	DS trains model, engineer deploys without understanding it	Joint ownership of evaluation, shared metric thresholds
"Research-first"	DS optimizes for accuracy, ignoring latency/cost	Production benchmarking before model selection
"Engineering-first"	Engineer picks fastest model, ignoring quality	Quality floor (F1 ≥ 0.85 per class) enforced by DS
"Retraining panic"	Knee-jerk retraining on every drift signal	Decision tree: model problem vs. data problem
"Metric disagreement"	DS and engineering argue about "what's good enough"	CPQ framework: objective cost-per-quality-point threshold

Key Takeaways for Interviews

"I didn't just consume models — I shaped them" — Show that you influenced model selection, evaluation criteria, and deployment strategy. Not just "DS gave me a model and I deployed it."
"We defined shared metrics" — The hardest part of cross-functional collaboration is agreeing on what "good" means. DS metrics (F1, accuracy) and engineering metrics (latency, cost) are both necessary but neither is sufficient alone.
"We had a retraining flywheel" — Monthly retraining with production data, automated evaluation, shadow → canary → full rollout. This is production ML maturity.
"Cost per Quality Point resolved disagreements" — When DS and engineering disagreed, CPQ turned subjective arguments into objective math.
"Production != offline benchmarks" — The DS team's top model (RoBERTa at 94.8%) wasn't the right production choice. Real-world constraints (latency, cost, scaling) dominated the decision.
"I built the infrastructure for DS to iterate" — Labeling queues, evaluation pipelines, shadow deployment, A/B test infrastructure — this is the engineer's job in the partnership.

Challenges/real-world-challenges.md §4 — Model Drift — Detailed drift scenarios and recovery stories
Challenges/real-world-challenges.md §19 — Evaluation — Measuring true impact beyond ML metrics
10-ai-llm-design.md — Model selection table and intent classification design
13-metrics.md — Full metrics framework including business and UX metrics

02. Data Scientist Collaboration — Simplifying Production Inference Together

All 9 Collaboration Areas at a Glance

How We Structured the Partnership

Area 1: Model Selection & Benchmarking

The Challenge

What the DS Team Proposed

What I Brought to the Table

The Decision

How We Resolved the Disagreement

Area 2: Fine-Tuning the Intent Classifier

The Challenge

How We Collaborated

Area 3: Defining Evaluation Metrics

The Challenge

The Metrics Framework We Designed Together

Metric Thresholds We Agreed On

Area 4: Embedding Model Fine-Tuning for RAG

The Challenge

How We Collaborated

Area 5: Prompt Optimization & A/B Testing

The Challenge

How We Collaborated

Area 6: Hallucination Detection & Scoring Pipeline

The Challenge

How We Collaborated

Area 7: LLM Evaluation Framework

The Challenge

How We Built It Together

Area 8: Model Drift Monitoring & Retraining Pipelines

The Challenge

How We Collaborated

Area 9: Cost-Quality Tradeoff Analysis

The Challenge

The Framework We Developed

How This Changed Our Conversations

Collaboration Anti-Patterns We Avoided

Key Takeaways for Interviews

Related Documents