02. Data Scientist Collaboration — Simplifying Production Inference Together
"The hardest part of model inference wasn't the models — it was bridging the gap between what worked in a Jupyter notebook and what survived production traffic at 50K concurrent sessions. I partnered with data scientists across 9 distinct areas, and we had to build shared ownership of the full lifecycle."
All 9 Collaboration Areas at a Glance
| # | Area | Who Led | Key Outcome | Interview Sound Bite |
|---|---|---|---|---|
| 1 | Model Selection & Benchmarking | Joint | Chose DistilBERT over RoBERTa | "2.7% accuracy gap wasn't worth $170K/year — we used cost per marginal accuracy point." |
| 2 | Fine-Tuning Intent Classifier | DS | +3.8% accuracy with augmented data | "Manga-specific accuracy jumped from 71% to 89% with synthetic training data." |
| 3 | Evaluation Metrics Definition | Joint | Shared metric thresholds | "DS metrics alone were necessary but not sufficient — we needed end-to-end shared metrics." |
| 4 | Embedding Fine-Tuning for RAG | DS | +14% Recall@3 | "Fine-tuned embeddings doubled the cosine similarity separation gap." |
| 5 | Prompt Optimization & A/B Testing | Joint | Chose structured output ($630/day vs $1,740/day) | "Best quality-per-dollar, not best absolute quality." |
| 6 | Hallucination Detection & Scoring | DS built, I productionized | 0.5 alert / 0.7 block thresholds | "I couldn't validate subtle hallucinations programmatically — needed NLP entailment models." |
| 7 | LLM Evaluation Framework | Joint | 7-dimension evaluation with golden dataset | "Traditional ML eval doesn't apply to LLMs — we built multi-dimensional scoring." |
| 8 | Model Drift Monitoring | DS detected, I retrained | Monthly retraining flywheel | "First question was always: model problem or data problem?" |
| 9 | Cost-Quality Tradeoff Analysis | Joint | CPQ framework ($3K/point threshold) | "CPQ turned subjective debates into math problems." |
How We Structured the Partnership
At Amazon, engineering and data science are separate roles with different skill sets. For MangaAssist, I was the senior engineer owning the production pipeline; the DS team (2 data scientists, 1 applied scientist) owned model quality. The overlap — where most of the hard problems lived — was jointly owned.
graph TD
subgraph "Engineer Owned (Me)"
E1[Inference infrastructure<br>SageMaker, Bedrock, scaling]
E2[API design, latency optimization]
E3[Monitoring, alerting, dashboards]
E4[Deployment pipelines<br>CI/CD, canary, rollback]
E5[Cost tracking & optimization]
E6[Guardrails & validation pipeline]
end
subgraph "Jointly Owned"
J1[Model evaluation framework]
J2[Metric definitions & thresholds]
J3[A/B test design & analysis]
J4[Production incident triage<br>Is it a model issue or infra issue?]
J5[Feature engineering for classifier]
J6[Prompt engineering & testing]
end
subgraph "Data Science Owned"
D1[Model architecture selection]
D2[Training data curation & labeling]
D3[Hyperparameter tuning]
D4[Model training & experimentation]
D5[Offline evaluation & benchmarks]
D6[Embedding fine-tuning]
end
style J1 fill:#FFD700
style J2 fill:#FFD700
style J3 fill:#FFD700
style J4 fill:#FFD700
style J5 fill:#FFD700
style J6 fill:#FFD700
Area 1: Model Selection & Benchmarking
The Challenge
For the intent classifier, the DS team evaluated 5 architectures. But their benchmark results — run on a single GPU with batch inference — didn't reflect production reality. I needed to ground their selection in production constraints.
What the DS Team Proposed
| Model | Offline Accuracy | Offline F1 | Training Time |
|---|---|---|---|
| BERT-base | 94.2% | 0.93 | 8 hours |
| DistilBERT | 92.1% | 0.91 | 3 hours |
| RoBERTa | 94.8% | 0.94 | 12 hours |
| TinyBERT | 89.5% | 0.88 | 2 hours |
| Rule-based (regex) | 78.0% | 0.75 | N/A |
The DS team recommended RoBERTa — highest accuracy. I pushed back. Here's why.
What I Brought to the Table
I benchmarked each model on production infrastructure (SageMaker real-time endpoint, ml.g4dn.xlarge instance):
| Model | Avg Latency | P99 Latency | Model Size | Instance Cost | Accuracy |
|---|---|---|---|---|---|
| BERT-base | 35ms | 90ms | 440MB | $0.736/hr | 94.2% |
| DistilBERT | 15ms | 50ms | 260MB | $0.736/hr | 92.1% |
| RoBERTa | 40ms | 110ms | 500MB | $1.204/hr (needs larger GPU) | 94.8% |
| TinyBERT | 8ms | 25ms | 56MB | $0.340/hr (CPU OK) | 89.5% |
| Rule-based | <1ms | 2ms | N/A | $0 | 78.0% |
The Decision
We chose DistilBERT — not the most accurate, but the best cost-accuracy-latency tradeoff:
- 2.7% less accurate than RoBERTa but 2.6x faster and 40% cheaper on infrastructure.
- The 2.1% accuracy gap above TinyBERT was worth the extra latency because misclassifications cascaded through the entire pipeline (wrong context assembly → wrong RAG chunks → poor LLM response).
- We paired it with the rule-based fast path (handles 40% of messages at <1ms), so DistilBERT only processed the remaining 60% — further reducing its cost impact.
How We Resolved the Disagreement
The DS team initially resisted — "we're leaving accuracy on the table." I showed them the production math:
RoBERTa path: 40ms classifier + 500ms LLM = 540ms minimum
DistilBERT path: 15ms classifier + 500ms LLM = 515ms minimum
Difference per request: 25ms
At 500K messages/day: 25ms × 500K = 3,472 additional compute-hours/year
Plus: larger GPU instance = $468/day more in SageMaker costs
Accuracy difference: 2.7% (94.8% vs 92.1%)
= ~13,500 additional correctly classified messages/day
Cost per additional correct classification: $0.035/classification
The DS agreed that 2.7% accuracy wasn't worth $170K/year. We made the decision based on cost per marginal accuracy point — a metric we defined together.
Area 2: Fine-Tuning the Intent Classifier
The Challenge
The DS team trained DistilBERT on ~50,000 labeled examples. But production performance lagged offline benchmarks by 4-6%. The gap came from distribution mismatch: training data was curated from historical Amazon customer service logs, but MangaAssist users wrote differently (more casual, more manga-specific jargon, more emoji).
How We Collaborated
Step 1 — I provided production data, DS analyzed distribution gaps:
I built a pipeline that sampled 500 low-confidence classifications per week from production and sent them to a labeling queue. The DS team analyzed these and found:
Distribution gaps:
- 22% of misclassifications involved manga-specific terms
("tankōbon", "shōnen jump", "seinen", "mangaka")
- 15% involved colloquial/slang queries
("Is this peak?", "W manga", "goated", "mid")
- 18% involved multi-intent messages
("I want to return this and also recommend something else")
- 12% involved Japanese text mixed with English
Step 2 — DS generated synthetic training data, I validated in production:
The DS team used Claude to generate 5,000 synthetic manga-specific training examples:
Prompt: "Generate a customer message to a manga store chatbot
with intent={intent}. Use casual language, manga terminology,
and occasionally include Japanese words."
These were human-validated (intern reviewers) and added to the training set. After retraining:
| Metric | Before Augmentation | After Augmentation | Change |
|---|---|---|---|
| Overall Accuracy | 88.3% (production) | 92.1% (production) | +3.8% |
| Manga-specific Accuracy | 71.2% | 89.5% | +18.3% |
| Multi-intent Detection | 65.0% | 82.3% | +17.3% |
| Slang/Colloquial | 73.4% | 86.7% | +13.3% |
Step 3 — Joint retraining cadence:
We established a monthly retraining cycle: 1. I exported the month's low-confidence samples (automated pipeline). 2. DS team labeled and augmented training data. 3. DS trained the new model, ran offline evaluation suite. 4. I deployed to shadow mode, compared production metrics. 5. Joint go/no-go decision based on: accuracy ≥ 90%, no intent regression >2%, escalation rate stable.
Area 3: Defining Evaluation Metrics
The Challenge
The DS team and I initially disagreed on what "good" meant. DS optimized for accuracy on a test set. I cared about end-to-end user impact. We needed shared metrics that both sides trusted.
The Metrics Framework We Designed Together
graph TD
subgraph "DS-Centric Metrics (Offline)"
M1[Accuracy]
M2[Precision / Recall / F1<br>per intent class]
M3[Confusion Matrix]
M4[AUC-ROC]
end
subgraph "Engineering-Centric Metrics (Online)"
M5[P50/P99 Latency]
M6[Throughput<br>inferences/second]
M7[Error Rate]
M8[Cost per Inference]
end
subgraph "Shared Metrics (End-to-End)"
M9[Escalation Rate]
M10[Thumbs Up Rate]
M11[Hallucination Rate]
M12[Conversion Rate]
end
M1 --> M9
M2 --> M9
M5 --> M12
M7 --> M10
style M9 fill:#FFD700
style M10 fill:#FFD700
style M11 fill:#FFD700
style M12 fill:#FFD700
Key insight: DS metrics (accuracy, F1) were necessary but not sufficient. A model with 95% accuracy but 200ms latency could yield worse business outcomes than a 90% accuracy model with 15ms latency — because the faster model allowed more context assembly time, which improved LLM response quality.
Metric Thresholds We Agreed On
| Metric | Threshold | Owner | Measurement |
|---|---|---|---|
| Intent classification accuracy | ≥ 90% | DS | Weekly offline evaluation (500 labeled samples) |
| Per-class F1 | ≥ 0.85 for all classes | DS | Same evaluation set |
| No class regression | ≤ 2% drop on any class | DS | Comparison with previous model |
| P99 latency | ≤ 50ms | Engineering (me) | CloudWatch real-time |
| Throughput | ≥ 1,000 inferences/sec/instance | Engineering (me) | Load test before deploy |
| Escalation rate impact | ≤ +1% change | Joint | A/B or canary deployment |
| Thumbs up rate impact | ≤ -2% change | Joint | A/B or canary deployment |
Area 4: Embedding Model Fine-Tuning for RAG
The Challenge
The Titan Embeddings V2 model produced decent general-purpose embeddings but struggled with manga-specific vocabulary. "Shōnen" and "action manga" should be semantically close — but the base embeddings placed them far apart because "shōnen" was a rare token.
How We Collaborated
DS owned: Training a contrastive learning adapter on manga-specific query-document pairs. They used ~2,000 hand-curated pairs:
Query: "dark fantasy manga like Berserk"
Positive Document: "Vinland Saga - a dark historical manga with brutal action..."
Negative Document: "My Neighbor Totoro art book - wholesome Ghibli illustrations..."
I owned: Production-izing the fine-tuned embeddings: - Deploying the adapter as a lightweight Lambda layer that transforms Titan embeddings before the OpenSearch kNN search. - Rebuilding the OpenSearch index with the new embeddings (35M chunks re-embedded, ~4 hours). - A/B testing the fine-tuned retrieval against the baseline.
Results:
| Metric | Base Titan | Fine-tuned Adapter | Improvement |
|---|---|---|---|
| Recall@3 | 72% | 86% | +14% |
| Recall@5 | 81% | 92% | +11% |
| MRR (Mean Reciprocal Rank) | 0.68 | 0.81 | +0.13 |
| Manga-specific query Recall@3 | 58% | 83% | +25% |
The +14% Recall@3 improvement translated to measurably better LLM responses — the LLM had better source material to work with.
Area 5: Prompt Optimization & A/B Testing
The Challenge
Prompt engineering lived in an awkward middle ground: DS brought NLP expertise (understanding token probabilities, temperature effects, instruction-following patterns), while I brought production constraints (latency budgets, token costs, format requirements for downstream parsing).
How We Collaborated
DS-led experiments: - Testing different instruction styles (imperative vs. conversational) - Optimizing temperature per intent type - Few-shot example selection (which examples improved quality most?) - Chain-of-thought vs. direct answer for complex queries
I-led experiments: - Prompt compression (same quality with fewer tokens) - Token budget optimization (how much context can we cut before quality degrades?) - Format instructions that survive streaming (valid JSON at any truncation point) - Prompt caching compatibility (structuring prompts so the cacheable prefix is maximized)
Joint experiment: We ran a 2-week A/B test on recommendation prompt variants:
| Variant | Prompt Style | Avg Response Quality (human-rated) | Avg Output Tokens | Cost/Response |
|---|---|---|---|---|
| A (baseline) | Direct instruction | 3.8 / 5.0 | 180 tokens | $0.0027 |
| B (few-shot) | 3 examples + instruction | 4.2 / 5.0 | 210 tokens | $0.0035 |
| C (chain-of-thought) | Reasoning + answer | 4.4 / 5.0 | 350 tokens | $0.0058 |
| D (structured output) | JSON template + instruction | 4.0 / 5.0 | 140 tokens | $0.0021 |
Decision: We chose Variant D — not the highest quality, but the best quality-per-dollar. At 300K LLM calls/day, the cost difference between C ($1,740/day) and D ($630/day) was $1,110/day ($33K/month). The 0.4 quality point difference didn't justify it.
Area 6: Hallucination Detection & Scoring Pipeline
The Challenge
DS expertise was critical for building automated hallucination detection. I could validate prices and ASINs programmatically, but detecting subtler hallucinations ("the deluxe edition includes exclusive author commentary" — fabricated but plausible) required NLP techniques.
How We Collaborated
DS built the hallucination scoring model: 1. A claim extraction pipeline (using a smaller LLM): extract all factual claims from the response. 2. An entailment classifier: for each claim, check if it's entailed by the source context (RAG chunks + product data). 3. Score: 0 (fully grounded) to 1 (completely fabricated).
I built the production infrastructure: 1. Async scoring pipeline (didn't add latency to the response path). 2. CloudWatch metric for daily hallucination score average. 3. Alert if daily average exceeded 0.03. 4. Quarterly review of 500 flagged responses to calibrate the scoring model.
Joint calibration:
The entailment classifier had its own precision-recall tradeoff:
| Threshold | Precision (% of flagged responses that are actual hallucinations) | Recall (% of actual hallucinations caught) | False Positive Rate |
|---|---|---|---|
| 0.3 (aggressive) | 62% | 94% | 12% |
| 0.5 (balanced) | 78% | 82% | 5% |
| 0.7 (conservative) | 91% | 65% | 2% |
We chose 0.5 for alerting (catch most hallucinations, tolerate some false positives in the alert stream) and 0.7 for automated blocking (only block responses we're very confident are hallucinated — can't afford false positives that block good responses).
Area 7: LLM Evaluation Framework
The Challenge
Traditional ML evaluation (accuracy on a test set) doesn't apply to LLM outputs. There's no single "correct answer" — a response can be factually correct but poorly formatted, or well-written but missing key information. We needed a multi-dimensional evaluation framework.
How We Built It Together
DS designed the evaluation dimensions:
| Dimension | What It Measures | Scoring Method |
|---|---|---|
| Factual Correctness | Are all claims accurate? | Automated (ASIN/price validation) + human audit |
| Completeness | Did the response address the user's question fully? | Human rating (1-5) |
| Relevance | Is the response about the right topic? | Automated (intent match) + human rating |
| Fluency | Is the response well-written and natural? | BLEU/ROUGE against golden responses + human rating |
| Helpfulness | Would a user find this useful? | Human rating (1-5) + thumbs up/down proxy |
| Safety | No toxicity, PII, competitor mentions? | Automated guardrails |
| Format Compliance | Correct JSON structure, markdown rendering? | Automated schema validation |
I built the evaluation pipeline:
graph TD
A[Model/Prompt Change] --> B[Trigger Evaluation Pipeline]
B --> C[Run 500 Golden Queries]
C --> D[Automated Scoring<br>BLEU, ROUGE, format, factual]
D --> E{All automated<br>checks pass?}
E -->|No| F[Block Deployment<br>Alert DS + Engineering]
E -->|Yes| G[Sample 50 for<br>Human Evaluation]
G --> H[Human Raters Score<br>Completeness, Helpfulness]
H --> I{Human scores<br>above threshold?}
I -->|No| F
I -->|Yes| J[Approve for<br>Canary Deployment]
Golden dataset curation was a joint effort: - I selected 500 representative queries from production (stratified by intent, complexity, and edge cases). - DS team wrote reference responses and scoring rubrics. - We reviewed and revised quarterly — removing stale questions, adding new edge cases discovered in production.
Area 8: Model Drift Monitoring & Retraining Pipelines
The Challenge
Both the intent classifier and the LLM's behavior drifted over time. DS needed to detect drift; I needed to retrain and redeploy safely.
How We Collaborated
DS owned drift detection: - Intent distribution monitoring (alert if any intent's share shifts by >5% week-over-week) - Classification confidence distribution monitoring (alert if mean confidence drops below 0.85) - Embedding drift detection (compare query embedding distributions month-over-month using KL divergence)
I owned the retraining pipeline:
graph LR
A[Drift Alert<br>from DS monitoring] --> B[Sample Low-Confidence<br>Production Data]
B --> C[Human Labeling Queue<br>200 samples/week]
C --> D[DS Retrains Model<br>Updated training set]
D --> E[Offline Evaluation<br>Golden dataset]
E --> F[Shadow Deployment<br>1 week parallel run]
F --> G[Canary Deployment<br>1% traffic, 24 hours]
G --> H[Full Rollout<br>100% traffic]
Joint incident response:
When we detected drift, the first question was always: "Is this a model problem or a data problem?" We built a decision tree:
| Signal | Model Problem | Data Problem |
|---|---|---|
| Accuracy drops on golden dataset | Yes — model degraded | No — golden dataset still works |
| Accuracy drops on production data only | No — production data shifted | Yes — distribution changed |
| New intent pattern emerges | Partially — model doesn't know it | Yes — need new training data |
| Confidence distribution shifts | Yes — model uncertain | Maybe — ambiguous queries increased |
This saved us from knee-jerk retraining when the real issue was a seasonal data shift (holiday traffic pattern) that required new training data, not a new model.
Area 9: Cost-Quality Tradeoff Analysis
The Challenge
Every model decision involved a cost-quality tradeoff. DS naturally optimized for quality; I naturally optimized for cost and latency. We needed a shared framework to make these decisions objectively.
The Framework We Developed
Cost per Quality Point (CPQ): We defined a composite quality score (0-100) combining accuracy, helpfulness, and safety. Then we measured the cost to improve it by one point:
Quality Score = 0.4 × Accuracy + 0.3 × Helpfulness + 0.2 × Safety + 0.1 × Fluency
| Decision | Quality Δ | Cost Δ (monthly) | CPQ ($/point) | Decision |
|---|---|---|---|---|
| RoBERTa vs DistilBERT | +2.7 | +$14K | $5,185/point | Reject — too expensive per point |
| Fine-tuned embeddings vs base | +8.5 | +$2K | $235/point | Accept — great ROI |
| Chain-of-thought prompting | +4.0 | +$33K | $8,250/point | Reject — not cost-justified |
| Structured output prompting | +2.0 | -$10K | Saves money AND improves quality | Accept — no brainer |
| Cross-encoder reranker | +6.2 | +$5K | $806/point | Accept — good ROI |
| Human-in-the-loop for low confidence | +3.1 | +$8K | $2,580/point | Accept — for now; revisit at scale |
Decision rule: Accept improvements with CPQ < $3,000/point. Reject those above unless there's a safety or compliance reason.
How This Changed Our Conversations
Before CPQ, DS and engineering had subjective debates: "this model is better" vs. "this model is too expensive." After CPQ, every proposal came with a quantified cost-quality tradeoff. It turned disagreements into math problems.
Collaboration Anti-Patterns We Avoided
| Anti-Pattern | What Happens | How We Avoided It |
|---|---|---|
| "Throw it over the wall" | DS trains model, engineer deploys without understanding it | Joint ownership of evaluation, shared metric thresholds |
| "Research-first" | DS optimizes for accuracy, ignoring latency/cost | Production benchmarking before model selection |
| "Engineering-first" | Engineer picks fastest model, ignoring quality | Quality floor (F1 ≥ 0.85 per class) enforced by DS |
| "Retraining panic" | Knee-jerk retraining on every drift signal | Decision tree: model problem vs. data problem |
| "Metric disagreement" | DS and engineering argue about "what's good enough" | CPQ framework: objective cost-per-quality-point threshold |
Key Takeaways for Interviews
-
"I didn't just consume models — I shaped them" — Show that you influenced model selection, evaluation criteria, and deployment strategy. Not just "DS gave me a model and I deployed it."
-
"We defined shared metrics" — The hardest part of cross-functional collaboration is agreeing on what "good" means. DS metrics (F1, accuracy) and engineering metrics (latency, cost) are both necessary but neither is sufficient alone.
-
"We had a retraining flywheel" — Monthly retraining with production data, automated evaluation, shadow → canary → full rollout. This is production ML maturity.
-
"Cost per Quality Point resolved disagreements" — When DS and engineering disagreed, CPQ turned subjective arguments into objective math.
-
"Production != offline benchmarks" — The DS team's top model (RoBERTa at 94.8%) wasn't the right production choice. Real-world constraints (latency, cost, scaling) dominated the decision.
-
"I built the infrastructure for DS to iterate" — Labeling queues, evaluation pipelines, shadow deployment, A/B test infrastructure — this is the engineer's job in the partnership.
Related Documents
- Challenges/real-world-challenges.md §4 — Model Drift — Detailed drift scenarios and recovery stories
- Challenges/real-world-challenges.md §19 — Evaluation — Measuring true impact beyond ML metrics
- 10-ai-llm-design.md — Model selection table and intent classification design
- 13-metrics.md — Full metrics framework including business and UX metrics