LOCAL PREVIEW View on GitHub

02. Data Scientist Collaboration — Simplifying Production Inference Together

"The hardest part of model inference wasn't the models — it was bridging the gap between what worked in a Jupyter notebook and what survived production traffic at 50K concurrent sessions. I partnered with data scientists across 9 distinct areas, and we had to build shared ownership of the full lifecycle."


All 9 Collaboration Areas at a Glance

# Area Who Led Key Outcome Interview Sound Bite
1 Model Selection & Benchmarking Joint Chose DistilBERT over RoBERTa "2.7% accuracy gap wasn't worth $170K/year — we used cost per marginal accuracy point."
2 Fine-Tuning Intent Classifier DS +3.8% accuracy with augmented data "Manga-specific accuracy jumped from 71% to 89% with synthetic training data."
3 Evaluation Metrics Definition Joint Shared metric thresholds "DS metrics alone were necessary but not sufficient — we needed end-to-end shared metrics."
4 Embedding Fine-Tuning for RAG DS +14% Recall@3 "Fine-tuned embeddings doubled the cosine similarity separation gap."
5 Prompt Optimization & A/B Testing Joint Chose structured output ($630/day vs $1,740/day) "Best quality-per-dollar, not best absolute quality."
6 Hallucination Detection & Scoring DS built, I productionized 0.5 alert / 0.7 block thresholds "I couldn't validate subtle hallucinations programmatically — needed NLP entailment models."
7 LLM Evaluation Framework Joint 7-dimension evaluation with golden dataset "Traditional ML eval doesn't apply to LLMs — we built multi-dimensional scoring."
8 Model Drift Monitoring DS detected, I retrained Monthly retraining flywheel "First question was always: model problem or data problem?"
9 Cost-Quality Tradeoff Analysis Joint CPQ framework ($3K/point threshold) "CPQ turned subjective debates into math problems."

How We Structured the Partnership

At Amazon, engineering and data science are separate roles with different skill sets. For MangaAssist, I was the senior engineer owning the production pipeline; the DS team (2 data scientists, 1 applied scientist) owned model quality. The overlap — where most of the hard problems lived — was jointly owned.

graph TD
    subgraph "Engineer Owned (Me)"
        E1[Inference infrastructure<br>SageMaker, Bedrock, scaling]
        E2[API design, latency optimization]
        E3[Monitoring, alerting, dashboards]
        E4[Deployment pipelines<br>CI/CD, canary, rollback]
        E5[Cost tracking & optimization]
        E6[Guardrails & validation pipeline]
    end

    subgraph "Jointly Owned"
        J1[Model evaluation framework]
        J2[Metric definitions & thresholds]
        J3[A/B test design & analysis]
        J4[Production incident triage<br>Is it a model issue or infra issue?]
        J5[Feature engineering for classifier]
        J6[Prompt engineering & testing]
    end

    subgraph "Data Science Owned"
        D1[Model architecture selection]
        D2[Training data curation & labeling]
        D3[Hyperparameter tuning]
        D4[Model training & experimentation]
        D5[Offline evaluation & benchmarks]
        D6[Embedding fine-tuning]
    end

    style J1 fill:#FFD700
    style J2 fill:#FFD700
    style J3 fill:#FFD700
    style J4 fill:#FFD700
    style J5 fill:#FFD700
    style J6 fill:#FFD700

Area 1: Model Selection & Benchmarking

The Challenge

For the intent classifier, the DS team evaluated 5 architectures. But their benchmark results — run on a single GPU with batch inference — didn't reflect production reality. I needed to ground their selection in production constraints.

What the DS Team Proposed

Model Offline Accuracy Offline F1 Training Time
BERT-base 94.2% 0.93 8 hours
DistilBERT 92.1% 0.91 3 hours
RoBERTa 94.8% 0.94 12 hours
TinyBERT 89.5% 0.88 2 hours
Rule-based (regex) 78.0% 0.75 N/A

The DS team recommended RoBERTa — highest accuracy. I pushed back. Here's why.

What I Brought to the Table

I benchmarked each model on production infrastructure (SageMaker real-time endpoint, ml.g4dn.xlarge instance):

Model Avg Latency P99 Latency Model Size Instance Cost Accuracy
BERT-base 35ms 90ms 440MB $0.736/hr 94.2%
DistilBERT 15ms 50ms 260MB $0.736/hr 92.1%
RoBERTa 40ms 110ms 500MB $1.204/hr (needs larger GPU) 94.8%
TinyBERT 8ms 25ms 56MB $0.340/hr (CPU OK) 89.5%
Rule-based <1ms 2ms N/A $0 78.0%

The Decision

We chose DistilBERT — not the most accurate, but the best cost-accuracy-latency tradeoff:

  • 2.7% less accurate than RoBERTa but 2.6x faster and 40% cheaper on infrastructure.
  • The 2.1% accuracy gap above TinyBERT was worth the extra latency because misclassifications cascaded through the entire pipeline (wrong context assembly → wrong RAG chunks → poor LLM response).
  • We paired it with the rule-based fast path (handles 40% of messages at <1ms), so DistilBERT only processed the remaining 60% — further reducing its cost impact.

How We Resolved the Disagreement

The DS team initially resisted — "we're leaving accuracy on the table." I showed them the production math:

RoBERTa path: 40ms classifier + 500ms LLM = 540ms minimum
DistilBERT path: 15ms classifier + 500ms LLM = 515ms minimum
Difference per request: 25ms

At 500K messages/day: 25ms × 500K = 3,472 additional compute-hours/year
Plus: larger GPU instance = $468/day more in SageMaker costs

Accuracy difference: 2.7% (94.8% vs 92.1%)
= ~13,500 additional correctly classified messages/day

Cost per additional correct classification: $0.035/classification

The DS agreed that 2.7% accuracy wasn't worth $170K/year. We made the decision based on cost per marginal accuracy point — a metric we defined together.


Area 2: Fine-Tuning the Intent Classifier

The Challenge

The DS team trained DistilBERT on ~50,000 labeled examples. But production performance lagged offline benchmarks by 4-6%. The gap came from distribution mismatch: training data was curated from historical Amazon customer service logs, but MangaAssist users wrote differently (more casual, more manga-specific jargon, more emoji).

How We Collaborated

Step 1 — I provided production data, DS analyzed distribution gaps:

I built a pipeline that sampled 500 low-confidence classifications per week from production and sent them to a labeling queue. The DS team analyzed these and found:

Distribution gaps:
- 22% of misclassifications involved manga-specific terms
  ("tankōbon", "shōnen jump", "seinen", "mangaka")
- 15% involved colloquial/slang queries
  ("Is this peak?", "W manga", "goated", "mid")
- 18% involved multi-intent messages
  ("I want to return this and also recommend something else")
- 12% involved Japanese text mixed with English

Step 2 — DS generated synthetic training data, I validated in production:

The DS team used Claude to generate 5,000 synthetic manga-specific training examples:

Prompt: "Generate a customer message to a manga store chatbot 
with intent={intent}. Use casual language, manga terminology, 
and occasionally include Japanese words."

These were human-validated (intern reviewers) and added to the training set. After retraining:

Metric Before Augmentation After Augmentation Change
Overall Accuracy 88.3% (production) 92.1% (production) +3.8%
Manga-specific Accuracy 71.2% 89.5% +18.3%
Multi-intent Detection 65.0% 82.3% +17.3%
Slang/Colloquial 73.4% 86.7% +13.3%

Step 3 — Joint retraining cadence:

We established a monthly retraining cycle: 1. I exported the month's low-confidence samples (automated pipeline). 2. DS team labeled and augmented training data. 3. DS trained the new model, ran offline evaluation suite. 4. I deployed to shadow mode, compared production metrics. 5. Joint go/no-go decision based on: accuracy ≥ 90%, no intent regression >2%, escalation rate stable.


Area 3: Defining Evaluation Metrics

The Challenge

The DS team and I initially disagreed on what "good" meant. DS optimized for accuracy on a test set. I cared about end-to-end user impact. We needed shared metrics that both sides trusted.

The Metrics Framework We Designed Together

graph TD
    subgraph "DS-Centric Metrics (Offline)"
        M1[Accuracy]
        M2[Precision / Recall / F1<br>per intent class]
        M3[Confusion Matrix]
        M4[AUC-ROC]
    end

    subgraph "Engineering-Centric Metrics (Online)"
        M5[P50/P99 Latency]
        M6[Throughput<br>inferences/second]
        M7[Error Rate]
        M8[Cost per Inference]
    end

    subgraph "Shared Metrics (End-to-End)"
        M9[Escalation Rate]
        M10[Thumbs Up Rate]
        M11[Hallucination Rate]
        M12[Conversion Rate]
    end

    M1 --> M9
    M2 --> M9
    M5 --> M12
    M7 --> M10

    style M9 fill:#FFD700
    style M10 fill:#FFD700
    style M11 fill:#FFD700
    style M12 fill:#FFD700

Key insight: DS metrics (accuracy, F1) were necessary but not sufficient. A model with 95% accuracy but 200ms latency could yield worse business outcomes than a 90% accuracy model with 15ms latency — because the faster model allowed more context assembly time, which improved LLM response quality.

Metric Thresholds We Agreed On

Metric Threshold Owner Measurement
Intent classification accuracy ≥ 90% DS Weekly offline evaluation (500 labeled samples)
Per-class F1 ≥ 0.85 for all classes DS Same evaluation set
No class regression ≤ 2% drop on any class DS Comparison with previous model
P99 latency ≤ 50ms Engineering (me) CloudWatch real-time
Throughput ≥ 1,000 inferences/sec/instance Engineering (me) Load test before deploy
Escalation rate impact ≤ +1% change Joint A/B or canary deployment
Thumbs up rate impact ≤ -2% change Joint A/B or canary deployment

Area 4: Embedding Model Fine-Tuning for RAG

The Challenge

The Titan Embeddings V2 model produced decent general-purpose embeddings but struggled with manga-specific vocabulary. "Shōnen" and "action manga" should be semantically close — but the base embeddings placed them far apart because "shōnen" was a rare token.

How We Collaborated

DS owned: Training a contrastive learning adapter on manga-specific query-document pairs. They used ~2,000 hand-curated pairs:

Query: "dark fantasy manga like Berserk"
Positive Document: "Vinland Saga - a dark historical manga with brutal action..."
Negative Document: "My Neighbor Totoro art book - wholesome Ghibli illustrations..."

I owned: Production-izing the fine-tuned embeddings: - Deploying the adapter as a lightweight Lambda layer that transforms Titan embeddings before the OpenSearch kNN search. - Rebuilding the OpenSearch index with the new embeddings (35M chunks re-embedded, ~4 hours). - A/B testing the fine-tuned retrieval against the baseline.

Results:

Metric Base Titan Fine-tuned Adapter Improvement
Recall@3 72% 86% +14%
Recall@5 81% 92% +11%
MRR (Mean Reciprocal Rank) 0.68 0.81 +0.13
Manga-specific query Recall@3 58% 83% +25%

The +14% Recall@3 improvement translated to measurably better LLM responses — the LLM had better source material to work with.


Area 5: Prompt Optimization & A/B Testing

The Challenge

Prompt engineering lived in an awkward middle ground: DS brought NLP expertise (understanding token probabilities, temperature effects, instruction-following patterns), while I brought production constraints (latency budgets, token costs, format requirements for downstream parsing).

How We Collaborated

DS-led experiments: - Testing different instruction styles (imperative vs. conversational) - Optimizing temperature per intent type - Few-shot example selection (which examples improved quality most?) - Chain-of-thought vs. direct answer for complex queries

I-led experiments: - Prompt compression (same quality with fewer tokens) - Token budget optimization (how much context can we cut before quality degrades?) - Format instructions that survive streaming (valid JSON at any truncation point) - Prompt caching compatibility (structuring prompts so the cacheable prefix is maximized)

Joint experiment: We ran a 2-week A/B test on recommendation prompt variants:

Variant Prompt Style Avg Response Quality (human-rated) Avg Output Tokens Cost/Response
A (baseline) Direct instruction 3.8 / 5.0 180 tokens $0.0027
B (few-shot) 3 examples + instruction 4.2 / 5.0 210 tokens $0.0035
C (chain-of-thought) Reasoning + answer 4.4 / 5.0 350 tokens $0.0058
D (structured output) JSON template + instruction 4.0 / 5.0 140 tokens $0.0021

Decision: We chose Variant D — not the highest quality, but the best quality-per-dollar. At 300K LLM calls/day, the cost difference between C ($1,740/day) and D ($630/day) was $1,110/day ($33K/month). The 0.4 quality point difference didn't justify it.


Area 6: Hallucination Detection & Scoring Pipeline

The Challenge

DS expertise was critical for building automated hallucination detection. I could validate prices and ASINs programmatically, but detecting subtler hallucinations ("the deluxe edition includes exclusive author commentary" — fabricated but plausible) required NLP techniques.

How We Collaborated

DS built the hallucination scoring model: 1. A claim extraction pipeline (using a smaller LLM): extract all factual claims from the response. 2. An entailment classifier: for each claim, check if it's entailed by the source context (RAG chunks + product data). 3. Score: 0 (fully grounded) to 1 (completely fabricated).

I built the production infrastructure: 1. Async scoring pipeline (didn't add latency to the response path). 2. CloudWatch metric for daily hallucination score average. 3. Alert if daily average exceeded 0.03. 4. Quarterly review of 500 flagged responses to calibrate the scoring model.

Joint calibration:

The entailment classifier had its own precision-recall tradeoff:

Threshold Precision (% of flagged responses that are actual hallucinations) Recall (% of actual hallucinations caught) False Positive Rate
0.3 (aggressive) 62% 94% 12%
0.5 (balanced) 78% 82% 5%
0.7 (conservative) 91% 65% 2%

We chose 0.5 for alerting (catch most hallucinations, tolerate some false positives in the alert stream) and 0.7 for automated blocking (only block responses we're very confident are hallucinated — can't afford false positives that block good responses).


Area 7: LLM Evaluation Framework

The Challenge

Traditional ML evaluation (accuracy on a test set) doesn't apply to LLM outputs. There's no single "correct answer" — a response can be factually correct but poorly formatted, or well-written but missing key information. We needed a multi-dimensional evaluation framework.

How We Built It Together

DS designed the evaluation dimensions:

Dimension What It Measures Scoring Method
Factual Correctness Are all claims accurate? Automated (ASIN/price validation) + human audit
Completeness Did the response address the user's question fully? Human rating (1-5)
Relevance Is the response about the right topic? Automated (intent match) + human rating
Fluency Is the response well-written and natural? BLEU/ROUGE against golden responses + human rating
Helpfulness Would a user find this useful? Human rating (1-5) + thumbs up/down proxy
Safety No toxicity, PII, competitor mentions? Automated guardrails
Format Compliance Correct JSON structure, markdown rendering? Automated schema validation

I built the evaluation pipeline:

graph TD
    A[Model/Prompt Change] --> B[Trigger Evaluation Pipeline]
    B --> C[Run 500 Golden Queries]
    C --> D[Automated Scoring<br>BLEU, ROUGE, format, factual]
    D --> E{All automated<br>checks pass?}
    E -->|No| F[Block Deployment<br>Alert DS + Engineering]
    E -->|Yes| G[Sample 50 for<br>Human Evaluation]
    G --> H[Human Raters Score<br>Completeness, Helpfulness]
    H --> I{Human scores<br>above threshold?}
    I -->|No| F
    I -->|Yes| J[Approve for<br>Canary Deployment]

Golden dataset curation was a joint effort: - I selected 500 representative queries from production (stratified by intent, complexity, and edge cases). - DS team wrote reference responses and scoring rubrics. - We reviewed and revised quarterly — removing stale questions, adding new edge cases discovered in production.


Area 8: Model Drift Monitoring & Retraining Pipelines

The Challenge

Both the intent classifier and the LLM's behavior drifted over time. DS needed to detect drift; I needed to retrain and redeploy safely.

How We Collaborated

DS owned drift detection: - Intent distribution monitoring (alert if any intent's share shifts by >5% week-over-week) - Classification confidence distribution monitoring (alert if mean confidence drops below 0.85) - Embedding drift detection (compare query embedding distributions month-over-month using KL divergence)

I owned the retraining pipeline:

graph LR
    A[Drift Alert<br>from DS monitoring] --> B[Sample Low-Confidence<br>Production Data]
    B --> C[Human Labeling Queue<br>200 samples/week]
    C --> D[DS Retrains Model<br>Updated training set]
    D --> E[Offline Evaluation<br>Golden dataset]
    E --> F[Shadow Deployment<br>1 week parallel run]
    F --> G[Canary Deployment<br>1% traffic, 24 hours]
    G --> H[Full Rollout<br>100% traffic]

Joint incident response:

When we detected drift, the first question was always: "Is this a model problem or a data problem?" We built a decision tree:

Signal Model Problem Data Problem
Accuracy drops on golden dataset Yes — model degraded No — golden dataset still works
Accuracy drops on production data only No — production data shifted Yes — distribution changed
New intent pattern emerges Partially — model doesn't know it Yes — need new training data
Confidence distribution shifts Yes — model uncertain Maybe — ambiguous queries increased

This saved us from knee-jerk retraining when the real issue was a seasonal data shift (holiday traffic pattern) that required new training data, not a new model.


Area 9: Cost-Quality Tradeoff Analysis

The Challenge

Every model decision involved a cost-quality tradeoff. DS naturally optimized for quality; I naturally optimized for cost and latency. We needed a shared framework to make these decisions objectively.

The Framework We Developed

Cost per Quality Point (CPQ): We defined a composite quality score (0-100) combining accuracy, helpfulness, and safety. Then we measured the cost to improve it by one point:

Quality Score = 0.4 × Accuracy + 0.3 × Helpfulness + 0.2 × Safety + 0.1 × Fluency
Decision Quality Δ Cost Δ (monthly) CPQ ($/point) Decision
RoBERTa vs DistilBERT +2.7 +$14K $5,185/point Reject — too expensive per point
Fine-tuned embeddings vs base +8.5 +$2K $235/point Accept — great ROI
Chain-of-thought prompting +4.0 +$33K $8,250/point Reject — not cost-justified
Structured output prompting +2.0 -$10K Saves money AND improves quality Accept — no brainer
Cross-encoder reranker +6.2 +$5K $806/point Accept — good ROI
Human-in-the-loop for low confidence +3.1 +$8K $2,580/point Accept — for now; revisit at scale

Decision rule: Accept improvements with CPQ < $3,000/point. Reject those above unless there's a safety or compliance reason.

How This Changed Our Conversations

Before CPQ, DS and engineering had subjective debates: "this model is better" vs. "this model is too expensive." After CPQ, every proposal came with a quantified cost-quality tradeoff. It turned disagreements into math problems.


Collaboration Anti-Patterns We Avoided

Anti-Pattern What Happens How We Avoided It
"Throw it over the wall" DS trains model, engineer deploys without understanding it Joint ownership of evaluation, shared metric thresholds
"Research-first" DS optimizes for accuracy, ignoring latency/cost Production benchmarking before model selection
"Engineering-first" Engineer picks fastest model, ignoring quality Quality floor (F1 ≥ 0.85 per class) enforced by DS
"Retraining panic" Knee-jerk retraining on every drift signal Decision tree: model problem vs. data problem
"Metric disagreement" DS and engineering argue about "what's good enough" CPQ framework: objective cost-per-quality-point threshold

Key Takeaways for Interviews

  1. "I didn't just consume models — I shaped them" — Show that you influenced model selection, evaluation criteria, and deployment strategy. Not just "DS gave me a model and I deployed it."

  2. "We defined shared metrics" — The hardest part of cross-functional collaboration is agreeing on what "good" means. DS metrics (F1, accuracy) and engineering metrics (latency, cost) are both necessary but neither is sufficient alone.

  3. "We had a retraining flywheel" — Monthly retraining with production data, automated evaluation, shadow → canary → full rollout. This is production ML maturity.

  4. "Cost per Quality Point resolved disagreements" — When DS and engineering disagreed, CPQ turned subjective arguments into objective math.

  5. "Production != offline benchmarks" — The DS team's top model (RoBERTa at 94.8%) wasn't the right production choice. Real-world constraints (latency, cost, scaling) dominated the decision.

  6. "I built the infrastructure for DS to iterate" — Labeling queues, evaluation pipelines, shadow deployment, A/B test infrastructure — this is the engineer's job in the partnership.