BMS — AI Platform Engineer: Real-World Interview Questions

Target Role: AI Platform Engineer at Bristol Myers Squibb (BMS)
Focus Areas: RAG Systems, LLM Pipelines, Platform Thinking, Evaluation, Leadership

What This Question Set Actually Tests

These questions are not random — they probe: - End-to-end ownership (idea → production) - RAG depth (not surface-level familiarity) - Platform thinking (reusable, scalable systems) - Failure handling + debugging mindset - Context-aware and agentic reasoning - Evaluation + responsible AI - Leadership + mentoring - Real production experience (not just theory)

Core Resume / Intro

1. Tell me about yourself.

Easy: What is your current role and what does your team own?
Medium: What is the most impactful system you have shipped and why?
Hard: If you had to distill everything you have built in the last 3 years into a single platform capability, what would it be and how would you sell it to a hiring committee?

2. Why this company?

Easy: What do you know about BMS's use of AI in drug discovery?
Medium: How does your experience align with the AI Platform team's goals at BMS?
Hard: If you were to identify one gap in how pharma companies adopt AI platforms today, what would it be and how would you address it at BMS?

3. Why this role?

Easy: What about "AI Platform" specifically interests you over a pure ML or data engineering role?
Medium: How does this role differ from what you have done before, and how will you bridge that gap?
Hard: What would you change about the way most pharma companies build internal AI platforms, and how would you drive that change from this role?

4. Any questions before starting?

Easy: Is there anything about the role or process you would like to clarify?
Medium: What does success look like for this role in the first 6 months?
Hard: What are the biggest technical bets the AI Platform team is making this year, and where do you see the highest execution risk?

Project Deep Dive — RAG Chatbot

5. Walk me through your RAG-based chatbot.

Easy: What problem was the chatbot solving and who were its users?
Medium: What were the major components — ingestion, retrieval, generation — and how did data flow between them?
Hard: If you had to rebuild this system today with lessons learned, what would you change architecturally and why?

6. What does the low-level RAG pipeline look like?

Easy: Can you draw or describe the sequence of steps from a user query to a final answer?
Medium: How did you handle chunking strategy, embedding generation, and index updates in the pipeline?
Hard: How did you balance latency vs. retrieval accuracy, and what tradeoffs did you make at each stage of the pipeline?

7. How are documents processed and indexed into OpenSearch?

Easy: What document formats did you support and how were they parsed?
Medium: What chunking and metadata enrichment strategy did you apply before indexing, and how did document structure affect it?
Hard: How did you handle schema evolution in your index — for example, when you needed to add new metadata fields without reindexing everything?

8. What triggers the ingestion worker?

Easy: Is ingestion event-driven, scheduled, or manually triggered?
Medium: How did you ensure idempotency so that re-running ingestion does not create duplicate chunks?
Hard: How would you design a near-real-time ingestion pipeline that handles updates, deletions, and version conflicts at scale?

9. What kind of retrieval strategy did you use?

Easy: Did you use dense retrieval (vector search), sparse retrieval (BM25), or a hybrid approach?
Medium: How did you tune top-K and similarity thresholds, and what signals told you those values were right?
Hard: How would you design a retrieval strategy that adapts dynamically to query type — factual vs. analytical vs. multi-hop — without requiring reindexing?

10. What was the scale — documents vs. chunks?

Easy: Roughly how many documents and chunks were in the index at peak?
Medium: How did index size affect retrieval latency, and what optimizations did you apply?
Hard: At what point does a single OpenSearch cluster become a bottleneck for a RAG system, and how would you design around it?

Retrieval & Evaluation

11. How did you validate retrieval techniques?

Easy: What evaluation dataset did you use to benchmark retrieval quality?
Medium: How did you measure precision, recall, and ranking quality for retrieved chunks?
Hard: How do you build a continuously updating retrieval evaluation pipeline that keeps pace with new document additions and query distribution shifts?

12. How did you validate metadata design vs. query domain?

Easy: What metadata fields did you attach to chunks and why?
Medium: How did you test whether metadata filters improved or hurt retrieval quality for different query types?
Hard: How would you automatically infer optimal metadata schemas from a new document corpus without manual labeling?

13. What is Hit@K / retrieval hit rate?

Easy: How is Hit@K defined and what does it measure?
Medium: What K value did you optimize for and how did it relate to your LLM context window budget?
Hard: How do you balance Hit@K (coverage) against MRR (ranking quality), and which matters more when the LLM can synthesize across multiple chunks?

14. What metrics did you use for evaluation?

Easy: Name the top 3 metrics you tracked for RAG quality.
Medium: How did you decompose end-to-end answer quality into retrieval quality vs. generation quality?
Hard: How do you design a metric that captures hallucination risk specifically caused by poor retrieval rather than LLM generation failure?

15. Did you experiment with non-embedding methods like BM25?

Easy: What is BM25 and when does it outperform dense vector search?
Medium: How did you implement hybrid retrieval and how did you weight BM25 vs. vector scores?
Hard: What is Reciprocal Rank Fusion (RRF) and how would you tune the fusion strategy based on query type rather than a static weight?

Failure Cases & Improvements

16. What are scenarios where retrieval fails?

Easy: Give two common examples where a RAG system returns irrelevant chunks.
Medium: How did you detect retrieval failures in production — what signals or logs did you use?
Hard: How would you build an automated failure detection system that distinguishes retrieval failures from generation failures without human review?

17. How do you handle multi-intent queries?

Easy: What is a multi-intent query? Give an example.
Medium: How did you detect multi-intent queries before retrieval, and what did you do with them?
Hard: Design a query decomposition + parallel retrieval system that handles multi-intent queries, merges results coherently, and does not hallucinate across sub-answers.

Query Orchestration / Intelligence

18. How do you break complex queries into sub-queries?

Easy: What technique did you use — prompt-based decomposition, a classifier, or a planning LLM?
Medium: How did you handle dependencies between sub-queries — for example, when answer A is needed before query B can be formed?
Hard: How would you build a self-correcting query decomposition system that re-plans if intermediate retrieval results are insufficient?

19. How do you decide intent and decomposition logic?

Easy: Is intent classification rule-based, model-based, or LLM-based in your system?
Medium: How did you build and maintain the intent taxonomy, and how did you handle queries that fell outside defined intents?
Hard: How do you ensure intent classification remains accurate as new document domains are added to the platform without retraining the classifier every time?

20. How do you orchestrate parallel retrieval and merge results?

Easy: Did you use async calls for parallel retrieval, and what framework managed them?
Medium: How did you handle the case where one sub-query retrieval returns no results — did you fail fast or merge partial results?
Hard: How would you design a retrieval merging strategy that accounts for relevance score incompatibility between dense and sparse retrievers returning results for different sub-queries?

Conversational AI / Context Handling

21. Does your system handle multi-turn conversations?

Easy: How is conversation history stored and passed to the LLM?
Medium: How do you decide what to keep and what to drop from conversation history as context grows?
Hard: How would you design a context compression system that preserves reasoning-critical history while staying within token limits, without losing information that could hurt answers?

22. How do you make queries context-aware?

Easy: Do you inject raw history into the prompt or rewrite the current query?
Medium: How does the system decide which prior turns are relevant to rewrite the current query?
Hard: How would you train or fine-tune a lightweight query rewriter that generalizes well across different conversation domains?

23. How does context → query rewrite → retrieval work end to end?

Easy: Walk through the exact prompt structure you use for query rewriting.
Medium: How did you measure whether query rewriting improved or degraded retrieval quality?
Hard: How would you build a self-evaluating rewrite pipeline that detects when rewriting changed the query's semantic intent and falls back gracefully?

24. Do you use a sliding window for context?

Easy: What is a sliding window in the context of conversation history management?
Medium: How did you choose the window size, and what happened at the boundary when old turns were evicted?
Hard: How would you replace a fixed sliding window with a dynamic context selection system that prioritizes semantically relevant turns instead of recency?

25. How do you use an LLM for query understanding?

Easy: What tasks does the LLM perform before retrieval — classification, rewriting, decomposition?
Medium: How do you ensure the LLM query understanding step does not add unacceptable latency to the overall response time?
Hard: How would you distill a large LLM query understanding step into a smaller, faster model without significantly degrading quality?

State & Persistence

26. What persistence layer did you use for conversation?

Easy: Why did you choose DynamoDB over a relational database for conversation state?
Medium: How did you structure the DynamoDB schema to support efficient reads for the most recent N turns?
Hard: How would you design a persistence layer that supports both real-time conversation retrieval and long-term pattern analysis across millions of sessions?

27. Did you hit DynamoDB size limits? How did you handle it?

Easy: What is the DynamoDB item size limit and how does it apply to conversation storage?
Medium: What strategy did you use — summarization, external storage references, or chunked records — and why?
Hard: How would you implement a seamless history archival system where older turns are compressed and moved to cold storage but still retrievable mid-conversation if the LLM needs them?

Monitoring, Evaluation & Feedback

28. Do you have monitoring pipelines?

Easy: What metrics does your monitoring pipeline track — latency, error rate, retrieval quality?
Medium: How are monitoring alerts structured and at what thresholds do they fire?
Hard: How would you design a monitoring system that detects semantic degradation — where answers become less accurate even when latency and error rate remain stable?

29. How do you collect user feedback?

Easy: What UI mechanism captures feedback — thumbs up/down, star rating, freeform text?
Medium: How is feedback stored and linked back to the specific query, retrieved chunks, and generated response?
Hard: How do you handle selection bias in user feedback — where only extreme responses (very good or very bad) get rated — and still extract reliable signal?

30. What does feedback look like — a thumbs up / thumbs down system?

Easy: What data is captured alongside the thumbs signal — session ID, query, response?
Medium: How do you use thumbs-down signals to trigger human review vs. automated reprocessing?
Hard: How would you design a multi-signal feedback system that captures implicit signals (dwell time, follow-up queries, copy actions) alongside explicit thumbs ratings?

31. Do users provide the correct answer for thumbs down?

Easy: Is providing a correction optional or prompted?
Medium: How do you validate user-provided corrections — do you accept them as ground truth directly?
Hard: How would you build a correction verification workflow that uses LLM-as-judge to assess whether user corrections are actually better than the original response before adding them to your evaluation set?

32. How do you build an automated feedback → evaluation pipeline?

Easy: How does raw feedback data flow into your evaluation dataset?
Medium: How do you ensure feedback-derived examples are representative and not skewed by noisy or adversarial inputs?
Hard: Design a closed-loop pipeline where production feedback continuously fine-tunes retrieval and generation components with minimal human intervention and verifiable quality gates.

Leadership / Mentoring

33. Tell me about a project with junior team members.

Easy: How many junior engineers were on the team and what were their backgrounds?
Medium: How did you divide responsibilities and ensure quality without creating a bottleneck on yourself?
Hard: Tell me about a time a junior engineer proposed a significantly different approach than yours — how did you evaluate it and what was the outcome?

34. How did you mentor and guide them?

Easy: What did your regular cadence with junior engineers look like — 1:1s, code reviews, design reviews?
Medium: How did you calibrate how much guidance to give vs. letting engineers figure things out independently?
Hard: How do you measure whether your mentoring is actually accelerating someone's growth vs. creating dependency?

End-to-End Ownership

35. Tell me about a project that went from idea to production.

Easy: What was the original idea and who was the stakeholder that motivated it?
Medium: What were the key milestones and what went differently than planned?
Hard: At what point were you most tempted to scope down or cut a feature, and how did you decide whether to do it?

36. What were your key decisions and tradeoffs?

Easy: Name the single most consequential technical decision you made on this project.
Medium: What was a decision you made that turned out to be wrong, and how did you recover?
Hard: How do you structure decision-making on a platform project when different stakeholders — scientists, engineers, product — have conflicting priorities and you have to choose?

Tech Awareness / Growth

37. How do you keep up with GenAI trends?

Easy: What are two or three sources you regularly read or follow?
Medium: Describe a recent development in GenAI — such as a new model capability or evaluation technique — that you found most relevant to your work.
Hard: How do you evaluate whether a new GenAI trend is genuinely useful for your platform vs. engineering hype, and how do you build that judgment into a team's decision process?

Frontend / Full Stack

38. Are you familiar with React and TypeScript?

Easy: Have you built or maintained any React/TypeScript frontend as part of an AI platform?
Medium: How did you design the frontend component that surfaced RAG responses — what state management, streaming, and feedback UI did you implement?
Hard: How would you architect a streaming LLM response UI that handles partial renders, error states, and inline source citations without creating a poor user experience?

Questions for the Team

39. How do you define success for Accelerator → Scale?

Easy: What does "Accelerator" mean in this context — is it a prototyping environment?
Medium: What are the criteria that determine when a use case graduates from Accelerator to Scale?
Hard: What are the biggest risks that prevent AI use cases from graduating from Accelerator to Scale, and how does the platform team actively mitigate them?

40. What are current gaps in GenAI platform components?

Easy: Which platform components are most mature and which are still being built?
Medium: Are there specific RAG, orchestration, or evaluation components that teams commonly rebuild from scratch instead of reusing?
Hard: If you could eliminate one category of redundant work that individual teams are currently doing on top of the platform, what would it be?

41. How are pods structured?

Easy: What does a typical pod look like — size, roles, embedded ML scientists vs. engineers?
Medium: How do pods coordinate on shared platform components to avoid divergence?
Hard: How does the platform team handle it when a pod needs a platform capability faster than the platform team can deliver it?

42. How do you handle responsible AI and evaluation?

Easy: What evaluation criteria exist today for safety, bias, and hallucination?
Medium: Is responsible AI evaluation embedded into CI/CD for model updates, or is it a separate gating process?
Hard: How do you balance moving fast on new GenAI capabilities with thorough responsible AI review in a regulated industry like pharma?

43. What does success look like in the first 90 days?

Easy: What onboarding resources or documentation exist to ramp up quickly?
Medium: Are there specific projects or deliverables expected within the first 90 days?
Hard: What is a decision or contribution in the first 90 days that would signal to this team that they made the right hire?

Last updated: March 2026