LLD Interview Questions - MangaAssist Component Deep Dives

A comprehensive set of questions organized by difficulty level, simulating a panel of interviewers drilling into the Low-Level Design of each component.

How to Use This Question Set

Use Easy to warm up on schemas, entities, and the role of each service.
Use Medium to rehearse request flow, retrieval design, and guardrail sequencing.
Use Hard, Very Hard, and the architect-focused section when you want deeper discussion around production constraints, data-model limits, and platform evolution.

Topic Deep-Dive Packs

Start with topic-deep-dives/README.md for the full map from question numbers to topic files.
Use topic-deep-dives/01-orchestrator-request-flow.md for orchestration, latency, and partial-failure drills.
Use topic-deep-dives/02-intent-classification-entity-resolution.md for classifier, confidence, ambiguity, and drift follow-ups.
Use topic-deep-dives/03-conversation-memory-session-storage.md for session schema, summarization, TTL, and branching design.
Use topic-deep-dives/04-rag-indexing-retrieval-re-ranking.md for chunking, retrieval, reranking, HNSW, and freshness.
Use topic-deep-dives/05-guardrails-validation-safety.md for guardrail order, validation, testing, and async audit.
Use topic-deep-dives/06-api-contracts-streaming-and-escalation.md for chat contracts, streaming, reconnect, and multilingual handling.
Use topic-deep-dives/07-analytics-observability-and-feedback-loops.md for event schemas, dashboards, prompt experiments, and trace design.
Use topic-deep-dives/08-scale-failures-and-architecture-evolution.md for failure-mode drills, redundancy, model swaps, and launch-readiness critique.

Easy (Junior / Entry-Level Engineers)

These questions test basic understanding of individual components and their schemas.

Interviewer: Engineering Manager

What is the role of the Chatbot Orchestrator? What are its main responsibilities? - Expected: The Orchestrator is the central coordinator. It loads conversation history, calls the Intent Classifier, routes to appropriate services based on intent, aggregates data, passes it to the LLM, runs guardrails, and returns the final response.
What is a PageContext and why is it included in the chat request? - Expected: PageContext contains information about what the user is currently viewing - current ASIN, store section, cart items, and browsing history. It helps the chatbot give contextually relevant responses.
What fields are in a ChatRequest? Explain each one briefly. - Expected: sessionId (unique session identifier), customerId (optional, for logged-in users), message (the user's text), pageContext (current page info), timestamp (when the message was sent).
Why does the ChatResponse include a products array and an actions array? - Expected: products shows product cards when recommendations or search results are relevant. actions provides interactive buttons like "Add to Cart" or "See More Like This" so users can take action directly from the chat.
What is a Turn in conversation memory? What fields does it have? - Expected: A Turn is one message in the conversation (either from the user or the assistant). Fields: role (user/assistant), content (the message text), timestamp, and optionally the intent that was classified.

Interviewer: Senior Engineer

In the Intent Classifier, what is the two-stage system and why does it exist? - Expected: Stage 1 is a fast rule-based pre-filter using regex patterns (cheap, fast). Stage 2 is a BERT classifier on SageMaker (more accurate but slower). Stage 2 only runs if Stage 1 confidence is below 0.8. This saves cost and reduces latency for obvious intents.
What entities does the Intent Classifier extract? Give an example. - Expected: Entities include series_name, volume_number, attribute, and asin. Example: "Is Demon Slayer Vol 12 in English?" -> {series_name: "Demon Slayer", volume_number: "12", attribute: "language"}.
What is the DynamoDB schema for conversation sessions? What is the partition key? - Expected: PK is session_id. Also has a GSI on customer_id. Attributes include turns (list), page_context (map), created_at, updated_at, ttl, turn_count, last_intent.
What is the TTL set to for conversation sessions, and why? - Expected: 24 hours (86400 seconds). Sessions expire after a day because most shopping interactions are short-lived. Keeps storage costs down and avoids stale context.
What are follow_up_suggestions in the response format? Why are they useful?
- Expected: Pre-generated suggested follow-up questions like "Tell me about the art style", "Is there a box set?". They guide the user, reduce friction, and help users discover features they might not know to ask about.

Medium (Mid-Level Engineers, 2-5 Years Experience)

These questions test component interactions, design decisions, and implementation details.

Interviewer: Senior Engineer

Walk through the Orchestrator's state machine. What happens between ReceiveMessage and ReturnResponse?
- Expected: ReceiveMessage -> LoadContext -> ClassifyIntent -> RouteToService -> AggregateData -> GenerateResponse -> ApplyGuardrails -> SaveTurn -> ReturnResponse.
In the RAG indexing pipeline, what chunk size is used and why is there overlap?
- Expected: 512 tokens with 50-token overlap. 512 is large enough to contain meaningful context but small enough for precise retrieval. The 50-token overlap prevents information from being split across chunk boundaries, ensuring continuity.
What embedding model is used for the RAG pipeline? What is the vector dimension?
- Expected: Amazon Titan Embeddings V2 with 1536-dimensional vectors. It's used because it's natively available in AWS, integrates well with Bedrock, and produces high-quality embeddings.
Explain the retrieval flow in the RAG pipeline step by step.
- Expected: (1) User query is embedded into a 1536-dim vector. (2) KNN search on OpenSearch returns top 10 candidate chunks. (3) A reranker (cross-encoder) reranks the 10 chunks by relevance. (4) Top 3 chunks are selected. (5) These 3 chunks, along with system prompt and user query, are sent to the LLM.
What metadata is attached to each RAG chunk, and how is it useful during retrieval?
- Expected: source_type (faq, product, policy), asin (if product-specific), category (manga), last_updated (freshness). Metadata can be used for filtered search (e.g., only retrieve FAQ chunks when intent is faq), freshness ranking, and source attribution in responses.

Interviewer: Staff Engineer

The memory management summarizes after 20 turns instead of truncating. Why is summarization better than truncation?
- Expected: Summarization preserves key context - what the user was looking for, what was recommended, what they liked/disliked. Truncation discards the oldest turns completely, potentially losing important context like the user's initial preferences. Summarization reduces token count for LLM prompts while retaining semantic value.
In the Guardrails pipeline, 6 checks run sequentially. What is the order and why does it matter?
- Expected: PII Filter -> Price Validator -> Toxicity Filter -> Competitor Filter -> Hallucination Check -> final gate. Order matters: PII should be caught first (highest risk). Price validation before toxicity because wrong prices are more common than toxic content. Hallucination check last because it requires catalog lookups (most expensive check).
The API has three endpoints: /chat/message, /chat/feedback, and /chat/escalate. Why are these separate endpoints instead of a single unified endpoint?
- Expected: Separation of concerns. Message handling is complex (requires orchestration, LLM, etc.). Feedback is a simple write (fire-and-forget to a queue). Escalation triggers a different workflow (Amazon Connect). Different latency requirements, different processing pipelines, different scaling characteristics.
How does the ASIN Validation guardrail work? What happens when an invalid ASIN is detected?
- Expected: After the LLM generates a response containing product ASINs, the Hallucination Check guardrail queries the Product Catalog to verify each ASIN exists and is currently available. If an ASIN is invalid, the product is removed from the response. The response text is adjusted or regenerated if removing the product makes it incoherent.
In the analytics schema, message_text is stored as PII-scrubbed. How would you implement PII scrubbing before writing to Kinesis?
- Expected: Apply regex-based detection for common PII patterns (email, phone, SSN, credit card numbers) plus named entity recognition for names/addresses. Replace detected PII with tokens like [EMAIL], [PHONE]. Run scrubbing before events enter the Kinesis stream, not after - data should never hit the analytics pipeline with PII intact.

Hard (Senior Engineers, 5-10 Years Experience)

These questions test deep implementation choices, edge cases, and production concerns.

Interviewer: Staff Engineer

The Intent Classifier uses a BERT model on SageMaker as fallback. How would you handle model drift - e.g., the classifier's accuracy degrades over 6 months?
- Expected: (1) Monitor classification confidence distribution over time - if average confidence drops, drift is occurring. (2) Sample low-confidence classifications and have humans label them (data flywheel). (3) Periodically retrain on updated labeled data (monthly or when drift is detected). (4) Shadow evaluation - run new model in shadow mode against production traffic, compare accuracy before promoting. (5) A/B test model versions.
The DynamoDB session table uses session_id as the partition key. What happens if a single session generates very high read/write throughput? How would you mitigate hot partitions?
- Expected: A single session won't generate enough traffic for a hot partition (one user = one session). The real risk is many sessions landing on the same partition. DynamoDB handles this with adaptive capacity. But if we add a GSI on customer_id, and a single customer has many sessions, the GSI could have hotspots. Mitigation: write sharding on the GSI, or accept eventual consistency on the GSI.
In the system prompt template, {{rag_chunks}} injects retrieved documents. What happens if the retrieved chunks are contradictory (e.g., one says 30-day return policy, another says 14-day)?
- Expected: (1) This is a data quality issue - chunks should be deduplicated and versioned during indexing. (2) Attach last_updated metadata and instruct the LLM to prefer the most recent information. (3) In the system prompt, add a rule: "If sources conflict, cite the most recent source and mention that the information may have been recently updated." (4) Flag contradictions for human review via the analytics pipeline.
The chat response includes a metadata.latency_ms field. How would you measure this accurately, and what latency breakdown would you want?
- Expected: Measure wall-clock time from request received to response sent. Break down into: (1) Auth + session load (DynamoDB read): ~10-50ms. (2) Intent classification (rule-based: <5ms, BERT: ~50-100ms). (3) Service fan-out (varies by intent): ~100-500ms. (4) RAG retrieval (embedding + search + reranking): ~200-400ms. (5) LLM generation (Bedrock): ~500-2000ms. (6) Guardrails: ~50-100ms. Instrument each step with tracing (X-Ray or OpenTelemetry).
How would you handle the scenario where the Bedrock LLM returns a partial response (e.g., network interruption during streaming)?
- Expected: (1) Detect incomplete response via a sentinel token or length heuristic. (2) Retry with the same prompt (idempotent request). (3) If retry fails, return a graceful degradation response: provide the partial information if it passed guardrails, or a fallback "I encountered an issue" message. (4) Don't save incomplete turns to conversation memory. (5) Log the incident for monitoring. (6) Use Bedrock's built-in retry configuration.

Interviewer: Principal Engineer

The OpenSearch RAG index uses HNSW with nmslib engine. How does HNSW work, and what are the trade-offs of the configuration choices (M, ef_construction, ef_search)?
- Expected: HNSW (Hierarchical Navigable Small World) builds a multi-layer graph where each node connects to M nearest neighbors. During search, it navigates from a random entry point to the query's nearest neighbors. M (connections per node): higher = better recall, more memory and slower indexing. ef_construction (candidate list during build): higher = better index quality, slower build. ef_search (candidate list during query): higher = better recall, higher latency. For RAG use case, optimize for recall (accuracy) over latency since the LLM call dominates latency anyway. Typical values: M=16, ef_construction=256, ef_search=128.
The Price Validator guardrail cross-checks prices with the catalog. At scale, this means a catalog lookup for every product in every response. How would you make this performant?
- Expected: (1) Batch the ASIN lookups - collect all ASINs from the response and make a single batch-get request. (2) Cache prices in an in-memory cache (ElastiCache/Redis) with a TTL of ~5 minutes. Prices don't change every second. (3) Only validate if the response actually contains price information (most FAQ responses won't). (4) Accept eventual consistency - if a price changed in the last 5 minutes, the cache is "close enough." For critical accuracy, add a disclaimer: "Prices may vary, see product page for current price."
The conversation memory schema stores turns as a DynamoDB List. As conversations grow, this list gets large. What are the DynamoDB item size constraints, and how would you handle them?
- Expected: DynamoDB has a 400KB item size limit. Each turn might be ~500-2000 bytes (message + metadata). At 20 turns before summarization, that's ~10-40KB - well within limits. But if summarization fails or turns are very long, you could approach the limit. Mitigation: (1) Enforce max message length at the API gateway. (2) If item approaches 350KB, force summarization regardless of turn count. (3) Alternative: Store turns in a separate table with session_id + turn_number as composite key and keep only the summary + last 5 turns in the session item.
The system supports both WebSocket (streaming) and HTTPS fallback. How would you implement the streaming response generation?
- Expected: (1) Bedrock API supports streaming responses (returns a stream of token chunks). (2) The Orchestrator reads from this stream and forwards each chunk through the WebSocket connection. (3) Guardrails must run in two phases: pre-guardrails (check full prompt before sending to LLM) and post-guardrails (buffer a sliding window of generated text to check for PII/toxicity). (4) Can't do full ASIN validation until the response is complete, so stream the text first, then append product cards after validation. (5) HTTPS fallback waits for the full response before returning.
How would you design the reranker stage in the RAG pipeline? What model would you use, and where would it run?
- Expected: Use a cross-encoder model (e.g., ms-marco-MiniLM or a fine-tuned model) deployed on SageMaker. The reranker takes each (query, chunk) pair and scores relevance more accurately than vector similarity alone. It's slower (runs 10 forward passes for 10 chunks) but more accurate. At 10 chunks x ~5ms/pair = ~50ms additional latency. Worth it because top-3 selection quality directly impacts LLM response quality. Alternative: Use Bedrock's built-in reranking if available, or Cohere Rerank API.

Very Hard (Staff / Principal Engineers, 10+ Years Experience)

These questions test cross-component design, operational excellence, and edge cases at scale.

Interviewer: Principal Engineer

Walk through a complete failure scenario: the Intent Classifier is returning incorrect intents for 10% of requests due to a bad model deployment. How do you detect, mitigate, and recover?
- Expected:
- Detection: Monitor intent confidence distribution - a bad model might show unusual confidence patterns. Track downstream metrics: escalation rate spikes, user feedback thumbs-down increases, resolution rate drops. Alert on statistical anomalies.
- Mitigation: Automated rollback based on canary metrics (if new model's metrics are significantly worse than baseline within first 15 minutes). Manual killswitch to revert to previous model version.
- Recovery: Roll back SageMaker endpoint to previous model version. Analyze misclassified intents. Fix training data or model. Redeploy with longer canary phase.
- Prevention: Shadow testing before production. Automated regression tests against a golden dataset before promoting any model.
The LLD shows entity extraction returning structured data like {series_name, volume_number, attribute}. How would you handle ambiguous entity resolution? For example, "attack on titan" could match the manga, the anime, or a video game.
- Expected: (1) Context matters - in the manga store, default to the manga interpretation. Use PageContext.storeSection to disambiguate. (2) If the entity maps to multiple ASINs (e.g., multiple editions of Attack on Titan), return all versions and let the user clarify. (3) Use the conversation history - if the user was just looking at manga, "attack on titan" means the manga. (4) Fuzzy matching with ranking: prefer exact title matches over partial matches, prefer more popular items. (5) When confidence is low, ask a clarifying question: "Did you mean the manga series or the artbook?"
Design a comprehensive testing strategy for the Guardrails pipeline. How do you ensure each of the 6 guardrails works correctly in isolation and together?
- Expected:
- Unit tests per guardrail: PII filter with known patterns (SSN, phone, email variations), Price Validator with correct/incorrect prices, Toxicity filter with adversarial examples, Competitor filter with edge cases ("Barnes and Noble" vs. "Barnes noble"), ASIN Validator with valid/invalid/retired ASINs, Scope Check with off-topic responses.
- Integration tests: Full pipeline with responses that trigger multiple guardrails simultaneously (e.g., a response with PII AND wrong price AND a competitor mention).
- Regression suite: Maintain a corpus of 1000+ test responses that should pass and 500+ that should fail. Run on every deployment.
- Red team testing: Quarterly adversarial testing where engineers try to bypass guardrails.
- Production monitoring: Track guardrail trigger rates. If any guardrail triggers >5% of responses, investigate - either the LLM quality degraded or the guardrail is too aggressive.
The RAG knowledge base needs to stay updated as product information changes (new products, price changes, discontinued items). Design the real-time indexing pipeline.
- Expected: (1) Product catalog changes emit events to an SNS topic (or DynamoDB Streams). (2) A Lambda function consumes these events. (3) For new/updated products: chunk the product description, generate embeddings via Titan, upsert into OpenSearch. (4) For discontinued products: delete chunks from OpenSearch. (5) For price changes: update the price in chunk metadata (or don't store prices in RAG - always fetch live from the catalog). (6) Rate limit the indexing to avoid overwhelming OpenSearch. (7) Use a dead-letter queue for failed indexing events. (8) Measure index freshness: average time between catalog change and index update should be <5 minutes.
The system prompt includes {{conversation_turns}} as history. What happens when the total prompt (system + RAG chunks + history + user message) exceeds the LLM's context window?
- Expected: (1) Calculate token budget: allocate fixed budgets for system prompt (~500 tokens), RAG chunks (~1500 tokens), user message (~200 tokens), output (~500 tokens). Remaining budget goes to conversation history. (2) If history exceeds its budget, summarize older turns (the memory management already does this at 20 turns). (3) If RAG chunks exceed their budget, reduce from 3 chunks to 2, or truncate the longest chunk. (4) If the user's single message is very long, truncate or reject with a "message too long" warning. (5) Always reserve enough output tokens - don't let a huge prompt starve the output.

Interviewer: Solutions Architect

Compare the current DynamoDB-based conversation memory with an alternative design using Redis (ElastiCache). When would you choose one over the other?
- Expected:
- DynamoDB: Persistent, durable, auto-scaled, supports TTL, but higher latency (~5-10ms read). Good for: sessions that might be resumed after hours, audit/compliance needs, simplicity (managed, no cluster to size).
- Redis: Sub-millisecond latency, but volatile (data lost on restart unless using Redis with AOF/RDB), requires capacity planning, more expensive per GB.
- Hybrid approach (best): Use Redis as a write-through cache for active sessions (last 30 minutes). Persist to DynamoDB on every write but read from Redis first. If Redis misses (session older than 30 min or cache evicted), fall back to DynamoDB.
- Choose DynamoDB only when: latency is acceptable, ops simplicity is valued, sessions are long-lived.
- Choose Redis only when: sub-millisecond memory access is required, sessions are short-lived, data loss on restart is acceptable.
The analytics events are streamed via Kinesis to Redshift. Design the schema evolution strategy - what happens when you need to add a new field (e.g., user_satisfaction_score) to the events schema?
- Expected: (1) Use a schema registry (AWS Glue Schema Registry) to version event schemas. (2) Make all schema changes backward-compatible: new fields are nullable/optional. (3) Kinesis consumers (Firehose or Lambda) should handle both old and new schema versions. (4) Redshift: use ALTER TABLE ADD COLUMN for new fields with a default of NULL. (5) Backfill strategy: run a one-time job to populate the new field for historical events if possible. (6) Never remove or rename existing fields - deprecate them instead. (7) Document schema changes with a version log.
The Orchestrator fans out to multiple services in parallel (e.g., Recommendations + Catalog). How would you implement timeout handling and partial result aggregation?
- Expected: (1) Each service call has an independent timeout (e.g., Catalog: 500ms, Recommendations: 800ms, Promotions: 300ms). (2) Use a scatter-gather pattern: launch all calls concurrently, wait for all to complete or timeout. (3) If a critical service times out (e.g., Catalog), retry once. If still failed, deem it a failure. (4) If a non-critical service times out (e.g., Promotions), proceed without it - the response just won't include promotions. (5) Pass the available partial results to the LLM with metadata indicating what's missing. (6) The LLM prompt can say: "Promotion data is currently unavailable. Do not mention promotions in your response."
Design the data model for supporting conversation branching - where a user says "Actually, go back to what you said about manga X" and the conversation forks.
- Expected: (1) Instead of a flat list of turns, use a tree structure where each turn has a parent_turn_id. (2) The "active branch" is the path from root to the current turn. (3) When the user references a past turn, find the referenced turn and create a new branch from it. (4) The LLM receives the linear path (root -> branching point -> new branch) as history. (5) Practical concern: this adds complexity. For MVP, a simpler approach is to detect "go back" intents and reconstruct context by including the referenced turn in the current prompt. Full branching can be a future enhancement.
The system prompt instructs the LLM with 7 rules. How would you A/B test different system prompts across users?
- Expected: (1) Store multiple prompt versions in a config store (DynamoDB or SSM Parameter Store) with version IDs. (2) Assign users to prompt variants using consistent hashing on customer_id (ensures the same user always sees the same variant). (3) Log the prompt version ID in analytics events alongside response_id. (4) Run for sufficient statistical power (~2 weeks, thousands of conversations). (5) Measure: resolution rate, feedback score, escalation rate, conversion rate per variant. (6) Ensure guardrails are identical across variants - only test content/style differences, not safety changes.

👑 Architect Level (Distinguished Engineer / VP of Engineering)

These questions test system evolution, cross-cutting concerns, and strategic technical decisions.

Interviewer: Distinguished Engineer

If you needed to replace the entire LLM backend (e.g., moving from Claude on Bedrock to an internally trained model), what does the LLD tell you about the blast radius of that change?
- Expected: The LLD shows the LLM is used in two places: (1) Response generation (Orchestrator -> LLM), (2) potentially in the Intent Classifier (BERT fallback). The response format contract is the key abstraction - as long as the new model produces {response_text, products, actions, follow_up_suggestions}, the downstream components (Guardrails, Formatter, Frontend) are unaffected. Changes needed: (1) New Bedrock model endpoint or custom SageMaker endpoint. (2) Prompt engineering - the system prompt template may need rewriting for the new model's style. (3) Guardrails thresholds may need recalibration. (4) Performance testing - latency/throughput characteristics will differ. The Orchestrator itself doesn't change.
The LLD defines 6 guardrails. In production, these will have false positives (blocking good responses) and false negatives (missing bad responses). How would you design a feedback loop to continuously tune guardrail thresholds?
- Expected: (1) Log every guardrail trigger with: the original response, which rule triggered, what action was taken. (2) Randomly sample triggered responses for human review (is this a true positive or false positive?). (3) Track false positive rate per guardrail - if PII filter is blocking 5% of responses but 80% of those are false positives, tighten the regex. (4) Track false negatives via user feedback - if a user thumbs-down a response that contains wrong information that should have been caught, it's a guardrail miss. (5) Quarterly guardrail review: update regex patterns, retrain classifiers, adjust thresholds. (6) Dashboard showing guardrail trigger rates, false positive rates, and coverage metrics.
Looking at the complete LLD, identify the three most likely single points of failure and design redundancy for each.
- Expected:
- SPOF 1: Bedrock LLM. If Bedrock is unavailable, no responses can be generated. Redundancy: maintain a secondary region Bedrock endpoint. Ultimate fallback: template-based responses for common intents (FAQ answers from RAG chunks without LLM generation).
- SPOF 2: DynamoDB (conversation memory). If DynamoDB is down, can't load conversation history. Redundancy: DynamoDB is inherently HA within a region. Add a Redis cache layer. If both fail, run in stateless mode (respond to current message without history).
- SPOF 3: Intent Classifier (SageMaker endpoint). If the classifier is down, can't route messages. Redundancy: multi-instance SageMaker endpoint with auto-scaling. If fully down, fall back to rule-based classifier only (Stage 1) which runs in the Orchestrator process - no external dependency.
The LLD shows the chat API uses WebSocket with streamed JSON responses. Design the exact WebSocket protocol - message types, error handling, reconnection, and heartbeat.
- Expected:
- Message types: chat_message (user -> server), response_start (server -> client, includes response_id), response_chunk (server -> client, token text), response_end (server -> client, includes full metadata, products, actions), error (server -> client), typing_indicator (server -> client), ping/pong (heartbeat).
- Reconnection: Client auto-reconnects with exponential backoff (1s, 2s, 4s, max 30s). On reconnect, client sends session_id to resume. Server checks if a response was in-flight and either resends or acknowledges completion.
- Heartbeat: Server sends ping every 30 seconds. If client doesn't receive ping for 60 seconds, it reconnects. This detects silent connection drops.
- Error handling: Server sends error message with error code and user-friendly message. Client displays the message and offers retry.
- Ordering guarantee: WebSocket messages are ordered by default (TCP). Include a sequence number for safety.
How would you design the system to support multi-language conversations (e.g., user writes in Japanese, system responds in Japanese) without duplicating the entire pipeline?
- Expected: (1) Language detection: Add a language detection step before intent classification (use a lightweight model or Bedrock/Comprehend). (2) Intent Classifier: Retrain BERT model on multilingual data (or use multilingual BERT). Rule-based stage needs patterns for each language. (3) RAG pipeline: Maintain language-specific indexes (Japanese chunks, English chunks). Query the index matching the detected language. Alternatively, embed everything with a multilingual embedding model. (4) LLM: Claude supports multiple languages natively. Add a language instruction to the system prompt: "Respond in {{detected_language}}." (5) Guardrails: PII patterns are language-specific (Japanese phone numbers vs. US). Toxicity detection needs multilingual support. (6) Don't duplicate orchestration - the orchestrator, memory, and API contracts stay language-agnostic. Only the NLP components and content need language awareness.

Interviewer: VP of Engineering

The LLD defines the Orchestrator as a class with methods like handleMessage, routeByIntent, and buildLLMPrompt. Critique this design - what would you change for production at Amazon scale?
- Expected: (1) A single class with all logic is a code-level monolith. At Amazon scale, split into separate services: Intent Routing Service, Prompt Builder Service, Response Assembly Service. (2) routeByIntent using a switch/map is fine for 10 intents but becomes unmaintainable at 50. Use a plugin/registry pattern where each intent handler registers itself. (3) buildLLMPrompt should be a separate concern with prompt versioning, A/B testing support, and template management. (4) The class diagram doesn't show error handling, retries, or circuit breakers - production code needs these. (5) However, for an MVP, this design is perfectly appropriate. Premature splitting increases operational complexity without proven need.

Looking at the analytics schema, the chatbot_events table stores individual events. Design the aggregation queries needed for a daily executive dashboard.

Expected:

-- Daily conversation volume
SELECT DATE(created_at), COUNT(DISTINCT session_id) FROM chatbot_events GROUP BY 1;

-- Intent distribution
SELECT DATE(created_at), intent, COUNT(*) FROM chatbot_events WHERE event_type='message' GROUP BY 1,2;

-- Resolution rate (sessions without escalation)
SELECT DATE(created_at),
  1.0 - (COUNT(DISTINCT CASE WHEN event_type='escalation' THEN session_id END)::FLOAT / 
         COUNT(DISTINCT session_id)) as resolution_rate
FROM chatbot_events GROUP BY 1;

-- Average latency by intent
SELECT intent, AVG(latency_ms), PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY latency_ms) as p95
FROM chatbot_events WHERE event_type='response' GROUP BY 1;

-- User satisfaction
SELECT DATE(created_at),
  SUM(CASE WHEN feedback='thumbs_up' THEN 1 ELSE 0 END)::FLOAT / 
  COUNT(feedback) as satisfaction_rate
FROM chatbot_events WHERE feedback IS NOT NULL GROUP BY 1;

Also build materialized views or pre-aggregated summary tables for dashboard performance.

The Guardrails pipeline is synchronous today (runs after every LLM response). Design an asynchronous audit guardrails system that runs in the background and catches issues the real-time pipeline misses.
- Expected: (1) Every response (pre-guardrail and post-guardrail versions) is published to an SQS queue. (2) A Lambda consumer runs a more thorough/expensive analysis: deeper PII detection (NER model, not just regex), semantic similarity checks against known bad outputs, factual accuracy verification against a reference database, style/tone analysis. (3) Results are written to a review queue for human moderators. (4) If a severe issue is found (e.g., leaked PII that real-time missed), trigger an incident: notify on-call, optionally send a correction message to the user, update the real-time guardrails to catch the new pattern. (5) Generate a weekly guardrail effectiveness report: how many issues were caught by real-time vs. async.
Design a complete observability strategy for the LLD components. What traces, metrics, and logs would you emit from each component?
- Expected:
- Traces (X-Ray/OpenTelemetry): End-to-end trace from API Gateway through every component. Each span: Orchestrator, Intent Classifier, each service call, RAG retrieval, LLM call, each guardrail. Correlate with session_id and response_id.
- Metrics (CloudWatch):
  - Orchestrator: request count, latency (p50/p95/p99), error rate, concurrent sessions.
  - Intent Classifier: classification latency, confidence distribution, fallback-to-BERT rate.
  - RAG: retrieval latency, chunks retrieved, reranker latency, cache hit rate.
  - LLM: token count (input/output), generation latency, throttling events.
  - Guardrails: trigger rate per guardrail type, false positive rate.
  - Memory: DynamoDB read/write latency, summarization trigger rate.
- Logs (CloudWatch Logs): Structured JSON logs per request. Include: session_id, intent, entities, services called, guardrail results, model_id, token counts. PII-scrubbed. Retention: 30 days hot, 1 year cold (S3).
- Alarms: Latency > 3s for >5 min, error rate > 1% for >2 min, guardrail trigger rate > 10% for >10 min, LLM throttling events > 0.
If the MangaAssist team asked you to review this LLD and give 5 specific improvements before production launch, what would they be?
- Expected: 1. Add caching layer: Identical or near-identical queries (e.g., "what's the return policy?") should be served from cache, not re-run through the full pipeline every time. Design a semantic cache keyed on query embedding similarity. 2. Define SLAs per component: The LLD shows components but doesn't specify latency SLAs. Each component needs a latency budget: Intent Classifier <100ms, RAG <400ms, LLM <2s, Guardrails <100ms, total <3s. 3. Add circuit breaker patterns: The state machine goes from RouteToService to AggregateData, but doesn't show what happens on failure. Add circuit breakers with fallback behavior for each downstream service. 4. Prompt versioning and management: The system prompt is hardcoded in the LLD. In production, prompts should be versioned, stored in a config service, and A/B testable without code deployment. 5. Load testing results: The LLD should include expected throughput numbers and the results of load testing each component. What's the max QPS each component can handle? Where is the bottleneck?