HLD Interview Questions - MangaAssist Chatbot Architecture

A comprehensive set of questions organized by difficulty level, simulating a panel of interviewers (Engineering Manager, Senior Engineer, Staff Engineer, Principal Engineer, and Solutions Architect).

How to Use This Question Set

Start with the Easy and Medium sections for a standard interview rehearsal flow.
Use the Hard and Very Hard sections to pressure-test tradeoffs, failure modes, and scaling decisions.
Save the final architect-focused section for mock interviews where you need platform thinking, org-level tradeoffs, or long-term extensibility.

Easy (Junior / Entry-Level Engineers)

These questions test fundamental understanding of the architecture and its components.

Interviewer: Engineering Manager

Can you walk us through the high-level architecture of MangaAssist? What are the main layers? - Expected: Client Layer -> Edge & Auth -> Orchestration -> Intelligence -> Data -> Safety & Output -> Observability -> Fallback.
Why is WebSocket used instead of plain HTTP for the chat interface? - Expected: WebSocket enables token-by-token streaming of LLM responses, making the chatbot feel faster and more responsive to users.
What role does the API Gateway play in this architecture? - Expected: Single entry point, TLS termination, request routing, throttling, request validation. Decouples frontend from backend.
What is the difference between how authenticated users and guest users are handled? - Expected: Authenticated users get personalized features (order tracking, recommendations via customer ID). Guest users get a temporary session ID and can still use discovery and FAQ features.
Why does the system have a rate limiter? What limits are applied? - Expected: Prevents abuse, controls LLM cost, protects downstream services. ~30 messages/minute per user, separate limits for authenticated vs. guest.

Interviewer: Senior Engineer

Name at least 5 intents that the Intent Classifier can detect. - Expected: product_discovery, product_question, faq, order_tracking, return_request, promotion_inquiry, recommendation, escalation, chitchat.
What database is used for conversation memory, and why? - Expected: DynamoDB - low-latency key-value store, supports TTL for auto-expiry (24 hours), scales well for session data.
What is RAG and why is it used in this architecture? - Expected: Retrieval-Augmented Generation. It retrieves relevant documents (FAQ, policies, product data) and passes them as context to the LLM, grounding responses in real data and reducing hallucination.
What does Amazon Bedrock provide in this system? - Expected: Managed LLM service hosting Claude/Titan for response generation, plus Bedrock Guardrails for content moderation.
Why does the architecture reuse Amazon Personalize instead of building a custom recommendation engine?
- Expected: Amazon already has one of the best recommendation engines. Reusing it saves development time, leverages existing user signals, and is battle-tested at scale.

Medium (Mid-Level Engineers, 2-5 Years Experience)

These questions test deeper understanding and design reasoning.

Interviewer: Senior Engineer

Why is the Intent Classifier a separate component from the LLM? Why not just send every message directly to the LLM?
- Expected: Intent classification is faster and cheaper than LLM inference. Deterministic routing avoids unnecessary LLM calls for simple intents (chitchat, order tracking). Reduces latency and cost.
Explain the data flow from user message to final response. What happens at each step? - Expected: User -> Frontend -> Gateway -> Orchestrator -> Intent Classifier -> Route to appropriate service(s) -> Aggregate data -> LLM generation -> Guardrails validation -> Response back through Gateway -> Frontend -> User.
How does the conversation memory handle multi-turn context? Give an example of why this is important. - Expected: Stores last N turns per session in DynamoDB. Example: User says "What about the second one you mentioned?" - that requires memory of previous recommendations. Without memory, the chatbot cannot resolve references.
What happens when the Guardrails pipeline detects a problem with the LLM response?
- Expected: The response is blocked/modified. Problems include PII leakage, toxic content, off-topic responses, competitor mentions, hallucinated prices. A fallback safe response is returned instead.
Why is Kinesis used for analytics instead of writing directly to Redshift?
- Expected: Kinesis acts as a buffer for event streaming, decoupling the chatbot's real-time path from the analytics database write path. Direct writes to Redshift would be slow and could cause backpressure on the chat service.

Interviewer: Staff Engineer

If the chatbot needs to query both the Recommendation Engine AND the Product Catalog for a single response, how does the Orchestrator handle this?
- Expected: The Orchestrator fans out parallel requests to both services, aggregates the results, then passes the combined data to the LLM for response generation.
What happens if the Order Service is down and a user asks "Where is my order?"
- Expected: The Orchestrator should detect the failure (timeout/error), return a graceful degradation response ("I'm having trouble accessing order information right now"), and offer to escalate to a human agent. Should also trigger a CloudWatch alarm.
Why is OpenSearch Serverless chosen as the vector store instead of a dedicated vector database like Pinecone?
- Expected: OpenSearch Serverless is a managed AWS service, integrates natively with Bedrock, avoids introducing an external vendor dependency, and supports both text search and vector search (hybrid retrieval).
How would you handle a spike in traffic during a major manga release (e.g., new One Piece volume)?
- Expected: API Gateway throttling, Lambda auto-scaling, DynamoDB on-demand capacity, Bedrock provisioned throughput. Rate limiter prevents individual abuse while allowing overall traffic increase. Could pre-warm caches for the specific product.
What metrics would you track from Day 1 to know if the chatbot is successful?
- Expected: Latency (p50, p95, p99), intent distribution, resolution rate (did the user's question get answered without escalation), escalation rate, thumbs up/down ratio, conversion rate (did the user add a recommended product to cart), session length.

Hard (Senior Engineers, 5-10 Years Experience)

These questions test architectural trade-offs, failure modes, and system design depth.

Interviewer: Staff Engineer

The architecture uses Lambda + Step Functions for orchestration. What are the trade-offs vs. a long-running ECS/Fargate service?
- Expected: Lambda: auto-scaling, pay-per-invocation, no server management, but cold start latency (~100-300ms), 15-min timeout, stateless. ECS/Fargate: lower latency (no cold starts), can maintain in-process state, but requires capacity management, costs run continuously. For a chatbot with bursty traffic, Lambda is a good fit if cold starts are mitigated (provisioned concurrency).
How would you design the system to handle a scenario where the LLM hallucinates a product that doesn't exist?
- Expected: Guardrails pipeline includes ASIN Validation - every product ASIN in the response is cross-checked against the live catalog. If an ASIN doesn't exist, it's removed from the response. Additionally, the system prompt explicitly instructs the LLM to only reference products from the provided context.
What would happen if DynamoDB (conversation memory) experiences elevated latency? How would you protect the user experience?
- Expected: Implement circuit breaker pattern. If DynamoDB reads are slow, fall back to a "stateless" mode where the chatbot processes the current message without history. Degrade gracefully - better to respond without context than to time out. Cache recent turns in an in-memory layer (ElastiCache) as a hot path.
How do you ensure the RAG pipeline returns relevant results and not stale or irrelevant content?
- Expected: (1) Metadata filtering - filter by source_type, category, last_updated. (2) Chunk quality - proper chunking strategy with overlap. (3) Reranking - use a cross-encoder reranker after initial retrieval. (4) Freshness - periodic re-indexing pipeline with a freshness score. (5) Evaluation - measure retrieval precision/recall with labeled test sets.
The architecture shows a single LLM (Bedrock Claude). How would you design for model flexibility - e.g., switching to a different model or using multiple models?
- Expected: Abstract the LLM behind an interface/adapter pattern. The Orchestrator calls a "ResponseGenerator" service, which internally calls Bedrock with a configurable model ID. For A/B testing, route a percentage of traffic to Model A vs. Model B. Use feature flags to control rollout. Log model ID in analytics for comparison.

Interviewer: Principal Engineer

If you had to reduce end-to-end latency from ~2 seconds to under 1 second, what changes would you make?
- Expected: (1) Use a smaller/faster LLM for simple intents (Titan Lite for FAQ, Claude for recommendations). (2) Cache frequent queries (e.g., "what's the return policy?"). (3) Parallelize Intent Classification with RAG retrieval (speculative execution). (4) Use provisioned concurrency on Lambda to eliminate cold starts. (5) Pre-compute embeddings for common queries. (6) Stream response tokens as they generate rather than waiting for full completion.
How would you design the system to support multiple storefronts (e.g., JP Manga Store, Comics Store, Kindle Store) without duplicating the entire architecture?
- Expected: Multi-tenant design. The Orchestrator accepts a store_id parameter. RAG knowledge bases are partitioned by store. System prompts are templated per store. Product catalog queries are scoped by category. Recommendation models can be shared or per-store. The core infrastructure (Gateway, Lambda, DynamoDB) is shared.
Walk me through how you would handle a data privacy incident where conversation logs were found to contain PII that should have been scrubbed.
- Expected: (1) Immediate: Quarantine affected logs, assess scope. (2) Root cause: Identify where the PII scrubbing pipeline failed (before logging? regex missed a pattern?). (3) Fix: Patch the scrubbing logic, add new PII patterns. (4) Remediate: Run a retroactive scrub job on stored logs. (5) Prevent: Add PII detection tests to CI/CD, add a secondary PII scan on the log pipeline, alert on PII detection in production logs.
The current design uses synchronous request-response for most service calls. Where would you introduce asynchronous patterns and why?
- Expected: (1) Analytics logging - already async via Kinesis. (2) Feedback processing - fire-and-forget to an SQS queue. (3) RAG re-indexing - async pipeline triggered by catalog changes. (4) Human handoff - async queue via Amazon Connect. (5) If response generation takes >3s, consider async with a "typing indicator" and push notification when ready.
How would you prevent prompt injection attacks where a user tries to manipulate the LLM?
- Expected: (1) Input sanitization - strip known injection patterns. (2) System prompt hardening - clear role boundaries and "ignore previous instructions" defenses. (3) Guardrails pipeline checks output for off-topic content. (4) Separate the system prompt from user input in the LLM API call. (5) Monitor for anomalous outputs that indicate successful injection. (6) Rate limit to prevent automated injection probing.

Very Hard (Staff / Principal Engineers, 10+ Years Experience)

These questions test system-wide thinking, cross-functional concerns, and production readiness.

Interviewer: Principal Engineer

How would you design a canary deployment strategy for releasing a new LLM model version without impacting all users?
- Expected: (1) Deploy new model alongside existing one. (2) Route 1% of traffic to new model via feature flag. (3) Monitor key metrics: latency, guardrail failure rate, user feedback (thumbs down), escalation rate. (4) Gradually increase traffic if metrics are stable. (5) Automated rollback if any metric breaches a threshold. (6) Shadow mode - run both models, only serve the old one, compare outputs offline.
If Amazon leadership asked you to add voice support (Alexa integration) to MangaAssist, what architectural changes would be needed?
- Expected: (1) New client adapter for Alexa Skills Kit that converts speech-to-text -> chat request and response-text -> speech (Polly). (2) The Orchestrator API stays the same - it already accepts text messages. (3) Response format needs adaptation - voice can't show product cards, so responses need an "audio_summary" field. (4) Session management needs to bridge Alexa sessions with web sessions. (5) Latency budget is tighter for voice (users expect faster responses).
How would you design a feedback loop that actually improves the chatbot over time?
- Expected: (1) Capture explicit feedback (thumbs up/down) and implicit signals (did user click recommended product, did user escalate). (2) Build a labeled dataset from feedback - positive examples become training data, negative examples become hard negatives. (3) Periodically fine-tune the intent classifier on new data. (4) Use negative feedback to improve RAG chunks (identify gaps in knowledge base). (5) A/B test prompt changes using the analytics pipeline. (6) Monthly review of escalation transcripts to find new automation opportunities.
The system depends on ~8 downstream services (Catalog, Orders, Returns, Recommendations, etc.). How would you design for partial degradation when 2-3 of these services go down simultaneously?
- Expected: (1) Categorize services by criticality - Catalog is critical, Promotions is nice-to-have. (2) Circuit breakers per service with independent timeout configs. (3) Graceful degradation matrix: if Catalog is down -> show text-only response without product cards; if Recommendations is down -> show popular/trending instead; if Orders is down -> tell user to check order page and offer escalation. (4) Health check dashboard showing service dependency status. (5) Orchestrator has a "capability map" that adjusts behavior based on available services.
How would you estimate the infrastructure cost for MangaAssist at launch (100K conversations/day) and at scale (10M conversations/day)?
- Expected: Break down by component:
- Bedrock LLM costs: Input/output tokens x price per token x messages per conversation x conversations per day. Claude Sonnet ~$3/M input tokens, ~$15/M output tokens. At 100K convos x 5 turns x ~500 tokens = ~250M tokens/day ≈ $750-4000/day.
- Lambda: Invocations x duration x memory. Mostly negligible.
- DynamoDB: Read/write capacity units x sessions. On-demand pricing. ~$50-200/day at 100K.
- OpenSearch Serverless: OCU-based pricing. ~$700-2000/month.
- Supporting services: CloudWatch, Kinesis, API Gateway - smaller costs.
- Total estimate: ~$30K-100K/month at 100K convos, ~$500K-2M/month at 10M convos.
- At scale, the LLM cost dominates. Optimization strategies: caching, smaller models for simple intents, response length limits.

Interviewer: Solutions Architect

How would you ensure this system meets Amazon's internal SLA of 99.95% availability?
- Expected: (1) Multi-AZ deployment for all stateful components. (2) DynamoDB global tables for cross-region resilience. (3) Bedrock has built-in high availability. (4) Circuit breakers prevent cascading failures. (5) Health checks with automated recovery. (6) Runbook for each failure mode. (7) Dependency on external services (LLM) is the biggest risk - maintain fallback responses for when Bedrock is unavailable. (8) Chaos engineering - regularly test failure scenarios.
If a new regulation requires that all customer conversation data be deletable within 24 hours of a GDPR deletion request, how would you implement this across all storage layers?
- Expected: Audit all storage layers: DynamoDB (sessions - already 24h TTL, good), Kinesis (streaming, transient), Redshift (analytics - need customer_id index to delete rows), CloudWatch logs (need PII-free logging or deletable log groups), OpenSearch (if any customer data is indexed). Implement a "right to be forgotten" Lambda triggered by deletion events from Amazon's GDPR pipeline. Must also handle backups and any downstream consumers.
How would you test this entire system end-to-end before launch?
- Expected: (1) Unit tests per service. (2) Integration tests with mocked downstream services. (3) Contract tests between services (API schema validation). (4) Load tests simulating peak traffic (locust/k6). (5) Chaos tests (kill services, inject latency). (6) LLM evaluation suite - golden set of 500+ query-response pairs scored by human raters. (7) Red team testing - try to break guardrails, inject prompts, extract PII. (8) Shadow launch - run in production alongside existing support, compare outcomes. (9) Beta launch with Amazon employees first.
The architecture shows separate Lambda functions for different intents. What happens when you need to add a new intent (e.g., "gift_wrapping")? Walk through the full change.
- Expected: (1) Add new intent to Intent Classifier training data and retrain. (2) Add new routing rule in Orchestrator. (3) Create/integrate with Gift Wrapping Service (or use existing API). (4) Add RAG chunks for gift wrapping FAQ/policies. (5) Update system prompt to include gift wrapping context. (6) Add guardrail rules specific to gift wrapping. (7) Update analytics schema for new intent tracking. (8) Test with golden dataset. (9) Feature-flag rollout. The architecture is extensible by design - adding a new intent shouldn't require changing the core orchestration logic.
Compare this architecture to a simpler design where you just dump everything into a single large-context LLM call with all product data. Why is the microservices approach better?
- Expected: (1) Cost: Large context = more tokens = more cost per call. At scale, this is prohibitive. (2) Latency: Larger prompts take longer. (3) Accuracy: RAG retrieves only relevant data, reducing noise. Dumping everything causes the LLM to hallucinate or miss relevant info. (4) Freshness: Real-time data (prices, availability, order status) must come from live services, not a static dump. (5) Maintainability: Microservices can be updated independently. (6) Observability: You can measure and optimize each component independently. (7) The simple approach might work for a prototype but fails at Amazon scale.

👑 Architect Level (Distinguished Engineer / VP of Engineering)

These questions test strategic thinking, business alignment, and system evolution.

Interviewer: VP of Engineering / Distinguished Engineer

How does this chatbot fit into Amazon's broader customer experience strategy? What flywheel effects do you expect?
- Expected: MangaAssist reduces friction in the buying journey -> higher conversion -> more sales -> more investment in manga content -> more customers -> more data for recommendations -> better chatbot -> lower support costs. It also feeds data back into Amazon's broader recommendation and search systems. The chatbot is a wedge - if successful in manga, it can be templated to other verticals (books, electronics, fashion).
If you were starting over, would you build this as a chatbot or as an enhanced search/browse experience? Defend your choice.
- Expected: This is a "both" answer. The chatbot excels at: multi-turn discovery ("I want something dark but not too scary"), support tasks (order tracking, returns), and hand-holding new users. Enhanced search excels at: known-item search, filtering, comparison. The ideal architecture supports both - the chatbot can trigger search/browse actions, and the search page can surface a "need help?" chatbot entry point. Don't force users into a single interaction paradigm.
What's the biggest risk to this project, and how would you mitigate it?
- Expected: Risk 1 - User trust: If the chatbot gives wrong information (wrong price, wrong product, bad recommendation), users will stop using it. Mitigation: aggressive guardrails, human review of flagged responses, conservative rollout. Risk 2 - Cost at scale: LLM costs can escalate quickly. Mitigation: caching, model tiering, response length limits, monitoring cost per conversation. Risk 3 - Adoption: Users may prefer existing browse/search. Mitigation: A/B test chatbot vs. no chatbot, optimize placement, make it genuinely useful not just a gimmick.
How would you measure ROI of this project to justify continued investment?
- Expected: (1) Revenue impact: Conversion rate for users who interact with chatbot vs. those who don't (A/B test). Average order value comparison. (2) Cost savings: Reduction in customer support tickets routed to human agents. (3) Engagement: Time on site, pages viewed, return visit rate. (4) Customer satisfaction: CSAT/NPS for chatbot interactions vs. baseline. (5) Present as: "For every $1 spent on chatbot infra, we generate $X in incremental revenue and save $Y in support costs."
In 3 years, how would you evolve this architecture?
- Expected: (1) Proactive assistance: Don't wait for users to ask - proactively suggest during browsing ("You're looking at Vol 5, but you haven't read Vol 4 yet"). (2) Multi-modal: Image-based queries ("I saw this manga cover somewhere, what is it?"). (3) Voice/Alexa: Extend to voice shopping. (4) Cross-store: Expand beyond manga to all Amazon categories. (5) Agent capabilities: Let the chatbot take actions (add to cart, start a return) not just answer questions. (6) Personalized LLM: Fine-tuned model that understands individual user preferences. (7) Social features: "Manga fans who liked X also discussed Y."
If a competitor (e.g., a manga-specific retailer) launches a similar AI shopping assistant, how would this architecture give Amazon a defensible advantage?
- Expected: (1) Data moat: Amazon has purchase history, browsing data, and reviews at unmatched scale. (2) Recommendation engine: Amazon's collaborative filtering is trained on billions of interactions. (3) Infrastructure: AWS services give Amazon cost and latency advantages. (4) Distribution: The chatbot is embedded in the world's largest e-commerce platform - no need to acquire users separately. (5) Full lifecycle: Amazon can handle discovery -> purchase -> delivery -> returns in one chatbot, competitors can't. (6) However: A niche competitor might win on depth of manga knowledge and community - Amazon should invest in editorial content and community features to counter.
How would you handle the organizational challenge of this project - it touches frontend, backend, ML, data, support teams, and business stakeholders?
- Expected: (1) Ownership model: A dedicated "MangaAssist" team owns the Orchestrator, Intent Classifier, and RAG pipeline. They have service-level agreements with dependent teams (Catalog, Orders, Recommendations). (2) Working backwards: Start with the customer experience (PR/FAQ doc), align all teams on the vision. (3) API contracts first: Define interfaces between teams before building. (4) Weekly cross-team sync with escalation path to VP. (5) Phased delivery: MVP with a small team, then scale up.
If you could only launch with 3 of the 10 intents, which 3 would you pick and why?
- Expected: (1) product_discovery / recommendation - this is the primary value proposition and differentiator. (2) product_question - directly supports purchase decisions. (3) faq - handles common questions and reduces support load. Why not order_tracking? Because Amazon already has "Where's My Stuff" which works well. The chatbot's unique value is in pre-purchase assistance, not post-purchase (where existing tools already serve users).
How would you evaluate whether to build vs. buy key components (e.g., build a custom orchestrator vs. use an existing agent framework like LangChain or Amazon Bedrock Agents)?
- Expected: Evaluate on: (1) Control: Custom gives maximum control over routing, prompt engineering, and optimization. Frameworks may black-box critical decisions. (2) Speed to market: Frameworks accelerate MVP but may limit customization later. (3) Operational ownership: Using Bedrock Agents means AWS manages the orchestration infra. (4) Vendor lock-in: LangChain is open-source but adds a dependency. Bedrock Agents is AWS-native. (5) Recommendation: For Amazon, build custom - they have the engineering talent, and the chatbot is a strategic differentiator. For a startup, use Bedrock Agents to ship quickly.
What would make you decide to shut this project down?
- Expected: (1) If after 6 months post-launch, conversion rate for chatbot users is equal to or lower than non-chatbot users (i.e., no revenue impact). (2) If customer satisfaction scores are consistently negative despite iterations. (3) If LLM costs per conversation never fall below the cost of a human support interaction. (4) If guardrail failures remain above 1% despite improvements (brand risk). (5) However, before shutting down, explore pivots: change the scope (simpler assistant), change the model (cheaper/faster), change the placement (different entry point). Don't kill a promising concept over a fixable execution problem.