HLD Deep Dive: Architecture Overview & Core Components
Questions covered: Q1–Q5 (Easy), Q21 (Hard), Q40 (Very Hard)
Interviewer level: Engineering Manager → Staff Engineer
Q1. Walk us through the high-level architecture of MangaAssist. What are the main layers?
Short Answer
Eight layers: Client → Edge & Auth → Orchestration → Intelligence → Data → Safety & Output → Observability → Fallback.
Deep Dive
User (Web/Mobile)
│
▼
┌─────────────────────────────────┐
│ CLIENT LAYER │
│ React chat widget │
│ WebSocket (streaming) / HTTPS │
└─────────────┬───────────────────┘
│
▼
┌─────────────────────────────────┐
│ EDGE & AUTH LAYER │
│ CloudFront + ALB / API Gateway │
│ Auth Service (session tokens) │
│ Rate Limiter (token-bucket) │
└─────────────┬───────────────────┘
│
▼
┌─────────────────────────────────┐
│ ORCHESTRATION LAYER │
│ Chatbot Orchestrator (ECS) │
│ Intent Classifier (SageMaker) │
│ Conversation Memory (DynamoDB) │
└─────────────┬───────────────────┘
│
▼
┌─────────────────────────────────┐
│ INTELLIGENCE LAYER │
│ RAG Pipeline + OpenSearch │
│ Recommendation Engine │
│ Product Q&A Service │
│ Bedrock LLM (Claude 3.5) │
└─────────────┬───────────────────┘
│
▼
┌─────────────────────────────────┐
│ DATA LAYER │
│ Product Catalog (DDB + ES) │
│ Order / Returns / Checkout │
│ User Profile & History │
│ Promotions Service │
│ ElastiCache (Redis) hot cache │
└─────────────┬───────────────────┘
│
▼
┌─────────────────────────────────┐
│ SAFETY & OUTPUT LAYER │
│ Bedrock Guardrails │
│ Response Formatter │
└─────────────┬───────────────────┘
│
▼
┌─────────────────────────────────┐
│ OBSERVABILITY LAYER │
│ CloudWatch · Kinesis · Redshift│
│ Feedback Capture │
└─────────────────────────────────┘
│
▼
┌─────────────────────────────────┐
│ FALLBACK LAYER │
│ Amazon Connect (Human Handoff) │
└─────────────────────────────────┘
Why layered?
Each layer has a single responsibility. Layers can scale, fail, or be upgraded independently. The Data Layer can be swapped without touching the Intelligence Layer.
Interview tip: Walk it top-down. Start with "a user types a message…" and trace the path through each layer. Interviewers want to see you understand why each layer exists, not just that it exists.
Q2. Why is WebSocket used instead of plain HTTP?
Short Answer
WebSocket enables token-by-token streaming of LLM responses, making the chatbot feel faster and more responsive.
Deep Dive
The problem with HTTP (request-response):
With plain HTTP, the client sends a request and waits for the entire LLM response to be generated before receiving anything. For a response that takes 3 seconds to generate, the user stares at a blank screen for 3 seconds, then the full text appears instantly. This feels slow and unnatural.
What WebSocket gives you:
HTTP: Request ──────────────────────────► Full Response (3s wait)
WebSocket: Request ► Token1 ► Token2 ► Token3 ► ... ► TokenN (tokens stream as generated)
- Perceived latency drops from ~3s to ~300ms (time-to-first-token).
- The user sees the response "typing out" in real time, which feels conversational.
- WebSocket is a persistent, bidirectional TCP connection — no overhead of opening a new connection per message.
Implementation details:
- Session starts with POST /chat/init (HTTPS) to validate user and create session.
- After init, the frontend upgrades to WebSocket.
- Heartbeat pings every 30 seconds keep the connection alive.
- Idle connections are closed after 5 minutes.
- HTTPS fallback exists for environments that block WebSocket (corporate firewalls, older proxies).
API Gateway WebSocket support:
API Gateway v2 natively supports WebSocket APIs with route keys ($connect, $disconnect, $default). Each incoming WebSocket message triggers a Lambda or routes to an ECS service.
Trade-offs of WebSocket vs. Server-Sent Events (SSE): | | WebSocket | SSE | |---|---|---| | Direction | Bidirectional | Server → Client only | | Protocol | TCP upgrade | HTTP/2 | | Reconnection | Manual | Automatic | | Complexity | Higher | Lower |
For a chatbot where the client also sends messages, WebSocket is the right choice. SSE would work but requires a second HTTP channel for client → server messages.
Q3. What role does API Gateway play?
Short Answer
Single entry point: TLS termination, request routing, throttling, validation. Decouples frontend from backend.
Deep Dive
API Gateway sits at the boundary between the internet and your internal services. It handles everything that would otherwise need to be replicated in every backend service:
Internet ──► [API Gateway] ──► Internal Services
│
├── TLS termination (HTTPS → HTTP internally)
├── Auth validation (verify session tokens)
├── Rate limiting (token-bucket per user)
├── Request validation (schema enforcement)
├── Request routing (which Lambda/ECS to hit)
├── Request/response logging
└── CORS headers
Why this matters architecturally:
1. Backend services don't need to implement TLS — they run HTTP internally within the VPC.
2. Single place to change routing — if you rename a service or move it to a different port, you change one config in API Gateway, not every client.
3. Throttling enforced at the edge — malicious traffic never reaches backend services.
4. Decoupling — the frontend calls /api/chat, not internal-orchestrator-service-v2.ap-northeast-1.elb.amazonaws.com. Backend refactors are transparent to the client.
API Gateway types in AWS: | Type | Use Case | |---|---| | REST API | Standard HTTP APIs with stages, usage plans | | HTTP API | Lower latency, lower cost, less features | | WebSocket API | Persistent connections, used here for chat |
MangaAssist uses WebSocket API for the chat stream, and potentially a REST API for non-streaming endpoints (session init, feedback submission).
Common interview follow-up: What's the difference between API Gateway throttling and the application-level rate limiter?
API Gateway throttling sets a global limit (e.g., 10,000 req/sec for the service). The application-level rate limiter enforces per-user limits (e.g., 30 messages/minute per user). Both are needed.
Q4. Authenticated users vs. guest users — what's the difference?
Short Answer
Authenticated users get personalized features via customer ID. Guest users get a temporary session ID and can use discovery and FAQ.
Deep Dive
Authentication flow:
Authenticated User:
Browser → Amazon session cookie → Auth Service validates → customer_id extracted
→ Full feature set
Guest User:
Browser → No session cookie → Auth Service detects guest → session_id generated (UUID)
→ Limited feature set
Feature matrix:
| Feature | Authenticated | Guest |
|---|---|---|
| Product discovery | ✅ | ✅ |
| FAQ / Policy questions | ✅ | ✅ |
| Personalized recommendations | ✅ | ❌ |
| Order tracking | ✅ | ❌ |
| Return requests | ✅ | ❌ |
| Wishlist / Cart integration | ✅ | ❌ |
| Purchase history context | ✅ | ❌ |
| In-session recommendations | ✅ (uses history) | ✅ (uses page context) |
Why allow guests at all?
Not every user is logged in when they discover a product. Blocking non-authenticated users from the chatbot would lose potential customers at the top of the funnel. The chatbot can still provide value (discovery, FAQ) and serve as a nudge to log in for personalized recommendations.
Session ID lifecycle: - Guest session ID is stored in a browser cookie (HttpOnly, Secure, SameSite=Strict). - TTL: 24 hours in DynamoDB. - If the user logs in mid-session, the session is upgraded and associated with their customer_id. Previous turns are preserved.
Q5. Why does the system have a rate limiter? What limits are applied?
Short Answer
Prevents abuse, controls LLM cost, protects downstream services. ~30 messages/minute per user; separate limits for authenticated vs. guest.
Deep Dive
Rate limiting serves three distinct goals:
1. Cost control (most important for LLM systems)
Without rate limiter:
1 user sends 1000 messages/minute
→ 1000 LLM calls/minute
→ ~$100/minute per single abusive user
→ $144,000/day from one bad actor
With rate limiter:
User capped at 30 messages/minute
→ Bounded cost per user
2. Abuse prevention - Malicious users probing for prompt injection vulnerabilities would need hundreds of requests to test systematically. Rate limiting makes this impractical. - Scraping bot patterns generate high-frequency requests. Rate limiting identifies and throttles them.
3. Downstream service protection
Even if the LLM can handle the load, downstream services (Order Service, Catalog) have their own capacity limits. The rate limiter ensures no single user consumes a disproportionate share.
Token bucket algorithm:
Each user has a "bucket" with:
- Capacity: 30 tokens
- Refill rate: 1 token/2 seconds (30/min)
Each message consumes 1 token.
If bucket is empty → 429 Too Many Requests.
The token bucket is preferred over fixed window (e.g., "30 per minute") because it handles burst more gracefully — a user can send 5 messages in rapid succession without hitting the limit, as long as the overall rate stays within 30/min.
Rate limit tiers:
| User Type | Limit | Rationale |
|---|---|---|
| Authenticated | 30 messages/min | Full experience |
| Guest | 10 messages/min | Prevents anonymous abuse |
| API / Developer | Configurable | Via API Gateway usage plans |
Rate limit headers returned to client:
X-RateLimit-Limit: 30
X-RateLimit-Remaining: 24
X-RateLimit-Reset: 1711345678
Retry-After: 8 (if 429)
Q21. Lambda + Step Functions vs. ECS/Fargate — what are the trade-offs?
Short Answer
Lambda: auto-scaling, pay-per-invocation, cold start risk. ECS/Fargate: no cold starts, stateful, costs run continuously. For bursty chatbot traffic, Lambda with provisioned concurrency is the right call.
Deep Dive
Lambda characteristics: - Auto-scales to zero — if there are no requests, there are no charges. - Pay per invocation — ~$0.0000002 per request + $0.0000166667/GB-second. - Cold starts — ~100–300ms for a JVM/Python Lambda; ~50ms for Node.js. Eliminated with provisioned concurrency. - 15-minute maximum timeout — fine for chatbot (responses are ~2–5s), problematic for batch jobs. - Stateless — each invocation is independent. Session state must live in an external store (DynamoDB). - Max concurrency — default 1000 per region, can be increased.
ECS/Fargate characteristics: - Always-on containers — never a cold start. - Can hold in-process state — WebSocket connections, connection pools, in-memory caches. - Continuous billing — you pay for running tasks even when idle. - Manual scaling triggers needed — Auto Scaling policies based on CPU/memory. - Streaming-friendly — persistent processes can hold open WebSocket connections.
MangaAssist hybrid model:
WebSocket Handler:
[ECS Fargate, sticky sessions]
- Must be long-lived to hold WebSocket connections
- Auto-scales 2–20 tasks based on connection count
Orchestrator:
[ECS Fargate, 10–100 tasks]
- Core coordination logic; benefits from always-on warm instances
- Stable traffic pattern → justify ECS over Lambda
Lambda Burst Workers:
[Lambda @ concurrency 1000]
- Handles overflow during traffic spikes
- ECS can't scale fast enough for sudden viral events
- Pay-per-use for the bursty portion
Cold start mitigation for Lambda:
Provisioned Concurrency: Keep N Lambda instances warm at all times.
Cost: ~$0.000004646/GB-hour for provisioned concurrency + $0.0000097/GB-second for requests.
Trade-off: You pay for idle instances. Set provisioned concurrency equal to p95 baseline traffic.
Decision framework:
| Factor | Choose Lambda | Choose ECS/Fargate |
|---|---|---|
| Traffic pattern | Bursty, unpredictable | Steady, predictable |
| Response time | <200ms acceptable | Need <50ms always |
| State | Stateless OK | Need in-process state |
| Cost model | Low baseline traffic | High consistent traffic |
| Streaming | Not required | Required (WebSocket) |
Q40. Why not just dump everything into one big LLM call?
Short Answer
Cost, latency, accuracy, freshness, maintainability, and observability all fail at scale with a monolithic LLM approach.
Deep Dive
The "naive" approach:
# Naive: Fetch everything, stuff it in one prompt
prompt = f"""
You are MangaAssist. Here is ALL product data: {all_50k_products}
Here is ALL FAQ data: {all_faq_docs}
Here is user order history: {all_orders}
Answer: {user_message}
"""
response = llm.generate(prompt)
Why this breaks at scale:
1. Cost
Token calculation per call:
50K products × 200 tokens/product = 10M tokens input
FAQ docs × 500 tokens/doc × 2000 docs = 1M tokens
Total input per call: ~11M tokens
Claude Sonnet cost: $3/1M input tokens
Cost per call: ~$33
At 100K calls/day: $3.3M/day
At MangaAssist scale: IMPOSSIBLE
RAG approach:
Top-5 retrieved chunks × 500 tokens = 2,500 tokens
Cost per call: ~$0.0075
Difference: 4,400x cheaper
2. Latency
Larger prompts = more time for the LLM to process. A 11M-token context would take minutes, not seconds. Claude's context window doesn't support 11M tokens anyway (max ~200K tokens).
3. Accuracy
More context = more noise. When you give the LLM 50,000 products when the user asked about one specific series, the model has to "find" the relevant content in a haystack. RAG retrieves only what's relevant, reducing noise and improving answer quality.
4. Freshness
Prices change hourly. Inventory changes by the minute. You cannot "dump everything" — you must query live services for real-time data. A static dump would return wrong prices and out-of-stock products.
5. Maintainability
With microservices, you can update the Product Catalog independently of the RAG pipeline. You can change the LLM prompt without touching the Order Service. With a monolith, everything is coupled.
6. Observability
With the microservices approach:
- "The RAG retrieval step is taking 400ms — we need to optimize the OpenSearch query"
- "Intent classification is mis-routing 5% of order queries — retrain the classifier"
- "LLM response length is 40% longer than needed — tighten the prompt"
With a monolith, all you know is "the LLM response was slow." You can't isolate which step is the problem.
When the simple approach works:
Prototyping, internal tools, very small datasets, when you don't care about cost. For a production system serving millions of Amazon users, the microservices architecture is non-negotiable.