LOCAL PREVIEW View on GitHub

01. Types of APIs I Built for MangaAssist

"MangaAssist wasn't one API — it was six different API surfaces stitched together behind a single chat widget. WebSocket for streaming, REST for control actions, gRPC for internal service calls, ML inference endpoints, vector search, and a multi-stage guardrails pipeline. Each one had different latency profiles, failure modes, and scaling behavior."


Where This Doc Fits

Each section below ends with a Design Lens sub-section showing which stakeholder forced which choice. If you only have time to skim, read the Design Lens callouts.


API Landscape at a Glance

# API Type Protocol Purpose Latency Target
1 WebSocket API WebSocket (wss://) Stream LLM tokens to the user in real time First token < 1.5s
2 External REST APIs HTTPS Chat message, feedback, escalation endpoints < 200ms (non-LLM)
3 Internal Downstream APIs REST / gRPC Product Catalog, Orders, Returns, Promotions, etc. < 100-200ms per call
4 ML Inference APIs HTTPS (SageMaker + Bedrock) Intent classifier, embeddings, reranker, LLM generation 15ms – 1.5s depending on model
5 Vector Search / RAG API HTTPS (OpenSearch) Retrieve knowledge chunks for grounded answers < 50ms
6 Guardrails / Safety API Internal pipeline PII detection, toxicity filter, price validation, ASIN check < 100ms total

1. WebSocket API — Real-Time Streaming

I built the primary user-facing transport on Amazon API Gateway WebSocket. The key insight was that a chatbot generating 200+ tokens feels painfully slow if the user waits for the full response. Streaming token-by-token via WebSocket made responses feel instant even when full generation took 1-2 seconds.

Routes

Route Purpose
$connect Authenticate user (Amazon session cookie), create session, enforce rate limits
$disconnect Clean up session state, flush analytics events
$default All chat messages — classify intent, orchestrate, stream response back

Key Design Decisions

  • Auth at connect time: I validated the Amazon session token during $connect so every subsequent message was pre-authenticated with zero overhead
  • HTTPS REST fallback: Some corporate networks and older browsers block WebSocket. I built a parallel REST path (POST /chat/message) that returns the full response in one shot — same orchestrator, different transport
  • Sticky sessions via ALB: WebSocket connections are long-lived. I used Application Load Balancer sticky sessions so reconnects land on the same orchestrator instance, preserving in-memory conversation context

Response Streaming Format

{"type": "token", "content": "Based"}
{"type": "token", "content": " on"}
{"type": "token", "content": " your"}
{"type": "products", "items": [{"asin": "B08X1YRSTR", "title": "Chainsaw Man Vol 1", "price": "$9.99"}]}
{"type": "actions", "buttons": [{"label": "Add to Cart", "type": "add_to_cart", "asin": "B08X1YRSTR"}]}
{"type": "done", "metadata": {"latency_ms": 1842, "intent": "recommendation"}}

Each frame is a self-contained JSON object so the frontend can render progressively — text tokens appear as they generate, then product cards and action buttons drop in at the end.

LLD — Connection State Machine

sequenceDiagram
    participant C as Client
    participant S as Server (APIGW + Orchestrator)
    participant DB as DynamoDB

    Note over C,S: Initial connection
    C->>S: $connect (wss:// + Cognito token)
    activate S
    S->>S: validate token
    S->>S: rate-limit check
    S->>DB: create session
    S-->>C: 101 Switching Protocols
    deactivate S

    Note over C,S: Active conversation
    C->>S: user message ($default)
    S-->>C: frame stream<br/>(token, products, done)

    Note over C,S: Heartbeat (every 30s)
    C->>S: ping
    S-->>C: pong

    Note over C,S: 10 min idle timeout
    S->>S: idle disconnect
    S->>DB: session persists (TTL 24h)
    S-->>C: $disconnect

    Note over C,S: Reconnect with same session_id
    C->>S: $connect (resume, session_id)
    S->>DB: load session
    DB-->>S: turns + summary
    S-->>C: context restored

Key LLD invariants: - Connection identity ≠ session identity. The session_id is durable in DynamoDB; the WebSocket connection_id is ephemeral. - Sticky sessions (ALB) keep reconnects on the same orchestrator task when possible — but the session is recoverable from DynamoDB if the task dies. - See Frame Schema in 00-hld-lld-architecture.md, Section 2.3 for the full frame discriminated-union.

Design Lens — WebSocket as primary transport

Lens Position
🎨 Frontend Drove the REST fallback requirement (corporate networks block WebSocket)
📊 PM Drove streaming itself (perceived-quality A/B data)
🛡️ Security Liked auth-once-at-connect; demanded origin validation + token expiry handling on long sessions
⚙️ SRE Cautious — WebSocket connections are stateful. Required deploy-time connection draining (60s grace)
💰 Cost Modeled the per-connection-minute cost; acceptable at ~$0.0008/conversation

Full multi-stakeholder analysis: 00-hld-lld-architecture.md, Decision D-1.


2. External REST APIs — Control Plane

Three REST endpoints powered the non-streaming interactions:

POST /chat/message

The main chat endpoint (used as fallback when WebSocket is unavailable).

// Request
{
  "session_id": "sess_abc123",
  "message": "Recommend something like Naruto",
  "page_context": {
    "current_asin": null,
    "store_section": "manga-home",
    "cart_asins": ["B09XYZ"],
    "url": "/stores/page/jp-manga"
  }
}

// Response
{
  "session_id": "sess_abc123",
  "response_id": "resp_def456",
  "response_text": "If you loved Naruto, you'll enjoy these...",
  "products": [...],
  "actions": [...],
  "follow_up_suggestions": ["Tell me about the art style", "Is there a box set?"],
  "metadata": {
    "intent": "recommendation",
    "latency_ms": 1842,
    "model": "claude-3-sonnet",
    "sources": ["recommendation_engine", "product_catalog"]
  }
}

POST /chat/feedback

{
  "response_id": "resp_def456",
  "feedback": "thumbs_up",
  "comment": null
}

POST /chat/escalate

{
  "session_id": "sess_abc123",
  "reason": "user_requested",
  "summary": "User wants to return damaged Naruto Vol 5, order #112-3456789"
}

Rate Limiting

User Type Limit Enforcement
Authenticated 30 messages/min Token bucket per customer ID
Guest 10 messages/min Token bucket per session ID
Global LLM path Configurable cap Protects Bedrock quota from abuse

Rate limiting was enforced at the API Gateway level using usage plans + Lambda authorizer. Exceeding the limit returns 429 Too Many Requests with a human-friendly message.

LLD — Idempotency on Feedback and Escalation

The non-streaming endpoints have idempotency requirements that streaming doesn't: - POST /chat/feedback is keyed on (response_id, customer_id). Duplicate clicks (network retry, double-tap on mobile) must not produce double feedback. Implementation: DynamoDB conditional write if not exists. - POST /chat/escalate is keyed on (session_id, escalation_reason_hash) within a 5-minute window. Prevents accidentally creating multiple human-handoff tickets. - POST /chat/message (REST fallback) uses an Idempotency-Key header. Required for mobile clients where flaky networks retry. Server caches the response for 5 min keyed on (customer_id, idempotency_key).

Design Lens — Token bucket vs. fixed window rate limiting

Lens Position
🛡️ Security Drove the choice — fixed window has the boundary-burst problem (30 msgs at 11:59:59 + 30 at 12:00:01)
🔧 Backend Token bucket is straightforward in Redis with INCR + TTL
⚙️ SRE Wanted a global LLM-path cap as a safety valve independent of per-user limits
💰 Cost Rate limiting is a Bedrock-cost-protection mechanism, not just abuse prevention
📊 PM Negotiated the 30/min number — calibrated against P99 of legitimate beta usage (15/min) for 2x headroom

3. Internal Downstream APIs — Service Integrations

The orchestrator called 9 different backend services via internal REST and gRPC APIs. Each integration had its own contract, latency profile, and failure handling.

Service Protocol Data Returned Latency Auth Required
Product Catalog gRPC ASIN, title, author, format, language, page count < 100ms No
Pricing Service REST Current price, list price, discounts, Prime pricing < 50ms No
Inventory Service REST Stock status, restock date, warehouse location < 50ms No
Reviews & Ratings REST Avg rating, review count, top review snippets < 100ms No
Recommendation Engine REST (Personalize) Ranked list of recommended ASINs < 200ms Optional
Customer Profile REST Name, Prime status, locale, preferred language < 50ms Yes
Order Service REST Order status, items, dates < 200ms Yes
Shipping & Delivery REST Carrier, tracking, ETA < 200ms Yes
Returns Service REST Eligibility, return window, refund status < 200ms Yes
Promotions Service REST Active deals, coupons, bundle offers < 100ms No

Key Design Decisions

  • Parallel fan-out: After intent classification, I dispatched downstream calls in parallel rather than sequentially. This cut 300ms+ from the critical path (3 calls at 100ms each = 300ms sequential vs 100ms parallel)
  • Price is never cached: Product prices were always fetched live. Showing a stale price is a legal and trust issue. Every other data type used ElastiCache with appropriate TTLs
  • Auth-gated vs. open: Product discovery, FAQ, and recommendations worked for guest users. Order tracking, returns, and profile data required authentication. The orchestrator enforced this boundary

LLD — Per-Service Failure Policy

Each downstream service has an explicit failure policy. The policy is hard-coded, not configured at runtime, because failure semantics are part of correctness — not tunable parameters.

Service Timeout Retry Circuit Breaker Fallback Behavior
Catalog (gRPC) 200ms Open at 30% error in 60s Skip product card; return prose-only response
Pricing 200ms 0× (NOT idempotent semantics for pricing freshness) Open at 20% Strip all price mentions; "check product page"
Inventory 150ms Open at 30% Show product without availability badge
Reviews 300ms Open at 50% Skip review snippet
Recommendation Engine 500ms Open at 40% Fall back to vector-search top-N
Customer Profile 200ms Open at 30% Treat user as guest tier
Order Service 500ms Open at 30% "I couldn't load your order — let me connect you with support"
Returns 500ms Open at 30% Direct user to returns center URL
Promotions 200ms Open at 50% Don't surface promotions; chat continues

Why "0 retries" on Pricing: A retry implies the first call may succeed at the second attempt. For pricing, the first call's latency is also a freshness signal — a slow pricing call may be returning stale data from a degraded path. We'd rather strip prices than risk stale ones.

Design Lens — gRPC for Catalog, REST for everything else

Lens Position
🔧 Backend "Catalog is the highest-volume call. gRPC saves 25% serialization CPU at our scale."
💰 Cost Endorsed — direct $X/month savings on serialization + bandwidth
⚙️ SRE Concerned about two debugging tools; required X-Ray instrumentation parity
📊 PM Neutral — invisible to user
Hidden driver The Catalog team already had a gRPC service. Adopting their existing contract was easier than asking them to build a REST facade. Decisions like this are bent by what other teams have already shipped.

Full multi-stakeholder analysis: 00-hld-lld-architecture.md, Decision D-8.


4. ML Inference APIs — The Intelligence Layer

Every user message triggered a chain of 4 ML model invocations — not one API call, but an orchestrated pipeline.

Model Hosting Endpoint Type Avg Latency P99 Latency Purpose
DistilBERT (fine-tuned) SageMaker Real-time Endpoint ml.inf1.xlarge (Inferentia) ~15ms ~50ms Intent classification (10 intents)
Titan Embeddings V2 Amazon Bedrock On-demand ~30ms ~80ms Query embedding for RAG retrieval
Cross-Encoder Reranker SageMaker Real-time Endpoint ml.g4dn.xlarge ~50ms ~120ms Rerank top-50 RAG results to top-5
Claude 3.5 Sonnet Amazon Bedrock On-demand / Provisioned ~500ms (TTFT) ~1.5s (TTFT) Natural language response generation

Two-Stage Intent Classification

I designed the intent classifier as a two-stage system to minimize cost:

  1. Stage 1 — Rule-based pre-filter: Regex patterns caught high-confidence intents cheaply ("where is my order" → order_tracking at 0.95 confidence). Handled ~40% of messages with zero ML inference cost
  2. Stage 2 — DistilBERT classifier: Only fired when Stage 1 confidence < 0.8. Fine-tuned on 55K labeled Amazon customer service conversations + manga-specific training data

Intelligent Model Routing

Not every message needed Claude 3.5 Sonnet. I built a cost-aware router:

Intent Model Used Why
Chitchat (hello, thanks) Template responses No LLM needed at all
Order status formatting Claude Haiku Simple formatting task; 10x cheaper
Product recommendations Claude 3.5 Sonnet Complex reasoning with product context
Multi-turn Q&A Claude 3.5 Sonnet Needs conversation history understanding

This routing saved ~$18,000/month in LLM costs.

LLD — Latency Budget Allocation

The 1.5s P99 first-token target is decomposed across the inference chain:

Total P99 budget: 1500ms TTFT
├── Edge + auth + validate                    50ms   (3%)
├── Stage 1 regex / Stage 2 DistilBERT       50ms   (3%)
├── Embedding (Titan)                          80ms   (5%)
├── KNN search (OpenSearch)                    50ms   (3%)
├── Reranker (Cross-Encoder)                  120ms   (8%)
├── Service fan-out (parallel, max wait)     200ms  (13%)
├── Prompt assembly                            10ms   (1%)
├── Bedrock TTFT                              900ms  (60%)
├── Inline guardrail (per chunk)               <1ms   (–)
└── Headroom                                   40ms   (3%)

Critical observation: Bedrock TTFT dominates. 60% of the budget is in one component we don't control. This shapes everything else — every other component must come in under its allocation, because LLM TTFT can spike unexpectedly and there's no way to shrink it from the outside.

Speculative execution recovers ~150ms on average: Embedding + KNN run in parallel with intent classification, not after it. ~70% of intents need RAG anyway. Wasted compute when intent doesn't need RAG: ~30ms. Net gain: 150-300ms saved on the 70% majority case.

Design Lens — Provisioned Throughput vs. on-demand for Bedrock

Lens Position
💰 Cost Drove the decision — 35% saving with tiered Provisioned + on-demand
⚙️ SRE Strong supporter — eliminates throttle risk for baseline traffic
🤖 ML Cautious — routing splits eval surface; needed offline evaluation showing Haiku is acceptable for routed-down intents
📊 PM Required quality evaluation showing no measurable difference for routed categories
⚖️ Legal Neutral — both models in existing Bedrock agreement

Full analysis: 00-hld-lld-architecture.md, Decision D-4.


5. Vector Search / RAG API — Knowledge Retrieval

The RAG pipeline powered FAQ, policy, and product knowledge answers. I used OpenSearch Serverless with HNSW vector indexing.

Retrieval Flow

  1. Embed query → Titan Embeddings V2 (1024-dim vector)
  2. KNN search → OpenSearch returns top-10 candidate chunks
  3. Rerank → Cross-Encoder Reranker scores all 10, returns top-3
  4. Augment prompt → Top-3 chunks injected into the LLM prompt as grounding context

Index Schema

{
  "chunk_id": "faq-return-policy-003",
  "content": "Manga volumes can be returned within 30 days of delivery if they are in original condition.",
  "source_type": "faq",
  "asin": null,
  "category": "manga",
  "embedding": [0.012, -0.034, ...],
  "last_updated": "2025-12-01"
}

Data Sources Indexed

  • Product descriptions (by ASIN)
  • FAQ pages (return policy, shipping info, payment options)
  • Editorial content (manga guides, genre explainers)
  • Review summaries (pre-generated sentiment snippets)

Chunking strategy: 512 tokens with 50-token overlap. This balance kept chunks large enough for context but small enough for precise retrieval.

LLD — Index Update Pipeline

Knowledge sources update at very different cadences. The index must absorb updates without thrashing:

Source Update Cadence Trigger SLA
Product descriptions Daily Catalog change-data-capture stream New ASIN searchable < 2 hours
FAQ / Policy Weekly Manual edit + approval workflow Updated chunk searchable < 30 minutes
Editorial content Per release Editorial CMS webhook Same-day
Review summaries Hourly Sentiment pipeline rerun Next hourly cycle

Why CDC for product descriptions: Catalog has millions of items. Polling for changes is wasteful; the stream lets us re-embed only the deltas. Each change event triggers: (1) re-embed via Titan, (2) update OpenSearch document atomically, (3) emit a metric.

Atomic update strategy: OpenSearch supports partial document updates. We update the embedding and the metadata in a single request. This avoids the failure mode where the metadata is fresh but the embedding is stale (or vice versa) — which would silently degrade retrieval quality.

Design Lens — HNSW vs. flat / IVF index

Lens Position
🤖 ML Drove the choice — HNSW gives best recall/latency trade-off for our query volume
🔧 Backend OpenSearch HNSW is managed; no custom index code
⚙️ SRE Slight concern about HNSW memory footprint; right-sized cluster for 18K vectors @ 1024-dim
💰 Cost OpenSearch Serverless OCU pricing was modeled and acceptable
📊 PM Neutral — outcome measured by retrieval quality, not algorithm choice

Alternative considered: flat brute-force search. Rejected because at 18K+ vectors and 50K+ queries/sec at peak, brute-force latency would exceed budget.

Alternative considered: IVF-PQ for memory efficiency. Rejected because the recall drop (5-8% vs. HNSW) was meaningful for FAQ retrieval where missing a policy chunk leads to LLM hallucination.


6. Guardrails / Safety API — Trust Pipeline

Every LLM response passed through a 6-stage validation pipeline before reaching the user. This was not optional — for a customer-facing shopping assistant, trust is the product.

Pipeline Stages

Stage What It Checks Tool Action on Failure
1. PII Redaction SSN, credit card, phone, email in output Amazon Comprehend + regex Redact and log
2. Price Accuracy Prices in response match live catalog data Custom validator (API call) Replace with correct price
3. Toxicity Filter Offensive or inappropriate content Bedrock Guardrails Block response, return safe fallback
4. Competitor Mention Names of competitors (Barnes & Noble, etc.) Keyword filter Remove mention
5. ASIN Validation Product IDs mentioned actually exist in catalog Catalog API batch check Remove invalid product cards
6. Scope Check Response stays on topic (manga/Amazon) Rule-based + LLM classifier Redirect to on-topic response

Input-Side Safety

Before the user message even reached the LLM, I ran:

  • PII detection: Replaced emails, phone numbers, addresses with [REDACTED] tokens so PII never entered model logs
  • Prompt injection defense: Pattern matching for "ignore previous instructions", "you are now", encoded/obfuscated attempts — blocked and logged
  • Payload size limit: Messages capped at 2000 characters to prevent abuse

LLD — Two-Tier Guardrails Mechanics

The 6 stages don't all run at the same point. They're split between inline (per-chunk during streaming) and post-stream (full pipeline on complete response).

flowchart LR
    LLM[("LLM<br/>token chunks")]

    subgraph inline["⚡ Inline checks (per chunk, &lt;1ms)"]
        direction TB
        I1["regex PII<br/>(SSN, CC, phone, email)"]
        I2["keyword<br/>competitor list"]
    end

    FE["🎨 Frontend<br/>(streams to user)"]

    subgraph post["🔒 Post-stream pipeline (sequential, ~100ms)"]
        direction TB
        P1["1. PII Redaction"]
        P2["2. Price Accuracy"]
        P3["3. Toxicity Filter"]
        P4["4. Competitor Mention"]
        P5["5. ASIN Validation"]
        P6["6. Scope Check"]
        P1 --> P2 --> P3 --> P4 --> P5 --> P6
    end

    CF["📢 Correction Frame<br/>(replace_seq → original)"]

    LLM --> inline
    inline --> FE
    LLM -. "after [done]" .-> post
    post -- "issue inline missed" --> CF
    CF --> FE

    classDef llm fill:#e8f5e9,stroke:#2e7d32
    classDef chk fill:#fff3e0,stroke:#e65100
    classDef out fill:#e3f2fd,stroke:#1565c0
    classDef warn fill:#ffebee,stroke:#c62828
    class LLM llm
    class I1,I2,P1,P2,P3,P4,P5,P6 chk
    class FE out
    class CF warn

Each chunk passes through inline checks (~99.5% pass cleanly). After streaming completes, the full 6-stage pipeline runs on the assembled response. Any issue inline didn't catch triggers a Correction Frame referencing the original token's seq number.

Correction frame mechanics: - Correction frame includes replace_seq pointing at the original frame's seq number. - Frontend replaces the offending text in-place (e.g., wrong price → corrected price). - Audit log captures: original text, corrected text, stage that fired, timestamp, customer_id. - Rate of correction frames is monitored — sustained increase is a leading indicator that the LLM is regressing on a guardrail-relevant pattern (or that an attacker is probing).

Why inline is regex-only, not ML-based: A small LLM-based inline guardrail would add ~15ms per token chunk → streaming becomes useless. Full analysis: 06-decision-records-perspectives.md, ADR-004.

Design Lens — Two-tier guardrails (the most contested decision)

Lens Position
⚖️ Legal Reluctant supporter. Conditioned on: (1) full audit log, (2) prices NEVER stream from LLM (cards inject from catalog), (3) wrong-price complaints retrievable in <1 hour
🛡️ Security Reluctant supporter. Preferred pre-stream but accepted on data showing inline catches 90% and post-stream catches the rest
📊 PM Strong supporter — streaming is the product
🎨 Frontend Cautious — correction frames are awkward to render. Backend committed to <0.5% rate
⚙️ SRE Supporter — correction_frame_emitted rate is a leading indicator

Full analysis: 00-hld-lld-architecture.md, Decision D-2.


Cross-API Design Tensions

The 6 API types don't exist in isolation. Each pair has a tension that surfaces during incidents or feature requests. These tensions are worth knowing because they drive judgment calls when the obvious answer in one API conflicts with constraints in another.

Tension 1: WebSocket streaming ↔ Guardrails

Conflict: Streaming wants tokens out fast. Guardrails want full validation before exposure. Resolution: Two-tier guardrails (D-2). Inline regex during stream; full pipeline after; correction frame for the rare miss. Who owns the tension: Backend + Security + Frontend, jointly. Failure mode if mishandled: Either UX feels slow (Legal/Security wins too hard) or wrong content leaks to users (Product wins too hard).

Tension 2: Internal service fan-out ↔ Latency budget

Conflict: More service calls = richer responses. But each call eats budget. P99 budget is 1500ms TTFT and Bedrock owns 60% of it. Resolution: Parallel fan-out with explicit per-service timeouts. Speculative execution starts RAG before intent is known. Each service has a fallback so a slow service degrades gracefully instead of blocking. Who owns the tension: Backend + Product. Failure mode if mishandled: Either responses are too thin (over-aggressive timeouts) or chat feels slow (under-aggressive timeouts).

Tension 3: ML routing ↔ Cost ↔ Quality

Conflict: Sonnet is high quality but expensive. Haiku is cheap but lower quality on complex intents. Caching is cheapest but degrades freshness/personalization. Resolution: Cost-aware router — chitchat to templates (free), simple intents to Haiku (cheap), complex to Sonnet (premium). Semantic cache for user-independent queries only. Who owns the tension: ML + Cost + Product. Failure mode if mishandled: Either costs blow up (over-route to Sonnet) or quality regresses (over-route to Haiku/cache).

Tension 4: RAG retrieval precision ↔ Recommendation diversity

Conflict: Higher Precision@K means fewer irrelevant chunks. But for recommendation queries, diversity drives add-to-cart rate (counter-intuitive finding from offline-online correlation analysis). Resolution: Different K and ranker per intent. FAQ uses K=3 with high precision. Recommendation uses K=5 with MMR for diversity. Who owns the tension: ML + Product. Failure mode if mishandled: Either users get wrong info (low precision) or get bored same-result responses (high precision, low diversity).

Tension 5: Guardrails strictness ↔ User helpfulness

Conflict: Stricter guardrails block more (good for safety, bad for helpfulness). Looser guardrails answer more (good for UX, bad for safety). Resolution: Per-intent thresholds. order_tracking and return_request have looser scope checks (users in these flows are committed; redirecting them frustrates). Recommendation has tighter (we'd rather say "I don't know" than recommend something off-genre). Who owns the tension: Security + Product + ML. Failure mode if mishandled: Either users hit "I can't help with that" too often or the chatbot recommends counterfeit listings.


Interview Q&A — API Types

Q: Walk me through the different types of APIs you built for this chatbot system.

  • Easy: I built 6 API types: WebSocket for streaming, REST for control actions, internal service APIs for data fetching, ML inference endpoints (SageMaker + Bedrock), vector search for RAG, and a guardrails pipeline. Each had different latency and reliability requirements
  • Medium: The critical design insight was that these aren't independent APIs — they form a dependency chain. A single user message hits the WebSocket API, triggers intent classification (SageMaker), fans out to 2-3 internal service APIs in parallel, calls the embedding model + vector search for RAG, feeds everything to the LLM (Bedrock), then runs through the guardrails pipeline. Managing the latency budget across this chain — allocating milliseconds to each stage to stay under 3 seconds total — was the central architectural challenge
  • Hard: The hardest part was failure isolation across API boundaries. When the reranker SageMaker endpoint timed out, I couldn't fail the entire request. I built graceful degradation at every API boundary: reranker down → fall back to raw cosine similarity, embedding timeout → skip RAG and use product catalog data only, LLM timeout → return cached response for common queries. Each fallback was a deliberate design decision with measured quality tradeoffs, not a generic error handler

Q: Why WebSocket instead of Server-Sent Events or polling?

  • Easy: WebSocket gives bidirectional communication — I needed it for both streaming responses and sending real-time context updates from the frontend (typing indicators, page navigation events)
  • Medium: SSE would have worked for streaming responses but doesn't support upstream messages without a separate channel. Long polling adds latency overhead (connection setup per poll) that compounds at 50K concurrent sessions. WebSocket's persistent connection was the right tradeoff — higher memory per connection but lower latency per message
  • Hard: The real challenge was handling WebSocket at scale. API Gateway has a 125K concurrent connection limit. During load testing I hit 90% of that limit. The solution was ALB sticky sessions for connection distribution, a connection pooling strategy on the client, and requesting a limit increase from AWS. For future scale, the architecture supports multi-region routing via Route 53

Q: How did you handle the fact that prices should never be cached?

  • Easy: Prices were always fetched live from the Pricing Service. Showing a stale price is a legal and trust issue
  • Medium: Everything else — product details, recommendations, promotions, reviews — used ElastiCache Redis with appropriate TTLs (5 min for product details, 15 min for recommendations, 1 hour for reviews). Prices were the explicit exception: zero caching, always real-time
  • Hard: The guardrails pipeline had a Price Accuracy stage that cross-checked any price mentioned in the LLM response against the live Pricing Service API. If the LLM hallucinated a price (which happened in ~2% of responses during early testing), the guardrail replaced it with the correct value before the response reached the user. This caught a class of errors that prompt engineering alone couldn't eliminate