01. Types of APIs I Built for MangaAssist

"MangaAssist wasn't one API — it was six different API surfaces stitched together behind a single chat widget. WebSocket for streaming, REST for control actions, gRPC for internal service calls, ML inference endpoints, vector search, and a multi-stage guardrails pipeline. Each one had different latency profiles, failure modes, and scaling behavior."

Where This Doc Fits

For the architectural skeleton these APIs sit on (5-layer HLD, AWS topology, NFRs, capacity model, failure domains), see 00-hld-lld-architecture.md.
For the multi-stakeholder reasoning behind decisions referenced here (Why WebSocket? Why two-tier guardrails? Why DAX?), see 00-hld-lld-architecture.md, Part 3 and 06-decision-records-perspectives.md.
This document focuses on the API surfaces themselves — the concrete contracts at the boundary of each layer.

Each section below ends with a Design Lens sub-section showing which stakeholder forced which choice. If you only have time to skim, read the Design Lens callouts.

API Landscape at a Glance

#	API Type	Protocol	Purpose	Latency Target
1	WebSocket API	WebSocket (wss://)	Stream LLM tokens to the user in real time	First token < 1.5s
2	External REST APIs	HTTPS	Chat message, feedback, escalation endpoints	< 200ms (non-LLM)
3	Internal Downstream APIs	REST / gRPC	Product Catalog, Orders, Returns, Promotions, etc.	< 100-200ms per call
4	ML Inference APIs	HTTPS (SageMaker + Bedrock)	Intent classifier, embeddings, reranker, LLM generation	15ms – 1.5s depending on model
5	Vector Search / RAG API	HTTPS (OpenSearch)	Retrieve knowledge chunks for grounded answers	< 50ms
6	Guardrails / Safety API	Internal pipeline	PII detection, toxicity filter, price validation, ASIN check	< 100ms total

1. WebSocket API — Real-Time Streaming

I built the primary user-facing transport on Amazon API Gateway WebSocket. The key insight was that a chatbot generating 200+ tokens feels painfully slow if the user waits for the full response. Streaming token-by-token via WebSocket made responses feel instant even when full generation took 1-2 seconds.

Routes

Route	Purpose
`$connect`	Authenticate user (Amazon session cookie), create session, enforce rate limits
`$disconnect`	Clean up session state, flush analytics events
`$default`	All chat messages — classify intent, orchestrate, stream response back

Key Design Decisions

Auth at connect time: I validated the Amazon session token during $connect so every subsequent message was pre-authenticated with zero overhead
HTTPS REST fallback: Some corporate networks and older browsers block WebSocket. I built a parallel REST path (POST /chat/message) that returns the full response in one shot — same orchestrator, different transport
Sticky sessions via ALB: WebSocket connections are long-lived. I used Application Load Balancer sticky sessions so reconnects land on the same orchestrator instance, preserving in-memory conversation context

Response Streaming Format

{"type": "token", "content": "Based"}
{"type": "token", "content": " on"}
{"type": "token", "content": " your"}
{"type": "products", "items": [{"asin": "B08X1YRSTR", "title": "Chainsaw Man Vol 1", "price": "$9.99"}]}
{"type": "actions", "buttons": [{"label": "Add to Cart", "type": "add_to_cart", "asin": "B08X1YRSTR"}]}
{"type": "done", "metadata": {"latency_ms": 1842, "intent": "recommendation"}}

Each frame is a self-contained JSON object so the frontend can render progressively — text tokens appear as they generate, then product cards and action buttons drop in at the end.

LLD — Connection State Machine

sequenceDiagram
    participant C as Client
    participant S as Server (APIGW + Orchestrator)
    participant DB as DynamoDB

    Note over C,S: Initial connection
    C->>S: $connect (wss:// + Cognito token)
    activate S
    S->>S: validate token
    S->>S: rate-limit check
    S->>DB: create session
    S-->>C: 101 Switching Protocols
    deactivate S

    Note over C,S: Active conversation
    C->>S: user message ($default)
    S-->>C: frame stream<br/>(token, products, done)

    Note over C,S: Heartbeat (every 30s)
    C->>S: ping
    S-->>C: pong

    Note over C,S: 10 min idle timeout
    S->>S: idle disconnect
    S->>DB: session persists (TTL 24h)
    S-->>C: $disconnect

    Note over C,S: Reconnect with same session_id
    C->>S: $connect (resume, session_id)
    S->>DB: load session
    DB-->>S: turns + summary
    S-->>C: context restored

Key LLD invariants: - Connection identity ≠ session identity. The session_id is durable in DynamoDB; the WebSocket connection_id is ephemeral. - Sticky sessions (ALB) keep reconnects on the same orchestrator task when possible — but the session is recoverable from DynamoDB if the task dies. - See Frame Schema in 00-hld-lld-architecture.md, Section 2.3 for the full frame discriminated-union.

Design Lens — WebSocket as primary transport

Lens	Position
🎨 Frontend	Drove the REST fallback requirement (corporate networks block WebSocket)
📊 PM	Drove streaming itself (perceived-quality A/B data)
🛡️ Security	Liked auth-once-at-connect; demanded origin validation + token expiry handling on long sessions
⚙️ SRE	Cautious — WebSocket connections are stateful. Required deploy-time connection draining (60s grace)
💰 Cost	Modeled the per-connection-minute cost; acceptable at ~$0.0008/conversation

Full multi-stakeholder analysis: 00-hld-lld-architecture.md, Decision D-1.

2. External REST APIs — Control Plane

Three REST endpoints powered the non-streaming interactions:

POST /chat/message

The main chat endpoint (used as fallback when WebSocket is unavailable).

// Request
{
  "session_id": "sess_abc123",
  "message": "Recommend something like Naruto",
  "page_context": {
    "current_asin": null,
    "store_section": "manga-home",
    "cart_asins": ["B09XYZ"],
    "url": "/stores/page/jp-manga"
  }
}

// Response
{
  "session_id": "sess_abc123",
  "response_id": "resp_def456",
  "response_text": "If you loved Naruto, you'll enjoy these...",
  "products": [...],
  "actions": [...],
  "follow_up_suggestions": ["Tell me about the art style", "Is there a box set?"],
  "metadata": {
    "intent": "recommendation",
    "latency_ms": 1842,
    "model": "claude-3-sonnet",
    "sources": ["recommendation_engine", "product_catalog"]
  }
}

POST /chat/feedback

{
  "response_id": "resp_def456",
  "feedback": "thumbs_up",
  "comment": null
}

POST /chat/escalate

{
  "session_id": "sess_abc123",
  "reason": "user_requested",
  "summary": "User wants to return damaged Naruto Vol 5, order #112-3456789"
}

Rate Limiting

User Type	Limit	Enforcement
Authenticated	30 messages/min	Token bucket per customer ID
Guest	10 messages/min	Token bucket per session ID
Global LLM path	Configurable cap	Protects Bedrock quota from abuse

Rate limiting was enforced at the API Gateway level using usage plans + Lambda authorizer. Exceeding the limit returns 429 Too Many Requests with a human-friendly message.

LLD — Idempotency on Feedback and Escalation

The non-streaming endpoints have idempotency requirements that streaming doesn't: - POST /chat/feedback is keyed on (response_id, customer_id). Duplicate clicks (network retry, double-tap on mobile) must not produce double feedback. Implementation: DynamoDB conditional write if not exists. - POST /chat/escalate is keyed on (session_id, escalation_reason_hash) within a 5-minute window. Prevents accidentally creating multiple human-handoff tickets. - POST /chat/message (REST fallback) uses an Idempotency-Key header. Required for mobile clients where flaky networks retry. Server caches the response for 5 min keyed on (customer_id, idempotency_key).

Design Lens — Token bucket vs. fixed window rate limiting

Lens	Position
🛡️ Security	Drove the choice — fixed window has the boundary-burst problem (30 msgs at 11:59:59 + 30 at 12:00:01)
🔧 Backend	Token bucket is straightforward in Redis with `INCR` + TTL
⚙️ SRE	Wanted a global LLM-path cap as a safety valve independent of per-user limits
💰 Cost	Rate limiting is a Bedrock-cost-protection mechanism, not just abuse prevention
📊 PM	Negotiated the 30/min number — calibrated against P99 of legitimate beta usage (15/min) for 2x headroom

3. Internal Downstream APIs — Service Integrations

The orchestrator called 9 different backend services via internal REST and gRPC APIs. Each integration had its own contract, latency profile, and failure handling.

Service	Protocol	Data Returned	Latency	Auth Required
Product Catalog	gRPC	ASIN, title, author, format, language, page count	< 100ms	No
Pricing Service	REST	Current price, list price, discounts, Prime pricing	< 50ms	No
Inventory Service	REST	Stock status, restock date, warehouse location	< 50ms	No
Reviews & Ratings	REST	Avg rating, review count, top review snippets	< 100ms	No
Recommendation Engine	REST (Personalize)	Ranked list of recommended ASINs	< 200ms	Optional
Customer Profile	REST	Name, Prime status, locale, preferred language	< 50ms	Yes
Order Service	REST	Order status, items, dates	< 200ms	Yes
Shipping & Delivery	REST	Carrier, tracking, ETA	< 200ms	Yes
Returns Service	REST	Eligibility, return window, refund status	< 200ms	Yes
Promotions Service	REST	Active deals, coupons, bundle offers	< 100ms	No

Key Design Decisions

Parallel fan-out: After intent classification, I dispatched downstream calls in parallel rather than sequentially. This cut 300ms+ from the critical path (3 calls at 100ms each = 300ms sequential vs 100ms parallel)
Price is never cached: Product prices were always fetched live. Showing a stale price is a legal and trust issue. Every other data type used ElastiCache with appropriate TTLs
Auth-gated vs. open: Product discovery, FAQ, and recommendations worked for guest users. Order tracking, returns, and profile data required authentication. The orchestrator enforced this boundary

LLD — Per-Service Failure Policy

Each downstream service has an explicit failure policy. The policy is hard-coded, not configured at runtime, because failure semantics are part of correctness — not tunable parameters.

Service	Timeout	Retry	Circuit Breaker	Fallback Behavior
Catalog (gRPC)	200ms	1×	Open at 30% error in 60s	Skip product card; return prose-only response
Pricing	200ms	0× (NOT idempotent semantics for pricing freshness)	Open at 20%	Strip all price mentions; "check product page"
Inventory	150ms	1×	Open at 30%	Show product without availability badge
Reviews	300ms	1×	Open at 50%	Skip review snippet
Recommendation Engine	500ms	0×	Open at 40%	Fall back to vector-search top-N
Customer Profile	200ms	1×	Open at 30%	Treat user as guest tier
Order Service	500ms	1×	Open at 30%	"I couldn't load your order — let me connect you with support"
Returns	500ms	1×	Open at 30%	Direct user to returns center URL
Promotions	200ms	1×	Open at 50%	Don't surface promotions; chat continues

Why "0 retries" on Pricing: A retry implies the first call may succeed at the second attempt. For pricing, the first call's latency is also a freshness signal — a slow pricing call may be returning stale data from a degraded path. We'd rather strip prices than risk stale ones.

Design Lens — gRPC for Catalog, REST for everything else

Lens	Position
🔧 Backend	"Catalog is the highest-volume call. gRPC saves 25% serialization CPU at our scale."
💰 Cost	Endorsed — direct $X/month savings on serialization + bandwidth
⚙️ SRE	Concerned about two debugging tools; required X-Ray instrumentation parity
📊 PM	Neutral — invisible to user
Hidden driver	The Catalog team already had a gRPC service. Adopting their existing contract was easier than asking them to build a REST facade. Decisions like this are bent by what other teams have already shipped.

Full multi-stakeholder analysis: 00-hld-lld-architecture.md, Decision D-8.

4. ML Inference APIs — The Intelligence Layer

Every user message triggered a chain of 4 ML model invocations — not one API call, but an orchestrated pipeline.

Model	Hosting	Endpoint Type	Avg Latency	P99 Latency	Purpose
DistilBERT (fine-tuned)	SageMaker Real-time Endpoint	`ml.inf1.xlarge` (Inferentia)	~15ms	~50ms	Intent classification (10 intents)
Titan Embeddings V2	Amazon Bedrock	On-demand	~30ms	~80ms	Query embedding for RAG retrieval
Cross-Encoder Reranker	SageMaker Real-time Endpoint	`ml.g4dn.xlarge`	~50ms	~120ms	Rerank top-50 RAG results to top-5
Claude 3.5 Sonnet	Amazon Bedrock	On-demand / Provisioned	~500ms (TTFT)	~1.5s (TTFT)	Natural language response generation

Two-Stage Intent Classification

I designed the intent classifier as a two-stage system to minimize cost:

Stage 1 — Rule-based pre-filter: Regex patterns caught high-confidence intents cheaply ("where is my order" → order_tracking at 0.95 confidence). Handled ~40% of messages with zero ML inference cost
Stage 2 — DistilBERT classifier: Only fired when Stage 1 confidence < 0.8. Fine-tuned on 55K labeled Amazon customer service conversations + manga-specific training data

Intelligent Model Routing

Not every message needed Claude 3.5 Sonnet. I built a cost-aware router:

Intent	Model Used	Why
Chitchat (hello, thanks)	Template responses	No LLM needed at all
Order status formatting	Claude Haiku	Simple formatting task; 10x cheaper
Product recommendations	Claude 3.5 Sonnet	Complex reasoning with product context
Multi-turn Q&A	Claude 3.5 Sonnet	Needs conversation history understanding

This routing saved ~$18,000/month in LLM costs.

LLD — Latency Budget Allocation

The 1.5s P99 first-token target is decomposed across the inference chain:

Total P99 budget: 1500ms TTFT
├── Edge + auth + validate                    50ms   (3%)
├── Stage 1 regex / Stage 2 DistilBERT       50ms   (3%)
├── Embedding (Titan)                          80ms   (5%)
├── KNN search (OpenSearch)                    50ms   (3%)
├── Reranker (Cross-Encoder)                  120ms   (8%)
├── Service fan-out (parallel, max wait)     200ms  (13%)
├── Prompt assembly                            10ms   (1%)
├── Bedrock TTFT                              900ms  (60%)
├── Inline guardrail (per chunk)               <1ms   (–)
└── Headroom                                   40ms   (3%)

Critical observation: Bedrock TTFT dominates. 60% of the budget is in one component we don't control. This shapes everything else — every other component must come in under its allocation, because LLM TTFT can spike unexpectedly and there's no way to shrink it from the outside.

Speculative execution recovers ~150ms on average: Embedding + KNN run in parallel with intent classification, not after it. ~70% of intents need RAG anyway. Wasted compute when intent doesn't need RAG: ~30ms. Net gain: 150-300ms saved on the 70% majority case.

Design Lens — Provisioned Throughput vs. on-demand for Bedrock

Lens	Position
💰 Cost	Drove the decision — 35% saving with tiered Provisioned + on-demand
⚙️ SRE	Strong supporter — eliminates throttle risk for baseline traffic
🤖 ML	Cautious — routing splits eval surface; needed offline evaluation showing Haiku is acceptable for routed-down intents
📊 PM	Required quality evaluation showing no measurable difference for routed categories
⚖️ Legal	Neutral — both models in existing Bedrock agreement

Full analysis: 00-hld-lld-architecture.md, Decision D-4.

5. Vector Search / RAG API — Knowledge Retrieval

The RAG pipeline powered FAQ, policy, and product knowledge answers. I used OpenSearch Serverless with HNSW vector indexing.

Retrieval Flow

Embed query → Titan Embeddings V2 (1024-dim vector)
KNN search → OpenSearch returns top-10 candidate chunks
Rerank → Cross-Encoder Reranker scores all 10, returns top-3
Augment prompt → Top-3 chunks injected into the LLM prompt as grounding context

Index Schema

{
  "chunk_id": "faq-return-policy-003",
  "content": "Manga volumes can be returned within 30 days of delivery if they are in original condition.",
  "source_type": "faq",
  "asin": null,
  "category": "manga",
  "embedding": [0.012, -0.034, ...],
  "last_updated": "2025-12-01"
}

Data Sources Indexed

Product descriptions (by ASIN)
FAQ pages (return policy, shipping info, payment options)
Editorial content (manga guides, genre explainers)
Review summaries (pre-generated sentiment snippets)

Chunking strategy: 512 tokens with 50-token overlap. This balance kept chunks large enough for context but small enough for precise retrieval.

LLD — Index Update Pipeline

Knowledge sources update at very different cadences. The index must absorb updates without thrashing:

Source	Update Cadence	Trigger	SLA
Product descriptions	Daily	Catalog change-data-capture stream	New ASIN searchable < 2 hours
FAQ / Policy	Weekly	Manual edit + approval workflow	Updated chunk searchable < 30 minutes
Editorial content	Per release	Editorial CMS webhook	Same-day
Review summaries	Hourly	Sentiment pipeline rerun	Next hourly cycle

Why CDC for product descriptions: Catalog has millions of items. Polling for changes is wasteful; the stream lets us re-embed only the deltas. Each change event triggers: (1) re-embed via Titan, (2) update OpenSearch document atomically, (3) emit a metric.

Atomic update strategy: OpenSearch supports partial document updates. We update the embedding and the metadata in a single request. This avoids the failure mode where the metadata is fresh but the embedding is stale (or vice versa) — which would silently degrade retrieval quality.

Design Lens — HNSW vs. flat / IVF index

Lens	Position
🤖 ML	Drove the choice — HNSW gives best recall/latency trade-off for our query volume
🔧 Backend	OpenSearch HNSW is managed; no custom index code
⚙️ SRE	Slight concern about HNSW memory footprint; right-sized cluster for 18K vectors @ 1024-dim
💰 Cost	OpenSearch Serverless OCU pricing was modeled and acceptable
📊 PM	Neutral — outcome measured by retrieval quality, not algorithm choice

Alternative considered: flat brute-force search. Rejected because at 18K+ vectors and 50K+ queries/sec at peak, brute-force latency would exceed budget.

Alternative considered: IVF-PQ for memory efficiency. Rejected because the recall drop (5-8% vs. HNSW) was meaningful for FAQ retrieval where missing a policy chunk leads to LLM hallucination.

6. Guardrails / Safety API — Trust Pipeline

Every LLM response passed through a 6-stage validation pipeline before reaching the user. This was not optional — for a customer-facing shopping assistant, trust is the product.

Pipeline Stages

Stage	What It Checks	Tool	Action on Failure
1. PII Redaction	SSN, credit card, phone, email in output	Amazon Comprehend + regex	Redact and log
2. Price Accuracy	Prices in response match live catalog data	Custom validator (API call)	Replace with correct price
3. Toxicity Filter	Offensive or inappropriate content	Bedrock Guardrails	Block response, return safe fallback
4. Competitor Mention	Names of competitors (Barnes & Noble, etc.)	Keyword filter	Remove mention
5. ASIN Validation	Product IDs mentioned actually exist in catalog	Catalog API batch check	Remove invalid product cards
6. Scope Check	Response stays on topic (manga/Amazon)	Rule-based + LLM classifier	Redirect to on-topic response

Input-Side Safety

Before the user message even reached the LLM, I ran:

PII detection: Replaced emails, phone numbers, addresses with [REDACTED] tokens so PII never entered model logs
Prompt injection defense: Pattern matching for "ignore previous instructions", "you are now", encoded/obfuscated attempts — blocked and logged
Payload size limit: Messages capped at 2000 characters to prevent abuse

LLD — Two-Tier Guardrails Mechanics

The 6 stages don't all run at the same point. They're split between inline (per-chunk during streaming) and post-stream (full pipeline on complete response).

flowchart LR
    LLM[("LLM<br/>token chunks")]

    subgraph inline["⚡ Inline checks (per chunk, &lt;1ms)"]
        direction TB
        I1["regex PII<br/>(SSN, CC, phone, email)"]
        I2["keyword<br/>competitor list"]
    end

    FE["🎨 Frontend<br/>(streams to user)"]

    subgraph post["🔒 Post-stream pipeline (sequential, ~100ms)"]
        direction TB
        P1["1. PII Redaction"]
        P2["2. Price Accuracy"]
        P3["3. Toxicity Filter"]
        P4["4. Competitor Mention"]
        P5["5. ASIN Validation"]
        P6["6. Scope Check"]
        P1 --> P2 --> P3 --> P4 --> P5 --> P6
    end

    CF["📢 Correction Frame<br/>(replace_seq → original)"]

    LLM --> inline
    inline --> FE
    LLM -. "after [done]" .-> post
    post -- "issue inline missed" --> CF
    CF --> FE

    classDef llm fill:#e8f5e9,stroke:#2e7d32
    classDef chk fill:#fff3e0,stroke:#e65100
    classDef out fill:#e3f2fd,stroke:#1565c0
    classDef warn fill:#ffebee,stroke:#c62828
    class LLM llm
    class I1,I2,P1,P2,P3,P4,P5,P6 chk
    class FE out
    class CF warn

Each chunk passes through inline checks (~99.5% pass cleanly). After streaming completes, the full 6-stage pipeline runs on the assembled response. Any issue inline didn't catch triggers a Correction Frame referencing the original token's seq number.

Correction frame mechanics: - Correction frame includes replace_seq pointing at the original frame's seq number. - Frontend replaces the offending text in-place (e.g., wrong price → corrected price). - Audit log captures: original text, corrected text, stage that fired, timestamp, customer_id. - Rate of correction frames is monitored — sustained increase is a leading indicator that the LLM is regressing on a guardrail-relevant pattern (or that an attacker is probing).

Why inline is regex-only, not ML-based: A small LLM-based inline guardrail would add ~15ms per token chunk → streaming becomes useless. Full analysis: 06-decision-records-perspectives.md, ADR-004.

Design Lens — Two-tier guardrails (the most contested decision)

Lens	Position
⚖️ Legal	Reluctant supporter. Conditioned on: (1) full audit log, (2) prices NEVER stream from LLM (cards inject from catalog), (3) wrong-price complaints retrievable in <1 hour
🛡️ Security	Reluctant supporter. Preferred pre-stream but accepted on data showing inline catches 90% and post-stream catches the rest
📊 PM	Strong supporter — streaming is the product
🎨 Frontend	Cautious — correction frames are awkward to render. Backend committed to <0.5% rate
⚙️ SRE	Supporter — `correction_frame_emitted` rate is a leading indicator

Full analysis: 00-hld-lld-architecture.md, Decision D-2.

Cross-API Design Tensions

The 6 API types don't exist in isolation. Each pair has a tension that surfaces during incidents or feature requests. These tensions are worth knowing because they drive judgment calls when the obvious answer in one API conflicts with constraints in another.

Tension 1: WebSocket streaming ↔ Guardrails

Conflict: Streaming wants tokens out fast. Guardrails want full validation before exposure. Resolution: Two-tier guardrails (D-2). Inline regex during stream; full pipeline after; correction frame for the rare miss. Who owns the tension: Backend + Security + Frontend, jointly. Failure mode if mishandled: Either UX feels slow (Legal/Security wins too hard) or wrong content leaks to users (Product wins too hard).

Tension 2: Internal service fan-out ↔ Latency budget

Conflict: More service calls = richer responses. But each call eats budget. P99 budget is 1500ms TTFT and Bedrock owns 60% of it. Resolution: Parallel fan-out with explicit per-service timeouts. Speculative execution starts RAG before intent is known. Each service has a fallback so a slow service degrades gracefully instead of blocking. Who owns the tension: Backend + Product. Failure mode if mishandled: Either responses are too thin (over-aggressive timeouts) or chat feels slow (under-aggressive timeouts).

Tension 3: ML routing ↔ Cost ↔ Quality

Conflict: Sonnet is high quality but expensive. Haiku is cheap but lower quality on complex intents. Caching is cheapest but degrades freshness/personalization. Resolution: Cost-aware router — chitchat to templates (free), simple intents to Haiku (cheap), complex to Sonnet (premium). Semantic cache for user-independent queries only. Who owns the tension: ML + Cost + Product. Failure mode if mishandled: Either costs blow up (over-route to Sonnet) or quality regresses (over-route to Haiku/cache).

Tension 4: RAG retrieval precision ↔ Recommendation diversity

Conflict: Higher Precision@K means fewer irrelevant chunks. But for recommendation queries, diversity drives add-to-cart rate (counter-intuitive finding from offline-online correlation analysis). Resolution: Different K and ranker per intent. FAQ uses K=3 with high precision. Recommendation uses K=5 with MMR for diversity. Who owns the tension: ML + Product. Failure mode if mishandled: Either users get wrong info (low precision) or get bored same-result responses (high precision, low diversity).

Tension 5: Guardrails strictness ↔ User helpfulness

Conflict: Stricter guardrails block more (good for safety, bad for helpfulness). Looser guardrails answer more (good for UX, bad for safety). Resolution: Per-intent thresholds. order_tracking and return_request have looser scope checks (users in these flows are committed; redirecting them frustrates). Recommendation has tighter (we'd rather say "I don't know" than recommend something off-genre). Who owns the tension: Security + Product + ML. Failure mode if mishandled: Either users hit "I can't help with that" too often or the chatbot recommends counterfeit listings.

Interview Q&A — API Types

Q: Walk me through the different types of APIs you built for this chatbot system.

Easy: I built 6 API types: WebSocket for streaming, REST for control actions, internal service APIs for data fetching, ML inference endpoints (SageMaker + Bedrock), vector search for RAG, and a guardrails pipeline. Each had different latency and reliability requirements
Medium: The critical design insight was that these aren't independent APIs — they form a dependency chain. A single user message hits the WebSocket API, triggers intent classification (SageMaker), fans out to 2-3 internal service APIs in parallel, calls the embedding model + vector search for RAG, feeds everything to the LLM (Bedrock), then runs through the guardrails pipeline. Managing the latency budget across this chain — allocating milliseconds to each stage to stay under 3 seconds total — was the central architectural challenge
Hard: The hardest part was failure isolation across API boundaries. When the reranker SageMaker endpoint timed out, I couldn't fail the entire request. I built graceful degradation at every API boundary: reranker down → fall back to raw cosine similarity, embedding timeout → skip RAG and use product catalog data only, LLM timeout → return cached response for common queries. Each fallback was a deliberate design decision with measured quality tradeoffs, not a generic error handler

Q: Why WebSocket instead of Server-Sent Events or polling?

Easy: WebSocket gives bidirectional communication — I needed it for both streaming responses and sending real-time context updates from the frontend (typing indicators, page navigation events)
Medium: SSE would have worked for streaming responses but doesn't support upstream messages without a separate channel. Long polling adds latency overhead (connection setup per poll) that compounds at 50K concurrent sessions. WebSocket's persistent connection was the right tradeoff — higher memory per connection but lower latency per message
Hard: The real challenge was handling WebSocket at scale. API Gateway has a 125K concurrent connection limit. During load testing I hit 90% of that limit. The solution was ALB sticky sessions for connection distribution, a connection pooling strategy on the client, and requesting a limit increase from AWS. For future scale, the architecture supports multi-region routing via Route 53

Q: How did you handle the fact that prices should never be cached?

Easy: Prices were always fetched live from the Pricing Service. Showing a stale price is a legal and trust issue
Medium: Everything else — product details, recommendations, promotions, reviews — used ElastiCache Redis with appropriate TTLs (5 min for product details, 15 min for recommendations, 1 hour for reviews). Prices were the explicit exception: zero caching, always real-time
Hard: The guardrails pipeline had a Price Accuracy stage that cross-checked any price mentioned in the LLM response against the live Pricing Service API. If the LLM hallucinated a price (which happened in ~2% of responses during early testing), the guardrail replaced it with the correct value before the response reached the user. This caught a class of errors that prompt engineering alone couldn't eliminate