00. HLD & LLD — MangaAssist Architecture (Foundation)
"Before any of the API, testing, or scale documents make sense, you need the HLD/LLD scaffolding. This file is the architectural anchor — the same reference an SRE pulls up during an incident, a security engineer reads before a threat-model review, and a new joiner reads on day one. Everything in the other documents references back to here."
How to Read This Document
This document is split into three parts:
| Part | Audience | What's Inside |
|---|---|---|
| Part 1: HLD | Anyone joining the system, interviewers, architects | Context, layered architecture, physical AWS topology, NFRs, capacity model, failure domains |
| Part 2: LLD | Engineers writing code, on-call SREs, security reviewers | Sequence diagrams, state machines, schemas, indexes, concurrency model, observability schema |
| Part 3: Decision Lens | Anyone reviewing or extending the design | Each major decision viewed through 8 stakeholder perspectives — what trade-off was accepted, by whom, against what alternative |
If you want the what, read Part 1. If you want the how, read Part 2. If you want the why, read Part 3.
Part 1: High-Level Design (HLD)
1.1 System Context — Where MangaAssist Lives
flowchart TB
User["👤 Amazon Customer (Web / Mobile)<br/>Browses manga store → opens chat widget → asks Q"]
MA["🧩 <b>MangaAssist</b> (this system)<br/>Conversational shopping assistant for the manga vertical<br/>Inputs: NL + page context + auth state<br/>Outputs: streamed text + product cards + actions"]
User -- "WebSocket / HTTPS" --> MA
subgraph downstream["Downstream domains (owned by 6+ teams across Amazon)"]
Catalog["📚 Catalog<br/>Pricing<br/>Inventory"]
Orders["📦 Orders<br/>Returns<br/>Shipping"]
ML["🤖 ML Platform<br/>Bedrock + SageMaker"]
Promo["🎯 Promotions<br/>Reviews<br/>Personalize"]
Trust["🛡️ Trust & Safety<br/>Legal"]
end
MA --> Catalog
MA --> Orders
MA --> ML
MA --> Promo
MA --> Trust
classDef user fill:#e3f2fd,stroke:#1565c0,stroke-width:2px
classDef system fill:#fff3e0,stroke:#e65100,stroke-width:3px
classDef domain fill:#f3e5f5,stroke:#6a1b9a,stroke-width:1px
class User user
class MA system
class Catalog,Orders,ML,Promo,Trust domain
MangaAssist is a thin orchestration layer. It owns conversation state, intent routing, prompt assembly, response streaming, and the safety pipeline. Every domain it talks about (products, prices, orders, returns) lives in another team's system. The architecture is fundamentally about integration, not data ownership.
1.2 Layered Architecture — Logical View
The system splits cleanly into 5 horizontal layers. Each layer has its own scaling characteristics, failure modes, and ownership.
flowchart TB
L5["<b>Layer 5: Safety / Trust</b><br/>Input validators · Output guardrails (6-stage) · PII vault · Audit log<br/>Latency: <100ms · Failure mode: fail-closed"]
L4["<b>Layer 4: Intelligence (ML Pipeline)</b><br/>DistilBERT → Titan Embeddings → OpenSearch KNN → Cross-Encoder Reranker → Claude LLM<br/>Latency: 600ms–1.5s · Failure mode: per-stage circuit breaker + fallback"]
L3["<b>Layer 3: Orchestration</b><br/>Conversation manager · Prompt assembler · Service fan-out · Memory summarizer · Streaming pump · Correction-frame emitter<br/>Latency: <10ms compute · Failure mode: stateless retry"]
L2["<b>Layer 2: Edge / Transport</b><br/>API Gateway WebSocket · API Gateway REST · ALB sticky sessions · Cognito auth · Token-bucket rate limiter<br/>Latency: <20ms · Failure mode: TCP retry, reconnection"]
L1["<b>Layer 1: State / Data</b><br/>DynamoDB (sessions, audit) · ElastiCache Redis (semantic cache, locks) · DAX (hot sessions) · OpenSearch (vector index)<br/>Latency: 1–50ms · Failure mode: DAX fall-through, on-demand mode"]
L5 --- L4
L4 --- L3
L3 --- L2
L2 --- L1
classDef safety fill:#ffebee,stroke:#c62828,stroke-width:2px
classDef intel fill:#e8f5e9,stroke:#2e7d32,stroke-width:2px
classDef orch fill:#fff3e0,stroke:#e65100,stroke-width:2px
classDef edge fill:#e3f2fd,stroke:#1565c0,stroke-width:2px
classDef data fill:#f3e5f5,stroke:#6a1b9a,stroke-width:2px
class L5 safety
class L4 intel
class L3 orch
class L2 edge
class L1 data
Why this layering matters for everything else in this folder: - The testing pyramid in 02-api-testing-strategy.md maps onto these layers — unit tests cover Layer ⅗ logic, integration tests cross Layer 2↔3↔4, contract tests gate the ML and Service boundaries, E2E tests traverse all 5. - The scale scenarios in 03-scale-testing-scenarios.md are categorized by which layer broke first (DynamoDB hot partition = Layer 1, Bedrock throttling = Layer 4, WebSocket ceiling = Layer 2). - The 6 API types in 01-api-types-overview.md are the interfaces between these layers, plus the external surface at Layer 2.
1.3 Physical Architecture — AWS Topology
flowchart TB
R53["🌐 Route 53<br/>(latency-based routing)"]
subgraph east["us-east-1 (primary)"]
direction TB
APIGW_WS["API Gateway<br/>WebSocket"]
APIGW_REST["API Gateway<br/>REST"]
WAF["WAF + Cognito<br/>Authorizer"]
ALB["ALB<br/>(sticky sessions)"]
ECS["ECS Fargate<br/>(orchestrator)"]
Lambda["Lambda<br/>(burst overflow)"]
DDB[("DynamoDB<br/>+ DAX")]
Redis[("ElastiCache<br/>Redis")]
OS[("OpenSearch<br/>Serverless")]
ML["SageMaker + Bedrock<br/>DistilBERT · Titan · Reranker · Claude"]
end
subgraph west["us-west-2 (mirror, active)"]
WestMirror["(same topology)"]
end
Kinesis["Kinesis"]
Firehose["Firehose"]
Redshift[("Redshift<br/>(analytics, PII scrubbed)")]
R53 --> east
R53 --> west
APIGW_WS --> WAF
APIGW_REST --> WAF
WAF --> ALB
ALB --> ECS
ALB --> Lambda
ECS --> DDB
ECS --> Redis
ECS --> OS
ECS --> ML
Lambda --> DDB
Lambda --> Redis
Lambda --> ML
ECS --> Kinesis
Kinesis --> Firehose
Firehose --> Redshift
DDB <-. "Global Tables<br/>replication" .-> WestMirror
classDef edge fill:#e3f2fd,stroke:#1565c0
classDef compute fill:#fff3e0,stroke:#e65100
classDef data fill:#f3e5f5,stroke:#6a1b9a
classDef ml fill:#e8f5e9,stroke:#2e7d32
classDef analytics fill:#fce4ec,stroke:#ad1457
class APIGW_WS,APIGW_REST,WAF,ALB,R53 edge
class ECS,Lambda compute
class DDB,Redis,OS data
class ML ml
class Kinesis,Firehose,Redshift analytics
Multi-region: Active-active via Route 53 latency routing. Session state replicates via DynamoDB Global Tables (eventual consistency, <2s replication lag) — acceptable because session lookup is sticky to the connecting region 99%+ of the time.
Compute split — ECS + Lambda: Steady traffic runs on ECS Fargate (predictable cost, fewer cold starts). Burst overflow above ECS auto-scale rate spills to Lambda. The two share the same orchestrator code packaged differently.
1.4 Non-Functional Requirements (NFRs) — Quantified
These are the numerical targets every design choice was measured against. They are not aspirational — they are gating thresholds.
| Category | NFR | Target | How It's Enforced |
|---|---|---|---|
| Latency | P50 first token | < 800ms | CloudWatch alarm, deploy gate |
| P99 first token | < 1.5s | Per-model latency budget allocation | |
| P99 full response | < 3s | E2E timer in every request span | |
| Availability | Steady-state | 99.9% | Multi-AZ + multi-region |
| Prime Day | 99.9% (no degradation) | Provisioned capacity, scheduled scale-up | |
| Scale | Concurrent WS connections | 200K normal, 500K peak | API Gateway limit increase + multi-region |
| Messages/sec | 5K normal, 50K peak | Per-component capacity model | |
| Quality | LLM hallucination rate | < 1% | Offline gate + runtime guardrails |
| Intent accuracy | ≥ 90% (macro F1) | Pre-deploy gate | |
| Guardrail pass rate | ≥ 98% on adversarial set | Security CI | |
| Cost | Cost per resolved conversation | < $0.025 | Weekly cost dashboard |
| LLM share of cost | < 60% | Routing + caching + Haiku fallback | |
| Compliance | PII in analytics pipeline | Zero | Comprehend scan on every record |
| Right-to-be-forgotten SLA | < 30 days | Automated deletion pipeline | |
| Security | Prompt injection block rate | 100% on test suite | Adversarial CI on every commit |
Trade-off principle: When NFRs conflict (e.g., latency vs. quality, cost vs. availability), the priority order is Safety > Availability > Latency > Quality > Cost. This is documented because real incidents required leaning on this order — see Decision D-2 in Part 3.
1.5 Capacity Model
| Tier | Concurrent Sessions | Msgs/sec | LLM Calls/sec | Reasoning |
|---|---|---|---|---|
| Normal weekday | 50K | 5K | 3K | 60% intents need LLM (40% short-circuited by templates / cache) |
| Peak weekday | 100K | 12K | 7K | Evening browse traffic |
| Prime Day | 500K | 50K | 30K | 10x spike with bursty distribution |
| Degraded mode | 500K | 50K | 5K | LLM throttled — Haiku fallback + cache + escalation |
Sizing rule: Provision for 75th percentile of daily peak; rely on auto-scaling + fallback for the rest. Provisioning for the absolute peak wastes ~$80K/month in idle capacity at off-peak hours.
1.6 Failure Domain Map
A failure domain is the blast radius of a single component breaking. The architecture is designed so no single domain takes down everything.
flowchart LR
subgraph A["Domain A: Edge"]
A1["<b>Failure</b><br/>WebSocket connections drop"]
A2["<b>Mitigation</b><br/>REST fallback<br/>multi-region failover"]
A3["<b>Worst case</b><br/>chat unavailable<br/>for one region ~30s"]
A1 --> A2 --> A3
end
subgraph B["Domain B: Bedrock"]
B1["<b>Failure</b><br/>LLM throttling or outage"]
B2["<b>Mitigation</b><br/>Haiku fallback · cached responses · escalation"]
B3["<b>Worst case</b><br/>chat works but<br/>quality degrades ~2h"]
B1 --> B2 --> B3
end
subgraph C["Domain C: SageMaker"]
C1["<b>Failure</b><br/>classifier or reranker down"]
C2["<b>Mitigation</b><br/>Stage-1 regex (40% of intents)<br/>raw cosine sim fallback"]
C3["<b>Worst case</b><br/>10–15% routing<br/>accuracy drop"]
C1 --> C2 --> C3
end
subgraph D["Domain D: DynamoDB"]
D1["<b>Failure</b><br/>session table throttle / down"]
D2["<b>Mitigation</b><br/>DAX hot-reads · on-demand mode"]
D3["<b>Worst case</b><br/>new sessions can't<br/>be created ~1 min"]
D1 --> D2 --> D3
end
subgraph E["Domain E: Catalog / Orders / etc."]
E1["<b>Failure</b><br/>downstream service down"]
E2["<b>Mitigation</b><br/>per-service circuit breaker<br/>graceful degradation"]
E3["<b>Worst case</b><br/>that intent path<br/>returns partial answer"]
E1 --> E2 --> E3
end
subgraph F["Domain F: Guardrails"]
F1["<b>Failure</b><br/>a guardrail stage errors"]
F2["<b>Mitigation</b><br/>fail-closed: block response<br/>never fail-open on safety"]
F3["<b>Worst case</b><br/>response blocked,<br/>user sees safe message"]
F1 --> F2 --> F3
end
classDef fail fill:#ffebee,stroke:#c62828
classDef mit fill:#fff3e0,stroke:#e65100
classDef worst fill:#f3e5f5,stroke:#6a1b9a
class A1,B1,C1,D1,E1,F1 fail
class A2,B2,C2,D2,E2,F2 mit
class A3,B3,C3,D3,E3,F3 worst
Cross-domain rule: No two domains share a single point of failure. The orchestrator is the only component in the critical path of every request, and it's deployed redundantly across AZs and regions with stateless instances behind ALB sticky sessions.
Part 2: Low-Level Design (LLD)
2.1 Request Lifecycle — Sequence Diagram
The canonical streaming chat flow:
sequenceDiagram
autonumber
actor U as User
participant FE as Frontend
participant WS as APIGW-WS
participant O as Orchestrator
participant DB as DynamoDB
participant C as Classifier
participant E as Embed
participant K as KNN (OpenSearch)
participant R as Reranker
participant CAT as Catalog
participant BR as Bedrock
participant G as Guardrails
U->>FE: types question
FE->>WS: ws.send(message)
WS->>O: $default
O->>DB: load session (DAX)
DB-->>O: session + history
Note over O: input validate (PII, injection, size) — 5ms
Note over O: Stage 1: regex intent (zero-cost)
par parallel fan-out (T+5ms)
O->>C: classify
C-->>O: intent (T+20ms)
and
O->>E: embed
E-->>O: vector (T+35ms)
and
O->>CAT: catalog batch
CAT-->>O: products
end
O->>K: KNN search (T+35ms)
K-->>O: top-10 chunks (T+75ms)
O->>R: rerank top-10 → top-3
R-->>O: top-3 (T+125ms)
Note over O: Assemble prompt (sys + history + RAG + products) — 5ms
O->>BR: invoke (streaming)
Note over BR,O: TTFT ~500ms
loop streaming tokens
BR-->>O: token chunk
Note over O: inline guardrail<br/>(regex PII + competitor)
O-->>WS: frame
WS-->>FE: token N
FE-->>U: render
end
BR-->>O: [done]
O->>G: full guardrails pipeline (6 stages)
G-->>O: pass / correction
Note over O: products[] injected from live Catalog<br/>(NOT from LLM)
O->>DB: persist turn
O-->>WS: products + done frame
WS-->>FE: render cards
FE-->>U: complete response
Key invariant: prices and ASINs in product cards never come from the LLM. The LLM generates descriptive prose. Structured data (ASIN, price, in_stock, title in the card) comes from catalog APIs at response-assembly time. This makes the price hallucination problem (described in 04-offline-testing-quality-strategies.md) tractable — we only need to defend against inline mentions in the prose, not card data.
2.2 Orchestrator State Machine
stateDiagram-v2
[*] --> IDLE
IDLE --> VALIDATING: message arrives ($default)
VALIDATING --> REJECTED: validation fails<br/>(PII, injection, size)
REJECTED --> [*]: safe response returned
VALIDATING --> CLASSIFYING: pass
CLASSIFYING --> ROUTED: Stage 1 high<br/>confidence (regex)
CLASSIFYING --> DistilBERT: Stage 1 low conf<br/>run Stage 2
DistilBERT --> ROUTED: classified
ROUTED --> FANNING_OUT: parallel<br/>service calls
FANNING_OUT --> DEGRADING: any non-critical<br/>service fails
DEGRADING --> ASSEMBLING: with fallbacks
FANNING_OUT --> ASSEMBLING: all results in<br/>(or timeouts hit)
ASSEMBLING --> STREAMING: prompt ready,<br/>invoke Bedrock
STREAMING --> FAILOVER: Bedrock fails
FAILOVER --> GUARDED: cached / Haiku / escalation
STREAMING --> GUARDED: stream complete
GUARDED --> PERSISTED: pass or<br/>correction frame emitted
PERSISTED --> [*]: emit metrics,<br/>return to IDLE
note right of FANNING_OUT
Each state has a timeout
that triggers transition
to DEGRADING or FAILOVER
end note
Each state has a timeout that triggers transition to DEGRADING or FAILOVER. The state machine is in-memory (per-request), but the state of the conversation (turns, intents, ASINs referenced) is in DynamoDB.
2.3 Streaming Protocol — Frame Schema
Every frame on the WebSocket conforms to this discriminated union:
type Frame =
| { type: "token"; seq: number; content: string }
| { type: "products"; seq: number; items: ProductCard[] }
| { type: "actions"; seq: number; buttons: ActionButton[] }
| { type: "correction"; seq: number; replace_seq: number; content: string }
| { type: "follow_up"; seq: number; suggestions: string[] }
| { type: "error"; seq: number; code: string; message: string; recoverable: boolean }
| { type: "done"; seq: number; metadata: ResponseMetadata }
interface ProductCard {
asin: string // validated against catalog
title: string // from catalog, not LLM
author: string // from catalog
format: "Paperback" | "Hardcover" | "Kindle" | "Audible"
price: { current: number, list: number, currency: string } // live pricing API
in_stock: boolean // live inventory
image_url: string
url: string // /dp/{asin}?ref=manga_assist
trust_signals: { listing_score: number, seller_verified: boolean } // post-fraud-incident
}
interface ResponseMetadata {
intent: string
intent_confidence: number
models_used: string[] // ["distilbert", "claude-3-5-sonnet", ...]
rag_chunks_used: number
guardrail_actions: string[] // ["pii_redacted", "scope_redirected", ...]
total_latency_ms: number
ttft_ms: number
cache_hit: boolean
request_id: string // threads through X-Ray
}
Why a discriminated union and not separate WebSocket routes? Frontends are simpler when there's one inbound stream per session. The type field lets the client switch on each frame. This was a frontend-engineer-driven preference and survived three redesign cycles.
Why seq numbers? Discussed in Part 3, Decision D-5 — to detect dropped frames and reconcile correction frames against the original token stream.
2.4 Session Schema (DynamoDB)
Table: manga_assist_sessions · PK = session_id · SK = sk (composite discriminator) · TTL = expires_at (24h rolling) · GSI = customer_id-index (PK customer_id, SK last_active)
erDiagram
META ||--o{ TURN : "has 0..N turns"
META ||--o| PII_VAULT : "has 0..1 vault"
META }o--|| CUSTOMER : "indexed via GSI"
META {
string session_id PK
string sk "= META"
string customer_id "nullable for guest"
string created_at "ISO8601"
string last_active "ISO8601"
string tier "prime|authenticated|guest"
string locale
string ws_connection_id "null when disconnected"
string summary "populated after 20 turns"
number expires_at "TTL"
}
TURN {
string session_id PK
string sk "= TURN#NNNN"
string role "user|assistant"
string content "PII-redacted"
string timestamp
string intent
list asins_mentioned "entity preservation"
list models_used
string response_id "for feedback correlation"
}
PII_VAULT {
string session_id PK
string sk "= PII_VAULT"
bytes kms_encrypted_blob "separate KMS key, audited access"
}
CUSTOMER {
string customer_id PK "GSI partition key"
string last_active "GSI sort key"
}
GSI use case: "show me all my chats" feature + GDPR right-to-be-forgotten (find all sessions for a customer_id, delete them).
Access patterns:
| Pattern | Operation | Frequency | Notes |
|---|---|---|---|
| Load session for next turn | Query PK=session_id, SK begins_with "META" or "TURN#" |
Every message | DAX cached |
| Persist new turn | PutItem SK=TURN#{n} |
Every response | Idempotent on response_id |
| Summarize on turn 20+ | BatchWriteItem to overwrite TURN#0001..0020 with one TURN#SUMMARY |
Per session | See Pillar 5 in offline-testing |
| GDPR delete | Query GSI then BatchWriteItem deletes |
On request | <30 day SLA |
| Customer history | Query GSI customer_id-index |
Customer view feature | Limited to last 30 days |
Why composite sort key with TURN#NNNN? Read pattern is "give me last N turns in order." Range queries on a sort key that sorts lexically on TURN#0042 work cleanly. Numeric prefixes pad to 4 digits — sufficient because we summarize before reaching 1000 turns.
Why a separate PII vault item? The redacted content field is what the LLM sees; the original PII (an email, address, etc.) is needed only for compliance retrieval. By isolating PII into a separate KMS-encrypted item, the LLM context loading path never touches the encryption key. Reduces blast radius if the orchestrator is ever compromised.
2.5 Concurrency & Threading Model
Per-request: The orchestrator uses an asyncio event loop. Within a single request:
- Parallel fan-out is asyncio.gather() over service calls.
- Streaming from Bedrock is an async iterator.
- Inline guardrail checks run synchronously per chunk (regex is fast — <1ms per chunk).
- Final guardrail pipeline runs sequentially (see Q6 in interview QA for ordering rationale).
Per-instance: Each ECS task can hold ~500 concurrent WebSocket connections, limited by: - File descriptor count (we set ulimit to 65K). - Per-connection memory: ~80KB for buffers + state. 500 conns × 80KB = 40MB — well within limits. - Bedrock streaming concurrency limit per instance — we allow up to 100 concurrent Bedrock streams per task.
Per-region: Sticky sessions on ALB ensure WebSocket reconnects land on the same task. If a task dies, sticky session is broken, the client reconnects, ALB routes to a new task, the new task loads session from DynamoDB. No state lost.
2.6 Observability Schema
Every request emits a structured event with this shape:
{
"request_id": "req_xyz", // X-Ray trace ID
"session_id": "sess_abc",
"customer_id": "amzn1...",
"timestamp": "2026-04-27T12:34:56Z",
"intent": "recommendation",
"intent_confidence": 0.92,
"intent_source": "distilbert", // or "regex_stage_1"
"rag": {
"chunks_retrieved": 10,
"chunks_used": 3,
"recall_at_3_observed": null // populated only in shadow eval
},
"models": {
"classifier_ms": 15,
"embed_ms": 30,
"knn_ms": 40,
"rerank_ms": 50,
"llm_ttft_ms": 480,
"llm_total_ms": 1100
},
"guardrails": {
"stages_triggered": ["pii_redact"],
"correction_frame_emitted": false,
"block_reason": null
},
"downstream_calls": [
{ "service": "catalog", "ms": 45, "status": 200 },
{ "service": "personalize", "ms": 180, "status": 200 }
],
"cache": { "lookup_hit": false, "stored": true },
"cost_usd": 0.0042,
"outcome": "delivered" // or "blocked", "error", "fallback"
}
Why this shape:
- One row per request → easy aggregation in Redshift (GROUP BY intent, models_used).
- Per-stage timings → P99 alarms can be set per component.
- outcome field → distinguishes user-visible failures from silent fallbacks.
- request_id threads X-Ray, CloudWatch logs, and the Redshift event — single ID for full trace.
Part 3: Decision Lens — Multi-Stakeholder Perspectives
This part is where we think like a group of people. Each major architectural decision is examined through 8 lenses. At Amazon, no design survives one perspective alone — these are the voices that have to align before the design is durable.
The lenses:
| Lens | Question they ask |
|---|---|
| 🔧 Backend Engineer | Is this clean code? Can I test it? What breaks if traffic doubles? |
| ⚙️ SRE / Ops | What pages me at 3am? What's the runbook? How do I roll back? |
| 🛡️ Security Engineer | What's the threat model? What's the blast radius if compromised? |
| 🤖 ML Engineer | Does this constrain or enable model improvements? How do I A/B test? |
| 📊 Product Manager | Does this hurt user experience? Is the trade-off worth the customer impact? |
| 🎨 Frontend Engineer | How does my client handle this? Edge cases on slow networks? |
| 💰 Cost / Finance | What's the unit cost? How does it scale? What's the burn at peak? |
| ⚖️ Legal / Compliance | PII? Liability? Regulatory obligations? |
D-1 — WebSocket as the primary transport (with REST fallback)
HLD impact: Edge layer (Layer 2) is bifurcated — two transports, one orchestrator.
LLD impact: Frame protocol must work over both; seq numbers, idempotency on response_id, REST returns batched frames.
| Lens | Position | What they pushed for / against |
|---|---|---|
| 🔧 Backend | Supportive | Clean separation — one orchestrator, two transport adapters. Wanted to reject HTTP/2 SSE because of the upstream-message use case |
| ⚙️ SRE | Cautious supporter | WebSocket connections are stateful and harder to drain on deploy. Required us to design connection draining: stop accepting new conns 60s before deploy, let in-flight finish, force-close after grace period |
| 🛡️ Security | Strong supporter | Long-lived authenticated connection means auth happens once at $connect, not per-message. Cuts auth surface — but required strict origin validation and token expiry handling on long-lived sessions |
| 🤖 ML | Neutral | Doesn't affect model serving. Mild preference for WebSocket because TTFT is more visible to user and improves perceived quality of streaming models |
| 📊 PM | Strong supporter | Streaming is the product. Survey data showed users perceive WebSocket-streamed responses as "smarter" even when content is identical |
| 🎨 Frontend | Cautious supporter | More complex client code (reconnect logic, frame ordering). Asked for the REST fallback as their safety net for corporate networks blocking WebSocket |
| 💰 Cost | Neutral | API Gateway WebSocket charges per message and per connection-minute. Modeled at scale: ~$0.0008/conversation. Acceptable |
| ⚖️ Legal | Neutral | No new compliance issues vs. REST |
Alternatives considered & rejected: - Pure REST (no streaming): Rejected on UX. Users abandon at >1.5s perceived latency. Polling adds even more overhead. - Server-Sent Events: Rejected because we need upstream events from the client (page navigation, typing indicator) on the same channel. - gRPC streaming on the edge: Rejected because browser support is poor; gRPC-Web requires a proxy and we'd lose the bidirectional benefit.
How it would have failed differently: - Pure REST: scale would have been easier (no connection ceiling), but conversion rate would have dropped per A/B simulation by ~12% due to perceived latency. - SSE: 90% of the win, but the upstream channel would have been a separate REST endpoint — splits state across two transports, complicates debugging.
D-2 — Streaming guardrails are two-tier (inline + post-stream), not full pre-stream
HLD impact: Safety layer is split — inline runs during streaming, full pipeline runs after. Correction frame mechanism is required. LLD impact: Two guardrail invocation points; correction frame in the protocol; sequence numbers on tokens.
This is the most contested decision in the system. It pits Safety (Legal, Security) against UX (Product, Frontend).
| Lens | Position | Quote |
|---|---|---|
| 🔧 Backend | Pragmatic supporter | "Buffering until guardrails finish kills the streaming UX. Two-tier is a clean compromise — fast checks inline, slow checks after." |
| ⚙️ SRE | Supporter | "I get to alarm on correction_frame_emitted rate. If it spikes, I know guardrails are blocking content the LLM is increasingly generating. That's a leading indicator." |
| 🛡️ Security | Reluctant supporter | "I'd prefer pre-stream. But inline regex catches 90% of what worries me (PII, competitors). The remaining 10% — price hallucination, scope drift — fires <0.5%, and the audit log captures it. I accepted this with the condition that ASIN/price corrections trigger a hard correction frame, not a soft retraction." |
| 🤖 ML | Supporter | "Pre-stream means I can't show the user any tokens until the LLM is fully done — that's 1500ms of buffering. Two-tier preserves the streaming benefit." |
| 📊 PM | Strong supporter | "Without streaming, the chatbot loses its 'feels smart' edge. We measured this in user research." |
| 🎨 Frontend | Cautious | "Correction frames are awkward to render. I asked for them to be infrequent enough that we don't need a complex 'retract' animation. Backend committed to <0.5% rate." |
| 💰 Cost | Neutral | "Roughly cost-neutral — same number of guardrail invocations." |
| ⚖️ Legal | Reluctant supporter | "Reviewed and accepted on the condition that: (1) every correction frame is logged with full audit trail, (2) wrong prices are NEVER streamed inline (LLM is prompted not to mention prices in prose; product cards inject from catalog), (3) any user complaint about a wrong price is retrievable from logs within 1 hour." |
Alternatives rejected: - Full pre-stream guardrails: Rejected on UX (3s perceived latency for first token). - No inline guardrails (only post-stream): Rejected on safety — Legal would not accept PII potentially streaming to user even briefly. - Buffering only price/PII tokens: Rejected as unimplementable — token boundaries don't align with semantic boundaries; you can't know "$" is a price prefix until the next token arrives.
Hidden constraint that shaped the decision: Product cards (ASIN, price, title) are injected from live catalog after the prose stream, never generated by the LLM. This is what made the two-tier acceptable to Legal — the highest-risk fields (price, ASIN) bypass the LLM entirely.
D-3 — Sequential, not parallel, guardrails pipeline
HLD impact: Safety layer is a pipe, not a fan-out. LLD impact: Each stage takes the previous stage's output as input. Total latency is sum, not max.
| Lens | Position | Reasoning |
|---|---|---|
| 🔧 Backend | Strong supporter | "Stages mutate the response. PII redaction changes text. If toxicity ran in parallel on the original text, results don't compose. Sequential = correctness." |
| ⚙️ SRE | Mild concern | "100ms total is in the latency budget. I asked for per-stage timings in the metrics so I can tell which stage is the slow one if total spikes." |
| 🛡️ Security | Strong supporter | "PII MUST run first. If I run it last, anything that side-effects (logging, caching) sees PII. Order is a security control." |
| 🤖 ML | Neutral | "Doesn't affect model behavior." |
| 📊 PM | Neutral | "Acceptable as long as 100ms total is real." |
| 🎨 Frontend | Neutral | "Doesn't affect me." |
| 💰 Cost | Neutral | "Cost-equivalent to parallel — same number of API calls and CPU cycles." |
| ⚖️ Legal | Strong supporter | "PII redaction first is non-negotiable — same reasoning as Security." |
Alternative rejected: Parallel execution with a final reconciliation step. Would save ~60ms but introduces a "reconciliation correctness" problem (e.g., did the toxicity filter act on text that PII later removed?). The 60ms saving wasn't worth the verification cost.
D-4 — Provisioned Throughput for Sonnet, on-demand for Haiku
HLD impact: Intelligence layer has a tiered Bedrock plan, not a single model. LLD impact: Routing logic in orchestrator picks model by intent + user tier. Cost dashboards split provisioned vs. on-demand spend.
| Lens | Position | Reasoning |
|---|---|---|
| 🔧 Backend | Supportive | "Routing logic is straightforward — intent → model map plus a fallback rule." |
| ⚙️ SRE | Strong supporter | "Provisioned eliminates throttling for our baseline. I sleep better." |
| 🛡️ Security | Neutral | "Same model, same security posture. Routing decisions are logged for audit." |
| 🤖 ML | Cautious | "Routing changes the eval surface — I now have to evaluate Haiku on simple intents AND Sonnet on complex intents AND verify the boundary. I asked for offline evaluation showing Haiku is acceptable for routed-down intents (chitchat, simple FAQ). It is — but the gate is now a per-model gate." |
| 📊 PM | Cautious supporter | "I worried about quality drop on routed-down intents. ML showed no measurable user-perceived difference for the routed categories. Approved." |
| 🎨 Frontend | Neutral | "Doesn't affect me." |
| 💰 Cost | Strong driver | "This is a $35K/month saving. Provisioned baseline + on-demand burst is 35% cheaper than pure on-demand at our volume. It's also smoother — fewer billing surprises." |
| ⚖️ Legal | Neutral | "Both models are in our existing Bedrock agreement." |
Alternative rejected: Pure on-demand across both models. Rejected on cost ($35K/month) and reliability (no guaranteed throughput during AWS-wide pressure events).
Open risk: Provisioned has a 1-month minimum commitment. If traffic drops permanently (unlikely but possible), we eat the cost until expiry. Mitigation: weekly utilization review, conservative provisioning at P75 of daily peak.
D-5 — Sequence numbers on every WebSocket frame
HLD impact: Protocol-level — every frame is identifiable.
LLD impact: Client must track expected seq; backend must guarantee monotonic increase per session.
| Lens | Position | Reasoning |
|---|---|---|
| 🔧 Backend | Supportive | "Easy to implement. Counter per session." |
| ⚙️ SRE | Strong supporter | "When users report 'response cut off,' I can trace by seq gaps. Without seq, I'd have no signal." |
| 🛡️ Security | Mild supporter | "Helps with replay-attack analysis on connection logs." |
| 🤖 ML | Neutral | "Doesn't affect model serving." |
| 📊 PM | Neutral | "Doesn't affect product behavior unless something goes wrong." |
| 🎨 Frontend | Driver | "I asked for this. Without seq, I can't reconcile correction frames against the original stream — which token am I retracting? Also, on flaky networks, I detect dropped frames and can request resend of structured data." |
| 💰 Cost | Neutral | "4 extra bytes per frame. Negligible." |
| ⚖️ Legal | Neutral | "Useful for audit retrieval — 'show me the exact frames sent at this timestamp.'" |
Alternative rejected: Implicit ordering (TCP guarantees order). Rejected because TCP order doesn't help when the client loses a frame due to a network glitch or browser tab pause.
D-6 — DAX in front of DynamoDB for sessions
HLD impact: Data layer has a cache layer that's part of the read path, not just an optimization. LLD impact: Reads go through DAX client; writes go DynamoDB-direct (DAX invalidates on write).
| Lens | Position | Reasoning |
|---|---|---|
| 🔧 Backend | Supportive | "DAX client is API-compatible with DynamoDB SDK. Drop-in." |
| ⚙️ SRE | Cautious supporter | "Adds a failure domain. I asked for DAX-down runbook: app falls through to DynamoDB direct, performance degrades but doesn't break. Verified in chaos testing." |
| 🛡️ Security | Neutral | "DAX is in our VPC, encrypted in transit. Same posture as DynamoDB." |
| 🤖 ML | Neutral | "Doesn't affect model serving." |
| 📊 PM | Neutral | "Invisible to user." |
| 🎨 Frontend | Neutral | "Invisible to user." |
| 💰 Cost | Cautious | "+$200/month. Acceptable for the engineering cost it eliminates (no more capacity planning emergencies)." |
| ⚖️ Legal | Neutral | "PII vault is on DynamoDB direct, not cached in DAX. Confirmed." |
Alternative rejected: Application-level cache in Redis. Rejected because it duplicates state across two systems and complicates invalidation. DAX is purpose-built for the DynamoDB read pattern.
Caveat captured during decision: SRE flagged that DAX is a band-aid for a partition-key design issue (popular sessions create hotspots). The proper fix is write-sharding the session table. We accepted DAX as a near-term fix with an open follow-up — still un-tackled.
D-7 — Memory summarizer compresses at 20 turns, preserves entities explicitly
HLD impact: Conversation state isn't unbounded — there's a compression boundary. LLD impact: Summarizer is a separate Bedrock call; entities (ASINs, series names) are preserved as structured metadata, not just embedded in summary prose.
| Lens | Position | Reasoning |
|---|---|---|
| 🔧 Backend | Supportive | "Without compression, prompts grow unbounded. We had a session hit 60 turns and the prompt was 8K tokens — too expensive." |
| ⚙️ SRE | Cautious | "Summarizer is another LLM call in the critical path. I asked for: only triggers async after turn 20; turn 21 doesn't wait. Backend implemented as background fire-and-forget, with a fallback to summarize synchronously if the async hasn't completed by turn 25." |
| 🛡️ Security | Neutral | "Summarized text is still PII-redacted — same security boundary as raw turns." |
| 🤖 ML | Driver | "I designed this. The key insight: lossy compression is fine for prose but lossy on entities is fatal. The bug we caught (ASINs getting stripped at turn 20) proved this. Now entities are preserved as structured metadata asins_mentioned, separate from summary text." |
| 📊 PM | Cautious | "User stated preferences (genre, format) must survive summarization. Asked for explicit testing on 'I prefer physical not digital' style preferences across the 20-turn boundary." |
| 🎨 Frontend | Neutral | "Doesn't see the summarizer; just sees that long sessions still feel coherent." |
| 💰 Cost | Supportive | "Without compression, average prompt size grows linearly. At 50 turns, prompt cost dominates. Compression saves ~$8K/month at scale." |
| ⚖️ Legal | Neutral | "Compression doesn't change retention obligations — TTL is still on the session, not per-turn." |
Alternative rejected: Sliding window (only last N turns sent to LLM). Rejected because it loses early-conversation context that users frequently reference ("the first one you mentioned").
D-8 — gRPC for Catalog, REST for everything else
HLD impact: Mixed protocols inside the orchestrator's downstream surface. LLD impact: Two HTTP clients, two contract-test frameworks (Pact for REST, Buf for protos).
| Lens | Position | Reasoning |
|---|---|---|
| 🔧 Backend | Pragmatic supporter | "gRPC for the highest-volume call (catalog) saves CPU + bandwidth at scale. REST elsewhere keeps integration cost low." |
| ⚙️ SRE | Concerned | "Two protocols = two debugging tools. I asked for X-Ray instrumentation on both, with the same request_id correlation. Done." |
| 🛡️ Security | Neutral | "mTLS works for both. Equivalent security posture." |
| 🤖 ML | Neutral | "Doesn't affect model serving." |
| 📊 PM | Neutral | "Invisible to user." |
| 🎨 Frontend | Neutral | "Backend-only concern." |
| 💰 Cost | Supportive | "At 30K catalog lookups/sec, gRPC saves ~25% CPU on serialization vs. REST+JSON. Real money at scale." |
| ⚖️ Legal | Neutral | "No compliance difference." |
Alternative considered: All-gRPC. Rejected because forcing 5 other teams to add gRPC servers had no business value; their REST APIs were stable and their contract testing already worked. Alternative considered: All-REST. Rejected because catalog volume is high enough that the serialization cost matters at scale.
Real driver, not in the table: The Catalog team already had a gRPC service; we adopted their existing contract. Decisions like this are rarely greenfield — they bend to the realities of what other teams have built.
D-9 — Active-active multi-region with DynamoDB Global Tables
HLD impact: Two-region deployment, Route 53 latency-routing. LLD impact: Session writes can occur in either region; eventual consistency on cross-region reads (rare in practice due to sticky routing).
| Lens | Position | Reasoning |
|---|---|---|
| 🔧 Backend | Cautious supporter | "Global tables introduce eventual consistency. We confirmed via testing that session writes in region A are readable in region A immediately (<5ms); cross-region read of fresh write is the only edge case." |
| ⚙️ SRE | Driver | "I pushed for active-active. Active-passive means failover is a manual surgery with risk. Active-active means a region failure auto-shifts via Route 53 health checks. Verified during a chaos test where us-east-1 was simulated as down." |
| 🛡️ Security | Cautious | "Two regions = two attack surfaces. Asked for identical IAM policies and WAF rules in both. Automated this via Terraform modules." |
| 🤖 ML | Cautious | "Bedrock model availability differs by region. We confirmed both us-east-1 and us-west-2 have Sonnet + Haiku + Titan available. If a future model launches us-east-1 only, the routing will need region-awareness." |
| 📊 PM | Strong supporter | "99.9% availability with single-region was achievable but tight. Active-active gives us 99.95+ headroom — the difference is real for users during AWS regional events." |
| 🎨 Frontend | Neutral | "Route 53 handles routing; client just hits a single hostname." |
| 💰 Cost | Concerned | "Roughly 1.7x infrastructure cost for 1.05x availability. Hard sell. Justified by the cost of one regional outage event = 4-6 hours of degraded customer experience during peak. We approved with an SLA-based ROI argument, not a unit-cost argument." |
| ⚖️ Legal | Supportive | "Multi-region is preferred for disaster recovery compliance posture." |
Alternative rejected: Active-passive (warm standby in second region). Rejected because failover requires manual or automated promotion that's brittle at the moment it's most needed (during a regional outage). Alternative rejected: Single region with multi-AZ only. Rejected because AZ-level failures are rare but region-level events do happen and we're customer-facing during Prime Day.
D-10 — Synchronous downstream fan-out, not event-driven
HLD impact: Orchestration layer holds resources during fan-out; latency is bounded by slowest service.
LLD impact: asyncio.gather with timeouts; circuit breaker on each service; no message bus in the request path.
This is the decision that's most likely to be revisited in a future redesign — see Q18 in 04-interview-qa-deep-dive.md.
| Lens | Position | Reasoning |
|---|---|---|
| 🔧 Backend | Pragmatic supporter | "Synchronous code is straightforward. Event-driven (Step Functions / EventBridge) would be more resilient but harder to reason about." |
| ⚙️ SRE | Concerned but supportive | "Synchronous means timeouts are first-class. I needed every downstream call to have an explicit timeout shorter than the request deadline. Implemented." |
| 🛡️ Security | Neutral | "No security difference." |
| 🤖 ML | Neutral | "Doesn't affect model serving." |
| 📊 PM | Strong supporter | "Synchronous keeps the response coherent — no 'partial response then update' surprises for the user." |
| 🎨 Frontend | Strong supporter | "Streaming is one continuous response. Event-driven would mean either buffering until all events arrive (kills streaming UX) or showing partial responses with deferred updates (confusing UX)." |
| 💰 Cost | Neutral | "Roughly cost-equivalent." |
| ⚖️ Legal | Neutral | "No compliance difference." |
Alternative considered for future redesign: Step Functions for the data-gathering phase, synchronous only for the LLM streaming phase. The decoupling would improve resilience but not improve user-perceived latency for the streaming case. We accepted this as a future investigation, not a current change.
Decision Lens — Summary Heatmap
| Decision | 🔧 BE | ⚙️ SRE | 🛡️ Sec | 🤖 ML | 📊 PM | 🎨 FE | 💰 Cost | ⚖️ Legal |
|---|---|---|---|---|---|---|---|---|
| D-1 WebSocket primary | ✓ | ⚠️ | ✓✓ | – | ✓✓ | ⚠️ | – | – |
| D-2 Two-tier guardrails | ✓ | ✓ | ⚠️ | ✓ | ✓✓ | ⚠️ | – | ⚠️ |
| D-3 Sequential guardrails | ✓✓ | ⚠️ | ✓✓ | – | – | – | – | ✓✓ |
| D-4 Tiered Bedrock | ✓ | ✓✓ | – | ⚠️ | ⚠️ | – | ✓✓✓ | – |
| D-5 Frame seq numbers | ✓ | ✓✓ | ✓ | – | – | ✓✓✓ | – | ✓ |
| D-6 DAX on sessions | ✓ | ⚠️ | – | – | – | – | ⚠️ | – |
| D-7 Memory summarizer | ✓ | ⚠️ | – | ✓✓✓ | ⚠️ | – | ✓ | – |
| D-8 gRPC for Catalog | ✓ | ⚠️ | – | – | – | – | ✓ | – |
| D-9 Active-active | ⚠️ | ✓✓✓ | ⚠️ | ⚠️ | ✓✓ | – | ❌→✓ | ✓ |
| D-10 Sync fan-out | ✓ | ⚠️ | – | – | ✓✓ | ✓✓ | – | – |
Legend: ✓✓✓ driver · ✓✓ strong supporter · ✓ supporter · ⚠️ cautious / had concerns addressed · ❌→✓ initially opposed, accepted with conditions · – neutral / not affected
How to Use This Document in Practice
During design review: Walk down the lens table for any new feature. If a single perspective is missing, you haven't done the review.
During incidents: Pull up the failure domain map (Section 1.6). The first question in the incident channel should be "which domain is this?" — not "what broke?"
During interviews: When asked about a specific decision, walk the perspectives lens. "I had to align Backend, SRE, Security, and Legal. Here's how each saw it…"
For new joiners: Read Part 1 in week 1. Read Part 2 before you write code. Read Part 3 before your first design review.
Related Documents
- 01-api-types-overview.md — The 6 API types built on top of this architecture
- 02-api-testing-strategy.md — How each layer is tested
- 03-scale-testing-scenarios.md — How each layer broke under load
- 04-interview-qa-deep-dive.md — Specific decisions probed in interview format
- 04-offline-testing-quality-strategies.md — Quality testing for the Intelligence layer
- 05-grilling-sessions.md — Hard follow-ups that test depth