00. HLD & LLD — MangaAssist Architecture (Foundation)

"Before any of the API, testing, or scale documents make sense, you need the HLD/LLD scaffolding. This file is the architectural anchor — the same reference an SRE pulls up during an incident, a security engineer reads before a threat-model review, and a new joiner reads on day one. Everything in the other documents references back to here."

How to Read This Document

This document is split into three parts:

Part	Audience	What's Inside
Part 1: HLD	Anyone joining the system, interviewers, architects	Context, layered architecture, physical AWS topology, NFRs, capacity model, failure domains
Part 2: LLD	Engineers writing code, on-call SREs, security reviewers	Sequence diagrams, state machines, schemas, indexes, concurrency model, observability schema
Part 3: Decision Lens	Anyone reviewing or extending the design	Each major decision viewed through 8 stakeholder perspectives — what trade-off was accepted, by whom, against what alternative

If you want the what, read Part 1. If you want the how, read Part 2. If you want the why, read Part 3.

Part 1: High-Level Design (HLD)

1.1 System Context — Where MangaAssist Lives

flowchart TB
    User["👤 Amazon Customer (Web / Mobile)<br/>Browses manga store → opens chat widget → asks Q"]
    MA["🧩 <b>MangaAssist</b> (this system)<br/>Conversational shopping assistant for the manga vertical<br/>Inputs: NL + page context + auth state<br/>Outputs: streamed text + product cards + actions"]

    User -- "WebSocket / HTTPS" --> MA

    subgraph downstream["Downstream domains (owned by 6+ teams across Amazon)"]
        Catalog["📚 Catalog<br/>Pricing<br/>Inventory"]
        Orders["📦 Orders<br/>Returns<br/>Shipping"]
        ML["🤖 ML Platform<br/>Bedrock + SageMaker"]
        Promo["🎯 Promotions<br/>Reviews<br/>Personalize"]
        Trust["🛡️ Trust & Safety<br/>Legal"]
    end

    MA --> Catalog
    MA --> Orders
    MA --> ML
    MA --> Promo
    MA --> Trust

    classDef user fill:#e3f2fd,stroke:#1565c0,stroke-width:2px
    classDef system fill:#fff3e0,stroke:#e65100,stroke-width:3px
    classDef domain fill:#f3e5f5,stroke:#6a1b9a,stroke-width:1px
    class User user
    class MA system
    class Catalog,Orders,ML,Promo,Trust domain

MangaAssist is a thin orchestration layer. It owns conversation state, intent routing, prompt assembly, response streaming, and the safety pipeline. Every domain it talks about (products, prices, orders, returns) lives in another team's system. The architecture is fundamentally about integration, not data ownership.

1.2 Layered Architecture — Logical View

The system splits cleanly into 5 horizontal layers. Each layer has its own scaling characteristics, failure modes, and ownership.

flowchart TB
    L5["<b>Layer 5: Safety / Trust</b><br/>Input validators · Output guardrails (6-stage) · PII vault · Audit log<br/>Latency: &lt;100ms · Failure mode: fail-closed"]
    L4["<b>Layer 4: Intelligence (ML Pipeline)</b><br/>DistilBERT → Titan Embeddings → OpenSearch KNN → Cross-Encoder Reranker → Claude LLM<br/>Latency: 600ms–1.5s · Failure mode: per-stage circuit breaker + fallback"]
    L3["<b>Layer 3: Orchestration</b><br/>Conversation manager · Prompt assembler · Service fan-out · Memory summarizer · Streaming pump · Correction-frame emitter<br/>Latency: &lt;10ms compute · Failure mode: stateless retry"]
    L2["<b>Layer 2: Edge / Transport</b><br/>API Gateway WebSocket · API Gateway REST · ALB sticky sessions · Cognito auth · Token-bucket rate limiter<br/>Latency: &lt;20ms · Failure mode: TCP retry, reconnection"]
    L1["<b>Layer 1: State / Data</b><br/>DynamoDB (sessions, audit) · ElastiCache Redis (semantic cache, locks) · DAX (hot sessions) · OpenSearch (vector index)<br/>Latency: 1–50ms · Failure mode: DAX fall-through, on-demand mode"]

    L5 --- L4
    L4 --- L3
    L3 --- L2
    L2 --- L1

    classDef safety fill:#ffebee,stroke:#c62828,stroke-width:2px
    classDef intel fill:#e8f5e9,stroke:#2e7d32,stroke-width:2px
    classDef orch fill:#fff3e0,stroke:#e65100,stroke-width:2px
    classDef edge fill:#e3f2fd,stroke:#1565c0,stroke-width:2px
    classDef data fill:#f3e5f5,stroke:#6a1b9a,stroke-width:2px
    class L5 safety
    class L4 intel
    class L3 orch
    class L2 edge
    class L1 data

Why this layering matters for everything else in this folder: - The testing pyramid in 02-api-testing-strategy.md maps onto these layers — unit tests cover Layer ⅗ logic, integration tests cross Layer 2↔3↔4, contract tests gate the ML and Service boundaries, E2E tests traverse all 5. - The scale scenarios in 03-scale-testing-scenarios.md are categorized by which layer broke first (DynamoDB hot partition = Layer 1, Bedrock throttling = Layer 4, WebSocket ceiling = Layer 2). - The 6 API types in 01-api-types-overview.md are the interfaces between these layers, plus the external surface at Layer 2.

1.3 Physical Architecture — AWS Topology

flowchart TB
    R53["🌐 Route 53<br/>(latency-based routing)"]

    subgraph east["us-east-1 (primary)"]
        direction TB
        APIGW_WS["API Gateway<br/>WebSocket"]
        APIGW_REST["API Gateway<br/>REST"]
        WAF["WAF + Cognito<br/>Authorizer"]
        ALB["ALB<br/>(sticky sessions)"]
        ECS["ECS Fargate<br/>(orchestrator)"]
        Lambda["Lambda<br/>(burst overflow)"]

        DDB[("DynamoDB<br/>+ DAX")]
        Redis[("ElastiCache<br/>Redis")]
        OS[("OpenSearch<br/>Serverless")]
        ML["SageMaker + Bedrock<br/>DistilBERT · Titan · Reranker · Claude"]
    end

    subgraph west["us-west-2 (mirror, active)"]
        WestMirror["(same topology)"]
    end

    Kinesis["Kinesis"]
    Firehose["Firehose"]
    Redshift[("Redshift<br/>(analytics, PII scrubbed)")]

    R53 --> east
    R53 --> west

    APIGW_WS --> WAF
    APIGW_REST --> WAF
    WAF --> ALB
    ALB --> ECS
    ALB --> Lambda
    ECS --> DDB
    ECS --> Redis
    ECS --> OS
    ECS --> ML
    Lambda --> DDB
    Lambda --> Redis
    Lambda --> ML

    ECS --> Kinesis
    Kinesis --> Firehose
    Firehose --> Redshift

    DDB <-. "Global Tables<br/>replication" .-> WestMirror

    classDef edge fill:#e3f2fd,stroke:#1565c0
    classDef compute fill:#fff3e0,stroke:#e65100
    classDef data fill:#f3e5f5,stroke:#6a1b9a
    classDef ml fill:#e8f5e9,stroke:#2e7d32
    classDef analytics fill:#fce4ec,stroke:#ad1457
    class APIGW_WS,APIGW_REST,WAF,ALB,R53 edge
    class ECS,Lambda compute
    class DDB,Redis,OS data
    class ML ml
    class Kinesis,Firehose,Redshift analytics

Multi-region: Active-active via Route 53 latency routing. Session state replicates via DynamoDB Global Tables (eventual consistency, <2s replication lag) — acceptable because session lookup is sticky to the connecting region 99%+ of the time.

Compute split — ECS + Lambda: Steady traffic runs on ECS Fargate (predictable cost, fewer cold starts). Burst overflow above ECS auto-scale rate spills to Lambda. The two share the same orchestrator code packaged differently.

1.4 Non-Functional Requirements (NFRs) — Quantified

These are the numerical targets every design choice was measured against. They are not aspirational — they are gating thresholds.

Category	NFR	Target	How It's Enforced
Latency	P50 first token	< 800ms	CloudWatch alarm, deploy gate
	P99 first token	< 1.5s	Per-model latency budget allocation
	P99 full response	< 3s	E2E timer in every request span
Availability	Steady-state	99.9%	Multi-AZ + multi-region
	Prime Day	99.9% (no degradation)	Provisioned capacity, scheduled scale-up
Scale	Concurrent WS connections	200K normal, 500K peak	API Gateway limit increase + multi-region
	Messages/sec	5K normal, 50K peak	Per-component capacity model
Quality	LLM hallucination rate	< 1%	Offline gate + runtime guardrails
	Intent accuracy	≥ 90% (macro F1)	Pre-deploy gate
	Guardrail pass rate	≥ 98% on adversarial set	Security CI
Cost	Cost per resolved conversation	< $0.025	Weekly cost dashboard
	LLM share of cost	< 60%	Routing + caching + Haiku fallback
Compliance	PII in analytics pipeline	Zero	Comprehend scan on every record
	Right-to-be-forgotten SLA	< 30 days	Automated deletion pipeline
Security	Prompt injection block rate	100% on test suite	Adversarial CI on every commit

Trade-off principle: When NFRs conflict (e.g., latency vs. quality, cost vs. availability), the priority order is Safety > Availability > Latency > Quality > Cost. This is documented because real incidents required leaning on this order — see Decision D-2 in Part 3.

1.5 Capacity Model

Tier	Concurrent Sessions	Msgs/sec	LLM Calls/sec	Reasoning
Normal weekday	50K	5K	3K	60% intents need LLM (40% short-circuited by templates / cache)
Peak weekday	100K	12K	7K	Evening browse traffic
Prime Day	500K	50K	30K	10x spike with bursty distribution
Degraded mode	500K	50K	5K	LLM throttled — Haiku fallback + cache + escalation

Sizing rule: Provision for 75^th percentile of daily peak; rely on auto-scaling + fallback for the rest. Provisioning for the absolute peak wastes ~$80K/month in idle capacity at off-peak hours.

1.6 Failure Domain Map

A failure domain is the blast radius of a single component breaking. The architecture is designed so no single domain takes down everything.

flowchart LR
    subgraph A["Domain A: Edge"]
        A1["<b>Failure</b><br/>WebSocket connections drop"]
        A2["<b>Mitigation</b><br/>REST fallback<br/>multi-region failover"]
        A3["<b>Worst case</b><br/>chat unavailable<br/>for one region ~30s"]
        A1 --> A2 --> A3
    end

    subgraph B["Domain B: Bedrock"]
        B1["<b>Failure</b><br/>LLM throttling or outage"]
        B2["<b>Mitigation</b><br/>Haiku fallback · cached responses · escalation"]
        B3["<b>Worst case</b><br/>chat works but<br/>quality degrades ~2h"]
        B1 --> B2 --> B3
    end

    subgraph C["Domain C: SageMaker"]
        C1["<b>Failure</b><br/>classifier or reranker down"]
        C2["<b>Mitigation</b><br/>Stage-1 regex (40% of intents)<br/>raw cosine sim fallback"]
        C3["<b>Worst case</b><br/>10–15% routing<br/>accuracy drop"]
        C1 --> C2 --> C3
    end

    subgraph D["Domain D: DynamoDB"]
        D1["<b>Failure</b><br/>session table throttle / down"]
        D2["<b>Mitigation</b><br/>DAX hot-reads · on-demand mode"]
        D3["<b>Worst case</b><br/>new sessions can't<br/>be created ~1 min"]
        D1 --> D2 --> D3
    end

    subgraph E["Domain E: Catalog / Orders / etc."]
        E1["<b>Failure</b><br/>downstream service down"]
        E2["<b>Mitigation</b><br/>per-service circuit breaker<br/>graceful degradation"]
        E3["<b>Worst case</b><br/>that intent path<br/>returns partial answer"]
        E1 --> E2 --> E3
    end

    subgraph F["Domain F: Guardrails"]
        F1["<b>Failure</b><br/>a guardrail stage errors"]
        F2["<b>Mitigation</b><br/>fail-closed: block response<br/>never fail-open on safety"]
        F3["<b>Worst case</b><br/>response blocked,<br/>user sees safe message"]
        F1 --> F2 --> F3
    end

    classDef fail fill:#ffebee,stroke:#c62828
    classDef mit fill:#fff3e0,stroke:#e65100
    classDef worst fill:#f3e5f5,stroke:#6a1b9a
    class A1,B1,C1,D1,E1,F1 fail
    class A2,B2,C2,D2,E2,F2 mit
    class A3,B3,C3,D3,E3,F3 worst

Cross-domain rule: No two domains share a single point of failure. The orchestrator is the only component in the critical path of every request, and it's deployed redundantly across AZs and regions with stateless instances behind ALB sticky sessions.

Part 2: Low-Level Design (LLD)

2.1 Request Lifecycle — Sequence Diagram

The canonical streaming chat flow:

sequenceDiagram
    autonumber
    actor U as User
    participant FE as Frontend
    participant WS as APIGW-WS
    participant O as Orchestrator
    participant DB as DynamoDB
    participant C as Classifier
    participant E as Embed
    participant K as KNN (OpenSearch)
    participant R as Reranker
    participant CAT as Catalog
    participant BR as Bedrock
    participant G as Guardrails

    U->>FE: types question
    FE->>WS: ws.send(message)
    WS->>O: $default
    O->>DB: load session (DAX)
    DB-->>O: session + history
    Note over O: input validate (PII, injection, size) — 5ms
    Note over O: Stage 1: regex intent (zero-cost)

    par parallel fan-out (T+5ms)
        O->>C: classify
        C-->>O: intent (T+20ms)
    and
        O->>E: embed
        E-->>O: vector (T+35ms)
    and
        O->>CAT: catalog batch
        CAT-->>O: products
    end

    O->>K: KNN search (T+35ms)
    K-->>O: top-10 chunks (T+75ms)
    O->>R: rerank top-10 → top-3
    R-->>O: top-3 (T+125ms)
    Note over O: Assemble prompt (sys + history + RAG + products) — 5ms

    O->>BR: invoke (streaming)
    Note over BR,O: TTFT ~500ms

    loop streaming tokens
        BR-->>O: token chunk
        Note over O: inline guardrail<br/>(regex PII + competitor)
        O-->>WS: frame
        WS-->>FE: token N
        FE-->>U: render
    end

    BR-->>O: [done]
    O->>G: full guardrails pipeline (6 stages)
    G-->>O: pass / correction
    Note over O: products[] injected from live Catalog<br/>(NOT from LLM)
    O->>DB: persist turn
    O-->>WS: products + done frame
    WS-->>FE: render cards
    FE-->>U: complete response

Key invariant: prices and ASINs in product cards never come from the LLM. The LLM generates descriptive prose. Structured data (ASIN, price, in_stock, title in the card) comes from catalog APIs at response-assembly time. This makes the price hallucination problem (described in 04-offline-testing-quality-strategies.md) tractable — we only need to defend against inline mentions in the prose, not card data.

2.2 Orchestrator State Machine

stateDiagram-v2
    [*] --> IDLE
    IDLE --> VALIDATING: message arrives ($default)

    VALIDATING --> REJECTED: validation fails<br/>(PII, injection, size)
    REJECTED --> [*]: safe response returned

    VALIDATING --> CLASSIFYING: pass

    CLASSIFYING --> ROUTED: Stage 1 high<br/>confidence (regex)
    CLASSIFYING --> DistilBERT: Stage 1 low conf<br/>run Stage 2
    DistilBERT --> ROUTED: classified

    ROUTED --> FANNING_OUT: parallel<br/>service calls

    FANNING_OUT --> DEGRADING: any non-critical<br/>service fails
    DEGRADING --> ASSEMBLING: with fallbacks
    FANNING_OUT --> ASSEMBLING: all results in<br/>(or timeouts hit)

    ASSEMBLING --> STREAMING: prompt ready,<br/>invoke Bedrock

    STREAMING --> FAILOVER: Bedrock fails
    FAILOVER --> GUARDED: cached / Haiku / escalation
    STREAMING --> GUARDED: stream complete

    GUARDED --> PERSISTED: pass or<br/>correction frame emitted

    PERSISTED --> [*]: emit metrics,<br/>return to IDLE

    note right of FANNING_OUT
        Each state has a timeout
        that triggers transition
        to DEGRADING or FAILOVER
    end note

Each state has a timeout that triggers transition to DEGRADING or FAILOVER. The state machine is in-memory (per-request), but the state of the conversation (turns, intents, ASINs referenced) is in DynamoDB.

2.3 Streaming Protocol — Frame Schema

Every frame on the WebSocket conforms to this discriminated union:

type Frame =
  | { type: "token";       seq: number; content: string }
  | { type: "products";    seq: number; items: ProductCard[] }
  | { type: "actions";     seq: number; buttons: ActionButton[] }
  | { type: "correction";  seq: number; replace_seq: number; content: string }
  | { type: "follow_up";   seq: number; suggestions: string[] }
  | { type: "error";       seq: number; code: string; message: string; recoverable: boolean }
  | { type: "done";        seq: number; metadata: ResponseMetadata }

interface ProductCard {
  asin: string                  // validated against catalog
  title: string                 // from catalog, not LLM
  author: string                // from catalog
  format: "Paperback" | "Hardcover" | "Kindle" | "Audible"
  price: { current: number, list: number, currency: string }   // live pricing API
  in_stock: boolean             // live inventory
  image_url: string
  url: string                   // /dp/{asin}?ref=manga_assist
  trust_signals: { listing_score: number, seller_verified: boolean }  // post-fraud-incident
}

interface ResponseMetadata {
  intent: string
  intent_confidence: number
  models_used: string[]         // ["distilbert", "claude-3-5-sonnet", ...]
  rag_chunks_used: number
  guardrail_actions: string[]   // ["pii_redacted", "scope_redirected", ...]
  total_latency_ms: number
  ttft_ms: number
  cache_hit: boolean
  request_id: string            // threads through X-Ray
}

Why a discriminated union and not separate WebSocket routes? Frontends are simpler when there's one inbound stream per session. The type field lets the client switch on each frame. This was a frontend-engineer-driven preference and survived three redesign cycles.

Why seq numbers? Discussed in Part 3, Decision D-5 — to detect dropped frames and reconcile correction frames against the original token stream.

2.4 Session Schema (DynamoDB)

Table: manga_assist_sessions · PK = session_id · SK = sk (composite discriminator) · TTL = expires_at (24h rolling) · GSI = customer_id-index (PK customer_id, SK last_active)

erDiagram
    META ||--o{ TURN : "has 0..N turns"
    META ||--o| PII_VAULT : "has 0..1 vault"
    META }o--|| CUSTOMER : "indexed via GSI"

    META {
        string session_id PK
        string sk "= META"
        string customer_id "nullable for guest"
        string created_at "ISO8601"
        string last_active "ISO8601"
        string tier "prime|authenticated|guest"
        string locale
        string ws_connection_id "null when disconnected"
        string summary "populated after 20 turns"
        number expires_at "TTL"
    }

    TURN {
        string session_id PK
        string sk "= TURN#NNNN"
        string role "user|assistant"
        string content "PII-redacted"
        string timestamp
        string intent
        list asins_mentioned "entity preservation"
        list models_used
        string response_id "for feedback correlation"
    }

    PII_VAULT {
        string session_id PK
        string sk "= PII_VAULT"
        bytes kms_encrypted_blob "separate KMS key, audited access"
    }

    CUSTOMER {
        string customer_id PK "GSI partition key"
        string last_active "GSI sort key"
    }

GSI use case: "show me all my chats" feature + GDPR right-to-be-forgotten (find all sessions for a customer_id, delete them).

Access patterns:

Pattern	Operation	Frequency	Notes
Load session for next turn	`Query` PK=session_id, SK begins_with "META" or "TURN#"	Every message	DAX cached
Persist new turn	`PutItem` SK=TURN#{n}	Every response	Idempotent on response_id
Summarize on turn 20+	`BatchWriteItem` to overwrite TURN#0001..0020 with one TURN#SUMMARY	Per session	See Pillar 5 in offline-testing
GDPR delete	`Query` GSI then `BatchWriteItem` deletes	On request	<30 day SLA
Customer history	`Query` GSI customer_id-index	Customer view feature	Limited to last 30 days

Why composite sort key with TURN#NNNN? Read pattern is "give me last N turns in order." Range queries on a sort key that sorts lexically on TURN#0042 work cleanly. Numeric prefixes pad to 4 digits — sufficient because we summarize before reaching 1000 turns.

Why a separate PII vault item? The redacted content field is what the LLM sees; the original PII (an email, address, etc.) is needed only for compliance retrieval. By isolating PII into a separate KMS-encrypted item, the LLM context loading path never touches the encryption key. Reduces blast radius if the orchestrator is ever compromised.

2.5 Concurrency & Threading Model

Per-request: The orchestrator uses an asyncio event loop. Within a single request: - Parallel fan-out is asyncio.gather() over service calls. - Streaming from Bedrock is an async iterator. - Inline guardrail checks run synchronously per chunk (regex is fast — <1ms per chunk). - Final guardrail pipeline runs sequentially (see Q6 in interview QA for ordering rationale).

Per-instance: Each ECS task can hold ~500 concurrent WebSocket connections, limited by: - File descriptor count (we set ulimit to 65K). - Per-connection memory: ~80KB for buffers + state. 500 conns × 80KB = 40MB — well within limits. - Bedrock streaming concurrency limit per instance — we allow up to 100 concurrent Bedrock streams per task.

Per-region: Sticky sessions on ALB ensure WebSocket reconnects land on the same task. If a task dies, sticky session is broken, the client reconnects, ALB routes to a new task, the new task loads session from DynamoDB. No state lost.

2.6 Observability Schema

Every request emits a structured event with this shape:

{
  "request_id": "req_xyz",         // X-Ray trace ID
  "session_id": "sess_abc",
  "customer_id": "amzn1...",
  "timestamp": "2026-04-27T12:34:56Z",
  "intent": "recommendation",
  "intent_confidence": 0.92,
  "intent_source": "distilbert",   // or "regex_stage_1"
  "rag": {
    "chunks_retrieved": 10,
    "chunks_used": 3,
    "recall_at_3_observed": null   // populated only in shadow eval
  },
  "models": {
    "classifier_ms": 15,
    "embed_ms": 30,
    "knn_ms": 40,
    "rerank_ms": 50,
    "llm_ttft_ms": 480,
    "llm_total_ms": 1100
  },
  "guardrails": {
    "stages_triggered": ["pii_redact"],
    "correction_frame_emitted": false,
    "block_reason": null
  },
  "downstream_calls": [
    { "service": "catalog", "ms": 45, "status": 200 },
    { "service": "personalize", "ms": 180, "status": 200 }
  ],
  "cache": { "lookup_hit": false, "stored": true },
  "cost_usd": 0.0042,
  "outcome": "delivered"           // or "blocked", "error", "fallback"
}

Why this shape: - One row per request → easy aggregation in Redshift (GROUP BY intent, models_used). - Per-stage timings → P99 alarms can be set per component. - outcome field → distinguishes user-visible failures from silent fallbacks. - request_id threads X-Ray, CloudWatch logs, and the Redshift event — single ID for full trace.

Part 3: Decision Lens — Multi-Stakeholder Perspectives

This part is where we think like a group of people. Each major architectural decision is examined through 8 lenses. At Amazon, no design survives one perspective alone — these are the voices that have to align before the design is durable.

The lenses:

Lens	Question they ask
🔧 Backend Engineer	Is this clean code? Can I test it? What breaks if traffic doubles?
⚙️ SRE / Ops	What pages me at 3am? What's the runbook? How do I roll back?
🛡️ Security Engineer	What's the threat model? What's the blast radius if compromised?
🤖 ML Engineer	Does this constrain or enable model improvements? How do I A/B test?
📊 Product Manager	Does this hurt user experience? Is the trade-off worth the customer impact?
🎨 Frontend Engineer	How does my client handle this? Edge cases on slow networks?
💰 Cost / Finance	What's the unit cost? How does it scale? What's the burn at peak?
⚖️ Legal / Compliance	PII? Liability? Regulatory obligations?

D-1 — WebSocket as the primary transport (with REST fallback)

HLD impact: Edge layer (Layer 2) is bifurcated — two transports, one orchestrator. LLD impact: Frame protocol must work over both; seq numbers, idempotency on response_id, REST returns batched frames.

Lens	Position	What they pushed for / against
🔧 Backend	Supportive	Clean separation — one orchestrator, two transport adapters. Wanted to reject HTTP/2 SSE because of the upstream-message use case
⚙️ SRE	Cautious supporter	WebSocket connections are stateful and harder to drain on deploy. Required us to design connection draining: stop accepting new conns 60s before deploy, let in-flight finish, force-close after grace period
🛡️ Security	Strong supporter	Long-lived authenticated connection means auth happens once at `$connect`, not per-message. Cuts auth surface — but required strict origin validation and token expiry handling on long-lived sessions
🤖 ML	Neutral	Doesn't affect model serving. Mild preference for WebSocket because TTFT is more visible to user and improves perceived quality of streaming models
📊 PM	Strong supporter	Streaming is the product. Survey data showed users perceive WebSocket-streamed responses as "smarter" even when content is identical
🎨 Frontend	Cautious supporter	More complex client code (reconnect logic, frame ordering). Asked for the REST fallback as their safety net for corporate networks blocking WebSocket
💰 Cost	Neutral	API Gateway WebSocket charges per message and per connection-minute. Modeled at scale: ~$0.0008/conversation. Acceptable
⚖️ Legal	Neutral	No new compliance issues vs. REST

Alternatives considered & rejected: - Pure REST (no streaming): Rejected on UX. Users abandon at >1.5s perceived latency. Polling adds even more overhead. - Server-Sent Events: Rejected because we need upstream events from the client (page navigation, typing indicator) on the same channel. - gRPC streaming on the edge: Rejected because browser support is poor; gRPC-Web requires a proxy and we'd lose the bidirectional benefit.

How it would have failed differently: - Pure REST: scale would have been easier (no connection ceiling), but conversion rate would have dropped per A/B simulation by ~12% due to perceived latency. - SSE: 90% of the win, but the upstream channel would have been a separate REST endpoint — splits state across two transports, complicates debugging.

D-2 — Streaming guardrails are two-tier (inline + post-stream), not full pre-stream

HLD impact: Safety layer is split — inline runs during streaming, full pipeline runs after. Correction frame mechanism is required. LLD impact: Two guardrail invocation points; correction frame in the protocol; sequence numbers on tokens.

This is the most contested decision in the system. It pits Safety (Legal, Security) against UX (Product, Frontend).

Lens	Position	Quote
🔧 Backend	Pragmatic supporter	"Buffering until guardrails finish kills the streaming UX. Two-tier is a clean compromise — fast checks inline, slow checks after."
⚙️ SRE	Supporter	"I get to alarm on `correction_frame_emitted` rate. If it spikes, I know guardrails are blocking content the LLM is increasingly generating. That's a leading indicator."
🛡️ Security	Reluctant supporter	"I'd prefer pre-stream. But inline regex catches 90% of what worries me (PII, competitors). The remaining 10% — price hallucination, scope drift — fires <0.5%, and the audit log captures it. I accepted this with the condition that ASIN/price corrections trigger a hard correction frame, not a soft retraction."
🤖 ML	Supporter	"Pre-stream means I can't show the user any tokens until the LLM is fully done — that's 1500ms of buffering. Two-tier preserves the streaming benefit."
📊 PM	Strong supporter	"Without streaming, the chatbot loses its 'feels smart' edge. We measured this in user research."
🎨 Frontend	Cautious	"Correction frames are awkward to render. I asked for them to be infrequent enough that we don't need a complex 'retract' animation. Backend committed to <0.5% rate."
💰 Cost	Neutral	"Roughly cost-neutral — same number of guardrail invocations."
⚖️ Legal	Reluctant supporter	"Reviewed and accepted on the condition that: (1) every correction frame is logged with full audit trail, (2) wrong prices are NEVER streamed inline (LLM is prompted not to mention prices in prose; product cards inject from catalog), (3) any user complaint about a wrong price is retrievable from logs within 1 hour."

Alternatives rejected: - Full pre-stream guardrails: Rejected on UX (3s perceived latency for first token). - No inline guardrails (only post-stream): Rejected on safety — Legal would not accept PII potentially streaming to user even briefly. - Buffering only price/PII tokens: Rejected as unimplementable — token boundaries don't align with semantic boundaries; you can't know "$" is a price prefix until the next token arrives.

Hidden constraint that shaped the decision: Product cards (ASIN, price, title) are injected from live catalog after the prose stream, never generated by the LLM. This is what made the two-tier acceptable to Legal — the highest-risk fields (price, ASIN) bypass the LLM entirely.

D-3 — Sequential, not parallel, guardrails pipeline

HLD impact: Safety layer is a pipe, not a fan-out. LLD impact: Each stage takes the previous stage's output as input. Total latency is sum, not max.

Lens	Position	Reasoning
🔧 Backend	Strong supporter	"Stages mutate the response. PII redaction changes text. If toxicity ran in parallel on the original text, results don't compose. Sequential = correctness."
⚙️ SRE	Mild concern	"100ms total is in the latency budget. I asked for per-stage timings in the metrics so I can tell which stage is the slow one if total spikes."
🛡️ Security	Strong supporter	"PII MUST run first. If I run it last, anything that side-effects (logging, caching) sees PII. Order is a security control."
🤖 ML	Neutral	"Doesn't affect model behavior."
📊 PM	Neutral	"Acceptable as long as 100ms total is real."
🎨 Frontend	Neutral	"Doesn't affect me."
💰 Cost	Neutral	"Cost-equivalent to parallel — same number of API calls and CPU cycles."
⚖️ Legal	Strong supporter	"PII redaction first is non-negotiable — same reasoning as Security."

Alternative rejected: Parallel execution with a final reconciliation step. Would save ~60ms but introduces a "reconciliation correctness" problem (e.g., did the toxicity filter act on text that PII later removed?). The 60ms saving wasn't worth the verification cost.

D-4 — Provisioned Throughput for Sonnet, on-demand for Haiku

HLD impact: Intelligence layer has a tiered Bedrock plan, not a single model. LLD impact: Routing logic in orchestrator picks model by intent + user tier. Cost dashboards split provisioned vs. on-demand spend.

Lens	Position	Reasoning
🔧 Backend	Supportive	"Routing logic is straightforward — intent → model map plus a fallback rule."
⚙️ SRE	Strong supporter	"Provisioned eliminates throttling for our baseline. I sleep better."
🛡️ Security	Neutral	"Same model, same security posture. Routing decisions are logged for audit."
🤖 ML	Cautious	"Routing changes the eval surface — I now have to evaluate Haiku on simple intents AND Sonnet on complex intents AND verify the boundary. I asked for offline evaluation showing Haiku is acceptable for routed-down intents (chitchat, simple FAQ). It is — but the gate is now a per-model gate."
📊 PM	Cautious supporter	"I worried about quality drop on routed-down intents. ML showed no measurable user-perceived difference for the routed categories. Approved."
🎨 Frontend	Neutral	"Doesn't affect me."
💰 Cost	Strong driver	"This is a $35K/month saving. Provisioned baseline + on-demand burst is 35% cheaper than pure on-demand at our volume. It's also smoother — fewer billing surprises."
⚖️ Legal	Neutral	"Both models are in our existing Bedrock agreement."

Alternative rejected: Pure on-demand across both models. Rejected on cost ($35K/month) and reliability (no guaranteed throughput during AWS-wide pressure events).

Open risk: Provisioned has a 1-month minimum commitment. If traffic drops permanently (unlikely but possible), we eat the cost until expiry. Mitigation: weekly utilization review, conservative provisioning at P75 of daily peak.

D-5 — Sequence numbers on every WebSocket frame

HLD impact: Protocol-level — every frame is identifiable. LLD impact: Client must track expected seq; backend must guarantee monotonic increase per session.

Lens	Position	Reasoning
🔧 Backend	Supportive	"Easy to implement. Counter per session."
⚙️ SRE	Strong supporter	"When users report 'response cut off,' I can trace by seq gaps. Without seq, I'd have no signal."
🛡️ Security	Mild supporter	"Helps with replay-attack analysis on connection logs."
🤖 ML	Neutral	"Doesn't affect model serving."
📊 PM	Neutral	"Doesn't affect product behavior unless something goes wrong."
🎨 Frontend	Driver	"I asked for this. Without seq, I can't reconcile correction frames against the original stream — which token am I retracting? Also, on flaky networks, I detect dropped frames and can request resend of structured data."
💰 Cost	Neutral	"4 extra bytes per frame. Negligible."
⚖️ Legal	Neutral	"Useful for audit retrieval — 'show me the exact frames sent at this timestamp.'"

Alternative rejected: Implicit ordering (TCP guarantees order). Rejected because TCP order doesn't help when the client loses a frame due to a network glitch or browser tab pause.

D-6 — DAX in front of DynamoDB for sessions

HLD impact: Data layer has a cache layer that's part of the read path, not just an optimization. LLD impact: Reads go through DAX client; writes go DynamoDB-direct (DAX invalidates on write).

Lens	Position	Reasoning
🔧 Backend	Supportive	"DAX client is API-compatible with DynamoDB SDK. Drop-in."
⚙️ SRE	Cautious supporter	"Adds a failure domain. I asked for DAX-down runbook: app falls through to DynamoDB direct, performance degrades but doesn't break. Verified in chaos testing."
🛡️ Security	Neutral	"DAX is in our VPC, encrypted in transit. Same posture as DynamoDB."
🤖 ML	Neutral	"Doesn't affect model serving."
📊 PM	Neutral	"Invisible to user."
🎨 Frontend	Neutral	"Invisible to user."
💰 Cost	Cautious	"+$200/month. Acceptable for the engineering cost it eliminates (no more capacity planning emergencies)."
⚖️ Legal	Neutral	"PII vault is on DynamoDB direct, not cached in DAX. Confirmed."

Alternative rejected: Application-level cache in Redis. Rejected because it duplicates state across two systems and complicates invalidation. DAX is purpose-built for the DynamoDB read pattern.

Caveat captured during decision: SRE flagged that DAX is a band-aid for a partition-key design issue (popular sessions create hotspots). The proper fix is write-sharding the session table. We accepted DAX as a near-term fix with an open follow-up — still un-tackled.

D-7 — Memory summarizer compresses at 20 turns, preserves entities explicitly

HLD impact: Conversation state isn't unbounded — there's a compression boundary. LLD impact: Summarizer is a separate Bedrock call; entities (ASINs, series names) are preserved as structured metadata, not just embedded in summary prose.

Lens	Position	Reasoning
🔧 Backend	Supportive	"Without compression, prompts grow unbounded. We had a session hit 60 turns and the prompt was 8K tokens — too expensive."
⚙️ SRE	Cautious	"Summarizer is another LLM call in the critical path. I asked for: only triggers async after turn 20; turn 21 doesn't wait. Backend implemented as background fire-and-forget, with a fallback to summarize synchronously if the async hasn't completed by turn 25."
🛡️ Security	Neutral	"Summarized text is still PII-redacted — same security boundary as raw turns."
🤖 ML	Driver	"I designed this. The key insight: lossy compression is fine for prose but lossy on entities is fatal. The bug we caught (ASINs getting stripped at turn 20) proved this. Now entities are preserved as structured metadata `asins_mentioned`, separate from summary text."
📊 PM	Cautious	"User stated preferences (genre, format) must survive summarization. Asked for explicit testing on 'I prefer physical not digital' style preferences across the 20-turn boundary."
🎨 Frontend	Neutral	"Doesn't see the summarizer; just sees that long sessions still feel coherent."
💰 Cost	Supportive	"Without compression, average prompt size grows linearly. At 50 turns, prompt cost dominates. Compression saves ~$8K/month at scale."
⚖️ Legal	Neutral	"Compression doesn't change retention obligations — TTL is still on the session, not per-turn."

Alternative rejected: Sliding window (only last N turns sent to LLM). Rejected because it loses early-conversation context that users frequently reference ("the first one you mentioned").

D-8 — gRPC for Catalog, REST for everything else

HLD impact: Mixed protocols inside the orchestrator's downstream surface. LLD impact: Two HTTP clients, two contract-test frameworks (Pact for REST, Buf for protos).

Lens	Position	Reasoning
🔧 Backend	Pragmatic supporter	"gRPC for the highest-volume call (catalog) saves CPU + bandwidth at scale. REST elsewhere keeps integration cost low."
⚙️ SRE	Concerned	"Two protocols = two debugging tools. I asked for X-Ray instrumentation on both, with the same `request_id` correlation. Done."
🛡️ Security	Neutral	"mTLS works for both. Equivalent security posture."
🤖 ML	Neutral	"Doesn't affect model serving."
📊 PM	Neutral	"Invisible to user."
🎨 Frontend	Neutral	"Backend-only concern."
💰 Cost	Supportive	"At 30K catalog lookups/sec, gRPC saves ~25% CPU on serialization vs. REST+JSON. Real money at scale."
⚖️ Legal	Neutral	"No compliance difference."

Alternative considered: All-gRPC. Rejected because forcing 5 other teams to add gRPC servers had no business value; their REST APIs were stable and their contract testing already worked. Alternative considered: All-REST. Rejected because catalog volume is high enough that the serialization cost matters at scale.

Real driver, not in the table: The Catalog team already had a gRPC service; we adopted their existing contract. Decisions like this are rarely greenfield — they bend to the realities of what other teams have built.

D-9 — Active-active multi-region with DynamoDB Global Tables

HLD impact: Two-region deployment, Route 53 latency-routing. LLD impact: Session writes can occur in either region; eventual consistency on cross-region reads (rare in practice due to sticky routing).

Lens	Position	Reasoning
🔧 Backend	Cautious supporter	"Global tables introduce eventual consistency. We confirmed via testing that session writes in region A are readable in region A immediately (<5ms); cross-region read of fresh write is the only edge case."
⚙️ SRE	Driver	"I pushed for active-active. Active-passive means failover is a manual surgery with risk. Active-active means a region failure auto-shifts via Route 53 health checks. Verified during a chaos test where us-east-1 was simulated as down."
🛡️ Security	Cautious	"Two regions = two attack surfaces. Asked for identical IAM policies and WAF rules in both. Automated this via Terraform modules."
🤖 ML	Cautious	"Bedrock model availability differs by region. We confirmed both us-east-1 and us-west-2 have Sonnet + Haiku + Titan available. If a future model launches us-east-1 only, the routing will need region-awareness."
📊 PM	Strong supporter	"99.9% availability with single-region was achievable but tight. Active-active gives us 99.95+ headroom — the difference is real for users during AWS regional events."
🎨 Frontend	Neutral	"Route 53 handles routing; client just hits a single hostname."
💰 Cost	Concerned	"Roughly 1.7x infrastructure cost for 1.05x availability. Hard sell. Justified by the cost of one regional outage event = 4-6 hours of degraded customer experience during peak. We approved with an SLA-based ROI argument, not a unit-cost argument."
⚖️ Legal	Supportive	"Multi-region is preferred for disaster recovery compliance posture."

Alternative rejected: Active-passive (warm standby in second region). Rejected because failover requires manual or automated promotion that's brittle at the moment it's most needed (during a regional outage). Alternative rejected: Single region with multi-AZ only. Rejected because AZ-level failures are rare but region-level events do happen and we're customer-facing during Prime Day.

D-10 — Synchronous downstream fan-out, not event-driven

HLD impact: Orchestration layer holds resources during fan-out; latency is bounded by slowest service. LLD impact: asyncio.gather with timeouts; circuit breaker on each service; no message bus in the request path.

This is the decision that's most likely to be revisited in a future redesign — see Q18 in 04-interview-qa-deep-dive.md.

Lens	Position	Reasoning
🔧 Backend	Pragmatic supporter	"Synchronous code is straightforward. Event-driven (Step Functions / EventBridge) would be more resilient but harder to reason about."
⚙️ SRE	Concerned but supportive	"Synchronous means timeouts are first-class. I needed every downstream call to have an explicit timeout shorter than the request deadline. Implemented."
🛡️ Security	Neutral	"No security difference."
🤖 ML	Neutral	"Doesn't affect model serving."
📊 PM	Strong supporter	"Synchronous keeps the response coherent — no 'partial response then update' surprises for the user."
🎨 Frontend	Strong supporter	"Streaming is one continuous response. Event-driven would mean either buffering until all events arrive (kills streaming UX) or showing partial responses with deferred updates (confusing UX)."
💰 Cost	Neutral	"Roughly cost-equivalent."
⚖️ Legal	Neutral	"No compliance difference."

Alternative considered for future redesign: Step Functions for the data-gathering phase, synchronous only for the LLM streaming phase. The decoupling would improve resilience but not improve user-perceived latency for the streaming case. We accepted this as a future investigation, not a current change.

Decision Lens — Summary Heatmap

Decision	🔧 BE	⚙️ SRE	🛡️ Sec	🤖 ML	📊 PM	🎨 FE	💰 Cost	⚖️ Legal
D-1 WebSocket primary	✓	⚠️	✓✓	–	✓✓	⚠️	–	–
D-2 Two-tier guardrails	✓	✓	⚠️	✓	✓✓	⚠️	–	⚠️
D-3 Sequential guardrails	✓✓	⚠️	✓✓	–	–	–	–	✓✓
D-4 Tiered Bedrock	✓	✓✓	–	⚠️	⚠️	–	✓✓✓	–
D-5 Frame seq numbers	✓	✓✓	✓	–	–	✓✓✓	–	✓
D-6 DAX on sessions	✓	⚠️	–	–	–	–	⚠️	–
D-7 Memory summarizer	✓	⚠️	–	✓✓✓	⚠️	–	✓	–
D-8 gRPC for Catalog	✓	⚠️	–	–	–	–	✓	–
D-9 Active-active	⚠️	✓✓✓	⚠️	⚠️	✓✓	–	❌→✓	✓
D-10 Sync fan-out	✓	⚠️	–	–	✓✓	✓✓	–	–

Legend: ✓✓✓ driver · ✓✓ strong supporter · ✓ supporter · ⚠️ cautious / had concerns addressed · ❌→✓ initially opposed, accepted with conditions · – neutral / not affected

How to Use This Document in Practice

During design review: Walk down the lens table for any new feature. If a single perspective is missing, you haven't done the review.

During incidents: Pull up the failure domain map (Section 1.6). The first question in the incident channel should be "which domain is this?" — not "what broke?"

During interviews: When asked about a specific decision, walk the perspectives lens. "I had to align Backend, SRE, Security, and Legal. Here's how each saw it…"

For new joiners: Read Part 1 in week 1. Read Part 2 before you write code. Read Part 3 before your first design review.

01-api-types-overview.md — The 6 API types built on top of this architecture
02-api-testing-strategy.md — How each layer is tested
03-scale-testing-scenarios.md — How each layer broke under load
04-interview-qa-deep-dive.md — Specific decisions probed in interview format
04-offline-testing-quality-strategies.md — Quality testing for the Intelligence layer
05-grilling-sessions.md — Hard follow-ups that test depth