4c. WebSocket API - Design Space and Prototype Spec
This document explores the design space for the MangaAssist chat WebSocket API before implementation. It records the decisions needed up front, the limits of each direction, and what a laptop prototype can and cannot verify. Read this before writing any code.
Purpose
The HLD (§4) and LLD-7 specify a chat transport built on POST /chat/message + WebSocket delivery. This document expands that choice into its surrounding design space so the prototype is built with full awareness of the alternatives and their tradeoffs.
Axis 1: Transport Topology
Four ways to move a user message in and an LLM response out.
| Direction | How it works | Limits |
|---|---|---|
| A. Pure bidirectional WS | Client sends message frame over WS, server streams reply frames back on same socket | Blocked by some corporate proxies; WS drop equals lost in-flight request; auth must happen in first frame |
| B. HTTP submit + WS deliver (HLD choice) | POST /chat/message returns 202 + response_id; server pushes delta and completed events over WS keyed by session_id |
Two round trips; needs correlation by response_id; requires GET /chat/message/{id} fallback path |
| C. HTTP submit + SSE deliver | Same as B but Server-Sent Events instead of WS | One-way only (fine for chat); no binary; some old proxies buffer SSE and break streaming |
| D. HTTP submit + long-poll | Client polls GET /chat/message/{id} until done |
Higher latency, more requests; no real streaming UX, tokens arrive in batches |
Why the HLD picked B
- Large-scale traffic traverses proxies that mangle WS; the HTTP submit path always works.
- Auth is already an HTTP concern (Amazon session token, rate limits).
- The
response_idpattern lets the LLM keep generating server-side even if the client's WS drops. GET /chat/message/{id}gives a clean fallback for WS-hostile networks without redesigning anything.
Axis 2: Streaming Granularity
How small should each streamed frame be?
| Direction | Spec | Limits |
|---|---|---|
| Token-level deltas | One frame per LLM token (~4 chars) | Many tiny frames; per-frame WS overhead matters; UI must handle 50+ msgs/sec |
| Chunk deltas | Batch tokens into ~20-50 char chunks | Slight perceived lag but much less overhead |
| Sentence or event deltas | One frame per semantic unit (sentence, product card) | Feels less "live" but easier to reason about; good for structured responses with products and actions |
| Full response only | No streaming | Defeats the purpose; users wait 2-5 seconds staring at nothing |
The HLD implies token-level (chat.response.delta with a delta string), which matches how Claude's SDK streams natively. For a prototype that mixes prose with structured product cards, a hybrid is worth considering: stream the response_text token-by-token, but emit products and actions once at the end inside chat.response.completed.
Axis 3: Session and Connection Lifecycle
Who owns the WebSocket and for how long?
| Direction | Spec | Limits |
|---|---|---|
| One WS per session (HLD) | WS opens at /chat/init, persists across many POST /chat/message calls |
Server needs a {session_id -> websocket} registry; sticky sessions required if multi-replica |
| One WS per request | New WS per message, closes when completed fires |
No registry needed but higher connection churn, worse for mobile |
| Multiplexed WS across sessions | One WS carries multiple sessions | Overkill for a prototype |
For a laptop prototype running in a single process, the per-session model has zero cost — it is simply a dict[str, WebSocket] in memory.
Axis 4: Reliability Contract
The axis most tutorials skip. The design must answer each of these questions.
| Scenario | Question the design must answer |
|---|---|
| WS drops mid-stream | Does the LLM keep generating? Where do buffered deltas go? How does the client resume? |
| Client sends 2nd message before 1st completes | Queue, reject, or interleave? |
| Server restarts mid-response | Is response_id durable? Or lost? |
| Network blocks WS entirely | Does GET /chat/message/{id} return the full buffered state? |
Duplicate POST /chat/message (retry) |
Idempotency key via response_id? Or dedupe by client-supplied ID? |
The HLD gives ingredients (response_id, GET fallback, 30s heartbeat, 5-minute idle timeout) but does not lock down resume semantics. This document picks defaults in the Prototype Spec section below.
What a Laptop Prototype Can and Cannot Teach
| Can verify on laptop | Cannot verify on laptop |
|---|---|
Correlation between POST and WS delivery |
Behavior through real corporate proxies |
| Reconnect + resume logic | Sticky-session routing under a load balancer |
| Heartbeat and idle timeout handling | Multi-replica session_id -> ws routing |
Backpressure (throttle WS send with asyncio.sleep) |
Kinesis/Redshift analytics fan-out |
HTTPS fallback semantics (GET /chat/message/{id}) |
WS behavior under real TLS termination at ALB |
| Concurrent message rejection or queueing | True autoscaling behavior |
Event schema correctness (delta, completed, error) |
Multi-region latency |
The left column is where the interesting design lives. The right column is infra and can be simulated later.
Decisions to Lock Down Before Coding
Pin each of these before writing Python.
- Transport direction. Recommend B (match HLD).
- Delta granularity. Recommend token-level via Claude's native stream. Hybrid: prose streams, structured fields arrive in
completed. - Resume policy. Simplest: buffer all deltas per
response_idserver-side; a reconnecting WS receives the full buffer then tails live. - Concurrency per session. Simplest: reject 2nd message until 1st completes with HTTP
409 IN_PROGRESS. - Idempotency. Skip for v1. Add a
client_message_idfield later if needed. - Auth. Simplest: pass
session_idin the WS URL path, validated against an in-memory session map created at/chat/init.
Prototype Spec (v1)
This is the minimum viable WebSocket API for a laptop prototype. It exercises every interesting behavior in the HLD without any AWS dependency.
Endpoints
POST /chat/init Create session, return session_id + ws URL
POST /chat/message Submit message, return response_id (202)
GET /chat/message/{response_id} HTTPS fallback, returns buffered state
WS /ws/{session_id} Stream deltas + completed events
State (in-memory for v1)
sessions: dict[session_id, SessionRecord]
responses: dict[response_id, ResponseRecord]
sockets: dict[session_id, WebSocket]
Event schema
Frames on the WebSocket are JSON with a type discriminator:
{ "type": "chat.response.delta", "session_id": "...", "response_id": "...", "delta": "..." }
{ "type": "chat.response.completed", "session_id": "...", "response_id": "...", "response_text": "...", "products": [], "actions": [] }
{ "type": "chat.response.error", "session_id": "...", "response_id": "...", "error": { "code": "...", "message": "..." } }
{ "type": "ping" }
{ "type": "pong" }
Lifecycle state machine
stateDiagram-v2
[*] --> Idle: POST /chat/init
Idle --> Generating: POST /chat/message (202)
Generating --> Generating: emit delta
Generating --> Completed: emit completed
Generating --> Errored: emit error
Generating --> Generating: client reconnects -> replay buffered deltas
Completed --> Idle: ready for next message
Errored --> Idle
Idle --> [*]: 5-min idle timeout
Failure modes the prototype must handle
- WS disconnect mid-stream: LLM continues, deltas buffer under
responses[response_id].buffer. New WS connection replays buffer then tails. - Second
POST /chat/messagewhile first isGenerating: respond409 { "code": "IN_PROGRESS" }. - WS message with wrong
session_id: close with code4401. - Idle WS for 5 minutes with no ping: server sends close frame.
- Server-generated ping every 30 seconds.
Non-goals for v1
Explicitly deferred to keep the prototype focused on WS design:
- Real LLM calls (stub with
asyncio.sleep(0.05)between fake tokens, or cheap Claude Haiku calls). - Intent classification, RAG, guardrails, memory summarization.
- Persistence (DynamoDB, Redis). Everything is in-memory.
- Multi-replica concerns.
- Real auth (a string
session_idis enough).
Build Order
Once the above decisions are approved:
- FastAPI skeleton with the four endpoints, in-memory state dicts.
- Fake token generator yielding chunks with small sleeps, wired to emit
deltaframes. - A 20-line HTML client with
new WebSocket(...)printing deltas as they arrive. - WS disconnect + reconnect test: kill the client mid-stream, reconnect, verify buffered deltas replay.
- HTTPS fallback test: hit
GET /chat/message/{id}mid-generation and at completion, verify both return sensible state. - Concurrent-message test: POST twice quickly, verify second returns
409. - Swap the fake LLM for real Claude streaming, keep everything else identical.
At that point, all of the WS-design questions in Axis 4 have been answered by running code.