4c. WebSocket API - Design Space and Prototype Spec

This document explores the design space for the MangaAssist chat WebSocket API before implementation. It records the decisions needed up front, the limits of each direction, and what a laptop prototype can and cannot verify. Read this before writing any code.

Purpose

The HLD (§4) and LLD-7 specify a chat transport built on POST /chat/message + WebSocket delivery. This document expands that choice into its surrounding design space so the prototype is built with full awareness of the alternatives and their tradeoffs.

Axis 1: Transport Topology

Four ways to move a user message in and an LLM response out.

Direction	How it works	Limits
A. Pure bidirectional WS	Client sends message frame over WS, server streams reply frames back on same socket	Blocked by some corporate proxies; WS drop equals lost in-flight request; auth must happen in first frame
B. HTTP submit + WS deliver (HLD choice)	`POST /chat/message` returns `202` + `response_id`; server pushes `delta` and `completed` events over WS keyed by `session_id`	Two round trips; needs correlation by `response_id`; requires `GET /chat/message/{id}` fallback path
C. HTTP submit + SSE deliver	Same as B but Server-Sent Events instead of WS	One-way only (fine for chat); no binary; some old proxies buffer SSE and break streaming
D. HTTP submit + long-poll	Client polls `GET /chat/message/{id}` until done	Higher latency, more requests; no real streaming UX, tokens arrive in batches

Why the HLD picked B

Large-scale traffic traverses proxies that mangle WS; the HTTP submit path always works.
Auth is already an HTTP concern (Amazon session token, rate limits).
The response_id pattern lets the LLM keep generating server-side even if the client's WS drops.
GET /chat/message/{id} gives a clean fallback for WS-hostile networks without redesigning anything.

Axis 2: Streaming Granularity

How small should each streamed frame be?

Direction	Spec	Limits
Token-level deltas	One frame per LLM token (~4 chars)	Many tiny frames; per-frame WS overhead matters; UI must handle 50+ msgs/sec
Chunk deltas	Batch tokens into ~20-50 char chunks	Slight perceived lag but much less overhead
Sentence or event deltas	One frame per semantic unit (sentence, product card)	Feels less "live" but easier to reason about; good for structured responses with products and actions
Full response only	No streaming	Defeats the purpose; users wait 2-5 seconds staring at nothing

The HLD implies token-level (chat.response.delta with a delta string), which matches how Claude's SDK streams natively. For a prototype that mixes prose with structured product cards, a hybrid is worth considering: stream the response_text token-by-token, but emit products and actions once at the end inside chat.response.completed.

Axis 3: Session and Connection Lifecycle

Who owns the WebSocket and for how long?

Direction	Spec	Limits
One WS per session (HLD)	WS opens at `/chat/init`, persists across many `POST /chat/message` calls	Server needs a `{session_id -> websocket}` registry; sticky sessions required if multi-replica
One WS per request	New WS per message, closes when `completed` fires	No registry needed but higher connection churn, worse for mobile
Multiplexed WS across sessions	One WS carries multiple sessions	Overkill for a prototype

For a laptop prototype running in a single process, the per-session model has zero cost — it is simply a dict[str, WebSocket] in memory.

Axis 4: Reliability Contract

The axis most tutorials skip. The design must answer each of these questions.

Scenario	Question the design must answer
WS drops mid-stream	Does the LLM keep generating? Where do buffered deltas go? How does the client resume?
Client sends 2^nd message before 1^st completes	Queue, reject, or interleave?
Server restarts mid-response	Is `response_id` durable? Or lost?
Network blocks WS entirely	Does `GET /chat/message/{id}` return the full buffered state?
Duplicate `POST /chat/message` (retry)	Idempotency key via `response_id`? Or dedupe by client-supplied ID?

The HLD gives ingredients (response_id, GET fallback, 30s heartbeat, 5-minute idle timeout) but does not lock down resume semantics. This document picks defaults in the Prototype Spec section below.

What a Laptop Prototype Can and Cannot Teach

Can verify on laptop	Cannot verify on laptop
Correlation between `POST` and WS delivery	Behavior through real corporate proxies
Reconnect + resume logic	Sticky-session routing under a load balancer
Heartbeat and idle timeout handling	Multi-replica `session_id -> ws` routing
Backpressure (throttle WS `send` with `asyncio.sleep`)	Kinesis/Redshift analytics fan-out
HTTPS fallback semantics (`GET /chat/message/{id}`)	WS behavior under real TLS termination at ALB
Concurrent message rejection or queueing	True autoscaling behavior
Event schema correctness (`delta`, `completed`, `error`)	Multi-region latency

The left column is where the interesting design lives. The right column is infra and can be simulated later.

Decisions to Lock Down Before Coding

Pin each of these before writing Python.

Transport direction. Recommend B (match HLD).
Delta granularity. Recommend token-level via Claude's native stream. Hybrid: prose streams, structured fields arrive in completed.
Resume policy. Simplest: buffer all deltas per response_id server-side; a reconnecting WS receives the full buffer then tails live.
Concurrency per session. Simplest: reject 2^nd message until 1^st completes with HTTP 409 IN_PROGRESS.
Idempotency. Skip for v1. Add a client_message_id field later if needed.
Auth. Simplest: pass session_id in the WS URL path, validated against an in-memory session map created at /chat/init.

Prototype Spec (v1)

This is the minimum viable WebSocket API for a laptop prototype. It exercises every interesting behavior in the HLD without any AWS dependency.

Endpoints

POST /chat/init                     Create session, return session_id + ws URL
POST /chat/message                  Submit message, return response_id (202)
GET  /chat/message/{response_id}    HTTPS fallback, returns buffered state
WS   /ws/{session_id}               Stream deltas + completed events

State (in-memory for v1)

sessions:  dict[session_id, SessionRecord]
responses: dict[response_id, ResponseRecord]
sockets:   dict[session_id, WebSocket]

Event schema

Frames on the WebSocket are JSON with a type discriminator:

{ "type": "chat.response.delta",     "session_id": "...", "response_id": "...", "delta": "..." }
{ "type": "chat.response.completed", "session_id": "...", "response_id": "...", "response_text": "...", "products": [], "actions": [] }
{ "type": "chat.response.error",     "session_id": "...", "response_id": "...", "error": { "code": "...", "message": "..." } }
{ "type": "ping" }
{ "type": "pong" }

Lifecycle state machine

stateDiagram-v2
    [*] --> Idle: POST /chat/init
    Idle --> Generating: POST /chat/message (202)
    Generating --> Generating: emit delta
    Generating --> Completed: emit completed
    Generating --> Errored: emit error
    Generating --> Generating: client reconnects -> replay buffered deltas
    Completed --> Idle: ready for next message
    Errored --> Idle
    Idle --> [*]: 5-min idle timeout

Failure modes the prototype must handle

WS disconnect mid-stream: LLM continues, deltas buffer under responses[response_id].buffer. New WS connection replays buffer then tails.
Second POST /chat/message while first is Generating: respond 409 { "code": "IN_PROGRESS" }.
WS message with wrong session_id: close with code 4401.
Idle WS for 5 minutes with no ping: server sends close frame.
Server-generated ping every 30 seconds.

Non-goals for v1

Explicitly deferred to keep the prototype focused on WS design:

Real LLM calls (stub with asyncio.sleep(0.05) between fake tokens, or cheap Claude Haiku calls).
Intent classification, RAG, guardrails, memory summarization.
Persistence (DynamoDB, Redis). Everything is in-memory.
Multi-replica concerns.
Real auth (a string session_id is enough).

Build Order

Once the above decisions are approved:

FastAPI skeleton with the four endpoints, in-memory state dicts.
Fake token generator yielding chunks with small sleeps, wired to emit delta frames.
A 20-line HTML client with new WebSocket(...) printing deltas as they arrive.
WS disconnect + reconnect test: kill the client mid-stream, reconnect, verify buffered deltas replay.
HTTPS fallback test: hit GET /chat/message/{id} mid-generation and at completion, verify both return sensible state.
Concurrent-message test: POST twice quickly, verify second returns 409.
Swap the fake LLM for real Claude streaming, keep everything else identical.

At that point, all of the WS-design questions in Axis 4 have been answered by running code.