LOCAL PREVIEW View on GitHub

4c. WebSocket API - Design Space and Prototype Spec

This document explores the design space for the MangaAssist chat WebSocket API before implementation. It records the decisions needed up front, the limits of each direction, and what a laptop prototype can and cannot verify. Read this before writing any code.

Purpose

The HLD (§4) and LLD-7 specify a chat transport built on POST /chat/message + WebSocket delivery. This document expands that choice into its surrounding design space so the prototype is built with full awareness of the alternatives and their tradeoffs.

Axis 1: Transport Topology

Four ways to move a user message in and an LLM response out.

Direction How it works Limits
A. Pure bidirectional WS Client sends message frame over WS, server streams reply frames back on same socket Blocked by some corporate proxies; WS drop equals lost in-flight request; auth must happen in first frame
B. HTTP submit + WS deliver (HLD choice) POST /chat/message returns 202 + response_id; server pushes delta and completed events over WS keyed by session_id Two round trips; needs correlation by response_id; requires GET /chat/message/{id} fallback path
C. HTTP submit + SSE deliver Same as B but Server-Sent Events instead of WS One-way only (fine for chat); no binary; some old proxies buffer SSE and break streaming
D. HTTP submit + long-poll Client polls GET /chat/message/{id} until done Higher latency, more requests; no real streaming UX, tokens arrive in batches

Why the HLD picked B

  • Large-scale traffic traverses proxies that mangle WS; the HTTP submit path always works.
  • Auth is already an HTTP concern (Amazon session token, rate limits).
  • The response_id pattern lets the LLM keep generating server-side even if the client's WS drops.
  • GET /chat/message/{id} gives a clean fallback for WS-hostile networks without redesigning anything.

Axis 2: Streaming Granularity

How small should each streamed frame be?

Direction Spec Limits
Token-level deltas One frame per LLM token (~4 chars) Many tiny frames; per-frame WS overhead matters; UI must handle 50+ msgs/sec
Chunk deltas Batch tokens into ~20-50 char chunks Slight perceived lag but much less overhead
Sentence or event deltas One frame per semantic unit (sentence, product card) Feels less "live" but easier to reason about; good for structured responses with products and actions
Full response only No streaming Defeats the purpose; users wait 2-5 seconds staring at nothing

The HLD implies token-level (chat.response.delta with a delta string), which matches how Claude's SDK streams natively. For a prototype that mixes prose with structured product cards, a hybrid is worth considering: stream the response_text token-by-token, but emit products and actions once at the end inside chat.response.completed.

Axis 3: Session and Connection Lifecycle

Who owns the WebSocket and for how long?

Direction Spec Limits
One WS per session (HLD) WS opens at /chat/init, persists across many POST /chat/message calls Server needs a {session_id -> websocket} registry; sticky sessions required if multi-replica
One WS per request New WS per message, closes when completed fires No registry needed but higher connection churn, worse for mobile
Multiplexed WS across sessions One WS carries multiple sessions Overkill for a prototype

For a laptop prototype running in a single process, the per-session model has zero cost — it is simply a dict[str, WebSocket] in memory.

Axis 4: Reliability Contract

The axis most tutorials skip. The design must answer each of these questions.

Scenario Question the design must answer
WS drops mid-stream Does the LLM keep generating? Where do buffered deltas go? How does the client resume?
Client sends 2nd message before 1st completes Queue, reject, or interleave?
Server restarts mid-response Is response_id durable? Or lost?
Network blocks WS entirely Does GET /chat/message/{id} return the full buffered state?
Duplicate POST /chat/message (retry) Idempotency key via response_id? Or dedupe by client-supplied ID?

The HLD gives ingredients (response_id, GET fallback, 30s heartbeat, 5-minute idle timeout) but does not lock down resume semantics. This document picks defaults in the Prototype Spec section below.

What a Laptop Prototype Can and Cannot Teach

Can verify on laptop Cannot verify on laptop
Correlation between POST and WS delivery Behavior through real corporate proxies
Reconnect + resume logic Sticky-session routing under a load balancer
Heartbeat and idle timeout handling Multi-replica session_id -> ws routing
Backpressure (throttle WS send with asyncio.sleep) Kinesis/Redshift analytics fan-out
HTTPS fallback semantics (GET /chat/message/{id}) WS behavior under real TLS termination at ALB
Concurrent message rejection or queueing True autoscaling behavior
Event schema correctness (delta, completed, error) Multi-region latency

The left column is where the interesting design lives. The right column is infra and can be simulated later.

Decisions to Lock Down Before Coding

Pin each of these before writing Python.

  1. Transport direction. Recommend B (match HLD).
  2. Delta granularity. Recommend token-level via Claude's native stream. Hybrid: prose streams, structured fields arrive in completed.
  3. Resume policy. Simplest: buffer all deltas per response_id server-side; a reconnecting WS receives the full buffer then tails live.
  4. Concurrency per session. Simplest: reject 2nd message until 1st completes with HTTP 409 IN_PROGRESS.
  5. Idempotency. Skip for v1. Add a client_message_id field later if needed.
  6. Auth. Simplest: pass session_id in the WS URL path, validated against an in-memory session map created at /chat/init.

Prototype Spec (v1)

This is the minimum viable WebSocket API for a laptop prototype. It exercises every interesting behavior in the HLD without any AWS dependency.

Endpoints

POST /chat/init                     Create session, return session_id + ws URL
POST /chat/message                  Submit message, return response_id (202)
GET  /chat/message/{response_id}    HTTPS fallback, returns buffered state
WS   /ws/{session_id}               Stream deltas + completed events

State (in-memory for v1)

sessions:  dict[session_id, SessionRecord]
responses: dict[response_id, ResponseRecord]
sockets:   dict[session_id, WebSocket]

Event schema

Frames on the WebSocket are JSON with a type discriminator:

{ "type": "chat.response.delta",     "session_id": "...", "response_id": "...", "delta": "..." }
{ "type": "chat.response.completed", "session_id": "...", "response_id": "...", "response_text": "...", "products": [], "actions": [] }
{ "type": "chat.response.error",     "session_id": "...", "response_id": "...", "error": { "code": "...", "message": "..." } }
{ "type": "ping" }
{ "type": "pong" }

Lifecycle state machine

stateDiagram-v2
    [*] --> Idle: POST /chat/init
    Idle --> Generating: POST /chat/message (202)
    Generating --> Generating: emit delta
    Generating --> Completed: emit completed
    Generating --> Errored: emit error
    Generating --> Generating: client reconnects -> replay buffered deltas
    Completed --> Idle: ready for next message
    Errored --> Idle
    Idle --> [*]: 5-min idle timeout

Failure modes the prototype must handle

  • WS disconnect mid-stream: LLM continues, deltas buffer under responses[response_id].buffer. New WS connection replays buffer then tails.
  • Second POST /chat/message while first is Generating: respond 409 { "code": "IN_PROGRESS" }.
  • WS message with wrong session_id: close with code 4401.
  • Idle WS for 5 minutes with no ping: server sends close frame.
  • Server-generated ping every 30 seconds.

Non-goals for v1

Explicitly deferred to keep the prototype focused on WS design:

  • Real LLM calls (stub with asyncio.sleep(0.05) between fake tokens, or cheap Claude Haiku calls).
  • Intent classification, RAG, guardrails, memory summarization.
  • Persistence (DynamoDB, Redis). Everything is in-memory.
  • Multi-replica concerns.
  • Real auth (a string session_id is enough).

Build Order

Once the above decisions are approved:

  1. FastAPI skeleton with the four endpoints, in-memory state dicts.
  2. Fake token generator yielding chunks with small sleeps, wired to emit delta frames.
  3. A 20-line HTML client with new WebSocket(...) printing deltas as they arrive.
  4. WS disconnect + reconnect test: kill the client mid-stream, reconnect, verify buffered deltas replay.
  5. HTTPS fallback test: hit GET /chat/message/{id} mid-generation and at completion, verify both return sensible state.
  6. Concurrent-message test: POST twice quickly, verify second returns 409.
  7. Swap the fake LLM for real Claude streaming, keep everything else identical.

At that point, all of the WS-design questions in Axis 4 have been answered by running code.