Scenarios and Runbooks — Real-Time AI Interaction

MangaAssist context: JP Manga store chatbot on AWS — Bedrock Claude 3 (Sonnet at $3/$15 per 1M tokens input/output, Haiku at $0.25/$1.25), OpenSearch Serverless (vector store), DynamoDB (sessions/products), ECS Fargate (orchestrator), API Gateway WebSocket, ElastiCache Redis. Target: useful answer in under 3 seconds, 1M messages/day scale.

Skill Mapping

Field	Value
Certification	AWS Certified AI Practitioner (AIP-C01)
Domain	2 — Implementation and Integration of Foundation Models
Task	2.4 — Design and implement interaction mechanisms for FM-based applications
Skill	2.4.2 — Develop real-time AI interaction systems to provide immediate feedback from FM

Scenario Format Reference

Each scenario follows this structure:

Section	Purpose
Scenario Title	Problem statement
Severity + Blast Radius	Impact classification
Architecture Diagram	Visual of failure point
Root Cause Analysis	Why this happens
Detection	How to spot the problem
Impact Analysis	What breaks and for whom
Resolution Steps	How to fix it
Prevention	How to stop it from happening again
Verification	How to confirm the fix works
Key Takeaway	One-sentence lesson

Scenario 1: WebSocket Connection Dropping Mid-Stream Losing Partial Response

Severity: HIGH | Blast Radius: Single User Session

Problem Statement

A MangaAssist user asks "Recommend dark fantasy manga series similar to Berserk with detailed art." Bedrock Claude 3 Sonnet begins streaming the response. After 40 tokens (roughly "Based on your interest in Berserk, I'd recommend these dark fantasy manga series with similarly detailed art: 1. Vagabond by Takehiko Inou"), the WebSocket connection drops. The client receives a close frame with code 1006 (abnormal closure). The partial response is lost, and the user sees an incomplete, broken message.

Architecture Diagram

Timeline:
t=0.0s   User sends question via WebSocket
t=0.3s   Bedrock TTFT, first token arrives
t=0.3-1.2s  40 tokens streamed to client successfully
t=1.2s   ──── NETWORK INTERRUPTION ────
         │  User's mobile switches from WiFi to LTE
         │  TCP connection resets
         │  API Gateway receives TCP RST
         │  API Gateway fires $disconnect route
         └──────────────────────────────────
t=1.2s   Bedrock continues generating (unaware of disconnect)
t=1.3s   ECS tries POST @connections/{connectionId}
         → GoneException (connection no longer exists)
t=1.3s   ECS catches GoneException, stops reading Bedrock stream
t=1.3s   Partial response "Based on your interest in..." is lost
         (only existed in ECS task memory, not persisted)
t=3.0s   Client reconnects with new connectionId
t=3.1s   Client sends resume request with messageId
t=3.2s   Server has no record of partial response → must regenerate

┌──────────┐     ┌─────────────┐     ┌──────────┐     ┌─────────┐
│  Client  │────>│ API Gateway │────>│   ECS    │────>│ Bedrock │
│ (Mobile) │     │  WebSocket  │     │ Fargate  │     │ Claude3 │
└──────┬───┘     └──────┬──────┘     └────┬─────┘     └────┬────┘
       │                │                 │                 │
       │  40 tokens OK  │                 │                 │
       │<═══════════════│<════════════════│<════════════════│
       │                │                 │                 │
    ╔══╧════════════╗   │                 │                 │
    ║ WiFi → LTE    ║   │                 │                 │
    ║ TCP RST       ║   │                 │                 │
    ╚══╤════════════╝   │                 │                 │
       │                │                 │                 │
       X  Connection    │── $disconnect──>│                 │
          LOST          │                 │── GoneException │
                        │                 │   on next POST  │
                        │                 │                 │
                        │                 │── STOP stream──>│
                        │                 │                 │
                        │                 │  Partial text   │
                        │                 │  LOST (not saved)│

Root Cause Analysis

Primary cause: No intermediate persistence of streaming tokens.

The streaming pipeline processes tokens in-memory only: 1. Bedrock generates tokens and sends them as EventStream chunks 2. ECS Fargate receives each chunk, transforms it, and pushes to the WebSocket via @connections API 3. If the WebSocket connection drops, the ECS task catches GoneException and stops processing 4. The 40 tokens already sent to the client exist only in the client's DOM, which may be lost on reconnection 5. The remaining tokens (which Bedrock would have generated) are never produced because the stream was cancelled 6. No server-side record of the partial response exists

Contributing factors: - Mobile network handoff (WiFi to cellular) causes TCP reset - API Gateway has no built-in mechanism to buffer or persist WebSocket messages - The ECS orchestrator does not save intermediate streaming state to Redis - Client-side storage of partial response depends on the reconnection implementation

Detection

CloudWatch Alarms:

Metric	Namespace	Condition	Alarm
`@connections GoneException count`	MangaAssist/Streaming	> 50/minute	WARN
`Stream cancellation rate`	MangaAssist/Streaming	> 20% of streams	HIGH
`$disconnect route invocations`	AWS/ApiGateway	Spike > 3x baseline	WARN
`Mid-stream disconnect ratio`	MangaAssist/Streaming	> 10%	HIGH

Log patterns to search:

# ECS Fargate application logs
filter @message like /GoneException/
| stats count() as gone_exceptions by bin(5m)

# Identify mid-stream disconnects specifically
filter @message like /GoneException/ and @message like /chunk_index > 0/
| stats count() as mid_stream_drops by bin(5m)

Client-side detection: - WebSocket onclose event with code 1006 (abnormal closure) - Accumulated text length > 0 at time of disconnect (partial response exists in DOM)

Impact Analysis

Aspect	Impact
User experience	User sees partial, broken text; must re-ask the question
Cost	Wasted Bedrock tokens (40 tokens generated but response incomplete)
Session state	Conversation history may have incomplete assistant turn
Estimated frequency	~2-5% of mobile sessions experience network handoff
At 1M msgs/day	20,000-50,000 partial response incidents daily

Resolution Steps

Step 1: Implement server-side token caching in Redis

Add Redis buffering to the streaming pipeline so every token is persisted as it is generated:

# In the ECS streaming orchestrator, after each token:
async def handle_stream_token(self, message_id: str, chunk: StreamChunk):
    """Process each streaming token with Redis persistence."""
    # 1. Append token to Redis list (survives disconnect)
    redis_key = f"stream:{message_id}:tokens"
    self._redis.rpush(redis_key, json.dumps({
        "text": chunk.text,
        "index": chunk.index,
        "timestamp": time.time(),
    }))
    # Set TTL for cleanup (5 minutes — enough for reconnection)
    self._redis.expire(redis_key, 300)

    # 2. Try to send to client via WebSocket
    try:
        await self.ws_handler.send_token(connection_id, chunk.text, chunk.index)
    except GoneException:
        # Connection is gone — but tokens are safe in Redis
        logger.info(f"Connection gone for {message_id}, tokens preserved in Redis")
        # Do NOT stop reading from Bedrock — continue generating
        # The full response will be available for reconnection

Step 2: Continue Bedrock stream even after disconnect

# Modified stream reading loop
async def stream_with_persistence(self, message_id: str, prompt: str):
    connection_alive = True

    async for chunk in self.fm_client.stream_response(prompt, model="sonnet"):
        # Always persist to Redis
        await self.persist_token(message_id, chunk)

        # Try to deliver if connection is still alive
        if connection_alive:
            try:
                await self.ws_handler.send_token(connection_id, chunk.text)
            except GoneException:
                connection_alive = False
                logger.info(f"Continuing generation for {message_id} (client disconnected)")

    # Save final response to DynamoDB session
    full_response = self.assemble_from_redis(message_id)
    await self.save_session(message_id, full_response)

Step 3: Implement client-side reconnection with resume

// Client reconnection handler
websocket.onclose = async (event) => {
    if (event.code === 1006 && currentMessageId) {
        // Abnormal closure during active stream
        savePartialResponse(currentMessageId, accumulatedText);

        await reconnectWithBackoff();

        // Request resume from server
        websocket.send(JSON.stringify({
            action: "resume",
            data: {
                messageId: currentMessageId,
                lastChunkIndex: lastReceivedChunkIndex,
            }
        }));
    }
};

Step 4: Server-side resume handler

async def handle_resume(self, connection_id: str, message_id: str, last_index: int):
    """Resume streaming from where the client left off."""
    redis_key = f"stream:{message_id}:tokens"
    total_tokens = self._redis.llen(redis_key)

    if total_tokens == 0:
        # No cached tokens — must regenerate
        await self.ws_handler.send_error(
            connection_id, "STREAM_EXPIRED", "Regenerating response..."
        )
        return await self.regenerate(connection_id, message_id)

    # Replay tokens from last_index + 1 onwards
    for i in range(last_index + 1, total_tokens):
        token_json = self._redis.lindex(redis_key, i)
        token = json.loads(token_json)
        await self.ws_handler.send_token(
            connection_id, token["text"], token["index"]
        )

    # Check if stream is still generating
    stream_complete_key = f"stream:{message_id}:complete"
    if self._redis.exists(stream_complete_key):
        await self.ws_handler.send_complete(connection_id)
    else:
        # Stream still in progress — attach this connection to receive new tokens
        await self.attach_to_active_stream(connection_id, message_id)

Prevention

Measure	Implementation	Effectiveness
Redis token caching	`RPUSH` each token, 5-min TTL	Eliminates data loss
Continue generation on disconnect	Do not cancel Bedrock stream on GoneException	Full response always generated
Client-side partial response storage	IndexedDB/localStorage backup	Client-side recovery
Server-side resume protocol	Replay from Redis cache	Seamless continuation
Proactive heartbeat before long queries	Send heartbeat immediately before Bedrock call	Reduces idle-timeout drops

Verification

# 1. Simulate mid-stream disconnect
# Connect via wscat, send a message, then kill the connection after 1 second

# 2. Check Redis for cached tokens
redis-cli LRANGE stream:msg-test-001:tokens 0 -1

# 3. Reconnect and request resume
# Send resume message, verify tokens are replayed

# 4. Monitor CloudWatch metrics
# Verify mid_stream_disconnect_with_recovery metric counts up
# Verify stream_data_loss metric stays at 0

Key Takeaway

Never rely solely on in-memory streaming pipelines for user-facing responses. Every token should be persisted to a fast cache (Redis) as it is generated, enabling server-side resume after disconnection without regeneration cost.

Scenario 2: Streaming Backpressure Causing Client Memory Bloat

Severity: MEDIUM | Blast Radius: Affected Client Devices

Problem Statement

During a MangaAssist session on a low-end Android device, a user asks a complex question that triggers a Claude 3 Sonnet response. Bedrock generates tokens at approximately 80 tokens/second. The ECS Fargate orchestrator pushes each token to the client via WebSocket. The client's JavaScript event loop cannot process and render the incoming tokens fast enough — DOM manipulation, Markdown rendering, and scroll management consume more time per token than the inter-token interval. The WebSocket message buffer grows in the browser's memory. After 2,000+ tokens (a long response), the browser tab consumes 500+ MB of RAM, the UI becomes unresponsive, and on some devices the browser kills the tab.

Architecture Diagram

Token Production vs Consumption Rate:

Production (Bedrock → ECS → WebSocket → Client):
  80 tokens/sec ════════════════════════════════>

Consumption (Client JS parse + render + scroll):
  ~30 tokens/sec ═══════════════>
                                  ↑
                        50 tokens/sec deficit
                        accumulates in browser
                        WebSocket message queue

Memory over time (2000 token response, ~25 seconds):
  RAM (MB)
  500 ┤                                          ╭──── Tab killed
      │                                     ╭────╯
  400 ┤                                ╭────╯
      │                           ╭────╯
  300 ┤                      ╭────╯
      │                 ╭────╯
  200 ┤            ╭────╯
      │       ╭────╯
  100 ┤  ╭────╯
      │──╯
   50 ┤─ Baseline
      └────┬────┬────┬────┬────┬────┬────┬────┬──
           5   10   15   20   25   30   35   40  seconds

┌─────────────────────────────────────────────────────────┐
│                    Client (Browser)                      │
│                                                         │
│  WebSocket        Message Queue        JS Main Thread   │
│  Receiver         (Growing!)           (Overloaded)     │
│  ┌──────┐        ┌──────────────┐     ┌──────────────┐ │
│  │ WS   │──────> │ token 41     │     │ Render token │ │
│  │ onmsg│        │ token 42     │     │ #15          │ │
│  │      │        │ token 43     │     │              │ │
│  │ 80/s │        │ ...          │     │ DOM update   │ │
│  │      │        │ token 120    │     │ MD parse     │ │
│  │      │        │ (80 queued!) │     │ Scroll       │ │
│  └──────┘        └──────────────┘     │ ~30/s        │ │
│                   ▲                    └──────────────┘ │
│                   │ 50 tokens/sec accumulating          │
│                   │ ~300 KB/sec memory growth            │
└─────────────────────────────────────────────────────────┘

Root Cause Analysis

Primary cause: Mismatch between server-side token production rate and client-side rendering capacity.

The rendering pipeline on low-end devices involves: 1. WebSocket onmessage callback: ~1ms (fast) 2. JSON parse: ~0.5ms 3. Append text to state: ~0.5ms 4. Re-render Markdown: ~15ms (expensive — full re-parse of accumulated text) 5. DOM update: ~10ms (layout recalculation, paint) 6. Scroll to bottom: ~5ms (triggers reflow)

Total per token: ~32ms = ~31 tokens/sec maximum rendering throughput

At 80 tokens/sec from Bedrock, the deficit is 49 tokens/sec, meaning ~1,225 unprocessed messages accumulate in the browser's WebSocket message queue over a 25-second response.

Contributing factors: - No server-side awareness of client rendering capacity - No flow control protocol in the WebSocket communication - Markdown re-rendering of entire accumulated text on each token (O(n) per token = O(n^2) total) - No requestAnimationFrame batching for DOM updates - No virtual rendering for long responses

Detection

Client-side detection:

Signal	Threshold	Meaning
`performance.memory.usedJSHeapSize`	Growing > 1 MB/sec during stream	Memory leak / queue growth
Time between `onmessage` and render	> 100ms	Main thread congestion
`requestAnimationFrame` callback delay	> 50ms	UI thread blocked
Queued messages count (custom counter)	> 50	Falling behind

Server-side detection:

Signal	Threshold	Meaning
No `flow_control:pause` from client	Despite long response	Client may not implement backpressure
Sudden `GoneException` after long stream	After 1000+ tokens	Tab may have been killed
Client heartbeat stops during stream	After 10+ seconds	Tab unresponsive

Impact Analysis

Aspect	Impact
User experience	UI freezes, text stutters, app feels broken
Device impact	Battery drain from CPU thrashing, potential tab crash
Lost responses	If tab crashes, partial response lost
User segments	Primarily mobile users on budget devices (~30% of MangaAssist users)
At 1M msgs/day	~50,000 long responses (>500 tokens) could trigger this

Resolution Steps

Step 1: Implement client-side flow control signaling

class StreamFlowController {
    constructor(websocket, options = {}) {
        this.ws = websocket;
        this.maxQueueSize = options.maxQueueSize || 30;
        this.resumeThreshold = options.resumeThreshold || 10;
        this.isPaused = false;
        this.queueSize = 0;
    }

    onTokenReceived() {
        this.queueSize++;

        if (!this.isPaused && this.queueSize > this.maxQueueSize) {
            this.ws.send(JSON.stringify({
                action: "flow_control",
                data: { command: "pause" }
            }));
            this.isPaused = true;
        }
    }

    onTokenRendered() {
        this.queueSize = Math.max(0, this.queueSize - 1);

        if (this.isPaused && this.queueSize <= this.resumeThreshold) {
            this.ws.send(JSON.stringify({
                action: "flow_control",
                data: { command: "resume" }
            }));
            this.isPaused = false;
        }
    }
}

Step 2: Server-side batching when client is slow

async def adaptive_send(self, connection_id: str, tokens: list):
    """Batch multiple tokens into single WebSocket frame when client is slow."""
    if self.backpressure.state == FlowState.BUFFERING:
        # Client has signaled pause — accumulate tokens
        return

    if self.backpressure.state == FlowState.DRAINING:
        # Send accumulated tokens in batches
        batch = tokens[:10]  # 10 tokens per frame
        combined_text = "".join(t.text for t in batch)
        await self.ws_handler.send_token(
            connection_id,
            combined_text,
            batch[0].index,
        )

Step 3: Optimize client-side rendering

// BEFORE: O(n^2) — re-render all text on each token
onToken(text) {
    this.fullText += text;
    this.element.innerHTML = markdownToHtml(this.fullText); // EXPENSIVE
}

// AFTER: O(1) — append only new text, batch DOM updates
class StreamRenderer {
    constructor(container) {
        this.container = container;
        this.pendingText = "";
        this.rafScheduled = false;
    }

    onToken(text) {
        this.pendingText += text;
        if (!this.rafScheduled) {
            this.rafScheduled = true;
            requestAnimationFrame(() => this.flush());
        }
    }

    flush() {
        this.rafScheduled = false;
        if (!this.pendingText) return;

        // Append a new text node (not re-render everything)
        const textNode = document.createTextNode(this.pendingText);
        this.container.appendChild(textNode);
        this.pendingText = "";

        // Scroll to bottom (use scrollIntoView, not scrollTop assignment)
        this.container.lastChild.scrollIntoView({ behavior: "instant" });
    }
}

Step 4: Enable server-side adaptive buffering

Use the ChunkBufferManager with BufferStrategy.ADAPTIVE to batch tokens server-side when generation is fast, reducing the number of WebSocket frames per second.

Prevention

Measure	Implementation	Effectiveness
Server-side adaptive buffering	`ChunkBufferManager(ADAPTIVE)`	Reduces frame rate from 80/s to ~20/s
Client flow control protocol	`flow_control:pause/resume` messages	Explicit backpressure
requestAnimationFrame batching	Batch DOM updates to 60fps	Eliminates main thread blocking
Incremental Markdown rendering	Append-only text nodes	O(1) per token instead of O(n)
Memory monitoring	`performance.memory` API polling	Early detection before crash
Server-side response length limit	Max 2000 tokens for mobile clients	Hard cap on memory growth

Verification

// Client-side performance test
const metrics = {
    tokensReceived: 0,
    tokensRendered: 0,
    maxQueueDepth: 0,
    peakMemoryMB: 0,
};

// After a 2000-token stream:
// tokensReceived === tokensRendered (no backlog)
// maxQueueDepth < 50 (backpressure working)
// peakMemoryMB < 100 (no memory bloat)

Key Takeaway

Always implement bidirectional flow control in streaming systems. The server should never assume the client can consume tokens as fast as the model generates them. Use adaptive buffering server-side and requestAnimationFrame batching client-side to match production rate to consumption capacity.

Scenario 3: API Gateway WebSocket Idle Timeout Disconnecting Long FM Calls

Severity: MEDIUM | Blast Radius: Users with Complex Queries

Problem Statement

A MangaAssist user asks: "Compare the art styles, narrative techniques, and thematic elements of Berserk, Vagabond, and Vinland Saga across their full publication history." This triggers a RAG pipeline that retrieves extensive context from OpenSearch (3 manga series, multiple volumes each), builds a large prompt (~3000 input tokens), and calls Claude 3 Sonnet. The RAG retrieval takes 4 seconds and Bedrock takes an additional 800ms for the first token (large prompt processing). During these 4.8 seconds, no data flows through the WebSocket connection. Meanwhile, the API Gateway WebSocket idle timeout is 10 minutes, which should not be an issue. However, the actual problem is different: the API Gateway integration timeout for the backend (Lambda/HTTP integration) is 29 seconds, and more critically, the client-side code has a 5-second response timeout that fires before any token arrives, causing the client to assume the request failed.

Architecture Diagram

Timeline:
t=0.0s   Client sends complex query via WebSocket
t=0.0s   API Gateway routes to ECS Fargate (sendMessage route)
t=0.1s   ECS begins RAG retrieval
         ──── NO TOKENS FLOWING ────
t=0.5s   OpenSearch embedding query sent
t=2.0s   OpenSearch returns 15 context documents
t=2.5s   DynamoDB product detail lookups complete
t=3.5s   Prompt construction (3000 tokens) complete
t=4.0s   Bedrock InvokeModelWithResponseStream called
t=4.0-4.8s  Bedrock processing large prompt (no output yet)
         ──── STILL NO TOKENS FLOWING ────
t=4.8s   First token generated by Bedrock
t=5.0s   ⚠ Client timeout fires! Client shows "Request timed out"
         Client may attempt to reconnect or re-send
t=5.1s   Server pushes first token via @connections
         Client receives token but has already shown error

Client          API Gateway       ECS Fargate        OpenSearch      Bedrock
  │                 │                 │                   │              │
  │── sendMessage──>│── Route ───────>│                   │              │
  │                 │                 │── embed query ───>│              │
  │   ⏳ waiting     │   ⏳ waiting     │                   │              │
  │                 │                 │<── 15 docs ───────│              │
  │   ⏳ waiting     │   ⏳ waiting     │── product lookup ─┤              │
  │                 │                 │                   │              │
  │   ⏳ 5s timeout  │                 │── build prompt ──>│              │
  │   approaching   │                 │                   │              │
  │                 │                 │── stream call ────┤────────────>│
  │                 │                 │                   │   ⏳ TTFT     │
  │   ⚠ TIMEOUT!   │                 │                   │   800ms      │
  │   "Request      │                 │                   │              │
  │    failed"      │                 │<─ first token ────┤──────────── │
  │                 │<─ @connections──│                   │              │
  │<── token ───────│  (but client    │                   │              │
  │   (too late)    │   already       │                   │              │
  │                 │   timed out)    │                   │              │

Root Cause Analysis

Primary cause: Missing intermediate progress signals during the pre-generation pipeline.

The 4.8 seconds between the user's message and the first token are consumed by: - OpenSearch vector search: ~1.5s - DynamoDB product lookups: ~0.5s - Prompt construction: ~1.0s - Bedrock TTFT (large prompt): ~0.8s - Network overhead: ~1.0s

During this entire window, the server sends zero messages to the client. The client has no way to distinguish between "server is working on it" and "server never received the request."

Contributing factors: - Client-side timeout set too aggressively (5 seconds) - No progress/status messages sent during RAG pipeline - No "thinking" indicator pushed to client before Bedrock call - API Gateway WebSocket route integration timeout (29s) is adequate but the client does not wait that long - Large prompts (3000+ tokens) increase Bedrock TTFT significantly

Detection

Signal	Source	Threshold
Client-side timeout events	Client telemetry	> 5% of requests
Time between sendMessage and first token	ECS metrics	p95 > 5 seconds
RAG pipeline duration	ECS metrics	p95 > 3 seconds
Bedrock TTFT for large prompts	StreamingFMClient metrics	p95 > 1 second
Duplicate message sends (retries after timeout)	API Gateway logs	> 100/hour

Impact Analysis

Aspect	Impact
User experience	"Request timed out" error followed by delayed response appearing
Duplicate requests	Client retries, generating the same response twice (double cost)
Cost waste	Two Bedrock calls for one question: 2x Sonnet cost = $0.0075
Frequency	~15% of Sonnet requests have TTFT > 4 seconds with large context
At 1M msgs/day (20% Sonnet)	~30,000 timeout-then-duplicate incidents daily

Resolution Steps

Step 1: Send immediate acknowledgment

async def handle_send_message(self, connection_id: str, message: dict):
    """Handle incoming chat message with progress signals."""
    message_id = generate_message_id()

    # IMMEDIATELY acknowledge receipt (before any processing)
    await self.ws_handler._post_to_connection(connection_id, {
        "action": "message_received",
        "data": {
            "messageId": message_id,
            "status": "processing",
            "timestamp": int(time.time() * 1000),
        }
    })

    # Client now knows the server received the message
    # Client should reset its timeout to a longer "processing" timeout

Step 2: Send progress updates during RAG pipeline

async def rag_with_progress(self, connection_id: str, message_id: str, query: str):
    """RAG pipeline that sends progress updates to the client."""

    # Progress: searching knowledge base
    await self.send_progress(connection_id, message_id, "searching", 0.2)
    context_docs = await self.opensearch.vector_search(query)

    # Progress: retrieving product details
    await self.send_progress(connection_id, message_id, "retrieving", 0.4)
    products = await self.dynamodb.batch_get_products(context_docs)

    # Progress: building context
    await self.send_progress(connection_id, message_id, "preparing", 0.6)
    prompt = self.build_prompt(query, context_docs, products)

    # Progress: generating response
    await self.send_progress(connection_id, message_id, "generating", 0.8)
    # Now call Bedrock streaming — first token will arrive shortly

async def send_progress(self, connection_id, message_id, stage, progress):
    await self.ws_handler._post_to_connection(connection_id, {
        "action": "progress",
        "data": {
            "messageId": message_id,
            "stage": stage,
            "progress": progress,
            "timestamp": int(time.time() * 1000),
        }
    })

Step 3: Implement tiered client-side timeouts

const TIMEOUT_CONFIG = {
    ack_timeout: 3000,        // 3s to receive message_received
    processing_timeout: 30000, // 30s for RAG + generation to start
    streaming_timeout: 10000,  // 10s between tokens during streaming
    total_timeout: 120000,     // 2 min absolute maximum
};

class TimeoutManager {
    constructor(ws) {
        this.ws = ws;
        this.currentPhase = "waiting_ack";
        this.timer = null;
    }

    startAckTimeout() {
        this.currentPhase = "waiting_ack";
        this.timer = setTimeout(() => {
            this.onTimeout("No acknowledgment received");
        }, TIMEOUT_CONFIG.ack_timeout);
    }

    onMessageReceived() {
        clearTimeout(this.timer);
        this.currentPhase = "processing";
        this.timer = setTimeout(() => {
            this.onTimeout("Processing taking too long");
        }, TIMEOUT_CONFIG.processing_timeout);
    }

    onProgressUpdate() {
        clearTimeout(this.timer);
        // Reset processing timeout on each progress update
        this.timer = setTimeout(() => {
            this.onTimeout("Processing stalled");
        }, TIMEOUT_CONFIG.processing_timeout);
    }

    onFirstToken() {
        clearTimeout(this.timer);
        this.currentPhase = "streaming";
        this.resetStreamingTimeout();
    }

    onToken() {
        this.resetStreamingTimeout();
    }

    resetStreamingTimeout() {
        clearTimeout(this.timer);
        this.timer = setTimeout(() => {
            this.onTimeout("Streaming stalled");
        }, TIMEOUT_CONFIG.streaming_timeout);
    }
}

Prevention

Measure	Implementation	Effectiveness
Immediate ACK on message receipt	Send within 100ms of receiving message	Eliminates "did server get it?" uncertainty
Progress signals during RAG	Send 3-4 updates during pipeline	Client knows server is alive
Tiered client timeouts	3s ACK / 30s processing / 10s inter-token	Matches actual server behavior
Parallel RAG steps	Run OpenSearch and DynamoDB concurrently	Reduce pipeline from 4s to ~2.5s
Prompt size optimization	Limit context to 5 docs instead of 15	Reduce Bedrock TTFT from 800ms to 400ms
Pre-warm Bedrock	Keep-alive requests to reduce cold start	Reduce TTFT by ~100ms

Verification

# Simulate a complex query with long RAG pipeline
# 1. Send complex query via WebSocket
# 2. Verify ACK received within 1 second
# 3. Verify 2+ progress updates received
# 4. Verify first token received within 6 seconds
# 5. Verify NO client timeout events in telemetry

# CloudWatch query for timeout incidents
filter @message like /client_timeout/
| stats count() by bin(1h)
# Should drop to near zero after fix

Key Takeaway

In real-time AI systems, the user must never wonder if their request was received. Send immediate acknowledgment, then pipeline progress updates before the first model token arrives. Structure client timeouts around these expected signals, not around a single monolithic deadline.

Scenario 4: SSE Connection Not Closing After FM Completes

Severity: LOW-MEDIUM | Blast Radius: Server Resource Exhaustion Over Time

Problem Statement

MangaAssist uses SSE as a fallback transport for clients behind WebSocket-blocking proxies. A user asks about a manga title, the response streams successfully via SSE, and the final event: done is sent. However, the client's EventSource connection remains open. The server (ECS Fargate) holds the HTTP connection, the associated goroutine/async handler, and the Redis session reference. Over hours, hundreds of "zombie" SSE connections accumulate, consuming file descriptors, memory, and connection pool slots on the ECS tasks. Eventually, new SSE requests fail with "connection refused" because the ECS task has exhausted its file descriptor limit.

Architecture Diagram

Normal SSE Lifecycle:
  Client ──GET /stream──> ECS ──stream tokens──> Client ──event:done──> Client closes
  ✓ Clean                                                               ✓ Resources freed

Broken SSE Lifecycle (this scenario):
  Client ──GET /stream──> ECS ──stream tokens──> Client ──event:done──> ???
  │                        │                                             │
  │  Connection stays open │  Handler stays alive                        │ EventSource does
  │  HTTP keep-alive       │  File descriptor held                       │ NOT auto-close on
  │                        │  Memory allocated                           │ "done" event
  │                        │  Redis refs active                          │
  │                        │                                             │
  │  ◄──── ZOMBIE CONNECTION ────────────────────────────────────►      │

Over time:
  ┌─────────────────────────────────────────────────────────┐
  │              ECS Fargate Task                            │
  │                                                         │
  │  Active SSE connections:     15                         │
  │  Zombie SSE connections:    247    ← growing hourly     │
  │  File descriptors used:    524/1024                     │
  │  Memory consumed by zombies: ~124 MB                    │
  │                                                         │
  │  ⚠ At this rate, FD limit reached in 4 hours           │
  └─────────────────────────────────────────────────────────┘

Root Cause Analysis

Primary cause: The SSE event: done is an application-level signal, not a transport-level close.

The SSE protocol (EventSource API) does not automatically close the connection when a specific event type is received. The EventSource is designed to be a persistent connection that receives events indefinitely. Sending event: done tells the application "the response is complete," but the HTTP connection remains open. The EventSource will even attempt to reconnect if the server closes the connection (that is its built-in behavior).

Contributing factors: 1. Server sends event: done but does not close the HTTP response stream 2. Client JavaScript handles event: done but does not call eventSource.close() 3. ALB keep-alive settings allow idle connections to persist for 60 seconds (default), but SSE connections are not truly "idle" from the ALB's perspective since the connection is still established 4. ECS task does not have a post-completion timeout to force-close SSE connections 5. No monitoring on "zombie" connection count

Detection

Signal	Source	Threshold
Open SSE connections per task	ECS custom metric	> 50 (normal: ~10)
File descriptor usage	ECS task OS metrics	> 70% of ulimit
SSE connections older than 5 minutes	Application metric	> 0
Memory growth without load increase	ECS CloudWatch	Steady increase
"Connection refused" errors	ALB access logs	Any occurrence

Impact Analysis

Aspect	Impact
Immediate	Wasted server resources (memory, FDs, Redis connections)
Progressive	New connections fail as FDs exhaust
Blast radius	All SSE users on the affected ECS task
Time to impact	~4-6 hours of normal traffic before FD exhaustion
Recovery	ECS task restart (kills all zombie connections)
Frequency	Every SSE response creates a zombie (100% of SSE traffic)

Resolution Steps

Step 1: Server-side — Close the HTTP response after done event

async def handle_sse_stream(request):
    """SSE endpoint that properly closes after stream completes."""
    session_id = request.query.get("session_id")
    sse_manager = SSEResponseManager()
    session = sse_manager.create_session(session_id)

    async def generate():
        # Send retry directive
        yield f"retry: 3000\n\n"

        # Stream tokens from Bedrock
        async for chunk in bedrock_stream:
            event = sse_manager.create_token_event(session, chunk.text)
            yield event.serialize()

        # Send done event
        done_event = sse_manager.create_done_event(session)
        yield done_event.serialize()

        # CRITICAL: Close the response stream after done
        # This terminates the HTTP response, signaling the client
        # that no more data will be sent on this connection.
        # Without this, the connection stays open as a zombie.

    response = web.StreamResponse()
    response.headers.update(sse_manager.get_sse_headers())
    await response.prepare(request)

    try:
        async for data in generate():
            await response.write(data.encode("utf-8"))
    finally:
        # Ensure response is finalized (sends zero-length chunk for chunked encoding)
        await response.write_eof()
        sse_manager.cleanup_session(session_id)
        logger.info(f"SSE connection closed for session {session_id}")

Step 2: Client-side — Close EventSource on done event

const eventSource = new EventSource(`/stream?session_id=${sessionId}`);

eventSource.addEventListener("done", (event) => {
    const data = JSON.parse(event.data);
    console.log(`Stream complete: ${data.total_chunks} chunks`);

    // CRITICAL: Explicitly close the EventSource
    // Without this, the browser will attempt to reconnect
    // when the server closes the connection
    eventSource.close();

    // Update UI to show response is complete
    updateStreamStatus("complete");
});

eventSource.addEventListener("error", (event) => {
    if (eventSource.readyState === EventSource.CLOSED) {
        // Server closed the connection — this is expected after "done"
        console.log("SSE connection closed by server (expected)");
        return;
    }
    // Genuine error
    handleSSEError(event);
});

Step 3: Server-side — Add connection reaper for zombies

class SSEConnectionReaper:
    """
    Periodically scans for and terminates zombie SSE connections
    that should have been closed after stream completion.
    """

    MAX_SSE_DURATION = 300  # 5 minutes absolute max for any SSE connection
    REAP_INTERVAL = 60      # Check every 60 seconds

    def __init__(self, sse_manager: SSEResponseManager):
        self.sse_manager = sse_manager
        self._active_connections: Dict[str, float] = {}  # session_id -> start_time

    def register(self, session_id: str):
        self._active_connections[session_id] = time.time()

    def deregister(self, session_id: str):
        self._active_connections.pop(session_id, None)

    async def reap_loop(self):
        """Run in background to clean up zombie connections."""
        while True:
            await asyncio.sleep(self.REAP_INTERVAL)
            now = time.time()
            zombies = [
                sid for sid, start in self._active_connections.items()
                if now - start > self.MAX_SSE_DURATION
            ]
            for sid in zombies:
                logger.warning(f"Reaping zombie SSE connection: {sid}")
                self.sse_manager.cleanup_session(sid)
                self.deregister(sid)
                # Force-close the response (implementation depends on web framework)

Step 4: ALB idle timeout configuration

# Set ALB idle timeout to 120 seconds for SSE target group
# This ensures connections that are truly idle get cleaned up
# Normal SSE responses complete in <30 seconds
aws elbv2 modify-target-group-attributes \
    --target-group-arn arn:aws:elasticloadbalancing:... \
    --attributes Key=idle_timeout.timeout_seconds,Value=120

Prevention

Measure	Implementation	Effectiveness
Server `write_eof()` after done	Close HTTP response explicitly	Eliminates server-side zombies
Client `eventSource.close()` on done	Close EventSource explicitly	Prevents reconnection attempts
Connection reaper background task	Kill connections > 5 min old	Safety net for edge cases
ALB idle timeout 120s	Auto-kill idle connections	Infrastructure-level cleanup
File descriptor monitoring	Alert on FD usage > 70%	Early warning
SSE connection count metric	Custom CloudWatch metric	Visibility into zombie accumulation

Verification

# 1. Trigger 10 SSE streaming responses
# 2. After all complete, check open connections on ECS task:
ss -s  # Show socket summary
lsof -i -P | grep ESTABLISHED | wc -l

# Expected: Only baseline connections (Redis, DDB, etc.)
# NO lingering SSE connections

# 3. CloudWatch metric: sse_zombie_connection_count should be 0

Key Takeaway

SSE's event: done is an application-level signal; it does not close the HTTP connection. Both server (call write_eof()) and client (call eventSource.close()) must explicitly terminate the connection after the stream completes. Always add a background reaper as a safety net to catch any zombie connections that slip through.

Scenario 5: Chunked Encoding Breaking Unicode Characters at Chunk Boundaries

Severity: MEDIUM | Blast Radius: Japanese Content Responses

Problem Statement

MangaAssist serves Japanese-speaking users. When streaming a response about manga titles via chunked transfer encoding through the ALB, a chunk boundary falls in the middle of a multi-byte UTF-8 character. For example, the Japanese character "東" (East, Unicode U+6771) is encoded as 3 bytes in UTF-8: E6 9D B1. If the chunk boundary falls after E6 9D, the client receives an incomplete UTF-8 sequence, triggering a TextDecoder error. The user sees garbled text (a replacement character) instead of "東京グール" (Tokyo Ghoul).

Architecture Diagram

UTF-8 encoding of "東京グール" (Tokyo Ghoul):
  東 = E6 9D B1     (3 bytes)
  京 = E4 BA AC     (3 bytes)
  グ = E3 82 B0     (3 bytes)
  ー = E3 83 BC     (3 bytes)
  ル = E3 83 AB     (3 bytes)
  Total: 15 bytes

Chunked encoding with BAD chunk boundary:

  Chunk 1 (10 bytes):          Chunk 2 (5 bytes):
  E6 9D B1  E4 BA AC  E3 82   B0  E3 83 BC  E3 83 AB
  ├─ 東 ──┤ ├─ 京 ──┤ ├─ ??   ├──────────────────────┤
                       ↑
                  SPLIT HERE!
                  E3 82 is the start of グ (needs E3 82 B0)
                  but B0 is in the next chunk

  Client receives Chunk 1:
  "東京" + INVALID BYTE SEQUENCE (E3 82) → TextDecoder error or U+FFFD (�)

  Client receives Chunk 2:
  B0 E3 83 BC E3 83 AB → starts with continuation byte (B0) → MORE ERRORS

ECS Fargate                  ALB                      Client
    │                         │                         │
    │── HTTP Response ──────>│                         │
    │   Transfer-Encoding:   │                         │
    │   chunked              │                         │
    │                         │                         │
    │── chunk: "東京" + E382 ─>│── forward chunk 1 ────>│
    │   (10 bytes)           │                         │── TextDecoder
    │                         │                         │   ERROR on E3 82
    │                         │                         │   Shows: 東京�
    │── chunk: B0 + "ール" ──>│── forward chunk 2 ────>│
    │   (5 bytes)            │                         │── TextDecoder
    │                         │                         │   ERROR on B0
    │                         │                         │   Shows: 東京��ール
    │                         │                         │
    │                         │                         │   Expected: 東京グール
    │                         │                         │   Got:      東京��ール

Root Cause Analysis

Primary cause: The chunked response writer flushes at byte boundaries without regard for UTF-8 character boundaries.

UTF-8 is a variable-width encoding: - ASCII (U+0000 to U+007F): 1 byte, starts with 0xxxxxxx - Latin/Greek/Cyrillic (U+0080 to U+07FF): 2 bytes, starts with 110xxxxx - CJK/Japanese/Korean (U+0800 to U+FFFF): 3 bytes, starts with 1110xxxx - Emoji/rare (U+10000 to U+10FFFF): 4 bytes, starts with 11110xxx - Continuation bytes: 10xxxxxx

When the server writes a fixed-size buffer (e.g., 8 KB) and flushes, the flush point may land between the leading byte and continuation bytes of a multi-byte character.

Contributing factors: 1. ECS application uses a fixed buffer size for chunked response writing 2. The buffer flush does not check for UTF-8 character boundaries 3. ALB does not reassemble or validate UTF-8 across chunks (it is a pass-through) 4. Browser's TextDecoder in strict mode rejects invalid sequences 5. Japanese text (3 bytes per character) has a 66% chance of a split falling within a character vs a 33% chance for ASCII (1 byte per character)

Detection

Signal	Source	Threshold
TextDecoder errors in client	Client telemetry	Any occurrence
U+FFFD (replacement character) in rendered text	Client text analysis	Any occurrence
`UnicodeDecodeError` in server logs	ECS application logs	Any occurrence
Japanese responses with garbled characters	User complaints	Any report

Client detection code:

const decoder = new TextDecoder("utf-8", { fatal: true }); // strict mode
try {
    const text = decoder.decode(chunk);
} catch (e) {
    // TypeError: The encoded data was not valid UTF-8
    reportUnicodeError(chunk);
}

Impact Analysis

Aspect	Impact
User experience	Garbled manga titles, unreadable Japanese text
Content accuracy	Manga title "東京グール" appears as "東京??ール" — user cannot identify it
Frequency	~1 in 3 chunks containing Japanese text (33% of CJK characters split)
Scope	All chunked HTTP responses with Japanese content (SSE and chunked paths)
Trust impact	Users may doubt the chatbot's Japanese language competency
At 1M msgs/day	~200,000 messages contain Japanese text, ~66,000 could be affected

Resolution Steps

Step 1: Implement UTF-8-safe chunk flushing

class Utf8SafeChunkedWriter:
    """
    Writes chunked HTTP responses with guaranteed UTF-8 character
    boundary alignment.
    """

    def __init__(self, response, target_chunk_size: int = 4096):
        self.response = response
        self.target_chunk_size = target_chunk_size
        self._buffer = bytearray()

    async def write(self, text: str) -> None:
        """Write text to the chunked response, UTF-8-safe."""
        encoded = text.encode("utf-8")
        self._buffer.extend(encoded)

        while len(self._buffer) >= self.target_chunk_size:
            # Find the last safe UTF-8 boundary at or before target_chunk_size
            boundary = self._find_utf8_boundary(
                self._buffer, self.target_chunk_size
            )
            # Flush up to the boundary
            chunk = bytes(self._buffer[:boundary])
            self._buffer = self._buffer[boundary:]
            await self.response.write(chunk)

    async def flush(self) -> None:
        """Flush remaining buffer (guaranteed complete UTF-8)."""
        if self._buffer:
            await self.response.write(bytes(self._buffer))
            self._buffer.clear()

    @staticmethod
    def _find_utf8_boundary(data: bytearray, max_pos: int) -> int:
        """
        Find the last valid UTF-8 character boundary at or before max_pos.

        UTF-8 byte patterns:
          0xxxxxxx  -> ASCII, single byte (always a boundary)
          110xxxxx  -> Start of 2-byte sequence
          1110xxxx  -> Start of 3-byte sequence
          11110xxx  -> Start of 4-byte sequence
          10xxxxxx  -> Continuation byte (NOT a boundary)
        """
        pos = min(max_pos, len(data))

        # Walk backward from pos until we find a character start
        while pos > 0:
            byte = data[pos - 1]

            # If this byte is ASCII (0xxxxxxx), boundary is after it
            if byte < 0x80:
                return pos

            # If this byte is a continuation byte (10xxxxxx), keep going back
            if (byte & 0xC0) == 0x80:
                pos -= 1
                continue

            # This byte is a start byte (11xxxxxx)
            # Determine the expected sequence length
            if (byte & 0xE0) == 0xC0:
                seq_len = 2
            elif (byte & 0xF0) == 0xE0:
                seq_len = 3
            elif (byte & 0xF8) == 0xF0:
                seq_len = 4
            else:
                # Invalid byte, treat as boundary
                return pos

            # Check if the full sequence fits before max_pos
            seq_start = pos - 1
            if seq_start + seq_len <= max_pos:
                # Full sequence fits, boundary is after it
                return seq_start + seq_len
            else:
                # Sequence would be split, boundary is before it
                return seq_start

        return 0

Step 2: Apply to the streaming pipeline

async def stream_chunked_response(request, bedrock_stream):
    """Stream Bedrock response with UTF-8-safe chunked encoding."""
    response = web.StreamResponse(
        headers={
            "Content-Type": "text/plain; charset=utf-8",
            "Transfer-Encoding": "chunked",
        }
    )
    await response.prepare(request)

    writer = Utf8SafeChunkedWriter(response, target_chunk_size=4096)

    async for chunk in bedrock_stream:
        if chunk.text:
            await writer.write(chunk.text)

    # Flush any remaining bytes (guaranteed complete)
    await writer.flush()
    await response.write_eof()

Step 3: Client-side fallback with lenient decoder

Even with server-side fixes, implement a resilient client-side decoder as defense in depth:

class ResilientStreamDecoder {
    constructor() {
        // Use non-fatal mode: replaces invalid sequences with U+FFFD
        // instead of throwing
        this.decoder = new TextDecoder("utf-8", { fatal: false });
        this.pendingBytes = new Uint8Array(0);
    }

    decode(chunk) {
        // Combine any pending bytes from previous chunk
        const combined = this._concat(this.pendingBytes, new Uint8Array(chunk));

        // Find the last safe UTF-8 boundary
        const boundary = this._findUtf8Boundary(combined);

        if (boundary === combined.length) {
            // All bytes form complete characters
            this.pendingBytes = new Uint8Array(0);
            return this.decoder.decode(combined, { stream: true });
        }

        // Split at boundary: decode complete chars, save incomplete
        const complete = combined.slice(0, boundary);
        this.pendingBytes = combined.slice(boundary);

        return this.decoder.decode(complete, { stream: true });
    }

    flush() {
        // Decode any remaining bytes at end of stream
        if (this.pendingBytes.length > 0) {
            const text = this.decoder.decode(this.pendingBytes);
            this.pendingBytes = new Uint8Array(0);
            return text;
        }
        return this.decoder.decode();
    }

    _findUtf8Boundary(bytes) {
        let pos = bytes.length;
        // Walk backward to find the start of the last character
        while (pos > 0 && pos > bytes.length - 4) {
            pos--;
            const byte = bytes[pos];
            if ((byte & 0x80) === 0) {
                // ASCII — boundary is after this byte
                return pos + 1;
            }
            if ((byte & 0xC0) !== 0x80) {
                // Start byte found — check if sequence is complete
                let expectedLen;
                if ((byte & 0xE0) === 0xC0) expectedLen = 2;
                else if ((byte & 0xF0) === 0xE0) expectedLen = 3;
                else if ((byte & 0xF8) === 0xF0) expectedLen = 4;
                else return pos; // invalid, skip

                if (pos + expectedLen <= bytes.length) {
                    return pos + expectedLen; // Sequence complete
                }
                return pos; // Sequence incomplete — boundary before it
            }
        }
        return bytes.length;
    }

    _concat(a, b) {
        const result = new Uint8Array(a.length + b.length);
        result.set(a, 0);
        result.set(b, a.length);
        return result;
    }
}

Step 4: Add SSE-specific UTF-8 safety

For SSE transport, the same issue can occur if the SSE event data contains incomplete UTF-8. The SSEResponseManager should ensure each data: line contains only complete UTF-8 characters:

def create_token_event(self, session, text):
    """Create a token event, ensuring text contains only complete UTF-8."""
    # Validate the text is valid UTF-8 (should be, since it came from Bedrock)
    try:
        text.encode("utf-8").decode("utf-8")
    except UnicodeError:
        logger.warning(f"Invalid UTF-8 in token text, cleaning: {repr(text)}")
        text = text.encode("utf-8", errors="replace").decode("utf-8")

    # Rest of event creation unchanged
    event_id = f"{session.session_id}-{session.chunk_index}"
    session.chunk_index += 1
    session.accumulated_text += text
    session.last_event_id = event_id
    session.last_sent_at = time.time()

    return SSEEvent(
        event_type=SSEEventType.TOKEN,
        data={"text": text, "index": session.chunk_index - 1},
        event_id=event_id,
    )

Prevention

Measure	Implementation	Effectiveness
`Utf8SafeChunkedWriter`	Always flush at character boundaries	Eliminates server-side splits
`ResilientStreamDecoder` (client)	Buffer incomplete bytes, decode only complete chars	Client-side defense in depth
`TextDecoder({ fatal: false })`	Replace rather than throw on invalid sequences	Graceful degradation
SSE UTF-8 validation	Validate each event data payload	Ensures SSE events are clean
Integration test with CJK text	Test with all 3-byte and 4-byte character sets	Catches regressions
Chunk size = multiple of 12	LCM of 1,2,3,4 byte sequences	Reduces split probability

Verification

# Server-side test
def test_utf8_safe_chunking():
    """Verify chunks never split multi-byte characters."""
    test_texts = [
        "東京グール",                    # All 3-byte CJK
        "Hello 東京 World",             # Mixed ASCII and CJK
        "🔥 ファイア 🔥",               # 4-byte emoji + 3-byte katakana
        "Berserk ベルセルク Vol.1",      # Mixed with punctuation
        "進撃の巨人 Attack on Titan",    # Full manga title
    ]

    for text in test_texts:
        writer = Utf8SafeChunkedWriter(mock_response, target_chunk_size=5)
        # Use tiny chunk size to force many splits
        writer.write(text)
        writer.flush()

        # Verify each chunk is valid UTF-8
        for chunk in mock_response.chunks:
            try:
                chunk.decode("utf-8")
            except UnicodeDecodeError:
                assert False, f"Chunk contains incomplete UTF-8: {chunk.hex()}"

        # Verify reassembled text matches original
        reassembled = b"".join(mock_response.chunks).decode("utf-8")
        assert reassembled == text

// Client-side test
function testResilientDecoder() {
    const decoder = new ResilientStreamDecoder();
    // Simulate "東" (E6 9D B1) split across two chunks
    const chunk1 = new Uint8Array([0xE6, 0x9D]);      // Incomplete "東"
    const chunk2 = new Uint8Array([0xB1, 0xE4, 0xBA, 0xAC]); // Complete "東" + "京"

    const text1 = decoder.decode(chunk1);
    console.assert(text1 === "", "Should buffer incomplete char");

    const text2 = decoder.decode(chunk2);
    console.assert(text2 === "東京", "Should decode complete chars including buffered");
}

Key Takeaway

When streaming text over chunked HTTP in a multi-language application, never flush at arbitrary byte boundaries. Implement a UTF-8-aware chunked writer that only flushes at character boundaries, and add a resilient decoder on the client side that buffers incomplete multi-byte sequences across chunk boundaries. This is especially critical for CJK languages where every character is 3 bytes and has a 66% probability of being split by a naive byte-level flush.

Cross-Scenario Summary

#	Scenario	Root Cause	Key Fix	Prevention
1	WebSocket drop mid-stream	No token persistence	Redis token caching + resume	Persist every token, continue generation on disconnect
2	Client memory bloat	Production/consumption rate mismatch	Bidirectional flow control	Adaptive buffering + requestAnimationFrame batching
3	Idle timeout on long FM calls	No progress signals during RAG	Tiered timeouts + progress updates	ACK immediately, send pipeline progress
4	SSE zombie connections	Done event is not transport close	Explicit close on both sides	Server `write_eof()` + client `eventSource.close()` + reaper
5	Unicode split at chunk boundary	Byte-level flush ignores UTF-8	UTF-8-safe chunk writer	Flush at character boundaries, resilient client decoder

Common Patterns Across All Scenarios

Never assume the transport handles application semantics. WebSocket close is not the same as stream complete. SSE done event is not connection close. Chunked encoding is not UTF-8-aware.
Always implement both server-side and client-side defenses. No single-side fix is sufficient. Server writes UTF-8-safe chunks AND client handles incomplete sequences. Server sends progress AND client has tiered timeouts.
Persist intermediate state for resumability. Tokens in Redis, connection records in DynamoDB, partial responses in client storage. Every streaming pipeline should survive disconnection at any point.
Monitor the gaps, not just the flows. Track zombie connections, split characters, mid-stream drops, and rate mismatches. The interesting failures happen in the spaces between components.
Design for Japanese content from the start. Multi-byte characters, no word boundaries, and mixed-script content affect buffering, chunking, rendering, and search. Retrofitting CJK support is far harder than building it in.