Scenarios and Runbooks — Real-Time AI Interaction
MangaAssist context: JP Manga store chatbot on AWS — Bedrock Claude 3 (Sonnet at $3/$15 per 1M tokens input/output, Haiku at $0.25/$1.25), OpenSearch Serverless (vector store), DynamoDB (sessions/products), ECS Fargate (orchestrator), API Gateway WebSocket, ElastiCache Redis. Target: useful answer in under 3 seconds, 1M messages/day scale.
Skill Mapping
| Field | Value |
|---|---|
| Certification | AWS Certified AI Practitioner (AIP-C01) |
| Domain | 2 — Implementation and Integration of Foundation Models |
| Task | 2.4 — Design and implement interaction mechanisms for FM-based applications |
| Skill | 2.4.2 — Develop real-time AI interaction systems to provide immediate feedback from FM |
Scenario Format Reference
Each scenario follows this structure:
| Section | Purpose |
|---|---|
| Scenario Title | Problem statement |
| Severity + Blast Radius | Impact classification |
| Architecture Diagram | Visual of failure point |
| Root Cause Analysis | Why this happens |
| Detection | How to spot the problem |
| Impact Analysis | What breaks and for whom |
| Resolution Steps | How to fix it |
| Prevention | How to stop it from happening again |
| Verification | How to confirm the fix works |
| Key Takeaway | One-sentence lesson |
Scenario 1: WebSocket Connection Dropping Mid-Stream Losing Partial Response
Severity: HIGH | Blast Radius: Single User Session
Problem Statement
A MangaAssist user asks "Recommend dark fantasy manga series similar to Berserk with detailed art." Bedrock Claude 3 Sonnet begins streaming the response. After 40 tokens (roughly "Based on your interest in Berserk, I'd recommend these dark fantasy manga series with similarly detailed art: 1. Vagabond by Takehiko Inou"), the WebSocket connection drops. The client receives a close frame with code 1006 (abnormal closure). The partial response is lost, and the user sees an incomplete, broken message.
Architecture Diagram
Timeline:
t=0.0s User sends question via WebSocket
t=0.3s Bedrock TTFT, first token arrives
t=0.3-1.2s 40 tokens streamed to client successfully
t=1.2s ──── NETWORK INTERRUPTION ────
│ User's mobile switches from WiFi to LTE
│ TCP connection resets
│ API Gateway receives TCP RST
│ API Gateway fires $disconnect route
└──────────────────────────────────
t=1.2s Bedrock continues generating (unaware of disconnect)
t=1.3s ECS tries POST @connections/{connectionId}
→ GoneException (connection no longer exists)
t=1.3s ECS catches GoneException, stops reading Bedrock stream
t=1.3s Partial response "Based on your interest in..." is lost
(only existed in ECS task memory, not persisted)
t=3.0s Client reconnects with new connectionId
t=3.1s Client sends resume request with messageId
t=3.2s Server has no record of partial response → must regenerate
┌──────────┐ ┌─────────────┐ ┌──────────┐ ┌─────────┐
│ Client │────>│ API Gateway │────>│ ECS │────>│ Bedrock │
│ (Mobile) │ │ WebSocket │ │ Fargate │ │ Claude3 │
└──────┬───┘ └──────┬──────┘ └────┬─────┘ └────┬────┘
│ │ │ │
│ 40 tokens OK │ │ │
│<═══════════════│<════════════════│<════════════════│
│ │ │ │
╔══╧════════════╗ │ │ │
║ WiFi → LTE ║ │ │ │
║ TCP RST ║ │ │ │
╚══╤════════════╝ │ │ │
│ │ │ │
X Connection │── $disconnect──>│ │
LOST │ │── GoneException │
│ │ on next POST │
│ │ │
│ │── STOP stream──>│
│ │ │
│ │ Partial text │
│ │ LOST (not saved)│
Root Cause Analysis
Primary cause: No intermediate persistence of streaming tokens.
The streaming pipeline processes tokens in-memory only:
1. Bedrock generates tokens and sends them as EventStream chunks
2. ECS Fargate receives each chunk, transforms it, and pushes to the WebSocket via @connections API
3. If the WebSocket connection drops, the ECS task catches GoneException and stops processing
4. The 40 tokens already sent to the client exist only in the client's DOM, which may be lost on reconnection
5. The remaining tokens (which Bedrock would have generated) are never produced because the stream was cancelled
6. No server-side record of the partial response exists
Contributing factors: - Mobile network handoff (WiFi to cellular) causes TCP reset - API Gateway has no built-in mechanism to buffer or persist WebSocket messages - The ECS orchestrator does not save intermediate streaming state to Redis - Client-side storage of partial response depends on the reconnection implementation
Detection
CloudWatch Alarms:
| Metric | Namespace | Condition | Alarm |
|---|---|---|---|
@connections GoneException count |
MangaAssist/Streaming | > 50/minute | WARN |
Stream cancellation rate |
MangaAssist/Streaming | > 20% of streams | HIGH |
$disconnect route invocations |
AWS/ApiGateway | Spike > 3x baseline | WARN |
Mid-stream disconnect ratio |
MangaAssist/Streaming | > 10% | HIGH |
Log patterns to search:
# ECS Fargate application logs
filter @message like /GoneException/
| stats count() as gone_exceptions by bin(5m)
# Identify mid-stream disconnects specifically
filter @message like /GoneException/ and @message like /chunk_index > 0/
| stats count() as mid_stream_drops by bin(5m)
Client-side detection:
- WebSocket onclose event with code 1006 (abnormal closure)
- Accumulated text length > 0 at time of disconnect (partial response exists in DOM)
Impact Analysis
| Aspect | Impact |
|---|---|
| User experience | User sees partial, broken text; must re-ask the question |
| Cost | Wasted Bedrock tokens (40 tokens generated but response incomplete) |
| Session state | Conversation history may have incomplete assistant turn |
| Estimated frequency | ~2-5% of mobile sessions experience network handoff |
| At 1M msgs/day | 20,000-50,000 partial response incidents daily |
Resolution Steps
Step 1: Implement server-side token caching in Redis
Add Redis buffering to the streaming pipeline so every token is persisted as it is generated:
# In the ECS streaming orchestrator, after each token:
async def handle_stream_token(self, message_id: str, chunk: StreamChunk):
"""Process each streaming token with Redis persistence."""
# 1. Append token to Redis list (survives disconnect)
redis_key = f"stream:{message_id}:tokens"
self._redis.rpush(redis_key, json.dumps({
"text": chunk.text,
"index": chunk.index,
"timestamp": time.time(),
}))
# Set TTL for cleanup (5 minutes — enough for reconnection)
self._redis.expire(redis_key, 300)
# 2. Try to send to client via WebSocket
try:
await self.ws_handler.send_token(connection_id, chunk.text, chunk.index)
except GoneException:
# Connection is gone — but tokens are safe in Redis
logger.info(f"Connection gone for {message_id}, tokens preserved in Redis")
# Do NOT stop reading from Bedrock — continue generating
# The full response will be available for reconnection
Step 2: Continue Bedrock stream even after disconnect
# Modified stream reading loop
async def stream_with_persistence(self, message_id: str, prompt: str):
connection_alive = True
async for chunk in self.fm_client.stream_response(prompt, model="sonnet"):
# Always persist to Redis
await self.persist_token(message_id, chunk)
# Try to deliver if connection is still alive
if connection_alive:
try:
await self.ws_handler.send_token(connection_id, chunk.text)
except GoneException:
connection_alive = False
logger.info(f"Continuing generation for {message_id} (client disconnected)")
# Save final response to DynamoDB session
full_response = self.assemble_from_redis(message_id)
await self.save_session(message_id, full_response)
Step 3: Implement client-side reconnection with resume
// Client reconnection handler
websocket.onclose = async (event) => {
if (event.code === 1006 && currentMessageId) {
// Abnormal closure during active stream
savePartialResponse(currentMessageId, accumulatedText);
await reconnectWithBackoff();
// Request resume from server
websocket.send(JSON.stringify({
action: "resume",
data: {
messageId: currentMessageId,
lastChunkIndex: lastReceivedChunkIndex,
}
}));
}
};
Step 4: Server-side resume handler
async def handle_resume(self, connection_id: str, message_id: str, last_index: int):
"""Resume streaming from where the client left off."""
redis_key = f"stream:{message_id}:tokens"
total_tokens = self._redis.llen(redis_key)
if total_tokens == 0:
# No cached tokens — must regenerate
await self.ws_handler.send_error(
connection_id, "STREAM_EXPIRED", "Regenerating response..."
)
return await self.regenerate(connection_id, message_id)
# Replay tokens from last_index + 1 onwards
for i in range(last_index + 1, total_tokens):
token_json = self._redis.lindex(redis_key, i)
token = json.loads(token_json)
await self.ws_handler.send_token(
connection_id, token["text"], token["index"]
)
# Check if stream is still generating
stream_complete_key = f"stream:{message_id}:complete"
if self._redis.exists(stream_complete_key):
await self.ws_handler.send_complete(connection_id)
else:
# Stream still in progress — attach this connection to receive new tokens
await self.attach_to_active_stream(connection_id, message_id)
Prevention
| Measure | Implementation | Effectiveness |
|---|---|---|
| Redis token caching | RPUSH each token, 5-min TTL |
Eliminates data loss |
| Continue generation on disconnect | Do not cancel Bedrock stream on GoneException | Full response always generated |
| Client-side partial response storage | IndexedDB/localStorage backup | Client-side recovery |
| Server-side resume protocol | Replay from Redis cache | Seamless continuation |
| Proactive heartbeat before long queries | Send heartbeat immediately before Bedrock call | Reduces idle-timeout drops |
Verification
# 1. Simulate mid-stream disconnect
# Connect via wscat, send a message, then kill the connection after 1 second
# 2. Check Redis for cached tokens
redis-cli LRANGE stream:msg-test-001:tokens 0 -1
# 3. Reconnect and request resume
# Send resume message, verify tokens are replayed
# 4. Monitor CloudWatch metrics
# Verify mid_stream_disconnect_with_recovery metric counts up
# Verify stream_data_loss metric stays at 0
Key Takeaway
Never rely solely on in-memory streaming pipelines for user-facing responses. Every token should be persisted to a fast cache (Redis) as it is generated, enabling server-side resume after disconnection without regeneration cost.
Scenario 2: Streaming Backpressure Causing Client Memory Bloat
Severity: MEDIUM | Blast Radius: Affected Client Devices
Problem Statement
During a MangaAssist session on a low-end Android device, a user asks a complex question that triggers a Claude 3 Sonnet response. Bedrock generates tokens at approximately 80 tokens/second. The ECS Fargate orchestrator pushes each token to the client via WebSocket. The client's JavaScript event loop cannot process and render the incoming tokens fast enough — DOM manipulation, Markdown rendering, and scroll management consume more time per token than the inter-token interval. The WebSocket message buffer grows in the browser's memory. After 2,000+ tokens (a long response), the browser tab consumes 500+ MB of RAM, the UI becomes unresponsive, and on some devices the browser kills the tab.
Architecture Diagram
Token Production vs Consumption Rate:
Production (Bedrock → ECS → WebSocket → Client):
80 tokens/sec ════════════════════════════════>
Consumption (Client JS parse + render + scroll):
~30 tokens/sec ═══════════════>
↑
50 tokens/sec deficit
accumulates in browser
WebSocket message queue
Memory over time (2000 token response, ~25 seconds):
RAM (MB)
500 ┤ ╭──── Tab killed
│ ╭────╯
400 ┤ ╭────╯
│ ╭────╯
300 ┤ ╭────╯
│ ╭────╯
200 ┤ ╭────╯
│ ╭────╯
100 ┤ ╭────╯
│──╯
50 ┤─ Baseline
└────┬────┬────┬────┬────┬────┬────┬────┬──
5 10 15 20 25 30 35 40 seconds
┌─────────────────────────────────────────────────────────┐
│ Client (Browser) │
│ │
│ WebSocket Message Queue JS Main Thread │
│ Receiver (Growing!) (Overloaded) │
│ ┌──────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ WS │──────> │ token 41 │ │ Render token │ │
│ │ onmsg│ │ token 42 │ │ #15 │ │
│ │ │ │ token 43 │ │ │ │
│ │ 80/s │ │ ... │ │ DOM update │ │
│ │ │ │ token 120 │ │ MD parse │ │
│ │ │ │ (80 queued!) │ │ Scroll │ │
│ └──────┘ └──────────────┘ │ ~30/s │ │
│ ▲ └──────────────┘ │
│ │ 50 tokens/sec accumulating │
│ │ ~300 KB/sec memory growth │
└─────────────────────────────────────────────────────────┘
Root Cause Analysis
Primary cause: Mismatch between server-side token production rate and client-side rendering capacity.
The rendering pipeline on low-end devices involves:
1. WebSocket onmessage callback: ~1ms (fast)
2. JSON parse: ~0.5ms
3. Append text to state: ~0.5ms
4. Re-render Markdown: ~15ms (expensive — full re-parse of accumulated text)
5. DOM update: ~10ms (layout recalculation, paint)
6. Scroll to bottom: ~5ms (triggers reflow)
Total per token: ~32ms = ~31 tokens/sec maximum rendering throughput
At 80 tokens/sec from Bedrock, the deficit is 49 tokens/sec, meaning ~1,225 unprocessed messages accumulate in the browser's WebSocket message queue over a 25-second response.
Contributing factors: - No server-side awareness of client rendering capacity - No flow control protocol in the WebSocket communication - Markdown re-rendering of entire accumulated text on each token (O(n) per token = O(n^2) total) - No requestAnimationFrame batching for DOM updates - No virtual rendering for long responses
Detection
Client-side detection:
| Signal | Threshold | Meaning |
|---|---|---|
performance.memory.usedJSHeapSize |
Growing > 1 MB/sec during stream | Memory leak / queue growth |
Time between onmessage and render |
> 100ms | Main thread congestion |
requestAnimationFrame callback delay |
> 50ms | UI thread blocked |
| Queued messages count (custom counter) | > 50 | Falling behind |
Server-side detection:
| Signal | Threshold | Meaning |
|---|---|---|
No flow_control:pause from client |
Despite long response | Client may not implement backpressure |
Sudden GoneException after long stream |
After 1000+ tokens | Tab may have been killed |
| Client heartbeat stops during stream | After 10+ seconds | Tab unresponsive |
Impact Analysis
| Aspect | Impact |
|---|---|
| User experience | UI freezes, text stutters, app feels broken |
| Device impact | Battery drain from CPU thrashing, potential tab crash |
| Lost responses | If tab crashes, partial response lost |
| User segments | Primarily mobile users on budget devices (~30% of MangaAssist users) |
| At 1M msgs/day | ~50,000 long responses (>500 tokens) could trigger this |
Resolution Steps
Step 1: Implement client-side flow control signaling
class StreamFlowController {
constructor(websocket, options = {}) {
this.ws = websocket;
this.maxQueueSize = options.maxQueueSize || 30;
this.resumeThreshold = options.resumeThreshold || 10;
this.isPaused = false;
this.queueSize = 0;
}
onTokenReceived() {
this.queueSize++;
if (!this.isPaused && this.queueSize > this.maxQueueSize) {
this.ws.send(JSON.stringify({
action: "flow_control",
data: { command: "pause" }
}));
this.isPaused = true;
}
}
onTokenRendered() {
this.queueSize = Math.max(0, this.queueSize - 1);
if (this.isPaused && this.queueSize <= this.resumeThreshold) {
this.ws.send(JSON.stringify({
action: "flow_control",
data: { command: "resume" }
}));
this.isPaused = false;
}
}
}
Step 2: Server-side batching when client is slow
async def adaptive_send(self, connection_id: str, tokens: list):
"""Batch multiple tokens into single WebSocket frame when client is slow."""
if self.backpressure.state == FlowState.BUFFERING:
# Client has signaled pause — accumulate tokens
return
if self.backpressure.state == FlowState.DRAINING:
# Send accumulated tokens in batches
batch = tokens[:10] # 10 tokens per frame
combined_text = "".join(t.text for t in batch)
await self.ws_handler.send_token(
connection_id,
combined_text,
batch[0].index,
)
Step 3: Optimize client-side rendering
// BEFORE: O(n^2) — re-render all text on each token
onToken(text) {
this.fullText += text;
this.element.innerHTML = markdownToHtml(this.fullText); // EXPENSIVE
}
// AFTER: O(1) — append only new text, batch DOM updates
class StreamRenderer {
constructor(container) {
this.container = container;
this.pendingText = "";
this.rafScheduled = false;
}
onToken(text) {
this.pendingText += text;
if (!this.rafScheduled) {
this.rafScheduled = true;
requestAnimationFrame(() => this.flush());
}
}
flush() {
this.rafScheduled = false;
if (!this.pendingText) return;
// Append a new text node (not re-render everything)
const textNode = document.createTextNode(this.pendingText);
this.container.appendChild(textNode);
this.pendingText = "";
// Scroll to bottom (use scrollIntoView, not scrollTop assignment)
this.container.lastChild.scrollIntoView({ behavior: "instant" });
}
}
Step 4: Enable server-side adaptive buffering
Use the ChunkBufferManager with BufferStrategy.ADAPTIVE to batch tokens server-side when generation is fast, reducing the number of WebSocket frames per second.
Prevention
| Measure | Implementation | Effectiveness |
|---|---|---|
| Server-side adaptive buffering | ChunkBufferManager(ADAPTIVE) |
Reduces frame rate from 80/s to ~20/s |
| Client flow control protocol | flow_control:pause/resume messages |
Explicit backpressure |
| requestAnimationFrame batching | Batch DOM updates to 60fps | Eliminates main thread blocking |
| Incremental Markdown rendering | Append-only text nodes | O(1) per token instead of O(n) |
| Memory monitoring | performance.memory API polling |
Early detection before crash |
| Server-side response length limit | Max 2000 tokens for mobile clients | Hard cap on memory growth |
Verification
// Client-side performance test
const metrics = {
tokensReceived: 0,
tokensRendered: 0,
maxQueueDepth: 0,
peakMemoryMB: 0,
};
// After a 2000-token stream:
// tokensReceived === tokensRendered (no backlog)
// maxQueueDepth < 50 (backpressure working)
// peakMemoryMB < 100 (no memory bloat)
Key Takeaway
Always implement bidirectional flow control in streaming systems. The server should never assume the client can consume tokens as fast as the model generates them. Use adaptive buffering server-side and requestAnimationFrame batching client-side to match production rate to consumption capacity.
Scenario 3: API Gateway WebSocket Idle Timeout Disconnecting Long FM Calls
Severity: MEDIUM | Blast Radius: Users with Complex Queries
Problem Statement
A MangaAssist user asks: "Compare the art styles, narrative techniques, and thematic elements of Berserk, Vagabond, and Vinland Saga across their full publication history." This triggers a RAG pipeline that retrieves extensive context from OpenSearch (3 manga series, multiple volumes each), builds a large prompt (~3000 input tokens), and calls Claude 3 Sonnet. The RAG retrieval takes 4 seconds and Bedrock takes an additional 800ms for the first token (large prompt processing). During these 4.8 seconds, no data flows through the WebSocket connection. Meanwhile, the API Gateway WebSocket idle timeout is 10 minutes, which should not be an issue. However, the actual problem is different: the API Gateway integration timeout for the backend (Lambda/HTTP integration) is 29 seconds, and more critically, the client-side code has a 5-second response timeout that fires before any token arrives, causing the client to assume the request failed.
Architecture Diagram
Timeline:
t=0.0s Client sends complex query via WebSocket
t=0.0s API Gateway routes to ECS Fargate (sendMessage route)
t=0.1s ECS begins RAG retrieval
──── NO TOKENS FLOWING ────
t=0.5s OpenSearch embedding query sent
t=2.0s OpenSearch returns 15 context documents
t=2.5s DynamoDB product detail lookups complete
t=3.5s Prompt construction (3000 tokens) complete
t=4.0s Bedrock InvokeModelWithResponseStream called
t=4.0-4.8s Bedrock processing large prompt (no output yet)
──── STILL NO TOKENS FLOWING ────
t=4.8s First token generated by Bedrock
t=5.0s ⚠ Client timeout fires! Client shows "Request timed out"
Client may attempt to reconnect or re-send
t=5.1s Server pushes first token via @connections
Client receives token but has already shown error
Client API Gateway ECS Fargate OpenSearch Bedrock
│ │ │ │ │
│── sendMessage──>│── Route ───────>│ │ │
│ │ │── embed query ───>│ │
│ ⏳ waiting │ ⏳ waiting │ │ │
│ │ │<── 15 docs ───────│ │
│ ⏳ waiting │ ⏳ waiting │── product lookup ─┤ │
│ │ │ │ │
│ ⏳ 5s timeout │ │── build prompt ──>│ │
│ approaching │ │ │ │
│ │ │── stream call ────┤────────────>│
│ │ │ │ ⏳ TTFT │
│ ⚠ TIMEOUT! │ │ │ 800ms │
│ "Request │ │ │ │
│ failed" │ │<─ first token ────┤──────────── │
│ │<─ @connections──│ │ │
│<── token ───────│ (but client │ │ │
│ (too late) │ already │ │ │
│ │ timed out) │ │ │
Root Cause Analysis
Primary cause: Missing intermediate progress signals during the pre-generation pipeline.
The 4.8 seconds between the user's message and the first token are consumed by: - OpenSearch vector search: ~1.5s - DynamoDB product lookups: ~0.5s - Prompt construction: ~1.0s - Bedrock TTFT (large prompt): ~0.8s - Network overhead: ~1.0s
During this entire window, the server sends zero messages to the client. The client has no way to distinguish between "server is working on it" and "server never received the request."
Contributing factors: - Client-side timeout set too aggressively (5 seconds) - No progress/status messages sent during RAG pipeline - No "thinking" indicator pushed to client before Bedrock call - API Gateway WebSocket route integration timeout (29s) is adequate but the client does not wait that long - Large prompts (3000+ tokens) increase Bedrock TTFT significantly
Detection
| Signal | Source | Threshold |
|---|---|---|
| Client-side timeout events | Client telemetry | > 5% of requests |
| Time between sendMessage and first token | ECS metrics | p95 > 5 seconds |
| RAG pipeline duration | ECS metrics | p95 > 3 seconds |
| Bedrock TTFT for large prompts | StreamingFMClient metrics | p95 > 1 second |
| Duplicate message sends (retries after timeout) | API Gateway logs | > 100/hour |
Impact Analysis
| Aspect | Impact |
|---|---|
| User experience | "Request timed out" error followed by delayed response appearing |
| Duplicate requests | Client retries, generating the same response twice (double cost) |
| Cost waste | Two Bedrock calls for one question: 2x Sonnet cost = $0.0075 |
| Frequency | ~15% of Sonnet requests have TTFT > 4 seconds with large context |
| At 1M msgs/day (20% Sonnet) | ~30,000 timeout-then-duplicate incidents daily |
Resolution Steps
Step 1: Send immediate acknowledgment
async def handle_send_message(self, connection_id: str, message: dict):
"""Handle incoming chat message with progress signals."""
message_id = generate_message_id()
# IMMEDIATELY acknowledge receipt (before any processing)
await self.ws_handler._post_to_connection(connection_id, {
"action": "message_received",
"data": {
"messageId": message_id,
"status": "processing",
"timestamp": int(time.time() * 1000),
}
})
# Client now knows the server received the message
# Client should reset its timeout to a longer "processing" timeout
Step 2: Send progress updates during RAG pipeline
async def rag_with_progress(self, connection_id: str, message_id: str, query: str):
"""RAG pipeline that sends progress updates to the client."""
# Progress: searching knowledge base
await self.send_progress(connection_id, message_id, "searching", 0.2)
context_docs = await self.opensearch.vector_search(query)
# Progress: retrieving product details
await self.send_progress(connection_id, message_id, "retrieving", 0.4)
products = await self.dynamodb.batch_get_products(context_docs)
# Progress: building context
await self.send_progress(connection_id, message_id, "preparing", 0.6)
prompt = self.build_prompt(query, context_docs, products)
# Progress: generating response
await self.send_progress(connection_id, message_id, "generating", 0.8)
# Now call Bedrock streaming — first token will arrive shortly
async def send_progress(self, connection_id, message_id, stage, progress):
await self.ws_handler._post_to_connection(connection_id, {
"action": "progress",
"data": {
"messageId": message_id,
"stage": stage,
"progress": progress,
"timestamp": int(time.time() * 1000),
}
})
Step 3: Implement tiered client-side timeouts
const TIMEOUT_CONFIG = {
ack_timeout: 3000, // 3s to receive message_received
processing_timeout: 30000, // 30s for RAG + generation to start
streaming_timeout: 10000, // 10s between tokens during streaming
total_timeout: 120000, // 2 min absolute maximum
};
class TimeoutManager {
constructor(ws) {
this.ws = ws;
this.currentPhase = "waiting_ack";
this.timer = null;
}
startAckTimeout() {
this.currentPhase = "waiting_ack";
this.timer = setTimeout(() => {
this.onTimeout("No acknowledgment received");
}, TIMEOUT_CONFIG.ack_timeout);
}
onMessageReceived() {
clearTimeout(this.timer);
this.currentPhase = "processing";
this.timer = setTimeout(() => {
this.onTimeout("Processing taking too long");
}, TIMEOUT_CONFIG.processing_timeout);
}
onProgressUpdate() {
clearTimeout(this.timer);
// Reset processing timeout on each progress update
this.timer = setTimeout(() => {
this.onTimeout("Processing stalled");
}, TIMEOUT_CONFIG.processing_timeout);
}
onFirstToken() {
clearTimeout(this.timer);
this.currentPhase = "streaming";
this.resetStreamingTimeout();
}
onToken() {
this.resetStreamingTimeout();
}
resetStreamingTimeout() {
clearTimeout(this.timer);
this.timer = setTimeout(() => {
this.onTimeout("Streaming stalled");
}, TIMEOUT_CONFIG.streaming_timeout);
}
}
Prevention
| Measure | Implementation | Effectiveness |
|---|---|---|
| Immediate ACK on message receipt | Send within 100ms of receiving message | Eliminates "did server get it?" uncertainty |
| Progress signals during RAG | Send 3-4 updates during pipeline | Client knows server is alive |
| Tiered client timeouts | 3s ACK / 30s processing / 10s inter-token | Matches actual server behavior |
| Parallel RAG steps | Run OpenSearch and DynamoDB concurrently | Reduce pipeline from 4s to ~2.5s |
| Prompt size optimization | Limit context to 5 docs instead of 15 | Reduce Bedrock TTFT from 800ms to 400ms |
| Pre-warm Bedrock | Keep-alive requests to reduce cold start | Reduce TTFT by ~100ms |
Verification
# Simulate a complex query with long RAG pipeline
# 1. Send complex query via WebSocket
# 2. Verify ACK received within 1 second
# 3. Verify 2+ progress updates received
# 4. Verify first token received within 6 seconds
# 5. Verify NO client timeout events in telemetry
# CloudWatch query for timeout incidents
filter @message like /client_timeout/
| stats count() by bin(1h)
# Should drop to near zero after fix
Key Takeaway
In real-time AI systems, the user must never wonder if their request was received. Send immediate acknowledgment, then pipeline progress updates before the first model token arrives. Structure client timeouts around these expected signals, not around a single monolithic deadline.
Scenario 4: SSE Connection Not Closing After FM Completes
Severity: LOW-MEDIUM | Blast Radius: Server Resource Exhaustion Over Time
Problem Statement
MangaAssist uses SSE as a fallback transport for clients behind WebSocket-blocking proxies. A user asks about a manga title, the response streams successfully via SSE, and the final event: done is sent. However, the client's EventSource connection remains open. The server (ECS Fargate) holds the HTTP connection, the associated goroutine/async handler, and the Redis session reference. Over hours, hundreds of "zombie" SSE connections accumulate, consuming file descriptors, memory, and connection pool slots on the ECS tasks. Eventually, new SSE requests fail with "connection refused" because the ECS task has exhausted its file descriptor limit.
Architecture Diagram
Normal SSE Lifecycle:
Client ──GET /stream──> ECS ──stream tokens──> Client ──event:done──> Client closes
✓ Clean ✓ Resources freed
Broken SSE Lifecycle (this scenario):
Client ──GET /stream──> ECS ──stream tokens──> Client ──event:done──> ???
│ │ │
│ Connection stays open │ Handler stays alive │ EventSource does
│ HTTP keep-alive │ File descriptor held │ NOT auto-close on
│ │ Memory allocated │ "done" event
│ │ Redis refs active │
│ │ │
│ ◄──── ZOMBIE CONNECTION ────────────────────────────────────► │
Over time:
┌─────────────────────────────────────────────────────────┐
│ ECS Fargate Task │
│ │
│ Active SSE connections: 15 │
│ Zombie SSE connections: 247 ← growing hourly │
│ File descriptors used: 524/1024 │
│ Memory consumed by zombies: ~124 MB │
│ │
│ ⚠ At this rate, FD limit reached in 4 hours │
└─────────────────────────────────────────────────────────┘
Root Cause Analysis
Primary cause: The SSE event: done is an application-level signal, not a transport-level close.
The SSE protocol (EventSource API) does not automatically close the connection when a specific event type is received. The EventSource is designed to be a persistent connection that receives events indefinitely. Sending event: done tells the application "the response is complete," but the HTTP connection remains open. The EventSource will even attempt to reconnect if the server closes the connection (that is its built-in behavior).
Contributing factors:
1. Server sends event: done but does not close the HTTP response stream
2. Client JavaScript handles event: done but does not call eventSource.close()
3. ALB keep-alive settings allow idle connections to persist for 60 seconds (default), but SSE connections are not truly "idle" from the ALB's perspective since the connection is still established
4. ECS task does not have a post-completion timeout to force-close SSE connections
5. No monitoring on "zombie" connection count
Detection
| Signal | Source | Threshold |
|---|---|---|
| Open SSE connections per task | ECS custom metric | > 50 (normal: ~10) |
| File descriptor usage | ECS task OS metrics | > 70% of ulimit |
| SSE connections older than 5 minutes | Application metric | > 0 |
| Memory growth without load increase | ECS CloudWatch | Steady increase |
| "Connection refused" errors | ALB access logs | Any occurrence |
Impact Analysis
| Aspect | Impact |
|---|---|
| Immediate | Wasted server resources (memory, FDs, Redis connections) |
| Progressive | New connections fail as FDs exhaust |
| Blast radius | All SSE users on the affected ECS task |
| Time to impact | ~4-6 hours of normal traffic before FD exhaustion |
| Recovery | ECS task restart (kills all zombie connections) |
| Frequency | Every SSE response creates a zombie (100% of SSE traffic) |
Resolution Steps
Step 1: Server-side — Close the HTTP response after done event
async def handle_sse_stream(request):
"""SSE endpoint that properly closes after stream completes."""
session_id = request.query.get("session_id")
sse_manager = SSEResponseManager()
session = sse_manager.create_session(session_id)
async def generate():
# Send retry directive
yield f"retry: 3000\n\n"
# Stream tokens from Bedrock
async for chunk in bedrock_stream:
event = sse_manager.create_token_event(session, chunk.text)
yield event.serialize()
# Send done event
done_event = sse_manager.create_done_event(session)
yield done_event.serialize()
# CRITICAL: Close the response stream after done
# This terminates the HTTP response, signaling the client
# that no more data will be sent on this connection.
# Without this, the connection stays open as a zombie.
response = web.StreamResponse()
response.headers.update(sse_manager.get_sse_headers())
await response.prepare(request)
try:
async for data in generate():
await response.write(data.encode("utf-8"))
finally:
# Ensure response is finalized (sends zero-length chunk for chunked encoding)
await response.write_eof()
sse_manager.cleanup_session(session_id)
logger.info(f"SSE connection closed for session {session_id}")
Step 2: Client-side — Close EventSource on done event
const eventSource = new EventSource(`/stream?session_id=${sessionId}`);
eventSource.addEventListener("done", (event) => {
const data = JSON.parse(event.data);
console.log(`Stream complete: ${data.total_chunks} chunks`);
// CRITICAL: Explicitly close the EventSource
// Without this, the browser will attempt to reconnect
// when the server closes the connection
eventSource.close();
// Update UI to show response is complete
updateStreamStatus("complete");
});
eventSource.addEventListener("error", (event) => {
if (eventSource.readyState === EventSource.CLOSED) {
// Server closed the connection — this is expected after "done"
console.log("SSE connection closed by server (expected)");
return;
}
// Genuine error
handleSSEError(event);
});
Step 3: Server-side — Add connection reaper for zombies
class SSEConnectionReaper:
"""
Periodically scans for and terminates zombie SSE connections
that should have been closed after stream completion.
"""
MAX_SSE_DURATION = 300 # 5 minutes absolute max for any SSE connection
REAP_INTERVAL = 60 # Check every 60 seconds
def __init__(self, sse_manager: SSEResponseManager):
self.sse_manager = sse_manager
self._active_connections: Dict[str, float] = {} # session_id -> start_time
def register(self, session_id: str):
self._active_connections[session_id] = time.time()
def deregister(self, session_id: str):
self._active_connections.pop(session_id, None)
async def reap_loop(self):
"""Run in background to clean up zombie connections."""
while True:
await asyncio.sleep(self.REAP_INTERVAL)
now = time.time()
zombies = [
sid for sid, start in self._active_connections.items()
if now - start > self.MAX_SSE_DURATION
]
for sid in zombies:
logger.warning(f"Reaping zombie SSE connection: {sid}")
self.sse_manager.cleanup_session(sid)
self.deregister(sid)
# Force-close the response (implementation depends on web framework)
Step 4: ALB idle timeout configuration
# Set ALB idle timeout to 120 seconds for SSE target group
# This ensures connections that are truly idle get cleaned up
# Normal SSE responses complete in <30 seconds
aws elbv2 modify-target-group-attributes \
--target-group-arn arn:aws:elasticloadbalancing:... \
--attributes Key=idle_timeout.timeout_seconds,Value=120
Prevention
| Measure | Implementation | Effectiveness |
|---|---|---|
Server write_eof() after done |
Close HTTP response explicitly | Eliminates server-side zombies |
Client eventSource.close() on done |
Close EventSource explicitly | Prevents reconnection attempts |
| Connection reaper background task | Kill connections > 5 min old | Safety net for edge cases |
| ALB idle timeout 120s | Auto-kill idle connections | Infrastructure-level cleanup |
| File descriptor monitoring | Alert on FD usage > 70% | Early warning |
| SSE connection count metric | Custom CloudWatch metric | Visibility into zombie accumulation |
Verification
# 1. Trigger 10 SSE streaming responses
# 2. After all complete, check open connections on ECS task:
ss -s # Show socket summary
lsof -i -P | grep ESTABLISHED | wc -l
# Expected: Only baseline connections (Redis, DDB, etc.)
# NO lingering SSE connections
# 3. CloudWatch metric: sse_zombie_connection_count should be 0
Key Takeaway
SSE's event: done is an application-level signal; it does not close the HTTP connection. Both server (call write_eof()) and client (call eventSource.close()) must explicitly terminate the connection after the stream completes. Always add a background reaper as a safety net to catch any zombie connections that slip through.
Scenario 5: Chunked Encoding Breaking Unicode Characters at Chunk Boundaries
Severity: MEDIUM | Blast Radius: Japanese Content Responses
Problem Statement
MangaAssist serves Japanese-speaking users. When streaming a response about manga titles via chunked transfer encoding through the ALB, a chunk boundary falls in the middle of a multi-byte UTF-8 character. For example, the Japanese character "東" (East, Unicode U+6771) is encoded as 3 bytes in UTF-8: E6 9D B1. If the chunk boundary falls after E6 9D, the client receives an incomplete UTF-8 sequence, triggering a TextDecoder error. The user sees garbled text (a replacement character) instead of "東京グール" (Tokyo Ghoul).
Architecture Diagram
UTF-8 encoding of "東京グール" (Tokyo Ghoul):
東 = E6 9D B1 (3 bytes)
京 = E4 BA AC (3 bytes)
グ = E3 82 B0 (3 bytes)
ー = E3 83 BC (3 bytes)
ル = E3 83 AB (3 bytes)
Total: 15 bytes
Chunked encoding with BAD chunk boundary:
Chunk 1 (10 bytes): Chunk 2 (5 bytes):
E6 9D B1 E4 BA AC E3 82 B0 E3 83 BC E3 83 AB
├─ 東 ──┤ ├─ 京 ──┤ ├─ ?? ├──────────────────────┤
↑
SPLIT HERE!
E3 82 is the start of グ (needs E3 82 B0)
but B0 is in the next chunk
Client receives Chunk 1:
"東京" + INVALID BYTE SEQUENCE (E3 82) → TextDecoder error or U+FFFD (�)
Client receives Chunk 2:
B0 E3 83 BC E3 83 AB → starts with continuation byte (B0) → MORE ERRORS
ECS Fargate ALB Client
│ │ │
│── HTTP Response ──────>│ │
│ Transfer-Encoding: │ │
│ chunked │ │
│ │ │
│── chunk: "東京" + E382 ─>│── forward chunk 1 ────>│
│ (10 bytes) │ │── TextDecoder
│ │ │ ERROR on E3 82
│ │ │ Shows: 東京�
│── chunk: B0 + "ール" ──>│── forward chunk 2 ────>│
│ (5 bytes) │ │── TextDecoder
│ │ │ ERROR on B0
│ │ │ Shows: 東京��ール
│ │ │
│ │ │ Expected: 東京グール
│ │ │ Got: 東京��ール
Root Cause Analysis
Primary cause: The chunked response writer flushes at byte boundaries without regard for UTF-8 character boundaries.
UTF-8 is a variable-width encoding:
- ASCII (U+0000 to U+007F): 1 byte, starts with 0xxxxxxx
- Latin/Greek/Cyrillic (U+0080 to U+07FF): 2 bytes, starts with 110xxxxx
- CJK/Japanese/Korean (U+0800 to U+FFFF): 3 bytes, starts with 1110xxxx
- Emoji/rare (U+10000 to U+10FFFF): 4 bytes, starts with 11110xxx
- Continuation bytes: 10xxxxxx
When the server writes a fixed-size buffer (e.g., 8 KB) and flushes, the flush point may land between the leading byte and continuation bytes of a multi-byte character.
Contributing factors:
1. ECS application uses a fixed buffer size for chunked response writing
2. The buffer flush does not check for UTF-8 character boundaries
3. ALB does not reassemble or validate UTF-8 across chunks (it is a pass-through)
4. Browser's TextDecoder in strict mode rejects invalid sequences
5. Japanese text (3 bytes per character) has a 66% chance of a split falling within a character vs a 33% chance for ASCII (1 byte per character)
Detection
| Signal | Source | Threshold |
|---|---|---|
| TextDecoder errors in client | Client telemetry | Any occurrence |
| U+FFFD (replacement character) in rendered text | Client text analysis | Any occurrence |
UnicodeDecodeError in server logs |
ECS application logs | Any occurrence |
| Japanese responses with garbled characters | User complaints | Any report |
Client detection code:
const decoder = new TextDecoder("utf-8", { fatal: true }); // strict mode
try {
const text = decoder.decode(chunk);
} catch (e) {
// TypeError: The encoded data was not valid UTF-8
reportUnicodeError(chunk);
}
Impact Analysis
| Aspect | Impact |
|---|---|
| User experience | Garbled manga titles, unreadable Japanese text |
| Content accuracy | Manga title "東京グール" appears as "東京??ール" — user cannot identify it |
| Frequency | ~1 in 3 chunks containing Japanese text (33% of CJK characters split) |
| Scope | All chunked HTTP responses with Japanese content (SSE and chunked paths) |
| Trust impact | Users may doubt the chatbot's Japanese language competency |
| At 1M msgs/day | ~200,000 messages contain Japanese text, ~66,000 could be affected |
Resolution Steps
Step 1: Implement UTF-8-safe chunk flushing
class Utf8SafeChunkedWriter:
"""
Writes chunked HTTP responses with guaranteed UTF-8 character
boundary alignment.
"""
def __init__(self, response, target_chunk_size: int = 4096):
self.response = response
self.target_chunk_size = target_chunk_size
self._buffer = bytearray()
async def write(self, text: str) -> None:
"""Write text to the chunked response, UTF-8-safe."""
encoded = text.encode("utf-8")
self._buffer.extend(encoded)
while len(self._buffer) >= self.target_chunk_size:
# Find the last safe UTF-8 boundary at or before target_chunk_size
boundary = self._find_utf8_boundary(
self._buffer, self.target_chunk_size
)
# Flush up to the boundary
chunk = bytes(self._buffer[:boundary])
self._buffer = self._buffer[boundary:]
await self.response.write(chunk)
async def flush(self) -> None:
"""Flush remaining buffer (guaranteed complete UTF-8)."""
if self._buffer:
await self.response.write(bytes(self._buffer))
self._buffer.clear()
@staticmethod
def _find_utf8_boundary(data: bytearray, max_pos: int) -> int:
"""
Find the last valid UTF-8 character boundary at or before max_pos.
UTF-8 byte patterns:
0xxxxxxx -> ASCII, single byte (always a boundary)
110xxxxx -> Start of 2-byte sequence
1110xxxx -> Start of 3-byte sequence
11110xxx -> Start of 4-byte sequence
10xxxxxx -> Continuation byte (NOT a boundary)
"""
pos = min(max_pos, len(data))
# Walk backward from pos until we find a character start
while pos > 0:
byte = data[pos - 1]
# If this byte is ASCII (0xxxxxxx), boundary is after it
if byte < 0x80:
return pos
# If this byte is a continuation byte (10xxxxxx), keep going back
if (byte & 0xC0) == 0x80:
pos -= 1
continue
# This byte is a start byte (11xxxxxx)
# Determine the expected sequence length
if (byte & 0xE0) == 0xC0:
seq_len = 2
elif (byte & 0xF0) == 0xE0:
seq_len = 3
elif (byte & 0xF8) == 0xF0:
seq_len = 4
else:
# Invalid byte, treat as boundary
return pos
# Check if the full sequence fits before max_pos
seq_start = pos - 1
if seq_start + seq_len <= max_pos:
# Full sequence fits, boundary is after it
return seq_start + seq_len
else:
# Sequence would be split, boundary is before it
return seq_start
return 0
Step 2: Apply to the streaming pipeline
async def stream_chunked_response(request, bedrock_stream):
"""Stream Bedrock response with UTF-8-safe chunked encoding."""
response = web.StreamResponse(
headers={
"Content-Type": "text/plain; charset=utf-8",
"Transfer-Encoding": "chunked",
}
)
await response.prepare(request)
writer = Utf8SafeChunkedWriter(response, target_chunk_size=4096)
async for chunk in bedrock_stream:
if chunk.text:
await writer.write(chunk.text)
# Flush any remaining bytes (guaranteed complete)
await writer.flush()
await response.write_eof()
Step 3: Client-side fallback with lenient decoder
Even with server-side fixes, implement a resilient client-side decoder as defense in depth:
class ResilientStreamDecoder {
constructor() {
// Use non-fatal mode: replaces invalid sequences with U+FFFD
// instead of throwing
this.decoder = new TextDecoder("utf-8", { fatal: false });
this.pendingBytes = new Uint8Array(0);
}
decode(chunk) {
// Combine any pending bytes from previous chunk
const combined = this._concat(this.pendingBytes, new Uint8Array(chunk));
// Find the last safe UTF-8 boundary
const boundary = this._findUtf8Boundary(combined);
if (boundary === combined.length) {
// All bytes form complete characters
this.pendingBytes = new Uint8Array(0);
return this.decoder.decode(combined, { stream: true });
}
// Split at boundary: decode complete chars, save incomplete
const complete = combined.slice(0, boundary);
this.pendingBytes = combined.slice(boundary);
return this.decoder.decode(complete, { stream: true });
}
flush() {
// Decode any remaining bytes at end of stream
if (this.pendingBytes.length > 0) {
const text = this.decoder.decode(this.pendingBytes);
this.pendingBytes = new Uint8Array(0);
return text;
}
return this.decoder.decode();
}
_findUtf8Boundary(bytes) {
let pos = bytes.length;
// Walk backward to find the start of the last character
while (pos > 0 && pos > bytes.length - 4) {
pos--;
const byte = bytes[pos];
if ((byte & 0x80) === 0) {
// ASCII — boundary is after this byte
return pos + 1;
}
if ((byte & 0xC0) !== 0x80) {
// Start byte found — check if sequence is complete
let expectedLen;
if ((byte & 0xE0) === 0xC0) expectedLen = 2;
else if ((byte & 0xF0) === 0xE0) expectedLen = 3;
else if ((byte & 0xF8) === 0xF0) expectedLen = 4;
else return pos; // invalid, skip
if (pos + expectedLen <= bytes.length) {
return pos + expectedLen; // Sequence complete
}
return pos; // Sequence incomplete — boundary before it
}
}
return bytes.length;
}
_concat(a, b) {
const result = new Uint8Array(a.length + b.length);
result.set(a, 0);
result.set(b, a.length);
return result;
}
}
Step 4: Add SSE-specific UTF-8 safety
For SSE transport, the same issue can occur if the SSE event data contains incomplete UTF-8. The SSEResponseManager should ensure each data: line contains only complete UTF-8 characters:
def create_token_event(self, session, text):
"""Create a token event, ensuring text contains only complete UTF-8."""
# Validate the text is valid UTF-8 (should be, since it came from Bedrock)
try:
text.encode("utf-8").decode("utf-8")
except UnicodeError:
logger.warning(f"Invalid UTF-8 in token text, cleaning: {repr(text)}")
text = text.encode("utf-8", errors="replace").decode("utf-8")
# Rest of event creation unchanged
event_id = f"{session.session_id}-{session.chunk_index}"
session.chunk_index += 1
session.accumulated_text += text
session.last_event_id = event_id
session.last_sent_at = time.time()
return SSEEvent(
event_type=SSEEventType.TOKEN,
data={"text": text, "index": session.chunk_index - 1},
event_id=event_id,
)
Prevention
| Measure | Implementation | Effectiveness |
|---|---|---|
Utf8SafeChunkedWriter |
Always flush at character boundaries | Eliminates server-side splits |
ResilientStreamDecoder (client) |
Buffer incomplete bytes, decode only complete chars | Client-side defense in depth |
TextDecoder({ fatal: false }) |
Replace rather than throw on invalid sequences | Graceful degradation |
| SSE UTF-8 validation | Validate each event data payload | Ensures SSE events are clean |
| Integration test with CJK text | Test with all 3-byte and 4-byte character sets | Catches regressions |
| Chunk size = multiple of 12 | LCM of 1,2,3,4 byte sequences | Reduces split probability |
Verification
# Server-side test
def test_utf8_safe_chunking():
"""Verify chunks never split multi-byte characters."""
test_texts = [
"東京グール", # All 3-byte CJK
"Hello 東京 World", # Mixed ASCII and CJK
"🔥 ファイア 🔥", # 4-byte emoji + 3-byte katakana
"Berserk ベルセルク Vol.1", # Mixed with punctuation
"進撃の巨人 Attack on Titan", # Full manga title
]
for text in test_texts:
writer = Utf8SafeChunkedWriter(mock_response, target_chunk_size=5)
# Use tiny chunk size to force many splits
writer.write(text)
writer.flush()
# Verify each chunk is valid UTF-8
for chunk in mock_response.chunks:
try:
chunk.decode("utf-8")
except UnicodeDecodeError:
assert False, f"Chunk contains incomplete UTF-8: {chunk.hex()}"
# Verify reassembled text matches original
reassembled = b"".join(mock_response.chunks).decode("utf-8")
assert reassembled == text
// Client-side test
function testResilientDecoder() {
const decoder = new ResilientStreamDecoder();
// Simulate "東" (E6 9D B1) split across two chunks
const chunk1 = new Uint8Array([0xE6, 0x9D]); // Incomplete "東"
const chunk2 = new Uint8Array([0xB1, 0xE4, 0xBA, 0xAC]); // Complete "東" + "京"
const text1 = decoder.decode(chunk1);
console.assert(text1 === "", "Should buffer incomplete char");
const text2 = decoder.decode(chunk2);
console.assert(text2 === "東京", "Should decode complete chars including buffered");
}
Key Takeaway
When streaming text over chunked HTTP in a multi-language application, never flush at arbitrary byte boundaries. Implement a UTF-8-aware chunked writer that only flushes at character boundaries, and add a resilient decoder on the client side that buffers incomplete multi-byte sequences across chunk boundaries. This is especially critical for CJK languages where every character is 3 bytes and has a 66% probability of being split by a naive byte-level flush.
Cross-Scenario Summary
| # | Scenario | Root Cause | Key Fix | Prevention |
|---|---|---|---|---|
| 1 | WebSocket drop mid-stream | No token persistence | Redis token caching + resume | Persist every token, continue generation on disconnect |
| 2 | Client memory bloat | Production/consumption rate mismatch | Bidirectional flow control | Adaptive buffering + requestAnimationFrame batching |
| 3 | Idle timeout on long FM calls | No progress signals during RAG | Tiered timeouts + progress updates | ACK immediately, send pipeline progress |
| 4 | SSE zombie connections | Done event is not transport close | Explicit close on both sides | Server write_eof() + client eventSource.close() + reaper |
| 5 | Unicode split at chunk boundary | Byte-level flush ignores UTF-8 | UTF-8-safe chunk writer | Flush at character boundaries, resilient client decoder |
Common Patterns Across All Scenarios
-
Never assume the transport handles application semantics. WebSocket close is not the same as stream complete. SSE done event is not connection close. Chunked encoding is not UTF-8-aware.
-
Always implement both server-side and client-side defenses. No single-side fix is sufficient. Server writes UTF-8-safe chunks AND client handles incomplete sequences. Server sends progress AND client has tiered timeouts.
-
Persist intermediate state for resumability. Tokens in Redis, connection records in DynamoDB, partial responses in client storage. Every streaming pipeline should survive disconnection at any point.
-
Monitor the gaps, not just the flows. Track zombie connections, split characters, mid-stream drops, and rate mismatches. The interesting failures happen in the spaces between components.
-
Design for Japanese content from the start. Multi-byte characters, no word boundaries, and mixed-script content affect buffering, chunking, rendering, and search. Retrofitting CJK support is far harder than building it in.