Streaming and Latency Optimization for GenAI Applications
AWS AIP-C01 Task 4.2 — Skill 4.2.1: Deep-dive into response streaming, first-token optimization, and pre-computation pipelines Context: MangaAssist e-commerce chatbot — Bedrock Claude 3 Sonnet/Haiku, OpenSearch Serverless, DynamoDB, ECS Fargate, API Gateway WebSocket, ElastiCache Redis North Star: First token to screen < 400ms, full response streaming complete < 3s
Skill Mapping
| Certification | Domain | Task | Skill |
|---|---|---|---|
| AWS AIP-C01 | Domain 4 — Operational Efficiency | Task 4.2 — Optimize FM application performance | Skill 4.2.1 — Identify techniques to create responsive AI systems (e.g., pre-computation, latency-optimized model selection, parallel request processing, response streaming, performance benchmarking) |
File scope: This file focuses on the streaming and latency optimization dimensions of Skill 4.2.1 — how MangaAssist delivers sub-400ms first-token latency through Bedrock streaming, WebSocket delivery, progressive rendering, pre-computation pipelines, and automated benchmark frameworks.
Mind Map — Streaming and Latency Optimization
mindmap
root((Streaming &<br/>Latency<br/>Optimization))
WebSocket Streaming Architecture
API Gateway WebSocket
Connection Lifecycle
Route Selection Keys
Binary Frame Support
ECS Orchestrator
Stream Proxy
Chunk Aggregation
Error Recovery
Bedrock Stream API
invoke_model_with_response_stream
content_block_delta Events
message_stop Signal
First-Token Latency
Warm Connections
Connection Pooling
Keep-Alive to Bedrock
Pre-authenticated Sessions
Pre-Loaded System Prompts
Cached Prompt Templates
Avoid Per-Request Assembly
Template Versioning
Inference Warm-Up
Periodic Health Pings
Model Warm Invocations
Cold Start Mitigation
Progressive Rendering
Skeleton UI
Typing Indicator
Placeholder Cards
Shimmer Effects
Status Messages
Looking Up Manga
Checking Your Order
Generating Recommendation
Incremental Content
Stream Text Tokens
Progressive Image Load
Recommendation Card Build
Pre-Computation Pipeline
Follow-Up Prediction
Intent Transition Model
Next-Query Probability
Background Generation
Background Generation
Trending Batch Jobs
Catalog Update Hooks
Scheduled Refreshes
Cache Warming
Popular Query Replay
Segment-Based Pre-Fetch
Time-of-Day Patterns
Benchmark Framework
Automated Measurement
Continuous Test Harness
Synthetic Query Replay
Production Sampling
Statistical Testing
Mann-Whitney U Test
Bootstrap Confidence Intervals
Effect Size Measurement
Regression Detection
Baseline Comparison
Automated Alerting
Auto-Rollback Triggers
WebSocket Streaming Architecture
MangaAssist uses API Gateway WebSocket to maintain a persistent, bidirectional connection with the client. This eliminates HTTP request/response overhead and enables real-time token-by-token delivery.
End-to-End Streaming Data Flow
sequenceDiagram
participant Client as Browser Client
participant APIGW as API Gateway<br/>WebSocket
participant ECS as ECS Fargate<br/>Orchestrator
participant Redis as ElastiCache Redis
participant OS as OpenSearch<br/>Serverless
participant DDB as DynamoDB
participant Bedrock as Bedrock<br/>Claude 3
Note over Client,APIGW: WebSocket already established (connection reuse)
Client->>APIGW: {"action":"query","text":"Recommend manga like One Piece"}
APIGW->>ECS: Route via $default route
Note over ECS: Phase 1: Immediate Status (0ms)
ECS-->>APIGW: {"type":"status","msg":"Looking up manga for you..."}
APIGW-->>Client: Display typing indicator
Note over ECS: Phase 2: Parallel Fan-Out (0-650ms)
par Concurrent Data Fetches
ECS->>Redis: Check semantic cache
Redis-->>ECS: Cache MISS (45ms)
and
ECS->>OS: KNN vector search "manga like One Piece"
OS-->>ECS: Top 5 results (620ms)
and
ECS->>DDB: Get session + user profile
DDB-->>ECS: Session history + preferences (180ms)
end
Note over ECS: Phase 3: Prompt Assembly (650-680ms)
ECS->>ECS: Build prompt with RAG context
Note over ECS: Phase 4: Streaming Invocation (680ms+)
ECS->>Bedrock: invoke_model_with_response_stream()
Note over Bedrock: First token generated (~350ms after invocation)
loop Token Streaming (~1030ms to ~2400ms)
Bedrock-->>ECS: content_block_delta: "Based on your love of..."
ECS-->>APIGW: WebSocket frame (batched 3 chunks)
APIGW-->>Client: Render text incrementally
end
Bedrock-->>ECS: message_stop
ECS-->>APIGW: {"type":"done","metadata":{...}}
APIGW-->>Client: Finalize + show recommendation cards
Note over Client: First token: ~1030ms from send<br/>Full response: ~2400ms<br/>Perceived: Instant (typing at 1s)
WebSocket Connection Lifecycle
statediagram-v2
[*] --> Connecting: Client initiates
Connecting --> Connected: $connect route succeeds
Connected --> Authenticated: JWT validated
Authenticated --> Active: Ready for queries
Active --> Streaming: Query received
Streaming --> Active: Response complete
Active --> Streaming: Another query
Active --> Idle: No activity 5min
Idle --> Active: New query
Idle --> Disconnected: 10min timeout
Streaming --> Reconnecting: Network error
Reconnecting --> Active: Reconnected (resume session)
Reconnecting --> Disconnected: 3 retries failed
Disconnected --> [*]
Connection Management Configuration
| Parameter | Value | Rationale |
|---|---|---|
| Idle timeout | 10 minutes | Matches typical manga browsing session length |
| Route selection expression | $request.body.action |
Enables query, ping, disconnect actions |
| Maximum message size | 128 KB | Sufficient for longest manga description + context |
| Throttling (per connection) | 100 msg/sec | Prevents abuse; streaming chunks stay well under |
| Throttling (account) | 10,000 connections | Peak evening hours JP timezone capacity |
| Binary support | Enabled | Future: image thumbnails in stream |
First-Token Latency Optimization
First-token latency is the time from when the user sends a query to when the first generated token appears on screen. This is the most important perceived latency metric — users tolerate slow generation if they see the response starting quickly.
Latency Budget Breakdown
gantt
title First-Token Latency Budget — Target: < 400ms (Haiku) / < 1100ms (Sonnet)
dateFormat X
axisFormat %L ms
section Network
WebSocket Frame Receive :0, 5
TLS Already Established :5, 5
section Routing
Intent Classification :5, 55
Model Selection :55, 60
section Cache (Haiku path - simple query)
Redis Lookup :60, 105
section Generation (Haiku)
Bedrock Haiku First Token :105, 380
WebSocket Frame Send :380, 385
section Parallel I/O (Sonnet path)
RAG + Session + Profile :60, 680
section Generation (Sonnet)
Prompt Assembly :680, 700
Bedrock Sonnet First Token :700, 1050
WebSocket Frame Send :1050, 1055
Optimization Techniques
1. Warm Connection Pooling
Cold connections to Bedrock require TLS handshake + authentication. MangaAssist maintains a warm connection pool.
import asyncio
import time
from contextlib import asynccontextmanager
import boto3
from botocore.config import Config
class BedrockConnectionPool:
"""
Maintains warm connections to Bedrock to eliminate cold-start
TLS and authentication overhead on the critical path.
Each ECS task maintains a pool of pre-authenticated boto3 clients.
A background task sends periodic health pings to keep connections warm.
"""
def __init__(
self,
pool_size: int = 5,
keep_alive_interval: float = 30.0,
region: str = "us-west-2",
):
self.pool_size = pool_size
self.keep_alive_interval = keep_alive_interval
self.region = region
self._pool: asyncio.Queue = asyncio.Queue(maxsize=pool_size)
self._keep_alive_task: asyncio.Task | None = None
config = Config(
region_name=region,
retries={"max_attempts": 2, "mode": "adaptive"},
connect_timeout=5,
read_timeout=60,
max_pool_connections=pool_size,
)
# Pre-create clients and fill the pool
for _ in range(pool_size):
client = boto3.client("bedrock-runtime", config=config)
self._pool.put_nowait(client)
async def start(self) -> None:
"""Start the background keep-alive task."""
self._keep_alive_task = asyncio.create_task(
self._keep_alive_loop()
)
async def stop(self) -> None:
"""Stop the keep-alive task and close all clients."""
if self._keep_alive_task:
self._keep_alive_task.cancel()
@asynccontextmanager
async def acquire(self):
"""
Acquire a warm Bedrock client from the pool.
Returns it after use.
"""
client = await asyncio.wait_for(
self._pool.get(), timeout=5.0
)
try:
yield client
finally:
await self._pool.put(client)
async def _keep_alive_loop(self) -> None:
"""
Periodically invoke a minimal request to keep
connections warm and TLS sessions alive.
"""
while True:
await asyncio.sleep(self.keep_alive_interval)
for _ in range(self._pool.qsize()):
client = await self._pool.get()
try:
# Minimal invocation to keep connection warm
client.invoke_model(
modelId="anthropic.claude-3-haiku-20240307-v1:0",
body='{"anthropic_version":"bedrock-2023-05-31",'
'"max_tokens":1,"messages":[{"role":"user",'
'"content":"ping"}]}',
)
except Exception:
pass # Connection will be refreshed by boto3
finally:
await self._pool.put(client)
2. Pre-Loaded System Prompts
System prompts are assembled once at deployment and cached in memory, not rebuilt per request.
import json
from functools import lru_cache
class SystemPromptManager:
"""
Pre-loads and caches system prompts at ECS task startup.
Prompts are versioned and stored in DynamoDB. The manager
loads all active versions at init and refreshes on a schedule.
This eliminates per-request DynamoDB reads for prompt templates.
"""
def __init__(self, dynamodb_table, refresh_interval: int = 300):
self.table = dynamodb_table
self.refresh_interval = refresh_interval
self._prompts: dict[str, str] = {}
self._load_all_prompts()
def _load_all_prompts(self) -> None:
"""Load all active prompt versions into memory."""
response = self.table.scan(
FilterExpression="active = :true",
ExpressionAttributeValues={":true": True},
)
for item in response.get("Items", []):
key = f"{item['intent']}:{item['version']}"
self._prompts[key] = item["prompt_text"]
@lru_cache(maxsize=64)
def get_prompt(self, intent: str, version: str = "latest") -> str:
"""
Retrieve a pre-loaded system prompt.
Zero I/O on the hot path.
"""
key = f"{intent}:{version}"
if key in self._prompts:
return self._prompts[key]
# Fallback to default if specific version not found
default_key = f"{intent}:default"
return self._prompts.get(default_key, self._default_prompt())
@staticmethod
def _default_prompt() -> str:
return (
"You are MangaAssist, a helpful assistant for a Japanese manga "
"e-commerce store. Be concise, friendly, and respond in the "
"customer's language."
)
3. Model Inference Warm-Up
Cold model instances on Bedrock can add 200-500ms to the first invocation. MangaAssist sends periodic warm-up requests.
| Warm-Up Strategy | Frequency | Target Model | Payload | Added Cost |
|---|---|---|---|---|
| Health ping (1 token) | Every 30s per ECS task | Haiku | "ping" → 1 token |
~$0.30/month |
| Warm invocation (short) | Every 5 min | Sonnet | 10-token FAQ response | ~$2.10/month |
| Full warm-up (streaming) | Every 15 min | Sonnet | 100-token streamed response | ~$4.50/month |
| Total warm-up cost | ~$6.90/month |
Progressive Response Rendering
Progressive rendering keeps the user engaged while backend processing happens. MangaAssist uses a three-phase approach.
Three-Phase Rendering Strategy
flowchart LR
subgraph "Phase 1: Instant (0-100ms)"
A1[Typing Indicator]
A2[Skeleton Card]
A3[Status: Thinking...]
end
subgraph "Phase 2: Context (100-700ms)"
B1[Status: Looking up manga...]
B2[Show Category Tag]
B3[Display Product Thumbnails<br/>from cache]
end
subgraph "Phase 3: Streaming (700ms+)"
C1[First Token Appears]
C2[Text Streams In]
C3[Recommendation Cards<br/>Build Progressively]
end
A1 --> B1 --> C1
A2 --> B2 --> C2
A3 --> B3 --> C3
Client-Side Status Messages
| Backend Phase | Duration | Client Displays | UX Purpose |
|---|---|---|---|
| Intent classification | 0-50ms | Typing indicator animation | Immediate feedback |
| Cache check | 50-100ms | "Let me check..." | Acknowledgment |
| RAG retrieval running | 100-650ms | "Looking up manga for you..." | Context-aware status |
| DynamoDB session load | 100-200ms | (no change — parallel) | N/A |
| Prompt assembly | 650-700ms | "Preparing your answer..." | Transition signal |
| Bedrock first token | 700-1050ms | First word appears | Key perceived moment |
| Streaming generation | 1050-2500ms | Text streams word by word | Continuous engagement |
| Completion | 2500ms | Recommendation cards render | Rich content payoff |
Pre-Computation Pipeline
Follow-Up Query Prediction
MangaAssist predicts what the user will ask next and pre-generates responses in the background.
graph TB
subgraph "Current Interaction"
QUERY[User asks: "Tell me about One Piece"]
RESPOND[Stream response about One Piece]
end
subgraph "Background Pre-Computation (parallel)"
PREDICT[Intent Transition Model<br/>P(next query | current)]
direction TB
PRED1["P=0.35: 'Where can I buy it?'<br/>→ Pre-generate purchase link"]
PRED2["P=0.25: 'Similar manga?'<br/>→ Pre-generate recommendations"]
PRED3["P=0.20: 'Latest volume?'<br/>→ Pre-fetch release info"]
PRED4["P=0.10: 'Is it in stock?'<br/>→ Pre-check inventory"]
end
subgraph "Pre-Computed Cache"
REDIS[ElastiCache Redis<br/>TTL: 5 minutes<br/>Scoped to session]
end
QUERY --> RESPOND
RESPOND --> PREDICT
PREDICT --> PRED1 & PRED2 & PRED3 & PRED4
PRED1 & PRED2 & PRED3 & PRED4 --> REDIS
style REDIS fill:#2ecc71,color:#000
style PREDICT fill:#9b59b6,color:#fff
Pre-Computation Hit Rate by Query Pattern
| Trigger Query Pattern | Predicted Follow-Up | Pre-Compute Model | Hit Rate | Latency When Hit |
|---|---|---|---|---|
| Product inquiry | Purchase/availability | Haiku (template) | 38% | < 100ms (cache) |
| Recommendation result | "Tell me more about X" | Haiku (product summary) | 31% | < 100ms (cache) |
| Order status check | Shipping ETA details | Haiku (template) | 45% | < 100ms (cache) |
| Category browse | Specific title from list | Pre-fetched from OpenSearch | 27% | < 150ms (cache) |
| Weighted average | 34% | < 110ms |
BedrockStreamManager — Complete Implementation
import asyncio
import json
import time
from dataclasses import dataclass, field
from enum import Enum
from typing import AsyncIterator, Callable
from aws_lambda_powertools import Logger, Metrics, Tracer
logger = Logger(service="mangaassist-stream-manager")
tracer = Tracer(service="mangaassist-stream-manager")
metrics = Metrics(namespace="MangaAssist/Streaming")
class StreamState(Enum):
"""States of a streaming response lifecycle."""
INITIALIZING = "initializing"
WAITING_FIRST_TOKEN = "waiting_first_token"
STREAMING = "streaming"
COMPLETED = "completed"
ERROR = "error"
CANCELLED = "cancelled"
@dataclass
class StreamMetrics:
"""Metrics collected during a single streaming session."""
first_token_latency_ms: float | None = None
total_duration_ms: float = 0.0
tokens_generated: int = 0
chunks_sent: int = 0
bytes_sent: int = 0
errors: list[str] = field(default_factory=list)
class BedrockStreamManager:
"""
Manages the full lifecycle of a Bedrock streaming response
for MangaAssist, including:
- Connection pool acquisition
- Streaming invocation with timeout
- Chunk batching and WebSocket delivery
- First-token latency tracking
- Error recovery and client notification
- Graceful cancellation on client disconnect
"""
def __init__(
self,
connection_pool,
websocket_sender: Callable,
system_prompt_manager,
chunk_batch_size: int = 3,
stream_timeout: float = 30.0,
):
self.pool = connection_pool
self.ws_send = websocket_sender
self.prompt_mgr = system_prompt_manager
self.chunk_batch_size = chunk_batch_size
self.stream_timeout = stream_timeout
@tracer.capture_method
async def stream_query(
self,
model_id: str,
intent: str,
messages: list[dict],
connection_id: str,
max_tokens: int = 1024,
) -> StreamMetrics:
"""
Execute a streaming query against Bedrock and deliver
chunks to the client via WebSocket.
Steps:
1. Acquire warm connection from pool
2. Send immediate status message to client
3. Invoke Bedrock with streaming
4. Track first-token latency
5. Batch and deliver chunks via WebSocket
6. Handle errors and disconnects gracefully
"""
stream_metrics = StreamMetrics()
state = StreamState.INITIALIZING
start_time = time.monotonic()
# Phase 1: Immediate status feedback
await self.ws_send(connection_id, {
"type": "status",
"state": "processing",
"message": self._status_message_for_intent(intent),
})
try:
async with self.pool.acquire() as bedrock_client:
state = StreamState.WAITING_FIRST_TOKEN
# Phase 2: Invoke Bedrock streaming
system_prompt = self.prompt_mgr.get_prompt(intent)
response = bedrock_client.invoke_model_with_response_stream(
modelId=model_id,
body=json.dumps({
"anthropic_version": "bedrock-2023-05-31",
"max_tokens": max_tokens,
"system": system_prompt,
"messages": messages,
}),
)
# Phase 3: Process stream with chunk batching
chunk_buffer = []
full_text = []
for event in response.get("body", []):
chunk = event.get("chunk")
if not chunk:
continue
payload = json.loads(chunk.get("bytes", b"{}"))
# Handle content delta events
if payload.get("type") == "content_block_delta":
delta = payload.get("delta", {})
if delta.get("type") == "text_delta":
text = delta.get("text", "")
if not text:
continue
# Track first token
if stream_metrics.first_token_latency_ms is None:
state = StreamState.STREAMING
ftl = (time.monotonic() - start_time) * 1000
stream_metrics.first_token_latency_ms = ftl
metrics.add_metric(
name="first_token_latency_ms",
unit="Milliseconds",
value=ftl,
)
logger.info(
"First token delivered",
first_token_ms=round(ftl, 1),
model=model_id,
intent=intent,
)
stream_metrics.tokens_generated += 1
full_text.append(text)
chunk_buffer.append(text)
# Batch chunks to reduce WebSocket overhead
if len(chunk_buffer) >= self.chunk_batch_size:
batched = "".join(chunk_buffer)
await self.ws_send(connection_id, {
"type": "chunk",
"text": batched,
})
stream_metrics.chunks_sent += 1
stream_metrics.bytes_sent += len(
batched.encode("utf-8")
)
chunk_buffer = []
# Handle message_stop
elif payload.get("type") == "message_stop":
break
# Flush remaining buffer
if chunk_buffer:
batched = "".join(chunk_buffer)
await self.ws_send(connection_id, {
"type": "chunk",
"text": batched,
})
stream_metrics.chunks_sent += 1
stream_metrics.bytes_sent += len(
batched.encode("utf-8")
)
# Send completion signal
state = StreamState.COMPLETED
stream_metrics.total_duration_ms = (
(time.monotonic() - start_time) * 1000
)
await self.ws_send(connection_id, {
"type": "done",
"metadata": {
"first_token_ms": round(
stream_metrics.first_token_latency_ms or 0, 1
),
"total_ms": round(
stream_metrics.total_duration_ms, 1
),
"tokens": stream_metrics.tokens_generated,
},
})
except ConnectionError:
state = StreamState.CANCELLED
logger.warning(
"Client disconnected during streaming",
connection_id=connection_id,
)
stream_metrics.errors.append("client_disconnected")
except Exception as e:
state = StreamState.ERROR
logger.exception("Stream error", error=str(e))
stream_metrics.errors.append(str(e))
try:
await self.ws_send(connection_id, {
"type": "error",
"message": (
"Sorry, I ran into an issue. Please try again."
),
})
except Exception:
pass # Client may already be gone
# Publish final metrics
self._publish_metrics(stream_metrics, model_id, intent, state)
return stream_metrics
def _status_message_for_intent(self, intent: str) -> str:
"""Return a MangaAssist-specific status message per intent."""
messages = {
"recommendation": "Looking up manga recommendations for you...",
"product_search": "Searching our manga catalog...",
"order_status": "Checking your order status...",
"comparison": "Analyzing both titles for you...",
"complaint": "I understand your concern. Looking into this...",
"faq": "Let me find that answer...",
"greeting": "",
}
return messages.get(intent, "Working on your request...")
def _publish_metrics(
self,
stream_metrics: StreamMetrics,
model_id: str,
intent: str,
state: StreamState,
) -> None:
"""Publish comprehensive streaming metrics to CloudWatch."""
metrics.add_metric(
name="streaming_total_ms",
unit="Milliseconds",
value=stream_metrics.total_duration_ms,
)
metrics.add_metric(
name="streaming_tokens",
unit="Count",
value=stream_metrics.tokens_generated,
)
metrics.add_metric(
name="streaming_chunks_sent",
unit="Count",
value=stream_metrics.chunks_sent,
)
if stream_metrics.tokens_generated > 0:
tokens_per_second = (
stream_metrics.tokens_generated
/ (stream_metrics.total_duration_ms / 1000)
)
metrics.add_metric(
name="tokens_per_second",
unit="Count",
value=tokens_per_second,
)
if state == StreamState.ERROR:
metrics.add_metric(
name="stream_errors", unit="Count", value=1
)
elif state == StreamState.CANCELLED:
metrics.add_metric(
name="stream_cancellations", unit="Count", value=1
)
MangaAssist-Specific Streaming Patterns
Streaming Manga Descriptions
When a user asks about a specific manga title, the response streams the description while simultaneously loading product images and purchase options.
gantt
title Manga Description Streaming Timeline
dateFormat X
axisFormat %L ms
section Client Display
Typing indicator :0, 100
"Looking up manga..." :100, 700
Title + author appear :700, 900
Description streams in :900, 2000
Product card builds :2000, 2300
Purchase button active :2300, 2400
Progressive Recommendation Lists
For recommendation queries, MangaAssist streams titles one at a time so the user can start browsing before the full list is ready.
| Stream Phase | Content Delivered | Time (ms) | User Action Possible |
|---|---|---|---|
| 1 | "Based on your interest in One Piece, here are my top picks:" | 700-1000 | Read introduction |
| 2 | "1. Naruto by Masashi Kishimoto — A young ninja's journey..." | 1000-1400 | Click title #1 |
| 3 | "2. Bleach by Tite Kubo — Soul reapers and sword battles..." | 1400-1800 | Click title #1 or #2 |
| 4 | "3. My Hero Academia by Kohei Horikoshi — Superheroes in..." | 1800-2200 | Click any of first 3 |
| 5 | Product cards with images and prices | 2200-2500 | Add to cart |
Benchmark Framework — Automated Latency Measurement
Test Harness Architecture
graph TB
subgraph "Benchmark Scheduler"
EB[EventBridge Rule<br/>Every 15 min]
end
subgraph "Test Runner — ECS Task"
RUNNER[Benchmark Runner<br/>50 queries per intent]
SYNTHETIC[Synthetic Query<br/>Generator]
end
subgraph "MangaAssist Stack (Production)"
APIGW[API Gateway<br/>WebSocket]
ECS[ECS Fargate<br/>Orchestrator]
BEDROCK[Bedrock]
end
subgraph "Metrics & Analysis"
CW[CloudWatch Metrics<br/>p50/p95/p99 per intent]
STATS[Statistical Analysis<br/>Lambda]
ALARM[CloudWatch Alarms<br/>Regression Detection]
DASH[Grafana Dashboard<br/>Latency Trends]
end
EB --> RUNNER
RUNNER --> SYNTHETIC --> APIGW --> ECS --> BEDROCK
BEDROCK --> ECS --> APIGW --> RUNNER
RUNNER --> CW
CW --> STATS --> ALARM
CW --> DASH
style EB fill:#ff9900,color:#000
style CW fill:#ff9900,color:#000
style ALARM fill:#e74c3c,color:#fff
Statistical Significance Testing
MangaAssist uses the Mann-Whitney U test to detect regressions because latency distributions are non-normal (right-skewed).
| Test Parameter | Value | Rationale |
|---|---|---|
| Sample size per intent | 50 queries | Sufficient power at alpha=0.05, effect size d=0.5 |
| Significance level (alpha) | 0.05 | Industry standard |
| Effect size threshold | Cohen's d >= 0.5 (medium) | Ignore trivially small regressions |
| Comparison window | Current run vs last 7-day rolling baseline | Captures seasonal patterns |
| Test frequency | Every 15 minutes | Balance between detection speed and cost |
| Warm-up exclusion | First 5 queries discarded | Avoid cold-start contamination |
Regression Severity Classification
| Metric | SEV-3 (Warning) | SEV-2 (Alert) | SEV-1 (Critical) |
|---|---|---|---|
| p95 increase | 10-25% above baseline | 25-50% above baseline | > 50% above baseline |
| First-token p95 increase | 15-30% above target | 30-60% above target | > 60% above target |
| Error rate | 1-3% of requests | 3-10% of requests | > 10% of requests |
| Action | Log + dashboard flag | PagerDuty alert to on-call | Auto-rollback + page SRE team |
Latency Optimization Checklist
| # | Optimization | Layer | Expected Savings | Implementation Complexity |
|---|---|---|---|---|
| 1 | Warm connection pool to Bedrock | Network | -150ms cold start | Low |
| 2 | Pre-loaded system prompts | Application | -50ms per request | Low |
| 3 | Model warm-up pings | Inference | -200ms on cold instances | Low |
| 4 | Parallel fan-out (asyncio.gather) | Orchestration | -400ms waterfall | Medium |
| 5 | Chunk batching (3 per frame) | Streaming | -15% WebSocket overhead | Low |
| 6 | Follow-up query prediction | Pre-computation | -1.5s for 34% of follow-ups | High |
| 7 | Progressive status messages | UX | Perceived -500ms | Low |
| 8 | Intent-based model routing | Model selection | -600ms for simple queries | Medium |
| 9 | Redis semantic cache | Caching | -2s for cache hits | Medium |
| 10 | Pre-computed trending responses | Pre-computation | -1.2s for trending queries | Medium |
Key Exam Takeaways
- WebSocket streaming is the foundation of responsive GenAI UX — it turns a 3-second wait into a 1-second "starts typing" experience
- First-token latency is the single most important metric for perceived responsiveness — optimize everything upstream of the first generated token
- Connection pooling and warm-up pings are cheap investments ($7/month) that eliminate 150-500ms of cold-start latency
- Pre-computation works best for predictable patterns — trending titles, common follow-ups, template-based responses
- Statistical benchmarking (Mann-Whitney U, not simple averages) prevents both false alarms and missed regressions
- Progressive rendering (typing indicators, status messages, skeleton UI) improves perceived latency even when actual latency is unchanged