LOCAL PREVIEW View on GitHub

Streaming and Latency Optimization for GenAI Applications

AWS AIP-C01 Task 4.2 — Skill 4.2.1: Deep-dive into response streaming, first-token optimization, and pre-computation pipelines Context: MangaAssist e-commerce chatbot — Bedrock Claude 3 Sonnet/Haiku, OpenSearch Serverless, DynamoDB, ECS Fargate, API Gateway WebSocket, ElastiCache Redis North Star: First token to screen < 400ms, full response streaming complete < 3s


Skill Mapping

Certification Domain Task Skill
AWS AIP-C01 Domain 4 — Operational Efficiency Task 4.2 — Optimize FM application performance Skill 4.2.1 — Identify techniques to create responsive AI systems (e.g., pre-computation, latency-optimized model selection, parallel request processing, response streaming, performance benchmarking)

File scope: This file focuses on the streaming and latency optimization dimensions of Skill 4.2.1 — how MangaAssist delivers sub-400ms first-token latency through Bedrock streaming, WebSocket delivery, progressive rendering, pre-computation pipelines, and automated benchmark frameworks.


Mind Map — Streaming and Latency Optimization

mindmap
  root((Streaming &<br/>Latency<br/>Optimization))
    WebSocket Streaming Architecture
      API Gateway WebSocket
        Connection Lifecycle
        Route Selection Keys
        Binary Frame Support
      ECS Orchestrator
        Stream Proxy
        Chunk Aggregation
        Error Recovery
      Bedrock Stream API
        invoke_model_with_response_stream
        content_block_delta Events
        message_stop Signal
    First-Token Latency
      Warm Connections
        Connection Pooling
        Keep-Alive to Bedrock
        Pre-authenticated Sessions
      Pre-Loaded System Prompts
        Cached Prompt Templates
        Avoid Per-Request Assembly
        Template Versioning
      Inference Warm-Up
        Periodic Health Pings
        Model Warm Invocations
        Cold Start Mitigation
    Progressive Rendering
      Skeleton UI
        Typing Indicator
        Placeholder Cards
        Shimmer Effects
      Status Messages
        Looking Up Manga
        Checking Your Order
        Generating Recommendation
      Incremental Content
        Stream Text Tokens
        Progressive Image Load
        Recommendation Card Build
    Pre-Computation Pipeline
      Follow-Up Prediction
        Intent Transition Model
        Next-Query Probability
        Background Generation
      Background Generation
        Trending Batch Jobs
        Catalog Update Hooks
        Scheduled Refreshes
      Cache Warming
        Popular Query Replay
        Segment-Based Pre-Fetch
        Time-of-Day Patterns
    Benchmark Framework
      Automated Measurement
        Continuous Test Harness
        Synthetic Query Replay
        Production Sampling
      Statistical Testing
        Mann-Whitney U Test
        Bootstrap Confidence Intervals
        Effect Size Measurement
      Regression Detection
        Baseline Comparison
        Automated Alerting
        Auto-Rollback Triggers

WebSocket Streaming Architecture

MangaAssist uses API Gateway WebSocket to maintain a persistent, bidirectional connection with the client. This eliminates HTTP request/response overhead and enables real-time token-by-token delivery.

End-to-End Streaming Data Flow

sequenceDiagram
    participant Client as Browser Client
    participant APIGW as API Gateway<br/>WebSocket
    participant ECS as ECS Fargate<br/>Orchestrator
    participant Redis as ElastiCache Redis
    participant OS as OpenSearch<br/>Serverless
    participant DDB as DynamoDB
    participant Bedrock as Bedrock<br/>Claude 3

    Note over Client,APIGW: WebSocket already established (connection reuse)

    Client->>APIGW: {"action":"query","text":"Recommend manga like One Piece"}
    APIGW->>ECS: Route via $default route

    Note over ECS: Phase 1: Immediate Status (0ms)
    ECS-->>APIGW: {"type":"status","msg":"Looking up manga for you..."}
    APIGW-->>Client: Display typing indicator

    Note over ECS: Phase 2: Parallel Fan-Out (0-650ms)
    par Concurrent Data Fetches
        ECS->>Redis: Check semantic cache
        Redis-->>ECS: Cache MISS (45ms)
    and
        ECS->>OS: KNN vector search "manga like One Piece"
        OS-->>ECS: Top 5 results (620ms)
    and
        ECS->>DDB: Get session + user profile
        DDB-->>ECS: Session history + preferences (180ms)
    end

    Note over ECS: Phase 3: Prompt Assembly (650-680ms)
    ECS->>ECS: Build prompt with RAG context

    Note over ECS: Phase 4: Streaming Invocation (680ms+)
    ECS->>Bedrock: invoke_model_with_response_stream()

    Note over Bedrock: First token generated (~350ms after invocation)

    loop Token Streaming (~1030ms to ~2400ms)
        Bedrock-->>ECS: content_block_delta: "Based on your love of..."
        ECS-->>APIGW: WebSocket frame (batched 3 chunks)
        APIGW-->>Client: Render text incrementally
    end

    Bedrock-->>ECS: message_stop
    ECS-->>APIGW: {"type":"done","metadata":{...}}
    APIGW-->>Client: Finalize + show recommendation cards

    Note over Client: First token: ~1030ms from send<br/>Full response: ~2400ms<br/>Perceived: Instant (typing at 1s)

WebSocket Connection Lifecycle

statediagram-v2
    [*] --> Connecting: Client initiates
    Connecting --> Connected: $connect route succeeds
    Connected --> Authenticated: JWT validated
    Authenticated --> Active: Ready for queries

    Active --> Streaming: Query received
    Streaming --> Active: Response complete
    Active --> Streaming: Another query

    Active --> Idle: No activity 5min
    Idle --> Active: New query
    Idle --> Disconnected: 10min timeout

    Streaming --> Reconnecting: Network error
    Reconnecting --> Active: Reconnected (resume session)
    Reconnecting --> Disconnected: 3 retries failed

    Disconnected --> [*]

Connection Management Configuration

Parameter Value Rationale
Idle timeout 10 minutes Matches typical manga browsing session length
Route selection expression $request.body.action Enables query, ping, disconnect actions
Maximum message size 128 KB Sufficient for longest manga description + context
Throttling (per connection) 100 msg/sec Prevents abuse; streaming chunks stay well under
Throttling (account) 10,000 connections Peak evening hours JP timezone capacity
Binary support Enabled Future: image thumbnails in stream

First-Token Latency Optimization

First-token latency is the time from when the user sends a query to when the first generated token appears on screen. This is the most important perceived latency metric — users tolerate slow generation if they see the response starting quickly.

Latency Budget Breakdown

gantt
    title First-Token Latency Budget — Target: < 400ms (Haiku) / < 1100ms (Sonnet)
    dateFormat X
    axisFormat %L ms

    section Network
    WebSocket Frame Receive    :0, 5
    TLS Already Established    :5, 5

    section Routing
    Intent Classification      :5, 55
    Model Selection            :55, 60

    section Cache (Haiku path - simple query)
    Redis Lookup               :60, 105

    section Generation (Haiku)
    Bedrock Haiku First Token  :105, 380
    WebSocket Frame Send       :380, 385

    section Parallel I/O (Sonnet path)
    RAG + Session + Profile    :60, 680

    section Generation (Sonnet)
    Prompt Assembly            :680, 700
    Bedrock Sonnet First Token :700, 1050
    WebSocket Frame Send       :1050, 1055

Optimization Techniques

1. Warm Connection Pooling

Cold connections to Bedrock require TLS handshake + authentication. MangaAssist maintains a warm connection pool.

import asyncio
import time
from contextlib import asynccontextmanager

import boto3
from botocore.config import Config


class BedrockConnectionPool:
    """
    Maintains warm connections to Bedrock to eliminate cold-start
    TLS and authentication overhead on the critical path.

    Each ECS task maintains a pool of pre-authenticated boto3 clients.
    A background task sends periodic health pings to keep connections warm.
    """

    def __init__(
        self,
        pool_size: int = 5,
        keep_alive_interval: float = 30.0,
        region: str = "us-west-2",
    ):
        self.pool_size = pool_size
        self.keep_alive_interval = keep_alive_interval
        self.region = region
        self._pool: asyncio.Queue = asyncio.Queue(maxsize=pool_size)
        self._keep_alive_task: asyncio.Task | None = None

        config = Config(
            region_name=region,
            retries={"max_attempts": 2, "mode": "adaptive"},
            connect_timeout=5,
            read_timeout=60,
            max_pool_connections=pool_size,
        )

        # Pre-create clients and fill the pool
        for _ in range(pool_size):
            client = boto3.client("bedrock-runtime", config=config)
            self._pool.put_nowait(client)

    async def start(self) -> None:
        """Start the background keep-alive task."""
        self._keep_alive_task = asyncio.create_task(
            self._keep_alive_loop()
        )

    async def stop(self) -> None:
        """Stop the keep-alive task and close all clients."""
        if self._keep_alive_task:
            self._keep_alive_task.cancel()

    @asynccontextmanager
    async def acquire(self):
        """
        Acquire a warm Bedrock client from the pool.
        Returns it after use.
        """
        client = await asyncio.wait_for(
            self._pool.get(), timeout=5.0
        )
        try:
            yield client
        finally:
            await self._pool.put(client)

    async def _keep_alive_loop(self) -> None:
        """
        Periodically invoke a minimal request to keep
        connections warm and TLS sessions alive.
        """
        while True:
            await asyncio.sleep(self.keep_alive_interval)
            for _ in range(self._pool.qsize()):
                client = await self._pool.get()
                try:
                    # Minimal invocation to keep connection warm
                    client.invoke_model(
                        modelId="anthropic.claude-3-haiku-20240307-v1:0",
                        body='{"anthropic_version":"bedrock-2023-05-31",'
                             '"max_tokens":1,"messages":[{"role":"user",'
                             '"content":"ping"}]}',
                    )
                except Exception:
                    pass  # Connection will be refreshed by boto3
                finally:
                    await self._pool.put(client)

2. Pre-Loaded System Prompts

System prompts are assembled once at deployment and cached in memory, not rebuilt per request.

import json
from functools import lru_cache


class SystemPromptManager:
    """
    Pre-loads and caches system prompts at ECS task startup.

    Prompts are versioned and stored in DynamoDB. The manager
    loads all active versions at init and refreshes on a schedule.
    This eliminates per-request DynamoDB reads for prompt templates.
    """

    def __init__(self, dynamodb_table, refresh_interval: int = 300):
        self.table = dynamodb_table
        self.refresh_interval = refresh_interval
        self._prompts: dict[str, str] = {}
        self._load_all_prompts()

    def _load_all_prompts(self) -> None:
        """Load all active prompt versions into memory."""
        response = self.table.scan(
            FilterExpression="active = :true",
            ExpressionAttributeValues={":true": True},
        )
        for item in response.get("Items", []):
            key = f"{item['intent']}:{item['version']}"
            self._prompts[key] = item["prompt_text"]

    @lru_cache(maxsize=64)
    def get_prompt(self, intent: str, version: str = "latest") -> str:
        """
        Retrieve a pre-loaded system prompt.
        Zero I/O on the hot path.
        """
        key = f"{intent}:{version}"
        if key in self._prompts:
            return self._prompts[key]

        # Fallback to default if specific version not found
        default_key = f"{intent}:default"
        return self._prompts.get(default_key, self._default_prompt())

    @staticmethod
    def _default_prompt() -> str:
        return (
            "You are MangaAssist, a helpful assistant for a Japanese manga "
            "e-commerce store. Be concise, friendly, and respond in the "
            "customer's language."
        )

3. Model Inference Warm-Up

Cold model instances on Bedrock can add 200-500ms to the first invocation. MangaAssist sends periodic warm-up requests.

Warm-Up Strategy Frequency Target Model Payload Added Cost
Health ping (1 token) Every 30s per ECS task Haiku "ping" → 1 token ~$0.30/month
Warm invocation (short) Every 5 min Sonnet 10-token FAQ response ~$2.10/month
Full warm-up (streaming) Every 15 min Sonnet 100-token streamed response ~$4.50/month
Total warm-up cost ~$6.90/month

Progressive Response Rendering

Progressive rendering keeps the user engaged while backend processing happens. MangaAssist uses a three-phase approach.

Three-Phase Rendering Strategy

flowchart LR
    subgraph "Phase 1: Instant (0-100ms)"
        A1[Typing Indicator]
        A2[Skeleton Card]
        A3[Status: Thinking...]
    end

    subgraph "Phase 2: Context (100-700ms)"
        B1[Status: Looking up manga...]
        B2[Show Category Tag]
        B3[Display Product Thumbnails<br/>from cache]
    end

    subgraph "Phase 3: Streaming (700ms+)"
        C1[First Token Appears]
        C2[Text Streams In]
        C3[Recommendation Cards<br/>Build Progressively]
    end

    A1 --> B1 --> C1
    A2 --> B2 --> C2
    A3 --> B3 --> C3

Client-Side Status Messages

Backend Phase Duration Client Displays UX Purpose
Intent classification 0-50ms Typing indicator animation Immediate feedback
Cache check 50-100ms "Let me check..." Acknowledgment
RAG retrieval running 100-650ms "Looking up manga for you..." Context-aware status
DynamoDB session load 100-200ms (no change — parallel) N/A
Prompt assembly 650-700ms "Preparing your answer..." Transition signal
Bedrock first token 700-1050ms First word appears Key perceived moment
Streaming generation 1050-2500ms Text streams word by word Continuous engagement
Completion 2500ms Recommendation cards render Rich content payoff

Pre-Computation Pipeline

Follow-Up Query Prediction

MangaAssist predicts what the user will ask next and pre-generates responses in the background.

graph TB
    subgraph "Current Interaction"
        QUERY[User asks: "Tell me about One Piece"]
        RESPOND[Stream response about One Piece]
    end

    subgraph "Background Pre-Computation (parallel)"
        PREDICT[Intent Transition Model<br/>P(next query | current)]
        direction TB
        PRED1["P=0.35: 'Where can I buy it?'<br/>→ Pre-generate purchase link"]
        PRED2["P=0.25: 'Similar manga?'<br/>→ Pre-generate recommendations"]
        PRED3["P=0.20: 'Latest volume?'<br/>→ Pre-fetch release info"]
        PRED4["P=0.10: 'Is it in stock?'<br/>→ Pre-check inventory"]
    end

    subgraph "Pre-Computed Cache"
        REDIS[ElastiCache Redis<br/>TTL: 5 minutes<br/>Scoped to session]
    end

    QUERY --> RESPOND
    RESPOND --> PREDICT
    PREDICT --> PRED1 & PRED2 & PRED3 & PRED4
    PRED1 & PRED2 & PRED3 & PRED4 --> REDIS

    style REDIS fill:#2ecc71,color:#000
    style PREDICT fill:#9b59b6,color:#fff

Pre-Computation Hit Rate by Query Pattern

Trigger Query Pattern Predicted Follow-Up Pre-Compute Model Hit Rate Latency When Hit
Product inquiry Purchase/availability Haiku (template) 38% < 100ms (cache)
Recommendation result "Tell me more about X" Haiku (product summary) 31% < 100ms (cache)
Order status check Shipping ETA details Haiku (template) 45% < 100ms (cache)
Category browse Specific title from list Pre-fetched from OpenSearch 27% < 150ms (cache)
Weighted average 34% < 110ms

BedrockStreamManager — Complete Implementation

import asyncio
import json
import time
from dataclasses import dataclass, field
from enum import Enum
from typing import AsyncIterator, Callable

from aws_lambda_powertools import Logger, Metrics, Tracer

logger = Logger(service="mangaassist-stream-manager")
tracer = Tracer(service="mangaassist-stream-manager")
metrics = Metrics(namespace="MangaAssist/Streaming")


class StreamState(Enum):
    """States of a streaming response lifecycle."""
    INITIALIZING = "initializing"
    WAITING_FIRST_TOKEN = "waiting_first_token"
    STREAMING = "streaming"
    COMPLETED = "completed"
    ERROR = "error"
    CANCELLED = "cancelled"


@dataclass
class StreamMetrics:
    """Metrics collected during a single streaming session."""
    first_token_latency_ms: float | None = None
    total_duration_ms: float = 0.0
    tokens_generated: int = 0
    chunks_sent: int = 0
    bytes_sent: int = 0
    errors: list[str] = field(default_factory=list)


class BedrockStreamManager:
    """
    Manages the full lifecycle of a Bedrock streaming response
    for MangaAssist, including:

    - Connection pool acquisition
    - Streaming invocation with timeout
    - Chunk batching and WebSocket delivery
    - First-token latency tracking
    - Error recovery and client notification
    - Graceful cancellation on client disconnect
    """

    def __init__(
        self,
        connection_pool,
        websocket_sender: Callable,
        system_prompt_manager,
        chunk_batch_size: int = 3,
        stream_timeout: float = 30.0,
    ):
        self.pool = connection_pool
        self.ws_send = websocket_sender
        self.prompt_mgr = system_prompt_manager
        self.chunk_batch_size = chunk_batch_size
        self.stream_timeout = stream_timeout

    @tracer.capture_method
    async def stream_query(
        self,
        model_id: str,
        intent: str,
        messages: list[dict],
        connection_id: str,
        max_tokens: int = 1024,
    ) -> StreamMetrics:
        """
        Execute a streaming query against Bedrock and deliver
        chunks to the client via WebSocket.

        Steps:
        1. Acquire warm connection from pool
        2. Send immediate status message to client
        3. Invoke Bedrock with streaming
        4. Track first-token latency
        5. Batch and deliver chunks via WebSocket
        6. Handle errors and disconnects gracefully
        """
        stream_metrics = StreamMetrics()
        state = StreamState.INITIALIZING
        start_time = time.monotonic()

        # Phase 1: Immediate status feedback
        await self.ws_send(connection_id, {
            "type": "status",
            "state": "processing",
            "message": self._status_message_for_intent(intent),
        })

        try:
            async with self.pool.acquire() as bedrock_client:
                state = StreamState.WAITING_FIRST_TOKEN

                # Phase 2: Invoke Bedrock streaming
                system_prompt = self.prompt_mgr.get_prompt(intent)
                response = bedrock_client.invoke_model_with_response_stream(
                    modelId=model_id,
                    body=json.dumps({
                        "anthropic_version": "bedrock-2023-05-31",
                        "max_tokens": max_tokens,
                        "system": system_prompt,
                        "messages": messages,
                    }),
                )

                # Phase 3: Process stream with chunk batching
                chunk_buffer = []
                full_text = []

                for event in response.get("body", []):
                    chunk = event.get("chunk")
                    if not chunk:
                        continue

                    payload = json.loads(chunk.get("bytes", b"{}"))

                    # Handle content delta events
                    if payload.get("type") == "content_block_delta":
                        delta = payload.get("delta", {})
                        if delta.get("type") == "text_delta":
                            text = delta.get("text", "")
                            if not text:
                                continue

                            # Track first token
                            if stream_metrics.first_token_latency_ms is None:
                                state = StreamState.STREAMING
                                ftl = (time.monotonic() - start_time) * 1000
                                stream_metrics.first_token_latency_ms = ftl
                                metrics.add_metric(
                                    name="first_token_latency_ms",
                                    unit="Milliseconds",
                                    value=ftl,
                                )
                                logger.info(
                                    "First token delivered",
                                    first_token_ms=round(ftl, 1),
                                    model=model_id,
                                    intent=intent,
                                )

                            stream_metrics.tokens_generated += 1
                            full_text.append(text)
                            chunk_buffer.append(text)

                            # Batch chunks to reduce WebSocket overhead
                            if len(chunk_buffer) >= self.chunk_batch_size:
                                batched = "".join(chunk_buffer)
                                await self.ws_send(connection_id, {
                                    "type": "chunk",
                                    "text": batched,
                                })
                                stream_metrics.chunks_sent += 1
                                stream_metrics.bytes_sent += len(
                                    batched.encode("utf-8")
                                )
                                chunk_buffer = []

                    # Handle message_stop
                    elif payload.get("type") == "message_stop":
                        break

                # Flush remaining buffer
                if chunk_buffer:
                    batched = "".join(chunk_buffer)
                    await self.ws_send(connection_id, {
                        "type": "chunk",
                        "text": batched,
                    })
                    stream_metrics.chunks_sent += 1
                    stream_metrics.bytes_sent += len(
                        batched.encode("utf-8")
                    )

                # Send completion signal
                state = StreamState.COMPLETED
                stream_metrics.total_duration_ms = (
                    (time.monotonic() - start_time) * 1000
                )

                await self.ws_send(connection_id, {
                    "type": "done",
                    "metadata": {
                        "first_token_ms": round(
                            stream_metrics.first_token_latency_ms or 0, 1
                        ),
                        "total_ms": round(
                            stream_metrics.total_duration_ms, 1
                        ),
                        "tokens": stream_metrics.tokens_generated,
                    },
                })

        except ConnectionError:
            state = StreamState.CANCELLED
            logger.warning(
                "Client disconnected during streaming",
                connection_id=connection_id,
            )
            stream_metrics.errors.append("client_disconnected")

        except Exception as e:
            state = StreamState.ERROR
            logger.exception("Stream error", error=str(e))
            stream_metrics.errors.append(str(e))

            try:
                await self.ws_send(connection_id, {
                    "type": "error",
                    "message": (
                        "Sorry, I ran into an issue. Please try again."
                    ),
                })
            except Exception:
                pass  # Client may already be gone

        # Publish final metrics
        self._publish_metrics(stream_metrics, model_id, intent, state)
        return stream_metrics

    def _status_message_for_intent(self, intent: str) -> str:
        """Return a MangaAssist-specific status message per intent."""
        messages = {
            "recommendation": "Looking up manga recommendations for you...",
            "product_search": "Searching our manga catalog...",
            "order_status": "Checking your order status...",
            "comparison": "Analyzing both titles for you...",
            "complaint": "I understand your concern. Looking into this...",
            "faq": "Let me find that answer...",
            "greeting": "",
        }
        return messages.get(intent, "Working on your request...")

    def _publish_metrics(
        self,
        stream_metrics: StreamMetrics,
        model_id: str,
        intent: str,
        state: StreamState,
    ) -> None:
        """Publish comprehensive streaming metrics to CloudWatch."""
        metrics.add_metric(
            name="streaming_total_ms",
            unit="Milliseconds",
            value=stream_metrics.total_duration_ms,
        )
        metrics.add_metric(
            name="streaming_tokens",
            unit="Count",
            value=stream_metrics.tokens_generated,
        )
        metrics.add_metric(
            name="streaming_chunks_sent",
            unit="Count",
            value=stream_metrics.chunks_sent,
        )

        if stream_metrics.tokens_generated > 0:
            tokens_per_second = (
                stream_metrics.tokens_generated
                / (stream_metrics.total_duration_ms / 1000)
            )
            metrics.add_metric(
                name="tokens_per_second",
                unit="Count",
                value=tokens_per_second,
            )

        if state == StreamState.ERROR:
            metrics.add_metric(
                name="stream_errors", unit="Count", value=1
            )
        elif state == StreamState.CANCELLED:
            metrics.add_metric(
                name="stream_cancellations", unit="Count", value=1
            )

MangaAssist-Specific Streaming Patterns

Streaming Manga Descriptions

When a user asks about a specific manga title, the response streams the description while simultaneously loading product images and purchase options.

gantt
    title Manga Description Streaming Timeline
    dateFormat X
    axisFormat %L ms

    section Client Display
    Typing indicator             :0, 100
    "Looking up manga..."        :100, 700
    Title + author appear        :700, 900
    Description streams in       :900, 2000
    Product card builds          :2000, 2300
    Purchase button active       :2300, 2400

Progressive Recommendation Lists

For recommendation queries, MangaAssist streams titles one at a time so the user can start browsing before the full list is ready.

Stream Phase Content Delivered Time (ms) User Action Possible
1 "Based on your interest in One Piece, here are my top picks:" 700-1000 Read introduction
2 "1. Naruto by Masashi Kishimoto — A young ninja's journey..." 1000-1400 Click title #1
3 "2. Bleach by Tite Kubo — Soul reapers and sword battles..." 1400-1800 Click title #1 or #2
4 "3. My Hero Academia by Kohei Horikoshi — Superheroes in..." 1800-2200 Click any of first 3
5 Product cards with images and prices 2200-2500 Add to cart

Benchmark Framework — Automated Latency Measurement

Test Harness Architecture

graph TB
    subgraph "Benchmark Scheduler"
        EB[EventBridge Rule<br/>Every 15 min]
    end

    subgraph "Test Runner — ECS Task"
        RUNNER[Benchmark Runner<br/>50 queries per intent]
        SYNTHETIC[Synthetic Query<br/>Generator]
    end

    subgraph "MangaAssist Stack (Production)"
        APIGW[API Gateway<br/>WebSocket]
        ECS[ECS Fargate<br/>Orchestrator]
        BEDROCK[Bedrock]
    end

    subgraph "Metrics & Analysis"
        CW[CloudWatch Metrics<br/>p50/p95/p99 per intent]
        STATS[Statistical Analysis<br/>Lambda]
        ALARM[CloudWatch Alarms<br/>Regression Detection]
        DASH[Grafana Dashboard<br/>Latency Trends]
    end

    EB --> RUNNER
    RUNNER --> SYNTHETIC --> APIGW --> ECS --> BEDROCK
    BEDROCK --> ECS --> APIGW --> RUNNER
    RUNNER --> CW
    CW --> STATS --> ALARM
    CW --> DASH

    style EB fill:#ff9900,color:#000
    style CW fill:#ff9900,color:#000
    style ALARM fill:#e74c3c,color:#fff

Statistical Significance Testing

MangaAssist uses the Mann-Whitney U test to detect regressions because latency distributions are non-normal (right-skewed).

Test Parameter Value Rationale
Sample size per intent 50 queries Sufficient power at alpha=0.05, effect size d=0.5
Significance level (alpha) 0.05 Industry standard
Effect size threshold Cohen's d >= 0.5 (medium) Ignore trivially small regressions
Comparison window Current run vs last 7-day rolling baseline Captures seasonal patterns
Test frequency Every 15 minutes Balance between detection speed and cost
Warm-up exclusion First 5 queries discarded Avoid cold-start contamination

Regression Severity Classification

Metric SEV-3 (Warning) SEV-2 (Alert) SEV-1 (Critical)
p95 increase 10-25% above baseline 25-50% above baseline > 50% above baseline
First-token p95 increase 15-30% above target 30-60% above target > 60% above target
Error rate 1-3% of requests 3-10% of requests > 10% of requests
Action Log + dashboard flag PagerDuty alert to on-call Auto-rollback + page SRE team

Latency Optimization Checklist

# Optimization Layer Expected Savings Implementation Complexity
1 Warm connection pool to Bedrock Network -150ms cold start Low
2 Pre-loaded system prompts Application -50ms per request Low
3 Model warm-up pings Inference -200ms on cold instances Low
4 Parallel fan-out (asyncio.gather) Orchestration -400ms waterfall Medium
5 Chunk batching (3 per frame) Streaming -15% WebSocket overhead Low
6 Follow-up query prediction Pre-computation -1.5s for 34% of follow-ups High
7 Progressive status messages UX Perceived -500ms Low
8 Intent-based model routing Model selection -600ms for simple queries Medium
9 Redis semantic cache Caching -2s for cache hits Medium
10 Pre-computed trending responses Pre-computation -1.2s for trending queries Medium

Key Exam Takeaways

  1. WebSocket streaming is the foundation of responsive GenAI UX — it turns a 3-second wait into a 1-second "starts typing" experience
  2. First-token latency is the single most important metric for perceived responsiveness — optimize everything upstream of the first generated token
  3. Connection pooling and warm-up pings are cheap investments ($7/month) that eliminate 150-500ms of cold-start latency
  4. Pre-computation works best for predictable patterns — trending titles, common follow-ups, template-based responses
  5. Statistical benchmarking (Mann-Whitney U, not simple averages) prevents both false alarms and missed regressions
  6. Progressive rendering (typing indicators, status messages, skeleton UI) improves perceived latency even when actual latency is unchanged