LOCAL PREVIEW View on GitHub

Performance Optimization User Stories - MangaAssist Chatbot

Overview

This directory contains detailed user stories for performance optimization across every major service in the MangaAssist chatbot architecture. Each user story includes high-level design, low-level implementation details, Mermaid diagrams, and code examples.

The goal is to meet the target latency budget of < 2 seconds end-to-end for 95th percentile (p95) responses, with streaming first-token latency under 400ms.

User Stories

# User Story Primary Service Target p95 Latency
PO-01 LLM Response Latency Optimization Amazon Bedrock (Claude 3.5 Sonnet) First token < 400ms
PO-02 Intent Classifier Latency Optimization SageMaker Endpoint + Rule Engine < 50ms (rules), < 150ms (ML)
PO-03 RAG Pipeline Retrieval Performance OpenSearch Serverless + Embeddings < 200ms total retrieval
PO-04 DynamoDB Conversation Memory Performance DynamoDB + DAX < 10ms reads
PO-05 Caching Layer Performance ElastiCache Redis < 2ms cache hits
PO-06 WebSocket Streaming Performance CloudFront + ALB + WebSocket < 50ms delivery jitter
PO-07 Orchestrator Concurrency and Throughput ECS Fargate + Lambda 10K concurrent sessions
PO-08 End-to-End Latency Optimization Cross-cutting < 2s p95 total

Latency Budget Breakdown

gantt
    title Request Latency Budget (p95 Target: < 2000ms)
    dateFormat X
    axisFormat %L ms

    section Edge
    TLS + Routing           :0, 30
    Auth + Rate Limit       :30, 50

    section Orchestrator
    Load Context (DynamoDB) :50, 60
    Intent Classification   :60, 110

    section Data Fetch
    Cache Check             :110, 115
    Service Calls (parallel):115, 315

    section Intelligence
    RAG Retrieval           :115, 315
    LLM First Token         :315, 715
    LLM Streaming           :715, 1700

    section Safety
    Guardrails              :1700, 1750

    section Delivery
    Format + Stream         :1750, 1800

Performance Tiers

pie title Response Time Distribution Target
    "< 500ms (template/cached)" : 35
    "500ms - 1s (simple LLM)" : 25
    "1s - 2s (standard LLM)" : 30
    "> 2s (complex multi-turn)" : 10

How to Use

  1. Start with PO-08 (End-to-End Latency) for the holistic view and critical path analysis.
  2. Read PO-01 (LLM Latency) next — it targets the largest latency contributor.
  3. Apply PO-02 through PO-07 based on your profiling data.
  4. Each user story is self-contained and can be implemented independently.

Relationship to Architecture

These user stories map directly to the components described in: - 04-architecture-hld.md — High-level architecture - 04b-architecture-lld.md — Low-level design - Cost-Optimization-User-Stories/ — Complementary cost optimization strategies