LOCAL PREVIEW View on GitHub

Detailed Technology Stack - MangaAssist

Complete inventory of every technology choice across the stack, with rationale, alternatives considered, and why each was selected.

How to Use This Document


Stack Overview

graph TB
    subgraph Client["Client Layer"]
        A1[React Chat Widget]
        A2[WebSocket / HTTPS]
        A3[CloudFront CDN]
    end

    subgraph Gateway["Edge & Gateway"]
        B1[Amazon API Gateway]
        B2[AWS WAF]
        B3[Amazon Cognito]
    end

    subgraph Compute["Compute & Orchestration"]
        C1[ECS Fargate - baseline]
        C2[AWS Lambda - burst]
        C3[Step Functions - workflows]
    end

    subgraph Intelligence["AI & ML Layer"]
        D1[Amazon Bedrock - LLM]
        D2[SageMaker - custom models]
        D3[vLLM - self-hosted inference]
        D4[OpenSearch - vector store]
    end

    subgraph Data["Data Layer"]
        E1[DynamoDB - conversations]
        E2[ElastiCache Redis - caching]
        E3[Redshift - analytics]
        E4[Kinesis - streaming]
    end

    subgraph Observability["Observability"]
        F1[MLflow Tracing]
        F2[CloudWatch / X-Ray]
        F3[OpenTelemetry]
        F4[Prometheus / Grafana]
    end

    Client --> Gateway --> Compute --> Intelligence
    Intelligence --> Data
    Compute --> Observability
    Intelligence --> Observability

Layer-by-Layer Breakdown

1. Frontend & Client

Component Technology Rationale
Chat Widget React (Amazon internal framework) Company standard; seamless integration with Amazon JP storefront
Real-time Communication WebSocket (primary) + HTTPS REST (fallback) WebSocket for streaming token-by-token responses; REST fallback for environments that block WS
CDN CloudFront Global edge caching for static assets; already part of Amazon's infra
State Management React Context + useReducer Lightweight; no need for Redux given single-widget scope

Why not alternatives? - Vue/Angular: Amazon's frontend ecosystem is React-native; switching adds integration burden with zero benefit - Server-Sent Events (SSE): WebSocket chosen because we need bidirectional communication (user typing indicators, real-time context updates)


2. API Gateway & Edge Security

Component Technology Rationale
API Gateway Amazon API Gateway (WebSocket + REST) Native WebSocket support, built-in throttling, IAM integration
Web Application Firewall AWS WAF SQL injection, XSS, rate limiting, geo-blocking for Japan-specific deployment
Authentication Amazon Cognito (guest + authenticated) Supports anonymous browsing with seamless upgrade to authenticated sessions
TLS TLS 1.3 Mandatory for all traffic; 0-RTT resumption reduces handshake latency

Why not alternatives? - Kong/Nginx: API Gateway is fully managed; operational burden of self-hosting not justified for our scale - Auth0/Okta: Cognito integrates natively with all AWS services; external auth adds network hop + vendor dependency


3. Compute & Orchestration

Component Technology Rationale
Baseline Compute ECS Fargate Serverless containers; no EC2 management; auto-scales with demand
Burst Compute AWS Lambda Sub-second cold starts for lightweight operations (intent classification, cache lookups)
Workflow Orchestration Step Functions Visual state machines for multi-step chatbot flows; built-in retry/error handling
Container Registry ECR Standard AWS container registry; scanned for vulnerabilities

Scaling Model:

Normal:  ECS Fargate (10-50 tasks, predictable cost)
Spike:   Lambda (0 to 10,000 concurrent in seconds)
Peak:    ECS + Lambda hybrid (cost-optimized: Fargate for base, Lambda for overflow)

Why not alternatives? - EKS (Kubernetes): Overkill for our service count; Fargate gives us container benefits without cluster management - EC2: Manual scaling and patching; Fargate eliminates this entirely - Temporal/Airflow: Step Functions is native; simpler for AWS-only workflows


4. AI & LLM Layer

This is the most critical layer and where the majority of technology innovation happened.

Component Technology Rationale
Primary LLM Claude 3.5 Sonnet (via Amazon Bedrock) Best quality-to-latency ratio for conversational AI; native Bedrock integration
Lightweight LLM Claude Haiku (via Bedrock) 10x cheaper for simple tasks (greetings, order status formatting)
Self-Hosted Inference vLLM on SageMaker endpoints For fine-tuned models where Bedrock doesn't apply; see 02-open-source-libraries.md
Intent Classifier Fine-tuned DistilBERT on SageMaker Two-stage: rule-based -> ML classifier; DistilBERT is 60% smaller than BERT with 97% accuracy
Hardware Optimization AWS Inferentia (ml.inf1.xlarge) 70% cost reduction vs. GPU (ml.g4dn.xlarge) for intent classification after Neuron SDK compilation
Embeddings Amazon Titan Embeddings V2 (via Bedrock) 1024-dim vectors; optimized for Japanese text; fully managed
Reranker ms-marco-MiniLM cross-encoder on SageMaker Reranks top-50 retrieval results to top-5; 12x more accurate than embedding similarity alone
Vector Store OpenSearch Serverless (HNSW w/ nmslib) Serverless eliminates capacity planning; HNSW gives sub-50ms retrieval at 10M+ vectors
Guardrails Amazon Bedrock Guardrails + custom pipeline 6-stage validation: PII detection, prompt injection defense, content moderation, hallucination check, response length, format validation
Model Compilation Neuron SDK, ONNX, TorchScript Neuron for Inferentia; ONNX for cross-platform portability; TorchScript for production serialization
Recommendations Amazon Personalize Collaborative filtering trained on manga browsing/purchase history

Why not alternatives? - GPT-4: Higher latency, no native AWS integration, data residency concerns for Amazon data - Open-source LLMs (Llama, Mistral): Evaluated but Claude 3.5 Sonnet beat them on Japanese language quality; we do use vLLM for self-hosted fine-tuned models - Pinecone/Weaviate: OpenSearch is already in Amazon's ecosystem with zero egress costs - FAISS: No serverless option; requires managing infrastructure


5. Data Layer

Component Technology Rationale
Conversation Store DynamoDB (on-demand mode) Single-digit ms reads; 24-hour TTL auto-cleans expired sessions
Cache Accelerator DynamoDB DAX In-memory cache in front of DynamoDB; microsecond reads for hot conversations
Distributed Cache ElastiCache Redis L2 cache for LLM responses, product data, intent classifications
Analytics Warehouse Amazon Redshift OLAP queries across conversation logs, metrics, A/B test results
Event Streaming Amazon Kinesis Data Streams Real-time event pipeline: chat events -> analytics, monitoring, alerting
Data Lake S3 (Parquet format) Long-term storage for training data, conversation logs, embeddings

Caching Strategy (3 layers):

L1: In-memory (application-level, per-container)     -> <1ms, small capacity
L2: ElastiCache Redis (shared across containers)      -> 1-5ms, medium capacity
L3: DynamoDB DAX (conversation-specific acceleration)  -> 1-3ms, large capacity

Why not alternatives? - PostgreSQL/Aurora: DynamoDB's single-digit ms latency at any scale beats RDS for key-value access patterns - Memcached: Redis supports richer data structures (sorted sets for ranking, pub/sub for real-time updates) - Snowflake: Redshift Serverless is cheaper within AWS ecosystem; no data egress


6. Observability & Monitoring

Component Technology Rationale
LLM Tracing MLflow Tracing Open-source, OTel-compatible; traces every step of the LLM pipeline; see 03-mlflow-llm-observability.md
Distributed Tracing AWS X-Ray / OpenTelemetry End-to-end request tracing across all AWS services
Metrics & Dashboards CloudWatch + Prometheus/Grafana CloudWatch for AWS-native metrics; Grafana for custom ML dashboards
Logging CloudWatch Logs (structured JSON) Centralized logging with Insights querying
Alerting CloudWatch Alarms -> SNS -> PagerDuty Tiered alerting: P1 (pages), P2 (Slack), P3 (daily digest)
Audit Trail CloudTrail Immutable audit log for all API calls; compliance requirement

Why MLflow over alternatives? - Langfuse/LangSmith: MLflow is fully open-source, self-hosted (no data leaves AWS), and integrates with our existing MLflow experiment tracking - Datadog LLM Observability: Costly at our scale; vendor lock-in; MLflow gives us the same capabilities at zero license cost - Detailed comparison in 03-mlflow-llm-observability.md


7. Security & Compliance

Component Technology Rationale
Encryption at Rest AWS KMS (AES-256) Managed key rotation; all data encrypted by default
Encryption in Transit TLS 1.3 End-to-end; certificate managed by ACM
Network Isolation VPC + private subnets All ML endpoints and databases in private subnets; no public internet access
Secrets Management AWS Secrets Manager Auto-rotating credentials for all integrations
PII Detection Amazon Comprehend + custom regex Detects and masks PII before LLM processing
IAM IAM Roles (least privilege) Per-service roles; no shared credentials; cross-account via STS
Compliance GDPR, CCPA, COPPA, PCI-DSS Japan-specific data residency in ap-northeast-1

8. Infrastructure & DevOps

Component Technology Rationale
Infrastructure as Code AWS CDK (TypeScript) Imperative IaC; better abstractions than raw CloudFormation
CI/CD AWS CodePipeline + CodeBuild Native integration; no external CI dependency
Container Builds Docker (multi-stage) Minimal prod images; separate build/runtime layers
Feature Flags AWS AppConfig Gradual feature rollouts; instant kill-switches for new LLM behaviors
A/B Testing Custom framework on Kinesis + Redshift Splits traffic by session; measures conversion lift, CSAT, AI quality metrics

Cost Summary (100K conversations/day)

Component Monthly Cost % of Total
LLM Inference (Bedrock) $15,000 - $40,000 40-50%
SageMaker Endpoints (classifiers, rerankers) $5,000 - $12,000 15-20%
DynamoDB + DAX $3,000 - $6,000 8-10%
OpenSearch Serverless $2,500 - $5,000 7-8%
ElastiCache Redis $1,500 - $3,000 4-5%
Compute (Fargate + Lambda) $2,000 - $5,000 6-8%
Observability (CloudWatch, MLflow infra) $1,000 - $3,000 3-5%
Other (S3, Kinesis, CDN, etc.) $1,000 - $3,000 3-5%
Total $31,000 - $77,000 100%

Cost optimizations that I drove (see 04-innovation-and-tradeoffs.md): - Inferentia migration for classifiers: -$8,400/month - Semantic caching of LLM responses: -$12,000/month - vLLM for self-hosted models: -$15,000/month (50% GPU reduction) - Intelligent routing (Haiku vs Sonnet): -$18,000/month - Prompt compression and optimization: -$6,000/month - Total monthly savings: ~$59,400/month (~$713K/year)


Technology Decision Framework

For every component, I applied this evaluation matrix:

Criteria Weight How Measured
Performance 30% Benchmark latency (P50, P95, P99), throughput (req/sec)
Cost 25% $/request, $/month at projected scale
Operational Burden 20% Setup time, monitoring needs, on-call complexity
AWS Integration 15% Native service integration, IAM support, VPC compatibility
Community & Longevity 10% GitHub stars, contributor count, corporate backing, release cadence

This framework is detailed further in 04-innovation-and-tradeoffs.md.