Detailed Technology Stack - MangaAssist

Complete inventory of every technology choice across the stack, with rationale, alternatives considered, and why each was selected.

How to Use This Document

Read this file top-to-bottom if you want the full system stack in one place.
Jump to 02-open-source-libraries.md for deeper discussion of the OSS runtime choices behind inference and serving.
Jump to 03-mlflow-llm-observability.md or 04-innovation-and-tradeoffs.md if you want tooling and decision-making depth rather than layer-by-layer inventory.

Stack Overview

graph TB
    subgraph Client["Client Layer"]
        A1[React Chat Widget]
        A2[WebSocket / HTTPS]
        A3[CloudFront CDN]
    end

    subgraph Gateway["Edge & Gateway"]
        B1[Amazon API Gateway]
        B2[AWS WAF]
        B3[Amazon Cognito]
    end

    subgraph Compute["Compute & Orchestration"]
        C1[ECS Fargate - baseline]
        C2[AWS Lambda - burst]
        C3[Step Functions - workflows]
    end

    subgraph Intelligence["AI & ML Layer"]
        D1[Amazon Bedrock - LLM]
        D2[SageMaker - custom models]
        D3[vLLM - self-hosted inference]
        D4[OpenSearch - vector store]
    end

    subgraph Data["Data Layer"]
        E1[DynamoDB - conversations]
        E2[ElastiCache Redis - caching]
        E3[Redshift - analytics]
        E4[Kinesis - streaming]
    end

    subgraph Observability["Observability"]
        F1[MLflow Tracing]
        F2[CloudWatch / X-Ray]
        F3[OpenTelemetry]
        F4[Prometheus / Grafana]
    end

    Client --> Gateway --> Compute --> Intelligence
    Intelligence --> Data
    Compute --> Observability
    Intelligence --> Observability

Layer-by-Layer Breakdown

1. Frontend & Client

Component	Technology	Rationale
Chat Widget	React (Amazon internal framework)	Company standard; seamless integration with Amazon JP storefront
Real-time Communication	WebSocket (primary) + HTTPS REST (fallback)	WebSocket for streaming token-by-token responses; REST fallback for environments that block WS
CDN	CloudFront	Global edge caching for static assets; already part of Amazon's infra
State Management	React Context + useReducer	Lightweight; no need for Redux given single-widget scope

Why not alternatives? - Vue/Angular: Amazon's frontend ecosystem is React-native; switching adds integration burden with zero benefit - Server-Sent Events (SSE): WebSocket chosen because we need bidirectional communication (user typing indicators, real-time context updates)

2. API Gateway & Edge Security

Component	Technology	Rationale
API Gateway	Amazon API Gateway (WebSocket + REST)	Native WebSocket support, built-in throttling, IAM integration
Web Application Firewall	AWS WAF	SQL injection, XSS, rate limiting, geo-blocking for Japan-specific deployment
Authentication	Amazon Cognito (guest + authenticated)	Supports anonymous browsing with seamless upgrade to authenticated sessions
TLS	TLS 1.3	Mandatory for all traffic; 0-RTT resumption reduces handshake latency

Why not alternatives? - Kong/Nginx: API Gateway is fully managed; operational burden of self-hosting not justified for our scale - Auth0/Okta: Cognito integrates natively with all AWS services; external auth adds network hop + vendor dependency

3. Compute & Orchestration

Component	Technology	Rationale
Baseline Compute	ECS Fargate	Serverless containers; no EC2 management; auto-scales with demand
Burst Compute	AWS Lambda	Sub-second cold starts for lightweight operations (intent classification, cache lookups)
Workflow Orchestration	Step Functions	Visual state machines for multi-step chatbot flows; built-in retry/error handling
Container Registry	ECR	Standard AWS container registry; scanned for vulnerabilities

Scaling Model:

Normal:  ECS Fargate (10-50 tasks, predictable cost)
Spike:   Lambda (0 to 10,000 concurrent in seconds)
Peak:    ECS + Lambda hybrid (cost-optimized: Fargate for base, Lambda for overflow)

Why not alternatives? - EKS (Kubernetes): Overkill for our service count; Fargate gives us container benefits without cluster management - EC2: Manual scaling and patching; Fargate eliminates this entirely - Temporal/Airflow: Step Functions is native; simpler for AWS-only workflows

4. AI & LLM Layer

This is the most critical layer and where the majority of technology innovation happened.

Component	Technology	Rationale
Primary LLM	Claude 3.5 Sonnet (via Amazon Bedrock)	Best quality-to-latency ratio for conversational AI; native Bedrock integration
Lightweight LLM	Claude Haiku (via Bedrock)	10x cheaper for simple tasks (greetings, order status formatting)
Self-Hosted Inference	vLLM on SageMaker endpoints	For fine-tuned models where Bedrock doesn't apply; see 02-open-source-libraries.md
Intent Classifier	Fine-tuned DistilBERT on SageMaker	Two-stage: rule-based -> ML classifier; DistilBERT is 60% smaller than BERT with 97% accuracy
Hardware Optimization	AWS Inferentia (ml.inf1.xlarge)	70% cost reduction vs. GPU (ml.g4dn.xlarge) for intent classification after Neuron SDK compilation
Embeddings	Amazon Titan Embeddings V2 (via Bedrock)	1024-dim vectors; optimized for Japanese text; fully managed
Reranker	ms-marco-MiniLM cross-encoder on SageMaker	Reranks top-50 retrieval results to top-5; 12x more accurate than embedding similarity alone
Vector Store	OpenSearch Serverless (HNSW w/ nmslib)	Serverless eliminates capacity planning; HNSW gives sub-50ms retrieval at 10M+ vectors
Guardrails	Amazon Bedrock Guardrails + custom pipeline	6-stage validation: PII detection, prompt injection defense, content moderation, hallucination check, response length, format validation
Model Compilation	Neuron SDK, ONNX, TorchScript	Neuron for Inferentia; ONNX for cross-platform portability; TorchScript for production serialization
Recommendations	Amazon Personalize	Collaborative filtering trained on manga browsing/purchase history

Why not alternatives? - GPT-4: Higher latency, no native AWS integration, data residency concerns for Amazon data - Open-source LLMs (Llama, Mistral): Evaluated but Claude 3.5 Sonnet beat them on Japanese language quality; we do use vLLM for self-hosted fine-tuned models - Pinecone/Weaviate: OpenSearch is already in Amazon's ecosystem with zero egress costs - FAISS: No serverless option; requires managing infrastructure

5. Data Layer

Component	Technology	Rationale
Conversation Store	DynamoDB (on-demand mode)	Single-digit ms reads; 24-hour TTL auto-cleans expired sessions
Cache Accelerator	DynamoDB DAX	In-memory cache in front of DynamoDB; microsecond reads for hot conversations
Distributed Cache	ElastiCache Redis	L2 cache for LLM responses, product data, intent classifications
Analytics Warehouse	Amazon Redshift	OLAP queries across conversation logs, metrics, A/B test results
Event Streaming	Amazon Kinesis Data Streams	Real-time event pipeline: chat events -> analytics, monitoring, alerting
Data Lake	S3 (Parquet format)	Long-term storage for training data, conversation logs, embeddings

Caching Strategy (3 layers):

L1: In-memory (application-level, per-container)     -> <1ms, small capacity
L2: ElastiCache Redis (shared across containers)      -> 1-5ms, medium capacity
L3: DynamoDB DAX (conversation-specific acceleration)  -> 1-3ms, large capacity

Why not alternatives? - PostgreSQL/Aurora: DynamoDB's single-digit ms latency at any scale beats RDS for key-value access patterns - Memcached: Redis supports richer data structures (sorted sets for ranking, pub/sub for real-time updates) - Snowflake: Redshift Serverless is cheaper within AWS ecosystem; no data egress

6. Observability & Monitoring

Component	Technology	Rationale
LLM Tracing	MLflow Tracing	Open-source, OTel-compatible; traces every step of the LLM pipeline; see 03-mlflow-llm-observability.md
Distributed Tracing	AWS X-Ray / OpenTelemetry	End-to-end request tracing across all AWS services
Metrics & Dashboards	CloudWatch + Prometheus/Grafana	CloudWatch for AWS-native metrics; Grafana for custom ML dashboards
Logging	CloudWatch Logs (structured JSON)	Centralized logging with Insights querying
Alerting	CloudWatch Alarms -> SNS -> PagerDuty	Tiered alerting: P1 (pages), P2 (Slack), P3 (daily digest)
Audit Trail	CloudTrail	Immutable audit log for all API calls; compliance requirement

Why MLflow over alternatives? - Langfuse/LangSmith: MLflow is fully open-source, self-hosted (no data leaves AWS), and integrates with our existing MLflow experiment tracking - Datadog LLM Observability: Costly at our scale; vendor lock-in; MLflow gives us the same capabilities at zero license cost - Detailed comparison in 03-mlflow-llm-observability.md

7. Security & Compliance

Component	Technology	Rationale
Encryption at Rest	AWS KMS (AES-256)	Managed key rotation; all data encrypted by default
Encryption in Transit	TLS 1.3	End-to-end; certificate managed by ACM
Network Isolation	VPC + private subnets	All ML endpoints and databases in private subnets; no public internet access
Secrets Management	AWS Secrets Manager	Auto-rotating credentials for all integrations
PII Detection	Amazon Comprehend + custom regex	Detects and masks PII before LLM processing
IAM	IAM Roles (least privilege)	Per-service roles; no shared credentials; cross-account via STS
Compliance	GDPR, CCPA, COPPA, PCI-DSS	Japan-specific data residency in ap-northeast-1

8. Infrastructure & DevOps

Component	Technology	Rationale
Infrastructure as Code	AWS CDK (TypeScript)	Imperative IaC; better abstractions than raw CloudFormation
CI/CD	AWS CodePipeline + CodeBuild	Native integration; no external CI dependency
Container Builds	Docker (multi-stage)	Minimal prod images; separate build/runtime layers
Feature Flags	AWS AppConfig	Gradual feature rollouts; instant kill-switches for new LLM behaviors
A/B Testing	Custom framework on Kinesis + Redshift	Splits traffic by session; measures conversion lift, CSAT, AI quality metrics

Cost Summary (100K conversations/day)

Component	Monthly Cost	% of Total
LLM Inference (Bedrock)	$15,000 - $40,000	40-50%
SageMaker Endpoints (classifiers, rerankers)	$5,000 - $12,000	15-20%
DynamoDB + DAX	$3,000 - $6,000	8-10%
OpenSearch Serverless	$2,500 - $5,000	7-8%
ElastiCache Redis	$1,500 - $3,000	4-5%
Compute (Fargate + Lambda)	$2,000 - $5,000	6-8%
Observability (CloudWatch, MLflow infra)	$1,000 - $3,000	3-5%
Other (S3, Kinesis, CDN, etc.)	$1,000 - $3,000	3-5%
Total	$31,000 - $77,000	100%

Cost optimizations that I drove (see 04-innovation-and-tradeoffs.md): - Inferentia migration for classifiers: -$8,400/month - Semantic caching of LLM responses: -$12,000/month - vLLM for self-hosted models: -$15,000/month (50% GPU reduction) - Intelligent routing (Haiku vs Sonnet): -$18,000/month - Prompt compression and optimization: -$6,000/month - Total monthly savings: ~$59,400/month (~$713K/year)

Technology Decision Framework

For every component, I applied this evaluation matrix:

Criteria	Weight	How Measured
Performance	30%	Benchmark latency (P50, P95, P99), throughput (req/sec)
Cost	25%	$/request, $/month at projected scale
Operational Burden	20%	Setup time, monitoring needs, on-call complexity
AWS Integration	15%	Native service integration, IAM support, VPC compatibility
Community & Longevity	10%	GitHub stars, contributor count, corporate backing, release cadence

This framework is detailed further in 04-innovation-and-tradeoffs.md.