Detailed Technology Stack - MangaAssist
Complete inventory of every technology choice across the stack, with rationale, alternatives considered, and why each was selected.
How to Use This Document
- Read this file top-to-bottom if you want the full system stack in one place.
- Jump to 02-open-source-libraries.md for deeper discussion of the OSS runtime choices behind inference and serving.
- Jump to 03-mlflow-llm-observability.md or 04-innovation-and-tradeoffs.md if you want tooling and decision-making depth rather than layer-by-layer inventory.
Stack Overview
graph TB
subgraph Client["Client Layer"]
A1[React Chat Widget]
A2[WebSocket / HTTPS]
A3[CloudFront CDN]
end
subgraph Gateway["Edge & Gateway"]
B1[Amazon API Gateway]
B2[AWS WAF]
B3[Amazon Cognito]
end
subgraph Compute["Compute & Orchestration"]
C1[ECS Fargate - baseline]
C2[AWS Lambda - burst]
C3[Step Functions - workflows]
end
subgraph Intelligence["AI & ML Layer"]
D1[Amazon Bedrock - LLM]
D2[SageMaker - custom models]
D3[vLLM - self-hosted inference]
D4[OpenSearch - vector store]
end
subgraph Data["Data Layer"]
E1[DynamoDB - conversations]
E2[ElastiCache Redis - caching]
E3[Redshift - analytics]
E4[Kinesis - streaming]
end
subgraph Observability["Observability"]
F1[MLflow Tracing]
F2[CloudWatch / X-Ray]
F3[OpenTelemetry]
F4[Prometheus / Grafana]
end
Client --> Gateway --> Compute --> Intelligence
Intelligence --> Data
Compute --> Observability
Intelligence --> Observability
Layer-by-Layer Breakdown
1. Frontend & Client
| Component | Technology | Rationale |
|---|---|---|
| Chat Widget | React (Amazon internal framework) | Company standard; seamless integration with Amazon JP storefront |
| Real-time Communication | WebSocket (primary) + HTTPS REST (fallback) | WebSocket for streaming token-by-token responses; REST fallback for environments that block WS |
| CDN | CloudFront | Global edge caching for static assets; already part of Amazon's infra |
| State Management | React Context + useReducer | Lightweight; no need for Redux given single-widget scope |
Why not alternatives? - Vue/Angular: Amazon's frontend ecosystem is React-native; switching adds integration burden with zero benefit - Server-Sent Events (SSE): WebSocket chosen because we need bidirectional communication (user typing indicators, real-time context updates)
2. API Gateway & Edge Security
| Component | Technology | Rationale |
|---|---|---|
| API Gateway | Amazon API Gateway (WebSocket + REST) | Native WebSocket support, built-in throttling, IAM integration |
| Web Application Firewall | AWS WAF | SQL injection, XSS, rate limiting, geo-blocking for Japan-specific deployment |
| Authentication | Amazon Cognito (guest + authenticated) | Supports anonymous browsing with seamless upgrade to authenticated sessions |
| TLS | TLS 1.3 | Mandatory for all traffic; 0-RTT resumption reduces handshake latency |
Why not alternatives? - Kong/Nginx: API Gateway is fully managed; operational burden of self-hosting not justified for our scale - Auth0/Okta: Cognito integrates natively with all AWS services; external auth adds network hop + vendor dependency
3. Compute & Orchestration
| Component | Technology | Rationale |
|---|---|---|
| Baseline Compute | ECS Fargate | Serverless containers; no EC2 management; auto-scales with demand |
| Burst Compute | AWS Lambda | Sub-second cold starts for lightweight operations (intent classification, cache lookups) |
| Workflow Orchestration | Step Functions | Visual state machines for multi-step chatbot flows; built-in retry/error handling |
| Container Registry | ECR | Standard AWS container registry; scanned for vulnerabilities |
Scaling Model:
Normal: ECS Fargate (10-50 tasks, predictable cost)
Spike: Lambda (0 to 10,000 concurrent in seconds)
Peak: ECS + Lambda hybrid (cost-optimized: Fargate for base, Lambda for overflow)
Why not alternatives? - EKS (Kubernetes): Overkill for our service count; Fargate gives us container benefits without cluster management - EC2: Manual scaling and patching; Fargate eliminates this entirely - Temporal/Airflow: Step Functions is native; simpler for AWS-only workflows
4. AI & LLM Layer
This is the most critical layer and where the majority of technology innovation happened.
| Component | Technology | Rationale |
|---|---|---|
| Primary LLM | Claude 3.5 Sonnet (via Amazon Bedrock) | Best quality-to-latency ratio for conversational AI; native Bedrock integration |
| Lightweight LLM | Claude Haiku (via Bedrock) | 10x cheaper for simple tasks (greetings, order status formatting) |
| Self-Hosted Inference | vLLM on SageMaker endpoints | For fine-tuned models where Bedrock doesn't apply; see 02-open-source-libraries.md |
| Intent Classifier | Fine-tuned DistilBERT on SageMaker | Two-stage: rule-based -> ML classifier; DistilBERT is 60% smaller than BERT with 97% accuracy |
| Hardware Optimization | AWS Inferentia (ml.inf1.xlarge) | 70% cost reduction vs. GPU (ml.g4dn.xlarge) for intent classification after Neuron SDK compilation |
| Embeddings | Amazon Titan Embeddings V2 (via Bedrock) | 1024-dim vectors; optimized for Japanese text; fully managed |
| Reranker | ms-marco-MiniLM cross-encoder on SageMaker | Reranks top-50 retrieval results to top-5; 12x more accurate than embedding similarity alone |
| Vector Store | OpenSearch Serverless (HNSW w/ nmslib) | Serverless eliminates capacity planning; HNSW gives sub-50ms retrieval at 10M+ vectors |
| Guardrails | Amazon Bedrock Guardrails + custom pipeline | 6-stage validation: PII detection, prompt injection defense, content moderation, hallucination check, response length, format validation |
| Model Compilation | Neuron SDK, ONNX, TorchScript | Neuron for Inferentia; ONNX for cross-platform portability; TorchScript for production serialization |
| Recommendations | Amazon Personalize | Collaborative filtering trained on manga browsing/purchase history |
Why not alternatives? - GPT-4: Higher latency, no native AWS integration, data residency concerns for Amazon data - Open-source LLMs (Llama, Mistral): Evaluated but Claude 3.5 Sonnet beat them on Japanese language quality; we do use vLLM for self-hosted fine-tuned models - Pinecone/Weaviate: OpenSearch is already in Amazon's ecosystem with zero egress costs - FAISS: No serverless option; requires managing infrastructure
5. Data Layer
| Component | Technology | Rationale |
|---|---|---|
| Conversation Store | DynamoDB (on-demand mode) | Single-digit ms reads; 24-hour TTL auto-cleans expired sessions |
| Cache Accelerator | DynamoDB DAX | In-memory cache in front of DynamoDB; microsecond reads for hot conversations |
| Distributed Cache | ElastiCache Redis | L2 cache for LLM responses, product data, intent classifications |
| Analytics Warehouse | Amazon Redshift | OLAP queries across conversation logs, metrics, A/B test results |
| Event Streaming | Amazon Kinesis Data Streams | Real-time event pipeline: chat events -> analytics, monitoring, alerting |
| Data Lake | S3 (Parquet format) | Long-term storage for training data, conversation logs, embeddings |
Caching Strategy (3 layers):
L1: In-memory (application-level, per-container) -> <1ms, small capacity
L2: ElastiCache Redis (shared across containers) -> 1-5ms, medium capacity
L3: DynamoDB DAX (conversation-specific acceleration) -> 1-3ms, large capacity
Why not alternatives? - PostgreSQL/Aurora: DynamoDB's single-digit ms latency at any scale beats RDS for key-value access patterns - Memcached: Redis supports richer data structures (sorted sets for ranking, pub/sub for real-time updates) - Snowflake: Redshift Serverless is cheaper within AWS ecosystem; no data egress
6. Observability & Monitoring
| Component | Technology | Rationale |
|---|---|---|
| LLM Tracing | MLflow Tracing | Open-source, OTel-compatible; traces every step of the LLM pipeline; see 03-mlflow-llm-observability.md |
| Distributed Tracing | AWS X-Ray / OpenTelemetry | End-to-end request tracing across all AWS services |
| Metrics & Dashboards | CloudWatch + Prometheus/Grafana | CloudWatch for AWS-native metrics; Grafana for custom ML dashboards |
| Logging | CloudWatch Logs (structured JSON) | Centralized logging with Insights querying |
| Alerting | CloudWatch Alarms -> SNS -> PagerDuty | Tiered alerting: P1 (pages), P2 (Slack), P3 (daily digest) |
| Audit Trail | CloudTrail | Immutable audit log for all API calls; compliance requirement |
Why MLflow over alternatives? - Langfuse/LangSmith: MLflow is fully open-source, self-hosted (no data leaves AWS), and integrates with our existing MLflow experiment tracking - Datadog LLM Observability: Costly at our scale; vendor lock-in; MLflow gives us the same capabilities at zero license cost - Detailed comparison in 03-mlflow-llm-observability.md
7. Security & Compliance
| Component | Technology | Rationale |
|---|---|---|
| Encryption at Rest | AWS KMS (AES-256) | Managed key rotation; all data encrypted by default |
| Encryption in Transit | TLS 1.3 | End-to-end; certificate managed by ACM |
| Network Isolation | VPC + private subnets | All ML endpoints and databases in private subnets; no public internet access |
| Secrets Management | AWS Secrets Manager | Auto-rotating credentials for all integrations |
| PII Detection | Amazon Comprehend + custom regex | Detects and masks PII before LLM processing |
| IAM | IAM Roles (least privilege) | Per-service roles; no shared credentials; cross-account via STS |
| Compliance | GDPR, CCPA, COPPA, PCI-DSS | Japan-specific data residency in ap-northeast-1 |
8. Infrastructure & DevOps
| Component | Technology | Rationale |
|---|---|---|
| Infrastructure as Code | AWS CDK (TypeScript) | Imperative IaC; better abstractions than raw CloudFormation |
| CI/CD | AWS CodePipeline + CodeBuild | Native integration; no external CI dependency |
| Container Builds | Docker (multi-stage) | Minimal prod images; separate build/runtime layers |
| Feature Flags | AWS AppConfig | Gradual feature rollouts; instant kill-switches for new LLM behaviors |
| A/B Testing | Custom framework on Kinesis + Redshift | Splits traffic by session; measures conversion lift, CSAT, AI quality metrics |
Cost Summary (100K conversations/day)
| Component | Monthly Cost | % of Total |
|---|---|---|
| LLM Inference (Bedrock) | $15,000 - $40,000 | 40-50% |
| SageMaker Endpoints (classifiers, rerankers) | $5,000 - $12,000 | 15-20% |
| DynamoDB + DAX | $3,000 - $6,000 | 8-10% |
| OpenSearch Serverless | $2,500 - $5,000 | 7-8% |
| ElastiCache Redis | $1,500 - $3,000 | 4-5% |
| Compute (Fargate + Lambda) | $2,000 - $5,000 | 6-8% |
| Observability (CloudWatch, MLflow infra) | $1,000 - $3,000 | 3-5% |
| Other (S3, Kinesis, CDN, etc.) | $1,000 - $3,000 | 3-5% |
| Total | $31,000 - $77,000 | 100% |
Cost optimizations that I drove (see 04-innovation-and-tradeoffs.md): - Inferentia migration for classifiers: -$8,400/month - Semantic caching of LLM responses: -$12,000/month - vLLM for self-hosted models: -$15,000/month (50% GPU reduction) - Intelligent routing (Haiku vs Sonnet): -$18,000/month - Prompt compression and optimization: -$6,000/month - Total monthly savings: ~$59,400/month (~$713K/year)
Technology Decision Framework
For every component, I applied this evaluation matrix:
| Criteria | Weight | How Measured |
|---|---|---|
| Performance | 30% | Benchmark latency (P50, P95, P99), throughput (req/sec) |
| Cost | 25% | $/request, $/month at projected scale |
| Operational Burden | 20% | Setup time, monitoring needs, on-call complexity |
| AWS Integration | 15% | Native service integration, IAM support, VPC compatibility |
| Community & Longevity | 10% | GitHub stars, contributor count, corporate backing, release cadence |
This framework is detailed further in 04-innovation-and-tradeoffs.md.