HLD Deep Dive: Scalability, Performance & Cost
Questions covered: Q19, Q26, Q27, Q35
Interviewer level: Staff Engineer → Principal Engineer
Q19. Handling a traffic spike (major manga release)
Short Answer
API Gateway throttling + Lambda auto-scaling + DynamoDB on-demand + Bedrock provisioned throughput + pre-warmed caches.
Deep Dive
Scenario: New One Piece omnibus announced, manga.amazon.co.jp gets 50x normal traffic at 12:00 AM JST
Normal traffic baseline:
~500 concurrent users
~5,000 messages/minute
~50 LLM calls/minute (10% require LLM)
Spike scenario:
~25,000 concurrent users (50x spike)
~250,000 messages/minute
~2,500 LLM calls/minute
Duration: ~2 hours
Each layer's response to the spike:
Layer 1: API Gateway — Rate Throttling
Per-user limit: 30 messages/minute (unchanged — prevents single user abuse)
Account-level limit: Automatically scales with demand (AWS manages this)
Lambda integration: API Gateway triggers Lambda asynchronously — no blocking
Layer 2: ECS Fargate Orchestrator — Auto-scaling
# ECS Service Auto Scaling policy
TargetTrackingScalingPolicy:
MetricType: ECSServiceAverageCPUUtilization
TargetValue: 60.0 # Scale out when CPU > 60%
ScaleOutCooldown: 60 # seconds
ScaleInCooldown: 300 # seconds (slower to scale in — avoid thrashing)
MinCapacity: 10 # Always-on baseline
MaxCapacity: 200 # Upper bound during spike
ECS scaling speed: ~1–2 minutes to bring a new task online. May miss the very first wave of a sudden spike.
Layer 3: Lambda Burst Workers — Instant elasticity
Lambda concurrency scales in seconds (not minutes like ECS).
The burst worker Lambda is triggered when Orchestrator queue depth exceeds threshold.
burst_concurrency = min(10 * spike_factor, 1000) # AWS default limit
Layer 4: DynamoDB — On-demand capacity
DynamoDB on-demand:
- Pay per WCU/RCU consumed
- Automatically handles 2x previous peak instantly
- Handles larger spikes with ~30-minute warm-up
- No capacity planning needed
For the One Piece spike:
Previous peak: 50,000 WCU/minute
Spike: 2,500,000 WCU/minute
DynamoDB on-demand handles up to ~10x previous peak instantly
Layer 5: Bedrock — Provisioned Throughput
Default (on-demand) Bedrock:
- Shared capacity with other AWS customers
- May throttle during global spikes
Provisioned Throughput (Bedrock):
- Reserve N Model Units (MU) = guaranteed LLM capacity
- Each MU for Claude Sonnet: ~100 input-output units/minute
- Cost: Charged per hour regardless of usage
Strategy for spike:
1. Pre-purchase provisioned throughput 24h before major release
2. Fallback: on-demand for overflow
3. Model tiering: route simpler intents to Claude Haiku (much cheaper, unlimited)
Layer 6: Cache Pre-warming
async def pre_warm_for_release(asin: str, release_datetime: datetime):
"""
Called 1 hour before a major manga release.
Populates cache with product data that will be in high demand.
"""
# Pre-fetch and cache product details
product_data = await catalog.get_product(asin)
await redis.setex(f"product:{asin}", 7200, json.dumps(product_data)) # 2hr TTL
# Pre-compute and cache top recommendations for this product
similar = await personalize.get_similar_items(asin, num_results=20)
await redis.setex(f"similar:{asin}", 3600, json.dumps(similar))
# Pre-cache the FAQ for this product
product_faq = await knowledge_base.get_product_faq(asin)
await redis.setex(f"faq:product:{asin}", 7200, json.dumps(product_faq))
logger.info(f"Pre-warmed cache for {asin} release at {release_datetime}")
Expected cache hit rate during One Piece spike: ~85% (most queries are about the new volume specifically → high hit rate for pre-warmed data).
Q26. How to reduce end-to-end latency from ~2s to <1s?
Short Answer
Model tiering + query caching + speculative execution + provisioned concurrency + token streaming. Attack every layer in parallel.
Deep Dive
Current latency breakdown (p50, ~2s total):
Auth / session load: ~50ms
Intent classification: ~20ms
DynamoDB memory load: ~10ms (Redis cache hit)
Service data fetch: ~200ms (parallel fan-out)
LLM generation: ~1,500ms ← DOMINATES
Guardrails validation: ~100ms
Response format: ~10ms
Network / serialization: ~50ms
─────────────────────────────────
Total: ~1,940ms
Optimization 1: Model tiering (highest impact)
# Route simple intents to a 3x faster model
FAST_INTENTS = {"chitchat", "simple_faq", "order_tracking_template"}
SMART_INTENTS = {"recommendation", "complex_product_question"}
model_id = (
"anthropic.claude-3-haiku" # ~300-400ms
if intent in FAST_INTENTS
else "anthropic.claude-3-5-sonnet" # ~1,500ms
)
Optimization 2: Semantic query caching
async def get_or_generate_response(query: str, context_hash: str) -> str:
# Create cache key from semantic similarity, not exact text
# "What's your return policy?" and "Can I return something?" → same answer
query_embedding = await embed(query)
cache_key = find_similar_cached_query(query_embedding, threshold=0.95)
if cache_key:
return await redis.get(cache_key) # ~1ms instead of ~1,500ms
response = await llm.generate(query, context)
# Cache the response with the query embedding as key
await cache_response(query_embedding, response, ttl=3600)
return response
Optimization 3: Speculative execution / parallel prefetch
async def process_message_with_speculation(message: str, session_id: str):
# Start BOTH intent classification AND RAG retrieval simultaneously
# (even before we know the intent — speculative)
intent_task = asyncio.create_task(classifier.classify(message))
rag_task = asyncio.create_task(rag.retrieve(message, top_k=5)) # Speculative
# Get intent first (fast)
intent = await intent_task
if intent in RAG_NEEDED_INTENTS:
# RAG was already running in parallel — use the result
rag_results = await rag_task
# Saved ~200ms vs sequential
else:
# Intent doesn't need RAG — cancel the prefetch task
rag_task.cancel()
rag_results = None
Optimization 4: Lambda provisioned concurrency
# Eliminate cold starts for the burst Lambda workers
# Cold start: ~300ms (Python Lambda with large dependencies)
# With provisioned concurrency: ~5ms
aws lambda put-provisioned-concurrency-config \
--function-name manga-chatbot-orchestrator \
--qualifier PROD \
--provisioned-concurrent-executions 50
Optimization 5: Token streaming (perceived latency)
Without streaming: User waits 1,500ms, then sees full response.
With streaming: User sees first token in ~300ms (time-to-first-token).
Even though total generation takes the same time,
perceived responsiveness is much higher.
Optimization 6: Pre-compute embeddings for common queries
COMMON_QUERIES = [
"what is the return policy",
"where is my order",
"do you have gift wrapping",
"what payment methods do you accept",
# ... top 100 most frequent queries
]
# Run offline as a scheduled job
async def precompute_common_query_embeddings():
for query in COMMON_QUERIES:
embedding = await embed(query)
response = await rag.retrieve_and_generate(query)
await redis.setex(f"precomputed:{hash(query)}", 86400, json.dumps({
"embedding": embedding,
"response": response
}))
Summary of latency gains:
| Optimization | Before | After | Savings |
|---|---|---|---|
| Model tiering (60% of traffic) | 1,500ms LLM | 400ms LLM | ~1,100ms |
| Query caching (30% of traffic) | 1,940ms total | <10ms total | ~1,930ms |
| Speculative RAG prefetch | Sequential 200ms | Parallel (free) | ~200ms |
| Provisioned concurrency | 300ms cold start | ~5ms | ~300ms |
| Token streaming | 1,940ms perceived | ~300ms first token | Huge perceived gain |
Q35. Infrastructure cost estimation: 100K vs 10M conversations/day
Short Answer
100K conversations/day: ~$30K–$100K/month. 10M conversations/day: ~$500K–$2M/month. LLM costs dominate — optimization strategies can cut costs by 70–80%.
Deep Dive
Assumptions: - 1 conversation = 5 turns average - 1 turn = 1 user message + 1 assistant response - Input tokens per turn: ~1,500 (conversation history + context + message) - Output tokens per turn: ~300 - 40% of turns require LLM (others use templates or cache)
At 100K conversations/day:
LLM (Bedrock Claude 3.5 Sonnet):
Turns needing LLM: 100K × 5 × 40% = 200,000 LLM calls/day
Input tokens: 200,000 × 1,500 = 300M tokens/day
Output tokens: 200,000 × 300 = 60M tokens/day
Input cost: 300M × $3/1M = $900/day
Output cost: 60M × $15/1M = $900/day
LLM total: $1,800/day = ~$54,000/month
ECS Fargate (Orchestrator):
10-task baseline × 1 vCPU × $0.04048/vCPU-hour × 24h = ~$10/day
~$300/month
DynamoDB (Conversation Memory):
100K sessions × 5 writes/session = 500K WCU/day
On-demand: $1.25 per million WCU → ~$0.63/day → ~$19/month
OpenSearch Serverless:
Minimum 2 OCU × $0.24/OCU-hour × 24h × 30 = ~$346/month
ElastiCache Redis (2 nodes, cache.r6g.large):
~$200/month
API Gateway (WebSocket):
100K connections/day × $0.000001/min × ~5min avg = ~$0.50/day → ~$15/month
Lambda (Burst Workers):
Minimal — only on overflow. ~$50/month
CloudWatch + Kinesis + SageMaker (Classifier):
~$500/month combined
─────────────────────────────────────────────────────────────────
Total at 100K conversations/day: ~$55,000–75,000/month
(LLM cost is ~75% of total)
At 10M conversations/day (100x scale):
LLM cost scales linearly: ~$5.4M/month (dominates)
Infrastructure scales sub-linearly thanks to:
- Better cache hit rates at scale
- Reserved capacity pricing
- Model tiering
Total (unoptimized): ~$6M/month
Total (with optimization): ~$1.5M/month
Cost optimization strategies at scale:
1. Model tiering (saves ~50% of LLM cost):
Route 60% of LLM calls to Claude Haiku:
Haiku: $0.25/1M input, $1.25/1M output (12x cheaper than Sonnet)
Before: 200,000 calls/day × Sonnet = $1,800/day
After: 200,000 calls/day × 60% Haiku + 40% Sonnet
= 120,000 × ($0.25+$1.25)/1M + 80,000 × ($3+$15)/1M
≈ ~$720/day (60% reduction)
2. Semantic caching (saves ~30% by avoiding repeat LLM calls):
Assume 30% of queries are semantically similar to a cached result:
200,000 calls/day × 30% cache hit = 60,000 calls avoided
Savings: 60,000 × ~$0.009/call = $540/day
3. Response length limits:
MAX_TOKENS = {
"recommendation": 500, # List of 3-5 products
"faq": 300, # Concise answer
"product_question": 400, # Moderate detail
}
# Reduces output token cost by ~40%
4. Bedrock provisioned throughput (saves ~20% at scale):
At 10M conversations/day:
Purchase reserved throughput for baseline capacity
~20% discount vs. on-demand pricing
Cost per conversation:
Unoptimized: ~$0.55/conversation
After optimization: ~$0.15/conversation
Customer support agent cost: ~$3–7/interaction
→ Chatbot at <$0.20/conversation is 15–35x cheaper than human agent
Q27. Multi-storefront design (JP Manga, Comics, Kindle)
Short Answer
Multi-tenant with store_id parameter. Shared infrastructure, partitioned knowledge bases, templated system prompts.
Deep Dive
What's partitioned per store:
store_id: "manga_jp" | "comics_us" | "kindle"
Partitioned:
✅ RAG knowledge base → separate OpenSearch index per store
✅ System prompts → templated, store-specific tone/policies
✅ Product catalog queries → scoped to store's category/marketplace
✅ Recommendation model → shared or per-store depending on overlap
Shared (costs split across stores):
✅ Orchestrator (single service, routes by store_id)
✅ API Gateway + Auth
✅ DynamoDB conversation memory (session_id includes store_id prefix)
✅ Lambda/ECS compute
✅ CloudWatch monitoring
Routing by store_id:
STORE_CONFIGS = {
"manga_jp": StoreConfig(
rag_index="manga-jp-knowledge-base",
system_prompt=MANGA_JP_SYSTEM_PROMPT,
recommendation_campaign="arn:aws:personalize:.../manga-jp",
catalog_category="manga",
locale="ja-JP"
),
"comics_us": StoreConfig(
rag_index="comics-us-knowledge-base",
system_prompt=COMICS_US_SYSTEM_PROMPT,
recommendation_campaign="arn:aws:personalize:.../comics-us",
catalog_category="comics",
locale="en-US"
),
}
async def handle_request(store_id: str, message: str, session_id: str):
config = STORE_CONFIGS[store_id]
# All downstream calls use store-specific config
recommendations = await personalize.get_recommendations(
campaign_arn=config.recommendation_campaign,
...
)
rag_context = await rag.retrieve(
index=config.rag_index,
query=message
)
response = await llm.generate(
system_prompt=config.system_prompt,
...
)
return response
System prompt templating:
SYSTEM_PROMPT_TEMPLATE = """
You are {store_name}, Amazon's AI shopping assistant for {category}.
Your tone is {tone}.
You help customers with:
- Discovering {category} based on their preferences
- Answering questions about {category} products
- Tracking orders and handling returns per {marketplace} policies
IMPORTANT: Only reference products available in {marketplace}.
Respond in {language}.
"""
MANGA_JP_SYSTEM_PROMPT = SYSTEM_PROMPT_TEMPLATE.format(
store_name="MangaAssist",
category="manga",
tone="enthusiastic and knowledgeable",
marketplace="Amazon Japan",
language="Japanese"
)
Cost allocation:
Each store's usage is tagged with store_id:
CloudWatch metrics: { "StoreId": "manga_jp", "MetricName": "LLMCalls" }
Kinesis events: { "store_id": "manga_jp", ... }
Monthly chargeback to each business unit:
Shared infrastructure costs split by usage percentage
LLM costs charged directly to each store's P&L