HLD Deep Dive: Scalability, Performance & Cost

Questions covered: Q19, Q26, Q27, Q35
Interviewer level: Staff Engineer → Principal Engineer

Q19. Handling a traffic spike (major manga release)

Short Answer

API Gateway throttling + Lambda auto-scaling + DynamoDB on-demand + Bedrock provisioned throughput + pre-warmed caches.

Deep Dive

Scenario: New One Piece omnibus announced, manga.amazon.co.jp gets 50x normal traffic at 12:00 AM JST

Normal traffic baseline:

~500 concurrent users
~5,000 messages/minute
~50 LLM calls/minute (10% require LLM)

Spike scenario:

~25,000 concurrent users (50x spike)
~250,000 messages/minute
~2,500 LLM calls/minute
Duration: ~2 hours

Each layer's response to the spike:

Layer 1: API Gateway — Rate Throttling

Per-user limit:     30 messages/minute (unchanged — prevents single user abuse)
Account-level limit: Automatically scales with demand (AWS manages this)
Lambda integration: API Gateway triggers Lambda asynchronously — no blocking

Layer 2: ECS Fargate Orchestrator — Auto-scaling

# ECS Service Auto Scaling policy
TargetTrackingScalingPolicy:
  MetricType: ECSServiceAverageCPUUtilization
  TargetValue: 60.0    # Scale out when CPU > 60%
  ScaleOutCooldown: 60  # seconds
  ScaleInCooldown: 300  # seconds (slower to scale in — avoid thrashing)

MinCapacity: 10   # Always-on baseline
MaxCapacity: 200  # Upper bound during spike

ECS scaling speed: ~1–2 minutes to bring a new task online. May miss the very first wave of a sudden spike.

Layer 3: Lambda Burst Workers — Instant elasticity

Lambda concurrency scales in seconds (not minutes like ECS).
The burst worker Lambda is triggered when Orchestrator queue depth exceeds threshold.

burst_concurrency = min(10 * spike_factor, 1000)  # AWS default limit

Layer 4: DynamoDB — On-demand capacity

DynamoDB on-demand:
  - Pay per WCU/RCU consumed
  - Automatically handles 2x previous peak instantly
  - Handles larger spikes with ~30-minute warm-up
  - No capacity planning needed

For the One Piece spike:
  Previous peak: 50,000 WCU/minute
  Spike: 2,500,000 WCU/minute
  DynamoDB on-demand handles up to ~10x previous peak instantly

Layer 5: Bedrock — Provisioned Throughput

Default (on-demand) Bedrock:
  - Shared capacity with other AWS customers
  - May throttle during global spikes

Provisioned Throughput (Bedrock):
  - Reserve N Model Units (MU) = guaranteed LLM capacity
  - Each MU for Claude Sonnet: ~100 input-output units/minute
  - Cost: Charged per hour regardless of usage

Strategy for spike:
  1. Pre-purchase provisioned throughput 24h before major release
  2. Fallback: on-demand for overflow
  3. Model tiering: route simpler intents to Claude Haiku (much cheaper, unlimited)

Layer 6: Cache Pre-warming

async def pre_warm_for_release(asin: str, release_datetime: datetime):
    """
    Called 1 hour before a major manga release.
    Populates cache with product data that will be in high demand.
    """
    # Pre-fetch and cache product details
    product_data = await catalog.get_product(asin)
    await redis.setex(f"product:{asin}", 7200, json.dumps(product_data))  # 2hr TTL

    # Pre-compute and cache top recommendations for this product
    similar = await personalize.get_similar_items(asin, num_results=20)
    await redis.setex(f"similar:{asin}", 3600, json.dumps(similar))

    # Pre-cache the FAQ for this product
    product_faq = await knowledge_base.get_product_faq(asin)
    await redis.setex(f"faq:product:{asin}", 7200, json.dumps(product_faq))

    logger.info(f"Pre-warmed cache for {asin} release at {release_datetime}")

Expected cache hit rate during One Piece spike: ~85% (most queries are about the new volume specifically → high hit rate for pre-warmed data).

Q26. How to reduce end-to-end latency from ~2s to <1s?

Short Answer

Model tiering + query caching + speculative execution + provisioned concurrency + token streaming. Attack every layer in parallel.

Deep Dive

Current latency breakdown (p50, ~2s total):

Auth / session load:        ~50ms
Intent classification:      ~20ms
DynamoDB memory load:       ~10ms   (Redis cache hit)
Service data fetch:        ~200ms   (parallel fan-out)
LLM generation:          ~1,500ms   ← DOMINATES
Guardrails validation:     ~100ms
Response format:             ~10ms
Network / serialization:    ~50ms
─────────────────────────────────
Total:                    ~1,940ms

Optimization 1: Model tiering (highest impact)

# Route simple intents to a 3x faster model
FAST_INTENTS = {"chitchat", "simple_faq", "order_tracking_template"}
SMART_INTENTS = {"recommendation", "complex_product_question"}

model_id = (
    "anthropic.claude-3-haiku"      # ~300-400ms
    if intent in FAST_INTENTS 
    else "anthropic.claude-3-5-sonnet"  # ~1,500ms
)

Impact: ~60% of requests can use Haiku → p50 drops to ~600ms for those.

Optimization 2: Semantic query caching

async def get_or_generate_response(query: str, context_hash: str) -> str:
    # Create cache key from semantic similarity, not exact text
    # "What's your return policy?" and "Can I return something?" → same answer
    query_embedding = await embed(query)
    cache_key = find_similar_cached_query(query_embedding, threshold=0.95)

    if cache_key:
        return await redis.get(cache_key)  # ~1ms instead of ~1,500ms

    response = await llm.generate(query, context)

    # Cache the response with the query embedding as key
    await cache_response(query_embedding, response, ttl=3600)
    return response

Impact: Repeat queries (e.g., "what's the return policy?" asked 1000x/day) → served from cache at <5ms.

Optimization 3: Speculative execution / parallel prefetch

async def process_message_with_speculation(message: str, session_id: str):
    # Start BOTH intent classification AND RAG retrieval simultaneously
    # (even before we know the intent — speculative)

    intent_task = asyncio.create_task(classifier.classify(message))
    rag_task = asyncio.create_task(rag.retrieve(message, top_k=5))  # Speculative

    # Get intent first (fast)
    intent = await intent_task

    if intent in RAG_NEEDED_INTENTS:
        # RAG was already running in parallel — use the result
        rag_results = await rag_task
        # Saved ~200ms vs sequential
    else:
        # Intent doesn't need RAG — cancel the prefetch task
        rag_task.cancel()
        rag_results = None

Impact: Saves ~200ms on RAG-dependent intents.

Optimization 4: Lambda provisioned concurrency

# Eliminate cold starts for the burst Lambda workers
# Cold start: ~300ms (Python Lambda with large dependencies)
# With provisioned concurrency: ~5ms

aws lambda put-provisioned-concurrency-config \
  --function-name manga-chatbot-orchestrator \
  --qualifier PROD \
  --provisioned-concurrent-executions 50

Cost: ~$0.006/hour per provisioned execution. For 50 instances × 24h = ~$7/day. Worth it to eliminate 300ms cold starts.

Optimization 5: Token streaming (perceived latency)

Without streaming: User waits 1,500ms, then sees full response.
With streaming:    User sees first token in ~300ms (time-to-first-token).
                  Even though total generation takes the same time,
                  perceived responsiveness is much higher.

This doesn't reduce actual latency but dramatically improves perceived latency — often more important.

Optimization 6: Pre-compute embeddings for common queries

COMMON_QUERIES = [
    "what is the return policy",
    "where is my order",
    "do you have gift wrapping",
    "what payment methods do you accept",
    # ... top 100 most frequent queries
]

# Run offline as a scheduled job
async def precompute_common_query_embeddings():
    for query in COMMON_QUERIES:
        embedding = await embed(query)
        response = await rag.retrieve_and_generate(query)
        await redis.setex(f"precomputed:{hash(query)}", 86400, json.dumps({
            "embedding": embedding,
            "response": response
        }))

Impact: Top 100 queries (often ~30% of traffic) served from pre-computed cache at <5ms.

Summary of latency gains:

Optimization	Before	After	Savings
Model tiering (60% of traffic)	1,500ms LLM	400ms LLM	~1,100ms
Query caching (30% of traffic)	1,940ms total	<10ms total	~1,930ms
Speculative RAG prefetch	Sequential 200ms	Parallel (free)	~200ms
Provisioned concurrency	300ms cold start	~5ms	~300ms
Token streaming	1,940ms perceived	~300ms first token	Huge perceived gain

Q35. Infrastructure cost estimation: 100K vs 10M conversations/day

Short Answer

100K conversations/day: ~$30K–$100K/month. 10M conversations/day: ~$500K–$2M/month. LLM costs dominate — optimization strategies can cut costs by 70–80%.

Deep Dive

Assumptions: - 1 conversation = 5 turns average - 1 turn = 1 user message + 1 assistant response - Input tokens per turn: ~1,500 (conversation history + context + message) - Output tokens per turn: ~300 - 40% of turns require LLM (others use templates or cache)

At 100K conversations/day:

LLM (Bedrock Claude 3.5 Sonnet):
  Turns needing LLM: 100K × 5 × 40% = 200,000 LLM calls/day
  Input tokens: 200,000 × 1,500 = 300M tokens/day
  Output tokens: 200,000 × 300 = 60M tokens/day

  Input cost:  300M × $3/1M = $900/day
  Output cost:  60M × $15/1M = $900/day
  LLM total:   $1,800/day = ~$54,000/month

ECS Fargate (Orchestrator):
  10-task baseline × 1 vCPU × $0.04048/vCPU-hour × 24h = ~$10/day
  ~$300/month

DynamoDB (Conversation Memory):
  100K sessions × 5 writes/session = 500K WCU/day
  On-demand: $1.25 per million WCU → ~$0.63/day → ~$19/month

OpenSearch Serverless:
  Minimum 2 OCU × $0.24/OCU-hour × 24h × 30 = ~$346/month

ElastiCache Redis (2 nodes, cache.r6g.large):
  ~$200/month

API Gateway (WebSocket):
  100K connections/day × $0.000001/min × ~5min avg = ~$0.50/day → ~$15/month

Lambda (Burst Workers):
  Minimal — only on overflow. ~$50/month

CloudWatch + Kinesis + SageMaker (Classifier):
  ~$500/month combined
─────────────────────────────────────────────────────────────────
Total at 100K conversations/day:         ~$55,000–75,000/month
(LLM cost is ~75% of total)

At 10M conversations/day (100x scale):

LLM cost scales linearly:       ~$5.4M/month (dominates)
Infrastructure scales sub-linearly thanks to:
  - Better cache hit rates at scale
  - Reserved capacity pricing
  - Model tiering

Total (unoptimized):            ~$6M/month
Total (with optimization):      ~$1.5M/month

Cost optimization strategies at scale:

1. Model tiering (saves ~50% of LLM cost):

Route 60% of LLM calls to Claude Haiku:
  Haiku: $0.25/1M input, $1.25/1M output (12x cheaper than Sonnet)

Before: 200,000 calls/day × Sonnet = $1,800/day
After:  200,000 calls/day × 60% Haiku + 40% Sonnet
       = 120,000 × ($0.25+$1.25)/1M + 80,000 × ($3+$15)/1M
       ≈ ~$720/day (60% reduction)

2. Semantic caching (saves ~30% by avoiding repeat LLM calls):

Assume 30% of queries are semantically similar to a cached result:
  200,000 calls/day × 30% cache hit = 60,000 calls avoided
  Savings: 60,000 × ~$0.009/call = $540/day

3. Response length limits:

MAX_TOKENS = {
    "recommendation": 500,     # List of 3-5 products
    "faq": 300,               # Concise answer
    "product_question": 400,  # Moderate detail
}
# Reduces output token cost by ~40%

4. Bedrock provisioned throughput (saves ~20% at scale):

At 10M conversations/day:
  Purchase reserved throughput for baseline capacity
  ~20% discount vs. on-demand pricing

Cost per conversation:

Unoptimized: ~$0.55/conversation
After optimization: ~$0.15/conversation

Customer support agent cost: ~$3–7/interaction
→ Chatbot at <$0.20/conversation is 15–35x cheaper than human agent

Q27. Multi-storefront design (JP Manga, Comics, Kindle)

Short Answer

Multi-tenant with store_id parameter. Shared infrastructure, partitioned knowledge bases, templated system prompts.

Deep Dive

What's partitioned per store:

store_id: "manga_jp" | "comics_us" | "kindle"

Partitioned:
  ✅ RAG knowledge base → separate OpenSearch index per store
  ✅ System prompts → templated, store-specific tone/policies
  ✅ Product catalog queries → scoped to store's category/marketplace
  ✅ Recommendation model → shared or per-store depending on overlap

Shared (costs split across stores):
  ✅ Orchestrator (single service, routes by store_id)
  ✅ API Gateway + Auth
  ✅ DynamoDB conversation memory (session_id includes store_id prefix)
  ✅ Lambda/ECS compute
  ✅ CloudWatch monitoring

Routing by store_id:

STORE_CONFIGS = {
    "manga_jp": StoreConfig(
        rag_index="manga-jp-knowledge-base",
        system_prompt=MANGA_JP_SYSTEM_PROMPT,
        recommendation_campaign="arn:aws:personalize:.../manga-jp",
        catalog_category="manga",
        locale="ja-JP"
    ),
    "comics_us": StoreConfig(
        rag_index="comics-us-knowledge-base",
        system_prompt=COMICS_US_SYSTEM_PROMPT,
        recommendation_campaign="arn:aws:personalize:.../comics-us",
        catalog_category="comics",
        locale="en-US"
    ),
}

async def handle_request(store_id: str, message: str, session_id: str):
    config = STORE_CONFIGS[store_id]

    # All downstream calls use store-specific config
    recommendations = await personalize.get_recommendations(
        campaign_arn=config.recommendation_campaign,
        ...
    )
    rag_context = await rag.retrieve(
        index=config.rag_index,
        query=message
    )
    response = await llm.generate(
        system_prompt=config.system_prompt,
        ...
    )
    return response

System prompt templating:

SYSTEM_PROMPT_TEMPLATE = """
You are {store_name}, Amazon's AI shopping assistant for {category}.
Your tone is {tone}.
You help customers with:
- Discovering {category} based on their preferences
- Answering questions about {category} products
- Tracking orders and handling returns per {marketplace} policies

IMPORTANT: Only reference products available in {marketplace}. 
Respond in {language}.
"""

MANGA_JP_SYSTEM_PROMPT = SYSTEM_PROMPT_TEMPLATE.format(
    store_name="MangaAssist",
    category="manga",
    tone="enthusiastic and knowledgeable",
    marketplace="Amazon Japan",
    language="Japanese"
)

Cost allocation:

Each store's usage is tagged with store_id:
  CloudWatch metrics: { "StoreId": "manga_jp", "MetricName": "LLMCalls" }
  Kinesis events: { "store_id": "manga_jp", ... }

Monthly chargeback to each business unit:
  Shared infrastructure costs split by usage percentage
  LLM costs charged directly to each store's P&L