Scenario File 3 — Response Cache Layer

Context in the Architecture

The cache sits between the Orchestrator and the downstream origin services (Product Catalog, Recommendation Engine, Promotions Service, Reviews Service).

Goal: Absorb repeated reads for the same product or promotion, reducing downstream service load and reducing P99 latency for the chatbot.
Pattern: Cache-aside (read-through is not used; Orchestrator checks cache first, populates on miss).
Hot keys: Product detail pages for top 100 manga ASINs can receive thousands of cache reads per minute during flash sales.
Prices are never cached — always fetched live from the Product Catalog to avoid stale pricing.

Cache Hit/Miss Economics

Scenario	Origin Latency	Cache Latency	Save Per Request
Product detail (ASIN)	120ms	< 2ms	118ms
Recommendation for user_id	350ms	< 2ms	348ms
Promotion for store section	80ms	< 2ms	78ms
Review summary for ASIN	200ms	< 2ms	198ms

Current Choice: ElastiCache Redis

Why it was chosen: Sub-millisecond reads, rich data structures (Hashes for structured objects, Sets for recommendation lists), TTL per key, pub/sub for cache invalidation events, cluster mode for horizontal scaling. Deeply integrated in AWS ecosystem.

Key design:

product:{asin}              TTL: 5 min   (invalidated by SNS catalog event)
reco:{user_id}:{seed_asin}  TTL: 15 min  (invalidated on new session)
promo:{store_section}       TTL: 15 min  (invalidated by SNS promo event)
review:{asin}               TTL: 1 hour  (scheduled refresh)

Alternative 1: Memcached

What Changes

Redis is replaced with Amazon ElastiCache Memcached. Memcached is a simpler, multi-threaded in-memory cache.

Best Case

Multi-threaded by design — Memcached uses all CPU cores out of the box, while Redis is single-threaded per shard (Redis 7+ adds thread IO but core execution is still single-threaded). For pure GET/SET workloads with large values (product JSON blobs), Memcached can achieve higher throughput per node.
Simpler operational model: no replication complexity, no cluster bus gossip protocol.
Easier to scale out (just add nodes; keys distribute via consistent hashing).

Failure Scenario — No Pub/Sub, No Event-Driven Invalidation

What happens: The Promotions team updates the "Golden Week Sale" promotion for the JP Manga store at 11am. The SNS event is published. The cache invalidation subscriber calls:

redis.delete("promo:manga-home")

But with Memcached, there is no pub/sub mechanism. The cache invalidation subscriber cannot exist. The team implements a workaround: a Lambda function polls the SNS topic, then calls Memcached delete via the ElastiCache endpoint.

The Lambda function has a 60-second poll interval. For 60 seconds after the promotion goes live, every user browsing the manga home section sees the old promotions (yesterday's sale). 40,000 users see an expired 20%-off coupon. When they try to use it at checkout, it fails. Support tickets spike.

The deeper problem: Without pub/sub, all cache invalidation must be pull-based (polling) or you accept stale data windows as permanent. For price-adjacent data like promotions, a 60-second stale window is business-critical.

Failure Scenario 2 — No Persistence, Full Cold Start

What happens: A network partition causes one Memcached node to be drained from the cluster and replaced. Since Memcached has no replication, the new node starts completely empty.

The Product Catalog receives 15,000 requests/min for the 100 most popular ASINs (all cache misses on the new node). The catalog service was sized for 2,000 requests/min (95% cache hit rate assumed). It falls over. The chatbot cannot return product details for 8 minutes while the catalog service auto-scales.

What makes this different from Redis: Redis with replication (replica=1) means a failed primary's replica is promoted. The promoted replica already has all keys. Cache warm-up time is zero.

Grilling Questions

Memcached uses consistent hashing for key distribution. You add a new node. What percentage of existing keys are remapped to the new node, and what is the resulting cache miss storm?
You need to store a product recommendation list (ordered) for a user. Redis supports ZSET (sorted set with scores). Memcached only supports bytes. How do you store an ordered list in Memcached, and what is the operational cost of implementing this yourself?
A Memcached node OOMs (eviction policy kicks in). How do you know which keys were evicted? How does this differ from Redis evicted_keys counters and keyspace notifications?

Decision Heuristic

Use Memcached when: (a) you only need simple key-value GET/SET with no pub/sub, (b) your keys are large binary blobs (>128KB) where Memcached's lack of overhead wins, and © cache invalidation latency of 30-60s is acceptable. For the MangaAssist chatbot, event-driven promotion invalidation makes Redis non-negotiable.

Alternative 2: DynamoDB Accelerator (DAX) as Cache

What Changes

Instead of Redis, the Orchestrator calls the DAX endpoint (API-compatible with DynamoDB) to cache product details. Product data is fetched from DynamoDB, cached in DAX's in-memory layer, and served on subsequent reads.

This only works if product catalog data is stored in DynamoDB (not Elasticsearch). The current architecture uses DynamoDB + Elasticsearch for the catalog, so this requires restructuring.

Best Case

Zero code changes from the Orchestrator's perspective (DAX is DynamoDB-compatible).
No separate cache service to manage.
Works well for workloads where the same DynamoDB keys are read frequently (thousands of reads for the same ASIN).

Failure Scenario — Cache Granularity Mismatch

What happens: Product detail queries use the DynamoDB key ASIN#B08X1YRSTR. DAX caches at the item level (per-item, per-attribute-set). But the Orchestrator fetches different attribute subsets depending on the intent:

product_question intent: fetches {title, description, format, page_count, language}
recommendation intent: fetches {title, price_cents, image_url, review_score}

DynamoDB GetItem with a projection expression is cached by DAX as a separate cache entry per projection. Instead of one cache entry per ASIN, there are N cache entries per ASIN (one per unique projection). Cache memory efficiency drops 4×. DAX cluster hits its memory ceiling and starts evicting entries that were only 3 minutes old.

Result: Product detail cache hit rate on DAX is 40% instead of the expected 85%. The product catalog service sees 2× more requests than planned. Latency for the recommendation flow goes from 3ms (cache) to 120ms (catalog) for 60% of requests.

Grilling Questions

DAX is a write-through cache. A catalog price change event arrives. DAX already has the product item cached. The update goes to DynamoDB. When does DAX serve the updated price? (Answer: only after the TTL expires or the item is invalidated with InvalidateKey.) Does this meet the "never serve a stale price" requirement?
DAX is deployed in your VPC and is not encrypted in transit by default in older SDK versions. What is the security implication for product PII (seller addresses, tax info) stored in DynamoDB and potentially flowing through DAX?
You have 500K distinct ASINs. DAX memory per node is 13GB. At an average item size of 2KB, how many items can fit in one DAX node? What percentage of the catalog is perpetually a cache miss?

Decision Heuristic

DAX is effective when you have a small hot dataset (<100K items) and the access pattern is pure key-value (same key, same projection). It is ineffective when projection expressions vary or when the dataset is too large to have meaningful hit rates.

Alternative 3: In-Process LRU Cache (Per-Task)

What Changes

No external cache service. Each Orchestrator ECS task maintains an in-process LRU cache (Python functools.lru_cache or cachetools.TTLCache):

from cachetools import TTLCache
product_cache = TTLCache(maxsize=5000, ttl=300)  # 5min TTL, 5000 entries

Best Case

Zero network hop: cache reads are in-process memory, sub-microsecond.
No operational overhead: no Redis cluster, no DAX, no ElastiCache.
Zero cost beyond the task's existing RAM.

Failure Scenario — The Stale Promotion at Flash Sale Start

What happens: Each of 60 Orchestrator tasks has its own in-process cache. When the "Golden Week Sale" promotion activates at 10:00am, an SNS event is published.

There is no way to broadcast an invalidation signal to 60 isolated in-process caches simultaneously. Each task evicts the old promotion only when its TTL expires (up to 15 minutes later).

At 10:05am, 50 of 60 tasks still serve the old promotion. Users get different responses to the same question depending on which task handles their request — some see the new sale, some see no sale. This is a data consistency nightmare that is invisible in logs but visible in user A/B-style side-by-side screenshots shared on Twitter.

The amplification: Tasks restart during deployments or auto-scaling events, warming their local cache from the origin service at startup. During a 10-node burst scale-out at flash sale start, 10 new tasks simultaneously populate their local caches by hitting the Promotions Service: 10 tasks × 200 categories = 2,000 cold-start requests to the Promotions Service in 60 seconds.

Failure Scenario 2 — Memory Pressure Under High Concurrency

Each Orchestrator task is allocated 2GB RAM. The in-process cache adds up to 500MB for 5,000 product entries (at ~100KB average product JSON). Under high load, 500MB is consumed by the cache alone, competing with: - LLM response buffering (~200MB peak per task) - Python interpreter + libraries (~300MB baseline)

The task runs out of heap headroom and GC pressure causes P99 latency spikes (Python's GC halts the world for 20-80ms during major collections). The chatbot's P99 jumps from 1.9s to 2.4s, breaching the 2.0s SLA.

Grilling Questions

You reduce cache size to 1,000 entries (200KB overhead per task) to control memory. The top 100 ASINs each have 5 variant pages. 100 × 5 = 500 hot items. 50 tasks × 1,000 entries = fine in isolation. But each task sees a slightly different request mix. What is the effective cluster-wide cache hit rate?
A deployment rolls 60 tasks over 10 minutes. All cache entries are lost. Origin services see 60× their warm baseline request rate immediately after each batch of 10 task replacements. How do you mitigate the cold-start thundering herd?
In-process caches are not observable. You cannot run redis-cli monitor to see what's being cached. How do you debug a cache coherence issue where users are getting different promotion responses?

Decision Heuristic

Use in-process cache only for immutable or rarely-changing data (e.g., intent routing rules, prompt templates, static genre taxonomy). Never use it for data that can be invalidated by external events (promotions, catalog changes) or data that must be consistent across all tasks.

Alternative 4: No Cache — Direct Origin Calls

What Changes

The cache layer is removed entirely. Every product detail, recommendation, and promotion query hits the origin service directly.

Best Case

Always-fresh data: zero risk of serving stale product or promotion information.
Simplified architecture: no Redis cluster to manage, monitor, or pay for.
Correct for prices (which are already never cached). If everything should be live, remove the cache.

Failure Scenario — The Cascading Timeout During a Flash Sale

What happens: A new manga series launches. 50,000 users simultaneously ask the chatbot: "Is [New Series Vol 1] in stock?" and "What's the price?"

Without a cache, 50,000 concurrent requests hit the Product Catalog service. The catalog is sized for 5,000 concurrent requests (based on 10% cache-miss baseline). At 10× the planned load, the catalog service's connection pool saturates. Database connections are queued. P99 catalog latency grows from 120ms → 3,000ms.

The Orchestrator has a 2,000ms timeout on catalog calls. Every request times out. The Orchestrator returns 503 to the chatbot. Every user sees: "I'm having trouble finding that product right now. Please try again later."

50,000 simultaneous error responses on the biggest launch day of the year. The manga launch is a business disaster despite the product being live and in stock.

The cost comparison: ElastiCache cluster costs ~$400/month. The flash sale generated $2M in manga revenue. The cache is a 0.02% cost to protect 100% of revenue.

Grilling Questions

You argue that the Product Catalog should have auto-scaling and can handle 50,000 RPS. True — but auto-scaling ECS tasks takes 90-180 seconds. The flash sale spike happens in 15 seconds. What is the traffic pattern during that 165-second gap?
Without a cache, every chatbot response requires N origin service calls (product detail + recommendation + promotion = 3 calls in parallel). At 350ms median latency each, P50 response time is 350ms. With cache, P50 is 2ms. Does this change the UX of the chatbot significantly?
Prices cannot be cached. But price range context (e.g., "prices start from $9.99 for this series") could be cached. Design a rule for what "safe to cache" means in terms of business risk.

Decision Heuristic

The "no cache" option is only valid during development/low traffic or for data categories where staleness causes legal or business harm (prices, order status). For product metadata and promotions at scale, a cache is not optional — it is the difference between the system working and not.

Alternative 5: Read-Through Cache (vs. Current Cache-Aside)

What Changes

Instead of the Orchestrator checking the cache and falling through to origin, a read-through cache proxy (e.g., a dedicated caching microservice or Redis with WAIT semantics) fills the cache automatically on every miss without the Orchestrator implementing miss logic.

Best Case

Simpler Orchestrator code: no cache-miss logic, no "if cache miss, call origin, then populate cache" boilerplate.
Single source of truth for cache population logic.
Avoids thundering herd on cold start: cache proxy can use single-flight (only one pending origin call per cache key, all waiters share the result).

Failure Scenario — The Thundering Herd the Proxy Doesn't Protect Against

What happens: You implement the cache proxy but forget to add single-flight (also called request coalescing). A Redis key expires at 10:00:00.000. 500 Orchestrator tasks all simultaneously call the cache proxy for product:B08X1YRSTR. The proxy checks Redis: all 500 find a cache miss. All 500 call the origin service simultaneously.

The origin service receives 500 requests for the same ASIN in 10ms. This is the classic thundering herd. Without single-flight, read-through provides no protection against it.

Single-flight implementation: The proxy uses a mutex map: the first request for a missing key acquires the lock and calls origin. All other requesters block on the lock. When the first request returns, the result is written to cache and all waiters receive the same response. Only 1 origin call happens instead of 500.

The failure: single-flight was only implemented for the proxy's internal Go coroutines, not across the 60 distributed Orchestrator tasks. Cross-task thundering herd still occurs.

Decision Heuristic

Read-through simplifies Orchestrator code but moves complexity into the proxy service. The proxy must implement single-flight for cache stampede protection. If you implement cache-aside correctly with distributed locking on miss, the outcome is equivalent. Use read-through when you have multiple consumers of the same cache and want one place to own population logic.

Master Summary Table

Choice	Latency	Invalidation	Consistency	Key Failure Risk
ElastiCache Redis (current)	<2ms	Event-driven (SNS pub/sub)	Strong (single source)	Node failure causes 30-60s promotion staleness
Memcached	<2ms	Pull-based (polling)	Weak (60s staleness window)	Promo invalidation requires polling workaround
DAX	<1ms	TTL or explicit key invalidation	Strong for DynamoDB-only data	Projection mismatch kills hit rate
In-process LRU	<0.1ms	None (TTL only per task)	None (60 separate caches)	Users get different promo answers from different tasks
No cache	80-350ms	N/A (always fresh)	Perfect	Origin services collapse under flash sale load
Read-through	<2ms	Event-driven (if implemented)	Strong	Thundering herd if single-flight is missing