US-03: Caching Strategy for Cost Reduction
User Story
As a platform architect, I want to maximize ElastiCache hit rates and implement intelligent cache warming, So that downstream service calls (Product Catalog, Recommendations, Promotions) are reduced by 30-50%, lowering both latency and API costs.
Acceptance Criteria
- Overall cache hit rate for product data ≥ 70%.
- Recommendation cache hit rate ≥ 40% during peak hours.
- Cache warming pre-populates top 500 ASINs before peak traffic starts.
- Event-driven invalidation ensures stale data is never served for more than 60 seconds after a catalog update.
- ElastiCache cluster is right-sized with auto-scaling based on memory utilization.
- Total downstream API calls decrease by 30-50%.
High-Level Design
Cost Problem
Every Orchestrator request fans out to 2-4 downstream services. At 1M messages/day: - Product Catalog: ~800K calls/day (most messages involve products) - Recommendation Engine: ~300K calls/day - Promotions Service: ~500K calls/day - Reviews Service: ~200K calls/day
Each API call adds latency (50-200ms) and costs compute on the downstream service. Caching offloads this to sub-millisecond Redis reads.
Cache Architecture (Enhanced)
graph TD
subgraph "Request Path"
A[Orchestrator] --> B{Cache<br>Lookup}
B -->|HIT| C[Return Cached Data<br>< 1ms]
B -->|MISS| D[Call Origin Service<br>50-200ms]
D --> E[Populate Cache<br>with TTL]
E --> C
end
subgraph "Cache Warming"
F[Scheduled Warmer<br>Lambda - 8:30am JST] --> G[Fetch Top 500 ASINs]
G --> H[Pre-populate<br>Product Cache]
F --> I[Fetch Active Promos]
I --> J[Pre-populate<br>Promo Cache]
end
subgraph "Cache Invalidation"
K[Catalog Change Event<br>SNS] --> L[Invalidation Handler<br>Lambda]
L --> M[Delete Stale Keys]
N[Promo Change Event<br>SNS] --> L
end
subgraph "ElastiCache Redis Cluster"
C --> O[Product Cache<br>TTL: 5 min]
C --> P[Reco Cache<br>TTL: 15 min]
C --> Q[Promo Cache<br>TTL: 15 min]
C --> R[Review Cache<br>TTL: 1 hour]
end
style C fill:#2d8,stroke:#333
style D fill:#f66,stroke:#333
Cost Impact
| Scenario | Downstream Calls/Day | Monthly API Cost | Cache Cost | Net Savings |
|---|---|---|---|---|
| No cache | 1.8M | ~$2,700 | $0 | — |
| Basic cache (LLD-10) | 900K | ~$1,350 | ~$200 | ~$1,150/month |
| Optimized cache + warming | 550K | ~$825 | ~$250 | ~$1,625/month |
Low-Level Design
1. Multi-Layer Cache with Read-Through Pattern
graph LR
A[Request] --> B[L1: In-Process<br>LRU Cache<br>TTL: 30s]
B -->|miss| C[L2: ElastiCache<br>Redis Cluster<br>TTL: varies]
C -->|miss| D[L3: Origin<br>Service]
D --> E[Populate L2 + L1]
style B fill:#2d8,stroke:#333
style C fill:#fd2,stroke:#333
style D fill:#f66,stroke:#333
Code Example: Multi-Layer Cache Client
import json
import time
from collections import OrderedDict
from dataclasses import dataclass
from typing import Any, Callable, Optional
import redis
@dataclass
class CacheEntry:
value: Any
created_at: float
ttl_seconds: int
@property
def is_expired(self) -> bool:
return time.time() - self.created_at > self.ttl_seconds
class L1Cache:
"""In-process LRU cache for ultra-hot data."""
def __init__(self, max_size: int = 1000, default_ttl: int = 30):
self._store: OrderedDict[str, CacheEntry] = OrderedDict()
self._max_size = max_size
self._default_ttl = default_ttl
def get(self, key: str) -> Optional[Any]:
entry = self._store.get(key)
if entry is None:
return None
if entry.is_expired:
del self._store[key]
return None
self._store.move_to_end(key)
return entry.value
def put(self, key: str, value: Any, ttl: Optional[int] = None) -> None:
if len(self._store) >= self._max_size:
self._store.popitem(last=False)
self._store[key] = CacheEntry(
value=value,
created_at=time.time(),
ttl_seconds=ttl or self._default_ttl,
)
def invalidate(self, key: str) -> None:
self._store.pop(key, None)
class MultiLayerCache:
"""Two-layer cache: in-process LRU (L1) + Redis (L2)."""
def __init__(self, redis_client: redis.Redis, l1_max_size: int = 1000):
self._l1 = L1Cache(max_size=l1_max_size)
self._redis = redis_client
self._hits = {"l1": 0, "l2": 0, "miss": 0}
def get_or_fetch(
self,
key: str,
fetcher: Callable[[], Any],
l1_ttl: int = 30,
l2_ttl: int = 300,
) -> Any:
# L1 check
value = self._l1.get(key)
if value is not None:
self._hits["l1"] += 1
return value
# L2 check
raw = self._redis.get(key)
if raw is not None:
self._hits["l2"] += 1
value = json.loads(raw)
self._l1.put(key, value, ttl=l1_ttl)
return value
# Origin fetch
self._hits["miss"] += 1
value = fetcher()
self._redis.setex(key, l2_ttl, json.dumps(value))
self._l1.put(key, value, ttl=l1_ttl)
return value
def invalidate(self, key: str) -> None:
self._l1.invalidate(key)
self._redis.delete(key)
def get_stats(self) -> dict:
total = sum(self._hits.values()) or 1
return {
"l1_hit_rate": self._hits["l1"] / total,
"l2_hit_rate": self._hits["l2"] / total,
"miss_rate": self._hits["miss"] / total,
"total_requests": total,
}
2. Cache Key Design
graph TD
subgraph "Cache Key Patterns"
A["product:{asin}<br>e.g. product:B08X1YRSTR"]
B["reco:{user_id}:{seed_asin}<br>e.g. reco:C123:B08X1YRSTR"]
C["reco:anon:{context_hash}<br>e.g. reco:anon:a3f8b2"]
D["promo:{store_section}<br>e.g. promo:manga-home"]
E["review:{asin}<br>e.g. review:B08X1YRSTR"]
F["product_batch:{hash}<br>e.g. product_batch:c8d2e1"]
end
Code Example: Cache Key Builder
import hashlib
class CacheKeyBuilder:
"""Consistent cache key generation for all cached data types."""
@staticmethod
def product(asin: str) -> str:
return f"product:{asin}"
@staticmethod
def recommendation(user_id: str, seed_asin: str) -> str:
if user_id:
return f"reco:{user_id}:{seed_asin}"
# Anonymous users: hash the browsing context
return f"reco:anon:{seed_asin}"
@staticmethod
def recommendation_anonymous(browsing_history: list[str]) -> str:
context = "|".join(sorted(browsing_history[-5:]))
context_hash = hashlib.sha256(context.encode()).hexdigest()[:8]
return f"reco:anon:{context_hash}"
@staticmethod
def promotion(store_section: str) -> str:
return f"promo:{store_section}"
@staticmethod
def review(asin: str) -> str:
return f"review:{asin}"
@staticmethod
def product_batch(asins: list[str]) -> str:
"""Key for batch product lookups (e.g., cart or recommendation results)."""
sorted_asins = sorted(asins)
batch_hash = hashlib.sha256(
"|".join(sorted_asins).encode()
).hexdigest()[:12]
return f"product_batch:{batch_hash}"
3. Cache Warming Strategy
Pre-populate the cache with high-demand data before peak hours.
sequenceDiagram
participant Scheduler as EventBridge<br>Rule (8:30am JST)
participant Warmer as Cache Warmer<br>Lambda
participant Analytics as Redshift
participant Catalog as Product Catalog
participant Promos as Promotions Service
participant Cache as ElastiCache
Scheduler->>Warmer: Trigger warm-up
Warmer->>Analytics: Query top 500 ASINs<br>(last 7 days by traffic)
Analytics-->>Warmer: ASIN list
loop For each ASIN batch (50 at a time)
Warmer->>Catalog: Batch get product details
Catalog-->>Warmer: Product data
Warmer->>Cache: MSET product:{asin} entries
end
Warmer->>Promos: Get all active promotions
Promos-->>Warmer: Promo list
Warmer->>Cache: SET promo:{section} entries
Warmer->>Warmer: Log: warmed {count} products, {count} promos
Code Example: Cache Warmer Lambda
import json
import logging
from typing import Any
import boto3
import redis
logger = logging.getLogger()
logger.setLevel(logging.INFO)
# Initialize outside handler for connection reuse
redis_client = redis.Redis(
host="manga-cache.xxxxxx.ng.0001.apne1.cache.amazonaws.com",
port=6379,
ssl=True,
decode_responses=True,
)
redshift_client = boto3.client("redshift-data")
def handler(event: dict, context: Any) -> dict:
"""Cache warmer Lambda — triggered daily before peak traffic."""
# 1. Get top ASINs from analytics
top_asins = _get_top_asins(limit=500)
logger.info(f"Warming cache for {len(top_asins)} ASINs")
# 2. Batch-fetch product details and populate cache
products_warmed = 0
batch_size = 50
for i in range(0, len(top_asins), batch_size):
batch = top_asins[i : i + batch_size]
products = _batch_get_products(batch)
pipeline = redis_client.pipeline()
for asin, product_data in products.items():
key = f"product:{asin}"
pipeline.setex(key, 600, json.dumps(product_data)) # 10 min TTL
products_warmed += 1
pipeline.execute()
# 3. Warm promotions cache
promos = _get_active_promotions()
promo_pipeline = redis_client.pipeline()
for section, promo_list in promos.items():
key = f"promo:{section}"
promo_pipeline.setex(key, 900, json.dumps(promo_list)) # 15 min TTL
promo_pipeline.execute()
result = {
"products_warmed": products_warmed,
"promo_sections_warmed": len(promos),
"status": "success",
}
logger.info(f"Cache warming complete: {result}")
return result
def _get_top_asins(limit: int) -> list[str]:
"""Query Redshift for the most-accessed ASINs in the last 7 days."""
response = redshift_client.execute_statement(
ClusterIdentifier="manga-analytics",
Database="chatbot",
Sql=f"""
SELECT DISTINCT products_shown AS asin, COUNT(*) AS cnt
FROM chatbot_events
WHERE created_at > DATEADD(day, -7, GETDATE())
AND products_shown IS NOT NULL
GROUP BY products_shown
ORDER BY cnt DESC
LIMIT {limit}
""",
)
# Wait for query to complete and fetch results
statement_id = response["Id"]
waiter = redshift_client.get_waiter("statement_finished")
waiter.wait(Id=statement_id)
result = redshift_client.get_statement_result(Id=statement_id)
return [row[0]["stringValue"] for row in result["Records"]]
def _batch_get_products(asins: list[str]) -> dict:
"""Fetch product details from catalog service."""
# Simulated — in production this calls the internal Product Catalog API
dynamodb = boto3.resource("dynamodb")
table = dynamodb.Table("product_catalog")
products = {}
keys = [{"asin": asin} for asin in asins]
response = dynamodb.batch_get_item(
RequestItems={"product_catalog": {"Keys": keys}}
)
for item in response.get("Responses", {}).get("product_catalog", []):
products[item["asin"]] = {
"title": item.get("title"),
"price": item.get("price"),
"format": item.get("format"),
"availability": item.get("availability"),
}
return products
def _get_active_promotions() -> dict[str, list]:
"""Fetch all active promotions grouped by store section."""
# Simulated — in production this calls the Promotions Service API
return {
"manga-home": [
{"title": "Manga Sale", "discount": "20% off"},
],
}
4. Event-Driven Cache Invalidation
sequenceDiagram
participant CatalogService as Product Catalog
participant SNS as SNS Topic<br>catalog-changes
participant Lambda as Invalidation<br>Lambda
participant Cache as ElastiCache
CatalogService->>SNS: Publish {event: "product_updated", asin: "B08X1YRSTR"}
SNS->>Lambda: Trigger
Lambda->>Cache: DEL product:B08X1YRSTR
Lambda->>Cache: DEL review:B08X1YRSTR
Lambda->>Lambda: Log invalidation event
Code Example: Cache Invalidation Handler
import json
import logging
import redis
logger = logging.getLogger()
logger.setLevel(logging.INFO)
redis_client = redis.Redis(
host="manga-cache.xxxxxx.ng.0001.apne1.cache.amazonaws.com",
port=6379,
ssl=True,
decode_responses=True,
)
def handler(event: dict, context) -> dict:
"""SNS-triggered Lambda to invalidate stale cache entries."""
invalidated = 0
for record in event.get("Records", []):
message = json.loads(record["Sns"]["Message"])
event_type = message.get("event")
asin = message.get("asin")
section = message.get("store_section")
if event_type == "product_updated" and asin:
keys = [f"product:{asin}", f"review:{asin}"]
redis_client.delete(*keys)
# Also invalidate any batch keys containing this ASIN
_invalidate_batch_keys(asin)
invalidated += len(keys)
logger.info(f"Invalidated product cache for {asin}")
elif event_type == "promotion_changed" and section:
key = f"promo:{section}"
redis_client.delete(key)
invalidated += 1
logger.info(f"Invalidated promo cache for section {section}")
elif event_type == "product_deleted" and asin:
keys = [f"product:{asin}", f"review:{asin}"]
redis_client.delete(*keys)
invalidated += len(keys)
logger.info(f"Purged all cache for deleted ASIN {asin}")
return {"invalidated_keys": invalidated}
def _invalidate_batch_keys(asin: str) -> None:
"""Scan and remove batch keys that include the updated ASIN."""
# Use SCAN to find matching batch keys (bounded iteration)
cursor = 0
while True:
cursor, keys = redis_client.scan(
cursor=cursor, match="product_batch:*", count=100
)
for key in keys:
raw = redis_client.get(key)
if raw and asin in raw:
redis_client.delete(key)
if cursor == 0:
break
ElastiCache Right-Sizing
Sizing Decision Tree
graph TD
A[Estimate Working Set Size] --> B{< 6 GB?}
B -->|Yes| C[cache.r6g.large<br>13 GB, $0.166/hr<br>~$120/month]
B -->|No| D{< 13 GB?}
D -->|Yes| E[cache.r6g.xlarge<br>26 GB, $0.332/hr<br>~$240/month]
D -->|No| F[cache.r6g.2xlarge<br>52 GB, $0.664/hr<br>~$478/month]
G[Enable Auto-scaling] --> H[Scale based on<br>memory utilization]
H --> I{> 75% memory?}
I -->|Yes| J[Scale up node type]
I -->|No| K{< 30% memory?}
K -->|Yes| L[Scale down node type]
Working Set Estimate
| Data Type | Avg Entry Size | Max Entries | Total Size |
|---|---|---|---|
| Product details | 2 KB | 50,000 ASINs | 100 MB |
| Recommendations | 1 KB | 100,000 user×ASIN combos | 100 MB |
| Promotions | 500 bytes | 1,000 entries | 0.5 MB |
| Reviews | 500 bytes | 50,000 ASINs | 25 MB |
| Semantic LLM cache | 3 KB | 50,000 entries | 150 MB |
| Total | ~375 MB |
A cache.r6g.large (13 GB) is sufficient with room for growth.
Monitoring and Metrics
| Metric | Target | Alert |
|---|---|---|
| L1 cache hit rate | ≥ 30% | < 20% |
| L2 (Redis) cache hit rate | ≥ 50% | < 35% |
| Combined hit rate | ≥ 70% | < 55% |
| Cache eviction rate | < 1% of entries/hour | > 5% |
| Redis memory utilization | 40-70% | > 80% |
| Cache warming success | 100% daily | Any failure |
| Invalidation latency | < 5 seconds | > 30 seconds |
Risks and Mitigations
| Risk | Impact | Mitigation |
|---|---|---|
| Cache stampede on popular ASIN invalidation | Many concurrent origin fetches | Use lock-based cache rebuild (only one request fetches; others wait) |
| Redis cluster failure | All requests hit origin services | Origin services must handle full load; monitor Redis health |
| Over-caching stale data | User sees wrong price/availability | Prices are NEVER cached; product TTL capped at 5 minutes |
| L1 cache inconsistency across ECS tasks | Different tasks serve different data | Short L1 TTL (30s) limits divergence window |
Deep Dive: Why This Works on a Manga Chatbot Workload
Caching is the highest-leverage optimization in this collection because it is the only one that scales sub-linearly with traffic — a 10× traffic spike does not produce 10× downstream cost if the cache hit rate is high. The reason this story projects 30–50% downstream API cost reduction is not that "cache is fast"; it's that manga-chatbot read patterns have three properties that make caching unusually effective.
Property 1: Read-to-write ratio is extreme. Manga catalog products are written by merchandising tooling on a daily-or-slower cadence (new volume releases, price updates, inventory sync). They are read by users millions of times per day. Effective read:write ratio for the catalog is on the order of 10⁵:1. Caching is a win-win under such ratios — the cost of a miss-and-fill is amortized over 100K+ subsequent hits. The 5-minute product TTL is not chosen because data changes that often; it is chosen because availability (in-stock status) does, and a 5-minute window is the maximum tolerable inventory-staleness for chat responses. The architectural assumption is that prices are excluded from the cache (story line 538 explicitly forbids it) — pricing changes propagate through a different path with a hard contract.
Property 2: Access distribution follows a steep Zipf curve. The top 500 ASINs in any manga store account for the majority of chat queries on any given day (popular titles dominate; long-tail catalog browsing is rare in chat versus the storefront). This is why the cache-warming Lambda only pre-fetches 500 ASINs — that subset captures the bulk of cold-start misses. The two-tier (L1 in-process LRU + L2 Redis) design exists because Zipf is fractal: the top 50 ASINs are even hotter than the top 500, so a small in-process cache absorbs them at near-zero latency without paying the Redis network round-trip. The 30-second L1 TTL is chosen short enough that cross-task inconsistency (different ECS tasks serving different snapshots) cannot persist long enough to confuse a single user session.
Property 3: Cache-stampede risk is concentrated, not uniform. When a popular ASIN's cache entry expires or is invalidated, all in-flight requests for that ASIN miss simultaneously — and a single miss-and-fill can become hundreds of concurrent origin fetches against the catalog API. This is the thundering-herd / cache-stampede pattern (Mogul & Padmanabhan 1996). The lock-based rebuild pattern (only one request fetches; others wait on the lock) is non-optional for a chatbot workload because (a) the hot keys are well-known, and (b) origin services (catalog API, recommendations API) are themselves rate-limited. Without stampede protection, a single key invalidation during peak traffic can cascade into a downstream outage.
Bottom line: the savings come from the multiplicative product of read amplification (10⁵:1 R:W) and access skew (top 500 ASINs cover most queries). The cache is not just a latency optimization — it is the load isolation layer that lets US-04's compute and US-05's DDB scale linearly with unique requests instead of total requests.
Real-World Validation
Industry Benchmarks & Case Studies
- Pinterest engineering blog: "Caching at Pinterest" — Reports >90% hit rate on product-catalog-style read paths with multi-tier (L1 process + L2 distributed) caching. This story's combined 70% target is conservative against Pinterest-class results, reflecting the higher write volatility on a manga catalog (new releases, language availability flips).
- Mogul & Padmanabhan (1996), "Performance issues in WWW servers" — Foundational paper on cache-stampede / thundering-herd. The lock-based rebuild pattern in this story (line 536) traces directly back to this work; lock-based mitigation is the textbook fix.
- AWS ElastiCache Well-Architected pillar (sizing guidance) — Recommends 30–60% steady-state memory utilization with headroom for traffic spikes and rehashing. The story's 40–70% target band aligns with AWS guidance.
- Discord engineering: "Storing billions of messages" / Twitter Pelikan — Both publicly document the L1+L2 pattern with similar TTL philosophies (short L1 to bound inconsistency, longer L2 for working-set capture). Validates the architectural choice.
- DoorDash engineering blog: "Eliminating thundering herd at scale" — Documents cache-rebuild stampede as a production incident; their fix (probabilistic early expiration + lock-based rebuild) extends the pattern in this story.
- Internal cross-reference:
POC-to-Production-War-Story/02-seven-production-catastrophes.md— The "RAG recall collapse" and "WebSocket meltdown" catastrophes both had cache stampede as a contributing factor. - Internal cross-reference:
RAG-MCP-Integration/01-catalog-search-mcp.md— The catalog MCP server is the primary client of this cache; its hit-rate metrics flow into the dashboards here.
Math Validation
- ElastiCache
cache.r6g.large(13 GB, 2 vCPU): ~$0.215/hr (us-east-1) × 730 hrs × 2 nodes (Multi-AZ) = ~$314/month. Story doesn't claim a baseline ElastiCache cost; the working-set estimate (375 MB) confirmsr6g.largeis correctly sized — could fitcache.t4g.medium(3.09 GB, ~$0.064/hr) at ~$94/month for two-node Multi-AZ if traffic projections hold. Worth re-evaluating once steady-state hit rates are measured. - 1.8M downstream API calls/day at ~$0.05/1K calls (typical commercial API tier) = ~$2.7K/month — matches story baseline. ✅
- At 70% combined hit rate, 1.8M × 0.30 = 540K origin calls/day → ~$810/month. Plus cache infra (~$314) = ~$1,124/month total. Story's "$825/month optimized" excludes the cache cost itself; combined-cost saving is ~58% (vs 70% raw API saving) — story should clarify whether savings include or exclude cache infra.
Conservative vs Aggressive Savings Bounds
| Bound | Source | Total monthly savings |
|---|---|---|
| Conservative | 50% combined hit rate, no L1, no warming | ~25% (~$675/month gross, ~$360 net of infra) |
| Aggressive | 85% combined hit rate (Pinterest-class), full warming, probabilistic early expiration | ~60% (~$1,620/month gross, ~$1,300 net) |
| Story's projected savings | 30–50% (~$1,000–$1,400 gross) | Realistic mid-band; depends on Zipf curve steepness for this catalog. |
Cross-Story Interactions & Conflicts
This story is the shared infrastructure layer for several other stories. Keyspace and TTL coordination is centralized here.
- US-01 (LLM Tokens) — The semantic response cache from US-01 lives on this Redis tier under reserved prefix
llmresp:. Conflict mode: during memory pressure, default LRU eviction can evict large LLM-cache entries (~3 KB each) before smaller product-cache entries (~2 KB), even though LLM cache misses cost 100× more. Resolution: dedicate a separate Redis logical DB (or use Redis ACL keyspace boundaries) forllmresp:withnoevictionpolicy; product cache usesallkeys-lru. - US-02 (Intent Classifier) — Session intent cache from US-02 uses prefix
intent:sess:with TTL 90s. Conflict mode: session intent TTL must be < user-session TTL (US-05's 24h DDB session) so a stale intent never outlives its session. Resolution: centralize TTL constants in a shared config module; both stories reference the same constants. - US-05 (DynamoDB) — Authoritative side: this story for the cache-DDB contract. Conflict mode: during ElastiCache failover (~30–90 second window), all reads miss and stampede DynamoDB. If DDB is in on-demand mode (US-05) the spike is absorbed at higher cost; if in provisioned mode it throttles. Resolution: circuit-breaker between cache miss and DDB fallback (limit fallback concurrency to 100 req/s per origin); pre-provisioned DDB headroom for failover scenario.
- US-06 (RAG) — Embedding cache from US-06 uses prefix
emb:with 1h TTL. Per-key size 1.5–6 KB (1536-dim float32 = 6 KB; quantized int8 = 1.5 KB). At 50K cached embeddings × 6 KB = 300 MB — almost as large as the story's current working-set estimate of 375 MB. Resolution: extend the working-set table to includeemb:row; consider int8 quantization to halve memory; bump node sizing tocache.r6g.largeminimum (no t4g) to absorb this.
Rollback & Experimentation
Shadow-Mode Plan
- Run L1+L2 cache in observe mode for 1 week: cache reads served, but every cache hit also fetches from origin in the background and compares response equality. Log mismatches; promote to fully cache-served only after mismatch rate < 0.5% for 72h.
- Run cache warming in dry-run for 3 days: log the 500 ASINs that would be pre-fetched, measure overlap with actual peak-hour misses; tune the warming list weekly thereafter.
Canary Thresholds
- Per-keyspace rollout: enable
product:first (lowest stampede risk), thenreco:, thenllmresp:, thenemb:(largest, riskiest). 48h between each. - Abort criteria (any one trips): hit rate < 30% combined after 1 week, cache-stampede incident count > 0 in production, ElastiCache CPU sustained > 75%, mismatch rate (shadow audit) > 0.5%.
Kill Switch
- Single feature flag:
cache_enabled. When false, both L1 and L2 are bypassed; all reads hit origins. Critical: verify origin services can absorb the full 1.8M calls/day load before flipping. Coordinated with US-04's auto-scaler — origin services must pre-scale before kill-switch activation.
Quality Regression Criteria (story-specific)
- Combined hit rate floor: ≥ 55% (below this, cache infra cost approaches savings; revert and tune).
- Cache-stampede incidents: 0 per quarter (any incident triggers stampede pattern review).
- Stale-data incidents (user reports wrong price/availability): ≤ 2 per quarter from cache (vs origin-side staleness).
- Redis memory utilization: ≤ 80% sustained (above this, eviction storms degrade hit rate).
Multi-Reviewer Validation Findings & Resolutions
The cross-reviewer pass identified the following story-specific findings. README's "Multi-Reviewer Validation & Cross-Cutting Hardening" section covers concerns that span all stories.
S1 (must-fix before production)
Per-keyspace eviction policies are architecturally impossible on a single Redis cluster. The story proposes noeviction for llmresp: and allkeys-lru for product/reco. Redis applies a single maxmemory-policy per cluster; per-prefix policies do not exist. Without a fix, the most expensive cache entries (LLM responses) are evicted first under memory pressure. Resolution: use Redis logical databases — db=1 llmresp: (noeviction), db=2 intent:sess: (allkeys-lru), db=3 product:/reco:/promo: (allkeys-lru), db=4 emb: (allkeys-lru, int8 quantized), db=0 rate: and cost: (noeviction — cost-critical state). Each story's client connection is opened to its specific db; cross-db writes are forbidden by Redis ACL.
Distributed-monolith SPOF — five stories down if Redis is down. US-01, US-02, US-03 (this story), US-06, US-08 all break at once during a Redis outage. Resolution: explicit per-story fallback documented here as the canonical contract:
- US-01 (
llmresp:miss) → call Bedrock; emitcache_unavailable_eventto US-07. - US-02 (
intent:sess:miss) → call SageMaker; degrade rule-coverage gracefully. - US-03 (
product:/reco:miss) → call origin services with concurrency-limited circuit breaker (max 100 req/s per origin to prevent stampede). - US-06 (
emb:miss) → fresh Titan call; cost spikes ~30% but functional. - US-08 (
rate:miss) → fall back to in-process token bucket (per-task counts, not shared); cost-ledger reads from DDB authoritative ledger (Redis is just cache).
Cache invalidation handler stampede / DoS surface. SNS-triggered Lambda calls SCAN on millions of keys, then DEL each. Slow + stampede-prone; misconfigured upstream can flood the topic. Resolution: (a) restrict SNS topic with topic policy to specific service ARNs; (b) replace SNS with SQS FIFO (deduplication, ordered) so duplicates collapse; © batch keys include ASIN in the key name (product_batch:{asin1}:{asin2}:...) so invalidation is a direct DEL of computed key, no SCAN needed; (d) rate-limit the Lambda concurrency (max 5 concurrent invocations).
Cache invalidation lacks cryptographic origin check. Anyone publishing to the SNS topic can cause cache invalidation. Resolution: topic policy whitelist by source service IAM ARN; reject events without trusted-source SNS signing key.
S2 (fix before scale-up)
L1 in-process LRU may be over-engineering. The Architect reviewer flagged the L1+L2 design as marginal benefit for added complexity (cross-task inconsistency, double-invalidation logic). Resolution: keep L1 only for the top-50 ASIN slice (small footprint, high reuse) with 30s TTL; remove L1 for reco:, promo:, review: keyspaces where Redis-network latency is the dominant cost component. Measure L1 hit rate post-launch; remove L1 entirely if hit rate < 25%.
Cache warming race window. Pre-fetch at 8:30am JST but traffic starts at 8:15am — first 15 minutes hit cold cache and stampede the origin. Resolution: start warming at 8:00am; use probabilistic early expiration (cache entries at 80% TTL begin async refresh) to avoid hard expiry storms; soft-refresh top 50 ASIN keys at the top of every hour.
Working-set estimate omits embedding cache. US-06's emb: keyspace adds ~300 MB (50K embeddings × 6 KB float32). At 1.5 KB int8-quantized that drops to ~75 MB. Resolution: update the working-set table to include emb: row at 75 MB (assuming int8 quantization is mandated in US-06); upsize Redis node from r6g.large to r6g.xlarge (26 GB) for headroom.
Rate-limit log retention contains IPs (PII under GDPR). Resolution: hash IPs (SHA-256(IP + monthly-rotated salt)) before logging; cap log retention at 30 days.
S3 (acknowledged / future work)
- Cache-key namespace migration to mcrouter (per-task connection pooling) if L1 is removed.
- Cross-region replicated Redis (DR) — out of scope; single-region by current design.
Runbook: Redis Cluster Failure
Symptoms: ElastiCache CloudWatch alarm; sustained Redis connection errors > 10% across all five dependent stories.
Triage (in order):
- Confirm Multi-AZ failover triggered. If primary failed, replica should promote within 60–90s.
- During the 60–90s outage window, all five stories run in fallback mode (above). Validate each story is degrading gracefully — no 5xx storm: - US-01: Bedrock spend rate spikes (no LLM cache); US-08 breaker may trip — that is the expected safety net. - US-03: origin services hit; concurrency circuit breakers must be active. - US-08: in-process token-bucket fallback; rate-limit accuracy degrades but no fail-open.
- After failover, monitor for cache-stampede recovery; warm from US-07's recent-ASIN list to short-circuit cold start.
- If failover does not complete within 5 minutes, escalate; manual instance promotion via CLI as last resort.
Escalation: Origin services must pre-scale 2× headroom for the failover window — confirmed during quarterly DR drill, not during the incident.