US-03: Caching Strategy for Cost Reduction

User Story

As a platform architect, I want to maximize ElastiCache hit rates and implement intelligent cache warming, So that downstream service calls (Product Catalog, Recommendations, Promotions) are reduced by 30-50%, lowering both latency and API costs.

Acceptance Criteria

Overall cache hit rate for product data ≥ 70%.
Recommendation cache hit rate ≥ 40% during peak hours.
Cache warming pre-populates top 500 ASINs before peak traffic starts.
Event-driven invalidation ensures stale data is never served for more than 60 seconds after a catalog update.
ElastiCache cluster is right-sized with auto-scaling based on memory utilization.
Total downstream API calls decrease by 30-50%.

High-Level Design

Cost Problem

Every Orchestrator request fans out to 2-4 downstream services. At 1M messages/day: - Product Catalog: ~800K calls/day (most messages involve products) - Recommendation Engine: ~300K calls/day - Promotions Service: ~500K calls/day - Reviews Service: ~200K calls/day

Each API call adds latency (50-200ms) and costs compute on the downstream service. Caching offloads this to sub-millisecond Redis reads.

Cache Architecture (Enhanced)

graph TD
    subgraph "Request Path"
        A[Orchestrator] --> B{Cache<br>Lookup}
        B -->|HIT| C[Return Cached Data<br>< 1ms]
        B -->|MISS| D[Call Origin Service<br>50-200ms]
        D --> E[Populate Cache<br>with TTL]
        E --> C
    end

    subgraph "Cache Warming"
        F[Scheduled Warmer<br>Lambda - 8:30am JST] --> G[Fetch Top 500 ASINs]
        G --> H[Pre-populate<br>Product Cache]
        F --> I[Fetch Active Promos]
        I --> J[Pre-populate<br>Promo Cache]
    end

    subgraph "Cache Invalidation"
        K[Catalog Change Event<br>SNS] --> L[Invalidation Handler<br>Lambda]
        L --> M[Delete Stale Keys]
        N[Promo Change Event<br>SNS] --> L
    end

    subgraph "ElastiCache Redis Cluster"
        C --> O[Product Cache<br>TTL: 5 min]
        C --> P[Reco Cache<br>TTL: 15 min]
        C --> Q[Promo Cache<br>TTL: 15 min]
        C --> R[Review Cache<br>TTL: 1 hour]
    end

    style C fill:#2d8,stroke:#333
    style D fill:#f66,stroke:#333

Cost Impact

Scenario	Downstream Calls/Day	Monthly API Cost	Cache Cost	Net Savings
No cache	1.8M	~$2,700	$0	—
Basic cache (LLD-10)	900K	~$1,350	~$200	~$1,150/month
Optimized cache + warming	550K	~$825	~$250	~$1,625/month

Low-Level Design

1. Multi-Layer Cache with Read-Through Pattern

graph LR
    A[Request] --> B[L1: In-Process<br>LRU Cache<br>TTL: 30s]
    B -->|miss| C[L2: ElastiCache<br>Redis Cluster<br>TTL: varies]
    C -->|miss| D[L3: Origin<br>Service]
    D --> E[Populate L2 + L1]

    style B fill:#2d8,stroke:#333
    style C fill:#fd2,stroke:#333
    style D fill:#f66,stroke:#333

Code Example: Multi-Layer Cache Client

import json
import time
from collections import OrderedDict
from dataclasses import dataclass
from typing import Any, Callable, Optional

import redis


@dataclass
class CacheEntry:
    value: Any
    created_at: float
    ttl_seconds: int

    @property
    def is_expired(self) -> bool:
        return time.time() - self.created_at > self.ttl_seconds


class L1Cache:
    """In-process LRU cache for ultra-hot data."""

    def __init__(self, max_size: int = 1000, default_ttl: int = 30):
        self._store: OrderedDict[str, CacheEntry] = OrderedDict()
        self._max_size = max_size
        self._default_ttl = default_ttl

    def get(self, key: str) -> Optional[Any]:
        entry = self._store.get(key)
        if entry is None:
            return None
        if entry.is_expired:
            del self._store[key]
            return None
        self._store.move_to_end(key)
        return entry.value

    def put(self, key: str, value: Any, ttl: Optional[int] = None) -> None:
        if len(self._store) >= self._max_size:
            self._store.popitem(last=False)
        self._store[key] = CacheEntry(
            value=value,
            created_at=time.time(),
            ttl_seconds=ttl or self._default_ttl,
        )

    def invalidate(self, key: str) -> None:
        self._store.pop(key, None)


class MultiLayerCache:
    """Two-layer cache: in-process LRU (L1) + Redis (L2)."""

    def __init__(self, redis_client: redis.Redis, l1_max_size: int = 1000):
        self._l1 = L1Cache(max_size=l1_max_size)
        self._redis = redis_client
        self._hits = {"l1": 0, "l2": 0, "miss": 0}

    def get_or_fetch(
        self,
        key: str,
        fetcher: Callable[[], Any],
        l1_ttl: int = 30,
        l2_ttl: int = 300,
    ) -> Any:
        # L1 check
        value = self._l1.get(key)
        if value is not None:
            self._hits["l1"] += 1
            return value

        # L2 check
        raw = self._redis.get(key)
        if raw is not None:
            self._hits["l2"] += 1
            value = json.loads(raw)
            self._l1.put(key, value, ttl=l1_ttl)
            return value

        # Origin fetch
        self._hits["miss"] += 1
        value = fetcher()
        self._redis.setex(key, l2_ttl, json.dumps(value))
        self._l1.put(key, value, ttl=l1_ttl)
        return value

    def invalidate(self, key: str) -> None:
        self._l1.invalidate(key)
        self._redis.delete(key)

    def get_stats(self) -> dict:
        total = sum(self._hits.values()) or 1
        return {
            "l1_hit_rate": self._hits["l1"] / total,
            "l2_hit_rate": self._hits["l2"] / total,
            "miss_rate": self._hits["miss"] / total,
            "total_requests": total,
        }

2. Cache Key Design

graph TD
    subgraph "Cache Key Patterns"
        A["product:{asin}<br>e.g. product:B08X1YRSTR"]
        B["reco:{user_id}:{seed_asin}<br>e.g. reco:C123:B08X1YRSTR"]
        C["reco:anon:{context_hash}<br>e.g. reco:anon:a3f8b2"]
        D["promo:{store_section}<br>e.g. promo:manga-home"]
        E["review:{asin}<br>e.g. review:B08X1YRSTR"]
        F["product_batch:{hash}<br>e.g. product_batch:c8d2e1"]
    end

Code Example: Cache Key Builder

import hashlib


class CacheKeyBuilder:
    """Consistent cache key generation for all cached data types."""

    @staticmethod
    def product(asin: str) -> str:
        return f"product:{asin}"

    @staticmethod
    def recommendation(user_id: str, seed_asin: str) -> str:
        if user_id:
            return f"reco:{user_id}:{seed_asin}"
        # Anonymous users: hash the browsing context
        return f"reco:anon:{seed_asin}"

    @staticmethod
    def recommendation_anonymous(browsing_history: list[str]) -> str:
        context = "|".join(sorted(browsing_history[-5:]))
        context_hash = hashlib.sha256(context.encode()).hexdigest()[:8]
        return f"reco:anon:{context_hash}"

    @staticmethod
    def promotion(store_section: str) -> str:
        return f"promo:{store_section}"

    @staticmethod
    def review(asin: str) -> str:
        return f"review:{asin}"

    @staticmethod
    def product_batch(asins: list[str]) -> str:
        """Key for batch product lookups (e.g., cart or recommendation results)."""
        sorted_asins = sorted(asins)
        batch_hash = hashlib.sha256(
            "|".join(sorted_asins).encode()
        ).hexdigest()[:12]
        return f"product_batch:{batch_hash}"

3. Cache Warming Strategy

Pre-populate the cache with high-demand data before peak hours.

sequenceDiagram
    participant Scheduler as EventBridge<br>Rule (8:30am JST)
    participant Warmer as Cache Warmer<br>Lambda
    participant Analytics as Redshift
    participant Catalog as Product Catalog
    participant Promos as Promotions Service
    participant Cache as ElastiCache

    Scheduler->>Warmer: Trigger warm-up
    Warmer->>Analytics: Query top 500 ASINs<br>(last 7 days by traffic)
    Analytics-->>Warmer: ASIN list
    loop For each ASIN batch (50 at a time)
        Warmer->>Catalog: Batch get product details
        Catalog-->>Warmer: Product data
        Warmer->>Cache: MSET product:{asin} entries
    end
    Warmer->>Promos: Get all active promotions
    Promos-->>Warmer: Promo list
    Warmer->>Cache: SET promo:{section} entries
    Warmer->>Warmer: Log: warmed {count} products, {count} promos

Code Example: Cache Warmer Lambda

import json
import logging
from typing import Any

import boto3
import redis

logger = logging.getLogger()
logger.setLevel(logging.INFO)

# Initialize outside handler for connection reuse
redis_client = redis.Redis(
    host="manga-cache.xxxxxx.ng.0001.apne1.cache.amazonaws.com",
    port=6379,
    ssl=True,
    decode_responses=True,
)
redshift_client = boto3.client("redshift-data")


def handler(event: dict, context: Any) -> dict:
    """Cache warmer Lambda — triggered daily before peak traffic."""

    # 1. Get top ASINs from analytics
    top_asins = _get_top_asins(limit=500)
    logger.info(f"Warming cache for {len(top_asins)} ASINs")

    # 2. Batch-fetch product details and populate cache
    products_warmed = 0
    batch_size = 50
    for i in range(0, len(top_asins), batch_size):
        batch = top_asins[i : i + batch_size]
        products = _batch_get_products(batch)
        pipeline = redis_client.pipeline()
        for asin, product_data in products.items():
            key = f"product:{asin}"
            pipeline.setex(key, 600, json.dumps(product_data))  # 10 min TTL
            products_warmed += 1
        pipeline.execute()

    # 3. Warm promotions cache
    promos = _get_active_promotions()
    promo_pipeline = redis_client.pipeline()
    for section, promo_list in promos.items():
        key = f"promo:{section}"
        promo_pipeline.setex(key, 900, json.dumps(promo_list))  # 15 min TTL
    promo_pipeline.execute()

    result = {
        "products_warmed": products_warmed,
        "promo_sections_warmed": len(promos),
        "status": "success",
    }
    logger.info(f"Cache warming complete: {result}")
    return result


def _get_top_asins(limit: int) -> list[str]:
    """Query Redshift for the most-accessed ASINs in the last 7 days."""
    response = redshift_client.execute_statement(
        ClusterIdentifier="manga-analytics",
        Database="chatbot",
        Sql=f"""
            SELECT DISTINCT products_shown AS asin, COUNT(*) AS cnt
            FROM chatbot_events
            WHERE created_at > DATEADD(day, -7, GETDATE())
              AND products_shown IS NOT NULL
            GROUP BY products_shown
            ORDER BY cnt DESC
            LIMIT {limit}
        """,
    )
    # Wait for query to complete and fetch results
    statement_id = response["Id"]
    waiter = redshift_client.get_waiter("statement_finished")
    waiter.wait(Id=statement_id)

    result = redshift_client.get_statement_result(Id=statement_id)
    return [row[0]["stringValue"] for row in result["Records"]]


def _batch_get_products(asins: list[str]) -> dict:
    """Fetch product details from catalog service."""
    # Simulated — in production this calls the internal Product Catalog API
    dynamodb = boto3.resource("dynamodb")
    table = dynamodb.Table("product_catalog")

    products = {}
    keys = [{"asin": asin} for asin in asins]
    response = dynamodb.batch_get_item(
        RequestItems={"product_catalog": {"Keys": keys}}
    )
    for item in response.get("Responses", {}).get("product_catalog", []):
        products[item["asin"]] = {
            "title": item.get("title"),
            "price": item.get("price"),
            "format": item.get("format"),
            "availability": item.get("availability"),
        }
    return products


def _get_active_promotions() -> dict[str, list]:
    """Fetch all active promotions grouped by store section."""
    # Simulated — in production this calls the Promotions Service API
    return {
        "manga-home": [
            {"title": "Manga Sale", "discount": "20% off"},
        ],
    }

4. Event-Driven Cache Invalidation

sequenceDiagram
    participant CatalogService as Product Catalog
    participant SNS as SNS Topic<br>catalog-changes
    participant Lambda as Invalidation<br>Lambda
    participant Cache as ElastiCache

    CatalogService->>SNS: Publish {event: "product_updated", asin: "B08X1YRSTR"}
    SNS->>Lambda: Trigger
    Lambda->>Cache: DEL product:B08X1YRSTR
    Lambda->>Cache: DEL review:B08X1YRSTR
    Lambda->>Lambda: Log invalidation event

Code Example: Cache Invalidation Handler

import json
import logging

import redis

logger = logging.getLogger()
logger.setLevel(logging.INFO)

redis_client = redis.Redis(
    host="manga-cache.xxxxxx.ng.0001.apne1.cache.amazonaws.com",
    port=6379,
    ssl=True,
    decode_responses=True,
)


def handler(event: dict, context) -> dict:
    """SNS-triggered Lambda to invalidate stale cache entries."""

    invalidated = 0
    for record in event.get("Records", []):
        message = json.loads(record["Sns"]["Message"])
        event_type = message.get("event")
        asin = message.get("asin")
        section = message.get("store_section")

        if event_type == "product_updated" and asin:
            keys = [f"product:{asin}", f"review:{asin}"]
            redis_client.delete(*keys)
            # Also invalidate any batch keys containing this ASIN
            _invalidate_batch_keys(asin)
            invalidated += len(keys)
            logger.info(f"Invalidated product cache for {asin}")

        elif event_type == "promotion_changed" and section:
            key = f"promo:{section}"
            redis_client.delete(key)
            invalidated += 1
            logger.info(f"Invalidated promo cache for section {section}")

        elif event_type == "product_deleted" and asin:
            keys = [f"product:{asin}", f"review:{asin}"]
            redis_client.delete(*keys)
            invalidated += len(keys)
            logger.info(f"Purged all cache for deleted ASIN {asin}")

    return {"invalidated_keys": invalidated}


def _invalidate_batch_keys(asin: str) -> None:
    """Scan and remove batch keys that include the updated ASIN."""
    # Use SCAN to find matching batch keys (bounded iteration)
    cursor = 0
    while True:
        cursor, keys = redis_client.scan(
            cursor=cursor, match="product_batch:*", count=100
        )
        for key in keys:
            raw = redis_client.get(key)
            if raw and asin in raw:
                redis_client.delete(key)
        if cursor == 0:
            break

ElastiCache Right-Sizing

Sizing Decision Tree

graph TD
    A[Estimate Working Set Size] --> B{< 6 GB?}
    B -->|Yes| C[cache.r6g.large<br>13 GB, $0.166/hr<br>~$120/month]
    B -->|No| D{< 13 GB?}
    D -->|Yes| E[cache.r6g.xlarge<br>26 GB, $0.332/hr<br>~$240/month]
    D -->|No| F[cache.r6g.2xlarge<br>52 GB, $0.664/hr<br>~$478/month]

    G[Enable Auto-scaling] --> H[Scale based on<br>memory utilization]
    H --> I{> 75% memory?}
    I -->|Yes| J[Scale up node type]
    I -->|No| K{< 30% memory?}
    K -->|Yes| L[Scale down node type]

Working Set Estimate

Data Type	Avg Entry Size	Max Entries	Total Size
Product details	2 KB	50,000 ASINs	100 MB
Recommendations	1 KB	100,000 user×ASIN combos	100 MB
Promotions	500 bytes	1,000 entries	0.5 MB
Reviews	500 bytes	50,000 ASINs	25 MB
Semantic LLM cache	3 KB	50,000 entries	150 MB
Total			~375 MB

A cache.r6g.large (13 GB) is sufficient with room for growth.

Monitoring and Metrics

Metric	Target	Alert
L1 cache hit rate	≥ 30%	< 20%
L2 (Redis) cache hit rate	≥ 50%	< 35%
Combined hit rate	≥ 70%	< 55%
Cache eviction rate	< 1% of entries/hour	> 5%
Redis memory utilization	40-70%	> 80%
Cache warming success	100% daily	Any failure
Invalidation latency	< 5 seconds	> 30 seconds

Risks and Mitigations

Risk	Impact	Mitigation
Cache stampede on popular ASIN invalidation	Many concurrent origin fetches	Use lock-based cache rebuild (only one request fetches; others wait)
Redis cluster failure	All requests hit origin services	Origin services must handle full load; monitor Redis health
Over-caching stale data	User sees wrong price/availability	Prices are NEVER cached; product TTL capped at 5 minutes
L1 cache inconsistency across ECS tasks	Different tasks serve different data	Short L1 TTL (30s) limits divergence window

Deep Dive: Why This Works on a Manga Chatbot Workload

Caching is the highest-leverage optimization in this collection because it is the only one that scales sub-linearly with traffic — a 10× traffic spike does not produce 10× downstream cost if the cache hit rate is high. The reason this story projects 30–50% downstream API cost reduction is not that "cache is fast"; it's that manga-chatbot read patterns have three properties that make caching unusually effective.

Property 1: Read-to-write ratio is extreme. Manga catalog products are written by merchandising tooling on a daily-or-slower cadence (new volume releases, price updates, inventory sync). They are read by users millions of times per day. Effective read:write ratio for the catalog is on the order of 10⁵:1. Caching is a win-win under such ratios — the cost of a miss-and-fill is amortized over 100K+ subsequent hits. The 5-minute product TTL is not chosen because data changes that often; it is chosen because availability (in-stock status) does, and a 5-minute window is the maximum tolerable inventory-staleness for chat responses. The architectural assumption is that prices are excluded from the cache (story line 538 explicitly forbids it) — pricing changes propagate through a different path with a hard contract.

Property 2: Access distribution follows a steep Zipf curve. The top 500 ASINs in any manga store account for the majority of chat queries on any given day (popular titles dominate; long-tail catalog browsing is rare in chat versus the storefront). This is why the cache-warming Lambda only pre-fetches 500 ASINs — that subset captures the bulk of cold-start misses. The two-tier (L1 in-process LRU + L2 Redis) design exists because Zipf is fractal: the top 50 ASINs are even hotter than the top 500, so a small in-process cache absorbs them at near-zero latency without paying the Redis network round-trip. The 30-second L1 TTL is chosen short enough that cross-task inconsistency (different ECS tasks serving different snapshots) cannot persist long enough to confuse a single user session.

Property 3: Cache-stampede risk is concentrated, not uniform. When a popular ASIN's cache entry expires or is invalidated, all in-flight requests for that ASIN miss simultaneously — and a single miss-and-fill can become hundreds of concurrent origin fetches against the catalog API. This is the thundering-herd / cache-stampede pattern (Mogul & Padmanabhan 1996). The lock-based rebuild pattern (only one request fetches; others wait on the lock) is non-optional for a chatbot workload because (a) the hot keys are well-known, and (b) origin services (catalog API, recommendations API) are themselves rate-limited. Without stampede protection, a single key invalidation during peak traffic can cascade into a downstream outage.

Bottom line: the savings come from the multiplicative product of read amplification (10⁵:1 R:W) and access skew (top 500 ASINs cover most queries). The cache is not just a latency optimization — it is the load isolation layer that lets US-04's compute and US-05's DDB scale linearly with unique requests instead of total requests.

Real-World Validation

Industry Benchmarks & Case Studies

Pinterest engineering blog: "Caching at Pinterest" — Reports >90% hit rate on product-catalog-style read paths with multi-tier (L1 process + L2 distributed) caching. This story's combined 70% target is conservative against Pinterest-class results, reflecting the higher write volatility on a manga catalog (new releases, language availability flips).
Mogul & Padmanabhan (1996), "Performance issues in WWW servers" — Foundational paper on cache-stampede / thundering-herd. The lock-based rebuild pattern in this story (line 536) traces directly back to this work; lock-based mitigation is the textbook fix.
AWS ElastiCache Well-Architected pillar (sizing guidance) — Recommends 30–60% steady-state memory utilization with headroom for traffic spikes and rehashing. The story's 40–70% target band aligns with AWS guidance.
Discord engineering: "Storing billions of messages" / Twitter Pelikan — Both publicly document the L1+L2 pattern with similar TTL philosophies (short L1 to bound inconsistency, longer L2 for working-set capture). Validates the architectural choice.
DoorDash engineering blog: "Eliminating thundering herd at scale" — Documents cache-rebuild stampede as a production incident; their fix (probabilistic early expiration + lock-based rebuild) extends the pattern in this story.
Internal cross-reference: POC-to-Production-War-Story/02-seven-production-catastrophes.md — The "RAG recall collapse" and "WebSocket meltdown" catastrophes both had cache stampede as a contributing factor.
Internal cross-reference: RAG-MCP-Integration/01-catalog-search-mcp.md — The catalog MCP server is the primary client of this cache; its hit-rate metrics flow into the dashboards here.

Math Validation

ElastiCache cache.r6g.large (13 GB, 2 vCPU): ~$0.215/hr (us-east-1) × 730 hrs × 2 nodes (Multi-AZ) = ~$314/month. Story doesn't claim a baseline ElastiCache cost; the working-set estimate (375 MB) confirms r6g.large is correctly sized — could fit cache.t4g.medium (3.09 GB, ~$0.064/hr) at ~$94/month for two-node Multi-AZ if traffic projections hold. Worth re-evaluating once steady-state hit rates are measured.
1.8M downstream API calls/day at ~$0.05/1K calls (typical commercial API tier) = ~$2.7K/month — matches story baseline. ✅
At 70% combined hit rate, 1.8M × 0.30 = 540K origin calls/day → ~$810/month. Plus cache infra (~$314) = ~$1,124/month total. Story's "$825/month optimized" excludes the cache cost itself; combined-cost saving is ~58% (vs 70% raw API saving) — story should clarify whether savings include or exclude cache infra.

Conservative vs Aggressive Savings Bounds

Bound	Source	Total monthly savings
Conservative	50% combined hit rate, no L1, no warming	~25% (~$675/month gross, ~$360 net of infra)
Aggressive	85% combined hit rate (Pinterest-class), full warming, probabilistic early expiration	~60% (~$1,620/month gross, ~$1,300 net)
Story's projected savings	30–50% (~$1,000–$1,400 gross)	Realistic mid-band; depends on Zipf curve steepness for this catalog.

Cross-Story Interactions & Conflicts

This story is the shared infrastructure layer for several other stories. Keyspace and TTL coordination is centralized here.

US-01 (LLM Tokens) — The semantic response cache from US-01 lives on this Redis tier under reserved prefix llmresp:. Conflict mode: during memory pressure, default LRU eviction can evict large LLM-cache entries (~3 KB each) before smaller product-cache entries (~2 KB), even though LLM cache misses cost 100× more. Resolution: dedicate a separate Redis logical DB (or use Redis ACL keyspace boundaries) for llmresp: with noeviction policy; product cache uses allkeys-lru.
US-02 (Intent Classifier) — Session intent cache from US-02 uses prefix intent:sess: with TTL 90s. Conflict mode: session intent TTL must be < user-session TTL (US-05's 24h DDB session) so a stale intent never outlives its session. Resolution: centralize TTL constants in a shared config module; both stories reference the same constants.
US-05 (DynamoDB) — Authoritative side: this story for the cache-DDB contract. Conflict mode: during ElastiCache failover (~30–90 second window), all reads miss and stampede DynamoDB. If DDB is in on-demand mode (US-05) the spike is absorbed at higher cost; if in provisioned mode it throttles. Resolution: circuit-breaker between cache miss and DDB fallback (limit fallback concurrency to 100 req/s per origin); pre-provisioned DDB headroom for failover scenario.
US-06 (RAG) — Embedding cache from US-06 uses prefix emb: with 1h TTL. Per-key size 1.5–6 KB (1536-dim float32 = 6 KB; quantized int8 = 1.5 KB). At 50K cached embeddings × 6 KB = 300 MB — almost as large as the story's current working-set estimate of 375 MB. Resolution: extend the working-set table to include emb: row; consider int8 quantization to halve memory; bump node sizing to cache.r6g.large minimum (no t4g) to absorb this.

Rollback & Experimentation

Shadow-Mode Plan

Run L1+L2 cache in observe mode for 1 week: cache reads served, but every cache hit also fetches from origin in the background and compares response equality. Log mismatches; promote to fully cache-served only after mismatch rate < 0.5% for 72h.
Run cache warming in dry-run for 3 days: log the 500 ASINs that would be pre-fetched, measure overlap with actual peak-hour misses; tune the warming list weekly thereafter.

Canary Thresholds

Per-keyspace rollout: enable product: first (lowest stampede risk), then reco:, then llmresp:, then emb: (largest, riskiest). 48h between each.
Abort criteria (any one trips): hit rate < 30% combined after 1 week, cache-stampede incident count > 0 in production, ElastiCache CPU sustained > 75%, mismatch rate (shadow audit) > 0.5%.

Kill Switch

Single feature flag: cache_enabled. When false, both L1 and L2 are bypassed; all reads hit origins. Critical: verify origin services can absorb the full 1.8M calls/day load before flipping. Coordinated with US-04's auto-scaler — origin services must pre-scale before kill-switch activation.

Quality Regression Criteria (story-specific)

Combined hit rate floor: ≥ 55% (below this, cache infra cost approaches savings; revert and tune).
Cache-stampede incidents: 0 per quarter (any incident triggers stampede pattern review).
Stale-data incidents (user reports wrong price/availability): ≤ 2 per quarter from cache (vs origin-side staleness).
Redis memory utilization: ≤ 80% sustained (above this, eviction storms degrade hit rate).

Multi-Reviewer Validation Findings & Resolutions

The cross-reviewer pass identified the following story-specific findings. README's "Multi-Reviewer Validation & Cross-Cutting Hardening" section covers concerns that span all stories.

S1 (must-fix before production)

Per-keyspace eviction policies are architecturally impossible on a single Redis cluster. The story proposes noeviction for llmresp: and allkeys-lru for product/reco. Redis applies a single maxmemory-policy per cluster; per-prefix policies do not exist. Without a fix, the most expensive cache entries (LLM responses) are evicted first under memory pressure. Resolution: use Redis logical databases — db=1 llmresp: (noeviction), db=2 intent:sess: (allkeys-lru), db=3 product:/reco:/promo: (allkeys-lru), db=4 emb: (allkeys-lru, int8 quantized), db=0 rate: and cost: (noeviction — cost-critical state). Each story's client connection is opened to its specific db; cross-db writes are forbidden by Redis ACL.

Distributed-monolith SPOF — five stories down if Redis is down. US-01, US-02, US-03 (this story), US-06, US-08 all break at once during a Redis outage. Resolution: explicit per-story fallback documented here as the canonical contract:

US-01 (llmresp: miss) → call Bedrock; emit cache_unavailable_event to US-07.
US-02 (intent:sess: miss) → call SageMaker; degrade rule-coverage gracefully.
US-03 (product:/reco: miss) → call origin services with concurrency-limited circuit breaker (max 100 req/s per origin to prevent stampede).
US-06 (emb: miss) → fresh Titan call; cost spikes ~30% but functional.
US-08 (rate: miss) → fall back to in-process token bucket (per-task counts, not shared); cost-ledger reads from DDB authoritative ledger (Redis is just cache).

Cache invalidation handler stampede / DoS surface. SNS-triggered Lambda calls SCAN on millions of keys, then DEL each. Slow + stampede-prone; misconfigured upstream can flood the topic. Resolution: (a) restrict SNS topic with topic policy to specific service ARNs; (b) replace SNS with SQS FIFO (deduplication, ordered) so duplicates collapse; © batch keys include ASIN in the key name (product_batch:{asin1}:{asin2}:...) so invalidation is a direct DEL of computed key, no SCAN needed; (d) rate-limit the Lambda concurrency (max 5 concurrent invocations).

Cache invalidation lacks cryptographic origin check. Anyone publishing to the SNS topic can cause cache invalidation. Resolution: topic policy whitelist by source service IAM ARN; reject events without trusted-source SNS signing key.

S2 (fix before scale-up)

L1 in-process LRU may be over-engineering. The Architect reviewer flagged the L1+L2 design as marginal benefit for added complexity (cross-task inconsistency, double-invalidation logic). Resolution: keep L1 only for the top-50 ASIN slice (small footprint, high reuse) with 30s TTL; remove L1 for reco:, promo:, review: keyspaces where Redis-network latency is the dominant cost component. Measure L1 hit rate post-launch; remove L1 entirely if hit rate < 25%.

Cache warming race window. Pre-fetch at 8:30am JST but traffic starts at 8:15am — first 15 minutes hit cold cache and stampede the origin. Resolution: start warming at 8:00am; use probabilistic early expiration (cache entries at 80% TTL begin async refresh) to avoid hard expiry storms; soft-refresh top 50 ASIN keys at the top of every hour.

Working-set estimate omits embedding cache. US-06's emb: keyspace adds ~300 MB (50K embeddings × 6 KB float32). At 1.5 KB int8-quantized that drops to ~75 MB. Resolution: update the working-set table to include emb: row at 75 MB (assuming int8 quantization is mandated in US-06); upsize Redis node from r6g.large to r6g.xlarge (26 GB) for headroom.

Rate-limit log retention contains IPs (PII under GDPR). Resolution: hash IPs (SHA-256(IP + monthly-rotated salt)) before logging; cap log retention at 30 days.

S3 (acknowledged / future work)

Cache-key namespace migration to mcrouter (per-task connection pooling) if L1 is removed.
Cross-region replicated Redis (DR) — out of scope; single-region by current design.

Runbook: Redis Cluster Failure

Symptoms: ElastiCache CloudWatch alarm; sustained Redis connection errors > 10% across all five dependent stories.

Triage (in order):

Confirm Multi-AZ failover triggered. If primary failed, replica should promote within 60–90s.
During the 60–90s outage window, all five stories run in fallback mode (above). Validate each story is degrading gracefully — no 5xx storm: - US-01: Bedrock spend rate spikes (no LLM cache); US-08 breaker may trip — that is the expected safety net. - US-03: origin services hit; concurrency circuit breakers must be active. - US-08: in-process token-bucket fallback; rate-limit accuracy degrades but no fail-open.
After failover, monitor for cache-stampede recovery; warm from US-07's recent-ASIN list to short-circuit cold start.
If failover does not complete within 5 minutes, escalate; manual instance promotion via CLI as last resort.

Escalation: Origin services must pre-scale 2× headroom for the failover window — confirmed during quarterly DR drill, not during the incident.