LOCAL PREVIEW View on GitHub

US-06: Cache Aggressiveness — Freshness vs Speed vs Cost

User Story

As a platform engineering lead, I want to determine the optimal caching strategy for every data type in the MangaAssist pipeline, So that we maximize cache hit rates (reducing cost and latency) without serving stale data that erodes user trust.

The Debate

graph TD
    subgraph "Cost Team"
        C["Cache everything with long TTLs.<br/>Every cache hit saves an API call<br/>and an LLM invocation.<br/>At 60% hit rate, we save<br/>$120K/month."]
    end

    subgraph "Performance Team"
        P["We agree on caching for speed.<br/>But invalidation latency matters.<br/>If cache invalidation is slow,<br/>we serve stale data AND<br/>the user blames us."]
    end

    subgraph "Inference Team"
        I["Stale cached recommendations<br/>are WORSE than no cache.<br/>If someone bought Volume 1<br/>and we keep recommending it,<br/>we look stupid. Freshness<br/>is non-negotiable for<br/>personalized responses."]
    end

    C ---|"Freshness<br/>concern"| I
    I ---|"Invalidation<br/>complexity"| P
    P ---|"Cache cost vs<br/>savings"| C

    style C fill:#f9d71c,stroke:#333,color:#000
    style P fill:#4ecdc4,stroke:#333,color:#000
    style I fill:#ff6b6b,stroke:#333,color:#000

Acceptance Criteria

  • Cache hit rate ≥ 55% across all cacheable data types.
  • No stale price is ever served (prices are never cached).
  • Recommendation cache invalidates within 60 seconds of a new purchase.
  • Product detail cache invalidates within 5 minutes of a catalog change.
  • Cache infrastructure cost stays under $8,000/month.
  • Cache-related user complaints (stale data) stay under 0.1% of sessions.

The Caching Spectrum

graph LR
    subgraph "Never Cache"
        NC["Prices<br/>Inventory counts<br/>Cart state<br/>Payment info"]
    end

    subgraph "Short TTL (1-5 min)"
        ST["Product details<br/>Availability status<br/>Search results"]
    end

    subgraph "Medium TTL (15-60 min)"
        MT["Recommendations<br/>Promotions<br/>FAQ answers"]
    end

    subgraph "Long TTL (1-24 hours)"
        LT["Review summaries<br/>Trending lists<br/>Editorial content<br/>User preferences"]
    end

    NC -->|"Freshness<br/>critical"| ST
    ST -->|"Moderate<br/>freshness"| MT
    MT -->|"Low<br/>freshness need"| LT

    style NC fill:#eb3b5a,stroke:#333,color:#fff
    style ST fill:#fd9644,stroke:#333,color:#000
    style MT fill:#f9d71c,stroke:#333,color:#000
    style LT fill:#2d8659,stroke:#333,color:#fff

Full Cache Strategy Table

Data Type Cache? TTL Invalidation Staleness Risk Impact of Stale Data
Prices NEVER N/A N/A N/A Customer charged wrong amount → legal issue
Inventory count NEVER N/A N/A N/A "In stock" when out of stock → broken promise
Cart state NEVER N/A N/A N/A Wrong items → checkout failure
Product details (title, author, format) Yes 5 min Catalog change event (SNS) Low Minor: outdated description, rarely changes
Product availability (in stock / out of stock) Yes 1 min Inventory event Medium "In stock" when out → frustration
Recommendations Yes 15 min Purchase event, new session Medium Recommending already-purchased items
Promotions Yes 15 min Promo change event (SNS) Medium Showing expired deals → mild frustration
FAQ answers Yes 1 hour Policy change event Low Usually static; risky only for policy changes
Review summaries Yes 4 hours Scheduled refresh Low Reviews change slowly; minor impact
Trending / popular lists Yes 1 hour Scheduled refresh Low Staleness is expected for trend data
User preference embeddings Yes 6 hours Purchase / browse event Low Preferences change slowly
Search results Yes 3 min None (TTL only) Medium New products won't appear immediately

The Hardest Decision: Recommendation Cache

graph TD
    A["User asks: 'What should I read next?'"] --> B{"Cache Hit?"}
    B -->|"Hit"| C["Return cached reco<br/>⏱️ 5ms | 💰 $0"]
    B -->|"Miss"| D["Full pipeline<br/>⏱️ 1,500ms | 💰 $0.015"]

    C --> E{"Is it stale?"}
    E -->|"No — user hasn't<br/>bought anything new"| F["✅ Good response"]
    E -->|"Yes — user just bought<br/>Volume 1 of recommended series"| G["❌ Bad response<br/>Recommends what they just bought"]

    style F fill:#2d8659,stroke:#333,color:#fff
    style G fill:#eb3b5a,stroke:#333,color:#fff

Recommendation Cache Invalidation Strategy

sequenceDiagram
    participant User
    participant OrderService as Order Service
    participant SNS as SNS Topic
    participant Invalidator as Cache Invalidator
    participant Cache as ElastiCache

    User->>OrderService: Purchase Demon Slayer Vol 1
    OrderService->>SNS: Publish purchase event
    SNS->>Invalidator: Trigger invalidation
    Invalidator->>Cache: DELETE reco:{user_id}:*

    Note over Cache: Next recommendation request<br/>will compute fresh results<br/>that exclude the purchased item

    Note over Invalidator: Target: < 60 seconds<br/>from purchase to invalidation

Why 60 seconds? After a purchase, the user typically returns to browsing 30-120 seconds later. Invalidating within 60 seconds means fresh recommendations are ready before the user asks.


Semantic Response Cache: The Most Controversial Cache

The Cost Team wants to cache entire LLM responses for semantically similar queries. This is the most contentious caching decision.

graph TD
    A["'What's a good horror manga?'"] --> B["Embed query"]
    B --> C{"Semantic similarity to<br/>cached query > 0.95?"}
    C -->|"Yes"| D["Return cached LLM response<br/>⏱️ 20ms | 💰 $0"]
    C -->|"No"| E["Full LLM pipeline<br/>⏱️ 1,500ms | 💰 $0.015"]

    F["'Recommend me some horror manga'"] --> B
    G["'Best horror manga series?'"] --> B

    style D fill:#2d8659,stroke:#333,color:#fff
    style E fill:#eb3b5a,stroke:#333,color:#fff

The Arguments For and Against

Team Position Argument
Cost Team (For) Cache it Top 100 queries cover 25% of traffic. At $0.015/request, that's $112K/month saved
Inference Team (Against) Don't cache Recommendations should be personalized. A cached generic response defeats the purpose
Performance Team (Neutral) Cache FAQ only FAQ answers are universal; recommendation answers are personal

The Compromise: Tiered Semantic Cache

graph TD
    subgraph "Tier 1: FAQ Cache (Aggressive)"
        T1["Generic, non-personalized<br/>Similarity threshold: 0.92<br/>TTL: 1 hour<br/>Expected hit rate: 40%"]
    end

    subgraph "Tier 2: Product Info Cache (Moderate)"
        T2["Product-specific, not user-specific<br/>Similarity threshold: 0.95<br/>TTL: 15 min<br/>Expected hit rate: 25%"]
    end

    subgraph "Tier 3: Recommendation Cache (Conservative)"
        T3["User-specific cache key<br/>Similarity threshold: 0.98<br/>TTL: 15 min<br/>Expected hit rate: 10%"]
    end

    subgraph "Never Cached"
        T4["Multi-turn conversations<br/>Cart-dependent queries<br/>Price/availability queries"]
    end

    style T1 fill:#2d8659,stroke:#333,color:#fff
    style T2 fill:#fd9644,stroke:#333,color:#000
    style T3 fill:#f9d71c,stroke:#333,color:#000
    style T4 fill:#eb3b5a,stroke:#333,color:#fff

Cache Cost vs Savings Analysis

Monthly Cache Infrastructure Cost

Component Configuration Monthly Cost
ElastiCache Redis (r6g.xlarge, 2 nodes) 26 GB memory, multi-AZ $4,800
Semantic cache index (small OpenSearch) For response similarity search $1,200
Cache invalidation Lambda Event-driven, ~500K invocations/month $200
Total $6,200

Monthly Savings from Caching

Cache Type Hit Rate Requests/Day Saved Savings/Request Monthly Savings
Product detail cache 70% 700,000 $0.002 (API cost avoided) $42,000
Recommendation cache 10% 40,000 $0.015 (LLM cost avoided) $18,000
FAQ semantic cache 40% 120,000 $0.012 (LLM + RAG avoided) $43,200
Promotion cache 65% 130,000 $0.001 (API cost avoided) $3,900
Total savings $107,100

ROI: $107,100 savings / $6,200 cost = 17.3x return


Cache Failure Modes and Mitigations

graph TD
    A["Cache Failure Modes"] --> B["Cache Stampede"]
    A --> C["Stale Serve"]
    A --> D["Cache Pollution"]
    A --> E["Memory Pressure"]

    B --> B1["Mitigation: Probabilistic<br/>early expiry + mutex lock<br/>for cache rebuild"]
    C --> C1["Mitigation: Event-driven<br/>invalidation + version tags<br/>+ TTL as safety net"]
    D --> D1["Mitigation: Eviction policy<br/>(allkeys-lru) + cache<br/>quality scoring"]
    E --> E1["Mitigation: Memory alerts<br/>at 70% + tiered eviction<br/>(evict low-value first)"]

    style B fill:#eb3b5a,stroke:#333,color:#fff
    style C fill:#fd9644,stroke:#333,color:#000
    style D fill:#f9d71c,stroke:#333,color:#000
    style E fill:#ff6b6b,stroke:#333,color:#000

Cache Stampede Prevention

sequenceDiagram
    participant R1 as Request 1
    participant R2 as Request 2
    participant R3 as Request 3
    participant Cache as ElastiCache
    participant Origin as Origin Service

    R1->>Cache: GET product:ASIN123
    Cache-->>R1: MISS
    R1->>Cache: SET lock:product:ASIN123 (TTL=5s)
    R1->>Origin: Fetch product data

    R2->>Cache: GET product:ASIN123
    Cache-->>R2: MISS
    R2->>Cache: SET lock:product:ASIN123
    Cache-->>R2: Lock already held
    Note over R2: Wait 50ms, retry cache

    R1->>Cache: SET product:ASIN123 (data, TTL=5min)
    Origin-->>R1: Product data

    R2->>Cache: GET product:ASIN123
    Cache-->>R2: HIT (fresh data)

    Note over R1,R3: Only one request hits origin.<br/>Others wait for cache population.

2026 Update: Treat Caching as a Hierarchy, Not One Redis Policy

Treat everything above this section as the baseline caching architecture. This update preserves those earlier application-cache decisions and shows how the current architecture adds newer cache layers on top.

The current document focuses mostly on application caches. In practice, modern LLM systems now stack multiple cache layers with different freshness and risk characteristics.

  • Add prompt caching and prefix/KV caching beneath the application cache layer. These reduce latency and input-token cost for repeated prefixes without serving stale business data.
  • Prefer exact-match caches, deterministic tool-result caches, and versioned feature caches before semantic response caching. They are easier to invalidate, reason about, and audit.
  • Keep semantic response caching tightly scoped to low-risk intents and include policy version, corpus version, user segment, and personalization scope in the cache key.
  • Use stale-while-revalidate or refresh-ahead for medium-risk content, but continue to use event-driven invalidation for policy, promo, and catalog changes.
  • Monitor cache layers separately: exact-hit rate, semantic-hit rate, prefix-cache hit rate, invalidation lag, stale-serve rate, and cache read/write cost.

Recent references: AWS Bedrock prompt caching, vLLM automatic prefix caching, vLLM metrics.

Reversal Triggers

Trigger Action
Stale-data complaints exceed 0.1% of sessions Reduce TTL for the offending data type
Cache hit rate drops below 40% for any tier Investigate key pattern; may need different cache key strategy
ElastiCache memory usage exceeds 80% Add capacity or tighten eviction / reduce TTL
A product shows wrong availability from cache Reduce availability cache TTL to 30 seconds or remove caching entirely
Semantic cache returns irrelevant answers (similarity too loose) Tighten similarity threshold (e.g., 0.92 → 0.95)

Impact on Trilemma

Dimension No Caching Aggressive Caching Balanced (Decision)
Cost $$$$$ (every request hits origin + LLM) $ (most requests served from cache) $$ (tiered, smart invalidation)
Performance Slow (origin latency every time) Fast (5-20ms cache hits) Fast (55%+ hit rate)
Inference Quality Best (always fresh data) Risky (stale data possible) Good (event-driven invalidation, never cache prices)
QACPI Low (cost and latency drag) Medium (quality risk) Highest