US-06: Cache Aggressiveness — Freshness vs Speed vs Cost
User Story
As a platform engineering lead, I want to determine the optimal caching strategy for every data type in the MangaAssist pipeline, So that we maximize cache hit rates (reducing cost and latency) without serving stale data that erodes user trust.
The Debate
graph TD
subgraph "Cost Team"
C["Cache everything with long TTLs.<br/>Every cache hit saves an API call<br/>and an LLM invocation.<br/>At 60% hit rate, we save<br/>$120K/month."]
end
subgraph "Performance Team"
P["We agree on caching for speed.<br/>But invalidation latency matters.<br/>If cache invalidation is slow,<br/>we serve stale data AND<br/>the user blames us."]
end
subgraph "Inference Team"
I["Stale cached recommendations<br/>are WORSE than no cache.<br/>If someone bought Volume 1<br/>and we keep recommending it,<br/>we look stupid. Freshness<br/>is non-negotiable for<br/>personalized responses."]
end
C ---|"Freshness<br/>concern"| I
I ---|"Invalidation<br/>complexity"| P
P ---|"Cache cost vs<br/>savings"| C
style C fill:#f9d71c,stroke:#333,color:#000
style P fill:#4ecdc4,stroke:#333,color:#000
style I fill:#ff6b6b,stroke:#333,color:#000
Acceptance Criteria
- Cache hit rate ≥ 55% across all cacheable data types.
- No stale price is ever served (prices are never cached).
- Recommendation cache invalidates within 60 seconds of a new purchase.
- Product detail cache invalidates within 5 minutes of a catalog change.
- Cache infrastructure cost stays under $8,000/month.
- Cache-related user complaints (stale data) stay under 0.1% of sessions.
The Caching Spectrum
graph LR
subgraph "Never Cache"
NC["Prices<br/>Inventory counts<br/>Cart state<br/>Payment info"]
end
subgraph "Short TTL (1-5 min)"
ST["Product details<br/>Availability status<br/>Search results"]
end
subgraph "Medium TTL (15-60 min)"
MT["Recommendations<br/>Promotions<br/>FAQ answers"]
end
subgraph "Long TTL (1-24 hours)"
LT["Review summaries<br/>Trending lists<br/>Editorial content<br/>User preferences"]
end
NC -->|"Freshness<br/>critical"| ST
ST -->|"Moderate<br/>freshness"| MT
MT -->|"Low<br/>freshness need"| LT
style NC fill:#eb3b5a,stroke:#333,color:#fff
style ST fill:#fd9644,stroke:#333,color:#000
style MT fill:#f9d71c,stroke:#333,color:#000
style LT fill:#2d8659,stroke:#333,color:#fff
Full Cache Strategy Table
| Data Type | Cache? | TTL | Invalidation | Staleness Risk | Impact of Stale Data |
|---|---|---|---|---|---|
| Prices | NEVER | N/A | N/A | N/A | Customer charged wrong amount → legal issue |
| Inventory count | NEVER | N/A | N/A | N/A | "In stock" when out of stock → broken promise |
| Cart state | NEVER | N/A | N/A | N/A | Wrong items → checkout failure |
| Product details (title, author, format) | Yes | 5 min | Catalog change event (SNS) | Low | Minor: outdated description, rarely changes |
| Product availability (in stock / out of stock) | Yes | 1 min | Inventory event | Medium | "In stock" when out → frustration |
| Recommendations | Yes | 15 min | Purchase event, new session | Medium | Recommending already-purchased items |
| Promotions | Yes | 15 min | Promo change event (SNS) | Medium | Showing expired deals → mild frustration |
| FAQ answers | Yes | 1 hour | Policy change event | Low | Usually static; risky only for policy changes |
| Review summaries | Yes | 4 hours | Scheduled refresh | Low | Reviews change slowly; minor impact |
| Trending / popular lists | Yes | 1 hour | Scheduled refresh | Low | Staleness is expected for trend data |
| User preference embeddings | Yes | 6 hours | Purchase / browse event | Low | Preferences change slowly |
| Search results | Yes | 3 min | None (TTL only) | Medium | New products won't appear immediately |
The Hardest Decision: Recommendation Cache
graph TD
A["User asks: 'What should I read next?'"] --> B{"Cache Hit?"}
B -->|"Hit"| C["Return cached reco<br/>⏱️ 5ms | 💰 $0"]
B -->|"Miss"| D["Full pipeline<br/>⏱️ 1,500ms | 💰 $0.015"]
C --> E{"Is it stale?"}
E -->|"No — user hasn't<br/>bought anything new"| F["✅ Good response"]
E -->|"Yes — user just bought<br/>Volume 1 of recommended series"| G["❌ Bad response<br/>Recommends what they just bought"]
style F fill:#2d8659,stroke:#333,color:#fff
style G fill:#eb3b5a,stroke:#333,color:#fff
Recommendation Cache Invalidation Strategy
sequenceDiagram
participant User
participant OrderService as Order Service
participant SNS as SNS Topic
participant Invalidator as Cache Invalidator
participant Cache as ElastiCache
User->>OrderService: Purchase Demon Slayer Vol 1
OrderService->>SNS: Publish purchase event
SNS->>Invalidator: Trigger invalidation
Invalidator->>Cache: DELETE reco:{user_id}:*
Note over Cache: Next recommendation request<br/>will compute fresh results<br/>that exclude the purchased item
Note over Invalidator: Target: < 60 seconds<br/>from purchase to invalidation
Why 60 seconds? After a purchase, the user typically returns to browsing 30-120 seconds later. Invalidating within 60 seconds means fresh recommendations are ready before the user asks.
Semantic Response Cache: The Most Controversial Cache
The Cost Team wants to cache entire LLM responses for semantically similar queries. This is the most contentious caching decision.
graph TD
A["'What's a good horror manga?'"] --> B["Embed query"]
B --> C{"Semantic similarity to<br/>cached query > 0.95?"}
C -->|"Yes"| D["Return cached LLM response<br/>⏱️ 20ms | 💰 $0"]
C -->|"No"| E["Full LLM pipeline<br/>⏱️ 1,500ms | 💰 $0.015"]
F["'Recommend me some horror manga'"] --> B
G["'Best horror manga series?'"] --> B
style D fill:#2d8659,stroke:#333,color:#fff
style E fill:#eb3b5a,stroke:#333,color:#fff
The Arguments For and Against
| Team | Position | Argument |
|---|---|---|
| Cost Team (For) | Cache it | Top 100 queries cover 25% of traffic. At $0.015/request, that's $112K/month saved |
| Inference Team (Against) | Don't cache | Recommendations should be personalized. A cached generic response defeats the purpose |
| Performance Team (Neutral) | Cache FAQ only | FAQ answers are universal; recommendation answers are personal |
The Compromise: Tiered Semantic Cache
graph TD
subgraph "Tier 1: FAQ Cache (Aggressive)"
T1["Generic, non-personalized<br/>Similarity threshold: 0.92<br/>TTL: 1 hour<br/>Expected hit rate: 40%"]
end
subgraph "Tier 2: Product Info Cache (Moderate)"
T2["Product-specific, not user-specific<br/>Similarity threshold: 0.95<br/>TTL: 15 min<br/>Expected hit rate: 25%"]
end
subgraph "Tier 3: Recommendation Cache (Conservative)"
T3["User-specific cache key<br/>Similarity threshold: 0.98<br/>TTL: 15 min<br/>Expected hit rate: 10%"]
end
subgraph "Never Cached"
T4["Multi-turn conversations<br/>Cart-dependent queries<br/>Price/availability queries"]
end
style T1 fill:#2d8659,stroke:#333,color:#fff
style T2 fill:#fd9644,stroke:#333,color:#000
style T3 fill:#f9d71c,stroke:#333,color:#000
style T4 fill:#eb3b5a,stroke:#333,color:#fff
Cache Cost vs Savings Analysis
Monthly Cache Infrastructure Cost
| Component | Configuration | Monthly Cost |
|---|---|---|
| ElastiCache Redis (r6g.xlarge, 2 nodes) | 26 GB memory, multi-AZ | $4,800 |
| Semantic cache index (small OpenSearch) | For response similarity search | $1,200 |
| Cache invalidation Lambda | Event-driven, ~500K invocations/month | $200 |
| Total | $6,200 |
Monthly Savings from Caching
| Cache Type | Hit Rate | Requests/Day Saved | Savings/Request | Monthly Savings |
|---|---|---|---|---|
| Product detail cache | 70% | 700,000 | $0.002 (API cost avoided) | $42,000 |
| Recommendation cache | 10% | 40,000 | $0.015 (LLM cost avoided) | $18,000 |
| FAQ semantic cache | 40% | 120,000 | $0.012 (LLM + RAG avoided) | $43,200 |
| Promotion cache | 65% | 130,000 | $0.001 (API cost avoided) | $3,900 |
| Total savings | $107,100 |
ROI: $107,100 savings / $6,200 cost = 17.3x return
Cache Failure Modes and Mitigations
graph TD
A["Cache Failure Modes"] --> B["Cache Stampede"]
A --> C["Stale Serve"]
A --> D["Cache Pollution"]
A --> E["Memory Pressure"]
B --> B1["Mitigation: Probabilistic<br/>early expiry + mutex lock<br/>for cache rebuild"]
C --> C1["Mitigation: Event-driven<br/>invalidation + version tags<br/>+ TTL as safety net"]
D --> D1["Mitigation: Eviction policy<br/>(allkeys-lru) + cache<br/>quality scoring"]
E --> E1["Mitigation: Memory alerts<br/>at 70% + tiered eviction<br/>(evict low-value first)"]
style B fill:#eb3b5a,stroke:#333,color:#fff
style C fill:#fd9644,stroke:#333,color:#000
style D fill:#f9d71c,stroke:#333,color:#000
style E fill:#ff6b6b,stroke:#333,color:#000
Cache Stampede Prevention
sequenceDiagram
participant R1 as Request 1
participant R2 as Request 2
participant R3 as Request 3
participant Cache as ElastiCache
participant Origin as Origin Service
R1->>Cache: GET product:ASIN123
Cache-->>R1: MISS
R1->>Cache: SET lock:product:ASIN123 (TTL=5s)
R1->>Origin: Fetch product data
R2->>Cache: GET product:ASIN123
Cache-->>R2: MISS
R2->>Cache: SET lock:product:ASIN123
Cache-->>R2: Lock already held
Note over R2: Wait 50ms, retry cache
R1->>Cache: SET product:ASIN123 (data, TTL=5min)
Origin-->>R1: Product data
R2->>Cache: GET product:ASIN123
Cache-->>R2: HIT (fresh data)
Note over R1,R3: Only one request hits origin.<br/>Others wait for cache population.
2026 Update: Treat Caching as a Hierarchy, Not One Redis Policy
Treat everything above this section as the baseline caching architecture. This update preserves those earlier application-cache decisions and shows how the current architecture adds newer cache layers on top.
The current document focuses mostly on application caches. In practice, modern LLM systems now stack multiple cache layers with different freshness and risk characteristics.
- Add prompt caching and prefix/KV caching beneath the application cache layer. These reduce latency and input-token cost for repeated prefixes without serving stale business data.
- Prefer exact-match caches, deterministic tool-result caches, and versioned feature caches before semantic response caching. They are easier to invalidate, reason about, and audit.
- Keep semantic response caching tightly scoped to low-risk intents and include policy version, corpus version, user segment, and personalization scope in the cache key.
- Use stale-while-revalidate or refresh-ahead for medium-risk content, but continue to use event-driven invalidation for policy, promo, and catalog changes.
- Monitor cache layers separately: exact-hit rate, semantic-hit rate, prefix-cache hit rate, invalidation lag, stale-serve rate, and cache read/write cost.
Recent references: AWS Bedrock prompt caching, vLLM automatic prefix caching, vLLM metrics.
Reversal Triggers
| Trigger | Action |
|---|---|
| Stale-data complaints exceed 0.1% of sessions | Reduce TTL for the offending data type |
| Cache hit rate drops below 40% for any tier | Investigate key pattern; may need different cache key strategy |
| ElastiCache memory usage exceeds 80% | Add capacity or tighten eviction / reduce TTL |
| A product shows wrong availability from cache | Reduce availability cache TTL to 30 seconds or remove caching entirely |
| Semantic cache returns irrelevant answers (similarity too loose) | Tighten similarity threshold (e.g., 0.92 → 0.95) |
Impact on Trilemma
| Dimension | No Caching | Aggressive Caching | Balanced (Decision) |
|---|---|---|---|
| Cost | $$$$$ (every request hits origin + LLM) | $ (most requests served from cache) | $$ (tiered, smart invalidation) |
| Performance | Slow (origin latency every time) | Fast (5-20ms cache hits) | Fast (55%+ hit rate) |
| Inference Quality | Best (always fresh data) | Risky (stale data possible) | Good (event-driven invalidation, never cache prices) |
| QACPI | Low (cost and latency drag) | Medium (quality risk) | Highest |