Intelligent Caching — Scenarios and Runbooks
AWS AIP-C01 Task 4.1 — Skill 4.1.4: Design intelligent caching systems for FM applications Context: MangaAssist e-commerce chatbot — Bedrock Claude 3 (Sonnet/Haiku), OpenSearch Serverless, DynamoDB, ECS Fargate, API Gateway WebSocket, ElastiCache Redis. 1M messages/day.
Skill Mapping
| Certification | Domain | Task | Skill |
|---|---|---|---|
| AWS AIP-C01 | Domain 4 — Operational Efficiency | Task 4.1 — Optimize FM applications | Skill 4.1.4 — Diagnose and resolve caching failures, false positives, stampedes, staleness, and resource exhaustion in FM caching systems |
Skill scope: Five production scenarios covering the failure modes of intelligent caching in a high-traffic GenAI chatbot — each with detection, root cause analysis, resolution steps, and prevention measures.
Scenario 1 — Semantic Cache Returns Wrong Manga Recommendation
Problem Statement
A customer asks: "Can you recommend something like Dragon Ball?"
The semantic cache returns a previously cached response for "Can you recommend something like Dragon Ball Super?" — which recommends Dragon Ball Super's sequel arcs, not titles similar to the original Dragon Ball series. The cosine similarity between the two queries is 0.94, above the 0.92 threshold, so the cache treats them as equivalent.
Business impact: Customer receives irrelevant recommendations, reducing trust and conversion rate. At MangaAssist's scale, even a 0.5% false positive rate means ~1,400 incorrect responses per day.
Detection
graph TD
A[User submits negative feedback<br/>'This isn't what I asked for'] --> B{Feedback classifier}
B -->|Cache-related| C[Check if response<br/>was served from cache]
C -->|source = L2_SEMANTIC| D[Log: semantic_cache_false_positive]
D --> E[CloudWatch Metric:<br/>cache.false_positive_count]
E --> F{Rate > 0.5%<br/>of L2 hits?}
F -->|Yes| G[CloudWatch Alarm:<br/>CacheFalsePositiveRate]
G --> H[PagerDuty / SNS Alert]
B -->|Not cache-related| I[Route to standard<br/>feedback pipeline]
C -->|source = BEDROCK| J[Not a cache issue —<br/>model quality problem]
style G fill:#e76f51,stroke:#f4a261,color:#fff
Key metrics to monitor:
| Metric | Normal | Alarm Threshold | Source |
|---|---|---|---|
cache.false_positive_count |
< 50/hr | > 200/hr | User feedback + automated checks |
cache.false_positive_rate |
< 0.5% | > 1.0% | false_positives / l2_hits |
cache.similarity_score_distribution |
Peaks at 0.93–0.97 | Shift toward 0.92 boundary | Redis search results |
Root Cause Analysis
graph TD
FP[False Positive<br/>Detected] --> Q1{Was the cached query<br/>actually different intent?}
Q1 -->|Yes| RC1[Root Cause: Intent classifier<br/>assigned same intent to<br/>semantically different queries]
Q1 -->|No, same intent| Q2{Is cosine similarity<br/>barely above threshold?}
Q2 -->|Yes, sim = 0.920–0.930| RC2[Root Cause: Threshold too low<br/>for this intent category]
Q2 -->|No, sim > 0.93| Q3{Are the queries about<br/>related but distinct titles?}
Q3 -->|Yes| RC3[Root Cause: Franchise titles<br/>cluster too tightly in<br/>embedding space]
Q3 -->|No| RC4[Root Cause: Embedding model<br/>lacks domain specificity<br/>for manga titles]
RC1 --> FIX1[Add intent sub-classification<br/>dragon_ball vs dragon_ball_super]
RC2 --> FIX2[Raise threshold for<br/>recommendation intent to 0.96]
RC3 --> FIX3[Include title entity<br/>as hard filter in cache key]
RC4 --> FIX4[Fine-tune embedding model<br/>on manga title pairs]
style FP fill:#e76f51,stroke:#f4a261,color:#fff
style RC3 fill:#264653,stroke:#2a9d8f,color:#fff
In this scenario: Root Cause 3 is most likely. "Dragon Ball" and "Dragon Ball Super" share significant token overlap, producing embeddings within cosine distance 0.06. The semantic cache cannot distinguish them by embedding alone.
Resolution
Step 1 — Immediate mitigation (5 minutes)
# Emergency: raise recommendation threshold to reject borderline matches
# Deploy via environment variable — no code deploy needed
# ECS task definition environment variable:
CACHE_THRESHOLD_RECOMMENDATION = "0.97"
Step 2 — Invalidate affected entries (10 minutes)
import redis
import logging
logger = logging.getLogger(__name__)
def invalidate_recommendation_cache(
redis_client: redis.Redis,
index_name: str = "mangaassist_cache_idx",
) -> int:
"""
Remove all recommendation-intent entries from the semantic cache.
Called as emergency response to false positive spike.
"""
cursor = 0
removed = 0
while True:
cursor, keys = redis_client.scan(cursor, match="sc:*", count=500)
for key in keys:
intent = redis_client.hget(key, "intent")
if intent and intent.decode() == "recommendation":
redis_client.delete(key)
removed += 1
if cursor == 0:
break
logger.info("Invalidated %d recommendation cache entries", removed)
return removed
Step 3 — Add entity-based hard filter (1–2 hours)
def recommendation_cache_search_with_entity_filter(
redis_client: redis.Redis,
embedding: list[float],
primary_entity: str,
intent: str = "recommendation",
) -> dict | None:
"""
Enhanced search that requires the primary manga title entity
to match exactly, in addition to vector similarity.
This prevents 'Dragon Ball' from matching 'Dragon Ball Super'.
"""
import numpy as np
vec_bytes = np.array(embedding, dtype=np.float32).tobytes()
# Hard filter: intent must match AND query text must contain the exact title
# Using TEXT search on the query field as a secondary filter
filter_expr = f"@intent:{{{intent}}} @query:{primary_entity}"
query = f"({filter_expr})=>[KNN 3 @vec $blob AS dist]"
results = redis_client.execute_command(
"FT.SEARCH", "mangaassist_cache_idx", query,
"PARAMS", "2", "blob", vec_bytes,
"SORTBY", "dist", "ASC",
"LIMIT", "0", "1",
"RETURN", "3", "response", "query", "dist",
"DIALECT", "2",
)
if results[0] == 0:
return None
# Parse and apply threshold check
# ... (standard threshold logic)
return None # Placeholder
Prevention
| Prevention Measure | Implementation | Effort |
|---|---|---|
| Entity-aware cache keys | Include primary entity (manga title) as a hard TAG filter in Redis search | Medium |
| Intent-specific thresholds | Raise recommendation threshold to 0.96 (already documented in threshold map) | Low |
| Franchise disambiguation | Maintain a franchise-alias table: "Dragon Ball" ≠ "Dragon Ball Super" ≠ "Dragon Ball GT" | Medium |
| Automated false positive detection | Compare cached response entities vs query entities; flag mismatches | High |
| User feedback loop | "Was this helpful?" button writes to feedback queue; high negative rate triggers threshold auto-adjustment | Medium |
Scenario 2 — Cache Stampede During New Manga Release
Problem Statement
A highly anticipated manga volume (e.g., Jujutsu Kaisen final volume) releases at midnight JST. Within 60 seconds, 15,000 users simultaneously ask: "Is Jujutsu Kaisen final volume available?"
The cache has no entry for this query (new product, never asked before). All 15,000 requests miss the cache simultaneously and hit Bedrock in parallel. This creates a thundering herd / cache stampede:
- Bedrock throttle limit hit (requests rejected with
ThrottlingException) - RAG pipeline overloaded (15,000 parallel OpenSearch queries)
- p99 latency spikes from 2.8s to 25s+
- Some users receive errors ("Service temporarily unavailable")
Business impact: The highest-traffic moment (new release) coincides with the worst user experience. Customers leave, revenue drops, social media complaints spike.
Detection
graph TD
A[Bedrock ThrottlingException<br/>rate > 100/min] --> ALARM1[CloudWatch Alarm:<br/>BedrockThrottleRate]
B[Cache miss rate spikes<br/>to > 95% for 5 min] --> ALARM2[CloudWatch Alarm:<br/>CacheMissRateSpike]
C[OpenSearch latency<br/>p99 > 5000ms] --> ALARM3[CloudWatch Alarm:<br/>RAGLatencyHigh]
D[ECS task CPU > 90%<br/>for > 2 min] --> ALARM4[CloudWatch Alarm:<br/>ECSCPUHigh]
ALARM1 --> COMP[CloudWatch Composite Alarm:<br/>CacheStampedeDetected]
ALARM2 --> COMP
ALARM3 --> COMP
ALARM4 --> COMP
COMP --> SNS[SNS → PagerDuty<br/>P1 Incident]
COMP --> AUTO[Auto-Remediation Lambda]
style COMP fill:#e76f51,stroke:#f4a261,color:#fff
| Metric | Normal (steady state) | Stampede Indicator | Source |
|---|---|---|---|
cache.miss_rate |
~60% | > 95% sustained for 5 min | Application metrics |
bedrock.throttle_count |
< 5/min | > 100/min | CloudWatch AWS/Bedrock |
opensearch.search_latency_p99 |
150ms | > 5,000ms | CloudWatch AWS/AOSS |
ecs.cpu_utilization |
45% | > 90% | CloudWatch AWS/ECS |
| Concurrent identical queries | < 10 | > 1,000 | Application-level dedup counter |
Root Cause Analysis
graph TD
STAMP[Cache Stampede<br/>Detected] --> Q1{Is this a<br/>new product/event?}
Q1 -->|Yes| RC1[Root Cause: No cache warming<br/>for the new release.<br/>Cold cache + sudden demand spike.]
Q1 -->|No| Q2{Did cache recently<br/>get invalidated?}
Q2 -->|Yes| RC2[Root Cause: Mass invalidation<br/>flushed entries that were<br/>immediately re-requested.]
Q2 -->|No| Q3{Did TTLs expire<br/>simultaneously?}
Q3 -->|Yes| RC3[Root Cause: Synchronized TTL<br/>expiry. All entries for<br/>an intent expired at once.]
Q3 -->|No| RC4[Root Cause: Infrastructure issue<br/>Redis restart or connectivity<br/>loss emptied the cache.]
RC1 --> FIX1[Implement pre-release<br/>cache warming pipeline]
RC2 --> FIX2[Add stampede lock<br/>single-flight pattern]
RC3 --> FIX3[Add TTL jitter<br/>±10% randomization]
RC4 --> FIX4[Redis cluster mode<br/>+ AOF persistence]
style STAMP fill:#e76f51,stroke:#f4a261,color:#fff
style RC1 fill:#264653,stroke:#2a9d8f,color:#fff
Resolution
Step 1 — Request coalescing / single-flight pattern (immediate)
When multiple identical requests arrive simultaneously, only the first one invokes Bedrock. All others wait for the first response, which is then shared.
import asyncio
import hashlib
import logging
import time
from typing import Optional
logger = logging.getLogger(__name__)
class StampedeProtection:
"""
Single-flight / request coalescing pattern.
When N concurrent requests arrive for the same cache key,
only 1 invokes Bedrock. The other N-1 wait for the result.
Uses Redis distributed locks to coordinate across ECS containers.
"""
LOCK_PREFIX = "lock:"
LOCK_TTL = 30 # seconds — max time to wait for Bedrock response
def __init__(self, redis_client, semantic_cache, bedrock_invoker):
self.redis = redis_client
self.cache = semantic_cache
self.invoker = bedrock_invoker
self._local_waiters: dict[str, asyncio.Event] = {}
async def get_or_invoke(
self,
query: str,
intent: str,
entities: dict,
rag_context: str,
session_history: list,
) -> dict:
"""
Try cache → if miss, acquire lock → invoke Bedrock → store → release.
Concurrent requests for the same query wait on the lock.
"""
# Step 1: Try cache
cached = self.cache.lookup(query=query, intent=intent)
if cached:
return cached
# Step 2: Compute coalescing key
normalized = self.cache._normalize(query)
coalesce_key = hashlib.md5(f"{normalized}|{intent}".encode()).hexdigest()
lock_key = f"{self.LOCK_PREFIX}{coalesce_key}"
# Step 3: Try to acquire distributed lock
acquired = self.redis.set(lock_key, "1", nx=True, ex=self.LOCK_TTL)
if acquired:
# This container is the leader — invoke Bedrock
try:
response = self.invoker.invoke(
user_message=query,
rag_context=rag_context,
session_history=session_history,
)
# Store in cache for all waiters
self.cache.store(
query=query,
response=response["response_text"],
intent=intent,
entities=entities,
)
return {
"response": response["response_text"],
"source": "BEDROCK_LEADER",
"similarity": 1.0,
}
finally:
self.redis.delete(lock_key)
else:
# Another container is the leader — poll cache until result appears
logger.info("Waiting for leader to populate cache for key=%s", coalesce_key)
for _ in range(60): # Wait up to 30 seconds (60 x 0.5s)
await asyncio.sleep(0.5)
cached = self.cache.lookup(query=query, intent=intent)
if cached:
cached["source"] = "COALESCED_WAIT"
return cached
# Timeout — fall through to direct invocation
logger.warning("Coalescing timeout for key=%s, invoking directly", coalesce_key)
response = self.invoker.invoke(
user_message=query,
rag_context=rag_context,
session_history=session_history,
)
return {
"response": response["response_text"],
"source": "BEDROCK_TIMEOUT_FALLBACK",
"similarity": 1.0,
}
Step 2 — Pre-release cache warming (preventive, triggered by catalog event)
async def warm_new_release_cache(
title: str,
volume: str,
cache_warmer, # CacheWarmer instance
) -> dict:
"""
Triggered by EventBridge when a new manga release is added to the catalog.
Pre-populates cache 1 hour before the release goes live.
"""
result = cache_warmer.warm_new_release(title=title, volume=volume)
logger.info("Pre-warmed cache for %s vol %s: %s", title, volume, result)
return result
Step 3 — TTL jitter to prevent synchronized expiry
import random
def ttl_with_jitter(base_ttl: int, jitter_pct: float = 0.10) -> int:
"""
Add ±10% random jitter to TTL to prevent synchronized expiry.
base_ttl=3600 → returns 3240–3960.
"""
jitter = int(base_ttl * jitter_pct)
return base_ttl + random.randint(-jitter, jitter)
Prevention
| Prevention Measure | Implementation | Effort |
|---|---|---|
| Single-flight / request coalescing | Redis distributed lock; only 1 request per unique query hits Bedrock | Medium |
| Pre-release cache warming | EventBridge rule triggers CacheWarmer on catalog update | Medium |
| TTL jitter | Add ±10% randomization to all TTLs | Low |
| Bedrock provisioned throughput | Reserve model units for anticipated spikes | Medium (cost) |
| Graceful degradation | When throttled, serve stale cache (bypass TTL) with "may be outdated" disclaimer | Medium |
Scenario 3 — Stale Pricing in Cached Response After Flash Sale
Problem Statement
MangaAssist runs a flash sale: Demon Slayer Complete Box Set drops from 12,800 to 8,980. The event-driven invalidation system is supposed to flush all product_info cache entries for "Demon Slayer" — but a bug in the EventBridge rule filter means the invalidation Lambda is never triggered.
For the next hour (until TTL expires), customers asking about Demon Slayer pricing receive the cached response with the old price (12,800). Some customers see the correct sale price on the website but the wrong price in the chatbot. Complaints flood in.
Business impact: Price inconsistency between channels erodes trust. Customers who purchased at the chatbot-quoted price may demand refunds. Legal/compliance risk if chatbot price is considered a binding offer.
Detection
graph TD
A[Customer complaint:<br/>'Chatbot says 12800 but<br/>website says 8980'] --> B[Support ticket created]
B --> C{Check cache source<br/>in response metadata}
C -->|source = L2_SEMANTIC| D[Confirm stale cache]
E[EventBridge event:<br/>price_changed for<br/>Demon Slayer] --> F{Was invalidation<br/>Lambda invoked?}
F -->|No invocation in<br/>CloudWatch Logs| G[Confirm: EventBridge<br/>rule did not trigger]
H[Automated price<br/>consistency check] --> I{Chatbot response price<br/>== catalog API price?}
I -->|Mismatch| J[CloudWatch Metric:<br/>cache.price_consistency_error]
J --> K[CloudWatch Alarm:<br/>StalePriceDetected]
K --> L[PagerDuty P1 Alert]
style K fill:#e76f51,stroke:#f4a261,color:#fff
style G fill:#e76f51,stroke:#f4a261,color:#fff
| Metric | Normal | Alert Threshold | Source |
|---|---|---|---|
invalidation.lambda.invocation_count |
Matches EventBridge event count | Diverges by > 0 | CloudWatch Lambda metrics |
cache.price_consistency_errors |
0 | > 0 | Automated consistency checker |
invalidation.latency_ms |
< 500ms | > 5,000ms | Lambda execution duration |
eventbridge.failed_invocations |
0 | > 0 | CloudWatch EventBridge metrics |
Root Cause Analysis
graph TD
STALE[Stale Price in<br/>Cached Response] --> Q1{Was price_changed<br/>event emitted?}
Q1 -->|No| RC1[Root Cause: Catalog CMS<br/>did not emit event.<br/>Manual price update bypassed<br/>event pipeline.]
Q1 -->|Yes| Q2{Did EventBridge<br/>rule match the event?}
Q2 -->|No| RC2[Root Cause: EventBridge rule<br/>filter pattern mismatch.<br/>Event schema changed but<br/>rule was not updated.]
Q2 -->|Yes| Q3{Did Lambda execute<br/>successfully?}
Q3 -->|No| RC3[Root Cause: Lambda timeout<br/>or permission error.<br/>Could not connect to Redis<br/>or CloudFront.]
Q3 -->|Yes| Q4{Were correct Redis<br/>keys deleted?}
Q4 -->|No| RC4[Root Cause: Key scan pattern<br/>did not match. Entity extraction<br/>for 'Demon Slayer' returned<br/>'demon slayer' (case mismatch).]
Q4 -->|Yes| RC5[Root Cause: Race condition.<br/>Cache entry re-populated<br/>before CloudFront invalidation<br/>completed.]
RC2 --> FIX[Fix EventBridge rule filter<br/>to match current event schema]
style STALE fill:#e76f51,stroke:#f4a261,color:#fff
style RC2 fill:#264653,stroke:#2a9d8f,color:#fff
In this scenario: Root Cause 2 — The catalog team changed the event schema from {"detail-type": "PriceChanged"} to {"detail-type": "product.price.changed"}, but the EventBridge rule still filtered on the old pattern.
Resolution
Step 1 — Emergency manual cache flush (5 minutes)
import redis
import boto3
import logging
logger = logging.getLogger(__name__)
def emergency_price_cache_flush(
redis_url: str,
title: str,
cloudfront_distribution_id: str,
) -> dict:
"""
Emergency flush of all cache entries for a specific product.
Called manually by on-call engineer when price staleness is detected.
"""
r = redis.Redis.from_url(redis_url, decode_responses=True)
cf = boto3.client("cloudfront")
# Flush Redis (L2)
removed = 0
cursor = 0
while True:
cursor, keys = r.scan(cursor, match="sc:*", count=500)
for key in keys:
query_text = r.hget(key, "query")
entities_json = r.hget(key, "ents")
if query_text and title.lower() in query_text.lower():
r.delete(key)
removed += 1
elif entities_json and title.lower() in entities_json.lower():
r.delete(key)
removed += 1
if cursor == 0:
break
# Flush CloudFront (L4)
invalidation = cf.create_invalidation(
DistributionId=cloudfront_distribution_id,
InvalidationBatch={
"Paths": {
"Quantity": 1,
"Items": [f"/api/products/*demon-slayer*"],
},
"CallerReference": f"emergency-{title}-{__import__('time').time()}",
},
)
result = {
"redis_entries_removed": removed,
"cloudfront_invalidation_id": invalidation["Invalidation"]["Id"],
}
logger.info("Emergency cache flush for '%s': %s", title, result)
return result
Step 2 — Fix EventBridge rule filter (30 minutes)
{
"source": ["com.mangaassist.catalog"],
"detail-type": ["product.price.changed"],
"detail": {
"change_type": ["price_update", "flash_sale_start", "flash_sale_end"]
}
}
Step 3 — Add automated price consistency checker (1–2 hours)
import asyncio
import logging
import random
logger = logging.getLogger(__name__)
class PriceConsistencyChecker:
"""
Periodically samples cached product_info responses and compares
the price in the response against the live catalog API.
Runs as a background task in each ECS container.
"""
def __init__(self, semantic_cache, catalog_api_client, check_interval: int = 60):
self.cache = semantic_cache
self.catalog = catalog_api_client
self.interval = check_interval
self._inconsistencies = 0
async def run(self) -> None:
"""Run continuous price consistency checks."""
while True:
await asyncio.sleep(self.interval)
try:
await self._check_sample()
except Exception as e:
logger.error("Price consistency check failed: %s", e)
async def _check_sample(self) -> None:
"""Sample 10 random product_info cache entries and verify prices."""
# Scan for product_info entries
cursor = 0
candidates = []
while len(candidates) < 50:
cursor, keys = self.cache.redis.scan(cursor, match="sc:*", count=100)
for key in keys:
intent = self.cache.redis.hget(key, "intent")
if intent and intent.decode() == "product_info":
candidates.append(key)
if cursor == 0:
break
# Sample 10
sample = random.sample(candidates, min(10, len(candidates)))
for key in sample:
response_text = self.cache.redis.hget(key, "response")
entities_json = self.cache.redis.hget(key, "ents")
if not response_text or not entities_json:
continue
response_text = response_text.decode()
entities = __import__("json").loads(entities_json.decode())
title = entities.get("title")
if not title:
continue
# Get live price from catalog
live_price = await self.catalog.get_price(title)
if live_price is None:
continue
# Check if cached response contains the correct price
if str(live_price) not in response_text:
self._inconsistencies += 1
logger.warning(
"Price inconsistency: title='%s', live_price=%s, "
"cached response does not contain live price. Key=%s",
title, live_price, key.decode(),
)
# Auto-invalidate the stale entry
self.cache.redis.delete(key)
logger.info("Auto-invalidated stale cache entry: %s", key.decode())
Prevention
| Prevention Measure | Implementation | Effort |
|---|---|---|
| EventBridge rule schema validation | CI/CD check: event schema + rule filter compatibility test | Medium |
| Dead letter queue for failed invalidations | SQS DLQ on the invalidation Lambda; alarm on DLQ depth > 0 | Low |
| Automated price consistency checker | Background task sampling cached product_info responses | Medium |
| Double-write invalidation | Both the CMS AND the cache warmer can trigger invalidation (redundancy) | Low |
| Shorter TTL for price-sensitive intents | Reduce product_info TTL from 1hr to 15min during sale periods | Low |
Scenario 4 — Prompt Cache Miss Rate Spikes After System Prompt Update
Problem Statement
The MangaAssist team updates the system prompt to add a new guardrail: "Do not recommend manga with graphic violence to users under 18." The prompt version changes from v2.3 to v2.4. After deployment:
- Bedrock prompt cache hit rate drops from 95% to 0%
- Time-to-first-token (TTFT) increases from 120ms to 450ms
- Input token cost spikes because the system prompt (2,200 tokens) is fully processed for every request
- For the next 5 minutes, all 1M/day / (24*60) = ~694 requests/minute pay full input token cost
Business impact: Temporary latency regression and cost spike. If the team deploys prompt changes multiple times per day during an iteration cycle, the cumulative cost impact grows.
Detection
graph TD
A[Bedrock usage metrics] --> B{cacheReadInputTokens<br/>dropped to 0?}
B -->|Yes| C[CloudWatch Metric:<br/>bedrock.prompt_cache_miss_rate = 100%]
C --> D[CloudWatch Alarm:<br/>PromptCacheMissSpike]
D --> E{Correlate with<br/>deployment event}
E -->|Deployment within<br/>last 10 min| F[Diagnosis: Prompt change<br/>invalidated Bedrock cache]
E -->|No deployment| G[Diagnosis: Cache eviction<br/>due to 5-min inactivity<br/>or provider-side issue]
style D fill:#e76f51,stroke:#f4a261,color:#fff
| Metric | Normal | Alarm Threshold | Source |
|---|---|---|---|
bedrock.cache_read_tokens |
~2,200 per request | 0 for > 2 min | Bedrock Converse response |
bedrock.cache_write_tokens |
~0 (after first req) | ~2,200 sustained | Bedrock Converse response |
bedrock.ttft_ms |
~120ms | > 400ms sustained for 5 min | Application latency tracing |
prompt.version |
Stable | Changed in last 10 min | Deployment metadata |
Root Cause Analysis
graph TD
MISS[Prompt Cache Miss<br/>Rate = 100%] --> Q1{Was system prompt<br/>text changed?}
Q1 -->|Yes| RC1[Root Cause: Any change to<br/>cached prefix invalidates<br/>the Bedrock prompt cache.<br/>Even a single character.]
Q1 -->|No| Q2{Was model ID<br/>changed?}
Q2 -->|Yes| RC2[Root Cause: Different model =<br/>different cache namespace.<br/>Cache is per-model.]
Q2 -->|No| Q3{Was there a gap<br/>> 5 min with no traffic?}
Q3 -->|Yes| RC3[Root Cause: Bedrock prompt<br/>cache evicts after 5 min<br/>of inactivity.]
Q3 -->|No| RC4[Root Cause: Provider-side<br/>cache eviction. Not<br/>controllable by customer.]
RC1 --> FIX1[Schedule prompt changes<br/>during low-traffic windows.<br/>Monitor TTFT recovery.]
style MISS fill:#e76f51,stroke:#f4a261,color:#fff
style RC1 fill:#264653,stroke:#2a9d8f,color:#fff
Resolution
Step 1 — Verify the cause (5 minutes)
import boto3
from datetime import datetime, timedelta
def check_prompt_cache_status(region: str = "us-east-1") -> dict:
"""
Query recent Bedrock invocations to check prompt cache behavior.
Compares cache_read vs cache_write token counts.
"""
cw = boto3.client("cloudwatch", region_name=region)
# Check custom metric for prompt cache hits
response = cw.get_metric_statistics(
Namespace="MangaAssist/Cache",
MetricName="PromptCacheReadTokens",
StartTime=datetime.utcnow() - timedelta(minutes=30),
EndTime=datetime.utcnow(),
Period=60,
Statistics=["Sum", "SampleCount"],
)
datapoints = sorted(response["Datapoints"], key=lambda x: x["Timestamp"])
if not datapoints:
return {"status": "NO_DATA"}
recent = datapoints[-1]
earlier = datapoints[0] if len(datapoints) > 1 else None
return {
"current_cache_read_tokens": recent["Sum"],
"current_sample_count": recent["SampleCount"],
"earlier_cache_read_tokens": earlier["Sum"] if earlier else None,
"cache_active": recent["Sum"] > 0,
"diagnosis": (
"Prompt cache is active"
if recent["Sum"] > 0
else "Prompt cache is NOT active — likely invalidated by prompt change"
),
}
Step 2 — Accept and wait (if expected)
Bedrock prompt caching re-warms automatically. The first request after a prompt change pays the full cache-write cost. Subsequent requests (within 5 minutes) hit the cache. No manual action is needed — the cache self-heals.
Step 3 — Reduce future impact with deployment scheduling
# deployment_config.yaml — schedule prompt changes during low-traffic
prompt_deployment:
preferred_window: "03:00-05:00 JST" # 3-5 AM JST — lowest traffic
pre_warm_requests: 5 # Send 5 dummy requests to warm the cache
monitoring:
watch_metric: "bedrock.prompt_cache_hit_rate"
recovery_threshold: 0.90 # Alert if not recovered within 5 min
alert_channel: "#mangaassist-ops"
Prevention
| Prevention Measure | Implementation | Effort |
|---|---|---|
| Deploy prompt changes during low-traffic windows | CI/CD scheduled deployment at 3 AM JST | Low |
| Pre-warm with synthetic requests | Send 5 dummy requests after prompt deploy to populate cache | Low |
| Monitor TTFT as a deployment health check | If TTFT > 400ms for > 5 min post-deploy, alert | Low |
| Batch prompt changes | Accumulate changes and deploy once/day instead of multiple times | Process |
| Keep static prefix unchanged | Move dynamic instructions to user message prefix; keep system prompt stable | Medium |
Scenario 5 — Redis Memory Exhaustion from Unbounded Cache Growth
Problem Statement
Over 3 months of operation, the ElastiCache Redis cluster (r6g.xlarge, 26 GB memory) gradually fills with cache entries. Key observations:
- Memory utilization reached 92% (alarm threshold: 85%)
- Eviction policy (
allkeys-lru) starts evicting entries — including frequently accessed ones - Cache hit rate drops from 28% to 14% as high-value entries are evicted to make room for low-value ones
- Some entries have TTL=86400 (24hr) but are never accessed after the first hit
- Recommendation entries (TTL=14400, 4hr) consume 40% of cache memory but have only 8% hit rate
Business impact: The cache becomes less effective over time. Cost savings decrease as hit rate drops. Eventually, Redis maxmemory is hit and LRU evictions degrade performance unpredictably.
Detection
graph TD
A[CloudWatch Metric:<br/>ElastiCache BytesUsedForCache] --> B{> 85% of<br/>maxmemory?}
B -->|Yes| C[CloudWatch Alarm:<br/>RedisCacheMemoryHigh]
C --> D[SNS → Ops Channel]
E[Application Metric:<br/>cache.hit_rate] --> F{Dropped > 5% from<br/>7-day moving average?}
F -->|Yes| G[CloudWatch Alarm:<br/>CacheHitRateDegradation]
G --> D
H[ElastiCache Metric:<br/>Evictions] --> I{Evictions > 0?}
I -->|Yes| J[CloudWatch Alarm:<br/>RedisCacheEvictions]
J --> D
D --> K[Composite Alarm:<br/>CacheMemoryExhaustion]
K --> L[PagerDuty P2 Alert]
style K fill:#e76f51,stroke:#f4a261,color:#fff
| Metric | Normal | Warning | Critical | Source |
|---|---|---|---|---|
BytesUsedForCache |
< 70% | > 85% | > 95% | CloudWatch AWS/ElastiCache |
CacheHits / (CacheHits + CacheMisses) |
~28% | < 20% | < 15% | CloudWatch AWS/ElastiCache |
Evictions |
0 | > 0 | > 100/min | CloudWatch AWS/ElastiCache |
cache.entry_count |
< 500K | > 750K | > 1M | Application metric (FT.INFO) |
| Memory per intent category | Balanced | Any category > 50% | — | Custom metric |
Root Cause Analysis
graph TD
MEM[Redis Memory<br/>Exhaustion] --> Q1{Are TTLs being<br/>set on all entries?}
Q1 -->|No| RC1[Root Cause: Some code paths<br/>store entries without TTL.<br/>These entries never expire.]
Q1 -->|Yes| Q2{Is a single intent<br/>consuming > 40% memory?}
Q2 -->|Yes| RC2[Root Cause: Low hit-rate intents<br/>consuming disproportionate memory.<br/>Recommendations: 40% memory,<br/>8% hit rate.]
Q2 -->|No| Q3{Are entries being<br/>re-created after TTL expiry?}
Q3 -->|Yes| RC3[Root Cause: Cache churn —<br/>entries expire, get re-created,<br/>and accumulate unique keys<br/>due to timestamp in key.]
Q3 -->|No| RC4[Root Cause: Traffic growth<br/>exceeded original memory<br/>sizing. Need to scale<br/>the cluster.]
RC2 --> FIX[Implement memory budgets<br/>per intent category +<br/>reduce recommendation TTL]
style MEM fill:#e76f51,stroke:#f4a261,color:#fff
style RC2 fill:#264653,stroke:#2a9d8f,color:#fff
Resolution
Step 1 — Immediate memory relief (10 minutes)
import redis
import json
import logging
logger = logging.getLogger(__name__)
def emergency_memory_cleanup(
redis_url: str,
low_value_intents: list[str] = None,
max_entries_to_remove: int = 50_000,
) -> dict:
"""
Emergency memory cleanup: remove low-value cache entries.
Targets entries with low hit counts and low-value intents.
"""
if low_value_intents is None:
low_value_intents = ["greeting", "recommendation"]
r = redis.Redis.from_url(redis_url, decode_responses=True)
removed = 0
cursor = 0
while removed < max_entries_to_remove:
cursor, keys = r.scan(cursor, match="sc:*", count=500)
for key in keys:
intent = r.hget(key, "intent")
hits = r.hget(key, "hits")
should_remove = False
# Remove low-value intents
if intent in low_value_intents:
should_remove = True
# Remove entries with 0 hits that are older than 1 hour
elif hits and int(hits) == 0:
ts = r.hget(key, "ts")
if ts and ((__import__("time").time() - float(ts)) > 3600):
should_remove = True
if should_remove:
r.delete(key)
removed += 1
if cursor == 0:
break
result = {
"entries_removed": removed,
"memory_before": r.info("memory")["used_memory_human"],
}
logger.info("Emergency cleanup: %s", result)
return result
Step 2 — Implement memory budgets per intent (1–2 hours)
import logging
import time
logger = logging.getLogger(__name__)
# Memory budget: percentage of total Redis memory per intent category
INTENT_MEMORY_BUDGET = {
"faq": 0.25, # 25% — high hit rate, high value
"product_info": 0.25, # 25% — high hit rate, business-critical
"shipping_info": 0.15, # 15% — moderate volume, stable answers
"manga_release_date": 0.15,# 15% — moderate volume, time-sensitive
"recommendation": 0.10, # 10% — low hit rate, reduce from 40%
"greeting": 0.02, # 2% — minimal variation
"manga_search": 0.05, # 5% — moderate hit rate
"_other": 0.03, # 3% — catch-all
}
class MemoryBudgetEnforcer:
"""
Enforces per-intent memory budgets in the Redis cache.
When an intent exceeds its budget, the least-recently-hit
entries for that intent are evicted.
"""
def __init__(self, redis_client, max_memory_bytes: int):
self.redis = redis_client
self.max_memory = max_memory_bytes
def enforce(self) -> dict:
"""
Scan all cache entries, compute per-intent memory usage,
and evict entries from over-budget intents.
"""
# Phase 1: Inventory
intent_entries: dict[str, list[dict]] = {}
cursor = 0
while True:
cursor, keys = self.redis.scan(cursor, match="sc:*", count=500)
for key in keys:
intent = self.redis.hget(key, "intent")
hits = self.redis.hget(key, "hits")
ts = self.redis.hget(key, "ts")
mem = self.redis.memory_usage(key) or 0
intent_str = intent.decode() if intent else "_other"
if intent_str not in intent_entries:
intent_entries[intent_str] = []
intent_entries[intent_str].append({
"key": key,
"hits": int(hits) if hits else 0,
"ts": float(ts) if ts else 0,
"memory": mem,
})
if cursor == 0:
break
# Phase 2: Check budgets and evict
eviction_report = {}
for intent, entries in intent_entries.items():
budget_pct = INTENT_MEMORY_BUDGET.get(intent, INTENT_MEMORY_BUDGET["_other"])
budget_bytes = int(self.max_memory * budget_pct)
current_usage = sum(e["memory"] for e in entries)
if current_usage <= budget_bytes:
eviction_report[intent] = {
"status": "within_budget",
"usage_bytes": current_usage,
"budget_bytes": budget_bytes,
"evicted": 0,
}
continue
# Sort by hit count (ascending) then timestamp (ascending)
# → evict least-hit, oldest entries first
entries.sort(key=lambda e: (e["hits"], e["ts"]))
evicted = 0
for entry in entries:
if current_usage <= budget_bytes:
break
self.redis.delete(entry["key"])
current_usage -= entry["memory"]
evicted += 1
eviction_report[intent] = {
"status": "evicted",
"usage_bytes": current_usage,
"budget_bytes": budget_bytes,
"evicted": evicted,
}
logger.info("Memory budget enforcement: %s", eviction_report)
return eviction_report
Step 3 — Schedule regular cleanup as an ECS scheduled task
# EventBridge scheduled rule — runs every 6 hours
{
"schedule": "rate(6 hours)",
"target": {
"arn": "arn:aws:ecs:us-east-1:123456789:cluster/manga-cluster",
"taskDefinition": "cache-maintenance:latest",
"overrides": {
"containerOverrides": [{
"name": "cache-maintenance",
"command": ["python", "-m", "cache_maintenance.enforce_budgets"]
}]
}
}
}
Prevention
| Prevention Measure | Implementation | Effort |
|---|---|---|
| Per-intent memory budgets | MemoryBudgetEnforcer runs every 6 hours via scheduled task | Medium |
| Reduce recommendation TTL | Drop from 4hr to 1hr — low hit rate does not justify long TTL | Low |
| Max entry count limit | Cap total entries at 500K; reject new writes when at capacity | Low |
| ElastiCache scaling alarm | If BytesUsedForCache > 70% for 24hr, auto-scale to next instance size |
Medium |
| Cache value scoring | score = hit_count / memory_bytes; evict lowest-score entries first | High |
| TTL audit | Weekly report of entries by TTL bucket; flag intents with TTL >> access frequency | Low |
Scenario Cross-Reference
| # | Scenario | Primary Failure Mode | Detection Speed | Resolution Time | Severity |
|---|---|---|---|---|---|
| 1 | Wrong manga recommendation | Semantic false positive | Minutes (user feedback) | Hours (threshold tuning + entity filter) | P2 — Incorrect answers |
| 2 | Cache stampede on new release | Thundering herd | Seconds (throttle alarm) | Minutes (single-flight deploy) | P1 — Service degradation |
| 3 | Stale pricing after flash sale | Invalidation failure | Minutes (consistency check) | Minutes (emergency flush) | P1 — Business/legal risk |
| 4 | Prompt cache miss after update | Expected cache cold start | Minutes (TTFT spike) | Self-healing (5 min) | P3 — Temporary cost spike |
| 5 | Redis memory exhaustion | Unbounded growth | Hours (gradual metric shift) | Hours (budget enforcement) | P2 — Degraded hit rate |