LOCAL PREVIEW View on GitHub

Intelligent Caching — Scenarios and Runbooks

AWS AIP-C01 Task 4.1 — Skill 4.1.4: Design intelligent caching systems for FM applications Context: MangaAssist e-commerce chatbot — Bedrock Claude 3 (Sonnet/Haiku), OpenSearch Serverless, DynamoDB, ECS Fargate, API Gateway WebSocket, ElastiCache Redis. 1M messages/day.


Skill Mapping

Certification Domain Task Skill
AWS AIP-C01 Domain 4 — Operational Efficiency Task 4.1 — Optimize FM applications Skill 4.1.4 — Diagnose and resolve caching failures, false positives, stampedes, staleness, and resource exhaustion in FM caching systems

Skill scope: Five production scenarios covering the failure modes of intelligent caching in a high-traffic GenAI chatbot — each with detection, root cause analysis, resolution steps, and prevention measures.


Scenario 1 — Semantic Cache Returns Wrong Manga Recommendation

Problem Statement

A customer asks: "Can you recommend something like Dragon Ball?"

The semantic cache returns a previously cached response for "Can you recommend something like Dragon Ball Super?" — which recommends Dragon Ball Super's sequel arcs, not titles similar to the original Dragon Ball series. The cosine similarity between the two queries is 0.94, above the 0.92 threshold, so the cache treats them as equivalent.

Business impact: Customer receives irrelevant recommendations, reducing trust and conversion rate. At MangaAssist's scale, even a 0.5% false positive rate means ~1,400 incorrect responses per day.

Detection

graph TD
    A[User submits negative feedback<br/>'This isn't what I asked for'] --> B{Feedback classifier}
    B -->|Cache-related| C[Check if response<br/>was served from cache]
    C -->|source = L2_SEMANTIC| D[Log: semantic_cache_false_positive]
    D --> E[CloudWatch Metric:<br/>cache.false_positive_count]
    E --> F{Rate > 0.5%<br/>of L2 hits?}
    F -->|Yes| G[CloudWatch Alarm:<br/>CacheFalsePositiveRate]
    G --> H[PagerDuty / SNS Alert]

    B -->|Not cache-related| I[Route to standard<br/>feedback pipeline]
    C -->|source = BEDROCK| J[Not a cache issue —<br/>model quality problem]

    style G fill:#e76f51,stroke:#f4a261,color:#fff

Key metrics to monitor:

Metric Normal Alarm Threshold Source
cache.false_positive_count < 50/hr > 200/hr User feedback + automated checks
cache.false_positive_rate < 0.5% > 1.0% false_positives / l2_hits
cache.similarity_score_distribution Peaks at 0.93–0.97 Shift toward 0.92 boundary Redis search results

Root Cause Analysis

graph TD
    FP[False Positive<br/>Detected] --> Q1{Was the cached query<br/>actually different intent?}
    Q1 -->|Yes| RC1[Root Cause: Intent classifier<br/>assigned same intent to<br/>semantically different queries]
    Q1 -->|No, same intent| Q2{Is cosine similarity<br/>barely above threshold?}
    Q2 -->|Yes, sim = 0.920–0.930| RC2[Root Cause: Threshold too low<br/>for this intent category]
    Q2 -->|No, sim > 0.93| Q3{Are the queries about<br/>related but distinct titles?}
    Q3 -->|Yes| RC3[Root Cause: Franchise titles<br/>cluster too tightly in<br/>embedding space]
    Q3 -->|No| RC4[Root Cause: Embedding model<br/>lacks domain specificity<br/>for manga titles]

    RC1 --> FIX1[Add intent sub-classification<br/>dragon_ball vs dragon_ball_super]
    RC2 --> FIX2[Raise threshold for<br/>recommendation intent to 0.96]
    RC3 --> FIX3[Include title entity<br/>as hard filter in cache key]
    RC4 --> FIX4[Fine-tune embedding model<br/>on manga title pairs]

    style FP fill:#e76f51,stroke:#f4a261,color:#fff
    style RC3 fill:#264653,stroke:#2a9d8f,color:#fff

In this scenario: Root Cause 3 is most likely. "Dragon Ball" and "Dragon Ball Super" share significant token overlap, producing embeddings within cosine distance 0.06. The semantic cache cannot distinguish them by embedding alone.

Resolution

Step 1 — Immediate mitigation (5 minutes)

# Emergency: raise recommendation threshold to reject borderline matches
# Deploy via environment variable — no code deploy needed
# ECS task definition environment variable:
CACHE_THRESHOLD_RECOMMENDATION = "0.97"

Step 2 — Invalidate affected entries (10 minutes)

import redis
import logging

logger = logging.getLogger(__name__)


def invalidate_recommendation_cache(
    redis_client: redis.Redis,
    index_name: str = "mangaassist_cache_idx",
) -> int:
    """
    Remove all recommendation-intent entries from the semantic cache.
    Called as emergency response to false positive spike.
    """
    cursor = 0
    removed = 0
    while True:
        cursor, keys = redis_client.scan(cursor, match="sc:*", count=500)
        for key in keys:
            intent = redis_client.hget(key, "intent")
            if intent and intent.decode() == "recommendation":
                redis_client.delete(key)
                removed += 1
        if cursor == 0:
            break
    logger.info("Invalidated %d recommendation cache entries", removed)
    return removed

Step 3 — Add entity-based hard filter (1–2 hours)

def recommendation_cache_search_with_entity_filter(
    redis_client: redis.Redis,
    embedding: list[float],
    primary_entity: str,
    intent: str = "recommendation",
) -> dict | None:
    """
    Enhanced search that requires the primary manga title entity
    to match exactly, in addition to vector similarity.
    This prevents 'Dragon Ball' from matching 'Dragon Ball Super'.
    """
    import numpy as np

    vec_bytes = np.array(embedding, dtype=np.float32).tobytes()

    # Hard filter: intent must match AND query text must contain the exact title
    # Using TEXT search on the query field as a secondary filter
    filter_expr = f"@intent:{{{intent}}} @query:{primary_entity}"
    query = f"({filter_expr})=>[KNN 3 @vec $blob AS dist]"

    results = redis_client.execute_command(
        "FT.SEARCH", "mangaassist_cache_idx", query,
        "PARAMS", "2", "blob", vec_bytes,
        "SORTBY", "dist", "ASC",
        "LIMIT", "0", "1",
        "RETURN", "3", "response", "query", "dist",
        "DIALECT", "2",
    )

    if results[0] == 0:
        return None

    # Parse and apply threshold check
    # ... (standard threshold logic)
    return None  # Placeholder

Prevention

Prevention Measure Implementation Effort
Entity-aware cache keys Include primary entity (manga title) as a hard TAG filter in Redis search Medium
Intent-specific thresholds Raise recommendation threshold to 0.96 (already documented in threshold map) Low
Franchise disambiguation Maintain a franchise-alias table: "Dragon Ball" ≠ "Dragon Ball Super" ≠ "Dragon Ball GT" Medium
Automated false positive detection Compare cached response entities vs query entities; flag mismatches High
User feedback loop "Was this helpful?" button writes to feedback queue; high negative rate triggers threshold auto-adjustment Medium

Scenario 2 — Cache Stampede During New Manga Release

Problem Statement

A highly anticipated manga volume (e.g., Jujutsu Kaisen final volume) releases at midnight JST. Within 60 seconds, 15,000 users simultaneously ask: "Is Jujutsu Kaisen final volume available?"

The cache has no entry for this query (new product, never asked before). All 15,000 requests miss the cache simultaneously and hit Bedrock in parallel. This creates a thundering herd / cache stampede:

  • Bedrock throttle limit hit (requests rejected with ThrottlingException)
  • RAG pipeline overloaded (15,000 parallel OpenSearch queries)
  • p99 latency spikes from 2.8s to 25s+
  • Some users receive errors ("Service temporarily unavailable")

Business impact: The highest-traffic moment (new release) coincides with the worst user experience. Customers leave, revenue drops, social media complaints spike.

Detection

graph TD
    A[Bedrock ThrottlingException<br/>rate > 100/min] --> ALARM1[CloudWatch Alarm:<br/>BedrockThrottleRate]
    B[Cache miss rate spikes<br/>to > 95% for 5 min] --> ALARM2[CloudWatch Alarm:<br/>CacheMissRateSpike]
    C[OpenSearch latency<br/>p99 > 5000ms] --> ALARM3[CloudWatch Alarm:<br/>RAGLatencyHigh]
    D[ECS task CPU > 90%<br/>for > 2 min] --> ALARM4[CloudWatch Alarm:<br/>ECSCPUHigh]

    ALARM1 --> COMP[CloudWatch Composite Alarm:<br/>CacheStampedeDetected]
    ALARM2 --> COMP
    ALARM3 --> COMP
    ALARM4 --> COMP

    COMP --> SNS[SNS → PagerDuty<br/>P1 Incident]
    COMP --> AUTO[Auto-Remediation Lambda]

    style COMP fill:#e76f51,stroke:#f4a261,color:#fff
Metric Normal (steady state) Stampede Indicator Source
cache.miss_rate ~60% > 95% sustained for 5 min Application metrics
bedrock.throttle_count < 5/min > 100/min CloudWatch AWS/Bedrock
opensearch.search_latency_p99 150ms > 5,000ms CloudWatch AWS/AOSS
ecs.cpu_utilization 45% > 90% CloudWatch AWS/ECS
Concurrent identical queries < 10 > 1,000 Application-level dedup counter

Root Cause Analysis

graph TD
    STAMP[Cache Stampede<br/>Detected] --> Q1{Is this a<br/>new product/event?}
    Q1 -->|Yes| RC1[Root Cause: No cache warming<br/>for the new release.<br/>Cold cache + sudden demand spike.]
    Q1 -->|No| Q2{Did cache recently<br/>get invalidated?}
    Q2 -->|Yes| RC2[Root Cause: Mass invalidation<br/>flushed entries that were<br/>immediately re-requested.]
    Q2 -->|No| Q3{Did TTLs expire<br/>simultaneously?}
    Q3 -->|Yes| RC3[Root Cause: Synchronized TTL<br/>expiry. All entries for<br/>an intent expired at once.]
    Q3 -->|No| RC4[Root Cause: Infrastructure issue<br/>Redis restart or connectivity<br/>loss emptied the cache.]

    RC1 --> FIX1[Implement pre-release<br/>cache warming pipeline]
    RC2 --> FIX2[Add stampede lock<br/>single-flight pattern]
    RC3 --> FIX3[Add TTL jitter<br/>±10% randomization]
    RC4 --> FIX4[Redis cluster mode<br/>+ AOF persistence]

    style STAMP fill:#e76f51,stroke:#f4a261,color:#fff
    style RC1 fill:#264653,stroke:#2a9d8f,color:#fff

Resolution

Step 1 — Request coalescing / single-flight pattern (immediate)

When multiple identical requests arrive simultaneously, only the first one invokes Bedrock. All others wait for the first response, which is then shared.

import asyncio
import hashlib
import logging
import time
from typing import Optional

logger = logging.getLogger(__name__)


class StampedeProtection:
    """
    Single-flight / request coalescing pattern.
    When N concurrent requests arrive for the same cache key,
    only 1 invokes Bedrock. The other N-1 wait for the result.

    Uses Redis distributed locks to coordinate across ECS containers.
    """

    LOCK_PREFIX = "lock:"
    LOCK_TTL = 30  # seconds — max time to wait for Bedrock response

    def __init__(self, redis_client, semantic_cache, bedrock_invoker):
        self.redis = redis_client
        self.cache = semantic_cache
        self.invoker = bedrock_invoker
        self._local_waiters: dict[str, asyncio.Event] = {}

    async def get_or_invoke(
        self,
        query: str,
        intent: str,
        entities: dict,
        rag_context: str,
        session_history: list,
    ) -> dict:
        """
        Try cache → if miss, acquire lock → invoke Bedrock → store → release.
        Concurrent requests for the same query wait on the lock.
        """
        # Step 1: Try cache
        cached = self.cache.lookup(query=query, intent=intent)
        if cached:
            return cached

        # Step 2: Compute coalescing key
        normalized = self.cache._normalize(query)
        coalesce_key = hashlib.md5(f"{normalized}|{intent}".encode()).hexdigest()
        lock_key = f"{self.LOCK_PREFIX}{coalesce_key}"

        # Step 3: Try to acquire distributed lock
        acquired = self.redis.set(lock_key, "1", nx=True, ex=self.LOCK_TTL)

        if acquired:
            # This container is the leader — invoke Bedrock
            try:
                response = self.invoker.invoke(
                    user_message=query,
                    rag_context=rag_context,
                    session_history=session_history,
                )
                # Store in cache for all waiters
                self.cache.store(
                    query=query,
                    response=response["response_text"],
                    intent=intent,
                    entities=entities,
                )
                return {
                    "response": response["response_text"],
                    "source": "BEDROCK_LEADER",
                    "similarity": 1.0,
                }
            finally:
                self.redis.delete(lock_key)
        else:
            # Another container is the leader — poll cache until result appears
            logger.info("Waiting for leader to populate cache for key=%s", coalesce_key)
            for _ in range(60):  # Wait up to 30 seconds (60 x 0.5s)
                await asyncio.sleep(0.5)
                cached = self.cache.lookup(query=query, intent=intent)
                if cached:
                    cached["source"] = "COALESCED_WAIT"
                    return cached

            # Timeout — fall through to direct invocation
            logger.warning("Coalescing timeout for key=%s, invoking directly", coalesce_key)
            response = self.invoker.invoke(
                user_message=query,
                rag_context=rag_context,
                session_history=session_history,
            )
            return {
                "response": response["response_text"],
                "source": "BEDROCK_TIMEOUT_FALLBACK",
                "similarity": 1.0,
            }

Step 2 — Pre-release cache warming (preventive, triggered by catalog event)

async def warm_new_release_cache(
    title: str,
    volume: str,
    cache_warmer,  # CacheWarmer instance
) -> dict:
    """
    Triggered by EventBridge when a new manga release is added to the catalog.
    Pre-populates cache 1 hour before the release goes live.
    """
    result = cache_warmer.warm_new_release(title=title, volume=volume)
    logger.info("Pre-warmed cache for %s vol %s: %s", title, volume, result)
    return result

Step 3 — TTL jitter to prevent synchronized expiry

import random

def ttl_with_jitter(base_ttl: int, jitter_pct: float = 0.10) -> int:
    """
    Add ±10% random jitter to TTL to prevent synchronized expiry.
    base_ttl=3600 → returns 3240–3960.
    """
    jitter = int(base_ttl * jitter_pct)
    return base_ttl + random.randint(-jitter, jitter)

Prevention

Prevention Measure Implementation Effort
Single-flight / request coalescing Redis distributed lock; only 1 request per unique query hits Bedrock Medium
Pre-release cache warming EventBridge rule triggers CacheWarmer on catalog update Medium
TTL jitter Add ±10% randomization to all TTLs Low
Bedrock provisioned throughput Reserve model units for anticipated spikes Medium (cost)
Graceful degradation When throttled, serve stale cache (bypass TTL) with "may be outdated" disclaimer Medium

Scenario 3 — Stale Pricing in Cached Response After Flash Sale

Problem Statement

MangaAssist runs a flash sale: Demon Slayer Complete Box Set drops from 12,800 to 8,980. The event-driven invalidation system is supposed to flush all product_info cache entries for "Demon Slayer" — but a bug in the EventBridge rule filter means the invalidation Lambda is never triggered.

For the next hour (until TTL expires), customers asking about Demon Slayer pricing receive the cached response with the old price (12,800). Some customers see the correct sale price on the website but the wrong price in the chatbot. Complaints flood in.

Business impact: Price inconsistency between channels erodes trust. Customers who purchased at the chatbot-quoted price may demand refunds. Legal/compliance risk if chatbot price is considered a binding offer.

Detection

graph TD
    A[Customer complaint:<br/>'Chatbot says 12800 but<br/>website says 8980'] --> B[Support ticket created]
    B --> C{Check cache source<br/>in response metadata}
    C -->|source = L2_SEMANTIC| D[Confirm stale cache]

    E[EventBridge event:<br/>price_changed for<br/>Demon Slayer] --> F{Was invalidation<br/>Lambda invoked?}
    F -->|No invocation in<br/>CloudWatch Logs| G[Confirm: EventBridge<br/>rule did not trigger]

    H[Automated price<br/>consistency check] --> I{Chatbot response price<br/>== catalog API price?}
    I -->|Mismatch| J[CloudWatch Metric:<br/>cache.price_consistency_error]
    J --> K[CloudWatch Alarm:<br/>StalePriceDetected]
    K --> L[PagerDuty P1 Alert]

    style K fill:#e76f51,stroke:#f4a261,color:#fff
    style G fill:#e76f51,stroke:#f4a261,color:#fff
Metric Normal Alert Threshold Source
invalidation.lambda.invocation_count Matches EventBridge event count Diverges by > 0 CloudWatch Lambda metrics
cache.price_consistency_errors 0 > 0 Automated consistency checker
invalidation.latency_ms < 500ms > 5,000ms Lambda execution duration
eventbridge.failed_invocations 0 > 0 CloudWatch EventBridge metrics

Root Cause Analysis

graph TD
    STALE[Stale Price in<br/>Cached Response] --> Q1{Was price_changed<br/>event emitted?}
    Q1 -->|No| RC1[Root Cause: Catalog CMS<br/>did not emit event.<br/>Manual price update bypassed<br/>event pipeline.]
    Q1 -->|Yes| Q2{Did EventBridge<br/>rule match the event?}
    Q2 -->|No| RC2[Root Cause: EventBridge rule<br/>filter pattern mismatch.<br/>Event schema changed but<br/>rule was not updated.]
    Q2 -->|Yes| Q3{Did Lambda execute<br/>successfully?}
    Q3 -->|No| RC3[Root Cause: Lambda timeout<br/>or permission error.<br/>Could not connect to Redis<br/>or CloudFront.]
    Q3 -->|Yes| Q4{Were correct Redis<br/>keys deleted?}
    Q4 -->|No| RC4[Root Cause: Key scan pattern<br/>did not match. Entity extraction<br/>for 'Demon Slayer' returned<br/>'demon slayer' (case mismatch).]
    Q4 -->|Yes| RC5[Root Cause: Race condition.<br/>Cache entry re-populated<br/>before CloudFront invalidation<br/>completed.]

    RC2 --> FIX[Fix EventBridge rule filter<br/>to match current event schema]

    style STALE fill:#e76f51,stroke:#f4a261,color:#fff
    style RC2 fill:#264653,stroke:#2a9d8f,color:#fff

In this scenario: Root Cause 2 — The catalog team changed the event schema from {"detail-type": "PriceChanged"} to {"detail-type": "product.price.changed"}, but the EventBridge rule still filtered on the old pattern.

Resolution

Step 1 — Emergency manual cache flush (5 minutes)

import redis
import boto3
import logging

logger = logging.getLogger(__name__)


def emergency_price_cache_flush(
    redis_url: str,
    title: str,
    cloudfront_distribution_id: str,
) -> dict:
    """
    Emergency flush of all cache entries for a specific product.
    Called manually by on-call engineer when price staleness is detected.
    """
    r = redis.Redis.from_url(redis_url, decode_responses=True)
    cf = boto3.client("cloudfront")

    # Flush Redis (L2)
    removed = 0
    cursor = 0
    while True:
        cursor, keys = r.scan(cursor, match="sc:*", count=500)
        for key in keys:
            query_text = r.hget(key, "query")
            entities_json = r.hget(key, "ents")
            if query_text and title.lower() in query_text.lower():
                r.delete(key)
                removed += 1
            elif entities_json and title.lower() in entities_json.lower():
                r.delete(key)
                removed += 1
        if cursor == 0:
            break

    # Flush CloudFront (L4)
    invalidation = cf.create_invalidation(
        DistributionId=cloudfront_distribution_id,
        InvalidationBatch={
            "Paths": {
                "Quantity": 1,
                "Items": [f"/api/products/*demon-slayer*"],
            },
            "CallerReference": f"emergency-{title}-{__import__('time').time()}",
        },
    )

    result = {
        "redis_entries_removed": removed,
        "cloudfront_invalidation_id": invalidation["Invalidation"]["Id"],
    }
    logger.info("Emergency cache flush for '%s': %s", title, result)
    return result

Step 2 — Fix EventBridge rule filter (30 minutes)

{
    "source": ["com.mangaassist.catalog"],
    "detail-type": ["product.price.changed"],
    "detail": {
        "change_type": ["price_update", "flash_sale_start", "flash_sale_end"]
    }
}

Step 3 — Add automated price consistency checker (1–2 hours)

import asyncio
import logging
import random

logger = logging.getLogger(__name__)


class PriceConsistencyChecker:
    """
    Periodically samples cached product_info responses and compares
    the price in the response against the live catalog API.
    Runs as a background task in each ECS container.
    """

    def __init__(self, semantic_cache, catalog_api_client, check_interval: int = 60):
        self.cache = semantic_cache
        self.catalog = catalog_api_client
        self.interval = check_interval
        self._inconsistencies = 0

    async def run(self) -> None:
        """Run continuous price consistency checks."""
        while True:
            await asyncio.sleep(self.interval)
            try:
                await self._check_sample()
            except Exception as e:
                logger.error("Price consistency check failed: %s", e)

    async def _check_sample(self) -> None:
        """Sample 10 random product_info cache entries and verify prices."""
        # Scan for product_info entries
        cursor = 0
        candidates = []
        while len(candidates) < 50:
            cursor, keys = self.cache.redis.scan(cursor, match="sc:*", count=100)
            for key in keys:
                intent = self.cache.redis.hget(key, "intent")
                if intent and intent.decode() == "product_info":
                    candidates.append(key)
            if cursor == 0:
                break

        # Sample 10
        sample = random.sample(candidates, min(10, len(candidates)))
        for key in sample:
            response_text = self.cache.redis.hget(key, "response")
            entities_json = self.cache.redis.hget(key, "ents")
            if not response_text or not entities_json:
                continue

            response_text = response_text.decode()
            entities = __import__("json").loads(entities_json.decode())
            title = entities.get("title")
            if not title:
                continue

            # Get live price from catalog
            live_price = await self.catalog.get_price(title)
            if live_price is None:
                continue

            # Check if cached response contains the correct price
            if str(live_price) not in response_text:
                self._inconsistencies += 1
                logger.warning(
                    "Price inconsistency: title='%s', live_price=%s, "
                    "cached response does not contain live price. Key=%s",
                    title, live_price, key.decode(),
                )
                # Auto-invalidate the stale entry
                self.cache.redis.delete(key)
                logger.info("Auto-invalidated stale cache entry: %s", key.decode())

Prevention

Prevention Measure Implementation Effort
EventBridge rule schema validation CI/CD check: event schema + rule filter compatibility test Medium
Dead letter queue for failed invalidations SQS DLQ on the invalidation Lambda; alarm on DLQ depth > 0 Low
Automated price consistency checker Background task sampling cached product_info responses Medium
Double-write invalidation Both the CMS AND the cache warmer can trigger invalidation (redundancy) Low
Shorter TTL for price-sensitive intents Reduce product_info TTL from 1hr to 15min during sale periods Low

Scenario 4 — Prompt Cache Miss Rate Spikes After System Prompt Update

Problem Statement

The MangaAssist team updates the system prompt to add a new guardrail: "Do not recommend manga with graphic violence to users under 18." The prompt version changes from v2.3 to v2.4. After deployment:

  • Bedrock prompt cache hit rate drops from 95% to 0%
  • Time-to-first-token (TTFT) increases from 120ms to 450ms
  • Input token cost spikes because the system prompt (2,200 tokens) is fully processed for every request
  • For the next 5 minutes, all 1M/day / (24*60) = ~694 requests/minute pay full input token cost

Business impact: Temporary latency regression and cost spike. If the team deploys prompt changes multiple times per day during an iteration cycle, the cumulative cost impact grows.

Detection

graph TD
    A[Bedrock usage metrics] --> B{cacheReadInputTokens<br/>dropped to 0?}
    B -->|Yes| C[CloudWatch Metric:<br/>bedrock.prompt_cache_miss_rate = 100%]
    C --> D[CloudWatch Alarm:<br/>PromptCacheMissSpike]
    D --> E{Correlate with<br/>deployment event}
    E -->|Deployment within<br/>last 10 min| F[Diagnosis: Prompt change<br/>invalidated Bedrock cache]
    E -->|No deployment| G[Diagnosis: Cache eviction<br/>due to 5-min inactivity<br/>or provider-side issue]

    style D fill:#e76f51,stroke:#f4a261,color:#fff
Metric Normal Alarm Threshold Source
bedrock.cache_read_tokens ~2,200 per request 0 for > 2 min Bedrock Converse response
bedrock.cache_write_tokens ~0 (after first req) ~2,200 sustained Bedrock Converse response
bedrock.ttft_ms ~120ms > 400ms sustained for 5 min Application latency tracing
prompt.version Stable Changed in last 10 min Deployment metadata

Root Cause Analysis

graph TD
    MISS[Prompt Cache Miss<br/>Rate = 100%] --> Q1{Was system prompt<br/>text changed?}
    Q1 -->|Yes| RC1[Root Cause: Any change to<br/>cached prefix invalidates<br/>the Bedrock prompt cache.<br/>Even a single character.]
    Q1 -->|No| Q2{Was model ID<br/>changed?}
    Q2 -->|Yes| RC2[Root Cause: Different model =<br/>different cache namespace.<br/>Cache is per-model.]
    Q2 -->|No| Q3{Was there a gap<br/>> 5 min with no traffic?}
    Q3 -->|Yes| RC3[Root Cause: Bedrock prompt<br/>cache evicts after 5 min<br/>of inactivity.]
    Q3 -->|No| RC4[Root Cause: Provider-side<br/>cache eviction. Not<br/>controllable by customer.]

    RC1 --> FIX1[Schedule prompt changes<br/>during low-traffic windows.<br/>Monitor TTFT recovery.]

    style MISS fill:#e76f51,stroke:#f4a261,color:#fff
    style RC1 fill:#264653,stroke:#2a9d8f,color:#fff

Resolution

Step 1 — Verify the cause (5 minutes)

import boto3
from datetime import datetime, timedelta


def check_prompt_cache_status(region: str = "us-east-1") -> dict:
    """
    Query recent Bedrock invocations to check prompt cache behavior.
    Compares cache_read vs cache_write token counts.
    """
    cw = boto3.client("cloudwatch", region_name=region)

    # Check custom metric for prompt cache hits
    response = cw.get_metric_statistics(
        Namespace="MangaAssist/Cache",
        MetricName="PromptCacheReadTokens",
        StartTime=datetime.utcnow() - timedelta(minutes=30),
        EndTime=datetime.utcnow(),
        Period=60,
        Statistics=["Sum", "SampleCount"],
    )

    datapoints = sorted(response["Datapoints"], key=lambda x: x["Timestamp"])
    if not datapoints:
        return {"status": "NO_DATA"}

    recent = datapoints[-1]
    earlier = datapoints[0] if len(datapoints) > 1 else None

    return {
        "current_cache_read_tokens": recent["Sum"],
        "current_sample_count": recent["SampleCount"],
        "earlier_cache_read_tokens": earlier["Sum"] if earlier else None,
        "cache_active": recent["Sum"] > 0,
        "diagnosis": (
            "Prompt cache is active"
            if recent["Sum"] > 0
            else "Prompt cache is NOT active — likely invalidated by prompt change"
        ),
    }

Step 2 — Accept and wait (if expected)

Bedrock prompt caching re-warms automatically. The first request after a prompt change pays the full cache-write cost. Subsequent requests (within 5 minutes) hit the cache. No manual action is needed — the cache self-heals.

Step 3 — Reduce future impact with deployment scheduling

# deployment_config.yaml — schedule prompt changes during low-traffic
prompt_deployment:
    preferred_window: "03:00-05:00 JST"  # 3-5 AM JST — lowest traffic
    pre_warm_requests: 5  # Send 5 dummy requests to warm the cache
    monitoring:
        watch_metric: "bedrock.prompt_cache_hit_rate"
        recovery_threshold: 0.90  # Alert if not recovered within 5 min
        alert_channel: "#mangaassist-ops"

Prevention

Prevention Measure Implementation Effort
Deploy prompt changes during low-traffic windows CI/CD scheduled deployment at 3 AM JST Low
Pre-warm with synthetic requests Send 5 dummy requests after prompt deploy to populate cache Low
Monitor TTFT as a deployment health check If TTFT > 400ms for > 5 min post-deploy, alert Low
Batch prompt changes Accumulate changes and deploy once/day instead of multiple times Process
Keep static prefix unchanged Move dynamic instructions to user message prefix; keep system prompt stable Medium

Scenario 5 — Redis Memory Exhaustion from Unbounded Cache Growth

Problem Statement

Over 3 months of operation, the ElastiCache Redis cluster (r6g.xlarge, 26 GB memory) gradually fills with cache entries. Key observations:

  • Memory utilization reached 92% (alarm threshold: 85%)
  • Eviction policy (allkeys-lru) starts evicting entries — including frequently accessed ones
  • Cache hit rate drops from 28% to 14% as high-value entries are evicted to make room for low-value ones
  • Some entries have TTL=86400 (24hr) but are never accessed after the first hit
  • Recommendation entries (TTL=14400, 4hr) consume 40% of cache memory but have only 8% hit rate

Business impact: The cache becomes less effective over time. Cost savings decrease as hit rate drops. Eventually, Redis maxmemory is hit and LRU evictions degrade performance unpredictably.

Detection

graph TD
    A[CloudWatch Metric:<br/>ElastiCache BytesUsedForCache] --> B{> 85% of<br/>maxmemory?}
    B -->|Yes| C[CloudWatch Alarm:<br/>RedisCacheMemoryHigh]
    C --> D[SNS → Ops Channel]

    E[Application Metric:<br/>cache.hit_rate] --> F{Dropped > 5% from<br/>7-day moving average?}
    F -->|Yes| G[CloudWatch Alarm:<br/>CacheHitRateDegradation]
    G --> D

    H[ElastiCache Metric:<br/>Evictions] --> I{Evictions > 0?}
    I -->|Yes| J[CloudWatch Alarm:<br/>RedisCacheEvictions]
    J --> D

    D --> K[Composite Alarm:<br/>CacheMemoryExhaustion]
    K --> L[PagerDuty P2 Alert]

    style K fill:#e76f51,stroke:#f4a261,color:#fff
Metric Normal Warning Critical Source
BytesUsedForCache < 70% > 85% > 95% CloudWatch AWS/ElastiCache
CacheHits / (CacheHits + CacheMisses) ~28% < 20% < 15% CloudWatch AWS/ElastiCache
Evictions 0 > 0 > 100/min CloudWatch AWS/ElastiCache
cache.entry_count < 500K > 750K > 1M Application metric (FT.INFO)
Memory per intent category Balanced Any category > 50% Custom metric

Root Cause Analysis

graph TD
    MEM[Redis Memory<br/>Exhaustion] --> Q1{Are TTLs being<br/>set on all entries?}
    Q1 -->|No| RC1[Root Cause: Some code paths<br/>store entries without TTL.<br/>These entries never expire.]
    Q1 -->|Yes| Q2{Is a single intent<br/>consuming > 40% memory?}
    Q2 -->|Yes| RC2[Root Cause: Low hit-rate intents<br/>consuming disproportionate memory.<br/>Recommendations: 40% memory,<br/>8% hit rate.]
    Q2 -->|No| Q3{Are entries being<br/>re-created after TTL expiry?}
    Q3 -->|Yes| RC3[Root Cause: Cache churn —<br/>entries expire, get re-created,<br/>and accumulate unique keys<br/>due to timestamp in key.]
    Q3 -->|No| RC4[Root Cause: Traffic growth<br/>exceeded original memory<br/>sizing. Need to scale<br/>the cluster.]

    RC2 --> FIX[Implement memory budgets<br/>per intent category +<br/>reduce recommendation TTL]

    style MEM fill:#e76f51,stroke:#f4a261,color:#fff
    style RC2 fill:#264653,stroke:#2a9d8f,color:#fff

Resolution

Step 1 — Immediate memory relief (10 minutes)

import redis
import json
import logging

logger = logging.getLogger(__name__)


def emergency_memory_cleanup(
    redis_url: str,
    low_value_intents: list[str] = None,
    max_entries_to_remove: int = 50_000,
) -> dict:
    """
    Emergency memory cleanup: remove low-value cache entries.
    Targets entries with low hit counts and low-value intents.
    """
    if low_value_intents is None:
        low_value_intents = ["greeting", "recommendation"]

    r = redis.Redis.from_url(redis_url, decode_responses=True)

    removed = 0
    cursor = 0
    while removed < max_entries_to_remove:
        cursor, keys = r.scan(cursor, match="sc:*", count=500)
        for key in keys:
            intent = r.hget(key, "intent")
            hits = r.hget(key, "hits")

            should_remove = False
            # Remove low-value intents
            if intent in low_value_intents:
                should_remove = True
            # Remove entries with 0 hits that are older than 1 hour
            elif hits and int(hits) == 0:
                ts = r.hget(key, "ts")
                if ts and ((__import__("time").time() - float(ts)) > 3600):
                    should_remove = True

            if should_remove:
                r.delete(key)
                removed += 1

        if cursor == 0:
            break

    result = {
        "entries_removed": removed,
        "memory_before": r.info("memory")["used_memory_human"],
    }
    logger.info("Emergency cleanup: %s", result)
    return result

Step 2 — Implement memory budgets per intent (1–2 hours)

import logging
import time

logger = logging.getLogger(__name__)

# Memory budget: percentage of total Redis memory per intent category
INTENT_MEMORY_BUDGET = {
    "faq": 0.25,               # 25% — high hit rate, high value
    "product_info": 0.25,      # 25% — high hit rate, business-critical
    "shipping_info": 0.15,     # 15% — moderate volume, stable answers
    "manga_release_date": 0.15,# 15% — moderate volume, time-sensitive
    "recommendation": 0.10,    # 10% — low hit rate, reduce from 40%
    "greeting": 0.02,          # 2% — minimal variation
    "manga_search": 0.05,      # 5% — moderate hit rate
    "_other": 0.03,            # 3% — catch-all
}


class MemoryBudgetEnforcer:
    """
    Enforces per-intent memory budgets in the Redis cache.
    When an intent exceeds its budget, the least-recently-hit
    entries for that intent are evicted.
    """

    def __init__(self, redis_client, max_memory_bytes: int):
        self.redis = redis_client
        self.max_memory = max_memory_bytes

    def enforce(self) -> dict:
        """
        Scan all cache entries, compute per-intent memory usage,
        and evict entries from over-budget intents.
        """
        # Phase 1: Inventory
        intent_entries: dict[str, list[dict]] = {}
        cursor = 0
        while True:
            cursor, keys = self.redis.scan(cursor, match="sc:*", count=500)
            for key in keys:
                intent = self.redis.hget(key, "intent")
                hits = self.redis.hget(key, "hits")
                ts = self.redis.hget(key, "ts")
                mem = self.redis.memory_usage(key) or 0

                intent_str = intent.decode() if intent else "_other"
                if intent_str not in intent_entries:
                    intent_entries[intent_str] = []
                intent_entries[intent_str].append({
                    "key": key,
                    "hits": int(hits) if hits else 0,
                    "ts": float(ts) if ts else 0,
                    "memory": mem,
                })
            if cursor == 0:
                break

        # Phase 2: Check budgets and evict
        eviction_report = {}
        for intent, entries in intent_entries.items():
            budget_pct = INTENT_MEMORY_BUDGET.get(intent, INTENT_MEMORY_BUDGET["_other"])
            budget_bytes = int(self.max_memory * budget_pct)
            current_usage = sum(e["memory"] for e in entries)

            if current_usage <= budget_bytes:
                eviction_report[intent] = {
                    "status": "within_budget",
                    "usage_bytes": current_usage,
                    "budget_bytes": budget_bytes,
                    "evicted": 0,
                }
                continue

            # Sort by hit count (ascending) then timestamp (ascending)
            # → evict least-hit, oldest entries first
            entries.sort(key=lambda e: (e["hits"], e["ts"]))

            evicted = 0
            for entry in entries:
                if current_usage <= budget_bytes:
                    break
                self.redis.delete(entry["key"])
                current_usage -= entry["memory"]
                evicted += 1

            eviction_report[intent] = {
                "status": "evicted",
                "usage_bytes": current_usage,
                "budget_bytes": budget_bytes,
                "evicted": evicted,
            }

        logger.info("Memory budget enforcement: %s", eviction_report)
        return eviction_report

Step 3 — Schedule regular cleanup as an ECS scheduled task

# EventBridge scheduled rule — runs every 6 hours
{
    "schedule": "rate(6 hours)",
    "target": {
        "arn": "arn:aws:ecs:us-east-1:123456789:cluster/manga-cluster",
        "taskDefinition": "cache-maintenance:latest",
        "overrides": {
            "containerOverrides": [{
                "name": "cache-maintenance",
                "command": ["python", "-m", "cache_maintenance.enforce_budgets"]
            }]
        }
    }
}

Prevention

Prevention Measure Implementation Effort
Per-intent memory budgets MemoryBudgetEnforcer runs every 6 hours via scheduled task Medium
Reduce recommendation TTL Drop from 4hr to 1hr — low hit rate does not justify long TTL Low
Max entry count limit Cap total entries at 500K; reject new writes when at capacity Low
ElastiCache scaling alarm If BytesUsedForCache > 70% for 24hr, auto-scale to next instance size Medium
Cache value scoring score = hit_count / memory_bytes; evict lowest-score entries first High
TTL audit Weekly report of entries by TTL bucket; flag intents with TTL >> access frequency Low

Scenario Cross-Reference

# Scenario Primary Failure Mode Detection Speed Resolution Time Severity
1 Wrong manga recommendation Semantic false positive Minutes (user feedback) Hours (threshold tuning + entity filter) P2 — Incorrect answers
2 Cache stampede on new release Thundering herd Seconds (throttle alarm) Minutes (single-flight deploy) P1 — Service degradation
3 Stale pricing after flash sale Invalidation failure Minutes (consistency check) Minutes (emergency flush) P1 — Business/legal risk
4 Prompt cache miss after update Expected cache cold start Minutes (TTFT spike) Self-healing (5 min) P3 — Temporary cost spike
5 Redis memory exhaustion Unbounded growth Hours (gradual metric shift) Hours (budget enforcement) P2 — Degraded hit rate