US-01: LLM Token Cost Optimization

User Story

As a platform engineering lead, I want to minimize Bedrock LLM token consumption without degrading response quality, So that the MangaAssist chatbot operates within budget at scale while maintaining a natural conversational experience.

Acceptance Criteria

Template-first routing bypasses the LLM for at least 30% of all messages (chitchat, simple order lookups, greetings).
Prompt compression reduces average input token count by 40% or more.
Semantic response cache achieves a 15-25% hit rate for repeated/similar queries.
Model tiering routes simple queries to a cheaper model (Haiku) and complex queries to Sonnet.
Total Bedrock spend decreases by 40-60% compared to the "send everything to Sonnet" baseline.

High-Level Design

Cost Problem

Bedrock charges per input and output token. Claude 3.5 Sonnet pricing: - Input: $3.00 / 1M tokens - Output: $15.00 / 1M tokens

At 1M messages/day with an average prompt of 2,000 input tokens and 300 output tokens: - Daily input cost: 2B tokens × $3/1M = $6,000/day - Daily output cost: 300M tokens × $15/1M = $4,500/day - Baseline: ~$10,500/day = ~$315,000/month

Optimization Strategy Overview

graph TD
    A[User Message] --> B{Template-First<br>Router}
    B -->|chitchat, greeting,<br>simple order status| C[Template Response<br>Zero LLM cost]
    B -->|needs generation| D{Semantic Cache<br>Check}
    D -->|cache hit| E[Return Cached Response<br>Zero LLM cost]
    D -->|cache miss| F{Complexity<br>Classifier}
    F -->|simple FAQ,<br>product lookup| G[Claude Haiku<br>Low cost]
    F -->|recommendation,<br>multi-turn reasoning| H[Claude Sonnet<br>Full cost]
    G --> I[Prompt Compressor]
    H --> I
    I --> J[Compressed Prompt<br>to Bedrock]
    J --> K[Cache Response<br>for future hits]

    style C fill:#2d8,stroke:#333
    style E fill:#2d8,stroke:#333
    style G fill:#fd2,stroke:#333
    style H fill:#f66,stroke:#333

Projected Savings Breakdown

Technique	Traffic Affected	Token Reduction	Monthly Savings
Template-first routing	~30% of messages	100% (no LLM call)	~$94,500
Semantic response cache	~18% of remaining	100% (cached)	~$39,690
Model tiering (Haiku for simple)	~35% of remaining	85% cost reduction	~$50,000
Prompt compression	All LLM calls	40% fewer input tokens	~$25,000
Total estimated savings			~$209,190/month (66%)

Low-Level Design

1. Template-First Router

The Orchestrator checks whether the classified intent can be fully resolved without an LLM call.

graph LR
    A[Intent + Entities] --> B{Intent in<br>template_eligible?}
    B -->|Yes| C{All required<br>entities present?}
    C -->|Yes| D[Render Template<br>with data]
    C -->|No| E[Fall through<br>to LLM]
    B -->|No| E
    D --> F[Return Response]
    E --> G[LLM Pipeline]

Template-Eligible Intents

Intent	Template Condition	Template Example
`chitchat` (greeting)	Always	"Welcome to JP Manga Store! How can I help?"
`chitchat` (thanks)	Always	"You're welcome! Anything else I can help with?"
`order_tracking`	Order ID resolved	"Your order #{{order_id}} is {{status}}. Expected delivery: {{date}}."
`promotion`	Active promos found	"We have {{count}} active deals: {{promo_list}}"
`escalation`	Always	"Connecting you with a support agent..."

Code Example: Template Router

from dataclasses import dataclass
from enum import Enum
from typing import Optional

class IntentType(Enum):
    CHITCHAT = "chitchat"
    ORDER_TRACKING = "order_tracking"
    PROMOTION = "promotion"
    ESCALATION = "escalation"
    PRODUCT_QUESTION = "product_question"
    RECOMMENDATION = "recommendation"
    FAQ = "faq"
    RETURN_REQUEST = "return_request"
    CHECKOUT_HELP = "checkout_help"
    PRODUCT_DISCOVERY = "product_discovery"

@dataclass
class Intent:
    type: IntentType
    confidence: float
    entities: dict

@dataclass
class TemplateResult:
    response_text: str
    bypassed_llm: bool
    template_id: str

TEMPLATE_ELIGIBLE_INTENTS = {
    IntentType.CHITCHAT,
    IntentType.ORDER_TRACKING,
    IntentType.PROMOTION,
    IntentType.ESCALATION,
}

TEMPLATES = {
    "chitchat_greeting": "Welcome to the JP Manga Store! How can I help you today?",
    "chitchat_thanks": "You're welcome! Is there anything else I can help with?",
    "chitchat_goodbye": "Thanks for visiting! Happy reading!",
    "order_tracking": "Your order #{order_id} is currently {status}. "
                      "Expected delivery: {delivery_date}.",
    "promotion_list": "Great news! We have {count} active manga deals:\n{promo_list}",
    "escalation": "I'll connect you with a support agent right away. "
                  "Estimated wait: ~{wait_time}.",
}


class TemplateRouter:
    """Routes eligible intents to pre-built templates, bypassing the LLM."""

    def try_template(
        self,
        intent: Intent,
        service_data: dict,
    ) -> Optional[TemplateResult]:
        if intent.type not in TEMPLATE_ELIGIBLE_INTENTS:
            return None

        if intent.type == IntentType.CHITCHAT:
            return self._handle_chitchat(intent)

        if intent.type == IntentType.ORDER_TRACKING:
            return self._handle_order_tracking(intent, service_data)

        if intent.type == IntentType.PROMOTION:
            return self._handle_promotion(service_data)

        if intent.type == IntentType.ESCALATION:
            return self._handle_escalation(service_data)

        return None

    def _handle_chitchat(self, intent: Intent) -> TemplateResult:
        sub_intent = intent.entities.get("sub_intent", "greeting")
        template_key = f"chitchat_{sub_intent}"
        text = TEMPLATES.get(template_key, TEMPLATES["chitchat_greeting"])
        return TemplateResult(
            response_text=text,
            bypassed_llm=True,
            template_id=template_key,
        )

    def _handle_order_tracking(
        self, intent: Intent, service_data: dict
    ) -> Optional[TemplateResult]:
        order = service_data.get("order")
        if not order or not order.get("order_id"):
            return None  # Fall through to LLM for clarification
        text = TEMPLATES["order_tracking"].format(
            order_id=order["order_id"],
            status=order.get("status", "processing"),
            delivery_date=order.get("delivery_date", "soon"),
        )
        return TemplateResult(
            response_text=text,
            bypassed_llm=True,
            template_id="order_tracking",
        )

    def _handle_promotion(self, service_data: dict) -> Optional[TemplateResult]:
        promos = service_data.get("promotions", [])
        if not promos:
            return None
        promo_lines = [f"• {p['title']} — {p['discount']}" for p in promos]
        text = TEMPLATES["promotion_list"].format(
            count=len(promos),
            promo_list="\n".join(promo_lines),
        )
        return TemplateResult(
            response_text=text,
            bypassed_llm=True,
            template_id="promotion_list",
        )

    def _handle_escalation(self, service_data: dict) -> TemplateResult:
        wait = service_data.get("estimated_wait_seconds", 120)
        wait_min = max(1, wait // 60)
        text = TEMPLATES["escalation"].format(wait_time=f"{wait_min} minutes")
        return TemplateResult(
            response_text=text,
            bypassed_llm=True,
            template_id="escalation",
        )

2. Semantic Response Cache

Caches LLM responses keyed by a semantic hash of the query + context, so near-identical questions reuse prior answers.

sequenceDiagram
    participant Orchestrator
    participant Embedder
    participant CacheIndex as Redis + Vector Cache
    participant Bedrock

    Orchestrator->>Embedder: Embed (query + intent + key_context)
    Embedder-->>Orchestrator: Query embedding
    Orchestrator->>CacheIndex: Search cached embeddings<br>(cosine similarity > 0.95)
    alt Cache Hit
        CacheIndex-->>Orchestrator: Cached response + metadata
        Orchestrator->>Orchestrator: Validate freshness (TTL check)
        Orchestrator-->>Orchestrator: Return cached response
    else Cache Miss
        CacheIndex-->>Orchestrator: No match
        Orchestrator->>Bedrock: Generate response
        Bedrock-->>Orchestrator: LLM response
        Orchestrator->>CacheIndex: Store embedding + response<br>(TTL by intent type)
    end

Cache TTL by Intent Type

Intent	Cache TTL	Rationale
`faq`	24 hours	Policy/FAQ content changes infrequently
`product_question`	1 hour	Product attributes rarely change mid-day
`recommendation`	30 min	Personalized; shorter TTL
`product_discovery`	30 min	Trends shift but not rapidly
`return_request`	Never cached	User-specific; must be live
`order_tracking`	Never cached	Real-time data required

Code Example: Semantic Cache

import hashlib
import json
import time
from dataclasses import dataclass
from typing import Optional

import numpy as np
import redis


@dataclass
class CachedResponse:
    response_text: str
    products: list
    intent: str
    created_at: float
    ttl_seconds: int


INTENT_CACHE_TTL = {
    "faq": 86400,           # 24 hours
    "product_question": 3600, # 1 hour
    "recommendation": 1800,  # 30 min
    "product_discovery": 1800,
}

NON_CACHEABLE_INTENTS = {"order_tracking", "return_request", "escalation", "chitchat"}


class SemanticResponseCache:
    """Cache LLM responses by semantic similarity of the query."""

    SIMILARITY_THRESHOLD = 0.95

    def __init__(self, redis_client: redis.Redis, embedder):
        self.redis = redis_client
        self.embedder = embedder

    def _make_cache_key(self, intent: str, embedding: np.ndarray) -> str:
        """Deterministic key from intent + discretized embedding."""
        quantized = np.round(embedding, decimals=4)
        raw = f"{intent}:{quantized.tobytes().hex()}"
        return f"llm_cache:{hashlib.sha256(raw.encode()).hexdigest()[:32]}"

    def get(
        self, query: str, intent: str, context_hash: str
    ) -> Optional[CachedResponse]:
        if intent in NON_CACHEABLE_INTENTS:
            return None

        embedding = self.embedder.embed(f"{intent}:{query}:{context_hash}")
        cache_key = self._make_cache_key(intent, embedding)

        raw = self.redis.get(cache_key)
        if raw is None:
            return None

        data = json.loads(raw)
        cached = CachedResponse(**data)

        # Check if still within TTL
        if time.time() - cached.created_at > cached.ttl_seconds:
            self.redis.delete(cache_key)
            return None

        return cached

    def put(
        self,
        query: str,
        intent: str,
        context_hash: str,
        response_text: str,
        products: list,
    ) -> None:
        if intent in NON_CACHEABLE_INTENTS:
            return

        ttl = INTENT_CACHE_TTL.get(intent)
        if ttl is None:
            return

        embedding = self.embedder.embed(f"{intent}:{query}:{context_hash}")
        cache_key = self._make_cache_key(intent, embedding)

        cached = CachedResponse(
            response_text=response_text,
            products=products,
            intent=intent,
            created_at=time.time(),
            ttl_seconds=ttl,
        )
        self.redis.setex(cache_key, ttl, json.dumps(cached.__dict__))

3. Model Tiering

Route requests to the cheapest model that can handle the task.

graph TD
    A[LLM Request] --> B{Complexity<br>Score}
    B -->|score <= 0.4<br>Simple FAQ, single entity| C[Claude 3.5 Haiku<br>$0.25/$1.25 per 1M tokens]
    B -->|0.4 < score <= 0.7<br>Multi-entity, comparison| D[Claude 3.5 Sonnet<br>$3/$15 per 1M tokens]
    B -->|score > 0.7<br>Complex reasoning,<br>multi-turn synthesis| E[Claude 3.5 Sonnet<br>with extended context]

    C --> F[Response]
    D --> F
    E --> F

    style C fill:#2d8,stroke:#333
    style D fill:#fd2,stroke:#333
    style E fill:#f66,stroke:#333

Complexity Scoring Rules

Factor	Weight	Low (0)	Medium (0.5)	High (1.0)
Entity count	0.2	0-1 entities	2-3 entities	4+ entities
Conversation turns	0.2	1-3 turns	4-8 turns	9+ turns
RAG chunks needed	0.2	0-1 chunks	2-3 chunks	4+ chunks
Intent ambiguity	0.2	confidence > 0.9	0.7-0.9	< 0.7
Requires comparison	0.2	No	Implicit	Explicit

Code Example: Model Tier Selector

from dataclasses import dataclass
from enum import Enum


class ModelTier(Enum):
    HAIKU = "anthropic.claude-3-5-haiku-20241022-v1:0"
    SONNET = "anthropic.claude-3-5-sonnet-20241022-v2:0"


@dataclass
class TierDecision:
    model_id: str
    tier: ModelTier
    complexity_score: float
    reason: str


class ModelTierSelector:
    """Select the cheapest model capable of handling the request."""

    HAIKU_THRESHOLD = 0.4
    SONNET_THRESHOLD = 0.7

    def select(
        self,
        intent: str,
        confidence: float,
        entity_count: int,
        turn_count: int,
        rag_chunk_count: int,
        requires_comparison: bool,
    ) -> TierDecision:
        scores = {
            "entity": self._score_entities(entity_count),
            "turns": self._score_turns(turn_count),
            "rag": self._score_rag(rag_chunk_count),
            "ambiguity": self._score_ambiguity(confidence),
            "comparison": 1.0 if requires_comparison else 0.0,
        }
        weights = {
            "entity": 0.2,
            "turns": 0.2,
            "rag": 0.2,
            "ambiguity": 0.2,
            "comparison": 0.2,
        }
        complexity = sum(scores[k] * weights[k] for k in scores)

        if complexity <= self.HAIKU_THRESHOLD:
            tier = ModelTier.HAIKU
            reason = "Low complexity — single entity, high confidence, minimal context"
        else:
            tier = ModelTier.SONNET
            reason = f"Complexity {complexity:.2f} requires Sonnet"

        return TierDecision(
            model_id=tier.value,
            tier=tier,
            complexity_score=complexity,
            reason=reason,
        )

    def _score_entities(self, count: int) -> float:
        if count <= 1:
            return 0.0
        if count <= 3:
            return 0.5
        return 1.0

    def _score_turns(self, count: int) -> float:
        if count <= 3:
            return 0.0
        if count <= 8:
            return 0.5
        return 1.0

    def _score_rag(self, count: int) -> float:
        if count <= 1:
            return 0.0
        if count <= 3:
            return 0.5
        return 1.0

    def _score_ambiguity(self, confidence: float) -> float:
        if confidence > 0.9:
            return 0.0
        if confidence > 0.7:
            return 0.5
        return 1.0

4. Prompt Compression

Reduces input tokens by compressing conversation history and trimming redundant context.

graph LR
    A[Full Prompt<br>~2000 tokens] --> B[Trim Browsing History<br>Keep last 5 items]
    B --> C[Compress Conv History<br>Summarize older turns]
    C --> D[Deduplicate RAG Chunks<br>Remove overlapping content]
    D --> E[Strip Empty Context<br>Remove null fields]
    E --> F[Compressed Prompt<br>~1200 tokens]

    style A fill:#f66,stroke:#333
    style F fill:#2d8,stroke:#333

Code Example: Prompt Compressor

from typing import Optional


class PromptCompressor:
    """Reduce input token count while preserving essential context."""

    MAX_BROWSING_HISTORY = 5
    MAX_RECENT_TURNS = 6       # 3 user + 3 assistant
    MAX_RAG_CHUNKS = 3
    MAX_CHUNK_TOKENS = 200

    def compress(
        self,
        system_prompt: str,
        browsing_history: list[str],
        conversation_turns: list[dict],
        rag_chunks: list[dict],
        current_product: Optional[dict],
        active_promos: list[dict],
    ) -> str:
        parts = [system_prompt]

        # 1. Trim browsing history to recent items
        trimmed_history = browsing_history[-self.MAX_BROWSING_HISTORY :]
        if trimmed_history:
            parts.append(f"Recent browsing: {', '.join(trimmed_history)}")

        # 2. Compress conversation: keep recent turns, summarize older
        if len(conversation_turns) > self.MAX_RECENT_TURNS:
            older = conversation_turns[: -self.MAX_RECENT_TURNS]
            summary = self._summarize_turns(older)
            recent = conversation_turns[-self.MAX_RECENT_TURNS :]
            parts.append(f"Earlier context: {summary}")
            for turn in recent:
                parts.append(f"{turn['role']}: {turn['content']}")
        else:
            for turn in conversation_turns:
                parts.append(f"{turn['role']}: {turn['content']}")

        # 3. Deduplicate and trim RAG chunks
        unique_chunks = self._deduplicate_chunks(rag_chunks)
        for chunk in unique_chunks[: self.MAX_RAG_CHUNKS]:
            truncated = self._truncate_tokens(chunk["content"], self.MAX_CHUNK_TOKENS)
            parts.append(f"[{chunk['source_type']}] {truncated}")

        # 4. Include product context only if present
        if current_product:
            parts.append(self._compact_product(current_product))

        # 5. Active promos — one line each
        if active_promos:
            promo_lines = [f"- {p['title']}: {p['discount']}" for p in active_promos[:3]]
            parts.append("Active deals:\n" + "\n".join(promo_lines))

        return "\n\n".join(parts)

    def _summarize_turns(self, turns: list[dict]) -> str:
        """Lightweight extractive summary of older turns."""
        user_msgs = [t["content"] for t in turns if t["role"] == "user"]
        return f"User previously asked about: {'; '.join(user_msgs[-3:])}"

    def _deduplicate_chunks(self, chunks: list[dict]) -> list[dict]:
        """Remove chunks with >80% content overlap."""
        seen_hashes: set[str] = set()
        unique: list[dict] = []
        for chunk in chunks:
            # Simple dedup by first 100 chars
            sig = chunk["content"][:100]
            if sig not in seen_hashes:
                seen_hashes.add(sig)
                unique.append(chunk)
        return unique

    def _truncate_tokens(self, text: str, max_tokens: int) -> str:
        words = text.split()
        # Rough approximation: 1 token ≈ 0.75 words
        max_words = int(max_tokens * 0.75)
        if len(words) <= max_words:
            return text
        return " ".join(words[:max_words]) + "..."

    def _compact_product(self, product: dict) -> str:
        fields = ["title", "price", "format", "availability"]
        parts = [f"{k}: {product[k]}" for k in fields if product.get(k)]
        return "Current product: " + " | ".join(parts)

Monitoring and Metrics

Cost Dashboard Metrics

graph TD
    subgraph "CloudWatch Metrics"
        A[TemplateBypassRate<br>Target: >= 30%]
        B[SemanticCacheHitRate<br>Target: >= 15%]
        C[HaikuRoutingRate<br>Target: >= 35% of LLM calls]
        D[AvgInputTokens<br>Target: <= 1200]
        E[AvgOutputTokens<br>Target: <= 250]
        F[DailyBedrockSpend<br>Target: <= $4,200]
    end

    subgraph "Alarms"
        A --> G{< 25%?}
        G -->|Yes| H[Alert: Template coverage low]
        F --> I{> $5,000?}
        I -->|Yes| J[Alert: Cost spike]
    end

Key Metrics to Track

Metric	Formula	Target	Alert Threshold
Template bypass rate	template_responses / total_messages	≥ 30%	< 25%
Cache hit rate	cache_hits / llm_eligible_messages	≥ 15%	< 10%
Haiku routing %	haiku_calls / total_llm_calls	≥ 35%	< 25%
Avg input tokens	sum(input_tokens) / llm_calls	≤ 1,200	> 1,500
Daily Bedrock spend	sum(daily_cost)	≤ $4,200	> $5,000
Quality score (CSAT)	positive_feedback / total_feedback	≥ 4.⅖	< 3.8/5

Risks and Mitigations

Risk	Impact	Mitigation
Over-aggressive caching returns stale data	Wrong product info shown to user	Event-driven invalidation + TTL guards; never cache prices
Haiku produces lower quality recommendations	User satisfaction drops	A/B test with quality scoring; auto-promote to Sonnet if feedback < threshold
Prompt compression loses critical context	LLM hallucinates or gives irrelevant answer	Always keep last 3 user turns intact; compress only older history
Template responses feel robotic	Lower engagement	Vary templates with 3-5 variants per intent; add personalization tokens

Deep Dive: Why This Works on a Manga Chatbot Workload

The four techniques in this story (template-first, semantic cache, model tiering, prompt compression) are not equally powerful — they work together because manga-chatbot traffic has three exploitable properties that LLM-only architectures ignore.

Property 1: Intent distribution is heavy-tailed, not uniform. On a manga store, the top ~10 intents (greeting, order status, "where is my package", "is volume N out yet", "more like this title", price check, return policy, language availability, account login, store hours) cover the majority of all messages. These are exactly the intents whose answer is either a constant string or a SQL-shaped lookup over RDS/DynamoDB. Sending them to Sonnet is paying $0.008–$0.012 per message to format an answer the system already knows. The template-first router exploits this distribution: it does not "compete" with the LLM, it removes the LLM from the path entirely for the queries the LLM was never adding value to. The architectural assumption is that the intent classifier (US-02) is precise enough — intent_precision >= 0.92 — that template-render-with-wrong-data is rarer than LLM-cost waste. If intent precision drifts, this technique inverts: it generates wrong answers cheaply instead of right answers expensively.

Property 2: Manga catalog questions are highly repetitive across users. Two users asking "is Chainsaw Man Volume 15 in English available" produce different prompts (different session history, different timestamps, different user IDs) but should produce the same factual answer. The semantic cache works because the answer-determining part of the prompt — the user question — collapses into a small number of equivalence classes per ASIN, while the answer-irrelevant part (session, persona, history) is what makes each prompt unique-looking. Classic exact-match caching cannot see through that noise; embedding-based semantic match can. The 0.95 cosine threshold is chosen so two questions only share a cache entry when they would have identical correct answers; lower the threshold and you start serving "is X available" answers for "is X in stock" questions, which can be different facts.

Property 3: Most manga questions are single-step retrieval, not multi-step reasoning. "Recommend me 5 titles like Berserk" is a recommendation problem (Sonnet-shaped). "What is the price of volume 1" is a retrieval problem (Haiku-shaped). The complexity classifier exists because Sonnet's strength — chain-of-thought, comparison, synthesis — is wasted on retrieval prompts where any LLM that can format JSON will produce identical output. Haiku at ~12× cheaper input and ~5× cheaper output is genuinely interchangeable on retrieval-shaped intents; the savings are real, not a quality compromise. The failure signal is response-quality regression measured at the intent-tier level, not aggregated CSAT: aggregated CSAT will mask Haiku failures hidden in 35% of traffic.

The prompt-compressor is the only technique here that is content-agnostic — it works because conversational LLM context windows are filled with low-information-density text (system preambles repeated every turn, prior assistant responses re-summarized, RAG chunks with formatting boilerplate). The compression ratio is highest where information density is lowest: chat history compresses well, fresh user questions compress almost not at all. This is why the strategy is "compress old turns, never the latest turn" — it preserves every bit of signal where signal density is high.

Bottom line: the savings stack multiplicatively because each technique attacks a different inefficiency: template kills calls, cache kills duplicate calls, tiering kills oversized calls, compression kills oversized prompts. Pull any one out and the others still work, but you leave 8–15 percentage points of savings on the table per technique removed.

Real-World Validation

Industry Benchmarks & Case Studies

Anthropic prompt caching (official docs) — Anthropic's prompt caching reduces cost on cache-hit reads by up to 90% (cache reads billed at 10% of base input price; cache writes at 125%). For chat workloads with stable system prompts, public Anthropic case studies report 2–10× effective cost reduction once cache is warm. This story does not currently exploit Bedrock's native prompt caching — see "Math Validation" below for the gap.
Notion AI / Cursor / Lindy public talks — Multi-tier model routing (Haiku-for-simple, Sonnet-for-complex, Opus-for-edge-cases) is now standard in production AI products. Cursor's published cost-per-completion analysis and Notion's 2024 AI cost retrospectives report 40–65% cost reduction from tier routing, consistent with the 35% Haiku-routing target in this story.
GPTCache / Redis Labs semantic cache benchmarks — Public benchmarks on chat-style workloads show 15–25% semantic cache hit rates with thresholds in the 0.92–0.97 cosine-similarity range. The 15–25% target in this story sits in the realistic band; teams reporting > 35% are usually conflating exact-match and semantic-match hits.
Microsoft LLMLingua paper (Jiang et al., 2023, EMNLP) — Demonstrates 2–20× prompt compression with task-quality preservation on summarization and QA. The 40% input-token reduction targeted here is on the very conservative end of the LLMLingua results; the gap exists because this story uses heuristic compression (truncate older turns, strip formatting), not learned compression — heuristic methods reliably hit 30–50% compression but plateau there.
Internal cross-reference: POC-to-Production-War-Story/02-seven-production-catastrophes.md — The "cost explosion" catastrophe in the war story was exactly the failure mode this story prevents (uncontrolled token growth as prompts accumulated prior turns). The fixes documented there (turn-window limits, summary-instead-of-full-history) are codified into the prompt compressor here.
Internal cross-reference: Optimization-Tradeoffs-User-Stories/ — Covers the cost-vs-quality trade-off curve at a higher abstraction; this story is the cost-leaning operating point on that curve.

Math Validation

Re-derive the headline cost number against current AWS Bedrock published pricing (validate before each release; pricing is volatile):

Claude 3.5 Sonnet: $3.00 / 1M input, $15.00 / 1M output (matches story line 24-25, ✅).
Claude 3 Haiku: $0.25 / 1M input, $1.25 / 1M output (12× / 12× cheaper, matches the "85% cost reduction" claim on line 60, ✅).
1M messages/day × 2,000 input tokens = 2B input tokens/day × $3 / 1M = $6,000/day input ✅.
1M messages/day × 300 output tokens = 300M output tokens/day × $15 / 1M = $4,500/day output ✅.
Baseline $315K/month is internally consistent.

Gap flagged: The story does not currently use Bedrock's native prompt caching feature (released Aug 2024 for Anthropic on Bedrock). For a chatbot with a stable system prompt of ~1,500 tokens repeated on every call, native prompt caching would deliver an additional 15–25% input-cost reduction on top of the techniques here, with near-zero engineering cost. Recommend folding into the next iteration of US-01.

Conservative vs Aggressive Savings Bounds

Bound	Source	Total monthly savings
Conservative ("typically observed")	GPTCache 15% hit + Notion-style 30% Haiku routing + LLMLingua 30% compression + 20% template bypass	~45% (~$140K/month)
Aggressive ("best published")	Native prompt caching layered on top + 25% semantic cache + 40% Haiku + 50% LLMLingua + 35% template	~70% (~$220K/month)
Story's projected savings	Story claims 66% (~$209K)	Sits at the top of the realistic band — assumes everything works as specified.

Cross-Story Interactions & Conflicts

US-02 (Intent Classifier) — Authoritative side: US-02. The template-first router on line 70 depends on the intent label produced by US-02. Conflict mode: if US-02's intent precision drops below 0.92, template-first generates wrong answers cheaply; cost goes down, CSAT goes down faster. Resolution: template router must read the intent confidence score, not just the label, and fall through to LLM if confidence < 0.85. Ownership of the precision floor lives in US-02's monitoring.
US-03 (Caching) — Authoritative side: US-03. The semantic response cache (line 39) is a logical layer on the Redis tier described in US-03. Conflict mode: without keyspace discipline, response cache entries (large, ~3 KB each) compete with product/recommendation cache entries during eviction. Resolution: reserved keyspace prefix llmresp: with its own LRU partition and noeviction policy under memory pressure (LLM cache misses are 100× more expensive than product cache misses).
US-08 (Traffic-Based) — Authoritative side: US-08. The cost circuit breaker described in US-08 reads daily Bedrock spend (line 606 metric DailyBedrockSpend) and forces tier-down. Conflict mode: US-01's complexity classifier may still route a query to Sonnet while US-08's breaker is in DEGRADED state. Resolution: model tier selector reads model_tier_floor from US-08; when model_tier_floor = haiku, Sonnet routing is suppressed regardless of complexity score. The shared config flag is the single integration point.

Rollback & Experimentation

Shadow-Mode Plan

Run template router and semantic cache in observe-only mode for 2 weeks: log "would have routed to template" and "would have served from cache" decisions, but always serve via LLM. Compare predicted-template-response against actual-LLM-response using LLM-as-judge scoring; abort rollout if disagreement rate > 8% on a sampled 5K-message audit.
Run model tiering in shadow for 1 week: send Haiku and Sonnet for the same prompt for 5% of "simple" classified traffic; serve Sonnet to user, log Haiku response; promote Haiku to live only if blind-rated quality deltas < 5%.

Canary Thresholds

10% of traffic for 48 hours, then 25% for 72 hours, then full.
Abort criteria (any one trips): CSAT proxy drops > 3 points, template-misroute rate > 2%, semantic-cache wrong-answer rate (sampled audit) > 1%, p99 response latency rises > 15%.

Kill Switch

Single feature flag: llm_cost_optimization_enabled. When false, all four techniques are bypassed and 100% of traffic flows to Sonnet via the original path. Flag readable from SSM Parameter Store with 30-second client cache; full rollback achievable in < 5 minutes.

Quality Regression Criteria (story-specific)

Template coverage floor: ≥ 25% (below this, the template router is paying its operational cost without ROI; revert and re-tune).
Haiku-tier CSAT delta vs Sonnet-tier on the same intents: |Δ| ≤ 0.15 on a 5-point scale.
Semantic-cache audit-sample wrong-answer rate: ≤ 0.5% (raise threshold to 0.97 if exceeded).
Compression-induced hallucination rate (sampled audit on compressed-history responses): ≤ 1%.

Multi-Reviewer Validation Findings & Resolutions

The cross-reviewer pass identified the following story-specific findings. README's "Multi-Reviewer Validation & Cross-Cutting Hardening" section covers concerns that span all stories.

S1 (must-fix before production)

Semantic cache cross-tenant collision risk. The cache key currently hashes embedded query + context. If customer_id is not in the hash, two users asking semantically identical questions can share a cache entry — User A's PII-laden answer (order details, address) is served to User B. Resolution: prepend customer_id to every llmresp: cache key; reject on customer_id mismatch at read; per-customer logical Redis DB if memory permits. Audit-sample 0.5% of cache hits for cross-customer leakage.

Semantic cache threshold inconsistent with declared hit rate. Threshold 0.95 + claimed 15–25% hit rate are mutually inconsistent (industry benchmarks show 0.95 alone yields 8–12%; 15–25% requires 0.92). Resolution: lower default threshold to 0.92 with per-intent overrides — product_availability and price_check use 0.97 (high precision required, fact answers must match exactly), general_chat uses 0.88. Add cache_hit_audit_accuracy metric (1% sample comparing cached vs uncached answer); alert at < 95%.

ModelTierSelector ignores model_tier_floor. US-08 declares this contract but US-01's selector reads only the complexity score. Resolution: before any routing decision, read model_tier_floor through the central feature-flag evaluator (per README precedence rules). When set to haiku, suppress Sonnet routing regardless of complexity score.

Prompt-injection-via-cached-response surface. Adversarial query → adversarial response → cached → served to subsequent users. Resolution: before insert into llmresp:, run response through a heuristic safety filter (no URLs not on allowlist, no instructions-to-user, no suspicious patterns). Tag the cached response with the originating intent label and customer_id; reject on read if request intent ≠ cached intent.

S2 (fix before scale-up)

Pricing baseline missing region uplift. Baseline math at line 30 uses us-east-1 list price; production runs ap-northeast-1 with ~5–10% uplift on Bedrock. Resolution: restate baseline at ap-northeast-1 pricing (~$330–346K/month vs $315K) so post-optimization targets reconcile against the real bill. Explicit pricing constants in code, sourced from a versioned pricing module (not hardcoded), so quarterly AWS price adjustments propagate cleanly.

No quality benchmark for Haiku on actual MangaAssist intents. Story claims "Haiku is interchangeable for retrieval-shaped intents" with no manga-specific evidence and no Japanese-language coverage. Resolution: before Haiku routing exceeds 5% live traffic, run blind-rated A/B against Sonnet on a 1K-prompt manga-store golden set stratified by intent and language (English / Japanese). Acceptance: per-intent rated-quality delta ≤ 5%; per-language delta ≤ 8%.

Prompt compression risks dropping entity references. Heuristic compression can strip entity mentions (product names, ASIN, genre) the latest user turn refers back to. Resolution: entity-tag pass before compression — [ENT:product]Berserk[/ENT] markers; compression rule never strips a sentence containing tagged entities. Add compression_entity_loss_rate metric (sampled).

Cost attribution missing in Bedrock invocation path. US-08's circuit breaker reads aggregated daily spend without per-request breakdown. Resolution: every Bedrock call emits an event to US-07 with request_id, customer_id (hashed), model_id, input_tokens, output_tokens, cached_input_tokens (when prompt caching is on), intent_label, intent_confidence, tier_used. Schema versioned per README cross-cutting concerns.

S3 (acknowledged / future work)

Native Bedrock prompt caching not yet exploited. Estimated additional 15–25% input-cost reduction on the ~1.5K-token stable system prompt. Backlog item — see README Bedrock Provisioned Throughput section.
Multi-region failure mode — single-region (ap-northeast-1) by current scope; multi-region failover requires replicated Redis + DynamoDB Global Tables.
Per-customer cost attribution — useful for chargeback but adds storage cost; deferred until cost data shapes are stable.

Runbook: Cost Spike Detected at 3am JST

Symptoms: US-07 daily-spend metric > 110% of budget within first 4 hours of UTC day; Bedrock invocation rate > 2× rolling 7-day average.

Triage (in order):

Set degradation_active=true via the feature-flag evaluator (effective globally in ≤ 30s). All stories tier-down per README precedence rules.
Read cache_hit_audit_accuracy from CloudWatch — if < 90%, semantic cache may be poisoned. Set llm_cost_optimization_enabled=false to bypass the cache entirely.
Read intent_precision from US-02's weekly audit — if < 0.90, template router is misrouting; force fall-through to LLM by raising router's confidence threshold to 0.99 via SSM.
Verify the cost breaker actually tripped: query DDB cost ledger (per US-08 immutable ledger), not Redis. If Redis says tripped but DDB says not, suspect Redis tampering — escalate immediately to security.
Page FinOps lead if not auto-resolved within 15 minutes.

Escalation: if cost continues to climb after step 1, manually scale Bedrock invocation concurrency to zero via the orchestrator kill switch (bedrock_invocation_enabled=false). All chat traffic returns "service temporarily degraded" template responses; cost goes to zero; user impact bounded for ≤ 1 hour while root cause is found.