US-01: LLM Token Cost Optimization
User Story
As a platform engineering lead, I want to minimize Bedrock LLM token consumption without degrading response quality, So that the MangaAssist chatbot operates within budget at scale while maintaining a natural conversational experience.
Acceptance Criteria
- Template-first routing bypasses the LLM for at least 30% of all messages (chitchat, simple order lookups, greetings).
- Prompt compression reduces average input token count by 40% or more.
- Semantic response cache achieves a 15-25% hit rate for repeated/similar queries.
- Model tiering routes simple queries to a cheaper model (Haiku) and complex queries to Sonnet.
- Total Bedrock spend decreases by 40-60% compared to the "send everything to Sonnet" baseline.
High-Level Design
Cost Problem
Bedrock charges per input and output token. Claude 3.5 Sonnet pricing: - Input: $3.00 / 1M tokens - Output: $15.00 / 1M tokens
At 1M messages/day with an average prompt of 2,000 input tokens and 300 output tokens: - Daily input cost: 2B tokens × $3/1M = $6,000/day - Daily output cost: 300M tokens × $15/1M = $4,500/day - Baseline: ~$10,500/day = ~$315,000/month
Optimization Strategy Overview
graph TD
A[User Message] --> B{Template-First<br>Router}
B -->|chitchat, greeting,<br>simple order status| C[Template Response<br>Zero LLM cost]
B -->|needs generation| D{Semantic Cache<br>Check}
D -->|cache hit| E[Return Cached Response<br>Zero LLM cost]
D -->|cache miss| F{Complexity<br>Classifier}
F -->|simple FAQ,<br>product lookup| G[Claude Haiku<br>Low cost]
F -->|recommendation,<br>multi-turn reasoning| H[Claude Sonnet<br>Full cost]
G --> I[Prompt Compressor]
H --> I
I --> J[Compressed Prompt<br>to Bedrock]
J --> K[Cache Response<br>for future hits]
style C fill:#2d8,stroke:#333
style E fill:#2d8,stroke:#333
style G fill:#fd2,stroke:#333
style H fill:#f66,stroke:#333
Projected Savings Breakdown
| Technique | Traffic Affected | Token Reduction | Monthly Savings |
|---|---|---|---|
| Template-first routing | ~30% of messages | 100% (no LLM call) | ~$94,500 |
| Semantic response cache | ~18% of remaining | 100% (cached) | ~$39,690 |
| Model tiering (Haiku for simple) | ~35% of remaining | 85% cost reduction | ~$50,000 |
| Prompt compression | All LLM calls | 40% fewer input tokens | ~$25,000 |
| Total estimated savings | ~$209,190/month (66%) |
Low-Level Design
1. Template-First Router
The Orchestrator checks whether the classified intent can be fully resolved without an LLM call.
graph LR
A[Intent + Entities] --> B{Intent in<br>template_eligible?}
B -->|Yes| C{All required<br>entities present?}
C -->|Yes| D[Render Template<br>with data]
C -->|No| E[Fall through<br>to LLM]
B -->|No| E
D --> F[Return Response]
E --> G[LLM Pipeline]
Template-Eligible Intents
| Intent | Template Condition | Template Example |
|---|---|---|
chitchat (greeting) |
Always | "Welcome to JP Manga Store! How can I help?" |
chitchat (thanks) |
Always | "You're welcome! Anything else I can help with?" |
order_tracking |
Order ID resolved | "Your order #{{order_id}} is {{status}}. Expected delivery: {{date}}." |
promotion |
Active promos found | "We have {{count}} active deals: {{promo_list}}" |
escalation |
Always | "Connecting you with a support agent..." |
Code Example: Template Router
from dataclasses import dataclass
from enum import Enum
from typing import Optional
class IntentType(Enum):
CHITCHAT = "chitchat"
ORDER_TRACKING = "order_tracking"
PROMOTION = "promotion"
ESCALATION = "escalation"
PRODUCT_QUESTION = "product_question"
RECOMMENDATION = "recommendation"
FAQ = "faq"
RETURN_REQUEST = "return_request"
CHECKOUT_HELP = "checkout_help"
PRODUCT_DISCOVERY = "product_discovery"
@dataclass
class Intent:
type: IntentType
confidence: float
entities: dict
@dataclass
class TemplateResult:
response_text: str
bypassed_llm: bool
template_id: str
TEMPLATE_ELIGIBLE_INTENTS = {
IntentType.CHITCHAT,
IntentType.ORDER_TRACKING,
IntentType.PROMOTION,
IntentType.ESCALATION,
}
TEMPLATES = {
"chitchat_greeting": "Welcome to the JP Manga Store! How can I help you today?",
"chitchat_thanks": "You're welcome! Is there anything else I can help with?",
"chitchat_goodbye": "Thanks for visiting! Happy reading!",
"order_tracking": "Your order #{order_id} is currently {status}. "
"Expected delivery: {delivery_date}.",
"promotion_list": "Great news! We have {count} active manga deals:\n{promo_list}",
"escalation": "I'll connect you with a support agent right away. "
"Estimated wait: ~{wait_time}.",
}
class TemplateRouter:
"""Routes eligible intents to pre-built templates, bypassing the LLM."""
def try_template(
self,
intent: Intent,
service_data: dict,
) -> Optional[TemplateResult]:
if intent.type not in TEMPLATE_ELIGIBLE_INTENTS:
return None
if intent.type == IntentType.CHITCHAT:
return self._handle_chitchat(intent)
if intent.type == IntentType.ORDER_TRACKING:
return self._handle_order_tracking(intent, service_data)
if intent.type == IntentType.PROMOTION:
return self._handle_promotion(service_data)
if intent.type == IntentType.ESCALATION:
return self._handle_escalation(service_data)
return None
def _handle_chitchat(self, intent: Intent) -> TemplateResult:
sub_intent = intent.entities.get("sub_intent", "greeting")
template_key = f"chitchat_{sub_intent}"
text = TEMPLATES.get(template_key, TEMPLATES["chitchat_greeting"])
return TemplateResult(
response_text=text,
bypassed_llm=True,
template_id=template_key,
)
def _handle_order_tracking(
self, intent: Intent, service_data: dict
) -> Optional[TemplateResult]:
order = service_data.get("order")
if not order or not order.get("order_id"):
return None # Fall through to LLM for clarification
text = TEMPLATES["order_tracking"].format(
order_id=order["order_id"],
status=order.get("status", "processing"),
delivery_date=order.get("delivery_date", "soon"),
)
return TemplateResult(
response_text=text,
bypassed_llm=True,
template_id="order_tracking",
)
def _handle_promotion(self, service_data: dict) -> Optional[TemplateResult]:
promos = service_data.get("promotions", [])
if not promos:
return None
promo_lines = [f"• {p['title']} — {p['discount']}" for p in promos]
text = TEMPLATES["promotion_list"].format(
count=len(promos),
promo_list="\n".join(promo_lines),
)
return TemplateResult(
response_text=text,
bypassed_llm=True,
template_id="promotion_list",
)
def _handle_escalation(self, service_data: dict) -> TemplateResult:
wait = service_data.get("estimated_wait_seconds", 120)
wait_min = max(1, wait // 60)
text = TEMPLATES["escalation"].format(wait_time=f"{wait_min} minutes")
return TemplateResult(
response_text=text,
bypassed_llm=True,
template_id="escalation",
)
2. Semantic Response Cache
Caches LLM responses keyed by a semantic hash of the query + context, so near-identical questions reuse prior answers.
sequenceDiagram
participant Orchestrator
participant Embedder
participant CacheIndex as Redis + Vector Cache
participant Bedrock
Orchestrator->>Embedder: Embed (query + intent + key_context)
Embedder-->>Orchestrator: Query embedding
Orchestrator->>CacheIndex: Search cached embeddings<br>(cosine similarity > 0.95)
alt Cache Hit
CacheIndex-->>Orchestrator: Cached response + metadata
Orchestrator->>Orchestrator: Validate freshness (TTL check)
Orchestrator-->>Orchestrator: Return cached response
else Cache Miss
CacheIndex-->>Orchestrator: No match
Orchestrator->>Bedrock: Generate response
Bedrock-->>Orchestrator: LLM response
Orchestrator->>CacheIndex: Store embedding + response<br>(TTL by intent type)
end
Cache TTL by Intent Type
| Intent | Cache TTL | Rationale |
|---|---|---|
faq |
24 hours | Policy/FAQ content changes infrequently |
product_question |
1 hour | Product attributes rarely change mid-day |
recommendation |
30 min | Personalized; shorter TTL |
product_discovery |
30 min | Trends shift but not rapidly |
return_request |
Never cached | User-specific; must be live |
order_tracking |
Never cached | Real-time data required |
Code Example: Semantic Cache
import hashlib
import json
import time
from dataclasses import dataclass
from typing import Optional
import numpy as np
import redis
@dataclass
class CachedResponse:
response_text: str
products: list
intent: str
created_at: float
ttl_seconds: int
INTENT_CACHE_TTL = {
"faq": 86400, # 24 hours
"product_question": 3600, # 1 hour
"recommendation": 1800, # 30 min
"product_discovery": 1800,
}
NON_CACHEABLE_INTENTS = {"order_tracking", "return_request", "escalation", "chitchat"}
class SemanticResponseCache:
"""Cache LLM responses by semantic similarity of the query."""
SIMILARITY_THRESHOLD = 0.95
def __init__(self, redis_client: redis.Redis, embedder):
self.redis = redis_client
self.embedder = embedder
def _make_cache_key(self, intent: str, embedding: np.ndarray) -> str:
"""Deterministic key from intent + discretized embedding."""
quantized = np.round(embedding, decimals=4)
raw = f"{intent}:{quantized.tobytes().hex()}"
return f"llm_cache:{hashlib.sha256(raw.encode()).hexdigest()[:32]}"
def get(
self, query: str, intent: str, context_hash: str
) -> Optional[CachedResponse]:
if intent in NON_CACHEABLE_INTENTS:
return None
embedding = self.embedder.embed(f"{intent}:{query}:{context_hash}")
cache_key = self._make_cache_key(intent, embedding)
raw = self.redis.get(cache_key)
if raw is None:
return None
data = json.loads(raw)
cached = CachedResponse(**data)
# Check if still within TTL
if time.time() - cached.created_at > cached.ttl_seconds:
self.redis.delete(cache_key)
return None
return cached
def put(
self,
query: str,
intent: str,
context_hash: str,
response_text: str,
products: list,
) -> None:
if intent in NON_CACHEABLE_INTENTS:
return
ttl = INTENT_CACHE_TTL.get(intent)
if ttl is None:
return
embedding = self.embedder.embed(f"{intent}:{query}:{context_hash}")
cache_key = self._make_cache_key(intent, embedding)
cached = CachedResponse(
response_text=response_text,
products=products,
intent=intent,
created_at=time.time(),
ttl_seconds=ttl,
)
self.redis.setex(cache_key, ttl, json.dumps(cached.__dict__))
3. Model Tiering
Route requests to the cheapest model that can handle the task.
graph TD
A[LLM Request] --> B{Complexity<br>Score}
B -->|score <= 0.4<br>Simple FAQ, single entity| C[Claude 3.5 Haiku<br>$0.25/$1.25 per 1M tokens]
B -->|0.4 < score <= 0.7<br>Multi-entity, comparison| D[Claude 3.5 Sonnet<br>$3/$15 per 1M tokens]
B -->|score > 0.7<br>Complex reasoning,<br>multi-turn synthesis| E[Claude 3.5 Sonnet<br>with extended context]
C --> F[Response]
D --> F
E --> F
style C fill:#2d8,stroke:#333
style D fill:#fd2,stroke:#333
style E fill:#f66,stroke:#333
Complexity Scoring Rules
| Factor | Weight | Low (0) | Medium (0.5) | High (1.0) |
|---|---|---|---|---|
| Entity count | 0.2 | 0-1 entities | 2-3 entities | 4+ entities |
| Conversation turns | 0.2 | 1-3 turns | 4-8 turns | 9+ turns |
| RAG chunks needed | 0.2 | 0-1 chunks | 2-3 chunks | 4+ chunks |
| Intent ambiguity | 0.2 | confidence > 0.9 | 0.7-0.9 | < 0.7 |
| Requires comparison | 0.2 | No | Implicit | Explicit |
Code Example: Model Tier Selector
from dataclasses import dataclass
from enum import Enum
class ModelTier(Enum):
HAIKU = "anthropic.claude-3-5-haiku-20241022-v1:0"
SONNET = "anthropic.claude-3-5-sonnet-20241022-v2:0"
@dataclass
class TierDecision:
model_id: str
tier: ModelTier
complexity_score: float
reason: str
class ModelTierSelector:
"""Select the cheapest model capable of handling the request."""
HAIKU_THRESHOLD = 0.4
SONNET_THRESHOLD = 0.7
def select(
self,
intent: str,
confidence: float,
entity_count: int,
turn_count: int,
rag_chunk_count: int,
requires_comparison: bool,
) -> TierDecision:
scores = {
"entity": self._score_entities(entity_count),
"turns": self._score_turns(turn_count),
"rag": self._score_rag(rag_chunk_count),
"ambiguity": self._score_ambiguity(confidence),
"comparison": 1.0 if requires_comparison else 0.0,
}
weights = {
"entity": 0.2,
"turns": 0.2,
"rag": 0.2,
"ambiguity": 0.2,
"comparison": 0.2,
}
complexity = sum(scores[k] * weights[k] for k in scores)
if complexity <= self.HAIKU_THRESHOLD:
tier = ModelTier.HAIKU
reason = "Low complexity — single entity, high confidence, minimal context"
else:
tier = ModelTier.SONNET
reason = f"Complexity {complexity:.2f} requires Sonnet"
return TierDecision(
model_id=tier.value,
tier=tier,
complexity_score=complexity,
reason=reason,
)
def _score_entities(self, count: int) -> float:
if count <= 1:
return 0.0
if count <= 3:
return 0.5
return 1.0
def _score_turns(self, count: int) -> float:
if count <= 3:
return 0.0
if count <= 8:
return 0.5
return 1.0
def _score_rag(self, count: int) -> float:
if count <= 1:
return 0.0
if count <= 3:
return 0.5
return 1.0
def _score_ambiguity(self, confidence: float) -> float:
if confidence > 0.9:
return 0.0
if confidence > 0.7:
return 0.5
return 1.0
4. Prompt Compression
Reduces input tokens by compressing conversation history and trimming redundant context.
graph LR
A[Full Prompt<br>~2000 tokens] --> B[Trim Browsing History<br>Keep last 5 items]
B --> C[Compress Conv History<br>Summarize older turns]
C --> D[Deduplicate RAG Chunks<br>Remove overlapping content]
D --> E[Strip Empty Context<br>Remove null fields]
E --> F[Compressed Prompt<br>~1200 tokens]
style A fill:#f66,stroke:#333
style F fill:#2d8,stroke:#333
Code Example: Prompt Compressor
from typing import Optional
class PromptCompressor:
"""Reduce input token count while preserving essential context."""
MAX_BROWSING_HISTORY = 5
MAX_RECENT_TURNS = 6 # 3 user + 3 assistant
MAX_RAG_CHUNKS = 3
MAX_CHUNK_TOKENS = 200
def compress(
self,
system_prompt: str,
browsing_history: list[str],
conversation_turns: list[dict],
rag_chunks: list[dict],
current_product: Optional[dict],
active_promos: list[dict],
) -> str:
parts = [system_prompt]
# 1. Trim browsing history to recent items
trimmed_history = browsing_history[-self.MAX_BROWSING_HISTORY :]
if trimmed_history:
parts.append(f"Recent browsing: {', '.join(trimmed_history)}")
# 2. Compress conversation: keep recent turns, summarize older
if len(conversation_turns) > self.MAX_RECENT_TURNS:
older = conversation_turns[: -self.MAX_RECENT_TURNS]
summary = self._summarize_turns(older)
recent = conversation_turns[-self.MAX_RECENT_TURNS :]
parts.append(f"Earlier context: {summary}")
for turn in recent:
parts.append(f"{turn['role']}: {turn['content']}")
else:
for turn in conversation_turns:
parts.append(f"{turn['role']}: {turn['content']}")
# 3. Deduplicate and trim RAG chunks
unique_chunks = self._deduplicate_chunks(rag_chunks)
for chunk in unique_chunks[: self.MAX_RAG_CHUNKS]:
truncated = self._truncate_tokens(chunk["content"], self.MAX_CHUNK_TOKENS)
parts.append(f"[{chunk['source_type']}] {truncated}")
# 4. Include product context only if present
if current_product:
parts.append(self._compact_product(current_product))
# 5. Active promos — one line each
if active_promos:
promo_lines = [f"- {p['title']}: {p['discount']}" for p in active_promos[:3]]
parts.append("Active deals:\n" + "\n".join(promo_lines))
return "\n\n".join(parts)
def _summarize_turns(self, turns: list[dict]) -> str:
"""Lightweight extractive summary of older turns."""
user_msgs = [t["content"] for t in turns if t["role"] == "user"]
return f"User previously asked about: {'; '.join(user_msgs[-3:])}"
def _deduplicate_chunks(self, chunks: list[dict]) -> list[dict]:
"""Remove chunks with >80% content overlap."""
seen_hashes: set[str] = set()
unique: list[dict] = []
for chunk in chunks:
# Simple dedup by first 100 chars
sig = chunk["content"][:100]
if sig not in seen_hashes:
seen_hashes.add(sig)
unique.append(chunk)
return unique
def _truncate_tokens(self, text: str, max_tokens: int) -> str:
words = text.split()
# Rough approximation: 1 token ≈ 0.75 words
max_words = int(max_tokens * 0.75)
if len(words) <= max_words:
return text
return " ".join(words[:max_words]) + "..."
def _compact_product(self, product: dict) -> str:
fields = ["title", "price", "format", "availability"]
parts = [f"{k}: {product[k]}" for k in fields if product.get(k)]
return "Current product: " + " | ".join(parts)
Monitoring and Metrics
Cost Dashboard Metrics
graph TD
subgraph "CloudWatch Metrics"
A[TemplateBypassRate<br>Target: >= 30%]
B[SemanticCacheHitRate<br>Target: >= 15%]
C[HaikuRoutingRate<br>Target: >= 35% of LLM calls]
D[AvgInputTokens<br>Target: <= 1200]
E[AvgOutputTokens<br>Target: <= 250]
F[DailyBedrockSpend<br>Target: <= $4,200]
end
subgraph "Alarms"
A --> G{< 25%?}
G -->|Yes| H[Alert: Template coverage low]
F --> I{> $5,000?}
I -->|Yes| J[Alert: Cost spike]
end
Key Metrics to Track
| Metric | Formula | Target | Alert Threshold |
|---|---|---|---|
| Template bypass rate | template_responses / total_messages | ≥ 30% | < 25% |
| Cache hit rate | cache_hits / llm_eligible_messages | ≥ 15% | < 10% |
| Haiku routing % | haiku_calls / total_llm_calls | ≥ 35% | < 25% |
| Avg input tokens | sum(input_tokens) / llm_calls | ≤ 1,200 | > 1,500 |
| Daily Bedrock spend | sum(daily_cost) | ≤ $4,200 | > $5,000 |
| Quality score (CSAT) | positive_feedback / total_feedback | ≥ 4.⅖ | < 3.8/5 |
Risks and Mitigations
| Risk | Impact | Mitigation |
|---|---|---|
| Over-aggressive caching returns stale data | Wrong product info shown to user | Event-driven invalidation + TTL guards; never cache prices |
| Haiku produces lower quality recommendations | User satisfaction drops | A/B test with quality scoring; auto-promote to Sonnet if feedback < threshold |
| Prompt compression loses critical context | LLM hallucinates or gives irrelevant answer | Always keep last 3 user turns intact; compress only older history |
| Template responses feel robotic | Lower engagement | Vary templates with 3-5 variants per intent; add personalization tokens |
Deep Dive: Why This Works on a Manga Chatbot Workload
The four techniques in this story (template-first, semantic cache, model tiering, prompt compression) are not equally powerful — they work together because manga-chatbot traffic has three exploitable properties that LLM-only architectures ignore.
Property 1: Intent distribution is heavy-tailed, not uniform. On a manga store, the top ~10 intents (greeting, order status, "where is my package", "is volume N out yet", "more like this title", price check, return policy, language availability, account login, store hours) cover the majority of all messages. These are exactly the intents whose answer is either a constant string or a SQL-shaped lookup over RDS/DynamoDB. Sending them to Sonnet is paying $0.008–$0.012 per message to format an answer the system already knows. The template-first router exploits this distribution: it does not "compete" with the LLM, it removes the LLM from the path entirely for the queries the LLM was never adding value to. The architectural assumption is that the intent classifier (US-02) is precise enough — intent_precision >= 0.92 — that template-render-with-wrong-data is rarer than LLM-cost waste. If intent precision drifts, this technique inverts: it generates wrong answers cheaply instead of right answers expensively.
Property 2: Manga catalog questions are highly repetitive across users. Two users asking "is Chainsaw Man Volume 15 in English available" produce different prompts (different session history, different timestamps, different user IDs) but should produce the same factual answer. The semantic cache works because the answer-determining part of the prompt — the user question — collapses into a small number of equivalence classes per ASIN, while the answer-irrelevant part (session, persona, history) is what makes each prompt unique-looking. Classic exact-match caching cannot see through that noise; embedding-based semantic match can. The 0.95 cosine threshold is chosen so two questions only share a cache entry when they would have identical correct answers; lower the threshold and you start serving "is X available" answers for "is X in stock" questions, which can be different facts.
Property 3: Most manga questions are single-step retrieval, not multi-step reasoning. "Recommend me 5 titles like Berserk" is a recommendation problem (Sonnet-shaped). "What is the price of volume 1" is a retrieval problem (Haiku-shaped). The complexity classifier exists because Sonnet's strength — chain-of-thought, comparison, synthesis — is wasted on retrieval prompts where any LLM that can format JSON will produce identical output. Haiku at ~12× cheaper input and ~5× cheaper output is genuinely interchangeable on retrieval-shaped intents; the savings are real, not a quality compromise. The failure signal is response-quality regression measured at the intent-tier level, not aggregated CSAT: aggregated CSAT will mask Haiku failures hidden in 35% of traffic.
The prompt-compressor is the only technique here that is content-agnostic — it works because conversational LLM context windows are filled with low-information-density text (system preambles repeated every turn, prior assistant responses re-summarized, RAG chunks with formatting boilerplate). The compression ratio is highest where information density is lowest: chat history compresses well, fresh user questions compress almost not at all. This is why the strategy is "compress old turns, never the latest turn" — it preserves every bit of signal where signal density is high.
Bottom line: the savings stack multiplicatively because each technique attacks a different inefficiency: template kills calls, cache kills duplicate calls, tiering kills oversized calls, compression kills oversized prompts. Pull any one out and the others still work, but you leave 8–15 percentage points of savings on the table per technique removed.
Real-World Validation
Industry Benchmarks & Case Studies
- Anthropic prompt caching (official docs) — Anthropic's prompt caching reduces cost on cache-hit reads by up to 90% (cache reads billed at 10% of base input price; cache writes at 125%). For chat workloads with stable system prompts, public Anthropic case studies report 2–10× effective cost reduction once cache is warm. This story does not currently exploit Bedrock's native prompt caching — see "Math Validation" below for the gap.
- Notion AI / Cursor / Lindy public talks — Multi-tier model routing (Haiku-for-simple, Sonnet-for-complex, Opus-for-edge-cases) is now standard in production AI products. Cursor's published cost-per-completion analysis and Notion's 2024 AI cost retrospectives report 40–65% cost reduction from tier routing, consistent with the 35% Haiku-routing target in this story.
- GPTCache / Redis Labs semantic cache benchmarks — Public benchmarks on chat-style workloads show 15–25% semantic cache hit rates with thresholds in the 0.92–0.97 cosine-similarity range. The 15–25% target in this story sits in the realistic band; teams reporting > 35% are usually conflating exact-match and semantic-match hits.
- Microsoft LLMLingua paper (Jiang et al., 2023, EMNLP) — Demonstrates 2–20× prompt compression with task-quality preservation on summarization and QA. The 40% input-token reduction targeted here is on the very conservative end of the LLMLingua results; the gap exists because this story uses heuristic compression (truncate older turns, strip formatting), not learned compression — heuristic methods reliably hit 30–50% compression but plateau there.
- Internal cross-reference:
POC-to-Production-War-Story/02-seven-production-catastrophes.md— The "cost explosion" catastrophe in the war story was exactly the failure mode this story prevents (uncontrolled token growth as prompts accumulated prior turns). The fixes documented there (turn-window limits, summary-instead-of-full-history) are codified into the prompt compressor here. - Internal cross-reference:
Optimization-Tradeoffs-User-Stories/— Covers the cost-vs-quality trade-off curve at a higher abstraction; this story is the cost-leaning operating point on that curve.
Math Validation
Re-derive the headline cost number against current AWS Bedrock published pricing (validate before each release; pricing is volatile):
- Claude 3.5 Sonnet: $3.00 / 1M input, $15.00 / 1M output (matches story line 24-25, ✅).
- Claude 3 Haiku: $0.25 / 1M input, $1.25 / 1M output (12× / 12× cheaper, matches the "85% cost reduction" claim on line 60, ✅).
- 1M messages/day × 2,000 input tokens = 2B input tokens/day × $3 / 1M = $6,000/day input ✅.
- 1M messages/day × 300 output tokens = 300M output tokens/day × $15 / 1M = $4,500/day output ✅.
- Baseline $315K/month is internally consistent.
Gap flagged: The story does not currently use Bedrock's native prompt caching feature (released Aug 2024 for Anthropic on Bedrock). For a chatbot with a stable system prompt of ~1,500 tokens repeated on every call, native prompt caching would deliver an additional 15–25% input-cost reduction on top of the techniques here, with near-zero engineering cost. Recommend folding into the next iteration of US-01.
Conservative vs Aggressive Savings Bounds
| Bound | Source | Total monthly savings |
|---|---|---|
| Conservative ("typically observed") | GPTCache 15% hit + Notion-style 30% Haiku routing + LLMLingua 30% compression + 20% template bypass | ~45% (~$140K/month) |
| Aggressive ("best published") | Native prompt caching layered on top + 25% semantic cache + 40% Haiku + 50% LLMLingua + 35% template | ~70% (~$220K/month) |
| Story's projected savings | Story claims 66% (~$209K) | Sits at the top of the realistic band — assumes everything works as specified. |
Cross-Story Interactions & Conflicts
- US-02 (Intent Classifier) — Authoritative side: US-02. The template-first router on line 70 depends on the intent label produced by US-02. Conflict mode: if US-02's intent precision drops below 0.92, template-first generates wrong answers cheaply; cost goes down, CSAT goes down faster. Resolution: template router must read the intent confidence score, not just the label, and fall through to LLM if confidence < 0.85. Ownership of the precision floor lives in US-02's monitoring.
- US-03 (Caching) — Authoritative side: US-03. The semantic response cache (line 39) is a logical layer on the Redis tier described in US-03. Conflict mode: without keyspace discipline, response cache entries (large, ~3 KB each) compete with product/recommendation cache entries during eviction. Resolution: reserved keyspace prefix
llmresp:with its own LRU partition andnoevictionpolicy under memory pressure (LLM cache misses are 100× more expensive than product cache misses). - US-08 (Traffic-Based) — Authoritative side: US-08. The cost circuit breaker described in US-08 reads daily Bedrock spend (line 606 metric
DailyBedrockSpend) and forces tier-down. Conflict mode: US-01's complexity classifier may still route a query to Sonnet while US-08's breaker is in DEGRADED state. Resolution: model tier selector readsmodel_tier_floorfrom US-08; whenmodel_tier_floor = haiku, Sonnet routing is suppressed regardless of complexity score. The shared config flag is the single integration point.
Rollback & Experimentation
Shadow-Mode Plan
- Run template router and semantic cache in observe-only mode for 2 weeks: log "would have routed to template" and "would have served from cache" decisions, but always serve via LLM. Compare predicted-template-response against actual-LLM-response using LLM-as-judge scoring; abort rollout if disagreement rate > 8% on a sampled 5K-message audit.
- Run model tiering in shadow for 1 week: send Haiku and Sonnet for the same prompt for 5% of "simple" classified traffic; serve Sonnet to user, log Haiku response; promote Haiku to live only if blind-rated quality deltas < 5%.
Canary Thresholds
- 10% of traffic for 48 hours, then 25% for 72 hours, then full.
- Abort criteria (any one trips): CSAT proxy drops > 3 points, template-misroute rate > 2%, semantic-cache wrong-answer rate (sampled audit) > 1%, p99 response latency rises > 15%.
Kill Switch
- Single feature flag:
llm_cost_optimization_enabled. When false, all four techniques are bypassed and 100% of traffic flows to Sonnet via the original path. Flag readable from SSM Parameter Store with 30-second client cache; full rollback achievable in < 5 minutes.
Quality Regression Criteria (story-specific)
- Template coverage floor: ≥ 25% (below this, the template router is paying its operational cost without ROI; revert and re-tune).
- Haiku-tier CSAT delta vs Sonnet-tier on the same intents: |Δ| ≤ 0.15 on a 5-point scale.
- Semantic-cache audit-sample wrong-answer rate: ≤ 0.5% (raise threshold to 0.97 if exceeded).
- Compression-induced hallucination rate (sampled audit on compressed-history responses): ≤ 1%.
Multi-Reviewer Validation Findings & Resolutions
The cross-reviewer pass identified the following story-specific findings. README's "Multi-Reviewer Validation & Cross-Cutting Hardening" section covers concerns that span all stories.
S1 (must-fix before production)
Semantic cache cross-tenant collision risk. The cache key currently hashes embedded query + context. If customer_id is not in the hash, two users asking semantically identical questions can share a cache entry — User A's PII-laden answer (order details, address) is served to User B. Resolution: prepend customer_id to every llmresp: cache key; reject on customer_id mismatch at read; per-customer logical Redis DB if memory permits. Audit-sample 0.5% of cache hits for cross-customer leakage.
Semantic cache threshold inconsistent with declared hit rate. Threshold 0.95 + claimed 15–25% hit rate are mutually inconsistent (industry benchmarks show 0.95 alone yields 8–12%; 15–25% requires 0.92). Resolution: lower default threshold to 0.92 with per-intent overrides — product_availability and price_check use 0.97 (high precision required, fact answers must match exactly), general_chat uses 0.88. Add cache_hit_audit_accuracy metric (1% sample comparing cached vs uncached answer); alert at < 95%.
ModelTierSelector ignores model_tier_floor. US-08 declares this contract but US-01's selector reads only the complexity score. Resolution: before any routing decision, read model_tier_floor through the central feature-flag evaluator (per README precedence rules). When set to haiku, suppress Sonnet routing regardless of complexity score.
Prompt-injection-via-cached-response surface. Adversarial query → adversarial response → cached → served to subsequent users. Resolution: before insert into llmresp:, run response through a heuristic safety filter (no URLs not on allowlist, no instructions-to-user, no suspicious patterns). Tag the cached response with the originating intent label and customer_id; reject on read if request intent ≠ cached intent.
S2 (fix before scale-up)
Pricing baseline missing region uplift. Baseline math at line 30 uses us-east-1 list price; production runs ap-northeast-1 with ~5–10% uplift on Bedrock. Resolution: restate baseline at ap-northeast-1 pricing (~$330–346K/month vs $315K) so post-optimization targets reconcile against the real bill. Explicit pricing constants in code, sourced from a versioned pricing module (not hardcoded), so quarterly AWS price adjustments propagate cleanly.
No quality benchmark for Haiku on actual MangaAssist intents. Story claims "Haiku is interchangeable for retrieval-shaped intents" with no manga-specific evidence and no Japanese-language coverage. Resolution: before Haiku routing exceeds 5% live traffic, run blind-rated A/B against Sonnet on a 1K-prompt manga-store golden set stratified by intent and language (English / Japanese). Acceptance: per-intent rated-quality delta ≤ 5%; per-language delta ≤ 8%.
Prompt compression risks dropping entity references. Heuristic compression can strip entity mentions (product names, ASIN, genre) the latest user turn refers back to. Resolution: entity-tag pass before compression — [ENT:product]Berserk[/ENT] markers; compression rule never strips a sentence containing tagged entities. Add compression_entity_loss_rate metric (sampled).
Cost attribution missing in Bedrock invocation path. US-08's circuit breaker reads aggregated daily spend without per-request breakdown. Resolution: every Bedrock call emits an event to US-07 with request_id, customer_id (hashed), model_id, input_tokens, output_tokens, cached_input_tokens (when prompt caching is on), intent_label, intent_confidence, tier_used. Schema versioned per README cross-cutting concerns.
S3 (acknowledged / future work)
- Native Bedrock prompt caching not yet exploited. Estimated additional 15–25% input-cost reduction on the ~1.5K-token stable system prompt. Backlog item — see README Bedrock Provisioned Throughput section.
- Multi-region failure mode — single-region (ap-northeast-1) by current scope; multi-region failover requires replicated Redis + DynamoDB Global Tables.
- Per-customer cost attribution — useful for chargeback but adds storage cost; deferred until cost data shapes are stable.
Runbook: Cost Spike Detected at 3am JST
Symptoms: US-07 daily-spend metric > 110% of budget within first 4 hours of UTC day; Bedrock invocation rate > 2× rolling 7-day average.
Triage (in order):
- Set
degradation_active=truevia the feature-flag evaluator (effective globally in ≤ 30s). All stories tier-down per README precedence rules. - Read
cache_hit_audit_accuracyfrom CloudWatch — if < 90%, semantic cache may be poisoned. Setllm_cost_optimization_enabled=falseto bypass the cache entirely. - Read
intent_precisionfrom US-02's weekly audit — if < 0.90, template router is misrouting; force fall-through to LLM by raising router's confidence threshold to 0.99 via SSM. - Verify the cost breaker actually tripped: query DDB cost ledger (per US-08 immutable ledger), not Redis. If Redis says tripped but DDB says not, suspect Redis tampering — escalate immediately to security.
- Page FinOps lead if not auto-resolved within 15 minutes.
Escalation: if cost continues to climb after step 1, manually scale Bedrock invocation concurrency to zero via the orchestrator kill switch (bedrock_invocation_enabled=false). All chat traffic returns "service temporarily degraded" template responses; cost goes to zero; user impact bounded for ≤ 1 hour while root cause is found.