Token Efficiency Scenarios and Runbooks
MangaAssist context: JP Manga store chatbot on AWS — Bedrock Claude 3 (Sonnet at $3/$15 per 1M tokens input/output, Haiku at $0.25/$1.25), OpenSearch Serverless (vector store), DynamoDB (sessions/products), ECS Fargate (orchestrator), API Gateway WebSocket, ElastiCache Redis. Target: useful answer in under 3 seconds, 1M messages/day scale.
Skill Mapping
| Certification | Task | Skill | This File |
|---|---|---|---|
| AWS AIP-C01 | Task 4.1 — Optimize cost and performance of FM applications | Skill 4.1.1 — Design token efficiency systems for FM-powered applications | 5 production scenarios with problem detection, root cause analysis, resolution, and prevention runbooks |
Skill scope: Operational runbooks for token efficiency failures in production MangaAssist. Each scenario follows a structured Problem, Detection, Root Cause, Resolution, Prevention format with Mermaid decision trees for incident response.
Scenario Overview
| # | Scenario | Severity | Blast Radius | Token Layer Affected |
|---|---|---|---|---|
| 1 | Token budget exceeded during manga recommendation conversation | High | Per-session cost spike | Context window management |
| 2 | Prompt compression causing quality degradation for Japanese content | Critical | Answer quality for JP queries | Prompt compression |
| 3 | Context pruning removing relevant manga details from RAG results | High | Incorrect product recommendations | RAG chunk pruning |
| 4 | Response size controls truncating critical shipping/pricing info | Critical | User trust and order accuracy | Response size control |
| 5 | Token tracking drift between estimation and actual Bedrock consumption | Medium | Budget forecasting accuracy | Estimation calibration |
Scenario 1: Token Budget Exceeded During Manga Recommendation Conversation
Problem
A MangaAssist user has been browsing manga for 18 turns. They started asking about shonen titles, then shifted to seinen, then asked for a specific recommendation combining elements from multiple series. By turn 18, the conversation history alone consumes 3,200 input tokens. Combined with the system prompt (180 tokens), RAG context from OpenSearch (800 tokens for 5 recommendation chunks), and the user's latest message (250 tokens), the total prompt reaches 4,430 tokens — well above the recommendation intent budget of 2,000 input tokens.
The TokenBudgetManager triggers compression, but even after aggressive history summarization and RAG pruning, the prompt is still at 2,800 tokens. The system downgrades the user from Sonnet to Haiku to stay within cost bounds, but Haiku produces a noticeably lower-quality recommendation that misses the nuanced cross-genre preferences the user expressed over 18 turns.
Detection
flowchart TD
A[CloudWatch Alarm: InputTokens > 2x intent budget] --> B{Which intent?}
B -->|recommendation| C[Check session turn count]
C --> D{Turns > 10?}
D -->|Yes| E[Multi-turn budget overrun]
D -->|No| F[Single-turn prompt bloat — check RAG chunks]
E --> G[Check compression ratio metric]
G --> H{Compression achieved target?}
H -->|Yes| I[Budget itself is too low for long conversations]
H -->|No| J[Compression pipeline failure]
J --> K[Check which compression stage failed]
K --> L{History summarization effective?}
L -->|No| M["Root cause: summarization not reducing enough"]
L -->|Yes| N["Root cause: RAG context too large after pruning"]
style E fill:#fff3e0
style M fill:#ffcdd2
style N fill:#ffcdd2
style I fill:#fff9c4
CloudWatch metrics to monitor:
- MangaAssist/Tokens/InputTokens with dimension Intent=recommendation — alarm at > 2,500
- MangaAssist/Tokens/SessionTurnCount — alarm at > 12 turns per session
- MangaAssist/Tokens/CompressionRatio — alarm when ratio > 0.8 (compression ineffective)
- MangaAssist/Tokens/ModelDowngradeCount — alert on any Sonnet-to-Haiku downgrade for recommendation intent
Root Cause
The conversation history grows linearly with turn count, but the summarization stage has a floor: it cannot compress 18 turns of nuanced preference discussion into fewer than ~400 tokens without losing critical context. The user expressed preferences like "I want something with the dark atmosphere of Berserk but the humor of One Punch Man and the world-building of One Piece" — these cross-series references resist summarization because each entity mention is load-bearing.
The core issue: the per-intent token budget of 2,000 was calibrated for conversations of 5-8 turns. Long recommendation sessions need a different budget tier.
Resolution
Immediate (during the incident):
# Emergency: increase recommendation budget for long sessions
def get_dynamic_budget(intent: str, turn_count: int) -> int:
"""Return a dynamic token budget based on conversation length.
For recommendation intent, we allow budget to scale with turn count
up to a hard ceiling. This prevents quality degradation in long
browsing sessions while still capping maximum cost.
"""
base_budgets = {
"recommendation": 2000,
"product_search": 1200,
"order_status": 600,
"manga_qa": 1500,
"chitchat": 400,
}
base = base_budgets.get(intent, 1200)
if intent == "recommendation" and turn_count > 8:
# Scale: +200 tokens per 4 turns beyond 8, capped at 3,200
extra_turns = turn_count - 8
extra_budget = (extra_turns // 4) * 200
return min(base + extra_budget, 3200)
return base
Short-term fix:
1. Implement tiered budgets: recommendation_short (< 8 turns, 2,000 tokens) and recommendation_long (8+ turns, 2,800 tokens).
2. At turn 8, trigger a proactive summarization pass that uses Haiku to produce a structured "preference profile" from the conversation so far. This replaces raw history with a compact profile (~200 tokens).
3. Keep the Sonnet model for recommendation — do not downgrade. The cost of a Haiku-quality recommendation in a long session is worse than the extra tokens: the user may abandon without purchasing.
Preference profile extraction (proactive at turn 8):
PREFERENCE_EXTRACTION_PROMPT = """
Analyze this conversation and extract the user's manga preferences
into a structured profile. Be concise.
Conversation:
{history}
Output format:
- Preferred genres: [list]
- Liked series: [list with what they liked about each]
- Disliked elements: [list]
- Reading level: [beginner/intermediate/advanced]
- Language preference: [JP only / EN only / bilingual]
- Price sensitivity: [low/medium/high]
- Key request: [one sentence summary of what they want]
"""
Prevention
flowchart TD
A[Prevention Measures] --> B[Dynamic Budget Tiers]
A --> C[Proactive Preference Extraction]
A --> D[Session Length Guardrails]
A --> E[Monitoring & Alerting]
B --> B1["recommendation_short: 2,000 tokens (< 8 turns)"]
B --> B2["recommendation_long: 2,800 tokens (8-15 turns)"]
B --> B3["recommendation_extended: 3,200 tokens (15+ turns)"]
C --> C1[At turn 8: extract preference profile with Haiku]
C --> C2[Replace raw history with 200-token structured profile]
C --> C3[Continue adding new turns to profile, not raw history]
D --> D1[At turn 15: suggest the user check their cart]
D --> D2[At turn 20: offer to email recommendations]
D --> D3[Never hard-cut a session — always offer a graceful exit]
E --> E1["CloudWatch alarm: SessionTurnCount > 12"]
E --> E2["CloudWatch alarm: ModelDowngrade for recommendation"]
E --> E3[Weekly review of long-session cost distribution]
style B1 fill:#c8e6c9
style C1 fill:#bbdefb
style D1 fill:#fff9c4
Scenario 2: Prompt Compression Causing Quality Degradation for Japanese Content
Problem
After deploying LLMLingua-style token compression to production, the quality monitoring dashboard shows a sharp drop in answer accuracy for queries containing Japanese text. The compression algorithm, trained primarily on English text perplexity, is removing Japanese particles (は, が, を, に) and honorific markers (さん, 様) that it perceives as low-perplexity (predictable) tokens. In English, removing predictable words like "the" or "is" rarely changes meaning. In Japanese, removing particles fundamentally alters sentence structure and meaning.
Specific failures observed:
- 「鬼滅の刃」 (Kimetsu no Yaiba / Demon Slayer) compressed to 「鬼滅刃」 — the の removal makes the title unrecognizable to the model
- この漫画はおすすめですか? (Is this manga recommended?) compressed to この漫画おすすめですか — the は removal changes the sentence from a question about a specific manga to a general statement
- Customer name 田中様 (Tanaka-sama) compressed to 田中 — honorific removal is culturally inappropriate in Japanese customer service
Detection
flowchart TD
A[Quality Monitoring Dashboard] --> B{Accuracy drop detected?}
B -->|Yes| C[Segment by language]
C --> D{JP queries affected more than EN?}
D -->|Yes| E[Language-specific compression issue]
D -->|No| F[General compression issue — check all components]
E --> G[Check compression log: what tokens were removed?]
G --> H{Japanese particles removed?}
H -->|Yes| I["Root cause: perplexity model not calibrated for CJK"]
H -->|No| J[Check entity preservation]
J --> K{Product names corrupted?}
K -->|Yes| L["Root cause: entity protection not applied to JP titles"]
K -->|No| M[Investigate specific failing queries]
style I fill:#ffcdd2
style L fill:#ffcdd2
Detection signals:
- MangaAssist/Quality/AccuracyScore drops below 0.85 (baseline: 0.92) with dimension Language=ja
- MangaAssist/Compression/EntityPreservationRate drops below 1.0
- Customer feedback: "The chatbot didn't understand my question" tickets spike for JP-language users
- Bedrock invocation logs show garbled Japanese in prompts
Root Cause
The LLMLingua compression algorithm uses a small language model (e.g., GPT-2 or a distilled LLaMA) to compute per-token perplexity. This model was trained predominantly on English text. Japanese tokens that are grammatically essential (particles, connectors) appear "predictable" to the model because they appear frequently — but their removal in Japanese is not semantically safe the way removing English articles is.
| Token Type | English Equivalent | Perplexity | Safe to Remove? |
|---|---|---|---|
| Japanese particle は (wa) | "is" / topic marker | Low | No — changes sentence meaning |
| Japanese particle の (no) | "of" / possessive | Low | No — changes noun relationships |
| Japanese particle を (wo) | object marker | Low | No — removes object reference |
| Japanese honorific 様 (sama) | "Mr./Ms." (formal) | Low | No — culturally inappropriate |
| English article "the" | — | Low | Yes — usually safe to remove |
| English copula "is" | — | Low | Yes — often inferable from context |
Resolution
Immediate (during the incident):
- Disable LLMLingua compression for Japanese text entirely. Roll back to rule-based compression only for prompts containing Japanese characters.
import re
def should_apply_llmlingua(text: str) -> bool:
"""Check if LLMLingua compression is safe for this text.
Returns False if the text contains CJK characters, meaning we
should use only rule-based compression.
"""
# CJK Unified Ideographs + Hiragana + Katakana ranges
cjk_pattern = re.compile(
r'[\u4e00-\u9fff\u3040-\u309f\u30a0-\u30ff]'
)
cjk_char_count = len(cjk_pattern.findall(text))
total_chars = len(text)
if total_chars == 0:
return True
# If more than 10% of characters are CJK, skip LLMLingua
cjk_ratio = cjk_char_count / total_chars
return cjk_ratio < 0.10
- Add entity protection: Before any compression, extract and protect entities (product names, customer names, prices, order IDs) from modification.
def protect_entities(text: str, entities: list[str]) -> tuple[str, dict]:
"""Replace entities with placeholder tokens before compression.
Returns the modified text and a mapping to restore entities after.
"""
placeholder_map = {}
protected_text = text
for i, entity in enumerate(entities):
placeholder = f"<<ENTITY_{i}>>"
placeholder_map[placeholder] = entity
protected_text = protected_text.replace(entity, placeholder)
return protected_text, placeholder_map
def restore_entities(text: str, placeholder_map: dict) -> str:
"""Restore entities from placeholders after compression."""
restored = text
for placeholder, entity in placeholder_map.items():
restored = restored.replace(placeholder, entity)
return restored
Long-term fix: - Train a Japanese-aware perplexity model for compression scoring, or use a multilingual model (e.g., XLM-R) that correctly assigns high importance to Japanese grammatical particles. - Implement a language-specific compression policy: English components use LLMLingua aggressively; Japanese components use only whitespace cleanup and structural compression (template variables).
Prevention
flowchart TD
A[Prevention Strategy] --> B[Language-Aware Compression Policy]
A --> C[Entity Protection Layer]
A --> D[Quality Gate Before Deployment]
A --> E[Monitoring]
B --> B1["English text: LLMLingua + rule-based (aggressive)"]
B --> B2["Japanese text: rule-based only (conservative)"]
B --> B3["Mixed text: segment by language, compress separately"]
C --> C1[Extract entities before compression]
C --> C2[Replace with placeholders]
C --> C3[Compress placeholder-protected text]
C --> C4[Restore entities after compression]
D --> D1["Golden dataset: 200 JP queries, 200 EN queries"]
D --> D2["Pre-deploy eval: compression must preserve >95% accuracy on both"]
D --> D3["Canary deployment: 5% traffic for 24 hours before full rollout"]
E --> E1["CloudWatch: AccuracyScore segmented by Language"]
E --> E2["Automated alert: JP accuracy drops >5% below EN accuracy"]
E --> E3["Weekly compression audit: sample 100 JP prompts, verify entity preservation"]
style B2 fill:#c8e6c9
style C1 fill:#bbdefb
style D1 fill:#fff9c4
Scenario 3: Context Pruning Removing Relevant Manga Details from RAG Results
Problem
A user asks: "Is the limited edition box set of Attack on Titan available, and does it include the color pages?" The OpenSearch vector search returns 6 chunks:
| Chunk | Content | Vector Score | Entity Score | Composite Score |
|---|---|---|---|---|
| 1 | Attack on Titan standard edition volumes 1-34 | 0.91 | 0.8 | 0.82 |
| 2 | Attack on Titan limited edition box set — availability and contents | 0.88 | 1.0 | 0.85 |
| 3 | Attack on Titan color edition — page details and pricing | 0.84 | 0.7 | 0.74 |
| 4 | Attack on Titan merchandise and figurines | 0.75 | 0.3 | 0.48 |
| 5 | General box set comparison (One Piece, Naruto, AoT) | 0.72 | 0.5 | 0.55 |
| 6 | Shipping info for oversized items (box sets) | 0.65 | 0.2 | 0.39 |
The ContextPruner with a token budget of 500 tokens and relevance threshold of 0.45 keeps chunks 1, 2, and 3 (highest composite scores). However, chunk 3 ("color edition — page details") is critical for answering the "color pages" part of the question. If the token budget only allows 2 chunks, chunk 3 gets pruned — and the model produces an answer that omits color page information, or worse, hallucinates.
The user gets: "Yes, the limited edition box set is available at 15,800 JPY" but no information about color pages. The user follows up, frustrated.
Detection
flowchart TD
A[User Feedback: Answer incomplete] --> B[Check query vs response coverage]
B --> C{All query entities addressed in response?}
C -->|No| D[Identify missing entity]
D --> E[Check RAG chunks returned by OpenSearch]
E --> F{Was relevant chunk retrieved?}
F -->|No| G["Root cause: OpenSearch retrieval failure (not a pruning issue)"]
F -->|Yes| H[Was relevant chunk kept after pruning?]
H -->|No| I["Root cause: ContextPruner removed relevant chunk"]
H -->|Yes| J["Root cause: Model ignored available context"]
I --> K[Check composite score of pruned chunk]
K --> L{Score above threshold?}
L -->|Yes| M["Token budget too small — chunk cut by budget cap"]
L -->|No| N["Relevance scoring underweighted this chunk"]
style I fill:#ffcdd2
style M fill:#fff3e0
style N fill:#ffcdd2
Detection signals:
- MangaAssist/Quality/CompletenessScore drops below 0.80 for product_search intent
- MangaAssist/Pruning/ChunksDiscardedAboveThreshold > 0 (chunks above relevance threshold cut by budget)
- User follow-up rate after product queries increases (users asking the same question differently)
- Manual audit of pruned chunks reveals relevant content being discarded
Root Cause
Two compounding issues:
-
Token budget too tight for multi-faceted queries: The user asked about (a) availability, (b) limited edition box set, and © color pages — three distinct information needs. The 500-token RAG budget was calibrated for single-faceted queries. Multi-faceted queries need proportionally more RAG context.
-
Entity overlap scoring missed "color pages": The entity extractor identified "Attack on Titan," "limited edition," and "box set" as query entities. It did not extract "color pages" as a separate entity because it is not a product name. Chunk 3's entity score was 0.7 instead of 1.0 because the entity overlap calculation missed this aspect.
Resolution
Immediate:
def detect_multi_faceted_query(query: str, intent: str) -> int:
"""Detect how many distinct information needs a query contains.
Multi-faceted queries get a proportionally larger RAG budget.
Returns the number of detected facets (minimum 1).
"""
# Simple heuristic: count question-indicating patterns
facet_indicators = [
# Conjunctions joining distinct questions
r'\band\b.*\?',
r'\balso\b',
r'\badditionally\b',
# Multiple question words
r'(does|is|can|will|how|what|when|where)',
# Comma-separated aspects
r',\s*(does|is|and)',
]
import re
facet_count = 1 # always at least 1
query_lower = query.lower()
for pattern in facet_indicators:
matches = re.findall(pattern, query_lower)
facet_count += len(matches)
# Cap at 4 to prevent budget explosion
return min(facet_count, 4)
def get_dynamic_rag_budget(intent: str, query: str,
base_budget: int = 500) -> int:
"""Scale RAG token budget based on query complexity."""
facets = detect_multi_faceted_query(query, intent)
# Each additional facet adds 200 tokens to the RAG budget
return base_budget + (facets - 1) * 200
Short-term fix: - Expand entity extraction to include descriptive phrases ("color pages", "hardcover", "first printing"), not just proper nouns. These are product attributes that are equally important for retrieval relevance. - Add a "query coverage check" after pruning: verify that every noun phrase in the user's query has at least one matching chunk in the kept set. If not, pull in the highest-scoring discarded chunk that covers the gap.
def verify_query_coverage(query: str, kept_chunks: list,
discarded_chunks: list,
estimator) -> list:
"""Verify that kept chunks cover all query aspects.
If a query aspect is uncovered, pull in the best discarded chunk
that addresses it.
"""
import re
# Extract noun phrases (simplified)
query_phrases = set(re.findall(r'[A-Za-z]+(?:\s+[A-Za-z]+)*', query.lower()))
# Check which phrases are covered by kept chunks
kept_text = " ".join(c.text.lower() for c in kept_chunks)
uncovered = [p for p in query_phrases
if p not in kept_text and len(p) > 3]
if not uncovered:
return kept_chunks
# Find discarded chunks that cover the gap
for phrase in uncovered:
best_match = None
best_score = 0
for chunk in discarded_chunks:
if phrase in chunk.text.lower():
if chunk.vector_score > best_score:
best_match = chunk
best_score = chunk.vector_score
if best_match:
kept_chunks.append(best_match)
discarded_chunks.remove(best_match)
return kept_chunks
Prevention
flowchart TD
A[Prevention] --> B[Dynamic RAG Budget]
A --> C[Expanded Entity Extraction]
A --> D[Query Coverage Verification]
A --> E[Monitoring]
B --> B1[Detect query facets: 1-4 information needs]
B --> B2["Budget = base + (facets - 1) x 200 tokens"]
B --> B3[Cap at 1,100 tokens for 4-facet queries]
C --> C1[Extract proper nouns: titles, authors, ISBNs]
C --> C2[Extract descriptive attributes: color pages, hardcover, limited]
C --> C3[Extract product categories: box set, collection, series]
D --> D1[After pruning: check every query noun phrase]
D --> D2[If uncovered: pull best discarded chunk]
D --> D3[Log coverage gaps for model improvement]
E --> E1["Alert: ChunksDiscardedAboveThreshold > 2"]
E --> E2[Weekly audit: sample 50 multi-faceted queries]
E --> E3["Track: user follow-up rate after product_search"]
style B2 fill:#c8e6c9
style D2 fill:#bbdefb
Scenario 4: Response Size Controls Truncating Critical Shipping/Pricing Information
Problem
A user asks: "I want to order the Naruto complete box set. What's the total price with shipping to Osaka, and when will it arrive?" The model generates a response that includes:
- Product details and price (130 tokens)
- Shipping options and costs (100 tokens)
- Estimated delivery dates for each option (80 tokens)
- Tax calculation (60 tokens)
- Total cost summary (40 tokens)
- Return policy note (50 tokens)
Total natural response: 460 tokens. But the order_status intent (misclassified — this is actually a product_search with shipping inquiry) has max_tokens=200. The ResponseSizeController truncates the response after the shipping options section. The user sees the product price and shipping options but never sees the delivery dates, tax calculation, or total cost — the most critical information for a purchase decision.
The truncation message says: "For full order details, visit your order history page." This is misleading — the user has not placed an order yet. They are trying to decide whether to order.
Detection
flowchart TD
A[User Complaint: Incomplete pricing info] --> B[Check response truncation logs]
B --> C{Was response truncated?}
C -->|Yes| D[Check intent classification]
D --> E{Correct intent assigned?}
E -->|No| F["Root cause: intent misclassification → wrong max_tokens"]
E -->|Yes| G[Check if natural response exceeds budget]
G --> H{Natural response > max_tokens by > 50%?}
H -->|Yes| I["Root cause: budget too low for this query type"]
H -->|No| J["Root cause: model being verbose — prompt needs tightening"]
F --> K[Fix: retrain intent classifier]
I --> L[Fix: increase budget or add sub-intent]
J --> M[Fix: add conciseness instruction to prompt]
C -->|No| N[Different issue — check model output quality]
style F fill:#ffcdd2
style I fill:#fff3e0
Detection signals:
- MangaAssist/Response/TruncationRate spikes above 15% for any intent
- MangaAssist/Response/TruncatedCriticalInfo — custom metric that checks if truncated responses are missing prices, dates, or order IDs
- Customer feedback: "The chatbot didn't tell me the total price" or "When will it arrive?"
- Conversion rate drops for users who received truncated responses (A/B comparison)
Root Cause
Two issues combined:
-
Intent misclassification: The query "I want to order... What's the total price with shipping?" was classified as
order_statusbecause it contains the word "order." But this is a pre-purchase inquiry — it should beproduct_search(max_tokens=400) or a new sub-intent likepurchase_inquiry(max_tokens=500). -
Truncation message mismatch: The
order_statustruncation message references "order history page," which is irrelevant for a pre-purchase query. Each intent's truncation message assumes the classification is correct. -
Critical information at the end: The model naturally structures responses from general (product details) to specific (total cost with tax). The most actionable information — the total cost — comes last and is the first thing truncated.
Resolution
Immediate (during the incident):
# Add a "critical information" detection pass before truncation
CRITICAL_PATTERNS = {
"pricing": [
r'total[:\s]*[\d,]+\s*(JPY|USD|円|ドル)',
r'合計[:\s]*[\d,]+',
r'shipping[:\s]*[\d,]+',
r'送料[:\s]*[\d,]+',
r'tax[:\s]*[\d,]+',
],
"delivery": [
r'(arrive|delivery|届く|到着)[:\s]*(by|on|within)',
r'\d{1,2}[/-]\d{1,2}',
r'(business days|営業日)',
],
"order_id": [
r'ORD-\d{4}-\d{5}',
r'注文番号[:\s]*\w+',
],
}
def contains_critical_info(text: str) -> bool:
"""Check if text contains critical information that must not be truncated."""
import re
for category, patterns in CRITICAL_PATTERNS.items():
for pattern in patterns:
if re.search(pattern, text, re.IGNORECASE):
return True
return False
async def stream_with_critical_protection(
bedrock_stream,
intent: str,
config,
encoding,
):
"""Stream with truncation protection for critical information.
If the response is approaching the token budget and we detect
critical information (prices, dates, order IDs), we extend the
budget by 50% to ensure this information is delivered.
"""
token_count = 0
buffer_chunks = []
in_critical_zone = False
async for event in bedrock_stream:
chunk_text = extract_text(event)
if not chunk_text:
continue
chunk_tokens = len(encoding.encode(chunk_text))
token_count += chunk_tokens
# Check if we're in the critical zone (80-100% of budget)
if token_count >= config.max_output_tokens * 0.8:
if contains_critical_info(chunk_text):
in_critical_zone = True
# If in critical zone, extend budget by 50%
effective_budget = (
int(config.max_output_tokens * 1.5)
if in_critical_zone
else config.max_output_tokens
)
if token_count > effective_budget:
yield chunk_text
yield config.truncation_message
return
yield chunk_text
Short-term fix:
1. Add a purchase_inquiry sub-intent with max_tokens=500 and an appropriate truncation message: "Want me to add this to your cart?"
2. Retrain the intent classifier to distinguish order_status (existing order) from purchase_inquiry (pre-purchase question).
3. Add a "critical info first" instruction to the system prompt for purchase-related intents:
When answering purchase or pricing questions, lead with the total cost
and delivery date. Put details and disclaimers after the key numbers.
- Fix the truncation message to be context-aware:
TRUNCATION_MESSAGES = {
"purchase_inquiry": (
"\n\n---\n*Want me to add this to your cart, "
"or do you have more questions?*"
),
"product_search": (
"\n\n---\n*I can provide more details. "
"Which aspect interests you most?*"
),
"order_status": (
"\n\n---\n*For full tracking details, "
"check your order history page.*"
),
}
Prevention
flowchart TD
A[Prevention] --> B[Intent Classifier Improvement]
A --> C[Critical Info Protection]
A --> D[Prompt Structure Guidance]
A --> E[Truncation Message Alignment]
B --> B1["Add purchase_inquiry sub-intent"]
B --> B2["Training data: 500 pre-purchase queries"]
B --> B3["Confusion matrix monitoring: order_status vs purchase_inquiry"]
C --> C1["Detect critical patterns: prices, dates, IDs"]
C --> C2["Auto-extend budget 50% when critical info is in-flight"]
C --> C3["Never truncate mid-price or mid-date"]
D --> D1["System prompt: lead with key numbers"]
D --> D2["Structured output format: summary first, details second"]
D --> D3["Test with truncation simulation: verify key info in first 60% of response"]
E --> E1["Each intent gets a context-appropriate truncation message"]
E --> E2["Truncation message never references actions the user cannot take"]
E --> E3["A/B test truncation messages for user satisfaction"]
style C2 fill:#c8e6c9
style D1 fill:#bbdefb
Scenario 5: Token Tracking Drift Between Estimation and Actual Bedrock Consumption
Problem
Over the past two weeks, the EstimationDriftPercent CloudWatch metric has been steadily increasing. Initial deployment showed a consistent 3-5% drift (estimations slightly above actuals, which is the safe direction). Now the drift has reached -18% — meaning actual token consumption is 18% higher than estimates. This means:
- Budget gates are passing requests that actually exceed their token budgets
- Cost forecasting underestimates actual spend by 18%
- Session budget caps are being hit later than expected, allowing more expensive invocations than intended
- Monthly cost is $20,400 over forecast
The drift is correlated with a Bedrock model update that happened 12 days ago: Claude 3 Sonnet received a minor version update that changed its tokenization slightly.
Detection
flowchart TD
A["CloudWatch: EstimationDriftPercent > 10%"] --> B[Check drift direction]
B --> C{Positive drift — over-estimating?}
C -->|Yes| D["Safe direction but wasteful — budgets too conservative"]
B --> E{Negative drift — under-estimating?}
E -->|Yes| F["Dangerous — actual cost exceeds estimates"]
F --> G[Check when drift started increasing]
G --> H{Correlated with model update?}
H -->|Yes| I["Root cause: tokenizer mismatch after model version change"]
H -->|No| J[Check prompt template changes]
J --> K{New prompt templates deployed?}
K -->|Yes| L["Root cause: new templates have different token characteristics"]
K -->|No| M[Check input data distribution shift]
M --> N["Root cause: user behavior changed (longer queries, more JP text)"]
style I fill:#ffcdd2
style F fill:#fff3e0
Detection signals:
- MangaAssist/Tokens/EstimationDriftPercent — 7-day rolling average crosses -10%
- MangaAssist/Tokens/ActualCostVsForecast — daily actual exceeds forecast by > 15%
- MangaAssist/Tokens/BudgetGateBypassRate — requests that passed budget check but actually exceeded budget
- Bedrock model version change notification in AWS Health Dashboard
Root Cause
The TokenEstimator uses tiktoken's cl100k_base encoding, which is a close approximation to Claude 3's tokenizer but not identical. When Anthropic updates the model version, subtle tokenization changes can occur:
- New vocabulary tokens: The model update added tokens for recently popular manga titles and Japanese slang. These map to fewer tokens in the new tokenizer but more tokens in the old
cl100k_baseused for estimation. - Changed byte-pair encoding merges: The model's BPE merge table was updated, changing how multi-byte Japanese characters are split into tokens. For example, a common kanji sequence might now be 1 token instead of 2 — but our estimator still counts it as 2.
- System prompt overhead: Bedrock's internal system prompt handling changed slightly, adding ~50 tokens of overhead that our estimator does not account for.
The 5% calibration buffer (multiply by 1.05) was sufficient for the original model version but insufficient for the updated version.
Resolution
Immediate (same day):
- Increase calibration buffer from 1.05 to 1.20 while investigating the root cause. This over-estimates by 20% — wasteful but safe.
# Emergency config change — no code deployment needed
# Update in DynamoDB config table or SSM Parameter Store
import boto3
ssm = boto3.client('ssm')
ssm.put_parameter(
Name='/mangaassist/token-estimator/calibration-buffer',
Value='1.20',
Type='String',
Overwrite=True,
)
- Enable Bedrock invocation logging (if not already) to capture exact token counts for every call. Use these as ground truth for recalibration.
Short-term fix (within 1 week):
Implement an auto-calibration system that continuously adjusts the calibration buffer based on observed drift:
import numpy as np
from collections import deque
class AdaptiveCalibrationBuffer:
"""Auto-calibrates the token estimation buffer using recent observations.
Maintains a rolling window of (estimated, actual) token pairs and
computes the optimal calibration buffer to minimize under-estimation
while avoiding excessive over-estimation.
MangaAssist design: the buffer is updated every 1,000 invocations
(approximately every 1.4 minutes at 1M messages/day). This means
the system self-corrects within minutes of a model update.
"""
def __init__(self, window_size: int = 5000,
min_buffer: float = 1.02,
max_buffer: float = 1.30,
target_underestimate_rate: float = 0.05):
"""
Args:
window_size: Number of recent observations to consider
min_buffer: Minimum calibration buffer (2% over)
max_buffer: Maximum calibration buffer (30% over)
target_underestimate_rate: Target rate of underestimates
(we want <=5% of estimates to be below actual)
"""
self.observations = deque(maxlen=window_size)
self.min_buffer = min_buffer
self.max_buffer = max_buffer
self.target_rate = target_underestimate_rate
self.current_buffer = 1.10 # start conservative
def record(self, estimated_tokens: int, actual_tokens: int):
"""Record an (estimated, actual) observation pair."""
if estimated_tokens > 0 and actual_tokens > 0:
ratio = actual_tokens / estimated_tokens
self.observations.append(ratio)
def recalibrate(self) -> float:
"""Recompute the optimal calibration buffer.
Uses the (1 - target_rate) percentile of actual/estimated ratios
to set the buffer such that only target_rate of estimates fall
below actual values.
"""
if len(self.observations) < 100:
return self.current_buffer # not enough data
ratios = np.array(self.observations)
# We want 95th percentile of (actual/estimated) ratios
# This means 95% of the time, estimated * buffer >= actual
target_percentile = 1.0 - self.target_rate
p95_ratio = np.percentile(ratios, target_percentile * 100)
# The buffer should be at least p95_ratio
new_buffer = np.clip(p95_ratio, self.min_buffer, self.max_buffer)
# Smooth the update to avoid oscillation
self.current_buffer = 0.7 * self.current_buffer + 0.3 * new_buffer
return self.current_buffer
def get_buffer(self) -> float:
"""Get the current calibration buffer."""
return self.current_buffer
def get_stats(self) -> dict:
"""Get drift statistics for monitoring."""
if len(self.observations) < 10:
return {"status": "insufficient_data"}
ratios = np.array(self.observations)
return {
"current_buffer": round(self.current_buffer, 4),
"mean_ratio": round(float(np.mean(ratios)), 4),
"p50_ratio": round(float(np.median(ratios)), 4),
"p95_ratio": round(float(np.percentile(ratios, 95)), 4),
"p99_ratio": round(float(np.percentile(ratios, 99)), 4),
"underestimate_rate": round(
float(np.mean(ratios > self.current_buffer)), 4
),
"observation_count": len(self.observations),
}
Long-term fix:
- Subscribe to Bedrock model update notifications via AWS Health Dashboard API. When a model update is detected, automatically increase the calibration buffer to 1.25 and trigger a 24-hour recalibration period.
- Evaluate switching from tiktoken cl100k_base to Anthropic's official token counting API (if/when available) for exact counts.
- Implement a shadow-mode estimator that runs both the old and new calibration in parallel and alerts when they diverge by more than 5%.
Prevention
flowchart TD
A[Prevention] --> B[Auto-Calibration System]
A --> C[Model Update Response]
A --> D[Drift Monitoring]
A --> E[Budgeting Safety]
B --> B1["Rolling window of 5,000 (estimated, actual) pairs"]
B --> B2["Recalibrate every 1,000 invocations"]
B --> B3["Target: < 5% underestimation rate"]
B --> B4["Buffer range: 1.02 to 1.30"]
C --> C1["Subscribe to Bedrock model update notifications"]
C --> C2["On update: auto-increase buffer to 1.25"]
C --> C3["24-hour intensive recalibration window"]
C --> C4["Alert on-call engineer for manual review"]
D --> D1["CloudWatch alarm: EstimationDriftPercent > +/-10%"]
D --> D2["Daily drift report to Slack"]
D --> D3["Weekly drift trend analysis"]
D --> D4["Monthly calibration buffer audit"]
E --> E1["Budget gates use calibrated estimate, not raw count"]
E --> E2["Session budgets include 15% safety margin"]
E --> E3["Monthly cost forecast uses P95 estimate, not mean"]
style B2 fill:#c8e6c9
style C2 fill:#bbdefb
style D1 fill:#fff9c4
Cross-Scenario Decision Tree — Token Efficiency Incident Response
When a token efficiency alert fires, use this decision tree to identify which scenario applies:
flowchart TD
A["Token Efficiency Alert Fired"] --> B{What type of alert?}
B -->|Cost spike| C{Per-session or system-wide?}
C -->|Per-session| D["Scenario 1: Budget exceeded in long conversation"]
C -->|System-wide| E{Correlated with model update?}
E -->|Yes| F["Scenario 5: Estimation drift"]
E -->|No| G{Correlated with new deployment?}
G -->|Yes| H["Check compression or pruning changes"]
G -->|No| I["Traffic pattern change — investigate intent distribution"]
B -->|Quality drop| J{Language-specific?}
J -->|JP queries affected| K["Scenario 2: Compression degrading Japanese content"]
J -->|All languages| L{Completeness or accuracy?}
L -->|Completeness| M["Scenario 3: Context pruning too aggressive"]
L -->|Accuracy| N["Check if RAG retrieval quality changed"]
B -->|Truncation spike| O{Which intent?}
O -->|order_status| P{Misclassified intent?}
P -->|Yes| Q["Scenario 4: Truncation of critical info (misclassification)"]
P -->|No| R["Budget too low for this query type"]
O -->|recommendation| S["Scenario 1: Long conversation + truncation"]
O -->|other| T["Review intent-specific max_tokens config"]
B -->|Drift alarm| F
style D fill:#fff3e0
style K fill:#ffcdd2
style M fill:#ffcdd2
style Q fill:#ffcdd2
style F fill:#fff9c4
Runbook Summary Table
| Scenario | First Response (< 5 min) | Short-term Fix (< 1 week) | Long-term Prevention |
|---|---|---|---|
| 1. Budget exceeded in long conversation | Increase recommendation budget for sessions > 8 turns | Implement tiered budgets + preference profile extraction at turn 8 | Dynamic budget scaling, proactive session summarization, session length guardrails |
| 2. JP content quality degradation | Disable LLMLingua for Japanese text immediately | Add entity protection layer, language-aware compression policy | Japanese-calibrated perplexity model, bilingual golden dataset, canary deployments |
| 3. RAG pruning removes relevant chunks | Lower relevance threshold by 0.1 for affected intent | Add query coverage verification after pruning, expand entity extraction | Dynamic RAG budget based on query facets, weekly pruning audits, user follow-up rate tracking |
| 4. Critical info truncated | Add critical info detection, extend budget 50% for prices/dates | Add purchase_inquiry sub-intent, retrain classifier, fix truncation messages |
Prompt-level "key info first" instruction, truncation simulation testing, A/B test truncation UX |
| 5. Estimation drift | Increase calibration buffer to 1.20 | Deploy auto-calibration system with rolling window recalibration | Model update auto-response, shadow estimator, monthly calibration audit |
Key Takeaways
- Token efficiency failures are never just about tokens: they cascade into quality, UX, and revenue impacts. A truncated price is a lost sale.
- Japanese content is a special case everywhere: compression, pruning, and estimation all need language-aware handling for CJK text.
- Multi-turn conversations are the primary budget risk: a single long browsing session can cost 10x a typical session. Proactive summarization at turn 8 is the key mitigation.
- Intent misclassification amplifies token issues: wrong intent means wrong budget, wrong max_tokens, and wrong truncation message. The intent classifier is the foundation of the entire token efficiency system.
- Auto-calibration beats static buffers: model updates will happen. A self-adjusting calibration buffer that recalibrates from production data protects against silent drift.