LOCAL PREVIEW View on GitHub

Token Efficiency Scenarios and Runbooks

MangaAssist context: JP Manga store chatbot on AWS — Bedrock Claude 3 (Sonnet at $3/$15 per 1M tokens input/output, Haiku at $0.25/$1.25), OpenSearch Serverless (vector store), DynamoDB (sessions/products), ECS Fargate (orchestrator), API Gateway WebSocket, ElastiCache Redis. Target: useful answer in under 3 seconds, 1M messages/day scale.


Skill Mapping

Certification Task Skill This File
AWS AIP-C01 Task 4.1 — Optimize cost and performance of FM applications Skill 4.1.1 — Design token efficiency systems for FM-powered applications 5 production scenarios with problem detection, root cause analysis, resolution, and prevention runbooks

Skill scope: Operational runbooks for token efficiency failures in production MangaAssist. Each scenario follows a structured Problem, Detection, Root Cause, Resolution, Prevention format with Mermaid decision trees for incident response.


Scenario Overview

# Scenario Severity Blast Radius Token Layer Affected
1 Token budget exceeded during manga recommendation conversation High Per-session cost spike Context window management
2 Prompt compression causing quality degradation for Japanese content Critical Answer quality for JP queries Prompt compression
3 Context pruning removing relevant manga details from RAG results High Incorrect product recommendations RAG chunk pruning
4 Response size controls truncating critical shipping/pricing info Critical User trust and order accuracy Response size control
5 Token tracking drift between estimation and actual Bedrock consumption Medium Budget forecasting accuracy Estimation calibration

Scenario 1: Token Budget Exceeded During Manga Recommendation Conversation

Problem

A MangaAssist user has been browsing manga for 18 turns. They started asking about shonen titles, then shifted to seinen, then asked for a specific recommendation combining elements from multiple series. By turn 18, the conversation history alone consumes 3,200 input tokens. Combined with the system prompt (180 tokens), RAG context from OpenSearch (800 tokens for 5 recommendation chunks), and the user's latest message (250 tokens), the total prompt reaches 4,430 tokens — well above the recommendation intent budget of 2,000 input tokens.

The TokenBudgetManager triggers compression, but even after aggressive history summarization and RAG pruning, the prompt is still at 2,800 tokens. The system downgrades the user from Sonnet to Haiku to stay within cost bounds, but Haiku produces a noticeably lower-quality recommendation that misses the nuanced cross-genre preferences the user expressed over 18 turns.

Detection

flowchart TD
    A[CloudWatch Alarm: InputTokens > 2x intent budget] --> B{Which intent?}
    B -->|recommendation| C[Check session turn count]
    C --> D{Turns > 10?}
    D -->|Yes| E[Multi-turn budget overrun]
    D -->|No| F[Single-turn prompt bloat — check RAG chunks]

    E --> G[Check compression ratio metric]
    G --> H{Compression achieved target?}
    H -->|Yes| I[Budget itself is too low for long conversations]
    H -->|No| J[Compression pipeline failure]

    J --> K[Check which compression stage failed]
    K --> L{History summarization effective?}
    L -->|No| M["Root cause: summarization not reducing enough"]
    L -->|Yes| N["Root cause: RAG context too large after pruning"]

    style E fill:#fff3e0
    style M fill:#ffcdd2
    style N fill:#ffcdd2
    style I fill:#fff9c4

CloudWatch metrics to monitor: - MangaAssist/Tokens/InputTokens with dimension Intent=recommendation — alarm at > 2,500 - MangaAssist/Tokens/SessionTurnCount — alarm at > 12 turns per session - MangaAssist/Tokens/CompressionRatio — alarm when ratio > 0.8 (compression ineffective) - MangaAssist/Tokens/ModelDowngradeCount — alert on any Sonnet-to-Haiku downgrade for recommendation intent

Root Cause

The conversation history grows linearly with turn count, but the summarization stage has a floor: it cannot compress 18 turns of nuanced preference discussion into fewer than ~400 tokens without losing critical context. The user expressed preferences like "I want something with the dark atmosphere of Berserk but the humor of One Punch Man and the world-building of One Piece" — these cross-series references resist summarization because each entity mention is load-bearing.

The core issue: the per-intent token budget of 2,000 was calibrated for conversations of 5-8 turns. Long recommendation sessions need a different budget tier.

Resolution

Immediate (during the incident):

# Emergency: increase recommendation budget for long sessions
def get_dynamic_budget(intent: str, turn_count: int) -> int:
    """Return a dynamic token budget based on conversation length.

    For recommendation intent, we allow budget to scale with turn count
    up to a hard ceiling. This prevents quality degradation in long
    browsing sessions while still capping maximum cost.
    """
    base_budgets = {
        "recommendation": 2000,
        "product_search": 1200,
        "order_status": 600,
        "manga_qa": 1500,
        "chitchat": 400,
    }
    base = base_budgets.get(intent, 1200)

    if intent == "recommendation" and turn_count > 8:
        # Scale: +200 tokens per 4 turns beyond 8, capped at 3,200
        extra_turns = turn_count - 8
        extra_budget = (extra_turns // 4) * 200
        return min(base + extra_budget, 3200)

    return base

Short-term fix: 1. Implement tiered budgets: recommendation_short (< 8 turns, 2,000 tokens) and recommendation_long (8+ turns, 2,800 tokens). 2. At turn 8, trigger a proactive summarization pass that uses Haiku to produce a structured "preference profile" from the conversation so far. This replaces raw history with a compact profile (~200 tokens). 3. Keep the Sonnet model for recommendation — do not downgrade. The cost of a Haiku-quality recommendation in a long session is worse than the extra tokens: the user may abandon without purchasing.

Preference profile extraction (proactive at turn 8):

PREFERENCE_EXTRACTION_PROMPT = """
Analyze this conversation and extract the user's manga preferences
into a structured profile. Be concise.

Conversation:
{history}

Output format:
- Preferred genres: [list]
- Liked series: [list with what they liked about each]
- Disliked elements: [list]
- Reading level: [beginner/intermediate/advanced]
- Language preference: [JP only / EN only / bilingual]
- Price sensitivity: [low/medium/high]
- Key request: [one sentence summary of what they want]
"""

Prevention

flowchart TD
    A[Prevention Measures] --> B[Dynamic Budget Tiers]
    A --> C[Proactive Preference Extraction]
    A --> D[Session Length Guardrails]
    A --> E[Monitoring & Alerting]

    B --> B1["recommendation_short: 2,000 tokens (< 8 turns)"]
    B --> B2["recommendation_long: 2,800 tokens (8-15 turns)"]
    B --> B3["recommendation_extended: 3,200 tokens (15+ turns)"]

    C --> C1[At turn 8: extract preference profile with Haiku]
    C --> C2[Replace raw history with 200-token structured profile]
    C --> C3[Continue adding new turns to profile, not raw history]

    D --> D1[At turn 15: suggest the user check their cart]
    D --> D2[At turn 20: offer to email recommendations]
    D --> D3[Never hard-cut a session — always offer a graceful exit]

    E --> E1["CloudWatch alarm: SessionTurnCount > 12"]
    E --> E2["CloudWatch alarm: ModelDowngrade for recommendation"]
    E --> E3[Weekly review of long-session cost distribution]

    style B1 fill:#c8e6c9
    style C1 fill:#bbdefb
    style D1 fill:#fff9c4

Scenario 2: Prompt Compression Causing Quality Degradation for Japanese Content

Problem

After deploying LLMLingua-style token compression to production, the quality monitoring dashboard shows a sharp drop in answer accuracy for queries containing Japanese text. The compression algorithm, trained primarily on English text perplexity, is removing Japanese particles (は, が, を, に) and honorific markers (さん, 様) that it perceives as low-perplexity (predictable) tokens. In English, removing predictable words like "the" or "is" rarely changes meaning. In Japanese, removing particles fundamentally alters sentence structure and meaning.

Specific failures observed: - 「鬼滅の刃」 (Kimetsu no Yaiba / Demon Slayer) compressed to 「鬼滅刃」 — the removal makes the title unrecognizable to the model - この漫画はおすすめですか? (Is this manga recommended?) compressed to この漫画おすすめですか — the removal changes the sentence from a question about a specific manga to a general statement - Customer name 田中様 (Tanaka-sama) compressed to 田中 — honorific removal is culturally inappropriate in Japanese customer service

Detection

flowchart TD
    A[Quality Monitoring Dashboard] --> B{Accuracy drop detected?}
    B -->|Yes| C[Segment by language]
    C --> D{JP queries affected more than EN?}
    D -->|Yes| E[Language-specific compression issue]
    D -->|No| F[General compression issue — check all components]

    E --> G[Check compression log: what tokens were removed?]
    G --> H{Japanese particles removed?}
    H -->|Yes| I["Root cause: perplexity model not calibrated for CJK"]
    H -->|No| J[Check entity preservation]
    J --> K{Product names corrupted?}
    K -->|Yes| L["Root cause: entity protection not applied to JP titles"]
    K -->|No| M[Investigate specific failing queries]

    style I fill:#ffcdd2
    style L fill:#ffcdd2

Detection signals: - MangaAssist/Quality/AccuracyScore drops below 0.85 (baseline: 0.92) with dimension Language=ja - MangaAssist/Compression/EntityPreservationRate drops below 1.0 - Customer feedback: "The chatbot didn't understand my question" tickets spike for JP-language users - Bedrock invocation logs show garbled Japanese in prompts

Root Cause

The LLMLingua compression algorithm uses a small language model (e.g., GPT-2 or a distilled LLaMA) to compute per-token perplexity. This model was trained predominantly on English text. Japanese tokens that are grammatically essential (particles, connectors) appear "predictable" to the model because they appear frequently — but their removal in Japanese is not semantically safe the way removing English articles is.

Token Type English Equivalent Perplexity Safe to Remove?
Japanese particle は (wa) "is" / topic marker Low No — changes sentence meaning
Japanese particle の (no) "of" / possessive Low No — changes noun relationships
Japanese particle を (wo) object marker Low No — removes object reference
Japanese honorific 様 (sama) "Mr./Ms." (formal) Low No — culturally inappropriate
English article "the" Low Yes — usually safe to remove
English copula "is" Low Yes — often inferable from context

Resolution

Immediate (during the incident):

  1. Disable LLMLingua compression for Japanese text entirely. Roll back to rule-based compression only for prompts containing Japanese characters.
import re


def should_apply_llmlingua(text: str) -> bool:
    """Check if LLMLingua compression is safe for this text.

    Returns False if the text contains CJK characters, meaning we
    should use only rule-based compression.
    """
    # CJK Unified Ideographs + Hiragana + Katakana ranges
    cjk_pattern = re.compile(
        r'[\u4e00-\u9fff\u3040-\u309f\u30a0-\u30ff]'
    )
    cjk_char_count = len(cjk_pattern.findall(text))
    total_chars = len(text)

    if total_chars == 0:
        return True

    # If more than 10% of characters are CJK, skip LLMLingua
    cjk_ratio = cjk_char_count / total_chars
    return cjk_ratio < 0.10
  1. Add entity protection: Before any compression, extract and protect entities (product names, customer names, prices, order IDs) from modification.
def protect_entities(text: str, entities: list[str]) -> tuple[str, dict]:
    """Replace entities with placeholder tokens before compression.

    Returns the modified text and a mapping to restore entities after.
    """
    placeholder_map = {}
    protected_text = text

    for i, entity in enumerate(entities):
        placeholder = f"<<ENTITY_{i}>>"
        placeholder_map[placeholder] = entity
        protected_text = protected_text.replace(entity, placeholder)

    return protected_text, placeholder_map


def restore_entities(text: str, placeholder_map: dict) -> str:
    """Restore entities from placeholders after compression."""
    restored = text
    for placeholder, entity in placeholder_map.items():
        restored = restored.replace(placeholder, entity)
    return restored

Long-term fix: - Train a Japanese-aware perplexity model for compression scoring, or use a multilingual model (e.g., XLM-R) that correctly assigns high importance to Japanese grammatical particles. - Implement a language-specific compression policy: English components use LLMLingua aggressively; Japanese components use only whitespace cleanup and structural compression (template variables).

Prevention

flowchart TD
    A[Prevention Strategy] --> B[Language-Aware Compression Policy]
    A --> C[Entity Protection Layer]
    A --> D[Quality Gate Before Deployment]
    A --> E[Monitoring]

    B --> B1["English text: LLMLingua + rule-based (aggressive)"]
    B --> B2["Japanese text: rule-based only (conservative)"]
    B --> B3["Mixed text: segment by language, compress separately"]

    C --> C1[Extract entities before compression]
    C --> C2[Replace with placeholders]
    C --> C3[Compress placeholder-protected text]
    C --> C4[Restore entities after compression]

    D --> D1["Golden dataset: 200 JP queries, 200 EN queries"]
    D --> D2["Pre-deploy eval: compression must preserve >95% accuracy on both"]
    D --> D3["Canary deployment: 5% traffic for 24 hours before full rollout"]

    E --> E1["CloudWatch: AccuracyScore segmented by Language"]
    E --> E2["Automated alert: JP accuracy drops >5% below EN accuracy"]
    E --> E3["Weekly compression audit: sample 100 JP prompts, verify entity preservation"]

    style B2 fill:#c8e6c9
    style C1 fill:#bbdefb
    style D1 fill:#fff9c4

Scenario 3: Context Pruning Removing Relevant Manga Details from RAG Results

Problem

A user asks: "Is the limited edition box set of Attack on Titan available, and does it include the color pages?" The OpenSearch vector search returns 6 chunks:

Chunk Content Vector Score Entity Score Composite Score
1 Attack on Titan standard edition volumes 1-34 0.91 0.8 0.82
2 Attack on Titan limited edition box set — availability and contents 0.88 1.0 0.85
3 Attack on Titan color edition — page details and pricing 0.84 0.7 0.74
4 Attack on Titan merchandise and figurines 0.75 0.3 0.48
5 General box set comparison (One Piece, Naruto, AoT) 0.72 0.5 0.55
6 Shipping info for oversized items (box sets) 0.65 0.2 0.39

The ContextPruner with a token budget of 500 tokens and relevance threshold of 0.45 keeps chunks 1, 2, and 3 (highest composite scores). However, chunk 3 ("color edition — page details") is critical for answering the "color pages" part of the question. If the token budget only allows 2 chunks, chunk 3 gets pruned — and the model produces an answer that omits color page information, or worse, hallucinates.

The user gets: "Yes, the limited edition box set is available at 15,800 JPY" but no information about color pages. The user follows up, frustrated.

Detection

flowchart TD
    A[User Feedback: Answer incomplete] --> B[Check query vs response coverage]
    B --> C{All query entities addressed in response?}
    C -->|No| D[Identify missing entity]
    D --> E[Check RAG chunks returned by OpenSearch]
    E --> F{Was relevant chunk retrieved?}
    F -->|No| G["Root cause: OpenSearch retrieval failure (not a pruning issue)"]
    F -->|Yes| H[Was relevant chunk kept after pruning?]
    H -->|No| I["Root cause: ContextPruner removed relevant chunk"]
    H -->|Yes| J["Root cause: Model ignored available context"]

    I --> K[Check composite score of pruned chunk]
    K --> L{Score above threshold?}
    L -->|Yes| M["Token budget too small — chunk cut by budget cap"]
    L -->|No| N["Relevance scoring underweighted this chunk"]

    style I fill:#ffcdd2
    style M fill:#fff3e0
    style N fill:#ffcdd2

Detection signals: - MangaAssist/Quality/CompletenessScore drops below 0.80 for product_search intent - MangaAssist/Pruning/ChunksDiscardedAboveThreshold > 0 (chunks above relevance threshold cut by budget) - User follow-up rate after product queries increases (users asking the same question differently) - Manual audit of pruned chunks reveals relevant content being discarded

Root Cause

Two compounding issues:

  1. Token budget too tight for multi-faceted queries: The user asked about (a) availability, (b) limited edition box set, and © color pages — three distinct information needs. The 500-token RAG budget was calibrated for single-faceted queries. Multi-faceted queries need proportionally more RAG context.

  2. Entity overlap scoring missed "color pages": The entity extractor identified "Attack on Titan," "limited edition," and "box set" as query entities. It did not extract "color pages" as a separate entity because it is not a product name. Chunk 3's entity score was 0.7 instead of 1.0 because the entity overlap calculation missed this aspect.

Resolution

Immediate:

def detect_multi_faceted_query(query: str, intent: str) -> int:
    """Detect how many distinct information needs a query contains.

    Multi-faceted queries get a proportionally larger RAG budget.

    Returns the number of detected facets (minimum 1).
    """
    # Simple heuristic: count question-indicating patterns
    facet_indicators = [
        # Conjunctions joining distinct questions
        r'\band\b.*\?',
        r'\balso\b',
        r'\badditionally\b',
        # Multiple question words
        r'(does|is|can|will|how|what|when|where)',
        # Comma-separated aspects
        r',\s*(does|is|and)',
    ]

    import re
    facet_count = 1  # always at least 1
    query_lower = query.lower()

    for pattern in facet_indicators:
        matches = re.findall(pattern, query_lower)
        facet_count += len(matches)

    # Cap at 4 to prevent budget explosion
    return min(facet_count, 4)


def get_dynamic_rag_budget(intent: str, query: str,
                           base_budget: int = 500) -> int:
    """Scale RAG token budget based on query complexity."""
    facets = detect_multi_faceted_query(query, intent)
    # Each additional facet adds 200 tokens to the RAG budget
    return base_budget + (facets - 1) * 200

Short-term fix: - Expand entity extraction to include descriptive phrases ("color pages", "hardcover", "first printing"), not just proper nouns. These are product attributes that are equally important for retrieval relevance. - Add a "query coverage check" after pruning: verify that every noun phrase in the user's query has at least one matching chunk in the kept set. If not, pull in the highest-scoring discarded chunk that covers the gap.

def verify_query_coverage(query: str, kept_chunks: list,
                          discarded_chunks: list,
                          estimator) -> list:
    """Verify that kept chunks cover all query aspects.

    If a query aspect is uncovered, pull in the best discarded chunk
    that addresses it.
    """
    import re
    # Extract noun phrases (simplified)
    query_phrases = set(re.findall(r'[A-Za-z]+(?:\s+[A-Za-z]+)*', query.lower()))

    # Check which phrases are covered by kept chunks
    kept_text = " ".join(c.text.lower() for c in kept_chunks)
    uncovered = [p for p in query_phrases
                 if p not in kept_text and len(p) > 3]

    if not uncovered:
        return kept_chunks

    # Find discarded chunks that cover the gap
    for phrase in uncovered:
        best_match = None
        best_score = 0
        for chunk in discarded_chunks:
            if phrase in chunk.text.lower():
                if chunk.vector_score > best_score:
                    best_match = chunk
                    best_score = chunk.vector_score

        if best_match:
            kept_chunks.append(best_match)
            discarded_chunks.remove(best_match)

    return kept_chunks

Prevention

flowchart TD
    A[Prevention] --> B[Dynamic RAG Budget]
    A --> C[Expanded Entity Extraction]
    A --> D[Query Coverage Verification]
    A --> E[Monitoring]

    B --> B1[Detect query facets: 1-4 information needs]
    B --> B2["Budget = base + (facets - 1) x 200 tokens"]
    B --> B3[Cap at 1,100 tokens for 4-facet queries]

    C --> C1[Extract proper nouns: titles, authors, ISBNs]
    C --> C2[Extract descriptive attributes: color pages, hardcover, limited]
    C --> C3[Extract product categories: box set, collection, series]

    D --> D1[After pruning: check every query noun phrase]
    D --> D2[If uncovered: pull best discarded chunk]
    D --> D3[Log coverage gaps for model improvement]

    E --> E1["Alert: ChunksDiscardedAboveThreshold > 2"]
    E --> E2[Weekly audit: sample 50 multi-faceted queries]
    E --> E3["Track: user follow-up rate after product_search"]

    style B2 fill:#c8e6c9
    style D2 fill:#bbdefb

Scenario 4: Response Size Controls Truncating Critical Shipping/Pricing Information

Problem

A user asks: "I want to order the Naruto complete box set. What's the total price with shipping to Osaka, and when will it arrive?" The model generates a response that includes:

  1. Product details and price (130 tokens)
  2. Shipping options and costs (100 tokens)
  3. Estimated delivery dates for each option (80 tokens)
  4. Tax calculation (60 tokens)
  5. Total cost summary (40 tokens)
  6. Return policy note (50 tokens)

Total natural response: 460 tokens. But the order_status intent (misclassified — this is actually a product_search with shipping inquiry) has max_tokens=200. The ResponseSizeController truncates the response after the shipping options section. The user sees the product price and shipping options but never sees the delivery dates, tax calculation, or total cost — the most critical information for a purchase decision.

The truncation message says: "For full order details, visit your order history page." This is misleading — the user has not placed an order yet. They are trying to decide whether to order.

Detection

flowchart TD
    A[User Complaint: Incomplete pricing info] --> B[Check response truncation logs]
    B --> C{Was response truncated?}
    C -->|Yes| D[Check intent classification]
    D --> E{Correct intent assigned?}
    E -->|No| F["Root cause: intent misclassification → wrong max_tokens"]
    E -->|Yes| G[Check if natural response exceeds budget]
    G --> H{Natural response > max_tokens by > 50%?}
    H -->|Yes| I["Root cause: budget too low for this query type"]
    H -->|No| J["Root cause: model being verbose — prompt needs tightening"]

    F --> K[Fix: retrain intent classifier]
    I --> L[Fix: increase budget or add sub-intent]
    J --> M[Fix: add conciseness instruction to prompt]

    C -->|No| N[Different issue — check model output quality]

    style F fill:#ffcdd2
    style I fill:#fff3e0

Detection signals: - MangaAssist/Response/TruncationRate spikes above 15% for any intent - MangaAssist/Response/TruncatedCriticalInfo — custom metric that checks if truncated responses are missing prices, dates, or order IDs - Customer feedback: "The chatbot didn't tell me the total price" or "When will it arrive?" - Conversion rate drops for users who received truncated responses (A/B comparison)

Root Cause

Two issues combined:

  1. Intent misclassification: The query "I want to order... What's the total price with shipping?" was classified as order_status because it contains the word "order." But this is a pre-purchase inquiry — it should be product_search (max_tokens=400) or a new sub-intent like purchase_inquiry (max_tokens=500).

  2. Truncation message mismatch: The order_status truncation message references "order history page," which is irrelevant for a pre-purchase query. Each intent's truncation message assumes the classification is correct.

  3. Critical information at the end: The model naturally structures responses from general (product details) to specific (total cost with tax). The most actionable information — the total cost — comes last and is the first thing truncated.

Resolution

Immediate (during the incident):

# Add a "critical information" detection pass before truncation
CRITICAL_PATTERNS = {
    "pricing": [
        r'total[:\s]*[\d,]+\s*(JPY|USD|円|ドル)',
        r'合計[:\s]*[\d,]+',
        r'shipping[:\s]*[\d,]+',
        r'送料[:\s]*[\d,]+',
        r'tax[:\s]*[\d,]+',
    ],
    "delivery": [
        r'(arrive|delivery|届く|到着)[:\s]*(by|on|within)',
        r'\d{1,2}[/-]\d{1,2}',
        r'(business days|営業日)',
    ],
    "order_id": [
        r'ORD-\d{4}-\d{5}',
        r'注文番号[:\s]*\w+',
    ],
}


def contains_critical_info(text: str) -> bool:
    """Check if text contains critical information that must not be truncated."""
    import re
    for category, patterns in CRITICAL_PATTERNS.items():
        for pattern in patterns:
            if re.search(pattern, text, re.IGNORECASE):
                return True
    return False


async def stream_with_critical_protection(
    bedrock_stream,
    intent: str,
    config,
    encoding,
):
    """Stream with truncation protection for critical information.

    If the response is approaching the token budget and we detect
    critical information (prices, dates, order IDs), we extend the
    budget by 50% to ensure this information is delivered.
    """
    token_count = 0
    buffer_chunks = []
    in_critical_zone = False

    async for event in bedrock_stream:
        chunk_text = extract_text(event)
        if not chunk_text:
            continue

        chunk_tokens = len(encoding.encode(chunk_text))
        token_count += chunk_tokens

        # Check if we're in the critical zone (80-100% of budget)
        if token_count >= config.max_output_tokens * 0.8:
            if contains_critical_info(chunk_text):
                in_critical_zone = True

        # If in critical zone, extend budget by 50%
        effective_budget = (
            int(config.max_output_tokens * 1.5)
            if in_critical_zone
            else config.max_output_tokens
        )

        if token_count > effective_budget:
            yield chunk_text
            yield config.truncation_message
            return

        yield chunk_text

Short-term fix: 1. Add a purchase_inquiry sub-intent with max_tokens=500 and an appropriate truncation message: "Want me to add this to your cart?" 2. Retrain the intent classifier to distinguish order_status (existing order) from purchase_inquiry (pre-purchase question). 3. Add a "critical info first" instruction to the system prompt for purchase-related intents:

When answering purchase or pricing questions, lead with the total cost
and delivery date. Put details and disclaimers after the key numbers.
  1. Fix the truncation message to be context-aware:
TRUNCATION_MESSAGES = {
    "purchase_inquiry": (
        "\n\n---\n*Want me to add this to your cart, "
        "or do you have more questions?*"
    ),
    "product_search": (
        "\n\n---\n*I can provide more details. "
        "Which aspect interests you most?*"
    ),
    "order_status": (
        "\n\n---\n*For full tracking details, "
        "check your order history page.*"
    ),
}

Prevention

flowchart TD
    A[Prevention] --> B[Intent Classifier Improvement]
    A --> C[Critical Info Protection]
    A --> D[Prompt Structure Guidance]
    A --> E[Truncation Message Alignment]

    B --> B1["Add purchase_inquiry sub-intent"]
    B --> B2["Training data: 500 pre-purchase queries"]
    B --> B3["Confusion matrix monitoring: order_status vs purchase_inquiry"]

    C --> C1["Detect critical patterns: prices, dates, IDs"]
    C --> C2["Auto-extend budget 50% when critical info is in-flight"]
    C --> C3["Never truncate mid-price or mid-date"]

    D --> D1["System prompt: lead with key numbers"]
    D --> D2["Structured output format: summary first, details second"]
    D --> D3["Test with truncation simulation: verify key info in first 60% of response"]

    E --> E1["Each intent gets a context-appropriate truncation message"]
    E --> E2["Truncation message never references actions the user cannot take"]
    E --> E3["A/B test truncation messages for user satisfaction"]

    style C2 fill:#c8e6c9
    style D1 fill:#bbdefb

Scenario 5: Token Tracking Drift Between Estimation and Actual Bedrock Consumption

Problem

Over the past two weeks, the EstimationDriftPercent CloudWatch metric has been steadily increasing. Initial deployment showed a consistent 3-5% drift (estimations slightly above actuals, which is the safe direction). Now the drift has reached -18% — meaning actual token consumption is 18% higher than estimates. This means:

  • Budget gates are passing requests that actually exceed their token budgets
  • Cost forecasting underestimates actual spend by 18%
  • Session budget caps are being hit later than expected, allowing more expensive invocations than intended
  • Monthly cost is $20,400 over forecast

The drift is correlated with a Bedrock model update that happened 12 days ago: Claude 3 Sonnet received a minor version update that changed its tokenization slightly.

Detection

flowchart TD
    A["CloudWatch: EstimationDriftPercent > 10%"] --> B[Check drift direction]
    B --> C{Positive drift — over-estimating?}
    C -->|Yes| D["Safe direction but wasteful — budgets too conservative"]
    B --> E{Negative drift — under-estimating?}
    E -->|Yes| F["Dangerous — actual cost exceeds estimates"]

    F --> G[Check when drift started increasing]
    G --> H{Correlated with model update?}
    H -->|Yes| I["Root cause: tokenizer mismatch after model version change"]
    H -->|No| J[Check prompt template changes]
    J --> K{New prompt templates deployed?}
    K -->|Yes| L["Root cause: new templates have different token characteristics"]
    K -->|No| M[Check input data distribution shift]
    M --> N["Root cause: user behavior changed (longer queries, more JP text)"]

    style I fill:#ffcdd2
    style F fill:#fff3e0

Detection signals: - MangaAssist/Tokens/EstimationDriftPercent — 7-day rolling average crosses -10% - MangaAssist/Tokens/ActualCostVsForecast — daily actual exceeds forecast by > 15% - MangaAssist/Tokens/BudgetGateBypassRate — requests that passed budget check but actually exceeded budget - Bedrock model version change notification in AWS Health Dashboard

Root Cause

The TokenEstimator uses tiktoken's cl100k_base encoding, which is a close approximation to Claude 3's tokenizer but not identical. When Anthropic updates the model version, subtle tokenization changes can occur:

  1. New vocabulary tokens: The model update added tokens for recently popular manga titles and Japanese slang. These map to fewer tokens in the new tokenizer but more tokens in the old cl100k_base used for estimation.
  2. Changed byte-pair encoding merges: The model's BPE merge table was updated, changing how multi-byte Japanese characters are split into tokens. For example, a common kanji sequence might now be 1 token instead of 2 — but our estimator still counts it as 2.
  3. System prompt overhead: Bedrock's internal system prompt handling changed slightly, adding ~50 tokens of overhead that our estimator does not account for.

The 5% calibration buffer (multiply by 1.05) was sufficient for the original model version but insufficient for the updated version.

Resolution

Immediate (same day):

  1. Increase calibration buffer from 1.05 to 1.20 while investigating the root cause. This over-estimates by 20% — wasteful but safe.
# Emergency config change — no code deployment needed
# Update in DynamoDB config table or SSM Parameter Store

import boto3

ssm = boto3.client('ssm')
ssm.put_parameter(
    Name='/mangaassist/token-estimator/calibration-buffer',
    Value='1.20',
    Type='String',
    Overwrite=True,
)
  1. Enable Bedrock invocation logging (if not already) to capture exact token counts for every call. Use these as ground truth for recalibration.

Short-term fix (within 1 week):

Implement an auto-calibration system that continuously adjusts the calibration buffer based on observed drift:

import numpy as np
from collections import deque


class AdaptiveCalibrationBuffer:
    """Auto-calibrates the token estimation buffer using recent observations.

    Maintains a rolling window of (estimated, actual) token pairs and
    computes the optimal calibration buffer to minimize under-estimation
    while avoiding excessive over-estimation.

    MangaAssist design: the buffer is updated every 1,000 invocations
    (approximately every 1.4 minutes at 1M messages/day). This means
    the system self-corrects within minutes of a model update.
    """

    def __init__(self, window_size: int = 5000,
                 min_buffer: float = 1.02,
                 max_buffer: float = 1.30,
                 target_underestimate_rate: float = 0.05):
        """
        Args:
            window_size: Number of recent observations to consider
            min_buffer: Minimum calibration buffer (2% over)
            max_buffer: Maximum calibration buffer (30% over)
            target_underestimate_rate: Target rate of underestimates
                (we want <=5% of estimates to be below actual)
        """
        self.observations = deque(maxlen=window_size)
        self.min_buffer = min_buffer
        self.max_buffer = max_buffer
        self.target_rate = target_underestimate_rate
        self.current_buffer = 1.10  # start conservative

    def record(self, estimated_tokens: int, actual_tokens: int):
        """Record an (estimated, actual) observation pair."""
        if estimated_tokens > 0 and actual_tokens > 0:
            ratio = actual_tokens / estimated_tokens
            self.observations.append(ratio)

    def recalibrate(self) -> float:
        """Recompute the optimal calibration buffer.

        Uses the (1 - target_rate) percentile of actual/estimated ratios
        to set the buffer such that only target_rate of estimates fall
        below actual values.
        """
        if len(self.observations) < 100:
            return self.current_buffer  # not enough data

        ratios = np.array(self.observations)
        # We want 95th percentile of (actual/estimated) ratios
        # This means 95% of the time, estimated * buffer >= actual
        target_percentile = 1.0 - self.target_rate
        p95_ratio = np.percentile(ratios, target_percentile * 100)

        # The buffer should be at least p95_ratio
        new_buffer = np.clip(p95_ratio, self.min_buffer, self.max_buffer)

        # Smooth the update to avoid oscillation
        self.current_buffer = 0.7 * self.current_buffer + 0.3 * new_buffer
        return self.current_buffer

    def get_buffer(self) -> float:
        """Get the current calibration buffer."""
        return self.current_buffer

    def get_stats(self) -> dict:
        """Get drift statistics for monitoring."""
        if len(self.observations) < 10:
            return {"status": "insufficient_data"}

        ratios = np.array(self.observations)
        return {
            "current_buffer": round(self.current_buffer, 4),
            "mean_ratio": round(float(np.mean(ratios)), 4),
            "p50_ratio": round(float(np.median(ratios)), 4),
            "p95_ratio": round(float(np.percentile(ratios, 95)), 4),
            "p99_ratio": round(float(np.percentile(ratios, 99)), 4),
            "underestimate_rate": round(
                float(np.mean(ratios > self.current_buffer)), 4
            ),
            "observation_count": len(self.observations),
        }

Long-term fix: - Subscribe to Bedrock model update notifications via AWS Health Dashboard API. When a model update is detected, automatically increase the calibration buffer to 1.25 and trigger a 24-hour recalibration period. - Evaluate switching from tiktoken cl100k_base to Anthropic's official token counting API (if/when available) for exact counts. - Implement a shadow-mode estimator that runs both the old and new calibration in parallel and alerts when they diverge by more than 5%.

Prevention

flowchart TD
    A[Prevention] --> B[Auto-Calibration System]
    A --> C[Model Update Response]
    A --> D[Drift Monitoring]
    A --> E[Budgeting Safety]

    B --> B1["Rolling window of 5,000 (estimated, actual) pairs"]
    B --> B2["Recalibrate every 1,000 invocations"]
    B --> B3["Target: < 5% underestimation rate"]
    B --> B4["Buffer range: 1.02 to 1.30"]

    C --> C1["Subscribe to Bedrock model update notifications"]
    C --> C2["On update: auto-increase buffer to 1.25"]
    C --> C3["24-hour intensive recalibration window"]
    C --> C4["Alert on-call engineer for manual review"]

    D --> D1["CloudWatch alarm: EstimationDriftPercent > +/-10%"]
    D --> D2["Daily drift report to Slack"]
    D --> D3["Weekly drift trend analysis"]
    D --> D4["Monthly calibration buffer audit"]

    E --> E1["Budget gates use calibrated estimate, not raw count"]
    E --> E2["Session budgets include 15% safety margin"]
    E --> E3["Monthly cost forecast uses P95 estimate, not mean"]

    style B2 fill:#c8e6c9
    style C2 fill:#bbdefb
    style D1 fill:#fff9c4

Cross-Scenario Decision Tree — Token Efficiency Incident Response

When a token efficiency alert fires, use this decision tree to identify which scenario applies:

flowchart TD
    A["Token Efficiency Alert Fired"] --> B{What type of alert?}

    B -->|Cost spike| C{Per-session or system-wide?}
    C -->|Per-session| D["Scenario 1: Budget exceeded in long conversation"]
    C -->|System-wide| E{Correlated with model update?}
    E -->|Yes| F["Scenario 5: Estimation drift"]
    E -->|No| G{Correlated with new deployment?}
    G -->|Yes| H["Check compression or pruning changes"]
    G -->|No| I["Traffic pattern change — investigate intent distribution"]

    B -->|Quality drop| J{Language-specific?}
    J -->|JP queries affected| K["Scenario 2: Compression degrading Japanese content"]
    J -->|All languages| L{Completeness or accuracy?}
    L -->|Completeness| M["Scenario 3: Context pruning too aggressive"]
    L -->|Accuracy| N["Check if RAG retrieval quality changed"]

    B -->|Truncation spike| O{Which intent?}
    O -->|order_status| P{Misclassified intent?}
    P -->|Yes| Q["Scenario 4: Truncation of critical info (misclassification)"]
    P -->|No| R["Budget too low for this query type"]
    O -->|recommendation| S["Scenario 1: Long conversation + truncation"]
    O -->|other| T["Review intent-specific max_tokens config"]

    B -->|Drift alarm| F

    style D fill:#fff3e0
    style K fill:#ffcdd2
    style M fill:#ffcdd2
    style Q fill:#ffcdd2
    style F fill:#fff9c4

Runbook Summary Table

Scenario First Response (< 5 min) Short-term Fix (< 1 week) Long-term Prevention
1. Budget exceeded in long conversation Increase recommendation budget for sessions > 8 turns Implement tiered budgets + preference profile extraction at turn 8 Dynamic budget scaling, proactive session summarization, session length guardrails
2. JP content quality degradation Disable LLMLingua for Japanese text immediately Add entity protection layer, language-aware compression policy Japanese-calibrated perplexity model, bilingual golden dataset, canary deployments
3. RAG pruning removes relevant chunks Lower relevance threshold by 0.1 for affected intent Add query coverage verification after pruning, expand entity extraction Dynamic RAG budget based on query facets, weekly pruning audits, user follow-up rate tracking
4. Critical info truncated Add critical info detection, extend budget 50% for prices/dates Add purchase_inquiry sub-intent, retrain classifier, fix truncation messages Prompt-level "key info first" instruction, truncation simulation testing, A/B test truncation UX
5. Estimation drift Increase calibration buffer to 1.20 Deploy auto-calibration system with rolling window recalibration Model update auto-response, shadow estimator, monthly calibration audit

Key Takeaways

  1. Token efficiency failures are never just about tokens: they cascade into quality, UX, and revenue impacts. A truncated price is a lost sale.
  2. Japanese content is a special case everywhere: compression, pruning, and estimation all need language-aware handling for CJK text.
  3. Multi-turn conversations are the primary budget risk: a single long browsing session can cost 10x a typical session. Proactive summarization at turn 8 is the key mitigation.
  4. Intent misclassification amplifies token issues: wrong intent means wrong budget, wrong max_tokens, and wrong truncation message. The intent classifier is the foundation of the entire token efficiency system.
  5. Auto-calibration beats static buffers: model updates will happen. A self-adjusting calibration buffer that recalibrates from production data protects against silent drift.