01: Content Handling Troubleshooting

AIP-C01 Mapping

Task 5.2 → Skill 5.2.1: Resolve content handling issues to ensure that necessary information is processed completely in FM interactions (context window overflow diagnostics, dynamic chunking strategies, prompt design optimization, truncation-related error analysis).

User Story

As a senior ML engineer on the MangaAssist team, I want to detect, diagnose, and resolve content handling failures in FM interactions, So that the chatbot processes complete information without silent truncation, delivers accurate responses grounded in full context, and maintains reliability as conversation lengths and catalog sizes grow.

Acceptance Criteria

Context window usage is tracked per request with token-level granularity; overflow is detected before submission to the FM
Dynamic chunking adapts chunk size by content type (product descriptions vs FAQ vs editorial) to maximize information density within token budgets
Prompt design uses a priority-based allocation system that preserves critical context (system rules, grounding data) and compresses lower-priority sections (history, page context) under budget pressure
Truncation errors are caught with automated alerting; silent truncation rate < 0.1% of requests
CloudWatch dashboard shows token budget utilization, overflow events, and truncation incidents per intent type

High-Level Design

The Content Handling Problem

MangaAssist operates under a hard constraint: Claude 3.5 Sonnet on Bedrock has a 200K token context window, but the practical budget is much smaller. The prompt must fit several components into a fraction of that window to keep latency and cost acceptable.

Why 200K is not the real budget:

Constraint	Practical Limit	Why
Latency SLA	~4,000 input tokens	Each additional 1K input tokens adds ~50-100ms to prefill latency; P95 target is < 3 seconds
Cost control	~4,000 input tokens	At $3.00/1M input tokens, 4K tokens per request at 1M requests/day = $12/day; 20K tokens = $60/day
Output reservation	~1,000 tokens	Must reserve space for the response; recommendation responses with product cards need 400-800 tokens
Effective input budget	~4,000 tokens	After reserving output tokens, the working input budget is roughly 4,000 tokens

Token Budget Allocation

graph TD
    subgraph "Total Budget: ~5,000 tokens"
        A[System Prompt<br>250-400 tokens<br>FIXED] --> B[RAG Chunks<br>800-1,500 tokens<br>VARIABLE]
        B --> C[Conversation History<br>300-700 tokens<br>COMPRESSIBLE]
        C --> D[Page Context<br>100-200 tokens<br>DROPPABLE]
        D --> E[User Message<br>50-200 tokens<br>FIXED]
        E --> F[Output Reservation<br>800-1,000 tokens<br>RESERVED]
    end

    style A fill:#e74c3c,color:#fff
    style B fill:#f39c12,color:#fff
    style C fill:#3498db,color:#fff
    style D fill:#2ecc71,color:#fff
    style E fill:#e74c3c,color:#fff
    style F fill:#95a5a6,color:#fff

Priority tiers: 1. FIXED (System Prompt + User Message): Never truncated. These define the task. 2. VARIABLE (RAG Chunks): Sized based on intent and available budget. The main information carrier. 3. COMPRESSIBLE (Conversation History): Summarized if budget is tight. Older turns compressed first. 4. DROPPABLE (Page Context): Omitted entirely under extreme budget pressure.

Content Handling Failure Taxonomy

graph LR
    A[Content Handling<br>Failure] --> B[Silent Truncation]
    A --> C[Budget Overflow]
    A --> D[Chunking Mismatch]
    A --> E[Compression Loss]

    B --> B1[FM truncates input<br>without error]
    B --> B2[History dropped<br>mid-sentence]
    B --> B3[RAG chunk cut off<br>at token boundary]

    C --> C1[Total prompt exceeds<br>model limit]
    C --> C2[Single section<br>exceeds allocation]
    C --> C3[Dynamic content<br>unexpectedly large]

    D --> D1[Chunk too large<br>for slot]
    D --> D2[Chunk splits key<br>information]
    D --> D3[Wrong chunk count<br>for intent]

    E --> E1[Summary loses<br>key preferences]
    E --> E2[Context compressed<br>below usefulness]
    E --> E3[Metadata stripped<br>during compression]

Troubleshooting Decision Flow

flowchart TD
    A[User reports incomplete<br>or wrong answer] --> B{Check token<br>budget metrics}
    B -->|Budget exceeded| C{Which section<br>overflowed?}
    B -->|Budget OK| D{Check chunk<br>quality}

    C -->|RAG chunks| E[Too many chunks retrieved<br>or chunks too large]
    C -->|History| F[Long conversation<br>not summarized]
    C -->|Dynamic content| G[Product data or promos<br>unexpectedly large]

    D -->|Low relevance scores| H[Embedding drift or<br>stale index]
    D -->|Good relevance| I{Check compression<br>quality}

    I -->|Summary lost context| J[Summarization prompt<br>needs tuning]
    I -->|Summary OK| K[Issue is in prompt<br>design, not content]

    E --> L[Adjust chunk count<br>or max chunk size]
    F --> M[Enable aggressive<br>summarization earlier]
    G --> N[Add content size<br>validation before assembly]
    H --> O[See file 04:<br>Retrieval Troubleshooting]
    J --> P[See file 03:<br>Prompt Troubleshooting]
    K --> P

Low-Level Design

1. Context Window Overflow Diagnostics

The first line of defense: know exactly how much of the context window each request uses, and detect overflow before the FM sees it.

Token Budget Manager

import time
import logging
from dataclasses import dataclass, field
from typing import Optional
from enum import Enum

logger = logging.getLogger("mangaassist.content_handling")


class BudgetPriority(Enum):
    FIXED = "fixed"          # Never truncated (system prompt, user message)
    VARIABLE = "variable"    # Sized dynamically (RAG chunks)
    COMPRESSIBLE = "compressible"  # Summarized under pressure (history)
    DROPPABLE = "droppable"  # Omitted if needed (page context)


@dataclass
class TokenAllocation:
    section: str
    priority: BudgetPriority
    min_tokens: int
    max_tokens: int
    actual_tokens: int = 0
    was_truncated: bool = False
    was_compressed: bool = False
    was_dropped: bool = False


@dataclass
class BudgetReport:
    total_budget: int
    total_used: int
    utilization_pct: float
    sections: list
    overflow_detected: bool
    truncation_events: list
    compression_events: list
    drop_events: list
    assembly_time_ms: float


@dataclass
class TokenBudgetManager:
    """Manages token allocation across prompt sections with priority-based overflow handling."""

    model_id: str = "anthropic.claude-3-5-sonnet-20241022-v2:0"
    total_budget: int = 5000
    output_reservation: int = 1000

    # Section budgets by priority
    budgets: dict = field(default_factory=lambda: {
        "system_prompt":       {"priority": BudgetPriority.FIXED,        "min": 250, "max": 400},
        "user_message":        {"priority": BudgetPriority.FIXED,        "min": 50,  "max": 300},
        "rag_chunks":          {"priority": BudgetPriority.VARIABLE,     "min": 400, "max": 1500},
        "conversation_history":{"priority": BudgetPriority.COMPRESSIBLE, "min": 100, "max": 700},
        "page_context":        {"priority": BudgetPriority.DROPPABLE,    "min": 0,   "max": 200},
    })

    def allocate(self, sections: dict) -> BudgetReport:
        """
        Given raw content for each section, allocate tokens within budget.
        Returns a BudgetReport with allocation decisions and overflow diagnostics.
        """
        start_time = time.monotonic()
        input_budget = self.total_budget - self.output_reservation

        allocations = []
        truncation_events = []
        compression_events = []
        drop_events = []

        # Phase 1: Count actual tokens for each section
        section_tokens = {}
        for name, content in sections.items():
            token_count = self._count_tokens(content)
            section_tokens[name] = token_count

        # Phase 2: Allocate FIXED sections first (non-negotiable)
        remaining = input_budget
        for name, config in self.budgets.items():
            if config["priority"] == BudgetPriority.FIXED and name in section_tokens:
                actual = min(section_tokens[name], config["max"])
                remaining -= actual
                allocations.append(TokenAllocation(
                    section=name, priority=config["priority"],
                    min_tokens=config["min"], max_tokens=config["max"],
                    actual_tokens=actual,
                ))

        # Phase 3: Allocate VARIABLE sections (sized to fit)
        for name, config in self.budgets.items():
            if config["priority"] == BudgetPriority.VARIABLE and name in section_tokens:
                available = min(remaining, config["max"])
                actual = min(section_tokens[name], available)
                was_truncated = section_tokens[name] > available

                if was_truncated:
                    truncation_events.append({
                        "section": name,
                        "requested_tokens": section_tokens[name],
                        "allocated_tokens": actual,
                        "dropped_tokens": section_tokens[name] - actual,
                    })
                    logger.warning(
                        "Token truncation in %s: requested=%d, allocated=%d, dropped=%d",
                        name, section_tokens[name], actual, section_tokens[name] - actual,
                        extra={"section": name, "overflow_type": "variable_truncation"}
                    )

                remaining -= actual
                allocations.append(TokenAllocation(
                    section=name, priority=config["priority"],
                    min_tokens=config["min"], max_tokens=config["max"],
                    actual_tokens=actual, was_truncated=was_truncated,
                ))

        # Phase 4: Allocate COMPRESSIBLE sections (compress if over budget)
        for name, config in self.budgets.items():
            if config["priority"] == BudgetPriority.COMPRESSIBLE and name in section_tokens:
                available = min(remaining, config["max"])
                actual = min(section_tokens[name], available)
                was_compressed = section_tokens[name] > available and actual >= config["min"]
                was_truncated = actual < config["min"]

                if was_compressed:
                    compression_events.append({
                        "section": name,
                        "original_tokens": section_tokens[name],
                        "compressed_to": actual,
                    })

                remaining -= actual
                allocations.append(TokenAllocation(
                    section=name, priority=config["priority"],
                    min_tokens=config["min"], max_tokens=config["max"],
                    actual_tokens=actual,
                    was_compressed=was_compressed,
                    was_truncated=was_truncated,
                ))

        # Phase 5: Allocate DROPPABLE sections (only if budget remains)
        for name, config in self.budgets.items():
            if config["priority"] == BudgetPriority.DROPPABLE and name in section_tokens:
                if remaining <= 0:
                    drop_events.append({"section": name, "reason": "budget_exhausted"})
                    allocations.append(TokenAllocation(
                        section=name, priority=config["priority"],
                        min_tokens=config["min"], max_tokens=config["max"],
                        actual_tokens=0, was_dropped=True,
                    ))
                else:
                    actual = min(section_tokens[name], remaining, config["max"])
                    remaining -= actual
                    allocations.append(TokenAllocation(
                        section=name, priority=config["priority"],
                        min_tokens=config["min"], max_tokens=config["max"],
                        actual_tokens=actual,
                    ))

        total_used = sum(a.actual_tokens for a in allocations) + self.output_reservation
        assembly_time_ms = (time.monotonic() - start_time) * 1000

        return BudgetReport(
            total_budget=self.total_budget,
            total_used=total_used,
            utilization_pct=round(total_used / self.total_budget * 100, 1),
            sections=allocations,
            overflow_detected=len(truncation_events) > 0 or len(drop_events) > 0,
            truncation_events=truncation_events,
            compression_events=compression_events,
            drop_events=drop_events,
            assembly_time_ms=round(assembly_time_ms, 2),
        )

    def _count_tokens(self, text: str) -> int:
        """Approximate token count. In production, use tiktoken or the model's tokenizer."""
        if not text:
            return 0
        # Rough approximation: 1 token ≈ 4 characters for English, ~2 for Japanese
        return len(text) // 4

CloudWatch Metrics Emission

import boto3

cloudwatch = boto3.client("cloudwatch", region_name="ap-northeast-1")


def emit_budget_metrics(report: BudgetReport, intent: str, session_id: str):
    """Emit token budget metrics to CloudWatch for monitoring and alerting."""

    metrics = [
        {
            "MetricName": "TokenBudgetUtilization",
            "Value": report.utilization_pct,
            "Unit": "Percent",
            "Dimensions": [
                {"Name": "Intent", "Value": intent},
                {"Name": "Service", "Value": "MangaAssist"},
            ],
        },
        {
            "MetricName": "TokenBudgetOverflow",
            "Value": 1 if report.overflow_detected else 0,
            "Unit": "Count",
            "Dimensions": [
                {"Name": "Intent", "Value": intent},
                {"Name": "Service", "Value": "MangaAssist"},
            ],
        },
    ]

    # Per-section utilization
    for section in report.sections:
        metrics.append({
            "MetricName": "SectionTokenUsage",
            "Value": section.actual_tokens,
            "Unit": "Count",
            "Dimensions": [
                {"Name": "Section", "Value": section.section},
                {"Name": "Intent", "Value": intent},
            ],
        })

    # Truncation events
    for event in report.truncation_events:
        metrics.append({
            "MetricName": "TokenTruncationDropped",
            "Value": event["dropped_tokens"],
            "Unit": "Count",
            "Dimensions": [
                {"Name": "Section", "Value": event["section"]},
                {"Name": "Intent", "Value": intent},
            ],
        })

    cloudwatch.put_metric_data(Namespace="MangaAssist/ContentHandling", MetricData=metrics)

2. Dynamic Chunking Strategies

Not all content should be chunked the same way. Product descriptions, FAQ articles, and editorial content have different information densities and different failure modes when chunked incorrectly.

Chunking Decision Flow

flowchart TD
    A[Content arrives<br>for indexing] --> B{Content type?}

    B -->|Product description| C[Chunk: 256 tokens<br>Overlap: 25 tokens]
    B -->|FAQ article| D[Chunk: 512 tokens<br>Overlap: 50 tokens]
    B -->|Policy document| E[Chunk: 512 tokens<br>Overlap: 50 tokens]
    B -->|Editorial/review| F[Chunk: 512 tokens<br>Overlap: 50 tokens]
    B -->|Review summary| G[Chunk: 128 tokens<br>Overlap: 0]

    C --> H{Token budget<br>pressure?}
    D --> H
    E --> H
    F --> H
    G --> H

    H -->|Normal| I[Use standard<br>chunk size]
    H -->|High pressure| J[Reduce to min<br>chunk size]
    H -->|Critical| K[Use single<br>best chunk only]

    I --> L[Attach metadata:<br>source, ASIN, category,<br>last_updated]
    J --> L
    K --> L

Adaptive Chunker

import re
from dataclasses import dataclass
from typing import Optional
from enum import Enum


class ContentType(Enum):
    PRODUCT_DESCRIPTION = "product_description"
    FAQ_ARTICLE = "faq_article"
    POLICY_DOCUMENT = "policy_document"
    EDITORIAL = "editorial"
    REVIEW_SUMMARY = "review_summary"


@dataclass
class ChunkConfig:
    target_tokens: int
    min_tokens: int
    max_tokens: int
    overlap_tokens: int
    split_on_sentences: bool
    preserve_metadata: bool


# Chunk configurations by content type — derived from production observations
CHUNK_CONFIGS = {
    ContentType.PRODUCT_DESCRIPTION: ChunkConfig(
        target_tokens=256, min_tokens=128, max_tokens=384,
        overlap_tokens=25, split_on_sentences=True, preserve_metadata=True,
    ),
    ContentType.FAQ_ARTICLE: ChunkConfig(
        target_tokens=512, min_tokens=256, max_tokens=768,
        overlap_tokens=50, split_on_sentences=True, preserve_metadata=True,
    ),
    ContentType.POLICY_DOCUMENT: ChunkConfig(
        target_tokens=512, min_tokens=256, max_tokens=768,
        overlap_tokens=50, split_on_sentences=True, preserve_metadata=True,
    ),
    ContentType.EDITORIAL: ChunkConfig(
        target_tokens=512, min_tokens=256, max_tokens=768,
        overlap_tokens=50, split_on_sentences=True, preserve_metadata=True,
    ),
    ContentType.REVIEW_SUMMARY: ChunkConfig(
        target_tokens=128, min_tokens=64, max_tokens=192,
        overlap_tokens=0, split_on_sentences=False, preserve_metadata=True,
    ),
}


@dataclass
class Chunk:
    content: str
    token_count: int
    chunk_index: int
    total_chunks: int
    content_type: ContentType
    metadata: dict


class DynamicChunker:
    """Content-type aware chunker with adaptive sizing under token budget pressure."""

    def chunk(
        self,
        text: str,
        content_type: ContentType,
        metadata: Optional[dict] = None,
        budget_pressure: float = 0.0,  # 0.0 = normal, 1.0 = critical
    ) -> list:
        config = CHUNK_CONFIGS[content_type]
        metadata = metadata or {}

        # Adjust chunk size based on budget pressure
        adjusted_target = self._adjust_for_pressure(config, budget_pressure)

        if config.split_on_sentences:
            chunks = self._sentence_aware_split(text, adjusted_target, config.overlap_tokens)
        else:
            chunks = self._token_split(text, adjusted_target, config.overlap_tokens)

        # Filter out chunks that are too small to be useful
        chunks = [c for c in chunks if self._count_tokens(c) >= config.min_tokens]

        return [
            Chunk(
                content=chunk_text,
                token_count=self._count_tokens(chunk_text),
                chunk_index=i,
                total_chunks=len(chunks),
                content_type=content_type,
                metadata={**metadata, "chunk_index": i, "total_chunks": len(chunks)},
            )
            for i, chunk_text in enumerate(chunks)
        ]

    def _adjust_for_pressure(self, config: ChunkConfig, pressure: float) -> int:
        """Reduce target chunk size under budget pressure."""
        if pressure <= 0.0:
            return config.target_tokens
        if pressure >= 1.0:
            return config.min_tokens
        # Linear interpolation between target and min
        return int(config.target_tokens - (config.target_tokens - config.min_tokens) * pressure)

    def _sentence_aware_split(self, text: str, target_tokens: int, overlap_tokens: int) -> list:
        """Split on sentence boundaries, respecting target token count."""
        sentences = re.split(r'(?<=[.!?])\s+', text)
        chunks = []
        current_chunk = []
        current_tokens = 0

        for sentence in sentences:
            sentence_tokens = self._count_tokens(sentence)

            if current_tokens + sentence_tokens > target_tokens and current_chunk:
                chunks.append(" ".join(current_chunk))
                # Overlap: keep last N tokens worth of sentences
                if overlap_tokens > 0:
                    overlap_chunk = []
                    overlap_count = 0
                    for s in reversed(current_chunk):
                        s_tokens = self._count_tokens(s)
                        if overlap_count + s_tokens > overlap_tokens:
                            break
                        overlap_chunk.insert(0, s)
                        overlap_count += s_tokens
                    current_chunk = overlap_chunk
                    current_tokens = overlap_count
                else:
                    current_chunk = []
                    current_tokens = 0

            current_chunk.append(sentence)
            current_tokens += sentence_tokens

        if current_chunk:
            chunks.append(" ".join(current_chunk))

        return chunks

    def _token_split(self, text: str, target_tokens: int, overlap_tokens: int) -> list:
        """Simple token-boundary split for content that doesn't benefit from sentence splitting."""
        words = text.split()
        chunks = []
        i = 0
        target_words = target_tokens  # Approximate: 1 token ≈ 1 word for this rough split

        while i < len(words):
            end = min(i + target_words, len(words))
            chunks.append(" ".join(words[i:end]))
            i = end - overlap_tokens if overlap_tokens > 0 else end

        return chunks

    def _count_tokens(self, text: str) -> int:
        if not text:
            return 0
        return len(text) // 4

3. Prompt Design Optimization for Token Efficiency

When the token budget is tight, the prompt itself must be engineered for density. This is not about writing shorter prompts — it is about structuring prompts so that the most important information occupies the most visible positions.

Prompt Compression Pipeline

graph LR
    A[Raw Prompt<br>Components] --> B[Priority<br>Sorter]
    B --> C[History<br>Compressor]
    C --> D[Chunk<br>Selector]
    D --> E[Context<br>Trimmer]
    E --> F[Budget<br>Validator]
    F -->|Within budget| G[Final Prompt]
    F -->|Over budget| H[Escalate:<br>drop DROPPABLE,<br>compress more]
    H --> E

Conversation History Compressor

import json
import logging
from dataclasses import dataclass

logger = logging.getLogger("mangaassist.content_handling")


@dataclass
class CompressedHistory:
    text: str
    token_count: int
    original_turns: int
    summarized_turns: int
    preserved_turns: int
    compression_ratio: float


class HistoryCompressor:
    """Compresses conversation history to fit within token budget.

    Strategy:
    1. Always preserve the most recent 2 turns (user + assistant) — these are the immediate context.
    2. If history exceeds budget, summarize older turns into a condensed summary.
    3. Preserve any turns that contain product ASINs, order IDs, or explicit preferences.
    """

    def __init__(self, max_tokens: int = 700, preserve_recent: int = 2):
        self.max_tokens = max_tokens
        self.preserve_recent = preserve_recent

    def compress(self, turns: list, budget_tokens: int) -> CompressedHistory:
        """Compress conversation history to fit within budget_tokens."""
        effective_budget = min(budget_tokens, self.max_tokens)

        # Calculate total tokens in full history
        full_text = self._format_turns(turns)
        full_tokens = self._count_tokens(full_text)

        # If it fits, return as-is
        if full_tokens <= effective_budget:
            return CompressedHistory(
                text=full_text,
                token_count=full_tokens,
                original_turns=len(turns),
                summarized_turns=0,
                preserved_turns=len(turns),
                compression_ratio=1.0,
            )

        # Split into recent (preserved) and older (to be summarized)
        recent_turns = turns[-self.preserve_recent:] if len(turns) > self.preserve_recent else turns
        older_turns = turns[:-self.preserve_recent] if len(turns) > self.preserve_recent else []

        recent_text = self._format_turns(recent_turns)
        recent_tokens = self._count_tokens(recent_text)

        # Budget remaining for summary of older turns
        summary_budget = effective_budget - recent_tokens

        if summary_budget <= 0:
            # Even recent turns exceed budget — truncate to most recent turn only
            last_turn = turns[-1:]
            last_text = self._format_turns(last_turn)
            logger.warning(
                "History budget critical: only preserving last turn",
                extra={"original_turns": len(turns), "budget": effective_budget},
            )
            return CompressedHistory(
                text=last_text,
                token_count=self._count_tokens(last_text),
                original_turns=len(turns),
                summarized_turns=len(turns) - 1,
                preserved_turns=1,
                compression_ratio=self._count_tokens(last_text) / full_tokens,
            )

        # Summarize older turns
        summary = self._summarize_turns(older_turns, summary_budget)
        combined_text = f"[Previous conversation summary]: {summary}\n\n{recent_text}"
        combined_tokens = self._count_tokens(combined_text)

        return CompressedHistory(
            text=combined_text,
            token_count=combined_tokens,
            original_turns=len(turns),
            summarized_turns=len(older_turns),
            preserved_turns=len(recent_turns),
            compression_ratio=round(combined_tokens / full_tokens, 2),
        )

    def _summarize_turns(self, turns: list, max_tokens: int) -> str:
        """Extract key information from older turns.

        In production, this would call a fast summarization model or use the FM itself
        with a summarization prompt. For cost, we use extractive summarization:
        keep turns that contain ASINs, order IDs, or preference keywords.
        """
        key_patterns = [
            r'B0[A-Z0-9]{8}',           # ASIN pattern
            r'#?\d{3}-\d{7}',           # Order ID pattern
            r'(?:prefer|like|want|love|hate|looking for)',  # Preference signals
        ]

        important_turns = []
        for turn in turns:
            import re
            content = turn.get("content", "")
            if any(re.search(pattern, content, re.IGNORECASE) for pattern in key_patterns):
                important_turns.append(turn)

        if important_turns:
            summary_text = " | ".join(
                f"{t['role']}: {t['content'][:100]}" for t in important_turns
            )
        else:
            # No key signals — just note how many turns were summarized
            summary_text = f"User discussed manga topics over {len(turns)} messages."

        # Trim to budget
        while self._count_tokens(summary_text) > max_tokens and len(summary_text) > 20:
            summary_text = summary_text[:len(summary_text) * 3 // 4]  # Trim 25% iteratively

        return summary_text

    def _format_turns(self, turns: list) -> str:
        return "\n".join(f"{t['role']}: {t['content']}" for t in turns)

    def _count_tokens(self, text: str) -> int:
        if not text:
            return 0
        return len(text) // 4

Silent truncation is the most dangerous content handling failure because no error is raised. The FM simply does not see the truncated content and produces a plausible-looking but incomplete answer.

Truncation Detection Pipeline

sequenceDiagram
    participant Orchestrator
    participant BudgetManager
    participant TruncationDetector
    participant CloudWatch
    participant PagerDuty

    Orchestrator->>BudgetManager: Allocate tokens for sections
    BudgetManager-->>Orchestrator: BudgetReport
    Orchestrator->>TruncationDetector: Validate report

    alt No truncation
        TruncationDetector-->>Orchestrator: PASS
    else Soft truncation (COMPRESSIBLE/DROPPABLE)
        TruncationDetector->>CloudWatch: Emit warning metric
        TruncationDetector-->>Orchestrator: PASS with warning
    else Hard truncation (FIXED/VARIABLE)
        TruncationDetector->>CloudWatch: Emit error metric
        TruncationDetector->>PagerDuty: Alert if rate > threshold
        TruncationDetector-->>Orchestrator: FAIL — content incomplete
        Orchestrator->>Orchestrator: Degrade gracefully:<br>use template response or<br>reduce scope
    end

Truncation Detector

import logging
from dataclasses import dataclass
from enum import Enum

logger = logging.getLogger("mangaassist.content_handling")


class TruncationSeverity(Enum):
    NONE = "none"
    SOFT = "soft"       # Compressible or droppable section affected
    HARD = "hard"       # Fixed or variable section affected — response quality impacted


@dataclass
class TruncationReport:
    severity: TruncationSeverity
    affected_sections: list
    total_tokens_lost: int
    recommendation: str
    should_degrade: bool


class TruncationDetector:
    """Detects and classifies truncation events in prompt assembly.

    Key insight: Not all truncation is equal.
    - Dropping page context when it's not needed: harmless.
    - Truncating RAG chunks that contain the answer: catastrophic.
    - Compressing history in a product discovery flow: usually fine.
    - Compressing history in a multi-turn return flow: dangerous (loses order context).
    """

    # Intent-specific sensitivity: how much truncation matters for each section
    SENSITIVITY_MAP = {
        "product_discovery": {
            "rag_chunks": "high",
            "conversation_history": "low",
            "page_context": "medium",
        },
        "product_question": {
            "rag_chunks": "high",
            "conversation_history": "medium",
            "page_context": "high",
        },
        "recommendation": {
            "rag_chunks": "high",
            "conversation_history": "high",   # User preferences are in history
            "page_context": "medium",
        },
        "faq": {
            "rag_chunks": "high",
            "conversation_history": "low",
            "page_context": "low",
        },
        "order_tracking": {
            "rag_chunks": "low",
            "conversation_history": "high",    # Order context is in history
            "page_context": "low",
        },
        "return_request": {
            "rag_chunks": "medium",
            "conversation_history": "high",    # Order + issue context in history
            "page_context": "low",
        },
    }

    def analyze(self, budget_report: BudgetReport, intent: str) -> TruncationReport:
        """Analyze a BudgetReport for truncation severity based on intent context."""
        sensitivity = self.SENSITIVITY_MAP.get(intent, {})
        affected_sections = []
        total_lost = 0
        max_severity = TruncationSeverity.NONE

        # Check truncation events
        for event in budget_report.truncation_events:
            section = event["section"]
            dropped = event["dropped_tokens"]
            section_sensitivity = sensitivity.get(section, "medium")

            if section_sensitivity == "high":
                severity = TruncationSeverity.HARD
            elif section_sensitivity == "medium":
                severity = TruncationSeverity.SOFT
            else:
                severity = TruncationSeverity.SOFT

            affected_sections.append({
                "section": section,
                "tokens_lost": dropped,
                "sensitivity": section_sensitivity,
                "severity": severity.value,
            })
            total_lost += dropped

            if severity.value == "hard":
                max_severity = TruncationSeverity.HARD
            elif severity.value == "soft" and max_severity == TruncationSeverity.NONE:
                max_severity = TruncationSeverity.SOFT

        # Check drop events
        for event in budget_report.drop_events:
            section = event["section"]
            section_sensitivity = sensitivity.get(section, "low")

            if section_sensitivity == "high":
                max_severity = TruncationSeverity.HARD
                affected_sections.append({
                    "section": section,
                    "tokens_lost": "all",
                    "sensitivity": section_sensitivity,
                    "severity": "hard",
                })

        # Generate recommendation
        if max_severity == TruncationSeverity.HARD:
            recommendation = (
                f"HARD truncation detected for intent '{intent}'. "
                f"High-sensitivity sections affected: {[s['section'] for s in affected_sections if s['sensitivity'] == 'high']}. "
                f"Consider: (1) reducing chunk count, (2) using a more aggressive summarization, "
                f"(3) falling back to template response for this intent."
            )
        elif max_severity == TruncationSeverity.SOFT:
            recommendation = (
                f"Soft truncation in low-sensitivity sections. "
                f"Response quality likely unaffected for intent '{intent}'."
            )
        else:
            recommendation = "No truncation detected. All sections within budget."

        return TruncationReport(
            severity=max_severity,
            affected_sections=affected_sections,
            total_tokens_lost=total_lost,
            recommendation=recommendation,
            should_degrade=max_severity == TruncationSeverity.HARD,
        )

5. MangaAssist Scenarios

Scenario A: Long Manga Series FAQ Overflows Context Window

Context: A user asks "Tell me about all the volumes of One Piece." The product catalog has 100+ volumes. The RAG pipeline retrieves descriptions for the top 10 volumes, each ~300 tokens = 3,000 tokens for RAG chunks alone, blowing the 1,500-token RAG budget.

Detection: TokenBudgetManager.allocate() detects that rag_chunks requested 3,000 tokens but only 1,500 were available. TruncationDetector flags HARD truncation because rag_chunks has high sensitivity for product_question intent.

Resolution: 1. Immediate: Reduce retrieved chunks from 10 to 3 (selecting the most relevant volumes based on the query) 2. Systemic: Add a pre-retrieval filter that limits chunk count by intent: product_question → max 3 chunks, recommendation → max 4 chunks, faq → max 2 chunks

Scenario B: Multi-Turn Return Conversation Exhausts History Budget

Context: A customer has been going back and forth about a damaged manga volume for 15 turns. The conversation history is 2,100 tokens. The history budget is 700 tokens. The HistoryCompressor summarizes the older 13 turns, but the summary drops the order ID and the specific damage description.

Detection: HistoryCompressor returns compression_ratio=0.33, and subsequent LLM response asks the user for their order number again — which they already provided in turn 3.

Resolution: 1. Immediate: Update HistoryCompressor._summarize_turns() to always preserve turns containing order IDs (regex pattern #?\d{3}-\d{7}) 2. Systemic: For return_request and order_tracking intents, extract structured entities (order ID, ASIN, issue type) into a separate "key facts" section that always gets FIXED priority

Scenario C: Japanese Content Tokenization Mismatch

Context: MangaAssist handles Japanese language content. Japanese text uses ~2 characters per token (compared to ~4 for English), so a product description that looks short in characters is actually 2x more tokens than expected.

Detection: TokenBudgetManager consistently reports overflow for Japanese-heavy RAG chunks. The _count_tokens approximation (len/4) underestimates Japanese content by 50%.

Resolution: 1. Immediate: Use a proper tokenizer (tiktoken or Anthropic's tokenizer) instead of character-based approximation 2. Systemic: Add a locale parameter to DynamicChunker that adjusts chunk sizes based on expected tokenization ratio

6. CloudWatch Dashboard and Alerts

Key Metrics

Metric	Namespace	Threshold	Alert Action
`TokenBudgetUtilization`	`MangaAssist/ContentHandling`	P95 > 90%	Warn: investigate prompt size growth
`TokenBudgetOverflow`	`MangaAssist/ContentHandling`	> 1% of requests	Page: content regularly truncated
`TokenTruncationDropped`	`MangaAssist/ContentHandling`	P95 > 200 tokens	Warn: significant information loss
`SectionTokenUsage` (by section)	`MangaAssist/ContentHandling`	Sudden change > 20%	Warn: section size drift
`HistoryCompressionRatio`	`MangaAssist/ContentHandling`	P95 < 0.25	Warn: history heavily compressed

CloudWatch Logs Insights Queries

Find requests where RAG chunks were truncated:

fields @timestamp, @message
| filter section = "rag_chunks" and overflow_type = "variable_truncation"
| stats count(*) as truncation_count by intent
| sort truncation_count desc
| limit 20

Track token budget utilization by intent over time:

fields @timestamp, intent, utilization_pct
| filter metric_name = "TokenBudgetUtilization"
| stats avg(utilization_pct) as avg_util, max(utilization_pct) as max_util, percentile(utilization_pct, 95) as p95_util by intent, bin(1h) as hour
| sort hour desc

Identify sessions with heavy history compression:

fields @timestamp, session_id, compression_ratio, original_turns
| filter metric_name = "HistoryCompression" and compression_ratio < 0.3
| sort compression_ratio asc
| limit 50

Intuition Gained

What Mental Model You Build

Working through content handling troubleshooting teaches you to think in token budgets the way a systems engineer thinks in memory budgets. Every piece of information has a cost, and the art is fitting the most useful information into a fixed-size window.

You develop three core instincts:

1. The Budget Intuition: You start to "feel" when a prompt is too large before you count tokens. When you see a long conversation history, a complex product query, and multiple RAG chunks — you know immediately that something will be dropped. This intuition lets you design systems that prevent overflow rather than react to it.

2. The Priority Intuition: Not all context is equally important, and importance depends on intent. A product question needs the product data more than the conversation history. A return request needs the conversation history (with the order context) more than the page context. You learn to build systems that make these priority decisions automatically.

3. The Silent Failure Intuition: You develop a healthy paranoia about "invisible" failures. The FM will not tell you it did not see the full context. It will generate a plausible response from whatever it received. The only way to detect silent truncation is to measure it proactively. This instinct carries over to every system you build: you learn to instrument the things that fail silently, not just the things that throw errors.

How This Intuition Guides Future Decisions

When designing a new GenAI feature: You start by calculating the token budget before writing a single prompt. You ask "How many tokens does each context source need?" before "What should the prompt say?"
When evaluating new models: You compare not just quality but context window efficiency. A model with a 128K window but terrible long-context accuracy is worse than a model with 32K that uses every token effectively.
When debugging user complaints: Your first question is "What did the model actually see?" — not "What did the model say?" This traces the problem to the input, not the output, and is faster.
When scaling to new domains: You know that token budgets are domain-dependent. Medical documents are denser than product descriptions. Legal text needs more context than FAQs. You build configurable budgets, not hardcoded ones.