01: Content Handling Troubleshooting
AIP-C01 Mapping
Task 5.2 → Skill 5.2.1: Resolve content handling issues to ensure that necessary information is processed completely in FM interactions (context window overflow diagnostics, dynamic chunking strategies, prompt design optimization, truncation-related error analysis).
User Story
As a senior ML engineer on the MangaAssist team, I want to detect, diagnose, and resolve content handling failures in FM interactions, So that the chatbot processes complete information without silent truncation, delivers accurate responses grounded in full context, and maintains reliability as conversation lengths and catalog sizes grow.
Acceptance Criteria
- Context window usage is tracked per request with token-level granularity; overflow is detected before submission to the FM
- Dynamic chunking adapts chunk size by content type (product descriptions vs FAQ vs editorial) to maximize information density within token budgets
- Prompt design uses a priority-based allocation system that preserves critical context (system rules, grounding data) and compresses lower-priority sections (history, page context) under budget pressure
- Truncation errors are caught with automated alerting; silent truncation rate < 0.1% of requests
- CloudWatch dashboard shows token budget utilization, overflow events, and truncation incidents per intent type
High-Level Design
The Content Handling Problem
MangaAssist operates under a hard constraint: Claude 3.5 Sonnet on Bedrock has a 200K token context window, but the practical budget is much smaller. The prompt must fit several components into a fraction of that window to keep latency and cost acceptable.
Why 200K is not the real budget:
| Constraint | Practical Limit | Why |
|---|---|---|
| Latency SLA | ~4,000 input tokens | Each additional 1K input tokens adds ~50-100ms to prefill latency; P95 target is < 3 seconds |
| Cost control | ~4,000 input tokens | At $3.00/1M input tokens, 4K tokens per request at 1M requests/day = $12/day; 20K tokens = $60/day |
| Output reservation | ~1,000 tokens | Must reserve space for the response; recommendation responses with product cards need 400-800 tokens |
| Effective input budget | ~4,000 tokens | After reserving output tokens, the working input budget is roughly 4,000 tokens |
Token Budget Allocation
graph TD
subgraph "Total Budget: ~5,000 tokens"
A[System Prompt<br>250-400 tokens<br>FIXED] --> B[RAG Chunks<br>800-1,500 tokens<br>VARIABLE]
B --> C[Conversation History<br>300-700 tokens<br>COMPRESSIBLE]
C --> D[Page Context<br>100-200 tokens<br>DROPPABLE]
D --> E[User Message<br>50-200 tokens<br>FIXED]
E --> F[Output Reservation<br>800-1,000 tokens<br>RESERVED]
end
style A fill:#e74c3c,color:#fff
style B fill:#f39c12,color:#fff
style C fill:#3498db,color:#fff
style D fill:#2ecc71,color:#fff
style E fill:#e74c3c,color:#fff
style F fill:#95a5a6,color:#fff
Priority tiers: 1. FIXED (System Prompt + User Message): Never truncated. These define the task. 2. VARIABLE (RAG Chunks): Sized based on intent and available budget. The main information carrier. 3. COMPRESSIBLE (Conversation History): Summarized if budget is tight. Older turns compressed first. 4. DROPPABLE (Page Context): Omitted entirely under extreme budget pressure.
Content Handling Failure Taxonomy
graph LR
A[Content Handling<br>Failure] --> B[Silent Truncation]
A --> C[Budget Overflow]
A --> D[Chunking Mismatch]
A --> E[Compression Loss]
B --> B1[FM truncates input<br>without error]
B --> B2[History dropped<br>mid-sentence]
B --> B3[RAG chunk cut off<br>at token boundary]
C --> C1[Total prompt exceeds<br>model limit]
C --> C2[Single section<br>exceeds allocation]
C --> C3[Dynamic content<br>unexpectedly large]
D --> D1[Chunk too large<br>for slot]
D --> D2[Chunk splits key<br>information]
D --> D3[Wrong chunk count<br>for intent]
E --> E1[Summary loses<br>key preferences]
E --> E2[Context compressed<br>below usefulness]
E --> E3[Metadata stripped<br>during compression]
Troubleshooting Decision Flow
flowchart TD
A[User reports incomplete<br>or wrong answer] --> B{Check token<br>budget metrics}
B -->|Budget exceeded| C{Which section<br>overflowed?}
B -->|Budget OK| D{Check chunk<br>quality}
C -->|RAG chunks| E[Too many chunks retrieved<br>or chunks too large]
C -->|History| F[Long conversation<br>not summarized]
C -->|Dynamic content| G[Product data or promos<br>unexpectedly large]
D -->|Low relevance scores| H[Embedding drift or<br>stale index]
D -->|Good relevance| I{Check compression<br>quality}
I -->|Summary lost context| J[Summarization prompt<br>needs tuning]
I -->|Summary OK| K[Issue is in prompt<br>design, not content]
E --> L[Adjust chunk count<br>or max chunk size]
F --> M[Enable aggressive<br>summarization earlier]
G --> N[Add content size<br>validation before assembly]
H --> O[See file 04:<br>Retrieval Troubleshooting]
J --> P[See file 03:<br>Prompt Troubleshooting]
K --> P
Low-Level Design
1. Context Window Overflow Diagnostics
The first line of defense: know exactly how much of the context window each request uses, and detect overflow before the FM sees it.
Token Budget Manager
import time
import logging
from dataclasses import dataclass, field
from typing import Optional
from enum import Enum
logger = logging.getLogger("mangaassist.content_handling")
class BudgetPriority(Enum):
FIXED = "fixed" # Never truncated (system prompt, user message)
VARIABLE = "variable" # Sized dynamically (RAG chunks)
COMPRESSIBLE = "compressible" # Summarized under pressure (history)
DROPPABLE = "droppable" # Omitted if needed (page context)
@dataclass
class TokenAllocation:
section: str
priority: BudgetPriority
min_tokens: int
max_tokens: int
actual_tokens: int = 0
was_truncated: bool = False
was_compressed: bool = False
was_dropped: bool = False
@dataclass
class BudgetReport:
total_budget: int
total_used: int
utilization_pct: float
sections: list
overflow_detected: bool
truncation_events: list
compression_events: list
drop_events: list
assembly_time_ms: float
@dataclass
class TokenBudgetManager:
"""Manages token allocation across prompt sections with priority-based overflow handling."""
model_id: str = "anthropic.claude-3-5-sonnet-20241022-v2:0"
total_budget: int = 5000
output_reservation: int = 1000
# Section budgets by priority
budgets: dict = field(default_factory=lambda: {
"system_prompt": {"priority": BudgetPriority.FIXED, "min": 250, "max": 400},
"user_message": {"priority": BudgetPriority.FIXED, "min": 50, "max": 300},
"rag_chunks": {"priority": BudgetPriority.VARIABLE, "min": 400, "max": 1500},
"conversation_history":{"priority": BudgetPriority.COMPRESSIBLE, "min": 100, "max": 700},
"page_context": {"priority": BudgetPriority.DROPPABLE, "min": 0, "max": 200},
})
def allocate(self, sections: dict) -> BudgetReport:
"""
Given raw content for each section, allocate tokens within budget.
Returns a BudgetReport with allocation decisions and overflow diagnostics.
"""
start_time = time.monotonic()
input_budget = self.total_budget - self.output_reservation
allocations = []
truncation_events = []
compression_events = []
drop_events = []
# Phase 1: Count actual tokens for each section
section_tokens = {}
for name, content in sections.items():
token_count = self._count_tokens(content)
section_tokens[name] = token_count
# Phase 2: Allocate FIXED sections first (non-negotiable)
remaining = input_budget
for name, config in self.budgets.items():
if config["priority"] == BudgetPriority.FIXED and name in section_tokens:
actual = min(section_tokens[name], config["max"])
remaining -= actual
allocations.append(TokenAllocation(
section=name, priority=config["priority"],
min_tokens=config["min"], max_tokens=config["max"],
actual_tokens=actual,
))
# Phase 3: Allocate VARIABLE sections (sized to fit)
for name, config in self.budgets.items():
if config["priority"] == BudgetPriority.VARIABLE and name in section_tokens:
available = min(remaining, config["max"])
actual = min(section_tokens[name], available)
was_truncated = section_tokens[name] > available
if was_truncated:
truncation_events.append({
"section": name,
"requested_tokens": section_tokens[name],
"allocated_tokens": actual,
"dropped_tokens": section_tokens[name] - actual,
})
logger.warning(
"Token truncation in %s: requested=%d, allocated=%d, dropped=%d",
name, section_tokens[name], actual, section_tokens[name] - actual,
extra={"section": name, "overflow_type": "variable_truncation"}
)
remaining -= actual
allocations.append(TokenAllocation(
section=name, priority=config["priority"],
min_tokens=config["min"], max_tokens=config["max"],
actual_tokens=actual, was_truncated=was_truncated,
))
# Phase 4: Allocate COMPRESSIBLE sections (compress if over budget)
for name, config in self.budgets.items():
if config["priority"] == BudgetPriority.COMPRESSIBLE and name in section_tokens:
available = min(remaining, config["max"])
actual = min(section_tokens[name], available)
was_compressed = section_tokens[name] > available and actual >= config["min"]
was_truncated = actual < config["min"]
if was_compressed:
compression_events.append({
"section": name,
"original_tokens": section_tokens[name],
"compressed_to": actual,
})
remaining -= actual
allocations.append(TokenAllocation(
section=name, priority=config["priority"],
min_tokens=config["min"], max_tokens=config["max"],
actual_tokens=actual,
was_compressed=was_compressed,
was_truncated=was_truncated,
))
# Phase 5: Allocate DROPPABLE sections (only if budget remains)
for name, config in self.budgets.items():
if config["priority"] == BudgetPriority.DROPPABLE and name in section_tokens:
if remaining <= 0:
drop_events.append({"section": name, "reason": "budget_exhausted"})
allocations.append(TokenAllocation(
section=name, priority=config["priority"],
min_tokens=config["min"], max_tokens=config["max"],
actual_tokens=0, was_dropped=True,
))
else:
actual = min(section_tokens[name], remaining, config["max"])
remaining -= actual
allocations.append(TokenAllocation(
section=name, priority=config["priority"],
min_tokens=config["min"], max_tokens=config["max"],
actual_tokens=actual,
))
total_used = sum(a.actual_tokens for a in allocations) + self.output_reservation
assembly_time_ms = (time.monotonic() - start_time) * 1000
return BudgetReport(
total_budget=self.total_budget,
total_used=total_used,
utilization_pct=round(total_used / self.total_budget * 100, 1),
sections=allocations,
overflow_detected=len(truncation_events) > 0 or len(drop_events) > 0,
truncation_events=truncation_events,
compression_events=compression_events,
drop_events=drop_events,
assembly_time_ms=round(assembly_time_ms, 2),
)
def _count_tokens(self, text: str) -> int:
"""Approximate token count. In production, use tiktoken or the model's tokenizer."""
if not text:
return 0
# Rough approximation: 1 token ≈ 4 characters for English, ~2 for Japanese
return len(text) // 4
CloudWatch Metrics Emission
import boto3
cloudwatch = boto3.client("cloudwatch", region_name="ap-northeast-1")
def emit_budget_metrics(report: BudgetReport, intent: str, session_id: str):
"""Emit token budget metrics to CloudWatch for monitoring and alerting."""
metrics = [
{
"MetricName": "TokenBudgetUtilization",
"Value": report.utilization_pct,
"Unit": "Percent",
"Dimensions": [
{"Name": "Intent", "Value": intent},
{"Name": "Service", "Value": "MangaAssist"},
],
},
{
"MetricName": "TokenBudgetOverflow",
"Value": 1 if report.overflow_detected else 0,
"Unit": "Count",
"Dimensions": [
{"Name": "Intent", "Value": intent},
{"Name": "Service", "Value": "MangaAssist"},
],
},
]
# Per-section utilization
for section in report.sections:
metrics.append({
"MetricName": "SectionTokenUsage",
"Value": section.actual_tokens,
"Unit": "Count",
"Dimensions": [
{"Name": "Section", "Value": section.section},
{"Name": "Intent", "Value": intent},
],
})
# Truncation events
for event in report.truncation_events:
metrics.append({
"MetricName": "TokenTruncationDropped",
"Value": event["dropped_tokens"],
"Unit": "Count",
"Dimensions": [
{"Name": "Section", "Value": event["section"]},
{"Name": "Intent", "Value": intent},
],
})
cloudwatch.put_metric_data(Namespace="MangaAssist/ContentHandling", MetricData=metrics)
2. Dynamic Chunking Strategies
Not all content should be chunked the same way. Product descriptions, FAQ articles, and editorial content have different information densities and different failure modes when chunked incorrectly.
Chunking Decision Flow
flowchart TD
A[Content arrives<br>for indexing] --> B{Content type?}
B -->|Product description| C[Chunk: 256 tokens<br>Overlap: 25 tokens]
B -->|FAQ article| D[Chunk: 512 tokens<br>Overlap: 50 tokens]
B -->|Policy document| E[Chunk: 512 tokens<br>Overlap: 50 tokens]
B -->|Editorial/review| F[Chunk: 512 tokens<br>Overlap: 50 tokens]
B -->|Review summary| G[Chunk: 128 tokens<br>Overlap: 0]
C --> H{Token budget<br>pressure?}
D --> H
E --> H
F --> H
G --> H
H -->|Normal| I[Use standard<br>chunk size]
H -->|High pressure| J[Reduce to min<br>chunk size]
H -->|Critical| K[Use single<br>best chunk only]
I --> L[Attach metadata:<br>source, ASIN, category,<br>last_updated]
J --> L
K --> L
Adaptive Chunker
import re
from dataclasses import dataclass
from typing import Optional
from enum import Enum
class ContentType(Enum):
PRODUCT_DESCRIPTION = "product_description"
FAQ_ARTICLE = "faq_article"
POLICY_DOCUMENT = "policy_document"
EDITORIAL = "editorial"
REVIEW_SUMMARY = "review_summary"
@dataclass
class ChunkConfig:
target_tokens: int
min_tokens: int
max_tokens: int
overlap_tokens: int
split_on_sentences: bool
preserve_metadata: bool
# Chunk configurations by content type — derived from production observations
CHUNK_CONFIGS = {
ContentType.PRODUCT_DESCRIPTION: ChunkConfig(
target_tokens=256, min_tokens=128, max_tokens=384,
overlap_tokens=25, split_on_sentences=True, preserve_metadata=True,
),
ContentType.FAQ_ARTICLE: ChunkConfig(
target_tokens=512, min_tokens=256, max_tokens=768,
overlap_tokens=50, split_on_sentences=True, preserve_metadata=True,
),
ContentType.POLICY_DOCUMENT: ChunkConfig(
target_tokens=512, min_tokens=256, max_tokens=768,
overlap_tokens=50, split_on_sentences=True, preserve_metadata=True,
),
ContentType.EDITORIAL: ChunkConfig(
target_tokens=512, min_tokens=256, max_tokens=768,
overlap_tokens=50, split_on_sentences=True, preserve_metadata=True,
),
ContentType.REVIEW_SUMMARY: ChunkConfig(
target_tokens=128, min_tokens=64, max_tokens=192,
overlap_tokens=0, split_on_sentences=False, preserve_metadata=True,
),
}
@dataclass
class Chunk:
content: str
token_count: int
chunk_index: int
total_chunks: int
content_type: ContentType
metadata: dict
class DynamicChunker:
"""Content-type aware chunker with adaptive sizing under token budget pressure."""
def chunk(
self,
text: str,
content_type: ContentType,
metadata: Optional[dict] = None,
budget_pressure: float = 0.0, # 0.0 = normal, 1.0 = critical
) -> list:
config = CHUNK_CONFIGS[content_type]
metadata = metadata or {}
# Adjust chunk size based on budget pressure
adjusted_target = self._adjust_for_pressure(config, budget_pressure)
if config.split_on_sentences:
chunks = self._sentence_aware_split(text, adjusted_target, config.overlap_tokens)
else:
chunks = self._token_split(text, adjusted_target, config.overlap_tokens)
# Filter out chunks that are too small to be useful
chunks = [c for c in chunks if self._count_tokens(c) >= config.min_tokens]
return [
Chunk(
content=chunk_text,
token_count=self._count_tokens(chunk_text),
chunk_index=i,
total_chunks=len(chunks),
content_type=content_type,
metadata={**metadata, "chunk_index": i, "total_chunks": len(chunks)},
)
for i, chunk_text in enumerate(chunks)
]
def _adjust_for_pressure(self, config: ChunkConfig, pressure: float) -> int:
"""Reduce target chunk size under budget pressure."""
if pressure <= 0.0:
return config.target_tokens
if pressure >= 1.0:
return config.min_tokens
# Linear interpolation between target and min
return int(config.target_tokens - (config.target_tokens - config.min_tokens) * pressure)
def _sentence_aware_split(self, text: str, target_tokens: int, overlap_tokens: int) -> list:
"""Split on sentence boundaries, respecting target token count."""
sentences = re.split(r'(?<=[.!?])\s+', text)
chunks = []
current_chunk = []
current_tokens = 0
for sentence in sentences:
sentence_tokens = self._count_tokens(sentence)
if current_tokens + sentence_tokens > target_tokens and current_chunk:
chunks.append(" ".join(current_chunk))
# Overlap: keep last N tokens worth of sentences
if overlap_tokens > 0:
overlap_chunk = []
overlap_count = 0
for s in reversed(current_chunk):
s_tokens = self._count_tokens(s)
if overlap_count + s_tokens > overlap_tokens:
break
overlap_chunk.insert(0, s)
overlap_count += s_tokens
current_chunk = overlap_chunk
current_tokens = overlap_count
else:
current_chunk = []
current_tokens = 0
current_chunk.append(sentence)
current_tokens += sentence_tokens
if current_chunk:
chunks.append(" ".join(current_chunk))
return chunks
def _token_split(self, text: str, target_tokens: int, overlap_tokens: int) -> list:
"""Simple token-boundary split for content that doesn't benefit from sentence splitting."""
words = text.split()
chunks = []
i = 0
target_words = target_tokens # Approximate: 1 token ≈ 1 word for this rough split
while i < len(words):
end = min(i + target_words, len(words))
chunks.append(" ".join(words[i:end]))
i = end - overlap_tokens if overlap_tokens > 0 else end
return chunks
def _count_tokens(self, text: str) -> int:
if not text:
return 0
return len(text) // 4
3. Prompt Design Optimization for Token Efficiency
When the token budget is tight, the prompt itself must be engineered for density. This is not about writing shorter prompts — it is about structuring prompts so that the most important information occupies the most visible positions.
Prompt Compression Pipeline
graph LR
A[Raw Prompt<br>Components] --> B[Priority<br>Sorter]
B --> C[History<br>Compressor]
C --> D[Chunk<br>Selector]
D --> E[Context<br>Trimmer]
E --> F[Budget<br>Validator]
F -->|Within budget| G[Final Prompt]
F -->|Over budget| H[Escalate:<br>drop DROPPABLE,<br>compress more]
H --> E
Conversation History Compressor
import json
import logging
from dataclasses import dataclass
logger = logging.getLogger("mangaassist.content_handling")
@dataclass
class CompressedHistory:
text: str
token_count: int
original_turns: int
summarized_turns: int
preserved_turns: int
compression_ratio: float
class HistoryCompressor:
"""Compresses conversation history to fit within token budget.
Strategy:
1. Always preserve the most recent 2 turns (user + assistant) — these are the immediate context.
2. If history exceeds budget, summarize older turns into a condensed summary.
3. Preserve any turns that contain product ASINs, order IDs, or explicit preferences.
"""
def __init__(self, max_tokens: int = 700, preserve_recent: int = 2):
self.max_tokens = max_tokens
self.preserve_recent = preserve_recent
def compress(self, turns: list, budget_tokens: int) -> CompressedHistory:
"""Compress conversation history to fit within budget_tokens."""
effective_budget = min(budget_tokens, self.max_tokens)
# Calculate total tokens in full history
full_text = self._format_turns(turns)
full_tokens = self._count_tokens(full_text)
# If it fits, return as-is
if full_tokens <= effective_budget:
return CompressedHistory(
text=full_text,
token_count=full_tokens,
original_turns=len(turns),
summarized_turns=0,
preserved_turns=len(turns),
compression_ratio=1.0,
)
# Split into recent (preserved) and older (to be summarized)
recent_turns = turns[-self.preserve_recent:] if len(turns) > self.preserve_recent else turns
older_turns = turns[:-self.preserve_recent] if len(turns) > self.preserve_recent else []
recent_text = self._format_turns(recent_turns)
recent_tokens = self._count_tokens(recent_text)
# Budget remaining for summary of older turns
summary_budget = effective_budget - recent_tokens
if summary_budget <= 0:
# Even recent turns exceed budget — truncate to most recent turn only
last_turn = turns[-1:]
last_text = self._format_turns(last_turn)
logger.warning(
"History budget critical: only preserving last turn",
extra={"original_turns": len(turns), "budget": effective_budget},
)
return CompressedHistory(
text=last_text,
token_count=self._count_tokens(last_text),
original_turns=len(turns),
summarized_turns=len(turns) - 1,
preserved_turns=1,
compression_ratio=self._count_tokens(last_text) / full_tokens,
)
# Summarize older turns
summary = self._summarize_turns(older_turns, summary_budget)
combined_text = f"[Previous conversation summary]: {summary}\n\n{recent_text}"
combined_tokens = self._count_tokens(combined_text)
return CompressedHistory(
text=combined_text,
token_count=combined_tokens,
original_turns=len(turns),
summarized_turns=len(older_turns),
preserved_turns=len(recent_turns),
compression_ratio=round(combined_tokens / full_tokens, 2),
)
def _summarize_turns(self, turns: list, max_tokens: int) -> str:
"""Extract key information from older turns.
In production, this would call a fast summarization model or use the FM itself
with a summarization prompt. For cost, we use extractive summarization:
keep turns that contain ASINs, order IDs, or preference keywords.
"""
key_patterns = [
r'B0[A-Z0-9]{8}', # ASIN pattern
r'#?\d{3}-\d{7}', # Order ID pattern
r'(?:prefer|like|want|love|hate|looking for)', # Preference signals
]
important_turns = []
for turn in turns:
import re
content = turn.get("content", "")
if any(re.search(pattern, content, re.IGNORECASE) for pattern in key_patterns):
important_turns.append(turn)
if important_turns:
summary_text = " | ".join(
f"{t['role']}: {t['content'][:100]}" for t in important_turns
)
else:
# No key signals — just note how many turns were summarized
summary_text = f"User discussed manga topics over {len(turns)} messages."
# Trim to budget
while self._count_tokens(summary_text) > max_tokens and len(summary_text) > 20:
summary_text = summary_text[:len(summary_text) * 3 // 4] # Trim 25% iteratively
return summary_text
def _format_turns(self, turns: list) -> str:
return "\n".join(f"{t['role']}: {t['content']}" for t in turns)
def _count_tokens(self, text: str) -> int:
if not text:
return 0
return len(text) // 4
4. Truncation-Related Error Analysis
Silent truncation is the most dangerous content handling failure because no error is raised. The FM simply does not see the truncated content and produces a plausible-looking but incomplete answer.
Truncation Detection Pipeline
sequenceDiagram
participant Orchestrator
participant BudgetManager
participant TruncationDetector
participant CloudWatch
participant PagerDuty
Orchestrator->>BudgetManager: Allocate tokens for sections
BudgetManager-->>Orchestrator: BudgetReport
Orchestrator->>TruncationDetector: Validate report
alt No truncation
TruncationDetector-->>Orchestrator: PASS
else Soft truncation (COMPRESSIBLE/DROPPABLE)
TruncationDetector->>CloudWatch: Emit warning metric
TruncationDetector-->>Orchestrator: PASS with warning
else Hard truncation (FIXED/VARIABLE)
TruncationDetector->>CloudWatch: Emit error metric
TruncationDetector->>PagerDuty: Alert if rate > threshold
TruncationDetector-->>Orchestrator: FAIL — content incomplete
Orchestrator->>Orchestrator: Degrade gracefully:<br>use template response or<br>reduce scope
end
Truncation Detector
import logging
from dataclasses import dataclass
from enum import Enum
logger = logging.getLogger("mangaassist.content_handling")
class TruncationSeverity(Enum):
NONE = "none"
SOFT = "soft" # Compressible or droppable section affected
HARD = "hard" # Fixed or variable section affected — response quality impacted
@dataclass
class TruncationReport:
severity: TruncationSeverity
affected_sections: list
total_tokens_lost: int
recommendation: str
should_degrade: bool
class TruncationDetector:
"""Detects and classifies truncation events in prompt assembly.
Key insight: Not all truncation is equal.
- Dropping page context when it's not needed: harmless.
- Truncating RAG chunks that contain the answer: catastrophic.
- Compressing history in a product discovery flow: usually fine.
- Compressing history in a multi-turn return flow: dangerous (loses order context).
"""
# Intent-specific sensitivity: how much truncation matters for each section
SENSITIVITY_MAP = {
"product_discovery": {
"rag_chunks": "high",
"conversation_history": "low",
"page_context": "medium",
},
"product_question": {
"rag_chunks": "high",
"conversation_history": "medium",
"page_context": "high",
},
"recommendation": {
"rag_chunks": "high",
"conversation_history": "high", # User preferences are in history
"page_context": "medium",
},
"faq": {
"rag_chunks": "high",
"conversation_history": "low",
"page_context": "low",
},
"order_tracking": {
"rag_chunks": "low",
"conversation_history": "high", # Order context is in history
"page_context": "low",
},
"return_request": {
"rag_chunks": "medium",
"conversation_history": "high", # Order + issue context in history
"page_context": "low",
},
}
def analyze(self, budget_report: BudgetReport, intent: str) -> TruncationReport:
"""Analyze a BudgetReport for truncation severity based on intent context."""
sensitivity = self.SENSITIVITY_MAP.get(intent, {})
affected_sections = []
total_lost = 0
max_severity = TruncationSeverity.NONE
# Check truncation events
for event in budget_report.truncation_events:
section = event["section"]
dropped = event["dropped_tokens"]
section_sensitivity = sensitivity.get(section, "medium")
if section_sensitivity == "high":
severity = TruncationSeverity.HARD
elif section_sensitivity == "medium":
severity = TruncationSeverity.SOFT
else:
severity = TruncationSeverity.SOFT
affected_sections.append({
"section": section,
"tokens_lost": dropped,
"sensitivity": section_sensitivity,
"severity": severity.value,
})
total_lost += dropped
if severity.value == "hard":
max_severity = TruncationSeverity.HARD
elif severity.value == "soft" and max_severity == TruncationSeverity.NONE:
max_severity = TruncationSeverity.SOFT
# Check drop events
for event in budget_report.drop_events:
section = event["section"]
section_sensitivity = sensitivity.get(section, "low")
if section_sensitivity == "high":
max_severity = TruncationSeverity.HARD
affected_sections.append({
"section": section,
"tokens_lost": "all",
"sensitivity": section_sensitivity,
"severity": "hard",
})
# Generate recommendation
if max_severity == TruncationSeverity.HARD:
recommendation = (
f"HARD truncation detected for intent '{intent}'. "
f"High-sensitivity sections affected: {[s['section'] for s in affected_sections if s['sensitivity'] == 'high']}. "
f"Consider: (1) reducing chunk count, (2) using a more aggressive summarization, "
f"(3) falling back to template response for this intent."
)
elif max_severity == TruncationSeverity.SOFT:
recommendation = (
f"Soft truncation in low-sensitivity sections. "
f"Response quality likely unaffected for intent '{intent}'."
)
else:
recommendation = "No truncation detected. All sections within budget."
return TruncationReport(
severity=max_severity,
affected_sections=affected_sections,
total_tokens_lost=total_lost,
recommendation=recommendation,
should_degrade=max_severity == TruncationSeverity.HARD,
)
5. MangaAssist Scenarios
Scenario A: Long Manga Series FAQ Overflows Context Window
Context: A user asks "Tell me about all the volumes of One Piece." The product catalog has 100+ volumes. The RAG pipeline retrieves descriptions for the top 10 volumes, each ~300 tokens = 3,000 tokens for RAG chunks alone, blowing the 1,500-token RAG budget.
Detection: TokenBudgetManager.allocate() detects that rag_chunks requested 3,000 tokens but only 1,500 were available. TruncationDetector flags HARD truncation because rag_chunks has high sensitivity for product_question intent.
Resolution:
1. Immediate: Reduce retrieved chunks from 10 to 3 (selecting the most relevant volumes based on the query)
2. Systemic: Add a pre-retrieval filter that limits chunk count by intent: product_question → max 3 chunks, recommendation → max 4 chunks, faq → max 2 chunks
Scenario B: Multi-Turn Return Conversation Exhausts History Budget
Context: A customer has been going back and forth about a damaged manga volume for 15 turns. The conversation history is 2,100 tokens. The history budget is 700 tokens. The HistoryCompressor summarizes the older 13 turns, but the summary drops the order ID and the specific damage description.
Detection: HistoryCompressor returns compression_ratio=0.33, and subsequent LLM response asks the user for their order number again — which they already provided in turn 3.
Resolution:
1. Immediate: Update HistoryCompressor._summarize_turns() to always preserve turns containing order IDs (regex pattern #?\d{3}-\d{7})
2. Systemic: For return_request and order_tracking intents, extract structured entities (order ID, ASIN, issue type) into a separate "key facts" section that always gets FIXED priority
Scenario C: Japanese Content Tokenization Mismatch
Context: MangaAssist handles Japanese language content. Japanese text uses ~2 characters per token (compared to ~4 for English), so a product description that looks short in characters is actually 2x more tokens than expected.
Detection: TokenBudgetManager consistently reports overflow for Japanese-heavy RAG chunks. The _count_tokens approximation (len/4) underestimates Japanese content by 50%.
Resolution:
1. Immediate: Use a proper tokenizer (tiktoken or Anthropic's tokenizer) instead of character-based approximation
2. Systemic: Add a locale parameter to DynamicChunker that adjusts chunk sizes based on expected tokenization ratio
6. CloudWatch Dashboard and Alerts
Key Metrics
| Metric | Namespace | Threshold | Alert Action |
|---|---|---|---|
TokenBudgetUtilization |
MangaAssist/ContentHandling |
P95 > 90% | Warn: investigate prompt size growth |
TokenBudgetOverflow |
MangaAssist/ContentHandling |
> 1% of requests | Page: content regularly truncated |
TokenTruncationDropped |
MangaAssist/ContentHandling |
P95 > 200 tokens | Warn: significant information loss |
SectionTokenUsage (by section) |
MangaAssist/ContentHandling |
Sudden change > 20% | Warn: section size drift |
HistoryCompressionRatio |
MangaAssist/ContentHandling |
P95 < 0.25 | Warn: history heavily compressed |
CloudWatch Logs Insights Queries
Find requests where RAG chunks were truncated:
fields @timestamp, @message
| filter section = "rag_chunks" and overflow_type = "variable_truncation"
| stats count(*) as truncation_count by intent
| sort truncation_count desc
| limit 20
Track token budget utilization by intent over time:
fields @timestamp, intent, utilization_pct
| filter metric_name = "TokenBudgetUtilization"
| stats avg(utilization_pct) as avg_util, max(utilization_pct) as max_util, percentile(utilization_pct, 95) as p95_util by intent, bin(1h) as hour
| sort hour desc
Identify sessions with heavy history compression:
fields @timestamp, session_id, compression_ratio, original_turns
| filter metric_name = "HistoryCompression" and compression_ratio < 0.3
| sort compression_ratio asc
| limit 50
Intuition Gained
What Mental Model You Build
Working through content handling troubleshooting teaches you to think in token budgets the way a systems engineer thinks in memory budgets. Every piece of information has a cost, and the art is fitting the most useful information into a fixed-size window.
You develop three core instincts:
1. The Budget Intuition: You start to "feel" when a prompt is too large before you count tokens. When you see a long conversation history, a complex product query, and multiple RAG chunks — you know immediately that something will be dropped. This intuition lets you design systems that prevent overflow rather than react to it.
2. The Priority Intuition: Not all context is equally important, and importance depends on intent. A product question needs the product data more than the conversation history. A return request needs the conversation history (with the order context) more than the page context. You learn to build systems that make these priority decisions automatically.
3. The Silent Failure Intuition: You develop a healthy paranoia about "invisible" failures. The FM will not tell you it did not see the full context. It will generate a plausible response from whatever it received. The only way to detect silent truncation is to measure it proactively. This instinct carries over to every system you build: you learn to instrument the things that fail silently, not just the things that throw errors.
How This Intuition Guides Future Decisions
- When designing a new GenAI feature: You start by calculating the token budget before writing a single prompt. You ask "How many tokens does each context source need?" before "What should the prompt say?"
- When evaluating new models: You compare not just quality but context window efficiency. A model with a 128K window but terrible long-context accuracy is worse than a model with 32K that uses every token effectively.
- When debugging user complaints: Your first question is "What did the model actually see?" — not "What did the model say?" This traces the problem to the input, not the output, and is faster.
- When scaling to new domains: You know that token budgets are domain-dependent. Medical documents are denser than product descriptions. Legal text needs more context than FAQs. You build configurable budgets, not hardcoded ones.