Prompt Compression and Context Pruning for GenAI Applications
MangaAssist context: JP Manga store chatbot on AWS — Bedrock Claude 3 (Sonnet at $3/$15 per 1M tokens input/output, Haiku at $0.25/$1.25), OpenSearch Serverless (vector store), DynamoDB (sessions/products), ECS Fargate (orchestrator), API Gateway WebSocket, ElastiCache Redis. Target: useful answer in under 3 seconds, 1M messages/day scale.
Skill Mapping
| Certification | Task | Skill | This File |
|---|---|---|---|
| AWS AIP-C01 | Task 4.1 — Optimize cost and performance of FM applications | Skill 4.1.1 — Design token efficiency systems for FM-powered applications | Deep-dive into prompt compression algorithms, context pruning strategies, response size controls, MangaAssist-specific examples |
Skill scope: Detailed implementation of the compression and pruning subsystems introduced in 01-token-efficiency-architecture.md. This file covers the algorithms, before/after examples, and production code for reducing input and output tokens without degrading answer quality.
Mind Map — Compression and Pruning Taxonomy
mindmap
root((Compression<br/>& Pruning))
Prompt Compression
Rule-Based
Whitespace normalization
Instruction deduplication
Boilerplate removal
Structural
Template variable injection
JSON compaction
Few-shot selection
Algorithmic
LLMLingua perplexity pruning
Token-level importance scoring
Entropy-based filtering
Japanese-Specific
CJK character density
Furigana handling
Mixed-script optimization
Context Pruning
RAG Chunk Pruning
Relevance score threshold
Token budget cap
Diversity filtering
Chunk deduplication
Conversation History
Sliding window
Summarization of old turns
Entity extraction retention
Recency weighting
Response Size Control
Input Controls
Per-intent max_tokens config
Dynamic budget based on query complexity
Output Controls
Streaming token counter
Graceful truncation
Follow-up prompt injection
Feedback Loop
Actual vs budget tracking
Auto-calibration of limits
Prompt Compression Algorithms — Deep Dive
The Compression Pipeline
Every MangaAssist prompt passes through a multi-stage compression pipeline before reaching Bedrock. The pipeline is ordered from cheapest (zero-cost, rule-based) to most expensive (requires model call). It short-circuits as soon as the token target is met.
sequenceDiagram
participant O as Orchestrator
participant E as TokenEstimator
participant C as PromptCompressor
participant P as ContextPruner
participant R as ResponseSizeController
participant B as Bedrock
O->>E: estimate_prompt(system, user, history, rag)
E-->>O: estimated_tokens = 2,400
O->>O: check_budget(recommendation) → max 2,000
Note over O: Over budget by 400 tokens
O->>C: compress(system_prompt, target=200)
Note over C: Stage 1: whitespace cleanup → saves 30
Note over C: Stage 2: instruction dedup → saves 80
Note over C: Stage 3: template variable injection → saves 120
C-->>O: system_prompt compressed (230 tokens saved)
O->>P: prune_rag_chunks(chunks, budget=600)
Note over P: 8 chunks → relevance filter → 4 chunks
Note over P: Token cap → 3 chunks kept
P-->>O: rag_context pruned (170 tokens saved)
O->>P: trim_history(history, budget=400)
Note over P: 12 turns → keep last 3 verbatim
Note over P: Summarize turns 1-9 → 80 tokens
P-->>O: history trimmed
O->>E: re-estimate → 1,950 tokens (within budget)
O->>R: configure(intent=recommendation)
R-->>O: max_output_tokens = 600
O->>B: invoke_model(prompt, max_tokens=600)
B-->>O: stream response chunks
O->>R: enforce_budget(stream, max=600)
R-->>O: controlled stream to client
LLMLingua-Style Compression for MangaAssist
LLMLingua is a prompt compression technique that removes tokens with low perplexity (tokens the model can easily predict from surrounding context). The insight: if the model would predict a token anyway, removing it preserves meaning while saving cost.
How It Works
flowchart TD
A[Original Prompt Text] --> B[Tokenize with small LM]
B --> C[Compute per-token perplexity]
C --> D[Rank tokens by perplexity]
D --> E{Token perplexity > threshold?}
E -->|High perplexity — informative| F[Keep token]
E -->|Low perplexity — predictable| G[Remove token]
F --> H[Reassemble compressed prompt]
G --> H
H --> I[Verify token count meets target]
I -->|Yes| J[Return compressed prompt]
I -->|No| K[Lower threshold and repeat]
K --> E
style F fill:#c8e6c9
style G fill:#ffcdd2
MangaAssist Application
In MangaAssist, LLMLingua-style compression is applied only to specific prompt components where token removal is safe:
| Component | Apply LLMLingua? | Rationale |
|---|---|---|
| System instructions | Yes — cautiously | Instructions are mostly predictable boilerplate |
| RAG manga descriptions | Yes — aggressively | Descriptions have high redundancy (genre, publisher patterns) |
| Conversation history (English) | Yes — moderately | English conversational text has predictable filler |
| Conversation history (Japanese) | No | Japanese tokens carry more semantic density per token; removing even one kanji can change meaning |
| User message | Never | Sacred text — never compress the user's actual question |
| Product names / ISBNs / prices | Never | Entity corruption is catastrophic for e-commerce |
Japanese Content — Why Compression Is Dangerous
Japanese text has fundamentally different compression characteristics than English:
| Property | English | Japanese |
|---|---|---|
| Average information per token | ~4.5 characters | ~1-2 characters (kanji = high density) |
| Token predictability | High for function words (the, is, a) | Low — most tokens are content-bearing |
| Safe removal candidates | Determiners, articles, filler | Very few — even particles (は, が, の) change meaning |
| Compression ratio achievable | 40-60% | 10-20% at best (quality degrades quickly) |
| Risk of semantic corruption | Moderate | High — removing one kanji can change the title entirely |
Example of dangerous Japanese compression:
| Original | 「鬼滅の刃」の最新巻はいつ発売されますか? |
|---|---|
| Meaning | "When will the latest volume of Demon Slayer be released?" |
| Naive compression | 「鬼滅刃」最新巻いつ発売? |
| Problem | 鬼滅の刃 (Kimetsu no Yaiba) becomes 鬼滅刃 — the の particle is not redundant; the model may misidentify the title |
Context Pruning Strategies
RAG Chunk Relevance Scoring
OpenSearch returns chunks sorted by vector similarity, but raw similarity scores are not sufficient for token-efficient pruning. MangaAssist uses a composite relevance score.
flowchart TD
subgraph Scoring["Composite Relevance Score"]
A[OpenSearch Chunk] --> B[Vector Similarity Score]
A --> C[Keyword Match Score]
A --> D[Recency Score]
A --> E[Entity Overlap Score]
B --> F["weight: 0.4"]
C --> G["weight: 0.2"]
D --> H["weight: 0.1"]
E --> I["weight: 0.3"]
F --> J[Composite Score]
G --> J
H --> J
I --> J
end
J --> K{Score > threshold?}
K -->|Yes| L{Cumulative tokens < budget?}
K -->|No| M[Discard chunk]
L -->|Yes| N[Include in prompt]
L -->|No| O[Stop — budget exhausted]
style N fill:#c8e6c9
style M fill:#ffcdd2
style O fill:#fff9c4
ContextPruner — Production Class
import numpy as np
from dataclasses import dataclass
from typing import List, Optional
@dataclass
class RAGChunk:
"""A single chunk retrieved from OpenSearch."""
chunk_id: str
text: str
vector_score: float # cosine similarity from OpenSearch
source_document: str # e.g., "one_piece_vol104_desc"
timestamp: Optional[float] # when the source was last updated
entities: list[str] # extracted entities: titles, authors, ISBNs
@dataclass
class PruneResult:
"""Output of the pruning operation."""
kept_chunks: list[RAGChunk]
discarded_chunks: list[RAGChunk]
kept_token_count: int
discarded_token_count: int
total_original_tokens: int
class ContextPruner:
"""Prunes RAG chunks and conversation history to fit token budgets.
MangaAssist-specific design decisions:
- Entity overlap is weighted heavily (0.3) because product names
and order IDs are critical for correct answers.
- Recency is weighted lightly (0.1) because manga catalog data
changes slowly (weekly catalog updates).
- Duplicate content from the same source document is removed
to prevent the model from over-indexing on one product.
"""
def __init__(self, estimator: "TokenEstimator",
relevance_threshold: float = 0.45,
weights: Optional[dict] = None):
self.estimator = estimator
self.relevance_threshold = relevance_threshold
self.weights = weights or {
"vector": 0.4,
"keyword": 0.2,
"recency": 0.1,
"entity": 0.3,
}
def score_chunk(self, chunk: RAGChunk, query_text: str,
query_entities: list[str],
current_time: float) -> float:
"""Compute composite relevance score for a single chunk.
Returns a float in [0.0, 1.0].
"""
# Component 1: Vector similarity (already 0-1 from OpenSearch)
vector_score = chunk.vector_score
# Component 2: Keyword match (exact token overlap ratio)
query_tokens = set(query_text.lower().split())
chunk_tokens = set(chunk.text.lower().split())
if query_tokens:
keyword_score = len(query_tokens & chunk_tokens) / len(query_tokens)
else:
keyword_score = 0.0
# Component 3: Recency (exponential decay, half-life = 7 days)
if chunk.timestamp:
age_days = (current_time - chunk.timestamp) / 86400
recency_score = np.exp(-0.099 * age_days) # ln(2)/7 ≈ 0.099
else:
recency_score = 0.5 # default for chunks without timestamps
# Component 4: Entity overlap (critical for MangaAssist)
if query_entities:
query_ent_set = set(e.lower() for e in query_entities)
chunk_ent_set = set(e.lower() for e in chunk.entities)
entity_score = len(query_ent_set & chunk_ent_set) / len(query_ent_set)
else:
entity_score = 0.0
# Weighted composite
composite = (
self.weights["vector"] * vector_score +
self.weights["keyword"] * keyword_score +
self.weights["recency"] * recency_score +
self.weights["entity"] * entity_score
)
return min(composite, 1.0)
def prune_rag_chunks(self, chunks: list[RAGChunk],
query_text: str,
query_entities: list[str],
token_budget: int,
current_time: float) -> PruneResult:
"""Prune RAG chunks to fit within token budget.
Strategy:
1. Score all chunks with composite relevance
2. Filter by relevance threshold
3. Deduplicate by source document (keep highest-scoring)
4. Fill token budget in score order (greedy knapsack)
"""
import time as time_module
current_time = current_time or time_module.time()
# Step 1: Score all chunks
scored = []
for chunk in chunks:
score = self.score_chunk(chunk, query_text, query_entities,
current_time)
scored.append((chunk, score))
# Step 2: Filter by threshold
above_threshold = [(c, s) for c, s in scored
if s >= self.relevance_threshold]
below_threshold = [c for c, s in scored
if s < self.relevance_threshold]
# Step 3: Deduplicate by source document
best_per_source: dict[str, tuple[RAGChunk, float]] = {}
dedup_discarded = []
for chunk, score in above_threshold:
source = chunk.source_document
if source not in best_per_source or score > best_per_source[source][1]:
if source in best_per_source:
dedup_discarded.append(best_per_source[source][0])
best_per_source[source] = (chunk, score)
else:
dedup_discarded.append(chunk)
# Step 4: Sort by score descending, fill budget
sorted_chunks = sorted(best_per_source.values(),
key=lambda x: x[1], reverse=True)
kept = []
discarded = list(below_threshold) + dedup_discarded
running_tokens = 0
for chunk, score in sorted_chunks:
chunk_tokens = self.estimator.count_tokens(chunk.text)
if running_tokens + chunk_tokens <= token_budget:
kept.append(chunk)
running_tokens += chunk_tokens
else:
discarded.append(chunk)
# Calculate total original tokens
total_original = sum(
self.estimator.count_tokens(c.text) for c in chunks
)
discarded_tokens = sum(
self.estimator.count_tokens(c.text) for c in discarded
)
return PruneResult(
kept_chunks=kept,
discarded_chunks=discarded,
kept_token_count=running_tokens,
discarded_token_count=discarded_tokens,
total_original_tokens=total_original,
)
def prune_conversation_history(self, history: list[dict],
token_budget: int,
keep_recent: int = 3,
entity_keywords: Optional[list[str]] = None
) -> list[dict]:
"""Prune conversation history with recency weighting.
Strategy:
- Always keep the last `keep_recent` turns verbatim
- For older turns, keep only those containing entity keywords
(product names, order IDs, prices)
- Summarize the remaining old turns into a compact block
Args:
history: List of {"role": str, "content": str} turn dicts
token_budget: Maximum total tokens for the history
keep_recent: Number of recent turns to keep verbatim
entity_keywords: Keywords that mark a turn as worth keeping
(e.g., ["One Piece", "ORD-12345", "Vol."])
"""
if not history:
return []
entity_keywords = entity_keywords or []
entity_set = set(kw.lower() for kw in entity_keywords)
# Split into recent (keep) and older (prune candidates)
recent = history[-keep_recent:] if len(history) > keep_recent else history
older = history[:-keep_recent] if len(history) > keep_recent else []
# Check if recent turns alone exceed budget
recent_tokens = sum(
self.estimator.count_tokens(t.get("content", ""))
for t in recent
)
if recent_tokens >= token_budget:
# Even recent turns are too large — keep only last 2
return history[-2:]
remaining_budget = token_budget - recent_tokens
# From older turns, extract entity-bearing turns
entity_turns = []
summary_turns = []
for turn in older:
content_lower = turn.get("content", "").lower()
if any(kw in content_lower for kw in entity_set):
entity_turns.append(turn)
else:
summary_turns.append(turn)
# Fit entity turns within remaining budget
kept_entity_turns = []
entity_tokens_used = 0
for turn in reversed(entity_turns): # most recent entity turns first
turn_tokens = self.estimator.count_tokens(turn.get("content", ""))
if entity_tokens_used + turn_tokens <= remaining_budget * 0.6:
kept_entity_turns.insert(0, turn)
entity_tokens_used += turn_tokens
# Summarize the rest
summary_budget = remaining_budget - entity_tokens_used
if summary_turns and summary_budget > 50:
topics = []
for turn in summary_turns:
content = turn.get("content", "")
if len(content) > 40:
topics.append(content[:40].strip() + "...")
elif content:
topics.append(content.strip())
summary_text = (
f"[Earlier: {len(summary_turns)} turns discussing: "
f"{'; '.join(topics[:3])}]"
)
summary_turn = {"role": "system", "content": summary_text}
return [summary_turn] + kept_entity_turns + recent
else:
return kept_entity_turns + recent
Response Size Controller — Production Class
import asyncio
import tiktoken
import logging
from dataclasses import dataclass
from typing import AsyncIterator, Optional
logger = logging.getLogger("mangaassist.response_control")
@dataclass
class ResponseBudgetConfig:
"""Per-intent response size configuration."""
max_output_tokens: int
hard_ceiling: int # absolute maximum, even if intent config is wrong
truncation_message: str # appended when truncating
allow_graceful_extension: bool # allow 10% over for sentence completion
min_tokens_for_useful: int # below this, the response is useless
# Per-intent response configurations for MangaAssist
RESPONSE_CONFIGS = {
"product_search": ResponseBudgetConfig(
max_output_tokens=400,
hard_ceiling=500,
truncation_message=(
"\n\n---\n*More results available. "
"Ask me to narrow your search.*"
),
allow_graceful_extension=True,
min_tokens_for_useful=80,
),
"order_status": ResponseBudgetConfig(
max_output_tokens=200,
hard_ceiling=250,
truncation_message=(
"\n\n---\n*For full order details, "
"visit your order history page.*"
),
allow_graceful_extension=False, # order info should be concise
min_tokens_for_useful=40,
),
"recommendation": ResponseBudgetConfig(
max_output_tokens=600,
hard_ceiling=750,
truncation_message=(
"\n\n---\n*I have more recommendations! "
"Tell me which genre interests you most.*"
),
allow_graceful_extension=True,
min_tokens_for_useful=150,
),
"manga_qa": ResponseBudgetConfig(
max_output_tokens=500,
hard_ceiling=600,
truncation_message=(
"\n\n---\n*There's more to say on this topic. "
"Ask a follow-up question for details.*"
),
allow_graceful_extension=True,
min_tokens_for_useful=100,
),
"chitchat": ResponseBudgetConfig(
max_output_tokens=150,
hard_ceiling=200,
truncation_message="", # chitchat truncation is invisible
allow_graceful_extension=False,
min_tokens_for_useful=20,
),
}
@dataclass
class StreamingStats:
"""Statistics collected during response streaming."""
total_output_tokens: int
was_truncated: bool
truncation_point_tokens: int
chunks_delivered: int
chunks_dropped: int
class ResponseSizeController:
"""Controls response size during streaming from Bedrock.
MangaAssist uses WebSocket streaming for real-time delivery. This
controller sits between the Bedrock stream and the WebSocket, counting
tokens in real time and enforcing per-intent output budgets.
Key design decisions:
- Graceful extension: if allow_graceful_extension is True, we allow
up to 10% over budget to finish the current sentence. This prevents
mid-word truncation.
- Truncation message: intent-specific messages guide the user to ask
follow-ups rather than leaving them with an abrupt cutoff.
- Hard ceiling: absolute maximum that prevents any runaway generation,
even if configuration is wrong.
"""
def __init__(self):
self.encoding = tiktoken.get_encoding("cl100k_base")
def get_config(self, intent: str) -> ResponseBudgetConfig:
"""Get response budget config for an intent, with fallback."""
return RESPONSE_CONFIGS.get(
intent,
ResponseBudgetConfig(
max_output_tokens=400,
hard_ceiling=500,
truncation_message="\n\n---\n*Response trimmed.*",
allow_graceful_extension=True,
min_tokens_for_useful=80,
),
)
async def stream_with_budget(
self,
bedrock_stream: AsyncIterator[dict],
intent: str,
session_id: str,
) -> AsyncIterator[str]:
"""Stream Bedrock response chunks while enforcing token budget.
Yields:
str: text chunks to forward to the WebSocket client
The caller should collect StreamingStats from self.last_stats
after the stream completes.
"""
config = self.get_config(intent)
token_count = 0
chunk_count = 0
dropped_count = 0
truncated = False
budget = config.max_output_tokens
grace_budget = int(budget * 1.1) if config.allow_graceful_extension else budget
async for event in bedrock_stream:
chunk_text = self._extract_text(event)
if not chunk_text:
continue
chunk_tokens = len(self.encoding.encode(chunk_text))
# Hard ceiling — never exceed this
if token_count + chunk_tokens > config.hard_ceiling:
truncated = True
logger.warning(
"Hard ceiling hit",
extra={
"session_id": session_id,
"intent": intent,
"tokens_at_ceiling": token_count,
"hard_ceiling": config.hard_ceiling,
},
)
if config.truncation_message:
yield config.truncation_message
break
# Soft budget — allow graceful extension for sentence completion
if token_count >= budget:
if token_count + chunk_tokens <= grace_budget:
# In grace zone — check if chunk ends a sentence
yield chunk_text
token_count += chunk_tokens
chunk_count += 1
if self._ends_sentence(chunk_text):
truncated = True
if config.truncation_message:
yield config.truncation_message
break
else:
truncated = True
if config.truncation_message:
yield config.truncation_message
break
else:
yield chunk_text
token_count += chunk_tokens
chunk_count += 1
self.last_stats = StreamingStats(
total_output_tokens=token_count,
was_truncated=truncated,
truncation_point_tokens=token_count if truncated else 0,
chunks_delivered=chunk_count,
chunks_dropped=dropped_count,
)
def _extract_text(self, event: dict) -> str:
"""Extract text from a Bedrock streaming event."""
# Bedrock converse stream format
if "contentBlockDelta" in event:
delta = event["contentBlockDelta"].get("delta", {})
return delta.get("text", "")
# Bedrock invoke_model_with_response_stream format
chunk = event.get("chunk", {})
if "bytes" in chunk:
import json
body = json.loads(chunk["bytes"])
return body.get("delta", {}).get("text", "")
return ""
def _ends_sentence(self, text: str) -> bool:
"""Check if text ends at a natural sentence boundary."""
stripped = text.rstrip()
return stripped.endswith(('.', '!', '?', '。', '!', '?', '\n'))
MangaAssist-Specific Compression Examples
Example 1: Manga Recommendation Query
User query: "I loved Demon Slayer and My Hero Academia. Can you recommend similar shonen manga?"
Before Compression (2,400 tokens total)
System Prompt (380 tokens):
You are MangaAssist, a helpful, friendly, and knowledgeable assistant
for a Japanese manga e-commerce store. Your role is to help customers
find manga they will love, check order status, track shipments, and
provide personalized recommendations based on their reading history and
preferences.
Always respond in a friendly and approachable manner. Make sure to include
prices in both JPY and USD. Always include manga titles in both Japanese
and English translations. If you are unsure about any information, be
honest and say so. Never fabricate information about manga titles, pricing,
availability, or shipping details.
When making recommendations, consider the customer's stated preferences,
their purchase history, and popular titles in similar genres. Aim to
suggest 3-5 titles with brief justifications for each recommendation.
Format your response with clear headers and bullet points for readability.
Remember to check availability before recommending any title.
Be concise but thorough in your explanations.
RAG Context — 8 chunks from OpenSearch (1,200 tokens):
Chunk 1 (score: 0.92): "Jujutsu Kaisen (呪術廻戦) by Gege Akutami,
published by Shueisha. Genre: Shonen, Dark Fantasy, Action. 26 volumes,
ongoing. Price: 528 JPY per volume. In stock: Yes (all volumes). Rating:
4.7/5 from 3,421 reviews. Synopsis: Yuji Itadori, a high school student,
joins a secret organization of sorcerers to kill a powerful curse after
he swallows a cursed object..."
Chunk 2 (score: 0.89): "Chainsaw Man (チェンソーマン) by Tatsuki
Fujimoto, published by Shueisha. Genre: Shonen, Dark Fantasy, Action.
Part 1: 11 volumes complete, Part 2: ongoing. Price: 528 JPY per volume.
In stock: Yes. Rating: 4.6/5 from 2,891 reviews. Synopsis: Denji is a
young man trapped in poverty who becomes a devil hunter after merging
with his pet devil-dog Pochita..."
Chunk 3 (score: 0.85): "Black Clover (ブラッククローバー) by Yuki
Tabata, published by Shueisha. Genre: Shonen, Fantasy, Action.
35 volumes, ongoing..."
[... 5 more chunks with decreasing relevance scores ...]
Chunk 7 (score: 0.41): "Cooking manga guide: Best recipes inspired
by manga series. Includes recipes from Shokugeki no Soma..."
Chunk 8 (score: 0.38): "Manga publishing industry overview 2024.
Shueisha reported revenue of..."
Conversation History — 12 turns (620 tokens):
Turn 1 (user): "Hi there!"
Turn 2 (assistant): "Welcome to MangaAssist! I'm here to help you
find your next great manga read. What are you looking for today?"
Turn 3 (user): "I'm just browsing around"
Turn 4 (assistant): "No problem! Take your time. I can help with
recommendations, checking if specific titles are in stock, or answering
any questions about manga series. What genres do you enjoy?"
Turn 5 (user): "I like action manga mostly"
Turn 6 (assistant): "Great choice! Action manga is our most popular
category. Some current bestsellers include..."
[... turns 7-10: user browsed several product pages ...]
Turn 11 (user): "What about the prices for these?"
Turn 12 (assistant): "Most standard tankobon volumes are 528 JPY..."
User Message (200 tokens):
I loved Demon Slayer and My Hero Academia. Can you recommend similar
shonen manga?
After Compression (1,580 tokens — 34% reduction)
Compressed System Prompt (180 tokens):
You are MangaAssist (JP manga e-commerce). Tasks: find manga, recommendations, order/shipping help.
Rules: friendly tone, prices in JPY+USD, titles in JP+EN, honest when unsure, no fabrication, verify stock.
For recommendations: 3-5 titles, brief justifications, consider user history + genre. Use bullet format.
Pruned RAG Context — 3 chunks kept (480 tokens):
[1] Jujutsu Kaisen (呪術廻戦) — Gege Akutami — Shonen/Dark Fantasy/Action — 26 vols ongoing — 528 JPY — 4.7/5 (3421) — In stock
Synopsis: High school student joins sorcerer organization to fight curses.
[2] Chainsaw Man (チェンソーマン) — Tatsuki Fujimoto — Shonen/Dark Fantasy/Action — Pt1: 11v complete, Pt2: ongoing — 528 JPY — 4.6/5 (2891) — In stock
Synopsis: Young man merges with devil-dog, becomes devil hunter.
[3] Black Clover (ブラッククローバー) — Yuki Tabata — Shonen/Fantasy/Action — 35 vols ongoing — In stock
Chunks 4-8 discarded: Chunks 7-8 below relevance threshold (0.41, 0.38). Chunks 4-6 cut by token budget. Saved: 720 tokens.
Pruned Conversation History (120 tokens):
[Earlier: 10 turns. User browsed action manga, asked about prices. Standard volumes are 528 JPY.]
Turn 11 (user): "What about the prices for these?"
Turn 12 (assistant): "Most standard tankobon volumes are 528 JPY..."
Turns 1-10 summarized. Recent 2 turns kept verbatim. Saved: 500 tokens.
User Message (200 tokens — unchanged):
I loved Demon Slayer and My Hero Academia. Can you recommend similar
shonen manga?
Compression Summary
| Component | Before | After | Saved | Technique |
|---|---|---|---|---|
| System Prompt | 380 | 180 | 200 (53%) | Dedup + compaction |
| RAG Context | 1,200 | 480 | 720 (60%) | Relevance pruning + template vars |
| History | 620 | 120 | 500 (81%) | Sliding window + summarization |
| User Message | 200 | 200 | 0 (0%) | Never compressed |
| Total | 2,400 | 980 | 1,420 (59%) | — |
Example 2: Order Status Query (Simpler Case)
User query: "Where is my order ORD-2024-78432?"
Before Compression (850 tokens)
| Component | Tokens | Content |
|---|---|---|
| System prompt | 380 | Full verbose prompt |
| RAG context | 0 | None needed — order data from DynamoDB |
| History | 270 | 6 turns of previous browsing |
| User message | 200 | With order ID and context |
After Compression (420 tokens — 51% reduction)
| Component | Before | After | Technique |
|---|---|---|---|
| System prompt | 380 | 120 | Stripped to order-status-only instructions |
| History | 270 | 100 | Keep only turns mentioning order ID |
| User message | 200 | 200 | Unchanged |
| Total | 850 | 420 | 51% reduction |
For order_status intent, the system prompt template switches entirely:
You are MangaAssist order assistant. Provide order status concisely.
Format: Order ID, Status, ETA, Tracking link. Prices in JPY+USD.
Compression Pipeline — Complete Sequence Diagram
sequenceDiagram
participant U as User (WebSocket)
participant GW as API Gateway
participant O as ECS Orchestrator
participant IC as Intent Classifier
participant TE as TokenEstimator
participant TBM as TokenBudgetManager
participant PC as PromptCompressor
participant OS as OpenSearch
participant CP as ContextPruner
participant RSC as ResponseSizeController
participant B as Bedrock Claude 3
participant TT as TokenTracker
participant CW as CloudWatch
U->>GW: "Recommend manga like Demon Slayer"
GW->>O: WebSocket message
O->>IC: classify(message)
IC-->>O: intent=recommendation
O->>OS: vector_search(query_embedding, k=8)
OS-->>O: 8 RAG chunks
O->>TE: estimate_prompt(system, user, history, rag)
TE-->>O: estimated=2,400 tokens
O->>TBM: evaluate(session, recommendation, ...)
TBM-->>O: over_budget, needs_compression, target=2,000
rect rgb(232, 245, 233)
Note over PC: Compression Phase
O->>PC: compress(system_prompt)
PC-->>O: 380→180 tokens (saved 200)
end
rect rgb(227, 242, 253)
Note over CP: RAG Pruning Phase
O->>CP: prune_rag_chunks(8 chunks, budget=600)
CP->>CP: Score chunks (composite relevance)
CP->>CP: Filter: threshold=0.45 → discard 2
CP->>CP: Dedup by source → discard 1
CP->>CP: Fill budget → keep 3, discard 2
CP-->>O: 3 chunks, 480 tokens (saved 720)
end
rect rgb(255, 243, 224)
Note over CP: History Pruning Phase
O->>CP: prune_history(12 turns, budget=400)
CP->>CP: Keep last 3 turns verbatim
CP->>CP: Extract entity turns (order IDs, product names)
CP->>CP: Summarize remaining turns
CP-->>O: summary + 3 turns, 120 tokens (saved 500)
end
O->>TE: re-estimate → 980 tokens (within budget)
O->>RSC: get_config(recommendation) → max_tokens=600
O->>B: invoke_model(compressed_prompt, max_tokens=600)
loop Streaming Response
B-->>O: chunk
O->>RSC: enforce_budget(chunk)
RSC-->>O: pass/truncate
O-->>GW: chunk
GW-->>U: chunk
end
O->>TT: record_invocation(actual_tokens)
TT->>CW: emit metrics
TT->>TT: update session total
Before/After Quality Validation
After compression, MangaAssist runs a lightweight quality check to ensure the compressed prompt still produces acceptable answers. This happens in the background using Haiku (cheap) to verify the compressed prompt against the original.
| Quality Metric | Threshold | How Measured |
|---|---|---|
| Entity preservation | 100% | All product names, prices, order IDs present in compressed prompt |
| Intent preservation | 100% | Re-classify the compressed prompt — must match original intent |
| Semantic similarity | > 0.90 | Cosine similarity between compressed and original prompt embeddings |
| Key instruction coverage | > 0.95 | Check that critical instructions (no fabrication, price format) survive |
If any threshold is violated, the compression is rolled back and the original prompt is used (at higher token cost but guaranteed quality).
Key Takeaways
- Compression is a pipeline, not a single technique: apply cheap rules first, expensive LLM-based methods only when needed.
- Japanese content resists compression: CJK tokens carry high semantic density. Compress English boilerplate aggressively, but leave Japanese content nearly intact.
- RAG pruning has the highest ROI: OpenSearch returns many marginally-relevant chunks. Composite scoring with entity overlap emphasis cuts 40-60% of RAG tokens.
- History pruning is the second highest ROI: Multi-turn conversations grow linearly. Sliding window with entity-aware summarization keeps context useful while cutting 60-80% of history tokens.
- Response size control is about UX, not just cost: Graceful truncation with intent-specific follow-up prompts maintains user experience even when budget is enforced.
- Always validate after compression: A background quality check prevents compression from silently degrading answer quality.