LOCAL PREVIEW View on GitHub

Prompt Compression and Context Pruning for GenAI Applications

MangaAssist context: JP Manga store chatbot on AWS — Bedrock Claude 3 (Sonnet at $3/$15 per 1M tokens input/output, Haiku at $0.25/$1.25), OpenSearch Serverless (vector store), DynamoDB (sessions/products), ECS Fargate (orchestrator), API Gateway WebSocket, ElastiCache Redis. Target: useful answer in under 3 seconds, 1M messages/day scale.


Skill Mapping

Certification Task Skill This File
AWS AIP-C01 Task 4.1 — Optimize cost and performance of FM applications Skill 4.1.1 — Design token efficiency systems for FM-powered applications Deep-dive into prompt compression algorithms, context pruning strategies, response size controls, MangaAssist-specific examples

Skill scope: Detailed implementation of the compression and pruning subsystems introduced in 01-token-efficiency-architecture.md. This file covers the algorithms, before/after examples, and production code for reducing input and output tokens without degrading answer quality.


Mind Map — Compression and Pruning Taxonomy

mindmap
  root((Compression<br/>& Pruning))
    Prompt Compression
      Rule-Based
        Whitespace normalization
        Instruction deduplication
        Boilerplate removal
      Structural
        Template variable injection
        JSON compaction
        Few-shot selection
      Algorithmic
        LLMLingua perplexity pruning
        Token-level importance scoring
        Entropy-based filtering
      Japanese-Specific
        CJK character density
        Furigana handling
        Mixed-script optimization
    Context Pruning
      RAG Chunk Pruning
        Relevance score threshold
        Token budget cap
        Diversity filtering
        Chunk deduplication
      Conversation History
        Sliding window
        Summarization of old turns
        Entity extraction retention
        Recency weighting
    Response Size Control
      Input Controls
        Per-intent max_tokens config
        Dynamic budget based on query complexity
      Output Controls
        Streaming token counter
        Graceful truncation
        Follow-up prompt injection
      Feedback Loop
        Actual vs budget tracking
        Auto-calibration of limits

Prompt Compression Algorithms — Deep Dive

The Compression Pipeline

Every MangaAssist prompt passes through a multi-stage compression pipeline before reaching Bedrock. The pipeline is ordered from cheapest (zero-cost, rule-based) to most expensive (requires model call). It short-circuits as soon as the token target is met.

sequenceDiagram
    participant O as Orchestrator
    participant E as TokenEstimator
    participant C as PromptCompressor
    participant P as ContextPruner
    participant R as ResponseSizeController
    participant B as Bedrock

    O->>E: estimate_prompt(system, user, history, rag)
    E-->>O: estimated_tokens = 2,400

    O->>O: check_budget(recommendation) → max 2,000

    Note over O: Over budget by 400 tokens

    O->>C: compress(system_prompt, target=200)
    Note over C: Stage 1: whitespace cleanup → saves 30
    Note over C: Stage 2: instruction dedup → saves 80
    Note over C: Stage 3: template variable injection → saves 120
    C-->>O: system_prompt compressed (230 tokens saved)

    O->>P: prune_rag_chunks(chunks, budget=600)
    Note over P: 8 chunks → relevance filter → 4 chunks
    Note over P: Token cap → 3 chunks kept
    P-->>O: rag_context pruned (170 tokens saved)

    O->>P: trim_history(history, budget=400)
    Note over P: 12 turns → keep last 3 verbatim
    Note over P: Summarize turns 1-9 → 80 tokens
    P-->>O: history trimmed

    O->>E: re-estimate → 1,950 tokens (within budget)

    O->>R: configure(intent=recommendation)
    R-->>O: max_output_tokens = 600

    O->>B: invoke_model(prompt, max_tokens=600)
    B-->>O: stream response chunks

    O->>R: enforce_budget(stream, max=600)
    R-->>O: controlled stream to client

LLMLingua-Style Compression for MangaAssist

LLMLingua is a prompt compression technique that removes tokens with low perplexity (tokens the model can easily predict from surrounding context). The insight: if the model would predict a token anyway, removing it preserves meaning while saving cost.

How It Works

flowchart TD
    A[Original Prompt Text] --> B[Tokenize with small LM]
    B --> C[Compute per-token perplexity]
    C --> D[Rank tokens by perplexity]
    D --> E{Token perplexity > threshold?}
    E -->|High perplexity — informative| F[Keep token]
    E -->|Low perplexity — predictable| G[Remove token]
    F --> H[Reassemble compressed prompt]
    G --> H
    H --> I[Verify token count meets target]
    I -->|Yes| J[Return compressed prompt]
    I -->|No| K[Lower threshold and repeat]
    K --> E

    style F fill:#c8e6c9
    style G fill:#ffcdd2

MangaAssist Application

In MangaAssist, LLMLingua-style compression is applied only to specific prompt components where token removal is safe:

Component Apply LLMLingua? Rationale
System instructions Yes — cautiously Instructions are mostly predictable boilerplate
RAG manga descriptions Yes — aggressively Descriptions have high redundancy (genre, publisher patterns)
Conversation history (English) Yes — moderately English conversational text has predictable filler
Conversation history (Japanese) No Japanese tokens carry more semantic density per token; removing even one kanji can change meaning
User message Never Sacred text — never compress the user's actual question
Product names / ISBNs / prices Never Entity corruption is catastrophic for e-commerce

Japanese Content — Why Compression Is Dangerous

Japanese text has fundamentally different compression characteristics than English:

Property English Japanese
Average information per token ~4.5 characters ~1-2 characters (kanji = high density)
Token predictability High for function words (the, is, a) Low — most tokens are content-bearing
Safe removal candidates Determiners, articles, filler Very few — even particles (は, が, の) change meaning
Compression ratio achievable 40-60% 10-20% at best (quality degrades quickly)
Risk of semantic corruption Moderate High — removing one kanji can change the title entirely

Example of dangerous Japanese compression:

Original 「鬼滅の刃」の最新巻はいつ発売されますか?
Meaning "When will the latest volume of Demon Slayer be released?"
Naive compression 「鬼滅刃」最新巻いつ発売?
Problem 鬼滅の刃 (Kimetsu no Yaiba) becomes 鬼滅刃 — the particle is not redundant; the model may misidentify the title

Context Pruning Strategies

RAG Chunk Relevance Scoring

OpenSearch returns chunks sorted by vector similarity, but raw similarity scores are not sufficient for token-efficient pruning. MangaAssist uses a composite relevance score.

flowchart TD
    subgraph Scoring["Composite Relevance Score"]
        A[OpenSearch Chunk] --> B[Vector Similarity Score]
        A --> C[Keyword Match Score]
        A --> D[Recency Score]
        A --> E[Entity Overlap Score]

        B --> F["weight: 0.4"]
        C --> G["weight: 0.2"]
        D --> H["weight: 0.1"]
        E --> I["weight: 0.3"]

        F --> J[Composite Score]
        G --> J
        H --> J
        I --> J
    end

    J --> K{Score > threshold?}
    K -->|Yes| L{Cumulative tokens < budget?}
    K -->|No| M[Discard chunk]
    L -->|Yes| N[Include in prompt]
    L -->|No| O[Stop — budget exhausted]

    style N fill:#c8e6c9
    style M fill:#ffcdd2
    style O fill:#fff9c4

ContextPruner — Production Class

import numpy as np
from dataclasses import dataclass
from typing import List, Optional


@dataclass
class RAGChunk:
    """A single chunk retrieved from OpenSearch."""
    chunk_id: str
    text: str
    vector_score: float          # cosine similarity from OpenSearch
    source_document: str         # e.g., "one_piece_vol104_desc"
    timestamp: Optional[float]   # when the source was last updated
    entities: list[str]          # extracted entities: titles, authors, ISBNs


@dataclass
class PruneResult:
    """Output of the pruning operation."""
    kept_chunks: list[RAGChunk]
    discarded_chunks: list[RAGChunk]
    kept_token_count: int
    discarded_token_count: int
    total_original_tokens: int


class ContextPruner:
    """Prunes RAG chunks and conversation history to fit token budgets.

    MangaAssist-specific design decisions:
    - Entity overlap is weighted heavily (0.3) because product names
      and order IDs are critical for correct answers.
    - Recency is weighted lightly (0.1) because manga catalog data
      changes slowly (weekly catalog updates).
    - Duplicate content from the same source document is removed
      to prevent the model from over-indexing on one product.
    """

    def __init__(self, estimator: "TokenEstimator",
                 relevance_threshold: float = 0.45,
                 weights: Optional[dict] = None):
        self.estimator = estimator
        self.relevance_threshold = relevance_threshold
        self.weights = weights or {
            "vector": 0.4,
            "keyword": 0.2,
            "recency": 0.1,
            "entity": 0.3,
        }

    def score_chunk(self, chunk: RAGChunk, query_text: str,
                    query_entities: list[str],
                    current_time: float) -> float:
        """Compute composite relevance score for a single chunk.

        Returns a float in [0.0, 1.0].
        """
        # Component 1: Vector similarity (already 0-1 from OpenSearch)
        vector_score = chunk.vector_score

        # Component 2: Keyword match (exact token overlap ratio)
        query_tokens = set(query_text.lower().split())
        chunk_tokens = set(chunk.text.lower().split())
        if query_tokens:
            keyword_score = len(query_tokens & chunk_tokens) / len(query_tokens)
        else:
            keyword_score = 0.0

        # Component 3: Recency (exponential decay, half-life = 7 days)
        if chunk.timestamp:
            age_days = (current_time - chunk.timestamp) / 86400
            recency_score = np.exp(-0.099 * age_days)  # ln(2)/7 ≈ 0.099
        else:
            recency_score = 0.5  # default for chunks without timestamps

        # Component 4: Entity overlap (critical for MangaAssist)
        if query_entities:
            query_ent_set = set(e.lower() for e in query_entities)
            chunk_ent_set = set(e.lower() for e in chunk.entities)
            entity_score = len(query_ent_set & chunk_ent_set) / len(query_ent_set)
        else:
            entity_score = 0.0

        # Weighted composite
        composite = (
            self.weights["vector"] * vector_score +
            self.weights["keyword"] * keyword_score +
            self.weights["recency"] * recency_score +
            self.weights["entity"] * entity_score
        )
        return min(composite, 1.0)

    def prune_rag_chunks(self, chunks: list[RAGChunk],
                         query_text: str,
                         query_entities: list[str],
                         token_budget: int,
                         current_time: float) -> PruneResult:
        """Prune RAG chunks to fit within token budget.

        Strategy:
        1. Score all chunks with composite relevance
        2. Filter by relevance threshold
        3. Deduplicate by source document (keep highest-scoring)
        4. Fill token budget in score order (greedy knapsack)
        """
        import time as time_module
        current_time = current_time or time_module.time()

        # Step 1: Score all chunks
        scored = []
        for chunk in chunks:
            score = self.score_chunk(chunk, query_text, query_entities,
                                     current_time)
            scored.append((chunk, score))

        # Step 2: Filter by threshold
        above_threshold = [(c, s) for c, s in scored
                           if s >= self.relevance_threshold]
        below_threshold = [c for c, s in scored
                           if s < self.relevance_threshold]

        # Step 3: Deduplicate by source document
        best_per_source: dict[str, tuple[RAGChunk, float]] = {}
        dedup_discarded = []
        for chunk, score in above_threshold:
            source = chunk.source_document
            if source not in best_per_source or score > best_per_source[source][1]:
                if source in best_per_source:
                    dedup_discarded.append(best_per_source[source][0])
                best_per_source[source] = (chunk, score)
            else:
                dedup_discarded.append(chunk)

        # Step 4: Sort by score descending, fill budget
        sorted_chunks = sorted(best_per_source.values(),
                               key=lambda x: x[1], reverse=True)

        kept = []
        discarded = list(below_threshold) + dedup_discarded
        running_tokens = 0

        for chunk, score in sorted_chunks:
            chunk_tokens = self.estimator.count_tokens(chunk.text)
            if running_tokens + chunk_tokens <= token_budget:
                kept.append(chunk)
                running_tokens += chunk_tokens
            else:
                discarded.append(chunk)

        # Calculate total original tokens
        total_original = sum(
            self.estimator.count_tokens(c.text) for c in chunks
        )
        discarded_tokens = sum(
            self.estimator.count_tokens(c.text) for c in discarded
        )

        return PruneResult(
            kept_chunks=kept,
            discarded_chunks=discarded,
            kept_token_count=running_tokens,
            discarded_token_count=discarded_tokens,
            total_original_tokens=total_original,
        )

    def prune_conversation_history(self, history: list[dict],
                                   token_budget: int,
                                   keep_recent: int = 3,
                                   entity_keywords: Optional[list[str]] = None
                                   ) -> list[dict]:
        """Prune conversation history with recency weighting.

        Strategy:
        - Always keep the last `keep_recent` turns verbatim
        - For older turns, keep only those containing entity keywords
          (product names, order IDs, prices)
        - Summarize the remaining old turns into a compact block

        Args:
            history: List of {"role": str, "content": str} turn dicts
            token_budget: Maximum total tokens for the history
            keep_recent: Number of recent turns to keep verbatim
            entity_keywords: Keywords that mark a turn as worth keeping
                             (e.g., ["One Piece", "ORD-12345", "Vol."])
        """
        if not history:
            return []

        entity_keywords = entity_keywords or []
        entity_set = set(kw.lower() for kw in entity_keywords)

        # Split into recent (keep) and older (prune candidates)
        recent = history[-keep_recent:] if len(history) > keep_recent else history
        older = history[:-keep_recent] if len(history) > keep_recent else []

        # Check if recent turns alone exceed budget
        recent_tokens = sum(
            self.estimator.count_tokens(t.get("content", ""))
            for t in recent
        )
        if recent_tokens >= token_budget:
            # Even recent turns are too large — keep only last 2
            return history[-2:]

        remaining_budget = token_budget - recent_tokens

        # From older turns, extract entity-bearing turns
        entity_turns = []
        summary_turns = []
        for turn in older:
            content_lower = turn.get("content", "").lower()
            if any(kw in content_lower for kw in entity_set):
                entity_turns.append(turn)
            else:
                summary_turns.append(turn)

        # Fit entity turns within remaining budget
        kept_entity_turns = []
        entity_tokens_used = 0
        for turn in reversed(entity_turns):  # most recent entity turns first
            turn_tokens = self.estimator.count_tokens(turn.get("content", ""))
            if entity_tokens_used + turn_tokens <= remaining_budget * 0.6:
                kept_entity_turns.insert(0, turn)
                entity_tokens_used += turn_tokens

        # Summarize the rest
        summary_budget = remaining_budget - entity_tokens_used
        if summary_turns and summary_budget > 50:
            topics = []
            for turn in summary_turns:
                content = turn.get("content", "")
                if len(content) > 40:
                    topics.append(content[:40].strip() + "...")
                elif content:
                    topics.append(content.strip())

            summary_text = (
                f"[Earlier: {len(summary_turns)} turns discussing: "
                f"{'; '.join(topics[:3])}]"
            )
            summary_turn = {"role": "system", "content": summary_text}
            return [summary_turn] + kept_entity_turns + recent
        else:
            return kept_entity_turns + recent

Response Size Controller — Production Class

import asyncio
import tiktoken
import logging
from dataclasses import dataclass
from typing import AsyncIterator, Optional

logger = logging.getLogger("mangaassist.response_control")


@dataclass
class ResponseBudgetConfig:
    """Per-intent response size configuration."""
    max_output_tokens: int
    hard_ceiling: int              # absolute maximum, even if intent config is wrong
    truncation_message: str        # appended when truncating
    allow_graceful_extension: bool # allow 10% over for sentence completion
    min_tokens_for_useful: int     # below this, the response is useless


# Per-intent response configurations for MangaAssist
RESPONSE_CONFIGS = {
    "product_search": ResponseBudgetConfig(
        max_output_tokens=400,
        hard_ceiling=500,
        truncation_message=(
            "\n\n---\n*More results available. "
            "Ask me to narrow your search.*"
        ),
        allow_graceful_extension=True,
        min_tokens_for_useful=80,
    ),
    "order_status": ResponseBudgetConfig(
        max_output_tokens=200,
        hard_ceiling=250,
        truncation_message=(
            "\n\n---\n*For full order details, "
            "visit your order history page.*"
        ),
        allow_graceful_extension=False,  # order info should be concise
        min_tokens_for_useful=40,
    ),
    "recommendation": ResponseBudgetConfig(
        max_output_tokens=600,
        hard_ceiling=750,
        truncation_message=(
            "\n\n---\n*I have more recommendations! "
            "Tell me which genre interests you most.*"
        ),
        allow_graceful_extension=True,
        min_tokens_for_useful=150,
    ),
    "manga_qa": ResponseBudgetConfig(
        max_output_tokens=500,
        hard_ceiling=600,
        truncation_message=(
            "\n\n---\n*There's more to say on this topic. "
            "Ask a follow-up question for details.*"
        ),
        allow_graceful_extension=True,
        min_tokens_for_useful=100,
    ),
    "chitchat": ResponseBudgetConfig(
        max_output_tokens=150,
        hard_ceiling=200,
        truncation_message="",  # chitchat truncation is invisible
        allow_graceful_extension=False,
        min_tokens_for_useful=20,
    ),
}


@dataclass
class StreamingStats:
    """Statistics collected during response streaming."""
    total_output_tokens: int
    was_truncated: bool
    truncation_point_tokens: int
    chunks_delivered: int
    chunks_dropped: int


class ResponseSizeController:
    """Controls response size during streaming from Bedrock.

    MangaAssist uses WebSocket streaming for real-time delivery. This
    controller sits between the Bedrock stream and the WebSocket, counting
    tokens in real time and enforcing per-intent output budgets.

    Key design decisions:
    - Graceful extension: if allow_graceful_extension is True, we allow
      up to 10% over budget to finish the current sentence. This prevents
      mid-word truncation.
    - Truncation message: intent-specific messages guide the user to ask
      follow-ups rather than leaving them with an abrupt cutoff.
    - Hard ceiling: absolute maximum that prevents any runaway generation,
      even if configuration is wrong.
    """

    def __init__(self):
        self.encoding = tiktoken.get_encoding("cl100k_base")

    def get_config(self, intent: str) -> ResponseBudgetConfig:
        """Get response budget config for an intent, with fallback."""
        return RESPONSE_CONFIGS.get(
            intent,
            ResponseBudgetConfig(
                max_output_tokens=400,
                hard_ceiling=500,
                truncation_message="\n\n---\n*Response trimmed.*",
                allow_graceful_extension=True,
                min_tokens_for_useful=80,
            ),
        )

    async def stream_with_budget(
        self,
        bedrock_stream: AsyncIterator[dict],
        intent: str,
        session_id: str,
    ) -> AsyncIterator[str]:
        """Stream Bedrock response chunks while enforcing token budget.

        Yields:
            str: text chunks to forward to the WebSocket client

        The caller should collect StreamingStats from self.last_stats
        after the stream completes.
        """
        config = self.get_config(intent)
        token_count = 0
        chunk_count = 0
        dropped_count = 0
        truncated = False
        budget = config.max_output_tokens
        grace_budget = int(budget * 1.1) if config.allow_graceful_extension else budget

        async for event in bedrock_stream:
            chunk_text = self._extract_text(event)
            if not chunk_text:
                continue

            chunk_tokens = len(self.encoding.encode(chunk_text))

            # Hard ceiling — never exceed this
            if token_count + chunk_tokens > config.hard_ceiling:
                truncated = True
                logger.warning(
                    "Hard ceiling hit",
                    extra={
                        "session_id": session_id,
                        "intent": intent,
                        "tokens_at_ceiling": token_count,
                        "hard_ceiling": config.hard_ceiling,
                    },
                )
                if config.truncation_message:
                    yield config.truncation_message
                break

            # Soft budget — allow graceful extension for sentence completion
            if token_count >= budget:
                if token_count + chunk_tokens <= grace_budget:
                    # In grace zone — check if chunk ends a sentence
                    yield chunk_text
                    token_count += chunk_tokens
                    chunk_count += 1

                    if self._ends_sentence(chunk_text):
                        truncated = True
                        if config.truncation_message:
                            yield config.truncation_message
                        break
                else:
                    truncated = True
                    if config.truncation_message:
                        yield config.truncation_message
                    break
            else:
                yield chunk_text
                token_count += chunk_tokens
                chunk_count += 1

        self.last_stats = StreamingStats(
            total_output_tokens=token_count,
            was_truncated=truncated,
            truncation_point_tokens=token_count if truncated else 0,
            chunks_delivered=chunk_count,
            chunks_dropped=dropped_count,
        )

    def _extract_text(self, event: dict) -> str:
        """Extract text from a Bedrock streaming event."""
        # Bedrock converse stream format
        if "contentBlockDelta" in event:
            delta = event["contentBlockDelta"].get("delta", {})
            return delta.get("text", "")
        # Bedrock invoke_model_with_response_stream format
        chunk = event.get("chunk", {})
        if "bytes" in chunk:
            import json
            body = json.loads(chunk["bytes"])
            return body.get("delta", {}).get("text", "")
        return ""

    def _ends_sentence(self, text: str) -> bool:
        """Check if text ends at a natural sentence boundary."""
        stripped = text.rstrip()
        return stripped.endswith(('.', '!', '?', '。', '!', '?', '\n'))

MangaAssist-Specific Compression Examples

Example 1: Manga Recommendation Query

User query: "I loved Demon Slayer and My Hero Academia. Can you recommend similar shonen manga?"

Before Compression (2,400 tokens total)

System Prompt (380 tokens):

You are MangaAssist, a helpful, friendly, and knowledgeable assistant
for a Japanese manga e-commerce store. Your role is to help customers
find manga they will love, check order status, track shipments, and
provide personalized recommendations based on their reading history and
preferences.

Always respond in a friendly and approachable manner. Make sure to include
prices in both JPY and USD. Always include manga titles in both Japanese
and English translations. If you are unsure about any information, be
honest and say so. Never fabricate information about manga titles, pricing,
availability, or shipping details.

When making recommendations, consider the customer's stated preferences,
their purchase history, and popular titles in similar genres. Aim to
suggest 3-5 titles with brief justifications for each recommendation.
Format your response with clear headers and bullet points for readability.

Remember to check availability before recommending any title.
Be concise but thorough in your explanations.

RAG Context — 8 chunks from OpenSearch (1,200 tokens):

Chunk 1 (score: 0.92): "Jujutsu Kaisen (呪術廻戦) by Gege Akutami,
published by Shueisha. Genre: Shonen, Dark Fantasy, Action. 26 volumes,
ongoing. Price: 528 JPY per volume. In stock: Yes (all volumes). Rating:
4.7/5 from 3,421 reviews. Synopsis: Yuji Itadori, a high school student,
joins a secret organization of sorcerers to kill a powerful curse after
he swallows a cursed object..."

Chunk 2 (score: 0.89): "Chainsaw Man (チェンソーマン) by Tatsuki
Fujimoto, published by Shueisha. Genre: Shonen, Dark Fantasy, Action.
Part 1: 11 volumes complete, Part 2: ongoing. Price: 528 JPY per volume.
In stock: Yes. Rating: 4.6/5 from 2,891 reviews. Synopsis: Denji is a
young man trapped in poverty who becomes a devil hunter after merging
with his pet devil-dog Pochita..."

Chunk 3 (score: 0.85): "Black Clover (ブラッククローバー) by Yuki
Tabata, published by Shueisha. Genre: Shonen, Fantasy, Action.
35 volumes, ongoing..."

[... 5 more chunks with decreasing relevance scores ...]

Chunk 7 (score: 0.41): "Cooking manga guide: Best recipes inspired
by manga series. Includes recipes from Shokugeki no Soma..."

Chunk 8 (score: 0.38): "Manga publishing industry overview 2024.
Shueisha reported revenue of..."

Conversation History — 12 turns (620 tokens):

Turn 1 (user): "Hi there!"
Turn 2 (assistant): "Welcome to MangaAssist! I'm here to help you
find your next great manga read. What are you looking for today?"
Turn 3 (user): "I'm just browsing around"
Turn 4 (assistant): "No problem! Take your time. I can help with
recommendations, checking if specific titles are in stock, or answering
any questions about manga series. What genres do you enjoy?"
Turn 5 (user): "I like action manga mostly"
Turn 6 (assistant): "Great choice! Action manga is our most popular
category. Some current bestsellers include..."
[... turns 7-10: user browsed several product pages ...]
Turn 11 (user): "What about the prices for these?"
Turn 12 (assistant): "Most standard tankobon volumes are 528 JPY..."

User Message (200 tokens):

I loved Demon Slayer and My Hero Academia. Can you recommend similar
shonen manga?

After Compression (1,580 tokens — 34% reduction)

Compressed System Prompt (180 tokens):

You are MangaAssist (JP manga e-commerce). Tasks: find manga, recommendations, order/shipping help.
Rules: friendly tone, prices in JPY+USD, titles in JP+EN, honest when unsure, no fabrication, verify stock.
For recommendations: 3-5 titles, brief justifications, consider user history + genre. Use bullet format.

Pruned RAG Context — 3 chunks kept (480 tokens):

[1] Jujutsu Kaisen (呪術廻戦) — Gege Akutami — Shonen/Dark Fantasy/Action — 26 vols ongoing — 528 JPY — 4.7/5 (3421) — In stock
Synopsis: High school student joins sorcerer organization to fight curses.

[2] Chainsaw Man (チェンソーマン) — Tatsuki Fujimoto — Shonen/Dark Fantasy/Action — Pt1: 11v complete, Pt2: ongoing — 528 JPY — 4.6/5 (2891) — In stock
Synopsis: Young man merges with devil-dog, becomes devil hunter.

[3] Black Clover (ブラッククローバー) — Yuki Tabata — Shonen/Fantasy/Action — 35 vols ongoing — In stock

Chunks 4-8 discarded: Chunks 7-8 below relevance threshold (0.41, 0.38). Chunks 4-6 cut by token budget. Saved: 720 tokens.

Pruned Conversation History (120 tokens):

[Earlier: 10 turns. User browsed action manga, asked about prices. Standard volumes are 528 JPY.]

Turn 11 (user): "What about the prices for these?"
Turn 12 (assistant): "Most standard tankobon volumes are 528 JPY..."

Turns 1-10 summarized. Recent 2 turns kept verbatim. Saved: 500 tokens.

User Message (200 tokens — unchanged):

I loved Demon Slayer and My Hero Academia. Can you recommend similar
shonen manga?

Compression Summary

Component Before After Saved Technique
System Prompt 380 180 200 (53%) Dedup + compaction
RAG Context 1,200 480 720 (60%) Relevance pruning + template vars
History 620 120 500 (81%) Sliding window + summarization
User Message 200 200 0 (0%) Never compressed
Total 2,400 980 1,420 (59%)

Example 2: Order Status Query (Simpler Case)

User query: "Where is my order ORD-2024-78432?"

Before Compression (850 tokens)

Component Tokens Content
System prompt 380 Full verbose prompt
RAG context 0 None needed — order data from DynamoDB
History 270 6 turns of previous browsing
User message 200 With order ID and context

After Compression (420 tokens — 51% reduction)

Component Before After Technique
System prompt 380 120 Stripped to order-status-only instructions
History 270 100 Keep only turns mentioning order ID
User message 200 200 Unchanged
Total 850 420 51% reduction

For order_status intent, the system prompt template switches entirely:

You are MangaAssist order assistant. Provide order status concisely.
Format: Order ID, Status, ETA, Tracking link. Prices in JPY+USD.


Compression Pipeline — Complete Sequence Diagram

sequenceDiagram
    participant U as User (WebSocket)
    participant GW as API Gateway
    participant O as ECS Orchestrator
    participant IC as Intent Classifier
    participant TE as TokenEstimator
    participant TBM as TokenBudgetManager
    participant PC as PromptCompressor
    participant OS as OpenSearch
    participant CP as ContextPruner
    participant RSC as ResponseSizeController
    participant B as Bedrock Claude 3
    participant TT as TokenTracker
    participant CW as CloudWatch

    U->>GW: "Recommend manga like Demon Slayer"
    GW->>O: WebSocket message

    O->>IC: classify(message)
    IC-->>O: intent=recommendation

    O->>OS: vector_search(query_embedding, k=8)
    OS-->>O: 8 RAG chunks

    O->>TE: estimate_prompt(system, user, history, rag)
    TE-->>O: estimated=2,400 tokens

    O->>TBM: evaluate(session, recommendation, ...)
    TBM-->>O: over_budget, needs_compression, target=2,000

    rect rgb(232, 245, 233)
        Note over PC: Compression Phase
        O->>PC: compress(system_prompt)
        PC-->>O: 380→180 tokens (saved 200)
    end

    rect rgb(227, 242, 253)
        Note over CP: RAG Pruning Phase
        O->>CP: prune_rag_chunks(8 chunks, budget=600)
        CP->>CP: Score chunks (composite relevance)
        CP->>CP: Filter: threshold=0.45 → discard 2
        CP->>CP: Dedup by source → discard 1
        CP->>CP: Fill budget → keep 3, discard 2
        CP-->>O: 3 chunks, 480 tokens (saved 720)
    end

    rect rgb(255, 243, 224)
        Note over CP: History Pruning Phase
        O->>CP: prune_history(12 turns, budget=400)
        CP->>CP: Keep last 3 turns verbatim
        CP->>CP: Extract entity turns (order IDs, product names)
        CP->>CP: Summarize remaining turns
        CP-->>O: summary + 3 turns, 120 tokens (saved 500)
    end

    O->>TE: re-estimate → 980 tokens (within budget)

    O->>RSC: get_config(recommendation) → max_tokens=600
    O->>B: invoke_model(compressed_prompt, max_tokens=600)

    loop Streaming Response
        B-->>O: chunk
        O->>RSC: enforce_budget(chunk)
        RSC-->>O: pass/truncate
        O-->>GW: chunk
        GW-->>U: chunk
    end

    O->>TT: record_invocation(actual_tokens)
    TT->>CW: emit metrics
    TT->>TT: update session total

Before/After Quality Validation

After compression, MangaAssist runs a lightweight quality check to ensure the compressed prompt still produces acceptable answers. This happens in the background using Haiku (cheap) to verify the compressed prompt against the original.

Quality Metric Threshold How Measured
Entity preservation 100% All product names, prices, order IDs present in compressed prompt
Intent preservation 100% Re-classify the compressed prompt — must match original intent
Semantic similarity > 0.90 Cosine similarity between compressed and original prompt embeddings
Key instruction coverage > 0.95 Check that critical instructions (no fabrication, price format) survive

If any threshold is violated, the compression is rolled back and the original prompt is used (at higher token cost but guaranteed quality).


Key Takeaways

  1. Compression is a pipeline, not a single technique: apply cheap rules first, expensive LLM-based methods only when needed.
  2. Japanese content resists compression: CJK tokens carry high semantic density. Compress English boilerplate aggressively, but leave Japanese content nearly intact.
  3. RAG pruning has the highest ROI: OpenSearch returns many marginally-relevant chunks. Composite scoring with entity overlap emphasis cuts 40-60% of RAG tokens.
  4. History pruning is the second highest ROI: Multi-turn conversations grow linearly. Sliding window with entity-aware summarization keeps context useful while cutting 60-80% of history tokens.
  5. Response size control is about UX, not just cost: Graceful truncation with intent-specific follow-up prompts maintains user experience even when budget is enforced.
  6. Always validate after compression: A background quality check prevents compression from silently degrading answer quality.