Scenarios and Runbooks — Skill 1.5.5: Query Handling Systems

MangaAssist context: JP Manga store chatbot on AWS — Bedrock Claude 3 (Sonnet at $3/$15 per 1M tokens input/output, Haiku at $0.25/$1.25), OpenSearch Serverless (vector store), DynamoDB (sessions/products), ECS Fargate (orchestrator), API Gateway WebSocket, ElastiCache Redis. Target: useful answer in under 3 seconds, 1M messages/day scale.

Skill Mapping

Dimension	Detail
Certification	AWS AIP-C01 — AI Practitioner
Domain	1 — Foundation Model Integration, Data Management, and Compliance
Task	1.5 — Design retrieval mechanisms for FM augmentation
Skill	1.5.5 — Design query handling systems (for example, query transformation, query expansion, HyDE hypothetical document embedding, step-back prompting, multi-hop retrieval, intent classification) to improve retrieval relevance before vector search in the MangaAssist RAG pipeline
This File	Five production scenarios with detection flowcharts, root cause analysis, resolution code, and prevention strategies

Skill Scope Statement

Skill 1.5.5 governs the query pre-processing layer between the user's raw input and the OpenSearch vector search call. For MangaAssist, users phrase queries colloquially ("something like Berserk but less gory"), with abbreviations ("OP vol 1"), or require multi-hop reasoning ("shipping cost for pre-order items over ¥5000 to Osaka"). Without query handling — intent classification, query rewriting, expansion, or HyDE — the dense retrieval layer receives a query that is semantically distant from the catalog chunks it needs to surface, producing low-recall results. Query handling adds latency and LLM cost, so each technique must improve retrieval quality measurably enough to justify the overhead.

Mind Map — Query Handling Failure Modes

mindmap
  root((Query Handling Failures))
    QueryTransformation
      RewriteChangesIntent
      OverlyLongRewrite
      LoopOnRewriteError
    HyDEFailures
      HypDocWrongGenre
      HypDocTooLong
      HyDENotCached
    MultiHopRetrieval
      InfiniteHopLoop
      HopContextLost
      TooManyLLMCallsPerQuery
    IntentClassification
      WrongIntentRoute
      AmbiguousQueryNotClarified
      NoFallbackForUnknownIntent
    OperationalGaps
      NoQueryTransformLatencyMetric
      TransformCacheMissing
      FailedRewriteNotLogged

Scenario Overview

#	Scenario	Severity	Blast Radius	Typical Detection Time
1	Query rewriter LLM call adds 900 ms to every query; total latency exceeds 3 s SLA	P2 High	All queries using query rewriting breach SLA; user experience degrades	< 5 min via API Gateway latency alarm
2	HyDE generates a hypothetical document in the wrong language (English for a Japanese title query), causing near-zero cosine match with JP catalog chunks	P2 High	Japanese-title queries return wrong or empty results when HyDE is active	2–6 hours via JP query recall metric
3	Multi-hop retrieval enters an infinite loop for ambiguous intent queries, exhausting ECS task CPU and memory	P1 Critical	Affected ECS tasks become unresponsive; cascading 503s	< 2 min via ECS CPU alarm
4	Step-back prompting generates an overly generic question ("Tell me about manga") that floods retrieval with unrelated chunks for niche queries	P3 Medium	Niche-title queries return popular titles instead of requested niche content	4–12 hours via answer relevance CSAT
5	Intent classifier routes product-detail queries to the FAQ retrieval path, fetching policy context for product questions and vice versa	P2 High	Product queries get policy answers; FAQ queries get catalog answers	1–4 hours via query type accuracy metric

Scenario 1: Query Rewriter Adds 900 ms SLA-Breaking Latency

Problem

MangaAssist enables Claude 3 Haiku to rewrite colloquial user queries into more retrieval-optimized forms (e.g., "something like Berserk but cheerful" → "upbeat adventure manga similar to Berserk, lighter tone"). The rewriter call takes an average of 900 ms at peak load. Combined with retrieval (300 ms), re-ranking (200 ms), and Claude 3 Sonnet inference (900 ms), total P99 latency reaches 2.4 s in baseline but spikes to 3.5 s when the query rewriter is called under load — breaching the 3-second SLA for 30% of queries.

Detection

flowchart TD
    A["API Gateway P99 latency\nalarm fires (> 2800ms)"] --> B{"X-Ray: which segment\nhas highest latency?"}
    B --> C["'query_rewrite' segment\ntakes > 700ms?"]
    C -->|Yes| D["CONFIRM: query rewriter\nbreaching latency budget"]
    D --> E["Is the rewrite result\ncacheable for this query?"]
    E -->|Yes| F["Add Redis cache to\nquery rewriter"]
    E -->|No| G["Switch rewriter to\nClaude 3 Haiku with lower max_tokens"]
    D --> H["Measure: does rewrite\nactually improve MRR for this query type?"]
    H -->|No| I["Disable rewriter for\nthis query intent class"]

Root Cause

The query rewriter used Claude 3 Sonnet (900 ms avg) instead of the faster Claude 3 Haiku (150 ms avg).
No Redis cache was applied to rewritten queries; every unique query hit the LLM even for very similar (paraphrased) inputs.
The rewriter was applied unconditionally — even short, already-optimal queries like "Naruto volume 1" were rewritten at extra cost.

Resolution

"""
Runbook: Low-latency query rewriter with selective activation and caching.
"""

import boto3
import json
import hashlib
import time
import redis

REGION        = "us-east-1"
HAIKU_ID      = "anthropic.claude-3-haiku-20240307-v1:0"
CACHE_TTL     = 600   # 10 min — rewritten queries are session-independent
REWRITE_MAX_TOKENS = 60   # Keep rewrites short

bedrock_rt = boto3.client("bedrock-runtime", region_name=REGION)
r          = redis.Redis(host="mangaassist-cache.abc.ng.0001.use1.cache.amazonaws.com",
                         port=6379, ssl=True, decode_responses=True)


# ── Query complexity classifier ───────────────────────────────────────────────
def needs_rewrite(query: str) -> bool:
    """
    Only rewrite queries that benefit: colloquial, comparative, or vague queries.
    Skip rewrites for short exact queries, ISBNs, or ASINs.
    """
    words = query.strip().split()
    if len(words) <= 3:
        return False   # too short to benefit
    indicators = ["like", "similar", "something", "recommend", "kind of",
                  "any", "best", "good", "show me", "what about"]
    return any(ind in query.lower() for ind in indicators)


# ── Rewriter with cache ────────────────────────────────────────────────────────
def rewrite_query(query: str) -> str:
    """
    Rewrite query using Claude 3 Haiku with Redis caching.
    Returns original query if rewrite is not needed or fails.
    """
    if not needs_rewrite(query):
        return query   # bypass rewrite entirely

    cache_key = f"qrewrite:{hashlib.sha256(query.encode()).hexdigest()}"
    cached    = r.get(cache_key)
    if cached:
        return cached

    t0 = time.time()
    try:
        resp = bedrock_rt.invoke_model(
            modelId=HAIKU_ID,
            body=json.dumps({
                "anthropic_version": "bedrock-2023-05-31",
                "max_tokens": REWRITE_MAX_TOKENS,
                "messages": [{
                    "role": "user",
                    "content": (
                        f"Rewrite this manga shop query for semantic search. "
                        f"Keep it concise (< 15 words). Return only the rewritten query.\n\n"
                        f"Query: {query}"
                    ),
                }],
            }),
            accept="application/json", contentType="application/json",
        )
        rewritten = json.loads(resp["body"].read())["content"][0]["text"].strip()
        elapsed   = (time.time() - t0) * 1000
        print(f"[REWRITE] '{query}' → '{rewritten}' ({elapsed:.0f}ms)")

        r.setex(cache_key, CACHE_TTL, rewritten)
        return rewritten

    except Exception as e:
        print(f"[REWRITE] Failed: {e} — using original query")
        return query   # fail open


# ── Budget-aware query pipeline ────────────────────────────────────────────────
LATENCY_BUDGET = {
    "rewrite":   150,   # ms budget (Haiku)
    "retrieve":  300,
    "rerank":    200,
    "inference": 1500,
}

def query_with_budget(raw_query: str) -> dict:
    timings = {}
    t = time.time()
    rewritten = rewrite_query(raw_query)
    timings["rewrite"] = (time.time() - t) * 1000
    if timings["rewrite"] > LATENCY_BUDGET["rewrite"] * 2:
        print(f"[BUDGET] Query rewrite overran budget: {timings['rewrite']:.0f}ms")
    return {"rewritten_query": rewritten, "timings": timings}

Prevention Steps

Haiku-only rewriter: Always use Claude 3 Haiku (not Sonnet) for query rewriting; Haiku is 6× cheaper and 4× faster.
Selective activation: Gate rewriting behind needs_rewrite() — skip for short or exact-match queries (estimated 40% of traffic).
Rewrite cache: Cache rewritten queries in Redis with 10-minute TTL; typical cache hit rate for repeat queries is 20–30%.
Latency budget enforcement: Publish per-step latency to CloudWatch; alarm if rewrite segment exceeds 300 ms P95.

Scenario 2: HyDE Generates Wrong-Language Hypothetical Document

Problem

MangaAssist enables HyDE (Hypothetical Document Embedding) to improve retrieval for niche queries: Claude 3 Haiku generates a plausible catalog description, which is then embedded and used as the retrieval query instead of the raw user query. For Japanese-language queries like "鬼滅の刃のあらすじ教えて" (Tell me the synopsis of Demon Slayer), Haiku generates an English hypothetical description. The English embedding is far in vector space from the Japanese catalog chunks, causing retrieval to return unrelated English-language titles.

Detection

flowchart TD
    A["Japanese queries return\nunrelated English results"] --> B{"Is HyDE enabled\nfor this query flow?"}
    B -->|Yes| C["Log the hypothetical\ndocument generated by HyDE"]
    C --> D["Hypothetical doc is in\nEnglish for JP query?"]
    D -->|Yes| E["CONFIRM: language mismatch\nin HyDE generation"]
    E --> F["Add language detection\nand language-aware HyDE prompt"]
    B -->|No| G["Check chunking and\nembedding quality for JP text"]
    D -->|No| H["Check cosine similarity\nbetween hyp doc and catalog chunks"]

Root Cause

The HyDE prompt was in English and instructed Claude 3 to generate an English catalog description regardless of query language.
No language detection was applied to user queries before HyDE generation.
The MangaAssist catalog has mixed JP/EN content; embedding space for JP and EN chunks are separated; an English HyDE doc maps to the English subset only.

Resolution

"""
Runbook: Language-aware HyDE implementation for MangaAssist bilingual catalog.
"""

import boto3
import json
import re

REGION   = "us-east-1"
HAIKU_ID = "anthropic.claude-3-haiku-20240307-v1:0"

bedrock_rt = boto3.client("bedrock-runtime", region_name=REGION)


# ── Language detection ─────────────────────────────────────────────────────────
def detect_language(text: str) -> str:
    """
    Simple language detection based on character category.
    Returns 'ja' for Japanese-dominant queries, 'en' otherwise.
    """
    cjk_chars  = sum(1 for c in text if "\u3000" <= c <= "\u9FFF" or
                     "\u4E00" <= c <= "\u9FFF" or "\u3040" <= c <= "\u30FF")
    total_chars = max(1, len(text.strip()))
    return "ja" if cjk_chars / total_chars > 0.2 else "en"


# ── Language-aware HyDE ───────────────────────────────────────────────────────
HYDE_PROMPTS = {
    "ja": (
        "以下の質問に対して、マンガカタログの商品説明文を日本語で1段落（100文字以内）で生成してください。"
        "実在のタイトルを含め、自然で具体的な商品説明にしてください。\n\n質問: {query}"
    ),
    "en": (
        "Generate a one-paragraph (under 100 words) manga catalog product description "
        "in English that would answer this query. Include specific manga titles, genres, "
        "and themes.\n\nQuery: {query}"
    ),
}

def generate_hypothetical_document(query: str) -> str:
    """
    Generate a language-matched hypothetical catalog document for HyDE.
    Falls back to original query if generation fails.
    """
    lang   = detect_language(query)
    prompt = HYDE_PROMPTS[lang].format(query=query)

    try:
        resp = bedrock_rt.invoke_model(
            modelId=HAIKU_ID,
            body=json.dumps({
                "anthropic_version": "bedrock-2023-05-31",
                "max_tokens": 150,
                "messages": [{"role": "user", "content": prompt}],
            }),
            accept="application/json", contentType="application/json",
        )
        hyp_doc = json.loads(resp["body"].read())["content"][0]["text"].strip()
        print(f"[HYDE] Language={lang}, hypothesis: '{hyp_doc[:60]}...'")
        return hyp_doc
    except Exception as e:
        print(f"[HYDE] Failed: {e} — using original query")
        return query


# ── Selective HyDE: only activate for ambiguous queries ──────────────────────
def should_use_hyde(query: str) -> bool:
    """
    HyDE helps most for vague/broad queries. Skip for exact or short queries.
    """
    if len(query.split()) <= 4:
        return False   # exact queries don't benefit
    vague_signals = ["recommend", "similar", "good", "best", "suggest",
                     "おすすめ", "似た", "面白い", "教えて"]
    return any(sig in query.lower() for sig in vague_signals)


# ── HyDE-enabled retrieval ────────────────────────────────────────────────────
def retrieve_with_hyde(query: str, kb_id: str, top_k: int = 10) -> list[dict]:
    bedrock_agent_rt = boto3.client("bedrock-agent-runtime", region_name=REGION)
    retrieval_query  = generate_hypothetical_document(query) if should_use_hyde(query) else query

    resp = bedrock_agent_rt.retrieve(
        knowledgeBaseId=kb_id,
        retrievalQuery={"text": retrieval_query},
        retrievalConfiguration={"vectorSearchConfiguration": {"numberOfResults": top_k}},
    )
    return resp.get("retrievalResults", [])

Prevention Steps

Language-aware HyDE prompts: Always detect query language and match the hypothetical document language to the dominant catalog language for that user.
Selective HyDE: Only activate HyDE for vague/broad queries (should_use_hyde()); disable for short, exact, or ISBNs.
HyDE quality check: Log the generated hypothetical document alongside the retrieval score; alert if avg cosine similarity for JP queries drops below 0.5 after HyDE.
Cohere Multilingual alternative: Consider cohere.embed-multilingual-v3 as the embedding model to reduce the language-space separation issue.

Scenario 3: Multi-Hop Retrieval Infinite Loop for Ambiguous Queries

Problem

MangaAssist implements multi-hop retrieval: if the first retrieval pass returns low-confidence results (max score < 0.5), the system re-queries with a reformulated question. A bug causes the reformulation to produce the same question when the original query is ambiguous ("show me manga"), creating an infinite hop loop. Each hop calls Bedrock twice (reformulation + retrieval), consuming ECS task CPU and quickly exhausting the OpenSearch OCU budget. The ECS task hangs indefinitely and eventually times out with a 504 error.

Detection

flowchart TD
    A["ECS task unresponsive:\n504 Gateway Timeout"] --> B{"Check ECS CloudWatch Logs:\nrepeated 'hop' log entries?"}
    B --> C["Same query logged\n> 3 times in one request?"]
    C -->|Yes| D["CONFIRM: multi-hop\ninfinite loop detected"]
    D --> E["Kill the stuck task\n(ECS task stop or restart)"]
    E --> F["Add max_hops=3 guard\nto multi-hop loop"]
    D --> G["Add circuit breaker:\nexit on same reformulation 2× in a row"]
    C -->|No| H["Check for other\nblocking I/O in ECS task"]

Root Cause

The loop termination condition was max_score < 0.5: if reformulation re-produces the same underperforming query, the loop never terminates.
No maximum hop count guard existed.
No deduplication check compared the current reformulated query to previous hop queries.

Resolution

"""
Runbook: Safe multi-hop retrieval with loop guard for MangaAssist.
"""

import boto3
import json

REGION            = "us-east-1"
KNOWLEDGE_BASE_ID = "kb-mangaassist-prod"
HAIKU_ID          = "anthropic.claude-3-haiku-20240307-v1:0"

bedrock_rt   = boto3.client("bedrock-runtime",       region_name=REGION)
bedrock_agrt = boto3.client("bedrock-agent-runtime", region_name=REGION)


def _retrieve(query: str, top_k: int = 10) -> tuple[list[dict], float]:
    resp   = bedrock_agrt.retrieve(
        knowledgeBaseId=KNOWLEDGE_BASE_ID,
        retrievalQuery={"text": query},
        retrievalConfiguration={"vectorSearchConfiguration": {"numberOfResults": top_k}},
    )
    results   = resp.get("retrievalResults", [])
    max_score = max((r.get("score", 0) for r in results), default=0)
    return results, max_score


def _reformulate(query: str, context_snippets: list[str]) -> str:
    ctx = "\n".join(context_snippets[:3])
    resp = bedrock_rt.invoke_model(
        modelId=HAIKU_ID,
        body=json.dumps({
            "anthropic_version": "bedrock-2023-05-31",
            "max_tokens": 60,
            "messages": [{
                "role": "user",
                "content": (
                    f"Rewrite this query to be more specific for a manga catalog search.\n"
                    f"Previous partial context:\n{ctx}\n\nOriginal query: {query}\n"
                    f"Rewritten query (return ONLY the query, < 15 words):"
                ),
            }],
        }),
        accept="application/json", contentType="application/json",
    )
    return json.loads(resp["body"].read())["content"][0]["text"].strip()


def multi_hop_retrieve(
    query: str,
    max_hops: int = 3,
    score_threshold: float = 0.5,
    top_k: int = 10,
) -> list[dict]:
    """
    Safe multi-hop retrieval with:
    - max_hops hard limit
    - Deduplication against previous queries
    - Score threshold to stop when good enough
    """
    seen_queries: set[str] = set()
    current_query = query

    for hop in range(max_hops):
        # ── Deduplication guard ────────────────────────────────────────────
        normalized = current_query.strip().lower()
        if normalized in seen_queries:
            print(f"[MULTIHOP] Hop {hop}: same query as previous — stopping loop")
            break
        seen_queries.add(normalized)

        # ── Retrieve ───────────────────────────────────────────────────────
        results, max_score = _retrieve(current_query, top_k=top_k)
        print(f"[MULTIHOP] Hop {hop}: query='{current_query}' max_score={max_score:.3f}")

        # ── Early exit if confident ────────────────────────────────────────
        if max_score >= score_threshold:
            print(f"[MULTIHOP] Score {max_score:.3f} >= threshold — stopping")
            return results

        # ── Prepare next hop ──────────────────────────────────────────────
        if hop < max_hops - 1:
            snippets      = [r["content"]["text"][:200] for r in results[:3]]
            current_query = _reformulate(current_query, snippets)

    print(f"[MULTIHOP] Max hops ({max_hops}) reached — returning best results")
    return results


# ── Integration guard: timeboxed wrapper ─────────────────────────────────────
import concurrent.futures

def safe_multi_hop_retrieve(query: str, timeout_sec: float = 2.0) -> list[dict]:
    """Wrap multi_hop_retrieve with a hard wall-clock timeout."""
    with concurrent.futures.ThreadPoolExecutor(max_workers=1) as ex:
        future = ex.submit(multi_hop_retrieve, query)
        try:
            return future.result(timeout=timeout_sec)
        except concurrent.futures.TimeoutError:
            print(f"[MULTIHOP] Wall-clock timeout ({timeout_sec}s) — returning []")
            return []

Prevention Steps

max_hops=3: Hard-code a maximum hop count of 3 in all multi-hop implementations; never rely on score threshold alone to terminate.
Query deduplication: Check every reformulated query against the set of previous hop queries before executing; break on match.
Wall-clock timeout: Wrap multi_hop_retrieve() in a concurrent.futures timeout of 2 seconds to prevent ECS task hangs.
Hop count metric: Publish HopCount per request to CloudWatch; alarm if P99 hop count reaches 3 (means loop guard is frequently triggered).

Scenario 4: Step-Back Prompting Returns Overly Generic Retrieval Context

Problem

MangaAssist uses step-back prompting for complex queries: Claude 3 Haiku generates a more abstract "step-back" question to retrieve broader context before the specific retrieval. For the query "Is Jujutsu Kaisen volume 23 available for international shipping to Germany?", the step-back question generated is "What manga does MangaAssist carry?" — far too vague. The retrieval for the step-back question floods the context with popular titles (One Piece, Naruto, My Hero Academia) rather than shipping policy or Jujutsu Kaisen availability information.

Detection

flowchart TD
    A["Complex product/shipping queries\nget irrelevant popular-title answers"] --> B{"Is step-back prompting\nactive for these queries?"}
    B -->|Yes| C["Log the step-back question\ngenerated by Haiku"]
    C --> D["Step-back question too generic:\n< 5 specific words?"]
    D -->|Yes| E["CONFIRM: step-back over-generalizes\nfor specific entity queries"]
    E --> F["Restrict step-back to semantic\nbroad queries only"]
    E --> G["Improve step-back prompt:\npreserve key entities"]
    B -->|No| H["Check query intent routing\nor retrieval config"]

Root Cause

The step-back prompt was too permissive, allowing Claude 3 to strip all specific entities (title, country, volume) from the abstracted question.
Step-back was applied universally to all query types, including specific entity-lookup and policy queries that don't benefit from abstraction.
No minimum information-preservation check ensured the step-back question retained key named entities.

Resolution

"""
Runbook: Entity-preserving step-back prompting for MangaAssist.
"""

import boto3
import json
import re

REGION   = "us-east-1"
HAIKU_ID = "anthropic.claude-3-haiku-20240307-v1:0"

bedrock_rt = boto3.client("bedrock-runtime", region_name=REGION)


# ── Determine if step-back is appropriate ─────────────────────────────────────
def should_use_stepback(query: str) -> bool:
    """
    Step-back helps for broad/conceptual queries.
    Avoid for: specific entity lookups, exact shipping/policy questions.
    """
    specific_signals = [
        r"volume\s+\d+", r"vol\s*\.\s*\d+", r"isbn", r"asin",
        r"\bprice\b", r"\bshipping\b", r"\bdelivery\b",
        r"\bavailable\b", r"\bin stock\b",
        r"978[-\s]?\d",  # ISBN pattern
    ]
    query_lower = query.lower()
    return not any(re.search(p, query_lower) for p in specific_signals)


# ── Entity-preserving step-back ───────────────────────────────────────────────
def generate_stepback_question(query: str) -> str:
    """
    Generate a step-back question that abstracts the query while
    preserving key named entities (manga titles, genres, publishers).
    """
    resp = bedrock_rt.invoke_model(
        modelId=HAIKU_ID,
        body=json.dumps({
            "anthropic_version": "bedrock-2023-05-31",
            "max_tokens": 80,
            "messages": [{
                "role": "user",
                "content": (
                    "Generate a slightly more general version of this manga query "
                    "for background context retrieval. "
                    "PRESERVE all manga titles, genres, and author names. "
                    "Only broaden the specific question (e.g., 'volume 5' → 'series'). "
                    "Return ONLY the step-back question.\n\n"
                    f"Original query: {query}"
                ),
            }],
        }),
        accept="application/json", contentType="application/json",
    )
    stepback = json.loads(resp["body"].read())["content"][0]["text"].strip()

    # Validate: step-back should not be shorter than 40% of original
    if len(stepback) < len(query) * 0.4:
        print(f"[STEPBACK] Over-generalized — using original query")
        return query

    return stepback


# ── Step-back enhanced retrieval ──────────────────────────────────────────────
def retrieve_with_stepback(query: str, kb_id: str, top_k: int = 10) -> list[dict]:
    bedrock_agrt = boto3.client("bedrock-agent-runtime", region_name=REGION)

    if not should_use_stepback(query):
        # Skip step-back for specific entity queries
        resp = bedrock_agrt.retrieve(
            knowledgeBaseId=kb_id,
            retrievalQuery={"text": query},
            retrievalConfiguration={"vectorSearchConfiguration": {"numberOfResults": top_k}},
        )
        return resp.get("retrievalResults", [])

    # Generate step-back and retrieve broader context
    stepback = generate_stepback_question(query)
    print(f"[STEPBACK] '{query}' → '{stepback}'")

    # Retrieve from both original and step-back queries; merge
    results_orig     = bedrock_agrt.retrieve(
        knowledgeBaseId=kb_id, retrievalQuery={"text": query},
        retrievalConfiguration={"vectorSearchConfiguration": {"numberOfResults": top_k // 2}},
    ).get("retrievalResults", [])
    results_stepback = bedrock_agrt.retrieve(
        knowledgeBaseId=kb_id, retrievalQuery={"text": stepback},
        retrievalConfiguration={"vectorSearchConfiguration": {"numberOfResults": top_k // 2}},
    ).get("retrievalResults", [])

    # Dedup by content hash and return combined
    seen    = set()
    merged  = []
    for r in results_orig + results_stepback:
        content_key = r["content"]["text"][:100]
        if content_key not in seen:
            seen.add(content_key)
            merged.append(r)
    return merged[:top_k]

Prevention Steps

Selective step-back: Gate step-back on should_use_stepback(); never apply to specific entity or policy queries.
Entity preservation validation: Post-generation, check that key nouns from the original query appear in the step-back question; fall back to original if not.
Step-back golden set: Include 10 entity-specific queries in the evaluation golden set and assert that step-back does NOT degrade precision for those queries.
Dual retrieval: Always merge step-back results with original-query results rather than replacing; step-back adds background context, not the primary answer.

Scenario 5: Intent Classifier Routes Product Queries to FAQ Index

Problem

MangaAssist's intent classifier routes queries to either CATALOG (product descriptions) or FAQ (policies, shipping, returns) retrieval. The classifier, a zero-shot Claude 3 Haiku call, misclassifies the query "What's the return policy if a manga volume is damaged on arrival?" as CATALOG intent (because "manga volume" is detected as a product entity). The query is routed to the catalog index, returning product descriptions instead of the return policy, causing Claude 3 to invent a policy answer.

Detection

flowchart TD
    A["Policy queries answered\nwith product descriptions"] --> B{"Check orchestrator logs:\nintent classification output"}
    B --> C["'damaged' / 'return' / 'policy'\nqueries classified as CATALOG?"]
    C -->|Yes| D["CONFIRM: intent classifier\nrouting error"]
    D --> E["Inspect classifier prompt:\nare policy keywords in few-shot examples?"]
    E --> F["Add policy-specific few-shot\nexamples to classifier prompt"]
    D --> G["Add keyword-override rules\nfor explicit policy terms"]
    C -->|No| H["Check index content:\nFAQ index may be missing policy chunks"]

Root Cause

The zero-shot classifier only saw product-centric examples; policy queries containing product nouns were ambiguous.
No keyword-override rules existed for policy-critical terms (return, refund, damaged, shipping cost, delivery time).
Intent classification accuracy was not measured on a held-out test set before deployment.

Resolution

"""
Runbook: Robust intent classification with keyword override for MangaAssist.
"""

import boto3
import json
import re

REGION   = "us-east-1"
HAIKU_ID = "anthropic.claude-3-haiku-20240307-v1:0"

bedrock_rt = boto3.client("bedrock-runtime", region_name=REGION)


# ── Layer 1: Keyword override (highest priority) ──────────────────────────────
FAQ_KEYWORDS = [r"\breturn\b", r"\brefund\b", r"\bshipping\b", r"\bdelivery\b",
                r"\bpolicy\b", r"\bdamaged\b", r"\bcancel\b", r"\btrack\b",
                r"\bwarranty\b", r"\bcustomer service\b", r"\bcontact\b",
                r"返品", r"配送", r"送料", r"キャンセル"]   # Japanese policy terms

CATALOG_KEYWORDS = [r"\bseries\b", r"\bvolume\b", r"\bauthor\b",
                    r"\bgenre\b", r"\bplot\b", r"\bcharacter\b",
                    r"あらすじ", r"漫画", r"巻", r"著者"]   # Japanese catalog terms


def keyword_classify(query: str) -> str | None:
    """
    High-confidence keyword-based pre-classification.
    Returns 'FAQ', 'CATALOG', or None (ambiguous — use LLM).
    """
    q = query.lower()
    faq_hits     = sum(1 for p in FAQ_KEYWORDS     if re.search(p, q))
    catalog_hits = sum(1 for p in CATALOG_KEYWORDS if re.search(p, q))
    if faq_hits > catalog_hits:
        return "FAQ"
    if catalog_hits > faq_hits:
        return "CATALOG"
    return None  # ambiguous


# ── Layer 2: LLM classifier with few-shot examples ───────────────────────────
FEW_SHOT = """
Query: "What is your return policy for used manga?"
Intent: FAQ

Query: "Can I get a refund if my order arrives damaged?"
Intent: FAQ

Query: "How long does shipping to the US take?"
Intent: FAQ

Query: "Tell me about the plot of Berserk volume 1"
Intent: CATALOG

Query: "Recommend seinen manga similar to Monster"
Intent: CATALOG

Query: "What manga does Hirohiko Araki write?"
Intent: CATALOG
"""

def llm_classify(query: str) -> str:
    resp = bedrock_rt.invoke_model(
        modelId=HAIKU_ID,
        body=json.dumps({
            "anthropic_version": "bedrock-2023-05-31",
            "max_tokens": 10,
            "messages": [{
                "role": "user",
                "content": (
                    f"Classify this query as FAQ (shipping/returns/policy) or CATALOG (product/manga info).\n"
                    f"{FEW_SHOT}\nQuery: \"{query}\"\nIntent:"
                ),
            }],
        }),
        accept="application/json", contentType="application/json",
    )
    raw    = json.loads(resp["body"].read())["content"][0]["text"].upper()
    intent = "FAQ" if "FAQ" in raw else "CATALOG"
    return intent


# ── Combined classifier ────────────────────────────────────────────────────────
def classify_intent(query: str) -> str:
    """Keyword override first; LLM only for ambiguous queries."""
    kw_result = keyword_classify(query)
    if kw_result:
        print(f"[INTENT] Keyword classified: {kw_result}")
        return kw_result
    llm_result = llm_classify(query)
    print(f"[INTENT] LLM classified: {llm_result}")
    return llm_result


# ── Accuracy evaluation on golden test set ────────────────────────────────────
GOLDEN_INTENTS = [
    ("What is your return policy?", "FAQ"),
    ("Can I cancel my manga preorder?", "FAQ"),
    ("damaged manga on arrival refund", "FAQ"),
    ("Best shonen manga for beginners", "CATALOG"),
    ("One Piece synopsis volume 1", "CATALOG"),
    ("Recommend manga similar to Naruto", "CATALOG"),
]

def evaluate_classifier(classifier_fn=classify_intent) -> float:
    hits = sum(1 for q, expected in GOLDEN_INTENTS
               if classifier_fn(q) == expected)
    accuracy = hits / len(GOLDEN_INTENTS)
    print(f"Classifier accuracy: {accuracy:.0%} ({hits}/{len(GOLDEN_INTENTS)})")
    assert accuracy >= 0.9, f"Classifier below 90%: {accuracy:.0%}"
    return accuracy

Prevention Steps

Keyword override layer: Deploy keyword_classify() as the first classification pass; it's free, deterministic, and handles the most common ambiguous cases.
Few-shot examples: Ensure LLM classifier few-shot examples cover the specific edge cases (product noun + policy verb) that cause misclassification.
Accuracy gate: Run evaluate_classifier() with 30+ golden examples before deploying any classifier prompt change; assert ≥ 90% accuracy.
Fallback routing: On ambiguous intent, retrieve from both FAQ and CATALOG (top 3 each) and let Claude 3 determine which context is most relevant.