LOCAL PREVIEW View on GitHub

02 — ProductSearchAgent

Catalog discovery for 5M+ multilingual manga titles. Thin domain wrapper over the Catalog Search MCP.

The ProductSearchAgent is the sub-agent the Orchestrator calls whenever a user message contains a search-shaped intent: title lookup, genre browsing, author discovery, ISBN/ASIN resolution. It is not a separate Claude instance — it is a logical handler exposed to the Orchestrator as a set of MCP tools, plus the small amount of domain logic that lives between the tools and the catalog index.


What it is

A logical sub-agent that owns three responsibilities:

  1. Query understanding — locale detection (English / Japanese / JP romaji), entity extraction (series name, volume number, edition).
  2. Retrieval orchestration — calls the Catalog Search MCP with the right parameters and post-processes results.
  3. Result shaping — enriches raw catalog hits with stock signals, price, and availability before returning to the Orchestrator.

It is backed by the Catalog Search MCP (../RAG-MCP-Integration/01-catalog-search-mcp.md), which itself wraps an OpenSearch Serverless index of 5M+ titles.


Tools exposed to the Orchestrator

Tool Purpose Typical use
search_manga(query, filters) Hybrid retrieval over the catalog "Find dark fantasy manga"
get_manga_details(asin) Structured fetch of a single title "Tell me about Berserk volume 42"
find_by_author(author, lang) Author-scoped listing "Show me everything by Naoki Urasawa"
find_by_series(series_name, volume?) Series + volume resolution "Vinland Saga volume 12 in Japanese"

Each tool description is engineered to disambiguate from similar tools in other MCPs — see 06-tool-dispatch-and-routing.md for tool description engineering.


Retrieval pipeline

User query → Locale detection → Entity extraction
          → Hybrid retrieval (BM25 + Titan dense)
          → RRF fusion (top-50 candidates)
          → BGE-reranker (top-3)
          → Stock + price enrichment
          → Return to Orchestrator

Hybrid retrieval

Pure BM25 misses semantic matches ("dark fantasy" → "Berserk" requires no shared keywords). Pure dense retrieval misses exact-title matches in the long tail. We run both and fuse with Reciprocal Rank Fusion:

def rrf(rank_lists, k=60):
    scores = defaultdict(float)
    for rank_list in rank_lists:
        for rank, doc_id in enumerate(rank_list):
            scores[doc_id] += 1.0 / (k + rank)
    return sorted(scores.items(), key=lambda x: -x[1])

k=60 is empirically chosen — too low and BM25 dominates exact matches at the cost of semantic coverage; too high and dense scores dominate.

Multilingual handling

Titan Embeddings v2 has decent cross-lingual alignment for English/Japanese, but Japanese romaji ("Naruto" written romaji vs. ナルト in kana) is a known weak spot. Mitigation: a small lookup table normalizes common romaji titles to their canonical Japanese form before embedding.


State management

Stateless per call. The agent owns no persistent state of its own — everything it needs is in the call payload or in the Catalog MCP's index.

Two caches sit in the path:

Cache Store TTL What it caches
Embedding cache ElastiCache 1 hour Query → embedding vector (dedupe identical queries)
Result cache ElastiCache 5 minutes Query+filter hash → top-3 results

Result-cache TTL is short because catalog stock and price change frequently; we don't want a 1-hour-stale "in stock" lying to a customer who clicks through.


Failure handling

Failure Detection Recovery
Empty result set len(hits) == 0 Broaden filters → drop genre constraint → semantic fallback with lower threshold → structured no_results with suggested alternatives
OpenSearch timeout 800ms exceeded One retry → return cached if available → fail with structured error
Reranker SageMaker cold First call after 15min idle Skip rerank, return RRF top-3 directly with reranked: false flag
Malformed input (bad ASIN format) Schema validation Return 400 with reason, no retry
Locale mis-detection Manual heuristic flag Run both English and Japanese pipelines, merge results

Critical principle: never return silent empty. The Orchestrator's downstream behavior depends on knowing whether "no results" means "we searched and found nothing" vs. "the search failed."


Latency budget

Target: P99 < 800ms per tool call.

Stage              | Budgeted | Notes
-------------------|----------|------
Locale + entity    |    20ms  | Regex + small lookup, in-process
Embedding          |    50ms  | Titan call, often cache hit (~5ms)
OpenSearch hybrid  |   200ms  | Two parallel queries (BM25 + KNN), fused
Reranker           |   100ms  | BGE on SageMaker, warm endpoint
Enrichment         |    30ms  | Stock + price from ElastiCache
Format + return    |    10ms  | XML wrap
-------------------|----------|------
Total              |   410ms  | leaves 390ms for network + cold start

Why this shape

Alternative Why we rejected it
Single retrieval method (BM25 only) Misses semantic matches in long-tail genre queries
Single retrieval method (dense only) Misses exact-title matches; BM25 dominates exact-match precision
Direct OpenSearch from Orchestrator (no MCP boundary) Couples LLM context to query DSL; hard to evolve
Pre-filter by genre before retrieval Cuts recall on cross-genre titles; reranker handles this better
Use a single OpenSearch query that does both OpenSearch supports this but RRF fusion outside the index gives better tunability

Validation: Constraint Sanity Check

Claimed metric Verdict Why
P99 < 800ms per tool call Aggressive under load Component sum of medians is 410ms; P99 of components stacks to ~1.1–1.4s during concurrent traffic. Holds at low concurrency, breaks at peak.
OpenSearch hybrid 200ms Optimistic OpenSearch Serverless KNN over 5M vectors with HNSW: P50 ~80–150ms with ef_search=64, P99 ~300–500ms under concurrent load. Two parallel queries (BM25 + KNN) take max of both, so 200ms P99 requires aggressive HNSW tuning.
Reranker 100ms Cold-start failure mode BGE-reranker-v2-m3 on SageMaker endpoint warm: 60–100ms. Cold (after autoscale-down): 800ms–2s. The fallback to "skip rerank" is the right call but degrades quality silently — needs a dashboard to track rerank skip rate.
Embedding 50ms Cache-hit dependent Titan call P99 over network: 80–150ms. The 50ms target only holds with cache hit (which dedupes identical queries). Fresh queries pay full cost.
5M+ titles, multilingual Quality claim, not validated here Cross-lingual retrieval quality (English query → Japanese results and vice versa) needs an offline eval set. The romaji lookup table is a band-aid; properly handling JP romaji probably needs a dedicated tokenizer or fine-tuned embedding. No eval data quoted.
Result-cache TTL 5 min Inventory freshness conflict Stock changes within minutes during sales. 5-min cache means up to 5 minutes of "in stock" lies. Either drop the cache to 30–60s, or invalidate on stock-change events from the inventory service.
RRF fusion k=60 Empirically chosen — undocumented Says "empirically chosen." If true, link the experiment. If not, it's a hyperparameter someone copied from a paper.
"Never return silent empty" principle Realistic if enforced Depends on actual implementation. Easy to regress — needs a unit test that explicit-empty results never bypass the structured-no-results path.

The biggest issue: P99 latency under concurrency

The 800ms budget assumes the components run in their isolated steady-state. Under realistic concurrent load:

  • OpenSearch P99 for KNN on 5M vectors with ef_search=64 and 50 concurrent queries: ~400–600ms (not 200ms).
  • SageMaker reranker endpoint queue depth grows when QPS spikes — reranker P99 can hit 500ms during busy periods.
  • Titan API rate limits can introduce queueing on the embedding side.

Realistic P99 at production load: 1.2–1.8s. The 800ms target holds at P95, possibly P97. Restate the SLA accordingly.

The freshness/cache contradiction

The 5-minute result cache and the "real-time stock" guarantee in the problem statement are in tension. Two paths to reconcile:

  1. Cache only the retrieval result (top-3 ASINs), not the enrichment (price/stock). Always re-fetch enrichment from the inventory service. This shortens the lie window to whatever the inventory cache TTL is.
  2. Listen to inventory-change events on a Kinesis stream and invalidate result cache entries proactively when their ASINs change. Engineering cost is real; do this only if Path 1 isn't enough.

The current architecture as written caches the whole result for 5 minutes. That's the lie window.