02 — ProductSearchAgent
Catalog discovery for 5M+ multilingual manga titles. Thin domain wrapper over the Catalog Search MCP.
The ProductSearchAgent is the sub-agent the Orchestrator calls whenever a user message contains a search-shaped intent: title lookup, genre browsing, author discovery, ISBN/ASIN resolution. It is not a separate Claude instance — it is a logical handler exposed to the Orchestrator as a set of MCP tools, plus the small amount of domain logic that lives between the tools and the catalog index.
What it is
A logical sub-agent that owns three responsibilities:
- Query understanding — locale detection (English / Japanese / JP romaji), entity extraction (series name, volume number, edition).
- Retrieval orchestration — calls the Catalog Search MCP with the right parameters and post-processes results.
- Result shaping — enriches raw catalog hits with stock signals, price, and availability before returning to the Orchestrator.
It is backed by the Catalog Search MCP (../RAG-MCP-Integration/01-catalog-search-mcp.md), which itself wraps an OpenSearch Serverless index of 5M+ titles.
Tools exposed to the Orchestrator
| Tool | Purpose | Typical use |
|---|---|---|
search_manga(query, filters) |
Hybrid retrieval over the catalog | "Find dark fantasy manga" |
get_manga_details(asin) |
Structured fetch of a single title | "Tell me about Berserk volume 42" |
find_by_author(author, lang) |
Author-scoped listing | "Show me everything by Naoki Urasawa" |
find_by_series(series_name, volume?) |
Series + volume resolution | "Vinland Saga volume 12 in Japanese" |
Each tool description is engineered to disambiguate from similar tools in other MCPs — see 06-tool-dispatch-and-routing.md for tool description engineering.
Retrieval pipeline
User query → Locale detection → Entity extraction
→ Hybrid retrieval (BM25 + Titan dense)
→ RRF fusion (top-50 candidates)
→ BGE-reranker (top-3)
→ Stock + price enrichment
→ Return to Orchestrator
Hybrid retrieval
Pure BM25 misses semantic matches ("dark fantasy" → "Berserk" requires no shared keywords). Pure dense retrieval misses exact-title matches in the long tail. We run both and fuse with Reciprocal Rank Fusion:
def rrf(rank_lists, k=60):
scores = defaultdict(float)
for rank_list in rank_lists:
for rank, doc_id in enumerate(rank_list):
scores[doc_id] += 1.0 / (k + rank)
return sorted(scores.items(), key=lambda x: -x[1])
k=60 is empirically chosen — too low and BM25 dominates exact matches at the cost of semantic coverage; too high and dense scores dominate.
Multilingual handling
Titan Embeddings v2 has decent cross-lingual alignment for English/Japanese, but Japanese romaji ("Naruto" written romaji vs. ナルト in kana) is a known weak spot. Mitigation: a small lookup table normalizes common romaji titles to their canonical Japanese form before embedding.
State management
Stateless per call. The agent owns no persistent state of its own — everything it needs is in the call payload or in the Catalog MCP's index.
Two caches sit in the path:
| Cache | Store | TTL | What it caches |
|---|---|---|---|
| Embedding cache | ElastiCache | 1 hour | Query → embedding vector (dedupe identical queries) |
| Result cache | ElastiCache | 5 minutes | Query+filter hash → top-3 results |
Result-cache TTL is short because catalog stock and price change frequently; we don't want a 1-hour-stale "in stock" lying to a customer who clicks through.
Failure handling
| Failure | Detection | Recovery |
|---|---|---|
| Empty result set | len(hits) == 0 |
Broaden filters → drop genre constraint → semantic fallback with lower threshold → structured no_results with suggested alternatives |
| OpenSearch timeout | 800ms exceeded | One retry → return cached if available → fail with structured error |
| Reranker SageMaker cold | First call after 15min idle | Skip rerank, return RRF top-3 directly with reranked: false flag |
| Malformed input (bad ASIN format) | Schema validation | Return 400 with reason, no retry |
| Locale mis-detection | Manual heuristic flag | Run both English and Japanese pipelines, merge results |
Critical principle: never return silent empty. The Orchestrator's downstream behavior depends on knowing whether "no results" means "we searched and found nothing" vs. "the search failed."
Latency budget
Target: P99 < 800ms per tool call.
Stage | Budgeted | Notes
-------------------|----------|------
Locale + entity | 20ms | Regex + small lookup, in-process
Embedding | 50ms | Titan call, often cache hit (~5ms)
OpenSearch hybrid | 200ms | Two parallel queries (BM25 + KNN), fused
Reranker | 100ms | BGE on SageMaker, warm endpoint
Enrichment | 30ms | Stock + price from ElastiCache
Format + return | 10ms | XML wrap
-------------------|----------|------
Total | 410ms | leaves 390ms for network + cold start
Why this shape
| Alternative | Why we rejected it |
|---|---|
| Single retrieval method (BM25 only) | Misses semantic matches in long-tail genre queries |
| Single retrieval method (dense only) | Misses exact-title matches; BM25 dominates exact-match precision |
| Direct OpenSearch from Orchestrator (no MCP boundary) | Couples LLM context to query DSL; hard to evolve |
| Pre-filter by genre before retrieval | Cuts recall on cross-genre titles; reranker handles this better |
| Use a single OpenSearch query that does both | OpenSearch supports this but RRF fusion outside the index gives better tunability |
Validation: Constraint Sanity Check
| Claimed metric | Verdict | Why |
|---|---|---|
| P99 < 800ms per tool call | Aggressive under load | Component sum of medians is 410ms; P99 of components stacks to ~1.1–1.4s during concurrent traffic. Holds at low concurrency, breaks at peak. |
| OpenSearch hybrid 200ms | Optimistic | OpenSearch Serverless KNN over 5M vectors with HNSW: P50 ~80–150ms with ef_search=64, P99 ~300–500ms under concurrent load. Two parallel queries (BM25 + KNN) take max of both, so 200ms P99 requires aggressive HNSW tuning. |
| Reranker 100ms | Cold-start failure mode | BGE-reranker-v2-m3 on SageMaker endpoint warm: 60–100ms. Cold (after autoscale-down): 800ms–2s. The fallback to "skip rerank" is the right call but degrades quality silently — needs a dashboard to track rerank skip rate. |
| Embedding 50ms | Cache-hit dependent | Titan call P99 over network: 80–150ms. The 50ms target only holds with cache hit (which dedupes identical queries). Fresh queries pay full cost. |
| 5M+ titles, multilingual | Quality claim, not validated here | Cross-lingual retrieval quality (English query → Japanese results and vice versa) needs an offline eval set. The romaji lookup table is a band-aid; properly handling JP romaji probably needs a dedicated tokenizer or fine-tuned embedding. No eval data quoted. |
| Result-cache TTL 5 min | Inventory freshness conflict | Stock changes within minutes during sales. 5-min cache means up to 5 minutes of "in stock" lies. Either drop the cache to 30–60s, or invalidate on stock-change events from the inventory service. |
RRF fusion k=60 |
Empirically chosen — undocumented | Says "empirically chosen." If true, link the experiment. If not, it's a hyperparameter someone copied from a paper. |
| "Never return silent empty" principle | Realistic if enforced | Depends on actual implementation. Easy to regress — needs a unit test that explicit-empty results never bypass the structured-no-results path. |
The biggest issue: P99 latency under concurrency
The 800ms budget assumes the components run in their isolated steady-state. Under realistic concurrent load:
- OpenSearch P99 for KNN on 5M vectors with
ef_search=64and 50 concurrent queries: ~400–600ms (not 200ms). - SageMaker reranker endpoint queue depth grows when QPS spikes — reranker P99 can hit 500ms during busy periods.
- Titan API rate limits can introduce queueing on the embedding side.
Realistic P99 at production load: 1.2–1.8s. The 800ms target holds at P95, possibly P97. Restate the SLA accordingly.
The freshness/cache contradiction
The 5-minute result cache and the "real-time stock" guarantee in the problem statement are in tension. Two paths to reconcile:
- Cache only the retrieval result (top-3 ASINs), not the enrichment (price/stock). Always re-fetch enrichment from the inventory service. This shortens the lie window to whatever the inventory cache TTL is.
- Listen to inventory-change events on a Kinesis stream and invalidate result cache entries proactively when their ASINs change. Engineering cost is real; do this only if Path 1 isn't enough.
The current architecture as written caches the whole result for 5 minutes. That's the lie window.
Related documents
- 01-orchestrator-agent.md — How the Orchestrator dispatches to this agent
- 06-tool-dispatch-and-routing.md — Tool description engineering
- ../RAG-MCP-Integration/01-catalog-search-mcp.md — Catalog MCP internals
- ../RAG-MCP-Integration/09-rag-retrieval-pipeline-deep-dive.md — Shared retrieval pipeline