05 — MangaQAAgent
Editorial Q&A, review summarization, and policy answers. Backed by S3 (policies) and OpenSearch (50M+ reviews + editorial chunks).
The MangaQAAgent handles the questions that need language, not lookups: "What's the return policy?", "What do readers think of Vinland Saga?", "What's this series about?". It is the most hallucination-prone sub-agent because its job is summarization over free-form text — and that's where LLMs fail loudest.
What it is
A logical sub-agent that owns text-grounded Q&A. Three responsibilities:
- Policy / FAQ answers — return policy, shipping rules, billing FAQs
- Review summarization — aggregate sentiment, common praise/complaints, spoilers flag
- Editorial summaries — series synopsis, themes, content warnings
Backed by two MCP servers: - Support & Policy MCP (../RAG-MCP-Integration/05-support-policy-mcp.md) — S3 + OpenSearch over legal-approved policy docs - Review & Sentiment MCP (../RAG-MCP-Integration/04-review-sentiment-mcp.md) — OpenSearch over 50M+ reviews + sentiment aggregates
Tools exposed to the Orchestrator
| Tool | Purpose | Source |
|---|---|---|
answer_faq(query) |
Policy / shipping / FAQ answer | S3 + OpenSearch (policy index) |
get_return_policy(region, category) |
Structured return policy lookup | S3 (versioned) |
get_sentiment_summary(asin, top_n) |
Aggregated review sentiment | OpenSearch (review index) |
get_editorial_summary(asin) |
Series synopsis + themes | OpenSearch (editorial index) |
The split between answer_faq (free-form) and get_return_policy (structured) is deliberate. Free-form is for exploratory questions; structured is for cases where the Orchestrator already knows the user wants the canonical policy. Structured returns the exact policy text and version, eliminating any LLM rephrasing risk.
Source-of-truth tagging
Every chunk returned to the Orchestrator carries metadata that tells Claude what kind of source it is. This is the linchpin of hallucination control:
<tool_result name="answer_faq">
<source type="policy" doc="returns_policy_v2024.10" version="2024-10-15"
url="s3://policies/returns_v2024.10.pdf" authority="legal_approved">
[policy text chunk]
</source>
<source type="review" asin="B07X1234" review_count="1247"
sentiment="0.72" authority="user_generated">
[aggregated review summary]
</source>
</tool_result>
The system prompt instructs Claude:
- For type=policy, quote verbatim and cite the version + URL
- For type=review, attribute to "readers" and use hedged language ("readers report...", "reviewers note...")
- Never blend the two
This separation matters. Policy is authoritative; reviews are subjective. Conflating them is a serious error.
Retrieval pipeline (per source type)
Policy / FAQ path
Query → Embed (Titan v2)
→ OpenSearch over policy index (filter: type=policy, region=user_region)
→ Cross-encoder rerank top-3
→ Verify version metadata is current (S3 versioning)
→ Return with verbatim text + citation
The version-verification step is critical. Policy docs change. If OpenSearch returns a chunk from a stale version, we'd cite outdated policy. The agent re-checks the S3 object version before returning.
Review / sentiment path
ASIN → OpenSearch over review index (filter: asin=X)
→ Pre-computed sentiment aggregation (avg score, top themes)
→ Top-N representative reviews via diversity sampling
→ Return as structured aggregate
We do not return raw reviews to the LLM. We return aggregated summaries already computed by an offline batch job. The LLM is trusted to formulate the answer in natural language, but the facts (sentiment scores, common themes) are pre-computed.
Editorial path
ASIN → OpenSearch over editorial index (filter: asin=X)
→ Top-1 editorial chunk (these are written by humans, no rerank needed)
→ Return with content-warning metadata
State management
Stateless per call. Backing stores: - S3 — policy docs, versioned. Source of truth. - OpenSearch — three separate indexes: policy, review, editorial. - Pre-computed sentiment aggregates — written nightly by a batch job, stored in OpenSearch as document fields.
The agent owns no mutable state.
Failure handling
| Failure | Detection | Recovery |
|---|---|---|
| No policy match for query | Empty result | Return "I'll connect you with support" + escalation_suggested=true |
| Policy version mismatch (S3 changed) | Re-check fails | Refresh OpenSearch entry, retry; if still fails, fail with "policy under update" |
| Empty review corpus for niche title | review_count < 5 |
Return "Not enough reviews yet to summarize" — never invent |
| Review sentiment computation stale (>24h) | Aggregate timestamp check | Use stale + flag with last_updated |
| Reranker SageMaker cold | First call after idle | Skip rerank, return RRF top-3 with reranked: false flag |
| OpenSearch index unavailable | Connection error | Fail fast with structured error, no fallback (would risk hallucination) |
| Citation mismatch (LLM cites wrong source) | Post-gen validation | Regenerate with stricter prompt; if persists, return template |
The "fail fast, no fallback" rule for OpenSearch unavailability is intentional. Better to admit "I can't answer right now" than to fall back to LLM training data and produce an authoritative-sounding wrong answer about a return policy.
Latency budget
Target: P99 < 800ms per tool call.
answer_faq:
Embed query 50ms (often cache hit ~5ms)
OpenSearch retrieve 200ms
Rerank 100ms
S3 version check 80ms (HEAD request)
Format 10ms
─────
Total 440ms (P50)
900ms (P99)
get_sentiment_summary:
OpenSearch lookup 100ms (filter by ASIN, no embedding needed)
Format 20ms
─────
Total 120ms (P50)
300ms (P99)
get_editorial_summary:
OpenSearch lookup 80ms
Format 10ms
─────
Total 90ms (P50)
200ms (P99)
answer_faq is the slowest because it goes through the full RAG pipeline. Sentiment and editorial summaries are fast because they're keyed by ASIN — no embedding, no rerank.
Why this shape
| Alternative | Why we rejected it |
|---|---|
| Single Q&A index (all source types together) | Loses source-type filtering; conflates policy with reviews |
| Compute sentiment on-demand | LLM summarizing 1000+ reviews per query is slow, expensive, and inconsistent |
| Skip the version check on policy | Policy drift caused real incidents — the version check is non-negotiable |
| Let the LLM rephrase policy answers | Legal review required for every customer-facing policy text; verbatim quoting is the safe path |
| Use raw reviews instead of aggregates | LLM cherry-picks; aggregates are pre-computed and stable |
Validation: Constraint Sanity Check
| Claimed metric | Verdict | Why |
|---|---|---|
P99 < 800ms for answer_faq |
Aggressive — version check kills the budget | The 80ms S3 HEAD check has long-tail variance; S3 P99 latency over us-east is 100–500ms. Realistic P99 of answer_faq is closer to 1.0–1.3s. |
| 50M+ reviews indexed | Numerically large, freshness unclear | When does a new review become queryable? Indexing latency on OpenSearch with 50M docs and high write throughput is typically minutes-to-hours. The doc doesn't quote a freshness SLA. |
| Sentiment aggregates updated nightly | Stale during fast review accumulation | A title that just released and is gathering 1000 reviews in 24 hours: nightly aggregates show "0 reviews" for the first day. Mitigation: real-time aggregation for high-velocity ASINs. Not in the architecture. |
| Policy version check on every read | Necessary, costly | The version check adds 80ms minimum to every policy-shaped query. Alternative: subscribe to S3 event notifications and update OpenSearch in near-real-time, eliminating the per-read check. Not done. |
| OpenSearch unavailability → fail fast | Right call, but no fallback hurts UX | "I can't answer right now" is correct for policy questions; for review summaries, a degraded mode (cached last-known) would be acceptable and isn't documented. |
| Citation correctness | Hardest validation problem | Claude can quote a chunk's content but attribute it to a wrong document — this is "citation hallucination." The doc says post-gen validation handles it, but doesn't specify how. A real check would: (a) verify quoted text appears in the cited chunk, (b) verify the chunk's metadata matches the citation. Implementing this is non-trivial. |
| 5M editorial chunks index | Probably small, unverified | Editorial content is human-written; for 5M titles probably ~10–20 chunks each = 50–100M editorial chunks. Or are only top titles editorialized? The scope is unclear. |
| "Never invent reviews" enforcement | Prompt-layer only | The system prompt tells Claude not to invent. There is no automated check that flagged review summaries actually cite real reviews. Enforcement is best-effort. |
| Rerank skip during cold start | Quality silently degrades | Skipping rerank reduces precision. There's no quoted impact ("rerank lifts top-3 precision by X%"). Without that number, we don't know how bad the degraded mode is. |
| Region-filtered policy retrieval | Sound, depends on accurate region detection | If the user's region is misdetected (VPN, browser locale mismatch), they may get the wrong region's policy. Region detection accuracy isn't quoted. |
The biggest risk: citation hallucination
The agent's whole value proposition is grounded answers with sources. The hardest failure mode is the LLM producing text that sounds like it's from the cited source but isn't. Examples we've seen in similar systems:
- Quote A is from Source 1 but cited as Source 2
- Quote A is fabricated (no chunk says this) but cited to a real source
- Quote A blends Source 1 and Source 2 into a single sentence with one citation
Detecting these requires substring matching every quoted span against the cited chunk. The doc says "post-gen validation handles it" but doesn't describe the implementation. This is the single most important post-processing step in the system, and it's underspecified.
Policy freshness has two clocks
Policy docs change in S3. The OpenSearch index is rebuilt periodically (nightly?). The S3 version check on every read protects against the index being stale. But:
- If the index was rebuilt 12 hours ago and the policy changed 6 hours ago, the index has the old chunk text but the version check catches that the latest version is newer. The agent then... what? Re-fetches from S3 directly? Falls back to "policy under update"?
The fallback path isn't specified. In practice, the right answer is probably: 1. Fetch the latest version directly from S3 2. Re-chunk on the fly 3. Return that chunk 4. Trigger an async re-index
Without this path implemented, the agent will return "policy under update" any time the index is stale, which is bad UX.
Sentiment aggregation has a freshness/cost tradeoff
Nightly aggregation is cheap and adequate for established titles. It fails for new releases that gather thousands of reviews per day. The architecture would benefit from a tiered approach:
- High-velocity ASINs (e.g., last 7 days, >100 reviews/day): real-time streaming aggregation via Kinesis
- Long tail: nightly batch
This isn't expensive engineering but isn't in the doc. As written, the architecture serves stale sentiment to exactly the queries (new releases) where users care most.
"Never invent reviews" is enforced by prompt only
The system prompt says don't invent. The MCP returns aggregated facts. But the LLM still composes a paragraph that interprets the aggregate. If the aggregate says "70% positive, common theme: pacing issues," and the LLM produces "Most readers loved the pacing," that's wrong. There's no automated check that catches this.
Real defenses: - Constrain output format to structured templates (e.g., always quote the score) - Validate post-generation that score ranges in the output match the aggregate input - Use a smaller verifier model
None are documented. The current defense is "trust the prompt," which has known failure modes.
Related documents
- 01-orchestrator-agent.md — Calling pattern from supervisor
- 07-failure-handling.md — Hallucination guardrails in detail
- ../RAG-MCP-Integration/04-review-sentiment-mcp.md — Review MCP internals
- ../RAG-MCP-Integration/05-support-policy-mcp.md — Policy MCP internals
- ../Security-Privacy-Guardrails/ — Hallucination defense practices