05 — MangaQAAgent

Editorial Q&A, review summarization, and policy answers. Backed by S3 (policies) and OpenSearch (50M+ reviews + editorial chunks).

The MangaQAAgent handles the questions that need language, not lookups: "What's the return policy?", "What do readers think of Vinland Saga?", "What's this series about?". It is the most hallucination-prone sub-agent because its job is summarization over free-form text — and that's where LLMs fail loudest.

What it is

A logical sub-agent that owns text-grounded Q&A. Three responsibilities:

Policy / FAQ answers — return policy, shipping rules, billing FAQs
Review summarization — aggregate sentiment, common praise/complaints, spoilers flag
Editorial summaries — series synopsis, themes, content warnings

Backed by two MCP servers: - Support & Policy MCP (../RAG-MCP-Integration/05-support-policy-mcp.md) — S3 + OpenSearch over legal-approved policy docs - Review & Sentiment MCP (../RAG-MCP-Integration/04-review-sentiment-mcp.md) — OpenSearch over 50M+ reviews + sentiment aggregates

Tools exposed to the Orchestrator

Tool	Purpose	Source
`answer_faq(query)`	Policy / shipping / FAQ answer	S3 + OpenSearch (policy index)
`get_return_policy(region, category)`	Structured return policy lookup	S3 (versioned)
`get_sentiment_summary(asin, top_n)`	Aggregated review sentiment	OpenSearch (review index)
`get_editorial_summary(asin)`	Series synopsis + themes	OpenSearch (editorial index)

The split between answer_faq (free-form) and get_return_policy (structured) is deliberate. Free-form is for exploratory questions; structured is for cases where the Orchestrator already knows the user wants the canonical policy. Structured returns the exact policy text and version, eliminating any LLM rephrasing risk.

Source-of-truth tagging

Every chunk returned to the Orchestrator carries metadata that tells Claude what kind of source it is. This is the linchpin of hallucination control:

<tool_result name="answer_faq">
  <source type="policy" doc="returns_policy_v2024.10" version="2024-10-15"
          url="s3://policies/returns_v2024.10.pdf" authority="legal_approved">
    [policy text chunk]
  </source>
  <source type="review" asin="B07X1234" review_count="1247"
          sentiment="0.72" authority="user_generated">
    [aggregated review summary]
  </source>
</tool_result>

The system prompt instructs Claude: - For type=policy, quote verbatim and cite the version + URL - For type=review, attribute to "readers" and use hedged language ("readers report...", "reviewers note...") - Never blend the two

This separation matters. Policy is authoritative; reviews are subjective. Conflating them is a serious error.

Retrieval pipeline (per source type)

Policy / FAQ path

Query → Embed (Titan v2)
      → OpenSearch over policy index (filter: type=policy, region=user_region)
      → Cross-encoder rerank top-3
      → Verify version metadata is current (S3 versioning)
      → Return with verbatim text + citation

The version-verification step is critical. Policy docs change. If OpenSearch returns a chunk from a stale version, we'd cite outdated policy. The agent re-checks the S3 object version before returning.

Review / sentiment path

ASIN → OpenSearch over review index (filter: asin=X)
     → Pre-computed sentiment aggregation (avg score, top themes)
     → Top-N representative reviews via diversity sampling
     → Return as structured aggregate

We do not return raw reviews to the LLM. We return aggregated summaries already computed by an offline batch job. The LLM is trusted to formulate the answer in natural language, but the facts (sentiment scores, common themes) are pre-computed.

Editorial path

ASIN → OpenSearch over editorial index (filter: asin=X)
     → Top-1 editorial chunk (these are written by humans, no rerank needed)
     → Return with content-warning metadata

State management

Stateless per call. Backing stores: - S3 — policy docs, versioned. Source of truth. - OpenSearch — three separate indexes: policy, review, editorial. - Pre-computed sentiment aggregates — written nightly by a batch job, stored in OpenSearch as document fields.

The agent owns no mutable state.

Failure handling

Failure	Detection	Recovery
No policy match for query	Empty result	Return "I'll connect you with support" + escalation_suggested=true
Policy version mismatch (S3 changed)	Re-check fails	Refresh OpenSearch entry, retry; if still fails, fail with "policy under update"
Empty review corpus for niche title	`review_count < 5`	Return "Not enough reviews yet to summarize" — never invent
Review sentiment computation stale (>24h)	Aggregate timestamp check	Use stale + flag with `last_updated`
Reranker SageMaker cold	First call after idle	Skip rerank, return RRF top-3 with `reranked: false` flag
OpenSearch index unavailable	Connection error	Fail fast with structured error, no fallback (would risk hallucination)
Citation mismatch (LLM cites wrong source)	Post-gen validation	Regenerate with stricter prompt; if persists, return template

The "fail fast, no fallback" rule for OpenSearch unavailability is intentional. Better to admit "I can't answer right now" than to fall back to LLM training data and produce an authoritative-sounding wrong answer about a return policy.

Latency budget

Target: P99 < 800ms per tool call.

answer_faq:
  Embed query             50ms (often cache hit ~5ms)
  OpenSearch retrieve    200ms
  Rerank                 100ms
  S3 version check        80ms (HEAD request)
  Format                  10ms
  ─────
  Total                  440ms (P50)
                         900ms (P99)

get_sentiment_summary:
  OpenSearch lookup      100ms (filter by ASIN, no embedding needed)
  Format                  20ms
  ─────
  Total                  120ms (P50)
                         300ms (P99)

get_editorial_summary:
  OpenSearch lookup       80ms
  Format                  10ms
  ─────
  Total                   90ms (P50)
                         200ms (P99)

answer_faq is the slowest because it goes through the full RAG pipeline. Sentiment and editorial summaries are fast because they're keyed by ASIN — no embedding, no rerank.

Why this shape

Alternative	Why we rejected it
Single Q&A index (all source types together)	Loses source-type filtering; conflates policy with reviews
Compute sentiment on-demand	LLM summarizing 1000+ reviews per query is slow, expensive, and inconsistent
Skip the version check on policy	Policy drift caused real incidents — the version check is non-negotiable
Let the LLM rephrase policy answers	Legal review required for every customer-facing policy text; verbatim quoting is the safe path
Use raw reviews instead of aggregates	LLM cherry-picks; aggregates are pre-computed and stable

Validation: Constraint Sanity Check

Claimed metric	Verdict	Why
P99 < 800ms for `answer_faq`	Aggressive — version check kills the budget	The 80ms S3 HEAD check has long-tail variance; S3 P99 latency over us-east is 100–500ms. Realistic P99 of `answer_faq` is closer to 1.0–1.3s.
50M+ reviews indexed	Numerically large, freshness unclear	When does a new review become queryable? Indexing latency on OpenSearch with 50M docs and high write throughput is typically minutes-to-hours. The doc doesn't quote a freshness SLA.
Sentiment aggregates updated nightly	Stale during fast review accumulation	A title that just released and is gathering 1000 reviews in 24 hours: nightly aggregates show "0 reviews" for the first day. Mitigation: real-time aggregation for high-velocity ASINs. Not in the architecture.
Policy version check on every read	Necessary, costly	The version check adds 80ms minimum to every policy-shaped query. Alternative: subscribe to S3 event notifications and update OpenSearch in near-real-time, eliminating the per-read check. Not done.
OpenSearch unavailability → fail fast	Right call, but no fallback hurts UX	"I can't answer right now" is correct for policy questions; for review summaries, a degraded mode (cached last-known) would be acceptable and isn't documented.
Citation correctness	Hardest validation problem	Claude can quote a chunk's content but attribute it to a wrong document — this is "citation hallucination." The doc says post-gen validation handles it, but doesn't specify how. A real check would: (a) verify quoted text appears in the cited chunk, (b) verify the chunk's metadata matches the citation. Implementing this is non-trivial.
5M editorial chunks index	Probably small, unverified	Editorial content is human-written; for 5M titles probably ~10–20 chunks each = 50–100M editorial chunks. Or are only top titles editorialized? The scope is unclear.
"Never invent reviews" enforcement	Prompt-layer only	The system prompt tells Claude not to invent. There is no automated check that flagged review summaries actually cite real reviews. Enforcement is best-effort.
Rerank skip during cold start	Quality silently degrades	Skipping rerank reduces precision. There's no quoted impact ("rerank lifts top-3 precision by X%"). Without that number, we don't know how bad the degraded mode is.
Region-filtered policy retrieval	Sound, depends on accurate region detection	If the user's region is misdetected (VPN, browser locale mismatch), they may get the wrong region's policy. Region detection accuracy isn't quoted.

The biggest risk: citation hallucination

The agent's whole value proposition is grounded answers with sources. The hardest failure mode is the LLM producing text that sounds like it's from the cited source but isn't. Examples we've seen in similar systems:

Quote A is from Source 1 but cited as Source 2
Quote A is fabricated (no chunk says this) but cited to a real source
Quote A blends Source 1 and Source 2 into a single sentence with one citation

Detecting these requires substring matching every quoted span against the cited chunk. The doc says "post-gen validation handles it" but doesn't describe the implementation. This is the single most important post-processing step in the system, and it's underspecified.

Policy freshness has two clocks

Policy docs change in S3. The OpenSearch index is rebuilt periodically (nightly?). The S3 version check on every read protects against the index being stale. But:

If the index was rebuilt 12 hours ago and the policy changed 6 hours ago, the index has the old chunk text but the version check catches that the latest version is newer. The agent then... what? Re-fetches from S3 directly? Falls back to "policy under update"?

The fallback path isn't specified. In practice, the right answer is probably: 1. Fetch the latest version directly from S3 2. Re-chunk on the fly 3. Return that chunk 4. Trigger an async re-index

Without this path implemented, the agent will return "policy under update" any time the index is stale, which is bad UX.

Sentiment aggregation has a freshness/cost tradeoff

Nightly aggregation is cheap and adequate for established titles. It fails for new releases that gather thousands of reviews per day. The architecture would benefit from a tiered approach:

High-velocity ASINs (e.g., last 7 days, >100 reviews/day): real-time streaming aggregation via Kinesis
Long tail: nightly batch

This isn't expensive engineering but isn't in the doc. As written, the architecture serves stale sentiment to exactly the queries (new releases) where users care most.

"Never invent reviews" is enforced by prompt only

The system prompt says don't invent. The MCP returns aggregated facts. But the LLM still composes a paragraph that interprets the aggregate. If the aggregate says "70% positive, common theme: pacing issues," and the LLM produces "Most readers loved the pacing," that's wrong. There's no automated check that catches this.

Real defenses: - Constrain output format to structured templates (e.g., always quote the score) - Validate post-generation that score ranges in the output match the aggregate input - Use a smaller verifier model

None are documented. The current defense is "trust the prompt," which has known failure modes.

01-orchestrator-agent.md — Calling pattern from supervisor
07-failure-handling.md — Hallucination guardrails in detail
../RAG-MCP-Integration/04-review-sentiment-mcp.md — Review MCP internals
../RAG-MCP-Integration/05-support-policy-mcp.md — Policy MCP internals
../Security-Privacy-Guardrails/ — Hallucination defense practices