LOCAL PREVIEW View on GitHub

05 — MangaQAAgent

Editorial Q&A, review summarization, and policy answers. Backed by S3 (policies) and OpenSearch (50M+ reviews + editorial chunks).

The MangaQAAgent handles the questions that need language, not lookups: "What's the return policy?", "What do readers think of Vinland Saga?", "What's this series about?". It is the most hallucination-prone sub-agent because its job is summarization over free-form text — and that's where LLMs fail loudest.


What it is

A logical sub-agent that owns text-grounded Q&A. Three responsibilities:

  1. Policy / FAQ answers — return policy, shipping rules, billing FAQs
  2. Review summarization — aggregate sentiment, common praise/complaints, spoilers flag
  3. Editorial summaries — series synopsis, themes, content warnings

Backed by two MCP servers: - Support & Policy MCP (../RAG-MCP-Integration/05-support-policy-mcp.md) — S3 + OpenSearch over legal-approved policy docs - Review & Sentiment MCP (../RAG-MCP-Integration/04-review-sentiment-mcp.md) — OpenSearch over 50M+ reviews + sentiment aggregates


Tools exposed to the Orchestrator

Tool Purpose Source
answer_faq(query) Policy / shipping / FAQ answer S3 + OpenSearch (policy index)
get_return_policy(region, category) Structured return policy lookup S3 (versioned)
get_sentiment_summary(asin, top_n) Aggregated review sentiment OpenSearch (review index)
get_editorial_summary(asin) Series synopsis + themes OpenSearch (editorial index)

The split between answer_faq (free-form) and get_return_policy (structured) is deliberate. Free-form is for exploratory questions; structured is for cases where the Orchestrator already knows the user wants the canonical policy. Structured returns the exact policy text and version, eliminating any LLM rephrasing risk.


Source-of-truth tagging

Every chunk returned to the Orchestrator carries metadata that tells Claude what kind of source it is. This is the linchpin of hallucination control:

<tool_result name="answer_faq">
  <source type="policy" doc="returns_policy_v2024.10" version="2024-10-15"
          url="s3://policies/returns_v2024.10.pdf" authority="legal_approved">
    [policy text chunk]
  </source>
  <source type="review" asin="B07X1234" review_count="1247"
          sentiment="0.72" authority="user_generated">
    [aggregated review summary]
  </source>
</tool_result>

The system prompt instructs Claude: - For type=policy, quote verbatim and cite the version + URL - For type=review, attribute to "readers" and use hedged language ("readers report...", "reviewers note...") - Never blend the two

This separation matters. Policy is authoritative; reviews are subjective. Conflating them is a serious error.


Retrieval pipeline (per source type)

Policy / FAQ path

Query → Embed (Titan v2)
      → OpenSearch over policy index (filter: type=policy, region=user_region)
      → Cross-encoder rerank top-3
      → Verify version metadata is current (S3 versioning)
      → Return with verbatim text + citation

The version-verification step is critical. Policy docs change. If OpenSearch returns a chunk from a stale version, we'd cite outdated policy. The agent re-checks the S3 object version before returning.

Review / sentiment path

ASIN → OpenSearch over review index (filter: asin=X)
     → Pre-computed sentiment aggregation (avg score, top themes)
     → Top-N representative reviews via diversity sampling
     → Return as structured aggregate

We do not return raw reviews to the LLM. We return aggregated summaries already computed by an offline batch job. The LLM is trusted to formulate the answer in natural language, but the facts (sentiment scores, common themes) are pre-computed.

Editorial path

ASIN → OpenSearch over editorial index (filter: asin=X)
     → Top-1 editorial chunk (these are written by humans, no rerank needed)
     → Return with content-warning metadata

State management

Stateless per call. Backing stores: - S3 — policy docs, versioned. Source of truth. - OpenSearch — three separate indexes: policy, review, editorial. - Pre-computed sentiment aggregates — written nightly by a batch job, stored in OpenSearch as document fields.

The agent owns no mutable state.


Failure handling

Failure Detection Recovery
No policy match for query Empty result Return "I'll connect you with support" + escalation_suggested=true
Policy version mismatch (S3 changed) Re-check fails Refresh OpenSearch entry, retry; if still fails, fail with "policy under update"
Empty review corpus for niche title review_count < 5 Return "Not enough reviews yet to summarize" — never invent
Review sentiment computation stale (>24h) Aggregate timestamp check Use stale + flag with last_updated
Reranker SageMaker cold First call after idle Skip rerank, return RRF top-3 with reranked: false flag
OpenSearch index unavailable Connection error Fail fast with structured error, no fallback (would risk hallucination)
Citation mismatch (LLM cites wrong source) Post-gen validation Regenerate with stricter prompt; if persists, return template

The "fail fast, no fallback" rule for OpenSearch unavailability is intentional. Better to admit "I can't answer right now" than to fall back to LLM training data and produce an authoritative-sounding wrong answer about a return policy.


Latency budget

Target: P99 < 800ms per tool call.

answer_faq:
  Embed query             50ms (often cache hit ~5ms)
  OpenSearch retrieve    200ms
  Rerank                 100ms
  S3 version check        80ms (HEAD request)
  Format                  10ms
  ─────
  Total                  440ms (P50)
                         900ms (P99)

get_sentiment_summary:
  OpenSearch lookup      100ms (filter by ASIN, no embedding needed)
  Format                  20ms
  ─────
  Total                  120ms (P50)
                         300ms (P99)

get_editorial_summary:
  OpenSearch lookup       80ms
  Format                  10ms
  ─────
  Total                   90ms (P50)
                         200ms (P99)

answer_faq is the slowest because it goes through the full RAG pipeline. Sentiment and editorial summaries are fast because they're keyed by ASIN — no embedding, no rerank.


Why this shape

Alternative Why we rejected it
Single Q&A index (all source types together) Loses source-type filtering; conflates policy with reviews
Compute sentiment on-demand LLM summarizing 1000+ reviews per query is slow, expensive, and inconsistent
Skip the version check on policy Policy drift caused real incidents — the version check is non-negotiable
Let the LLM rephrase policy answers Legal review required for every customer-facing policy text; verbatim quoting is the safe path
Use raw reviews instead of aggregates LLM cherry-picks; aggregates are pre-computed and stable

Validation: Constraint Sanity Check

Claimed metric Verdict Why
P99 < 800ms for answer_faq Aggressive — version check kills the budget The 80ms S3 HEAD check has long-tail variance; S3 P99 latency over us-east is 100–500ms. Realistic P99 of answer_faq is closer to 1.0–1.3s.
50M+ reviews indexed Numerically large, freshness unclear When does a new review become queryable? Indexing latency on OpenSearch with 50M docs and high write throughput is typically minutes-to-hours. The doc doesn't quote a freshness SLA.
Sentiment aggregates updated nightly Stale during fast review accumulation A title that just released and is gathering 1000 reviews in 24 hours: nightly aggregates show "0 reviews" for the first day. Mitigation: real-time aggregation for high-velocity ASINs. Not in the architecture.
Policy version check on every read Necessary, costly The version check adds 80ms minimum to every policy-shaped query. Alternative: subscribe to S3 event notifications and update OpenSearch in near-real-time, eliminating the per-read check. Not done.
OpenSearch unavailability → fail fast Right call, but no fallback hurts UX "I can't answer right now" is correct for policy questions; for review summaries, a degraded mode (cached last-known) would be acceptable and isn't documented.
Citation correctness Hardest validation problem Claude can quote a chunk's content but attribute it to a wrong document — this is "citation hallucination." The doc says post-gen validation handles it, but doesn't specify how. A real check would: (a) verify quoted text appears in the cited chunk, (b) verify the chunk's metadata matches the citation. Implementing this is non-trivial.
5M editorial chunks index Probably small, unverified Editorial content is human-written; for 5M titles probably ~10–20 chunks each = 50–100M editorial chunks. Or are only top titles editorialized? The scope is unclear.
"Never invent reviews" enforcement Prompt-layer only The system prompt tells Claude not to invent. There is no automated check that flagged review summaries actually cite real reviews. Enforcement is best-effort.
Rerank skip during cold start Quality silently degrades Skipping rerank reduces precision. There's no quoted impact ("rerank lifts top-3 precision by X%"). Without that number, we don't know how bad the degraded mode is.
Region-filtered policy retrieval Sound, depends on accurate region detection If the user's region is misdetected (VPN, browser locale mismatch), they may get the wrong region's policy. Region detection accuracy isn't quoted.

The biggest risk: citation hallucination

The agent's whole value proposition is grounded answers with sources. The hardest failure mode is the LLM producing text that sounds like it's from the cited source but isn't. Examples we've seen in similar systems:

  • Quote A is from Source 1 but cited as Source 2
  • Quote A is fabricated (no chunk says this) but cited to a real source
  • Quote A blends Source 1 and Source 2 into a single sentence with one citation

Detecting these requires substring matching every quoted span against the cited chunk. The doc says "post-gen validation handles it" but doesn't describe the implementation. This is the single most important post-processing step in the system, and it's underspecified.

Policy freshness has two clocks

Policy docs change in S3. The OpenSearch index is rebuilt periodically (nightly?). The S3 version check on every read protects against the index being stale. But:

  • If the index was rebuilt 12 hours ago and the policy changed 6 hours ago, the index has the old chunk text but the version check catches that the latest version is newer. The agent then... what? Re-fetches from S3 directly? Falls back to "policy under update"?

The fallback path isn't specified. In practice, the right answer is probably: 1. Fetch the latest version directly from S3 2. Re-chunk on the fly 3. Return that chunk 4. Trigger an async re-index

Without this path implemented, the agent will return "policy under update" any time the index is stale, which is bad UX.

Sentiment aggregation has a freshness/cost tradeoff

Nightly aggregation is cheap and adequate for established titles. It fails for new releases that gather thousands of reviews per day. The architecture would benefit from a tiered approach:

  • High-velocity ASINs (e.g., last 7 days, >100 reviews/day): real-time streaming aggregation via Kinesis
  • Long tail: nightly batch

This isn't expensive engineering but isn't in the doc. As written, the architecture serves stale sentiment to exactly the queries (new releases) where users care most.

"Never invent reviews" is enforced by prompt only

The system prompt says don't invent. The MCP returns aggregated facts. But the LLM still composes a paragraph that interprets the aggregate. If the aggregate says "70% positive, common theme: pacing issues," and the LLM produces "Most readers loved the pacing," that's wrong. There's no automated check that catches this.

Real defenses: - Constrain output format to structured templates (e.g., always quote the score) - Validate post-generation that score ranges in the output match the aggregate input - Use a smaller verifier model

None are documented. The current defense is "trust the prompt," which has known failure modes.