RAG-Based MCP Integration — MangaAssist Amazon Chatbot
What Is MCP (Model Context Protocol)?
MCP is Anthropic's open standard that lets an LLM call typed, schema-validated tools backed by external servers. Each MCP Server exposes: - A tool manifest (name, description, JSON-schema inputs/outputs) - A transport (stdio, HTTP/SSE, or WebSocket) - Security (OAuth 2.0 / API-key per server)
The LLM decides at inference time which tool to call — no hardcoded routing.
What Is RAG-Based MCP?
A plain MCP tool is a live API call (database lookup, REST endpoint). A RAG-based MCP tool wraps a retrieval pipeline before hitting the LLM or database.
Key distinction: The MCP server IS the RAG system. The LLM doesn't hold knowledge — it holds reasoning; the MCP servers hold knowledge.
flowchart TD
A([User Query]) --> B[MCP Router\nLLM decides tool]
B --> C[MCP Server]
C --> D[1 · Embed Query\nTitan Embed v2]
D --> E[2 · Retrieve\nOpenSearch / DynamoDB / S3]
E --> F[3 · Rerank\nCross-encoder / RRF fusion]
F --> G[4 · Format\nStructured context block]
G --> H[Tool Result returned to LLM]
H --> I([LLM synthesises\nfinal answer])
style A fill:#4A90D9,color:#fff
style I fill:#27AE60,color:#fff
style C fill:#8E44AD,color:#fff
Chatbot MCP Landscape — All Seven Servers
| # | MCP Server | RAG Source | Primary Purpose | Critical Scale Factor |
|---|---|---|---|---|
| 1 | Catalog Search MCP | OpenSearch (manga metadata) | Title/genre/author lookup | 5M+ manga titles, multilingual |
| 2 | User Preference MCP | DynamoDB + Personalize vectors | Personalized recs | 10M users × sparse preference vectors |
| 3 | Order & Inventory MCP | RDS + ElastiCache (hybrid RAG) | Order status, stock check | Real-time freshness requirement |
| 4 | Review & Sentiment MCP | OpenSearch (review corpus) | Community opinion synthesis | 50M+ reviews, sentiment aggregation |
| 5 | Support & Policy MCP | S3 + OpenSearch (doc RAG) | FAQ, returns, billing help | Policy doc freshness, version control |
| 6 | Trending & Discovery MCP | DynamoDB Streams + Kinesis | New releases, bestsellers | Sub-5s time-to-trend requirement |
| 7 | Cross-Title Link MCP | Neptune (graph) + OpenSearch | "If you liked X, try Y" | Multi-hop graph traversal + RAG |
High-Level Architecture
flowchart TD
User([User]) --> APIGW[API Gateway]
APIGW --> Lambda[Lambda Handler]
Lambda --> Claude[Claude claude-sonnet-4-6\nMCP Client]
Claude -->|Tool dispatch\nLLM decides| MCP1[Catalog\nSearch MCP]
Claude --> MCP2[User Preference\nMCP]
Claude --> MCP3[Order &\nInventory MCP]
Claude --> MCP4[Review &\nSentiment MCP]
Claude --> MCP5[Support &\nPolicy MCP]
Claude --> MCP6[Trending &\nDiscovery MCP]
Claude --> MCP7[Cross-Title\nLink MCP]
MCP1 --> DS1[(OpenSearch\nManga Index)]
MCP2 --> DS2[(DynamoDB +\nPersonalize)]
MCP3 --> DS3[(RDS + \nElastiCache)]
MCP4 --> DS4[(OpenSearch\nReview Corpus)]
MCP5 --> DS5[(S3 + \nOpenSearch Docs)]
MCP6 --> DS6[(Kinesis +\nDynamoDB Streams)]
MCP7 --> DS7[(Neptune Graph\n+ OpenSearch)]
style User fill:#4A90D9,color:#fff
style Claude fill:#8E44AD,color:#fff
style MCP1 fill:#E67E22,color:#fff
style MCP2 fill:#E67E22,color:#fff
style MCP3 fill:#E67E22,color:#fff
style MCP4 fill:#E67E22,color:#fff
style MCP5 fill:#E67E22,color:#fff
style MCP6 fill:#E67E22,color:#fff
style MCP7 fill:#E67E22,color:#fff
Shared Infrastructure
Embedding Layer (All MCP Servers)
# All RAG-based MCPs share the same embedding service
embedding_config = {
"model": "amazon.titan-embed-text-v2:0",
"dimensions": 1024,
"normalize": True,
"batch_size": 100,
}
# Deployed as: ECS Fargate service, behind ElastiCache for dedup
MCP Server Transport
All servers use HTTP/SSE transport on Amazon ECS Fargate:
- POST /tools/list — capability discovery
- POST /tools/call — tool execution
- JWT-authenticated via Cognito → API Gateway → ECS
Tool Invocation Flow (Claude SDK)
import anthropic
client = anthropic.Anthropic()
response = client.beta.messages.create(
model="claude-sonnet-4-6",
max_tokens=4096,
tools=[
{"type": "mcp", "server": "catalog-mcp", "tools": ["search_manga", "get_manga_details"]},
{"type": "mcp", "server": "user-pref-mcp", "tools": ["get_recommendations"]},
{"type": "mcp", "server": "order-mcp", "tools": ["get_order_status", "check_stock"]},
{"type": "mcp", "server": "review-mcp", "tools": ["get_sentiment_summary"]},
{"type": "mcp", "server": "support-mcp", "tools": ["answer_faq", "get_return_policy"]},
{"type": "mcp", "server": "trending-mcp", "tools": ["get_trending", "get_new_releases"]},
{"type": "mcp", "server": "graph-mcp", "tools": ["get_similar_titles"]},
],
messages=[{"role": "user", "content": user_message}],
betas=["mcp-client-2025-04-04"],
)
RAG Pipeline — Shared Stages Across All MCPs
Stage 1: Query Understanding (Pre-Retrieval)
flowchart LR
Q([Raw Query\n'dark fantasy manga\nlike Berserk for adults']) --> IC[Intent Classifier]
IC --> QR[Query Rewriter]
QR --> SD[Sub-query\nDecomposer]
SD --> S1[manga similar to Berserk\n→ Cross-Title MCP]
SD --> S2[dark fantasy genre filter\n→ Catalog MCP]
SD --> S3[adult / mature themes\n→ User Preference MCP]
style Q fill:#4A90D9,color:#fff
style S1 fill:#27AE60,color:#fff
style S2 fill:#27AE60,color:#fff
style S3 fill:#27AE60,color:#fff
Stage 2: Hybrid Retrieval
flowchart LR
Q([Query]) --> DE[Dense Embedding\nTitan Embed v2]
Q --> BM[BM25 Keyword\nSparse Match]
DE --> SF[Score Fusion\nRRF / Weighted Sum]
BM --> SF
SF --> TK[Top-K Candidates]
style Q fill:#4A90D9,color:#fff
style TK fill:#27AE60,color:#fff
Stage 3: Reranking → Context Assembly
flowchart LR
TK[Top-K Candidates] --> RR[Cross-Encoder Reranker\nBGE-reranker-v2-m3\non SageMaker]
RR --> T3[Top-3 Results]
T3 --> CA[Context Assembly\nStructured XML block]
CA --> TR([Tool Result\nreturned to Claude])
style TK fill:#E67E22,color:#fff
style TR fill:#27AE60,color:#fff
Stage 4: Context Assembly Code
def assemble_context(results: list[RetrievedDoc], tool_name: str) -> str:
return f"""
<tool_result name="{tool_name}">
{chr(10).join(f'[{i+1}] {r.to_structured_xml()}' for i, r in enumerate(results))}
</tool_result>
"""
MCP Selection Decision Tree
flowchart TD
IN([User Intent Detected]) --> Q1{Intent type?}
Q1 -->|Browse / Search| MCP1[Catalog Search MCP]
Q1 -->|Personalised Rec| MCP2A[User Preference MCP]
MCP2A --> MCP2B[+ Catalog MCP]
Q1 -->|Order Status| MCP3[Order & Inventory MCP]
Q1 -->|In stock?| MCP3
Q1 -->|Reader opinions| MCP4[Review & Sentiment MCP]
Q1 -->|Return policy\nCancel / Billing| MCP5[Support & Policy MCP]
Q1 -->|What's new?\nTrending?| MCP6[Trending & Discovery MCP]
Q1 -->|More like this\nSame author| MCP7[Cross-Title Link MCP]
Q1 -->|2+ intents| PAR[Multi-MCP\nParallel Call]
style IN fill:#4A90D9,color:#fff
style PAR fill:#C0392B,color:#fff
Multi-MCP Parallel Tool Calls
sequenceDiagram
actor User
participant Claude
participant CatalogMCP
participant TrendingMCP
participant GraphMCP
User->>Claude: "I loved Berserk — what's trending\nsimilar to it, and can I order vol 42?"
par Parallel dispatch
Claude->>CatalogMCP: search_manga("Berserk volume 42")
and
Claude->>TrendingMCP: get_trending(genre="dark_fantasy")
and
Claude->>GraphMCP: get_similar_titles("Berserk")
end
CatalogMCP-->>Claude: Stock: in stock, ¥1,200
TrendingMCP-->>Claude: [Vagabond, Vinland Saga, Claymore]
GraphMCP-->>Claude: [Vinland Saga, Claymore, Gantz]
Claude->>User: Synthesised answer combining\nstock status + trending + similar titles
Key Architecture Decisions + Trade-offs
| Decision | Choice | Why |
|---|---|---|
| MCP transport | HTTP/SSE | Streaming support; stdio only for local dev |
| Embedding model | Titan Embed v2 | AWS-native, no egress cost, 1024-dim |
| Retrieval store | OpenSearch Serverless | Auto-scaling, FAISS under the hood, no cluster management |
| Reranker | BGE-reranker on SageMaker | Better precision than BM25 alone, <50ms |
| MCP server host | ECS Fargate | Per-MCP isolation, independent scaling |
| Auth | Cognito → JWT per MCP | Fine-grained per-server permissions |
| Cache layer | ElastiCache (Redis) | Embedding dedup, result caching for trending |
Interview Grill: Quick-Fire
Q: Why not just put all tools in one MCP server? A: Single server means single blast radius. If Review MCP goes down, Catalog still works. Also: independent scaling — Trending spikes on Mondays, Order spikes on sale days.
Q: How does the LLM know which MCP to call? A: Tool descriptions are the API. Each tool carries a rich description string. Claude reads it at inference time — no routing code in application layer.
Q: What happens if a RAG retrieval returns nothing?
A: Each MCP server implements a fallback_strategy: (1) broaden filters, (2) semantic fallback with lower threshold, (3) return structured no_results object with suggested alternatives — never silent empty.
Q: How do you prevent prompt injection in MCP tool results?
A: Tool results are wrapped in <tool_result> XML tags. Claude treats content inside these tags as data, not instructions. Additionally, the MCP server strips markdown headings and code fences from user-generated content before returning.
Q: What's the latency budget per MCP call? A: P99 < 800ms per tool call. Breakdown: embed(50ms) + retrieve(200ms) + rerank(100ms) + format(10ms) = 360ms, leaving 440ms buffer for network + cold start.