10. AI / LLM Design - Intelligence Behind MangaAssist
Architecture Decision: Hybrid Approach
MangaAssist does not send every user message directly to an LLM. That would be slow, expensive, and prone to hallucination. Instead, it uses a hybrid design where different techniques handle different parts of the problem.
graph TD
A[User Message] --> B[Intent Classifier<br>Rule-based + BERT]
B -->|greeting / chitchat| C[Template Engine<br>No LLM]
B -->|order_tracking| D[API Call<br>Structured response]
B -->|product_question| E[Catalog Lookup<br>+ Light formatting]
B -->|recommendation| F[Recommendation Engine<br>+ LLM explanation]
B -->|faq / policy| G[RAG Pipeline<br>Retrieve + Generate]
B -->|complex / ambiguous| H[Full LLM Path<br>Multi-step reasoning]
When to Use What
| Technique | Use Case | Example | Latency | Cost |
|---|---|---|---|---|
| Template | Greetings, confirmations, simple answers | "Hi! Welcome to the JP Manga store." | < 10ms | Free |
| API + Template | Order tracking, price lookup, stock check | "Your order shipped on March 10 and arrives March 14." | < 200ms | API cost only |
| Intent Classification | Routing each message to the right handler | Detecting return vs. recommendation |
< 50ms | Minimal |
| RAG + LLM | FAQ, policy questions, editorial recommendations | "What's the return policy?" | < 1.5s | Moderate |
| Recommendation Engine + LLM | "Recommend something like X" | Reco engine picks titles and LLM explains why | < 1s | Moderate |
| Full LLM Path | Complex multi-turn requests | "I bought manga for my nephew but he didn't like it, what should I do?" | < 3s | Higher |
Intent Classification Design
Two-Stage Pipeline
Stage 1: Rule-Based Fast Path
RULES = {
r"(where|track|status).*(order|package|delivery)": "order_tracking",
r"(return|refund|exchange|damaged)": "return_request",
r"(recommend|suggest|similar|like).*(manga|book|read)": "recommendation",
r"(price|cost|how much|deal|sale|discount|coupon)": "promotion",
r"(hello|hi|hey|thanks|bye)": "chitchat",
r"(talk to|human|agent|representative)": "escalation",
}
If a rule matches with high confidence, skip the ML model entirely. This handles the obvious intents cheaply and reduces latency.
Stage 2: BERT Classifier
For messages that do not match rules clearly, a fine-tuned DistilBERT model classifies the intent on SageMaker.
Training data: about 50,000 labeled examples from Amazon customer service conversations plus 5,000 manga-specific synthetic examples that were human-validated.
RAG Pipeline
Why RAG?
The LLM does not know Amazon's return policy, current manga catalog, or today's deals. RAG solves this by: 1. Retrieving relevant information from our knowledge base at query time. 2. Augmenting the LLM prompt with that information. 3. Generating a response grounded in real data.
Chunk Strategy
| Content Type | Chunk Size | Overlap | Metadata |
|---|---|---|---|
| Product descriptions | 256 tokens | 25 tokens | ASIN, category, format |
| FAQ articles | 512 tokens | 50 tokens | topic, last_updated |
| Return and shipping policies | 512 tokens | 50 tokens | policy_type, region |
| Editorial content | 512 tokens | 50 tokens | genre, author |
| Review summaries | 128 tokens | 0 | ASIN, sentiment |
Retrieval Strategy
graph LR
A[User Query] --> B[Embed Query<br>Titan Embeddings V2]
B --> C[KNN Search<br>OpenSearch<br>Top 10 chunks]
C --> D[Metadata Filter<br>Match source_type to intent]
D --> E[Cross-Encoder Reranker<br>Top 3 chunks]
E --> F[Inject into Prompt]
If intent = faq, filter chunks to source_type in ('faq', 'policy'). If intent = recommendation, filter to source_type in ('editorial', 'product_description').
Prompt Engineering
Prompt Structure
SYSTEM PROMPT
- Persona: MangaAssist
- Hard rules: no hallucination, no competitor mentions
- Output format instructions
CONTEXT BLOCK
- Current page context (ASIN, section, locale)
- User profile (Prime, locale)
- Active promotions
RETRIEVED KNOWLEDGE
- Top 3 relevant chunks
- Source attribution for each chunk
PRODUCT DATA
- Structured JSON for relevant products from catalog or recommendations
CONVERSATION HISTORY
- Last 5 to 10 turns
USER MESSAGE
- The current question
Anti-Hallucination Measures
graph TD
A[Hallucination Risk] --> B[Grounding]
A --> C[Constraints]
A --> D[Validation]
B --> B1[Only reference products from provided catalog data]
B --> B2[Only cite policies from retrieved RAG chunks]
B --> B3[Never generate prices - use provided prices]
C --> C1[System prompt: If you do not know, say so]
C --> C2[System prompt: Never invent product details]
C --> C3[Temperature = 0.3]
D --> D1[Post-generation ASIN check]
D --> D2[Price check against the catalog]
D --> D3[Link check for valid URLs]
The key design choice is to give the LLM structured product data and tell it to only use what is provided. It formats and explains; it does not invent.
Template vs. Free-Form Decision
| Scenario | Approach | Why |
|---|---|---|
| "Where is my order?" | Template | Answer is structured |
| "What's the return policy?" | RAG + light generation | Needs natural language but must stay grounded |
| "Recommend something like One Piece" | Recommendation engine + LLM explanation | Reco engine picks titles; LLM explains why |
| "I'm new to manga, what should I read?" | Full LLM path | Needs open-ended conversation |
| "Hello!" / "Thanks!" | Template | No intelligence needed |
When to Escalate to a Human
graph TD
A[Should we escalate?] --> B{User explicitly asked for human?}
B -->|Yes| Z[Escalate]
B -->|No| C{Billing or payment dispute?}
C -->|Yes| Z
C -->|No| D{3+ failed attempts to resolve?}
D -->|Yes| Z
D -->|No| E{Sensitive issue?<br>fraud, harassment, legal}
E -->|Yes| Z
E -->|No| F{User sentiment very negative?}
F -->|Yes| G[Offer escalation]
F -->|No| H[Continue chatbot]
Escalation triggers: 1. User says "talk to a human". 2. Billing or payment issue. 3. Chatbot fails 3 times on the same question. 4. Sensitive topics such as fraud, harassment, or legal issues. 5. User sentiment drops significantly.
Model Selection
| Component | Model | Why |
|---|---|---|
| Intent Classifier | DistilBERT (fine-tuned) | Fast and small |
| Embeddings (RAG) | Amazon Titan Embeddings V2 | Native Bedrock integration |
| Reranker | Cross-encoder (ms-marco-MiniLM) | High reranking accuracy |
| Response Generation | Claude 3.5 Sonnet via Bedrock | Best quality/speed/cost balance |
| Sentiment Detection | DistilBERT (fine-tuned) | Same infra as intent classifier |
Cost Optimization
| Strategy | How | Savings |
|---|---|---|
| Template responses for simple intents | Many messages never hit the LLM | Large LLM cost reduction |
| Streaming | User sees response while it is generated | Better perceived latency |
| Prompt caching | Reuse the system prompt and common prefixes | Token savings on repeated context |
| Smaller model for simple tasks | Use a cheaper model for formatting tasks | Lower per-token cost |
| Batch RAG indexing | Index during off-peak hours | Lower compute cost |