10. AI / LLM Design - Intelligence Behind MangaAssist

Architecture Decision: Hybrid Approach

MangaAssist does not send every user message directly to an LLM. That would be slow, expensive, and prone to hallucination. Instead, it uses a hybrid design where different techniques handle different parts of the problem.

graph TD
    A[User Message] --> B[Intent Classifier<br>Rule-based + BERT]

    B -->|greeting / chitchat| C[Template Engine<br>No LLM]
    B -->|order_tracking| D[API Call<br>Structured response]
    B -->|product_question| E[Catalog Lookup<br>+ Light formatting]
    B -->|recommendation| F[Recommendation Engine<br>+ LLM explanation]
    B -->|faq / policy| G[RAG Pipeline<br>Retrieve + Generate]
    B -->|complex / ambiguous| H[Full LLM Path<br>Multi-step reasoning]

When to Use What

Technique	Use Case	Example	Latency	Cost
Template	Greetings, confirmations, simple answers	"Hi! Welcome to the JP Manga store."	< 10ms	Free
API + Template	Order tracking, price lookup, stock check	"Your order shipped on March 10 and arrives March 14."	< 200ms	API cost only
Intent Classification	Routing each message to the right handler	Detecting `return` vs. `recommendation`	< 50ms	Minimal
RAG + LLM	FAQ, policy questions, editorial recommendations	"What's the return policy?"	< 1.5s	Moderate
Recommendation Engine + LLM	"Recommend something like X"	Reco engine picks titles and LLM explains why	< 1s	Moderate
Full LLM Path	Complex multi-turn requests	"I bought manga for my nephew but he didn't like it, what should I do?"	< 3s	Higher

Intent Classification Design

Two-Stage Pipeline

Stage 1: Rule-Based Fast Path

RULES = {
    r"(where|track|status).*(order|package|delivery)": "order_tracking",
    r"(return|refund|exchange|damaged)": "return_request",
    r"(recommend|suggest|similar|like).*(manga|book|read)": "recommendation",
    r"(price|cost|how much|deal|sale|discount|coupon)": "promotion",
    r"(hello|hi|hey|thanks|bye)": "chitchat",
    r"(talk to|human|agent|representative)": "escalation",
}

If a rule matches with high confidence, skip the ML model entirely. This handles the obvious intents cheaply and reduces latency.

Stage 2: BERT Classifier

For messages that do not match rules clearly, a fine-tuned DistilBERT model classifies the intent on SageMaker.

Training data: about 50,000 labeled examples from Amazon customer service conversations plus 5,000 manga-specific synthetic examples that were human-validated.

RAG Pipeline

Why RAG?

The LLM does not know Amazon's return policy, current manga catalog, or today's deals. RAG solves this by: 1. Retrieving relevant information from our knowledge base at query time. 2. Augmenting the LLM prompt with that information. 3. Generating a response grounded in real data.

Chunk Strategy

Content Type	Chunk Size	Overlap	Metadata
Product descriptions	256 tokens	25 tokens	ASIN, category, format
FAQ articles	512 tokens	50 tokens	topic, last_updated
Return and shipping policies	512 tokens	50 tokens	policy_type, region
Editorial content	512 tokens	50 tokens	genre, author
Review summaries	128 tokens	0	ASIN, sentiment

Retrieval Strategy

graph LR
    A[User Query] --> B[Embed Query<br>Titan Embeddings V2]
    B --> C[KNN Search<br>OpenSearch<br>Top 10 chunks]
    C --> D[Metadata Filter<br>Match source_type to intent]
    D --> E[Cross-Encoder Reranker<br>Top 3 chunks]
    E --> F[Inject into Prompt]

If intent = faq, filter chunks to source_type in ('faq', 'policy'). If intent = recommendation, filter to source_type in ('editorial', 'product_description').

Prompt Engineering

Prompt Structure

SYSTEM PROMPT
- Persona: MangaAssist
- Hard rules: no hallucination, no competitor mentions
- Output format instructions

CONTEXT BLOCK
- Current page context (ASIN, section, locale)
- User profile (Prime, locale)
- Active promotions

RETRIEVED KNOWLEDGE
- Top 3 relevant chunks
- Source attribution for each chunk

PRODUCT DATA
- Structured JSON for relevant products from catalog or recommendations

CONVERSATION HISTORY
- Last 5 to 10 turns

USER MESSAGE
- The current question

Anti-Hallucination Measures

graph TD
    A[Hallucination Risk] --> B[Grounding]
    A --> C[Constraints]
    A --> D[Validation]

    B --> B1[Only reference products from provided catalog data]
    B --> B2[Only cite policies from retrieved RAG chunks]
    B --> B3[Never generate prices - use provided prices]

    C --> C1[System prompt: If you do not know, say so]
    C --> C2[System prompt: Never invent product details]
    C --> C3[Temperature = 0.3]

    D --> D1[Post-generation ASIN check]
    D --> D2[Price check against the catalog]
    D --> D3[Link check for valid URLs]

The key design choice is to give the LLM structured product data and tell it to only use what is provided. It formats and explains; it does not invent.

Template vs. Free-Form Decision

Scenario	Approach	Why
"Where is my order?"	Template	Answer is structured
"What's the return policy?"	RAG + light generation	Needs natural language but must stay grounded
"Recommend something like One Piece"	Recommendation engine + LLM explanation	Reco engine picks titles; LLM explains why
"I'm new to manga, what should I read?"	Full LLM path	Needs open-ended conversation
"Hello!" / "Thanks!"	Template	No intelligence needed

When to Escalate to a Human

graph TD
    A[Should we escalate?] --> B{User explicitly asked for human?}
    B -->|Yes| Z[Escalate]
    B -->|No| C{Billing or payment dispute?}
    C -->|Yes| Z
    C -->|No| D{3+ failed attempts to resolve?}
    D -->|Yes| Z
    D -->|No| E{Sensitive issue?<br>fraud, harassment, legal}
    E -->|Yes| Z
    E -->|No| F{User sentiment very negative?}
    F -->|Yes| G[Offer escalation]
    F -->|No| H[Continue chatbot]

Escalation triggers: 1. User says "talk to a human". 2. Billing or payment issue. 3. Chatbot fails 3 times on the same question. 4. Sensitive topics such as fraud, harassment, or legal issues. 5. User sentiment drops significantly.

Model Selection

Component	Model	Why
Intent Classifier	DistilBERT (fine-tuned)	Fast and small
Embeddings (RAG)	Amazon Titan Embeddings V2	Native Bedrock integration
Reranker	Cross-encoder (ms-marco-MiniLM)	High reranking accuracy
Response Generation	Claude 3.5 Sonnet via Bedrock	Best quality/speed/cost balance
Sentiment Detection	DistilBERT (fine-tuned)	Same infra as intent classifier

Cost Optimization

Strategy	How	Savings
Template responses for simple intents	Many messages never hit the LLM	Large LLM cost reduction
Streaming	User sees response while it is generated	Better perceived latency
Prompt caching	Reuse the system prompt and common prefixes	Token savings on repeated context
Smaller model for simple tasks	Use a cheaper model for formatting tasks	Lower per-token cost
Batch RAG indexing	Index during off-peak hours	Lower compute cost