LOCAL PREVIEW View on GitHub

10. AI / LLM Design - Intelligence Behind MangaAssist

Architecture Decision: Hybrid Approach

MangaAssist does not send every user message directly to an LLM. That would be slow, expensive, and prone to hallucination. Instead, it uses a hybrid design where different techniques handle different parts of the problem.

graph TD
    A[User Message] --> B[Intent Classifier<br>Rule-based + BERT]

    B -->|greeting / chitchat| C[Template Engine<br>No LLM]
    B -->|order_tracking| D[API Call<br>Structured response]
    B -->|product_question| E[Catalog Lookup<br>+ Light formatting]
    B -->|recommendation| F[Recommendation Engine<br>+ LLM explanation]
    B -->|faq / policy| G[RAG Pipeline<br>Retrieve + Generate]
    B -->|complex / ambiguous| H[Full LLM Path<br>Multi-step reasoning]

When to Use What

Technique Use Case Example Latency Cost
Template Greetings, confirmations, simple answers "Hi! Welcome to the JP Manga store." < 10ms Free
API + Template Order tracking, price lookup, stock check "Your order shipped on March 10 and arrives March 14." < 200ms API cost only
Intent Classification Routing each message to the right handler Detecting return vs. recommendation < 50ms Minimal
RAG + LLM FAQ, policy questions, editorial recommendations "What's the return policy?" < 1.5s Moderate
Recommendation Engine + LLM "Recommend something like X" Reco engine picks titles and LLM explains why < 1s Moderate
Full LLM Path Complex multi-turn requests "I bought manga for my nephew but he didn't like it, what should I do?" < 3s Higher

Intent Classification Design

Two-Stage Pipeline

Stage 1: Rule-Based Fast Path

RULES = {
    r"(where|track|status).*(order|package|delivery)": "order_tracking",
    r"(return|refund|exchange|damaged)": "return_request",
    r"(recommend|suggest|similar|like).*(manga|book|read)": "recommendation",
    r"(price|cost|how much|deal|sale|discount|coupon)": "promotion",
    r"(hello|hi|hey|thanks|bye)": "chitchat",
    r"(talk to|human|agent|representative)": "escalation",
}

If a rule matches with high confidence, skip the ML model entirely. This handles the obvious intents cheaply and reduces latency.

Stage 2: BERT Classifier

For messages that do not match rules clearly, a fine-tuned DistilBERT model classifies the intent on SageMaker.

Training data: about 50,000 labeled examples from Amazon customer service conversations plus 5,000 manga-specific synthetic examples that were human-validated.

RAG Pipeline

Why RAG?

The LLM does not know Amazon's return policy, current manga catalog, or today's deals. RAG solves this by: 1. Retrieving relevant information from our knowledge base at query time. 2. Augmenting the LLM prompt with that information. 3. Generating a response grounded in real data.

Chunk Strategy

Content Type Chunk Size Overlap Metadata
Product descriptions 256 tokens 25 tokens ASIN, category, format
FAQ articles 512 tokens 50 tokens topic, last_updated
Return and shipping policies 512 tokens 50 tokens policy_type, region
Editorial content 512 tokens 50 tokens genre, author
Review summaries 128 tokens 0 ASIN, sentiment

Retrieval Strategy

graph LR
    A[User Query] --> B[Embed Query<br>Titan Embeddings V2]
    B --> C[KNN Search<br>OpenSearch<br>Top 10 chunks]
    C --> D[Metadata Filter<br>Match source_type to intent]
    D --> E[Cross-Encoder Reranker<br>Top 3 chunks]
    E --> F[Inject into Prompt]

If intent = faq, filter chunks to source_type in ('faq', 'policy'). If intent = recommendation, filter to source_type in ('editorial', 'product_description').

Prompt Engineering

Prompt Structure

SYSTEM PROMPT
- Persona: MangaAssist
- Hard rules: no hallucination, no competitor mentions
- Output format instructions

CONTEXT BLOCK
- Current page context (ASIN, section, locale)
- User profile (Prime, locale)
- Active promotions

RETRIEVED KNOWLEDGE
- Top 3 relevant chunks
- Source attribution for each chunk

PRODUCT DATA
- Structured JSON for relevant products from catalog or recommendations

CONVERSATION HISTORY
- Last 5 to 10 turns

USER MESSAGE
- The current question

Anti-Hallucination Measures

graph TD
    A[Hallucination Risk] --> B[Grounding]
    A --> C[Constraints]
    A --> D[Validation]

    B --> B1[Only reference products from provided catalog data]
    B --> B2[Only cite policies from retrieved RAG chunks]
    B --> B3[Never generate prices - use provided prices]

    C --> C1[System prompt: If you do not know, say so]
    C --> C2[System prompt: Never invent product details]
    C --> C3[Temperature = 0.3]

    D --> D1[Post-generation ASIN check]
    D --> D2[Price check against the catalog]
    D --> D3[Link check for valid URLs]

The key design choice is to give the LLM structured product data and tell it to only use what is provided. It formats and explains; it does not invent.

Template vs. Free-Form Decision

Scenario Approach Why
"Where is my order?" Template Answer is structured
"What's the return policy?" RAG + light generation Needs natural language but must stay grounded
"Recommend something like One Piece" Recommendation engine + LLM explanation Reco engine picks titles; LLM explains why
"I'm new to manga, what should I read?" Full LLM path Needs open-ended conversation
"Hello!" / "Thanks!" Template No intelligence needed

When to Escalate to a Human

graph TD
    A[Should we escalate?] --> B{User explicitly asked for human?}
    B -->|Yes| Z[Escalate]
    B -->|No| C{Billing or payment dispute?}
    C -->|Yes| Z
    C -->|No| D{3+ failed attempts to resolve?}
    D -->|Yes| Z
    D -->|No| E{Sensitive issue?<br>fraud, harassment, legal}
    E -->|Yes| Z
    E -->|No| F{User sentiment very negative?}
    F -->|Yes| G[Offer escalation]
    F -->|No| H[Continue chatbot]

Escalation triggers: 1. User says "talk to a human". 2. Billing or payment issue. 3. Chatbot fails 3 times on the same question. 4. Sensitive topics such as fraud, harassment, or legal issues. 5. User sentiment drops significantly.

Model Selection

Component Model Why
Intent Classifier DistilBERT (fine-tuned) Fast and small
Embeddings (RAG) Amazon Titan Embeddings V2 Native Bedrock integration
Reranker Cross-encoder (ms-marco-MiniLM) High reranking accuracy
Response Generation Claude 3.5 Sonnet via Bedrock Best quality/speed/cost balance
Sentiment Detection DistilBERT (fine-tuned) Same infra as intent classifier

Cost Optimization

Strategy How Savings
Template responses for simple intents Many messages never hit the LLM Large LLM cost reduction
Streaming User sees response while it is generated Better perceived latency
Prompt caching Reuse the system prompt and common prefixes Token savings on repeated context
Smaller model for simple tasks Use a cheaper model for formatting tasks Lower per-token cost
Batch RAG indexing Index during off-peak hours Lower compute cost