LOCAL PREVIEW View on GitHub

HLD Deep Dive: Intent Classification & Orchestration

Questions covered: Q6, Q11, Q16, Q39
Interviewer level: Senior Engineer → Principal Engineer


Q6. Name at least 5 intents the Intent Classifier can detect

Full Intent Catalog

Intent Example User Message Routing Target
product_discovery "Show me dark fantasy manga under $15" Recommendation Engine + Catalog
product_question "Does Berserk have a digital edition?" Product Q&A Service
faq "What's your return policy?" RAG Pipeline (FAQ knowledge base)
order_tracking "Where is my order?" Order Service
return_request "I want to return Vol 4" Returns Service
promotion_inquiry "Any discounts on One Piece?" Promotions Service
recommendation "What should I read after Attack on Titan?" Recommendation Engine + LLM
checkout_help "How do I apply a gift card?" Checkout Service + RAG
escalation "I need to speak to a human" Amazon Connect
chitchat "Hello!" / "Thanks!" Template response (no LLM)

Deep Dive: How the Classifier Works

Architecture: Lightweight NLP model hosted on SageMaker

User Message ──► [Tokenizer] ──► [Embedding Layer] ──► [Classification Head]
                                                              │
                                         ┌────────────────────┴───────────────────────┐
                                         │ product_discovery: 0.72                    │
                                         │ recommendation:    0.15                    │
                                         │ product_question:  0.08                    │
                                         │ chitchat:          0.03                    │
                                         │ ...                                         │
                                         └────────────────────┬───────────────────────┘
                                                              │
                                              argmax ──► product_discovery

Model choices for the classifier:

Option Latency Cost Accuracy
Fine-tuned DistilBERT ~20ms Very low High
Fine-tuned BERT-base ~50ms Low Higher
LLM-based classification (Claude) ~500ms High Highest
Rule-based regex ~1ms Zero Low

MangaAssist choice: Fine-tuned DistilBERT on SageMaker.
Accuracy is sufficient (95%+), latency is negligible (~20ms), and cost per call is fractions of a cent — 25x cheaper than using the LLM for classification.

Confidence thresholds:

if confidence >= 0.85:
    route_to_intent(top_intent)
elif confidence >= 0.60:
    # Ask clarifying question
    return "Are you asking about [X] or [Y]?"
else:
    # Low confidence — fall back to LLM for freeform understanding
    route_to_llm_general_handler()


Q11. Why is Intent Classifier separate from the LLM?

Short Answer

Classification is 25x cheaper and 10x faster than LLM inference. Deterministic routing avoids unnecessary LLM calls for simple intents.

Deep Dive

Cost comparison (per request):

Intent Classifier (DistilBERT on SageMaker):
  ~$0.0001 per classification
  Latency: ~20ms

LLM (Claude 3.5 Sonnet via Bedrock):
  ~$0.003–0.05 per call (depending on token length)
  Latency: ~1,500–3,000ms

For "Where is my order?":
  Classifier: routes to Order Service → template response → $0.0001, ~300ms total
  LLM:        generates a natural language response → $0.02, ~2,000ms total

Cost difference: 200x
Latency difference: 7x

Intent distribution in a real manga chatbot (estimated):

chitchat:          15%  ─── Never needs LLM
order_tracking:    20%  ─── Needs Order Service, not LLM
faq:               25%  ─── RAG + simple template is sufficient for 70% of FAQs
product_question:  15%  ─── May need LLM for nuanced answers
recommendation:    15%  ─── Needs LLM
product_discovery: 10%  ─── May need LLM for complex queries

~60% of requests never need LLM generation. Sending all requests to the LLM would mean paying for LLM inference on "thanks!" and "where is my order?" messages — pure waste.

The architectural principle: Use the cheapest tool that solves the problem.

Message → Classifier → Order tracking?
                      └─► YES: Template: "Your order #12345 is in transit, arriving Thursday."
                                No LLM needed. Done.

Message → Classifier → Recommendation?
                      └─► YES: Need LLM. Retrieve context, generate personalized response.

What happens when the classifier gets it wrong?

The Orchestrator has a fallback path: 1. Classifier routes to Order Service with confidence = 0.62. 2. Order Service returns no results ("no match for this query"). 3. Orchestrator escalates to LLM general handler. 4. The failed routing is logged for classifier retraining.

This creates a feedback loop: misclassified messages (those that hit the LLM fallback) are labeled and used to improve the classifier in the next training cycle.


Q16. Fan-out — Orchestrator handles multiple service calls for one response

Short Answer

The Orchestrator fans out parallel requests to multiple services, aggregates results, then sends combined data to the LLM.

Deep Dive

Scenario: User asks "Can you recommend dark fantasy manga under $15?"

Without parallelism (sequential):

Orchestrator:
  1. Call Recommendation Engine ──── 200ms
  2. Wait for results
  3. Call Product Catalog ──────────  150ms (for each of 5 ASINs)
  4. Wait for results
  Total: ~950ms just for data fetching

With parallelism (fan-out):

Orchestrator:
  ┌─► Call Recommendation Engine ──── 200ms ┐
  │                                           ├─► Aggregate ──► LLM
  └─► Call Promotions Service ─────── 80ms  ┘
  Total: ~200ms (max of parallel calls)

Implementation using Python asyncio:

async def handle_recommendation_intent(user_message, customer_id, session_id):
    # Fan out parallel data fetches
    recommendation_task = asyncio.create_task(
        recommendation_service.get_recommendations(customer_id, query=user_message)
    )
    promotions_task = asyncio.create_task(
        promotions_service.get_active_promotions(category="manga")
    )

    # Wait for all to complete (or timeout)
    results = await asyncio.gather(
        recommendation_task,
        promotions_task,
        return_exceptions=True  # Don't fail if one service is down
    )

    recommendations, promotions = results

    # Handle partial failures gracefully
    if isinstance(recommendations, Exception):
        recommendations = get_fallback_recommendations()  # Popular/trending
    if isinstance(promotions, Exception):
        promotions = []  # No promotions is safe to omit

    # Aggregate context for LLM
    context = build_llm_context(recommendations, promotions)
    return await llm_service.generate(user_message, context)

Fan-out pattern rules: 1. Parallel only for independent calls — if Call B depends on the output of Call A, they must be sequential. 2. Set per-service timeouts — don't let a slow service block the entire response. Typical timeouts: 300ms for cached services, 1s for live services. 3. return_exceptions=True — one service failure should not abort the entire response. 4. Aggregate gracefully — build the best response possible with whatever data you have.

Step Functions alternative for complex flows: For workflows with conditional branching ("if user has Prime, also check Prime Reading"), AWS Step Functions provides a visual state machine. However, for the chatbot's real-time path, Step Functions adds ~100ms overhead — not appropriate for low-latency needs. Step Functions is better suited for async workflows (returns processing, RAG re-indexing).


Q39. Adding a new intent ("gift_wrapping") — walk through the full change

Short Answer

8 steps: training data → classifier retrain → Orchestrator routing → service integration → RAG chunks → system prompt → guardrails → analytics → feature flag rollout.

Deep Dive

Full change checklist:

Step 1: Intent Classifier Training Data

# Add new training examples to the labeled dataset
new_examples = [
    {"text": "Can I add gift wrapping?", "label": "gift_wrapping"},
    {"text": "I want to send this as a gift", "label": "gift_wrapping"},
    {"text": "Do you offer gift packaging?", "label": "gift_wrapping"},
    {"text": "Add a gift message to my order", "label": "gift_wrapping"},
    # ... 50+ diverse examples
]
Retrain the classifier with the new class. Validate that existing intent accuracy doesn't degrade (regression test).

Step 2: Orchestrator Routing Rule

# In Orchestrator routing config
intent_routes = {
    "product_discovery": ProductDiscoveryHandler,
    "order_tracking": OrderTrackingHandler,
    # ... existing routes ...
    "gift_wrapping": GiftWrappingHandler,  # NEW
}

Step 3: Service Integration

class GiftWrappingHandler:
    async def handle(self, message, customer_id, cart_context):
        # Check if gift wrapping is available for items in cart
        wrapping_options = await gift_service.get_options(cart_context.asin_list)

        if not wrapping_options:
            return template_response("GIFT_NOT_AVAILABLE")

        # Pass options to LLM for natural language presentation
        return await llm_service.generate(
            intent="gift_wrapping",
            context={"options": wrapping_options},
            user_message=message
        )

Step 4: RAG Knowledge Base — Add FAQ Chunks

Q: How much does gift wrapping cost?
A: Gift wrapping is available for $4.99 per item. Premium gift boxes are $8.99.

Q: Can I add a message with gift wrapping?
A: Yes, you can include a personalized message (up to 150 characters) at checkout.

Q: Is gift wrapping available for digital items?
A: No, gift wrapping is only available for physical products.
Re-index the knowledge base so RAG can retrieve these chunks.

Step 5: System Prompt Update

You are MangaAssist. You help users with:
- Finding and discovering manga
- Product questions and recommendations
- Order tracking and returns
- Gift wrapping options and pricing  ← ADD THIS
...
When discussing gift wrapping, always mention:
- Available options and pricing
- Message character limit
- Availability limitations (physical items only)

Step 6: Guardrail Rules

# Add gift wrapping specific guardrails
guardrails.add_rule(
    category="price_accuracy",
    check="response mentions gift wrapping price",
    action="validate_against_gift_service_api"  # Prevent hallucinated prices
)
guardrails.add_rule(
    category="availability_accuracy",
    check="response claims gift wrapping available",
    action="verify_product_supports_gift_wrapping"
)

Step 7: Analytics Schema Update

-- Add new intent to analytics tracking
ALTER TABLE chatbot_events ADD COLUMN IF NOT EXISTS gift_wrapping_option_selected VARCHAR(50);

-- New metric: gift wrapping adoption rate
SELECT 
    COUNT(CASE WHEN intent = 'gift_wrapping' THEN 1 END) / COUNT(*) as gift_wrapping_rate,
    COUNT(CASE WHEN intent = 'gift_wrapping' AND converted = true THEN 1 END) as gift_wrapping_conversions
FROM chatbot_sessions;

Step 8: Feature Flag Rollout

# Launch with 1% traffic
feature_flags = {
    "gift_wrapping_intent": {
        "enabled": True,
        "rollout_percentage": 1,
        "rollout_groups": ["internal_employees"],  # Test with employees first
    }
}
- Day 1–3: Internal employees only.
- Day 4–7: 1% of production traffic. Monitor accuracy.
- Day 8–14: 10% if metrics are healthy.
- Day 15+: Full rollout.

What makes this architecture extensible?
The core Orchestrator doesn't change — it reads intent_routes from a config map. Adding a new intent is a configuration + handler addition, not a core refactor. The system was designed for this.

Interview point: When an interviewer asks "how extensible is this system?", this walkthrough demonstrates that the answer is very — with a well-defined process that any engineer can follow.