HLD Deep Dive: Intent Classification & Orchestration
Questions covered: Q6, Q11, Q16, Q39
Interviewer level: Senior Engineer → Principal Engineer
Q6. Name at least 5 intents the Intent Classifier can detect
Full Intent Catalog
| Intent | Example User Message | Routing Target |
|---|---|---|
product_discovery |
"Show me dark fantasy manga under $15" | Recommendation Engine + Catalog |
product_question |
"Does Berserk have a digital edition?" | Product Q&A Service |
faq |
"What's your return policy?" | RAG Pipeline (FAQ knowledge base) |
order_tracking |
"Where is my order?" | Order Service |
return_request |
"I want to return Vol 4" | Returns Service |
promotion_inquiry |
"Any discounts on One Piece?" | Promotions Service |
recommendation |
"What should I read after Attack on Titan?" | Recommendation Engine + LLM |
checkout_help |
"How do I apply a gift card?" | Checkout Service + RAG |
escalation |
"I need to speak to a human" | Amazon Connect |
chitchat |
"Hello!" / "Thanks!" | Template response (no LLM) |
Deep Dive: How the Classifier Works
Architecture: Lightweight NLP model hosted on SageMaker
User Message ──► [Tokenizer] ──► [Embedding Layer] ──► [Classification Head]
│
┌────────────────────┴───────────────────────┐
│ product_discovery: 0.72 │
│ recommendation: 0.15 │
│ product_question: 0.08 │
│ chitchat: 0.03 │
│ ... │
└────────────────────┬───────────────────────┘
│
argmax ──► product_discovery
Model choices for the classifier:
| Option | Latency | Cost | Accuracy |
|---|---|---|---|
| Fine-tuned DistilBERT | ~20ms | Very low | High |
| Fine-tuned BERT-base | ~50ms | Low | Higher |
| LLM-based classification (Claude) | ~500ms | High | Highest |
| Rule-based regex | ~1ms | Zero | Low |
MangaAssist choice: Fine-tuned DistilBERT on SageMaker.
Accuracy is sufficient (95%+), latency is negligible (~20ms), and cost per call is fractions of a cent — 25x cheaper than using the LLM for classification.
Confidence thresholds:
if confidence >= 0.85:
route_to_intent(top_intent)
elif confidence >= 0.60:
# Ask clarifying question
return "Are you asking about [X] or [Y]?"
else:
# Low confidence — fall back to LLM for freeform understanding
route_to_llm_general_handler()
Q11. Why is Intent Classifier separate from the LLM?
Short Answer
Classification is 25x cheaper and 10x faster than LLM inference. Deterministic routing avoids unnecessary LLM calls for simple intents.
Deep Dive
Cost comparison (per request):
Intent Classifier (DistilBERT on SageMaker):
~$0.0001 per classification
Latency: ~20ms
LLM (Claude 3.5 Sonnet via Bedrock):
~$0.003–0.05 per call (depending on token length)
Latency: ~1,500–3,000ms
For "Where is my order?":
Classifier: routes to Order Service → template response → $0.0001, ~300ms total
LLM: generates a natural language response → $0.02, ~2,000ms total
Cost difference: 200x
Latency difference: 7x
Intent distribution in a real manga chatbot (estimated):
chitchat: 15% ─── Never needs LLM
order_tracking: 20% ─── Needs Order Service, not LLM
faq: 25% ─── RAG + simple template is sufficient for 70% of FAQs
product_question: 15% ─── May need LLM for nuanced answers
recommendation: 15% ─── Needs LLM
product_discovery: 10% ─── May need LLM for complex queries
~60% of requests never need LLM generation. Sending all requests to the LLM would mean paying for LLM inference on "thanks!" and "where is my order?" messages — pure waste.
The architectural principle: Use the cheapest tool that solves the problem.
Message → Classifier → Order tracking?
└─► YES: Template: "Your order #12345 is in transit, arriving Thursday."
No LLM needed. Done.
Message → Classifier → Recommendation?
└─► YES: Need LLM. Retrieve context, generate personalized response.
What happens when the classifier gets it wrong?
The Orchestrator has a fallback path:
1. Classifier routes to Order Service with confidence = 0.62.
2. Order Service returns no results ("no match for this query").
3. Orchestrator escalates to LLM general handler.
4. The failed routing is logged for classifier retraining.
This creates a feedback loop: misclassified messages (those that hit the LLM fallback) are labeled and used to improve the classifier in the next training cycle.
Q16. Fan-out — Orchestrator handles multiple service calls for one response
Short Answer
The Orchestrator fans out parallel requests to multiple services, aggregates results, then sends combined data to the LLM.
Deep Dive
Scenario: User asks "Can you recommend dark fantasy manga under $15?"
Without parallelism (sequential):
Orchestrator:
1. Call Recommendation Engine ──── 200ms
2. Wait for results
3. Call Product Catalog ────────── 150ms (for each of 5 ASINs)
4. Wait for results
Total: ~950ms just for data fetching
With parallelism (fan-out):
Orchestrator:
┌─► Call Recommendation Engine ──── 200ms ┐
│ ├─► Aggregate ──► LLM
└─► Call Promotions Service ─────── 80ms ┘
Total: ~200ms (max of parallel calls)
Implementation using Python asyncio:
async def handle_recommendation_intent(user_message, customer_id, session_id):
# Fan out parallel data fetches
recommendation_task = asyncio.create_task(
recommendation_service.get_recommendations(customer_id, query=user_message)
)
promotions_task = asyncio.create_task(
promotions_service.get_active_promotions(category="manga")
)
# Wait for all to complete (or timeout)
results = await asyncio.gather(
recommendation_task,
promotions_task,
return_exceptions=True # Don't fail if one service is down
)
recommendations, promotions = results
# Handle partial failures gracefully
if isinstance(recommendations, Exception):
recommendations = get_fallback_recommendations() # Popular/trending
if isinstance(promotions, Exception):
promotions = [] # No promotions is safe to omit
# Aggregate context for LLM
context = build_llm_context(recommendations, promotions)
return await llm_service.generate(user_message, context)
Fan-out pattern rules: 1. Parallel only for independent calls — if Call B depends on the output of Call A, they must be sequential. 2. Set per-service timeouts — don't let a slow service block the entire response. Typical timeouts: 300ms for cached services, 1s for live services. 3. return_exceptions=True — one service failure should not abort the entire response. 4. Aggregate gracefully — build the best response possible with whatever data you have.
Step Functions alternative for complex flows: For workflows with conditional branching ("if user has Prime, also check Prime Reading"), AWS Step Functions provides a visual state machine. However, for the chatbot's real-time path, Step Functions adds ~100ms overhead — not appropriate for low-latency needs. Step Functions is better suited for async workflows (returns processing, RAG re-indexing).
Q39. Adding a new intent ("gift_wrapping") — walk through the full change
Short Answer
8 steps: training data → classifier retrain → Orchestrator routing → service integration → RAG chunks → system prompt → guardrails → analytics → feature flag rollout.
Deep Dive
Full change checklist:
Step 1: Intent Classifier Training Data
# Add new training examples to the labeled dataset
new_examples = [
{"text": "Can I add gift wrapping?", "label": "gift_wrapping"},
{"text": "I want to send this as a gift", "label": "gift_wrapping"},
{"text": "Do you offer gift packaging?", "label": "gift_wrapping"},
{"text": "Add a gift message to my order", "label": "gift_wrapping"},
# ... 50+ diverse examples
]
Step 2: Orchestrator Routing Rule
# In Orchestrator routing config
intent_routes = {
"product_discovery": ProductDiscoveryHandler,
"order_tracking": OrderTrackingHandler,
# ... existing routes ...
"gift_wrapping": GiftWrappingHandler, # NEW
}
Step 3: Service Integration
class GiftWrappingHandler:
async def handle(self, message, customer_id, cart_context):
# Check if gift wrapping is available for items in cart
wrapping_options = await gift_service.get_options(cart_context.asin_list)
if not wrapping_options:
return template_response("GIFT_NOT_AVAILABLE")
# Pass options to LLM for natural language presentation
return await llm_service.generate(
intent="gift_wrapping",
context={"options": wrapping_options},
user_message=message
)
Step 4: RAG Knowledge Base — Add FAQ Chunks
Q: How much does gift wrapping cost?
A: Gift wrapping is available for $4.99 per item. Premium gift boxes are $8.99.
Q: Can I add a message with gift wrapping?
A: Yes, you can include a personalized message (up to 150 characters) at checkout.
Q: Is gift wrapping available for digital items?
A: No, gift wrapping is only available for physical products.
Step 5: System Prompt Update
You are MangaAssist. You help users with:
- Finding and discovering manga
- Product questions and recommendations
- Order tracking and returns
- Gift wrapping options and pricing ← ADD THIS
...
When discussing gift wrapping, always mention:
- Available options and pricing
- Message character limit
- Availability limitations (physical items only)
Step 6: Guardrail Rules
# Add gift wrapping specific guardrails
guardrails.add_rule(
category="price_accuracy",
check="response mentions gift wrapping price",
action="validate_against_gift_service_api" # Prevent hallucinated prices
)
guardrails.add_rule(
category="availability_accuracy",
check="response claims gift wrapping available",
action="verify_product_supports_gift_wrapping"
)
Step 7: Analytics Schema Update
-- Add new intent to analytics tracking
ALTER TABLE chatbot_events ADD COLUMN IF NOT EXISTS gift_wrapping_option_selected VARCHAR(50);
-- New metric: gift wrapping adoption rate
SELECT
COUNT(CASE WHEN intent = 'gift_wrapping' THEN 1 END) / COUNT(*) as gift_wrapping_rate,
COUNT(CASE WHEN intent = 'gift_wrapping' AND converted = true THEN 1 END) as gift_wrapping_conversions
FROM chatbot_sessions;
Step 8: Feature Flag Rollout
# Launch with 1% traffic
feature_flags = {
"gift_wrapping_intent": {
"enabled": True,
"rollout_percentage": 1,
"rollout_groups": ["internal_employees"], # Test with employees first
}
}
- Day 4–7: 1% of production traffic. Monitor accuracy.
- Day 8–14: 10% if metrics are healthy.
- Day 15+: Full rollout.
What makes this architecture extensible?
The core Orchestrator doesn't change — it reads intent_routes from a config map. Adding a new intent is a configuration + handler addition, not a core refactor. The system was designed for this.
Interview point: When an interviewer asks "how extensible is this system?", this walkthrough demonstrates that the answer is very — with a well-defined process that any engineer can follow.