HLD Deep Dive: Fault Tolerance & Reliability
Questions covered: Q17, Q23, Q34, Q36
Interviewer level: Staff Engineer → Solutions Architect
Q17. Order Service is down — user asks "Where is my order?"
Short Answer
Detect failure via timeout/error → return graceful degradation message → offer escalation → trigger CloudWatch alarm. Never show a raw error to the user.
Deep Dive
Detection: How does the Orchestrator know the Order Service is down?
async def query_order_service(customer_id: str, timeout_ms: int = 1000) -> OrderResult:
try:
result = await asyncio.wait_for(
order_service.get_orders(customer_id),
timeout=timeout_ms / 1000
)
return result
except asyncio.TimeoutError:
# Service is slow or frozen
cloudwatch.put_metric("OrderServiceTimeout", 1)
raise ServiceUnavailableError("order_service_timeout")
except ServiceException as e:
if e.status_code in [500, 502, 503]:
cloudwatch.put_metric("OrderServiceError", 1)
raise ServiceUnavailableError("order_service_error")
raise # Re-raise 4xx errors (client error, not service outage)
Orchestrator response when Order Service is unavailable:
async def handle_order_tracking(customer_id: str, message: str) -> Response:
try:
orders = await query_order_service(customer_id, timeout_ms=1000)
return format_order_response(orders)
except ServiceUnavailableError:
# Log for ops visibility
logger.warning("order_service_unavailable", extra={
"customer_id": customer_id,
"intent": "order_tracking"
})
# Graceful degradation — helpful, not an error page
return Response(
message=(
"I'm having trouble accessing your order information right now. "
"You can check your order status directly at: "
"amazon.co.jp/orders. "
"Or I can connect you with a support agent who can help immediately."
),
actions=[
{"type": "link", "label": "Check Orders Page",
"url": "https://amazon.co.jp/gp/css/order-history"},
{"type": "escalation", "label": "Talk to Support Agent"}
],
degraded=True,
degraded_reason="order_service_unavailable"
)
What NOT to do:
❌ Show raw error: "Error 503: Service Unavailable"
❌ Show nothing: Silent failure
❌ Retry forever: Causes latency to balloon to 30s+ while user waits
❌ Fabricate data: "Your order shipped on March 20th" — completely wrong
CloudWatch alarm that fires:
AlarmName: OrderServiceAvailability
MetricName: OrderServiceError + OrderServiceTimeout (sum)
Namespace: MangaAssist/ServiceHealth
Threshold: > 10 errors in 1 minute
Comparison: GreaterThanThreshold
EvaluationPeriods: 1
Actions:
- PagerDuty: on-call engineer notification
- SNS: "#manga-chatbot-ops" Slack channel
Q34. Partial degradation — 2-3 services down simultaneously
Short Answer
Circuit breakers per service + criticality tiers + capability map that adjusts behavior based on available services.
Deep Dive
Service criticality matrix:
TIER 1 - CRITICAL (chatbot fails gracefully if down):
Product Catalog → Cannot show product info, but can still answer FAQ
LLM / Bedrock → Cannot generate responses, use templates only
TIER 2 - IMPORTANT (degraded experience if down):
Recommendation Engine → Show popular/trending instead
Order Service → Direct user to orders page
TIER 3 - NICE-TO-HAVE (minor feature loss if down):
Promotions Service → Skip promotion nudges
Returns Service → Direct user to returns help page
User Profile → Treat as guest (no personalization)
TIER 4 - ASYNC (no real-time impact if down):
Analytics / Kinesis → Log locally, retry later
Feedback Service → Lose some feedback data
Circuit breaker per service (independent state):
class ServiceHealthManager:
def __init__(self):
self.circuit_breakers = {
"catalog": CircuitBreaker(threshold=5, timeout_s=30),
"recommendations": CircuitBreaker(threshold=5, timeout_s=30),
"order_service": CircuitBreaker(threshold=3, timeout_s=60),
"returns_service": CircuitBreaker(threshold=5, timeout_s=30),
"promotions": CircuitBreaker(threshold=10, timeout_s=15),
"llm": CircuitBreaker(threshold=3, timeout_s=120),
}
def get_capability_map(self) -> dict:
"""
Returns a map of what the Orchestrator CAN do right now.
Used to route requests to the best available path.
"""
return {
service: cb.state == "CLOSED"
for service, cb in self.circuit_breakers.items()
}
Capability-aware Orchestrator:
async def handle_message(intent: str, message: str, customer_id: str) -> Response:
capabilities = health_manager.get_capability_map()
if intent == "recommendation":
if capabilities["recommendations"] and capabilities["catalog"]:
# Full personalized recommendations
return await full_recommendation_flow(customer_id, message)
elif capabilities["catalog"]:
# Recommendations down, but catalog OK → show popular items
popular = await catalog.get_popular_manga(category="manga", limit=5)
return format_popular_items_response(popular, degraded=True)
else:
# Both down
return Response(
message="I'm having trouble accessing our catalog right now. "
"Try browsing manga directly at amazon.co.jp/manga"
)
elif intent == "order_tracking":
if capabilities["order_service"]:
return await full_order_tracking_flow(customer_id)
else:
return order_service_degraded_response() # Link to orders page
elif intent == "faq":
if capabilities["llm"]:
return await rag_faq_flow(message)
else:
# LLM down → use pre-computed FAQ templates
return await template_faq_flow(message)
Health check dashboard:
MangaAssist Service Health (updated every 30s)
─────────────────────────────────────────────────
Service Status Latency Error Rate Circuit
─────────────────────────────────────────────────────────
Product Catalog 🟢 OK 45ms 0.1% CLOSED
Order Service 🔴 DOWN --- 100% OPEN
Returns Service 🟡 WARN 890ms 5% CLOSED
Recommendations 🟢 OK 180ms 0.2% CLOSED
Promotions 🟢 OK 30ms 0.0% CLOSED
Bedrock LLM 🟢 OK 1,250ms 0.3% CLOSED
DynamoDB 🟢 OK 5ms 0.0% CLOSED
─────────────────────────────────────────────────────────
Capability Map:
Order tracking: ❌ Degraded (using fallback)
Personalized recs: ✅ Available
FAQ: ✅ Available
Return requests: ⚠️ Degraded (slow)
Graceful degradation matrix:
| Service Down | User Impact | Fallback Behavior |
|---|---|---|
| Recommendations only | Can't personalize | Show popular/trending manga |
| Catalog only | No product details | Text-only response + link |
| Order Service only | Can't track | Link to order page + escalation |
| Returns Service only | Can't process returns | Link to returns help + escalation |
| LLM / Bedrock | Limited NLU | Templates for known intents, "I can't help with that" for open-ended |
| Recommendations + Catalog | Product flow broken | Full fallback to search + FAQ |
| LLM + Catalog | Severely degraded | FAQ + order tracking only (both by template) |
Q36. Meeting 99.95% availability SLA
Short Answer
Multi-AZ deployment + DynamoDB global tables + circuit breakers + chaos engineering + runbooks per failure mode. Bedrock dependency is the biggest single risk.
Deep Dive
What 99.95% means:
99.95% uptime / month =
100% - 99.95% = 0.05% downtime
0.05% × 30 days × 24h × 60min = 21.6 minutes downtime/month maximum
1. Multi-AZ deployment:
ECS Fargate:
Tasks spread across 3 AZs: ap-northeast-1a, 1b, 1c
ALB routes to healthy tasks only
If one AZ fails: remaining 2 AZs serve all traffic (capacity pre-provisioned for this)
DynamoDB:
Native multi-AZ replication (same region)
3 replicas — survives 1 AZ failure with zero impact
OpenSearch Serverless:
Managed by AWS — multi-AZ by default
ElastiCache Redis:
Multi-AZ with automatic failover
Replica in each AZ — promotes to primary in <1 minute
2. DynamoDB Global Tables for cross-region resilience:
Regions: ap-northeast-1 (primary), us-east-1 (DR)
Global Tables replicate writes with ~100ms cross-region lag.
If ap-northeast-1 has a full regional outage (extremely rare):
Route 53 health check detects failure in ~30s
DNS switches to us-east-1 within ~60s
Total failover time: ~2 minutes
This doesn't happen within the monthly 21-minute budget.
Regional DR is insurance for multi-hour regional outages.
3. The Bedrock dependency — biggest risk:
Bedrock is a shared AWS service. An AWS-wide Bedrock outage would impact us.
Historical AWS service availability for managed services: >99.9%.
Mitigation:
1. Fallback to template responses for known intents
(covers ~60% of traffic without LLM)
2. Pre-generate "canned" responses for top-100 FAQ questions
3. If LLM is down >5 minutes: proactively show maintenance banner
so users don't think the chatbot is broken — it's partially available
4. Monitor Bedrock's own status page via Lambda pinger
4. Circuit breakers prevent cascading failures:
Without circuit breakers:
Order Service is slow (3s response vs 1s normal)
Every Orchestrator thread waits 3s for Order Service
Thread pool exhaustion → Orchestrator itself becomes unavailable
Result: One slow downstream service takes down the entire chatbot.
With circuit breakers:
Order Service is slow → circuit opens after 5 failures
Orchestrator routes around Order Service (degraded mode, instant)
No thread exhaustion
Chatbot stays available at 99.95%+, just with degraded order tracking
5. Health checks with automated recovery:
# ALB health check configuration
HealthCheckProtocol: HTTP
HealthCheckPath: /health
HealthCheckIntervalSeconds: 10
HealthyThresholdCount: 2 # Mark healthy after 2 successful checks
UnhealthyThresholdCount: 3 # Mark unhealthy after 3 failures (30s)
# Health check endpoint
@app.get("/health")
async def health_check():
# Check critical dependencies
dynamo_ok = await check_dynamo_latency()
redis_ok = await check_redis_connection()
if dynamo_ok and redis_ok:
return {"status": "healthy"}
else:
return JSONResponse(
status_code=503,
content={"status": "unhealthy", "dynamo": dynamo_ok, "redis": redis_ok}
)
# ALB automatically removes unhealthy tasks from rotation
6. Chaos engineering:
Monthly game days:
- Kill one ECS task → verify auto-recovery within 60s
- Inject DynamoDB latency (200ms) → verify circuit breaker activates
- Block Order Service → verify graceful degradation
- Kill one AZ → verify multi-AZ failover
- Saturate LLM concurrency → verify queue + backoff behavior
Tools: AWS Fault Injection Simulator (FIS)
7. Runbooks for each failure mode:
Runbook: Order Service Unavailable
Severity: P2
Detection: OrderServiceCircuitOpen CloudWatch alarm
Impact: Users cannot track orders (~20% of traffic affected)
Immediate action:
1. Verify order_service circuit is open (dashboard link)
2. Check Order Service team's status page
3. If outage expected >15min: post banner on chatbot
4. Page Order Service on-call if no acknowledgment
Recovery:
1. Order Service recovers → circuit auto-closes after probe
2. Verify order tracking working with test account
3. Post resolution in incident channel
Availability budget allocation:
| Component | Target | Comment |
|---|---|---|
| API Gateway | 99.99% | AWS managed |
| ECS Orchestrator | 99.95% | Multi-AZ, auto-recovery |
| DynamoDB | 99.99% | AWS multi-AZ |
| Bedrock LLM | 99.9% | AWS managed, fallback available |
| OpenSearch | 99.9% | AWS managed |
| End-to-end system | 99.95% | Circuit breakers hide individual failures |