HLD Deep Dive: Fault Tolerance & Reliability

Questions covered: Q17, Q23, Q34, Q36
Interviewer level: Staff Engineer → Solutions Architect

Q17. Order Service is down — user asks "Where is my order?"

Short Answer

Detect failure via timeout/error → return graceful degradation message → offer escalation → trigger CloudWatch alarm. Never show a raw error to the user.

Deep Dive

Detection: How does the Orchestrator know the Order Service is down?

async def query_order_service(customer_id: str, timeout_ms: int = 1000) -> OrderResult:
    try:
        result = await asyncio.wait_for(
            order_service.get_orders(customer_id),
            timeout=timeout_ms / 1000
        )
        return result

    except asyncio.TimeoutError:
        # Service is slow or frozen
        cloudwatch.put_metric("OrderServiceTimeout", 1)
        raise ServiceUnavailableError("order_service_timeout")

    except ServiceException as e:
        if e.status_code in [500, 502, 503]:
            cloudwatch.put_metric("OrderServiceError", 1)
            raise ServiceUnavailableError("order_service_error")
        raise  # Re-raise 4xx errors (client error, not service outage)

Orchestrator response when Order Service is unavailable:

async def handle_order_tracking(customer_id: str, message: str) -> Response:
    try:
        orders = await query_order_service(customer_id, timeout_ms=1000)
        return format_order_response(orders)

    except ServiceUnavailableError:
        # Log for ops visibility
        logger.warning("order_service_unavailable", extra={
            "customer_id": customer_id,
            "intent": "order_tracking"
        })

        # Graceful degradation — helpful, not an error page
        return Response(
            message=(
                "I'm having trouble accessing your order information right now. "
                "You can check your order status directly at: "
                "amazon.co.jp/orders. "
                "Or I can connect you with a support agent who can help immediately."
            ),
            actions=[
                {"type": "link", "label": "Check Orders Page", 
                 "url": "https://amazon.co.jp/gp/css/order-history"},
                {"type": "escalation", "label": "Talk to Support Agent"}
            ],
            degraded=True,
            degraded_reason="order_service_unavailable"
        )

What NOT to do:

❌ Show raw error:  "Error 503: Service Unavailable"
❌ Show nothing:   Silent failure
❌ Retry forever:  Causes latency to balloon to 30s+ while user waits
❌ Fabricate data: "Your order shipped on March 20th" — completely wrong

CloudWatch alarm that fires:

AlarmName: OrderServiceAvailability
MetricName: OrderServiceError + OrderServiceTimeout (sum)
Namespace: MangaAssist/ServiceHealth
Threshold: > 10 errors in 1 minute
Comparison: GreaterThanThreshold
EvaluationPeriods: 1
Actions:
  - PagerDuty: on-call engineer notification
  - SNS: "#manga-chatbot-ops" Slack channel

Q34. Partial degradation — 2-3 services down simultaneously

Short Answer

Circuit breakers per service + criticality tiers + capability map that adjusts behavior based on available services.

Deep Dive

Service criticality matrix:

TIER 1 - CRITICAL (chatbot fails gracefully if down):
  Product Catalog       → Cannot show product info, but can still answer FAQ
  LLM / Bedrock         → Cannot generate responses, use templates only

TIER 2 - IMPORTANT (degraded experience if down):
  Recommendation Engine → Show popular/trending instead
  Order Service         → Direct user to orders page

TIER 3 - NICE-TO-HAVE (minor feature loss if down):
  Promotions Service    → Skip promotion nudges
  Returns Service       → Direct user to returns help page
  User Profile          → Treat as guest (no personalization)

TIER 4 - ASYNC (no real-time impact if down):
  Analytics / Kinesis   → Log locally, retry later
  Feedback Service      → Lose some feedback data

Circuit breaker per service (independent state):

class ServiceHealthManager:
    def __init__(self):
        self.circuit_breakers = {
            "catalog":          CircuitBreaker(threshold=5, timeout_s=30),
            "recommendations":  CircuitBreaker(threshold=5, timeout_s=30),
            "order_service":    CircuitBreaker(threshold=3, timeout_s=60),
            "returns_service":  CircuitBreaker(threshold=5, timeout_s=30),
            "promotions":       CircuitBreaker(threshold=10, timeout_s=15),
            "llm":              CircuitBreaker(threshold=3, timeout_s=120),
        }

    def get_capability_map(self) -> dict:
        """
        Returns a map of what the Orchestrator CAN do right now.
        Used to route requests to the best available path.
        """
        return {
            service: cb.state == "CLOSED"
            for service, cb in self.circuit_breakers.items()
        }

Capability-aware Orchestrator:

async def handle_message(intent: str, message: str, customer_id: str) -> Response:
    capabilities = health_manager.get_capability_map()

    if intent == "recommendation":
        if capabilities["recommendations"] and capabilities["catalog"]:
            # Full personalized recommendations
            return await full_recommendation_flow(customer_id, message)

        elif capabilities["catalog"]:
            # Recommendations down, but catalog OK → show popular items
            popular = await catalog.get_popular_manga(category="manga", limit=5)
            return format_popular_items_response(popular, degraded=True)

        else:
            # Both down
            return Response(
                message="I'm having trouble accessing our catalog right now. "
                        "Try browsing manga directly at amazon.co.jp/manga"
            )

    elif intent == "order_tracking":
        if capabilities["order_service"]:
            return await full_order_tracking_flow(customer_id)
        else:
            return order_service_degraded_response()  # Link to orders page

    elif intent == "faq":
        if capabilities["llm"]:
            return await rag_faq_flow(message)
        else:
            # LLM down → use pre-computed FAQ templates
            return await template_faq_flow(message)

Health check dashboard:

MangaAssist Service Health (updated every 30s)
─────────────────────────────────────────────────
Service              Status    Latency  Error Rate  Circuit
─────────────────────────────────────────────────────────
Product Catalog      🟢 OK     45ms     0.1%        CLOSED
Order Service        🔴 DOWN   ---      100%        OPEN
Returns Service      🟡 WARN   890ms    5%          CLOSED
Recommendations      🟢 OK     180ms    0.2%        CLOSED
Promotions           🟢 OK     30ms     0.0%        CLOSED
Bedrock LLM          🟢 OK     1,250ms  0.3%        CLOSED
DynamoDB             🟢 OK     5ms      0.0%        CLOSED
─────────────────────────────────────────────────────────
Capability Map:
  Order tracking:          ❌ Degraded (using fallback)
  Personalized recs:       ✅ Available
  FAQ:                     ✅ Available
  Return requests:         ⚠️  Degraded (slow)

Graceful degradation matrix:

Service Down	User Impact	Fallback Behavior
Recommendations only	Can't personalize	Show popular/trending manga
Catalog only	No product details	Text-only response + link
Order Service only	Can't track	Link to order page + escalation
Returns Service only	Can't process returns	Link to returns help + escalation
LLM / Bedrock	Limited NLU	Templates for known intents, "I can't help with that" for open-ended
Recommendations + Catalog	Product flow broken	Full fallback to search + FAQ
LLM + Catalog	Severely degraded	FAQ + order tracking only (both by template)

Q36. Meeting 99.95% availability SLA

Short Answer

Multi-AZ deployment + DynamoDB global tables + circuit breakers + chaos engineering + runbooks per failure mode. Bedrock dependency is the biggest single risk.

Deep Dive

What 99.95% means:

99.95% uptime / month = 
  100% - 99.95% = 0.05% downtime
  0.05% × 30 days × 24h × 60min = 21.6 minutes downtime/month maximum

1. Multi-AZ deployment:

ECS Fargate:
  Tasks spread across 3 AZs: ap-northeast-1a, 1b, 1c
  ALB routes to healthy tasks only
  If one AZ fails: remaining 2 AZs serve all traffic (capacity pre-provisioned for this)

DynamoDB:
  Native multi-AZ replication (same region)
  3 replicas — survives 1 AZ failure with zero impact

OpenSearch Serverless:
  Managed by AWS — multi-AZ by default

ElastiCache Redis:
  Multi-AZ with automatic failover
  Replica in each AZ — promotes to primary in <1 minute

2. DynamoDB Global Tables for cross-region resilience:

Regions: ap-northeast-1 (primary), us-east-1 (DR)

Global Tables replicate writes with ~100ms cross-region lag.
If ap-northeast-1 has a full regional outage (extremely rare):
  Route 53 health check detects failure in ~30s
  DNS switches to us-east-1 within ~60s
  Total failover time: ~2 minutes

This doesn't happen within the monthly 21-minute budget.
Regional DR is insurance for multi-hour regional outages.

3. The Bedrock dependency — biggest risk:

Bedrock is a shared AWS service. An AWS-wide Bedrock outage would impact us.
Historical AWS service availability for managed services: >99.9%.

Mitigation:
  1. Fallback to template responses for known intents 
     (covers ~60% of traffic without LLM)
  2. Pre-generate "canned" responses for top-100 FAQ questions
  3. If LLM is down >5 minutes: proactively show maintenance banner
     so users don't think the chatbot is broken — it's partially available
  4. Monitor Bedrock's own status page via Lambda pinger

4. Circuit breakers prevent cascading failures:

Without circuit breakers:
  Order Service is slow (3s response vs 1s normal)
  Every Orchestrator thread waits 3s for Order Service
  Thread pool exhaustion → Orchestrator itself becomes unavailable

  Result: One slow downstream service takes down the entire chatbot.

With circuit breakers:
  Order Service is slow → circuit opens after 5 failures
  Orchestrator routes around Order Service (degraded mode, instant)
  No thread exhaustion
  Chatbot stays available at 99.95%+, just with degraded order tracking

5. Health checks with automated recovery:

# ALB health check configuration
HealthCheckProtocol: HTTP
HealthCheckPath: /health
HealthCheckIntervalSeconds: 10
HealthyThresholdCount: 2      # Mark healthy after 2 successful checks
UnhealthyThresholdCount: 3    # Mark unhealthy after 3 failures (30s)

# Health check endpoint
@app.get("/health")
async def health_check():
    # Check critical dependencies
    dynamo_ok = await check_dynamo_latency()
    redis_ok = await check_redis_connection()

    if dynamo_ok and redis_ok:
        return {"status": "healthy"}
    else:
        return JSONResponse(
            status_code=503,
            content={"status": "unhealthy", "dynamo": dynamo_ok, "redis": redis_ok}
        )
# ALB automatically removes unhealthy tasks from rotation

6. Chaos engineering:

Monthly game days:
  - Kill one ECS task → verify auto-recovery within 60s
  - Inject DynamoDB latency (200ms) → verify circuit breaker activates
  - Block Order Service → verify graceful degradation
  - Kill one AZ → verify multi-AZ failover
  - Saturate LLM concurrency → verify queue + backoff behavior

Tools: AWS Fault Injection Simulator (FIS)

7. Runbooks for each failure mode:

Runbook: Order Service Unavailable
  Severity: P2
  Detection: OrderServiceCircuitOpen CloudWatch alarm
  Impact: Users cannot track orders (~20% of traffic affected)
  Immediate action:
    1. Verify order_service circuit is open (dashboard link)
    2. Check Order Service team's status page
    3. If outage expected >15min: post banner on chatbot
    4. Page Order Service on-call if no acknowledgment
  Recovery:
    1. Order Service recovers → circuit auto-closes after probe
    2. Verify order tracking working with test account
    3. Post resolution in incident channel

Availability budget allocation:

Component	Target	Comment
API Gateway	99.99%	AWS managed
ECS Orchestrator	99.95%	Multi-AZ, auto-recovery
DynamoDB	99.99%	AWS multi-AZ
Bedrock LLM	99.9%	AWS managed, fallback available
OpenSearch	99.9%	AWS managed
End-to-end system	99.95%	Circuit breakers hide individual failures