LOCAL PREVIEW View on GitHub

Scale, Failures, And Architecture Evolution

Covers Q31, Q41, Q43, Q45, Q48, Q50.

What The Interviewer Is Testing

  • Whether you can reason about blast radius, redundancy, and degraded modes.
  • Whether you can evolve the system without rewriting every layer.
  • Whether you can critique the design before launch with specific improvements.

Deep Dive

Failure Thinking

The mature answer is never "make everything HA." It is:

  • identify the most likely failure domains
  • define degraded but usable behavior
  • reduce blast radius through abstraction boundaries
  • instrument the system to detect and recover fast

Likely Critical Failure Domains

  • LLM backend availability
  • conversation state store availability or latency
  • intent-routing dependency availability
  • stale or broken retrieval index
  • guardrail regressions that block good traffic or miss bad traffic

Backend Swap Blast Radius

If the system changes from Claude on Bedrock to an internal model, the biggest concerns are:

  • prompt compatibility
  • latency and throughput differences
  • output-format reliability
  • guardrail retuning

The clean contract to protect is the structured response shape. If that remains stable, the frontend and many downstream systems stay insulated.

Multilingual Evolution

The smart architectural move is to keep orchestration, session state, and API contracts language-neutral while making NLP, retrieval, and safety components language-aware.

Pre-Launch Review Lens

When asked for improvements before launch, prioritize:

  • component latency budgets
  • fallback and circuit-breaker behavior
  • prompt versioning
  • semantic or exact-match caching where safe
  • explicit load-test evidence

Strong Answer Pattern

  • "I would define graceful degradation before adding more redundancy."
  • "I want abstraction around the model boundary so backend changes do not leak everywhere."
  • "Before launch, I want evidence for latency, scale, and fallback behavior."

Scenario 1: Bedrock Region Outage

Primary Prompt

The Bedrock region serving generation is unavailable. How should the system respond in the first five minutes?

Follow-Up 1

What is the fastest degraded mode that still helps the user?

Follow-Up 2

Would you fail over cross-region automatically or gate it behind a feature flag?

Follow-Up 3

What metrics and alerts would confirm the failover is actually working?

Strong Answer Markers

  • Uses regional failover or controlled fallback.
  • Describes a non-LLM degraded mode for common intents.
  • Talks about error rate, latency, and completion-rate confirmation.

Scenario 2: Replace The LLM Backend

Primary Prompt

Leadership wants to move from Bedrock-hosted Claude to an internally trained model. What changes and what stays stable?

Follow-Up 1

Which abstraction protects the frontend and guardrails from major churn?

Follow-Up 2

What revalidation is mandatory before cutover?

Follow-Up 3

How would you phase the migration to reduce blast radius?

Strong Answer Markers

  • Anchors on structured output contract stability.
  • Mentions prompt retuning, latency benchmarking, and guardrail recalibration.
  • Uses shadow traffic or limited rollout before full cutover.

Scenario 3: Five Improvements Before Launch

Primary Prompt

You are reviewing the LLD one week before launch. What are the five highest-value improvements?

Follow-Up 1

Which of those are must-have blockers versus good follow-on work?

Follow-Up 2

How would you justify the ranking to engineering leadership?

Follow-Up 3

What would you explicitly defer from the roadmap?

Strong Answer Markers

  • Gives concrete, not generic, improvements.
  • Prioritizes based on user harm, launch risk, and operational readiness.
  • Defers nice-to-have complexity instead of bloating the launch plan.

Red Flags

  • Saying redundancy alone solves availability.
  • Treating a model swap as only an endpoint change.
  • Offering vague launch improvements like "more testing" without specifying what kind.
  • Ignoring degraded-mode UX.

Two-Minute Whiteboard Version

Draw three columns:

  1. Failure domain.
  2. Immediate fallback.
  3. Longer-term architectural improvement.