LOCAL PREVIEW View on GitHub

07 — Failure Handling

Most of the architecture is about what happens when things go wrong. This document is the failure taxonomy and the defenses against each class of failure.

Happy paths are short. Failure paths are most of the engineering. This document inventories the ways MangaAssist fails — in production, under load, during deploys, during outages — and the mechanisms that catch each class.


Failure taxonomy

Failures cluster into seven categories. Each has a different detection signal, a different recovery path, and a different blast radius.

Category What goes wrong Detection signal Worst-case impact
Tool unavailability MCP server down or unreachable HTTP 5xx / connection error One tool unusable; degraded answer
Tool malformed response MCP returns garbage JSON parse fails / schema mismatch One tool result dropped
Tool empty result MCP returns no data Empty list in response LLM falls back to "I don't know"
Inference failure Bedrock 5xx / timeout Bedrock client error No answer; full template fallback
Inference garbage Claude produces nonsense Guardrail post-validation fails Bad answer if not caught
Loop runaway Agent loops without converging Iteration count or wall-clock User waits; eventually killed by stop
Session corruption DynamoDB / Redis state inconsistent Schema validation on read Wrong context loaded; wrong answer

Defense layer 1: Circuit breakers

Each MCP server is fronted by a circuit breaker maintained by the Orchestrator's Lambda runtime.

        Closed
        ↓ (5 failures in 60s)
        Open  ────── (30s cooldown) ──────►  Half-Open
        ↑                                      ↓
        └────── (1 success closes) ◄───── (probe request)

States: - Closed — normal operation, all requests pass through - Open — all requests to this MCP short-circuit immediately; agent skips the tool and degrades - Half-Open — exactly one probe request allowed; success closes, failure re-opens

The 5-failures-in-60s threshold is configurable per MCP. Catalog Search (high QPS, mostly cached) gets a tighter threshold. Order Inventory (lower QPS, transactional) gets a looser one.

Failure attribution

The breaker counts only infrastructure failures (5xx, timeouts, connection refused). It does not count: - 4xx errors (bad input — caller's fault) - Empty results (legitimate "no match") - Schema validation failures on output (might be a bad chunk, not server outage)

This separation matters. Without it, a flaky MCP server could be hidden by a wave of 4xx errors that don't trip the breaker, or conversely, a deploy that introduces a new schema-validation bug could trip every breaker simultaneously and brown out the whole chatbot.


Defense layer 2: Retries

Limited and shaped:

Operation Max retries Backoff Idempotent?
MCP read tool 2 100ms → 200ms exponential Yes (read-only)
MCP write tool (initiate_return) 1 500ms No, but uses idempotency key
Bedrock inference 1 500ms Yes (deterministic-ish at temp 0.3)
DynamoDB read 3 SDK default Yes
ElastiCache read 1 50ms Yes
ElastiCache write 0 Yes, but failure is non-fatal

Two retries on read tools is intentional. More than two and you stack latency without reliability gain — by the third retry, you've usually crossed the wall-clock budget anyway. The first retry catches transient network blips; the second catches one node restarting; beyond that it's a real outage.

Idempotency on writes

initiate_return is the only write path. It's protected by an idempotency key (UUID generated on the first attempt, sent on retry):

return_key = f"return:{order_id}:{asin}:{idempotency_uuid}"
# RDS unique constraint on this key prevents duplicate RMAs

Without the idempotency key, a network hiccup at the wrong moment could create two RMAs for one return.


Defense layer 3: Stopping conditions

Three independent kill switches in the agent loop. Any one trips → the loop stops.

Switch Threshold Action on trip
Iteration count 10 tool-call rounds Best-effort answer with disclaimer + escalation offer
Wall-clock 8 seconds Cancel pending tool calls (asyncio.wait_for), template fallback
Token budget Hard cap on context window Force summarization, drop oldest tool results

The wall-clock is the user-facing safety net. Iteration count and token budget are about preventing pathological cases (a model that loops on its own indecision, or one that accumulates so much tool-result context that synthesis becomes prohibitively slow).


Defense layer 4: Post-generation guardrails

The final defense against bad LLM output. Runs after Claude finishes generating but before tokens reach the user.

Check What it validates Recovery
ASIN check Every cited ASIN exists in catalog Strip the citation; if answer depends on it, regenerate
Price sanity Quoted price matches catalog within 0.5% Force template price
Link validation URLs resolve (cached HEAD checks) Strip broken links
Citation match Quoted text appears in cited chunk Regenerate with stricter prompt
Forbidden patterns Competitor mentions, PII leakage, profanity Reject and regenerate
Length guard Response within token bounds Truncate at sentence boundary

Guardrail failure → at most one regeneration → if regeneration also fails, fall back to a template ("I'm having trouble answering that — can I connect you with a human?").

The single regeneration cap is deliberate. Multiple regens stack latency and rarely fix the underlying issue (if the model produced bad output once with a good prompt, it'll usually do it again).


Defense layer 5: Graceful degradation

When tools are unavailable, the agent still answers — just less well.

User: "Show me Berserk volume 42, is it in stock, and what's the return policy?"

Tools state:
  catalog_mcp:    UP
  order_mcp:      OPEN (circuit breaker tripped)
  support_mcp:    UP

Agent behavior:
  - catalog_mcp.search_manga → returns Berserk vol 42
  - order_mcp.check_stock → SKIPPED (breaker open)
  - support_mcp.get_return_policy → returns policy

Response:
  "Berserk volume 42 is available. I'm not able to check stock right
   now — please refresh in a moment. The return policy is..."

The key word is transparent. The user is told what we couldn't do. We do not silently omit a piece of information; we do not lie ("it's in stock" without checking).

This requires the Orchestrator to know which tool failed. Tool errors come back to the Orchestrator as structured <tool_error> blocks with a code (circuit_open, timeout, bad_input), and the system prompt instructs Claude how to phrase the degradation message based on the code.


Failure handling per layer

Lambda / API Gateway

  • 5xx from Lambda → API Gateway returns 502, websocket closes
  • Cold start > 3s → user sees a loading indicator longer; no functional failure
  • Lambda OOM → exception, full request fails

Orchestrator (Bedrock)

  • Bedrock timeout → 1 retry → template fallback
  • Bedrock rate limit → exponential backoff up to 5s, then template
  • Bedrock returns malformed tool calls → drop the malformed call, continue with valid ones

MCP server (ECS Fargate)

  • Container crash → ECS replaces task; circuit breaker absorbs during transition
  • ECS service degraded (e.g., 50% of tasks unhealthy) → ALB drops bad targets; latency spikes; circuit may trip
  • ALB unhealthy → static failover page

Backing data stores

  • OpenSearch shard unavailable → query may return partial results; agent flags as partial_results: true
  • DynamoDB throttle → SDK retries with backoff; if persistent, the tool fails open
  • RDS replica lag spike → read may be stale; order tool returns with stale: true flag
  • Redis evicted → read miss → fall through to source of truth

Observability

Every failure is observable.

CloudWatch metrics (per tool, per hour): - Total calls - Success rate - 5xx rate - Timeout rate - Circuit breaker state changes - Guardrail rejection count - Fallback engagement count

X-Ray traces: - Every request has a trace; every tool call is a sub-segment - Failed tools show as red segments with error metadata

Structured logs: - Request ID, session ID, intent, tools attempted, tools succeeded, final response source (live / template / error)

The single most useful aggregate metric: % of responses that used a fallback path. When this rises, something is wrong upstream — usually before the user-facing latency degrades. It's a leading indicator.


Validation: Constraint Sanity Check

Claimed metric / mechanism Verdict Why
Circuit breaker threshold: 5 failures in 60s Too sensitive on noisy MCPs A 99% reliable MCP at 100 QPS sees ~6 failures per minute as baseline. The breaker would trip false-positive every minute. Threshold should be relative (e.g., failure rate > 5%), not absolute.
Circuit breaker state in Lambda memory No coordination across instances Lambda is multi-instance. Each instance has its own breaker state. One instance trips, others keep retrying the failing MCP and tripping their own breakers — recovery is uncoordinated. Real solution: shared state in Redis or DynamoDB, but that adds latency. Tradeoff isn't called out.
2 retries with exponential backoff Doesn't help with sustained outages If MCP is down for 10 seconds, 2 retries (200ms + 400ms) still fail. The breaker eventually catches it but the first ~5 affected requests pay the full wall-clock cost. Acceptable, but document it.
8-second wall-clock See Orchestrator validation Inconsistent with 10-iteration cap. Real budget is 4–5 iterations. Document accordingly.
Single regeneration on guardrail fail Right policy, weak detection Regen fixes some failure modes (formatting). It does not fix others (consistent hallucination — model has the wrong belief). The regen policy doesn't differentiate. A regen that fails the same check should escalate, not retry the same prompt.
ASIN check, price sanity, link validation as guardrails Real but partial These check structured facts but not semantic accuracy. "The return window is 60 days" passes ASIN check, price check, link check — and is wrong (it's 30 days). Semantic validation requires re-running RAG on the answer, which isn't done.
Citation match (quoted text in cited chunk) Hardest, unspecified The doc lists this as a guardrail. Implementation is non-trivial: substring match? Fuzzy match? Embedding similarity? The implementation choice changes behavior dramatically. Underspecified.
Idempotency key on initiate_return Correct, but needs server-side enforcement The key is generated client-side. If the client retries with a different idempotency UUID for the same logical operation, dedup fails. Server-side dedup needs a stricter key (e.g., hash of order_id + asin within a 24h window). Not described.
Graceful degradation transparency Easy to regress "Tell the user what we couldn't do" sounds simple but depends on the LLM following the system prompt. We've seen Claude omit the disclaimer and answer with stale info confidently. Needs explicit testing.
Fallback rate as leading indicator Right metric, no alerting threshold The doc names the metric but doesn't say at what fallback rate we page someone. 5%? 20%? Without a threshold this is a graph someone looks at, not an alert.

The biggest issue: circuit breaker state isn't shared

Lambda runs as many concurrent instances. Each instance has independent in-memory state for circuit breakers. Consequences:

  • One instance trips the breaker; nine others continue hammering the failing MCP.
  • A single bad request goes to instance A, trips it; the same query routed to instance B succeeds (because B's breaker hasn't seen the failure).
  • "Half-open" probe requests fan out across instances; under load you get many simultaneous probes, defeating the purpose.

Real solutions: 1. Shared circuit state in DynamoDB with TTL — adds 10–30ms per check, but coherent 2. Move circuit breakers to a sidecar / mesh layer (App Mesh) — operationally heavier, fully consistent

The current design is "best-effort circuit breaking" and that's a non-trivial caveat. The doc treats them as authoritative.

Sustained outage handling is undocumented

What happens if Order MCP is down for 30 minutes? The breaker opens, then half-opens every 30s and probes. The probe fails. Re-opens. Repeat. Meanwhile, every user asking about orders gets a degraded response.

Should we: - Page someone after N minutes in open state? (yes, but the threshold isn't documented) - Switch to a static "we're having issues" message after K minutes? (probably, not documented) - Disable order-shaped intents entirely after some duration? (defensible, not documented)

Without this, the system limps along indefinitely on broken state. The "fail fast" philosophy needs an "escalate to human operators" companion.

Guardrails catch structured errors but not factual ones

The guardrail layer catches: - Made-up ASINs (good) - Wrong prices (good) - Broken links (good)

It does not catch: - "Returns are accepted within 60 days" (false; actual is 30) - "This title was published in 2018" (false; it was 2015) - "Author X's other works include Y" (where Y is fabricated)

These are grounded in the LLM's training data, not in retrieved chunks. The only defense is: instruct the model to never assert facts that aren't in retrieved chunks, and validate output against retrieved chunks. The validation step is the missing piece. Without it, factual hallucinations escape.

Idempotency key is client-side

The doc says initiate_return uses an idempotency key. But the key is provided by the client (the orchestrator's Lambda). If the orchestrator retries with a fresh UUID — which is what happens on a retry across separate Lambda invocations — the dedup fails and we create a duplicate RMA.

The fix: the idempotency key for returns should be derived from the operation, not generated freshly. E.g., sha256(order_id + asin + DATE()) so retries within the same day always produce the same key. This needs to be specified server-side, not just trusted client-side.

"Graceful degradation transparency" is prompt-dependent

The agent telling the user "I couldn't check stock right now" depends entirely on Claude obeying the system prompt instruction to disclose tool failures. We've observed cases where Claude: - Omits the disclaimer and answers with cached/stale info - Hallucinates a substitute answer ("looks like it's in stock based on your previous order") - Reroutes to a different tool that wasn't intended

Defending this requires either structural enforcement (the response template explicitly has an unavailable_tools: [] field that the system prompt forces to be populated) or post-hoc check (verify all unavailable tools are mentioned in the response). Neither is documented.