07 — Failure Handling

Most of the architecture is about what happens when things go wrong. This document is the failure taxonomy and the defenses against each class of failure.

Happy paths are short. Failure paths are most of the engineering. This document inventories the ways MangaAssist fails — in production, under load, during deploys, during outages — and the mechanisms that catch each class.

Failure taxonomy

Failures cluster into seven categories. Each has a different detection signal, a different recovery path, and a different blast radius.

Category	What goes wrong	Detection signal	Worst-case impact
Tool unavailability	MCP server down or unreachable	HTTP 5xx / connection error	One tool unusable; degraded answer
Tool malformed response	MCP returns garbage	JSON parse fails / schema mismatch	One tool result dropped
Tool empty result	MCP returns no data	Empty list in response	LLM falls back to "I don't know"
Inference failure	Bedrock 5xx / timeout	Bedrock client error	No answer; full template fallback
Inference garbage	Claude produces nonsense	Guardrail post-validation fails	Bad answer if not caught
Loop runaway	Agent loops without converging	Iteration count or wall-clock	User waits; eventually killed by stop
Session corruption	DynamoDB / Redis state inconsistent	Schema validation on read	Wrong context loaded; wrong answer

Defense layer 1: Circuit breakers

Each MCP server is fronted by a circuit breaker maintained by the Orchestrator's Lambda runtime.

        Closed
        ↓ (5 failures in 60s)
        Open  ────── (30s cooldown) ──────►  Half-Open
        ↑                                      ↓
        └────── (1 success closes) ◄───── (probe request)

States: - Closed — normal operation, all requests pass through - Open — all requests to this MCP short-circuit immediately; agent skips the tool and degrades - Half-Open — exactly one probe request allowed; success closes, failure re-opens

The 5-failures-in-60s threshold is configurable per MCP. Catalog Search (high QPS, mostly cached) gets a tighter threshold. Order Inventory (lower QPS, transactional) gets a looser one.

Failure attribution

The breaker counts only infrastructure failures (5xx, timeouts, connection refused). It does not count: - 4xx errors (bad input — caller's fault) - Empty results (legitimate "no match") - Schema validation failures on output (might be a bad chunk, not server outage)

This separation matters. Without it, a flaky MCP server could be hidden by a wave of 4xx errors that don't trip the breaker, or conversely, a deploy that introduces a new schema-validation bug could trip every breaker simultaneously and brown out the whole chatbot.

Defense layer 2: Retries

Limited and shaped:

Operation	Max retries	Backoff	Idempotent?
MCP read tool	2	100ms → 200ms exponential	Yes (read-only)
MCP write tool (`initiate_return`)	1	500ms	No, but uses idempotency key
Bedrock inference	1	500ms	Yes (deterministic-ish at temp 0.3)
DynamoDB read	3	SDK default	Yes
ElastiCache read	1	50ms	Yes
ElastiCache write	0	—	Yes, but failure is non-fatal

Two retries on read tools is intentional. More than two and you stack latency without reliability gain — by the third retry, you've usually crossed the wall-clock budget anyway. The first retry catches transient network blips; the second catches one node restarting; beyond that it's a real outage.

Idempotency on writes

initiate_return is the only write path. It's protected by an idempotency key (UUID generated on the first attempt, sent on retry):

return_key = f"return:{order_id}:{asin}:{idempotency_uuid}"
# RDS unique constraint on this key prevents duplicate RMAs

Without the idempotency key, a network hiccup at the wrong moment could create two RMAs for one return.

Defense layer 3: Stopping conditions

Three independent kill switches in the agent loop. Any one trips → the loop stops.

Switch	Threshold	Action on trip
Iteration count	10 tool-call rounds	Best-effort answer with disclaimer + escalation offer
Wall-clock	8 seconds	Cancel pending tool calls (`asyncio.wait_for`), template fallback
Token budget	Hard cap on context window	Force summarization, drop oldest tool results

The wall-clock is the user-facing safety net. Iteration count and token budget are about preventing pathological cases (a model that loops on its own indecision, or one that accumulates so much tool-result context that synthesis becomes prohibitively slow).

Defense layer 4: Post-generation guardrails

The final defense against bad LLM output. Runs after Claude finishes generating but before tokens reach the user.

Check	What it validates	Recovery
ASIN check	Every cited ASIN exists in catalog	Strip the citation; if answer depends on it, regenerate
Price sanity	Quoted price matches catalog within 0.5%	Force template price
Link validation	URLs resolve (cached HEAD checks)	Strip broken links
Citation match	Quoted text appears in cited chunk	Regenerate with stricter prompt
Forbidden patterns	Competitor mentions, PII leakage, profanity	Reject and regenerate
Length guard	Response within token bounds	Truncate at sentence boundary

Guardrail failure → at most one regeneration → if regeneration also fails, fall back to a template ("I'm having trouble answering that — can I connect you with a human?").

The single regeneration cap is deliberate. Multiple regens stack latency and rarely fix the underlying issue (if the model produced bad output once with a good prompt, it'll usually do it again).

Defense layer 5: Graceful degradation

When tools are unavailable, the agent still answers — just less well.

User: "Show me Berserk volume 42, is it in stock, and what's the return policy?"

Tools state:
  catalog_mcp:    UP
  order_mcp:      OPEN (circuit breaker tripped)
  support_mcp:    UP

Agent behavior:
  - catalog_mcp.search_manga → returns Berserk vol 42
  - order_mcp.check_stock → SKIPPED (breaker open)
  - support_mcp.get_return_policy → returns policy

Response:
  "Berserk volume 42 is available. I'm not able to check stock right
   now — please refresh in a moment. The return policy is..."

The key word is transparent. The user is told what we couldn't do. We do not silently omit a piece of information; we do not lie ("it's in stock" without checking).

This requires the Orchestrator to know which tool failed. Tool errors come back to the Orchestrator as structured <tool_error> blocks with a code (circuit_open, timeout, bad_input), and the system prompt instructs Claude how to phrase the degradation message based on the code.

Failure handling per layer

Lambda / API Gateway

5xx from Lambda → API Gateway returns 502, websocket closes
Cold start > 3s → user sees a loading indicator longer; no functional failure
Lambda OOM → exception, full request fails

Orchestrator (Bedrock)

Bedrock timeout → 1 retry → template fallback
Bedrock rate limit → exponential backoff up to 5s, then template
Bedrock returns malformed tool calls → drop the malformed call, continue with valid ones

MCP server (ECS Fargate)

Container crash → ECS replaces task; circuit breaker absorbs during transition
ECS service degraded (e.g., 50% of tasks unhealthy) → ALB drops bad targets; latency spikes; circuit may trip
ALB unhealthy → static failover page

Backing data stores

OpenSearch shard unavailable → query may return partial results; agent flags as partial_results: true
DynamoDB throttle → SDK retries with backoff; if persistent, the tool fails open
RDS replica lag spike → read may be stale; order tool returns with stale: true flag
Redis evicted → read miss → fall through to source of truth

Observability

Every failure is observable.

CloudWatch metrics (per tool, per hour): - Total calls - Success rate - 5xx rate - Timeout rate - Circuit breaker state changes - Guardrail rejection count - Fallback engagement count

X-Ray traces: - Every request has a trace; every tool call is a sub-segment - Failed tools show as red segments with error metadata

Structured logs: - Request ID, session ID, intent, tools attempted, tools succeeded, final response source (live / template / error)

The single most useful aggregate metric: % of responses that used a fallback path. When this rises, something is wrong upstream — usually before the user-facing latency degrades. It's a leading indicator.

Validation: Constraint Sanity Check

Claimed metric / mechanism	Verdict	Why
Circuit breaker threshold: 5 failures in 60s	Too sensitive on noisy MCPs	A 99% reliable MCP at 100 QPS sees ~6 failures per minute as baseline. The breaker would trip false-positive every minute. Threshold should be relative (e.g., failure rate > 5%), not absolute.
Circuit breaker state in Lambda memory	No coordination across instances	Lambda is multi-instance. Each instance has its own breaker state. One instance trips, others keep retrying the failing MCP and tripping their own breakers — recovery is uncoordinated. Real solution: shared state in Redis or DynamoDB, but that adds latency. Tradeoff isn't called out.
2 retries with exponential backoff	Doesn't help with sustained outages	If MCP is down for 10 seconds, 2 retries (200ms + 400ms) still fail. The breaker eventually catches it but the first ~5 affected requests pay the full wall-clock cost. Acceptable, but document it.
8-second wall-clock	See Orchestrator validation	Inconsistent with 10-iteration cap. Real budget is 4–5 iterations. Document accordingly.
Single regeneration on guardrail fail	Right policy, weak detection	Regen fixes some failure modes (formatting). It does not fix others (consistent hallucination — model has the wrong belief). The regen policy doesn't differentiate. A regen that fails the same check should escalate, not retry the same prompt.
ASIN check, price sanity, link validation as guardrails	Real but partial	These check structured facts but not semantic accuracy. "The return window is 60 days" passes ASIN check, price check, link check — and is wrong (it's 30 days). Semantic validation requires re-running RAG on the answer, which isn't done.
Citation match (quoted text in cited chunk)	Hardest, unspecified	The doc lists this as a guardrail. Implementation is non-trivial: substring match? Fuzzy match? Embedding similarity? The implementation choice changes behavior dramatically. Underspecified.
Idempotency key on `initiate_return`	Correct, but needs server-side enforcement	The key is generated client-side. If the client retries with a different idempotency UUID for the same logical operation, dedup fails. Server-side dedup needs a stricter key (e.g., hash of order_id + asin within a 24h window). Not described.
Graceful degradation transparency	Easy to regress	"Tell the user what we couldn't do" sounds simple but depends on the LLM following the system prompt. We've seen Claude omit the disclaimer and answer with stale info confidently. Needs explicit testing.
Fallback rate as leading indicator	Right metric, no alerting threshold	The doc names the metric but doesn't say at what fallback rate we page someone. 5%? 20%? Without a threshold this is a graph someone looks at, not an alert.

The biggest issue: circuit breaker state isn't shared

Lambda runs as many concurrent instances. Each instance has independent in-memory state for circuit breakers. Consequences:

One instance trips the breaker; nine others continue hammering the failing MCP.
A single bad request goes to instance A, trips it; the same query routed to instance B succeeds (because B's breaker hasn't seen the failure).
"Half-open" probe requests fan out across instances; under load you get many simultaneous probes, defeating the purpose.

Real solutions: 1. Shared circuit state in DynamoDB with TTL — adds 10–30ms per check, but coherent 2. Move circuit breakers to a sidecar / mesh layer (App Mesh) — operationally heavier, fully consistent

The current design is "best-effort circuit breaking" and that's a non-trivial caveat. The doc treats them as authoritative.

Sustained outage handling is undocumented

What happens if Order MCP is down for 30 minutes? The breaker opens, then half-opens every 30s and probes. The probe fails. Re-opens. Repeat. Meanwhile, every user asking about orders gets a degraded response.

Should we: - Page someone after N minutes in open state? (yes, but the threshold isn't documented) - Switch to a static "we're having issues" message after K minutes? (probably, not documented) - Disable order-shaped intents entirely after some duration? (defensible, not documented)

Without this, the system limps along indefinitely on broken state. The "fail fast" philosophy needs an "escalate to human operators" companion.

Guardrails catch structured errors but not factual ones

The guardrail layer catches: - Made-up ASINs (good) - Wrong prices (good) - Broken links (good)

It does not catch: - "Returns are accepted within 60 days" (false; actual is 30) - "This title was published in 2018" (false; it was 2015) - "Author X's other works include Y" (where Y is fabricated)

These are grounded in the LLM's training data, not in retrieved chunks. The only defense is: instruct the model to never assert facts that aren't in retrieved chunks, and validate output against retrieved chunks. The validation step is the missing piece. Without it, factual hallucinations escape.

Idempotency key is client-side

The doc says initiate_return uses an idempotency key. But the key is provided by the client (the orchestrator's Lambda). If the orchestrator retries with a fresh UUID — which is what happens on a retry across separate Lambda invocations — the dedup fails and we create a duplicate RMA.

The fix: the idempotency key for returns should be derived from the operation, not generated freshly. E.g., sha256(order_id + asin + DATE()) so retries within the same day always produce the same key. This needs to be specified server-side, not just trusted client-side.

"Graceful degradation transparency" is prompt-dependent

The agent telling the user "I couldn't check stock right now" depends entirely on Claude obeying the system prompt instruction to disclose tool failures. We've observed cases where Claude: - Omits the disclaimer and answers with cached/stale info - Hallucinates a substitute answer ("looks like it's in stock based on your previous order") - Reroutes to a different tool that wasn't intended

Defending this requires either structural enforcement (the response template explicitly has an unavailable_tools: [] field that the system prompt forces to be populated) or post-hoc check (verify all unavailable tools are mentioned in the response). Neither is documented.

01-orchestrator-agent.md — Stopping conditions in the agent loop
03-order-status-agent.md — Saga compensation for return writes
05-manga-qa-agent.md — Hallucination defense for free-form text
09-escalation-workflow.md — When degradation isn't enough
../Monitoring-GenAI-Systems/ — Observability practices