07 — Failure Handling
Most of the architecture is about what happens when things go wrong. This document is the failure taxonomy and the defenses against each class of failure.
Happy paths are short. Failure paths are most of the engineering. This document inventories the ways MangaAssist fails — in production, under load, during deploys, during outages — and the mechanisms that catch each class.
Failure taxonomy
Failures cluster into seven categories. Each has a different detection signal, a different recovery path, and a different blast radius.
| Category | What goes wrong | Detection signal | Worst-case impact |
|---|---|---|---|
| Tool unavailability | MCP server down or unreachable | HTTP 5xx / connection error | One tool unusable; degraded answer |
| Tool malformed response | MCP returns garbage | JSON parse fails / schema mismatch | One tool result dropped |
| Tool empty result | MCP returns no data | Empty list in response | LLM falls back to "I don't know" |
| Inference failure | Bedrock 5xx / timeout | Bedrock client error | No answer; full template fallback |
| Inference garbage | Claude produces nonsense | Guardrail post-validation fails | Bad answer if not caught |
| Loop runaway | Agent loops without converging | Iteration count or wall-clock | User waits; eventually killed by stop |
| Session corruption | DynamoDB / Redis state inconsistent | Schema validation on read | Wrong context loaded; wrong answer |
Defense layer 1: Circuit breakers
Each MCP server is fronted by a circuit breaker maintained by the Orchestrator's Lambda runtime.
Closed
↓ (5 failures in 60s)
Open ────── (30s cooldown) ──────► Half-Open
↑ ↓
└────── (1 success closes) ◄───── (probe request)
States: - Closed — normal operation, all requests pass through - Open — all requests to this MCP short-circuit immediately; agent skips the tool and degrades - Half-Open — exactly one probe request allowed; success closes, failure re-opens
The 5-failures-in-60s threshold is configurable per MCP. Catalog Search (high QPS, mostly cached) gets a tighter threshold. Order Inventory (lower QPS, transactional) gets a looser one.
Failure attribution
The breaker counts only infrastructure failures (5xx, timeouts, connection refused). It does not count: - 4xx errors (bad input — caller's fault) - Empty results (legitimate "no match") - Schema validation failures on output (might be a bad chunk, not server outage)
This separation matters. Without it, a flaky MCP server could be hidden by a wave of 4xx errors that don't trip the breaker, or conversely, a deploy that introduces a new schema-validation bug could trip every breaker simultaneously and brown out the whole chatbot.
Defense layer 2: Retries
Limited and shaped:
| Operation | Max retries | Backoff | Idempotent? |
|---|---|---|---|
| MCP read tool | 2 | 100ms → 200ms exponential | Yes (read-only) |
MCP write tool (initiate_return) |
1 | 500ms | No, but uses idempotency key |
| Bedrock inference | 1 | 500ms | Yes (deterministic-ish at temp 0.3) |
| DynamoDB read | 3 | SDK default | Yes |
| ElastiCache read | 1 | 50ms | Yes |
| ElastiCache write | 0 | — | Yes, but failure is non-fatal |
Two retries on read tools is intentional. More than two and you stack latency without reliability gain — by the third retry, you've usually crossed the wall-clock budget anyway. The first retry catches transient network blips; the second catches one node restarting; beyond that it's a real outage.
Idempotency on writes
initiate_return is the only write path. It's protected by an idempotency key (UUID generated on the first attempt, sent on retry):
return_key = f"return:{order_id}:{asin}:{idempotency_uuid}"
# RDS unique constraint on this key prevents duplicate RMAs
Without the idempotency key, a network hiccup at the wrong moment could create two RMAs for one return.
Defense layer 3: Stopping conditions
Three independent kill switches in the agent loop. Any one trips → the loop stops.
| Switch | Threshold | Action on trip |
|---|---|---|
| Iteration count | 10 tool-call rounds | Best-effort answer with disclaimer + escalation offer |
| Wall-clock | 8 seconds | Cancel pending tool calls (asyncio.wait_for), template fallback |
| Token budget | Hard cap on context window | Force summarization, drop oldest tool results |
The wall-clock is the user-facing safety net. Iteration count and token budget are about preventing pathological cases (a model that loops on its own indecision, or one that accumulates so much tool-result context that synthesis becomes prohibitively slow).
Defense layer 4: Post-generation guardrails
The final defense against bad LLM output. Runs after Claude finishes generating but before tokens reach the user.
| Check | What it validates | Recovery |
|---|---|---|
| ASIN check | Every cited ASIN exists in catalog | Strip the citation; if answer depends on it, regenerate |
| Price sanity | Quoted price matches catalog within 0.5% | Force template price |
| Link validation | URLs resolve (cached HEAD checks) | Strip broken links |
| Citation match | Quoted text appears in cited chunk | Regenerate with stricter prompt |
| Forbidden patterns | Competitor mentions, PII leakage, profanity | Reject and regenerate |
| Length guard | Response within token bounds | Truncate at sentence boundary |
Guardrail failure → at most one regeneration → if regeneration also fails, fall back to a template ("I'm having trouble answering that — can I connect you with a human?").
The single regeneration cap is deliberate. Multiple regens stack latency and rarely fix the underlying issue (if the model produced bad output once with a good prompt, it'll usually do it again).
Defense layer 5: Graceful degradation
When tools are unavailable, the agent still answers — just less well.
User: "Show me Berserk volume 42, is it in stock, and what's the return policy?"
Tools state:
catalog_mcp: UP
order_mcp: OPEN (circuit breaker tripped)
support_mcp: UP
Agent behavior:
- catalog_mcp.search_manga → returns Berserk vol 42
- order_mcp.check_stock → SKIPPED (breaker open)
- support_mcp.get_return_policy → returns policy
Response:
"Berserk volume 42 is available. I'm not able to check stock right
now — please refresh in a moment. The return policy is..."
The key word is transparent. The user is told what we couldn't do. We do not silently omit a piece of information; we do not lie ("it's in stock" without checking).
This requires the Orchestrator to know which tool failed. Tool errors come back to the Orchestrator as structured <tool_error> blocks with a code (circuit_open, timeout, bad_input), and the system prompt instructs Claude how to phrase the degradation message based on the code.
Failure handling per layer
Lambda / API Gateway
- 5xx from Lambda → API Gateway returns 502, websocket closes
- Cold start > 3s → user sees a loading indicator longer; no functional failure
- Lambda OOM → exception, full request fails
Orchestrator (Bedrock)
- Bedrock timeout → 1 retry → template fallback
- Bedrock rate limit → exponential backoff up to 5s, then template
- Bedrock returns malformed tool calls → drop the malformed call, continue with valid ones
MCP server (ECS Fargate)
- Container crash → ECS replaces task; circuit breaker absorbs during transition
- ECS service degraded (e.g., 50% of tasks unhealthy) → ALB drops bad targets; latency spikes; circuit may trip
- ALB unhealthy → static failover page
Backing data stores
- OpenSearch shard unavailable → query may return partial results; agent flags as
partial_results: true - DynamoDB throttle → SDK retries with backoff; if persistent, the tool fails open
- RDS replica lag spike → read may be stale; order tool returns with
stale: trueflag - Redis evicted → read miss → fall through to source of truth
Observability
Every failure is observable.
CloudWatch metrics (per tool, per hour): - Total calls - Success rate - 5xx rate - Timeout rate - Circuit breaker state changes - Guardrail rejection count - Fallback engagement count
X-Ray traces: - Every request has a trace; every tool call is a sub-segment - Failed tools show as red segments with error metadata
Structured logs: - Request ID, session ID, intent, tools attempted, tools succeeded, final response source (live / template / error)
The single most useful aggregate metric: % of responses that used a fallback path. When this rises, something is wrong upstream — usually before the user-facing latency degrades. It's a leading indicator.
Validation: Constraint Sanity Check
| Claimed metric / mechanism | Verdict | Why |
|---|---|---|
| Circuit breaker threshold: 5 failures in 60s | Too sensitive on noisy MCPs | A 99% reliable MCP at 100 QPS sees ~6 failures per minute as baseline. The breaker would trip false-positive every minute. Threshold should be relative (e.g., failure rate > 5%), not absolute. |
| Circuit breaker state in Lambda memory | No coordination across instances | Lambda is multi-instance. Each instance has its own breaker state. One instance trips, others keep retrying the failing MCP and tripping their own breakers — recovery is uncoordinated. Real solution: shared state in Redis or DynamoDB, but that adds latency. Tradeoff isn't called out. |
| 2 retries with exponential backoff | Doesn't help with sustained outages | If MCP is down for 10 seconds, 2 retries (200ms + 400ms) still fail. The breaker eventually catches it but the first ~5 affected requests pay the full wall-clock cost. Acceptable, but document it. |
| 8-second wall-clock | See Orchestrator validation | Inconsistent with 10-iteration cap. Real budget is 4–5 iterations. Document accordingly. |
| Single regeneration on guardrail fail | Right policy, weak detection | Regen fixes some failure modes (formatting). It does not fix others (consistent hallucination — model has the wrong belief). The regen policy doesn't differentiate. A regen that fails the same check should escalate, not retry the same prompt. |
| ASIN check, price sanity, link validation as guardrails | Real but partial | These check structured facts but not semantic accuracy. "The return window is 60 days" passes ASIN check, price check, link check — and is wrong (it's 30 days). Semantic validation requires re-running RAG on the answer, which isn't done. |
| Citation match (quoted text in cited chunk) | Hardest, unspecified | The doc lists this as a guardrail. Implementation is non-trivial: substring match? Fuzzy match? Embedding similarity? The implementation choice changes behavior dramatically. Underspecified. |
Idempotency key on initiate_return |
Correct, but needs server-side enforcement | The key is generated client-side. If the client retries with a different idempotency UUID for the same logical operation, dedup fails. Server-side dedup needs a stricter key (e.g., hash of order_id + asin within a 24h window). Not described. |
| Graceful degradation transparency | Easy to regress | "Tell the user what we couldn't do" sounds simple but depends on the LLM following the system prompt. We've seen Claude omit the disclaimer and answer with stale info confidently. Needs explicit testing. |
| Fallback rate as leading indicator | Right metric, no alerting threshold | The doc names the metric but doesn't say at what fallback rate we page someone. 5%? 20%? Without a threshold this is a graph someone looks at, not an alert. |
The biggest issue: circuit breaker state isn't shared
Lambda runs as many concurrent instances. Each instance has independent in-memory state for circuit breakers. Consequences:
- One instance trips the breaker; nine others continue hammering the failing MCP.
- A single bad request goes to instance A, trips it; the same query routed to instance B succeeds (because B's breaker hasn't seen the failure).
- "Half-open" probe requests fan out across instances; under load you get many simultaneous probes, defeating the purpose.
Real solutions: 1. Shared circuit state in DynamoDB with TTL — adds 10–30ms per check, but coherent 2. Move circuit breakers to a sidecar / mesh layer (App Mesh) — operationally heavier, fully consistent
The current design is "best-effort circuit breaking" and that's a non-trivial caveat. The doc treats them as authoritative.
Sustained outage handling is undocumented
What happens if Order MCP is down for 30 minutes? The breaker opens, then half-opens every 30s and probes. The probe fails. Re-opens. Repeat. Meanwhile, every user asking about orders gets a degraded response.
Should we: - Page someone after N minutes in open state? (yes, but the threshold isn't documented) - Switch to a static "we're having issues" message after K minutes? (probably, not documented) - Disable order-shaped intents entirely after some duration? (defensible, not documented)
Without this, the system limps along indefinitely on broken state. The "fail fast" philosophy needs an "escalate to human operators" companion.
Guardrails catch structured errors but not factual ones
The guardrail layer catches: - Made-up ASINs (good) - Wrong prices (good) - Broken links (good)
It does not catch: - "Returns are accepted within 60 days" (false; actual is 30) - "This title was published in 2018" (false; it was 2015) - "Author X's other works include Y" (where Y is fabricated)
These are grounded in the LLM's training data, not in retrieved chunks. The only defense is: instruct the model to never assert facts that aren't in retrieved chunks, and validate output against retrieved chunks. The validation step is the missing piece. Without it, factual hallucinations escape.
Idempotency key is client-side
The doc says initiate_return uses an idempotency key. But the key is provided by the client (the orchestrator's Lambda). If the orchestrator retries with a fresh UUID — which is what happens on a retry across separate Lambda invocations — the dedup fails and we create a duplicate RMA.
The fix: the idempotency key for returns should be derived from the operation, not generated freshly. E.g., sha256(order_id + asin + DATE()) so retries within the same day always produce the same key. This needs to be specified server-side, not just trusted client-side.
"Graceful degradation transparency" is prompt-dependent
The agent telling the user "I couldn't check stock right now" depends entirely on Claude obeying the system prompt instruction to disclose tool failures. We've observed cases where Claude: - Omits the disclaimer and answers with cached/stale info - Hallucinates a substitute answer ("looks like it's in stock based on your previous order") - Reroutes to a different tool that wasn't intended
Defending this requires either structural enforcement (the response template explicitly has an unavailable_tools: [] field that the system prompt forces to be populated) or post-hoc check (verify all unavailable tools are mentioned in the response). Neither is documented.
Related documents
- 01-orchestrator-agent.md — Stopping conditions in the agent loop
- 03-order-status-agent.md — Saga compensation for return writes
- 05-manga-qa-agent.md — Hallucination defense for free-form text
- 09-escalation-workflow.md — When degradation isn't enough
- ../Monitoring-GenAI-Systems/ — Observability practices