09 — Escalation Workflow

When the chatbot can't help, how it hands off to a human. The handoff is the most-watched and least-tested path in the system.

Most user issues are handled by the chatbot. The ones that aren't are usually the most important: a billing dispute, a fraudulent order, a frustrated customer about to churn. Escalation is the safety valve for cases the chatbot shouldn't try to handle alone — and the quality of the handoff is what separates "the chatbot couldn't help me" from "I had to repeat my entire problem to the human."

When to escalate

Five trigger conditions. Any of them fires → escalation flow.

Trigger	Detection	Confidence
User explicit request	Phrase match: "talk to a human", "agent", "representative"	High
Billing or payment dispute	Intent classifier or keyword match: "charged twice", "fraud", "dispute"	High
Repeated tool failures	2+ consecutive failed tool calls in same intent	Medium
Sensitive topic	Keyword + topic classifier: harassment, legal, safety	Medium
Confidence drop	Low confidence on intent classification (<0.6)	Low

Notice the spread of detection methods. Explicit user requests are highest confidence (the user is literally asking). Confidence-drop triggers are the most fragile — they assume we have a calibrated confidence number, which we mostly don't.

The handoff sequence

1. Trigger fires (any of the five above)
       ↓
2. Orchestrator emits final user-facing message:
   "I'll connect you with a person. They'll have our conversation
    so far, so you won't need to repeat yourself."
       ↓
3. Context snapshot generation (synchronous, ~200ms)
       ↓
4. Snapshot persisted to DynamoDB (escalation_id keyed)
       ↓
5. SNS publish: { escalation_id, queue, priority, summary }
       ↓
6. Amazon Connect picks up event → routes to agent queue
       ↓
7. Human agent's Connect screen loads:
   - Full conversation transcript
   - Auto-generated summary
   - Extracted entities (order ID, ASIN, etc.)
   - Sentiment / urgency tags
       ↓
8. Human agent takes over the conversation
       ↓
9. Chatbot session marked "escalated"; subsequent user messages
   in this session route to the Connect agent, not the bot

The whole sequence from trigger to "human is reading my problem" is target ≤ 30 seconds. The bottleneck is queue depth (how long until a human is available), not the technical handoff.

The context snapshot

The most important artifact in the entire flow. It's what lets the human agent skip the "tell me your problem from the beginning" step that customers hate.

Snapshot structure:

{
  "escalation_id": "esc_abc123",
  "session_id": "sess_xyz789",
  "user": {
    "user_id": "...",
    "language_pref": "en",
    "prime_member": true,
    "tenure_years": 4,
    "previous_escalations_30d": 0
  },
  "conversation": {
    "summary": "User is asking about a delayed order (ORD-12345) for Berserk Vol 42. Tracking shows the package stuck in transit for 5 days. User has requested a refund or expedited replacement. Chatbot offered standard replacement but couldn't confirm expedite eligibility.",
    "key_entities": {
      "order_id": "ORD-12345",
      "asin": "B07X1234",
      "issue_type": "delayed_delivery"
    },
    "sentiment_trend": ["neutral", "neutral", "frustrated", "frustrated"],
    "tools_attempted": [
      {"tool": "get_order_status", "success": true},
      {"tool": "check_refund_eligibility", "success": true},
      {"tool": "expedite_shipping", "success": false, "error": "policy_unclear"}
    ],
    "transcript": [...] // full last 20 turns
  },
  "trigger": {
    "type": "tool_failure",
    "details": "expedite_shipping returned policy_unclear after 2 retries"
  },
  "priority": "medium",
  "suggested_actions": [
    "Verify expedite eligibility manually",
    "Offer goodwill credit if expedite not possible"
  ]
}

The summary, key entities, and suggested actions are LLM-generated. The transcript and tool history are factual. Both are surfaced to the agent.

Priority and routing

Escalations don't all go to the same queue:

Priority	Examples	Target pickup
Urgent	Fraud claims, account access issues, safety concerns	< 1 minute
High	Billing disputes, repeat escalations from same user, Prime member with multi-failure	< 3 minutes
Medium	Tool failures, frustrated sentiment, complex returns	< 10 minutes
Low	"I'd rather talk to a person" (everything else healthy)	< 30 minutes

Priority is set by the chatbot at handoff time and surfaced to Connect's routing layer. SLA misses on Urgent/High trigger paging.

State during and after escalation

Once a session is escalated:

Chatbot session state: "escalated"
  ↓
Subsequent user messages → routed to Connect, NOT to bot
  ↓
Bot is silent until either:
  - Human agent ends the chat → session closed
  - Human agent passes back ("I'll let our bot help with the rest") → bot resumes

The "pass back" path is rare and operationally tricky — it means the bot has to read the human's resolution as conversation history and continue from there. Risk: bot contradicts the human's resolution. Mitigation: the bot is given a system prompt addendum on resume: "A human agent has been handling this. Their statements are authoritative. Do not contradict."

Post-escalation analysis

Every escalation is logged for review:

Metric	Why it matters
Trigger type distribution	Are we escalating for the right reasons?
Time-to-pickup by priority	Are SLAs being met?
Agent-resolution time	How long does the human take to resolve?
Resolution outcome (resolved / refund / further escalation)	What's the success rate?
User satisfaction (post-chat survey)	The ground truth
"Could the bot have handled this?" rating (sample audited weekly)	Are we over-escalating?

The last metric is the hardest. A human auditor reviews 100 escalations per week and rates whether the bot could have handled them with the right tool. High "could have" rates indicate over-escalation (wasted human time); low rates mean we're escalating only what's truly needed.

Why this shape

Alternative	Why we rejected it
No escalation (bot handles everything)	Bad UX for sensitive issues; legal risk for billing/fraud
Always escalate to humans (no bot)	Defeats the purpose; cost-prohibitive
User pushes button to escalate (no auto-trigger)	Most users tolerate too long before asking; satisfaction drops
Multi-stage escalation (junior → senior → expert)	Adds latency; rarely needed at this scope
LLM picks the priority	Inconsistent calibration; rule-based priority is more auditable
No context snapshot (just transcript)	Agent has to read the whole transcript; defeats the speed benefit

Validation: Constraint Sanity Check

Claimed metric / mechanism	Verdict	Why
Trigger: "confidence < 0.6"	Where does this number come from?	Claude doesn't natively expose calibrated confidence. If we're computing it from log-probs (token-level uncertainty), Bedrock doesn't reliably expose those. If from intent classifier confidence, that's a separate model with its own calibration issues. The 0.6 threshold is a number applied to a quantity whose meaning is unclear.
Trigger: 2 consecutive tool failures	Too aggressive	Two failures could be transient (circuit breaker test, MCP brief blip). Escalating on this trip is operationally noisy. Better: escalate after 2 failures within the same intent and breaker is open and fallback also failed.
Time to handoff ≤ 30 seconds	Technical part fine; queue is the bottleneck	Snapshot generation is ~200ms; SNS publish is ~50ms; Connect routing is ~1s. The 30s budget is mostly queue depth, which depends on agent staffing — outside engineering control. The 30s is meaningful only when staff is available; during peak it could be 10+ minutes.
LLM-generated summary in snapshot	Hallucination risk	The summary is what the human agent reads first. If the summary says "user wants a refund" and the user actually asked for replacement, the agent starts on wrong footing. No mention of validation: does the summary mention every entity from the transcript? Does it accurately represent the user's request? Should be checked against the transcript.
Suggested actions LLM-generated	High risk if agent follows blindly	Agents are time-pressured. If the bot suggests "offer goodwill credit," some agents will execute without further analysis. Wrong suggestions cost money. Suggested actions should come with confidence scores and require explicit agent confirmation.
Priority levels with target pickup times	Aspirational without SLA enforcement	The doc lists targets but doesn't say what happens when missed. Are alerts wired? Is staffing adjusted? Without enforcement, "< 1 minute for Urgent" is a wish.
Context transcript length: "last 20 turns"	Probably too long	An agent reading 20 turns mid-call wastes time. The summary should be the primary surface; the transcript is fallback. UX should default to summary expanded, transcript collapsed.
Pass-back from human to bot	High contradiction risk	"Bot resumes after human leaves" is rare for good reason. Even with the system prompt addendum, the bot may contradict policy decisions made by the human (e.g., human authorized a one-off refund; bot, on a follow-up question, says it's against policy). Should be either fully disabled or carefully gated by intent type.
"Could the bot have handled this?" weekly audit	Right metric, expensive process	Manual review of 100 escalations/week is expensive (~10 hours/week of senior support time). The doc doesn't quantify the staffing cost. Without continuous funding, this drops first when budgets tighten.
Trigger distribution monitored	Implied but not specified	What's the alert if "explicit user request" trigger spikes 3x normal? Is that a sign that the bot is failing more, or that users learned a workaround? Distinction matters.

The confidence-threshold trigger is the weakest link

confidence < 0.6 is the trigger most likely to fire incorrectly. In practice, what often happens:

Claude is confident in a wrong answer → no escalation, user gets bad info
Claude is unconfident on a routine question → unnecessary escalation, queue depth grows
Claude's "confidence" is loosely defined → threshold is meaningless

A more principled approach:

Don't trigger on calibrated confidence directly
Trigger on observable behavior: user repeats the question, user expresses frustration ("that's not what I asked"), user requests escalation, retrieval returned no relevant chunks
These are concrete signals that don't require a calibrated probability

The doc lists "user explicit request" and "repeated failures" — those are good. The "confidence < 0.6" line should probably be deleted unless and until we can ground it in something measurable.

LLM-generated summaries enter a new failure mode

The summary is created by Claude at handoff time, not pre-curated. If the model misrepresents the conversation, the agent starts wrong. Real defenses:

Validate that the summary mentions every entity from extracted_entities. If it doesn't, regenerate or surface a warning.
Show the agent a side-by-side: summary on the left, key transcript snippets on the right.
Allow the agent to flag a bad summary — feedback loop for prompt improvement.

None are documented. The summary is a single Bedrock call output, served as gospel.

Suggested actions are over-trusted

"Suggested actions: offer goodwill credit if expedite not possible" sounds like advice. To a busy agent, it sounds like a recommendation from the system. Some will follow it without thinking. If the suggestion is wrong (the user isn't eligible for credit), the agent has now committed to it on a customer call.

Better framing: "Possible actions to consider: …" with disclaimers, or no auto-suggested actions at all (let the human decide). The current architecture makes the bot's reasoning feel authoritative when it's just a guess.

Pass-back from human to bot should be disabled

The doc allows this path with a system-prompt addendum ("statements by human agent are authoritative"). But:

Customer asks follow-up: "What about the other order?"
Bot doesn't have context that the human resolved a specific issue
Bot may give answer that contradicts what the human just decided

Real solution: once escalated, stay escalated for the session. The pass-back path is rarely worth the contradiction risk. If the user has a new question, route it to the bot with a fresh session — they're past the escalated topic.

"≤ 30 second handoff" is misleading

The technical handoff is ~2 seconds. The "30 seconds" includes queue wait at light load. At peak load, queue wait can be 10+ minutes. Quoting "≤ 30 seconds" as a target sets customer expectations the system can't meet under normal busy conditions.

Honest framing: "Technical handoff: < 2 seconds. Time-to-agent: depends on queue depth, target P50 < 3 minutes for medium priority."

No "what if Connect is down" path

Connect itself can fail (regional outage, service degradation). What's the path?

Bot remains in escalated state
User keeps typing → no one is reading
Eventually times out

A real failover would route to a backup channel (email, callback request) when Connect is unavailable. The doc doesn't mention this. Single point of failure.

01-orchestrator-agent.md — Confidence and trigger detection
03-order-status-agent.md — Common escalation triggers (returns, refunds)
07-failure-handling.md — Tool-failure escalation path
08-memory-architecture.md — Snapshot generation from session state
../11-scalability-reliability.md — Connect integration capacity planning