07: Agent Performance Framework
AIP-C01 Mapping
Content Domain 5: Testing, Validation, and Troubleshooting Task 5.1: Implement evaluation systems for GenAI Skill 5.1.7: Evaluate agent performance (for example, task completion rates, tool usage accuracy, and reasoning quality).
User Story
As a MangaAssist engineering manager, I want to measure how well the chatbot agent completes end-to-end tasks — correctly classifying intents, selecting the right tools/APIs, reasoning through multi-step queries, and reaching satisfactory resolutions — So that we can identify bottlenecks in the agent pipeline and improve the weakest links in task completion.
Acceptance Criteria
- Task completion rate tracked per intent and per conversation complexity level
- Tool usage accuracy measures whether the agent calls the correct APIs with correct parameters
- Reasoning quality evaluated for multi-step queries requiring chained logic
- Intent classification accuracy tracked as a separate metric from response quality
- Escalation appropriateness evaluated — correct escalations vs. unnecessary escalations vs. missed escalations
- Multi-turn coherence: agent maintains context and progresses toward resolution across turns
- Agent latency budget tracked per pipeline stage
- Failure mode taxonomy: categorized reasons for task incompletion
Why Agent Performance Differs from Response Quality
Response quality (Skill 5.1.1) evaluates the text output. Agent performance evaluates the full pipeline: did the agent do the right thing, not just say the right thing?
Consider this MangaAssist interaction:
User: "I ordered One Piece Volume 108 last week. Where is it?" Agent: "One Piece Volume 108 is a popular manga! It features the continuing adventure of Luffy and the Straw Hat Pirates in the Egghead Island arc."
The response is high-quality text — relevant, factual, fluent. But the agent failed:
- Intent should be order_tracking, but was misclassified as product_question
- Should have called the Order Status API, but instead called the Product Catalog API
- Did not resolve the user's actual task (finding their order)
| Dimension | Response Quality Score | Agent Performance Score |
|---|---|---|
| Text Relevance | 0.70 (mentioned the product) | N/A |
| Task Completion | N/A | 0.0 (wrong task performed) |
| Tool Selection | N/A | 0.0 (wrong API called) |
| Intent Accuracy | N/A | 0.0 (misclassified) |
Agent performance evaluation catches failures that response quality metrics miss.
High-Level Design
Agent Performance Evaluation Architecture
graph TD
subgraph "Agent Pipeline Under Test"
A[User Query] --> B[Intent Classifier<br>SageMaker]
B --> C[Tool Selector<br>Orchestrator]
C --> D1[Product Catalog API]
C --> D2[Order Status API]
C --> D3[RAG Pipeline]
C --> D4[Recommendation Engine]
C --> D5[Escalation Handler]
D1 --> E[Response Generator<br>Bedrock]
D2 --> E
D3 --> E
D4 --> E
D5 --> E
E --> F[Response to User]
end
subgraph "Agent Performance Evaluator"
A --> G[Expected Trace]
F --> G
G --> H1[Intent Accuracy<br>Predicted vs Expected]
G --> H2[Tool Usage Accuracy<br>Correct API called?<br>Correct parameters?]
G --> H3[Reasoning Quality<br>Multi-step logic chain]
G --> H4[Task Completion<br>Did user's need get met?]
G --> H5[Escalation Appropriateness<br>Should/shouldn't have escalated]
end
subgraph "Aggregation"
H1 --> I[Agent Performance Report]
H2 --> I
H3 --> I
H4 --> I
H5 --> I
I --> J[CloudWatch Dashboard]
I --> K[Failure Taxonomy]
end
Multi-Turn Task Execution Trace
sequenceDiagram
participant U as User
participant IC as Intent Classifier
participant O as Orchestrator
participant API as Backend APIs
participant LLM as Bedrock Claude
participant Eval as Performance Evaluator
U->>IC: "I want to return my Naruto box set"
IC->>IC: Classify: return_request (confidence: 0.92)
IC->>O: intent=return_request
O->>API: GET /orders?product=naruto+box+set&status=delivered
API->>O: order_id=AZ-78901, delivered 3 days ago
O->>LLM: Generate return initiation response
LLM->>U: "I can help with that! Your Naruto Box Set (Order #AZ-78901)..."
U->>IC: "Actually, I want to exchange it for the deluxe edition"
IC->>IC: Classify: return_request (confidence: 0.85)
Note over IC: Should reclassify to exchange or maintain return context
IC->>O: intent=return_request, sub_intent=exchange
O->>API: GET /products?query=naruto+box+set+deluxe
API->>O: product_id=B0123, price=$89.99
O->>API: POST /returns/exchange {order_id, new_product_id}
API->>O: exchange_request_id=EX-456
O->>LLM: Generate exchange confirmation
LLM->>U: "I've initiated an exchange..."
Note over Eval: Evaluate the full trace
Eval->>Eval: ✅ Intent: correct (return → exchange transition)
Eval->>Eval: ✅ Tool: correct APIs called in correct order
Eval->>Eval: ✅ Parameters: correct product lookup, correct order reference
Eval->>Eval: ✅ Task: exchange initiated successfully
Eval->>Eval: ⚠️ Reasoning: could have proactively mentioned price difference
Failure Mode Taxonomy
graph TD
A[Task Failure] --> B[Intent Failure]
A --> C[Tool Failure]
A --> D[Reasoning Failure]
A --> E[Escalation Failure]
B --> B1["Misclassification<br/>Wrong intent detected"]
B --> B2["Low Confidence<br/>Correct but uncertain"]
B --> B3["Intent Ambiguity<br/>Multiple valid intents"]
C --> C1["Wrong Tool Selected<br/>Called wrong API"]
C --> C2["Wrong Parameters<br/>Right API, wrong params"]
C --> C3["Missing Tool Call<br/>Needed API but skipped"]
C --> C4["Tool Error<br/>API returned error"]
D --> D1["Single-Step Error<br/>Wrong conclusion from data"]
D --> D2["Multi-Step Error<br/>Lost context between steps"]
D --> D3["Coreference Failure<br/>Lost track of 'it' or 'that'"]
E --> E1["Missed Escalation<br/>Should have escalated"]
E --> E2["Premature Escalation<br/>Could have resolved"]
E --> E3["Wrong Escalation Path<br/>Wrong team/channel"]
Low-Level Design
Agent Performance Data Model
from dataclasses import dataclass, field
from enum import Enum
from typing import Optional
import time
import uuid
class TaskStatus(Enum):
COMPLETED = "completed"
PARTIALLY_COMPLETED = "partially_completed"
FAILED = "failed"
ESCALATED = "escalated"
class FailureCategory(Enum):
INTENT_MISCLASSIFICATION = "intent_misclassification"
WRONG_TOOL = "wrong_tool"
WRONG_PARAMETERS = "wrong_parameters"
MISSING_TOOL_CALL = "missing_tool_call"
TOOL_ERROR = "tool_error"
REASONING_ERROR = "reasoning_error"
COREFERENCE_FAILURE = "coreference_failure"
MISSED_ESCALATION = "missed_escalation"
PREMATURE_ESCALATION = "premature_escalation"
CONTEXT_LOST = "context_lost"
@dataclass
class AgentAction:
"""A single action taken by the agent (intent classification, API call, generation)."""
action_id: str = field(default_factory=lambda: str(uuid.uuid4()))
action_type: str = "" # "intent_classification", "api_call", "generation", "escalation"
input_data: dict = field(default_factory=dict)
output_data: dict = field(default_factory=dict)
expected_output: dict = field(default_factory=dict) # Ground truth
latency_ms: float = 0.0
correct: bool = True
error_message: str = ""
@dataclass
class AgentTrace:
"""Full execution trace of the agent for one conversation turn or session."""
trace_id: str = field(default_factory=lambda: str(uuid.uuid4()))
session_id: str = ""
actions: list[AgentAction] = field(default_factory=list)
total_turns: int = 0
task_status: TaskStatus = TaskStatus.COMPLETED
failure_categories: list[FailureCategory] = field(default_factory=list)
timestamp: float = field(default_factory=time.time)
@dataclass
class AgentPerformanceMetrics:
"""Aggregated agent performance metrics."""
task_completion_rate: float = 0.0 # % of tasks fully completed
partial_completion_rate: float = 0.0 # % partially completed
intent_accuracy: float = 0.0 # % of correct intent classifications
tool_selection_accuracy: float = 0.0 # % of correct tool/API selections
tool_parameter_accuracy: float = 0.0 # % of correct parameters when right tool
reasoning_accuracy: float = 0.0 # % of correct multi-step reasoning
escalation_precision: float = 0.0 # Correct escalations / total escalations
escalation_recall: float = 0.0 # Correct escalations / should-have-escalated
avg_turns_to_resolution: float = 0.0
failure_distribution: dict[str, int] = field(default_factory=dict)
total_traces: int = 0
Agent Performance Evaluator
import json
import logging
import time
from typing import Optional
import boto3
logger = logging.getLogger(__name__)
class AgentPerformanceEvaluator:
"""Evaluates end-to-end agent performance across all pipeline stages.
Evaluation flow:
1. Replay test conversations against the agent
2. Capture execution traces (intent, tool calls, parameters, response)
3. Compare traces against expected traces (ground truth)
4. Score each dimension and aggregate
"""
def __init__(self, cloudwatch_namespace: str = "MangaAssist/AgentPerformance"):
self.cloudwatch = boto3.client("cloudwatch", region_name="us-east-1")
self.namespace = cloudwatch_namespace
def evaluate_trace(
self, trace: AgentTrace, expected_trace: AgentTrace
) -> dict:
"""Evaluate a single agent trace against expected behavior."""
results = {
"trace_id": trace.trace_id,
"intent_accuracy": self._evaluate_intents(trace, expected_trace),
"tool_accuracy": self._evaluate_tool_usage(trace, expected_trace),
"parameter_accuracy": self._evaluate_parameters(trace, expected_trace),
"reasoning_quality": self._evaluate_reasoning(trace, expected_trace),
"task_completion": self._evaluate_task_completion(trace, expected_trace),
"escalation_quality": self._evaluate_escalation(trace, expected_trace),
"failure_categories": [],
}
# Identify failure categories
if results["intent_accuracy"] < 1.0:
results["failure_categories"].append(FailureCategory.INTENT_MISCLASSIFICATION.value)
if results["tool_accuracy"] < 1.0:
results["failure_categories"].append(FailureCategory.WRONG_TOOL.value)
if results["parameter_accuracy"] < 1.0:
results["failure_categories"].append(FailureCategory.WRONG_PARAMETERS.value)
return results
def evaluate_batch(
self, traces: list[tuple[AgentTrace, AgentTrace]]
) -> AgentPerformanceMetrics:
"""Evaluate a batch of agent traces and compute aggregate metrics."""
all_results = []
for actual, expected in traces:
result = self.evaluate_trace(actual, expected)
all_results.append(result)
if not all_results:
return AgentPerformanceMetrics()
n = len(all_results)
# Task completion
completed = sum(1 for r in all_results if r["task_completion"] >= 1.0)
partial = sum(1 for r in all_results if 0 < r["task_completion"] < 1.0)
# Failure distribution
failure_dist: dict[str, int] = {}
for r in all_results:
for cat in r["failure_categories"]:
failure_dist[cat] = failure_dist.get(cat, 0) + 1
metrics = AgentPerformanceMetrics(
task_completion_rate=completed / n,
partial_completion_rate=partial / n,
intent_accuracy=sum(r["intent_accuracy"] for r in all_results) / n,
tool_selection_accuracy=sum(r["tool_accuracy"] for r in all_results) / n,
tool_parameter_accuracy=sum(r["parameter_accuracy"] for r in all_results) / n,
reasoning_accuracy=sum(r["reasoning_quality"] for r in all_results) / n,
failure_distribution=failure_dist,
total_traces=n,
)
self._publish_metrics(metrics)
return metrics
def _evaluate_intents(self, actual: AgentTrace, expected: AgentTrace) -> float:
"""Evaluate intent classification accuracy across all turns."""
actual_intents = [
a for a in actual.actions if a.action_type == "intent_classification"
]
expected_intents = [
a for a in expected.actions if a.action_type == "intent_classification"
]
if not expected_intents:
return 1.0
correct = 0
for act, exp in zip(actual_intents, expected_intents):
actual_intent = act.output_data.get("intent")
expected_intent = exp.expected_output.get("intent")
if actual_intent == expected_intent:
correct += 1
return correct / len(expected_intents)
def _evaluate_tool_usage(self, actual: AgentTrace, expected: AgentTrace) -> float:
"""Evaluate whether the correct tools/APIs were called."""
actual_calls = [a for a in actual.actions if a.action_type == "api_call"]
expected_calls = [a for a in expected.actions if a.action_type == "api_call"]
if not expected_calls:
return 1.0 if not actual_calls else 0.5 # No calls expected, but agent made some
# Check if the sequence of API calls matches
actual_apis = [a.output_data.get("api_name", "") for a in actual_calls]
expected_apis = [a.expected_output.get("api_name", "") for a in expected_calls]
# Allow flexible ordering but require the same set of APIs
expected_set = set(expected_apis)
actual_set = set(actual_apis)
if not expected_set:
return 1.0
correct = len(expected_set & actual_set)
missing = len(expected_set - actual_set)
extra = len(actual_set - expected_set)
return correct / (correct + missing + extra * 0.5)
def _evaluate_parameters(self, actual: AgentTrace, expected: AgentTrace) -> float:
"""Evaluate whether API call parameters are correct."""
actual_calls = {
a.output_data.get("api_name"): a for a in actual.actions if a.action_type == "api_call"
}
expected_calls = {
a.expected_output.get("api_name"): a for a in expected.actions if a.action_type == "api_call"
}
if not expected_calls:
return 1.0
correct_params = 0
total_params = 0
for api_name, expected_action in expected_calls.items():
actual_action = actual_calls.get(api_name)
if not actual_action:
total_params += len(expected_action.expected_output.get("parameters", {}))
continue
expected_params = expected_action.expected_output.get("parameters", {})
actual_params = actual_action.output_data.get("parameters", {})
for key, expected_val in expected_params.items():
total_params += 1
actual_val = actual_params.get(key)
if str(actual_val).lower() == str(expected_val).lower():
correct_params += 1
return correct_params / total_params if total_params > 0 else 1.0
def _evaluate_reasoning(self, actual: AgentTrace, expected: AgentTrace) -> float:
"""Evaluate reasoning quality for multi-step tasks."""
if actual.total_turns <= 1:
return 1.0 # Single-turn tasks don't require complex reasoning
# Check that intermediate conclusions are correct
actual_steps = [
a for a in actual.actions if a.action_type in ("reasoning_step", "api_call")
]
expected_steps = [
a for a in expected.actions if a.action_type in ("reasoning_step", "api_call")
]
if not expected_steps:
return 1.0
# Evaluate ordering: were the steps performed in a logical order?
order_score = self._evaluate_step_ordering(actual_steps, expected_steps)
# Evaluate context carry: did the agent maintain context between steps?
context_score = 1.0
for i, action in enumerate(actual_steps):
if i > 0 and action.action_type == "api_call":
# Check if the API call references data from a previous step
prev_output = actual_steps[i - 1].output_data
curr_input = action.input_data
if prev_output and not any(
str(v) in str(curr_input) for v in prev_output.values() if v
):
context_score -= 0.2
return min(max((order_score + max(context_score, 0)) / 2, 0), 1)
def _evaluate_task_completion(self, actual: AgentTrace, expected: AgentTrace) -> float:
"""Evaluate whether the user's task was completed."""
if expected.task_status == TaskStatus.COMPLETED:
if actual.task_status == TaskStatus.COMPLETED:
return 1.0
elif actual.task_status == TaskStatus.PARTIALLY_COMPLETED:
return 0.5
elif actual.task_status == TaskStatus.ESCALATED:
return 0.3 # Escalation means task not completed by agent
return 0.0
elif expected.task_status == TaskStatus.ESCALATED:
# Escalation was the correct action
if actual.task_status == TaskStatus.ESCALATED:
return 1.0
return 0.0 # Should have escalated but didn't
return 0.5
def _evaluate_escalation(self, actual: AgentTrace, expected: AgentTrace) -> dict:
"""Evaluate escalation decisions."""
actual_escalated = actual.task_status == TaskStatus.ESCALATED
expected_escalated = expected.task_status == TaskStatus.ESCALATED
if expected_escalated and actual_escalated:
return {"correct": True, "type": "true_positive"}
elif not expected_escalated and not actual_escalated:
return {"correct": True, "type": "true_negative"}
elif expected_escalated and not actual_escalated:
return {"correct": False, "type": "missed_escalation"}
else:
return {"correct": False, "type": "premature_escalation"}
def _evaluate_step_ordering(
self, actual_steps: list[AgentAction], expected_steps: list[AgentAction]
) -> float:
"""Evaluate whether steps were performed in the expected order."""
actual_types = [s.action_type + ":" + s.output_data.get("api_name", "") for s in actual_steps]
expected_types = [s.action_type + ":" + s.expected_output.get("api_name", "") for s in expected_steps]
# Longest common subsequence ratio
lcs = self._lcs_length(actual_types, expected_types)
return lcs / max(len(expected_types), 1)
def _lcs_length(self, a: list, b: list) -> int:
"""Compute longest common subsequence length."""
m, n = len(a), len(b)
dp = [[0] * (n + 1) for _ in range(m + 1)]
for i in range(1, m + 1):
for j in range(1, n + 1):
if a[i - 1] == b[j - 1]:
dp[i][j] = dp[i - 1][j - 1] + 1
else:
dp[i][j] = max(dp[i - 1][j], dp[i][j - 1])
return dp[m][n]
def _publish_metrics(self, metrics: AgentPerformanceMetrics) -> None:
"""Publish agent performance metrics to CloudWatch."""
metric_data = [
{"MetricName": "TaskCompletionRate", "Value": metrics.task_completion_rate, "Unit": "None"},
{"MetricName": "IntentAccuracy", "Value": metrics.intent_accuracy, "Unit": "None"},
{"MetricName": "ToolSelectionAccuracy", "Value": metrics.tool_selection_accuracy, "Unit": "None"},
{"MetricName": "ToolParameterAccuracy", "Value": metrics.tool_parameter_accuracy, "Unit": "None"},
{"MetricName": "ReasoningAccuracy", "Value": metrics.reasoning_accuracy, "Unit": "None"},
]
self.cloudwatch.put_metric_data(Namespace=self.namespace, MetricData=metric_data)
MangaAssist Scenarios
Scenario A: Intent Misclassification Cascade
Context: A user asked "When will my Dragon Ball set be back?" The intent classifier returned product_question (confidence 0.72) instead of order_tracking (0.68).
What Happened:
- Intent classification: product_question → called Product Catalog API instead of Order Status API
- The agent responded with product availability information instead of order tracking
- User replied: "No, I mean MY order. I already bought it."
- Second turn: intent correctly classified as order_tracking
- Agent recovered, but the task took 3 turns instead of 1
Agent Performance Score:
- Intent accuracy: 0.50 (1 of 2 classifications correct)
- Tool selection: 0.50 (wrong tool on turn 1, correct on turn 2)
- Task completion: 0.50 (partially completed — resolved but with extra turns)
- Failure category: intent_misclassification → wrong_tool (cascade)
Root Cause: Ambiguous query — "back" could mean "back in stock" (product_question) or "back to me" (order_tracking). The intent classifier lacked the signal that the user phrased it as "MY" (possessive = order context).
Fix: Added possessive pronoun features ("my", "mine") to the intent classifier's feature set. Re-trained: order_tracking confidence for this pattern increased from 0.68 to 0.89.
Scenario B: Tool Parameter Extraction Failure
Context: User asked: "Compare the price of Jujutsu Kaisen volumes 20 and 21."
What Happened:
- Intent: product_question (correct)
- Tool selected: Product Catalog API (correct)
- Parameters extracted:
- Actual: {product: "Jujutsu Kaisen volumes 20 and 21"} — single string
- Expected: {product_1: "Jujutsu Kaisen Vol 20", product_2: "Jujutsu Kaisen Vol 21"} — two separate lookups
- Result: The API returned no results for the combined string. The agent said "I couldn't find that product."
Agent Performance Score:
- Intent accuracy: 1.0
- Tool selection: 1.0
- Parameter accuracy: 0.0 (parameters were wrong)
- Task completion: 0.0 (user's need unmet)
- Failure category: wrong_parameters
Fix: Updated the query decomposition step in the orchestrator. When the query contains comparison language ("compare", "vs", "difference between") and multiple product references, the orchestrator splits into separate API calls. Post-fix: parameter accuracy for comparison queries improved from 45% to 92%.
Scenario C: Multi-Step Reasoning Failure in Exchange Flow
Context: User wanted to exchange a manga for a different edition, then asked about the price difference.
What Happened: - Turn 1: "I want to exchange my Chainsaw Man Vol 1 for the special edition" → correct intent, correct API calls - Turn 2: "How much more do I need to pay?" → agent should reference the exchange from turn 1 - The agent treated turn 2 as a new product_question and looked up prices independently, losing the exchange context - Response gave the full price of the special edition ($24.99) instead of the price difference ($24.99 - $9.99 = $15.00)
Agent Performance Score:
- Turn 1: intent=1.0, tool=1.0, params=1.0
- Turn 2: intent=0.5 (should stay in exchange context), tool=0.5, reasoning=0.0
- Task completion: 0.5 (exchange initiated but price difference wrong)
- Failure category: context_lost, reasoning_error
Fix: The orchestrator was not passing the exchange context to subsequent turns. Added a session state that tracks in-progress transactions. When the user asks a follow-up in the same session with an active transaction, the orchestrator injects the transaction context into the prompt.
Scenario D: Escalation Calibration — Too Aggressive
Context: After a customer complaint incident, the team lowered the escalation confidence threshold from 0.80 to 0.50. This meant any query with >50% probability of needing human help was escalated.
What Happened: - Escalation rate jumped from 3% to 18% of conversations - Agent performance evaluation showed: - Escalation precision: 0.35 (only 35% of escalations were necessary) - Escalation recall: 0.98 (almost no missed escalations) - Task completion rate dropped from 85% to 72% (agent gave up too early) - Amazon Connect staffing costs increased 6x
Metric Signal: MangaAssist/AgentPerformance.EscalationPrecision dropped below 0.50 alert threshold.
Fix: Reverted threshold to 0.70 and added a two-stage escalation: first attempt to resolve, then offer escalation. If the agent detects frustration signals (repeated questions, negative sentiment), escalate immediately. Otherwise, attempt resolution first.
Post-fix: Escalation precision=0.72, recall=0.91, task completion=84%.
Intuition Gained
Agent Performance Is a Pipeline Problem
Each stage of the agent pipeline (intent → tool → parameters → reasoning → response) has its own failure modes. A single bad stage cascades: wrong intent → wrong tool → wrong answer. Agent performance evaluation must measure each stage independently AND the full pipeline together. Fixing the weakest link (often intent classification) has the highest ROI.
Task Completion Rate Is the North Star
Quality metrics, latency, and cost all matter. But the user's question is: "Did the chatbot solve my problem?" Task completion rate — measured by whether the agent performed the correct sequence of actions to resolution — is the most business-relevant metric. A chatbot with 95% text quality but 60% task completion is failing its users.
Escalation Is a Precision-Recall Tradeoff
Too many escalations waste human agent time and make the chatbot feel useless. Too few escalations leave frustrated users without help. MangaAssist targets 80% precision (4 out of 5 escalations are genuinely needed) and 90% recall (only 1 in 10 users who need help are missed). The two-stage approach (try first, then offer escalation) achieves this balance.
References
- MangaAssist Architecture HLD — Agent pipeline architecture
- MangaAssist Architecture LLD — Orchestrator and tool routing
- Intent Classifier Cost Optimization — Intent classification design
- FM Output Quality Assessment — Response quality evaluation
- Quality Assurance Processes — Quality gates for agent changes
- Troubleshooting GenAI Applications — Debugging agent failures in production