LOCAL PREVIEW View on GitHub

07: Agent Performance Framework

AIP-C01 Mapping

Content Domain 5: Testing, Validation, and Troubleshooting Task 5.1: Implement evaluation systems for GenAI Skill 5.1.7: Evaluate agent performance (for example, task completion rates, tool usage accuracy, and reasoning quality).


User Story

As a MangaAssist engineering manager, I want to measure how well the chatbot agent completes end-to-end tasks — correctly classifying intents, selecting the right tools/APIs, reasoning through multi-step queries, and reaching satisfactory resolutions — So that we can identify bottlenecks in the agent pipeline and improve the weakest links in task completion.


Acceptance Criteria

  • Task completion rate tracked per intent and per conversation complexity level
  • Tool usage accuracy measures whether the agent calls the correct APIs with correct parameters
  • Reasoning quality evaluated for multi-step queries requiring chained logic
  • Intent classification accuracy tracked as a separate metric from response quality
  • Escalation appropriateness evaluated — correct escalations vs. unnecessary escalations vs. missed escalations
  • Multi-turn coherence: agent maintains context and progresses toward resolution across turns
  • Agent latency budget tracked per pipeline stage
  • Failure mode taxonomy: categorized reasons for task incompletion

Why Agent Performance Differs from Response Quality

Response quality (Skill 5.1.1) evaluates the text output. Agent performance evaluates the full pipeline: did the agent do the right thing, not just say the right thing?

Consider this MangaAssist interaction:

User: "I ordered One Piece Volume 108 last week. Where is it?" Agent: "One Piece Volume 108 is a popular manga! It features the continuing adventure of Luffy and the Straw Hat Pirates in the Egghead Island arc."

The response is high-quality text — relevant, factual, fluent. But the agent failed: - Intent should be order_tracking, but was misclassified as product_question - Should have called the Order Status API, but instead called the Product Catalog API - Did not resolve the user's actual task (finding their order)

Dimension Response Quality Score Agent Performance Score
Text Relevance 0.70 (mentioned the product) N/A
Task Completion N/A 0.0 (wrong task performed)
Tool Selection N/A 0.0 (wrong API called)
Intent Accuracy N/A 0.0 (misclassified)

Agent performance evaluation catches failures that response quality metrics miss.


High-Level Design

Agent Performance Evaluation Architecture

graph TD
    subgraph "Agent Pipeline Under Test"
        A[User Query] --> B[Intent Classifier<br>SageMaker]
        B --> C[Tool Selector<br>Orchestrator]
        C --> D1[Product Catalog API]
        C --> D2[Order Status API]
        C --> D3[RAG Pipeline]
        C --> D4[Recommendation Engine]
        C --> D5[Escalation Handler]
        D1 --> E[Response Generator<br>Bedrock]
        D2 --> E
        D3 --> E
        D4 --> E
        D5 --> E
        E --> F[Response to User]
    end

    subgraph "Agent Performance Evaluator"
        A --> G[Expected Trace]
        F --> G
        G --> H1[Intent Accuracy<br>Predicted vs Expected]
        G --> H2[Tool Usage Accuracy<br>Correct API called?<br>Correct parameters?]
        G --> H3[Reasoning Quality<br>Multi-step logic chain]
        G --> H4[Task Completion<br>Did user's need get met?]
        G --> H5[Escalation Appropriateness<br>Should/shouldn't have escalated]
    end

    subgraph "Aggregation"
        H1 --> I[Agent Performance Report]
        H2 --> I
        H3 --> I
        H4 --> I
        H5 --> I
        I --> J[CloudWatch Dashboard]
        I --> K[Failure Taxonomy]
    end

Multi-Turn Task Execution Trace

sequenceDiagram
    participant U as User
    participant IC as Intent Classifier
    participant O as Orchestrator
    participant API as Backend APIs
    participant LLM as Bedrock Claude
    participant Eval as Performance Evaluator

    U->>IC: "I want to return my Naruto box set"
    IC->>IC: Classify: return_request (confidence: 0.92)
    IC->>O: intent=return_request
    O->>API: GET /orders?product=naruto+box+set&status=delivered
    API->>O: order_id=AZ-78901, delivered 3 days ago
    O->>LLM: Generate return initiation response
    LLM->>U: "I can help with that! Your Naruto Box Set (Order #AZ-78901)..."

    U->>IC: "Actually, I want to exchange it for the deluxe edition"
    IC->>IC: Classify: return_request (confidence: 0.85)
    Note over IC: Should reclassify to exchange or maintain return context
    IC->>O: intent=return_request, sub_intent=exchange
    O->>API: GET /products?query=naruto+box+set+deluxe
    API->>O: product_id=B0123, price=$89.99
    O->>API: POST /returns/exchange {order_id, new_product_id}
    API->>O: exchange_request_id=EX-456
    O->>LLM: Generate exchange confirmation
    LLM->>U: "I've initiated an exchange..."

    Note over Eval: Evaluate the full trace
    Eval->>Eval: ✅ Intent: correct (return → exchange transition)
    Eval->>Eval: ✅ Tool: correct APIs called in correct order
    Eval->>Eval: ✅ Parameters: correct product lookup, correct order reference
    Eval->>Eval: ✅ Task: exchange initiated successfully
    Eval->>Eval: ⚠️ Reasoning: could have proactively mentioned price difference

Failure Mode Taxonomy

graph TD
    A[Task Failure] --> B[Intent Failure]
    A --> C[Tool Failure]
    A --> D[Reasoning Failure]
    A --> E[Escalation Failure]

    B --> B1["Misclassification<br/>Wrong intent detected"]
    B --> B2["Low Confidence<br/>Correct but uncertain"]
    B --> B3["Intent Ambiguity<br/>Multiple valid intents"]

    C --> C1["Wrong Tool Selected<br/>Called wrong API"]
    C --> C2["Wrong Parameters<br/>Right API, wrong params"]
    C --> C3["Missing Tool Call<br/>Needed API but skipped"]
    C --> C4["Tool Error<br/>API returned error"]

    D --> D1["Single-Step Error<br/>Wrong conclusion from data"]
    D --> D2["Multi-Step Error<br/>Lost context between steps"]
    D --> D3["Coreference Failure<br/>Lost track of 'it' or 'that'"]

    E --> E1["Missed Escalation<br/>Should have escalated"]
    E --> E2["Premature Escalation<br/>Could have resolved"]
    E --> E3["Wrong Escalation Path<br/>Wrong team/channel"]

Low-Level Design

Agent Performance Data Model

from dataclasses import dataclass, field
from enum import Enum
from typing import Optional
import time
import uuid


class TaskStatus(Enum):
    COMPLETED = "completed"
    PARTIALLY_COMPLETED = "partially_completed"
    FAILED = "failed"
    ESCALATED = "escalated"


class FailureCategory(Enum):
    INTENT_MISCLASSIFICATION = "intent_misclassification"
    WRONG_TOOL = "wrong_tool"
    WRONG_PARAMETERS = "wrong_parameters"
    MISSING_TOOL_CALL = "missing_tool_call"
    TOOL_ERROR = "tool_error"
    REASONING_ERROR = "reasoning_error"
    COREFERENCE_FAILURE = "coreference_failure"
    MISSED_ESCALATION = "missed_escalation"
    PREMATURE_ESCALATION = "premature_escalation"
    CONTEXT_LOST = "context_lost"


@dataclass
class AgentAction:
    """A single action taken by the agent (intent classification, API call, generation)."""
    action_id: str = field(default_factory=lambda: str(uuid.uuid4()))
    action_type: str = ""                   # "intent_classification", "api_call", "generation", "escalation"
    input_data: dict = field(default_factory=dict)
    output_data: dict = field(default_factory=dict)
    expected_output: dict = field(default_factory=dict)  # Ground truth
    latency_ms: float = 0.0
    correct: bool = True
    error_message: str = ""


@dataclass
class AgentTrace:
    """Full execution trace of the agent for one conversation turn or session."""
    trace_id: str = field(default_factory=lambda: str(uuid.uuid4()))
    session_id: str = ""
    actions: list[AgentAction] = field(default_factory=list)
    total_turns: int = 0
    task_status: TaskStatus = TaskStatus.COMPLETED
    failure_categories: list[FailureCategory] = field(default_factory=list)
    timestamp: float = field(default_factory=time.time)


@dataclass
class AgentPerformanceMetrics:
    """Aggregated agent performance metrics."""
    task_completion_rate: float = 0.0       # % of tasks fully completed
    partial_completion_rate: float = 0.0    # % partially completed
    intent_accuracy: float = 0.0            # % of correct intent classifications
    tool_selection_accuracy: float = 0.0    # % of correct tool/API selections
    tool_parameter_accuracy: float = 0.0    # % of correct parameters when right tool
    reasoning_accuracy: float = 0.0         # % of correct multi-step reasoning
    escalation_precision: float = 0.0       # Correct escalations / total escalations
    escalation_recall: float = 0.0          # Correct escalations / should-have-escalated
    avg_turns_to_resolution: float = 0.0
    failure_distribution: dict[str, int] = field(default_factory=dict)
    total_traces: int = 0

Agent Performance Evaluator

import json
import logging
import time
from typing import Optional

import boto3

logger = logging.getLogger(__name__)


class AgentPerformanceEvaluator:
    """Evaluates end-to-end agent performance across all pipeline stages.

    Evaluation flow:
    1. Replay test conversations against the agent
    2. Capture execution traces (intent, tool calls, parameters, response)
    3. Compare traces against expected traces (ground truth)
    4. Score each dimension and aggregate
    """

    def __init__(self, cloudwatch_namespace: str = "MangaAssist/AgentPerformance"):
        self.cloudwatch = boto3.client("cloudwatch", region_name="us-east-1")
        self.namespace = cloudwatch_namespace

    def evaluate_trace(
        self, trace: AgentTrace, expected_trace: AgentTrace
    ) -> dict:
        """Evaluate a single agent trace against expected behavior."""
        results = {
            "trace_id": trace.trace_id,
            "intent_accuracy": self._evaluate_intents(trace, expected_trace),
            "tool_accuracy": self._evaluate_tool_usage(trace, expected_trace),
            "parameter_accuracy": self._evaluate_parameters(trace, expected_trace),
            "reasoning_quality": self._evaluate_reasoning(trace, expected_trace),
            "task_completion": self._evaluate_task_completion(trace, expected_trace),
            "escalation_quality": self._evaluate_escalation(trace, expected_trace),
            "failure_categories": [],
        }

        # Identify failure categories
        if results["intent_accuracy"] < 1.0:
            results["failure_categories"].append(FailureCategory.INTENT_MISCLASSIFICATION.value)
        if results["tool_accuracy"] < 1.0:
            results["failure_categories"].append(FailureCategory.WRONG_TOOL.value)
        if results["parameter_accuracy"] < 1.0:
            results["failure_categories"].append(FailureCategory.WRONG_PARAMETERS.value)

        return results

    def evaluate_batch(
        self, traces: list[tuple[AgentTrace, AgentTrace]]
    ) -> AgentPerformanceMetrics:
        """Evaluate a batch of agent traces and compute aggregate metrics."""
        all_results = []
        for actual, expected in traces:
            result = self.evaluate_trace(actual, expected)
            all_results.append(result)

        if not all_results:
            return AgentPerformanceMetrics()

        n = len(all_results)

        # Task completion
        completed = sum(1 for r in all_results if r["task_completion"] >= 1.0)
        partial = sum(1 for r in all_results if 0 < r["task_completion"] < 1.0)

        # Failure distribution
        failure_dist: dict[str, int] = {}
        for r in all_results:
            for cat in r["failure_categories"]:
                failure_dist[cat] = failure_dist.get(cat, 0) + 1

        metrics = AgentPerformanceMetrics(
            task_completion_rate=completed / n,
            partial_completion_rate=partial / n,
            intent_accuracy=sum(r["intent_accuracy"] for r in all_results) / n,
            tool_selection_accuracy=sum(r["tool_accuracy"] for r in all_results) / n,
            tool_parameter_accuracy=sum(r["parameter_accuracy"] for r in all_results) / n,
            reasoning_accuracy=sum(r["reasoning_quality"] for r in all_results) / n,
            failure_distribution=failure_dist,
            total_traces=n,
        )

        self._publish_metrics(metrics)
        return metrics

    def _evaluate_intents(self, actual: AgentTrace, expected: AgentTrace) -> float:
        """Evaluate intent classification accuracy across all turns."""
        actual_intents = [
            a for a in actual.actions if a.action_type == "intent_classification"
        ]
        expected_intents = [
            a for a in expected.actions if a.action_type == "intent_classification"
        ]

        if not expected_intents:
            return 1.0

        correct = 0
        for act, exp in zip(actual_intents, expected_intents):
            actual_intent = act.output_data.get("intent")
            expected_intent = exp.expected_output.get("intent")
            if actual_intent == expected_intent:
                correct += 1

        return correct / len(expected_intents)

    def _evaluate_tool_usage(self, actual: AgentTrace, expected: AgentTrace) -> float:
        """Evaluate whether the correct tools/APIs were called."""
        actual_calls = [a for a in actual.actions if a.action_type == "api_call"]
        expected_calls = [a for a in expected.actions if a.action_type == "api_call"]

        if not expected_calls:
            return 1.0 if not actual_calls else 0.5  # No calls expected, but agent made some

        # Check if the sequence of API calls matches
        actual_apis = [a.output_data.get("api_name", "") for a in actual_calls]
        expected_apis = [a.expected_output.get("api_name", "") for a in expected_calls]

        # Allow flexible ordering but require the same set of APIs
        expected_set = set(expected_apis)
        actual_set = set(actual_apis)

        if not expected_set:
            return 1.0

        correct = len(expected_set & actual_set)
        missing = len(expected_set - actual_set)
        extra = len(actual_set - expected_set)

        return correct / (correct + missing + extra * 0.5)

    def _evaluate_parameters(self, actual: AgentTrace, expected: AgentTrace) -> float:
        """Evaluate whether API call parameters are correct."""
        actual_calls = {
            a.output_data.get("api_name"): a for a in actual.actions if a.action_type == "api_call"
        }
        expected_calls = {
            a.expected_output.get("api_name"): a for a in expected.actions if a.action_type == "api_call"
        }

        if not expected_calls:
            return 1.0

        correct_params = 0
        total_params = 0

        for api_name, expected_action in expected_calls.items():
            actual_action = actual_calls.get(api_name)
            if not actual_action:
                total_params += len(expected_action.expected_output.get("parameters", {}))
                continue

            expected_params = expected_action.expected_output.get("parameters", {})
            actual_params = actual_action.output_data.get("parameters", {})

            for key, expected_val in expected_params.items():
                total_params += 1
                actual_val = actual_params.get(key)
                if str(actual_val).lower() == str(expected_val).lower():
                    correct_params += 1

        return correct_params / total_params if total_params > 0 else 1.0

    def _evaluate_reasoning(self, actual: AgentTrace, expected: AgentTrace) -> float:
        """Evaluate reasoning quality for multi-step tasks."""
        if actual.total_turns <= 1:
            return 1.0  # Single-turn tasks don't require complex reasoning

        # Check that intermediate conclusions are correct
        actual_steps = [
            a for a in actual.actions if a.action_type in ("reasoning_step", "api_call")
        ]
        expected_steps = [
            a for a in expected.actions if a.action_type in ("reasoning_step", "api_call")
        ]

        if not expected_steps:
            return 1.0

        # Evaluate ordering: were the steps performed in a logical order?
        order_score = self._evaluate_step_ordering(actual_steps, expected_steps)

        # Evaluate context carry: did the agent maintain context between steps?
        context_score = 1.0
        for i, action in enumerate(actual_steps):
            if i > 0 and action.action_type == "api_call":
                # Check if the API call references data from a previous step
                prev_output = actual_steps[i - 1].output_data
                curr_input = action.input_data
                if prev_output and not any(
                    str(v) in str(curr_input) for v in prev_output.values() if v
                ):
                    context_score -= 0.2

        return min(max((order_score + max(context_score, 0)) / 2, 0), 1)

    def _evaluate_task_completion(self, actual: AgentTrace, expected: AgentTrace) -> float:
        """Evaluate whether the user's task was completed."""
        if expected.task_status == TaskStatus.COMPLETED:
            if actual.task_status == TaskStatus.COMPLETED:
                return 1.0
            elif actual.task_status == TaskStatus.PARTIALLY_COMPLETED:
                return 0.5
            elif actual.task_status == TaskStatus.ESCALATED:
                return 0.3  # Escalation means task not completed by agent
            return 0.0
        elif expected.task_status == TaskStatus.ESCALATED:
            # Escalation was the correct action
            if actual.task_status == TaskStatus.ESCALATED:
                return 1.0
            return 0.0  # Should have escalated but didn't
        return 0.5

    def _evaluate_escalation(self, actual: AgentTrace, expected: AgentTrace) -> dict:
        """Evaluate escalation decisions."""
        actual_escalated = actual.task_status == TaskStatus.ESCALATED
        expected_escalated = expected.task_status == TaskStatus.ESCALATED

        if expected_escalated and actual_escalated:
            return {"correct": True, "type": "true_positive"}
        elif not expected_escalated and not actual_escalated:
            return {"correct": True, "type": "true_negative"}
        elif expected_escalated and not actual_escalated:
            return {"correct": False, "type": "missed_escalation"}
        else:
            return {"correct": False, "type": "premature_escalation"}

    def _evaluate_step_ordering(
        self, actual_steps: list[AgentAction], expected_steps: list[AgentAction]
    ) -> float:
        """Evaluate whether steps were performed in the expected order."""
        actual_types = [s.action_type + ":" + s.output_data.get("api_name", "") for s in actual_steps]
        expected_types = [s.action_type + ":" + s.expected_output.get("api_name", "") for s in expected_steps]

        # Longest common subsequence ratio
        lcs = self._lcs_length(actual_types, expected_types)
        return lcs / max(len(expected_types), 1)

    def _lcs_length(self, a: list, b: list) -> int:
        """Compute longest common subsequence length."""
        m, n = len(a), len(b)
        dp = [[0] * (n + 1) for _ in range(m + 1)]
        for i in range(1, m + 1):
            for j in range(1, n + 1):
                if a[i - 1] == b[j - 1]:
                    dp[i][j] = dp[i - 1][j - 1] + 1
                else:
                    dp[i][j] = max(dp[i - 1][j], dp[i][j - 1])
        return dp[m][n]

    def _publish_metrics(self, metrics: AgentPerformanceMetrics) -> None:
        """Publish agent performance metrics to CloudWatch."""
        metric_data = [
            {"MetricName": "TaskCompletionRate", "Value": metrics.task_completion_rate, "Unit": "None"},
            {"MetricName": "IntentAccuracy", "Value": metrics.intent_accuracy, "Unit": "None"},
            {"MetricName": "ToolSelectionAccuracy", "Value": metrics.tool_selection_accuracy, "Unit": "None"},
            {"MetricName": "ToolParameterAccuracy", "Value": metrics.tool_parameter_accuracy, "Unit": "None"},
            {"MetricName": "ReasoningAccuracy", "Value": metrics.reasoning_accuracy, "Unit": "None"},
        ]
        self.cloudwatch.put_metric_data(Namespace=self.namespace, MetricData=metric_data)

MangaAssist Scenarios

Scenario A: Intent Misclassification Cascade

Context: A user asked "When will my Dragon Ball set be back?" The intent classifier returned product_question (confidence 0.72) instead of order_tracking (0.68).

What Happened: - Intent classification: product_question → called Product Catalog API instead of Order Status API - The agent responded with product availability information instead of order tracking - User replied: "No, I mean MY order. I already bought it." - Second turn: intent correctly classified as order_tracking - Agent recovered, but the task took 3 turns instead of 1

Agent Performance Score: - Intent accuracy: 0.50 (1 of 2 classifications correct) - Tool selection: 0.50 (wrong tool on turn 1, correct on turn 2) - Task completion: 0.50 (partially completed — resolved but with extra turns) - Failure category: intent_misclassificationwrong_tool (cascade)

Root Cause: Ambiguous query — "back" could mean "back in stock" (product_question) or "back to me" (order_tracking). The intent classifier lacked the signal that the user phrased it as "MY" (possessive = order context).

Fix: Added possessive pronoun features ("my", "mine") to the intent classifier's feature set. Re-trained: order_tracking confidence for this pattern increased from 0.68 to 0.89.

Scenario B: Tool Parameter Extraction Failure

Context: User asked: "Compare the price of Jujutsu Kaisen volumes 20 and 21."

What Happened: - Intent: product_question (correct) - Tool selected: Product Catalog API (correct) - Parameters extracted: - Actual: {product: "Jujutsu Kaisen volumes 20 and 21"} — single string - Expected: {product_1: "Jujutsu Kaisen Vol 20", product_2: "Jujutsu Kaisen Vol 21"} — two separate lookups - Result: The API returned no results for the combined string. The agent said "I couldn't find that product."

Agent Performance Score: - Intent accuracy: 1.0 - Tool selection: 1.0 - Parameter accuracy: 0.0 (parameters were wrong) - Task completion: 0.0 (user's need unmet) - Failure category: wrong_parameters

Fix: Updated the query decomposition step in the orchestrator. When the query contains comparison language ("compare", "vs", "difference between") and multiple product references, the orchestrator splits into separate API calls. Post-fix: parameter accuracy for comparison queries improved from 45% to 92%.

Scenario C: Multi-Step Reasoning Failure in Exchange Flow

Context: User wanted to exchange a manga for a different edition, then asked about the price difference.

What Happened: - Turn 1: "I want to exchange my Chainsaw Man Vol 1 for the special edition" → correct intent, correct API calls - Turn 2: "How much more do I need to pay?" → agent should reference the exchange from turn 1 - The agent treated turn 2 as a new product_question and looked up prices independently, losing the exchange context - Response gave the full price of the special edition ($24.99) instead of the price difference ($24.99 - $9.99 = $15.00)

Agent Performance Score: - Turn 1: intent=1.0, tool=1.0, params=1.0 - Turn 2: intent=0.5 (should stay in exchange context), tool=0.5, reasoning=0.0 - Task completion: 0.5 (exchange initiated but price difference wrong) - Failure category: context_lost, reasoning_error

Fix: The orchestrator was not passing the exchange context to subsequent turns. Added a session state that tracks in-progress transactions. When the user asks a follow-up in the same session with an active transaction, the orchestrator injects the transaction context into the prompt.

Scenario D: Escalation Calibration — Too Aggressive

Context: After a customer complaint incident, the team lowered the escalation confidence threshold from 0.80 to 0.50. This meant any query with >50% probability of needing human help was escalated.

What Happened: - Escalation rate jumped from 3% to 18% of conversations - Agent performance evaluation showed: - Escalation precision: 0.35 (only 35% of escalations were necessary) - Escalation recall: 0.98 (almost no missed escalations) - Task completion rate dropped from 85% to 72% (agent gave up too early) - Amazon Connect staffing costs increased 6x

Metric Signal: MangaAssist/AgentPerformance.EscalationPrecision dropped below 0.50 alert threshold.

Fix: Reverted threshold to 0.70 and added a two-stage escalation: first attempt to resolve, then offer escalation. If the agent detects frustration signals (repeated questions, negative sentiment), escalate immediately. Otherwise, attempt resolution first.

Post-fix: Escalation precision=0.72, recall=0.91, task completion=84%.


Intuition Gained

Agent Performance Is a Pipeline Problem

Each stage of the agent pipeline (intent → tool → parameters → reasoning → response) has its own failure modes. A single bad stage cascades: wrong intent → wrong tool → wrong answer. Agent performance evaluation must measure each stage independently AND the full pipeline together. Fixing the weakest link (often intent classification) has the highest ROI.

Task Completion Rate Is the North Star

Quality metrics, latency, and cost all matter. But the user's question is: "Did the chatbot solve my problem?" Task completion rate — measured by whether the agent performed the correct sequence of actions to resolution — is the most business-relevant metric. A chatbot with 95% text quality but 60% task completion is failing its users.

Escalation Is a Precision-Recall Tradeoff

Too many escalations waste human agent time and make the chatbot feel useless. Too few escalations leave frustrated users without help. MangaAssist targets 80% precision (4 out of 5 escalations are genuinely needed) and 90% recall (only 1 in 10 users who need help are missed). The two-stage approach (try first, then offer escalation) achieves this balance.


References