LOCAL PREVIEW View on GitHub

Scenarios and Runbooks — Intelligent Tool Integrations

MangaAssist context: JP Manga store chatbot on AWS — Bedrock Claude 3 (Sonnet at $3/$15 per 1M tokens input/output, Haiku at $0.25/$1.25), OpenSearch Serverless (vector store), DynamoDB (sessions/products), ECS Fargate (orchestrator), API Gateway WebSocket, ElastiCache Redis. Target: useful answer in under 3 seconds, 1M messages/day scale.


Skill Mapping

Field Value
Certification AWS AI Practitioner (AIP-C01)
Domain 2 — Implementation and Integration of FM-Powered Applications
Task 2.1 — Select and implement appropriate FM integration strategies
Skill 2.1.6 — Implement intelligent tool integrations to extend FM capabilities and ensure reliable tool operations
Focus Areas Strands API for custom behaviors, standardized function definitions, Lambda for error handling and parameter validation

Scenario Format

Each scenario follows this standard structure:

SCENARIO N: <Title>
├── Severity: P1/P2/P3/P4
├── Blast Radius: <what is affected>
├── Symptom: <what operators/users observe>
├── Root Cause: <why it happened>
├── Timeline: <detection to resolution>
├── Runbook: <step-by-step fix>
├── Prevention: <how to stop it recurring>
├── Code Fix: <before/after code>
└── Exam Relevance: <AIP-C01 mapping>


Severity

P2 — High — Customers receive irrelevant or empty search results. Not a total outage, but degrades experience for a large fraction of queries.

Blast Radius

  • Direct: All product search queries where the FM sends malformed parameters (estimated 8-12% of search traffic during the incident window).
  • Indirect: Recommendation engine receives bad seed titles from broken searches, producing poor recommendations downstream.
  • Revenue: Missed conversions from customers who see empty results and abandon the session.

Symptom

[2026-03-28 14:23:17 UTC] WARN  mangaassist.tools — product_search returned
0 results for query="進撃の巨人" with genre="action" (invalid enum value)

[2026-03-28 14:23:17 UTC] INFO  mangaassist.orchestrator — FM selected
product_search with params: {"query": "進撃の巨人", "genre": "action",
"max_results": "10", "in_stock_only": "yes"}

Observations: 1. CloudWatch metric MangaAssist/Tools/product_search/ZeroResultRate spikes from baseline 3% to 22%. 2. Customer complaints in support queue: "I searched for Attack on Titan but nothing came up." 3. The FM is sending genre: "action" instead of the valid enum value "shonen". The genre enum does not include "action" — it is a hallucinated category. 4. The FM is sending max_results: "10" (string) instead of 10 (integer). 5. The FM is sending in_stock_only: "yes" (string) instead of true (boolean).

Root Cause

Two compounding failures:

  1. Schema definition gap: The product_search tool description says "Filter results by manga genre" but does not explicitly list the valid values in the description text. The enum constraint exists in the JSON Schema but the FM does not always respect schema-level constraints — it relies heavily on the natural-language description. When a customer asks for "action manga," the FM maps it to genre: "action" because the description does not guide it to the correct enum value "shonen".

  2. Missing type coercion: The ParameterValidator was not deployed in the latest release due to a packaging error in the Lambda layer. Without coercion, string values "10" and "yes" are passed directly to OpenSearch, which either ignores them (silent failure) or throws a type error.

Why the FM hallucinates enum values: - Bedrock's tool-use implementation sends the JSON Schema to the FM, but the FM treats enum values as suggestions, not hard constraints. - If the natural-language description does not reinforce the enum values, the FM may invent its own.

Timeline

Time (UTC) Event
14:00 Lambda layer deployment with broken package (missing parameter_validator.py)
14:15 First malformed search queries appear in logs
14:23 CloudWatch alarm fires: ZeroResultRate > 15% for 5 minutes
14:25 On-call engineer acknowledges alarm
14:30 Engineer identifies missing validator in Lambda layer via aws lambda get-layer-version
14:35 Rollback Lambda layer to previous version
14:37 Validator restored; zero-result rate begins dropping
14:45 Rate returns to baseline 3%
14:50 Post-incident ticket created for description fix
Next day Tool description updated to list enum values explicitly

Runbook

Step 1: Confirm the symptom

# Check zero-result rate in CloudWatch
aws cloudwatch get-metric-statistics \
  --namespace "MangaAssist/Tools" \
  --metric-name "ZeroResultCount" \
  --dimensions Name=ToolName,Value=product_search \
  --start-time "2026-03-28T14:00:00Z" \
  --end-time "2026-03-28T15:00:00Z" \
  --period 300 \
  --statistics Sum

Step 2: Identify the parameter pattern

# Search CloudWatch Logs for recent product_search invocations
aws logs filter-log-events \
  --log-group-name "/aws/lambda/mangaassist-tool-gateway" \
  --filter-pattern '{ $.tool_name = "product_search" && $.result_count = 0 }' \
  --start-time 1711630800000 \
  --limit 50

Look for: - Parameters with string values where integers are expected - Genre values not in the defined enum - Boolean fields with string values like "yes" or "no"

Step 3: Check Lambda layer integrity

# List current layer version
aws lambda get-function-configuration \
  --function-name mangaassist-tool-gateway \
  --query 'Layers'

# Download and inspect layer contents
aws lambda get-layer-version \
  --layer-name mangaassist-tools-layer \
  --version-number 14 \
  --query 'Content.Location' --output text | xargs curl -o layer.zip

unzip -l layer.zip | grep parameter_validator

Step 4: Rollback if validator is missing

# Rollback to previous layer version
aws lambda update-function-configuration \
  --function-name mangaassist-tool-gateway \
  --layers "arn:aws:lambda:ap-northeast-1:123456789012:layer:mangaassist-tools-layer:13"

Step 5: Verify recovery

# Monitor zero-result rate after rollback
watch -n 30 "aws cloudwatch get-metric-statistics \
  --namespace 'MangaAssist/Tools' \
  --metric-name 'ZeroResultCount' \
  --dimensions Name=ToolName,Value=product_search \
  --start-time '$(date -u -d '10 minutes ago' +%Y-%m-%dT%H:%M:%SZ)' \
  --end-time '$(date -u +%Y-%m-%dT%H:%M:%SZ)' \
  --period 60 --statistics Sum"

Step 6: Fix the tool description (permanent fix)

# BEFORE — vague description, FM hallucinates enum values
"genre": {
    "type": "string",
    "description": "Filter results by manga genre.",
    "enum": ["shonen", "shojo", "seinen", "josei", "kodomomuke",
             "isekai", "mecha", "slice_of_life", "horror", "sports"]
}

# AFTER — explicit description reinforces valid values
"genre": {
    "type": "string",
    "description": (
        "Filter results by manga genre. MUST be one of these exact values: "
        "shonen (action/adventure like Naruto, One Piece), "
        "shojo (romance like Fruits Basket), "
        "seinen (mature like Berserk, Vagabond), "
        "josei (women's like Nana), "
        "kodomomuke (children's like Doraemon), "
        "isekai (transported to another world), "
        "mecha (giant robots like Gundam), "
        "slice_of_life (daily life stories), "
        "horror (scary/dark themes like Junji Ito), "
        "sports (athletic competition like Haikyuu). "
        "If the customer's request does not match any genre exactly, "
        "omit this parameter and search by query text only."
    ),
    "enum": ["shonen", "shojo", "seinen", "josei", "kodomomuke",
             "isekai", "mecha", "slice_of_life", "horror", "sports"]
}

Prevention

  1. CI/CD validation: Add a deployment gate that verifies all expected modules exist in the Lambda layer zip before publishing.
  2. Description best practice: Always list enum values with examples in the natural-language description, not just in the JSON Schema enum field.
  3. Canary deployment: Deploy Lambda changes to 10% of traffic first with automated rollback if ZeroResultRate exceeds 10%.
  4. Type coercion as defense-in-depth: Even if the FM sends wrong types, the coercion layer fixes the most common mistakes before they reach the tool.

Exam Relevance

AIP-C01 Concept Application
Tool definitions must be precise and descriptive Vague descriptions cause FM to hallucinate parameter values
Parameter validation prevents downstream failures Missing validator led to malformed queries reaching OpenSearch
Lambda layers require deployment validation Broken packaging removed critical validation code
Monitoring tool success rates detects issues early ZeroResultRate alarm caught the problem within 23 minutes


Scenario 2: Tool Chain Breaking Mid-Execution on Inventory Lookup

Severity

P2 — High — The "search and check stock" chain fails at step 2, leaving customers without inventory information for their search results.

Blast Radius

  • Direct: All queries that trigger the search_and_check_stock tool chain (~15% of daily traffic = 150K messages/day).
  • Indirect: Customers see product results but no stock status. Some proceed to order out-of-stock items, generating support tickets and cancellations.
  • Downstream: Order cancellation rate increases from 2% to 11% because customers order items shown as "available" that are actually out of stock.

Symptom

[2026-03-29 09:14:22 UTC] ERROR mangaassist.chains — Chain
'search_and_check_stock' step 1 (inventory_check) timed out after 1000ms

[2026-03-29 09:14:22 UTC] WARN  mangaassist.chains — Returning partial
result: product_search succeeded, inventory_check failed

[2026-03-29 09:14:23 UTC] INFO  mangaassist.orchestrator — FM received
partial chain result, generating response without inventory data

Observations: 1. CloudWatch alarm: InventoryCheck/TimeoutRate > 30% sustained for 10 minutes. 2. The product_search step succeeds consistently (OpenSearch responds in ~200ms). 3. The inventory_check step times out at 1000ms. The DynamoDB inventory table is responding, but slowly. 4. DynamoDB ConsumedReadCapacityUnits for the inventory table is at 100% of provisioned capacity. 5. A batch inventory import job started at 09:00 is consuming all available read capacity.

Root Cause

Capacity contention on DynamoDB inventory table.

The inventory table uses provisioned capacity mode (not on-demand) to control costs. A scheduled batch import job runs daily at 09:00 to sync inventory from the warehouse management system. This job performs a full table scan with high read throughput, consuming all provisioned RCUs. When the tool chain's inventory_check step tries to read from the same table, it gets throttled by DynamoDB, causing timeouts.

Why the chain breaks instead of degrading gracefully:

The search_and_check_stock chain was configured with inventory_check as a required step (required=True). When it times out, the entire chain returns an error instead of returning the successful search results with a note that inventory is unavailable.

Timeline

Time (UTC) Event
09:00 Batch inventory import job starts
09:05 DynamoDB RCU consumption reaches 100%
09:10 First inventory_check timeouts appear
09:14 CloudWatch alarm fires: InventoryCheck/TimeoutRate > 30%
09:16 On-call engineer acknowledges
09:20 Engineer identifies DynamoDB throttling via CloudWatch metrics
09:22 Engineer reduces batch job throughput from 500 to 50 writes/sec
09:25 Inventory timeouts begin clearing
09:30 Timeout rate returns to baseline (< 1%)
09:45 Chain configuration updated to make inventory_check optional
10:00 DynamoDB table switched to on-demand capacity mode

Runbook

Step 1: Confirm DynamoDB throttling

# Check throttled read requests
aws cloudwatch get-metric-statistics \
  --namespace "AWS/DynamoDB" \
  --metric-name "ReadThrottleEvents" \
  --dimensions Name=TableName,Value=MangaAssist-Inventory \
  --start-time "2026-03-29T09:00:00Z" \
  --end-time "2026-03-29T10:00:00Z" \
  --period 60 \
  --statistics Sum

# Check consumed vs provisioned capacity
aws cloudwatch get-metric-statistics \
  --namespace "AWS/DynamoDB" \
  --metric-name "ConsumedReadCapacityUnits" \
  --dimensions Name=TableName,Value=MangaAssist-Inventory \
  --start-time "2026-03-29T09:00:00Z" \
  --end-time "2026-03-29T10:00:00Z" \
  --period 60 \
  --statistics Sum

Step 2: Identify the competing workload

# Check if the batch import job is running
aws ecs list-tasks \
  --cluster mangaassist-cluster \
  --family mangaassist-inventory-import \
  --desired-status RUNNING

# Check batch job CloudWatch logs for throughput
aws logs filter-log-events \
  --log-group-name "/ecs/mangaassist-inventory-import" \
  --filter-pattern "items_processed" \
  --start-time 1711699200000 \
  --limit 10

Step 3: Reduce batch job throughput (immediate mitigation)

# Update the batch job's rate limiter via SSM Parameter Store
aws ssm put-parameter \
  --name "/mangaassist/inventory-import/max-writes-per-sec" \
  --value "50" \
  --type String \
  --overwrite

# The batch job reads this parameter and throttles itself

Step 4: Fix the chain configuration (make inventory_check optional)

# BEFORE — inventory_check is required, chain fails completely
search_and_check_stock = ToolChain(
    name="search_and_check_stock",
    steps=[
        ChainStep(
            tool_name="product_search",
            param_mapper=lambda ctx: {"query": ctx["query"], "max_results": 1},
            timeout_ms=1500,
        ),
        ChainStep(
            tool_name="inventory_check",
            param_mapper=lambda ctx: {
                "product_id": ctx.get("results", [{}])[0].get("product_id", ""),
            },
            required=True,  # <-- BUG: chain fails if inventory_check fails
            timeout_ms=1000,
        ),
    ],
)

# AFTER — inventory_check is optional with a fallback value
search_and_check_stock = ToolChain(
    name="search_and_check_stock",
    steps=[
        ChainStep(
            tool_name="product_search",
            param_mapper=lambda ctx: {"query": ctx["query"], "max_results": 1},
            timeout_ms=1500,
        ),
        ChainStep(
            tool_name="inventory_check",
            param_mapper=lambda ctx: {
                "product_id": ctx.get("results", [{}])[0].get("product_id", ""),
            },
            required=False,  # <-- FIX: chain continues with partial results
            fallback_value={
                "status": "unknown",
                "message": "Inventory status is being updated. "
                "Please check the product page for current availability.",
            },
            timeout_ms=1000,
        ),
    ],
)

Step 5: Switch DynamoDB to on-demand (permanent fix)

aws dynamodb update-table \
  --table-name MangaAssist-Inventory \
  --billing-mode PAY_PER_REQUEST

Step 6: Verify recovery

# Monitor timeout rate
aws cloudwatch get-metric-statistics \
  --namespace "MangaAssist/Tools" \
  --metric-name "Error_timeout" \
  --dimensions Name=ToolName,Value=inventory_check \
  --start-time "$(date -u -d '15 minutes ago' +%Y-%m-%dT%H:%M:%SZ)" \
  --end-time "$(date -u +%Y-%m-%dT%H:%M:%SZ)" \
  --period 60 --statistics Sum

Prevention

  1. On-demand capacity for tables accessed by both real-time and batch workloads eliminates capacity contention.
  2. Chain design rule: Non-critical enrichment steps (inventory, pricing) should always be required=False with meaningful fallback values.
  3. Batch job scheduling: Run heavy batch imports during off-peak hours (e.g., 03:00 JST) when chatbot traffic is minimal.
  4. DynamoDB reserved capacity + burst: If staying with provisioned mode, configure auto-scaling with a burst buffer of 2x base capacity.

Exam Relevance

AIP-C01 Concept Application
Tool chains should degrade gracefully Making enrichment steps optional prevents total chain failure
DynamoDB capacity planning for mixed workloads Batch + real-time contention requires on-demand or auto-scaling
Fallback values maintain user experience Customers still see search results even when inventory is unavailable
Timeout budgets across chain steps Each step must fit within the overall 3-second latency target


Scenario 3: Strands Tool Returning Unexpected Schema Causing Agent Confusion

Severity

P2 — High — The agent enters a loop, repeatedly calling the same tool and burning through Sonnet tokens at $3/$15 per 1M, generating incoherent responses.

Blast Radius

  • Direct: ~5% of sessions where the recommendation_engine tool is invoked (~50K sessions/day).
  • Cost: Each looping session consumes 3-5x normal token usage. At Sonnet pricing, a single looping session costs ~$0.12 instead of $0.03. Across 50K affected sessions: $6,000/day extra.
  • UX: Customers receive repetitive or nonsensical recommendation responses.

Symptom

[2026-03-30 11:42:05 UTC] INFO  mangaassist.orchestrator — FM response
contains tool_use: recommendation_engine (attempt 1)
[2026-03-30 11:42:06 UTC] INFO  mangaassist.orchestrator — Tool result
returned to FM
[2026-03-30 11:42:07 UTC] INFO  mangaassist.orchestrator — FM response
contains tool_use: recommendation_engine (attempt 2, same params)
[2026-03-30 11:42:08 UTC] INFO  mangaassist.orchestrator — Tool result
returned to FM
[2026-03-30 11:42:09 UTC] INFO  mangaassist.orchestrator — FM response
contains tool_use: recommendation_engine (attempt 3, same params)
[2026-03-30 11:42:09 UTC] WARN  mangaassist.orchestrator — Tool loop
detected: recommendation_engine called 3 times with identical params

Observations: 1. The FM keeps calling recommendation_engine with the same parameters. 2. The tool returns results successfully each time (no error). 3. The FM's text response between tool calls says: "Let me check the recommendations for you..." repeatedly. 4. CloudWatch metric: ConvoLoop/ExcessiveToolCalls alarm fires.

Root Cause

Schema mismatch between tool output and FM expectations.

The recommendation_engine service was updated to v2, which changed its response schema:

# v1 response (what the FM expects based on training/system prompt)
{
    "recommendations": [
        {"title": "Naruto", "author": "Kishimoto", "score": 0.95},
        {"title": "Bleach", "author": "Kubo", "score": 0.88},
    ],
    "count": 2,
    "strategy": "collaborative"
}

# v2 response (what the tool now returns after the service update)
{
    "data": {
        "items": [
            {"product_name": "Naruto", "creator": "Kishimoto", "relevance": 95},
            {"product_name": "Bleach", "creator": "Kubo", "relevance": 88},
        ]
    },
    "metadata": {
        "total": 2,
        "algorithm": "collaborative_v2"
    }
}

The FM receives the v2 response but cannot find the expected recommendations key. It interprets this as an incomplete or failed result and retries the tool call, hoping for a different response. Since the tool is functioning correctly and returns the same v2 schema each time, the FM enters an infinite retry loop.

Timeline

Time (UTC) Event
10:00 Recommendation service v2 deployed to production
10:30 First tool loop incidents appear (low volume — only Sonnet sessions)
11:00 Loop incidents increase as traffic ramps up
11:42 CloudWatch alarm fires: ExcessiveToolCalls > 100 in 5 min
11:45 On-call engineer identifies schema mismatch in logs
11:50 Response adapter Lambda deployed to transform v2 -> v1 schema
11:55 Tool loops stop; normal operation resumes
12:30 Permanent fix: update Strands tool wrapper to handle both schemas

Runbook

Step 1: Detect the loop pattern

# Find sessions with excessive tool calls
aws logs filter-log-events \
  --log-group-name "/ecs/mangaassist-orchestrator" \
  --filter-pattern '{ $.event = "tool_loop_detected" }' \
  --start-time 1711792800000 \
  --limit 20

Step 2: Compare expected vs actual tool output

# Get a sample tool result from the logs
aws logs filter-log-events \
  --log-group-name "/aws/lambda/mangaassist-tool-gateway" \
  --filter-pattern '{ $.tool_name = "recommendation_engine" && $.status = "success" }' \
  --start-time 1711792800000 \
  --limit 5

# Compare the schema of the returned result against the expected schema
# documented in the tool's system prompt or tool description

Step 3: Deploy response adapter (immediate fix)

"""
response_adapter.py — Transforms tool responses to match expected schemas.

This adapter sits between the tool result and the FM, ensuring the
response schema matches what the FM was trained/prompted to expect.
"""

from typing import Any, Dict


def adapt_recommendation_response(raw: Dict[str, Any]) -> Dict[str, Any]:
    """Transform recommendation_engine v2 response to v1 schema."""

    # Detect v2 format
    if "data" in raw and "items" in raw.get("data", {}):
        items = raw["data"]["items"]
        metadata = raw.get("metadata", {})

        adapted = {
            "recommendations": [
                {
                    "title": item.get("product_name", item.get("title", "")),
                    "author": item.get("creator", item.get("author", "")),
                    "score": item.get("relevance", 0) / 100.0
                    if item.get("relevance", 0) > 1
                    else item.get("relevance", 0),
                }
                for item in items
            ],
            "count": metadata.get("total", len(items)),
            "strategy": metadata.get("algorithm", "unknown").replace("_v2", ""),
        }
        return adapted

    # Already v1 format or unknown — return as-is
    return raw


# Register adapters for all tools that might change schemas
RESPONSE_ADAPTERS = {
    "recommendation_engine": adapt_recommendation_response,
    # Add others as needed
}


def adapt_tool_response(
    tool_name: str, raw_response: Dict[str, Any]
) -> Dict[str, Any]:
    """Apply the appropriate adapter for a tool's response."""
    adapter = RESPONSE_ADAPTERS.get(tool_name)
    if adapter:
        return adapter(raw_response)
    return raw_response

Step 4: Add loop detection and breaking to the orchestrator

"""
loop_detector.py — Detects and breaks FM tool-call loops.
"""

from collections import defaultdict
from typing import Any, Dict, List, Optional
import hashlib
import json


class ToolLoopDetector:
    """
    Detects when the FM repeatedly calls the same tool with identical params.

    Detection rules:
    - Same tool + same params called 3+ times -> loop detected
    - Any tool called 5+ times in one turn -> excessive tool use
    - Total tool calls exceed 10 in one turn -> force stop
    """

    def __init__(
        self,
        same_call_limit: int = 3,
        per_tool_limit: int = 5,
        total_limit: int = 10,
    ) -> None:
        self.same_call_limit = same_call_limit
        self.per_tool_limit = per_tool_limit
        self.total_limit = total_limit
        self._call_hashes: List[str] = []
        self._tool_counts: Dict[str, int] = defaultdict(int)

    def check(self, tool_name: str, params: Dict[str, Any]) -> Optional[str]:
        """
        Check if a tool call should be blocked.

        Returns None if OK, or a reason string if the call should be blocked.
        """
        # Track total calls
        self._tool_counts[tool_name] += 1
        total = sum(self._tool_counts.values())

        # Hash the call for duplicate detection
        call_hash = hashlib.md5(
            json.dumps({"tool": tool_name, "params": params}, sort_keys=True).encode()
        ).hexdigest()
        self._call_hashes.append(call_hash)

        # Check: same exact call repeated
        identical_count = self._call_hashes.count(call_hash)
        if identical_count >= self.same_call_limit:
            return (
                f"Loop detected: {tool_name} called {identical_count} times "
                f"with identical parameters"
            )

        # Check: per-tool limit
        if self._tool_counts[tool_name] >= self.per_tool_limit:
            return (
                f"Excessive use: {tool_name} called "
                f"{self._tool_counts[tool_name]} times in this turn"
            )

        # Check: total limit
        if total >= self.total_limit:
            return f"Total tool call limit reached: {total} calls in this turn"

        return None

    def reset(self) -> None:
        """Reset counters for a new conversation turn."""
        self._call_hashes.clear()
        self._tool_counts.clear()

Step 5: Inject a break message when loop is detected

# In the orchestrator's convoLoop, when loop_detector.check() returns a reason:
loop_break_message = {
    "role": "user",
    "content": [
        {
            "type": "text",
            "text": (
                "SYSTEM NOTE: You have called the same tool multiple times "
                "with the same parameters. The tool is working correctly. "
                "Please use the most recent tool result to formulate your "
                "response to the customer. Do not call the tool again."
            ),
        }
    ],
}

Prevention

  1. Response schema contracts: Define and enforce output schemas for every tool. When a backend service changes its API, the tool wrapper must transform the response to match the contract.
  2. Loop detection: Always implement loop detection in the orchestrator with configurable limits.
  3. Tool versioning: Version tool definitions and pin the FM to a specific tool version. Deploy schema adapters when backend services upgrade.
  4. Integration tests: Automated tests that verify tool output schemas match FM expectations after every backend deployment.

Exam Relevance

AIP-C01 Concept Application
Standardized tool output schemas Schema changes break FM expectations and cause loops
Token cost management Each loop iteration burns Sonnet tokens at $3/$15 per 1M
ConvoLoop control Loop detection prevents runaway token consumption
Tool versioning and adapters Backend changes must be isolated from FM-facing contracts


Scenario 4: Lambda Cold Start Timeout on Tool Invocation

Severity

P3 — Medium — First request after idle period times out. Affects ~2-5% of sessions that arrive after a cold spell, but resolves after the first invocation warms the Lambda.

Blast Radius

  • Direct: First customer message in a session that hits a cold Lambda experiences a 5-8 second delay (exceeds 3-second target).
  • Indirect: If the orchestrator's tool timeout is set to 2000ms, the cold start causes a timeout error, and the customer gets an error message instead of results.
  • Scale: During low-traffic periods (02:00-06:00 JST), cold starts affect ~40% of incoming requests because all Lambda instances have been reclaimed.

Symptom

[2026-03-30 03:12:44 UTC] INFO  AWS Lambda — INIT_START
Runtime.Version = python:3.12
[2026-03-30 03:12:48 UTC] INFO  AWS Lambda — INIT_REPORT
Duration: 4200.55 ms   Init Duration: 4200.55 ms

[2026-03-30 03:12:48 UTC] ERROR mangaassist.chains — Tool
'product_search' timed out after 2000ms (Lambda init took 4200ms)

Observations: 1. The Lambda function's INIT_REPORT shows 4200ms initialization time. 2. The tool timeout is 2000ms, but the Lambda hasn't even finished initializing when the timeout fires. 3. The heavy initialization includes: importing boto3, loading the ToolRegistry, initializing OpenSearch client connections, and loading the JSON Schema validator. 4. After the first cold start, subsequent invocations complete in ~150ms (warm).

Root Cause

Lambda cold start exceeds tool timeout budget.

The mangaassist-tool-gateway Lambda function has a cold start time of ~4200ms because: 1. Large deployment package: The Lambda layer includes OpenSearch client, jsonschema, and several other dependencies (~45MB total). 2. Eager initialization: The ToolRegistry singleton is created at module-load time (outside the handler), which triggers connection establishment to OpenSearch, DynamoDB, and Redis. 3. Python runtime overhead: Python 3.12 runtime itself adds ~800ms of init time for large packages.

The tool chain's per-step timeout of 2000ms does not account for cold start time. The Lambda invocation times out before the handler even begins executing.

Timeline

Time (UTC) Event
02:00 Traffic drops below threshold; Lambda instances begin reclaiming
02:45 All Lambda instances reclaimed (0 warm instances)
03:12 First customer message arrives; triggers cold start
03:12 Lambda INIT takes 4200ms; tool timeout fires at 2000ms
03:12 Customer receives "Search is taking longer than usual" error
03:12 Lambda finishes init; subsequent calls succeed in ~150ms
03:13 Second customer message succeeds normally
09:00 Engineering reviews cold start metrics during standup
09:30 Provisioned Concurrency configured for minimum 5 instances

Runbook

Step 1: Confirm cold start as the cause

# Check Lambda init durations
aws logs filter-log-events \
  --log-group-name "/aws/lambda/mangaassist-tool-gateway" \
  --filter-pattern "INIT_REPORT" \
  --start-time 1711760400000 \
  --limit 20

# Check for correlation with tool timeouts
aws logs filter-log-events \
  --log-group-name "/aws/lambda/mangaassist-tool-gateway" \
  --filter-pattern '{ $.error_code = "GATEWAY_TIMEOUT" }' \
  --start-time 1711760400000 \
  --limit 20

Step 2: Measure cold start breakdown

# Add timing instrumentation to the Lambda module-level code
import time
_init_start = time.time()

import boto3  # ~600ms
_after_boto3 = time.time()

import jsonschema  # ~200ms
_after_jsonschema = time.time()

from mangaassist.tools.tool_registry import ToolRegistry  # ~800ms
_after_registry = time.time()

# ... OpenSearch client init: ~1200ms
# ... Redis connection: ~400ms

print(f"Init breakdown: boto3={_after_boto3 - _init_start:.0f}ms, "
      f"jsonschema={_after_jsonschema - _after_boto3:.0f}ms, "
      f"registry={_after_registry - _after_jsonschema:.0f}ms")

Step 3: Configure Provisioned Concurrency (immediate fix)

# Set provisioned concurrency to keep 5 instances warm at all times
aws lambda put-provisioned-concurrency-config \
  --function-name mangaassist-tool-gateway \
  --qualifier prod \
  --provisioned-concurrent-executions 5

# Verify configuration
aws lambda get-provisioned-concurrency-config \
  --function-name mangaassist-tool-gateway \
  --qualifier prod

Step 4: Optimize Lambda initialization (permanent fix)

"""
Optimization strategies for reducing Lambda cold start time.
"""

# Strategy 1: Lazy initialization — don't connect at import time
class LazyOpenSearchClient:
    """Only establish connection on first use, not at module load."""

    def __init__(self):
        self._client = None

    @property
    def client(self):
        if self._client is None:
            from opensearchpy import OpenSearch
            self._client = OpenSearch(
                hosts=[{"host": os.environ["OPENSEARCH_ENDPOINT"], "port": 443}],
                use_ssl=True,
                connection_class=RequestsHttpConnection,
            )
        return self._client


# Strategy 2: Reduce package size — use Lambda layers efficiently
# Move rarely-used dependencies into separate layers loaded on-demand

# Strategy 3: Use SnapStart equivalent for Python
# (As of 2026, AWS supports Lambda SnapStart for Python 3.12+)
# Configure in SAM template:
# Properties:
#   SnapStart:
#     ApplyOn: PublishedVersions

Step 5: Add cold-start-aware timeout in the orchestrator

# In the tool chain executor, detect likely cold start and extend timeout
async def invoke_with_cold_start_awareness(
    self, tool_name: str, params: dict, base_timeout_ms: int = 2000
) -> dict:
    """
    Invoke a tool with extended timeout for likely cold starts.

    If this is the first call to this tool in the session, allow
    extra time for potential Lambda cold start.
    """
    is_first_call = tool_name not in self._warm_tools

    timeout_ms = base_timeout_ms
    if is_first_call:
        timeout_ms = base_timeout_ms + 5000  # Extra 5s for cold start
        logger.info(
            f"First call to '{tool_name}' — extending timeout to "
            f"{timeout_ms}ms for potential cold start"
        )

    result = await asyncio.wait_for(
        self._invoke(tool_name, params),
        timeout=timeout_ms / 1000,
    )

    self._warm_tools.add(tool_name)
    return result

Prevention

  1. Provisioned Concurrency: Always configure minimum warm instances for latency-sensitive tool Lambdas. Cost: ~$0.015/GB-hour for provisioned instances.
  2. Lazy initialization: Defer connection establishment to first use, not module load. Moves init cost from cold start to first invocation (which is already expected to be slower).
  3. Package optimization: Split dependencies into layers; only load what the specific tool needs.
  4. SnapStart: Enable Lambda SnapStart for Python runtimes to snapshot the initialized state and restore from it on cold start (~80% reduction in init time).
  5. Warm-up pings: CloudWatch Events rule that invokes the Lambda every 5 minutes to keep at least one instance warm.

Exam Relevance

AIP-C01 Concept Application
Lambda cold starts affect tool latency 4200ms init exceeds the 2000ms tool timeout budget
Provisioned Concurrency ensures warm starts Keeps N instances ready for immediate invocation
Tool timeout budgets must account for infrastructure Network + compute + cold start all count against the 3s target
Cost vs latency tradeoff Provisioned Concurrency costs ~$0.015/GB-hr but eliminates cold start errors


Scenario 5: Circular Tool Dependency in Multi-Step Workflow

Severity

P1 — Critical — Agent enters infinite loop, consuming tokens without bound until max-turn limit kills the session. Affects all sessions that trigger the circular workflow.

Blast Radius

  • Direct: Any customer query that triggers the "find similar in stock" workflow (~8% of traffic = 80K sessions/day).
  • Cost: Each looping session burns 10-20x normal tokens before being killed. At Sonnet pricing: ~$0.30-$0.60 per affected session vs $0.03 normal. Projected daily overspend: $24,000-$48,000.
  • UX: Affected customers wait the full 30-second timeout, then receive a generic error message. Sessions are unusable.
  • Infrastructure: ECS Fargate task CPU spikes to 100% as multiple concurrent loops consume all available compute.

Symptom

[2026-03-31 08:45:12 UTC] INFO  FM -> tool_use: product_search(query="One Piece Vol 108")
[2026-03-31 08:45:13 UTC] INFO  FM -> tool_use: inventory_check(product_id="MNG-A1B2C3D4")
[2026-03-31 08:45:13 UTC] INFO  inventory_check returned: {"in_stock": false,
"restock_date": "2026-04-15"}
[2026-03-31 08:45:14 UTC] INFO  FM -> tool_use: recommendation_engine(
seed_title="One Piece Vol 108", exclude_owned=true)
[2026-03-31 08:45:15 UTC] INFO  recommendation_engine returned:
{"recommendations": [{"title": "One Piece Vol 107", "product_id": "MNG-E5F6G7H8"}]}
[2026-03-31 08:45:16 UTC] INFO  FM -> tool_use: inventory_check(product_id="MNG-E5F6G7H8")
[2026-03-31 08:45:16 UTC] INFO  inventory_check returned: {"in_stock": false}
[2026-03-31 08:45:17 UTC] INFO  FM -> tool_use: recommendation_engine(
seed_title="One Piece Vol 107", exclude_owned=true)
[2026-03-31 08:45:18 UTC] INFO  recommendation_engine returned:
{"recommendations": [{"title": "One Piece Vol 108", "product_id": "MNG-A1B2C3D4"}]}
[2026-03-31 08:45:19 UTC] INFO  FM -> tool_use: inventory_check(product_id="MNG-A1B2C3D4")
  ... [LOOP: Vol 108 -> Vol 107 -> Vol 108 -> Vol 107 -> ...]

Observations: 1. The FM is caught in a cycle: check stock on Vol 108 -> out of stock -> recommend similar -> gets Vol 107 -> check stock on Vol 107 -> out of stock -> recommend similar -> gets Vol 108 -> repeat. 2. The recommendation engine keeps suggesting the other volume in the series because they are the most similar titles. 3. Neither volume is in stock, so the FM never finds a satisfactory answer and keeps searching. 4. CloudWatch metric: ConvoLoop/TotalToolCalls spikes; average tool calls per session jumps from 2.1 to 14.7. 5. ECS Fargate CPU utilization hits 95%.

Root Cause

Circular dependency between inventory_check and recommendation_engine when all similar items are out of stock.

The FM's implicit workflow is: 1. Search for product 2. Check inventory 3. If out of stock -> recommend similar 4. Check inventory on recommendation 5. If out of stock -> recommend similar (goes back to step 3)

This creates a cycle when: - The recommendation engine returns titles that are in the same series - All titles in the series are out of stock - The FM lacks a stopping condition in its reasoning

The ToolLoopDetector from Scenario 3 would catch identical calls, but here the calls are not identical — the parameters differ each time (different product IDs). The cycle is at the workflow level, not the individual call level.

Timeline

Time (UTC) Event
08:00 One Piece Vol 107 and 108 both go out of stock (supply chain delay)
08:30 First circular workflow incidents appear
08:45 CloudWatch alarm: AvgToolCallsPerSession > 10
08:47 ECS CPU alarm: CPUUtilization > 90%
08:48 On-call engineer acknowledges both alarms
08:52 Engineer identifies the circular pattern in logs
08:55 Emergency fix: deploy max tool calls per session limit (10)
09:00 Looping sessions are terminated at 10 tool calls instead of running indefinitely
09:15 UX improves — customers get "unavailable" message in <5 seconds
09:30 Permanent fix deployed: cycle detection + system prompt update

Runbook

Step 1: Confirm the circular pattern

# Find sessions with high tool call counts
aws logs filter-log-events \
  --log-group-name "/ecs/mangaassist-orchestrator" \
  --filter-pattern '{ $.tool_call_count > 8 }' \
  --start-time 1711871100000 \
  --limit 20

# Extract the tool call sequence for a specific session
aws logs filter-log-events \
  --log-group-name "/ecs/mangaassist-orchestrator" \
  --filter-pattern '{ $.session_id = "sess_abc123" && $.event = "tool_use" }' \
  --start-time 1711871100000

Step 2: Deploy hard limit on tool calls per session (immediate fix)

# In the orchestrator's convoLoop
MAX_TOOL_CALLS_PER_TURN = 10

tool_call_count = 0
while True:
    response = bedrock.converse(...)
    if response["stopReason"] == "tool_use":
        tool_call_count += 1
        if tool_call_count > MAX_TOOL_CALLS_PER_TURN:
            # Force the FM to respond without more tools
            messages.append({
                "role": "user",
                "content": [{
                    "type": "text",
                    "text": (
                        "SYSTEM: Maximum tool calls reached. Please provide "
                        "the best answer you can with the information gathered "
                        "so far. If a product is out of stock, tell the customer "
                        "the expected restock date and suggest they check back later."
                    ),
                }],
            })
            # Remove tools from the next call to force text response
            response = bedrock.converse(
                messages=messages,
                modelId=model_id,
                # No toolConfig -> FM must respond with text
            )
            break
    else:
        break

Step 3: Implement workflow-level cycle detection

"""
cycle_detector.py — Detects circular patterns in tool call workflows.

Unlike the ToolLoopDetector which catches identical calls, this detects
cycles in the workflow graph where different tools with different params
form a repeating pattern.
"""

from collections import deque
from typing import Any, Dict, List, Optional, Tuple


class WorkflowCycleDetector:
    """
    Detects cyclic patterns in sequences of tool calls.

    A cycle is detected when a subsequence of tool calls repeats.
    For example: [A, B, C, A, B, C] has cycle [A, B, C] of length 3.

    Algorithm:
    - Maintain a sliding window of recent tool calls (name + key params)
    - After each call, check if the last N calls match the N calls before them
    - If so, a cycle of length N has been detected
    """

    def __init__(self, max_cycle_length: int = 5) -> None:
        self.max_cycle_length = max_cycle_length
        self._history: List[str] = []

    def record_and_check(
        self, tool_name: str, key_params: Dict[str, Any]
    ) -> Optional[Tuple[int, List[str]]]:
        """
        Record a tool call and check for cycles.

        Args:
            tool_name: Name of the tool being called.
            key_params: Key parameters that identify this specific call
                       (e.g., product_id, not timeout settings).

        Returns:
            None if no cycle, or (cycle_length, cycle_pattern) if detected.
        """
        # Create a fingerprint for this call
        fingerprint = f"{tool_name}:{sorted(key_params.items())}"
        self._history.append(fingerprint)

        # Check for cycles of various lengths
        for cycle_len in range(2, self.max_cycle_length + 1):
            if len(self._history) >= cycle_len * 2:
                recent = self._history[-cycle_len:]
                previous = self._history[-cycle_len * 2 : -cycle_len]
                if recent == previous:
                    pattern = [h.split(":")[0] for h in recent]
                    return (cycle_len, pattern)

        return None

    def reset(self) -> None:
        """Reset for a new conversation turn."""
        self._history.clear()

Step 4: Update system prompt to prevent circular reasoning

# Add to the MangaAssist system prompt:
ANTI_CYCLE_PROMPT_ADDITION = """
IMPORTANT RULES FOR TOOL USAGE:
- If you check inventory and a product is OUT OF STOCK, do NOT immediately
  search for alternatives and check their stock in a loop.
- Instead, tell the customer the restock date (if available) and ask if they
  would like recommendations for DIFFERENT types of manga (not the same series).
- Never call more than 6 tools total for a single customer question.
- If after 2 recommendation attempts you cannot find an in-stock alternative,
  inform the customer that the titles are temporarily out of stock and suggest
  they enable restock notifications.
"""

Step 5: Add "already checked" context to tool calls

# Track which product IDs have been checked in the session
# and pass this context to the recommendation engine
@tool
def recommendation_engine_v2(
    seed_title: Optional[str] = None,
    exclude_product_ids: Optional[List[str]] = None,
    # ... other params
) -> Dict[str, Any]:
    """
    Generate manga recommendations.

    Args:
        exclude_product_ids: List of product IDs already checked and found
            out of stock. Recommendations will not include these products.
    """
    # ... implementation that filters out already-checked products

Prevention

  1. Hard limit on tool calls: Every orchestrator must have a MAX_TOOL_CALLS_PER_TURN that cannot be exceeded. A value of 10 is reasonable for most chatbot workflows.
  2. Workflow-level cycle detection: Detect repeating patterns of tool calls, not just identical calls. Break cycles after the second repetition.
  3. System prompt guardrails: Explicitly instruct the FM not to loop on out-of-stock checks. Give it a clear exit strategy.
  4. Exclusion parameters: Tool definitions should accept lists of already-tried items to prevent revisiting them.
  5. Cost alerting: CloudWatch alarm on EstimatedTokenCost per session to catch runaway spending before it accumulates.

Exam Relevance

AIP-C01 Concept Application
Tool orchestration requires cycle prevention Circular dependencies between tools cause infinite loops
System prompt engineering controls FM behavior Explicit stopping rules prevent the FM from looping
Token cost management at scale 80K affected sessions at $0.30-0.60 each = $24K-48K/day overspend
Max tool call limits are a safety requirement Hard limits prevent runaway token consumption regardless of FM reasoning
Graceful degradation when tools cannot satisfy FM must know when to stop trying and give the best available answer


Cross-Scenario Summary

# Scenario Root Cause Key Fix Severity
1 Malformed search params Vague tool description + missing validator Explicit enum descriptions + Lambda layer fix P2
2 Chain breaks on inventory DynamoDB capacity contention required=False + on-demand capacity P2
3 Unexpected schema causes loop Backend API schema change Response adapter + loop detection P2
4 Lambda cold start timeout Heavy init exceeds tool timeout Provisioned Concurrency + lazy init P3
5 Circular tool dependency No cycle detection in workflow Cycle detector + max tool call limit + prompt guardrails P1

Operational Checklist

Use this checklist when deploying or modifying tool integrations:

  • Every tool has a JSON Schema with explicit enum descriptions in natural language
  • ParameterValidator is deployed and tested in the Lambda layer
  • Type coercion handles string-to-int, string-to-bool, float-to-int
  • Injection detection patterns are current and tested
  • Tool chains mark enrichment steps as required=False with fallback values
  • Chain total timeout fits within the 3-second latency budget
  • Circuit breaker configured per tool (threshold=5, cooldown=60s)
  • Response adapters exist for any tool backed by a versioned API
  • ToolLoopDetector is active (same-call limit=3, per-tool limit=5, total limit=10)
  • WorkflowCycleDetector is active with max cycle length=5
  • Lambda Provisioned Concurrency configured for latency-sensitive tools
  • Lambda cold start time measured and within timeout budget
  • System prompt includes explicit tool usage rules and stopping conditions
  • CloudWatch alarms configured: ZeroResultRate, TimeoutRate, ExcessiveToolCalls, EstimatedTokenCost
  • DynamoDB tables use on-demand capacity for mixed real-time/batch workloads
  • Fallback hierarchy tested: retry -> fallback tool -> cache -> static default -> user error

References