Scenarios and Runbooks — Intelligent Tool Integrations
MangaAssist context: JP Manga store chatbot on AWS — Bedrock Claude 3 (Sonnet at $3/$15 per 1M tokens input/output, Haiku at $0.25/$1.25), OpenSearch Serverless (vector store), DynamoDB (sessions/products), ECS Fargate (orchestrator), API Gateway WebSocket, ElastiCache Redis. Target: useful answer in under 3 seconds, 1M messages/day scale.
Skill Mapping
| Field | Value |
|---|---|
| Certification | AWS AI Practitioner (AIP-C01) |
| Domain | 2 — Implementation and Integration of FM-Powered Applications |
| Task | 2.1 — Select and implement appropriate FM integration strategies |
| Skill | 2.1.6 — Implement intelligent tool integrations to extend FM capabilities and ensure reliable tool operations |
| Focus Areas | Strands API for custom behaviors, standardized function definitions, Lambda for error handling and parameter validation |
Scenario Format
Each scenario follows this standard structure:
SCENARIO N: <Title>
├── Severity: P1/P2/P3/P4
├── Blast Radius: <what is affected>
├── Symptom: <what operators/users observe>
├── Root Cause: <why it happened>
├── Timeline: <detection to resolution>
├── Runbook: <step-by-step fix>
├── Prevention: <how to stop it recurring>
├── Code Fix: <before/after code>
└── Exam Relevance: <AIP-C01 mapping>
Scenario 1: Tool Parameter Validation Failure Causing Malformed Search
Severity
P2 — High — Customers receive irrelevant or empty search results. Not a total outage, but degrades experience for a large fraction of queries.
Blast Radius
- Direct: All product search queries where the FM sends malformed parameters (estimated 8-12% of search traffic during the incident window).
- Indirect: Recommendation engine receives bad seed titles from broken searches, producing poor recommendations downstream.
- Revenue: Missed conversions from customers who see empty results and abandon the session.
Symptom
[2026-03-28 14:23:17 UTC] WARN mangaassist.tools — product_search returned
0 results for query="進撃の巨人" with genre="action" (invalid enum value)
[2026-03-28 14:23:17 UTC] INFO mangaassist.orchestrator — FM selected
product_search with params: {"query": "進撃の巨人", "genre": "action",
"max_results": "10", "in_stock_only": "yes"}
Observations:
1. CloudWatch metric MangaAssist/Tools/product_search/ZeroResultRate spikes from baseline 3% to 22%.
2. Customer complaints in support queue: "I searched for Attack on Titan but nothing came up."
3. The FM is sending genre: "action" instead of the valid enum value "shonen". The genre enum does not include "action" — it is a hallucinated category.
4. The FM is sending max_results: "10" (string) instead of 10 (integer).
5. The FM is sending in_stock_only: "yes" (string) instead of true (boolean).
Root Cause
Two compounding failures:
-
Schema definition gap: The
product_searchtool description says "Filter results by manga genre" but does not explicitly list the valid values in the description text. The enum constraint exists in the JSON Schema but the FM does not always respect schema-level constraints — it relies heavily on the natural-language description. When a customer asks for "action manga," the FM maps it togenre: "action"because the description does not guide it to the correct enum value"shonen". -
Missing type coercion: The
ParameterValidatorwas not deployed in the latest release due to a packaging error in the Lambda layer. Without coercion, string values"10"and"yes"are passed directly to OpenSearch, which either ignores them (silent failure) or throws a type error.
Why the FM hallucinates enum values: - Bedrock's tool-use implementation sends the JSON Schema to the FM, but the FM treats enum values as suggestions, not hard constraints. - If the natural-language description does not reinforce the enum values, the FM may invent its own.
Timeline
| Time (UTC) | Event |
|---|---|
| 14:00 | Lambda layer deployment with broken package (missing parameter_validator.py) |
| 14:15 | First malformed search queries appear in logs |
| 14:23 | CloudWatch alarm fires: ZeroResultRate > 15% for 5 minutes |
| 14:25 | On-call engineer acknowledges alarm |
| 14:30 | Engineer identifies missing validator in Lambda layer via aws lambda get-layer-version |
| 14:35 | Rollback Lambda layer to previous version |
| 14:37 | Validator restored; zero-result rate begins dropping |
| 14:45 | Rate returns to baseline 3% |
| 14:50 | Post-incident ticket created for description fix |
| Next day | Tool description updated to list enum values explicitly |
Runbook
Step 1: Confirm the symptom
# Check zero-result rate in CloudWatch
aws cloudwatch get-metric-statistics \
--namespace "MangaAssist/Tools" \
--metric-name "ZeroResultCount" \
--dimensions Name=ToolName,Value=product_search \
--start-time "2026-03-28T14:00:00Z" \
--end-time "2026-03-28T15:00:00Z" \
--period 300 \
--statistics Sum
Step 2: Identify the parameter pattern
# Search CloudWatch Logs for recent product_search invocations
aws logs filter-log-events \
--log-group-name "/aws/lambda/mangaassist-tool-gateway" \
--filter-pattern '{ $.tool_name = "product_search" && $.result_count = 0 }' \
--start-time 1711630800000 \
--limit 50
Look for:
- Parameters with string values where integers are expected
- Genre values not in the defined enum
- Boolean fields with string values like "yes" or "no"
Step 3: Check Lambda layer integrity
# List current layer version
aws lambda get-function-configuration \
--function-name mangaassist-tool-gateway \
--query 'Layers'
# Download and inspect layer contents
aws lambda get-layer-version \
--layer-name mangaassist-tools-layer \
--version-number 14 \
--query 'Content.Location' --output text | xargs curl -o layer.zip
unzip -l layer.zip | grep parameter_validator
Step 4: Rollback if validator is missing
# Rollback to previous layer version
aws lambda update-function-configuration \
--function-name mangaassist-tool-gateway \
--layers "arn:aws:lambda:ap-northeast-1:123456789012:layer:mangaassist-tools-layer:13"
Step 5: Verify recovery
# Monitor zero-result rate after rollback
watch -n 30 "aws cloudwatch get-metric-statistics \
--namespace 'MangaAssist/Tools' \
--metric-name 'ZeroResultCount' \
--dimensions Name=ToolName,Value=product_search \
--start-time '$(date -u -d '10 minutes ago' +%Y-%m-%dT%H:%M:%SZ)' \
--end-time '$(date -u +%Y-%m-%dT%H:%M:%SZ)' \
--period 60 --statistics Sum"
Step 6: Fix the tool description (permanent fix)
# BEFORE — vague description, FM hallucinates enum values
"genre": {
"type": "string",
"description": "Filter results by manga genre.",
"enum": ["shonen", "shojo", "seinen", "josei", "kodomomuke",
"isekai", "mecha", "slice_of_life", "horror", "sports"]
}
# AFTER — explicit description reinforces valid values
"genre": {
"type": "string",
"description": (
"Filter results by manga genre. MUST be one of these exact values: "
"shonen (action/adventure like Naruto, One Piece), "
"shojo (romance like Fruits Basket), "
"seinen (mature like Berserk, Vagabond), "
"josei (women's like Nana), "
"kodomomuke (children's like Doraemon), "
"isekai (transported to another world), "
"mecha (giant robots like Gundam), "
"slice_of_life (daily life stories), "
"horror (scary/dark themes like Junji Ito), "
"sports (athletic competition like Haikyuu). "
"If the customer's request does not match any genre exactly, "
"omit this parameter and search by query text only."
),
"enum": ["shonen", "shojo", "seinen", "josei", "kodomomuke",
"isekai", "mecha", "slice_of_life", "horror", "sports"]
}
Prevention
- CI/CD validation: Add a deployment gate that verifies all expected modules exist in the Lambda layer zip before publishing.
- Description best practice: Always list enum values with examples in the natural-language description, not just in the JSON Schema
enumfield. - Canary deployment: Deploy Lambda changes to 10% of traffic first with automated rollback if
ZeroResultRateexceeds 10%. - Type coercion as defense-in-depth: Even if the FM sends wrong types, the coercion layer fixes the most common mistakes before they reach the tool.
Exam Relevance
| AIP-C01 Concept | Application |
|---|---|
| Tool definitions must be precise and descriptive | Vague descriptions cause FM to hallucinate parameter values |
| Parameter validation prevents downstream failures | Missing validator led to malformed queries reaching OpenSearch |
| Lambda layers require deployment validation | Broken packaging removed critical validation code |
| Monitoring tool success rates detects issues early | ZeroResultRate alarm caught the problem within 23 minutes |
Scenario 2: Tool Chain Breaking Mid-Execution on Inventory Lookup
Severity
P2 — High — The "search and check stock" chain fails at step 2, leaving customers without inventory information for their search results.
Blast Radius
- Direct: All queries that trigger the
search_and_check_stocktool chain (~15% of daily traffic = 150K messages/day). - Indirect: Customers see product results but no stock status. Some proceed to order out-of-stock items, generating support tickets and cancellations.
- Downstream: Order cancellation rate increases from 2% to 11% because customers order items shown as "available" that are actually out of stock.
Symptom
[2026-03-29 09:14:22 UTC] ERROR mangaassist.chains — Chain
'search_and_check_stock' step 1 (inventory_check) timed out after 1000ms
[2026-03-29 09:14:22 UTC] WARN mangaassist.chains — Returning partial
result: product_search succeeded, inventory_check failed
[2026-03-29 09:14:23 UTC] INFO mangaassist.orchestrator — FM received
partial chain result, generating response without inventory data
Observations:
1. CloudWatch alarm: InventoryCheck/TimeoutRate > 30% sustained for 10 minutes.
2. The product_search step succeeds consistently (OpenSearch responds in ~200ms).
3. The inventory_check step times out at 1000ms. The DynamoDB inventory table is responding, but slowly.
4. DynamoDB ConsumedReadCapacityUnits for the inventory table is at 100% of provisioned capacity.
5. A batch inventory import job started at 09:00 is consuming all available read capacity.
Root Cause
Capacity contention on DynamoDB inventory table.
The inventory table uses provisioned capacity mode (not on-demand) to control costs. A scheduled batch import job runs daily at 09:00 to sync inventory from the warehouse management system. This job performs a full table scan with high read throughput, consuming all provisioned RCUs. When the tool chain's inventory_check step tries to read from the same table, it gets throttled by DynamoDB, causing timeouts.
Why the chain breaks instead of degrading gracefully:
The search_and_check_stock chain was configured with inventory_check as a required step (required=True). When it times out, the entire chain returns an error instead of returning the successful search results with a note that inventory is unavailable.
Timeline
| Time (UTC) | Event |
|---|---|
| 09:00 | Batch inventory import job starts |
| 09:05 | DynamoDB RCU consumption reaches 100% |
| 09:10 | First inventory_check timeouts appear |
| 09:14 | CloudWatch alarm fires: InventoryCheck/TimeoutRate > 30% |
| 09:16 | On-call engineer acknowledges |
| 09:20 | Engineer identifies DynamoDB throttling via CloudWatch metrics |
| 09:22 | Engineer reduces batch job throughput from 500 to 50 writes/sec |
| 09:25 | Inventory timeouts begin clearing |
| 09:30 | Timeout rate returns to baseline (< 1%) |
| 09:45 | Chain configuration updated to make inventory_check optional |
| 10:00 | DynamoDB table switched to on-demand capacity mode |
Runbook
Step 1: Confirm DynamoDB throttling
# Check throttled read requests
aws cloudwatch get-metric-statistics \
--namespace "AWS/DynamoDB" \
--metric-name "ReadThrottleEvents" \
--dimensions Name=TableName,Value=MangaAssist-Inventory \
--start-time "2026-03-29T09:00:00Z" \
--end-time "2026-03-29T10:00:00Z" \
--period 60 \
--statistics Sum
# Check consumed vs provisioned capacity
aws cloudwatch get-metric-statistics \
--namespace "AWS/DynamoDB" \
--metric-name "ConsumedReadCapacityUnits" \
--dimensions Name=TableName,Value=MangaAssist-Inventory \
--start-time "2026-03-29T09:00:00Z" \
--end-time "2026-03-29T10:00:00Z" \
--period 60 \
--statistics Sum
Step 2: Identify the competing workload
# Check if the batch import job is running
aws ecs list-tasks \
--cluster mangaassist-cluster \
--family mangaassist-inventory-import \
--desired-status RUNNING
# Check batch job CloudWatch logs for throughput
aws logs filter-log-events \
--log-group-name "/ecs/mangaassist-inventory-import" \
--filter-pattern "items_processed" \
--start-time 1711699200000 \
--limit 10
Step 3: Reduce batch job throughput (immediate mitigation)
# Update the batch job's rate limiter via SSM Parameter Store
aws ssm put-parameter \
--name "/mangaassist/inventory-import/max-writes-per-sec" \
--value "50" \
--type String \
--overwrite
# The batch job reads this parameter and throttles itself
Step 4: Fix the chain configuration (make inventory_check optional)
# BEFORE — inventory_check is required, chain fails completely
search_and_check_stock = ToolChain(
name="search_and_check_stock",
steps=[
ChainStep(
tool_name="product_search",
param_mapper=lambda ctx: {"query": ctx["query"], "max_results": 1},
timeout_ms=1500,
),
ChainStep(
tool_name="inventory_check",
param_mapper=lambda ctx: {
"product_id": ctx.get("results", [{}])[0].get("product_id", ""),
},
required=True, # <-- BUG: chain fails if inventory_check fails
timeout_ms=1000,
),
],
)
# AFTER — inventory_check is optional with a fallback value
search_and_check_stock = ToolChain(
name="search_and_check_stock",
steps=[
ChainStep(
tool_name="product_search",
param_mapper=lambda ctx: {"query": ctx["query"], "max_results": 1},
timeout_ms=1500,
),
ChainStep(
tool_name="inventory_check",
param_mapper=lambda ctx: {
"product_id": ctx.get("results", [{}])[0].get("product_id", ""),
},
required=False, # <-- FIX: chain continues with partial results
fallback_value={
"status": "unknown",
"message": "Inventory status is being updated. "
"Please check the product page for current availability.",
},
timeout_ms=1000,
),
],
)
Step 5: Switch DynamoDB to on-demand (permanent fix)
aws dynamodb update-table \
--table-name MangaAssist-Inventory \
--billing-mode PAY_PER_REQUEST
Step 6: Verify recovery
# Monitor timeout rate
aws cloudwatch get-metric-statistics \
--namespace "MangaAssist/Tools" \
--metric-name "Error_timeout" \
--dimensions Name=ToolName,Value=inventory_check \
--start-time "$(date -u -d '15 minutes ago' +%Y-%m-%dT%H:%M:%SZ)" \
--end-time "$(date -u +%Y-%m-%dT%H:%M:%SZ)" \
--period 60 --statistics Sum
Prevention
- On-demand capacity for tables accessed by both real-time and batch workloads eliminates capacity contention.
- Chain design rule: Non-critical enrichment steps (inventory, pricing) should always be
required=Falsewith meaningful fallback values. - Batch job scheduling: Run heavy batch imports during off-peak hours (e.g., 03:00 JST) when chatbot traffic is minimal.
- DynamoDB reserved capacity + burst: If staying with provisioned mode, configure auto-scaling with a burst buffer of 2x base capacity.
Exam Relevance
| AIP-C01 Concept | Application |
|---|---|
| Tool chains should degrade gracefully | Making enrichment steps optional prevents total chain failure |
| DynamoDB capacity planning for mixed workloads | Batch + real-time contention requires on-demand or auto-scaling |
| Fallback values maintain user experience | Customers still see search results even when inventory is unavailable |
| Timeout budgets across chain steps | Each step must fit within the overall 3-second latency target |
Scenario 3: Strands Tool Returning Unexpected Schema Causing Agent Confusion
Severity
P2 — High — The agent enters a loop, repeatedly calling the same tool and burning through Sonnet tokens at $3/$15 per 1M, generating incoherent responses.
Blast Radius
- Direct: ~5% of sessions where the
recommendation_enginetool is invoked (~50K sessions/day). - Cost: Each looping session consumes 3-5x normal token usage. At Sonnet pricing, a single looping session costs ~$0.12 instead of $0.03. Across 50K affected sessions: $6,000/day extra.
- UX: Customers receive repetitive or nonsensical recommendation responses.
Symptom
[2026-03-30 11:42:05 UTC] INFO mangaassist.orchestrator — FM response
contains tool_use: recommendation_engine (attempt 1)
[2026-03-30 11:42:06 UTC] INFO mangaassist.orchestrator — Tool result
returned to FM
[2026-03-30 11:42:07 UTC] INFO mangaassist.orchestrator — FM response
contains tool_use: recommendation_engine (attempt 2, same params)
[2026-03-30 11:42:08 UTC] INFO mangaassist.orchestrator — Tool result
returned to FM
[2026-03-30 11:42:09 UTC] INFO mangaassist.orchestrator — FM response
contains tool_use: recommendation_engine (attempt 3, same params)
[2026-03-30 11:42:09 UTC] WARN mangaassist.orchestrator — Tool loop
detected: recommendation_engine called 3 times with identical params
Observations:
1. The FM keeps calling recommendation_engine with the same parameters.
2. The tool returns results successfully each time (no error).
3. The FM's text response between tool calls says: "Let me check the recommendations for you..." repeatedly.
4. CloudWatch metric: ConvoLoop/ExcessiveToolCalls alarm fires.
Root Cause
Schema mismatch between tool output and FM expectations.
The recommendation_engine service was updated to v2, which changed its response schema:
# v1 response (what the FM expects based on training/system prompt)
{
"recommendations": [
{"title": "Naruto", "author": "Kishimoto", "score": 0.95},
{"title": "Bleach", "author": "Kubo", "score": 0.88},
],
"count": 2,
"strategy": "collaborative"
}
# v2 response (what the tool now returns after the service update)
{
"data": {
"items": [
{"product_name": "Naruto", "creator": "Kishimoto", "relevance": 95},
{"product_name": "Bleach", "creator": "Kubo", "relevance": 88},
]
},
"metadata": {
"total": 2,
"algorithm": "collaborative_v2"
}
}
The FM receives the v2 response but cannot find the expected recommendations key. It interprets this as an incomplete or failed result and retries the tool call, hoping for a different response. Since the tool is functioning correctly and returns the same v2 schema each time, the FM enters an infinite retry loop.
Timeline
| Time (UTC) | Event |
|---|---|
| 10:00 | Recommendation service v2 deployed to production |
| 10:30 | First tool loop incidents appear (low volume — only Sonnet sessions) |
| 11:00 | Loop incidents increase as traffic ramps up |
| 11:42 | CloudWatch alarm fires: ExcessiveToolCalls > 100 in 5 min |
| 11:45 | On-call engineer identifies schema mismatch in logs |
| 11:50 | Response adapter Lambda deployed to transform v2 -> v1 schema |
| 11:55 | Tool loops stop; normal operation resumes |
| 12:30 | Permanent fix: update Strands tool wrapper to handle both schemas |
Runbook
Step 1: Detect the loop pattern
# Find sessions with excessive tool calls
aws logs filter-log-events \
--log-group-name "/ecs/mangaassist-orchestrator" \
--filter-pattern '{ $.event = "tool_loop_detected" }' \
--start-time 1711792800000 \
--limit 20
Step 2: Compare expected vs actual tool output
# Get a sample tool result from the logs
aws logs filter-log-events \
--log-group-name "/aws/lambda/mangaassist-tool-gateway" \
--filter-pattern '{ $.tool_name = "recommendation_engine" && $.status = "success" }' \
--start-time 1711792800000 \
--limit 5
# Compare the schema of the returned result against the expected schema
# documented in the tool's system prompt or tool description
Step 3: Deploy response adapter (immediate fix)
"""
response_adapter.py — Transforms tool responses to match expected schemas.
This adapter sits between the tool result and the FM, ensuring the
response schema matches what the FM was trained/prompted to expect.
"""
from typing import Any, Dict
def adapt_recommendation_response(raw: Dict[str, Any]) -> Dict[str, Any]:
"""Transform recommendation_engine v2 response to v1 schema."""
# Detect v2 format
if "data" in raw and "items" in raw.get("data", {}):
items = raw["data"]["items"]
metadata = raw.get("metadata", {})
adapted = {
"recommendations": [
{
"title": item.get("product_name", item.get("title", "")),
"author": item.get("creator", item.get("author", "")),
"score": item.get("relevance", 0) / 100.0
if item.get("relevance", 0) > 1
else item.get("relevance", 0),
}
for item in items
],
"count": metadata.get("total", len(items)),
"strategy": metadata.get("algorithm", "unknown").replace("_v2", ""),
}
return adapted
# Already v1 format or unknown — return as-is
return raw
# Register adapters for all tools that might change schemas
RESPONSE_ADAPTERS = {
"recommendation_engine": adapt_recommendation_response,
# Add others as needed
}
def adapt_tool_response(
tool_name: str, raw_response: Dict[str, Any]
) -> Dict[str, Any]:
"""Apply the appropriate adapter for a tool's response."""
adapter = RESPONSE_ADAPTERS.get(tool_name)
if adapter:
return adapter(raw_response)
return raw_response
Step 4: Add loop detection and breaking to the orchestrator
"""
loop_detector.py — Detects and breaks FM tool-call loops.
"""
from collections import defaultdict
from typing import Any, Dict, List, Optional
import hashlib
import json
class ToolLoopDetector:
"""
Detects when the FM repeatedly calls the same tool with identical params.
Detection rules:
- Same tool + same params called 3+ times -> loop detected
- Any tool called 5+ times in one turn -> excessive tool use
- Total tool calls exceed 10 in one turn -> force stop
"""
def __init__(
self,
same_call_limit: int = 3,
per_tool_limit: int = 5,
total_limit: int = 10,
) -> None:
self.same_call_limit = same_call_limit
self.per_tool_limit = per_tool_limit
self.total_limit = total_limit
self._call_hashes: List[str] = []
self._tool_counts: Dict[str, int] = defaultdict(int)
def check(self, tool_name: str, params: Dict[str, Any]) -> Optional[str]:
"""
Check if a tool call should be blocked.
Returns None if OK, or a reason string if the call should be blocked.
"""
# Track total calls
self._tool_counts[tool_name] += 1
total = sum(self._tool_counts.values())
# Hash the call for duplicate detection
call_hash = hashlib.md5(
json.dumps({"tool": tool_name, "params": params}, sort_keys=True).encode()
).hexdigest()
self._call_hashes.append(call_hash)
# Check: same exact call repeated
identical_count = self._call_hashes.count(call_hash)
if identical_count >= self.same_call_limit:
return (
f"Loop detected: {tool_name} called {identical_count} times "
f"with identical parameters"
)
# Check: per-tool limit
if self._tool_counts[tool_name] >= self.per_tool_limit:
return (
f"Excessive use: {tool_name} called "
f"{self._tool_counts[tool_name]} times in this turn"
)
# Check: total limit
if total >= self.total_limit:
return f"Total tool call limit reached: {total} calls in this turn"
return None
def reset(self) -> None:
"""Reset counters for a new conversation turn."""
self._call_hashes.clear()
self._tool_counts.clear()
Step 5: Inject a break message when loop is detected
# In the orchestrator's convoLoop, when loop_detector.check() returns a reason:
loop_break_message = {
"role": "user",
"content": [
{
"type": "text",
"text": (
"SYSTEM NOTE: You have called the same tool multiple times "
"with the same parameters. The tool is working correctly. "
"Please use the most recent tool result to formulate your "
"response to the customer. Do not call the tool again."
),
}
],
}
Prevention
- Response schema contracts: Define and enforce output schemas for every tool. When a backend service changes its API, the tool wrapper must transform the response to match the contract.
- Loop detection: Always implement loop detection in the orchestrator with configurable limits.
- Tool versioning: Version tool definitions and pin the FM to a specific tool version. Deploy schema adapters when backend services upgrade.
- Integration tests: Automated tests that verify tool output schemas match FM expectations after every backend deployment.
Exam Relevance
| AIP-C01 Concept | Application |
|---|---|
| Standardized tool output schemas | Schema changes break FM expectations and cause loops |
| Token cost management | Each loop iteration burns Sonnet tokens at $3/$15 per 1M |
| ConvoLoop control | Loop detection prevents runaway token consumption |
| Tool versioning and adapters | Backend changes must be isolated from FM-facing contracts |
Scenario 4: Lambda Cold Start Timeout on Tool Invocation
Severity
P3 — Medium — First request after idle period times out. Affects ~2-5% of sessions that arrive after a cold spell, but resolves after the first invocation warms the Lambda.
Blast Radius
- Direct: First customer message in a session that hits a cold Lambda experiences a 5-8 second delay (exceeds 3-second target).
- Indirect: If the orchestrator's tool timeout is set to 2000ms, the cold start causes a timeout error, and the customer gets an error message instead of results.
- Scale: During low-traffic periods (02:00-06:00 JST), cold starts affect ~40% of incoming requests because all Lambda instances have been reclaimed.
Symptom
[2026-03-30 03:12:44 UTC] INFO AWS Lambda — INIT_START
Runtime.Version = python:3.12
[2026-03-30 03:12:48 UTC] INFO AWS Lambda — INIT_REPORT
Duration: 4200.55 ms Init Duration: 4200.55 ms
[2026-03-30 03:12:48 UTC] ERROR mangaassist.chains — Tool
'product_search' timed out after 2000ms (Lambda init took 4200ms)
Observations:
1. The Lambda function's INIT_REPORT shows 4200ms initialization time.
2. The tool timeout is 2000ms, but the Lambda hasn't even finished initializing when the timeout fires.
3. The heavy initialization includes: importing boto3, loading the ToolRegistry, initializing OpenSearch client connections, and loading the JSON Schema validator.
4. After the first cold start, subsequent invocations complete in ~150ms (warm).
Root Cause
Lambda cold start exceeds tool timeout budget.
The mangaassist-tool-gateway Lambda function has a cold start time of ~4200ms because:
1. Large deployment package: The Lambda layer includes OpenSearch client, jsonschema, and several other dependencies (~45MB total).
2. Eager initialization: The ToolRegistry singleton is created at module-load time (outside the handler), which triggers connection establishment to OpenSearch, DynamoDB, and Redis.
3. Python runtime overhead: Python 3.12 runtime itself adds ~800ms of init time for large packages.
The tool chain's per-step timeout of 2000ms does not account for cold start time. The Lambda invocation times out before the handler even begins executing.
Timeline
| Time (UTC) | Event |
|---|---|
| 02:00 | Traffic drops below threshold; Lambda instances begin reclaiming |
| 02:45 | All Lambda instances reclaimed (0 warm instances) |
| 03:12 | First customer message arrives; triggers cold start |
| 03:12 | Lambda INIT takes 4200ms; tool timeout fires at 2000ms |
| 03:12 | Customer receives "Search is taking longer than usual" error |
| 03:12 | Lambda finishes init; subsequent calls succeed in ~150ms |
| 03:13 | Second customer message succeeds normally |
| 09:00 | Engineering reviews cold start metrics during standup |
| 09:30 | Provisioned Concurrency configured for minimum 5 instances |
Runbook
Step 1: Confirm cold start as the cause
# Check Lambda init durations
aws logs filter-log-events \
--log-group-name "/aws/lambda/mangaassist-tool-gateway" \
--filter-pattern "INIT_REPORT" \
--start-time 1711760400000 \
--limit 20
# Check for correlation with tool timeouts
aws logs filter-log-events \
--log-group-name "/aws/lambda/mangaassist-tool-gateway" \
--filter-pattern '{ $.error_code = "GATEWAY_TIMEOUT" }' \
--start-time 1711760400000 \
--limit 20
Step 2: Measure cold start breakdown
# Add timing instrumentation to the Lambda module-level code
import time
_init_start = time.time()
import boto3 # ~600ms
_after_boto3 = time.time()
import jsonschema # ~200ms
_after_jsonschema = time.time()
from mangaassist.tools.tool_registry import ToolRegistry # ~800ms
_after_registry = time.time()
# ... OpenSearch client init: ~1200ms
# ... Redis connection: ~400ms
print(f"Init breakdown: boto3={_after_boto3 - _init_start:.0f}ms, "
f"jsonschema={_after_jsonschema - _after_boto3:.0f}ms, "
f"registry={_after_registry - _after_jsonschema:.0f}ms")
Step 3: Configure Provisioned Concurrency (immediate fix)
# Set provisioned concurrency to keep 5 instances warm at all times
aws lambda put-provisioned-concurrency-config \
--function-name mangaassist-tool-gateway \
--qualifier prod \
--provisioned-concurrent-executions 5
# Verify configuration
aws lambda get-provisioned-concurrency-config \
--function-name mangaassist-tool-gateway \
--qualifier prod
Step 4: Optimize Lambda initialization (permanent fix)
"""
Optimization strategies for reducing Lambda cold start time.
"""
# Strategy 1: Lazy initialization — don't connect at import time
class LazyOpenSearchClient:
"""Only establish connection on first use, not at module load."""
def __init__(self):
self._client = None
@property
def client(self):
if self._client is None:
from opensearchpy import OpenSearch
self._client = OpenSearch(
hosts=[{"host": os.environ["OPENSEARCH_ENDPOINT"], "port": 443}],
use_ssl=True,
connection_class=RequestsHttpConnection,
)
return self._client
# Strategy 2: Reduce package size — use Lambda layers efficiently
# Move rarely-used dependencies into separate layers loaded on-demand
# Strategy 3: Use SnapStart equivalent for Python
# (As of 2026, AWS supports Lambda SnapStart for Python 3.12+)
# Configure in SAM template:
# Properties:
# SnapStart:
# ApplyOn: PublishedVersions
Step 5: Add cold-start-aware timeout in the orchestrator
# In the tool chain executor, detect likely cold start and extend timeout
async def invoke_with_cold_start_awareness(
self, tool_name: str, params: dict, base_timeout_ms: int = 2000
) -> dict:
"""
Invoke a tool with extended timeout for likely cold starts.
If this is the first call to this tool in the session, allow
extra time for potential Lambda cold start.
"""
is_first_call = tool_name not in self._warm_tools
timeout_ms = base_timeout_ms
if is_first_call:
timeout_ms = base_timeout_ms + 5000 # Extra 5s for cold start
logger.info(
f"First call to '{tool_name}' — extending timeout to "
f"{timeout_ms}ms for potential cold start"
)
result = await asyncio.wait_for(
self._invoke(tool_name, params),
timeout=timeout_ms / 1000,
)
self._warm_tools.add(tool_name)
return result
Prevention
- Provisioned Concurrency: Always configure minimum warm instances for latency-sensitive tool Lambdas. Cost: ~$0.015/GB-hour for provisioned instances.
- Lazy initialization: Defer connection establishment to first use, not module load. Moves init cost from cold start to first invocation (which is already expected to be slower).
- Package optimization: Split dependencies into layers; only load what the specific tool needs.
- SnapStart: Enable Lambda SnapStart for Python runtimes to snapshot the initialized state and restore from it on cold start (~80% reduction in init time).
- Warm-up pings: CloudWatch Events rule that invokes the Lambda every 5 minutes to keep at least one instance warm.
Exam Relevance
| AIP-C01 Concept | Application |
|---|---|
| Lambda cold starts affect tool latency | 4200ms init exceeds the 2000ms tool timeout budget |
| Provisioned Concurrency ensures warm starts | Keeps N instances ready for immediate invocation |
| Tool timeout budgets must account for infrastructure | Network + compute + cold start all count against the 3s target |
| Cost vs latency tradeoff | Provisioned Concurrency costs ~$0.015/GB-hr but eliminates cold start errors |
Scenario 5: Circular Tool Dependency in Multi-Step Workflow
Severity
P1 — Critical — Agent enters infinite loop, consuming tokens without bound until max-turn limit kills the session. Affects all sessions that trigger the circular workflow.
Blast Radius
- Direct: Any customer query that triggers the "find similar in stock" workflow (~8% of traffic = 80K sessions/day).
- Cost: Each looping session burns 10-20x normal tokens before being killed. At Sonnet pricing: ~$0.30-$0.60 per affected session vs $0.03 normal. Projected daily overspend: $24,000-$48,000.
- UX: Affected customers wait the full 30-second timeout, then receive a generic error message. Sessions are unusable.
- Infrastructure: ECS Fargate task CPU spikes to 100% as multiple concurrent loops consume all available compute.
Symptom
[2026-03-31 08:45:12 UTC] INFO FM -> tool_use: product_search(query="One Piece Vol 108")
[2026-03-31 08:45:13 UTC] INFO FM -> tool_use: inventory_check(product_id="MNG-A1B2C3D4")
[2026-03-31 08:45:13 UTC] INFO inventory_check returned: {"in_stock": false,
"restock_date": "2026-04-15"}
[2026-03-31 08:45:14 UTC] INFO FM -> tool_use: recommendation_engine(
seed_title="One Piece Vol 108", exclude_owned=true)
[2026-03-31 08:45:15 UTC] INFO recommendation_engine returned:
{"recommendations": [{"title": "One Piece Vol 107", "product_id": "MNG-E5F6G7H8"}]}
[2026-03-31 08:45:16 UTC] INFO FM -> tool_use: inventory_check(product_id="MNG-E5F6G7H8")
[2026-03-31 08:45:16 UTC] INFO inventory_check returned: {"in_stock": false}
[2026-03-31 08:45:17 UTC] INFO FM -> tool_use: recommendation_engine(
seed_title="One Piece Vol 107", exclude_owned=true)
[2026-03-31 08:45:18 UTC] INFO recommendation_engine returned:
{"recommendations": [{"title": "One Piece Vol 108", "product_id": "MNG-A1B2C3D4"}]}
[2026-03-31 08:45:19 UTC] INFO FM -> tool_use: inventory_check(product_id="MNG-A1B2C3D4")
... [LOOP: Vol 108 -> Vol 107 -> Vol 108 -> Vol 107 -> ...]
Observations:
1. The FM is caught in a cycle: check stock on Vol 108 -> out of stock -> recommend similar -> gets Vol 107 -> check stock on Vol 107 -> out of stock -> recommend similar -> gets Vol 108 -> repeat.
2. The recommendation engine keeps suggesting the other volume in the series because they are the most similar titles.
3. Neither volume is in stock, so the FM never finds a satisfactory answer and keeps searching.
4. CloudWatch metric: ConvoLoop/TotalToolCalls spikes; average tool calls per session jumps from 2.1 to 14.7.
5. ECS Fargate CPU utilization hits 95%.
Root Cause
Circular dependency between inventory_check and recommendation_engine when all similar items are out of stock.
The FM's implicit workflow is: 1. Search for product 2. Check inventory 3. If out of stock -> recommend similar 4. Check inventory on recommendation 5. If out of stock -> recommend similar (goes back to step 3)
This creates a cycle when: - The recommendation engine returns titles that are in the same series - All titles in the series are out of stock - The FM lacks a stopping condition in its reasoning
The ToolLoopDetector from Scenario 3 would catch identical calls, but here the calls are not identical — the parameters differ each time (different product IDs). The cycle is at the workflow level, not the individual call level.
Timeline
| Time (UTC) | Event |
|---|---|
| 08:00 | One Piece Vol 107 and 108 both go out of stock (supply chain delay) |
| 08:30 | First circular workflow incidents appear |
| 08:45 | CloudWatch alarm: AvgToolCallsPerSession > 10 |
| 08:47 | ECS CPU alarm: CPUUtilization > 90% |
| 08:48 | On-call engineer acknowledges both alarms |
| 08:52 | Engineer identifies the circular pattern in logs |
| 08:55 | Emergency fix: deploy max tool calls per session limit (10) |
| 09:00 | Looping sessions are terminated at 10 tool calls instead of running indefinitely |
| 09:15 | UX improves — customers get "unavailable" message in <5 seconds |
| 09:30 | Permanent fix deployed: cycle detection + system prompt update |
Runbook
Step 1: Confirm the circular pattern
# Find sessions with high tool call counts
aws logs filter-log-events \
--log-group-name "/ecs/mangaassist-orchestrator" \
--filter-pattern '{ $.tool_call_count > 8 }' \
--start-time 1711871100000 \
--limit 20
# Extract the tool call sequence for a specific session
aws logs filter-log-events \
--log-group-name "/ecs/mangaassist-orchestrator" \
--filter-pattern '{ $.session_id = "sess_abc123" && $.event = "tool_use" }' \
--start-time 1711871100000
Step 2: Deploy hard limit on tool calls per session (immediate fix)
# In the orchestrator's convoLoop
MAX_TOOL_CALLS_PER_TURN = 10
tool_call_count = 0
while True:
response = bedrock.converse(...)
if response["stopReason"] == "tool_use":
tool_call_count += 1
if tool_call_count > MAX_TOOL_CALLS_PER_TURN:
# Force the FM to respond without more tools
messages.append({
"role": "user",
"content": [{
"type": "text",
"text": (
"SYSTEM: Maximum tool calls reached. Please provide "
"the best answer you can with the information gathered "
"so far. If a product is out of stock, tell the customer "
"the expected restock date and suggest they check back later."
),
}],
})
# Remove tools from the next call to force text response
response = bedrock.converse(
messages=messages,
modelId=model_id,
# No toolConfig -> FM must respond with text
)
break
else:
break
Step 3: Implement workflow-level cycle detection
"""
cycle_detector.py — Detects circular patterns in tool call workflows.
Unlike the ToolLoopDetector which catches identical calls, this detects
cycles in the workflow graph where different tools with different params
form a repeating pattern.
"""
from collections import deque
from typing import Any, Dict, List, Optional, Tuple
class WorkflowCycleDetector:
"""
Detects cyclic patterns in sequences of tool calls.
A cycle is detected when a subsequence of tool calls repeats.
For example: [A, B, C, A, B, C] has cycle [A, B, C] of length 3.
Algorithm:
- Maintain a sliding window of recent tool calls (name + key params)
- After each call, check if the last N calls match the N calls before them
- If so, a cycle of length N has been detected
"""
def __init__(self, max_cycle_length: int = 5) -> None:
self.max_cycle_length = max_cycle_length
self._history: List[str] = []
def record_and_check(
self, tool_name: str, key_params: Dict[str, Any]
) -> Optional[Tuple[int, List[str]]]:
"""
Record a tool call and check for cycles.
Args:
tool_name: Name of the tool being called.
key_params: Key parameters that identify this specific call
(e.g., product_id, not timeout settings).
Returns:
None if no cycle, or (cycle_length, cycle_pattern) if detected.
"""
# Create a fingerprint for this call
fingerprint = f"{tool_name}:{sorted(key_params.items())}"
self._history.append(fingerprint)
# Check for cycles of various lengths
for cycle_len in range(2, self.max_cycle_length + 1):
if len(self._history) >= cycle_len * 2:
recent = self._history[-cycle_len:]
previous = self._history[-cycle_len * 2 : -cycle_len]
if recent == previous:
pattern = [h.split(":")[0] for h in recent]
return (cycle_len, pattern)
return None
def reset(self) -> None:
"""Reset for a new conversation turn."""
self._history.clear()
Step 4: Update system prompt to prevent circular reasoning
# Add to the MangaAssist system prompt:
ANTI_CYCLE_PROMPT_ADDITION = """
IMPORTANT RULES FOR TOOL USAGE:
- If you check inventory and a product is OUT OF STOCK, do NOT immediately
search for alternatives and check their stock in a loop.
- Instead, tell the customer the restock date (if available) and ask if they
would like recommendations for DIFFERENT types of manga (not the same series).
- Never call more than 6 tools total for a single customer question.
- If after 2 recommendation attempts you cannot find an in-stock alternative,
inform the customer that the titles are temporarily out of stock and suggest
they enable restock notifications.
"""
Step 5: Add "already checked" context to tool calls
# Track which product IDs have been checked in the session
# and pass this context to the recommendation engine
@tool
def recommendation_engine_v2(
seed_title: Optional[str] = None,
exclude_product_ids: Optional[List[str]] = None,
# ... other params
) -> Dict[str, Any]:
"""
Generate manga recommendations.
Args:
exclude_product_ids: List of product IDs already checked and found
out of stock. Recommendations will not include these products.
"""
# ... implementation that filters out already-checked products
Prevention
- Hard limit on tool calls: Every orchestrator must have a
MAX_TOOL_CALLS_PER_TURNthat cannot be exceeded. A value of 10 is reasonable for most chatbot workflows. - Workflow-level cycle detection: Detect repeating patterns of tool calls, not just identical calls. Break cycles after the second repetition.
- System prompt guardrails: Explicitly instruct the FM not to loop on out-of-stock checks. Give it a clear exit strategy.
- Exclusion parameters: Tool definitions should accept lists of already-tried items to prevent revisiting them.
- Cost alerting: CloudWatch alarm on
EstimatedTokenCostper session to catch runaway spending before it accumulates.
Exam Relevance
| AIP-C01 Concept | Application |
|---|---|
| Tool orchestration requires cycle prevention | Circular dependencies between tools cause infinite loops |
| System prompt engineering controls FM behavior | Explicit stopping rules prevent the FM from looping |
| Token cost management at scale | 80K affected sessions at $0.30-0.60 each = $24K-48K/day overspend |
| Max tool call limits are a safety requirement | Hard limits prevent runaway token consumption regardless of FM reasoning |
| Graceful degradation when tools cannot satisfy | FM must know when to stop trying and give the best available answer |
Cross-Scenario Summary
| # | Scenario | Root Cause | Key Fix | Severity |
|---|---|---|---|---|
| 1 | Malformed search params | Vague tool description + missing validator | Explicit enum descriptions + Lambda layer fix | P2 |
| 2 | Chain breaks on inventory | DynamoDB capacity contention | required=False + on-demand capacity |
P2 |
| 3 | Unexpected schema causes loop | Backend API schema change | Response adapter + loop detection | P2 |
| 4 | Lambda cold start timeout | Heavy init exceeds tool timeout | Provisioned Concurrency + lazy init | P3 |
| 5 | Circular tool dependency | No cycle detection in workflow | Cycle detector + max tool call limit + prompt guardrails | P1 |
Operational Checklist
Use this checklist when deploying or modifying tool integrations:
- Every tool has a JSON Schema with explicit enum descriptions in natural language
-
ParameterValidatoris deployed and tested in the Lambda layer - Type coercion handles string-to-int, string-to-bool, float-to-int
- Injection detection patterns are current and tested
- Tool chains mark enrichment steps as
required=Falsewith fallback values - Chain total timeout fits within the 3-second latency budget
- Circuit breaker configured per tool (threshold=5, cooldown=60s)
- Response adapters exist for any tool backed by a versioned API
-
ToolLoopDetectoris active (same-call limit=3, per-tool limit=5, total limit=10) -
WorkflowCycleDetectoris active with max cycle length=5 - Lambda Provisioned Concurrency configured for latency-sensitive tools
- Lambda cold start time measured and within timeout budget
- System prompt includes explicit tool usage rules and stopping conditions
- CloudWatch alarms configured: ZeroResultRate, TimeoutRate, ExcessiveToolCalls, EstimatedTokenCost
- DynamoDB tables use on-demand capacity for mixed real-time/batch workloads
- Fallback hierarchy tested: retry -> fallback tool -> cache -> static default -> user error