Safeguarded AI Workflow Architecture

Skill 2.1.3 — Controlled FM Behavior Through Multi-Layer Safety

MangaAssist context: JP Manga store chatbot on AWS — Bedrock Claude 3 (Sonnet at $3/$15 per 1M tokens input/output, Haiku at $0.25/$1.25), OpenSearch Serverless (vector store), DynamoDB (sessions/products), ECS Fargate (orchestrator), API Gateway WebSocket, ElastiCache Redis. Target: useful answer in under 3 seconds, 1M messages/day scale.

Mind Map: Safeguarded AI Workflow Components

Safeguarded AI Workflows (Skill 2.1.3)
├── 1. Stopping Conditions
│   ├── Max Iteration Limits
│   │   ├── Step Functions loop counters
│   │   ├── Lambda invocation depth tracking
│   │   └── Agentic loop hard caps (e.g., max 5 tool calls)
│   ├── Token Budget Enforcement
│   │   ├── Input token ceiling per request (4,096 tokens)
│   │   ├── Output token ceiling per response (2,048 tokens)
│   │   ├── Session-level cumulative token tracking
│   │   └── Daily/monthly token quota per user tier
│   ├── Confidence Thresholds
│   │   ├── Minimum similarity score from OpenSearch (>0.75)
│   │   ├── Model confidence scoring for answers
│   │   └── Fallback to human agent below threshold
│   └── Cost Guardrails
│       ├── Per-request cost cap ($0.05 Sonnet / $0.005 Haiku)
│       ├── Per-session cost cap ($0.50)
│       └── Daily spend limit with CloudWatch alarms
│
├── 2. Timeout Mechanisms
│   ├── Lambda Timeouts
│   │   ├── Bedrock inference Lambda: 30s timeout
│   │   ├── RAG retrieval Lambda: 10s timeout
│   │   ├── Session management Lambda: 5s timeout
│   │   └── Graceful degradation on timeout
│   ├── Step Functions TimeoutSeconds
│   │   ├── Per-state timeout (HeartbeatSeconds)
│   │   ├── Workflow-level timeout (TimeoutSeconds)
│   │   └── Activity task heartbeat monitoring
│   ├── API Gateway Timeout
│   │   ├── WebSocket idle timeout: 10 minutes
│   │   ├── Integration timeout: 29 seconds (hard limit)
│   │   └── Client-side timeout: 3 seconds for UX
│   └── Bedrock InvokeModel Timeout
│       ├── SDK-level read timeout: 25 seconds
│       ├── Connection timeout: 5 seconds
│       └── Streaming response chunk timeout
│
├── 3. IAM Boundaries
│   ├── Least Privilege for Bedrock Access
│   │   ├── Specific model ARN restrictions
│   │   ├── Deny access to unauthorized models
│   │   ├── Condition keys for request attributes
│   │   └── Service control policies (SCPs) at org level
│   ├── Resource-Based Policies
│   │   ├── DynamoDB table-level access (sessions only)
│   │   ├── OpenSearch collection-level policies
│   │   ├── S3 bucket policy for manga assets
│   │   └── KMS key policies for encryption
│   ├── Permission Boundaries
│   │   ├── Maximum permission envelope for Lambda roles
│   │   ├── Prevent privilege escalation
│   │   └── Separate roles per function concern
│   └── Cross-Service Access Controls
│       ├── VPC endpoints for private connectivity
│       ├── Security group rules for ECS Fargate
│       └── Secrets Manager for API keys
│
├── 4. Circuit Breakers
│   ├── Failure Rate Thresholds
│   │   ├── Bedrock API: open at >50% failure in 60s window
│   │   ├── OpenSearch: open at >30% failure in 30s window
│   │   ├── DynamoDB: open at >20% failure in 30s window
│   │   └── External API: open at >40% failure in 45s window
│   ├── Half-Open Testing
│   │   ├── Allow 1 request every 30 seconds
│   │   ├── Progressive recovery (1 → 3 → 10 requests)
│   │   └── Health check endpoint validation
│   ├── Fallback Responses
│   │   ├── Cached popular manga recommendations
│   │   ├── Static FAQ responses from ElastiCache
│   │   ├── "Please try again" with retry button
│   │   └── Route to human agent queue
│   └── State Management
│       ├── ElastiCache Redis for circuit state
│       ├── CloudWatch metrics for monitoring
│       └── SNS alerts on state transitions
│
└── 5. Guardrails (Bedrock Guardrails)
    ├── Content Filtering
    │   ├── Hate/violence/sexual content filters
    │   ├── Manga-appropriate content policies
    │   └── Custom denied topics
    ├── PII Detection
    │   ├── Credit card number masking
    │   ├── Email/phone anonymization
    │   └── Address redaction
    ├── Word/Phrase Filters
    │   ├── Competitor brand blocking
    │   ├── Profanity filtering (JP + EN)
    │   └── Custom regex patterns
    └── Contextual Grounding
        ├── Grounding score threshold (>0.7)
        ├── Relevance score threshold (>0.6)
        └── Hallucination prevention

Architecture Diagram: Multi-Layer Safety in MangaAssist

┌─────────────────────────────────────────────────────────────────────────────┐
│                        MangaAssist Safety Architecture                      │
│                        (Multi-Layer Defense in Depth)                        │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  ┌─────────────┐    LAYER 1: API GATEWAY                                    │
│  │   Client     │    ┌──────────────────────────────────────────┐           │
│  │  (Browser)   │───▶│  API Gateway WebSocket                   │           │
│  │  3s timeout  │    │  • Rate limiting: 100 req/s per user     │           │
│  └─────────────┘    │  • WAF rules: SQL injection, XSS         │           │
│                      │  • Integration timeout: 29s              │           │
│                      │  • Request validation (JSON schema)      │           │
│                      └──────────────┬───────────────────────────┘           │
│                                     │                                       │
│                      LAYER 2: ORCHESTRATION (ECS Fargate)                   │
│                      ┌──────────────▼───────────────────────────┐           │
│                      │  ECS Fargate Orchestrator                 │           │
│                      │  ┌─────────────────────────────────┐     │           │
│                      │  │  Circuit Breaker Manager         │     │           │
│                      │  │  • Bedrock CB (50% fail → open)  │     │           │
│                      │  │  • OpenSearch CB (30% → open)    │     │           │
│                      │  │  • DynamoDB CB (20% → open)      │     │           │
│                      │  └─────────────────────────────────┘     │           │
│                      │  ┌─────────────────────────────────┐     │           │
│                      │  │  Token Budget Tracker             │     │           │
│                      │  │  • Per-request: 4,096 in / 2,048 │     │           │
│                      │  │  • Per-session: 50,000 tokens     │     │           │
│                      │  │  • Per-user-day: 500,000 tokens   │     │           │
│                      │  └─────────────────────────────────┘     │           │
│                      └──────────────┬───────────────────────────┘           │
│                                     │                                       │
│                      LAYER 3: STEP FUNCTIONS WORKFLOW                       │
│                      ┌──────────────▼───────────────────────────┐           │
│                      │  Step Functions State Machine              │           │
│                      │  • Workflow timeout: 60 seconds           │           │
│                      │  • Max iterations: 5 (agentic loops)     │           │
│                      │  • Per-state timeout: 30 seconds         │           │
│                      │  • Retry with exponential backoff        │           │
│                      │  • Catch → fallback states               │           │
│                      └──────────────┬───────────────────────────┘           │
│                                     │                                       │
│                      LAYER 4: LAMBDA FUNCTIONS                              │
│                      ┌──────────────▼───────────────────────────┐           │
│                      │  Lambda Functions (per-concern roles)     │           │
│                      │                                           │           │
│                      │  ┌──────────┐ ┌──────────┐ ┌──────────┐ │           │
│                      │  │ Bedrock  │ │   RAG    │ │ Session  │ │           │
│                      │  │ Invoke   │ │ Retrieve │ │ Manager  │ │           │
│                      │  │ 30s TO   │ │ 10s TO   │ │  5s TO   │ │           │
│                      │  │ 512MB    │ │ 256MB    │ │ 128MB    │ │           │
│                      │  └────┬─────┘ └────┬─────┘ └────┬─────┘ │           │
│                      │       │IAM          │IAM         │IAM    │           │
│                      └───────┼─────────────┼────────────┼───────┘           │
│                              │             │            │                   │
│                      LAYER 5: AWS SERVICES WITH IAM BOUNDARIES              │
│                      ┌───────▼─────────────▼────────────▼───────┐           │
│                      │                                           │           │
│                      │  ┌──────────┐ ┌──────────┐ ┌──────────┐ │           │
│                      │  │ Bedrock  │ │OpenSearch│ │ DynamoDB │ │           │
│                      │  │ Claude 3 │ │Serverless│ │ Sessions │ │           │
│                      │  │+Guardrail│ │ (Vector) │ │+Products │ │           │
│                      │  └──────────┘ └──────────┘ └──────────┘ │           │
│                      │                                           │           │
│                      │  ┌──────────┐ ┌──────────┐               │           │
│                      │  │ElastiCach│ │CloudWatch│               │           │
│                      │  │  Redis   │ │  Alarms  │               │           │
│                      │  │(CB state)│ │(monitors)│               │           │
│                      │  └──────────┘ └──────────┘               │           │
│                      └───────────────────────────────────────────┘           │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Step Functions ASL: Stopping Conditions and Error Handling

{
  "Comment": "MangaAssist Safeguarded Chat Workflow — Skill 2.1.3",
  "StartAt": "InitializeRequest",
  "TimeoutSeconds": 60,
  "States": {
    "InitializeRequest": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:ap-northeast-1:123456789012:function:manga-init-request",
      "TimeoutSeconds": 5,
      "HeartbeatSeconds": 3,
      "ResultPath": "$.context",
      "Next": "CheckTokenBudget",
      "Retry": [
        {
          "ErrorEquals": ["Lambda.ServiceException", "Lambda.TooManyRequestsException"],
          "IntervalSeconds": 1,
          "MaxAttempts": 2,
          "BackoffRate": 2.0
        }
      ],
      "Catch": [
        {
          "ErrorEquals": ["States.ALL"],
          "ResultPath": "$.error",
          "Next": "FallbackResponse"
        }
      ]
    },

    "CheckTokenBudget": {
      "Type": "Choice",
      "Choices": [
        {
          "Variable": "$.context.sessionTokensUsed",
          "NumericGreaterThanEquals": 50000,
          "Next": "TokenBudgetExceeded"
        },
        {
          "Variable": "$.context.dailyTokensUsed",
          "NumericGreaterThanEquals": 500000,
          "Next": "DailyLimitExceeded"
        }
      ],
      "Default": "CheckCircuitBreaker"
    },

    "TokenBudgetExceeded": {
      "Type": "Pass",
      "Result": {
        "statusCode": 429,
        "message": "Session token budget exceeded. Please start a new session.",
        "messageJa": "セッションのトークン予算を超えました。新しいセッションを開始してください。",
        "fallbackType": "TOKEN_BUDGET"
      },
      "End": true
    },

    "DailyLimitExceeded": {
      "Type": "Pass",
      "Result": {
        "statusCode": 429,
        "message": "Daily usage limit reached. Premium users get 5x more quota.",
        "messageJa": "1日の利用上限に達しました。プレミアムユーザーは5倍のクォータがあります。",
        "fallbackType": "DAILY_LIMIT"
      },
      "End": true
    },

    "CheckCircuitBreaker": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:ap-northeast-1:123456789012:function:manga-check-circuit-breaker",
      "TimeoutSeconds": 3,
      "ResultPath": "$.circuitState",
      "Next": "EvaluateCircuitState",
      "Catch": [
        {
          "ErrorEquals": ["States.ALL"],
          "ResultPath": "$.error",
          "Next": "RetrieveContext"
        }
      ]
    },

    "EvaluateCircuitState": {
      "Type": "Choice",
      "Choices": [
        {
          "Variable": "$.circuitState.bedrockCircuit",
          "StringEquals": "OPEN",
          "Next": "CircuitOpenFallback"
        }
      ],
      "Default": "RetrieveContext"
    },

    "CircuitOpenFallback": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:ap-northeast-1:123456789012:function:manga-cached-response",
      "TimeoutSeconds": 5,
      "Parameters": {
        "query.$": "$.userMessage",
        "fallbackType": "CIRCUIT_OPEN"
      },
      "End": true
    },

    "RetrieveContext": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:ap-northeast-1:123456789012:function:manga-rag-retrieve",
      "TimeoutSeconds": 10,
      "HeartbeatSeconds": 5,
      "Parameters": {
        "query.$": "$.userMessage",
        "maxResults": 5,
        "minScore": 0.75
      },
      "ResultPath": "$.ragContext",
      "Next": "InitializeAgentLoop",
      "Retry": [
        {
          "ErrorEquals": ["OpenSearchError"],
          "IntervalSeconds": 2,
          "MaxAttempts": 2,
          "BackoffRate": 2.0
        }
      ],
      "Catch": [
        {
          "ErrorEquals": ["States.TaskFailed", "States.Timeout"],
          "ResultPath": "$.error",
          "Next": "InvokeBedrockWithoutContext"
        }
      ]
    },

    "InitializeAgentLoop": {
      "Type": "Pass",
      "Result": 0,
      "ResultPath": "$.iterationCount",
      "Next": "InvokeBedrock"
    },

    "InvokeBedrock": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:ap-northeast-1:123456789012:function:manga-bedrock-invoke",
      "TimeoutSeconds": 30,
      "HeartbeatSeconds": 10,
      "Parameters": {
        "userMessage.$": "$.userMessage",
        "ragContext.$": "$.ragContext",
        "conversationHistory.$": "$.context.history",
        "maxTokens": 2048,
        "guardrailId": "manga-content-guardrail",
        "guardrailVersion": "1"
      },
      "ResultPath": "$.bedrockResponse",
      "Next": "CheckAgentAction",
      "Retry": [
        {
          "ErrorEquals": ["ThrottlingException"],
          "IntervalSeconds": 2,
          "MaxAttempts": 3,
          "BackoffRate": 3.0
        },
        {
          "ErrorEquals": ["ModelTimeoutException"],
          "IntervalSeconds": 1,
          "MaxAttempts": 1,
          "BackoffRate": 1.0
        }
      ],
      "Catch": [
        {
          "ErrorEquals": ["GuardrailInterventionException"],
          "ResultPath": "$.error",
          "Next": "GuardrailBlockedResponse"
        },
        {
          "ErrorEquals": ["States.Timeout"],
          "ResultPath": "$.error",
          "Next": "TimeoutFallback"
        },
        {
          "ErrorEquals": ["States.ALL"],
          "ResultPath": "$.error",
          "Next": "FallbackResponse"
        }
      ]
    },

    "InvokeBedrockWithoutContext": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:ap-northeast-1:123456789012:function:manga-bedrock-invoke",
      "TimeoutSeconds": 30,
      "Parameters": {
        "userMessage.$": "$.userMessage",
        "ragContext": [],
        "maxTokens": 1024,
        "guardrailId": "manga-content-guardrail",
        "guardrailVersion": "1"
      },
      "ResultPath": "$.bedrockResponse",
      "Next": "FormatResponse",
      "Catch": [
        {
          "ErrorEquals": ["States.ALL"],
          "ResultPath": "$.error",
          "Next": "FallbackResponse"
        }
      ]
    },

    "CheckAgentAction": {
      "Type": "Choice",
      "Choices": [
        {
          "Variable": "$.bedrockResponse.actionRequired",
          "BooleanEquals": true,
          "Next": "IncrementIteration"
        }
      ],
      "Default": "FormatResponse"
    },

    "IncrementIteration": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:ap-northeast-1:123456789012:function:manga-increment-counter",
      "Parameters": {
        "currentCount.$": "$.iterationCount"
      },
      "ResultPath": "$.iterationCount",
      "Next": "CheckStoppingCondition"
    },

    "CheckStoppingCondition": {
      "Type": "Choice",
      "Comment": "Hard cap at 5 agentic iterations to prevent infinite loops",
      "Choices": [
        {
          "Variable": "$.iterationCount",
          "NumericGreaterThanEquals": 5,
          "Next": "MaxIterationsReached"
        }
      ],
      "Default": "ExecuteAgentAction"
    },

    "ExecuteAgentAction": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:ap-northeast-1:123456789012:function:manga-execute-tool",
      "TimeoutSeconds": 15,
      "Parameters": {
        "action.$": "$.bedrockResponse.action",
        "parameters.$": "$.bedrockResponse.actionParams"
      },
      "ResultPath": "$.toolResult",
      "Next": "InvokeBedrock",
      "Catch": [
        {
          "ErrorEquals": ["States.ALL"],
          "ResultPath": "$.error",
          "Next": "FallbackResponse"
        }
      ]
    },

    "MaxIterationsReached": {
      "Type": "Pass",
      "Result": {
        "statusCode": 200,
        "message": "I found some information but couldn't fully complete the search. Here's what I have so far.",
        "messageJa": "情報を見つけましたが、検索を完全に完了できませんでした。現時点での情報をお伝えします。",
        "partial": true
      },
      "ResultPath": "$.iterationWarning",
      "Next": "FormatResponse"
    },

    "GuardrailBlockedResponse": {
      "Type": "Pass",
      "Result": {
        "statusCode": 200,
        "message": "I can't help with that specific request. Can I help you find manga, check order status, or answer questions about our store?",
        "messageJa": "そのリクエストにはお応えできません。漫画の検索、注文状況の確認、またはストアに関する質問にお答えできます。",
        "fallbackType": "GUARDRAIL_BLOCK"
      },
      "End": true
    },

    "TimeoutFallback": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:ap-northeast-1:123456789012:function:manga-timeout-handler",
      "TimeoutSeconds": 5,
      "Parameters": {
        "query.$": "$.userMessage",
        "timeoutSource": "BEDROCK"
      },
      "End": true,
      "Catch": [
        {
          "ErrorEquals": ["States.ALL"],
          "ResultPath": "$.error",
          "Next": "FallbackResponse"
        }
      ]
    },

    "FormatResponse": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:ap-northeast-1:123456789012:function:manga-format-response",
      "TimeoutSeconds": 5,
      "Parameters": {
        "response.$": "$.bedrockResponse",
        "sessionId.$": "$.context.sessionId"
      },
      "Next": "SaveSession"
    },

    "SaveSession": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:ap-northeast-1:123456789012:function:manga-save-session",
      "TimeoutSeconds": 5,
      "End": true,
      "Retry": [
        {
          "ErrorEquals": ["DynamoDB.ProvisionedThroughputExceededException"],
          "IntervalSeconds": 1,
          "MaxAttempts": 3,
          "BackoffRate": 2.0
        }
      ],
      "Catch": [
        {
          "ErrorEquals": ["States.ALL"],
          "ResultPath": "$.error",
          "End": true
        }
      ]
    },

    "FallbackResponse": {
      "Type": "Pass",
      "Result": {
        "statusCode": 503,
        "message": "I'm having trouble right now. Please try again in a moment, or browse our popular manga picks!",
        "messageJa": "現在問題が発生しています。しばらくしてからもう一度お試しいただくか、人気の漫画をご覧ください！",
        "fallbackType": "GENERAL_ERROR",
        "showPopularManga": true
      },
      "End": true
    }
  }
}

IAM Policy Examples

Bedrock Model Access — Least Privilege

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "AllowSpecificBedrockModels",
      "Effect": "Allow",
      "Action": [
        "bedrock:InvokeModel",
        "bedrock:InvokeModelWithResponseStream"
      ],
      "Resource": [
        "arn:aws:bedrock:ap-northeast-1::foundation-model/anthropic.claude-3-sonnet-20240229-v1:0",
        "arn:aws:bedrock:ap-northeast-1::foundation-model/anthropic.claude-3-haiku-20240307-v1:0"
      ],
      "Condition": {
        "StringEquals": {
          "aws:RequestedRegion": "ap-northeast-1"
        }
      }
    },
    {
      "Sid": "AllowGuardrailUsage",
      "Effect": "Allow",
      "Action": [
        "bedrock:ApplyGuardrail"
      ],
      "Resource": [
        "arn:aws:bedrock:ap-northeast-1:123456789012:guardrail/manga-content-guardrail"
      ]
    },
    {
      "Sid": "DenyAllOtherModels",
      "Effect": "Deny",
      "Action": [
        "bedrock:InvokeModel",
        "bedrock:InvokeModelWithResponseStream"
      ],
      "NotResource": [
        "arn:aws:bedrock:ap-northeast-1::foundation-model/anthropic.claude-3-sonnet-20240229-v1:0",
        "arn:aws:bedrock:ap-northeast-1::foundation-model/anthropic.claude-3-haiku-20240307-v1:0"
      ]
    },
    {
      "Sid": "DenyModelCustomization",
      "Effect": "Deny",
      "Action": [
        "bedrock:CreateModelCustomizationJob",
        "bedrock:CreateProvisionedModelThroughput",
        "bedrock:DeleteModelInvocationLoggingConfiguration"
      ],
      "Resource": "*"
    }
  ]
}

DynamoDB Session Access — Scoped to Table and Partition Key

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "AllowSessionTableAccess",
      "Effect": "Allow",
      "Action": [
        "dynamodb:GetItem",
        "dynamodb:PutItem",
        "dynamodb:UpdateItem",
        "dynamodb:Query"
      ],
      "Resource": [
        "arn:aws:dynamodb:ap-northeast-1:123456789012:table/manga-sessions",
        "arn:aws:dynamodb:ap-northeast-1:123456789012:table/manga-sessions/index/*"
      ],
      "Condition": {
        "ForAllValues:StringEquals": {
          "dynamodb:LeadingKeys": ["${aws:PrincipalTag/SessionPrefix}"]
        }
      }
    },
    {
      "Sid": "AllowProductTableReadOnly",
      "Effect": "Allow",
      "Action": [
        "dynamodb:GetItem",
        "dynamodb:Query",
        "dynamodb:BatchGetItem"
      ],
      "Resource": [
        "arn:aws:dynamodb:ap-northeast-1:123456789012:table/manga-products",
        "arn:aws:dynamodb:ap-northeast-1:123456789012:table/manga-products/index/*"
      ]
    },
    {
      "Sid": "DenyTableDeletion",
      "Effect": "Deny",
      "Action": [
        "dynamodb:DeleteTable",
        "dynamodb:DeleteItem"
      ],
      "Resource": [
        "arn:aws:dynamodb:ap-northeast-1:123456789012:table/manga-products"
      ]
    }
  ]
}

Lambda Execution Role — Permission Boundary

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "AllowedServices",
      "Effect": "Allow",
      "Action": [
        "bedrock:InvokeModel",
        "bedrock:InvokeModelWithResponseStream",
        "bedrock:ApplyGuardrail",
        "dynamodb:GetItem",
        "dynamodb:PutItem",
        "dynamodb:UpdateItem",
        "dynamodb:Query",
        "dynamodb:BatchGetItem",
        "aoss:APIAccessAll",
        "elasticache:Connect",
        "logs:CreateLogGroup",
        "logs:CreateLogStream",
        "logs:PutLogEvents",
        "xray:PutTraceSegments",
        "xray:PutTelemetryRecords",
        "secretsmanager:GetSecretValue"
      ],
      "Resource": "*"
    },
    {
      "Sid": "DenyDangerousActions",
      "Effect": "Deny",
      "Action": [
        "iam:*",
        "organizations:*",
        "account:*",
        "s3:DeleteBucket",
        "dynamodb:DeleteTable",
        "ec2:*",
        "rds:*"
      ],
      "Resource": "*"
    }
  ]
}

OpenSearch Serverless Data Access Policy

[
  {
    "Rules": [
      {
        "ResourceType": "collection",
        "Resource": ["collection/manga-vectors"],
        "Permission": [
          "aoss:DescribeCollectionItems"
        ]
      },
      {
        "ResourceType": "index",
        "Resource": ["index/manga-vectors/*"],
        "Permission": [
          "aoss:ReadDocument",
          "aoss:DescribeIndex"
        ]
      }
    ],
    "Principal": [
      "arn:aws:iam::123456789012:role/manga-rag-lambda-role"
    ],
    "Description": "Read-only access for RAG retrieval Lambda"
  },
  {
    "Rules": [
      {
        "ResourceType": "index",
        "Resource": ["index/manga-vectors/*"],
        "Permission": [
          "aoss:ReadDocument",
          "aoss:WriteDocument",
          "aoss:CreateIndex",
          "aoss:UpdateIndex",
          "aoss:DescribeIndex"
        ]
      }
    ],
    "Principal": [
      "arn:aws:iam::123456789012:role/manga-indexer-role"
    ],
    "Description": "Write access for manga catalog indexing pipeline"
  }
]

Production Python: Circuit Breaker Implementation

"""
MangaAssist Circuit Breaker Implementation
Manages failure detection, state transitions, and fallback responses
for all downstream service calls.
"""

import time
import json
import logging
import hashlib
from enum import Enum
from dataclasses import dataclass, field
from typing import Optional, Callable, Any, Dict
from functools import wraps

import boto3
import redis

logger = logging.getLogger("manga_circuit_breaker")
logger.setLevel(logging.INFO)


class CircuitState(Enum):
    """Circuit breaker states following the standard pattern."""
    CLOSED = "CLOSED"          # Normal operation — requests flow through
    OPEN = "OPEN"              # Failures exceeded threshold — requests blocked
    HALF_OPEN = "HALF_OPEN"    # Testing recovery — limited requests allowed


@dataclass
class CircuitBreakerConfig:
    """Configuration for a circuit breaker instance."""
    name: str                              # e.g., "bedrock", "opensearch", "dynamodb"
    failure_threshold: int = 5             # Failures before opening
    success_threshold: int = 3             # Successes in half-open before closing
    timeout_seconds: int = 60              # Time in open state before half-open
    window_seconds: int = 60               # Rolling window for failure counting
    half_open_max_requests: int = 3        # Max concurrent requests in half-open
    failure_rate_threshold: float = 0.5    # Failure rate (0-1) to trigger open


@dataclass
class CircuitBreakerState:
    """Current state of a circuit breaker, stored in Redis."""
    state: CircuitState = CircuitState.CLOSED
    failure_count: int = 0
    success_count: int = 0
    last_failure_time: float = 0.0
    last_state_change: float = 0.0
    total_requests: int = 0
    half_open_active_requests: int = 0


class CircuitBreaker:
    """
    Production circuit breaker for MangaAssist downstream services.

    Uses Redis (ElastiCache) for distributed state so all ECS Fargate
    tasks share the same circuit state. This prevents thundering herd
    problems when a service recovers.

    Usage:
        cb = CircuitBreaker(redis_client, CircuitBreakerConfig(name="bedrock"))

        @cb.protect
        async def call_bedrock(prompt: str) -> str:
            # Your Bedrock call here
            ...

        # Or use directly:
        result = await cb.execute(call_bedrock, prompt="Hello")
    """

    REDIS_KEY_PREFIX = "manga:circuit:"

    def __init__(self, redis_client: redis.Redis, config: CircuitBreakerConfig):
        self.redis = redis_client
        self.config = config
        self.key = f"{self.REDIS_KEY_PREFIX}{config.name}"
        self._cloudwatch = boto3.client("cloudwatch", region_name="ap-northeast-1")
        self._sns = boto3.client("sns", region_name="ap-northeast-1")
        self._alert_topic = "arn:aws:sns:ap-northeast-1:123456789012:manga-circuit-alerts"

    def _get_state(self) -> CircuitBreakerState:
        """Load circuit state from Redis."""
        raw = self.redis.hgetall(self.key)
        if not raw:
            return CircuitBreakerState()

        return CircuitBreakerState(
            state=CircuitState(raw.get(b"state", b"CLOSED").decode()),
            failure_count=int(raw.get(b"failure_count", 0)),
            success_count=int(raw.get(b"success_count", 0)),
            last_failure_time=float(raw.get(b"last_failure_time", 0)),
            last_state_change=float(raw.get(b"last_state_change", 0)),
            total_requests=int(raw.get(b"total_requests", 0)),
            half_open_active_requests=int(raw.get(b"half_open_active_requests", 0)),
        )

    def _save_state(self, state: CircuitBreakerState) -> None:
        """Persist circuit state to Redis with TTL."""
        self.redis.hset(self.key, mapping={
            "state": state.state.value,
            "failure_count": state.failure_count,
            "success_count": state.success_count,
            "last_failure_time": state.last_failure_time,
            "last_state_change": state.last_state_change,
            "total_requests": state.total_requests,
            "half_open_active_requests": state.half_open_active_requests,
        })
        # Auto-expire after 24 hours to prevent stale state
        self.redis.expire(self.key, 86400)

    def _transition_to(self, state: CircuitBreakerState, new_state: CircuitState) -> None:
        """Handle state transition with logging and alerts."""
        old_state = state.state
        state.state = new_state
        state.last_state_change = time.time()

        if new_state == CircuitState.CLOSED:
            state.failure_count = 0
            state.success_count = 0
            state.half_open_active_requests = 0
        elif new_state == CircuitState.HALF_OPEN:
            state.success_count = 0
            state.half_open_active_requests = 0

        logger.warning(
            "Circuit breaker [%s] transition: %s -> %s",
            self.config.name, old_state.value, new_state.value
        )

        self._emit_metric(new_state)
        if new_state == CircuitState.OPEN:
            self._send_alert(old_state, new_state, state)

    def _emit_metric(self, state: CircuitState) -> None:
        """Emit CloudWatch metric for circuit state."""
        state_value = {"CLOSED": 0, "HALF_OPEN": 1, "OPEN": 2}
        try:
            self._cloudwatch.put_metric_data(
                Namespace="MangaAssist/CircuitBreaker",
                MetricData=[{
                    "MetricName": "CircuitState",
                    "Dimensions": [
                        {"Name": "ServiceName", "Value": self.config.name}
                    ],
                    "Value": state_value[state.value],
                    "Unit": "None"
                }]
            )
        except Exception as e:
            logger.error("Failed to emit CloudWatch metric: %s", e)

    def _send_alert(
        self,
        old_state: CircuitState,
        new_state: CircuitState,
        state: CircuitBreakerState,
    ) -> None:
        """Send SNS alert on circuit state change to OPEN."""
        try:
            self._sns.publish(
                TopicArn=self._alert_topic,
                Subject=f"ALERT: Circuit breaker [{self.config.name}] is OPEN",
                Message=json.dumps({
                    "circuit": self.config.name,
                    "transition": f"{old_state.value} -> {new_state.value}",
                    "failure_count": state.failure_count,
                    "total_requests": state.total_requests,
                    "timestamp": time.time(),
                }),
            )
        except Exception as e:
            logger.error("Failed to send SNS alert: %s", e)

    def can_execute(self) -> bool:
        """Check if a request is allowed through the circuit."""
        state = self._get_state()
        now = time.time()

        if state.state == CircuitState.CLOSED:
            return True

        if state.state == CircuitState.OPEN:
            # Check if timeout has elapsed for transition to half-open
            if now - state.last_state_change >= self.config.timeout_seconds:
                self._transition_to(state, CircuitState.HALF_OPEN)
                self._save_state(state)
                return True
            return False

        if state.state == CircuitState.HALF_OPEN:
            # Allow limited requests in half-open state
            return state.half_open_active_requests < self.config.half_open_max_requests

        return False

    def record_success(self) -> None:
        """Record a successful request."""
        state = self._get_state()
        state.total_requests += 1

        if state.state == CircuitState.HALF_OPEN:
            state.success_count += 1
            state.half_open_active_requests = max(0, state.half_open_active_requests - 1)

            if state.success_count >= self.config.success_threshold:
                self._transition_to(state, CircuitState.CLOSED)

        self._save_state(state)

    def record_failure(self) -> None:
        """Record a failed request."""
        state = self._get_state()
        state.failure_count += 1
        state.total_requests += 1
        state.last_failure_time = time.time()

        if state.state == CircuitState.HALF_OPEN:
            state.half_open_active_requests = max(0, state.half_open_active_requests - 1)
            self._transition_to(state, CircuitState.OPEN)
        elif state.state == CircuitState.CLOSED:
            failure_rate = (
                state.failure_count / state.total_requests
                if state.total_requests > 0 else 0
            )
            if (state.failure_count >= self.config.failure_threshold
                    or failure_rate >= self.config.failure_rate_threshold):
                self._transition_to(state, CircuitState.OPEN)

        self._save_state(state)

    def protect(self, func: Callable) -> Callable:
        """Decorator to protect a function with the circuit breaker."""
        @wraps(func)
        async def wrapper(*args, **kwargs):
            return await self.execute(func, *args, **kwargs)
        return wrapper

    async def execute(
        self,
        func: Callable,
        *args,
        fallback: Optional[Callable] = None,
        **kwargs,
    ) -> Any:
        """
        Execute a function through the circuit breaker.

        If the circuit is open, immediately returns the fallback result.
        If the circuit is closed or half-open, attempts the function and
        records success or failure.
        """
        if not self.can_execute():
            logger.warning("Circuit [%s] is OPEN — returning fallback", self.config.name)
            if fallback:
                return await fallback(*args, **kwargs)
            raise CircuitOpenError(
                f"Circuit breaker [{self.config.name}] is open. "
                f"Retry after {self.config.timeout_seconds}s."
            )

        try:
            result = await func(*args, **kwargs)
            self.record_success()
            return result
        except Exception as e:
            self.record_failure()
            logger.error("Circuit [%s] recorded failure: %s", self.config.name, e)
            if fallback:
                return await fallback(*args, **kwargs)
            raise


class CircuitOpenError(Exception):
    """Raised when a circuit breaker is open and no fallback is provided."""
    pass

Production Python: Timeout Handler

"""
MangaAssist Timeout Handler
Implements graceful degradation when Bedrock or other services time out.
Used as a Lambda function invoked by Step Functions on timeout.
"""

import json
import time
import logging
from typing import Dict, Any, Optional

import boto3
from botocore.config import Config

logger = logging.getLogger("manga_timeout_handler")
logger.setLevel(logging.INFO)

# Configure boto3 with explicit timeouts
BEDROCK_CONFIG = Config(
    region_name="ap-northeast-1",
    read_timeout=25,        # 25s read timeout (< Lambda 30s timeout)
    connect_timeout=5,      # 5s connection timeout
    retries={"max_attempts": 0}  # No SDK retries — we handle retries in Step Functions
)

DYNAMODB_CONFIG = Config(
    region_name="ap-northeast-1",
    read_timeout=5,
    connect_timeout=3,
    retries={"max_attempts": 1}
)

dynamodb = boto3.resource("dynamodb", config=DYNAMODB_CONFIG)
elasticache_client = boto3.client("elasticache")

# Redis connection (initialized outside handler for connection reuse)
import redis
redis_client = redis.Redis(
    host="manga-cache.xxxxx.apne1.cache.amazonaws.com",
    port=6379,
    ssl=True,
    decode_responses=True,
    socket_timeout=2,
    socket_connect_timeout=1,
)


def lambda_handler(event: Dict[str, Any], context) -> Dict[str, Any]:
    """
    Handle timeout events from Step Functions.

    Provides graceful degradation by returning cached or static responses
    when the primary path times out.

    Args:
        event: Contains query, timeoutSource, and optional context
        context: Lambda context with remaining time info

    Returns:
        Fallback response appropriate to the timeout source
    """
    query = event.get("query", "")
    timeout_source = event.get("timeoutSource", "UNKNOWN")
    session_id = event.get("sessionId", "")

    logger.info(
        "Timeout handler invoked — source=%s, query_length=%d, remaining_ms=%d",
        timeout_source,
        len(query),
        context.get_remaining_time_in_millis(),
    )

    # Attempt to find a cached response first
    cached = _try_cached_response(query)
    if cached:
        logger.info("Returning cached response for timeout")
        return {
            "statusCode": 200,
            "body": cached,
            "metadata": {
                "source": "cache",
                "timeoutSource": timeout_source,
                "latencyMs": 0,
            },
        }

    # Attempt to find similar FAQ answers
    faq_response = _try_faq_match(query)
    if faq_response:
        logger.info("Returning FAQ match for timeout")
        return {
            "statusCode": 200,
            "body": faq_response,
            "metadata": {
                "source": "faq",
                "timeoutSource": timeout_source,
            },
        }

    # Final fallback: static response with popular manga
    return _static_fallback(timeout_source)


def _try_cached_response(query: str) -> Optional[str]:
    """Check Redis cache for a recent response to a similar query."""
    try:
        import hashlib
        query_hash = hashlib.sha256(query.lower().strip().encode()).hexdigest()[:16]
        cached = redis_client.get(f"manga:response:{query_hash}")
        if cached:
            return cached
    except redis.RedisError as e:
        logger.warning("Redis cache lookup failed: %s", e)
    return None


def _try_faq_match(query: str) -> Optional[str]:
    """Attempt to match query against pre-loaded FAQ entries in Redis."""
    try:
        # Load FAQ categories
        faq_keys = redis_client.keys("manga:faq:*")
        query_lower = query.lower()

        keyword_map = {
            "shipping": "manga:faq:shipping",
            "配送": "manga:faq:shipping",
            "return": "manga:faq:returns",
            "返品": "manga:faq:returns",
            "payment": "manga:faq:payment",
            "支払い": "manga:faq:payment",
            "recommend": "manga:faq:recommendations",
            "おすすめ": "manga:faq:recommendations",
        }

        for keyword, faq_key in keyword_map.items():
            if keyword in query_lower:
                faq_response = redis_client.get(faq_key)
                if faq_response:
                    return faq_response

    except redis.RedisError as e:
        logger.warning("FAQ lookup failed: %s", e)
    return None


def _static_fallback(timeout_source: str) -> Dict[str, Any]:
    """Return a static fallback response with popular manga suggestions."""
    messages = {
        "BEDROCK": {
            "en": (
                "I'm taking longer than usual to think. While I work on your question, "
                "check out our trending manga!"
            ),
            "ja": (
                "回答に通常より時間がかかっています。お待ちの間、"
                "人気の漫画をチェックしてみてください！"
            ),
        },
        "OPENSEARCH": {
            "en": "Search is temporarily slow. Here are our popular picks:",
            "ja": "検索が一時的に遅くなっています。人気作品はこちら：",
        },
        "UNKNOWN": {
            "en": "Something went wrong. Please try again!",
            "ja": "エラーが発生しました。もう一度お試しください！",
        },
    }

    msg = messages.get(timeout_source, messages["UNKNOWN"])

    return {
        "statusCode": 200,
        "body": {
            "message": msg["en"],
            "messageJa": msg["ja"],
            "popularManga": [
                {"title": "One Piece", "titleJa": "ワンピース", "id": "manga-001"},
                {"title": "Jujutsu Kaisen", "titleJa": "呪術廻戦", "id": "manga-002"},
                {"title": "Spy x Family", "titleJa": "スパイファミリー", "id": "manga-003"},
                {"title": "Chainsaw Man", "titleJa": "チェンソーマン", "id": "manga-004"},
                {"title": "My Hero Academia", "titleJa": "僕のヒーローアカデミア", "id": "manga-005"},
            ],
            "retryable": True,
        },
        "metadata": {
            "source": "static_fallback",
            "timeoutSource": timeout_source,
        },
    }

Production Python: Guardrail Middleware

"""
MangaAssist Guardrail Middleware
Pre- and post-processing safety checks for all FM interactions.
Complements Bedrock Guardrails with application-level checks.
"""

import re
import time
import json
import logging
from typing import Dict, Any, Optional, List, Tuple
from dataclasses import dataclass

import boto3

logger = logging.getLogger("manga_guardrail")
logger.setLevel(logging.INFO)

bedrock_runtime = boto3.client("bedrock-runtime", region_name="ap-northeast-1")


@dataclass
class GuardrailResult:
    """Result of a guardrail check."""
    allowed: bool
    reason: Optional[str] = None
    modified_content: Optional[str] = None
    violations: Optional[List[str]] = None


class MangaGuardrailMiddleware:
    """
    Application-level guardrail middleware for MangaAssist.

    This runs BEFORE and AFTER Bedrock Guardrails to provide:
    1. Input validation and sanitization
    2. Token budget enforcement
    3. PII pre-screening (faster than Bedrock for obvious patterns)
    4. Output format validation
    5. Response quality checks

    The Bedrock Guardrail (manga-content-guardrail) handles:
    - Content filtering (hate, violence thresholds)
    - Contextual grounding checks
    - Denied topic enforcement
    """

    # PII patterns for pre-screening
    PII_PATTERNS = {
        "credit_card": re.compile(r"\b\d{4}[-\s]?\d{4}[-\s]?\d{4}[-\s]?\d{4}\b"),
        "email": re.compile(r"\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b"),
        "phone_jp": re.compile(r"\b0\d{1,4}[-\s]?\d{1,4}[-\s]?\d{4}\b"),
        "phone_intl": re.compile(r"\+\d{1,3}[-\s]?\d{6,14}\b"),
        "postal_jp": re.compile(r"\b\d{3}[-]?\d{4}\b"),
    }

    # Injection patterns to detect prompt manipulation
    INJECTION_PATTERNS = [
        re.compile(r"ignore\s+(all\s+)?previous\s+instructions", re.IGNORECASE),
        re.compile(r"you\s+are\s+now\s+(?:a|an)\s+", re.IGNORECASE),
        re.compile(r"system\s*:\s*", re.IGNORECASE),
        re.compile(r"<\s*(?:system|admin|root)\s*>", re.IGNORECASE),
        re.compile(r"forget\s+(?:everything|all|your)\s+", re.IGNORECASE),
        re.compile(r"(?:jailbreak|DAN|roleplay\s+as)", re.IGNORECASE),
    ]

    # Maximum lengths
    MAX_INPUT_CHARS = 10000
    MAX_INPUT_TOKENS_ESTIMATE = 4096
    MAX_OUTPUT_TOKENS = 2048

    def __init__(self, guardrail_id: str = "manga-content-guardrail", guardrail_version: str = "1"):
        self.guardrail_id = guardrail_id
        self.guardrail_version = guardrail_version

    def pre_process(self, user_input: str, session_context: Dict[str, Any]) -> GuardrailResult:
        """
        Run all pre-processing guardrail checks on user input.

        Order matters — cheapest checks first to fail fast:
        1. Length check (instant)
        2. Injection detection (regex, fast)
        3. PII pre-screening (regex, fast)
        4. Token budget check (counter lookup)
        """
        violations = []

        # 1. Length check
        if len(user_input) > self.MAX_INPUT_CHARS:
            return GuardrailResult(
                allowed=False,
                reason="Input exceeds maximum length of 10,000 characters",
                violations=["INPUT_TOO_LONG"],
            )

        # 2. Empty input check
        if not user_input.strip():
            return GuardrailResult(
                allowed=False,
                reason="Empty input",
                violations=["EMPTY_INPUT"],
            )

        # 3. Injection detection
        injection_result = self._check_injection(user_input)
        if not injection_result.allowed:
            logger.warning("Injection attempt detected: %s", injection_result.reason)
            violations.extend(injection_result.violations or [])
            return injection_result

        # 4. PII pre-screening
        pii_result = self._screen_pii(user_input)
        if pii_result.modified_content:
            user_input = pii_result.modified_content
            violations.extend(pii_result.violations or [])
            logger.info("PII detected and masked in input")

        # 5. Token budget check
        token_estimate = len(user_input) // 4  # Rough estimate: 4 chars per token
        session_tokens = session_context.get("sessionTokensUsed", 0)
        if session_tokens + token_estimate > 50000:
            return GuardrailResult(
                allowed=False,
                reason="Session token budget would be exceeded",
                violations=["TOKEN_BUDGET_EXCEEDED"],
            )

        return GuardrailResult(
            allowed=True,
            modified_content=user_input,
            violations=violations if violations else None,
        )

    def post_process(self, model_output: str, original_query: str) -> GuardrailResult:
        """
        Run post-processing guardrail checks on model output.

        Checks:
        1. Output not empty
        2. Output PII screening
        3. Response relevance (basic)
        4. Output format validation
        """
        # 1. Empty output
        if not model_output or not model_output.strip():
            return GuardrailResult(
                allowed=False,
                reason="Model returned empty response",
                violations=["EMPTY_OUTPUT"],
            )

        # 2. PII in output
        pii_result = self._screen_pii(model_output)
        if pii_result.modified_content:
            model_output = pii_result.modified_content

        # 3. Check for common hallucination indicators
        hallucination_phrases = [
            "as an AI language model",
            "I don't have access to real-time",
            "I cannot browse the internet",
            "my training data only goes up to",
        ]
        for phrase in hallucination_phrases:
            if phrase.lower() in model_output.lower():
                model_output = model_output.replace(phrase, "").strip()
                logger.info("Removed hallucination phrase from output")

        # 4. Verify output length is reasonable
        if len(model_output) > 50000:
            model_output = model_output[:50000] + "\n\n[Response truncated]"

        return GuardrailResult(
            allowed=True,
            modified_content=model_output,
        )

    def _check_injection(self, text: str) -> GuardrailResult:
        """Detect prompt injection attempts."""
        for pattern in self.INJECTION_PATTERNS:
            match = pattern.search(text)
            if match:
                return GuardrailResult(
                    allowed=False,
                    reason=f"Potential prompt injection detected: '{match.group()}'",
                    violations=["PROMPT_INJECTION"],
                )
        return GuardrailResult(allowed=True)

    def _screen_pii(self, text: str) -> GuardrailResult:
        """Screen and mask PII in text."""
        masked_text = text
        violations = []

        for pii_type, pattern in self.PII_PATTERNS.items():
            matches = pattern.findall(masked_text)
            if matches:
                violations.append(f"PII_DETECTED_{pii_type.upper()}")
                for match in matches:
                    mask = "[REDACTED]"
                    masked_text = masked_text.replace(match, mask)

        if violations:
            return GuardrailResult(
                allowed=True,
                modified_content=masked_text,
                violations=violations,
            )

        return GuardrailResult(allowed=True)

    def apply_bedrock_guardrail(self, content: str, source: str = "INPUT") -> GuardrailResult:
        """
        Apply Bedrock Guardrail to content.

        Args:
            content: Text to check
            source: "INPUT" for user input, "OUTPUT" for model output
        """
        try:
            response = bedrock_runtime.apply_guardrail(
                guardrailIdentifier=self.guardrail_id,
                guardrailVersion=self.guardrail_version,
                source=source,
                content=[{"text": {"text": content}}],
            )

            action = response.get("action", "NONE")

            if action == "GUARDRAIL_INTERVENED":
                outputs = response.get("outputs", [])
                output_text = outputs[0]["text"] if outputs else "Content blocked by guardrail"

                assessments = response.get("assessments", [])
                violation_types = []
                for assessment in assessments:
                    for filter_type, details in assessment.items():
                        if isinstance(details, list):
                            for detail in details:
                                if detail.get("action") == "BLOCKED":
                                    violation_types.append(
                                        f"{filter_type}:{detail.get('type', 'unknown')}"
                                    )

                return GuardrailResult(
                    allowed=False,
                    reason=f"Bedrock Guardrail intervened: {', '.join(violation_types)}",
                    modified_content=output_text,
                    violations=violation_types,
                )

            return GuardrailResult(allowed=True, modified_content=content)

        except Exception as e:
            logger.error("Bedrock Guardrail call failed: %s", e)
            # Fail open for guardrail service errors — log and continue
            # This is a conscious decision: availability over safety for service errors
            # Content still goes through application-level checks
            return GuardrailResult(allowed=True, modified_content=content)

Key Takeaways

┌─────────────────────────────────────────────────────────────────────────┐
│                         KEY TAKEAWAYS                                   │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│  1. DEFENSE IN DEPTH                                                    │
│     • 5 layers of safety: API Gateway → ECS → Step Functions            │
│       → Lambda → AWS Service policies                                   │
│     • Each layer catches what the previous layer might miss             │
│     • No single point of failure in safety controls                     │
│                                                                         │
│  2. STOPPING CONDITIONS PREVENT RUNAWAY COSTS                           │
│     • Token budgets at request, session, and daily levels               │
│     • Max iteration counts on agentic loops (hard cap = 5)              │
│     • Step Functions TimeoutSeconds as the ultimate backstop            │
│     • Cost guardrails: $0.05/request Sonnet, $0.005/request Haiku      │
│                                                                         │
│  3. CIRCUIT BREAKERS PROTECT SYSTEM STABILITY                           │
│     • Distributed state via ElastiCache Redis                           │
│     • Different thresholds per service (Bedrock 50%, OpenSearch 30%)    │
│     • Half-open progressive recovery prevents thundering herd           │
│     • CloudWatch metrics + SNS alerts on state transitions              │
│                                                                         │
│  4. IAM LEAST PRIVILEGE IS NON-NEGOTIABLE                               │
│     • Specific model ARNs, not wildcard access                          │
│     • Separate roles per Lambda function concern                        │
│     • Permission boundaries prevent privilege escalation                │
│     • Resource-based policies on DynamoDB, OpenSearch, S3               │
│                                                                         │
│  5. GRACEFUL DEGRADATION > HARD FAILURES                                │
│     • Timeout → cached response → FAQ match → static fallback           │
│     • Circuit open → popular manga suggestions                          │
│     • Guardrail block → helpful redirect message                        │
│     • Partial agentic result → "here's what I found so far"             │
│                                                                         │
│  6. BEDROCK GUARDRAILS + APPLICATION GUARDRAILS                         │
│     • Bedrock handles: content filtering, grounding, denied topics      │
│     • Application handles: PII pre-screen, injection detection,         │
│       token budgets, format validation                                  │
│     • Application checks run first (faster, cheaper)                    │
│     • Bedrock Guardrails provide the deep safety net                    │
│                                                                         │
│  EXAM TIP: For AIP-C01, know that Step Functions provides stopping      │
│  conditions (TimeoutSeconds, iteration counters, Choice states),        │
│  Lambda provides timeout mechanisms, IAM provides resource boundaries,  │
│  and circuit breakers + Bedrock Guardrails mitigate failures.           │
│  Defense-in-depth is the architectural principle.                        │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘

Quick Reference: Timeout Settings for MangaAssist

Component	Timeout	Purpose
API Gateway WebSocket idle	10 min	Drop idle connections
API Gateway integration	29 sec	Hard limit on backend calls
Client-side UX	3 sec	Show spinner, then partial response
Step Functions workflow	60 sec	Absolute cap on entire workflow
Step Functions per-state	30 sec	Cap on individual state execution
Lambda: Bedrock invoke	30 sec	Model inference timeout
Lambda: RAG retrieval	10 sec	OpenSearch vector search
Lambda: Session management	5 sec	DynamoDB read/write
Lambda: Timeout handler	5 sec	Fallback response generation
Bedrock SDK read timeout	25 sec	HTTP read (< Lambda 30s)
Bedrock SDK connect timeout	5 sec	TCP connection establishment
Redis socket timeout	2 sec	ElastiCache operations
Redis connect timeout	1 sec	ElastiCache connection

Quick Reference: IAM Principle Summary

Principle	MangaAssist Implementation
Least Privilege	Specific Bedrock model ARNs, not `bedrock:*`
Separation of Duties	Separate Lambda roles per function
Permission Boundaries	Max permission envelope for all Lambda roles
Resource Scoping	DynamoDB leading key conditions
Deny Explicit	Deny model customization, table deletion
Condition Keys	Region restriction, request attribute checks
Encryption	KMS key policies for data at rest
Network	VPC endpoints for private service access

End of File 1 — Safeguard Architecture