LOCAL PREVIEW View on GitHub

FM Deployment Patterns Architecture

MangaAssist context: JP Manga store chatbot on AWS — Bedrock Claude 3 (Sonnet at $3/$15 per 1M tokens input/output, Haiku at $0.25/$1.25), OpenSearch Serverless (vector store), DynamoDB (sessions/products), ECS Fargate (orchestrator), API Gateway WebSocket, ElastiCache Redis. Target: useful answer in under 3 seconds, 1M messages/day scale.


Skill Mapping

Dimension Value
Certification AWS Certified AI Practitioner (AIF-C01)
Task 2.2 — Select and implement model deployment strategies
Skill 2.2.1 — Deploy FMs using Lambda on-demand, Bedrock provisioned throughput, SageMaker hybrid
This File Deployment patterns architecture, decision framework, cost comparison

Skill Scope

This file covers the architecture of FM deployment patterns across three primary AWS services: Lambda on-demand invocation, Bedrock provisioned throughput, and SageMaker endpoint hosting. We examine when each pattern is appropriate, how they compare on latency, throughput, and cost, and how MangaAssist uses a hybrid approach to balance performance with budget at 1M messages/day scale. The decision framework presented here enables architects to select the right deployment pattern for any production GenAI workload.


Mind Map

mindmap
  root((FM Deployment Patterns))
    Lambda On-Demand
      Pay-per-invocation
      Zero idle cost
      Cold start latency
      15-minute timeout
      Bedrock API calls
      Bursty traffic
    Bedrock Provisioned Throughput
      Reserved model units
      Predictable latency
      Committed capacity
      No infrastructure mgmt
      Custom model support
      Hourly billing
    SageMaker Endpoints
      Real-time inference
      Custom containers
      Auto-scaling policies
      Multi-model endpoints
      GPU instance selection
      Full model control
    Hybrid Architecture
      Traffic-based routing
      Cost optimization
      Latency tiers
      Failover paths
      Peak vs off-peak
      Gradual migration
    Decision Framework
      Latency requirements
      Traffic predictability
      Cost constraints
      Operational complexity
      Model customization
      Scaling speed
    Cost Analysis
      Per-token pricing
      Provisioned commitments
      Instance hours
      Data transfer
      Total cost of ownership
      Break-even points

1. Deployment Pattern Overview

Foundation model deployment on AWS follows three primary patterns, each with distinct trade-offs across latency, throughput, cost, and operational complexity.

1.1 Pattern Comparison Matrix

Dimension Lambda + Bedrock On-Demand Bedrock Provisioned Throughput SageMaker Real-Time Endpoint
Latency (p50) 800ms–2s 400ms–1.2s 300ms–1s (warm)
Latency (p99) 3–8s (cold start) 600ms–1.8s 500ms–1.5s
Max throughput Elastic (account limits) Reserved model units Instance-bound
Scaling speed Seconds (warm) / minutes (cold) Instant (within capacity) 5–15 min (new instances)
Idle cost $0 Hourly commitment Instance hours
Per-request cost Highest per-token Medium (committed) Lowest at scale
Ops complexity Lowest Low Highest
Model customization Bedrock fine-tuned only Bedrock fine-tuned only Any model / framework
GPU control None None Full (instance type)

1.2 Architecture Diagram — Three Patterns

graph TB
    subgraph "Pattern 1: Lambda On-Demand"
        Client1[API Gateway<br/>WebSocket] --> Lambda1[Lambda Function<br/>256MB–10GB RAM]
        Lambda1 --> BR1[Bedrock API<br/>On-Demand]
        BR1 --> Claude1[Claude 3 Sonnet/Haiku<br/>Pay-per-token]
    end

    subgraph "Pattern 2: Bedrock Provisioned Throughput"
        Client2[API Gateway<br/>WebSocket] --> ECS2[ECS Fargate<br/>Orchestrator]
        ECS2 --> BR2[Bedrock API<br/>Provisioned]
        BR2 --> Claude2[Claude 3 Sonnet<br/>Reserved Model Units]
    end

    subgraph "Pattern 3: SageMaker Endpoint"
        Client3[API Gateway<br/>WebSocket] --> ECS3[ECS Fargate<br/>Orchestrator]
        ECS3 --> SM3[SageMaker Endpoint<br/>ml.g5.2xlarge]
        SM3 --> Model3[Custom/Fine-tuned Model<br/>GPU Instance]
    end

    style Client1 fill:#e1f5fe
    style Client2 fill:#e1f5fe
    style Client3 fill:#e1f5fe
    style Lambda1 fill:#fff3e0
    style ECS2 fill:#fff3e0
    style ECS3 fill:#fff3e0
    style BR1 fill:#e8f5e9
    style BR2 fill:#e8f5e9
    style SM3 fill:#fce4ec

2. Pattern 1 — Lambda On-Demand with Bedrock

Lambda on-demand is the simplest deployment pattern: a Lambda function calls the Bedrock API, paying only for tokens consumed. This pattern suits bursty, unpredictable workloads and development environments.

2.1 Architecture Deep Dive

"""
MangaAssist — Lambda on-demand deployment pattern.
Invokes Bedrock Claude models via Lambda for low-traffic or bursty workloads.
"""

import json
import time
import logging
import boto3
from typing import Any

logger = logging.getLogger(__name__)
logger.setLevel(logging.INFO)

# Initialize Bedrock client outside handler for connection reuse
bedrock_runtime = boto3.client("bedrock-runtime", region_name="ap-northeast-1")

# Model configuration
MODEL_CONFIG = {
    "haiku": {
        "model_id": "anthropic.claude-3-haiku-20240307-v1:0",
        "max_tokens": 1024,
        "cost_per_1k_input": 0.00025,
        "cost_per_1k_output": 0.00125,
    },
    "sonnet": {
        "model_id": "anthropic.claude-3-sonnet-20240229-v1:0",
        "max_tokens": 2048,
        "cost_per_1k_input": 0.003,
        "cost_per_1k_output": 0.015,
    },
}


def lambda_handler(event: dict, context: Any) -> dict:
    """
    Lambda handler for on-demand Bedrock invocation.
    Routes to Haiku or Sonnet based on query complexity.
    """
    start_time = time.time()
    remaining_ms = context.get_remaining_time_in_millis()

    body = json.loads(event.get("body", "{}"))
    user_message = body.get("message", "")
    session_id = body.get("session_id", "unknown")
    complexity = body.get("complexity", "simple")

    # Select model based on complexity
    model_key = "sonnet" if complexity == "complex" else "haiku"
    model_cfg = MODEL_CONFIG[model_key]

    logger.info(
        "Processing request",
        extra={
            "session_id": session_id,
            "model": model_key,
            "remaining_ms": remaining_ms,
        },
    )

    # Guard: if less than 5s remaining, use Haiku for faster response
    if remaining_ms < 5000 and model_key == "sonnet":
        model_key = "haiku"
        model_cfg = MODEL_CONFIG["haiku"]
        logger.warning("Falling back to Haiku due to time constraint")

    try:
        response = bedrock_runtime.invoke_model(
            modelId=model_cfg["model_id"],
            contentType="application/json",
            accept="application/json",
            body=json.dumps({
                "anthropic_version": "bedrock-2023-05-31",
                "max_tokens": model_cfg["max_tokens"],
                "messages": [
                    {"role": "user", "content": user_message}
                ],
                "system": (
                    "You are MangaAssist, a helpful JP manga store assistant. "
                    "Answer in the user's language. Be concise and accurate."
                ),
            }),
        )

        result = json.loads(response["body"].read())
        output_text = result["content"][0]["text"]
        input_tokens = result["usage"]["input_tokens"]
        output_tokens = result["usage"]["output_tokens"]

        elapsed_ms = (time.time() - start_time) * 1000

        # Calculate cost for observability
        cost = (
            (input_tokens / 1000) * model_cfg["cost_per_1k_input"]
            + (output_tokens / 1000) * model_cfg["cost_per_1k_output"]
        )

        logger.info(
            "Request completed",
            extra={
                "session_id": session_id,
                "model": model_key,
                "input_tokens": input_tokens,
                "output_tokens": output_tokens,
                "cost_usd": round(cost, 6),
                "elapsed_ms": round(elapsed_ms, 1),
            },
        )

        return {
            "statusCode": 200,
            "body": json.dumps({
                "response": output_text,
                "metadata": {
                    "model": model_key,
                    "latency_ms": round(elapsed_ms, 1),
                    "tokens": {
                        "input": input_tokens,
                        "output": output_tokens,
                    },
                },
            }),
        }

    except bedrock_runtime.exceptions.ThrottlingException:
        logger.error("Bedrock throttling encountered", extra={"model": model_key})
        return {
            "statusCode": 429,
            "body": json.dumps({"error": "Service busy, please retry"}),
        }
    except Exception as e:
        logger.exception("Invocation failed")
        return {
            "statusCode": 500,
            "body": json.dumps({"error": str(e)}),
        }

2.2 Lambda Configuration for FM Workloads

Setting Recommended Value Rationale
Memory 1024–3008 MB More memory = more CPU; Bedrock calls are I/O bound but JSON parsing benefits from CPU
Timeout 30s (API GW max) Sonnet can take 5–15s for long responses
Provisioned Concurrency 10–50 for prod Eliminates cold starts for baseline traffic
Reserved Concurrency 500 Prevents runaway costs; matches Bedrock account limits
Ephemeral Storage 512 MB (default) No model files needed; Bedrock handles model hosting
Architecture arm64 20% cheaper, comparable performance for API calls

2.3 Cold Start Mitigation

graph LR
    subgraph "Cold Start Timeline"
        A[Request Arrives] --> B[Init Runtime<br/>~300ms]
        B --> C[Load Handler<br/>~100ms]
        C --> D[Init boto3 Client<br/>~200ms]
        D --> E[Bedrock API Call<br/>~800ms–2s]
        E --> F[Response<br/>Total: 1.4–2.6s]
    end

    subgraph "Warm Invocation"
        G[Request Arrives] --> H[Bedrock API Call<br/>~800ms–2s]
        H --> I[Response<br/>Total: 0.8–2s]
    end

    style A fill:#ffcdd2
    style G fill:#c8e6c9

3. Pattern 2 — Bedrock Provisioned Throughput

Provisioned Throughput reserves dedicated capacity for a model, guaranteeing consistent latency and throughput. This pattern is optimal for predictable, high-volume workloads like MangaAssist's peak hours.

3.1 Provisioned Throughput Architecture

"""
MangaAssist — Bedrock Provisioned Throughput deployment pattern.
Uses reserved model units for consistent latency at scale.
"""

import json
import time
import logging
import boto3
from dataclasses import dataclass
from typing import Optional

logger = logging.getLogger(__name__)


@dataclass
class ProvisionedModelConfig:
    """Configuration for a provisioned throughput model."""
    provisioned_model_arn: str
    model_units: int
    commitment_duration: str  # "OneMonth" | "SixMonths" | "NoCommitment"
    max_tokens: int
    estimated_tokens_per_minute: int


class ProvisionedThroughputManager:
    """
    Manages Bedrock provisioned throughput for MangaAssist.
    Handles capacity tracking, fallback to on-demand, and cost monitoring.
    """

    def __init__(self, region: str = "ap-northeast-1"):
        self.bedrock = boto3.client("bedrock", region_name=region)
        self.bedrock_runtime = boto3.client("bedrock-runtime", region_name=region)
        self.region = region

        # Provisioned models — set after creation
        self.provisioned_models: dict[str, ProvisionedModelConfig] = {}

        # Usage tracking for capacity monitoring
        self._request_count = 0
        self._token_count = 0
        self._window_start = time.time()

    def create_provisioned_throughput(
        self,
        model_id: str,
        model_units: int,
        commitment: str = "OneMonth",
        name_suffix: str = "manga-assist",
    ) -> str:
        """
        Create a provisioned throughput reservation.
        Returns the provisioned model ARN.
        """
        response = self.bedrock.create_provisioned_model_throughput(
            modelUnits=model_units,
            provisionedModelName=f"{name_suffix}-{model_id.split('.')[-1]}",
            modelId=model_id,
            commitmentDuration=commitment,
        )

        provisioned_arn = response["provisionedModelArn"]
        logger.info(
            "Created provisioned throughput",
            extra={
                "model_id": model_id,
                "model_units": model_units,
                "commitment": commitment,
                "arn": provisioned_arn,
            },
        )
        return provisioned_arn

    def list_provisioned_models(self) -> list[dict]:
        """List all provisioned throughput models and their status."""
        response = self.bedrock.list_provisioned_model_throughputs()
        summaries = response.get("provisionedModelSummaries", [])

        models = []
        for summary in summaries:
            models.append({
                "name": summary["provisionedModelName"],
                "arn": summary["provisionedModelArn"],
                "status": summary["status"],
                "model_units": summary["modelUnits"],
                "model_id": summary["modelId"],
                "commitment": summary.get("commitmentDuration", "NoCommitment"),
                "created": str(summary.get("creationTime", "")),
            })
        return models

    def invoke_provisioned(
        self,
        provisioned_arn: str,
        messages: list[dict],
        system_prompt: str,
        max_tokens: int = 1024,
    ) -> dict:
        """
        Invoke a provisioned throughput model.
        Falls back to on-demand if provisioned fails.
        """
        start = time.time()

        try:
            response = self.bedrock_runtime.invoke_model(
                modelId=provisioned_arn,
                contentType="application/json",
                accept="application/json",
                body=json.dumps({
                    "anthropic_version": "bedrock-2023-05-31",
                    "max_tokens": max_tokens,
                    "messages": messages,
                    "system": system_prompt,
                }),
            )

            result = json.loads(response["body"].read())
            elapsed = time.time() - start

            self._request_count += 1
            self._token_count += result["usage"]["input_tokens"]
            self._token_count += result["usage"]["output_tokens"]

            return {
                "content": result["content"][0]["text"],
                "usage": result["usage"],
                "latency_ms": round(elapsed * 1000, 1),
                "source": "provisioned",
            }

        except Exception as e:
            logger.error(f"Provisioned invocation failed: {e}")
            return self._fallback_on_demand(messages, system_prompt, max_tokens)

    def _fallback_on_demand(
        self,
        messages: list[dict],
        system_prompt: str,
        max_tokens: int,
    ) -> dict:
        """Fall back to on-demand Bedrock when provisioned is unavailable."""
        start = time.time()
        logger.warning("Falling back to on-demand Bedrock")

        response = self.bedrock_runtime.invoke_model(
            modelId="anthropic.claude-3-haiku-20240307-v1:0",
            contentType="application/json",
            accept="application/json",
            body=json.dumps({
                "anthropic_version": "bedrock-2023-05-31",
                "max_tokens": max_tokens,
                "messages": messages,
                "system": system_prompt,
            }),
        )

        result = json.loads(response["body"].read())
        elapsed = time.time() - start

        return {
            "content": result["content"][0]["text"],
            "usage": result["usage"],
            "latency_ms": round(elapsed * 1000, 1),
            "source": "on-demand-fallback",
        }

    def get_utilization_metrics(self) -> dict:
        """Return current utilization metrics for capacity planning."""
        window_seconds = time.time() - self._window_start
        window_minutes = max(window_seconds / 60, 1)

        return {
            "requests_per_minute": round(self._request_count / window_minutes, 1),
            "tokens_per_minute": round(self._token_count / window_minutes, 1),
            "total_requests": self._request_count,
            "total_tokens": self._token_count,
            "window_seconds": round(window_seconds, 1),
        }

3.2 Provisioned Throughput Sizing

graph TD
    A[Estimate Peak Traffic] --> B[Calculate Tokens/Minute]
    B --> C{Tokens/min < 50K?}
    C -- Yes --> D[1 Model Unit<br/>~$23/hr Sonnet]
    C -- No --> E{Tokens/min < 150K?}
    E -- Yes --> F[2–3 Model Units]
    E -- No --> G[4+ Model Units<br/>Contact AWS]

    D --> H[Choose Commitment]
    F --> H
    G --> H

    H --> I{Predictable for 1 month?}
    I -- Yes --> J[1-Month Commitment<br/>~30% discount]
    I -- No --> K{Predictable for 6 months?}
    K -- Yes --> L[6-Month Commitment<br/>~50% discount]
    K -- No --> M[No Commitment<br/>Hourly billing]

    style A fill:#e3f2fd
    style H fill:#fff9c4
    style J fill:#c8e6c9
    style L fill:#c8e6c9

3.3 MangaAssist Provisioned Throughput Plan

Time Window Model Model Units Estimated Tokens/min Hourly Cost Strategy
Peak (18:00–24:00 JST) Sonnet 2 80K ~$46 1-month committed
Business (09:00–18:00) Haiku 1 40K ~$5 1-month committed
Off-peak (00:00–09:00) Haiku 0 (on-demand) <10K Pay-per-token No provisioning

4. Pattern 3 — SageMaker Real-Time Endpoints

SageMaker endpoints provide full control over hardware, model artifacts, and serving infrastructure. This pattern suits organizations needing custom models, specific GPU types, or advanced deployment features like multi-model endpoints.

4.1 SageMaker Endpoint Architecture

"""
MangaAssist — SageMaker endpoint deployment pattern.
Hosts custom or fine-tuned models with auto-scaling.
"""

import json
import time
import logging
import boto3
import sagemaker
from sagemaker.huggingface import HuggingFaceModel
from sagemaker.serializers import JSONSerializer
from sagemaker.deserializers import JSONDeserializer
from typing import Optional

logger = logging.getLogger(__name__)


class SageMakerFMDeployment:
    """
    Manages SageMaker endpoint deployment for MangaAssist.
    Supports custom model hosting with auto-scaling.
    """

    def __init__(self, region: str = "ap-northeast-1"):
        self.region = region
        self.sm_client = boto3.client("sagemaker", region_name=region)
        self.sm_runtime = boto3.client("sagemaker-runtime", region_name=region)
        self.autoscaling = boto3.client(
            "application-autoscaling", region_name=region
        )
        self.session = sagemaker.Session(
            boto_session=boto3.Session(region_name=region)
        )
        self.role = sagemaker.get_execution_role()

    def deploy_huggingface_model(
        self,
        model_id: str,
        instance_type: str = "ml.g5.2xlarge",
        instance_count: int = 1,
        endpoint_name: str = "manga-assist-fm",
    ) -> str:
        """
        Deploy a HuggingFace model to a SageMaker endpoint.
        Uses the HuggingFace DLC (Deep Learning Container) for optimized inference.
        """
        hub_config = {
            "HF_MODEL_ID": model_id,
            "HF_TASK": "text-generation",
            "SM_NUM_GPUS": "1",
            "MAX_INPUT_LENGTH": "4096",
            "MAX_TOTAL_TOKENS": "8192",
            "MAX_BATCH_TOTAL_TOKENS": "16384",
        }

        huggingface_model = HuggingFaceModel(
            env=hub_config,
            role=self.role,
            transformers_version="4.37.0",
            pytorch_version="2.1.0",
            py_version="py310",
            image_uri=self._get_tgi_image_uri(),
        )

        predictor = huggingface_model.deploy(
            initial_instance_count=instance_count,
            instance_type=instance_type,
            endpoint_name=endpoint_name,
            container_startup_health_check_timeout=600,
            model_data_download_timeout=600,
        )

        logger.info(
            "Model deployed",
            extra={
                "endpoint": endpoint_name,
                "instance_type": instance_type,
                "model_id": model_id,
            },
        )
        return endpoint_name

    def _get_tgi_image_uri(self) -> str:
        """Get the Text Generation Inference container image URI."""
        return sagemaker.image_uris.retrieve(
            framework="huggingface-llm",
            region=self.region,
            version="2.0.2",
            image_scope="inference",
            instance_type="ml.g5.2xlarge",
        )

    def configure_auto_scaling(
        self,
        endpoint_name: str,
        variant_name: str = "AllTraffic",
        min_capacity: int = 1,
        max_capacity: int = 4,
        target_invocations_per_instance: int = 10,
        scale_in_cooldown: int = 300,
        scale_out_cooldown: int = 60,
    ) -> None:
        """
        Configure auto-scaling for a SageMaker endpoint.
        Uses InvocationsPerInstance metric for scaling decisions.
        """
        resource_id = (
            f"endpoint/{endpoint_name}/variant/{variant_name}"
        )

        # Register scalable target
        self.autoscaling.register_scalable_target(
            ServiceNamespace="sagemaker",
            ResourceId=resource_id,
            ScalableDimension="sagemaker:variant:DesiredInstanceCount",
            MinCapacity=min_capacity,
            MaxCapacity=max_capacity,
        )

        # Configure target tracking scaling policy
        self.autoscaling.put_scaling_policy(
            PolicyName=f"{endpoint_name}-scaling-policy",
            ServiceNamespace="sagemaker",
            ResourceId=resource_id,
            ScalableDimension="sagemaker:variant:DesiredInstanceCount",
            PolicyType="TargetTrackingScaling",
            TargetTrackingScalingPolicyConfiguration={
                "TargetValue": float(target_invocations_per_instance),
                "CustomizedMetricSpecification": {
                    "MetricName": "InvocationsPerInstance",
                    "Namespace": "AWS/SageMaker",
                    "Dimensions": [
                        {"Name": "EndpointName", "Value": endpoint_name},
                        {"Name": "VariantName", "Value": variant_name},
                    ],
                    "Statistic": "Average",
                },
                "ScaleInCooldown": scale_in_cooldown,
                "ScaleOutCooldown": scale_out_cooldown,
            },
        )

        logger.info(
            "Auto-scaling configured",
            extra={
                "endpoint": endpoint_name,
                "min": min_capacity,
                "max": max_capacity,
                "target_invocations": target_invocations_per_instance,
            },
        )

    def invoke_endpoint(
        self,
        endpoint_name: str,
        prompt: str,
        max_new_tokens: int = 512,
        temperature: float = 0.7,
    ) -> dict:
        """Invoke a SageMaker endpoint for text generation."""
        start = time.time()

        payload = {
            "inputs": prompt,
            "parameters": {
                "max_new_tokens": max_new_tokens,
                "temperature": temperature,
                "do_sample": True,
                "top_p": 0.9,
            },
        }

        response = self.sm_runtime.invoke_endpoint(
            EndpointName=endpoint_name,
            ContentType="application/json",
            Accept="application/json",
            Body=json.dumps(payload),
        )

        result = json.loads(response["Body"].read().decode())
        elapsed = time.time() - start

        return {
            "generated_text": result[0]["generated_text"],
            "latency_ms": round(elapsed * 1000, 1),
            "source": "sagemaker",
        }

    def get_endpoint_metrics(self, endpoint_name: str) -> dict:
        """Retrieve endpoint performance metrics from CloudWatch."""
        cw = boto3.client("cloudwatch", region_name=self.region)

        metrics = {}
        for metric_name in [
            "Invocations",
            "ModelLatency",
            "OverheadLatency",
            "InvocationsPerInstance",
        ]:
            response = cw.get_metric_statistics(
                Namespace="AWS/SageMaker",
                MetricName=metric_name,
                Dimensions=[
                    {"Name": "EndpointName", "Value": endpoint_name},
                    {"Name": "VariantName", "Value": "AllTraffic"},
                ],
                StartTime=time.time() - 3600,
                EndTime=time.time(),
                Period=300,
                Statistics=["Average", "Maximum", "Sum"],
            )
            datapoints = response.get("Datapoints", [])
            if datapoints:
                latest = sorted(datapoints, key=lambda x: x["Timestamp"])[-1]
                metrics[metric_name] = {
                    "average": latest.get("Average"),
                    "maximum": latest.get("Maximum"),
                    "sum": latest.get("Sum"),
                }

        return metrics

4.2 SageMaker Instance Selection for FM Workloads

Instance Type GPU GPU Memory vCPUs RAM Cost/hr (Tokyo) Best For
ml.g5.xlarge 1x A10G 24 GB 4 16 GB ~$1.41 Small models (<7B)
ml.g5.2xlarge 1x A10G 24 GB 8 32 GB ~$1.89 Medium models (7–13B)
ml.g5.12xlarge 4x A10G 96 GB 48 192 GB ~$7.09 Large models (13–30B)
ml.p4d.24xlarge 8x A100 320 GB 96 1152 GB ~$40.97 Very large models (70B+)
ml.inf2.xlarge 1x Inferentia2 4 32 GB ~$0.99 Cost-optimized inference

5. Hybrid Architecture — MangaAssist Production Design

5.1 Traffic-Based Routing

graph TB
    API[API Gateway WebSocket] --> Router[Traffic Router<br/>ECS Fargate]

    Router --> |"Peak hours<br/>18:00–24:00 JST"| PT[Bedrock Provisioned<br/>Sonnet 2 MU]
    Router --> |"Business hours<br/>Complex queries"| OD_S[Bedrock On-Demand<br/>Sonnet]
    Router --> |"Simple queries<br/>Any time"| OD_H[Bedrock On-Demand<br/>Haiku]
    Router --> |"Custom model<br/>Manga-specific"| SM[SageMaker Endpoint<br/>Fine-tuned 7B]

    PT --> |Fallback| OD_S
    SM --> |Fallback| OD_H

    PT --> Resp[Response<br/>Aggregator]
    OD_S --> Resp
    OD_H --> Resp
    SM --> Resp

    Resp --> API

    style Router fill:#fff9c4
    style PT fill:#c8e6c9
    style OD_S fill:#e3f2fd
    style OD_H fill:#e3f2fd
    style SM fill:#fce4ec

5.2 Hybrid Router Implementation

"""
MangaAssist — Hybrid deployment router.
Routes requests to the optimal deployment pattern based on
time-of-day, query complexity, and current capacity.
"""

import json
import time
import logging
from datetime import datetime, timezone, timedelta
from enum import Enum
from dataclasses import dataclass
from typing import Optional

logger = logging.getLogger(__name__)

JST = timezone(timedelta(hours=9))


class DeploymentTarget(Enum):
    """Available deployment targets."""
    BEDROCK_PROVISIONED_SONNET = "provisioned-sonnet"
    BEDROCK_ONDEMAND_SONNET = "ondemand-sonnet"
    BEDROCK_ONDEMAND_HAIKU = "ondemand-haiku"
    SAGEMAKER_CUSTOM = "sagemaker-custom"


@dataclass
class RoutingDecision:
    """Result of the routing decision process."""
    target: DeploymentTarget
    reason: str
    estimated_latency_ms: int
    estimated_cost_per_request: float
    fallback: Optional[DeploymentTarget] = None


class HybridDeploymentRouter:
    """
    Routes MangaAssist requests to the optimal deployment pattern.

    Routing logic:
    1. Peak hours (18:00-24:00 JST) -> Provisioned Sonnet (committed capacity)
    2. Complex queries -> On-demand Sonnet (quality-first)
    3. Simple queries -> On-demand Haiku (cost-optimized)
    4. Manga-specific queries -> SageMaker custom model (specialized)
    """

    PEAK_START_HOUR = 18
    PEAK_END_HOUR = 24
    COMPLEXITY_THRESHOLD = 0.7
    MANGA_KEYWORDS = [
        "recommend", "similar", "genre", "author", "series",
        "rating", "review", "chapter", "volume", "latest",
    ]

    def __init__(self):
        self._daily_cost = 0.0
        self._daily_budget = 500.0  # USD per day
        self._request_count = 0

    def route(
        self,
        query: str,
        complexity_score: float,
        session_context: Optional[dict] = None,
    ) -> RoutingDecision:
        """
        Determine the optimal deployment target for a request.
        """
        now = datetime.now(JST)
        is_peak = self.PEAK_START_HOUR <= now.hour < self.PEAK_END_HOUR

        # Budget guard: force Haiku if budget is nearly exhausted
        if self._daily_cost >= self._daily_budget * 0.95:
            logger.warning("Daily budget nearly exhausted, forcing Haiku")
            return RoutingDecision(
                target=DeploymentTarget.BEDROCK_ONDEMAND_HAIKU,
                reason="budget_guard",
                estimated_latency_ms=800,
                estimated_cost_per_request=0.0005,
            )

        # Manga-specific query -> SageMaker custom model
        if self._is_manga_specific(query):
            return RoutingDecision(
                target=DeploymentTarget.SAGEMAKER_CUSTOM,
                reason="manga_specific_query",
                estimated_latency_ms=600,
                estimated_cost_per_request=0.001,
                fallback=DeploymentTarget.BEDROCK_ONDEMAND_HAIKU,
            )

        # Peak hours + complex -> Provisioned Sonnet
        if is_peak and complexity_score >= self.COMPLEXITY_THRESHOLD:
            return RoutingDecision(
                target=DeploymentTarget.BEDROCK_PROVISIONED_SONNET,
                reason="peak_complex",
                estimated_latency_ms=500,
                estimated_cost_per_request=0.003,
                fallback=DeploymentTarget.BEDROCK_ONDEMAND_SONNET,
            )

        # Peak hours + simple -> Provisioned Sonnet (within capacity)
        if is_peak:
            return RoutingDecision(
                target=DeploymentTarget.BEDROCK_PROVISIONED_SONNET,
                reason="peak_simple",
                estimated_latency_ms=400,
                estimated_cost_per_request=0.002,
                fallback=DeploymentTarget.BEDROCK_ONDEMAND_HAIKU,
            )

        # Off-peak + complex -> On-demand Sonnet
        if complexity_score >= self.COMPLEXITY_THRESHOLD:
            return RoutingDecision(
                target=DeploymentTarget.BEDROCK_ONDEMAND_SONNET,
                reason="offpeak_complex",
                estimated_latency_ms=1200,
                estimated_cost_per_request=0.005,
                fallback=DeploymentTarget.BEDROCK_ONDEMAND_HAIKU,
            )

        # Default: On-demand Haiku (cheapest)
        return RoutingDecision(
            target=DeploymentTarget.BEDROCK_ONDEMAND_HAIKU,
            reason="offpeak_simple",
            estimated_latency_ms=800,
            estimated_cost_per_request=0.0005,
        )

    def _is_manga_specific(self, query: str) -> bool:
        """Check if the query is manga-catalog-specific."""
        query_lower = query.lower()
        return any(kw in query_lower for kw in self.MANGA_KEYWORDS)

    def record_cost(self, cost: float) -> None:
        """Record the cost of a completed request."""
        self._daily_cost += cost
        self._request_count += 1

6. Cost Comparison Analysis

6.1 MangaAssist Monthly Cost Projection

Scenario Lambda + On-Demand Bedrock Provisioned SageMaker Endpoint Hybrid
1M msgs/day (simple) $15,000 $10,800 $8,200 $9,500
1M msgs/day (mixed) $28,000 $16,200 $12,400 $14,800
100K msgs/day (simple) $1,500 $10,800 $8,200 $2,100
100K msgs/day (mixed) $2,800 $10,800 $8,200 $3,200

6.2 Break-Even Analysis

graph LR
    subgraph "Break-Even: On-Demand vs Provisioned"
        A[Daily Volume] --> B{"> 300K tokens/hr<br/>sustained?"}
        B -- Yes --> C[Provisioned Wins<br/>30-50% savings]
        B -- No --> D{"> 8 hrs/day<br/>active?"}
        D -- Yes --> E[Provisioned Wins<br/>if predictable]
        D -- No --> F[On-Demand Wins<br/>pay only for use]
    end

    subgraph "Break-Even: Bedrock vs SageMaker"
        G[Monthly Budget] --> H{"> $10K/month?"}
        H -- Yes --> I{Need custom model?}
        I -- Yes --> J[SageMaker Wins]
        I -- No --> K[Bedrock Provisioned<br/>lower ops cost]
        H -- No --> L[Bedrock On-Demand<br/>simplest option]
    end

    style C fill:#c8e6c9
    style E fill:#c8e6c9
    style F fill:#e3f2fd
    style J fill:#fce4ec
    style K fill:#c8e6c9
    style L fill:#e3f2fd

7. Decision Framework

7.1 Pattern Selection Flowchart

flowchart TD
    Start[New FM Deployment] --> Q1{Need custom model<br/>or specific GPU?}
    Q1 -- Yes --> SM[SageMaker Endpoint]
    Q1 -- No --> Q2{Predictable high<br/>throughput?}

    Q2 -- Yes --> Q3{Can commit 1+ months?}
    Q3 -- Yes --> PT[Bedrock Provisioned<br/>Throughput]
    Q3 -- No --> Q4{Budget > $5K/mo?}
    Q4 -- Yes --> PT_NC[Bedrock Provisioned<br/>No Commitment]
    Q4 -- No --> OD[Lambda + Bedrock<br/>On-Demand]

    Q2 -- No --> Q5{Bursty / unpredictable<br/>traffic?}
    Q5 -- Yes --> OD
    Q5 -- No --> Q6{Multiple models<br/>needed?}
    Q6 -- Yes --> HY[Hybrid Architecture]
    Q6 -- No --> OD

    SM --> Deploy[Deploy & Monitor]
    PT --> Deploy
    PT_NC --> Deploy
    OD --> Deploy
    HY --> Deploy

    style Start fill:#e3f2fd
    style SM fill:#fce4ec
    style PT fill:#c8e6c9
    style PT_NC fill:#c8e6c9
    style OD fill:#fff9c4
    style HY fill:#e1bee7
    style Deploy fill:#f5f5f5

7.2 Decision Criteria Summary

Criterion Favors Lambda On-Demand Favors Provisioned Favors SageMaker
Traffic < 100K msgs/day Strong Weak Weak
Traffic > 500K msgs/day Weak Strong Medium
Latency < 500ms required Weak Medium Strong
Custom model needed Not possible Not possible Required
Zero idle cost Strong Not possible Not possible
Ops team capacity: small Strong Strong Weak
Fine-tuned Bedrock model Medium Strong N/A
Multi-model serving Medium Weak Strong

Key Takeaways

  1. Lambda on-demand is the starting point — zero idle cost, simplest ops, and scales automatically; best for dev/test, bursty traffic, and workloads under 100K messages/day.

  2. Bedrock provisioned throughput reduces latency and cost at scale — once traffic is predictable and exceeds ~300K tokens/hour sustained, provisioned model units deliver 30–50% savings over on-demand with consistent sub-second latency.

  3. SageMaker endpoints provide full control — when you need custom models, specific GPU types, multi-model hosting, or advanced deployment features (A/B testing, shadow deployments), SageMaker is the only option despite higher operational complexity.

  4. Hybrid architecture is the production answer — MangaAssist routes peak-hour traffic to provisioned throughput, simple queries to Haiku on-demand, complex queries to Sonnet, and manga-specific queries to a SageMaker-hosted fine-tuned model.

  5. Cost optimization requires time-of-day awareness — provisioned throughput during peak hours (18:00–24:00 JST for a JP manga store) combined with on-demand during off-peak eliminates wasted committed capacity.

  6. Always implement fallback paths — every deployment pattern should fall back to a simpler, cheaper option (e.g., provisioned -> on-demand Sonnet -> on-demand Haiku) to maintain availability when capacity is exhausted.

  7. Break-even analysis drives the decision — the crossover from on-demand to provisioned occurs around 300K tokens/hour sustained; the crossover from Bedrock to SageMaker occurs when custom model requirements or >$10K/month budgets make infrastructure control worthwhile.