FM Deployment Patterns Architecture
MangaAssist context: JP Manga store chatbot on AWS — Bedrock Claude 3 (Sonnet at $3/$15 per 1M tokens input/output, Haiku at $0.25/$1.25), OpenSearch Serverless (vector store), DynamoDB (sessions/products), ECS Fargate (orchestrator), API Gateway WebSocket, ElastiCache Redis. Target: useful answer in under 3 seconds, 1M messages/day scale.
Skill Mapping
| Dimension | Value |
|---|---|
| Certification | AWS Certified AI Practitioner (AIF-C01) |
| Task | 2.2 — Select and implement model deployment strategies |
| Skill | 2.2.1 — Deploy FMs using Lambda on-demand, Bedrock provisioned throughput, SageMaker hybrid |
| This File | Deployment patterns architecture, decision framework, cost comparison |
Skill Scope
This file covers the architecture of FM deployment patterns across three primary AWS services: Lambda on-demand invocation, Bedrock provisioned throughput, and SageMaker endpoint hosting. We examine when each pattern is appropriate, how they compare on latency, throughput, and cost, and how MangaAssist uses a hybrid approach to balance performance with budget at 1M messages/day scale. The decision framework presented here enables architects to select the right deployment pattern for any production GenAI workload.
Mind Map
mindmap
root((FM Deployment Patterns))
Lambda On-Demand
Pay-per-invocation
Zero idle cost
Cold start latency
15-minute timeout
Bedrock API calls
Bursty traffic
Bedrock Provisioned Throughput
Reserved model units
Predictable latency
Committed capacity
No infrastructure mgmt
Custom model support
Hourly billing
SageMaker Endpoints
Real-time inference
Custom containers
Auto-scaling policies
Multi-model endpoints
GPU instance selection
Full model control
Hybrid Architecture
Traffic-based routing
Cost optimization
Latency tiers
Failover paths
Peak vs off-peak
Gradual migration
Decision Framework
Latency requirements
Traffic predictability
Cost constraints
Operational complexity
Model customization
Scaling speed
Cost Analysis
Per-token pricing
Provisioned commitments
Instance hours
Data transfer
Total cost of ownership
Break-even points
1. Deployment Pattern Overview
Foundation model deployment on AWS follows three primary patterns, each with distinct trade-offs across latency, throughput, cost, and operational complexity.
1.1 Pattern Comparison Matrix
| Dimension | Lambda + Bedrock On-Demand | Bedrock Provisioned Throughput | SageMaker Real-Time Endpoint |
|---|---|---|---|
| Latency (p50) | 800ms–2s | 400ms–1.2s | 300ms–1s (warm) |
| Latency (p99) | 3–8s (cold start) | 600ms–1.8s | 500ms–1.5s |
| Max throughput | Elastic (account limits) | Reserved model units | Instance-bound |
| Scaling speed | Seconds (warm) / minutes (cold) | Instant (within capacity) | 5–15 min (new instances) |
| Idle cost | $0 | Hourly commitment | Instance hours |
| Per-request cost | Highest per-token | Medium (committed) | Lowest at scale |
| Ops complexity | Lowest | Low | Highest |
| Model customization | Bedrock fine-tuned only | Bedrock fine-tuned only | Any model / framework |
| GPU control | None | None | Full (instance type) |
1.2 Architecture Diagram — Three Patterns
graph TB
subgraph "Pattern 1: Lambda On-Demand"
Client1[API Gateway<br/>WebSocket] --> Lambda1[Lambda Function<br/>256MB–10GB RAM]
Lambda1 --> BR1[Bedrock API<br/>On-Demand]
BR1 --> Claude1[Claude 3 Sonnet/Haiku<br/>Pay-per-token]
end
subgraph "Pattern 2: Bedrock Provisioned Throughput"
Client2[API Gateway<br/>WebSocket] --> ECS2[ECS Fargate<br/>Orchestrator]
ECS2 --> BR2[Bedrock API<br/>Provisioned]
BR2 --> Claude2[Claude 3 Sonnet<br/>Reserved Model Units]
end
subgraph "Pattern 3: SageMaker Endpoint"
Client3[API Gateway<br/>WebSocket] --> ECS3[ECS Fargate<br/>Orchestrator]
ECS3 --> SM3[SageMaker Endpoint<br/>ml.g5.2xlarge]
SM3 --> Model3[Custom/Fine-tuned Model<br/>GPU Instance]
end
style Client1 fill:#e1f5fe
style Client2 fill:#e1f5fe
style Client3 fill:#e1f5fe
style Lambda1 fill:#fff3e0
style ECS2 fill:#fff3e0
style ECS3 fill:#fff3e0
style BR1 fill:#e8f5e9
style BR2 fill:#e8f5e9
style SM3 fill:#fce4ec
2. Pattern 1 — Lambda On-Demand with Bedrock
Lambda on-demand is the simplest deployment pattern: a Lambda function calls the Bedrock API, paying only for tokens consumed. This pattern suits bursty, unpredictable workloads and development environments.
2.1 Architecture Deep Dive
"""
MangaAssist — Lambda on-demand deployment pattern.
Invokes Bedrock Claude models via Lambda for low-traffic or bursty workloads.
"""
import json
import time
import logging
import boto3
from typing import Any
logger = logging.getLogger(__name__)
logger.setLevel(logging.INFO)
# Initialize Bedrock client outside handler for connection reuse
bedrock_runtime = boto3.client("bedrock-runtime", region_name="ap-northeast-1")
# Model configuration
MODEL_CONFIG = {
"haiku": {
"model_id": "anthropic.claude-3-haiku-20240307-v1:0",
"max_tokens": 1024,
"cost_per_1k_input": 0.00025,
"cost_per_1k_output": 0.00125,
},
"sonnet": {
"model_id": "anthropic.claude-3-sonnet-20240229-v1:0",
"max_tokens": 2048,
"cost_per_1k_input": 0.003,
"cost_per_1k_output": 0.015,
},
}
def lambda_handler(event: dict, context: Any) -> dict:
"""
Lambda handler for on-demand Bedrock invocation.
Routes to Haiku or Sonnet based on query complexity.
"""
start_time = time.time()
remaining_ms = context.get_remaining_time_in_millis()
body = json.loads(event.get("body", "{}"))
user_message = body.get("message", "")
session_id = body.get("session_id", "unknown")
complexity = body.get("complexity", "simple")
# Select model based on complexity
model_key = "sonnet" if complexity == "complex" else "haiku"
model_cfg = MODEL_CONFIG[model_key]
logger.info(
"Processing request",
extra={
"session_id": session_id,
"model": model_key,
"remaining_ms": remaining_ms,
},
)
# Guard: if less than 5s remaining, use Haiku for faster response
if remaining_ms < 5000 and model_key == "sonnet":
model_key = "haiku"
model_cfg = MODEL_CONFIG["haiku"]
logger.warning("Falling back to Haiku due to time constraint")
try:
response = bedrock_runtime.invoke_model(
modelId=model_cfg["model_id"],
contentType="application/json",
accept="application/json",
body=json.dumps({
"anthropic_version": "bedrock-2023-05-31",
"max_tokens": model_cfg["max_tokens"],
"messages": [
{"role": "user", "content": user_message}
],
"system": (
"You are MangaAssist, a helpful JP manga store assistant. "
"Answer in the user's language. Be concise and accurate."
),
}),
)
result = json.loads(response["body"].read())
output_text = result["content"][0]["text"]
input_tokens = result["usage"]["input_tokens"]
output_tokens = result["usage"]["output_tokens"]
elapsed_ms = (time.time() - start_time) * 1000
# Calculate cost for observability
cost = (
(input_tokens / 1000) * model_cfg["cost_per_1k_input"]
+ (output_tokens / 1000) * model_cfg["cost_per_1k_output"]
)
logger.info(
"Request completed",
extra={
"session_id": session_id,
"model": model_key,
"input_tokens": input_tokens,
"output_tokens": output_tokens,
"cost_usd": round(cost, 6),
"elapsed_ms": round(elapsed_ms, 1),
},
)
return {
"statusCode": 200,
"body": json.dumps({
"response": output_text,
"metadata": {
"model": model_key,
"latency_ms": round(elapsed_ms, 1),
"tokens": {
"input": input_tokens,
"output": output_tokens,
},
},
}),
}
except bedrock_runtime.exceptions.ThrottlingException:
logger.error("Bedrock throttling encountered", extra={"model": model_key})
return {
"statusCode": 429,
"body": json.dumps({"error": "Service busy, please retry"}),
}
except Exception as e:
logger.exception("Invocation failed")
return {
"statusCode": 500,
"body": json.dumps({"error": str(e)}),
}
2.2 Lambda Configuration for FM Workloads
| Setting | Recommended Value | Rationale |
|---|---|---|
| Memory | 1024–3008 MB | More memory = more CPU; Bedrock calls are I/O bound but JSON parsing benefits from CPU |
| Timeout | 30s (API GW max) | Sonnet can take 5–15s for long responses |
| Provisioned Concurrency | 10–50 for prod | Eliminates cold starts for baseline traffic |
| Reserved Concurrency | 500 | Prevents runaway costs; matches Bedrock account limits |
| Ephemeral Storage | 512 MB (default) | No model files needed; Bedrock handles model hosting |
| Architecture | arm64 | 20% cheaper, comparable performance for API calls |
2.3 Cold Start Mitigation
graph LR
subgraph "Cold Start Timeline"
A[Request Arrives] --> B[Init Runtime<br/>~300ms]
B --> C[Load Handler<br/>~100ms]
C --> D[Init boto3 Client<br/>~200ms]
D --> E[Bedrock API Call<br/>~800ms–2s]
E --> F[Response<br/>Total: 1.4–2.6s]
end
subgraph "Warm Invocation"
G[Request Arrives] --> H[Bedrock API Call<br/>~800ms–2s]
H --> I[Response<br/>Total: 0.8–2s]
end
style A fill:#ffcdd2
style G fill:#c8e6c9
3. Pattern 2 — Bedrock Provisioned Throughput
Provisioned Throughput reserves dedicated capacity for a model, guaranteeing consistent latency and throughput. This pattern is optimal for predictable, high-volume workloads like MangaAssist's peak hours.
3.1 Provisioned Throughput Architecture
"""
MangaAssist — Bedrock Provisioned Throughput deployment pattern.
Uses reserved model units for consistent latency at scale.
"""
import json
import time
import logging
import boto3
from dataclasses import dataclass
from typing import Optional
logger = logging.getLogger(__name__)
@dataclass
class ProvisionedModelConfig:
"""Configuration for a provisioned throughput model."""
provisioned_model_arn: str
model_units: int
commitment_duration: str # "OneMonth" | "SixMonths" | "NoCommitment"
max_tokens: int
estimated_tokens_per_minute: int
class ProvisionedThroughputManager:
"""
Manages Bedrock provisioned throughput for MangaAssist.
Handles capacity tracking, fallback to on-demand, and cost monitoring.
"""
def __init__(self, region: str = "ap-northeast-1"):
self.bedrock = boto3.client("bedrock", region_name=region)
self.bedrock_runtime = boto3.client("bedrock-runtime", region_name=region)
self.region = region
# Provisioned models — set after creation
self.provisioned_models: dict[str, ProvisionedModelConfig] = {}
# Usage tracking for capacity monitoring
self._request_count = 0
self._token_count = 0
self._window_start = time.time()
def create_provisioned_throughput(
self,
model_id: str,
model_units: int,
commitment: str = "OneMonth",
name_suffix: str = "manga-assist",
) -> str:
"""
Create a provisioned throughput reservation.
Returns the provisioned model ARN.
"""
response = self.bedrock.create_provisioned_model_throughput(
modelUnits=model_units,
provisionedModelName=f"{name_suffix}-{model_id.split('.')[-1]}",
modelId=model_id,
commitmentDuration=commitment,
)
provisioned_arn = response["provisionedModelArn"]
logger.info(
"Created provisioned throughput",
extra={
"model_id": model_id,
"model_units": model_units,
"commitment": commitment,
"arn": provisioned_arn,
},
)
return provisioned_arn
def list_provisioned_models(self) -> list[dict]:
"""List all provisioned throughput models and their status."""
response = self.bedrock.list_provisioned_model_throughputs()
summaries = response.get("provisionedModelSummaries", [])
models = []
for summary in summaries:
models.append({
"name": summary["provisionedModelName"],
"arn": summary["provisionedModelArn"],
"status": summary["status"],
"model_units": summary["modelUnits"],
"model_id": summary["modelId"],
"commitment": summary.get("commitmentDuration", "NoCommitment"),
"created": str(summary.get("creationTime", "")),
})
return models
def invoke_provisioned(
self,
provisioned_arn: str,
messages: list[dict],
system_prompt: str,
max_tokens: int = 1024,
) -> dict:
"""
Invoke a provisioned throughput model.
Falls back to on-demand if provisioned fails.
"""
start = time.time()
try:
response = self.bedrock_runtime.invoke_model(
modelId=provisioned_arn,
contentType="application/json",
accept="application/json",
body=json.dumps({
"anthropic_version": "bedrock-2023-05-31",
"max_tokens": max_tokens,
"messages": messages,
"system": system_prompt,
}),
)
result = json.loads(response["body"].read())
elapsed = time.time() - start
self._request_count += 1
self._token_count += result["usage"]["input_tokens"]
self._token_count += result["usage"]["output_tokens"]
return {
"content": result["content"][0]["text"],
"usage": result["usage"],
"latency_ms": round(elapsed * 1000, 1),
"source": "provisioned",
}
except Exception as e:
logger.error(f"Provisioned invocation failed: {e}")
return self._fallback_on_demand(messages, system_prompt, max_tokens)
def _fallback_on_demand(
self,
messages: list[dict],
system_prompt: str,
max_tokens: int,
) -> dict:
"""Fall back to on-demand Bedrock when provisioned is unavailable."""
start = time.time()
logger.warning("Falling back to on-demand Bedrock")
response = self.bedrock_runtime.invoke_model(
modelId="anthropic.claude-3-haiku-20240307-v1:0",
contentType="application/json",
accept="application/json",
body=json.dumps({
"anthropic_version": "bedrock-2023-05-31",
"max_tokens": max_tokens,
"messages": messages,
"system": system_prompt,
}),
)
result = json.loads(response["body"].read())
elapsed = time.time() - start
return {
"content": result["content"][0]["text"],
"usage": result["usage"],
"latency_ms": round(elapsed * 1000, 1),
"source": "on-demand-fallback",
}
def get_utilization_metrics(self) -> dict:
"""Return current utilization metrics for capacity planning."""
window_seconds = time.time() - self._window_start
window_minutes = max(window_seconds / 60, 1)
return {
"requests_per_minute": round(self._request_count / window_minutes, 1),
"tokens_per_minute": round(self._token_count / window_minutes, 1),
"total_requests": self._request_count,
"total_tokens": self._token_count,
"window_seconds": round(window_seconds, 1),
}
3.2 Provisioned Throughput Sizing
graph TD
A[Estimate Peak Traffic] --> B[Calculate Tokens/Minute]
B --> C{Tokens/min < 50K?}
C -- Yes --> D[1 Model Unit<br/>~$23/hr Sonnet]
C -- No --> E{Tokens/min < 150K?}
E -- Yes --> F[2–3 Model Units]
E -- No --> G[4+ Model Units<br/>Contact AWS]
D --> H[Choose Commitment]
F --> H
G --> H
H --> I{Predictable for 1 month?}
I -- Yes --> J[1-Month Commitment<br/>~30% discount]
I -- No --> K{Predictable for 6 months?}
K -- Yes --> L[6-Month Commitment<br/>~50% discount]
K -- No --> M[No Commitment<br/>Hourly billing]
style A fill:#e3f2fd
style H fill:#fff9c4
style J fill:#c8e6c9
style L fill:#c8e6c9
3.3 MangaAssist Provisioned Throughput Plan
| Time Window | Model | Model Units | Estimated Tokens/min | Hourly Cost | Strategy |
|---|---|---|---|---|---|
| Peak (18:00–24:00 JST) | Sonnet | 2 | 80K | ~$46 | 1-month committed |
| Business (09:00–18:00) | Haiku | 1 | 40K | ~$5 | 1-month committed |
| Off-peak (00:00–09:00) | Haiku | 0 (on-demand) | <10K | Pay-per-token | No provisioning |
4. Pattern 3 — SageMaker Real-Time Endpoints
SageMaker endpoints provide full control over hardware, model artifacts, and serving infrastructure. This pattern suits organizations needing custom models, specific GPU types, or advanced deployment features like multi-model endpoints.
4.1 SageMaker Endpoint Architecture
"""
MangaAssist — SageMaker endpoint deployment pattern.
Hosts custom or fine-tuned models with auto-scaling.
"""
import json
import time
import logging
import boto3
import sagemaker
from sagemaker.huggingface import HuggingFaceModel
from sagemaker.serializers import JSONSerializer
from sagemaker.deserializers import JSONDeserializer
from typing import Optional
logger = logging.getLogger(__name__)
class SageMakerFMDeployment:
"""
Manages SageMaker endpoint deployment for MangaAssist.
Supports custom model hosting with auto-scaling.
"""
def __init__(self, region: str = "ap-northeast-1"):
self.region = region
self.sm_client = boto3.client("sagemaker", region_name=region)
self.sm_runtime = boto3.client("sagemaker-runtime", region_name=region)
self.autoscaling = boto3.client(
"application-autoscaling", region_name=region
)
self.session = sagemaker.Session(
boto_session=boto3.Session(region_name=region)
)
self.role = sagemaker.get_execution_role()
def deploy_huggingface_model(
self,
model_id: str,
instance_type: str = "ml.g5.2xlarge",
instance_count: int = 1,
endpoint_name: str = "manga-assist-fm",
) -> str:
"""
Deploy a HuggingFace model to a SageMaker endpoint.
Uses the HuggingFace DLC (Deep Learning Container) for optimized inference.
"""
hub_config = {
"HF_MODEL_ID": model_id,
"HF_TASK": "text-generation",
"SM_NUM_GPUS": "1",
"MAX_INPUT_LENGTH": "4096",
"MAX_TOTAL_TOKENS": "8192",
"MAX_BATCH_TOTAL_TOKENS": "16384",
}
huggingface_model = HuggingFaceModel(
env=hub_config,
role=self.role,
transformers_version="4.37.0",
pytorch_version="2.1.0",
py_version="py310",
image_uri=self._get_tgi_image_uri(),
)
predictor = huggingface_model.deploy(
initial_instance_count=instance_count,
instance_type=instance_type,
endpoint_name=endpoint_name,
container_startup_health_check_timeout=600,
model_data_download_timeout=600,
)
logger.info(
"Model deployed",
extra={
"endpoint": endpoint_name,
"instance_type": instance_type,
"model_id": model_id,
},
)
return endpoint_name
def _get_tgi_image_uri(self) -> str:
"""Get the Text Generation Inference container image URI."""
return sagemaker.image_uris.retrieve(
framework="huggingface-llm",
region=self.region,
version="2.0.2",
image_scope="inference",
instance_type="ml.g5.2xlarge",
)
def configure_auto_scaling(
self,
endpoint_name: str,
variant_name: str = "AllTraffic",
min_capacity: int = 1,
max_capacity: int = 4,
target_invocations_per_instance: int = 10,
scale_in_cooldown: int = 300,
scale_out_cooldown: int = 60,
) -> None:
"""
Configure auto-scaling for a SageMaker endpoint.
Uses InvocationsPerInstance metric for scaling decisions.
"""
resource_id = (
f"endpoint/{endpoint_name}/variant/{variant_name}"
)
# Register scalable target
self.autoscaling.register_scalable_target(
ServiceNamespace="sagemaker",
ResourceId=resource_id,
ScalableDimension="sagemaker:variant:DesiredInstanceCount",
MinCapacity=min_capacity,
MaxCapacity=max_capacity,
)
# Configure target tracking scaling policy
self.autoscaling.put_scaling_policy(
PolicyName=f"{endpoint_name}-scaling-policy",
ServiceNamespace="sagemaker",
ResourceId=resource_id,
ScalableDimension="sagemaker:variant:DesiredInstanceCount",
PolicyType="TargetTrackingScaling",
TargetTrackingScalingPolicyConfiguration={
"TargetValue": float(target_invocations_per_instance),
"CustomizedMetricSpecification": {
"MetricName": "InvocationsPerInstance",
"Namespace": "AWS/SageMaker",
"Dimensions": [
{"Name": "EndpointName", "Value": endpoint_name},
{"Name": "VariantName", "Value": variant_name},
],
"Statistic": "Average",
},
"ScaleInCooldown": scale_in_cooldown,
"ScaleOutCooldown": scale_out_cooldown,
},
)
logger.info(
"Auto-scaling configured",
extra={
"endpoint": endpoint_name,
"min": min_capacity,
"max": max_capacity,
"target_invocations": target_invocations_per_instance,
},
)
def invoke_endpoint(
self,
endpoint_name: str,
prompt: str,
max_new_tokens: int = 512,
temperature: float = 0.7,
) -> dict:
"""Invoke a SageMaker endpoint for text generation."""
start = time.time()
payload = {
"inputs": prompt,
"parameters": {
"max_new_tokens": max_new_tokens,
"temperature": temperature,
"do_sample": True,
"top_p": 0.9,
},
}
response = self.sm_runtime.invoke_endpoint(
EndpointName=endpoint_name,
ContentType="application/json",
Accept="application/json",
Body=json.dumps(payload),
)
result = json.loads(response["Body"].read().decode())
elapsed = time.time() - start
return {
"generated_text": result[0]["generated_text"],
"latency_ms": round(elapsed * 1000, 1),
"source": "sagemaker",
}
def get_endpoint_metrics(self, endpoint_name: str) -> dict:
"""Retrieve endpoint performance metrics from CloudWatch."""
cw = boto3.client("cloudwatch", region_name=self.region)
metrics = {}
for metric_name in [
"Invocations",
"ModelLatency",
"OverheadLatency",
"InvocationsPerInstance",
]:
response = cw.get_metric_statistics(
Namespace="AWS/SageMaker",
MetricName=metric_name,
Dimensions=[
{"Name": "EndpointName", "Value": endpoint_name},
{"Name": "VariantName", "Value": "AllTraffic"},
],
StartTime=time.time() - 3600,
EndTime=time.time(),
Period=300,
Statistics=["Average", "Maximum", "Sum"],
)
datapoints = response.get("Datapoints", [])
if datapoints:
latest = sorted(datapoints, key=lambda x: x["Timestamp"])[-1]
metrics[metric_name] = {
"average": latest.get("Average"),
"maximum": latest.get("Maximum"),
"sum": latest.get("Sum"),
}
return metrics
4.2 SageMaker Instance Selection for FM Workloads
| Instance Type | GPU | GPU Memory | vCPUs | RAM | Cost/hr (Tokyo) | Best For |
|---|---|---|---|---|---|---|
| ml.g5.xlarge | 1x A10G | 24 GB | 4 | 16 GB | ~$1.41 | Small models (<7B) |
| ml.g5.2xlarge | 1x A10G | 24 GB | 8 | 32 GB | ~$1.89 | Medium models (7–13B) |
| ml.g5.12xlarge | 4x A10G | 96 GB | 48 | 192 GB | ~$7.09 | Large models (13–30B) |
| ml.p4d.24xlarge | 8x A100 | 320 GB | 96 | 1152 GB | ~$40.97 | Very large models (70B+) |
| ml.inf2.xlarge | 1x Inferentia2 | — | 4 | 32 GB | ~$0.99 | Cost-optimized inference |
5. Hybrid Architecture — MangaAssist Production Design
5.1 Traffic-Based Routing
graph TB
API[API Gateway WebSocket] --> Router[Traffic Router<br/>ECS Fargate]
Router --> |"Peak hours<br/>18:00–24:00 JST"| PT[Bedrock Provisioned<br/>Sonnet 2 MU]
Router --> |"Business hours<br/>Complex queries"| OD_S[Bedrock On-Demand<br/>Sonnet]
Router --> |"Simple queries<br/>Any time"| OD_H[Bedrock On-Demand<br/>Haiku]
Router --> |"Custom model<br/>Manga-specific"| SM[SageMaker Endpoint<br/>Fine-tuned 7B]
PT --> |Fallback| OD_S
SM --> |Fallback| OD_H
PT --> Resp[Response<br/>Aggregator]
OD_S --> Resp
OD_H --> Resp
SM --> Resp
Resp --> API
style Router fill:#fff9c4
style PT fill:#c8e6c9
style OD_S fill:#e3f2fd
style OD_H fill:#e3f2fd
style SM fill:#fce4ec
5.2 Hybrid Router Implementation
"""
MangaAssist — Hybrid deployment router.
Routes requests to the optimal deployment pattern based on
time-of-day, query complexity, and current capacity.
"""
import json
import time
import logging
from datetime import datetime, timezone, timedelta
from enum import Enum
from dataclasses import dataclass
from typing import Optional
logger = logging.getLogger(__name__)
JST = timezone(timedelta(hours=9))
class DeploymentTarget(Enum):
"""Available deployment targets."""
BEDROCK_PROVISIONED_SONNET = "provisioned-sonnet"
BEDROCK_ONDEMAND_SONNET = "ondemand-sonnet"
BEDROCK_ONDEMAND_HAIKU = "ondemand-haiku"
SAGEMAKER_CUSTOM = "sagemaker-custom"
@dataclass
class RoutingDecision:
"""Result of the routing decision process."""
target: DeploymentTarget
reason: str
estimated_latency_ms: int
estimated_cost_per_request: float
fallback: Optional[DeploymentTarget] = None
class HybridDeploymentRouter:
"""
Routes MangaAssist requests to the optimal deployment pattern.
Routing logic:
1. Peak hours (18:00-24:00 JST) -> Provisioned Sonnet (committed capacity)
2. Complex queries -> On-demand Sonnet (quality-first)
3. Simple queries -> On-demand Haiku (cost-optimized)
4. Manga-specific queries -> SageMaker custom model (specialized)
"""
PEAK_START_HOUR = 18
PEAK_END_HOUR = 24
COMPLEXITY_THRESHOLD = 0.7
MANGA_KEYWORDS = [
"recommend", "similar", "genre", "author", "series",
"rating", "review", "chapter", "volume", "latest",
]
def __init__(self):
self._daily_cost = 0.0
self._daily_budget = 500.0 # USD per day
self._request_count = 0
def route(
self,
query: str,
complexity_score: float,
session_context: Optional[dict] = None,
) -> RoutingDecision:
"""
Determine the optimal deployment target for a request.
"""
now = datetime.now(JST)
is_peak = self.PEAK_START_HOUR <= now.hour < self.PEAK_END_HOUR
# Budget guard: force Haiku if budget is nearly exhausted
if self._daily_cost >= self._daily_budget * 0.95:
logger.warning("Daily budget nearly exhausted, forcing Haiku")
return RoutingDecision(
target=DeploymentTarget.BEDROCK_ONDEMAND_HAIKU,
reason="budget_guard",
estimated_latency_ms=800,
estimated_cost_per_request=0.0005,
)
# Manga-specific query -> SageMaker custom model
if self._is_manga_specific(query):
return RoutingDecision(
target=DeploymentTarget.SAGEMAKER_CUSTOM,
reason="manga_specific_query",
estimated_latency_ms=600,
estimated_cost_per_request=0.001,
fallback=DeploymentTarget.BEDROCK_ONDEMAND_HAIKU,
)
# Peak hours + complex -> Provisioned Sonnet
if is_peak and complexity_score >= self.COMPLEXITY_THRESHOLD:
return RoutingDecision(
target=DeploymentTarget.BEDROCK_PROVISIONED_SONNET,
reason="peak_complex",
estimated_latency_ms=500,
estimated_cost_per_request=0.003,
fallback=DeploymentTarget.BEDROCK_ONDEMAND_SONNET,
)
# Peak hours + simple -> Provisioned Sonnet (within capacity)
if is_peak:
return RoutingDecision(
target=DeploymentTarget.BEDROCK_PROVISIONED_SONNET,
reason="peak_simple",
estimated_latency_ms=400,
estimated_cost_per_request=0.002,
fallback=DeploymentTarget.BEDROCK_ONDEMAND_HAIKU,
)
# Off-peak + complex -> On-demand Sonnet
if complexity_score >= self.COMPLEXITY_THRESHOLD:
return RoutingDecision(
target=DeploymentTarget.BEDROCK_ONDEMAND_SONNET,
reason="offpeak_complex",
estimated_latency_ms=1200,
estimated_cost_per_request=0.005,
fallback=DeploymentTarget.BEDROCK_ONDEMAND_HAIKU,
)
# Default: On-demand Haiku (cheapest)
return RoutingDecision(
target=DeploymentTarget.BEDROCK_ONDEMAND_HAIKU,
reason="offpeak_simple",
estimated_latency_ms=800,
estimated_cost_per_request=0.0005,
)
def _is_manga_specific(self, query: str) -> bool:
"""Check if the query is manga-catalog-specific."""
query_lower = query.lower()
return any(kw in query_lower for kw in self.MANGA_KEYWORDS)
def record_cost(self, cost: float) -> None:
"""Record the cost of a completed request."""
self._daily_cost += cost
self._request_count += 1
6. Cost Comparison Analysis
6.1 MangaAssist Monthly Cost Projection
| Scenario | Lambda + On-Demand | Bedrock Provisioned | SageMaker Endpoint | Hybrid |
|---|---|---|---|---|
| 1M msgs/day (simple) | $15,000 | $10,800 | $8,200 | $9,500 |
| 1M msgs/day (mixed) | $28,000 | $16,200 | $12,400 | $14,800 |
| 100K msgs/day (simple) | $1,500 | $10,800 | $8,200 | $2,100 |
| 100K msgs/day (mixed) | $2,800 | $10,800 | $8,200 | $3,200 |
6.2 Break-Even Analysis
graph LR
subgraph "Break-Even: On-Demand vs Provisioned"
A[Daily Volume] --> B{"> 300K tokens/hr<br/>sustained?"}
B -- Yes --> C[Provisioned Wins<br/>30-50% savings]
B -- No --> D{"> 8 hrs/day<br/>active?"}
D -- Yes --> E[Provisioned Wins<br/>if predictable]
D -- No --> F[On-Demand Wins<br/>pay only for use]
end
subgraph "Break-Even: Bedrock vs SageMaker"
G[Monthly Budget] --> H{"> $10K/month?"}
H -- Yes --> I{Need custom model?}
I -- Yes --> J[SageMaker Wins]
I -- No --> K[Bedrock Provisioned<br/>lower ops cost]
H -- No --> L[Bedrock On-Demand<br/>simplest option]
end
style C fill:#c8e6c9
style E fill:#c8e6c9
style F fill:#e3f2fd
style J fill:#fce4ec
style K fill:#c8e6c9
style L fill:#e3f2fd
7. Decision Framework
7.1 Pattern Selection Flowchart
flowchart TD
Start[New FM Deployment] --> Q1{Need custom model<br/>or specific GPU?}
Q1 -- Yes --> SM[SageMaker Endpoint]
Q1 -- No --> Q2{Predictable high<br/>throughput?}
Q2 -- Yes --> Q3{Can commit 1+ months?}
Q3 -- Yes --> PT[Bedrock Provisioned<br/>Throughput]
Q3 -- No --> Q4{Budget > $5K/mo?}
Q4 -- Yes --> PT_NC[Bedrock Provisioned<br/>No Commitment]
Q4 -- No --> OD[Lambda + Bedrock<br/>On-Demand]
Q2 -- No --> Q5{Bursty / unpredictable<br/>traffic?}
Q5 -- Yes --> OD
Q5 -- No --> Q6{Multiple models<br/>needed?}
Q6 -- Yes --> HY[Hybrid Architecture]
Q6 -- No --> OD
SM --> Deploy[Deploy & Monitor]
PT --> Deploy
PT_NC --> Deploy
OD --> Deploy
HY --> Deploy
style Start fill:#e3f2fd
style SM fill:#fce4ec
style PT fill:#c8e6c9
style PT_NC fill:#c8e6c9
style OD fill:#fff9c4
style HY fill:#e1bee7
style Deploy fill:#f5f5f5
7.2 Decision Criteria Summary
| Criterion | Favors Lambda On-Demand | Favors Provisioned | Favors SageMaker |
|---|---|---|---|
| Traffic < 100K msgs/day | Strong | Weak | Weak |
| Traffic > 500K msgs/day | Weak | Strong | Medium |
| Latency < 500ms required | Weak | Medium | Strong |
| Custom model needed | Not possible | Not possible | Required |
| Zero idle cost | Strong | Not possible | Not possible |
| Ops team capacity: small | Strong | Strong | Weak |
| Fine-tuned Bedrock model | Medium | Strong | N/A |
| Multi-model serving | Medium | Weak | Strong |
Key Takeaways
-
Lambda on-demand is the starting point — zero idle cost, simplest ops, and scales automatically; best for dev/test, bursty traffic, and workloads under 100K messages/day.
-
Bedrock provisioned throughput reduces latency and cost at scale — once traffic is predictable and exceeds ~300K tokens/hour sustained, provisioned model units deliver 30–50% savings over on-demand with consistent sub-second latency.
-
SageMaker endpoints provide full control — when you need custom models, specific GPU types, multi-model hosting, or advanced deployment features (A/B testing, shadow deployments), SageMaker is the only option despite higher operational complexity.
-
Hybrid architecture is the production answer — MangaAssist routes peak-hour traffic to provisioned throughput, simple queries to Haiku on-demand, complex queries to Sonnet, and manga-specific queries to a SageMaker-hosted fine-tuned model.
-
Cost optimization requires time-of-day awareness — provisioned throughput during peak hours (18:00–24:00 JST for a JP manga store) combined with on-demand during off-peak eliminates wasted committed capacity.
-
Always implement fallback paths — every deployment pattern should fall back to a simpler, cheaper option (e.g., provisioned -> on-demand Sonnet -> on-demand Haiku) to maintain availability when capacity is exhausted.
-
Break-even analysis drives the decision — the crossover from on-demand to provisioned occurs around 300K tokens/hour sustained; the crossover from Bedrock to SageMaker occurs when custom model requirements or >$10K/month budgets make infrastructure control worthwhile.