Task 2.2: Implement Model Deployment Strategies
Overview
This task covers deploying foundation models using the right compute, container strategies, and optimization approaches for production workloads.
Skill 2.2.1: Deploy FMs Based on Application Needs and Performance Requirements
Core Concepts
- Lambda for On-Demand: Serverless FM invocation for sporadic, event-driven workloads
- Bedrock Provisioned Throughput: Reserved model capacity for predictable, high-volume workloads
- SageMaker Endpoints: Custom model hosting with full infrastructure control
- Hybrid Solutions: Combining multiple deployment strategies for different traffic patterns
User Story 9: Multi-Tier FM Deployment for Content Platform
As a content platform CTO, I want FM deployments that match each use case's latency, throughput, and cost requirements, So that we optimize the $200K/month AI infrastructure budget across 5 different AI features.
Deep Dive Scenario
Company: ContentAI - generates, summarizes, and moderates 10M pieces of content/day
Problem: Single deployment strategy doesn't fit all use cases. Some need instant response, some can batch, some are predictable, some are spiky.
Deployment Decision Matrix:
| Feature | Pattern | QPS | Latency Req | Deployment | Cost Model |
|---|---|---|---|---|---|
| Content moderation | Real-time | 500 constant | <200ms | Bedrock Provisioned | Reserved capacity |
| Article summarization | Batch | 50K/day batched | <5 min | Lambda + Bedrock On-Demand | Pay-per-use |
| Chatbot responses | Interactive | 10-1000 (spiky) | <2s | Lambda + Bedrock On-Demand | Pay-per-use |
| Image captioning | Real-time | 100 constant | <1s | SageMaker Endpoint | Instance-based |
| Custom fine-tuned model | Real-time | 200 constant | <500ms | SageMaker Endpoint | Instance-based |
1. Lambda for On-Demand Invocation (chatbot, sporadic usage):
# Lambda function for chatbot - scales to zero when idle
import boto3
import json
bedrock_runtime = boto3.client('bedrock-runtime')
def lambda_handler(event, context):
"""On-demand FM invocation via Lambda.
Benefits:
- Scales to zero (no cost when idle)
- Auto-scales with demand
- No infrastructure management
- Max 15-minute execution
Limitations:
- Cold start latency (100-500ms)
- 10GB memory limit
- No GPU access
- 6MB response payload limit
"""
user_message = event["body"]["message"]
conversation_history = event["body"].get("history", [])
response = bedrock_runtime.invoke_model(
modelId="anthropic.claude-sonnet-4-20250514",
body=json.dumps({
"anthropic_version": "bedrock-2023-05-31",
"max_tokens": 1024,
"messages": conversation_history + [
{"role": "user", "content": user_message}
],
"system": "You are a helpful content assistant."
}),
contentType="application/json"
)
result = json.loads(response['body'].read())
return {
"statusCode": 200,
"body": json.dumps({
"response": result["content"][0]["text"],
"usage": result["usage"] # Track token consumption
})
}
2. Bedrock Provisioned Throughput (content moderation, high constant volume):
# Provisioned throughput for predictable, high-volume workloads
import boto3
bedrock = boto3.client('bedrock')
# Step 1: Create provisioned throughput
provisioned = bedrock.create_provisioned_model_throughput(
modelId="anthropic.claude-haiku-4-5-20251001",
provisionedModelName="content-moderation-haiku",
modelUnits=2, # Each unit = specific token throughput
commitmentDuration="SixMonths", # Discounted pricing
tags=[
{"key": "Team", "value": "ContentSafety"},
{"key": "UseCase", "value": "Moderation"}
]
)
# Step 2: Use the provisioned model ARN for invocations
provisioned_arn = provisioned["provisionedModelArn"]
def moderate_content(content_text):
"""High-throughput content moderation using provisioned capacity.
Benefits:
- Guaranteed throughput (no throttling)
- Consistent latency under load
- Cost-effective for high, predictable volume
- No cold starts
Considerations:
- Pay regardless of usage (committed capacity)
- Need to estimate throughput accurately
- 1-month or 6-month commitment options
"""
response = bedrock_runtime.invoke_model(
modelId=provisioned_arn, # Use provisioned ARN, not base model ID
body=json.dumps({
"anthropic_version": "bedrock-2023-05-31",
"max_tokens": 256,
"messages": [{
"role": "user",
"content": f"Classify this content as SAFE, WARN, or BLOCK. "
f"Respond with only the classification and a brief reason.\n\n"
f"Content: {content_text}"
}]
})
)
return json.loads(response['body'].read())
3. SageMaker AI Endpoint for Custom/Hybrid Solutions:
# SageMaker endpoint for custom fine-tuned model
import sagemaker
from sagemaker.huggingface import HuggingFaceModel
# Deploy custom fine-tuned model
huggingface_model = HuggingFaceModel(
model_data="s3://my-models/custom-caption-model/model.tar.gz",
role="arn:aws:iam::123456789:role/SageMakerRole",
transformers_version="4.37",
pytorch_version="2.1",
py_version="py310",
env={
"HF_TASK": "image-to-text",
"MAX_BATCH_SIZE": "8",
"MODEL_CACHE_ROOT": "/opt/ml/model"
}
)
# Deploy with auto-scaling
predictor = huggingface_model.deploy(
initial_instance_count=2,
instance_type="ml.g5.2xlarge", # GPU instance for vision model
endpoint_name="image-captioning-endpoint",
wait=True
)
# Configure auto-scaling
autoscaling_client = boto3.client('application-autoscaling')
autoscaling_client.register_scalable_target(
ServiceNamespace='sagemaker',
ResourceId=f'endpoint/image-captioning-endpoint/variant/AllTraffic',
ScalableDimension='sagemaker:variant:DesiredInstanceCount',
MinCapacity=2,
MaxCapacity=10
)
# Scale based on invocations per instance
autoscaling_client.put_scaling_policy(
PolicyName='image-captioning-scaling',
ServiceNamespace='sagemaker',
ResourceId=f'endpoint/image-captioning-endpoint/variant/AllTraffic',
ScalableDimension='sagemaker:variant:DesiredInstanceCount',
PolicyType='TargetTrackingScaling',
TargetTrackingScalingPolicyConfiguration={
'TargetValue': 100.0, # 100 invocations per instance
'PredefinedMetricSpecification': {
'PredefinedMetricType': 'SageMakerVariantInvocationsPerInstance'
},
'ScaleInCooldown': 300,
'ScaleOutCooldown': 60
}
)
Deployment Strategy Decision Tree:
Is the model a standard Bedrock model?
├── YES: Is traffic predictable and high volume (>100 QPS steady)?
│ ├── YES: Use Bedrock Provisioned Throughput
│ └── NO: Is it event-driven or sporadic?
│ ├── YES: Use Lambda + Bedrock On-Demand
│ └── NO: Use Bedrock On-Demand (standard)
└── NO: Is it a custom/fine-tuned model?
├── YES: Does it need GPU?
│ ├── YES: Use SageMaker Endpoint (GPU instances)
│ └── NO: Use SageMaker Serverless Inference
└── NO: Hybrid approach
├── Lambda for API/orchestration layer
├── Bedrock for foundation model calls
└── SageMaker for custom model inference
Skill 2.2.2: Deploy FM Solutions Addressing Unique LLM Challenges
Core Concepts
- Container-Based Deployment: Docker containers optimized for LLM memory, GPU, and token processing
- Model Loading Strategies: Pre-loading, lazy loading, model sharding across GPUs
- Memory Requirements: LLMs require 2-4x the model size in GPU memory
- Token Processing Capacity: Throughput measured in tokens/second, not just requests/second
User Story 10: Deploying a 70B Parameter Model for Enterprise Use
As a ML platform engineer, I want to deploy a 70B parameter model in production with optimal resource utilization, So that inference costs stay under $0.02/request while maintaining <3s latency at p99.
Deep Dive Scenario
Company: FinanceAI - deploying Llama-3 70B fine-tuned for financial analysis
Problem: 70B model requires ~140GB GPU memory (at FP16). Single GPU max is 80GB (A100). Need multi-GPU sharding and optimized container deployment.
Container-Based Deployment Pattern:
# Dockerfile optimized for LLM inference
FROM nvidia/cuda:12.1.0-base-ubuntu22.04
# Pre-install inference framework
RUN pip install \
vllm==0.3.0 \
torch==2.1.2 \
transformers==4.37.0 \
accelerate==0.26.0
# Model loading optimization: Pre-download model weights
# This ensures no download time during container startup
ENV MODEL_ID="s3://models/llama3-70b-finance-ft"
ENV HF_HOME="/opt/ml/model"
# Memory optimization settings
ENV CUDA_MEMORY_FRACTION=0.95
ENV MAX_MODEL_LEN=4096
ENV GPU_MEMORY_UTILIZATION=0.90
ENV TENSOR_PARALLEL_SIZE=4 # Shard across 4 GPUs
# Token processing configuration
ENV MAX_BATCH_SIZE=32
ENV MAX_NUM_SEQS=256
ENV SWAP_SPACE_GB=4
# Health check for load balancer
HEALTHCHECK --interval=30s --timeout=10s \
CMD curl -f http://localhost:8000/health || exit 1
COPY serve.py /opt/ml/code/
ENTRYPOINT ["python", "/opt/ml/code/serve.py"]
Serving Script with LLM-Specific Optimizations:
# serve.py - vLLM server with production optimizations
from vllm import LLM, SamplingParams
from vllm.engine.arg_utils import AsyncEngineArgs
from vllm.engine.async_llm_engine import AsyncLLMEngine
import uvicorn
from fastapi import FastAPI
import os
app = FastAPI()
# Model loading strategy: Pre-load at container start
# This adds to startup time but eliminates first-request latency
engine_args = AsyncEngineArgs(
model=os.environ["MODEL_ID"],
# GPU Memory Management
tensor_parallel_size=int(os.environ.get("TENSOR_PARALLEL_SIZE", 4)),
gpu_memory_utilization=float(os.environ.get("GPU_MEMORY_UTILIZATION", 0.90)),
max_model_len=int(os.environ.get("MAX_MODEL_LEN", 4096)),
# Token Processing Optimization
max_num_batched_tokens=32768, # Max tokens processed in one batch
max_num_seqs=256, # Max concurrent sequences
# Quantization for memory reduction (70B FP16 = 140GB, INT8 = 70GB, INT4 = 35GB)
quantization="awq", # 4-bit quantization: 70B -> ~35GB
# KV Cache optimization
block_size=16, # Token block size for KV cache
swap_space=4, # GB of CPU swap for KV cache overflow
# Advanced scheduling
enable_prefix_caching=True, # Cache common prompt prefixes
disable_log_stats=False # Enable for monitoring
)
engine = AsyncLLMEngine.from_engine_args(engine_args)
@app.post("/v1/completions")
async def generate(request: dict):
"""Handle inference request with token-aware processing."""
prompt = request["prompt"]
sampling_params = SamplingParams(
max_tokens=request.get("max_tokens", 512),
temperature=request.get("temperature", 0.7),
top_p=request.get("top_p", 0.9)
)
results = []
async for output in engine.generate(prompt, sampling_params, request_id=str(uuid.uuid4())):
results.append(output)
final = results[-1]
return {
"text": final.outputs[0].text,
"usage": {
"prompt_tokens": len(final.prompt_token_ids),
"completion_tokens": len(final.outputs[0].token_ids),
"total_tokens": len(final.prompt_token_ids) + len(final.outputs[0].token_ids)
}
}
@app.get("/health")
async def health():
return {"status": "healthy", "model_loaded": engine.is_running}
SageMaker Deployment Configuration:
from sagemaker.huggingface import HuggingFaceModel
# Deploy 70B model across multi-GPU instances
model = HuggingFaceModel(
model_data="s3://models/llama3-70b-finance-ft/model.tar.gz",
role=sagemaker_role,
image_uri="123456789.dkr.ecr.us-east-1.amazonaws.com/llm-inference:latest",
env={
"TENSOR_PARALLEL_SIZE": "4",
"QUANTIZATION": "awq",
"MAX_MODEL_LEN": "4096",
"GPU_MEMORY_UTILIZATION": "0.90"
}
)
predictor = model.deploy(
initial_instance_count=2,
instance_type="ml.p4d.24xlarge", # 8x A100 80GB GPUs
container_startup_health_check_timeout=600, # 10 min for model loading
model_data_download_timeout=1200, # 20 min for large model download
volume_size=500, # GB of EBS for model storage
endpoint_name="llama3-70b-finance"
)
LLM vs Traditional ML Deployment Comparison:
| Aspect | Traditional ML | LLM Deployment |
|---|---|---|
| Model Size | MBs to low GBs | 10s to 100s of GBs |
| Memory | CPU RAM sufficient | GPU VRAM required |
| GPU | Optional | Essential for reasonable latency |
| Scaling Unit | Request/second | Tokens/second |
| Cold Start | Seconds | Minutes (model loading) |
| Batching | Simple request batching | Continuous batching (token-level) |
| Caching | Response caching | KV cache + prefix caching |
| Quantization | Rarely needed | Critical for cost optimization |
| Multi-GPU | Rarely needed | Required for large models |
| Health Check | Quick endpoint check | Model-loaded verification |
Model Loading Strategies:
| Strategy | Description | Best For |
|---|---|---|
| Pre-load at startup | Load full model into GPU on container start | Production endpoints with consistent traffic |
| Lazy loading | Load model on first request | Dev/test, rarely-used models |
| Model sharding | Split model across multiple GPUs (tensor parallel) | Models > single GPU memory |
| Quantization | Reduce precision (FP16->INT8->INT4) | Cost reduction, fitting on fewer GPUs |
| Model caching | Cache model weights on local NVMe | Faster restarts after container recycling |
| Pipeline parallel | Split model layers across GPUs sequentially | Very large models (100B+) |
Skill 2.2.3: Develop Optimized FM Deployment Approaches
User Story 11: Cost-Optimized AI Platform with Model Cascading
As a VP of Engineering, I want an AI platform that uses the cheapest model that can handle each request, So that we reduce FM costs by 60% without impacting user experience.
Deep Dive Scenario
Company: SupportAI - AI customer support handling 100K conversations/day
Problem: Using Claude Sonnet for everything costs $150K/month. 70% of queries are simple (password reset, order status) that a smaller model handles fine.
Model Cascading Architecture:
[User Query]
|
v
[Complexity Classifier] (Haiku - fast, cheap)
|
|--- Simple (70%): Route to Haiku ($0.0001/query)
| |--- "What's my order status?"
| |--- "How do I reset my password?"
| |--- Simple FAQ lookups
|
|--- Medium (20%): Route to Sonnet ($0.003/query)
| |--- "Compare these two plans for my usage pattern"
| |--- "Help me troubleshoot my connection issues"
| |--- Multi-step reasoning needed
|
|--- Complex (10%): Route to Opus ($0.015/query)
| |--- "Analyze my 12-month usage and suggest optimization"
| |--- "Draft a custom enterprise agreement"
| |--- Deep analysis and creative tasks
|
v
[Quality Check] (verify response meets threshold)
|--- If below quality threshold: Escalate to next tier
|--- If adequate: Return response
Implementation:
import boto3
import json
bedrock_runtime = boto3.client('bedrock-runtime')
class ModelCascade:
"""API-based model cascading for cost optimization."""
TIERS = [
{
"name": "haiku",
"model_id": "anthropic.claude-haiku-4-5-20251001",
"cost_per_1k_input": 0.001,
"cost_per_1k_output": 0.005,
"max_complexity": "simple",
"quality_threshold": 0.85
},
{
"name": "sonnet",
"model_id": "anthropic.claude-sonnet-4-20250514",
"cost_per_1k_input": 0.003,
"cost_per_1k_output": 0.015,
"max_complexity": "medium",
"quality_threshold": 0.90
},
{
"name": "opus",
"model_id": "anthropic.claude-opus-4-20250514",
"cost_per_1k_input": 0.015,
"cost_per_1k_output": 0.075,
"max_complexity": "complex",
"quality_threshold": 0.95
}
]
def classify_complexity(self, query: str) -> str:
"""Use smallest model to classify query complexity."""
response = bedrock_runtime.invoke_model(
modelId="anthropic.claude-haiku-4-5-20251001",
body=json.dumps({
"anthropic_version": "bedrock-2023-05-31",
"max_tokens": 10,
"messages": [{
"role": "user",
"content": f"Classify this query as SIMPLE, MEDIUM, or COMPLEX. "
f"Reply with only one word.\n\nQuery: {query}"
}]
})
)
result = json.loads(response['body'].read())
return result["content"][0]["text"].strip().lower()
def invoke_with_cascade(self, query: str, system_prompt: str) -> dict:
"""Invoke the optimal model tier with automatic escalation."""
complexity = self.classify_complexity(query)
tier_index = {"simple": 0, "medium": 1, "complex": 2}.get(complexity, 1)
for tier in self.TIERS[tier_index:]:
response = bedrock_runtime.invoke_model(
modelId=tier["model_id"],
body=json.dumps({
"anthropic_version": "bedrock-2023-05-31",
"max_tokens": 1024,
"system": system_prompt,
"messages": [{"role": "user", "content": query}]
})
)
result = json.loads(response['body'].read())
answer = result["content"][0]["text"]
# Quality check: Does the response meet our threshold?
quality = self.assess_quality(query, answer)
if quality >= tier["quality_threshold"]:
return {
"response": answer,
"model_used": tier["name"],
"complexity": complexity,
"quality_score": quality,
"cost_estimate": self.estimate_cost(tier, result["usage"])
}
# If quality too low, cascade to next tier
# If all tiers exhausted, return last response
return {"response": answer, "model_used": "opus", "escalated": True}
def assess_quality(self, query: str, response: str) -> float:
"""Quick quality assessment using Haiku as judge."""
judge_response = bedrock_runtime.invoke_model(
modelId="anthropic.claude-haiku-4-5-20251001",
body=json.dumps({
"anthropic_version": "bedrock-2023-05-31",
"max_tokens": 10,
"messages": [{
"role": "user",
"content": f"Rate 0.0-1.0 how well this response answers the query. "
f"Reply with just the number.\n\nQuery: {query}\n\nResponse: {response}"
}]
})
)
result = json.loads(judge_response['body'].read())
return float(result["content"][0]["text"].strip())
def estimate_cost(self, tier, usage):
input_cost = (usage["input_tokens"] / 1000) * tier["cost_per_1k_input"]
output_cost = (usage["output_tokens"] / 1000) * tier["cost_per_1k_output"]
return input_cost + output_cost
# Usage
cascade = ModelCascade()
# Simple query -> Haiku handles it ($0.0001)
result = cascade.invoke_with_cascade("What's my order status for #12345?", system_prompt)
# result: {"model_used": "haiku", "cost_estimate": 0.0001}
# Complex query -> Cascades to Sonnet ($0.003)
result = cascade.invoke_with_cascade("Analyze my usage patterns and recommend the optimal plan", system_prompt)
# result: {"model_used": "sonnet", "cost_estimate": 0.003}
Cost Savings Analysis:
Before (all Sonnet):
100K queries × $0.003/query = $300/day = $9,000/month
After (cascading):
70K simple × $0.0001 = $7/day
20K medium × $0.003 = $60/day
10K complex × $0.015 = $150/day
Classification overhead: 100K × $0.00005 = $5/day
Total: $222/day = $6,660/month
Savings: 26% ($2,340/month)
With accurate routing optimization: up to 60% savings
Exam-Relevant Points: - Model cascading routes queries to the cheapest capable model - Smaller pre-trained models (Haiku) handle routine queries at 10x lower cost - Quality check gates ensure adequate response quality before returning - Complexity classification itself should use the cheapest model - Bedrock Provisioned Throughput for high-volume predictable workloads - Lambda for sporadic/event-driven FM invocations - SageMaker for custom models needing GPU/full infrastructure control - Quantization (INT8/INT4) dramatically reduces GPU memory requirements - Tensor parallelism splits large models across multiple GPUs - Container containers need extended health check timeouts for model loading - KV cache and prefix caching improve LLM-specific performance