Task 2.2: Implement Model Deployment Strategies

Overview

This task covers deploying foundation models using the right compute, container strategies, and optimization approaches for production workloads.

Skill 2.2.1: Deploy FMs Based on Application Needs and Performance Requirements

Core Concepts

Lambda for On-Demand: Serverless FM invocation for sporadic, event-driven workloads
Bedrock Provisioned Throughput: Reserved model capacity for predictable, high-volume workloads
SageMaker Endpoints: Custom model hosting with full infrastructure control
Hybrid Solutions: Combining multiple deployment strategies for different traffic patterns

User Story 9: Multi-Tier FM Deployment for Content Platform

As a content platform CTO, I want FM deployments that match each use case's latency, throughput, and cost requirements, So that we optimize the $200K/month AI infrastructure budget across 5 different AI features.

Deep Dive Scenario

Company: ContentAI - generates, summarizes, and moderates 10M pieces of content/day

Problem: Single deployment strategy doesn't fit all use cases. Some need instant response, some can batch, some are predictable, some are spiky.

Deployment Decision Matrix:

Feature	Pattern	QPS	Latency Req	Deployment	Cost Model
Content moderation	Real-time	500 constant	<200ms	Bedrock Provisioned	Reserved capacity
Article summarization	Batch	50K/day batched	<5 min	Lambda + Bedrock On-Demand	Pay-per-use
Chatbot responses	Interactive	10-1000 (spiky)	<2s	Lambda + Bedrock On-Demand	Pay-per-use
Image captioning	Real-time	100 constant	<1s	SageMaker Endpoint	Instance-based
Custom fine-tuned model	Real-time	200 constant	<500ms	SageMaker Endpoint	Instance-based

1. Lambda for On-Demand Invocation (chatbot, sporadic usage):

# Lambda function for chatbot - scales to zero when idle
import boto3
import json

bedrock_runtime = boto3.client('bedrock-runtime')

def lambda_handler(event, context):
    """On-demand FM invocation via Lambda.

    Benefits:
    - Scales to zero (no cost when idle)
    - Auto-scales with demand
    - No infrastructure management
    - Max 15-minute execution

    Limitations:
    - Cold start latency (100-500ms)
    - 10GB memory limit
    - No GPU access
    - 6MB response payload limit
    """
    user_message = event["body"]["message"]
    conversation_history = event["body"].get("history", [])

    response = bedrock_runtime.invoke_model(
        modelId="anthropic.claude-sonnet-4-20250514",
        body=json.dumps({
            "anthropic_version": "bedrock-2023-05-31",
            "max_tokens": 1024,
            "messages": conversation_history + [
                {"role": "user", "content": user_message}
            ],
            "system": "You are a helpful content assistant."
        }),
        contentType="application/json"
    )

    result = json.loads(response['body'].read())
    return {
        "statusCode": 200,
        "body": json.dumps({
            "response": result["content"][0]["text"],
            "usage": result["usage"]  # Track token consumption
        })
    }

2. Bedrock Provisioned Throughput (content moderation, high constant volume):

# Provisioned throughput for predictable, high-volume workloads
import boto3

bedrock = boto3.client('bedrock')

# Step 1: Create provisioned throughput
provisioned = bedrock.create_provisioned_model_throughput(
    modelId="anthropic.claude-haiku-4-5-20251001",
    provisionedModelName="content-moderation-haiku",
    modelUnits=2,  # Each unit = specific token throughput
    commitmentDuration="SixMonths",  # Discounted pricing
    tags=[
        {"key": "Team", "value": "ContentSafety"},
        {"key": "UseCase", "value": "Moderation"}
    ]
)

# Step 2: Use the provisioned model ARN for invocations
provisioned_arn = provisioned["provisionedModelArn"]

def moderate_content(content_text):
    """High-throughput content moderation using provisioned capacity.

    Benefits:
    - Guaranteed throughput (no throttling)
    - Consistent latency under load
    - Cost-effective for high, predictable volume
    - No cold starts

    Considerations:
    - Pay regardless of usage (committed capacity)
    - Need to estimate throughput accurately
    - 1-month or 6-month commitment options
    """
    response = bedrock_runtime.invoke_model(
        modelId=provisioned_arn,  # Use provisioned ARN, not base model ID
        body=json.dumps({
            "anthropic_version": "bedrock-2023-05-31",
            "max_tokens": 256,
            "messages": [{
                "role": "user",
                "content": f"Classify this content as SAFE, WARN, or BLOCK. "
                          f"Respond with only the classification and a brief reason.\n\n"
                          f"Content: {content_text}"
            }]
        })
    )
    return json.loads(response['body'].read())

3. SageMaker AI Endpoint for Custom/Hybrid Solutions:

# SageMaker endpoint for custom fine-tuned model
import sagemaker
from sagemaker.huggingface import HuggingFaceModel

# Deploy custom fine-tuned model
huggingface_model = HuggingFaceModel(
    model_data="s3://my-models/custom-caption-model/model.tar.gz",
    role="arn:aws:iam::123456789:role/SageMakerRole",
    transformers_version="4.37",
    pytorch_version="2.1",
    py_version="py310",
    env={
        "HF_TASK": "image-to-text",
        "MAX_BATCH_SIZE": "8",
        "MODEL_CACHE_ROOT": "/opt/ml/model"
    }
)

# Deploy with auto-scaling
predictor = huggingface_model.deploy(
    initial_instance_count=2,
    instance_type="ml.g5.2xlarge",  # GPU instance for vision model
    endpoint_name="image-captioning-endpoint",
    wait=True
)

# Configure auto-scaling
autoscaling_client = boto3.client('application-autoscaling')

autoscaling_client.register_scalable_target(
    ServiceNamespace='sagemaker',
    ResourceId=f'endpoint/image-captioning-endpoint/variant/AllTraffic',
    ScalableDimension='sagemaker:variant:DesiredInstanceCount',
    MinCapacity=2,
    MaxCapacity=10
)

# Scale based on invocations per instance
autoscaling_client.put_scaling_policy(
    PolicyName='image-captioning-scaling',
    ServiceNamespace='sagemaker',
    ResourceId=f'endpoint/image-captioning-endpoint/variant/AllTraffic',
    ScalableDimension='sagemaker:variant:DesiredInstanceCount',
    PolicyType='TargetTrackingScaling',
    TargetTrackingScalingPolicyConfiguration={
        'TargetValue': 100.0,  # 100 invocations per instance
        'PredefinedMetricSpecification': {
            'PredefinedMetricType': 'SageMakerVariantInvocationsPerInstance'
        },
        'ScaleInCooldown': 300,
        'ScaleOutCooldown': 60
    }
)

Deployment Strategy Decision Tree:

Is the model a standard Bedrock model?
├── YES: Is traffic predictable and high volume (>100 QPS steady)?
│   ├── YES: Use Bedrock Provisioned Throughput
│   └── NO: Is it event-driven or sporadic?
│       ├── YES: Use Lambda + Bedrock On-Demand
│       └── NO: Use Bedrock On-Demand (standard)
└── NO: Is it a custom/fine-tuned model?
    ├── YES: Does it need GPU?
    │   ├── YES: Use SageMaker Endpoint (GPU instances)
    │   └── NO: Use SageMaker Serverless Inference
    └── NO: Hybrid approach
        ├── Lambda for API/orchestration layer
        ├── Bedrock for foundation model calls
        └── SageMaker for custom model inference

Skill 2.2.2: Deploy FM Solutions Addressing Unique LLM Challenges

Core Concepts

Container-Based Deployment: Docker containers optimized for LLM memory, GPU, and token processing
Model Loading Strategies: Pre-loading, lazy loading, model sharding across GPUs
Memory Requirements: LLMs require 2-4x the model size in GPU memory
Token Processing Capacity: Throughput measured in tokens/second, not just requests/second

User Story 10: Deploying a 70B Parameter Model for Enterprise Use

As a ML platform engineer, I want to deploy a 70B parameter model in production with optimal resource utilization, So that inference costs stay under $0.02/request while maintaining <3s latency at p99.

Deep Dive Scenario

Company: FinanceAI - deploying Llama-3 70B fine-tuned for financial analysis

Problem: 70B model requires ~140GB GPU memory (at FP16). Single GPU max is 80GB (A100). Need multi-GPU sharding and optimized container deployment.

Container-Based Deployment Pattern:

# Dockerfile optimized for LLM inference
FROM nvidia/cuda:12.1.0-base-ubuntu22.04

# Pre-install inference framework
RUN pip install \
    vllm==0.3.0 \
    torch==2.1.2 \
    transformers==4.37.0 \
    accelerate==0.26.0

# Model loading optimization: Pre-download model weights
# This ensures no download time during container startup
ENV MODEL_ID="s3://models/llama3-70b-finance-ft"
ENV HF_HOME="/opt/ml/model"

# Memory optimization settings
ENV CUDA_MEMORY_FRACTION=0.95
ENV MAX_MODEL_LEN=4096
ENV GPU_MEMORY_UTILIZATION=0.90
ENV TENSOR_PARALLEL_SIZE=4  # Shard across 4 GPUs

# Token processing configuration
ENV MAX_BATCH_SIZE=32
ENV MAX_NUM_SEQS=256
ENV SWAP_SPACE_GB=4

# Health check for load balancer
HEALTHCHECK --interval=30s --timeout=10s \
    CMD curl -f http://localhost:8000/health || exit 1

COPY serve.py /opt/ml/code/
ENTRYPOINT ["python", "/opt/ml/code/serve.py"]

Serving Script with LLM-Specific Optimizations:

# serve.py - vLLM server with production optimizations
from vllm import LLM, SamplingParams
from vllm.engine.arg_utils import AsyncEngineArgs
from vllm.engine.async_llm_engine import AsyncLLMEngine
import uvicorn
from fastapi import FastAPI
import os

app = FastAPI()

# Model loading strategy: Pre-load at container start
# This adds to startup time but eliminates first-request latency
engine_args = AsyncEngineArgs(
    model=os.environ["MODEL_ID"],

    # GPU Memory Management
    tensor_parallel_size=int(os.environ.get("TENSOR_PARALLEL_SIZE", 4)),
    gpu_memory_utilization=float(os.environ.get("GPU_MEMORY_UTILIZATION", 0.90)),
    max_model_len=int(os.environ.get("MAX_MODEL_LEN", 4096)),

    # Token Processing Optimization
    max_num_batched_tokens=32768,    # Max tokens processed in one batch
    max_num_seqs=256,                # Max concurrent sequences

    # Quantization for memory reduction (70B FP16 = 140GB, INT8 = 70GB, INT4 = 35GB)
    quantization="awq",             # 4-bit quantization: 70B -> ~35GB

    # KV Cache optimization
    block_size=16,                   # Token block size for KV cache
    swap_space=4,                    # GB of CPU swap for KV cache overflow

    # Advanced scheduling
    enable_prefix_caching=True,      # Cache common prompt prefixes
    disable_log_stats=False          # Enable for monitoring
)

engine = AsyncLLMEngine.from_engine_args(engine_args)

@app.post("/v1/completions")
async def generate(request: dict):
    """Handle inference request with token-aware processing."""
    prompt = request["prompt"]

    sampling_params = SamplingParams(
        max_tokens=request.get("max_tokens", 512),
        temperature=request.get("temperature", 0.7),
        top_p=request.get("top_p", 0.9)
    )

    results = []
    async for output in engine.generate(prompt, sampling_params, request_id=str(uuid.uuid4())):
        results.append(output)

    final = results[-1]
    return {
        "text": final.outputs[0].text,
        "usage": {
            "prompt_tokens": len(final.prompt_token_ids),
            "completion_tokens": len(final.outputs[0].token_ids),
            "total_tokens": len(final.prompt_token_ids) + len(final.outputs[0].token_ids)
        }
    }

@app.get("/health")
async def health():
    return {"status": "healthy", "model_loaded": engine.is_running}

SageMaker Deployment Configuration:

from sagemaker.huggingface import HuggingFaceModel

# Deploy 70B model across multi-GPU instances
model = HuggingFaceModel(
    model_data="s3://models/llama3-70b-finance-ft/model.tar.gz",
    role=sagemaker_role,
    image_uri="123456789.dkr.ecr.us-east-1.amazonaws.com/llm-inference:latest",
    env={
        "TENSOR_PARALLEL_SIZE": "4",
        "QUANTIZATION": "awq",
        "MAX_MODEL_LEN": "4096",
        "GPU_MEMORY_UTILIZATION": "0.90"
    }
)

predictor = model.deploy(
    initial_instance_count=2,
    instance_type="ml.p4d.24xlarge",   # 8x A100 80GB GPUs
    container_startup_health_check_timeout=600,  # 10 min for model loading
    model_data_download_timeout=1200,  # 20 min for large model download
    volume_size=500,  # GB of EBS for model storage
    endpoint_name="llama3-70b-finance"
)

LLM vs Traditional ML Deployment Comparison:

Aspect	Traditional ML	LLM Deployment
Model Size	MBs to low GBs	10s to 100s of GBs
Memory	CPU RAM sufficient	GPU VRAM required
GPU	Optional	Essential for reasonable latency
Scaling Unit	Request/second	Tokens/second
Cold Start	Seconds	Minutes (model loading)
Batching	Simple request batching	Continuous batching (token-level)
Caching	Response caching	KV cache + prefix caching
Quantization	Rarely needed	Critical for cost optimization
Multi-GPU	Rarely needed	Required for large models
Health Check	Quick endpoint check	Model-loaded verification

Model Loading Strategies:

Strategy	Description	Best For
Pre-load at startup	Load full model into GPU on container start	Production endpoints with consistent traffic
Lazy loading	Load model on first request	Dev/test, rarely-used models
Model sharding	Split model across multiple GPUs (tensor parallel)	Models > single GPU memory
Quantization	Reduce precision (FP16->INT8->INT4)	Cost reduction, fitting on fewer GPUs
Model caching	Cache model weights on local NVMe	Faster restarts after container recycling
Pipeline parallel	Split model layers across GPUs sequentially	Very large models (100B+)

Skill 2.2.3: Develop Optimized FM Deployment Approaches

User Story 11: Cost-Optimized AI Platform with Model Cascading

As a VP of Engineering, I want an AI platform that uses the cheapest model that can handle each request, So that we reduce FM costs by 60% without impacting user experience.

Deep Dive Scenario

Company: SupportAI - AI customer support handling 100K conversations/day

Problem: Using Claude Sonnet for everything costs $150K/month. 70% of queries are simple (password reset, order status) that a smaller model handles fine.

Model Cascading Architecture:

[User Query]
    |
    v
[Complexity Classifier] (Haiku - fast, cheap)
    |
    |--- Simple (70%): Route to Haiku ($0.0001/query)
    |       |--- "What's my order status?"
    |       |--- "How do I reset my password?"
    |       |--- Simple FAQ lookups
    |
    |--- Medium (20%): Route to Sonnet ($0.003/query)
    |       |--- "Compare these two plans for my usage pattern"
    |       |--- "Help me troubleshoot my connection issues"
    |       |--- Multi-step reasoning needed
    |
    |--- Complex (10%): Route to Opus ($0.015/query)
    |       |--- "Analyze my 12-month usage and suggest optimization"
    |       |--- "Draft a custom enterprise agreement"
    |       |--- Deep analysis and creative tasks
    |
    v
[Quality Check] (verify response meets threshold)
    |--- If below quality threshold: Escalate to next tier
    |--- If adequate: Return response

Implementation:

import boto3
import json

bedrock_runtime = boto3.client('bedrock-runtime')

class ModelCascade:
    """API-based model cascading for cost optimization."""

    TIERS = [
        {
            "name": "haiku",
            "model_id": "anthropic.claude-haiku-4-5-20251001",
            "cost_per_1k_input": 0.001,
            "cost_per_1k_output": 0.005,
            "max_complexity": "simple",
            "quality_threshold": 0.85
        },
        {
            "name": "sonnet",
            "model_id": "anthropic.claude-sonnet-4-20250514",
            "cost_per_1k_input": 0.003,
            "cost_per_1k_output": 0.015,
            "max_complexity": "medium",
            "quality_threshold": 0.90
        },
        {
            "name": "opus",
            "model_id": "anthropic.claude-opus-4-20250514",
            "cost_per_1k_input": 0.015,
            "cost_per_1k_output": 0.075,
            "max_complexity": "complex",
            "quality_threshold": 0.95
        }
    ]

    def classify_complexity(self, query: str) -> str:
        """Use smallest model to classify query complexity."""
        response = bedrock_runtime.invoke_model(
            modelId="anthropic.claude-haiku-4-5-20251001",
            body=json.dumps({
                "anthropic_version": "bedrock-2023-05-31",
                "max_tokens": 10,
                "messages": [{
                    "role": "user",
                    "content": f"Classify this query as SIMPLE, MEDIUM, or COMPLEX. "
                              f"Reply with only one word.\n\nQuery: {query}"
                }]
            })
        )
        result = json.loads(response['body'].read())
        return result["content"][0]["text"].strip().lower()

    def invoke_with_cascade(self, query: str, system_prompt: str) -> dict:
        """Invoke the optimal model tier with automatic escalation."""

        complexity = self.classify_complexity(query)
        tier_index = {"simple": 0, "medium": 1, "complex": 2}.get(complexity, 1)

        for tier in self.TIERS[tier_index:]:
            response = bedrock_runtime.invoke_model(
                modelId=tier["model_id"],
                body=json.dumps({
                    "anthropic_version": "bedrock-2023-05-31",
                    "max_tokens": 1024,
                    "system": system_prompt,
                    "messages": [{"role": "user", "content": query}]
                })
            )

            result = json.loads(response['body'].read())
            answer = result["content"][0]["text"]

            # Quality check: Does the response meet our threshold?
            quality = self.assess_quality(query, answer)

            if quality >= tier["quality_threshold"]:
                return {
                    "response": answer,
                    "model_used": tier["name"],
                    "complexity": complexity,
                    "quality_score": quality,
                    "cost_estimate": self.estimate_cost(tier, result["usage"])
                }
            # If quality too low, cascade to next tier

        # If all tiers exhausted, return last response
        return {"response": answer, "model_used": "opus", "escalated": True}

    def assess_quality(self, query: str, response: str) -> float:
        """Quick quality assessment using Haiku as judge."""
        judge_response = bedrock_runtime.invoke_model(
            modelId="anthropic.claude-haiku-4-5-20251001",
            body=json.dumps({
                "anthropic_version": "bedrock-2023-05-31",
                "max_tokens": 10,
                "messages": [{
                    "role": "user",
                    "content": f"Rate 0.0-1.0 how well this response answers the query. "
                              f"Reply with just the number.\n\nQuery: {query}\n\nResponse: {response}"
                }]
            })
        )
        result = json.loads(judge_response['body'].read())
        return float(result["content"][0]["text"].strip())

    def estimate_cost(self, tier, usage):
        input_cost = (usage["input_tokens"] / 1000) * tier["cost_per_1k_input"]
        output_cost = (usage["output_tokens"] / 1000) * tier["cost_per_1k_output"]
        return input_cost + output_cost

# Usage
cascade = ModelCascade()

# Simple query -> Haiku handles it ($0.0001)
result = cascade.invoke_with_cascade("What's my order status for #12345?", system_prompt)
# result: {"model_used": "haiku", "cost_estimate": 0.0001}

# Complex query -> Cascades to Sonnet ($0.003)
result = cascade.invoke_with_cascade("Analyze my usage patterns and recommend the optimal plan", system_prompt)
# result: {"model_used": "sonnet", "cost_estimate": 0.003}

Cost Savings Analysis:

Before (all Sonnet):
  100K queries × $0.003/query = $300/day = $9,000/month

After (cascading):
  70K simple × $0.0001 = $7/day
  20K medium × $0.003 = $60/day
  10K complex × $0.015 = $150/day
  Classification overhead: 100K × $0.00005 = $5/day
  Total: $222/day = $6,660/month

  Savings: 26% ($2,340/month)
  With accurate routing optimization: up to 60% savings

Exam-Relevant Points: - Model cascading routes queries to the cheapest capable model - Smaller pre-trained models (Haiku) handle routine queries at 10x lower cost - Quality check gates ensure adequate response quality before returning - Complexity classification itself should use the cheapest model - Bedrock Provisioned Throughput for high-volume predictable workloads - Lambda for sporadic/event-driven FM invocations - SageMaker for custom models needing GPU/full infrastructure control - Quantization (INT8/INT4) dramatically reduces GPU memory requirements - Tensor parallelism splits large models across multiple GPUs - Container containers need extended health check timeouts for model loading - KV cache and prefix caching improve LLM-specific performance