CD-03: ML Model Deployment Pipeline# CD-03: ML Model Deployment Pipeline

| Quality gate pass rate | Step Functions | < 80% (indicates training issues) || Model endpoint latency P95 | SageMaker CloudWatch | Varies by model type || Shadow vs champion agreement | S3 analytics | < 90% agreement || Canary rollback rate | Step Functions + CloudWatch | > 20% of deployments || Mean time from approval to production | Custom CloudWatch | > 48 hours || Model deployment success rate | Step Functions | < 90% over 30 days ||---|---|---|| Metric | Source | Alert Threshold |## Monitoring This Pipeline---Key Insight: The time-based gates (24h shadow, 1h canary) are not arbitrary — they ensure the model sees enough traffic diversity (morning vs evening, weekday vs weekend) to validate across user behavior patterns. A model that works great at 2 PM on Tuesday might fail at 9 PM on Saturday when users ask different types of questions.| Emergency hotfix | Skip | 15 min | ~20 min | Critical regression fix || Major (> 5% change or architecture change) | 24 hours | 4 hours | ~30 hours | New model architecture || Moderate (2-5% change) | 4 hours | 1 hour | ~6 hours | New training data added || Minor (< 2% accuracy change) | Skip | 1 hour | ~1.5 hours | Weekly intent classifier retrain ||---|---|---|---|---|| Change Magnitude | Shadow | Canary | Total Time | Example |### Resolution: Tiered Deployment Speed by Change MagnitudeDS1 ---|"But metrics prove it"| AR2 AR3 ---|"Delays"| PM2 PM3 ---|"Amplifies"| DS3 DS3 ---|"Tension"| AR1 end PM3["Sides with Data Science"] PM2["Competitor just improved their chatbot"] PM1["NPS score dropping"] subgraph "Product Manager" end AR3["Statistical significance required"] AR2["Canary minimum 1 hour"] AR1["Shadow test minimum 24 hours"] subgraph "Architect" end DS3["Want to deploy NOW"] DS2["Customers complaining about current model"] DS1["Model accuracy improved 3%"] subgraph "Data Science Lead"graph TDmermaid### The Debate: Model Deployment Speed vs Quality Assurance## Tradeoffs---Rationale: LLM adapter changes directly affect response generation — the customer-facing output. Unlike the intent classifier (binary classification, measurable), LLM quality has subjective dimensions (tone, helpfulness, accuracy of nuanced answers) that automated metrics may miss. A human review of 10-20 sample outputs takes 15 minutes and catches issues that metrics cannot.G -->|No| H[Reject with feedback] G -->|Yes| E F --> G{ML Lead approves?} D --> F[Notify ML Lead via Slack] C --> E[Proceed to Shadow/Canary] B -->|LLM LoRA Adapter| D[Require human review] B -->|Cross-Encoder| C B -->|Embedding Model| C B -->|Intent Classifier| C[Auto-approve if all metrics pass] A[Model Passes Quality Gates] --> B{Model Type?}flowchart TDmermaidDecision: Hybrid — Automated gates with human approval for LLM adapter changes| Audit Trail | Automated logs | Manual + automated | Full trail both paths || Team Scalability | Infinite | Bottleneck at reviewer | Scalable with manual override || Catches Edge Cases | No — only tests what metrics measure | Yes — human judgment | Yes — human reviews anomalies || Consistency | Perfect — same thresholds always | Variable — reviewer fatigue | Consistent with human override || Speed | Minutes | Hours to days | Hours (fast path available) ||---|---|---|---|| Criteria | Fully Automated | Human-in-the-Loop | Hybrid (Current) |### Decision 3: Model Approval — Fully Automated vs Human-in-the-Loop---| Cross-Encoder Reranker | Direct replace with monitoring | Low risk — affects ranking order, not correctness || Embedding Model | Canary → Full (skip shadow) | Impact is retrieval quality, observable via RAG metrics || LLM LoRA Adapter | Shadow → Canary → Full | Affects response quality directly — high risk || Intent Classifier | Shadow → Canary → Full | Misclassification causes wrong response path — highest risk ||---|---|---|| Model | Pattern | Justification |Decision: Tiered approach by model criticality| Best For | Critical models (intent classifier) | Optional models (recommendations) | Config-only changes || Complexity | High (shadow infra + canary logic) | Medium (routing logic) | Low || Cost | Higher (2x endpoints during shadow) | Higher (2x endpoints always) | Lowest || Time to Full Deploy | Long (shadow + canary = hours) | Medium (immediate A/B) | Instant || Validation Quality | High — real traffic, no impact | High — real traffic, real impact | None — faith-based || Risk Level | Very Low — validated before traffic | Low — simultaneous serving | High — untested in prod ||---|---|---|---|| Criteria | Shadow + Canary (Current) | Champion-Challenger | Direct Replace |### Decision 2: Model Deployment Pattern — Shadow + Canary vs Champion-Challenger vs Direct Replace---Why not Kubeflow? The team has zero Kubernetes experience. Running a K8s cluster for ML pipeline orchestration when Step Functions + SageMaker provides the same capability serverlessly is unjustifiable overhead for a 1-2 person team.Why not SageMaker Pipelines alone? SageMaker Pipelines excel at training workflows but are limited for deployment orchestration. They can't natively: check CloudWatch canary alarms, send formatted Slack notifications, or implement complex branching logic based on multiple metric evaluations performed in parallel.Rationale: The model deployment pipeline requires custom logic beyond pure ML — shadow traffic mirroring, canary metric evaluation, Slack notifications, CloudWatch alarm checks. Step Functions handle this heterogeneous workflow naturally (Lambda for logic, SageMaker for model operations, SNS for notifications). SageMaker Pipelines force everything into the SageMaker paradigm, making non-ML steps awkward.Decision: AWS Step Functions (orchestrator) + SageMaker (execution)| Weighted Score | 7.2/10 | 8.3/10 | 4.8/10 || Error Handling (5%) | 6/10 — basic retry | 9/10 — rich error handling, catch, retry | 7/10 — retry policies || Team Familiarity (10%) | 6/10 | 8/10 — team uses Lambda | 2/10 — no K8s experience || Cost (10%) | 7/10 | 9/10 — $0.025/1K transitions | 4/10 — K8s cluster cost || Operational Overhead (15%) | 8/10 — fully managed | 9/10 — fully managed | 3/10 — requires K8s cluster || Visualization (15%) | 7/10 — SM Studio lineage | 8/10 — Step Functions console | 9/10 — Kubeflow UI || Custom Logic Support (20%) | 5/10 — limited to SM steps | 9/10 — any Lambda/ECS task | 8/10 — any container || SageMaker Integration (25%) | 10/10 — native | 7/10 — via SDK calls | 4/10 — separate system ||---|---|---|---|| Criteria (Weight) | SageMaker Pipelines | Step Functions | Kubeflow |### Decision 1: ML Pipeline Orchestrator — SageMaker Pipelines vs Step Functions vs Kubeflow## Critical Decisions---NotifyTeam --> [*] PromoteChampion --> [*] Rejected --> NotifyTeam Rollback --> NotifyTeam Monitor24h --> Rollback: Regression Monitor24h --> PromoteChampion: Stable FullProduction --> Monitor24h CanaryPass --> FullProduction CanaryEval --> Rollback: Metrics bad CanaryEval --> CanaryPass: Metrics OK CanaryWait --> CanaryEval CanaryDeploy --> CanaryWait: Wait 1 hour ShadowPass --> CanaryDeploy ShadowEval --> Rollback: Metrics bad ShadowEval --> ShadowPass: Metrics OK ShadowWait --> ShadowEval ShadowDeploy --> ShadowWait: Wait N hours GatesPass --> ShadowDeploy QualityGates --> Rejected: Any fail QualityGates --> GatesPass: All pass } LITReview --> [*] CaptumAudit --> LITReview [*] --> CaptumAudit state QualityGates { ValidateArtifact --> QualityGates [*] --> ValidateArtifactstateDiagram-v2mermaid### 6. Step Functions Orchestrationreturn 'challenger' if (hash_value % 100) < 50 else 'champion' hash_value = int(hashlib.sha256(hash_input.encode()).hexdigest(), 16) hash_input = f"{user_id}:{experiment_id}" """ Same user always gets same group for consistent experience. """Deterministic A/B assignment based on user_id hash.def get_ab_group(user_id: str, experiment_id: str) -> str:import hashlibpythonA/B Assignment Logic:H -->|"p > 0.05 after 7 days"| K[No Significant Difference — Keep Champion] H -->|"Not yet"| J[Continue A/B Test] H -->|"p < 0.05, N > 10K"| I[Auto-Promote Winner] G --> H{Statistical Significance?} F --> G[Athena Analysis] E --> F[Kinesis Data Stream] D --> E C --> E[Log: user_id, group, prediction, latency] B -->|"Group B (50%)"| D[Challenger Model v2.4] B -->|"Group A (50%)"| C[Champion Model v2.3] A[User Request] --> B{A/B Assignment}flowchart TDmermaid### 5. A/B Testing Framework) ] {'VariantName': challenger_variant, 'DesiredWeight': 100}, {'VariantName': champion_variant, 'DesiredWeight': 0}, DesiredWeightsAndCapacities=[ EndpointName=endpoint_name, sm.update_endpoint_weights_and_capacities( # All checks passed — promote to 100% raise CanaryFailure(f"Canary failed at check {check}: {metrics}") ) ] {'VariantName': challenger_variant, 'DesiredWeight': 0}, {'VariantName': champion_variant, 'DesiredWeight': 100}, DesiredWeightsAndCapacities=[ EndpointName=endpoint_name, sm.update_endpoint_weights_and_capacities( # ROLLBACK if not meets_thresholds(metrics): metrics = get_canary_metrics(endpoint_name, challenger_variant) time.sleep(300) for check in range(12): # Check every 5 min # Wait and evaluate (1 hour) ) ] {'VariantName': challenger_variant, 'DesiredWeight': 5}, {'VariantName': champion_variant, 'DesiredWeight': 95}, DesiredWeightsAndCapacities=[ EndpointName=endpoint_name, sm.update_endpoint_weights_and_capacities( # Stage 1: 5% traffic to challenger cw = boto3.client('cloudwatch') sm = boto3.client('sagemaker')def canary_deploy(endpoint_name: str, challenger_variant: str, champion_variant: str):pythonCanary Traffic Shifting:Note over SM: Challenger is now Champion OPS->>SM: Update endpoint — 100% Challenger end end OPS->>SM: Update endpoint — 100% Champion CW->>OPS: Trigger rollback CW->>PD: Alert — canary failing else Metrics degraded CW-->>OPS: Continue canary alt Metrics healthy CW->>CW: Evaluate challenger metrics loop Every 5 minutes for 1 hour OPS->>CW: Start canary alarm evaluation Note over SM: Champion: 95%, Challenger: 5% OPS->>SM: Update endpoint — 5% to Challenger participant PD as PagerDuty participant CW as CloudWatch participant SM as SageMaker Endpoint participant OPS as MLOps PipelinesequenceDiagrammermaid### 4. Canary Deployment — Weighted Traffic} 'pass': agreement > 0.95 and shadow_accuracy > 0.94, 'sample_size': len(predictions), 'shadow_latency_p95': shadow_latency_p95, 'shadow_accuracy': shadow_accuracy, 'agreement_rate': agreement, return { shadow_latency_p95 = self._calculate_p95_latency(predictions) shadow_accuracy = self._calculate_accuracy(predictions) # Calculate quality metrics on shadow ) / len(predictions) if s['predicted_intent'] == c['predicted_intent'] 1 for s, c in zip(predictions, champion_predictions) agreement = sum( # Calculate agreement rate champion_predictions = self._load_champion_predictions(duration_hours) predictions = self._load_shadow_predictions(duration_hours) # Query S3 for shadow prediction logs """Compare shadow predictions against champion.""" def evaluate_shadow_results(self, duration_hours: int = 4) -> dict: return f"{self.model_name}-shadow" ) EndpointConfigName=endpoint_config_name, EndpointName=f"{self.model_name}-shadow", self.sm.create_endpoint( ) }], 'InitialVariantWeight': 1, 'InitialInstanceCount': 1, 'InstanceType': self._get_instance_type(), 'ModelName': self._create_model_from_package(self.challenger_arn), 'VariantName': 'shadow-variant', ShadowProductionVariants=[{ }], 'InitialInstanceCount': 1, 'InstanceType': self._get_instance_type(), 'ModelName': self._create_model_from_package(self.challenger_arn), 'VariantName': 'shadow', ProductionVariants=[{ EndpointConfigName=endpoint_config_name, self.sm.create_endpoint_config( endpoint_config_name = f"{self.model_name}-shadow-{int(datetime.utcnow().timestamp())}" """Deploy challenger model as shadow endpoint.""" def create_shadow_endpoint(self) -> str: self.challenger_arn = challenger_arn self.model_name = model_name self.sm_runtime = boto3.client('sagemaker-runtime') self.sm = boto3.client('sagemaker') def __init__(self, model_name: str, challenger_arn: str):class ShadowDeployer:import jsonimport boto3pythonShadow Deployment Implementation:style F fill:#8C4FFF,color:#fff style E fill:#8C4FFF,color:#fff end G --> H[Shadow Metrics Dashboard] F --> G[Compare with Champion] E --> F[Log Predictions to S3] B -->|"Mirror 1% traffic"| E[Shadow Model Endpoint] subgraph "Shadow Path (No Customer Impact)" end C --> D[Response to User] B --> C[Champion Model Endpoint] A[User Request] --> B[ALB] subgraph "Production Traffic"flowchart LRmermaid### 3. Shadow Deployment — Mirrored Traffic| Golden Set Pass Rate | >= 98% | >= 95% | >= 90% || Captum Attribution | No single token > 40% attribution | N/A | N/A || P95 Latency | Must be < 150ms | Must be < 200ms | Must be < 2000ms (first token) || LLM-as-Judge Score | N/A | N/A | Must be >= champion - 0.1 (0-1 scale) || Retrieval Recall@10 | N/A | Must be >= champion - 2% | N/A || F1 Score vs Champion | Must be >= champion - 1% | N/A | N/A || Accuracy vs Champion | Must be >= champion - 0.5% | N/A (evaluated via retrieval recall) | N/A (evaluated via LLM-as-judge) ||---|---|---|---|| Metric | Intent Classifier | Embedding Model | LLM Adapter |Quality Gate Thresholds:end GW->>SNS: "Model rejected — {reason}" GW->>REG: Update status = Rejected else Gate failure GW->>SNS: "Model approved — ready for shadow" GW->>REG: Update status = Approved alt All gates pass GW->>GW: Evaluate against thresholds end LIT-->>GW: Accuracy, F1, latency results GW->>LIT: Run 500 golden test cases and Golden Set Review CAP-->>GW: Attribution report GW->>CAP: Run Captum integrated gradients par Attribution Audit GW->>GW: Load model artifact REG->>GW: ModelPackage status = PendingApproval participant SNS as SNS Notifications participant LIT as LIT Golden Set participant CAP as Captum Attribution participant GW as Quality Gate Lambda participant REG as Model RegistrysequenceDiagrammermaid### 2. Quality Gate — Automated Evaluationreturn response['ModelPackageArn'] ) }, 'lit_review': training_metrics['lit_status'], 'captum_audit': training_metrics['captum_status'], 'latency_p95_ms': str(training_metrics['latency_p95']), 'f1_score': str(training_metrics['f1']), 'accuracy': str(training_metrics['accuracy']), 'training_job_id': training_metrics['job_id'], CustomerMetadataProperties={ }, }, }, 'S3Uri': training_metrics['quality_report_s3'], 'ContentType': 'application/json', 'Statistics': { 'ModelQuality': { ModelMetrics={ ModelApprovalStatus='PendingManualApproval', }, 'SupportedResponseMIMETypes': ['application/json'], 'SupportedContentTypes': ['application/json'], }], 'ModelDataUrl': model_artifact_s3, 'Image': get_inference_image(model_name), 'Containers': [{ InferenceSpecification={ ModelPackageDescription=f"v{model_version} — {datetime.utcnow().isoformat()}", ModelPackageGroupName=f"mangaassist-{model_name}", response = sm.create_model_package( sm = boto3.client('sagemaker')) -> str: training_metrics: dict, model_version: str, model_name: str, model_artifact_s3: str,def register_model(from datetime import datetimeimport boto3python### 1. SageMaker Model Registry Integration## Low-Level Design---| Cross-Encoder Reranker | SageMaker ml.g5.xlarge | SageMaker | Monthly | 24 hours | 4 hours || LLM LoRA Adapter (Llama) | SageMaker 4x A100 (vLLM) | SageMaker | Quarterly | 48 hours | 24 hours || Embedding Model (e5-large) | SageMaker ml.g5.xlarge | SageMaker | Monthly | 24 hours | 4 hours || Intent Classifier (DistilBERT) | Inferentia ml.inf1.xlarge | SageMaker | Weekly | 4 hours | 1 hour ||---|---|---|---|---|---|| Model | Hardware | Registry | Cadence | Shadow Duration | Canary Duration |### Model Types and Deployment Cadencestyle R fill:#146eb4,color:#fff style N fill:#DD344C,color:#fff style H fill:#DD344C,color:#fff style I fill:#1B660F,color:#fff end Q -->|Yes| R[Model Promoted to Champion] Q -->|No| N P --> Q{Metrics stable?} O --> P[Monitor 24h post-deploy] subgraph "Post-Deploy" end M -->|Yes| O[Full Production — 100% traffic] M -->|No| N[Auto-Rollback to Champion] L --> M{1-hour canary OK?} K -->|Yes| L[Canary Deployment — 5% real traffic] K -->|No| H J --> K{Shadow Metrics OK?} I --> J[Shadow Deployment — 1% mirrored traffic] subgraph "Deployment Stages" end G -->|Yes| I[Status: Approved] G -->|No| H[Reject — Notify Team] F --> G{All Gates Pass?} E --> F[LIT Golden-Set Review] D --> E[Captum Attribution Audit] subgraph "Quality Gates" end C --> D[Status: PendingManualApproval] B --> C[Register in SageMaker Model Registry] A[Model Training Complete] --> B[Artifact Validation] subgraph "Training Pipeline (CD-03 input)"flowchart TDmermaid### Model Deployment Lifecycle## High-Level Design---- [ ] Deployment completes within 30 minutes (shadow excluded — shadow runs async)- [ ] Quality gates include Captum attribution audit and LIT golden-set review (per existing training pipeline)- [ ] Pipeline supports three model types: intent classifier (weekly), embeddings (monthly), LLM adapters (quarterly)- [ ] Full audit trail: who approved, what metrics were evaluated, when model went live- [ ] Model deployment does not require application code redeployment (model endpoint swap)- [ ] A/B testing framework supports comparing champion vs challenger models with statistical significance- [ ] Automated rollback triggers if accuracy drops > 1%, latency increases > 20%, or error rate exceeds 0.5%- [ ] Canary deployment runs for minimum 1 hour with automated quality metric evaluation- [ ] Intent classifier (DistilBERT on Inferentia) deploys with shadow traffic validation (1% real traffic, no customer impact)- [ ] Models are promoted from SageMaker Model Registry through defined stages: Staging → Shadow → Canary → Production## Acceptance Criteria---So that model updates improve chatbot quality without risking production availability — with automated rollback if any quality gate fails.I want to establish an automated pipeline for deploying ML models (intent classifier on Inferentia, embedding models, LLM LoRA adapters) from the SageMaker Model Registry through shadow testing, canary validation, and A/B testing to full production,As a MLOps Engineer on the MangaAssist AI Chatbot team,## User Story

User Story

As a ML Engineer / Senior Developer on the MangaAssist AI Chatbot team, I want to establish an automated ML model deployment pipeline that safely promotes trained models from the SageMaker Model Registry through shadow deployment, canary validation, and A/B testing to full production, So that model updates (intent classifier, embedding models, LLM LoRA adapters) are deployed with rigorous quality gates — preventing model regressions from reaching customers while maintaining a weekly update cadence for the intent classifier.

Acceptance Criteria

High-Level Design

Model Deployment Landscape

graph TB
    subgraph "Model Types"
        IC["Intent Classifier<br/>DistilBERT on Inferentia<br/>Weekly updates"]
        EMB["Embedding Models<br/>Titan + e5-large<br/>Monthly updates"]
        LLM["LLM LoRA Adapters<br/>Llama 70B via vLLM<br/>Quarterly updates"]
    end

    subgraph "Deployment Pipeline"
        REG["SageMaker Model Registry"]
        SHADOW["Shadow Deployment (1%)"]
        CANARY["Canary Deployment (5%)"]
        AB["A/B Test (optional)"]
        PROD["Full Production (100%)"]
    end

    IC --> REG
    EMB --> REG
    LLM --> REG

    REG --> SHADOW
    SHADOW -->|Auto-pass| CANARY
    CANARY -->|Auto-pass for IC/EMB| PROD
    CANARY -->|Human approval for LLM| AB
    AB --> PROD

    style IC fill:#ff9900,color:#000
    style EMB fill:#146eb4,color:#fff
    style LLM fill:#8C4FFF,color:#fff

Deployment Cadence and Risk Profile

Model Type	Cadence	Training Cost	Risk Level	Approval
Intent Classifier (DistilBERT)	Weekly	0.5 GPU-hr ($1.60)	Medium — misroutes queries	Automated
Embedding Adapter (e5-large)	Monthly	2-4 GPU-hr ($6-13)	Medium — affects RAG quality	Automated
Cross-Encoder Reranker	Monthly	2 GPU-hr ($6.40)	Low — affects ranking only	Automated
LLM LoRA Adapter (Llama 70B)	Quarterly	48 GPU-hr ($154)	HIGH — affects all responses	Human required

Low-Level Design

1. Model Registration in SageMaker Model Registry

import sagemaker
from sagemaker.model_metrics import ModelMetrics, MetricsSource

def register_model(
    model_artifact_uri: str,
    model_type: str,  # 'intent-classifier' | 'embedding' | 'llm-adapter'
    training_job_name: str,
    metrics: dict,
) -> str:
    """Register a trained model with full lineage metadata."""

    sm_client = sagemaker.Session().sagemaker_client

    model_package_group = f"mangaassist-{model_type}"

    # Create model metrics from evaluation results
    model_metrics = ModelMetrics(
        model_statistics=MetricsSource(
            content_type="application/json",
            s3_uri=f"s3://mangaassist-models/metrics/{training_job_name}/eval.json",
        )
    )

    response = sm_client.create_model_package(
        ModelPackageGroupName=model_package_group,
        ModelPackageDescription=f"Trained by {training_job_name}",
        InferenceSpecification={
            "Containers": [{
                "Image": get_inference_image(model_type),
                "ModelDataUrl": model_artifact_uri,
            }],
            "SupportedContentTypes": ["application/json"],
            "SupportedResponseMIMETypes": ["application/json"],
        },
        ModelApprovalStatus="PendingManualApproval" if model_type == "llm-adapter" else "Approved",
        CustomerMetadataProperties={
            "model_type": model_type,
            "training_job": training_job_name,
            "accuracy": str(metrics.get("accuracy", "N/A")),
            "f1_score": str(metrics.get("f1_score", "N/A")),
            "latency_p95_ms": str(metrics.get("latency_p95_ms", "N/A")),
            "training_data_version": metrics.get("data_version", "unknown"),
            "captum_audit": str(metrics.get("captum_audit_passed", False)),
        },
    )

    return response["ModelPackageArn"]

2. Pipeline Orchestration via Step Functions

stateDiagram-v2
    [*] --> ModelRegistered
    ModelRegistered --> ValidateArtifacts

    ValidateArtifacts --> ShadowDeploy: Artifacts valid
    ValidateArtifacts --> Failed: Artifacts corrupt/missing

    ShadowDeploy --> ShadowEval: 1% shadow traffic (30 min)
    ShadowEval --> CanaryDeploy: Shadow metrics pass
    ShadowEval --> Rollback: Shadow metrics fail

    CanaryDeploy --> CanaryMonitor: 5% real traffic
    CanaryMonitor --> WaitForCanary: Monitor 1 hour

    WaitForCanary --> CheckModelType

    CheckModelType --> AutoPromote: Intent classifier / Embedding
    CheckModelType --> HumanApproval: LLM adapter

    HumanApproval --> FullDeploy: Approved
    HumanApproval --> Rollback: Rejected

    AutoPromote --> FullDeploy
    FullDeploy --> Production: 100% traffic

    Rollback --> [*]: Champion model restored
    Production --> [*]: New champion
    Failed --> [*]: Alert sent

Step Functions Definition (key states):

# Step Functions state machine definition
MODEL_PIPELINE_DEFINITION = {
    "Comment": "ML Model Deployment Pipeline",
    "StartAt": "ValidateArtifacts",
    "States": {
        "ValidateArtifacts": {
            "Type": "Task",
            "Resource": "arn:aws:lambda:REGION:ACCOUNT:function:validate-model-artifacts",
            "Next": "ShadowDeploy",
            "Catch": [{"ErrorEquals": ["States.ALL"], "Next": "NotifyFailure"}],
        },
        "ShadowDeploy": {
            "Type": "Task",
            "Resource": "arn:aws:lambda:REGION:ACCOUNT:function:deploy-shadow-endpoint",
            "Next": "WaitForShadowTraffic",
        },
        "WaitForShadowTraffic": {
            "Type": "Wait",
            "Seconds": 1800,  # 30 minutes of shadow traffic
            "Next": "EvaluateShadowMetrics",
        },
        "EvaluateShadowMetrics": {
            "Type": "Task",
            "Resource": "arn:aws:lambda:REGION:ACCOUNT:function:evaluate-shadow-metrics",
            "Next": "ShadowDecision",
        },
        "ShadowDecision": {
            "Type": "Choice",
            "Choices": [
                {
                    "Variable": "$.shadow_passed",
                    "BooleanEquals": True,
                    "Next": "CanaryDeploy",
                }
            ],
            "Default": "RollbackShadow",
        },
        "CanaryDeploy": {
            "Type": "Task",
            "Resource": "arn:aws:lambda:REGION:ACCOUNT:function:deploy-canary",
            "Parameters": {"traffic_percentage": 5},
            "Next": "WaitForCanary",
        },
        "WaitForCanary": {
            "Type": "Wait",
            "Seconds": 3600,  # 1 hour canary
            "Next": "EvaluateCanaryMetrics",
        },
        "EvaluateCanaryMetrics": {
            "Type": "Task",
            "Resource": "arn:aws:lambda:REGION:ACCOUNT:function:evaluate-canary-metrics",
            "Next": "CheckModelType",
        },
        "CheckModelType": {
            "Type": "Choice",
            "Choices": [
                {
                    "Variable": "$.model_type",
                    "StringEquals": "llm-adapter",
                    "Next": "HumanApproval",
                }
            ],
            "Default": "FullDeploy",
        },
        "HumanApproval": {
            "Type": "Task",
            "Resource": "arn:aws:states:::sqs:sendMessage.waitForTaskToken",
            "Parameters": {
                "QueueUrl": "MODEL_APPROVAL_QUEUE_URL",
                "MessageBody": {
                    "pipeline_id.$": "$$.Execution.Id",
                    "model_type.$": "$.model_type",
                    "metrics.$": "$.canary_metrics",
                    "task_token.$": "$$.Task.Token",
                },
            },
            "Next": "FullDeploy",
        },
        "FullDeploy": {
            "Type": "Task",
            "Resource": "arn:aws:lambda:REGION:ACCOUNT:function:promote-to-production",
            "Parameters": {"traffic_percentage": 100},
            "Next": "Success",
        },
        "Success": {"Type": "Succeed"},
        "RollbackShadow": {
            "Type": "Task",
            "Resource": "arn:aws:lambda:REGION:ACCOUNT:function:rollback-model",
            "Next": "NotifyFailure",
        },
        "NotifyFailure": {
            "Type": "Task",
            "Resource": "arn:aws:lambda:REGION:ACCOUNT:function:notify-failure",
            "Next": "Failed",
        },
        "Failed": {"Type": "Fail"},
    },
}

3. Shadow Deployment Implementation

Shadow deployment routes a copy of production traffic to the new model without affecting customer responses. The champion model always serves the actual response.

sequenceDiagram
    participant User
    participant Orchestrator
    participant Champion as Champion Model (v1.8)
    participant Shadow as Shadow Model (v1.9)
    participant Metrics as CloudWatch Metrics

    User->>Orchestrator: "I want to return my order"

    par Serve from Champion
        Orchestrator->>Champion: Classify intent
        Champion-->>Orchestrator: order_return (98ms)
        Orchestrator-->>User: Response based on champion
    and Mirror to Shadow (async)
        Orchestrator->>Shadow: Classify intent (async, fire-and-forget)
        Shadow-->>Metrics: Log: order_return (85ms)
    end

    Metrics->>Metrics: Compare champion vs shadow:<br/>accuracy, latency, agreement rate

# Shadow traffic router
import asyncio
from typing import Any

async def classify_intent_with_shadow(
    message: str,
    champion_endpoint: str,
    shadow_endpoint: str | None,
    shadow_percentage: float = 0.01,
) -> dict[str, Any]:
    """Route traffic to champion + optionally mirror to shadow."""
    import random

    # Always call champion (this serves the response)
    champion_result = await call_endpoint(champion_endpoint, message)

    # Probabilistically mirror to shadow (async, non-blocking)
    if shadow_endpoint and random.random() < shadow_percentage:
        asyncio.create_task(
            _shadow_call(shadow_endpoint, message, champion_result)
        )

    return champion_result


async def _shadow_call(
    shadow_endpoint: str,
    message: str,
    champion_result: dict,
) -> None:
    """Fire-and-forget shadow call. Log comparison metrics."""
    try:
        shadow_result = await call_endpoint(shadow_endpoint, message)

        # Log comparison metrics (never affects user response)
        publish_comparison_metric(
            champion_label=champion_result["intent"],
            shadow_label=shadow_result["intent"],
            champion_latency=champion_result["latency_ms"],
            shadow_latency=shadow_result["latency_ms"],
            agreement=champion_result["intent"] == shadow_result["intent"],
        )
    except Exception:
        # Shadow failures are logged but never affect production
        pass

4. Quality Gate Evaluation

# Quality gate evaluation for model promotion decisions
import numpy as np
from scipy import stats

def evaluate_model_quality(
    champion_metrics: dict,
    challenger_metrics: dict,
    model_type: str,
) -> dict:
    """
    Compare challenger against champion using statistical tests.
    Returns pass/fail decision with reasoning.
    """
    gates = []

    # Gate 1: Accuracy must not regress (one-sided test)
    if model_type in ("intent-classifier", "embedding"):
        champion_acc = champion_metrics["accuracy"]
        challenger_acc = challenger_metrics["accuracy"]

        # Allow ±0.5% tolerance (noise margin)
        accuracy_gate = challenger_acc >= (champion_acc - 0.005)
        gates.append({
            "gate": "accuracy_non_regression",
            "passed": accuracy_gate,
            "champion": f"{champion_acc:.4f}",
            "challenger": f"{challenger_acc:.4f}",
            "threshold": f">= {champion_acc - 0.005:.4f}",
        })

    # Gate 2: Latency P95 must stay within budget
    latency_budgets = {
        "intent-classifier": 150,   # ms — per architecture spec
        "embedding": 200,            # ms
        "llm-adapter": 1500,          # ms — first token
    }
    budget = latency_budgets[model_type]
    challenger_p95 = challenger_metrics["latency_p95_ms"]
    latency_gate = challenger_p95 <= budget
    gates.append({
        "gate": "latency_p95",
        "passed": latency_gate,
        "value": f"{challenger_p95:.0f}ms",
        "budget": f"{budget}ms",
    })

    # Gate 3: Captum attribution audit (intent classifier only)
    if model_type == "intent-classifier":
        captum_passed = challenger_metrics.get("captum_audit_passed", False)
        gates.append({
            "gate": "captum_attribution_audit",
            "passed": captum_passed,
            "detail": "Checks for lexical shortcut reliance",
        })

    # Gate 4: Agreement rate with champion (shadow traffic)
    if "agreement_rate" in challenger_metrics:
        agreement = challenger_metrics["agreement_rate"]
        # If agreement < 85%, something fundamentally changed — flag for review
        agreement_gate = agreement >= 0.85
        gates.append({
            "gate": "champion_agreement_rate",
            "passed": agreement_gate,
            "value": f"{agreement:.2%}",
            "threshold": ">= 85%",
        })

    # Gate 5: Statistical significance (canary traffic)
    if "canary_error_rate" in challenger_metrics:
        champion_errors = champion_metrics["canary_error_rate"]
        challenger_errors = challenger_metrics["canary_error_rate"]

        # Two-proportion z-test
        n = challenger_metrics.get("canary_sample_size", 1000)
        z_stat, p_value = stats.proportions_ztest(
            [int(challenger_errors * n), int(champion_errors * n)],
            [n, n],
            alternative="larger",  # Test if challenger is worse
        )
        sig_gate = p_value > 0.05  # Not significantly worse
        gates.append({
            "gate": "statistical_significance",
            "passed": sig_gate,
            "p_value": f"{p_value:.4f}",
            "detail": "Challenger not significantly worse than champion (p > 0.05)",
        })

    all_passed = all(g["passed"] for g in gates)

    return {
        "overall_passed": all_passed,
        "gates": gates,
        "recommendation": "PROMOTE" if all_passed else "REJECT",
    }

5. Model-Specific Deployment Targets

flowchart TD
    subgraph "Intent Classifier"
        IC1[SageMaker Endpoint<br/>ml.inf1.xlarge — Inferentia]
        IC2["Neuron-compiled DistilBERT<br/>INT8 quantized"]
        IC3["Auto-scaling: 1-4 instances<br/>Target: 150ms P95"]
    end

    subgraph "Embedding Models"
        EM1[SageMaker Endpoint<br/>ml.g5.xlarge — GPU]
        EM2["multilingual-e5-large<br/>+ Titan Embeddings via Bedrock"]
        EM3["Auto-scaling: 1-2 instances<br/>Target: 200ms P95"]
    end

    subgraph "LLM Adapter"
        LA1["Option A: vLLM on SageMaker<br/>ml.g5.12xlarge (4x A10G)"]
        LA2["Option B: Bedrock Custom Model<br/>Fine-tuned via Bedrock"]
        LA3["Adapter hot-swap via LoRA<br/>No endpoint restart needed"]
    end

Critical Decisions

Decision 1: ML Orchestration — SageMaker Pipelines vs Step Functions vs Kubeflow

Criteria (Weight)	SageMaker Pipelines	Step Functions	Kubeflow
ML-Specific Features (25%)	10/10 — built for ML workflows	5/10 — generic orchestrator	9/10 — ML-native
AWS Integration (20%)	10/10 — native SageMaker	9/10 — any AWS service	4/10 — needs EKS
Operational Overhead (20%)	7/10 — managed service	9/10 — serverless	3/10 — must run K8s cluster
Flexibility (15%)	6/10 — limited to ML steps	10/10 — any compute	8/10 — arbitrary containers
Cost (10%)	7/10 — per-step pricing	9/10 — $0.025/1000 transitions	3/10 — EKS cluster cost
Team Size Fit (10%)	8/10 — no infra to manage	9/10 — zero ops	2/10 — needs K8s expertise
Weighted Score	8.2/10	8.0/10	5.0/10

Decision: Step Functions (with SageMaker SDK calls)

Rationale: Despite SageMaker Pipelines scoring slightly higher on ML-specific features, Step Functions was chosen because:

The deployment pipeline is not a training pipeline — it orchestrates deployments (shadow → canary → promote), which is a workflow problem, not an ML problem
Step Functions can call SageMaker APIs natively — CreateEndpoint, UpdateEndpointWeightsAndCapacities, etc. are direct SDK integrations
Step Functions also orchestrates non-ML steps — Lambda quality gates, SNS notifications, SQS human approval, CloudWatch metric queries — all first-class integrations
Operational simplicity — $0.025/1000 state transitions, no infrastructure to manage, visual debugging via console

Why not SageMaker Pipelines? SageMaker Pipelines excels at training orchestration (data processing → training → evaluation → registration). But the deployment pipeline needs conditional logic (check model type → route to different approval flows), human approval via SQS, and CloudWatch metric evaluation — all of which are awkward in SageMaker Pipelines but native in Step Functions.

Why not Kubeflow? Running a Kubernetes cluster (EKS) for ML orchestration when the entire application runs on ECS Fargate introduces a second container orchestrator. For a 1-2 person team, this is an unacceptable operational burden.

Decision 2: Deployment Pattern — Shadow vs Canary vs Champion-Challenger

Pattern	How It Works	Risk Exposure	Cost	Validation Quality
Shadow	Mirror traffic to new model; champion always serves	Zero (shadow is invisible)	2x compute during test	Medium — no real user feedback
Canary	Route small % of real traffic to new model	Low (5% of users)	1.05x compute	High — real user responses
Champion-Challenger	Persistent A/B test between models	Medium (50/50 split)	2x compute (persistent)	Highest — statistically significant

Decision: Shadow → Canary → (optional A/B for LLM)

flowchart LR
    A[New Model] --> B["Shadow (1%, 30 min)<br/>Zero risk validation"]
    B --> C["Canary (5%, 1 hour)<br/>Low risk, real feedback"]
    C --> D{Model Type?}
    D -->|"Intent/Embedding"| E["Auto-promote to 100%"]
    D -->|"LLM Adapter"| F["A/B Test (50/50, 1 week)<br/>Statistical significance"]
    F --> G["Human review → Promote or Reject"]

Rationale: The three-stage approach matches risk to model type: - Shadow first (all models): Validates latency, error rates, and basic correctness without any customer impact. Catches most deployment failures (wrong model artifact, OOM, inference errors). - Canary second (all models): Validates real-world performance on actual user queries. Catches subtle quality regressions that shadow can't detect (e.g., model is slower under real traffic patterns). - A/B test third (LLM only): LLM changes affect every response. A week-long A/B test with human evaluation is necessary to catch subtle quality shifts (tone, helpfulness, hallucination rate) that automated metrics miss.

Decision 3: Model Approval — Fully Automated vs Human-in-the-Loop

Criteria	Fully Automated	Human-in-the-Loop	Tiered (Our Choice)
Speed	Fast (minutes)	Slow (hours to days)	Depends on model type
Risk	Higher — no human judgment	Lower — expert review	Balanced
Scalability	Scales with pipeline	Bottleneck on reviewers	Scales for low-risk, controlled for high-risk
Cost	Low	High (engineer time)	Medium
Compliance	May not satisfy auditors	Satisfies compliance	Satisfies with clear trails

Decision: Tiered approval — automated for intent/embedding, human for LLM adapters

Model Type	Quality Gates (Automated)	Approval	Rationale
Intent Classifier	Accuracy ≥ champion − 0.5%, P95 < 150ms, Captum pass	Automated	Well-scoped output (intent labels), easily validated by metrics
Embedding Model	Recall@10 ≥ champion, P95 < 200ms	Automated	Output is embeddings — quality measured by retrieval metrics
Cross-Encoder	NDCG ≥ champion, P95 < 100ms	Automated	Ranking changes are measurable and bounded
LLM LoRA Adapter	All automated gates + 1-week A/B + human eval	Human required	Open-ended text generation — subtle quality shifts need human judgment

Tradeoffs

The Debate: Model Deployment Speed vs Quality Assurance

graph TD
    subgraph "Data Science Lead"
        DS1["Model accuracy improved 3%"]
        DS2["Customers are misclassified NOW"]
        DS3["Our improvement is proven on eval set"]
    end

    subgraph "Architect"
        DS1 ---|"But..."| AR1
        AR1["Eval set ≠ production traffic"]
        AR2["Shadow + canary takes 2 hours minimum"]
        AR3["Last fast-tracked model caused 5% error spike"]
    end

    subgraph "Product Manager"
        DS2 ---|"Agrees"| PM1
        PM1["Customers filing support tickets"]
        PM2["Every day delayed = lost CSAT"]
        PM3["Can we shorten the canary?"]
    end

    AR3 ---|"Tension"| PM3
    DS3 ---|"Partial answer to"| AR1

Resolution

The tiered deployment pipeline resolves this tension:

Model Change Magnitude	Pipeline Path	Total Time	Rationale
Hotfix (< 0.5% accuracy change, bug fix)	Validate → Canary (15 min) → Promote	30 minutes	Known-safe change, rapid recovery
Minor (0.5-2% change)	Validate → Shadow (30 min) → Canary (1 hr) → Promote	2 hours	Standard improvement, automated gates sufficient
Major (> 2% change or new architecture)	Validate → Shadow → Canary → A/B (1 week) → Human	1-2 weeks	Fundamental change requires statistical validation

Key Tradeoff: Single vs Multi-Model Endpoint

Approach	Pros	Cons	Cost
Single endpoint, update in-place	Simple, low cost	Downtime during update (30-60s)	1x
Multi-model endpoint	A/B testing native, zero-downtime	Higher memory usage, routing complexity	1.3x
Separate endpoints per version	Full isolation, independent scaling	Most expensive, endpoint proliferation	2x

Decision: Multi-model endpoint for intent classifier and embeddings (high frequency updates), separate endpoints for LLM (too large for multi-model).