CD-03: ML Model Deployment Pipeline# CD-03: ML Model Deployment Pipeline
| Quality gate pass rate | Step Functions | < 80% (indicates training issues) || Model endpoint latency P95 | SageMaker CloudWatch | Varies by model type || Shadow vs champion agreement | S3 analytics | < 90% agreement || Canary rollback rate | Step Functions + CloudWatch | > 20% of deployments || Mean time from approval to production | Custom CloudWatch | > 48 hours || Model deployment success rate | Step Functions | < 90% over 30 days ||---|---|---|| Metric | Source | Alert Threshold |## Monitoring This Pipeline---Key Insight: The time-based gates (24h shadow, 1h canary) are not arbitrary — they ensure the model sees enough traffic diversity (morning vs evening, weekday vs weekend) to validate across user behavior patterns. A model that works great at 2 PM on Tuesday might fail at 9 PM on Saturday when users ask different types of questions.| Emergency hotfix | Skip | 15 min | ~20 min | Critical regression fix || Major (> 5% change or architecture change) | 24 hours | 4 hours | ~30 hours | New model architecture || Moderate (2-5% change) | 4 hours | 1 hour | ~6 hours | New training data added || Minor (< 2% accuracy change) | Skip | 1 hour | ~1.5 hours | Weekly intent classifier retrain ||---|---|---|---|---|| Change Magnitude | Shadow | Canary | Total Time | Example |### Resolution: Tiered Deployment Speed by Change MagnitudeDS1 ---|"But metrics prove it"| AR2 AR3 ---|"Delays"| PM2 PM3 ---|"Amplifies"| DS3 DS3 ---|"Tension"| AR1 end PM3["Sides with Data Science"] PM2["Competitor just improved their chatbot"] PM1["NPS score dropping"] subgraph "Product Manager" end AR3["Statistical significance required"] AR2["Canary minimum 1 hour"] AR1["Shadow test minimum 24 hours"] subgraph "Architect" end DS3["Want to deploy NOW"] DS2["Customers complaining about current model"] DS1["Model accuracy improved 3%"] subgraph "Data Science Lead"graph TDmermaid### The Debate: Model Deployment Speed vs Quality Assurance## Tradeoffs---Rationale: LLM adapter changes directly affect response generation — the customer-facing output. Unlike the intent classifier (binary classification, measurable), LLM quality has subjective dimensions (tone, helpfulness, accuracy of nuanced answers) that automated metrics may miss. A human review of 10-20 sample outputs takes 15 minutes and catches issues that metrics cannot.G -->|No| H[Reject with feedback] G -->|Yes| E F --> G{ML Lead approves?} D --> F[Notify ML Lead via Slack] C --> E[Proceed to Shadow/Canary] B -->|LLM LoRA Adapter| D[Require human review] B -->|Cross-Encoder| C B -->|Embedding Model| C B -->|Intent Classifier| C[Auto-approve if all metrics pass] A[Model Passes Quality Gates] --> B{Model Type?}flowchart TDmermaidDecision: Hybrid — Automated gates with human approval for LLM adapter changes| Audit Trail | Automated logs | Manual + automated | Full trail both paths || Team Scalability | Infinite | Bottleneck at reviewer | Scalable with manual override || Catches Edge Cases | No — only tests what metrics measure | Yes — human judgment | Yes — human reviews anomalies || Consistency | Perfect — same thresholds always | Variable — reviewer fatigue | Consistent with human override || Speed | Minutes | Hours to days | Hours (fast path available) ||---|---|---|---|| Criteria | Fully Automated | Human-in-the-Loop | Hybrid (Current) |### Decision 3: Model Approval — Fully Automated vs Human-in-the-Loop---| Cross-Encoder Reranker | Direct replace with monitoring | Low risk — affects ranking order, not correctness || Embedding Model | Canary → Full (skip shadow) | Impact is retrieval quality, observable via RAG metrics || LLM LoRA Adapter | Shadow → Canary → Full | Affects response quality directly — high risk || Intent Classifier | Shadow → Canary → Full | Misclassification causes wrong response path — highest risk ||---|---|---|| Model | Pattern | Justification |Decision: Tiered approach by model criticality| Best For | Critical models (intent classifier) | Optional models (recommendations) | Config-only changes || Complexity | High (shadow infra + canary logic) | Medium (routing logic) | Low || Cost | Higher (2x endpoints during shadow) | Higher (2x endpoints always) | Lowest || Time to Full Deploy | Long (shadow + canary = hours) | Medium (immediate A/B) | Instant || Validation Quality | High — real traffic, no impact | High — real traffic, real impact | None — faith-based || Risk Level | Very Low — validated before traffic | Low — simultaneous serving | High — untested in prod ||---|---|---|---|| Criteria | Shadow + Canary (Current) | Champion-Challenger | Direct Replace |### Decision 2: Model Deployment Pattern — Shadow + Canary vs Champion-Challenger vs Direct Replace---Why not Kubeflow? The team has zero Kubernetes experience. Running a K8s cluster for ML pipeline orchestration when Step Functions + SageMaker provides the same capability serverlessly is unjustifiable overhead for a 1-2 person team.Why not SageMaker Pipelines alone? SageMaker Pipelines excel at training workflows but are limited for deployment orchestration. They can't natively: check CloudWatch canary alarms, send formatted Slack notifications, or implement complex branching logic based on multiple metric evaluations performed in parallel.Rationale: The model deployment pipeline requires custom logic beyond pure ML — shadow traffic mirroring, canary metric evaluation, Slack notifications, CloudWatch alarm checks. Step Functions handle this heterogeneous workflow naturally (Lambda for logic, SageMaker for model operations, SNS for notifications). SageMaker Pipelines force everything into the SageMaker paradigm, making non-ML steps awkward.Decision: AWS Step Functions (orchestrator) + SageMaker (execution)| Weighted Score | 7.2/10 | 8.3/10 | 4.8/10 || Error Handling (5%) | 6/10 — basic retry | 9/10 — rich error handling, catch, retry | 7/10 — retry policies || Team Familiarity (10%) | 6/10 | 8/10 — team uses Lambda | 2/10 — no K8s experience || Cost (10%) | 7/10 | 9/10 — $0.025/1K transitions | 4/10 — K8s cluster cost || Operational Overhead (15%) | 8/10 — fully managed | 9/10 — fully managed | 3/10 — requires K8s cluster || Visualization (15%) | 7/10 — SM Studio lineage | 8/10 — Step Functions console | 9/10 — Kubeflow UI || Custom Logic Support (20%) | 5/10 — limited to SM steps | 9/10 — any Lambda/ECS task | 8/10 — any container || SageMaker Integration (25%) | 10/10 — native | 7/10 — via SDK calls | 4/10 — separate system ||---|---|---|---|| Criteria (Weight) | SageMaker Pipelines | Step Functions | Kubeflow |### Decision 1: ML Pipeline Orchestrator — SageMaker Pipelines vs Step Functions vs Kubeflow## Critical Decisions---NotifyTeam --> [*] PromoteChampion --> [*] Rejected --> NotifyTeam Rollback --> NotifyTeam Monitor24h --> Rollback: Regression Monitor24h --> PromoteChampion: Stable FullProduction --> Monitor24h CanaryPass --> FullProduction CanaryEval --> Rollback: Metrics bad CanaryEval --> CanaryPass: Metrics OK CanaryWait --> CanaryEval CanaryDeploy --> CanaryWait: Wait 1 hour ShadowPass --> CanaryDeploy ShadowEval --> Rollback: Metrics bad ShadowEval --> ShadowPass: Metrics OK ShadowWait --> ShadowEval ShadowDeploy --> ShadowWait: Wait N hours GatesPass --> ShadowDeploy QualityGates --> Rejected: Any fail QualityGates --> GatesPass: All pass } LITReview --> [*] CaptumAudit --> LITReview [*] --> CaptumAudit state QualityGates { ValidateArtifact --> QualityGates [*] --> ValidateArtifactstateDiagram-v2mermaid### 6. Step Functions Orchestrationreturn 'challenger' if (hash_value % 100) < 50 else 'champion' hash_value = int(hashlib.sha256(hash_input.encode()).hexdigest(), 16) hash_input = f"{user_id}:{experiment_id}" """ Same user always gets same group for consistent experience. """Deterministic A/B assignment based on user_id hash.def get_ab_group(user_id: str, experiment_id: str) -> str:import hashlibpythonA/B Assignment Logic:H -->|"p > 0.05 after 7 days"| K[No Significant Difference — Keep Champion] H -->|"Not yet"| J[Continue A/B Test] H -->|"p < 0.05, N > 10K"| I[Auto-Promote Winner] G --> H{Statistical Significance?} F --> G[Athena Analysis] E --> F[Kinesis Data Stream] D --> E C --> E[Log: user_id, group, prediction, latency] B -->|"Group B (50%)"| D[Challenger Model v2.4] B -->|"Group A (50%)"| C[Champion Model v2.3] A[User Request] --> B{A/B Assignment}flowchart TDmermaid### 5. A/B Testing Framework) ] {'VariantName': challenger_variant, 'DesiredWeight': 100}, {'VariantName': champion_variant, 'DesiredWeight': 0}, DesiredWeightsAndCapacities=[ EndpointName=endpoint_name, sm.update_endpoint_weights_and_capacities( # All checks passed — promote to 100% raise CanaryFailure(f"Canary failed at check {check}: {metrics}") ) ] {'VariantName': challenger_variant, 'DesiredWeight': 0}, {'VariantName': champion_variant, 'DesiredWeight': 100}, DesiredWeightsAndCapacities=[ EndpointName=endpoint_name, sm.update_endpoint_weights_and_capacities( # ROLLBACK if not meets_thresholds(metrics): metrics = get_canary_metrics(endpoint_name, challenger_variant) time.sleep(300) for check in range(12): # Check every 5 min # Wait and evaluate (1 hour) ) ] {'VariantName': challenger_variant, 'DesiredWeight': 5}, {'VariantName': champion_variant, 'DesiredWeight': 95}, DesiredWeightsAndCapacities=[ EndpointName=endpoint_name, sm.update_endpoint_weights_and_capacities( # Stage 1: 5% traffic to challenger cw = boto3.client('cloudwatch') sm = boto3.client('sagemaker')def canary_deploy(endpoint_name: str, challenger_variant: str, champion_variant: str):pythonCanary Traffic Shifting:Note over SM: Challenger is now Champion OPS->>SM: Update endpoint — 100% Challenger end end OPS->>SM: Update endpoint — 100% Champion CW->>OPS: Trigger rollback CW->>PD: Alert — canary failing else Metrics degraded CW-->>OPS: Continue canary alt Metrics healthy CW->>CW: Evaluate challenger metrics loop Every 5 minutes for 1 hour OPS->>CW: Start canary alarm evaluation Note over SM: Champion: 95%, Challenger: 5% OPS->>SM: Update endpoint — 5% to Challenger participant PD as PagerDuty participant CW as CloudWatch participant SM as SageMaker Endpoint participant OPS as MLOps PipelinesequenceDiagrammermaid### 4. Canary Deployment — Weighted Traffic} 'pass': agreement > 0.95 and shadow_accuracy > 0.94, 'sample_size': len(predictions), 'shadow_latency_p95': shadow_latency_p95, 'shadow_accuracy': shadow_accuracy, 'agreement_rate': agreement, return { shadow_latency_p95 = self._calculate_p95_latency(predictions) shadow_accuracy = self._calculate_accuracy(predictions) # Calculate quality metrics on shadow ) / len(predictions) if s['predicted_intent'] == c['predicted_intent'] 1 for s, c in zip(predictions, champion_predictions) agreement = sum( # Calculate agreement rate champion_predictions = self._load_champion_predictions(duration_hours) predictions = self._load_shadow_predictions(duration_hours) # Query S3 for shadow prediction logs """Compare shadow predictions against champion.""" def evaluate_shadow_results(self, duration_hours: int = 4) -> dict: return f"{self.model_name}-shadow" ) EndpointConfigName=endpoint_config_name, EndpointName=f"{self.model_name}-shadow", self.sm.create_endpoint( ) }], 'InitialVariantWeight': 1, 'InitialInstanceCount': 1, 'InstanceType': self._get_instance_type(), 'ModelName': self._create_model_from_package(self.challenger_arn), 'VariantName': 'shadow-variant', ShadowProductionVariants=[{ }], 'InitialInstanceCount': 1, 'InstanceType': self._get_instance_type(), 'ModelName': self._create_model_from_package(self.challenger_arn), 'VariantName': 'shadow', ProductionVariants=[{ EndpointConfigName=endpoint_config_name, self.sm.create_endpoint_config( endpoint_config_name = f"{self.model_name}-shadow-{int(datetime.utcnow().timestamp())}" """Deploy challenger model as shadow endpoint.""" def create_shadow_endpoint(self) -> str: self.challenger_arn = challenger_arn self.model_name = model_name self.sm_runtime = boto3.client('sagemaker-runtime') self.sm = boto3.client('sagemaker') def __init__(self, model_name: str, challenger_arn: str):class ShadowDeployer:import jsonimport boto3pythonShadow Deployment Implementation:style F fill:#8C4FFF,color:#fff style E fill:#8C4FFF,color:#fff end G --> H[Shadow Metrics Dashboard] F --> G[Compare with Champion] E --> F[Log Predictions to S3] B -->|"Mirror 1% traffic"| E[Shadow Model Endpoint] subgraph "Shadow Path (No Customer Impact)" end C --> D[Response to User] B --> C[Champion Model Endpoint] A[User Request] --> B[ALB] subgraph "Production Traffic"flowchart LRmermaid### 3. Shadow Deployment — Mirrored Traffic| Golden Set Pass Rate | >= 98% | >= 95% | >= 90% || Captum Attribution | No single token > 40% attribution | N/A | N/A || P95 Latency | Must be < 150ms | Must be < 200ms | Must be < 2000ms (first token) || LLM-as-Judge Score | N/A | N/A | Must be >= champion - 0.1 (0-1 scale) || Retrieval Recall@10 | N/A | Must be >= champion - 2% | N/A || F1 Score vs Champion | Must be >= champion - 1% | N/A | N/A || Accuracy vs Champion | Must be >= champion - 0.5% | N/A (evaluated via retrieval recall) | N/A (evaluated via LLM-as-judge) ||---|---|---|---|| Metric | Intent Classifier | Embedding Model | LLM Adapter |Quality Gate Thresholds:end GW->>SNS: "Model rejected — {reason}" GW->>REG: Update status = Rejected else Gate failure GW->>SNS: "Model approved — ready for shadow" GW->>REG: Update status = Approved alt All gates pass GW->>GW: Evaluate against thresholds end LIT-->>GW: Accuracy, F1, latency results GW->>LIT: Run 500 golden test cases and Golden Set Review CAP-->>GW: Attribution report GW->>CAP: Run Captum integrated gradients par Attribution Audit GW->>GW: Load model artifact REG->>GW: ModelPackage status = PendingApproval participant SNS as SNS Notifications participant LIT as LIT Golden Set participant CAP as Captum Attribution participant GW as Quality Gate Lambda participant REG as Model RegistrysequenceDiagrammermaid### 2. Quality Gate — Automated Evaluationreturn response['ModelPackageArn'] ) }, 'lit_review': training_metrics['lit_status'], 'captum_audit': training_metrics['captum_status'], 'latency_p95_ms': str(training_metrics['latency_p95']), 'f1_score': str(training_metrics['f1']), 'accuracy': str(training_metrics['accuracy']), 'training_job_id': training_metrics['job_id'], CustomerMetadataProperties={ }, }, }, 'S3Uri': training_metrics['quality_report_s3'], 'ContentType': 'application/json', 'Statistics': { 'ModelQuality': { ModelMetrics={ ModelApprovalStatus='PendingManualApproval', }, 'SupportedResponseMIMETypes': ['application/json'], 'SupportedContentTypes': ['application/json'], }], 'ModelDataUrl': model_artifact_s3, 'Image': get_inference_image(model_name), 'Containers': [{ InferenceSpecification={ ModelPackageDescription=f"v{model_version} — {datetime.utcnow().isoformat()}", ModelPackageGroupName=f"mangaassist-{model_name}", response = sm.create_model_package( sm = boto3.client('sagemaker')) -> str: training_metrics: dict, model_version: str, model_name: str, model_artifact_s3: str,def register_model(from datetime import datetimeimport boto3python### 1. SageMaker Model Registry Integration## Low-Level Design---| Cross-Encoder Reranker | SageMaker ml.g5.xlarge | SageMaker | Monthly | 24 hours | 4 hours || LLM LoRA Adapter (Llama) | SageMaker 4x A100 (vLLM) | SageMaker | Quarterly | 48 hours | 24 hours || Embedding Model (e5-large) | SageMaker ml.g5.xlarge | SageMaker | Monthly | 24 hours | 4 hours || Intent Classifier (DistilBERT) | Inferentia ml.inf1.xlarge | SageMaker | Weekly | 4 hours | 1 hour ||---|---|---|---|---|---|| Model | Hardware | Registry | Cadence | Shadow Duration | Canary Duration |### Model Types and Deployment Cadencestyle R fill:#146eb4,color:#fff style N fill:#DD344C,color:#fff style H fill:#DD344C,color:#fff style I fill:#1B660F,color:#fff end Q -->|Yes| R[Model Promoted to Champion] Q -->|No| N P --> Q{Metrics stable?} O --> P[Monitor 24h post-deploy] subgraph "Post-Deploy" end M -->|Yes| O[Full Production — 100% traffic] M -->|No| N[Auto-Rollback to Champion] L --> M{1-hour canary OK?} K -->|Yes| L[Canary Deployment — 5% real traffic] K -->|No| H J --> K{Shadow Metrics OK?} I --> J[Shadow Deployment — 1% mirrored traffic] subgraph "Deployment Stages" end G -->|Yes| I[Status: Approved] G -->|No| H[Reject — Notify Team] F --> G{All Gates Pass?} E --> F[LIT Golden-Set Review] D --> E[Captum Attribution Audit] subgraph "Quality Gates" end C --> D[Status: PendingManualApproval] B --> C[Register in SageMaker Model Registry] A[Model Training Complete] --> B[Artifact Validation] subgraph "Training Pipeline (CD-03 input)"flowchart TDmermaid### Model Deployment Lifecycle## High-Level Design---- [ ] Deployment completes within 30 minutes (shadow excluded — shadow runs async)- [ ] Quality gates include Captum attribution audit and LIT golden-set review (per existing training pipeline)- [ ] Pipeline supports three model types: intent classifier (weekly), embeddings (monthly), LLM adapters (quarterly)- [ ] Full audit trail: who approved, what metrics were evaluated, when model went live- [ ] Model deployment does not require application code redeployment (model endpoint swap)- [ ] A/B testing framework supports comparing champion vs challenger models with statistical significance- [ ] Automated rollback triggers if accuracy drops > 1%, latency increases > 20%, or error rate exceeds 0.5%- [ ] Canary deployment runs for minimum 1 hour with automated quality metric evaluation- [ ] Intent classifier (DistilBERT on Inferentia) deploys with shadow traffic validation (1% real traffic, no customer impact)- [ ] Models are promoted from SageMaker Model Registry through defined stages: Staging → Shadow → Canary → Production## Acceptance Criteria---So that model updates improve chatbot quality without risking production availability — with automated rollback if any quality gate fails.I want to establish an automated pipeline for deploying ML models (intent classifier on Inferentia, embedding models, LLM LoRA adapters) from the SageMaker Model Registry through shadow testing, canary validation, and A/B testing to full production,As a MLOps Engineer on the MangaAssist AI Chatbot team,## User Story
User Story
As a ML Engineer / Senior Developer on the MangaAssist AI Chatbot team, I want to establish an automated ML model deployment pipeline that safely promotes trained models from the SageMaker Model Registry through shadow deployment, canary validation, and A/B testing to full production, So that model updates (intent classifier, embedding models, LLM LoRA adapters) are deployed with rigorous quality gates — preventing model regressions from reaching customers while maintaining a weekly update cadence for the intent classifier.
Acceptance Criteria
- Models are registered in SageMaker Model Registry with versioning and metadata
- Every model promotion triggers a shadow deployment (1% traffic, no customer impact)
- Shadow metrics are compared against the champion model using automated statistical tests
- Canary deployment (5% real traffic) runs for a minimum of 1 hour before auto-promotion
- Quality gates include: accuracy threshold, latency P95, attribution audit (Captum), and hallucination rate
- Human-in-the-loop approval is required for LLM adapter promotions; intent classifier promotions are fully automated
- Automated rollback triggers if canary metrics degrade beyond thresholds
- Model deployment history is tracked with full lineage (training data version, hyperparameters, metrics)
- A/B testing framework supports comparing 2+ model versions on real traffic
- Pipeline supports all three model types: intent classifier (Inferentia), embeddings (SageMaker), LLM adapters (Bedrock/vLLM)
High-Level Design
Model Deployment Landscape
graph TB
subgraph "Model Types"
IC["Intent Classifier<br/>DistilBERT on Inferentia<br/>Weekly updates"]
EMB["Embedding Models<br/>Titan + e5-large<br/>Monthly updates"]
LLM["LLM LoRA Adapters<br/>Llama 70B via vLLM<br/>Quarterly updates"]
end
subgraph "Deployment Pipeline"
REG["SageMaker Model Registry"]
SHADOW["Shadow Deployment (1%)"]
CANARY["Canary Deployment (5%)"]
AB["A/B Test (optional)"]
PROD["Full Production (100%)"]
end
IC --> REG
EMB --> REG
LLM --> REG
REG --> SHADOW
SHADOW -->|Auto-pass| CANARY
CANARY -->|Auto-pass for IC/EMB| PROD
CANARY -->|Human approval for LLM| AB
AB --> PROD
style IC fill:#ff9900,color:#000
style EMB fill:#146eb4,color:#fff
style LLM fill:#8C4FFF,color:#fff
Deployment Cadence and Risk Profile
| Model Type | Cadence | Training Cost | Risk Level | Approval |
|---|---|---|---|---|
| Intent Classifier (DistilBERT) | Weekly | 0.5 GPU-hr ($1.60) | Medium — misroutes queries | Automated |
| Embedding Adapter (e5-large) | Monthly | 2-4 GPU-hr ($6-13) | Medium — affects RAG quality | Automated |
| Cross-Encoder Reranker | Monthly | 2 GPU-hr ($6.40) | Low — affects ranking only | Automated |
| LLM LoRA Adapter (Llama 70B) | Quarterly | 48 GPU-hr ($154) | HIGH — affects all responses | Human required |
Low-Level Design
1. Model Registration in SageMaker Model Registry
import sagemaker
from sagemaker.model_metrics import ModelMetrics, MetricsSource
def register_model(
model_artifact_uri: str,
model_type: str, # 'intent-classifier' | 'embedding' | 'llm-adapter'
training_job_name: str,
metrics: dict,
) -> str:
"""Register a trained model with full lineage metadata."""
sm_client = sagemaker.Session().sagemaker_client
model_package_group = f"mangaassist-{model_type}"
# Create model metrics from evaluation results
model_metrics = ModelMetrics(
model_statistics=MetricsSource(
content_type="application/json",
s3_uri=f"s3://mangaassist-models/metrics/{training_job_name}/eval.json",
)
)
response = sm_client.create_model_package(
ModelPackageGroupName=model_package_group,
ModelPackageDescription=f"Trained by {training_job_name}",
InferenceSpecification={
"Containers": [{
"Image": get_inference_image(model_type),
"ModelDataUrl": model_artifact_uri,
}],
"SupportedContentTypes": ["application/json"],
"SupportedResponseMIMETypes": ["application/json"],
},
ModelApprovalStatus="PendingManualApproval" if model_type == "llm-adapter" else "Approved",
CustomerMetadataProperties={
"model_type": model_type,
"training_job": training_job_name,
"accuracy": str(metrics.get("accuracy", "N/A")),
"f1_score": str(metrics.get("f1_score", "N/A")),
"latency_p95_ms": str(metrics.get("latency_p95_ms", "N/A")),
"training_data_version": metrics.get("data_version", "unknown"),
"captum_audit": str(metrics.get("captum_audit_passed", False)),
},
)
return response["ModelPackageArn"]
2. Pipeline Orchestration via Step Functions
stateDiagram-v2
[*] --> ModelRegistered
ModelRegistered --> ValidateArtifacts
ValidateArtifacts --> ShadowDeploy: Artifacts valid
ValidateArtifacts --> Failed: Artifacts corrupt/missing
ShadowDeploy --> ShadowEval: 1% shadow traffic (30 min)
ShadowEval --> CanaryDeploy: Shadow metrics pass
ShadowEval --> Rollback: Shadow metrics fail
CanaryDeploy --> CanaryMonitor: 5% real traffic
CanaryMonitor --> WaitForCanary: Monitor 1 hour
WaitForCanary --> CheckModelType
CheckModelType --> AutoPromote: Intent classifier / Embedding
CheckModelType --> HumanApproval: LLM adapter
HumanApproval --> FullDeploy: Approved
HumanApproval --> Rollback: Rejected
AutoPromote --> FullDeploy
FullDeploy --> Production: 100% traffic
Rollback --> [*]: Champion model restored
Production --> [*]: New champion
Failed --> [*]: Alert sent
Step Functions Definition (key states):
# Step Functions state machine definition
MODEL_PIPELINE_DEFINITION = {
"Comment": "ML Model Deployment Pipeline",
"StartAt": "ValidateArtifacts",
"States": {
"ValidateArtifacts": {
"Type": "Task",
"Resource": "arn:aws:lambda:REGION:ACCOUNT:function:validate-model-artifacts",
"Next": "ShadowDeploy",
"Catch": [{"ErrorEquals": ["States.ALL"], "Next": "NotifyFailure"}],
},
"ShadowDeploy": {
"Type": "Task",
"Resource": "arn:aws:lambda:REGION:ACCOUNT:function:deploy-shadow-endpoint",
"Next": "WaitForShadowTraffic",
},
"WaitForShadowTraffic": {
"Type": "Wait",
"Seconds": 1800, # 30 minutes of shadow traffic
"Next": "EvaluateShadowMetrics",
},
"EvaluateShadowMetrics": {
"Type": "Task",
"Resource": "arn:aws:lambda:REGION:ACCOUNT:function:evaluate-shadow-metrics",
"Next": "ShadowDecision",
},
"ShadowDecision": {
"Type": "Choice",
"Choices": [
{
"Variable": "$.shadow_passed",
"BooleanEquals": True,
"Next": "CanaryDeploy",
}
],
"Default": "RollbackShadow",
},
"CanaryDeploy": {
"Type": "Task",
"Resource": "arn:aws:lambda:REGION:ACCOUNT:function:deploy-canary",
"Parameters": {"traffic_percentage": 5},
"Next": "WaitForCanary",
},
"WaitForCanary": {
"Type": "Wait",
"Seconds": 3600, # 1 hour canary
"Next": "EvaluateCanaryMetrics",
},
"EvaluateCanaryMetrics": {
"Type": "Task",
"Resource": "arn:aws:lambda:REGION:ACCOUNT:function:evaluate-canary-metrics",
"Next": "CheckModelType",
},
"CheckModelType": {
"Type": "Choice",
"Choices": [
{
"Variable": "$.model_type",
"StringEquals": "llm-adapter",
"Next": "HumanApproval",
}
],
"Default": "FullDeploy",
},
"HumanApproval": {
"Type": "Task",
"Resource": "arn:aws:states:::sqs:sendMessage.waitForTaskToken",
"Parameters": {
"QueueUrl": "MODEL_APPROVAL_QUEUE_URL",
"MessageBody": {
"pipeline_id.$": "$$.Execution.Id",
"model_type.$": "$.model_type",
"metrics.$": "$.canary_metrics",
"task_token.$": "$$.Task.Token",
},
},
"Next": "FullDeploy",
},
"FullDeploy": {
"Type": "Task",
"Resource": "arn:aws:lambda:REGION:ACCOUNT:function:promote-to-production",
"Parameters": {"traffic_percentage": 100},
"Next": "Success",
},
"Success": {"Type": "Succeed"},
"RollbackShadow": {
"Type": "Task",
"Resource": "arn:aws:lambda:REGION:ACCOUNT:function:rollback-model",
"Next": "NotifyFailure",
},
"NotifyFailure": {
"Type": "Task",
"Resource": "arn:aws:lambda:REGION:ACCOUNT:function:notify-failure",
"Next": "Failed",
},
"Failed": {"Type": "Fail"},
},
}
3. Shadow Deployment Implementation
Shadow deployment routes a copy of production traffic to the new model without affecting customer responses. The champion model always serves the actual response.
sequenceDiagram
participant User
participant Orchestrator
participant Champion as Champion Model (v1.8)
participant Shadow as Shadow Model (v1.9)
participant Metrics as CloudWatch Metrics
User->>Orchestrator: "I want to return my order"
par Serve from Champion
Orchestrator->>Champion: Classify intent
Champion-->>Orchestrator: order_return (98ms)
Orchestrator-->>User: Response based on champion
and Mirror to Shadow (async)
Orchestrator->>Shadow: Classify intent (async, fire-and-forget)
Shadow-->>Metrics: Log: order_return (85ms)
end
Metrics->>Metrics: Compare champion vs shadow:<br/>accuracy, latency, agreement rate
# Shadow traffic router
import asyncio
from typing import Any
async def classify_intent_with_shadow(
message: str,
champion_endpoint: str,
shadow_endpoint: str | None,
shadow_percentage: float = 0.01,
) -> dict[str, Any]:
"""Route traffic to champion + optionally mirror to shadow."""
import random
# Always call champion (this serves the response)
champion_result = await call_endpoint(champion_endpoint, message)
# Probabilistically mirror to shadow (async, non-blocking)
if shadow_endpoint and random.random() < shadow_percentage:
asyncio.create_task(
_shadow_call(shadow_endpoint, message, champion_result)
)
return champion_result
async def _shadow_call(
shadow_endpoint: str,
message: str,
champion_result: dict,
) -> None:
"""Fire-and-forget shadow call. Log comparison metrics."""
try:
shadow_result = await call_endpoint(shadow_endpoint, message)
# Log comparison metrics (never affects user response)
publish_comparison_metric(
champion_label=champion_result["intent"],
shadow_label=shadow_result["intent"],
champion_latency=champion_result["latency_ms"],
shadow_latency=shadow_result["latency_ms"],
agreement=champion_result["intent"] == shadow_result["intent"],
)
except Exception:
# Shadow failures are logged but never affect production
pass
4. Quality Gate Evaluation
# Quality gate evaluation for model promotion decisions
import numpy as np
from scipy import stats
def evaluate_model_quality(
champion_metrics: dict,
challenger_metrics: dict,
model_type: str,
) -> dict:
"""
Compare challenger against champion using statistical tests.
Returns pass/fail decision with reasoning.
"""
gates = []
# Gate 1: Accuracy must not regress (one-sided test)
if model_type in ("intent-classifier", "embedding"):
champion_acc = champion_metrics["accuracy"]
challenger_acc = challenger_metrics["accuracy"]
# Allow ±0.5% tolerance (noise margin)
accuracy_gate = challenger_acc >= (champion_acc - 0.005)
gates.append({
"gate": "accuracy_non_regression",
"passed": accuracy_gate,
"champion": f"{champion_acc:.4f}",
"challenger": f"{challenger_acc:.4f}",
"threshold": f">= {champion_acc - 0.005:.4f}",
})
# Gate 2: Latency P95 must stay within budget
latency_budgets = {
"intent-classifier": 150, # ms — per architecture spec
"embedding": 200, # ms
"llm-adapter": 1500, # ms — first token
}
budget = latency_budgets[model_type]
challenger_p95 = challenger_metrics["latency_p95_ms"]
latency_gate = challenger_p95 <= budget
gates.append({
"gate": "latency_p95",
"passed": latency_gate,
"value": f"{challenger_p95:.0f}ms",
"budget": f"{budget}ms",
})
# Gate 3: Captum attribution audit (intent classifier only)
if model_type == "intent-classifier":
captum_passed = challenger_metrics.get("captum_audit_passed", False)
gates.append({
"gate": "captum_attribution_audit",
"passed": captum_passed,
"detail": "Checks for lexical shortcut reliance",
})
# Gate 4: Agreement rate with champion (shadow traffic)
if "agreement_rate" in challenger_metrics:
agreement = challenger_metrics["agreement_rate"]
# If agreement < 85%, something fundamentally changed — flag for review
agreement_gate = agreement >= 0.85
gates.append({
"gate": "champion_agreement_rate",
"passed": agreement_gate,
"value": f"{agreement:.2%}",
"threshold": ">= 85%",
})
# Gate 5: Statistical significance (canary traffic)
if "canary_error_rate" in challenger_metrics:
champion_errors = champion_metrics["canary_error_rate"]
challenger_errors = challenger_metrics["canary_error_rate"]
# Two-proportion z-test
n = challenger_metrics.get("canary_sample_size", 1000)
z_stat, p_value = stats.proportions_ztest(
[int(challenger_errors * n), int(champion_errors * n)],
[n, n],
alternative="larger", # Test if challenger is worse
)
sig_gate = p_value > 0.05 # Not significantly worse
gates.append({
"gate": "statistical_significance",
"passed": sig_gate,
"p_value": f"{p_value:.4f}",
"detail": "Challenger not significantly worse than champion (p > 0.05)",
})
all_passed = all(g["passed"] for g in gates)
return {
"overall_passed": all_passed,
"gates": gates,
"recommendation": "PROMOTE" if all_passed else "REJECT",
}
5. Model-Specific Deployment Targets
flowchart TD
subgraph "Intent Classifier"
IC1[SageMaker Endpoint<br/>ml.inf1.xlarge — Inferentia]
IC2["Neuron-compiled DistilBERT<br/>INT8 quantized"]
IC3["Auto-scaling: 1-4 instances<br/>Target: 150ms P95"]
end
subgraph "Embedding Models"
EM1[SageMaker Endpoint<br/>ml.g5.xlarge — GPU]
EM2["multilingual-e5-large<br/>+ Titan Embeddings via Bedrock"]
EM3["Auto-scaling: 1-2 instances<br/>Target: 200ms P95"]
end
subgraph "LLM Adapter"
LA1["Option A: vLLM on SageMaker<br/>ml.g5.12xlarge (4x A10G)"]
LA2["Option B: Bedrock Custom Model<br/>Fine-tuned via Bedrock"]
LA3["Adapter hot-swap via LoRA<br/>No endpoint restart needed"]
end
Critical Decisions
Decision 1: ML Orchestration — SageMaker Pipelines vs Step Functions vs Kubeflow
| Criteria (Weight) | SageMaker Pipelines | Step Functions | Kubeflow |
|---|---|---|---|
| ML-Specific Features (25%) | 10/10 — built for ML workflows | 5/10 — generic orchestrator | 9/10 — ML-native |
| AWS Integration (20%) | 10/10 — native SageMaker | 9/10 — any AWS service | 4/10 — needs EKS |
| Operational Overhead (20%) | 7/10 — managed service | 9/10 — serverless | 3/10 — must run K8s cluster |
| Flexibility (15%) | 6/10 — limited to ML steps | 10/10 — any compute | 8/10 — arbitrary containers |
| Cost (10%) | 7/10 — per-step pricing | 9/10 — $0.025/1000 transitions | 3/10 — EKS cluster cost |
| Team Size Fit (10%) | 8/10 — no infra to manage | 9/10 — zero ops | 2/10 — needs K8s expertise |
| Weighted Score | 8.2/10 | 8.0/10 | 5.0/10 |
Decision: Step Functions (with SageMaker SDK calls)
Rationale: Despite SageMaker Pipelines scoring slightly higher on ML-specific features, Step Functions was chosen because:
- The deployment pipeline is not a training pipeline — it orchestrates deployments (shadow → canary → promote), which is a workflow problem, not an ML problem
- Step Functions can call SageMaker APIs natively — CreateEndpoint, UpdateEndpointWeightsAndCapacities, etc. are direct SDK integrations
- Step Functions also orchestrates non-ML steps — Lambda quality gates, SNS notifications, SQS human approval, CloudWatch metric queries — all first-class integrations
- Operational simplicity — $0.025/1000 state transitions, no infrastructure to manage, visual debugging via console
Why not SageMaker Pipelines? SageMaker Pipelines excels at training orchestration (data processing → training → evaluation → registration). But the deployment pipeline needs conditional logic (check model type → route to different approval flows), human approval via SQS, and CloudWatch metric evaluation — all of which are awkward in SageMaker Pipelines but native in Step Functions.
Why not Kubeflow? Running a Kubernetes cluster (EKS) for ML orchestration when the entire application runs on ECS Fargate introduces a second container orchestrator. For a 1-2 person team, this is an unacceptable operational burden.
Decision 2: Deployment Pattern — Shadow vs Canary vs Champion-Challenger
| Pattern | How It Works | Risk Exposure | Cost | Validation Quality |
|---|---|---|---|---|
| Shadow | Mirror traffic to new model; champion always serves | Zero (shadow is invisible) | 2x compute during test | Medium — no real user feedback |
| Canary | Route small % of real traffic to new model | Low (5% of users) | 1.05x compute | High — real user responses |
| Champion-Challenger | Persistent A/B test between models | Medium (50/50 split) | 2x compute (persistent) | Highest — statistically significant |
Decision: Shadow → Canary → (optional A/B for LLM)
flowchart LR
A[New Model] --> B["Shadow (1%, 30 min)<br/>Zero risk validation"]
B --> C["Canary (5%, 1 hour)<br/>Low risk, real feedback"]
C --> D{Model Type?}
D -->|"Intent/Embedding"| E["Auto-promote to 100%"]
D -->|"LLM Adapter"| F["A/B Test (50/50, 1 week)<br/>Statistical significance"]
F --> G["Human review → Promote or Reject"]
Rationale: The three-stage approach matches risk to model type: - Shadow first (all models): Validates latency, error rates, and basic correctness without any customer impact. Catches most deployment failures (wrong model artifact, OOM, inference errors). - Canary second (all models): Validates real-world performance on actual user queries. Catches subtle quality regressions that shadow can't detect (e.g., model is slower under real traffic patterns). - A/B test third (LLM only): LLM changes affect every response. A week-long A/B test with human evaluation is necessary to catch subtle quality shifts (tone, helpfulness, hallucination rate) that automated metrics miss.
Decision 3: Model Approval — Fully Automated vs Human-in-the-Loop
| Criteria | Fully Automated | Human-in-the-Loop | Tiered (Our Choice) |
|---|---|---|---|
| Speed | Fast (minutes) | Slow (hours to days) | Depends on model type |
| Risk | Higher — no human judgment | Lower — expert review | Balanced |
| Scalability | Scales with pipeline | Bottleneck on reviewers | Scales for low-risk, controlled for high-risk |
| Cost | Low | High (engineer time) | Medium |
| Compliance | May not satisfy auditors | Satisfies compliance | Satisfies with clear trails |
Decision: Tiered approval — automated for intent/embedding, human for LLM adapters
| Model Type | Quality Gates (Automated) | Approval | Rationale |
|---|---|---|---|
| Intent Classifier | Accuracy ≥ champion − 0.5%, P95 < 150ms, Captum pass | Automated | Well-scoped output (intent labels), easily validated by metrics |
| Embedding Model | Recall@10 ≥ champion, P95 < 200ms | Automated | Output is embeddings — quality measured by retrieval metrics |
| Cross-Encoder | NDCG ≥ champion, P95 < 100ms | Automated | Ranking changes are measurable and bounded |
| LLM LoRA Adapter | All automated gates + 1-week A/B + human eval | Human required | Open-ended text generation — subtle quality shifts need human judgment |
Tradeoffs
The Debate: Model Deployment Speed vs Quality Assurance
graph TD
subgraph "Data Science Lead"
DS1["Model accuracy improved 3%"]
DS2["Customers are misclassified NOW"]
DS3["Our improvement is proven on eval set"]
end
subgraph "Architect"
DS1 ---|"But..."| AR1
AR1["Eval set ≠ production traffic"]
AR2["Shadow + canary takes 2 hours minimum"]
AR3["Last fast-tracked model caused 5% error spike"]
end
subgraph "Product Manager"
DS2 ---|"Agrees"| PM1
PM1["Customers filing support tickets"]
PM2["Every day delayed = lost CSAT"]
PM3["Can we shorten the canary?"]
end
AR3 ---|"Tension"| PM3
DS3 ---|"Partial answer to"| AR1
Resolution
The tiered deployment pipeline resolves this tension:
| Model Change Magnitude | Pipeline Path | Total Time | Rationale |
|---|---|---|---|
| Hotfix (< 0.5% accuracy change, bug fix) | Validate → Canary (15 min) → Promote | 30 minutes | Known-safe change, rapid recovery |
| Minor (0.5-2% change) | Validate → Shadow (30 min) → Canary (1 hr) → Promote | 2 hours | Standard improvement, automated gates sufficient |
| Major (> 2% change or new architecture) | Validate → Shadow → Canary → A/B (1 week) → Human | 1-2 weeks | Fundamental change requires statistical validation |
Key Tradeoff: Single vs Multi-Model Endpoint
| Approach | Pros | Cons | Cost |
|---|---|---|---|
| Single endpoint, update in-place | Simple, low cost | Downtime during update (30-60s) | 1x |
| Multi-model endpoint | A/B testing native, zero-downtime | Higher memory usage, routing complexity | 1.3x |
| Separate endpoints per version | Full isolation, independent scaling | Most expensive, endpoint proliferation | 2x |
Decision: Multi-model endpoint for intent classifier and embeddings (high frequency updates), separate endpoints for LLM (too large for multi-model).