Scenarios and Runbooks — FM Customization and Lifecycle Management
MangaAssist context: JP Manga store chatbot on AWS — Bedrock Claude 3 (Sonnet at $3/$15 per 1M tokens input/output, Haiku at $0.25/$1.25), OpenSearch Serverless (vector store), DynamoDB (sessions/products), ECS Fargate (orchestrator), API Gateway WebSocket, ElastiCache Redis. Target: useful answer in under 3 seconds, 1M messages/day scale.
Skill Mapping
| Dimension | Detail |
|---|---|
| Certification | AWS AIP-C01 — AI Practitioner |
| Domain | 1 — Foundation Model Integration, Data Management, and Compliance |
| Task | 1.2 — Select and configure FMs |
| Skill | 1.2.4 — Implement FM customization deployment and lifecycle management |
| This File | Five production scenarios with detection flowcharts, root cause analysis, resolution code, and prevention strategies |
Skill Scope Statement
This file presents five real-world failure scenarios that MangaAssist has encountered (or would encounter) in production when managing the lifecycle of fine-tuned and adapter-augmented foundation models. Each scenario covers a distinct phase of the FM customization lifecycle: version tracking, training data freshness, automated deployment gates, endpoint warm-pool management, and model retirement. Each scenario includes: a problem statement, a mermaid detection flowchart, root cause analysis, Python resolution code using the boto3 SageMaker client, and prevention measures. These runbooks are designed for on-call ML engineers and MLOps practitioners responding to alerts from SageMaker, CloudWatch, and deployment pipelines.
Mind Map
mindmap
root((FM Customization<br/>Lifecycle Failures))
Version Tracking Gap
No Model Registry Entry
Missing Lineage Metadata
Rollback Impossible
Stale Training Data
LoRA Adapter Trained on Old Catalog
New Titles Absent from Adapter
Recommendation Blind Spot
Missing Quality Gate
Loss Metric Passes
Business KPI Fails
CTR Regression
Cold Start Latency
Scale-to-Zero Policy
No Warm Pool
Session Timeout
Parallel Endpoint Confusion
Old Model Not Retired
Traffic Split Inconsistency
Divergent User Experience
Scenario Overview
| # | Scenario | Severity | Blast Radius | Typical Detection Time |
|---|---|---|---|---|
| 1 | Fine-tuned model deployed without Model Registry version tracking — rollback impossible after quality regression | P1 — Critical | All recommendation requests on affected endpoint | 15-30 minutes via user complaint spike |
| 2 | LoRA adapter trained on 6-month-old product catalog — new manga releases absent from recommendations | P2 — High | All users browsing or asking about recent releases | 24-72 hours via catalog coverage audit |
| 3 | Automated pipeline promotes model that passes loss metric but fails business KPI (CTR drops 15%) | P1 — Critical | All recommendation-driven sessions | 4-8 hours via A/B CTR dashboard |
| 4 | SageMaker endpoint cold start after scale-to-zero kills active user sessions — no warm pool configured | P2 — High | Users during off-peak hours at scale-up moment | 30-90 seconds per session timeout burst |
| 5 | Old fine-tuned endpoint not retired — parallel endpoints split traffic causing inconsistent recommendations | P2 — High | ~50% of users seeing stale behavior | Hours to days if not monitored |
Scenario 1: Fine-Tuned Model Deployed Without Version Tracking — Rollback Impossible
Problem
MangaAssist's ML team fine-tuned a manga recommendation model on SageMaker and deployed it directly to a real-time inference endpoint by uploading model.tar.gz to S3 and creating a ModelConfig manually — bypassing the SageMaker Model Registry entirely. Three days later, user engagement metrics showed a 22% drop in recommendation acceptance rate. When the team tried to rollback, there was no registered model version, no approval trail, and no way to identify which training job had produced the previous artifact. The S3 bucket held dozens of model.tar.gz files with timestamp-only names and no lineage.
Detection
flowchart TD
A[Alert: Recommendation acceptance rate drops >10%] --> B{Is model version\nidentifiable?}
B -- No --> C[Check Model Registry for\nregistered model packages]
C --> D{Any package for\nthis endpoint?}
D -- None found --> E[Attempt to trace S3 artifact\nfrom endpoint config]
E --> F{Artifact matches\na training job?}
F -- No clear match --> G[BLOCKED: Cannot rollback\nwithout known prior artifact]
G --> H[Escalate: Manual artifact\nforensics required]
D -- Found --> I[Identify previous approved\npackage version]
I --> J[Redeploy previous version\nvia Model Registry]
F -- Match found --> K[Redeploy artifact manually\nwith risk of wrong version]
B -- Yes --> I
Root Cause
The deployment script used sagemaker.Model() directly with a hardcoded model_data S3 URI, skipping register() and the Model Registry approval workflow. No CI/CD step enforced registry registration as a prerequisite. Without Model Registry entries, SageMaker has no lineage graph linking endpoint → model package → training job → dataset, making rollback operationally impossible without manual investigation.
Resolution
import boto3
import json
from datetime import datetime, timezone
from botocore.exceptions import ClientError
sm_client = boto3.client("sagemaker", region_name="us-east-1")
ENDPOINT_NAME = "mangaassist-recommendation-endpoint"
MODEL_PACKAGE_GROUP = "mangaassist-recommendation-mpg"
def get_latest_approved_model_package(group_name: str, skip_version: str = None) -> dict:
"""Return the most recently approved model package, optionally skipping a known bad version."""
paginator = sm_client.get_paginator("list_model_packages")
approved = []
for page in paginator.paginate(
ModelPackageGroupName=group_name,
ModelApprovalStatus="Approved",
SortBy="CreationTime",
SortOrder="Descending",
):
for pkg in page["ModelPackageSummaryList"]:
if skip_version and pkg["ModelPackageArn"] == skip_version:
continue
approved.append(pkg)
if not approved:
raise ValueError(f"No approved model packages in group '{group_name}'")
return approved[0] # Most recent approved, skipping current bad version
def rollback_endpoint_to_previous_version(
endpoint_name: str, model_package_group: str
) -> str:
"""
Rollback endpoint to the previous approved model package version.
Returns the ARN of the model package deployed.
"""
# Identify the currently deployed model package ARN from endpoint config
try:
ep_desc = sm_client.describe_endpoint(EndpointName=endpoint_name)
current_config_name = ep_desc["EndpointConfigName"]
ep_config = sm_client.describe_endpoint_config(
EndpointConfigName=current_config_name
)
current_model_name = ep_config["ProductionVariants"][0]["ModelName"]
current_model_desc = sm_client.describe_model(ModelName=current_model_name)
current_pkg_arn = current_model_desc.get("PrimaryContainer", {}).get(
"ModelPackageName"
)
except ClientError as e:
print(f"[WARN] Could not determine current model package ARN: {e}")
current_pkg_arn = None
# Find previous approved version
previous_pkg = get_latest_approved_model_package(
model_package_group, skip_version=current_pkg_arn
)
previous_pkg_arn = previous_pkg["ModelPackageArn"]
print(f"[INFO] Rolling back to model package: {previous_pkg_arn}")
# Create a new model from the previous package
rollback_model_name = (
f"mangaassist-rollback-{datetime.now(timezone.utc).strftime('%Y%m%d%H%M%S')}"
)
sm_client.create_model(
ModelName=rollback_model_name,
PrimaryContainer={"ModelPackageName": previous_pkg_arn},
ExecutionRoleArn="arn:aws:iam::123456789012:role/SageMakerExecutionRole",
)
# Create endpoint config using the rollback model
rollback_config_name = f"{rollback_model_name}-config"
sm_client.create_endpoint_config(
EndpointConfigName=rollback_config_name,
ProductionVariants=[
{
"VariantName": "AllTraffic",
"ModelName": rollback_model_name,
"InstanceType": "ml.m5.xlarge",
"InitialInstanceCount": 2,
"InitialVariantWeight": 1.0,
}
],
Tags=[
{"Key": "RollbackFrom", "Value": current_pkg_arn or "unknown"},
{"Key": "RollbackTo", "Value": previous_pkg_arn},
{"Key": "RollbackTimestamp", "Value": datetime.now(timezone.utc).isoformat()},
],
)
# Update the live endpoint
sm_client.update_endpoint(
EndpointName=endpoint_name,
EndpointConfigName=rollback_config_name,
)
print(f"[OK] Endpoint '{endpoint_name}' rollback initiated to {previous_pkg_arn}")
return previous_pkg_arn
if __name__ == "__main__":
deployed_arn = rollback_endpoint_to_previous_version(
ENDPOINT_NAME, MODEL_PACKAGE_GROUP
)
print(f"Rollback complete. Deployed package ARN: {deployed_arn}")
Prevention
- Enforce Model Registry registration in CI/CD: Add a pipeline step that calls
register()before anycreate_model()or endpoint creation; fail the pipeline if noModelPackageArnis returned. - Deny direct
CreateModelwithout registry link: Use an IAM SCP or SageMaker condition key (sagemaker:ModelPackageName) to require all models to reference a registered package. - Tag every endpoint config with
ModelPackageArn,TrainingJobName, andDatasetVersionfor forensic traceability. - Automated approval gates: Require two-person approval in the Model Registry before the
Approvedstatus can be set, preserving an audit trail.
Scenario 2: LoRA Adapter Trained on Stale Catalog — New Releases Missing from Recommendations
Problem
MangaAssist uses a LoRA adapter fine-tuned on top of a base embedding model to improve manga recommendation relevance. The adapter was trained in September on the product catalog snapshot at that date. By March, over 400 new manga titles had been added to the DynamoDB product table, but none appeared in the adapter's training corpus. Users searching for the latest releases — including several bestselling new series — received zero recommendations or irrelevant matches. The adapter had no awareness of titles published after the training cutoff.
Detection
flowchart TD
A[Alert: Zero-result rate for\n'new release' queries > 5%] --> B[Pull sample of zero-result queries]
B --> C{Do queried titles exist\nin DynamoDB products table?}
C -- Yes, titles exist --> D[Titles are in catalog\nbut missing from recommendations]
D --> E{Check adapter training\ndata manifest}
E --> F{Training cutoff date\nvs. product creation date}
F -- Title created AFTER cutoff --> G[ROOT CAUSE: LoRA adapter\nnot trained on recent titles]
G --> H[Calculate catalog coverage gap:\ncount(titles after cutoff) / total titles]
H --> I{Coverage gap > 10%?}
I -- Yes --> J[Trigger adapter retraining\non full current catalog]
I -- No --> K[Monitor; schedule retraining\nnext sprint]
C -- No, titles absent --> L[Data pipeline issue;\ncheck DynamoDB ingestion]
F -- Title predates cutoff --> M[Investigate embedding quality\nor retrieval config]
Root Cause
LoRA adapters encode domain-specific knowledge at training time. Because the adapter was not retrained after the catalog update, new manga titles have no learned representation in the adapter weights. The base model's general embeddings are insufficient to match user queries like "latest Shonen Jump releases this month" to catalog entries that were never seen during fine-tuning. There was no automated trigger to recheck catalog coverage or schedule adapter retraining when catalog growth exceeded a threshold.
Resolution
import boto3
import json
from datetime import datetime, timezone, timedelta
from botocore.exceptions import ClientError
sm_client = boto3.client("sagemaker", region_name="us-east-1")
dynamodb = boto3.resource("dynamodb", region_name="us-east-1")
s3_client = boto3.client("s3", region_name="us-east-1")
PRODUCTS_TABLE = "mangaassist-products"
TRAINING_DATA_BUCKET = "mangaassist-training-data"
LORA_TRAINING_JOB_PREFIX = "mangaassist-lora-adapter"
MODEL_PACKAGE_GROUP = "mangaassist-lora-mpg"
EXECUTION_ROLE = "arn:aws:iam::123456789012:role/SageMakerExecutionRole"
def compute_catalog_coverage_gap(adapter_training_cutoff: datetime) -> dict:
"""
Count products created after the LoRA adapter training cutoff.
Returns coverage gap statistics.
"""
table = dynamodb.Table(PRODUCTS_TABLE)
response = table.scan(
FilterExpression="created_at > :cutoff",
ExpressionAttributeValues={
":cutoff": adapter_training_cutoff.isoformat()
},
ProjectionExpression="product_id, title, created_at",
)
new_items = response.get("Items", [])
# Handle pagination
while "LastEvaluatedKey" in response:
response = table.scan(
FilterExpression="created_at > :cutoff",
ExpressionAttributeValues={":cutoff": adapter_training_cutoff.isoformat()},
ProjectionExpression="product_id, title, created_at",
ExclusiveStartKey=response["LastEvaluatedKey"],
)
new_items.extend(response.get("Items", []))
total_response = table.scan(Select="COUNT")
total_count = total_response["Count"]
gap_pct = len(new_items) / total_count * 100 if total_count > 0 else 0
return {
"total_products": total_count,
"products_after_cutoff": len(new_items),
"coverage_gap_pct": round(gap_pct, 2),
"sample_missed_titles": [item["title"] for item in new_items[:10]],
}
def export_full_catalog_for_retraining(output_prefix: str) -> str:
"""
Export the current full DynamoDB product catalog to S3 JSONL for LoRA retraining.
Returns the S3 URI of the exported file.
"""
table = dynamodb.Table(PRODUCTS_TABLE)
response = table.scan()
items = response.get("Items", [])
while "LastEvaluatedKey" in response:
response = table.scan(ExclusiveStartKey=response["LastEvaluatedKey"])
items.extend(response.get("Items", []))
# Format as JSONL fine-tuning corpus
jsonl_lines = []
for item in items:
record = {
"text": (
f"Title: {item.get('title', '')}. "
f"Genre: {item.get('genre', '')}. "
f"Author: {item.get('author', '')}. "
f"Synopsis: {item.get('synopsis', '')}"
)
}
jsonl_lines.append(json.dumps(record, ensure_ascii=False))
timestamp = datetime.now(timezone.utc).strftime("%Y%m%d%H%M%S")
s3_key = f"{output_prefix}/catalog_full_{timestamp}.jsonl"
s3_client.put_object(
Bucket=TRAINING_DATA_BUCKET,
Key=s3_key,
Body="\n".join(jsonl_lines).encode("utf-8"),
)
s3_uri = f"s3://{TRAINING_DATA_BUCKET}/{s3_key}"
print(f"[OK] Exported {len(items)} products to {s3_uri}")
return s3_uri
def trigger_lora_adapter_retraining(training_data_uri: str) -> str:
"""
Launch a SageMaker training job to retrain the LoRA adapter on the updated catalog.
Returns the training job name.
"""
timestamp = datetime.now(timezone.utc).strftime("%Y%m%d%H%M%S")
job_name = f"{LORA_TRAINING_JOB_PREFIX}-{timestamp}"
sm_client.create_training_job(
TrainingJobName=job_name,
AlgorithmSpecification={
"TrainingImage": "763104351884.dkr.ecr.us-east-1.amazonaws.com/huggingface-pytorch-training:2.1.0-transformers4.36.0-gpu-py310-cu121-ubuntu20.04",
"TrainingInputMode": "File",
},
RoleArn=EXECUTION_ROLE,
InputDataConfig=[
{
"ChannelName": "train",
"DataSource": {
"S3DataSource": {
"S3DataType": "S3Prefix",
"S3Uri": training_data_uri,
"S3DataDistributionType": "FullyReplicated",
}
},
"ContentType": "application/jsonlines",
}
],
OutputDataConfig={
"S3OutputPath": f"s3://{TRAINING_DATA_BUCKET}/lora-adapter-output/"
},
ResourceConfig={
"InstanceType": "ml.g5.2xlarge",
"InstanceCount": 1,
"VolumeSizeInGB": 50,
},
HyperParameters={
"lora_r": "16",
"lora_alpha": "32",
"lora_dropout": "0.05",
"num_train_epochs": "3",
},
StoppingCondition={"MaxRuntimeInSeconds": 86400},
Tags=[
{"Key": "Project", "Value": "MangaAssist"},
{"Key": "CatalogSnapshotDate", "Value": datetime.now(timezone.utc).date().isoformat()},
],
)
print(f"[OK] LoRA retraining job '{job_name}' started.")
return job_name
if __name__ == "__main__":
# Adapter was last trained September 1 of last year
training_cutoff = datetime(2025, 9, 1, tzinfo=timezone.utc)
gap_stats = compute_catalog_coverage_gap(training_cutoff)
print(f"Coverage gap: {gap_stats['coverage_gap_pct']}% "
f"({gap_stats['products_after_cutoff']} new titles)")
print(f"Sample missed: {gap_stats['sample_missed_titles']}")
if gap_stats["coverage_gap_pct"] > 5.0:
s3_uri = export_full_catalog_for_retraining("lora-training-data")
job_name = trigger_lora_adapter_retraining(s3_uri)
print(f"Retraining triggered: {job_name}")
Prevention
- Catalog drift alarm: Set a CloudWatch metric (or scheduled Lambda) to count DynamoDB items with
created_at > adapter_training_cutoff. Alert when gap exceeds 5% of total catalog. - Scheduled retraining pipeline: Use EventBridge to trigger a Step Functions workflow for LoRA retraining on a monthly cadence or on catalog growth events.
- Adapter metadata tag: Store
catalog_snapshot_dateas a tag on every model package in the Model Registry so drift is immediately visible. - Shadow evaluation: After each retraining, run shadow A/B evaluation against the previous adapter on a held-out query set that includes the newest 10% of titles.
Scenario 3: Automated Pipeline Promotes Model That Fails Business KPI
Problem
MangaAssist's MLOps pipeline automatically promotes a newly fine-tuned recommendation model to production when the validation loss on the held-out set drops below a threshold. A new model trained on March data passed the loss gate (0.31 vs. allowed max 0.35) and was auto-promoted. Within 6 hours of deployment, the business dashboard showed click-through rate (CTR) on recommended manga dropping from 8.4% to 7.1% — a 15% regression. The loss metric had improved because the model memorized popular historical titles, but it became less sensitive to long-tail and niche genres that are MangaAssist's core differentiator.
Detection
flowchart TD
A[Alert: Recommendation CTR <\n7.5% for > 30 minutes] --> B{When was the last\nendpoint update?}
B -- Within last 24h --> C{Did a pipeline promotion\nocur recently?}
C -- Yes --> D[Retrieve model package ARN\nfrom current endpoint config]
D --> E[Fetch Model Registry entry\nand training job name]
E --> F{What quality gates\ndid this version pass?}
F -- Only loss metric checked --> G[ROOT CAUSE: Business KPI\nnot included in promotion gate]
G --> H[Retrieve previous approved\nmodel package ARN]
H --> I[Rollback endpoint to\nprevious version]
I --> J[Verify CTR recovers\nwithin 15 minutes]
C -- No --> K[Investigate external factors:\ncatalog change, traffic shift]
B -- No recent update --> K
F -- CTR gate also checked --> L[Investigate other causes:\nA/B experiment interference]
Root Cause
The CI/CD promotion gate only evaluated a model-level loss metric, which reflects prediction accuracy on a fixed test set but does not capture user engagement quality. Loss metrics can improve through overfitting to frequent patterns while degrading on the diversity and exploration signals that drive CTR. The pipeline had no integration with the business metrics dashboard (CloudWatch custom metrics from the application layer), so it had no visibility into real-user behavioral outcomes before promotion.
Resolution
import boto3
import time
from datetime import datetime, timezone
from botocore.exceptions import ClientError
sm_client = boto3.client("sagemaker", region_name="us-east-1")
cw_client = boto3.client("cloudwatch", region_name="us-east-1")
ENDPOINT_NAME = "mangaassist-recommendation-endpoint"
MODEL_PACKAGE_GROUP = "mangaassist-recommendation-mpg"
CTR_METRIC_NAME = "RecommendationClickThroughRate"
CTR_NAMESPACE = "MangaAssist/Business"
CTR_PROMOTION_THRESHOLD = 0.075 # 7.5% minimum CTR required for promotion
SHADOW_TRAFFIC_PCT = 10 # % of traffic for shadow evaluation
SHADOW_EVAL_DURATION_SECONDS = 1800 # 30-minute shadow window
def get_candidate_model_ctr_via_shadow(
endpoint_name: str, candidate_model_name: str
) -> float:
"""
Add a shadow variant to the endpoint, route SHADOW_TRAFFIC_PCT of traffic,
wait for SHADOW_EVAL_DURATION_SECONDS, and return observed CTR for the candidate.
"""
current_config = sm_client.describe_endpoint(EndpointName=endpoint_name)["EndpointConfigName"]
current_ep_config = sm_client.describe_endpoint_config(EndpointConfigName=current_config)
existing_variants = current_ep_config["ProductionVariants"]
# Add shadow variant
shadow_config_name = f"shadow-test-{datetime.now(timezone.utc).strftime('%Y%m%d%H%M%S')}"
shadow_variants = [
{**v, "InitialVariantWeight": (100 - SHADOW_TRAFFIC_PCT) / 100.0}
for v in existing_variants
] + [
{
"VariantName": "ShadowCandidate",
"ModelName": candidate_model_name,
"InstanceType": "ml.m5.xlarge",
"InitialInstanceCount": 1,
"InitialVariantWeight": SHADOW_TRAFFIC_PCT / 100.0,
}
]
sm_client.create_endpoint_config(
EndpointConfigName=shadow_config_name,
ProductionVariants=shadow_variants,
)
sm_client.update_endpoint(
EndpointName=endpoint_name,
EndpointConfigName=shadow_config_name,
)
print(f"[INFO] Shadow variant active. Waiting {SHADOW_EVAL_DURATION_SECONDS}s...")
time.sleep(SHADOW_EVAL_DURATION_SECONDS)
# Query CloudWatch for shadow variant CTR
end_time = datetime.now(timezone.utc)
start_time = datetime.fromtimestamp(
end_time.timestamp() - SHADOW_EVAL_DURATION_SECONDS, tz=timezone.utc
)
response = cw_client.get_metric_statistics(
Namespace=CTR_NAMESPACE,
MetricName=CTR_METRIC_NAME,
Dimensions=[
{"Name": "EndpointName", "Value": endpoint_name},
{"Name": "VariantName", "Value": "ShadowCandidate"},
],
StartTime=start_time,
EndTime=end_time,
Period=SHADOW_EVAL_DURATION_SECONDS,
Statistics=["Average"],
)
datapoints = response.get("Datapoints", [])
observed_ctr = datapoints[0]["Average"] if datapoints else 0.0
print(f"[INFO] Shadow variant observed CTR: {observed_ctr:.4f}")
return observed_ctr
def enforce_business_kpi_gate(
endpoint_name: str,
candidate_model_name: str,
model_package_arn: str,
) -> bool:
"""
Run shadow evaluation and enforce CTR gate before full promotion.
Returns True if promotion is approved; rejects and rolls back otherwise.
"""
try:
observed_ctr = get_candidate_model_ctr_via_shadow(
endpoint_name, candidate_model_name
)
except ClientError as e:
print(f"[ERROR] Shadow evaluation failed: {e}")
return False
if observed_ctr < CTR_PROMOTION_THRESHOLD:
print(
f"[REJECT] Model '{candidate_model_name}' failed CTR gate: "
f"{observed_ctr:.4f} < {CTR_PROMOTION_THRESHOLD:.4f}. Rejecting."
)
# Update Model Registry status to Rejected
sm_client.update_model_package(
ModelPackageArn=model_package_arn,
ModelApprovalStatus="Rejected",
ApprovalDescription=(
f"Failed CTR gate: observed {observed_ctr:.4f}, "
f"required {CTR_PROMOTION_THRESHOLD:.4f}"
),
)
return False
print(
f"[APPROVE] Model '{candidate_model_name}' passed CTR gate: "
f"{observed_ctr:.4f} >= {CTR_PROMOTION_THRESHOLD:.4f}. Approving."
)
sm_client.update_model_package(
ModelPackageArn=model_package_arn,
ModelApprovalStatus="Approved",
ApprovalDescription=f"CTR gate passed: {observed_ctr:.4f}",
)
return True
Prevention
- Never use model-only metrics as the sole promotion gate: Require at least one business KPI (CTR, conversion rate, session depth) as a hard gate before
Approvedstatus is set in Model Registry. - Shadow traffic A/B evaluation: Standard practice before full promotion — route 5-10% of traffic to the candidate and measure real user behavior for at least 30 minutes before full cutover.
- CloudWatch alarm on CTR: Set a
RecommendationClickThroughRate < 7.5%alarm that triggers an automatic rollback Step Functions workflow. - Separate loss gate from business gate: Pipeline should have two sequential stages — (1) offline quality gate using loss/precision metrics, (2) online business gate using shadow CTR — and only proceed if both pass.
Scenario 4: SageMaker Endpoint Cold Start After Scale-to-Zero Kills Active Sessions
Problem
To reduce costs during off-peak hours (2 AM–6 AM JST), the MangaAssist team configured the SageMaker inference endpoint with an auto-scaling policy that scales down to zero instances when invocations drop to near zero. On a Friday night, a promotional campaign drove unexpected traffic at 3 AM JST, causing scale-out from zero. The cold start took 4-7 minutes for model loading, during which 340 active chat sessions received HTTP 503 errors and timed out at the API Gateway level. Users lost their session context and had to restart conversations. No warm pool had been configured.
Detection
flowchart TD
A[Alert: 503 error rate >\n5% on /recommend endpoint] --> B[Check SageMaker endpoint status]
B --> C{Endpoint state?}
C -- Updating / InService\nwith 0 production instances --> D[Scale-out in progress\nfrom zero-instance state]
D --> E[Check last invocation\ntimestamp before incident]
E --> F{Gap in invocations\nbefore traffic spike?}
F -- Yes, >30 minute gap --> G[Scale-to-zero policy\ntriggered during quiet period]
G --> H[Check warm pool\nconfiguration]
H --> I{Warm pool instances\nconfigured?}
I -- None --> J[ROOT CAUSE: No warm pool,\ncold start on scale-out]
J --> K[Immediate: increase min\ninstances to 1; re-enable endpoint]
I -- Configured but\nnot enough --> L[Increase warm pool size]
C -- InService with active\ninstances --> M[Investigate application-level\ntimeout or routing error]
F -- No gap --> M
Root Cause
The auto-scaling policy used ScaleInCooldown=0 and MinCapacity=0, which allowed complete scale-to-zero during quiet periods. SageMaker does not maintain any warm-pool capacity when MinCapacity=0 and no explicit warm pool is configured. The ModelLoadingWaitTime for the fine-tuned recommendation model was approximately 4.5 minutes due to large weights being fetched from S3. The API Gateway connection timeout is 29 seconds, so virtually all requests during the cold-start window failed.
Resolution
import boto3
from botocore.exceptions import ClientError
sm_client = boto3.client("sagemaker", region_name="us-east-1")
aas_client = boto3.client("application-autoscaling", region_name="us-east-1")
ENDPOINT_NAME = "mangaassist-recommendation-endpoint"
VARIANT_NAME = "AllTraffic"
RESOURCE_ID = f"endpoint/{ENDPOINT_NAME}/variant/{VARIANT_NAME}"
def get_endpoint_instance_count(endpoint_name: str, variant_name: str) -> int:
"""Return current desired instance count for a production variant."""
try:
desc = sm_client.describe_endpoint(EndpointName=endpoint_name)
for variant in desc.get("ProductionVariants", []):
if variant["VariantName"] == variant_name:
return variant.get("CurrentInstanceCount", 0)
except ClientError as e:
print(f"[ERROR] Cannot describe endpoint: {e}")
return 0
def configure_warm_pool_and_min_capacity(
endpoint_name: str,
variant_name: str,
min_instances: int = 1,
max_instances: int = 10,
warm_pool_size: int = 1,
scale_in_cooldown: int = 300,
scale_out_cooldown: int = 60,
) -> None:
"""
Set minimum capacity to prevent cold starts and configure a warm pool.
Reconfigures the Application Auto Scaling policy for the endpoint variant.
"""
resource_id = f"endpoint/{endpoint_name}/variant/{variant_name}"
# Register scalable target with minimum capacity >= 1
try:
aas_client.register_scalable_target(
ServiceNamespace="sagemaker",
ResourceId=resource_id,
ScalableDimension="sagemaker:variant:DesiredInstanceCount",
MinCapacity=min_instances,
MaxCapacity=max_instances,
)
print(
f"[OK] Scalable target registered: min={min_instances}, max={max_instances}"
)
except ClientError as e:
print(f"[ERROR] Failed to register scalable target: {e}")
raise
# Configure target tracking: scale on SageMakerVariantInvocationsPerInstance
try:
aas_client.put_scaling_policy(
PolicyName=f"{endpoint_name}-invocations-tracking",
ServiceNamespace="sagemaker",
ResourceId=resource_id,
ScalableDimension="sagemaker:variant:DesiredInstanceCount",
PolicyType="TargetTrackingScaling",
TargetTrackingScalingPolicyConfiguration={
"TargetValue": 500.0, # target invocations per instance per minute
"PredefinedMetricSpecification": {
"PredefinedMetricType": "SageMakerVariantInvocationsPerInstance"
},
"ScaleInCooldown": scale_in_cooldown,
"ScaleOutCooldown": scale_out_cooldown,
"DisableScaleIn": False,
},
)
print(f"[OK] Target tracking policy configured.")
except ClientError as e:
print(f"[ERROR] Failed to put scaling policy: {e}")
raise
# Configure warm pool for faster scale-out
try:
sm_client.update_endpoint(
EndpointName=endpoint_name,
RetainAllVariantProperties=True,
ExcludeRetainedVariantProperties=[],
DeploymentConfig={
"AutoRollbackConfiguration": {
"Alarms": [
{"AlarmName": "mangaassist-recommendation-5xx-alarm"}
]
},
"WarmPoolConfig": {"MinSize": warm_pool_size},
},
)
print(f"[OK] Warm pool configured with MinSize={warm_pool_size}.")
except ClientError as e:
# Warm pool may require a config update; log and continue
print(f"[WARN] Warm pool update skipped (may need config update): {e}")
def emergency_scale_up(
endpoint_name: str, variant_name: str, target_count: int = 2
) -> None:
"""Immediately scale up the endpoint during a cold-start incident."""
current = get_endpoint_instance_count(endpoint_name, variant_name)
print(f"[INFO] Current instance count: {current}. Scaling to {target_count}.")
resource_id = f"endpoint/{endpoint_name}/variant/{variant_name}"
try:
aas_client.register_scalable_target(
ServiceNamespace="sagemaker",
ResourceId=resource_id,
ScalableDimension="sagemaker:variant:DesiredInstanceCount",
MinCapacity=target_count,
MaxCapacity=max(target_count, 10),
)
print(f"[OK] Emergency scale-up initiated to {target_count} instances.")
except ClientError as e:
print(f"[ERROR] Emergency scale-up failed: {e}")
raise
if __name__ == "__main__":
# Immediate response: scale up from zero
emergency_scale_up(ENDPOINT_NAME, VARIANT_NAME, target_count=2)
# Permanent fix: configure warm pool and min capacity
configure_warm_pool_and_min_capacity(
ENDPOINT_NAME,
VARIANT_NAME,
min_instances=1,
max_instances=10,
warm_pool_size=1,
scale_in_cooldown=300,
scale_out_cooldown=60,
)
Prevention
- Never set
MinCapacity=0for user-facing endpoints: SetMinCapacity=1to guarantee at least one warm instance at all times. The cost of one standbyml.m5.xlarge(~$0.23/hr) is justified by reliability. - Configure SageMaker Warm Pool: Pre-provision instances in a warm state so scale-out takes seconds instead of minutes for model loading.
- Increase
ScaleInCooldown: Use at minimum 300 seconds (5 minutes) to prevent rapid scale-in/out oscillation during off-peak traffic patterns. - Set API Gateway timeout to 60 seconds (max) for the recommendation path, and implement exponential-backoff retries with queue buffering at the ECS Fargate orchestrator level.
- Predictive scheduling: Use CloudWatch Scheduled Scaling to pre-warm instances before known high-traffic windows (promotional campaign start times).
Scenario 5: Old Fine-Tuned Endpoint Not Retired — Parallel Endpoints Split Traffic Inconsistently
Problem
When MangaAssist deployed a new fine-tuned recommendation model (v2), the old endpoint (v1, mangaassist-recommendation-v1) was left running and was never decommissioned. The API Gateway routing configuration still had the old endpoint URL in 40% of client configurations (mobile app version 3.x and older web sessions). Both endpoints were serving production traffic: v1 with training data frozen at September 2024, v2 with training data through February 2026. Users complained that recommendations "felt different on mobile vs. web" and that the same query returned different results in different sessions. The cost of running two parallel endpoints was $680/month in unnecessary spend.
Detection
flowchart TD
A[User report: Inconsistent\nrecommendations across sessions] --> B[Check all active SageMaker\nendpoints with 'InService' status]
B --> C{More than one endpoint\nmatching 'recommendation' prefix?}
C -- Yes --> D[List endpoint configs and\nassociated model package ARNs]
D --> E{Do they reference\ndifferent model versions?}
E -- Yes --> F[Identify which clients\nare routing to each endpoint]
F --> G[Check API Gateway, ECS env vars,\nand mobile app config for endpoint URLs]
G --> H{Old endpoint URL still\nactive in any config?}
H -- Yes --> I[ROOT CAUSE: Old endpoint not retired\nafter v2 deployment]
I --> J[Update all routing configs\nto point to v2 endpoint]
J --> K[Verify all traffic migrated\nto v2 via CloudWatch metrics]
K --> L[Delete v1 endpoint and config]
H -- No --> M[Check load balancer target\ngroup health and weights]
C -- No --> N[Investigate application-level\ncaching returning stale responses]
E -- No --> N
Root Cause
The deployment runbook for v2 included steps to create the new endpoint and update the primary ECS environment variable, but did not include an explicit step to decommission the v1 endpoint or audit all consumers of the old endpoint URL. Mobile app versions below 4.0 hard-coded the endpoint URL hostname rather than resolving it from a configuration service, so they continued pointing to v1 after the web deployment updated to v2. There was no automated endpoint inventory audit or "active consumer" check blocking deletion.
Resolution
import boto3
import json
from datetime import datetime, timezone
from botocore.exceptions import ClientError
sm_client = boto3.client("sagemaker", region_name="us-east-1")
cw_client = boto3.client("cloudwatch", region_name="us-east-1")
def list_recommendation_endpoints() -> list[dict]:
"""List all InService endpoints matching the MangaAssist recommendation naming pattern."""
paginator = sm_client.get_paginator("list_endpoints")
endpoints = []
for page in paginator.paginate(StatusEquals="InService", NameContains="recommendation"):
endpoints.extend(page["Endpoints"])
return endpoints
def get_endpoint_invocations_last_hour(endpoint_name: str) -> float:
"""Return total invocation count on the endpoint in the last 60 minutes."""
end_time = datetime.now(timezone.utc)
from datetime import timedelta
start_time = end_time - timedelta(hours=1)
response = cw_client.get_metric_statistics(
Namespace="AWS/SageMaker",
MetricName="Invocations",
Dimensions=[{"Name": "EndpointName", "Value": endpoint_name}],
StartTime=start_time,
EndTime=end_time,
Period=3600,
Statistics=["Sum"],
)
datapoints = response.get("Datapoints", [])
return sum(dp["Sum"] for dp in datapoints)
def get_endpoint_model_package_arn(endpoint_name: str) -> str | None:
"""Retrieve the Model Registry ARN associated with the endpoint's deployed model."""
try:
ep_desc = sm_client.describe_endpoint(EndpointName=endpoint_name)
config_name = ep_desc["EndpointConfigName"]
config_desc = sm_client.describe_endpoint_config(EndpointConfigName=config_name)
model_name = config_desc["ProductionVariants"][0]["ModelName"]
model_desc = sm_client.describe_model(ModelName=model_name)
return model_desc.get("PrimaryContainer", {}).get("ModelPackageName")
except ClientError as e:
print(f"[WARN] Could not retrieve model package ARN for {endpoint_name}: {e}")
return None
def audit_and_retire_stale_endpoints(
canonical_endpoint_name: str,
dry_run: bool = True,
) -> list[str]:
"""
Identify all recommendation endpoints other than the canonical one,
check if they still receive traffic, and retire inactive ones.
Returns list of endpoints retired (or that would be retired in dry run).
"""
all_endpoints = list_recommendation_endpoints()
stale = [ep for ep in all_endpoints if ep["EndpointName"] != canonical_endpoint_name]
if not stale:
print("[OK] No stale endpoints found.")
return []
retired = []
for ep in stale:
name = ep["EndpointName"]
pkg_arn = get_endpoint_model_package_arn(name)
invocations = get_endpoint_invocations_last_hour(name)
print(
f"\n[AUDIT] Endpoint: {name}\n"
f" Model Package ARN: {pkg_arn or 'unknown'}\n"
f" Invocations (last 1h): {invocations}"
)
if invocations > 100:
print(
f" [WARNING] Endpoint {name} still receiving traffic ({invocations} inv/hr). "
"Investigate before retiring — clients may still be routing here."
)
continue
if dry_run:
print(f" [DRY RUN] Would delete endpoint '{name}' (0 recent invocations).")
else:
try:
# Fetch and delete the endpoint config before deleting endpoint
ep_desc = sm_client.describe_endpoint(EndpointName=name)
config_name = ep_desc["EndpointConfigName"]
sm_client.delete_endpoint(EndpointName=name)
print(f" [OK] Deleted endpoint '{name}'.")
try:
sm_client.delete_endpoint_config(EndpointConfigName=config_name)
print(f" [OK] Deleted endpoint config '{config_name}'.")
except ClientError as e:
print(f" [WARN] Could not delete endpoint config: {e}")
retired.append(name)
except ClientError as e:
print(f" [ERROR] Failed to delete endpoint '{name}': {e}")
return retired
if __name__ == "__main__":
CANONICAL = "mangaassist-recommendation-v2"
print("=== DRY RUN — no deletions will occur ===")
audit_and_retire_stale_endpoints(CANONICAL, dry_run=True)
# Uncomment after verifying no stale endpoints have active traffic:
# print("\n=== LIVE RUN ===")
# retired = audit_and_retire_stale_endpoints(CANONICAL, dry_run=False)
# print(f"\nRetired endpoints: {retired}")
Prevention
- Include endpoint retirement as a mandatory deployment step: The deployment runbook for any new model version must contain an explicit checklist item: "Decommission previous endpoint after confirming 0 active consumers."
- Centralize endpoint URL configuration: Store the active endpoint URL in AWS AppConfig or Parameter Store rather than hard-coding it in client apps. All clients (web, mobile, ECS tasks) resolve the URL at runtime from the config store.
- Automated endpoint inventory audit: Run a weekly Lambda that lists all
InServiceSageMaker endpoints and alerts on any endpoint that has not received amodel_package_approved_versiontag update within 60 days. - Cost allocation tags: Tag every endpoint with
lifecycle_status=active|deprecated|pending_retirementand alert whendeprecatedendpoints accumulate cost > $50/month. - Mobile app feature flag: Use a remote config flag for the endpoint URL in mobile releases so the ML team can update routing for legacy app versions without requiring an app store update.