Scenarios and Runbooks — FM Customization and Lifecycle Management

MangaAssist context: JP Manga store chatbot on AWS — Bedrock Claude 3 (Sonnet at $3/$15 per 1M tokens input/output, Haiku at $0.25/$1.25), OpenSearch Serverless (vector store), DynamoDB (sessions/products), ECS Fargate (orchestrator), API Gateway WebSocket, ElastiCache Redis. Target: useful answer in under 3 seconds, 1M messages/day scale.

Skill Mapping

Dimension	Detail
Certification	AWS AIP-C01 — AI Practitioner
Domain	1 — Foundation Model Integration, Data Management, and Compliance
Task	1.2 — Select and configure FMs
Skill	1.2.4 — Implement FM customization deployment and lifecycle management
This File	Five production scenarios with detection flowcharts, root cause analysis, resolution code, and prevention strategies

Skill Scope Statement

This file presents five real-world failure scenarios that MangaAssist has encountered (or would encounter) in production when managing the lifecycle of fine-tuned and adapter-augmented foundation models. Each scenario covers a distinct phase of the FM customization lifecycle: version tracking, training data freshness, automated deployment gates, endpoint warm-pool management, and model retirement. Each scenario includes: a problem statement, a mermaid detection flowchart, root cause analysis, Python resolution code using the boto3 SageMaker client, and prevention measures. These runbooks are designed for on-call ML engineers and MLOps practitioners responding to alerts from SageMaker, CloudWatch, and deployment pipelines.

Mind Map

mindmap
  root((FM Customization<br/>Lifecycle Failures))
    Version Tracking Gap
      No Model Registry Entry
      Missing Lineage Metadata
      Rollback Impossible
    Stale Training Data
      LoRA Adapter Trained on Old Catalog
      New Titles Absent from Adapter
      Recommendation Blind Spot
    Missing Quality Gate
      Loss Metric Passes
      Business KPI Fails
      CTR Regression
    Cold Start Latency
      Scale-to-Zero Policy
      No Warm Pool
      Session Timeout
    Parallel Endpoint Confusion
      Old Model Not Retired
      Traffic Split Inconsistency
      Divergent User Experience

Scenario Overview

#	Scenario	Severity	Blast Radius	Typical Detection Time
1	Fine-tuned model deployed without Model Registry version tracking — rollback impossible after quality regression	P1 — Critical	All recommendation requests on affected endpoint	15-30 minutes via user complaint spike
2	LoRA adapter trained on 6-month-old product catalog — new manga releases absent from recommendations	P2 — High	All users browsing or asking about recent releases	24-72 hours via catalog coverage audit
3	Automated pipeline promotes model that passes loss metric but fails business KPI (CTR drops 15%)	P1 — Critical	All recommendation-driven sessions	4-8 hours via A/B CTR dashboard
4	SageMaker endpoint cold start after scale-to-zero kills active user sessions — no warm pool configured	P2 — High	Users during off-peak hours at scale-up moment	30-90 seconds per session timeout burst
5	Old fine-tuned endpoint not retired — parallel endpoints split traffic causing inconsistent recommendations	P2 — High	~50% of users seeing stale behavior	Hours to days if not monitored

Scenario 1: Fine-Tuned Model Deployed Without Version Tracking — Rollback Impossible

Problem

MangaAssist's ML team fine-tuned a manga recommendation model on SageMaker and deployed it directly to a real-time inference endpoint by uploading model.tar.gz to S3 and creating a ModelConfig manually — bypassing the SageMaker Model Registry entirely. Three days later, user engagement metrics showed a 22% drop in recommendation acceptance rate. When the team tried to rollback, there was no registered model version, no approval trail, and no way to identify which training job had produced the previous artifact. The S3 bucket held dozens of model.tar.gz files with timestamp-only names and no lineage.

Detection

flowchart TD
    A[Alert: Recommendation acceptance rate drops >10%] --> B{Is model version\nidentifiable?}
    B -- No --> C[Check Model Registry for\nregistered model packages]
    C --> D{Any package for\nthis endpoint?}
    D -- None found --> E[Attempt to trace S3 artifact\nfrom endpoint config]
    E --> F{Artifact matches\na training job?}
    F -- No clear match --> G[BLOCKED: Cannot rollback\nwithout known prior artifact]
    G --> H[Escalate: Manual artifact\nforensics required]
    D -- Found --> I[Identify previous approved\npackage version]
    I --> J[Redeploy previous version\nvia Model Registry]
    F -- Match found --> K[Redeploy artifact manually\nwith risk of wrong version]
    B -- Yes --> I

Root Cause

The deployment script used sagemaker.Model() directly with a hardcoded model_data S3 URI, skipping register() and the Model Registry approval workflow. No CI/CD step enforced registry registration as a prerequisite. Without Model Registry entries, SageMaker has no lineage graph linking endpoint → model package → training job → dataset, making rollback operationally impossible without manual investigation.

Resolution

import boto3
import json
from datetime import datetime, timezone
from botocore.exceptions import ClientError

sm_client = boto3.client("sagemaker", region_name="us-east-1")

ENDPOINT_NAME = "mangaassist-recommendation-endpoint"
MODEL_PACKAGE_GROUP = "mangaassist-recommendation-mpg"


def get_latest_approved_model_package(group_name: str, skip_version: str = None) -> dict:
    """Return the most recently approved model package, optionally skipping a known bad version."""
    paginator = sm_client.get_paginator("list_model_packages")
    approved = []

    for page in paginator.paginate(
        ModelPackageGroupName=group_name,
        ModelApprovalStatus="Approved",
        SortBy="CreationTime",
        SortOrder="Descending",
    ):
        for pkg in page["ModelPackageSummaryList"]:
            if skip_version and pkg["ModelPackageArn"] == skip_version:
                continue
            approved.append(pkg)

    if not approved:
        raise ValueError(f"No approved model packages in group '{group_name}'")

    return approved[0]  # Most recent approved, skipping current bad version


def rollback_endpoint_to_previous_version(
    endpoint_name: str, model_package_group: str
) -> str:
    """
    Rollback endpoint to the previous approved model package version.
    Returns the ARN of the model package deployed.
    """
    # Identify the currently deployed model package ARN from endpoint config
    try:
        ep_desc = sm_client.describe_endpoint(EndpointName=endpoint_name)
        current_config_name = ep_desc["EndpointConfigName"]
        ep_config = sm_client.describe_endpoint_config(
            EndpointConfigName=current_config_name
        )
        current_model_name = ep_config["ProductionVariants"][0]["ModelName"]
        current_model_desc = sm_client.describe_model(ModelName=current_model_name)
        current_pkg_arn = current_model_desc.get("PrimaryContainer", {}).get(
            "ModelPackageName"
        )
    except ClientError as e:
        print(f"[WARN] Could not determine current model package ARN: {e}")
        current_pkg_arn = None

    # Find previous approved version
    previous_pkg = get_latest_approved_model_package(
        model_package_group, skip_version=current_pkg_arn
    )
    previous_pkg_arn = previous_pkg["ModelPackageArn"]
    print(f"[INFO] Rolling back to model package: {previous_pkg_arn}")

    # Create a new model from the previous package
    rollback_model_name = (
        f"mangaassist-rollback-{datetime.now(timezone.utc).strftime('%Y%m%d%H%M%S')}"
    )
    sm_client.create_model(
        ModelName=rollback_model_name,
        PrimaryContainer={"ModelPackageName": previous_pkg_arn},
        ExecutionRoleArn="arn:aws:iam::123456789012:role/SageMakerExecutionRole",
    )

    # Create endpoint config using the rollback model
    rollback_config_name = f"{rollback_model_name}-config"
    sm_client.create_endpoint_config(
        EndpointConfigName=rollback_config_name,
        ProductionVariants=[
            {
                "VariantName": "AllTraffic",
                "ModelName": rollback_model_name,
                "InstanceType": "ml.m5.xlarge",
                "InitialInstanceCount": 2,
                "InitialVariantWeight": 1.0,
            }
        ],
        Tags=[
            {"Key": "RollbackFrom", "Value": current_pkg_arn or "unknown"},
            {"Key": "RollbackTo", "Value": previous_pkg_arn},
            {"Key": "RollbackTimestamp", "Value": datetime.now(timezone.utc).isoformat()},
        ],
    )

    # Update the live endpoint
    sm_client.update_endpoint(
        EndpointName=endpoint_name,
        EndpointConfigName=rollback_config_name,
    )
    print(f"[OK] Endpoint '{endpoint_name}' rollback initiated to {previous_pkg_arn}")
    return previous_pkg_arn


if __name__ == "__main__":
    deployed_arn = rollback_endpoint_to_previous_version(
        ENDPOINT_NAME, MODEL_PACKAGE_GROUP
    )
    print(f"Rollback complete. Deployed package ARN: {deployed_arn}")

Prevention

Enforce Model Registry registration in CI/CD: Add a pipeline step that calls register() before any create_model() or endpoint creation; fail the pipeline if no ModelPackageArn is returned.
Deny direct CreateModel without registry link: Use an IAM SCP or SageMaker condition key (sagemaker:ModelPackageName) to require all models to reference a registered package.
Tag every endpoint config with ModelPackageArn, TrainingJobName, and DatasetVersion for forensic traceability.
Automated approval gates: Require two-person approval in the Model Registry before the Approved status can be set, preserving an audit trail.

Scenario 2: LoRA Adapter Trained on Stale Catalog — New Releases Missing from Recommendations

Problem

MangaAssist uses a LoRA adapter fine-tuned on top of a base embedding model to improve manga recommendation relevance. The adapter was trained in September on the product catalog snapshot at that date. By March, over 400 new manga titles had been added to the DynamoDB product table, but none appeared in the adapter's training corpus. Users searching for the latest releases — including several bestselling new series — received zero recommendations or irrelevant matches. The adapter had no awareness of titles published after the training cutoff.

Detection

flowchart TD
    A[Alert: Zero-result rate for\n'new release' queries > 5%] --> B[Pull sample of zero-result queries]
    B --> C{Do queried titles exist\nin DynamoDB products table?}
    C -- Yes, titles exist --> D[Titles are in catalog\nbut missing from recommendations]
    D --> E{Check adapter training\ndata manifest}
    E --> F{Training cutoff date\nvs. product creation date}
    F -- Title created AFTER cutoff --> G[ROOT CAUSE: LoRA adapter\nnot trained on recent titles]
    G --> H[Calculate catalog coverage gap:\ncount(titles after cutoff) / total titles]
    H --> I{Coverage gap > 10%?}
    I -- Yes --> J[Trigger adapter retraining\non full current catalog]
    I -- No --> K[Monitor; schedule retraining\nnext sprint]
    C -- No, titles absent --> L[Data pipeline issue;\ncheck DynamoDB ingestion]
    F -- Title predates cutoff --> M[Investigate embedding quality\nor retrieval config]

Root Cause

LoRA adapters encode domain-specific knowledge at training time. Because the adapter was not retrained after the catalog update, new manga titles have no learned representation in the adapter weights. The base model's general embeddings are insufficient to match user queries like "latest Shonen Jump releases this month" to catalog entries that were never seen during fine-tuning. There was no automated trigger to recheck catalog coverage or schedule adapter retraining when catalog growth exceeded a threshold.

Resolution

import boto3
import json
from datetime import datetime, timezone, timedelta
from botocore.exceptions import ClientError

sm_client = boto3.client("sagemaker", region_name="us-east-1")
dynamodb = boto3.resource("dynamodb", region_name="us-east-1")
s3_client = boto3.client("s3", region_name="us-east-1")

PRODUCTS_TABLE = "mangaassist-products"
TRAINING_DATA_BUCKET = "mangaassist-training-data"
LORA_TRAINING_JOB_PREFIX = "mangaassist-lora-adapter"
MODEL_PACKAGE_GROUP = "mangaassist-lora-mpg"
EXECUTION_ROLE = "arn:aws:iam::123456789012:role/SageMakerExecutionRole"


def compute_catalog_coverage_gap(adapter_training_cutoff: datetime) -> dict:
    """
    Count products created after the LoRA adapter training cutoff.
    Returns coverage gap statistics.
    """
    table = dynamodb.Table(PRODUCTS_TABLE)
    response = table.scan(
        FilterExpression="created_at > :cutoff",
        ExpressionAttributeValues={
            ":cutoff": adapter_training_cutoff.isoformat()
        },
        ProjectionExpression="product_id, title, created_at",
    )
    new_items = response.get("Items", [])

    # Handle pagination
    while "LastEvaluatedKey" in response:
        response = table.scan(
            FilterExpression="created_at > :cutoff",
            ExpressionAttributeValues={":cutoff": adapter_training_cutoff.isoformat()},
            ProjectionExpression="product_id, title, created_at",
            ExclusiveStartKey=response["LastEvaluatedKey"],
        )
        new_items.extend(response.get("Items", []))

    total_response = table.scan(Select="COUNT")
    total_count = total_response["Count"]
    gap_pct = len(new_items) / total_count * 100 if total_count > 0 else 0

    return {
        "total_products": total_count,
        "products_after_cutoff": len(new_items),
        "coverage_gap_pct": round(gap_pct, 2),
        "sample_missed_titles": [item["title"] for item in new_items[:10]],
    }


def export_full_catalog_for_retraining(output_prefix: str) -> str:
    """
    Export the current full DynamoDB product catalog to S3 JSONL for LoRA retraining.
    Returns the S3 URI of the exported file.
    """
    table = dynamodb.Table(PRODUCTS_TABLE)
    response = table.scan()
    items = response.get("Items", [])
    while "LastEvaluatedKey" in response:
        response = table.scan(ExclusiveStartKey=response["LastEvaluatedKey"])
        items.extend(response.get("Items", []))

    # Format as JSONL fine-tuning corpus
    jsonl_lines = []
    for item in items:
        record = {
            "text": (
                f"Title: {item.get('title', '')}. "
                f"Genre: {item.get('genre', '')}. "
                f"Author: {item.get('author', '')}. "
                f"Synopsis: {item.get('synopsis', '')}"
            )
        }
        jsonl_lines.append(json.dumps(record, ensure_ascii=False))

    timestamp = datetime.now(timezone.utc).strftime("%Y%m%d%H%M%S")
    s3_key = f"{output_prefix}/catalog_full_{timestamp}.jsonl"
    s3_client.put_object(
        Bucket=TRAINING_DATA_BUCKET,
        Key=s3_key,
        Body="\n".join(jsonl_lines).encode("utf-8"),
    )
    s3_uri = f"s3://{TRAINING_DATA_BUCKET}/{s3_key}"
    print(f"[OK] Exported {len(items)} products to {s3_uri}")
    return s3_uri


def trigger_lora_adapter_retraining(training_data_uri: str) -> str:
    """
    Launch a SageMaker training job to retrain the LoRA adapter on the updated catalog.
    Returns the training job name.
    """
    timestamp = datetime.now(timezone.utc).strftime("%Y%m%d%H%M%S")
    job_name = f"{LORA_TRAINING_JOB_PREFIX}-{timestamp}"

    sm_client.create_training_job(
        TrainingJobName=job_name,
        AlgorithmSpecification={
            "TrainingImage": "763104351884.dkr.ecr.us-east-1.amazonaws.com/huggingface-pytorch-training:2.1.0-transformers4.36.0-gpu-py310-cu121-ubuntu20.04",
            "TrainingInputMode": "File",
        },
        RoleArn=EXECUTION_ROLE,
        InputDataConfig=[
            {
                "ChannelName": "train",
                "DataSource": {
                    "S3DataSource": {
                        "S3DataType": "S3Prefix",
                        "S3Uri": training_data_uri,
                        "S3DataDistributionType": "FullyReplicated",
                    }
                },
                "ContentType": "application/jsonlines",
            }
        ],
        OutputDataConfig={
            "S3OutputPath": f"s3://{TRAINING_DATA_BUCKET}/lora-adapter-output/"
        },
        ResourceConfig={
            "InstanceType": "ml.g5.2xlarge",
            "InstanceCount": 1,
            "VolumeSizeInGB": 50,
        },
        HyperParameters={
            "lora_r": "16",
            "lora_alpha": "32",
            "lora_dropout": "0.05",
            "num_train_epochs": "3",
        },
        StoppingCondition={"MaxRuntimeInSeconds": 86400},
        Tags=[
            {"Key": "Project", "Value": "MangaAssist"},
            {"Key": "CatalogSnapshotDate", "Value": datetime.now(timezone.utc).date().isoformat()},
        ],
    )
    print(f"[OK] LoRA retraining job '{job_name}' started.")
    return job_name


if __name__ == "__main__":
    # Adapter was last trained September 1 of last year
    training_cutoff = datetime(2025, 9, 1, tzinfo=timezone.utc)
    gap_stats = compute_catalog_coverage_gap(training_cutoff)
    print(f"Coverage gap: {gap_stats['coverage_gap_pct']}% "
          f"({gap_stats['products_after_cutoff']} new titles)")
    print(f"Sample missed: {gap_stats['sample_missed_titles']}")

    if gap_stats["coverage_gap_pct"] > 5.0:
        s3_uri = export_full_catalog_for_retraining("lora-training-data")
        job_name = trigger_lora_adapter_retraining(s3_uri)
        print(f"Retraining triggered: {job_name}")

Prevention

Catalog drift alarm: Set a CloudWatch metric (or scheduled Lambda) to count DynamoDB items with created_at > adapter_training_cutoff. Alert when gap exceeds 5% of total catalog.
Scheduled retraining pipeline: Use EventBridge to trigger a Step Functions workflow for LoRA retraining on a monthly cadence or on catalog growth events.
Adapter metadata tag: Store catalog_snapshot_date as a tag on every model package in the Model Registry so drift is immediately visible.
Shadow evaluation: After each retraining, run shadow A/B evaluation against the previous adapter on a held-out query set that includes the newest 10% of titles.

Scenario 3: Automated Pipeline Promotes Model That Fails Business KPI

Problem

MangaAssist's MLOps pipeline automatically promotes a newly fine-tuned recommendation model to production when the validation loss on the held-out set drops below a threshold. A new model trained on March data passed the loss gate (0.31 vs. allowed max 0.35) and was auto-promoted. Within 6 hours of deployment, the business dashboard showed click-through rate (CTR) on recommended manga dropping from 8.4% to 7.1% — a 15% regression. The loss metric had improved because the model memorized popular historical titles, but it became less sensitive to long-tail and niche genres that are MangaAssist's core differentiator.

Detection

flowchart TD
    A[Alert: Recommendation CTR <\n7.5% for > 30 minutes] --> B{When was the last\nendpoint update?}
    B -- Within last 24h --> C{Did a pipeline promotion\nocur recently?}
    C -- Yes --> D[Retrieve model package ARN\nfrom current endpoint config]
    D --> E[Fetch Model Registry entry\nand training job name]
    E --> F{What quality gates\ndid this version pass?}
    F -- Only loss metric checked --> G[ROOT CAUSE: Business KPI\nnot included in promotion gate]
    G --> H[Retrieve previous approved\nmodel package ARN]
    H --> I[Rollback endpoint to\nprevious version]
    I --> J[Verify CTR recovers\nwithin 15 minutes]
    C -- No --> K[Investigate external factors:\ncatalog change, traffic shift]
    B -- No recent update --> K
    F -- CTR gate also checked --> L[Investigate other causes:\nA/B experiment interference]

Root Cause

The CI/CD promotion gate only evaluated a model-level loss metric, which reflects prediction accuracy on a fixed test set but does not capture user engagement quality. Loss metrics can improve through overfitting to frequent patterns while degrading on the diversity and exploration signals that drive CTR. The pipeline had no integration with the business metrics dashboard (CloudWatch custom metrics from the application layer), so it had no visibility into real-user behavioral outcomes before promotion.

Resolution

import boto3
import time
from datetime import datetime, timezone
from botocore.exceptions import ClientError

sm_client = boto3.client("sagemaker", region_name="us-east-1")
cw_client = boto3.client("cloudwatch", region_name="us-east-1")

ENDPOINT_NAME = "mangaassist-recommendation-endpoint"
MODEL_PACKAGE_GROUP = "mangaassist-recommendation-mpg"
CTR_METRIC_NAME = "RecommendationClickThroughRate"
CTR_NAMESPACE = "MangaAssist/Business"
CTR_PROMOTION_THRESHOLD = 0.075   # 7.5% minimum CTR required for promotion
SHADOW_TRAFFIC_PCT = 10           # % of traffic for shadow evaluation
SHADOW_EVAL_DURATION_SECONDS = 1800  # 30-minute shadow window


def get_candidate_model_ctr_via_shadow(
    endpoint_name: str, candidate_model_name: str
) -> float:
    """
    Add a shadow variant to the endpoint, route SHADOW_TRAFFIC_PCT of traffic,
    wait for SHADOW_EVAL_DURATION_SECONDS, and return observed CTR for the candidate.
    """
    current_config = sm_client.describe_endpoint(EndpointName=endpoint_name)["EndpointConfigName"]
    current_ep_config = sm_client.describe_endpoint_config(EndpointConfigName=current_config)
    existing_variants = current_ep_config["ProductionVariants"]

    # Add shadow variant
    shadow_config_name = f"shadow-test-{datetime.now(timezone.utc).strftime('%Y%m%d%H%M%S')}"
    shadow_variants = [
        {**v, "InitialVariantWeight": (100 - SHADOW_TRAFFIC_PCT) / 100.0}
        for v in existing_variants
    ] + [
        {
            "VariantName": "ShadowCandidate",
            "ModelName": candidate_model_name,
            "InstanceType": "ml.m5.xlarge",
            "InitialInstanceCount": 1,
            "InitialVariantWeight": SHADOW_TRAFFIC_PCT / 100.0,
        }
    ]

    sm_client.create_endpoint_config(
        EndpointConfigName=shadow_config_name,
        ProductionVariants=shadow_variants,
    )
    sm_client.update_endpoint(
        EndpointName=endpoint_name,
        EndpointConfigName=shadow_config_name,
    )
    print(f"[INFO] Shadow variant active. Waiting {SHADOW_EVAL_DURATION_SECONDS}s...")
    time.sleep(SHADOW_EVAL_DURATION_SECONDS)

    # Query CloudWatch for shadow variant CTR
    end_time = datetime.now(timezone.utc)
    start_time = datetime.fromtimestamp(
        end_time.timestamp() - SHADOW_EVAL_DURATION_SECONDS, tz=timezone.utc
    )
    response = cw_client.get_metric_statistics(
        Namespace=CTR_NAMESPACE,
        MetricName=CTR_METRIC_NAME,
        Dimensions=[
            {"Name": "EndpointName", "Value": endpoint_name},
            {"Name": "VariantName", "Value": "ShadowCandidate"},
        ],
        StartTime=start_time,
        EndTime=end_time,
        Period=SHADOW_EVAL_DURATION_SECONDS,
        Statistics=["Average"],
    )
    datapoints = response.get("Datapoints", [])
    observed_ctr = datapoints[0]["Average"] if datapoints else 0.0
    print(f"[INFO] Shadow variant observed CTR: {observed_ctr:.4f}")
    return observed_ctr


def enforce_business_kpi_gate(
    endpoint_name: str,
    candidate_model_name: str,
    model_package_arn: str,
) -> bool:
    """
    Run shadow evaluation and enforce CTR gate before full promotion.
    Returns True if promotion is approved; rejects and rolls back otherwise.
    """
    try:
        observed_ctr = get_candidate_model_ctr_via_shadow(
            endpoint_name, candidate_model_name
        )
    except ClientError as e:
        print(f"[ERROR] Shadow evaluation failed: {e}")
        return False

    if observed_ctr < CTR_PROMOTION_THRESHOLD:
        print(
            f"[REJECT] Model '{candidate_model_name}' failed CTR gate: "
            f"{observed_ctr:.4f} < {CTR_PROMOTION_THRESHOLD:.4f}. Rejecting."
        )
        # Update Model Registry status to Rejected
        sm_client.update_model_package(
            ModelPackageArn=model_package_arn,
            ModelApprovalStatus="Rejected",
            ApprovalDescription=(
                f"Failed CTR gate: observed {observed_ctr:.4f}, "
                f"required {CTR_PROMOTION_THRESHOLD:.4f}"
            ),
        )
        return False

    print(
        f"[APPROVE] Model '{candidate_model_name}' passed CTR gate: "
        f"{observed_ctr:.4f} >= {CTR_PROMOTION_THRESHOLD:.4f}. Approving."
    )
    sm_client.update_model_package(
        ModelPackageArn=model_package_arn,
        ModelApprovalStatus="Approved",
        ApprovalDescription=f"CTR gate passed: {observed_ctr:.4f}",
    )
    return True

Prevention

Never use model-only metrics as the sole promotion gate: Require at least one business KPI (CTR, conversion rate, session depth) as a hard gate before Approved status is set in Model Registry.
Shadow traffic A/B evaluation: Standard practice before full promotion — route 5-10% of traffic to the candidate and measure real user behavior for at least 30 minutes before full cutover.
CloudWatch alarm on CTR: Set a RecommendationClickThroughRate < 7.5% alarm that triggers an automatic rollback Step Functions workflow.
Separate loss gate from business gate: Pipeline should have two sequential stages — (1) offline quality gate using loss/precision metrics, (2) online business gate using shadow CTR — and only proceed if both pass.

Scenario 4: SageMaker Endpoint Cold Start After Scale-to-Zero Kills Active Sessions

Problem

To reduce costs during off-peak hours (2 AM–6 AM JST), the MangaAssist team configured the SageMaker inference endpoint with an auto-scaling policy that scales down to zero instances when invocations drop to near zero. On a Friday night, a promotional campaign drove unexpected traffic at 3 AM JST, causing scale-out from zero. The cold start took 4-7 minutes for model loading, during which 340 active chat sessions received HTTP 503 errors and timed out at the API Gateway level. Users lost their session context and had to restart conversations. No warm pool had been configured.

Detection

flowchart TD
    A[Alert: 503 error rate >\n5% on /recommend endpoint] --> B[Check SageMaker endpoint status]
    B --> C{Endpoint state?}
    C -- Updating / InService\nwith 0 production instances --> D[Scale-out in progress\nfrom zero-instance state]
    D --> E[Check last invocation\ntimestamp before incident]
    E --> F{Gap in invocations\nbefore traffic spike?}
    F -- Yes, >30 minute gap --> G[Scale-to-zero policy\ntriggered during quiet period]
    G --> H[Check warm pool\nconfiguration]
    H --> I{Warm pool instances\nconfigured?}
    I -- None --> J[ROOT CAUSE: No warm pool,\ncold start on scale-out]
    J --> K[Immediate: increase min\ninstances to 1; re-enable endpoint]
    I -- Configured but\nnot enough --> L[Increase warm pool size]
    C -- InService with active\ninstances --> M[Investigate application-level\ntimeout or routing error]
    F -- No gap --> M

Root Cause

The auto-scaling policy used ScaleInCooldown=0 and MinCapacity=0, which allowed complete scale-to-zero during quiet periods. SageMaker does not maintain any warm-pool capacity when MinCapacity=0 and no explicit warm pool is configured. The ModelLoadingWaitTime for the fine-tuned recommendation model was approximately 4.5 minutes due to large weights being fetched from S3. The API Gateway connection timeout is 29 seconds, so virtually all requests during the cold-start window failed.

Resolution

import boto3
from botocore.exceptions import ClientError

sm_client = boto3.client("sagemaker", region_name="us-east-1")
aas_client = boto3.client("application-autoscaling", region_name="us-east-1")

ENDPOINT_NAME = "mangaassist-recommendation-endpoint"
VARIANT_NAME = "AllTraffic"
RESOURCE_ID = f"endpoint/{ENDPOINT_NAME}/variant/{VARIANT_NAME}"


def get_endpoint_instance_count(endpoint_name: str, variant_name: str) -> int:
    """Return current desired instance count for a production variant."""
    try:
        desc = sm_client.describe_endpoint(EndpointName=endpoint_name)
        for variant in desc.get("ProductionVariants", []):
            if variant["VariantName"] == variant_name:
                return variant.get("CurrentInstanceCount", 0)
    except ClientError as e:
        print(f"[ERROR] Cannot describe endpoint: {e}")
    return 0


def configure_warm_pool_and_min_capacity(
    endpoint_name: str,
    variant_name: str,
    min_instances: int = 1,
    max_instances: int = 10,
    warm_pool_size: int = 1,
    scale_in_cooldown: int = 300,
    scale_out_cooldown: int = 60,
) -> None:
    """
    Set minimum capacity to prevent cold starts and configure a warm pool.
    Reconfigures the Application Auto Scaling policy for the endpoint variant.
    """
    resource_id = f"endpoint/{endpoint_name}/variant/{variant_name}"

    # Register scalable target with minimum capacity >= 1
    try:
        aas_client.register_scalable_target(
            ServiceNamespace="sagemaker",
            ResourceId=resource_id,
            ScalableDimension="sagemaker:variant:DesiredInstanceCount",
            MinCapacity=min_instances,
            MaxCapacity=max_instances,
        )
        print(
            f"[OK] Scalable target registered: min={min_instances}, max={max_instances}"
        )
    except ClientError as e:
        print(f"[ERROR] Failed to register scalable target: {e}")
        raise

    # Configure target tracking: scale on SageMakerVariantInvocationsPerInstance
    try:
        aas_client.put_scaling_policy(
            PolicyName=f"{endpoint_name}-invocations-tracking",
            ServiceNamespace="sagemaker",
            ResourceId=resource_id,
            ScalableDimension="sagemaker:variant:DesiredInstanceCount",
            PolicyType="TargetTrackingScaling",
            TargetTrackingScalingPolicyConfiguration={
                "TargetValue": 500.0,  # target invocations per instance per minute
                "PredefinedMetricSpecification": {
                    "PredefinedMetricType": "SageMakerVariantInvocationsPerInstance"
                },
                "ScaleInCooldown": scale_in_cooldown,
                "ScaleOutCooldown": scale_out_cooldown,
                "DisableScaleIn": False,
            },
        )
        print(f"[OK] Target tracking policy configured.")
    except ClientError as e:
        print(f"[ERROR] Failed to put scaling policy: {e}")
        raise

    # Configure warm pool for faster scale-out
    try:
        sm_client.update_endpoint(
            EndpointName=endpoint_name,
            RetainAllVariantProperties=True,
            ExcludeRetainedVariantProperties=[],
            DeploymentConfig={
                "AutoRollbackConfiguration": {
                    "Alarms": [
                        {"AlarmName": "mangaassist-recommendation-5xx-alarm"}
                    ]
                },
                "WarmPoolConfig": {"MinSize": warm_pool_size},
            },
        )
        print(f"[OK] Warm pool configured with MinSize={warm_pool_size}.")
    except ClientError as e:
        # Warm pool may require a config update; log and continue
        print(f"[WARN] Warm pool update skipped (may need config update): {e}")


def emergency_scale_up(
    endpoint_name: str, variant_name: str, target_count: int = 2
) -> None:
    """Immediately scale up the endpoint during a cold-start incident."""
    current = get_endpoint_instance_count(endpoint_name, variant_name)
    print(f"[INFO] Current instance count: {current}. Scaling to {target_count}.")

    resource_id = f"endpoint/{endpoint_name}/variant/{variant_name}"
    try:
        aas_client.register_scalable_target(
            ServiceNamespace="sagemaker",
            ResourceId=resource_id,
            ScalableDimension="sagemaker:variant:DesiredInstanceCount",
            MinCapacity=target_count,
            MaxCapacity=max(target_count, 10),
        )
        print(f"[OK] Emergency scale-up initiated to {target_count} instances.")
    except ClientError as e:
        print(f"[ERROR] Emergency scale-up failed: {e}")
        raise


if __name__ == "__main__":
    # Immediate response: scale up from zero
    emergency_scale_up(ENDPOINT_NAME, VARIANT_NAME, target_count=2)

    # Permanent fix: configure warm pool and min capacity
    configure_warm_pool_and_min_capacity(
        ENDPOINT_NAME,
        VARIANT_NAME,
        min_instances=1,
        max_instances=10,
        warm_pool_size=1,
        scale_in_cooldown=300,
        scale_out_cooldown=60,
    )

Prevention

Never set MinCapacity=0 for user-facing endpoints: Set MinCapacity=1 to guarantee at least one warm instance at all times. The cost of one standby ml.m5.xlarge (~$0.23/hr) is justified by reliability.
Configure SageMaker Warm Pool: Pre-provision instances in a warm state so scale-out takes seconds instead of minutes for model loading.
Increase ScaleInCooldown: Use at minimum 300 seconds (5 minutes) to prevent rapid scale-in/out oscillation during off-peak traffic patterns.
Set API Gateway timeout to 60 seconds (max) for the recommendation path, and implement exponential-backoff retries with queue buffering at the ECS Fargate orchestrator level.
Predictive scheduling: Use CloudWatch Scheduled Scaling to pre-warm instances before known high-traffic windows (promotional campaign start times).

Scenario 5: Old Fine-Tuned Endpoint Not Retired — Parallel Endpoints Split Traffic Inconsistently

Problem

When MangaAssist deployed a new fine-tuned recommendation model (v2), the old endpoint (v1, mangaassist-recommendation-v1) was left running and was never decommissioned. The API Gateway routing configuration still had the old endpoint URL in 40% of client configurations (mobile app version 3.x and older web sessions). Both endpoints were serving production traffic: v1 with training data frozen at September 2024, v2 with training data through February 2026. Users complained that recommendations "felt different on mobile vs. web" and that the same query returned different results in different sessions. The cost of running two parallel endpoints was $680/month in unnecessary spend.

Detection

flowchart TD
    A[User report: Inconsistent\nrecommendations across sessions] --> B[Check all active SageMaker\nendpoints with 'InService' status]
    B --> C{More than one endpoint\nmatching 'recommendation' prefix?}
    C -- Yes --> D[List endpoint configs and\nassociated model package ARNs]
    D --> E{Do they reference\ndifferent model versions?}
    E -- Yes --> F[Identify which clients\nare routing to each endpoint]
    F --> G[Check API Gateway, ECS env vars,\nand mobile app config for endpoint URLs]
    G --> H{Old endpoint URL still\nactive in any config?}
    H -- Yes --> I[ROOT CAUSE: Old endpoint not retired\nafter v2 deployment]
    I --> J[Update all routing configs\nto point to v2 endpoint]
    J --> K[Verify all traffic migrated\nto v2 via CloudWatch metrics]
    K --> L[Delete v1 endpoint and config]
    H -- No --> M[Check load balancer target\ngroup health and weights]
    C -- No --> N[Investigate application-level\ncaching returning stale responses]
    E -- No --> N

Root Cause

The deployment runbook for v2 included steps to create the new endpoint and update the primary ECS environment variable, but did not include an explicit step to decommission the v1 endpoint or audit all consumers of the old endpoint URL. Mobile app versions below 4.0 hard-coded the endpoint URL hostname rather than resolving it from a configuration service, so they continued pointing to v1 after the web deployment updated to v2. There was no automated endpoint inventory audit or "active consumer" check blocking deletion.

Resolution

import boto3
import json
from datetime import datetime, timezone
from botocore.exceptions import ClientError

sm_client = boto3.client("sagemaker", region_name="us-east-1")
cw_client = boto3.client("cloudwatch", region_name="us-east-1")


def list_recommendation_endpoints() -> list[dict]:
    """List all InService endpoints matching the MangaAssist recommendation naming pattern."""
    paginator = sm_client.get_paginator("list_endpoints")
    endpoints = []
    for page in paginator.paginate(StatusEquals="InService", NameContains="recommendation"):
        endpoints.extend(page["Endpoints"])
    return endpoints


def get_endpoint_invocations_last_hour(endpoint_name: str) -> float:
    """Return total invocation count on the endpoint in the last 60 minutes."""
    end_time = datetime.now(timezone.utc)
    from datetime import timedelta
    start_time = end_time - timedelta(hours=1)

    response = cw_client.get_metric_statistics(
        Namespace="AWS/SageMaker",
        MetricName="Invocations",
        Dimensions=[{"Name": "EndpointName", "Value": endpoint_name}],
        StartTime=start_time,
        EndTime=end_time,
        Period=3600,
        Statistics=["Sum"],
    )
    datapoints = response.get("Datapoints", [])
    return sum(dp["Sum"] for dp in datapoints)


def get_endpoint_model_package_arn(endpoint_name: str) -> str | None:
    """Retrieve the Model Registry ARN associated with the endpoint's deployed model."""
    try:
        ep_desc = sm_client.describe_endpoint(EndpointName=endpoint_name)
        config_name = ep_desc["EndpointConfigName"]
        config_desc = sm_client.describe_endpoint_config(EndpointConfigName=config_name)
        model_name = config_desc["ProductionVariants"][0]["ModelName"]
        model_desc = sm_client.describe_model(ModelName=model_name)
        return model_desc.get("PrimaryContainer", {}).get("ModelPackageName")
    except ClientError as e:
        print(f"[WARN] Could not retrieve model package ARN for {endpoint_name}: {e}")
        return None


def audit_and_retire_stale_endpoints(
    canonical_endpoint_name: str,
    dry_run: bool = True,
) -> list[str]:
    """
    Identify all recommendation endpoints other than the canonical one,
    check if they still receive traffic, and retire inactive ones.
    Returns list of endpoints retired (or that would be retired in dry run).
    """
    all_endpoints = list_recommendation_endpoints()
    stale = [ep for ep in all_endpoints if ep["EndpointName"] != canonical_endpoint_name]

    if not stale:
        print("[OK] No stale endpoints found.")
        return []

    retired = []
    for ep in stale:
        name = ep["EndpointName"]
        pkg_arn = get_endpoint_model_package_arn(name)
        invocations = get_endpoint_invocations_last_hour(name)

        print(
            f"\n[AUDIT] Endpoint: {name}\n"
            f"  Model Package ARN: {pkg_arn or 'unknown'}\n"
            f"  Invocations (last 1h): {invocations}"
        )

        if invocations > 100:
            print(
                f"  [WARNING] Endpoint {name} still receiving traffic ({invocations} inv/hr). "
                "Investigate before retiring — clients may still be routing here."
            )
            continue

        if dry_run:
            print(f"  [DRY RUN] Would delete endpoint '{name}' (0 recent invocations).")
        else:
            try:
                # Fetch and delete the endpoint config before deleting endpoint
                ep_desc = sm_client.describe_endpoint(EndpointName=name)
                config_name = ep_desc["EndpointConfigName"]

                sm_client.delete_endpoint(EndpointName=name)
                print(f"  [OK] Deleted endpoint '{name}'.")

                try:
                    sm_client.delete_endpoint_config(EndpointConfigName=config_name)
                    print(f"  [OK] Deleted endpoint config '{config_name}'.")
                except ClientError as e:
                    print(f"  [WARN] Could not delete endpoint config: {e}")

                retired.append(name)
            except ClientError as e:
                print(f"  [ERROR] Failed to delete endpoint '{name}': {e}")

    return retired


if __name__ == "__main__":
    CANONICAL = "mangaassist-recommendation-v2"

    print("=== DRY RUN — no deletions will occur ===")
    audit_and_retire_stale_endpoints(CANONICAL, dry_run=True)

    # Uncomment after verifying no stale endpoints have active traffic:
    # print("\n=== LIVE RUN ===")
    # retired = audit_and_retire_stale_endpoints(CANONICAL, dry_run=False)
    # print(f"\nRetired endpoints: {retired}")

Prevention

Include endpoint retirement as a mandatory deployment step: The deployment runbook for any new model version must contain an explicit checklist item: "Decommission previous endpoint after confirming 0 active consumers."
Centralize endpoint URL configuration: Store the active endpoint URL in AWS AppConfig or Parameter Store rather than hard-coding it in client apps. All clients (web, mobile, ECS tasks) resolve the URL at runtime from the config store.
Automated endpoint inventory audit: Run a weekly Lambda that lists all InService SageMaker endpoints and alerts on any endpoint that has not received a model_package_approved_version tag update within 60 days.
Cost allocation tags: Tag every endpoint with lifecycle_status=active|deprecated|pending_retirement and alert when deprecated endpoints accumulate cost > $50/month.
Mobile app feature flag: Use a remote config flag for the endpoint URL in mobile releases so the ML team can update routing for legacy app versions without requiring an app store update.