LOCAL PREVIEW View on GitHub

09. SageMaker and Azure Inference APIs for LLMs and Traditional ML Models

"Most inference confusion disappears once you separate control-plane APIs from runtime APIs."


Quick Mental Model

Layer What it does Examples
Control plane Creates, updates, secures, and routes endpoints CreateEndpoint, Azure online endpoint create, list keys
Runtime or data plane Sends inputs and receives predictions InvokeEndpoint, scoring URI POST, /openai/v1/responses
Batch or job plane Submits large offline scoring work CreateTransformJob, Azure batch endpoint invoke
Monitoring plane Observes latency, errors, throughput, and drift CloudWatch, Azure Monitor, Log Analytics

If you mix these together, the API surface looks more confusing than it really is.


Common SageMaker APIs

1. SageMaker control-plane APIs

These APIs set up or manage the endpoint itself.

API What it is used for Typical scenario
CreateModel Registers model artifacts and the serving container First deployment of a model
CreateEndpointConfig Defines hosting config such as instance type, variants, or async settings Preparing a deployment configuration
CreateEndpoint Creates the live endpoint Initial rollout
UpdateEndpoint Replaces model version or hosting configuration Blue-green update or model refresh
DescribeEndpoint Checks provisioning and health status Deployment automation and health checks
DeleteEndpoint Removes the live endpoint Cleanup or retirement
UpdateEndpointWeightsAndCapacities Changes traffic weights across variants Canary, A/B test, gradual rollout
CreateInferenceComponent Deploys a model unit on a shared endpoint Fine-grained model placement on one endpoint

2. SageMaker runtime APIs

These are the APIs applications call to actually run inference.

API Best for Notes
InvokeEndpoint Normal synchronous inference Most common runtime API for traditional ML and self-hosted LLMs
InvokeEndpointWithResponseStream Streaming inference Useful for token streaming or chunked output
InvokeEndpointAsync Queued long-running inference Returns locations for output instead of immediate prediction

Common request-routing features in SageMaker runtime:

  • TargetVariant for production variants
  • TargetModel for multi-model endpoints
  • InferenceComponentName for inference components
  • InferenceId for request traceability

3. SageMaker batch APIs

API Best for
CreateTransformJob Offline batch scoring over data in S3
DescribeTransformJob Checking batch job status

Traditional ML on SageMaker

The most common pattern is:

  1. CreateModel
  2. CreateEndpointConfig
  3. CreateEndpoint
  4. InvokeEndpoint for real-time predictions
  5. CreateTransformJob for offline batch scoring

Typical traditional ML examples:

  • XGBoost credit scoring
  • scikit-learn churn prediction
  • Random forest risk classification
  • PyTorch or TensorFlow image classification

LLMs on SageMaker

If you are hosting the LLM yourself on SageMaker, the common runtime pattern is:

  • InvokeEndpoint for normal request-response generation
  • InvokeEndpointWithResponseStream for streaming tokens
  • InvokeEndpointAsync for long document or large-payload jobs

Important nuance:

  • If the goal is a fully managed third-party or first-party foundation model on AWS, many teams choose Amazon Bedrock instead of SageMaker.
  • If the goal is to host your own open-source or fine-tuned model with custom containers or low-level serving control, SageMaker is the more natural fit.

Common Azure APIs

Azure has two main inference families:

  1. Azure Machine Learning for custom models, classic ML, and custom-hosted inference
  2. Azure AI Foundry / Azure OpenAI / Azure AI Model Inference APIs for managed foundation models

Azure Machine Learning APIs

1. Azure ML control-plane APIs

API or interface What it is used for
Online Endpoints Create Or Update Create or update a real-time endpoint
Online Endpoints Get Fetch endpoint details such as scoringUri
Online Endpoints List Keys Retrieve key-based credentials
Online Endpoints Get Token Retrieve an AML token for endpoint auth
Batch Endpoints Create Or Update Create a batch inference endpoint
Batch Endpoints Get or List Manage batch endpoints
Online deployment create or update Attach a model deployment under an endpoint
Batch deployment create or update Attach a batch model or pipeline deployment

2. Azure ML runtime and invocation patterns

API or pattern Best for Notes
Endpoint scoringUri over HTTPS POST Synchronous real-time inference Common for traditional ML and custom-hosted models
az ml online-endpoint invoke Developer-friendly real-time invocation Wraps the scoring URI and auth flow
ml_client.online_endpoints.invoke() Python SDK invocation Common in notebooks, services, and automation
Batch endpoint invoke Long-running batch scoring Starts a batch job using data references

Useful Azure ML details:

  • Online endpoints support key-based auth, AML token auth, or Microsoft Entra token auth.
  • One endpoint can host multiple deployments and split traffic across them.
  • Batch endpoints are for long-running asynchronous inference over stored input data.
  • Managed online endpoints are the recommended default for most Azure ML real-time inference scenarios.

Traditional ML on Azure ML

The most common pattern is:

  1. Create an online endpoint
  2. Add one or more deployments
  3. Retrieve the scoringUri
  4. Call the scoringUri over HTTPS POST
  5. Use batch endpoints for offline large-scale scoring

Typical traditional ML examples:

  • scikit-learn demand forecasting
  • XGBoost churn scoring
  • PyTorch image classification
  • custom BYOC inference container

LLMs on Azure ML

Use Azure ML online endpoints for LLMs when:

  • you are hosting an open-source or custom fine-tuned model yourself
  • you need custom inference containers or BYOC
  • you want to manage rollout, autoscaling, and routing like any other custom model

This is the "I host the model" path on Azure.


Azure AI Foundry / Azure OpenAI / Azure AI Model Inference APIs

This is the "Azure hosts the foundation model for me" path.

Common LLM inference APIs used on Azure today

API Typical use Best fit
/openai/v1/responses Unified generation API Preferred current choice for Azure OpenAI chat-style generation when supported
/openai/v1/chat/completions Chat-completions-compatible generation Common when using chat syntax or cross-provider chat-compatible models
/openai/v1/embeddings Embedding generation RAG, semantic search, clustering, retrieval
/chat/completions Azure AI Model Inference generic chat API Common interface across Foundry models
/embeddings Azure AI Model Inference generic embeddings API Provider-agnostic embedding access
/info Azure AI Model Inference model metadata lookup Dynamic clients and tooling

Important Azure nuance

Current Microsoft Learn guidance, verified on March 23, 2026, is:

  • prefer the v1 Azure OpenAI APIs instead of older dated API-version patterns when that fits the service you are using
  • use the Responses API for Azure OpenAI in Foundry Models when it fits your workload
  • use Azure AI Model Inference APIs when you want a common model-agnostic interface across different providers in Foundry

LLM scenarios on Azure

Scenario Common Azure API choice
Chatbot or copilot using managed OpenAI-compatible models /openai/v1/responses
Existing app already built around chat-completions format /openai/v1/chat/completions
Embedding generation for RAG /openai/v1/embeddings or Azure AI Model Inference /embeddings
Cross-provider Foundry integration Azure AI Model Inference /chat/completions and /embeddings

Side-by-Side: AWS and Azure by Workload

Need AWS choice Azure choice
Real-time traditional ML SageMaker real-time endpoint plus InvokeEndpoint Azure ML online endpoint plus scoringUri
Bursty low-volume inference SageMaker Serverless Inference Azure standard deployment for supported foundation models, otherwise a small managed online endpoint
Long-running async inference SageMaker Async Inference plus InvokeEndpointAsync Azure ML batch endpoint or job-style batch invocation
Nightly offline scoring SageMaker Batch Transform Azure ML batch endpoint
Streaming LLM tokens SageMaker InvokeEndpointWithResponseStream Azure OpenAI or Foundry streaming APIs
Managed foundation-model API Often Bedrock on AWS Azure OpenAI or Azure AI Foundry
Self-hosted open-source LLM SageMaker GPU endpoint Azure ML online endpoint or BYOC
Many small tenant-specific models SageMaker multi-model endpoint Azure ML with multiple deployments or a custom routing layer

Fast Selection Rules

  • Choose SageMaker endpoint APIs when you own the model container and want AWS-managed hosting.
  • Choose Azure ML online or batch endpoints when you own the model artifact, scoring code, or container on Azure.
  • Choose Azure OpenAI or Azure AI Foundry inference APIs when you want managed LLM access with minimal infrastructure work.
  • Put FastAPI in front of any of these when you want one product-facing API that hides cloud-specific details.

References

  • AWS SageMaker Runtime APIs: https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_Operations_Amazon_SageMaker_Runtime.html
  • AWS InvokeEndpoint: https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_runtime_InvokeEndpoint.html
  • AWS InvokeEndpointAsync: https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_runtime_InvokeEndpointAsync.html
  • AWS InvokeEndpointWithResponseStream: https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_runtime_InvokeEndpointWithResponseStream.html
  • AWS inference options: https://docs.aws.amazon.com/sagemaker/latest/dg/deploy-model-options.html
  • AWS inference components: https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateInferenceComponent.html
  • Azure ML endpoints for inference: https://learn.microsoft.com/en-us/azure/machine-learning/concept-endpoints?view=azureml-api-2
  • Azure ML online endpoints: https://learn.microsoft.com/en-us/azure/machine-learning/concept-endpoints-online?view=azureml-api-2
  • Azure ML online endpoints REST: https://learn.microsoft.com/en-us/rest/api/azureml/online-endpoints
  • Azure ML batch endpoints concept: https://learn.microsoft.com/en-us/azure/machine-learning/concept-endpoints-batch?view=azureml-api-2
  • Azure AI Model Inference REST API: https://learn.microsoft.com/en-us/rest/api/aifoundry/modelinference/
  • Azure OpenAI in Foundry Models v1 API: https://learn.microsoft.com/en-us/azure/ai-foundry/foundry-models/how-to/use-chat-completions