09. SageMaker and Azure Inference APIs for LLMs and Traditional ML Models

"Most inference confusion disappears once you separate control-plane APIs from runtime APIs."

Quick Mental Model

Layer	What it does	Examples
Control plane	Creates, updates, secures, and routes endpoints	`CreateEndpoint`, Azure online endpoint create, list keys
Runtime or data plane	Sends inputs and receives predictions	`InvokeEndpoint`, scoring URI POST, `/openai/v1/responses`
Batch or job plane	Submits large offline scoring work	`CreateTransformJob`, Azure batch endpoint invoke
Monitoring plane	Observes latency, errors, throughput, and drift	CloudWatch, Azure Monitor, Log Analytics

If you mix these together, the API surface looks more confusing than it really is.

Common SageMaker APIs

1. SageMaker control-plane APIs

These APIs set up or manage the endpoint itself.

API	What it is used for	Typical scenario
`CreateModel`	Registers model artifacts and the serving container	First deployment of a model
`CreateEndpointConfig`	Defines hosting config such as instance type, variants, or async settings	Preparing a deployment configuration
`CreateEndpoint`	Creates the live endpoint	Initial rollout
`UpdateEndpoint`	Replaces model version or hosting configuration	Blue-green update or model refresh
`DescribeEndpoint`	Checks provisioning and health status	Deployment automation and health checks
`DeleteEndpoint`	Removes the live endpoint	Cleanup or retirement
`UpdateEndpointWeightsAndCapacities`	Changes traffic weights across variants	Canary, A/B test, gradual rollout
`CreateInferenceComponent`	Deploys a model unit on a shared endpoint	Fine-grained model placement on one endpoint

2. SageMaker runtime APIs

These are the APIs applications call to actually run inference.

API	Best for	Notes
`InvokeEndpoint`	Normal synchronous inference	Most common runtime API for traditional ML and self-hosted LLMs
`InvokeEndpointWithResponseStream`	Streaming inference	Useful for token streaming or chunked output
`InvokeEndpointAsync`	Queued long-running inference	Returns locations for output instead of immediate prediction

Common request-routing features in SageMaker runtime:

TargetVariant for production variants
TargetModel for multi-model endpoints
InferenceComponentName for inference components
InferenceId for request traceability

3. SageMaker batch APIs

API	Best for
`CreateTransformJob`	Offline batch scoring over data in S3
`DescribeTransformJob`	Checking batch job status

Traditional ML on SageMaker

The most common pattern is:

CreateModel
CreateEndpointConfig
CreateEndpoint
InvokeEndpoint for real-time predictions
CreateTransformJob for offline batch scoring

Typical traditional ML examples:

XGBoost credit scoring
scikit-learn churn prediction
Random forest risk classification
PyTorch or TensorFlow image classification

LLMs on SageMaker

If you are hosting the LLM yourself on SageMaker, the common runtime pattern is:

InvokeEndpoint for normal request-response generation
InvokeEndpointWithResponseStream for streaming tokens
InvokeEndpointAsync for long document or large-payload jobs

Important nuance:

If the goal is a fully managed third-party or first-party foundation model on AWS, many teams choose Amazon Bedrock instead of SageMaker.
If the goal is to host your own open-source or fine-tuned model with custom containers or low-level serving control, SageMaker is the more natural fit.

Common Azure APIs

Azure has two main inference families:

Azure Machine Learning for custom models, classic ML, and custom-hosted inference
Azure AI Foundry / Azure OpenAI / Azure AI Model Inference APIs for managed foundation models

Azure Machine Learning APIs

1. Azure ML control-plane APIs

API or interface	What it is used for
Online Endpoints Create Or Update	Create or update a real-time endpoint
Online Endpoints Get	Fetch endpoint details such as `scoringUri`
Online Endpoints List Keys	Retrieve key-based credentials
Online Endpoints Get Token	Retrieve an AML token for endpoint auth
Batch Endpoints Create Or Update	Create a batch inference endpoint
Batch Endpoints Get or List	Manage batch endpoints
Online deployment create or update	Attach a model deployment under an endpoint
Batch deployment create or update	Attach a batch model or pipeline deployment

2. Azure ML runtime and invocation patterns

API or pattern	Best for	Notes
Endpoint `scoringUri` over HTTPS POST	Synchronous real-time inference	Common for traditional ML and custom-hosted models
`az ml online-endpoint invoke`	Developer-friendly real-time invocation	Wraps the scoring URI and auth flow
`ml_client.online_endpoints.invoke()`	Python SDK invocation	Common in notebooks, services, and automation
Batch endpoint invoke	Long-running batch scoring	Starts a batch job using data references

Useful Azure ML details:

Online endpoints support key-based auth, AML token auth, or Microsoft Entra token auth.
One endpoint can host multiple deployments and split traffic across them.
Batch endpoints are for long-running asynchronous inference over stored input data.
Managed online endpoints are the recommended default for most Azure ML real-time inference scenarios.

Traditional ML on Azure ML

The most common pattern is:

Create an online endpoint
Add one or more deployments
Retrieve the scoringUri
Call the scoringUri over HTTPS POST
Use batch endpoints for offline large-scale scoring

Typical traditional ML examples:

scikit-learn demand forecasting
XGBoost churn scoring
PyTorch image classification
custom BYOC inference container

LLMs on Azure ML

Use Azure ML online endpoints for LLMs when:

you are hosting an open-source or custom fine-tuned model yourself
you need custom inference containers or BYOC
you want to manage rollout, autoscaling, and routing like any other custom model

This is the "I host the model" path on Azure.

Azure AI Foundry / Azure OpenAI / Azure AI Model Inference APIs

This is the "Azure hosts the foundation model for me" path.

Common LLM inference APIs used on Azure today

API	Typical use	Best fit
`/openai/v1/responses`	Unified generation API	Preferred current choice for Azure OpenAI chat-style generation when supported
`/openai/v1/chat/completions`	Chat-completions-compatible generation	Common when using chat syntax or cross-provider chat-compatible models
`/openai/v1/embeddings`	Embedding generation	RAG, semantic search, clustering, retrieval
`/chat/completions`	Azure AI Model Inference generic chat API	Common interface across Foundry models
`/embeddings`	Azure AI Model Inference generic embeddings API	Provider-agnostic embedding access
`/info`	Azure AI Model Inference model metadata lookup	Dynamic clients and tooling

Important Azure nuance

Current Microsoft Learn guidance, verified on March 23, 2026, is:

prefer the v1 Azure OpenAI APIs instead of older dated API-version patterns when that fits the service you are using
use the Responses API for Azure OpenAI in Foundry Models when it fits your workload
use Azure AI Model Inference APIs when you want a common model-agnostic interface across different providers in Foundry

LLM scenarios on Azure

Scenario	Common Azure API choice
Chatbot or copilot using managed OpenAI-compatible models	`/openai/v1/responses`
Existing app already built around chat-completions format	`/openai/v1/chat/completions`
Embedding generation for RAG	`/openai/v1/embeddings` or Azure AI Model Inference `/embeddings`
Cross-provider Foundry integration	Azure AI Model Inference `/chat/completions` and `/embeddings`

Side-by-Side: AWS and Azure by Workload

Need	AWS choice	Azure choice
Real-time traditional ML	SageMaker real-time endpoint plus `InvokeEndpoint`	Azure ML online endpoint plus `scoringUri`
Bursty low-volume inference	SageMaker Serverless Inference	Azure standard deployment for supported foundation models, otherwise a small managed online endpoint
Long-running async inference	SageMaker Async Inference plus `InvokeEndpointAsync`	Azure ML batch endpoint or job-style batch invocation
Nightly offline scoring	SageMaker Batch Transform	Azure ML batch endpoint
Streaming LLM tokens	SageMaker `InvokeEndpointWithResponseStream`	Azure OpenAI or Foundry streaming APIs
Managed foundation-model API	Often Bedrock on AWS	Azure OpenAI or Azure AI Foundry
Self-hosted open-source LLM	SageMaker GPU endpoint	Azure ML online endpoint or BYOC
Many small tenant-specific models	SageMaker multi-model endpoint	Azure ML with multiple deployments or a custom routing layer

Fast Selection Rules

Choose SageMaker endpoint APIs when you own the model container and want AWS-managed hosting.
Choose Azure ML online or batch endpoints when you own the model artifact, scoring code, or container on Azure.
Choose Azure OpenAI or Azure AI Foundry inference APIs when you want managed LLM access with minimal infrastructure work.
Put FastAPI in front of any of these when you want one product-facing API that hides cloud-specific details.

References

AWS SageMaker Runtime APIs: https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_Operations_Amazon_SageMaker_Runtime.html
AWS InvokeEndpoint: https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_runtime_InvokeEndpoint.html
AWS InvokeEndpointAsync: https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_runtime_InvokeEndpointAsync.html
AWS InvokeEndpointWithResponseStream: https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_runtime_InvokeEndpointWithResponseStream.html
AWS inference options: https://docs.aws.amazon.com/sagemaker/latest/dg/deploy-model-options.html
AWS inference components: https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateInferenceComponent.html
Azure ML endpoints for inference: https://learn.microsoft.com/en-us/azure/machine-learning/concept-endpoints?view=azureml-api-2
Azure ML online endpoints: https://learn.microsoft.com/en-us/azure/machine-learning/concept-endpoints-online?view=azureml-api-2
Azure ML online endpoints REST: https://learn.microsoft.com/en-us/rest/api/azureml/online-endpoints
Azure ML batch endpoints concept: https://learn.microsoft.com/en-us/azure/machine-learning/concept-endpoints-batch?view=azureml-api-2
Azure AI Model Inference REST API: https://learn.microsoft.com/en-us/rest/api/aifoundry/modelinference/
Azure OpenAI in Foundry Models v1 API: https://learn.microsoft.com/en-us/azure/ai-foundry/foundry-models/how-to/use-chat-completions