09. SageMaker and Azure Inference APIs for LLMs and Traditional ML Models
"Most inference confusion disappears once you separate control-plane APIs from runtime APIs."
Quick Mental Model
| Layer | What it does | Examples |
|---|---|---|
| Control plane | Creates, updates, secures, and routes endpoints | CreateEndpoint, Azure online endpoint create, list keys |
| Runtime or data plane | Sends inputs and receives predictions | InvokeEndpoint, scoring URI POST, /openai/v1/responses |
| Batch or job plane | Submits large offline scoring work | CreateTransformJob, Azure batch endpoint invoke |
| Monitoring plane | Observes latency, errors, throughput, and drift | CloudWatch, Azure Monitor, Log Analytics |
If you mix these together, the API surface looks more confusing than it really is.
Common SageMaker APIs
1. SageMaker control-plane APIs
These APIs set up or manage the endpoint itself.
| API | What it is used for | Typical scenario |
|---|---|---|
CreateModel |
Registers model artifacts and the serving container | First deployment of a model |
CreateEndpointConfig |
Defines hosting config such as instance type, variants, or async settings | Preparing a deployment configuration |
CreateEndpoint |
Creates the live endpoint | Initial rollout |
UpdateEndpoint |
Replaces model version or hosting configuration | Blue-green update or model refresh |
DescribeEndpoint |
Checks provisioning and health status | Deployment automation and health checks |
DeleteEndpoint |
Removes the live endpoint | Cleanup or retirement |
UpdateEndpointWeightsAndCapacities |
Changes traffic weights across variants | Canary, A/B test, gradual rollout |
CreateInferenceComponent |
Deploys a model unit on a shared endpoint | Fine-grained model placement on one endpoint |
2. SageMaker runtime APIs
These are the APIs applications call to actually run inference.
| API | Best for | Notes |
|---|---|---|
InvokeEndpoint |
Normal synchronous inference | Most common runtime API for traditional ML and self-hosted LLMs |
InvokeEndpointWithResponseStream |
Streaming inference | Useful for token streaming or chunked output |
InvokeEndpointAsync |
Queued long-running inference | Returns locations for output instead of immediate prediction |
Common request-routing features in SageMaker runtime:
TargetVariantfor production variantsTargetModelfor multi-model endpointsInferenceComponentNamefor inference componentsInferenceIdfor request traceability
3. SageMaker batch APIs
| API | Best for |
|---|---|
CreateTransformJob |
Offline batch scoring over data in S3 |
DescribeTransformJob |
Checking batch job status |
Traditional ML on SageMaker
The most common pattern is:
CreateModelCreateEndpointConfigCreateEndpointInvokeEndpointfor real-time predictionsCreateTransformJobfor offline batch scoring
Typical traditional ML examples:
- XGBoost credit scoring
- scikit-learn churn prediction
- Random forest risk classification
- PyTorch or TensorFlow image classification
LLMs on SageMaker
If you are hosting the LLM yourself on SageMaker, the common runtime pattern is:
InvokeEndpointfor normal request-response generationInvokeEndpointWithResponseStreamfor streaming tokensInvokeEndpointAsyncfor long document or large-payload jobs
Important nuance:
- If the goal is a fully managed third-party or first-party foundation model on AWS, many teams choose Amazon Bedrock instead of SageMaker.
- If the goal is to host your own open-source or fine-tuned model with custom containers or low-level serving control, SageMaker is the more natural fit.
Common Azure APIs
Azure has two main inference families:
- Azure Machine Learning for custom models, classic ML, and custom-hosted inference
- Azure AI Foundry / Azure OpenAI / Azure AI Model Inference APIs for managed foundation models
Azure Machine Learning APIs
1. Azure ML control-plane APIs
| API or interface | What it is used for |
|---|---|
| Online Endpoints Create Or Update | Create or update a real-time endpoint |
| Online Endpoints Get | Fetch endpoint details such as scoringUri |
| Online Endpoints List Keys | Retrieve key-based credentials |
| Online Endpoints Get Token | Retrieve an AML token for endpoint auth |
| Batch Endpoints Create Or Update | Create a batch inference endpoint |
| Batch Endpoints Get or List | Manage batch endpoints |
| Online deployment create or update | Attach a model deployment under an endpoint |
| Batch deployment create or update | Attach a batch model or pipeline deployment |
2. Azure ML runtime and invocation patterns
| API or pattern | Best for | Notes |
|---|---|---|
Endpoint scoringUri over HTTPS POST |
Synchronous real-time inference | Common for traditional ML and custom-hosted models |
az ml online-endpoint invoke |
Developer-friendly real-time invocation | Wraps the scoring URI and auth flow |
ml_client.online_endpoints.invoke() |
Python SDK invocation | Common in notebooks, services, and automation |
| Batch endpoint invoke | Long-running batch scoring | Starts a batch job using data references |
Useful Azure ML details:
- Online endpoints support key-based auth, AML token auth, or Microsoft Entra token auth.
- One endpoint can host multiple deployments and split traffic across them.
- Batch endpoints are for long-running asynchronous inference over stored input data.
- Managed online endpoints are the recommended default for most Azure ML real-time inference scenarios.
Traditional ML on Azure ML
The most common pattern is:
- Create an online endpoint
- Add one or more deployments
- Retrieve the
scoringUri - Call the
scoringUriover HTTPS POST - Use batch endpoints for offline large-scale scoring
Typical traditional ML examples:
- scikit-learn demand forecasting
- XGBoost churn scoring
- PyTorch image classification
- custom BYOC inference container
LLMs on Azure ML
Use Azure ML online endpoints for LLMs when:
- you are hosting an open-source or custom fine-tuned model yourself
- you need custom inference containers or BYOC
- you want to manage rollout, autoscaling, and routing like any other custom model
This is the "I host the model" path on Azure.
Azure AI Foundry / Azure OpenAI / Azure AI Model Inference APIs
This is the "Azure hosts the foundation model for me" path.
Common LLM inference APIs used on Azure today
| API | Typical use | Best fit |
|---|---|---|
/openai/v1/responses |
Unified generation API | Preferred current choice for Azure OpenAI chat-style generation when supported |
/openai/v1/chat/completions |
Chat-completions-compatible generation | Common when using chat syntax or cross-provider chat-compatible models |
/openai/v1/embeddings |
Embedding generation | RAG, semantic search, clustering, retrieval |
/chat/completions |
Azure AI Model Inference generic chat API | Common interface across Foundry models |
/embeddings |
Azure AI Model Inference generic embeddings API | Provider-agnostic embedding access |
/info |
Azure AI Model Inference model metadata lookup | Dynamic clients and tooling |
Important Azure nuance
Current Microsoft Learn guidance, verified on March 23, 2026, is:
- prefer the
v1Azure OpenAI APIs instead of older dated API-version patterns when that fits the service you are using - use the
ResponsesAPI for Azure OpenAI in Foundry Models when it fits your workload - use Azure AI Model Inference APIs when you want a common model-agnostic interface across different providers in Foundry
LLM scenarios on Azure
| Scenario | Common Azure API choice |
|---|---|
| Chatbot or copilot using managed OpenAI-compatible models | /openai/v1/responses |
| Existing app already built around chat-completions format | /openai/v1/chat/completions |
| Embedding generation for RAG | /openai/v1/embeddings or Azure AI Model Inference /embeddings |
| Cross-provider Foundry integration | Azure AI Model Inference /chat/completions and /embeddings |
Side-by-Side: AWS and Azure by Workload
| Need | AWS choice | Azure choice |
|---|---|---|
| Real-time traditional ML | SageMaker real-time endpoint plus InvokeEndpoint |
Azure ML online endpoint plus scoringUri |
| Bursty low-volume inference | SageMaker Serverless Inference | Azure standard deployment for supported foundation models, otherwise a small managed online endpoint |
| Long-running async inference | SageMaker Async Inference plus InvokeEndpointAsync |
Azure ML batch endpoint or job-style batch invocation |
| Nightly offline scoring | SageMaker Batch Transform | Azure ML batch endpoint |
| Streaming LLM tokens | SageMaker InvokeEndpointWithResponseStream |
Azure OpenAI or Foundry streaming APIs |
| Managed foundation-model API | Often Bedrock on AWS | Azure OpenAI or Azure AI Foundry |
| Self-hosted open-source LLM | SageMaker GPU endpoint | Azure ML online endpoint or BYOC |
| Many small tenant-specific models | SageMaker multi-model endpoint | Azure ML with multiple deployments or a custom routing layer |
Fast Selection Rules
- Choose SageMaker endpoint APIs when you own the model container and want AWS-managed hosting.
- Choose Azure ML online or batch endpoints when you own the model artifact, scoring code, or container on Azure.
- Choose Azure OpenAI or Azure AI Foundry inference APIs when you want managed LLM access with minimal infrastructure work.
- Put FastAPI in front of any of these when you want one product-facing API that hides cloud-specific details.
References
- AWS SageMaker Runtime APIs: https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_Operations_Amazon_SageMaker_Runtime.html
- AWS
InvokeEndpoint: https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_runtime_InvokeEndpoint.html - AWS
InvokeEndpointAsync: https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_runtime_InvokeEndpointAsync.html - AWS
InvokeEndpointWithResponseStream: https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_runtime_InvokeEndpointWithResponseStream.html - AWS inference options: https://docs.aws.amazon.com/sagemaker/latest/dg/deploy-model-options.html
- AWS inference components: https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateInferenceComponent.html
- Azure ML endpoints for inference: https://learn.microsoft.com/en-us/azure/machine-learning/concept-endpoints?view=azureml-api-2
- Azure ML online endpoints: https://learn.microsoft.com/en-us/azure/machine-learning/concept-endpoints-online?view=azureml-api-2
- Azure ML online endpoints REST: https://learn.microsoft.com/en-us/rest/api/azureml/online-endpoints
- Azure ML batch endpoints concept: https://learn.microsoft.com/en-us/azure/machine-learning/concept-endpoints-batch?view=azureml-api-2
- Azure AI Model Inference REST API: https://learn.microsoft.com/en-us/rest/api/aifoundry/modelinference/
- Azure OpenAI in Foundry Models v1 API: https://learn.microsoft.com/en-us/azure/ai-foundry/foundry-models/how-to/use-chat-completions