08. SageMaker Endpoints, FastAPI, asyncio, and REST for Model Inference

"An endpoint is the stable production contract in front of a model. It is not the model itself."

Quick Answer

Topic	Short answer
SageMaker endpoint	A managed HTTPS inference endpoint in front of model-serving containers on SageMaker
FastAPI	A Python web framework used to expose inference APIs or orchestration layers
`asyncio`	Python's async concurrency model for I/O-bound work
REST API	The HTTP contract clients call to send input and receive predictions

The most common source of confusion is that these are not alternatives in the same category.

A SageMaker endpoint is a managed hosting surface.
FastAPI is one way to build an application service.
asyncio is an implementation technique inside that service.
REST is the interface style exposed to callers.

One production system can use all four together.

What Is a SageMaker Endpoint?

A SageMaker endpoint is a managed HTTPS endpoint that hosts one or more model-serving containers and gives your application a stable runtime target for inference.

In practice, the flow usually looks like this:

Package model artifacts and choose a serving container.
Create a SageMaker model object.
Create an endpoint configuration.
Create or update the endpoint.
Call the endpoint through SageMaker Runtime APIs.

Important distinction

Term	Meaning
Model	The model artifacts plus the inference container definition
Endpoint configuration	The hosting setup such as instance type, variants, async config, or serverless config
Endpoint	The live managed HTTPS surface the application calls
Runtime API	The API used to invoke the endpoint, such as `InvokeEndpoint`

Why SageMaker endpoints exist

They keep a stable URL while the underlying instances change.
They manage scaling, health checks, and replacement of unhealthy instances.
They support controlled rollouts through variants and other deployment patterns.
They provide integration with IAM, VPC controls, metrics, logs, and monitoring.
They let application teams call inference without managing raw EC2 fleets.

A simple mental model

Think of a SageMaker endpoint as:

Application -> Managed HTTPS endpoint -> Model container(s) -> GPU/CPU instances

Your application talks to the endpoint. SageMaker handles the rest of the serving fleet.

Endpoint Types for ML Models in General

When people say "types of endpoints," they usually mean one of two things:

The serving mode
The API protocol

1. Serving-mode endpoint types

Type	Request pattern	Best for	Example workloads	Main tradeoff
Real-time synchronous endpoint	Request and response happen immediately	User-facing low-latency inference	fraud score, intent classification, ranking	Needs warm capacity or careful scaling
Serverless real-time endpoint	Synchronous, but infrastructure scales on demand	Bursty or low-volume traffic	internal tools, occasional image classification	Cold starts
Asynchronous endpoint	Submit now, get result later	Long-running or large-payload jobs	document summarization, OCR, video analysis	Not instant UX
Batch endpoint or batch job	Run inference over large datasets or files	Offline or scheduled scoring	churn scoring, catalog enrichment, demand forecasting	Not interactive
Streaming endpoint	Response arrives in chunks	Token streaming or progressive output	chat LLMs, speech systems	More complex client and server handling
Multi-model or shared endpoint	Many models share one fleet	Lots of small models with uneven traffic	per-tenant XGBoost, many lightweight recommenders	Shared noisy-neighbor risk and cold loads
Pipeline endpoint	Preprocessing, prediction, and postprocessing are chained	Multi-step inference in one deployment	normalize -> predict -> calibrate	Harder debugging than separate services
Edge or on-device endpoint	Inference happens near the device	Ultra-low latency, privacy, or disconnected operation	cameras, mobile apps, factory equipment	Device resource limits

2. API protocol styles

Protocol style	Best for	Example
REST over HTTP	Default cross-language model invocation	JSON request and JSON response
gRPC	High-throughput internal service-to-service calls	low-latency binary RPC between backend services
SSE or chunked HTTP streaming	Token streaming from LLMs	progressive chat output
WebSocket	Bidirectional real-time interaction	live transcription or interactive co-pilot
Queue or event-driven trigger	Fire-and-forget or async workflows	submit job and process later

For most application teams, the main architectural choice is between real-time, async, batch, and streaming.

How This Maps to SageMaker

SageMaker option	What it means	Use when	Avoid when
Real-time inference endpoint	Persistent managed REST endpoint	Low-latency synchronous inference	Traffic is very rare and cost sensitivity is extreme
Serverless Inference	Managed sync endpoint that scales to zero	Intermittent traffic, prototypes, internal tools	Cold-start-sensitive production flows
Asynchronous Inference	Queue-backed endpoint that returns results later	Payloads are large or runtime can be long	User needs an immediate answer
Batch Transform	Offline batch scoring without a persistent endpoint	Large datasets in S3 and no interactive need	User-facing APIs
Multi-model endpoint	Many models share one fleet and container	Large numbers of similar smaller models	One hot model dominates traffic or latency is strict
Inference pipeline	Several containers chained in one managed deployment	Preprocess + predict + postprocess together	Steps need independent scaling or ownership
Inference components	Multiple deployable model units on one endpoint	Fine-grained resource sharing on one endpoint	Teams want fully separate lifecycles

Rule of thumb

Traditional ML model with low-latency API needs: real-time endpoint.
Bursty internal tool: serverless inference.
Large payload or multi-minute job: async inference.
Nightly or hourly scoring over files: batch transform.
Hundreds of tenant-specific small models: multi-model endpoint.
One request always needing preprocess + model + postprocess: inference pipeline.
Self-hosted open-source LLM: GPU-backed real-time or streaming endpoint.

FastAPI vs asyncio vs REST API

These are not mutually exclusive.

A very common production pattern is:

Client -> REST API -> FastAPI service -> asyncio fan-out -> model endpoints / vector DB / business services

What each one is

Term	What it is	Use it when	It is a poor fit when
FastAPI	A Python framework for building HTTP APIs	You need validation, auth, orchestration, or custom logic around inference	You want fully managed serving with no app servers to operate
`asyncio`	Python async concurrency runtime	Your service spends time waiting on network calls, caches, DBs, or other model endpoints	The main bottleneck is CPU-bound model compute in one Python process
REST API	An HTTP interface style	You want a simple, language-agnostic contract for clients	You need low-level binary RPC or bidirectional real-time communication

When to use FastAPI

Use FastAPI when:

you need one product-facing API in front of multiple backends
you need request validation, auth, rate limiting, or response shaping
you want custom business rules around model output
you need orchestration across vector search, cache, profile service, and model calls
you are serving a small or medium model directly from a Python app

Good examples:

A RAG API that calls embeddings, vector DB, reranker, and LLM
A fraud API that combines model score with policy rules
An internal moderation service that normalizes input before scoring

Do not rely on FastAPI alone when:

you need highly optimized GPU batching for large LLM throughput
you want the cloud platform to manage deployment and autoscaling for you
you need a specialized serving runtime such as Triton, vLLM, or TGI

When to use `asyncio`

Use asyncio when the bottleneck is waiting, not raw math.

Good asyncio scenarios:

call multiple model endpoints in parallel
fetch vector search results, user profile, and feature flags concurrently
stream tokens from an LLM to the client
implement timeouts, cancellation, retries, and circuit breakers
build a high-concurrency API gateway

Bad asyncio scenarios:

a CPU-heavy scikit-learn loop inside one Python worker
a PyTorch forward pass that is compute-bound inside one process
any workload where the real bottleneck is GPU or CPU execution time rather than I/O wait time

In those cases, use worker processes, optimized runtimes, batching, or managed endpoints instead of expecting asyncio to speed up compute.

When REST API is the right choice

REST is the default choice when:

the client can make normal HTTP calls
request and response fit well in JSON or standard HTTP payloads
interoperability matters more than absolute efficiency
the result can be returned in one response

Good REST scenarios:

predict house price
classify support-ticket intent
rerank search results
return a final summary after one model run

Use streaming instead of plain one-shot REST when:

the user should see tokens as they are generated
responses are long and progressive UX matters
the job can take long enough that immediate partial output is useful

Practical Scenarios

Scenario	Best fit	Why
Credit-risk score during checkout	Real-time endpoint	Low latency and synchronous decision
Nightly churn scoring for millions of users	Batch endpoint or job	Large volume and no immediate user waiting
Upload a 500-page PDF and get a summary later	Async endpoint	Long-running job and large payload
Chat assistant with live token streaming	Streaming API plus FastAPI gateway	Better UX and orchestration around the LLM
Internal analytics tool used 20 times a day	Serverless endpoint	Avoids paying for idle capacity
SaaS platform with one small model per tenant	Multi-model endpoint	Better utilization than one endpoint per tenant
RAG assistant needing cache, vector DB, profile, and LLM	FastAPI plus `asyncio` plus managed endpoints	The orchestration layer is I/O-heavy
Simple tabular model called by many backend teams	REST API	Easiest cross-language contract

Interview-Friendly Distinctions

A SageMaker endpoint is the managed serving surface, not just the model artifact.
FastAPI is an implementation framework.
REST is the interface contract.
asyncio is the concurrency model inside the service.
A single production inference system can use all of them together.

References

AWS SageMaker inference options: https://docs.aws.amazon.com/sagemaker/latest/dg/deploy-model-options.html
AWS SageMaker real-time inference: https://docs.aws.amazon.com/sagemaker/latest/dg/realtime-endpoints.html
AWS SageMaker Serverless Inference: https://docs.aws.amazon.com/sagemaker/latest/dg/serverless-endpoints.html
AWS SageMaker Asynchronous Inference: https://docs.aws.amazon.com/en_us/sagemaker/latest/dg/async-inference.html
AWS SageMaker Batch Transform: https://docs.aws.amazon.com/sagemaker/latest/dg/batch-transform.html
AWS SageMaker multi-model endpoints: https://docs.aws.amazon.com/sagemaker/latest/dg/multi-model-endpoints.html
AWS SageMaker inference pipelines: https://docs.aws.amazon.com/sagemaker/latest/dg/inference-pipelines.html