08. SageMaker Endpoints, FastAPI, asyncio, and REST for Model Inference
"An endpoint is the stable production contract in front of a model. It is not the model itself."
Quick Answer
| Topic | Short answer |
|---|---|
| SageMaker endpoint | A managed HTTPS inference endpoint in front of model-serving containers on SageMaker |
| FastAPI | A Python web framework used to expose inference APIs or orchestration layers |
asyncio |
Python's async concurrency model for I/O-bound work |
| REST API | The HTTP contract clients call to send input and receive predictions |
The most common source of confusion is that these are not alternatives in the same category.
- A SageMaker endpoint is a managed hosting surface.
- FastAPI is one way to build an application service.
asynciois an implementation technique inside that service.- REST is the interface style exposed to callers.
One production system can use all four together.
What Is a SageMaker Endpoint?
A SageMaker endpoint is a managed HTTPS endpoint that hosts one or more model-serving containers and gives your application a stable runtime target for inference.
In practice, the flow usually looks like this:
- Package model artifacts and choose a serving container.
- Create a SageMaker model object.
- Create an endpoint configuration.
- Create or update the endpoint.
- Call the endpoint through SageMaker Runtime APIs.
Important distinction
| Term | Meaning |
|---|---|
| Model | The model artifacts plus the inference container definition |
| Endpoint configuration | The hosting setup such as instance type, variants, async config, or serverless config |
| Endpoint | The live managed HTTPS surface the application calls |
| Runtime API | The API used to invoke the endpoint, such as InvokeEndpoint |
Why SageMaker endpoints exist
- They keep a stable URL while the underlying instances change.
- They manage scaling, health checks, and replacement of unhealthy instances.
- They support controlled rollouts through variants and other deployment patterns.
- They provide integration with IAM, VPC controls, metrics, logs, and monitoring.
- They let application teams call inference without managing raw EC2 fleets.
A simple mental model
Think of a SageMaker endpoint as:
Application -> Managed HTTPS endpoint -> Model container(s) -> GPU/CPU instances
Your application talks to the endpoint. SageMaker handles the rest of the serving fleet.
Endpoint Types for ML Models in General
When people say "types of endpoints," they usually mean one of two things:
- The serving mode
- The API protocol
1. Serving-mode endpoint types
| Type | Request pattern | Best for | Example workloads | Main tradeoff |
|---|---|---|---|---|
| Real-time synchronous endpoint | Request and response happen immediately | User-facing low-latency inference | fraud score, intent classification, ranking | Needs warm capacity or careful scaling |
| Serverless real-time endpoint | Synchronous, but infrastructure scales on demand | Bursty or low-volume traffic | internal tools, occasional image classification | Cold starts |
| Asynchronous endpoint | Submit now, get result later | Long-running or large-payload jobs | document summarization, OCR, video analysis | Not instant UX |
| Batch endpoint or batch job | Run inference over large datasets or files | Offline or scheduled scoring | churn scoring, catalog enrichment, demand forecasting | Not interactive |
| Streaming endpoint | Response arrives in chunks | Token streaming or progressive output | chat LLMs, speech systems | More complex client and server handling |
| Multi-model or shared endpoint | Many models share one fleet | Lots of small models with uneven traffic | per-tenant XGBoost, many lightweight recommenders | Shared noisy-neighbor risk and cold loads |
| Pipeline endpoint | Preprocessing, prediction, and postprocessing are chained | Multi-step inference in one deployment | normalize -> predict -> calibrate | Harder debugging than separate services |
| Edge or on-device endpoint | Inference happens near the device | Ultra-low latency, privacy, or disconnected operation | cameras, mobile apps, factory equipment | Device resource limits |
2. API protocol styles
| Protocol style | Best for | Example |
|---|---|---|
| REST over HTTP | Default cross-language model invocation | JSON request and JSON response |
| gRPC | High-throughput internal service-to-service calls | low-latency binary RPC between backend services |
| SSE or chunked HTTP streaming | Token streaming from LLMs | progressive chat output |
| WebSocket | Bidirectional real-time interaction | live transcription or interactive co-pilot |
| Queue or event-driven trigger | Fire-and-forget or async workflows | submit job and process later |
For most application teams, the main architectural choice is between real-time, async, batch, and streaming.
How This Maps to SageMaker
| SageMaker option | What it means | Use when | Avoid when |
|---|---|---|---|
| Real-time inference endpoint | Persistent managed REST endpoint | Low-latency synchronous inference | Traffic is very rare and cost sensitivity is extreme |
| Serverless Inference | Managed sync endpoint that scales to zero | Intermittent traffic, prototypes, internal tools | Cold-start-sensitive production flows |
| Asynchronous Inference | Queue-backed endpoint that returns results later | Payloads are large or runtime can be long | User needs an immediate answer |
| Batch Transform | Offline batch scoring without a persistent endpoint | Large datasets in S3 and no interactive need | User-facing APIs |
| Multi-model endpoint | Many models share one fleet and container | Large numbers of similar smaller models | One hot model dominates traffic or latency is strict |
| Inference pipeline | Several containers chained in one managed deployment | Preprocess + predict + postprocess together | Steps need independent scaling or ownership |
| Inference components | Multiple deployable model units on one endpoint | Fine-grained resource sharing on one endpoint | Teams want fully separate lifecycles |
Rule of thumb
- Traditional ML model with low-latency API needs: real-time endpoint.
- Bursty internal tool: serverless inference.
- Large payload or multi-minute job: async inference.
- Nightly or hourly scoring over files: batch transform.
- Hundreds of tenant-specific small models: multi-model endpoint.
- One request always needing preprocess + model + postprocess: inference pipeline.
- Self-hosted open-source LLM: GPU-backed real-time or streaming endpoint.
FastAPI vs asyncio vs REST API
These are not mutually exclusive.
A very common production pattern is:
Client -> REST API -> FastAPI service -> asyncio fan-out -> model endpoints / vector DB / business services
What each one is
| Term | What it is | Use it when | It is a poor fit when |
|---|---|---|---|
| FastAPI | A Python framework for building HTTP APIs | You need validation, auth, orchestration, or custom logic around inference | You want fully managed serving with no app servers to operate |
asyncio |
Python async concurrency runtime | Your service spends time waiting on network calls, caches, DBs, or other model endpoints | The main bottleneck is CPU-bound model compute in one Python process |
| REST API | An HTTP interface style | You want a simple, language-agnostic contract for clients | You need low-level binary RPC or bidirectional real-time communication |
When to use FastAPI
Use FastAPI when:
- you need one product-facing API in front of multiple backends
- you need request validation, auth, rate limiting, or response shaping
- you want custom business rules around model output
- you need orchestration across vector search, cache, profile service, and model calls
- you are serving a small or medium model directly from a Python app
Good examples:
- A RAG API that calls embeddings, vector DB, reranker, and LLM
- A fraud API that combines model score with policy rules
- An internal moderation service that normalizes input before scoring
Do not rely on FastAPI alone when:
- you need highly optimized GPU batching for large LLM throughput
- you want the cloud platform to manage deployment and autoscaling for you
- you need a specialized serving runtime such as Triton, vLLM, or TGI
When to use asyncio
Use asyncio when the bottleneck is waiting, not raw math.
Good asyncio scenarios:
- call multiple model endpoints in parallel
- fetch vector search results, user profile, and feature flags concurrently
- stream tokens from an LLM to the client
- implement timeouts, cancellation, retries, and circuit breakers
- build a high-concurrency API gateway
Bad asyncio scenarios:
- a CPU-heavy scikit-learn loop inside one Python worker
- a PyTorch forward pass that is compute-bound inside one process
- any workload where the real bottleneck is GPU or CPU execution time rather than I/O wait time
In those cases, use worker processes, optimized runtimes, batching, or managed endpoints instead of expecting asyncio to speed up compute.
When REST API is the right choice
REST is the default choice when:
- the client can make normal HTTP calls
- request and response fit well in JSON or standard HTTP payloads
- interoperability matters more than absolute efficiency
- the result can be returned in one response
Good REST scenarios:
- predict house price
- classify support-ticket intent
- rerank search results
- return a final summary after one model run
Use streaming instead of plain one-shot REST when:
- the user should see tokens as they are generated
- responses are long and progressive UX matters
- the job can take long enough that immediate partial output is useful
Practical Scenarios
| Scenario | Best fit | Why |
|---|---|---|
| Credit-risk score during checkout | Real-time endpoint | Low latency and synchronous decision |
| Nightly churn scoring for millions of users | Batch endpoint or job | Large volume and no immediate user waiting |
| Upload a 500-page PDF and get a summary later | Async endpoint | Long-running job and large payload |
| Chat assistant with live token streaming | Streaming API plus FastAPI gateway | Better UX and orchestration around the LLM |
| Internal analytics tool used 20 times a day | Serverless endpoint | Avoids paying for idle capacity |
| SaaS platform with one small model per tenant | Multi-model endpoint | Better utilization than one endpoint per tenant |
| RAG assistant needing cache, vector DB, profile, and LLM | FastAPI plus asyncio plus managed endpoints |
The orchestration layer is I/O-heavy |
| Simple tabular model called by many backend teams | REST API | Easiest cross-language contract |
Interview-Friendly Distinctions
- A SageMaker endpoint is the managed serving surface, not just the model artifact.
- FastAPI is an implementation framework.
- REST is the interface contract.
asynciois the concurrency model inside the service.- A single production inference system can use all of them together.
References
- AWS SageMaker inference options: https://docs.aws.amazon.com/sagemaker/latest/dg/deploy-model-options.html
- AWS SageMaker real-time inference: https://docs.aws.amazon.com/sagemaker/latest/dg/realtime-endpoints.html
- AWS SageMaker Serverless Inference: https://docs.aws.amazon.com/sagemaker/latest/dg/serverless-endpoints.html
- AWS SageMaker Asynchronous Inference: https://docs.aws.amazon.com/en_us/sagemaker/latest/dg/async-inference.html
- AWS SageMaker Batch Transform: https://docs.aws.amazon.com/sagemaker/latest/dg/batch-transform.html
- AWS SageMaker multi-model endpoints: https://docs.aws.amazon.com/sagemaker/latest/dg/multi-model-endpoints.html
- AWS SageMaker inference pipelines: https://docs.aws.amazon.com/sagemaker/latest/dg/inference-pipelines.html