
How do I deploy a model on VESSL AI Service (serverless vs provisioned) and expose an endpoint?
Most teams hit the same wall: the model is trained, weights look good, but turning it into a reliable, callable endpoint is slow, brittle, or blocked by GPUs. VESSL AI Service removes most of that friction, but you still need to choose how you run it: serverless or provisioned.
This guide walks through both paths on VESSL AI Service—how to deploy a model, expose an endpoint, and decide when serverless vs provisioned is the right fit.
Understand VESSL AI Service: serverless vs provisioned
VESSL AI Service gives you two ways to run inference:
-
Serverless Service
- VESSL owns the autoscaling and lifecycle.
- You pay per compute-minute (backed by VESSL Cloud capacity).
- Cold starts may apply.
- Best for bursty traffic, early-stage features, or public demos.
-
Provisioned Service
- You pin capacity: GPU type, replica count, and region.
- Higher reliability and predictable latency; no cold starts.
- Best for production APIs, SLAs, and latency-sensitive workloads.
Both modes:
- Run on A100/H100/H200/B200/GB200/B300-class GPUs (depending on your selection).
- Use the same Web Console and CLI (
vessl run) primitives. - Expose HTTPS endpoints you can call from apps, agents, or other services.
Prerequisites: what you need ready
Before you deploy:
-
A model artifact
- Model weights in a format your serving code can load (e.g.,
.safetensors,.pt,.bin, or Hugging Face repo). - Optionally packaged into a container image if you want full control.
- Model weights in a format your serving code can load (e.g.,
-
Serving code
- An HTTP server (FastAPI, Flask, or similar) with:
- A
/healthor equivalent health endpoint. - An inference endpoint (e.g.,
/predictor/generate) that:- Accepts JSON.
- Returns JSON.
- A
- A startup command (e.g.,
uvicorn app:app --host 0.0.0.0 --port 8000).
- An HTTP server (FastAPI, Flask, or similar) with:
-
VESSL AI access
- VESSL account and access to VESSL Cloud.
vesslCLI installed and authenticated if you prefer CLI over Web Console.
-
Storage for model assets
- Either:
- Bundled in the container image, or
- Pulled from VESSL Cluster Storage/Object Storage, or
- Pulled from an external source (S3, Hugging Face, etc.).
- Either:
Key design decision: when to choose serverless vs provisioned
Use this as a quick decision matrix for your endpoint:
Choose Serverless Service if:
- Traffic is spiky or low but unpredictable.
- You’re running:
- Internal prototypes.
- Feature experiments.
- Public demos and hackathon endpoints.
- You want:
- No capacity planning.
- Automatic scale-to-zero and scale-up.
Choose Provisioned Service if:
- You need consistent latency and no cold starts.
- You’re serving:
- Production LLM APIs.
- RAG backends for user-facing apps.
- Physical AI / robotics controllers that can’t tolerate startup lag.
- You want:
- Guaranteed GPU capacity.
- Control over replica counts and GPU SKUs.
- The option to align with On-Demand or Reserved tiers.
You can start serverless for discovery and graduate to provisioned once traffic stabilizes or SLAs appear.
Common setup for both modes
Whether you go serverless or provisioned, you define the same core pieces:
-
Image & environment
- Base image (examples):
nvcr.io/nvidia/pytorch:24.01-py3for PyTorch.- A custom image you built and pushed (e.g.,
ghcr.io/org/model-service:latest).
- Environment variables:
- Paths, HF tokens (if needed), model names.
- Ports:
- The port your HTTP server listens on (e.g.,
8000).
- The port your HTTP server listens on (e.g.,
- Base image (examples):
-
Model loading logic
In
app.pyor equivalent:from fastapi import FastAPI from pydantic import BaseModel import torch app = FastAPI() class InferenceRequest(BaseModel): text: str class InferenceResponse(BaseModel): output: str @app.on_event("startup") def load_model(): global model # Load your model from disk or remote storage model = torch.load("/models/model.pt", map_location="cuda") @app.get("/health") def health(): return {"status": "ok"} @app.post("/predict", response_model=InferenceResponse) def predict(req: InferenceRequest): # Run inference (simplified) with torch.no_grad(): output = model.generate(req.text) return InferenceResponse(output=output) -
Startup command
Example:
uvicorn app:app --host 0.0.0.0 --port 8000You’ll reference this in your service spec (Web Console or CLI).
Path 1: Deploy as a Serverless Service
Step 1 – Define the serverless service (Web Console)
- Go to VESSL Web Console.
- Navigate to Service → Create Service.
- Choose Serverless as the service type.
Fill out:
- Name: e.g.,
llm-serverless-api. - Runtime / Image:
- Select a base image or your custom container.
- Command:
uvicorn app:app --host 0.0.0.0 --port 8000.
- Port:
8000(or whatever your app uses).
- Resources:
- GPU: choose the class (e.g.,
NVIDIA A100 80GB,H100, etc.). - CPU and memory: enough for model load and inference.
- GPU: choose the class (e.g.,
- Autoscaling (serverless):
- Set min replicas (often
0for scale-to-zero). - Set max replicas based on your expected burst (e.g.,
5).
- Set min replicas (often
Attach any volumes or Object Storage where your model weights live, or configure your app to download them on startup.
Step 2 – Deploy
- Click Deploy.
- VESSL will:
- Build/start the underlying container.
- Add autoscaling and a service mesh layer.
- Expose an HTTPS endpoint.
Wait for status to become Running / Healthy.
Step 3 – Get the serverless endpoint URL
Once running:
- Open the service details page.
- Copy the Endpoint URL (e.g.,
https://llm-serverless-api-xxxxx.vessl.ai).
This is your public API endpoint.
Step 4 – Call the serverless endpoint
Example curl:
curl -X POST \
"https://llm-serverless-api-xxxxx.vessl.ai/predict" \
-H "Content-Type: application/json" \
-d '{"text": "Hello from VESSL AI"}'
Example from Python:
import requests
url = "https://llm-serverless-api-xxxxx.vessl.ai/predict"
payload = {"text": "Hello from VESSL AI"}
resp = requests.post(url, json=payload, timeout=30)
resp.raise_for_status()
print(resp.json())
You can integrate this into your agents, web backends, or batch jobs.
When to move off serverless
If you see:
- Frequent cold-start penalties.
- Fixed or rising QPS that justifies dedicated capacity.
- Requirements for strict SLAs or lower p95 latency.
Then you likely want to re-deploy the same container as a Provisioned Service.
Path 2: Deploy as a Provisioned Service
Provisioned Service gives you pinned capacity and more direct control—think of it as a dedicated inference cluster wrapped in a clean API.
Step 1 – Define the provisioned service (Web Console)
- Go to Service → Create Service.
- Choose Provisioned as the service type.
Configure:
- Name: e.g.,
llm-prod-api. - Image & Command: same as serverless:
- Image:
ghcr.io/org/llm-service:latest(or base image). - Command:
uvicorn app:app --host 0.0.0.0 --port 8000.
- Image:
- Port:
8000.
Step 2 – Choose GPU and reliability tier
Here you align with how VESSL Cloud manages capacity:
- GPU SKU
- For LLM post-training or heavy inference:
A100 80GB,H100,H200. - For cutting-edge or high-density:
B200,GB200,B300as available.
- For LLM post-training or heavy inference:
- Replica count
- Start with
1–2replicas for staging. - Scale up as you approach production traffic.
- Start with
Decide the backing capacity mode (if exposed):
-
On-Demand
- Reliable capacity with automatic failover.
- VESSL can switch providers when a region or vendor fails.
- Best default for production services.
-
Reserved
- Guaranteed capacity, often with discounts up to ~40% with commitment.
- Recommended for steady, mission-critical workloads.
Attach Cluster Storage/Object Storage if your service loads models or data from shared volumes.
Step 3 – Deploy
- Click Deploy.
- VESSL will:
- Provision GPUs in your chosen provider/region.
- Create a Multi-Cluster–aware service if you use multiple regions.
- Set up health checks on
/health(or your configured path).
Wait until all replicas show Healthy.
Step 4 – Expose and secure the provisioned endpoint
On the service details page:
- Copy the primary Service URL.
- Configure auth if required:
- API keys, JWT, or upstream gateway integration.
- Optionally map a custom domain (e.g.,
api.my-llm.com) through your DNS and VESSL configuration.
Example call:
curl -X POST \
"https://llm-prod-api-xxxxx.vessl.ai/predict" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $API_TOKEN" \
-d '{"text": "Run this in production"}'
From an app server:
import os
import requests
url = "https://llm-prod-api-xxxxx.vessl.ai/predict"
token = os.environ["LLM_API_TOKEN"]
resp = requests.post(
url,
json={"text": "Run this in production"},
headers={"Authorization": f"Bearer {token}"},
timeout=5,
)
resp.raise_for_status()
print(resp.json())
Operational tips: keep your endpoints reliable
1. Use health checks and readiness probes
- Always expose
/health. - Keep it fast and lightweight (no full inference).
- If you do long model loads, use a readiness flag so the service isn’t marked ready too early.
2. Monitor and iterate
Use the VESSL Web Console to monitor:
- Latency (p50/p95).
- GPU utilization.
- Replica health and restart patterns.
If you see underutilized GPUs, you can:
- Move from H100 to A100 to save cost.
- Decrease replicas (provisioned) or max scale (serverless).
If you see saturation:
- Increase replicas.
- Scale vertically to larger GPU SKUs.
- Move critical workloads from Spot-backed capacity to On-Demand or Reserved.
3. Design for failover when it matters
For production, lean on VESSL’s reliability primitives:
-
Auto Failover
- Configure so a provider or region outage triggers a seamless switch to another cluster.
- Your endpoint stays reachable; you reduce “job wrangling” when something fails.
-
Multi-Cluster
- Keep a unified view across regions.
- Useful when you run the same model in multiple geographic regions for latency or compliance.
Choosing your deployment mode: quick recommendation
-
If you’re experimenting, demoing, or running bursty traffic:
- Start with Serverless Service.
- Accept some cold starts.
- Let VESSL handle autoscaling and capacity.
-
If you’re running production APIs with clear traffic patterns:
- Use Provisioned Service with On-Demand capacity.
- Turn on Auto Failover across providers/regions.
- Watch metrics and adjust replicas and SKUs.
-
If you’re running mission-critical workloads with tight SLOs:
- Move hot paths to Provisioned Service + Reserved capacity.
- Lock in GPU SKUs (A100/H100/B200/GB200/B300) and guarantee capacity.
- Use Multi-Cluster for resilience and regional control.
Next step: deploy your first VESSL AI Service
You can go from model weights to a live HTTPS endpoint in minutes—without chasing individual cloud quotas or babysitting GPU clusters.
Pick your deployment mode, plug in your container and command, and let VESSL handle the orchestration, scaling, and failover.