
How do I deploy a model on VESSL AI Service (serverless vs provisioned) and expose an endpoint?
Most teams reach the same point: the model trains fine, but turning it into a reliable, callable endpoint is where things stall. On VESSL AI Service, you have two clear ways to ship that model: Serverless for fully managed, autoscaling inference, and Provisioned for long-running, dedicated capacity. Both expose an HTTPS endpoint; the difference is how you pay for and control the underlying GPUs.
This guide walks through:
- When to use serverless vs provisioned
- How to define your model “service” (image, handler, and resources)
- How to deploy with the Web Console or
vesslCLI - How to get a URL and call your endpoint from your app
Step 1: Decide between serverless and provisioned
Think about your traffic pattern and operational needs first. That will determine which mode you start with.
When to use VESSL AI Service (serverless)
Use serverless if:
- You have spiky or unpredictable traffic
- You want VESSL to manage autoscaling and concurrency
- You’d rather pay per request / compute used than keep GPUs warm 24/7
- Cold start latency is acceptable (for non-ultra-latency-sensitive APIs)
Typical fits:
- Internal demo endpoints for an LLM or vision model
- Lightweight GPT-style tools used by PMs / analysts
- Experimental endpoints you’re iterating on quickly
What serverless gives you:
- No cluster management
- Automatic scale-out and scale-to-zero
- One HTTPS URL per service, no load balancer work
- Good starting point if you’re still tuning the model or usage pattern
Tradeoffs:
- Cold starts for idle services
- Less control over exact pod lifecycle and node-level tuning
- For always-busy production traffic, provisioned can be more predictable
When to use VESSL AI Service (provisioned)
Use provisioned if:
- You have steady or high baseline traffic
- You want a fixed pool of GPUs/CPUs always ready
- You care about warm latency and consistent performance
- The service is production-critical and you want explicit capacity control
Typical fits:
- Public-facing inference APIs with constant load
- In-house “platform” endpoints other teams depend on
- Serving large LLMs on A100/H100-class GPUs with tight SLOs
What provisioned gives you:
- Dedicated pods on dedicated resources
- Stable performance, no cold starts during normal operation
- Easier capacity planning for known workloads
Tradeoffs:
- You’re essentially “holding” the GPUs even when idle
- You’ll think in terms of replicas and cluster capacity, not just requests
If you’re unsure, start serverless for experimentation. Once traffic is steady or latency SLOs get stricter, move the same container and handler to a provisioned service.
Step 2: Package your model into a runnable container
Both serverless and provisioned use the same basic ingredients:
-
Docker image with:
- Your model weights (baked in or downloaded at startup)
- Inference code (Python, FastAPI/Flask, or your own server)
- Any runtime dependencies (CUDA, PyTorch, transformers, etc.)
-
Service handler:
- An HTTP server listening on a port (usually 8000 or 8080)
- A
/predictor similar route that:- Accepts JSON/bytes
- Runs the model
- Returns JSON (or other agreed format)
-
Resource configuration:
- GPU type (e.g., A100/H100/H200/B200/GB200/B300)
- GPU count
- CPU and memory limits
- Optional storage mounts (Cluster Storage / Object Storage)
A minimal Python example (FastAPI):
# app.py
from fastapi import FastAPI
from pydantic import BaseModel
import torch
app = FastAPI()
# Load your model at startup (global or in startup event)
model = torch.load("/models/model.pt")
model.eval()
class InferenceRequest(BaseModel):
text: str
class InferenceResponse(BaseModel):
output: str
@app.post("/predict", response_model=InferenceResponse)
def predict(req: InferenceRequest):
# Replace with your actual preprocessing/inference
with torch.no_grad():
output = req.text.upper()
return InferenceResponse(output=output)
Dockerfile:
FROM nvidia/cuda:12.1.1-cudnn8-runtime-ubuntu22.04
RUN apt-get update && apt-get install -y python3 python3-pip && rm -rf /var/lib/apt/lists/*
WORKDIR /app
COPY requirements.txt .
RUN pip3 install --no-cache-dir -r requirements.txt
COPY . .
ENV PORT=8000
EXPOSE 8000
# Use gunicorn/uvicorn for production
CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8000"]
requirements.txt:
fastapi
uvicorn[standard]
torch
Build and push:
docker build -t ghcr.io/your-org/your-model:latest .
docker push ghcr.io/your-org/your-model:latest
Once you have a container that runs uvicorn app:app, you’re ready to deploy to VESSL AI Service.
Step 3: Deploy as a serverless service and get an endpoint
3.1 Deploy via Web Console (serverless)
- Log in to VESSL Cloud Web Console.
- Go to Service > Create Service.
- Select Serverless as the deployment type.
- Configure Image:
- Image:
ghcr.io/your-org/your-model:latest - Command (if needed): empty if
CMDis defined in Dockerfile - Port:
8000(or whatever your app listens on)
- Image:
- Configure Resources:
- GPU: choose SKU (e.g., 1 × A100 80GB or 1 × H100 80GB)
- CPU/Memory: set appropriate requests/limits
- Set Autoscaling:
- Min replicas:
0(allow scale to zero) - Max replicas: e.g.,
10 - Concurrency: how many requests per replica (start with 1–4)
- Min replicas:
- Configure Networking:
- Public exposure: Enabled (to get a public HTTPS endpoint)
- Path:
/predict(or your route)
- Click Create / Deploy.
VESSL will: schedule the service, pull your image, and wire up a public HTTPS URL.
3.2 Get the endpoint URL (serverless)
After deployment:
- Open the service detail page.
- Look for the Endpoint URL section.
- Copy the public HTTPS URL, which will look like:
https://<service-name>-<hash>.svc.vessl.ai/predict
You can test it with curl:
curl -X POST \
-H "Content-Type: application/json" \
-d '{"text": "hello"}' \
https://<service-url>/predict
Or from Python:
import requests
url = "https://<service-url>/predict"
payload = {"text": "hello"}
resp = requests.post(url, json=payload)
resp.raise_for_status()
print(resp.json())
If your org uses authentication (API keys, tokens), attach them in headers as configured in your environment or gateway.
3.3 Deploy via CLI (serverless)
If you prefer versioned infra-as-code, you can define your service spec in YAML and deploy with the CLI.
Example service.serverless.yaml:
kind: service
name: my-model-serverless
spec:
type: serverless
container:
image: ghcr.io/your-org/your-model:latest
port: 8000
resources:
gpu:
type: a100-80gb
count: 1
cpu: "4"
memory: "16Gi"
autoscaling:
minReplicas: 0
maxReplicas: 10
targetConcurrency: 2
networking:
public: true
routes:
- path: /predict
method: POST
Deploy:
vessl service apply -f service.serverless.yaml
Then use:
vessl service get my-model-serverless
to retrieve status and endpoint URL.
Step 4: Deploy as a provisioned service and expose an endpoint
Provisioned uses the same container but gives you dedicated, always-on capacity.
4.1 Deploy via Web Console (provisioned)
- Go to Service > Create Service.
- Select Provisioned as the deployment type.
- Configure Image:
- Image:
ghcr.io/your-org/your-model:latest - Port:
8000
- Image:
- Configure Resources per replica:
- GPU: choose specific SKU (A100/H100/H200/B200/GB200/B300)
- CPU/Memory: match your model footprint
- Configure Replicas:
- Replica count: e.g.,
2for HA
- Replica count: e.g.,
- Set Placement (optional):
- Choose region/provider if you’re aligning with other workloads
- Configure Networking:
- Public exposure: Enabled
- Route path:
/predict
- Click Create / Deploy.
4.2 Get the endpoint URL (provisioned)
Just like serverless:
- Open the service details.
- Copy the public HTTPS URL.
- Test with
curlor your preferred client.
Example:
curl -X POST \
-H "Content-Type: application/json" \
-d '{"text": "provisioned test"}' \
https://<provisioned-service-url>/predict
You’ll see lower and more consistent latency, since pods stay warm and capacity is reserved for you.
4.3 Deploy via CLI (provisioned)
Example service.provisioned.yaml:
kind: service
name: my-model-provisioned
spec:
type: provisioned
container:
image: ghcr.io/your-org/your-model:latest
port: 8000
resources:
gpu:
type: h100-80gb
count: 1
cpu: "8"
memory: "32Gi"
replicas: 2
networking:
public: true
routes:
- path: /predict
method: POST
Deploy:
vessl service apply -f service.provisioned.yaml
Check:
vessl service get my-model-provisioned
Copy the endpoint and integrate it into your application.
Step 5: Manage versions, rollouts, and scaling
Once you have an endpoint, the operational questions start:
- How do I update the model without downtime?
- How do I tune cost vs performance?
- How do I test new versions safely?
Rolling out new versions
Typical pattern:
- Build and push a new image tag:
docker build -t ghcr.io/your-org/your-model:v2 . docker push ghcr.io/your-org/your-model:v2 - Update the service spec (Web Console or YAML) to use
:v2. - Redeploy:
- Web Console: click Update / Redeploy
- CLI:
vessl service apply -f service.provisioned.yaml
VESSL will roll pods, so traffic moves to the new version as the new replicas become healthy.
Adjusting scale and capacity
-
Serverless:
- Increase
maxReplicasfor throughput. - Adjust
targetConcurrencyto match model latency and memory. - If autoscaling is too slow, increase min replicas (e.g., 1–2) to reduce cold-start impact.
- Increase
-
Provisioned:
- Increase/decrease
replicasbased on sustained QPS and latency. - For heavy LLMs, scale by GPU type first (A100 → H100/H200/B200/GB200/B300) before simply adding more replicas; you may get better throughput per dollar on newer SKUs.
- Increase/decrease
Storage and shared assets
For larger models and artifacts:
- Use Cluster Storage for shared, high-performance file access (e.g., multiple services sharing the same checkpoint directory).
- Use Object Storage for lower-cost model blobs, datasets, and logs that are downloaded at startup.
Both can be mounted into your service spec so your container sees them as file paths.
Serverless vs provisioned: quick decision cheatsheet
Use serverless if:
- You’re in early experimentation
- Traffic is sporadic or unpredictable
- You want VESSL to handle autoscaling and scale-to-zero
- Slight cold start latency is okay
Use provisioned if:
- You have steady or growing production traffic
- You need predictable, low latency and warm GPUs
- You want explicit control over replicas and capacity
- The service is mission-critical for your app or users
Both modes:
- Run the same container pattern
- Expose a public HTTPS endpoint
- Integrate cleanly with your existing apps via REST
From model to endpoint with less job wrangling
The whole point of VESSL AI Service is to take you from “I have a model checkpoint” to “my team is hitting a stable HTTPS endpoint” without weeks of cluster wiring. Package your model in a container once, then choose:
- Serverless for managed, autoscaling inference.
- Provisioned for dedicated, always-on capacity.
In both cases, you get an endpoint you can call from your app, with GPUs across providers available through a single control surface.