How do I deploy a model on VESSL AI Service (serverless vs provisioned) and expose an endpoint?

Most teams hit the same wall: the model is trained, weights look good, but turning it into a reliable, callable endpoint is slow, brittle, or blocked by GPUs. VESSL AI Service removes most of that friction, but you still need to choose how you run it: serverless or provisioned.

This guide walks through both paths on VESSL AI Service—how to deploy a model, expose an endpoint, and decide when serverless vs provisioned is the right fit.

Understand VESSL AI Service: serverless vs provisioned

VESSL AI Service gives you two ways to run inference:

Serverless Service
- VESSL owns the autoscaling and lifecycle.
- You pay per compute-minute (backed by VESSL Cloud capacity).
- Cold starts may apply.
- Best for bursty traffic, early-stage features, or public demos.
Provisioned Service
- You pin capacity: GPU type, replica count, and region.
- Higher reliability and predictable latency; no cold starts.
- Best for production APIs, SLAs, and latency-sensitive workloads.

Both modes:

Run on A100/H100/H200/B200/GB200/B300-class GPUs (depending on your selection).
Use the same Web Console and CLI (vessl run) primitives.
Expose HTTPS endpoints you can call from apps, agents, or other services.

Prerequisites: what you need ready

Before you deploy:

A model artifact
- Model weights in a format your serving code can load (e.g., .safetensors, .pt, .bin, or Hugging Face repo).
- Optionally packaged into a container image if you want full control.
Serving code
- An HTTP server (FastAPI, Flask, or similar) with:
  - A /health or equivalent health endpoint.
  - An inference endpoint (e.g., /predict or /generate) that:
    - Accepts JSON.
    - Returns JSON.
- A startup command (e.g., uvicorn app:app --host 0.0.0.0 --port 8000).
VESSL AI access
- VESSL account and access to VESSL Cloud.
- vessl CLI installed and authenticated if you prefer CLI over Web Console.
Storage for model assets
- Either:
  - Bundled in the container image, or
  - Pulled from VESSL Cluster Storage/Object Storage, or
  - Pulled from an external source (S3, Hugging Face, etc.).

Key design decision: when to choose serverless vs provisioned

Use this as a quick decision matrix for your endpoint:

Choose Serverless Service if:

Traffic is spiky or low but unpredictable.
You’re running:
- Internal prototypes.
- Feature experiments.
- Public demos and hackathon endpoints.
You want:
- No capacity planning.
- Automatic scale-to-zero and scale-up.

Choose Provisioned Service if:

You need consistent latency and no cold starts.
You’re serving:
- Production LLM APIs.
- RAG backends for user-facing apps.
- Physical AI / robotics controllers that can’t tolerate startup lag.
You want:
- Guaranteed GPU capacity.
- Control over replica counts and GPU SKUs.
- The option to align with On-Demand or Reserved tiers.

You can start serverless for discovery and graduate to provisioned once traffic stabilizes or SLAs appear.

Common setup for both modes

Whether you go serverless or provisioned, you define the same core pieces:

Image & environment
- Base image (examples):
  - nvcr.io/nvidia/pytorch:24.01-py3 for PyTorch.
  - A custom image you built and pushed (e.g., ghcr.io/org/model-service:latest).
- Environment variables:
  - Paths, HF tokens (if needed), model names.
- Ports:
  - The port your HTTP server listens on (e.g., 8000).

Model loading logic

In app.py or equivalent:

from fastapi import FastAPI
from pydantic import BaseModel
import torch

app = FastAPI()

class InferenceRequest(BaseModel):
    text: str

class InferenceResponse(BaseModel):
    output: str

@app.on_event("startup")
def load_model():
    global model
    # Load your model from disk or remote storage
    model = torch.load("/models/model.pt", map_location="cuda")

@app.get("/health")
def health():
    return {"status": "ok"}

@app.post("/predict", response_model=InferenceResponse)
def predict(req: InferenceRequest):
    # Run inference (simplified)
    with torch.no_grad():
        output = model.generate(req.text)
    return InferenceResponse(output=output)

Startup command

Example:
```
uvicorn app:app --host 0.0.0.0 --port 8000
```
You’ll reference this in your service spec (Web Console or CLI).

Path 1: Deploy as a Serverless Service

Step 1 – Define the serverless service (Web Console)

Go to VESSL Web Console.
Navigate to Service → Create Service.
Choose Serverless as the service type.

Fill out:

Name: e.g., llm-serverless-api.
Runtime / Image:
- Select a base image or your custom container.
Command:
- uvicorn app:app --host 0.0.0.0 --port 8000.
Port:
- 8000 (or whatever your app uses).
Resources:
- GPU: choose the class (e.g., NVIDIA A100 80GB, H100, etc.).
- CPU and memory: enough for model load and inference.
Autoscaling (serverless):
- Set min replicas (often 0 for scale-to-zero).
- Set max replicas based on your expected burst (e.g., 5).

Attach any volumes or Object Storage where your model weights live, or configure your app to download them on startup.

Step 2 – Deploy

Click Deploy.
VESSL will:
- Build/start the underlying container.
- Add autoscaling and a service mesh layer.
- Expose an HTTPS endpoint.

Wait for status to become Running / Healthy.

Step 3 – Get the serverless endpoint URL

Once running:

Open the service details page.
Copy the Endpoint URL (e.g., https://llm-serverless-api-xxxxx.vessl.ai).

This is your public API endpoint.

Step 4 – Call the serverless endpoint

Example curl:

curl -X POST \
  "https://llm-serverless-api-xxxxx.vessl.ai/predict" \
  -H "Content-Type: application/json" \
  -d '{"text": "Hello from VESSL AI"}'

Example from Python:

import requests

url = "https://llm-serverless-api-xxxxx.vessl.ai/predict"
payload = {"text": "Hello from VESSL AI"}

resp = requests.post(url, json=payload, timeout=30)
resp.raise_for_status()
print(resp.json())

You can integrate this into your agents, web backends, or batch jobs.

When to move off serverless

If you see:

Frequent cold-start penalties.
Fixed or rising QPS that justifies dedicated capacity.
Requirements for strict SLAs or lower p95 latency.

Then you likely want to re-deploy the same container as a Provisioned Service.

Path 2: Deploy as a Provisioned Service

Provisioned Service gives you pinned capacity and more direct control—think of it as a dedicated inference cluster wrapped in a clean API.

Step 1 – Define the provisioned service (Web Console)

Go to Service → Create Service.
Choose Provisioned as the service type.

Configure:

Name: e.g., llm-prod-api.
Image & Command: same as serverless:
- Image: ghcr.io/org/llm-service:latest (or base image).
- Command: uvicorn app:app --host 0.0.0.0 --port 8000.
Port:
- 8000.

Step 2 – Choose GPU and reliability tier

Here you align with how VESSL Cloud manages capacity:

GPU SKU
- For LLM post-training or heavy inference: A100 80GB, H100, H200.
- For cutting-edge or high-density: B200, GB200, B300 as available.
Replica count
- Start with 1–2 replicas for staging.
- Scale up as you approach production traffic.

Decide the backing capacity mode (if exposed):

On-Demand
- Reliable capacity with automatic failover.
- VESSL can switch providers when a region or vendor fails.
- Best default for production services.
Reserved
- Guaranteed capacity, often with discounts up to ~40% with commitment.
- Recommended for steady, mission-critical workloads.

Attach Cluster Storage/Object Storage if your service loads models or data from shared volumes.

Step 3 – Deploy

Click Deploy.
VESSL will:
- Provision GPUs in your chosen provider/region.
- Create a Multi-Cluster–aware service if you use multiple regions.
- Set up health checks on /health (or your configured path).

Wait until all replicas show Healthy.

Step 4 – Expose and secure the provisioned endpoint

On the service details page:

Copy the primary Service URL.
Configure auth if required:
- API keys, JWT, or upstream gateway integration.
Optionally map a custom domain (e.g., api.my-llm.com) through your DNS and VESSL configuration.

Example call:

curl -X POST \
  "https://llm-prod-api-xxxxx.vessl.ai/predict" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $API_TOKEN" \
  -d '{"text": "Run this in production"}'

From an app server:

import os
import requests

url = "https://llm-prod-api-xxxxx.vessl.ai/predict"
token = os.environ["LLM_API_TOKEN"]

resp = requests.post(
    url,
    json={"text": "Run this in production"},
    headers={"Authorization": f"Bearer {token}"},
    timeout=5,
)
resp.raise_for_status()
print(resp.json())

Operational tips: keep your endpoints reliable

1. Use health checks and readiness probes

Always expose /health.
Keep it fast and lightweight (no full inference).
If you do long model loads, use a readiness flag so the service isn’t marked ready too early.

2. Monitor and iterate

Use the VESSL Web Console to monitor:

Latency (p50/p95).
GPU utilization.
Replica health and restart patterns.

If you see underutilized GPUs, you can:

Move from H100 to A100 to save cost.
Decrease replicas (provisioned) or max scale (serverless).

If you see saturation:

Increase replicas.
Scale vertically to larger GPU SKUs.
Move critical workloads from Spot-backed capacity to On-Demand or Reserved.

3. Design for failover when it matters

For production, lean on VESSL’s reliability primitives:

Auto Failover
- Configure so a provider or region outage triggers a seamless switch to another cluster.
- Your endpoint stays reachable; you reduce “job wrangling” when something fails.
Multi-Cluster
- Keep a unified view across regions.
- Useful when you run the same model in multiple geographic regions for latency or compliance.

Choosing your deployment mode: quick recommendation

If you’re experimenting, demoing, or running bursty traffic:
- Start with Serverless Service.
- Accept some cold starts.
- Let VESSL handle autoscaling and capacity.
If you’re running production APIs with clear traffic patterns:
- Use Provisioned Service with On-Demand capacity.
- Turn on Auto Failover across providers/regions.
- Watch metrics and adjust replicas and SKUs.
If you’re running mission-critical workloads with tight SLOs:
- Move hot paths to Provisioned Service + Reserved capacity.
- Lock in GPU SKUs (A100/H100/B200/GB200/B300) and guarantee capacity.
- Use Multi-Cluster for resilience and regional control.

Next step: deploy your first VESSL AI Service

You can go from model weights to a live HTTPS endpoint in minutes—without chasing individual cloud quotas or babysitting GPU clusters.

Pick your deployment mode, plug in your container and command, and let VESSL handle the orchestration, scaling, and failover.

Get Started