
How do I set up a persistent GPU Workspace in VESSL AI with Jupyter + SSH access?
Most teams don’t just need a GPU; they need a workspace that stays put. Same environment, same data, same ports—whether you’re in Jupyter, SSH, or coming back tomorrow to pick up where you left off.
Below is a practical, step‑by‑step walkthrough for how to set up a persistent GPU Workspace in VESSL AI with both Jupyter and SSH access, tuned for LLM post‑training, Physical AI, and AI‑for‑Science workflows.
What “persistent GPU Workspace” means in VESSL AI
When people say “persistent workspace,” they usually want three things:
- Persistent storage – Your code, data, and checkpoints survive restarts.
- Stable environment – Same image, same dependencies, same ports every time.
- Reusable access – You can reconnect via Jupyter in the browser or SSH from your terminal, without re‑wiring everything.
In VESSL AI, you get this by combining:
- A Workspace running on GPUs (A100, H100, H200, B200, GB200, B300, etc.).
- Attached persistent volumes (e.g., Cluster Storage / Object Storage).
- Exposed endpoints for Jupyter and SSH.
- A saved Workspace configuration you can start again in one click.
Step 1 – Decide your capacity: Spot, On‑Demand, or Reserved
Start from the constraint: quota ceilings, waitlists, and outages.
Before you spin anything up, decide which reliability tier fits the workspace:
-
Spot – Best for:
- Exploratory notebooks
- Non‑critical experiments
- Budget‑sensitive workloads
Tradeoff: can be preempted. Persistent storage survives, but the running session can be interrupted.
-
On‑Demand – Best for:
- Daily development work
- Long‑running Jupyter sessions
- Teams that need automatic failover across providers
Tradeoff: higher hourly cost than Spot, but much more stable.
-
Reserved – Best for:
- Mission‑critical labs and production‑like dev environments
- Teams who want guaranteed capacity on specific SKUs (e.g., H100, B200)
Tradeoff: you commit capacity upfront in exchange for discounts (often up to ~40%) and dedicated support.
Rule of thumb:
Use On‑Demand or Reserved for a persistent Workspace you depend on every day. Keep Spot for bursty or throwaway work.
Step 2 – Create a GPU Workspace in the Web Console
-
Log into VESSL AI
- Go to
https://vessl.aiand open the Web Console. - Make sure you’re in the right organization / project.
- Go to
-
Open the Workspaces module
- In the left sidebar, select something like Workspaces (or Runs / Sessions depending on your account’s naming).
- Click New Workspace or Create Workspace.
-
Choose your GPU and capacity tier
- Pick from available SKUs: A100, H100, H200, B200, GB200, B300, etc.
- Choose the capacity type:
- Spot
- On‑Demand
- Reserved (if your team has a reservation)
- Set:
- Number of GPUs (start with 1; scale up as needed).
- vCPUs / RAM according to your workload.
-
Select a base image
- Choose a Docker image that includes:
- CUDA + drivers compatible with your GPU
- Python, Jupyter, and common ML frameworks (PyTorch, TensorFlow, etc.)
- If your org has a standard dev image, use that—it makes collaboration easier.
- You can always extend with
piporcondalater, but baking your stack into the image reduces “it worked yesterday” issues.
- Choose a Docker image that includes:
-
Name the Workspace
- Use something descriptive like
llm-dev-h100-jupyter-sshorphysics-sim-b200-lab. - This name will show up in monitoring and activity logs.
- Use something descriptive like
Step 3 – Attach persistent storage so your work survives
You can’t call a workspace “persistent” if your datasets and notebooks disappear on restart.
In VESSL AI, the main storage primitives are:
- Cluster Storage – High‑performance shared file system for:
- Project codebases
- Checkpoints
- Intermediate artifacts
- Object Storage – Lower‑cost blob store for:
- Large datasets
- Logs and model artifacts
When creating (or editing) your Workspace:
-
Add a persistent volume
- In the Storage or Volumes section, attach:
- A Cluster Storage volume (e.g.,
cluster-shared) mounted at/workspace. - Optionally, mount Object Storage buckets for datasets, e.g., at
/data.
- A Cluster Storage volume (e.g.,
- Make sure Read/Write is enabled if you want to save from this Workspace.
- In the Storage or Volumes section, attach:
-
Standardize mount paths
- Use consistent paths across team workspaces:
/workspace– code + notebooks/checkpoints– model checkpoints/data– datasets
- This reduces path‑related breakage when you move between Workspace and batch runs.
- Use consistent paths across team workspaces:
-
Check persistence behavior
- Confirm that deleting or restarting the Workspace does not delete the volume.
- Your Jupyter notebooks, virtual environments, and checkpoints should live on the mounted volume, not on the ephemeral container filesystem.
Step 4 – Enable Jupyter access in your Workspace
Next, wire up the browser‑based development flow.
A. Start Jupyter inside the Workspace
Most VESSL base images and common ML images already include Jupyter or JupyterLab. Once your Workspace is running:
-
Open a terminal in the Web Console
- Go to the Workspace detail page.
- Open a terminal or shell session.
-
Launch Jupyter (example with JupyterLab)
cd /workspace # make sure you’re on the persistent volume
jupyter lab --ip=0.0.0.0 --port=8888 --no-browser --NotebookApp.token='' --NotebookApp.password=''
--ip=0.0.0.0makes it reachable inside the container.--port=8888is a common default; you can change if needed.--no-browserprevents it from trying to open a browser in the container.- Setting
tokenandpasswordempty is fine if VESSL is fronting it with its own secure access; otherwise, configure auth as your security policy requires.
B. Expose the Jupyter endpoint
In the Workspace configuration:
- Find the Ports / Endpoints section.
- Add a new endpoint:
- Type: HTTP
- Port:
8888(or whatever you used) - Name:
jupyterorjupyterlab
- Save and apply the configuration.
Once the Workspace is running and Jupyter is up:
- You should see a Jupyter / Open in Jupyter button or a URL in the Web Console.
- Clicking it opens the notebook UI over a secure connection, anchored on your persistent storage (e.g.,
/workspace).
Step 5 – Enable SSH access for terminal‑first workflows
Some teams live in VS Code, tmux, and CLI tools. For them, SSH is the main interface.
A. Configure SSH in the image
Your Workspace container needs an SSH server. If your image doesn’t already include one:
- Install and enable SSH inside the Workspace
In the terminal:
# Example for Debian/Ubuntu-based images
sudo apt-get update
sudo apt-get install -y openssh-server
sudo service ssh start
Make sure /var/run/sshd exists and your user is allowed to log in. For production‑style use, bake this into your custom image rather than installing on every run.
- Set SSH credentials
- Either:
- Use password authentication (quick for internal team dev, but less ideal), or
- Configure SSH public keys:
- Add your public key to
~/.ssh/authorized_keys.
- Add your public key to
- Persist the
.sshdirectory to the mounted volume (e.g.,/workspace/.ssh) and symlink it if you want keys to survive restarts.
- Either:
B. Expose an SSH endpoint
In the Workspace configuration:
- Go to Ports / Endpoints.
- Add a new endpoint:
- Type: TCP
- Port:
22(or custom SSH port you configured) - Name:
ssh
- Save and apply.
Once the Workspace is running:
- The Web Console should show an SSH endpoint—either a hostname + port or a ready‑to‑copy SSH command.
- From your local terminal, you can connect:
ssh -p <PORT> <USER>@<HOST>
(Use the exact values VESSL provides.)
You can now:
- Use VS Code Remote SSH to develop directly on the GPU box.
- Run
tmuxsessions for long‑running scripts. - Attach
rsyncorscpfor quick file transfer—though in most cases, you’ll rely on Cluster/Object Storage instead.
Step 6 – Make the Workspace truly “persistent”
To turn this from a one‑off session into a repeatable environment:
A. Save the Workspace configuration
- In the Web Console, ensure your Workspace configuration is saved:
- GPU type and count
- Capacity tier (Spot / On‑Demand / Reserved)
- Image
- Mounted volumes
- Exposed ports (Jupyter, SSH)
- Some setups support saving this as a template—so other team members can instantiate the exact same stack.
B. Move state onto persistent volumes
Inside the Workspace:
- Store notebooks in
/workspace/notebooks(or similar) on Cluster Storage. - Store checkpoints in
/checkpoints. - Store config and SSH keys under a persistent path, e.g.:
/workspace/.ssh/workspace/.venv(if you don’t bake dependencies into the image)
This way, even if:
- The Workspace restarts,
- You switch from Spot to On‑Demand,
- Or you migrate between providers,
your artifacts remain intact.
C. Use consistent labels and naming
Give your persistent Workspace a clear identity:
- Name:
team-llm-dev-h100-persistent - Labels/tags (if supported):
env=devtype=workspaceinterface=jupyter-ssh
This makes tracking, monitoring, and cost accounting much easier.
Step 7 – Handle reliability: failover and multi‑cloud
A persistent workspace is useless if the provider goes down and you’re blocked for hours.
VESSL AI adds reliability primitives on top of GPUs:
- Auto Failover – Seamless provider switching on On‑Demand runs. If a region or provider fails, VESSL can transparently migrate capacity (subject to SKU availability).
- Multi‑Cluster – Unified view and control across regions; you’re not locked into a single cloud location.
For persistent workspaces:
- Prefer On‑Demand or Reserved if downtime hurts your team.
- Use Multi‑Cluster to keep storage and configs visible across regions.
- Keep your Workspace image and storage mounts cloud‑agnostic where possible (no provider‑specific paths).
The outcome: when one provider hits a quota wall or an outage, you don’t have to rebuild everything—your workspace configuration is already portable.
Step 8 – Reconnect and resume work
Once your persistent GPU Workspace is set up:
-
To resume via Jupyter:
- Start the Workspace (if stopped).
- Confirm your persistent volume is mounted.
- Relaunch
jupyter labif needed. - Click the Jupyter endpoint in the Web Console.
-
To resume via SSH:
- Start the Workspace.
- Use the SSH command from the Web Console.
- Reattach to your
tmuxsessions or directly run scripts.
You’re back where you left off—same GPUs, same data, same environment—without job wrangling.
Common pitfalls and how to avoid them
1. “My notebooks disappeared.”
Likely cause: you saved them in the container’s root filesystem, not on a persistent volume.
Fix: always work under /workspace (or your mounted path) and confirm the volume is attached.
2. “Jupyter URL doesn’t load.”
- Check that Jupyter is actually running:
ps aux | grep jupyter. - Confirm the container port matches the endpoint port (e.g., 8888).
- Restart the Workspace after changing endpoint config.
3. “SSH connection refused.”
- Ensure
sshdis running inside the Workspace. - Confirm the correct port (22 or custom) is exposed.
- Check firewall / security group rules on your cloud provider, if applicable, though VESSL typically abstracts this.
4. “Workspace got preempted mid‑run.”
You probably used Spot. For truly persistent long‑running work, switch to On‑Demand or Reserved and let VESSL’s Auto Failover handle provider churn.
When to move from Workspace to production‑style runs
Your persistent GPU Workspace is ideal for:
- Interactive development
- Model debugging
- Data exploration
- Small‑scale training loops
Once workloads become:
- Large‑scale training jobs,
- Scheduled batch processes, or
- Production inference services,
you should:
- Keep using the same image and storage mounts you validated in your Workspace.
- Move execution into more controlled jobs / runs via
vessl runin the CLI. - Rely on On‑Demand with Auto Failover or Reserved for production‑grade reliability.
This is how teams go from “one dev box with Jupyter” to a reproducible, multi‑cloud training pipeline without re‑architecting.
Final takeaway
A persistent GPU Workspace in VESSL AI is just a clean combination of:
- The right GPU tier (Spot / On‑Demand / Reserved),
- Persistent storage (Cluster + Object Storage) mounted at stable paths,
- Exposed endpoints for Jupyter and SSH, and
- A saved configuration you can restart in minutes.
Set it up once, store everything on persistent volumes, and you get a workspace you can reconnect to every day—without chasing GPUs, re‑installing dependencies, or rebuilding environments.
Get started and spin up your first persistent Jupyter + SSH GPU Workspace in a few minutes.