How do I set up a persistent GPU Workspace in VESSL AI with Jupyter + SSH access?
GPU Cloud Infrastructure

How do I set up a persistent GPU Workspace in VESSL AI with Jupyter + SSH access?

10 min read

Most teams don’t just need a GPU; they need a workspace that stays put. Same environment, same data, same ports—whether you’re in Jupyter, SSH, or coming back tomorrow to pick up where you left off.

Below is a practical, step‑by‑step walkthrough for how to set up a persistent GPU Workspace in VESSL AI with both Jupyter and SSH access, tuned for LLM post‑training, Physical AI, and AI‑for‑Science workflows.


What “persistent GPU Workspace” means in VESSL AI

When people say “persistent workspace,” they usually want three things:

  • Persistent storage – Your code, data, and checkpoints survive restarts.
  • Stable environment – Same image, same dependencies, same ports every time.
  • Reusable access – You can reconnect via Jupyter in the browser or SSH from your terminal, without re‑wiring everything.

In VESSL AI, you get this by combining:

  • A Workspace running on GPUs (A100, H100, H200, B200, GB200, B300, etc.).
  • Attached persistent volumes (e.g., Cluster Storage / Object Storage).
  • Exposed endpoints for Jupyter and SSH.
  • A saved Workspace configuration you can start again in one click.

Step 1 – Decide your capacity: Spot, On‑Demand, or Reserved

Start from the constraint: quota ceilings, waitlists, and outages.

Before you spin anything up, decide which reliability tier fits the workspace:

  • Spot – Best for:

    • Exploratory notebooks
    • Non‑critical experiments
    • Budget‑sensitive workloads
      Tradeoff: can be preempted. Persistent storage survives, but the running session can be interrupted.
  • On‑Demand – Best for:

    • Daily development work
    • Long‑running Jupyter sessions
    • Teams that need automatic failover across providers
      Tradeoff: higher hourly cost than Spot, but much more stable.
  • Reserved – Best for:

    • Mission‑critical labs and production‑like dev environments
    • Teams who want guaranteed capacity on specific SKUs (e.g., H100, B200)
      Tradeoff: you commit capacity upfront in exchange for discounts (often up to ~40%) and dedicated support.

Rule of thumb:
Use On‑Demand or Reserved for a persistent Workspace you depend on every day. Keep Spot for bursty or throwaway work.


Step 2 – Create a GPU Workspace in the Web Console

  1. Log into VESSL AI

    • Go to https://vessl.ai and open the Web Console.
    • Make sure you’re in the right organization / project.
  2. Open the Workspaces module

    • In the left sidebar, select something like Workspaces (or Runs / Sessions depending on your account’s naming).
    • Click New Workspace or Create Workspace.
  3. Choose your GPU and capacity tier

    • Pick from available SKUs: A100, H100, H200, B200, GB200, B300, etc.
    • Choose the capacity type:
      • Spot
      • On‑Demand
      • Reserved (if your team has a reservation)
    • Set:
      • Number of GPUs (start with 1; scale up as needed).
      • vCPUs / RAM according to your workload.
  4. Select a base image

    • Choose a Docker image that includes:
      • CUDA + drivers compatible with your GPU
      • Python, Jupyter, and common ML frameworks (PyTorch, TensorFlow, etc.)
    • If your org has a standard dev image, use that—it makes collaboration easier.
    • You can always extend with pip or conda later, but baking your stack into the image reduces “it worked yesterday” issues.
  5. Name the Workspace

    • Use something descriptive like llm-dev-h100-jupyter-ssh or physics-sim-b200-lab.
    • This name will show up in monitoring and activity logs.

Step 3 – Attach persistent storage so your work survives

You can’t call a workspace “persistent” if your datasets and notebooks disappear on restart.

In VESSL AI, the main storage primitives are:

  • Cluster Storage – High‑performance shared file system for:
    • Project codebases
    • Checkpoints
    • Intermediate artifacts
  • Object Storage – Lower‑cost blob store for:
    • Large datasets
    • Logs and model artifacts

When creating (or editing) your Workspace:

  1. Add a persistent volume

    • In the Storage or Volumes section, attach:
      • A Cluster Storage volume (e.g., cluster-shared) mounted at /workspace.
      • Optionally, mount Object Storage buckets for datasets, e.g., at /data.
    • Make sure Read/Write is enabled if you want to save from this Workspace.
  2. Standardize mount paths

    • Use consistent paths across team workspaces:
      • /workspace – code + notebooks
      • /checkpoints – model checkpoints
      • /data – datasets
    • This reduces path‑related breakage when you move between Workspace and batch runs.
  3. Check persistence behavior

    • Confirm that deleting or restarting the Workspace does not delete the volume.
    • Your Jupyter notebooks, virtual environments, and checkpoints should live on the mounted volume, not on the ephemeral container filesystem.

Step 4 – Enable Jupyter access in your Workspace

Next, wire up the browser‑based development flow.

A. Start Jupyter inside the Workspace

Most VESSL base images and common ML images already include Jupyter or JupyterLab. Once your Workspace is running:

  1. Open a terminal in the Web Console

    • Go to the Workspace detail page.
    • Open a terminal or shell session.
  2. Launch Jupyter (example with JupyterLab)

cd /workspace   # make sure you’re on the persistent volume
jupyter lab --ip=0.0.0.0 --port=8888 --no-browser --NotebookApp.token='' --NotebookApp.password=''
  • --ip=0.0.0.0 makes it reachable inside the container.
  • --port=8888 is a common default; you can change if needed.
  • --no-browser prevents it from trying to open a browser in the container.
  • Setting token and password empty is fine if VESSL is fronting it with its own secure access; otherwise, configure auth as your security policy requires.

B. Expose the Jupyter endpoint

In the Workspace configuration:

  1. Find the Ports / Endpoints section.
  2. Add a new endpoint:
    • Type: HTTP
    • Port: 8888 (or whatever you used)
    • Name: jupyter or jupyterlab
  3. Save and apply the configuration.

Once the Workspace is running and Jupyter is up:

  • You should see a Jupyter / Open in Jupyter button or a URL in the Web Console.
  • Clicking it opens the notebook UI over a secure connection, anchored on your persistent storage (e.g., /workspace).

Step 5 – Enable SSH access for terminal‑first workflows

Some teams live in VS Code, tmux, and CLI tools. For them, SSH is the main interface.

A. Configure SSH in the image

Your Workspace container needs an SSH server. If your image doesn’t already include one:

  1. Install and enable SSH inside the Workspace

In the terminal:

# Example for Debian/Ubuntu-based images
sudo apt-get update
sudo apt-get install -y openssh-server
sudo service ssh start

Make sure /var/run/sshd exists and your user is allowed to log in. For production‑style use, bake this into your custom image rather than installing on every run.

  1. Set SSH credentials
    • Either:
      • Use password authentication (quick for internal team dev, but less ideal), or
      • Configure SSH public keys:
        • Add your public key to ~/.ssh/authorized_keys.
    • Persist the .ssh directory to the mounted volume (e.g., /workspace/.ssh) and symlink it if you want keys to survive restarts.

B. Expose an SSH endpoint

In the Workspace configuration:

  1. Go to Ports / Endpoints.
  2. Add a new endpoint:
    • Type: TCP
    • Port: 22 (or custom SSH port you configured)
    • Name: ssh
  3. Save and apply.

Once the Workspace is running:

  • The Web Console should show an SSH endpoint—either a hostname + port or a ready‑to‑copy SSH command.
  • From your local terminal, you can connect:
ssh -p <PORT> <USER>@<HOST>

(Use the exact values VESSL provides.)

You can now:

  • Use VS Code Remote SSH to develop directly on the GPU box.
  • Run tmux sessions for long‑running scripts.
  • Attach rsync or scp for quick file transfer—though in most cases, you’ll rely on Cluster/Object Storage instead.

Step 6 – Make the Workspace truly “persistent”

To turn this from a one‑off session into a repeatable environment:

A. Save the Workspace configuration

  • In the Web Console, ensure your Workspace configuration is saved:
    • GPU type and count
    • Capacity tier (Spot / On‑Demand / Reserved)
    • Image
    • Mounted volumes
    • Exposed ports (Jupyter, SSH)
  • Some setups support saving this as a template—so other team members can instantiate the exact same stack.

B. Move state onto persistent volumes

Inside the Workspace:

  • Store notebooks in /workspace/notebooks (or similar) on Cluster Storage.
  • Store checkpoints in /checkpoints.
  • Store config and SSH keys under a persistent path, e.g.:
    • /workspace/.ssh
    • /workspace/.venv (if you don’t bake dependencies into the image)

This way, even if:

  • The Workspace restarts,
  • You switch from Spot to On‑Demand,
  • Or you migrate between providers,

your artifacts remain intact.

C. Use consistent labels and naming

Give your persistent Workspace a clear identity:

  • Name: team-llm-dev-h100-persistent
  • Labels/tags (if supported):
    • env=dev
    • type=workspace
    • interface=jupyter-ssh

This makes tracking, monitoring, and cost accounting much easier.


Step 7 – Handle reliability: failover and multi‑cloud

A persistent workspace is useless if the provider goes down and you’re blocked for hours.

VESSL AI adds reliability primitives on top of GPUs:

  • Auto Failover – Seamless provider switching on On‑Demand runs. If a region or provider fails, VESSL can transparently migrate capacity (subject to SKU availability).
  • Multi‑Cluster – Unified view and control across regions; you’re not locked into a single cloud location.

For persistent workspaces:

  • Prefer On‑Demand or Reserved if downtime hurts your team.
  • Use Multi‑Cluster to keep storage and configs visible across regions.
  • Keep your Workspace image and storage mounts cloud‑agnostic where possible (no provider‑specific paths).

The outcome: when one provider hits a quota wall or an outage, you don’t have to rebuild everything—your workspace configuration is already portable.


Step 8 – Reconnect and resume work

Once your persistent GPU Workspace is set up:

  • To resume via Jupyter:

    • Start the Workspace (if stopped).
    • Confirm your persistent volume is mounted.
    • Relaunch jupyter lab if needed.
    • Click the Jupyter endpoint in the Web Console.
  • To resume via SSH:

    • Start the Workspace.
    • Use the SSH command from the Web Console.
    • Reattach to your tmux sessions or directly run scripts.

You’re back where you left off—same GPUs, same data, same environment—without job wrangling.


Common pitfalls and how to avoid them

1. “My notebooks disappeared.”
Likely cause: you saved them in the container’s root filesystem, not on a persistent volume.
Fix: always work under /workspace (or your mounted path) and confirm the volume is attached.

2. “Jupyter URL doesn’t load.”

  • Check that Jupyter is actually running: ps aux | grep jupyter.
  • Confirm the container port matches the endpoint port (e.g., 8888).
  • Restart the Workspace after changing endpoint config.

3. “SSH connection refused.”

  • Ensure sshd is running inside the Workspace.
  • Confirm the correct port (22 or custom) is exposed.
  • Check firewall / security group rules on your cloud provider, if applicable, though VESSL typically abstracts this.

4. “Workspace got preempted mid‑run.”
You probably used Spot. For truly persistent long‑running work, switch to On‑Demand or Reserved and let VESSL’s Auto Failover handle provider churn.


When to move from Workspace to production‑style runs

Your persistent GPU Workspace is ideal for:

  • Interactive development
  • Model debugging
  • Data exploration
  • Small‑scale training loops

Once workloads become:

  • Large‑scale training jobs,
  • Scheduled batch processes, or
  • Production inference services,

you should:

  • Keep using the same image and storage mounts you validated in your Workspace.
  • Move execution into more controlled jobs / runs via vessl run in the CLI.
  • Rely on On‑Demand with Auto Failover or Reserved for production‑grade reliability.

This is how teams go from “one dev box with Jupyter” to a reproducible, multi‑cloud training pipeline without re‑architecting.


Final takeaway

A persistent GPU Workspace in VESSL AI is just a clean combination of:

  • The right GPU tier (Spot / On‑Demand / Reserved),
  • Persistent storage (Cluster + Object Storage) mounted at stable paths,
  • Exposed endpoints for Jupyter and SSH, and
  • A saved configuration you can restart in minutes.

Set it up once, store everything on persistent volumes, and you get a workspace you can reconnect to every day—without chasing GPUs, re‑installing dependencies, or rebuilding environments.

Get started and spin up your first persistent Jupyter + SSH GPU Workspace in a few minutes.