How do I start a training run on VESSL AI using the CLI (vessl run) with a YAML file?

Most teams hit the same wall: you finally get access to GPUs, and then you lose days wiring up YAML, Docker, and cluster configs just to launch a single training job. On VESSL AI, vessl run plus one YAML file is the escape hatch—you describe the run once, then fire-and-forget from your terminal.

This guide walks through exactly how to start a training run on VESSL AI using the CLI (vessl run) with a YAML file, from first install to a working example.

Prerequisites

Before you run anything:

A VESSL AI account
Access to VESSL Cloud (or your VESSL-managed clusters)
Python 3.8+ (recommended for CLI install)
Docker image with your training code (optional but typical)

If you’re blocked by cloud quotas, waitlists, or missing GPU SKUs (A100/H100/H200/B200-class), that’s exactly the problem VESSL is solving. You’ll access those GPUs through one CLI and one YAML file instead of juggling multiple providers.

Step 1 – Install and configure the VESSL CLI

Install the CLI

Using pip (commonly used setup):

pip install vessl

Or with pipx (avoids polluting your global Python):

pipx install vessl

Verify installation:

vessl --version

You should see a version string instead of an error.

Authenticate the CLI

vessl login

This typically opens a browser window or prompts you for an access token. Once authenticated, your CLI sessions can start and monitor runs against your VESSL workspace.

Step 2 – Understand the `vessl run` + YAML pattern

On VESSL, you don’t hand-write long shell commands for every experiment. You define a run specification once in a YAML file, then trigger that spec with:

vessl run -f path/to/your-run.yaml

The YAML file is your contract:

What image to use
Which GPUs (A100/H100/H200/B200/GB200/B300, etc.)
How many nodes
What command to execute
Volumes, datasets, environment variables
Retry, preemption, and other runtime policies

vessl run reads the file, schedules the job across VESSL’s unified GPU pool, and gives you a handle to monitor or attach logs—without manually juggling providers or clusters.

Step 3 – Create a minimal training YAML file

Start with a simple example. Create train.yaml in your project root.

Below is a generic pattern you can adapt; exact fields may differ based on your cluster setup and the VESSL version you’re using, but this shows the typical shape:

name: my-first-training-run
description: "Example: train ResNet on CIFAR-10 using VESSL CLI"

# Compute & environment
image: my-docker-user/my-training-image:latest  # Your Docker image
hardware:
  gpu:
    type: "A100"         # Or H100/H200/B200/etc.
    count: 1
  cpu:
    count: 8
  memory:
    size_gb: 32

# Optional: choose reliability / cost mode conceptually
# mode: "spot" | "on-demand" | "reserved"

# Code & command
working_dir: /workspace/project
command:
  - bash
  - -lc
  - |
    python train.py \
      --epochs 50 \
      --batch-size 256 \
      --lr 0.1

# Environment variables
env:
  - name: DATA_DIR
    value: /workspace/data
  - name: OUTPUT_DIR
    value: /workspace/output

# Storage (examples)
volumes:
  - name: project-code
    mount_path: /workspace/project
    # Implementation detail depends on your VESSL storage setup
  - name: project-output
    mount_path: /workspace/output

# Logging & monitoring options (example)
logs:
  stdout: true
  stderr: true

Key ideas:

image – Where your training code lives. Build once, then reuse.
hardware – Declare GPU/CPU/memory; VESSL finds capacity across providers.
command – The exact shell command that runs inside the container.
env, volumes – Bind configuration and storage without touching the command.

Once this is in place, you don’t need to remember long shell flags. You update the YAML and re-run.

Step 4 – Run the training job with `vessl run`

From the directory containing train.yaml:

vessl run -f train.yaml

The CLI will:

Validate your YAML.
Submit the run to VESSL’s control plane.
Print a run ID and status link (e.g., a URL to the Web Console).

Typical output looks like:

Submitting run from train.yaml...
Run submitted successfully.
Run ID: run-1234567890abcdef
View logs: https://console.vessl.ai/runs/run-1234567890abcdef

Now you have two control paths:

Web Console – Visual cluster and run monitoring, no extra setup.
CLI – Native tracking from your terminal or scripts.

Step 5 – Monitor and manage the run from the CLI

Use the run ID from submission.

Check run status

vessl run status run-1234567890abcdef

You’ll see states like PENDING, RUNNING, SUCCEEDED, or FAILED.

Stream logs

vessl run logs run-1234567890abcdef

Use -f for follow (tail):

vessl run logs -f run-1234567890abcdef

This is where “fire-and-forget” shows up in practice: the cluster management, GPU rebalance, and multi-cloud details are handled for you. You just watch your training output.

Cancel a run

vessl run cancel run-1234567890abcdef

Useful when an experiment clearly isn’t working and you don’t want to burn more GPU hours.

Step 6 – Parameterize your YAML for fast iteration

You don’t want a new file for every learning rate or batch size. Instead, keep one base YAML and pass overrides at runtime.

Option A – Use CLI overrides (when available)

Many teams structure their YAML to accept overridable values (e.g., as CLI args or environment variables). A typical pattern:

In train.yaml:

env:
  - name: LR
    value: "0.1"
  - name: BATCH_SIZE
    value: "256"

command:
  - bash
  - -lc
  - |
    python train.py \
      --epochs 50 \
      --batch-size ${BATCH_SIZE} \
      --lr ${LR}

Then launch different runs by editing env values or templating your YAML with your own script.

Option B – Create variations per workload tier

Keep separate YAMLs per reliability/cost tier, while reusing most configuration:

train-spot.yaml – For cheap, preemptible experimentation.
train-ondemand.yaml – For baselines or nightly runs that must finish.
train-reserved.yaml – For mission-critical production training with guaranteed capacity.

Each file primarily differs in mode and GPU count; you still start them all with vessl run -f <file>.

Step 7 – Map YAML to workload and GPU strategy

The power of vessl run with YAML is that it matches your GPU strategy directly to each workload type:

Exploratory training & sweeps
- YAML: small gpu.count, mode: spot
- Goal: maximize experiments per dollar, accept preemptions.
Core model training & baselines
- YAML: mode: on-demand, explicit GPU SKUs (e.g., H100), maybe multi-node
- Goal: jobs finish, with automatic failover across providers/regions if something goes down.
Production / mission-critical runs
- YAML: mode: reserved, pinned GPU counts
- Goal: guaranteed capacity backed by VESSL support and SLAs.

You don’t re-engineer code to move between these. You adjust YAML and keep using vessl run.

Step 8 – Integrate `vessl run` into your existing workflows

Once you’re comfortable running a single training job, it’s easy to plug VESSL into your tooling.

Shell scripts

#!/usr/bin/env bash
set -euo pipefail

CONFIG=${1:-train.yaml}

vessl run -f "$CONFIG"

CI/CD pipelines

Add a step that checks out your repo.
Build/push a Docker image.
Run vessl run -f train.yaml as the training step.
Use the run ID to gate further steps (e.g., evaluation, deployment).

Research workflows

For labs and research groups (like BAIR, MIT, etc.) already on VESSL, this is how you replace “job wrangling” with a single command, letting students and researchers trigger long training jobs that survive provider issues without babysitting the cluster.

Common issues and how to avoid them

YAML parse errors

Use spaces, not tabs.
Validate with a linter or python -c 'import yaml,sys; yaml.safe_load(sys.stdin)' < train.yaml' if you prefer.

Image pull failures

Make sure image is public or your registry credentials are configured with VESSL.
Confirm the tag exists locally: docker pull my-docker-user/my-training-image:latest.

Insufficient GPU capacity

Adjust gpu.type to a supported SKU in your region.
For large multi-GPU runs, talk to VESSL about Reserved capacity so your YAML translates into guaranteed slots.

Command exits immediately

Double-check working_dir and command.
Ensure your training script path is correct inside the container.

Putting it all together

Launching training on VESSL AI with vessl run and a YAML file follows a simple pattern:

Install and log in to the VESSL CLI.
Describe your run once in a YAML spec (image, hardware, command, env, storage).
Start the job with vessl run -f train.yaml.
Monitor, cancel, and iterate directly from your terminal.
Evolve the YAML to match workload tiers: Spot for cheap experiments, On-Demand for reliable baselines, Reserved for guaranteed production capacity.

Instead of fighting cloud quotas and cluster quirks, you get a single control surface—one CLI, one YAML—to run training across multi-cloud GPUs.

Get Started

How do I start a training run on VESSL AI using the CLI (vessl run) with a YAML file?

Prerequisites

Step 1 – Install and configure the VESSL CLI

Install the CLI

Authenticate the CLI

Step 2 – Understand the `vessl run` + YAML pattern

Step 3 – Create a minimal training YAML file

Step 4 – Run the training job with `vessl run`

Step 5 – Monitor and manage the run from the CLI

Check run status

Stream logs

Cancel a run

Step 6 – Parameterize your YAML for fast iteration

Option A – Use CLI overrides (when available)

Option B – Create variations per workload tier

Step 7 – Map YAML to workload and GPU strategy

Step 8 – Integrate `vessl run` into your existing workflows

Shell scripts

CI/CD pipelines

Research workflows

Common issues and how to avoid them

Putting it all together

Keep Reading

More from GPU Cloud Infrastructure

VESSL AI: estimate cost to fine-tune an LLM on 8×H100 for 72 hours (on-demand vs reserved)

How do I mount S3/object storage or a GitHub repo into a VESSL AI run or workspace?

How do I set up a persistent GPU Workspace in VESSL AI with Jupyter + SSH access?

How do I start a training run on VESSL AI using the CLI (vessl run) with a YAML file?

Prerequisites

Step 1 – Install and configure the VESSL CLI

Install the CLI

Authenticate the CLI

Step 2 – Understand the vessl run + YAML pattern

Step 3 – Create a minimal training YAML file

Step 4 – Run the training job with vessl run

Step 5 – Monitor and manage the run from the CLI

Check run status

Stream logs

Cancel a run

Step 6 – Parameterize your YAML for fast iteration

Option A – Use CLI overrides (when available)

Option B – Create variations per workload tier

Step 7 – Map YAML to workload and GPU strategy

Step 8 – Integrate vessl run into your existing workflows

Shell scripts

CI/CD pipelines

Research workflows

Common issues and how to avoid them

Putting it all together

Keep Reading

More from GPU Cloud Infrastructure

VESSL AI: estimate cost to fine-tune an LLM on 8×H100 for 72 hours (on-demand vs reserved)

How do I mount S3/object storage or a GitHub repo into a VESSL AI run or workspace?

How do I set up a persistent GPU Workspace in VESSL AI with Jupyter + SSH access?

Step 2 – Understand the `vessl run` + YAML pattern

Step 4 – Run the training job with `vessl run`

Step 8 – Integrate `vessl run` into your existing workflows