
How do I start a training run on VESSL AI using the CLI (vessl run) with a YAML file?
Most teams hit the same wall: you finally get access to GPUs, and then you lose days wiring up YAML, Docker, and cluster configs just to launch a single training job. On VESSL AI, vessl run plus one YAML file is the escape hatch—you describe the run once, then fire-and-forget from your terminal.
This guide walks through exactly how to start a training run on VESSL AI using the CLI (vessl run) with a YAML file, from first install to a working example.
Prerequisites
Before you run anything:
- A VESSL AI account
- Access to VESSL Cloud (or your VESSL-managed clusters)
- Python 3.8+ (recommended for CLI install)
- Docker image with your training code (optional but typical)
If you’re blocked by cloud quotas, waitlists, or missing GPU SKUs (A100/H100/H200/B200-class), that’s exactly the problem VESSL is solving. You’ll access those GPUs through one CLI and one YAML file instead of juggling multiple providers.
Step 1 – Install and configure the VESSL CLI
Install the CLI
Using pip (commonly used setup):
pip install vessl
Or with pipx (avoids polluting your global Python):
pipx install vessl
Verify installation:
vessl --version
You should see a version string instead of an error.
Authenticate the CLI
Log in to VESSL AI from your terminal:
vessl login
This typically opens a browser window or prompts you for an access token. Once authenticated, your CLI sessions can start and monitor runs against your VESSL workspace.
Step 2 – Understand the vessl run + YAML pattern
On VESSL, you don’t hand-write long shell commands for every experiment. You define a run specification once in a YAML file, then trigger that spec with:
vessl run -f path/to/your-run.yaml
The YAML file is your contract:
- What image to use
- Which GPUs (A100/H100/H200/B200/GB200/B300, etc.)
- How many nodes
- What command to execute
- Volumes, datasets, environment variables
- Retry, preemption, and other runtime policies
vessl run reads the file, schedules the job across VESSL’s unified GPU pool, and gives you a handle to monitor or attach logs—without manually juggling providers or clusters.
Step 3 – Create a minimal training YAML file
Start with a simple example. Create train.yaml in your project root.
Below is a generic pattern you can adapt; exact fields may differ based on your cluster setup and the VESSL version you’re using, but this shows the typical shape:
name: my-first-training-run
description: "Example: train ResNet on CIFAR-10 using VESSL CLI"
# Compute & environment
image: my-docker-user/my-training-image:latest # Your Docker image
hardware:
gpu:
type: "A100" # Or H100/H200/B200/etc.
count: 1
cpu:
count: 8
memory:
size_gb: 32
# Optional: choose reliability / cost mode conceptually
# mode: "spot" | "on-demand" | "reserved"
# Code & command
working_dir: /workspace/project
command:
- bash
- -lc
- |
python train.py \
--epochs 50 \
--batch-size 256 \
--lr 0.1
# Environment variables
env:
- name: DATA_DIR
value: /workspace/data
- name: OUTPUT_DIR
value: /workspace/output
# Storage (examples)
volumes:
- name: project-code
mount_path: /workspace/project
# Implementation detail depends on your VESSL storage setup
- name: project-output
mount_path: /workspace/output
# Logging & monitoring options (example)
logs:
stdout: true
stderr: true
Key ideas:
image– Where your training code lives. Build once, then reuse.hardware– Declare GPU/CPU/memory; VESSL finds capacity across providers.command– The exact shell command that runs inside the container.env,volumes– Bind configuration and storage without touching the command.
Once this is in place, you don’t need to remember long shell flags. You update the YAML and re-run.
Step 4 – Run the training job with vessl run
From the directory containing train.yaml:
vessl run -f train.yaml
The CLI will:
- Validate your YAML.
- Submit the run to VESSL’s control plane.
- Print a run ID and status link (e.g., a URL to the Web Console).
Typical output looks like:
Submitting run from train.yaml...
Run submitted successfully.
Run ID: run-1234567890abcdef
View logs: https://console.vessl.ai/runs/run-1234567890abcdef
Now you have two control paths:
- Web Console – Visual cluster and run monitoring, no extra setup.
- CLI – Native tracking from your terminal or scripts.
Step 5 – Monitor and manage the run from the CLI
Use the run ID from submission.
Check run status
vessl run status run-1234567890abcdef
You’ll see states like PENDING, RUNNING, SUCCEEDED, or FAILED.
Stream logs
vessl run logs run-1234567890abcdef
Use -f for follow (tail):
vessl run logs -f run-1234567890abcdef
This is where “fire-and-forget” shows up in practice: the cluster management, GPU rebalance, and multi-cloud details are handled for you. You just watch your training output.
Cancel a run
vessl run cancel run-1234567890abcdef
Useful when an experiment clearly isn’t working and you don’t want to burn more GPU hours.
Step 6 – Parameterize your YAML for fast iteration
You don’t want a new file for every learning rate or batch size. Instead, keep one base YAML and pass overrides at runtime.
Option A – Use CLI overrides (when available)
Many teams structure their YAML to accept overridable values (e.g., as CLI args or environment variables). A typical pattern:
In train.yaml:
env:
- name: LR
value: "0.1"
- name: BATCH_SIZE
value: "256"
command:
- bash
- -lc
- |
python train.py \
--epochs 50 \
--batch-size ${BATCH_SIZE} \
--lr ${LR}
Then launch different runs by editing env values or templating your YAML with your own script.
Option B – Create variations per workload tier
Keep separate YAMLs per reliability/cost tier, while reusing most configuration:
train-spot.yaml– For cheap, preemptible experimentation.train-ondemand.yaml– For baselines or nightly runs that must finish.train-reserved.yaml– For mission-critical production training with guaranteed capacity.
Each file primarily differs in mode and GPU count; you still start them all with vessl run -f <file>.
Step 7 – Map YAML to workload and GPU strategy
The power of vessl run with YAML is that it matches your GPU strategy directly to each workload type:
- Exploratory training & sweeps
- YAML: small
gpu.count,mode: spot - Goal: maximize experiments per dollar, accept preemptions.
- YAML: small
- Core model training & baselines
- YAML:
mode: on-demand, explicit GPU SKUs (e.g.,H100), maybe multi-node - Goal: jobs finish, with automatic failover across providers/regions if something goes down.
- YAML:
- Production / mission-critical runs
- YAML:
mode: reserved, pinned GPU counts - Goal: guaranteed capacity backed by VESSL support and SLAs.
- YAML:
You don’t re-engineer code to move between these. You adjust YAML and keep using vessl run.
Step 8 – Integrate vessl run into your existing workflows
Once you’re comfortable running a single training job, it’s easy to plug VESSL into your tooling.
Shell scripts
#!/usr/bin/env bash
set -euo pipefail
CONFIG=${1:-train.yaml}
vessl run -f "$CONFIG"
CI/CD pipelines
- Add a step that checks out your repo.
- Build/push a Docker image.
- Run
vessl run -f train.yamlas the training step. - Use the run ID to gate further steps (e.g., evaluation, deployment).
Research workflows
For labs and research groups (like BAIR, MIT, etc.) already on VESSL, this is how you replace “job wrangling” with a single command, letting students and researchers trigger long training jobs that survive provider issues without babysitting the cluster.
Common issues and how to avoid them
YAML parse errors
- Use spaces, not tabs.
- Validate with a linter or
python -c 'import yaml,sys; yaml.safe_load(sys.stdin)' < train.yaml'if you prefer.
Image pull failures
- Make sure
imageis public or your registry credentials are configured with VESSL. - Confirm the tag exists locally:
docker pull my-docker-user/my-training-image:latest.
Insufficient GPU capacity
- Adjust
gpu.typeto a supported SKU in your region. - For large multi-GPU runs, talk to VESSL about Reserved capacity so your YAML translates into guaranteed slots.
Command exits immediately
- Double-check
working_dirandcommand. - Ensure your training script path is correct inside the container.
Putting it all together
Launching training on VESSL AI with vessl run and a YAML file follows a simple pattern:
- Install and log in to the VESSL CLI.
- Describe your run once in a YAML spec (image, hardware, command, env, storage).
- Start the job with
vessl run -f train.yaml. - Monitor, cancel, and iterate directly from your terminal.
- Evolve the YAML to match workload tiers: Spot for cheap experiments, On-Demand for reliable baselines, Reserved for guaranteed production capacity.
Instead of fighting cloud quotas and cluster quirks, you get a single control surface—one CLI, one YAML—to run training across multi-cloud GPUs.