How do I start a training run on VESSL AI using the CLI (vessl run) with a YAML file?
GPU Cloud Infrastructure

How do I start a training run on VESSL AI using the CLI (vessl run) with a YAML file?

8 min read

Most teams hit the same wall: you finally get access to H100s, but every run needs manual setup, cloud console clicks, and debugging before you can even start training. The whole point of vessl run is to flip that around—declare your job once in YAML, then launch it on VESSL Cloud from any terminal in a single command.

Below is a step-by-step walkthrough of how to start a training run on VESSL AI using the CLI (vessl run) with a YAML file, plus a reference template you can adapt for your own workloads.


Why use vessl run with YAML?

Running from YAML gives you:

  • Repeatability – Same environment, same GPUs, same volumes every time.
  • Versioning – Store job definitions next to your code in Git.
  • Portability – Move from 1 to 100 GPUs or between providers without rewriting scripts.
  • Less job wrangling – Declare what you want, let VESSL handle orchestration, monitoring, and failover.

You describe your job once. VESSL’s control plane takes care of provisioning GPUs across providers, wiring in storage, and exposing logs and metrics in the Web Console.


Prerequisites

Before you run your first training job from a YAML file, make sure you have:

  1. A VESSL AI account

  2. VESSL CLI installed

    • Follow the install instructions from your onboarding or docs (usually via a package manager or a direct binary).
    • Confirm it’s working:
      vessl --version
      
  3. CLI authenticated to your VESSL workspace

    • Log in:
      vessl login
      
    • This opens a browser window or prompts for a token so the CLI can use your VESSL identity and project context.
  4. A project/workspace selected

    • List available projects:
      vessl project list
      
    • Set the active project (if supported by your version):
      vessl project use <PROJECT_NAME_OR_ID>
      

Once this is done, you’re ready to define your training run in YAML.


Step 1: Create a YAML spec for your training run

The YAML file is a declarative spec for your run: container image, command, GPUs, storage, environment variables, and more.

Below is a minimal but realistic example for a PyTorch training job using an H100-class GPU.

# train.yaml
name: "pytorch-cifar10-train"

# Container environment
image: "pytorch/pytorch:2.3.0-cuda12.1-cudnn8-runtime"

# Resources & scheduling
resources:
  # GPU SKU and count (examples: A100, H100, H200, B200, GB200, B300)
  gpu:
    type: "H100"
    count: 1

  # Optional: CPU and memory (implementation-specific)
  cpu: "8"
  memory: "64Gi"

  # Reliability tier: spot / on-demand / reserved (depends on your VESSL plan)
  tier: "on-demand"

# Code + entrypoint
command:
  - "bash"
  - "-lc"
  - |
    python train.py \
      --dataset cifar10 \
      --batch-size 256 \
      --epochs 90

# Working directory inside the container
workdir: "/workspace"

# Mount your project files (e.g., synced from Git, object storage, or a volume)
volumes:
  - name: "project"
    mountPath: "/workspace"
    # Backend details depend on your setup:
    # type: "cluster-storage" or "object-storage"
    # path: "your-dataset-or-repo-path"

# Environment variables (for hyperparameters, tokens, etc.)
env:
  - name: "CUDA_VISIBLE_DEVICES"
    value: "0"
  - name: "EXPERIMENT_NAME"
    value: "cifar10-h100-baseline"

This spec gives VESSL enough information to:

  • Provision an H100 GPU in your selected region/provider.
  • Start a PyTorch container.
  • Mount your code and data into /workspace.
  • Run python train.py ... as your training entrypoint.
  • Pipe logs and metrics back to the Web Console and CLI.

Common YAML fields you’ll likely use

Depending on your workload, you can extend the spec with:

  • Multi-GPU / multi-node

    resources:
      gpu:
        type: "A100"
        count: 8
      tier: "spot"
    

    Scale a single job to 8 GPUs for faster LLM post-training or vision models.

  • Job name and labels

    name: "llm-finetune-mixtral"
    labels:
      project: "chatbot"
      stage: "post-train"
    
  • Artifacts / outputs

    outputs:
      - name: "checkpoints"
        path: "/workspace/checkpoints"
        type: "object-storage"
        bucket: "my-bucket"
    

VESSL handles the plumbing so you don’t need to stitch together storage, logs, or GPU scheduling manually.


Step 2: Place the YAML file in your project

Put train.yaml at the root of your repository or in a vessl/ folder—just keep it somewhere easy to reference:

your-project/
  train.py
  requirements.txt
  train.yaml

You can commit this YAML into Git so everyone on your team can invoke the same spec without copy-paste configs.


Step 3: Run the training job with vessl run

Once your YAML is ready:

# From your project directory
vessl run -f train.yaml

Typical behavior:

  • The CLI validates your YAML.
  • VESSL schedules the job on the requested GPU SKU (e.g., H100, A100, B200).
  • A run ID is returned, and you can open the Web Console to follow logs and metrics.

Example output (structure may differ slightly):

$ vessl run -f train.yaml
Submitting run "pytorch-cifar10-train"...
Run ID: run-abc123
Status: PENDING -> SCHEDULED

View logs and metrics:
https://console.vessl.ai/runs/run-abc123

Useful CLI flags (if supported by your version)

  • --name – Override the job name:

    vessl run -f train.yaml --name cifar10-test-run
    
  • --env – Override or add env vars at submit time:

    vessl run -f train.yaml --env EPOCHS=30
    
  • --detach – Submit and return immediately, without streaming logs:

    vessl run -f train.yaml --detach
    

Check your current CLI help for the exact flag set:

vessl run --help

Step 4: Monitor and debug the run

VESSL is designed to minimize “babysitting” runs. Once submitted, you’ve got two main views:

1. Web Console

Open the run link from the CLI output, or navigate from the VESSL dashboard:

  • Real-time logs – stdout/stderr from your container.
  • Metrics & utilization – GPU, CPU, memory usage.
  • Status & events – Scheduling, provisioning, failover events if On-Demand or Reserved tiers are used.
  • Artifacts – Outputs written to configured storage.

This is where “fire-and-forget” becomes real: once it’s green, you can move on to the next experiment.

2. CLI

Use the CLI when you prefer terminal-first workflows:

  • List runs:
    vessl run list
    
  • Stream logs for a specific run:
    vessl run logs run-abc123
    
  • Describe a run:
    vessl run describe run-abc123
    

Step 5: Iterate on your YAML for different workloads

Once you’ve got a basic training run working, you can use YAML versions to match different workload types and reliability needs.

Example: Cheap experimentation with Spot GPUs

Use Spot for large sweeps where preemption is acceptable:

resources:
  gpu:
    type: "A100"
    count: 4
  tier: "spot"

Pros:

  • Lower cost for batch experiments and non-critical runs.

Tradeoff:

  • Jobs can be preempted. Plan to resume from checkpoints.

Example: Production training with automatic failover

For business-critical training where you don’t want region/provider outages to break your run, move to On-Demand with VESSL’s reliability primitives (like Auto Failover):

resources:
  gpu:
    type: "H100"
    count: 4
  tier: "on-demand"

On-Demand is best when:

  • You need reliable capacity with automatic failover across providers.
  • You want to keep service-level SLOs without manually rescheduling runs during outages.

Example: Reserved capacity for mission-critical jobs

For teams running recurring, heavy training cycles:

resources:
  gpu:
    type: "B200"
    count: 16
  tier: "reserved"

Reserved is ideal when:

  • You want guaranteed capacity on specific SKUs (e.g., H100/H200/B200).
  • You’re okay with a capacity commitment in exchange for discounts (up to ~40%) and dedicated support.

By switching the tier and gpu type in YAML, you can move the same workload from cheap experimentation to production-grade reliability without rewriting code.


Example: Full end-to-end workflow

Putting it all together for a typical LLM post-training scenario:

  1. Clone your repo and write training code (train.py).

  2. Create train.yaml for a single-GPU baseline:

    name: "llm-post-train-baseline"
    image: "nvcr.io/nvidia/pytorch:24.01-py3"
    resources:
      gpu:
        type: "A100"
        count: 1
      tier: "spot"
    workdir: "/workspace"
    command:
      - "bash"
      - "-lc"
      - "python train.py --model mixtral --epochs 3"
    
  3. Run it:

    vessl run -f train.yaml
    
  4. Refine the experiment, then scale:

    resources:
      gpu:
        type: "H100"
        count: 8
      tier: "on-demand"
    command:
      - "bash"
      - "-lc"
      - "python train.py --model mixtral --epochs 10 --global-batch-size 2048"
    
  5. Submit again and let VESSL orchestrate multi-GPU capacity across providers.

This is where the “GPU liquidity layer” shows up in practice: same YAML shape, different GPU SKU and reliability tier, no manual reshuffling of cloud accounts or quotas.


Tips for smoother runs on VESSL

  • Pin exact images – Use a specific tag (2.3.0-cuda12.1) instead of latest to avoid “it worked yesterday” issues.
  • Checkpoint frequently – Especially on Spot. Save to Cluster Storage or Object Storage so runs can resume.
  • Use environment variables for configs – They’re easier to tweak at submit time than hard-coding values.
  • Keep YAMLs small and focused – One YAML per job type (e.g., train.yaml, eval.yaml, inference.yaml) beats one giant spec for everything.
  • Leverage labels – Tag runs by experiment, dataset, or model size to keep your dashboard manageable.

From YAML to production: what changes?

Almost nothing. The control plane is the same:

  • Web Console for visual cluster management.
  • CLI (vessl run) for native workflows.
  • Auto Failover and Multi-Cluster for keeping workloads alive across providers and regions.
  • SOC 2 Type II / ISO 27001 for security and procurement readiness.
  • Published per-SKU pricing and Reserved discounts when you’re ready to lock in capacity.

You start with a YAML file and a single vessl run. From there, scaling from 1 GPU to 100, or from Spot to Reserved, is just a few lines changed in your spec.


Next Step

Get started with your first YAML-driven training run on VESSL AI and see how much “job wrangling” you can reclaim for actual experiment design.

Get Started