How do I mount S3/object storage or a GitHub repo into a VESSL AI run or workspace?
GPU Cloud Infrastructure

How do I mount S3/object storage or a GitHub repo into a VESSL AI run or workspace?

10 min read

Most teams hit the same wall on day one with VESSL Cloud: you can boot A100s in minutes, but your data and code still live in S3 buckets or GitHub. Until those are mounted cleanly into a run or workspace, you’re stuck copying tarballs and debugging paths instead of training.

This guide walks through how to mount S3/object storage and GitHub repos into a VESSL AI run or workspace using patterns that scale beyond a single experiment. The examples assume you’re using the vessl CLI and the Web Console, and that you care about the same things I do: fewer moving parts, fewer “where did my data go?” questions, and more fire-and-forget jobs.


At-a-Glance: Ways to Bring External Data & Code into VESSL

There are three common patterns:

  • Object storage → Mounted path
    • Use S3 or compatible object stores (AWS S3, GCS with S3 API, MinIO, etc.).
    • Mount into your run at a known path (/mnt/data, /datasets, etc.) at container start.
  • GitHub repo → Working directory
    • Pull your repo into the run or workspace, either at container startup or via dev container config.
  • Persistent VESSL storage → Reusable across runs
    • Use Cluster Storage or Object Storage inside VESSL as a caching layer so you’re not re-pulling 1 TB from S3 every time.

The right approach depends on:

  • Dataset size and update frequency
  • Whether you’re in a one-off run or a long-lived workspace
  • How much you care about startup time vs. keeping the source of truth in S3/Git

Below I’ll break down each option, with concrete workflows, pros/cons, and when to pick which.


Comparison: S3 vs. GitHub vs. VESSL Storage

RankOptionBest ForPrimary StrengthWatch Out For
1S3/Object Storage MountLarge datasets and artifacts used across many runsScales with data size; cloud-native; flexibleNeeds credentials and IAM hygiene
2GitHub Repo Mount/CloneCodebases, configs, lightweight assets (<10–20 GB)Simple dev flow; fits GitOps and CI/CD patternsGit is slow/fragile for very large binaries
3VESSL Cluster/Object StorageCached datasets and shared project storage inside VESSLFast access inside clusters; share across usersInitial sync from external S3/Git still required

We’ll cover all three, but the question most people really have is: How do I wire S3 and GitHub into a vessl run or workspace without turning into a bash script farm?


How we’ll structure the answer

To match how you actually work, I’ll split by context:

  1. Mounting S3/object storage into a one-off run (vessl run)
  2. Mounting S3/object storage into a persistent workspace
  3. Pulling a GitHub repo into a run
  4. Pulling a GitHub repo into a workspace
  5. Hardening and scaling: credentials, performance, and repeatability

Where relevant, I’ll also show how VESSL’s own Object Storage and Cluster Storage help you stop hitting external S3 every time.


1. Mount S3/Object Storage into a VESSL Run

You have two main strategies:

  1. Treat S3 as a remote filesystem (mount-like behavior via tools like s3fs, goofys, rclone, or a simple sync).
  2. Use VESSL’s Object Storage / Cluster Storage as a caching layer and sync from S3 once per dataset, not per run.

1.1. Pattern A: Sync from S3 at Run Start

This is the simplest, most robust pattern. You don’t literally “mount” S3; you sync data into the container at startup, run your job, then (optionally) sync results back to S3.

Steps:

  1. Wire credentials into the run

Use environment variables or a secret mechanism (recommended):

vessl run \
  --name s3-sync-example \
  --env AWS_ACCESS_KEY_ID=... \
  --env AWS_SECRET_ACCESS_KEY=... \
  --env AWS_DEFAULT_REGION=us-east-1 \
  ...

Or point to an IAM role if your provider supports it.

  1. Install AWS CLI or your object API in the image

In your Dockerfile:

RUN pip install awscli
  1. Sync at startup

Create a small startup script, bootstrap.sh:

#!/usr/bin/env bash
set -euo pipefail

echo "[bootstrap] Syncing S3 dataset..."
aws s3 sync s3://my-bucket/path/to/dataset /mnt/dataset

echo "[bootstrap] Done. Starting training..."
exec python train.py --data_dir /mnt/dataset

Make it executable:

chmod +x bootstrap.sh

Then run it via vessl run:

vessl run \
  --name s3-sync-run \
  --image your-image:latest \
  --command "bash bootstrap.sh"

Pros

  • Simple, no FUSE layers, behaves like local files.
  • Works with any S3-compatible store (AWS S3, MinIO, GCS S3 API).

Cons

  • Every run pulls the dataset unless you pair it with a cache (see Cluster Storage below).
  • If you’re pulling 1 TB, startup will be slow.

1.2. Pattern B: Use VESSL Cluster/Object Storage as a Cache

If you’re reusing the same dataset across many runs or users:

  1. First run: sync from S3 into VESSL storage
  • Mount a Cluster Storage volume (fast, shared file-like storage inside the cluster) into your run at /cluster-data.
  • Sync S3 → /cluster-data/datasets/imagenet.
  1. Subsequent runs: skip S3 and mount only the Cluster Storage volume

Inside VESSL, this looks like:

  • Cluster Storage

    • Best for: shared training data, fast I/O, collaborative jobs.
    • Behaves like a shared NFS-style filesystem across runs and users.
  • Object Storage

    • Best for: lower-cost artifacts, checkpoints, logs.
    • Think of it as S3 inside VESSL.

Operationally: you still use the method from 1.1 to ingest data once, but all future runs mount the VESSL-managed volume, so you’re not re-hammering external S3.


2. Mount S3/Object Storage into a Workspace

A workspace is where you live day-to-day: notebooks, editors, interactive debugging. You want your S3 buckets to appear at known paths every time the workspace boots.

You’ll usually follow the same patterns as runs, but with some tweaks:

2.1. Bootstrapping S3 in a workspace

Use a startup script or devcontainer config that:

  1. Exposes credentials (env vars / secrets).
  2. Installs your object storage client.
  3. Syncs or mounts S3 into a path like /home/vessl/data.

Example workspace-bootstrap.sh:

#!/usr/bin/env bash
set -euo pipefail

mkdir -p /home/vessl/data

# One-time sync or periodic update
aws s3 sync s3://my-bucket/project-data /home/vessl/data

# Optional: sync models/checkpoints too
aws s3 sync s3://my-bucket/models /home/vessl/models

# Keep the workspace alive with your usual process (e.g., Jupyter)
exec jupyter lab --ip=0.0.0.0 --no-browser --NotebookApp.token=''

Configure your workspace to run this script at start (via the Web Console or your dev container config).

2.2. Choosing between sync vs. FUSE “mount”

For workspaces, it’s tempting to use s3fs or rclone mount to make S3 look like a live filesystem. That works, but:

  • FUSE mounts can break under high concurrency.
  • Latency is non-trivial; every ls hits S3.
  • If the mount dies mid-session, your notebook sees weird errors.

What I recommend:

  • Small/medium datasets or code: aws s3 sync into local or Cluster Storage.
  • Huge archives you rarely touch: consider a read-only mount, but only if your access pattern is truly sparse.

Example with rclone mount in a workspace:

rclone config create myremote s3 provider=AWS env_auth=true
mkdir -p /home/vessl/s3
rclone mount myremote:my-bucket /home/vessl/s3 --vfs-cache-mode full &

Then work with /home/vessl/s3 as a mount-like path. Just know it’s still remote.


3. Bring a GitHub Repo into a VESSL Run

For code, GitHub is the source of truth. The question is: clone at build time (in the image) or at run time?

3.1. Option A: Bake the repo into the image (best for stable code)

In your Dockerfile:

WORKDIR /app
RUN git clone https://github.com/your-org/your-repo.git . \
    && pip install -r requirements.txt

Then:

vessl run \
  --name train-from-github-image \
  --image your-image:with-repo \
  --command "python train.py"

Good when:

  • Code doesn’t change hourly.
  • You want reproducible runs bound to a specific commit.

3.2. Option B: Clone the repo at run start (best for active development)

Use a startup script:

#!/usr/bin/env bash
set -euo pipefail

git clone https://github.com/your-org/your-repo.git /workspace
cd /workspace

# Optional: switch branch or commit
git checkout main

# Install deps
pip install -r requirements.txt

# Start your job
exec python train.py

Run it via:

vessl run \
  --name github-run-latest \
  --image python:3.11 \
  --command "bash /root/bootstrap.sh"

For private repos, inject a token:

vessl run \
  --env GITHUB_TOKEN=ghp_... \
  ...

And use:

git clone https://$GITHUB_TOKEN@github.com/your-org/your-private-repo.git /workspace

4. Bring a GitHub Repo into a Workspace

Workspaces are where developers want their normal Git flow:

  • git status
  • git pull
  • git commit

You have two solid options.

4.1. Workspace image with pre-cloned repo

If you’re fine with workspaces starting from a specific commit:

WORKDIR /workspace
RUN git clone https://github.com/your-org/your-repo.git . \
    && pip install -r requirements-dev.txt

Start a workspace from this image. You’ll land directly in /workspace with everything ready.

4.2. Workspace bootstrap script with clone/pull

Better for teams pushing code all day:

#!/usr/bin/env bash
set -euo pipefail

if [ ! -d "/workspace/.git" ]; then
  git clone https://github.com/your-org/your-repo.git /workspace
  cd /workspace
else
  cd /workspace
  git fetch origin
  git pull origin main
fi

pip install -r requirements-dev.txt

# Start whatever dev server or just a shell
exec bash

Add this script to your workspace template so every boot:

  • Clones the repo if it’s missing.
  • Pulls the latest changes if it’s already there.

For private repos, same GITHUB_TOKEN pattern as above.


5. Hardening and Scaling These Patterns

Mounting S3/object storage or GitHub into a VESSL run/workspace is one thing; making it not break under load is another. Here’s how to keep it sane.

5.1. Credentials: don’t scatter secrets

  • Prefer ephemeral credentials (IAM roles, short-lived access keys) over long-lived static keys.
  • Use VESSL’s secret/env tooling instead of hard-coding in images.
  • Avoid writing tokens to disk if you can; rely on environment variables and in-memory use.

5.2. Performance: reduce repeated syncs

Pair S3 with VESSL storage:

  • First time: S3 → Cluster Storage or Object Storage.
  • All subsequent runs: mount only the VESSL volume.
  • For multi-GPU training (A100/H100/H200/B200/GB200/B300), this keeps I/O local and reduces S3 bottlenecks.

This directly cuts down job wrangling: fewer “why is this job still in ‘syncing data’ after 40 minutes?” pings.

5.3. Reproducibility: pin versions and commits

When you’re doing LLM post-training or AI-for-Science runs, you need the same commit and same dataset:

  • In S3, version datasets or store them under commit-like prefixes (s3://bucket/datasets/v1.2.3/).

  • In GitHub, check out a specific commit:

    git checkout 3e9f1c1
    
  • Bake the commit hash into the run metadata or logs (e.g., print it on startup).

5.4. Reliability: make startup scripts idempotent

Your bootstrap scripts should be safe to re-run:

  • Use aws s3 sync (idempotent) instead of cp -r.
  • Check if .git exists before re-cloning.
  • Handle partial failures: if sync fails, exit non-zero so VESSL marks the job failed rather than silently running on incomplete data.

Putting it together: recommended default setups

If you just want a practical starting point that works for 90% of teams:

  • For experiments (Spot jobs)

    • Use vessl run with a startup script that:
      • Clones your GitHub repo.
      • Syncs from S3 into a Cluster Storage mount the first time.
    • Reuse that Cluster Storage volume across runs so Spot preemptions don’t force full re-downloads.
  • For production / On-Demand and Reserved jobs

    • Bake your GitHub repo + exact commit into the image.
    • Point to datasets in VESSL Cluster/Object Storage (pre-synced from S3).
    • Use automatic failover at the compute layer; keep data paths stable.
  • For workspaces

    • Workspace image with dev tooling.
    • Bootstrap script that:
      • Clones/pulls your GitHub repo.
      • Syncs or mounts S3 into a fixed path, or mounts Cluster Storage.

That way, whether you’re on a single A100 Spot or a 64x H100 Reserved cluster, the path to your data and code is the same, and you’re not burning time on manual mounts every run.


Final Verdict

Mounting S3/object storage or GitHub into VESSL runs and workspaces isn’t about one magic “mount” flag; it’s about a repeatable boot pattern:

  • S3/object storage: use sync-on-start plus VESSL Cluster/Object Storage as a cache so data is close to the GPUs and you’re not saturating external buckets.
  • GitHub: bake stable code into images for production; clone at startup for active development and workspaces.

Once you standardize these patterns, you massively cut “job wrangling”—no more hand-copied datasets, no more “which commit did this run use?”—and you can scale from 1 to 100 GPUs without rethinking your data and code mounting every time.

Next Step

Get Started