
How do I mount S3/object storage or a GitHub repo into a VESSL AI run or workspace?
Most teams hit the same wall on day one with VESSL Cloud: you can boot A100s in minutes, but your data and code still live in S3 buckets or GitHub. Until those are mounted cleanly into a run or workspace, you’re stuck copying tarballs and debugging paths instead of training.
This guide walks through how to mount S3/object storage and GitHub repos into a VESSL AI run or workspace using patterns that scale beyond a single experiment. The examples assume you’re using the vessl CLI and the Web Console, and that you care about the same things I do: fewer moving parts, fewer “where did my data go?” questions, and more fire-and-forget jobs.
At-a-Glance: Ways to Bring External Data & Code into VESSL
There are three common patterns:
- Object storage → Mounted path
- Use S3 or compatible object stores (AWS S3, GCS with S3 API, MinIO, etc.).
- Mount into your run at a known path (
/mnt/data,/datasets, etc.) at container start.
- GitHub repo → Working directory
- Pull your repo into the run or workspace, either at container startup or via dev container config.
- Persistent VESSL storage → Reusable across runs
- Use Cluster Storage or Object Storage inside VESSL as a caching layer so you’re not re-pulling 1 TB from S3 every time.
The right approach depends on:
- Dataset size and update frequency
- Whether you’re in a one-off run or a long-lived workspace
- How much you care about startup time vs. keeping the source of truth in S3/Git
Below I’ll break down each option, with concrete workflows, pros/cons, and when to pick which.
Comparison: S3 vs. GitHub vs. VESSL Storage
| Rank | Option | Best For | Primary Strength | Watch Out For |
|---|---|---|---|---|
| 1 | S3/Object Storage Mount | Large datasets and artifacts used across many runs | Scales with data size; cloud-native; flexible | Needs credentials and IAM hygiene |
| 2 | GitHub Repo Mount/Clone | Codebases, configs, lightweight assets (<10–20 GB) | Simple dev flow; fits GitOps and CI/CD patterns | Git is slow/fragile for very large binaries |
| 3 | VESSL Cluster/Object Storage | Cached datasets and shared project storage inside VESSL | Fast access inside clusters; share across users | Initial sync from external S3/Git still required |
We’ll cover all three, but the question most people really have is: How do I wire S3 and GitHub into a vessl run or workspace without turning into a bash script farm?
How we’ll structure the answer
To match how you actually work, I’ll split by context:
- Mounting S3/object storage into a one-off run (
vessl run) - Mounting S3/object storage into a persistent workspace
- Pulling a GitHub repo into a run
- Pulling a GitHub repo into a workspace
- Hardening and scaling: credentials, performance, and repeatability
Where relevant, I’ll also show how VESSL’s own Object Storage and Cluster Storage help you stop hitting external S3 every time.
1. Mount S3/Object Storage into a VESSL Run
You have two main strategies:
- Treat S3 as a remote filesystem (mount-like behavior via tools like
s3fs,goofys,rclone, or a simple sync). - Use VESSL’s Object Storage / Cluster Storage as a caching layer and sync from S3 once per dataset, not per run.
1.1. Pattern A: Sync from S3 at Run Start
This is the simplest, most robust pattern. You don’t literally “mount” S3; you sync data into the container at startup, run your job, then (optionally) sync results back to S3.
Steps:
- Wire credentials into the run
Use environment variables or a secret mechanism (recommended):
vessl run \
--name s3-sync-example \
--env AWS_ACCESS_KEY_ID=... \
--env AWS_SECRET_ACCESS_KEY=... \
--env AWS_DEFAULT_REGION=us-east-1 \
...
Or point to an IAM role if your provider supports it.
- Install AWS CLI or your object API in the image
In your Dockerfile:
RUN pip install awscli
- Sync at startup
Create a small startup script, bootstrap.sh:
#!/usr/bin/env bash
set -euo pipefail
echo "[bootstrap] Syncing S3 dataset..."
aws s3 sync s3://my-bucket/path/to/dataset /mnt/dataset
echo "[bootstrap] Done. Starting training..."
exec python train.py --data_dir /mnt/dataset
Make it executable:
chmod +x bootstrap.sh
Then run it via vessl run:
vessl run \
--name s3-sync-run \
--image your-image:latest \
--command "bash bootstrap.sh"
Pros
- Simple, no FUSE layers, behaves like local files.
- Works with any S3-compatible store (AWS S3, MinIO, GCS S3 API).
Cons
- Every run pulls the dataset unless you pair it with a cache (see Cluster Storage below).
- If you’re pulling 1 TB, startup will be slow.
1.2. Pattern B: Use VESSL Cluster/Object Storage as a Cache
If you’re reusing the same dataset across many runs or users:
- First run: sync from S3 into VESSL storage
- Mount a Cluster Storage volume (fast, shared file-like storage inside the cluster) into your run at
/cluster-data. - Sync S3 →
/cluster-data/datasets/imagenet.
- Subsequent runs: skip S3 and mount only the Cluster Storage volume
Inside VESSL, this looks like:
-
Cluster Storage
- Best for: shared training data, fast I/O, collaborative jobs.
- Behaves like a shared NFS-style filesystem across runs and users.
-
Object Storage
- Best for: lower-cost artifacts, checkpoints, logs.
- Think of it as S3 inside VESSL.
Operationally: you still use the method from 1.1 to ingest data once, but all future runs mount the VESSL-managed volume, so you’re not re-hammering external S3.
2. Mount S3/Object Storage into a Workspace
A workspace is where you live day-to-day: notebooks, editors, interactive debugging. You want your S3 buckets to appear at known paths every time the workspace boots.
You’ll usually follow the same patterns as runs, but with some tweaks:
2.1. Bootstrapping S3 in a workspace
Use a startup script or devcontainer config that:
- Exposes credentials (env vars / secrets).
- Installs your object storage client.
- Syncs or mounts S3 into a path like
/home/vessl/data.
Example workspace-bootstrap.sh:
#!/usr/bin/env bash
set -euo pipefail
mkdir -p /home/vessl/data
# One-time sync or periodic update
aws s3 sync s3://my-bucket/project-data /home/vessl/data
# Optional: sync models/checkpoints too
aws s3 sync s3://my-bucket/models /home/vessl/models
# Keep the workspace alive with your usual process (e.g., Jupyter)
exec jupyter lab --ip=0.0.0.0 --no-browser --NotebookApp.token=''
Configure your workspace to run this script at start (via the Web Console or your dev container config).
2.2. Choosing between sync vs. FUSE “mount”
For workspaces, it’s tempting to use s3fs or rclone mount to make S3 look like a live filesystem. That works, but:
- FUSE mounts can break under high concurrency.
- Latency is non-trivial; every
lshits S3. - If the mount dies mid-session, your notebook sees weird errors.
What I recommend:
- Small/medium datasets or code:
aws s3 syncinto local or Cluster Storage. - Huge archives you rarely touch: consider a read-only mount, but only if your access pattern is truly sparse.
Example with rclone mount in a workspace:
rclone config create myremote s3 provider=AWS env_auth=true
mkdir -p /home/vessl/s3
rclone mount myremote:my-bucket /home/vessl/s3 --vfs-cache-mode full &
Then work with /home/vessl/s3 as a mount-like path. Just know it’s still remote.
3. Bring a GitHub Repo into a VESSL Run
For code, GitHub is the source of truth. The question is: clone at build time (in the image) or at run time?
3.1. Option A: Bake the repo into the image (best for stable code)
In your Dockerfile:
WORKDIR /app
RUN git clone https://github.com/your-org/your-repo.git . \
&& pip install -r requirements.txt
Then:
vessl run \
--name train-from-github-image \
--image your-image:with-repo \
--command "python train.py"
Good when:
- Code doesn’t change hourly.
- You want reproducible runs bound to a specific commit.
3.2. Option B: Clone the repo at run start (best for active development)
Use a startup script:
#!/usr/bin/env bash
set -euo pipefail
git clone https://github.com/your-org/your-repo.git /workspace
cd /workspace
# Optional: switch branch or commit
git checkout main
# Install deps
pip install -r requirements.txt
# Start your job
exec python train.py
Run it via:
vessl run \
--name github-run-latest \
--image python:3.11 \
--command "bash /root/bootstrap.sh"
For private repos, inject a token:
vessl run \
--env GITHUB_TOKEN=ghp_... \
...
And use:
git clone https://$GITHUB_TOKEN@github.com/your-org/your-private-repo.git /workspace
4. Bring a GitHub Repo into a Workspace
Workspaces are where developers want their normal Git flow:
git statusgit pullgit commit
You have two solid options.
4.1. Workspace image with pre-cloned repo
If you’re fine with workspaces starting from a specific commit:
WORKDIR /workspace
RUN git clone https://github.com/your-org/your-repo.git . \
&& pip install -r requirements-dev.txt
Start a workspace from this image. You’ll land directly in /workspace with everything ready.
4.2. Workspace bootstrap script with clone/pull
Better for teams pushing code all day:
#!/usr/bin/env bash
set -euo pipefail
if [ ! -d "/workspace/.git" ]; then
git clone https://github.com/your-org/your-repo.git /workspace
cd /workspace
else
cd /workspace
git fetch origin
git pull origin main
fi
pip install -r requirements-dev.txt
# Start whatever dev server or just a shell
exec bash
Add this script to your workspace template so every boot:
- Clones the repo if it’s missing.
- Pulls the latest changes if it’s already there.
For private repos, same GITHUB_TOKEN pattern as above.
5. Hardening and Scaling These Patterns
Mounting S3/object storage or GitHub into a VESSL run/workspace is one thing; making it not break under load is another. Here’s how to keep it sane.
5.1. Credentials: don’t scatter secrets
- Prefer ephemeral credentials (IAM roles, short-lived access keys) over long-lived static keys.
- Use VESSL’s secret/env tooling instead of hard-coding in images.
- Avoid writing tokens to disk if you can; rely on environment variables and in-memory use.
5.2. Performance: reduce repeated syncs
Pair S3 with VESSL storage:
- First time: S3 → Cluster Storage or Object Storage.
- All subsequent runs: mount only the VESSL volume.
- For multi-GPU training (A100/H100/H200/B200/GB200/B300), this keeps I/O local and reduces S3 bottlenecks.
This directly cuts down job wrangling: fewer “why is this job still in ‘syncing data’ after 40 minutes?” pings.
5.3. Reproducibility: pin versions and commits
When you’re doing LLM post-training or AI-for-Science runs, you need the same commit and same dataset:
-
In S3, version datasets or store them under commit-like prefixes (
s3://bucket/datasets/v1.2.3/). -
In GitHub, check out a specific commit:
git checkout 3e9f1c1 -
Bake the commit hash into the run metadata or logs (e.g., print it on startup).
5.4. Reliability: make startup scripts idempotent
Your bootstrap scripts should be safe to re-run:
- Use
aws s3 sync(idempotent) instead ofcp -r. - Check if
.gitexists before re-cloning. - Handle partial failures: if sync fails, exit non-zero so VESSL marks the job failed rather than silently running on incomplete data.
Putting it together: recommended default setups
If you just want a practical starting point that works for 90% of teams:
-
For experiments (Spot jobs)
- Use
vessl runwith a startup script that:- Clones your GitHub repo.
- Syncs from S3 into a Cluster Storage mount the first time.
- Reuse that Cluster Storage volume across runs so Spot preemptions don’t force full re-downloads.
- Use
-
For production / On-Demand and Reserved jobs
- Bake your GitHub repo + exact commit into the image.
- Point to datasets in VESSL Cluster/Object Storage (pre-synced from S3).
- Use automatic failover at the compute layer; keep data paths stable.
-
For workspaces
- Workspace image with dev tooling.
- Bootstrap script that:
- Clones/pulls your GitHub repo.
- Syncs or mounts S3 into a fixed path, or mounts Cluster Storage.
That way, whether you’re on a single A100 Spot or a 64x H100 Reserved cluster, the path to your data and code is the same, and you’re not burning time on manual mounts every run.
Final Verdict
Mounting S3/object storage or GitHub into VESSL runs and workspaces isn’t about one magic “mount” flag; it’s about a repeatable boot pattern:
- S3/object storage: use sync-on-start plus VESSL Cluster/Object Storage as a cache so data is close to the GPUs and you’re not saturating external buckets.
- GitHub: bake stable code into images for production; clone at startup for active development and workspaces.
Once you standardize these patterns, you massively cut “job wrangling”—no more hand-copied datasets, no more “which commit did this run use?”—and you can scale from 1 to 100 GPUs without rethinking your data and code mounting every time.