
How do I mount S3/object storage or a GitHub repo into a VESSL AI run or workspace?
Most teams hit the same wall on day one with VESSL AI: you have data in S3 (or another object store) and code in GitHub, but you want everything ready inside a run or workspace with zero manual copying. The good news is you can wire both in cleanly and keep your jobs “fire-and-forget.”
Below is a practical, step‑by‑step guide for how to mount S3/object storage or a GitHub repo into a VESSL AI run or workspace using patterns that scale from 1 to 100 GPUs.
At-a-Glance: Options for mounting storage and code
There are three common patterns teams use:
-
Pattern 1 – Direct object storage access in the container
Use AWS/GCP/other SDKs or CLI inside your run/workspace to pull from or push to S3‑compatible buckets. -
Pattern 2 – Mount-like behavior via sync
On startup, sync objects from S3 into local or shared storage (e.g., Cluster Storage), run your workload, then sync artifacts back out. -
Pattern 3 – Git-based code checkout
Clone or pull your GitHub repo at the start of the run/workspace, optionally pinned to a branch, tag, or commit.
In practice, most teams combine Pattern 2 + Pattern 3: sync data from S3 into shared storage, clone code from GitHub, and keep the container image small and generic.
Core concepts in a VESSL AI run or workspace
Before wiring in storage and GitHub, it helps to anchor on a few primitives:
-
Compute:
You request GPU capacity (A100/H100/H200/B200/GB200/B300) using Spot, On‑Demand (with automatic failover), or Reserved. Storage and Git configuration should be stable across all three. -
Storage types:
- Cluster Storage – High‑performance shared file system, ideal for collaborative training jobs where multiple runs need the same dataset.
- Object Storage – Lower‑cost storage for datasets, checkpoints, and artifacts. S3 or S3‑compatible stores fall into this category.
-
Execution path:
- Runs: fire‑and‑forget jobs you launch via Web Console or
vessl run. - Workspaces: long‑lived dev environments for notebooks, debugging, and interactive work.
- Runs: fire‑and‑forget jobs you launch via Web Console or
Mounting S3 or GitHub essentially means making them available at a predictable path as early as possible in this execution path.
Pattern 1: Direct S3/object storage access inside the container
When to use this pattern
Use direct access when:
- You want to stream data on demand (e.g., samples from a very large dataset).
- You don’t want to maintain a large local copy of the entire bucket.
- Your training code is already written around AWS/GCP/other SDKs (e.g.,
boto3,gcsfs,s3fs).
Step 1: Provide credentials to the run/workspace
You have three common options:
-
Environment variables (simplest)
SetAWS_ACCESS_KEY_ID,AWS_SECRET_ACCESS_KEY, and optionallyAWS_SESSION_TOKENandAWS_DEFAULT_REGIONas environment variables when launching the run/workspace. -
Credentials file (AWS profile style)
Mount or create~/.aws/credentialsand~/.aws/configinside the container (e.g., via startup commands). -
Workload-specific IAM role (cloud-native)
If your VESSL cluster is bound to AWS and configured with role-based access, the run may inherit an IAM role automatically. In that case, you usually don’t need static keys.
Regardless of approach, keep secrets out of the image. Pass them at runtime.
Step 2: Access S3 via SDK or CLI
Inside your run or workspace:
# Minimal check: list a bucket
aws s3 ls s3://my-bucket-name/
# Copy a dataset locally
aws s3 sync s3://my-bucket-name/datasets/cifar10 /mnt/data/cifar10
Python example with boto3:
import boto3
s3 = boto3.client("s3")
bucket = "my-bucket-name"
prefix = "datasets/cifar10"
response = s3.list_objects_v2(Bucket=bucket, Prefix=prefix)
for obj in response.get("Contents", []):
print(obj["Key"])
For S3‑compatible providers (e.g., MinIO, some cloud vendors), set a custom endpoint:
export AWS_ENDPOINT_URL="https://my-s3-compatible-endpoint"
aws s3 ls --endpoint-url "$AWS_ENDPOINT_URL"
Pros and cons
Pros
- No extra sync or copy step needed if your code can read from S3 directly.
- Works well for very large datasets or streaming use cases.
Cons
- Training speed depends on S3/object storage throughput and latency.
- You must manage credentials carefully for each run/workspace.
Pattern 2: “Mount” S3 via sync into Cluster Storage or local disk
This pattern gives you mount-like behavior without a kernel-level mount: you sync data in at startup and sync artifacts back out at the end.
When to use this pattern
- You want high-performance local reads during training (GPU‑heavy runs).
- Multiple jobs should reuse the same dataset from a shared path.
- You want a predictable, POSIX-like directory structure (
/mnt/data/...).
Step 1: Choose the target path
Typical paths inside a VESSL run/workspace:
/mnt/data– for dataset inputs/mnt/checkpoints– for model checkpoints and logs
If you use Cluster Storage, you can make these paths shared across runs.
Step 2: Sync data from S3 into the run/workspace
In your run or workspace startup commands:
# Ensure directories exist
mkdir -p /mnt/data /mnt/checkpoints
# Sync dataset from S3 into /mnt/data
aws s3 sync s3://my-bucket-name/datasets/imagenet /mnt/data/imagenet
# Optionally, sync existing checkpoints
aws s3 sync s3://my-bucket-name/checkpoints/project-x /mnt/checkpoints
If you’re using Object Storage from a different provider, use their CLI or SDK instead of aws s3.
Step 3: Run training against the synced paths
Point your training script to the local paths you defined:
python train.py \
--data-root /mnt/data/imagenet \
--output-dir /mnt/checkpoints
Your framework (PyTorch, TensorFlow, JAX) just sees a local filesystem; no need to change dataset loaders for S3.
Step 4: Sync checkpoints and artifacts back to S3
At the end of the run, sync outputs back:
aws s3 sync /mnt/checkpoints s3://my-bucket-name/checkpoints/project-x
aws s3 sync /mnt/logs s3://my-bucket-name/logs/project-x
For long‑running workspaces, you can run this periodically (e.g., via cron or a small Python script) for incremental backups.
Pros and cons
Pros
- Faster, more predictable training performance.
- Simple code: everything operates on local paths.
- Works well with shared Cluster Storage for multi‑run reuse.
Cons
- Requires some startup and shutdown time for large syncs.
- You must keep an eye on storage usage inside the cluster.
Pattern 3: Mount a GitHub repo by cloning into the run or workspace
Instead of baking your full codebase into a container, you can keep images slim and pull GitHub on startup.
When to use this pattern
- You iterate on code frequently and don’t want to rebuild images each time.
- Multiple researchers share the same repo across runs/workspaces.
- You want reproducibility by pinning to commit hashes or tags.
Step 1: Provide GitHub access
Depending on whether the repo is public or private:
-
Public repos (simplest)
No auth needed:
git clone https://github.com/my-org/my-repo.git -
Private repos via PAT (Personal Access Token)
- Store your PAT as a secret in your CI system or VESSL config, not in the image.
- Inject it into the run/workspace as an environment variable, e.g.,
GITHUB_TOKEN.
Then clone using the token:
git clone https://$GITHUB_TOKEN@github.com/my-org/my-private-repo.git -
SSH key-based access
-
Inject your private key as a secret.
-
In startup commands:
mkdir -p ~/.ssh echo "$GITHUB_SSH_KEY" > ~/.ssh/id_rsa chmod 600 ~/.ssh/id_rsa ssh-keyscan github.com >> ~/.ssh/known_hosts git clone git@github.com:my-org/my-private-repo.git
-
Step 2: Choose the checkout location
Typical pattern:
mkdir -p /workspace
cd /workspace
git clone https://github.com/my-org/my-repo.git
cd my-repo
This keeps code under /workspace/my-repo, which is easy to reference from your training scripts or notebooks.
Step 3: Pin to a branch, tag, or commit for reproducibility
After cloning, you can checkout a specific ref:
cd /workspace/my-repo
# Specific branch
git checkout feature/new-eval-pipeline
# Or specific tag
git checkout v1.2.0
# Or commit hash
git checkout 3f5c6d9
Pinning a commit makes your run reproducible even as the main branch moves.
Step 4: Integrate with your run command or notebook
Examples:
-
CLI (
vessl run):cd /workspace/my-repo python train.py --config configs/exp.yaml -
Workspace (Jupyter, VS Code, etc.):
Open
/workspace/my-repoand run your notebooks or scripts directly there.
Pros and cons
Pros
- Fast iteration: push to GitHub, start a new run, code is up to date.
- Smaller images; language/runtime dependencies live in the image, code stays in Git.
- Easy to track exactly what code ran (commit hash).
Cons
- Requires network access to GitHub on run/workspace start.
- Needs token/SSH management for private repos.
Putting it together: S3 data + GitHub code in a single VESSL run
Here’s how you might combine everything in a typical vessl run-style startup sequence.
Example startup script
#!/usr/bin/env bash
set -e
# 1. Prepare directories
mkdir -p /mnt/data /mnt/checkpoints /workspace
# 2. Sync dataset from S3 into local or Cluster Storage
aws s3 sync s3://my-bucket-name/datasets/imagenet /mnt/data/imagenet
# 3. Clone code from GitHub
cd /workspace
git clone https://$GITHUB_TOKEN@github.com/my-org/my-private-repo.git
cd my-private-repo
# Optional: pin to a specific commit
git checkout 3f5c6d9
# 4. Launch training
python train.py \
--data-root /mnt/data/imagenet \
--output-dir /mnt/checkpoints
# 5. Sync checkpoints back to S3
aws s3 sync /mnt/checkpoints s3://my-bucket-name/checkpoints/project-x
Wire this script into your run configuration (via Web Console or CLI) as the container’s main command. Every time you submit a run, VESSL provisions GPUs (Spot/On‑Demand/Reserved as requested), pulls data and code, trains, and pushes results out—with minimal job wrangling.
Best practices for stable, scalable mounting
1. Keep secrets out of images
- Pass AWS keys, GitHub tokens, and SSH keys as secrets or environment variables at run time.
- Rotate them periodically and limit permissions to required buckets/repos.
2. Use shared Cluster Storage for heavy datasets
- Put frequently used datasets on Cluster Storage and only refresh from S3 when necessary.
- Let multiple runs reuse the same local copy to save time and bandwidth.
3. Separate code and data
- Code lives in GitHub.
- Datasets and artifacts live in S3/object storage.
- Your run/workspace just pulls what it needs into a predictable directory structure.
4. Plan for provider hiccups
One reason teams move to VESSL is to avoid being hostage to a single provider. When you mix:
- Unified GPU access (across clouds)
- Auto Failover (seamless provider switching)
- Multi-Cluster (unified view across regions)
you want your data and code mounting pattern to keep up:
- Keep S3 endpoints configurable (env vars or config files).
- Make sure your GitHub access doesn’t hard‑code IP‑based allowlists that break when a region changes.
- Test runs across multiple regions/providers to confirm S3 and GitHub access remains stable.
5. Log what you mounted
For reproducibility, log:
- The S3 bucket and prefix used (
s3://my-bucket/datasets/imagenet@2025-01-01). - The Git commit hash checked out.
- Any relevant environment variables (sanitized, without secrets).
Write this metadata into your run’s logs or a JSON file in /mnt/checkpoints and sync it back to S3.
How this fits the way VESSL AI is used
Most teams I work with follow this progression:
-
Exploration in a workspace
- Mount S3 or other object storage via direct SDK access or small syncs.
- Clone GitHub into
/workspaceand iterate on notebooks and scripts.
-
Batch training via runs
- Stabilize a startup script that syncs S3 → Cluster Storage, clones GitHub, launches training, and syncs results back to S3.
- Use Spot where possible for cost, On‑Demand with automatic failover for important experiments.
-
Production post‑training and fine‑tuning
- Move to Reserved capacity for critical jobs, keep the same storage and Git mounting pattern.
- Rely on Auto Failover and Multi‑Cluster to keep jobs alive through provider outages.
The mounting patterns don’t change as you move from 1 GPU to 100 GPUs—only the scale, reliability tier, and how aggressively you cache datasets in Cluster Storage.
Next step
If you’re ready to stop wrestling with storage mounts and start running experiments, you can configure your first run or workspace—and wire in your S3/object storage and GitHub repos—in a few minutes.