Best data version control tools for ML datasets in S3 (large files) that support branching/merging like Git
AI Data Version Control

Best data version control tools for ML datasets in S3 (large files) that support branching/merging like Git

11 min read

Most ML teams hit the same wall: Git is perfect for code, but the moment your dataset lives in S3 and crosses a few gigabytes, normal Git workflows fall apart. You still need branching, merging, and reproducibility, but now you’re dealing with multi-GB parquet files, terabytes of images, and model weights that don’t belong in a .git repo.

Quick Answer: The best data version control tools for ML datasets in S3 that feel most like Git are Oxen.ai, lakeFS, DVC, Quilt, and Dolt—each with different tradeoffs in how they handle large files, branching/merging, and collaboration. For Git-style workflows on big datasets (with data review, fine-tuning, and deployment in the same loop), Oxen.ai and lakeFS are usually the most practical starting points; DVC is strong if you’re deeply invested in Git and don’t mind more plumbing.

Why This Matters

If you can’t answer “which data trained which model?” for every production release, you’re flying blind. When datasets live in ad-hoc S3 buckets with custom scripts, you get:

  • Reproducibility failures: experiments that can’t be recreated months later.
  • Broken collaboration: data scientists, engineers, and product/creative teams all editing files differently.
  • Slower iteration: every branch of a dataset is a full copy; merges are manual and error-prone.

Data version control tools that support Git-like branching and merging against S3 are how you scale beyond “prototype on my laptop” to “production model with a paper trail.” They give you traceability across datasets, model weights, and endpoints—so you can ship faster without guessing which dataset snapshot is in prod.

Key Benefits:

  • Reproducible pipelines: Tie every model, run, and endpoint back to a specific dataset version and S3 state.
  • Safer collaboration: Let multiple people branch, review, and merge changes to datasets instead of overwriting each other.
  • Cheaper iteration: Avoid full S3 copies for every experiment; use deduplication and intelligent storage so branching doesn’t 10x your bill.

Core Concepts & Key Points

ConceptDefinitionWhy it's important
Data branchingCreating independent “lines” of a dataset (e.g., main, exp/new-labels) without copying all underlying data.Lets you experiment freely (labeling, filtering, augmentations) without risking production data.
Data mergingCombining changes from one dataset branch into another, resolving conflicts at file or row level.Makes it possible to promote reviewed changes to production and keep a clean lineage.
Object-store–backed versioningUsing S3 (or similar) as the underlying storage while a control plane tracks versions, commits, and metadata.You keep cheap, scalable S3; the tool adds Git-like history, diffs, and rollbacks on top.

How It Works (Step-by-Step)

At a high level, all the serious tools follow a similar pattern: keep S3 as your blob store, add a “version control brain” on top, and expose Git-ish operations.

  1. Connect your datasets to a versioning layer
    You point the tool at S3 buckets, folders, or data lakes. It starts tracking object versions and metadata, usually without moving existing data. In tools like Oxen.ai, you create repositories for datasets and model weights; in lakeFS, you connect it in front of S3 as a gateway.

  2. Use Git-like workflows for data changes
    Instead of git branch and git commit only for code, you also branch datasets, apply transforms or labeling, and commit those changes. Tools store metadata and content hashes while keeping large files in S3. Branches map to different views of the underlying objects.

  3. Review, merge, and ship to production
    Data changes go through review (diffs, comments, approvals), then get merged into main or prod branches. Pipelines and training jobs read from specific branches/commits, so your model artifacts are always linked to a precise dataset snapshot.


Below is a pragmatic rundown of the best tools that match “data version control for ML datasets in S3 (large files) that support branching/merging like Git,” with an eye on what it actually feels like to use them.

Oxen.ai – Version Datasets, Fine-Tune, Deploy in One Loop

Oxen.ai is an end-to-end platform where dataset version control is the first-class citizen, and fine-tuning + serverless deployment sit on top of that. It’s built for exactly the “Git but for large ML assets” problem.

What it does:

  • Version Every Asset: Large datasets, model weights, and other multimodal assets live in Oxen repositories with full history. Instead of pushing 100 GB into Git, you version them in Oxen while keeping structured access and diffs.
  • Git-style branches and commits: Create branches per experiment (exp/new-augmentation, exp-filter-spam), commit data changes, and merge when they’re validated.
  • Structured dataset workflows: Version, query, explore, and collaborate on datasets used for training, fine-tuning, and evaluation—fits both image-heavy and tabular/text workloads.
  • Zero-code fine-tuning: Start from an existing model, feed in a dataset version, and fine-tune in a few clicks—no infra management. The platform handles training.
  • One-click serverless endpoints: Once fine-tuned, deploy to a serverless endpoint with a click; consume via API in your app, pay-as-you-go.
  • Collaboration built-in: All stakeholders—ML engineering, data science, product, creative—can review and edit data together. “More eyes on the data” is not just a slogan; it’s baked into the UI.

Why it’s a strong fit for S3 + large files:

  • Designed for large, multi-modal assets (datasets, model weights) where Git breaks down.
  • Git-like workflows without Git’s size limits: no more “do we really put this in the repo?” conversations.
  • Not just versioning: you get the full flywheel—dataset → fine-tune → deploy—on top of the same versioned artifacts.

Best for:
Teams who want Git-like data workflows plus an opinionated path to training and deploying models, without building all the plumbing around S3 themselves.


lakeFS – Git for Object Stores (S3, GCS, Azure)

lakeFS is often the first recommendation when someone says “I want Git on top of S3.” It sits between your compute and your object store, turning your S3 bucket into a versioned data lake.

What it does:

  • Branching over S3: Treat entire S3 buckets or prefixes as branches. Create staging, prod, feature/* branches that all point at the same underlying objects with copy-on-write behavior.
  • Commit and revert: Every change (add/update/delete of objects) becomes a commit you can roll back.
  • Merges and diffs: Compare branches, see which objects differ, and merge changes with policies to protect prod.
  • Transparent integration: You often use S3-compatible endpoints; tools like Spark, Presto, and other data engines can read lakeFS paths with minimal code changes.

Why it’s a strong fit:

  • Purely focused on Git-like semantics over S3—branch, commit, merge, revert.
  • Great if you already have a data lake and just need safety rails and reproducibility for datasets feeding your models.

Tradeoffs:

  • lakeFS is infrastructure: you manage it (unless you use managed offerings), and it doesn’t include model training, fine-tuning, or serving. It’s “control plane for S3,” not an end-to-end ML platform.
  • Collaboration UX is more data-engineer flavored; you’ll likely still need separate tools for labeling, review, and model lifecycle.

Best for:
Data teams with a mature S3 data lake who want Git semantics at the storage layer, and are comfortable owning their own ML tooling on top.


DVC – Data Version Control Tied Directly to Git Repos

DVC (Data Version Control) makes data and models feel like part of your Git repo without actually storing big blobs there. It uses Git for metadata + a remote (e.g., S3) for payloads.

What it does:

  • Git-managed metadata: Your repo tracks .dvc files that describe datasets/models; the actual files live in S3 (or other remotes).
  • Branching/merging via Git: You branch and merge exactly as you would for code. Data references follow the Git history.
  • Pipelines: Define stages and dependencies (dvc.yaml) to track data and model provenance across steps in a pipeline.
  • Reproducibility: Given a Git commit, you can dvc pull to restore the exact dataset and model files.

Why it’s a strong fit for S3 + Git-centric teams:

  • If your team already lives in Git, DVC fits into existing workflows naturally.
  • You retain full control over S3 layout and IAM.

Tradeoffs:

  • Requires more manual wiring: you manage S3 remotes, storage lifecycle, and access controls.
  • Collaboration UX is Git-first; non-engineering stakeholders (PMs, labelers, creatives) may struggle without extra tooling.
  • Branching/merging is at the level of references to data files; you still have to manage conflict resolution semantics yourself (e.g., in tabular merges).

Best for:
Engineering-heavy teams that want to keep everything code-centric, and are okay with a bit of DevOps around S3 and CI.


Quilt – Data Packaging and Search on Top of S3

Quilt focuses on “data packages” and structured search, layering organization and metadata on top of S3.

What it does:

  • Data packages: Group related S3 objects as versioned bundles.
  • Catalog + search: Browse, search, and preview datasets; annotate with metadata and docs.
  • Versioning: Track package versions and roll back as needed.

Why it’s a maybe-fit:

  • Useful if your main pain is discoverability and cataloging of datasets in S3.
  • Does offer versioning semantics, but branching/merging is less Git-like than lakeFS or DVC.

Tradeoffs:

  • Branching/merging is more about package versions than full Git-style branch workflows.
  • Less focused specifically on the ML dataset → model lifecycle.

Best for:
Teams who primarily need a catalog and packaging layer over S3 with versioned bundles, not full-blown Git semantics.


Dolt – Git for SQL Databases

Dolt is a version-controlled SQL database (MySQL-compatible) where you can branch, diff, and merge data as if it were code.

What it does:

  • Branching/merging tables: Create branches of your database, make row-level changes, and merge with conflict resolution.
  • Diff and history: Inspect how tables changed over time at fine granularity.
  • SQL-first: Interact via familiar SQL tooling.

Why it’s relevant:

  • If your ML datasets are primarily tabular and fit into Dolt’s model, you get beautiful Git semantics and row-level diffs/merges.

Tradeoffs:

  • Not S3-native; it’s a database system, not a wrapper around S3 object storage.
  • Not designed for huge image/video corpora or multi-terabyte file sets.

Best for:
Data science teams whose primary datasets are relational and who want version-controlled SQL tables rather than object-store blobs.


How to Choose the Right Tool for Your S3-Based ML Datasets

Use these criteria to narrow down:

  1. Where do you want Git semantics to live?

    • Storage layer (S3 buckets): lakeFS.
    • Repo-level artifacts (datasets + weights + code): Oxen.ai, DVC.
    • SQL/tables: Dolt.
  2. Do you want just versioning or full ML lifecycle?

    • Versioning + lake semantics only: lakeFS.
    • Versioning + pipelines: DVC.
    • Versioning + fine-tuning + serverless endpoints: Oxen.ai.
  3. Who needs to touch the datasets?

    • Mostly engineers: lakeFS or DVC make sense.
    • Mixed stakeholders (ML, product, creative, labelers): Oxen.ai’s collaboration and UI will matter more; CLI-only workflows are a non-starter for them.
  4. How much infrastructure do you want to own?

    • Comfortable managing services: lakeFS, DVC (self-hosted remotes).
    • Prefer managed infra with pay-as-you-go inference: Oxen.ai.

Common Mistakes to Avoid

  • Treating S3 as an unversioned dumping ground:
    Without a version control layer, you’ll end up with dataset_v3_final_final buckets and no traceability. Pick a tool and standardize on it.

  • Ignoring collaboration workflows:
    A CLI-only solution might be “clean” for engineering, but if PMs and labelers can’t review data changes, you’ll ship models trained on outdated or low-quality data. Choose a platform that non-engineers can actually use.


Real-World Example

Imagine you’re building a multimodal model that detects harmful content in user-uploaded images and text. Your raw data lives in S3; you iterate quickly:

  1. Create a dataset repo in Oxen.ai and sync in your raw S3 data. Now images, captions, and labels live in a versioned repository.
  2. Branch for a labeling sprint: exp/hate-speech-tightening. Labelers and PMs work in that branch, updating labels and adding edge cases.
  3. Review and merge: ML engineers review diffs (see which files changed, which labels were adjusted), comment, and merge into main once everyone’s aligned.
  4. Fine-tune a model in a few clicks against the main dataset snapshot; no infrastructure to stand up.
  5. Deploy a serverless endpoint directly from the new fine-tuned model and wire it into your production moderation system via API.
  6. If incident reports spike, roll back: you can instantly see which dataset snapshot fed the current model and revert to a known-good version while you debug.

No zipping giant folders. No manual S3 copy scripts. No guessing which “final” dataset is in production.

Pro Tip: Treat dataset branches like you treat feature branches in code: small, focused, and short-lived. Point your training pipelines at specific branches/commits (not “latest in S3”) so every model has a clear data lineage you can audit later.

Summary

If you’re working with large ML datasets in S3 and you want branching and merging that feel like Git, you need a dedicated data version control layer. lakeFS, DVC, Quilt, Dolt, and Oxen.ai all solve pieces of this; the right choice depends on whether you just need storage-level Git semantics or an integrated loop from dataset → fine-tune → deploy.

For most ML teams moving from prototype to production, the sweet spot is a platform that:

  • Versions every asset—datasets and model weights—not just code.
  • Supports Git-like branching and merging so experimentation is cheap and safe.
  • Connects directly to model training and serving, so you always know which data trained which endpoint.

That’s the gap Oxen.ai is built to fill.

Next Step

Get Started