How do I create a dataset branch in Oxen.ai, make edits, and merge it back (and resolve conflicts if needed)?
AI Data Version Control

How do I create a dataset branch in Oxen.ai, make edits, and merge it back (and resolve conflicts if needed)?

8 min read

Quick Answer: In Oxen.ai, you create a dataset branch just like a Git branch: fork from main (or any branch), make your edits in isolation, then open a merge back into the target branch. If the same rows/files were changed in both branches, Oxen flags merge conflicts so you can review, pick the correct version, and keep your dataset history clean and reproducible.

Why This Matters

If you’re serious about reproducible ML, you can’t have everyone hammering on the same “production” dataset. You need safe branches for experimentation, clear history of who changed what, and a clean way to merge curated data back into the main training set—without losing work or silently overwriting rows. Oxen.ai brings Git-style branching and merging to large datasets and model artifacts so you can answer “which data trained which model?” for every release.

Key Benefits:

  • Isolate risky changes: Create branches for labeling, cleaning, and augmentation without breaking your main dataset.
  • Collaborate safely: Let ML engineers, data scientists, and product/creative teams all edit data concurrently, then review and merge.
  • Keep training reproducible: Version every dataset change, resolve conflicts explicitly, and tie models back to the exact dataset revision.

Core Concepts & Key Points

ConceptDefinitionWhy it's important
Dataset branchA parallel line of history for a dataset repository (e.g., mainimprove-labels-q2) where you can edit data without changing the source branch.Lets you experiment, clean, or relabel data safely and independently.
MergeThe operation that combines changes from one branch (e.g., improve-labels-q2) back into another (e.g., main).This is how curated data and fixes land in the canonical training dataset.
Conflict resolutionThe process of resolving differences when the same rows/files were changed in both branches.Prevents silent overwrites and forces explicit, auditable decisions about your training data.

How It Works (Step-by-Step)

Think of your Oxen.ai dataset like a Git repo: it has branches, commits, and a canonical main branch that your production models depend on.

1. Create a Dataset Branch

You’ll typically branch from main (or from the last dataset snapshot used for a production model):

  1. Open your dataset repository in Oxen.ai.
  2. In the “Branches” or repo header area, locate the current branch (often main).
  3. Click the branch selector and choose Create Branch (or similar).
  4. Name your branch clearly, e.g.:
    • fix-mislabeled-animals
    • add-2026-q1-logs
    • filter-toxic-content-v2
  5. Confirm to create the branch. Oxen snapshots the current state so you can work independently.

You now have an isolated branch where you can edit rows, upload new files, or remove bad data without touching main.

2. Make Edits on the Branch

On your new branch, make all the dataset changes you need:

  • Upload new data:
    • Add new images, audio, video, or text files.
    • Import new tabular data (e.g., JSONL, CSV, Parquet) using the UI or API.
  • Edit existing rows:
    • Fix labels or annotations.
    • Normalize fields (e.g., unify category names, clean punctuation).
    • Add metadata columns for future filtering/evaluation.
  • Remove or quarantine data:
    • Delete corrupted rows/files.
    • Remove low-quality or non-compliant samples.
    • Move suspect data to a different split or annotated state.

Each change is tracked as a commit in the branch’s history, so you always know what changed and when.

Collaborate on the Branch

Oxen.ai is built for multi-role teams:

  • Invite ML engineers, annotators, and PMs/creative stakeholders to the same branch.
  • Use the UI to review diffs between commits (e.g., “Which labels changed?”).
  • Add comments or notes tied to specific files or rows (depending on your workflow).

This is where “more eyes the better” actually works—you get data quality feedback before merging into main.

3. Review Changes Before Merge

Before merging back, do a quick sanity check:

  • View the branch diff: Compare your branch against main:
    • How many rows/files were added, modified, or removed?
    • Are any critical columns (labels, splits, IDs) changed?
  • Run checks or evaluations:
    • If you fine-tuned a model on this branch’s dataset, attach the evaluation results.
    • Compare metrics vs. the previous dataset version so you’re not merging blind.

Treat this like a code review, but for training data.

4. Merge the Branch Back into Main

Once you’re happy with the changes:

  1. In the branch view, look for Merge or Open Merge / Pull Request.
  2. Choose the target branch (usually main).
  3. Add a descriptive message:
    • “Fix incorrect dog vs wolf labels in 2,100 rows”
    • “Add Q1 2026 chat logs; filtered PII; updated toxicity labels”
  4. Submit the merge operation.

Oxen compares the branch to the target and tries to automatically merge compatible changes.

5. Resolve Merge Conflicts (If Needed)

Conflicts happen when:

  • The same row/file was modified in both branches since the branch point.
  • A row was deleted in one branch and edited in the other.
  • The same label/field has different values across branches.

Oxen.ai will:

  • Flag the conflicting rows/files.
  • Prevent the merge from finalizing until conflicts are resolved.

Typical Conflict Patterns

  1. Conflicting label edits

    • Example: On main, image_123.jpg is updated from catlynx. On your branch, the same image is updated from catbobcat.
    • Resolution: Decide which label is correct, or create a more granular class if your taxonomy allows. Update the conflict with the correct value.
  2. Row modified vs deleted

    • Example: On main, a noisy row is deleted. On your branch, that row’s label was fixed.
    • Resolution: Decide whether the row is worth keeping. Either keep your fixed version or accept the deletion.
  3. Conflicting metadata changes

    • Example: Both branches changed split for the same row (e.g., train vs validation).
    • Resolution: Choose the split that matches your evaluation and train/validation design.

Resolving Conflicts Step-by-Step

Exact UI labels may differ, but the flow is:

  1. Open the merge details or conflict view.
  2. For each conflicted row/file:
    • Inspect both versions (source branch vs target branch).
    • Pick one version or manually edit to create a combined “best” version.
  3. Save your resolutions.
  4. Once all conflicts are resolved, confirm the merge.

Oxen records this as a clean commit in main, with a clear history of:

  • Original version
  • Branch edits
  • Final resolved version

Now you can confidently say which data your next model will see.

Common Mistakes to Avoid

  • Using main for experiments:
    Avoid making ad-hoc edits directly on main. Create a branch for every meaningful change set—labeling campaigns, new data imports, or normalization passes.

  • Merging without review:
    Don’t merge big data changes without checking the diff and your model metrics. Always review what changed and how it affects performance before you bless a new dataset revision.

  • Ignoring conflicts:
    Never “force” away conflicts just to get a merge through. Every conflict is a signal that model behavior might change—resolve them thoughtfully and document decisions.

  • Mixing unrelated changes in one branch:
    Keep branches focused. Don’t combine “add new data” with a massive “rename labels” change in the same branch; it makes review, rollback, and debugging harder.

Real-World Example

Say you maintain a vision dataset used to fine-tune a detection model. You’ve noticed mislabeling in your “urban animals” subset and want to fix it without disrupting the rest of the team.

  1. From the dataset repo’s main, you create a branch: fix-urban-animals-q2.
  2. On this branch:
    • You relabel 3,000 images where “dog” vs “coyote” vs “wolf” was inconsistent.
    • A PM helps review ambiguous cases directly in Oxen.
    • You add a new field confidence_source to track if labels came from vendor A vs internal annotators.
  3. Another teammate, meanwhile, adds new images and minor fixes directly on main.
  4. When you open a merge from fix-urban-animals-q2main, Oxen flags conflicts on 200 images that both branches touched.
  5. You step through the conflicts:
    • For some, your branch is correct (careful relabeling).
    • For others, main introduced better context from a new label schema, so you keep those.
  6. After resolving all conflicts, you merge. main now has:
    • New images.
    • Cleaned labels.
    • A documented, reproducible history that ties into the next fine-tuned model.

You can now fine-tune your model on the merged dataset, evaluate it, and deploy a new serverless endpoint—knowing exactly which branch and commit the training came from.

Pro Tip: Treat dataset branches like code branches. Use small, focused branches (e.g., “normalize-labels”, “add-q2-data”) and merge them frequently. The smaller the diff, the easier the review, conflict resolution, and rollback if a model regression appears.

Summary

Oxen.ai brings Git-style branching and merging to datasets and large AI assets so you can:

  • Create branches from main to isolate data work.
  • Edit, clean, and augment data collaboratively on those branches.
  • Merge changes back with a clear diff and explicit conflict resolution.

That workflow turns “random changes in S3” into a disciplined loop: version dataset → fine-tune model → evaluate → deploy → repeat, with a clean answer to “which data trained which model?” every time.

Next Step

Get Started