How do I create a dataset branch in Oxen.ai, make edits, and merge it back (and resolve conflicts if needed)?
AI Data Version Control

How do I create a dataset branch in Oxen.ai, make edits, and merge it back (and resolve conflicts if needed)?

9 min read

Most teams coming into Oxen.ai already know Git-style workflows for code, but hit friction when they try to apply the same discipline to datasets. The good news: creating a dataset branch, making edits, and merging back (with conflict resolution) works very much like Git—just designed for large, multi-modal AI assets instead of tiny text files.

Quick Answer: In Oxen.ai you create a dataset branch from your main repository, make your changes (add/remove/update files, tweak metadata, curate labels), then open a merge request to bring those edits back into the main branch. If Oxen detects conflicts—like two branches editing the same rows or files—you’ll resolve them in a review workflow before completing the merge, preserving a clear history of “which data trained which model.”

Why This Matters

Branching datasets is the difference between “YOLO edits in S3” and a reproducible AI workflow. When you isolate changes in a branch, you can review, test, and benchmark those changes before they impact production models. Oxen.ai makes that Git-like flow work for datasets and model artifacts, so you can tie every fine-tuned model back to the exact dataset version that produced it.

Key Benefits:

  • Safe experimentation: Try new labeling schemes, filters, or augmentations on a branch without risking your primary training dataset.
  • Reproducible training: Every merge creates a traceable dataset version you can point to when asking “which data trained which model?”.
  • Collaborative review: Product, ML, and creative teams can propose changes in branches, then review and resolve conflicts before merging.

Core Concepts & Key Points

ConceptDefinitionWhy it's important
Dataset branchA named line of development in an Oxen repository that snapshots your dataset and lets you make changes in isolation.Lets you run parallel experiments and reviews without overwriting your main training/eval data.
Commit & mergeA commit records a set of dataset changes; a merge applies those commits from one branch into another (usually back into main).Creates a permanent, auditable history of how your dataset evolved.
Conflict resolutionThe process of handling overlapping edits (e.g., same row/file changed on two branches) when merging branches.Prevents silent overwrites and ensures data quality decisions are explicit, not accidental.

How It Works (Step-by-Step)

At a high level, you’ll:

  1. Create a dataset branch from your main branch.
  2. Make your edits on that branch (upload, remove, or modify data).
  3. Open a merge request, review changes, and resolve conflicts if they exist.

Below I’ll walk the flow as if you’re using Oxen’s standard repo + branch model. If you’re used to Git, this will feel familiar—just with datasets instead of .py files.

1. Create a dataset branch

You usually start from main (or whatever branch backs your production model).

  1. Open your repository

    • Go to your dataset repository in Oxen.ai (for example, a training-images or Thinking-LLMs style repo).
    • Confirm you’re on the correct base branch (commonly main).
  2. Create the branch

    • Use the UI’s branch control to create a new branch.
    • Name it by intent and scope, for example:
      • add-spanish-translations
      • clean-noisy-labels
      • augment-product-photos-v2
    • The new branch starts as a copy of the base branch at its current commit.
  3. Confirm isolation

    • Once created, you’re “checked out” to that branch in the UI.
    • Any changes you make now (uploads, deletes, metadata edits) apply only to this branch until you merge.

2. Make edits on the branch

This is where you curate your dataset. You can treat the branch as a sandbox for changes that would be risky to make directly on main.

Common operations:

  • Upload or add data

    • Add new images, audio, video, or text files.
    • Upload new metadata files (e.g., JSONL, CSV) with additional columns like category, language, or quality_score.
    • For text repos (like the Thinking-LLMs examples), you might:
      • Add new conversation examples.
      • Append a category or difficulty column.
  • Modify existing data

    • Fix labels or annotations (bounding boxes, captions, tags).
    • Normalize fields (e.g., unify category naming: dog vs. Dog vs. dogs).
    • Delete out-of-policy or low-quality samples.
  • Run curation passes

    • Filter out duplicates or near-duplicates.
    • Flag edge cases for human review.
    • Split data into train / val / test subsets.

As you iterate:

  1. Save and commit changes

    • Group related edits into logical commits with clear messages, for example:
      • fix mislabeled cats as dogs in batch_07
      • add 5k spanish QA examples
      • drop low-resolution product images < 256px
    • Each commit becomes part of the branch history, which is critical later when you want to explain why model behavior changed.
  2. (Optional) Fine-tune and evaluate

    • On this branch, you can:
      • Use Oxen’s zero-code fine-tuning to train a custom model on the curated dataset in a few clicks.
      • Run evaluations and benchmarks (e.g., on a held-out eval set).
    • This lets you tie performance metrics directly to “dataset branch X at commit Y.”
  3. Share for review

    • Invite collaborators: ML engineers, product managers, content/creative reviewers.
    • Let them:
      • Inspect diffs (which files/rows were added/changed/removed).
      • Leave feedback or request tweaks.
    • This is where “more eyes the better” actually shows up in the workflow instead of in a slide deck.

3. Open a merge request back to main

Once you’re confident in your changes and (ideally) have evaluation evidence they improve the model:

  1. Create the merge request

    • From your branch, open a merge request targeting main (or another branch).
    • Add a concise description:
      • What changed in the dataset.
      • Why you made the change.
      • Links to benchmarks or models trained on this dataset version.
  2. Review the dataset diff

    • Oxen will show:
      • Files/rows added
      • Files/rows removed
      • Files/rows modified (e.g., label or metadata changes)
    • This is where you confirm:
      • No accidental large deletes.
      • No sensitive or out-of-policy data slipped in.
      • Labeling and metadata follow your standards.
  3. Handle approvals

    • Depending on your team process:
      • Data owners/ML leads sign off on the change.
      • Product/content leads sign off on policy and brand concerns.
    • Once approved, you’re ready to merge—unless there are conflicts.

Common Mistakes to Avoid

  • Editing directly on main:
    How to avoid it: Always create a dedicated branch for non-trivial edits, especially when multiple people are involved or when the dataset backs a production model.

  • Merging without reviewing diffs or conflicts:
    How to avoid it: Treat dataset merges like code merges—always inspect what’s added/removed/changed and walk through any conflicts deliberately.

Resolving Conflicts When Merging Back

Conflicts happen when two branches make incompatible changes to the same underlying data. In code, that’s two people editing the same function; in datasets, it’s often:

  • The same image or row labeled differently on two branches.
  • The same metadata file edited in two places.
  • One branch deletes an item that another branch edits.

Oxen detects these conflicts at merge time and forces you to pick a winner instead of silently overwriting data.

1. Identify conflict types

Expect conflicts like:

  • Label/annotation conflicts
    • Example: image_1023.jpg labeled cat in main, but dog in your branch.
  • Metadata conflicts
    • Example: JSONL/CSV row with different category or status between branches.
  • Existence conflicts
    • Example: A file is deleted on main but modified on your branch.

2. Use a clear resolution strategy

Instead of guessing, decide upfront how your team resolves conflicts:

  • Trust one branch for specific fields
    • Example: “Always trust the label-cleanup branch for labels, but trust main for the split column.”
  • Prefer newer manual labels over older/automatic ones
    • Use commit messages and branch intent to decide which labels to keep.
  • Fallback to manual review
    • For high-impact items (e.g., safety-sensitive content), pull in a human to inspect the conflict directly.

3. Apply resolutions and re-run checks

  1. Resolve each conflict in the review UI or by updating the branch

    • Choose the desired version of the file/row.
    • Or manually construct a merged version that combines both branches’ changes.
  2. Recommit the resolved data

    • Commit your conflict resolution changes back to the branch.
    • The merge request will update, showing conflicts as resolved.
  3. Re-run your evaluations

    • If conflict resolution changed labels or data composition meaningfully, re-run:
      • Model fine-tuning (if needed).
      • Evaluation metrics and spot checks.
    • This confirms your final merged dataset still delivers the performance you expect.
  4. Complete the merge

    • Once conflicts are cleared and the diff looks correct, complete the merge into main.
    • Oxen now records this merged state as the latest dataset version, ready to back fine-tuned models and serverless endpoints.

Real-World Example

Imagine you’re maintaining a conversational dataset for a customer-support LLM—similar to a Thinking-LLMs style repo with a big examples.jsonl file.

  1. Your main dataset powers a fine-tuned support bot already deployed behind an Oxen serverless endpoint.
  2. You create a branch called add-billing-escalations-v2 from main.
  3. On that branch you:
    • Add 15,000 new billing-related conversations.
    • Fix some outdated responses where the refund policy changed.
    • Add a new category column (e.g., billing, shipping, technical) to examples.jsonl.
  4. Meanwhile, a teammate creates a cleanup-off-topic branch and:
    • Removes hundreds of off-topic or low-quality exchanges.
    • Relabels some conversations from general to account.

When you open your merge request, Oxen flags conflicts in examples.jsonl where:

  • Your branch labeled a conversation as billing, and your teammate’s branch labeled the same one as account.
  • Your branch added a category column; their branch didn’t, but changed the response text.

You resolve conflicts by:

  1. Reviewing each overlapping conversation.
  2. Agreeing as a team:
    • Use your new category column (since future routing depends on it).
    • In case of disagreement, prefer the branch that used the updated policy doc.
  3. Committing the merged version to your branch and re-running evaluation:
    • Fine-tune the support model on the merged dataset via Oxen’s zero-code fine-tuning.
    • Validate that ticket resolution metrics improve on your internal benchmark set.

Once the merge completes:

  • main now reflects the curated, conflict-resolved dataset.
  • You promote the new fine-tuned support model to a serverless endpoint in one click—backed by a clear dataset history.

Pro Tip: When you’re planning large curation passes, create focused branches (e.g., label-cleanup-billing-q1-2026, add-es-es-data) rather than one mega-branch. Smaller, well-scoped branches mean fewer conflicts, faster reviews, and cleaner “this change did X to model quality” stories.

Summary

Branching datasets in Oxen.ai lets you apply Git-like discipline to the asset that matters most for model quality: your data. Create a branch, make your changes, validate them with fine-tuning and evaluation, then merge back into main with explicit conflict resolution when needed. The result is a clean, auditable history of how your training and evaluation data evolves—and a repeatable loop from dataset → fine-tuned model → deployed endpoint.

Next Step

Get Started