
How do I create a dataset branch in Oxen.ai, make edits, and merge it back (and resolve conflicts if needed)?
Most teams coming into Oxen.ai already know Git-style workflows for code, but hit friction when they try to apply the same discipline to datasets. The good news: creating a dataset branch, making edits, and merging back (with conflict resolution) works very much like Git—just designed for large, multi-modal AI assets instead of tiny text files.
Quick Answer: In Oxen.ai you create a dataset branch from your main repository, make your changes (add/remove/update files, tweak metadata, curate labels), then open a merge request to bring those edits back into the main branch. If Oxen detects conflicts—like two branches editing the same rows or files—you’ll resolve them in a review workflow before completing the merge, preserving a clear history of “which data trained which model.”
Why This Matters
Branching datasets is the difference between “YOLO edits in S3” and a reproducible AI workflow. When you isolate changes in a branch, you can review, test, and benchmark those changes before they impact production models. Oxen.ai makes that Git-like flow work for datasets and model artifacts, so you can tie every fine-tuned model back to the exact dataset version that produced it.
Key Benefits:
- Safe experimentation: Try new labeling schemes, filters, or augmentations on a branch without risking your primary training dataset.
- Reproducible training: Every merge creates a traceable dataset version you can point to when asking “which data trained which model?”.
- Collaborative review: Product, ML, and creative teams can propose changes in branches, then review and resolve conflicts before merging.
Core Concepts & Key Points
| Concept | Definition | Why it's important |
|---|---|---|
| Dataset branch | A named line of development in an Oxen repository that snapshots your dataset and lets you make changes in isolation. | Lets you run parallel experiments and reviews without overwriting your main training/eval data. |
| Commit & merge | A commit records a set of dataset changes; a merge applies those commits from one branch into another (usually back into main). | Creates a permanent, auditable history of how your dataset evolved. |
| Conflict resolution | The process of handling overlapping edits (e.g., same row/file changed on two branches) when merging branches. | Prevents silent overwrites and ensures data quality decisions are explicit, not accidental. |
How It Works (Step-by-Step)
At a high level, you’ll:
- Create a dataset branch from your main branch.
- Make your edits on that branch (upload, remove, or modify data).
- Open a merge request, review changes, and resolve conflicts if they exist.
Below I’ll walk the flow as if you’re using Oxen’s standard repo + branch model. If you’re used to Git, this will feel familiar—just with datasets instead of .py files.
1. Create a dataset branch
You usually start from main (or whatever branch backs your production model).
-
Open your repository
- Go to your dataset repository in Oxen.ai (for example, a
training-imagesorThinking-LLMsstyle repo). - Confirm you’re on the correct base branch (commonly
main).
- Go to your dataset repository in Oxen.ai (for example, a
-
Create the branch
- Use the UI’s branch control to create a new branch.
- Name it by intent and scope, for example:
add-spanish-translationsclean-noisy-labelsaugment-product-photos-v2
- The new branch starts as a copy of the base branch at its current commit.
-
Confirm isolation
- Once created, you’re “checked out” to that branch in the UI.
- Any changes you make now (uploads, deletes, metadata edits) apply only to this branch until you merge.
2. Make edits on the branch
This is where you curate your dataset. You can treat the branch as a sandbox for changes that would be risky to make directly on main.
Common operations:
-
Upload or add data
- Add new images, audio, video, or text files.
- Upload new metadata files (e.g., JSONL, CSV) with additional columns like
category,language, orquality_score. - For text repos (like the
Thinking-LLMsexamples), you might:- Add new conversation examples.
- Append a
categoryordifficultycolumn.
-
Modify existing data
- Fix labels or annotations (bounding boxes, captions, tags).
- Normalize fields (e.g., unify category naming:
dogvs.Dogvs.dogs). - Delete out-of-policy or low-quality samples.
-
Run curation passes
- Filter out duplicates or near-duplicates.
- Flag edge cases for human review.
- Split data into
train/val/testsubsets.
As you iterate:
-
Save and commit changes
- Group related edits into logical commits with clear messages, for example:
fix mislabeled cats as dogs in batch_07add 5k spanish QA examplesdrop low-resolution product images < 256px
- Each commit becomes part of the branch history, which is critical later when you want to explain why model behavior changed.
- Group related edits into logical commits with clear messages, for example:
-
(Optional) Fine-tune and evaluate
- On this branch, you can:
- Use Oxen’s zero-code fine-tuning to train a custom model on the curated dataset in a few clicks.
- Run evaluations and benchmarks (e.g., on a held-out eval set).
- This lets you tie performance metrics directly to “dataset branch X at commit Y.”
- On this branch, you can:
-
Share for review
- Invite collaborators: ML engineers, product managers, content/creative reviewers.
- Let them:
- Inspect diffs (which files/rows were added/changed/removed).
- Leave feedback or request tweaks.
- This is where “more eyes the better” actually shows up in the workflow instead of in a slide deck.
3. Open a merge request back to main
Once you’re confident in your changes and (ideally) have evaluation evidence they improve the model:
-
Create the merge request
- From your branch, open a merge request targeting
main(or another branch). - Add a concise description:
- What changed in the dataset.
- Why you made the change.
- Links to benchmarks or models trained on this dataset version.
- From your branch, open a merge request targeting
-
Review the dataset diff
- Oxen will show:
- Files/rows added
- Files/rows removed
- Files/rows modified (e.g., label or metadata changes)
- This is where you confirm:
- No accidental large deletes.
- No sensitive or out-of-policy data slipped in.
- Labeling and metadata follow your standards.
- Oxen will show:
-
Handle approvals
- Depending on your team process:
- Data owners/ML leads sign off on the change.
- Product/content leads sign off on policy and brand concerns.
- Once approved, you’re ready to merge—unless there are conflicts.
- Depending on your team process:
Common Mistakes to Avoid
-
Editing directly on
main:
How to avoid it: Always create a dedicated branch for non-trivial edits, especially when multiple people are involved or when the dataset backs a production model. -
Merging without reviewing diffs or conflicts:
How to avoid it: Treat dataset merges like code merges—always inspect what’s added/removed/changed and walk through any conflicts deliberately.
Resolving Conflicts When Merging Back
Conflicts happen when two branches make incompatible changes to the same underlying data. In code, that’s two people editing the same function; in datasets, it’s often:
- The same image or row labeled differently on two branches.
- The same metadata file edited in two places.
- One branch deletes an item that another branch edits.
Oxen detects these conflicts at merge time and forces you to pick a winner instead of silently overwriting data.
1. Identify conflict types
Expect conflicts like:
- Label/annotation conflicts
- Example:
image_1023.jpglabeledcatinmain, butdogin your branch.
- Example:
- Metadata conflicts
- Example: JSONL/CSV row with different
categoryorstatusbetween branches.
- Example: JSONL/CSV row with different
- Existence conflicts
- Example: A file is deleted on
mainbut modified on your branch.
- Example: A file is deleted on
2. Use a clear resolution strategy
Instead of guessing, decide upfront how your team resolves conflicts:
- Trust one branch for specific fields
- Example: “Always trust the
label-cleanupbranch for labels, but trustmainfor thesplitcolumn.”
- Example: “Always trust the
- Prefer newer manual labels over older/automatic ones
- Use commit messages and branch intent to decide which labels to keep.
- Fallback to manual review
- For high-impact items (e.g., safety-sensitive content), pull in a human to inspect the conflict directly.
3. Apply resolutions and re-run checks
-
Resolve each conflict in the review UI or by updating the branch
- Choose the desired version of the file/row.
- Or manually construct a merged version that combines both branches’ changes.
-
Recommit the resolved data
- Commit your conflict resolution changes back to the branch.
- The merge request will update, showing conflicts as resolved.
-
Re-run your evaluations
- If conflict resolution changed labels or data composition meaningfully, re-run:
- Model fine-tuning (if needed).
- Evaluation metrics and spot checks.
- This confirms your final merged dataset still delivers the performance you expect.
- If conflict resolution changed labels or data composition meaningfully, re-run:
-
Complete the merge
- Once conflicts are cleared and the diff looks correct, complete the merge into
main. - Oxen now records this merged state as the latest dataset version, ready to back fine-tuned models and serverless endpoints.
- Once conflicts are cleared and the diff looks correct, complete the merge into
Real-World Example
Imagine you’re maintaining a conversational dataset for a customer-support LLM—similar to a Thinking-LLMs style repo with a big examples.jsonl file.
- Your main dataset powers a fine-tuned support bot already deployed behind an Oxen serverless endpoint.
- You create a branch called
add-billing-escalations-v2frommain. - On that branch you:
- Add 15,000 new billing-related conversations.
- Fix some outdated responses where the refund policy changed.
- Add a new
categorycolumn (e.g.,billing,shipping,technical) toexamples.jsonl.
- Meanwhile, a teammate creates a
cleanup-off-topicbranch and:- Removes hundreds of off-topic or low-quality exchanges.
- Relabels some conversations from
generaltoaccount.
When you open your merge request, Oxen flags conflicts in examples.jsonl where:
- Your branch labeled a conversation as
billing, and your teammate’s branch labeled the same one asaccount. - Your branch added a
categorycolumn; their branch didn’t, but changed theresponsetext.
You resolve conflicts by:
- Reviewing each overlapping conversation.
- Agreeing as a team:
- Use your new
categorycolumn (since future routing depends on it). - In case of disagreement, prefer the branch that used the updated policy doc.
- Use your new
- Committing the merged version to your branch and re-running evaluation:
- Fine-tune the support model on the merged dataset via Oxen’s zero-code fine-tuning.
- Validate that ticket resolution metrics improve on your internal benchmark set.
Once the merge completes:
mainnow reflects the curated, conflict-resolved dataset.- You promote the new fine-tuned support model to a serverless endpoint in one click—backed by a clear dataset history.
Pro Tip: When you’re planning large curation passes, create focused branches (e.g.,
label-cleanup-billing-q1-2026,add-es-es-data) rather than one mega-branch. Smaller, well-scoped branches mean fewer conflicts, faster reviews, and cleaner “this change did X to model quality” stories.
Summary
Branching datasets in Oxen.ai lets you apply Git-like discipline to the asset that matters most for model quality: your data. Create a branch, make your changes, validate them with fine-tuning and evaluation, then merge back into main with explicit conflict resolution when needed. The result is a clean, auditable history of how your training and evaluation data evolves—and a repeatable loop from dataset → fine-tuned model → deployed endpoint.