Sometimes multiple team members work on the same DVC-tracked data. When the time comes to combine their changes, merge conflicts can occur in Git-tracked DVC files, which need to be resolved.
Conflicts here are no different from what we would see in source code. See Git Merging.
stages: prepare: cmd: python src/prepare.py data/data.xml deps: < < < < < < < HEAD - data/big.xml = = = = = = = - data/small.xml > > > > > > > branch - src/prepare.py params: - prepare.seed - prepare.split outs: - data/prepared
dvc commitcan also be a good option, but only for the specific case where the
HEADversion is chosen.
There are three main variations in the structure of these files, that differ by the command that has generated them:
outs: < < < < < < < HEAD - md5: a304afb96060aad90176268345e10355 size: 12 = = = = = = = - md5: 35dd1fda9cfb4b645ae431f4621fa324 size: 100 > > > > > > > branch path: data.xml
You can pick one of the versions:
outs: - md5: 35dd1fda9cfb4b645ae431f4621fa324 size: 100 path: data.xml
But if you want to actually merge the data files (or directories) of both versions, then you can follow this process:
dvc checkout data.xmlon both
- Copy the data into temporary locations (e.g.
- Merge it by-hand;
- Finally, run
dvc add data.xmlto overwrite the conflicted
If you have a directory, DVC provides a Git merge driver that can automatically resolve many merge conflicts for you. To use it, first set it up in your Git repo:
$ git config merge.dvc.name 'DVC merge driver' $ git config merge.dvc.driver \ 'dvc git-hook merge-driver --ancestor %O --our %A --their %B'
And add this line to your
.gitattributes (in the root of your git repo):
Now, when a merge conflict occurs, DVC will simply combine data from both branches.
If the same file was added or changed in both branches, the merge driver will fail unless the changes are the same. If the same file was deleted in both branches, the merge driver will fail.
< < < < < < < HEAD md5: 263395583f35403c8e0b1b94b30bea32 ======= md5: 520d2602f440d13372435d91d3bfa176 > > > > > > > branch frozen: true deps: - path: get-started/data.xml repo: url: https://github.com/iterative/dataset-registry < < < < < < < HEAD rev_lock: f31f5c4cdae787b4bdeb97a717687d44667d9e62 = = = = = = = rev_lock: 06be1104741f8a7c65449322a1fcc8c5f1070a1e > > > > > > > branch outs: < < < < < < < HEAD - md5: a304afb96060aad90176268345e10355 size: 12 = = = = = = = - md5: 35dd1fda9cfb4b645ae431f4621fa324 size: 100 > > > > > > > branch path: data.xml
So you get something like this:
frozen: true deps: - path: get-started/data.xml repo: url: https://github.com/iterative/dataset-registry outs: - path: data.xml
Note that updating will bring in the latest version of the data from its source, which may not correspond with any of the hashes that was removed.