Sometimes multiple members of a team might work on the the same DVC-tracked data. And when the time comes to combine their changes, merge conflicts can happen in Git-tracked DVC files, which need to be resolved.
Conflicts here are no different from what we would see in source code. See Git Merging.
stages: prepare: cmd: python src/prepare.py data/data.xml deps: < < < < < < < HEAD - data/big.xml = = = = = = = - data/small.xml > > > > > > > branch - src/prepare.py params: - prepare.seed - prepare.split outs: - data/prepared
dvc commitcan also be a good option, but only for the specific case where the
HEADversion is chosen.
There are three main variations in the structure of these files, that differ by the command that has generated them:
outs: < < < < < < < HEAD - md5: a304afb96060aad90176268345e10355 size: 12 = = = = = = = - md5: 35dd1fda9cfb4b645ae431f4621fa324 size: 100 > > > > > > > branch path: data.xml
You can pick one of the versions:
outs: - md5: 35dd1fda9cfb4b645ae431f4621fa324 size: 100 path: data.xml
But if you want to actually merge the data files (or directories) of both versions, then you can follow this process:
dvc checkout data.xmlon both
dvc add data.xmlto overwrite the conflicted
If you have an "append-only" dataset, where people only add new files/directories, DVC provides a so-called merge-driver that can automatically resolve Git conflicts for you. To use it, first set it up in your Git repo:
$ git config merge.dvc.name 'DVC merge driver' $ git config merge.dvc.driver \ 'dvc git-hook merge-driver --ancestor %O --our %A --their %B'
And add this line to your
.gitattributes (in the root of your git repo):
Now, when a merge conflict occurs, DVC will simply combine data from both branches.
< < < < < < < HEAD md5: 263395583f35403c8e0b1b94b30bea32 ======= md5: 520d2602f440d13372435d91d3bfa176 > > > > > > > branch frozen: true deps: - path: get-started/data.xml repo: url: https://github.com/iterative/dataset-registry < < < < < < < HEAD rev_lock: f31f5c4cdae787b4bdeb97a717687d44667d9e62 = = = = = = = rev_lock: 06be1104741f8a7c65449322a1fcc8c5f1070a1e > > > > > > > branch outs: < < < < < < < HEAD - md5: a304afb96060aad90176268345e10355 size: 12 = = = = = = = - md5: 35dd1fda9cfb4b645ae431f4621fa324 size: 100 > > > > > > > branch path: data.xml
So you get something like this:
frozen: true deps: - path: get-started/data.xml repo: url: https://github.com/iterative/dataset-registry outs: - path: data.xml
Note that updating will bring in the latest version of the data from its source, which may not correspond with any of the hashes that was removed.