DVC allows storing and versioning data files, ML models, directories, intermediate results with Git, without tracking the file contents with Git. Let's get a dataset example to play with:
$ mkdir data $ dvc get https://github.com/iterative/dataset-registry \ get-started/data.xml -o data/data.xml
dvc getcan use any DVC repository to find the appropriate remote storage and download data artifacts from it (analogous to
wget, but for repositories). In this case we use dataset-registry) as the source repo. (Refer to Data Registries for more info about this setup.)
To track a file (or a directory) with DVC just run
dvc add on it. For example:
$ dvc add data/data.xml
DVC stores information about the added data in a special file called a DVC-file. DVC-files are small text files with a human-readable format and they can be committed with Git:
$ git add data/.gitignore data/data.xml.dvc $ git commit -m "Add raw data to project"
Committing DVC-files with Git allows us to track different versions of the project data as it evolves with the source code tracked by Git.
If your workspace uses Git, without DVC you would have to manually put each data
file or directory into
.gitignore. DVC commands that track data files
automatically takes care of this for you! (You just have to add the changes with