One of the main uses of DVC repositories is the
versioning of data and model files,
with commands such as
dvc add. With the aim to enable reusability of these
data artifacts between different projects, DVC also provides
dvc import and
dvc get. This means that your projects can
depend on data from other DVC repositories, similar to a package management
systems for data science.
This means we can build a DVC project dedicated to tracking and versioning datasets (or any large files, directories, ML models, etc.). The repository would have all the metadata and history of changes in the different datasets. We could see who updated what and when, and use pull requests to update data, the same way we do with code.
This is what we call a data registry — a kind of data management middleware between ML projects and cloud storage. Here are its advantages:
dvc importcommands, similar to software package management systems like
Data registries start like any other DVC repository, with
git init and
dvc init. A good way to organize their contents is by using
different directories to group similar data, e.g.
natural-language/, etc. For example, our
dataset registry uses a
directory for each part in our docs like
Adding datasets to a registry can be as simple as placing the data file or
directory in question inside the workspace, and telling DVC to
track it, with
dvc add. For example:
$ mkdir -p music/songs $ cp ~/Downloads/millionsongsubset_full music/songs $ dvc add music/songs/
This sample dataset actually exists. See MillionSongSubset.
A regular Git workflow can be followed with the tiny
.dvc file that substitute
the actual data (
music/songs.dvc in this example). This enables team
collaboration on data at the same level as with source code (commit history,
branching, pull requests, reviews, etc.):
$ git add music/songs.dvc music/.gitignore $ git commit -m "Track 1.8 GB 10,000 song dataset in music/"
$ dvc remote add -d myremote s3://bucket/path $ dvc push
To explore the contents of a data DVC repo in search for the right data, use the
dvc list command (analogous to
ls, or 3rd party tools like
aws s3 ls):
$ dvc list -R https://github.com/iterative/dataset-registry .gitignore README.md get-started/.gitignore get-started/data.xml get-started/data.xml.dvc images/.gitignore images/dvc-logo-outlines.png ...
Both Git-tracked files and DVC-tracked data and models are listed.
dvc get is analogous to using direct download tools like
aws s3 cp (S3), etc. To get a dataset from a DVC repo, we can run something
$ dvc get https://github.com/example/registry \ music/songs
This downloads the
music/songs directory from the project's
default remote and places it in the
current working directory (this can be used anywhere in the file system).
$ dvc import https://github.com/example/registry \ images/faces
Besides downloading the data, importing saves the dependency from the local
project to the data source (registry repo). This is achieved by generating a
.dvc file, which contains this metadata, and can be committed
As an addition to the import workflow, we can easily bring it up to date in our
consumer project(s) with
dvc update whenever the the dataset changes in the
source repo (data registry):
$ dvc update dataset.dvc
This downloads new and changed files, or removes deleted ones, from the
images/faces directory, based on the latest commit in the source repo. It also
updates the project dependency metadata in the import
Our Python API, included with the
dvc package installed
with DVC, includes the
open function to load/stream data directly from
external DVC projects:
import dvc.api.open model_path = 'model.pkl' repo_url = 'https://github.com/example/registry' with dvc.api.open(model_path, repo_url) as fd: model = pickle.load(fd) # ... Use the model!
model.pkl as a file descriptor. The example above illustrates a
simple ML model deployment method.
Datasets evolve, and DVC is prepared to handle it. Just change the data in the
registry, and apply the updates by running
dvc add again:
$ cp 1000/more/images/* music/songs/ $ dvc add music/songs/
DVC then modifies the corresponding
.dvc file to reflect the changes in the
data, and this will be picked up by Git:
$ git status Changes not staged for commit: ... modified: music/songs.dvc $ git commit -am "Add 1,000 more songs to music/ dataset."
Iterating on this process for several datasets can give shape to a robust registry. The result is basically a repo that versions a set of metafiles. Let's see an example:
$ tree --filelimit=10 . ├── images │ ├── .gitignore │ ├── cats-dogs [2800 entries] # Listed in .gitignore │ ├── faces [10000 entries] # Listed in .gitignore │ ├── cats-dogs.dvc │ └── faces.dvc ├── music │ ├── .gitignore │ ├── songs [11000 entries] # Listed in .gitignore │ └── songs.dvc ├── text ...
$ dvc push