One of the main uses of DVC repositories is the versioning of data and model files. DVC also enables cross-project reusability of these data artifacts. This means that your projects can depend on data from other DVC repositories — like a package management system for data science.
We can build a DVC project dedicated to versioning datasets (or data features, ML models, etc.). The repository would have all the metadata and change history for the data it tracks. We could see who changed what and when, and use pull requests to update data like we do with code. This is what we call a data registry — data management middleware between ML projects and cloud storage.
Advantages of data registries:
dvc importcommands, similar to software package management systems like
Adding datasets to a registry can be as simple as placing the data file or
directory in question inside the workspace, and track it with
dvc add. A regular Git workflow can be followed with the
.dvc files that
substitute the actual data (e.g.
music/songs.dvc below). This enables team
collaboration on data at the same level as with source code:
This sample dataset actually exists.
$ mkdir -p music/songs $ cp ~/Downloads/millionsongsubset_full music/songs $ dvc add music/songs/ $ git add music/songs.dvc music/.gitignore $ git commit -m "Track 1.8 GB 10,000 song dataset in music/"
$ dvc remote add -d myremote s3://mybucket/dvcstore $ dvc push
💡 A good way to organize DVC repositories into data registries is to use directories to group similar data, e.g.
natural-language/, etc. For example, our dataset registry has directories like
use-cases/, matching parts of this website.
To explore the contents of a DVC repository in search for the right data, use
dvc list command (similar to
ls and 3rd-party tools like
aws s3 ls):
$ dvc list -R https://github.com/iterative/dataset-registry .gitignore README.md get-started/.gitignore get-started/data.xml get-started/data.xml.dvc images/.gitignore images/dvc-logo-outlines.png ...
Both Git-tracked files and DVC-tracked data (or models, etc.) are listed.
dvc get is analogous to using direct download tools like
aws s3 cp (S3), etc. To get a dataset from a DVC repo, we can run something
$ dvc get https://github.com/example/registry music/songs
music/songs from the project's
default remote and places it in the
current working directory.
$ dvc import https://github.com/example/registry images/faces
Besides downloading the data, importing saves the information about the
dependency that the local project has on the data source (registry repo). This
is achieved by generating a special import
.dvc file, which contains this
Whenever the dataset changes in the registry, we can bring data up to date in
$ dvc update faces.dvc
This downloads new and changed files, and removes deleted ones, based on the
latest commit in the source repo; And it updates the
.dvc file accordingly.
Our Python API, included with the
dvc package installed
with DVC, includes the
open function to load/stream data directly from
external DVC projects:
import dvc.api.open model_path = 'model.pkl' repo_url = 'https://github.com/example/registry' with dvc.api.open(model_path, repo_url) as fd: model = pickle.load(fd) # ... Use the model!
model.pkl as a file descriptor. This example illustrates a simple
ML model deployment method, but it could be extended to more advanced
scenarios such as a model zoo.
Datasets evolve, and DVC is prepared to handle it. Just change the data in the
registry, and apply the updates by running
dvc add again:
$ cp 1000/more/images/* music/songs/ $ dvc add music/songs/
DVC modifies the corresponding
.dvc file to reflect the changes, and this is
picked up by Git:
$ git status ... modified: music/songs.dvc $ git commit -am "Add 1,000 more songs to music/ dataset."
Iterating on this process for several datasets can give shape to a robust registry. The result is basically a repo that versions a set of metafiles. Let's see an example:
$ tree --filelimit=10 . ├── images │ ├── .gitignore │ ├── cats-dogs [2800 entries] # Listed in .gitignore │ ├── faces [10000 entries] # Listed in .gitignore │ ├── cats-dogs.dvc │ └── faces.dvc ├── music │ ├── .gitignore │ ├── songs [11000 entries] # Listed in .gitignore │ └── songs.dvc ├── text ...
$ dvc push