Edit on GitHub

Data Registries

One of the main uses of DVC repositories is the versioning of data and model files, with commands such as dvc add. With the aim to enable reusability of these data artifacts between different projects, DVC also provides commands like dvc import and dvc get. This means that your projects can depend on data from other DVC repositories, similar to a package management systems for data science.

data registry Data and models as code

This means we can build a DVC project dedicated to tracking and versioning datasets (or any large files, directories, ML models, etc.). The repository would have all the metadata and history of changes in the different datasets. We could see who updated what and when, and use pull requests to update data, the same way we do with code.

This is what we call a data registry โ€” a kind of data management middleware between ML projects and cloud storage. Here are its advantages:

  • Reusability: reproduce and organize feature stores with a simple CLI (dvc get and dvc import commands, similar to software package management systems like pip).
  • Persistence: the DVC registry-controlled remote storage (e.g. an S3 bucket) improves data security. There are less chances someone can delete or rewrite a model, for example.
  • Storage optimization: track data shared by multiple projects centralized in a single location (with the ability to create distributed copies on other remotes). This simplifies data management and optimizes space requirements.
  • Data as code: leverage Git workflow such as commits, branching, pull requests, reviews, and even CI/CD for your data and models lifecycle. Think "Git for cloud storage", but without ad-hoc conventions.
  • Security: registries can be setup to have read-only remote storage (e.g. an HTTP location). Git versioning of DVC metafiles allows us to track and audit data changes.

Building registries

Data registries start like any other DVC repository, with git init and dvc init. A good way to organize their contents is by using different directories to group similar data, e.g. images/, natural-language/, etc. For example, our dataset registry uses a directory for each part in our docs like get-started/ and use-cases/.

Adding datasets to a registry can be as simple as placing the data file or directory in question inside the workspace, and telling DVC to track it, with dvc add. For example:

$ mkdir -p music/songs
$ cp ~/Downloads/millionsongsubset_full music/songs
$ dvc add music/songs/

This sample dataset actually exists. See MillionSongSubset.

A regular Git workflow can be followed with the tiny .dvc file that substitute the actual data (music/songs.dvc in this example). This enables team collaboration on data at the same level as with source code (commit history, branching, pull requests, reviews, etc.):

$ git add music/songs.dvc music/.gitignore
$ git commit -m "Track 1.8 GB 10,000 song dataset in music/"

The actual data is stored in the project's cache and can be pushed to one or more remote storage locations, so the registry can be accessed from other locations or by other people:

$ dvc remote add -d myremote s3://bucket/path
$ dvc push

Using registries

The main methods to consume artifacts from a data registry are the dvc import and dvc get commands, as well as the Python API, dvc.api. But first, you may want to explore its contents.

Listing data

To explore the contents of a data DVC repo in search for the right data, use the dvc list command (analogous to ls, or 3rd party tools like aws s3 ls):

$ dvc list -R https://github.com/iterative/dataset-registry
.gitignore
README.md
get-started/.gitignore
get-started/data.xml
get-started/data.xml.dvc
images/.gitignore
images/dvc-logo-outlines.png
...

Both Git-tracked files and DVC-tracked data and models are listed.

Simple downloads

dvc get is analogous to using direct download tools like wget (HTTP), aws s3 cp (S3), etc. To get a dataset from a DVC repo, we can run something like this:

$ dvc get https://github.com/example/registry \
          music/songs

This downloads the music/songs directory from the project's default remote and places it in the current working directory (this can be used anywhere in the file system).

Note that this command (as well as dvc import) has a --rev option to download the data from a specific commit of the source repository.

Import workflow

dvc import uses the same syntax as dvc get:

$ dvc import https://github.com/example/registry \
             images/faces

Note that unlike dvc get, which can be used from any directory, dvc import needs to run within an existing DVC project.

Besides downloading the data, importing saves the dependency from the local project to the data source (registry repo). This is achieved by generating a special import .dvc file, which contains this metadata, and can be committed with Git.

As an addition to the import workflow, we can easily bring it up to date in our consumer project(s) with dvc update whenever the the dataset changes in the source repo (data registry):

$ dvc update dataset.dvc

This downloads new and changed files, or removes deleted ones, from the images/faces directory, based on the latest commit in the source repo. It also updates the project dependency metadata in the import .dvc file.

Using DVC data from Python code

Our Python API, included with the dvc package installed with DVC, includes the open function to load/stream data directly from external DVC projects:

import dvc.api.open

model_path = 'model.pkl'
repo_url = 'https://github.com/example/registry'

with dvc.api.open(model_path, repo_url) as fd:
    model = pickle.load(fd)
    # ... Use the model!

This opens model.pkl as a file descriptor. The example above illustrates a simple ML model deployment method.

See also the dvc.api.read() and dvc.api.get_url() functions.

Updating registries

Datasets evolve, and DVC is prepared to handle it. Just change the data in the registry, and apply the updates by running dvc add again:

$ cp 1000/more/images/* music/songs/
$ dvc add music/songs/

DVC then modifies the corresponding .dvc file to reflect the changes in the data, and this will be picked up by Git:

$ git status
Changes not staged for commit:
...
	modified:   music/songs.dvc
$ git commit -am "Add 1,000 more songs to music/ dataset."

Iterating on this process for several datasets can give shape to a robust registry. The result is basically a repo that versions a set of metafiles. Let's see an example:

$ tree --filelimit=10
.
โ”œโ”€โ”€ images
โ”‚   โ”œโ”€โ”€ .gitignore
โ”‚   โ”œโ”€โ”€ cats-dogs [2800 entries]  # Listed in .gitignore
โ”‚   โ”œโ”€โ”€ faces [10000 entries]     # Listed in .gitignore
โ”‚   โ”œโ”€โ”€ cats-dogs.dvc
โ”‚   โ””โ”€โ”€ faces.dvc
โ”œโ”€โ”€ music
โ”‚   โ”œโ”€โ”€ .gitignore
โ”‚   โ”œโ”€โ”€ songs [11000 entries]     # Listed in .gitignore
โ”‚   โ””โ”€โ”€ songs.dvc
โ”œโ”€โ”€ text
...

And let's not forget to dvc push data changes to the remote storage, so others can obtain them!

$ dvc push
Content

๐Ÿ› Found an issue? Let us know! Or fix it:

Edit on GitHub

โ“ Have a question? Join our chat, we will help you:

Discord Chat