Edit on GitHub

Data Registries

One of the main uses of DVC repositories is the versioning of data and model files, with commands such as dvc add. With the aim to enable reusability of these data artifacts between different projects, DVC also provides the dvc import and dvc get commands, among others. This means that a project can depend on data from an external DVC project, similar to package management systems, but for data science projects.

Data and models as code

Keeping this in mind, we could build a DVC project dedicated to tracking and versioning datasets (or any large data, even ML models). This way we would have a repository with all the metadata and history of changes of different datasets. We could see who updated what, and when, and use pull requests to update data (the same way we do with code). This is what we call a data registry, which can work as data management middleware between ML projects and cloud storage.

Note that a single dedicated repository is just one possible pattern to create data registries with DVC.

Advantages of using a DVC data registry project:

  • Reusability: Reproduce and organize feature stores with a simple CLI (dvc get and dvc import commands, similar to software package management systems like pip).
  • Persistence: The DVC registry-controlled remote storage (e.g. an S3 bucket) improves data security. There are less chances someone can delete or rewrite a model, for example.
  • Storage Optimization: Track data shared by multiple projects centralized in a single location (with the ability to create distributed copies on other remotes). This simplifies data management and optimizes space requirements.
  • Security: Registries can be setup to have read-only remote storage (e.g. an HTTP location). Git versioning of DVC-files allows us to track and audit data changes.
  • Data as code: Leverage Git workflow such as commits, branching, pull requests, reviews, and even CI/CD for your data and models lifecycle. Think Git for cloud storage, but without ad-hoc conventions.

Building registries

Data registries can be created like any other DVC repository with git init and dvc init. A good way to organize them is with different directories, to group the data into separate uses, such as images/, natural-language/, etc. For example, our dataset-registry uses a directory for each section in our website documentation, like get-started/, use-cases/, etc.

Adding datasets to a registry can be as simple as placing the data file or directory in question inside the workspace, and telling DVC to track it, with dvc add. For example:

$ mkdir -p music/Beatles
$ cp ~/Downloads/millionsongsubset_full music/songs
$ dvc add music/songs

This example dataset actually exists. See MillionSongSubset.

A regular Git workflow can be followed with the tiny DVC-files that substitute the actual data (music/songs.dvc in this example). This enables team collaboration on data at the same level as with source code (commit history, branching, pull requests, reviews, etc.):

$ git add music/songs.dvc music/.gitignore
$ git commit -m "Track 1.8 GB 10,000 song dataset in music/"

The actual data is stored in the project's cache and should be pushed to one or more remote storage locations, so the registry can be accessed from other locations or by other people:

$ dvc remote add -d myremote s3://bucket/path
$ dvc push

Using registries

The main methods to consume data artifacts from a data registry are the dvc import and dvc get commands, as well as the dvc.api Python API.

Simple download (get)

This is analogous to using direct download tools like wget (HTTP), aws s3 cp (S3), etc. To get a dataset for example, we can run something like:

$ dvc get https://github.com/example/registry \
          music/songs/

This downloads music/songs/ from the project's default remote and places it in the current working directory (anywhere in the file system with user write access).

Note that this command (as well as dvc import) has a --rev option to download specific versions of the data.

Import workflow

dvc import uses the same syntax as dvc get:

$ dvc import https://github.com/example/registry \
             images/faces/

Note that unlike dvc get, which can be used from any directory, dvc import needs to run within an initialized DVC project.

Besides downloading, importing saves the dependency of the local project towards the data source (registry repository). This is achieved by creating a particular kind of DVC-file (a.k.a. import stage). This file can be used staged and committed with Git.

As an addition to the import workflow, and enabled the saved dependency, we can easily bring it up to date in our consumer project with dvc update whenever the the dataset changes in the source project (data registry):

$ dvc update dataset.dvc

dvc update downloads new and changed files, or removes deleted ones, from images/faces/, based on the latest version of the source project. It also updates the project dependency metadata in the import stage (DVC-file).

Programatic reusability of DVC data

Our Python API, included with the dvc package installed with DVC, includes the open function to load/stream data directly from external DVC projects:

import dvc.api.open

model_path = 'model.pkl'
repo_url = 'https://github.com/example/registry'

with dvc.api.open(model_path, repo_url) as fd:
    model = pickle.load(fd)
    # ... Use the model!

This opens model.pkl as a file descriptor. The example above tries to illustrate a hardcoded ML model deployment method.

Notice that the dvc.api.get_url and dvc.api.read functions are also available.

Updating registries

Datasets evolve, and DVC is prepared to handle it. Just change the data in the registry, and apply the updates by running dvc add again:

$ cp /path/to/1000/image/dir music/songs
$ dvc add music/songs

DVC then modifies the corresponding DVC-file to reflect the changes in the data, and this will be noticed by Git:

$ git status
Changes not staged for commit:
...
	modified:   music/songs.dvc
$ git commit -am "Add 1,000 more songs to music/ dataset."

Iterating on this process for several datasets can give shape to a robust registry, which are basically repositories that mainly version a bunch of DVC-files, as you can see in the hypothetical example below.

$ tree --filelimit=100
.
├── images
│   ├── .gitignore
│   ├── cats-dogs [2800 entries]  # Listed in .gitignore
│   ├── faces [10000 entries]     # Listed in .gitignore
│   ├── cats-dogs.dvc
│   └── faces.dvc
├── music
│   ├── .gitignore
│   ├── songs [11000 entries]     # Listed in .gitignore
│   └── songs.dvc
├── text
...

And let's not forget to dvc push data changes to the remote storage, so others can obtain them!

$ dvc push

Content


Building registriesUsing registriesUpdating registries

🐛 Found an issue? Let us know! Or fix it:

Edit on GitHub

Have a question? Join our chat, we will help you:

Discord Chat