Skip to content
Edit on GitHub

Data Registry

One of the main uses of DVC repositories is the versioning of data and model files. DVC also enables cross-project reusability of these data artifacts. This means that your projects can depend on data from other DVC repositories — like a package management system for data science.

data registry Data management middleware

We can build a DVC project dedicated to versioning datasets (or data features, ML models, etc.). The repository contains the necessary metadata, as well as the entire change history. The data itself is stored in one or more DVC remotes. This is what we call a data registry — data management middleware between ML projects and cloud storage. Advantages:

  • Reusability: Reproduce and organize feature stores with a simple CLI (dvc get and dvc import commands, similar to software package management like pip).
  • Persistence: Separating metadata from storage on reliable platforms (Git, cloud locations) improve the durability of your data.
  • Storage optimization: Centralize data shared by multiple projects in a single location (distributed copies are possible too). This simplifies data management and optimizes space requirements.
  • Data as code: Leverage Git workflows such as commits, branching, pull requests, reviews, and even CI/CD for your data and models lifecycle. Think "Git for cloud storage".
  • Security: DVC-controlled remote storage (e.g. Amazon S3) can be configured to limit data access. For example, you can setup read-only endpoints (e.g. an HTTP server) to prevent data deletions or alterations.

See also Model Registry.

Building registries

Adding datasets to a registry can be as simple as placing the data file or directory in question inside the workspace, and track it with dvc add. A regular Git workflow can be followed with the .dvc files that substitute the actual data (e.g. music/songs.dvc below). This enables team collaboration on data at the same level as with source code:

This sample dataset actually exists.

$ mkdir -p music/songs
$ cp ~/Downloads/millionsongsubset_full music/songs

$ dvc add music/songs/

$ git add music/songs.dvc music/.gitignore
$ git commit -m "Track 1.8 GB 10,000 song dataset in music/"

The actual data is stored in the project's cache, and can be pushed to one or more remote storage locations so the registry can be accessed from other locations and by other people:

$ dvc remote add -d myremote s3://mybucket/dvcstore
$ dvc push

💡 A good way to organize DVC repositories into data registries is to use directories to group similar data, e.g. images/, natural-language/, etc. For example, our dataset registry has directories like get-started/ and use-cases/, matching parts of this website.

Using registries

The main methods to consume artifacts from a data registry are the dvc import and dvc get commands, as well as the Python API dvc.api. But first, we may want to explore its contents.

Listing data

To explore the contents of a DVC repository in search for the right data, use the dvc list command (similar to ls and 3rd-party tools like aws s3 ls):

$ dvc list -R

Both Git-tracked files and DVC-tracked data (or models, etc.) are listed.

Data downloads

dvc get is analogous to using direct download tools like wget (HTTP), aws s3 cp (S3), etc. To get a dataset from a DVC repo, we can run something like this:

$ dvc get music/songs

This downloads music/songs from the project's default remote and places it in the current working directory.

Data import workflow

dvc import uses the same syntax as dvc get:

$ dvc import images/faces

Besides downloading the data, importing saves the information about the dependency that the local project has on the data source (registry repo). This is achieved by generating a special import .dvc file, which contains this metadata.

Whenever the dataset changes in the registry, we can bring data up to date in with dvc update:

$ dvc update faces.dvc

This downloads new and changed files, and removes deleted ones, based on the latest commit in the source repo; And it updates the .dvc file accordingly.

Note that dvc get, dvc import, and dvc update have a --rev option to download data from a specific commit of the source repository.

Using DVC data from Python code

Our Python API, included with the dvc package installed with DVC, includes the open function to load/stream data directly from external DVC projects:


model_path = 'model.pkl'
repo_url = ''

with, repo_url) as fd:
    model = pickle.load(fd)
    # ... Use the model!

This opens model.pkl as a file descriptor. This example illustrates a simple ML model deployment method, but it could be extended to more advanced scenarios such as a model zoo.

See also the and dvc.api.get_url() functions.

Updating registries

Datasets evolve, and DVC is prepared to handle it. Just change the data in the registry, and apply the updates by running dvc add again:

$ cp 1000/more/songs/* music/songs/
$ dvc add music/songs/

DVC modifies the corresponding .dvc file to reflect the changes, and this is picked up by Git:

$ git status
	modified:   music/songs.dvc
$ git commit -am "Add 1,000 more songs to music/ dataset."

Iterating on this process for several datasets can give shape to a robust registry. The result is basically a repo that versions a set of metafiles. Let's see an example:

$ tree --filelimit=10
├── images
│   ├── .gitignore
│   ├── cats-dogs [2800 entries]  # Listed in .gitignore
│   ├── faces [10000 entries]     # Listed in .gitignore
│   ├── cats-dogs.dvc
│   └── faces.dvc
├── music
│   ├── .gitignore
│   ├── songs [11000 entries]     # Listed in .gitignore
│   └── songs.dvc
├── text

And let's not forget to dvc push data changes to the remote storage, so others can obtain them!

$ dvc push

🐛 Found an issue? Let us know! Or fix it:

Edit on GitHub

Have a question? Join our chat, we will help you:

Discord Chat