One of the main uses of DVC repositories is the
versioning of data and model files,
with commands such as
dvc add. With the aim to enable reusability of these
data artifacts between different projects, DVC also provides the
dvc import and
dvc get commands, among others. This means that a project can
depend on data from an external DVC project, similar to package
management systems, but for data science projects.
Data and models as code
Keeping this in mind, we could build a DVC project dedicated to tracking and versioning datasets (or any large data, even ML models). This way we would have a repository with all the metadata and history of changes of different datasets. We could see who updated what, and when, and use pull requests to update data (the same way we do with code). This is what we call a data registry, which can work as data management middleware between ML projects and cloud storage.
Note that a single dedicated repository is just one possible pattern to create data registries with DVC.
Advantages of using a DVC data registry project:
- Reusability: Reproduce and organize feature stores with a simple CLI
dvc importcommands, similar to software package management systems like
- Persistence: The DVC registry-controlled remote storage (e.g. an S3 bucket) improves data security. There are less chances someone can delete or rewrite a model, for example.
- Storage Optimization: Track data shared by multiple projects centralized in a single location (with the ability to create distributed copies on other remotes). This simplifies data management and optimizes space requirements.
- Security: Registries can be setup to have read-only remote storage (e.g. an HTTP location). Git versioning of DVC-files allows us to track and audit data changes.
- Data as code: Leverage Git workflow such as commits, branching, pull requests, reviews, and even CI/CD for your data and models lifecycle. Think Git for cloud storage, but without ad-hoc conventions.
Data registries can be created like any other DVC repository with
git init and
dvc init. A good way to organize them is with different
directories, to group the data into separate uses, such as
natural-language/, etc. For example, our
dataset-registry uses a
directory for each section in our website documentation, like
Adding datasets to a registry can be as simple as placing the data file or
directory in question inside the workspace, and telling DVC to
track it, with
dvc add. For example:
$ mkdir -p music/Beatles $ cp ~/Downloads/millionsongsubset_full music/songs $ dvc add music/songs
This example dataset actually exists. See MillionSongSubset.
A regular Git workflow can be followed with the tiny
DVC-files that substitute the actual data
music/songs.dvc in this example). This enables team collaboration on data at
the same level as with source code (commit history, branching, pull requests,
$ git add music/songs.dvc music/.gitignore $ git commit -m "Track 1.8 GB 10,000 song dataset in music/"
$ dvc remote add -d myremote s3://bucket/path $ dvc push
Simple download (get)
$ dvc get https://github.com/example/registry \ music/songs/
music/songs/ from the project's
default remote and places it in the
current working directory (anywhere in the file system with user write access).
Note that this command (as well as
dvc import) has a
--revoption to download specific versions of the data.
$ dvc import https://github.com/example/registry \ images/faces/
Besides downloading, importing saves the dependency of the local project towards the data source (registry repository). This is achieved by creating a particular kind of DVC-file (a.k.a. import stage). This file can be used staged and committed with Git.
As an addition to the import workflow, and enabled the saved dependency, we can
easily bring it up to date in our consumer project with
dvc update whenever
the the dataset changes in the source project (data registry):
$ dvc update dataset.dvc
dvc update downloads new and changed files, or removes deleted ones, from
images/faces/, based on the latest version of the source project. It also
updates the project dependency metadata in the import stage (DVC-file).
Programatic reusability of DVC data
Our Python API, included with the
dvc package installed with DVC, includes the
open function to load/stream data directly from external DVC projects:
import dvc.api.open model_path = 'model.pkl' repo_url = 'https://github.com/example/registry' with dvc.api.open(model_path, repo_url) as fd: model = pickle.load(fd) # ... Use the model!
model.pkl as a file descriptor. The example above tries to
illustrate a hardcoded ML model deployment method.
Notice that the
dvc.api.readfunctions are also available.
Datasets evolve, and DVC is prepared to handle it. Just change the data in the
registry, and apply the updates by running
dvc add again:
$ cp /path/to/1000/image/dir music/songs $ dvc add music/songs
DVC then modifies the corresponding DVC-file to reflect the changes in the data, and this will be noticed by Git:
$ git status Changes not staged for commit: ... modified: music/songs.dvc $ git commit -am "Add 1,000 more songs to music/ dataset."
Iterating on this process for several datasets can give shape to a robust registry, which are basically repositories that mainly version a bunch of DVC-files, as you can see in the hypothetical example below.
$ tree --filelimit=100 . ├── images │ ├── .gitignore │ ├── cats-dogs [2800 entries] # Listed in .gitignore │ ├── faces [10000 entries] # Listed in .gitignore │ ├── cats-dogs.dvc │ └── faces.dvc ├── music │ ├── .gitignore │ ├── songs [11000 entries] # Listed in .gitignore │ └── songs.dvc ├── text ...
$ dvc push