Data Registry
One of the main uses of DVC repositories is the
versioning of data and model files.
This is provided by commands such as dvc add
and dvc run
, that allow
tracking of datasets and any other data artifacts.
With the aim to enable reusability of these versioned artifacts between
different projects (similar to package management systems, but for data), DVC
also includes the dvc get
, dvc import
, and dvc update
commands. For
example, project A may use a data file to begin its data
pipeline, but project B also requires this
same file; Instead of
adding it it to both projects,
B can simply import it from A. Furthermore, the version of the data file
imported to B can be an older iteration than what's currently used in A.
Keeping this in mind, we could build a DVC project dedicated to
tracking and versioning datasets (or any kind of large files). This way we would
have a repository that has all the metadata and change history for the project's
data. We can see who updated what, and when; use pull requests to update data
the same way you do with code; and we don't need ad-hoc conventions to store
different data versions. Other projects can share the data in the registry by
downloading (dvc get
) or importing (dvc import
) them for use in different
data processes.
The advantages of using a DVC data registry project are:
- Data as code: Improve lifecycle management with versioning of simple directory structures (like Git for your cloud storage), without ad-hoc conventions. Leverage Git and Git hosting features such as change history, branching, pull requests, reviews, and even continuous deployment of ML models.
- Reusability: Reproduce and organize feature stores with a simple CLI
(
dvc get
anddvc import
commands, similar to software package management systems likepip
). - Persistence: The DVC registry-controlled remote storage (e.g. an S3 bucket) improves data security. There are less chances someone can delete or rewrite a model, for example.
- Storage Optimization: Track data shared by multiple projects centralized in a single location (with the ability to create distributed copies on other remotes). This simplifies data management and optimizes space requirements.
- Security: Registries can be setup to have read-only remote storage (e.g. an HTTP location). Git versioning of DVC-files allows us to track and audit data changes.
Example
A dataset we use for several of our examples and tutorials is one containing
2800 images of cats and dogs. We partitioned the dataset in two for our
Versioning Tutorial, and backed up the parts on a
storage server, downloading them with wget
in our examples. This setup was
then revised to download the dataset with dvc get
instead, so we created the
dataset-registry) repository, a
DVC project hosted on GitHub, to version the dataset (see its
tutorial/ver
directory).
However, there are a few problems with the way this dataset is structured. Most
importantly, this single dataset is tracked by 2 different
DVC-files, instead of 2 versions of the same
one, which would better reflect the intentions of this dataset... Fortunately,
we have also prepared an improved alternative in the
use-cases/
directory of the same DVC repository.
To create a
first version
of our dataset, we extracted the first part into the use-cases/cats-dogs
directory (illustrated below), and ran dvc add use-cases/cats-dogs
to
track the entire directory.
$ tree use-cases/cats-dogs --filelimit 3
use-cases/cats-dogs
└── data
├── train
│ ├── cats [500 image files]
│ └── dogs [500 image files]
└── validation
├── cats [400 image files]
└── dogs [400 image files]
In a local DVC project, we could have downloaded this dataset at this point with the following command:
$ dvc import git@github.com:iterative/dataset-registry.git \
use-cases/cats-dogs
Note that unlike
dvc get
, which can be used from any directory,dvc import
always needs to run from an initialized DVC project.
Importing keeps the connection between the local project and the source data
registry where we are downloading the dataset from. This is achieved by creating
a particular kind of DVC-file that uses the
repo
field (a.k.a. import stage). (This file can be used for versioning the
import with Git.)
For a sample DVC-file resulting from
dvc import
, refer to this example.
Back in our dataset-registry project, a
second version
of our dataset was created by extracting the second part, with 1000 additional
images (500 cats, 500 dogs), into the same directory structure. Then, we simply
ran dvc add use-cases/cats-dogs
again.
In our local project, all we have to do in order to obtain this latest version of the dataset is to run:
$ dvc update cats-dogs.dvc
This is possible because of the connection that the import stage saved among local and source projects, as explained earlier.
This downloads new and changed files in cats-dogs/
from the source project,
and updates the metadata in the import stage DVC-file.
As an extra detail, notice that so far our local project is working only with a local cache. It has no need to setup a remotes to pull or push this dataset.