One of the main uses of DVC repositories is the
versioning of data and model files.
This is provided by commands such as
dvc add and
dvc run, that allow
tracking of datasets and any other data artifacts.
With the aim to enable reusability of these versioned artifacts between
different projects (similar to package management systems, but for data), DVC
also includes the
dvc import, and
dvc update commands. For
example, project A may use a data file to begin its data
pipeline, but project B also requires this
same file; Instead of
adding it it to both projects,
B can simply import it from A. Furthermore, the version of the data file
imported to B can be an older iteration than what's currently used in A.
Keeping this in mind, we could build a DVC project dedicated to
tracking and versioning datasets (or any kind of large files). This way we would
have a repository that has all the metadata and change history for the project's
data. We can see who updated what, and when; use pull requests to update data
the same way you do with code; and we don't need ad-hoc conventions to store
different data versions. Other projects can share the data in the registry by
dvc get) or importing (
dvc import) them for use in different
The advantages of using a DVC data registry project are:
- Data as code: Improve lifecycle management with versioning of simple directory structures (like Git for your cloud storage), without ad-hoc conventions. Leverage Git and Git hosting features such as change history, branching, pull requests, reviews, and even continuous deployment of ML models.
- Reusability: Reproduce and organize feature stores with a simple CLI
dvc importcommands, similar to software package management systems like
- Persistence: The DVC registry-controlled remote storage (e.g. an S3 bucket) improves data security. There are less chances someone can delete or rewrite a model, for example.
- Storage Optimization: Track data shared by multiple projects centralized in a single location (with the ability to create distributed copies on other remotes). This simplifies data management and optimizes space requirements.
- Security: Registries can be setup to have read-only remote storage (e.g. an HTTP location). Git versioning of DVC-files allows us to track and audit data changes.
A dataset we use for several of our examples and tutorials is one containing
2800 images of cats and dogs. We partitioned the dataset in two for our
Versioning Tutorial, and backed up the parts on a
storage server, downloading them with
wget in our examples. This setup was
then revised to download the dataset with
dvc get instead, so we created the
dataset-registry) repository, a
DVC project hosted on GitHub, to version the dataset (see its
However, there are a few problems with the way this dataset is structured. Most
importantly, this single dataset is tracked by 2 different
DVC-files, instead of 2 versions of the same
one, which would better reflect the intentions of this dataset... Fortunately,
we have also prepared an improved alternative in the
directory of the same DVC repository.
$ tree use-cases/cats-dogs --filelimit 3 use-cases/cats-dogs └── data ├── train │ ├── cats [500 image files] │ └── dogs [500 image files] └── validation ├── cats [400 image files] └── dogs [400 image files]
In a local DVC project, we could have downloaded this dataset at this point with the following command:
$ dvc import email@example.com:iterative/dataset-registry.git \ use-cases/cats-dogs
Importing keeps the connection between the local project and the source data
registry where we are downloading the dataset from. This is achieved by creating
a particular kind of DVC-file that uses the
repo field (a.k.a. import stage). (This file can be used for versioning the
import with Git.)
Back in our dataset-registry project, a
of our dataset was created by extracting the second part, with 1000 additional
images (500 cats, 500 dogs), into the same directory structure. Then, we simply
dvc add use-cases/cats-dogs again.
In our local project, all we have to do in order to obtain this latest version of the dataset is to run:
$ dvc update cats-dogs.dvc
This is possible because of the connection that the import stage saved among local and source projects, as explained earlier.
This downloads new and changed files in
cats-dogs/ from the source project,
and updates the metadata in the import stage DVC-file.