What Is DVC?
Today the data science community is still lacking good practices for organizing
their projects and effectively collaborating. ML algorithms and methods are no
longer simple tribal knowledge but are still difficult to implement, manage and
One of the biggest challenges in reusing, and hence the managing of ML
projects, is its reproducibility.
Data Version Control, or DVC, is a new type of experiment management software
built on top of Git. DVC reduces the gap between existing tools and data science
needs, allowing users to take advantage of experiment management while reusing
existing skills and intuition.
DVC codifies data and ML experiments
Leveraging an underlying source code management system eliminates the need to
use 3rd-party services. Data science experiment sharing and collaboration can be
done through regular Git features (commit messages, merges, pull requests, etc)
the same way it works for software engineers.
DVC uses a few core concepts:
- Experiment: Equivalent to a
Git revision. Each experiment (extract
new features, change model hyperparameters, data cleaning, add a new data
source) can be performed in a separate branch or tag. DVC allows experiments
to be integrated into a Git repository history and never needs to recompute
the results after a successful merge.
- Experiment state or state: Equivalent to a Git snapshot (all committed
files). A Git commit hash, branch or tag name, etc. can be used as a
reference to an
- Reproducibility: Action to reproduce an experiment state. This action
generates output files (or directories) based on a set of input files and
source code. This action usually changes experiment state.
- Pipeline: Dependency graph or series of commands to reproduce data
processing results. The commands are connected by their inputs
(dependencies) and outputs. Pipelines are defined by
special stage files (similar to
Refer to pipeline for more information.
- Workflow: Set of experiments and relationships among them. Workflow
corresponds to the entire Git repository.
- Data files: Cached files (for large files). Data files are stored outside
of the Git repository on a local/shared hard drive or remote storage, but
DVC-files describing that data
are stored in Git for DVC needs (to maintain pipelines and reproducibility).
- Cache directory: Directory with all data files on a local hard drive or in
cloud storage, but not in the Git repository. See
dvc cache dir.
- Cloud storage support: available complement to the core DVC features. This
is how a data scientist transfers large data files or shares a GPU-trained
model with those without GPUs available.
DVC streamlines large data files and binary models into a single Git environment
and this approach will not require storing binary files in your Git repository.
The diagram below describes all the DVC commands and relationships between a
local cache and remote storage:
DVC data management