Setting up an effective data science project structure can be challenging. Do you organize ML models in nested directory trees, link large datasets from different locations, identify variations with ad hoc filename conventions? Adding versioning needs and dependency management can easily turn this near impossible.
A DVC project structure is simplified by encapsulating data versioning and pipelining (e.g. machine learning workflows), among other features. This leaves a workspace directory with a clean view of your working raw data, source code, data artifacts, etc. and a few metafiles that enable these features. A single version of the project is visible at a time.
The DVC workspace is analogous to the working tree in Git.
Files and directories in the workspace can be added to DVC (
dvc add), or they
can be downloaded from external sources (
dvc import-url). Changes to the data, notebooks, models, and any related
machine learning artifact can be tracked (
dvc commit), and their content can
be synchronized (
dvc checkout). Tracked data can be removed (
from the workspace.