Edit on GitHub


Setting up an effective data science project structure can be challenging. Do you organize ML models in nested directory trees, link large datasets from different locations, identify variations with ad hoc filename conventions? Adding versioning needs and dependency management can easily turn this near impossible.

A DVC project structure is simplified by encapsulating data versioning and pipelining (e.g. machine learning workflows), among other features. This leaves a workspace directory with a clean view of your working raw data, source code, data artifacts, etc. and a few metafiles that enable these features. A single version of the project is visible at a time.

The DVC workspace is analogous to the working tree in Git.

Files and directories in the workspace can be added to DVC (dvc add), or they can be downloaded from external sources (dvc get, dvc import, dvc import-url). Changes to the data, notebooks, models, and any related machine learning artifact can be tracked (dvc commit), and their content can be synchronized (dvc checkout). Tracked data can be removed (dvc remove) from the workspace.

Further Reading


🐛 Found an issue? Let us know! Or fix it:

Edit on GitHub

Have a question? Join our chat, we will help you:

Discord Chat