Edit on GitHub

Get Started

Assuming DVC is already installed, let's initialize it by running dvc init inside a Git project:

โš™๏ธ Expand to prepare the project.

In expandable sections that start with the โš™๏ธ emoji, we'll be providing more information for those trying to run the commands. It's up to you to pick the best way to read the material - read the text (skip sections like this, and it should be enough to understand the idea of DVC), or try to run them and get the first hand experience.

We'll be building an NLP project from scratch together. The end result is published on GitHub.

Let's start with git init:

$ mkdir example-get-started
$ cd example-get-started
$ git init
$ dvc init

A few directories and files are created that should be added to Git:

$ git status
Changes to be committed:
        new file:   .dvc/.gitignore
        new file:   .dvc/config
        ...
$ git commit -m "Initialize DVC"

DVC features can be grouped into layers. We'll explore them one by one in the next few sections:

  • Data versioning is the core part of DVC for large files, datasets, machine learning models versioning and efficient sharing. We'll show how to use a regular Git workflow, without storing large files with Git. Think "Git for data".
  • Data access shows how to use data artifacts from outside of the project and how to import data artifacts from another DVC project. This can help to download a specific version of an ML model to a deployment server or import a model to another project.
  • Data pipelines describe how models and other data artifacts are built, and provide an efficient way to reproduce them. Think "Makefiles for data and ML projects" done right.
  • Experiments attach parameters, metrics, plots. You can capture and navigate experiments without leaving Git. Think "Git for machine learning".

โ–ถ๏ธ It can be run online:

Run in Katacoda

๐Ÿ› Found an issue? Let us know! Or fix it:

Edit on GitHub

โ“ Have a question? Join our chat, we will help you:

Discord Chat