How cool would it be to make Git handle arbitrary large files and directories
with the same performance as with small code files? Imagine doing a git clone
and seeing data files and machine learning models in the workspace. Or switching
to a different version of a 100Gb file in less than a second with a
git checkout
.
The foundation of DVC consists of a few commands that you can run along with
git
to track large files, directories, or ML models. Think "Git for data".
Read on or watch our video to learn about versioning data with DVC!
To start tracking a file or directory, use dvc add
:
$ dvc add data/data.xml
DVC stores information about the added file (or a directory) in a special .dvc
file named data/data.xml.dvc
, a small text file with a human-readable
format. This file can be easily
versioned like source code with Git, as a placeholder for the original data
(which gets listed in .gitignore
):
$ git add data/data.xml.dvc data/.gitignore
$ git commit -m "Add raw data"
You can upload DVC-tracked data or models with dvc push
, so they're safely
stored remotely. This also means they can be
retrieved on other environments later with dvc pull
. First, we need to setup a
storage:
$ dvc remote add -d storage s3://mybucket/dvcstore
$ git add .dvc/config
$ git commit -m "Configure remote storage"
DVC supports the following remote storage types: Google Drive, Amazon S3, Azure Blob Storage, Google Cloud Storage, Aliyun OSS, SSH, HDFS, and HTTP. Please refer to
dvc remote add
for more details and examples.
$ dvc push
Usually, we also want to git commit
and git push
the corresponding .dvc
files.
Having DVC-tracked data stored remotely, it can be downloaded when needed in
other copies of this project with dvc pull
. Usually, we run it
after git clone
and git pull
.
$ dvc pull
📖 See also Sharing Data and Model Files for more on basic collaboration workflows.
When you make a change to a file or directory, run dvc add
again to track the
latest version:
$ dvc add data/data.xml
Usually you would also run git commit
and dvc push
to save the changes:
$ git commit data/data.xml.dvc -m "Dataset updates"
$ dvc push
The regular workflow is to use git checkout
first to switch a branch, checkout
a commit, or a revision of a .dvc
file, and then run dvc checkout
to sync
data:
$ git checkout <...>
$ dvc checkout
Yes, DVC is technically not even a version control system! .dvc
files content
defines data file versions. Git itself provides the version control. DVC in turn
creates these .dvc
files, updates them, and synchronizes DVC-tracked data in
the workspace efficiently to match them.
In cases where you process very large datasets, you need an efficient mechanism (in terms of space and performance) to share a lot of data, including different versions of itself. Do you use a network attached storage? Or a large external volume?
While these cases are not covered in the Get Started, we recommend reading the following sections next to learn more about advanced workflows: