How cool would it be to track large datasets and machine learning models
alongside your code, sidestepping all the limitations of storing it in Git?
Imagine cloning a repository and immediately seeing your datasets, checkpoints
and models staged in your workspace. Imagine switching to a different version of
a 100Gb file in less than a second with a
💫 DVC is your "Git for data"!
Working inside an initialized project
directory, let's pick a piece of data to work with. We'll use an example
data.xml file, though any text or binary file (or directory) will do. Start by
$ dvc get https://github.com/iterative/dataset-registry \ get-started/data.xml -o data/data.xml
dvc add to start tracking the dataset file:
$ dvc add data/data.xml
DVC stores information about the added file in a special
.dvc file named
data/data.xml.dvc. This small, human-readable metadata file acts as a
placeholder for the original data for the purpose of Git tracking.
Next, run the following commands to track changes in Git:
$ git add data/data.xml.dvc data/.gitignore $ git commit -m "Add raw data"
Now the metadata about your data is versioned alongside your source code,
while the original data file was added to
You can upload DVC-tracked data to a variety of storage systems (remote or local) referred to as "remotes". For simplicity, we will use a "local remote" for this guide, which is just a directory in the local file system.
Before pushing data to a remote we need to set it up using the
dvc remote add
$ mkdir /tmp/dvcstore $ dvc remote add -d myremote /tmp/dvcstore
$ mkdir %TEMP%/dvcstore $ dvc remote add -d myremote %TEMP%\dvcstore
DVC supports many remote storage types, including Amazon S3, NFS,SSH, Google Drive, Azure Blob Storage, and HDFS.
An example for a common use case is configuring an Amazon S3 remote:
$ dvc remote add -d storage s3://mybucket/dvcstore
For this to work, you'll need an AWS account and credentials set up to allow access.
To learn more about storage remotes, see the Remote Storage Guide.
Now that a storage remote was configured, run
dvc push to upload data:
$ dvc push
Usually, we would also want to Git track any code changes that led to the data
git commit and
git push ).
Once DVC-tracked data and models are stored remotely, they can be downloaded
dvc pull when needed (e.g. in other copies of this project).
Usually, we run it after
git pull or
Let's try this now:
$ dvc pull
Next, let's say we obtained more data from some external source. We will simulate this by doubling the dataset contents:
$ cp data/data.xml /tmp/data.xml $ cat /tmp/data.xml >> data/data.xml
$ copy data\data.xml %TEMP%\data.xml $ type %TEMP%\data.xml >> data\data.xml
After modifying the data, run
dvc add again to track the latest version:
$ dvc add data/data.xml
Now we can run
dvc push to upload the changes to the remote storage, followed
git commit to track them:
$ dvc push $ git commit data/data.xml.dvc -m "Dataset updates"
$ git checkout <...> $ dvc checkout
Let's go back to the original version of the data:
$ git checkout HEAD~1 data/data.xml.dvc $ dvc checkout
Let's commit it (no need to do
dvc push this time since this original version
of the dataset was already saved):
$ git commit data/data.xml.dvc -m "Revert dataset updates"
As you can see, DVC is technically not a version control system by itself! It
.dvc files, whose contents define the data file versions. Git is
already used to version your code, and now it can also version your data
Your tracked data can be imported and fetched from anywhere using DVC. For example, you may want to download a specific version of an ML model to a deployment server or import a dataset into another project like we did at the top of this chapter. To learn about how DVC allows you to do this, see Discovering and Accessing Data Guide.