Edit on GitHub

Get Started: Data Versioning

By clicking play, you agree to YouTube's Privacy Policy and Terms of Service

How cool would it be to make Git handle arbitrarily large files and directories with the same performance it has with small code files? Imagine cloning a repository and seeing data files and machine learning models in the workspace. Or switching to a different version of a 100Gb file in less than a second with a git checkout. Think "Git for data".

Having initialized a project in the previous section, we can get the data file (which we'll be using later) like this:

$ dvc get https://github.com/iterative/dataset-registry \
          get-started/data.xml -o data/data.xml

We use the fancy dvc get command to jump ahead a bit and show how a Git repo becomes a source for datasets or models — what we call a data registry. dvc get can download any file or directory tracked in a DVC repository.

To start tracking a file or directory, use dvc add:

$ dvc add data/data.xml

DVC stores information about the added file in a special .dvc file named data/data.xml.dvc — a small text file with a human-readable format. This metadata file is a placeholder for the original data, and can be easily versioned like source code with Git:

$ git add data/data.xml.dvc data/.gitignore
$ git commit -m "Add raw data"

The data, meanwhile, is listed in .gitignore.

dvc add moved the data to the project's cache, and linked it back to the workspace. The .dvc/cache should look like this:

└── 22
    └── a1a2931c8370d3aeedd7183606fd7f

The hash value of the data.xml file we just added (22a1a29...) determines the cache path shown above. And if you check data/data.xml.dvc, you will find it there too:

  - md5: 22a1a2931c8370d3aeedd7183606fd7f
    path: data.xml

Storing and sharing

You can upload DVC-tracked data or model files with dvc push, so they're safely stored remotely. This also means they can be retrieved on other environments later with dvc pull. First, we need to set up a remote storage location:

$ dvc remote add -d storage s3://mybucket/dvcstore
$ git add .dvc/config
$ git commit -m "Configure remote storage"

DVC supports many remote storage types, including Amazon S3, SSH, Google Drive, Azure Blob Storage, and HDFS. See dvc remote add for more details and examples.

DVC remotes let you store a copy of the data tracked by DVC outside of the local cache (usually a cloud storage service). For simplicity, let's set up a local remote in a temporary dvcstore/ directory (create the dir first if needed):

$ dvc remote add -d myremote /tmp/dvcstore
$ git commit .dvc/config -m "Configure local remote"
$ dvc remote add -d myremote %TEMP%\dvcstore
$ git commit .dvc\config -m "Configure local remote"

While the term "local remote" may seem contradictory, it doesn't have to be. The "local" part refers to the type of location: another directory in the file system. "Remote" is what we call storage for DVC projects. It's essentially a local data backup.

$ dvc push

Usually, we also want to git commit and git push the corresponding .dvc files.

dvc push copied the data cached locally to the remote storage we set up earlier. The remote storage directory should look like this:

└── 22
    └── a1a2931c8370d3aeedd7183606fd7f


Having DVC-tracked data and models stored remotely, it can be downloaded when needed in other copies of this project with dvc pull. Usually, we run it after git clone and git pull.

If you've run dvc push, you can delete the cache (.dvc/cache) and data/data.xml to experiment with dvc pull:

$ rm -rf .dvc/cache
$ rm -f data/data.xml
$ rmdir .dvc\cache
$ del data\data.xml
$ dvc pull

See dvc remote for more information on remote storage.

Making changes

When you make a change to a file or directory, run dvc add again to track the latest version:

Let's say we obtained more data from some external source. We can pretend this is the case by doubling the dataset:

$ cp data/data.xml /tmp/data.xml
$ cat /tmp/data.xml >> data/data.xml
$ copy data\data.xml %TEMP%\data.xml
$ type %TEMP%\data.xml >> data\data.xml
$ dvc add data/data.xml

Usually you would also run git commit and dvc push to save the changes:

$ git commit data/data.xml.dvc -m "Dataset updates"
$ dvc push

Switching between versions

The regular workflow is to use git checkout first (to switch a branch or checkout a .dvc file version) and then run dvc checkout to sync data:

$ git checkout <...>
$ dvc checkout

Let's go back to the original version of the data:

$ git checkout HEAD~1 data/data.xml.dvc
$ dvc checkout

Let's commit it (no need to do dvc push this time since this original version of the dataset was already saved):

$ git commit data/data.xml.dvc -m "Revert dataset updates"

Yes, DVC is technically not a version control system! Git itself provides that layer. DVC in turn manipulates .dvc files, whose contents define the data file versions. DVC also synchronizes DVC-tracked data in the workspace efficiently to match them.

Large datasets versioning

In cases where you process very large datasets, you need an efficient mechanism (in terms of space and performance) to share a lot of data, including different versions. Do you use network attached storage (NAS)? Or a large external volume? You can learn more about advanced workflows using these links:

  • A shared cache can be set up to store, version and access a lot of data on a large shared volume efficiently.
  • A quite advanced scenario is to track and version data directly on the remote storage (e.g. S3). See Managing External Data to learn more.

🐛 Found an issue? Let us know! Or fix it:

Edit on GitHub

Have a question? Join our chat, we will help you:

Discord Chat