Get Started: Data Versioning
How cool would it be to make Git handle arbitrarily large files and directories
with the same performance it has with small code files? Imagine cloning a
repository and seeing data files and machine learning models in the workspace.
Or switching to a different version of a 100Gb file in less than a second with a
git checkout
. Think "Git for data".
Having initialized a project in the previous section, we can get the data file (which we'll be using later) like this:
$ dvc get https://github.com/iterative/dataset-registry \
get-started/data.xml -o data/data.xml
We use the fancy dvc get
command to jump ahead a bit and show how a Git repo
becomes a source for datasets or models โ what we call a data registry.
dvc get
can download any file or directory tracked in a DVC
repository.
To start tracking a file or directory, use dvc add
:
$ dvc add data/data.xml
DVC stores information about the added file in a special .dvc
file named
data/data.xml.dvc
โ a small text file with a human-readable format. This
metadata file is a placeholder for the original data, and can be easily
versioned like source code with Git:
$ git add data/data.xml.dvc data/.gitignore
$ git commit -m "Add raw data"
The data, meanwhile, is listed in .gitignore
.
dvc add
moved the data to the project's cache, and
linked it back to the workspace. The .dvc/cache
should look like this:
.dvc/cache
โโโ 22
โโโ a1a2931c8370d3aeedd7183606fd7f
The hash value of the data.xml
file we just added (22a1a29...
) determines
the cache path shown above. And if you check data/data.xml.dvc
, you will find
it there too:
outs:
- md5: 22a1a2931c8370d3aeedd7183606fd7f
path: data.xml
Storing and sharing
You can upload DVC-tracked data or model files with dvc push
, so they're
safely stored remotely. This also means they
can be retrieved on other environments later with dvc pull
. First, we need to
set up a remote storage location:
$ dvc remote add -d storage s3://mybucket/dvcstore
$ git add .dvc/config
$ git commit -m "Configure remote storage"
DVC supports many remote storage types, including Amazon S3, SSH, Google Drive, Azure Blob Storage, and HDFS. See
dvc remote add
for more details and examples.
DVC remotes let you store a copy of the data tracked by DVC outside of the local
cache (usually a cloud storage service). For simplicity, let's set up a local
remote in a temporary dvcstore/
directory (create the dir first if needed):
$ dvc remote add -d myremote /tmp/dvcstore
$ git commit .dvc/config -m "Configure local remote"
$ dvc remote add -d myremote %TEMP%\dvcstore
$ git commit .dvc\config -m "Configure local remote"
While the term "local remote" may seem contradictory, it doesn't have to be. The "local" part refers to the type of location: another directory in the file system. "Remote" is what we call storage for DVC projects. It's essentially a local data backup.
$ dvc push
Usually, we also want to git commit
and git push
the corresponding .dvc
files.
dvc push
copied the data cached locally to the remote storage we
set up earlier. The remote storage directory should look like this:
.../dvcstore
โโโ 22
โโโ a1a2931c8370d3aeedd7183606fd7f
Retrieving
Having DVC-tracked data and models stored remotely, it can be downloaded when
needed in other copies of this project with dvc pull
. Usually, we
run it after git clone
and git pull
.
$ dvc pull
See dvc remote
for more information on remote storage.
Making changes
When you make a change to a file or directory, run dvc add
again to track the
latest version:
Let's say we obtained more data from some external source. We can pretend this is the case by doubling the dataset:
$ cp data/data.xml /tmp/data.xml
$ cat /tmp/data.xml >> data/data.xml
$ copy data\data.xml %TEMP%\data.xml
$ type %TEMP%\data.xml >> data\data.xml
$ dvc add data/data.xml
Usually you would also run git commit
and dvc push
to save the changes:
$ git commit data/data.xml.dvc -m "Dataset updates"
$ dvc push
Switching between versions
The regular workflow is to use git checkout
first (to switch a branch or
checkout a .dvc
file version) and then run dvc checkout
to sync data:
$ git checkout <...>
$ dvc checkout
Let's go back to the original version of the data:
$ git checkout HEAD~1 data/data.xml.dvc
$ dvc checkout
Let's commit it (no need to do dvc push
this time since this original version
of the dataset was already saved):
$ git commit data/data.xml.dvc -m "Revert dataset updates"
Yes, DVC is technically not a version control system! Git itself provides that
layer. DVC in turn manipulates .dvc
files, whose contents define the data file
versions. DVC also synchronizes DVC-tracked data in the workspace
efficiently to match them.
Large datasets versioning
In cases where you process very large datasets, you need an efficient mechanism (in terms of space and performance) to share a lot of data, including different versions. Do you use network attached storage (NAS)? Or a large external volume? You can learn more about advanced workflows using these links:
- A shared cache can be set up to store, version and access a lot of data on a large shared volume efficiently.
- A quite advanced scenario is to track and version data directly on the remote storage (e.g. S3). See Managing External Data to learn more.