Tutorial: Data Registry Basics
Building registries
Adding datasets to a registry can be as simple as placing the data file or
directory in question inside the workspace, and track it with
dvc add
. A standard Git workflow can be followed with the .dvc
files that
substitute the actual data (e.g. music/songs.dvc
below). This enables team
collaboration on data at the same level as with source code:
This sample dataset actually exists.
$ mkdir -p music/songs
$ cp ~/Downloads/millionsongsubset_full music/songs
$ dvc add music/songs/
$ git add music/songs.dvc music/.gitignore
$ git commit -m "Track 1.8 GB 10,000 song dataset in music/"
The actual data is stored in the project's cache, and can be pushed to one or more [remote storage] locations so the registry can be accessed from other locations and by other people:
$ dvc remote add -d myremote s3://mybucket/dvcstore
$ dvc push
๐ก A good way to organize DVC repositories into data registries is to use directories to group similar data, e.g.
images/
,natural-language/
, etc. For example, our dataset registry has directories likeget-started/
anduse-cases/
, matching parts of this website.
Using registries
The main methods to consume artifacts from a data registry are
the dvc import
and dvc get
commands, as well as the Python API dvc.api
.
But first, we may want to explore its contents.
Listing data
To explore the contents of a DVC repository in search for the right data, use
the dvc list
command (similar to ls
and 3rd-party tools like aws s3 ls
):
$ dvc list -R https://github.com/iterative/dataset-registry
.gitignore
README.md
get-started/.gitignore
get-started/data.xml
get-started/data.xml.dvc
images/.gitignore
images/dvc-logo-outlines.png
...
Both Git-tracked files and DVC-tracked data (or models, etc.) are listed.
Data downloads
dvc get
is analogous to using direct download tools like wget
(HTTP),
aws s3 cp
(S3), etc. To get a dataset from a DVC repo, we can run something
like this:
$ dvc get https://github.com/example/registry music/songs
This downloads music/songs
from the project's
default remote and places it in the
current working directory.
Data import workflow
dvc import
uses the same syntax as dvc get
:
$ dvc import https://github.com/example/registry images/faces
Besides downloading the data, importing saves the information about the
dependency that the local project has on the data source (registry repo). This
is achieved by generating a special import .dvc
file, which contains this
metadata.
Whenever the dataset changes in the registry, we can bring data up to date in
with dvc update
:
$ dvc update faces.dvc
This downloads new and changed files, and removes deleted ones, based on the
latest commit in the source repo; And it updates the .dvc
file accordingly.
Note that
dvc get
,dvc import
, anddvc update
have a--rev
option to download data from a specific commit of the source repository.
Using DVC data from Python code
Our Python API, included with the dvc
package installed
with DVC, includes the open
function to load/stream data directly from
external DVC projects:
import dvc.api.open
model_path = 'model.pkl'
repo_url = 'https://github.com/example/registry'
with dvc.api.open(model_path, repo_url) as f:
model = pickle.load(f)
# ... Use the model!
This opens model.pkl
as a file-like object. This example illustrates a simple
ML model deployment method, but it could be extended to more advanced
scenarios such as a model zoo.
See also the dvc.api.read()
and dvc.api.get_url()
functions.
Updating registries
Datasets evolve, and DVC is prepared to handle it. Just change the data in the
registry, and apply the updates by running dvc add
again:
$ cp 1000/more/songs/* music/songs/
$ dvc add music/songs/
DVC modifies the corresponding .dvc
file to reflect the changes, and this is
picked up by Git:
$ git status
...
modified: music/songs.dvc
$ git commit -am "Add 1,000 more songs to music/ dataset."
Iterating on this process for several datasets can give shape to a robust registry. The result is basically a repo that versions a set of metafiles. Let's see an example:
$ tree --filelimit=10
.
โโโ images
โ โโโ .gitignore
โ โโโ cats-dogs [2800 entries] # Listed in .gitignore
โ โโโ faces [10000 entries] # Listed in .gitignore
โ โโโ cats-dogs.dvc
โ โโโ faces.dvc
โโโ music
โ โโโ .gitignore
โ โโโ songs [11000 entries] # Listed in .gitignore
โ โโโ songs.dvc
โโโ text
...
And let's not forget to dvc push
data changes to the [remote storage], so
others can obtain them!
$ dvc push