Edit on GitHub

Discovering and accessing data

Assuming you've learned the basics of how to track and version data with DVC, you might wonder: How can we access and use these artifacts outside of the DVC project? How do we download a model to deploy it? How to download a specific version of a model? How to reuse datasets across different projects?

These questions tend to come up when you browse the files that DVC saves to remote storage (e.g. s3://dvc-public/remote/get-started/fb/89904ef053f04d64eafcc3d70db673 ๐Ÿ˜ฑ instead of the original file name such as model.pkl or data.xml).

By clicking play, you agree to YouTube's Privacy Policy and Terms of Service

Remember those .dvc files dvc add generates? Those files (and dvc.lock) have their history in Git. DVC's remote storage config is also saved in Git, and contains all the information needed to access and download any version of datasets, files, and models. It means that a Git repository with DVC files becomes an entry point, and can be used instead of accessing files directly.

Find a file or directory

You can use dvc list to explore a DVC repository hosted on any Git server. For example, let's see what's in the get-started/ directory of our dataset-registry repo:

$ dvc list https://github.com/iterative/dataset-registry get-started
.gitignore
data.xml
data.xml.dvc

The benefit of this command over browsing a Git hosting website is that the list includes files and directories tracked by both Git and DVC (data.xml is not visible if you check GitHub).

Download

One way is to simply download the data with dvc get. This is useful when working outside of a DVC project environment, for example in an automated ML model deployment task:

$ dvc get https://github.com/iterative/dataset-registry \
          use-cases/cats-dogs

When working inside another DVC project though, this is not the best strategy because the connection between the projects is lost โ€” others won't know where the data came from or whether new versions are available.

Import file or directory

dvc import also downloads any file or directory, while also creating a .dvc file (which can be saved in the project):

$ dvc import https://github.com/iterative/dataset-registry \
             get-started/data.xml -o data/data.xml

This is similar to dvc get + dvc add, but the resulting .dvc files includes metadata to track changes in the source repository. This allows you to bring in changes from the data source later using dvc update.

The dataset registry repository doesn't actually contain a get-started/data.xml file. Like dvc get, dvc import downloads from remote storage.

.dvc files created by dvc import have special fields, such as the data source repo and path (under deps):

+deps:
+- path: get-started/data.xml
+  repo:
+    url: https://github.com/iterative/dataset-registry
+    rev_lock: 96fdd8f12c14fa58a1b7354f15c7adb50e4e8542
 outs:
 - md5: 22a1a2931c8370d3aeedd7183606fd7f
   path: data.xml

The url and rev_lock subfields under repo are used to save the origin and version of the dependency, respectively.

Python API

It's also possible to integrate your data or models directly in source code with DVC's Python API. This lets you access the data contents directly from within an application at runtime. For example:

import dvc.api

with dvc.api.open(
    'get-started/data.xml',
    repo='https://github.com/iterative/dataset-registry'
) as f:
    # f is a file-like object which can be processed normally
Content

๐Ÿ› Found an issue? Let us know! Or fix it:

Edit on GitHub

โ“ Have a question? Join our chat, we will help you:

Discord Chat