We've learned how to track data and models with DVC, and how to commit their versions to Git. The next questions are: How can we use these artifacts outside of the project? How do we download a model to deploy it? How to download a specific version of a model? Or reuse datasets across different projects?
These questions tend to come up when you browse the files that DVC saves to remote storage (e.g.
s3://dvc-public/remote/get-started/fb/89904ef053f04d64eafcc3d70db673😱 instead of the original file name such as
Read on or watch our video to see how to find and access models and datasets with DVC.
dvc add generates? Those files (and
which we'll cover later) have their history in Git. DVC's remote storage config
is also saved in Git, and contains all the information needed to access and
download any version of datasets, files, and models. It means that a Git
repository with DVC files becomes an entry point, and can be used
instead of accessing files directly.
$ dvc list https://github.com/iterative/dataset-registry get-started .gitignore data.xml data.xml.dvc
The benefit of this command over browsing a Git hosting website is that the list
includes files and directories tracked by both Git and DVC (
data.xml is not
visible if you
One way is to simply download the data with
dvc get. This is useful when
working outside of a DVC project environment, for example in an
automated ML model deployment task:
$ dvc get https://github.com/iterative/dataset-registry \ use-cases/cats-dogs
When working inside another DVC project though, this is not the best strategy because the connection between the projects is lost — others won't know where the data came from or whether new versions are available.
$ dvc import https://github.com/iterative/dataset-registry \ get-started/data.xml -o data/data.xml
This is similar to
dvc get +
dvc add, but the resulting
includes metadata to track changes in the source repository. This allows you to
bring in changes from the data source later using
It's also possible to integrate your data or models directly in source code with DVC's Python API. This lets you access the data contents directly from within an application at runtime. For example:
import dvc.api with dvc.api.open( 'get-started/data.xml', repo='https://github.com/iterative/dataset-registry' ) as fd: # fd is a file descriptor which can be processed normally