Edit on GitHub

Import Data

We've seen how to push and pull data from/to a DVC project's remote. But what if we wanted to integrate a dataset or ML model produced in one project into another one?

One way is to manually download the data (with wget or dvc get, for example) and use dvc add to track it, but the connection between the projects would be lost. We wouldn't be able to tell where the data came from or whether there are new versions available. A better alternative is the dvc import command:

$ dvc import https://github.com/iterative/dataset-registry \
             get-started/data.xml

This downloads data.xml from our dataset-registry project into the current working directory, adds it to .gitignore, and creates the data.xml.dvc DVC-file to track changes in the source data. With imports, we can use dvc update to bring in changes in the external data source before reproducing any pipeline that depends on this data.

Expand to learn more about imports

Note that the dataset-registry repository doesn't actually contain a get-started/data.xml file. Instead, DVC inspects get-started/data.xml.dvc and tries to retrieve the file using the project's default remote (configured here).

DVC-files created by dvc import are called import stages. They use the repo field in the dependencies section (deps) in order to track source data changes (as an external dependency), enabling the reusability of data artifacts. For example:

md5: fd56a1794c147fea48d408f2bc95a33a
locked: true
deps:
  - path: get-started/data.xml
    repo:
      url: https://github.com/iterative/dataset-registry
      rev_lock: 7476a858f6200864b5755863c729bff41d0fb045
outs:
  - md5: a304afb96060aad90176268345e10355
    path: data.xml
    cache: true
    metric: false
    persist: false

The url and rev_lock subfields under repo are used to save the origin and version of the dependency, respectively.

Note that dvc update updates the rev_lock field of the corresponding DVC-file (when there are changes to bring in).

Since this is not an official part of this Get Started, bring everything back to normal with:

$ git reset --hard
$ rm -f data.*

See also dvc import-url.

🐛 Found an issue? Let us know! Or fix it:

Edit on GitHub

Have a question? Join our chat, we will help you:

Discord Chat