Edit on GitHub

External Data

To version data that lives outside of your local project, you can import it. You can choose whether to download that data and whether to push copies to your DVC remote. This makes importing the data useful even if you want to track the data in-place at its original source location.

See external dependencies and outputs if you want to work with external data in a pipeline.

How importing external data works

Import external data using import-url:

$ dvc import-url https://data.dvc.org/get-started/data.xml
Importing 'https://data.dvc.org/get-started/data.xml' -> 'data.xml'

This downloads the file to data.xml (see Avoiding duplication if you want to skip this step). It also creates the data.xml.dvcfile, which tracks the source data.

# ...
deps:
  - etag: '"f432e270cd634c51296ecd2bc2f5e752-5"'
    path: https://data.dvc.org/get-started/data.xml
outs:
  - md5: a304afb96060aad90176268345e10355
    path: data.xml
    cache: true
    persist: false

DVC checks the headers returned by the server, looking for an HTTP ETag or a Content-MD5 header, and uses it to determine whether the source has changed and we need to download the file again.

To check the source location for updates, run dvc update:

$ dvc update data.xml.dvc
'data.xml.dvc' didn't change, skipping

During dvc push, DVC will upload the version of the data tracked by data.xml.dvc to the DVC remote so that it is backed up in case you need to recover it.

DVC will never overwrite the source location of the data. Instead, DVC can checkout any version of that data locally. DVC is designed to protect the original data from accidental overwrites or changes that might be unexpected to other users, so you can recover old versions without losing what's currently stored in the source location.

Avoiding duplication

Making copies of the external data may be unnecessary and impractical in some cases, like if your data is too big to download locally, or you stream it directly from its source location, or you use cloud versioning to backup old versions already.

Use --no-download to skip the download step when you import or update the data. DVC will save the metadata in data.xml.dvc but won't download data.xml locally:

$ dvc import-url --no-download https://data.dvc.org/get-started/data.xml
Importing 'https://data.dvc.org/get-started/data.xml' -> 'data.xml'

$ ls
data.xml.dvc

To recover this version of the data later, use dvc pull, and DVC will try to download it from its original source location. However, if you have overwritten the original source data, dvc pull may fail. To version the data so you can recover any version, either push the data to the DVC remote or use cloud versioning.

Example: Push to remote

dvc import-url --to-remote will not download the data locally but will push the data to the DVC remote:

$ dvc import-url --to-remote https://data.dvc.org/get-started/data.xml

$ ls
data.xml.dvc

$ dvc push
Everything is up to date.

Example: Cloud versioning

If you are importing from a supported cloud versioning provider, dvc import-url --no-download --version-aware will not download the data locally but will track the cloud provider's version IDs for the data. dvc pull will try to download those version IDs as long as they are available. dvc push will not upload anything because DVC assumes the versions are available at the source location:

$ dvc import-url --no-download --version-aware s3://myversionedbucket/data.xml
Importing 's3://myversionedbucket/data.xml' -> 'data.xml'

$ ls
data.xml.dvc

$ dvc push
Everything is up to date.
Content

๐Ÿ› Found an issue? Let us know! Or fix it:

Edit on GitHub

โ“ Have a question? Join our chat, we will help you:

Discord Chat