To version data that lives outside of your local project, you can import it. You can choose whether to download that data and whether to push copies to your DVC remote. This makes importing the data useful even if you want to track the data in-place at its original source location.
See external dependencies and outputs if you want to work with external data in a pipeline.
Import external data using
$ dvc import-url https://data.dvc.org/get-started/data.xml Importing 'https://data.dvc.org/get-started/data.xml' -> 'data.xml'
This downloads the file to
Avoiding duplication if you want to skip this step). It
also creates the
data.xml.dvcfile, which tracks the source data.
To check the source location for updates, run
$ dvc update data.xml.dvc 'data.xml.dvc' didn't change, skipping
DVC will never overwrite the source location of the data. Instead, DVC can checkout any version of that data locally. DVC is designed to protect the original data from accidental overwrites or changes that might be unexpected to other users, so you can recover old versions without losing what's currently stored in the source location.
Making copies of the external data may be unnecessary and impractical in some cases, like if your data is too big to download locally, or you stream it directly from its source location, or you use cloud versioning to backup old versions already.
--no-download to skip the download step when you import or update the
data. DVC will save the metadata in
data.xml.dvc but won't download
$ dvc import-url --no-download https://data.dvc.org/get-started/data.xml Importing 'https://data.dvc.org/get-started/data.xml' -> 'data.xml' $ ls data.xml.dvc
To recover this version of the data later, use
dvc pull, and DVC will try to
download it from its original source location. However, if you have overwritten
the original source data,
dvc pull may fail. To version the data so you can
recover any version, either push the data to the DVC remote or use cloud
$ dvc import-url --to-remote https://data.dvc.org/get-started/data.xml $ ls data.xml.dvc $ dvc push Everything is up to date.
If you are importing from a supported cloud versioning provider,
dvc import-url --no-download --version-aware will not download the data
locally but will track the cloud provider's version IDs for the data.
will try to download those version IDs as long as they are available.
will not upload anything because DVC assumes the versions are available at the
$ dvc import-url --no-download --version-aware s3://myversionedbucket/data.xml Importing 's3://myversionedbucket/data.xml' -> 'data.xml' $ ls data.xml.dvc $ dvc push Everything is up to date.