⚠️ This is an advanced feature for very specific situations and not recommended except if there's absolutely no other alternative. In most cases, alternatives like the to-cache or to-remote strategies of
dvc import-urlare more convenient. Note that external outputs are not pushed or pulled from/to remote storage.
There are cases when data is so large, or its processing is organized in such a way, that its impossible to handle it in the local machine disk. For example versioning existing data on a network attached storage (NAS), processing data on HDFS, running Dask via SSH, or any code that generates massive files directly to the cloud.
External outputs (and external dependencies) provide ways to track and version data outside of the project.
To use existing files or directories in an external location as outputs, give
their remote URLs or external paths to
dvc add, or put them in
deps field). Use the same format as the
url of the following supported
dvc remote types/protocols:
Avoid using the same DVC remote used for
dvc pull, etc. as external cache, because it may cause data collisions: the hash of an external output could collide with that of a local file with different content.
Note that remote storage is a different feature.
DVC requires that the project's cache is configured in the same external location as the data that will be tracked (external outputs). This avoids transferring files to the local environment and enables file links within the external storage.
As an example, let's create a directory external to the workspace and set it up as cache:
$ mkdir -p /home/shared/dvcstore $ dvc cache dir /home/shared/dvcstore
💡 Note that in real-life scenarios, often the directory will be in a remote
ssh://email@example.com/cache (see the
⚠️ An external cache could be shared among copies of a DVC project. Please do not use external outputs in that scenario, as
dvc checkoutin any project would overwrite the working data for all projects.
Let's take a look at the following operations on all the supported location types:
dvc remote*) in the same location as the external data, using
--externaloption needed). This produces a
.dvcfile with an external URL or path in its
--externaloption needed) that moves a local file to the external location. This produces an external output in
* Note that for certain remote storage authentication methods, extra config steps are required (see
dvc remote modifyfor details). Once access is setup, use the special
remote://URL format in step 2. For example:
dvc add --external remote://myxcache/existing-data.