Managing External Data
⚠️ This is an advanced feature for very specific situations and not recommended except if there's absolutely no other alternative. In most cases, alternatives like the to-cache or to-remote strategies of
dvc import-urlare more convenient. Note that external outputs are not pushed or pulled from/to remote storage.
There are cases when data is so large, or its processing is organized in such a way, that its impossible to handle it in the local machine disk. For example versioning existing data on a network attached storage (NAS), processing data on HDFS, running Dask via SSH, or any code that generates massive files directly to the cloud.
External outputs (and external dependencies) provide ways to track and version data outside of the project.
How external outputs work
External outputs will be tracked by DVC for
versioning, detecting when they
change (reported by
dvc status, for example).
To use existing files or directories in an external location as outputs, give
their remote URLs or external paths to
dvc add, or put them in
deps field). Use the same format as the
url of the following supported
dvc remote types/protocols:
- Amazon S3
- Local files and directories outside the workspace
Avoid using the same DVC remote used for
dvc pull, etc. as external cache, because it may cause data collisions: the hash of an external output could collide with that of a local file with different content.
Note that remote storage is a different feature.
Setting up an external cache
DVC requires that the project's cache is configured in the same external location as the data that will be tracked (external outputs). This avoids transferring files to the local environment and enables file links within the external storage.
As an example, let's create a directory external to the workspace and set it up as cache:
$ mkdir -p /home/shared/dvcstore $ dvc cache dir /home/shared/dvcstore
dvc cache dirand
dvc config cachefor more information.
💡 Note that in real-life scenarios, often the directory will be in a remote
ssh://firstname.lastname@example.org/cache (see the
⚠️ An external cache could be shared among copies of a DVC project. Do not use external outputs in that scenario, as
dvc checkoutin any project would overwrite the working data for all projects.
Let's take a look at the following operations on all the supported location types:
- Configure an external cache directory (added as a
dvc remote*) in the same location as the external data, using
- Tracking existing data on the external location using
--externaloption needed). This produces a
.dvcfile with an external URL or path in its
- Creating a simple stage with
dvc stage add(
--externaloption needed) that moves a local file to the external location. This produces an external output in
* Note that for certain remote storage authentication methods, extra config steps are required (see
dvc remote modifyfor details). Once access is setup, use the special
remote://URL format in step 2. For example:
dvc add --external remote://myxcache/existing-data.