Managing External Data
⚠️ This is an advanced feature for very specific situations and not recommended except if there's absolutely no other alternative. In most cases, alternatives like the to-cache or to-remote strategies of
dvc add
anddvc import-url
are more convenient. Note that external outputs are not pushed or pulled from/to remote storage.
There are cases when data is so large, or its processing is organized in such a way, that its impossible to handle it in the local machine disk. For example versioning existing data on a network attached storage (NAS), processing data on HDFS, running Dask via SSH, or any code that generates massive files directly to the cloud.
External outputs (and external dependencies) provide ways to track and version data outside of the project.
How external outputs work
External outputs will be tracked by DVC for
versioning, detecting when they
change (reported by dvc status
, for example).
To use existing files or directories in an external location as outputs, give
their remote URLs or external paths to dvc add
, or put them in dvc.yaml
(deps
field). Use the same format as the url
of the following supported
dvc remote
types/protocols:
- Amazon S3
- SSH
- HDFS
- Local files and directories outside the workspace
Avoid using the same DVC remote used for
dvc push
,dvc pull
, etc. as external cache, because it may cause data collisions: the hash of an external output could collide with that of a local file with different content.
Note that remote storage is a different feature.
Setting up an external cache
DVC requires that the project's cache is configured in the same external location as the data that will be tracked (external outputs). This avoids transferring files to the local environment and enables file links within the external storage.
As an example, let's create a directory external to the workspace and set it up as cache:
$ mkdir -p /home/shared/dvcstore
$ dvc cache dir /home/shared/dvcstore
See
dvc cache dir
anddvc config cache
for more information.
💡 Note that in real-life scenarios, often the directory will be in a remote
location, e.g. s3://mybucket/cache
or ssh://user@example.com/cache
(see the
examples below).
⚠️ An external cache could be shared among copies of a DVC project. Do not use external outputs in that scenario, as
dvc checkout
in any project would overwrite the working data for all projects.
Examples
Let's take a look at the following operations on all the supported location types:
- Configure an external cache directory (added as a
dvc remote
*) in the same location as the external data, usingdvc config
. - Tracking existing data on the external location using
dvc add
(--external
option needed). This produces a.dvc
file with an external URL or path in itsouts
field. - Creating a simple stage with
dvc stage add
(--external
option needed) that moves a local file to the external location. This produces an external output indvc.yaml
.
* Note that for certain remote storage authentication methods, extra config steps are required (see
dvc remote modify
for details). Once access is setup, use the specialremote://
URL format in step 2. For example:
dvc add --external remote://myxcache/existing-data
.
$ dvc remote add s3cache s3://mybucket/cache
$ dvc config cache.s3 s3cache
$ dvc add --external s3://mybucket/existing-data
$ dvc stage add -d data.txt \
--external \
-o s3://mybucket/data.txt \
aws s3 cp data.txt s3://mybucket/data.txt
$ dvc remote add sshcache ssh://user@example.com/cache
$ dvc config cache.ssh sshcache
$ dvc add --external ssh://user@example.com/existing-data
$ dvc stage add -d data.txt \
--external \
-o ssh://user@example.com/data.txt \
scp data.txt user@example.com:/data.txt
DVC requires both SSH and SFTP access to work with SSH remote storage. Check
that you can connect both ways with tools like ssh
and sftp
(GNU/Linux).
Note that your server's SFTP root might differ from its physical root (/
).
$ dvc remote add hdfscache hdfs://user@example.com/cache
$ dvc config cache.hdfs hdfscache
$ dvc add --external hdfs://user@example.com/existing-data
$ dvc stage add -d data.txt \
--external \
-o hdfs://user@example.com/data.txt \
hdfs fs -copyFromLocal \
data.txt \
hdfs://user@example.com/data.txt
Note that as long as there is a hdfs://...
URL for your data, DVC can handle
it. So systems like Hadoop, Hive, and HBase are supported!
$ dvc remote add webhdfscache webhdfs://user@example.com/cache
$ dvc config cache.webhdfs webhdfscache
$ dvc add --external webhdfs://user@example.com/existing-data
$ dvc stage add -d data.txt \
--external \
-o webhdfs://user@example.com/data.txt \
curl --upload-file data.txt \
"http://user@example.com:50075/webhdfs/v1/data.txt?op=CREATE"
The default cache is in .dvc/cache
, so there is no need to set a
custom cache location for local paths outside of your project.
Exceptions to the above include: external data on different storage devices, and partitions mounted on the same file system (e.g.
/mnt/raid/data
). In such cases, set up an external cache in the same drive to enable file links and avoid copying data.
$ dvc add --external /home/shared/existing-data
$ dvc stage add -d data.txt \
--external \
-o /home/shared/data.txt \
cp data.txt /home/shared/data.txt