There are cases when data is so large, or its processing is organized in such a way, that its preferable to avoid moving it from its current external location. For example data on a network attached storage (NAS), processing data on HDFS, running Dask via SSH, or for a script that streams data from S3 to process it.
External dependencies and external outputs provide ways to track and version data outside of the project.
External dependencies will be tracked by DVC, detecting when they
change (triggering stage executions on
dvc repro, for example).
To define files or directories in an external location as
stage dependencies, specify their remote URLs or
external paths in
deps field). Use the same format as the
dvc remote types. Currently, the following supported
Note that remote storage is a different feature.
Let's take a look at defining and running a
download_file stage that simply
downloads a file from an external location, on all the supported location types.
See the Remote alias example for info. on using remote locations that require manual authentication setup.
You may want to encapsulate external locations as configurable entities that can be managed independently. This is useful if the connection requires authentication, if multiple dependencies (or stages) reuse the same location, or if the URL is likely to change in the future.
Let's see an example using SSH. First, register and configure the remote:
$ dvc remote add myssh ssh://email@example.com $ dvc remote modify --local myssh password 'mypassword'
Please refer to
dvc remote modifyfor more details like setting up access credentials for the different remote types.
Now, use an alias to this remote when defining the stage:
$ dvc run -n download_file \ -d remote://myssh/path/to/data.txt \ -o data.txt \ wget https://example.com/data.txt -O data.txt
In the previous examples, special downloading tools were used:
aws s3 cp, etc.
dvc import-url simplifies the downloading for all the
supported external path or URL types.
$ dvc import-url https://data.dvc.org/get-started/data.xml Importing 'https://data.dvc.org/get-started/data.xml' -> 'data.xml'
The command above creates the import
data.xml.dvc, that contains
an external dependency (in this case an HTTPs URL).
$ dvc import firstname.lastname@example.org:iterative/example-get-started model.pkl Importing 'model.pkl (email@example.com:iterative/example-get-started)' -> 'model.pkl'
The command above creates
model.pkl.dvc, where the external dependency is
specified (with the