Sometimes you need to stream your data dependencies directly from their source locations outside your local project, or stream your data outputs directly to some external location, like cloud storage or HDFS.
To version external data without a pipeline, see importing external data.
External dependencies will be tracked by DVC, detecting when they
change (triggering stage executions on
dvc repro, for example).
To define files or directories in an external location as stage
dependencies, specify their remote URLs or external paths in
field). Use the same format as the
url of of the following supported
dvc remote types/protocols:
- Amazon S3
- Microsoft Azure Blob Storage
- Google Cloud Storage
- Local files and directories outside the workspace
Let's take a look at defining and running a
download_file stage that simply
downloads a file from an external location, on all the supported location types.
See the Remote alias example for info. on using remote locations that require manual authentication setup.
External outputs will be tracked by DVC, detecting when they change, but not saved in the cache for versioning.
Saving external outputs to an external cache has been deprecated in DVC 3.0.
Stay tuned as we work on versioning external outputs using cloud versioning.
To define files or directories in an external location as
dvc stage add -O, or put them in
outs field). For supported external output types and expected URL
formats, see the examples above for
Let's take a look at defining and running an
upload_file stage that simply
uploads a file to an external location.
$ dvc stage add -n upload_file \ -d data.txt \ -O s3://mybucket/data.txt \ aws s3 cp data.txt s3://mybucket/data.txt