Edit on GitHub

push

Upload tracked files or directories to remote storage based on the current dvc files files.

Synopsis

usage: dvc push [-h] [-q | -v] [-j <number>] [-r <name>] [-a] [-T]
                [--all-commits] [--glob] [-d] [-R]
                [--run-cache | --no-run-cache]
                [targets [targets ...]]

positional arguments:
  targets       Limit command scope to these tracked files/directories,
                .dvc files, or stage names.

Description

The dvc push and dvc pull commands are the means for uploading and downloading data to and from remote storage (S3, SSH, GCS, etc.). These commands are similar to git push and git pull, respectively. Data sharing across environments, and preserving data versions (input datasets, intermediate results, models, dvc metrics, etc.) remotely are the most common use cases for these commands.

dvc push uploads data from the cache to a dvc remote.

Note that pushing data does not affect code, dvc.yaml, or .dvc files. Those should be uploaded with git push. dvc import data is also ignored by this command.

The dvc remote used is determined in order, based on

  1. the remote fields in the dvc.yaml or .dvc files.
  2. the value passed to the --remote (-r) option via CLI.
  3. the value of the core.remote config option (see dvc remote default).

Without arguments, it uploads the files and directories referenced in the current workspace (found in all dvc.yaml and .dvc files) that are missing from the remote. Any targets given to this command limit what to push. It accepts paths to tracked files or directories (including paths inside tracked directories), .dvc files, and stage names (found in dvc.yaml).

The --all-branches, --all-tags, and --all-commits options enable pushing files/dirs referenced in multiple Git commits.

๐Ÿ’ก For convenience, a Git hook is available to automate running dvc push after git push. See dvc install for more details.

For all outputs referenced in each target, DVC finds the corresponding files and directories in the cache (identified by hash values saved in dvc.lock and .dvc files). DVC then gathers a list of files missing from the remote storage, and uploads them.

Note that the dvc status -c command can list files tracked by DVC that are new in the cache (compared to the default remote.) It can be used to see what files dvc push would upload.

Options

  • -a, --all-branches - determines the files to upload by examining dvc.yaml and .dvc metafiles in all Git branches, as well as in the workspace. It's useful if branches are used to track experiments. Note that this can be combined with -T below, for example using the -aT flags.

  • -T, --all-tags - examines metafiles in all Git tags, as well as in the workspace. Useful if tags are used to mark certain versions of an experiment or project. Note that this can be combined with -a above, for example using the -aT flags.

  • -A, --all-commits - examines metafiles in all Git commits, as well as in the workspace. This uploads tracked data for the entire commit history of the project.

  • -d, --with-deps - only meaningful when specifying targets. This determines files to push by resolving all dependencies of the targets: DVC searches backward from the targets in the corresponding pipelines. This will not push files referenced in later stages than the targets.

  • -R, --recursive - determines the files to push by searching each target directory and its subdirectories for dvc.yaml and .dvc files to inspect. If there are no directories among the targets, this option has no effect.

  • -r <name>, --remote <name> - name of the dvc remote to push to (see dvc remote list).

  • --run-cache, --no-run-cache - whether to upload all available history of stage runs to the dvc remote. Default is --no-run-cache.

  • -j <number>, --jobs <number> - parallelism level for DVC to upload data to remote storage. The default value is 4 * cpu_count(). Note that the default value can be set using the jobs config option with dvc remote modify. Using more jobs may speed up the operation.

  • --glob - allows pushing files and directories that match the pattern specified in targets. Shell style wildcards supported: *, ?, [seq], [!seq], and **

  • -h, --help - prints the usage/help message, and exit.

  • -q, --quiet - do not write anything to standard output. Exit with 0 if no problems arise, otherwise 1.

  • -v, --verbose - displays detailed tracing information.

Examples

To use dvc push (without options), a dvc remote default must be defined (see also dvc remote add --default). Let's see an SSH remote example:

$ dvc remote add --default r1 \
                 ssh://user@example.com/path/to/dvc/cache/directory

For existing projects, remotes are usually already set up. You can use dvc remote list to check them:

$ dvc remote list
r1	ssh://user@example.com/path/to/dvc/cache/directory

Push entire data cache from the current workspace to the default remote:

$ dvc push

Push files related to a specific .dvc file only:

$ dvc push data.zip.dvc

Example: With dependencies

Demonstrating the --with-deps option requires a larger example. First, assume a pipeline has been set up with these stages: clean-posts, featurize, test-posts, matrix-train

Imagine the project has been modified such that the outputs of some of these stages need to be uploaded to remote storage.

$ dvc status --cloud
...
	new:            data/model.p
	new:            data/matrix-test.p
	new:            data/matrix-train.p

One could do a simple dvc push to share all the data, but what if you only want to upload part of the data?

$ dvc push --with-deps test-posts

# Do some work based on the partial update...
# Then push the rest of the data:

$ dvc push --with-deps matrix-train

$ dvc status --cloud
Cache and remote 'r1' are in sync.

We specified a stage in the middle of this pipeline (test-posts) with the first push. --with-deps caused DVC to start with that .dvc file, and search backwards through the pipeline for data files to upload.

Because the matrix-train stage occurs later (it's the last one), its data was not pushed. However, we then specified it in the second push, so all remaining data was uploaded.

Finally, we used dvc status to double check that all data had been uploaded.

Example: What happens in the cache?

By clicking play, you agree to YouTube's Privacy Policy and Terms of Service

Let's take a detailed look at what happens to the cache directory as you run an experiment locally and push data to remote storage. To set the example consider having created a project with some code, data, and a dvc remote setup.

Some work has been performed in the workspace, and new data is ready for uploading to the remote. dvc status --cloud will list several files in new state. We can see exactly what that means by looking in the project's cache:

$ tree .dvc/cache/files/md5
.dvc/cache/files/md5
โ”œโ”€โ”€ 02
โ”‚   โ””โ”€โ”€ 423d88d184649a7157a64f28af5a73
โ”œโ”€โ”€ 0b
โ”‚   โ””โ”€โ”€ d48000c6a4e359f4b81285abf059b5
โ”œโ”€โ”€ 38
โ”‚   โ””โ”€โ”€ 64e70211d3bdb367ad1432bfc14c1f.dir
โ”œโ”€โ”€ 4a
โ”‚   โ””โ”€โ”€ 8c47036c79c01522e79ac0f518d0f7
โ”œโ”€โ”€ 6c
โ”‚   โ””โ”€โ”€ 3074754e3a9b563b62c8f1a38670dc
โ”œโ”€โ”€ 77
โ”‚   โ””โ”€โ”€ bea77463abe2b7c6b4d13f00d2c7b4
โ””โ”€โ”€ 88
    โ””โ”€โ”€ c3db1c257136090dbb4a7ddf31e678.dir

10 directories, 9 files

$ tree ~/vault/recursive/files/md5
~/vault/recursive/files/md5
โ”œโ”€โ”€ 0b
โ”‚   โ””โ”€โ”€ d48000c6a4e359f4b81285abf059b5
โ”œโ”€โ”€ 4a
โ”‚   โ””โ”€โ”€ 8c47036c79c01522e79ac0f518d0f7
โ””โ”€โ”€ 88
    โ””โ”€โ”€ c3db1c257136090dbb4a7ddf31e678.dir

5 directories, 5 files

The directory .dvc/cache is the local cache, while ~/vault/recursive is a "local remote" (another directory in the local file system). This listing shows the cache having more files in it than the remote โ€“ which is what the new state means.

Refer to Structure of cache directory for more info.

Next we can copy the remaining data from the cache to the remote using dvc push:

$ tree ~/vault/recursive
~/vault/recursive
โ”œโ”€โ”€ 02
โ”‚   โ””โ”€โ”€ 423d88d184649a7157a64f28af5a73
โ”œโ”€โ”€ 0b
โ”‚   โ””โ”€โ”€ d48000c6a4e359f4b81285abf059b5
โ”œโ”€โ”€ 38
โ”‚   โ””โ”€โ”€ 64e70211d3bdb367ad1432bfc14c1f.dir
โ”œโ”€โ”€ 4a
โ”‚   โ””โ”€โ”€ 8c47036c79c01522e79ac0f518d0f7
โ”œโ”€โ”€ 6c
โ”‚   โ””โ”€โ”€ 3074754e3a9b563b62c8f1a38670dc
โ”œโ”€โ”€ 77
โ”‚   โ””โ”€โ”€ bea77463abe2b7c6b4d13f00d2c7b4
โ””โ”€โ”€ 88
    โ””โ”€โ”€ c3db1c257136090dbb4a7ddf31e678.dir

10 directories, 10 files

$ dvc status --cloud
Cache and remote 'r1' are in sync.

And running dvc status --cloud, DVC verifies that indeed there are no more files to push to remote storage.

Content

๐Ÿ› Found an issue? Let us know! Or fix it:

Edit on GitHub

โ“ Have a question? Join our chat, we will help you:

Discord Chat