Edit on GitHub

push

Upload tracked files or directories to remote storage based on the current dvc.yaml and .dvc files.

Synopsis

usage: dvc push [-h] [-q | -v] [-j <number>] [-r <name>] [-a] [-T]
                [--all-commits] [-d] [-R] [--run-cache]
                [targets [targets ...]]

positional arguments:
  targets       Limit command scope to these tracked files/directories,
                .dvc files, or stage names.

Description

The dvc push and dvc pull commands are the means for uploading and downloading data to and from remote storage (S3, SSH, GCS, etc.). These commands are similar to git push and git pull, respectively. Data sharing across environments, and preserving data versions (input datasets, intermediate results, models, metrics, etc.) remotely are the most common use cases for these commands.

dvc push uploads data from the cache to remote storage.

Note that pushing data does not affect code, dvc.yaml, or .dvc files. Those should be uploaded with git push.

The default remote is used (see dvc remote default) unless the --remote option is used. See dvc remote for more information on how to configure a remote.

Without arguments, it uploads all files and directories missing from remote storage, found as outputs of the stages or .dvc files present in the workspace. The --all-branches, --all-tags, and --all-commits options enable pushing multiple Git commits.

The targets given to this command (if any) limit what to push. It accepts paths to tracked files or directories (including paths inside tracked directories), .dvc files, and stage names (found in dvc.yaml).

๐Ÿ’ก For convenience, a Git hook is available to automate running dvc push after git push. See dvc install for more details.

For all outputs referenced in each target, DVC finds the corresponding files and directories in the cache (identified by hash values saved in dvc.lock and .dvc files). DVC then gathers a list of files missing from the remote storage, and uploads them.

Note that the dvc status -c command can list files tracked by DVC that are new in the cache (compared to the default remote.) It can be used to see what files dvc push would upload.

Options

  • -a, --all-branches - determines the files to upload by examining dvc.yaml and .dvc files in all Git branches instead of just those present in the current workspace. It's useful if branches are used to track experiments or project checkpoints. Note that this can be combined with -T below, for example using the -aT flag.
  • -T, --all-tags - same as -a above, but applies to Git tags as well as the workspace. Useful if tags are used to track "checkpoints" of an experiment or project. Note that both options can be combined, for example using the -aT flag.
  • --all-commits - same as -a or -T above, but applies to all Git
    commits as well as the workspace. Useful for uploading all the data used in the entire existing commit history of the project.
  • -d, --with-deps - determines files to upload by tracking dependencies to the targets. If none are provided, this option is ignored. By traversing all stage dependencies, DVC searches backward from the target stages in the corresponding pipelines. This means DVC will not push files referenced in later stages than the targets.
  • -R, --recursive - determines the files to push by searching each target directory and its subdirectories for dvc.yaml and .dvc files to inspect. If there are no directories among the targets, this option is ignored.
  • -r <name>, --remote <name> - name of the remote storage to push from (see dvc remote list).
  • --run-cache - uploads all available history of stage runs to the remote repository.
  • -j <number>, --jobs <number> - parallelism level for DVC to upload data from remote storage. This only applies when the --cloud option is used, or a --remote is given. The default value is 4 * cpu_count(). For SSH remotes, the default is 4. Using more jobs may improve the overall transfer speed.
  • -h, --help - prints the usage/help message, and exit.
  • -q, --quiet - do not write anything to standard output. Exit with 0 if no problems arise, otherwise 1.
  • -v, --verbose - displays detailed tracing information.

Examples

To use dvc push (without options), a default remote storage must be defined (see option --default of dvc remote add). Let's see an SSH remote example:

$ dvc remote add --default r1 \
                 ssh://_username_@_host_/path/to/dvc/cache/directory

For existing projects, remotes are usually already set up. You can use dvc remote list to check them:

$ dvc remote list
r1	ssh://_username_@_host_/path/to/dvc/cache/directory

Push entire data cache from the current workspace to the default remote:

$ dvc push

Push outputs of a specific .dvc file only:

$ dvc push data.zip.dvc

Example: With dependencies

Demonstrating the --with-deps option requires a larger example. First, assume a pipeline has been setup with these stages: clean-posts, featurize, test-posts, matrix-train

Imagine the project has been modified such that the outputs of some of these stages need to be uploaded to remote storage.

$ dvc status --cloud
...
    new:            data/model.p
    new:            data/matrix-test.p
    new:            data/matrix-train.p

One could do a simple dvc push to share all the data, but what if you only want to upload part of the data?

$ dvc push --with-deps test-posts

... Do some work based on the partial update

$ dvc push --with-deps matrix-train

... Push the rest of the data

$ dvc status --cloud
Data and pipelines are up to date.

We specified a stage in the middle of this pipeline (test-posts) with the first push. --with-deps caused DVC to start with that .dvc file, and search backwards through the pipeline for data files to upload.

Because the matrix-train stage occurs later (it's the last one), its data was not pushed. However, we then specified it in the second push, so all remaining data was uploaded.

Finally, we used dvc status to double check that all data had been uploaded.

Example: What happens in the cache?

Let's take a detailed look at what happens to the cache directory as you run an experiment locally and push data to remote storage. To set the example consider having created a workspace that contains some code and data, and having set up a remote.

Some work has been performed in the workspace, and new data is ready for uploading to the remote. dvc status --cloud will list several files in new state. We can see exactly what that means by looking in the project's cache:

$ tree .dvc/cache
.dvc/cache
โ”œโ”€โ”€ 02
โ”‚   โ””โ”€โ”€ 423d88d184649a7157a64f28af5a73
โ”œโ”€โ”€ 0b
โ”‚   โ””โ”€โ”€ d48000c6a4e359f4b81285abf059b5
โ”œโ”€โ”€ 38
โ”‚   โ””โ”€โ”€ 64e70211d3bdb367ad1432bfc14c1f.dir
โ”œโ”€โ”€ 4a
โ”‚   โ””โ”€โ”€ 8c47036c79c01522e79ac0f518d0f7
โ”œโ”€โ”€ 6c
โ”‚   โ””โ”€โ”€ 3074754e3a9b563b62c8f1a38670dc
โ”œโ”€โ”€ 77
โ”‚   โ””โ”€โ”€ bea77463abe2b7c6b4d13f00d2c7b4
โ””โ”€โ”€ 88
    โ””โ”€โ”€ c3db1c257136090dbb4a7ddf31e678.dir

10 directories, 9 files

$ tree ~/vault/recursive
~/vault/recursive
โ”œโ”€โ”€ 0b
โ”‚   โ””โ”€โ”€ d48000c6a4e359f4b81285abf059b5
โ”œโ”€โ”€ 4a
โ”‚   โ””โ”€โ”€ 8c47036c79c01522e79ac0f518d0f7
โ””โ”€โ”€ 88
    โ””โ”€โ”€ c3db1c257136090dbb4a7ddf31e678.dir

5 directories, 5 files

The directory .dvc/cache is the local cache, while ~/vault/recursive is a "local remote" (another directory in the local file system). This listing shows the cache having more files in it than the remote โ€“ which is what the new state means.

Refer to Structure of cache directory for more info.

Next we can copy the remaining data from the cache to the remote using dvc push:

$ tree ~/vault/recursive
~/vault/recursive
โ”œโ”€โ”€ 02
โ”‚   โ””โ”€โ”€ 423d88d184649a7157a64f28af5a73
โ”œโ”€โ”€ 0b
โ”‚   โ””โ”€โ”€ d48000c6a4e359f4b81285abf059b5
โ”œโ”€โ”€ 38
โ”‚   โ””โ”€โ”€ 64e70211d3bdb367ad1432bfc14c1f.dir
โ”œโ”€โ”€ 4a
โ”‚   โ””โ”€โ”€ 8c47036c79c01522e79ac0f518d0f7
โ”œโ”€โ”€ 6c
โ”‚   โ””โ”€โ”€ 3074754e3a9b563b62c8f1a38670dc
โ”œโ”€โ”€ 77
โ”‚   โ””โ”€โ”€ bea77463abe2b7c6b4d13f00d2c7b4
โ””โ”€โ”€ 88
    โ””โ”€โ”€ c3db1c257136090dbb4a7ddf31e678.dir

10 directories, 10 files

$ dvc status --cloud
Data and pipelines are up to date.

And running dvc status --cloud, DVC verifies that indeed there are no more files to push to remote storage.

Content

๐Ÿ› Found an issue? Let us know! Or fix it:

Edit on GitHub

โ“ Have a question? Join our chat, we will help you:

Discord Chat