usage: dvc pull [-h] [-q | -v] [-j <number>] [-r <name>] [-a] [-T] [-d] [-f] [-R] [--glob] [--all-commits] [--run-cache] [targets [targets ...]] positional arguments: targets Limit command scope to these tracked files/directories, .dvc files, or stage names.
dvc push and
dvc pull commands are the means for uploading and
downloading data to and from remote storage (S3, SSH, GCS, etc.). These commands
are similar to
git push and
git pull, respectively.
Data sharing across environments
and preserving data versions (input datasets, intermediate results, models,
metrics, etc.) remotely are the most common
use cases for these commands.
Tracked files Commands ---------------- --------------------------------- remote storage + | +------------+ | - - - - | dvc fetch | ++ v +------------+ + +----------+ project's cache ++ | dvc pull | + +------------+ + +----------+ | - - - - |dvc checkout| ++ | +------------+ v workspace
Without arguments, it downloads all files and directories referenced in the
current workspace (found in
.dvc files) that are missing from
the workspace. Any
targets given to this command limit what to pull. It
accepts paths to tracked files or directories (including paths inside tracked
.dvc files, and stage names (found in
--all-commits options enable pulling
files/dirs referenced in multiple Git commits.
--all-branches- determines the files to download by examining
.dvcmetafiles in all Git branches, as well as in the workspace. It's useful if branches are used to track experiments. Note that this can be combined with
-Tbelow, for example using the
--all-tags- examines metafiles in all Git tags, as well as in the workspace. Useful if tags are used to mark certain versions of an experiment or project. Note that this can be combined with
-aabove, for example using the
--all-commits- examines metafiles in all Git commits, as well as in the workspace. This downloads tracked data for the entire commit history of the project.
--with-deps- determines files to download by tracking dependencies to the
targets. If none are provided, this option is ignored. By traversing all stage dependencies, DVC searches backward from the target stages in the corresponding pipelines. This means DVC will not pull files referenced in later stages than the
--recursive- determines the files to pull by searching each target directory and its subdirectories for
.dvcfiles to inspect. If there are no directories among the
targets, this option is ignored.
--force- does not prompt when removing workspace files, which occurs when these files no longer match the current stages or
.dvcfiles. This option surfaces behavior from the
dvc checkoutcommands because
dvc pullin effect performs those 2 functions in a single command.
--remote <name>- name of the remote storage to pull from (see
dvc remote list).
--run-cache- downloads all available history of stage runs from the
dvc remote(to the cache only, like
dvc fetch --run-cache). Note that
dvc repro <stage_name>is necessary to checkout these files (into the workspace) and update
--jobs <number>- parallelism level for DVC to download data from remote storage. The default value is
4 * cpu_count(). For SSH remotes, the default is
4. Note that the default value can be set using the
jobsconfig option with
dvc remote modify. Using more jobs may speed up the operation.
--glob- allows pulling files and directories that match the pattern specified in
targets. Shell style wildcards supported:
--help- prints the usage/help message, and exit.
--quiet- do not write anything to standard output. Exit with 0 if no problems arise, otherwise 1.
--verbose- displays detailed tracing information.
. ├── data │ └── data.xml.dvc ├── dvc.lock ├── dvc.yaml ... └── src └── <code files here>
We can now just run
dvc pull to download the most recent
model.pkl, and other DVC-tracked files into the workspace:
$ dvc pull $ tree . ├── data │ ├── data.xml │ ├── data.xml.dvc ... └── model.pkl
We can also download only the outputs of a specific stage:
$ dvc pull train
Please delete the
.dvc/cachedirectory first (with
rm -Rf .dvc/cache) to follow this example if you tried the previous ones.
Imagine the remote storage has been modified such that the data in some of these stages should be updated in the workspace.
$ dvc status -c ... deleted: data/features/test.pkl deleted: data/features/train.pkl deleted: model.pkl ...
One could do a simple
dvc pull to get all the data, but what if you only want
to retrieve part of the data?
$ dvc pull --with-deps featurize # Use the partial update... # Then pull the remaining data: $ dvc pull Everything is up to date.
With the first
dvc pull we specified a stage in the middle of this pipeline
featurize) while using
--with-deps. DVC started with that stage and
searched backwards through the pipeline for data files to download. Later we ran
dvc pull to download all the remaining data files.
For using the
dvc pull command, a remote storage must be defined. (See
dvc remote add.) For an existing project, remotes are usually
already set up and you can use
dvc remote list to check them. To remember how
it's done, and set a context for the example, let's define a default SSH remote:
$ dvc remote add -d r1 ssh://firstname.lastname@example.org/path/to/dvc/remote/storage $ dvc remote list r1 ssh://email@example.com/path/to/dvc/remote/storage
DVC supports several remote types.
To download DVC-tracked data from a specific DVC remote, use the
-r) option of
$ dvc pull --remote r1