Download files or directories from remote storage to the cache.
usage: dvc fetch [-h] [-q | -v] [-j <number>] [-r <name>] [-a] [-T] [--all-commits] [-d] [-R] [--run-cache] [targets [targets ...]] positional arguments: targets Limit command scope to these tracked files/directories, .dvc files, or stage names.
Downloads tracked files and directories from remote storage into the
cache (without placing them in the workspace like
dvc pull). This makes the tracked data available for linking (or copying) into
the workspace (see
dvc pull already includes fetching:
Tracked files Commands ---------------- --------------------------------- remote storage + | +------------+ | - - - - | dvc fetch | ++ v +------------+ + +----------+ project's cache ++ | dvc pull | + +------------+ + +----------+ | - - - - |dvc checkout| ++ | +------------+ v workspace
Here are some scenarios in which
dvc fetch is useful, instead of pulling:
dvc metrics showwith its
dvc plots diff.
Without arguments, it downloads all files and directories referenced in the
current workspace (found in
.dvc files) that are missing from
the workspace. Any
targets given to this command limit what to fetch. It
accepts paths to tracked files or directories (including paths inside tracked
.dvc files, and stage names (found in
--all-commits options enable fetching
files/dirs referenced in multiple Git commits.
dvc remote used is determined in order, based on
remotefields in the
--remoteoption via CLI.
core.remoteconfig option (see
dvc remote default).
--with-deps - only meaningful when specifying
determines files to download by resolving all dependencies of the targets: DVC
searches backward from the targets in the corresponding pipelines. This will
not fetch files referenced in later stages than the
--recursive - determines the files to fetch by searching each target
directory and its subdirectories for
.dvc files to inspect.
If there are no directories among the
targets, this option has no effect.
--jobs <number> - parallelism level for DVC to download data
from remote storage. The default value is
4 * cpu_count(). Note that the
default value can be set using the
jobs config option with
dvc remote modify. Using more jobs may speed up the operation.
--all-branches - fetch cache for all Git branches, as well as for the
workspace. This means DVC may download files needed to reproduce different
versions of a
.dvc file, not just the ones currently in the workspace. Note
that this can be combined with
-T below, for example using the
--all-tags - fetch cache for all Git tags, as well as for the
workspace. Note that this can be combined with
-a above, for example using
--all-commits - fetch cache for all Git commits, as well as for the
workspace. This downloads tracked data for the entire commit history of the
--help - prints the usage/help message, and exit.
--quiet - do not write anything to standard output. Exit with 0 if no
problems arise, otherwise 1.
--verbose - displays detailed tracing information.
The workspace looks like this:
. ├── data │ └── data.xml.dvc ├── dvc.lock ├── dvc.yaml ├── params.yaml ├── prc.json ├── scores.json └── src └── <code files here>
$ dvc status --cloud ... deleted: data/features/train.pkl deleted: model.pkl $ dvc fetch $ tree .dvc/cache .dvc/cache ├── 20 │ └── b786b6e6f80e2b3fcf17827ad18597.dir ├── c8 │ ├── d307aa005d6974a8525550956d5fb3 │ └── ... ...
Note that the
directory was created and populated.
All the data needed in this version of the project is now in your cache: File
c8d307a... correspond to the
model.pkl file, respectively.
To link these files to the workspace:
$ dvc checkout
If you tried the previous example, please delete the
.dvc/cachedirectory first (e.g.
rm -Rf .dvc/cache) to follow this one.
dvc fetch only downloads the tracked data corresponding to any given
$ dvc fetch prepare $ tree .dvc/cache .dvc/cache ├── 20 │ └── b786b6e6f80e2b3fcf17827ad18597.dir ├── 32 │ └── b715ef0d71ff4c9e61f55b09c15e75 └── 6f └── 597d341ceb7d8fbbe88859a892ef81
Cache entries for the
data/prepared directory (output of the
prepare target), as well as the actual
train.tsv files, were
downloaded. Their hash values are shown above.
Note that you can fetch data within directories tracked. For example, the
featurize stage has the entire
data/features directory as output, but we can
just get this:
$ dvc fetch data/features/test.pkl
If you check again
.dvc/cache, you'll see a couple more files were downloaded:
the cache entries for the
data/features directory, and
After following the previous example (Specific stages), only the files
associated with the
prepare stage have been fetched. Several
dependencies/outputs of other pipeline stages are still missing from the cache:
$ dvc status -c ... deleted: data/features/test.pkl deleted: data/features/train.pkl deleted: model.pkl
One could do a simple
dvc fetch to get all the data, but what if you only want
to retrieve the data up to our third stage,
train? We can use the
$ dvc fetch --with-deps train $ tree .dvc/cache .dvc/cache ├── 20 │ └── b786b6e6f80e2b3fcf17827ad18597.dir ├── c8 │ ├── 43577f9da31eab5ddd3a2cf1465f9b │ └── d307aa005d6974a8525550956d5fb3 ├── 32 │ └── b715ef0d71ff4c9e61f55b09c15e75 ├── 54 │ └── c0f3ef1f379563e0b9ba4accae6807 ├── 6f │ └── 597d341ceb7d8fbbe88859a892ef81 ├── a1 │ └── 414b22382ffbb76a153ab1f0d69241.dir └── a3 └── 04afb96060aad90176268345e10355
--with-deps starts with the target stage (
train) and searches
backwards through its pipeline for data to download into the project's cache.
All the data for the second and third stages (
train) has now
been downloaded to the cache. We could now use
dvc checkout to get the data
files needed to reproduce this pipeline up to the third stage into the workspace
dvc repro train).
Note that in this example project, the last stage
evaluatedoesn't add any more data files than those form previous stages, so at this point all of the data for this pipeline is cached and
dvc status -cwould output
Cache and remote 'storage' are in sync.