fetch
Download files or directories from remote storage to the cache.
Synopsis
usage: dvc fetch [-h] [-q | -v] [-j <number>] [-r <name>] [-a] [-T]
[--all-commits] [-d] [-R] [--run-cache | --no-run-cache]
[--max-size <bytes>] [--type {metrics,plots}]
[targets [targets ...]]
positional arguments:
targets Limit command scope to these tracked files/directories,
.dvc files, or stage names.
Description
Downloads tracked files and directories from a dvc remote
into the
cache (without placing them in the workspace like
dvc pull
). This makes the tracked data available for linking (or copying) into
the workspace (see dvc checkout
).
Note that dvc pull
includes fetching.
Tracked files Commands
---------------- ---------------------------------
remote storage
+
| +------------+
| - - - - | dvc fetch | ++
v +------------+ + +----------+
project's cache ++ | dvc pull |
+ +------------+ + +----------+
| - - - - |dvc checkout| ++
| +------------+
v
workspace
Here are some scenarios in which dvc fetch
is useful, instead of pulling:
- After checking out a fresh copy of a DVC repository, to get DVC-tracked data from multiple project branches or tags into your machine.
- To use comparison commands across different Git commits, for example
dvc metrics show
with its--all-branches
option, ordvc plots diff
. - If you want to avoid linking files from the cache, or keep the workspace clean for any other reason.
Without arguments, it downloads all files and directories referenced in the
current workspace (found in dvc.yaml
and .dvc
files) that are missing from
the workspace. Any targets
given to this command limit what to fetch. It
accepts paths to tracked files or directories (including paths inside tracked
directories), .dvc
files, and stage names (found in dvc.yaml
).
The --all-branches
, --all-tags
, and --all-commits
options enable fetching
files/dirs referenced in multiple Git commits.
The dvc remote
used is determined in order, based on
- the
remote
fields in thedvc.yaml
or.dvc
files. - the value passed to the
--remote
option via CLI. - the value of the
core.remote
config option (seedvc remote default
).
Options
-
-r <name>
,--remote <name>
- name of thedvc remote
to fetch from (seedvc remote list
). -
-d
,--with-deps
- only meaningful when specifyingtargets
. This determines files to download by resolving all dependencies of the targets: DVC searches backward from the targets in the corresponding pipelines. This will not fetch files referenced in later stages than thetargets
. -
-R
,--recursive
- determines the files to fetch by searching each target directory and its subdirectories fordvc.yaml
and.dvc
files to inspect. If there are no directories among thetargets
, this option has no effect. -
--run-cache
,--no-run-cache
- whether to download all available history of stage runs from the remote repository. See the same option indvc push
. Default is โโno-run-cache`. -
-j <number>
,--jobs <number>
- parallelism level for DVC to download data from remote storage. The default value is4 * cpu_count()
. Note that the default value can be set using thejobs
config option withdvc remote modify
. Using more jobs may speed up the operation. -
-a
,--all-branches
- fetch cache for all Git branches, as well as for the workspace. This means DVC may download files needed to reproduce different versions of a.dvc
file, not just the ones currently in the workspace. Note that this can be combined with-T
below, for example using the-aT
flags. -
-T
,--all-tags
- fetch cache for all Git tags, as well as for the workspace. Note that this can be combined with-a
above, for example using the-aT
flags. -
-A
,--all-commits
- fetch cache for all Git commits, as well as for the workspace. This downloads tracked data for the entire commit history of the project. -
--max-size <bytes>
- fetch data files/directories that are each below specified size (bytes). Note that the size is determined by a correspondingsize
field in the.dvc
/dvc.lock
file. Which means that even if some files or subdirectories are smaller inside a DVC-tracked directory, the whole directory is still skipped. -
--type <type>
- fetch data files/directories that are of a particular type. Currently onlymetrics
andplots
are supported. -
-h
,--help
- prints the usage/help message, and exit. -
-q
,--quiet
- do not write anything to standard output. Exit with 0 if no problems arise, otherwise 1. -
-v
,--verbose
- displays detailed tracing information.
Examples
Let's employ a simple workspace with some data, code, ML models,
pipeline stages, such as the DVC project created for the
Get Started. Then we can see what dvc fetch
does in different
scenarios.
Start by cloning our example repo if you don't already have it:
$ git clone https://github.com/iterative/example-get-started
$ cd example-get-started
The workspace looks like this:
.
โโโ data
โ โโโ data.xml.dvc
โโโ dvc.lock
โโโ dvc.yaml
โโโ params.yaml
โโโ prc.json
โโโ scores.json
โโโ src
โโโ <code files here>
This project comes with a predefined HTTP remote storage. We can now just run dvc fetch
to download the most recent model.pkl
, data.xml
, and other DVC-tracked files
into our local cache.
$ dvc status --cloud
...
deleted: data/features/train.pkl
deleted: model.pkl
$ dvc fetch
$ tree .dvc/cache/files/md5
.dvc/cache/files/md5
โโโ 20
โ โโโ b786b6e6f80e2b3fcf17827ad18597.dir
โโโ c8
โ โโโ d307aa005d6974a8525550956d5fb3
โ โโโ ...
...
dvc status --cloud
compares the cache contents against the default remote. Refer todvc status
.
Note that the
.dvc/cache
directory was created and populated.
All the data needed in this version of the project is now in your cache: File
names 20b786b...
and c8d307a...
correspond to the data/features/
directory
and model.pkl
file, respectively.
To link these files to the workspace:
$ dvc checkout
Example: Specific files or directories
If you tried the previous example, delete the
.dvc/cache
directory first (e.g.rm -Rf .dvc/cache
) to follow this one.
dvc fetch
only downloads the tracked data corresponding to any given
targets
:
$ dvc fetch prepare
$ tree .dvc/cache/files/md5
.dvc/cache/files/md5
โโโ 20
โ โโโ b786b6e6f80e2b3fcf17827ad18597.dir
โโโ 32
โ โโโ b715ef0d71ff4c9e61f55b09c15e75
โโโ 6f
โโโ 597d341ceb7d8fbbe88859a892ef81
Cache entries for the data/prepared
directory (output of the
prepare
target), as well as the actual test.tsv
and train.tsv
files, were
downloaded. Their hash values are shown above.
Note that you can fetch data within directories tracked. For example, the
featurize
stage has the entire data/features
directory as output, but we can
just get this:
$ dvc fetch data/features/test.pkl
If you check again .dvc/cache
, you'll see a couple more files were downloaded:
the cache entries for the data/features
directory, and
data/features/test.pkl
itself.
Example: With dependencies
After following the previous example (Specific stages), only the files
associated with the prepare
stage have been fetched. Several
dependencies/outputs of other pipeline stages are still missing from the cache:
$ dvc status -c
...
deleted: data/features/test.pkl
deleted: data/features/train.pkl
deleted: model.pkl
One could do a simple dvc fetch
to get all the data, but what if you only want
to retrieve the data up to our third stage, train
? We can use the
--with-deps
(or -d
) option:
$ dvc fetch --with-deps train
$ tree .dvc/cache/files/md5
.dvc/cache/files/md5
โโโ 20
โ โโโ b786b6e6f80e2b3fcf17827ad18597.dir
โโโ c8
โ โโโ 43577f9da31eab5ddd3a2cf1465f9b
โ โโโ d307aa005d6974a8525550956d5fb3
โโโ 32
โ โโโ b715ef0d71ff4c9e61f55b09c15e75
โโโ 54
โ โโโ c0f3ef1f379563e0b9ba4accae6807
โโโ 6f
โ โโโ 597d341ceb7d8fbbe88859a892ef81
โโโ a1
โ โโโ 414b22382ffbb76a153ab1f0d69241.dir
โโโ a3
โโโ 04afb96060aad90176268345e10355
Fetching using --with-deps
starts with the target stage (train
) and searches
backwards through its pipeline for data to download into the project's cache.
All the data for the second and third stages (featurize
and train
) has now
been downloaded to the cache. We could now use dvc checkout
to get the data
files needed to reproduce this pipeline up to the third stage into the workspace
(with dvc repro train
).
Note that in this example project, the last stage
evaluate
doesn't add any more data files than those form previous stages, so at this point all of the data for this pipeline is cached anddvc status -c
would outputCache and remote 'storage' are in sync.