Track data files or directories with DVC, by creating a corresponding
usage: dvc add [-h] [-q | -v] [-R] [--no-commit] [--external] [--glob] [--file <filename>] [-o <path>] [--to-remote] [-r <name>] [-j <number>] [--desc <text>] targets [targets ...] positional arguments: targets Files or directories to add
This command can be used to track large files, models, dataset directories, etc. that are too big for Git to handle directly. This enables versioning them indirectly with Git.
After checking that each
target hasn't been added before (or tracked with
other DVC commands), a few actions are taken under the hood:
.dvc/cache) (or to remote storage if
--to-remoteis given), using the file hash to form the cached file path. (See Structure of cache directory for more details.)
.dvcfile to track the file, using its path and hash to identify the cached data (with
-o, an external path is moved to the workspace). The
.dvcfile lists the DVC-tracked file as an output (
outsfield). Unless the
--fileoption is used, the
.dvcfile name generated by default is
<file>is the file name of the first target.
.gitignorein order to prevent them from being committed to the Git repository (unless
dvc init --no-scmwas used when initializing the DVC project).
gitcommands for staging
.dvcfiles (or they are staged automatically if
Summarizing, the result is that the target data is replaced by small
files that can be easily tracked with Git.
It's possible to prevent files or directories from being added by DVC by
entering the corresponding patterns in a
You can also undo
dvc add to stop
tracking files or directories.
By default, DVC tries to use reflinks (see
File link types
to avoid copying any file contents and to optimize
.dvc file operations for
large files. DVC also supports other link types for use on file systems without
reflink support, but they have to be specified manually. Refer to the
cache.type config option in
dvc config cache for more information.
Every file in the dir is cached normally (unless the
--no-commit option is
used), but DVC does not produce individual
.dvc files for each one. Instead,
.dvc file references a special JSON file in the cache (with
extension), that in turn points to the added files.
Refer to Structure of cache directory for more info. on
Note that DVC commands that use tracked data support granular targeting of files
and directories, even when contained in a parent directory added as a whole.
dvc import, etc.
As a rarely needed alternative, the
--recursive option causes every file in
the hierarchy to be added individually. A corresponding
.dvc file will be
generated for each file in the same location. This may be helpful to save time
adding several data files grouped in a structural directory, but it's
undesirable for data directories with a large number of files.
To avoid adding files inside a directory accidentally, you can add the
corresponding patterns to
dvc add supports symlinked files as
targets. But if a target path is a
directory symlink, or if it contains any intermediate directory symlinks, it
cannot be added to DVC.
For example, given the following project structure:
. ├── .dvc ├── dir │ └── file ├── link_to_dir -> dir ├── link_to_external_dir -> /path/to/dir ├── link_to_external_file -> /path/to/file └── link_to_file -> dir/file
link_to_external_file are both valid symlink targets to
dvc add. But
--recursive - determines the files to add by searching each target
directory and its subdirectories for data files. If there are no directories
targets, this option has no effect. For each file found, a new
.dvc file is created using the process outlined in this command's
--file <filename> - specify name of the
.dvc file it generates. This
option works only if there is a single target. By default the name of the
.dvc file is
<target> is the file name of
the given target. This option allows to set the name and the path of the
--glob - allows adding files and directories that match the
pattern specified in
Shell style wildcards supported:
--external - allow tracking
targets outside of the DVC repository
Managing External Data.
⚠️ Note that this is an advanced feature for very specific situations and not recommended except if there's absolutely no other alternative. Additionally, this typically requires an external cache setup (see link above).
--out <path> - specify a
path to the desired location in the
workspace to place the
targets (copying them from their current location).
This enables targeting data outside the project (see an
--to-remote - add a target that's outside the project, but neither cache it
nor place it in the workspace nor cache it yet.
Transfer it directly to remote storage
instead (the default one unless one is specified with
--out .. Use
dvc pull to get the data locally.
--remote <name> - name of the
remote storage to transfer external target to
(can only be used with
--jobs <number> - parallelism level for DVC to transfer data
--to-remote. The default value is
4 \* cpu_count(). For SSH
remotes, the default is
4. Using more jobs may speed up the operation.
--desc <text> - user description of the data (optional). This doesn't affect
any DVC operations.
--help - prints the usage/help message, and exit.
--quiet - do not write anything to standard output. Exit with 0 if no
problems arise, otherwise 1.
--verbose - displays detailed tracing information.
Track a file with DVC:
$ dvc add data.xml
As indicated above, a
.dvc file has been created for
data.xml. Let's explore
$ tree . ├── data.xml └── data.xml.dvc
Let's check the
data.xml.dvc file inside:
outs: - md5: 6137cde4893c59f76f005a8123d8e8e6 path: data.xml
This is a standard
.dvc file with only one output (
outs field). The hash
md5 field) corresponds to a file path in the cache.
$ file .dvc/cache/d8/acabbfd4ee51c95da5d7628c7ef74b .dvc/cache/61/37cde4893c59f76f005a8123d8e8e6: ASCII text
⚠️ Tracking compressed files (e.g. ZIP or TAR archives) is not recommended, as
dvc add supports tracking directories (details below).
Let's suppose your goal is to build an algorithm to identify cats and dogs in pictures. You may then have hundreds or thousands of pictures of these animals in a directory, and this is your training dataset:
$ tree pics --filelimit 3 pics ├── train │ ├── cats [many image files] │ └── dogs [many image files] └── validation ├── cats [more image files] └── dogs [more image files]
Tracking a directory with DVC as simple as with a single file:
$ dvc add pics
There are no
.dvc files generated within this directory structure to match
each image, but the image files are all cached. A single
file is generated for the top-level directory, and it contains:
outs: - md5: ce57450aa92ab8f2b957c24b0df73edc.dir path: pics
Refer to Adding entire directories for more info.
This allows us to treat the entire directory structure as a single data artifact. For example, you can pass it as a dependency to a stage definition:
$ dvc stage add -n train \ -d train.py -d pics \ -M metrics.json -o model.h5 \ python train.py
To try this example, see the versioning tutorial.
If instead we use the
-R) option, the output looks like this:
$ dvc add -R pics
In this case, a
.dvc file is generated for each file in the
$ tree pics pics ├── train | ├── cats | | ├── img1.jpg | | ├── img1.jpg.dvc | | ├── img2.jpg | | ├── img2.jpg.dvc | | ├── ... | └── dogs | ├── img1.jpg | ├── img1.jpg.dvc | ...
$ mkdir dir $ echo file_one > dir/file1 $ echo file_two > dir/file2
$ echo dir/file1 > .dvcignore $ dvc add dir
$ echo file_one_changed > dir/file1 $ dvc status Data and pipelines are up to date.
Let's have a look at cache directory:
$ tree .dvc/cache .dvc/cache ├── 0a │ └── ec3a687bd65c3e6a13e3cf20f3a6b2.dir └── 52 └── 4bcc8502a70ac49bf441db350eafc2
Only the hash values of the
dir/ directory (with
.dir file extension) and
file2 have been cached.
When you want to add a large dataset that is outside of your project (e.g. online), you would normally need to download or copy it into the workspace first. But you may not have enough local storage space.
You can however set up an external cache that can handle the data. To avoid
ever making a local copy, target the outside data with
dvc add while
-o) path inside of your project. This way the data will
be transferred to the cache directly, and then linked into your
Let's add a
data.xml file via HTTP, putting it in
$ dvc add https://data.dvc.org/get-started/data.xml -o data.xml ... $ ls data.xml data.xml.dvc
.dvc file will save the provided local
path as if the data was
already in the workspace, while the
md5 hash points to the copy of the data
that has now been transferred to the cache. Let's check the
data.xml.dvc in this case:
outs: - md5: a304afb96060aad90176268345e10355 nfiles: 1 path: data.xml
Sometimes there's not enough space in the local environment to import a large dataset, but you still want to track it in the project so it can be pulled later.
As long as you have setup remote storage that can handle the data, this can be
achieved with the
--to-remote flag. It creates a
.dvc file without
downloading anything, transferring a target directly to a DVC remote instead:
Let's add a
data.xml file via HTTP straight to remote:
$ dvc add https://data.dvc.org/get-started/data.xml --to-remote ... $ ls data.xml.dvc
$ dvc pull data.xml.dvc A data.xml 1 file added