Track data files or directories with DVC, by creating a corresponding .dvc
file.
usage: dvc add [-h] [-q | -v] [-R] [--no-commit] [--external]
[--glob] [--file <filename>] [-o <path>] [--to-remote]
[-r <name>] [-j <number>] [--desc <text>]
targets [targets ...]
positional arguments:
targets Files or directories to add
The dvc add
command is analogous to git add
, in that it makes DVC aware of
the target data, in order to start versioning it. It creates a .dvc
file to
track the added data.
This command can be used to track large files, models, dataset directories, etc. that are too big for Git to handle directly. This enables versioning them indirectly with Git.
The targets
are the files or directories to add.
They get stored in the cache by default (use the --no-commit
option to avoid this, and dvc commit
to finish the process when needed).
See also
dvc.yaml
anddvc run
for more advanced ways to track and version intermediate and final results (like ML models).
After checking that each target
hasn't been added before (or tracked with
other DVC commands), a few actions are taken under the hood:
.dvc/cache
) (or to
remote storage if --to-remote
is given), using the file hash to form the
cached file path. (See
Structure of cache directory
for more details.)--to-remote
is used..dvc
file to track the file, using its path and hash
to identify the cached data (with --to-remote
/-o
, an external path is
moved to the workspace). The .dvc
file lists the DVC-tracked file as an
output (outs
field). Unless the --file
option is used, the
.dvc
file name generated by default is <file>.dvc
, where <file>
is the
file name of the first target.targets
to .gitignore
in order to prevent them from being
committed to the Git repository (unless dvc init --no-scm
was used when
initializing the DVC project).git
commands for staging .dvc
files (or
they are staged automatically if
core.autostage
is set).Summarizing, the result is that the target data is replaced by small .dvc
files that can be easily tracked with Git.
It's possible to prevent files or directories from being added by DVC by
entering the corresponding patterns in a .dvcignore
file.
You can also undo dvc add
to stop
tracking files or directories.
By default, DVC tries to use reflinks (see
File link types
to avoid copying any file contents and to optimize .dvc
file operations for
large files. DVC also supports other link types for use on file systems without
reflink
support, but they have to be specified manually. Refer to the
cache.type
config option in dvc config cache
for more information.
A dvc add
target can be either a file or a directory. In the latter case, a
.dvc
file is created for the top of the hierarchy (with default name
<dir_name>.dvc
).
Every file in the dir is cached normally (unless the --no-commit
option is
used), but DVC does not produce individual .dvc
files for each one. Instead,
the single .dvc
file references a special JSON file in the cache (with .dir
extension), that in turn points to the added files.
Refer to Structure of cache directory for more info. on
.dir
cache entries.
Note that DVC commands that use tracked data support granular targeting of files
and directories, even when contained in a parent directory added as a whole.
Examples: dvc push
, dvc pull
, dvc get
, dvc import
, etc.
As a rarely needed alternative, the --recursive
option causes every file in
the hierarchy to be added individually. A corresponding .dvc
file will be
generated for each file in he same location. This may be helpful to save time
adding several data files grouped in a structural directory, but it's
undesirable for data directories with a large number of files.
To avoid adding files inside a directory accidentally, you can add the
corresponding patterns to .dvcignore
.
dvc add
supports symlinked files as targets
. But if a target path is a
directory symlink, or if it contains any intermediate directory symlinks, it
cannot be added to DVC.
For example, given the following project structure:
.
├── .dvc
├── dir
│ └── file
├── link_to_dir -> dir
├── link_to_external_dir -> /path/to/dir
├── link_to_external_file -> /path/to/file
└── link_to_file -> dir/file
link_to_file
and link_to_external_file
are both valid symlink targets to
dvc add
. But link_to_dir
, link_to_external_dir
, and link_to_dir/file
are
not.
-R
, --recursive
- determines the files to add by searching each target
directory and its subdirectories for data files. If there are no directories
among the targets
, this option is ignored. For each file found, a new .dvc
file is created using the process described in this command's description.--no-commit
- do not store targets
in the cache (the .dvc
file is still
created). Use dvc commit
to finish the operation (similar to git commit
after git add
).--file <filename>
- specify name of the .dvc
file it generates. This
option works only if there is a single target. By default the name of the
generated .dvc
file is <target>.dvc
, where <target>
is the file name of
the given target. This option allows to set the name and the path of the
generated .dvc
file.--glob
- allows adding files and directories that match the
pattern specified in targets
.
Shell style wildcards supported: *
, ?
, [seq]
, [!seq]
, and **
--external
- allow targets
that are outside of the DVC repository. See
Managing External Data.
⚠️ Note that this is an advanced feature for very specific situations and not recommended except if there's absolutely no other alternative. Additionally, this typically requires an external cache setup (see link above).
-o <path>
, --out <path>
- destination path
to make a local target copy,
or to transfer an external target into the cache
(and link to workspace). Note that this can be combined with --to-remote
to
avoid storing the data locally, while still adding it to the project.--to-remote
- import an external target, but don't move it into the
workspace, nor cache it. Transfer it it
directly to remote storage (the default one, unless -r
is specified)
instead. Use dvc pull
to get the data locally.-r <name>
, --remote <name>
- name of the
remote storage to transfer external target to
(can only be used with --to-remote
).--desc <text>
- user description of the data (optional). This doesn't affect
any DVC operations.-h
, --help
- prints the usage/help message, and exit.-q
, --quiet
- do not write anything to standard output. Exit with 0 if no
problems arise, otherwise 1.-v
, --verbose
- displays detailed tracing information.Track a file with DVC:
$ dvc add data.xml
As indicated above, a .dvc
file has been created for data.xml
. Let's explore
the result:
$ tree
.
├── data.xml
└── data.xml.dvc
Let's check the data.xml.dvc
file inside:
outs:
- md5: 6137cde4893c59f76f005a8123d8e8e6
path: data.xml
This is a standard .dvc
file with only one output (outs
field). The hash
value (md5
field) corresponds to a file path in the cache.
$ file .dvc/cache/d8/acabbfd4ee51c95da5d7628c7ef74b
.dvc/cache/61/37cde4893c59f76f005a8123d8e8e6: ASCII text
⚠️ Tracking compressed files (e.g. ZIP or TAR archives) is not recommended, as
dvc add
supports tracking directories (details below).
Let's suppose your goal is to build an algorithm to identify cats and dogs in pictures. You may then have hundreds or thousands of pictures of these animals in a directory, and this is your training dataset:
$ tree pics --filelimit 3
pics
├── train
│ ├── cats [many image files]
│ └── dogs [many image files]
└── validation
├── cats [more image files]
└── dogs [more image files]
Tracking a directory with DVC as simple as with a single file:
$ dvc add pics
There are no .dvc
files generated within this directory structure to match
each image, but the image files are all cached. A single pics.dvc
file is generated for the top-level directory, and it contains:
outs:
- md5: ce57450aa92ab8f2b957c24b0df73edc.dir
path: pics
Refer to Adding entire directories for more info.
This allows us to treat the entire directory structure as a single data
artifact. For example, you can pass it as a dependency to a
dvc run
stage definition:
$ dvc run -n train \
-d train.py -d pics \
-M metrics.json -o model.h5 \
python train.py
To try this example, see the versioning tutorial.
If instead we use the --recursive
(-R
) option, the output looks like this:
$ dvc add -R pics
In this case, a .dvc
file is generated for each file in the pics/
directory
tree:
$ tree pics
pics
├── train
| ├── cats
| | ├── img1.jpg
| | ├── img1.jpg.dvc
| | ├── img2.jpg
| | ├── img2.jpg.dvc
| | ├── ...
| └── dogs
| ├── img1.jpg
| ├── img1.jpg.dvc
| ...
Note that no top-level .dvc
file is generated, which is typically less
convenient. For example, we cannot use the directory structure as one unit with
dvc run
or other commands.
Let's take an example to illustrate how .dvcignore
interacts with dvc add
.
$ mkdir dir
$ echo file_one > dir/file1
$ echo file_two > dir/file2
Now add file1
to .dvcignore
and track the entire dir
directory with
dvc add
.
$ echo dir/file1 > .dvcignore
$ dvc add dir
Let's now modify file1
(which is listed in .dvcignore
) and run dvc status
:
$ echo file_one_changed > dir/file1
$ dvc status
Data and pipelines are up to date.
dvc status
ignores changes to files listed in .dvcignore
.
Let's have a look at cache directory:
$ tree .dvc/cache
.dvc/cache
├── 0a
│ └── ec3a687bd65c3e6a13e3cf20f3a6b2.dir
└── 52
└── 4bcc8502a70ac49bf441db350eafc2
Only the hash values of the dir/
directory (with .dir
file extension) and
file2
have been cached.
When you have a large dataset in an external location, you may want to add it to the project without having to copy it into the workspace. Maybe your local disk doesn't have enough space, but you have setup an external cache that could handle it.
The --out
option lets you add external paths in a way that they are
cached first, and then
linked
to a given path inside the workspace
Let's add a data.xml
file via HTTP for example, putting it a local path in our
project:
$ dvc add https://data.dvc.org/get-started/data.xml -o data.xml
...
$ ls
data.xml data.xml.dvc
The resulting .dvc
file will save the provided local path
as if the data was
already in the workspace, while the md5
hash points to the copy of the data
that has now been transferred to the cache. Let's check the
contents of data.xml.dvc
in this case:
outs:
- md5: a304afb96060aad90176268345e10355
nfiles: 1
path: data.xml
When you have a large dataset in an external location, you may want to track it
as if it was in your project, but without downloading it locally (for now). The
--to-remote
option lets you do so, while storing a copy
remotely so it can be
pulled later.
Let's setup a sample remote and add the data.xml
to our remote storage from
the given remote location:
$ mkdir /tmp/dvcstore
$ dvc remote add myremote /tmp/dvcstore
$ dvc add https://data.dvc.org/get-started/data.xml -o data.xml \
--to-remote -r myremote
...
The only difference that dataset is transferred straight to remote, so DVC won't
control the remote location you gave but rather continue managing your remote
storage where the data is now on. The operation will still be resulted with an
.dvc
file:
$ ls
data.xml.dvc
Whenever anyone wants to actually download the added data (for example from a
system that can handle it), they can use dvc pull
as usual:
$ dvc pull data.xml.dvc -r tmp_remote
A data.xml
1 file added and 1 file fetched
For a similar operation that actually keeps a connection to the data source, please see an
import-url
example.