Edit on GitHub

add

Track data files or directories with DVC, by creating a corresponding .dvc file.

Synopsis

usage: dvc add [-h] [-q | -v] [-R] [--no-commit] [--external]
               [--glob] [--file <filename>] [-o <path>] [--to-remote]
               [-r <name>] [-j <number>] [--desc <text>]
               targets [targets ...]

positional arguments:
  targets               Files or directories to add

Description

The dvc add command is analogous to git add, in that it makes DVC aware of the target data, in order to start versioning it. It creates a .dvc file to track the added data.

This command can be used to track large files, models, dataset directories, etc. that are too big for Git to handle directly. This enables versioning them indirectly with Git.

The targets are the files or directories to add. They get stored in the cache by default (use the --no-commit option to avoid this, and dvc commit to finish the process when needed).

See also dvc.yaml and dvc run for more advanced ways to track and version intermediate and final results (like ML models).

After checking that each target hasn't been added before (or tracked with other DVC commands), a few actions are taken under the hood:

  1. Calculate the file hash.
  2. Move the file contents to the cache (by default in .dvc/cache) (or to remote storage if --to-remote is given), using the file hash to form the cached file path. (See Structure of cache directory for more details.)
  3. Attempt to replace the file with a link to the cached data (more details on file linking further down). Skipped if --to-remote is used.
  4. Create a corresponding .dvc file to track the file, using its path and hash to identify the cached data (with --to-remote/-o, an external path is moved to the workspace). The .dvc file lists the DVC-tracked file as an output (outs field). Unless the --file option is used, the .dvc file name generated by default is <file>.dvc, where <file> is the file name of the first target.
  5. Add the targets to .gitignore in order to prevent them from being committed to the Git repository (unless dvc init --no-scm was used when initializing the DVC project).
  6. Instructions are printed showing git commands for staging .dvc files (or they are staged automatically if core.autostage is set).

Summarizing, the result is that the target data is replaced by small .dvc files that can be easily tracked with Git.

It's possible to prevent files or directories from being added by DVC by entering the corresponding patterns in a .dvcignore file.

You can also undo dvc add to stop tracking files or directories.

By default, DVC tries to use reflinks (see File link types to avoid copying any file contents and to optimize .dvc file operations for large files. DVC also supports other link types for use on file systems without reflink support, but they have to be specified manually. Refer to the cache.type config option in dvc config cache for more information.

Adding entire directories

A dvc add target can be either a file or a directory. In the latter case, a .dvc file is created for the top of the hierarchy (with default name <dir_name>.dvc).

Every file in the dir is cached normally (unless the --no-commit option is used), but DVC does not produce individual .dvc files for each one. Instead, the single .dvc file references a special JSON file in the cache (with .dir extension), that in turn points to the added files.

Refer to Structure of cache directory for more info. on .dir cache entries.

Note that DVC commands that use tracked data support granular targeting of files and directories, even when contained in a parent directory added as a whole. Examples: dvc push, dvc pull, dvc get, dvc import, etc.

As a rarely needed alternative, the --recursive option causes every file in the hierarchy to be added individually. A corresponding .dvc file will be generated for each file in he same location. This may be helpful to save time adding several data files grouped in a structural directory, but it's undesirable for data directories with a large number of files.

To avoid adding files inside a directory accidentally, you can add the corresponding patterns to .dvcignore.

dvc add supports symlinked files as targets. But if a target path is a directory symlink, or if it contains any intermediate directory symlinks, it cannot be added to DVC.

For example, given the following project structure:

.
โ”œโ”€โ”€ .dvc
โ”œโ”€โ”€ dir
โ”‚   โ””โ”€โ”€ file
โ”œโ”€โ”€ link_to_dir -> dir
โ”œโ”€โ”€ link_to_external_dir -> /path/to/dir
โ”œโ”€โ”€ link_to_external_file -> /path/to/file
โ””โ”€โ”€ link_to_file -> dir/file

link_to_file and link_to_external_file are both valid symlink targets to dvc add. But link_to_dir, link_to_external_dir, and link_to_dir/file are not.

Options

  • -R, --recursive - determines the files to add by searching each target directory and its subdirectories for data files. If there are no directories among the targets, this option is ignored. For each file found, a new .dvc file is created using the process described in this command's description.
  • --no-commit - do not store targets in the cache (the .dvc file is still created). Use dvc commit to finish the operation (similar to git commit after git add).
  • --file <filename> - specify name of the .dvc file it generates. This option works only if there is a single target. By default the name of the generated .dvc file is <target>.dvc, where <target> is the file name of the given target. This option allows to set the name and the path of the generated .dvc file.
  • --glob - allows adding files and directories that match the pattern specified in targets. Shell style wildcards supported: *, ?, [seq], [!seq], and **
  • --external - allow targets that are outside of the DVC repository. See Managing External Data.

    โš ๏ธ Note that this is an advanced feature for very specific situations and not recommended except if there's absolutely no other alternative. Additionally, this typically requires an external cache setup (see link above).

  • -o <path>, --out <path> - destination path to make a local target copy, or to transfer an external target into the cache (and link to workspace). Note that this can be combined with --to-remote to avoid storing the data locally, while still adding it to the project.
  • --to-remote - import an external target, but don't move it into the workspace, nor cache it. Transfer it it directly to remote storage (the default one, unless -r is specified) instead. Use dvc pull to get the data locally.
  • -r <name>, --remote <name> - name of the remote storage to transfer external target to (can only be used with --to-remote).
  • --desc <text> - user description of the data (optional). This doesn't affect any DVC operations.
  • -h, --help - prints the usage/help message, and exit.
  • -q, --quiet - do not write anything to standard output. Exit with 0 if no problems arise, otherwise 1.
  • -v, --verbose - displays detailed tracing information.

Example: Single file

Track a file with DVC:

$ dvc add data.xml

As indicated above, a .dvc file has been created for data.xml. Let's explore the result:

$ tree
.
โ”œโ”€โ”€ data.xml
โ””โ”€โ”€ data.xml.dvc

Let's check the data.xml.dvc file inside:

outs:
  - md5: 6137cde4893c59f76f005a8123d8e8e6
    path: data.xml

This is a standard .dvc file with only one output (outs field). The hash value (md5 field) corresponds to a file path in the cache.

$ file .dvc/cache/d8/acabbfd4ee51c95da5d7628c7ef74b
.dvc/cache/61/37cde4893c59f76f005a8123d8e8e6: ASCII text

โš ๏ธ Tracking compressed files (e.g. ZIP or TAR archives) is not recommended, as dvc add supports tracking directories (details below).

Example: Directory

Let's suppose your goal is to build an algorithm to identify cats and dogs in pictures. You may then have hundreds or thousands of pictures of these animals in a directory, and this is your training dataset:

$ tree pics --filelimit 3
pics
โ”œโ”€โ”€ train
โ”‚   โ”œโ”€โ”€ cats [many image files]
โ”‚   โ””โ”€โ”€ dogs [many image files]
โ””โ”€โ”€ validation
    โ”œโ”€โ”€ cats [more image files]
    โ””โ”€โ”€ dogs [more image files]

Tracking a directory with DVC as simple as with a single file:

$ dvc add pics

There are no .dvc files generated within this directory structure to match each image, but the image files are all cached. A single pics.dvc file is generated for the top-level directory, and it contains:

outs:
  - md5: ce57450aa92ab8f2b957c24b0df73edc.dir
    path: pics

Refer to Adding entire directories for more info.

This allows us to treat the entire directory structure as a single data artifact. For example, you can pass it as a dependency to a dvc run stage definition:

$ dvc run -n train \
          -d train.py -d pics \
          -M metrics.json -o model.h5 \
          python train.py

To try this example, see the versioning tutorial.

If instead we use the --recursive (-R) option, the output looks like this:

$ dvc add -R pics

In this case, a .dvc file is generated for each file in the pics/ directory tree:

$ tree pics
pics
โ”œโ”€โ”€ train
|   โ”œโ”€โ”€ cats
|   |   โ”œโ”€โ”€ img1.jpg
|   |   โ”œโ”€โ”€ img1.jpg.dvc
|   |   โ”œโ”€โ”€ img2.jpg
|   |   โ”œโ”€โ”€ img2.jpg.dvc
|   |   โ”œโ”€โ”€ ...
|   โ””โ”€โ”€ dogs
|       โ”œโ”€โ”€ img1.jpg
|       โ”œโ”€โ”€ img1.jpg.dvc
|       ...

Note that no top-level .dvc file is generated, which is typically less convenient. For example, we cannot use the directory structure as one unit with dvc run or other commands.

Example .dvcignore

Let's take an example to illustrate how .dvcignore interacts with dvc add.

$ mkdir dir
$ echo file_one > dir/file1
$ echo file_two > dir/file2

Now add file1 to .dvcignore and track the entire dir directory with dvc add.

$ echo dir/file1 > .dvcignore
$ dvc add dir

Let's now modify file1 (which is listed in .dvcignore) and run dvc status:

$ echo file_one_changed > dir/file1
$ dvc status
Data and pipelines are up to date.

dvc status ignores changes to files listed in .dvcignore.

Let's have a look at cache directory:

$ tree .dvc/cache
.dvc/cache
โ”œโ”€โ”€ 0a
โ”‚   โ””โ”€โ”€ ec3a687bd65c3e6a13e3cf20f3a6b2.dir
โ””โ”€โ”€ 52
    โ””โ”€โ”€ 4bcc8502a70ac49bf441db350eafc2

Only the hash values of the dir/ directory (with .dir file extension) and file2 have been cached.

Example: Transfer to the cache

When you have a large dataset in an external location, you may want to add it to the project without having to copy it into the workspace. Maybe your local disk doesn't have enough space, but you have setup an external cache that could handle it.

The --out option lets you add external paths in a way that they are cached first, and then linked to a given path inside the workspace

Let's add a data.xml file via HTTP for example, putting it a local path in our project:

$ dvc add https://data.dvc.org/get-started/data.xml -o data.xml
...
$ ls
data.xml data.xml.dvc

The resulting .dvc file will save the provided local path as if the data was already in the workspace, while the md5 hash points to the copy of the data that has now been transferred to the cache. Let's check the contents of data.xml.dvc in this case:

outs:
  - md5: a304afb96060aad90176268345e10355
    nfiles: 1
    path: data.xml

Example: Transfer to remote storage

When you have a large dataset in an external location, you may want to track it as if it was in your project, but without downloading it locally (for now). The --to-remote option lets you do so, while storing a copy remotely so it can be pulled later.

Let's setup a sample remote and add the data.xml to our remote storage from the given remote location:

$ mkdir /tmp/dvcstore
$ dvc remote add myremote /tmp/dvcstore

$ dvc add https://data.dvc.org/get-started/data.xml -o data.xml \
                 --to-remote -r myremote
...

The only difference that dataset is transferred straight to remote, so DVC won't control the remote location you gave but rather continue managing your remote storage where the data is now on. The operation will still be resulted with an .dvc file:

$ ls
data.xml.dvc

Whenever anyone wants to actually download the added data (for example from a system that can handle it), they can use dvc pull as usual:

 $ dvc pull data.xml.dvc -r tmp_remote

A       data.xml
1 file added and 1 file fetched

For a similar operation that actually keeps a connection to the data source, please see an import-url example.