In order to track the data files and directories added with
dvc add or
dvc run, DVC moves all these files to the cache. A
project's cache is the hidden storage (by default located in
.dvc/cache) for files that are tracked by DVC, and their different versions.
dvc cache and
DVC Files and Directories for more
However, the versions of the tracked files that
match the current code are also
needed in the workspace, so a subset of the cached files can be
kept in the working directory (using
dvc checkout). Does this mean that some
files will be duplicated between the workspace and the cache? That would not
be efficient! Especially with large files (several Gigabytes or larger).
In order to have the files present in both directories without duplication, DVC can automatically create file links to the cached data in the workspace. In fact, by default it will attempt to use reflinks* if supported by the file system.
File links are lightweight entries in the file system that don't hold the file contents, but work as shortcuts to where the original data is actually stored. They're more common in file systems used with UNIX-like operating systems, and come in different kinds that differ in how they connect file names to inodes in the system.
Inodes are metadata file records to locate and store permissions to the actual file contents. See Linking files in this doc for technical details (Linux). Use
ls -ito list inodes on Linux.
There are pros and cons to the 3 supported link types: Hard links, Soft or
Symbolic links, and Reflinks in more recent systems. While reflinks bring all
the benefits and none of the worries, they're not commonly supported in most
platforms yet. Hard/soft links optimize speed and space in the file
system, but may break your workflow since updating hard/sym-linked files tracked
by DVC in the workspace causes cache corruption. To
protect against that, DVC makes hardlinks and symlinks links read-only, which
requires the user to use
dvc unprotect before modifying them.
Finally, a 4th "linking" alternative is to actually copy files from/to the cache, which is safe but inefficient – especially for large files (several GBs or more).
File link type benefits summary:
Each file linking method is further detailed below, in function of their efficiency:
reflink: Copy-on-write* links or "reflinks" are the best possible link
type, when available. They're is as efficient as hard/symlinks, but don't
carry a risk of cache corruption since the file system takes care of copying
the file if you try to edit it in place, thus keeping the linked cache file
Unfortunately reflinks are currently supported on a limited number of file systems only (Linux: Btrfs, XFS, OCFS2; MacOS: APFS), but in the future they will be supported by the majority of file systems in use.
hardlink: Hard links are the most efficient way to link your data to
cache if both your repo and your cache directory are located on the same
partition or storage device.
Please note that hardlinked data files should never be edited in place, but instead deleted and then replaced with a new file, otherwise it might cause cache corruption – and automatic deletion of cached files by DVC.
symlink: Symbolic (a.k.a. "soft") links are the most efficient way to
link your data to cache if your repo and your cache directory are located on
different file systems/drives (i.e. repo is located on SSD for performance,
but cache dir is located on HDD for bigger storage).
Please note that symlinked data files should never be edited in place, but instead deleted and then replaced with a new file, otherwise it might cause cache corruption – and automatic deletion of cached files by DVC.
copy: An inefficient "linking" strategy, yet supported on all file systems. Using
copymeans there will be no file links, but that the tracked files will be duplicated as copies existing in both the cache and workspace. Suitable for scenarios with relatively small data files, where copying them is not a storage performance concern.
hardlinktypes by default to protect user from accidental cache corruption. Refer to the Update a Tracked File guide to learn more.
By default, DVC tries to use reflinks for the cache if available on your system, however this is not the most common case at this time, so it falls back to the copying strategy. If you wish to enable hard or soft links, you can configure DVC like this:
$ dvc config cache.type hardlink,symlink
dvc config cachefor more details.
Note that with this
cache.type, your workspace files will be in read-only mode
in order to protect the cache from corruption. Please refer to
Update a Tracked File on how to
manage tracked files under these cache configurations.
* copy-on-write links or "reflinks" are a relatively new way to link files in UNIX-style file systems. Unlike hardlinks or symlinks, they support transparent copy on write. This means that editing a reflinked file is always safe as all the other links to the file will reflect the changes.