Large Dataset Optimization
In order to track the data files and directories added with
dvc repro, etc. DVC moves all these files to the project's cache.
However, the versions of the tracked files that
match the current code are also needed in
the workspace, so a subset of the cached files can be kept in the
working directory (using
dvc checkout). Does this mean that some files will be
duplicated between the workspace and the cache? That would not be efficient!
Especially with large files (several Gigabytes or larger).
In order to have the files present in both directories without duplication, DVC can automatically create file links to the cached data in the workspace. In fact, by default it will attempt to use reflinks* if supported by the file system.
File link types for the DVC cache
File links are lightweight entries in the file system that don't hold the file contents, but work as shortcuts to where the original data is actually stored. They're more common in file systems used with UNIX-like operating systems, and come in different kinds that differ in how they connect file names to inodes in the system.
Inodes are metadata file records to locate and store permissions to the actual file contents. See Linking files in this doc for technical details (Linux). Use
ls -ito list inodes on Linux.
There are pros and cons to the 3 supported link types: Hard links, Soft or
Symbolic links, and Reflinks in more recent systems. While reflinks bring all
the benefits and none of the worries, they're not commonly supported in most
platforms yet. Hard/soft links optimize speed and space in the file
system, but may break your workflow since updating hard/sym-linked files tracked
by DVC in the workspace would cause cache corruption.
To protect against this, DVC protects hardlinks and symlinks by making them
read-only, requiring using
dvc unprotect to be able to modify them safely.
Finally, a 4th "linking" alternative is to actually copy files from/to the cache, which is safe but inefficient – especially for large files (several GBs or more).
Some versions of Windows (e.g. Windows Server 2012+ and Windows 10 Enterprise) support hard or soft links on the NTFS and ReFS file systems.
File link type benefits summary:
Each file linking method is further detailed below, in function of their efficiency:
reflink: Copy-on-write* links or "reflinks" are the best possible link type, when available. They're is as efficient as hard/symlinks, but don't carry any risk of cache corruption since the file system takes care of copying the file if you try to edit it in place, thus keeping the linked cache file intact.
Unfortunately reflinks are currently supported on a limited number of file systems only (Linux: Btrfs, XFS, OCFS2; macOS: APFS), but in the future they will be supported by the majority of file systems in use.
hardlink: Hard links are the most efficient way to link your data to cache if both your repo and your cache directory are located on the same partition or storage device. The number of hardlinks to one file can be limited by the file system (NTFS: 1024, EXT4: 65,000). DVC will fall back to the next available linking strategy when the number of links exceeds this limit which can happen for repos with very many identical files.
Note that hard-linked data files can't be edited in place, so DVC avoids these by default. It's however possible to unlink or delete them, and then replace them with a new file.
symlink: Symbolic (a.k.a. "soft") links are the most efficient way to link your data to cache if your repo and your cache directory are located on different file systems/drives (i.e. repo is located on SSD for performance, but cache dir is located on HDD for bigger storage).
Note that symlinked data files can't be edited in place, so DVC avoids these by default. It's however possible to unlink or delete them, and then replace them with a new file.
copy: An inefficient "linking" strategy, yet supported on all file systems. Using
copymeans there will be no file links, but that the tracked files will be duplicated as copies existing in both the cache and workspace. Suitable for scenarios with relatively small data files, where copying them is not a storage performance concern.
Configuring DVC cache file link type
By default, DVC tries to use reflinks for the cache if available on your system, however this is not the most common case at this time, so it falls back to the copying strategy. If you wish to enable hard or soft links, you can configure DVC like this:
$ dvc config cache.type hardlink,symlink
dvc config cachefor more details.
Note that with this
cache.type, your workspace files will be in read-only mode
in order to protect the cache from corruption. Refer to
Update a Tracked File on how to
manage tracked files under these cache configurations.
To make sure that the data files in the workspace are consistent with the
cache.type config value, you may use
dvc checkout --relink. See
dvc checkout for more information.
* (copy-on-write) links or reflinks are a type of file linking available in modern file systems. Unlike hard links or symlinks, editing reflinks is always safe, as the original cached data will remain unchanged.