Dependency: A file (e.g. data, code), directory (e.g. datasets), or parameter used as input for a stage in a DVC pipeline. These are specified as paths in the
deps field of
.dvc files. Stages are invalidated (considered outdated) when any of their dependencies change. See
dvc stage add,
DVC Cache: The DVC cache is a hidden storage (by default in
.dvc/cache) for files and directories tracked by DVC, and their different versions. Learn more about its structure here.
DVC Project: Initialized by running
dvc init in the workspace (typically a Git repository). It will contain the
.dvc/ directory, as well as
.dvc files created with commands such as
dvc add or
dvc run. More info
Experiment: An attempt to reach desired/better/interesting results during data pipelining or ML model development. DVC is designed to help manage experiments, having built-in mechanisms like the run-cache and the
dvc exp commands (available on DVC 2.0 and above).
External Dependency: A stage dependency (
deps field in
dvc.yaml or in an import stage
.dvc file) with origin in an external source, for example HTTP, SSH, Amazon S3, Google Cloud Storage remote locations, or even other DVC repositories. See External Dependencies.
A way to have a file appear in multiple different folders without occupying more physical space on the storage disk. This is both fast and economical. See large dataset optimization and
dvc config cache for more on file linking.
.dvc file created with
dvc import or
dvc import-url, which represents a file or directory from an external source. It has an external dependency (the data source), an implicit download command, and the imported data itself as output.
Output: A file or directory tracked by DVC, recorded in the
outs section of a stage (in
.dvc file. Outputs are usually the result of stages. See
dvc import, among others.
Parameters: Hyperparameters or other config values used by your code, loaded from a a structured file (
params.yaml by default). They can be tracked as granular dependencies for stages of DVC pipelines (defined in
dvc.yaml). DVC can also compare them among machine learning experiments (useful for optimization). See
Pipeline: DVC pipelines describe data processing workflows in a standard declarative YAML format (
dvc.yaml). This guarantees DVC can reproduce them consistently. DVC also helps automate their execution and caches their results. See Defining Pipelines for more details.
Run-cache: A log of stages that have been run in the project. It's comprised of
dvc.lock file backups, identified as combinations of dependencies, commands, and outputs that correspond to each other.
dvc repro and
dvc run populate and reutilize the run-cache. See Run-cache for more details.
Stage: A stage represents an individual command, script, or source code that gets to some milestone as part of your project's workflow. For example,
python train.py may generate a machine learning model. DVC stages include data input(s) and resulting output(s), if any. Learn more.
Workspace: Directory containing all your DVC project files, e.g. raw data, source code, ML models. One project version at a time is visible in the workspace. More info