Edit on GitHub

Glossary

Dependency: A file or directory (possibly tracked by DVC) recorded in the deps section of a stage (in dvc.yaml) or .dvc file file. See dvc run. Stages are invalidated (considered outdated) when any of their dependencies change.

DVC Cache: The DVC cache is a hidden storage (by default in .dvc/cache) for files and directories tracked by DVC, and their different versions. Learn more about its structure here.

DVC File: dvc.yaml, dvc.lock, or .dvc files. DVC commands create these in the workspace to codify pipelines and/or to track data for versioning. See also dvc repro, dvc add.

DVC Project: Initialized by running dvc init in the workspace (typically a Git repository). It will contain the .dvc/ directory, as well as dvc.yaml and .dvc files created with commands such as dvc add or dvc run. More info

Experiment: An attempt to reach desired/better/interesting results during data pipelining or ML model development. DVC is designed to help manage experiments, having built-in mechanisms like the run-cache and the dvc exp commands (available on DVC 2.0 and above).

External Dependency: A stage dependency (deps field in dvc.yaml or in an import stage .dvc file) with origin in an external source, for example HTTP, SSH, Amazon S3, Google Cloud Storage remote locations, or even other DVC repositories. See External Dependencies.

File Linking: A way to have a file appear in multiple different folders without occupying more physical space on the storage disk. This is both fast and economical. See large dataset optimization and dvc config cache for more on file linking.

Import Stage: .dvc file created with dvc import or dvc import-url, which represents a file or directory from an external source. It has an external dependency (the data source), an implicit download command, and the imported data itself as output.

Output: A file or directory tracked by DVC, recorded in the outs section of a stage (in dvc.yaml) or .dvc file. Outputs are usually the result of stages. See dvc add, dvc run, dvc import, among others.

Parameter Dependency: Pipeline stages (defined in dvc.yaml) can depend on specific values inside an arbitrary YAML, JSON, TOML, or Python file (params.yaml by default). Stages are invalid (considered outdated) when any of their parameter values change. See dvc params.

Pipeline (DAG): A set of inter-dependent stages. This is also called a dependency graph.

Run-cache: A log of stages that have been run in the project. It's comprised of dvc.lock file backups, identified as combinations of dependencies, commands, and outputs that correspond to each other. dvc repro and dvc run populate and reutilize the run-cache. See Run-cache for more details.

Stage: A stage represents individual data processes, including their input and resulting output which can be combined to build detailed machine learning pipelines.

Workspace: Directory containing all your DVC project files, e.g. raw data, source code, ML models. One project version at a time is visible in the workspace. More info

🐛 Found an issue? Let us know! Or fix it:

Edit on GitHub

Have a question? Join our chat, we will help you:

Discord Chat