Pipelines represent data workflows that you want to reproduce reliably — so the results are consistent. The typical pipelining process involves:
stages: prepare: ... # stage 1 definition train: ... # stage 2 definition evaluate: ... # stage 3 definition
We call this file-based definition codification (YAML format in our case). It has the added benefit of allowing you to develop pipelines on standard Git workflows (and GitOps).
Stages usually take some data and run some code, producing an output (e.g. an ML model). The pipeline is formed by making them interdependent, meaning that the output of a stage becomes the input of another, and so on. Technically, this is called a dependency graph (DAG).
Note that while each pipeline is a graph, this doesn't mean a single
file. DVC checks the entire project tree and validates all such
files to find stages, rebuilding all the pipelines that these may define.
See the full specification of stage entries.
Each stage wraps around an executable shell command and specifies any file-based dependencies as well as outputs. Let's look at a sample stage: it depends on a script file it runs as well as on a raw data input (ideally tracked by DVC already):
stages: prepare: cmd: source src/cleanup.sh deps: - src/cleanup.sh - data/raw outs: - data/clean.csv
We use GNU/Linux in these examples, but Windows or other shells can be used too.
dvc.yaml files manually (recommended), you can also create
dvc stage add — a limited command-line interface to setup
pipelines. Let's add another stage this way and look at the resulting
$ dvc stage add --name train \ --deps src/model.py \ --deps data/clean.csv \ --outs data/predict.dat \ python src/model.py data/clean.csv
stages: prepare: ... outs: - data/clean.csv train: cmd: python src/model.py data/model.csv deps: - src/model.py - data/clean.csv outs: - data/predict.dat
Notice that the new
train stage depends on the output from stage
data/clean.csv), forming the pipeline (DAG).
Stage execution sequences will be determined entirely by the DAG, not by the
order in which stages are found in
There's more than one type of stage dependency. A simple dependency is a file or
directory used as input by the stage command. When it's contents have changed,
DVC "invalidates" the stage — it knows that it needs to run again (see
dvc status). This in turn may cause a chain reaction in which subsequent
stages of the pipeline are also reproduced.
DVC calculates a hash of file/dir contents to compare vs. previous versions.
This is a distinctive mechanism over traditional build tools like
A more granular type of dependency is the parameter (
params field of
dvc.yaml), or hyperparameters in machine learning. These are any values used
inside your code to tune data processing, or that affect stage execution in any
other way. For example, training a Neural Network usually requires batch
size and epoch values.
Instead of hard-coding param values, your code can read them from a structured
file (e.g. YAML format). DVC can track any key/value pair in a supported
parameters file (
params.yaml by default). Params are granular dependencies
because DVC only invalidates stages when the corresponding part of the params
file has changed.
stages: train: cmd: ... deps: ... params: # from params.yaml - learning_rate - nn.epochs - nn.batch_size outs: ...
See more details about this syntax.
dvc params diff to compare parameters across project versions.
Stage outputs are files (or directories) written by pipelines, for
example machine learning models and intermediate artifacts. These files are
cached by DVC automatically, and tracked with the help of
dvc.lock files (or
.dvc files, see
Outputs can be dependencies of subsequent stages (as explained earlier). So when they change, DVC may need to reproduce downstream stages as well (handled automatically).