Pipelines represent data workflows that you want to reproduce reliably — so the results are consistent. The typical pipelining process involves:
dvc importthe project's initial data requirements (see Data Versioning). This caches the data and generates
Define the pipeline stages in
dvc.yamlfiles (more on this later). Example structure:
stages: prepare: ... # stage 1 definition train: ... # stage 2 definition evaluate: ... # stage 3 definition
Capture other useful metadata such as runtime parameters, performance metrics, and plots to visualize. DVC supports multiple file formats for these.
We call this file-based definition codification (YAML format in our case). It has the added benefit of allowing you to develop pipelines on standard Git workflows (and GitOps).
Stages usually take some data and run some code, producing an output (e.g. an ML model). The pipeline is formed by making them interdependent, meaning that the output of a stage becomes the input of another, and so on. Technically, this is called a dependency graph (DAG).
Note that while each pipeline is a graph, this doesn't mean a single
file. DVC checks the entire project tree and validates all such
files to find stages, rebuilding all the pipelines that these may define.
See the full specification of stage entries.
Each stage wraps around an executable shell command and specifies any file-based dependencies as well as outputs. Let's look at a sample stage: it depends on a script file it runs as well as on a raw data input (ideally tracked by DVC already):
stages: prepare: cmd: source src/cleanup.sh deps: - src/cleanup.sh - data/raw outs: - data/clean.csv
We use GNU/Linux in these examples, but Windows or other shells can be used too.
dvc.yaml files manually (recommended), you can also create
dvc stage add — a limited command-line interface to setup
pipelines. Let's add another stage this way and look at the resulting
$ dvc stage add --name train \ --deps src/model.py \ --deps data/clean.csv \ --outs data/predict.dat \ python src/model.py data/clean.csv
stages: prepare: ... outs: - data/clean.csv train: cmd: python src/model.py data/model.csv deps: - src/model.py - data/clean.csv outs: - data/predict.dat
One advantage of using
dvc stage add is that it will verify the validity of
the arguments provided (otherwise stage definition won't be checked until
execution). A disadvantage is that some advanced features such as templating
are not available this way.
Notice that the new
train stage depends on the output from stage
data/clean.csv), forming the pipeline (DAG).
Stage execution sequences will be determined entirely by the DAG, not by the
order in which stages are found in
There's more than one type of stage dependency. A simple dependency is a file or
directory used as input by the stage command. When it's contents have changed,
DVC "invalidates" the stage — it knows that it needs to run again (see
dvc status). This in turn may cause a chain reaction in which subsequent
stages of the pipeline are also reproduced.
DVC calculates a hash of file/dir contents to compare vs. previous versions.
This is a distinctive mechanism over traditional build tools like
File system-level dependencies are defined in the
deps field of
stages; Alternatively, using the
-d) option of
dvc stage add (see
the previous section's example).
A more granular type of dependency is the parameter (
params field of
dvc.yaml), or hyperparameters in machine learning. These are any values used
inside your code to tune data processing, or that affect stage execution in any
other way. For example, training a Neural Network usually requires batch
size and epoch values.
Instead of hard-coding param values, your code can read them from a structured
file (e.g. YAML format). DVC can track any key/value pair in a supported
parameters file (
params.yaml by default). Params are granular dependencies
because DVC only invalidates stages when the corresponding part of the params
file has changed.
stages: train: cmd: ... deps: ... params: # from params.yaml - learning_rate - nn.epochs - nn.batch_size outs: ...
See more details about this syntax.
dvc params diff to compare parameters across project versions.
Stage outputs are files (or directories) written by pipelines, for
example machine learning models, intermediate artifacts, as well as data plots
and performance metrics. These files are cached by DVC
automatically, and tracked with the help of
dvc.lock files (or
Outputs can be dependencies of subsequent stages (as explained earlier). So when they change, DVC may need to reproduce downstream stages as well (handled automatically).
The types of outputs are:
Files and directories: Typically data to feed to intermediate stages, as well as the final results of a pipeline (e.g. a dataset or an ML model).
Metrics: DVC supports small text files that usually contain model performance metrics from the evaluation, validation, or testing phases of the ML lifecycle. DVC allows to compare produced metrics with one another using
dvc metrics diffand presents the results as a table with
dvc metrics showor
dvc exp show.
Plots: Different kinds of data that can be visually graphed. For example contrast ML performance statistics or continuous metrics from multiple experiments.
dvc plots showcan generate charts for certain data files or render custom image files for you, or you can compare different ones with
dvc plots diff.
Outputs are produced by stage commands. DVC does not make any
assumption regarding this process; they should just match the path specified in