Edit on GitHub

Defining Pipelines

Pipelines represent data workflows that you want to reproduce reliably — so the results are consistent. The typical pipelining process involves:

  • Obtain and dvc add or dvc import the project's initial data requirements (see Data Versioning). This caches the data and generates .dvc files.

  • Define the pipeline stages in dvc.yaml files (more on this later). Example structure:

    stages:
      prepare: ... # stage 1 definition
      train: ... # stage 2 definition
      evaluate: ... # stage 3 definition
  • Capture other useful metadata such as runtime parameters, performance metrics, and plots to visualize. DVC supports multiple file formats for these.

We call this file-based definition codification (YAML format in our case). It has the added benefit of allowing you to develop pipelines on standard Git workflows (and GitOps).

Stages usually take some data and run some code, producing an output (e.g. an ML model). The pipeline is formed by making them interdependent, meaning that the output of a stage becomes the input of another, and so on. Technically, this is called a dependency graph (DAG).

Note that while each pipeline is a graph, this doesn't mean a single dvc.yaml file. DVC checks the entire project tree and validates all such files to find stages, rebuilding all the pipelines that these may define.

DVC represents a pipeline internally as a graph where the nodes are stages and the edges are directed dependencies (e.g. A before B). And in order for DVC to run a pipeline, its topology should be acyclic — because executing cycles (e.g. A -> B -> C -> A …) would continue indefinitely. More about DAGs.

Use dvc dag to visualize (or export) them.

Stages

See the full specification of stage entries.

Each stage wraps around an executable shell command and specifies any file-based dependencies as well as outputs. Let's look at a sample stage: it depends on a script file it runs as well as on a raw data input (ideally tracked by DVC already):

stages:
  prepare:
    cmd: source src/cleanup.sh
    deps:
      - src/cleanup.sh
      - data/raw
    outs:
      - data/clean.csv

We use GNU/Linux in these examples, but Windows or other shells can be used too.

Besides writing dvc.yaml files manually (recommended), you can also create stages with dvc stage add — a limited command-line interface to setup pipelines. Let's add another stage this way and look at the resulting dvc.yaml:

$ dvc stage add --name train \
                --deps src/model.py \
                --deps data/clean.csv \
                --outs data/predict.dat \
                python src/model.py data/clean.csv
stages:
  prepare:
    ...
    outs:
      - data/clean.csv
  train:
    cmd: python src/model.py data/model.csv
    deps:
      - src/model.py
      - data/clean.csv
    outs:
      - data/predict.dat

One advantage of using dvc stage add is that it will verify the validity of the arguments provided (otherwise stage definition won't be checked until execution). A disadvantage is that some advanced features such as templating are not available this way.

Notice that the new train stage depends on the output from stage prepare (data/clean.csv), forming the pipeline (DAG).

Stage execution sequences will be determined entirely by the DAG, not by the order in which stages are found in dvc.yaml.

Simple dependencies

There's more than one type of stage dependency. A simple dependency is a file or directory used as input by the stage command. When it's contents have changed, DVC "invalidates" the stage — it knows that it needs to run again (see dvc status). This in turn may cause a chain reaction in which subsequent stages of the pipeline are also reproduced.

DVC calculates a hash of file/dir contents to compare vs. previous versions. This is a distinctive mechanism over traditional build tools like make.

File system-level dependencies are defined in the deps field of dvc.yaml stages; Alternatively, using the --deps (-d) option of dvc stage add (see the previous section's example).

Parameter dependencies

A more granular type of dependency is the parameter (params field of dvc.yaml), or hyperparameters in machine learning. These are any values used inside your code to tune data processing, or that affect stage execution in any other way. For example, training a Neural Network usually requires batch size and epoch values.

Instead of hard-coding param values, your code can read them from a structured file (e.g. YAML format). DVC can track any key/value pair in a supported parameters file (params.yaml by default). Params are granular dependencies because DVC only invalidates stages when the corresponding part of the params file has changed.

stages:
  train:
    cmd: ...
    deps: ...
    params: # from params.yaml
      - learning_rate
      - nn.epochs
      - nn.batch_size
    outs: ...

See more details about this syntax.

Use dvc params diff to compare parameters across project versions.

Outputs

Stage outputs are files (or directories) written by pipelines, for example machine learning models and intermediate artifacts. These files are cached by DVC automatically, and tracked with the help of dvc.lock files (or .dvc files, see dvc add).

Outputs can be dependencies of subsequent stages (as explained earlier). So when they change, DVC may need to reproduce downstream stages as well (handled automatically).

DVC can also track metrics and plots files, which can optionally be added as stage outputs, or even added with cache: false in dvc.yaml since they are often small enough to store in Git.

Outputs are produced by stage commands. DVC does not make any assumption regarding this process; they should just match the path specified in dvc.yaml.

Content

🐛 Found an issue? Let us know! Or fix it:

Edit on GitHub

Have a question? Join our chat, we will help you:

Discord Chat