Skip to content
Edit on GitHub

dvc.yaml

You can construct data science or machine learning pipelines by defining individual stages in one or more dvc.yaml files. Stages form a pipeline when they connect with each other (forming a dependency graph, see dvc dag). Refer to Get Started: Data Pipelines.

A helper command, dvc stage, is available to create and list stages.

dvc.yaml files can be versioned with Git.

These files use the YAML 1.2 file format, and a human-friendly schema explained below. We encourage you to get familiar with it so you may modify, write, or generate stages and pipelines on your own.

We use GNU/Linux in most of our examples.

Stages

The list of stages contains one or more user-defined stages. Here's a simple one named transpose:

stages:
  transpose:
    cmd: ./trans.r rows.txt > columns.txt
    deps:
      - rows.txt
    outs:
      - columns.txt

See also dvc stage add, a helper command to write stages in dvc.yaml.

The most important part of a stage is the terminal command(s) it executes (cmd field). This is what DVC runs when the stage is reproduced (see dvc repro).

If a command reads input files, these (or their directory locations) can be defined as dependencies (deps). DVC will check whether they have changed to decide whether the stage requires re-execution (see dvc status).

If it writes files or dirs, they can be defined as outputs (outs). DVC will track them going forward (similar to using dvc add).

Output files may be viable data sources for top-level plots.

See the full stage entry specification.

Parameter dependencies

Parameters are a special type of stage dependency. They consist of a list of params to track in one of these formats:

  1. A param key/value pair that can be found in params.yaml (default params file);
  2. A dictionary named by the file path to a custom params file, and with a list of param key/value pairs to find in it;
  3. An empty set (give no value or use null) named by the file path to a params file: to track all the params in it dynamically.

Note that file paths used must be to valid YAML, JSON, TOML, or Python parameters file.

stages:
  preprocess:
    cmd: bin/cleanup raw.txt clean.txt
    deps:
      - raw.txt
    params:
      - threshold # track specific param (from params.yaml)
      - passes
      - myparams.yaml: # track specific params from custom file
          - epochs
      - config.json: # track all parameters in this file
    outs:
      - clean.txt

This allows several stages to depend on values of a shared structured file (which can be versioned directly with Git). See also dvc params diff.

Metrics and Plots outputs

Like common outputs, metrics and plots files are produced by the stage cmd. However, their purpose is different. Typically they contain metadata to evaluate pipeline processes. Example:

stages:
  build:
    cmd: python train.py
    deps:
      - features.csv
    outs:
      - model.pt
    metrics:
      - accuracy.txt:
          cache: false
    plots:
      - auc.json:
          cache: false

cache: false is typical here, since they're small enough for Git to version directly.

The commands in dvc metrics and dvc plots help you display and compare metrics and plots.

Templating

New in DVC 2.0 (see dvc version)

dvc.yaml supports a templating format to insert values from different sources in the YAML structure itself. These sources can be parameters files, or vars defined in dvc.yaml instead.

Let's say we have params.yaml (default params file) with the following contents:

models:
  us:
    threshold: 10
    filename: 'model-us.hdf5'

Those values can be used anywhere in dvc.yaml with the ${} substitution expression:

stages:
  build-us:
    cmd: >-
      python train.py
      --thresh ${models.us.threshold}
      --out ${models.us.filename}
    outs:
      - ${models.us.filename}:
          cache: true

DVC will track simple param values (numbers, strings, etc.) used in ${} (they will be listed by dvc params diff).

Dict Unpacking

Only inside the cmd entries, you can also reference a dictionary inside ${} and DVC will unpack it. For example, given the following params.yaml:

dict:
  foo: foo
  bar: 2
  bool: true
  nested:
    foo: bar
  list: [1, 2, 'foo']

You can reference dict in the cmd section of a dvc.yaml:

stages:
  train:
    cmd: python train.py ${dict}

And DVC will unpack the values inside dict, creating the following cmd call:

$ python train.py --foo 'foo' --bar 2 --bool \
                  --nested.foo 'bar' --list 1 2 'foo'

This can be useful for avoiding to write every argument passed to the cmd or having to modify the dvc.yaml when adding or removing arguments.

The parsing section of dvc config can be used to customize the syntax used for some ambiguous types like booleans and lists.

Vars

Alternatively, values for substitution can be listed as top-level vars like this:

vars:
  - models:
      us:
        threshold: 10
  - desc: 'Reusable description'

stages:
  build-us:
    desc: ${desc}
    cmd: python train.py --thresh ${models.us.threshold}

Values from vars are not tracked like parameters.

To load additional params files, list them in the top vars, in the desired order, e.g.:

Params file paths will be evaluated based on wdir, if specified.

vars:
  - params.json
  - myvar: 'value'
  - config/myapp.yaml

Note that the default params.yaml file is always loaded first, if present.

It's also possible to specify what to include from additional params files, with a : colon:

vars:
  - params.json:clean,feats

stages:
  featurize:
    cmd: ${feats.exec}
    deps:
      - ${clean.filename}
    outs:
      - ${feats.dirname}

Stage-specific values are also supported, with inner vars. You may also load additional params files locally. For example:

stages:
  build-us:
    vars:
      - params.json:build
      - model:
          filename: 'model-us.hdf5'
    cmd: python train.py ${build.epochs} --out ${model.filename}
    outs:
      - ${model.filename}

DVC merges values from params files and vars in each scope when possible. For example, {"grp": {"a": 1}} merges with {"grp": {"b": 2}}, but not with {"grp": {"a": 7}}.

⚠️ Known limitations of local vars:

  • wdir cannot use values from local vars, as DVC uses the working directory first (to load any values from params files listed in vars).
  • foreach is also incompatible with local vars at the moment.

The substitution expression supports these forms:

${param} # Simple
${param.key} # Nested values through . (period)
${param.list[0]} # List elements via index in [] (square brackets)

To use the expression literally in dvc.yaml (so DVC does not replace it for a value), escape it with a backslash, e.g. \${....

foreach stages

New in DVC 2.0 (see dvc version)

You can define more than one stage in a single dvc.yaml entry with the following syntax. A foreach element accepts a list or dictionary with values to iterate on, while do contains the regular stage fields (cmd, outs, etc.). Here's a simple example:

stages:
  cleanups:
    foreach: # List of simple values
      - raw1
      - labels1
      - raw2
    do:
      cmd: clean.py "${item}"
      outs:
        - ${item}.cln

Upon dvc repro, each item in the list is expanded into its own stage by substituting its value in expression ${item}. The item's value is appended to each stage name after a @. The final stages generated by the foreach syntax are saved to dvc.lock:

schema: '2.0'
stages:
  cleanups@labels1:
    cmd: clean.py "labels1"
    outs:
      - path: labels1.cln
  cleanups@raw1:
    cmd: clean.py "raw1"
    outs:
      - path: raw1.cln
  cleanups@raw2:
    cmd: clean.py "raw2"
    outs:
      - path: raw2.cln

For lists containing complex values (e.g. dictionaries), the substitution expression can use the ${item.key} form. Stage names will be appended with a zero-based index. For example:

stages:
  train:
    foreach:
      - epochs: 3
        thresh: 10
      - epochs: 10
        thresh: 15
    do:
      cmd: python train.py ${item.epochs} ${item.thresh}
# dvc.lock
schema: '2.0'
stages:
  train@0:
    cmd: python train.py 3 10
  train@1:
    cmd: python train.py 10 15

DVC can also iterate on a dictionary given directly to foreach, resulting in two substitution expressions being available: ${key} and ${item}. The former is used for the stage names:

stages:
  build:
    foreach:
      uk:
        epochs: 3
        thresh: 10
      us:
        epochs: 10
        thresh: 15
    do:
      cmd: python train.py '${key}' ${item.epochs} ${item.thresh}
      outs:
        - model-${key}.hdfs
# dvc.lock
schema: '2.0'
stages:
  build@uk:
    cmd: python train.py 'uk' 3 10
    outs:
      - path: model-uk.hdfs
        md5: 17b3d1efc339b416c4b5615b1ce1b97e
  build@us: ...

Importantly, dictionaries from parameters files can be used in foreach stages as well:

stages:
  mystages:
    foreach: ${myobject} # From params.yaml
    do:
      cmd: ./script.py ${key} ${item.prop1}
      outs:
        - ${item.prop2}

Note that this feature is not compatible with templating at the moment.

Stage entries

These are the fields that are accepted in each stage:

FieldDescription
cmd(Required) One or more commands executed by the stage (may contain either a single value or a list). Commands are executed sequentially until all are finished or until one of them fails (see dvc repro).
wdirWorking directory for the stage command to run in (relative to the file's location). Any paths in other fields are also based on this. It defaults to . (the file's location).
depsList of dependency paths of this stage (relative to wdir).
outsList of stage output paths (relative to wdir). These can contain optional subfields.
paramsList of parameter dependency keys (field names) to track from params.yaml (in wdir). The list may also contain other parameters file names, with a sub-list of the param names to track in them.
metricsList of metrics files, and optionally, whether or not this metrics file is cached (true by default). See the --metrics-no-cache (-M) option of dvc run.
plotsList of plot metrics, and optionally, their default configuration (subfields matching the options of dvc plots modify), and whether or not this plots file is cached ( true by default). See the --plots-no-cache option of dvc run.
frozenWhether or not this stage is frozen from reproduction
always_changedCauses this stage to be always considered as changed by commands such as dvc status and dvc repro. false by default
meta(Optional) arbitrary metadata can be added manually with this field. Any YAML content is supported. meta contents are ignored by DVC, but they can be meaningful for user processes that read or write .dvc files directly.
desc(Optional) user description for this stage. This doesn't affect any DVC operations.

dvc.yaml files also support # comments.

Note that we maintain a dvc.yaml schema that can be used by editors like VSCode or PyCharm to enable automatic syntax validation and auto-completion.

See also How to Merge Conflicts.

Output subfields

These include a subset of the fields in .dvc file output entries.

FieldDescription
cacheWhether or not this file or directory is cached (true by default). See the --no-commit option of dvc add.
remote(Optional) Name of the remote to use for pushing/fetching
persistWhether the output file/dir should remain in place during dvc repro (false by default: outputs are deleted when dvc repro starts)
checkpoint(Optional) Set to true to let DVC know that this output is associated with checkpoint experiments. These outputs are reverted to their last cached version at dvc exp run and also persist during the stage execution.
desc(Optional) User description for this output. This doesn't affect any DVC operations.

⚠️ Note that using the checkpoint field in dvc.yaml is not compatible with dvc repro.

Top-level plot definitions

The list of plots contains one or more user-defined top-level plots (paths relative to the location of dvc.yaml).

Every plot has to have its own ID. Configuration, if provided, should be a dictionary.

In the simplest use case, a user can provide the file path as the plot ID and not provide configuration at all:

# dvc.yaml
---
plots:
  logs.csv:

In that case the default behavior will be applied. DVC will take data from logs.csv file and apply linear plot template to the last found column (CSV, TSV files) or field (JSON, YAML).

We can customize the plot by adding appropriate fields to the configuration:

# dvc.yaml
---
plots:
  confusion_matrix:
    y:
      confusion_matrix_data.csv: predicted_class
    x: actual_class
    template: confusion

In this case we provided confusion_matrix as a plot ID. It will be displayed in the plot as a title, unless we override it with title field. In this case we provided data source in y axis definition. Data will be sourced from confusion_matrix_data.csv. As y axis we will use predicted_class field. On x axis we will have actual_class field. Note that DVC will assume that actual_class is inside confusion_matrix_data.csv.

We can provide multiple columns/fields from the same file:

#dvc.yaml
---
plots:
  multiple_series:
    y:
      logs.csv: [accuracy, loss]
    x: epoch

In this case, we will take accuracy and loss fields and display them agains epoch column, all coming from logs.csv file.

We can source the data from multiple files too:

#dvc.yaml
---
plots:
  multiple_files:
    y:
      train_logs.csv: accuracy
      test_logs.csv: accuracy
    x: epoch

In this case we will plot accuracy field from both train_logs.csv and test_logs.csv against the epoch. Note that both files have to have epoch field.

Available configuration fields

  • x - field name from which the X axis data comes from. An auto-generated step field is used by default. It has to be a string.

  • y - field name from which the Y axis data comes from.

    • Top-level plots: It can be a string, list or dictionary. If its a string or list, it is assumed that plot ID will be the path to the data source. String, or list elements will be the names of data columns or fields withing the source file. If this field is a dictionary, it is assumed that its keys are paths to data sources. The values have to be either strings or lists, and are treated as column(s)/field(s) within respective files.
    • Plot outputs: It is a field name from which the Y axis data comes from.
  • x_label - X axis label. The X field name is the default.

  • y_label - Y axis label. If all provided Y entries have the same field name, this name will be the default, y string otherwise.

  • title - Plot title. Defaults:

    • Top-level plots: path/to/dvc.yaml::plot_id
    • Plot outputs: Path to the file.

dvc.lock file

⚠️ Avoid editing these files. DVC will create and update them for you.

To record the state of your pipeline(s) and help track its outputs, DVC will maintain a dvc.lock file for each dvc.yaml. Their purposes include:

  • Allow DVC to detect when stage definitions, or their dependencies have changed. Such conditions invalidate stages, requiring their reproduction (see dvc status).
  • Tracking of intermediate and final outputs of a pipeline — similar to .dvc files.
  • Needed for several DVC commands to operate, such as dvc checkout or dvc get.

Here's an example:

schema: '2.0'
stages:
  features:
    cmd: jupyter nbconvert --execute featurize.ipynb
    deps:
      - path: data/clean
        md5: d8b874c5fa18c32b2d67f73606a1be60
    params:
      params.yaml:
        levels.no: 5
    outs:
      - path: features
        md5: 2119f7661d49546288b73b5730d76485
        size: 154683
      - path: performance.json
        md5: ea46c1139d771bfeba7942d1fbb5981e
        size: 975
      - path: logs.csv
        md5: f99aac37e383b422adc76f5f1fb45004
        size: 695947

Stages are listed again in dvc.lock, in order to know if their definitions change in dvc.yaml.

Regular dependency entries and all forms of output entries (including metrics and plots files) are also listed (per stage) in dvc.lock, including a content hash field (md5, etag, or checksum).

Full parameter dependencies (both key and value) are listed too (under params), under each parameters file name. templated dvc.yaml files, the actual values are written to dvc.lock (no ${} expression). As for foreach stages, individual stages are expanded (no foreach structures are preserved).

Content

🐛 Found an issue? Let us know! Or fix it:

Edit on GitHub

Have a question? Join our chat, we will help you:

Discord Chat