Skip to content
Edit on GitHub

dvc.yaml

You can construct machine learning pipelines by defining individual stages in one or more dvc.yaml files. Stages constitute a pipeline when they connect with each other (forming a dependency graph, see dvc dag).

dvc.yaml uses the YAML 1.2 format and a human-friendly schema explained below. We encourage you to get familiar with it so you may modify, write, or generate them by your own means.

dvc.yaml files are designed to be small enough so you can easily version them with Git along with other DVC files and your project's code.

Stages

The list of stages is typically the most important part of a dvc.yaml file. It contains one or more user-defined stages. Here's a simple one named transpose:

stages:
  transpose:
    cmd: ./trans.r rows.txt > columns.txt
    deps:
      - rows.txt
    outs:
      - columns.txt

A helper command group, dvc stage, is available to create and list stages.

The only required part of a stage it's the shell command(s) it executes (cmd field). This is what DVC runs when the stage is reproduced (see dvc repro).

We use GNU/Linux in our examples, but Windows or other shells can be used too.

If a stage command reads input files, these (or their directory locations) can be defined as dependencies (deps). DVC will check whether they have changed to decide whether the stage requires re-execution (see dvc status).

If it writes files or directories, these can be defined as outputs (outs). DVC will track them going forward (similar to using dvc add on them).

Output files may be viable data sources for top-level plots.

See the full stage entry specification.

Stage commands

The command(s) defined in the stages (cmd field) can be anything your system terminal would accept and run, for example a shell built-in, an expression, or a binary found in PATH.

Surround the command with double quotes " if it includes special characters like | or <, >. Use single quotes ' instead if there are environment variables in it that should be evaluated dynamically.

The same applies to the command argument for helper commands (dvc stage add, dvc exp init), otherwise they would apply to the DVC call itself:

$ dvc stage add -n a_stage "./a_script.sh > /dev/null 2>&1"
$ dvc exp init './another_script.sh $MYENVVAR'

See also Templating (and Dictionary unpacking) for useful ways to parametrize cmd strings.

We don't want to tell anyone how to write their code or what programs to use! However, please be aware that in order to prevent unexpected results when DVC reproduces pipeline stages, the underlying code should ideally follow these rules:

  • Read/write exclusively from/to the specified dependencies and outputs (including parameters files, metrics, and plots).
  • Completely rewrite outputs. Do not append or edit.
  • Stop reading and writing files when the command exits.

Also, if your pipeline reproducibility goals include consistent output data, its code should be deterministic (produce the same output for any given input): avoid code that increases entropy (e.g. random numbers, time functions, hardware dependencies, etc.).

Parameters

Parameters are simple key/value pairs consumed by the command code from a structured parameters file. They are defined per-stage in the params field of dvc.yaml and should contain one of these:

  1. A param name that can be found in params.yaml (default params file);
  2. A dictionary named by the file path to a custom params file, and with a list of param key/value pairs to find in it;
  3. An empty set (give no value or use null) named by the file path to a params file: to track all the params in it dynamically.

Dot-separated param names become tree paths to locate values in the params file.

stages:
  preprocess:
    cmd: bin/cleanup raw.txt clean.txt
    deps:
      - raw.txt
    params:
      - threshold # track specific param (from params.yaml)
      - nn.batch_size
      - myparams.yaml: # track specific params from custom file
          - epochs
      - config.json: # track all parameters in this file
    outs:
      - clean.txt

Params are a more granular type of stage dependency: multiple stages can use the same params file, but only certain values will affect their state (see dvc status).

Parameters files

The supported params file formats are YAML 1.2, JSON, TOML 1.0, and Python. Parameter key/value pairs should be organized in tree-like hierarchies inside. Supported value types are: string, integer, float, boolean, and arrays (groups of params).

These files are typically written manually (or generated) and they can be versioned directly with Git along with other workspace files.

See also dvc params diff to compare params across project version.

Metrics and Plots outputs

Like common outputs, metrics and plots files are produced by the stage cmd. However, their purpose is different. Typically they contain metadata to evaluate pipeline processes. Example:

stages:
  build:
    cmd: python train.py
    deps:
      - features.csv
    outs:
      - model.pt
    metrics:
      - accuracy.json:
          cache: false
    plots:
      - auc.json:
          cache: false

cache: false is typical here, since they're small enough for Git to store directly.

The commands in dvc metrics and dvc plots help you display and compare metrics and plots.

Stage entries

These are the fields that are accepted in each stage:

FieldDescription
cmd(Required) One or more shell commands to execute (may contain either a single value or a list). cmd values may use dictionary substitution from param files. Commands are executed sequentially until all are finished or until one of them fails (see dvc repro).
wdirWorking directory for the cmd to run in (relative to the file's location). Any paths in other fields are also based on this. It defaults to . (the file's location).
depsList of dependency paths (relative to wdir).
outsList of output paths (relative to wdir). These can contain certain optional subfields.
paramsList of parameter dependency keys (field names) to track from params.yaml (in wdir). The list may also contain other parameters file names, with a sub-list of the param names to track in them.
metricsList of metrics files, and optionally, whether or not this metrics file is cached (true by default). See the --metrics-no-cache (-M) option of dvc run.
plotsList of plot metrics, and optionally, their default configuration (subfields matching the options of dvc plots modify), and whether or not this plots file is cached ( true by default). See the --plots-no-cache option of dvc run.
frozenWhether or not this stage is frozen (prevented from execution during reproduction)
always_changedCauses this stage to be always considered as changed by commands such as dvc status and dvc repro. false by default
meta(Optional) arbitrary metadata can be added manually with this field. Any YAML content is supported. meta contents are ignored by DVC, but they can be meaningful for user processes that read or write .dvc files directly.
desc(Optional) user description. This doesn't affect any DVC operations.

dvc.yaml files also support # comments.

We maintain a dvc.yaml schema that can be used by editors like VSCode or PyCharm to enable automatic syntax validation and auto-completion.

Output subfields

These include a subset of the fields in .dvc file output entries.

FieldDescription
cacheWhether or not this file or directory is cached (true by default). See the --no-commit option of dvc add.
remote(Optional) Name of the remote to use for pushing/fetching
persistWhether the output file/dir should remain in place during dvc repro (false by default: outputs are deleted when dvc repro starts)
checkpoint(Optional) Set to true to let DVC know that this output is associated with checkpoint experiments. These outputs are reverted to their last cached version at dvc exp run and also persist during the stage execution.
desc(Optional) User description for this output. This doesn't affect any DVC operations.

Using the checkpoint field in dvc.yaml is not compatible with dvc repro.

Templating

dvc.yaml supports a templating format to insert values from different sources in the YAML structure itself. These sources can be parameters files, or vars defined in dvc.yaml instead.

Let's say we have params.yaml (default params file) with the following contents:

models:
  us:
    threshold: 10
    filename: 'model-us.hdf5'

Those values can be used anywhere in dvc.yaml with the ${} substitution expression, for example to pass parameters as command-line arguments to a stage command:

stages:
  build-us:
    cmd: >-
      python train.py
      --thresh ${models.us.threshold}
      --out ${models.us.filename}
    outs:
      - ${models.us.filename}:
          cache: true

DVC will track simple param values (numbers, strings, etc.) used in ${} (they will be listed by dvc params diff).

Only inside the cmd entries, you can also reference a dictionary inside ${} and DVC will unpack it. This can be useful to avoid writing every argument passed to the command, or having to modify dvc.yaml when arguments change.

An alternative to load parameters from Python code is the dvc.api.params_show() API function.

For example, given the following params.yaml:

mydict:
  foo: foo
  bar: 1
  bool: true
  nested:
    baz: bar
  list: [2, 3, 'qux']

You can reference mydict in a stage command like this:

stages:
  train:
    cmd: R train.r ${mydict}

DVC will unpack the values inside mydict, creating the following cmd call:

$ R train.r --foo 'foo' --bar 1 --bool \
                  --nested.baz 'bar' --list 2 3 'qux'

You can combine this with argument parsing libraries such as R argparse or Julia ArgParse to do all the work for you.

dvc config parsing can be used to customize the syntax used for ambiguous types like booleans and lists.

Variables

Alternatively (to relying on parameter files), values for substitution can be listed as top-level vars like this:

vars:
  - models:
      us:
        threshold: 10
  - desc: 'Reusable description'

stages:
  build-us:
    desc: ${desc}
    cmd: python train.py --thresh ${models.us.threshold}

Values from vars are not tracked like parameters.

To load additional params files, list them in the top vars, in the desired order, e.g.:

vars:
  - params.json
  - myvar: 'value'
  - config/myapp.yaml

Notes

The default params.yaml file is always loaded first, if present.
Param file paths will be evaluated based on wdir, if specified.

It's also possible to specify what to include from additional params files, with a : colon:

vars:
  - params.json:clean,feats

stages:
  featurize:
    cmd: ${feats.exec}
    deps:
      - ${clean.filename}
    outs:
      - ${feats.dirname}

Stage-specific values are also supported, with inner vars. You may also load additional params files locally. For example:

stages:
  build-us:
    vars:
      - params.json:build
      - model:
          filename: 'model-us.hdf5'
    cmd: python train.py ${build.epochs} --out ${model.filename}
    outs:
      - ${model.filename}

DVC merges values from params files and vars in each scope when possible. For example, {"grp": {"a": 1}} merges with {"grp": {"b": 2}}, but not with {"grp": {"a": 7}}.

Known limitations of local vars:

  • wdir cannot use values from local vars, as DVC uses the working directory first (to load any values from params files listed in vars).
  • foreach is also incompatible with local vars at the moment.

The substitution expression supports these forms:

${param} # Simple
${param.key} # Nested values through . (period)
${param.list[0]} # List elements via index in [] (square brackets)

To use the expression literally in dvc.yaml (so DVC does not replace it for a value), escape it with a backslash, e.g. \${....

foreach stages

You can define more than one stage in a single dvc.yaml entry with the following syntax. A foreach element accepts a list or dictionary with values to iterate on, while do contains the regular stage fields (cmd, outs, etc.). Here's a simple example:

stages:
  cleanups:
    foreach: # List of simple values
      - raw1
      - labels1
      - raw2
    do:
      cmd: clean.py "${item}"
      outs:
        - ${item}.cln

Upon dvc repro, each item in the list is expanded into its own stage by substituting its value in expression ${item}. The item's value is appended to each stage name after a @. The final stages generated by the foreach syntax are saved to dvc.lock:

schema: '2.0'
stages:
  cleanups@labels1:
    cmd: clean.py "labels1"
    outs:
      - path: labels1.cln
  cleanups@raw1:
    cmd: clean.py "raw1"
    outs:
      - path: raw1.cln
  cleanups@raw2:
    cmd: clean.py "raw2"
    outs:
      - path: raw2.cln

For lists containing complex values (e.g. dictionaries), the substitution expression can use the ${item.key} form. Stage names will be appended with a zero-based index. For example:

stages:
  train:
    foreach:
      - epochs: 3
        thresh: 10
      - epochs: 10
        thresh: 15
    do:
      cmd: python train.py ${item.epochs} ${item.thresh}
# dvc.lock
schema: '2.0'
stages:
  train@0:
    cmd: python train.py 3 10
  train@1:
    cmd: python train.py 10 15

DVC can also iterate on a dictionary given directly to foreach, resulting in two substitution expressions being available: ${key} and ${item}. The former is used for the stage names:

stages:
  build:
    foreach:
      uk:
        epochs: 3
        thresh: 10
      us:
        epochs: 10
        thresh: 15
    do:
      cmd: python train.py '${key}' ${item.epochs} ${item.thresh}
      outs:
        - model-${key}.hdfs
# dvc.lock
schema: '2.0'
stages:
  build@uk:
    cmd: python train.py 'uk' 3 10
    outs:
      - path: model-uk.hdfs
        md5: 17b3d1efc339b416c4b5615b1ce1b97e
  build@us: ...

Both resulting stages (train@1, build@uk) and source groups (train, build) may be used in commands that accept stage targets, such as dvc repro and dvc stage list.

Importantly, dictionaries from parameters files can be used in foreach stages as well:

stages:
  mystages:
    foreach: ${myobject} # From params.yaml
    do:
      cmd: ./script.py ${key} ${item.prop1}
      outs:
        - ${item.prop2}

Both individual foreach stages (train@1) and groups of foreach stages (train) may be used in commands that accept stage targets.

Top-level plot definitions

The list of plots contains one or more user-defined dvc plots configurations. Every plot must have a unique ID, which may be either a file or directory path (relative to the location of dvc.yaml) or an arbitrary string. If the ID is an arbitrary string, a data source must be provided in the y field (x data source is always optional and cannot be the only data source provided). Optional configuration fields can be provided as well.

Here's an example plotting ROC and precision-recall curves on the same plot:

plots:
  - roc_vs_prc:
      x:
        precision_recall.json: recall
        roc.json: fpr
      y:
        precision_recall.json: precision
        roc.json: tpr
      title: ROC vs Precision-Recall

Refer to Visualizing Plots and dvc plots show for more examples.

Available configuration fields

  • y - source from which the Y axis data comes from:

    • Top-level plots: accepts string, list, or dictionary (like data_source_path: column/field_name).

    • Plot outputs: column/field name found in the source plots file.

  • x (string) - source from which the X axis data comes from. An auto-generated step field is used by default.

    • Top-level plots: multiple x values are supported, but only if they match the number of y values and are specified as a dictionary (list is not supported).

    • Plot outputs: column/field name found in the source plots file.

  • y_label (string) - Y axis label. If all y data sources have the same field name, that will be the default. Otherwise, it's "y".

  • x_label (string) - X axis label. If all y data sources have the same field name, that will be the default. Otherwise, it's "x".

  • title (string) - header for the plot(s). Defaults:

    • Top-level plots: path/to/dvc.yaml::plot_id
    • Plot outputs: path/to/data.csv
  • template (string) - plot template. Defaults to linear.

dvc.lock file

To record the state of your pipeline(s) and help track its outputs, DVC will maintain a dvc.lock file for each dvc.yaml. Their purposes include:

  • Allow DVC to detect when stage definitions, or their dependencies have changed. Such conditions invalidate stages, requiring their reproduction (see dvc status).
  • Tracking of intermediate and final outputs of a pipeline — similar to .dvc files.
  • Needed for several DVC commands to operate, such as dvc checkout or dvc get.

Avoid editing these files. DVC will create and update them for you.

Here's an example:

schema: '2.0'
stages:
  features:
    cmd: jupyter nbconvert --execute featurize.ipynb
    deps:
      - path: data/clean
        md5: d8b874c5fa18c32b2d67f73606a1be60
    params:
      params.yaml:
        levels.no: 5
    outs:
      - path: features
        md5: 2119f7661d49546288b73b5730d76485
        size: 154683
      - path: performance.json
        md5: ea46c1139d771bfeba7942d1fbb5981e
        size: 975
      - path: logs.csv
        md5: f99aac37e383b422adc76f5f1fb45004
        size: 695947

Stages are listed again in dvc.lock, in order to know if their definitions change in dvc.yaml.

Regular dependency entries and all forms of output entries (including metrics and plots files) are also listed (per stage) in dvc.lock, including a content hash field (md5, etag, or checksum).

Full parameter dependencies (both key and value) are listed too (under params), under each parameters file name. templated dvc.yaml files, the actual values are written to dvc.lock (no ${} expression). As for foreach stages, individual stages are expanded (no foreach structures are preserved).

Content

🐛 Found an issue? Let us know! Or fix it:

Edit on GitHub

Have a question? Join our chat, we will help you:

Discord Chat