Edit on GitHub

dvc.yaml

You can configure machine learning projects in one or more dvc.yaml files. The list of stages is typically the most important part of a dvc.yaml file, though the file can also be used to configure artifacts, metrics, params, and plots, either as part of a stage definition or on their own.

dvc.yaml uses the YAML 1.2 format and a human-friendly schema explained below. We encourage you to get familiar with it so you may modify, write, or generate them by your own means.

dvc.yaml files are designed to be small enough so you can easily version them with Git along with other DVC files and your project's code.

Artifacts

This section allows you to declare structured metadata about your artifacts.

artifacts:
  cv-classification: # artifact ID (name)
    path: models/resnet.pt
    type: model
    desc: 'CV classification model, ResNet50'
    labels:
      - resnet50
      - classification
    meta:
      framework: pytorch

For every artifact ID you can specify the following elements (only path is mandatory):

  • path (string) - The path to the artifact, either relative to the root of the repository or a full path in an external storage such as S3.
  • type (string) - You can specify artifacts of any type. By default, the DVC-based model registry will show any artifacts with type model (you can adjust the filters to also show artifacts of other types).
  • desc (string) - A description of your artifact
  • labels (list) - Any labels you want to add to the artifact
  • meta - Any extra extra information, the content of this element will be ignored by DVC and will not show up in the model registry

Artifact IDs must consist of letters and numbers, and use '-' as separator (but not at the start or end).

To migrate from the old GTO-based Model Registry by moving artifact annotations from artifacts.yaml to dvc.yaml, use this helper script.

Metrics

The list of metrics contains one or more paths to metrics files. Here's an example:

metrics:
  - metrics.json

Metrics are key/value pairs saved in structured files that map a metric name to a numeric value. See dvc metrics for more information and how to compare among experiments, or DVCLive for a helper to log metrics.

Params

The list of params contains one or more paths to parameters files. Here's an example:

params:
  - params.yaml

Parameters are key/value pairs saved in structured files. Unlike stage-level parameter dependencies, which are granular, top-level parameters are defined at the file level and include all parameters in the file. See dvc params for more information and how to compare between experiments.

Plots

The list of plots contains one or more user-defined dvc plots configurations. Every plot must have a unique ID, which may be either a file or directory path (relative to the location of dvc.yaml) or an arbitrary string. If the ID is an arbitrary string, a file path must be provided in the y field (x file path is always optional and cannot be the only path provided).

Refer to Visualizing Plots and dvc plots show for more examples, and refer to DVCLive for a helper to log plots.

Available configuration fields

  • y (string, list, dict) - source for the Y axis data:

    If plot ID is a path, one or more column/field names is expected, or the last column/field is used by default. For example:

    plots:
      - regression_hist.csv:
          y: mean_squared_error
      - classifier_hist.csv:
          y: [acc, loss]

    If plot ID is an arbitrary string, a dictionary of file paths mapped to column/field names is expected. For example:

    plots:
      - train_val_test:
          y:
            train.csv: [train_acc, val_acc]
            test.csv: test_acc
  • x (string, dict) - source for the X axis data. An auto-generated step field is used by default.

    If plot ID is a path, one column/field name is expected. For example:

    plots:
      - classifier_hist.csv:
          y: [acc, loss]
          x: epoch

    If plot ID is an arbitrary string, x may either be one column/field name, or a dictionary of file paths each mapped to one column/field name (the number of column/field names must match the number in y).

    plots:
      - train_val_test: # single x
          y:
            train.csv: [train_acc, val_acc]
            test.csv: test_acc
          x: epoch
      - roc_vs_prc: # x dict
          y:
            precision_recall.json: precision
            roc.json: tpr
          x:
            precision_recall.json: recall
            roc.json: fpr
      - confusion: # different x and y paths
          y:
            dir/preds.csv: predicted
          x:
            dir/actual.csv: actual
          template: confusion
  • y_label (string) - Y axis label. If all y data sources have the same field name, that will be the default. Otherwise, it's "y".

  • x_label (string) - X axis label. If all y data sources have the same field name, that will be the default. Otherwise, it's "x".

  • title (string) - header for the plot(s). Defaults to path/to/dvc.yaml::plot_id.

  • template (string) - plot template. Defaults to linear.

Stages

You can construct machine learning pipelines by defining individual stages in one or more dvc.yaml files. Stages constitute a pipeline when they connect with each other (forming a dependency graph, see dvc dag).

The list of stages contains one or more user-defined stages. Here's a simple one named transpose:

stages:
  transpose:
    cmd: ./trans.r rows.txt > columns.txt
    deps:
      - rows.txt
    outs:
      - columns.txt

A helper command group, dvc stage, is available to create and list stages.

The only required part of a stage it's the shell command(s) it executes (cmd field). This is what DVC runs when the stage is reproduced (see dvc repro).

We use GNU/Linux in our examples, but Windows or other shells can be used too.

If a stage command reads input files, these (or their directory locations) can be defined as dependencies (deps). DVC will check whether they have changed to decide whether the stage requires re-execution (see dvc status).

If it writes files or directories, these can be defined as outputs (outs). DVC will track them going forward (similar to using dvc add on them).

Output files may be viable data sources for plots.

See the full stage entry specification.

Stage commands

The command(s) defined in the stages (cmd field) can be anything your system terminal would accept and run, for example a shell built-in, an expression, or a binary found in PATH.

Surround the command with double quotes " if it includes special characters like | or <, >. Use single quotes ' instead if there are environment variables in it that should be evaluated dynamically.

The same applies to the command argument for helper commands (dvc stage add), otherwise they would apply to the DVC call itself:

$ dvc stage add -n a_stage "./a_script.sh > /dev/null 2>&1"

See also Templating (and Dictionary unpacking) for useful ways to parametrize cmd strings.

We don't want to tell anyone how to write their code or what programs to use! However, please be aware that in order to prevent unexpected results when DVC reproduces pipeline stages, the underlying code should ideally follow these rules:

  • Read/write exclusively from/to the specified dependencies and outputs (including parameters files, metrics, and plots).
  • Completely rewrite outputs. Do not append or edit.
  • Stop reading and writing files when the command exits.

Also, if your pipeline reproducibility goals include consistent output data, its code should be deterministic (produce the same output for any given input): avoid code that increases entropy (e.g. random numbers, time functions, hardware dependencies, etc.).

Parameters

Parameters are simple key/value pairs consumed by the command code from a structured parameters file. They are defined per-stage in the params field of dvc.yaml and should contain one of these:

  1. A param name that can be found in params.yaml (default params file);
  2. A dictionary named by the file path to a custom params file, and with a list of param key/value pairs to find in it;
  3. An empty set (give no value or use null) named by the file path to a params file: to track all the params in it dynamically.

Dot-separated param names become tree paths to locate values in the params file.

stages:
  preprocess:
    cmd: bin/cleanup raw.txt clean.txt
    deps:
      - raw.txt
    params:
      - threshold # track specific param (from params.yaml)
      - nn.batch_size
      - myparams.yaml: # track specific params from custom file
          - epochs
      - config.json: # track all parameters in this file
    outs:
      - clean.txt

Params are a more granular type of stage dependency: multiple stages can use the same params file, but only certain values will affect their state (see dvc status).

Parameters files

The supported params file formats are YAML 1.2, JSON, TOML 1.0, and Python. Parameter key/value pairs should be organized in tree-like hierarchies inside. Supported value types are: string, integer, float, boolean, and arrays (groups of params).

These files are typically written manually (or generated) and they can be versioned directly with Git along with other workspace files.

See also dvc params diff to compare params across project version.

foreach stages

Checkout matrix stages for a more powerful way to define multiple stages.

You can define more than one stage in a single dvc.yaml entry with the following syntax. A foreach element accepts a list or dictionary with values to iterate on, while do contains the regular stage fields (cmd, outs, etc.). Here's a simple example:

stages:
  cleanups:
    foreach: # List of simple values
      - raw1
      - labels1
      - raw2
    do:
      cmd: clean.py "${item}"
      outs:
        - ${item}.cln

Upon dvc repro, each item in the list is expanded into its own stage by substituting its value in expression ${item}. The item's value is appended to each stage name after a @. The final stages generated by the foreach syntax are saved to dvc.lock:

schema: '2.0'
stages:
  cleanups@labels1:
    cmd: clean.py "labels1"
    outs:
      - path: labels1.cln
  cleanups@raw1:
    cmd: clean.py "raw1"
    outs:
      - path: raw1.cln
  cleanups@raw2:
    cmd: clean.py "raw2"
    outs:
      - path: raw2.cln

For lists containing complex values (e.g. dictionaries), the substitution expression can use the ${item.key} form. Stage names will be appended with a zero-based index. For example:

stages:
  train:
    foreach:
      - epochs: 3
        thresh: 10
      - epochs: 10
        thresh: 15
    do:
      cmd: python train.py ${item.epochs} ${item.thresh}
# dvc.lock
schema: '2.0'
stages:
  train@0:
    cmd: python train.py 3 10
  train@1:
    cmd: python train.py 10 15

DVC can also iterate on a dictionary given directly to foreach, resulting in two substitution expressions being available: ${key} and ${item}. The former is used for the stage names:

stages:
  build:
    foreach:
      uk:
        epochs: 3
        thresh: 10
      us:
        epochs: 10
        thresh: 15
    do:
      cmd: python train.py '${key}' ${item.epochs} ${item.thresh}
      outs:
        - model-${key}.hdfs
# dvc.lock
schema: '2.0'
stages:
  build@uk:
    cmd: python train.py 'uk' 3 10
    outs:
      - path: model-uk.hdfs
        md5: 17b3d1efc339b416c4b5615b1ce1b97e
  build@us: ...

Both resulting stages (train@1, build@uk) and source groups (train, build) may be used in commands that accept stage targets, such as dvc repro and dvc stage list.

Importantly, dictionaries from parameters files can be used in foreach stages as well:

stages:
  mystages:
    foreach: ${myobject} # From params.yaml
    do:
      cmd: ./script.py ${key} ${item.prop1}
      outs:
        - ${item.prop2}

Both individual foreach stages (train@1) and groups of foreach stages (train) may be used in commands that accept stage targets.

matrix stages

matrix allows you do to define multiple stages based on combinations of variables. A matrix element accepts one or more variables, each iterating over a list of values. For example:

stages:
  train:
    matrix:
      model: [cnn, xgb]
      feature: [feature1, feature2, feature3]
    cmd: ./train.py --feature ${item.feature} ${item.model}
    outs:
      - ${key}.pkl # equivalent to: ${item.model}-${item.feature}.pkl

You can reference each variable in your stage definition using the item dictionary key. In the above example, you can access item.model and item.feature. Moreover, matrix exposes a key value, which combines the current item values into one expression (see example below).

On dvc repro, dvc will expand the definition to multiple stages for each possible combination of the variables. In the above example, dvc will create six stages, one for each combination of model and feature. The name of the stages will be generated by appending values of the variables to the stage name after a @ as with foreach. For example, dvc will create the following stages:

$ dvc stage list
train@cnn-feature1  Outputs cnn-feature1.pkl
train@cnn-feature2  Outputs cnn-feature2.pkl
train@cnn-feature3  Outputs cnn-feature3.pkl
train@xgb-feature1  Outputs xgb-feature1.pkl
train@xgb-feature2  Outputs xgb-feature2.pkl
train@xgb-feature3  Outputs xgb-feature3.pkl

Both individual matrix stages (eg: train@cnn-feature1) and group of matrix stages (train) may be used in commands that accept stage targets.

The values in variables can be simple values such as string, integer, etc and composite values such as list, dictionary, etc. For example:

matrix:
  config:
    - n_estimators: 150
      max_depth: 20
    - n_estimators: 120
      max_depth: 30
  labels:
    - [label1, label2, label3]
    - [labelX, labelY, labelZ]

When using a list or a dictionary, dvc will generate the name of stages based on variable name and the index of the value. In the above example, generated stages may look like train@labels0-config0.

Templating can also be used inside matrix, so you can reference variables defined elsewhere. For example, you can define values in params.yaml file and use them in matrix.

# params.yaml
datasets: [dataset1/, dataset2/]
processors: [processor1, processor2]
# dvc.yaml
stages:
  preprocess:
    matrix:      processor: ${processors}      dataset: ${datasets}
    cmd: ./preprocess.py ${item.dataset} ${item.processor}
    deps:
    - ${item.dataset}
    outs:
    - ${item.dataset}-${item.processor}.json

Stage entries

These are the fields that are accepted in each stage:

FieldDescription
cmd(Required) One or more shell commands to execute (may contain either a single value or a list). cmd values may use dictionary substitution from param files. Commands are executed sequentially until all are finished or until one of them fails (see dvc repro).
wdirWorking directory for the cmd to run in (relative to the file's location). Any paths in other fields are also based on this. It defaults to . (the file's location).
depsList of dependency paths (relative to wdir).
outsList of output paths (relative to wdir). These can contain certain optional subfields.
paramsList of parameter dependency keys (field names) to track from params.yaml (in wdir). The list may also contain other parameters file names, with a sub-list of the param names to track in them.
frozenWhether or not this stage is frozen (prevented from execution during reproduction)
always_changedCauses this stage to be always considered as changed by commands such as dvc status and dvc repro. false by default
meta(Optional) arbitrary metadata can be added manually with this field. Any YAML content is supported. meta contents are ignored by DVC, but they can be meaningful for user processes that read or write .dvc files directly.
desc(Optional) user description. This doesn't affect any DVC operations.

dvc.yaml files also support # comments.

We maintain a dvc.yaml schema that can be used by editors like VSCode or PyCharm to enable automatic syntax validation and auto-completion.

Output subfields

These include a subset of the fields in .dvc file output entries.

FieldDescription
cacheWhether or not this file or directory is cached (true by default). See the --no-commit option of dvc add. If any output of a stage has cache: false, the [run cache will be deactivated for that stage.
remote(Optional) Name of the remote to use for pushing/fetching
persistWhether the output file/dir should remain in place during dvc repro (false by default: outputs are deleted when dvc repro starts)
desc(Optional) User description for this output. This doesn't affect any DVC operations.
pushWhether or not this file or directory, when previously cached, is uploaded to remote storage by dvc push (true by default).

Templating

dvc.yaml supports a templating format to insert values from different sources in the YAML structure itself. These sources can be parameters files, or vars defined in dvc.yaml instead.

Let's say we have params.yaml (default params file) with the following contents:

models:
  us:
    threshold: 10
    filename: 'model-us.hdf5'
codedir: src

Those values can be used anywhere in dvc.yaml with the ${} substitution expression, for example to pass parameters as command-line arguments to a stage command:

artifacts:
  model-us:
    path: ${models.us.filename}
    type: model

stages:
  build-us:
    cmd: >-
      python ${codedir}/train.py
      --thresh ${models.us.threshold}
      --out ${models.us.filename}
    outs:
      - ${models.us.filename}:
          cache: true

DVC will track simple param values (numbers, strings, etc.) used in ${} (they will be listed by dvc params diff).

Only inside the cmd entries, you can also reference a dictionary inside ${} and DVC will unpack it. This can be useful to avoid writing every argument passed to the command, or having to modify dvc.yaml when arguments change.

An alternative to load parameters from Python code is the dvc.api.params_show() API function.

For example, given the following params.yaml:

mydict:
  foo: foo
  bar: 1
  bool: true
  nested:
    baz: bar
  list: [2, 3, 'qux']

You can reference mydict in a stage command like this:

stages:
  train:
    cmd: R train.r ${mydict}

DVC will unpack the values inside mydict, creating the following cmd call:

$ R train.r --foo 'foo' --bar 1 --bool \
                  --nested.baz 'bar' --list 2 3 'qux'

You can combine this with argument parsing libraries such as R argparse or Julia ArgParse to do all the work for you.

dvc config parsing can be used to customize the syntax used for ambiguous types like booleans and lists.

Variables

Alternatively (to relying on parameter files), values for substitution can be listed as top-level vars like this:

vars:
  - models:
      us:
        threshold: 10
        filename: 'model-us.hdf5'
  - codedir: src

artifacts:
  model-us:
    path: ${models.us.filename}
    type: model

stages:
  build-us:
    cmd: >-
      python ${codedir}/train.py --thresh ${models.us.threshold} --out
      ${models.us.filename}
    outs:
      - ${models.us.filename}:
          cache: true

Values from vars are not tracked like parameters.

To load additional params files, list them in the top-level vars, in the desired order, e.g.:

vars:
  - params.json
  - myvar: 'value'
  - config/myapp.yaml

Notes

Param file paths will be evaluated relative to the directory the dvc.yaml file is in. The default params.yaml is always loaded first, if present.

It's also possible to specify what to include from additional params files, with a : colon:

vars:
  - params.json:clean,feats

stages:
  featurize:
    cmd: ${feats.exec}
    deps:
      - ${clean.filename}
    outs:
      - ${feats.dirname}

DVC merges values from param files or values specified in vars. For example, {"grp": {"a": 1}} merges with {"grp": {"b": 2}}, but not with {"grp": {"a": 7}}.

The substitution expression supports these forms:

${param} # Simple
${param.key} # Nested values through . (period)
${param.list[0]} # List elements via index in [] (square brackets)

To use the expression literally in dvc.yaml (so DVC does not replace it for a value), escape it with a backslash, e.g. \${....

dvc.lock file

To record the state of your pipeline(s) and help track its outputs, DVC will maintain a dvc.lock file for each dvc.yaml. Their purposes include:

  • Allow DVC to detect when stage definitions, or their dependencies have changed. Such conditions invalidate stages, requiring their reproduction (see dvc status).
  • Tracking of intermediate and final outputs of a pipeline โ€” similar to .dvc files.
  • Needed for several DVC commands to operate, such as dvc checkout or dvc get.

Avoid editing these files. DVC will create and update them for you.

Here's an example:

schema: '2.0'
stages:
  features:
    cmd: jupyter nbconvert --execute featurize.ipynb
    deps:
      - path: data/clean
        md5: d8b874c5fa18c32b2d67f73606a1be60
    params:
      params.yaml:
        levels.no: 5
    outs:
      - path: features
        md5: 2119f7661d49546288b73b5730d76485
        size: 154683
      - path: performance.json
        md5: ea46c1139d771bfeba7942d1fbb5981e
        size: 975
      - path: logs.csv
        md5: f99aac37e383b422adc76f5f1fb45004
        size: 695947

Stages are listed again in dvc.lock, in order to know if their definitions change in dvc.yaml.

Regular dependency entries and all forms of output entries (including metrics and plots files) are also listed (per stage) in dvc.lock, including a content hash field (md5, etag, or checksum).

Full parameter dependencies (both key and value) are listed too (under params), under each parameters file name. templated dvc.yaml files, the actual values are written to dvc.lock (no ${} expression). As for foreach stages and matrix stages, individual stages are expanded (no foreach or matrix structures are preserved).

Content

๐Ÿ› Found an issue? Let us know! Or fix it:

Edit on GitHub

โ“ Have a question? Join our chat, we will help you:

Discord Chat