Edit on GitHub

dvc.yaml file

dvc.yaml files describe data science or machine learning pipelines (similar to how Makefiles work for building software). Its YAML structure contains a list of stages which can be written manually or generated by user code.

A helper command, dvc run, is also available to add or update stages in dvc.yaml. Additionally, a dvc.lock file is also created or updated by dvc run and dvc repro, to record the pipelines' state.

Here's a comprehensive dvc.yaml example:

stages:
  features:
    cmd: jupyter nbconvert --execute featurize.ipynb
    deps:
      - data/clean
    params:
      - levels.no
    outs:
      - features
    metrics:
      - performance.json
  training:
    desc: Train model with Python
    cmd:
      - pip install -r requirements.txt
      - python train.py --out ${model_file}
    deps:
      - requirements.txt
      - train.py
      - features
    outs:
      - ${model_file}:
          desc: My model description
    plots:
      - logs.csv:
          x: epoch
          x_label: Epoch
    meta: 'For deployment'
    # User metadata and comments are supported.

๐Ÿ’ก Keep in mind that there may be multiple dvc.yaml files in each DVC project. All of them are checked for consistency during operations that require rebuilding DAGs (like dvc repro).

Accepted fields

dvc.yaml files consists of a set of stages with names provided by the user (for example with the --name option of dvc run). Each stage entry can contain the following fields:

  • cmd (always present): One or more commands executed by the stage (may contain either a single value, or a list). Commands are executed sequentially until all are finished or until one of them fails (see dvc repro for details).
  • wdir: Working directory for the stage command to run in (relative to the file's location). Any paths in other fields are also based on this. It defaults to . (the file's location).
  • deps: List of dependency file or directory paths of this stage (relative to wdir).
  • params: List of parameter dependency keys (field names) to track from params.yaml (in wdir). The list may also contain other YAML, JSON, TOML, or Python file names, with a sub-list of the param names to track in them.
  • outs: List of output file or directory paths of this stage (relative to wdir). See Output entries for more details.
  • metrics: List of metrics files, and optionally, whether or not this metrics file is cached (true by default). See the --metrics-no-cache (-M) option of dvc run.
  • plots: List of plot metrics, and optionally, their default configuration (subfields matching the options of dvc plots modify), and whether or not this plots file is cached ( true by default). See the --plots-no-cache option of dvc run.
  • frozen: Whether or not this stage is frozen from reproduction
  • always_changed: Whether or not this stage is considered as changed by commands such as dvc status and dvc repro. false by default
  • meta (optional): Arbitrary metadata can be added manually with this field. Any YAML contents is supported. meta contents are ignored by DVC, but they can be meaningful for user processes that read or write .dvc files directly.
  • desc (optional): User description for this stage. This doesn't affect any DVC operations.

See Advanced dvc.yaml Usage for info on the ${} syntax, as well as foreach/do fields.

dvc.yaml files also support # comments.

Note that we maintain a dvc.yaml schema that can be used by editors like VSCode or PyCharm to enable automatic syntax validation and auto-completion.

Output entries

outs fields can contain these subfields:

  • cache: Whether or not this file or directory is cached (true by default). See the --no-commit option of dvc add.
  • persist: Whether the output file/dir should remain in place while dvc repro runs (false by default: outputs are deleted when dvc repro starts
  • desc (optional): User description for this output. This doesn't affect any DVC operations.

dvc.lock file

For every dvc.yaml file, a matching dvc.lock (YAML) file usually exists. It's created or updated by DVC commands such as dvc run and dvc repro. dvc.lock describes the latest pipeline state. It has several purposes:

  • Tracking of intermediate and final outputs of a pipeline โ€” similar to .dvc files.
  • Allow DVC to detect when stage definitions, or their dependencies have changed. Such conditions invalidate stages, requiring their reproduction (see dvc status, dvc repro).
  • dvc.lock is needed internally for several DVC commands to operate, such as dvc checkout, dvc get, and dvc import.

Here's an example dvc.lock (based on the dvc.yaml example above):

stages:
  features:
    cmd: jupyter nbconvert --execute featurize.ipynb
    deps:
      - path: data/clean
        md5: d8b874c5fa18c32b2d67f73606a1be60
    params:
      params.yaml:
        levels.no: 5
    outs:
      - path: features
        md5: 2119f7661d49546288b73b5730d76485
      - path: performance.json
        md5: ea46c1139d771bfeba7942d1fbb5981e
      - path: logs.csv
        md5: f99aac37e383b422adc76f5f1fb45004

Stage commands are listed again in dvc.lock, in order to know when their definitions change in dvc.yaml.

Regular dependencies and all kinds of outputs (including metrics and plots files) are also listed (per stage) in dvc.lock, but with an additional field with a hash of their last known contents. Specifically: md5, etag, or checksum are used (same as in deps and outs entries of .dvc files).

Full parameter dependencies (key and value) are listed too (under params), grouped by parameters file. And in the case of templated dvc.yaml files, their actual values are substituted into the dvc.lock YAML structure.

Content

๐Ÿ› Found an issue? Let us know! Or fix it:

Edit on GitHub

โ“ Have a question? Join our chat, we will help you:

Discord Chat