dvc.yaml
filedvc.yaml
files describe data science or machine learning pipelines (similar to
how Makefiles
work for building software). Its YAML structure contains a list of stages which
can be written manually or generated by user code.
A helper command,
dvc run
, is also available to add or update stages indvc.yaml
. Additionally, advc.lock
file is also created or updated bydvc run
anddvc repro
, to record the pipelines' state.
Here's a comprehensive dvc.yaml
example:
stages:
features:
cmd: jupyter nbconvert --execute featurize.ipynb
deps:
- data/clean
params:
- levels.no
outs:
- features
metrics:
- performance.json
training:
desc: Train model with Python
cmd:
- pip install -r requirements.txt
- python train.py --out ${model_file}
deps:
- requirements.txt
- train.py
- features
outs:
- ${model_file}:
desc: My model description
plots:
- logs.csv:
x: epoch
x_label: Epoch
meta: 'For deployment'
# User metadata and comments are supported.
💡 Keep in mind that there may be multiple dvc.yaml
files in each DVC
project. All of them are checked for consistency during operations that
require rebuilding DAGs (like dvc repro
).
dvc.yaml
files consists of a set of stages
with names provided by the user
(for example with the --name
option of dvc run
). Each stage entry can
contain the following fields:
cmd
(always present): One or more commands executed by the stage (may
contain either a single value, or a list). Commands are executed sequentially
until all are finished or until one of them fails (see dvc repro
for
details).wdir
: Working directory for the stage command to run in (relative to the
file's location). Any paths in other fields are also based on this. It
defaults to .
(the file's location).deps
: List of dependency file or directory paths of this stage
(relative to wdir
).params
: List of parameter dependency keys (field names) to
track from params.yaml
(in wdir
). The list may also contain other YAML,
JSON, TOML, or Python file names, with a sub-list of the param names to track
in them.outs
: List of output file or directory paths of this stage
(relative to wdir
). See Output entries for more details.metrics
: List of metrics files, and
optionally, whether or not this metrics file is cached (true
by
default). See the --metrics-no-cache
(-M
) option of dvc run
.plots
: List of plot metrics, and optionally,
their default configuration (subfields matching the options of
dvc plots modify
), and whether or not this plots file is cached
( true
by default). See the --plots-no-cache
option of dvc run
.frozen
: Whether or not this stage is frozen from reproductionalways_changed
: Whether or not this stage is considered as changed by
commands such as dvc status
and dvc repro
. false
by defaultmeta
(optional): Arbitrary metadata can be added manually with this field.
Any YAML contents is supported. meta
contents are ignored by DVC, but they
can be meaningful for user processes that read or write .dvc
files directly.desc
(optional): User description for this stage. This doesn't affect any
DVC operations.See Advanced dvc.yaml Usage for
info on the ${}
syntax, as well as foreach
/do
fields.
dvc.yaml
files also support # comments
.
Note that we maintain a dvc.yaml
schema that can be used by
editors like VSCode or
PyCharm to enable automatic syntax
validation and auto-completion.
outs
fields can contain these subfields:
cache
: Whether or not this file or directory is cached (true
by default). See the --no-commit
option of dvc add
.persist
: Whether the output file/dir should remain in place while
dvc repro
runs (false
by default: outputs are deleted when dvc repro
startsdesc
(optional): User description for this output. This doesn't affect any
DVC operations.dvc.lock
fileFor every dvc.yaml
file, a matching dvc.lock
(YAML) file usually exists.
It's created or updated by DVC commands such as dvc run
and dvc repro
.
dvc.lock
describes the latest pipeline state. It has several purposes:
.dvc
files.dvc status
, dvc repro
).dvc.lock
is needed internally for several DVC commands to operate, such as
dvc checkout
, dvc get
, and dvc import
.Here's an example dvc.lock
(based on the dvc.yaml
example above):
stages:
features:
cmd: jupyter nbconvert --execute featurize.ipynb
deps:
- path: data/clean
md5: d8b874c5fa18c32b2d67f73606a1be60
params:
params.yaml:
levels.no: 5
outs:
- path: features
md5: 2119f7661d49546288b73b5730d76485
- path: performance.json
md5: ea46c1139d771bfeba7942d1fbb5981e
- path: logs.csv
md5: f99aac37e383b422adc76f5f1fb45004
Stage commands are listed again in dvc.lock
, in order to know when their
definitions change in dvc.yaml
.
Regular dependencies and all kinds of outputs
(including metrics and
plots files) are also listed (per stage) in
dvc.lock
, but with an additional field with a hash of their last known
contents. Specifically: md5
, etag
, or checksum
are used (same as in deps
and outs
entries of .dvc
files).
Full parameter dependencies (key and value) are listed too (under
params
), grouped by parameters file. And in the case of
templated dvc.yaml
files, their
actual values are substituted into the dvc.lock
YAML structure.