dvc.yaml
You can configure machine learning projects in one or more dvc.yaml
files. The
list of stages
is typically the most important part of a dvc.yaml
file, though the file can also be used to configure metrics
,
params
, and plots
, either as part of a stage definition
or on their own.
dvc.yaml
uses the YAML 1.2 format and a human-friendly
schema explained below. We encourage you to get familiar with it so you may
modify, write, or generate them by your own means.
dvc.yaml
files are designed to be small enough so you can easily version them
with Git along with other DVC files and your project's code.
Metrics
The list of metrics
contains one or more paths to metrics files.
Here's an example:
metrics:
- metrics.json
Metrics are key/value pairs saved in structured files that map a metric name to
a numeric value. See dvc metrics
for more information and how to compare among
experiments.
Params
The list of params
contains one or more paths to parameters
files. Here's an example:
params:
- params.yaml
Parameters are key/value pairs saved in structured files. Unlike stage-level
parameter dependencies, which are granular, top-level parameters
are defined at the file level and include all parameters in the file. See
dvc params
for more information and how to compare between experiments.
Plots
The list of plots
contains one or more user-defined dvc plots
configurations. Every plot must have a unique ID, which may be either a file or
directory path (relative to the location of dvc.yaml
) or an arbitrary string.
If the ID is an arbitrary string, a file path must be provided in the y
field
(x
file path is always optional and cannot be the only path provided).
Refer to Visualizing Plots and dvc plots show
for more examples.
Available configuration fields
-
y
- source for the Y axis data:-
Top-level plots (string, list, dict):
If plot ID is a path, one or more column/field names is expected. For example:
plots: - regression_hist.csv: y: mean_squared_error - classifier_hist.csv: y: [acc, loss]
If plot ID is an arbitrary string, a dictionary of file paths mapped to column/field names is expected. For example:
plots: - train_val_test: y: train.csv: [train_acc, val_acc] test.csv: test_acc
-
Plot outputs (string): one column/field name.
-
-
x
- source for the X axis data. An auto-generated step field is used by default.-
Top-level plots (string, dict):
If plot ID is a path, one column/field name is expected. For example:
plots: - classifier_hist.csv: y: [acc, loss] x: epoch
If plot ID is an arbitrary string,
x
may either be one column/field name, or a dictionary of file paths each mapped to one column/field name (the number of column/field names must match the number iny
).plots: - train_val_test: # single x y: train.csv: [train_acc, val_acc] test.csv: test_acc x: epoch - roc_vs_prc: # x dict y: precision_recall.json: precision roc.json: tpr x: precision_recall.json: recall roc.json: fpr - confusion: # different x and y paths y: dir/preds.csv: predicted x: dir/actual.csv: actual template: confusion
-
Plot outputs (string): one column/field name.
-
-
y_label
(string) - Y axis label. If ally
data sources have the same field name, that will be the default. Otherwise, it's "y". -
x_label
(string) - X axis label. If ally
data sources have the same field name, that will be the default. Otherwise, it's "x". -
title
(string) - header for the plot(s). Defaults:- Top-level plots:
path/to/dvc.yaml::plot_id
- Plot outputs:
path/to/data.csv
- Top-level plots:
-
template
(string) - plot template. Defaults tolinear
.
Stages
You can construct machine learning pipelines by defining individual
stages in one or more dvc.yaml
files. Stages
constitute a pipeline when they connect with each other (forming a dependency
graph, see dvc dag
).
The list of stages
contains one or more user-defined stages.
Here's a simple one named transpose
:
stages:
transpose:
cmd: ./trans.r rows.txt > columns.txt
deps:
- rows.txt
outs:
- columns.txt
A helper command group, dvc stage
, is available to create and list stages.
The only required part of a stage it's the shell command(s) it executes (cmd
field). This is what DVC runs when the stage is reproduced (see dvc repro
).
We use GNU/Linux in our examples, but Windows or other shells can be used too.
If a stage command reads input files, these (or their
directory locations) can be defined as dependencies (deps
). DVC
will check whether they have changed to decide whether the stage requires
re-execution (see dvc status
).
If it writes files or directories, these can be defined as outputs
(outs
). DVC will track them going forward (similar to using dvc add
on
them).
Output files may be viable data sources for top-level plots.
See the full stage entry specification.
Stage commands
The command(s) defined in the stages
(cmd
field) can be anything your system
terminal would accept and run, for example a shell built-in, an expression, or a
binary found in PATH
.
Surround the command with double quotes "
if it includes special characters
like |
or <
, >
. Use single quotes '
instead if there are environment
variables in it that should be evaluated dynamically.
The same applies to the command
argument for helper commands
(dvc stage add
), otherwise they would apply to the DVC call itself:
$ dvc stage add -n a_stage "./a_script.sh > /dev/null 2>&1"
See also Templating (and Dictionary unpacking) for useful
ways to parametrize cmd
strings.
We don't want to tell anyone how to write their code or what programs to use! However, please be aware that in order to prevent unexpected results when DVC reproduces pipeline stages, the underlying code should ideally follow these rules:
- Read/write exclusively from/to the specified dependencies and outputs (including parameters files, metrics, and plots).
- Completely rewrite outputs. Do not append or edit.
- Stop reading and writing files when the
command
exits.
Also, if your pipeline reproducibility goals include consistent output data, its code should be deterministic (produce the same output for any given input): avoid code that increases entropy (e.g. random numbers, time functions, hardware dependencies, etc.).
Parameters
Parameters are simple key/value pairs consumed by the command
code from a structured parameters file. They are defined
per-stage in the params
field of dvc.yaml
and should contain one of these:
- A param name that can be found in
params.yaml
(default params file); - A dictionary named by the file path to a custom params file, and with a list of param key/value pairs to find in it;
- An empty set (give no value or use
null
) named by the file path to a params file: to track all the params in it dynamically.
Dot-separated param names become tree paths to locate values in the params file.
stages:
preprocess:
cmd: bin/cleanup raw.txt clean.txt
deps:
- raw.txt
params:
- threshold # track specific param (from params.yaml)
- nn.batch_size
- myparams.yaml: # track specific params from custom file
- epochs
- config.json: # track all parameters in this file
outs:
- clean.txt
Params are a more granular type of stage dependency: multiple stages
can use
the same params file, but only certain values will affect their state (see
dvc status
).
Parameters files
The supported params file formats are YAML 1.2, JSON, TOML 1.0, and Python. Parameter key/value pairs should be organized in tree-like hierarchies inside. Supported value types are: string, integer, float, boolean, and arrays (groups of params).
These files are typically written manually (or generated) and they can be versioned directly with Git along with other workspace files.
See also dvc params diff
to compare params across project version.
Metrics and Plots outputs
Like common outputs, metrics and plots files are
produced by the stage cmd
. However, their purpose is different. Typically they
contain metadata to evaluate pipeline processes. Example:
stages:
build:
cmd: python train.py
deps:
- features.csv
outs:
- model.pt
metrics:
- accuracy.json:
cache: false
plots:
- auc.json:
cache: false
cache: false
is typical here, since they're small enough for Git to store
directly.
The commands in dvc metrics
and dvc plots
help you display and compare
metrics and plots.
Stage entries
These are the fields that are accepted in each stage:
Field | Description |
---|---|
cmd | (Required) One or more shell commands to execute (may contain either a single value or a list). cmd values may use dictionary substitution from param files. Commands are executed sequentially until all are finished or until one of them fails (see dvc repro ). |
wdir | Working directory for the cmd to run in (relative to the file's location). Any paths in other fields are also based on this. It defaults to . (the file's location). |
deps | List of dependency paths (relative to wdir ). |
outs | List of output paths (relative to wdir ). These can contain certain optional subfields. |
params | List of parameter dependency keys (field names) to track from params.yaml (in wdir ). The list may also contain other parameters file names, with a sub-list of the param names to track in them. |
metrics | List of metrics files, and optionally, whether or not this metrics file is cached (true by default). See the --metrics-no-cache (-M ) option of dvc stage add . |
plots | List of plot metrics, and optionally, their default configuration (subfields matching the options of dvc plots modify ), and whether or not this plots file is cached ( true by default). See the --plots-no-cache option of dvc stage add . |
frozen | Whether or not this stage is frozen (prevented from execution during reproduction) |
always_changed | Causes this stage to be always considered as changed by commands such as dvc status and dvc repro . false by default |
meta | (Optional) arbitrary metadata can be added manually with this field. Any YAML content is supported. meta contents are ignored by DVC, but they can be meaningful for user processes that read or write .dvc files directly. |
desc | (Optional) user description. This doesn't affect any DVC operations. |
dvc.yaml
files also support # comments
.
See also How to Merge Conflicts.
Output subfields
These include a subset of the fields in .dvc
file
output entries.
Field | Description |
---|---|
cache | Whether or not this file or directory is cached (true by default). See the --no-commit option of dvc add . If any output of a stage has cache: false , the [run cache will be deactivated for that stage. |
remote | (Optional) Name of the remote to use for pushing/fetching |
persist | Whether the output file/dir should remain in place during dvc repro (false by default: outputs are deleted when dvc repro starts) |
checkpoint | (Optional) Set to true to let DVC know that this output is associated with checkpoint experiments. These outputs are reverted to their last cached version at dvc exp run and also persist during the stage execution. |
desc | (Optional) User description for this output. This doesn't affect any DVC operations. |
push | Whether or not this file or directory, when previously cached, is uploaded to remote storage by dvc push (true by default). |
Templating
dvc.yaml
supports a templating format to insert values from different sources
in the YAML structure itself. These sources can be
parameters files, or vars
defined in
dvc.yaml
instead.
Let's say we have params.yaml
(default params file) with the following
contents:
models:
us:
threshold: 10
filename: 'model-us.hdf5'
Those values can be used anywhere in dvc.yaml
with the ${}
substitution
expression, for example to pass parameters as command-line arguments to a
stage command:
stages:
build-us:
cmd: >-
python train.py
--thresh ${models.us.threshold}
--out ${models.us.filename}
outs:
- ${models.us.filename}:
cache: true
DVC will track simple param values (numbers, strings, etc.) used in ${}
(they
will be listed by dvc params diff
).
Only inside the cmd
entries, you can also reference a dictionary inside ${}
and DVC will unpack it. This can be useful to avoid writing every argument
passed to the command, or having to modify dvc.yaml
when arguments change.
An alternative to load parameters from Python code is the
dvc.api.params_show()
API function.
For example, given the following params.yaml
:
mydict:
foo: foo
bar: 1
bool: true
nested:
baz: bar
list: [2, 3, 'qux']
You can reference mydict
in a stage command like this:
stages:
train:
cmd: R train.r ${mydict}
DVC will unpack the values inside mydict
, creating the following cmd
call:
$ R train.r --foo 'foo' --bar 1 --bool \
--nested.baz 'bar' --list 2 3 'qux'
You can combine this with argument parsing libraries such as R argparse or Julia ArgParse to do all the work for you.
dvc config parsing
can be used to customize the syntax used for ambiguous
types like booleans and lists.
Variables
Alternatively (to relying on parameter files), values for substitution can be
listed as top-level vars
like this:
vars:
- models:
us:
threshold: 10
- desc: 'Reusable description'
stages:
build-us:
desc: ${desc}
cmd: python train.py --thresh ${models.us.threshold}
Values from vars
are not tracked like parameters.
To load additional params files, list them in the top vars
, in the desired
order, e.g.:
vars:
- params.json
- myvar: 'value'
- config/myapp.yaml
Notes
The default params.yaml
file is always loaded first, if present.
Param file paths will be evaluated based on wdir
, if
specified.
It's also possible to specify what to include from additional params files, with
a :
colon:
vars:
- params.json:clean,feats
stages:
featurize:
cmd: ${feats.exec}
deps:
- ${clean.filename}
outs:
- ${feats.dirname}
Stage-specific values are also supported, with inner vars
. You may also load
additional params files locally. For example:
stages:
build-us:
vars:
- params.json:build
- model:
filename: 'model-us.hdf5'
cmd: python train.py ${build.epochs} --out ${model.filename}
outs:
- ${model.filename}
DVC merges values from params files and vars
in each scope when possible. For
example, {"grp": {"a": 1}}
merges with {"grp": {"b": 2}}
, but not with
{"grp": {"a": 7}}
.
Known limitations of local vars
:
wdir
cannot use values from localvars
, as DVC uses the working directory first (to load any values from params files listed invars
).foreach
is also incompatible with localvars
at the moment.
The substitution expression supports these forms:
${param} # Simple
${param.key} # Nested values through . (period)
${param.list[0]} # List elements via index in [] (square brackets)
To use the expression literally in dvc.yaml
(so DVC does not replace it for a
value), escape it with a backslash, e.g. \${...
.
foreach
stages
You can define more than one stage in a single dvc.yaml
entry with the
following syntax. A foreach
element accepts a list or dictionary with values
to iterate on, while do
contains the regular stage fields (cmd
, outs
,
etc.). Here's a simple example:
stages:
cleanups:
foreach: # List of simple values
- raw1
- labels1
- raw2
do:
cmd: clean.py "${item}"
outs:
- ${item}.cln
Upon dvc repro
, each item in the list is expanded into its own stage by
substituting its value in expression ${item}
. The item's value is appended to
each stage name after a @
. The final stages generated by the foreach
syntax
are saved to dvc.lock
:
schema: '2.0'
stages:
cleanups@labels1:
cmd: clean.py "labels1"
outs:
- path: labels1.cln
cleanups@raw1:
cmd: clean.py "raw1"
outs:
- path: raw1.cln
cleanups@raw2:
cmd: clean.py "raw2"
outs:
- path: raw2.cln
For lists containing complex values (e.g. dictionaries), the substitution
expression can use the ${item.key}
form. Stage names will be appended with a
zero-based index. For example:
stages:
train:
foreach:
- epochs: 3
thresh: 10
- epochs: 10
thresh: 15
do:
cmd: python train.py ${item.epochs} ${item.thresh}
# dvc.lock
schema: '2.0'
stages:
train@0:
cmd: python train.py 3 10
train@1:
cmd: python train.py 10 15
DVC can also iterate on a dictionary given directly to foreach
, resulting in
two substitution expressions being available: ${key}
and ${item}
. The former
is used for the stage names:
stages:
build:
foreach:
uk:
epochs: 3
thresh: 10
us:
epochs: 10
thresh: 15
do:
cmd: python train.py '${key}' ${item.epochs} ${item.thresh}
outs:
- model-${key}.hdfs
# dvc.lock
schema: '2.0'
stages:
build@uk:
cmd: python train.py 'uk' 3 10
outs:
- path: model-uk.hdfs
md5: 17b3d1efc339b416c4b5615b1ce1b97e
build@us: ...
Both resulting stages (train@1
, build@uk
) and source groups (train
,
build
) may be used in commands that accept stage targets, such as dvc repro
and dvc stage list
.
Importantly, dictionaries from
parameters files can be used in
foreach
stages as well:
stages:
mystages:
foreach: ${myobject} # From params.yaml
do:
cmd: ./script.py ${key} ${item.prop1}
outs:
- ${item.prop2}
Both individual foreach stages (train@1
) and groups of foreach stages
(train
) may be used in commands that accept stage targets.
dvc.lock file
To record the state of your pipeline(s) and help track its outputs,
DVC will maintain a dvc.lock
file for each dvc.yaml
. Their purposes include:
- Allow DVC to detect when stage definitions, or their dependencies
have changed. Such conditions invalidate stages, requiring their reproduction
(see
dvc status
). - Tracking of intermediate and final outputs of a pipeline โ similar to
.dvc
files. - Needed for several DVC commands to operate, such as
dvc checkout
ordvc get
.
Avoid editing these files. DVC will create and update them for you.
Here's an example:
schema: '2.0'
stages:
features:
cmd: jupyter nbconvert --execute featurize.ipynb
deps:
- path: data/clean
md5: d8b874c5fa18c32b2d67f73606a1be60
params:
params.yaml:
levels.no: 5
outs:
- path: features
md5: 2119f7661d49546288b73b5730d76485
size: 154683
- path: performance.json
md5: ea46c1139d771bfeba7942d1fbb5981e
size: 975
- path: logs.csv
md5: f99aac37e383b422adc76f5f1fb45004
size: 695947
Stages are listed again in dvc.lock
, in order to know if their definitions
change in dvc.yaml
.
Regular
dependency entries
and all forms of
output entries
(including metrics and
plots files) are also listed (per stage) in
dvc.lock
, including a content hash field (md5
, etag
, or checksum
).
Full parameter dependencies (both key and value) are listed too
(under params
), under each parameters file name.
templated dvc.yaml
files, the actual values are written to
dvc.lock
(no ${}
expression). As for foreach
stages,
individual stages are expanded (no foreach
structures are preserved).