dvc.yaml
You can configure machine learning projects in one or more dvc.yaml
files. The
list of stages
is typically the most important part of a dvc.yaml
file, though the file can also be used to configure artifacts
,
metrics
, params
, and plots
, either as
part of a stage definition or on their own.
dvc.yaml
uses the YAML 1.2 format and a human-friendly
schema explained below. We encourage you to get familiar with it so you may
modify, write, or generate them by your own means.
dvc.yaml
files are designed to be small enough so you can easily version them
with Git along with other DVC files and your project's code.
Artifacts
This section allows you to declare structured metadata about your artifacts.
artifacts:
cv-classification: # artifact ID (name)
path: models/resnet.pt
type: model
desc: 'CV classification model, ResNet50'
labels:
- resnet50
- classification
meta:
framework: pytorch
For every artifact ID you can specify the following elements (only path
is
mandatory):
path
(string) - The path to the artifact, either relative to the root of the repository or a full path in an external storage such as S3.type
(string) - You can specify artifacts of anytype
. By default, the DVC-based model registry will show any artifacts with typemodel
(you can adjust the filters to also show artifacts of other types).desc
(string) - A description of your artifactlabels
(list) - Any labels you want to add to the artifactmeta
- Any extra extra information, the content of this element will be ignored by DVC and will not show up in the model registry
Artifact IDs must consist of letters and numbers, and use '-' as separator (but not at the start or end).
To migrate from the old GTO-based Model Registry by moving artifact annotations
from artifacts.yaml
to dvc.yaml
, use
this helper script.
Metrics
The list of metrics
contains one or more paths to metrics files.
Here's an example:
metrics:
- metrics.json
Metrics are key/value pairs saved in structured files that map a metric name to
a numeric value. See dvc metrics
for more information and how to compare among
experiments, or DVCLive for a helper to log metrics.
Params
The list of params
contains one or more paths to parameters
files. Here's an example:
params:
- params.yaml
Parameters are key/value pairs saved in structured files. Unlike stage-level
parameter dependencies, which are granular, top-level parameters
are defined at the file level and include all parameters in the file. See
dvc params
for more information and how to compare between experiments.
Plots
The list of plots
contains one or more user-defined dvc plots
configurations. Every plot must have a unique ID, which may be either a file or
directory path (relative to the location of dvc.yaml
) or an arbitrary string.
If the ID is an arbitrary string, a file path must be provided in the y
field
(x
file path is always optional and cannot be the only path provided).
Refer to Visualizing Plots and dvc plots show
for more examples, and refer to
DVCLive for a helper to log plots.
Available configuration fields
-
y
(string, list, dict) - source for the Y axis data:If plot ID is a path, one or more column/field names is expected, or the last column/field is used by default. For example:
plots: - regression_hist.csv: y: mean_squared_error - classifier_hist.csv: y: [acc, loss]
If plot ID is an arbitrary string, a dictionary of file paths mapped to column/field names is expected. For example:
plots: - train_val_test: y: train.csv: [train_acc, val_acc] test.csv: test_acc
-
x
(string, dict) - source for the X axis data. An auto-generated step field is used by default.If plot ID is a path, one column/field name is expected. For example:
plots: - classifier_hist.csv: y: [acc, loss] x: epoch
If plot ID is an arbitrary string,
x
may either be one column/field name, or a dictionary of file paths each mapped to one column/field name (the number of column/field names must match the number iny
).plots: - train_val_test: # single x y: train.csv: [train_acc, val_acc] test.csv: test_acc x: epoch - roc_vs_prc: # x dict y: precision_recall.json: precision roc.json: tpr x: precision_recall.json: recall roc.json: fpr - confusion: # different x and y paths y: dir/preds.csv: predicted x: dir/actual.csv: actual template: confusion
-
y_label
(string) - Y axis label. If ally
data sources have the same field name, that will be the default. Otherwise, it's "y". -
x_label
(string) - X axis label. If ally
data sources have the same field name, that will be the default. Otherwise, it's "x". -
title
(string) - header for the plot(s). Defaults topath/to/dvc.yaml::plot_id
. -
template
(string) - plot template. Defaults tolinear
.
Stages
You can construct machine learning pipelines by defining individual
stages in one or more dvc.yaml
files. Stages
constitute a pipeline when they connect with each other (forming a dependency
graph, see dvc dag
).
The list of stages
contains one or more user-defined stages.
Here's a simple one named transpose
:
stages:
transpose:
cmd: ./trans.r rows.txt > columns.txt
deps:
- rows.txt
outs:
- columns.txt
A helper command group, dvc stage
, is available to create and list stages.
The only required part of a stage it's the shell command(s) it executes (cmd
field). This is what DVC runs when the stage is reproduced (see dvc repro
).
We use GNU/Linux in our examples, but Windows or other shells can be used too.
If a stage command reads input files, these (or their
directory locations) can be defined as dependencies (deps
). DVC
will check whether they have changed to decide whether the stage requires
re-execution (see dvc status
).
If it writes files or directories, these can be defined as outputs
(outs
). DVC will track them going forward (similar to using dvc add
on
them).
Output files may be viable data sources for plots.
See the full stage entry specification.
Stage commands
The command(s) defined in the stages
(cmd
field) can be anything your system
terminal would accept and run, for example a shell built-in, an expression, or a
binary found in PATH
.
Surround the command with double quotes "
if it includes special characters
like |
or <
, >
. Use single quotes '
instead if there are environment
variables in it that should be evaluated dynamically.
The same applies to the command
argument for helper commands
(dvc stage add
), otherwise they would apply to the DVC call itself:
$ dvc stage add -n a_stage "./a_script.sh > /dev/null 2>&1"
See also Templating (and Dictionary unpacking) for useful
ways to parametrize cmd
strings.
We don't want to tell anyone how to write their code or what programs to use! However, please be aware that in order to prevent unexpected results when DVC reproduces pipeline stages, the underlying code should ideally follow these rules:
- Read/write exclusively from/to the specified dependencies and outputs (including parameters files, metrics, and plots).
- Completely rewrite outputs. Do not append or edit.
- Stop reading and writing files when the
command
exits.
Also, if your pipeline reproducibility goals include consistent output data, its code should be deterministic (produce the same output for any given input): avoid code that increases entropy (e.g. random numbers, time functions, hardware dependencies, etc.).
Parameters
Parameters are simple key/value pairs consumed by the command
code from a structured parameters file. They are defined
per-stage in the params
field of dvc.yaml
and should contain one of these:
- A param name that can be found in
params.yaml
(default params file); - A dictionary named by the file path to a custom params file, and with a list of param key/value pairs to find in it;
- An empty set (give no value or use
null
) named by the file path to a params file: to track all the params in it dynamically.
Dot-separated param names become tree paths to locate values in the params file.
stages:
preprocess:
cmd: bin/cleanup raw.txt clean.txt
deps:
- raw.txt
params:
- threshold # track specific param (from params.yaml)
- nn.batch_size
- myparams.yaml: # track specific params from custom file
- epochs
- config.json: # track all parameters in this file
outs:
- clean.txt
Params are a more granular type of stage dependency: multiple stages
can use
the same params file, but only certain values will affect their state (see
dvc status
).
Parameters files
The supported params file formats are YAML 1.2, JSON, TOML 1.0, and Python. Parameter key/value pairs should be organized in tree-like hierarchies inside. Supported value types are: string, integer, float, boolean, and arrays (groups of params).
These files are typically written manually (or generated) and they can be versioned directly with Git along with other workspace files.
See also dvc params diff
to compare params across project version.
foreach
stages
Checkout matrix
stages for a more powerful way to define
multiple stages.
You can define more than one stage in a single dvc.yaml
entry with the
following syntax. A foreach
element accepts a list or dictionary with values
to iterate on, while do
contains the regular stage fields (cmd
, outs
,
etc.). Here's a simple example:
stages:
cleanups:
foreach: # List of simple values
- raw1
- labels1
- raw2
do:
cmd: clean.py "${item}"
outs:
- ${item}.cln
Upon dvc repro
, each item in the list is expanded into its own stage by
substituting its value in expression ${item}
. The item's value is appended to
each stage name after a @
. The final stages generated by the foreach
syntax
are saved to dvc.lock
:
schema: '2.0'
stages:
cleanups@labels1:
cmd: clean.py "labels1"
outs:
- path: labels1.cln
cleanups@raw1:
cmd: clean.py "raw1"
outs:
- path: raw1.cln
cleanups@raw2:
cmd: clean.py "raw2"
outs:
- path: raw2.cln
For lists containing complex values (e.g. dictionaries), the substitution
expression can use the ${item.key}
form. Stage names will be appended with a
zero-based index. For example:
stages:
train:
foreach:
- epochs: 3
thresh: 10
- epochs: 10
thresh: 15
do:
cmd: python train.py ${item.epochs} ${item.thresh}
# dvc.lock
schema: '2.0'
stages:
train@0:
cmd: python train.py 3 10
train@1:
cmd: python train.py 10 15
DVC can also iterate on a dictionary given directly to foreach
, resulting in
two substitution expressions being available: ${key}
and ${item}
. The former
is used for the stage names:
stages:
build:
foreach:
uk:
epochs: 3
thresh: 10
us:
epochs: 10
thresh: 15
do:
cmd: python train.py '${key}' ${item.epochs} ${item.thresh}
outs:
- model-${key}.hdfs
# dvc.lock
schema: '2.0'
stages:
build@uk:
cmd: python train.py 'uk' 3 10
outs:
- path: model-uk.hdfs
md5: 17b3d1efc339b416c4b5615b1ce1b97e
build@us: ...
Both resulting stages (train@1
, build@uk
) and source groups (train
,
build
) may be used in commands that accept stage targets, such as dvc repro
and dvc stage list
.
Importantly, dictionaries from
parameters files can be used in
foreach
stages as well:
stages:
mystages:
foreach: ${myobject} # From params.yaml
do:
cmd: ./script.py ${key} ${item.prop1}
outs:
- ${item.prop2}
Both individual foreach stages (train@1
) and groups of foreach stages
(train
) may be used in commands that accept stage targets.
matrix
stages
matrix
allows you do to define multiple stages based on combinations of
variables. A matrix
element accepts one or more variables, each iterating over
a list of values. For example:
stages:
train:
matrix:
model: [cnn, xgb]
feature: [feature1, feature2, feature3]
cmd: ./train.py --feature ${item.feature} ${item.model}
outs:
- ${key}.pkl # equivalent to: ${item.model}-${item.feature}.pkl
You can reference each variable in your stage definition using the item
dictionary key. In the above example, you can access item.model
and
item.feature
. Moreover, matrix
exposes a key
value, which combines the
current item
values into one expression (see example below).
On dvc repro
, dvc will expand the definition to multiple stages for each
possible combination of the variables. In the above example, dvc will create six
stages, one for each combination of model
and feature
. The name of the
stages will be generated by appending values of the variables to the stage name
after a @
as with foreach. For example, dvc will create the
following stages:
$ dvc stage list
train@cnn-feature1 Outputs cnn-feature1.pkl
train@cnn-feature2 Outputs cnn-feature2.pkl
train@cnn-feature3 Outputs cnn-feature3.pkl
train@xgb-feature1 Outputs xgb-feature1.pkl
train@xgb-feature2 Outputs xgb-feature2.pkl
train@xgb-feature3 Outputs xgb-feature3.pkl
Both individual matrix stages (eg: train@cnn-feature1
) and group of matrix
stages (train
) may be used in commands that accept stage targets.
The values in variables can be simple values such as string, integer, etc and composite values such as list, dictionary, etc. For example:
matrix:
config:
- n_estimators: 150
max_depth: 20
- n_estimators: 120
max_depth: 30
labels:
- [label1, label2, label3]
- [labelX, labelY, labelZ]
When using a list or a dictionary, dvc will generate the name of stages based on
variable name and the index of the value. In the above example, generated stages
may look like train@labels0-config0
.
Templating can also be used inside matrix
, so you can reference
variables defined elsewhere. For example, you can define values in
params.yaml
file and use them in matrix
.
# params.yaml
datasets: [dataset1/, dataset2/]
processors: [processor1, processor2]
# dvc.yaml
stages:
preprocess:
matrix: processor: ${processors} dataset: ${datasets}
cmd: ./preprocess.py ${item.dataset} ${item.processor}
deps:
- ${item.dataset}
outs:
- ${item.dataset}-${item.processor}.json
Stage entries
These are the fields that are accepted in each stage:
Field | Description |
---|---|
cmd | (Required) One or more shell commands to execute (may contain either a single value or a list). cmd values may use dictionary substitution from param files. Commands are executed sequentially until all are finished or until one of them fails (see dvc repro ). |
wdir | Working directory for the cmd to run in (relative to the file's location). Any paths in other fields are also based on this. It defaults to . (the file's location). |
deps | List of dependency paths (relative to wdir ). |
outs | List of output paths (relative to wdir ). These can contain certain optional subfields. |
params | List of parameter dependency keys (field names) to track from params.yaml (in wdir ). The list may also contain other parameters file names, with a sub-list of the param names to track in them. |
frozen | Whether or not this stage is frozen (prevented from execution during reproduction) |
always_changed | Causes this stage to be always considered as changed by commands such as dvc status and dvc repro . false by default |
meta | (Optional) arbitrary metadata can be added manually with this field. Any YAML content is supported. meta contents are ignored by DVC, but they can be meaningful for user processes that read or write .dvc files directly. |
desc | (Optional) user description. This doesn't affect any DVC operations. |
dvc.yaml
files also support # comments
.
See also How to Merge Conflicts.
Output subfields
These include a subset of the fields in .dvc
file
output entries.
Field | Description |
---|---|
cache | Whether or not this file or directory is cached (true by default). See the --no-commit option of dvc add . If any output of a stage has cache: false , the [run cache will be deactivated for that stage. |
remote | (Optional) Name of the remote to use for pushing/fetching |
persist | Whether the output file/dir should remain in place during dvc repro (false by default: outputs are deleted when dvc repro starts) |
desc | (Optional) User description for this output. This doesn't affect any DVC operations. |
push | Whether or not this file or directory, when previously cached, is uploaded to remote storage by dvc push (true by default). |
Templating
dvc.yaml
supports a templating format to insert values from different sources
in the YAML structure itself. These sources can be
parameters files, or vars
defined in
dvc.yaml
instead.
Let's say we have params.yaml
(default params file) with the following
contents:
models:
us:
threshold: 10
filename: 'model-us.hdf5'
codedir: src
Those values can be used anywhere in dvc.yaml
with the ${}
substitution
expression, for example to pass parameters as command-line arguments to a
stage command:
artifacts:
model-us:
path: ${models.us.filename}
type: model
stages:
build-us:
cmd: >-
python ${codedir}/train.py
--thresh ${models.us.threshold}
--out ${models.us.filename}
outs:
- ${models.us.filename}:
cache: true
DVC will track simple param values (numbers, strings, etc.) used in ${}
(they
will be listed by dvc params diff
).
Only inside the cmd
entries, you can also reference a dictionary inside ${}
and DVC will unpack it. This can be useful to avoid writing every argument
passed to the command, or having to modify dvc.yaml
when arguments change.
An alternative to load parameters from Python code is the
dvc.api.params_show()
API function.
For example, given the following params.yaml
:
mydict:
foo: foo
bar: 1
bool: true
nested:
baz: bar
list: [2, 3, 'qux']
You can reference mydict
in a stage command like this:
stages:
train:
cmd: R train.r ${mydict}
DVC will unpack the values inside mydict
, creating the following cmd
call:
$ R train.r --foo 'foo' --bar 1 --bool \
--nested.baz 'bar' --list 2 3 'qux'
You can combine this with argument parsing libraries such as R argparse or Julia ArgParse to do all the work for you.
dvc config parsing
can be used to customize the syntax used for ambiguous
types like booleans and lists.
Variables
Alternatively (to relying on parameter files), values for substitution can be
listed as top-level vars
like this:
vars:
- models:
us:
threshold: 10
filename: 'model-us.hdf5'
- codedir: src
artifacts:
model-us:
path: ${models.us.filename}
type: model
stages:
build-us:
cmd: >-
python ${codedir}/train.py --thresh ${models.us.threshold} --out
${models.us.filename}
outs:
- ${models.us.filename}:
cache: true
Values from vars
are not tracked like parameters.
To load additional params files, list them in the top-level vars
, in the
desired order, e.g.:
vars:
- params.json
- myvar: 'value'
- config/myapp.yaml
If you have multiple pipelines in your repository, or if your params.yaml
file
is not in the same directory as the dvc.yaml
file, you must specify the params
file path in vars
in order to use the values for templating.
Notes
Param file paths will be evaluated relative to the directory the dvc.yaml
file
is in. The default params.yaml
is always loaded first, if present.
It's also possible to specify what to include from additional params files, with
a :
colon:
vars:
- params.json:clean,feats
stages:
featurize:
cmd: ${feats.exec}
deps:
- ${clean.filename}
outs:
- ${feats.dirname}
DVC merges values from param files or values specified in vars
. For example,
{"grp": {"a": 1}}
merges with {"grp": {"b": 2}}
, but not with
{"grp": {"a": 7}}
.
The substitution expression supports these forms:
${param} # Simple
${param.key} # Nested values through . (period)
${param.list[0]} # List elements via index in [] (square brackets)
To use the expression literally in dvc.yaml
(so DVC does not replace it for a
value), escape it with a backslash, e.g. \${...
.
dvc.lock file
To record the state of your pipeline(s) and help track its outputs,
DVC will maintain a dvc.lock
file for each dvc.yaml
. Their purposes include:
- Allow DVC to detect when stage definitions, or their dependencies
have changed. Such conditions invalidate stages, requiring their reproduction
(see
dvc status
). - Tracking of intermediate and final outputs of a pipeline â similar to
.dvc
files. - Needed for several DVC commands to operate, such as
dvc checkout
ordvc get
.
Avoid editing these files. DVC will create and update them for you.
Here's an example:
schema: '2.0'
stages:
features:
cmd: jupyter nbconvert --execute featurize.ipynb
deps:
- path: data/clean
md5: d8b874c5fa18c32b2d67f73606a1be60
params:
params.yaml:
levels.no: 5
outs:
- path: features
md5: 2119f7661d49546288b73b5730d76485
size: 154683
- path: performance.json
md5: ea46c1139d771bfeba7942d1fbb5981e
size: 975
- path: logs.csv
md5: f99aac37e383b422adc76f5f1fb45004
size: 695947
Stages are listed again in dvc.lock
, in order to know if their definitions
change in dvc.yaml
.
Regular
dependency entries
and all forms of
output entries
(including metrics and
plots files) are also listed (per stage) in
dvc.lock
, including a content hash field (md5
, etag
, or checksum
).
Full parameter dependencies (both key and value) are listed too
(under params
), under each parameters file name.
templated dvc.yaml
files, the actual values are written to
dvc.lock
(no ${}
expression). As for foreach
stages and
matrix
stages, individual stages are expanded (no foreach
or matrix
structures are preserved).