Helper command to create or update stages in
dvc.yaml. Requires a name and a
usage: dvc run [-h] [-q | -v] -n <name> [-d <path>] [-o <path>] [-O <path>] [-p [<path>:]<params_list>] [-m <path>] [-M <path>] [--plots <path>] [--plots-no-cache <path>] [-w <path>] [--no-exec] [-f] [--no-run-cache] [--no-commit] [--outs-persist <path>] [--outs-persist-no-cache <path>] [--always-changed] [--external] [--desc <text>] command positional arguments: command Command for the stage.
Stages represent individual data processes, including their input and resulting outputs. They can be combined to capture simple data workflows, organize data science projects, or build detailed machine learning pipelines.
A stage name is required and can be provided using the
The other available options are mostly meant to describe different
kinds of stage dependencies and outputs. The
remaining terminal input provided to
dvc run after
-- flags will become
dvc run executes stage commands, unless the
--no-exec option is used.
By specifying lists of dependencies (
-d option) and/or
-O options) for each stage, we can create a
dependency graph (DAG)
that connects them, i.e. the output of a stage becomes the input of another, and
so on (see
dvc dag). This graph can be restored by DVC later to modify or
reproduce the full pipeline. For example:
$ dvc run -n printer -d write.sh -o pages ./write.sh $ dvc run -n scanner -d read.sh -d pages -o signed.pdf ./read.sh pages
Stage dependencies can be any file or directory, either untracked, or more
commonly tracked by DVC or Git. Outputs will be tracked and cached
by DVC when the stage is run. Every output version will be cached when the stage
is reproduced (see also
-ddependencies. This ensures that when the source code changes, DVC knows that the stage needs to be reproduced. (You can chose whether to do this.)
dvc runchecks the dependency graph integrity before creating a new stage. For example: two stage cannot specify the same output or overlapping output paths, there should be no cycles, etc.
.direntry in the cache (refer to Structure of cache directory for more info.)
dvc repro) if their paths are found as existing files/directories (unless
--outs-persistis used). This also means that the stage command needs to recreate any directory structures defined as outputs every time its executed by DVC.
dvc.yaml. It is possible to add missing dependencies/outputs to an existing stage without having to execute it again.
dvc.yamland the project's cache accordingly.
--params option) are a
special type of key/value dependencies. Multiple parameter dependencies can be
specified from within one or more YAML, JSON, TOML, or Python parameters files
params.yaml). This allows tracking experimental hyperparameters easily.
Special types of output files, metrics (
-M options) and plots (
--plots-no-cache options), are also supported. Metrics and plots files have
specific formats (JSON, YAML, CSV, or TSV) and allow displaying and comparing
data science experiments.
command sent to
dvc run can be anything your terminal would accept and
run directly, for example a shell built-in, expression, or binary found in
PATH. Please remember that any flags sent after the
command are interpreted
by the command itself, not by
⚠️ Note that while DVC is platform-agnostic, the commands defined in your pipeline stages may only work on some operating systems and require certain software packages to be installed.
Wrap the command with double quotes
" if there are special characters in it
| (pipe) or
> (redirection), otherwise they would apply to
dvc run itself. Use single quotes
' instead if there are environment
variables in it that should be evaluated dynamically. Examples:
$ dvc run -n first_stage "./a_script.sh > /dev/null 2>&1" $ dvc run -n second_stage './another_script.sh $MYENVVAR'
--name <stage>(required) - specify a name for the stage generated by this command (e.g.
-n train). Stage names can only contain letters, numbers, dash
--deps <path> - specify a file or a directory the stage depends
on. Multiple dependencies can be specified like this:
-d data.csv -d process.py. Usually, each dependency is a file or a directory
with data, or a code file, or a configuration file. DVC also supports certain
When you use
dvc repro, the list of dependencies helps DVC analyze whether
any dependencies have changed and thus executing stages required to regenerate
--outs <path>- specify a file or directory that is the result of running the
command. Multiple outputs can be specified:
-o model.pkl -o output.log. DVC builds a dependency graph (pipeline) to connect different stages with each other based on this list of outputs and dependencies (see
-d). DVC tracks all output files and directories and puts them into the cache (this is similar to what's happening when you use
--outs-no-cache <path>- the same as
-oexcept that outputs are not tracked by DVC. This means that they are never cached, so it's up to the user to manage them separately. This is useful if the outputs are small enough to be tracked by Git directly; or large, yet you prefer to regenerate them every time (see
dvc repro); or unwanted in storage for any other reason.
--outs-persist <path>- declare output file or directory that will not be removed when
dvc reprostarts (but it can still be modified, overwritten, or even deleted by the stage command(s)).
--outs-persist-no-cache <path>- the same as
-outs-persistexcept that outputs are not tracked by DVC (same as with
--params [<path>:]<params_list>- specify a set of parameter dependencies the stage depends on, from a parameters file. This is done by sending a comma separated list as argument, e.g.
-p learning_rate,epochs. The default parameters file name is
params.yaml, but this can be redefined with a prefix in the argument sent to this option, e.g.
-p parse_params.yaml:threshold. See
dvc paramsto learn more about parameters.
--metrics <path>- specify a metrics file produced by this stage. This option behaves like
-obut registers the file in a
metricsfield inside the
dvc.yamlstage. Metrics are usually small, human readable files (JSON or YAML) with scalar numbers or other simple information that describes a model (or any other data artifact). See
dvc metricsto learn more about metrics.
--metrics-no-cache <path>- the same as
-mexcept that DVC does not track the metrics file (same as with
-Oabove). This means that they are never cached, so it's up to the user to manage them separately. This is typically desirable with metrics because they are small enough to be tracked with Git directly.
--plots <path>- specify a plot metrics file produces by this stage. This option behaves like
-obut registers the file in a
plotsfield inside the
dvc.yamlstage. Plot metrics are data series stored in tabular (CSV or TSV) or hierarchical (JSON or YAML) files, with complex information that describes a model (or any other data artifact). See
dvc plotsto learn more about plots.
--plots-no-cache <path>- the same as
--plotsexcept that DVC does not track the plots file (same as with
-Mabove). This may be desirable with plots, if they are small enough to be tracked with Git directly.
--wdir <path>- specifies a working directory for the
commandto run in (uses the
dvc.yaml). Dependency and output files (including metrics and plots) should be specified relative to this directory. It's used by
dvc reproto change the working directory before executing the
--no-exec- write the stage to
dvc.yaml, but do not execute the
command. DVC will still add the outputs to
.gitignore, but they won't be cached or recorded in
--no-commitbelow). This is useful if you need to define a pipeline quickly, and
dvc reproit later; or if the stage outputs already exist and you want to "DVCfy" this state of the project (see also
--force- overwrite an existing stage in
dvc.yamlfile without asking for confirmation.
--no-run-cache- execute the stage
commandeven if it has already been run with the same dependencies/outputs/etc. before. Useful for example if the command's code is non-deterministic (not recommended).
--no-commit- do not store the outputs of this execution in the cache (
dvc.lockare still created or updated); useful to avoid caching unnecessary data when exploring different data or stages. You can use
dvc committo finish the operation.
Note that regular
.dvcfiles (without dependencies) are automatically considered "always changed", so this option has no effect in those cases.
--external- allow writing outputs outside of the DVC repository. See Managing External Data.
--desc <text>- user description of the stage (optional). This doesn't
--help- prints the usage/help message, and exit.
--quiet- do not write anything to standard output. Exit with 0 if no problems arise, otherwise 1.
--verbose- displays detailed tracing information.
Let's create a DVC project and a stage (that counts the number of
lines in a
$ mkdir example && cd example $ git init $ dvc init $ mkdir data $ dvc run -n count \ -d test.txt \ -o lines \ "cat test.txt | wc -l > lines" Running stage 'test' with command: cat test.txt | wc -l > lines Creating 'dvc.yaml' Adding stage 'count' in 'dvc.yaml' Generating lock file 'dvc.lock' $ tree . ├── dvc.lock ├── dvc.yaml ├── lines └── test.txt
This results in the following stage entry in
stages: count: cmd: 'cat test.txt | wc -l > lines' deps: - test.txt outs: - lines
The following stage runs a Python script that trains an ML model on the training
20180226 is a seed value):
$ dvc run -n train \ -d train_model.py -d matrix-train.p -o model.p \ python train_model.py 20180226 model.p
To update a stage that is already defined, the
--force) option is
needed. Let's update the seed for the
$ dvc run -n train --force \ -d train_model.p -d matrix-train.p -o model.p \ python train_model.py 18494003 model.p
Let's move to a subdirectory and create a stage there. This generates a separate
dvc.yaml file in that location. The stage command itself counts the lines in
test.txt and writes the number to
$ cd more_stages/ $ dvc run -n process_data \ -d data.in \ -o result.out \ ./my_script.sh data.in result.out $ tree .. . ├── dvc.yaml ├── dvc.lock ├── file1 ├── ... └── more_stages/ ├── data.in ├── dvc.lock ├── dvc.yaml └── result.out
DVC pipelines are constructed by connecting the outputs of a stage to the dependencies of the following one(s).
Extract an XML file from an archive to the
$ mkdir data $ dvc run -n extract \ -d Posts.xml.zip \ -o data/Posts.xml \ unzip Posts.xml.zip -d data/
Note that the last
-dapplies to the stage's command (
unzip), not to
Execute an R script that parses the XML file:
$ dvc run -n parse \ -d parsingxml.R -d data/Posts.xml \ -o data/Posts.csv \ Rscript parsingxml.R data/Posts.xml data/Posts.csv
To visualize how these stages are connected into a pipeline (given their outputs
and dependencies), we can use
$ dvc dag +---------+ | extract | +---------+ * * * +---------+ | parse | +---------+
To use specific values inside a parameters file as dependencies, create a simple
YAML file named
params.yaml (default params file name, see
dvc params to
seed: 20180226 train: lr: 0.0041 epochs: 75 layers: 9 processing: threshold: 0.98 bow_size: 15000
Define a stage with both regular dependencies as well as parameter dependencies:
$ dvc run -n train \ -d train_model.py -d matrix-train.p -o model.p \ -p seed,train.lr,train.epochs python train_model.py 20200105 model.p
train_model.py will include some code to open and parse the parameters:
import yaml with open("params.yaml", 'r') as fd: params = yaml.safe_load(fd) seed = params['seed'] lr = params['train']['lr'] epochs = params['train']['epochs']
DVC will keep an eye on these param values (same as with the regular dependency
files) and know that the stage should be reproduced if/when they change. See
dvc params for more details.