Edit on GitHub

run

Helper command to create or update stages in dvc.yaml. Requires a name and a command.

Synopsis

usage: dvc run [-h] [-q | -v] [-n <name>] [--no-exec] [--no-run-cache]
               [--no-commit] [-d <path>] [-o <filename>] [-O <filename>]
               [-p [<filename>:]<params_list>] [-m <path>] [-M <path>]
               [--plots <path>] [--plots-no-cache <path>] [--live <path>]
               [--live-no-cache <path>] [--live-no-summary] [--live-no-html]
               [-w <path>] [-f] [--outs-persist <filename>]
               [--outs-persist-no-cache <filename>] [-c <filename>]
               [--always-changed] [--external] [--desc <text>]
               command

positional arguments:
  command               Command for the stage.

Description

dvc run is a helper for creating or updating pipeline stages in a dvc.yaml file (located in the current working directory).

Stages represent individual data processes, including their input and resulting outputs. They can be combined to capture simple data workflows, organize data science projects, or build detailed machine learning pipelines.

A stage name is required and can be provided using the -n (--name) option. The other available options are mostly meant to describe different kinds of stage dependencies and outputs. The remaining terminal input provided to dvc run after -/-- flags will become the required command argument.

dvc run executes stage commands, unless the --no-exec option is used.

๐Ÿ’ก Avoiding unexpected behavior

We don't want to tell anyone how to write their code or what programs to use! However, please be aware that in order to prevent unexpected results when DVC reproduces pipeline stages, the underlying code should ideally follow these rules:

  • Read/write exclusively from/to the specified dependencies and outputs (including parameters files, metrics, and plots).
  • Completely rewrite outputs. Do not append or edit.
  • Stop reading and writing files when the command exits.

Also, if your pipeline reproducibility goals include consistent output data, its code should be deterministic (produce the same output for any given input): avoid code that increases entropy (e.g. random numbers, time functions, hardware dependencies, etc.).

Dependencies and outputs

By specifying lists of dependencies (-d option) and/or outputs (-o and -O options) for each stage, we can create a dependency graph (DAG) that connects them, i.e. the output of a stage becomes the input of another, and so on (see dvc dag). This graph can be restored by DVC later to modify or reproduce the full pipeline. For example:

$ dvc run -n printer -d write.sh -o pages ./write.sh
$ dvc run -n scanner -d read.sh -d pages -o signed.pdf ./read.sh pages

Stage dependencies can be any file or directory, either untracked, or more commonly tracked by DVC or Git. Outputs will be tracked and cached by DVC when the stage is run. Every output version will be cached when the stage is reproduced (see also dvc gc).

Relevant notes:

  • Typically, scripts being run (or possibly a directory containing the source code) are included among the specified -d dependencies. This ensures that when the source code changes, DVC knows that the stage needs to be reproduced. (You can chose whether to do this.)
  • dvc run checks the dependency graph integrity before creating a new stage. For example: two stage cannot specify the same output or overlapping output paths, there should be no cycles, etc.
  • DVC does not feed dependency files to the command being run. The program will have to read by itself the files specified with -d.
  • Entire directories produced by the stage can be tracked as outputs by DVC, which generates a single .dir entry in the cache (refer to Structure of cache directory for more info.)
  • external dependencies and external outputs (outside of the workspace) are also supported (except metrics and plots).
  • Outputs are deleted from the workspace before executing the command (including at dvc repro) if their paths are found as existing files/directories (unless --outs-persist is used). This also means that the stage command needs to recreate any directory structures defined as outputs every time its executed by DVC.
  • In some situations, we have previously executed a stage, and later notice that some of the files/directories used by the stage as dependencies, or created as outputs are missing from dvc.yaml. It is possible to add missing dependencies/outputs to an existing stage without having to execute it again.
  • Renaming dependencies or outputs requires a manual process to update dvc.yaml and the project's cache accordingly.

For displaying and comparing data science experiments

parameters (-p/--params option) are a special type of key/value dependencies. Multiple parameter dependencies can be specified from within one or more YAML, JSON, TOML, or Python parameters files (e.g. params.yaml). This allows tracking experimental hyperparameters easily.

Special types of output files, metrics (-m and -M options) and plots (--plots and --plots-no-cache options), are also supported. Metrics and plots files have specific formats (JSON, YAML, CSV, or TSV) and allow displaying and comparing data science experiments.

The command argument

The command sent to dvc run can be anything your terminal would accept and run directly, for example a shell built-in, expression, or binary found in PATH. Please remember that any flags sent after the command are interpreted by the command itself, not by dvc run.

โš ๏ธ While DVC is platform-agnostic, the commands defined in your pipeline stages may only work on some operating systems and require certain software packages to be installed.

Wrap the command with double quotes " if there are special characters in it like | (pipe) or <, > (redirection), otherwise they would apply to dvc run itself. Use single quotes ' instead if there are environment variables in it that should be evaluated dynamically. Examples:

$ dvc run -n first_stage "./a_script.sh > /dev/null 2>&1"
$ dvc run -n second_stage './another_script.sh $MYENVVAR'

Options

  • -n <stage>, --name <stage> (required) - specify a name for the stage generated by this command (e.g. -n train). Stage names can only contain letters, numbers, dash - and underscore _.
  • -d <path>, --deps <path> - specify a file or a directory the stage depends on. Multiple dependencies can be specified like this: -d data.csv -d process.py. Usually, each dependency is a file or a directory with data, or a code file, or a configuration file. DVC also supports certain external dependencies.

    When you use dvc repro, the list of dependencies helps DVC analyze whether any dependencies have changed and thus executing stages required to regenerate their outputs.

  • -o <path>, --outs <path> - specify a file or directory that is the result of running the command. Multiple outputs can be specified: -o model.pkl -o output.log. DVC builds a dependency graph (pipeline) to connect different stages with each other based on this list of outputs and dependencies (see -d). DVC tracks all output files and directories and puts them into the cache (this is similar to what's happening when you use dvc add).
  • -O <path>, --outs-no-cache <path> - the same as -o except that outputs are not tracked by DVC. This means that they are never cached, so it's up to the user to manage them separately. This is useful if the outputs are small enough to be tracked by Git directly; or large, yet you prefer to regenerate them every time (see dvc repro); or unwanted in storage for any other reason.
  • --outs-persist <path> - declare output file or directory that will not be removed when dvc repro starts (but it can still be modified, overwritten, or even deleted by the stage command(s)).
  • --outs-persist-no-cache <path> - the same as -outs-persist except that outputs are not tracked by DVC (same as with -O above).
  • -c <path, --checkpoints <path> - the same as -o but also marks the output as a checkpoint. Implies --no-exec. This makes the stage incompatible with dvc repro.
  • -p [<path>:]<params_list>, --params [<path>:]<params_list> - specify a set of parameter dependencies the stage depends on, from a parameters file. This is done by sending a comma separated list as argument, e.g. -p learning_rate,epochs. The default parameters file name is params.yaml, but this can be redefined with a prefix in the argument sent to this option, e.g. -p parse_params.yaml:threshold. See dvc params to learn more about parameters.
  • -m <path>, --metrics <path> - specify a metrics file produced by this stage. This option behaves like -o but registers the file in a metrics field inside the dvc.yaml stage. Metrics are usually small, human readable files (JSON or YAML) with scalar numbers or other simple information that describes a model (or any other data artifact). See dvc metrics to learn more about metrics.
  • -M <path>, --metrics-no-cache <path> - the same as -m except that DVC does not track the metrics file (same as with -O above). This means that they are never cached, so it's up to the user to manage them separately. This is typically desirable with metrics because they are small enough to be tracked with Git directly.
  • --plots <path> - specify a plot metrics file produces by this stage. This option behaves like -o but registers the file in a plots field inside the dvc.yaml stage. Plot metrics are data series stored in tabular (CSV or TSV) or hierarchical (JSON or YAML) files, with complex information that describes a model (or any other data artifact). See dvc plots to learn more about plots.
  • --plots-no-cache <path> - the same as --plots except that DVC does not track the plots file (same as with -O and -M above). This may be desirable with plots, if they are small enough to be tracked with Git directly.
  • -w <path>, --wdir <path> - specifies a working directory for the command to run in (uses the wdir field in dvc.yaml). Dependency and output files (including metrics and plots) should be specified relative to this directory. It's used by dvc repro to change the working directory before executing the command.
  • --no-exec - write the stage to dvc.yaml, but do not execute the command. DVC will still add the outputs to .gitignore, but they won't be cached or recorded in dvc.lock (like with --no-commit below). This is useful if you need to define a pipeline quickly, and dvc repro it later; or if the stage outputs already exist and you want to "DVCfy" this state of the project (see also dvc commit).
  • -f, --force - overwrite an existing stage in dvc.yaml file without asking for confirmation.
  • --no-run-cache - execute the stage command(s) even if they have already been run with the same dependencies and outputs (see the details). Useful for example if the stage command/s is/are non-deterministic (not recommended).
  • --no-commit - do not store the outputs of this execution in the cache (dvc.yaml and dvc.lock are still created or updated); useful to avoid caching unnecessary data when exploring different data or stages. You can use dvc commit to finish the operation.
  • --always-changed - always consider this stage as changed (uses the always_changed field in dvc.yaml). As a result dvc status will report it as always changed and dvc repro will always execute it.

    Note that regular .dvc files (without dependencies) are automatically considered "always changed", so this option has no effect in those cases.

  • --external - allow writing outputs outside of the DVC repository. See Managing External Data.
  • --desc <text> - user description of the stage (optional). This doesn't
    affect any DVC operations.
  • --live <path> - specify the directory path for Dvclive to write logs in. path will be tracked (cached) by DVC. Saved in the live field of dvc.yaml.
  • --live-no-cache <path> - the same as -o except that the path is not tracked by DVC. Useful if you prefer to track it with Git.
  • --live-no-summary - passes summary=False to Dvclive config.
  • --live-no-html - passes html=False to Dvclive config.
  • -h, --help - prints the usage/help message, and exit.
  • -q, --quiet - do not write anything to standard output. Exit with 0 if no problems arise, otherwise 1.
  • -v, --verbose - displays detailed tracing information.

Examples

Let's create a stage (that counts the number of lines in a test.txt file):

$ dvc run -n count \
          -d test.txt \
          -o lines \
          "cat test.txt | wc -l > lines"
Running stage 'test' with command:
        cat test.txt | wc -l > lines
Creating 'dvc.yaml'
Adding stage 'count' in 'dvc.yaml'
Generating lock file 'dvc.lock'

$ tree
.
โ”œโ”€โ”€ dvc.lock
โ”œโ”€โ”€ dvc.yaml
โ”œโ”€โ”€ lines
โ””โ”€โ”€ test.txt

This results in the following stage entry in dvc.yaml:

stages:
  count:
    cmd: 'cat test.txt | wc -l > lines'
    deps:
      - test.txt
    outs:
      - lines

Example: Overwrite an existing stage

The following stage runs a Python script that trains an ML model on the training dataset (20180226 is a seed value):

$ dvc run -n train \
          -d train_model.py -d matrix-train.p -o model.p \
          python train_model.py 20180226 model.p

To update a stage that is already defined, the -f (--force) option is needed. Let's update the seed for the train stage:

$ dvc run -n train --force \
          -d train_model.p -d matrix-train.p -o model.p \
          python train_model.py 18494003 model.p

Example: Separate stages in a subdirectory

Let's move to a subdirectory and create a stage there. This generates a separate dvc.yaml file in that location. The stage command itself counts the lines in test.txt and writes the number to lines.

$ cd more_stages/
$ dvc run -n process_data \
          -d data.in \
          -o result.out \
          ./my_script.sh data.in result.out
$ tree ..
.
โ”œโ”€โ”€ dvc.yaml
โ”œโ”€โ”€ dvc.lock
โ”œโ”€โ”€ file1
โ”œโ”€โ”€ ...
โ””โ”€โ”€ more_stages/
    โ”œโ”€โ”€ data.in
    โ”œโ”€โ”€ dvc.lock
    โ”œโ”€โ”€ dvc.yaml
    โ””โ”€โ”€ result.out

Example: Chaining stages

DVC pipelines are constructed by connecting the outputs of a stage to the dependencies of the following one(s).

Extract an XML file from an archive to the data/ folder:

$ dvc run -n extract \
          -d Posts.xml.zip \
          -o data/Posts.xml \
          unzip Posts.xml.zip -d data/

Note that the last -d applies to the stage's command (unzip), not to dvc run.

Execute an R script that parses the XML file:

$ dvc run -n parse \
          -d parsingxml.R -d data/Posts.xml \
          -o data/Posts.csv \
          Rscript parsingxml.R data/Posts.xml data/Posts.csv

To visualize how these stages are connected into a pipeline (given their outputs and dependencies), we can use dvc dag:

$ dvc dag
+---------+
| extract |
+---------+
      *
      *
      *
+---------+
|  parse  |
+---------+

Example: Using parameter dependencies

To use specific values inside a parameters file as dependencies, create a simple YAML file named params.yaml (default params file name, see dvc params to learn more):

seed: 20180226

train:
  lr: 0.0041
  epochs: 75
  layers: 9

processing:
  threshold: 0.98
  bow_size: 15000

Define a stage with both regular dependencies as well as parameter dependencies:

$ dvc run -n train \
          -d train_model.py -d matrix-train.p  -o model.p \
          -p seed,train.lr,train.epochs
          python train_model.py 20200105 model.p

train_model.py will include some code to open and parse the parameters:

import yaml

with open("params.yaml", 'r') as fd:
    params = yaml.safe_load(fd)

seed = params['seed']
lr = params['train']['lr']
epochs = params['train']['epochs']

DVC will keep an eye on these param values (same as with the regular dependency files) and know that the stage should be reproduced if/when they change. See dvc params for more details.