Edit on GitHub

repro

Reproduce complete or partial pipelines by executing commands defined in their stages in the correct order. The commands to be executed are determined by recursively analyzing dependencies and outputs of the target stages.

Synopsis

usage: dvc repro [-h] [-q | -v] [-f] [-s] [-m] [--dry] [-i]
                 [-p] [-P] [-R] [--no-run-cache] [--force-downstream]
                 [--no-commit] [--downstream] [--pull] [--glob]
                 [targets [<target> ...]]

positional arguments:
  targets       Limit command scope to these .dvc or dvc.yaml files,
                or stage names.

See targets for more details.

Description

Provides a way to regenerate data pipeline results, by restoring the dependency graph (a DAG) implicitly defined by the stages listed in dvc.yaml files. The commands defined in these stages are then be executed in the correct order.

For stages with multiple commands (having a list or a multiline string in the cmd field), commands are run one after the other in the order they are defined. The failure of any command will halt the remaining stage execution, and raises an error.

Pipeline stages are defined in dvc.yaml (either manually or by using dvc run) while initial data dependencies can be registered with dvc add.

This command is similar to Make in software build automation, but DVC captures build requirements (dependencies and outputs) and caches the pipeline's outputs along the way.

๐Ÿ’ก For convenience, a Git hook is available to remind you to dvc repro when needed after a git commit. See dvc install for more details.

By default, this command checks all pipeline stages to determine which ones have changed. Then it executes the corresponding commands (cmd field of dvc.yaml).

There are a few ways to restrict what will be regenerated by this command: by specifying specific reproduction targets, or by using certain command options, such as --single-item or --all-pipelines.

Note that stages without dependencies are considered always changed, so dvc repro always executes them.

Stage outputs are deleted from the workspace before executing the stage commands that produce them (unless persist: true is used in dvc.yaml).

dvc repro does not run dvc fetch, dvc pull or dvc checkout to get data files, intermediate or final results (except if the --pull option is used).

It stores all the data files, intermediate or final results in the cache (unless the --no-commit option is used), and updates the hash values of changed dependencies and outputs in the dvc.lock and .dvc files.

Parallel stage execution

Currently, dvc repro is not able to parallelize stage execution automatically. If you need to do this, you can launch dvc repro multiple times manually. For example, let's say a pipeline graph looks something like this:

$ dvc dag
+--------+          +--------+
|   A1   |          |   B1   |
+--------+          +--------+
     *                   *
     *                   *
     *                   *
+--------+          +--------+
|   A2   |          |   B2   |
+--------+          +--------+
          *         *
           **     **
             *   *
        +------------+
        |    train   |
        +------------+

This pipeline consists of two parallel branches (A and B), and the final train stage, where the branches merge. If you run dvc repro at this point, it would reproduce each branch sequentially before train. To reproduce both branches simultaneously, you could run dvc repro A2 and dvc repro B2 at the same time (e.g. in separate terminals). After both finish successfully, you can then run dvc repro train: DVC will know that both branches are already up-to-date and only execute the final stage.

Options

  • targets (optional command argument) - what to reproduce (the pipeline(s) in ./dvc.yaml by default). Different things can be provided as targets depending on the flags used (more details in each option). Examples:

  • -R, --recursive - looks for dvc.yaml files to reproduce in any directories given as targets, and in their subdirectories. If there are no directories among the targets, this option has no effect.
  • --glob - causes the targets to be interpreted as wildcard patterns to match for stage names. For example: train-* (certain stage names) or models/dvc.yaml:train-* (stages in specific dvc.yaml file). Note that it does not match patterns with the path, only to the stages present in the specified file.
  • -s, --single-item - reproduce only a single stage by turning off the recursive search for changed dependencies. Multiple stages are executed (non-recursively) if multiple stage names are given as targets.
  • -f, --force - reproduce a pipeline, regenerating its results, even if no changes were found. This executes all of the stages by default, but it can be limited with the targets argument, or the -s, -p options.
  • --no-commit - do not store the outputs of this execution in the cache (dvc.yaml and dvc.lock are still created or updated); useful to avoid caching unnecessary data when exploring different data or stages. Use dvc commit to finish the operation.
  • -m, --metrics - show metrics after reproduction. The target pipelines must have at least one metrics file defined either with the dvc metrics command, or by the -M or -m options of the dvc run command.
  • --dry - only print the commands that would be executed without actually executing the commands.
  • -i, --interactive - ask for confirmation before reproducing each stage. The stage is only executed if the user types "y".
  • -p, --pipeline - reproduce the entire pipelines that the targets belong to. Use dvc dag <target> to show the parent pipeline of a target.
  • -P, --all-pipelines - reproduce all pipelines for all dvc.yaml files present in the DVC project. Specifying targets has no effects with this option, as all possible targets are already included.
  • --no-run-cache - execute stage commands even if they have already been run with the same dependencies/outputs/etc. before.
  • --force-downstream - in cases like ... -> A (changed) -> B -> C it will reproduce A first and then B, even if B was previously executed with the same inputs from A (cached). To be precise, it reproduces all descendants of a changed stage or the stages following the changed stage, even if their direct dependencies did not change.

    It can be useful when we have a common dependency among all stages, and want to specify it only once (for stage A here). For example, if we know that all stages (A and below) depend on requirements.txt, we can specify it in A, and omit it in B and C.

    Like with the --force option on dvc run, this is a way to force-execute stages without changes. This can also be useful for pipelines containing stages that produce non-deterministic (semi-random) outputs, where outputs can vary on each execution, meaning the cache cannot be trusted for such stages.

  • --downstream - only execute the stages after the given targets in their corresponding pipelines, including the target stages themselves. This option has no effect if targets are not provided.
  • --pull - pulls dependencies and outputs involved in the stages being reproduced, if they are found in the default remote storage. Note that it checks the local run-cache too (available history of stage runs).

    Has no effect if combined with --no-run-cache.

  • -h, --help - prints the usage/help message, and exit.
  • -q, --quiet - do not write anything to standard output. Exit with 0 if all stages are up to date or if all stages are successfully executed, otherwise exit with 1. The command defined in the stage is free to write output regardless of this flag.
  • -v, --verbose - displays detailed tracing information.

Examples

To get hands-on experience with data science and machine learning pipelines, see Get Started: Data Pipelines.

Let's build and reproduce a simple pipeline. It takes this text.txt file:

dvc
1231
is
3
the
best

And runs a few simple transformations to filter and count numbers:

$ dvc run -n filter -d text.txt -o numbers.txt \
           "cat text.txt | egrep '[0-9]+' > numbers.txt"

$ dvc run -n count -d numbers.txt -d process.py -M count.txt \
           "python process.py numbers.txt > count.txt"

Where process.py is a script that, for simplicity, just prints the number of lines:

import sys
num_lines = 0
with open(sys.argv[1], 'r') as f:
    for line in f:
        num_lines += 1
print(num_lines)

The result of executing these dvc run commands should look like this:

$ tree
.
โ”œโ”€โ”€ count.txt      <---- result: "2"
โ”œโ”€โ”€ dvc.lock       <---- file to record pipeline state
โ”œโ”€โ”€ dvc.yaml       <---- file containing list of stages.
โ”œโ”€โ”€ numbers.txt    <---- intermediate result of the first stage
โ”œโ”€โ”€ process.py     <---- code that implements data transformation
โ””โ”€โ”€ text.txt       <---- text file to process

You may want to check the contents of dvc.lock and count.txt for later reference.

Ok, now let's run dvc repro:

$ dvc repro
Stage 'filter' didn't change, skipping
Stage 'count' didn't change, skipping
Data and pipelines are up to date.

It makes sense, since we haven't changed any of the dependencies of this pipeline (text.txt and process.py). Now, let's imagine we want to print a description and we add this line to the process.py:

...
print('Number of lines:')
print(num_lines)

If we now run dvc repro, we should see this:

$ dvc repro
Stage 'filter' didn't change, skipping
Running stage 'count' with command:
        python process.py numbers.txt > count.txt
Updating lock file 'dvc.lock'

You can now check that dvc.lock and count.txt have been updated with the new information: updated dependency/output file hash values, and a new result, respectively.

Example: Downstream from a target stage

This example continues the previous one.

The --downstream option, when used with a target stage, allows us to only reproduce results from commands after that specific stage in a pipeline. To demonstrate how it works, let's make a change in text.txt (the input of our first stage, created in the previous example):

...
The answer to universe is 42
- The Hitchhiker's Guide to the  Galaxy

Let's say we also want to print the filename in the description, and so we update the process.py as:

print(f'Number of lines in {sys.argv[1]}:')
print(num_lines)

Now, using the --downstream option with dvc repro results in the execution of only the target (count) and following stages (none in this case):

$ dvc repro --downstream count
Running stage 'count' with command:
        python process.py numbers.txt > count.txt
Updating lock file 'dvc.lock'

The change in text.txt is ignored because that file is a dependency in the filter stage, which wasn't executed by the dvc repro above. This is because filter happens before the target (count) in the pipeline (see dvc dag), as shown below:

$ dvc dag

  +--------+
  | filter |
  +--------+
      *
      *
      *
  +-------+
  | count |
  +-------+

Note that using dvc repro without --downstream in the above example results in the execution of the target (count), and the preceding stages (only 'filter' in this case).

Content

๐Ÿ› Found an issue? Let us know! Or fix it:

Edit on GitHub

โ“ Have a question? Join our chat, we will help you:

Discord Chat