Edit on GitHub

Running pipelines

To run a pipeline, you can use dvc exp run. This will run the pipeline and save the results as an experiment:

$ dvc exp run
'data/data.xml.dvc' didn't change, skipping
Running stage 'prepare':
> python src/prepare.py data/data.xml
Updating lock file 'dvc.lock'

Running stage 'featurize':
> python src/featurization.py data/prepared data/features
Updating lock file 'dvc.lock'

Running stage 'train':
> python src/train.py data/features model.pkl
Updating lock file 'dvc.lock'

Running stage 'evaluate':
> python src/evaluate.py model.pkl data/features
Updating lock file 'dvc.lock'

Ran experiment(s): barer-acts
Experiment results have been applied to your workspace.

If you do not want to save the results as an experiment, you can use dvc repro, which is similar but does not save an experiment or have the other experiment-related features of dvc exp run.

Stage outputs are deleted from the workspace before executing the stage commands that produce them (unless persist: true is used in dvc.yaml).

DAG

DVC runs the DAG stages sequentially, in the order defined by the dependencies and outputs. Consider this example dvc.yaml:

stages:
  prepare:
    cmd: python src/prepare.py data/data.xml
    deps:
      - data/data.xml
      - src/prepare.py
    params:
      - prepare.seed
      - prepare.split
    outs:
      - data/prepared
  featurize:
    cmd: python src/featurization.py data/prepared data/features
    deps:
      - data/prepared
      - src/featurization.py
    params:
      - featurize.max_features
      - featurize.ngrams
    outs:
      - data/features

The prepare stage will always precede the featurize stage because data/prepared is an output of prepare and a dependency of featurize.

Caching Stages

DVC will try to avoid recomputing stages that have been run before. If you run a stage without changing its commands, dependencies, or parameters, DVC will skip that stage:

Stage 'prepare' didn't change, skipping

DVC will also recover the outputs from previous runs using the run cache:

Stage 'prepare' is cached - skipping run, checking out outputs

If you want a stage to run every time, you can use always changed in dvc.yaml:

stages:
  pull_latest:
    cmd: python pull_latest.py
    deps:
      - pull_latest.py
    outs:
      - latest_results.csv
    always_changed: true

Debugging Stages

If you are using advanced features to interpolate values for your pipeline, like templating or Hydra composition, you can get the interpolated values by running dvc exp run -vv, which will include information like:

2023-05-18 07:38:43,955 TRACE: Hydra composition enabled.
Contents dumped to params.yaml: {'model': {'batch_size':
512, 'latent_dim': 8, 'lr': 0.01, 'duration': '00:00:30:00',
'max_epochs': 2}, 'data_path': 'fra.txt', 'num_samples':
100000, 'seed': 423}
2023-05-18 07:38:44,027 TRACE: Context during resolution of
stage download: {'model': {'batch_size': 512, 'latent_dim':
8, 'lr': 0.01, 'duration': '00:00:30:00', 'max_epochs': 2},
'data_path': 'fra.txt', 'num_samples': 100000, 'seed': 423}
2023-05-18 07:38:44,073 TRACE: Context during resolution of
stage train: {'model': {'batch_size': 512, 'latent_dim': 8,
'lr': 0.01, 'duration': '00:00:30:00', 'max_epochs': 2},
'data_path': 'fra.txt', 'num_samples': 100000, 'seed': 423}
Content

🐛 Found an issue? Let us know! Or fix it:

Edit on GitHub

❓ Have a question? Join our chat, we will help you:

Discord Chat