Running pipelines
To run a pipeline, you can use dvc exp run
. This will run the pipeline and
save the results as an experiment:
$ dvc exp run
'data/data.xml.dvc' didn't change, skipping
Running stage 'prepare':
> python src/prepare.py data/data.xml
Updating lock file 'dvc.lock'
Running stage 'featurize':
> python src/featurization.py data/prepared data/features
Updating lock file 'dvc.lock'
Running stage 'train':
> python src/train.py data/features model.pkl
Updating lock file 'dvc.lock'
Running stage 'evaluate':
> python src/evaluate.py model.pkl data/features
Updating lock file 'dvc.lock'
Ran experiment(s): barer-acts
Experiment results have been applied to your workspace.
If you do not want to save the results as an experiment, you can use
dvc repro
, which is similar but does not save an experiment or have the other
experiment-related features of dvc exp run
.
Stage outputs are deleted from the workspace before executing the
stage commands that produce them (unless persist: true
is used in dvc.yaml
).
DAG
DVC runs the DAG stages
sequentially, in the order defined by the
dependencies
and outputs. Consider
this example dvc.yaml
:
stages:
prepare:
cmd: python src/prepare.py data/data.xml
deps:
- data/data.xml
- src/prepare.py
params:
- prepare.seed
- prepare.split
outs:
- data/prepared
featurize:
cmd: python src/featurization.py data/prepared data/features
deps:
- data/prepared
- src/featurization.py
params:
- featurize.max_features
- featurize.ngrams
outs:
- data/features
The prepare
stage will always precede the featurize
stage because
data/prepared
is an output of prepare
and a dependency of
featurize
.
Caching Stages
DVC will try to avoid recomputing stages that have been run before. If you run a stage without changing its commands, dependencies, or parameters, DVC will skip that stage:
Stage 'prepare' didn't change, skipping
DVC will also recover the outputs from previous runs using the run cache:
Stage 'prepare' is cached - skipping run, checking out outputs
If you want a stage to run every time, you can use
always changed
in dvc.yaml
:
stages:
pull_latest:
cmd: python pull_latest.py
deps:
- pull_latest.py
outs:
- latest_results.csv
always_changed: true
Debugging Stages
If you are using advanced features to interpolate values for your pipeline, like
templating or Hydra composition, you can get the interpolated values by
running dvc exp run -vv
, which will include information like:
2023-05-18 07:38:43,955 TRACE: Hydra composition enabled.
Contents dumped to params.yaml: {'model': {'batch_size':
512, 'latent_dim': 8, 'lr': 0.01, 'duration': '00:00:30:00',
'max_epochs': 2}, 'data_path': 'fra.txt', 'num_samples':
100000, 'seed': 423}
2023-05-18 07:38:44,027 TRACE: Context during resolution of
stage download: {'model': {'batch_size': 512, 'latent_dim':
8, 'lr': 0.01, 'duration': '00:00:30:00', 'max_epochs': 2},
'data_path': 'fra.txt', 'num_samples': 100000, 'seed': 423}
2023-05-18 07:38:44,073 TRACE: Context during resolution of
stage train: {'model': {'batch_size': 512, 'latent_dim': 8,
'lr': 0.01, 'duration': '00:00:30:00', 'max_epochs': 2},
'data_path': 'fra.txt', 'num_samples': 100000, 'seed': 423}