Edit on GitHub

Get Started: Experimenting Using Pipelines

If you've been following the guide in order, you might have gone through the chapter about data pipelines already. Here, we will use the same functionality as a basis for an experimentation build system.

Running an experiment is achieved by executing DVC pipelines, and the term refers to the set of trackable changes associated with this execution. This includes code changes and resulting artifacts like plots, charts and models. The various dvc exp subcommands allow you to execute, share and manage experiments in various ways. Below, we'll build an experiment pipeline, and use dvc exp run to execute it with a few very handy capabilities like experiment queueing and parametrization.

Stepping up and out of the notebook

After some time spent in your IPython notebook (e.g. Jupyter) doing data exploration and basic modeling, managing your notebook cells may start to feel fragile, and you may want to structure your project and code for reproducible execution, testing and further automation. When you are ready to migrate from notebooks to scripts, DVC Pipelines help you standardize your workflow following software engineering best practices:

  • Modularization: Split the different logical steps in your notebook into separate scripts.

  • Parametrization: Adapt your scripts to decouple the configuration from the source code.

Creating the experiment pipeline

In our example repo, we first extract data preparation logic from the original notebook into data_split.py. We parametrize this script by reading parameters from params.yaml:

from ruamel.yaml import YAML

yaml = YAML(typ="safe")

def data_split():
    params = yaml.load(open("params.yaml", encoding="utf-8"))

We now use dvc stage add commands to transform our scripts into individual stages starting with a data_split stage for data_split.py:

$ dvc stage add --name data_split \
  --params base,data_split \
  --deps data/pool_data --deps src/data_split.py \
  --outs data/train_data --outs data/test_data \
  python src/data_split.py

A dvc.yaml file is automatically generated with the stage details.

It includes information about the stage we added, like the executable command (python src/data_split.py), its dependencies, parameters, and outputs:

    cmd: python src/data_split.py
      - src/data_split.py
      - data/pool_data
      - base
      - data_split
      - data/train_data
      - data/test_data

Now, we create the train and evaluate stages using train.py and evaluate.py to train the model and evaluate its performance respectively:

$ dvc stage add -n train \
  -p base,train \
  -d src/train.py -d data/train_data \
  -o models/model.pkl \
  python src/train.py

$ dvc stage add -n evaluate \
  -p base,evaluate \
  -d src/evaluate.py -d models/model.pkl -d data/test_data \
  -o results python src/evaluate.py

The dvc.yaml file is updated automatically and should include all the stages now.

    cmd: python src/data_split.py
      - data/pool_data
      - src/data_split.py
      - base
      - data_split
      - data/test_data
      - data/train_data
    cmd: python src/train.py
      - data/train_data
      - src/train.py
      - base
      - train
      - models/model.pkl
    cmd: python src/evaluate.py
      - data/test_data
      - models/model.pkl
      - src/evaluate.py
      - base
      - evaluate
      - results

As the number of stages grows, the dvc dag command becomes handy for visualizing the pipeline without manually inspecting the dvc.yaml file:

$ dvc dag
    | data/pool_data.dvc |
        | data_split |
         **        **
       **            **
      *                **
+-------+                *
| train |              **
+-------+            **
         **        **
           **    **
             *  *
         | evaluate |

Now that you have a DVC Pipeline set up, you can easily iterate on it by running dvc exp run to create and track new experiment runs. This enables some new features in DVC like Queueing experiments, and a canonical way to work with parameters and hyper-parameters.

Modifying parameters

You can modify parameters from the CLI using --set-param:

$ dvc exp run --set-param "train.img_size=128"

The flag can be used to modify multiple parameters on a single call, even from different stages:

$ dvc exp run \
-S "data_split.test_pct=0.1" -S "train.img_size=384"

Hyperparameter Tuning

You can provide multiple values for the same parameter:

$ dvc exp run \
--queue --set-param "train.batch_size=8,16,24"
Queueing with overrides '{'params.yaml': ['train.batch_size=8']}'.
Queueing with overrides '{'params.yaml': ['train.batch_size=16']}'.
Queueing with overrides '{'params.yaml': ['train.batch_size=24']}'.

You can build a grid search by modifying multiple parameters. To better identify the experiments from the grid search, you can also provide a --name:

$ dvc exp run --name "arch-size" --queue \
-S 'train.arch=alexnet,resnet34,squeezenet1_1' \
-S 'train.img_size=128,256'
Queueing with overrides '{'params.yaml': ['train.arch=alexnet', 'train.img_size=128']}'.
Queued experiment 'arch-size-1' for future execution.
Queueing with overrides '{'params.yaml': ['train.arch=alexnet', 'train.img_size=256']}'.
Queued experiment 'arch-size-2' for future execution.
Queueing with overrides '{'params.yaml': ['train.arch=resnet34', 'train.img_size=128']}'.
Queued experiment 'arch-size-3' for future execution.

Learn more about Running Experiments

Queuing experiments

You can enqueue experiments for later execution using --queue:

$ dvc exp run --queue --set-param "train.img_size=512"
Queueing with overrides '{'params.yaml': ['train.img_size=512']}'.

Once you have put some experiments in the queue, you can run all with:

$ dvc exp run --run-all

Learn more about The experiments queue