DVC 2.0 Pre-Release

The new release is a result of our learning from our users. There are four major features coming:
🔗 ML pipeline templating and iterative foreach stages
🧪 Lightweight ML experiments
📍 ML model checkpoints
📈 Dvc-live – new open-source library for metrics logging

Dmitry Petrov

February 17, 2021

11 minutes read

Install

First things first. You can install the 2.0 pre-release from the master branch in our repo (instruction here) or through pip:

$ pip install --upgrade --pre dvc

ML pipelines parameterization and foreach stages

After introducing the multi-stage pipeline file dvc.yaml, it was quickly adopted among our users. The DVC team got tons of positive feedback from them, as well as feature requests.

Pipeline parameters from `vars`

The most requested feature was the ability to use parameters in dvc.yaml. For example. So, you can pass the same seed value or filename to multiple stages in the pipeline.

vars:
    train_matrix: train.pkl
    test_matrix: test.pkl
    seed: 20210215

...

stages:
    process:
        cmd: python process.py 
                --seed ${seed} 
                --train ${train_matrix} 
                --test ${test_matrix}
        outs:
        - ${test_matrix}
        - ${train_matrix}

        ...

    train:
        cmd: python train.py ${train_matrix} --seed ${seed}
        deps:
        - ${train_matrix}

Also, it gives an ability to localize all the important parameters in a single vars block and play with them. This is a natural thing to do for scenarios like NLP or when hyperparameter optimization is happening not only in the model training code but in the data processing as well.

Pipeline parameters from params files

It is quite common to define pipeline parameters in a config file or a parameters file (like params.yaml) instead of in the pipeline file dvc.yaml itself. These parameters defined in params.yaml can also be used in dvc.yaml.

# params.yaml
models:
  us:
    thresh: 10
    filename: 'model-us.hdf5'

# dvc.yaml
stages:
  build-us:
    cmd: >-
      python script.py
        --out ${models.us.filename}
        --thresh ${models.us.thresh}
    outs:
      - ${models.us.filename}

DVC properly tracks params dependencies for each stage starting from the previous DVC version 1.0. See the --params option of dvc run for more details.

Iterating over params with foreach stages

Iterating over params was a frequently requested feature. Now users can define multiple similar stages with a templatized command.

stages:
  build:
    foreach:
      gb:
        thresh: 15
        filename: 'model-gb.hdf5'
      us:
        thresh: 10
        filename: 'model-us.hdf5'
    do:
      cmd: >-
        python script.py --out ${item.filename} --thresh ${item.thresh}
      outs:
        - ${item.filename}

Lightweight ML experiments

DVC uses Git versioning as the basis for ML experiments. This solid foundation makes each experiment reproducible and accessible from the project’s history. This Git-based approach works very well for ML projects with mature models when only a few new experiments per day are run.

However, in more active development, when dozens or hundreds of experiments need to be run in a single day, Git creates overhead — each experiment run requires additional Git commands git add/commit, and comparing all experiments is difficult.

We introduce lightweight experiments in DVC 2.0! This is how you can auto-track ML experiments without any overhead from ML engineers.

⚠️ Note, our new ML experiment features (dvc exp) are experimental in the coming release. This means that the commands might change a bit in the following minor releases.

dvc exp run can run an ML experiment with a new hyperparameter from params.yaml while dvc exp diff shows metrics and params difference:

$ dvc exp run --set-param featurize.max_features=3000

Reproduced experiment(s): exp-bb55c
Experiment results have been applied to your workspace.

$ dvc exp diff
Path         Metric    Value    Change
scores.json  auc       0.57462  0.0072197

Path         Param                   Value    Change
params.yaml  featurize.max_features  3000     1500

More experiments:

$ dvc exp run --set-param featurize.max_features=4000
Reproduced experiment(s): exp-9bf22
Experiment results have been applied to your workspace.

$ dvc exp run --set-param featurize.max_features=5000
Reproduced experiment(s): exp-63ee0
Experiment results have been applied to your workspace.

$ dvc exp run --set-param featurize.max_features=5000 
                --set-param featurize.ngrams=3
Reproduced experiment(s): exp-80655
Experiment results have been applied to your workspace.

In the examples above, hyperparameters were changed with the --set-param option, but you can make these changes by modifying the params file instead. In fact any code or data files can be changed and dvc exp run will capture the variations.

See all the runs:

$ dvc exp show --no-pager --no-timestamp 
        --include-params featurize.max_features,featurize.ngrams

 ─────────────────────────────────────────────────────────────────────
  **Experiment**          **auc**   **featurize.max_features**   **featurize.ngrams**
 ─────────────────────────────────────────────────────────────────────
  workspace       0.56359   5000                     3
  master           0.5674   1500                     2
  ├── exp-80655   0.56359   5000                     3
  ├── exp-63ee0    0.5515   5000                     2
  ├── exp-9bf22   0.56448   4000                     2
  └── exp-bb55c   0.57462   3000                     2
 ─────────────────────────────────────────────────────────────────────

Under the hood, DVC uses Git to store the experiments’ meta-information. A straight-forward implementation would create visible branches and auto-commit in them, but that approach would over-pollute the branch namespace very quickly. To avoid this issue, we introduced custom Git references exps, the same way as GitHub uses custom references pulls to track pull requests (this is an interesting technical topic that deserves a separate blog post). Below you can see how it works.

No artificial branches, only custom references exps (do not worry if you don’t understand this part – it is an implementation detail):

$ git branch
* master

$ git show-ref
5649f62d845fdc29e28ea6f7672dd729d3946940 refs/exps/exec/EXEC_APPLY
5649f62d845fdc29e28ea6f7672dd729d3946940 refs/exps/exec/EXEC_BRANCH
5649f62d845fdc29e28ea6f7672dd729d3946940 refs/exps/71/67904d89e116f28daf7a6e4c0878268117c893/exp-80655
f16e7b7c804cf52d91d1d11850c15963fb2a8d7b refs/exps/97/d69af70c6fb4bc59aefb9a87437dcd28b3bde4/exp-63ee0
0566d42cddb3a8c4eb533f31027f0febccbbc2dd refs/exps/91/94265d5acd847e1c439dd859aa74b1fc3d73ad/exp-bb55c
9bb067559583990a8c5d499d7435c35a7c9417b7 refs/exps/49/5c835cd36772123e82e812d96eabcce320f7ec/exp-9bf22

The best experiment can be promoted to the workspace and committed to Git.

$ dvc exp apply exp-bb55c
$ git add .
$ git commit -m 'optimize max feature size'

Alternatively, an experiment can be promoted to a branch (big_fr_size branch in this case):

$ dvc exp branch exp-80655 big_fr_size
Git branch 'big_fr_size' has been created from experiment 'exp-c695f'.
To switch to the new branch run:

git checkout big_fr_size

Remove all the experiments that were not used:

$ dvc exp gc --workspace --force

Model checkpoints

ML model checkpoints are an essential part of deep learning. ML engineers prefer to save the model files (or weights) at checkpoints during a training process and return back when metrics start diverging or learning is not fast enough.

The checkpoints create a different dynamic around ML modeling process and need a special support from the toolset:

Track and save model checkpoints (DVC outputs) periodically, not only the final result or training epoch.
Save metrics corresponding to each of the checkpoints.
Reuse checkpoints – warm-start training with an existing model file, corresponding code, dataset version and metrics.

This new behavior is supported in DVC 2.0. Now, DVC can version all your checkpoints with corresponding code and data. It brings the reproducibility of DL processes to the next level – every checkpoint is reproducible.

This is how you define checkpoints with live-metrics:

$ dvc stage add -n train 
        -d users.csv -d train.py 
        -p dropout,epochs,lr,process 
        --checkpoint model.h5 
        --live logs 
    python train.py

Creating 'dvc.yaml'
Adding stage 'train' in 'dvc.yaml'

Note, we use dvc stage add command instead of dvc run. Starting from DVC 2.0 we begin extracting all stage specific functionality under dvc stage umbrella. dvc run is still working, but will be deprecated in the following major DVC version (most likely in 3.0).

Start the training process and interrupt it after 5 epochs:

$ dvc exp run
'users.csv.dvc' didn't change, skipping
Running stage 'train':
> python train.py
...
^CTraceback (most recent call last):
...
KeyboardInterrupt

Navigate in checkpoints:

$ dvc exp show --no-pager --no-timestamp

 ──────────────────────────────────────────────────────────────────────
  **Experiment**      **step**     **loss**   **accuracy**   **val_loss**   **…**   **epochs**   **…**
 ──────────────────────────────────────────────────────────────────────
  workspace          4   2.0702    0.30388      2.025   …   5        …
  master             -        -          -          -   …   5        …
  │ ╓ exp-e15bc      4   2.0702    0.30388      2.025   …   5        …
  │ ╟ 5ea8327        4   2.0702    0.30388      2.025   …   5        …
  │ ╟ bc0cf02        3   2.1338    0.23988     2.0883   …   5        …
  │ ╟ f8cf03f        2   2.1989    0.17932     2.1542   …   5        …
  │ ╟ 7575a44        1   2.2694    0.12833      2.223   …   5        …
  ├─╨ a72c526        0   2.3416     0.0959     2.2955   …   5        …
 ──────────────────────────────────────────────────────────────────────

Each of the checkpoints above is a separate experiment with all data, code, paramaters and metrics. You can use the same dvc exp apply command to extract any of these.

Another run continues this process. You can see how accuracy metrics are increasing – DVC does not remove the model/checkpoint and training code trains on top of it:

$ dvc exp run
Existing checkpoint experiment 'exp-e15bc' will be resumed
...
^C
KeyboardInterrupt

$ dvc exp show --no-pager --no-timestamp

 ──────────────────────────────────────────────────────────────────────
  **Experiment**      **step**     **loss**   **accuracy**   **val_loss**   **…**   **epochs**   **…**
 ──────────────────────────────────────────────────────────────────────
  workspace          9   1.7845    0.58125     1.7381   …   5        …
  master             -        -          -          -   …   5        …
  │ ╓ exp-e15bc      9   1.7845    0.58125     1.7381   …   5        …
  │ ╟ 205a8d3        9   1.7845    0.58125     1.7381   …   5        …
  │ ╟ dd23d96        8   1.8369    0.54173     1.7919   …   5        …
  │ ╟ 5bb3a1f        7   1.8929    0.49108     1.8474   …   5        …
  │ ╟ 6dc5610        6    1.951    0.43433     1.9046   …   5        …
  │ ╟ a79cf29        5   2.0088    0.36837     1.9637   …   5        …
  │ ╟ 5ea8327        4   2.0702    0.30388      2.025   …   5        …
  │ ╟ bc0cf02        3   2.1338    0.23988     2.0883   …   5        …
  │ ╟ f8cf03f        2   2.1989    0.17932     2.1542   …   5        …
  │ ╟ 7575a44        1   2.2694    0.12833      2.223   …   5        …
  ├─╨ a72c526        0   2.3416     0.0959     2.2955   …   5        …
 ──────────────────────────────────────────────────────────────────────

After modifying the code, data, or params, the same process can be resumed. DVC recognizes the change and shows it (see experiment b363267):

$ vi train.py     # modify code
$ vi params.yaml  # modify params

$ dvc exp run
Modified checkpoint experiment based on 'exp-e15bc' will be created
...

$ dvc exp show --no-pager --no-timestamp

 ──────────────────────────────────────────────────────────────────────────────
  **Experiment**              **step**     **loss**   **accuracy**   **val_loss**   **…**   **epochs**   **…**
 ──────────────────────────────────────────────────────────────────────────────
  workspace                 13   1.5841    0.69262     1.5381   …   15       …
  master                     -        -          -          -   …   5        …
  │ ╓ exp-7ff06             13   1.5841    0.69262     1.5381   …   15       …
  │ ╟ 6c62fec               12   1.6325    0.67248     1.5857   …   15       …
  │ ╟ 4baca3c               11   1.6817    0.64855     1.6349   …   15       …
  │ ╟ b363267 (2b06de7)     10   1.7323    0.61925     1.6857   …   15       …
  │ ╓ 2b06de7                9   1.7845    0.58125     1.7381   …   5        …
  │ ╟ 205a8d3                9   1.7845    0.58125     1.7381   …   5        …
  │ ╟ dd23d96                8   1.8369    0.54173     1.7919   …   5        …
  │ ╟ 5bb3a1f                7   1.8929    0.49108     1.8474   …   5        …
  │ ╟ 6dc5610                6    1.951    0.43433     1.9046   …   5        …
  │ ╟ a79cf29                5   2.0088    0.36837     1.9637   …   5        …
  │ ╟ 5ea8327                4   2.0702    0.30388      2.025   …   5        …
  │ ╟ bc0cf02                3   2.1338    0.23988     2.0883   …   5        …
  │ ╟ f8cf03f                2   2.1989    0.17932     2.1542   …   5        …
  │ ╟ 7575a44                1   2.2694    0.12833      2.223   …   5        …
  ├─╨ a72c526                0   2.3416     0.0959     2.2955   …   5        …
 ──────────────────────────────────────────────────────────────────────────────

Sometimes you might need to train the model from scratch. The reset option removes the checkpoint file before training: dvc exp run --reset.

Metrics logging

Continuously logging ML metrics is a very common practice in the ML world. Instead of a simple command-line output with the metrics values, many ML engineers prefer visuals and plots. These plots can be organized in a “database” of ML experiments to keep track of a project. There are many special solutions for metrics collecting and experiment tracking such as sacred, mlflow, weight and biases, neptune.ai, or others.

With DVC 2.0, we are releasing a new open-source library DVC-Live that provides functionality for tracking model metrics and organizing metrics in simple text files in a way that DVC can visualize the metrics with navigation in Git history. So, DVC can show you a metrics difference between the current model and a model in master or any other branch.

This approach is similar to the other metrics tracking tools with the difference that Git becomes a “database” or of ML experiments.

Generate metrics file

Install the library:

$ pip install dvclive

Instrument your code:

import dvclive
from dvclive.keras import DvcLiveCallback

dvclive.init("logs") #, summarize=True)

...

model.fit(...
          # Set up DVC-Live callback:
          callbacks=[ DvcLiveCallback() ]
         )

During the training you will see the metrics files that are continuously populated each epochs:

$ ls logs/
accuracy.tsv     loss.tsv         val_accuracy.tsv val_loss.tsv

$ head logs/accuracy.tsv
timestampstepaccuracy
161364558271600.7360000014305115
161364558547810.8349999785423279
161364558732220.8830000162124634
161364558912530.9049999713897705
161364559089140.9070000052452087
161364559268150.9279999732971191
161364559449060.9430000185966492
161364559623270.9369999766349792
161364559803480.9430000185966492

In addition to the continuous metrics files, you will see the summary metrics file and HTML file with the same file prefix. The summary file contains the result of the latest epoch:

$ cat logs.json | python -m json.tool
{
    "step": 41,
    "loss": 0.015958430245518684,
    "accuracy": 0.9950000047683716,
    "val_loss": 13.705962181091309,
    "val_accuracy": 0.5149999856948853
}

The HTML file contains all the visuals for continuous metrics as well as the summary metrics on a single page:

Note, the HTML and the summary metrics files are generating automatically for each. So, you can monitor model performance in realtime.

Git-navigation with the metrics file

DVC repository is NOT required to use the live metrics functionality from the above. It works independently from DVC.

DVC repository becomes useful when the metrics and plots are committed in your Git repository, and you need navigation around the metrics.

Metrics difference between workspace and the last Git commit:

$ git status -s
 M logs.json
 M logs/accuracy.tsv
 M logs/loss.tsv
 M logs/val_accuracy.tsv
 M logs/val_loss.tsv
 M train.py
?? model.h5

$ dvc metrics diff --target logs.json
Path       Metric        Old       New      Change
logs.json  accuracy      0.995     0.99     -0.005
logs.json  loss          0.01596   0.03036  0.0144
logs.json  step          41        36       -5
logs.json  val_accuracy  0.515     0.5175   0.0025
logs.json  val_loss      13.70596  3.29033  -10.41563

The difference between a particular commit/branch/tag or between two commits:

$ dvc metrics diff --target logs.json HEAD^ 47b85c
Path       Metric        Old       New      Change
logs.json  accuracy      0.995     0.998    0.003
logs.json  loss          0.01596   0.01951  0.00355
logs.json  step          41        82       41
logs.json  val_accuracy  0.515     0.51     -0.005
logs.json  val_loss      13.70596  5.83056  -7.8754

The same Git-navigation works with the plots:

$ dvc plots diff --target logs
file:///Users/dmitry/src/exp-dc/plots.html

Another nice thing about the live metrics – they work across ML experiments and checkpoints, if properly set up in dvc stages. To set up live metrics, you need to specify the metrics directory in the live section of a stage:

stages:
  train:
    cmd: python train.py
    live:
      logs:
        cache: false
        summary: true
        report: true
    deps:
      - data

Thank you!

I’d like to thank all of you DVC community members for the feedback that we are constantly getting. This feedback helps us build new functionalities in DVC and make it more stable.

Please be in touch with us on Twitter and our Discord channel.

📰 Join our Newsletter to stay up to date with news and contributions from the Community!