DVC 2.0 Release

Today is DVC 2.0 release day! Watch a video from DVC-team when we explain the new features and read more details in this blog post.
New features:
🧪 Lightweight ML experiments
📍 ML model checkpoints versioning
📈 Dvc-live – new open-source library for metrics logging
🔗 ML pipeline templating and iterative foreach-stages
🤖 CML – new way to get GPU/CPU in clouds and GitHub Actions support

Dmitry Petrov

March 3, 2021

15 minutes read

TL;DR; video

What is new in DVC 2.0?

We have been working on DVC for almost 4 years. In the previous versions, we have built a great foundation on versioning data, code and ML models that helps make your ML projects reproducible.

With the 2.0 release, we are going deeper into machine learning and deep learning scenarios such as experiment management, ML model checkpoints and ML metrics logging. These scenarios are widely adopted by ML practitioners and instrumented with custom tools or external frameworks and SaaS services. Our vision is to make the ML experimentation experience distributed (like Git) and independent of external SaaS platforms, and to introduce proper data and model management to ML experiments.

⚠️ DVC 2.0 is the first release with ML experements, which is still in experimentation mode (yeah, experiments in experimentation mode 😅), so the API might change a bit in the following releases.

ML pipelines parametrization is another big improvement in DVC 2.0. This was the most requested feature during the last year. We are introducing variables in pipelines as well as foreach-stages. This is a significant improvement for users who work on multi-stages ML projects, which is very common for NLP projects.

A better CPU/GPU resource allocation is another important direction for DVC. Together with DVC 2.0 we are releasing new version 0.3 of CML (CI/CD for ML). It aims to hide all complexity of clouds from data scientists and ML engineers. We developed a brand new Iterative Terraform Provider to reach this goal and simplify the end-user experience. In future releases, we expect DVC to use this Terraform provider to access cloud resources directly.

The last but not least important part – we made the new release with minimum breaking changes to our API. That makes migration to DVC 2.0 smooth and low-risk.

Install

The new version is generally available!

Install DVC 2.0 through OS packages or as Python library:

$ pip install --upgrade dvc

CML is pre-installed in the CML docker containers (e.g. iterativeai/cml:0-dvc2-base1) and also available as an NPM package:

$ npm i -g @dvcorg/cml

Lightweight ML experiments

DVC uses Git versioning as the basis for ML experiments. This solid foundation makes each experiment reproducible and accessible from the project’s history. This Git-based approach works very well for ML projects with mature models when only a few new experiments per day are run.

However, in more active development, when dozens or hundreds of experiments need to be run in a single day, Git creates overhead — each experiment run requires additional Git commands git add/commit, and comparing all experiments is difficult.

We are introducing lightweight experiments in DVC 2.0! This is how you can auto-track ML experiments without any overhead.

⚠️ Note, our new ML experiment features (dvc exp) are experimental. This means that the commands might change a bit in the following minor releases.

dvc exp run can run an ML experiment with a new hyperparameter from params.yaml while dvc exp diff shows metrics and params difference:

$ dvc exp run --set-param featurize.max_features=3000

Reproduced experiment(s): exp-bb55c
Experiment results have been applied to your workspace.

$ dvc exp diff
Path         Metric    Value    Change
scores.json  auc       0.57462  0.0072197

Path         Param                   Value    Change
params.yaml  featurize.max_features  3000     1500

More experiments:

$ dvc exp run --set-param featurize.max_features=4000
Reproduced experiment(s): exp-9bf22
Experiment results have been applied to your workspace.

$ dvc exp run --set-param featurize.max_features=5000
Reproduced experiment(s): exp-63ee0
Experiment results have been applied to your workspace.

$ dvc exp run --set-param featurize.max_features=5000 
                --set-param featurize.ngrams=3
Reproduced experiment(s): exp-80655
Experiment results have been applied to your workspace.

In the examples above, hyperparameters were changed with the --set-param option, but you can make these changes by modifying the params file instead. In fact any code can be changed and dvc exp run will capture the variations.

See all the runs:

$ dvc exp show --no-pager --no-timestamp 
        --include-params featurize.max_features,featurize.ngrams

 ─────────────────────────────────────────────────────────────────────
  **Experiment**          **auc**   **featurize.max_features**   **featurize.ngrams**
 ─────────────────────────────────────────────────────────────────────
  workspace       0.56359   5000                     3
  master           0.5674   1500                     2
  ├── exp-80655   0.56359   5000                     3
  ├── exp-63ee0    0.5515   5000                     2
  ├── exp-9bf22   0.56448   4000                     2
  └── exp-bb55c   0.57462   3000                     2
 ─────────────────────────────────────────────────────────────────────

Under the hood, DVC uses Git to store the experiments’ meta-information. A straight-forward implementation would create visible branches and auto-commit in them, but that approach would over-pollute the branch namespace very quickly. To avoid this issue, we introduced custom Git references exps, the same way as GitHub uses custom references pulls to track pull requests (this is an interesting technical topic that deserves a separate blog post). Below you can see how it works.

No artificial branches, only custom references exps (do not worry if you don’t understand this part – it is an implementation detail):

$ git branch
* master

$ git show-ref
5649f62d845fdc29e28ea6f7672dd729d3946940 refs/exps/exec/EXEC_APPLY
5649f62d845fdc29e28ea6f7672dd729d3946940 refs/exps/exec/EXEC_BRANCH
5649f62d845fdc29e28ea6f7672dd729d3946940 refs/exps/71/67904d89e116f28daf7a6e4c0878268117c893/exp-80655
f16e7b7c804cf52d91d1d11850c15963fb2a8d7b refs/exps/97/d69af70c6fb4bc59aefb9a87437dcd28b3bde4/exp-63ee0
0566d42cddb3a8c4eb533f31027f0febccbbc2dd refs/exps/91/94265d5acd847e1c439dd859aa74b1fc3d73ad/exp-bb55c
9bb067559583990a8c5d499d7435c35a7c9417b7 refs/exps/49/5c835cd36772123e82e812d96eabcce320f7ec/exp-9bf22

The best experiment can be promoted to the workspace and committed to Git.

$ dvc exp apply exp-bb55c
$ git add .
$ git commit -m 'optimize max feature size'

Alternatively, an experiment can be promoted to a branch (big_fr_size branch in this case):

$ dvc exp branch exp-80655 big_fr_size
Git branch 'big_fr_size' has been created from experiment 'exp-c695f'.
To switch to the new branch run:

git checkout big_fr_size

Remove all the experiments that were not used:

$ dvc exp gc --workspace --force

ML model checkpoints versioning

ML model checkpoints are an essential part of deep learning. ML engineers prefer to save the model files (or weights) at checkpoints during a training process and return back when metrics start diverging or learning is not fast enough.

The checkpoints create a different dynamics around ML modeling process and need a special support from the toolset:

Track and save model checkpoints (DVC outputs) periodically, not only the final result or training epoch.
Save metrics corresponding to each of the checkpoints.
Reuse checkpoints – warm-start training with an existing model file, corresponding code, dataset version and metrics.

This new behavior is supported in DVC 2.0. Now, DVC can version all your checkpoints with corresponding code and data. It brings the reproducibility of DL processes to the next level – every checkpoint is reproducible.

This is how you define checkpoints with live-metrics:

$ dvc stage add -n train 
        -d users.csv -d train.py 
        -p dropout,epochs,lr,process 
        --checkpoint model.h5 
        --live logs 
    python train.py

Creating 'dvc.yaml'
Adding stage 'train' in 'dvc.yaml'

Note, we use dvc stage add command instead of dvc run. Starting from DVC 2.0 we begin extracting all stage specific functionality under dvc stage umbrella. dvc run is still working, but will be deprecated in the following major DVC version (most likely in 3.0).

Start the training process and interrupt it after 5 epochs:

$ dvc exp run
'users.csv.dvc' didn't change, skipping
Running stage 'train':
> python train.py
...
^CTraceback (most recent call last):
...
KeyboardInterrupt

Navigate in checkpoints:

$ dvc exp show --no-pager --no-timestamp

 ──────────────────────────────────────────────────────────────────────
  **Experiment**      **step**     **loss**   **accuracy**   **val_loss**   **…**   **epochs**   **…**
 ──────────────────────────────────────────────────────────────────────
  workspace          4   2.0702    0.30388      2.025   …   5        …
  master             -        -          -          -   …   5        …
  │ ╓ exp-e15bc      4   2.0702    0.30388      2.025   …   5        …
  │ ╟ 5ea8327        4   2.0702    0.30388      2.025   …   5        …
  │ ╟ bc0cf02        3   2.1338    0.23988     2.0883   …   5        …
  │ ╟ f8cf03f        2   2.1989    0.17932     2.1542   …   5        …
  │ ╟ 7575a44        1   2.2694    0.12833      2.223   …   5        …
  ├─╨ a72c526        0   2.3416     0.0959     2.2955   …   5        …
 ──────────────────────────────────────────────────────────────────────

Each of the checkpoints above is a separate experiment with all data, code, paramaters and metrics. You can use the same dvc exp apply command to extract any of these.

Another run continues this process. You can see how accuracy metrics are increasing – DVC does not remove the model/checkpoint and training code trains on top of it:

$ dvc exp run
Existing checkpoint experiment 'exp-e15bc' will be resumed
...
^C
KeyboardInterrupt

$ dvc exp show --no-pager --no-timestamp

 ──────────────────────────────────────────────────────────────────────
  **Experiment**      **step**     **loss**   **accuracy**   **val_loss**   **…**   **epochs**   **…**
 ──────────────────────────────────────────────────────────────────────
  workspace          9   1.7845    0.58125     1.7381   …   5        …
  master             -        -          -          -   …   5        …
  │ ╓ exp-e15bc      9   1.7845    0.58125     1.7381   …   5        …
  │ ╟ 205a8d3        9   1.7845    0.58125     1.7381   …   5        …
  │ ╟ dd23d96        8   1.8369    0.54173     1.7919   …   5        …
  │ ╟ 5bb3a1f        7   1.8929    0.49108     1.8474   …   5        …
  │ ╟ 6dc5610        6    1.951    0.43433     1.9046   …   5        …
  │ ╟ a79cf29        5   2.0088    0.36837     1.9637   …   5        …
  │ ╟ 5ea8327        4   2.0702    0.30388      2.025   …   5        …
  │ ╟ bc0cf02        3   2.1338    0.23988     2.0883   …   5        …
  │ ╟ f8cf03f        2   2.1989    0.17932     2.1542   …   5        …
  │ ╟ 7575a44        1   2.2694    0.12833      2.223   …   5        …
  ├─╨ a72c526        0   2.3416     0.0959     2.2955   …   5        …
 ──────────────────────────────────────────────────────────────────────

After modifying the code, data, or params, the same process can be resumed. DVC recognizes the change and shows it (see experiment b363267):

$ vi train.py     # modify code
$ vi params.yaml  # modify params

$ dvc exp run
Modified checkpoint experiment based on 'exp-e15bc' will be created
...

$ dvc exp show --no-pager --no-timestamp

 ──────────────────────────────────────────────────────────────────────────────
  **Experiment**              **step**     **loss**   **accuracy**   **val_loss**   **…**   **epochs**   **…**
 ──────────────────────────────────────────────────────────────────────────────
  workspace                 13   1.5841    0.69262     1.5381   …   15       …
  master                     -        -          -          -   …   5        …
  │ ╓ exp-7ff06             13   1.5841    0.69262     1.5381   …   15       …
  │ ╟ 6c62fec               12   1.6325    0.67248     1.5857   …   15       …
  │ ╟ 4baca3c               11   1.6817    0.64855     1.6349   …   15       …
  │ ╟ b363267 (2b06de7)     10   1.7323    0.61925     1.6857   …   15       …
  │ ╓ 2b06de7                9   1.7845    0.58125     1.7381   …   5        …
  │ ╟ 205a8d3                9   1.7845    0.58125     1.7381   …   5        …
  │ ╟ dd23d96                8   1.8369    0.54173     1.7919   …   5        …
  │ ╟ 5bb3a1f                7   1.8929    0.49108     1.8474   …   5        …
  │ ╟ 6dc5610                6    1.951    0.43433     1.9046   …   5        …
  │ ╟ a79cf29                5   2.0088    0.36837     1.9637   …   5        …
  │ ╟ 5ea8327                4   2.0702    0.30388      2.025   …   5        …
  │ ╟ bc0cf02                3   2.1338    0.23988     2.0883   …   5        …
  │ ╟ f8cf03f                2   2.1989    0.17932     2.1542   …   5        …
  │ ╟ 7575a44                1   2.2694    0.12833      2.223   …   5        …
  ├─╨ a72c526                0   2.3416     0.0959     2.2955   …   5        …
 ──────────────────────────────────────────────────────────────────────────────

Sometimes you might need to train the model from scratch. The reset option removes the checkpoint file before training: dvc exp run --reset.

Metrics logging

Continuously logging ML metrics is a very common practice in the ML world. Instead of a simple command-line output with the metrics values, many ML engineers prefer visuals and plots. These plots can be organized in a “database” of ML experiments to keep track of a project. There are many special solutions for metrics collecting and experiment tracking such as sacred, mlflow, weight and biases, neptune.ai, or others.

With DVC 2.0, we are releasing a new open-source library DVC-Live that provides functionality for tracking model metrics and organizing metrics in simple text files in a way that DVC can visualize the metrics with navigation in Git history. So, DVC can show you a metrics difference between the current model and a model in master or any other branch.

This approach is similar to the other metrics tracking tools with the difference that Git becomes a “database” or of ML experiments.

Generate metrics file

Install the library:

$ pip install dvclive

Instrument your code:

import dvclive
from dvclive.keras import DvcLiveCallback

dvclive.init("logs") #, summarize=True)

...

model.fit(...
          # Set up DVC-Live callback:
          callbacks=[ DvcLiveCallback() ]
         )

During the training you will see the metrics files that are continuously populated each epochs:

$ ls logs/
accuracy.tsv     loss.tsv         val_accuracy.tsv val_loss.tsv

$ head logs/accuracy.tsv
timestampstepaccuracy
161364558271600.7360000014305115
161364558547810.8349999785423279
161364558732220.8830000162124634
161364558912530.9049999713897705
161364559089140.9070000052452087
161364559268150.9279999732971191
161364559449060.9430000185966492
161364559623270.9369999766349792
161364559803480.9430000185966492

In addition to the continuous metrics files, you will see the summary metrics file and HTML file with the same file prefix. The summary file contains the result of the latest epoch:

$ cat logs.json | python -m json.tool
{
    "step": 41,
    "loss": 0.015958430245518684,
    "accuracy": 0.9950000047683716,
    "val_loss": 13.705962181091309,
    "val_accuracy": 0.5149999856948853
}

The HTML file contains all the visuals for continuous metrics as well as the summary metrics on a single page:

Note, the HTML and the summary metrics files are generating automatically for each. So, you can monitor model performance in realtime.

Git-navigation with the metrics file

DVC repository is NOT required to use the live metrics functionality from the above. It works independently from DVC.

DVC repository becomes useful when the metrics and plots are committed in your Git repository, and you need navigation around the metrics.

Metrics difference between workspace and the last Git commit:

$ git status -s
 M logs.json
 M logs/accuracy.tsv
 M logs/loss.tsv
 M logs/val_accuracy.tsv
 M logs/val_loss.tsv
 M train.py
?? model.h5

$ dvc metrics diff --target logs.json
Path       Metric        Old       New      Change
logs.json  accuracy      0.995     0.99     -0.005
logs.json  loss          0.01596   0.03036  0.0144
logs.json  step          41        36       -5
logs.json  val_accuracy  0.515     0.5175   0.0025
logs.json  val_loss      13.70596  3.29033  -10.41563

The difference between a particular commit/branch/tag or between two commits:

$ dvc metrics diff --target logs.json HEAD^ 47b85c
Path       Metric        Old       New      Change
logs.json  accuracy      0.995     0.998    0.003
logs.json  loss          0.01596   0.01951  0.00355
logs.json  step          41        82       41
logs.json  val_accuracy  0.515     0.51     -0.005
logs.json  val_loss      13.70596  5.83056  -7.8754

The same Git-navigation works with the plots:

$ dvc plots diff --target logs
file:///Users/dmitry/src/exp-dc/plots.html

Another nice thing about the live metrics – they work across ML experiments and checkpoints, if properly set up in dvc stages. To set up live metrics, you need to specify the metrics directory in the live section of a stage:

stages:
  train:
    cmd: python train.py
    live:
      logs:
        cache: false
        summary: true
        report: true
    deps:
      - data

ML pipelines parameterization and foreach stages

After introducing the multi-stage pipeline file dvc.yaml, it was quickly adopted among our users. The DVC team got tons of positive feedback from them, as well as feature requests.

Pipeline parameters from `vars`

The most requested feature was the ability to use parameters in dvc.yaml. For example. So, you can pass the same seed value or filename to multiple stages in the pipeline.

vars:
  - train_matrix: train.pkl
  - test_matrix: test.pkl
  - seed: 20210215

...

stages:
    process:
        cmd: python process.py 
                --seed ${seed} 
                --train ${train_matrix} 
                --test ${test_matrix}
        outs:
        - ${test_matrix}
        - ${train_matrix}

        ...

    train:
        cmd: python train.py ${train_matrix} --seed ${seed}
        deps:
        - ${train_matrix}

Also, it gives an ability to localize all the important parameters in a single vars block and play with them. This is a natural thing to do for scenarios like NLP or when hyperparameter optimization is happening not only in the model training code but in the data processing as well.

Pipeline parameters from params files

It is quite common to define pipeline parameters in a config file or a parameters file (like params.yaml) instead of in the pipeline file dvc.yaml itself. These parameters defined in params.yaml can also be used in dvc.yaml.

# params.yaml
models:
  us:
    thresh: 10
    filename: 'model-us.hdf5'

# dvc.yaml
stages:
  build-us:
    cmd: >-
      python script.py
        --out ${models.us.filename}
        --thresh ${models.us.thresh}
    outs:
      - ${models.us.filename}

DVC properly tracks params dependencies for each stage starting from the previous DVC version 1.0. See the --params option of dvc run for more details.

Iterating over params with foreach stages

Iterating over params was a frequently requested feature. Now users can define multiple similar stages with a templatized command.

stages:
  build:
    foreach:
      gb:
        thresh: 15
        filename: 'model-gb.hdf5'
      us:
        thresh: 10
        filename: 'model-us.hdf5'
    do:
      cmd: >-
        python script.py --out ${item.filename} --thresh ${item.thresh}
      outs:
        - ${item.filename}

New method to provision cloud compute in new CML release

We are releasing new CML release 0.3 together with DVC 2.0. We developed a brand new CML command cml runner that hides much of the complexity of configuring and provisioning an instance, keeping your workflows free of bash scripting clutter.

The new approach uses our new Iterative Terraform Provider under the hood instead of Docker Machine, as in the first version of CML.

This example workflow to launch an EC2 instance from a GitHub Action workflow and then train a model. We hope you’ll agree it’s shorter, sweeter, and more powerful than ever!

name: 'Train in the cloud'
on: [push]

jobs:
  deploy-runner:
    runs-on: [ubuntu-latest]
    steps:
      - uses: iterative/setup-cml@v1
      - uses: actions/checkout@v2
      - name: deploy
        shell: bash
        env:
          repo_token: ${{ secrets.PERSONAL_ACCESS_TOKEN }}
          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
        run: |
          cml runner 
          --cloud aws 
          --cloud-region us-west 
          --cloud-type=t2.micro 
          --labels=cml-runner
  train-model:
    needs: deploy-runner
    runs-on: [self-hosted, cml-runner]
    steps:
      - uses: actions/checkout@v2
      - uses: actions/setup-python@v2
        with:
          python-version: '3.x'
      - name: 'Train my model'
        run: |
          pip install -r requirements.txt
          python train.py

You’ll get a pull request that looks something like this:

All the code to replicate this example is up on a brand new demo repository.

Please find more details in the CML 0.3 pre-release blog post or in the CML website.

GitHub Actions in new CML release

One more thing: you might’ve noticed in our example workflow above that there’s a new CML GitHub Action! The new Action helps you setup CML, giving you one more way to mix and match the CML suite of functions with your preferred environment.

The new Action is designed to be a straightforward, all-in-one install that gives you immediate use of functions like cml publish and cml runner. You’ll add this step to your workflow:

steps:
  - uses: actions/checkout@v2
  - uses: iterative/setup-cml@v1

More details are in the docs!

The same way you can reference DVC as a GitHub Action:

steps:
  - uses: actions/checkout@v2
  - uses: iterative/dvc-action@v1

See DVC GitHub Action

Breaking changes

We put a lot of efforts to make this release with very minimum amount of breaking changes to simplify migration to the new version for the users:

Dropped support for external outputs in Google Cloud Storage and changed the default checksum from md5 to etag.
Dropped support for login with p12 files on service authentication for Google Drive.
Stages without dependencies will not always run as if changed. Instead, use --always-changed.
Environment variables inside the cmd of a stage using ${VAR} syntax must be escaped as ${VAR} in 2.0 due to the use of ${} syntax for templating.

Thank you!

Thank you to all DVC users and community members for the help. Please try out the new DVC and CML releases and do not get lost in your ML experiments!

— 📰 Join our Newsletter to stay up to date with news and contributions from the Community!