Edit on GitHub

Checkpoints

ML checkpoints are an important part of deep learning because ML engineers like to save the model files at certain points during a training process.

With DVC experiments and checkpoints, you can:

  • Implement the best practice in deep learning to save your model weights as checkpoints.
  • Track all code and data changes corresponding to the checkpoints.
  • See when metrics start diverging and revert to the optimal checkpoint.
  • Automate the process of tracking every training epoch.

The way checkpoints are implemented by DVC utilizes ephemeral experiment commits and experiment branches within DVC. They are created using the metadata from experiments and are tracked with the exps custom Git reference.

You can add experiments to your Git history by committing the experiment you want to track, which you'll see later in this tutorial.

This tutorial is going to cover how to implement checkpoints in an ML project using DVC. We're going to train a model to identify handwritten digits based on the MNIST dataset.

โš™๏ธ Setting up the project

You can follow along with the steps here or you can clone the repo directly from GitHub and play with it. To clone the repo, run the following commands.

$ git clone https://github.com/iterative/checkpoints-tutorial
$ cd checkpoints-tutorial

It is highly recommended you create a virtual environment for this example. You can do that by running:

$ python3 -m venv .venv

Once your virtual environment is installed, you can start it up with one of the following commands.

  • On Mac/Linux: source .venv/bin/activate
  • On Windows: .\.venv\Scripts\activate

Once you have your environment set up, you can install the dependencies by running:

$ pip install -r requirements.txt

This will download all of the packages you need to run the example. Now you have everything you need to get started with experiments and checkpoints.

Setting up a DVC pipeline

DVC versions data and it also can version the machine learning model weights file as checkpoints during the training process. To enable this, you will need to set up a DVC pipeline to train your model.

Adding a DVC pipeline only takes a few commands. At the root of the project, run:

$ dvc init

This sets up the files you need for your DVC pipeline to work.

Now we need to add a stage for training our model within a DVC pipeline. We'll do that with dvc stage add, which we'll explain more later. For now, run the following command:

$ dvc stage add --name train --deps data/MNIST --deps train.py \
              --checkpoints model.pt --plots-no-cache predictions.json \
              --params seed,lr,weight_decay --live dvclive python train.py

The --live dvclive option enables our special logger DVCLive, which helps you register checkpoints from your code.

The checkpoints need to be enabled in DVC at the pipeline level. The -c / --checkpoint option of the dvc stage add command defines the checkpoint file or directory. The checkpoint file, model.pt, is an output from one checkpoint that becomes a dependency for the next checkpoint, such as the model weights file.

The rest of the dvc stage add command sets up our dependencies for running the training code, which parameters we want to track (which are defined in the params.yaml), some configurations for our plots, showing the training metrics, and specifying where the logs produced by the training process will go.

After running the command above to setup your train stage, your dvc.yaml should have the following code.

stages:
  train:
    cmd: python train.py
    deps:
      - data/MNIST
      - train.py
    params:
      - lr
      - seed
      - weight_decay
    outs:
      - model.pt:
          checkpoint: true
    plots:
      - predictions.json:
          cache: false
    live:
      dvclive:
        summary: true
        html: true

Before we go any further, this is a great point to add these changes to your Git history. You can do that with the following commands:

$ git add .
$ git commit -m "created DVC pipeline"

Now that you know how to enable checkpoints in a DVC pipeline, let's move on to setting up checkpoints in your code.

Registering checkpoints in your code

Take a look at the train.py file and you'll see how we train a convolutional neural network to classify handwritten digits. The main area of this code most relevant to checkpoints is when we iterate over the training epochs.

This is where DVC will be able to track our metrics over time and where we will add our checkpoints to give us the points in time with our model that we can switch between.

Now we need to enable checkpoints at the code level. We are interested in tracking the metrics along with each checkpoint, so we'll need to add a few lines of code.

In the train.py file, import the dvclive package with the other imports:

import dvclive

It's also possible to use DVC's Python API to register checkpoints, or to use custom code to do so. See dvc.api.make_checkpoint() for details.

Then update the following lines of code in the main method inside of the training epoch loop.

# Iterate over training epochs.
for i in range(1, EPOCHS+1):
    # Train in batches.
    train_loader = torch.utils.data.DataLoader(
            dataset=list(zip(x_train, y_train)),
            batch_size=512,
            shuffle=True)
    for x_batch, y_batch in train_loader:
        train(model, x_batch, y_batch, params["lr"], params["weight_decay"])
    torch.save(model.state_dict(), "model.pt")
    # Evaluate and checkpoint.
    metrics = evaluate(model, x_test, y_test)
    for k, v in metrics.items():
        print('Epoch %s: %s=%s'%(i, k, v))
+       dvclive.log(k, v)
+   dvclive.next_step()

The line torch.save(model.state_dict(), "model.pt") updates the checkpoint file.

You can read about what the line dvclive.log(k, v) does in the dvclive.log() reference.

The dvclive.next_step() line tells DVC that it can take a snapshot of the entire workspace and version it with Git. It's important that with this approach only code with metadata is versioned in Git (as an ephemeral commit), while the actual model weight file will be stored in the DVC data cache.

Running experiments

With checkpoints enabled and working in our code, let's run the experiment. You can run an experiment with the following command:

$ dvc exp run

You'll see output similar to this in your terminal while the training process is going on.

Epoch 1: loss=1.9428282976150513
Epoch 1: acc=0.5715
Generating lock file 'dvc.lock'
Updating lock file 'dvc.lock'
Checkpoint experiment iteration 'd99d81c'.

file:///Users/milecia/Repos/checkpoints-tutorial/dvclive.html
Epoch 2: loss=1.25374174118042
Epoch 2: acc=0.7738
Updating lock file 'dvc.lock'
Checkpoint experiment iteration '963b396'.

file:///Users/milecia/Repos/checkpoints-tutorial/dvclive.html
Epoch 3: loss=0.7242147922515869
Epoch 3: acc=0.8284
Updating lock file 'dvc.lock'
Checkpoint experiment iteration 'd630b92'.

file:///Users/milecia/Repos/checkpoints-tutorial/dvclive.html
Epoch 4: loss=0.5083536505699158
Epoch 4: acc=0.8538
Updating lock file 'dvc.lock'
Checkpoint experiment iteration '0911c09'.

...

After a few epochs have completed, stop the training process with [Ctrl] C. Now it's time to take a look at the metrics we're working with.

If you don't have a number of training epochs defined and you don't terminate the process, the experiment will run for 100 epochs.

Viewing checkpoints

You can see a table of your experiments and checkpoints in the terminal by running:

$ dvc exp show
โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”“
โ”ƒ Experiment              โ”ƒ Created  โ”ƒ step โ”ƒ loss    โ”ƒ acc    โ”ƒ seed   โ”ƒ lr     โ”ƒ weight_decay โ”ƒ
โ”กโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ฉ
โ”‚ workspace               โ”‚ -        โ”‚ 6    โ”‚ 0.33246 โ”‚ 0.9044 โ”‚ 473987 โ”‚ 0.0001 โ”‚ 0            โ”‚
โ”‚ main                    โ”‚ 01:19 PM โ”‚ -    โ”‚ -       โ”‚ -      โ”‚ 473987 โ”‚ 0.0001 โ”‚ 0            โ”‚
โ”‚ โ”‚ โ•“ d90179a [exp-02ba1] โ”‚ 01:24 PM โ”‚ 6    โ”‚ 0.33246 โ”‚ 0.9044 โ”‚ 473987 โ”‚ 0.0001 โ”‚ 0            โ”‚
โ”‚ โ”‚ โ•Ÿ 5eb4025             โ”‚ 01:24 PM โ”‚ 5    โ”‚ 0.36601 โ”‚ 0.8943 โ”‚ 473987 โ”‚ 0.0001 โ”‚ 0            โ”‚
โ”‚ โ”‚ โ•Ÿ d665a31             โ”‚ 01:24 PM โ”‚ 4    โ”‚ 0.41666 โ”‚ 0.8777 โ”‚ 473987 โ”‚ 0.0001 โ”‚ 0            โ”‚
โ”‚ โ”‚ โ•Ÿ 0911c09             โ”‚ 01:24 PM โ”‚ 3    โ”‚ 0.50835 โ”‚ 0.8538 โ”‚ 473987 โ”‚ 0.0001 โ”‚ 0            โ”‚
โ”‚ โ”‚ โ•Ÿ d630b92             โ”‚ 01:23 PM โ”‚ 2    โ”‚ 0.72421 โ”‚ 0.8284 โ”‚ 473987 โ”‚ 0.0001 โ”‚ 0            โ”‚
โ”‚ โ”‚ โ•Ÿ 963b396             โ”‚ 01:23 PM โ”‚ 1    โ”‚ 1.2537  โ”‚ 0.7738 โ”‚ 473987 โ”‚ 0.0001 โ”‚ 0            โ”‚
โ”‚ โ”œโ”€โ•จ d99d81c             โ”‚ 01:23 PM โ”‚ 0    โ”‚ 1.9428  โ”‚ 0.5715 โ”‚ 473987 โ”‚ 0.0001 โ”‚ 0            โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Starting from an existing checkpoint

Since you have all of these different checkpoints, you might want to resume training from a particular one. For example, maybe your accuracy started decreasing at a certain checkpoint and you want to make some changes to fix that.

First, we need to apply the checkpoint we want to begin our new experiment from. To do that, run the following command:

$ dvc exp apply 963b396

where 963b396 is the id of the checkpoint you want to reference.

Next, we'll change the learning rate set in the params.yaml to 0.000001 and start a new experiment based on an existing checkpoint with the following command:

$ dvc exp run --set-param lr=0.00001

You'll be able to see where the experiment starts from the existing checkpoint by running:

$ dvc exp show

You should seem something similar to this in your terminal.

โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”“
โ”ƒ Experiment              โ”ƒ Created  โ”ƒ step โ”ƒ loss    โ”ƒ acc    โ”ƒ seed   โ”ƒ lr     โ”ƒ weight_decay โ”ƒ
โ”กโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ฉ
โ”‚ workspace               โ”‚ -        โ”‚ 8    โ”‚ 0.83515 โ”‚ 0.8185 โ”‚ 473987 โ”‚ 1e-05  โ”‚ 0            โ”‚
โ”‚ main                    โ”‚ 01:19 PM โ”‚ -    โ”‚ -       โ”‚ -      โ”‚ 473987 โ”‚ 0.0001 โ”‚ 0            โ”‚
โ”‚ โ”‚ โ•“ 726d32f [exp-3b52b] โ”‚ 01:38 PM โ”‚ 8    โ”‚ 0.83515 โ”‚ 0.8185 โ”‚ 473987 โ”‚ 1e-05  โ”‚ 0            โ”‚
โ”‚ โ”‚ โ•Ÿ 3f8efc5             โ”‚ 01:38 PM โ”‚ 7    โ”‚ 0.88414 โ”‚ 0.814  โ”‚ 473987 โ”‚ 1e-05  โ”‚ 0            โ”‚
โ”‚ โ”‚ โ•Ÿ 23f04c2             โ”‚ 01:37 PM โ”‚ 6    โ”‚ 0.9369  โ”‚ 0.8105 โ”‚ 473987 โ”‚ 1e-05  โ”‚ 0            โ”‚
โ”‚ โ”‚ โ•Ÿ d810ed0             โ”‚ 01:37 PM โ”‚ 5    โ”‚ 0.99302 โ”‚ 0.804  โ”‚ 473987 โ”‚ 1e-05  โ”‚ 0            โ”‚
โ”‚ โ”‚ โ•Ÿ 3e46262             โ”‚ 01:37 PM โ”‚ 4    โ”‚ 1.0528  โ”‚ 0.799  โ”‚ 473987 โ”‚ 1e-05  โ”‚ 0            โ”‚
โ”‚ โ”‚ โ•Ÿ bf580a5             โ”‚ 01:37 PM โ”‚ 3    โ”‚ 1.1164  โ”‚ 0.7929 โ”‚ 473987 โ”‚ 1e-05  โ”‚ 0            โ”‚
โ”‚ โ”‚ โ•Ÿ ea2d11b (963b396)   โ”‚ 01:37 PM โ”‚ 2    โ”‚ 1.1833  โ”‚ 0.7847 โ”‚ 473987 โ”‚ 1e-05  โ”‚ 0            โ”‚
โ”‚ โ”‚ โ•“ d90179a [exp-02ba1] โ”‚ 01:24 PM โ”‚ 6    โ”‚ 0.33246 โ”‚ 0.9044 โ”‚ 473987 โ”‚ 0.0001 โ”‚ 0            โ”‚
โ”‚ โ”‚ โ•Ÿ 5eb4025             โ”‚ 01:24 PM โ”‚ 5    โ”‚ 0.36601 โ”‚ 0.8943 โ”‚ 473987 โ”‚ 0.0001 โ”‚ 0            โ”‚
โ”‚ โ”‚ โ•Ÿ d665a31             โ”‚ 01:24 PM โ”‚ 4    โ”‚ 0.41666 โ”‚ 0.8777 โ”‚ 473987 โ”‚ 0.0001 โ”‚ 0            โ”‚
โ”‚ โ”‚ โ•Ÿ 0911c09             โ”‚ 01:24 PM โ”‚ 3    โ”‚ 0.50835 โ”‚ 0.8538 โ”‚ 473987 โ”‚ 0.0001 โ”‚ 0            โ”‚
โ”‚ โ”‚ โ•Ÿ d630b92             โ”‚ 01:23 PM โ”‚ 2    โ”‚ 0.72421 โ”‚ 0.8284 โ”‚ 473987 โ”‚ 0.0001 โ”‚ 0            โ”‚
โ”‚ โ”‚ โ•Ÿ 963b396             โ”‚ 01:23 PM โ”‚ 1    โ”‚ 1.2537  โ”‚ 0.7738 โ”‚ 473987 โ”‚ 0.0001 โ”‚ 0            โ”‚
โ”‚ โ”œโ”€โ•จ d99d81c             โ”‚ 01:23 PM โ”‚ 0    โ”‚ 1.9428  โ”‚ 0.5715 โ”‚ 473987 โ”‚ 0.0001 โ”‚ 0            โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

The existing checkpoint is referenced at the beginning of the new experiment. The new experiment is referred to as a modified experiment because it's taking existing data and using it as the starting point.

Metrics diff

When you've run all the experiments you want to and you are ready to compare metrics between checkpoints, you can run the command:

$ dvc metrics diff d90179a 726d32f

Make sure that you replace d90179a and 726d32f with checkpoint ids from your table with the checkpoints you want to compare. You'll see something similar to this in your terminal.

Path          Metric    Old      New      Change
dvclive.json  acc       0.9044   0.8185   -0.0859
dvclive.json  loss      0.33246  0.83515  0.50269
dvclive.json  step      6        8        2

These are the same numbers you see in the metrics table, just in a different format.

Looking at plots

You also have the option to generate plots to visualize the metrics about your training epochs. Running:

$ dvc plots diff d90179a 726d32f

where d90179a and 726d32f are the checkpoint ids you want to compare, will generate a plots.html file that you can open in a browser and it will display plots for you similar to the ones below.

Plots generated from running experiments on MNIST dataset using DVC

Resetting checkpoints

Usually when you start training a model, you won't begin with accuracy this high. There might be a time when you want to remove all of the existing checkpoints to start the training from scratch. You can reset your checkpoints with the following command:

$ dvc exp run --reset

This resets all of the existing checkpoints and re-runs the code to generate a new set of checkpoints under a new experiment branch.

โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”“
โ”ƒ Experiment              โ”ƒ Created  โ”ƒ step โ”ƒ loss    โ”ƒ acc    โ”ƒ seed   โ”ƒ lr     โ”ƒ weight_decay โ”ƒ
โ”กโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ฉ
โ”‚ workspace               โ”‚ -        โ”‚ 6    โ”‚ 2.0912  โ”‚ 0.5607 โ”‚ 473987 โ”‚ 1e-05  โ”‚ 0            โ”‚
โ”‚ main                    โ”‚ 01:19 PM โ”‚ -    โ”‚ -       โ”‚ -      โ”‚ 473987 โ”‚ 0.0001 โ”‚ 0            โ”‚
โ”‚ โ”‚ โ•“ b21235e [exp-6c6fa] โ”‚ 01:56 PM โ”‚ 6    โ”‚ 2.0912  โ”‚ 0.5607 โ”‚ 473987 โ”‚ 1e-05  โ”‚ 0            โ”‚
โ”‚ โ”‚ โ•Ÿ 53e5d64             โ”‚ 01:56 PM โ”‚ 5    โ”‚ 2.1385  โ”‚ 0.5012 โ”‚ 473987 โ”‚ 1e-05  โ”‚ 0            โ”‚
โ”‚ โ”‚ โ•Ÿ 7e0e0fe             โ”‚ 01:56 PM โ”‚ 4    โ”‚ 2.1809  โ”‚ 0.4154 โ”‚ 473987 โ”‚ 1e-05  โ”‚ 0            โ”‚
โ”‚ โ”‚ โ•Ÿ f7eafc5             โ”‚ 01:55 PM โ”‚ 3    โ”‚ 2.2177  โ”‚ 0.2518 โ”‚ 473987 โ”‚ 1e-05  โ”‚ 0            โ”‚
โ”‚ โ”‚ โ•Ÿ dcfa3ff             โ”‚ 01:55 PM โ”‚ 2    โ”‚ 2.2486  โ”‚ 0.1264 โ”‚ 473987 โ”‚ 1e-05  โ”‚ 0            โ”‚
โ”‚ โ”‚ โ•Ÿ bfd54b4             โ”‚ 01:55 PM โ”‚ 1    โ”‚ 2.2736  โ”‚ 0.1015 โ”‚ 473987 โ”‚ 1e-05  โ”‚ 0            โ”‚
โ”‚ โ”œโ”€โ•จ 189bbcb             โ”‚ 01:55 PM โ”‚ 0    โ”‚ 2.2936  โ”‚ 0.0892 โ”‚ 473987 โ”‚ 1e-05  โ”‚ 0            โ”‚
โ”‚ โ”‚ โ•“ 726d32f [exp-3b52b] โ”‚ 01:38 PM โ”‚ 8    โ”‚ 0.83515 โ”‚ 0.8185 โ”‚ 473987 โ”‚ 1e-05  โ”‚ 0            โ”‚
โ”‚ โ”‚ โ•Ÿ 3f8efc5             โ”‚ 01:38 PM โ”‚ 7    โ”‚ 0.88414 โ”‚ 0.814  โ”‚ 473987 โ”‚ 1e-05  โ”‚ 0            โ”‚
โ”‚ โ”‚ โ•Ÿ 23f04c2             โ”‚ 01:37 PM โ”‚ 6    โ”‚ 0.9369  โ”‚ 0.8105 โ”‚ 473987 โ”‚ 1e-05  โ”‚ 0            โ”‚
โ”‚ โ”‚ โ•Ÿ d810ed0             โ”‚ 01:37 PM โ”‚ 5    โ”‚ 0.99302 โ”‚ 0.804  โ”‚ 473987 โ”‚ 1e-05  โ”‚ 0            โ”‚
โ”‚ โ”‚ โ•Ÿ 3e46262             โ”‚ 01:37 PM โ”‚ 4    โ”‚ 1.0528  โ”‚ 0.799  โ”‚ 473987 โ”‚ 1e-05  โ”‚ 0            โ”‚
โ”‚ โ”‚ โ•Ÿ bf580a5             โ”‚ 01:37 PM โ”‚ 3    โ”‚ 1.1164  โ”‚ 0.7929 โ”‚ 473987 โ”‚ 1e-05  โ”‚ 0            โ”‚
โ”‚ โ”‚ โ•Ÿ ea2d11b (963b396)   โ”‚ 01:37 PM โ”‚ 2    โ”‚ 1.1833  โ”‚ 0.7847 โ”‚ 473987 โ”‚ 1e-05  โ”‚ 0            โ”‚
โ”‚ โ”‚ โ•“ d90179a [exp-02ba1] โ”‚ 01:24 PM โ”‚ 6    โ”‚ 0.33246 โ”‚ 0.9044 โ”‚ 473987 โ”‚ 0.0001 โ”‚ 0            โ”‚
โ”‚ โ”‚ โ•Ÿ 5eb4025             โ”‚ 01:24 PM โ”‚ 5    โ”‚ 0.36601 โ”‚ 0.8943 โ”‚ 473987 โ”‚ 0.0001 โ”‚ 0            โ”‚
โ”‚ โ”‚ โ•Ÿ d665a31             โ”‚ 01:24 PM โ”‚ 4    โ”‚ 0.41666 โ”‚ 0.8777 โ”‚ 473987 โ”‚ 0.0001 โ”‚ 0            โ”‚
โ”‚ โ”‚ โ•Ÿ 0911c09             โ”‚ 01:24 PM โ”‚ 3    โ”‚ 0.50835 โ”‚ 0.8538 โ”‚ 473987 โ”‚ 0.0001 โ”‚ 0            โ”‚
โ”‚ โ”‚ โ•Ÿ d630b92             โ”‚ 01:23 PM โ”‚ 2    โ”‚ 0.72421 โ”‚ 0.8284 โ”‚ 473987 โ”‚ 0.0001 โ”‚ 0            โ”‚
โ”‚ โ”‚ โ•Ÿ 963b396             โ”‚ 01:23 PM โ”‚ 1    โ”‚ 1.2537  โ”‚ 0.7738 โ”‚ 473987 โ”‚ 0.0001 โ”‚ 0            โ”‚
โ”‚ โ”œโ”€โ•จ d99d81c             โ”‚ 01:23 PM โ”‚ 0    โ”‚ 1.9428  โ”‚ 0.5715 โ”‚ 473987 โ”‚ 0.0001 โ”‚ 0            โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Adding checkpoints to Git

When you terminate training, you'll see a few commands in the terminal that will allow you to add these changes to Git.

To track the changes with git, run:

        git add dvclive.json dvc.yaml .gitignore train.py dvc.lock

Reproduced experiment(s): exp-263da
Experiment results have been applied to your workspace.

To promote an experiment to a Git branch run:

        dvc exp branch <exp>

You can run the following command to save your experiments to the Git history.

$ git add dvclive.json dvc.yaml .gitignore train.py dvc.lock

You can take a look at what will be committed to your Git history by running:

$ git status

You should see something similar to this in your terminal.

Changes to be committed:
  (use "git restore --staged <file>..." to unstage)
        new file:   .gitignore
        new file:   dvc.lock
        new file:   dvclive.json

Untracked files:
  (use "git add <file>..." to include in what will be committed)
        data/
        dvclive.html
        dvclive/
        model.pt
        plots.html
        predictions.json

All that's left is to commit these changes with the following command:

$ git commit -m 'saved files from experiment'

We can also remove all of the experiments we don't promote to our Git workspace with the following command:

$ dvc exp gc --workspace --force

Now that you know how to use checkpoints in DVC, you'll be able to resume training from different checkpoints to try out new hyperparameters or code and you'll be able to track all of the changes you make while trying to create the best possible model.