Make an in-code checkpoint.
from dvc.api import make_checkpoint while True: # ... write a stage output make_checkpoint()
To track successive steps in a longer experiment, you can write
your code so it registers checkpoints with DVC during runtime (similar to a
logger). This function can be called by the code in stages executed by
dvc exp run.
make_checkpoint() does the following:
DVC_ROOTenv var is set. It means this code is being executed via
dvc exp run, and it contains the location to the correct
.dvc/directory for this experiment (which can vary when
$DVC_ROOT/.dvc/tmp/DVC_CHECKPOINTsignal file so DVC knows that a checkpoint should be captured now.
💡 Note that for non-Python code, the way to register checkpoints with DVC is to implement the steps above yourself.
Let's consider the following
stages: train: cmd: python train.py outs: - model: checkpoint: true
The code in
train.py will train a model up to a number of epochs. Every 100
iterations, it saves the
model, evaluates it, and makes a checkpoint for DVC
from dvc.api import make_checkpoint for epoch in range(epochs): train(model, x_train, y_train) if epoch % 100 == 0: save_model(model, "model") evaluate(model, x_test, y_test) make_checkpoint()
checkpoint outputs in effect implement a circular dependency,
dvc repro does not support running this stage. Let's execute the stage with
dvc exp run instead, and interrupt the process manually moments later:
$ dvc exp run Running stage 'every100': > python iterate.py Generating lock file 'dvc.lock' Updating lock file 'dvc.lock' Checkpoint experiment iteration 'd832784'. Updating lock file 'dvc.lock' Checkpoint experiment iteration '6f5009b'. Updating lock file 'dvc.lock' Checkpoint experiment iteration '75ff5e0'. ^C Reproduced experiment(s): exp-8a3bd Experiment results have been applied to your workspace.
⚠️ it's important to handle interruptions or any other errors in your code for DVC checkpoints to behave as expected.
In this example we killed the process (with
[Ctrl] C) after 3 checkpoints (at
0, 100, and 200 epochs). The cache will contain those 3 versions of
dvc exp show will display these checkpoints as an experiment branch:
$ dvc exp show
────────────────────────────── neutral:**Experiment** neutral:**Created** ────────────────────────────── workspace - master Feb 10, 2021 │ ╓ exp-8a3bd 02:07 PM │ ╟ 75ff5e0 01:54 PM │ ╟ 6f5009b 01:54 PM ├─╨ d832784 01:54 PM ────────────────────────────── # Press q to exit this screen.
If we use
dvc exp run again, the process will start from 200 (since that's
what the workspace reflects).
See Experiment Management for more details on managing experiments.