Data science and ML are iterative processes that require a large number of attempts to reach a certain level of a metric. Experimentation is part of the development of data features, hyperspace exploration, deep learning optimization, etc. DVC helps you codify and manage all of your experiments, supporting these main approaches:
Make experiments or checkpoints persistent by committing them to your repository. Or create these versions from scratch like typical project changes.
At this point you may also want to consider the different ways to organize experiments in your project (as Git branches, as folders, etc.).
DVC also provides specialized features to codify and analyze experiments. Parameters are simple values you can tweak in a human-readable text file, which cause different behaviors in your code and models. On the other end, metrics (and plots) let you define, visualize, and compare meaningful measures for the experimental results.
👨💻 See Get Started: Experiments for a hands-on introduction to DVC experiments.
⚠️ This feature is only available in DVC 2.0 ⚠️
dvc exp
commands let you automatically track a variation to an established
data pipeline. You can create multiple isolated
experiments this way, as well as review, compare, and restore them later, or
roll back to the baseline. The basic workflow goes like this:
dvc exp run
(instead of repro
) to execute the pipeline. The results
are reflected in your workspace, and tracked automatically.dvc exp show
or dvc exp diff
. Repeat
🔄dvc exp apply
to roll back to the best one.⚠️ This feature is only available in DVC 2.0 ⚠️
To track successive steps in a longer experiment, you can write your code so it registers checkpoints with DVC at runtime. This allows you, for example, to track the progress in deep learning techniques such as evolving neural networks.
This kind of experiment can also derived from a stable project version, but it
tracks a series of variations (the checkpoints). You interact with them using
dvc exp run
, dvc exp resume
, and dvc exp reset
(see also the checkpoint
field of dvc.yaml
outputs).
When your experiments are good enough to save or share, you may want to store them persistently as Git commits in your repository.
Whether the results were produced with dvc repro
directly, or after a
dvc exp
workflow (refer to previous sections), the dvc.yaml
and dvc.lock
pair in the workspace will codify the experiment as a new project
version. The right outputs (including
metrics) should also be present, or available
via dvc checkout
.
DVC takes care of arranging dvc exp
experiments and the data
cache under the hood. But when it comes to full-blown persistent
experiments, it's up to you to decide how to organize them in your project.
These are the main alternatives:
Every time you dvc repro
pipelines or dvc exp run
experiments, DVC logs the
unique signature of each stage run (to .dvc/cache/runs
by default). If it
never happened before, the stage command(s) are executed normally. Every
subsequent time a stage runs under the same
conditions, the previous results can be restored instantly, without wasting time
or computing resources.
✅ This built-in feature is called run-cache and it can
dramatically improve performance. It's enabled out-of-the-box (but can be
disabled with the --no-run-cache
command option).