New in DVC 2.0
Data science and ML are iterative processes that require a large number of attempts to reach a certain level of a metric. Experimentation is part of the development of data features, hyperspace exploration, deep learning optimization, etc. DVC helps you codify and manage all of your experiments, supporting these main approaches:
Make experiments or checkpoints persistent by committing them to your repository. Or create these versions from scratch like typical project changes.
At this point you may also want to consider the different ways to organize experiments in your project (as Git branches, as folders, etc.).
DVC also provides specialized features to codify and analyze experiments. Parameters are simple values you can tweak in a human-readable text file, which cause different behaviors in your code and models. On the other end, metrics (and plots) let you define, visualize, and compare meaningful measures for the experimental results.
👨💻 See Get Started: Experiments for a hands-on introduction to DVC experiments.
dvc exp commands let you automatically track a variation to an established
data pipeline. You can create multiple isolated
experiments this way, as well as review, compare, and restore them later, or
roll back to the baseline. The basic workflow goes like this:
dvc exp run(instead of
repro) to execute the pipeline. The results are reflected in your workspace, and tracked automatically.
dvc exp showor
dvc exp diff. Repeat 🔄
dvc exp applyto roll back to the best one.
To track successive steps in a longer experiment, you can register checkpoints from your code at runtime. This allows you, for example, to track the progress in deep learning techniques such as evolving neural networks.
This kind of experiments track a series of variations (the checkpoints) and its
execution can be stopped and resumed as needed. You interact with them using
dvc exp run and its
--reset options (see also the
📖 To learn more, see the dedicated Checkpoints guide.
When your experiments are good enough to save or share, you may want to store them persistently as Git commits in your repository.
Whether the results were produced with
dvc repro directly, or after a
dvc exp workflow (refer to previous sections), the
pair in the workspace will codify the experiment as a new project
version. The right outputs (including
metrics) should also be present, or available
DVC takes care of arranging
dvc exp experiments and the data
cache under the hood. But when it comes to full-blown persistent
experiments, it's up to you to decide how to organize them in your project.
These are the main alternatives:
Every time you
dvc repro pipelines or
dvc exp run experiments, DVC logs the
unique signature of each stage run (to
.dvc/cache/runs by default). If it
never happened before, the stage command(s) are executed normally. Every
subsequent time a stage runs under the same
conditions, the previous results can be restored instantly, without wasting time
or computing resources.
✅ This built-in feature is called run-cache and it can
dramatically improve performance. It's enabled out-of-the-box (but can be
disabled with the
--no-run-cache command option).