Hydra Composition
Hydra is a framework to configure complex applications. DVC supports Hydra's config composition as a way to configure experiment runs.
You must explicitly enable this feature with:
$ dvc config hydra.enabled True
How it works
Upon dvc exp run
, Hydra will be used to compose a single params.yaml
before
executing the experiment. Its parameters will configure the
underlying DVC pipeline that contains your experiment.
DVC pipelines can run multiple stages with different shell commands instead of
a single script, and offer advanced features like templating and foreach
stages.
Setting up Hydra
First you need a conf/
directory for Hydra config groups. For example:
conf
โโโ config.yaml
โโโ dataset
โ โโโ imagenette.yaml
โ โโโ imagewoof.yaml
โโโ train
โโโ baseline.yaml
โโโ model
โ โโโ alexnet.yaml
โ โโโ efficientnet.yaml
โ โโโ resnet.yaml
โโโ optimizer
โโโ adam.yaml
โโโ sgd.yaml
You also need a conf/config.yaml
file defining the Hydra defaults list:
defaults:
- dataset: imagenette
- train/model: resnet
- train/optimizer: sgd
Use dvc config hydra
options to change the default locations for the config
groups directory and defaults list file.
Now let's look at what the resulting params.yaml
could be based on the setup
above:
dataset:
url: https://s3.amazonaws.com/fast-ai-imageclas/imagenette2-160.tgz
output_folder: imagenette
train:
model:
name: ResNet
size: 50
weights: ResNet50_Weights.IMAGENET1K_V2
optimizer:
name: SGD
lr: 0.001
momentum: 0.9
Running experiments
Let's build an experimental pipeline with 2 stages. The first one downloads a dataset
and uses the parameters defined in the dataset
section of params.yaml
. The second
stage trains an ML model and uses the rest of the parameters (entire train
group).
stages:
setup-dataset:
cmd:
- wget ${dataset.url} -O tmp.tgz
- mkdir -p ${dataset.output_folder}
- tar zxvf tmp.tgz -C ${dataset.output_folder}
- rm tmp.tgz
outs:
- ${dataset.output_folder}
train:
cmd: python train.py
deps:
- ${dataset.output_folder}
params:
- train
We parametrize the shell commands above (mkdir
, tar
, wget
) as well as
output and dependency paths (outs
, deps
) using
templating (${}
expression).
You can load the params with any YAML parsing library. In Python, you can use
the built-in dvc.api.params_show()
or OmegaConf.load("params.yaml")
(which
comes with Hydra).
We can now use dvc exp run --set-param
to modify the config composition
on-the-fly, for example loading the model config from
train/model/efficientnet.yaml
(instead of resnet.yaml
from the
defaults list):
$ dvc exp run --set-param 'train/model=efficientnet'
Stage 'setup-dataset' didn't change skipping
Running stage 'train':
> python train.py
--model.name 'EfficientNet' --model.size 'b0' --model.weights 'ResNet50_Weights.IMAGENET1K_V2'
--optimizer.name 'SGD' --optimizer.lr 0.001 --optimizer.momentum 0.9
...
Add -vv
(dvc exp run -vv --set-param 'train/model=efficientnet'
) if you need
to debug the composed values.
We can also modify specific values from any config section:
$ dvc exp run --set-param 'train.optimizer.lr=0.1'
Stage 'setup-dataset' didn't change, skipping
Running stage 'train':
> python train.py
--model.name 'EfficientNet' --model.size 'b0' --model.weights 'ResNet50_Weights.IMAGENET1K_V2'
--optimizer.name 'SGD' --optimizer.lr 0.01 --optimizer.momentum 0.9
...
We can also load multiple config groups in an experiments queue, for example to run a grid search of ML hyperparameters:
$ dvc exp run --queue \
-S 'train/optimizer=adam,sgd' \
-S 'train/model=resnet,efficientnet'
Queueing with overrides '{'params.yaml': ['optimizer=adam', 'model=resnet']}'.
Queued experiment 'ed3b4ef' for future execution.
Queueing with overrides '{'params.yaml': ['optimizer=adam', 'model=efficientnet']}'.
Queued experiment '7a10d54' for future execution.
Queueing with overrides '{'params.yaml': ['optimizer=sgd', 'model=resnet']}'.
Queued experiment '0b443d8' for future execution.
Queueing with overrides '{'params.yaml': ['optimizer=sgd', 'model=efficientnet']}'.
Queued experiment '0a5f20e' for future execution.
$ dvc queue start
...
Note that DVC keeps a cache of all runs, so many permutations will be completed
without actually having to run the experiment. In the above example, the
experiment with ['optimizer=sgd', 'model=resnet']
will not waste computing
time because the results are already in the run cache. You can
confirm this with dvc queue logs
:
$ dvc queue logs 0b443d8
Stage 'setup-dataset' didn't change, skipping
Stage 'train' didn't change, skipping
dvc exp run
will compose a new params.yaml
each time you run it, so it is
not a reliable way to reproduce past experiments. Instead, use dvc repro
when
you want to reproduce a previously run experiment.
Migrating Hydra Projects
If you already have Hydra configured and want to start using DVC alongside it,
you may need to refactor your code slightly. DVC will not pass the Hydra config
to @hydra.main()
, so it should be dropped from the code. Instead, DVC composes
the Hydra config before your code runs and dumps the results to params.yaml
.
Using the example above, here's how the Python code in train.py
might look
using Hydra without DVC:
import hydra
from omegaconf import DictConfig
@hydra.main(version_base=None, config_path="conf", config_name="config")
def main(cfg: DictConfig) -> None:
# train model using cfg parameters
if __name__ == "__main__":
main()
To convert the same code to use DVC with Hydra composition enabled:
from omegaconf import OmegaConf
def main() -> None:
cfg = OmegaConf.load("params.yaml")
# train model using cfg parameters
if __name__ == "__main__":
main()
You no longer need to import Hydra into your code. A main()
method is included
in this example because it is good practice, but it's not necessary. This
separation between config and code can help debug because the entire config
generated by Hydra gets written to params.yaml
before the experiment starts.
You can run the same code with or without Hydra (or DVC). You can also reuse
params.yaml
across multiple scripts in different stages of a DVC pipeline.
Advanced Hydra config
You can configure how DVC works with Hydra.
By default, DVC will look for Hydra config groups in a conf
directory, but you
can set a different directory using dvc config hydra.config_dir other_dir
. This
is equivalent to the config_path
argument in @hydra.main()
.
Within that directory, DVC will look for defaults list in config.yaml
, but you
can set a different path using dvc config hydra.config_name other.yaml
. This is
equivalent to the config_name
argument in @hydra.main()
.
Hydra will automatically discover plugins in the hydra_plugins
directory. By
default, DVC will look for hydra_plugins
in the root directory of the DVC
repository, but you can set a different path with
dvc config hydra.plugins_path other_path
.
Custom resolvers
You can register OmegaConf custom resolvers as plugins by writing them to a
file inside hydra_plugins
. DVC will use these custom resolvers when composing
the Hydra config. For example, add a custom resolver to
hydra_plugins/my_resolver.py
:
import os
from omegaconf import OmegaConf
OmegaConf.register_new_resolver('join', lambda x, y : os.path.join(x, y))
You can use that custom resolver inside the Hydra config:
dir: raw/data
relpath: dataset.csv
fullpath: ${join:${dir},${relpath}}
The final params.yaml
will look like:
dir: raw/data
relpath: dataset.csv
fullpath: raw/data/dataset.csv