Install Git hooks into the DVC repository to automate certain common actions.
DVC provides an intelligent data repository on top of a regular Git repo to
store code and configuration files. With
dvc install, the two are more tightly
integrated in order to cause certain convenient actions to happen automatically.
Note that this command requires the DVC project to be a Git
repository. But the hooks won't activate if the current branch (commit, tag,
etc.) doesn't have DVC initialized (no
.dvc/ directory present).
Checkout: For any commit hash, branch or tag,
git checkout restores the
DVC project files corresponding to that
version. Some of these files, in turn refer to data stored in
cache, but not necessarily current in the workspace.
Normally, it's necessary to use
dvc checkout to also update the workspace
This hook automates
dvc checkout after
Commit/Reproduce: Before committing DVC changes with Git, it may be
dvc commit to store new data files not yet in cache. Or the
changes might require reproducing the corresponding
dvc repro) to regenerate the
project's results (which implicitly commits them to DVC as well).
Push: While publishing changes to the Git remote with
git push, its easy
to forget that the
dvc push command is necessary to upload new or updated data
files and directories tracked by DVC to remote storage.
This hook automates
dvc push before
git checkoutto automatically update the workspace with the correct data file versions.
git committo inform the user about the differences between cache and workspace.
git pushto upload files and directories tracked by DVC to the
dvc remote default.
If a hook already exists, DVC will raise an exception. In that case, try to manually edit the existing file or remove it and retry install.
For more information about git hooks, refer to the git-scm documentation.
When you use
dvc install, it creates three files under the
.git/hooks ├── post-checkout ├── pre-commit └── pre-push
To disable them, you need to remove or edit those files (i.e.
repos: - hooks: - id: dvc-pre-commit language_version: python3 stages: - commit - id: dvc-pre-push # use s3/gs/etc instead of all to only install specific cloud support additional_dependencies: ['.[all]'] language_version: python3 stages: - push - always_run: true id: dvc-post-checkout language_version: python3 stages: - post-checkout repo: https://github.com/iterative/dvc rev: main # rev should be set to a specific revision (e.g. 2.9.5) since pre-commit # does not allow using mutable references. # If using `main`, see pre-commit guide: # https://pre-commit.com/#using-the-latest-version-for-a-repository
Note that by default, the pre-commit tool only installs
pre-commit hooks. To
post-checkout hooks, you must explicitly configure
the tool this way:
$ pre-commit install --hook-type pre-push --hook-type post-checkout --hook-type pre-commit
--use-pre-commit-tool- configures DVC pre-commit, pre-push, post-checkout Git hooks in the pre-commit config file (
--help- prints the usage/help message, and exit.
--quiet- do not write anything to standard output. Exit with 0 if no problems arise, otherwise 1.
--verbose- displays detailed tracing information.
Let's employ a simple workspace with some data, code, ML models,
pipeline stages, such as the DVC project created in our
Get Started section. Then we can see what happens with
dvc install in different situations.
Switching from one Git commit to another (with
git checkout) may change the
set of DVC files in the workspace. This could mean
that the currently present data no longer matches the project's version (which
can be fixed with
Let's first list the available tags in the Get Started repo:
$ git tag 0-git-init 1-dvc-init 2-track-data 3-config-remote 4-import-data 5-source-code 6-prepare-stage 7-ml-pipeline 8-evaluation 9-bigrams-model 10-bigrams-experiment ...
These tags are used to mark points in the development of the project, and to
document specific experiments conducted in it. To take a look at one, we
$ git checkout 7-ml-pipeline Note: checking out '7-ml-pipeline'. You are in 'detached HEAD' state... $ dvc status featurize: changed outs: modified: data/features ... $ dvc checkout $ dvc status Data and pipelines are up to date.
git checkout we are also shown a message saying You are in
'detached HEAD' state. Returning the workspace to a normal state requires
git checkout master.
We also see that the first
dvc status tells us about differences between the
project's cache and the data files currently in the workspace. Git
changed the DVC files in the workspace, which changed references to data files.
dvc status first informed us that the data files in the workspace no longer
matched the hash values in the corresponding
dvc checkout then brings them up to date, and a second
tells us that the data files now do match the DVC files.
$ git checkout master Previous HEAD position was 6666298 Create ML pipeline stages Switched to branch 'master' Your branch is up to date with 'origin/master'. $ dvc checkout
We've seen the default behavior with there being no Git hooks installed. We want
to see how the behavior changes after installing the Git hooks. We must first
reset the workspace to the
HEAD commit before installing the hooks.
$ dvc install $ cat .git/hooks/pre-commit #!/bin/sh exec dvc status $ cat .git/hooks/post-checkout #!/bin/sh exec dvc checkout
The two Git hooks have been installed, and the one of interest for this exercise
post-checkout script that runs after
We can now repeat the command run earlier, to see the difference.
$ git checkout 7-ml-pipeline HEAD is now at 6666298 Create ML pipeline stages M model.pkl M data/features/ $ dvc status Data and pipelines are up to date.
Look carefully at this output and it is clear that the
dvc checkout command
has indeed been run. As a result the workspace is up to date with the data files
matching what is referenced in the DVC files.
To follow this example, start with the same workspace as before, making sure it
is not in a detached HEAD state by running
git checkout master.
If we simply edit one of the code files:
$ vi src/featurization.py $ git commit -a -m "modified featurization" featurize: changed deps: modified: src/featurization.py [master 1116ddc] modified featurization 1 file changed, 1 insertion(+), 1 deletion(-)
We see that the output of
dvc status has appeared in the
interaction. This new behavior corresponds to the Git hook installed, and it
informs us that the workspace is out of sync. Therefore, we know that
dvc repro command is needed:
$ dvc repro ... To track the changes with git run: git add dvc.lock $ git status -s M dvc.lock $ git commit -a -m "updated data after modified featurization" Data and pipelines are up to date. [master 78d0c44] modified featurization 5 files changed, 12 insertions(+), 12 deletions(-)
After reproducing the pipeline, the data files should be in sync with the code
and configuration, and we want to commit the changes with Git. In doing so,
dvc status is run automatically again, informing us that the data files have
been updated indeed, with the
Data and pipelines are up to date. message.