Record changes to files or directories tracked by DVC.
Stores the current contents of files and directories tracked by DVC in the
cache, and updates
.dvc files if/as needed. This
forces DVC to accept any changed contents of tracked data currently in the
workspace. We explore the scenarios in which this can be useful
dvc commit provides a way to complete DVC commands that track data (
dvc import, etc.), when they have been used with the
--no-exec options. Those options cause the command to skip
these step(s) during the process of tracking each file or directory:
Skipping these steps is typically done to avoid caching unfinished data, for example when exploring different data or stages.
More specifically, scenarios for
dvc commit include:
Often we edit source code, configuration, or other files that are specified as
deps field) in a way that doesn't
cause any changes to stage outputs. For example: reformatting
input data, adding code comments, etc. However, DVC notices all changes to
dependencies and expects you to reproduce the corresponding pipeline
dvc repro). You can use
dvc commit instead to force accepting these new
versions without having to execute stage commands.
Sometimes after executing a stage, we realize that not all of its dependencies
or outputs are defined in
dvc.yaml. It is possible to
add the missing deps/outs
without having to re-execute stages, and
dvc commit is needed to finalize
the operation (see link).
dvc unprotect(or removing the outputs) is usually required before rewriting files/dirs tracked by DVC.
--with-deps - determines files to commit by tracking dependencies to
the target stages or
.dvc files. If no
targets are provided, this option
is ignored. By traversing all stage dependencies, DVC searches backward from
the target stages in the corresponding pipelines. This means DVC will not
commit files referenced in later stages than the
--recursive - determines the files to commit by searching each target
directory and its subdirectories for stages or
.dvc files to inspect. If
there are no directories among the
targets, this option is ignored.
--force - commit data even if hash values for dependencies or outputs
did not change.
--help - prints the usage/help message, and exit.
--quiet - do not write anything to standard output. Exit with 0 if no
problems arise, otherwise 1.
--verbose - displays detailed tracing information from executing the
dvc add command.
Let's employ a simple workspace with some data, code, ML models,
pipeline stages, such as the DVC project created for the
Get Started. Then we can see what happens with
git commit and
dvc commit in different situations.
Sometimes we want to iterate through multiple changes to configuration, code, or
data, trying different ways to improve the output of a stage. To avoid filling
the cache with undesired intermediate results, you can use the
--no-commit option of
dvc repro. Once your progress is good enough,
dvc commit can be used to store data files in the cache.
src/featurization.py is executed. A useful change to
make is adjusting the parameters for that script. The parameters are defined in
params.yaml file. Updating the value of the
max_features param to 6000
changes the resulting model:
featurize: max_features: 6000 ngrams: 2
This edit introduces a change that would cause the
evaluate stages to execute if we ran
dvc repro. But if we want to try
several values for
max_features and save only the best result to the cache, we
can run it like this:
$ dvc repro --no-commit
We can run this command as many times as we like, editing
params.yaml any way
we like, and so long as we use
--no-commit, the data does not get saved to the
cache. Let's verify that's the case:
$ dvc status featurize: changed outs: not in cache: data/features train: changed outs: not in cache: model.pkl
Now we can look in the cache directory to see if the new version of
is not in cache indeed. Let's look at the latest state of
train: cmd: python src/train.py data/features model.pkl deps: - path: data/features md5: de03a7e34e003e54dde0d40582c6acf4.dir - path: src/train.py md5: ad8e71b2cca4334a7d3bb6495645068c params: params.yaml: train.n_estimators: 100 train.seed: 20170428 outs: - path: model.pkl md5: 9aba000ba83b341a423a81eed8ff9238
To verify this instance of
model.pkl is not in the cache, we must know the
path to the cached file. In the cache directory, the first two characters of the
hash value are used as a subdirectory name, and the remaining characters are the
file name. Therefore, had the file been committed to the cache, it would appear
in the directory
.dvc/cache/9a. Let's check:
$ ls .dvc/cache/9a ls: .dvc/cache/9a: No such file or directory
If we've determined the changes to
params.yaml were successful, we can execute
this set of commands:
$ dvc commit $ dvc status Data and pipelines are up to date. $ ls .dvc/cache/70 ba000ba83b341a423a81eed8ff9238
We've verified that
dvc commit has saved the changes into the cache, and that
the new instance of
model.pkl is there.
Sometimes you may want to execute stage commands manually (instead of using
dvc repro). You won't have DVC helping you, but you'll have the freedom to run
any command, even ones not defined in
dvc.yaml. For example:
$ python src/featurization.py data/prepared data/features $ python src/train.py data/features model.pkl $ python src/evaluate.py model.pkl data/features auc.metric
Sometimes we want to clean up a code or configuration file in a way that doesn't cause a change in its results. We might write in-line documentation with comments, change indentation, remove some debugging printouts, or any other change that doesn't produce different output of pipeline stages.
$ git status -s M src/train.py $ dvc status train: changed deps: modified: src/train.py
Let's edit one of the source code files. It doesn't matter which one. You'll see that both Git and DVC recognize a change was made.
If we ran
dvc repro at this point, this pipeline would be reproduced. But
since the change was inconsequential, that would be a waste of time and CPU.
That's especially critical if the corresponding stages take lots of resources to
$ git add src/train.py $ git commit -m "CHANGED" [master 72327bd] CHANGED 1 file changed, 2 insertions(+) $ dvc commit dependencies ['src/train.py'] of 'train.dvc' changed. Are you sure you commit it? [y/n] y $ dvc status Data and pipelines are up to date.
Instead of reproducing the pipeline for changes that do not produce different
results, just use
commit on both Git and DVC.