Record changes to DVC-tracked files in the project, by updating DVC-files and saving outputs to the cache.
usage: dvc commit [-h] [-q | -v] [-f] [-d] [-R] [targets [targets ...]] positional arguments: targets Limit command scope to these stages or .dvc files. Using -R, directories to search for stages or .dvc files can also be given.
dvc commit command is useful for several scenarios, when data already
tracked by DVC changes: when a stage or
pipeline is in development/experimentation; when
manually editing or generating DVC outputs; or to force DVC-file
updates without reproducing stages or pipelines. These scenarios are further
Code or data for a stage is under active development, with multiple iterations
(experiments) in code, configuration, or data. Use the
--no-commit option of
DVC commands (
dvc repro) to avoid caching unnecessary
data repeatedly. Use
dvc commit when the DVC-tracked data is final.
dvc unprotect). Once a desirable result is reached, use
dvc commitas appropriate to update DVC-files and store changed data to the cache.
dvc committo force update the related DVC-files and cache.
Let's take a look at what is happening in the first scenario closely. Normally
DVC commands like
dvc repro or
dvc run commit the data to the
cache after creating a DVC-file. What commit means is that DVC:
.gitignore). (Note that if the project was initialized with no Git support (
dvc init --no-scm), this does not happen.)
There are many cases where the last step is not desirable (for example rapid
iterations on an experiment). The
--no-commit option prevents the last step
from occurring (on the commands where it's available), saving time and space by
not storing unwanted data artifacts. The file hash is still
computed and added to the DVC-file, but the actual data file is not saved in the
cache. This is where the
dvc commit command comes into play. It performs that
last step (saving the data in cache).
Note that it's best to avoid the last two scenarios. They essentially force-update the DVC-files and save data to cache. They are still useful, but keep in mind that DVC can't guarantee reproducibility in those cases.
--with-deps- determines files to commit by tracking dependencies to the target DVC-files (stages). If no
targetsare provided, this option is ignored. By traversing all stage dependencies, DVC searches backward from the target stages in the corresponding pipelines. This means DVC will not commit files referenced in later stages than the
--recursive- determines the files to commit by searching each target directory and its subdirectories for DVC-files to inspect. If there are no directories among the
targets, this option is ignored.
--force- commit data even if hash values for dependencies or outputs did not change.
--help- prints the usage/help message, and exit.
--quiet- do not write anything to standard output. Exit with 0 if no problems arise, otherwise 1.
--verbose- displays detailed tracing information from executing the
Let's employ a simple workspace with some data, code, ML models,
pipeline stages, such as the DVC project created for the
Get Started. Then we can see what happens with
git commit and
dvc commit in different situations.
Sometimes we want to iterate through multiple changes to configuration, code, or
data, trying different ways to improve the output of a stage. To avoid filling
the cache with undesired intermediate results, we can run a single
dvc run --no-commit, or reproduce an entire pipeline using
dvc repro --no-commit. This prevents data from being pushed to cache. When
development of the stage is finished,
dvc commit can be used to store data
files in the cache.
src/featurize.py is executed. A useful change to
make is adjusting a parameter to
CountVectorizer in that script. Namely,
max_features value in the line below changes the resulting
bag_of_words = CountVectorizer(stop_words='english', max_features=6000, ngram_range=(1, 2))
This edit introduces a change that would cause the
evaluate.dvc stages to execute if we ran
dvc repro. But if we want to
try several values for
max_features and save only the best result to the
cache, we can run it like this:
$ dvc repro --no-commit evaluate.dvc
We can run this command as many times as we like, editing
featurize.py any way
we like, and so long as we use
--no-commit, the data does not get saved to the
cache. Let's verify that's the case:
$ dvc status evaluate.dvc: changed deps: modified: data/features modified: model.pkl train.dvc: changed outs: not in cache: model.pkl
Now we can look in the cache directory to see if the new version of
is indeed not in cache as claimed. Look at
cmd: python src/train.py data/features model.pkl deps: - md5: d05e0201a3fb47c878defea65bd85e4d path: src/train.py - md5: b7a357ba7fa6b726e615dd62b34190b4.dir path: data/features md5: b91b22bfd8d9e5af13e8f48523e80250 outs: - cache: true md5: 70599f166c2098d7ffca91a369a78b0d metric: false path: model.pkl persist: false wdir: .
To verify this instance of
model.pkl is not in the cache, we must know the
path to the cached file. In the cache directory, the first two characters of the
hash value are used as a subdirectory name, and the remaining characters are the
file name. Therefore, had the file been committed to the cache, it would appear
in the directory
.dvc/cache/70. Let's check:
$ ls .dvc/cache/70 ls: .dvc/cache/70: No such file or directory
If we've determined the changes to
featurize.py were successful, we can
execute this set of commands:
$ dvc commit $ dvc status Data and pipelines are up to date. $ ls .dvc/cache/70 599f166c2098d7ffca91a369a78b0d
We've verified that
dvc commit has saved the changes into the cache, and that
the new instance of
model.pkl is there.
It is also possible to execute the commands that are executed by
dvc repro by
hand. You won't have DVC helping you, but you have the freedom to run any
command you like, even ones not defined in a
DVC-file. For example:
$ python src/featurization.py data/prepared data/features $ python src/train.py data/features model.pkl $ python src/evaluate.py model.pkl data/features auc.metric
Sometimes we want to clean up a code or configuration file in a way that doesn't cause a change in its results. We might write in-line documentation with comments, change indentation, remove some debugging printouts, or any other change that doesn't produce different output of pipeline stages.
$ git status -s M src/train.py $ dvc status train.dvc: changed deps: modified: src/train.py
Let's edit one of the source code files. It doesn't matter which one. You'll see that both Git and DVC recognize a change was made.
If we ran
dvc repro at this point, this pipeline would be reproduced. But
since the change was inconsequential, that would be a waste of time and CPU.
That's especially critical if the corresponding stages take lots of resources to
$ git add src/train.py $ git commit -m "CHANGED" [master 72327bd] CHANGED 1 file changed, 2 insertions(+) $ dvc commit dependencies ['src/train.py'] of 'train.dvc' changed. Are you sure you commit it? [y/n] y $ dvc status Data and pipelines are up to date.
Instead of reproducing the pipeline for changes that do not produce different
results, just use
commit on both Git and DVC.