Due to the way DVC handles linking between the data files between the
cache and their counterparts in the workspace (refer
to Large Dataset Optimization),
updating tracked files has to be carried out with caution to avoid data
corruption when the DVC config option
cache.type is set to
dvc config cache for more details on setting the cache file
For an example of the cache corruption problem see issue #599 in our GitHub repository.
train.tsv is tracked by DVC and you want to update it. Here updating
may mean either replacing
train.tsv with a new file having the same name or
editing the content of the file.
If you run
dvc repro there is no need to manage generated (output) files
manually. DVC removes them for you before executing the stage that generates
If you use DVC to track a file that is generated during your pipeline (e.g. some
intermediate result or a final model file i.e.
model.pkl) and you don't use
dvc run and
dvc repro to manage your pipeline, use the procedure below (run
dvc unprotect or
dvc remove) to unlink it from DVC cache prior to the
execution of the script that modifies it.
If you want to replace the file, you can take the following steps.
First, un-track the file. This will remove
train.tsv from the workspace:
$ dvc remove train.tsv.dvc
Next, replace the file with new content:
$ echo new > train.tsv
And start tracking it again:
$ dvc add train.tsv $ git add train.tsv.dvc .gitignore $ git commit -m "new train data"
"Unlink" the file with
dvc unprotect. This will make
train.tsv safe to edit:
$ dvc unprotect train.tsv
Edit the content of the file:
$ echo "new data item" >> train.tsv
Add the new version of the file back with DVC:
$ dvc add train.tsv $ git add train.tsv.dvc $ git commit -m "modify train data"