Even in its basic scenarios, commands like
described in the previous sections could be used independently and provide a
basic useful framework to track, save and share models and large data files. To
achieve full reproducibility though, we'll have to connect code and
configuration with the data it processes to produce the result.
$ dvc run -f prepare.dvc \ -d src/prepare.py -d data/data.xml \ -o data/prepared \ python src/prepare.py data/data.xml
dvc run generates the
prepare.dvc DVC-file. It has the same
format as the file we created in the
previous section to track
except in this case it has additional information about the
output (a directory where two files,
test.tsv, will be written
to), and about the Python command that is required to build it.
You don't need to run
dvc add to track output files (
prepared/test.tsv) with DVC.
dvc run takes care of this. You only need to
dvc push (usually along with
git commit) to save them to the remote when
you are done.
Let's commit the changes to save the stage we built:
$ git add data/.gitignore prepare.dvc $ git commit -m "Create data preparation stage" $ dvc push