To include a data file into your data science environment, you need to copy the
file into the repository. We'll create a
data/ directory for the data files
and download a 40MB data archive into this directory.
$ mkdir data $ wget -P data https://data.dvc.org/tutorial/nlp/100K/Posts.xml.zip $ du -sh data/* 41M data/Posts.xml.zip
At this time,
data/Posts.xml.zip is a regular (untracked) file. We can track
it with DVC using
dvc add (see below). After executing the command you will
see a new file
data/Posts.xml.zip.dvc and a change in
of these files have to be committed to the repository.
$ dvc add data/Posts.xml.zip $ du -sh data/* 41M data/Posts.xml.zip 4.0K data/Posts.xml.zip.dvc $ git status -s data/ ?? data/.gitignore ?? data/Posts.xml.zip.dvc $ git add . $ git commit -m "add raw dataset"
You have probably already noticed that the actual data file was not committed to
the repository. The reason is that DVC included the file into
so Git ignores this data file from now on.
DVC will always exclude data files from the Git repository by listing them in
If you take a look at the DVC-file created by
dvc add, you will see that outputs are tracked in the
field. In this file, only one output is specified. The output contains the data
file path in the repository and its MD5 hash. This hash value determines the
location of the actual content file in the
$ cat data/Posts.xml.zip.dvc md5: 7559eb45beb7e90f192e836be8032a64 outs: - cache: true md5: ec1d2935f811b77cc49b031b999cbf17 path: Posts.xml.zip $ du -sh .dvc/cache/ec/* 41M .dvc/cache/ec/1d2935f811b77cc49b031b999cbf17
Outputs from DVC-files define the relationship between the data file path in a repository and the path in the cache directory.
Keeping actual file contents in the cache, and a copy of the cached
file in the workspace during
$ git checkout is a regular trick
that Git-LFS (Git for Large File Storage) uses.
This trick works fine for tracking small files with source code. For large data
files, this might not be the best approach, because of checkout operation for
a 10Gb data file might take several seconds and a 50GB file checkout (think
copy) might take a few minutes.
DVC was designed with large data files in mind. This means gigabytes or even hundreds of gigabytes in file size. Instead of copying files from cache to workspace, DVC can create reflinks or other file link types.
When reflinks are not supported by the file system, DVC defaults to copying files, which doesn't optimize file storage. However, it's easy to enable other file link types on most systems. See File link types for more information.
Creating file links is a quick file system operation. So, with DVC you can
easily checkout a few dozen files of any size. A file link prevents you from
using twice as much space in the hard drive. Even if each of the files contains
41MB of data, the overall size of the repository is still 41MB. Both of the
files correspond to the same
inode (a file metadata record) in the file
system. Refer to
Large Dataset Optimization for
Note that in systems supporting reflinks, use the
dfcommand to confirm that free space on the drive didn't decline by the file size that we are adding, so no duplication takes place.
dumay be inaccurate with reflinks.
$ ls -i data/Posts.xml.zip 78483929 data/Posts.xml.zip $ ls -i .dvc/cache/ec/ 78483929 88519f8465218abb23ce0e0e8b1384 $ du -sh . 41M .
ls -i prints the index number(78483929) of each file and inode for
Once the data files are in the workspace, you can start processing the data and train ML models out of the data files. DVC helps you to define stages of your ML process and easily connect them into a ML pipeline.
dvc run executes any command that you pass it as a list of parameters.
However, the command to run alone is not as interesting as its role within a
larger data pipeline, so we'll need to specify its dependencies and
outputs. We call all this a pipeline stage. Dependencies may
include input files or directories, and the actual command to run. Outputs are
files written to by the command, if any.
-d file.tsvshould be used to specify a dependency file or directory. The dependency can be a regular file from a repository or a data file.
-o file.tsv(lower case o) specifies an output data file. DVC will track this data file by creating a corresponding DVC-file (as if running
dvc add file.tsvafter
-O file.tsv(upper case O) specifies a simple output file (not to be added to DVC).
It's important to specify dependencies and outputs before the command to run itself.
Let's see how an extraction command
unzip works under DVC, for example:
$ dvc run -d data/Posts.xml.zip -o data/Posts.xml \ unzip data/Posts.xml.zip -d data/ Running command: unzip data/Posts.xml.zip -d data/ Archive: data/Posts.xml.zip inflating: data/Posts.xml Saving information to 'Posts.xml.dvc'. To track the changes with git run: git add data/.gitignore Posts.xml.dvc $ du -sh data/* 145M data/Posts.xml 41M data/Posts.xml.zip 4.0K data/Posts.xml.zip.dvc
In these commands, option
-d specifies an output directory for the tar
-d data/Posts.xml.zip defines the input file and
the resulting extracted data file.
unzip command extracts data file
data/Posts.xml.zip to a regular file
data/Posts.xml. It knows nothing about data files or DVC. DVC executes the
command and does some additional work if the command was successful:
- DVC transforms all the outputs (
-ooption) into tracked data files (similar to using
dvc addfor each of them). As a result, all the actual data contents go to the cache directory
.dvc/cache, and each of the file names will be added to
- For reproducibility purposes,
dvc runcreates the
Posts.xml.dvcstage file in the project with information about this pipeline stage. (See DVC-File Format). Note that the name of this file could be specified by using the
-foption, for example
Let's take a look at the resulting stage file created by
dvc run above:
$ cat Posts.xml.dvc cmd: ' unzip data/Posts.xml.zip -d data/' deps: - md5: ec1d2935f811b77cc49b031b999cbf17 path: data/Posts.xml.zip md5: 16129387a89cb5a329eb6a2aa985415e outs: - cache: true md5: c1fa36d90caa8489a317eee917d8bf03 path: data/Posts.xml
Sections of the file above include:
cmd: The command to run
deps: Dependencies with MD5 hashes
outs: Outputs with MD5 hashes
And (as with the
dvc add command) the
data/.gitignore file was modified. Now
it includes the unarchived command output file
$ git status -s M data/.gitignore ?? Posts.xml.dvc $ cat data/.gitignore Posts.xml.zip Posts.xml
The output file
Posts.xml was transformed by DVC into a data file in
accordance with the
-o option. You can find the corresponding cache file with
the hash value, as a path starting in
$ ls .dvc/cache/ 2f/ a8/ $ du -sh .dvc/cache/c1/* .dvc/cache/ec/* 41M .dvc/cache/ec/1d2935f811b77cc49b031b999cbf17 145M .dvc/cache/c1/fa36d90caa8489a317eee917d8bf03 $ du -sh . 186M .
Let's commit the result of the
unzip command. This will be the first stage of
our ML pipeline.
$ git add . $ git commit -m "extract data"
A single stage of our ML pipeline was created and committed into repository. It isn't necessary to commit stages right after their creation. You can create a few and commit them with Git together later.
Let's create the following stages: converting an XML file to TSV, and then separating training and testing datasets:
$ dvc run -d data/Posts.xml -d code/xml_to_tsv.py -d code/conf.py \ -o data/Posts.tsv \ python code/xml_to_tsv.py Using 'Posts.tsv.dvc' as a stage file Reproducing 'Posts.tsv.dvc': python code/xml_to_tsv.py $ dvc run -d data/Posts.tsv -d code/split_train_test.py \ -d code/conf.py \ -o data/Posts-test.tsv -o data/Posts-train.tsv \ python code/split_train_test.py 0.33 20180319 Using 'Posts-test.tsv.dvc' as a stage file Reproducing 'Posts-test.tsv.dvc': python code/split_train_test.py 0.33 20180319 Positive size 2049, negative size 97951
The result of the commands above are two
stage files corresponding to each of the commands,
Posts.tsv.dvc. Also, a
code/conf.pyc file was
created. This type of file should not be tracked by Git. Let's manually include
this type of file into
$ git status -s M data/.gitignore ?? Posts-test.tsv.dvc ?? Posts.tsv.dvc ?? code/conf.pyc $ echo "*.pyc" >> .gitignore
As mentioned before, both of stage files can be committed to the repository together:
$ git add . $ git commit -m "Process to TSV and separate test and train"
Let's run and save the following commands for our pipeline. First, define the
feature extraction stage, that takes
test TSVs and generates
corresponding matrix files:
$ dvc run -d code/featurization.py -d code/conf.py \ -d data/Posts-train.tsv -d data/Posts-test.tsv \ -o data/matrix-train.p -o data/matrix-test.p \ python code/featurization.py Using 'matrix-train.p.dvc' as a stage file Reproducing 'matrix-train.p.dvc': python code/featurization.py The input data frame data/Posts-train.tsv size is (66999, 3) The output matrix data/matrix-train.p size is (66999, 5002) and data type is float64 The input data frame data/Posts-test.tsv size is (33001, 3) The output matrix data/matrix-test.p size is (33001, 5002) and data type is float64
Train a model using the train matrix file:
$ dvc run -d data/matrix-train.p -d code/train_model.py \ -d code/conf.py -o data/model.p \ python code/train_model.py 20180319 Using 'model.p.dvc' as a stage file Reproducing 'model.p.dvc': python code/train_model.py 20180319 Input matrix size (66999, 5002) X matrix size (66999, 5000) Y matrix size (66999,)
And evaluate the result of the trained model using the test feature matrix:
$ dvc run -d data/model.p -d data/matrix-test.p \ -d code/evaluate.py -d code/conf.py -M data/eval.txt \ -f Dvcfile \ python code/evaluate.py Reproducing 'Dvcfile': python code/evaluate.py
Note that using
dvc runabove isn't necessary as the default stage file name is
Dvcfilewhen there are no outputs (option
The model evaluation stage is the last one for this tutorial. To help in the
pipeline's reproducibility, we use stage file name
Dvcfile. (This will be
discussed in more detail in the next chapter.)
Note that the output file
data/eval.txt was transformed by DVC
into a metric file in accordance with the
The result of the last three
dvc run commands execution is three stage files
and a modified .gitignore file. All the changes should be committed with Git:
$ git status -s M data/.gitignore ?? Dvcfile ?? data/eval.txt ?? matrix-train.p.dvc ?? model.p.dvc $ git add . $ git commit -m Evaluate
The output of the evaluation stage contains the target value in a simple text form:
$ cat data/eval.txt AUC: 0.624652
You can also show the metrics using the
DVC metrics command:
$ dvc metrics show data/eval.txt:AUC: 0.624652
This is probably not the best AUC that you have seen. In this document, our focus is DVC, not ML modeling and we use a relatively small dataset without any advanced ML techniques.
In the next chapter we will try to improve the metrics by changing our modeling code and using reproducibility in our pipeline.