The goal of this example is to give you some hands-on experience with a basic machine learning version control scenario: managing multiple datasets and ML models using DVC. We'll work with a tutorial that François Chollet put together to show how to build a powerful image classifier using a pretty small dataset.
Dataset to classify cats and dogs
We highly recommend reading François' tutorial itself. It's a great demonstration of how a general pre-trained model can be leveraged to build a new high-performance model, with very limited resources.
We first train a classifier model using 1000 labeled images, then we double the
number of images (2000) and retrain our model. We capture both datasets and
classifier results and show how to use
dvc checkout to switch between
The specific algorithm used to train and validate the classifier is not important, and no prior knowledge of Keras is required. We'll reuse the script from the original blog post as a black box – it takes some data and produces a model file.
We have tested our tutorials and examples with Python 3. We don't recommend using earlier versions.
If you're using Windows, please review Running DVC on Windows for important tips to improve your experience.
Okay! Let's first download the code and set up a Git repository:
$ git clone https://github.com/iterative/example-versioning.git $ cd example-versioning
This command pulls a DVC project with a single script
that will train the model.
Let's now install the requirements. But before we do that, we strongly recommend creating a virtual environment:
$ python3 -m venv .env $ source .env/bin/activate $ pip install -r requirements.txt
Now that we're done with preparations, let's add some data and then train the first model. We'll capture everything with DVC, including the input dataset and model metrics.
$ dvc get https://github.com/iterative/dataset-registry \ tutorial/ver/data.zip $ unzip -q data.zip $ rm -f data.zip
dvc getcan download any file or directory tracked in a DVC repository (and stored remotely). It's like
wget, but for DVC or Git repos. In this case we use our dataset registry repo as the data source (refer to Data Registries for more info.)
This command downloads and extracts our raw dataset, consisting of 1000 labeled images for training and 800 labeled images for validation. In total, it's a 43 MB dataset, with a directory structure like this:
data ├── train │ ├── dogs │ │ ├── dog.1.jpg │ │ ├── ... │ │ └── dog.500.jpg │ └── cats │ ├── cat.1.jpg │ ├── ... │ └── cat.500.jpg └── validation ├── dogs │ ├── dog.1001.jpg │ ├── ... │ └── dog.1400.jpg └── cats ├── cat.1001.jpg ├── ... └── cat.1400.jpg
(Who doesn't love ASCII directory art?)
Let's capture the current state of this dataset with
$ dvc add data
You can use this command instead of
git add on files or directories that are
too large to be tracked with Git: usually input datasets, models, some
intermediate results, etc. It tells Git to ignore the directory and puts it into
the cache (while keeping a
to it in the workspace, so you can continue working the same way as
before). This is achieved by creating a tiny, human-readable
.dvc file that
serves as a pointer to the cache.
Next, we train our first model with
train.py. Because of the small dataset,
this training process should be small enough to run on most computers in a
reasonable amount of time (a few minutes). This command outputs a
bunch of files, among them
metrics.csv, weights of the trained
model, and metrics history. The simplest way
to capture the current version of the model is to use
dvc add again:
$ python train.py $ dvc add model.h5
We manually added the model output here, which isn't ideal. The preferred way of capturing command outputs is with
dvc run. More on this later.
Let's commit the current state:
$ git add data.dvc model.h5.dvc metrics.csv .gitignore $ git commit -m "First model, trained with 1000 images" $ git tag -a "v1.0" -m "model v1.0, 1000 images"
Note that executing
train.pyproduced other intermediate files. This is OK, we will use them later.
$ git status ... bottleneck_features_train.npy bottleneck_features_validation.npy`
Let's imagine that our image dataset doubles in size. The next command extracts
500 new cat images and 500 new dog images into
$ dvc get https://github.com/iterative/dataset-registry \ tutorial/ver/new-labels.zip $ unzip -q new-labels.zip $ rm -f new-labels.zip
For simplicity's sake, we keep the validation subset the same. Now our dataset has 2000 images for training and 800 images for validation, with a total size of 67 MB:
data ├── train │ ├── dogs │ │ ├── dog.1.jpg │ │ ├── ... │ │ └── dog.1000.jpg │ └── cats │ ├── cat.1.jpg │ ├── ... │ └── cat.1000.jpg └── validation ├── dogs │ ├── dog.1001.jpg │ ├── ... │ └── dog.1400.jpg └── cats ├── cat.1001.jpg ├── ... └── cat.1400.jpg
We will now want to leverage these new labels and retrain the model:
$ dvc add data $ python train.py $ dvc add model.h5
Let's commit the second version:
$ git add data.dvc model.h5.dvc metrics.csv $ git commit -m "Second model, trained with 2000 images" $ git tag -a "v2.0" -m "model v2.0, 2000 images"
That's it! We've tracked a second version of the dataset, model, and metrics in
DVC and committed the
.dvc files that point to them with Git. Let's now look
at how DVC can help us go back to the previous version if we need to.
The DVC command that helps get a specific committed version of data is designed
to be similar to
git checkout. All we need to do in our case is to
dvc checkout to get the right data into the
There are two ways of doing this: a full workspace checkout or checkout of a specific data or model file. Let's consider the full checkout first. It's pretty straightforward:
$ git checkout v1.0 $ dvc checkout
These commands will restore the workspace to the first snapshot we made: code,
data files, model, all of it. DVC optimizes this operation to avoid copying data
or model files each time. So
dvc checkout is quick even if you have large
datasets, data files, or models.
On the other hand, if we want to keep the current code, but go back to the previous dataset version, we can target specific data, like this:
$ git checkout v1.0 data.dvc $ dvc checkout data.dvc
If you run
git status you'll see that
data.dvc is modified and currently
points to the
v1.0 version of the dataset, while code and model files are from
dvc add makes sense when you need to keep track of different versions of
datasets or model files that come from source projects. The
above (with cats and dogs images) is a good example.
On the other hand, there are files that are the result of running some code. In
train.py produces binary files (e.g.
bottleneck_features_train.npy), the model file
model.h5, and the
When you have a script that takes some data as an input and produces other data
outputs, a better way to capture them is to use
If you tried the commands in the Switching between workspace versions section, go back to the master branch code and data, and remove the
$ git checkout master $ dvc checkout $ dvc remove model.h5.dvc
$ dvc run -n train -d train.py -d data \ -o model.h5 -o bottleneck_features_train.npy \ -o bottleneck_features_validation.npy -M metrics.csv \ python train.py
dvc run writes a pipeline stage named
train (specified using the
dvc.yaml. It tracks all outputs (
-o) the same way as
dvc run also tracks dependencies (
-d) and the
python train.py) that was run to produce the result.
At this point you could run
git add .and
git committo save the
trainstage and its outputs to the repository.
dvc repro will run the
train stage if any of its dependencies (
changed. For example, when we added new images to built the second version of
our model, that was a dependency change. It also updates outputs and puts them
into the cache.
To make things a little simpler:
dvc add and
dvc checkout provide a basic
mechanism for model and large dataset versioning.
dvc run and
provide a build system for machine learning models, which is similar to
Make in software build automation.
In this example, our focus was on giving you hands-on experience with dataset
and ML model versioning. We specifically looked at the
dvc add and
dvc checkout commands. We'd also like to outline some topics and ideas you
might be interested to try next to learn more about DVC and how it makes
managing ML projects simpler.
First, you may have noticed that the script that trains the model is written in
a monolithic way. It uses the
save_bottleneck_feature function to
pre-calculate the bottom, "frozen" part of the net every time it is run.
Features are written into files. The intention was probably that the
save_bottleneck_feature can be commented out after the first run, but it's not
very convenient having to remember to do so every time the dataset changes.
Here's where the pipelines feature of DVC comes in
handy. We touched on it briefly when we described
dvc run and
dvc repro. The
next step would be splitting the script into two parts and utilizing pipelines.
See Data Pipelines to get hands-on experience with
pipelines, and try to apply it here. Don't hesitate to join our
community and ask any questions!
Another detail we only brushed upon here is the way we captured the
metrics.csv metrics file with the
-M option of
dvc run. Marking this
output as a metric enables us to compare its values across Git tags
or branches (for example, representing different experiments). See
Comparing Changes, and
Comparing Many Experiments
to learn more about managing metrics with DVC.