Data science process is inherently iterative and R&D like. Data scientist may try many different approaches, different hyper-parameter values, and "fail" many times before the required level of a metric is achieved.
DVC is built to provide a way to capture different experiments and navigate easily between them. Let's say we want to try a modified feature extraction:
$ vi src/featurization.py # edit to use bigrams (see above) $ dvc repro train.dvc # regenerate the new model.pkl $ git commit -am "Reproduce model using bigrams"
Now, we have a new
model.pkl captured and saved. To get back to the initial
version, we run
git checkout along with
dvc checkout command:
$ git checkout baseline-experiment $ dvc checkout
DVC is designed to checkout large data files (no matter how large they are) into your workspace almost instantly on almost all modern operating systems with file links. See Large Dataset Optimization for more information.