Versioning large data files and directories for data science is great, but not enough. How is data filtered, transformed, or used to train ML models? DVC introduces a mechanism to capture data pipelines — series of data processes that produce a final result.
DVC pipelines and their data can also be easily versioned (using Git). This allows you to better organize projects, and reproduce your workflow and results later — exactly as they were built originally! For example, you could capture a simple ETL workflow, organize a data science project, or build a detailed machine learning pipeline.
Watch and learn, or follow along with the code example below!
Use dvc stage add
to create stages. These represent processes (source code
tracked with Git) which form the steps of a pipeline. Stages also connect code
to its corresponding data input and output. Let's transform a Python script
into a stage:
Get the sample code like this:
$ wget https://code.dvc.org/get-started/code.zip
$ unzip code.zip
$ rm -f code.zip
$ tree
.
├── params.yaml
└── src
├── evaluate.py
├── featurization.py
├── prepare.py
├── requirements.txt
└── train.py
Now let's install the requirements:
We strongly recommend creating a virtual environment first.
$ pip install -r src/requirements.txt
Please also add or commit the source code directory with Git at this point.
$ dvc stage add -n prepare \
-p prepare.seed,prepare.split \
-d src/prepare.py -d data/data.xml \
-o data/prepared \
python src/prepare.py data/data.xml
A dvc.yaml
file is generated. It includes information about the command we
want to run (python src/prepare.py data/data.xml
), its
dependencies, and outputs.
DVC uses these metafiles to track the data used and produced by the stage, so
there's no need to use dvc add
on data/prepared
manually.
The command options used above mean the following:
-n prepare
specifies a name for the stage. If you open the dvc.yaml
file
you will see a section named prepare
.
-p prepare.seed,prepare.split
defines special types of dependencies —
parameters. We'll get to them later in the
Metrics, Parameters, and Plots page,
but the idea is that the stage can depend on field values from a parameters
file (params.yaml
by default):
prepare:
split: 0.20
seed: 20170428
-d src/prepare.py
and -d data/data.xml
mean that the stage depends on
these files to work. Notice that the source code itself is marked as a
dependency. If any of these files change later, DVC will know that this stage
needs to be reproduced.
-o data/prepared
specifies an output directory for this script, which writes
two files in it. This is how the workspace should look like after
the run:
.
├── data
│ ├── data.xml
│ ├── data.xml.dvc
+│ └── prepared
+│ ├── test.tsv
+│ └── train.tsv
+├── dvc.yaml
+├── dvc.lock
├── params.yaml
└── src
├── ...
The last line, python src/prepare.py data/data.xml
is the command to run in
this stage, and it's saved to dvc.yaml
, as shown below.
The resulting prepare
stage contains all of the information above:
stages:
prepare:
cmd: python src/prepare.py data/data.xml
deps:
- src/prepare.py
- data/data.xml
params:
- prepare.seed
- prepare.split
outs:
- data/prepared
Once you added a stage, you can run the pipeline with dvc repro
. Next, you can
use dvc push
if you wish to save all the data to remote storage (usually
along with git commit
to version DVC metafiles).
By using dvc stage add
multiple times, and specifying outputs of
a stage as dependencies of another one, we can describe a sequence
of commands which gets to a desired result. This is what we call a data
pipeline or
dependency graph.
Let's create a second stage chained to the outputs of prepare
, to perform
feature extraction:
$ dvc stage add -n featurize \
-p featurize.max_features,featurize.ngrams \
-d src/featurization.py -d data/prepared \
-o data/features \
python src/featurization.py data/prepared data/features
The dvc.yaml
file is updated automatically and should include two stages now.
The changes to the dvc.yaml
should look like this:
stages:
prepare:
cmd: python src/prepare.py data/data.xml
deps:
- data/data.xml
- src/prepare.py
params:
- prepare.seed
- prepare.split
outs:
- data/prepared
+ featurize:
+ cmd: python src/featurization.py data/prepared data/features
+ deps:
+ - data/prepared
+ - src/featurization.py
+ params:
+ - featurize.max_features
+ - featurize.ngrams
+ outs:
+ - data/features
Let's add the training itself. Nothing new this time; just the same dvc run
command with the same set of options:
$ dvc stage add -n train \
-p train.seed,train.n_est,train.min_split \
-d src/train.py -d data/features \
-o model.pkl \
python src/train.py data/features model.pkl
Please check the dvc.yaml
again, it should have one more stage now.
This should be a good time to commit the changes with Git. These include
.gitignore
, dvc.lock
, and dvc.yaml
— which describe our pipeline.
The whole point of creating this dvc.yaml
file is the ability to easily
reproduce a pipeline:
$ dvc repro
Let's try to play a little bit with it. First, let's try to change one of the parameters for the training stage:
params.yaml
and change n_est
to 100
, anddvc repro
.You should see:
$ dvc repro
Stage 'prepare' didn't change, skipping
Stage 'featurize' didn't change, skipping
Running stage 'train' with command: ...
DVC detected that only train
should be run, and skipped everything else! All
the intermediate results are being reused.
Now, let's change it back to 50
and run dvc repro
again:
$ dvc repro
Stage 'prepare' didn't change, skipping
Stage 'featurize' didn't change, skipping
As before, there was no need to rerun prepare
, featurize
, etc. But this time
it also doesn't rerun train
! The previous run with the same set of inputs
(parameters & data) was saved in DVC's run-cache, and reused here.
dvc repro
relies on the DAG definition from dvc.yaml
, and uses
dvc.lock
to determine what exactly needs to be run.
The dvc.lock
file is similar to a .dvc
file — it captures hashes (in most
cases md5
s) of the dependencies and values of the parameters that were used.
It can be considered a state of the pipeline:
schema: '2.0'
stages:
prepare:
cmd: python src/prepare.py data/data.xml
deps:
- path: data/data.xml
md5: a304afb96060aad90176268345e10355
- path: src/prepare.py
md5: 285af85d794bb57e5d09ace7209f3519
params:
params.yaml:
prepare.seed: 20170428
prepare.split: 0.2
outs:
- path: data/prepared
md5: 20b786b6e6f80e2b3fcf17827ad18597.dir
dvc status
command can be used to compare this state with an actual state of the workspace.
DVC pipelines (dvc.yaml
file, dvc stage add
, and dvc repro
commands) solve
a few important problems:
dvc.yaml
and dvc.lock
files describe what data to use
and which commands will generate the pipeline results (such as an ML model).
Storing these files in Git makes it easy to version and share.Having built our pipeline, we need a good way to understand its structure. Seeing a graph of connected stages would help. DVC lets you do so without leaving the terminal!
$ dvc dag
+---------+
| prepare |
+---------+
*
*
*
+-----------+
| featurize |
+-----------+
*
*
*
+-------+
| train |
+-------+
Refer to
dvc dag
to explore other ways this command can visualize a pipeline.