Versioning large data files and directories for data science is great, but not enough. How is data filtered, transformed, or used to train ML models? DVC introduces a mechanism to capture data pipelines — series of data processes that produce a final result.
DVC pipelines and their data can also be easily versioned (using Git). This allows you to better organize projects, and reproduce your workflow and results later — exactly as they were built originally! For example, you could capture a simple ETL workflow, organize a data science project, or build a detailed machine learning pipeline.
Watch and learn, or follow along with the code example below!
dvc stage add to create stages. These represent processes (source code
tracked with Git) which form the steps of a pipeline. Stages also connect code
to its corresponding data input and output. Let's transform a Python script
into a stage:
$ dvc stage add -n prepare \ -p prepare.seed,prepare.split \ -d src/prepare.py -d data/data.xml \ -o data/prepared \ python src/prepare.py data/data.xml
dvc.yaml file is generated. It includes information about the command we
want to run (
python src/prepare.py data/data.xml), its
dependencies, and outputs.
Once you added a stage, you can run the pipeline with
dvc repro. Next, you can
dvc push if you wish to save all the data to remote storage (usually
git commit to version DVC metafiles).
dvc stage add multiple times, and specifying outputs of
a stage as dependencies of another one, we can describe a sequence
of commands which gets to a desired result. This is what we call a data
Let's create a second stage chained to the outputs of
prepare, to perform
$ dvc stage add -n featurize \ -p featurize.max_features,featurize.ngrams \ -d src/featurization.py -d data/prepared \ -o data/features \ python src/featurization.py data/prepared data/features
dvc.yaml file is updated automatically and should include two stages now.
The whole point of creating this
dvc.yaml file is the ability to easily
reproduce a pipeline:
$ dvc repro
dvc.lockfiles describe what data to use and which commands will generate the pipeline results (such as an ML model). Storing these files in Git makes it easy to version and share.
Having built our pipeline, we need a good way to understand its structure. Seeing a graph of connected stages would help. DVC lets you do so without leaving the terminal!
$ dvc dag +---------+ | prepare | +---------+ * * * +-----------+ | featurize | +-----------+ * * * +-------+ | train | +-------+
dvc dagto explore other ways this command can visualize a pipeline.