Versioning large data files and directories for data science is great, but not enough. How is data filtered, transformed, or used to train ML models? DVC introduces a mechanism to capture data pipelines — series of data processes that produce a final result.
DVC pipelines and their data can also be easily versioned (using Git). This allows you to better organize projects, and reproduce your workflow and results later — exactly as they were built originally! For example, you could capture a simple ETL workflow, organize a data science project, or build a detailed machine learning pipeline.
Watch and learn, or follow along with the code example below!
dvc run to create stages. These represent processes (source code tracked
with Git) which form the steps of a pipeline. Stages also connect code to its
corresponding data input and output. Let's transform a Python script into a
$ dvc run -n prepare \ -p prepare.seed,prepare.split \ -d src/prepare.py -d data/data.xml \ -o data/prepared \ python src/prepare.py data/data.xml
dvc.yaml file is generated. It includes information about the command we ran
python src/prepare.py data/data.xml), its dependencies, and
There's no need to use
dvc add for DVC to track stage outputs (
in this case);
dvc run already took care of this. You only need to run
dvc push if you want to save them to
(usually along with
git commit to version
dvc run multiple times, and specifying outputs of a
stage as dependencies of another one, we can describe a sequence of
commands which gets to a desired result. This is what we call a data pipeline
or dependency graph.
Let's create a second stage chained to the outputs of
prepare, to perform
$ dvc run -n featurize \ -p featurize.max_features,featurize.ngrams \ -d src/featurization.py -d data/prepared \ -o data/features \ python src/featurization.py data/prepared data/features
dvc.yaml file is updated automatically and should include two stages now.
The whole point of creating this
dvc.yaml file is the ability to easily
reproduce a pipeline:
$ dvc repro
dvc.lockfiles describe what data to use and which commands will generate the pipeline results (such as an ML model). Storing these files in Git makes it easy to version and share.
Having built our pipeline, we need a good way to understand its structure. Seeing a graph of connected stages would help. DVC lets you do so without leaving the terminal!
$ dvc dag +---------+ | prepare | +---------+ * * * +-----------+ | featurize | +-----------+ * * * +-------+ | train | +-------+
dvc dagto explore other ways this command can visualize a pipeline.