Versioning large data files and directories for data science is great, but not enough. How is data filtered, transformed, or used to train ML models? DVC introduces a mechanism to capture data pipelines — series of data processes that produce a final result.
DVC pipelines and their data can also be easily versioned (using Git). This allows you to better organize projects, and reproduce your workflow and results later — exactly as they were built originally! For example, you could capture a simple ETL workflow, organize a data science project, or build a detailed machine learning pipeline.
dvc stage add to create stages. These represent processes (source code
tracked with Git) which form the steps of a pipeline. Stages also connect code
to its corresponding data input and output. Let's transform a Python script
into a stage:
$ dvc stage add -n prepare \ -p prepare.seed,prepare.split \ -d src/prepare.py -d data/data.xml \ -o data/prepared \ python src/prepare.py data/data.xml
dvc.yaml file is generated. It includes information about the command we
want to run (
python src/prepare.py data/data.xml), its
dependencies, and outputs.
Once you added a stage, you can run the pipeline with
dvc repro. Next, you can
dvc push if you wish to save all the data to remote storage (usually
git commit to version DVC metafiles).
dvc stage add multiple times, defining outputs of a
stage as dependencies of another, we can describe a sequence of
commands which gets to some desired result. This is what we call a dependency
graph and it's what forms a cohesive pipeline.
Let's create a second stage chained to the outputs of
prepare, to perform
$ dvc stage add -n featurize \ -p featurize.max_features,featurize.ngrams \ -d src/featurization.py -d data/prepared \ -o data/features \ python src/featurization.py data/prepared data/features
dvc.yaml file is updated automatically and should include two stages now.
The whole point of creating this
dvc.yaml file is the ability to easily
reproduce a pipeline:
$ dvc repro
- Automation: run a sequence of steps in a "smart" way which makes iterating on your project faster. DVC automatically determines which parts of a project need to be run, and it caches "runs" and their results to avoid unnecessary reruns.
dvc.lockfiles describe what data to use and which commands will generate the pipeline results (such as an ML model). Storing these files in Git makes it easy to version and share.
- Continuous Delivery and Continuous Integration (CI/CD) for ML: describing projects in way that can be reproduced (built) is the first necessary step before introducing CI/CD systems. See our sister project CML for some examples.
Having built our pipeline, we need a good way to understand its structure. Seeing a graph of connected stages would help. DVC lets you do so without leaving the terminal!
$ dvc dag +---------+ | prepare | +---------+ * * * +-----------+ | featurize | +-----------+ * * * +-------+ | train | +-------+
dvc dagto explore other ways this command can visualize a pipeline.