Versioning large data files and directories for data science is powerful, but often not enough. Data needs to be filtered, cleaned, and transformed before training ML models - for that purpose DVC introduces a build system to define, execute and track data pipelines — a series of data processing stages, that produce a final result.
💫 DVC is a "Makefile" system for machine learning projects!
DVC pipelines are versioned using Git, and allow you to better organize projects and reproduce complete workflows and results at will. You could capture a simple ETL workflow, organize your project, or build a complex DAG (Directed Acyclic Graph) pipeline.
Later, we will find DVC allows you to manage machine learning experiments on top of these pipelines - controlling their execution, injecting parameters, etc.
Working inside an initialized DVC project, let's get some sample code for the next steps:
$ wget https://code.dvc.org/get-started/code.zip $ unzip code.zip && rm -f code.zip
The DVC tracked data needed to run this example can be downloaded using
$ dvc get https://github.com/iterative/dataset-registry \ get-started/data.xml -o data/data.xml
Now, let's go through some usual project setup steps (virtualenv, requirements, Git).
First, create and use a virtual environment (it's not a must, but we strongly recommend it):
$ virtualenv venv && echo "venv" > .gitignore $ source venv/bin/activate
Next, install the Python requirements:
$ pip install -r src/requirements.txt
Finally, this is a good time to commit our code to Git:
$ git add .github/ data/ params.yaml src .gitignore $ git commit -m "Initial commit"
dvc stage add to create stages. These represent processing steps
(usually scripts/code tracked with Git) and combine to form the pipeline.
Stages allow connecting code to its corresponding data input and output.
Let's transform a Python script into a stage:
$ dvc stage add -n prepare \ -p prepare.seed,prepare.split \ -d src/prepare.py -d data/data.xml \ -o data/prepared \ python src/prepare.py data/data.xml
dvc.yaml file is generated. It includes information about the command we
want to run (
python src/prepare.py data/data.xml), its
dependencies, and outputs.
DVC uses the pipeline definition to automatically track the data used and
produced by any stage, so there's no need to manually run
dvc add for
Once you've added a stage, you can run the pipeline with
dvc stage add multiple times, defining outputs of a
stage as dependencies of another, we can describe a sequence of
dependent commands which gets to some desired result. This is what we call a
dependency graph which forms a full cohesive pipeline.
Let's create a 2nd stage chained to the outputs of
prepare, to perform feature
$ dvc stage add -n featurize \ -p featurize.max_features,featurize.ngrams \ -d src/featurization.py -d data/prepared \ -o data/features \ python src/featurization.py data/prepared data/features
dvc.yaml file will now be updated to include the two stages.
And finally, let's add a 3rd
$ dvc stage add -n train \ -p train.seed,train.n_est,train.min_split \ -d src/train.py -d data/features \ -o model.pkl \ python src/train.py data/features model.pkl
dvc.yaml should have all 3 stages.
This would be a good time to commit the changes with Git. These include
dvc.yaml — which describes our pipeline.
$ git add .gitignore data/.gitignore dvc.yaml $ git commit -m "pipeline defined"
Great! Now we're ready to run the pipeline.
The pipeline definition in
dvc.yaml allow us to easily reproduce the pipeline:
$ dvc repro
You'll notice a
dvc.lock (a "state file") was created to capture the
It's good practice to immediately commit
dvc.lock to Git after its creation or
modification, to record the current state & results:
$ git add dvc.lock && git commit -m "first pipeline repro"
Having built our pipeline, we need a good way to understand its structure. Visualizing it as a graph of connected stages helps with that. DVC lets you do so without leaving the terminal!
$ dvc dag +---------+ | prepare | +---------+ * * * +-----------+ | featurize | +-----------+ * * * +-------+ | train | +-------+
dvc dag to explore other ways this command can visualize a pipeline.
- Automation: run a sequence of steps in a "smart" way which makes iterating on your project faster. DVC automatically determines which parts of a project need to be run, and it caches "runs" and their results to avoid unnecessary reruns.
dvc.lockfiles describe what data to use and which commands will generate the pipeline results (such as an ML model). Storing these files in Git makes it easy to version and share.
- Continuous Delivery and Continuous Integration (CI/CD) for ML: describing projects in a way that can be built and reproduced is the first necessary step before introducing CI/CD systems. See our sister project CML for some examples.