Open-source
Version Control System
for Data Science Projects

Watch video
Watch video

How it works

We’re onGithub –––
$ dvc add images.zip
$ dvc run -d images.zip -o model.p ./cnn.py
$ dvc remote add myrepo s3://mybucket
$ dvc push
DVC streamlines machine learning projects
DVC is an open-source framework and distributed version control system for machine learning projects. DVC is designed to handle large files, models, and metrics as well as code.

ML project version control

Keep pointers in Git to large data input files, ML models, and intermediate data files along with the code. Use S3, Azure, GCP, or any network-accessible storage to store file contents.

Full code and data provenance help track the complete evolution of every ML experiment. This guarantees reproducibility and makes it easy to switch back and forth between experiments.

ML experiment management

Harness the full power of Git branches to try different ideas instead of sloppy file suffixes and comments in code. Use automatic metric-tracking to navigate instead of paper and pencil.

DVC was designed to keep branching as simple and fast as in Git — no matter the data file size. Along with first-class citizen metrics and ML pipelines, it means that a project has cleaner structure. It's easy to compare ideas and pick the best. Iterations become faster with intermediate artifact caching.

Deployment & Collaboration

Instead of ad-hoc scripts, use push/pull commands to move consistent bundles of ML models, data, and code into production, remote machines, or a colleague's computer.

DVC introduces lightweight pipelines as a first-class citizen mechanism in Git. They are language-agnostic and connect multiple steps into a DAG. These pipelines are used to remove friction from getting code into production.

ML experiment management

ML experiment management

Harness the full power of Git branches to try different ideas instead of sloppy file suffixes and comments in code. Use automatic metric-tracking to navigate instead of paper and pencil.

DVC was designed to keep branching as simple and fast as in Git — no matter the data file size. Along with first-class citizen metrics and ML pipelines, it means that a project has cleaner structure. It's easy to compare ideas and pick the best. Iterations become faster with intermediate artifact caching.

For data scientists, by data scientists

Use cases

Save and reproduce your experiments

At any time, fetch the full context about any experiment you or your team has run. DVC guarantees that all files and metrics will be consistent and in the right place to reproduce the experiment or use it as a baseline for a new iteration.

Version control data files

DVC keeps metafiles in Git instead of Google Docs to describe and version control your data sets and models. DVC supports a variety of external storage types as a remote cache for large files.

Establish workflow for deployment & collaboration

DVC defines rules and processes for working effectively and consistently as a team. It serves as a protocol for collaboration, sharing results, and getting and running a finished model in a production environment.

Save and reproduce your experiments

More...

Version control data files

More...

Establish workflow for deployment & collaboration

More...

Subscribe for updates. We won't spam you.

DVC.org
Copyright © 2018 Iterative, Inc