What is DVC?
Data Version Control is a free, open-source tool for data management, ML pipeline automation, and experiment management. This helps data science and machine learning teams manage large datasets, make projects reproducible, and collaborate better.
DVC takes advantage of the existing software engineering toolset your team already knows (Git, your IDE, CI/CD, cloud storage, etc.). Its design follows this set of principles:
- Codification: Define any aspect of your ML project (data and model versions, ML pipelines and experiments) in human-readable metafiles. This enables using best practices and established engineering toolsets, reducing the gap with data science.
- Versioning: Use Git (or any SCM) to version and share your entire ML project including its source code and configuration, parameters and metrics, as well as data assets and processes by committing DVC metafiles (as placeholders).
- Secure collaboration: Control the access to all aspects of your project and share them with the people and teams you choose.
Characteristics
-
DVC comes as a VS Code Extension, as a command line interface, and as a Python API. These options provide a familiar an intuitive user experience to a broad range of users.
-
Easy to use: DVC is quick to install and works out of the box. It doesn't require special infrastructure, nor does it depend on APIs or external services.
Optional integrations with existing solutions and platforms such as Git hosting, SSH and cloud storage providers, among others are included.
-
Works on top of Git repositories and has similar feel and flows. Stick to the regular Git workflow (commits, branching, pull requests, etc.) and don't reinvent the wheel!
DVC can also work stand-alone, but without versioning capabilities.
-
Bring your own resources: Provision or reuse existing resources on-premises or on the cloud, including storage, compute, CI workers, etc. and use DVC on top. You're not locked to any one provider!
-
DVC is platform agnostic: It runs on all major operating systems (Linux, macOS, and Windows), and works independently of programming languages (Python, R, Julia, shell scripts, etc.) or ML libraries (Keras, Tensorflow, PyTorch, Scipy, etc.).
Comparison with Related Technologies
DVC combines a number of existing ideas into a single tool, with the goal of bringing best practices from software engineering into the data science field.