DVC works on top of Git repositories and has a similar command line interface and flow as Git. DVC can also work stand-alone, but without versioning capabilities.
DVC codifies data and ML experiments:
Data versioning is enabled by replacing large files, dataset directories, machine learning models, etc. with small metafiles (easy to handle with Git). These placeholders point to the original data, which is decoupled from source code management.
Data storage: On-premises or cloud storage can be used to store the project's data separate from its code base. This is how data scientists can transfer large datasets or share a GPU-trained model with others.
DVC makes data science projects reproducible by creating lightweight pipelines using implicit dependency graphs, and by codifying the data and artifacts involved.
DVC is platform agnostic: It runs on all major operating systems (Linux, macOS, and Windows), and works independently of the programming languages (Python, R, Julia, shell scripts, etc.) or ML libraries (Keras, Tensorflow, PyTorch, Scipy, etc.) used in the project.
Easy to use: DVC is quick to install and doesn't require special infrastructure, nor does it depend on APIs or external services. It's a stand-alone CLI tool.
Git servers, as well as SSH and cloud storage providers are supported, however.
DVC combines a number of existing ideas into a single tool, with the goal of bringing best practices from software engineering into the data science field.