August '20 Heartbeat

Catch our monthly updates- featuring the CML release, DVC meetup recap, a new video tutorial series, and the best reading about pipelines and DataOps.

  • Elle O'Brien
  • August 10, 20207 min read

DeeVee avoids the summer sun at Mount Rainier National Park.

Welcome to our August roundup of cool news, new releases, and recommended reading in the MLOps world!


CML release

At the beginning of July, we went live with a new project: Continuous Machine Learning, or CML for short. If you hadven't heard, CML is an open-source toolkit for adapting popular continuous integration systems like GitHub Actions and GitLab CI for machine learning and data science. This release marks a new stage for our organization: while CML can work with DVC, and both are built around Git, CML is designed for standalone use. That means we're supporting TWO projects now!

Luckily, we received plenty of encouraging and helpful feedback following the CML release. CML was on the front page of Hacker News for most of release day! We also got covered on Heise, a popular German IT news source. I (Elle, a proud part of the CML team!) also gave a talk presenting our approach as part of the MLOps World meeting, which is now available for online viewing.

Of course, we're fielding lots of questions too! We've compiled some of the most common questions (and their answers!) in our last Community Gems post, and CML developer David G. Ortega has written a tutorial for a much-asked-for use case: doing continuous integration with on-demand GPUs.

If you have comments, questions, or feature requests about CML, we really want to hear from you. A few ways to be in touch:

July Meetup

Last week, we had another meetup! DVC Ambassador Marcel kicked us off with a short talk about how he's using DVC as part of his causal modeling approach to bioinformatics. It's cool stuff. Then, I talked a bit about CML and did some live-coding. The beauty of live-coding is getting to answer questions in real-time, and if you're totally new to the idea of continuous integration (or want to understand how CML works with GitHub Actions/GitLab CI) seeing a project in-action is one of the best ways to learn.

You can watch a recording of the meetup online now (it's lightly edited to remove some pesky Zoom trolls), and join our Meetup group to get updates for the next one. In future meetups, we'd love to support community members sharing their work, so get in touch if you'd like to present.

New video series

We're starting up some new YouTube features! If you haven't seen our channel, check it out and consider subscribing for hands-on tutorials and demos. Our first video introduced continuous integration and GitHub Actions, and the second showed how to use DVC and free Google Drive storage to add external data storage to a GitHub project.

In the coming weeks, we'll be covering:

  • Using CML and GitHub Actions with hardware for deep learning, like on-premise GPUs
  • Understanding Vega plots and making data viz part of your CI system
  • Some DVC basics to supplement our docs

From the community

SpaCy + DVC = ❤️

We're huge fans of a recent Python Bytes episode featuring Ines Montani, founder of Explosion and one of the makers of the incredible SpaCy library for NLP (seriously, I have the highest recommendations for SpaCy).

Ines' episode discussed DVC, and DVC is going to be integrated with SpaCy in their 3.0 release. SpaCy + DVC is going to be a powerhouse and we can't wait.

Take a stab at shtab

Another cool software project: Casper da Costa-Luis, DVC contributor and creator of the popular tqdm library, has published a tab-completion script generator for Python applications! shtab, as it's called, was originally designed for DVC, but Casper developed it into a generic tool that can be used for virtually any Python CLI application. Check out shtab on GitHub and read the release blog.

(Tab) Complete Any Python Application in 1 Minute or Less

We've made a painless tab-completion script generator for Python applications!
(Tab) Complete Any Python Application in 1 Minute or Less

DVC 1.0 migration script

Our friends at DAGsHub have released a script to help DVC users upgrade their pipelines to the new DVC 1.0 format! Says Simon, a DAGsHub engineer, in his tutorial:

In this post, I'll walk you through the process of migrating your existing project from DVC ≤ 0.94 to DVC 1.X using a single automated script, and then demonstrate a way to check that your migration was successful.

Read the blog and get migrating (but don't worry if you can't; DVC 1.0 is backwards compatible).

Automatically migrate your project from DVC≤ 0.94 to DVC 1.x

Migrating your project from DVC ≤ 0.94 to DVC 1.x can be a very involved process. Here’s an easy way to do it.
Automatically migrate your project from DVC≤ 0.94 to DVC 1.x

Here are some of our favorite blogs from around the internet 🌏.

  • Déborah Mesquita, data scientist (and an excellent writer to follow), published a tutorial about DVC pipelines that is truly deserving of the moniker "ultimate guide". It's a start-to-finish case study about a typical machine learning project, with DVC pipelines to automate everything from grabbing the data to training and evaluating a model. Also, it comes with a video tutorial if you prefer to watch instead of read!

The ultimate guide to building maintainable Machine Learning pipelines using DVC

Learn the principles for building maintainable Machine Learning pipelines using DVC
The ultimate guide to building maintainable Machine Learning pipelines using DVC

  • Software engineer Vaithy Narayanan created the first ever ☝️ CML user blog! Vaithy created a pipeline that covers data collection to model training and testing, and used CML to automate the pipeline execution whenever the project's GitHub repository is updated. He ends with some insightful discussion about the strengths and weaknesses of the approach.

Using Continuous Machine Learning to Run Your ML Pipeline

Vaithy Narayanan
Using Continuous Machine Learning to Run Your ML Pipeline

  • Ryan Gross, a VP at Pariveda Solutions, blogged about the future of data governance and the lessons from DevOps that might save the day. Honestly, you should probably start reading for this cover image alone.

    dataops DataOps is accurately depicted as a badass flaming eagle. Check out the blog here:

The Rise of DataOps (from the ashes of Data Governance)

Legacy Data Governance is broken in the ML era. Let’s rebuild it as an engineering discipline to drive orders-of-magnitude improvements.
The Rise of DataOps (from the ashes of Data Governance)

And, there's a noteworthy counterpoint by Michael Kaminsky. Read them both!

Thanks everyone, that's it for this month. We hope you're staying safe and making cool things!

Subscribe for updates. We won't spam you.