January ’21 Heartbeat

Monthly updates are here! read all about our new R language tutorial, putting DVC to work on an image segmentation pipeline, and a new fast way to setup your DVC remote.

Elle O'Brien

January 20, 2021

5 minutes read

News

Welcome to the first Heartbeat of 2021! Here’s some new year news.

We’re still hiring

Our search continues for a Developer Advocate to support and inspire developers by creating new content like blogs, tutorials, and videos- plus lead outreach through meetups and conferences.

Does this sound like you or someone you know? Be in touch!

7000 stars on GitHub

We recently passed 7000 stars on the DVC GitHub repository! We crossed the 7k mark extremely close to midnight on New Year’s Eve, so we probably hit it in time for the new year in at least one time zone. Anyway, it made for a very suspenseful countdown to midnight. Woot woot!

The repo is HQ for DVC development, meaning- if you have an issue to report, a feature to request, or a pull request to offer, this is where you should start!

New video for R users

A lot of our videos about GitHub Actions have used Python scripts, but there’s no reason to restrict Continuous Machine Learning to one language. We’ve just released our first-ever R language video, which covers

How to install R on a GitHub Actions runner
How to manage R package dependencies for continuous integration (teaser: CRAN binaries are amazing)
Putting a ggplot or a kable table in your pull request

Watch and follow along! If you make something based on this approach, or if you think there’s a better way, please tell us- we’re eager to see what the R community thinks.

Workshops and talks

On Friday, January 24, I (Elle) spoke with Alexey Grigorev (author of a Data Science Bookcamp), on his podcast about being a developer advocate in the machine learning space! If you’re curious about what the role entails, or what to look for when hiring a developer advocate for your machine learning project, please come by. The event is up on YouTube, and will soon be available as a podcast for your listening pleasure 🎧

From the community

As ever, we have much to share from the great citizens of the DVC community.

Where’s Baby Yoda?

There’s a brand new blog post we love, and only half of that has to do with its impressive collection of Baby Yoda pics. Simon Lousky, developer at DAGsHub, published a blog provocatively titled Datasets should behave like git repositories. He writes:

While data versioning solves the problem of managing data in the context of your machine learning project, it brings with it a new approach to managing datasets. This approach, also described as data registries here, consists of creating a git repository entirely dedicated to managing a dataset. This means that instead of training models on frozen datasets – something researchers, students, kagglers, and open source machine learning contributors often do – you could link your project to a dataset (or to any file for that matter), and treat it as a dependency. After all, data can and should be treated as code, and follow through a review process.

We agree! Lousky goes on to show us a brilliant code example wherein he segments instances of Baby Yoda out of frames from The Mandalorian. DVC plays a key role in keeping track of all the Baby Yodas, which is pretty much the most important use case we could’ve imagined.

There’s also a lively discussion about the post on Reddit. Check it out and consider contributing your own Baby Yoda image annotations to grow the dataset!

Data Version Control Explained

Researcher Nimra Ejaz published a fantastically detailed introduction to DVC. She even included a “History of DVC” section, which is pretty cool for us- this might be a first!

Her blog covers not only the key features of DVC, but a thoughtful pros-and-cons list and a case study about using DVC in an image classification project. If you want an up-to-date, high-level overview of DVC and some help deciding if it fits your needs, I couldn’t recommend Nimra’s blog more.

Data Version Control Explained

Nimra Ejaz

crowdbotics.com

One more thing from DAGsHub

Dean Pleban, CEO of DAGsHub, shared an important update: they now offer FREE dataset and model hosting for DVC projects (up to 10 GB per user and project, with flexibility for public projects)! And with no configuration!

That means you don’t have to configure your DVC remote to use DVC with model and data storage in the cloud- DAGsHub will handle all of it. Your DVC remote can be added as easily as a Git remote, in other words. Read the announcement, and then dig into their basic tutorial to get started.

Free Dataset & Model Hosting with Zero Configuration – Launching DAGsHub Storage

Dean Pleban

dagshub.com

Free Dataset & Model Hosting with Zero Configuration – Launching DAGsHub Storage

A nice tweet

Bilgin Ibryam, author of the Kubernetes Patterns book, gave us a shoutout for being an interesting data engineering project (according to a list by another expert we trust, Dmitry Ryabov). Thanks Bilgin and Dmitry, we think you’re very interesting too!

Five Interesting Data Engineering Projects (@getdbt, @PrefectIO, @dask_dev, @DVCorg, greatexpectations)https://t.co/XXeLXYDp0M by @squarecog
— Bilgin Ibryam (@bibryam) December 23, 2020

📰 Join our Newsletter to stay up to date with news and contributions from the Community!