October '21 Heartbeat

This month you will find:

🗺 MLOps workflows,

🤔 Lots of ways to learn,

🎥 Meetup and Conference videos,

📖 Docs updates,

🚀 Info on our growing team, and more!

  • Jeny De Figueiredo
  • October 15, 202112 min read
Hero Picture

From the Community

This month we have been flooded with content from our Community. We are grateful and inspired to keep serving you!

Thank you!

Ricardo Manhães Savii: Trying to turn Machine Learning into value

If we can't turn machine learning into value, what good are we? Ricardo Manhães Savii wrote a piece in Medium where he tackles how to technically and visually define the steps to deliver an Intelligent System with the same level of best practice maturity that software development has today. He combines and synthesizes the ideas of some of the best known thinkers in the space to build a thorough architecture of machine learning best practices. You won't want to miss this post and wrap your head around these diagrams!

CI/CD for Machine Learning Ricardo Manhães Savii's Addendum to François Chollet's](https://medium.com/@francois.chollet) figure on result of machine learning (Source link)

RappiBank: How to build an efficient machine learning project workflow

Continuing the theme of ML workflow Complexity, Daniel Baena wrote a great overview and tutorial piece outlining the challenges that his team at RappiBank encountered and found ways to solve with DVC including:

  • confusing experiment files with different names
  • disjointed messaging about training and models and dataset changes
  • holding in your head or own notes progress that is not visible to the rest of the team
  • heavy run and re-run times without a modularized system

Daniel shows how all of these things can be solved using DVC.🏆

How to Build an Efficient Machine Learning Project Workflow Usign Data Version Control (DVC)

Daniel Baena's overview of common MLOps challenges encoutered at Rappi Bank and how they are solved with DVC.
How to Build an Efficient Machine Learning Project Workflow Usign Data Version Control (DVC)

DAGsHub: Production Oriented Work

Next up, Nir Barazida from DAGsHub created a video on Production-oriented work using a monorepo strategy and focusing on moving from research to production-ready code using Git and DVC. If you are a data scientist trying to wrap your head around going from your notebook to production, this may help!

Production-Oriented Work with Git, DVC and DAGsHub

Nir Barazida's tutorial and video on who to use a monorepo strategy and go from your notebook to production-ready code.
Production-Oriented Work with Git, DVC and DAGsHub

ML Data Versioning with DVC: How to Manage Machine Learning Data

Piotr Storożenko of Appsilon wrote a great tutorial taking into account the many challenges data scientists and ML engineers encounter in their data versioning efforts and how DVC solves them. Do these scenarios from his article look familiar?

Was it in model_3final.pth or model_last.pth that I used a bigger lerning rate?

When did I start using data preprocessing, during model_2a.pth or model_2aa.pth

Is model_7.pth trained on the new dataset or on the old one?`

Oh, gosh, which set of parameters and data have I used to train model_2.pth? It was pretty good in the end…”

Learning Opportunities

Raviraja Ganta's 10-week course on Basic MLOps

Twitter and LinkedIn were a blaze in the last month when Raviraja Ganta announced his 10-Week Course on MLOps basics. This course is chock full of resoures and practical tutorials to build your MLOps platform and knowledge. Week 3 of the course is about DVC and its ability to solve your versioning and reproducibility challenges. Be sure to check out the course repo as well!

MLOps Community is hosting him to speak about his course on October 20th. Sign up to attend here!

Raviraja Ganta's 10-Week MLOps Course Raviraja Ganta's 10-Week Course on MLOps Basics (Source link)

Josh Willis video on COVID simulations with DVC

This week, this Tweet comment led me to this work by Josh Wills. Josh was tapped by DJ Patil to participate in some COVID simulation research early on in the pandemic in which he used DVC. In his presentation about the project, he tells of the tools he used and challenges of the use case. Nice DVC shout out at 19:56! Ah, the fruits of a Twitter 🐇🕳!

September Office Hours Video: Transfer Learning with Milecia McGregor

If you missed last month's Office Hours Meetup, you can now catch the video! Milecia's presentation was based on her blog post on the same topic: Using Experiments for Transfer Learning. If you're curious about transfer learning in general, AlexNet and SqueezeNet in particular, or using DVC experiments and checkpoints to track all that you do, this video's for you!

Quoc-Tien Au: Continuously Learning on the Job as a Data Scientist

This Towards Data Science article by Quoc-Tien Au entitled "The What, Where, and How about continuously learning on the job as a data scientist," speaks to some higher points on the need to have a mindset for continuous learning in the Data Science field. It's packed with great thought processes and resources on what to learn, where to learn, and how to keep learning while still getting your work done. Who stuggles with this? 😅

DVC News

Amsterdam Off-site

Most of our team members from Europe got together in Amsterdam recently for a couple days of brainstorming and team bonding. They went on a Treasure Hunt, ate Ramen (a favorite among our team) and had great discussions on how to make our tools and our team even better! Pictured below from front of the room left, going clockwise (to the back of the room and back up) are David Ortega, Helio Machado, David de la Iglesia Castro, Laurens Duijvesteijn, Ruslan Kupriev (hidden), Dmitry Petrov, Jelle Bouwman, Batuhan Taskaya,Svetlana Sachkovskaya, and Paweł Redzyński.

Be sure to check out this section next month as our Americas team members will meet in San Francisco!

Europe Iterative Team Members meet in Amsterdam Iterative Team Members meet in Amsterdam (Source: David Ortega))

New Team Members

Jordan Weber joins us from Los Angeles, California as our new Chief of Staff. She has previously held similar roles at venture captial and FinTech firms. In Jordan's free time she enjoys cooking, tennis, dance, and hiking! 🎾

Ken Thom joins us from Palo Alto, California as our new Director of Operations. His past work includes business operations, product management, software and hardware development. In his spare time he likes to spend time with his family, swim, ski, and hike! 🥾

Jon Burdo joins us from Boston, Massachusetts as a Senior Software engineer. He's been working for the past few years as a machine learnng engineer with a focus on NLP. In his last role he used DVC and loved it, which is how he eventually ended up here! 🎉 In his spare time, Jon likes learning about open source software, tinkering with Linux, and inline skating.

Stephanie Roy joins the team as a Senior Software Engineer from Quebec, Canada. Our first Canadian team member! She has previously worked at LogMeln on one of their mobile apps. In her spare time she likes taking care of her plants in her indoor grow house, playing roller derby, and discovering new things to watch, listen to and eat! 😋

Welcome to all our new team members! We are so glad you are here! 🙌🏼

Open Positions

And wouldn't you know it? We're still hiring! Use this link to find details of all the positions including:

  • Senior Software Engineer (ML, Labeling, Python)
  • Senior Software Engineer (ML, Labeling, Python)
  • Senior Software Engineer (ML, DevTools, Python)
  • Field Data Scientist / Sales Engineer
  • Developer Advocate (ML)
  • Director / VP of Engineering (ML, DevTools)
  • Director / VP of Product (ML, Data Infra, SaaS)
  • Head of Talent
  • Head of DevRel

Please pass this info on to anyone you know that may fit the bill. We look forward to new team members! 🎉

Docs Updates

Here are a few important docs updates you may want to take a look at this month!

📖 PyTorch Lightning

We all have Ilia Sirotkin to thank for his contribution to our docs. He created the PyTorch Lightning integration docs for all to use!

📖 CML with DVC guide:

Our updated CML with DVC Guide provides updated code and streamlined information on Cloud Storage Provider credentials and GitHub Actions set up.

name: CML & DVC
on: [push]
jobs:
  run:
    runs-on: ubuntu-latest
    container: docker://ghcr.io/iterative/cml:0-dvc2-base1
    steps:
      - uses: actions/checkout@v2
        with:
          fetch-depth: 0
      - name: Train model
        env:
          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
        run: |
          pip install -r requirements.txt  # Install dependencies
          dvc pull data --run-cache        # Pull data & run-cache from S3
          dvc repro                        # Reproduce pipeline
      - name: Create CML report
        env:
          REPO_TOKEN: ${{ secrets.GITHUB_TOKEN }}
        run: |
          echo "## Metrics" >> report.md
          dvc metrics diff master --show-md >> report.md

          # Publish confusion matrix diff
          echo "## Plots" >> report.md
          echo "### Class confusions" >> report.md
          dvc plots diff \
            --target classes.csv \
            --template confusion \
            -x actual \
            -y predicted \
            --show-vega master > vega.json
          vl2png vega.json -s 1.5 | cml publish --md >> report.md

          # Publish regularization function diff
          echo "### Effects of regularization" >> report.md
          dvc plots diff \
            --target estimators.csv \
            -x Regularization \
            --show-vega master > vega.json
          vl2png vega.json -s 1.5 | cml publish --md >> report.md

          cml send-comment report.md

📖 Shtab

Team member Casper da Costa-Luis has created a docs website for his python tab- completion script generator project shtab. For more info checkout the original blog post about it as well.

Next Meetups

For the second class of DVC Learn, join us to learn about getting started running experiments! This lesson will include information on how to use our checkpoints feature as well. We look forward to seeing you there!

DVC Learn - Getting Started with Running Experiments

Milecia McGregor shows us how to get started with DVC Experiments and Checkpoints
DVC Learn - Getting Started with Running Experiments

Be sure to join us at the November Office Hours Meetup, where Maykon Shots will talk about how he used DVC and CML to create an internal Kaggle competition for his team to arrive at their best models in their work for the largest bank in Brazil.

DVC Office Hours - Creating an Internal Kaggle Competition with DVC and CML

Maykon Shots shows us how he used DVC and CML to create an internal Kaggle competition for his team
DVC Office Hours - Creating an Internal Kaggle Competition with DVC and CML

Tweet Love ❤️

This month, it was exceedingly hard to pick just one Tweet. I'm leaving you with one that ballooned our followers over the last month. But there have been many! I encourage you to visit our newly created Wall of Love ❤️ to see all the beautiful Iterative tool love. 🛠❤️🤗


Do you have any use case questions or need support? Join us in Discord!

Head to the DVC Forum to discuss your ideas and best practices.

Subscribe for updates. We won't spam you.