April '22 Community Gems

A roundup of technical Q&As from the DVC and CML community. This month: CML updates, working with multiple datasets, using DVC stages, and more.

Milecia McGregor
April 28, 2022 • 3 min read

When I run `dvc repro` on a stage, does it automatically push any outputs to my remote?

Great question from @tina_rey!

The dvc repro command doesn't automatically push any outputs or data to your remote. The outputs are stored in the cache until you run dvc push, which then pushes them from your cache to your remote.

Is `dvc dag` based on `deps` and `outs`, so that a stage that depends on the output of another stage will always be executed after the former has finished?

This is a good question from @johnysku!

That is correct! If the pipelines are independent or the stages are independent, they may run in any order. Without explicit dependency linkage, stages could be executed in an unexpected order.

If I want to use the `foreach` utility in `dvc repro`, is there a way I can use glob patterns to create the list DVC needs to iterate over?

Another interesting question from @copah!

If you have mystage which uses foreach, you can do dvc repro to mystage to iterate over every mystage stage.

How does DVC handle files that have been deleted from remote storage?

Really good question from @Meme Philosopher!

DVC will fail when you try to pull files that have been deleted from the remote and notify you that those files are missing in remote storage.

Can I separate CML running from GitHub actions VM to work with GCP or AWS so training and testing are in these cloud environments?

Thanks for the question @Atsu!

This is supported out-of-the-box! Here's how it works:

Within Github Actions, CML launches a self-hosted runner on GCP or AWS using cml runner --labels=cml --cloud=gcp/--cloud=aws
GitHub Actions runs the rest of the workflow on the self-hosted runner using runs-on: [self-hosted, cml] and the maximum allowable timeout-minutes: 4320
If GitHub Actions is about to timeout, CML will restart the workflow, so make sure your code regularly caches and restores data if it's expected to take >3 days to run.

You can follow along with this doc to get started.

The key is requesting GitHub's maximum timeout-minutes: 4320. This signals to CML to restart the workflow just before the timeout. You'll also have to write your code to cache results so that the restarted workflow will use previous results (e.g. use https://dvc.org/doc/user-guide/experiment-management/checkpoints#caching-checkpoints and https://github.com/iterative/dvc/issues/6823)

When running an experiment from the web interface with DVC, is there any way to get the new metrics to show on the commit created by Iterative Studio for the experiment?

Awesome question about Studio from @Benjamin-Etheredge!

In order to show the experiment results in Studio, you would have to commit and push the results as part of your CI (continuous integration) action. Here's an example GitHub action script that does this.

We do understand that it is not ideal that there are 2 commits, one with your changes and one with the results. We have been thinking about how this can be improved and it would be great to hear if you have any thoughts/ideas!

Is there a way to get DVC to import from a private repository?

Good question from @qubvel!

You can use SSH to handle this and run the following command:

$ dvc import git@gitlab.com:<reposiotry location> <data_path>

If I use a local remote and a shared cache, will the data be symlinked from the remote to the cache?

Very interesting question from @cajoek!

The data will not be symlinked from the remote to the cache.

Sometimes we can treat cache as something temporary so a lot of data that will never be used can get there from failed experiments, etc. In this case having a local remote to keep track of important data for important versions of your project would be good.

That way, later when your cache is too big and the project takes up too much space, you can remove .dvc/cache and download latest important version from remote.

iAM_Learning GIF

At our May Office Hours Meetup we will have Matt Squire of Fuzzy Labs join us sharing his view on open source MLOps tools! RSVP for the Meetup here to stay up to date with specifics as we get closer to the event!

Join us in Discord to get all your DVC and CML questions answered!