February '22 Community Gems

A roundup of technical Q&A's from the DVC and CML community. This month: comparing experiments, working with data, working with pipelines, and more.

Milecia McGregor
February 28, 2022 • 3 min read

How can I delete DVC-tracked files from cloud storage?

Thanks for the question @fireballpoint1!

You can find the best way to delete files from your cloud storage in our docs. Make sure you're super careful when deleting data from the cloud because it's an irreversible action. Here's an example of a deletion command that will clear out everything in your cloud storage except what is referenced in your workspace.:

$ dvc gc --workspace --cloud

This option only keeps the files and directories referenced in the workspace and it removes everything else, including data in the cloud and cache. By default, this command will use the default remote you have set. You can specify a different remote storage with the --remote option like this:

$ dvc gc --workspace --cloud --remote name_of_remote

I'm using DVC experiments, but the Git index gets corrupted with large (4GB) files. What is the best workaround?

Great question from @charles.melby-thompson!

Experiment files may be tracked by Git or DVC. For large files, we generally recommend tracking them with DVC, in which case file size shouldn't be an issue.

By default, experiments will track all other files with Git. However, Git will fail with too much data. If there are files you don't want to track at all (such as large temporary/intermediate files), you can add them to your .gitignore file.

Check out this open issue with experiments for more details and to provide feedback.

Is there an easy way to visualize DVC experiment results without using the command line?

Good question @LucZ[Mad]!

If you bring those experiments into your regular Git workflow, e.g. using dvc exp branch to create a branch for any experiment you want to share, you could use DVC Studio to visualize them.

We're working on support for viewing any pushed experiments in Studio right now so if there's anything you want to see, make sure to comment on and follow this issue.

Can CML self-hosted runners stop the instance after the idle timeout instead of terminating?

This is another fantastic question from @jotsif!

No, we deliberately terminate the instance to avoid unexpected costs. Stopped but unterminated instances can still cost the same as running ones. It's best to let the CML runner terminate and create new instances, running dvc pull to restore your data each time.

However, if you're trying to preserve data (e.g. cache dependencies to speed up experimentation time) on an AWS EC2 instance, you could connect persistent AWS S3 remote storage.

What's the difference between DVC Studio free and enterprise versions?

Thanks for asking @Abdi!

You can find more info about the different DVC Studio tiers here.

The Free tier has all the features most individual users need, like connecting to ML repositories, creating views, submitting experiments, and generating plots. The Teams tier allows you to create large teams for better collaboration and sharing of views and settings with everyone. The Enterprise tier is more for needs around compliance, dedicated support, and on-premise installation.

If you are trying to decide which plan to select, please email us at support@iterative.ai and we'll help you figure it out based on your needs.

How can I use one `dvc.yaml` file with multiple pipeline folders with different `params.yaml` files?

@louisv, thanks for this question!

It seems like you're looking for the parametrization functionality. You can learn more about how it works in this doc, but here's a an example of what that might look like in the dvc.yaml.

stages:
  cleanups:
    foreach: # List of simple values
      - raw1
      - labels1
      - raw2
    do:
      cmd: clean.py "${item}"
      outs:
        - ${item}.cln

Is it possible to change the x-label in DVC Studio?

A great question about Studio from @PythonF!

You can set custom properties for your plot in your dvc.yaml like this:

plots:
  - plots_no_cache.csv:
      cache: false
      x: r

You can also use dvc plots modify to change the x-label or y-label for your plots using commands similar to the following.

$ dvc plots modify plots_no_cache.csv -x r -y q

Done Tyler The Creator GIF

At our March Office Hours Meetup we will be about how you can create, run, and benchmark DVC pipelines with ZnTrack! RSVP for the Meetup here to stay up to date with specifics as we get closer to the event!

Join us in Discord to get all your DVC and CML questions answered!