May '21 Community Gems

A roundup of technical Q&A's from the DVC and CML community. This month: remote storage integration, removing old experiments, ideas for running CML pipeline reports and more.

  • Milecia McGregor
  • May 28, 2021 β€’ 9 min read

Each month we go through our Discord messages to pull out some of the best questions from our community. AKA: Community Gems. πŸ’Ž This month we'd like to thank @asraniel, @PythonF, @mattlbeck, @Ahti, @yikeqicn, @lexzen, @EdAb, @FreshLettuce for inspiring this month's gems!

As always, join us in Discord to get all your DVC and CML questions answered!

DVC

What is the best way to commit 2 experiment runs?

You want to use dvc exp branch if you want to keep multiple experiments. That way, each one is in a separate branch rather than trying to apply one experiment on top of another.

How can I clean up the remote caches after a lot of experiments and branches have been pushed?

dvc exp gc requires some kind of flags to operate. At the very least, --workspace. So, with --workspace, dvc will try to read all of the pointer files: .dvc files and dvc.yaml files in the workspace. It will read all of them and will determine all the cache objects/files that need to be preserved (since they are being used in the current workspace). The rest of the files in the .dvc/cache are removed.

This does not require any Git operations!

You can also use the --all-branches flag. It will read all of the files present in the current workspace and from the commits in the branches you have locally. Then it will use that list to determine what to keep and what to remove.

If you need to read pointer files from given tags you have locally, the --all-tags flag is the best option.

The --all-commits flag reads pointer files from every commit and it will make a list of all the files that are in the cache/remote and if the .dvc file isn't found in any commits of the Git repo, it will delete those files.

You're looking for the -r / --remote option for dvc push. The command looks like this:

$ dvc push --remote <name_of_remote_storage>

It will push directly to the remote storage you defined in the command above.

Take a look at the new experiments feature! It enables you to easily experiment with different parameter values.

You could script a grid search pretty easily by queueing an experiment for each set of parameter values you want to try. For example:

$ dvc exp run --queue -S alpha={alpha},beta={beta}
$ dvc exp run --run-all --jobs 2

The --jobs 2 flag means you're running 2 queued experiments in parallel. By default, the --run-all flag runs 1 queued experiment at a time.

Then you can compare the results with dvc exp show.

┏━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┓
┃ Experiment    ┃ avg_prec ┃ roc_auc ┃ train.n_est┃ train.min_split ┃
┑━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━┩
β”‚ workspace     β”‚  0.56191 β”‚ 0.93345 β”‚ 50         β”‚ 2               β”‚
β”‚ master        β”‚  0.55259 β”‚ 0.91536 β”‚ 50         β”‚ 2               β”‚
β”‚ β”œβ”€β”€ exp-bfe64 β”‚  0.57833 β”‚ 0.95555 β”‚ 50         β”‚ 8               β”‚
β”‚ β”œβ”€β”€ exp-b8082 β”‚  0.59806 β”‚ 0.95287 β”‚ 50         β”‚ 64              β”‚
β”‚ β”œβ”€β”€ exp-c7250 β”‚  0.58876 β”‚ 0.94524 β”‚ 100        β”‚ 2               β”‚
β”‚ β”œβ”€β”€ exp-b9cd4 β”‚  0.57953 β”‚ 0.95732 β”‚ 100        β”‚ 8               β”‚
β”‚ β”œβ”€β”€ exp-98a96 β”‚  0.60405 β”‚  0.9608 β”‚ 100        β”‚ 64              β”‚
β”‚ └── exp-ad5b1 β”‚  0.56191 β”‚ 0.93345 β”‚ 50         β”‚ 2               β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

We are working on developing experiments to have features or documented patterns explicitly for grid search support, so definitely share any feedback to help drive the future direction of that!

When importing/getting data from a repo, how do I provide credentials to the source repo remote storage without saving it into that Git repo?

There's a bit of context behind this question that might give it more meaning. Here's the background information given by @EdAb in Discord:


I set up a private GitHub repo to be a data registry and I have set up a private Azure remote where I have pushed some datasets.

I am now trying to read those datasets from another repository ("my-project-repo"), using dvc get (e.g. dvc get git@github.com:data-registry-repo.git path/data.csv) but I get this error:

ERROR: failed to get 'path/data.csv' from 'git@github.com:data-registry-repo.git' - Authentication to Azure Blob Storage via default credentials (https://azuresdkdocs.blob.core.windows.net/$web/python/azure-identity/1.4.0/azure.identity.html#azure.identity.DefaultAzureCredential) failed.
Learn more about configuration settings at <https://man.dvc.org/remote/modify>: unable to connect to account for Must provide either a connection_string or account_name with credentials!!

Generally, there are two ways solve this issue:

  • ENV vars
  • Setup some options using the --global or --system flags to update the DVC config

If you're going to update the DVC config to include your cloud credentials, use the dvc remote modify command. Here's an example of how you can do that with Azure using the --global flag.

$ dvc remote modify --global myremote connection_string 'mysecret'

You should initialize myremote in the config file with dvc remote add and remove the URL to rely on the one that comes from the repo being imported.

This will modify the global config file, instead of the .dvc/config file. You could also use the --system flag to modify the system file if that's necessary for your project. You can take a look at the specific config file locations here.

Is there any way to ensure that dvc import uses the cache from the config file and how can I keep the cache consistent for multiple team members?

This is another great question where a little context might be useful.


I'm trying to import a dataset project called dvcdata into another DVC project.

The config for dvcdata is:

[core]
    remote = awsremote
[cache]
    type = symlink
    dir = /home/user/dvc_cache
['remote "awsremote"']
    url = s3://...

When I run dvc import git@github.com:user/dvcdata.git my_data, it starts to download it. I have double checked that I have pushed this config file to master and don't understand why it's not pulling the data from my cache instead of downloading the data again.


The repo you are importing into has its own cache directory. If you want to use the same cache directory across both projects, you have to configure cache.dir in both projects. You also have the option to configure the cache.type.

You can set up the cache dir and cache link type in your own global config and then when project 1 imports dvcdata, it will be cached there. Finally when project 2 imports dvcdata, it will just be linked or copied, depending on the config, from the cache without downloading.

We recommend you use the --global or --system flags in the dvc config command for updating the configs globally. An example of this would be:

$ dvc config --global cache.dir path/to/cache/

If you set up a cache that is not shared and located on a separate volume and you have a lot of data - consider also enabling symlinks as described here - Large Data Optimizations

You might also consider using the local URL of the source project to avoid the import downloading from the remote storage. That would look something like this:

$ dvc import /home/user/dvcdata my_data

If your concern is keeping these configs consistent for multiple users on the same machine, check out the doc on shared server development to get more details!

CML

I have an ML model that retrains every 24 hours with updated data, but I do not want to create a merge request every time. I just need a nice way to look at the results. Is there a solution on how to report the results of a pipeline in Gitlab?

Great question! CML doesn't currently have a feature that takes care of this, but here are a couple of solutions (only one is needed):

  1. Keep a separate branch with unrelated history for committing the reports.
  2. Keep a single report file on the repository and update it with each commit.

I've run into an error trying to get CML to orchestrate runs in my AWS account. It doesn't seem to be a permissions issue as the AWSEc2FullAccess policy seems to have worked, but I can't see the security group. What could be going wrong?

Check to make sure you are deploying to the correct region. Use the argument --cloud-region <region> (us-west for example) to mark the region where the instance is deployed.

Head to these docs for more information on the optional arguments that the CML runner accepts.

Until next month…

Join us in Discord to get all your DVC and CML questions answered and contribute to the MLOps community! πŸš€

Subscribe for updates. We won't spam you.