February '21 Community Gems

A roundup of technical Q&A's from the DVC community. This month: best practices for config files, pipeline dependency management,and caching data for CI/CD. Plus a new CML feature to launch cloud compute with Terraform!

  • Elle O'Brien
  • February 26, 20218 min read

DVC Questions

Q: I noticed I have a DVC config file and a config.local file. What's best practice for committing these to my Git repository?

DVC uses the config and config.local files to link your remote data repository to your project. config is intended to be committed to Git, while config.local is not - it's a file that you use to store sensitive information (e.g. your personal credentials - username, password, access keys, etc. for remote storage) or settings that are specific to your local environment.

Usually, you don't have to worry about ensuring your config.local file is being ignored by Git- the only way to create a config.local file is using the --local flag explicitly in functions like dvc remote and dvc config commands, so you'll know you've made one! And your config.local file is .gitignored by default. If you're concerned, take a look and make sure there are no settings in your config.local file that you actually want in your regular config file.

To learn more about config and config.local, read up in our docs.

Q: What's the best way to install the new version of DVC in a Conda environment? I'm concerned about the paramiko dependency.

When you install DVC via conda, it will come with dependencies like paramiko.

The only exception when installing DVC as a Python library is with pip: you might want to specify the kind of remote storage you need to make sure all dependencies are present (like boto for S3). You can run pip install "dvc[<option>]", with supported options like [s3], [azure], [gdrive], [gs], [oss], [ssh]. Or, use [all] to include them all.

For more about installing DVC and its dependencies, check out our docs.

Q: How do I keep track of changes in modules that my DVC pipeline depends on? For example, I have a pipeline stage that runs a script prepare.py, which imports a module module.py. If module.py changes, how will DVC know to rerun the pipeline stage?

If your DVC pipeline only lists prepare.py as a dependency, then changing code in module files won't trigger a re-run of the pipeline. Meaning that if you run dvc repro after updating module.py, DVC will simply return the result of your last pipeline run and a message that nothing has changed.

To explain further why this happens:

DVC is platform agnostic and it doesn't know whether your command's executable is python, some other script interpreter, or a compiled binary for that matter.

E.g. this is a valid stage: dvc run -o hello.txt 'echo "Hello!" > hello.txt' (where the executable is echo).

DVC also doesn't know what's going on inside the command's source code. Therefore, any file that your code requires internally should be explicitly specified as a pipeline stage dependency (in CLI, dvc run -d , or in YAML, deps:) for DVC to track it.

If you're not interested in adding modules as explicit dependencies, there are a few other approaches:

  • Make your requirements.txt file a stage dependency (if the loaded module comes from a package).
  • Manually rebuild the pipeline (with dvc repro --force <stage>.dvc) when you know an unmarked dependency is changed – although this is prone to human error.
  • Have a version/build number comment in the main script that always gets updated when an unmarked dependency changes – this could be automated.

See here for more information on similar use cases.

We also have an ongoing discussion about this issue on our GitHub repository, and we'd love your input. Please participate in this issue if you can here!

Q: My DVC pipeline has a lot of dependencies, and I don't want to manually write them all out in my dvc.yaml file. Are there any ways to use wildcards (like *) or specify directories as dependencies?

Yes, you can set a directory to be a dependency or an output of a DVC pipeline stage. This means you can have tens, hundreds, thousands or millions of dependency files in one directory, and all you have to declare in the pipeline is the address of that directory.

Check out the all the options here.

CML Questions

Q: I heard there's a new CML feature using Terraform to provision runners. When is this coming out?

You're in luck, because we just shared this feature as part of the CML 0.3.0 pre-release! The pre-release introduced a new function, cml-runner, which upgraded our previous method for launching instances in the cloud from a CI workflow using Docker Machine. In the new cml-runner function built on Terraform, you can deploy instances in AWS and Azure with a single command (it used to take about 30 lines of code!). For example, to launch a t2.micro instance on AWS from your GitHub Actions or GitLab CI workflow, you'll run:

cml-runner \
	--cloud aws \
	--cloud-region us-west \
	--cloud-type=t2.micro \

Check out the pre-release notes and our example project repository to get started.

Q: My CI workflow creates a [report.md](http://report.md) document that gets published to my pull request by CML. I want to save the report.md file to my repository, too. Is this possible?

By default, files that are created in a GitHub Actions or GitLab CI workflow only exist on the runner- as soon as the runner turns off, they vanish. Functions like cml-publish and cml-send-comment create persistent links to data visualizations, tables, and other outputs of your workflow so you can view them long after your run ends. However, by design, CML doesn't commit files to your repository (not all users want this!)

What you're likely looking for is an auto-commit, to essentially git add and git commit files generated by the workflow to your repository. You can manually write this code into your workflow file, or you can use a GitHub Action tool like the Auto Commit or Add & Commit Actions.

Q: Do you have any suggested caching strategies with CML and DVC? My DVC pipeline runs in a CI workflow, and it depends on ~15 GB of data. I don't want to download this dataset to my runner every time the workflow runs.

Downloading data to a runner on every CI workflow can be needlessly time consuming, particularly when the data rarely changes.

While we don't have a CML-specific mechanism in the works for this use case, there are two main approaches we see as viable:

  1. Attach an EBS volume to the instance that runs your workflow. If you're using DVC, DVC needs to run in that volume (at the very least, your DVC cache must be there). A user recently let us know that this approach is working well for them and prevents unnecessary re-downloads of their DVC cache. They also recommended this article for setup guidelines.
  2. Use a shared DVC cache. Currently, many DVC users configure their cache in shared NFS. A similar setup that might help here is using a single shared development server- check out our docs for a use case.

As always, if you have any use case questions or need support, join us in Discord! Or head to the DVC Forum to discuss your ideas and best practices.

And, you can follow us on Twitter and LinkedIn!

Subscribe for updates. We won't spam you.