April '20 Community Gems

A roundup of technical Q&A's from the DVC community. This month, we discuss the DVC cache, cloud storage options and concurrency.

  • Elle O'Brien
  • April 16, 20204 min read
Hero Picture

Discord gems

Here are some Q&A's from our Discord channel that we think are worth sharing.

Q: How can I view and download files that are being tracked by DVC in a repository?

To list the files that are currently being tracked in a project repository by DVC and Git, you can use dvc list. This will display the contents of that repository, including .dvc files. To download the contents corresponding to a particular .dvc file, use dvc get:

Let's consider an example using both functions. Assume we're working with DVC's data registry example repository. To list the files present, run:

$ dvc list -R https://github.com/iterative/dataset-registry
.gitignore
README.md
get-started/.gitignore
get-started/data.xml
get-started/data.xml.dvc
...

Note that the -R flag, which enables dvc list to display the contents of directories inside the repository. Now assume you want to download data.xml, which we can see is being tracked by DVC. To download the dataset to your local workspace, you would then run

$ dvc get https://github.com/iterative/dataset-registry get-started/data.xml

For more examples and information, see the documents for dvc list and for dvc get.

Q: I'm setting up cloud remote storage for DVC and I'd like to forbid dvc gc --cloud so users can't accidently delete files in the remote. Will it be sufficient to restrict deletion in the remote's settings?

You're right to be careful, because dvc gc --cloud can be dangerous in the wrong hands- it'll remove any unused files in your remote (for more info, see our docs). To prevent users from having this power, setting your bucket policy to block object deletions should do the trick. How to do this will depend on your cloud storage provider- we found some relevant docs for GCP, S3, and Azure. For the full list of supported remote storage types, see here.

Q: My team is interested in DVC, and we have all of our data in remote storage. Do we need to install a centralised enterprise version of DVC on a dedicated server? And do we have to also have a GitHub repository?

There's no need for a DVC server. Our remote storage works on top of most kinds of cloud storage by default, including S3, GCP, Azure, Google Drive, and Aliyun, with no additional infrastructure required. As for GitHub (or BitBucket, or GitLab, etc.), this is only needed if you're interested in sharing your project with others over that channel. We like sharing projects on GitHub, but you don't have to. Any Git repository, even a local one, will do.

So a "minimal" DVC project for you might consist of a local workspace with Git enabled (which you do need), a local Git repository, and your S3 remote storage. Check out our use cases to see some examples of infrastructure and workflow for teams.

Q: Could there be any issues with concurrent dvc push-es to the same remote?

There are a few ways for concurrency to occur: multiple jobs running in parallel on the same machine, or different users on different machines. But in any case, the answer is the same: there's nothing to worry about! When pushing a file to a DVC remote, all operations are non-destructive and atomic.

Q: How do I only download part of my remote repository? For example, I only need the final output of my pipeline, not the raw data or intermediate steps.

We support granular operations on DVC project repositories! Say your project's DVC remote contains several .dvc files corresponding to different stages of your pipeline: 0_process_data.dvc, 1_split_test_train.dvc, and 2_train_model.dvc. If you're only interested in the files output by the final stage of the pipeline (2_train_model.dvc), you can run:

$ dvc pull process_data_stage.dvc

You can also use dvc pull at the level of individual files. This might be needed if your DVC pipeline file creates 10 outputs, for example, and you only want to pull one (say, model.pkl, your trained model) from remote DVC storage. You'd simply run

$ dvc pull model.pkl

Q: How can I remove a .dvc file, but keep the associated files in my workspace?

Sometimes, you realize you don't want to put a file under DVC tracking after all. That's okay, easy to fix. Simply remove the .dvc file like any other- rm <file>.dvc. DVC will then stop tracking the file, and the associated target file will still be in your local workspace. Note that the file will still be in your DVC cache unless you clear it with dvc gc.

Q: I'm trying to move a stage file with dvc move, but I'm getting an error. What's going on?

The dvc move command is used to rename a file or directory and simultaneously modify its corresponding DVC file. It's handy so you don't rename a file in your local workspace that's under DVC tracking without updating DVC to the change (see an example here). The function doesn't work on "stage files" from DVC pipelines. There's not currently an easy way to safely move dvc.yaml files, and it's an open issue we're working on. Until then, you can manually update dvc.yaml, or make a new one in the desired location.

Q: I just starting using DVC and noticed that when I dvc push files to remote cloud storage, the directory in my remote looks like my DVC cache, not my local workspace directory. Is this right?

Yep, that's exactly how it should be! In order to provide deduplication and some other optimizations, your DVC remote's directory structure will mirror the DVC cache (which is by default in your local workspace under .dvc/cache). Effectively, DVC uses your Git repository to store DVC files, which are keys for cache files on your remote. So looking inside your remote won't be particularly enlightening if you're looking for human-readable filenames- the file names will look like hashes (because, well, they are). Luckily, DVC handles all the conversions between the filenames in your local workspace and these hashes.

To get some more intuition about this, check out some of our docs about how DVC organizes files.

Back to blog