A roundup of technical Q&A's from the DVC community. This month: DVC pipeline configs, working with remotes, file handling and more.
Thanks to @PythonF from Discord for asking this question that led to this Gem! 💎
It uses custom Git refs internally, similar to the way GitHub handles PRs. It’s a custom DVC Git ref pointing to a Git commit. Here's an example.
$ git show-ref exp-26220 c42f48168830148b946f6a75d1bdbb25cda46f35 refs/exps/f1/37703af59ba1b80e77505a762335805d05d212/exp-26220
If you want to see your local experiments (that have not been pushed), you can
dvc exp list --all.
You can read more about how we handle our custom Git refs in this blog post.
Thanks to @Chandana for asking this question about experiments!
Yes! You can quickly look at all of the experiments in any repo with:
$ dvc exp list --all <git repo URL>
$ dvc exp list --all <git remote>
Thanks again @Chandana for this gem!
Another great question from @Chandana!
Right now, we support GitHub and GitLab.
Azure DevOps and GCP (Google Cloud Platform) support are on the roadmap. Stay tuned for more updates!
dvc pushto my DVC remote, but there are a few that couldn't be pushed at the time. If I run
dvc pushagain, will it just upload the missing files?
Thanks for the question @petek!
Yes! You can just re-run
dvc push and it will only upload the missing files.
It might be a little slower than you would expect because DVC has to do some checks to make sure that the other files were uploaded successfully before, but as far as the actual data transfer goes, only the missing files will be uploaded.
Thanks for such a great question @LucZ!
These are good questions for common problems in MLOps from @Phoenix!
To answer the first part, say you are getting new data every week. When you use DVC, you don't have to worry about getting duplicate data.
DVC supports file-level deduplication right now, so if your data is in a shape of directory with files, then all unique files will only be stored once. Chunk-level deduplication is on our todo list. You can see how it's going in this issue we have on GitHub.
For the second part of the question, you can use data management with DVC and
have your own pipelines. Just treat it as Git for data then be sure to
dvc pull and you should be set. Hooks, like
post-pipeline-run, are a good way to go about it.
When you have a remote that is not on your default AWS profile and when you
access it via the
awscli using something like
aws s3 --profile=second_profile ls, you'll need to update your remote config
You can run a command like:
$ dvc remote modify myremote profile myprofile
Check out the docs on
dvc remote modify for all the remote config options.
Great question @Avi!
At our July Office Hours Meetup we will be demo-ing pipelines as well as CML. RSVP for the Meetup here to stay up to date with specifics as we get closer to the event!
Join us in Discord to get all your DVC and CML questions answered!