A roundup of technical Q&A's from the DVC community. This month, we discuss using CI/CD to validate models, advanced DVC pipeline scenarios, and how CML adds pictures to your GitHub and GitLab comments.
Here are some of our top Q&A's from around the community. With the launch of CML earlier in the month, we've got some new ground to cover!
You can think of your DVC remote similar to your Git remote, but for data and model artifacts- it's a place to backup and share artifacts. It also gives you methods to push and pull those artifacts to and from your team.
Your DVC cache (by default, it's located in
.dvc/cache) serves a similar
purpose to your Git objects database (which is by default located in
.git/objects). They're both local caches that store files (including various
versions of them) in a content-addressable format, which helps you quickly
checkout different versions to your local workspace. The difference is that
.dvc/cache is for data/model artifacts, and
.git/objects is for code.
Usually, your DVC remote is a superset of
.dvc/cache- everything in your cache
is a copy of something in your remote (though there may be files in your DVC
remote that are not in your cache (and vice versa) if you have never attempted
pull them locally).
In theory, if you are using an external cache- meaning a DVC cache configured on a separate volume (like NAS, large HDD, etc.) outside your project path- and all your projects and all your teammates use that external cache, and you know that the storage is highly reliable, you don't need to also have a DVC remote. If you have any doubts about access to your external cache or its reliability, we'd recommend also keeping a remote.
Yes! There are two approaches. We'll be assuming you have a pipeline stage that
outputs a file,
dvc runyet, then you'll do it like this:
$ dvc run -n <stage name> -d <dependency> -O myfile
Note that instead of using the flag
-o for specifying the output
-O- it's shorthand for
--outs-no-cache. You can
read about this flag in our docs.
dvc.yamland manually add the field
cache: falseto the stage as follows:
outs: - myfile: cache: false
Please note one special case: if you previously enabled hardlinks or symlinks in
dvc config cache, you may need to run
dvc unprotect myfile to fully
myfile from your DVC cache. If you haven't enabled these types of file
links (and if you're not sure, you probably didn't!), this step is unncessary.
See our docs for more.
Yes, this is straightforward- you change your
your workspace, and then use it in
$ dvc run -p params.json:myparam ...
Alternately, if your pipeline stage has already been created, you can manually
dvc.yaml file to replace
For more about the
see our docs.
We don't know of any published guide. One of our users shared their procedure for disabling LFS:
$ git lfs uninstall $ git rm .gitattributes $ git rm .lfsconfig
Note that, if you're going to delete any LFS files, make sure you're certain the corresponding data has been transferred to DVC.
We don't have special support for this use case, and there may be some security downsides to using a confidential validation dataset with someone else's code (be sure nothing in their code could expose your data!). But, there are ways to implement this if you're sure about it.
One possible approach is to create a separate "data registry" repository using a
private cloud bucket to store your validation dataset
(see our docs about the why and how of data registries).
Your CI system can be setup to have access to the data registry via secrets
(called "variables" in GitLab). Then when you run validation via
dvc repro validate, you could use
dvc get to pull the private data from the
The data is never exposed to the user in an interactive setting, only on the runner- and there it's ephemeral, meaning it does not exist once the runner shuts down.
If your workflow is set to trigger on a push (as in the CML use cases), it isn't
git commit locally- you need to push to your GitHub or GitLab
repository. If you want every commit to trigger your workflow, you'll need to
push each one!
What about if you don't want a push to trigger your worfklow? In GitLab, you
can use the
[ci skip] flag- make sure
your commit message contains
[ci skip] or
[skip ci], and GitLab CI won't run
the pipeline in your
In GitHub Actions, this flag isn't supported, so you can manually kill any workflows in the Actions dashboard. For a programmatic fix, check out this workaround by Tim Heuer.
Definitely! This is a desirable workflow in several cases:
CML is very flexible, and one strong use case is for sanity checking and evaluating a model in a CI system post-training. When you have a model that you're satisifed with, you can check it into your CI system and use CML to evaluate the model in a production-like environment (such as a custom Docker container), report its behavior and informative metrics. Then you can decide if it's ready to be merged into your main branch.
Definitely. This is what
dvc metrics diff is for- like a
git diff, but for
model metrics instead of code. We made a video about how to do this in CML!
cml-publish, it looks like you're uploading published files to
https://asset.cml.dev. Why don't you just save images in the Git repository?
If an image file is created as part of your workflow, it's ephemeral- it doesn't exist outside of your CI runner, and will disappear when your runner is shut down. To include an image in a GitHub or GitLab comment, a link to the image needs to persist. You could commit the image to your repository, but typically, it's undesireable to automatically commit results of a CI workflow.
We created a publishing service to help you host files for CML reports. Under the hood, our service uploads your file to an S3 bucket and uses a key-value store to share the file with you.
This covers a lot of cases, but if the files you wish to publish can't be shared
with our service for security or privacy reasons, you can emulate the
cml-publish function with your own storage. You would push your file to
storage and include a link to its address in your markdown report.