
How can I track a new file added to my data folder if the data folder is already tracked by DVC, yet ignored by Git?
Great question on how DVC handles data tracking from @NgHoangDat!
Since you already track the data folder, when you add a new file into it, all
you need to do is update your DVC history. You can use either dvc add data or
dvc commit to start tracking the new file.
DVC will also only recalculate the changed files. If you add or modify a small number of files in that folder, the update will not take very long.
What would be the best method to get the remote URL of a given dataset inside a Python environment?
Wonderful question from @come_arvis!
You can use the get_url method of the
DVC Python API to do this. Here's an
example of a script you might run to get the remote URL.
import dvc.api
resource_url = dvc.api.get_url(
'get-started/data.xml',
repo='https://github.com/iterative/dataset-registry'
)
print(resource_url)
# https://remote.dvc.org/dataset-registry/a3/04afb96060aad90176268345e10355This URL is built with the remote URL from the project configuration file,
.dvc/config, and the md5 file hashes stored in the .dvc file corresponding
to the data file or directory you want the storage location of.
I'm excited about MLEM helping expose API endpoints to our model, but heard it was experimental. Where can I learn more about how to deploy models with this tool?
Great question from @raveman^2!
There are a few ways you can use expose API endpoints to your model:
- Run
mlem serveto generate a FastAPI endpoint with your model. - Export the model as a Python package for your own custom-built API.
- The experimental deploy to Heroku.
You can find more details here in the MLEM docs: https://mlem.ai/doc/get-started
You can also see an example of deploying a model with MLEM in this blog post tutorial.
How do I revert a dvc add command to stop tracking data?
This is a good question from @Nwoke!
If you have accidentally added the wrong directory or files for DVC to track,
you can easily remove them with the dvc remove command. This is used to remove
the .dvc file and ensure that the original data file is no longer being
tracked. Here's an example of this command being used:
$ dvc remove data.csv.dvcSometimes when you stop tracking data, you also want to remove it from your
cache. You can do this with the dvc gc command, which will remove all data,
not just the target of dvc remove. If you want to remove all of the data and
its previous versions from the cache, you can do that with the following
command:
$ dvc gc -wThe -w option only keeps the files and directories referenced in the
workspace, so once you have removed the data you don't want to track, this is
how DVC knows what to keep and what to discard.
You can learn more about removing tracked data in the docs here.
Is it normal for the outs of a stage to be removed when dvc repro is run?
Fantastic question from @Nish!
This is the expected behavior of DVC. It removes the outs of a stage unless
the persist:true value is set for that output. You can learn more about how
this works in
our docs here.
Here's an example of a stage with the persist value set.
stages:
train:
cmd: date > data/external/date
outs:
- data/external:
persist: trueEven if you don't persist your outs, you can still check out an older version
of the pipeline to get older outs with dvc checkout. This is based on what's
in the dvc.lock and .dvc files and it will update your workspace to match
the experiment you check out. This is usually run after checking out a different
Git branch. So the flow might look like:
$ git checkout experiment-branch
$ dvc checkoutThese commands allow you to get the dvc.lock and .dvc files for the
experiment you want to go back to from your Git history. Then it uses DVC to get
your data to the version you want and reproduce your entire experiment. You can
learn more about these details in
the dvc checkout docs here.
Is there a way to have a plot with multiple y-axes?
Wonderful question from @shortcipher3!
If you update DVC to version 2.12.1 and higher, you should be able to define
multiple y-axes in your DVC pipeline. Here's an example of how this may look in
a dvc.yaml:
# dvc.yaml
stages: ...
plots:
some_file.csv:
x: x_column_name
y: [col1, col2, col3]
# alternative 1:
multiple_rocs:
x: x_column_name
y:
some_file.csv: [col1, col2, col3]
# in case of multiple files:
multiple_rocs_from_multiple_files:
x: x_column_name
y:
file1.csv: [col1, col2]
file2.csv: [col3]A quick note, make sure that plots is on the same level as stages in your
dvc.yaml file.
How do you structure the dvc.yaml file to run in stages in a specific order?
Awesome question from @srb302!
You would need to set up outputs and dependencies for each stage. So a stage that is run first would generate an output and the stage that is suppose to run second would use the first stage's output as a dependency.
Otherwise, DVC does not guarantee any particular execution order for stages which are independent of each other. DVC determines the structure of your DAG based on file outputs and dependencies and there isn't another way to enforce order of stage execution in DVC.
How do I know when I should track a file with Git or DVC?
This is a really good question from @vadim.sukhov!
Let's take a look at an example dvc.yaml.
stages:
evaluate:
...
plots:
- prc.json:
cache: false
x: recall
y: precision
- roc.json:
cache: false
x: fpr
y: tprIn this scenario, the prc.json and roc.json files are not being tracked
by DVC because of the cache: false value. Since these files aren't tracked by
DVC, they aren't saved to a remote storage location outside of Git, like data
files are. So if you have cache: false on a file that you want to keep track
of, you'll need to Git commit them to your project.

Check out our docs to get all your DVC, CML, and MLEM questions answered!
Join us in Discord to chat with the community!
📰 Join our Newsletter to stay up to date with news and contributions from the Community!
