Edit on GitHub

Databricks

Databricks Git folders don't expose the underlying Git repo, so Git-related DVC functionality within Databricks Repos is not supported (e.g. experiments, --rev/--all-commits/--all-tags/etc). Everything will operate as normal if you git clone a project yourself or use remote projects with DVC directly.

Setup

%pip install dvc

In order to be able to work in [Databricks Repos], you'll need to use this workaround:

!dvc config core.no_scm true --local

DVC API

You can use your existing DVC projects through the Python API as normal, for example:

import dvc.api

with dvc.api.open(
    'get-started/data.xml',
    repo='https://github.com/iterative/dataset-registry',
) as fobj:
    ...

Secrets

If you need to use secrets to access your data, first add them to Databricks secrets and then use them with DVC, for example:

import dvc.api

remote_config = {
    'access_key_id': dbutils.secrets.get(scope='test_scope', key='aws_access_key_id'),
    'secret_access_key': dbutils.secrets.get(scope='test_scope', key='aws_secret_access_key'),
}

with dvc.api.open(
    'recent-grads.csv',
    repo='https://github.com/efiop/mydataregistry',
    remote_config=remote_config
) as fobj:
    ...

Running DVC commands

Databricks doesn't provide a classic terminal by default, so you'll need to use magic commands to run DVC commands in your notebook. If your workspace does have web terminal enabled, you can also run DVC commands in the terminal as normal.

Example: set up shared DVC cache on dbfs

!dvc config cache.dir /dbfs/dvc/cache

Example: add data

!dvc add data

If working with [Databricks Repos], due to the limitations described in the beginning and noscm workaround, DVC won't be able to automatically add new entries to corresponding .gitignores, so you'll need to do that manually.

Example: import data

!dvc import-url https://archive.ics.uci.edu/static/public/186/wine+quality.zip

Live experiment updates

If working with [Databricks Repos], you will need to set both the DVC_STUDIO_TOKEN and DVC_EXP_GIT_REMOTE to see live experiment updates in DVC Studio.

import getpass
import os

os.environ["DVC_STUDIO_TOKEN"] = getpass.getpass()
os.environ["DVC_EXP_GIT_REMOTE"] = "https://github.com/<org>/<repo>"
Content

๐Ÿ› Found an issue? Let us know! Or fix it:

Edit on GitHub

โ“ Have a question? Join our chat, we will help you:

Discord Chat