Edit on GitHub

dvc.api.get_url()

Returns the URL to the storage location of a data file or directory tracked in a DVC project.

def get_url(path: str,
            repo: str = None,
            rev: str = None,
            remote: str = None) -> str

Usage:

import dvc.api

resource_url = dvc.api.get_url(
    'get-started/data.xml',
    repo='https://github.com/iterative/dataset-registry')

# resource_url is now "https://remote.dvc.org/dataset-registry/a3/04afb96060aad90176268345e10355"

Description

Returns the URL string of the storage location (in a DVC remote) where a target file or directory, specified by its path in a repo (DVC project), is stored.

The URL is formed by reading the project's remote configuration and the dvc.yaml or .dvc file where the given path is found (outs field). The schema of the URL returned depends on the type of the remote used (see the Parameters section).

If the target is a directory, the returned URL will end in .dir. Refer to Structure of cache directory and dvc add to learn more about how DVC handles data directories.

โš ๏ธ This function does not check for the actual existence of the file or directory in the remote storage.

๐Ÿ’ก Having the resource's URL, it should be possible to download it directly with an appropriate library, such as boto3 or paramiko.

Parameters

  • path - location and file name of the file or directory in repo, relative to the project's root.
  • repo - specifies the location of the DVC project. It can be a URL or a file system path. Both HTTP and SSH protocols are supported for online Git repos (e.g. [user@]server:project.git). Default: The current project is used (the current working directory tree is walked up to find it).
  • rev - Git commit (any revision such as a branch or tag name, or a commit hash). If repo is not a Git repo, this option is ignored. Default: HEAD.
  • remote - name of the DVC remote to use to form the returned URL string. Default: The default remote of repo is used.

Exceptions

  • dvc.exceptions.NoRemoteError - no remote is found.

Example: Getting the URL to a DVC-tracked file

import dvc.api

resource_url = dvc.api.get_url(
    'get-started/data.xml',
    repo='https://github.com/iterative/dataset-registry'
    )

print(resource_url)

The script above prints

https://remote.dvc.org/dataset-registry/a3/04afb96060aad90176268345e10355

This URL represents the location where the data is stored, and is built by reading the corresponding .dvc file (get-started/data.xml.dvc) where the md5 file hash is stored,

outs:
  - md5: a304afb96060aad90176268345e10355
    path: get-started/data.xml

and the project configuration (.dvc/config) where the remote URL is saved:

['remote "storage"']
url = https://remote.dvc.org/dataset-registry
Content

๐Ÿ› Found an issue? Let us know! Or fix it:

Edit on GitHub

โ“ Have a question? Join our chat, we will help you:

Discord Chat