Opens a tracked file.
def open(path: str, repo: str = None, rev: str = None, remote: str = None, mode: str = "r", encoding: str = None)
import dvc.api with dvc.api.open( 'get-started/data.xml', repo='https://github.com/iterative/dataset-registry' ) as fd: # ... fd is a file descriptor that can be processed normally.
Open a data or model file tracked in a DVC project and generate a corresponding file object. The file can be tracked by DVC or by Git.
The exact type of file object depends on the
modeused. For more details, please refer to Python's
open()built-in, which is used under the hood.
dvc.api.open() may only be used as a
with keyword, as shown in the examples).
This function makes a direct connection to the remote storage (except for Google Drive), so the file contents can be streamed. Your code can process the data buffer as it's streamed, which optimizes memory usage.
dvc.api.read()to load the complete file contents in a single function call – no context manager involved. Neither function utilizes disc space.
path- location and file name of the file in
repo, relative to the project's root.
repo- specifies the location of the DVC project. It can be a URL or a file system path. Both HTTP and SSH protocols are supported for online Git repos (e.g.
[user@]server:project.git). Default: The current project is used (the current working directory tree is walked up to find it).
rev- Git commit (any revision such as a branch or tag name, or a commit hash). If
repois not a Git repo, this option is ignored. Default:
remote- name of the DVC remote to look for the target data. Default: The default remote of
repois used if a
remoteargument is not given. For local projects, the cache is tied before the default remote.
mode- specifies the mode in which the file is opened. Defaults to
"r"(read). Mirrors the namesake parameter in builtin
encoding- codec used to decode the file contents to a string. This should only be used in text mode. Defaults to
"utf-8". Mirrors the namesake parameter in builtin
dvc.exceptions.FileMissingError- file in
pathis missing from
pathcannot be found in
repois not a DVC project.
Any data artifact hosted online can be processed directly in your Python code with this API. For example, an XML file tracked in a public DVC repo on Github can be processed like this:
from xml.sax import parse import dvc.api from mymodule import mySAXHandler with dvc.api.open( 'get-started/data.xml', repo='https://github.com/iterative/dataset-registry' ) as fd: parse(fd, mySAXHandler)
Notice that we use a SAX XML parser here because
dvc.api.open() is able to stream the data from
mySAXHandler object should handle the event-driven parsing of the
document in this case.) This increases the performance of the code (minimizing
memory usage), and is typically faster than loading the whole data into memory.
If you just needed to load the complete file contents into memory, you can use
from xml.dom.minidom import parse import dvc.api xmldata = dvc.api.read('get-started/data.xml', repo='https://github.com/iterative/dataset-registry') xmldom = parse(xmldata)
This is just a matter of using the right
repo argument, for example an SSH URL
(requires that the
credentials are configured
import dvc.api with dvc.api.open( 'features.dat', firstname.lastname@example.org:path/to/repo.git' ) as fd: # ... Process 'features'
rev argument lets you specify any Git commit to look for an artifact. This
way any previous version, or alternative experiment can be accessed
programmatically. For example, let's say your DVC repo has tagged releases of a
import csv import dvc.api with dvc.api.open( 'clean.csv', rev='v1.1.0' ) as fd: reader = csv.reader(fd) # ... Process 'clean' data from version 1.1.0
Also, notice that we didn't supply a
repo argument in this example. DVC will
attempt to find a DVC project to use in the current working
directory tree, and look for the file contents of
clean.csv in its local
cache; no download will happen if found. See the
Parameters section for more info.
Sometimes we may want to choose the remote data
source, for example if the
repo has no default remote set. This can be done by
import dvc.api with open( 'activity.log', repo='location/of/dvc/project', remote='my-s3-bucket' ) as fd: for line in fd: match = re.search(r'user=(\w+)', line) # ... Process users activity log
To chose which codec to open a text file with, send an
import dvc.api with dvc.api.open( 'data/nlp/words_ru.txt', encoding='koi8_r') as fd: # ... Process Russian words