Edit on GitHub

DVCFileSystem

New in DVC 2.27.0 (see dvc version)

DVCFileSystem provides a pythonic file interface ( fsspec-compatible) for a DVC repo. It is a read-only filesystem, hence it does not support any write operations, like put_file, cp, rm, mv, mkdir etc.

DVCFileSystem provides a unified view of all the files/directories in your repository, be it Git-tracked or DVC-tracked, or untracked (in case of a local repository). It can reuse the files in DVC cache and can otherwise stream from supported remote storage.

>>> from dvc.api import DVCFileSystem
# opening a local repository
>>> fs = DVCFileSystem("/path/to/local/repository")
# opening a remote repository
>>> url = "https://github.com/iterative/example-get-started.git"
>>> fs = DVCFileSystem(url, rev="main")

The optional positional argument can be a URL or a local path to the DVC project. If unspecified, the DVC project in current working directory is used.

The optional rev argument can be passed to open a filesystem from a certain Git commit (any revision such as a branch or a tag name, a commit hash, or an experiment name).

Opening a file

>>> with fs.open("model.pkl") as f:
        model = pickle.load(f)

This is similar to dvc.api.open() which returns a file-like object. Note that, unlike dvc.api.open(), the mode defaults to binary mode, i.e. "rb". You can also specify encoding argument in case of text mode ("r").

Reading a file

>>> text = fs.read_text("get-started/data.xml", encoding="utf-8")

This is similar to dvc.api.read(), which returns the contents of the file as a string.

To get the binary contents of the file, you can use read_bytes() or cat_file().

>>> contents = fs.read_bytes("get-started/data.xml")

Listing all DVC-tracked files recursively

>>> fs.find("/", detail=False, dvc_only=True)
[
    '/data/data.xml',
    '/data/features/test.pkl',
    '/data/features/train.pkl',
    '/data/prepared/test.tsv',
    '/data/prepared/train.tsv',
    '/evaluation/importance.png',
    '/model.pkl'
]

This is similar to dvc ls --recursive --dvc-only CLI command. Note that the "/" is considered as the root of the Git repo. You can specify sub-paths to only return entries in that directory. Similarly, there is fs.ls() that is non-recursive.

Listing all files (including Git-tracked)

>>> fs.find("/", detail=False)
[
    ...
    '/.gitignore',
    '/README.md',
    '/data/.gitignore',
    '/data/data.xml',
    '/data/features/test.pkl',
    '/data/features/train.pkl',
    '/data/prepared/test.tsv',
    '/data/prepared/train.tsv',
    ...
    '/evaluation/.gitignore',
    '/evaluation/importance.png',
    '/evaluation/plots/confusion_matrix.json',
    '/evaluation/plots/precision_recall.json',
    '/evaluation/plots/roc.json',
    '/model.pkl',
    ...
]

This is similar to dvc ls --recursive CLI command. It returns all of the files tracked by DVC and Git and if filesystem is opened locally, it also includes the local untracked files.

Downloading a file or a directory

>>> fs.get_file("data/data.xml", "data.xml")

This downloads "data/data.xml" file to the current working directory as "data.xml" file. The DVC-tracked files may be downloaded from the cache if it exists or may get streamed from the remote.

>>> fs.get("data", "data", recursive=True)

This downloads all the files in "data" directory - be it Git-tracked or DVC-tracked into a local directory "data". Similarly, DVC might fetch files from remote if they don't exist in the cache.

API Reference

As DVCFileSystem is based on fsspec, it is compatible with most of the APIs that it offers. For more details check out the fsspec's API Reference.